Global search inside a database

In general, searching for a value in SQL consists of running a SELECT query on predefined tables and columns. We know the structure of the database, and we need to retrieve the contents from columns that are specified in the query.

In some situations, it’s the opposite: we know a value, but don’t know in which tables or columns it may be stored, and that’s what we want to find.

This kind of reverse lookup into the database schema can be implemented with procedural code and dynamic SQL.

In this post, I’m describing a short implementation of this search in a plpgsql function, using a few intermediate to advanced features:

the regproc datatype providing the concept of “pointer to function”.
the views in information_schema to find tables and columns that the current user is permitted to read.
ctid columns as pointers to the current physical locations of rows.
the format() function to safely generate SQL queries.
the RAISE statement in plpgsql to return results along the way from long-running functions.

The search options

The options are passed as arguments to the function:

CREATE FUNCTION global_search(
    search_term text,
    comparator regproc default 'pg_catalog.texteq',   -- comparison function
    tables text[] default null,
    schemas text[] default null,
    progress text default null -- 'tables', 'hits', 'all'
    max_width text default -1, --  in chars, -1 for unlimited
)

Remember that they can be passed by name, not just by position, so this kind of invocation is accepted:

select global_search('foo', progress=>'tables');

search_term

The only mandatory argument is the term to search. If everything else is left to default, the search will be for an exact match between the term and the text representation of every column in every readable table in the current search path.

comparator

The next argument is the comparison function to compare the term and the values in all fields. By default, it’s pg_catalog.texteq(text,text), which is the function associated to the = operator applied to text data types. We could give any function that takes two strings and returns true for a match and false for a non-match. The following built-in functions are likely to cover many needs:

Function (text,text)->bool	Operator	Comment
textregexeq	~	regular expression case-sensitive match
texticregexeq	~*	regular expression case-insensitive match
textlike	~~	the same as `arg1 LIKE arg2`
texticlike	~~*	the same as `arg1 ILIKE arg2`

The comparison function can typically be simply passed by name, as in select global_search('^foo', comparator=>'textregexeq') to find any field matching the ^foo regular expression. The comparator argument is of type regproc, which automatically replaces the function name by the unique identifier of the function.

regproc is an OID alias type whose text representation is the name of a function or procedure. Casting a name to regproc searches for a function with that name, taking into account the search path, succeeding and producing its OID if one and only one match is found. Otherwise it raises an error.

For instance,

=> SELECT 'nonexisting'::regproc;
ERROR:  function "nonexisting" does not exist

=> SELECT oid, proname,prolang,proargtypes FROM pg_proc WHERE oid='texteq'::regproc;
 oid | proname | prolang | proargtypes 
-----+---------+---------+-------------
  67 | texteq  |      12 | 25 25

Conversely, the output of a regproc value produces the name associated to the OID contained in that variable. The name will be fully qualified with a schema if necessary.

This works with temporary custom functions, too:

-- Check if two strings are equal after Unicode normalization
-- (requires PostgreSQL 13 or newer)
CREATE FUNCTION pg_temp.normalized_equality(text,text) RETURNS bool
AS 'select normalize($1, NFC) = normalize($2, NFC);'
LANGUAGE SQL STABLE STRICT;

SELECT 'pg_temp.normalized_equality'::regproc,
       'pg_temp.normalized_equality'::regproc::oid;

            regproc            |  oid  
-------------------------------+-------
 pg_temp_5.normalized_equality | 80666

Creating functions in pg_temp makes them stay local to the session and be dropped automatically on disconnection.

tables and schemas

The next optional arguments are arrays of table names and schemas, to restrict the area of the search. pg_catalog can be passed as a schema name to force the search into the system tables. For instance:

SELECT * FROM global_search('point', schemas=>'{pg_catalog}');

 schemaname | tablename | columnname | columnvalue | rowctid 
------------+-----------+------------+-------------+---------
 pg_catalog | pg_type   | typname    | point       | (0,28)
 pg_catalog | pg_proc   | proname    | point       | (21,5)
 pg_catalog | pg_proc   | proname    | point       | (21,25)
 pg_catalog | pg_proc   | proname    | point       | (23,3)
 pg_catalog | pg_proc   | proname    | point       | (23,4)
 pg_catalog | pg_proc   | proname    | point       | (23,5)

We get the schemas to scan by default with the very handy current_schemas() function, which avoids the difficult task of parsing search_path.

The list of all tables in the databases are obtained from the information_schema.tables view. The tables that are not readable for lack of permissions are filtered out by checking information_schema.table_privileges. The schemas that cannot be scanned by the current user are filtered out by joining with information_schema.schemata.

The full query used to get the list of tables to scan is:

    SELECT t.table_schema, t.table_name
      FROM information_schema.tables t
      JOIN information_schema.schemata s ON
	(s.schema_name=t.table_schema)
    WHERE (t.table_name=ANY(tables) OR tables is null)
      AND t.table_schema=ANY(schemas)
      AND t.table_type='BASE TABLE'
      AND EXISTS (SELECT 1 FROM information_schema.table_privileges p
	WHERE p.table_name=t.table_name
	  AND p.table_schema=t.table_schema
	  AND p.privilege_type='SELECT'
      );

Note that in the case of tables that grant access to only certain columns, the SELECT privilege for the table will not figure in information_schema.table_privileges, so such tables are filtered out. We could use information_schema.column_privileges to avoid that, but the problem is our implementation needs the ctid pseudo-column, which is never accessible to a user that has only column-level grants to a base table. Therefore this search function does not search at all in tables that have column-level access privileges.

progress

The rows returned by a Set-Returning-Function are not made available until completion. In the case of long-running functions like this one, it’s nice to see its progress along the way, as if we were running a grep command through a large collection of files. There’s a simple and effective method for that: the RAISE statement, whose message goes immediately to the client. In the case of psql, it displays it right away. Depending on this progress parameter, the caller can get an output on each match, or on each table scanned, or none. When reporting each match, the function could even not return anything and only display the hits along the way. That could be added as an option.

value_width

Except when searching for an exact match, the values found in the columns can be interesting, so they’re returned in the result set to the caller. But some values might be annoyingly large. Setting value_width to a number of characters, including zero, will truncate the returned value to that size to prevent excessively wide results.

Scanning the rows

Now to the core part: we must scan rows and apply the comparator to all values of all columns. Ideally we want to do this in a single pass, testing all columns at once and scanning each row only once. At the column level, the test is comparator(cast(column as text), 'search_term'). We could combine these with an OR disjunction between all columns, but then it wouldn’t tell which column(s) matches, only that one of them does.

This is where arrays come to the rescue. Instead of using OR, we build an array of boolean results of the per-column tests:

array[
  comparator(cast(first_column as text), 'search_term'),
  comparator(cast(second_column as text), 'search_term'),
  comparator(cast(third_column as text), 'search_term'),
  ...
]

So if for instance a row has the second column that matches, this will produce {false,true,false} for that row.

Then applying array_positions(..., true) to that array will produce an array with the position numbers of the matchings columns, in that case {2}.

Now the corresponding column names can be looked up from the columns array that was easily obtained from an information_schema view again:

    SELECT array_agg(column_name ORDER BY ordinal_position)
      FROM information_schema.columns
      WHERE table_name=tablename
        AND table_schema=schemaname
    INTO columns;

In addition to the array of matching column numbers, the row-scanning query returns the matching ctid (non-durable physical location of the row in the table). As the doc says:

ctid: the physical location of the row version within its table. Note that although the ctid can be used to locate the row version very quickly, a row’s ctid will change if it is updated or moved by VACUUM FULL. Therefore ctid is useless as a long-term row identifier.

Finally the code reads the matching values one column at a time (generally there will be only one) from the row located by the ctid, unless the caller passed max_width => 0, and it issues RETURN NEXT to add the OUT variables schemaname, tablename, columnname, columnvalue, rowctid to the result set of the function.

EXECUTE format('SELECT %s FROM %I.%I WHERE ctid=''%s''',
  col_expr, schemaname, tablename, rowctid) INTO columnvalue;
[...]
RETURN NEXT;

Complete source code

Here’s the 85 lines source code. A maintained version is on github.

CREATE or replace FUNCTION global_search(
  search_term text,
  comparator regproc default 'pg_catalog.texteq',   -- comparison function
  tables text[] default null,
  schemas text[] default null,
  progress text default null,   -- 'tables', 'hits', 'all'
  max_width int default -1     -- returned value's max width in chars, or -1 for unlimited
)
RETURNS table(schemaname text, tablename text, columnname text, columnvalue text, rowctid tid)
AS $$
DECLARE
  query text;
  clauses text[];
  columns text[];
  pos int;
  positions int[];
  col_expr text;
BEGIN
  IF schemas IS NULL THEN
    -- by default, exclude pg_catalog and non-readable schemas
    schemas := current_schemas(false);
  END IF;

  FOR schemaname,tablename IN
    -- select tables for which all columns are readable
    SELECT t.table_schema, t.table_name
      FROM information_schema.tables t
      JOIN information_schema.schemata s ON
	(s.schema_name=t.table_schema)
    WHERE (t.table_name=ANY(tables) OR tables is null)
      AND t.table_schema=ANY(schemas)
      AND t.table_type='BASE TABLE'
      AND EXISTS (SELECT 1 FROM information_schema.table_privileges p
	WHERE p.table_name=t.table_name
	  AND p.table_schema=t.table_schema
	  AND p.privilege_type='SELECT'
      )
  LOOP
    IF (progress in ('tables','all')) THEN
      RAISE INFO '%', format('Searching globally in table: %I.%I',
         schemaname, tablename);
    END IF;

    -- Get lists of columns and per-column boolean expressions
    SELECT array_agg(column_name ORDER BY ordinal_position),
           array_agg(format('%s(cast(%I as text), %L)', comparator, column_name, search_term)
	     ORDER BY ordinal_position)
      FROM information_schema.columns
      WHERE table_name=tablename
        AND table_schema=schemaname
    INTO columns, clauses;

    -- Main query to get each matching row and the ordinal positions of matching columns
    query := format('SELECT s.ctid, p from (SELECT ctid,'
		    'array_positions(array[%s],true) AS p FROM ONLY %I.%I) s'
		    ' WHERE cardinality(p)>0',
      array_to_string(clauses, ','), schemaname, tablename );

    FOR rowctid,positions IN EXECUTE query -- for each matching row
    LOOP
      FOREACH pos IN ARRAY positions -- for each matching field
      LOOP
	columnname := columns[pos];
	IF (max_width <> 0) THEN -- fetch value only if needed
	  IF (max_width > 0) THEN
	    -- fetch a truncated value
	    col_expr := format('left(%I,%s)', columnname, max_width);
	  ELSE
	    col_expr := format('%I', columnname);
	  END IF;
	  EXECUTE format('SELECT %s FROM %I.%I WHERE ctid=''%s''',
	    col_expr, schemaname, tablename, rowctid) INTO columnvalue;
	ELSE
	  columnvalue:=null;
	END IF;
	IF (progress in ('hits', 'all')) THEN
	  RAISE INFO '%', format('Found in %I.%I.%I at ctid %s',
	     schemaname, tablename, columnname, rowctid);
	END IF;
	RETURN NEXT;
      END LOOP;
    END LOOP;
  END LOOP; -- for each table
END;
$$ language plpgsql;

Examples

-- Setup
=> CREATE TABLE tst(t text);
=> INSERT INTO tst VALUES('foo'),('bar'),('baz'),('barbaz'),('Foo'),(null);

Simple equality search

=> SELECT * FROM global_search('Foo');

 schemaname | tablename | columnname | columnvalue | rowctid 
------------+-----------+------------+-------------+---------
 public     | tst       | t          | Foo         | (0,5)

Regular expression matching

=> SELECT * FROM global_search('^bar', comparator=>'textregexeq');

 schemaname | tablename | columnname | columnvalue | rowctid 
------------+-----------+------------+-------------+---------
 public     | tst       | t          | bar         | (0,2)
 public     | tst       | t          | barbaz      | (0,4)
(2 rows)

Case insensitive LIKE (equivalent to: column ILIKE search_term)

=> SELECT * FROM global_search('fo%', comparator=>'texticlike');

 schemaname | tablename | columnname | columnvalue | rowctid 
------------+-----------+------------+-------------+---------
 public     | tst       | t          | foo         | (0,1)
 public     | tst       | t          | Foo         | (0,5)
(2 rows)

Sometimes there are several functions with the same name. In the case of texticlike, the citext extension overloads this function with two variants that take a citext parameter instead of text. In that kind of case, you want to use the regprocedure cast with a function name qualified with arguments to disambiguate.

For instance:

SELECT * FROM global_search('fo%', comparator=>'texticlike(text,text)'::regprocedure);

Find all “incorrect” values with a custom function

The following custom function finds values that do not conform to the Unicode NFC normalization Note that even though the second argument is not used in that case, it needs to be declared nonetheless.

=> CREATE FUNCTION pg_temp.check_normal_form(text,text)
 returns boolean as
'select $1 is not NFC normalized' -- requires Postgres 13 or newer
language sql;

=> INSERT INTO tst VALUES (E'El Nin\u0303o');
=> SELECT * FROM global_search(null, comparator=>'pg_temp.check_normal_form');

 schemaname | tablename | columnname | columnvalue | rowctid 
------------+-----------+------------+-------------+---------
 public     | tst       | t          | El Niño     | (0,7)

Find references to the OID of a namespace

That sort of query can be useful when exploring the catalogs. 2200 is the OID of the public namespace.

=> SELECT * FROM global_search(
     (select oid::text from pg_namespace where nspname='public'), 
     schemas=>'{pg_catalog}'
   );

 schemaname |   tablename    |  columnname  | columnvalue | rowctid 
------------+----------------+--------------+-------------+---------
 pg_catalog | pg_proc        | pronamespace | 2200        | (95,5)
 pg_catalog | pg_description | objoid       | 2200        | (28,58)
 pg_catalog | pg_namespace   | oid          | 2200        | (0,8)
 pg_catalog | pg_depend      | refobjid     | 2200        | (13,70)
 pg_catalog | pg_init_privs  | objoid       | 2200        | (2,27)
(5 rows)

Case and accent insensitive global search

Using an ICU collation for advanced non-bitwise equality tests.

-- create a collation that ignores accents and case
=> CREATE COLLATION nd (
  provider = 'icu',
  locale = '@colStrength=primary',
  deterministic = false
);

=> CREATE FUNCTION pg_temp.ci_equal(text,text)
 returns boolean as
'select $1=$2 collate "nd"'
language sql;

=> SELECT * FROM global_search('foo', comparator=>'pg_temp.ci_equal');

 schemaname | tablename | columnname | columnvalue | rowctid 
------------+-----------+------------+-------------+---------
 public     | tst       | t          | foo         | (0,1)
 public     | tst       | t          | Foo         | (0,5)
(2 rows)

^{Image by courtesy of PixaBay}