Global search inside a database
In general, searching for a value in SQL consists of running a SELECT query on predefined tables and columns. We know the structure of the database, and we need to retrieve the contents from columns that are specified in the query.
In some situations, it’s the opposite: we know a value, but don’t know in which tables or columns it may be stored, and that’s what we want to find.
This kind of reverse lookup into the database schema can be implemented with procedural code and dynamic SQL.
In this post, I’m describing a short implementation of this search in a plpgsql function, using a few intermediate to advanced features:
- the
regproc
datatype providing the concept of “pointer to function”. - the views in
information_schema
to find tables and columns that the current user is permitted to read. ctid
columns as pointers to the current physical locations of rows.- the
format()
function to safely generate SQL queries. - the
RAISE
statement in plpgsql to return results along the way from long-running functions.
The search options
The options are passed as arguments to the function:
CREATE FUNCTION global_search(
search_term text,
comparator regproc default 'pg_catalog.texteq', -- comparison function
tables text[] default null,
schemas text[] default null,
progress text default null -- 'tables', 'hits', 'all'
max_width text default -1, -- in chars, -1 for unlimited
)
Remember that they can be passed by name, not just by position, so this kind of invocation is accepted:
select global_search('foo', progress=>'tables');
search_term
The only mandatory argument is the term to search. If everything else is left to default, the search will be for an exact match between the term and the text representation of every column in every readable table in the current search path.
comparator
The next argument is the comparison function to compare the term and
the values in all fields. By default, it’s
pg_catalog.texteq(text,text)
, which is the function associated to
the =
operator applied to text data types. We could give any
function that takes two strings and returns true
for a
match and false
for a non-match. The following built-in functions are
likely to cover many needs:
Function (text,text)->bool |
Operator | Comment |
---|---|---|
textregexeq | ~ | regular expression case-sensitive match |
texticregexeq | ~* | regular expression case-insensitive match |
textlike | ~~ | the same as arg1 LIKE arg2 |
texticlike | ~~* | the same as arg1 ILIKE arg2 |
The comparison function can typically be simply passed by name, as in
select global_search('^foo', comparator=>'textregexeq')
to find any
field matching the ^foo
regular expression. The comparator
argument is of type regproc
, which automatically replaces the
function name by the unique identifier of the function.
regproc
is an OID alias type whose text representation is the name of a function or procedure.
Casting a name to regproc
searches for a function with that name, taking into account the search path, succeeding and producing its OID if one and only one match is found. Otherwise it raises an error.
For instance,
=> SELECT 'nonexisting'::regproc;
ERROR: function "nonexisting" does not exist
=> SELECT oid, proname,prolang,proargtypes FROM pg_proc WHERE oid='texteq'::regproc;
oid | proname | prolang | proargtypes
-----+---------+---------+-------------
67 | texteq | 12 | 25 25
Conversely, the output of a regproc
value produces the name associated to the OID contained in that variable. The name will be fully qualified with a schema if necessary.
This works with temporary custom functions, too:
-- Check if two strings are equal after Unicode normalization
-- (requires PostgreSQL 13 or newer)
CREATE FUNCTION pg_temp.normalized_equality(text,text) RETURNS bool
AS 'select normalize($1, NFC) = normalize($2, NFC);'
LANGUAGE SQL STABLE STRICT;
SELECT 'pg_temp.normalized_equality'::regproc,
'pg_temp.normalized_equality'::regproc::oid;
regproc | oid
-------------------------------+-------
pg_temp_5.normalized_equality | 80666
Creating functions in pg_temp
makes them stay local to the
session and be dropped automatically on disconnection.
tables and schemas
The next optional arguments are arrays of table names and schemas, to
restrict the area of the search. pg_catalog
can be passed
as a schema name to force the search into the system tables.
For instance:
SELECT * FROM global_search('point', schemas=>'{pg_catalog}');
schemaname | tablename | columnname | columnvalue | rowctid
------------+-----------+------------+-------------+---------
pg_catalog | pg_type | typname | point | (0,28)
pg_catalog | pg_proc | proname | point | (21,5)
pg_catalog | pg_proc | proname | point | (21,25)
pg_catalog | pg_proc | proname | point | (23,3)
pg_catalog | pg_proc | proname | point | (23,4)
pg_catalog | pg_proc | proname | point | (23,5)
We get the schemas to scan by default with the very handy current_schemas() function, which
avoids the difficult task of parsing search_path
.
The list of all tables in the databases are obtained from the information_schema.tables
view.
The tables that are not readable for lack of permissions are filtered out
by checking information_schema.table_privileges
.
The schemas that cannot be scanned by the current user are filtered out by joining with
information_schema.schemata
.
The full query used to get the list of tables to scan is:
SELECT t.table_schema, t.table_name
FROM information_schema.tables t
JOIN information_schema.schemata s ON
(s.schema_name=t.table_schema)
WHERE (t.table_name=ANY(tables) OR tables is null)
AND t.table_schema=ANY(schemas)
AND t.table_type='BASE TABLE'
AND EXISTS (SELECT 1 FROM information_schema.table_privileges p
WHERE p.table_name=t.table_name
AND p.table_schema=t.table_schema
AND p.privilege_type='SELECT'
);
Note that in the case of tables that grant access to only certain
columns, the SELECT
privilege for the table will not figure
in information_schema.table_privileges
, so such tables
are filtered out.
We could use information_schema.column_privileges
to avoid that,
but the problem
is our implementation needs the ctid
pseudo-column, which is never
accessible to a user that has only column-level grants to a base table.
Therefore this search function does not search at all in tables that have
column-level access privileges.
progress
The rows returned by a Set-Returning-Function are not made available
until completion. In the case of long-running functions like this one,
it’s nice to see its progress along the way, as if we were running a
grep
command through a large collection of files. There’s a simple
and effective method for that: the
RAISE
statement, whose message goes immediately to the client.
In the case of psql
, it displays it right away. Depending on this
progress
parameter, the caller can get an output on each match, or
on each table scanned, or none. When reporting each match, the
function could even not return anything and only display the hits
along the way. That could be added as an option.
value_width
Except when searching for an exact match, the values found in the
columns can be interesting, so they’re returned in the result set to
the caller. But some values might be annoyingly large.
Setting value_width
to a number of characters, including zero, will truncate
the returned value to that size to prevent excessively wide results.
Scanning the rows
Now to the core part: we must scan rows and apply the
comparator to all values of all columns. Ideally we want to do this
in a single pass, testing all columns at once and scanning each row
only once.
At the column level, the test is comparator(cast(column as text), 'search_term')
.
We could combine these with an OR disjunction between all columns, but then
it wouldn’t tell which column(s) matches, only that one of them does.
This is where arrays come to the rescue. Instead of using OR, we build an array of boolean results of the per-column tests:
array[
comparator(cast(first_column as text), 'search_term'),
comparator(cast(second_column as text), 'search_term'),
comparator(cast(third_column as text), 'search_term'),
...
]
So if for instance a row has the second column that matches, this will
produce {false,true,false}
for that row.
Then applying array_positions(..., true)
to that array will produce an array with
the position numbers of the matchings columns, in that case {2}
.
Now the corresponding column names can be looked up from the columns
array that
was easily obtained from an information_schema
view again:
SELECT array_agg(column_name ORDER BY ordinal_position)
FROM information_schema.columns
WHERE table_name=tablename
AND table_schema=schemaname
INTO columns;
In addition to the array of matching column numbers, the row-scanning query
returns the matching ctid
(non-durable physical location of the row in the table). As the doc says:
ctid: the physical location of the row version within its table. Note that although the ctid can be used to locate the row version very quickly, a row’s ctid will change if it is updated or moved by VACUUM FULL. Therefore ctid is useless as a long-term row identifier.
Finally the code reads the matching values one column at a time
(generally there will be only one) from the row located by the ctid
,
unless the caller passed max_width => 0
, and it issues RETURN NEXT
to add the OUT variables schemaname, tablename, columnname,
columnvalue, rowctid
to the result set of the function.
EXECUTE format('SELECT %s FROM %I.%I WHERE ctid=''%s''',
col_expr, schemaname, tablename, rowctid) INTO columnvalue;
[...]
RETURN NEXT;
Complete source code
Here’s the 85 lines source code. A maintained version is on github.
CREATE or replace FUNCTION global_search(
search_term text,
comparator regproc default 'pg_catalog.texteq', -- comparison function
tables text[] default null,
schemas text[] default null,
progress text default null, -- 'tables', 'hits', 'all'
max_width int default -1 -- returned value's max width in chars, or -1 for unlimited
)
RETURNS table(schemaname text, tablename text, columnname text, columnvalue text, rowctid tid)
AS $$
DECLARE
query text;
clauses text[];
columns text[];
pos int;
positions int[];
col_expr text;
BEGIN
IF schemas IS NULL THEN
-- by default, exclude pg_catalog and non-readable schemas
schemas := current_schemas(false);
END IF;
FOR schemaname,tablename IN
-- select tables for which all columns are readable
SELECT t.table_schema, t.table_name
FROM information_schema.tables t
JOIN information_schema.schemata s ON
(s.schema_name=t.table_schema)
WHERE (t.table_name=ANY(tables) OR tables is null)
AND t.table_schema=ANY(schemas)
AND t.table_type='BASE TABLE'
AND EXISTS (SELECT 1 FROM information_schema.table_privileges p
WHERE p.table_name=t.table_name
AND p.table_schema=t.table_schema
AND p.privilege_type='SELECT'
)
LOOP
IF (progress in ('tables','all')) THEN
RAISE INFO '%', format('Searching globally in table: %I.%I',
schemaname, tablename);
END IF;
-- Get lists of columns and per-column boolean expressions
SELECT array_agg(column_name ORDER BY ordinal_position),
array_agg(format('%s(cast(%I as text), %L)', comparator, column_name, search_term)
ORDER BY ordinal_position)
FROM information_schema.columns
WHERE table_name=tablename
AND table_schema=schemaname
INTO columns, clauses;
-- Main query to get each matching row and the ordinal positions of matching columns
query := format('SELECT s.ctid, p from (SELECT ctid,'
'array_positions(array[%s],true) AS p FROM ONLY %I.%I) s'
' WHERE cardinality(p)>0',
array_to_string(clauses, ','), schemaname, tablename );
FOR rowctid,positions IN EXECUTE query -- for each matching row
LOOP
FOREACH pos IN ARRAY positions -- for each matching field
LOOP
columnname := columns[pos];
IF (max_width <> 0) THEN -- fetch value only if needed
IF (max_width > 0) THEN
-- fetch a truncated value
col_expr := format('left(%I,%s)', columnname, max_width);
ELSE
col_expr := format('%I', columnname);
END IF;
EXECUTE format('SELECT %s FROM %I.%I WHERE ctid=''%s''',
col_expr, schemaname, tablename, rowctid) INTO columnvalue;
ELSE
columnvalue:=null;
END IF;
IF (progress in ('hits', 'all')) THEN
RAISE INFO '%', format('Found in %I.%I.%I at ctid %s',
schemaname, tablename, columnname, rowctid);
END IF;
RETURN NEXT;
END LOOP;
END LOOP;
END LOOP; -- for each table
END;
$$ language plpgsql;
Examples
-- Setup
=> CREATE TABLE tst(t text);
=> INSERT INTO tst VALUES('foo'),('bar'),('baz'),('barbaz'),('Foo'),(null);
Simple equality search
=> SELECT * FROM global_search('Foo');
schemaname | tablename | columnname | columnvalue | rowctid
------------+-----------+------------+-------------+---------
public | tst | t | Foo | (0,5)
Regular expression matching
=> SELECT * FROM global_search('^bar', comparator=>'textregexeq');
schemaname | tablename | columnname | columnvalue | rowctid
------------+-----------+------------+-------------+---------
public | tst | t | bar | (0,2)
public | tst | t | barbaz | (0,4)
(2 rows)
Case insensitive LIKE (equivalent to: column ILIKE search_term)
=> SELECT * FROM global_search('fo%', comparator=>'texticlike');
schemaname | tablename | columnname | columnvalue | rowctid
------------+-----------+------------+-------------+---------
public | tst | t | foo | (0,1)
public | tst | t | Foo | (0,5)
(2 rows)
Sometimes there are several functions with the same name. In the
case of texticlike
, the citext
extension overloads this function
with two variants that take a citext
parameter instead of text
.
In that kind of case, you want to use the regprocedure
cast with a
function name qualified with arguments to disambiguate.
For instance:
SELECT * FROM global_search('fo%', comparator=>'texticlike(text,text)'::regprocedure);
Find all “incorrect” values with a custom function
The following custom function finds values that do not conform to the Unicode NFC normalization Note that even though the second argument is not used in that case, it needs to be declared nonetheless.
=> CREATE FUNCTION pg_temp.check_normal_form(text,text)
returns boolean as
'select $1 is not NFC normalized' -- requires Postgres 13 or newer
language sql;
=> INSERT INTO tst VALUES (E'El Nin\u0303o');
=> SELECT * FROM global_search(null, comparator=>'pg_temp.check_normal_form');
schemaname | tablename | columnname | columnvalue | rowctid
------------+-----------+------------+-------------+---------
public | tst | t | El Niño | (0,7)
Find references to the OID of a namespace
That sort of query can be useful when exploring the catalogs.
2200
is the OID of the public
namespace.
=> SELECT * FROM global_search(
(select oid::text from pg_namespace where nspname='public'),
schemas=>'{pg_catalog}'
);
schemaname | tablename | columnname | columnvalue | rowctid
------------+----------------+--------------+-------------+---------
pg_catalog | pg_proc | pronamespace | 2200 | (95,5)
pg_catalog | pg_description | objoid | 2200 | (28,58)
pg_catalog | pg_namespace | oid | 2200 | (0,8)
pg_catalog | pg_depend | refobjid | 2200 | (13,70)
pg_catalog | pg_init_privs | objoid | 2200 | (2,27)
(5 rows)
Case and accent insensitive global search
Using an ICU collation for advanced non-bitwise equality tests.
-- create a collation that ignores accents and case
=> CREATE COLLATION nd (
provider = 'icu',
locale = '@colStrength=primary',
deterministic = false
);
=> CREATE FUNCTION pg_temp.ci_equal(text,text)
returns boolean as
'select $1=$2 collate "nd"'
language sql;
=> SELECT * FROM global_search('foo', comparator=>'pg_temp.ci_equal');
schemaname | tablename | columnname | columnvalue | rowctid
------------+-----------+------------+-------------+---------
public | tst | t | foo | (0,1)
public | tst | t | Foo | (0,5)
(2 rows)
Image by courtesy of PixaBay