Text search: a custom dictionary to avoid long words
The full text search is based on transforming the initial text into a tsvector
.
For example:
test=> select to_tsvector('english', 'This text is being processed.');
to_tsvector
----------------------
'process':5 'text':2
This result is a sorted list of lexems, with their relative positions in the initial text, obtained by this process:
Raw text => Parser => Dictionaries (configurable) => tsvector
When there is enough data, we tend to index these vectors with a GIN or GIST index to speed up text search queries.
In SQL we can inspect the intermediate results of this process with
the ts_debug
function:
test=> select * from ts_debug('english', 'This text is being processed.');
alias | description | token | dictionaries | dictionary | lexemes
-----------+-----------------+-----------+----------------+--------------+-----------
asciiword | Word, all ASCII | This | {english_stem} | english_stem | {}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | text | {english_stem} | english_stem | {text}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | is | {english_stem} | english_stem | {}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | being | {english_stem} | english_stem | {}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | processed | {english_stem} | english_stem | {process}
blank | Space symbols | . | {} | |
The parser breaks down the text into tokens (token column), each token being associated with a type (alias and description columns). Then depending on their types, these tokens are submitted as input to dictionaries mapped to these types, which may produce one lexem, or several, or zero to eliminate the term from the output vector.
In the above example, spaces and punctuation are eliminated because there are
not mapped to any dictionary; common terms (“this”, “is”, “being”) are
eliminated as stop words by the english_stem
dictionary; “text” is kept
untouched, and “processed” is stemmed as “process” by english_stem
.
What about undesirables pieces of text?
Sometimes the initial text is not “clean” in the sense that it contains noise that we’d rather not index. For instance when indexing mail messages, badly formatted messages may have base64 contents that slip into text parts. When these contents correspond to attached files, they can be pretty big. Looking at how this kind of data gets transformed into lexems, here’s what we can see:
=# \x
=# select * from ts_debug('english', 'Q29uc2lzdGVuY3kgYWxzbyBtZWFucyB0aGF0IHdoZW4gQWxpY2UgYW5kIEJvYiBhcmUgcnVubmlu');
-[ RECORD 1 ]+-------------------------------------------------------------------------------
alias | numword
description | Word, letters and digits
token | Q29uc2lzdGVuY3kgYWxzbyBtZWFucyB0aGF0IHdoZW4gQWxpY2UgYW5kIEJvYiBhcmUgcnVubmlu
dictionaries | {simple}
dictionary | simple
lexemes | {q29uc2lzdgvuy3kgywxzbybtzwfucyb0agf0ihdozw4gqwxpy2ugyw5kiejvyibhcmugcnvubmlu}
So this text is parsed as a single numword token, from which
a single long lexem is produced, since numword tokens are associated with the
simple
dictionary that just downcases the word. So that brings up a
question: how to avoid that kind of term in the vector?
One idea is to consider that very long tokens are uninteresting for search, and as such they can just be eliminated. Even though there are really long words in some languages, such as the german Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz with its 63 characters(!), we can imagine setting a length limit over which words are ignored, just like stop words are ignored.
Filtering by length
It’s not difficult to create a dictionary to filter out long words. A text search dictionary takes the form of two functions in C, along with a few SQL declarative statements to create it and assign it to a text configuration.
In PostgreSQL source code, there are several examples of dictionaries than can be used as models:
- dict_simple a built-in dictionary, part of core.
- dict_int, a contrib module to filter out number longer than a given number of digits, so very close to what we want in this post.
- dict_xsyn, a contrib module to add lexems to the output that are not in the original text but are synonyms of words in the text.
- unaccent, a contrib module providing first a callable SQL function to remove accents, but also exposing it as a filtering dictionary, which means that lexems produced are passed to the next dictionary in the chain.
A new dictionary project can be started by pretty much copy-pasting one of these examples, since much of the code is declarative stuff that you don’t want to write from scratch.
There are two C functions to produce:
-
An
INIT
function that receives the configuration parameters of the dictionary. We can use this function to process our maximum length a parameter, instead of hard-coding it in the source. -
A
LEXIZE
function that takes a token’s text as input and needs to produce zero, one or several lexems corresponding to that piece of text. The function also indicates if these lexems are to be passed to the rest of the chain of dictionaries. In our case we want to eliminate the token if it’s longer than our limit, or pass it unchanged.
Let’s call this dictionary dictmaxlen
and length
its parameter.
Following the model of contrib modules, we can package its code
and declarations in a Postgres extension.
The declarations actually create a dictionary template rather than a dictionary,
if we want to use the correct terminology.
A dictionary is instantiated from a template with CREATE TEXT SEARCH
DICTIONARY (TEMPLATE = ...)
with the values for the parameters.
Here are the SQL declarations for the functions and the template:
The only specific bit here is the name dictmaxlen
, otherwise
any other dictionary would have the same declarations.
C functions
Dictionary initialization (called when instantiating)
Generation of lexems
Encapsulating the code into an extension
As most for contrib modules, it’s convenient to package this code as an extension to distribute and deploy it easily.
Creating an extension is relatively easy with PGXS,
that is part of the postgres toolbox (for Linux distributions, it’s often
provided in a development package, such as postgresql-server-dev-11
for Debian).
An extension needs a control file. The following stanza will do the job
(filename: dict_maxlen.control
):
# dict_maxlen extension
comment = 'text search template for a dictionary filtering out long words'
default_version = '1.0'
module_pathname = '$libdir/dict_maxlen'
relocatable = true
Thanks to PGXS, we can use a simplied Makefile that will transparently
include the declarations with the paths and libraries involved
in the build. Here is a ready-to-use Makefile than can build our
simple extension with make && make install
:
EXTENSION = dict_maxlen
EXTVERSION = 1.0
PG_CONFIG = pg_config
MODULE_big = dict_maxlen
OBJS = dict_maxlen.o
DATA = $(wildcard *.sql)
PGXS := $(shell $(PG_CONFIG) --pgxs)
include $(PGXS)
Usage
Once the extension is compiled is installed, we can instantiate it in a database and create the dictionary:
Now this dictionary needs to be mapped to tokens (the output of the parser),
with ALTER TEXT SEARCH CONFIGURATION ... ALTER MAPPING
.
In the example above we saw that the kind of token produced with
the base64 content was numword
, but we may also want to map
our dictionary to tokens containing only letters:
word
and asciiword
, or to any other of the
23 kinds of tokens
that the parser can generate as of Postgres 11.
With psql
this mapping can be visualized with \dF+
. For english
,
the defaults are:
postgres=> \dF+ english
Text search configuration "pg_catalog.english"
Parser: "pg_catalog.default"
Token | Dictionaries
-----------------+--------------
asciihword | english_stem
asciiword | english_stem
email | simple
file | simple
float | simple
host | simple
hword | english_stem
hword_asciipart | english_stem
hword_numpart | simple
hword_part | english_stem
int | simple
numhword | simple
numword | simple
sfloat | simple
uint | simple
url | simple
url_path | simple
version | simple
word | english_stem
To avoid messing up with the built-in english configuration, let’s derive a new specific text search configuration with its own mappings:
Now let’s check the new configuration with psql:
=# \dF+ mytsconf
Text search configuration "public.mytsconf"
Parser: "pg_catalog.default"
Token | Dictionaries
-----------------+------------------------
asciihword | english_stem
asciiword | dictmaxlen,english_stem
email | simple
file | simple
float | simple
host | simple
hword | english_stem
hword_asciipart | english_stem
hword_numpart | simple
hword_part | english_stem
int | simple
numhword | simple
numword | dictmaxlen,simple
sfloat | simple
uint | simple
url | simple
url_path | simple
version | simple
word | dictmaxlen,english_stem
And check what happens now to a numword over 40 characters:
Et voilà! The undesirable token has been left out, as expected.
We may use this mytsconf
configuration by passing it as an explicit
first argument to text search functions, but it could also be set
as a default.
For the duration of the session:
SET default_text_search_config TO 'mytsconf';
For the database (durably):
ALTER DATABASE dbname SET default_text_search_config TO 'mytsconf';
The source code for this example is available at github.