Text search: a custom dictionary to avoid long words

The full text search is based on transforming the initial text into a tsvector. For example:

test=> select to_tsvector('english', 'This text is being processed.');
     to_tsvector      
----------------------
 'process':5 'text':2

This result is a sorted list of lexems, with their relative positions in the initial text, obtained by this process:

Raw text => Parser => Dictionaries (configurable) => tsvector

When there is enough data, we tend to index these vectors with a GIN or GIST index to speed up text search queries.

In SQL we can inspect the intermediate results of this process with the ts_debug function:

test=> select * from ts_debug('english', 'This text is being processed.');
   alias   |   description   |   token   |  dictionaries  |  dictionary  |  lexemes  
-----------+-----------------+-----------+----------------+--------------+-----------
 asciiword | Word, all ASCII | This      | {english_stem} | english_stem | {}
 blank     | Space symbols   |           | {}             |              | 
 asciiword | Word, all ASCII | text      | {english_stem} | english_stem | {text}
 blank     | Space symbols   |           | {}             |              | 
 asciiword | Word, all ASCII | is        | {english_stem} | english_stem | {}
 blank     | Space symbols   |           | {}             |              | 
 asciiword | Word, all ASCII | being     | {english_stem} | english_stem | {}
 blank     | Space symbols   |           | {}             |              | 
 asciiword | Word, all ASCII | processed | {english_stem} | english_stem | {process}
 blank     | Space symbols   | .         | {}             |              | 

The parser breaks down the text into tokens (token column), each token being associated with a type (alias and description columns). Then depending on their types, these tokens are submitted as input to dictionaries mapped to these types, which may produce one lexem, or several, or zero to eliminate the term from the output vector.

In the above example, spaces and punctuation are eliminated because there are not mapped to any dictionary; common terms (“this”, “is”, “being”) are eliminated as stop words by the english_stem dictionary; “text” is kept untouched, and “processed” is stemmed as “process” by english_stem.

What about undesirables pieces of text?

Sometimes the initial text is not “clean” in the sense that it contains noise that we’d rather not index. For instance when indexing mail messages, badly formatted messages may have base64 contents that slip into text parts. When these contents correspond to attached files, they can be pretty big. Looking at how this kind of data gets transformed into lexems, here’s what we can see:

=# \x
=# select * from  ts_debug('english', 'Q29uc2lzdGVuY3kgYWxzbyBtZWFucyB0aGF0IHdoZW4gQWxpY2UgYW5kIEJvYiBhcmUgcnVubmlu');
-[ RECORD 1 ]+-------------------------------------------------------------------------------
alias        | numword
description  | Word, letters and digits
token        | Q29uc2lzdGVuY3kgYWxzbyBtZWFucyB0aGF0IHdoZW4gQWxpY2UgYW5kIEJvYiBhcmUgcnVubmlu
dictionaries | {simple}
dictionary   | simple
lexemes      | {q29uc2lzdgvuy3kgywxzbybtzwfucyb0agf0ihdozw4gqwxpy2ugyw5kiejvyibhcmugcnvubmlu}

So this text is parsed as a single numword token, from which a single long lexem is produced, since numword tokens are associated with the simple dictionary that just downcases the word. So that brings up a question: how to avoid that kind of term in the vector?

One idea is to consider that very long tokens are uninteresting for search, and as such they can just be eliminated. Even though there are really long words in some languages, such as the german Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz with its 63 characters(!), we can imagine setting a length limit over which words are ignored, just like stop words are ignored.

Filtering by length

It’s not difficult to create a dictionary to filter out long words. A text search dictionary takes the form of two functions in C, along with a few SQL declarative statements to create it and assign it to a text configuration.

In PostgreSQL source code, there are several examples of dictionaries than can be used as models:

dict_simple a built-in dictionary, part of core.
dict_int, a contrib module to filter out number longer than a given number of digits, so very close to what we want in this post.
dict_xsyn, a contrib module to add lexems to the output that are not in the original text but are synonyms of words in the text.
unaccent, a contrib module providing first a callable SQL function to remove accents, but also exposing it as a filtering dictionary, which means that lexems produced are passed to the next dictionary in the chain.

A new dictionary project can be started by pretty much copy-pasting one of these examples, since much of the code is declarative stuff that you don’t want to write from scratch.

There are two C functions to produce:

An INIT function that receives the configuration parameters of the dictionary. We can use this function to process our maximum length a parameter, instead of hard-coding it in the source.
A LEXIZE function that takes a token’s text as input and needs to produce zero, one or several lexems corresponding to that piece of text. The function also indicates if these lexems are to be passed to the rest of the chain of dictionaries. In our case we want to eliminate the token if it’s longer than our limit, or pass it unchanged.

Let’s call this dictionary dictmaxlen and length its parameter. Following the model of contrib modules, we can package its code and declarations in a Postgres extension.

The declarations actually create a dictionary template rather than a dictionary, if we want to use the correct terminology. A dictionary is instantiated from a template with CREATE TEXT SEARCH DICTIONARY (TEMPLATE = ...) with the values for the parameters.

Here are the SQL declarations for the functions and the template:

CREATE FUNCTION dictmaxlen_init(internal)
        RETURNS internal
        AS 'MODULE_PATHNAME'
        LANGUAGE C STRICT;

CREATE FUNCTION dictmaxlen_lexize(internal, internal, internal, internal)
        RETURNS internal
        AS 'MODULE_PATHNAME'
        LANGUAGE C STRICT;

CREATE TEXT SEARCH TEMPLATE dictmaxlen_template (
        LEXIZE = dictmaxlen_lexize,
	INIT   = dictmaxlen_init
);

The only specific bit here is the name dictmaxlen, otherwise any other dictionary would have the same declarations.

C functions

Dictionary initialization (called when instantiating)

Datum
dictmaxlen_init(PG_FUNCTION_ARGS)
{
  List        *options = (List *) PG_GETARG_POINTER(0);
  DictMaxLen  *d;
  ListCell    *l;

  d = (DictMaxLen *) palloc0(sizeof(DictMaxLen));
  d->maxlen = 50;        /* 50 chars by defaut */

  foreach(l, options)
  {
    DefElem    *defel = (DefElem *) lfirst(l);

    if (strcmp(defel->defname, "length") == 0)
      d->maxlen = atoi(defGetString(defel));
    else
    {
      ereport(ERROR,
          (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
           errmsg("unrecognized dictionary parameter: \"%s\"",
              defel->defname)));
    }
  }

  PG_RETURN_POINTER(d);
}

Generation of lexems

Datum
dictmaxlen_lexize(PG_FUNCTION_ARGS)
{
  DictMaxLen  *d = (DictMaxLen *) PG_GETARG_POINTER(0);
  char        *token = (char *) PG_GETARG_POINTER(1);
  int         byte_length = PG_GETARG_INT32(2);

  if (pg_mbstrlen_with_len(token, byte_length) > d->maxlen)
  {
    /* 
     * If the word is longer than our limit, returns an array without
     * any lexem.
     */
     TSLexeme   *res = palloc0(sizeof(TSLexeme));
     PG_RETURN_POINTER(res);     
  }
  else
  {
    /* If the word is short, pass it unmodified */
    PG_RETURN_POINTER(NULL);
  }

}

Encapsulating the code into an extension

As most for contrib modules, it’s convenient to package this code as an extension to distribute and deploy it easily.

Creating an extension is relatively easy with PGXS, that is part of the postgres toolbox (for Linux distributions, it’s often provided in a development package, such as postgresql-server-dev-11 for Debian).

An extension needs a control file. The following stanza will do the job (filename: dict_maxlen.control):

# dict_maxlen extension
comment = 'text search template for a dictionary filtering out long words'
default_version = '1.0'
module_pathname = '$libdir/dict_maxlen'
relocatable = true

Thanks to PGXS, we can use a simplied Makefile that will transparently include the declarations with the paths and libraries involved in the build. Here is a ready-to-use Makefile than can build our simple extension with make && make install:

EXTENSION = dict_maxlen
EXTVERSION = 1.0
PG_CONFIG = pg_config

MODULE_big = dict_maxlen
OBJS = dict_maxlen.o

DATA = $(wildcard *.sql)

PGXS := $(shell $(PG_CONFIG) --pgxs)
include $(PGXS)

Usage

Once the extension is compiled is installed, we can instantiate it in a database and create the dictionary:

CREATE EXTENSION  dict_maxlen;

CREATE TEXT SEARCH DICTIONARY dictmaxlen (
  TEMPLATE = dictmaxlen_template,
  LENGTH = 40 -- for example
);

Now this dictionary needs to be mapped to tokens (the output of the parser), with ALTER TEXT SEARCH CONFIGURATION ... ALTER MAPPING. In the example above we saw that the kind of token produced with the base64 content was numword, but we may also want to map our dictionary to tokens containing only letters: word and asciiword, or to any other of the 23 kinds of tokens that the parser can generate as of Postgres 11.

With psql this mapping can be visualized with \dF+. For english, the defaults are:

postgres=> \dF+ english
Text search configuration "pg_catalog.english"
Parser: "pg_catalog.default"
      Token      | Dictionaries 
-----------------+--------------
 asciihword      | english_stem
 asciiword       | english_stem
 email           | simple
 file            | simple
 float           | simple
 host            | simple
 hword           | english_stem
 hword_asciipart | english_stem
 hword_numpart   | simple
 hword_part      | english_stem
 int             | simple
 numhword        | simple
 numword         | simple
 sfloat          | simple
 uint            | simple
 url             | simple
 url_path        | simple
 version         | simple
 word            | english_stem

To avoid messing up with the built-in english configuration, let’s derive a new specific text search configuration with its own mappings:

CREATE TEXT SEARCH CONFIGURATION mytsconf ( COPY = pg_catalog.english );

ALTER TEXT SEARCH CONFIGURATION mytsconf
 ALTER MAPPING FOR asciiword, word
  WITH dictmaxlen,english_stem;

ALTER TEXT SEARCH CONFIGURATION mytsconf
 ALTER MAPPING FOR numword
  WITH dictmaxlen,simple;

Now let’s check the new configuration with psql:

=# \dF+ mytsconf
Text search configuration "public.mytsconf"
Parser: "pg_catalog.default"
      Token      |      Dictionaries      
-----------------+------------------------
 asciihword      | english_stem
 asciiword       | dictmaxlen,english_stem
 email           | simple
 file            | simple
 float           | simple
 host            | simple
 hword           | english_stem
 hword_asciipart | english_stem
 hword_numpart   | simple
 hword_part      | english_stem
 int             | simple
 numhword        | simple
 numword         | dictmaxlen,simple
 sfloat          | simple
 uint            | simple
 url             | simple
 url_path        | simple
 version         | simple
 word            | dictmaxlen,english_stem

And check what happens now to a numword over 40 characters:

=# select to_tsvector('mytsconf', 'A long word: Q29uc2lzdGVuY3kgYWxzbyBtZWFucyB0aGF0IHdoZW4');
    to_tsvector    
-------------------
 'long':2 'word':3

Et voilà! The undesirable token has been left out, as expected.

We may use this mytsconf configuration by passing it as an explicit first argument to text search functions, but it could also be set as a default.

For the duration of the session:

SET default_text_search_config TO 'mytsconf';

For the database (durably):

ALTER DATABASE dbname SET default_text_search_config TO 'mytsconf';

The source code for this example is available at github.