Since version 10, Postgres can be configured with ICU,
the reference library for Unicode, to use its collations in
To that end, ICU collations for most language/countries pairs
are automatically created at initdb time (they will be found in
and others can be added later with
Beyond collation support, ICU provides other services focused on locale-aware and multilingual text processing, following Unicode recommendations.
Once our Postgres instances are linked with ICU (most are because the major packagers like Apt, Rpm, or the Edb Installer have opted to include it), why not try to benefit from all these services through SQL?
This is the goal of
a C extension providing SQL wrapper functions to ICU. As of now, it
brings about twenty functions allowing to inspect locales and
collations, identify text boundaries, compare and sort strings,
spoof-check strings, spell numbers, transliterate between scripts.
Before looking into compare and sort functions, let’s recall what the ICU integration already provides.
Benefits of ICU versus collations from the OS
pg_collation.collversionholds the version number of each collation when created, and if ever at runtime it doesn’t match the version ICU provides (typically following an upgrade), a warning is issued, urging the user to rebuild potentially affected indexes and register the new collation version in the catalog.
Cross-OS consistency: the same ICU collation sorts identically across all operating systems, which is not the case with the C libraries. For instance,
lc_collateset to the same
en_US.UTF-8behaves differently on Linux and FreeBSD: pairs of strings as simple as
"0102"sort in opposite orders.
Indexing speed: Postgres uses abbreviated keys when available (also called sort keys in ICU terminology), because they can really speed up indexing. But since the libc support for this feature (
strxfrm) has been found to be buggy in several operating systems, it’s only activated for ICU collations.
Parameterized string comparisons: in libc, there’s typically a one-to-one mapping between locale and collation, like for instance
fr_CA.UTF-8compares strings with the linguistic rules of French as written in Canada, but without further customization or control. With ICU, collations accept a handful of parameters that offers possibilities beyond the language-country specification, as demonstrated in the online ICU Collation Demo, or Peter Eisentraut’s blog post “More robust collations with ICU support in PostgreSQL 10” announcing the feature last year, or in “What users can do with custom ICU collations in Postgres 10” (discussion in pgsql-hackers).
What Postgres can’t do yet with ICU collations
Unfortunately there is a problem preventing to use the advanced comparisons to their full potential: the current infrastructure in Postgres for text comparisons cannot handle strings being equal when they are not equivalent byte-for-byte. To enforce this constraint, when strings compare as equal with the linguistic comparison (strcoll-like), Postgres runs an strcmp tie-breaker on them, and discards the result of the ICU collator (or libc’s, for that matter) in favor of that bytewise comparison.
For instance, we might create this collation:
ks-level1 here means primary collation strength.
Be warned that this syntax with parameters as BCP-47 tags is silently ineffective
with older versions of ICU (53 or older). When unsure of the syntax,
you may pass it to
icu_collation_attributes() to check
how ICU parses it, as shown a bit below in this post.
Anyway, there are five levels of strength, the primary level considering only the base characters so that case and accent differences are ignored.
The point with the strcmp tie-breaker is that the following equality will not be satisfied, contrary to what we might expect if we were not aware of it:
Hopefully we’ll get the ability to make these case-insensitive-and-more collations work seamlessly in a future version of Postgres, but in the meantime, it’s possible to somehow circumvent this problem with functions from icu_ext. Let’s see how.
Comparing strings with functions from icu_ext
The main function is:
icu_compare(string1 text, string2 text [, collator text]).
It returns the result of
collator as the ICU collation.
It’s a signed integer, negative if string1 < string2, zero if string =
string2, and positive if string1 > string2.
When the 3rd argument
collator is present, it’s not the name of a db
collation declared with
CREATE COLLATION, but the value that would
be passed as its locale or lc_collate parameter if we had to
instantiate that collation. In other words, it’s a locale ID
from the point of view of ICU independantly of Postgres
(yes, ICU uses the term “ID” to denote a string argument that is
more less a built-up string)
collator argument is missing, it is the collation associated
string2 that is used for the comparison.
It must be an ICU collation and it must be the same for the two
arguments or the function will error out. This form
with a single argument is significantly faster due to Postgres keeping its
collations “open” (in the sense of
ucol_open()/ucol_close()) for the
duration of the session, whereas the other form with the
collator argument does open and close the ICU collation
at each call.
To get back at the previous example, this time we can witness the equality of strings under the regime of case & accent insensitiveness:
Case-sensitive, accent-insensitive comparisons are also possible:
With an implicit Postgres collation:
id | title
1 | C'est l'été
Handling combining diacritics
In Unicode, accented letters may be written in composed form in decomposed form, the latter being with an unaccented letter followed by the accent character from the Combining Diacritical Marks Block.
The two decomposed forms, NFD or NFKD
are infrequently used in UTF-8 documents, but they’re perfectly valid and
accepted by Postgres. Semantically,
'à' is supposedly equivalent to
E'a\u0300'. At least, ICU collators seem to see them as equal, even
under the strictest comparison level:
(again, the Postgres equal operator disagrees because of the tie-breaker. The point of using functions as an alternative is precisely to bypass it).
Besides, combining characters are not only about accents, since some of them achieve text effects such as strikethrough or underlines. Let’s see an example from within psql, as displayed on a gnome terminal. The query in the screenshot below takes the literal ‘Hello’, inserts combining marks from U+0330 to U+0338 after every character, returning the result and comparing it with the original string with primary and secondary comparison strengths.
In general, ICU functions seem to take combining marks into account when it makes sense, whereas non-ICU functions in Postgres (with regular expressions, for instances) consider a combining mark as a codepoint like any other.
Sorting and grouping
The ORDER BY and GROUP BY clauses are not designed to work with functions comparing two arguments, but they do work with single-argument functions that transform their input into something else. To sort or group results according to all ICU advanced linguistic rules, icu_ext exposes a function that converts a single string into a sort key:
function icu_sort_key(string text [, icu_collation text]) returns bytea
This is the same sort key as what is used implicitly by Postgres to create
indexes involving ICU collations.
The promise of the sort key is that, if
icu_compare(s1, s2, collator) returns X, then
the (faster) bytewise comparison between
icu_sort_key(s1, collator) and
icu_sort_key(s2, collator) returns X too.
The ICU documentation warns that computing a sort key is slow, much slower than doing a single comparison with the same collation. But in the context of ORDER BY in queries an to the extent of my tests, they seem to do quite well.
In fact when looking at the performance of
ORDER BY field COLLATE "icu-coll" versus
ORDER BY icu_sort_key(field, 'collation'), most of the difference appears
to be caused by
icu_sort_key having to parse the collation specification
for each call, especially for complex specifications.
Just like for
icu_compare(), to benefit from Postgres caching collator
objects for the duration of the session, it’s recommended to use the
single-argument form, which relies on the collation of the argument,
for instance, with our “fr-ci” collation declared previously:
=# SELECT icu_sort_key ('Eté' COLLATE "fr-ci")
Still on performance, here’s a comparison of EXPLAIN ANALYZE for
sorting 6.6 million short words (13 characters on average) with
ORDER BY directly on the field:
We can see here that there’s no notable performance degradation
icu_sort_key explicitly. In fact in this example it’s
slightly faster for some reason.
GROUP BY or
DISTINCT ON can use sort keys, too:
Using sort keys in indexes
The ordered position or unicity of a piece of text under a certain collation is equivalent to
the ordered position or unicity of its binary sort key computed with that collation.
So it’s possible to create an index, including to enforce a unique constraint,
icu_sort_key(column, collator) rather than just
to work around the “non-binary-equal is not equal” rule for
Using the previous example of with the books table, we could do:
=# CREATE INDEX ON books (icu_sort_key(title));
and then getting it used for exact matches with
=# SELECT title FROM books WHERE
icu_sort_key(title) = icu_sort_key('cortege' collate "mycoll");
Just to check that our index is effectively used:
=# explain select title from books where icu_sort_key(title)=icu_sort_key('cortege' collate "mycoll");
Bitmap Heap Scan on books (cost=4.30..10.64 rows=2 width=29)
Recheck Cond: (icu_sort_key(title) = '\x2d454b4f313531'::bytea)
-> Bitmap Index Scan on books_icu_sort_key_idx (cost=0.00..4.29 rows=2 width=0)
Index Cond: (icu_sort_key(title) = '\x2d454b4f313531'::bytea)
As mentioned above, when refering to a collation, old and new versions of ICU don’t necessarily understand the same syntax and tags, and they also tend to ignore anything they can’t parse instead of reporting an error.
To make sure that a collation is correctly spelled or does what you
expect, the output of
icu_collation_attributes() can be examined.
This function takes a collation name as input, collects its
properties, and returns them as a set of (attribute, value) couples with
the display name (probably the most important attribute),
plus tags such as kn/kb/kk/ka/ks/kf/kc, and finally the version of the collation.
Besides comparisons and sort keys,
icu_ext brings SQL wrappers to
a bunch of other ICU functionalities
(see the README.md
in the sources for at least one example per function):
Break down a string into its constituents and return them as sets of text. Basically it’a better version of
regexp_split_to_table(string, regexp)that use linguistic rules recommended by the Unicode standard instead of regexp-based separators.
Output the Unicode name of any character out of the 130K+ of the charset.
icu_confusable_strings_check and icu_spoof_check
Tell if pairs of strings appear similar-looking and if strings has characters that might be meant to deceive a reader.
List all locales with languages and countries in localized versions. Incidentally, that gives country and languages names translated into many languages (Use
icu_set_default_locale()to switch languages).
Spell out numbers in different locales.
icu_transforms_list and icu_transform
Powerful text transformations and transliterations (conversions between scripts). There are more than 600 basic transforms listed by icu_transforms_list and they can be combined with filters. There is an online demo for this service.
Hopefully other functions will be added into icu_ext, as time permits, and more examples on how to use the existing ones. In the meantime, feel free to submit issues or pull requests on github if some functions don’t do what you expect, or if you’d like specific ICU services being exposed and with what interface, or exposed differently than what icu_ext does at the moment…