Nondeterministic collations
Since version 12, PostgreSQL collations are created with a parameter named deterministic, that can be true or false, so that collations are now either deterministic (which they are by default), or nondeterministic.
What does that mean? This term refers to what Unicode calls deterministic comparisons between strings:
This is a comparison where strings that do not have identical binary contents (optionally, after some process of normalization) will compare as unequal
So before version 12, comparisons for collatable types in Postgres are always deterministic according to the above definition. Specifically, when the underlying collation provider (libc or ICU) reports that two strings are equal, a tie-breaker bytewise comparison is performed, so that it’s only when the strings consist of identical binary contents that they are truly equal for Postgres.
Starting with version 12, the new “deterministic” property can be set
to false
at CREATE COLLATION
time to request that string comparisons
skip the tie-breaker, so that the memory representations being different
is not an obstacle to recognize strings as equal when the underlying locale
says they are.
This does not only affect direct comparisons or lookups through WHERE
clauses, but also the results of GROUP BY, ORDER BY, DISTINCT,
PARTITION BY, unique constraints, and everything implying the equality
operator.
So what can be achieved with nondeterministic collations?
The most obvious features are case-insensitive and accent-insensitive
matching implemented with COLLATE clauses, as opposed to calling
explicit functions to do case-mapping
(upper
, lower
)
and removal of accents (unaccent
).
Now that these are accessible through the collation service,
the traditional recommendation to use the citext datatype for
case-insensitive lookups may start to be reconsidered.
Beyond that, nondeterministic collations allow to match strings that are canonically equivalent (differing only by which Unicode normal form they use), or differ only by compatible sequences, or by punctuation, or by non-displayable characters.
Except for the canonical equivalence, these matching features are
optional, and they’re activated by declaring collation attributes
inside the locale
parameter, especially the comparison levels.
Unicode Technical Report #35 provides a table of collation settings with BCP47 keys and values,
but the examples in this post will use ICU “old-style” attributes:
colStrength
, colCaseLevel
, colAlternate
rather than “new-style” keys (respectively ks
, kc
, ka
). This is because the
former work with all versions of ICU, whereas the latter work only when PostgreSQL is built
with ICU version 54 or later (released in 2014). It appears that pre-compiled binaries for Windows are currently built with ICU version 53,
so it’s better to stick to the old-style syntax at least for them.
Now, let’s go through a list of fancy comparison features that are enabled by nondeterministic collations.
1. Equality between canonically equivalent sequences of code points
This is a requirement of Unicode that PostgreSQL was not able to fulfill until now. As explained in Unicode equivalence (wikipedia):
Code point sequences that are defined as canonically equivalent are assumed to have the same appearance and meaning when printed or displayed. For example, the code point U+006E (the Latin lowercase “n”) followed by U+0303 (the combining tilde “◌̃”) is defined by Unicode to be canonically equivalent to the single code point U+00F1 (the lowercase letter “ñ” of the Spanish alphabet). Therefore, those sequences should be displayed in the same manner, should be treated in the same way by applications such as alphabetizing names or searching, and may be substituted for each other.
Nondeterministic collations will recognize canonically equivalent
sequences as equal without requiring any particular collation attribute
in the locale
argument.
The example below uses a language-agnostic locale: an empty string,
that selects the root collation. und
may also be used, as the 3-letter BCP-47 tag for “undefined”. Otherwise a language code may be used, optionally followed by a script code and a region code, such as 'fr-CA'
for “french as spoken in Canada”.
Example of canonical equivalence between NFD and NFC forms:
CREATE COLLATION nd (
provider = 'icu',
locale='', -- or 'und' (no language or region specified)
deterministic = false
);
SELECT s1, s2, s1 = s2 COLLATE nd AS equal FROM
(VALUES (E'El Nin\u0303o', E'El Ni\u00F1o')) AS s(s1,s2);
s1 | s2 | equal
---------+---------+-------
El Niño | El Niño | t
By contrast, with any deterministic collation, we would get f
for false in the equal
column, since these strings s1
and s2
are bytewise unequal.
2. Equality between compatible sequences of code points
Besides being equivalent, sequences of code points can be merely compatible, in which case they can optionally be considered as equal.
Quoting another part of the above-linked wikipedia entry:
Sequences that are defined as compatible are assumed to have possibly distinct appearances, but the same meaning in some contexts. Thus, for example, the code point U+FB00 (the typographic ligature “ff”) is defined to be compatible—but not canonically equivalent—to the sequence U+0066 U+0066 (two Latin “f” letters)
At tertiary strength (the default), these sequences are not equal.
Let’s see this in SQL, reusing the "nd"
collation previously defined:
SELECT s1, s2, s1 = s2 COLLATE nd AS equal FROM
(VALUES ('shelffull', E'shel\ufb00ull')) AS s(s1,s2);
s1 | s2 | equal
-----------+----------+-------
shelffull | shelffull | f
But at secondary strength, these sequences compare as equal:
CREATE COLLATION nd2 (
provider = 'icu',
locale = '@colStrength=secondary', -- or 'und-u-ks-level2'
deterministic = false
);
SELECT s1, s2, s1 = s2 COLLATE nd2 AS equal FROM
(values ('shelffull', E'shel\ufb00ull')) AS s(s1,s2);
s1 | s2 | equal
-----------+----------+-------
shelffull | shelffull | t
3. Ignoring case
The most typical use case for nondeterministic collations is probably the case-insensitive comparison. At secondary strength, strings that differ by case compare as equal:
SELECT s1, s2, s1 = s2 COLLATE nd2 AS equal FROM
(values ('Abc', 'ABC')) AS s(s1,s2);
s1 | s2 | equal
-----+-----+-------
Abc | ABC | t
4. Ignoring case and accents
Strings that differ by accents or case (or both) compare as equal at primary strength:
CREATE COLLATION nd1 (
provider = 'icu',
locale = '@colStrength=primary', -- or 'und-u-ks-level1'
deterministic = false
);
SELECT s1, s2, s1 = s2 COLLATE nd1 AS "equal-nd1",
s1 = s2 COLLATE nd2 AS "equal-nd2"
FROM (values ('Été', 'ete')) AS s(s1,s2);
s1 | s2 | equal-nd1 | equal-nd2
-----+-----+-----------+-----------
Été | ete | t | f
5. Ignoring accents but not case
It’s possible to ignore accents but not case by staying at primary strength, but setting a boolean attribute to the collation: colCaseLevel.
Example:
CREATE COLLATION nd2c (
provider = 'icu',
locale = 'und@colStrength=primary;colCaseLevel=yes' , -- or 'und-u-ks-level1-kc'
deterministic = false
);
SELECT 'Ete' = 'Eté' COLLATE nd2c AS eq1,
'Ete' = 'ete' COLLATE nd2c AS eq2;
eq1 | eq2
-----+-----
t | f
6. Ignoring spaces and punctuation
The simplest option is to ignore punctuation completely, or “blank” it as refered to in “Ignore Punctuation” Options in ICU documentation.
This is done by activating “Alternate Handling” at strength levels 1 to 3.
Since colStrength=tertiary
by default, we can leave it unspecified:
CREATE COLLATION "nd3alt" (
provider = 'icu',
locale='und@colAlternate=shifted',
deterministic = false
);
SELECT '{your-name?}' = 'your name' COLLATE "nd3alt" AS equal;
equal
-------
t
7. Matching compatible symbols
colAlternate
set to shifted
at the quaternary comparison level
may also be used to recognize equality
between punctuation or symbols that are linguistically equivalent, but appear as
distinct sequences of code points.
For instance HORIZONTAL ELLIPSIS (U+2026) is equivalent to three consecutive
ascii dots (FULL STOP, U+002E), and FULLWIDTH COMMERCIAL AT (U+FF20) is
equivalent to COMMERCIAL AT (U+0040) as used in ASCII email addresses.
CREATE COLLATION "nd4alt" (
provider = 'icu',
locale='und@colStrength=quaternary;colAlternate=shifted',
deterministic = false
);
SELECT 'Wow…!' = 'Wow...!' COLLATE "nd4alt" AS equal;
equal
-------
t
8. Ignoring code points assigned to invisible characters
At strength level 3 or below, code points
in the ranges [\u0001-\u0008]
, [\u000E-\u001F]
[\u007f-\u009F]
(control characters) are ignored in comparisons.
This is also true of code points for spacing characters such as
(to list just a few plausible ones):
- SOFT HYPHEN (U+00AD)
- ZERO WIDTH SPACE (U+200B)
- INVISIBLE SEPARATOR (U+2063)
- LEFT-TO-RIGHT MARK (U+200E)
- RIGHT-TO-LEFT MARK (U+200F)
- WORD JOINER (U+2060)
- …and many more…
Example:
SELECT s1, s2, s1 = s2 COLLATE nd AS equal FROM
(VALUES ('ABC', E'\u200eA\u0001B\u00adC')) AS s(s1,s2);
s1 | s2 | equal
-----+-----------+-------
ABC | A\x01BC | t
To have these code points not ignored, the comparison strength should
be set at the maximum level, that is colStrength=identical
(or
ks-identic
with the tags syntax).
At this level, the only difference with binary equality is the case of strings that
differ only by canonically equivalent sequences.
CREATE COLLATION "nd-identic" (
provider = 'icu',
locale='und@colStrength=identical', -- or und-u-ks-identic
deterministic = false
);
SELECT 'abc' = E'a\u0001bc' COLLATE "nd-identic" AS equal;
equal
-------
f
Transforming (beyond collations)
German umlauts are sometimes converted into sequences of US-ASCII letters like this:
- ü => ue, Ü => Ue
- ö => oe, Ö => Oe
- ä => ae, Ä => Ae
These equivalences are not recognized as equal sequences by ICU collations, even at primary strength and specifying German (de) as the language. On the other hand, ß (sharp s) and ss are equal at primary strength.
Starting with version 60, ICU provides de-ASCII as a built-in
transform rule.
Transforms are provided by a different service than collations, which is not
exposed by PostgreSQL core (see icu_transform()
in icu_ext if you need that, or more generally transliterations between scripts).