Since version 12, PostgreSQL collations are created with a parameter named deterministic, that can be true or false, so that collations are now either deterministic (which they are by default), or nondeterministic.

What does that mean? This term refers to what Unicode calls deterministic comparisons between strings:

This is a comparison where strings that do not have identical binary contents (optionally, after some process of normalization) will compare as unequal

So before version 12, comparisons for collatable types in Postgres are always deterministic according to the above definition. Specifically, when the underlying collation provider (libc or ICU) reports that two strings are equal, a tie-breaker bytewise comparison is performed, so that it’s only when the strings consist of identical binary contents that they are truly equal for Postgres.

Starting with version 12, the new “deterministic” property can be set to false at CREATE COLLATION time to request that string comparisons skip the tie-breaker, so that the memory representations being different is not an obstacle to recognize strings as equal when the underlying locale says they are. This does not only affect direct comparisons or lookups through WHERE clauses, but also the results of GROUP BY, ORDER BY, DISTINCT, PARTITION BY, unique constraints, and everything implying the equality operator.

So what can be achieved with nondeterministic collations?

The most obvious features are case-insensitive and accent-insensitive matching implemented with COLLATE clauses, as opposed to calling explicit functions to do case-mapping (upper, lower) and removal of accents (unaccent). Now that these are accessible through the collation service, the traditional recommendation to use the citext datatype for case-insensitive lookups may start to be reconsidered.

Beyond that, nondeterministic collations allow to match strings that are canonically equivalent (differing only by which Unicode normal form they use), or differ only by compatible sequences, or by punctuation, or by non-displayable characters.

Except for the canonical equivalence, these matching features are optional, and they’re activated by declaring collation attributes inside the locale parameter, especially the comparison levels.

Unicode Technical Report #35 provides a table of collation settings with BCP47 keys and values, but the examples in this post will use ICU “old-style” attributes: colStrength, colCaseLevel, colAlternate rather than “new-style” keys (respectively ks, kc, ka). This is because the former work with all versions of ICU, whereas the latter work only when PostgreSQL is built with ICU version 54 or later (released in 2014). It appears that pre-compiled binaries for Windows are currently built with ICU version 53, so it’s better to stick to the old-style syntax at least for them.

Now, let’s go through a list of fancy comparison features that are enabled by nondeterministic collations.

1. Equality between canonically equivalent sequences of code points

This is a requirement of Unicode that PostgreSQL was not able to fulfill until now. As explained in Unicode equivalence (wikipedia):

Code point sequences that are defined as canonically equivalent are assumed to have the same appearance and meaning when printed or displayed. For example, the code point U+006E (the Latin lowercase “n”) followed by U+0303 (the combining tilde “◌̃”) is defined by Unicode to be canonically equivalent to the single code point U+00F1 (the lowercase letter “ñ” of the Spanish alphabet). Therefore, those sequences should be displayed in the same manner, should be treated in the same way by applications such as alphabetizing names or searching, and may be substituted for each other.

Nondeterministic collations will recognize canonically equivalent sequences as equal without requiring any particular collation attribute in the locale argument.

The example below uses a language-agnostic locale: an empty string, that selects the root collation. und may also be used, as the 3-letter BCP-47 tag for “undefined”. Otherwise a language code may be used, optionally followed by a script code and a region code, such as 'fr-CA' for “french as spoken in Canada”.

Example of canonical equivalence between NFD and NFC forms:

CREATE COLLATION nd (
   provider = 'icu',
   locale='',   -- or 'und' (no language or region specified)
   deterministic = false
);

SELECT s1, s2, s1 = s2 COLLATE nd AS equal FROM
  (VALUES (E'El Nin\u0303o', E'El Ni\u00F1o')) AS s(s1,s2);
   s1    |   s2    | equal 
---------+---------+-------
 El Niño | El Niño | t

By contrast, with any deterministic collation, we would get f for false in the equal column, since these strings s1 and s2 are bytewise unequal.

2. Equality between compatible sequences of code points

Besides being equivalent, sequences of code points can be merely compatible, in which case they can optionally be considered as equal.

Quoting another part of the above-linked wikipedia entry:

Sequences that are defined as compatible are assumed to have possibly distinct appearances, but the same meaning in some contexts. Thus, for example, the code point U+FB00 (the typographic ligature “ff”) is defined to be compatible—but not canonically equivalent—to the sequence U+0066 U+0066 (two Latin “f” letters)

At tertiary strength (the default), these sequences are not equal. Let’s see this in SQL, reusing the "nd" collation previously defined:

SELECT s1, s2, s1 = s2 COLLATE nd AS equal FROM
 (VALUES ('shelffull', E'shel\ufb00ull')) AS s(s1,s2);
    s1     |    s2    | equal 
-----------+----------+-------
 shelffull | shelffull | f

But at secondary strength, these sequences compare as equal:

CREATE COLLATION nd2 (
  provider = 'icu',
  locale = '@colStrength=secondary', -- or 'und-u-ks-level2'
  deterministic = false
);

SELECT s1, s2, s1 = s2 COLLATE nd2 AS equal FROM
 (values ('shelffull', E'shel\ufb00ull')) AS s(s1,s2);
    s1     |    s2    | equal 
-----------+----------+-------
 shelffull | shelffull | t

3. Ignoring case

The most typical use case for nondeterministic collations is probably the case-insensitive comparison. At secondary strength, strings that differ by case compare as equal:

SELECT s1, s2, s1 = s2 COLLATE nd2 AS equal FROM
 (values ('Abc', 'ABC')) AS s(s1,s2);
 s1  | s2  | equal 
-----+-----+-------
 Abc | ABC | t

4. Ignoring case and accents

Strings that differ by accents or case (or both) compare as equal at primary strength:

CREATE COLLATION nd1 (
  provider = 'icu',
  locale = '@colStrength=primary', -- or 'und-u-ks-level1'
  deterministic = false
);
  
SELECT s1, s2, s1 = s2 COLLATE nd1 AS "equal-nd1",
               s1 = s2 COLLATE nd2 AS "equal-nd2"
 FROM (values ('Été', 'ete')) AS s(s1,s2);
 s1  | s2  | equal-nd1 | equal-nd2 
-----+-----+-----------+-----------
 Été | ete | t         | f

5. Ignoring accents but not case

It’s possible to ignore accents but not case by staying at primary strength, but setting a boolean attribute to the collation: colCaseLevel.

Example:

CREATE COLLATION nd2c (
  provider = 'icu',
  locale = 'und@colStrength=primary;colCaseLevel=yes' , -- or 'und-u-ks-level1-kc'
  deterministic = false
);

SELECT 'Ete' = 'Eté' COLLATE nd2c AS eq1,
       'Ete' = 'ete' COLLATE nd2c AS eq2;
 eq1 | eq2 
-----+-----
 t   | f

6. Ignoring spaces and punctuation

The simplest option is to ignore punctuation completely, or “blank” it as refered to in “Ignore Punctuation” Options in ICU documentation.

This is done by activating “Alternate Handling” at strength levels 1 to 3. Since colStrength=tertiary by default, we can leave it unspecified:

CREATE COLLATION "nd3alt" (
  provider = 'icu',
  locale='und@colAlternate=shifted',
  deterministic = false
);

SELECT '{your-name?}' = 'your name' COLLATE "nd3alt" AS equal;
 equal
-------
 t

7. Matching compatible symbols

colAlternate set to shifted at the quaternary comparison level may also be used to recognize equality between punctuation or symbols that are linguistically equivalent, but appear as distinct sequences of code points. For instance HORIZONTAL ELLIPSIS (U+2026) is equivalent to three consecutive ascii dots (FULL STOP, U+002E), and FULLWIDTH COMMERCIAL AT (U+FF20) is equivalent to COMMERCIAL AT (U+0040) as used in ASCII email addresses.

CREATE COLLATION "nd4alt" (
  provider = 'icu',
  locale='und@colStrength=quaternary;colAlternate=shifted',
  deterministic = false
);

SELECT 'Wow…!' = 'Wow...!' COLLATE "nd4alt" AS equal;
 equal
-------
 t

8. Ignoring code points assigned to invisible characters

At strength level 3 or below, code points in the ranges [\u0001-\u0008], [\u000E-\u001F] [\u007f-\u009F] (control characters) are ignored in comparisons. This is also true of code points for spacing characters such as (to list just a few plausible ones):

  • SOFT HYPHEN (U+00AD)
  • ZERO WIDTH SPACE (U+200B)
  • INVISIBLE SEPARATOR (U+2063)
  • LEFT-TO-RIGHT MARK (U+200E)
  • RIGHT-TO-LEFT MARK (U+200F)
  • WORD JOINER (U+2060)
  • …and many more…

Example:

SELECT s1, s2, s1 = s2 COLLATE nd AS equal FROM
  (VALUES ('ABC', E'\u200eA\u0001B\u00adC')) AS s(s1,s2);
 s1  |    s2     | equal 
-----+-----------+-------
 ABC | ‎A\x01B­C | t

To have these code points not ignored, the comparison strength should be set at the maximum level, that is colStrength=identical (or ks-identic with the tags syntax). At this level, the only difference with binary equality is the case of strings that differ only by canonically equivalent sequences.

CREATE COLLATION "nd-identic" (
  provider = 'icu',
  locale='und@colStrength=identical', -- or und-u-ks-identic
  deterministic = false
);

SELECT 'abc' = E'a\u0001bc' COLLATE "nd-identic" AS equal;
 equal 
-------
 f

Transforming (beyond collations)

German umlauts are sometimes converted into sequences of US-ASCII letters like this:

  • ü => ue, Ü => Ue
  • ö => oe, Ö => Oe
  • ä => ae, Ä => Ae

These equivalences are not recognized as equal sequences by ICU collations, even at primary strength and specifying German (de) as the language. On the other hand, ß (sharp s) and ss are equal at primary strength.

Starting with version 60, ICU provides de-ASCII as a built-in transform rule. Transforms are provided by a different service than collations, which is not exposed by PostgreSQL core (see icu_transform() in icu_ext if you need that, or more generally transliterations between scripts).