Looking at the new built-in collation provider in PostgreSQL 17

Among the novelties of PostgreSQL 17, recently released in beta, there’s a built-in UTF-8 locale and collation with binary string comparisons. In this post, let’s see why it’s interesting, how to use it and how it performs.

A third collation provider

Before version 17, we can choose between two providers: the libc library, which comes with the operating system, and the optional ICU library. v17 adds a third provider for collations implemented in Postgres itself.

The announce describes it as:

PostgreSQL 17 includes a built-in collation provider that provides similar sorting semantics to the C collation except with UTF-8 encoding rather than SQL_ASCII. This new collation is guaranteed to be immutable, ensuring that the return values of your sorts won’t change regardless of what system your PostgreSQL installation runs on.

There are two built-in collations, both for UTF-8.

SELECT
  collnamespace::regnamespace,
  collname,
  colllocale,
  pg_encoding_to_char(collencoding) as encoding
FROM pg_collation WHERE collprovider='b';  -- 'b' = 'built-in'

 collnamespace | collname  | colllocale | encoding 
---------------+-----------+------------+----------
 pg_catalog    | ucs_basic | C          | UTF8
 pg_catalog    | pg_c_utf8 | C.UTF-8    | UTF8

ucs_basic already existed in previous versions with the “libc” provider and was identical to the "C" collation in UTF8 databases. It’s the same in version 17, that is:

for string comparisons, it uses the byte order, without any linguistic rule.
for character classification (i.e. knowing if such or such character is a letter, a digit, a lower case or upper case letter…), it answers according to an internal table for any code point below 127 (the ASCII characters), and for the rest, it answers “No” to any of these questions.

So although it’s technically possible to use ucs_basic or "C" in UTF-8 databases, it’s not adequate for non-english contents.

The new C.UTF-8 built-in locale and its pg_c_utf8 collation address this, giving correct results for all code points of the Unicode repertoire at the time of the release (almost 150000 as of Unicode 15.1), and thus allowing proper case conversions and regular expressions with all alphabets.

Now, it’s also the case for the Unicode collations provided by libc and ICU, but the point is that the new locale/collation provides this combination of features in the same locale:

fast, immutable binary sorts (no ICU locale does that)
OS independence (no libc locale provides that except “C”)
Full Unicode character classification (“C” is missing it).

Let’s illustrate this with a Venn diagram: collations Venn diagram

Note that ICU’s independence on the OS does not mean that OS upgrades are transparent with ICU. OS distributions tend to come with the latest version of the ICU library at the time when the distribution is made, so upgrading an OS generally means upgrading ICU at the same time, unless you compile it yourself. OS independence with regard to ICU means that when using the same version of ICU on different operating systems, collations behave identically across these systems.

How to use it

To use the new collation by default when creating an instance or a database, we select it with the locale name C.UTF8 and the provider "builtin" (beware not to confuse it with the C.UTF-8 or C.utf8 locales of libc). At the instance level:

initdb --locale-provider=builtin --locale=C.UTF8 -D /usr/local/pgsql/data

In that case, all databases created in that instance will use that new locale by default.

Otherwise, on an already existing instance initialized with the libc or ICU providers, to create a specific database with the new built-in locale, we should do it like this:

CREATE DATABASE test locale_provider='builtin' builtin_locale='C.UTF8' template='template0';

If you wonder why we need to force template0 (instead of the default template1), it’s because the server assumes that we might have added localized data into template1 that might be incompatible with the locale of our new database. template0 contains non-modifiable base catalogs that are locale-agnostic, making it a safe choice, whereas template1 is template0 plus whatever additions a DBA has made.

Finally, from inside a database that doesn’t use that new locale by default, we can add COLLATE pg_c_utf8 clauses at the SQL level on columns or expressions, for instance:

CREATE INDEX idx ON tablename using btree(columnname COLLATE pg_c_utf8);

Then such an index will be considered by the query planner for queries like, for instance:

SELECT * FROM tablename WHERE columnname = 'some-value' COLLATE pg_c_utf8;

Indexing and search performance

Bytewise comparisons uses less CPU than linguistic comparisons. Let’s illustrate this on some data of the Internet Movie Database:

name_basics is a table with about 13 million rows. The primaryName column holds the names of people who worked on movies, with about 10 million distinct values. It’s a good example to sort and search into with different collations.

\d name_basics
                        Table "public.name_basics"
      Column       |         Type         | Collation | Nullable | Default 
-------------------+----------------------+-----------+----------+---------
 sn_soundex        | character varying(5) |           |          | 
 deathYear         | integer              |           |          | 
 primaryProfession | text                 |           |          | 
 nconst            | integer              |           |          | 
 primaryName       | text                 |           |          | 
 birthYear         | integer              |           |          | 
 s_soundex         | character varying(5) |           |          | 
 ns_soundex        | character varying(5) |           |          | 
 knownForTitles    | text                 |           |          | 
Indexes:
    "idx" btree ("primaryName")

Let’s see how much time it takes to build the btree index on “primaryName” on databases created with different locales, on a Linux Debian 12 system (GNU libc 2.36, ICU 72).

indexing performance

How to interpret these differences

libc’s en_US.utf8 is the slowest: it does linguistic comparisons without abbreviated keys ¹
ICU’s en-x-icu does linguistic comparisons, but with the abbreviated keys optimisation.
libc’s C.utf8 does faster binary comparisons, but Postgres does not “know it”, so instead of using an optimized fast path, it calls the slower generic function.
C and the new C.UTF-8 locales are optimized for the fastest execution and perform similarly well.

After indexing the column, let’s see how the queries compare in execution time across databases. For that test, we use pg_bench with a query searching for 1000 values at the time:

SELECT count(*) FROM name_basics WHERE "primaryName" IN
(
...<list of 1000 names previously randomly choosen from the table >...
)

The plan for this query is an Index Only Scan, matching 6572 rows. The median execution times are as follows:

lookup performance

The differences are roughly like in the index creation, except that the gap between en_US.utf8 and en-x-icu is larger when indexing. My theory is that it is because of the abbreviated keys optimization that is disabled with en_US.utf8, versus enabled with ICU collations. Abbreviated keys are used on sorts, but not on lookups.

A recent post on depesz’s blog (How much speed you’re leaving at the table if you use default locale?) also shows a benchmark comparing these collations, but using COLLATE clauses instead of defining the locale at the database level, as I do for the numbers reported above. Depesz notes that the binary comparisons perform better, as expected, but also that the built-in pg_c_utf8 collation performs significantly worse than the "C" collation, and wonders why. The answer comes as a follow-up by Jeff Davis on the pgsql-hackers list: Speed up collation cache. In short, comparisons with non-default collations currently imply cache lookups that could be better optimized. Hopefully this will be improved soon, but still, it’s good to remember that the database locale should be prefered over COLLATE clauses for faster results.

Sorting of results

The main drawback of the binary-sorting collations is that they don’t sort the “human way”. For instance, let’s say we want to output all primaryName starting with “Adèle” or “Adele”, in a sorted list. We’re going to make a query like this:

SELECT DISTINCT "primaryName" FROM name_basics
 WHERE "primaryName" ~ '^Adèle' OR "primaryName" ~ '^Adele'
 ORDER BY "primaryName" ;

With a binary sort order, we’re getting this output:

        primaryName          
------------------------------
 Adele
 Adele & The French StarKids
 Adele Aalto
 Adele Abbott
 Adele Abinante
[...]
 Adèle de Fontbrune
 Adèle de Mesnard
 Adèle de la Fuente
 Adèle van Biljon
 Adèle-Elise Prévost
(1342 rows)

The “Adele” are grouped together in the first part of the list, and the “Adèle” are in a second part. That’s because the code point of the è letter (U+00E8) is greater than the code point of the e letter (U+00065).

When using a linguistic collation like fr_FR.utf8, the output is better:

         primaryName          
------------------------------
 Adele
 Adèle
 Adele Aalto
 Adele Abbott
 Adele Abinante
 Adele Abou Ali
 Adele Aburrow
 Adele Adams
 Adele Adderley
 Adele Addison
 Adele Addo
 Adele Adeshayo
 Adele Adkins
 Adèle Ado
[...]
 Adele Zeiner
 Adele Zin
 Adele Zoppis
 Adèle Zouane
 Adele Zupicic
(1342 rows)

In this list, the accented and non-accented e letters are together, and it’s the last name that mostly drives the order. It’s clearly what we (humans) prefer to see.

To get this sort in a binary-sorting database, a COLLATE clause with a linguist collation should be added to the ORDER BY. It could be

ORDER BY "primaryName" COLLATE "en_US.utf8" (or some other libc collation). It works only if that collation does exist in the database, which depends on the OS and the installation, so it’s not portable. It also forces a specific language/country, which might not be what an application wants.
ORDER BY "primaryName" COLLATE "en-x-icu" with an ICU collation. The advantage is that every Postgres instance with ICU enabled will have this collation, so it’s almost portable. However this again forces a specific language.
ORDER BY "primaryName" COLLATE "unicode", that exists since Postgres 16. It calls for the “root” collation that is designed to work well across all languages. With Postgres 15 and older, we can use the less easy to memorize "und-x-icu".

In practice, modern client applications are likely to be equipped with sort capabilities that make them not rely on database sorts. For instance a Javascript modern app might use Navigator.language and Intl.Collator to sort according to the browser’s language. It will be more natural for most developers to get Javascript to sort the list than to find a database collation matching the browser’s language and inject it into the SQL query. If a list of results might be too large to be handled by the browser, then sure, the database-side sort is back in the game.

The other benefit of binary-sorted text indexes

Besides being faster, binary-sorted indexes supports left-anchored searches directly. In the previous query, if we have only a linguistic-sorted index on "primaryName", we’re getting a slow sequential scan:

EXPLAIN ANALYZE  select distinct "primaryName" from name_basics
  where "primaryName" ~ '^Adèle' or "primaryName" ~ '^Adele'
  order by "primaryName";

Unique  (cost=243279.95..243541.75 rows=2174 width=14) (actual time=4354.978..4356.406 rows=1342 loops=1)
   ->  Gather Merge  (cost=243279.95..243536.32 rows=2174 width=14) (actual time=4354.978..4356.212 rows=1398 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Unique  (cost=242279.92..242285.36 rows=1087 width=14) (actual time=4346.459..4346.578 rows=466 loops=3)
               ->  Sort  (cost=242279.92..242282.64 rows=1087 width=14) (actual time=4346.453..4346.482 rows=487 loops=3)
                     Sort Key: "primaryName"
                     Sort Method: quicksort  Memory: 25kB
                     Worker 0:  Sort Method: quicksort  Memory: 25kB
                     Worker 1:  Sort Method: quicksort  Memory: 25kB
                     ->  Parallel Seq Scan on name_basics  (cost=0.00..242225.11 rows=1087 width=14) (actual time=9.747..4345.114 rows=487 loops=3)
                           Filter: (("primaryName" ~ '^Adèle'::text) OR ("primaryName" ~ '^Adele'::text))
                           Rows Removed by Filter: 4345599
 Planning Time: 0.186 ms
 Execution Time: 4356.494 ms

The advice given by the documentation to accelerate this is to create a second index with varchar_pattern_ops:

The operator classes text_pattern_ops, varchar_pattern_ops, and bpchar_pattern_ops support B-tree indexes on the types text, varchar, and char respectively. The difference from the default operator classes is that the values are compared strictly character by character rather than according to the locale-specific collation rules. This makes these operator classes suitable for use by queries involving pattern matching expressions (LIKE or POSIX regular expressions) when the database does not use the standard “C” locale. As an example, you might index a varchar column like this: CREATE INDEX test_index ON test_table (col varchar_pattern_ops);

The good news is that the new C.UTF-8 built-in locale and pg_c_utf8 collation benefit from the same advantage as the standard “C” locale: they can be used directly for left-anchored search, as shown with the execution plan with such an index:

Unique  (cost=162.45..175.49 rows=2607 width=14) (actual time=30.616..31.048 rows=1342 loops=1)
   ->  Sort  (cost=162.45..168.97 rows=2608 width=14) (actual time=30.612..30.699 rows=1461 loops=1)
         Sort Key: "primaryName"
         Sort Method: quicksort  Memory: 49kB
         ->  Bitmap Heap Scan on name_basics  (cost=10.44..14.46 rows=2608 width=14) (actual time=0.571..26.297 rows=1461 loops=1)
               Recheck Cond: (("primaryName" ~ '^Adèle'::text) OR ("primaryName" ~ '^Adele'::text))
               Filter: (("primaryName" ~ '^Adèle'::text) OR ("primaryName" ~ '^Adele'::text))
               Heap Blocks: exact=1451
               ->  BitmapOr  (cost=10.44..10.44 rows=1 width=0) (actual time=0.356..0.357 rows=0 loops=1)
                     ->  Bitmap Index Scan on idx  (cost=0.00..4.57 rows=1 width=0) (actual time=0.126..0.127 rows=268 loops=1)
                           Index Cond: (("primaryName" >= 'Adèle'::text) AND ("primaryName" < 'Adèlf'::text))
                     ->  Bitmap Index Scan on idx  (cost=0.00..4.57 rows=1 width=0) (actual time=0.227..0.227 rows=1193 loops=1)
                           Index Cond: (("primaryName" >= 'Adele'::text) AND ("primaryName" < 'Adelf'::text))
 Planning Time: 0.473 ms
 Execution Time: 31.132 ms

Conclusion

Let’s summarize the main collation characteristics in this table:

	“C” or ucs_basic	libc collations other than “C”	ICU collations	pg_c_utf8
Portability between OSes?	✅ Yes	❌ No	✅ Yes	✅ Yes
Bytewise comparisons?	✅ Yes	❓ Some	❌ No	✅ Yes
Abbreviated keys? ¹	-	❌ No	✅ Yes	-
Transparent OS updates? ²	✅ Yes	❌ No	❌ No	✅ Yes
Unicode classification?	❌ No	✅ Yes	✅ Yes	✅ Yes
Linguistic sort?	❌ No	✅ Yes	✅ Yes	❌ No
Advanced comparisons? ³	❌ No	❌ No	✅ Yes	❌ No

The new built-in provider in Postgres 17 provides a fast, portable, easier-to-upgrade C.UTF-8 locale / pg_c_utf8 collation. Although it does not have the more advanced capabilities we get with the ICU provider, it’s likely to be a good default locale for Unicode databases.

For more on the subject, I recommend watching this excellent presentation on YouTube: Collations from A to Z (PGConf.dev 2024) by Jeremy Schneider and Jeff Davis. Jeremy first presents quite a list of unintuitive facts and paint points related to collations (with a few references to this blog I noticed 🤓 ) and then Jeff, who is the author of this Postgres 17 new feature, details what can be done with existing collations and how the new locale fits into that context.

Notes

Abbreviated keys are binary representations of strings to accelerate sorts with linguistic collations. See Peter Geoghegan’s post Abbreviated keys: exploiting locality to improve PostgreSQL's text sort performance (from 2015) explaining that feature. In theory, libc collations could use them too, but the implementations have been found buggy and therefore unsafe to use for indexing. ↩ ↩²
Strictly speaking, OS updates are not transparent when they imply an Unicode upgrade. Even with binary sorts solving the main issue, the addition of code points in Unicode can theorically change the results of check constraints involving regular expressions, case conversion or normalization rules. ↩
Advanced comparisons means all features allowed by ICU non-deterministic collations, among which case-insensitive and accent-insensitive comparisons. ↩