When trying ICU 73, I’ve noticed that some strings are ordered differently than with the previous version with collations whose versions haven’t changed.

It turns out to be an ICU bug that is due to an uncommon move, as told in the bug’s comments:

this change was basically a cherry-pick from the then-future Unicode 15.1 change […] I think this is the first time (at least for over ten years) that we changed the root sort order without upgrading to a whole new Unicode version.

That’s a problem for Postgres, as we’re counting on these version numbers to change whenever collations change.

The Postgres context

Each time Postgres loads an ICU collation in memory, it compares the version it has stored previously (into pg_collation.collversion orpg_database.datcollversion) against the version currently reported by the library. If they differ, it warns about it with a message like the following, for the database-wide collation, right after connecting:

WARNING:  database "icudb" has a collation version mismatch
DETAIL:  The database was created using collation version X.Y, but the operating system provides version X.Z.
HINT:  Rebuild all objects in this database that use the default collation and run ALTER DATABASE icudb REFRESH COLLATION VERSION, or build PostgreSQL with the right library version.

Or this other kind of message, when the collation is refered to explicitly with a COLLATE clause:

WARNING:  collation "en-x-icu" has version mismatch
DETAIL:  The collation in the database was created using version X.Y, but the operating system provides version X.Z.
HINT:  Rebuild all objects affected by this collation and run ALTER COLLATION pg_catalog."en-x-icu" REFRESH VERSION, or build PostgreSQL with the right library version.

This warning must not be ignored. It means that indexes could be walked using comparisons that might yield different results than when the indexes were read and written to with the previous version. That can quickly lead to silent data corruption. The simplest action to take is to rebuild all indexes that refer to text columns, so that the “new order” sets in before any application code is allowed to interact with the database.

This situation can only arise when the collation library is upgraded, but not necessarily on every upgrade. Some upgrades of libc or ICU might not incur any collation change.

Example of change

With ICU-72 if we compare the two following strings with the collation en-x-icu or in fact many other common collations like und-x-icu, we get this result:

select x  from (values ('A‘B'), ('A‚2')) as _(x)
    order by x collate "en-x-icu";
  x  
-----
 A‘B
 A‚2

select collversion from pg_collation where collname='en-x-icu';
 collversion 
-------------
 153.120

The second characters in the strings are not backquote and comma but U+201A (SINGLE LOW-9 QUOTATION MARK) and U+2018 (LEFT SINGLE QUOTATION MARK). They can also be written in ASCII as values (U&'A\201A2'), (U&'A\20182')

Now with ICU-73:

select x  from (values ('A‘B'), ('A‚2')) as _(x)
    order by x collate "en-x-icu";
  x  
-----
 A‚2
 A‘B

select collversion from pg_collation where collname='en-x-icu';
 collversion 
-------------
 153.120

As we can see, the order of strings has changed, whereas the collation version number has not changed. That’s what we expect cannot happen.

It’s interesting that many collations (331 exactly) have gone from version number 153.120.42 to 153.120.43, probably reflecting the fact that ICU-72 uses the CLDR version 42 and ICU-73 has upgraded to CLDR version 43. However many other collations don’t have that .42 or .43 suffix. This is suggested in the bug comment by Markus Scherer:

I thought that the CLDR or ICU version number also feeds into the ucol_getVersion() return value; I will check. It might not be included for the root (und) collation if the version number comes from a tailoring resource bundle.

But many collations have the 153.120 version number (I’m counting 475 of them) in an initial Postgres installation with ICU-73.

Looking at Postgres/ICU during a Fedora upgrade

ICU-73 does not yet ship in Linux LTS releases, but it does in frequently updated distros. For instance, Fedora 38 comes with PostgreSQL 15 / ICU-72 and Fedora 39 with PostgreSQL 15 / ICU-73.

Let’s see what happens when doing this Fedora upgrade with a database affected by the problem.

Fedora 38

$ cat /etc/redhat-release 
Fedora release 38 (Thirty Eight)

$ postgres -V
postgres (PostgreSQL) 15.4

$ ldd /usr/bin/postgres | grep icu
    libicui18n.so.72 => /lib64/libicui18n.so.72 (0x00007f3bf2600000)
    libicuuc.so.72 => /lib64/libicuuc.so.72 (0x00007f3bf2200000)
    libicudata.so.72 => /lib64/libicudata.so.72 (0x00007f3bf0000000)

$ psql
-- Create an ICU database 
create database icudb icu_locale='en' locale_provider='icu' template 'template0';

\c icudb

-- Import a test dataset (2 million strings)
create table words(w text);
\copy words from 'words.txt'

create index idx on words(w);

Fedora 39

Following the intructions on Upgrading Fedora Using DNF System Plugin, it upgrades smoothly to Fedora 39.

After rebooting, checking the versions:

$ cat /etc/redhat-release 
Fedora release 39 (Thirty Nine)

$ postgres -V
postgres (PostgreSQL) 15.4

$ ldd /usr/bin/postgres | grep icu
	libicui18n.so.73 => /lib64/libicui18n.so.73 (0x00007fed85800000)
	libicuuc.so.73 => /lib64/libicuuc.so.73 (0x00007fed85400000)
	libicudata.so.73 => /lib64/libicudata.so.73 (0x00007fed83000000)

Now we can log in psql to the database with the ICU db-wide collation, and observe these problems:

  • there is no warning about a version mismatch.
  • when searching for certain specific existing values with the index, they’re not found.
  • when searching the same specific values without the index, they’re found.
  • when checking with the btree index checker from amcheck, a corruption is detected.

No warning:

$ psql -d icudb
psql (15.4)
Type "help" for help.

icudb=# \l
                                                 List of databases
   Name    |  Owner   | Encoding |   Collate   |    Ctype    | ICU Locale | Locale Provider |   Access privileges   
-----------+----------+----------+-------------+-------------+------------+-----------------+-----------------------
 icudb     | postgres | UTF8     | fr_FR.UTF-8 | fr_FR.UTF-8 | en         | icu             | 
 postgres  | postgres | UTF8     | fr_FR.UTF-8 | fr_FR.UTF-8 |            | libc            | 
 template0 | postgres | UTF8     | fr_FR.UTF-8 | fr_FR.UTF-8 |            | libc            | =c/postgres          +
           |          |          |             |             |            |                 | postgres=CTc/postgres
 template1 | postgres | UTF8     | fr_FR.UTF-8 | fr_FR.UTF-8 |            | libc            | =c/postgres          +
           |          |          |             |             |            |                 | postgres=CTc/postgres
(4 rows)

Not finding an existing value:

icudb=# select count(*) from words  where w = 's‚8';
 count 
-------
     0
(1 row)

icudb=# explain select count(*) from words  where w = 's‚8';
                                 QUERY PLAN                                 
----------------------------------------------------------------------------
 Aggregate  (cost=4.45..4.46 rows=1 width=8)
   ->  Index Only Scan using idx on words  (cost=0.43..4.45 rows=1 width=0)
         Index Cond: (w = 's‚8'::text)
(3 rows)

Now if telling the planner to not use indexes, the value is found:

SET enable_indexscan TO off;
SET enable_indexonlyscan TO off;
SET enable_bitmapscan TO off;

icudb=# select count(*) from words  where w = 's‚8';
 count 
-------
     1
(1 row)

icudb=# explain select count(*) from words  where w = 's‚8';
                                  QUERY PLAN                                  
------------------------------------------------------------------------------
 Aggregate  (cost=24584.98..24584.99 rows=1 width=8)
   ->  Gather  (cost=1000.00..24584.97 rows=1 width=0)
         Workers Planned: 2
         ->  Parallel Seq Scan on words  (cost=0.00..23584.88 rows=1 width=0)
               Filter: (w = 's‚8'::text)
(5 rows)

Let’s check the index with amcheck which confirms that it’s unusable:

icudb=# create extension amcheck;
CREATE EXTENSION
icudb=# \d
         List of relations
 Schema | Name  | Type  |  Owner   
--------+-------+-------+----------
 public | words | table | postgres
(1 row)

icudb=# select bt_index_check('idx', true);
ERREUR:  item order invariant violated for index "idx"
DETAIL:  Lower index tid=(8,50) (points to heap tid=(12309,79)) higher index tid=(8,51) (points to heap tid=(11717,201)) page lsn=0/C3F76A0.

Reindex, and problem solved:

icudb=# reindex database icudb;
REINDEX

icudb=#  select bt_index_check('idx', true);
 bt_index_check 
----------------
 
(1 row)

icudb=# select count(*) from words  where w = 's‚8';
 count 
-------
     1
(1 row)

icudb=# explain select count(*) from words  where w = 's‚8';
                                 QUERY PLAN                                 
----------------------------------------------------------------------------
 Aggregate  (cost=4.45..4.46 rows=1 width=8)
   ->  Index Only Scan using idx on words  (cost=0.43..4.45 rows=1 width=0)
         Index Cond: (w = 's‚8'::text)
(3 rows)

Conclusion

ICU-74 is at the RC stage, will be released in a few days, and does not have that issue. Chances are many Postgres installations that use ICU collations (certainly a minority, to begin with) will never see an upgrade to ICU-73.

Otherwise once you know it, it’s not really a problem: when upgrading to ICU-73, act as if Postgres issued the above-mentioned warnings.

If you’re on Linux, you’ll probably upgrade glibc at the same time, which since Postgres 15 will cause it to display the warnings for glibc collations that might be missing for ICU (somewhat ironically given that ICU is originally more reliable in that respect).

Databases that use bytewise sorts (collation “C” or “POSIX” or “C.UTF-8” for certain operating systems) for stored contents are not concerned with these upgrade difficulties. In another blog post, I’ll elaborate on why bytewise-sorting collations should be chosen more often in my opinion in Postgres databases, as opposed to the linguistic collations that are often chosen by default.

References

If you don’t use ICU but libc (the default) as the collation provider, you might be interested in these posts about collation changes with libc.

Beware of your next glibc upgrade, on this blog (2018)

Locale data changes, on the PostgreSQL wiki (2018 and later)

Don’t let collation versions corrupt your PostgreSQL indexes by Thomas Munro (2020)

Manage collation changes in PostgreSQL on Amazon Aurora and Amazon RDS by Jonathan Katz (2023)