Beware of your next glibc upgrade

GNU libc 2.28, released on August 1, 2018, has among its new features a major update of its Unicode locale data with new collation information.

From the announcement:

The localization data for ISO 14651 is updated to match the 2016 Edition 4 release of the standard, this matches data provided by Unicode 9.0.0. This update introduces significant improvements to the collation of Unicode characters. […] With the update many locales have been updated to take advantage of the new collation information. The new collation information has increased the size of the compiled locale archive or binary locales.

For Postgres databases using language and region-sensitive collations, which tend to be the default nowadays, it means that certain strings might sort differently after this upgrade. A critical consequence is that indexes that depend on such collations must be rebuilt immediately after the upgrade. Servers in WAL-based/streaming replication setups should also be upgraded together since a standby must run the same libc/locales as its primary.

The risk otherwise is index corruption issues, as mentioned for instance in these two threads from pgsql-general: “Issues with german locale on CentOS 5,6,7”, and “The dangers of streaming across versions of glibc: A cautionary tale”

So while this issue is not new, what’s special about glibc-2.28 is the scale of the update in locales, which is unprecedented in recent times. Previously and since year 2000, according to bug#14095, the locale data in glibc were modified on a case-by-case basis. This time, there’s a big merge to close the gap with the standard.

To get a feel for the extent of these changes, I’ve installed ArchLinux which already has glibc-2.28, along with PostgreSQL 10.5, and compared some query results against the same Postgres on Debian 9 (“stretch”) with glibc-2.24.

I expected changes, but not so broad. Simple tests on plain ASCII strings reveal obvious differences immediately. For instance, with the en_US.UTF-8 locale:

Debian stretch (glibc 2.24)

=# select version();
                                                             version                                                              
----------------------------------------------------------------------------------------------------------------------------------
 PostgreSQL 10.5 (Debian 10.5-1.pgdg90+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 6.3.0-18+deb9u1) 6.3.0 20170516, 64-bit
(1 row)

=# show lc_collate ;
 lc_collate  
-------------
 en_US.UTF-8
(1 row)

=# SELECT * FROM (values ('a'), ('$a'), ('a$'), ('b'), ('$b'), ('b$'), ('A'), ('B'))
   AS l(x) ORDER BY x ;
 x  
----
 a
 $a
 a$
 A
 b
 $b
 b$
 B
(6 rows)

ArchLinux (glibc 2.28):

=# select version();
                                   version                                   
-----------------------------------------------------------------------------
 PostgreSQL 10.5 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 8.2.0, 64-bit
(1 row)

=# show lc_collate;
 lc_collate  
-------------
 en_US.UTF-8
(1 row)

=# SELECT * FROM (values ('a'), ('$a'), ('a$'), ('b'), ('$b'), ('b$'), ('A'), ('B'))
   AS l(x) ORDER BY x ;
 x  
----
 $a
 $b
 a
 A
 a$
 b
 B
 b$
(6 rows)

The changes are not limited to UTF-8 locales. The above differences also occur with LATIN9 encoding and lc_collate = 'fr_FR.iso885915@euro', for instance. Here’s an even simpler query showing other non-alphabetic characters giving different comparison results across versions:

Debian stretch (glibc 2.24)

=# SELECT * FROM (values ('"0102"'), ('0102')) AS x(x)
   ORDER BY x;
   x    
--------
 0102
 "0102"
(2 rows)

ArchLinux (glibc 2.28):

=# SELECT * FROM (values ('"0102"'), ('0102')) AS x(x)
   ORDER BY x;
   x    
--------
 "0102"
 0102
(2 rows)

The above query is one I liked to use to illustrate the difference between FreeBSD and Linux/glibc. The en_US collation in FreeBSD 11 used to give opposite results on this than glibc, but now it turns out that the new glibc gives the same results…

Of course people generally don’t upgrade libc on their own initiative, it happens as part of a system upgrade. Whether that system upgrade includes a Postgres upgrade, and whether it includes a dump/reload or database-wide REINDEXes, that remains to be checked by administrators. pg_upgrade does not reindex automatically nor mentions the need to reindex in that particular situation.

As of this writing, only “bleeding edge” distros like ArchLinux already ship glibc 2.28. For Fedora it’s scheduled for October 30, 2018; Debian has 2.27-5 in testing, and Ubuntu “cosmic” (18.10) has 2.27-3.

In any case, Linux users, you certainly want to check if your databases are concerned by these collation updates, and if yes, watch out for when glibc 2.28 is landing on your systems and prepare an upgrade scenario to avoid any risk of data corruption!

To know which collations each database use by default:

 SELECT datname, datcollate FROM pg_database;

To know which collations are in use in indexes (to run in each database):

SELECT distinct collname FROM pg_collation JOIN
  (SELECT regexp_split_to_table(n::text,' ')::oid  AS o
    FROM (SELECT distinct indcollation AS n FROM pg_index) AS a) AS b on o=oid
 -- WHERE collprovider <> 'i'
;

Uncomment the last line with Postgres 10 or newer to filter out ICU collations if needed (ICU collations are not concerned with glibc upgrades).