What Unicode versions do we use?

With three locale providers (libc, icu and builtin), a PostgreSQL instance has potentially three different versions of Unicode at the same time.

When doing only store/retrieve or basic operations on UTF-8 strings, these versions don’t matter. Postgres checks that the bytes in the strings constitute valid UTF-8 sequences, but not whether the code points corresponding to these sequences are assigned or not.

It starts to matter when interpreting these code points as characters. The features that operate at that level of interpretation are:

Functionality	Unicode provider	PG version	Collation matters?
upper, lower, initcap	libc, icu, builtin	all	Yes
casefold	icu, builtin ¹	>= 18	Yes
normalization	builtin ²	>= 13	No
unaccent	builtin ²	all	No
regular expressions	libc, icu, builtin	all	Yes
full text search parser	libc	all	No
linguistic sorts/comparisons	libc, icu	all	Yes
unicode_assigned	builtin	>=18	No

In practice, most text data in our databases is already known by whatever Unicode version is being used, because even 20 years ago when Unicode 4.1 was current, it already had almost 100,000 characters from 59 different scripts. Still the Unicode repertoire continues to expand pretty much every year, with lots of new symbols, emojis and the inclusion of lesser-known languages.

So when using Postgres, if we care about these, how do we know which version of Unicode we’re working with?

Checking with SQL code

Since PostgreSQL 17, the unicode_version() and icu_unicode_version() functions tell us which Unicode version is supported by respectively the builtin provider and ICU. There is no equivalent for the libc provider, because the libc library itself does not expose that information. But we can write our own function in SQL to figure out a pretty good approximation of which Unicode version our libc is supporting.

The idea is that, for any new letter L added by a given version V of Unicode, the regular expression L ~ '\w' COLLATE "collname" will match only if the Unicode version that the collname collation is based on is newer or equal than V.

The following SQL expression returns the Unicode release of the default collation by considering one arbitrary letter per release that added that letter, checking whether it matches, and doing this from the most recent release down to the oldest.

  case
    when U&'\088F' ~ '\w' then '>= 17'
    when U&'\1C89' ~ '\w' then '16'
    when U&'\+01123F' ~ '\w' then '15'
    when U&'\0870' ~ '\w' then '14'
    when U&'\08BE' ~ '\w' then '13'
    when U&'\0E86' ~ '\w' then '12'
    when U&'\0560' ~ '\w' then '11'
    when U&'\0860' ~ '\w' then '10'
    when U&'\08B6' ~ '\w' then '9'
    when U&'\08B3' ~ '\w' then '8'
    when U&'\037F' ~ '\w' then '7'
    when U&'\0526' ~ '\w' then '6'
    when U&'\0524' ~ '\w' then '5.2'
    when U&'\0370' ~ '\w' then '5.1'
    when U&'\0252' ~ '\w' then '5'
    when U&'\0237' ~ '\w' then '4.1'
    when U&'\0221' ~ '\w' then '4.0'
    when U&'\0220' ~ '\w' then '3.2'
    when U&'\03F4' ~ '\w' then '3.1'
    when U&'\01F6' ~ '\w' then '3.0'
    else '< 3.0'    -- released before 1999
  end

This list misses certain minor Unicode versions because they did not always include new letters, but for practical purposes, it should be good enough.

A fully fledged function based on this is available on github. It does this check for each locale provider trying to find suitable collations for each one, and returns a set of tuples (provider, u_version).

Some results

On Ubuntu 22.04, Postgres 17:

=> select * from unicode_versions();
 provider | u_version 
----------+-----------
 libc     | 14
 icu      | 14
 builtin  | 15

On Ubuntu 25.04, Postgres 18 RC1:

=> select * from unicode_versions();
 provider | u_version 
----------+-----------
 libc     | 15
 icu      | 15
 builtin  | 16

On Amazon RDS with a recently installed Postgres 17, the function reports:

=> select * from unicode_versions();
 provider | u_version 
----------+-----------
 libc     | 10
 icu      | 10
 builtin  | 15

Unicode version 10 was released in 2017. RDS probably uses a patched glibc to pin the collation code and locale data (see compat-collation-for-glibc ). The advantage is the compatibility across OS upgrades (no need to recreate text indexes). The drawback is the Unicode support being stuck in the past.

A surprising result is seen with Windows 11, Postgres 17 built by EDB, and the default locale for me being French_France.1252.

=> select * from unicode_versions();
 provider | u_version
----------+-----------
 libc     | 5.1
 icu      | 13
 builtin  | 15

Unicode 5.1 was released in 2008! I find it hard to believe that Windows 11 has no newer version of Unicode, but maybe it’s available through a different set of API than what Postgres uses. A look at the source code tells that the regular expression test against \w in Postgres 17 calls iswalpha() or _iswalpha_l().

Well, that’s another reason to use the builtin locale provider whenever possible…

libc is not listed because the C library doesn’t do case folding. It does work to casefold() with a libc collation in Postgres, but it does the same as lower(). ↩
normalize() and unaccent() predate the builtin provider, but the Unicode data they use is shared by the builtin provider and corresponds to the same version. See $(UNICODE_VERSION) set in Postgres source code in src/Makefile.global.in ↩ ↩²