What Unicode versions do we use?
With three locale providers (libc, icu and builtin), a PostgreSQL instance has potentially three different versions of Unicode at the same time.
When doing only store/retrieve or basic operations on UTF-8 strings, these versions don’t matter. Postgres checks that the bytes in the strings constitute valid UTF-8 sequences, but not whether the code points corresponding to these sequences are assigned or not.
It starts to matter when interpreting these code points as characters. The features that operate at that level of interpretation are:
Functionality | Unicode provider | PG version | Collation matters? |
---|---|---|---|
upper, lower, initcap | libc, icu, builtin | all | Yes |
casefold | icu, builtin 1 | >= 18 | Yes |
normalization | builtin 2 | >= 13 | No |
unaccent | builtin 2 | all | No |
regular expressions | libc, icu, builtin | all | Yes |
full text search parser | libc | all | No |
linguistic sorts/comparisons | libc, icu | all | Yes |
unicode_assigned | builtin | >=18 | No |
In practice, most text data in our databases is already known by whatever Unicode version is being used, because even 20 years ago when Unicode 4.1 was current, it already had almost 100,000 characters from 59 different scripts. Still the Unicode repertoire continues to expand pretty much every year, with lots of new symbols, emojis and the inclusion of lesser-known languages.
So when using Postgres, if we care about these, how do we know which version of Unicode we’re working with?
Checking with SQL code
Since PostgreSQL 17, the unicode_version()
and icu_unicode_version()
functions tell us which Unicode version is supported by respectively the builtin provider and ICU.
There is no equivalent for the libc provider, because the libc library itself does not expose
that information. But we can write our own function in SQL
to figure out a pretty good approximation of which Unicode version our libc is supporting.
The idea is that, for any new letter L
added by a given version V
of Unicode, the regular expression L ~ '\w' COLLATE "collname"
will
match only if the Unicode version that the collname
collation is
based on is newer or equal than V
.
The following SQL expression returns the Unicode release of the default collation by considering one arbitrary letter per release that added that letter, checking whether it matches, and doing this from the most recent release down to the oldest.
case
when U&'\088F' ~ '\w' then '>= 17'
when U&'\1C89' ~ '\w' then '16'
when U&'\+01123F' ~ '\w' then '15'
when U&'\0870' ~ '\w' then '14'
when U&'\08BE' ~ '\w' then '13'
when U&'\0E86' ~ '\w' then '12'
when U&'\0560' ~ '\w' then '11'
when U&'\0860' ~ '\w' then '10'
when U&'\08B6' ~ '\w' then '9'
when U&'\08B3' ~ '\w' then '8'
when U&'\037F' ~ '\w' then '7'
when U&'\0526' ~ '\w' then '6'
when U&'\0524' ~ '\w' then '5.2'
when U&'\0370' ~ '\w' then '5.1'
when U&'\0252' ~ '\w' then '5'
when U&'\0237' ~ '\w' then '4.1'
when U&'\0221' ~ '\w' then '4.0'
when U&'\0220' ~ '\w' then '3.2'
when U&'\03F4' ~ '\w' then '3.1'
when U&'\01F6' ~ '\w' then '3.0'
else '< 3.0' -- released before 1999
end
This list misses certain minor Unicode versions because they did not always include new letters, but for practical purposes, it should be good enough.
A fully fledged function based on this is available on github.
It does this check for each locale provider trying to find suitable
collations for each one, and returns a set of tuples (provider,
u_version)
.
Some results
On Ubuntu 22.04, Postgres 17:
=> select * from unicode_versions();
provider | u_version
----------+-----------
libc | 14
icu | 14
builtin | 15
On Ubuntu 25.04, Postgres 18 RC1:
=> select * from unicode_versions();
provider | u_version
----------+-----------
libc | 15
icu | 15
builtin | 16
On Amazon RDS with a recently installed Postgres 17, the function reports:
=> select * from unicode_versions();
provider | u_version
----------+-----------
libc | 10
icu | 10
builtin | 15
Unicode version 10 was released in 2017. RDS probably uses a patched glibc to pin the collation code and locale data (see compat-collation-for-glibc ). The advantage is the compatibility across OS upgrades (no need to recreate text indexes). The drawback is the Unicode support being stuck in the past.
A surprising result is seen with Windows 11, Postgres 17 built by
EDB,
and the default locale for me being French_France.1252
.
=> select * from unicode_versions();
provider | u_version
----------+-----------
libc | 5.1
icu | 13
builtin | 15
Unicode 5.1 was released in 2008!
I find it hard to believe that Windows 11 has no newer version of Unicode,
but maybe it’s available through a different set of API than what Postgres uses.
A look at the source code tells that the regular expression test against \w
in Postgres 17 calls iswalpha()
or _iswalpha_l()
.
Well, that’s another reason to use the builtin locale provider whenever possible…
-
libc is not listed because the C library doesn’t do case folding. It does work to
casefold()
with a libc collation in Postgres, but it does the same aslower()
. ↩ -
normalize()
andunaccent()
predate the builtin provider, but the Unicode data they use is shared by the builtin provider and corresponds to the same version. See$(UNICODE_VERSION)
set in Postgres source code insrc/Makefile.global.in
↩ ↩2