Multiple strings replacement with plperl
Substituting a single string by another within a larger string is
straightforward in SQL, with the replace
function:
select replace('the string is bar', 'bar', 'foo');
replace
-------------------
the string is foo
But there isn’t a PostgreSQL core function to substitute
multiple strings each by its own independent replacement, like for
instance the strtr
function
in PHP,
or the substitutions operators with regular expressions in Python or Perl
(which we’re going to use in this post).
In PostgreSQL, regexp_replace
does support alternations to search for
multiple strings in a single pass, but it can’t do multiple
replacements, that is
regexp_replace(string, 'foo|bar', 'baz', 'g')
would replace both
foo
and bar
by baz
, but it cannot be instructed to replace foo
by
a string and bar
by another.
In some cases, nested or successive replace()
calls might do the
job, and it seems the popular answer when searching for
a solution on the web, but in the general case, they produce flawed results if
any inner replacement happens to create a match for other outer
replace
calls.
As an example, consider the necessary conversion of characters into HTML entities when creating an HTML fragment from a raw text. These five substitutions have to be applied:
> → >
< → <
& → &
" → '
' → "
Say we apply the substitutions in that order, with a query like this:
select replace(replace(replace(replace(replace(
rawtext,
'>', '>'),
'<', '<'),
'&', '&'),
'"', '''),
'''', '"');
If rawtext
is <strong> AT&T </strong>
, then the result
obtained is:
&lt;strong&gt; AT&T &lt;/strong&gt;
and this is clearly wrong, as what we want is:
<strong> AT&T </strong>
The problem with the query above is that once >
has been replaced by
>
(or <
by <
), the next replacement stage has no way to
distinguish the ampersand in AT&T
(that must be replaced) from the
ampersand in >
that comes from a previous replacement and should
be left unmodified.
In that particular case, an effective workaround is to rearrange the order of
the replacements so that &
is substituted by &
first.
But what if we have circular references between substitutions, like when two strings need to be switched? In that case, no order will succeed to produce the desired result:
-- replace foo with bar and bar with foo.
-- wrong result, version 1
select replace(replace('foo and bar', 'foo', 'bar'), 'bar', 'foo');
replace
-------------
foo and foo
-- wrong result, version 2
select replace(replace('foo and bar', 'bar', 'foo'), 'foo', 'bar');
replace
-------------
bar and bar
What is needed instead is an algorithm that scans and replaces in the string in a single pass. It also needs a rule for when several matches occur simultaneously at the same point in the string (typically the longest match wins, but another rule could be chosen).
The Perl substitution operator (s/pattern/replacement/flags
) provides
this functionality, since pattern
can
be an alternation and replacement
can be a hash with key and values holding
the substitutions. For instance:
my %subs = (foo=>bar, bar=>foo);
my $string = "foo et bar";
$string =~ s/(foo|bar)/$subs{$1}/g;
print $string;
As a bonus, since Perl 5.10 the implementation of this construct has been specifically optimized:
Trie optimisation of literal string alternations
Alternations, where possible, are optimised into more efficient matching structures. String literal alternations are merged into a trie and are matched simultaneously. This means that instead of O(N) time for matching N alternations at a given point, the new code performs in O(1) time. A new special variable, ${^RE_TRIE_MAXBUF}, has been added to fine-tune this optimization. (Yves Orton)
PostgreSQL allows to write functions in Perl With the
plperl
extension.
Using this, here’s a simple ready-to-use implementation of the multiple replacement:
CREATE FUNCTION multi_replace(string text, orig text[], repl text[])
RETURNS text
AS $BODY$
my ($string, $orig, $repl) = @_;
my %subs;
if (@$orig != @$repl) {
elog(ERROR, "array sizes mismatch");
}
if (ref @$orig[0] eq 'ARRAY' || ref @$repl[0] eq 'ARRAY') {
elog(ERROR, "array dimensions mismatch");
}
@subs{@$orig} = @$repl;
my $re = join "|", map quotemeta,
sort { (length($b) <=> length($a)) } keys %subs;
$re = qr/($re)/;
$string =~ s/$re/$subs{$1}/g;
return $string;
$BODY$ language plperl strict immutable;
This Perl function appears to be very fast, even with many strings to replace (tested up to 1000, knowing that they get combined into a single regular expression) and many actual replacements occurring.
In a plpgsql implementation, searching with the alternation appears to be
pretty fast, but iterating on the replacements is very costly.
The code is a bit longer, I won’t put it in this post but it’s on
the PostgreSQL wiki, see
multi_replace in plpgsql.
Aside from taking the input replacements through a jsonb
parameter
rather than arrays, which incidentally allows to have the two
implementations to co-exist in the database with the same function name,
the results should always be identical.
For instance, this invocation switches foo
and bar
while also leaving
foobar
unchanged by using the trick of replacing it by itself:
select multi_replace(
'foo and bar are not foobar',
'{foo,bar,foobar}',
'{bar,foo,foobar}');
multi_replace
----------------------------
bar and foo are not foobar
Be it in plperl or plpgsql, this function avoids the problems with
nesting replace()
, including allowing dynamic lists of
substitutions. But if you need that functionality with large numbers
of substitutions, it’s a case where Perl is much more efficient than
the equivalent with SQL built-in functions.