Multiple strings replacement with plperl

Substituting a single string by another within a larger string is straightforward in SQL, with the replace function:

select replace('the string is bar', 'bar', 'foo');
     replace
-------------------
 the string is foo

But there isn’t a PostgreSQL core function to substitute multiple strings each by its own independent replacement, like for instance the strtr function in PHP, or the substitutions operators with regular expressions in Python or Perl (which we’re going to use in this post).

In PostgreSQL, regexp_replace does support alternations to search for multiple strings in a single pass, but it can’t do multiple replacements, that is regexp_replace(string, 'foo|bar', 'baz', 'g') would replace both foo and bar by baz, but it cannot be instructed to replace foo by a string and bar by another.

In some cases, nested or successive replace() calls might do the job, and it seems the popular answer when searching for a solution on the web, but in the general case, they produce flawed results if any inner replacement happens to create a match for other outer replace calls.

As an example, consider the necessary conversion of characters into HTML entities when creating an HTML fragment from a raw text. These five substitutions have to be applied:

 > → &gt;
 < → &lt;
 & → &amp;
 " → &apos;
 ' → &quot;

Say we apply the substitutions in that order, with a query like this:

select replace(replace(replace(replace(replace(
  rawtext,
  '>', '&gt;'),
  '<', '&lt;'),
  '&', '&amp;'),
  '"', '&apos;'),
  '''', '&quot;');

If rawtext is <strong> AT&T </strong>, then the result obtained is:

&amp;lt;strong&amp;gt; AT&amp;T &amp;lt;/strong&amp;gt;

and this is clearly wrong, as what we want is:

&lt;strong&gt; AT&amp;T &lt;/strong&gt;

The problem with the query above is that once > has been replaced by > (or < by <), the next replacement stage has no way to distinguish the ampersand in AT&T (that must be replaced) from the ampersand in > that comes from a previous replacement and should be left unmodified.

In that particular case, an effective workaround is to rearrange the order of the replacements so that & is substituted by & first.

But what if we have circular references between substitutions, like when two strings need to be switched? In that case, no order will succeed to produce the desired result:

-- replace foo with bar and bar with foo.
-- wrong result, version 1
select replace(replace('foo and bar', 'foo', 'bar'), 'bar', 'foo');
   replace   
-------------
 foo and foo

-- wrong result, version 2
 select replace(replace('foo and bar', 'bar', 'foo'), 'foo', 'bar');
   replace   
-------------
 bar and bar

What is needed instead is an algorithm that scans and replaces in the string in a single pass. It also needs a rule for when several matches occur simultaneously at the same point in the string (typically the longest match wins, but another rule could be chosen).

The Perl substitution operator (s/pattern/replacement/flags) provides this functionality, since pattern can be an alternation and replacement can be a hash with key and values holding the substitutions. For instance:

my %subs = (foo=>bar, bar=>foo);
my $string = "foo et bar";
$string =~ s/(foo|bar)/$subs{$1}/g;
print $string;

As a bonus, since Perl 5.10 the implementation of this construct has been specifically optimized:

Trie optimisation of literal string alternations

Alternations, where possible, are optimised into more efficient matching structures. String literal alternations are merged into a trie and are matched simultaneously. This means that instead of O(N) time for matching N alternations at a given point, the new code performs in O(1) time. A new special variable, ${^RE_TRIE_MAXBUF}, has been added to fine-tune this optimization. (Yves Orton)

PostgreSQL allows to write functions in Perl With the plperl extension. Using this, here’s a simple ready-to-use implementation of the multiple replacement:

CREATE FUNCTION multi_replace(string text, orig text[], repl text[])
RETURNS text
AS $BODY$
  my ($string, $orig, $repl) = @_;
  my %subs;

  if (@$orig != @$repl) {
     elog(ERROR, "array sizes mismatch");
  }
  if (ref @$orig[0] eq 'ARRAY' || ref @$repl[0] eq 'ARRAY') {
     elog(ERROR, "array dimensions mismatch");
  }

  @subs{@$orig} = @$repl;
  
  my $re = join "|", map quotemeta,
     sort { (length($b) <=> length($a)) } keys %subs;
  $re = qr/($re)/;

  $string =~ s/$re/$subs{$1}/g;
  return $string;
$BODY$ language plperl strict immutable;

This Perl function appears to be very fast, even with many strings to replace (tested up to 1000, knowing that they get combined into a single regular expression) and many actual replacements occurring.

In a plpgsql implementation, searching with the alternation appears to be pretty fast, but iterating on the replacements is very costly. The code is a bit longer, I won’t put it in this post but it’s on the PostgreSQL wiki, see multi_replace in plpgsql. Aside from taking the input replacements through a jsonb parameter rather than arrays, which incidentally allows to have the two implementations to co-exist in the database with the same function name, the results should always be identical.

For instance, this invocation switches foo and bar while also leaving foobar unchanged by using the trick of replacing it by itself:

 select multi_replace(                                                           
   'foo and bar are not foobar',
   '{foo,bar,foobar}',
   '{bar,foo,foobar}');

       multi_replace        
----------------------------
 bar and foo are not foobar

Be it in plperl or plpgsql, this function avoids the problems with nesting replace(), including allowing dynamic lists of substitutions. But if you need that functionality with large numbers of substitutions, it’s a case where Perl is much more efficient than the equivalent with SQL built-in functions.