Advanced psql scripting with coproc

In this post, let’s see how to control psql from outside with the coproc functionality in bash.

coproc is described in bash documentation as:

A coprocess is a shell command preceded by the coproc reserved word. A coprocess is executed asynchronously in a subshell, as if the command had been terminated with the ‘&’ control operator, with a two-way pipe established between the executing shell and the coprocess.

Of course there are simpler ways than a coprocess to invoke psql from a shell script. Let’s review them briefly to put the more advanced method into perspective.

Without coproc

As an example, let’s say we want to import data from files into tables with the copy command, and we have an $opts shell variable with our psql options. Generally in batches, it’s a good idea to have the options -v ON_ERROR_STOP=1 to stop on errors, and -At to obtain query results in the simplest format. -q may also be used to suppress command tags unless we want to parse them. Also custom variables may be set on the command line with the -v option and the script can refer to them with the :'var' syntax for literals or :"var" for identifiers.

We can run single queries with the -c option:

psql $opts -c "COPY table1 FROM stdin" < datafile1.txt

If we have a bunch commands to run, we can group them in a script:
```
psql $opts -f script.sql
```
The SQL commands can also be embedded in the shell script with the here-document syntax. Shell variables can also be used in the text (beware of quoting special characters; in the following example, the contents of variables are assumed to be safe):
```
psql $opts << END_OF_SCRIPT
begin;
\copy table_$file1 from '$file1'
\copy table_$file2 from '$file2'
...
commit;
END_OF_SCRIPT
```

The commands can also be dynamically generated by the script and piped into psql:

(
echo "begin;"
for file in $(ls /path/to/import_dir/*.csv)
do
 tablename=${file%.csv}  # remove csv suffix
 echo "\\copy $tablename from $file csv"
done
echo "commit;"
) | psql $opts

In these examples, we have a lot of flexibility on what is fed to psql, but once the block of commands is determined, we don’t have much control about the rest of the process. It might succeed, or it might fail and psql will quit with an error, but that’s about it.

Sometimes we need more control, like if we need to synchronize psql commands with external events, or when commands should be run depending on the result of previous commands. That’s when the coprocess method comes in.

With coproc

To launch psql as a coprocess, let’s use this invocation:

coproc PSQL { psql $opts ; }

Now when our bash script writes lines into ${PSQL[1]} they go into psql’s standard input, thus sending commands to psql, and when it reads from ${PSQL[0] then it gets the psql standard output, thus getting results. And when not doing that, psql just waits for input as if it was used interactively.

Getting per-command results

After sending a command to psql, the script reads its output line by line, but it must not block when there are no more results. To that effect, each command sent by the script is followed by a second command echoing a “end mark” string whose output is interpreted as the end of results of the previous command. The end mark is a random string including a UUID so that the probability to conflict with actual contents is negligible:

function end_marker
{
  echo "-- END RESULTS MARK -- $(get_uuid) --"
}

So our function to send a command and get results is as follows:

function psql_command
{
  end=$(end_marker)

  psql_check_alive
  echo "$1"  >&${PSQL[1]}
  echo "\\echo '$end'" >&${PSQL[1]}

  psql_check_alive
  while read -r -u ${PSQL[0]} result
  do
    if [[ $result = $end ]]; then
      break
    fi
    echo $result
  done
}

This works with both meta-commands and SQL commands. To read a psql variable, echo it:

query="select table_name from information_schema.tables where table_schema='public';"
psql_command "$query" > tables-in-public.txt
count=$(psql_command '\echo :ROW_COUNT')
echo "$count table(s) found"

Advanced error handling

By accessing SQLSTATE, we can do fine-grained error handling. For instance the following function will take a list of statements to run in a transaction, and will retry the whole transaction when it encounters errors that are worth retrying: statement timeouts, or serialization or deadlock failures.

function retriable_transaction
{
  while true
  do
    psql_command "BEGIN;"
    for query in "$@"
    do
      results=$(psql_command "$query")
      # check for errors
      sqlstate=$(psql_command '\echo :SQLSTATE')
      case "$sqlstate" in
	00000)
	  echo "$results"   # output results of $query
	  ;;
	57014 | 40001 | 40P01)
	  # Rollback and retry on
	  # query canceled, or serialization failure, or deadlock
	  # see https://www.postgresql.org/docs/current/errcodes-appendix.html
	  psql_command "ROLLBACK;"
	  continue 2;  # restart transaction at first query
	  ;;
	*)
	  # rollback and stop
	  err=$(psql_command '\echo :LAST_ERROR_MESSAGE');
	  echo 1>&2 "SQL error: $sqlstate $err";
	  psql_command "ROLLBACK;"
	  return
	  ;;
      esac
    done
    psql_command "COMMIT;"
    break;
  done
}

Synchronization with other processes

Another use case for psql as a coprocess is to allow to synchronize its commands with other programs that consume or produce data.

A job I had to implement in a project consisted of importing files from a compressed tar archive into tables with the following constraints:

We cannot assume that there is enough disk space to extract the tar file locally, or even a single file of the archive.
Which files are present in the archive is not known in advance.
The import must be done within a single transaction, failing or succeeding as a whole.

GNU tar provides a --to-command option that streams the contents of each file into a program, solving the intermediate disk space problem. However streaming directly into psql would invoke psql for each file in the archive, which would make it impossible to import all in a single transaction.

So instead of doing that, we launch psql as a coprocess to be passed \copy commands exactly when the data for each table is made available in a pipe. The tar process runs in parallel with the bash script and psql, writing into two named pipes: one for the names of the files ($fifo_names below), from which we deduce into which table the contents should go, and another one for the contents ($fifo_contents below).

Here’s a sketch of the design: psql tar design

And here’s the relevant part of the code (which expects the tar.bz2 archive streamed to its standard input)

# $psql_opts includes --single-transaction so we don't emit begin/commit
# pairs as commands
coproc PSQL { psql $psql_opts ; } 

fifo_names="$(mktemp -u)-$(get_uuid)"
mkfifo -m 0600 $fifo_names

fifo_contents="$(mktemp -u)-$(get_uuid)"
mkfifo -m 0600 $fifo_contents

function cleanup
{
    rm -f $fifo_names $fifo_contents
}

trap cleanup EXIT

end_files=$(end_marker)

cat | (
  tar --to-command "echo \$TAR_FILENAME >>$fifo_names; cat > $fifo_contents" \
     -xjf - ;
  echo "$end_files" >$fifo_names
) &


while read -r copyfilename < $fifo_names; do
  if [[ "$copyfilename" = "$end_files" ]]; then
    break
  else
    tablename=${copyfilename%.copy}
    echo "Importing $copyfilename into $tablename"
    psql_command "\\copy $tablename from $fifo_contents"
  fi
done

As we can see, this is quite concise to solve a problem that would be hard without the coprocess facility. If you want to improve or discuss the code in this post, feel free to provide feedback in my GitHub repository.