Advanced psql scripting with coproc
In this post, let’s see how to control psql from outside with the coproc functionality in bash.
coproc is described in bash documentation as:
A coprocess is a shell command preceded by the coproc reserved word. A coprocess is executed asynchronously in a subshell, as if the command had been terminated with the ‘&’ control operator, with a two-way pipe established between the executing shell and the coprocess.
Of course there are simpler ways than a coprocess to invoke psql from a shell script. Let’s review them briefly to put the more advanced method into perspective.
Without coproc
As an example, let’s say we want to import data from files into tables
with the copy
command, and we have an $opts
shell variable with
our psql options. Generally in batches, it’s a good idea to have the
options -v ON_ERROR_STOP=1
to stop on errors, and -At
to obtain
query results in the simplest format. -q
may also be used to
suppress command tags unless we want to parse them.
Also custom variables may be set on the command line with the -v
option
and the script can refer to them with the :'var'
syntax for literals
or :"var"
for identifiers.
- We can run single queries with the
-c
option:psql $opts -c "COPY table1 FROM stdin" < datafile1.txt
- If we have a bunch commands to run, we can group them in a script:
psql $opts -f script.sql
- The SQL commands can also be embedded in the shell script with the here-document syntax.
Shell variables can also be used in the text (beware of quoting
special characters; in the following example, the contents
of variables are assumed to be safe):
psql $opts << END_OF_SCRIPT begin; \copy table_$file1 from '$file1' \copy table_$file2 from '$file2' ... commit; END_OF_SCRIPT
- The commands can also be dynamically generated by the script and
piped into psql:
( echo "begin;" for file in $(ls /path/to/import_dir/*.csv) do tablename=${file%.csv} # remove csv suffix echo "\\copy $tablename from $file csv" done echo "commit;" ) | psql $opts
In these examples, we have a lot of flexibility on what is fed to psql, but once the block of commands is determined, we don’t have much control about the rest of the process. It might succeed, or it might fail and psql will quit with an error, but that’s about it.
Sometimes we need more control, like if we need to synchronize psql commands with external events, or when commands should be run depending on the result of previous commands. That’s when the coprocess method comes in.
With coproc
To launch psql
as a coprocess, let’s use this invocation:
coproc PSQL { psql $opts ; }
Now when our bash script writes lines into ${PSQL[1]}
they go into
psql’s standard input, thus sending commands to psql, and when it
reads from ${PSQL[0]
then it gets the psql standard output, thus
getting results. And when not doing that, psql just waits for
input as if it was used interactively.
Getting per-command results
After sending a command to psql, the script reads its output line by line, but it must not block when there are no more results. To that effect, each command sent by the script is followed by a second command echoing a “end mark” string whose output is interpreted as the end of results of the previous command. The end mark is a random string including a UUID so that the probability to conflict with actual contents is negligible:
function end_marker
{
echo "-- END RESULTS MARK -- $(get_uuid) --"
}
So our function to send a command and get results is as follows:
function psql_command
{
end=$(end_marker)
psql_check_alive
echo "$1" >&${PSQL[1]}
echo "\\echo '$end'" >&${PSQL[1]}
psql_check_alive
while read -r -u ${PSQL[0]} result
do
if [[ $result = $end ]]; then
break
fi
echo $result
done
}
This works with both meta-commands and SQL commands. To read a psql variable, echo it:
query="select table_name from information_schema.tables where table_schema='public';"
psql_command "$query" > tables-in-public.txt
count=$(psql_command '\echo :ROW_COUNT')
echo "$count table(s) found"
Advanced error handling
By accessing SQLSTATE
, we can do fine-grained error handling.
For instance the following function will take a list of statements to run
in a transaction, and will retry the whole transaction when it encounters
errors that are worth retrying: statement timeouts, or serialization
or deadlock failures.
function retriable_transaction
{
while true
do
psql_command "BEGIN;"
for query in "$@"
do
results=$(psql_command "$query")
# check for errors
sqlstate=$(psql_command '\echo :SQLSTATE')
case "$sqlstate" in
00000)
echo "$results" # output results of $query
;;
57014 | 40001 | 40P01)
# Rollback and retry on
# query canceled, or serialization failure, or deadlock
# see https://www.postgresql.org/docs/current/errcodes-appendix.html
psql_command "ROLLBACK;"
continue 2; # restart transaction at first query
;;
*)
# rollback and stop
err=$(psql_command '\echo :LAST_ERROR_MESSAGE');
echo 1>&2 "SQL error: $sqlstate $err";
psql_command "ROLLBACK;"
return
;;
esac
done
psql_command "COMMIT;"
break;
done
}
Synchronization with other processes
Another use case for psql as a coprocess is to allow to synchronize its commands with other programs that consume or produce data.
A job I had to implement in a project consisted of importing files from a compressed tar archive into tables with the following constraints:
- We cannot assume that there is enough disk space to extract the tar file locally, or even a single file of the archive.
- Which files are present in the archive is not known in advance.
- The import must be done within a single transaction, failing or succeeding as a whole.
GNU tar provides a --to-command
option that streams the contents of each
file into a program, solving the intermediate disk space
problem. However streaming directly into psql would invoke psql for
each file in the archive, which would make it impossible to import all
in a single transaction.
So instead of doing that, we launch psql as a coprocess to
be passed \copy
commands exactly when the data for each table
is made available in a pipe. The tar process runs in parallel with the bash script and psql,
writing into two named pipes: one for the names of the files ($fifo_names
below),
from which we deduce into which table the contents should go, and
another one for the contents ($fifo_contents
below).
Here’s a sketch of the design:
And here’s the relevant part of the code (which expects the tar.bz2 archive streamed to its standard input)
# $psql_opts includes --single-transaction so we don't emit begin/commit
# pairs as commands
coproc PSQL { psql $psql_opts ; }
fifo_names="$(mktemp -u)-$(get_uuid)"
mkfifo -m 0600 $fifo_names
fifo_contents="$(mktemp -u)-$(get_uuid)"
mkfifo -m 0600 $fifo_contents
function cleanup
{
rm -f $fifo_names $fifo_contents
}
trap cleanup EXIT
end_files=$(end_marker)
cat | (
tar --to-command "echo \$TAR_FILENAME >>$fifo_names; cat > $fifo_contents" \
-xjf - ;
echo "$end_files" >$fifo_names
) &
while read -r copyfilename < $fifo_names; do
if [[ "$copyfilename" = "$end_files" ]]; then
break
else
tablename=${copyfilename%.copy}
echo "Importing $copyfilename into $tablename"
psql_command "\\copy $tablename from $fifo_contents"
fi
done
As we can see, this is quite concise to solve a problem that would be hard without the coprocess facility. If you want to improve or discuss the code in this post, feel free to provide feedback in my GitHub repository.