Static and dynamic pivots
What’s a pivot?
Pivoting is the operation by which values in a column become column names, so that visually a kind of 90° rotation happens: what was read vertically top-to-bottom appears horizontally from left to right. This concept is also called “transposition”, or “crosstab”, and it brings a bit of the spreadsheet way of thinking into the relational way.
In the simplest case, we start from only two columns, one being a function of the other. We’ll use weather as an example: let’s consider a year column, and a raining days column, expressing the number of days when it rained more than 1 mm at a certain location.
Before pivoting:
Year | RainDays |
-------+-----------+
2012 | 112 |
2013 | 116 |
2014 | 111 |
2015 | 80 |
2016 | 110 |
2017 | 102 |
After pivoting:
2012 | 2013 | 2014 | 2015 | 2016 | 2017
------+------+------+------+------+------
112 | 116 | 111 | 80 | 110 | 102
Often there’s a second dimension, giving us 3 columns with a functional dependency: (X dimension, Y dimension) => Value
With the rain example, the second dimension could be a city name, like below:
Before pivoting:
Year | City | RainDays
-------+-----------+---------
2012 | Lyon | 112
2013 | Lyon | 116
2014 | Lyon | 111
... | ... | ...
2014 | Toulouse | 111
2015 | Toulouse | 83
Let’s consider a simple dataset with data for 13 cities over 6 years, or 78 rows. (an SQL dump for this exemple is available here: rainfall-example.sql; The raw data consists of monthly measurements for a few cities in France coming from https://www.infoclimat.fr/climatologie/).
Here’s a pivoted typical resultset showing the per-city,per-year data:
City | 2012 | 2013 | 2014 | 2015 | 2016 | 2017
-----------+------+------+------+------+------+------
Ajaccio | 69 | 91 | 78 | 48 | 81 | 51
Bordeaux | 116 | 138 | 137 | 101 | 117 | 110
Brest | 178 | 161 | 180 | 160 | 165 | 144
Dijon | 114 | 124 | 116 | 93 | 116 | 103
Lille | 153 | 120 | 136 | 128 | 138 | 113
Lyon | 112 | 116 | 111 | 80 | 110 | 102
Marseille | 47 | 63 | 68 | 53 | 54 | 43
Metz | 98 | 120 | 110 | 93 | 122 | 115
Nantes | 124 | 132 | 142 | 111 | 106 | 110
Nice | 53 | 77 | 78 | 50 | 52 | 43
Paris | 114 | 111 | 113 | 85 | 120 | 110
Perpignan | 48 | 56 | 54 | 48 | 69 | 48
Toulouse | 86 | 116 | 111 | 83 | 102 | 89
(13 rows)
Generally speaking, the pivoted result of a series of (X,Y,V) tuples is a grid of N+1 columns and M rows, where:
-
N is the number of distinct values of X.
-
M is the number of distinct values of Y.
-
the first, leftmost column contains the distinct values of Y, generally in a specific order, for instance here that could be cities sorted alphabetically.
-
the names of other columns are made of the distinct values of X, also in a specific order. In the example above they represent years, in ascending order from left to right.
-
for each (X,Y) couple, if a (X,Y,V) tuple exists in the dataset to pivot, V is set in the grid at the intersection of the column named X and the row starting with Y. Hence the “crosstab” appellation.
If there is no corresponding V for (X,Y) in the dataset, the corresponding field in the grid is set to NULL or to a specific marker for empty values.
When the number of columns is relatively limited, this representation has some interesting visual advantages over the top-down list of (X,Y,V) tuples:
- it fits better in the 2D space
- it’s more intuitive, as there’s no repetition in the dimensions.
- it’s obvious when some values are missing in the grid.
- each axis can be sorted distinctly.
What queries to pivot a dataset?
The canonical form
The PostgreSQL dialect doesn’t have a PIVOT clause, contrary to Oracle or MS-SQL Server, but it’s not essential. A query pivoting three columns (x,y,v) can be written like this:
SELECT
y,
(CASE WHEN x='value 1' THEN v END) "value 1",
(CASE WHEN x='value 2' THEN v END) "value 2",
...repeated for each x transposed into into a column
FROM table or subquery
[ORDER BY 1]
Often a pivot query will aggregate values simultaneously with pivoting. The typical form of query will be like:
SELECT
y,
AGG(v) FILTER (WHERE x='value 1') AS "value 1",
AGG(v) FILTER (WHERE x='value 2') AS "value 2",
...repeated for each x transposed into into a column
FROM table or subquery
GROUP BY y [ORDER BY 1];
The FILTER clause is new in PostgreSQL 9.4; with earlier versions a CASE WHEN expression can be used instead.
AGG(v) represents an aggregation function, which could be SUM(v) to add values, or COUNT(v) to count occurrences, MIN(), MAX()…
For the example of rainy days over 6 years, with one measurement per month and the goal to produce a City/Year pivot, the query would be as follows:
The crosstab() model
The tablefunc
extension in contrib provides a function:
crosstab(text source_sql, text category_sql)
that is often the first solution cited in
questions about pivots with Postgres.
The first argument of crosstab is the text of a query returning the data to
pivot. The second argument is another query returning the names of columns
after pivoting, in the desired order.
Since the function returns an anonymous SETOF RECORD type, the
output columns have to be specified with an AS (col1 type, col2 type,...)
clause in the calling query, so that the output can be interpreted by
the SQL engine as proper columns.
For example:
The limitations of static pivots
Both the crosstab()-based queries and the canonical form have the drawback that the output columns must be explicitly enumerated, so that when a new value appears in the rows to transpose, it must be added manually to the list. Otherwise, with the canonical form the new data would be ignored, and with crosstab() it would be likely to cause a mismatch error.
These queries also lack flexibility: to change the order of the columns, or transpose a different column of the source data (for instance have cities on the horizontal axis instead of years), they need to be rewritten.
Also, some pivots may have hundreds of columns, so listing them manually in SQL is too tedious.
What we’d like, as SQL users, is a dynamic pivot, that is, a polymorphic query that would automatically have row values transposed into columns without the need to edit the SQL statement.
But generally speaking, an SQL query can’t have dynamic columns,
or at least not in the way that a dynamic pivot would need them.
One could object that in SELECT * FROM table
, the *
sign is replaced
dynamically by a list of columns, so we somehow have dynamic columns already,
but the catch is that this replacement
does not happen during the execution step. The SQL engine can’t even
prepare the query if it doesn’t know how many output columns there
are and what are their types and names.
This is why the output of crosstab()
as well as any function returning
SETOF RECORD must be qualified with AS (...)
clause with
column names and types hardcoded in the query.
How to do a dynamic pivot
So the difficulty of a dynamic pivot is: in an SQL query, the output columns must be determined before execution. But to know which columns are formed from transposing rows, we’d to need to execute the query first. To solve this chicken and egg problem, we need to loosen the constraints a bit.
Result in a nested structure inside a column
Firstly, a SQL query can return the pivoted part encapsulated inside a single column with a composite or array type, rather than as multiple columns. This type could be array[text], JSON, XML… This solution is used by Oracle with its PIVOT XML clause. That’s a one-step method, which is good, but the result has a non-tabular structure that does necessarily match the users expectations.
Here is an example using modern PostgreSQL with JSON:
Without having to enumerate the years in the query, we get the same data as above, but as 2 columns, the first for the “vertical axis”, and the next one for all the rest in JSON format:
city | json_object_agg
-----------+----------------------------------------------------------------------------------------
Ajaccio | { "2012" : 69, "2013" : 91, "2014" : 78, "2015" : 48, "2016" : 81, "2017" : 51 }
Bordeaux | { "2012" : 116, "2013" : 138, "2014" : 137, "2015" : 101, "2016" : 117, "2017" : 110 }
Brest | { "2012" : 178, "2013" : 161, "2014" : 180, "2015" : 160, "2016" : 165, "2017" : 144 }
Dijon | { "2012" : 114, "2013" : 124, "2014" : 116, "2015" : 93, "2016" : 116, "2017" : 103 }
Lille | { "2012" : 153, "2013" : 120, "2014" : 136, "2015" : 128, "2016" : 138, "2017" : 113 }
Lyon | { "2012" : 112, "2013" : 116, "2014" : 111, "2015" : 80, "2016" : 110, "2017" : 102 }
Marseille | { "2012" : 47, "2013" : 63, "2014" : 68, "2015" : 53, "2016" : 54, "2017" : 43 }
Metz | { "2012" : 98, "2013" : 120, "2014" : 110, "2015" : 93, "2016" : 122, "2017" : 115 }
Nantes | { "2012" : 124, "2013" : 132, "2014" : 142, "2015" : 111, "2016" : 106, "2017" : 110 }
Nice | { "2012" : 53, "2013" : 77, "2014" : 78, "2015" : 50, "2016" : 52, "2017" : 43 }
Paris | { "2012" : 114, "2013" : 111, "2014" : 113, "2015" : 85, "2016" : 120, "2017" : 110 }
Perpignan | { "2012" : 48, "2013" : 56, "2014" : 54, "2015" : 48, "2016" : 69, "2017" : 48 }
Toulouse | { "2012" : 86, "2013" : 116, "2014" : 111, "2015" : 83, "2016" : 102, "2017" : 89 }
(13 rows)
That somehow does the job, but visually the alignment is not very good, and if we want to copy-paste this result into a spreadsheet or import it as tabular data, it’s clear that it’s not going to work.
Getting a tabular result in two steps
The other solutions based on a SQL query are based on the idea of a two-step process:
-
a first query build the result with all its columns, and returns an indirect reference to this result.
-
a second query returns the result set, now that its structure is known by the SQL engine thanks to the previous step.
At this point, let’s insist that calling these two steps through a single
function would defeat the solution: because in order to call that function,
a SQL query would require an AS(...)
clause listing the columns
of the result, making it no different than calling crosstab() in the first
place.
The object that the first step returns a reference to can be a cursor:
in that case the SQL query can be a function call taking the same
arguments as crosstab() but returning a REFCURSOR.
The function would build a dynamic SQL query and instantiate
a cursor over it. Upon return, the caller could browse the
resulting rows with FETCH. This variant is implemented
in the dynamic_pivot()
function below.
Another variant would be for the function to create a
view or a table, temporary or permanent, returning the pivoted
data. It would return its name for instance, and the client-side
code would run a SELECT on this table or view, and probably
drop it afterwards.
The SQL implementation could be similar to the cursor-returning
function, except that it would execute
CREATE [TEMPORARY] TABLE (ou VIEW) nom AS ...
followed by
the same query that is dynamically built.
In the code below, I suggest a function returning a REFCURSOR that can be used directly, or be used as a template to customize.
Its input arguments are the same than crosstab()
:
-
a main query returning 3 columns to pivot as (vertical, horizontal, value) tuples.
-
a second query returning a single row with the values of the future horizontal header, in the desired order.
The function creates and returns a cursor pointing to the result,
which must be consumed in the same transaction (it’s not declared
WITH HOLD
, although it could if we wanted to keep the results
past the transaction).
Unfortunately the main query is executed twice in this implementation,
since it’s embedded as a subquery in two distinct places.
Also, the type of output columns is forced to text
, as
in plpgsql, we can’t get to the column types of our input query.
An implementation in the C language would probably avoid
these plpgsql limitations (although let’s be glad
to have row_to_json
, added in 9.2,
without which it wouldn’t be even possible to figure out the column names
of the source query).
Anyway, there’s a significant advantage to the plpgsql implementation:
it needs only about 40 lines of code. And here they are:
Example of use:
=> BEGIN;
-- step 1: get the cursor (we let Postgres generate the cursor's name)
=> SELECT dynamic_pivot(
'SELECT city,year,SUM(raindays)
FROM rainfall GROUP BY city,year
ORDER BY 1',
'SELECT DISTINCT year FROM rainfall ORDER BY 1'
) AS cur
\gset
-- step 2: read the results through the cursor
=> FETCH ALL FROM :"cur";
city | 2012 | 2013 | 2014 | 2015 | 2016 | 2017
-----------+------+------+------+------+------+------
Ajaccio | 69 | 91 | 78 | 48 | 81 | 51
Bordeaux | 116 | 138 | 137 | 101 | 117 | 110
Brest | 178 | 161 | 180 | 160 | 165 | 144
Dijon | 114 | 124 | 116 | 93 | 116 | 103
Lille | 153 | 120 | 136 | 128 | 138 | 113
Lyon | 112 | 116 | 111 | 80 | 110 | 102
Marseille | 47 | 63 | 68 | 53 | 54 | 43
Metz | 98 | 120 | 110 | 93 | 122 | 115
Nantes | 124 | 132 | 142 | 111 | 106 | 110
Nice | 53 | 77 | 78 | 50 | 52 | 43
Paris | 114 | 111 | 113 | 85 | 120 | 110
Perpignan | 48 | 56 | 54 | 48 | 69 | 48
Toulouse | 86 | 116 | 111 | 83 | 102 | 89
(13 lignes)
=> CLOSE :"cur";
=> COMMIT; -- will close the cursor if not already done with CLOSE.
Client-side pivot
The client-side presentation layer can also do the work of transposing rows into columns from a non-pivoted dataset. In fact, some people argue that the pivot is a purely presentational problem, which is right in the sense that it’s exactly the same data that is shown in the end.
Since version 9.6, psql offers a client-side approach for pivots through the \crosstabview command.
In interactive use, this method is probably the quickest way to visualize pivoted representations.
For instance, say we want to have a look at the (year,city) couples in our example with more that 120 rainy days per year:
city | 2012 | 2013 | 2014 | 2015 | 2016 | 2017
----------+------+------+------+------+------+------
Brest | 178 | 161 | 180 | 160 | 165 | 144
Nantes | 124 | 132 | 142 | | |
Lille | 153 | | 136 | 128 | 138 |
Dijon | | 124 | | | |
Bordeaux | | 138 | 137 | | |
Metz | | | | | 122 |
The horizontal header holds the values of the 2nd column of
the source data. To have the other transposition, we just
have to give year
and city
as arguments to crosstabview
in that order, without changing the query:
=# \crosstabview year city
city | Brest | Nantes | Lille | Dijon | Bordeaux | Metz
-------+-------+--------+-------+-------+----------+------
2012 | 178 | 124 | 153 | | |
2013 | 161 | 132 | | 124 | 138 |
2014 | 180 | 142 | 136 | | 137 |
2015 | 160 | | 128 | | |
2016 | 165 | | 138 | | | 122
2017 | 144 | | | | |
In the result above, cities come in no particular order.
But these columns can be sorted, possibly with complex criteria, through the command’s 4th argument. For example, the following command sorts the city columns by rank of rainfall, adding it as a 4th column to the query and passing it to crosstabview:
Result:
year | Metz | Dijon | Nantes | Bordeaux | Lille | Brest
-------+------+-------+--------+----------+-------+-------
2012 | | | 124 | | 153 | 178
2013 | | 124 | 132 | 138 | | 161
2014 | | | 142 | 137 | 136 | 180
2015 | | | | | 128 | 160
2016 | 122 | | | | 138 | 165
2017 | | | | | | 144
We can see that the numbers get distributed in a way that shows a left-to-right gradient from the less to the more rainy, with Brest as the clear winner for this dataset.
How to unpivot a dataset?
The UNPIVOT clause does exist in some SQL dialects, but not in PostgreSQL. Nevertheless it’s easy to unpivot generically with PostgreSQL, without having to explicitly list the columns, by using an intermediate JSON representation.
Say for example that our weather data was structured as below, with a distinct column for each month of the year, spreadsheet-style.
=> \d rain_months
Column | Type |
--------+---------+
city | text |
year | integer |
m1 | integer |
m2 | integer |
m3 | integer |
m4 | integer |
m5 | integer |
m6 | integer |
m7 | integer |
m8 | integer |
m9 | integer |
m10 | integer |
m11 | integer |
m12 | integer |
By applying the json_each_text
function to each row formatted with
row_to_json
, the columns are transposed into (key,value) tuples:
To get the final unpivoted result, we need to improve a bit this query to associate the year and city to each measurement, and filter and type-cast the month columns:
Result:
city | year | month | raindays
-----------+-------+-------+---------
Lille | 2017 | 1 | 9
Lille | 2017 | 2 | 10
Lille | 2017 | 3 | 9
etc...
(936 rows)
This query retrieves the 13 cities x 6 years x 12 months of the original dataset.