What’s a pivot?

Pivoting is the operation by which values in a column become column names, so that visually a kind of 90° rotation happens: what was read vertically top-to-bottom appears horizontally from left to right. This concept is also called “transposition”, or “crosstab”, and it brings a bit of the spreadsheet way of thinking into the relational way.

In the simplest case, we start from only two columns, one being a function of the other. We’ll use weather as an example: let’s consider a year column, and a raining days column, expressing the number of days when it rained more than 1 mm at a certain location.

Before pivoting:

 Year  | RainDays |
-------+-----------+
 2012  |      112 |
 2013  |      116 |
 2014  |      111 |
 2015  |       80 |
 2016  |      110 |
 2017  |      102 |

After pivoting:

 2012 | 2013 | 2014 | 2015 | 2016 | 2017 
------+------+------+------+------+------
  112 |  116 |  111 |   80 |  110 |  102

Often there’s a second dimension, giving us 3 columns with a functional dependency: (X dimension, Y dimension) => Value

With the rain example, the second dimension could be a city name, like below:

Before pivoting:

 Year  |   City    | RainDays
-------+-----------+---------
  2012 | Lyon      |     112
  2013 | Lyon      |     116
  2014 | Lyon      |     111
  ...  | ...       |     ...
  2014 | Toulouse  |     111
  2015 | Toulouse  |      83

Let’s consider a simple dataset with data for 13 cities over 6 years, or 78 rows. (an SQL dump for this exemple is available here: rainfall-example.sql; The raw data consists of monthly measurements for a few cities in France coming from https://www.infoclimat.fr/climatologie/).

Here’s a pivoted typical resultset showing the per-city,per-year data:

   City    | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 
-----------+------+------+------+------+------+------
 Ajaccio   |   69 |   91 |   78 |   48 |   81 |   51
 Bordeaux  |  116 |  138 |  137 |  101 |  117 |  110
 Brest     |  178 |  161 |  180 |  160 |  165 |  144
 Dijon     |  114 |  124 |  116 |   93 |  116 |  103
 Lille     |  153 |  120 |  136 |  128 |  138 |  113
 Lyon      |  112 |  116 |  111 |   80 |  110 |  102
 Marseille |   47 |   63 |   68 |   53 |   54 |   43
 Metz      |   98 |  120 |  110 |   93 |  122 |  115
 Nantes    |  124 |  132 |  142 |  111 |  106 |  110
 Nice      |   53 |   77 |   78 |   50 |   52 |   43
 Paris     |  114 |  111 |  113 |   85 |  120 |  110
 Perpignan |   48 |   56 |   54 |   48 |   69 |   48
 Toulouse  |   86 |  116 |  111 |   83 |  102 |   89

(13 rows)

Generally speaking, the pivoted result of a series of (X,Y,V) tuples is a grid of N+1 columns and M rows, where:

  • N is the number of distinct values of X.

  • M is the number of distinct values of Y.

  • the first, leftmost column contains the distinct values of Y, generally in a specific order, for instance here that could be cities sorted alphabetically.

  • the names of other columns are made of the distinct values of X, also in a specific order. In the example above they represent years, in ascending order from left to right.

  • for each (X,Y) couple, if a (X,Y,V) tuple exists in the dataset to pivot, V is set in the grid at the intersection of the column named X and the row starting with Y. Hence the “crosstab” appellation.

If there is no corresponding V for (X,Y) in the dataset, the corresponding field in the grid is set to NULL or to a specific marker for empty values.

When the number of columns is relatively limited, this representation has some interesting visual advantages over the top-down list of (X,Y,V) tuples:

  • it fits better in the 2D space
  • it’s more intuitive, as there’s no repetition in the dimensions.
  • it’s obvious when some values are missing in the grid.
  • each axis can be sorted distinctly.

What queries to pivot a dataset?

The canonical form

The PostgreSQL dialect doesn’t have a PIVOT clause, contrary to Oracle or MS-SQL Server, but it’s not essential. A query pivoting three columns (x,y,v) can be written like this:

SELECT
  y,
  (CASE WHEN x='value 1' THEN v END) "value 1",
  (CASE WHEN x='value 2' THEN v END) "value 2",
  ...repeated for each x transposed into into a column
  FROM table or subquery
  [ORDER BY 1]

Often a pivot query will aggregate values simultaneously with pivoting. The typical form of query will be like:

SELECT
  y,
  AGG(v) FILTER (WHERE x='value 1') AS "value 1",
  AGG(v) FILTER (WHERE x='value 2') AS "value 2",
  ...repeated for each x transposed into into a column
  FROM table or subquery
  GROUP BY y [ORDER BY 1];

The FILTER clause is new in PostgreSQL 9.4; with earlier versions a CASE WHEN expression can be used instead.

AGG(v) represents an aggregation function, which could be SUM(v) to add values, or COUNT(v) to count occurrences, MIN(), MAX()…

For the example of rainy days over 6 years, with one measurement per month and the goal to produce a City/Year pivot, the query would be as follows:

SELECT
  city,
  SUM(raindays) FILTER (WHERE year=2012) AS "2012",
  SUM(raindays) FILTER (WHERE year=2013) AS "2013",
  SUM(raindays) FILTER (WHERE year=2014) AS "2014",
  SUM(raindays) FILTER (WHERE year=2015) AS "2015",
  SUM(raindays) FILTER (WHERE year=2016) AS "2016",
  SUM(raindays) FILTER (WHERE year=2017) AS "2017"
FROM rainfall 
GROUP BY city
ORDER BY city;

The crosstab() model

The tablefunc extension in contrib provides a function: crosstab(text source_sql, text category_sql)
that is often the first solution cited in questions about pivots with Postgres.

The first argument of crosstab is the text of a query returning the data to pivot. The second argument is another query returning the names of columns after pivoting, in the desired order. Since the function returns an anonymous SETOF RECORD type, the output columns have to be specified with an AS (col1 type, col2 type,...) clause in the calling query, so that the output can be interpreted by the SQL engine as proper columns.

For example:

SELECT * FROM crosstab(
   -- central query
   'SELECT city,year,SUM(raindays)
     FROM rainfall GROUP BY city,year ORDER BY city',
   -- query to generate the horizontal header
   'SELECT DISTINCT year FROM rainfall ORDER BY year')
  AS ("City" text,
      "2012" int,
      "2013" int,
      "2014" int,
      "2015" int,
      "2016" int,
      "2017" int);

The limitations of static pivots

Both the crosstab()-based queries and the canonical form have the drawback that the output columns must be explicitly enumerated, so that when a new value appears in the rows to transpose, it must be added manually to the list. Otherwise, with the canonical form the new data would be ignored, and with crosstab() it would be likely to cause a mismatch error.

These queries also lack flexibility: to change the order of the columns, or transpose a different column of the source data (for instance have cities on the horizontal axis instead of years), they need to be rewritten.

Also, some pivots may have hundreds of columns, so listing them manually in SQL is too tedious.

What we’d like, as SQL users, is a dynamic pivot, that is, a polymorphic query that would automatically have row values transposed into columns without the need to edit the SQL statement.

But generally speaking, an SQL query can’t have dynamic columns, or at least not in the way that a dynamic pivot would need them. One could object that in SELECT * FROM table, the * sign is replaced dynamically by a list of columns, so we somehow have dynamic columns already, but the catch is that this replacement does not happen during the execution step. The SQL engine can’t even prepare the query if it doesn’t know how many output columns there are and what are their types and names. This is why the output of crosstab() as well as any function returning SETOF RECORD must be qualified with AS (...) clause with column names and types hardcoded in the query.

How to do a dynamic pivot

So the difficulty of a dynamic pivot is: in an SQL query, the output columns must be determined before execution. But to know which columns are formed from transposing rows, we’d to need to execute the query first. To solve this chicken and egg problem, we need to loosen the constraints a bit.

Result in a nested structure inside a column

Firstly, a SQL query can return the pivoted part encapsulated inside a single column with a composite or array type, rather than as multiple columns. This type could be array[text], JSON, XML… This solution is used by Oracle with its PIVOT XML clause. That’s a one-step method, which is good, but the result has a non-tabular structure that does necessarily match the users expectations.

Here is an example using modern PostgreSQL with JSON:

SELECT city,
       json_object_agg(year,total ORDER BY year)
   FROM (
     SELECT city, year, SUM(raindays) AS total
        FROM rainfall
        GROUP BY city,year
   ) s
  GROUP BY city
  ORDER BY city;

Without having to enumerate the years in the query, we get the same data as above, but as 2 columns, the first for the “vertical axis”, and the next one for all the rest in JSON format:

   city    |                                    json_object_agg                                     
-----------+----------------------------------------------------------------------------------------
 Ajaccio   | { "2012" : 69, "2013" : 91, "2014" : 78, "2015" : 48, "2016" : 81, "2017" : 51 }
 Bordeaux  | { "2012" : 116, "2013" : 138, "2014" : 137, "2015" : 101, "2016" : 117, "2017" : 110 }
 Brest     | { "2012" : 178, "2013" : 161, "2014" : 180, "2015" : 160, "2016" : 165, "2017" : 144 }
 Dijon     | { "2012" : 114, "2013" : 124, "2014" : 116, "2015" : 93, "2016" : 116, "2017" : 103 }
 Lille     | { "2012" : 153, "2013" : 120, "2014" : 136, "2015" : 128, "2016" : 138, "2017" : 113 }
 Lyon      | { "2012" : 112, "2013" : 116, "2014" : 111, "2015" : 80, "2016" : 110, "2017" : 102 }
 Marseille | { "2012" : 47, "2013" : 63, "2014" : 68, "2015" : 53, "2016" : 54, "2017" : 43 }
 Metz      | { "2012" : 98, "2013" : 120, "2014" : 110, "2015" : 93, "2016" : 122, "2017" : 115 }
 Nantes    | { "2012" : 124, "2013" : 132, "2014" : 142, "2015" : 111, "2016" : 106, "2017" : 110 }
 Nice      | { "2012" : 53, "2013" : 77, "2014" : 78, "2015" : 50, "2016" : 52, "2017" : 43 }
 Paris     | { "2012" : 114, "2013" : 111, "2014" : 113, "2015" : 85, "2016" : 120, "2017" : 110 }
 Perpignan | { "2012" : 48, "2013" : 56, "2014" : 54, "2015" : 48, "2016" : 69, "2017" : 48 }
 Toulouse  | { "2012" : 86, "2013" : 116, "2014" : 111, "2015" : 83, "2016" : 102, "2017" : 89 }
(13 rows)

That somehow does the job, but visually the alignment is not very good, and if we want to copy-paste this result into a spreadsheet or import it as tabular data, it’s clear that it’s not going to work.

Getting a tabular result in two steps

The other solutions based on a SQL query are based on the idea of a two-step process:

  1. a first query build the result with all its columns, and returns an indirect reference to this result.

  2. a second query returns the result set, now that its structure is known by the SQL engine thanks to the previous step.

At this point, let’s insist that calling these two steps through a single function would defeat the solution: because in order to call that function, a SQL query would require an AS(...) clause listing the columns of the result, making it no different than calling crosstab() in the first place.

The object that the first step returns a reference to can be a cursor: in that case the SQL query can be a function call taking the same arguments as crosstab() but returning a REFCURSOR. The function would build a dynamic SQL query and instantiate a cursor over it. Upon return, the caller could browse the resulting rows with FETCH. This variant is implemented in the dynamic_pivot() function below.

Another variant would be for the function to create a view or a table, temporary or permanent, returning the pivoted data. It would return its name for instance, and the client-side code would run a SELECT on this table or view, and probably drop it afterwards. The SQL implementation could be similar to the cursor-returning function, except that it would execute CREATE [TEMPORARY] TABLE (ou VIEW) nom AS ... followed by the same query that is dynamically built.

In the code below, I suggest a function returning a REFCURSOR that can be used directly, or be used as a template to customize.

Its input arguments are the same than crosstab():

  • a main query returning 3 columns to pivot as (vertical, horizontal, value) tuples.

  • a second query returning a single row with the values of the future horizontal header, in the desired order.

The function creates and returns a cursor pointing to the result, which must be consumed in the same transaction (it’s not declared WITH HOLD, although it could if we wanted to keep the results past the transaction).

Unfortunately the main query is executed twice in this implementation, since it’s embedded as a subquery in two distinct places. Also, the type of output columns is forced to text, as in plpgsql, we can’t get to the column types of our input query. An implementation in the C language would probably avoid these plpgsql limitations (although let’s be glad to have row_to_json, added in 9.2, without which it wouldn’t be even possible to figure out the column names of the source query). Anyway, there’s a significant advantage to the plpgsql implementation: it needs only about 40 lines of code. And here they are:

CREATE FUNCTION dynamic_pivot(central_query text, headers_query text)
 RETURNS refcursor AS
$$
DECLARE
  left_column text;
  header_column text;
  value_column text;
  h_value text;
  headers_clause text;
  query text;
  j json;
  r record;
  curs refcursor;
  i int:=1;
BEGIN
  -- find the column names of the source query
  EXECUTE 'select row_to_json(_r.*) from (' ||  central_query || ') AS _r' into j;
  FOR r in SELECT * FROM json_each_text(j)
  LOOP
    IF (i=1) THEN left_column := r.key;
      ELSEIF (i=2) THEN header_column := r.key;
      ELSEIF (i=3) THEN value_column := r.key;
    END IF;
    i := i+1;
  END LOOP;

  --  build the dynamic transposition query (based on the canonical model)
  FOR h_value in EXECUTE headers_query
  LOOP
    headers_clause := concat(headers_clause,
     format(chr(10)||',min(case when %I=%L then %I::text end) as %I',
           header_column,
	   h_value,
	   value_column,
	   h_value ));
  END LOOP;

  query := format('SELECT %I %s FROM (select *,row_number() over() as rn from (%s) AS _c) as _d GROUP BY %I order by min(rn)',
           left_column,
	   headers_clause,
	   central_query,
	   left_column);

  -- open the cursor so the caller can FETCH right away
  OPEN curs FOR execute query;
  RETURN curs;
END 
$$ LANGUAGE plpgsql;

Example of use:

=> BEGIN;

-- step 1: get the cursor (we let Postgres generate the cursor's name)
=> SELECT dynamic_pivot(
       'SELECT city,year,SUM(raindays) 
          FROM rainfall GROUP BY city,year
          ORDER BY 1',
       'SELECT DISTINCT year FROM rainfall ORDER BY 1'
     ) AS cur
     \gset

-- step 2: read the results through the cursor
=> FETCH ALL FROM :"cur";

   city    | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 
-----------+------+------+------+------+------+------
 Ajaccio   | 69   | 91   | 78   | 48   | 81   | 51
 Bordeaux  | 116  | 138  | 137  | 101  | 117  | 110
 Brest     | 178  | 161  | 180  | 160  | 165  | 144
 Dijon     | 114  | 124  | 116  | 93   | 116  | 103
 Lille     | 153  | 120  | 136  | 128  | 138  | 113
 Lyon      | 112  | 116  | 111  | 80   | 110  | 102
 Marseille | 47   | 63   | 68   | 53   | 54   | 43
 Metz      | 98   | 120  | 110  | 93   | 122  | 115
 Nantes    | 124  | 132  | 142  | 111  | 106  | 110
 Nice      | 53   | 77   | 78   | 50   | 52   | 43
 Paris     | 114  | 111  | 113  | 85   | 120  | 110
 Perpignan | 48   | 56   | 54   | 48   | 69   | 48
 Toulouse  | 86   | 116  | 111  | 83   | 102  | 89
(13 lignes)

=> CLOSE :"cur";

=> COMMIT;   -- will close the cursor if not already done with CLOSE.

Client-side pivot

The client-side presentation layer can also do the work of transposing rows into columns from a non-pivoted dataset. In fact, some people argue that the pivot is a purely presentational problem, which is right in the sense that it’s exactly the same data that is shown in the end.

Since version 9.6, psql offers a client-side approach for pivots through the \crosstabview command.

In interactive use, this method is probably the quickest way to visualize pivoted representations.

For instance, say we want to have a look at the (year,city) couples in our example with more that 120 rainy days per year:

=#  SELECT city, year, SUM(raindays)
    FROM rainfall
    GROUP BY city,year 
    HAVING SUM(raindays)>120
    ORDER BY city
    \crosstabview
  city    | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 
----------+------+------+------+------+------+------
 Brest    |  178 |  161 |  180 |  160 |  165 |  144
 Nantes   |  124 |  132 |  142 |      |      |     
 Lille    |  153 |      |  136 |  128 |  138 |     
 Dijon    |      |  124 |      |      |      |     
 Bordeaux |      |  138 |  137 |      |      |     
 Metz     |      |      |      |      |  122 |     

The horizontal header holds the values of the 2nd column of the source data. To have the other transposition, we just have to give year and city as arguments to crosstabview in that order, without changing the query:

=# \crosstabview year city

 city  | Brest | Nantes | Lille | Dijon | Bordeaux | Metz 
-------+-------+--------+-------+-------+----------+------
  2012 |   178 |    124 |   153 |       |          |     
  2013 |   161 |    132 |       |   124 |      138 |     
  2014 |   180 |    142 |   136 |       |      137 |     
  2015 |   160 |        |   128 |       |          |     
  2016 |   165 |        |   138 |       |          |  122
  2017 |   144 |        |       |       |          |     

In the result above, cities come in no particular order.

But these columns can be sorted, possibly with complex criteria, through the command’s 4th argument. For example, the following command sorts the city columns by rank of rainfall, adding it as a 4th column to the query and passing it to crosstabview:

=#  SELECT year, city, SUM(raindays),
      rank() OVER (ORDER BY SUM(raindays))
    FROM rainfall
    GROUP BY city,year
    HAVING SUM(raindays)>120
    ORDER BY year
    \crosstabview year city sum rank

Result:

  year | Metz | Dijon | Nantes | Bordeaux | Lille | Brest 
-------+------+-------+--------+----------+-------+-------
  2012 |      |       |    124 |          |   153 |   178
  2013 |      |   124 |    132 |      138 |       |   161
  2014 |      |       |    142 |      137 |   136 |   180
  2015 |      |       |        |          |   128 |   160
  2016 |  122 |       |        |          |   138 |   165
  2017 |      |       |        |          |       |   144

We can see that the numbers get distributed in a way that shows a left-to-right gradient from the less to the more rainy, with Brest as the clear winner for this dataset.

How to unpivot a dataset?

The UNPIVOT clause does exist in some SQL dialects, but not in PostgreSQL. Nevertheless it’s easy to unpivot generically with PostgreSQL, without having to explicitly list the columns, by using an intermediate JSON representation.

Say for example that our weather data was structured as below, with a distinct column for each month of the year, spreadsheet-style.

=> \d rain_months

 Column |  Type   |
--------+---------+
 city   | text    |
 year   | integer |
 m1     | integer |
 m2     | integer |
 m3     | integer |
 m4     | integer |
 m5     | integer |
 m6     | integer |
 m7     | integer |
 m8     | integer |
 m9     | integer |
 m10    | integer |
 m11    | integer |
 m12    | integer |

By applying the json_each_text function to each row formatted with row_to_json, the columns are transposed into (key,value) tuples:

SELECT key, value FROM
  (SELECT row_to_json(t.*) AS line FROM rain_months t) AS r
  CROSS JOIN LATERAL json_each_text(r.line);

To get the final unpivoted result, we need to improve a bit this query to associate the year and city to each measurement, and filter and type-cast the month columns:

SELECT
   r.city,
   r.year,
   substr(key,2)::int AS month,  -- replace 'm4' by 4
   value::int AS raindays
 FROM (SELECT city, year, row_to_json(t.*) AS line FROM rain_months t) AS r
  JOIN LATERAL json_each_text(r.line) ON (key ~ '^m[0-9]+');

Result:

   city    | year  | month | raindays
-----------+-------+-------+---------
 Lille     |  2017 |    1  |       9
 Lille     |  2017 |    2  |      10
 Lille     |  2017 |    3  |       9
etc...
(936 rows)

This query retrieves the 13 cities x 6 years x 12 months of the original dataset.