How To Count Distinct Items Over Sliding Time Windows in Postgres

As a member of PatientsLikeMe's Data team, from time to time we're asked to compute how many unique users did action X on the site within a date range, say 28 days, or several date ranges (1,14,28 days for example). It's easy enough to do that for a given day, but to do that for every day over a span of time (in one query) took some thinking. Here's what I came up with.

One day at a time

First, a simplified example table:

create table events (user_id integer,  event varchar,  date date)

Getting unique user counts by event on any given day is easy. Below, we'll get the counts of unique users by events for the 7 days leading up to Valentine's day:

select count(distinct user_id), event from events
where date between '2011-02-07' and '2011-02-14'
group by 2

Now Do That For Every Day

The simplest thing that could possibly work is to just issue that query to compute the stats for the time span desired. We're looking for something faster, and a bit more elegant.

Stepping back a bit, for a seven day time window, we're asking that an event on 2/7/2011 count for that day, and also count for the 6 following days - effectively we're mapping the events of each day onto itself and 6 other days. If this was a map/reduce tool, we could emit the event 7 times, one under each date key. In SQL, that sounds like a join waiting to happen. Once joined, its easy to group by the mapped date, and do a distinct count.

With a table like the one below:

from_date to_date
2011-01-01 2011-01-01
2011-01-01 2011-01-02
2011-01-01 2011-01-03
2011-01-01 2011-01-04
2011-01-01 2011-01-05
2011-01-01 2011-01-06
2011-01-01 2011-01-07
2011-01-02 2011-01-02
...

This SQL becomes easy.

select to_date, event, count(distinct user_id) from events
join dates_plus_7 on events.date = dates_plus_7.from_date
group by 1,2

The output looks like this:

to_date event count
...
2011-01-05 bar 20
2011-01-05 baz 27
2011-01-05 foo 24
2011-01-06 bar 31
...

You'll then need to trim the ends of your data to adjust for where the windows ran off the edge of the data.

That works for me on Postgresql 8.4. Your mileage may vary with other brands.

How Do I Get One of Those?

A dates table like that is a one-liner using the generate_series method:

select date::date as from_date, date::date+plus_day as to_date from
 generate_series('2011-01-01'::date, '2011-02-28'::date, '1 day') as date,
 generate_series(0,6,1) as plus_day ;

There we get the cartesian product of the set of dates in the desired range, and the set of numbers from 0 to 6. Sum the two, treating the numbers as offsets and Bob's your uncle.