How To Count Distinct Items Over Sliding Time Windows in Postgres
As a member of PatientsLikeMe's Data team, from time to time we're asked to compute how many unique users did action X on the site within a date range, say 28 days, or several date ranges (1,14,28 days for example). It's easy enough to do that for a given day, but to do that for every day over a span of time (in one query) took some thinking. Here's what I came up with.
One day at a time
First, a simplified example table:
create table events (user_id integer, event varchar, date date)
Getting unique user counts by event on any given day is easy. Below, we'll get the counts of unique users by events for the 7 days leading up to Valentine's day:
select count(distinct user_id), event from events where date between '2011-02-07' and '2011-02-14' group by 2
Now Do That For Every Day
The simplest thing that could possibly work is to just issue that query to compute the stats for the time span desired. We're looking for something faster, and a bit more elegant.
Stepping back a bit, for a seven day time window, we're asking that an event on 2/7/2011 count for that day, and also count for the 6 following days - effectively we're mapping the events of each day onto itself and 6 other days. If this was a map/reduce tool, we could emit the event 7 times, one under each date key. In SQL, that sounds like a join waiting to happen. Once joined, its easy to group by the mapped date, and do a distinct count.
With a table like the one below:
This SQL becomes easy.
select to_date, event, count(distinct user_id) from events join dates_plus_7 on events.date = dates_plus_7.from_date group by 1,2
The output looks like this:
You'll then need to trim the ends of your data to adjust for where the windows ran off the edge of the data.
That works for me on Postgresql 8.4. Your mileage may vary with other brands.
How Do I Get One of Those?
A dates table like that is a one-liner using the generate_series method:
select date::date as from_date, date::date+plus_day as to_date from generate_series('2011-01-01'::date, '2011-02-28'::date, '1 day') as date, generate_series(0,6,1) as plus_day ;
There we get the cartesian product of the set of dates in the desired range, and the set of numbers from 0 to 6. Sum the two, treating the numbers as offsets and Bob's your uncle.