Thursday, July 26, 2012

How to create a histogram from count data using sql

I am often asked to generate summary counts from various tables and frequently asked to "bucketize" the results for the user.  Once you do it the technique is pretty easy, but getting over that first hump can be challenging.  So here's a tutorial from a recent request (this is mysql but the sql is pretty standard).  (If you only have a few thousand rows of data it might be easier just to use the histogram function of Excel but if you have more than some maximum number of rows, this technique will always work.)

1. First get your counts, something like
         select some_key,count(*) from table1 group by some_key;

2. and stick these values into an intermediate table (yes, not required but easier to explain)
         create table kount as select some_key,count(*) as value from table1 group by some_key;

3. now decide what kinds of "buckets" you want for your histogram.  Be sure that you cover your min and max values from the previous count query.  So in my case I'll make buckets that represent orders of magnitude.  You create a table with these buckets and then populate a min and max value for each bucket thusly.
       create table bins (min_value int, max_value int);
       insert bins values(0,10),(10,100),(100,1000),(1000,10000),(10000,100000);

Our resultant buckets look like this:

+-----------+-----------+
| min_value | max_value |
+-----------+-----------+
|         0 |        10 |
|        10 |       100 |
|       100 |      1000 |
|      1000 |     10000 |
|     10000 |    100000 |
+-----------+-----------+

4. Now we do a join and get the counts for each bucket.  
      select b.min_value,b.max_value,count(*) from bins b 
          left outer join kounts a on a.value between b.min_value and b.max_value
          group by b.min_value 
          order by b.min_value

And our histogram values look like the following and can be put into your favorite charting product and made into a pretty graph.

+-----------+-----------+----------+
| min_value | max_value | count(*) |
+-----------+-----------+----------+
|         0 |        10 |  3189118 |
|        10 |       100 |   239142 |
|       100 |      1000 |     9004 |
|      1000 |     10000 |      208 |
|     10000 |    100000 |        4 |
+-----------+-----------+----------+

I hope this is useful to some new sql user.   Enjoy.  


No comments:

Post a Comment