Building the Lottery Data Viz
2019 April 02

I previously posted a lottery data visualization in which I attempted to correlate how often a certain number came up in the PA lottery given that some other number came up. I built the visualization using Python, SQLite, and Chart.JS on the front end. Since, I like sql I'll about the queries and methods used. These methods could probably be massively improved, but I did it in a couple of hours, because I was bored.


The underlying database is SQLite. There are 3 tables in the database.

The rules are as follows a game has many drawings which then has many numbers of different types drawn. As there are PA lottery games with things like the Powerball and some multiplier, we need to differentiate numbers based on type. However, we will only display the Pick N games, because I think they make the best looking charts.

Originally I attempted to do the entire prediction using SQL, but I couldn't get that to work quite right (I was getting double counts, I'm sure it's possible, but I couldn't get it). So I moved to using Python to do the aggregation.

So I actually ended up retrieving all the normal numbers for a game grouped by their drawing with a query like the below.

FROM Numbers n
JOIN Drawing d ON = n.drawing_id
    num_type = 'n' AND game_id = ? AND
    drawing_date BETWEEN ? AND ?
GROUP BY drawing_id

From there I processed all the numbers like so:

from collections import Counter 
import numpy as np

# cur is a database cursor after executing the above query
numbers = [Counter(map(int, drawn.split(','))) for drawn in cur.fetchall()]
low, high = compute_game_minmax(game_id)
mmax = high + 1
freq = np.zeros(mmax * mmax).reshape(mmax, mmax)
for predict in range(mmax):
    for values in numbers:
        if predict in values:
            for value, count in values.items():
                if (value == predict and count >= 2) or (value != predict):
                    freq[predict, value] += 1
row_sums = freq.sum(axis=1)
freq2 = freq / row_sums[:, np.newaxis]

The gist of the code is as follows, build a series of counts for each drawing. Then create a matrix with a column for all possible predictors by index. From here I only count each number once, and require when the predictor is the same as the value, it's count must be greater than 1. This prevents double counts. From there I just wrap it up for Chart.JS to deal with it.

I also provide a histogram of the numbers pulled in a game with no groupings. That's powered by a query like so:

FROM Numbers
JOIN Drawing ON = numbers.drawing_id
WHERE game_id = ? AND
num_type = 'n' AND
drawing_date BETWEEN ? AND ?
GROUP BY value
ORDER BY value

It's a pretty standard aggregation and histogram query.

The Front End

The front end is powered by Chart.JS, an open source and free JavaScript charting library. It's pretty trivial, I just tie the input fields onchange events to reload the chart from a CGI script on the back end (Yes I know CGI is archaic, but is there a better solution that doesn't require running a daemon in the background or php. After all, I'm happy with my shared host).

I also pull in the game details using a CGI script, so that I don't have to manually update the options or date ranges that I have available. I could change it to be regenerated, but that increases the complexity of my deploy process and removes the static nature of my site, and I would like it to stay that way.

Getting the Numbers

As for getting the numbers, I just scraped the PA Lottery's numbers RSS feed using the feedburner python module. This is the kind of stuff that needs to be public to prevent corruption, and I'm glad they still publish via RSS. Then I setup a daily scheduled event on my host to get the numbers and stick them in the database.


So I hope you found this article an informative look into my thought process when I build something. I hope you learned that you can still do things the old way and get good performance that is mostly hands off. Finally, I hope you take a look at my data visualization, I don't think it shows that much yet, but as we get more data I think we'll eventually get a fairly interesting picture.

Written by Henry J Schmale on 2019 April 02