Analyzing Your Takeout Data From Google Fit
2017 June 27

Google allows you to export your data from various products. In this post I show how one can run analysis on my data from Google Fit to find various pieces of information using the basic bash command line tools.

Getting The Data

The first thing we need to do is download our data from Google fit. To do this, you need to go to takeout.google.com. From there you can select which product's data you want to export. For this project you will only need your google fit data, so hit the select none button. Then select the fit data as the only one to export. Scroll to the bottom, and hit export to get your data. Depending on how much data you have it might take a while, it will email you once the export is complete to download your data.

After you've downloaded the data, go ahead and unzip that into what ever directory you want. It will extract in as the following folder format.

Takeout
`- Product
  `- Data Folders.

Since, we're working with fit data we're going to cd Takeout/Fit/Daily Aggregations from your selected extraction directory. Google fit will export your data in two formats, one is a weird xml format, and the other is plain CSVs. We want the CSVs, because they can be operated on really easily with command line tools.

Data Format

In the Daily Aggregations, we're find files named like YYYY-MM-DD.csv, containing the activities of that day, and Daily Summaries.csv which contain the aggregated data of each day in this set. The date named files just look at that specific day for what happened, while the summary file will give an idea of what happened on a daily basis rather than a fifteen minute window. So, if you want more granular data, you should use the date named files otherwise just use the daily summaries because it reduces the amount of data crawled.

You should look at the headers in your csv file to what happen for that day with your activities. For instance, you have your base columns, then appended to them is the duration of that activity during that time period for date named files, and for summary file it is the total amount of activity completed.

Examining Daily Summaries

Since, I'm interested in my min, max and average calorie use per day, we'll use the summary file and some awk magic. So, we'll start with a basic pipeline to select a column excluding the file header from a csv file. From there we can pipe it into our favorite tool to analyze a bag of numbers, be that awk or R. R gives us more more from the get go so let's use that on this data.

# 1 - csv file
# 2 - column number to select, first is indexed as 1
function csvGetColumnValues() {
    tail -n +2 "$1" | cut -d, -f"$2"
}

Now with these aliases in place we can do a simple analysis on the summaries for the interesting statistical data or whatever bag you're looking at. Use it like the below.

# Summarize the calories
# calories is the 2nd column in this
csvGetColumnValues "Daily Summaries.csv" 2 | Rscript -e 'summary(scan("stdin",quiet=TRUE))'

The Rscript used just reads in all of the values from stdin, then uses the magic summary function to get some information out of it. Information like you should have learned how to get in third grade, like the min, max, median and mean. As you can see I burn more 2000 calories a day on average.

Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
562.8  1948.3  2058.5  2087.0  2192.1  3352.1

Examining Data on a Date By Date Basis

Now let's look out at how one might calculate some statistic derived from the files for each date. I've found that each of the date files uses a range based approach to stats where the day is broken into fifteen minute increments. Let's say I want to figure out how many calories I burn during the average fifteen minute period. Let's use the pipeline from above and some find magic.

# We're looking at data from the 21st century
FILES=$(find . -name '20*.csv')
for f in $FILES
do
    csvGetColumnValues $f 3
done | awk '{s+=$1;n+=1}END{print s, n, s/n}'
# From above we can either pipe it into awk or R, but let's use awk to find the mean.

Conclusion

You can build some pretty powerful pipelines just using the basic unix tools, and a little bit of R at times. Sure we're not churning through gigabytes here, but it does give you an idea of how to use smaller tools to avoid building a big heavy cluster to crunch a relatively small amount of data.

*****
Written by Henry J Schmale on 2017 June 27