Parallelize google analytics API with R

by Carlo Bonini

facebooktwittergoogle_pluspinterestlinkedintumblr

google_apiAs a digital analytics agency we are always asks to gather data from multiple sources in order to get actionable insights and make smarter data-driven decisions.

Part of this process is to get data from google analytics, our client’s main analytics platform;

How we do that, is pretty easy: we managed to use the wonderful RGoogleanalytics package in R which is a nice and simple api interface to google analytics data.

What’s set us apart from the competition is the ability to work in highly complex analytics environments, where you have to deal with clients owning multiple websites, profiles, and views;

Nevertheless the 80% of total websites we manage, have lots of old and one time used websites without data and permissions, putting us, the analysts, in the position were a single API call is not sufficient.

So we come up with a pretty interesting solution, and today we’re going to share it with you:

We took the Rgoogle analytics package and we parallelize the process of multiple api calls.

Follow the process and you’re going see the lights of parallel programming:

First of all, we develop a for loop for multiple profiles:

As you can see, the first important thing to do is get the list of all your profiles, which seems obvious but is actually tricky when you don’t have permissions for all of them.

Then, the game became interesting cause we need to set up our loop, and we do that creating a couple of lists, extremely useful later on:

Since we would like to loop over profiles and date we created a dataframe containing multiple dates so that you can choose which are the dates you are interesting in getting the data.

Finally we are ready to see our fully 8-cores processors, shining up to create a huge lists in return of our api calls.

 

Some interesting stuff in this code:

1. We’ve been using lists so that for each profiles we have a dataframe with the query for that profiles

2. Using the try() function we avoid errors in the calls, and the loop goes on even if the API return 0 results for the choosen period.

3. We didn’t face any problems calling analytics API 4 times per second, but the API policy is to avoid more than 1 call per second.

4. We decide to add a variable called ProfileName, which contains the profile name combined with the profile ID, so that we are able to distinguish profiles.

5. It’s mind blowing fast, if you have to gather in a couple of minutes millions of rows.

I f you want to double loop over profiles and dates, here’s also the code to do that:

 

use the all_months dataframe to provide the range of dates, you need.