6 Setup

Before we start with the practical guides, it is important to ensure we correctly install and setup the required dependencies.

6.1 R

6.1.1 About R

The R programming language (R Core Team 2018) is a popular and open-source tool for data analysis and statistical computing.

We will use R throughout this paper for the practical examples.

6.1.2 Install R

To install R, head over to https://www.r-project.org/ and follow the instructions.

It will prompt you to choose a CRAN Mirror. This is an archive server that contains the latest R version.

There are instructions to install R on Linux, Mac and Windows.

This will install the base R software, which on its own is just a command line-like interface.

6.1.3 Install RStudio

To get the most out of R we recommend installing the RStudio (“RStudio” 2019) integrated development environment. It is an environment that provides a console with a range of features to enrich the use of R.

To download, visit https://www.rstudio.com/products/rstudio/ and follow the instructions for your system.

The end results should look similar to this:

6.1.4 Install Required R Packages

R has a ecosystem of open source packages that work like add-ins. These packages typically contain specialised functionality that may be required when performing analysis. You can install R packages from the R console and they will be downloaded onto your system and available whenever you use R.

The main R packages we will use in this paper are:

tidyverse - The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures. (Wickham 2017)
bigrquery - The bigrquery package makes it easy to work with data stored in Google BigQuery by allowing you to query BigQuery tables and retrieve metadata about your projects, data sets, tables, and jobs. (Wickham 2019)
ChannelAttribution - Implements a Markov Model for the Online Multi-Channel Attribution Problem. (Altomare and Loris 2019)
survival - Contains the core survival analysis routines, including definition of Surv objects, Kaplan-Meier and Aalen-Johansen (multi-state) curves, Cox models, and parametric accelerated failure time models. (Therneau 2018)
CausalImpact - Implements a Bayesian approach to causal impact estimation in time series, as described in Brodersen et al. (2015)

To install these packages, and some other supporting packages used in this paper, you can run the command below

install.packages(c("tidyverse", "bigrquery", "ChannelAttribtion", "survival",
                   "lubridate", "survminer", "DBI", "CausalImpact"))

6.2 BigQuery

The first step to more advanced modelling is extracting and cleaning the data. Here we will go through a basic setup and familiarisation guide to get your Analytics data from BigQuery. It is expected that you have some experience using SQL and knowledge of basic database operations.

6.2.1 Setup

If you are an Analytics 360 customer, you can setup BigQuery for use with Google Analytics data. Refer to the Google Help Pages for detailed instructions.

If you don’t yet have Analytics 360, you can still benefit from these practical code examples by using the free Google Analytics Sample Dataset. This is a complete Google Analytics 360 data set from the Google Merchant Store, a real eCommerce platform.

To access the data set follow the instructions provided by Google:

Go to http://bigquery.cloud.google.com.
If you’re new to BigQuery, or you don’t have a project set up yet, you’ll need to create a project.
Select this link to go directly to the data set.
Click Query Table to run a query.

In the future you can access the data set within BigQuery by selecting the bigquery-public-data project from the left-hand navigation panel, then select the ga_sessions table under the google_analytics_sample data set.

Please be mindful that using BigQuery can incur a cost. Please refer to Google’s billing and quota information before use.

6.2.2 Export Schema

The way Google Analytics structures its data export into BigQuery is known as the export schema. This schema is able to take advantage of data structures that may be counter-intuitive for users who are familiar with normalised relational databases. BigQuery supports de-normalised tables, where instead of joining lots of flat, normalised tables, you can have one table with nested records.

In the bigquery data set, there is one table per day, defined at the session level that contains all Analytics related data nested within, such as hits and events. Below we will demonstrate some basic queries on this data.

For more information you can refer to the BigQuery Export Schema

6.2.3 Running Queries

6.2.3.1 Hello World

This is a basic query to run that simply sums the page views over a given day. You will note the use of #standardSQL in the header. This lets bigquery know you are using the ‘standard’ SQL flavour over it’s older ‘legacy’ flavour.

/* Hello World Example */
#standardSQL
SELECT SUM(totals.pageviews) as TotalPageviews
    
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_20170101`

Row	TotalPageviews
1	5362

6.2.3.2 Using Table Ranges

As the GA Export Schema provides one table per day, to scan across date ranges we need to specify a table range to query.

Notice the * in the FROM clause and the inclusion of _TABLE_SUFFIX.

/* Using Table Ranges */
#standardSQL
SELECT  date, 
        SUM(totals.pageviews) as TotalPageviews

FROM  `bigquery-public-data.google_analytics_sample.ga_sessions_*`

WHERE _TABLE_SUFFIX BETWEEN '20170725' AND '20170801'

GROUP BY date

ORDER BY date

6.2.3.3 Hit Level Data

Accessing ‘hit’ level data, that is, individual hits within each user’s session requires ‘un-nesting’ the hits data and joining it to the ga_sessions_ table

/* Now at the hit level */
#standardSQL
SELECT  fullVisitorId,  
        visitNumber, 
        date, 
        totals.hits, 
        hits.hitNumber, 
        hits.type, 
        hits.time, 
        hits.page.pagePath
        
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_20170101`,
  UNNEST(hits) as hits
  
WHERE fullVisitorId = '6170732910727440668' /*Random Visitor Selected*/

Now that we are equipped with the basics we can further our understanding in future chapters to extract the data we need for modelling.

6.2.4 Using R with BigQuery

We can also interact with the BigQuery API using the bigrquery package in R.

To set up we load this package, along with the DBI package which helps with database interfacing.

library(bigrquery)
library(DBI)

Next we establish a connection to BigQuery. Here you would replace bq_test_project() with the project id you have established in Google Cloud Console.

con <- dbConnect(
  bigrquery::bigquery(),
  project = "bigquery-public-data",
  dataset = "google_analytics_sample",
  billing = bq_test_project()
)

We can save any SQL queries and forward these for execution as text strings.

sql <- "
  /* Using Table Ranges */
  #standardSQL
  SELECT  date, 
        SUM(totals.pageviews) as TotalPageviews
  
  FROM  `bigquery-public-data.google_analytics_sample.ga_sessions_*`
  
  WHERE _TABLE_SUFFIX BETWEEN '20170725' AND '20170801'
  
  GROUP BY date
  
  ORDER BY date
  "

Here the query is executed and saved as a temp table. The bq_table_download function returns the results to the console.

query <- bq_project_query(x = bq_test_project(), sql)
bq_table_download(query)

date	TotalPageviews
20170725	10728
20170726	11200
20170727	10175
20170728	9359
20170729	6293
20170730	7258
20170731	11115
20170801	10939

References

Altomare, Davide, and David Loris. 2019. ChannelAttribution: Markov Model for the Online Multi-Channel Attribution Problem. https://CRAN.R-project.org/package=ChannelAttribution.

Brodersen, Kay H., Fabian Gallusser, Jim Koehler, Nicolas Remy, and Steven L. Scott. 2015. “Inferring Causal Impact Using Bayesian Structural Time-Series Models.” Annals of Applied Statistics 9: 247–74.

R Core Team. 2018. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

“RStudio.” 2019. RStudio. https://www.rstudio.com/products/rstudio/.

Therneau, Terry M. 2018. Survival: Survival Analysis. https://CRAN.R-project.org/package=survival.

Wickham, Hadley. 2017. Tidyverse: Easily Install and Load the ’Tidyverse’. https://CRAN.R-project.org/package=tidyverse.

Wickham, Hadley. 2019. Bigrquery: An Interface to Google’s ’Bigquery’ ’Api’. https://CRAN.R-project.org/package=bigrquery.