7 Extracting Data from BigQuery

The first step before we implement our own attribution models in R is to extract the data from BigQuery. Here we assume you are a Google Analytics 360 customer and have set up the BiqQuery export. If you are using some other form of multi-touch conversion pathway data, you can skip this section. For this example we will be using the bigquery-public-data data set which contains a sample of Analytics 360 data from the Google Merchant Store (See Section 6.2 for setup info).

7.1 About the data

When we conduct this analysis, a key consideration is how much data to use.

It’s usually not feasible, or sensible to use all available clickstream data in your organisation. Here we have made some decisions around how much data we want to base our models on.

  • Extract the last 30 days of sessions
  • Keep both converting and non-converting users
  • Look back 7 days of touch points from the visitor’s most recent session.
  • If the visitor converted, disregard any subsequent visits.

These numbers are arbitrary and should be considered in light of your organisation. Some websites with a longer conversion funnel may want to analyse 6 months of data, using a look back window of one month for example.

7.2 Get full event log

First we want to extract a log of all sessions during a time period. Here we have selected 30 days. Note how we use the * suffix in the FROM clause so we can scan multiple dates at once.

We have selected the fullVisitorId as the unique identifier. A time stamp of the visitStartTime and the channelGrouping which will show the Default Channel Group associated with an end user’s session for this View.

   SELECT fullVisitorId, 
          TIMESTAMP_SECONDS(SAFE_CAST(visitStartTime AS INT64))  AS visitStartTime,  
          channelGrouping

    FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*`

    WHERE _TABLE_SUFFIX BETWEEN '20170101' AND '20170131'

    ORDER BY fullVisitorId, visitStartTime

7.3 Identify those who converted

Next we identify all visitors who made a transaction during our date range, and the date of this transaction. We have specified a ‘transaction’ as our conversion goal, however there is no reason why this can’t be another goal, such as a sign up event etc.

    SELECT fullVisitorId, 
           MIN(TIMESTAMP_SECONDS(SAFE_CAST(visitStartTime AS INT64)))  AS purchasetime

    FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*`

    WHERE _TABLE_SUFFIX BETWEEN '20170101' AND '20170131'
     AND totals.transactions IS NOT NULL

    GROUP BY fullVisitorId

    ORDER BY fullVisitorId

7.4 Converting touchpoints in last 7 days

Now we can identify our converting path touch points. These are taken from the event log we defined in Step 1, however we ensure only visitors who converted appear in our results. Furthermore we restrict the results to only those touch points that occur on or before the purchase time (obviously touch points after this don’t influence the conversion outcome). Finally we implement a look-back period of 7 days, so any touch points older than this are disregarded.

  SELECT a.*,
         'conversion' AS outcome

  FROM event_log a
    INNER JOIN conversions b ON a.fullVisitorId = b.fullVisitorId  

  WHERE a.visitStartTime <= b.purchasetime 
    AND DATE_DIFF(DATE(b.purchasetime), DATE(a.visitStartTime), DAY)  <= 7

7.5 Non-converting touchpoints in last 7 days

For those users who did not convert, we need to identify the most recent session start time so we can calculate where to start our 7 day look back period.


SELECT fullVisitorId, 
  MAX(visitStartTime) AS maxvisittime
  
  FROM event_log
  
  GROUP BY fullVisitorId

Now we can query the touch points of our non-converting visitors and apply our 7 day time window.

SELECT a.*,
         'non_conversion' AS outcome

  FROM event_log a
    INNER JOIN maxtimes b ON a.fullVisitorId = b.fullVisitorId 

  WHERE a.fullVisitorId NOT IN (SELECT DISTINCT fullVisitorID FROM conversions)
   AND  DATE_DIFF(DATE(b.maxvisittime), DATE(a.visitStartTime), DAY) <= 7

7.6 Complete Query

We can bundle this together to run as one query:

#standardSQL

WITH 
  /* EVENT LOG */
  event_log AS (
    SELECT fullVisitorId, 
          TIMESTAMP_SECONDS(SAFE_CAST(visitStartTime AS INT64))  AS visitStartTime,  
          channelGrouping

    FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*`

    WHERE _TABLE_SUFFIX BETWEEN '20170101' AND '20170131'

    ORDER BY fullVisitorId, visitStartTime
  ),

 /* VISITORS WHO CONVERTED */
 conversions AS (
    SELECT fullVisitorId, 
           MIN(TIMESTAMP_SECONDS(SAFE_CAST(visitStartTime AS INT64)))  AS purchasetime

    FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*`

    WHERE _TABLE_SUFFIX BETWEEN '20170101' AND '20170131'
     AND totals.transactions IS NOT NULL

    GROUP BY fullVisitorId

    ORDER BY fullVisitorId
  ),
  
  /* LATEST VISIT TIME FOR ALL VISITORS */
   maxtimes AS (
    SELECT fullVisitorId, 
    MAX(visitStartTime) AS maxvisittime
    
    FROM event_log
    
    GROUP BY fullVisitorId
  )

  /*== MAIN QUERY THAT UNIONS CONVERTING AND NON CONVERTING PATHS WITHA GIVEN TIME WINDOW ==*/
  
  /* CONVERTING PATHS */
  SELECT a.*,
         'conversion' AS outcome

  FROM event_log a
    INNER JOIN conversions b ON a.fullVisitorId = b.fullVisitorId  

  WHERE a.visitStartTime <= b.purchasetime 
    AND DATE_DIFF(DATE(b.purchasetime), DATE(a.visitStartTime), DAY)  <= 7

UNION ALL

  /* NON CONVERTING PATHS */
  SELECT a.*,
         'non_conversion' AS outcome

  FROM event_log a
    INNER JOIN maxtimes b ON a.fullVisitorId = b.fullVisitorId 

  WHERE a.fullVisitorId NOT IN (SELECT DISTINCT fullVisitorID FROM conversions)
   AND  DATE_DIFF(DATE(b.maxvisittime), DATE(a.visitStartTime), DAY) <= 7