9 Algorithmic Methods

Now we move into algorithmic methods for modelling marketing attribution.

These methods use more advanced statistical algorithms to create a data driven model for assigning conversion credit to channels in a multi-touch point conversion pathway.

9.1 Markov Chains

The first method we will review are markov chains. We introduced these in section 4.3.3.

First we load the required packages. In this case we again use ChannelAttribution as well as the tidyverse and lubridate packages for data manipulation and date handling. We also read in our raw results from BigQuery that contains the event log and timestamps of visits, complete with channel and outcome.

library(tidyverse)
library(lubridate)
library(ChannelAttribution)

paths_raw <- read_csv('bigquery/bq-results.csv')
fullVisitorId visitStartTime channelGrouping outcome
07184911138250312 2017-01-18 06:57:50 UTC (Other) non_conversion
07184911138250312 2017-01-18 07:40:31 UTC (Other) non_conversion
07184911138250312 2017-01-18 08:18:50 UTC (Other) non_conversion
3112985461863519829 2017-01-25 20:42:23 UTC (Other) non_conversion
4720404071621394560 2017-01-18 07:02:40 UTC (Other) non_conversion
6060076679741207514 2017-01-18 06:59:27 UTC (Other) non_conversion
0003297619580760716 2017-01-08 05:50:50 UTC Direct non_conversion
00035794135966385 2017-01-20 12:46:25 UTC Direct non_conversion
0004867638405459898 2017-01-15 14:22:26 UTC Direct non_conversion
0005604256236421547 2017-01-24 21:04:26 UTC Direct non_conversion

Next we need to transform the data. We use the same procedure as in Section 7.1. We now have one row per conversion path, with total conversions/non-conversions.

path conversion non_conversion
Referral 178 4015
Organic Search 137 21659
Direct 71 9308
Referral > Referral 55 553
Organic Search > Organic Search 39 1507
Direct > Direct 26 803
Referral > Referral > Referral 26 132
Paid Search 21 1348
Display 12 256
Direct > Referral 11 60

We now call the markov_model() function from the ChannelAttribution package. It accepts arguments for the data frame, the variable that contains the conversion path, the variable that encodes both number of conversions and non conversions and the order of the markov model.

Below we can see it’s output is a list of distinct channels with the total attributed conversions per channel. The channel that receives the most credit is ‘Referral’, followed by ‘Organic Search’.

As a marketer we could multiply these by the average conversion value to get a total attributed value for each channel. By comparing this to the cost of marketing in each channel we get a robust calculation for ROI.

In fact, if the actual sales revenue per customer is recorded we can go one step further and have this model calculate the attributed value without having to estimate using the average value. This is handled with the argument var_value.

fit_m <- markov_model(Data = paths, 
                      var_path = 'path', 
                      var_conv = 'conversion', 
                      var_null = 'non_conversion', 
                      order = 1)

fit_m
##     channel_name total_conversions
## 1       Referral       289.1465114
## 2 Organic Search       209.6476986
## 3         Direct       107.5363407
## 4    Paid Search        27.3277169
## 5        Display        18.8606878
## 6         Social         7.5544153
## 7        (Other)         0.1014015
## 8     Affiliates         1.8252278

We can also iterate on this by calculating a 1, 2, and 3 order markov model.

Below we display a chart of the results. As we can see, there is not much difference in the results.

fit_mult <- map_dfr(.x = c(1, 2, 3), 
                    .f = ~markov_model(Data = paths, 
                                       var_path = 'path', 
                                       var_conv = 'conversion', 
                                       var_null = 'non_conversion', 
                                       order = .x), 
                    .id = "order")

9.2 Survival Analysis

The next model we demonstrate is survival analysis. Here we define ‘survival’ as non-conversion and the event of interest is when a customer converts.

As usual we start by loading the data and the required packages.

library(tidyverse)
library(lubridate)
library(survival)
library(survminer)

paths_raw <- read_csv('bigquery/bq-results.csv')
fullVisitorId visitStartTime channelGrouping outcome
07184911138250312 2017-01-18 06:57:50 UTC (Other) non_conversion
07184911138250312 2017-01-18 07:40:31 UTC (Other) non_conversion
07184911138250312 2017-01-18 08:18:50 UTC (Other) non_conversion
3112985461863519829 2017-01-25 20:42:23 UTC (Other) non_conversion
4720404071621394560 2017-01-18 07:02:40 UTC (Other) non_conversion
6060076679741207514 2017-01-18 06:59:27 UTC (Other) non_conversion
0003297619580760716 2017-01-08 05:50:50 UTC Direct non_conversion
00035794135966385 2017-01-20 12:46:25 UTC Direct non_conversion
0004867638405459898 2017-01-15 14:22:26 UTC Direct non_conversion
0005604256236421547 2017-01-24 21:04:26 UTC Direct non_conversion

The data transformation steps here are a little different.

We want to condense our data into one row per customer. For our analysis, three key pieces of information are required:

  1. The interval of time:
  1. For converting customers, between the first visit and the purchase time.
  2. For non-converting customers, between the first visit and the last recorded visit.
  1. The outcome: 1 for converted, 0 for non-converted.
  2. The channel used to convert.

Firstly, this analysis is slightly different. We aren’t strictly attributing credit between channels, but rather analysing at various points in time, what is the probability of a customer converting through any given channel.

Secondly, we have made some assumptions around excluding (or censoring) customers who don’t convert at the point of the most recent visit. In effect we are declaring these customer lost to follow up. An alternative method would be to calculate the time interval for non-converters right up until the end of the analysis period. Both are ok, but given we constrained our look back period to just 7 days we will go with our chosen method.

surv_data <- paths_raw %>% 
  mutate(visitStartTime = ymd_hms(visitStartTime)) %>% 
  group_by(fullVisitorId) %>% 
  mutate(mindate = min(visitStartTime),
         maxdate = max(visitStartTime)) %>% 
  filter(visitStartTime == maxdate) %>% 
  mutate(time = (maxdate - mindate)/3600,
         status = ifelse(outcome == "conversion", 1, 0)) %>% 
  dplyr::select(fullVisitorId, time, status, channelGrouping)
## # A tibble: 10 x 4
## # Groups:   fullVisitorId [10]
##    fullVisitorId       time            status channelGrouping
##    <chr>               <drtn>           <dbl> <chr>          
##  1 2019849659101397972 26.6350000 secs      0 Organic Search 
##  2 0602100966737642308  0.0000000 secs      0 Organic Search 
##  3 6693833242689696465  0.8844444 secs      0 Organic Search 
##  4 7956907751868983697  0.1047222 secs      0 Paid Search    
##  5 9483710368265788494  0.0000000 secs      0 Organic Search 
##  6 2139227959373761194  0.0000000 secs      0 Social         
##  7 2973441778247906125  0.0000000 secs      0 Organic Search 
##  8 6325419508306646730  0.0000000 secs      0 Direct         
##  9 9873658775382341673  0.0000000 secs      0 Organic Search 
## 10 7744114977873417883  0.0000000 secs      0 Organic Search

Next we create a special object called a survival object using the Surv function.

surv_object <- Surv(time = surv_data$time, event = surv_data$status)

We can now compute our estimate for a survival curve using the survfit function.

We include a grouping variable of the converting channel. This will calculate one survival curve per group so we have a basis for comparison.

fit_surv <- survfit(surv_object ~ channelGrouping, data = surv_data)
## Call: survfit(formula = surv_object ~ channelGrouping, data = surv_data)
## 
##                                    n events median 0.95LCL 0.95UCL
## channelGrouping=(Other)            2      0     NA      NA      NA
## channelGrouping=Affiliates       963      0     NA      NA      NA
## channelGrouping=Direct         10555    106     NA     181      NA
## channelGrouping=Display          408     20     NA      NA      NA
## channelGrouping=Organic Search 24208    202     NA      NA      NA
## channelGrouping=Paid Search     1769     26     NA      NA      NA
## channelGrouping=Referral        5372    302     NA     174      NA
## channelGrouping=Social          9764      6     NA      NA      NA

The results are best viewed graphically.

It is important to note that the terminology ‘survival’ in this context means ‘non-conversion’. It shows, up to a given time along the x-axis, what is the probability of a customer not converting through a particular channel.

What we want is the complement of this, that is, the cumulative event plot.

We can see that Referral channel achieves high levels of conversion very quickly, representing a valuable channel. This is followed by Display, however as time goes on, after 150 hours (~6 days) from first visit, channels such as ‘Direct’ and ‘Organic Search’ have a strong conversion probability. This indicates that advertising and referral traffic are essential to raising awareness early on, and after this, customer recall of the brand is high with customers returning days later via direct and organic channels.