8 Heuristic Models
Here we will implement some non-algorithmic methods as a baseline. To do this we will use the ChannelAttribution R package
8.1 Transform Data
The ChannelAttribution
package requires the data structured in a certain
way. In this case it is in the form:
path | conversion | non-conversions |
---|---|---|
direct > social > search | 10 | 154 |
direct > direct > direct | 2 | 234 |
referral > direct | 7 | 187 |
Here the touch points are transformed from into a single string path, separated
by the >
character. For each path we aggregate the total number of conversions
that resulted from this pathway, and also the number of non-conversions.
If you are using marketing data from a system other than BigQuery, you will need to prepare your data per the above.
Now we can look at the steps required to transform our data from Chapter 7.
In this case we have the results saved as a CSV, but these may also be queried directly in R as per Section 6.2.4
library(tidyverse)
library(lubridate)
paths_raw <- read_csv('bigquery/bq-results.csv')
Below is a snapshot of the top 20 rows. We can see from BigQuery it is in a standard ‘long’ format with one row per touch point based on the time stamp.
fullVisitorId | visitStartTime | channelGrouping | outcome |
---|---|---|---|
07184911138250312 | 2017-01-18 06:57:50 UTC | (Other) | non_conversion |
07184911138250312 | 2017-01-18 07:40:31 UTC | (Other) | non_conversion |
07184911138250312 | 2017-01-18 08:18:50 UTC | (Other) | non_conversion |
3112985461863519829 | 2017-01-25 20:42:23 UTC | (Other) | non_conversion |
4720404071621394560 | 2017-01-18 07:02:40 UTC | (Other) | non_conversion |
6060076679741207514 | 2017-01-18 06:59:27 UTC | (Other) | non_conversion |
0003297619580760716 | 2017-01-08 05:50:50 UTC | Direct | non_conversion |
00035794135966385 | 2017-01-20 12:46:25 UTC | Direct | non_conversion |
0004867638405459898 | 2017-01-15 14:22:26 UTC | Direct | non_conversion |
0005604256236421547 | 2017-01-24 21:04:26 UTC | Direct | non_conversion |
0006746295360194683 | 2017-01-24 05:16:18 UTC | Direct | non_conversion |
0009834325573666752 | 2017-01-24 22:43:34 UTC | Direct | non_conversion |
001324382917654255 | 2017-01-10 18:46:32 UTC | Direct | non_conversion |
0013701781325366363 | 2017-01-25 15:32:52 UTC | Direct | non_conversion |
0014256672578655164 | 2017-01-24 17:39:56 UTC | Direct | non_conversion |
0015731153666510386 | 2017-01-25 22:37:42 UTC | Direct | non_conversion |
0016316356325418630 | 2017-01-09 03:08:40 UTC | Direct | non_conversion |
0016883628233932470 | 2017-01-04 18:02:46 UTC | Direct | non_conversion |
0017373815580187343 | 2017-01-25 11:59:15 UTC | Direct | non_conversion |
0018094491063949293 | 2017-01-18 15:18:19 UTC | Direct | non_conversion |
We are using the tidyverse
conventions here to make the interpretation
easier. To translate we can see we start by formatting the time stamp correctly.
Next we rank the sessions by this time stamp for each visitor so we know which
order the touch points occurred. We next restructure the data by summarising
the touch points into one path string. Finally we count the occurrence of conversions and
non-conversions.
paths <- paths_raw %>%
mutate(visitStartTime = ymd_hms(visitStartTime)) %>%
group_by(fullVisitorId, outcome) %>%
arrange(visitStartTime) %>%
summarise(path = paste(channelGrouping, collapse = ' > ')) %>%
ungroup() %>%
count(outcome, path, name = "n") %>%
spread(outcome, n) %>%
replace_na(list(conversion = 0, non_conversion = 0)) %>%
arrange(desc(conversion))
path | conversion | non_conversion |
---|---|---|
Referral | 178 | 4015 |
Organic Search | 137 | 21659 |
Direct | 71 | 9308 |
Referral > Referral | 55 | 553 |
Organic Search > Organic Search | 39 | 1507 |
Direct > Direct | 26 | 803 |
Referral > Referral > Referral | 26 | 132 |
Paid Search | 21 | 1348 |
Display | 12 | 256 |
Direct > Referral | 11 | 60 |
8.2 Heuristic Models
Now that the data are in the correct format we can use the ChannelAttribution::heuristic_models()
function to compare three common models: First Touch, Last Touch and Linear.
This function will automatically calculate the total number of conversion attributed to each channel using the above models.
The results across these three methods are very similar. We can see ‘Referral’ is the channel that is attributed the most credit for conversions, followed by ‘Organic Search’.
The Last Touch method provides slightly more credit to ‘Referral’ than other methods, in contrast to ‘Direct’, which is attributed more credit when using the First Touch method.
library(ChannelAttribution)
fit_h <- heuristic_models(Data = paths, var_path = 'path', var_conv = 'conversion')
channel_name | first_touch | last_touch | linear_touch |
---|---|---|---|
Referral | 279 | 302 | 290.46667 |
Organic Search | 203 | 202 | 202.28333 |
Direct | 124 | 106 | 114.31667 |
Paid Search | 31 | 26 | 29.08333 |
Display | 19 | 20 | 19.85000 |
Social | 6 | 6 | 6.00000 |
(Other) | 0 | 0 | 0.00000 |
Affiliates | 0 | 0 | 0.00000 |