Outbrain Click Prediction EDA

You’ll never believe this one simple trick to help predict who will click your ad

The goal of this project is to essentially complete a Kaggle project uploaded by Outbrain challenging users to use their billions of lines of data to see if you can find patterns and predict which users will click a certain ad. Since I work in the advertising industry and have done a number of projects with outbrain, I think it would be useful to get a better grasp on how programmatic partners are serving ads to relevant users and how they commoditize their algorithms to ensure they can create equal value out of an investment for advertising agencies, clients and themselves.


In [2]:
import pandas as pd
import seaborn as sb
%matplotlib inline
from sklearn import datasets
import matplotlib.pyplot as plt
import seaborn as sns

Let’s load some raw data.

For the purposes of simplifying this project and being kind to my Macbook Air’s processor, I’m going to combine all the datasets Outbrain shared with me in thier Kaggle and reduce the document size to a manageable size. Hopefully, once I apply my machine learning tools to help predict which ads will be clicked compared to those that will not be, based on the environment in which they are served, to the larger dataset with hopefully minimal loss.

In [3]:
df0 = pd.read_csv('Outbrain_Kaggle/page_views_sample.csv')
In [4]:
df4 = pd.read_csv('Outbrain_Kaggle/documents_entities.csv')
In [5]:
df5 = pd.read_csv('Outbrain_Kaggle/documents_meta.csv')
In [6]:
df6 = pd.read_csv('Outbrain_Kaggle/documents_topics.csv')
In [7]:
df8 = pd.read_csv('Outbrain_Kaggle/promoted_content.csv')

So things are easier to work with, lets’s join these datasets.

In [8]:
df_new1 = pd.merge(df0, df8, on='document_id')
In [9]:
df_new2 = pd.merge(df_new1, df6, on='document_id')
In [10]:
df_new3 = pd.merge(df_new2, df5, on='document_id')
In [11]:
df = pd.merge(df_new3, df4, on='document_id')

Analyzing the document dataset

Working database which is a join of 5 datasets that Outbrain provided on every instance an ad appears on a page. It includes data on the following:

UUID – ID number of each user visiting the site. (used to join just about all the datasets) Document_id – ID number of the actual website the users visits timestamp – the time in which the visitor comes to the site platform – whether they visited via desktop, mobile or tablet (desktop = 1, mobile = 2, tablet =3) geo_location – where the user is visiting from traffic_source (internal = 1, search = 2, social = 3) ad_id – the identification number of each individual ad (I.e. Google Chromebooks anthem TV spot :30 english) campaign_id – the identification of the ad campaign that an advertiser is running (i.e. Google Chromebooks campaign) advertiser_id – the identification of the advertiser running the campaign (i.e. google) topic_id – the topic number associated with the visiting site content (i.e. news, sports, lifestyle) source_id – the part of the site on which the document is displayed, e.g. edition.cnn.com

workingdatabase.csv provide information about the content in a document, as well as Outbrain’s confidence in each respective relationship. For example, an entity_id can represent a person, organization, or location. The rows in workingdatabase.csv give the confidence that the given entity was referred to in the document.

In [12]:
(14951581, 16)
In [13]:
df = df.sample(n=500000, frac=None, replace=False, weights=None, random_state=None, axis=None)
In [14]:
(500000, 16)

Oof. Our dataset is nealry 15 million oberservations long. Let’s random sample this down to 500k line items to help siimplify things.

In [15]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 500000 entries, 12365570 to 4778249
Data columns (total 16 columns):
uuid                  500000 non-null object
document_id           500000 non-null int64
timestamp             500000 non-null int64
platform              500000 non-null int64
geo_location          499993 non-null object
traffic_source        500000 non-null int64
ad_id                 500000 non-null int64
campaign_id           500000 non-null int64
advertiser_id         500000 non-null int64
topic_id              500000 non-null int64
confidence_level_x    500000 non-null float64
source_id             500000 non-null float64
publisher_id          500000 non-null float64
publish_time          498156 non-null object
entity_id             500000 non-null object
confidence_level_y    500000 non-null float64
dtypes: float64(4), int64(8), object(4)
memory usage: 64.8+ MB

Looks like we have a few nulls in the geo_location and publish_time features. Let’s see if we can fill these nulls with the mode of the dataframe to help clean everything up. Since we just have a small amount of missing data (with the most on the publish time of the event), I feel that filling the nulls with the mode will be inconsequential.

In [16]:
df['geo_location'].fillna(df['geo_location'].mode()[0], inplace=True)
In [17]:
df['publish_time'].fillna(df['publish_time'].mode()[0], inplace=True)
In [18]:
uuid                  0
document_id           0
timestamp             0
platform              0
geo_location          0
traffic_source        0
ad_id                 0
campaign_id           0
advertiser_id         0
topic_id              0
confidence_level_x    0
source_id             0
publisher_id          0
publish_time          0
entity_id             0
confidence_level_y    0
dtype: int64

So, looking at our columns I can already see that the entity id section seems like it is not going to give me much in regards to analysis, so I’m going to go ahead and delete.

In [20]:
df = df.drop('entity_id', axis=1)
In [21]:
df = df.drop('confidence_level_x', axis=1)
In [22]:
df = df.drop('confidence_level_y', axis=1)

One thing that would be pretty interesting to look at is what types of ads are served on what types of sites. With that said, let’s do a quick scatter plot to see what we can find.

In [23]:
import pandas as pd
import seaborn as sb
%matplotlib inline
from sklearn import datasets
import matplotlib.pyplot as plt

x = df.document_id
y = df.campaign_id
plt.scatter(x, y)
<matplotlib.collections.PathCollection at 0x11423dc18>

Hmm, looks like there is a heavy concentration of numbers of campaigns on sites wiht higher document ids. Maybe that has to do with the fact that newer sites have higher doc id numbers and more people are visiting newer sites than older sites, so there are more campaigns there during this campaign time frame (2015 and 2016). Let’s see if there is anything interesting if we replace doc id with topic id, which covers a wider range of website topics.

In [24]:
plot = sb.lmplot(x="document_id", y="ad_id", data=df)
In [25]:
import pandas as pd
import seaborn as sb
%matplotlib inline
from sklearn import datasets
import matplotlib.pyplot as plt

x = df.topic_id
y = df.campaign_id
plt.scatter(x, y)
<matplotlib.collections.PathCollection at 0x114891278>

Not a whole lot of information there outside of an inference that there seems to be a pretty even distribution of campaigns and topic ids. We’ll come back to this.

Looking at this dataframe, we have good infromation on the sites, topics, ad campaigns and actual ads that appear on our outbrain database, but we don’t have any information on what was actually clicked. This is where the clicks train database below comes in.

Analyzing the Train / Test Data

Now, let’s see the time frame when the ads are most often served. Using the events data provided which outlines the context in which the display id is served.

In [26]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

events = pd.read_csv("Outbrain_Kaggle/events.csv", dtype=np.int32, index_col=0, usecols=[0,3])
/Users/danielclark/anaconda3/lib/python3.6/site-packages/numpy/lib/arraysetops.py:466: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  mask |= (ar1 == a)
1 61
2 81
3 182
4 234
5 338
In [27]:
train = pd.merge(pd.read_csv("Outbrain_Kaggle/clicks_train.csv", dtype=np.int32, index_col=0).sample(frac=0.1),
                 events, left_index=True, right_index=True)
test = pd.merge(pd.read_csv("Outbrain_Kaggle/clicks_test.csv", dtype=np.int32, index_col=0).sample(frac=0.1),
                events, left_index=True, right_index=True)
/Users/danielclark/anaconda3/lib/python3.6/site-packages/numpy/lib/arraysetops.py:466: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  mask |= (ar1 == a)
In [28]:
ad_id timestamp
16874594 180797 1026
16874596 289915 1546
16874597 305790 1571
16874597 143981 1571
16874600 57591 4078
In [29]:
ad_id clicked timestamp
3 250082 0 182
6 175234 1 395
7 300808 0 602
12 35982 0 718
15 52086 0 1000
In [30]:
test["hour"] = (test.timestamp // (3600 * 1000)) % 24
test["day"] = test.timestamp // (3600 * 24 * 1000)

train["hour"] = (train.timestamp // (3600 * 1000)) % 24
train["day"] = train.timestamp // (3600 * 24 * 1000)

train.hour.hist(bins=np.linspace(-0.5, 23.5, 25), label="train", alpha=0.7, normed=True)
test.hour.hist(bins=np.linspace(-0.5, 23.5, 25), label="test", alpha=0.7, normed=True)
plt.xlim(-0.5, 23.5)
plt.xlabel("Hour of Day")
plt.ylabel("Fraction of Events")
Text(0,0.5,'Fraction of Events')

Looking at the time distribution of clicks, we are seeing that the majorit are taking place between 9 am and 5pm, which suggest people are browsing the internet while at work. Naughty Naughty!

In [31]:
train.day.hist(bins=np.linspace(-.5, 14.5, 16), label="train", alpha=0.7, normed=True)
test.day.hist(bins=np.linspace(-.5, 14.5, 16), label="test", alpha=0.7, normed=True)
plt.xlim(-0.5, 14.5)
plt.xlabel("Days since June 14")
plt.ylabel("Fraction of Events")
Text(0,0.5,'Fraction of Events')

Looking at the differences between the timing of the train and test events, we can see that 50 percent of the testing data takes place during the time of the training data, and the other 50 percent takes place on the two days afterwards.

In [32]:
hour_day_counts = train.groupby(["hour", "day"]).count().ad_id.values.reshape(24,-1)
plt.imshow((hour_day_counts / hour_day_counts.sum(axis=0)).T,
           interpolation="none", cmap="rainbow")
plt.xlabel("Hour of Day")
plt.ylabel("Days since June 14 2016")
Text(0,0.5,'Days since June 14 2016')

Looking at the days since June 14, 2016 (which is a Tuesday). As you can see on days 4 and 5, we see a somewhat change in the hour of days in which people are browsing the internet online. i.e. Browsing kind of peaks at 10 and reduces on Saturday while browsing peaks around 10-12 on Sunday night.

In [33]:
clicks_train = pd.read_csv('Outbrain_Kaggle/clicks_train.csv')
display_id ad_id clicked
0 1 42337 0
1 1 139684 0
2 1 144739 1
3 1 156824 0
4 1 279295 0
In [34]:
df_train = pd.read_csv('Outbrain_Kaggle/clicks_train.csv')
df_test = pd.read_csv('Outbrain_Kaggle/clicks_test.csv')

sizes_train = df_train.groupby('display_id')['ad_id'].count().value_counts()
sizes_test = df_test.groupby('display_id')['ad_id'].count().value_counts()
sizes_train = sizes_train / np.sum(sizes_train)
sizes_test = sizes_test / np.sum(sizes_test)

sns.barplot(sizes_train.index, sizes_train.values, alpha=0.8, label='train')
sns.barplot(sizes_test.index, sizes_test.values, alpha=0.6, label='test')
plt.xlabel('Number of Ads in display', fontsize=12)
plt.ylabel('Proportion of set', fontsize=12)
Text(0,0.5,'Proportion of set')

Looking at the data here, it looks like the most common number of ad placements per outbrain display unit is about 4 to 6 ads per unit. If you look at the image below of an example from CNN.com, you can see for yourself.


In [35]:
(87141731, 3)
In [36]:
0    70267138
1    16874593
Name: clicked, dtype: int64
In [37]:
clck = clicks_train['clicked'].value_counts()

sns.barplot(clck.index, clck.values, alpha=0.8)
plt.xlabel('Clicked = 1, Not-Clicked = 0', fontsize=12)
plt.ylabel('Occurence count', fontsize=12)
Text(0,0.5,'Occurence count')
In [38]:

Wow, we have over 87 million lines of clicks data from the time frame for us to work with and nearly 17 million instances of people clicking ads. This will be helpful in helping us detirmining click through rates for each unique ad_id.

In [39]:
click_means = clicks_train.groupby(['display_id', 'ad_id'], as_index=False).mean().groupby('ad_id')['clicked'].mean()
1    0.000000
2    0.045455
3    0.068323
4    0.125000
5    0.000000
Name: clicked, dtype: float64
In [40]:
import matplotlib.pyplot as plt
import numpy as np

plt.hist(click_means[click_means.values != 0], bins=100)

Here I calculated the click throu rate of each of the 478k unique ad id numbers by calculating the average value summed across the “clicked” column.

Creating a histogram of our collection of unique CTR values, we see there is a right skew of our data around the 0.08 to 0.1 level. As you can see, there are large concentrations of values at 100%, 50%, 33%, 25%, ect. I suspect that these are ad id’s with a small number of impressions (i.e. <10) which gives them a round CTR number.

For my model, I would like to join this click through rate to our bigger dataset and look into which features help attribute to a higher click through rate.

Lets merge our click data with our document data

In [41]:
dfclick = df.merge(click_means.reset_index(), how="left", on='ad_id')

Let’s analyze our feature data with the clicks

Starting off, maybe we can check to see what different kinds of CTR values we can get with each doc id. Maybe from here, we can calculate which doc id is driving the highest click through rate.

In [43]:
corr = dfclick.corr()

fig, ax = plt.subplots(figsize=(9,7))

mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

ax = sns.heatmap(corr, mask=mask, ax=ax)

ax.set_xticklabels(ax.xaxis.get_ticklabels(), fontsize=14)
ax.set_yticklabels(ax.yaxis.get_ticklabels(), fontsize=14)


Just doing a quick correlation analysis, I can see that there is a slight correlation between ad_id, document_id, source_id, campaign_id and advertiser_id, this would suggest that I should be able to plot some sort of visual showing what types of ads are appearing where.

In [44]:
In [45]:
docid_means = dfclick.groupby(['document_id', 'clicked'], as_index=False).mean().groupby('document_id')['clicked'].mean()
22147    0.244346
23920    0.000000
38161    0.000000
52202    0.116942
61586    0.100000
Name: clicked, dtype: float64
In [46]:
docid_counts = dfclick['document_id'].value_counts()
plt.hist(docid_counts, bins = 100, log=True)
plt.xlabel('doc_ids that have appeared', fontsize=12)
plt.ylabel('log(Count of times doc_id appeared)', fontsize=12)

Looking at this. It seems like the impressions we are seeing in our sample data is skewed heavily towards less webpages.

In [47]:
uuid_counts = dfclick.groupby('uuid')['uuid'].count().sort_values()

f0bb39732abe15    145
239492050c07f3    148
27d04477d9c65f    150
e546c1eba0a495    150
8a8aedc97b0af2    154
Name: uuid, dtype: int64
In [48]:
for i in [2, 3, 10]:
    print('Users that appear less than {} times: {}%'.format(i, round((uuid_counts < i).mean() * 100, 2)))
plt.figure(figsize=(12, 4))
plt.hist(uuid_counts.values, bins=50, log=True)
plt.xlabel('Number of times user appeared in set', fontsize=12)
plt.ylabel('log(Count of users)', fontsize=12)
Users that appear less than 2 times: 57.25%
Users that appear less than 3 times: 77.45%
Users that appear less than 10 times: 95.78%
In [49]:
uuid_means = dfclick.groupby(['uuid', 'clicked'], as_index=False).mean().groupby('uuid')['clicked'].mean()
10006ce3083383    0.333333
1000c0ddb87278    0.000000
1000ddd5edf8fe    0.270833
1000fe9ab64ba8    0.000000
10011d8e5037fc    0.090909
Name: clicked, dtype: float64
In [50]:
plat_means = dfclick.groupby(['platform', 'clicked'], as_index=False).mean().groupby('platform')['clicked'].mean()
1    0.189616
2    0.189328
3    0.199200
Name: clicked, dtype: float64

drawing correlations between platofrm and click through rate, it seems like tablets are driving the highest click through rates folowed by desktop

In [51]:
import pandas as pd
import warnings

Userlocation = dfclick[['uuid', 'geo_location']]
Userlocation = Userlocation.loc[~Userlocation.geo_location.isin(['EU', '--']), :]
Userlocation = Userlocation.drop_duplicates('uuid', keep='first')
In [52]:
country = Userlocation.copy()
country.columns = ['uuid', 'Country']
country.Country = country.Country.str[:2]
country.loc[:, 'UserCount'] = country.groupby('Country')['Country'].transform('count')
country = country.loc[:, ['Country', 'UserCount']].drop_duplicates('Country', keep='first')
country.sort_values('UserCount', ascending=False, inplace=True)
Country UserCount
0 US 134774
39 GB 7648
27 CA 7544
81 AU 2751
52 IN 1355
814 DE 908
1186 IE 621
2 MX 589
65 NL 583
55 ZA 582
In [53]:
country['CumulativePercentage'] = 100 * country.UserCount.cumsum()/country.UserCount.sum()
country.reset_index(drop=True, inplace=True)
country[['Country', 'UserCount', 'CumulativePercentage']].head(10)
Country UserCount CumulativePercentage
0 US 134774 78.994444
1 GB 7648 83.477129
2 CA 7544 87.898858
3 AU 2751 89.511289
4 IN 1355 90.305488
5 DE 908 90.837690
6 IE 621 91.201674
7 MX 589 91.546902
8 NL 583 91.888613
9 ZA 582 92.229738
In [54]:
usa = Userlocation.loc[Userlocation.geo_location.str[:2] == 'US', :]
usa.columns = ['uuid', 'State']

usa.State = usa.State.str[3:5]

# Drop Data with missing state info
usa = usa.loc[usa.State != '', :]

usa.loc[:, 'UserCount'] = usa.groupby('State')['State'].transform('count')
usa.loc[:, ['State', 'UserCount']] = usa.loc[:, ['State', 'UserCount']].drop_duplicates('State', keep='first')
usa.sort_values('UserCount', ascending=False, inplace=True)

Looking at the geolocation of our data, the vast majority come from the US and California specifically

In [56]:
uuid State UserCount StateName CumulativePercentage
0 5e7154cc78afc5 CA 18720.0 California 14.406649
1 51a0979e08ed41 TX 10795.0 Texas 22.714330
2 27d04477d9c65f NY 9368.0 New York 29.923811
3 1e7eac31a74a69 FL 9277.0 Florida 37.063260
4 81a98f071af50c IL 5464.0 Illinois 41.268278
In [57]:
Country UserCount CumulativePercentage
0 US 134774 78.994444
1 GB 7648 83.477129
2 CA 7544 87.898858
3 AU 2751 89.511289
4 IN 1355 90.305488
In [59]:
pivot = pd.pivot_table(dfclick,index=["State"], values=['clicked'], aggfunc=[np.mean])
AK 0.050958
AL 0.146979
AR 0.044984
AZ 0.079135
CA 0.035163
CO 0.000000
CT 0.076205
DC 0.500000
DE 0.313725
FL 0.270833
GA 0.371244
HI 0.051023
IA 0.137504
ID 0.000000
IL 0.059469
IN 0.000000
KS 0.063144
KY 0.221593
LA 0.043043
MA 0.093017
MD 0.125000
ME 0.079320
MI 0.045159
MN 0.054094
MO 0.044293
MS 0.156527
MT 0.031250
NC 0.040563
ND 0.096663
NE 0.357143
NH 0.048775
NJ 0.357143
NM 0.054732
NV 0.049370
NY 0.138596
OH 0.371751
OK 0.039357
OR 0.045893
PA 0.492754
RI 0.051013
SC 0.051286
SD 0.250000
TN 0.079619
TX 0.111093
UT 0.060269
VA 0.270833
VT 0.096760
WA 0.083473
WI 0.250000
WV 0.156395
WY 0.333333

Based on our data, we can see that states like New Jersey, North Dakorta, Utah are driving a high click through rate. Even though California has a high population, it does perform strongly with a 0.27 click through rate.

In [60]:
trafpivot = pd.pivot_table(dfclick,index=["traffic_source"], values=['clicked'], aggfunc=[np.mean])
1 0.144606
2 0.256321
3 0.214739

Based on our traffic source data in relation to mean click through rate, we are seeing that search and social are driving the greatest click through rate as opposed to organic traffic sources.

In [61]:
clickx = dfclick.topic_id
clicky = dfclick.clicked
plt.scatter(clickx, clicky)
<matplotlib.collections.PathCollection at 0x1a2ee7a0b8>

Conclusion and Next Steps

Based on our intitial data analysis, We are seeing some predictability that can be found through the time stamp and through combinations of indification metrics of the ads, source websites, and webpage, we could build a model to help figure which cobination of features can help drive up our click through rate.

For Next steps, I think I will want to run a couple of different prediction models like logistic regression, random forests and kmeans to help find the following:

  • Which features are most indicative of click performance
  • If there is any sort of patterns that we can find to group ads, documents, sources, and topics together
  • if we can use our clicks train data to predict our click test data
In [63]:
dfclick.to_csv('finalcoredb.csv', sep='\t')