You’ll never believe this one simple trick to help predict who will click your ad
The goal of this project is to essentially complete a Kaggle project uploaded by Outbrain challenging users to use their billions of lines of data to see if you can find patterns and predict which users will click a certain ad. Since I work in the advertising industry and have done a number of projects with outbrain, I think it would be useful to get a better grasp on how programmatic partners are serving ads to relevant users and how they commoditize their algorithms to ensure they can create equal value out of an investment for advertising agencies, clients and themselves.
import pandas as pd
import seaborn as sb
%matplotlib inline
from sklearn import datasets
import matplotlib.pyplot as plt
import seaborn as sns
Let’s load some raw data.
For the purposes of simplifying this project and being kind to my Macbook Air’s processor, I’m going to combine all the datasets Outbrain shared with me in thier Kaggle and reduce the document size to a manageable size. Hopefully, once I apply my machine learning tools to help predict which ads will be clicked compared to those that will not be, based on the environment in which they are served, to the larger dataset with hopefully minimal loss.
df0 = pd.read_csv('Outbrain_Kaggle/page_views_sample.csv')
df4 = pd.read_csv('Outbrain_Kaggle/documents_entities.csv')
df5 = pd.read_csv('Outbrain_Kaggle/documents_meta.csv')
df6 = pd.read_csv('Outbrain_Kaggle/documents_topics.csv')
df8 = pd.read_csv('Outbrain_Kaggle/promoted_content.csv')
So things are easier to work with, lets’s join these datasets.
df_new1 = pd.merge(df0, df8, on='document_id')
df_new2 = pd.merge(df_new1, df6, on='document_id')
df_new3 = pd.merge(df_new2, df5, on='document_id')
df = pd.merge(df_new3, df4, on='document_id')
Analyzing the document dataset
Working database which is a join of 5 datasets that Outbrain provided on every instance an ad appears on a page. It includes data on the following:
UUID – ID number of each user visiting the site. (used to join just about all the datasets) Document_id – ID number of the actual website the users visits timestamp – the time in which the visitor comes to the site platform – whether they visited via desktop, mobile or tablet (desktop = 1, mobile = 2, tablet =3) geo_location – where the user is visiting from traffic_source (internal = 1, search = 2, social = 3) ad_id – the identification number of each individual ad (I.e. Google Chromebooks anthem TV spot :30 english) campaign_id – the identification of the ad campaign that an advertiser is running (i.e. Google Chromebooks campaign) advertiser_id – the identification of the advertiser running the campaign (i.e. google) topic_id – the topic number associated with the visiting site content (i.e. news, sports, lifestyle) source_id – the part of the site on which the document is displayed, e.g. edition.cnn.com
workingdatabase.csv provide information about the content in a document, as well as Outbrain’s confidence in each respective relationship. For example, an entity_id can represent a person, organization, or location. The rows in workingdatabase.csv give the confidence that the given entity was referred to in the document.
df.shape
df = df.sample(n=500000, frac=None, replace=False, weights=None, random_state=None, axis=None)
df.shape
Oof. Our dataset is nealry 15 million oberservations long. Let’s random sample this down to 500k line items to help siimplify things.
df.info()
Looks like we have a few nulls in the geo_location and publish_time features. Let’s see if we can fill these nulls with the mode of the dataframe to help clean everything up. Since we just have a small amount of missing data (with the most on the publish time of the event), I feel that filling the nulls with the mode will be inconsequential.
df['geo_location'].fillna(df['geo_location'].mode()[0], inplace=True)
df['publish_time'].fillna(df['publish_time'].mode()[0], inplace=True)
df.isnull().sum()
So, looking at our columns I can already see that the entity id section seems like it is not going to give me much in regards to analysis, so I’m going to go ahead and delete.
df = df.drop('entity_id', axis=1)
df = df.drop('confidence_level_x', axis=1)
df = df.drop('confidence_level_y', axis=1)
One thing that would be pretty interesting to look at is what types of ads are served on what types of sites. With that said, let’s do a quick scatter plot to see what we can find.
import pandas as pd
import seaborn as sb
%matplotlib inline
from sklearn import datasets
import matplotlib.pyplot as plt
x = df.document_id
y = df.campaign_id
plt.scatter(x, y)
Hmm, looks like there is a heavy concentration of numbers of campaigns on sites wiht higher document ids. Maybe that has to do with the fact that newer sites have higher doc id numbers and more people are visiting newer sites than older sites, so there are more campaigns there during this campaign time frame (2015 and 2016). Let’s see if there is anything interesting if we replace doc id with topic id, which covers a wider range of website topics.
plot = sb.lmplot(x="document_id", y="ad_id", data=df)
import pandas as pd
import seaborn as sb
%matplotlib inline
from sklearn import datasets
import matplotlib.pyplot as plt
x = df.topic_id
y = df.campaign_id
plt.scatter(x, y)
Not a whole lot of information there outside of an inference that there seems to be a pretty even distribution of campaigns and topic ids. We’ll come back to this.
Looking at this dataframe, we have good infromation on the sites, topics, ad campaigns and actual ads that appear on our outbrain database, but we don’t have any information on what was actually clicked. This is where the clicks train database below comes in.
Analyzing the Train / Test Data
Now, let’s see the time frame when the ads are most often served. Using the events data provided which outlines the context in which the display id is served.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
events = pd.read_csv("Outbrain_Kaggle/events.csv", dtype=np.int32, index_col=0, usecols=[0,3])
events.head()
train = pd.merge(pd.read_csv("Outbrain_Kaggle/clicks_train.csv", dtype=np.int32, index_col=0).sample(frac=0.1),
events, left_index=True, right_index=True)
test = pd.merge(pd.read_csv("Outbrain_Kaggle/clicks_test.csv", dtype=np.int32, index_col=0).sample(frac=0.1),
events, left_index=True, right_index=True)
test.head()
train.head()
test["hour"] = (test.timestamp // (3600 * 1000)) % 24
test["day"] = test.timestamp // (3600 * 24 * 1000)
train["hour"] = (train.timestamp // (3600 * 1000)) % 24
train["day"] = train.timestamp // (3600 * 24 * 1000)
plt.figure(figsize=(12,4))
train.hour.hist(bins=np.linspace(-0.5, 23.5, 25), label="train", alpha=0.7, normed=True)
test.hour.hist(bins=np.linspace(-0.5, 23.5, 25), label="test", alpha=0.7, normed=True)
plt.xlim(-0.5, 23.5)
plt.legend(loc="best")
plt.xlabel("Hour of Day")
plt.ylabel("Fraction of Events")
Looking at the time distribution of clicks, we are seeing that the majorit are taking place between 9 am and 5pm, which suggest people are browsing the internet while at work. Naughty Naughty!
plt.figure(figsize=(12,4))
train.day.hist(bins=np.linspace(-.5, 14.5, 16), label="train", alpha=0.7, normed=True)
test.day.hist(bins=np.linspace(-.5, 14.5, 16), label="test", alpha=0.7, normed=True)
plt.xlim(-0.5, 14.5)
plt.legend(loc="best")
plt.xlabel("Days since June 14")
plt.ylabel("Fraction of Events")
Looking at the differences between the timing of the train and test events, we can see that 50 percent of the testing data takes place during the time of the training data, and the other 50 percent takes place on the two days afterwards.
plt.figure(figsize=(12,6))
hour_day_counts = train.groupby(["hour", "day"]).count().ad_id.values.reshape(24,-1)
plt.imshow((hour_day_counts / hour_day_counts.sum(axis=0)).T,
interpolation="none", cmap="rainbow")
plt.xlabel("Hour of Day")
plt.ylabel("Days since June 14 2016")
Looking at the days since June 14, 2016 (which is a Tuesday). As you can see on days 4 and 5, we see a somewhat change in the hour of days in which people are browsing the internet online. i.e. Browsing kind of peaks at 10 and reduces on Saturday while browsing peaks around 10-12 on Sunday night.
clicks_train = pd.read_csv('Outbrain_Kaggle/clicks_train.csv')
clicks_train.head()
df_train = pd.read_csv('Outbrain_Kaggle/clicks_train.csv')
df_test = pd.read_csv('Outbrain_Kaggle/clicks_test.csv')
sizes_train = df_train.groupby('display_id')['ad_id'].count().value_counts()
sizes_test = df_test.groupby('display_id')['ad_id'].count().value_counts()
sizes_train = sizes_train / np.sum(sizes_train)
sizes_test = sizes_test / np.sum(sizes_test)
plt.figure(figsize=(12,4))
sns.barplot(sizes_train.index, sizes_train.values, alpha=0.8, label='train')
sns.barplot(sizes_test.index, sizes_test.values, alpha=0.6, label='test')
plt.legend()
plt.xlabel('Number of Ads in display', fontsize=12)
plt.ylabel('Proportion of set', fontsize=12)
Looking at the data here, it looks like the most common number of ad placements per outbrain display unit is about 4 to 6 ads per unit. If you look at the image below of an example from CNN.com, you can see for yourself.
clicks_train.shape
clicks_train['clicked'].value_counts()
clck = clicks_train['clicked'].value_counts()
plt.figure(figsize=(12,4))
sns.barplot(clck.index, clck.values, alpha=0.8)
plt.xlabel('Clicked = 1, Not-Clicked = 0', fontsize=12)
plt.ylabel('Occurence count', fontsize=12)
clicks_train['ad_id'].nunique()
Wow, we have over 87 million lines of clicks data from the time frame for us to work with and nearly 17 million instances of people clicking ads. This will be helpful in helping us detirmining click through rates for each unique ad_id.
click_means = clicks_train.groupby(['display_id', 'ad_id'], as_index=False).mean().groupby('ad_id')['clicked'].mean()
click_means.head()
import matplotlib.pyplot as plt
import numpy as np
plt.hist(click_means[click_means.values != 0], bins=100)
plt.title("CTR")
plt.xlabel("CTR")
plt.ylabel("Counts")
Here I calculated the click throu rate of each of the 478k unique ad id numbers by calculating the average value summed across the “clicked” column.
Creating a histogram of our collection of unique CTR values, we see there is a right skew of our data around the 0.08 to 0.1 level. As you can see, there are large concentrations of values at 100%, 50%, 33%, 25%, ect. I suspect that these are ad id’s with a small number of impressions (i.e. <10) which gives them a round CTR number.
For my model, I would like to join this click through rate to our bigger dataset and look into which features help attribute to a higher click through rate.
Lets merge our click data with our document data
dfclick = df.merge(click_means.reset_index(), how="left", on='ad_id')
dfclick.head()
Let’s analyze our feature data with the clicks
Starting off, maybe we can check to see what different kinds of CTR values we can get with each doc id. Maybe from here, we can calculate which doc id is driving the highest click through rate.
corr = dfclick.corr()
fig, ax = plt.subplots(figsize=(9,7))
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
ax = sns.heatmap(corr, mask=mask, ax=ax)
ax.set_xticklabels(ax.xaxis.get_ticklabels(), fontsize=14)
ax.set_yticklabels(ax.yaxis.get_ticklabels(), fontsize=14)
plt.show()
Just doing a quick correlation analysis, I can see that there is a slight correlation between ad_id, document_id, source_id, campaign_id and advertiser_id, this would suggest that I should be able to plot some sort of visual showing what types of ads are appearing where.
dfclick.groupby('document_id')
dfclick.head()
docid_means = dfclick.groupby(['document_id', 'clicked'], as_index=False).mean().groupby('document_id')['clicked'].mean()
docid_means.head()
docid_counts = dfclick['document_id'].value_counts()
plt.hist(docid_counts, bins = 100, log=True)
plt.xlabel('doc_ids that have appeared', fontsize=12)
plt.ylabel('log(Count of times doc_id appeared)', fontsize=12)
plt.show()
Looking at this. It seems like the impressions we are seeing in our sample data is skewed heavily towards less webpages.
uuid_counts = dfclick.groupby('uuid')['uuid'].count().sort_values()
print(uuid_counts.tail())
for i in [2, 3, 10]:
print('Users that appear less than {} times: {}%'.format(i, round((uuid_counts < i).mean() * 100, 2)))
plt.figure(figsize=(12, 4))
plt.hist(uuid_counts.values, bins=50, log=True)
plt.xlabel('Number of times user appeared in set', fontsize=12)
plt.ylabel('log(Count of users)', fontsize=12)
plt.show()
uuid_means = dfclick.groupby(['uuid', 'clicked'], as_index=False).mean().groupby('uuid')['clicked'].mean()
uuid_means.head()
plat_means = dfclick.groupby(['platform', 'clicked'], as_index=False).mean().groupby('platform')['clicked'].mean()
plat_means.head()
drawing correlations between platofrm and click through rate, it seems like tablets are driving the highest click through rates folowed by desktop
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
Userlocation = dfclick[['uuid', 'geo_location']]
Userlocation = Userlocation.loc[~Userlocation.geo_location.isin(['EU', '--']), :]
Userlocation = Userlocation.drop_duplicates('uuid', keep='first')
country = Userlocation.copy()
country.columns = ['uuid', 'Country']
country.Country = country.Country.str[:2]
country.loc[:, 'UserCount'] = country.groupby('Country')['Country'].transform('count')
country = country.loc[:, ['Country', 'UserCount']].drop_duplicates('Country', keep='first')
country.sort_values('UserCount', ascending=False, inplace=True)
country.head(10)
country['CumulativePercentage'] = 100 * country.UserCount.cumsum()/country.UserCount.sum()
country.reset_index(drop=True, inplace=True)
country[['Country', 'UserCount', 'CumulativePercentage']].head(10)
usa = Userlocation.loc[Userlocation.geo_location.str[:2] == 'US', :]
usa.columns = ['uuid', 'State']
usa.State = usa.State.str[3:5]
# Drop Data with missing state info
usa = usa.loc[usa.State != '', :]
usa.loc[:, 'UserCount'] = usa.groupby('State')['State'].transform('count')
usa.loc[:, ['State', 'UserCount']] = usa.loc[:, ['State', 'UserCount']].drop_duplicates('State', keep='first')
usa.sort_values('UserCount', ascending=False, inplace=True)
Looking at the geolocation of our data, the vast majority come from the US and California specifically
usa.head()
country.head()
pivot = pd.pivot_table(dfclick,index=["State"], values=['clicked'], aggfunc=[np.mean])
pivot
Based on our data, we can see that states like New Jersey, North Dakorta, Utah are driving a high click through rate. Even though California has a high population, it does perform strongly with a 0.27 click through rate.
trafpivot = pd.pivot_table(dfclick,index=["traffic_source"], values=['clicked'], aggfunc=[np.mean])
trafpivot
Based on our traffic source data in relation to mean click through rate, we are seeing that search and social are driving the greatest click through rate as opposed to organic traffic sources.
clickx = dfclick.topic_id
clicky = dfclick.clicked
plt.scatter(clickx, clicky)
Conclusion and Next Steps
Based on our intitial data analysis, We are seeing some predictability that can be found through the time stamp and through combinations of indification metrics of the ads, source websites, and webpage, we could build a model to help figure which cobination of features can help drive up our click through rate.
For Next steps, I think I will want to run a couple of different prediction models like logistic regression, random forests and kmeans to help find the following:
- Which features are most indicative of click performance
- If there is any sort of patterns that we can find to group ads, documents, sources, and topics together
- if we can use our clicks train data to predict our click test data
dfclick.to_csv('finalcoredb.csv', sep='\t')