Now that we did our EDA, lets see if we can run some models that will predict which sets of ads will have the highest CTR value based on the various features.
To start, let’s do a little more cleanup to make sure our analysises are clean and easy to work with.
import pandas as pd
import seaborn as sb
%matplotlib inline
from sklearn import datasets
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn import preprocessing
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, cross_val_score
import csv as csv
Load the dataframe that I developed during hte project EDA phase
dfwork = pd.read_csv('Outbrain_Kaggle/finalcoredb.csv', sep='\t')
In [3]:
#let's create dummy variables for platform id and traffic source
In [12]:
df['clicked'].head()
df.columns
Here’s the columns that we will work with to measure which can help us establish a strong CTR.
df.fillna(0, inplace=True)
Random Forest Classifier
# Load scikit's random forest classifier library
from sklearn.ensemble import RandomForestClassifier
# Load pandas
import pandas as pd
# Load numpy
import numpy as np
# Set random seed
np.random.seed(0)
Let’s calculate the average click across our entire “clicked” column
ctr_mean = df['clicked'].mean()
ctr_mean
Let’s create a new column that labels each CTR value as above average or below average with a True or False
# Create a list to store the data
above_avg = []
# For each row in the column,
for row in df['clicked']:
# if more than a value,
if row > ctr_mean:
# Append a letter grade
above_avg.append('True')
else:
# Append a failing grade
above_avg.append('False')
# Create a column from the list
df['above_avg'] = above_avg
df_class = pd.get_dummies(df['above_avg'], drop_first=True)
df_class.head()
#add it back into our dataframe
df = pd.concat([df, df_class], axis=1)
df.drop('above_avg', axis=1, inplace=True)
Drop Above Average, clicked, True, ad Id and Unnamed: 0 since they is too closely representatitve of our Y variable
# Create a list of the feature column's names
features = df.columns.drop(['clicked', 'True', 'ad_id', 'Unnamed: 0'])
# View features
features
Establish our X’s and Y’s as well as our train and test X’s and Y’s. I’m doing a 2/3s split of training to testing data.
y = df['True']
X= df.drop(['clicked', 'True', 'ad_id', 'Unnamed: 0'], axis=1)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
X.shape
from sklearn.model_selection import train_test_split
Let’s load our random forest classifier and run a cross val score with it to measure the predictive nature of the different combinations of features towards my testing data from my training data.
rfc = RandomForestClassifier()
scores = cross_val_score(rfc, X, y, cv=10)
print(np.average(scores))
With a cross validated score, we are seeing some strong results with the predictive nature of our dataset. Maybe it’s overfit, but we will see.
from sklearn.metrics import classification_report
import pprint as pp
rfc.fit(X_train,y_train)
pp.pprint(classification_report(rfc.predict(X_test), y_test))
from sklearn.cross_validation import cross_val_score
from sklearn.cross_validation import cross_val_score
for n_trees in range(1, 100, 10):
model = RandomForestClassifier(n_estimators = n_trees)
scores = cross_val_score(model, X, y, scoring='roc_auc')
print('n trees: {}, CV AUC {}, Average AUC {}'.format(n_trees, scores, scores.mean()))
scores = cross_val_score(model, X, y, scoring='roc_auc')
print('CV AUC {}, Average AUC {}'.format(scores, scores.mean()))
If you increase the number of estimators, you can help improve the predictive performance of your random forest up to a 0.965 level
from sklearn.cross_validation import cross_val_score
# ... #
scores = cross_val_score(rfc, X, y, scoring='roc_auc', cv=10)
print('CV AUC {}, Average AUC {}'.format(scores, scores.mean()))
Looking at our AUC score it appears that we can expect a strong relationship between true positives and false negatives with minimal false positives and true negatives.
# Show the number of observations for the test and training dataframes
print('Number of observations in the training data:', len(X_train))
print('Number of observations in the test data:',len(X_test))
# Create a random forest Classifier. By convention, clf means 'Classifier'
rfc = RandomForestClassifier(n_jobs=2, random_state=0, max_depth=50)
# Train the Classifier to take the training features and learn how they relate
# to the training y (the species)
rfc.fit(X_train, y_train)
# Apply the Classifier we trained to the test data
rfc.predict(X_test[features])[0:100]
np.mean(rfc.predict(X_train[features]))
np.mean(rfc.predict(X_test[features]))
# This calculates the mean average of our test result.
This here compares the prediction rate of the training set and the test set. As you can see here, they are very heavily matched.
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, rfc.predict(X_test))
Looking at our confusion Matrix, we have a very small proportion of false positives against true positives.
rfc.score(X_test, y_test)
# View the predicted probabilities of the first 10 observations
rfc.predict_proba(X_test[features])[0:10]
preds = df['True'][rfc.predict(X_test[features])]
#run a prediction of our test dataset using the models above
Run a prediction of the CTR above average classification on the first five values
preds[0:5]
# View a list of the features and their importance scores
list(zip(X_train[features], rfc.feature_importances_))
feature_imp = pd.Series(rfc.feature_importances_,index=features).sort_values(ascending=False)
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
# Creating a bar plot
sns.barplot(x=feature_imp, y=feature_imp.index)
# Add labels to your graph
plt.xlabel('Feature Importance Score')
plt.ylabel('Features')
plt.title("Visualizing Important Features")
plt.legend()
plt.show()
Looking across our features, it appears that Document Id (web page), Campaign ID, Source ID, and Time Stamp are the most predictive of a more efficient click through rate. I.e. the web page, site category, time and ad campaign are the best indicator of strong ad performance.
Now let’s do a grid search to tune our hyperparamters.
Now Let’s look at the most valuable parameters to our Random Forest
import numpy as np
from sklearn.grid_search import GridSearchCV
from sklearn import datasets, svm
import matplotlib.pyplot as plt
model =svm.SVC()
params = {"C":[1, 10, 100], "kernel":["rbf","linear"],"gamma":[0.001, 0.0001]}
gsearch=GridSearchCV(model,params)
# Create a classifier object with the classifier and parameter candidates
rfcf = gsearch
rfcf
# Train a new classifier using the best parameters found by the grid search
rfc2 = RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=50, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=2,
oob_score=False, random_state=0, verbose=0, warm_start=False)
# Train the Classifier to take the training features and learn how they relate
# to the training y (the species)
rfc2.fit(X_train, y_train)
# Train a new classifier using the best parameters found by the grid search
svm.SVC(C=1.0, class_weight=None, max_iter=100, random_state=0, tol=0.0001,
verbose=0).fit(X_train, y_train).score(X_test, y_test)
confusion_matrix(y_test, rfc2.predict(X_test))
Looking at the “ideal’ parameters applied to the confusion matrix, I am seeing that there is no difference between the confusion matrix of my initial random forest (RFC) as well as the (RFC2).
Logistic Regression Classifier
Let’s create something that we can compare our classifier to. Since we have a binary Y value we are testing for, let’s see if we can create a logistic regression model to predict which ads will be above average or below average.
# Fit a logistic regression model and store the class predictions.
from sklearn.linear_model import LogisticRegression
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
logreg = LogisticRegression()
Here we are going to run our Logistic Regession Model the exact same way as our Random Forest Model
logreg.fit(X,y)
pred = logreg.predict(X)
log = LogisticRegression(n_jobs=2, random_state=0)
# Train the Classifier to take the training features and learn how they relate
# to the training y (the species)
log.fit(X_train, y_train)
log.predict(X_test[features])[0:100]
df['True'].mean()
np.mean(log.predict(X_train[features]))
np.mean(log.predict(X_test[features]))
Let’s compare this to our predictiveness of our trains and test against the random forest.
Train – 0.36252835820895524 Test – 0.3625515151515151
Both of our testing and training data seemed to put out consistent predictions between training and testing data, which is expected. owever it’s notable that the distribution of our logistic regression training data seemed to slightly skew towards being more above average than below average. The actual dataset has a 0.384 ratio between True and not True.
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, log.predict(X_test))
log.score(X_test, y_test)
# View the predicted probabilities of the first 10 observations
log.predict_proba(X_test[features])[0:10]
logreg.predict_proba(X)[0:10]
Comparing our actual and testing probabilities, there seems to be some difference, but it’s hard to recognize in the dataframe.
preds = df['True'][log.predict(X_test[features])]
preds[0:5]
df['True_Prob'] = logreg.predict_proba(X)[:, 1]
df.head()
plt.rcParams['agg.path.chunksize'] = 1000
# Plot the predicted probabilities.
plt.scatter(df['clicked'], df['True'])
plt.plot(df.clicked, df.True_Prob, color='red')
plt.xlabel('clicked')
plt.ylabel('Above Average')
import pandas as pd
import numpy as np
from sklearn import preprocessing
import matplotlib.pyplot as plt
plt.rc("font", size=14)
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import train_test_split
import seaborn as sns
sns.set(style="white")
sns.set(style="whitegrid", color_codes=True)
above average=1
.
logodds = logreg.intercept_ + logreg.coef_[0] * 2
logodds
Now that we have the log odds, we will need to go through the process of converting these log odds to probability.
Convert the log odds to odds, then the odds to probability.
# Convert log odds to odds.
odds = np.exp(logodds)
odds
# Convert odds to probability.
prob = odds/(1 + odds)
prob
This will give us the probability of our above average value our model being true at around 50%, which is not necessarily predictive for our purposes.
Now Let’s look at the most valuable parameters to our log regression
import numpy as np
from sklearn.grid_search import GridSearchCV
from sklearn import datasets, svm
import matplotlib.pyplot as plt
model =svm.SVC()
params = {"C":[1, 10, 100], "kernel":["rbf","linear"],"gamma":[0.001, 0.0001]}
gsearch=GridSearchCV(model,params)
# Create a classifier object with the classifier and parameter candidates
clf = gsearch
clf
log2 = LogisticRegression(n_jobs=1,
verbose=0)
# Train the Classifier to take the training features and learn how they relate
# to the training y (the species)
log2.fit(X_train, y_train)
# Train a new classifier using the best parameters found by the grid search
svm.SVC(C=1.0, class_weight=None, max_iter=100, random_state=0, tol=0.0001,
verbose=0).fit(X_train, y_train).score(X_test, y_test)
confusion_matrix(y_test, log2.predict(X_test))
after checking the ideal params for the logistic regression, I wasn’t able to improve my confusion matrix.
Conclusion
After running a random forest as well as logistic regression classifier model to see which is more predictive of our training and testing data. After running a training and testing ratio of 2/3 training to 1/3 testing with a dataset that was reduced from 10 million to 500,000 for easy computation purposes, I found that while the logistic regression analysis provided data that closer matched the output of the dataframe at a 0.38 probability of an observation having an above average CTR, I felt that our Random Forest model was stronger. Based on the confusion matrix we outputted, our random forest returned a really strong ratio of correct predictions compared to incorrect.
Random Forest array([[97535, 4105], [ 7566, 55794]])
Logistic regression array([[82969, 18671], [19147, 44213]])
As far as next steps for future research, I would like to expand this model to my bigger dataset of 10MM observations to confirm that my predictive nature can be applied to a larger scale. I even have a series of events data of every instance someone went online, but I wasn’t able to join my core dataframe with the bigger events dataset for me to make a solid observation.