Outbrain Click Prediction – Random Forest

Submission of Final Report

Now that we did our EDA, lets see if we can run some models that will predict which sets of ads will have the highest CTR value based on the various features.

To start, let’s do a little more cleanup to make sure our analysises are clean and easy to work with.

In [1]:
import pandas as pd
import seaborn as sb
%matplotlib inline
from sklearn import datasets
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn import preprocessing
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, cross_val_score
import csv as csv

Load the dataframe that I developed during hte project EDA phase

In [2]:
dfwork = pd.read_csv('Outbrain_Kaggle/finalcoredb.csv', sep='\t')
In [3]:
#let's create dummy variables for platform id and traffic source
In [12]:
df['clicked'].head()
Out[12]:
0    0.000000
1    0.000000
2    0.000000
3    0.367188
4    0.270833
Name: clicked, dtype: float64
In [13]:
df.columns
Out[13]:
Index(['Unnamed: 0', 'document_id', 'timestamp', 'ad_id', 'campaign_id',
       'advertiser_id', 'topic_id', 'source_id', 'publisher_id', 'clicked',
       'desktop', 'mobile', 'tablet', 'internal', 'search', 'social'],
      dtype='object')

Here’s the columns that we will work with to measure which can help us establish a strong CTR.

In [14]:
df.fillna(0, inplace=True)

Random Forest Classifier

In [15]:
# Load scikit's random forest classifier library
from sklearn.ensemble import RandomForestClassifier

# Load pandas
import pandas as pd

# Load numpy
import numpy as np

# Set random seed
np.random.seed(0)

Let’s calculate the average click across our entire “clicked” column

In [17]:
ctr_mean = df['clicked'].mean()
In [18]:
ctr_mean
Out[18]:
0.16305434941875965

Let’s create a new column that labels each CTR value as above average or below average with a True or False

In [19]:
# Create a list to store the data
above_avg = []

# For each row in the column,
for row in df['clicked']:
    # if more than a value,
    if row > ctr_mean:
        # Append a letter grade
        above_avg.append('True')
    else:
        # Append a failing grade
        above_avg.append('False')
        
# Create a column from the list
df['above_avg'] = above_avg
In [20]:
df_class = pd.get_dummies(df['above_avg'], drop_first=True)
In [21]:
df_class.head()
Out[21]:
True
0 0
1 0
2 0
3 1
4 1
In [22]:
#add it back into our dataframe
df = pd.concat([df, df_class], axis=1)
In [24]:
df.drop('above_avg', axis=1, inplace=True)

Drop Above Average, clicked, True, ad Id and Unnamed: 0 since they is too closely representatitve of our Y variable

In [25]:
# Create a list of the feature column's names
features = df.columns.drop(['clicked', 'True', 'ad_id', 'Unnamed: 0'])

# View features
features
Out[25]:
Index(['document_id', 'timestamp', 'campaign_id', 'advertiser_id', 'topic_id',
       'source_id', 'publisher_id', 'desktop', 'mobile', 'tablet', 'internal',
       'search', 'social'],
      dtype='object')

Establish our X’s and Y’s as well as our train and test X’s and Y’s. I’m doing a 2/3s split of training to testing data.

In [26]:
y = df['True']
X= df.drop(['clicked', 'True', 'ad_id', 'Unnamed: 0'], axis=1)
In [27]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
In [28]:
X.shape
Out[28]:
(500000, 13)
In [29]:
from sklearn.model_selection import train_test_split

Let’s load our random forest classifier and run a cross val score with it to measure the predictive nature of the different combinations of features towards my testing data from my training data.

In [30]:
rfc = RandomForestClassifier()
scores = cross_val_score(rfc, X, y, cv=10)
print(np.average(scores))
0.9312459989544914

With a cross validated score, we are seeing some strong results with the predictive nature of our dataset. Maybe it’s overfit, but we will see.

In [31]:
from sklearn.metrics import classification_report
import pprint as pp
rfc.fit(X_train,y_train)
pp.pprint(classification_report(rfc.predict(X_test), y_test))
('             precision    recall  f1-score   support\n'
 '\n'
 '          0       0.96      0.93      0.94    105275\n'
 '          1       0.88      0.93      0.91     59725\n'
 '\n'
 'avg / total       0.93      0.93      0.93    165000\n')
In [32]:
from sklearn.cross_validation import cross_val_score
/Users/danielclark/anaconda3/lib/python3.6/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
In [98]:
from sklearn.cross_validation import cross_val_score


for n_trees in range(1, 100, 10):
    model = RandomForestClassifier(n_estimators = n_trees)
    scores = cross_val_score(model, X, y, scoring='roc_auc')
    print('n trees: {}, CV AUC {}, Average AUC {}'.format(n_trees, scores, scores.mean()))
    
    
scores = cross_val_score(model, X, y, scoring='roc_auc')
print('CV AUC {}, Average AUC {}'.format(scores, scores.mean()))
n trees: 1, CV AUC [0.9281814  0.9270731  0.92784869], Average AUC 0.9277010630333707
n trees: 11, CV AUC [0.95831493 0.95826191 0.95918317], Average AUC 0.9585866713023615
n trees: 21, CV AUC [0.961231   0.96112515 0.96172492], Average AUC 0.9613603555263853
n trees: 31, CV AUC [0.96227717 0.96263208 0.96303503], Average AUC 0.9626480958589451
n trees: 41, CV AUC [0.96305231 0.96342662 0.96376722], Average AUC 0.9634153840905256
n trees: 51, CV AUC [0.96346607 0.96402944 0.96454555], Average AUC 0.9640136859462324
n trees: 61, CV AUC [0.96396412 0.96441186 0.96458517], Average AUC 0.9643203826630992
n trees: 71, CV AUC [0.96425782 0.96431858 0.96494901], Average AUC 0.9645084712910128
n trees: 81, CV AUC [0.96432358 0.96476473 0.96526965], Average AUC 0.9647859844159467
n trees: 91, CV AUC [0.96432661 0.96484705 0.96511333], Average AUC 0.964762330647793
CV AUC [0.96456384 0.96516789 0.96555938], Average AUC 0.9650970370546021

If you increase the number of estimators, you can help improve the predictive performance of your random forest up to a 0.965 level

In [99]:
from sklearn.cross_validation import cross_val_score

# ... #

scores = cross_val_score(rfc, X, y, scoring='roc_auc', cv=10)
print('CV AUC {}, Average AUC {}'.format(scores, scores.mean()))
CV AUC [0.96249425 0.96355798 0.96250469 0.96187606 0.96212373 0.96066386
 0.96279448 0.96153078 0.96459095 0.96281962], Average AUC 0.9624956409295521

Looking at our AUC score it appears that we can expect a strong relationship between true positives and false negatives with minimal false positives and true negatives.

In [35]:
# Show the number of observations for the test and training dataframes
print('Number of observations in the training data:', len(X_train))
print('Number of observations in the test data:',len(X_test))
Number of observations in the training data: 335000
Number of observations in the test data: 165000
In [36]:
# Create a random forest Classifier. By convention, clf means 'Classifier'
rfc = RandomForestClassifier(n_jobs=2, random_state=0, max_depth=50)

# Train the Classifier to take the training features and learn how they relate
# to the training y (the species)
rfc.fit(X_train, y_train)
Out[36]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=50, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=2,
            oob_score=False, random_state=0, verbose=0, warm_start=False)
In [37]:
# Apply the Classifier we trained to the test data 
rfc.predict(X_test[features])[0:100]
Out[37]:
array([0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0,
       0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1,
       1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0], dtype=uint8)
In [38]:
np.mean(rfc.predict(X_train[features]))
Out[38]:
0.36252835820895524
In [39]:
np.mean(rfc.predict(X_test[features]))

# This calculates the mean average of our test result.
Out[39]:
0.3625515151515151

This here compares the prediction rate of the training set and the test set. As you can see here, they are very heavily matched.

In [40]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, rfc.predict(X_test))
Out[40]:
array([[97647,  3961],
       [ 7532, 55860]])

Looking at our confusion Matrix, we have a very small proportion of false positives against true positives.

In [41]:
rfc.score(X_test, y_test)
Out[41]:
0.9303454545454546
In [42]:
# View the predicted probabilities of the first 10 observations
rfc.predict_proba(X_test[features])[0:10]
Out[42]:
array([[0.81055556, 0.18944444],
       [0.        , 1.        ],
       [0.        , 1.        ],
       [0.        , 1.        ],
       [1.        , 0.        ],
       [0.99463087, 0.00536913],
       [1.        , 0.        ],
       [0.22690476, 0.77309524],
       [1.        , 0.        ],
       [0.81462264, 0.18537736]])
In [44]:
preds = df['True'][rfc.predict(X_test[features])]
#run a prediction of our test dataset using the models above

Run a prediction of the CTR above average classification on the first five values

In [45]:
preds[0:5]
Out[45]:
0    0
1    0
1    0
1    0
0    0
Name: True, dtype: uint8
In [46]:
# View a list of the features and their importance scores
list(zip(X_train[features], rfc.feature_importances_))
Out[46]:
[('document_id', 0.24476066591999746),
 ('timestamp', 0.12598455356540972),
 ('campaign_id', 0.2104563796014575),
 ('advertiser_id', 0.07389685173299361),
 ('topic_id', 0.05833893203881163),
 ('source_id', 0.15149686395139042),
 ('publisher_id', 0.11010019094447798),
 ('desktop', 0.0015468276543025557),
 ('mobile', 0.002456213883768503),
 ('tablet', 0.0006630240272853334),
 ('internal', 0.0052573939226172785),
 ('search', 0.0027001658654666827),
 ('social', 0.01234193689202124)]
In [47]:
feature_imp = pd.Series(rfc.feature_importances_,index=features).sort_values(ascending=False)
In [48]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
# Creating a bar plot
sns.barplot(x=feature_imp, y=feature_imp.index)
# Add labels to your graph
plt.xlabel('Feature Importance Score')
plt.ylabel('Features')
plt.title("Visualizing Important Features")
plt.legend()
plt.show()
No handles with labels found to put in legend.

Looking across our features, it appears that Document Id (web page), Campaign ID, Source ID, and Time Stamp are the most predictive of a more efficient click through rate. I.e. the web page, site category, time and ad campaign are the best indicator of strong ad performance.

Now let’s do a grid search to tune our hyperparamters.

Now Let’s look at the most valuable parameters to our Random Forest

In [49]:
import numpy as np
from sklearn.grid_search import GridSearchCV
from sklearn import datasets, svm
import matplotlib.pyplot as plt
/Users/danielclark/anaconda3/lib/python3.6/site-packages/sklearn/grid_search.py:42: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20.
  DeprecationWarning)
In [50]:
model =svm.SVC()
params = {"C":[1, 10, 100], "kernel":["rbf","linear"],"gamma":[0.001, 0.0001]}
gsearch=GridSearchCV(model,params)
In [51]:
# Create a classifier object with the classifier and parameter candidates
rfcf = gsearch
In [53]:
rfcf
Out[53]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=50, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=2,
            oob_score=False, random_state=0, verbose=0, warm_start=False)
In [54]:
# Train a new classifier using the best parameters found by the grid search
rfc2 = RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=50, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=2,
            oob_score=False, random_state=0, verbose=0, warm_start=False)
In [55]:
# Train the Classifier to take the training features and learn how they relate
# to the training y (the species)
rfc2.fit(X_train, y_train)
Out[55]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=50, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=2,
            oob_score=False, random_state=0, verbose=0, warm_start=False)
In [56]:
# Train a new classifier using the best parameters found by the grid search
svm.SVC(C=1.0, class_weight=None, max_iter=100, random_state=0, tol=0.0001,
          verbose=0).fit(X_train, y_train).score(X_test, y_test)
/Users/danielclark/anaconda3/lib/python3.6/site-packages/sklearn/svm/base.py:218: ConvergenceWarning: Solver terminated early (max_iter=100).  Consider pre-processing your data with StandardScaler or MinMaxScaler.
  % self.max_iter, ConvergenceWarning)
Out[56]:
0.3855212121212121
In [57]:
confusion_matrix(y_test, rfc2.predict(X_test))
Out[57]:
array([[97647,  3961],
       [ 7532, 55860]])

Looking at the “ideal’ parameters applied to the confusion matrix, I am seeing that there is no difference between the confusion matrix of my initial random forest (RFC) as well as the (RFC2).

Logistic Regression Classifier

Let’s create something that we can compare our classifier to. Since we have a binary Y value we are testing for, let’s see if we can create a logistic regression model to predict which ads will be above average or below average.

In [58]:
# Fit a logistic regression model and store the class predictions.
from sklearn.linear_model import LogisticRegression

from sklearn import datasets
from sklearn.preprocessing import StandardScaler
In [59]:
logreg = LogisticRegression()

Here we are going to run our Logistic Regession Model the exact same way as our Random Forest Model

In [60]:
logreg.fit(X,y)
pred = logreg.predict(X)
In [61]:
log = LogisticRegression(n_jobs=2, random_state=0)

# Train the Classifier to take the training features and learn how they relate
# to the training y (the species)
log.fit(X_train, y_train)
/Users/danielclark/anaconda3/lib/python3.6/site-packages/sklearn/linear_model/logistic.py:1228: UserWarning: 'n_jobs' > 1 does not have any effect when 'solver' is set to 'liblinear'. Got 'n_jobs' = 2.
  " = {}.".format(self.n_jobs))
Out[61]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=2,
          penalty='l2', random_state=0, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
In [62]:
log.predict(X_test[features])[0:100]
Out[62]:
array([0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0,
       0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1,
       0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1,
       1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0], dtype=uint8)
In [63]:
df['True'].mean()
Out[63]:
0.384438
In [64]:
np.mean(log.predict(X_train[features]))
Out[64]:
0.38302985074626866
In [65]:
np.mean(log.predict(X_test[features]))
Out[65]:
0.38378181818181817

Let’s compare this to our predictiveness of our trains and test against the random forest.

Train – 0.36252835820895524 Test – 0.3625515151515151

Both of our testing and training data seemed to put out consistent predictions between training and testing data, which is expected. owever it’s notable that the distribution of our logistic regression training data seemed to slightly skew towards being more above average than below average. The actual dataset has a 0.384 ratio between True and not True.

In [66]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, log.predict(X_test))
Out[66]:
array([[82703, 18905],
       [18973, 44419]])
In [67]:
log.score(X_test, y_test)
Out[67]:
0.7704363636363636
In [68]:
# View the predicted probabilities of the first 10 observations
log.predict_proba(X_test[features])[0:10]
Out[68]:
array([[0.86578321, 0.13421679],
       [0.43339606, 0.56660394],
       [0.44309772, 0.55690228],
       [0.29822099, 0.70177901],
       [0.90259753, 0.09740247],
       [0.84710841, 0.15289159],
       [0.83407236, 0.16592764],
       [0.49403533, 0.50596467],
       [0.15522634, 0.84477366],
       [0.91226979, 0.08773021]])
In [69]:
logreg.predict_proba(X)[0:10]
Out[69]:
array([[0.84850591, 0.15149409],
       [0.89217383, 0.10782617],
       [0.90594105, 0.09405895],
       [0.4318877 , 0.5681123 ],
       [0.32673021, 0.67326979],
       [0.87844156, 0.12155844],
       [0.80200762, 0.19799238],
       [0.40426772, 0.59573228],
       [0.60296166, 0.39703834],
       [0.32621639, 0.67378361]])

Comparing our actual and testing probabilities, there seems to be some difference, but it’s hard to recognize in the dataframe.

In [70]:
preds = df['True'][log.predict(X_test[features])]
In [71]:
preds[0:5]
Out[71]:
0    0
1    0
1    0
1    0
0    0
Name: True, dtype: uint8
In [72]:
df['True_Prob'] = logreg.predict_proba(X)[:, 1]
In [73]:
df.head()
Out[73]:
Unnamed: 0 document_id timestamp ad_id campaign_id advertiser_id topic_id source_id publisher_id clicked desktop mobile tablet internal search social True True_Prob
0 0 1649400 40696880 270167 25886 112 265 5315.0 1046.0 0.000000 0 0 1 1 0 0 0 0.151494
1 1 1629915 16740705 424102 28821 1635 183 6939.0 912.0 0.000000 1 0 0 1 0 0 0 0.107826
2 2 1750649 7493196 326034 25885 112 281 5315.0 1046.0 0.000000 1 0 0 1 0 0 0 0.094059
3 3 1767529 21211103 331303 16622 201 241 10198.0 9.0 0.367188 1 0 0 1 0 0 1 0.568112
4 4 1867736 65537339 369453 6395 177 36 3568.0 1035.0 0.270833 0 1 0 0 0 1 1 0.673270
In [74]:
plt.rcParams['agg.path.chunksize'] = 1000
In [75]:
# Plot the predicted probabilities.
plt.scatter(df['clicked'], df['True'])
plt.plot(df.clicked, df.True_Prob, color='red')
plt.xlabel('clicked')
plt.ylabel('Above Average')
Out[75]:
Text(0,0.5,'Above Average')
In [ ]:
import pandas as pd
import numpy as np
from sklearn import preprocessing
import matplotlib.pyplot as plt 
plt.rc("font", size=14)
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import train_test_split
import seaborn as sns
sns.set(style="white")
sns.set(style="whitegrid", color_codes=True)
In [ ]:

 

 

 

 

 

 

 

Compute the log odds of clciked when when above average=1.

 

 

In [76]:

 

logodds = logreg.intercept_ + logreg.coef_[0] * 2
logodds
Out[76]:
array([-3.55342578e-07,  3.08890954e-08, -2.56782921e-04, -2.39228966e-05,
       -8.21974552e-07,  4.68565452e-04,  1.28267309e-05, -1.64137959e-08,
        8.48999603e-09,  4.46816919e-09, -5.34960562e-08,  3.58955757e-08,
        1.41448498e-08])

Now that we have the log odds, we will need to go through the process of converting these log odds to probability.

Convert the log odds to odds, then the odds to probability.

In [77]:
# Convert log odds to odds.
odds = np.exp(logodds)
odds
Out[77]:
array([0.99999964, 1.00000003, 0.99974325, 0.99997608, 0.99999918,
       1.00046868, 1.00001283, 0.99999998, 1.00000001, 1.        ,
       0.99999995, 1.00000004, 1.00000001])
In [78]:
# Convert odds to probability.
prob = odds/(1 + odds)
prob
Out[78]:
array([0.49999991, 0.50000001, 0.4999358 , 0.49999402, 0.49999979,
       0.50011714, 0.50000321, 0.5       , 0.5       , 0.5       ,
       0.49999999, 0.50000001, 0.5       ])

This will give us the probability of our above average value our model being true at around 50%, which is not necessarily predictive for our purposes.

Now Let’s look at the most valuable parameters to our log regression

In [79]:
import numpy as np
from sklearn.grid_search import GridSearchCV
from sklearn import datasets, svm
import matplotlib.pyplot as plt
In [80]:
model =svm.SVC()
params = {"C":[1, 10, 100], "kernel":["rbf","linear"],"gamma":[0.001, 0.0001]}
gsearch=GridSearchCV(model,params)
In [81]:
# Create a classifier object with the classifier and parameter candidates
clf = gsearch
In [83]:
clf
Out[83]:
GridSearchCV(cv=None, error_score='raise',
       estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'C': [1, 10, 100], 'kernel': ['rbf', 'linear'], 'gamma': [0.001, 0.0001]},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)
In [94]:
log2 = LogisticRegression(n_jobs=1,
       
       verbose=0)

# Train the Classifier to take the training features and learn how they relate
# to the training y (the species)
log2.fit(X_train, y_train)
Out[94]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
In [96]:
# Train a new classifier using the best parameters found by the grid search
svm.SVC(C=1.0, class_weight=None, max_iter=100, random_state=0, tol=0.0001,
          verbose=0).fit(X_train, y_train).score(X_test, y_test)
/Users/danielclark/anaconda3/lib/python3.6/site-packages/sklearn/svm/base.py:218: ConvergenceWarning: Solver terminated early (max_iter=100).  Consider pre-processing your data with StandardScaler or MinMaxScaler.
  % self.max_iter, ConvergenceWarning)
Out[96]:
0.3855212121212121
In [97]:
confusion_matrix(y_test, log2.predict(X_test))
Out[97]:
array([[82696, 18912],
       [18962, 44430]])

after checking the ideal params for the logistic regression, I wasn’t able to improve my confusion matrix.

Conclusion

After running a random forest as well as logistic regression classifier model to see which is more predictive of our training and testing data. After running a training and testing ratio of 2/3 training to 1/3 testing with a dataset that was reduced from 10 million to 500,000 for easy computation purposes, I found that while the logistic regression analysis provided data that closer matched the output of the dataframe at a 0.38 probability of an observation having an above average CTR, I felt that our Random Forest model was stronger. Based on the confusion matrix we outputted, our random forest returned a really strong ratio of correct predictions compared to incorrect.

Random Forest array([[97535, 4105], [ 7566, 55794]])

Logistic regression array([[82969, 18671], [19147, 44213]])

As far as next steps for future research, I would like to expand this model to my bigger dataset of 10MM observations to confirm that my predictive nature can be applied to a larger scale. I even have a series of events data of every instance someone went online, but I wasn’t able to join my core dataframe with the bigger events dataset for me to make a solid observation.

Advertisement