Submission of Final Report

Now that we did our EDA, lets see if we can run some models that will predict which sets of ads will have the highest CTR value based on the various features.

To start, let’s do a little more cleanup to make sure our analysises are clean and easy to work with.

In [1]:

import pandas as pd
import seaborn as sb
%matplotlib inline
from sklearn import datasets
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn import preprocessing
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, cross_val_score
import csv as csv

Load the dataframe that I developed during hte project EDA phase

In [2]:

dfwork = pd.read_csv('Outbrain_Kaggle/finalcoredb.csv', sep='\t')
In [3]:

#let's create dummy variables for platform id and traffic source
In [12]:

df['clicked'].head()

Out[12]:

0    0.000000
1    0.000000
2    0.000000
3    0.367188
4    0.270833
Name: clicked, dtype: float64

In [13]:

df.columns

Out[13]:

Index(['Unnamed: 0', 'document_id', 'timestamp', 'ad_id', 'campaign_id',
       'advertiser_id', 'topic_id', 'source_id', 'publisher_id', 'clicked',
       'desktop', 'mobile', 'tablet', 'internal', 'search', 'social'],
      dtype='object')

Here’s the columns that we will work with to measure which can help us establish a strong CTR.

In [14]:

df.fillna(0, inplace=True)

Random Forest Classifier

In [15]:

# Load scikit's random forest classifier library
from sklearn.ensemble import RandomForestClassifier

# Load pandas
import pandas as pd

# Load numpy
import numpy as np

# Set random seed
np.random.seed(0)

Let’s calculate the average click across our entire “clicked” column

In [17]:

ctr_mean = df['clicked'].mean()

In [18]:

ctr_mean

Out[18]:

0.16305434941875965

Let’s create a new column that labels each CTR value as above average or below average with a True or False

In [19]:

# Create a list to store the data
above_avg = []

# For each row in the column,
for row in df['clicked']:
    # if more than a value,
    if row > ctr_mean:
        # Append a letter grade
        above_avg.append('True')
    else:
        # Append a failing grade
        above_avg.append('False')
        
# Create a column from the list
df['above_avg'] = above_avg

In [20]:

df_class = pd.get_dummies(df['above_avg'], drop_first=True)

In [21]:

df_class.head()

Out[21]:

	True
0	0
1	0
2	0
3	1
4	1

In [22]:

#add it back into our dataframe
df = pd.concat([df, df_class], axis=1)

In [24]:

df.drop('above_avg', axis=1, inplace=True)

Drop Above Average, clicked, True, ad Id and Unnamed: 0 since they is too closely representatitve of our Y variable

In [25]:

# Create a list of the feature column's names
features = df.columns.drop(['clicked', 'True', 'ad_id', 'Unnamed: 0'])

# View features
features

Out[25]:

Index(['document_id', 'timestamp', 'campaign_id', 'advertiser_id', 'topic_id',
       'source_id', 'publisher_id', 'desktop', 'mobile', 'tablet', 'internal',
       'search', 'social'],
      dtype='object')

Establish our X’s and Y’s as well as our train and test X’s and Y’s. I’m doing a 2/3s split of training to testing data.

In [26]:

y = df['True']
X= df.drop(['clicked', 'True', 'ad_id', 'Unnamed: 0'], axis=1)

In [27]:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

In [28]:

X.shape

Out[28]:

(500000, 13)

In [29]:

from sklearn.model_selection import train_test_split

Let’s load our random forest classifier and run a cross val score with it to measure the predictive nature of the different combinations of features towards my testing data from my training data.

In [30]:

rfc = RandomForestClassifier()
scores = cross_val_score(rfc, X, y, cv=10)
print(np.average(scores))

0.9312459989544914

With a cross validated score, we are seeing some strong results with the predictive nature of our dataset. Maybe it’s overfit, but we will see.

In [31]:

from sklearn.metrics import classification_report
import pprint as pp
rfc.fit(X_train,y_train)
pp.pprint(classification_report(rfc.predict(X_test), y_test))

('             precision    recall  f1-score   support\n'
 '\n'
 '          0       0.96      0.93      0.94    105275\n'
 '          1       0.88      0.93      0.91     59725\n'
 '\n'
 'avg / total       0.93      0.93      0.93    165000\n')

In [32]:

from sklearn.cross_validation import cross_val_score

/Users/danielclark/anaconda3/lib/python3.6/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)

In [98]:

from sklearn.cross_validation import cross_val_score


for n_trees in range(1, 100, 10):
    model = RandomForestClassifier(n_estimators = n_trees)
    scores = cross_val_score(model, X, y, scoring='roc_auc')
    print('n trees: {}, CV AUC {}, Average AUC {}'.format(n_trees, scores, scores.mean()))
    
    
scores = cross_val_score(model, X, y, scoring='roc_auc')
print('CV AUC {}, Average AUC {}'.format(scores, scores.mean()))

n trees: 1, CV AUC [0.9281814  0.9270731  0.92784869], Average AUC 0.9277010630333707
n trees: 11, CV AUC [0.95831493 0.95826191 0.95918317], Average AUC 0.9585866713023615
n trees: 21, CV AUC [0.961231   0.96112515 0.96172492], Average AUC 0.9613603555263853
n trees: 31, CV AUC [0.96227717 0.96263208 0.96303503], Average AUC 0.9626480958589451
n trees: 41, CV AUC [0.96305231 0.96342662 0.96376722], Average AUC 0.9634153840905256
n trees: 51, CV AUC [0.96346607 0.96402944 0.96454555], Average AUC 0.9640136859462324
n trees: 61, CV AUC [0.96396412 0.96441186 0.96458517], Average AUC 0.9643203826630992
n trees: 71, CV AUC [0.96425782 0.96431858 0.96494901], Average AUC 0.9645084712910128
n trees: 81, CV AUC [0.96432358 0.96476473 0.96526965], Average AUC 0.9647859844159467
n trees: 91, CV AUC [0.96432661 0.96484705 0.96511333], Average AUC 0.964762330647793
CV AUC [0.96456384 0.96516789 0.96555938], Average AUC 0.9650970370546021

If you increase the number of estimators, you can help improve the predictive performance of your random forest up to a 0.965 level

In [99]:

from sklearn.cross_validation import cross_val_score

# ... #

scores = cross_val_score(rfc, X, y, scoring='roc_auc', cv=10)
print('CV AUC {}, Average AUC {}'.format(scores, scores.mean()))

CV AUC [0.96249425 0.96355798 0.96250469 0.96187606 0.96212373 0.96066386
 0.96279448 0.96153078 0.96459095 0.96281962], Average AUC 0.9624956409295521

Looking at our AUC score it appears that we can expect a strong relationship between true positives and false negatives with minimal false positives and true negatives.

In [35]:

# Show the number of observations for the test and training dataframes
print('Number of observations in the training data:', len(X_train))
print('Number of observations in the test data:',len(X_test))

Number of observations in the training data: 335000
Number of observations in the test data: 165000

In [36]:

# Create a random forest Classifier. By convention, clf means 'Classifier'
rfc = RandomForestClassifier(n_jobs=2, random_state=0, max_depth=50)

# Train the Classifier to take the training features and learn how they relate
# to the training y (the species)
rfc.fit(X_train, y_train)

Out[36]:

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=50, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=2,
            oob_score=False, random_state=0, verbose=0, warm_start=False)

In [37]:

# Apply the Classifier we trained to the test data 
rfc.predict(X_test[features])[0:100]

Out[37]:

array([0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0,
       0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1,
       1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0], dtype=uint8)

In [38]:

np.mean(rfc.predict(X_train[features]))

Out[38]:

0.36252835820895524

In [39]:

np.mean(rfc.predict(X_test[features]))

# This calculates the mean average of our test result.

Out[39]:

0.3625515151515151

This here compares the prediction rate of the training set and the test set. As you can see here, they are very heavily matched.

In [40]:

from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, rfc.predict(X_test))

Out[40]:

array([[97647,  3961],
       [ 7532, 55860]])

Looking at our confusion Matrix, we have a very small proportion of false positives against true positives.

In [41]:

rfc.score(X_test, y_test)

Out[41]:

0.9303454545454546

In [42]:

# View the predicted probabilities of the first 10 observations
rfc.predict_proba(X_test[features])[0:10]

Out[42]:

array([[0.81055556, 0.18944444],
       [0.        , 1.        ],
       [0.        , 1.        ],
       [0.        , 1.        ],
       [1.        , 0.        ],
       [0.99463087, 0.00536913],
       [1.        , 0.        ],
       [0.22690476, 0.77309524],
       [1.        , 0.        ],
       [0.81462264, 0.18537736]])

In [44]:

preds = df['True'][rfc.predict(X_test[features])]
#run a prediction of our test dataset using the models above

Run a prediction of the CTR above average classification on the first five values

In [45]:

preds[0:5]

Out[45]:

0    0
1    0
1    0
1    0
0    0
Name: True, dtype: uint8

In [46]:

# View a list of the features and their importance scores
list(zip(X_train[features], rfc.feature_importances_))

Out[46]:

[('document_id', 0.24476066591999746),
 ('timestamp', 0.12598455356540972),
 ('campaign_id', 0.2104563796014575),
 ('advertiser_id', 0.07389685173299361),
 ('topic_id', 0.05833893203881163),
 ('source_id', 0.15149686395139042),
 ('publisher_id', 0.11010019094447798),
 ('desktop', 0.0015468276543025557),
 ('mobile', 0.002456213883768503),
 ('tablet', 0.0006630240272853334),
 ('internal', 0.0052573939226172785),
 ('search', 0.0027001658654666827),
 ('social', 0.01234193689202124)]

In [47]:

feature_imp = pd.Series(rfc.feature_importances_,index=features).sort_values(ascending=False)

In [48]:

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
# Creating a bar plot
sns.barplot(x=feature_imp, y=feature_imp.index)
# Add labels to your graph
plt.xlabel('Feature Importance Score')
plt.ylabel('Features')
plt.title("Visualizing Important Features")
plt.legend()
plt.show()

No handles with labels found to put in legend.

Looking across our features, it appears that Document Id (web page), Campaign ID, Source ID, and Time Stamp are the most predictive of a more efficient click through rate. I.e. the web page, site category, time and ad campaign are the best indicator of strong ad performance.

Now let’s do a grid search to tune our hyperparamters.

Now Let’s look at the most valuable parameters to our Random Forest

In [49]:

import numpy as np
from sklearn.grid_search import GridSearchCV
from sklearn import datasets, svm
import matplotlib.pyplot as plt

/Users/danielclark/anaconda3/lib/python3.6/site-packages/sklearn/grid_search.py:42: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20.
  DeprecationWarning)

In [50]:

model =svm.SVC()
params = {"C":[1, 10, 100], "kernel":["rbf","linear"],"gamma":[0.001, 0.0001]}
gsearch=GridSearchCV(model,params)

In [51]:

# Create a classifier object with the classifier and parameter candidates
rfcf = gsearch

In [53]:

rfcf

Out[53]:

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=50, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=2,
            oob_score=False, random_state=0, verbose=0, warm_start=False)

In [54]:

# Train a new classifier using the best parameters found by the grid search
rfc2 = RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=50, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=2,
            oob_score=False, random_state=0, verbose=0, warm_start=False)

In [55]:

# Train the Classifier to take the training features and learn how they relate
# to the training y (the species)
rfc2.fit(X_train, y_train)

Out[55]:

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=50, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=2,
            oob_score=False, random_state=0, verbose=0, warm_start=False)

In [56]:

# Train a new classifier using the best parameters found by the grid search
svm.SVC(C=1.0, class_weight=None, max_iter=100, random_state=0, tol=0.0001,
          verbose=0).fit(X_train, y_train).score(X_test, y_test)

/Users/danielclark/anaconda3/lib/python3.6/site-packages/sklearn/svm/base.py:218: ConvergenceWarning: Solver terminated early (max_iter=100).  Consider pre-processing your data with StandardScaler or MinMaxScaler.
  % self.max_iter, ConvergenceWarning)

Out[56]:

0.3855212121212121

In [57]:

confusion_matrix(y_test, rfc2.predict(X_test))

Out[57]:

array([[97647,  3961],
       [ 7532, 55860]])

Looking at the “ideal’ parameters applied to the confusion matrix, I am seeing that there is no difference between the confusion matrix of my initial random forest (RFC) as well as the (RFC2).

Logistic Regression Classifier

Let’s create something that we can compare our classifier to. Since we have a binary Y value we are testing for, let’s see if we can create a logistic regression model to predict which ads will be above average or below average.

In [58]:

# Fit a logistic regression model and store the class predictions.
from sklearn.linear_model import LogisticRegression

from sklearn import datasets
from sklearn.preprocessing import StandardScaler

In [59]:

logreg = LogisticRegression()

Here we are going to run our Logistic Regession Model the exact same way as our Random Forest Model

In [60]:

logreg.fit(X,y)
pred = logreg.predict(X)

In [61]:

log = LogisticRegression(n_jobs=2, random_state=0)

# Train the Classifier to take the training features and learn how they relate
# to the training y (the species)
log.fit(X_train, y_train)

/Users/danielclark/anaconda3/lib/python3.6/site-packages/sklearn/linear_model/logistic.py:1228: UserWarning: 'n_jobs' > 1 does not have any effect when 'solver' is set to 'liblinear'. Got 'n_jobs' = 2.
  " = {}.".format(self.n_jobs))

Out[61]:

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=2,
          penalty='l2', random_state=0, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [62]:

log.predict(X_test[features])[0:100]

Out[62]:

array([0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0,
       0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1,
       0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1,
       1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0], dtype=uint8)

In [63]:

df['True'].mean()

Out[63]:

0.384438

In [64]:

np.mean(log.predict(X_train[features]))

Out[64]:

0.38302985074626866

In [65]:

np.mean(log.predict(X_test[features]))

Out[65]:

0.38378181818181817

Let’s compare this to our predictiveness of our trains and test against the random forest.

Train – 0.36252835820895524 Test – 0.3625515151515151

Both of our testing and training data seemed to put out consistent predictions between training and testing data, which is expected. owever it’s notable that the distribution of our logistic regression training data seemed to slightly skew towards being more above average than below average. The actual dataset has a 0.384 ratio between True and not True.

In [66]:

from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, log.predict(X_test))

Out[66]:

array([[82703, 18905],
       [18973, 44419]])

In [67]:

log.score(X_test, y_test)

Out[67]:

0.7704363636363636

In [68]:

# View the predicted probabilities of the first 10 observations
log.predict_proba(X_test[features])[0:10]

Out[68]:

array([[0.86578321, 0.13421679],
       [0.43339606, 0.56660394],
       [0.44309772, 0.55690228],
       [0.29822099, 0.70177901],
       [0.90259753, 0.09740247],
       [0.84710841, 0.15289159],
       [0.83407236, 0.16592764],
       [0.49403533, 0.50596467],
       [0.15522634, 0.84477366],
       [0.91226979, 0.08773021]])

In [69]:

logreg.predict_proba(X)[0:10]

Out[69]:

array([[0.84850591, 0.15149409],
       [0.89217383, 0.10782617],
       [0.90594105, 0.09405895],
       [0.4318877 , 0.5681123 ],
       [0.32673021, 0.67326979],
       [0.87844156, 0.12155844],
       [0.80200762, 0.19799238],
       [0.40426772, 0.59573228],
       [0.60296166, 0.39703834],
       [0.32621639, 0.67378361]])

Comparing our actual and testing probabilities, there seems to be some difference, but it’s hard to recognize in the dataframe.

In [70]:

preds = df['True'][log.predict(X_test[features])]

In [71]:

preds[0:5]

Out[71]:

0    0
1    0
1    0
1    0
0    0
Name: True, dtype: uint8

In [72]:

df['True_Prob'] = logreg.predict_proba(X)[:, 1]

In [73]:

df.head()

Out[73]:

	Unnamed: 0	document_id	timestamp	ad_id	campaign_id	advertiser_id	topic_id	source_id	publisher_id	clicked	desktop	mobile	tablet	internal	social	True	True_Prob
0	0	1649400	40696880	270167	25886	112	265	5315.0	1046.0	0.000000	0	0	1	1	0	0	0.151494
1	1	1629915	16740705	424102	28821	1635	183	6939.0	912.0	0.000000	1	0	0	1	0	0	0.107826
2	2	1750649	7493196	326034	25885	112	281	5315.0	1046.0	0.000000	1	0	0	1	0	0	0.094059
3	3	1767529	21211103	331303	16622	201	241	10198.0	9.0	0.367188	1	0	0	1	0	1	0.568112
4	4	1867736	65537339	369453	6395	177	36	3568.0	1035.0	0.270833	0	1	0	0	1	1	0.673270

In [74]:

plt.rcParams['agg.path.chunksize'] = 1000

In [75]:

# Plot the predicted probabilities.
plt.scatter(df['clicked'], df['True'])
plt.plot(df.clicked, df.True_Prob, color='red')
plt.xlabel('clicked')
plt.ylabel('Above Average')

Out[75]:

Text(0,0.5,'Above Average')

In [ ]:

import pandas as pd
import numpy as np
from sklearn import preprocessing
import matplotlib.pyplot as plt 
plt.rc("font", size=14)
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import train_test_split
import seaborn as sns
sns.set(style="white")
sns.set(style="whitegrid", color_codes=True)

In [ ]:

Compute the log odds of clciked when when above average=1.

In [76]:

logodds = logreg.intercept_ + logreg.coef_[0] * 2
logodds

Out[76]:

array([-3.55342578e-07,  3.08890954e-08, -2.56782921e-04, -2.39228966e-05,
       -8.21974552e-07,  4.68565452e-04,  1.28267309e-05, -1.64137959e-08,
        8.48999603e-09,  4.46816919e-09, -5.34960562e-08,  3.58955757e-08,
        1.41448498e-08])

Now that we have the log odds, we will need to go through the process of converting these log odds to probability.

Convert the log odds to odds, then the odds to probability.

In [77]:

# Convert log odds to odds.
odds = np.exp(logodds)
odds

Out[77]:

array([0.99999964, 1.00000003, 0.99974325, 0.99997608, 0.99999918,
       1.00046868, 1.00001283, 0.99999998, 1.00000001, 1.        ,
       0.99999995, 1.00000004, 1.00000001])

In [78]:

# Convert odds to probability.
prob = odds/(1 + odds)
prob

Out[78]:

array([0.49999991, 0.50000001, 0.4999358 , 0.49999402, 0.49999979,
       0.50011714, 0.50000321, 0.5       , 0.5       , 0.5       ,
       0.49999999, 0.50000001, 0.5       ])

This will give us the probability of our above average value our model being true at around 50%, which is not necessarily predictive for our purposes.

Now Let’s look at the most valuable parameters to our log regression

In [79]:

import numpy as np
from sklearn.grid_search import GridSearchCV
from sklearn import datasets, svm
import matplotlib.pyplot as plt

In [80]:

model =svm.SVC()
params = {"C":[1, 10, 100], "kernel":["rbf","linear"],"gamma":[0.001, 0.0001]}
gsearch=GridSearchCV(model,params)

In [81]:

# Create a classifier object with the classifier and parameter candidates
clf = gsearch

In [83]:

clf

Out[83]:

GridSearchCV(cv=None, error_score='raise',
       estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'C': [1, 10, 100], 'kernel': ['rbf', 'linear'], 'gamma': [0.001, 0.0001]},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)

In [94]:

log2 = LogisticRegression(n_jobs=1,
       
       verbose=0)

# Train the Classifier to take the training features and learn how they relate
# to the training y (the species)
log2.fit(X_train, y_train)

Out[94]:

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [96]:

# Train a new classifier using the best parameters found by the grid search
svm.SVC(C=1.0, class_weight=None, max_iter=100, random_state=0, tol=0.0001,
          verbose=0).fit(X_train, y_train).score(X_test, y_test)

/Users/danielclark/anaconda3/lib/python3.6/site-packages/sklearn/svm/base.py:218: ConvergenceWarning: Solver terminated early (max_iter=100).  Consider pre-processing your data with StandardScaler or MinMaxScaler.
  % self.max_iter, ConvergenceWarning)

Out[96]:

0.3855212121212121

In [97]:

confusion_matrix(y_test, log2.predict(X_test))

Out[97]:

array([[82696, 18912],
       [18962, 44430]])

after checking the ideal params for the logistic regression, I wasn’t able to improve my confusion matrix.

Conclusion

After running a random forest as well as logistic regression classifier model to see which is more predictive of our training and testing data. After running a training and testing ratio of 2/3 training to 1/3 testing with a dataset that was reduced from 10 million to 500,000 for easy computation purposes, I found that while the logistic regression analysis provided data that closer matched the output of the dataframe at a 0.38 probability of an observation having an above average CTR, I felt that our Random Forest model was stronger. Based on the confusion matrix we outputted, our random forest returned a really strong ratio of correct predictions compared to incorrect.

Random Forest array([[97535, 4105], [ 7566, 55794]])

Logistic regression array([[82969, 18671], [19147, 44213]])

As far as next steps for future research, I would like to expand this model to my bigger dataset of 10MM observations to confirm that my predictive nature can be applied to a larger scale. I even have a series of events data of every instance someone went online, but I wasn’t able to join my core dataframe with the bigger events dataset for me to make a solid observation.

Mr. Daniel Clark

and his ongoing concern.

Outbrain Click Prediction – Random Forest

Random Forest Classifier

Now Let’s look at the most valuable parameters to our Random Forest

Logistic Regression Classifier

Now Let’s look at the most valuable parameters to our log regression

Conclusion

Related