python certification course
An important issue confronting retailers and other
businesses today is the preponderance of credit card fraud. This issue recently
hit home, as my son was a victim a week prior to me writing this.
We can apply machine learning to help detect credit card
fraud, but there is a bit of a problem in that the vast majority of
transactions are perfectly legitimate, which reduces a typical model’s
sensitivity to fraud.
As an example, consider a logistic algorithm running against
the Credit Card Fraud dataset posted on Kaggle.
You can download it here:
To follow along you will need an installation of Python with
the following packages:
NumPy
Pandas
SciKit-Learn
You can get all those packages, and many more, with the
Anaconda installation which you can find at:
https://www.anaconda.com/download/
To begin with, start off with the necessary imports.
importnumpyasnp
importpandasaspd
fromsklearn.metricsimportconfusion_matrix, cohen_kappa_score
fromsklearn.metricsimportf1_score, recall_score
We need NumPy for some basic mathematical functions and
Pandas to read in the CSV file and create the data frame. We will use a number
of sklearn.metrics to evaluate the results from our models.
Next, we need to create a couple of helper functions.
PrintStats will compile and display the results from a model. Here is the code:
defPrintStats(cmat, y_test, pred):
# separate out the
confusion matrix components
tpos = cmat[0][0]
fneg = cmat[1][1]
fpos = cmat[0][1]
tneg = cmat[1][0]
# calculate F!,
Recall scores
f1Score =
round(f1_score(y_test, pred), 2)
recallScore =
round(recall_score(y_test, pred), 2)
# calculate and
display metrics
print(cmat)
print( 'Accuracy:
'+ str(np.round(100*float(tpos+fneg)/float(tpos+fneg + fpos + tneg),2))+'%')
print( 'Cohen
Kappa: '+ str(np.round(cohen_kappa_score(y_test, pred),3)))
print("Sensitivity/Recall for Model :
{recall_score}".format(recall_score = recallScore))
print("F1
Score for Model : {f1_score}".format(f1_score = f1Score))
PrintStats takes as parameters a confusion matrix, test
labels and prediction labels and does the following:
Separates the confusion matrix into its constituent parts.
Calculates the F1, Recall, Accuracy and Cohen Kappa scores.
Prints the confusion matrix and all the calculated scores.
We also need a function, called RunModel, to actually train
the model and generate predictions against the test data. Here is the code:
defRunModel(model, X_train, y_train, X_test, y_test):
model.fit(X_train,
y_train.values.ravel())
pred =
model.predict(X_test)
matrix =
confusion_matrix(y_test, pred)
returnmatrix, pred
The RunModel function takes as input the untrained model
along with all the test and training data, including labels. It trains the
model, runs the prediction using the test data, and returns the confusion
matrix along with the predicted labels.
With these two functions created, it’s time to see if we can
create a model to do fraud detection. Fraud detection is generally considered a
two-class problem. In other words, a transaction is either:
Class #1: Not fraud
Or
Class #2: Fraud
Our goal is to try to determine to which class a particular
transaction belongs. Step #1 is to load the CSV data and create the classes.
This code will do the trick:
df = pd.read_csv('../Datasets/creditcard.csv')
class_names =
{0:'Not Fraud', 1:'Fraud'}
print(df.Class.value_counts().rename(index = class_names))
It generates the following result:
Not Fraud 284315
Fraud 492
Name: Class, dtype: int64
This is a fairly typical dataset. Out of nearly 300,000
transactions, 492 were labelled as fraudulent. It may not seem like much, but
each transaction represents a significant expense. Together, all such
fraudulent transactions may represent billions of dollars of lost revenue each
year. It also poses a problem with detection. Such a small percentage of fraud
transactions makes it more difficult to weed out the offenders from the
overwhelming number of good transactions.
Step #2 is to define the features we want to use. Normally,
we want to apply some dimension reduction and feature engineering to our data,
but that is another article (or two). Instead we’ll just use the whole dataset
here with the following code:
feature_names = df.iloc[:, 1:30].columns
target = df.iloc[:1, 30: ].columns
data_features = df[feature_names]
data_target = df[target]
With the dataset defined, step #3 is to split the data into
training and test sets. To do this, we need to import another function and run
the following code:
fromsklearn.cross_validationimporttrain_test_split
np.random.seed(123)
X_train, X_test, y_train, y_test =
train_test_split(data_features, \
data_target, train_size=0.70, test_size=0.30,
random_state=1)
The train_test_split function uses a randomizer to separate
the data into training and test sets. 70% of the data is for training and 30%
is for testing. The random seed is initially set to ensure the same data is
used for every run.
For step #4 , we pick a machine learning technique, or
model. Perhaps the most common two-class machine learning technique is logistic
regression. We will use that for this first test:
fromsklearn.linear_modelimportLogisticRegression
lr = LogisticRegression()
cmat, pred = RunModel(lr, X_train, y_train, X_test, y_test)
PrintStats(cmat, y_test, pred)
The output from this run should look like this:
[[85293 15]
[ 57
78]]
Accuracy: 99.92%
Cohen Kappa: 0.684
Sensitivity/Recall for Model : 0.58
F1 Scorre for Model : 0.68
You might initially think the model did a good job. After
all, it got 99.92% of its predictions correct. That is true, except if you look
closely at the confusion matrix you will see the following results:
85293 transactions were classified as valid that were
actually valid
15 transactions were classified as fraud that were actually
valid (type 1 error)
57 transactions were classified as valid that were fraud
(type 2 error)
78 transactions were classified as fraud that were fraud
So, while the accuracy was great, we find that the algorithm
misclassified more than 4 in 10 fraudulent transactions. In fact, if our
algorithm simply classified everything as valid, it would have an accuracy of
better than 99.9% but be entirely useless! So, accuracy is not the reliable
measure of a model’s effectiveness. Instead, we look at other measures like the
Cohen Kappa, Recall, and F1 score. In each case, we want to achieve a score as
close to 1 as we can.
Maybe another model will work. How about a RandomForest
classifier? The code is similar to logistic regression:
fromsklearn.ensembleimportRandomForestClassifier
rf = RandomForestClassifier(n_estimators = 100, n_jobs =4)
cmat, pred = RunModel(rf, X_train, y_train, X_test, y_test)
PrintStats(cmat, y_test, pred)
Trying this classifier will get you results similar to the
following:
[[85297 11]
[ 31
104]]
Accuracy: 99.95%
Cohen Kappa: 0.832
Sensitivity/Recall for Model : 0.77
F1 Scorre for Model : 0.83
That’s quite a bit better. Note the accuracy went up slightly,
but the other scores showed significant improvements as well. So, one way to
improve our detection is to try different models and see how they perform.
Clearly changing models helped. But there are other options too. One is
over-sampling the sample of fraud records or, conversely, under-sampling the
sample of good records. Over-sampling means adding fraud records to our
training sample, thereby increasing the overall proportion of fraud records.
Conversely, under-sampling is removing valid records from the sample, which has
the same effect. Changing the sampling makes the algorithm more “sensitive” to
fraud transactions.
Going back to the logistical regression classifier, let’s
see how some under-sampling might improve the overall performance of the model.
There are specific techniques, such as SMOTE and ADASYN, designed to
strategically sample unbalanced datasets. In our case, let’s under-sample in
order to achieve an even split between fraud and valid transactions. It will
make the training set pretty small, but the algorithm doesn’t need a lot of
data to come up with a good classifier:
fraud_records = len(df[df.Class == 1])
# pull the indicies for fraud and valid rows
fraud_indices = df[df.Class == 1].index
normal_indices = df[df.Class == 0].index
# randomly collect equal samples of each type
under_sample_indices = np.random.choice(normal_indices,
fraud_records, False)
df_undersampled =
df.iloc[np.concatenate([fraud_indices,under_sample_indices]),:]
X_undersampled = df_undersampled.iloc[:,1:30]
Y_undersampled = df_undersampled.Class
X_undersampled_train, X_undersampled_test,
Y_undersampled_train, \
Y_undersampled_test
= train_test_split(X_undersampled,Y_undersampled,test_size = 0.3)
lr_undersampled = LogisticRegression(C=1)
# run the new model
cmat, pred = RunModel(lr_undersampled, X_undersampled_train,
Y_undersampled_train, \
X_undersampled_test, Y_undersampled_test)
PrintStats(cmat, Y_undersampled_test, pred)
Now look at the new results:
[[138 1]
[ 22 135]]
Accuracy: 92.23%
Cohen Kappa: 0.845
Sensitivity/Recall for Model : 0.86
F1 Scorre for Model : 0.92
The accuracy went down, but all of the other scores went up.
Looking at the confusion matrix, you can see a much higher percentage of
correct classifications of fraudulent data.
Unfortunately, there is no free lunch. A higher number of
fraud classifications almost always means a correspondingly higher number of
valid transactions also classified as fraudulent too. Now try the “new”
logistic regression classifier against the original test data:
cmat, pred = RunModel(lr_undersampled, X_undersampled_train,
Y_undersampled_train, \
X_test, y_test)
PrintStats(cmat, y_test, pred)
This time, the results are:
[[83757 1551]
[ 16
119]]
Accuracy: 98.17%
Cohen Kappa: 0.129
Sensitivity/Recall for Model : 0.88
F1 Scorre for Model : 0.13
The algorithm was far better at catching fraudulent
transactions (16 misclassification to 57) but far worse at mislabeling valid
transactions (1551 to 15).
As a data scientist, you have to determine at what point the
tradeoff is worth it. Generally, the costs of missing a fraudulent transaction
is many times greater than misclassifying a good transaction as fraud. Your job
is to find that balance point in your model training and proceed accordingly.[Source]-https://www.accelebrate.com/blog/fraud-detection-using-python/
Advanced level python
certification course with 100% Job Assistance Guarantee Provided. We Have 3
Sessions Per Week And 90 Hours Certified Basic Python Training Offered By
Asterix Solution.
Comments
Post a Comment