python certification course

An important issue confronting retailers and other businesses today is the preponderance of credit card fraud. This issue recently hit home, as my son was a victim a week prior to me writing this.

We can apply machine learning to help detect credit card fraud, but there is a bit of a problem in that the vast majority of transactions are perfectly legitimate, which reduces a typical model’s sensitivity to fraud.

As an example, consider a logistic algorithm running against the Credit Card Fraud dataset posted on Kaggle. You can download it here:

To follow along you will need an installation of Python with the following packages:

NumPy

Pandas

SciKit-Learn

You can get all those packages, and many more, with the Anaconda installation which you can find at:

https://www.anaconda.com/download/

To begin with, start off with the necessary imports.

importnumpyasnp

importpandasaspd

fromsklearn.metricsimportconfusion_matrix, cohen_kappa_score

fromsklearn.metricsimportf1_score, recall_score

We need NumPy for some basic mathematical functions and Pandas to read in the CSV file and create the data frame. We will use a number of sklearn.metrics to evaluate the results from our models.

Next, we need to create a couple of helper functions. PrintStats will compile and display the results from a model. Here is the code:

defPrintStats(cmat, y_test, pred):

# separate out the confusion matrix components

tpos = cmat[0][0]

fneg = cmat[1][1]

fpos = cmat[0][1]

tneg = cmat[1][0]

# calculate F!, Recall scores

f1Score = round(f1_score(y_test, pred), 2)

recallScore = round(recall_score(y_test, pred), 2)

# calculate and display metrics

print(cmat)

print( 'Accuracy: '+ str(np.round(100*float(tpos+fneg)/float(tpos+fneg + fpos + tneg),2))+'%')

print( 'Cohen Kappa: '+ str(np.round(cohen_kappa_score(y_test, pred),3)))

print("Sensitivity/Recall for Model : {recall_score}".format(recall_score = recallScore))

print("F1 Score for Model : {f1_score}".format(f1_score = f1Score))

PrintStats takes as parameters a confusion matrix, test labels and prediction labels and does the following:

Separates the confusion matrix into its constituent parts.

Calculates the F1, Recall, Accuracy and Cohen Kappa scores.

Prints the confusion matrix and all the calculated scores.

We also need a function, called RunModel, to actually train the model and generate predictions against the test data. Here is the code:

defRunModel(model, X_train, y_train, X_test, y_test):

model.fit(X_train, y_train.values.ravel())

pred = model.predict(X_test)

matrix = confusion_matrix(y_test, pred)

returnmatrix, pred

The RunModel function takes as input the untrained model along with all the test and training data, including labels. It trains the model, runs the prediction using the test data, and returns the confusion matrix along with the predicted labels.

With these two functions created, it’s time to see if we can create a model to do fraud detection. Fraud detection is generally considered a two-class problem. In other words, a transaction is either:

Class #1: Not fraud

Class #2: Fraud

Our goal is to try to determine to which class a particular transaction belongs. Step #1 is to load the CSV data and create the classes. This code will do the trick:

df = pd.read_csv('../Datasets/creditcard.csv')

class_names = {0:'Not Fraud', 1:'Fraud'}

print(df.Class.value_counts().rename(index = class_names))

It generates the following result:

Not Fraud 284315

Fraud 492

Name: Class, dtype: int64

This is a fairly typical dataset. Out of nearly 300,000 transactions, 492 were labelled as fraudulent. It may not seem like much, but each transaction represents a significant expense. Together, all such fraudulent transactions may represent billions of dollars of lost revenue each year. It also poses a problem with detection. Such a small percentage of fraud transactions makes it more difficult to weed out the offenders from the overwhelming number of good transactions.

Step #2 is to define the features we want to use. Normally, we want to apply some dimension reduction and feature engineering to our data, but that is another article (or two). Instead we’ll just use the whole dataset here with the following code:

feature_names = df.iloc[:, 1:30].columns

target = df.iloc[:1, 30: ].columns

data_features = df[feature_names]

data_target = df[target]

With the dataset defined, step #3 is to split the data into training and test sets. To do this, we need to import another function and run the following code:

fromsklearn.cross_validationimporttrain_test_split

np.random.seed(123)

X_train, X_test, y_train, y_test = train_test_split(data_features, \

data_target, train_size=0.70, test_size=0.30, random_state=1)

The train_test_split function uses a randomizer to separate the data into training and test sets. 70% of the data is for training and 30% is for testing. The random seed is initially set to ensure the same data is used for every run.

For step #4 , we pick a machine learning technique, or model. Perhaps the most common two-class machine learning technique is logistic regression. We will use that for this first test:

fromsklearn.linear_modelimportLogisticRegression

lr = LogisticRegression()

cmat, pred = RunModel(lr, X_train, y_train, X_test, y_test)

PrintStats(cmat, y_test, pred)

The output from this run should look like this:

[[85293 15]

[ 57 78]]

Accuracy: 99.92%

Cohen Kappa: 0.684

Sensitivity/Recall for Model : 0.58

F1 Scorre for Model : 0.68

You might initially think the model did a good job. After all, it got 99.92% of its predictions correct. That is true, except if you look closely at the confusion matrix you will see the following results:

85293 transactions were classified as valid that were actually valid

15 transactions were classified as fraud that were actually valid (type 1 error)

57 transactions were classified as valid that were fraud (type 2 error)

78 transactions were classified as fraud that were fraud

So, while the accuracy was great, we find that the algorithm misclassified more than 4 in 10 fraudulent transactions. In fact, if our algorithm simply classified everything as valid, it would have an accuracy of better than 99.9% but be entirely useless! So, accuracy is not the reliable measure of a model’s effectiveness. Instead, we look at other measures like the Cohen Kappa, Recall, and F1 score. In each case, we want to achieve a score as close to 1 as we can.

Maybe another model will work. How about a RandomForest classifier? The code is similar to logistic regression:

fromsklearn.ensembleimportRandomForestClassifier

rf = RandomForestClassifier(n_estimators = 100, n_jobs =4)

cmat, pred = RunModel(rf, X_train, y_train, X_test, y_test)

PrintStats(cmat, y_test, pred)

Trying this classifier will get you results similar to the following:

[[85297 11]

[ 31 104]]

Accuracy: 99.95%

Cohen Kappa: 0.832

Sensitivity/Recall for Model : 0.77

F1 Scorre for Model : 0.83

That’s quite a bit better. Note the accuracy went up slightly, but the other scores showed significant improvements as well. So, one way to improve our detection is to try different models and see how they perform. Clearly changing models helped. But there are other options too. One is over-sampling the sample of fraud records or, conversely, under-sampling the sample of good records. Over-sampling means adding fraud records to our training sample, thereby increasing the overall proportion of fraud records. Conversely, under-sampling is removing valid records from the sample, which has the same effect. Changing the sampling makes the algorithm more “sensitive” to fraud transactions.

Going back to the logistical regression classifier, let’s see how some under-sampling might improve the overall performance of the model. There are specific techniques, such as SMOTE and ADASYN, designed to strategically sample unbalanced datasets. In our case, let’s under-sample in order to achieve an even split between fraud and valid transactions. It will make the training set pretty small, but the algorithm doesn’t need a lot of data to come up with a good classifier:

fraud_records = len(df[df.Class == 1])

# pull the indicies for fraud and valid rows

fraud_indices = df[df.Class == 1].index

normal_indices = df[df.Class == 0].index

# randomly collect equal samples of each type

under_sample_indices = np.random.choice(normal_indices, fraud_records, False)

df_undersampled = df.iloc[np.concatenate([fraud_indices,under_sample_indices]),:]

X_undersampled = df_undersampled.iloc[:,1:30]

Y_undersampled = df_undersampled.Class

X_undersampled_train, X_undersampled_test, Y_undersampled_train, \

Y_undersampled_test = train_test_split(X_undersampled,Y_undersampled,test_size = 0.3)

lr_undersampled = LogisticRegression(C=1)

# run the new model

cmat, pred = RunModel(lr_undersampled, X_undersampled_train, Y_undersampled_train, \

X_undersampled_test, Y_undersampled_test)

PrintStats(cmat, Y_undersampled_test, pred)

Now look at the new results:

[[138 1]

[ 22 135]]

Accuracy: 92.23%

Cohen Kappa: 0.845

Sensitivity/Recall for Model : 0.86

F1 Scorre for Model : 0.92

The accuracy went down, but all of the other scores went up. Looking at the confusion matrix, you can see a much higher percentage of correct classifications of fraudulent data.

Unfortunately, there is no free lunch. A higher number of fraud classifications almost always means a correspondingly higher number of valid transactions also classified as fraudulent too. Now try the “new” logistic regression classifier against the original test data:

cmat, pred = RunModel(lr_undersampled, X_undersampled_train, Y_undersampled_train, \

X_test, y_test)

PrintStats(cmat, y_test, pred)

This time, the results are:

[[83757 1551]

[ 16 119]]

Accuracy: 98.17%

Cohen Kappa: 0.129

Sensitivity/Recall for Model : 0.88

F1 Scorre for Model : 0.13

The algorithm was far better at catching fraudulent transactions (16 misclassification to 57) but far worse at mislabeling valid transactions (1551 to 15).

As a data scientist, you have to determine at what point the tradeoff is worth it. Generally, the costs of missing a fraudulent transaction is many times greater than misclassifying a good transaction as fraud. Your job is to find that balance point in your model training and proceed accordingly.[Source]-https://www.accelebrate.com/blog/fraud-detection-using-python/

Advanced level python certification course with 100% Job Assistance Guarantee Provided. We Have 3 Sessions Per Week And 90 Hours Certified Basic Python Training Offered By Asterix Solution.

Search This Blog

Digital Marketing Certfication Course

python certification course

Comments

Post a Comment

Popular posts from this blog

How To Earn a Top-Paying AWS Certification & Salary

Your guide to Kubernetes best practices

Hadoop