detecting fraud with decision tree and spark

Apache Spark™ is a unified analytics engine for large-scale data processing.

Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.

Security

Security in Spark is OFF by default. This could mean you are vulnerable to attack by default. Please see Spark Security before downloading and running Spark.

DATABRICKS

Databricks is a company founded by the original creators of Apache Spark.[1] Databricks grew out of the AMPLab project at University of California, Berkeley that was involved in making Apache Spark, an open-source distributed computing framework built atop Scala. Databricks develops a web-based platform for working with Spark, that provides automated cluster management and IPython-style notebooks. In addition to building the Databricks platform, the company is co-organizing massive open online courses about Spark[2] and runs the largest conference about Spark – Spark Summit.

we will use Databricks as our vehicle to do massive online compute and run a project to identify the fraud quickly.

Source Data

PaySim simulates mobile money transactions based on a sample of real transactions extracted from one month of financial logs from a mobile money service implemented in an African country. The original logs were provided by a multinational company, who is the provider of the mobile financial service which is currently running in more than 14 countries all around the world.

This synthetic dataset is scaled down 1/4 of the original dataset and it is created just for Kaggle. To load the dataset yourself, please download it to your local machine from Kaggle and then import the data via Import Data or if you are wealthy like hell and want to bring cloud like god ZEUS you can import it via Azure and Aws link in notebook.

About Data:

Dictionary

This is the column definition of the referenced sythentic dataset.

Column NameDescription
stepmaps a unit of time in the real world. In this case 1 step is 1 hour of time. Total steps 744 (30 days simulation).
typeCASH-IN, CASH-OUT, DEBIT, PAYMENT and TRANSFER.
amountamount of the transaction in local currency.
nameOrigcustomer who started the transaction
oldbalanceOrginitial balance before the transaction
newbalanceOrignew balance after the transaction
nameDestcustomer who is the recipient of the transaction
oldbalanceDestinitial balance recipient before the transaction. Note that there is not information for customers that start with M (Merchants).
newbalanceDestnew balance recipient after the transaction. Note that there is not information for customers that start with M (Merchants).

we will give initial intuition about mlflow here.

so when you are working with databricks you will have to create a cluster to work with the data like:

this will take care of your entry point in spark programming.

Image result for spark entry point

the underneath code in scala and what it is:

Create SparkSession

SparkSession can be created using a builder pattern. Internally, it requires a SparkContext that used to create RDD and manage cluster resources, but you don’t need to provide it explicitly anymore. Under the hood, the builder would reuse an existing SparkContext, or create a new one if it doesn’t exist.


import org.apache.spark.sql.SparkSession

val spark: SparkSession =
SparkSession
.builder()
.master(“local”)
.appName(“Spark Example”)
.getOrCreate()

in python API it is like: sqlContext = SparkSession.builder.enableHiveSupport().getOrCreate() in python

as you are in python + spark environment you are free to use any thing other than in python and spark if you want to execute other scripts(language ) you have to provide % to let engine know your intention.

the sql function in spark session let you run any sql query as a string so if you want to run select * from my_table just pass spark.sql(” select * from my_table”)

to see schema of the dataset type df.printSchema() in our case

root |– step: integer (nullable = true) |– type: string (nullable = true) |– amount: double (nullable = true) |– nameOrig: string (nullable = true) |– oldbalanceOrg: double (nullable = true) |– newbalanceOrig: double (nullable = true) |– nameDest: string (nullable = true) |– oldbalanceDest: double (nullable = true) |– newbalanceDest: double (nullable = true)

as expected from previous if the schema is not correct change it by this concept:

df =(df.withColumn('dropoff_datetime', df.dropoff_datetime.cast('timestamp')).withColumn('pickup_datetime', df.pickup_datetime.cast('timestamp')).withColumnRenamed('_id', 'id').withColumnRenamed('_rev', 'rev').withColumnRenamed('dropoff_latitude', 'dropoff_lat').withColumnRenamed('dropoff_longitude', 'dropoff_long').withColumnRenamed('pickup_latitude', 'pickup_lat').withColumnRenamed('pickup_longitude', 'pickup_long'))

please note diff example

now Calculate Differences between Originating and Destination Balanaces

df.withColumn("orgDiff", df.newbalanceOrig - df.oldbalanceOrg).withColumn("destDiff", df.newbalanceDest - df.oldbalanceDest)

note with pandas it is like :

df[“orgDiff”]=df.newbalanceOrig – df.oldbalanceOrg

create a temp view to explore in sql

df.createOrReplaceTempView("financials")

Let’s review the data and ask some simple questions

display(df)
1PAYMENT9839.64C1231006815170136160296.36M197978715500-9839.6400000000140
1PAYMENT1864.28C16665442952124919384.72M204428222500-1864.27999999999880
1TRANSFER181C13054861451810C55326406500-1810
1CASH_OUT181C8400836711810C38997010211820-181-21182
1PAYMENT11668.14C20485377204155429885.86M123070170300-11668.140
1PAYMENT7817.71C900456385386046042.29M57348727400-7817.7099999999990
1PAYMENT7107.77C154988899183195176087.23M40806911900-7107.76999999998950
1PAYMENT7861.64C1912850431176087.23168225.59M63332633300-7861.6400000000140
1PAYMENT4024.36C126501292826710M117693210400-26710
1DEBIT5337.77C7124101244172036382.23C1956008604189840348.79-5337.769999999997-1549.2099999999991
1DEBIT9644.94C190036674944650C99760839810845157982.12-4465147137.12
1PAYMENT3099.97C2491775732077117671.03M209653912900-3099.9700000000010
1PAYMENT2560.74C164823259150702509.26M97286527000-2560.740
1PAYMENT11633.76C1716932897101270M80156915100-101270
1PAYMENT4098.78C1026483832503264499165.22M163537821300-4098.7800000000280
1CASH_OUT229133.94C905080434153250C476402209508351513.44-1532546430.44
1PAYMENT1563.82C7617507064500M173121798400-4500
1PAYMENT1157.86C12377626392115619998.14M187706290700-1157.86000000000060
1PAYMENT671.64C20335245451512314451.36M47305329300-671.63999999999940
1TRANSFER215310.3C16709931827050C1100439041224250-705-22425
1PAYMENT1373.43C208046021385412480.57M134451905100-1373.43000000000030
1DEBIT9302.79C1566511282112991996.21C19735381352983216896.7-9302.79-12935.3
1DEBIT1065.41C19592395861817751.59C515132998103300-1065.4099999999999-10330
1PAYMENT3876.41C5043364836785263975.59M140493204200-3876.41000000000350
1TRANSFER311685.89C1984094095108350C93258385062672719172.89-108352712905.89
1PAYMENT6061.13C10433588264430M155807930300-4430
1PAYMENT9478.39C1671590089116494107015.61M5848821300-9478.390
1PAYMENT8009.09C1053967012109682958.91M29530480600-8009.090
1PAYMENT8901.99C16324978282958.910M3341971700-2958.910
1PAYMENT9920.52C76482668400M19400553340000
1PAYMENT3448.92C210376375000M3351077340000
1PAYMENT4206.84C21507875300M17573171280000
1PAYMENT5885.56C84051453800M18044413050000
1PAYMENT5307.88C176824271000M19717831620000
1PAYMENT5031.22C24711341900M1514420750000
1PAYMENT24213.67C123861609900M706959900000
1PAYMENT8603.42C16086339892530M161561751200-2530
1PAYMENT2791.42C923341586300481297689.58M10799482500-2791.41999999998370
1PAYMENT7413.54C1470868839297689.58290276.03M142672522300-7413.5499999999880
1PAYMENT3295.19C711197015233633230337.81M138445498000-3295.19000000000230
1PAYMENT1684.81C14815940862970M156943556100-2970
1DEBIT5758.59C14669178783260426845.41C129768578120969916997.22-5758.59-192701.78
1CASH_OUT110414.71C76821642026845.410C15095143332888002415.16-26845.41-286384.84
1PAYMENT7823.46C2600848319980M26781411300-9980
1PAYMENT5086.48C59835756200M15932247100000
1PAYMENT5281.48C1440738283152019146737.52M184901535700-5281.48000000001050
1PAYMENT13875.98C484199463158181942.02M200810678800-13875.980
1CASH_OUT56953.9C15704705381942.020C8240090857025364106.18-1942.02-6146.82
1CASH_OUT5346.89C51254920000C2486097746526376453430.9105800793.91
1PAYMENT2204.04C16158012985860M49039170400-5860
1PAYMENT2641.47C4605702712305320411.53M165336134400-2641.4700000000010
1CASH_OUT23261.3C207231308020411.530C2001112025257420-20411.53-25742
1PAYMENT2330.64C816944408203543201212.36M90913250300-2330.6400000000140
1PAYMENT1614.64C9129668114127639661.36M179238440200-1614.63999999999940
1PAYMENT9164.71C145862157347235.7738071.06M165898098200-9164.710
1PAYMENT2970.97C4694135738071.0635100.09M115260631500-2970.9700000000010
1PAYMENT38.66C3433453081617416135.34M171468847800-38.6599999999998540
1PAYMENT2252.44C10471644116270M150695118100-16270
1TRANSFER62610.8C19764019877911416503.2C19379625145178383.29-62610.87866.290000000001
1DEBIT5529.13C86728851785473017.87C242131142102060-5529.13-10206
1CASH_OUT82940.31C15288346183017.870C47680012013237249864.36-3017.87-82507.64
1DEBIT4510.22C280615803102565745.78C1254526270106970-4510.22-10697
1DEBIT8727.74C166694583882770874042.26C1129670968126360-8727.73999999999-12636
1PAYMENT2599.46C885910946874042.26871442.79M186059186700-2599.4699999999720
1DEBIT4874.49C8112077751530C19714892952531040-153-253104
1PAYMENT6440.78C116114811721920M51687505200-21920
1PAYMENT4910.14C11315921184155136640.86M58998718700-4910.1399999999990
1PAYMENT6444.64C1262609629120195574.36M58718031400-6444.640
1DEBIT5149.66C195599052247820C13301069455275224044.18-4782-28707.82
1PAYMENT7292.16C69673470216827209534.84M108241169100-7292.16000000000350
1CASH_OUT47458.86C527211736209534.84162075.98C2096057945521200-47458.859999999986-52120
1CASH_OUT136872.92C1533123860162075.9825203.05C7665722102178060-136872.93000000002-217806
1CASH_OUT94253.33C171890671125203.050C97799310199773965870.05-25203.05866097.05
1PAYMENT2998.04C71802912120309031.96M213427153200-2998.0400000000010
1PAYMENT3454.08C6863497959031.965577.88M183101068600-3454.0799999999990
1PAYMENT4316.2C1423768154109996682.8M40422244300-4316.20
1PAYMENT2131.84C19879774232240M6107329500-2240
1PAYMENT12986.61C8073225072335010363.39M39648583400-12986.610
1TRANSFER42712.39C28303940110363.390C133010694557901.6624044.18-10363.39-33857.48
1TRANSFER77957.68C20747177800C17612913209490022233.650-72666.35
1TRANSFER17231.46C124317189700C7832862382467200-24672
1TRANSFER78766.03C137615104400C1749186397103772277515.050173743.05
1TRANSFER224606.64C87317541100C766572210354678.9200-354678.92
1TRANSFER125872.53C144396787600C3922924163485123420103.0903071591.09
1TRANSFER379856.23C144977253900C159055041590018019169204.93018269024.93
1TRANSFER1505626.01C92685912400C665576141290315515763.3405486732.34
1TRANSFER554026.99C160369686500C766572210579285.5600-579285.56
1TRANSFER147543.1C1290586000C135904462622322016518.360-206701.64
1TRANSFER761507.39C41278834600C15905504151280036.2319169204.93017889168.7
1TRANSFER1429051.47C152026701000C15905504152041543.6219169204.93017127661.31
1TRANSFER358831.92C90808467200C392292416474384.533420103.0902945718.5599999996
1TRANSFER367768.4C28830676500C1359044626370763.116518.360-354244.74
1TRANSFER209711.11C155686794000C1509514333399214.712415.160-396799.55000000005
1TRANSFER583848.46C183916812800C12860849596677782107778.1101440000.1099999999
1TRANSFER1724887.05C149560850200C15905504153470595.119169204.93015698609.83
1TRANSFER710544.77C83577356900C1359044626738531.516518.360-722013.14
1TRANSFER581294.26C84329909200C15905504155195482.1519169204.93013973722.78
1TRANSFER11996.58C60598237400C12256164054025500-40255
1PAYMENT2875.1C14123228311544312567.9M165126269500-2875.10000000000040
1PAYMENT8586.98C130500471137630M49407744600-37630
steptypeamountnameOrigoldbalanceOrgnewbalanceOrignameDestoldbalanceDestnewbalanceDestorgDiffdestDiff

Showing the first 1000 rows.

 type of transactions

%sql
-- Organize by Type
select type, count(1) from financials group by type

in pandas it will look like df.groupby(“type”)[[“type”]]

How much money are we talking about (synthetically)?

on basic we know that decision tree are best when we have more discrete variables and we can divide based on nodes or decisions still a glimpse:

Decision Trees

Decision trees and their ensembles are popular methods for the machine learning tasks of classification and regression. Decision trees are widely used since they are easy to interpret, handle categorical features, extend to the multiclass classification setting, do not require feature scaling, and are able to capture non-linearities and feature interactions. Tree ensemble algorithms such as random forests and boosting are among the top performers for classification and regression tasks.

Because of these facets, decision trees often perform well on top of rules-based models and are often a good starting point for fraud detection.

creating our test and train data

Initially split our dataset between training and test datasets

(train, test) = df.randomSplit([0.8, 0.2], seed=12345)

remember in case of pandas we need to import test_train_split from sklearn

we will catch the datasets.
When creating an ML model, there are typically a set of repeated steps (e.g. StringIndexerVectorAssembler, etc.). By creating a ML pipeline, we can reuse this pipeline (and all of its steps) to retrain on a new and/or updated dataset. we

for machine learning use pyspark.ml library
VectorAssembler is a transformer that combines a given list of columns into a single vector column ,Create our pipeline stages


from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import OneHotEncoderEstimator
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import DecisionTreeClassifier

#Encodes a string column of labels to a column of label indices

indexer = StringIndexer(inputCol = “type”, outputCol = “typeIndexed”)

va = VectorAssembler(inputCols = [“typeIndexed”, “amount”, “oldbalanceOrg”, “newbalanceOrig”, “oldbalanceDest”, “newbalanceDest”, “orgDiff”, “destDiff”], outputCol = “features”)

dt = DecisionTreeClassifier(labelCol = “label”, featuresCol = “features”, seed = 54321, maxDepth = 5)

pipeline = Pipeline(stages=[indexer, va, dt])

our tree:

our confusion matrix

now what we can do we can use full balanced approach to solve it remember we have created our data pipeline so no need to change the code .so we will now solve it .

github code : full.

copy paste the url in your browser to understand code to play with the code import it in your account of databricks . hit like ,subscribe and donate button (cost you so don’t do) if you learned something new.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s