Here is very cool solution to detect fraud case which is related to same name or same address used to show different entry.

While doing the risk consulting, due diligence I've come across the problem when we have to check:

- Genuine data vs dummy data
- Employee is also involved as vendor
- Some relative of employee is involved as vendor
- Same company/vendor has multiple instance in data. etc.

Below R code is very straight forward to detect such cases. I've used my friends name as dummy name for this example. I hope they won't sue me for this ðŸ˜› or any copyright issue. ðŸ™‚

There are many distance formula that we can use to find the distance between two names, addresses or strings.

Such as Hamming distance, Cosine distance, Soundex etc. Soundex is very unique among rest of them it works on phonetic distance, i.e,. if pronunciation of two words are same it will return "0" zero distance.

Just to make it more clear "0" (zero) distance is best condition or you could say exact match. WhileÂ higher value will be interpret as both values are going apart.

list.of.packages <- c("stringdist") new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])] repo='http://nbcgib.uesc.br/mirrors/cran/' if(length(new.packages)) install.packages(new.packages, repo = repo) lapply(list.of.packages, require, character.only = TRUE) x <- c("R Kewlani","Rohan Kewlani", "Aman Preet", "Man Pret", "Sum", "Sumit", "Sumi", "Rashid Khan", "R. Khan", "Ram", "Rashid Khn", "Tej Pratap Singh", "T P Singh", "Amit Kumar Bharti", "A K Bharti", "A. K. Bharti", "Amit Bharti", "Ashish Chaurasia", "Ashis Chau", "Asis Chauras", "M. Tauheed", "Md. Tauheed", "Muhammad Tauheed", "M.R. Khan", "Sudhan Agrahari", "Sudhanshu Agrahari") df1 <- data.frame(seqid = seq(1:length(x)), name = x) dfr <- data.frame(n1=df1$name,n2=df1$name) ndf <- expand.grid(lapply(dfr, levels)) ndf <- ndf[order(ndf$n1),] method_list <- c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw", "soundex") for( i in method_list){ ndf[,i] <- stringdist(ndf$n1,ndf$n2,method=i) } ## Hypothetically assumed threshold to predict the suspicious match cases ## Cosine score < 0.20 and qgram score < 10 ## To avoid exact match from the dataset; Remove cosine score = 0.00 suspicious_match <- ndf[ndf$cosine < 0.20 & ndf$cosine != 0 & ndf$qgram < 10, ] suspicious_match <- suspicious_match[order(suspicious_match$n1,suspicious_match$cosine),] head(suspicious_match)

For this particular example I found ensemble of cosine and qgram score is suitable approach so I considered. And set the cut off value as for cosine 0.20 and qgram 10. Below the cut-off point every score will be treated as suspicious entry.

See the output:

## Output ## Do not copy paste > head(suspicious_match) n1 n2 osa lv dl hamming lcs qgram cosine jaccard jw 25 A K Bharti A. K. Bharti 2 2 2 Inf 2 2 0.13397460 0.1000000 0.05555556 97 A K Bharti Amit Kumar Bharti 7 7 7 Inf 7 7 0.14230997 0.1818182 0.27058824 73 A K Bharti Amit Bharti 3 3 3 Inf 5 5 0.18010841 0.2000000 0.15757576 2 A. K. Bharti A K Bharti 2 2 2 Inf 2 2 0.13397460 0.1000000 0.05555556 243 Aman Preet Man Pret 3 3 3 Inf 4 4 0.18350342 0.3000000 0.14166667 100 Amit Bharti Amit Kumar Bharti 6 6 6 Inf 6 6 0.08901973 0.1818182 0.17825312 soundex 25 0 97 1 73 1 2 0 243 1 100 1

Excel Output:

You can manipulate the code and method as per your business need. Hope this article will help you a lot.

Very informative article! R is a great programming language to learn for Data Science and this is a useful concept to learn. Thanks for sharing this informative article.

Thanks

Hi Zia,

can you explain it for two different data sets. with different dimension (n1 = 1000000, n2 = 10000000).

if you can then Please...

nice introduction, thank you.

However:

This throws an error:

dfr <- data.frame(n1=df1$name,n2=df1$name)

I made it into a factor:

dfr <- data.frame(n1=as.factor(df1$name),n2=as.factor(df1$name))

and then it worked.

Thank you so much for the article!!!!!