R Fuzzy String Match

R Fuzzy String Match

Here is very cool solution to detect fraud case which is related to same name or same address used to show different entry.

While doing the risk consulting, due diligence I've come across the problem when we have to check:

  1. Genuine data vs dummy data
  2. Employee is also involved as vendor
  3. Some relative of employee is involved as vendor
  4. Same company/vendor has multiple instance in data. etc.

Below R code is very straight forward to detect such cases. I've used my friends name as dummy name for this example. I hope they won't sue me for this 😛 or any copyright issue. 🙂

There are many distance formula that we can use to find the distance between two names, addresses or strings.

Such as Hamming distance, Cosine distance, Soundex etc. Soundex is very unique among rest of them it works on phonetic distance, i.e,. if pronunciation of two words are same it will return "0" zero distance.

Just to make it more clear "0" (zero) distance is best condition or you could say exact match. While higher value will be interpret as both values are going apart.

list.of.packages <- c("stringdist")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
repo='http://nbcgib.uesc.br/mirrors/cran/'
if(length(new.packages)) install.packages(new.packages, repo = repo)
lapply(list.of.packages, require, character.only = TRUE)


x <- c("R Kewlani","Rohan Kewlani", "Aman Preet", "Man Pret", "Sum", "Sumit", "Sumi", "Rashid Khan", "R. Khan", "Ram", "Rashid Khn", "Tej Pratap Singh", "T P Singh", "Amit Kumar Bharti", "A K Bharti", "A. K. Bharti", "Amit Bharti", "Ashish Chaurasia", "Ashis Chau", "Asis Chauras", "M. Tauheed", "Md. Tauheed", "Muhammad Tauheed", "M.R. Khan", "Sudhan Agrahari", "Sudhanshu Agrahari")
df1 <- data.frame(seqid = seq(1:length(x)), name = x)
dfr <- data.frame(n1=df1$name,n2=df1$name)
ndf <- expand.grid(lapply(dfr, levels)) 
ndf <- ndf[order(ndf$n1),]
method_list <- c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw", "soundex")
for( i in method_list){

 ndf[,i] <- stringdist(ndf$n1,ndf$n2,method=i)

}

## Hypothetically assumed threshold to predict the suspicious match cases
## Cosine score < 0.20 and qgram score < 10
## To avoid exact match from the dataset; Remove cosine score = 0.00
suspicious_match <- ndf[ndf$cosine < 0.20 & ndf$cosine != 0 & ndf$qgram < 10, ]
suspicious_match <- suspicious_match[order(suspicious_match$n1,suspicious_match$cosine),]
head(suspicious_match)

For this particular example I found ensemble of cosine and qgram score is suitable approach so I considered. And set the cut off value as for cosine 0.20 and qgram 10. Below the cut-off point every score will be treated as suspicious entry.

See the output:

## Output
## Do not copy paste
> head(suspicious_match)
              n1                n2 osa lv dl hamming lcs qgram     cosine   jaccard         jw
25    A K Bharti      A. K. Bharti   2  2  2     Inf   2     2 0.13397460 0.1000000 0.05555556
97    A K Bharti Amit Kumar Bharti   7  7  7     Inf   7     7 0.14230997 0.1818182 0.27058824
73    A K Bharti       Amit Bharti   3  3  3     Inf   5     5 0.18010841 0.2000000 0.15757576
2   A. K. Bharti        A K Bharti   2  2  2     Inf   2     2 0.13397460 0.1000000 0.05555556
243   Aman Preet          Man Pret   3  3  3     Inf   4     4 0.18350342 0.3000000 0.14166667
100  Amit Bharti Amit Kumar Bharti   6  6  6     Inf   6     6 0.08901973 0.1818182 0.17825312
    soundex
25        0
97        1
73        1
2         0
243       1
100       1

Excel Output:

1

You can manipulate the code and method as per your business need. Hope this article will help you a lot.

 

5 thoughts on “R Fuzzy String Match”

  1. Very informative article! R is a great programming language to learn for Data Science and this is a useful concept to learn. Thanks for sharing this informative article.

  2. nice introduction, thank you.
    However:
    This throws an error:
    dfr <- data.frame(n1=df1$name,n2=df1$name)
    I made it into a factor:
    dfr <- data.frame(n1=as.factor(df1$name),n2=as.factor(df1$name))
    and then it worked.

Leave a Comment