R Fuzzy String Match

Here is very cool solution to detect fraud case which is related to same name or same address used to show different entry.

While doing the risk consulting, due diligence I've come across the problem when we have to check:

  1. Genuine data vs dummy data
  2. Employee is also involved as vendor
  3. Some relative of employee is involved as vendor
  4. Same company/vendor has multiple instance in data. etc.

Below R code is very straight forward to detect such cases. I've used my friends name as dummy name for this example. I hope they won't sue me for this 😛 or any copyright issue. 🙂

There are many distance formula that we can use to find the distance between two names, addresses or strings.

Such as Hamming distance, Cosine distance, Soundex etc. Soundex is very unique among rest of them it works on phonetic distance, i.e,. if pronunciation of two words are same it will return "0" zero distance.

Just to make it more clear "0" (zero) distance is best condition or you could say exact match. While higher value will be interpret as both values are going apart.

For this particular example I found ensemble of cosine and qgram score is suitable approach so I considered. And set the cut off value as for cosine 0.20 and qgram 10. Below the cut-off point every score will be treated as suspicious entry.

See the output:

Excel Output:

1

You can manipulate the code and method as per your business need. Hope this article will help you a lot.

 

3 thoughts on “R Fuzzy String Match

  1. Very informative article! R is a great programming language to learn for Data Science and this is a useful concept to learn. Thanks for sharing this informative article.

Leave a Comment