Market Basket Analysis in R

basketMarket Basket Analysis (MBA) or Affinity Analysis or Association Analysis is an analysis to understand combination and sequence of activities. Originally linked to understanding products bought together in a basket by a customer.

Detailed description on Market Basket Analysis

Install Required Packages

For Market Basket or Association Analysis, arules and arulesViz have to be installed and loaded

# Install and Load the libraries
#install.packages("arules")
#install.packages("arulesViz")

# Load Libraries
library(arules)
library(arulesViz) 

Read Data for Market Basket Analysis

Belgium Grocery store data is  used as sample data for market basket analysis in this blog.

## Belgium Grocery store data

fc <- file("http://fimi.ua.ac.be/data/retail.dat")
mylist <- strsplit(readLines(fc), " ")
close(fc)

head(mylist)
## [[1]]
##  [1] "0"  "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "13"
## [15] "14" "15" "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26" "27"
## [29] "28" "29"
## 
## [[2]]
## [1] "30" "31" "32"
## 
## [[3]]
## [1] "33" "34" "35"
## 
## [[4]]
##  [1] "36" "37" "38" "39" "40" "41" "42" "43" "44" "45" "46"
## 
## [[5]]
## [1] "38" "39" "47" "48"
## 
## [[6]]
##  [1] "38" "39" "48" "49" "50" "51" "52" "53" "54" "55" "56" "57" "58"

Exploratory Data Analysis

One of the first step is to check the basic information about the data. This will help us in addressing any issues with the data creation or reading. And also help in building our understanding of the data

# Number of orders
noOrder <- length(mylist)
noOrder


## [1] 88162

Counting number of items in each of these orders.

prdCount <-sapply(mylist, length)
# max number of items in an order/visit
max(prdCount)
## [1] 76

# Group Number products in 10 
grp <- cut(prdCount, breaks=c(0,1,5,10,15,20,30,40,80)) 

class(grp)

## [1] "factor"
t <-table(grp)
xl <-c("0-1","2-5","6-10","11-15","16-20","21-30","31-40","41-High")
# Bar Chart on Product count in a basket
barplot(t, xlab="Product Counts",
        ylab="# of Orders",
        main="Number of Products in each Order",
        names.arg=xl,
        col = "blue",
        border="white")


## table with % of order 

prop.table(t)*100

 

Convert data into “transaction” class before using Association Analysis functions.

## Convert into "transaction" class
rtrans <- as(mylist, "transactions")

## get frequency
freq <-itemFrequency(rtrans,type="absolute")
freq <-sort(freq,decreasing = T)
freq[1:20]

 

##    39    48    38    32    41    65    89   225   170   237    36   110 
## 50675 42135 15596 15167 14945  4472  3837  3257  3099  3032  2936  2794 
##   310   101   475   271   413   438  1327   147 
##  2594  2237  2167  2094  1880  1863  1786  1779
# Frequency plot
itemFrequencyPlot(rtrans,
                  topN=20,
                  type="absolute",
                  xlab="Products",
                  ylab="Frequency of Product Sale",
                  main="Sale Frequency of Each Product",
                  col="red",border="white")

 

Association or Market basket Analysis Rules

bel.rules <- apriori(rtrans, parameter = list(supp = 0.001, conf = 0.8))
## 
## parameter specification:
##  confidence minval smax arem  aval originalSupport support minlen maxlen
##         0.8    0.1    1 none FALSE            TRUE   0.001      1     10
##  target   ext
##   rules FALSE
## 
## algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## apriori - find association rules with the apriori algorithm
## version 4.21 (2004.05.09)        (c) 1996-2004   Christian Borgelt
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[16470 item(s), 88162 transaction(s)] done [0.12s].
## sorting and recoding items ... [2117 item(s)] done [0.01s].
## creating transaction tree ... done [0.05s].
## checking subsets of size 1 2 3 4 5 6 done [0.13s].
## writing ... [711 rule(s)] done [0.01s].
## creating S4 object  ... done [0.01s].
# number of rules
length(bel.rules)
## [1] 711
inspect(bel.rules[1:5])
##   lhs       rhs   support confidence  lift
## 1 {3854} => {38} 0.001066     0.9126 5.159
## 2 {1045} => {32} 0.001100     0.9065 5.270
## 3 {4030} => {48} 0.001021     0.8257 1.728
## 4 {1473} => {39} 0.001225     0.8000 1.392
## 5 {1727} => {38} 0.001838     0.9310 5.263
summary(bel.rules)
## set of 711 rules
## 
## rule length distribution (lhs + rhs):sizes
##   2   3   4   5 
##  42 313 300  56 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00    3.00    4.00    3.52    4.00    5.00 
## 
## summary of quality measures:
##     support          confidence         lift      
##  Min.   :0.00101   Min.   :0.800   Min.   :  1.4  
##  1st Qu.:0.00119   1st Qu.:0.823   1st Qu.:  1.4  
##  Median :0.00151   Median :0.853   Median :  1.6  
##  Mean   :0.00256   Mean   :0.882   Mean   :  9.6  
##  3rd Qu.:0.00230   3rd Qu.:0.964   3rd Qu.:  5.5  
##  Max.   :0.08355   Max.   :1.000   Max.   :318.1  
## 
## mining info:
##    data ntransactions support confidence
##  rtrans         88162   0.001        0.8

Association Rules build by apriori algorithm can be selected using support, lift or confidence.

 

#------------- MBA: Select Rules-----------------------
# support
s <- sort(bel.rules, by="support", decreasing=TRUE)
inspect(s[1:10])
##    lhs      rhs  support confidence  lift
## 1  {41,                                  
##     48}  => {39} 0.08355     0.8168 1.421
## 2  {170} => {38} 0.03438     0.9781 5.529
## 3  {36}  => {38} 0.03165     0.9503 5.372
## 4  {110} => {38} 0.03091     0.9753 5.513
## 5  {170,                                 
##     39}  => {38} 0.02290     0.9806 5.543
## 6  {38,                                  
##     41,                                  
##     48}  => {39} 0.02258     0.8387 1.459
## 7  {36,                                  
##     39}  => {38} 0.02206     0.9548 5.398
## 8  {110,                                 
##     39}  => {38} 0.01974     0.9892 5.592
## 9  {170,                                 
##     48}  => {38} 0.01745     0.9878 5.584
## 10 {225,                                 
##     48}  => {39} 0.01588     0.8065 1.403
# confidence
c <- sort(bel.rules, by="confidence", decreasing=TRUE)
inspect(c[1:10])

 

##    lhs      rhs   support confidence  lift
## 1  {32,                                   
##     840} => {38} 0.001032          1 5.653
## 2  {32,                                   
##     371} => {38} 0.001372          1 5.653
## 3  {170,                                  
##     438} => {38} 0.001168          1 5.653
## 4  {310,                                  
##     36}  => {38} 0.001044          1 5.653
## 5  {170,                                  
##     225} => {38} 0.001463          1 5.653
## 6  {32,                                   
##     47,                                   
##     48}  => {38} 0.001214          1 5.653
## 7  {371,                                  
##     41,                                   
##     48}  => {38} 0.001146          1 5.653
## 8  {32,                                   
##     37,                                   
##     48}  => {38} 0.001384          1 5.653
## 9  {32,                                   
##     37,                                   
##     39}  => {38} 0.001554          1 5.653
## 10 {37,                                   
##     41,                                   
##     48}  => {38} 0.001951          1 5.653
# lift

l <- sort(bel.rules, by="lift", decreasing=TRUE)
inspect(l[1:10])

 

##    lhs        rhs      support confidence  lift
## 1  {1818,                                      
##     3311,                                      
##     795}   => {1819}  0.001089     0.9057 318.1
## 2  {1818,                                      
##     1819,                                      
##     795}   => {3311}  0.001089     0.8276 302.7
## 3  {3311,                                      
##     795}   => {1819}  0.001407     0.8435 296.3
## 4  {1818,                                      
##     1819,                                      
##     3311}  => {795}   0.001089     0.8421 295.8
## 5  {1818,                                      
##     3311}  => {1819}  0.001293     0.8143 286.0
## 6  {1818,                                      
##     1819}  => {795}   0.001316     0.8000 281.0
## 7  {1080,                                      
##     1378}  => {1379}  0.001078     0.8120 252.9
## 8  {1379,                                      
##     1380}  => {309}   0.001044     0.8214 234.4
## 9  {16430,                                     
##     41}    => {16431} 0.001180     0.8595 215.9
## 10 {16430,                                     
##     39,                                        
##     48}    => {16431} 0.001202     0.8548 214.7

 


Product Combinations

## First product/combination selected, what is second product

sel.rules<-apriori(data=rtrans, parameter=list(supp=0.001,conf = 0.15,minlen=2), 
               appearance = list(default="rhs",lhs="38"),
               control = list(verbose=F))
sel.rules<-sort(sel.rules, decreasing=TRUE,by="confidence")
inspect(sel.rules[1:5])
##   lhs     rhs   support confidence  lift
## 1 {38} => {39}  0.11734     0.6633 1.154
## 2 {38} => {48}  0.09011     0.5094 1.066
## 3 {38} => {41}  0.04420     0.2499 1.474
## 4 {38} => {170} 0.03438     0.1943 5.529
## 5 {38} => {32}  0.03213     0.1816 1.056
## Second product/combination selected when first product select
rhs.rules<-apriori(data=rtrans, parameter=list(supp=0.001,conf = 0.15,minlen=2), 
                   appearance = list(default="lhs",rhs="38"),
                   control = list(verbose=F))
sel.rules<-sort(rhs.rules, decreasing=TRUE,by="confidence")
inspect(rhs.rules[1:5])
##   lhs       rhs   support confidence  lift
## 1 {3854} => {38} 0.001066     0.9126 5.159
## 2 {1727} => {38} 0.001838     0.9310 5.263
## 3 {3005} => {38} 0.002212     0.9512 5.377
## 4 {504}  => {38} 0.002779     0.8221 4.647
## 5 {2805} => {38} 0.002405     0.9550 5.398

 

1 thought on “Market Basket Analysis in R”

Leave a Comment