# Market Basket Analysis with R

## Association Rules

There are many ways to see the similarities between items. These are techniques that fall under the general umbrella of association. The outcome of this type of technique, in simple terms, is a set of rules that can be understood as “if this, then that”.

## Applications

So what kind of items are we talking about?
There are many applications of association:

• Product recommendation – like Amazon’s “customers who bought that, also bought this”
• Music recommendations – like Last FM’s artist recommendations
• Medical diagnosis – like with diabetes really cool stuff
• Content optimisation – like in magazine websites or blogs

In this post we will focus on the retail application – it is simple, intuitive, and the dataset comes packaged with R making it repeatable.

## The Groceries Dataset

Imagine 10000 receipts sitting on your table. Each receipt represents a transaction with items that were purchased. The receipt is a representation of stuff that went into a customer’s basket – and therefore ‘Market Basket Analysis’.

That is exactly what the Groceries Data Set contains: a collection of receipts with each line representing 1 receipt and the items purchased. Each line is called a transaction and each column in a row represents an item. You can download the Groceries data set to take a look at it, but this is not a necessary step.

## A little bit of Math

We already discussed the concept of Items and Item Sets.

We can represent our items as an item set as follows:

I = { i1,i2,…,in }

Therefore a transaction is represented as follows:

tn = { ij,ik,…,in }

This gives us our rules which are represented as follows:

{ i1,i2} => { ik}

Which can be read as “if a user buys an item in the item set on the left hand side, then the user will likely buy the item on the right hand side too”. A more human readable example is:

{coffee,sugar} => {milk}

If a customer buys coffee and sugar, then they are also likely to buy milk.

With this we can understand three important ratios; the support, confidence and lift. We describe the significance of these in the following bullet points, but if you are interested in a formal mathematical definition you can find it on wikipedia.

• Support: The fraction of which our item set occurs in our dataset.
• Confidence: probability that a rule is correct for a new transaction with items on the left.
• Lift: The ratio by which by the confidence of a rule exceeds the expected confidence.
Note: if the lift is 1 it indicates that the items on the left and right are independent.
• ## Apriori Recommendation with R

So lets get started by loading up our libraries and data set.

```# Load the libraries library(arules) library(arulesViz) library(datasets)   # Load the data set data(Groceries)```

Lets explore the data before we make any rules:

```# Create an item frequency plot for the top 20 items itemFrequencyPlot(Groceries,topN=20,type="absolute")```

We are now ready to mine some rules!
You will always have to pass the minimum required support and confidence.

• We set the minimum support to 0.001
• We set the minimum confidence of 0.8
• We then show the top 5 rules
```# Get the rules rules <- apriori(Groceries, parameter = list(supp = 0.001, conf = 0.8))   # Show the top 5 rules, but only 2 digits options(digits=2) inspect(rules[1:5])```

The output we see should look something like this

``` lhs rhs support confidence lift 1 {liquor,red/blush wine} => {bottled beer} 0.0019 0.90 11.2 2 {curd,cereals} => {whole milk} 0.0010 0.91 3.6 3 {yogurt,cereals} => {whole milk} 0.0017 0.81 3.2 4 {butter,jam} => {whole milk} 0.0010 0.83 3.3 5 {soups,bottled beer} => {whole milk} 0.0011 0.92 3.6```

This reads easily, for example: if someone buys yogurt and cereals, they are 81% likely to buy whole milk too.

We can get summary info. about the rules that give us some interesting information such as:

• The number of rules generated: 410
• The distribution of rules by length: Most rules are 4 items long
• The summary of quality measures: interesting to see ranges of support, lift, and confidence.
• The information on the data mined: total data mined, and minimum parameters.
```set of 410 rules   rule length distribution (lhs + rhs): sizes 3 4 5 6 29 229 140 12   summary of quality measures: support conf. lift Min. :0.00102 Min. :0.80 Min. : 3.1 1st Qu.:0.00102 1st Qu.:0.83 1st Qu.: 3.3 Median :0.00122 Median :0.85 Median : 3.6 Mean :0.00125 Mean :0.87 Mean : 4.0 3rd Qu.:0.00132 3rd Qu.:0.91 3rd Qu.: 4.3 Max. :0.00315 Max. :1.00 Max. :11.2   mining info: data n support confidence Groceries 9835 0.001 0.8```

## Sorting stuff out

The first issue we see here is that the rules are not sorted. Often we will want the most relevant rules first. Lets say we wanted to have the most likely rules. We can easily sort by confidence by executing the following code.

`rules<-sort(rules, by="confidence", decreasing=TRUE)`

Now our top 5 output will be sorted by confidence and therefore the most relevant rules appear.

``` lhs rhs support conf. lift 1 {rice,sugar} => {whole milk} 0.0012 1 3.9 2 {canned fish,hygiene articles} => {whole milk} 0.0011 1 3.9 3 {root vegetables,butter,rice} => {whole milk} 0.0010 1 3.9 4 {root vegetables,whipped/sour cream,flour} => {whole milk} 0.0017 1 3.9 5 {butter,soft cheese,domestic eggs} => {whole milk} 0.0010 1 3.9```

Rule 4 is perhaps excessively long. Lets say you wanted more concise rules. That is also easy to do by adding a “maxlen” parameter to your apriori function:

`rules <- apriori(Groceries, parameter = list(supp = 0.001, conf = 0.8,maxlen=3))`

## Redundancies

Sometimes, rules will repeat. Redundancy indicates that one item might be a given. As an analyst you can elect to drop the item from the dataset. Alternatively, you can remove redundant rules generated.

We can eliminate these repeated rules using the follow snippet of code:

```subset.matrix <- is.subset(rules, rules) subset.matrix[lower.tri(subset.matrix, diag=T)] <- NA redundant <- colSums(subset.matrix, na.rm=T) >= 1 rules.pruned <- rules[!redundant] rules<-rules.pruned```

## Targeting Items

Now that we know how to generate rules, limit the output, lets say we wanted to target items to generate rules. There are two types of targets we might be interested in that are illustrated with an example of “whole milk”:

2. What are customers likely to buy if they purchase whole milk?

This essentially means we want to set either the Left Hand Side and Right Hand Side. This is not difficult to do with R!

```rules<-apriori(data=Groceries, parameter=list(supp=0.001,conf = 0.08), appearance = list(default="lhs",rhs="whole milk"), control = list(verbose=F)) rules<-sort(rules, decreasing=TRUE,by="confidence") inspect(rules[1:5])```

The output will look like this:

``` lhs rhs supp. conf. lift 1 {rice,sugar} => {whole milk} 0.0012 1 3.9 2 {canned fish,hygiene articles} => {whole milk} 0.0011 1 3.9 3 {root vegetables,butter,rice} => {whole milk} 0.0010 1 3.9 4 {root vegetables,whipped/sour cream,flour} => {whole milk} 0.0017 1 3.9 5 {butter,soft cheese, domestic eggs} => {whole milk} 0.0010 1 3.9```

Likewise, we can set the left hand side to be “whole milk” and find its antecedents.
Note the following:

• We set the confidence to 0.15 since we get no rules with 0.8
• We set a minimum length of 2 to avoid empty left hand side items
```rules<-apriori(data=Groceries, parameter=list(supp=0.001,conf = 0.15,minlen=2), appearance = list(default="rhs",lhs="whole milk"), control = list(verbose=F)) rules<-sort(rules, decreasing=TRUE,by="confidence") inspect(rules[1:5])```

Now our output looks like this:

``` lhs rhs support confidence lift 1 {whole milk} => {other vegetables} 0.075 0.29 1.5 2 {whole milk} => {rolls/buns} 0.057 0.22 1.2 3 {whole milk} => {yogurt} 0.056 0.22 1.6 4 {whole milk} => {root vegetables} 0.049 0.19 1.8 5 {whole milk} => {tropical fruit} 0.042 0.17 1.6 6 {whole milk} => {soda} 0.040 0.16 0.9```

## Visualization

The last step is visualization. Lets say you wanted to map out the rules in a graph. We can do that with another library called “arulesViz”.

```library(arulesViz) plot(rules,method="graph",interactive=TRUE,shading=NA)```

You will get a nice graph that you can move around to look like this:

## Resources

1.  by  Karthikeyan P

A Very good article on MBA. Thank you for posting.

2.  by  Ashok Menon

Enjoyed the piece..your code worked to perfection…was attempting to get a grip on MBA the R route…Many Thanks

• Hi Ashok!

Glad it worked! Let me know if I can help any more ^_^

Cheers for stopping to read the post!

3.  by  Ashok Menon

Hi Salem,

Just in case you can increase the complexity tier of the dataset – say introduce a CustomerID and say Date of Purchase to begin with…something akin to raw data that we normally get…I am not sure if it would be worth your while…all the same… also it may not be a bad idea I feel to add some literature to your visualisation piece…as a final thought if your visualisation could resemble a la SPSS/equivalent – with some dashboards thrown in – the impact would be substantial.

Cheers!!

4.  by  Prashant

Hi Saleem,

It was really nice explination about MBA using apriori algorithm.

I am trying build algorithm using different category(i.e sports).

Prashant

• Hi Prashant,

Sure, the groceries dataset comes packaged with R.

If you want to make your data available to others you might want to look into saving your workspace with save() or saving your data set with save.image().

I am not sure if I answered the question you were after – let me know if I haven’t 🙂

Cheers,
Salem

5.  by  Prashant

Hi Saleem,

Thanks for your promt reply. Actualy my question was little different, am trying to create market basket analysis for sports category similar to your grocery category, since i have ctreated dummy transactional data for sports items, and using it as csv file but unfortunately am unable to get very neat out like your, hence i just wondering whether my data set creation have any problem.

If you want i can share my sample data with you. I will more than happy to talk to you over phone also.

Thanks,
Prashant

• Sure! I’ll send you an email now, please send across a sample data set.

6.  by  Omid

Hi Salem,

I am trying to read in my own data, but unfortunetly I get an error:

Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function â€˜itemFrequencyPlotâ€™ for signature â€˜”data.frame”â€™

Do you perhaps know why? Have I read in the data wrong?

Appreciate the effort.

Regards,
Omid

• Hi Omid!

•  by  Brian

Hi Salem,

I also get the same error, but for the visualization attempt:

> # Create an item frequency plot for the top 20 items
> itemFrequencyPlot(ITVM,topN=20,type=”absolute”)
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function â€˜itemFrequencyPlotâ€™ for signature â€˜”data.frame”â€™

Also, is there a way to append the code so that empty RHS rules are removed?

Thanks so much for a great blog!

7.  by  Prashant

Hi Salem Again,

Like i mentioned earlier, i am working on sports category data for MBA, for my lhs product (i.e. input product) am geting missings({ }) as rhs. am using “read.transactions” to read data where it ask for “rm.duplicate=T”. I expect due to removing duplicate data i might be facing this issue.

Do you know any other options read data where i done need to lose any data & get my product details in rhs?

Thanks Again,
Prashant

• Hi Prashant!

Actually that is normal behaviour if you do not set the minimum number of required rules.

If you read the help file on the apriori library it is mentioned there.

If you get an empty LHS with an item on the RHS {}=>{Cookies} it means that no matter what item you choose, you will always get cookies.

Likewise if you get an empty RHS with an item on the LHS it means that if you select the item on the LHS it will be the only item in the basket.

If you are getting a lot of these empty brackets you have different options:
– Increase your data set it might be too sparse
– Change the confidence intervals (lift, support, confidence)

I hope this helps.

•  by  Ammar

Can u please share your sports data with me…..I want it for my semester project

8.  by  angela

hi Salem
great article , my thesis is on market basket analysis , and what u wrote is perfect , but incase u have any data on any retail shop which implemented this with their numbers that would be just great. it could help me a lot and saves my time i should defend my thesis in two months and yet havent found a single data on any other shop apart ours:(
regards
F

• Hi Angela,

A lot of retailers have this implemented. In the UK Tesco’s loyalty system started based on this algorithm. If you have 2 months I think to defend your thesis it would be even better if you partnered with a local retailer and ran a pilot. Thoughts?

Salem

•  by  eric

hi angel
i have some real time basket data i could send you
contact me

•  by  Ammar

Plz send me the data asap i want it for my semester project

•  by  Migle

Hi Eric,

I’m writing my Bachelor thesis about Market Basket Analysis and looking everywhere for some data, I didn’t find anything so far, maybe you could send me yours, it would be really great!

•  by  Sabrina

Thank you so much!
Sabrina

9.  by  Manu

Hi Saleem,

I am having trouble converting csv to transaction data.
The original groceries dataset has 9835 rows and 169 columns. But when I try to convert the groceries.csv to transaction using the read(df,”transactions”) it results into a transaction data with 9835 rows but 7011 columns.

Any help in this regard will be appreciated.

• Hi Manu,

Cheers,
Salem

•  by  Manu

Thanks Saleem.
I did use it before but did not set the format then.
Setting the format to basket did the trick.

Thanks once again

•  by  Prafful

Hi Manu,
I am having somewhat similar problem. My CSV file contains 15000 rows and 33 columns but when I am trying to run apriori algorithm on it, R crashes. This file is just 5 mb. I don`t know what to do now?

•  by  Manu

Do you have any idea of an R package which uses the fp-growth for association rule mining.

Thanks
manu

•  by  Gossuinjp

Hello,

I have thefollowing comment when using the read.trans.

Error in asMethod(object) :
can not coerce list with transactions with duplicated items

10.  by  Chris

This walk through was incredibly helpful for me. I must thank you.

Once the analysis is complete, what is the best way to take the results from the console to a report in Excel or a PDF? Any tips there?

Thanks again!

Chris

• Hi Chris!

Thanks :] glad you found it useful.

Yes you can write the rules to a CSV files like this:

`write(rules, file = "data.csv", quote=TRUE, sep = ",", col.names = NA)`

You can also save the graph to an image file using jpeg()

Let us know if that works for you 🙂

11. I am unable to find itemFrequencyPlot(). I have the arules package installed. Pls let me know where I can find this func. Thanks.

12.  by  vishu

very nice blog.I am going to work on the same.Can you please tell me which dataset you have used ? I have searched a lot and haven’t got a proper dataset.

• Hi Vishu,

I used the groceries dataset available in R.

Thanks!

13.  by  Jeremy

Hi Salem,

I encountered this error when I tried plotting the visualization:
Error in i.parse.plot.params(graph, list(…)) :

Is there a package dependency I may have missed? Perhaps igraph?
I used:
library(arules)
library(arulesViz)

• Hi Jeremy,

After some research it looks like that error is indeed generated by igraph.

Have you tried install.packages(“igraph”); library(igraph) ?

Salem

•  by  Stavit

Hi,

After installing the igraph I am still encounter with the same error.. any ideas?….
This is what i am trying to run:
> library(igraph)
> library(arulesViz)
Error in i.parse.plot.params(graph, list(…)) :

Also, do you happened to know if there are other MBA algorithms which was implemented in R?…

Thanks.

•  by  Tesh

Thank you Salem for this wonderful review. I also get the same error, even after installing/ loading igraph. I am running the latest version of R as well.

Please let me know if you have any other suggestions.

•  by  Lucie

Hi, did you find a solution? I get the same error despite
– installing igraph,
– re-installing tcltk => capabilities(“tcltk”) result is TRUE

I use R 3.1.2 binary for Mac OS X 10.9 (Mavericks) and higher. Therefore I also re-installed XQuartz.

Cheers,Lucie

•  by  panini73

Within the apriori function, in parameter = list(…); provide target =
It can be one of “frequent itemsets”, “maximally frequent itemsets”, “closed frequent itemsets”, “rules”, “hyperedgesets”
Works for all but rules, which still gives the error. So I hazard a guess that it is internal (?).
Others give you nice graphs.
Hope this helps.

•  by  Bill

I was able to get the diagram by changing the shading to TRUE from the sample NA:

14.  by  shecode

I would like to use this example after doing some cleaning of my text (i.e. using tm_map to remove stop words etc
I have a corpus data structure – how do I convert this into the structure you are using?

15. i dont understand the concept of MBA is good but there is no link to the coding and the output
i want an example application for mad but according to your coding the output is displaying like v1 v2 v3 observations and N/A and FALSE there is no comparing and no items so please give me a valid example

16.  by  Lagnajita

Hello,

If I using data drawn from an excel sheet instead of the dataset does it require extra coding. I tried using the data without any change to the data and received an error message.

Thanks.

17.  by  Tesh

Actually, I may have solved it.

Setting the target=”hyperedgesets”
seems to create a graphable object. I am still very new to this so I’m not sure if there is a way to make the original work. The graphs produced this way show baskets that the genes belong to and not the genes connected to each other.

rules2 <- apriori(y, parameter = list(supp = 0.001, conf = 0.8, target="hyperedgesets"))

18.  by  Naga Harish

Hi Salem,

The article is too good got a clear picture of market basket analysis.Thank you. I am actually looking for real market basket datasets for one of my academic projects, so if you have any or if you know something about it please let me know. Thanks in advance.

Regards
Harish

19.  by  Maila

Hi Salem,

Thanks for great article.
can you tell me how can I upload my own data(.csv file, similar to groceries file)
I have data in my desktop folder. how do I set path for that?
Also, I too got this error
Error in i.parse.plot.params(graph, list(â€¦)) :

Thanks

20.  by  Maila

This is me again
Finally, I able to load my file and able to run some rules
but the result came like this
inspect(rules[1:4])
lhs rhs support confidence lift
1 {} => { true} 0.99853 0.99853 1.0000
2 {} => { false} 0.99853 0.99853 1.0000
3 { true} => { false} 0.99780 0.99926 1.0007
4 { false} => { true} 0.99780 0.99926 1.0007

it is because my dataset in this format:
Basket ID Candy Lemons coffee milk
C11867 false false false false
C5096 false false false false
C4295 false false false false

can you give some hint, how would I change this data to transaction matrix

Thanks

21.  by  shan

Hi Salem..

If possible can we have an article on customer churning model.
Your help in this regard will be highly appreciated.
Thanks…

22.  by  Kevin

Real great stuff here. I was wondering if you knew of any transactions data sets like Groceries that also contained customer attributes (such as a customer ID linked to demographic variables) for each transaction? I’d like to extend this analysis to incorporate some of those factors, but I haven’t been able to find a suitable dataset.

Regards,

Kevin

23.  by  DP

Hi,
Just a question: do you have set the minimum support and confidence using some information from the plot or do you have used the default ?

There are according to you some rule of thumb that I can use to set properly these parameters?

DP

24.  by  DP

HI,

An additional question: how I have to set my dataset in order to use the apriori algorithm?
Do you think that a dataset like that below should be ok?

ID Product
1 A
1 A
1 B
2 C
2 A

Thanks,
DP

25.  by  Ans Elk

Excellent article, thank you sir ! 🙂

26.  by  CRISTINE

Hi,

there is the solution: it works perfectly

United we stand

27.  by  Tania

Hi Salem,
I’m trying to mine association rules on a 20,000 transactions dataset. When I try to get the item frequency plot I got this message:

Errore in (function (classes, fdef, mtable) :
unable to find an inherited method for function â€˜itemFrequencyPlotâ€™ for signature â€˜”matrix”â€™

Tania

•  by  Maria

You’ll have to convert your data frame into a transactions object.
with something like this:

trans <- as(df, "transactions")
itemFrequencyPlot(trans,topN=20,type="absolute")

28.  by  Survi`

Please advise on how to use my own csv file for such an analysis. Someone pls mail me the code. urgent

29.  by  Venkata Duvvuri

Great blog. I used apriori to generate rules. How Can i apply the rules to new dataset to predict RHS?

30.  by  Udit

This is just to help people who are using csv files

use read.transactions() than read.csv() for the reason that read.csv() would return data frame with automatic column names

Using MyData<-read.csv() would return data frame in MyData but now when you pass this MyData to apriori, it will accept it but give the column names as V1 , V2 and the result will be distorted

In the above example Groceries is already transaction data

Hence there was no confusion but when importing transactions from csv file, remember the above rules

I was myself facing the problem when I came through this post

Hope it helps

31.  by  Yong

FYI:: If you don’t include [shading=N] option then you may skip the error [object ‘v.color’ not found] .

32.  by  sub

Is there a way to extract how my records are there for each rule? For ex: if

A+ B ===> C

Is there is a way to figure out from the Apriori algorithm how many transactions are there in the rule set? Thanks!

33.  by  Ana

Hi Salem,

I was wondering how I need to structure my data to begin with. Would it be:

Order_ID Products
1 Banana, Milk, Coffee
2 Banana, Oragnge, Milk, Tea
3 Tea, Milk ,Coffee

Then save this as a csv? Then Import it to using the read.transaction() ? Just need some help organizing my own data so I can get started.

Thank you!!!!

34.  by  Matthew Bellissimo

Salem

Did you ever figure out what the fix is for the code?
Returns:
Error in i.parse.plot.params(graph, list(…)) :

> plot(rules,method=’graph’,interactive=TRUE)
Returns:
Error in structure(.External(.C_dotTclObjv, objv), class = “tclObj”) :
[tcl] invalid command name “font”.

I followed the rest of your blog post identically, It’s incredible. The only thing is i am stuck on this error.
Help here would be Extremely appreciated.

35.  by  dileep

Have implemented fp growth algorithm in R.I am working on that but didnt find any function like apriori(),eclat()….. Thank you for ur responce in advance.

36.  by  Yusheng

Hi,

Thanks for this great article!

Here I have a question about the dataset. Is the data from real scenario or you simulated them? And do you mind to give me some info about how can I get the the real scenario data for grocery store purchase records? I am doing a project about this and if there are some data comes from the real scenario I can use, that would be so helpful.

Cheers,

Yusheng

37. arules package really helps for products, customers, marketing analysis… Just wonder where can I get the data of customers’ orders to analyse? (=.=)”

38. You made this quite simple to grasp and immediately implement. Thank you!

39.  by  Upekha

Hi Salem,
Thanks for the article.

I have a question, which is the better method to do recommendations, market basket analysis or collaborative filtering ?

What is the difference between these two?

Thank You

40.  by  Ananya

Hi Saleem,

I have a transaction data, how can i convert it to the format to be given to affinity analysis??

41. Pingback: R studio | cl4assignments

42.  by  Mohanapriya S

hi,
In my data set there are 14 columns.but only one column is string value(fruit value),remaining columns are numeric(temperature , ph values),when i perform apriori i get only numbers in both lhr and rhs(ph ,temperature values.i didn’t get any fruit name).i want fruit name in lhr side.