Market Basket Analysis using Python

Picture of Data-Mania Writer's Guild

Data-Mania Writer's Guild

Reading Time: 11 minutes

This article primarily demonstrates how to do market basket analysis using Python, along with a primer on association rules.

What is Market Basket Analysis?

Market Basket Analysis is a modelling technique based upon the theory of association rules which states that if you buy a certain type of item, you are more or less likely to buy another type of item. For example, if you buy milk and eggs you are more likely to buy bread than anyone else who did not buy milk or eggs. Market basket analysis seeks to find a relationship between purchases made by customers. This is very important for supermarkets and online stores for placement of products and recommendation of products.

brief market basket analysis using Python

How does Market Basket Analysis work?

Market basket analysis is one of the key techniques used by large retailers to uncover association between items. This helps in understanding the customer’s behavior to a certain extent. And enables the retailers to make recommendations based on past purchases of other customers.

Market Basket analysis is primarily based on rules, association rules in particular. A good example of association rules is – if a customer already chose peanut butter and jelly, then what’s the possibility of buying bread? This kind of “if item A is bought, the possibility of buying Y” is called an association rule. The rule can be represented as –

  
{peanut butter, jelly} => {bread}

Why Association Rules?

This is indeed an interesting question. When there are multiple techniques such as SVM’s, Random Forest, Clustering and so on. Why is association rules preferred over these techniques for market basket analysis? Some of the drawbacks that come with these techniques are –

  • Tuning such algorithms can be quite hard.
  • These algorithms tend to require quite a large amount of data to give good recommendations.
  • They also require quite a bit of feature engineering.

This is where Association rules has an edge over other techniques –

  • It is relatively fast method.
  • Works well on small quantities of data.
  • Not much feature engineering is required.

Some of the important terms involved with market basket analysis –

Support – This is the relative frequency of an item in the transaction data. Support for an item can be calculated as –

Confidence – It is the probability of seeing the consequent in a transaction given that it also contains the antecedent. In the below case, A is the antecedent, whereas C is the consequent. Confidence is also a measure of the reliability of a rule.

Lift – Lift is a metric which measures how much more often the antecedent and consequent occur together than them occurring independently. A lift score of 1 and above is considered.

Performing Market Basket Analysis Using Python

Since we need to perform association analysis, there is a very good package available in python to accomplish this. The package is named MLxtend, it has similar syntax to scikit-learn. Let’s install MLxtend –

  • If you are using the pip package manager – pip install mlxtend
  • Alternatively, if you are using the conda package manager – conda install mlxtend

For explicit directions on how to install mlxtend, feel free to visit this page.

Once you are done installing MLxtend, it’s time to proceed to the next step – reading in the data. Now, it’s important to get the data in the right format. Transaction data is usually in the below format –

market basket analysis using Python categories

The above data is a sample dataset from a convenience store and it shall not be be used for analysis in this article. I have prepared a new set of data to work with the algorithm. Each row in the above data set represents a transaction, the first row represents one transaction wherein the items purchased were – citrus fruit, semi-finished bread, margarine and ready soups. For the apriori algorithm to function, the transaction data must be converted to the sparse matrix which is in the format as shown below –

market basket analysis using Python demo data

This transformation can be achieved by using the TransactionEncoder module from mlxtend.preprocessing. The code for performing the encoding is – TransactionEncoder.fit(dataset).transform(dataset). For working with the algorithm, I have already transformed a new set of data into the sparse matrix format. Let’s start with the analysis

 
import pandas as pd
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

Now let’s read in the data –

 
basket_sets = pd.read_csv('a.csv')
basket_sets.head()

The above snippet is basically transaction data from a hardware store. We have the invoice number and the items as columns and each row represents a transaction. The InvoiceNo column needs to be dropped since it does not add any value to the analysis.

 
basket_sets = basket_sets.drop('InvoiceNo', axis=1)
basket_sets.head()

more demo data for market basket analysis using Python

Cool, the InvoiceNo is removed. Now let’s find out which of the items have a minimum support of at least 0.6.

 
apriori(basket_sets, min_support=0.6) 
      Support   itemsets
0	0.728850	[0]
1	0.659436	[25]
2	0.904555	[35]
3	0.624729	[46]
4	0.735358	[61]
5	0.785249	[64]
6	0.685466	[65]
7	2.611714	[81]
8	1.015184	[146]
9	0.867679	[225]
10	0.824295	[238]
11	0.629067	[239]
12	0.607375	[240]
13	0.737527	[308]
14	0.637744	[315]
15	0.646421	[317]
16	0.624729	[349]
17	1.171367	[366]
18	0.780911	[542]
19	0.911063	[595]
20	1.145336	[598]
21	0.806941	[601]
22	0.780911	[614]
23	0.655098	[626]
24	0.650759	[627]
25	1.388286	[631]
26	0.954447	[633]
27	1.056399	[639]
28	0.650759	[641]
29	1.219089	[697]
...	...	...
64	1.017354	[1027]
65	0.631236	[1035]
66	2.850325	[1057]
67	0.607375	[1086]
68	0.969631	[1090]
69	0.709328	[1119]
70	1.373102	[1121]
71	0.659436	[1168]
72	0.809111	[1179]
73	0.683297	[1234]
74	0.676790	[1235]
75	1.301518	[1239]
76	1.492408	[1240]
77	0.728850	[1243]
78	1.041215	[1245]
79	1.613883	[1246]
80	2.082430	[1248]
81	2.759219	[1267]
82	2.420824	[1268]
83	1.041215	[1302]
84	0.806941	[1321]
85	1.787419	[1326]
86	0.676790	[1339]
87	0.702820	[1350]
88	1.093275	[1372]
89	0.661605	[1523]
90	1.145336	[1535]
91	0.704989	[1536]
92	0.704989	[1540]
93	0.759219	[1551]

Now there are 93 items which have a support of 0.6 and above. But the above table has item indices and not the item names. To get the item names, the following command can be passed –

 
apriori(basket_sets, min_support=0.6, use_colnames=True)
support	itemsets	
0	0.728850	[10 COLOUR SPACEBOY PEN]
1	0.659436	[36 PENCILS TUBE RED RETROSPOT]
2	0.904555	[4 TRADITIONAL SPINNING TOPS]
3	0.624729	[60 TEATIME FAIRY CAKE CASES]
4	0.735358	[ALARM CLOCK BAKELIKE GREEN]
5	0.785249	[ALARM CLOCK BAKELIKE PINK]
6	0.685466	[ALARM CLOCK BAKELIKE RED]
7	2.611714	[ASSORTED COLOUR BIRD ORNAMENT]
8	1.015184	[BLUE HARMONICA IN BOX]
9	0.867679	[CARTOON PENCIL SHARPENERS]
10	0.824295	[CHARLOTTE BAG APPLES DESIGN]
11	0.629067	[CHARLOTTE BAG DOLLY GIRL DESIGN]
12	0.607375	[CHARLOTTE BAG PINK POLKADOT]
13	0.737527	[CIRCUS PARADE LUNCH BOX]
14	0.637744	[CLOTHES PEGS RETROSPOT PACK 24]
15	0.646421	[COFFEE MUG APPLES DESIGN]
16	0.624729	[DINOSAUR KEYRINGS ASSORTED]
17	1.171367	[DOLLY GIRL LUNCH BOX]
18	0.780911	[GUMBALL COAT RACK]
19	0.911063	[ICE CREAM BUBBLES]
20	1.145336	[ICE CREAM SUNDAE LIP GLOSS]
21	0.806941	[INFLATABLE POLITICAL GLOBE]
22	0.780911	[JAM MAKING SET PRINTED]
23	0.655098	[JUMBO BAG APPLES]
24	0.650759	[JUMBO BAG DOILEY PATTERNS]
25	1.388286	[JUMBO BAG PINK POLKADOT]
26	0.954447	[JUMBO BAG RED RETROSPOT]
27	1.056399	[JUMBO BAG VINTAGE DOILY]
28	0.650759	[JUMBO BAG WOODLAND ANIMALS]
29	1.219089	[LUNCH BAG APPLE DESIGN]
...	...	...
64	1.017354	[RED RETROSPOT CHARLOTTE BAG]
65	0.631236	[RED RETROSPOT PICNIC BAG]
66	2.850325	[RED TOADSTOOL LED NIGHT LIGHT]
67	0.607375	[RETROSPOT PARTY BAG + STICKER SET]
68	0.969631	[REVOLVER WOODEN RULER]
69	0.709328	[ROUND SNACK BOXES SET OF 4 FRUITS]
70	1.373102	[ROUND SNACK BOXES SET OF4 WOODLAND]
71	0.659436	[SET OF 12 FAIRY CAKE BAKING CASES]
72	0.809111	[SET OF 20 KIDS COOKIE CUTTERS]
73	0.683297	[SET OF 60 I LOVE LONDON CAKE CASES]
74	0.676790	[SET OF 60 PANTRY DESIGN CAKE CASES]
75	1.301518	[SET OF 9 BLACK SKULL BALLOONS]
76	1.492408	[SET OF 9 HEART SHAPED BALLOONS]
77	0.728850	[SET/10 BLUE POLKADOT PARTY CANDLES]
78	1.041215	[SET/10 PINK POLKADOT PARTY CANDLES]
79	1.613883	[SET/10 RED POLKADOT PARTY CANDLES]
80	2.082430	[SET/20 RED RETROSPOT PAPER NAPKINS]
81	2.759219	[SET/6 RED SPOTTY PAPER CUPS]
82	2.420824	[SET/6 RED SPOTTY PAPER PLATES]
83	1.041215	[SMALL RED RETROSPOT WINDMILL]
84	0.806941	[SPACEBOY BIRTHDAY CARD]
85	1.787419	[SPACEBOY LUNCH BOX]
86	0.676790	[STARS GIFT TAPE]
87	0.702820	[STRAWBERRY LUNCH BOX WITH CUTLERY]
88	1.093275	[TEA PARTY BIRTHDAY CARD]
89	0.661605	[WOODLAND CHARLOTTE BAG]
90	1.145336	[WORLD WAR 2 GLIDERS ASSTD DESIGNS]
91	0.704989	[WRAP VINTAGE DOILY]
92	0.704989	[WRAP CHRISTMAS VILLAGE]
93	0.759219	[WRAP RED APPLES]

Hmm that’s much better, let’s try decreasing the support value since with a support value of 0.6 returns only items of one combination.

 
df = basket_sets
frequent_itemsets = apriori(df, min_support=0.06, use_colnames=True)
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))
frequent_itemsets
support	itemsets	length	
0	0.728850	[10 COLOUR SPACEBOY PEN]	1
1	0.260304	[12 COLOURED PARTY BALLOONS]	1
2	0.427332	[12 PENCIL SMALL TUBE WOODLAND]	1
3	0.323210	[12 PENCILS SMALL TUBE RED RETROSPOT]	1
4	0.336226	[12 PENCILS SMALL TUBE SKULL]	1
5	0.227766	[12 PENCILS TALL TUBE RED RETROSPOT]	1
6	0.156182	[12 PENCILS TALL TUBE WOODLAND]	1
7	0.069414	[18PC WOODEN CUTLERY SET DISPOSABLE]	1
8	0.088937	[20 DOLLY PEGS RETROSPOT]	1
9	0.208243	[3 PIECE SPACEBOY COOKIE CUTTER SET]	1
10	0.078091	[36 DOILIES DOLLY GIRL]	1
11	0.659436	[36 PENCILS TUBE RED RETROSPOT]	1
12	0.119306	[36 PENCILS TUBE SKULLS]	1
13	0.312364	[36 PENCILS TUBE WOODLAND]	1
14	0.084599	[3D VINTAGE CHRISTMAS STICKERS]	1
15	0.078091	[4 IVORY DINNER CANDLES SILVER FLOCK]	1
16	0.904555	[4 TRADITIONAL SPINNING TOPS]	1
17	0.104121	[5 HOOK HANGER RED MAGIC TOADSTOOL]	1
18	0.236443	[6 GIFT TAGS 50'S CHRISTMAS]	1
19	0.340564	[6 GIFT TAGS VINTAGE CHRISTMAS]	1
20	0.203905	[6 RIBBONS RUSTIC CHARM]	1
21	0.468547	[60 CAKE CASES DOLLY GIRL DESIGN]	1
22	0.104121	[60 CAKE CASES VINTAGE CHRISTMAS]	1
23	0.624729	[60 TEATIME FAIRY CAKE CASES]	1
24	0.260304	[72 SWEETHEART FAIRY CAKE CASES]	1
25	0.121475	[ABC TREASURE BOOK BOX]	1
26	0.147505	[ALARM CLOCK BAKELIKE CHOCOLATE]	1
27	0.735358	[ALARM CLOCK BAKELIKE GREEN]	1
28	0.173536	[ALARM CLOCK BAKELIKE IVORY]	1
29	0.208243	[ALARM CLOCK BAKELIKE ORANGE]	1
...	...	...	...
662	0.086768	[PLASTERS IN TIN CIRCUS PARADE, PLASTERS IN TI...	2
663	0.125813	[PLASTERS IN TIN CIRCUS PARADE, POSTAGE]	2
664	0.088937	[PLASTERS IN TIN SPACEBOY, PLASTERS IN TIN WOO...	2
665	0.097614	[PLASTERS IN TIN SPACEBOY, POSTAGE]	2
666	0.117137	[PLASTERS IN TIN WOODLAND ANIMALS, POSTAGE]	2
667	0.140998	[POSTAGE, RABBIT NIGHT LIGHT]	2
668	0.069414	[POSTAGE, RED RETROSPOT CHARLOTTE BAG]	2
669	0.097614	[POSTAGE, RED RETROSPOT MINI CASES]	2
670	0.134490	[POSTAGE, RED TOADSTOOL LED NIGHT LIGHT]	2
671	0.091106	[POSTAGE, REGENCY CAKESTAND 3 TIER]	2
672	0.080260	[POSTAGE, ROUND SNACK BOXES SET OF 4 FRUITS]	2
673	0.125813	[POSTAGE, ROUND SNACK BOXES SET OF4 WOODLAND]	2
674	0.093275	[POSTAGE, SET/20 RED RETROSPOT PAPER NAPKINS]	2
675	0.099783	[POSTAGE, SET/6 RED SPOTTY PAPER CUPS]	2
676	0.091106	[POSTAGE, SET/6 RED SPOTTY PAPER PLATES]	2
677	0.082430	[POSTAGE, SPACEBOY LUNCH BOX]	2
678	0.097614	[POSTAGE, STRAWBERRY LUNCH BOX WITH CUTLERY]	2
679	0.075922	[POSTAGE, TEA PARTY BIRTHDAY CARD]	2
680	0.086768	[SET/20 RED RETROSPOT PAPER NAPKINS, SET/6 RED...	2
681	0.086768	[SET/20 RED RETROSPOT PAPER NAPKINS, SET/6 RED...	2
682	0.104121	[SET/6 RED SPOTTY PAPER CUPS, SET/6 RED SPOTTY...	2
683	0.060738	[ALARM CLOCK BAKELIKE GREEN, ALARM CLOCK BAKEL...	3
684	0.062907	[PLASTERS IN TIN CIRCUS PARADE, PLASTERS IN TI...	3
685	0.071584	[PLASTERS IN TIN CIRCUS PARADE, PLASTERS IN TI...	3
686	0.071584	[PLASTERS IN TIN SPACEBOY, PLASTERS IN TIN WOO...	3
687	0.071584	[POSTAGE, SET/20 RED RETROSPOT PAPER NAPKINS, ...	3
688	0.071584	[POSTAGE, SET/20 RED RETROSPOT PAPER NAPKINS, ...	3
689	0.086768	[POSTAGE, SET/6 RED SPOTTY PAPER CUPS, SET/6 R...	3
690	0.084599	[SET/20 RED RETROSPOT PAPER NAPKINS, SET/6 RED...	3
691	0.069414	[POSTAGE, SET/20 RED RETROSPOT PAPER NAPKINS, ...	4

Filtering out the item combinations of 2 and more.

 
frequent_itemsets[ (frequent_itemsets['length'] >= 2) ]

        support	                                 itemsets	            length
643	0.062907	[ALARM CLOCK BAKELIKE GREEN, ALARM CLOCK BAKEL...	2
644	0.067245	[ALARM CLOCK BAKELIKE GREEN, ALARM CLOCK BAKEL...	2
645	0.071584	[ALARM CLOCK BAKELIKE GREEN, POSTAGE]			2
646	0.062907	[ALARM CLOCK BAKELIKE PINK, ALARM CLOCK BAKELI...	2
647	0.075922	[ALARM CLOCK BAKELIKE PINK, POSTAGE]			2
648	0.073753	[ALARM CLOCK BAKELIKE RED, POSTAGE]			2
649	0.062907	[DOLLY GIRL LUNCH BOX, POSTAGE]			2
650	0.060738	[DOLLY GIRL LUNCH BOX, SPACEBOY LUNCH BOX]		2
651	0.069414	[JUMBO BAG RED RETROSPOT, POSTAGE]			2
652	0.065076	[JUMBO BAG WOODLAND ANIMALS, POSTAGE]			2
653	0.088937	[LUNCH BAG APPLE DESIGN, POSTAGE]			2
654	0.104121	[LUNCH BAG RED RETROSPOT, POSTAGE]			2
655	0.078091	[LUNCH BAG SPACEBOY DESIGN, POSTAGE]			2
656	0.086768	[LUNCH BAG WOODLAND, POSTAGE]				2
657	0.097614	[LUNCH BOX WITH CUTLERY RETROSPOT, POSTAGE]		2
658	0.069414	[MINI PAINT SET VINTAGE, POSTAGE]			2
659	0.071584	[PACK OF 72 RETROSPOT CAKE CASES, POSTAGE]		2
660	0.060738	[PAPER BUNTING RETROSPOT, POSTAGE]			2
661	0.078091	[PLASTERS IN TIN CIRCUS PARADE, PLASTERS IN TI...	2
662	0.086768	[PLASTERS IN TIN CIRCUS PARADE, PLASTERS IN TI...	2
663	0.125813	[PLASTERS IN TIN CIRCUS PARADE, POSTAGE]		2
664	0.088937	[PLASTERS IN TIN SPACEBOY, PLASTERS IN TIN WOO...	2
665	0.097614	[PLASTERS IN TIN SPACEBOY, POSTAGE]	     		2
666	0.117137	[PLASTERS IN TIN WOODLAND ANIMALS, POSTAGE]		2
667	0.140998	[POSTAGE, RABBIT NIGHT LIGHT]				2
668	0.069414	[POSTAGE, RED RETROSPOT CHARLOTTE BAG]		2
669	0.097614	[POSTAGE, RED RETROSPOT MINI CASES]			2
670	0.134490	[POSTAGE, RED TOADSTOOL LED NIGHT LIGHT]		2
671	0.091106	[POSTAGE, REGENCY CAKESTAND 3 TIER]			2
672	0.080260	[POSTAGE, ROUND SNACK BOXES SET OF 4 FRUITS]		2
673	0.125813	[POSTAGE, ROUND SNACK BOXES SET OF4 WOODLAND]		2
674	0.093275	[POSTAGE, SET/20 RED RETROSPOT PAPER NAPKINS]		2
675	0.099783	[POSTAGE, SET/6 RED SPOTTY PAPER CUPS]		2
676	0.091106	[POSTAGE, SET/6 RED SPOTTY PAPER PLATES]		2
677	0.082430	[POSTAGE, SPACEBOY LUNCH BOX]				2
678	0.097614	[POSTAGE, STRAWBERRY LUNCH BOX WITH CUTLERY]		2
679	0.075922	[POSTAGE, TEA PARTY BIRTHDAY CARD]			2
680	0.086768	[SET/20 RED RETROSPOT PAPER NAPKINS, SET/6 RED...	2
681	0.086768	[SET/20 RED RETROSPOT PAPER NAPKINS, SET/6 RED...	2
682	0.104121	[SET/6 RED SPOTTY PAPER CUPS, SET/6 RED SPOTTY...	2
683	0.060738	[ALARM CLOCK BAKELIKE GREEN, ALARM CLOCK BAKEL...	3
684	0.062907	[PLASTERS IN TIN CIRCUS PARADE, PLASTERS IN TI...	3
685	0.071584	[PLASTERS IN TIN CIRCUS PARADE, PLASTERS IN TI...	3
686	0.071584	[PLASTERS IN TIN SPACEBOY, PLASTERS IN TIN WOO...	3
687	0.071584	[POSTAGE, SET/20 RED RETROSPOT PAPER NAPKINS, ...	3
688	0.071584	[POSTAGE, SET/20 RED RETROSPOT PAPER NAPKINS, ...	3
689	0.086768	[POSTAGE, SET/6 RED SPOTTY PAPER CUPS, SET/6 R...	3
690	0.084599	[SET/20 RED RETROSPOT PAPER NAPKINS, SET/6 RED...	3
691	0.069414	[POSTAGE, SET/20 RED RETROSPOT PAPER NAPKINS, ...	4

Now it’s quite easy to generate association rules with MLxtend, the argument to generate these rules takes in two inputs. One for defining if the metric should be “confidence” or “lift” and the second input is for setting the minimum level for these metrics. Let’s try building some rules with confidence level as the metric and a minimum threshold level of 0.5.

 
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.5)
rules.head()

As seen in the above output, the minimum confidence level starts at 0.5. Now let’s create rules with lift as the metric and a minimum threshold level of 1.

 
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)
rules.head()

As seen in the above output, the minimum lift level starts above 1. Since we have obtained the rules, it’s quite easy to filter out rules with a desired number for lift and confidence.

 
rules[ (rules['lift'] >= 5) & (rules['confidence'] >= 0.5)]

This method can be used for giving out recommendation for products and is useful in understanding what products are having a high lift rate.

Conclusion

As you can see in this simple demonstration of market basket analysis using Python, it’s easy to form association rules in Python with the MLxtend package. The data that is used in this article is not relatively large, but I am sure that all the concepts regarding market basket analysis have been explained. Now it’s up to you guys to go ahead and start working on it.

And if you enjoyed this demonstration, consider enrolling in our course on Python for Data Science over on LinkedIn Learning.

Python

HI, I’M LILLIAN PIERSON.
I’m a fractional CMO that specializes in go-to-market and product-led growth for B2B tech companies.
If you’re looking for marketing strategy and leadership support with a proven track record of driving breakthrough growth for B2B tech startups and consultancies, you’re in the right place. Over the last decade, I’ve supported the growth of 30% of Fortune 10 companies, and more tech startups than you can shake a stick at. I stay very busy, but I’m currently able to accommodate a handful of select new clients. Visit this page to learn more about how I can help you and to book a time for us to speak directly.

Get Featured

We love helping tech brands gain exposure and brand awareness among our active audience of 530,000 data professionals. If you’d like to explore our alternatives for brand partnerships and content collaborations, you can reach out directly on this page and book a time to speak.

Join The Convergence Newsletter

See what 26,000 other data professionals have discovered from the powerful data science, AI, and data strategy advice that’s only available inside this free community newsletter.

Join The Convergence Newsletter for free below.

Our newsletter is exclusively written for operators in the data & AI industry.

Hi, I'm Lillian Pierson, Data-Mania's founder. We welcome you to our little corner of the internet. Data-Mania offers fractional CMO and marketing consulting services to deep tech B2B businesses.

The Convergence community is sponsored by Data-Mania, as a tribute to the data community from which we sprung. You are welcome anytime.

Get more actionable advice by joining The Convergence Newsletter for free below.

See what 26,000 other data professionals have discovered from the powerful data science, AI, and data strategy advice that’s only available inside this free community newsletter.

Join The Convergence Newsletter for free below.
We are 100% committed to you having an AMAZING ✨ experience – that, of course, involves no spam.

Fractional CMO for deep tech B2B businesses. Specializing in go-to-market strategy, SaaS product growth, and consulting revenue growth. American expat serving clients worldwide since 2012.

© Data-Mania, 2012 - 2024+, All Rights Reserved - Terms & Conditions - Privacy Policy | PRODUCTS PROTECTED BY COPYSCAPE

The Convergence is sponsored by Data-Mania, as a tribute to the data community from which we sprung.

Get The Newsletter

See what 26,000 other data professionals have discovered from the powerful data science, AI, and data strategy advice that’s only available inside this free community newsletter.

Join The Convergence Newsletter for free below.
* Zero spam. Unsubscribe anytime.