Instacart market basket analysis (Part-2)

5 min readMay 6, 2021

It is continuation of part 1. So, to understand the problem better you can start reading with first blog.

The things I would be discussing in this part:

ML formulation of business problem
EDA
Feature Engineering
ML models
Introducing Auto-encoders in the problem, are they any useful
Trying some approaches from the Kaggle winners
Future works
References

1. ML formulation of business problem

For the user to get product recommendations based on his past N orders, we need to observe patterns and generate rules which will give recommendations with high probability. Since we have over 3 Million data points, we need to automate this learning process and using Machine Learning we can achieve this to give probabilistic prediction. Machine Learning works better on large sets of data and generates rules from patterns learned from features. Other Alternative would be a rule-based system, which works best when we know the rules. But it’s very difficult to generate rules by going over all data samples manually and make sense of the patterns. This can’t guarantee in high predictive power

2. EDA

so, we understand something better by asking questions related to that. for in-depth EDA you can follow my code on GitHub.

2.1 Most ordered products

From this we can get fruits are ordered the most and then the vegetables.

2.2 what day of the week people order

Mostly the Products are ordered on Sunday and Monday

Mostly the Products are ordered on Sunday and Monday

2.3 After how many days people order again

we can see bumps at 7th day mark 14th day, 21th and 30th day.So, from this we can get that people mostly order weekly.

2.4 Number of Orders from Department/Aisle

Size of the Boxes shows number of orders.

3. Feature Engineering

The idea for feature engineering is we can use every feature that we have in different tables as order feature or product related feature to get different other features.

Note — for getting the features for training and testing, we need to use our prior data as to get the features So we would get features likes say user related or product related then we take all these features and merge with train orders

Different types combination of original data to get features:

user_id , product_id and days_since_prior_order — to get features like days before a particular product is ordered
user_id and Department, user_id and aisle — to get features related to department or aisle the product belongs to.
product_id and users to get features related to a specific user and a product.

The idea behind the feature creation -

3.1 User features

How often the user reordered items
Time between orders
Time of day the user visits
Whether the user ordered organic, gluten-free, or Asian items in the past
Features based on order sizes
How many of the user’s orders contained no previously purchased items

3.2 Item features

How often the item is purchased
Position in the cart
How many users buy it as “one shot” item
Stats on the number of items that co-occur with this item
Stats on the order streak
Probability of being reordered within N orders
Distribution of the day of week it is ordered
Probability it is reordered after the first order
Statistics around the time between orders

3.3 User x Item features

Number of orders in which the user purchases the item
Days since the user last purchased the item
Streak (number of orders in a row the user has purchased the item)
Position in the cart
Whether the user already ordered the item today
Co-occurrence statistics
Replacement items

4. ML models

ML models that I tried –

Logistic Regression
Naive Bayes
Decision Trees
Random Forest
Gradient Boosting with XGBoost

So, in all models Gradient Boosting model performed best

5. Introducing Auto-encoders in the problem, are they any useful

Auto-encoders are a specific type of feed forward neural networks where the input is the same as the output. They compress the input into a lower-dimensional code and then reconstruct the output from this representation.

We want to reduce the dimensions for features so that we can train our model fast and auto-encoder also help in reducing the noise of data but leaving the theory aside they don’t help much in this scenario as I got the CV F1 score as 0.327 that’s way less then our gradient Boosting model. So, this was kind of a failed attempt.

6. Trying some approaches from the Kaggle winners

So, to increase my F1 score I tried few of the Kaggle winner approaches like –

Trying predicting None with other Products — As we can see from the output file that has been published on Kaggle Competition page. From that we can see, if a user doesn’t order anything then None should be predicted. So to follow that we gonna create two different models one that we have already created that predicts products and the other one is to predict None (we are gonna consider none as product and gonna predict it with other products)
F1 optimization approach to improve F1 score — This approach I have already discussed in part-1

Private and Public score on Kaggle-

7. Future works

To further improve our F1 score. We can try to predict the basket size for an order. So, if we know the basket size then we can pick the top products from our other model predictions.

I think this can improve the F1 score significantly.

8. References

Project code — https://github.com/KuldeepSangwan/InstacartAnalysis

Contact me — https://www.linkedin.com/in/kuldeep881/