AI that Understands FMCG: Smarter Product Classification & Recommendations

The world of Fast-Moving Consumer Goods (FMCG) is defined by large product catalogs, diverse price points, and fierce brand competition. For online retailers like BigBasket, efficiently organizing, classifying, and recommending products isn’t just a technical challenge—it’s a business necessity.

In this blog, we’ll explore how Terno AI can be applied to the BigBasket Beauty & Hygiene dataset to build machine learning pipelines for product classification, similarity detection, and recommendations. Using structured prompts, Terno AI helps automate insights that can power catalog management, personalization, and market strategy.

👉 Dataset & Prompts explored via Terno AI: Conversation Link

1. Dataset Overview

Prompt:
“Describe the dataset in detail, list all columns, their data types, number of missing values, and sample values.”

This prompt ensures a strong foundation by providing a complete understanding of the dataset’s structure. Knowing the column types, null values, and sample records is crucial before applying any analytics or machine learning models. For FMCG businesses, this step helps identify potential data quality issues like missing prices, inconsistent product names, or brand duplication, which can affect model accuracy. It also enables structured planning for preprocessing and feature engineering, ensuring reliable insights.

Insight:

The Beauty & Hygiene dataset captures thousands of SKUs across sub-categories like Hair Care, Makeup, Skin Care, and more.

  • Columns include: SKU Name, Brand, Sub-Category, Sub-Sub-Category, MRP, SKU Size, and About the Product.
  • Data types: mix of categorical (Brand, Category), numeric (MRP, Size), and free-text fields (SKU Name, About the Product).
  • Missing values: Minimal in structured fields; some product descriptions are short or missing.

This blend of structured and unstructured data makes it a strong candidate for machine learning.

2. Duplicate & Unique Products

Prompt:
“Show the number of unique product names and check for duplicates.”

Duplicate SKUs can inflate counts and mislead analysis. By detecting them, this prompt ensures data accuracy and clean insights. For example, duplicate listings of the same shampoo SKU with slightly different spellings can bias sales analytics. For FMCG firms, this step reduces noise in machine learning models and ensures reliable classification and recommendation outputs.

Insight:

The dataset contains ~20,000 SKUs, with some duplicate product titles or size variants. Deduplication is crucial for fair modeling, as identical items across sizes should not bias classification or recommendation.

3. Features for Sub-Category Classification

Prompt:
“Identify columns useful for machine learning classification of sub-category (features like Brand, SKU Size, MRP, About the Product).”

This step identifies predictive features for sub-category classification models. For instance, "Brand," "Price," and "Product description" are strong indicators of category placement. Businesses can use these models for automated categorization of new products, reducing manual effort and improving catalog management efficiency. Retailers can ensure faster onboarding of products into their systems.

Insight:

To predict a product’s sub-category, the following features are most relevant:

  • Brand (categorical signal for product type)
  • SKU Size (volume/quantity ties to product type)
  • MRP (pricing tiers differ across categories)
  • Text fields: SKU Name and About the Product (rich descriptive signal)

4. Text Preprocessing Pipeline

Prompt:
“Create a text preprocessing pipeline for the 'SKU Name' and 'About the Product' columns (tokenization, stopword removal, stemming/lemmatization).”

This prompt prepares product text data for machine learning models. Proper preprocessing ensures noise reduction and enhances the model’s ability to recognize meaningful patterns. For FMCG players, this is essential for building recommendation engines, similarity searches, and automated classification tools. It directly improves accuracy and business usability of AI-powered solutions.

Insight:

For free-text fields, a standard NLP pipeline was applied:

  1. Tokenization – breaking sentences into words.
  2. Stopword removal – filtering common words like and, with, for.
  3. Stemming/Lemmatization – reducing words to root forms (moisturizing → moisturize).

5. TF-IDF Vectors

Prompt:
“Convert text features into TF-IDF vectors.”

TF-IDF helps transform product text descriptions into numerical representations that machine learning models can use. This step captures important keywords (like “shampoo,” “anti-dandruff,” or “moisturizing”) that define products. For businesses, this is the backbone of recommendation systems and helps match consumer search queries with relevant products, boosting conversions.

Insight:

Both SKU Name and About the Product were transformed into TF-IDF vectors (Term Frequency–Inverse Document Frequency), capturing which words are most distinctive across products.

6. Sub-Category Classification

Prompt:
“Train a classification model to predict sub-category from product description and pricing information, report accuracy.”

This prompt builds and validates a predictive model for product categorization. High accuracy means businesses can automate catalog management, ensuring new products are correctly classified with minimal manual input. For retailers like Big Basket, this translates into faster product onboarding and improved search accuracy for customers.

Insight:

A Logistic Regression model trained on TF-IDF + numeric features achieved: ~70–75% accuracy on well-represented sub-categories.

7. Brand Prediction

Prompt:
“Perform brand prediction based on product title and MRP. Explain the brand prediction results in detail.”

This explores whether AI models can predict the brand identity from product details. If effective, it shows that brand positioning and pricing strategies are distinct enough for recognition. For businesses, it’s a test of brand differentiation—whether their pricing and descriptions clearly stand out. Retailers can use this to check for counterfeit or misclassified products.

Insight:

Brand prediction from product title + MRP was highly accurate: Accuracy: ~98% (after filtering brands with ≥50 SKUs)

👉 This demonstrates that brand identity is strongly encoded in text + price data, making automated brand tagging highly reliable.

8. Similarity Search with Cosine Similarity

Prompt:
“Use cosine similarity on TF-IDF features to find similar products to a given SKU.”

Finding similar products allows for AI-powered product discovery and recommendation engines. This benefits businesses by cross-promoting similar products and increasing sales. For example, recommending alternative shampoos to consumers searching for a specific one ensures retention even if the product is out of stock. This directly improves customer satisfaction and sales conversions.

Insight:

By applying cosine similarity on TF-IDF vectors, Terno AI could retrieve semantically close items. For example:

  • Query: “Himalaya Lip Balm”
  • Similar Products: Other herbal or Ayurvedic lip balms, within the same price bracket.

This is the foundation of “Customers also viewed” recommendation systems.

9. Recommendations for a Specific SKU

Prompt:
“Generate top 5 product recommendations for a "Bblunt Back To Life Dry Shampoo For Instant Freshness - Spring Fling, 30 ml" SKU based on text similarity and price closeness.”

This prompt builds a real-world recommendation engine combining similarity and pricing logic. It’s important for ensuring recommendations are not only relevant in features but also in consumer budget. For FMCG companies, such recommendations drive cross-sell and upsell opportunities. Retailers can improve personalization, leading to higher basket value per customer.

Insight:

For the SKU:
“Bblunt Back To Life Dry Shampoo For Instant Freshness - Spring Fling, 30 ml”,
Terno AI suggested:

  • Other Bblunt dry shampoos (different variants/flavors).
  • Competing mini dry shampoos at ~₹250–₹350.

Recommendation logic combined text similarity (dry shampoo keywords) + price closeness, ensuring results are relevant both semantically and economically.

10. Product Clustering & Visualization

Prompt:
“Perform clustering of products based on combined text and price features to identify natural product groupings. Visualize product clusters using t-SNE or PCA plots.”

Clustering helps discover hidden product groupings in the dataset, such as value shampoos, premium skincare, or mid-range cosmetics. This is important for businesses to detect competitive landscapes and pricing gaps. For retailers, it helps in shelf arrangement, catalog organization, and better targeting of promotions by consumer segment.

Insight:

Products were clustered using:

  • TF-IDF features (reduced with SVD)
  • MRP normalized
  • K-Means (k=10)
  • Visualized with PCA/t-SNE plots

Clusters naturally grouped items into categories like lip care, shampoos, fragrances, and skincare—offering insights for catalog structuring and competitive positioning.

11. Feasibility & Quality of Automation

Prompt:
“Summarize the feasibility and quality of automated product classification and recommendation for this dataset.”

This concluding step evaluates whether AI solutions are practically viable using the dataset. It highlights challenges like inconsistent data quality or opportunities like strong brand categorization. For real-world decisions, this insight helps businesses decide on AI adoption strategy—whether to invest further in automation or to improve data collection first.

Insight:

  • Classification: Strong potential for brand and sub-category tagging, though fine-grained classes need advanced handling.
  • Recommendation: High-quality out-of-the-box with TF-IDF + price.
  • Clustering: Useful for catalog curation, competitive mapping, and identifying under-served niches.

Conclusion

This exercise shows how Terno AI can unlock machine learning capabilities for FMCG datasets:

  • Sub-Category classification is achievable with careful handling of class imbalance.
  • Brand prediction is highly accurate, making it ideal for catalog hygiene.
  • Product recommendations work well using simple TF-IDF + price blending.
  • Clustering visualizations reveal natural product groupings for merchandising strategy.

For retailers like BigBasket, these pipelines can drive better personalization, faster catalog management, and sharper competitive insights—all powered by Terno AI.

👉 Dataset & Prompts explored via Terno AI: Conversation Link

- Your AI-Data Scientist

Turn your data into decisions with Terno.