Classifying Recipe Popularity and Rating with Imbalanced Data

By Weijie Zhang (wez042@ucsd.edu)

Project Overview

This data science project aims to explore some of the characteristics of popular and highly rated food recipes. The data comes from food.com and was originally scraped and used by the authors of this recommender system paper.

Introduction
Data Cleaning and EDA
Assessment of Missingness
Hypothesis Testing
Framing a Prediction Problem
Baseline Model: A Simple Approach
Final Model: Balanced Random Forest & Binary Classification
Fairness Analysis

Introduction

Nowadays, cooking and sharing recipes online has become widely popular and a significant part of many people’s lives. With the convenience of online recipe platforms like food.com, millions of users benefit from having access to a vast collection of recipes in a variety of culinary. Among all the recipe options available, however, some recipes stand out as both particularly popular and highly rated.

In this project, we dive into an extensive collection of food recipe data sourced from food.com. The first dataset we are going to use contains food recipes from 2008 to 2018 on food.com, with detailed information about a recipe such as ingredients, preparation times, nutritional values, etc. This dataset contains 83782 observations, meaning that there are 83782 unique recipes in total in our data. The second dataset contains information about users’ review comments and ratings submitted for the recipes in the first dataset. This dataset contains 731927 number of reviews in total submitted by users. Some recipes have lots of reviews while others might have fewer.

Let’s take a closer look at the information we have. Below is the description of some of the relevant columns in the first dataset:

Column	Description
`name`	Recipe name
`minutes`	Minutes to prepare food
`submitted`	Date recipe was uploaded
`tags`	Food.com tags for recipe
`nutrition`	Nutrition information for the recipe, including calories, total fat, sugar, sodium protein, saturated fat, and carbohydrates
`steps`	Steps to make the food by the recipe steps
`description`	Recipe description
`ingredients`	List of ingredients for recipe

Let’s take a look at the second dataset:

Column	Description
`date`	Date reviews was submitted
`rating`	Rating given by the user
`review`	Review comment given by the user

By combining these two comprehensive datasets, we aim to answer the following research questions throughout the project: What are some of the characteristics of the recipe that are both popular and highly rated? What category does a recipe fall into based on its predicted popularity level and average rating score?

Category 1: Low review count, low average rating

Category 2: Low review count, high average rating

Category 3: High review count, low average rating

Category 4: High review count, high average rating

Note: we define the threshold for high and low review counts as 10 reviews count and the threshold for high and low average rating scores as 4.8

Through a rigorous analysis of these datasets, we will explore univariate and bivariate relationships between key variables, investigate missing values in the dataset, conduct hypothesis testing to answer the question of interest, build and evaluate a classifier model, and ultimately conduct fairness analysis for the final model. This project will provide valuable insights into understanding the key characteristics of certain recipes that stand out as exceptional and loved by many.

Data Cleaning and EDA

Data Cleaning

We will perform the following data cleaning steps to combine and transform our dataset into tidy format, before diving into analysis.

Convert data type: We converted the steps, tags, and ingredients columns from the string data type to the list data type. This will help us in accessing each individual element. We also converted the submitted column to datatime format.
Expand the Nutrition column: We expanded the nutrition column in the Dataframe such that nutritional information like calories, total fat, sugar, etc is in their individual columns. This allows us to easily access and analyze each quantitative column separately.
Merge the two datasets into one: We merged the review dataset into the recipes dataset such that each unique recipe has an average rating and a review list where each element represents a review comment given by a user. We added a new column called n_review that counts the number of reviews a recipe has. We will use this information to represent the popularity of the recipe.

After data cleaning, we have the following Dataframe comes in handy for analysis:

name	id	minutes	contributor_id	submitted	tags	n_steps	steps	description	ingredients	n_ingredients	calories	total_fat	sugar	sodium	protein	saturated_fat	carbohydrates	rating	n_review	reviews
1 brownies in the world best ever	333281	40	985201	2008-10-27	[‘60-minutes-or-less’, ‘time-to-make’, ‘course’, ‘main-ingredient’]	10	[‘heat the oven to 350f and arrange the rack in the middle …	these are the most; chocolatey, moist, rich, dense, fudgy, delicious brownies …	[‘bittersweet chocolate’, ‘unsalted butter’, ‘eggs’, ‘granulated sugar’]	9	138.4	10.0	50.0	3.0	3.0	19.0	6.0	4.0	1.0	[‘These were pretty good, but took forever to bake …
1 in canada chocolate chip cookies	453467	45	1848091	2011-04-11	[‘60-minutes-or-less’, ‘time-to-make’, ‘cuisine’, ‘preparation’, ‘north-american’ …	12	[‘pre-heat oven the 350 degrees f, in a mixing bowl …	this is the recipe that we use at my school cafeteria for chocolate chip cookies …	[‘white sugar’, ‘brown sugar’, ‘salt’, ‘margarine’, ‘eggs’]	11	595.1	46.0	211.0	22.0	13.0	51.0	26.0	5.0	1.0	[‘Originally I was gonna cut the recipe in half (just the 2 of us here) …
412 broccoli casserole	306168	40	50969	2008-05-30	[‘60-minutes-or-less’, ‘time-to-make’, ‘course’, ‘main-ingredient’]	6	[‘preheat oven to 350 degrees, spray a 2 quart baking dish with cooking spray …	since there are already 411 recipes for broccoli casserole …	[‘frozen broccoli cuts’, ‘cream of chicken soup’, ‘sharp cheddar cheese’]	9	194.8	20.0	6.0	32.0	22.0	36.0	3.0	5.0	4.0	[‘This was one of the best broccoli casseroles that I have ever made …
millionaire pound cake	286009	120	461724	2008-02-12	[‘time-to-make’, ‘course’, ‘cuisine’, ‘preparation’, ‘occasion’, ‘north-american’]	7	[‘freheat the oven to 300 degrees, grease a 10-inch tube pan with butter …	why a millionaire pound cake? because it’s super rich …	[‘butter’, ‘sugar’, ‘eggs’, ‘all-purpose flour’, ‘whole milk’]	7	878.3	63.0	326.0	13.0	20.0	123.0	39.0	5.0	1.0	[‘don’t let the calories and fat grams scare you off …
2000 meatloaf	475785	90	2202916	2012-03-06	[‘time-to-make’, ‘course’, ‘main-ingredient’, ‘preparation’, ‘main-dish’ …	17	[‘pan fry bacon , and set aside on a paper towel to absorb excess grease …	ready, set, cook! special edition contest entry: …	[‘meatloaf mixture’, ‘unsmoked bacon’, ‘goat cheese’, ‘unsalted butter’]	13	267.0	30.0	12.0	12.0	29.0	48.0	2.0	5.0	2.0	[‘Delicious!!!!! – the goat cheese made the difference …

Univariate Analysis

In the univariate analysis, we will look at the distribution of the calories and the distribution of the rating in food recipes in the dataset.

Distribution of Calories

(This is a histogram of the distribution of the calories of food recipes in the dataset.)

We only included 98% of the data values from the calories column for better visualization. From looking at the histogram, we see that most of the calorie values range from 0 to 1000, with fewer data values towards the right side of the plot. The distribution shows right-skewness, indicating that a significant proportion of food recipes fall within a typical calorie range, with some recipes featuring higher calorie counts.

Distribution of Rating

(This is a histogram of the distribution of the rating of food recipes in the dataset.)

As we look at the histogram, the majority of recipes have an average rating above 3.0, with only a very small minority falling below this threshold. If we want to develop a classification model to distinguish recipes based on good and bad ratings, we must address the issue of imbalanced data. This is something we should consider when building our final classification model.

Bivariate Analysis

For bivariate analysis, we will look at the relationship between sugar percent daily value and rating range, as well as, calories percent daily value and the number of ingredients in food recipes in the dataset.

Sugar vs. Rating

(This is a box plot featuring the relationship between rating and sugar percent daily value in food recipes in the dataset.)

In crafting this box plot, we only incorporated 95% of the data values from the sugar column to exclude extreme values and grouped ratings into bins for better visualization. As we look into the box plot, we note a slight upward trend in the sugar daily level for the recipes as the average rating decreases, particularly in the third quarter and the upper fence. This observation suggests a potential negative correlation between the sugar content and the average rating in recipes, indicating that low sugar content might tend to receive higher ratings, and vice versa.

Calories vs. Number of Ingredients

(This is a scatter plot featuring the relationship between calorie percent daily value and the number of ingredients in food recipes in the dataset.)

In making this visualization, we only included 99% of the data value for the calories column to exclude extreme values, grouped ratings into bins and introduced random noise to the n_ingredients column, which represents a discrete variable, for better visualization. From examining the scatter plot, we see that there exists a week positive correlation between the calorie daily level and the number of ingredients in recipes. This finding aligns with our expectation that using a variety of ingredients in a recipe would likely result in higher calorie content in food.

Data Aggregation

So far, we only looked at the overall distribution of one variable and the bivariate distribution of two variables. For a better understanding of our data, we construct the following pivot table that shows the aggregated distribution of quantitative variables by mean, conditional on the number of ingredients. As we examine each column of the pivot table from top to bottom, we can observe a consistently increasing trend in various quantitative variables such as calories, carbohydrates, minutes, n_steps, protein, saturated_fat, sodium, and total_fat, as the number of ingredients increases in food recipes. This suggests that n_ingredients might have a positive correlation with these quantitative variables.

The pivot table is shown below.

n_ingredients	calories	carbohydrates	minutes	n_steps	protein	saturated_fat	sodium	sugar	total_fat
1	288.77	10.2	47.6	7.3	10.8	25	12.9	100.1	21.8
2	238.312	8.728	56.131	5.711	13.18	18.211	7.699	62.322	15.27
3	233.924	8.413	42.754	5.485	13.873	19.791	10.844	55.622	15.406
4	263.534	9.084	40.107	6.135	16.943	24.627	12.858	60.599	18.164
5	282.587	9.288	49.087	7.104	19.914	26.891	15.712	54.497	20.726
…	…	…	…	…	…	…	…	…	…
29	886.3	34	86.222	21.222	78	72.333	59.444	141.778	60
30	631.656	17.333	116.667	20	55.222	62.222	51	41.222	53.222
31	502.05	14.667	197.5	22.333	47.667	45	46.5	44.667	39.833
32	697.35	18.5	55	34	66	87.5	53	30.5	58.5
33	338.2	14	35	6	8	12	16	18	25

To get a visual representation of the above data, we have converted the pivot table to a series of line plots as below.

As we can see from the line plots above which illustrates the relationship between the number of ingredients and various other quantitative columns such as calories, carbohydrates, minutes, n_steps, protein, saturated_fat, sodium, and total_fat, we can observe clear and consistent increasing trends across all plots, aligns with our observation from the pivot table. These findings suggest a positive correlation between the number of ingredients and these quantitative variables, indicating that, on average, having more ingredients included in a food recipe corresponds to higher nutritional values, more preparation time, and a greater number of steps.

Assessment of Missingness

Not Missing At Random (NMAR) Analysis

We believe that the missing values in the rating column are Not Missing at Random (NMAR), meaning that the chance of a value being missing depends on the actual missing values themselves. Upon browsing the food.com website, we found that there are certain reviews given by users without ratings in the comment section of a recipe (shown in the image below). This is the case in which users opt to provide a review comment, without giving a rating score to the recipe. Therefore, during the data generation process, this result in certain recipes having missing rating score values.

NMAR example

Missingness Dependency Analysis

Besides the rating column, we also found missing values in the description column. We hypothesize that the missingness of description is Missing at Random (MAR), meaning that the chance of a value being missing depends on some other columns. We will investigate the missing dependency of description using a permutation test.

We hypothesize that the missingness of the description column depends on the sugar column, meaning that there is some systemic difference between the distribution of sugar for those recipes missing the description and those that do not.

Set up:

Does the missingness of description depend on the sugar column?

Null Hypothesis: the missingness of description does not depend on sugar.

Alternative Hypothesis: the missingness of description does depend on sugar.

Significance Level: 0.05

Let’s look at the distribution of sugar, conditional on the missingness of the description column.

Based on the plot above, the orange line represents the distribution of sugar when description is not missing, while the blue line represents the distribution of sugar when description is missing. We can see that these two distributions of sugar, conditional on the missingness of description, look quite different. To quantify this difference, we decide to use Kolmogorov-Smirnov (K-S) statistic instead of the absolute difference of means. Therefore, we will use K-S statistics as our test statistics for the permutation test.

We shuffle the missingness of the description 1000 times and get 1000 simulated K-S test statistics of the sugar column during the permutation test. The permutation result is shown as below.

From the plot above and the result of the permutation test, the 0.05% significance level of the simulated statistics is 0.16, and our observed test statistics is 0.18, which is greater than the values of the significance level. The p-value is 0.019, which is less than the significance level of 0.05. So we reject the null hypothesis.

Thus, we conclude that the missingness of description likely depends on the sugar column.

We hypothesize that the missingness of the description column does not depend on the calorie column, meaning that the distribution of calories when description is missing and the distribution of calories when description is not missing are alike, any difference is due to random chance.

Set up:

Does the missingness of description depend on the minutes column?

Null Hypothesis: the missingness of description does not depend on minutes.

Alternative Hypothesis: the missingness of description does depend on minutes.

Significance Level: 0.05

Let’s look at the distribution of minutes, conditional on the missingness of the description column.

Based on the plot above, the orange line illustrates the distribution of minutes when description is not missing, while the blue line illustrates the distribution of minutes when description is missing. Despite both distributions having similar shapes, they are centered in a similar location. Using the difference in mean may not effectively capture this difference in distribution. So we decide to use the Kolmogorov-Smirnov (K-S) statistic as our test statistics for the permutation test.

We shuffle the missingness of the description 1000 times and get 1000 simulated K-S test statistics of the minutes column during the permutation test. The permutation result is shown below.

From the plot above and the result of the permutation test, the 0.05% significance level of the simulated statistics is 0.15 and our observed test statistics is 0.1, which is smaller than the values of the significance level. The p-value is 0.408, which is greater than the significance level of 0.05. So we fail to reject the null hypothesis.

Thus, we conclude that the missingness of description likely depends on the minutes column.

Hypothesis Testing

The question we are going to explore and research in this section is the following: Do popular recipes (those with high review count) have a lower sugar level compared to less popular ones (those with low review count)

Recall that we define a high review count threshold as having more than 10 reviews. We will conduct a permutation test to see if the distribution of sugar levels for popular recipes and the distribution of sugar levels for non-popular recipes are similar.

Set up:

Null Hypothesis: Recipes with a high review count do not have a lower sugar level than those with a low review count. Any observed differences in our samples are merely due to random chance.

Alternative Hypothesis: Recipes with a high review count indeed have a lower sugar level than those with a low review count. The observed difference observed in our samples cannot be explained by random chance alone.

Test Statistics: Since our variable of interest is numerical and our test is a one-tail test, a directional alternative hypothesis, we will use the difference in mean as our test statistics for the permutation test.

Significance Level: 0.05

We created a new column called is_popular which is true if the recipe has a review count of more than 10, and false otherwise.

This is a sample of the dataframe that we are going to perform the permutation test on.

	sugar	is_popular
1092	36	False
44285	4	False
60708	18	False
56138	48	False
628	0	True

We shuffle the is_popular 1000 times and get 1000 simulated differences in mean test statistics of the sugar column during the permutation test. The empirical distribution of the permutation test results is shown below.

P-value: 0.0

From looking at the graph above, we can see that our observed difference in mean, which is 14.58, is greater than the significance level of the simulated difference in mean, which is 6.05, suggesting that the observed statistics in our sample are not merely coincidental. Furthermore, the p-value obtained from our permutation testing is 0.0, which falls below our significance level of 0.05. Therefore, we reject our null hypothesis, in favor of our alternative hypothesis: recipes with high review counts likely have a lower sugar level compared to those with low review counts.

The result can be reasonable that people might like to give high ratings to recipes with lower sugar content in food, and food with a lower sugar level is considered more healthy.

Framing a Prediction Problem

Recall from the introduction that we are interested in the following problem:

What category does a recipe fall into based on its predicted popularity level and average rating score?

Category 1: Low review count, low average rating

Category 2: Low review count, high average rating

Category 3: High review count, low average rating

Category 4: High review count, high average rating

Note: we define the threshold for high and low review counts as 10 reviews count and the threshold for high and low average rating scores as 4.8

Specifically, we want to classify recipes into the above 4 categories based on all other information we have.

The prediction problem we are addressing is a multi-class classification problem since our goal is to classify recipes into one of four distinct categories. These class categories are determined by two other variables rating which categorizes recipes based on their average rating and n_review which categorizes recipes based on the number of reviews it has. Thus, recipes are classified as a combination of either having a low review count or a high review count, and either a low average rating or a high average rating, resulting in four unique class labels.

To generate features for our classification model, the variables we will be using are all other columns except rating and n_review since these two variables are used to create the class categories. These are the features we are available at the time of prediction for classifying recipes based on popularity and average rating.

The key metrics we will be using for evaluating our classifier model performance are accuracy, precision, recall, and f1-score:

accuracy formula

Accuracy represents the proportion of correctly classified instances among all observations. It tells us the overall performance of how the model classifies recipes into each category. However, accuracy does not tell the full story, especially when dealing with imbalanced data.

precision formula

Precision measures the proportion of the predicted positive instances that are correctly classified. It tells us how good the model is at avoiding false positive predictions. In our content, a false positive occurs when low-rating recipes are mistakenly classified into high-rating categories. A high precision minimizes the occurrence of such misclassification.

recall formula

Recall measures the proportion of the actual positive instances that are correctly classified. It tells us how good the model is at identifying all the positive instances that are present, without missing too many of them. In our context, it reflects the model’s ability to correctly classify all high-rating recipes into the high-rating categories. A high recall maximizes the occurrence of the correct classification.

f1-score formula

F1-score, being the harmonic mean of precision and recall, provides a balanced summary of the model’s predictive power, considering both predicted positive and actual positive.

Baseline Model: A Simple Approach

We split our data into training and testing sets by stratifying using the class label. The training set constitutes 80% of our data while the testing has the remaining 20% of our data. We will use the testing set to evaluate the ability of our model to generalize to unseen data.:

X_train, X_test, y_train, y_test = train_test_split(recipe.drop(['class', 'rating', 'n_review'], axis=1), 
                                                    recipe['class'], test_size=0.2, stratify=recipe['class'])

Feature Engineering

We perform the following feature engineering steps to transform our variables before fitting them into our model. We use RobustScaler instead of StandardScaler on most numerical columns because there exist extreme values in those columns. RobustScaler uses the median instead of the mean while scaling the data.

Quantative Feature:

minutes, protein, sodium, saturated_fat, total_fat, carbohydrates

Type: quantitative continuous
Feature Transformation: use RobustScaler to reduce the impact of outlier

n_steps, n_ingredients

Type: quantitative discrete
Feature Transformation: passthrough

time

Type: quantitative
Feature Transformation: extract year, month, and day from submitted timestamp column

Categorical Feature:

calories, sugar

Type: quantitative to nominal
Feature Transformation: categorize calories and sugar into 8 bins and do one-hot encoding

recipe_complexity

Type: nominal
Feature Transformation: binarize n_steps and n_ingredients using the threshold of 10 to represent recipe complexity

We choose to build features from the above columns as we believe these features might have some relationship for predicting the recipe’s popularity and average rating.

Baseline Model Building and Performance Evaluation

For the Baseline Model, we decide to use the Random Forest model for our classification problem.

The main idea of the Random Forest algorithm is to Fit n number of decision trees by using bagging and a random subset of features at each split. Predict by taking a vote from those n decision trees. It is the idea of Ensemble Learning.

Here is a pipeline of our baseline model, in which we transform our column first, and then fit into RandomForestClassifier.

Pipeline(steps=[('col_trans',
                 ColumnTransformer(transformers=[('outlier', RobustScaler(),
                                                  ['minutes', 'protein',
                                                   'sodium', 'saturated_fat',
                                                   'total_fat',
                                                   'carbohydrates']),
                                                 ('pass', 'passthrough',
                                                  ['n_steps', 'n_ingredients']),
                                                 ('to_bin',
                                                  Pipeline(steps=[('outlier',
                                                                   RobustScaler()),
                                                                  ('to_bins',
                                                                   KBinsDiscretizer(n_bins=8))]),
                                                  ['calories', 'sugar']),
                                                 ('time',
                                                  Pipeline(steps=[('time',
                                                                   FunctionTransformer(func=<function <lambda> at 0x2b514b550>))]),
                                                  ['submitted']),
                                                 ('complexity',
                                                  Pipeline(steps=[('complex',
                                                                   FunctionTransformer(func=<function recipe_complexity at 0x2b5056ee0>))]),
                                                  ['n_steps',
                                                   'n_ingredients'])])),
                ('clf', RandomForestClassifier())])

After fitting our training data into the baseline model, we evaluate our model using testing data. The confusion matrix below is the result of the prediction of testing data.

By looking at the confusion matrix, we can see that the baseline model misclassifies lots of recipes into class labels 1 and 2. In addition, it is difficult to accurately classify recipes into class labels 3 and 4, which correspond to high review counts. This is caused by the imbalanced nature of class labels in our dataset. Recipes with high review counts are relatively rare and constitute a minority group of data, whereas recipes with low review counts are more prevalent and make up a majority of the data. This is something we will address in building our final model.

The result of counting all the class labels in our dataset, showing imbalanced data:

  48979
  28825
   1834
   1535
Name: class, dtype: int64

Let’s look at the precision, recall, and f1-score of our baseline model for predicting unseen data.

	precision	recall	f1-score	support
1	0.42	0.13	0.20	5765
2	0.61	0.91	0.73	9796
3	0.00	0.00	0.00	307
4	0.00	0.00	0.00	367
accuracy			0.59	16235
macro avg	0.26	0.26	0.23	16235
weighted avg	0.52	0.59	0.51	16235

We see that the model has 0 precision, recall, and f1-score for class labels 3 and 4 while it has a high f1-score for class label 2 since most recipes belong to class 2. Our model accuracy is 59%. Overall, the performance of our baseline model is not as good as we thought since it is difficult to identify popular recipes, which are something we are more interested in. We will improve our baseline model to correctly classify more popular recipes.

Final Model: Balanced Random Forest & Binary Classification

As usual, we will use the same training and testing data from baseline mode.

Feature Engineering

The following features are from the baseline model

minutes, protein, sodium, saturated_fat, total_fat, carbohydrates

Type: quantitative continuous
Feature Transformation: use RobustScaler to reduce the impact of outlier

n_steps, n_ingredients

Type: quantitative discrete
Feature Transformation: passthrough

time

Type: quantitative
Feature Transformation: extract year, month, and day from submitted timestamp column

recipe_complexity

Type: nominal
Feature Transformation: binarize n_steps and n_ingredients to represent recipe complexity

The following text features are added for the final model:

description

Type: text data
Feature Transformation:
Build a list of vocabulary from high Inverse Document Frequency(IDF) words from the description of high reivew count recipes (minority class label, class 3 and 4)
Vectorize the description text column using TF-IDF and the vocabulary from the previous step
For each recipe, extract the top 5 highest TF-IDF values as the features

steps

Type: text data
Feature Transformation:
Build a list of vocabulary from high Inverse Document Frequency(IDF) words from the steps of high reivew count recipes (minority class label, class 3 and 4)
Vectorize the steps text column using TF-IDF and the vocabulary from the previous step
For each recipe, extract the top 5 highest TF-IDF values as the features

We believe that incorporating the description and steps features can improve our model’s ability to identify recipes with high review counts since we built a vocabulary from those recipes and used TF-IDF to extract important text information.

Simple sentiment analysis on the reviews column:

reviews

Type: text data
Feature Transformation:
Manually create a list of sentiment words such as good, loved, hated, etc that are relevant for extracting sentiment information
For each word in the sentiment word list, binarize reviews based on whether there are enough review comments that contain that particular word

We think that adding this sentiment analysis to the reviews feature can improve our model’s ability to distinguish recipes between high average ratings and low average ratings because high-rating recipes often have more positive words while low-rating recipes often have more negative words.

Final Model Building

The approach we aim to use for our final model is to decompose our multi-class classification problem into two distinct binary classification problems. We will construct two separate classification models, one to determine whether a recipe has a high or low review count, and another to determine whether a recipe has a high or low average rating. Subsequently, we will combine the outcomes of these two binary classifiers to generate the four different class labels corresponding to the result of our multi-class classification problem.

We observed previously that there is a severe data imbalance issue in classifying recipes based on review count, where the low review count category represents the majority class, while the high review count category represents the minority class. To address this imbalance, the model we are going to use is the BalancedRandomForest algorithm from the imbalanced-learn library. Unlike the standard random forest implemented in sklearn, balanced random forest us bootstrapping to sample from the minority class and randomly selects the same number of samples with replacement from the majority class while constructing each decision tree. This approach will help our model’s ability to have a higher precision in classifying the minority class.

A simple model construction of our idea:

class ClassificationTransformer(BaseEstimator, TransformerMixin):

    def __init__(self, model_1, model_2):
        self.model_1 = clone(model_1)
        self.model_2 = clone(model_2)

    def fit(self, X, Y):
        self.model_2.fit(X, Y[1])
        self.model_1.fit(X, Y[0])

        return self

    def predict(self, X):
        y_1 = self.model_1.predict(X)
        y_2 = self.model_2.predict(X)
        
        # combine the outcome of y_1 and y_2 
        ...
        
        return y

Model pipeline:

Pipeline(steps=[('col_trans',
                 ColumnTransformer(transformers=[('outlier', RobustScaler(),
                                                  ['minutes', 'protein',
                                                   'sodium', 'saturated_fat',
                                                   'total_fat',
                                                   'carbohydrates']),
                                                 ('pass', 'passthrough',
                                                  ['n_steps', 'n_ingredients']),
                                                 ('time',
                                                  Pipeline(steps=[('time',
                                                                   FunctionTransformer(func=<function <lambda> at 0x2b514b550>))]),
                                                  ['submitted']), ...])),
                ('clf',
                 ClassificationTransformer(model_1=BalancedRandomForestClassifier(class_weight='balanced',
                                                                                  criterion='entropy',
                                                                                  max_depth=16,
                                                                                  n_estimators=180),
                                           model_2=RandomForestClassifier(class_weight='balanced',
                                                                          criterion='entropy',
                                                                          max_depth=19,
                                                                          n_estimators=150)))])

We use the standard random forest to classify recipes based on average ratings and use a balanced random forest to classify recipes based on review counts since it has more imbalanced class labels.

Hyperparameter Tunning

We manually iterate through a list of hyperparameters with stratified 5-fold train-test split separately for our two classifier models to find the best hyperparameter for model accuracy. We found the best max_depth hyperparameter to be 16 and 19 and the best num_estimators hyperparameter to be 180 and 150 for balanced random forest and standard random forest, respectively, as seen in our pipeline above.

Model Performance Evaluation

After fitting our training data into the final model, we evaluate our model using testing data. The confusion matrix below is the result of the prediction of testing data.

As seen in the confusion matrix, our final model has correctly classified a considerable amount of minority class labels, class 3 and 4, which is overall an improvement to the baseline model.

Let’s look at the precision, recall, and f1-score of our final model for predicting unseen data.

	precision	recall	f1-score	support
1	0.59	0.47	0.53	5765
2	0.74	0.73	0.73	9796
3	0.19	0.55	0.29	307
4	0.19	0.59	0.29	367
accuracy			0.63	16235
macro avg	0.43	0.58	0.46	16235
weighted avg	0.66	0.63	0.64	16235

Although the precision for classes 3 and 4 is relatively low, their recall is high, which means that our model is good at capturing all the high review count recipes, without missing too many of them, but at the same time, it makes too many false positive, misclassify low review count recipes into high review count categories. The trade-off is acceptable. The accuracy of our final model is 63%, which is better than our baseline model. Overall, our final model is an improvement over the baseline model.

We can gain insights into our model’s feature importance by visualization. Feature importance highlights the extent to which our engineered feature contributes to helping the model’s classification decision. A higher importance score indicates that the feature plays a more significant role in classifying recipes. We can see from the plot that certain features have a very high importance to the model. Specifically, the review feature corresponds to features numbered from 21-50. Recall that we use a list of manually created sentiment words for simple sentiment analysis. The result of the visualization above indicates that certain sentiment words are particularly useful in helping the model to make accurate predictions.

Below, we present the top 10 most useful sentiment words for feature engineering in our classification model:

      great
       good
       very
  delicious
       made
        but
      loved
    perfect
      would
  wonderful

Finally, we fit our final model using all available data for fairness analysis.

final_model = pl_clf.fit(
    recipe.drop(['class', 'rating', 'n_review'], axis=1),
    [to_n_review(recipe['class']), to_rating(recipe['class'])])

Fairness Analysis

For fairness analysis, we are interested in this question: “Are recipes with vegetarian tags more likely to be correctly classified as to the high average rating category by the model, compared to those without the vegetarian tags?” Are our models fair in terms of precision?

To evaluate fairness, we will compare the precision across two distinct groups. Specifically, we will compare the precision score for recipes with the ‘vegetarian’ tag against those without it. If the precision for recipes with the vegetarian tag is statistically significantly higher than the precision for recipes without it, it could potentially indicate a bias towards classifying vegetarian recipes as having a high average rating more frequently, even when they should not be classified as such. In our dataset, a high average rating corresponds to class categories 2 and 4.

Setup:

Group 1: Recipes with the vegetarian tag

Group 2: Recipes without the vegetarian tag

Null Hypothesis: Our model is fair. Our classifier’s precision is the same for recipes with and without the vegetarian tag, and any differences are due to random chance.

Alternative Hypothesis: Our model is not fair. Our classifier’s precision is higher for recipes with the vegetarian tag than those without, and any observed differences can not be explained by random chance alone.

Test statistic: Difference in average precision of class 2 and 4 (without vegetarian tag - with vegetarian tag).

Significance level: 0.05.

We fitted our final model with all available data and created a new column has_tags_vegetarian that indicates if the recipes have the vegetarian tag.

We then shuffle the has_tags_vegetarian 1000 times and get 1000 simulated differences in average precision test statistics for recipes with and without the vegetarian tag during the permutation test. The empirical distribution of permutation test results is shown below.

From the graph above, we can see that the observed difference in precision falls below the significance level of 0.05, suggesting that the observed statistics in our sample are likely by random chance alone. The p-value we obtained from performing our permutation testing is approximately 0.403, which is greater than our significance level of 0.05. Therefore, we fail to reject our null hypothesis that the precision of our classifier is likely around the same for recipes with and without the vegetarian tag, and any observed differences are due to random chance. Our model achieves precision parity across groups with and without the vegetarian tag.

Back to the Top

Classifying Recipe Popularity and Rating with Imbalanced Data

This is a data science project for DSC 80 at UCSD.

Classifying Recipe Popularity and Rating with Imbalanced Data

Project Overview

Table of Contents

Introduction

Data Cleaning and EDA

Data Cleaning

Univariate Analysis

Distribution of Calories

Distribution of Rating

Bivariate Analysis

Sugar vs. Rating

Calories vs. Number of Ingredients

Data Aggregation

Assessment of Missingness

Not Missing At Random (NMAR) Analysis

Missingness Dependency Analysis

Hypothesis Testing

Framing a Prediction Problem

Baseline Model: A Simple Approach

Feature Engineering

Baseline Model Building and Performance Evaluation

Final Model: Balanced Random Forest & Binary Classification

Feature Engineering

Final Model Building

Hyperparameter Tunning

Model Performance Evaluation

Fairness Analysis