Project 3 for DSC 80 at UCSD
In this project we studied the relationship between the minutes and average rating of a recipe. We wanted to learn if the time it takes for a recipe to be cooked had any effect on the average rating of that recipe. More specifically, we wanted to know:
Do shorter recipes (less than or equal to 40 mins) tend to have higher average ratings than longer recipes (40 mins or more)?
We believe answering this question will be useful to the creators who post there recipes online as this analysis would give insight on to what is the best cooking time for a dish, in order to get the best ratings.
The dataset we will be using for our analysis is made up of two datasets. RAW_RECIPES.csv
which contains over one with over 70,000 recipes RAW_INTERACTIONS.csv
which contains over 200,000 reviews of the recipes. Both datasets were obtained through the website food.com.
The variables we will be most intersted in particular are minutes
in recipes and rating
.
Breakdown of datasets:
RAW_RECIPES.CSV
Variables | Description |
---|---|
name | Recipe name |
id | Recipe Id |
minutes | Minutes to prepare recipe |
contributor_id | Contributor Id |
submitted | Date Submitted |
tags | List of tags |
nutrition | Nutrtion facts |
n_steps | Number of steps |
steps | List of steps |
description | Description of Recipe |
RAW_INTERACTIONS.CSV
Variables | Description |
---|---|
user_id | User Id |
recipe_id | Recipe Id |
date | Date review was written |
rating | Rating of recipe |
review | Review |
To start by loading in the data and merging the two datasets.
# loading data
recipes = pd.read_csv('data/RAW_recipes.csv')
interactions = pd.read_csv('data/RAW_interactions.csv')
# merging DataFrames together
merged = recipes.merge(interactions, left_on='id', right_on='recipe_id', how='left')
The resulting dataframe has over 200,000 rows with 17 columns. This makes sense as we are merging the interactions dataset to the recipes dataset. The recipes will be duplicated but the interactions data are unique which we will use to calcualte averge rating later on.
We noticed that the merged dataset, had many values in the rating
column with the value 0.
rating | |
---|---|
5 | 169676 |
4 | 37307 |
0 | 15035 |
3 | 7172 |
1 | 2870 |
2 | 2368 |
This could pose a problem for our analysis since we will be using the average rating of recipes and values of 0 would decrease our averages. After looking at the website where the data had been scraped from (food.com) we realised that it was not possible to rate a recipe less than 1 star. Upon further analysis we identified a section labeled ‘tweaks’ which contained reviews but no ratings. Thus, we concluded that the rows with 0 star rating were from the ‘tweak’ section and that those stars are acutally supposed to be missing by design. Therefore, we changed the ratings with 0 to NaN so that these rows did not affect our average ratings.
We then grouped all the recipes with the same id
and calculated the mean to get the average rating and added it as a column labled avg_rating
.
avg_rating = merged.groupby('id')['rating'].mean()
merged['avg_rating'] = merged['id'].apply(lambda x: avg_rating[x])
We then added the column time label
which contains the string values ‘40<=’ and >40’, then dropped the following columns as they would not be used for further analysis:
When looking at the distributions of our variables we noticed the variable minutes
some outliers that we wanted to deal with before our analysis. Plotting the distribution of the minutes
column gave us the follwing histogram:
In order to identify outliers we used the Inter quartile range (IQR) method. In which we define outliers as any value less than then the (25th percentile * 1.5) or greater than (75th percentile * 1.5). Then when a value was detected as an outlier it was removed from the dataset as these values would be unrepresentative of our general population of recipes. After removing the outliers our new histogram became:
Afterwards we are left with the following DataFrame of shape (75791, 13):
name | id | minutes | contributor_id | tags | nutrition | n_steps | steps | description | ingredients | n_ingredients | avg_rating | time label | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 brownies in the world best ever | 333281 | 40 | 985201 | [‘60-minutes-or-less’, ‘time-to-make’, ‘course… | [138.4, 10.0, 50.0, 3.0, 3.0, 19.0, 6.0] | 10 | [‘heat the oven to 350f and arrange the rack i… | these are the most; chocolatey, moist, rich, d… | [‘bittersweet chocolate’, ‘unsalted butter’, ‘… | 9 | 4 | <=40 |
1 | 1 in canada chocolate chip cookies | 453467 | 45 | 1848091 | [‘60-minutes-or-less’, ‘time-to-make’, ‘cuisin… | [595.1, 46.0, 211.0, 22.0, 13.0, 51.0, 26.0] | 12 | [‘pre-heat oven the 350 degrees f’, ‘in a mixi… | this is the recipe that we use at my school ca… | [‘white sugar’, ‘brown sugar’, ‘salt’, ‘margar… | 11 | 5 | >40 |
2 | 412 broccoli casserole | 306168 | 40 | 50969 | [‘60-minutes-or-less’, ‘time-to-make’, ‘course… | [194.8, 20.0, 6.0, 32.0, 22.0, 36.0, 3.0] | 6 | [‘preheat oven to 350 degrees’, ‘spray a 2 qua… | since there are already 411 recipes for brocco… | [‘frozen broccoli cuts’, ‘cream of chicken sou… | 9 | 5 | <=40 |
6 | millionaire pound cake | 286009 | 120 | 461724 | [‘time-to-make’, ‘course’, ‘cuisine’, ‘prepara… | [878.3, 63.0, 326.0, 13.0, 20.0, 123.0, 39.0] | 7 | [‘freheat the oven to 300 degrees’, ‘grease a… | why a millionaire pound cake? because it’s su… | [‘butter’, ‘sugar’, ‘eggs’, ‘all-purpose flour… | 7 | 5 | >40 |
7 | 2000 meatloaf | 475785 | 90 | 2202916 | [‘time-to-make’, ‘course’, ‘main-ingredient’, … | [267.0, 30.0, 12.0, 12.0, 29.0, 48.0, 2.0] | 17 | [‘pan fry bacon , and set aside on a paper tow… | ready, set, cook! special edition contest entr… | [‘meatloaf mixture’, ‘unsmoked bacon’, ‘goat c… | 13 | 5 | >40 |
First we will plot the distribution of the variables minutes
and avg_rating
in order to get an idea of what the distriubtions look like.
Plotting the distribution of the
minutes
gives us the following histogram:
Plotting the distribution of the
avg_rating
gives us the following histogram:
The average ratings in the dataset are skewed left and tend to have more positive ratings as opposed to negative ones. Over 60% of the recipes have an average rating of 5 while less than 5% have a rating of 3 or less.
Next we will plot the conditional distributions of the variables on whether or not the recipe is less than 40 minuetes or less or not.
Looking at the two distributions there seems to be a disproportionate amount of recipes on the site that take less than 40 minutes to cook, but there relative shapes look similar.
Next we will create a pivot table in order to compare the average rating of the shorter recipes vs the longer recipes over the number of ingredients to get a better idea of how the averages may differ.
time label | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
<=40 | 4.792 | 4.74624 | 4.71655 | 4.70764 | 4.71222 | 4.69611 | 4.67469 | 4.66409 | 4.68101 | 4.66748 | 4.67962 | 4.69613 | 4.71457 | 4.66231 | 4.68861 | 4.68911 | 4.72753 | 4.72819 | 4.71846 | 4.77003 | 4.72861 | 4.82624 | 4.83673 | 4.69231 | 4.11111 | 4.89853 | 4.8 | 5 | 4.90476 | 5 | nan | nan | 5 |
>40 | 5 | 4.63603 | 4.73703 | 4.65007 | 4.71809 | 4.64523 | 4.65759 | 4.64848 | 4.63874 | 4.65828 | 4.69217 | 4.6648 | 4.69096 | 4.65559 | 4.68107 | 4.66201 | 4.64584 | 4.75513 | 4.67791 | 4.67346 | 4.68863 | 4.89439 | 4.82418 | 4.79658 | 4.80882 | 4.81481 | 4.59048 | 4.88571 | 4.94118 | 4.69565 | 5 | 5 | nan |
Calculating the differnece in means across ingredients:
pivot.diff().mean(axis=1)
We get that the difference in means is -0.006732 showing a slightly higher average rating for shorter cooking time recipes.
Our dataset, Recipes and Ratings, contained three columns in which there were missing values. We can determine these columns using pandas functions:
0 | |
---|---|
user_id | 0 |
recipe_id | 0 |
date | 0 |
rating | 0 |
review | 169 |
0 | |
---|---|
name | 1 |
id | 0 |
minutes | 0 |
contributor_id | 0 |
submitted | 0 |
tags | 0 |
nutrition | 0 |
n_steps | 0 |
steps | 0 |
description | 70 |
ingredients | 0 |
n_ingredients | 0 |
name | id | minutes | contributor_id | tags | nutrition | n_steps | steps | description | ingredients | n_ingredients | avg_rating | time label | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 brownies in the world best ever | 333281 | 40 | 985201 | [‘60-minutes-or-less’, ‘time-to-make’, ‘course… | [138.4, 10.0, 50.0, 3.0, 3.0, 19.0, 6.0] | 10 | [‘heat the oven to 350f and arrange the rack i… | these are the most; chocolatey, moist, rich, d… | [‘bittersweet chocolate’, ‘unsalted butter’, ‘… | 9 | 4 | <=40 |
1 | 1 in canada chocolate chip cookies | 453467 | 45 | 1848091 | [‘60-minutes-or-less’, ‘time-to-make’, ‘cuisin… | [595.1, 46.0, 211.0, 22.0, 13.0, 51.0, 26.0] | 12 | [‘pre-heat oven the 350 degrees f’, ‘in a mixi… | this is the recipe that we use at my school ca… | [‘white sugar’, ‘brown sugar’, ‘salt’, ‘margar… | 11 | 5 | >40 |
2 | 412 broccoli casserole | 306168 | 40 | 50969 | [‘60-minutes-or-less’, ‘time-to-make’, ‘course… | [194.8, 20.0, 6.0, 32.0, 22.0, 36.0, 3.0] | 6 | [‘preheat oven to 350 degrees’, ‘spray a 2 qua… | since there are already 411 recipes for brocco… | [‘frozen broccoli cuts’, ‘cream of chicken sou… | 9 | 5 | <=40 |
6 | millionaire pound cake | 286009 | 120 | 461724 | [‘time-to-make’, ‘course’, ‘cuisine’, ‘prepara… | [878.3, 63.0, 326.0, 13.0, 20.0, 123.0, 39.0] | 7 | [‘freheat the oven to 300 degrees’, ‘grease a… | why a millionaire pound cake? because it’s su… | [‘butter’, ‘sugar’, ‘eggs’, ‘all-purpose flour… | 7 | 5 | >40 |
7 | 2000 meatloaf | 475785 | 90 | 2202916 | [‘time-to-make’, ‘course’, ‘main-ingredient’, … | [267.0, 30.0, 12.0, 12.0, 29.0, 48.0, 2.0] | 17 | [‘pan fry bacon , and set aside on a paper tow… | ready, set, cook! special edition contest entr… | [‘meatloaf mixture’, ‘unsmoked bacon’, ‘goat c… | 13 | 5 | >40 |
Since we utilize the avg_rating column in our data analysis, and there are missing values (as shown below), we determined it necessary to test the missingness dependency of avg_rating on the column, minutes.
To test if the avg_rating column is MAR, we execute a permutation test. The code to do so is included below:
def test_missing(col,k=1000):
df = cleaned_food.copy() # create a copy of the cleaned dataframe
# calculating the observed test statistic for our permutation test
non_missing_mean = df[~df['avg_rating'].isna()][col].mean()
missing_mean = df[df['avg_rating'].isna()][col].mean()
obs_diff_mean = abs(non_missing_mean-missing_mean)
results = [] #empty list to store results
#perform k iterations of the permutation test
for i in range(k):
df[col] = np.random.permutation(df[col]) # shuffle the specified column
non_missing_mean_test = df[~df['avg_rating'].isna()][col].mean() # non-missing rows
missing_mean_test = df[df['avg_rating'].isna()][col].mean() # missing rows
diff_mean = abs(non_missing_mean_test-missing_mean_test) # compute the difference in means
results.append(diff_mean) # append test statistic to results list
p_val = (np.array(results)>=obs_diff_mean).mean() # compute the p value based on the results
return p_val
When we run the function above with the input column specified as ‘minutes’, we get the following output:
test_missing('minutes')
0.0
That is, we conclude that at the significance level of p = 5%, that there is likely a dependency on the ‘minutes’ column.
We then test the dependency of the avg_rating column on another column, ‘n_ingredients’, which includes the number of ingredients that the recipe calls for. We do so in a similar manner to our test for minutes:
test_missing('n_ingredients')
0.06
In this case, we conclude that it is unlikely that the avg_rating column depends on ‘n_ingredients’ because it is above the 5% significance level.
Recall, we are looking into the question:
What is the relationship between the cooking time and average rating of recipes?
To complete an in-depth analysis of our question based on the data, we chose to run a permutation test. Details of our test include:
Null Hypothesis: Ratings for recipes that take less than or equal to 40 minutes to complete have the same ratings as recipes that take greater than 40 minutes to complete.
Alternative Hypothesis: Recipes that take less than or equal to 40 minutes to complete tend to have higher ratings
We chose these as our null and alternative hypotheses because we found that in plotting the relationship between minutes and average rating, most of the high ratings for recipes could be found in the recipes that took less than 40 minutes. Thus, we decided to compare the two groups (>40 and <=40) in our test.
Test Statistic: Difference in Means
Because our Alternative Hypothesis specifies that ratings are higher for recipes with durations less than or equal to 40 minutes, we choose an unsigned difference in means to compare the two.
P-Value: 5% significance level
We chose the conventional significance level for our p-value.
For ease of permutation testing, we added a column to our dataframe that specifies whether the recipe falls within the <=40 or >40 group.
To perform our permutation test, we created a function that takes the number of iterations of the test we’d like to perform (with a default value of 100), and returns the p-value for our test. Code is as follows:
def perm_test(k=100):
df = cleaned_food.copy() # create a copy of the cleaned_food dataframe
rows_greater_40 = df[df['time label']== '>40'] # isolate >40 minute duration values in a dataframe
rows_less_40 = df[df['time label']=='<=40'] # isolate <=40 duration values in a dataframe
obs_stat = rows_less_40['avg_rating'].mean() - rows_greater_40['avg_rating'].mean() #calculate observed difference in means
results = [] # create empty list for results
#perform k iterations of test
for i in range(k):
df['time label'] = np.random.permutation(df['time label']) # shuffle time label column
row_g40 = df[df['time label']=='>40'] # get dataframe of >40 minute recipes
row_l40 = df[df['time label']=='<=40'] #get dataframe of <= 40 minute recipes
ts = row_l40['avg_rating'].mean()- row_g40['avg_rating'].mean() #calculate the current test statistic
results.append(ts) #append test statistic to results list
p_val = (np.array(results)>=obs_stat).mean() #calculate p value for our test
return p_val
When we run this function with the default values, we get the following output:
perm_test()
0.0