Bryan White

Mixed Effects to the Rescue!

2020-07-06T00:00:00+00:00

Linear and logistic regression work great on canned example data found on blogs and websites but run into many problems when they are deployed in the wild. Data is often grouped into categories that are imbalanced (some customers have more purchasing history than others), hierarchical (products can be part of one ‘family’ with slight differences), and non independent (sales people have different impacts on outcomes). In each of these examples, standard regression techniques fail to properly address the structure of the data.

Linear Mixed Models - Regression for Real World Data

Linear Mixed Models (LMM) work like standard regression, but with additional terms called random effects that capture variation not explained by the independent variables used to create the model.

THE DATA

For this post I’m going to use some fast food data provided by tidy tuesday. If you haven’t heard of Tidy Tuesday - check it out, lots of great visualization examples in R.

This data works great for LMM’s because its grouped, it has inherent hierarches, and there’s large imbalances of items between groups. All of these characteristcs would be a headache for traditional linear models. We’ll see how adding random effects helps.

fast_food = fread('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2018/2018-09-04/fastfood_calories.csv')
fast_food[,item := tolower(item)]
fast_food[,has_chicken := grepl('chicken', item)]
fast_food[,chicken_strip := grepl('strip|popcorn|nuggets|chicken fries|tenders',item)]
fast_food[,is_salad := grepl('salad',item)]

subway_subset = fast_food[restaurant == 'Subway']
fast_food = fast_food[!(restaurant == 'Subway' & grepl('footlong',item)) & !(restaurant == 'Subway' & grepl('kids mini',item))]
fast_food[restaurant == 'Subway', sandwich := grepl('6', item)]
fast_food[, is_wrap := grepl('wrap', item)]
fast_food[is.na(sandwich),sandwich := grepl('sandwich', item)]

Difference Between Random and Fixed Effects

Fixed Effects are categorical predictors where you know all the possible levels. Some Examples:

Drink Size (S/M/L)
Is the chicken fried? (Y/N)

Random Effects are categorical predictors that your data only captures a subset of, of all possible levels in the full data. In practice, they allow differing levels of variation and association to the response creating non constant error distributions.

Restaurant Chain
Items on customer orders

Benefits of incorporating random effects in your model

Random effects have several benefits, the main ones are:

Enable us to make better predictions on sparse groups using partial pooling
Allow us to explicitly model non independence in data, such as repeated measurements
Better capture hierarchical or clustered data such as geography or product lines

Imbalanced Data

The Random Effect term in a LMM uses a property called Partial Pooling. Partial pooling allows the model to create an overall effect for a predictor that will pull smaller groups toward the group effect while larger groups will be more distinct. This is similar to how in Bayesian statistics repeated measurements improve your confidence in an estimate.

In the below example, we try to predict the amount of calories in a salad based on the amount of fat. By incorporating a random effect for the restaurant, we can get a stronger estimate for restaurants with fewer salad options.

salad_data = fast_food[is_salad == T]
salad_data[,fried := grepl('crispy',item)]
mixed_effects = lmer(calories ~ total_fat + (1|restaurant), data = salad_data)
fixed_effects = lm(calories ~ total_fat, data = salad_data)
AIC(fixed_effects, mixed_effects)

##               df      AIC
## fixed_effects  3 736.9616
## mixed_effects  4 720.9785

Non Independence

Non independence of measurements is very common in real world data. Below is an example using chicken strips, nuggets, fries, and tenders. Each restaurant has a signature method for preparing chicken and this preparation will impact the caloric content. Treating each restaurant as a fixed effect rapidly increases the degrees of freedom - and doesn’t consider that this data only represents a subset of all restaurants with chicken tenders!

Making restaurant a random effect better represents the data and makes the model more generalizeable. The mixed effect model also has a lower Akaike’s Information Criterion (a measure of how model fit - lower better).

chicken_strip_data = fast_food[chicken_strip == T]
chicken_strip_data[,count := as.integer(sub('[^0-9]+','',item))]

## Warning in eval(jsub, SDenv, parent.frame()): NAs introduced by coercion

chicken_strip_data = chicken_strip_data[!is.na(count)]
mixed_effects_chicken = lmer(calories ~ count + (1|restaurant), data = chicken_strip_data)
fixed_effects_chicken = lm(calories ~ count, data = chicken_strip_data)
AIC(mixed_effects_chicken, fixed_effects_chicken)

##                       df      AIC
## mixed_effects_chicken  4 624.9130
## fixed_effects_chicken  3 649.4055

Hierarchical Relationships

Nested relationships are also very common in day to day data. In the example below - subway item belongs to a class of item , sandwich, salad, etc and a flavor meatball, tuna, etc. Representing this structure in the data would require many separate models if using a traditional fixed effects model. By incorporating random effects, we can use the structure in the data to get a better overall fit.

subway_subset[,item_general := ifelse(grepl('6""', item),'6_sandwich',NA)]
subway_subset[is.na(item_general),item_general := ifelse(grepl('footlong', item),'12_sandwich',NA)]
subway_subset[is.na(item_general),item_general := ifelse(grepl('kids mini', item),'3_sandwich',NA)]


subway_subset[is.na(item_general),item_general := ifelse(grepl('salad', item),'salad',NA)]
subway_subset[is.na(item_general),item_general := ifelse(grepl('wrap', item),'wrap',NA)]
subway_subset[is.na(item_general),item_general := ifelse(grepl('pizza', item),'pizza',NA)]

subway_subset[,item := sub('6""|footlong|kids mini','',item)]
subway_subset[,item := sub('6""|footlong|kids mini|salad|wrap|pizza','',item)]

subway_model = lmer(protein ~ total_carb +  (1|item_general) + (1|item), data = subway_subset)

sauce

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5970551/

http://lme4.r-forge.r-project.org/book/Ch2.pdf

https://stats.idre.ucla.edu/other/mult-pkg/introduction-to-generalized-linear-mixed-models/

https://cran.r-project.org/web/packages/lme4/vignettes/lmer.pdf

What the HECK is NSE

2020-06-19T00:00:00+00:00

Non standard evaluation (NSE) sounds like a scary subject. In this post, I’ll work to demystify the topic by comparing it to standard evaluation (SE) and showing real world examples of NSE.

What the HECK is NSE

NSE is the the method by which several great features in R work that you probably use every day without realizing you are using NSE. The easiest way to explain NSE is to compare it to SE, R’s default behavior. In R, when you assign a value to a variable and call that variable, you get the value back. We can see this as:

x = 'cats'
x

## [1] "cats"

In contrast when you call the same variable ‘x’ in NSE using the quote() function, you get back ‘x’.

quote(x)

## x

Instead of seeing ‘x’ and returning what x is referencing (‘cat’), quote sees the code used to compute the value. NSE doesn’t check if a variable has defined in its environment and it doesn’t execute code that it evaluates. It simply sees the argument as a string and returns it:

quote(y)

## y

quote(1+3)

## 1 + 3

The functions of NSE

NSE uses four main functions: quote(), substitute(), enquote(), deparse(), eval()/get()

quote, substitute, and enquote are the workhorses of NSE. They operate slightly differently. quote() returns the literal string of the argument passed to it. substitute() does the same, but will search up an environment chain for the original reference. This idea is more easily explained in an example such as below:

external_ref = 'GO VOLS'
nse_example = function(internal_ref){
  list(quote = quote(internal_ref),
       substitute = substitute(internal_ref),
       SE = internal_ref)
}
nse_example(external_ref)

## $quote
## internal_ref
## 
## $substitute
## external_ref
## 
## $SE
## [1] "GO VOLS"

As the above example shows, unlike quote which returns exactly what was passed to it, substitute looked up the environment chain to the environment outside the function and found the reference ‘external_ref’. This seems black magicy to me, but it works via a concept called a promise. I won’t cover a promise here but can point the reader in the right direction here[].

enquote is a shorthand function for quote(some_function(….)) which just means it looks inside the function and performs the quote operation on its contents, such as below

test_function = function(x,y){
  x*2 + y
}
enquote(test_function)

## base::quote(function(x,y){
##   x*2 + y
## })

quote(test_function)

## test_function

deparse simply converts the output of a NSE function call into a character vector.

x = 'GBO'
thing = substitute(x)
deparse(thing)

## [1] "x"

eval enables the user to run the product of substitute and get allows the user to run or return the values of a string object. Combining them with a NSE output gives SE behavior

NSE_obj = quote(2+4)
eval(NSE_obj)

## [1] 6

best_mascot = 'smokey'
NSE_obj = deparse(quote(best_mascot))
get(NSE_obj)

## [1] "smokey"

Where is NSE used

We’ve actually used NSE in an unexpected way already in this post. When we constructed the list object in nse_example() we were using NSE to not have to enquote the names of the list elements. By using a combination of substitute() to capture the argument and eval() to deploy it. Other common examples are not quoting library names when attaching a package by calling library() or by column names being inherited from their objects in data.frame.

library(magrittr)
library('magrittr')

a = 1:5
b = 6:10
data.frame(a,b)

##   a  b
## 1 1  6
## 2 2  7
## 3 3  8
## 4 4  9
## 5 5 10

OK that’s great - Why do I care?

For 99% of developers I’d argue you shouldn’t really care about using NSE in practice. When using NSE, you can create very powerful and elegant functions. However, code using NSE is harder to read and share. Additionally, functions lose a property called referential transparency when invoking NSE. Referential transparency basically means you can replace arguments with their values and function behavior doesn’t change.

I think its important to understand what NSE is so you know what it means when people talk about NSE and so you have some background for how some functions are doing the things they do. I don’t see myself using much NSE in my day to day code as in most cases the same task can be completed with more clear code.

As often as possible it makes sense to code to be understood rather than clever :).

Working with R’s Azure Machine Learning SDK

2020-03-01T00:00:00+00:00

This post is a continuation of my series of posts on the now extremely expired Kaggle competition What’s Cooking. In my previous post, Multiclass Imbalance, I attempted to address the class imbalance in the data using class weights and stratified sampling adjustments. It quickly became apparent that while these methods help, finding the right mix of these methods and other model parameters is difficult to do in a one-off fashion. This process of tuning model parameters is called hyperparameter tuning - there are several pacakges designed to help with this in R, specifically the famous caret package. I decided to try out something new to me: Azure Machine Learning Service HyperDrive. By using the Azure based hyperparameter tuning method I’ll be able to get more models tested, have the results quicker, and have a ready made framework for tracking progress. In the future, I’ll be able to use this method for much larger datasets and more complex model types than will work on my local machine. If you’d like to see my full process, check out my repo

Setting up hyperdrive in R is relatively straightworard in R using the azuremlsdk package. This package mirrors the functions availible in the Python azuremlsdk, which is very useful considering the far larger body of documentation availible for the Python implementation. I hope the code I have in this repo helps you and future me work with the R sdk.

Using hyperdrive, I tuned the random forest over the following parameter space:

azuremlsdk::random_parameter_sampling(list("--min_samples_leaf" = azuremlsdk::uniform(1, 15),
                                            "--n_estimators" = azuremlsdk::choice(c(500, 1000, 2000)),
                                            "--wts_method" = azuremlsdk::choice(c('no_wts', 'inv_freq','inv_log_freq'))
                                            ))

The code above basically says that I’m testing the minimum observations allowed in a terminal node, the number of trees to grow, and the class weight strategy. If I tested each of these exhaustively, I’d have 135(!) models to compare just from three parameters. Instead of the exhaustive approach, I used a random sample of the parameters with early stopping. This means Azure will run up to 135 models but if it gets lucky early on and picks the best set it will stop. The downside to this approach is that it is possible that the tuning process could choose a local maximum instead of the global maximum accuracy.

After setting up the base AzureML configuration and the hyperdrive settings, I started the hyperdrive process, setting it to maximize balanced accuracy. I had to repeat this process a few times, as small errors in code can take a few minutes to become apparent looking into the Azure logs. My patience and diligence was rewarded when the setup was finally error free and Azure was able to churn through 87 iterations before settling on a balanced accuraccy of 88% - not the best we’ve seen! It is likely that Azure settled on a local maxima. We could have asked Azure to not stop early and really try every model - but this would have taken longer and doesn’t garuntee that the model would be as good as or better than the best model I trained locally because a random forest is built on random samples.

In general I don’t think hyperdrive is going to be a tool that I use for every problem - but I’m really glad to have some exposure to it as I can see its value. The configuration and debugging process was a far larger time investment than the previous models I generated locally. However, once hyperdrive was running, it ran great and its output in the azure portal GUI (seen above) was very useful. In future work, I imagine I’ll use hyperdrive instead of a manual grid serach on parameters to get a better answer faster when having a very accurate model is needed. For POC projects or exploratory work, I think local configuration of a model will be best.

What’s Cooking - Multiclass Imbalance

2020-02-27T00:00:00+00:00

This post is a continuation of my series of posts on the now extremely expired Kaggle competition What’s Cooking. In my previous post, Hello World Model, I created a basic random forest model in R and found that while its overall accuracy was quite high (93%), its balanced accuracy (76%) could be improved. In this post, I’ll look at some strategies for accommodating class imbalance while still using the same random forest multiclass classifier.

There are several potential options built into the randomForest() package that can help us right off. The first I’ll explore is the classwt parameter, from the documentation, this parameter sets the ‘priors of the classes’ and ‘it need not add to 1’. By default, all classes have a classwt = 1, meaning that a misclassification error during training counts the same for all classes. By changing this parameter so that one class has a classwt of 2 and the rest have a classwt of 1, you are telling the RandomForest that a misclassification is double penalized for one class versus another. This forces the model to minimize error in the class with higher weight. We can supply any numeric vector that is equal to the number of classes to R and not worry about it adding to 1 because when the model is compiled, the classwts are regularized. I tested two strategies for class weights. The first is a basic inverse frequency weighting:

cuisine_freq = table(train_test$train_dt_y)
base_wts = 1/(cuisine_freq/min(cuisine_freq))

rfwts_1 = randomForest(x = train_test$train_dt_x,
                       y = train_test$train_dt_y,
                       ntree = 1500,
                       mtry = 10,
                       classwt = base_wts)

Here the smallest class (british) has a classwt = 1 and the largest class (mexican) has a classwt of .03. This model resulted in a slightly weaker overall accuracy (weighted : 88%, base : 93%) but a much stronger balanced accuracy (weighted : 90% , base : 76%). Looking at the plot below, we can see that the model gave strong preference to the smaller classes, causing large class performance to suffer.

I tested using a slightly less drastic weighting method to see if I could get better class accuracy balance:

cuisine_freq = table(train_test$train_dt_y)
new_wts = log(cuisine_freq/min(cuisine_freq))
new_wts[is.infinite(new_wts)] = 1
new_wts[new_wts < 1] = 1
new_wts = 1/new_wts

rf_wts_2 = randomForest(x = train_test$train_dt_x,
                  y = train_test$train_dt_y,
                  ntree = 2500,
                  mtry = 10, 
                  classwt = new_wts)

Which improved overall accuracy (92%) and balanced accuracy (92%)

Another option built into the randomForest() package is to use stratified sampling in the creation of each tree. By default, the model will sample your data, with replacement, to create a bootstrapped dataset for each tree it constructs. Each observation in your data is equally likely to occur in this bootstrapped dataset, so, naturally, larger classes have larger representation. We can alter this behavior by telling the model that we want to take a specific number of samples from each class, using the sampsize agrument. By giving a set number from each class to sample, we can make each tree have equal representation for all classes. NOTE - it is necessary to set replace = FALSE when doing this to keep your smaller classes from having overly influential observations. Like the classwt parameter, I tried two different configurations of the sampsize parameter to improve performance.

First, I tried setting the samp size to be equal across all classes. To do this, I took 80% of the smallest class size and propagated this sample size to all classes. This resulted in poorer performance than weighting but better than the base model, (overall accuracy : 91%, balanced accuracy : 85%).

cuisine_freq = table(train_test$train_dt_y)
samp_prop = .8
equal_samp_size = round(min(cuisine_freq)*samp_prop)

samp_vec = rep(equal_samp_size, length(cuisine_freq))
names(samp_vec) = names(cuisine_freq)

rf_samp_1 = randomForest(x = train_test$train_dt_x,
                        y = train_test$train_dt_y,
                        ntree = 2500,
                        mtry = 10,
                        sampsize = samp_vec,
                        replace = F)

Thinking I could get a similar improvement to accuracy by making the samp size a bit more representative of the data, I tried another sample size method :

cuisine_freq = table(train_test$train_dt_y)
samp_prop = .8
equal_samp_size = round(cuisine_freq*samp_prop)
new_sample_prop = log(equal_samp_size/min(equal_samp_size))
new_sample_prop[new_sample_prop<1]= 1
new_samp_vec = round(new_sample_prop*min(equal_samp_size))

rf_samp_2 = randomForest(x = train_test$train_dt_x,
                         y = train_test$train_dt_y,
                         ntree = 2500,
                         mtry = 10, 
                         sampsize = new_samp_vec,
                         replace = F)

Unfortunately, it did not improve performance - in fact its performance is very similar to the base model (overall accuracy : 93%, balanced accuracy : 79%).

There are many other ways we could address the class imbalance outside of model configurations - these change the training data by either over sampling the smaller classes, or undersampling the larger ones, or create new artificial minority class examples via SMOTE. I won’t cover these here but they may be useful in the future.

As this post shows, setting these optional parameters can really impact the performance of your model. It also shows that it can quickly get overwhelming keeping track of all the different configurations and resulting performance data. In my next post, I’ll talk about how we can deploy hyperparameter tuning to make this process more automated.

What’s Cooking - Hello World Model

2020-02-24T00:00:00+00:00

This post is a continuation of my series of posts on the now extremely expired Kaggle competition What’s Cooking. In my previous posts, Data Exploration Part 1 and Data Exploration Part 2, I covered my general strategy of creating the training set from the raw data - TLDR version, I used some text mining techniques to tokenize the ingredients and create a pseudo TFIDF weighting for each ingredient, allowing me to select the terms that are most unique to each cuisine; from these terms, I created a dummy matrix of ingredients, where each row represents a recipe and each column represents a ingredient token.

The goal of this competition is to predict a recipe’s cuisine type based on the ingredients present in the recipe. With a test matrix in hand, lets get to modeling!

In my opinion, the best place to start this modeling process is to try out an ensembled tree method, specifically random forest. Ensembled trees are a great fit for this type of problem as they natively handle multiclass predictions and can be provided a large amount of predictors while remaining robust to overfitting. In this example I’ll be using Random Forest, but boosted methods such as XGboost would be a great choice as well.

At its core, a random forest model is a collection of decision trees given slightly different build information adn parameters being combined to form a model that is typically more predictive than possible with a single tree. I won’t go into the details on the model in this post, but if you’re curious (or just want a refresher) - I like the description in this link. The implementation of a random forest in R is very straightforward:

rf = randomForest(x = train_test$train_dt_x,
                  y = train_test$train_dt_y,
                  ntree = 1500,
                  mtry = 10)

In the above code, x is the training matrix [18626,71], y is the training target a 1 dimensional vector [18626], ntree is the number of decision trees that will be created and ensembled together, mtry is the number of predictors the tree will be given to use for each split, keeping mtry lowish ensures that your trees don’t all look the same (what would be the point of creating a tree monoculture) - not giving it a high enough value ties the hands of your model, not allowing it to perform at the best possible level - in a future post we’ll go into how we can use cross validation to choose the optimal mtry :D.

This model takes a minute or two to run on my machine, but produces a pretty good starting accuracy of 93% and a balanced accuracy of 76% on the held back test data. The balanced accuracy is just the average of recall for each group. I like the balanced accuracy metric best for evaluating this model as it takes into account that the model could have (and does have) varied performance for the different classes.
Calculating these metrics in R is pretty easy with the help of the caret package:

predictions = predict(rf,train_test$test_dt_x)

confusion_calc = caret::confusionMatrix(data = predictions,
                                        reference = train_test$test_dt_y)
all_recall = confusion_calc$byClass[,'Sensitivity']
accuracy = confusion_calc$overall['Accuracy']
balanced_accuracy = mean(all_recall)
}

And we can easily visualize the group accuracy with the following plot rendered by ggplot2:

In the next post, I’ll look into strategies to improve our model performance by dealing with the class imbalance!

What’s Cooking - Data Exploration Part 2

2020-02-18T00:00:00+00:00

This post is a continuation of my pervious post, Data Exploration Part 1. The goal of these posts is to prep the data for the Kaggle competition What’s Cooking. The goal of this competition is to predict a recipe’s cuisine type based on the ingredients present in the recipe - in the previous post we reshaped the data and did some preliminary exploration of ingredient and cuisine frequencies. The data showed that there is great class imbalance that will be need to be addressed when a model is built. It also showed that there are many ingredients that are so prevalent, such as salt, that are not informative of the cuisine type. This post will focus on creating a dummy matrix of ingredients that are useful - using an idea from the domain of text mining - TFIDF!

Term Frequency Inverse Document Frequency (TFIDF) is a way of regularizing text data based on how frequent a term is in a sentence versus how frequent it occurs in a document. Intuitively, a word that occurs frequently in the whole document is likely not important to understanding the sentence, unless it occurs very frequently within the sentence. We can use this idea to create a IFICF or a Ingredient Frequency Inverse Cuisine Frequency - this will cause ingredients that are common across cuisines to be down weighted relative to those that are unique to a cuisine.

create_IFICF = function(dt){
  dt_sum = dt[,.(term_frequency = sum(term_frequency)),
                              by = .(ingredient_token, cuisine)]
  dt_sum[,cuisine_frequency := uniqueN(.SD$cuisine),
                 by = .(ingredient_token)]
  dt_sum[,total_cuisines := uniqueN(cuisine)]
  dt_sum[,inverse_cuisine_frequency := log(total_cuisines/cuisine_frequency)]
  dt_sum[,IFICF := term_frequency*inverse_cuisine_frequency]
  dt_sum
}

After reweighting the data using the above transformation, we get…

Neat! The plot shows that the IFICF is working like we hoped it would - ingredient terms like sesame oil and rice vinegar have high scores for Chinese, while buttermilk, pecans, and grits have high scores for Southern US cuisine. The plot also shows that there is maybe some work to be done around the ingredient term parsing - as we have both sesame oil and sesame in the top 5 IFICF plot for Chinese. Ideally these would be collapsed, and in some cases this is possible, but programmatically this is difficult - is cream the same as whipped cream? I think there may be some ways to fix this issue by seeing if a term is always part of a bigram or trigram but other tests may be inaccurate.
For now, I’m going to move on and come back as needed.

With the IFICF score in hand - we can now create a prediction matrix with the most predictive N terms and begin modeling! In my next post, I’ll begin the modeling process. If you’d like to see the full data prep workflow see my github

What’s Cooking - Data Exploration Part 1

2020-02-11T00:00:00+00:00

While super late to the game, the kaggle competition What’s Cooking sounded too fun to miss. The premise is simple - can you design a model to predict a recipe’s cuisine type based on the ingredients present in the recipe? Intuitively, most people would guess that buttermilk + grits + collard greens would be southern American cuisine - but teaching a model that could be tricky - there are 6714 unique ingredients in the dataset, with almost 11 ingredients on average per recipe! I’ll be focusing on the data prep and cleaning in this post, see the full repo here!

For the data cleaning process, I’m going to use R - this could be done in python - but I really like the flexibility and speed of the data.table package. After reading this data into R, you’ll notice the data has the ingredients stored as a vector - which is a pretty inaccessible form for ML. By using the few lines below you can split each ingredient out, making a long version of the data.

train_data = train_data[,lapply(.SD,function(x){unlist(x)}),
                        .SDcols = 'ingredients',
                        by = .(id, cuisine)]

Once the data is in the long shape, its much more simple to work with. Using the data.table package, I can make quick work of some summary statistics:

train_data_recipe_sum = train_data[,.(count_ingredients = .N), by = .(id, cuisine)]

train_data_cuisine_sum = train_data_recipe_sum[,.(frequency = uniqueN(id),
                                                  avg_ingredients = mean(count_ingredients),
                                                  median_ingredients = as.double(median(count_ingredients))), by = .(cuisine)]
train_data_cuisine_sum[,cuisine := factor(cuisine,
                                          levels = train_data_cuisine_sum$cuisine[order(-train_data_cuisine_sum$frequency)])]

cuisine_freq_plot = ggplot(data = train_data_cuisine_sum, aes(x = cuisine, y = frequency)) + 
  geom_bar(stat = 'identity') + 
  theme_minimal() +
  theme(axis.text.x=element_text(angle=90, vjust = .2))

You may be curious about why I had to wrap as.double() around the median() calculation - if so check out this link to a stack overflow post about how data.table’s strong type enforcement clashes with the default behavior of the base::median function - TLDR - median() will preserve the type of the data sent to it– unless that data is a logical or integer class of even length - which makes data.table throw an error.

One of the first things I noticed about this dataset is that the cuisine type classifications are very imbalanced. Italian and Mexican cuisine have 8K and 6.5K recipes while less popular cuisines such as Russian and Brazilian have around 500. This high class imbalance will make prediction difficult - as the most efficient model is to predict that every recipe you see is Italian! In fact, if you knew nothing about the recipe and guessed Italian you’d have a 1 in 5 chance of being right - not bad. As we move forward with modelling we will have to keep this factor in mind.

On average, most recipes have about 10 ingredients, but this varies some by cuisine - for example Moroccan cuisine has the most on average with almost 13 and Japanese has the least with 9. The number of distinct ingredients in this dataset is a bit daunting -6714- and attempting to train a model with all of the ingredients as predictive terms would create an exceptionally wide training matrix with almost as many predictive terms as there are observations! While it is possible to train with that many predictors, in practice, it is usually better to do some feature engineering first to reduce the dimensionality.

To start this process - lets look at which ingredients are the most common:

train_data_ingredient_sum = train_data[,.(frequency = .N), by = .(ingredients)]

train_data_ingredient_sum[,ingredients := factor(ingredients,
                                             levels = train_data_ingredient_sum$ingredients[order(-train_data_ingredient_sum$frequency)])]

setorder(train_data_ingredient_sum,-frequency)

ingredient_freq_plot = ggplot(data = train_data_ingredient_sum[1:10], 
                              aes(x = ingredients, y = frequency)) + 
  geom_bar(stat = 'identity') + 
  theme_minimal() +
  theme(axis.text.x=element_text(angle=90,
                                 vjust = .2)) 

ingredient_freq_plot

The plot is about as surprising as it is useful for this analysis - most recipes use salt, onions, garlic, and fat. While in some cases it would make sense to retain these dimensions, in this case, knowing that salt is an ingredient would not help the model determine the cuisine type as it is so ubiquitous.

In the next post I’ll go into how I tackled the dimensionality reduction problem by treating the ingredients much like how you treat a corpus of text in a text analytics problem!

You’re up and running!

2014-03-03T00:00:00+00:00

Next you can update your site name, avatar and other options using the _config.yml file in the root of your repository (shown below).

The easiest way to make your first post is to edit this one. Go into /_posts/ and update the Hello World markdown file. For more instructions head over to the Jekyll Now repository on GitHub.