Mixed Effects to the Rescue!

Linear and logistic regression work great on canned example data found on blogs and websites but run into many problems when they are deployed in the wild. Data is often grouped into categories that are imbalanced (some customers have more purchasing history than others), hierarchical (products can be part of one ‘family’ with slight differences), and non independent (sales people have different impacts on outcomes). In each of these examples, standard regression techniques fail to properly address the structure of the data.

What the HECK is NSE

Non standard evaluation (NSE) sounds like a scary subject. In this post, I’ll work to demystify the topic by comparing it to standard evaluation (SE) and showing real world examples of NSE.

Working with R's Azure Machine Learning SDK

This post is a continuation of my series of posts on the now extremely expired Kaggle competition What’s Cooking. In my previous post, Multiclass Imbalance, I attempted to address the class imbalance in the data using class weights and stratified sampling adjustments. It quickly became apparent that while these methods help, finding the right mix of these methods and other model parameters is difficult to do in a one-off fashion. This process of tuning model parameters is called hyperparameter tuning - there are several pacakges designed to help with this in R, specifically the famous caret package. I decided to try out something new to me: Azure Machine Learning Service HyperDrive. By using the Azure based hyperparameter tuning method I’ll be able to get more models tested, have the results quicker, and have a ready made framework for tracking progress. In the future, I’ll be able to use this method for much larger datasets and more complex model types than will work on my local machine. If you’d like to see my full process, check out my repo

What's Cooking - Multiclass Imbalance

This post is a continuation of my series of posts on the now extremely expired Kaggle competition What’s Cooking. In my previous post, Hello World Model, I created a basic random forest model in R and found that while its overall accuracy was quite high (93%), its balanced accuracy (76%) could be improved. In this post, I’ll look at some strategies for accommodating class imbalance while still using the same random forest multiclass classifier.

What's Cooking - Hello World Model

This post is a continuation of my series of posts on the now extremely expired Kaggle competition What’s Cooking. In my previous posts, Data Exploration Part 1 and Data Exploration Part 2, I covered my general strategy of creating the training set from the raw data - TLDR version, I used some text mining techniques to tokenize the ingredients and create a pseudo TFIDF weighting for each ingredient, allowing me to select the terms that are most unique to each cuisine; from these terms, I created a dummy matrix of ingredients, where each row represents a recipe and each column represents a ingredient token.

What's Cooking - Data Exploration Part 2

This post is a continuation of my pervious post, Data Exploration Part 1. The goal of these posts is to prep the data for the Kaggle competition What’s Cooking. The goal of this competition is to predict a recipe’s cuisine type based on the ingredients present in the recipe - in the previous post we reshaped the data and did some preliminary exploration of ingredient and cuisine frequencies. The data showed that there is great class imbalance that will be need to be addressed when a model is built. It also showed that there are many ingredients that are so prevalent, such as salt, that are not informative of the cuisine type. This post will focus on creating a dummy matrix of ingredients that are useful - using an idea from the domain of text mining - TFIDF!

What's Cooking - Data Exploration Part 1

While super late to the game, the kaggle competition What’s Cooking sounded too fun to miss. The premise is simple - can you design a model to predict a recipe’s cuisine type based on the ingredients present in the recipe? Intuitively, most people would guess that buttermilk + grits + collard greens would be southern American cuisine - but teaching a model that could be tricky - there are 6714 unique ingredients in the dataset, with almost 11 ingredients on average per recipe! I’ll be focusing on the data prep and cleaning in this post, see the full repo here!

You're up and running!

Next you can update your site name, avatar and other options using the _config.yml file in the root of your repository (shown below).