Welcome to The Code Forest

A place for all things data science



Recent Posts

We’ll use 50 years of NFL kicking data to inform the least – or most – important decision of your fantasy season: Drafting a kicker.

CONTINUE READING

Take your time series forecasting game to the next level by working through two real world scenarios in the Tidyverse!

CONTINUE READING

In this post, we’ll leverage 110 years of historical data – and everything from time-series forecasting to hypothesis testing – to understand how one’s state of birth influences their name

CONTINUE READING

In this post, we’ll analyze reviews and ratings from Beeradvocate.com to understand what drives satisfaction amongst beer drinkers worldwide. Prost!

CONTINUE READING

Survival Analysis is the go-to method for analyzing time-to-event data. In this post, we’ll go deep on some historical player data and then leverage machine learning to predict how long new draft picks will remain in the NFL.

CONTINUE READING

College rankings are a standard input for most students when choosing a school. But to what extent does a college’s rank relate to how much a graduate makes 10 years into their career? We’ll answer this question by web scraping data from a variety of online sources with R and Python, and then build a model to understand which factors matter most to post-college pay.

CONTINUE READING

Portland, Oregon is home to some of the best watering holes in America. With so many places to quaff a West Coast Style IPA or glass of Pinot Noir, choosing which to visit (and in which order) can be a daunting task. To address this question, we’ll leverage some classic optimization techniques to minimize the total distance travelled between the top bars in Portland for a truly “optimal” Pub Crawl.

CONTINUE READING

Keras is quickly becoming the go-to prototyping solution for computer vision problems, and this post provides an overview of how to rapidly build a Convolutional Neural Network in R with the Keras library.

CONTINUE READING

Advanced machine learning algorithms like Artificial Neural Networks(ANNs) can’t model time-dependent data without some pre-processing. The additional processing hurdle often deters forecasters from implementing advanced methods in favor of classic (but less powerful) approaches. However, I’ve observed some notable accuracy gains applying ANNs to forecasting problems. Accordingly, this post provides a basic playbook for data cleaning, feature engineering, model selection, prediction, and risk assessment when forecasting with Neural Nets.

CONTINUE READING

Aaron Rodgers or Tom Brady? Carson Wentz or Drew Brees? Choosing the right Fantasy Football QB each week is challenging. To remove some of the guesswork from the decision-making process, I devised an approach that’s worked well over the past few seasons. Read on to learn more about using the Beta Distribution to pick your weekly starting QB.

CONTINUE READING

Drafting a rookie in Fantasy Football can be a risky move, but it can pay huge dividends if you happen to snag a diamond in the rough. After accounting for a player’s draft position, do physical attributes (height/weight) and combine performance (40 yard dash, bench press, etc.) provide any additional explanatory power of points scored during a player’s first NFL season? I’ll explore this question for rookie Running Backs and Wide Receivers.

CONTINUE READING

Tired of waiting around for your simulations to finish? Run them in parallel! This post covers how to use Spark and ForEach to add parallelism to your R code.

CONTINUE READING

This post focuses on some of my favorite things – football and forecasting – and will outline how to leverage external regressors when creating forecasts. We’ll do some web scraping in R and Python to create our dataset, and then forecast how many people will visit Tom Brady’s Wikipedia page.

CONTINUE READING

Feature selection is an integral part of machine learning and this post explores what happens when lots of irrelevant features are added to the modeling process. We’ll also identify which algorithms are affected the most by such features. These questions will be addressed as we build a classifier and try to predict which wines we’ll like based on their chemical properties. So pour yourself a glass of Pinot Noir and fire up your R terminal!

CONTINUE READING

Sometimes a controlled experiment isn’t an option yet you want to establish causality. This post outlines a method for quantifying the effects of an intervention via counterfactual predictions.

CONTINUE READING

That’s a dense title – Monte Carlo Simulation, Power, Mixed-Effect models. Each of these topics could be their own post. However, I’m going to discuss their interrelations in the context of experimental power and keep everything high-level. The goal is to get an intuitive idea of how we can leverage simulation to provide sample size estimates for experiments with nested data.

CONTINUE READING

This post covers a straightforward approach for detecting and replacing outliers in order to improve forecasting accuracy.

CONTINUE READING

Exception handling is a critical component of any data science workflow. You write code. It breaks. You build logic to deal with the exceptions. Repeat. From my experience, one of point of confusion for new R users is how to handle exceptions, which is a bit more intuitive in Python. Accordingly, this post provides a practical overview of how to handle exceptions in R by first illustrating the concept in Python.

CONTINUE READING

Early trend detection is a major area of focus in the analytics realm, because it can inform key business strategy yet it an remains extremely difficult task. This post outlines one trend-detection method in an effort to predict where a stock’s price will go in the future.

CONTINUE READING

This post covers how quantile regression and prediction intervals can be used to determine how much ‘wiggle room’ there is for a home’s price.

CONTINUE READING