Data visualizations are fundamental tools of the data scientist as you try to understand your data for the purpose of building predictive models. Visualizations can help you sanity check your data, verify your results, improve and understand your models, and even get started building models faster.
One of the most basic, but most powerful, data visualization techniques is the histogram. The histogram is a simple plot that allows you to understand the distribution of a feature in your dataset. A histogram is drawn by binning up a numeric feature -- e.g., 1-10, 11-20, etc. -- and counting up the number of observations in each bin. There is a similar chart for categorical variables called a bar chart. These charts are close cousins, whose names are occasionally used interchangeably. There is little practical difference between a bar chart and a histogram.
Broadly speaking, the histogram enables the data scientist to do two things:
- Understand the distribution of a particular feature
- Analyze some other continuous metric (frequently your prediction target) across the range of that distribution
For the purposes of this blog post, we’ll be looking at a Stanford study called, “How Couples Meet and Stay Together.” Researchers in this study polled several thousand people over the course of several years in order to learn more about, you guessed it, how they met and stayed together (or not).
I’ve pre-processed the dataset and stored it as a .csv file. Let’s write some python code to read in the data and investigate a few things:
Using a histogram, we should take a look at all of our features. Drawing the histogram is fairly simple with matplotlib. Let’s look at the survey respondent’s age for instance:
This data visualization surprised me. Typically, when I think about “breaking up,” I’m thinking about young couples in dating relationships, but the population in this dataset is largely older. My guess is that older couples are less likely to break up than younger couples, but I don’t have to guess! Indeed, a histogram can show that this is the case.
Okay, so technically, this is a bit more than just a simple histogram, but you can see how powerful this visualization is. By looking at one simple plot, I know that my dataset probably includes long-term marriage relationships in addition to more casual relationships, and I can see how the stability of relationships tends to increase as people get older.
This raises lots of interesting questions that a histogram can help with:
- How do marriage rates change as couples age?
- Does the age gap between partners correlate with break-up rates?
- Are there any non-sensical values for age in my dataset?
- Is the man or woman more likely to be older in a relationship
- How does age vary by how the couple met?
You may notice that for the lines that I've drawn, I used the mean value in each bin. I could certainly think of cases where the median, maximum, or other values might be more relevant or informative.
All these plots are great for understanding your data, explaining it, and visualizing it, but you can also use histograms to understand how your predictive models are working. You could try:
- Compare your average target value against your model’s average prediction across the range of each predictor in your data; i.e., two lines! This approach could help you identify places where you model is underperforming.
- Plot model diagnostic statistics -- e.g., partial dependence, standard deviation, etc -- against each feature in your dataset.
- Evaluate model performance against each feature in your dataset
While not rocket science, histograms are invaluable for understanding what your data looks like and how your model behaves.
By Greg Michaelson, Director of DataRobot Labs