Statistical Learning in Python Blog Series Kick-off

This blog post is the first of a series that follows along with the “StatLearning” MOOC by Trevor Hastie and Rob Tibshirani in Winter 2014. We’ll show how to use many of the techniques they cover using Python instead of R.

 

Statistical Learning

Ever since I was exposed to data science and statistical machine learning, one book has always claimed the prime real-estate on my desk: The Elements of Statistical Learning by Trevor Hastie, Robert Tibshirani, and Jerome Friedman. It is the seminal work on statistical learning and covers a wide range of statistical techniques for data analysis that we at DataRobot use on a daily basis. To me, the best part of the book is that it presents methods from both statistics and machine learning in a coherent and accessible way.

SSS Hastie etal.qxd (Page 1)

My colleagues and I were thrilled when two of the authors, Trevor Hastie and Robert Tibshirani, announced a Massive Online Open Course on statistical learning: StatLearning. It’s free to the general public and will be hosted on Stanford’s OpenEdX platform. The course runs from January 21, 2014 through March 22, 2014.

It is based on a new book that they co-authored with Gareth James and Daniela Witten, An Introduction to Statistical Learning. It covers much of the same material as Elements of Statistical Learning, but at a level more accessible to a broad audience and includes many examples of applied statistical learning using R, a domain-specific language for statistical computing. The course, like the book, will include many practical examples of statistical computing using R.

isl_cover

"R is the most powerful statistical computing language on the planet."  - Norman Nie

At DataRobot, R is one of two key languages we use on a day-to-day basis (the other being Python). We agree with Norman Nie: R definitely is the most powerful statistical computing language on the planet. However, many (if not most) productionalized data science projects cannot be realized in R alone.

Python is a general purpose programming language with a strong scientific computing stack that includes many of the statistical learning techniques taught in the course. Since more and more people are using Python for data science, we decided to create a blog series that follows along with the StatLearning course and shows how many of the statistical learning techniques presented in the course can be applied using tools from the Python ecosystem: “numpy”, “scipy”, “pandas”, “matplotlib”, “scikit-learn”, and “statsmodels." Over the next two months we will reproduce many of the examples presented in the course using Python in place of R. From time to time, we may also cover some supplemental material and/or interesting case studies.

If you haven't used Python yet, we strongly recommend you follow along and see how it can be a one-stop-shopping solution to computing in general. If you have used Python before but are new to statistical learning then this series should provide you all information to get started without the need to learn a new language. Stay tuned!

New Call-to-action

This post was written by Jeremy Achin and Peter Prettenhofer.  Please post any feedback, comments, or questions below.