This blog post is the first of a series that follows along with the “StatLearning” MOOC by Trevor Hastie and Rob Tibshirani in Winter 2014. We’ll show how to use many of the techniques they cover using Python instead of R.
Statistical LearningEver since I was exposed to data science and statistical machine learning, one book has always claimed the prime real-estate on my desk: The Elements of Statistical Learning by Trevor Hastie, Robert Tibshirani, and Jerome Friedman. It is the seminal work on statistical learning and covers a wide range of statistical techniques for data analysis that we at DataRobot use on a daily basis. To me, the best part of the book is that it presents methods from both statistics and machine learning in a coherent and accessible way.
My colleagues and I were thrilled when two of the authors, Trevor Hastie and Robert Tibshirani, announced a Massive Online Open Course on statistical learning: StatLearning. It’s free to the general public and will be hosted on Stanford’s OpenEdX platform. The course runs from January 21, 2014 through March 22, 2014.
It is based on a new book that they co-authored with Gareth James and Daniela Witten, An Introduction to Statistical Learning. It covers much of the same material as Elements of Statistical Learning, but at a level more accessible to a broad audience and includes many examples of applied statistical learning using R, a domain-specific language for statistical computing. The course, like the book, will include many practical examples of statistical computing using R.
"R is the most powerful statistical computing language on the planet." - Norman Nie
At DataRobot, R is one of two key languages we use on a day-to-day basis (the other being Python). We agree with Norman Nie: R definitely is the most powerful statistical computing language on the planet. However, many (if not most) productionalized data science projects cannot be realized in R alone.
Python is a general purpose programming language with a strong scientific computing stack that includes many of the statistical learning techniques taught in the course. Since more and more people are using Python for data science, we decided to create a blog series that follows along with the StatLearning course and shows how many of the statistical learning techniques presented in the course can be applied using tools from the Python ecosystem: “numpy”, “scipy”, “pandas”, “matplotlib”, “scikit-learn”, and “statsmodels." Over the next two months we will reproduce many of the examples presented in the course using Python in place of R. From time to time, we may also cover some supplemental material and/or interesting case studies.
If you haven't used Python yet,
This post was written by Jeremy Achin and Peter Prettenhofer. Please post any feedback,