We are often asked about how to get started in data science. While there are more and more training courses and programs out there that can cost significant amounts of money, we have found that some of the best resources available are free. We are planning to put together a comprehensive step by step guide for learning data science using freely available resources (hopefully soon), but for now here are some of the best resources to help you get started.
For learning about statistics and machine learning:
Andrew Ng’s Machine Learning course. Andrew Ng has a course that covers most of the standard machine learning techniques out there. But by being an online course, it also provides exercises where you write your own code but at a level that is approachable to a general audience. Professor Ng is one of the cofounders of Coursera and has put a great deal of attention on this course.
An Introduction to Statistical Learning with Applications in R by James, Witter, Hastie and Tibshirani covers much of the material you need to know at a basic level. For a more thorough treatment of much of the same material if you have a strong statistical or mathematical background see The Elements of Statistical Learning, Second Edition by Hastie, Tibshirani and Friedman. These provide a detailed treatment of most of the statistical techniques that are at the heart of data science. They are also written by the people who developed many of them and they do an excellent job in presenting them. And they available for free online with the permission of the publisher.
For learning Python,
There are many resources out there, but the Udacity courses are particularly well done. The Introduction to Computer Science Course by Dave Evans is a good beginner course in both understanding programming as well as beginning Python. The Design of Computer Programs course taught by Peter Norvig, the Director of Research at Google, is an intermediate course in both Python and programming.
For learning R,
The best online courses we have found are the sequence from the Johns Hopkins Bloomberg School of Public Health on Coursera. Roger Peng’s courses are more focused on learning the R language while Jeff Leek is more focused on statistical methods inside R.
The swirl package, also from Johns Hopkins, is designed to teach you how to program R by providing a self-guided tutorial inside R. If you want to learn R but don’t want to take a course it is definitely worth trying.
While powerful, the performance of GLMs depends heavily on how well you understand them and how well you preprocess your data. A Practitioner’s Guide to Generalized Linear Models will help you bring your linear modeling skills to the next level. It covers in detail all aspects of GLMs and provides advice targeted at solving real problems.
In addition, Coursera offers courses in Data Science Specialization. Although we have yet to validate these courses, they seem very promising.