Top 5 Books on Data Science and Statistics

Beyond the resources we provide here at Data Science for Anyone, another excellent way to grow your skills even more and explore more advanced data science topics is to get high-quality books by field experts. Here are five of our recommendations for books on data science and statistics to get you started.

1. R for Data Science (by Garrett Grolemund & Hadley Wickham): This book provides an excellent overview of fundamentals in R including data management, data wrangling, data analysis and data visualization. It assumes that readers have no existing knowledge of R and is an excellent progression of easier to understand material all the way up to more advanced topics. As an added plus, the worked examples and data sets are really interesting (think baseball and car models), which adds to the appear and accessibility of this book. Even better it is available as a paperback or e-book, so you can take your data science learning on the go on your tablet or e-reader.


2. Introduction to Machine Learning in Python (by Andreas C. Müller & Sarah Guido): Another offering from the publisher O’Reilly, who releases many helpful coding books, this book is not just limited to machine learning. It takes a methodical and clear step-by-step approach to get the essential packages installed in Python including pandas, numpy and scipy, which allow you to work with data frame and conduct statistical analyses. It also covers importing data, basic summary statistics and visualizations, before getting to more complex material such as classification and regression using machine learning techniques. Much like the R for Data Science book above, this book contains numerous references to real datasets (the Iris flower datasets, in particular) and uses many code snippets that you can adapt to a wide variety of different datasets and research questions.


3. A Gentle Introduction to Stata (by Alan C. Acock): This is a classic book on Stata and there is a reason it is in its sixth edition. The great thing about this book is it assumes that you have little to no programming or statistical knowledge going into learning Stata for the first time. It begins with basic fundamentals including descriptive statistics, bivariate relationships and elementary plotting, and then moves into more complicated topics such as multivariate analysis, non-continuous outcome variables and logistic and negative binomial regression models. Much like the other books on this page, this one includes excellent references to real datasets to keep the working examples interesting and relevant. For anyone learning Stata for the first time or even if you are adding a second or third data science language to your toolkit, we would recommend this book as a great starting point.


4. R for SAS and SPSS Users (by Robert A. Muenchen): This book is a bit more specialized for SAS and SPSS users who want to develop skills in R, but can also have relevance for anyone who is used to a point-and-click statistics program and wants to develop skills in syntax-based programming in R. The material is ordered logically from very basic tasks (e.g., calculating descriptive statistics, recoding variables, simple plotting) to more complex operations (transforming variables, reshaping data, running factor analyses), and it includes a nice balance between practical explanations with real data and intuitive code snippets from all three statistical packages.


5. R for Excel Users: Introduction to R for Excel Analysts (by John L Taveras): Don’t let this book’s size fool you! While a lot shorter than the other essential data science books reviewed on this page, this book has some clear and informative guides to explaining R commands with simple Excel examples. Even if you aren’t an expert in Excel, if you are used to point and click interfaces for running analyses and want to hone your skills in syntax-based programming and analysis in R, this book has some great examples to get you started. It moves from basic concepts including creating, subsetting and merging data frames, to more complicated topics like visualization, user-written functions and for/while loops.