These are some of the books and blogs I have read in the last few years to get up to speed on Big Data, data science, and related concepts.


  • Hadoop the Definitive Guide - As the title indicates, this is the definitive guide for Hadoop. It provides a great introduction to the Hadoop ecosystem. While it covers things like Pig and Hive, it does not go into depth on any one topic.
  • Hadoop Operations - This book is a must have for any Hadoop administrator. It documents a bunch of practical advice on administering Hadoop clusters not found in any other literature.
  • Programming Pig - This is one of the few books on the Pig language. It provides a good introduction to Pig.

Practical Data Mining and Machine Learning

These books give a hands on introduction to working with data and machine learning algorithms.

Statistics and Algorithms Theory

These books provide the theoretical background behind algorithms used in Big Data. These are not the best books for learning practical application. Instead, these books provide great references for when you need to understand the inner workings of different algorithms.

  • Introduction to Algorithms - When you are really working with Big Data, it is essential to have a good understanding of algorithms and how they scale. At the very least, you should know Big O Notation and how it applies to algorithms and data structures.
  • Elements of Statistical Learning - This book provides a great reference for machine learning algorithms including Random Forest, logistic regression, k-means, and ensemble learning.
  • Bayesian Data Analysis - This book provides a solid introduction to the theory behind Bayesian methods.
  • Data-Intensive Text Processing with MapReduce - This online book describes how to implement different algorithms using the MapReduce paradigm. This is very useful for understanding how to approach algorithm implementation in MapReduce.


These books are non-technical, but provide a good background for ways of thinking about data.

  • Signal and the Noise - This book is a great introduction to thinking about data in terms of Bayesian statistics. The author provides many real world examples of applying Bayesian statistics including sports and politics.
  • Thinking, Fast and Slow - This book is not specifically about statistical thinking, but about how people make decisions. It covers many of the common cognitive bias that cause people to make incorrect conclusions. Avoiding bias is essential for anyone interested in doing data analysis.
  • Antifragile - Nassim Taleb’s book Antifragile and his previous book The Black Swan, both explore the limitations of statistics in prediction.
  • The Grand Design - This is a book about modern physics, but more than that it is about using observed data to create descriptive models of our universe. This book introduced me to the concept of model-dependent realism. This approach claims that we can only understand the universe via intermediate models.

Websites and Newsletters

These are some of the websites and newsletters I frequent on a regular basis.

  • DataTau - Datatau is a news aggregation site focused on data related topics.
  • Data Science Weekly - Weekly newsletter concerning all things data science.
  • Hacker Newsletter - A weekly newsletter highlighting some of the best articles on Hacker News for the week.
  • Quora - A question/answer site with a very active data community. Good place to go to ask data engineering and data science questions.
  • FiveThirtyEight - This site applies statistical analysis sports, news, politics and life.
  • Farnam Street - Covers a wide variety of topics from a variety of disciplines. It isn’t focused on data, but a great blog for those interested in learning about multiple topics.


  • Google Research - This is a collection of various Google Research papers. It covers artificial intelligence, machine learning, distributed systems, and much more.
  • The Fourth Quadrant - An essay by Nassim Taleb on the fundamental limitations of using statistics in making decisions.
  • Numbers Everybody Should Know - While this entire presentation is interesting, slide 13 is the one anyone working with data at scale should understand. It provides some latency statistics for read/write operations.