Statistics vs. Machine Learning: Dilemma of Analytics Practitioner

Author: Rajneesh Pathak

Today Analytics industry uses multiple disciplines which help in solving problems by learning from data.  Techniques from Statistics, Operations research, Machine Learning / Statistical learning, Econometrics along with Market research can solve some similar and very diverse problems which analytics practitioners face today. Though a seasoned user of analytics handles this confluence of disciplines and availability of competing and complementary algorithms with ease, people continue to debate on the differences and superiority of these disciplines. Given the fact that many big names from Industry are betting big on Machine learning, this debate intensifies even further.

  • “A breakthrough in machine learning would be worth ten Microsofts” (Bill Gates, Chairman, Microsoft)
  • “Machine learning is the next Internet” (Tony Tether, Director, DARPA)
  • Machine learning is the hot new thing” (John Hennessy, President, Stanford)
  • “Web rankings today are mostly a matter of machine learning” (Prabhakar Raghavan, Dir. Research, Yahoo)
  • “Machine learning is going to result in a real revolution” (Greg Papadopoulos, Former CTO, Sun)
  • “Machine learning is today’s discontinuity” (Jerry Yang, Founder, Yahoo)
  • “Machine learning today is one of the hottest aspects of computer science” (Steve Ballmer, CEO, Microsoft

Initial motivation of Machine learning was to build computer programs “from data” as opposed to from written specifications. Many of the tasks like computer Vision, Speech Recognition etc. cannot be easily specified in any formal way. However it is easy to obtain “input-output” examples of the desired behavior.

Machine Learning (ML) in its current form has some very close similarities with statistical analysis. Basic Motivation of ML lends itself very well with the typical statistical problems we deal with.

Looking at the definition of the ML/SL  from book “An introduction to statistical learning” byGareth James, Daniela Witten,Trevor Hastie,and Robert Tibshirani

A relation between response variable and predictor(s) can be written as,
Y = f(X) + e
f() : function of X
X : An input vector with X1, X1…Xn.
Y : Output
e is random error

SL/ML  refers to approaches in estimating the f().

Even though machine learning is not the same as statistics, there is a huge overlap both in the underlying mathematics and in the resulting techniques. Both fields deal with data trying to find some function which takes (data as) input producing the desired output.

However Statistics emphasizes on statistical inference (confidence intervals, hypothesis tests, optimal estimators), whereas machine learning emphasized prediction. In statistics, one infers the process by which data is generated. In machine learning, one wants to know how to predict what future data will look like corresponding to some variable.

Larry Wasserman form Carnegie Mellon observes,

“Statistics is an older field than Machine Learning (but young compared to Math, Physics etc). Thus, ideas about collecting and analyzing data in Statistics are rooted in the times before computers even existed. Of course, the field has adapted as times have changed but history matters and the result is that the way Statisticians think, teach, approach problems and choose research topics is often different than their colleagues in Machine Learning. If I had to summarize the main difference between the two fields I would say:

Statistics emphasizes formal statistical inference (confidence intervals, hypothesis tests, optimal estimators) in low dimensional problems. Machine Learning emphasizes high dimensional prediction problems.”

Statistics vs Machine Learning

There are also differences in terminology. Here are some examples:

Statistics Machine Learning
Estimation Learning
Classifier Hypothesis
Data point Example/Instance
Regression Supervised Learning
Classification Supervised Learning
Covariate Features
Response Label


and of course: Statisticians use R, and Machine Learners use Matlab.

Overall, the the two fields are blending together more and more and I think this is a good thing.

Robert Tibshiriani, a statistician and machine learning expert at Stanford says machine learning is glamorous version of statistics in his class notes.

Though statistical analysis and methodology are the predominant approach in modern machine learning, Not all machine learning methods are based on probabilistic models, e.g. SVMs, non-negative matrix factorization.

Also Machine learning use computers far more extensively, which helps in solving many complex problems.  Another difference which can be observed is that Statistics generally deals with low dimensional data where Machine learning is generally associated with high dimensional data.


Leave a Comment