Economists discussing machine learning, such as Athey and Mullianathan and Spiess, make much of supposed difference that while most of machine learning work focuses on prediction, in economics it is causal inference rather than prediction which is more important.
But what really is the fundamental difference between causal inference and prediction?
Let us try to define both tasks.
Prediction
For the prediction task we are given training data on an outcome variable Y and a set of predictor variables X in the form of observed (Xi,Yi) pairs. Presumably there is a functional relationship Yi = f(Xi), but we don’t know what f is and our task is make use of the training data to predict f(X) for given values of X.
Of course the task is interesting only if the value of X for which we have to make a prediction is not among the values Xi seen in the data set, for otherwise we would just output the known Yi (ignore for the time being the resource constraints in storing the entire training database and searching it for a known Xi).
For us to have some degree of success in the prediction task, f needs to have some regularity, in the sense that values of of Y at one X should yield some information about the values of Y at another X, and our prediction algorithm should be able to capture that regularity, i.e. our prediction function h [TODO: not introduced before] should in some sense be close to the true f. This means we must begin our work with knowledge of the general nature of f. [TODO: the no free lunch theorem, must also explain why we cannot work with a hypothesis set which includes all possible fs].
Often we work with a statistical version of the prediction problem where we posit that Yi = f(Xi,Zi) where Zi is an unobserved quantity which we model as a random variable with some probability distribution. When called upon to make a prediction we are only given X. Not knowing Z, we cannot predict the exact Y, so we require our prediction to be good in some statistical sense, such as expected mean squared error.
In fact, statistics can enter also into the deterministic prediction tast when evaluating the performance of a prediction rule/algorithm. An algorithm may make good predictions for some values of X and not so good prediction on other values of X. We can put a probability measure on the domain of X and use a statistical average case performance metric.
Causal Inference
One way to model the causal inference task is in terms of Rabin’s counterfactual model. There is a binary treatment Ti. Each agent has an outcome with treatment and without treatment Yi0 and Yi1. Our task becomes to predict Yi0 − Yi1. To help us in this we have data on some individuals, but the fundamental difficulty in causal inference is that for any individual we can measure only one of Yi0 and Yi1. We may have auxialliary data on the individual in the form of a vector X of covariates.
But this is just a prediction task! We have Yit = f(Xi,Ti,Zi), where Zi is a potential random noise term. Why can’t we treat this as a simple prediction problem?
In treating this as a prediction problem we would once more have to make assumptions to get around the no free lunch theorem. In fact, the way the causal inference literature is different from the prediction literature is in terms of the assumptions that are generally made.
Problems:
Selection bias. Zi may be correlated with Ti. In that case a hypothesis that minimizes error rates over observed data, such as the difference in mean treated minus mean untreated would actually lead to bias when minimized over the whole set of data. This is a problem with estimation method.
One of the assumptions we may make is unconfoundedness. That Ti is independent of Zi conditional on Xi. This is a factorization type condition which leads us to the correct estimation method to use. Such conditions are generally not available in the prediction context.
Both prediction and causal inference ask a counterfactual question: what will be the value of an outcome variable at an unobserved point in the domain, where in the casual inference case the domain includes a component for the treatment state. In both cases the task can succeed only if the mapping from domain to outcomes is in some sense regular enough to allow unseen cases to be inferred from seen ones.
Given this similarity in formal structure, the practice of causal inference differs from garden variety prediction essentially in two ways. First, in causal settings we privilege accuracy in prediction of treatment effects over other functions of the outcome variables. Second, the assumptions about the regularity of the function being estimated takes very specific forms of a “factorization” types: A is independent of B given C.
The first difference presumably can be folded into standard prediction methods by choosing the correct performance metric. What about the second? My sense is that the identification assumptions being made in current economic literature are too strong, and we are going to see a causal inference crash when a new generation of work finds the current assumptions “incredible”. So instead of trying to shoehorn machine learning into the framework of current causal inference practice, as Athey and Mullianathan seem to be trying to do, it may be better to step back and think of a more data-driven approach to causal inference that would give more robust even if it gives weaker conclusions. Of course that is easier said that done, otherwise I would be on MakeMyTrip looking for good Delhi → Stockholm → Delhi options.
[There seems to be lost of recent machine learning research that tries to exploit the connections between prediction and causal inference, but I see no connection to older practice in economics and other social sciences. Keywords: transfer learning, domain adaptation, covariate shift.]