The Basic Assumptions of the Regression Model

These assumptions are broken down into parts to allow discussion case-by-case.  The first assumption, model produces data, is made by all statistical models.   That's what a statistical model is, by definition: it is a producer of data.  It is an assumption that your data are generated by a probabilistic process.  The rest of the assumptions characterize the process more specifically. 

Model Produces Data Assumption

Usually, you don't see any "model produces data" in research articles or other texts.  Instead, it is implicit, often stated in model form such as

"Y = b0 + b1x + e,  where e is random variation."

The fact that "random variation" is specified means that the model assumes random generation of the data. It is the assumption that the data are produced from a probabilistic model. Here is the specific assumption:

Model produces data assumption:  For every X (X may be a vector as is the case in multiple regression), the value of Y is produced at random from a probability distribution. This distribution is allowed to depend on X:  Y|X ~ p(y|X).   In other words, the regression model states that for a given X, the value of Y is produced by the model p(y|X).

(Note:  It is possible that Y is completely unrelated to X, in which case  Y|X~p(y), and this violates no assumption of the regression model.) 

Example:  The number of bottles of wine purchased by a customer is modeled by the Poisson distribution, with a mean that depends on X=time in store.  Here

p(y|X) = em my / y! .

The dependence of the distribution on X=x may be expressed by

m = exp(b0 + b1x).

The parameters b0  b1 are unknown and can be estimated using data.

Note that the random generation assumption by itself makes no assumption about distributions, Poisson, normal or otherwise, and makes no specific assumptions about the functional relationships between Y and X (linear, quadratic, logarithmic, etc.).  As such, the model is fairly generic, and therefore quite benign.  Statistical data really do look as if generated from distributions, simply because randomly produced data exhibit variability, and because variability is real.  The "model produces (random) data" assumption is therefore quite realistic, in that the data produced by such a model look like the data you actually see.

The following assumptions make more specific statements about distributions and functional forms of relationships.  They define the classic regression model.  The usual output from any standard regression software makes these assumptions. These assumptions are very restrictive, though, and much of the course will be about alternative models that are more realistic.  Much of the course is also about identifying when you can use the more restrictive models, despite their being wrong.


Correct Functional Specification Assumption

The means of the distributions p(y|X) fall exactly on a function that is in the family f(X;b) that you specify, for some vector b of fixed, unknown values.  (Note: some of the values in the vector b can be 0's without violating the assumption.)

Examples:

 

Constant Variance, or Homoscedasticity Assumption

The variances of the distributions p(y|X) are constant (i.e., they are all the same number, s2) for all specific values X=x.

 

Uncorrelated Errors (or Conditional Independence) Assumption

The residual ei = Yi - f(Xi;b) is uncorrelated with the residual ej = Yj - f(Xj;b), for all sample pairs (i,j).  

Alternatively, the components of Y are independent, given X=x.  (This latter form is used in maximum likelihood models such as logistic regression.)

 

Normality Assumption

The probability distribution function p(y|X) is a normal distribution  for every X=x (i.e., for every X=x, the distribution of Y is normal, as opposed to Bernoulli, Poisson, multinomial, uniform, etc.).  Specifically, the normality assumption states that for a given X=x, Y is produced by a model having the function form

                           

       This model allows nonlinearity, where m(x) is a curved function, and also allows heteroscedasticity, where s(x) is non-constant.