Regression is the measures of the average relationship between two or more variables in terms of the original units of the data and it is also attempts to establish the nature of the relationship between variables, that is, to study the functional relationship between the variables and thereby provide a mechanism for prediction or forecasting.
The most possible simple regression analysis, namely, the bi-variate, or two-variable regression. Simple refers to the number of independent variables. In a simple regression we have only one independent variable and one dependent variable that is, one X and one Y. This case is considered firstly, for its practical adequacy, as well as it presents the fundamental ideas of regression analysis as simply as possible and some of these ideas can be illustrated with the aid of two-dimensional graphs.
3.1 Two main objectives of simple regression
1.To establish a relationship between two variables.More specifically to establish the statistical relationship between two variables.
some examples: On average we expect that people or families that earn higher income will generally spend more. In this case we’re talking about a positive relationship between income and spending.We could also analyze and test the relationship between wage and gender. We could ask if men are more likely to earn higher wages than women, we’re talking about gender discrimination which is negative. We can actually use the regression models to test if that relationship exists.
2.Forecast new observations: For instance, if we know that our sales tend to grow over time and we actually even know how strong this relationship is and if we know how fast our sales grow, we could use this information to predict or to forecast what will our sales be over the next quarter.
3.2 A HYPOTHETICAL EXAMPLE OF TWO VARIABLE REGRESSION MODEL
In regression analysis, our main aim is to estimate and/or predict the population average/ mean value of the dependent variable based on the known values of the explanatory variables. The population average/ mean value or expected value, or expectation of a random variable Y is denoted by the symbol E(Y).To understand this, we consider the table-3.1.
TABLE 3.1 :The distribution of weekly family consumption for the fixed level of weekly family income.
The data in our table-3.1 refer to a total population of 36 families. In a hypothetical community the weekly income (X) and weekly consumption expenditure (Y) of these family is measured both in taka. These families are further divided into 6 income groups from taka 1000 to taka 3500. we have 6 fixed values of X and the corresponding Y values are shown in the table-3.1. so we have 6 Y sub-populations.
There is considerable variation in weekly consumption expenditure in each income group, which can be seen clearly from Figure 3.1. But the general picture is that expenditure within each income group, on the average, weekly consumption expenditure increases as income increases. We can see this clearly from the bottom row of the table 3.1 in which the conditional expected values of Y( weekly consumption expenditure) are given. we have given the mean, or average, weekly consumption expenditure corresponding to each of the 6 levels of income.We have the income(X) and the consumption(Y) and this X value is given as 1000, 1500 2000, 2500, 3000 and 3500 which are the weekly income of the households and their corresponding consumption are given in the cell values with each column and which are known as weekly consumption expenditure.It can be shown by the scattergram of the figure 3.1 and the red dot indicates conditional mean values of consumption expenditure. We have the total weekly consumption of all the households with that particular level of income, say, for example 4000 for 1000 taka. But if we look at the last row the expected value of consumption given a particular level of income is 800 taka of all those households whose weekly income is taka 1000. Similarly 1100 tk gives the mean weekly consumption of all those households whose weekly income is 1500tk and so on. In all we have 6 mean values for the 6 sub-populations of Y. We call these mean values conditional expected values, as they depend on the given values of the (conditioning) variable X. Symbolically, we denote them as E(Y | X), which is read as the expected value of Y given the value of X (see also Table 3.2). Now we can find the conditional probability for the population of 36 observations from the above data(Table-3.1) which we can show by the next table- 3.2 .
Now the cell value is the probability of a particular level of consumption given a particular level of income. So if the income is 1000 taka the probability is 1/5. Similarly, if the next level of income which is 1500 therefore the conditional probability of each of these families is 1 / 6 since there are 6 families of this group and so on. And the conditional means are also given which are same as was given in our previous table which is 800, 1100, 1400, 1700,2000 and 2300 so on.
Now It is important to apart these conditional expected values from the unconditional expected value of weekly consumption expenditure which can be denoted as E(Y). If we add the weekly consumption expenditures for all the 36 families in this case ( the population) and divide this number by the total number of families which is 36, this comes out to be rupees 1583.33tk (57000/36), which is the unconditional mean, or expected value of weekly consumption expenditure, E(Y); it is unconditional in the sense that in arriving at this number we have disregarded the income levels of the various families. Obviously, the various conditional expected values of Y given in Table 3.1 are different from the unconditional expected value of Y of 1583.33tk. The expected value of weekly consumption expenditure of a family is 1580.33tk. But the conditional expected value of weekly consumption expenditure of a family whose monthly income is, say, 1000tk is 800tk and the conditional expected values are different for different income level and the conditional expected values are different from the unconditional expected value. The unconditional expected value gives the average expenditure on consumption of all income groups while the conditional expected value gives the average expenditure of consumption of a particular income group.
We can show the conditional mean values of Y against the various X values by the dark circled points in Figure 3.2 and If we join these conditional mean values, we obtain the line which is known as the population regression line (PRL), or more generally, the population regression curve. More simply, it is the regression of Y on X. Probably the object for discussing the entire regression analysis is to obtain the PRL.
The adjective “population” comes from the fact that we are dealing in this example with the entire population of 36 families. Of course, in reality a population may have many families.Geometrically, a population regression line is simply the locus of the conditional means of the dependent variable for the given/ fixed values of the explanatory variable(s). The regression curve does passes through these conditional mean values. More simply, it is the curve connecting the means of the sub-populations of Y corresponding to the given values of the regressor X. we assume for simplicity that these Y values are symmetrically distributed around their respective conditional mean values.
3.3 THE CONCEPT OF POPULATION REGRESSION
PPULATION REGRESSION FUNCTION (PRF)
Population regression function (PRF) is the locus of the conditional mean of variable Y (dependent variable) for the fixed variable X (independent variable). Let us now try to understand the population regression function. we know what a population line is but how do we represent it into a regression function. So continuing with the above example. The example which we have discussed above (section 3.1) about income and expenditure. we have seen that how conditional mean of expenditure on consumption depends on particular income level. Therefore, the conditional expected value of Y, E (Y | Xi) where Xi is a given a value of X which is in this case as we have already discussed as income for a particular value. Mathematically we can write
E (Y | Xi) = f (Xi) ————————— (1)
where f (Xi) denotes some function of the explanatory variable X. In our example, E (Y | Xi) is a linear function of Xi. Equation (1) is known as the conditional expectation function (CEF) or population regression function
(PRF) or population regression (PR) for short. It states merely that the expected value of the distribution of Y given Xi is functionally related to Xi. Therefore, the question which arises is that how do we relate this functional form which is f(Xi). So the shape the functional form of the population regression function is both an empirical as well as a theoretical question. Therefore, for instance many economists assume that expenditure on consumption and income are linearly related for simplicity sake and as initial working hypothesis. we assumed that the PRF
E (Y | Xi) is a linear function of Xi, hence we can write down
E (Y | Xi) = β1 + β2Xi———————– (2)
where β1 and β2 are unknown but fixed parameters known as the regression coefficients; β1 and β2 are also known as intercept and slope coefficients, respectively. Equation (2) itself is known as the linear population
regression function. This is also known as linear population regression model or simply linear population regression. In the sequel, the terms regression, regression equation, and regression model will be used synonymously.
In regression analysis our interest is in estimating the PRF like (2), that is, estimating the values of the unknowns β1 and β2 on the basis of observations on Y and X.
THE SAMPLE REGRESSION FUNCTION (SRF)
we know that the functional form of the function F(X) is not always known in reality. It is very difficult to know the exact functional form of the relationship between dependent variable and independent variable. So our objective is to estimate population regression function which is known as PRF from this entire relationship. So we use sample information to estimate population values.
Now consider two random samples drawn from our population which represented in table3.1. These two random samples could be that we have sample -1 where we take the values of Y as 700, 915, 3050, 1800, 2150 and 2400 which is given in the first column and the corresponding values of X are given in the second column which ranges from 1000 to 3500 taka which are the income levels. The sample-2 consists of again the expenditure denoted by Y and income denoted by X. The income levels are same 1000 to 3500 which means we have kept income as fixed but the values of expenditure which is Y is now different and it ranges from 800 to 2250.
The main objective of regression analysis is to estimate the population regression function with the help of this sample regression function. Note that an estimator or a (sample) statistic, is simply a rule or formula or method that tells how to estimate the population parameter on basis of the sample data. A particular approximate value, number, quantity, or extent of something obtained by the estimating process is called an estimate.
Now we can draw sample regression curve or line based on the table 3.3.
Now just as we expressed the PRF in two equivalent forms, (1) and (2) and we can express the SRF (3) in its stochastic form as follows:
where, in addition to the symbols already defined, û denotes the (sample) residual term. Conceptually û is analogous to ui and can be regarded as an estimate of ui. It is introduced in the SRF for the same reasons as ui was introduced in the PRF.
To sum up, then, we find our primary objective in regression analysis is to estimate the
PRF Yi = β1 + β2Xi + ui
on the basis of the SRF
Because more often than not our analysis is based upon a single sample from some population. But because of sampling fluctuations our estimate of the PRF based on the SRF is at best an approximate one. This approximation is shown diagrammatically in Figure 3.3. For X = Xi, we have one (sample) observation Y = Yi. In terms of the SRF, the observed Yi can be expressed
and in terms of the PRF, it can be expressed as
Yi = E(Y | Xi) + ui ——————————(6)
Now obviously in Figure 3.4 overestimates the true E(Y | Xi) for the Xi. By the same token, for any Xi to the left of the point A, the SRF will underestimate the true PRF. But the reader can readily see that such over and underestimation is inevitable because of sampling fluctuations.
The critical question now is: Granted that the SRF is but an approximation of the PRF, can we devise a rule or a method that will make this approximation as “close” as possible? In other words, how should the SRF be constructed so that estimated β1 is as “close” as possible to the true β1 and estimated β2 is as “close” as possible to the true β2 even though we will never know the true β1 and β2?
Here we can develop procedures to construct the SRF for the PRF as close as possible. It is fascinating to consider that this can be done even though we never actually determine the PRF itself.
The main objective of regression analysis is to estimate the population regression function with the help of this simpleregression function. The sample estimates will be used as an approximate of the population parameters. These simple estimates will also vary from one sample to another sample. In this way we can get N different SRFs for N different samples and these SRFs are not the same. So our task will be to estimate that a sample regression function which make this approximation as close as possible because in that case SRF can be satisfactorily used in place of the population regression function.
3.4 THE MEANING OF THE TERM LINEAR
As our concern is primarily with linear models like E(Y | Xi) = β1 + β2Xi , we should know what the term linear really means. It can be interpreted in two different ways like Linearity in the Variables and Linearity in the Parameters.
A regression that graphs to a straight line is called linear regression model. In these two graphs we can see that one is a straight regression line while the other graph is a bended regression line. So the blue regression line shows the linear regression model while in this graph the black regression line shows the nonlinear regression model.
Figure 3-5: Linear and non-linear regression model
Linearity in the Variables
A regression line with two are more than two variables but none of the variables can have power more than 1. In other words The meaning linearity in the Variables is that the conditional expectation of Y is a linear function of Xi, such as, for example, E (Y | Xi) = β1 + β2Xi which is represented by the straight line (blue line). It may be simple or multiple regression model. The regression function such as E (Y | Xi) = β1 + β2X³i is not a linear function because the variable X appears with a power of 3 and we get the black line which is not a straight.
Linearity in the Parameters
The second interpretation of linearity is that the conditional expectation of Y, E (Y | Xi), is a linear function of the parameters, the β’s; that means none of the parameters (betas`) can have power more than 1. It may or may not be linear in the variable X. A function is said to be linear in the parameter, say, β1, if β1 appears with a power of 1 only and is not multiplied or divided by any other parameter (for example, β1β2, β2/β1, and so on). In this interpretation E (Y | Xi) = β1 + β2X²i is a linear (in the parameter) regression model. To see this, let us suppose X takes the value 3. Therefore, E (Y | X = 3) = β1 + 9β2, which is obviously linear in β1 and β2. All the models shown in Figure 3.6:
are thus linear regression models, that is, models linear in the parameters. Now consider the model E(Y | Xi) = β1 + β²2 X²i .
Now suppose X = 3; then we obtain E(Y | Xi) = β1 + 3β²2, which is nonlinear in the parameter β2. The preceding model is an example of a nonlinear (in the parameter) regression model. Of the two interpretations of linearity, linearity in the parameters is relevant for the development of the regression theory to be presented shortly.
Therefore, from now on the term “linear” regression will always mean a regression that is linear in the parameters; the β’s (that is, the parameters are raised to the first power only). It may or may not be linear in the explanatory variables, the X’s. Schematically, we have table-3.3.
Thus, E (Y | Xi) = β1 + β2Xi, which is linear both in the parameters and variable, is a LRM, and so is E (Y | Xi) = β1 + β2X²i , which is linear in the parameters but nonlinear in variable X.
Note: LRM = linear regression model
NLRM = nonlinear regression model
3.5 STOCHASTIC DISTURBANCE TERM
Now coming back to our example we find that the average expenditure on consumption increases as income increases. But if we look at a particular family it needs not necessarily be true. For instance, there is a family with income of tk 1500 whose expenditure on consumption is just tk 850 which is below the expenditure tk 900 of one family whose income is tk 1000. But notice that the average consumption expenditure of families with a weekly income of tk 1500 is greater than the average consumption expenditure of families with a weekly income of 850tk (tk1100 versus tk800). So therefore we are trying to say that there are families whose consumption expenditure deviates from the average expenditure and this deviation of individual expenditure from its expected or mean value can be expressed.From Figure 3.1 we see that, given the income level of Xi, an individual family’s consumption expenditure is gathered around its conditional expectation of consumption of all families at that Xi. Therefore, we can express the deviation of an individual Yi around its expected value as follows:
ui = Yi − E(Y | Xi)————————————-(7)
Yi = E(Y | Xi) + ui———————————— (8)
where the deviation ui is an unnoticeable random variable taking positive or negative values. This ui is known as the stochastic disturbance or stochastic error term.Here we can say that the expenditure of an individual family, given its income level, can be expressed as the sum of two components.
1. E(Y | Xi), which is simply the mean consumption expenditure of all the families with the same level of income. This component is known as the systematic, or deterministic.
2. ui, which is the random, or non-systematic component where we assume that it is a proxy for all the omitted or neglected variables that may slightly affect Y but are not (or cannot be) included in the regression model.
If E(Y | Xi) is assumed to be linear in Xi, as in (2), Eq. (8) may be written as
Yi = E(Y | Xi) + ui
= β1 + β2Xi + ui———————- (9)
Equation (9) posits that the consumption expenditure of a family is linearly related to its income plus the disturbance term. Thus, the individual consumption expenditures, given X = 1000tk (see Table 2.1), can be expressed as
Y1 = 700 = β1 + β2(1000) + u1
Y2 = 750 = β1 + β2(1000) + u2
Y3 = 800 = β1 + β2(1000) + u3 ——————————(10)
Y4 = 850 = β1 + β2(1000) + u4
Y5 = 900 = β1 + β2(1000) + u5
Now if we take the expected value of (8) on both sides, we obtain
E(Yi | Xi) = E[E(Y | Xi)] + E(ui | Xi)
= E(Y | Xi) + E(ui | Xi)———————————- (11)
where the fact that the expected value of a constant is that constant itself. Notice carefully that in (11) we have taken the conditional expectation, conditional upon the given X’s.
Since E(Yi | Xi) is the same thing as E(Y | Xi), Eq. (11) implies that
E(ui | Xi) = 0—————————————- (12)
Thus, the assumption that the regression line passes through the conditional means of Y (see Figure 3.2) implies that the conditional mean values of ui (conditional upon the given X’s) are zero. From the previous discussion, it is clear (2) and (8) are equivalent forms if E(ui | Xi) = 0. But the stochastic disturbance term (12) has the merit that it clearly shows that there are other variables besides income that affect consumption expenditure and that an individual family’s consumption expenditure cannot be fully explained only by the variable(s) included in the regression model.
3.5 THE SIGNIFICANCE/THE ROLE AND IMPORTANCE OF THE STOCHASTIC
The stochastic disturbance term ui captures all emitted variables that collectively effects or may affect the dependent variable but are not included in the model. it is not possible to introduce all the terms that affect the dependent variable explicitly into the model. There are many reasons for this.
- Vagueness of theory: Sometimes we might know for certain that weekly income X influences weekly consumption expenditure Y, but we might be unsure about the other variables affecting Y. Therefore, ui may be used as a substitute for all the excluded or omitted variables from the model. As a result, often the theory, determining the behavior of Y may be incomplete.
- Unavailability of data: Even if we know what some of the excluded variables are and therefore consider a multiple regression rather than a simple regression. But in empirical analysis we may not have the available quantitative information about these variables. For example, in principle we could introduce family wealth as an explanatory variable in the model to explain family consumption expenditure. But unfortunately, it is difficult to collect information on family wealth. Therefore, we may exclude the wealth variable from our model despite its great theoretical relevance in explaining consumption expenditure.
- Core variables versus minor variables: In our consumption income example we assume that besides income X1, the number of children per family X2, sex X3, religion X4, education X5, and geographical region X6 also affect consumption expenditure. But the joint effect of all or some of these variables may be so small and at best random that as a practical matter and for cost considerations it is quite impossible to introduce them into the model explicitly. So we hope that their combined effect can be treated as a random variable ui.
4.Intrinsic randomness in human behavior: In spite of our success in introducing all the relevant variables into the model, there is bound to be some “natural” randomness in individual Y’s that cannot be explained no matter how hard we try. The disturbances, the u’s, may very well reflect this natural randomness.
- Poor proxy variables: In practice we may not have the actual data though we assume in regression model that the Y and X variables are measured accurately. So it may be suffered by errors of measurement. Consider, for example, Milton Friedman’s well-known theory of the consumption function in which he considers permanent consumption (Yp) as a function of permanent income (Xp). But since data on these variables are not directly detectable, in practice we use proxy variables, such as current consumption (Y) and current income (X) and it can be observable. Since the observed Y and X may not equal Yp and Xp, there is the problem of errors of measurement.The disturbance term u may in this case represent the errors of measurement and they can have serious implications for estimating the regression coefficients, the β’s.
- Principle of parsimony: Following Occam’s razor, we would like to keep our regression model as simple as possible. If we can explain the behavior of Y “significantly” with two or three explanatory variables and if our theory is not strong enough to suggest what other variables might be included, we don`t introduce more variables. Let ui represent all other variables. Of course, we should not exclude relevant and important variables just to keep the regression model simple.
- Wrong functional form: Even if we have theoretically correct variables explaining a theory and even if we can obtain data on these variables, very often we do not know the form of the functional relationship between the regressand and the regressors. Is consumption expenditure a linear (invariable) function of income or a nonlinear (invariable) function? If it is the linear, Yi = β1 + B2Xi + ui is the proper functional relationship between Y and X, but if it is non-linear, Yi = β1 + β2Xi + β3X²i+ ui may be the correct functional form. In two-variable models the functional form of the relationship can often be judged from the scattergram. But in a multiple regression model, it is not easy to determine the appropriate functional form, for graphically. We cannot visualize scattergrams in multiple dimensions.
For all the above reasons the stochastic disturbances ui has an extremely vital role in regression analysis.
3.6 The difference between a residual and an error term.
The error term is also known as the disturbance term. To discuss it let`s take a table where some earnings data
data earnings is the dependent variable(Y) and(X) tenure is the explanatory variable when a builder model to help to explain earnings from tenure which is the time spent in a company. Let’s forget about the units. we see what we have five pairs of observations. we draw a scatterplot labeling the Y for the vertical axis and X for the horizontal axis. Next we plot the 5 points. So for the first observation we
have earnings of 25 and tenure of 8. We take that over down to the respective axis and then we put a dot there and we put other dots for other observations. Plotting all the observations we see that a good line will tend to be among the points. This line is called population regression line or true line. Since it is not possible to know the true line/population line is we have to estimate it using OLS method or any other method like maximum likelihood method. We get an estimated line from the estimators which line close to true line. Generally, it will be something close to it but maybe not equal to it.
Now the error term is the error in the best- fit line`s /true line`s prediction of Y (The actual Y value –the Y value lies on the true line ) which is shown in the above figure by red arrow and we denote it by U. The residual is the û which we can show it by the blue arrow in the above figure.