2.1 THE HISTORICAL ORIGIN OF REGRESSION
The term `regression` was introduced at first by Francis Galton in his famous paper “Family Likeness in stature“ to connect the height of parents and son. Usually there is a tendency to have short children of short parents and long children of long parents. But He found that, the average or mean height of children born of both parents tend to regress towards the average or mean height in the population as a whole. In the word of Galton, this was` regression mediocrity’. The same concept was later confirmed by analyzing thousands of records of height of both children and parents by Karl Pearson.
2.2 THE MODERN INTERPRETATION OF REGRESSION
The study of the dependence of one variable known as the dependent variable, on one or more other variables, the explanatory variables known as the independent variables to estimate and/or to predict the (population) mean or average value of the former(the dependent variable) in terms of the known or fixed (in repeated sampling) values of the latter(The independent variable) is called regression.
In short, regression analysis is basically a statistical tool or a method that is very useful for studying the relationship between any two or more variables and mainly economic variables.
For example in economics, we have Keynes theory of psychological law of consumption which states that when income increases there is an associate increase in consumption. we all know that the marginal propensity to consume would always lie between zero and one.
2.3 TERMINOLOGY AND NOTATION OF DEPENDENT AND INDEPENDENT VARIABLE
In regression there are varieties names of dependent variable and independent variable.
A representative list is bellow:
Although it is a matter of personal taste and tradition, in this text we will use the dependent variable/explanatory variable or the more neutral, regressand and regressor terminology will be used. In regression analysis the most important notation is of a dependent variable which is generally denoted by Y and then we have independent variables generally denoted by X1 X2 X3 X4 and it can extend up to Xk where k is the number of independent variables. The subscript i or t will denote the ith or the tth observation or value. Xki (or Xkt) will denote the ith (or tth) observation on variable Xk. N (or T) will denote the total number of observations or values in the population, and n (or t) the total number of observations in a sample.As a matter of convention, the observation subscript i will be used for cross-sectional data (i.e., data collected at one point in time) and the subscript t will be used for time series data.
2.4 Types of regression analysis
There are usually two types of regression models. The first is the simple regression model which may be further divided into simple linear regression model and simple nonlinear regression model. Similarly multiple regression model can also be subdivided into multiple linear regression model and multiple non linear regression model. So we have a four-way classification regression model divided into a simple and multiple and each one of them further divided into linear and nonlinear regression models. we can see this at a glance which is shown bellow:
In simple regression model the number of independent variable is one while in multiple regression model it is more. Then the term linearity also has different meanings. It could mean both linearity in parameters or linearity in variables. For regression analysis linear regression model would mean a model which is linear in parameters but could be nonlinear or linear in terms of the explanatory variables.
If we are studying the dependence of a variable on only a single explanatory variable, such as that of consumption expenditure on real income, such a study is known as simple, or two-variable regression analysis. However, if we are studying the dependence of one variable on more than one explanatory variable, as in the crop-yield, rainfall, temperature, sunshine, and fertilizer examples, it is known as multiple regression analysis .In other words, in two-variable regression there is only one explanatory variable, whereas in multiple regression there is more than one explanatory variable.
2.5 RANDOM VARIABLE VS NONSTOCHASTIC VARIABLE
The term random is a synonym for the term stochastic. A random or stochastic variable is a variable that can take on any set of values, positive or negative, with a given probability. In regression analysis the dependent variable is a random variable and the independent variables are fixed.
On the other hand, the variable which does not posses a certain probability and it is fixed in repeated sampling is called non stochastic variable. In regression analysis the independent variables are non-stochastic.
2.6 THE USES OF REGRESSION
According to the modern view of regression we want to find out how the average height of sons changes, given the fathers’ height. In other words, our concern is with predicting the average height of sons knowing the height of their fathers. To see how this can be done, consider Figure 2.2, which is a scatter diagram, or scatter gram This figure shows the distribution of heights of sons in a hypothetical
population corresponding to the given or fixed values of the father’s height. It is noted that corresponding to any given height of a father is a distribution of the heights of the sons and despite the variability of the height of sons for a given value of father’s height, the average height of sons generally rises as the height of the father rises which we can show by the figure below. The circled crosses in the figure indicate the average height of sons corresponding to a given height of the father. Connecting these averages, we obtain the line shown in the figure. This line, is known as the regression line. It shows how the average height of sons increases with the father’s height.
2.Consider the figure 2.3, that gives the distribution in a hypothetical population of heights of boys measured at fixed ages. Corresponding to any given age, we have a range of heights. Obviously, not all boys of a given age are likely to have identical heights. But height on the average increases with age (of course, up to a certain age), which can be seen clear if we draw a line (the regression line) through the circled points that represent the average height at the given ages. Thus, knowing the age, we may be able to predict from the regression line the average height corresponding to that age.
3.An economist may be interested in studying the dependence of personal consumption expenditure on disposable real personal income. Such an analysis may be helpful in estimating the marginal propensity to consume (MPC), that is, average change in consumption expenditure for, say, a dollar’s worth of change in real income (see Figure 2.4).
Figure-2.4:Personal consumption expenditure (Y) in relation to GDP (X)
4.A monopolist who can fix the price or output (but not both) may want to find out the response of the demand for a product to changes in price. Such an experiment may enable the estimation of the price elasticity of the demand for the product and may help determine the most profitable price.
5.A labor economist may want to search the rate of change of money wages in relation to the unemployment rate. This relation can be shown by the well known Phillips curve which is shown by the figure 2.4 , relating changes in the money wages to the unemployment rate. Such a scatter gram may enable the labor economist to predict the average change in money wages given a certain unemployment rate. Such knowledge may be helpful in stating something about the inflationary process in an economy, for increases in money wages are likely to be reflected in increased prices.
6.A monetary economist may want to know that, other things remaining the same, what would be the k(k=Money/Income) in response to inflation rate. It can be shown by the figure 2.6 that the higher the rate of inflation π, the lower the proportion k of their income that people would want to hold in the form of money. A quantitative analysis of this relationship will enable the monetary economist to estimate the amount of money, as a proportion of their income, that people would want to hold at various rates of inflation.
FIGURE 2.5 Money holding in relation to the inflation rate π.
7.The marketing director of a company may want to know how the demand for the company’s product is related to, say, advertising expenditure. Such a study will be of considerable help in finding out the elasticity of demand with respect to advertising expenditure, that is, the percent change in demand in response to, say, a 1 percent change in the advertising budget.This knowledge may be helpful in determining the “optimum” advertising budget.
8.Finally, an agronomist may be interested in studying the dependence of crop yield, say, of wheat, on temperature, rainfall, amount of sunshine, and fertilizer. Such a dependence analysis may enable the prediction or forecasting of the average crop yield, given information about the explanatory variables.
The study such dependence among variables discuss above are the examples of regression. More technically these examples are some uses of regression analysis.
2.7 STATISTICAL VERSUS DETERMINISTIC RELATIONSHIPS
In regression analysis we look for a statistical relationship and not deterministic relationship among the variables. In a statistical relationship we essentially deal with the stochastic or random variables that possessed a probability distributions. For example, the yield of crop depends on temperature rainfall, and fertilizers, but an agronomist could not predict the yield of crop exactly ,as there are other factors collectively affect the yield but may be difficult to identify individually. Thus, there is bound to be some “intrinsic” or random variability in the dependent-variable crop yield that cannot be fully explained no matter how many explanatory variables we include.
In deterministic phenomena, on the other hand, we deal with relationships of the type, say, exhibited by Newton’s law of gravity, which states: Every particle in the universe attracts every other particle with a force directly proportional to the product of their masses and inversely proportional to the square of the distance between them. Symbolically, F = k(m1m2/r 2), where F = force, m1 and m2 are the masses of the two particles, r = distance, and k = constant of proportionality. Another example is Ohm’s law, which states: For metallic conductors over a limited range of temperature the current C is proportional to the voltage V; that is, C = ( 1/k )V where 1 /k is the constant of proportionality. Other examples of such deterministic relationships are Boyle’s gas law, Kirchhoff’s law of electricity, and Newton’s law of motion .In this text we are not concerned with such deterministic relationships. Of course, if there are errors of measurement, say, in the k of Newton’s law of gravity, the otherwise deterministic relationship becomes a statistical relationship. In this situation, force can be predicted only approximately from the given value of k (and m1, m2, and r), which contains errors. The variable F in this case becomes a random variable.
2.8 REGRESSION VS CAUSATION
Although regression analysis deals with the dependence of one variable on other variables, it does not imply causation. . In the words of Kendall and Stuart, “A statistical relationship, however strong and however suggestive, can never establish causal connection: our ideas of causation must come from outside statistics,” For instance we know that crop yield depends on rainfall. But there is no valid reason to assume that rainfall depends on crop yield. However simple common sense do suggests that this is not usually the case as we cannot change rainfall by increasing or decreasing crop yield. Therefore the crop yield is treated as dependent variable and rainfall as explanatory variable. To determine causality between variables, we must take into account a prior or theoretical considerations. The point to note is that statistical relationship in itself is not at all sufficient to imply causation between variables.
2.9 REGRESSION VERSUS CORRELATION
Closely related to but conceptually very much different from regression analysis to correlation analysis. In correlation analysis our main objective is to measure the strength or degree of linear association between variables, that is, between dependent or independent variables. For example, we are interested in measuring the correlation coefficient between smoking and lung
cancer, scores on statistics and mathematics examinations and so on. In regression analysis , we are not primarily interested in such a measure; But we are interested in predicting the average or mean value of one variable that is the dependent variable, on the basis of the given values of other variables which we define as independent variables. For instance, we can predict the average score on a statistics examination by knowing a student’s score on a mathematics examination.
Regression and correlation have some fundamental differences that are worth mentioning. In regression analysis there is an asymmetry in the way the dependent and explanatory variables are treated. The dependent variable is assumed to be statistical, random, or stochastic, that is, to have a probability distribution. The explanatory variables, on the other hand, are assumed to have fixed values (in repeated sampling). Thus, in Figure 2.3 we assumed that the variable age was fixed at given levels and height measurements were obtained at these levels. In correlation analysis, on the other hand, we treat any (two) variables symmetrically; there is no distinction between the dependent and explanatory variables. All of the variables are random in correlation analysis. After all, the correlation between scores on mathematics and statistics examinations is the same as that between scores on statistics and mathematics examinations. Moreover, as we shall see, most of the correlation theory is based on the assumption of randomness of variables.
2.10 THE NATURE AND SOURCES OF DATA
The success of any regression analysis will depend on the availability of high quality data. So we have to discuss about types, issue of accuracy and limitations of data.
2.10.1 Types of Data
Generally there are three types of data available for empirical analysis which are known as time series data, cross section data and pooled data which is nothing but a combination of time series and cross section data.
a.Time Series Data: In time series data we measure the set of observations at different time periods which could be daily, could be monthly, or could be quarterly. we have very many examples of such kind of data for example stock prices, weather reports are examples of daily data. For weekly data we have like money supply data. Similarly for monthly data we have consumer price index. We have unemployment rate for quarterly data. we have and we may have GDP, industrial production etc for annual data. the problem with time-series economics data is that this types of data is stationary. Simply speaking, a time series data is said to be stationary if its mean and variance remain constant over a period of time.
Table 2.1: A hypothetical example of Time series data
b.Cross-Section Data: In cross section data we measure the data on one or more variables at the same point in time for different people or different sections. Such as the census of population conducted by the Census Bureau every 10 year. The problem with cross-sectional data relates to the issue of heterogeneity. So when we include heterogeneous units in a statistical analysis the size and the scale effect needs to be considered into account.
c.Pooled Data: Pooled data is a mixture of time series data and cross-section data. One example is GNP per capita of all developing countries over ten years.
Table 2.3: A hypothetical example of pooled Data: Fish production of different district from 2010 to 2012 of Bangladesh.
2.10.2 Sources of data.
There are many sources of data and it can be very time-consuming to find all the data needed. In fact, finding data can take up more of the time than analysis in a project. Some sources are governmental agencies (BBS), international agencies (the International Monetary fund (IMF), the World Bank, the World Health Organisation (WHO), etc.), firms, etc.
The Internet has become the newest source of information over the last decade. There are lots of economic and financial data to obtain.
The data collected by various agencies may be experimental or non-experimental. In the social sciences, the data that one generally encounters are non-experimental in nature, that is, not subject to the control of the researcher.13 For example, the data on GNP, unemployment, stock prices, etc., are not directly
under the control of the investigator. As we shall see, this lack of control often creates special problems for the researcher in pinning down the exact cause or causes affecting a particular situation.
2.10.3 DATA ACCURACY
Because data in the social sciences are seldom generated under controlled conditions, there will always be unknown influences. This makes it difficult to obtain qualitative data for researcher.
Measurement of Scale Data fall into four categories which are important to know:
- Ratio scale refers to quantities such as ratios X1/ X2 and distances X2——-X2 .There can be ordering of the data where comparisons are meaningful, such as X1≥ X2. Basically, this can be measured with a parametric approach to statistic.
- Interval scale refers to distances as mentioned above, it can also be measure with a parametric approach to statistics.
- Ordinal scale refers to an order that is not quantitative but qualitative. We can also say that there is a “natural order” of grouping different categories. For example, there are different income classes (high, medium, low), sizes (large, medium, small), etc. An ordinal scale can be measure with both parametric and non-parametric statistics.
- Nominal scale refers to states but there is no ordering among them. For instance, genders (male, female), materials (paper, plastics, wood), etc. Interval scale can only be measure with non-parametric approach to statistics.
Because data in the social sciences are seldom generated under controlled conditions, there will always be unknown influences. This makes it difficult to obtain qualitative data for research.