If a log transformation is applied to the dependent variable only, this is equivalent to assuming that it grows or decays exponentially as a function of the independent variables. If a log transformation is applied to both the dependent variable and the independent variables, this is equivalent to assuming that the effects of the independent variables are multiplicative rather than additive in their original units.
This means that, on the margin, a small percentage change in one of the independent variables induces a proportional percentage change in the expected value of the dependent variable, other things being equal. Models of this kind are commonly used in modeling price-demand relationships, as illustrated on the beer sales example on this web site. Another possibility to consider is adding another regressor that is a nonlinear function of one of the other variables. Higher-order terms of this kind cubic, etc.
This sort of "polynomial curve fitting" can be a nice way to draw a smooth curve through a wavy pattern of points in fact, it is a trend-line option on scatterplots on Excel , but it is usually a terrible way to extrapolate outside the range of the sample data. Finally, it may be that you have overlooked some entirely different independent variable that explains or corrects for the nonlinear pattern or interactions among variables that you are seeing in your residual plots.
In that case the shape of the pattern, together with economic or physical reasoning, may suggest some likely suspects. For example, if the strength of the linear relationship between Y and X 1 depends on the level of some other variable X 2 , this could perhaps be addressed by creating a new independent variable that is the product of X 1 and X 2.
In the case of time series data, if the trend in Y is believed to have changed at a particular point in time, then the addition of a piecewise linear trend variable one whose string of values looks like 0, 0, …, 0, 1, 2, 3, … could be used to fit the kink in the data.
Such a variable can be considered as the product of a trend variable and a dummy variable. Again, though, you need to beware of overfitting the sample data by throwing in artificially constructed variables that are poorly motivated. At the end of the day you need to be able to interpret the model and explain or sell it to others. Violations of independence are potentially very serious in time series regression models: serial correlation in the errors i.
Independence can also be violated in non-time-series models if errors tend to always have the same sign under particular conditions, i. How to diagnose: The best test for serial correlation is to look at a residual time series plot residuals vs. If your software does not provide these by default for time series data, you should figure out where in the menu or code to find them.
Pay especially close attention to significant correlations at the first couple of lags and in the vicinity of the seasonal period, because these are probably not due to mere chance and are also fixable. The Durbin-Watson statistic provides a test for significant residual autocorrelation at lag 1: the DW stat is approximately equal to 2 1-a where a is the lag-1 residual autocorrelation, so ideally it should be close to 2.
How to fix: Minor cases of positive serial correlation say, lag-1 residual autocorrelation in the range 0. An AR 1 term adds a lag of the dependent variable to the forecasting equation, whereas an MA 1 term adds a lag of the forecast error. If there is significant correlation at lag 2, then a 2nd-order lag may be appropriate.
If there is significant negative correlation in the residuals lag-1 autocorrelation more negative than Differencing tends to drive autocorrelations in the negative direction, and too much differencing may lead to artificial patterns of negative correlation that lagged variables cannot correct for. If there is significant correlation at the seasonal period e. The dummy-variable approach enables additive seasonal adjustment to be performed as part of the regression model: a different additive constant can be estimated for each season of the year.
If the dependent variable has been logged, the seasonal adjustment is multiplicative. Something else to watch out for: it is possible that although your dependent variable is already seasonally adjusted, some of your independent variables may not be, causing their seasonal patterns to leak into the forecasts.
Major cases of serial correlation a Durbin-Watson statistic well below 1. You may wish to reconsider the transformations if any that have been applied to the dependent and independent variables. To test for non-time-series violations of independence , you can look at plots of the residuals versus independent variables or plots of residuals versus row number in situations where the rows have been sorted or grouped in some way that depends only on the values of the independent variables.
The residuals should be randomly and symmetrically distributed around zero under all conditions, and in particular there should be no correlation between consecutive errors no matter how the rows are sorted , as long as it is on some criterion that does not involve the dependent variable. If this is not true, it could be due to a violation of the linearity assumption or due to bias that is explainable by omitted variables say, interaction terms or dummies for identifiable conditions.
Violations of homoscedasticity which are called "heteroscedasticity" make it difficult to gauge the true standard deviation of the forecast errors, usually resulting in confidence intervals that are too wide or too narrow.
In particular, if the variance of the errors is increasing over time, confidence intervals for out-of-sample predictions will tend to be unrealistically narrow.
Heteroscedasticity may also have the effect of giving too much weight to a small subset of the data namely the subset where the error variance was largest when estimating coefficients. How to diagnose: look at a plot of residuals versus predicted values and, in the case of time series data, a plot of residuals versus time.
Be alert for evidence of residuals that grow larger either as a function of time or as a function of the predicted value. To be really thorough, you should also generate plots of residuals versus independent variables to look for consistency there as well.
Because of imprecision in the coefficient estimates, the errors may tend to be slightly larger for forecasts associated with predictions or values of independent variables that are extreme in both directions, although the effect should not be too dramatic.
What you hope not to see are errors that systematically get larger in one direction by a significant amount. The independent variables can be of any type. Although linear regression cannot show causation by itself, the dependent variable is usually affected by the independent variables.
By its nature, linear regression only looks at linear relationships between dependent and independent variables. That is, it assumes there is a straight-line relationship between them. Sometimes this is incorrect. For example, the relationship between income and age is curved, i. You can tell if this is a problem by looking at graphical representations of the relationships.
Linear regression looks at a relationship between the mean of the dependent variable and the independent variables. For example, if you look at the relationship between the birth weight of infants and maternal characteristics such as age, linear regression will look at the average weight of babies born to mothers of different ages. However, sometimes you need to look at the extremes of the dependent variable, e. Just as the mean is not a complete description of a single variable, linear regression is not a complete description of relationships among variables.
You can deal with this problem by using quantile regression. Outliers are data that are surprising. Outliers can be univariate based on one variable or multivariate. In this case, neither the age nor the income is very extreme, but very few year-old people make that much money. Linear Regression is the bicycle of regression models.
It can be used in a variety of domains. It has a nice closed formed solution, which makes model training a super-fast non-iterative process. If there only one regression model that you have time to learn inside-out, it should be the Linear Regression model. Linearity requires little explanation. After all, if you have chosen to do Linear Regression, you are assuming that the underlying data exhibits linear relationships, specifically the following linear relationship:.
The second assumption that one makes while fitting OLSR models is that the residual errors left over from fitting the model to the data are independent , identically distributed random variables. After we train a Linear Regression model on a data set, if we run the training data through the same model, the model will generate predictions. When you roll a die twice, the probability of its coming up as one, two,…,six in the second throw does not depend on the value it came up on the first throw.
So the two throws are independent random variables that can each take a value of 1 thru 6 independent of the other throw. In the context of regression, we have seen why the residual errors of the regression model are random variables.
If the residual errors are not independent, they will likely demonstrate some sort of a pattern which is not always obvious to the naked eye. But sometimes one can detect patterns in the plot of residual errors versus the predicted values or the plot of residual errors versus actual values.
This is known as lag-1 auto-correlation and it is a useful technique to find out if residual errors of a time series regression model are independent. Patsy will add the regression intercept by default. Finally, build and train an Ordinary Least Squares Regression Model on the training data and print the model summary:. PredictionResult and the predictions can obtained from the PredictionResult. One can see that the residuals are more or less pattern-less for smaller values of Power Output, but they seem to be showing a linear pattern at the higher end of the Power Output scale.
If the distribution of errors is not identical, one cannot reliably use tests of significance such as the F-test for regression analysis or perform confidence interval testing on the predictions.
Many of these tests depend on the residual errors being identically, and normally distributed. This brings us to the next assumption. In the previous section, we saw how and why the residual errors of the regression are assumed to be independent, identically distributed i. Assumption 3 imposes an additional constraint. The errors should all have a normal distribution with a mean of zero. In statistical language:. It is a common misconception that linear regression models require the explanatory variables and the response variable to be normally distributed.
In fact, normality of residual errors is not even strictly required.
0コメント