Reading header file data4-7.hdr
List of variables
0) const 1) chd 2) cal
3) unemp 4) cig
5) edfat 6) meat 7) spirits
8) beer 9) wine
period: 1, maxobs: 34,
obs range: full 1947-1980,
current 1947-1980
?logs cal unemp cig edfat meat spirits beer wine ;
?ols chd const cal unemp cig edfat meat spirits
beer wine l_cal l_unemp l_cig l_edfat l_meat l_spirit l_beer l_wine ;
Excluding the constant, p-value was highest for
variable 4 (cig)
?omit cig ;
Model selection statistics have decreased (i.e.
improved) for 7 criteria
Also, the p-value decreases for the remaining
coefficients (and drastically so for l_cig). Thus, we prefer the model
without cig.
Excluding the constant, p-value was highest for
variable 2 (cal)
?omit cal ;
Model selection statistics have decreased (i.e.
improved) for 8 criteria
Also, the p-values either decrease or stay about
the same for the remaining coefficients.
Notice that the p-value for the coefficient on
l_cal becomes drastically smaller (this happens because we are omitting
a variable that was highly correlated with l_cal). So, we prefer the model
that omits cig and cal.
Excluding the constant, p-value was highest for variable 14 (l_meat)
?omit l_meat ;
Model selection statistics have decreased (i.e.
improved) for 7 criteria
Now the coefficient on meat is statistically
significant at the 10% level and the p-value for many other coefficients
become smaller. Thus we prefer the model without cig, cal and l_meat.
Excluding the constant, p-value was highest for variable 15 (l_spirit)
?omit l_spirit ;
Model selection statistics have decreased (i.e.
improved) for 7 criteria
Once again, many of the p-values become smaller
(because we are omitting a variable that is highly correlated with other
independent variables in the model.) Thus we prefer the model without cig,
cal, l_meat and l_spirit.
Excluding the constant, p-value was highest for variable 7 (spirits)
?omit spirits ;
Model selection statistics have decreased (i.e.
improved) for 8 criteria
and p-values are decreasing. Once again this
is due to the fact that the variable spirits is highly correlated with
the other independent variables in the model. Thus we want to omit cig,
cal, l_meat, l_spirit and spirits.
Excluding the constant, p-value was highest for variable 10 (l_cal)
?omit l_cal ;
Model selection statistics have decreased (i.e.
improved) for 4 criteria
Since the "judges" are split (4 think the last
model is better, but 4 think the previous model is better), it is a little
trickier to decide which model to select.
However, I would omit l_cal because:
When we leave l_cal in the model, its p-value
is .22 - far from being significant. When we omit it, all of the coefficients
in this last model are significant at the 10% level (or less). I can't
think of a strong theoretical reason to leave in l_cal.
Plus, l_cal is highly correlated with the other
variables in the model. When we omit it, not only do we have a more parsimonious
model (a smaller model), but we reduce the problems that arise from multicollinearity.
In the end, there are more reasons to omit it
than not to, so our final model is the one that omits cig, cal, l_meat,
l_spirit, spirits and l_cal.
Ommited variable bias does not seem to be a problem since all of the variables we omitted were highly correlated with variables that we left in the model. Any effect they have on chd will be captured by the variables that are still in the model.
II.1
?ols chd const cal unemp cig edfat meat spirits
beer wine ;
?genr ut = uhat
Generated var. no. 10 (ut)
II.2
In order to run the auxiliary regression, first I generate the logs of the independent variables.
?logs cal unemp cig edfat meat spirits beer wine ;
List of variables
0) const 1) chd 2) cal
3) unemp 4) cig
5) edfat 6) meat 7) spirits
8) beer 9) wine
10) ut 11) l_cal 12) l_unemp
13) l_cig 14) l_edfat
15) l_meat 16) l_spirit
17) l_beer 18) l_wine
Be Careful if you are using numbers instead of
the variable names. The OLS command would be:
?ols 10 0 2 3 4 5 6 7 8 9 11 12 13 14 15 16 17
18;
10 is now the dependent variable (so don't accidentally
include it between 9 and 11.)
1 was previously the dependent variable. It should
not be included in the auxiliary regression (so don't accidentally put
it between 0 and 2.)
?ols ut const cal unemp cig edfat meat spirits
beer wine l_cal l_unemp
l_cig l_edfat l_meat l_spirit l_beer l_wine ;
OLS ESTIMATES USING THE 34 OBSERVATIONS 1947-1980
Dependent variable - ut
VARIABLE COEFFICIENT STDERROR T STAT 2Prob(t >
|T|)
0) constant 3405.413887 2221.59399 1.53287 0.143706
2) cal 684.439153 2072.046278 0.33032 0.745194
3) unemp -12.947539 5.741953 -2.254902 0.037616
**
4) cig -12.806737 31.466661 -0.406994 0.689091
5) edfat 28.700466 16.882349 1.700028 0.107347
6) meat -2.143243 2.240574 -0.95656 0.352196
7) spirits -2.869088 19.526544 -0.146933 0.884914
8) beer 9.127586 6.249167 1.460609 0.162358
9) wine -212.986719 38.630151 -5.513484 < 0.0001
***
11) l_cal -671.818277 2053.736652 -0.32712 0.747571
12) l_unemp 60.597237 30.23124 2.004458 0.061217
*
13) l_cig 75.898994 282.509119 0.26866 0.791424
14) l_edfat -1536.251473 818.764434 -1.876305
0.07789 *
15) l_meat 288.275233 367.83058 0.783717 0.443989
16) l_spirit -38.899562 38.26226 -1.016656 0.323569
17) l_beer -228.669718 119.384239 -1.91541 0.072417
*
18) l_wine 399.118534 68.881178 5.794305 <
0.0001 ***
Mean of dep. var. -7.429966e-14
S.D. of dep. variable 7.746551e+00
Error Sum of Sq (ESS) 399.614638
Std Err of Resid. (sgmahat) 4.848375
Unadjusted R-squared 0.798
Adjusted R-squared 0.608
F-statistic (16, 17) 4.20274
pvalue = Prob(F > 4.203) is 0.002678
Durbin-Watson Stat. 2.350904
First-order auto corr coeff -0.229
MODEL SELECTION STATISTICS
SGMASQ 23.506743 AIC 31.948977
FPE 35.260115
HQ 41.446626 SCHWARZ 68.533345
SHIBATA 23.506743
GCV 47.013487 RICE undefined
Excluding the constant, p-value was highest for variable 7 (spirits)
II.3
The null hypothesis for the LM test is that the coefficients for the 8 added logged variables (l_cal l_unemp l_cig l_edfat l_meat l_spirit l_beer l_wine) are all equal to zero. [Note: The hypothesis is that the coefficients are equal to zero, NOT that the variables themselves are equal to zero.]
H0: b 11 = b 12 = b 13 = b 14 = b 15 = b 16 = b 17 = b 18 = 0
The test statistic = LM = nR2
LM = number of obervations x Unadjusted R squarred
from the auxiliary regression
LM = .798 x .34 = 27.1 [obtain numbers from previous
regression output]
The LM statistic can also be calculated using esl. The commands are:
?genr LM = $nrsq
Generated var. no. 19 (LM)
?print LM ;
Varname: LM, period: 1, maxobs: 34, obs range:
full 1947-1980,
current 1947-1980
27.13896476
LM ~ X2 with 8 degrees of freedom [degrees
of freedom = number of restrictions in the null hypothesis]
Using the Chi-Square table in the front of the
book, LM*(10%) with 8 d.f. = 13.36
Since LM > LM* , we can reject the null hypothesis
and conclude that at least some of the added (logged) variables belong
in the model.
II.4
In the next regression, we want to regress our original dependent variable (chd) against all eight original (non-logged) variables and the logged variables that have a p-value less that 0.5 in the auxiliary regression. All but two of the logged variables meet this condition. The variables that we want to add are indicated by an arrow in the previous regression output [Note: If you are using numbers rather than variable names in the esl commands, be careful not to include ut as one of the explanatory variables.]
?ols chd const cal unemp cig edfat meat spirits
beer wine l_unemp
l_edfat l_meat l_spirit l_beer l_wine ;
(Call this Model A)
Excluding the constant, p-value was highest for
variable 15 (l_meat)
II.5
Now we will omit the variable with the highest p-value
?omit l_meat ;
(Call this Model B)
Excluding the constant, p-value was highest for
variable 16 (l_spirit)
Model selection statistics have decreased (i.e. improved) for 7 criteria
?omit l_spirit ;
(Call this Model C)
Excluding the constant, p-value was highest for
variable 2 (cal)
Model selection statistics have decreased (i.e. improved) for 7 criteria
?omit cal ;
(Call this Model D)
Excluding the constant, p-value was highest for
variable 7 (spirits)
Model selection statistics have decreased (i.e. improved) for 7 criteria
?omit spirits ;
(Call this Model E)
Excluding the constant, p-value was highest for
variable 5 (edfat)
Model selection statistics have decreased (i.e. improved) for 4 criteria
?omit edfat ;
(Call this model F)
Excluding the constant, p-value was highest for
variable 14 (l_edfat)
Model selection statistics have decreased (i.e. improved) for 1 criteria
?omit l_edfat ;
(Call this model G)
OLS ESTIMATES USING THE
34 OBSERVATIONS 1947-1980
Dependent variable - chd
VARIABLE COEFFICIENT STDERROR
T STAT 2Prob(t > |T|)
0) constant 1098.1593 185.079746
5.933439 < 0.0001 ***
3) unemp -14.381571 5.020278
-2.864696 0.008337 ***
4) cig 5.668438 3.339232
1.697527 0.102013
6) meat -0.358754 0.137352
-2.61194 0.01501 **
8) beer 7.501769 4.181431
1.794067 0.084907 *
9) wine -175.350966 22.944562
-7.642376 < 0.0001 ***
12) l_unemp 63.191628 25.973479
2.432929 0.022468 **
17) l_beer -248.604002
78.813799 -3.154321 0.004155 ***
18) l_wine 327.952805 31.448551
10.428232 < 0.0001 ***
Mean of dep. var. 354.814706
S.D. of dep. variable 14.946047
Error Sum of Sq (ESS) 546.787487
Std Err of Resid. (sgmahat) 4.676697
Unadjusted R-squared 0.926
Adjusted R-squared 0.902
F-statistic (8, 25) 39.005643
pvalue = Prob(F > 39.006) is < 0.0001
Durbin-Watson Stat. 2.361434
First-order auto corr coeff -0.254
MODEL SELECTION STATISTICS
SGMASQ 21.871499 AIC 27.306137
FPE 27.661014
HQ 31.340134 SCHWARZ 40.900737
SHIBATA 24.595977
GCV 29.745239 RICE 34.174218
Excluding the constant,
p-value was highest for variable 4 (cig)
Model selection statistics
have decreased (i.e. improved) for 8 criteria
?omit cig ;
(Call this model H)
Excluding the constant, p-value was highest for
variable 8 (beer)
Model selection statistics have decreased (i.e. improved) for 1 criteria
?omit beer ;
(Call this model I)
Excluding the constant, p-value was highest for
variable 12 (l_unemp)
Model selection statistics have decreased (i.e. improved) for 2 criteria
?omit l_unemp ;
(Call this model J)
OLS ESTIMATES USING THE
34 OBSERVATIONS 1947-1980
Dependent variable - chd
VARIABLE COEFFICIENT STDERROR
T STAT 2Prob(t > |T|)
0) constant 948.225231 54.876022
17.279409 < 0.0001 ***
3) unemp -2.101086 0.846616
-2.481745 0.019344 **
6) meat -0.371626 0.131817
-2.819265 0.008739 ***
9) wine -155.913632 15.580936
-10.006692 < 0.0001 ***
17) l_beer -123.611682
12.287921 -10.059609 < 0.0001 ***
18) l_wine 319.248737 28.091231
11.364712 < 0.0001 ***
Mean of dep. var. 354.814706
S.D. of dep. variable 14.946047
Error Sum of Sq (ESS) 712.786036
Std Err of Resid. (sgmahat) 5.045458
Unadjusted R-squared 0.903
Adjusted R-squared 0.886
F-statistic (5, 28) 52.315589
pvalue = Prob(F > 52.316) is < 0.0001
Durbin-Watson Stat. 2.011552
First-order auto corr coeff -0.105
MODEL SELECTION STATISTICS
SGMASQ 25.456644 AIC 29.837379
FPE 29.948993
HQ 32.708031 SCHWARZ 39.060811
SHIBATA 28.363458
GCV 30.911639 RICE 32.399365
Model selection statistics have decreased (i.e. improved) for 2 criteria
II.6
There is no single right answer for this question. You will be given credit for justifying your choice based on the criteria we have discussed so far in class.
Some things to consider are:
You want the coefficients to be significant.
You want the selection statistics to be low.
You don't want to omit variables that will cause
bias.
You don't want to leave out variables that should
theoretically be in the model.
You don't want to include explanatory variables
that are highly correlated with each other.
A parsimonious specification (a model with few
parameters - a smaller model) is good because: (1) estimates will be more
precise because of reduced multicollinearity, (2) estimates will be more
reliable because you have more degrees of freedom, (3) power of tests will
be greater, (4) a simpler model is easier to understand than a complex
one. (see pg 285 of the text book)
I've narrowed it down to three main contenders for my top choice model: Models E, G and J.
Model J is good because all of the parameters are significant. It is a parsimonious specification. If you look at the correlation matrix (in esl, type ? corr cal unemp .. . l_cal l_unemp .. .) you will see that many of the variables are highly correlated, so to avoid the problems that arise with multicollinearity, we would like to leave out as many unnecessary variables as possible. The one drawback to selecting this model is that other models have lower model selection statistics.
You could also make an argument for stopping at model G. In this model, nearly all of the parameters are significant. The only non-significant parameter has a p-value of .102 and is actually on the borderline of being significant. Omitted variable bias suggests that it is better to leave a variable in the model if it appears to have some effect. In addition, the majority of model selection statistics are lower than for models A - F.
In the end, I will choose J as my final model. The remainder of the problem set will use this model to make the comparisons.
II.7
The models that I choose from I.3 and II.6 are not the same. I would choose the model from I.3.
Comparing the last model of part I with model J from part II:
Out of the eight model selection statistics, 4 are lower for the part I model and 4 are lower for the part II model, so this alone doesn't indicate that one model is substantially better than the other (remember, one of these measures includes the R2). The general to simple method of model section that we used in part I is preferred over the simple to general method we used in part II, "because it is surer and does not depend on the arbitrary 0.5 rule of selecting variables from the auxiliary regression. So, if I had to make a decision, I would choose the model from part I.
(The same answer would hold if I compare the part
I model to model G from part II.)