HOMEWORK #1 - Sketches of Answers

 
Reading header file data4-7.hdr
List of variables
0) const 1) chd 2) cal 3) unemp 4) cig
5) edfat 6) meat 7) spirits 8) beer 9) wine
period: 1, maxobs: 34, obs range: full 1947-1980,
current 1947-1980

 

?logs cal unemp cig edfat meat spirits beer wine ;

?ols chd const cal unemp cig edfat meat spirits beer wine l_cal l_unemp l_cig l_edfat l_meat l_spirit l_beer l_wine ;
Excluding the constant, p-value was highest for variable 4 (cig)

?omit cig ;
Model selection statistics have decreased (i.e. improved) for 7 criteria
Also, the p-value decreases for the remaining coefficients (and drastically so for l_cig). Thus, we prefer the model without cig.
Excluding the constant, p-value was highest for variable 2 (cal)

?omit cal ;
Model selection statistics have decreased (i.e. improved) for 8 criteria
Also, the p-values either decrease or stay about the same for the remaining coefficients.
Notice that the p-value for the coefficient on l_cal becomes drastically smaller (this happens because we are omitting a variable that was highly correlated with l_cal). So, we prefer the model that omits cig and cal.

Excluding the constant, p-value was highest for variable 14 (l_meat)

?omit l_meat ;
Model selection statistics have decreased (i.e. improved) for 7 criteria
Now the coefficient on meat is statistically significant at the 10% level and the p-value for many other coefficients become smaller. Thus we prefer the model without cig, cal and l_meat.

Excluding the constant, p-value was highest for variable 15 (l_spirit)

?omit l_spirit ;
Model selection statistics have decreased (i.e. improved) for 7 criteria
Once again, many of the p-values become smaller (because we are omitting a variable that is highly correlated with other independent variables in the model.) Thus we prefer the model without cig, cal, l_meat and l_spirit.

Excluding the constant, p-value was highest for variable 7 (spirits)

?omit spirits ;
Model selection statistics have decreased (i.e. improved) for 8 criteria
and p-values are decreasing. Once again this is due to the fact that the variable spirits is highly correlated with the other independent variables in the model. Thus we want to omit cig, cal, l_meat, l_spirit and spirits.

Excluding the constant, p-value was highest for variable 10 (l_cal)

?omit l_cal ;

Model selection statistics have decreased (i.e. improved) for 4 criteria
Since the "judges" are split (4 think the last model is better, but 4 think the previous model is better), it is a little trickier to decide which model to select.
However, I would omit l_cal because:
When we leave l_cal in the model, its p-value is .22 - far from being significant. When we omit it, all of the coefficients in this last model are significant at the 10% level (or less). I can't think of a strong theoretical reason to leave in l_cal.
Plus, l_cal is highly correlated with the other variables in the model. When we omit it, not only do we have a more parsimonious model (a smaller model), but we reduce the problems that arise from multicollinearity.
In the end, there are more reasons to omit it than not to, so our final model is the one that omits cig, cal, l_meat, l_spirit, spirits and l_cal.

Ommited variable bias does not seem to be a problem since all of the variables we omitted were highly correlated with variables that we left in the model. Any effect they have on chd will be captured by the variables that are still in the model.

 

II.1

?ols chd const cal unemp cig edfat meat spirits beer wine ;
?genr ut = uhat
Generated var. no. 10 (ut)

II.2

In order to run the auxiliary regression, first I generate the logs of the independent variables.

?logs cal unemp cig edfat meat spirits beer wine ;

List of variables
0) const 1) chd 2) cal 3) unemp 4) cig
5) edfat 6) meat 7) spirits 8) beer 9) wine
10) ut 11) l_cal 12) l_unemp 13) l_cig 14) l_edfat
15) l_meat 16) l_spirit 17) l_beer 18) l_wine

Be Careful if you are using numbers instead of the variable names. The OLS command would be:
?ols 10 0 2 3 4 5 6 7 8 9 11 12 13 14 15 16 17 18;
10 is now the dependent variable (so don't accidentally include it between 9 and 11.)
1 was previously the dependent variable. It should not be included in the auxiliary regression (so don't accidentally put it between 0 and 2.)

?ols ut const cal unemp cig edfat meat spirits beer wine l_cal l_unemp
l_cig l_edfat l_meat l_spirit l_beer l_wine ;

OLS ESTIMATES USING THE 34 OBSERVATIONS 1947-1980

Dependent variable - ut

VARIABLE COEFFICIENT STDERROR T STAT 2Prob(t > |T|)
0) constant 3405.413887 2221.59399 1.53287 0.143706
2) cal 684.439153 2072.046278 0.33032 0.745194
3) unemp -12.947539 5.741953 -2.254902 0.037616 **
4) cig -12.806737 31.466661 -0.406994 0.689091
5) edfat 28.700466 16.882349 1.700028 0.107347
6) meat -2.143243 2.240574 -0.95656 0.352196
7) spirits -2.869088 19.526544 -0.146933 0.884914
8) beer 9.127586 6.249167 1.460609 0.162358
9) wine -212.986719 38.630151 -5.513484 < 0.0001 ***
11) l_cal -671.818277 2053.736652 -0.32712 0.747571
12) l_unemp 60.597237 30.23124 2.004458 0.061217 *
13) l_cig 75.898994 282.509119 0.26866 0.791424
14) l_edfat -1536.251473 818.764434 -1.876305 0.07789 *
15) l_meat 288.275233 367.83058 0.783717 0.443989
16) l_spirit -38.899562 38.26226 -1.016656 0.323569
17) l_beer -228.669718 119.384239 -1.91541 0.072417 *
18) l_wine 399.118534 68.881178 5.794305 < 0.0001 ***

Mean of dep. var. -7.429966e-14 S.D. of dep. variable 7.746551e+00
Error Sum of Sq (ESS) 399.614638 Std Err of Resid. (sgmahat) 4.848375
Unadjusted R-squared 0.798 Adjusted R-squared 0.608
F-statistic (16, 17) 4.20274 pvalue = Prob(F > 4.203) is 0.002678
Durbin-Watson Stat. 2.350904 First-order auto corr coeff -0.229

MODEL SELECTION STATISTICS
SGMASQ 23.506743 AIC 31.948977 FPE 35.260115
HQ 41.446626 SCHWARZ 68.533345 SHIBATA 23.506743
GCV 47.013487 RICE undefined

Excluding the constant, p-value was highest for variable 7 (spirits)

 

II.3

The null hypothesis for the LM test is that the coefficients for the 8 added logged variables (l_cal l_unemp l_cig l_edfat l_meat l_spirit l_beer l_wine) are all equal to zero. [Note: The hypothesis is that the coefficients are equal to zero, NOT that the variables themselves are equal to zero.]

H0: b 11 = b 12 = b 13 = b 14 = b 15 = b 16 = b 17 = b 18 = 0

The test statistic = LM = nR2
LM = number of obervations x Unadjusted R squarred from the auxiliary regression
LM = .798 x .34 = 27.1 [obtain numbers from previous regression output]

The LM statistic can also be calculated using esl. The commands are:

?genr LM = $nrsq
Generated var. no. 19 (LM)

?print LM ;
Varname: LM, period: 1, maxobs: 34, obs range: full 1947-1980,
current 1947-1980
27.13896476

LM ~ X2 with 8 degrees of freedom [degrees of freedom = number of restrictions in the null hypothesis]
Using the Chi-Square table in the front of the book, LM*(10%) with 8 d.f. = 13.36
Since LM > LM* , we can reject the null hypothesis and conclude that at least some of the added (logged) variables belong in the model.

 

II.4

In the next regression, we want to regress our original dependent variable (chd) against all eight original (non-logged) variables and the logged variables that have a p-value less that 0.5 in the auxiliary regression. All but two of the logged variables meet this condition. The variables that we want to add are indicated by an arrow in the previous regression output [Note: If you are using numbers rather than variable names in the esl commands, be careful not to include ut as one of the explanatory variables.]

?ols chd const cal unemp cig edfat meat spirits beer wine l_unemp
l_edfat l_meat l_spirit l_beer l_wine ;
(Call this Model A)

Excluding the constant, p-value was highest for variable 15 (l_meat)
 

II.5

Now we will omit the variable with the highest p-value

?omit l_meat ;
(Call this Model B)
Excluding the constant, p-value was highest for variable 16 (l_spirit)

Model selection statistics have decreased (i.e. improved) for 7 criteria

?omit l_spirit ;
(Call this Model C)
Excluding the constant, p-value was highest for variable 2 (cal)

Model selection statistics have decreased (i.e. improved) for 7 criteria

?omit cal ;
(Call this Model D)
Excluding the constant, p-value was highest for variable 7 (spirits)

Model selection statistics have decreased (i.e. improved) for 7 criteria

?omit spirits ;
(Call this Model E)
Excluding the constant, p-value was highest for variable 5 (edfat)

Model selection statistics have decreased (i.e. improved) for 4 criteria

?omit edfat ;
(Call this model F)
Excluding the constant, p-value was highest for variable 14 (l_edfat)

Model selection statistics have decreased (i.e. improved) for 1 criteria

?omit l_edfat ;
(Call this model G)

OLS ESTIMATES USING THE 34 OBSERVATIONS 1947-1980
Dependent variable - chd

VARIABLE COEFFICIENT STDERROR T STAT 2Prob(t > |T|)
0) constant 1098.1593 185.079746 5.933439 < 0.0001 ***
3) unemp -14.381571 5.020278 -2.864696 0.008337 ***
4) cig 5.668438 3.339232 1.697527 0.102013
6) meat -0.358754 0.137352 -2.61194 0.01501 **
8) beer 7.501769 4.181431 1.794067 0.084907 *
9) wine -175.350966 22.944562 -7.642376 < 0.0001 ***
12) l_unemp 63.191628 25.973479 2.432929 0.022468 **
17) l_beer -248.604002 78.813799 -3.154321 0.004155 ***
18) l_wine 327.952805 31.448551 10.428232 < 0.0001 ***

Mean of dep. var. 354.814706 S.D. of dep. variable 14.946047
Error Sum of Sq (ESS) 546.787487 Std Err of Resid. (sgmahat) 4.676697
Unadjusted R-squared 0.926 Adjusted R-squared 0.902
F-statistic (8, 25) 39.005643 pvalue = Prob(F > 39.006) is < 0.0001
Durbin-Watson Stat. 2.361434 First-order auto corr coeff -0.254

MODEL SELECTION STATISTICS
SGMASQ 21.871499 AIC 27.306137 FPE 27.661014
HQ 31.340134 SCHWARZ 40.900737 SHIBATA 24.595977
GCV 29.745239 RICE 34.174218
Excluding the constant, p-value was highest for variable 4 (cig)
Model selection statistics have decreased (i.e. improved) for 8 criteria

?omit cig ;
(Call this model H)
Excluding the constant, p-value was highest for variable 8 (beer)

Model selection statistics have decreased (i.e. improved) for 1 criteria

?omit beer ;
(Call this model I)
Excluding the constant, p-value was highest for variable 12 (l_unemp)

Model selection statistics have decreased (i.e. improved) for 2 criteria

?omit l_unemp ;
(Call this model J)

OLS ESTIMATES USING THE 34 OBSERVATIONS 1947-1980
Dependent variable - chd
VARIABLE COEFFICIENT STDERROR T STAT 2Prob(t > |T|)

0) constant 948.225231 54.876022 17.279409 < 0.0001 ***
3) unemp -2.101086 0.846616 -2.481745 0.019344 **
6) meat -0.371626 0.131817 -2.819265 0.008739 ***
9) wine -155.913632 15.580936 -10.006692 < 0.0001 ***
17) l_beer -123.611682 12.287921 -10.059609 < 0.0001 ***
18) l_wine 319.248737 28.091231 11.364712 < 0.0001 ***

Mean of dep. var. 354.814706 S.D. of dep. variable 14.946047
Error Sum of Sq (ESS) 712.786036 Std Err of Resid. (sgmahat) 5.045458
Unadjusted R-squared 0.903 Adjusted R-squared 0.886
F-statistic (5, 28) 52.315589 pvalue = Prob(F > 52.316) is < 0.0001
Durbin-Watson Stat. 2.011552 First-order auto corr coeff -0.105

MODEL SELECTION STATISTICS

SGMASQ 25.456644 AIC 29.837379 FPE 29.948993
HQ 32.708031 SCHWARZ 39.060811 SHIBATA 28.363458
GCV 30.911639 RICE 32.399365

Model selection statistics have decreased (i.e. improved) for 2 criteria

 

 
II.6

There is no single right answer for this question. You will be given credit for justifying your choice based on the criteria we have discussed so far in class.

Some things to consider are:
You want the coefficients to be significant.
You want the selection statistics to be low.
You don't want to omit variables that will cause bias.
You don't want to leave out variables that should theoretically be in the model.
You don't want to include explanatory variables that are highly correlated with each other.

A parsimonious specification (a model with few parameters - a smaller model) is good because: (1) estimates will be more precise because of reduced multicollinearity, (2) estimates will be more reliable because you have more degrees of freedom, (3) power of tests will be greater, (4) a simpler model is easier to understand than a complex one. (see pg 285 of the text book)
 

I've narrowed it down to three main contenders for my top choice model: Models E, G and J.

Model J is good because all of the parameters are significant. It is a parsimonious specification. If you look at the correlation matrix (in esl, type ? corr cal unemp .. . l_cal l_unemp .. .) you will see that many of the variables are highly correlated, so to avoid the problems that arise with multicollinearity, we would like to leave out as many unnecessary variables as possible. The one drawback to selecting this model is that other models have lower model selection statistics.

You could also make an argument for stopping at model G. In this model, nearly all of the parameters are significant. The only non-significant parameter has a p-value of .102 and is actually on the borderline of being significant. Omitted variable bias suggests that it is better to leave a variable in the model if it appears to have some effect. In addition, the majority of model selection statistics are lower than for models A - F.

In the end, I will choose J as my final model. The remainder of the problem set will use this model to make the comparisons.

II.7

The models that I choose from I.3 and II.6 are not the same. I would choose the model from I.3.

Comparing the last model of part I with model J from part II:

Out of the eight model selection statistics, 4 are lower for the part I model and 4 are lower for the part II model, so this alone doesn't indicate that one model is substantially better than the other (remember, one of these measures includes the R2). The general to simple method of model section that we used in part I is preferred over the simple to general method we used in part II, "because it is surer and does not depend on the arbitrary 0.5 rule of selecting variables from the auxiliary regression. So, if I had to make a decision, I would choose the model from part I.

(The same answer would hold if I compare the part I model to model G from part II.)