ISQS 5349, Spring 2013

Course Syllabus
SAS data sets used in the class
Old web pages, midterms and finals
Class recordings


SAS: Access via citrix.ba.ttu.edu. Note on accessing graphs from Citrix.

Why is probability needed in the regression model?

Class Topics

Preparation – Read and study everything in this column. There will be a quiz at the beginning of class on the day listed.  Refer back to these documents repeatedly.

Homework and etc.

1. 1/17 Meaning of the term “Regression”; The classical regression model and its assumptions.


Read this matrix algebra prep material, courtesy
A. Colin Cameron, UC Davis (whoo hoo!).

 

Read this paper on the meaning of “regression to the mean,” by Martin Bland.

Read this summary of the assumptions of the regression model, by Dr. Westfall.

Read Example 7.5 of Chapter 7 of Understanding Advanced Statistical Methods, by Westfall and Henning.

 

Read this discussion of the “Model produces data” concept as it related to the regression model, by Dr. Westfall.

 

Also read this discussion of the “Model produces data” concept as it related to the regression model, by Dr. Westfall.

HW 1, due Thursday 1/24/2013

 

Why is probability needed in the regression model?

2. 1/22  Likelihood and least squares, Exact Inferences in the classical parametric regression model

 

(Today’s quiz covers all the readings for 1/17 and 1/22, and counts double)

Read p. 16, starting with Example 17.4, through to the top of p. 18, of Chapter 17 of Understanding Advanced Statistical Methods, by Westfall and Henning.

 

Read this discussion of a confidence interval for the slope, from Doug Stirling, Massey University in Palmerston North, New Zealand.

 

Read this document on interpreting p-values, by Dr. Westfall.

 

Read this document on “Why you should never say “Accept Ho,” written by Dr. Westfall

 

Read this discussion of confidence intervals for E(Y|X=x) versus Prediction interval for Y|X=x, from “Musings on Using and Misusing Statistics,” by Martha K. Smith, retired UT professor.

 

Read this document on “Prediction and Generalization,” written by Dr. Westfall.

 

Read this document on “Confidence intervals and significance tests as predictions,” written by Dr. Westfall.

Codes for class:

 

Work hours versus Lot size

 

Why do assumptions matter?

 

Why is probability needed in the regression model?

3. 1/24  p-values, confidence and prediction intervals, Scatterplots, LOESS smoothers

Read the SAS documentation on LOESS.

 

Read this paper on LOESS (additional links within the document are very informative! But they are not required reading), from NIST.

proc reg data = isqs5349.gpa_gmat(where = (degree='P' and ethnic = 'N'));

   model gpa = gmat;

run; quit;

SAS file for constructing frequentist confidence and prediction intervals; GPA/SAT example and Toluca example

SAS file for constructing Bayesian intervals using the same examples

Bayesian analysis using transformation to solve obvious problem with nonnormality of the GPA distribution

Similar to the above, but using the GPA/GMAT data

Estimating curvature using LOESS smoothing: Toluca, Peak Energy, Car Sales, and Product complexity examples.

4. 1/29 Checking the assumptions of the classic model

Read the document, “How to check assumptions using data,” written by Dr. Westfall, and run the SAS code therein.

 

Read and run the SAS code in the document, “Why do assumptions matter?” from Dr. Westfall

HW2 due Thursday, 2/7.

Understanding how to interpret LOESS smooths by simulating data where the true mean function is known.

Testing for curvature using quadratic regression: Toluca, Peak Energy, Car Sales, and Product Complexity examples

Statistical versus practical significance: A demonstration of the difference.

Estimating the relationship between mean absolute residual and predictor variable using LOESS smoothing: Toluca and Peak Energy examples.

Understanding how to interpret LOESS smooths of absolute residuals by simulating data where the true variance function is known.

Testing for heteroscedasticity using the Breusch-Pagan test: Toluca, Peak Energy, and Product Complexity examples.

Evaluating the normality assumption using q-q plots and hypothesis tests: Toluca, Peak Energy, and Product Complexity examples.

Understanding how to interpret q-q plots by simulating data where the true error distribution is known.

5. 1/31  Using transformations to achieve a more reasonable model

Read these presentation slides, by  William G. Jacoby, Department of Political Science, Michigan State.

 

Read the document, “Comments on Transformations,” written by Dr. Westfall.

Lance Car Sales example: Analysis of model using x-1 transformation

Peak Energy Use example: Analysis of model using ln(y) transformation

6. 2/5 The multiple regression model


Read these presentation slides by Carlos Carvalho, UT Austin.

 

Read this note:  What Carvalho calls “Standard Error” in his slides is actually non-standard terminology.  He really means “Root Mean Squared Error,” which is the estimate of the conditional standard deviation of Y given the X variables. He used the term “Standard Error” because it’s the term the Microsoft Excel uses.  This is really unfortunate, because there are standard errors for each of the beta estimates, and these are quite different from the Root Mean Squared Error.

 

Read the document “Prediction as association and prediction as causation,” written by Dr. Westfall.

 

Next semester:  More on causality (maybe something from Mostly Harmless Econometrics, by Angrist and Pischke)

SAS code for Sales vs. Int rate and Gas Price example

Visualizing the Multiple Regression model- 3-D and partial plots using EXCEL.

SAS code showing that simple Y*Xj diagnostics are not completely adequate to judge the fit of the multiple regression model.

SAS code for computer time vs. Ram and Clock speed example

 

7. 2/7  The Gauss-Markov theorem, standard errors, t intervals and tests

Read very carefully this document on the Gauss-Markov Theorem, by D. Stephen G. Pollock of Leicester U. Read only page 1 up to the section “The Gauss-Markov Theorem. (An Alternative Statement)” It’s not much to read, but it is very thick, so read it slowly and carefully.

 

Read this document on the matrix form of the regression model, from François Nielsen of UNC-Chapel Hill, sections 1, 2, 3, 4, 5.1 – 5.4, 5.6.1, 5.6.2.

The multivariate normal distribution (from Wikipedia)

Information on covariance matrices, from Wikipedia

Illustration of sampling distribution of the estimated slope.

Illustrating the Gauss-Markov property, both good and bad.

The various matrices in regression: a SAS/IML file.

You can also see many of the matrices using the "xpx" and "i" options in the "MODEL" statement of PROC REG.

8. 2/12  Multicollinearity

Read these presentation slides, slides 1-6, by Alicia Carriquiry of Iowa State.

 

Read this document on multicollinearity, by Dr. Westfall

 

HW 3, due Thursday, 2/21 (not 3/1 as it says in the document.)

ESP demo: Influencing the outcome of a die roll:
data dice;

   input first second @@;

cards;

3 1 4 6 6 6 1 3 6 1 2 1 1 1 
2 1 2 3 4 3 3 3 6 3 3 2 6 5
6 2 6 1 5 5 6 3 2 5 3 5

;

proc reg;

   model second = first;
run;

File to illustrate problems with multicollinearity

SAS file for diagnosis and interpretation of multicollinear variables; also indicates one of many potential solutions to the problem.

9. 2/14 The ANOVA table, the F test, and the R-squared statistic 

Read this document by Matt Blackwell of U. Rochester, and others.

 

Read this document by Kristofer Jennings, Purdue U.

Full model - reduced model F test.

 

10. 2/19 Interactions; the inclusion principle

Read this summary of Mediators (intervening variables) and Moderators (interacting variables) (This is a brief summary of Baron and Kenney’s paper)

 

Read this document about Type I and Type III F tests by Dr. Westfall

 

Read p. 1-11 in this document by William G. Jacoby, Department of Political Science, Michigan State.

Examining interactions - a SAS demo

Moderator example, from Karl Wuensch's web page http://core.ecu.edu/psyc/wuenschk/. The publication is here.

A "Hand-drawn" graph using Excel of the moderating effect.

File to illustrate problems with violating the inclusion principle

11. 2/21 Dummy variables, ANOVA, ANCOVA, and multiple comparisons, ANCOVA with interactions, LSMeans, graphical summaries

Read section 1.1 – 1.3 of Multiple Tests and Multiple Comparisons Using SAS, by Westfall, Tobias and Wolfinger.

 

Read 10.1 – 10.4 of this document by Howard Seltman of Carnegie-Mellon.

ANOVA/ANCOVA, first file – comparing GPAs of Male and Female students (two-level ANOVA/ANCOVA).

12. 2/26 Variable and model selection

Read "Model Selection Issues" on p. 43 and 44 from the Documentation for the SAS/STAT GLMSELECT procedure.

 

Read page 1, up to “Model selection,” in this paper.

Read summary comments on variable selection, data snooping, and a strategy for variable selection, by Dr. Westfall

HW 4 due Tuesday 3/5

ANOVA/ANCOVA, second file – comparing GPAs of students different degree plans

13. 2/28 Variable and model selection

Read Rob McCulloch’s notes.  Comment: While the notes are great and McCulloch is a very famous statistician, there is too much emphasis on polynomial models for a single variable in the presentation.  Just realize that the points he is making apply equally well to all other kinds of regression models, polynomial in a single variable, linear in multiple variables, neural nets, trees, LOESS, simple or multiple regression, any kind of regression model whatsoever. McCulloch knows very well that polynomial models should usually be avoided; I think he picked polynomial models just to make the presentation easier to understand.

 

(Next time I teach this course in Spring 2014 students will) Read this article in Statistical Science by Galit Schmueli, "To Explain or to Predict?"

The Law of Total Variance

A SAS demo to illustrate the danger of overfitting.

A SAS simulation to illustrate that including extraneous variables does not cause bias, but does inflate the variance.

A SAS file to illustrate the variance/bias tradeoff, and show why you might prefer biased estimates in terms of estimating the mean value.

A SAS file to illustrate the variance/bias trade-off in terms of parameter estimation. Sometimes biased parameter estimates are more accurate than unbiased estimates.

SAS file for producing and comparing PRESS (n-fold cross-validation) statistics for different models

Model selection for predicting doctors per capita

14. 3/5 Heteroscedasticity: WLS, ML estimation, robust standard errors

Read about variance function estimation in this classic paper by Davidian and Carroll, but just read sections 1, 2 and 3.

 

Read all about robust standard errors in this classic paper by Long and Ervin.

First file to illustrate benefit of Weighted Least Squares – shows imprecise predictions of OLS in the presence of heteroscedasticity

Comparison of prediction limits: Homoscedastic vs. Heteroscedastic models – shows OLS prediction limits are incorrect in the presence of heteroscedasticity

Estimating the heteroscedastic variance GE returns as a function of trading volume via maximum likelihood using PROC MODEL.

Comparing ordinary and heteroscedasticity-consistent standard errors

15. 3/7 Outlier……………..s

 

Watch this video; we won’t meet in class. Quiz on 3/19 counts double, covering the video, the 3/7 readings, and the 3/19 readings.

Read about Influence Diagnostics from the SAS documentation (up to but not including the section called “The PARTIAL Option”)

 

Read sections 1 – 3 of this classic paper by R. Dennis Cook.

 

Read Outliers: How to detect them and what to do about them?, by Dr. Westfall

 

16. 3/19  Quantile regression

Read either A gentle introduction to quantile regression for ecologists,  or Quantile Regression (from J. Econ. Perspectives)

 

Read the overview of PROC QUANTREG, p. 5352 – 5356, AND one of the examples, either 72.2, 72.3, 72.4, or 72.5.

 

Data on weekly salaries from the BLS, from 2002 to 2012. Note that the 0.10 and 0.90 quantiles have different slopes.

EXCEL spreadsheet to explain the quantile estimation method

EXCEL spreadsheet to explain the quantile estimation method in the regression case – The CAPM regression model

The CAPM model via quantile regression using PROC QUANTREG

17. 3/21  Generalized Least Squares, Correlated errors 

Read Ch. 9 up to and including Section 9.3 of this document from Michael Creel.

 

Read the SAS documentation for PROC MIXED, p. 3886 – 3896.

A first data analysis indicating the problem with correlated observations: Standard errors can be either too large or too small if data are assumed independent.

Simulation studies of inefficiency and Type I error inflation when using OLS with correlated errors: An example with clustered data

18. 3/26  Repeated measures and multilevel analysis

Read this article by David Dickey on PROC MIXED

 

Read this article by Judith Singer

 

Maybe next semester use this one: A Multilevel Model Primer Using SAS® PROC MIXED Bethany A. Bell, Mihaela Ene, Whitney Smiley, Jason A. Schoeneberger

HW5, due 4/9

 

/* Example with random coefficient/ growth curve modeling.

What affects charitable contributions? */

proc mixed data = isqs5349.charitytax covtest;

   class subject;

   model charity = income price age ms deps time/s;

   random int income price age ms deps time/ subject=subject type=fa(1);

   repeated / subject = subject type = arma(1,1);

run;

 

A case study in the analysis of a repeated measures experimental design: a real live clinical trial data set, with the type of analysis typically performed by drug companies.

19. 3/28 BLUPs

Read this excellent presentation by Carolyn Anderson.

 

Here is a correspondence of symbols used in Anderson with symbols used elsewhere

Ranking of teaching in various majors at TTU using BLUPs, with comparison to simple OLS means.

A case study in random coefficient modeling:  effects of firm strategy and investment on performance, incorporating clustering effects due to NAICS industry classification, and providing BLUPs. You'll need to download the Firm performance data set

20. 4/2 Random coefficient regression and SUR

Read Sections 1 and 2 of “Longitudinal and Panel Data,” by Frees and Kim.

 

Read this summary of Zellner’s seemingly unrelated regression model by James Powell.

 

Here is a supplemental paper, not required (not this semester anyway):  A critique of the over-used, often misunderstood, and trained parrot-ish Hausman test for fixed versus random effects.

 

Next semester – more on pure time series regression

HW 6 (last) due April 18

/* Two way random effects modeling in PROC MIXED and PROC PANEL */

 

proc mixed data = isqs5349.charitytax covtest ;

   class subject time;

   model charity = income price age ms deps/s;

   random subject time; run;

 

proc panel data = isqs5349.charitytax;

   id subject time;

   model charity = income price age ms deps / rantwo; run;

 

 

Seemingly unrelated regressions on investment data.  Data file. Excel sheet comparing UR and SUR.

Incorporating missing values using PROC MIXED

Multivariate Regression Model (Multiple Y variables, all responding to the same X variables) using PROC MIXED

A paper describing the utility of PROC MIXED for general multivariate models

21. 4/4 Binary regression models

Read the Wikipedia page on logistic regression – it’s pretty good.  Read up to, but not including the section “As a two-way latent variable model” (even though, believe it or not, Daniel McFadden won a Nobel prize in 2000 for the material in that section).

A finance link

 

Maximum likelihood logistic regression analysis "by hand" using EXCEL's solver.

Logistic Regression Examples Using SAS (including grouped data input).

Graphical Presentations using Excel.

Comparing Normit and Logit link Functions.

 

An example from this great paper:

ods graphics on;

proc logistic data=isqs5349.mergers plots(only) = effect;

model function(event="JointVenture") = ccd legalre timing fmsize;

run;ods graphics off;

 

Concordance; Discordance.

Consider a pair of observations; one where the event occurred, one where the event did not occur.

If the observation with the event has a higher predicted outcome probability, the pair is concordant. If the observation with the event has a lower predicted outcome probability, the pair is discordant. If the predicted outcome probabilities are the same, the pair is tied.

22. 4/9 Ordinal response regression models

Read this document by Paul Johnson at KU. It’s kind funny, as well as informative, and thankfully, correct. 

Ordered Categorical Response Model using SAS, with graphical presentations using Excel.

Excel file for ML estimation for ordinal regression.

23. 4/11 Poisson, negative binomial and other count data regression models

Read this article by Liu and Cela.

The Poisson Distribution (EXCEL file)

Poisson Regression file (SAS).    Case info.  Follow-up analysis using EXCEL.

ML estimation for Poisson regression

Wikipedia page for negative binomial distribution (pretty good!)

The negative binomial distribution compared to Poisson

ML estimation with Negative Binomial distribution

Zero-inflated models using PROC COUNTREG

24. 4/16 GLMMs with repeated or hierarchical data structures

Read this article by Oliver Schabenberger, developer of PROC GLIMMIX of SAS/STAT.

Use of PROC GLIMMIX to fit a random intercept logistic regression model.

The (disguised) data file is here; many thanks to Qiwei Gan for supplying the data.  These disguised data are used simply to illustrate the idea, not to claim any research finding.

25. 4/18 Nominal response regression models

Read “Applying discrete choice models to predict Academy Award winners, J. R. Statist. Soc. A (2008), 375-394, by Pardoe and Simonton.

 

Multinomial logistic regression using PROC LOGISTIC of SAS. Summaries using EXCEL. 

Excel file showing ML Estimation for multinomial logistic regression.

Modeling the probability of best picture selection using the conditional logit model. (Here is the Oscar selections data set– thanks Iain Pardoe!  Here is the codebook to explain the variable names.)

Comparing multinomial logistic regression with the conditional logit model.

26. 4/23 Tobit and censored regression models

Read sections 1,2,3 and 5 of  Tobit Models: A Survey,” By Takeshi Amemiya, Journal of Econometrics, Volume 24, 1984.  Here is the link.

SAS file for TOBIT regression.   Data set is here. A graph illustrating the Tobit Model.

ML estimation for TOBIT model using EXCEL.

Tobit versus ZIP; comparing likelihoods.

Censoring syntax in PROC LIFEREG


Programmer success case; time to completion.  Follow-up analysis using EXCEL, with info on the lognormal distribution.

Excel spreadsheet showing ML estimation for upper censored data, using both normal and lognormal distributions.

27. 4/25 Survival analysis regression models; Cox proportional hazards model

Read this summary by Maartin Buis of Vrije Universiteit Amsterdam.

SAS file for proportional hazards regression.  Follow-up using excel, with comparison to lognormal regression. Interpretation of the sign of the "Months" coefficient: why is it positive?

28. 4/30 Instrumental variable regression

Your professor’s explication of this morass.  (Updated and corrected, 5/1/2013 9:01AM)

Future semesters: see papers by Thad Dunning, Yale.

 

Future semesters – more time on endogeneity issues, also on sample selection bias (Heckman)

A paper with an example on instrument selection.

/* A somewhat silly example illustrating IV estimation */

proc syslin data=isqs5349.gpa_gmat 2sls;

   endogenous  gpa;

   instruments major;

   eq1: model gpa = gmat;

   eq2: model gmat = major; run;

 

proc corr cov data=isqs5349.gpa_gmat;

   var gpa gmat major;

run;

Simulation file illustration endogeneity bias and its rectification using instrumental variables and two-stage least squares, as well as covariance structure modeling

29. 5/2 Project presentations

Game rules for presentations.

1. An overview of the statistical methodology is essential.  Make it general enough for everyone.  Do not be too specific to your own discipline. State assumptions.

2. Data analysis and SAS implementation (could be specialized procs, macros or IML code, as available) is essential.

3. Your aim must be towards the other students, not to me. Aim to present something that your classmates will find to be potentially useful. When might they want to use this methodology, in general? (no matter what field they are in). How do they do it in SAS? What do you get out of it? What are the caveats? What is the benefit of this methodology relative to other methodologies?

4. Presentations should focus on statistics. Some context is needed, but everyone in this class comes from different subject areas, so the subject theory should be minimized to only that bit which is needed to understand the statistics.

5. Feel free to present material from other sources (internet, books, journals, etc.) CITE YOUR SOURCES VERY CLEARLY.  OTHERWISE, IT’S CALLED ‘PLAGIARISM.’

6. The data analysis need not be useful at all. It can be just an exercise that doesn't work out so well. That's perfect! It shows others a caveat - what *not* to do.

7. Everyone will speak 6 minutes. A group of three will talk 18 minutes. A group of one will talk 6 minutes. DO NOT GO OVER TIME. DO NOT MAKE TOO MANY SLIDES. One slide a minute is enough. Maybe too much. DON'T JUST READ SLIDES! THAT WILL BE TOO PAINFUL FOR THE AUDIENCE!!!!!!

8. You are on your own on group composition. It can be your HW group, or not. It's totally up to you. No more than three in a group.

9. Students will be randomly called on to ask questions of presenters. Every student will ask a question or questions, and every presenter will answer these questions.

10. Presenters in groups with more than one member will be randomly selected to answer the questions.  The question will be about any aspect of the group presentation, not just what the particular presenter discussed.

11. Each question and answer session will take 2 minutes.

12. Questions must focus on statistics, statistical methodology, code, output, etc. Questions might involve subject-matter specific concepts, but only if those questions mainly involve the statistics, statistical methodology, code, output, etc. Questions will be graded on engagement and relevance.

13. Unless students prefer otherwise, the first students to select their topic will present last so they can see what everyone else has done. The next to select a topic will present second-to-last, and so on.

14. This will all happen on the last two class days.

15. Your overall project grade will be 60% on presentation, 30% on your question, and 10% on your answer. The question is more important than the answer!

Here are some titles for presentations that are in the ball park of what I am looking for.  I am not asking you to pick one of these though.  These just give the flavor of what I want.  Google is good.

“Comparing neural networks with classical linear regression.”

“Using hierarchical logistic regression models to predict  student retention.”

“Using Bayesian methods to estimate elasticity.”

“Using Bayesian methods to estimate optimal product position.”

“A simulation study comparing model selection methods.”

“Using time-to-event models to predict longevity (of firm, plant, animal, or human).”

“Choosing optimal thresholds for logistic regression probabilities.”

“Applying shrinkage estimates (BLUPs) to improve online rating systems.”

“The effects of model misspecification in panel data.”

 

 

 

  5/2/13 Session: 

“Tools for Model Selection and Data Mining”

 

Presentations:

 

Model Averaging as an Alternative to Variable Selection (SAS code on last page)

 

Bagging and Boosting

 

Neural Nets versus Ordinary Least Squares (Also some SAS code) (Also a YouTube presentation)

 

Generalized Additive Models for Binary Data

 

Note: Linked files are in no way considered “endorsed” by your professor.  They are simply provided “as is” so you can remember what was discussed.

30. 5/7 Projects

 

Robust Standard Errors

Winsorizing - What is it and Why is it a Bad Idea?

 Switching Regressions

Optimal Design for Regression

 

Note: Linked files are in no way considered “endorsed” by your professor.  They are simply provided “as is” so you can remember what was discussed.

 

Final Exam: Friday, 5/10, 4:30 – 7:00 PM. The final is cumulative, covering the entire course in equal parts, including the student presentations.

 

Final Exam Solutions

Old finals and solutions are available in the old courses link. But every semester is different.

 

Why is probability needed in the regression model?

 

The Quiz race!