ISQS
5349, Spring 2013
Course
Syllabus
SAS
data sets used in the class
Old
web pages, midterms and finals
Class recordings
SAS: Access via citrix.ba.ttu.edu.
Note
on accessing graphs from Citrix.
Why
is probability needed in the regression model?
|
Class
Topics |
Preparation
– Read and study everything in this column. There will be a quiz at the beginning
of class on the day listed. Refer
back to these documents repeatedly. |
Homework
and etc. |
|||||
|
1. 1/17 Meaning of the term “Regression”; The
classical regression model and its assumptions. |
Read this paper on
the meaning of “regression to the mean,” by Martin Bland. Read this
summary of the assumptions of the regression model, by Dr. Westfall. Read Example 7.5 of Chapter 7 of Understanding Advanced Statistical Methods,
by Westfall and Henning. Read this
discussion of the “Model produces data” concept as it related to the
regression model, by Dr. Westfall. Also read this
discussion of the “Model produces data” concept as it related to the
regression model, by Dr. Westfall. |
|
|||||
|
2. 1/22 Likelihood
and least squares, Exact Inferences in the classical parametric regression
model (Today’s quiz covers all the readings for 1/17 and
1/22, and counts double) |
Read p. 16, starting with Example 17.4, through to
the top of p. 18, of Chapter 17 of Understanding
Advanced Statistical Methods, by Westfall and Henning. Read this
discussion of a confidence interval for the slope, from Doug Stirling, Massey University in Palmerston
North, New Zealand. Read this
document on interpreting p-values,
by Dr. Westfall. Read this
document on “Why you should never say “Accept Ho,” written by Dr.
Westfall Read this
discussion of confidence intervals for E(Y|X=x) versus Prediction interval for Y|X=x, from “Musings on Using and Misusing
Statistics,” by Martha K. Smith, retired UT professor. Read this
document on “Prediction and Generalization,” written by Dr. Westfall. Read this
document on “Confidence intervals and significance tests as predictions,”
written by Dr. Westfall. |
Codes
for class: |
|||||
|
3. 1/24
p-values, confidence and prediction intervals, Scatterplots, LOESS
smoothers |
Read the SAS
documentation on LOESS. Read this
paper on LOESS (additional links within the document are very
informative! But they are not required reading), from NIST. |
proc reg data =
isqs5349.gpa_gmat(where = (degree='P' and ethnic = 'N')); model gpa
= gmat; run; quit; SAS
file for constructing Bayesian intervals using the same examples |
|||||
|
4. 1/29 Checking the assumptions of the classic
model |
Read the document, “How
to check assumptions using data,” written by Dr. Westfall, and run the
SAS code therein. Read and run the SAS code in the document, “Why
do assumptions matter?” from Dr. Westfall |
Testing
for curvature using quadratic regression: Toluca, Peak Energy, Car Sales, and
Product Complexity examples |
|||||
|
5. 1/31
Using transformations to achieve a more reasonable model |
Read these presentation
slides, by
William G. Jacoby, Department of Political Science, Michigan
State. Read the document, “Comments
on Transformations,” written by Dr. Westfall. |
Lance
Car Sales example: Analysis of model using x-1 transformation Peak
Energy Use example: Analysis of model using ln(y)
transformation |
|||||
|
6. 2/5 The multiple regression model |
Read this note:
What Carvalho calls “Standard Error” in his slides
is actually non-standard terminology.
He really means “Root Mean Squared Error,” which is the estimate of
the conditional standard deviation of Y
given the X variables. He used the
term “Standard Error” because it’s the term the Microsoft Excel uses. This is really unfortunate, because there
are standard errors for each of the beta estimates, and these are quite
different from the Root Mean Squared Error. Read the document “Prediction as association
and prediction as causation,” written by Dr. Westfall. Next semester:
More on causality (maybe something from Mostly Harmless Econometrics, by Angrist
and Pischke) |
SAS
code for Sales vs. Int rate and Gas Price example Visualizing
the Multiple Regression model- 3-D and partial plots using EXCEL. SAS
code for computer time vs. Ram and Clock speed example |
|||||
|
7. 2/7 The
Gauss-Markov theorem, standard errors, t intervals and tests |
Read very carefully this
document on the Gauss-Markov Theorem, by D. Stephen G. Pollock of
Leicester U. Read only page 1 up to the section “The Gauss-Markov Theorem.
(An Alternative Statement)” It’s not much to read, but it is very thick, so
read it slowly and carefully. Read this document on the
matrix form of the regression model, from François Nielsen of UNC-Chapel
Hill, sections 1, 2, 3, 4, 5.1 – 5.4, 5.6.1, 5.6.2. |
The
multivariate normal distribution (from Wikipedia) Information on covariance
matrices, from Wikipedia Illustration
of sampling distribution of the estimated slope. Illustrating
the Gauss-Markov property, both good and bad. The
various matrices in regression: a SAS/IML file. You can also see many of the matrices using the "xpx" and "i" options in the "MODEL" statement of PROC REG. |
|||||
|
8. 2/12 Multicollinearity |
Read these
presentation slides, slides 1-6, by Alicia Carriquiry
of Iowa State. Read this
document on multicollinearity, by Dr. Westfall |
HW 3,
due Thursday, 2/21 (not 3/1 as it says in the document.) ESP demo: Influencing the outcome of a die roll: input first second @@; cards; 3 1 4 6 6 6 1 3 6 1 2 1 1 1 ; proc reg; model second = first; File
to illustrate problems with multicollinearity SAS file for diagnosis and interpretation of multicollinear variables; also indicates one of many potential solutions to the problem. |
|||||
|
9. 2/14 The ANOVA table, the F test, and the
R-squared statistic |
Read this
document by Matt Blackwell of U. Rochester, and others. Read this
document by Kristofer Jennings, Purdue U. |
Full
model - reduced model F test. |
|||||
|
10. 2/19 Interactions; the inclusion principle |
Read this summary of
Mediators (intervening variables) and Moderators (interacting variables)
(This is a brief summary of Baron and Kenney’s paper) Read this
document about Type I and Type III F tests by Dr. Westfall Read p. 1-11 in this
document by William G. Jacoby, Department of Political Science, Michigan State. |
Examining
interactions - a SAS demo Moderator
example, from Karl Wuensch's web page http://core.ecu.edu/psyc/wuenschk/.
The publication is here. A "Hand-drawn" graph using Excel of the moderating effect. File
to illustrate problems with violating the inclusion principle |
|||||
|
11. 2/21 Dummy variables, ANOVA, ANCOVA, and
multiple comparisons, ANCOVA with interactions, LSMeans,
graphical summaries |
Read section 1.1 – 1.3 of Multiple
Tests and Multiple Comparisons Using SAS, by Westfall, Tobias and Wolfinger. Read 10.1 – 10.4 of this document
by Howard Seltman of Carnegie-Mellon. |
ANOVA/ANCOVA,
first file – comparing GPAs of Male and Female students (two-level
ANOVA/ANCOVA). |
|||||
|
12. 2/26 Variable and model selection |
Read "Model Selection
Issues" on p. 43 and 44 from the Documentation for
the SAS/STAT GLMSELECT procedure. Read page 1, up to “Model selection,”
in this paper. |
ANOVA/ANCOVA,
second file – comparing GPAs of students different degree plans |
|||||
|
13. 2/28 Variable and model selection |
Read Rob
McCulloch’s notes. Comment: While
the notes are great and McCulloch is a very famous statistician, there is too
much emphasis on polynomial models for a single variable in the
presentation. Just realize that the
points he is making apply equally well to all other kinds of regression
models, polynomial in a single variable, linear in multiple variables, neural
nets, trees, LOESS, simple or multiple regression, any kind of regression
model whatsoever. McCulloch knows very well that polynomial models should
usually be avoided; I think he picked polynomial models just to make the
presentation easier to understand. (Next time I
teach this course in Spring 2014 students will) Read this article in Statistical Science
by Galit Schmueli, "To Explain or
to Predict?" |
A
SAS demo to illustrate the danger of overfitting. SAS
file for producing and comparing PRESS (n-fold
cross-validation) statistics for different models |
|||||
|
14. 3/5 Heteroscedasticity: WLS, ML estimation,
robust standard errors |
Read about variance function estimation in this classic paper by Davidian and Carroll, but just read sections 1, 2 and 3. Read all about robust standard errors in this
classic paper by Long and Ervin. |
First
file to illustrate benefit of Weighted Least Squares – shows imprecise
predictions of OLS in the presence of heteroscedasticity Comparison
of prediction limits: Homoscedastic vs. Heteroscedastic models – shows
OLS prediction limits are incorrect in the presence of heteroscedasticity Comparing
ordinary and heteroscedasticity-consistent standard errors |
|||||
|
15. 3/7 Outlier……………..s Watch this
video; we won’t meet in class. Quiz on 3/19 counts double, covering the video,
the 3/7 readings, and the 3/19 readings. |
Read about Influence
Diagnostics from the SAS documentation (up to but not including the
section called “The PARTIAL Option”) Read sections 1 – 3 of this
classic paper by R. Dennis Cook. Read Outliers:
How to detect them and what to do about them?, by Dr. Westfall |
|
|||||
|
16. 3/19 Quantile regression |
Read either
A gentle
introduction to quantile regression for ecologists, or Quantile Regression (from J. Econ. Perspectives) Read the overview
of PROC QUANTREG, p. 5352 – 5356, AND one
of the examples, either 72.2, 72.3, 72.4, or 72.5. |
||||||
|
17. 3/21
Generalized Least Squares, Correlated errors |
Read Ch. 9 up to and including Section 9.3 of this
document from Michael Creel. Read the SAS
documentation for PROC MIXED, p. 3886 – 3896. |
Simulation studies of inefficiency and Type I error inflation when using OLS with correlated errors: An example with clustered data |
|||||
|
18. 3/26
Repeated measures and multilevel analysis |
Read this article
by David Dickey on PROC MIXED Read this article by Judith
Singer Maybe next semester use this
one: A Multilevel Model Primer Using SAS® PROC MIXED Bethany A. Bell, Mihaela Ene, Whitney Smiley,
Jason A. Schoeneberger |
/* Example with random coefficient/ growth curve modeling. What affects charitable contributions? */ proc mixed data = isqs5349.charitytax covtest; class subject; model charity = income price age ms deps time/s; random int income price age ms
deps time/ subject=subject type=fa(1); repeated / subject = subject type = arma(1,1); run; |
|||||
|
19. 3/28 BLUPs |
Read this excellent
presentation by Carolyn Anderson. Here
is a correspondence of symbols used in Anderson with symbols used elsewhere |
Ranking
of teaching in various majors at TTU using BLUPs, with comparison to simple
OLS means. A
case study in random coefficient modeling: effects of firm strategy and
investment on performance, incorporating clustering effects due to NAICS industry classification,
and providing BLUPs. You'll need to download the Firm
performance data set. |
|||||
|
20. 4/2 Random coefficient regression and SUR |
Read Sections 1 and 2 of “Longitudinal
and Panel Data,” by Frees and Kim. Read this
summary of Zellner’s seemingly unrelated
regression model by James Powell. Here is a supplemental paper, not required (not
this semester anyway): A
critique of the over-used, often misunderstood, and trained parrot-ish Hausman test for fixed
versus random effects. Next semester – more on pure time series
regression |
/* Two way random effects modeling in PROC MIXED and PROC PANEL */ proc mixed data = isqs5349.charitytax covtest ; class subject time; model charity = income price age ms deps/s; random subject time; run; proc panel data =
isqs5349.charitytax; id subject time; model charity = income price age ms deps / rantwo; run; Seemingly
unrelated regressions on investment data. Data
file. Excel sheet comparing
UR and SUR. Incorporating
missing values using PROC MIXED A paper
describing the utility of PROC MIXED for general multivariate models |
|||||
|
21. 4/4 Binary regression models |
Read the Wikipedia page
on logistic regression – it’s pretty good.
Read up to, but not including the section “As a two-way latent
variable model” (even though, believe it or not, Daniel McFadden won a Nobel
prize in 2000 for the material in that section). |
Maximum
likelihood logistic regression analysis "by hand" using EXCEL's
solver. An example from this great paper: ods graphics on; proc logistic data=isqs5349.mergers plots(only) = effect; model function(event="JointVenture") = ccd legalre timing fmsize; run;ods graphics off; Concordance; Discordance. Consider a pair of observations; one where the event occurred, one
where the event did not occur. If the observation with the
event has a higher predicted
outcome probability, the pair is concordant.
If the observation with the event
has a lower predicted outcome
probability, the pair is discordant.
If the predicted outcome probabilities are the same, the pair is tied. |
|||||
|
22. 4/9 Ordinal response regression models |
Read this
document by Paul Johnson at KU. It’s kind funny, as well as informative,
and thankfully, correct. |
Ordered Categorical Response
Model using SAS, with graphical
presentations using Excel. Excel file for ML
estimation for ordinal regression. |
|||||
|
23. 4/11 Poisson, negative binomial and other
count data regression models |
Read this
article by Liu and Cela. |
The Poisson
Distribution (EXCEL file) Poisson
Regression file (SAS). Case info.
Follow-up analysis
using EXCEL. ML
estimation for Poisson regression Wikipedia
page for negative binomial distribution (pretty good!) The negative
binomial distribution compared to Poisson |
|||||
|
24. 4/16 GLMMs with repeated or hierarchical data
structures |
Read this article by
Oliver Schabenberger, developer of PROC GLIMMIX of
SAS/STAT. |
Use of
PROC GLIMMIX to fit a random intercept logistic regression model. |
|||||
|
25. 4/18 Nominal response regression models |
|
Multinomial
logistic regression using PROC LOGISTIC of SAS. Summaries
using EXCEL. |
|||||
|
26. 4/23 Tobit and
censored regression models |
Read sections 1,2,3 and 5
of “Tobit
Models: A Survey,” By Takeshi Amemiya, Journal of Econometrics, Volume 24,
1984. Here
is the link. |
SAS
file for TOBIT regression. Data set is here.
A graph
illustrating the Tobit Model. |
|||||
|
27. 4/25 Survival analysis regression models; Cox
proportional hazards model |
Read this summary by Maartin Buis of Vrije Universiteit Amsterdam. |
SAS
file for proportional hazards regression. Follow-up using
excel, with comparison to lognormal regression. Interpretation
of the sign of the "Months" coefficient: why is it positive? |
|||||
|
28. 4/30 Instrumental variable regression |
Your
professor’s explication of this morass.
(Updated and corrected, 5/1/2013 9:01AM) Future semesters: see papers by Thad Dunning,
Yale. Future semesters – more time on endogeneity issues, also on sample selection bias
(Heckman) |
A
paper with an example on instrument selection. /* A somewhat silly example illustrating IV estimation */ proc syslin data=isqs5349.gpa_gmat 2sls; endogenous gpa; instruments major; eq1: model gpa = gmat; eq2: model gmat = major; run; proc corr cov data=isqs5349.gpa_gmat; var gpa gmat major; run; |
|||||
|
29. 5/2 Project presentations |
Game rules for presentations. 1. An overview of the statistical methodology is essential. Make it general enough for everyone. Do not be too specific to your own
discipline. State assumptions. 2. Data analysis and SAS implementation (could be specialized procs, macros or IML code, as available) is essential. 3. Your aim must be towards the other students, not to me. Aim
to present something that your classmates will find to be potentially useful.
When might they want to use this methodology, in general? (no
matter what field they are in). How do they do it in SAS? What do you get out
of it? What are the caveats? What is the benefit of this methodology relative
to other methodologies? 4. Presentations should focus on statistics. Some context is
needed, but everyone in this class comes from different subject areas, so the
subject theory should be minimized to only that bit which is needed to
understand the statistics. 5. Feel free to present material from other sources (internet,
books, journals, etc.) CITE YOUR SOURCES VERY CLEARLY. OTHERWISE, IT’S CALLED ‘PLAGIARISM.’ 6. The data analysis need not be useful at all. It can be just an
exercise that doesn't work out so well. That's perfect! It shows others a
caveat - what *not* to do. 7. Everyone will speak 6 minutes. A group of three will talk 18
minutes. A group of one will talk 6 minutes. DO NOT GO OVER TIME. DO NOT MAKE
TOO MANY SLIDES. One slide a minute is enough. Maybe too much. DON'T JUST
READ SLIDES! THAT WILL BE TOO PAINFUL FOR THE AUDIENCE!!!!!! 8. You are on your own on group composition. It can be your HW
group, or not. It's totally up to you. No more than three in a group. 9. Students will be randomly called on to ask questions of
presenters. Every student will ask a question or questions, and every
presenter will answer these questions. 10. Presenters in groups with more than one member will be randomly
selected to answer the questions. The
question will be about any aspect of the group presentation, not just what
the particular presenter discussed. 11. Each question and answer session will take 2 minutes. 12. Questions must focus on statistics, statistical methodology,
code, output, etc. Questions might involve subject-matter specific concepts,
but only if those questions mainly involve the statistics, statistical
methodology, code, output, etc. Questions will be graded on engagement and
relevance. 13. Unless students prefer otherwise, the first students to
select their topic will present last so they can see what everyone else has
done. The next to select a topic will present second-to-last, and so on. 14. This will all happen on the last two class days. 15. Your overall project grade will be 60% on presentation, 30%
on your question, and 10% on your answer. The question is more important than
the answer! Here are some titles for presentations that are in the ball park
of what I am looking for. I am not
asking you to pick one of these though.
These just give the flavor of what I want. Google is good. “Comparing neural networks with classical linear regression.” “Using hierarchical logistic regression models to predict student
retention.” “Using Bayesian methods to estimate elasticity.” “Using Bayesian methods to estimate optimal product position.” “A simulation study comparing model selection methods.” “Using time-to-event models to predict longevity (of firm,
plant, animal, or human).” “Choosing optimal thresholds for logistic regression
probabilities.” “Applying shrinkage estimates (BLUPs) to improve online rating
systems.” “The effects of model misspecification in panel
data.” |
|
|||||
|
30. 5/7 Projects |
|
Winsorizing
- What is it and Why is it a Bad Idea? Note:
Linked files are in no way considered “endorsed” by your professor. They are simply provided “as is” so you can
remember what was discussed. |
|||||
|
Final Exam: Friday, 5/10, 4:30 – 7:00 PM. The final is cumulative,
covering the entire course in equal parts, including the student
presentations. |
Old finals and solutions are available in the old
courses link. But every semester is different. |
The Quiz race!
