Westfall's Data Mining Class!!!
Fall 04, BA255, 2-3:30PM; we meet in the computer lab
BA363 frequently.
Syllabus
Student Presentation Summaries -
Overviews and Strengths (Critiques and grades are mailed to students
individually)
Note: Midterm 2 is 11/9 - covers material since Midterm 1. That
means association analysis, linkage analysis, cluster analysis and text mining.
The review materials are all on this web page, underneath "Midterm 1".
The test will be in room 255 at normal class time, not in the computer lab.
The final is cumulative.
Demo for
Understanding Trees
SAS code node
for robust assessment measures (for interval targets)
Class 9/23: Let's look at a nominal prediction problem (predicting on-line
purchase category): use this
data set
and select "category" as target. Partition the data, and run the tree.
From there we will look at assessment measures "Proportion Misclassified" and
"Leaf Impurity".
A clustering demo
A demo
for Comparison of Regression, Trees, and Neural Nets
Neural Nets for Trading Models
Midterm 1 Review topics
Midterm October 5:
Chapters 1-3 of book, p. 1-42 of
Predictive Modeling Using Enterprise Miner, p. 1-22 to 1-65, and p.
2-91 to 2-98 of Web
mining, homeworks, class notes, class enterprise miner exercises, web
notes above, and demos shown above.
"Basket Case" Demo
Tuesday Oct. 12: Demos
on Association and links
Thursday Oct. 14: Demos - Links and
clustering
Tuesday Oct. 19: Demos -
clustering.
Thursday Oct. 21: More on
clustering. Some misc. notes.
Tuesday Oct. 26: Text mining intro
Thursday Oct. 28: Text mining
details
Tuesday Nov. 2: More Text mining
Thursday Nov. 4: Yep, you guessed it.
More text mining.
Midterm 2 Solutions
Thursday Nov. 11. Let's meet in the regular classroom again. I need
to use the blackboard. I will go over the midterm solutions, then I want to talk
in a little more detail about linear transformations and SVD from a matrix
algebra perspective. (Here is a
SAS/IML file to compute SVD "manually".) I will also discuss more of the training material,
and talk about the clustering technique (Expectation Maximization) embedded
within the TEXT miner. The training material will loosely cover 1-86
through 2-33 of the manual.
Tuesday, Nov. 16. We will discuss logical functions a little, talk about
clustering using normal mixtures, and predictive modeling using text.
Here is a file we will use:
http://www.ba.ttu.edu/isqs/westfall/dmtm/inssubro.sas7bdat
Thursday, Nov. 18. Back to predictive modeling and scoring. Let's
make sure we understand the concept of scoring. It's the main point of
predictive modeling.
Here is
some training material written by
your tour guide.
Tuesday, Nov. 23. Understanding how neural networks and memory-based reasoning are used for scoring. See this Excel Spreadsheet first for an understanding of Neural Nets. See this document for an intro to memory-based reasoning.
Homework Assignments (arrgghhhhh!)
Homework 1
Homework 2 (Here is a
picture of a model I will show in
class for this one)
Homework 3: None this week. Start poking around for data/text mining
resources for your term projects. Big, diverse, ugly data is best.
If most of the term project is ETL (extract, transform, load), with little
actual data mining, that is fine. Also, we haven't got to text yet,
but text mining data are especially encouraged. I
encourage require you to consult
with me.
Homework 4. It's time to make
money on the stock market! (Update, 9/25: Please be sure that
you specify a clear path to get to your "final" model so that I can replicate it
on the same data set. After you try all of your models, try to replicate
just the one that was "best" in a separate diagram. Write down the various
tuning options so that I can duplicate it, and also hand in the EM
diagram that produced just the final version. Another note: You
can't use any information in the "current" SP500 to predict it. This can
happen easily if you are not careful, e.g., if you choose to cluster the data
along the way, and you leave the SP500 in as one of the clustering variables,
then there will be information in the current SP500 contained in the cluster
labels. This will make for a good prediction model, but one that is
useless because you cannot predict the future by using what happens in the
future! You can only use what you know now to predict the future.
Homework 5, due Tuesday, October 12.
Homework 6, due Tuesday, October 19: Perform an association analysis on the data set "transactions" (see this guide for more details, and note that both files are password protected.) Write a short report, with tables and graphs, explaining what knowledge you have gained. Be sure to include mentions of support, lift and confidence in your report in a way that makes it clear you understand their meanings (but don't just give definitions - please make the report more "business"-like.) (And recall that KDD means "Knowledge Discovery in Databases"!)
Homework 7, due Thursday, Oct. 28.
Homework 8: Nothing to hand in. Read the text mining documentation “Text Mining Using SAS software” (password protected). Also read the Help materials inside of SAS, called "SAS Text Miner". Also work on your projects.
Final Exam: Tuesday, Dec 14, 1:30 - 4:00PM. It is cumulative. Review course notes, demos and readings, Homeworks, and the midterms. I may use the student presentations for ideas for exam questions.