Westfall's Data Mining Class!!!
Fall 04, BA255, 2-3:30PM; we meet in the computer lab BA363 frequently.

Syllabus

Student Presentation Summaries - Overviews and Strengths (Critiques and grades are mailed to students individually)
 

Final Exam and Solutions


Note:  Midterm 2 is 11/9 - covers material since Midterm 1.  That means association analysis, linkage analysis, cluster analysis and text mining.  The review materials are all on this web page, underneath "Midterm 1". The test will be in room 255 at normal class time, not in the computer lab.

The final is cumulative.
 

Demo for Understanding Trees
SAS code node for robust assessment measures (for interval targets)
Class 9/23:  Let's look at a nominal prediction problem (predicting on-line purchase category):  use this data set and select "category" as target.  Partition the data, and run the tree.  From there we will look at assessment measures "Proportion Misclassified" and "Leaf Impurity". 
A clustering demo
A demo for Comparison of Regression, Trees, and Neural Nets
Neural Nets for Trading Models

Midterm 1 Review topics

Midterm October 5: 
Chapters 1-3 of book, p. 1-42 of Predictive Modeling Using Enterprise Miner, p. 1-22 to 1-65, and p.  2-91 to 2-98 of Web mining, homeworks, class notes, class enterprise miner exercises, web notes above, and demos shown above.

Midterm 1 Solutions.

"Basket Case" Demo
Tuesday Oct. 12:  Demos on Association and links
Thursday Oct. 14:  Demos - Links and clustering
Tuesday Oct. 19:  Demos - clustering
Thursday Oct. 21:  More on clustering.  Some misc. notes.
Tuesday Oct. 26:  Text mining intro
Thursday Oct. 28:  Text mining details
Tuesday Nov. 2:  More Text mining
Thursday Nov. 4: Yep, you guessed it.  More text mining.

Midterm 2 Solutions

Thursday Nov. 11.  Let's meet in the regular classroom again.  I need to use the blackboard. I will go over the midterm solutions, then I want to talk in a little more detail about linear transformations and SVD from a matrix algebra perspective.  (Here is a SAS/IML file to compute SVD "manually".) I will also discuss more of the training material, and talk about the clustering technique (Expectation Maximization) embedded within the TEXT miner.   The training material will loosely cover 1-86 through 2-33 of the manual.

Tuesday, Nov. 16.  We will discuss logical functions a little, talk about clustering using normal mixtures, and predictive modeling using text.   Here is a file we will use:  http://www.ba.ttu.edu/isqs/westfall/dmtm/inssubro.sas7bdat

Thursday, Nov. 18.  Back to predictive modeling and scoring.  Let's make sure we understand the concept of scoring.  It's the main point of predictive modeling.  Here is some training material written by your tour guide.

Tuesday, Nov. 23.   Understanding how neural networks and memory-based reasoning are used for scoring.  See this Excel Spreadsheet first for an understanding of Neural Nets.   See this document for an intro to memory-based reasoning.

Homework Assignments (arrgghhhhh!)
Homework 1
Homework 2 (Here is a picture of a model I will show in class for this one)
Homework 3:  None this week.  Start poking around for data/text mining resources for your term projects.  Big, diverse, ugly data is best.  If most of the term project is ETL (extract, transform, load),  with little actual data mining, that is fine.   Also, we haven't got to text yet, but text mining data are especially encouraged.  I encourage require you to consult with me.
Homework 4.  It's time to make money on the stock market! (Update, 9/25:  Please be sure that you specify a clear path to get to your "final" model so that I can replicate it on the same data set.  After you try all of your models, try to replicate just the one that was "best" in a separate diagram.  Write down the various tuning options so that I can duplicate it, and also hand in the EM diagram that produced just the final version.  Another note:  You can't use any information in the "current" SP500 to predict it.  This can happen easily if you are not careful, e.g., if you choose to cluster the data along the way, and you leave the SP500 in as one of the clustering variables, then there will be information in the current SP500 contained in the cluster labels.  This will make for a good prediction model, but one that is useless because you cannot predict the future by using what happens in the future!  You can only use what you know now to predict the future.

Homework 5, due Tuesday, October 12.

Homework 6, due Tuesday, October 19:  Perform an association analysis on the data set "transactions" (see this guide for more details, and note that both files are password protected.)  Write a short report, with tables and graphs, explaining what knowledge you have gained.  Be sure to include mentions of support, lift and confidence in your report in a way that makes it clear you understand their meanings (but don't just give definitions - please make the report more "business"-like.)   (And recall that KDD means "Knowledge Discovery in Databases"!)

Homework 7, due Thursday, Oct. 28.

Homework 8:  Nothing to hand in.  Read the text mining documentation “Text Mining Using SAS software” (password protected).   Also read the Help materials inside of SAS, called "SAS Text Miner".  Also work on your projects.

Final Exam:  Tuesday, Dec 14, 1:30 - 4:00PM.  It is cumulative.  Review course notes, demos and readings, Homeworks, and the midterms.   I may use the student presentations for ideas for exam questions.