Statistical Data Mining and Knowledge Discovery

Edited by Hamparsum Bozdogan

List of Chapters

1. The Role of Bayesian and Frequentist Multivariate Modeling in Statistical Data Mining
2. Intelligent Statistical Data Mining with Information Complexity and Genetic Algorithms
3. Econometric and Statistical Data Mining, Prediction and Policy-Making
4. Data Mining Strategies for the Detection of Chemical Warfare Agents
5. Disclosure Limitation Methods Based on Bounds for Large Contingency Tables With Applications to Disability
6. Partial Membership Models with Application to Disability Survey Data
7. Automated Scoring of Polygraph Data
8. Missing Value Algorithms in Decision Trees
9. Unsupervised Learning from Incomplete Data Using a Mixture Model Approach
10. Improving the Performance of Radial Basis Function (RBF) Classification Using Information Criteria
11. Use of Kernel Based Techniques for Sensor Validation in Nuclear Power Plants
12. Data Mining and Traditional Regression
13. An Extended Sliced Inverse Regression
14. Using Genetic Programming to Improve the Group Method of Data Handling in Time Series Prediction
15. Data Mining for Monitoring Plant Devices Using GMDH and Pattern Classification
16. Statistical Modeling and Data Mining to Identify Consumer Preferences
17. Testing for Structural Change Over Time of Brand Attribute Perceptions in Market Segments
18. Kernel PCA for Feature Extraction with Information Complexity
19. Global Principal Component Analysis for Dimensionality Reduction in Distributed Data Mining
20. A New Metric for Categorical Data
21. Ordinal Logistic Modeling Using ICOMP as a Goodness-of-Fit Criterion
22. Comparing Latent Class Factor Analysis with the Traditional Approach in Data Mining
23. On Cluster Effects in Mining Complex Econometric Data
24. Neural Network-Based Data Mining Techniques for Steel Making
25. Solving Data Clustering Problem as a String Search Problem
26. Behavior-Based Recommender Systems as Value-Added Services for Scientific Libraries
27. GTP (General Text Parser) Software for Text Mining
28. Implication Intensity: From the Basic Statistical Definition to the Entropic Version
29. Use of a Secondary Splitting Criterion in Classification Forest Construction
30. A Method Integrating Self-Organizing Maps to Predict the Probability of Barrier Removal
31. Cluster Analysis of Imputed Financial Data Using an Augmentation-Based Algorithm
32. Data Mining in Federal Agencies
33. STING: Evaluation of Scientific&Technological Innovation and Progress
34. The Semantic Conference Organizer


Contents

1 The Role of Bayesian and Frequentist Multivariate Modeling in Statistical Data Mining
S. James Press, University of California, Riverside, USA

1.1 Introduction
1.2 Is Data Mining Science?
1.3 Genesis of Data Mining
1.4 TheDataCube and Databases
1.4.1 Some Examples
1.5 Structured Query Language
1.6 Statistical Problems with Data Mining
1.7 Some DM Approaches to Dimension Reduction
1.7.1 Graph-Based Multivariate Models
1.7.2 Multivariate Bayes Classification and Prediction
1.7.3 Markov Independence Models
1.8 Prior Distributions in Data Mining
1.9 Some New DM Applications

2 Intelligent Statistical Data Mining with Information Complexity and Genetic Algorithms
Hamparsum Bozdogan, University of Tennessee, Knoxville, USA

2.1 Introduction
2.2 What is Information Complexity: ICOMP?
2.2.1 The Concept of Complexity and Complexity of a System
2.2.2 Information Theoretic Measure of Complexity of a Multivariate Distribution
2.2.3 Initial Definition of Covariance Complexity
2.2.4 Definition of Maximal Covariance Complexity
2.2.5 ICOMP as an Approximation to the Sum of Two Kullback-Leibler Distances
2.3 Information Criteria for Multiple Regression Models
2.3.1 ICOMP Based on Complexity Measures
2.3.2 ICOMP Under Misspecification
2.3.3 AIC and AIC-Type Criteria
2.4 AGA for the Regression Modeling
2.5 NumericalExamples
2.5.1 Subset Selection of Best Predictors in Multiple Regression: A Simulation Example
2.5.2 Subset Selection of Best Predictors in Multiple Regression: A Real Example
2.6 Conclusion and Discussion

3 Econometric and Statistical Data Mining, Prediction and Policy-Making
Arnold Zellner, University of Chicago, Chicago, USA

3.1 Introduction
3.2 Brief Comments on Scientific Method and Data Mining
3.3 The Structural Econometric Modeling, Time Series Analysis (SEMTSA) Approach
3.3.1 The SEMTSA Approach
3.3.2 Statistical Inference Procedures
3.4 Methods Employed in Data Analysis, Modeling and Forecasting
3.5 Disaggregation and the Marshallian Macroeconomic Model
3.6 A Complete Marshallian Macroeconomic Model

4 Data Mining Strategies for the Detection of Chemical Warfare Agents
Jeffrey L. Solka, Edward J. Wegman and David J. Marchette
Naval Surface Warfare Center (NSWCDD), Dahlgren, VA,
George Mason University, Fairfax, VA, USA

4.1 Introduction
4.2 Results
4.3 Conclusions

5 Disclosure Limitation Methods Based on Bounds for Large Contingency Tables With Applications to Disability
Adrian Dobra, Elena A. Erosheva and Stephen E. Fienberg,
Duke University, Durham, University of Washington, Seattle, and Carnegie-Mellon University, Pittsburgh, USA

5.1 Introduction
5.2 Example:National Long Term Care Survey Data
5.3 Technical Background on Cell Entry Bounds
5.4 Decomposable Frontiers
5.4.1 Calculating Decomposable Frontiers
5.4.2 Analysis of the 216 NLTCS Example
5.5 'Greedy' Frontiers
5.6 Bounds
5.6.1 Bounds in the Decomposable Case
5.6.2 Bounds in the Non-decomposable Case
5.7 Discussion

6 Partial Membership Models with Application to Disability Survey Data
Elena A. Erosheva, University of Washington, USA

6.1 Motivation
6.2 Functional Disability Data
6.2.1 Marginal Frequencies and Simple Statistics
6.2.2 Frequent Responses
6.2.3 Total Number of Disabilities
6.3 Full Versus Partial Membership
6.3.1 Full Membership: Latent Class Model
6.3.2 Partial Membership:GoM Model
6.4 Bayesian Estimation of the GoM Model
6.5 Analysis and Comparison
6.5.1 Latent Class Analysis
6.5.2 Grade of Membership Analysis and Comparison
6.6 Concluding Remarks

7 Automated Scoring of Polygraph Data
Aleksandra B. Slavkovic, Department of Statistics, Carnegie-Mellon University, Pittsburgh, USA

7.1 Introduction
7.2 Background
7.2.1 The Polygraph Examination
7.2.2 Instrumentation and Measurements
7.2.3 Chart Evaluations
7.3 Statistical Models for Classification and Prediction
7.4 The Data
7.5 Statistical Analysis
7.5.1 Signal Processing
7.5.2 A Simplified Approach to Feature Extraction
7.5.3 Feature Evaluation, Modeling and Classification
7.5.4 Logistic Regression
7.5.5 Classification Results
7.6 Discussion
7.7 Conclusion

8 Missing Value Algorithms in Decision Trees
Hyunjoong Kim and Sumer Yates University of Tennessee, Knoxville, USA

8.1 Introduction
8.2 The Seven Algorithms
8.2.1 Probability Split
8.2.2 Alternative Split
8.2.3 Proxy Split
8.2.4 Surrogate Split
8.2.5 RootNode Imputation
8.2.6 Nodewise Imputation
8.2.7 Separate Node
8.3 The Simulation Study
8.3.1 Experiment-A
8.3.2 Experiment-B
8.4 Results
8.4.1 Experiment-A
8.4.2 Experiment-B
8.5 Conclusions

9 Unsupervised Learning from Incomplete Data Using a Mixture Model Approach
Lynette Hunt and Murray Jorgensen University of Waikato, Hamilton, New Zealand

9.1 Introduction
9.2 Clustering by Mixture Models
9.2.1 Latent Class Models and Local Independence
9.2.2 Generalizing Local Independence: the Multimix Model
9.2.3 Missing Data
9.3 Applications
9.3.1 Example1: The Iris Data
9.3.2 Example2: The Cancer Data
9.4 Discussion

10 Improving the Performance of Radial Basis Function (RBF) Classification Using Information Criteria
Zhenqiu Liu and Hamparsum Bozdogan, University of Tennessee, Knoxville, USA

10.1 Introduction
10.1.1 Binary Classifiers
10.1.2 Logistic Regression and Mixture Model
10.1.3 Multi-Class Classifier
10.2 Regression Trees
10.2.1 Information Criteria
10.2.2 Parameter Estimation
10.3 New Kernel Functions
10.4 The EM Algorithm
10.4.1 Constrained Maximization of a Convex Function
10.4.2 The EM Algorithm
10.4.3 EM-Algorithm for PE Mixture Models
10.5 Hybrid Training
10.6 Computational Results
10.7 Conclusions

11 Use of Kernel Based Techniques for Sensor Validation in Nuclear Power Plants
Andrei V. Gribok, Aleksey M. Urmanov, J. Wesley Hines and Robert E. Uhrig University of Tennessee, Knoxville, USA

11.1 Introduction
11.2 Collinear, Ill-Posed Problems, Regularization
11.3 Kernel Regression and MSET
11.4 Support Vector Machines
11.5 Data Description and Results
11.6 Conclusions

12 Data Mining and Traditional Regression
Christopher M. Hill, Linda C. Malone and Linda Trocine University of Central Florida, Orlando, FL, USA

12.1 Introduction
12.2 Military Manpower Application
12.3 Data Mining and Traditional Regression
12.4 General Problems
12.5 Attempted Solutions
12.5.1 AlgorithmicImprovements
12.5.2 Data Complexity Reduction
12.6 Regression Specific Issues
12.6.1 Linear Regression
12.6.2 Logistic Regression
12.7 Conclusion

13 An Extended Sliced Inverse Regression
Masahiro Mizuta Hokkaido University, Sapporo, Japan

13.1 Introduction
13.2 Algorithms for SIR Model
13.3 Relative Projection Pursuit
13.4 SIRrpp
13.5 Concluding Remarks

14 Using Genetic Programming to Improve the Group Method of Data Handling in Time Series Prediction
M. Hiassat, M.F. Abbod, and N. Mort, University of Sheffield, Sheffield, UK

14.1 Introduction
14.2 The Data
14.3 Financial Data
14.4 Weather Data
14.5 Processing of Data
14.6 The Group Method of Data Handling (GMDH)
14.7 Genetic Programming (GP)
14.8 GP-GMDH
14.9 Results and Discussion
14.10 Conclusion and Further Work

15 Data Mining for Monitoring Plant Devices Using GMDH and Pattern Classification
B.R. Upadhyaya and B. Lu, University of Tennessee, Knoxville

15.1 Introduction
15.2 Description of the Method
15.2.1 Group Method of Data Handling (GMDH)
15.2.2 Analysis of Residuals for Fault Diagnosis
15.3 Analysis and Results
15.4 Concluding Remarks

16 Statistical Modeling and Data Mining to Identify Consumer Preferences
Francois Boussu1 and Jean Jacques Denimal
Ecole Nationale Superieure des Arts et Industries Textiles, Roubaix, and
University of Sciences and Technologies of Lille, France

16.1 Introduction
16.1.1 The Garment Market
16.1.2 Interest of the Proposed Method
16.2 Data Mining Method
16.2.1 From Data to Analyses
16.2.2 Proposed Methodology
16.2.3 Data Visualization Tools
16.3 Application to a Textile Data Set
16.3.1 Classical and Circular Representations of the Hierarchy
16.3.2 Graphic Plane Representations Associated with Nodes
16.3.3 Shaded Tables Visualizing Associations Between Items and Periods
16.4 Conclusion

17 Testing for Structural Change Over Time of Brand Attribute Perceptions in Market Segments
Sara Dolnicar and Friedrich Leisch, University of Wollongong and Vienna University of Technology, Austria

17.1 Introduction
17.2 The Managerial Problem
17.3 Results from Traditional Analysis
17.3.1 Brand Image Analysis
17.3.2 Competition Analysis
17.4 The PBMS and DynPBMS Approaches
17.4.1 General Principles
17.4.2 Complexity Reduction
17.4.3 Tests on Heterogeneity
17.4.4 Competition Analysis
17.5 Summary

18 Kernel PCA for Feature Extraction with Information Complexity
Zhenqiu Liu and Hamparsum Bozdogan University of Tennessee, Knoxville, USA

18.1 Introduction
18.2 Kernel Functions
18.3 Kernel PCA
18.4 EM for Kernel PCA and On-line PCA
18.5 Choosing the Number of Components with Information Complexity
18.6 Computational Results
18.7 Conclusions

19 Global Principal Component Analysis for Dimensionality Reduction in Distributed Data Mining
Hairong Qi, Tse-WeiWang, and J. Douglas Birdwell University of Tennessee, Knoxville, USA

19.1 Introduction
19.2 Principal Component Analysis
19.3 Global PCA for Distributed Homogeneous Databases
19.4 Global PCA for Distributed Heterogeneous Databases
19.5 Experiments and Results
19.5.1 Global PCA for Distributed Homogeneous Databases
19.5.2 Global PCA for Distributed Heterogeneous Databases
19.6 Conclusion

20 A New Metric for Categorical Data
S. H. Al-Harbi, G. P.McKeown and V. J. Rayward-Smith University of East Anglia, Norwich, UK

20.1 Introduction
20.2 Dissimilarity Measure
20.2.1 Mahalanobis Distance
20.3 DCV Metric
20.3.1 Weights of Fields
20.4 Synthetic Examples
20.5 Exploiting the DCV Metric
20.5.1 k-Means Algorithm
20.5.2 Case-Based Reasoning
20.5.3 Supervised Clustering
20.6 Conclusions and Future Work

21 Ordinal Logistic Modeling Using ICOMP as a Goodness-of-Fit Criterion
J.Michael Lanning and Hamparsum Bozdogan, University of Tennessee, Knoxville, USA

21.1 Introduction
21.2 Model Selection Criteria
21.2.1 Computational Software
21.3 Ordinal Logistic Regression
21.3.1 Cumulative Link Models
21.3.2 Cumulative Logit (Proportional Odds) Model
21.3.3 Proportional Hazard Model
21.3.4 Continuation-Ratio Model
21.3.5 Category Boundary Cutoffs (Correlated Data)
21.3.6 Model Selection
21.4 Example Problem: Diabetes Severity
21.4.1 Proportional Odds Model
21.4.2 Residual (Outlier) Analysis
21.4.3 Nested Dichotomous Models
21.4.4 Nested Cumulative Link Dichotomous Models
21.4.5 Nested Continuation-Ratio Link Dichotomous Models
21.5 Conclusions

22 Comparing Latent Class Factor Analysis with the Traditional Approach in DataMining
JayMagidson and Jeroen Vermunt Statistical Innovations Inc., USA and Tilburg University, The Netherlands

22.1 Introduction
22.2 The Basic LC Factor Model
22.3 Examples
22.3.1 Rater Agreement
22.3.2 MBTI Personality Items
22.4 Conclusion

23 On Cluster Effects in Mining Complex Econometric Data
M. Ishaq Bhatti Sultan Qaboos University, Muscat, OMAN

23.1 Introduction
23.2 The Model
23.3 An Algorithm for Full Maximum Likelihood Estimation
23.4 Application of the Model
23.5 Fixed Coefficient Regression Models
23.6 Concluding Remarks

24 Neural Network-Based DataMining Techniques for Steel Making
Ravindra K. Sarma, Amar Gupta, and Sanjeev Vadhavkar Massachusetts Institute of Technology, Cambridge, USA

24.1 Introduction
24.2 Productivity from Information Technology (PROFIT) Initiative
24.3 Description of Predictive Model
24.4 NNRUN–ANN Training Suite
24.4.1 Data Manipulation
24.4.2 Automated Search For Best Network
24.5 Results and Analysis
24.6 Conclusions

25 Solving Data Clustering Problem as a String Search Problem
V. Olman, D. Xu, and Y. Xu, Oak Ridge National Laboratory, Oak Ridge, TN, USA

25.1 Introduction
25.2 Mathematical Framework
25.2.1 Definition of Cluster
25.2.2 MST-Representation of High-Dimensional Data
25.2.3 Relationship between MSTs and Clusters
25.3 Stability of MST Structure Under Noise
25.4 Statistical Assessment of Identified Clusters
25.5 Applications
25.5.1 Partitioning of Simulated Data
25.5.2 Regulatory Binding Site Identification
25.5.3 Cluster Identification in Gene Expression Profiles
25.6 Discussion

26 Behavior-Based Recommender Systems as Value-Added Services for Scientific Libraries
Andreas Geyer-Schulz, Michael Hahsler, Andreas Neumann and Anke Thede

Universit?at Karlsruhe (TH), Germany and WU-Wien, Austria
26.1 Introduction
26.2 Recommender Services for Legacy Library Systems
26.3 Ehrenberg’s Repeat-Buying Theory for Libraries
26.4 A Recommender System for the Library of the Universitat Karlsruhe (TH)
26.5 Conclusion

27 GTP (General Text Parser) Software for Text Mining
Justin T. Giles, Ling Wo and Michael W. Berry, University of Tennessee, Knoxville, USA

27.1 Introduction
27.2 Model Facilitated by GTP
27.3 GTP Usage and Files Generated
27.4 Overview of GTP Options
27.5 Query Processing with GTP QUERY
27.6 Example
27.7 Versions of GTP and GTP QUERY
27.8 Code Evolution
27.9 Future Work

28 Implication Intensity: From the Basic Statistical Definition to the Entropic Version
Julien Blanchard, Pascale Kuntz, FabriceGuillet and Regis Gras
Ecole Polytechnique de l’Universite de Nantes, France

28.1 Introduction
28.2 First Definitions
28.3 Entropic Version
28.4 Experimental Results
28.4.1 Experiments with Synthetic Data
28.4.2 Experiments with Real Data
28.5 Conclusion

29 Use of a Secondary Splitting Criterion in Classification Forest Construction
Chang-Yung Yu and Heping Zhang, Yale University, New Haven, USA

29.1 Introduction
29.2 A Secondary Node-Splitting Criterion
29.3 The Formation of a Deterministic Forest
29.4 Comparison Data
29.4.1 Comparison through Cross-Validation
29.4.2 Understanding the Deterministic Forest
29.5 Discussion

30 A Method Integrating Self-OrganizingMaps to Predict the Probability of Barrier Removal
Zhicheng Zhang and Frederic Vanderhaegen, University of Valenciennes, Le Mont Houy, France

30.1 Introduction
30.2 A Method Integrating Self-Organizing Maps Algorithm
30.2.1 Problematic in BR Data Analysis
30.2.2 A Hierarchical SOM (HSOM) Algorithm-Based Method
30.3 Experimental Results
30.4 Discussions
30.5 Conclusions

31 Cluster Analysis of Imputed Financial Data Using an Augmentation-Based Algorithm
H. Bensmail and R. P. DeGennaro University of Tennessee, Knoxville, TN, USA

31.1 Introduction
31.2 Data and PreliminaryTests
31.3 Clustering and Bayesian Data Augmentation
31.3.1 Imputation
31.3.2 Posterior Estimation
31.3.3 Algorithm
31.4 Bayesian Model Selection for Choosing the Number of Clusters
31.5 Analysis of Financial Data
31.6 Discussion

32 Data Mining in Federal Agencies
David L. Banks and Robert T. Olszewski U.S. Food and Drug Administration, Rockville, MD, and
University of Pittsburgh, Pittsburgh, PA, USA

32.1 Data Quality
32.1.1 First Steps
32.1.2 Using Poor Quality Data
32.1.3 Estimating the Probability of Bad Data
32.2 Indexing Data
32.2.1 A First-Pass Method
32.2.2 Estimating Distance Functions
32.2.3 Computation
32.3 Screening for Structure with Locally Low Dimension
32.3.1 Screening Strategy
32.3.2 The Designed Experiment
32.3.3 Regression
32.4 Estimating Exposure
32.4.1 Categorical Data

33 STING: Evaluation of Scientific & Technological Innovation and Progress
S. Sirmakessis, K.Markellos, P.Markellou, G. Mayritsakis, K. Perdikouri, A. Tsakalidis, and Georgia Panagopoulou
Computer Technology Institute and National Statistical Services of Greece, IT Division, Greece

33.1 Introduction
33.2 Methodology for the Analysis of Patents
33.2.1 Textual Analysis Techniques
33.2.2 Correspondence and Cluster Analysis Techniques
33.2.3 Pre-ProcessingSteps
33.3 System Description
33.3.1 Database Manager Module
33.3.2 Statistical Analysis Module
33.3.3 Results Presentation Module
33.3.4 User Interface
33.4 Technology Indicators
33.4.1 Indicators Based on the Technological Sector
33.4.2 Indicators in Level of Continents/Countries/Designated States
33.4.3 Indicators for Inventors/Assignees
33.4.4 Indicators OverTime
33.5 Conclusion

34 The Semantic Conference Organizer
Kevin Heinrich, Michael W. Berry, Jack J. Dongarra, and Sathish Vadhiyar
University of Tennessee, Knoxville, USA

34.1 Background
34.2 Latent Semantic Indexing
34.3 Software Issues
34.4 Creatinga Conference
34.4.1 A Simple Example
34.4.2 Benchmarks
34.5 Future Extensions


   Publications    Method GMDH