Cover image for Data Mining and Statistics for Decision Making.
Data Mining and Statistics for Decision Making.
Title:
Data Mining and Statistics for Decision Making.
Author:
Tufféry, Stéphane.
ISBN:
9780470979167
Personal Author:
Edition:
1st ed.
Physical Description:
1 online resource (717 pages)
Series:
Wiley Series in Computational Statistics
Contents:
Data Mining and Statistics for Decision Making -- Contents -- Preface -- Foreword -- Foreword from the French language edition -- List of trademarks -- 1 Overview of data mining -- 1.1 What is data mining? -- 1.2 What is data mining used for? -- 1.2.1 Data mining in different sectors -- 1.2.2 Data mining in different applications -- 1.3 Data mining and statistics -- 1.4 Data mining and information technology -- 1.5 Data mining and protection of personal data -- 1.6 Implementation of data mining -- 2 The development of a data mining study -- 2.1 Defining the aims -- 2.2 Listing the existing data -- 2.3 Collecting the data -- 2.4 Exploring and preparing the data -- 2.5 Population segmentation -- 2.6 Drawing up and validating predictive models -- 2.7 Synthesizing predictive models of different segments -- 2.8 Iteration of the preceding steps -- 2.9 Deploying the models -- 2.10 Training the model users -- 2.11 Monitoring the models -- 2.12 Enriching the models -- 2.13 Remarks -- 2.14 Life cycle of a model -- 2.15 Costs of a pilot project -- 3 Data exploration and preparation -- 3.1 The different types of data -- 3.2 Examining the distribution of variables -- 3.3 Detection of rare or missing values -- 3.4 Detection of aberrant values -- 3.5 Detection of extreme values -- 3.6 Tests of normality -- 3.7 Homoscedasticity and heteroscedasticity -- 3.8 Detection of the most discriminating variables -- 3.8.1 Qualitative, discrete or binned independent variables -- 3.8.2 Continuous independent variables -- 3.8.3 Details of single-factor non-parametric tests -- 3.8.4 ODS and automated selection of discriminating variables -- 3.9 Transformation of variables -- 3.10 Choosing ranges of values of binned variables -- 3.11 Creating new variables -- 3.12 Detecting interactions -- 3.13 Automatic variable selection -- 3.14 Detection of collinearity -- 3.15 Sampling.

3.15.1 Using sampling -- 3.15.2 Random sampling methods -- 4 Using commercial data -- 4.1 Data used in commercial applications -- 4.1.1 Data on transactions and RFM Data -- 4.1.2 Data on products and contracts -- 4.1.3 Lifetimes -- 4.1.4 Data on channels -- 4.1.5 Relational, attitudinal and psychographic data -- 4.1.6 Sociodemographic data -- 4.1.7 When data are unavailable -- 4.1.8 Technical data -- 4.2 Special data -- 4.2.1 Geodemographic data -- 4.2.2 Profitability -- 4.3 Data used by business sector -- 4.3.1 Data used in banking -- 4.3.2 Data used in insurance -- 4.3.3 Data used in telephony -- 4.3.4 Data used in mail order -- 5 Statistical and data mining software -- 5.1 Types of data mining and statistical software -- 5.2 Essential characteristics of the software -- 5.2.1 Points of comparison -- 5.2.2 Methods implemented -- 5.2.3 Data preparation functions -- 5.2.4 Other functions -- 5.2.5 Technical characteristics -- 5.3 The main software packages -- 5.3.1 Overview -- 5.3.2 IBM SPSS -- 5.3.3 SAS -- 5.3.4 R -- 5.3.5 Some elements of the R language -- 5.4 Comparison of R, SAS and IBM SPSS -- 5.5 How to reduce processing time -- 6 An outline of data mining methods -- 6.1 Classification of the methods -- 6.2 Comparison of the methods -- 7 Factor analysis -- 7.1 Principal component analysis -- 7.1.1 Introduction -- 7.1.2 Representation of variables -- 7.1.3 Representation of individuals -- 7.1.4 Use of PCA -- 7.1.5 Choosing the number of factor axes -- 7.1.6 Summary -- 7.2 Variants of principal component analysis -- 7.2.1 PCA with rotation -- 7.2.2 PCA of ranks -- 7.2.3 PCA on qualitative variables -- 7.3 Correspondence analysis -- 7.3.1 Introduction -- 7.3.2 Implementing CA with IBM SPSS Statistics -- 7.4 Multiple correspondence analysis -- 7.4.1 Introduction -- 7.4.2 Review of CA and MCA -- 7.4.3 Implementing MCA and CA with SAS.

8 Neural networks -- 8.1 General information on neural networks -- 8.2 Structure of a neural network -- 8.3 Choosing the learning sample -- 8.4 Some empirical rules for network design -- 8.5 Data normalization -- 8.5.1 Continuous variables -- 8.5.2 Discrete variables -- 8.5.3 Qualitative variables -- 8.6 Learning algorithms -- 8.7 The main neural networks -- 8.7.1 The multilayer perceptron -- 8.7.2 The radial basis function network -- 8.7.3 The Kohonen network -- 9 Cluster analysis -- 9.1 Definition of clustering -- 9.2 Applications of clustering -- 9.3 Complexity of clustering -- 9.4 Clustering structures -- 9.4.1 Structure of the data to be clustered -- 9.4.2 Structure of the resulting clusters -- 9.5 Some methodological considerations -- 9.5.1 The optimum number of clusters -- 9.5.2 The use of certain types of variables -- 9.5.3 The use of illustrative variables -- 9.5.4 Evaluating the quality of clustering -- 9.5.5 Interpreting the resulting clusters -- 9.5.6 The criteria for correct clustering -- 9.6 Comparison of factor analysis and clustering -- 9.7 Within-cluster and between-cluster sum of squares -- 9.8 Measurements of clustering quality -- 9.8.1 All types of clustering -- 9.8.2 Agglomerative hierarchical clustering -- 9.9 Partitioning methods -- 9.9.1 The moving centres method -- 9.9.2 k-means and dynamic clouds -- 9.9.3 Processing qualitative data -- 9.9.4 k-medoids and their variants -- 9.9.5 Advantages of the partitioning methods -- 9.9.6 Disadvantages of the partitioning methods -- 9.9.7 Sensitivity to the choice of initial centres -- 9.10 Agglomerative hierarchical clustering -- 9.10.1 Introduction -- 9.10.2 The main distances used -- 9.10.3 Density estimation methods -- 9.10.4 Advantages of agglomerative hierarchical clustering -- 9.10.5 Disadvantages of agglomerative hierarchical clustering -- 9.11 Hybrid clustering methods.

9.11.1 Introduction -- 9.11.2 Illustration using SAS Software -- 9.12 Neural clustering -- 9.12.1 Advantages -- 9.12.2 Disadvantages -- 9.13 Clustering by similarity aggregation -- 9.13.1 Principle of relational analysis -- 9.13.2 Implementing clustering by similarity aggregation -- 9.13.3 Example of use of the R amap package -- 9.13.4 Advantages of clustering by similarity aggregation -- 9.13.5 Disadvantages of clustering by similarity aggregation -- 9.14 Clustering of numeric variables -- 9.15 Overview of clustering methods -- 10 Association analysis -- 10.1 Principles -- 10.2 Using taxonomy -- 10.3 Using supplementary variables -- 10.4 Applications -- 10.5 Example of use -- 11 Classification and prediction methods -- 11.1 Introduction -- 11.2 Inductive and transductive methods -- 11.3 Overview of classification and prediction methods -- 11.3.1 The qualities expected from a classification or prediction method -- 11.3.2 Generalizability -- 11.3.3 Vapnik's learning theory -- 11.3.4 Overfitting -- 11.4 Classification by decision tree -- 11.4.1 Principle of the decision trees -- 11.4.2 Definitions - the first step in creating the tree -- 11.4.3 Splitting criterion -- 11.4.4 Distribution among nodes - the second step in creating the tree -- 11.4.5 Pruning - the third step in creating the tree -- 11.4.6 A pitfall to avoid -- 11.4.7 The CART, C5.0 and CHAID trees -- 11.4.8 Advantages of decision trees -- 11.4.9 Disadvantages of decision trees -- 11.5 Prediction by decision tree -- 11.6 Classification by discriminant analysis -- 11.6.1 The problem -- 11.6.2 Geometric descriptive discriminant analysis (discriminant factor analysis) -- 11.6.3 Geometric predictive discriminant analysis -- 11.6.4 Probabilistic discriminant analysis -- 11.6.5 Measurements of the quality of the model -- 11.6.6 Syntax of discriminant analysis in SAS.

11.6.7 Discriminant analysis on qualitative variables (DISQUAL Method) -- 11.6.8 Advantages of discriminant analysis -- 11.6.9 Disadvantages of discriminant analysis -- 11.7 Prediction by linear regression -- 11.7.1 Simple linear regression -- 11.7.2 Multiple linear regression and regularized regression -- 11.7.3 Tests in linear regression -- 11.7.4 Tests on residuals -- 11.7.5 The influence of observations -- 11.7.6 Example of linear regression -- 11.7.7 Further details of the SAS linear regression syntax -- 11.7.8 Problems of collinearity in linear regression: an example using R -- 11.7.9 Problems of collinearity in linear regression: diagnosis and solutions -- 11.7.10 PLS regression -- 11.7.11 Handling regularized regression with SAS and R -- 11.7.12 Robust regression -- 11.7.13 The general linear model -- 11.8 Classification by logistic regression -- 11.8.1 Principles of binary logistic regression -- 11.8.2 Logit, probit and log-log logistic regressions -- 11.8.3 Odds ratios -- 11.8.4 Illustration of division into categories -- 11.8.5 Estimating the parameters -- 11.8.6 Deviance and quality measurement in a model -- 11.8.7 Complete separation in logistic regression -- 11.8.8 Statistical tests in logistic regression -- 11.8.9 Effect of division into categories and choice of the reference category -- 11.8.10 Effect of collinearity -- 11.8.11 The effect of sampling on logit regression -- 11.8.12 The syntax of logistic regression in SAS Software -- 11.8.13 An example of modelling by logistic regression -- 11.8.14 Logistic regression with R -- 11.8.15 Advantages of logistic regression -- 11.8.16 Advantages of the logit model compared with probit -- 11.8.17 Disadvantages of logistic regression -- 11.9 Developments in logistic regression -- 11.9.1 Logistic regression on individuals with different weights.

11.9.2 Logistic regression with correlated data.
Abstract:
"Business intelligence analysts and statisticians, compliance and financial experts in both commercial and government organizations across all industry sectors will benefit from this book." (Zentralblatt MATH, 2011).
Local Note:
Electronic reproduction. Ann Arbor, Michigan : ProQuest Ebook Central, 2017. Available via World Wide Web. Access may be limited to ProQuest Ebook Central affiliated libraries.
Electronic Access:
Click to View
Holds: Copies: