Biological Knowledge Discovery Handbook : Preprocessing, Mining and Postprocessing of Biological Data.

Title:

Author:

Elloumi, Mourad.

ISBN:

9781118617113

Personal Author:

Elloumi, Mourad.

Edition:

1st ed.

Physical Description:

1 online resource (1192 pages)

Series:

Wiley Series in Bioinformatics Ser. ; v.23

Wiley Series in Bioinformatics Ser.

Contents:

BIOLOGICAL KNOWLEDGE DISCOVERY HANDBOOK: Preprocessing, Mining, and Postprocessing of Biological Data -- CONTENTS -- PREFACE -- CONTRIBUTORS -- SECTION I: BIOLOGICAL DATA PREPROCESSING -- PART A: BIOLOGICAL DATA MANAGEMENT -- 1 GENOME AND TRANSCRIPTOME SEQUENCE DATABASES FOR DISCOVERY, STORAGE, AND REPRESENTATION OF ALTERNATIVE SPLICING EVENTS -- 1.1 INTRODUCTION -- 1.2 SPLICING -- 1.2.1 Mechanism of Splicing -- 1.2.2 Regulation of Splicing -- 1.3 ALTERNATIVE SPLICING -- 1.3.1 Introduction to Alternative Splicing -- 1.3.2 Mechanism of Alternative Splicing -- 1.3.3 Regulation of Alternative Splicing -- 1.3.4 Evolution and Conservation of Splicing and Alternative Splicing -- 1.4 ALTERNATIVE SPLICING DATABASES -- 1.4.1 Genomic and Transcriptomic Sequence Analyses -- 1.4.2 Literature Overview of Various Alternative Splicing Databases -- 1.4.3 SDBs -- 1.5 DATA MINING FROM ALTERNATIVE SPLICING DATABASES -- 1.5.1 Implementation of dbASQ and Utility of SDBs -- 1.5.2 Identification of Transcript-Initial and Transcript-Terminal Variation -- ACKNOWLEDGMENTS -- WEB RESOURCES -- REFERENCES -- 2 CLEANING, INTEGRATING, AND WAREHOUSING GENOMIC DATA FROM BIOMEDICAL RESOURCES -- 2.1 INTRODUCTION -- 2.2 RELATED WORK -- 2.3 TYPOLOGY OF DATA QUALITY PROBLEMS IN BIOMEDICAL RESOURCES -- 2.4 CLEANING, INTEGRATING, AND WAREHOUSING BIOMEDICAL DATA -- 2.4.1 Lessons Learned from Integrating and Warehousing Biomedical Data on Liver Genes and Diseases -- 2.4.2 Data Quality-Aware Solutions -- 2.4.3 Biological Entity Resolution and Record Linkage -- 2.4.4 Ontology-Based Approaches -- 2.5 CONCLUSIONS AND PERSPECTIVES -- WEB RESOURCES -- REFERENCES -- 3 CLEANSING OF MASS SPECTROMETRY DATA FOR PROTEIN IDENTIFICATION AND QUANTIFICATION -- 3.1 INTRODUCTION -- 3.2 PREPROCESSING APPROACH FOR IMPROVING PROTEIN IDENTIFICATION -- 3.2.1 Existing Approaches.

3.2.2 New Dynamic Wavelet-Based Spectra Preprocessing Method -- 3.3 IDENTIFICATION FILTERING APPROACH FOR IMPROVING PROTEIN IDENTIFICATION -- 3.3.1 Existing Approaches -- 3.3.2 New Target-Decoy Approach for Improving Protein Identification -- 3.4 EVALUATION RESULTS -- 3.4.1 Evaluation of New Proprocessing Method -- 3.4.2 Evaluation of New Identification Filtering Method -- 3.5 CONCLUSION -- REFERENCES -- 4 FILTERING PROTEIN-PROTEIN INTERACTIONS BY INTEGRATION OF ONTOLOGY DATA -- 4.1 INTRODUCTION -- 4.2 EVALUATION OF SEMANTIC SIMILARITY -- 4.2.1 Gene Ontology -- 4.2.2 Survey of Semantic Similarity Measures -- 4.2.3 Correlation with Functional Categorizations -- 4.3 IDENTIFICATION OF FALSE PROTEIN-PROTEIN INTERACTION DATA -- 4.3.1 Classification Method -- 4.3.2 Accuracy of PPI Classification -- 4.3.3 Reliability of PPI Data -- 4.4 CONCLUSION -- REFERENCES -- PART B: BIOLOGICAL DATA MODELING -- 5 COMPLEXITY AND SYMMETRIES IN DNA SEQUENCES -- 5.1 INTRODUCTION -- 5.2 ARCHAEA -- 5.3 PATTERNS ON INDICATOR MATRIX -- 5.3.1 Indicator Matrix -- 5.3.2 Test Sequences -- 5.4 MEASURE OF COMPLEXITY AND INFORMATION -- 5.4.1 Complexity -- 5.4.2 Fractal Dimension -- 5.4.3 Entropy -- 5.5 COMPLEX ROOT REPRESENTATION OF DNA WORDS -- 5.5.1 Pseudorandom Sequence on Unit Circle -- 5.6 DNA WALKS -- 5.6.1 Walks on Pseudorandom and Deterministic Complex Sequences -- 5.6.2 Variance -- 5.7 WAVELET ANALYSIS -- 5.7.1 Haar Wavelet Basis -- 5.7.2 Discrete Haar Wavelet Transform -- 5.7.3 Haar Wavelet Coefficients and Statistical Parameters -- 5.7.4 Hurst Exponent -- 5.8 ALGORITHM OF SHORT HAAR DISCRETE WAVELET TRANSFORM -- 5.8.1 Clusters of Wavelet Coefficients -- 5.8.2 Cluster Analysis of Wavelet Coefficients of Complex DNA Representation -- 5.9 CONCLUSIONS -- REFERENCES -- 6 ONTOLOGY-DRIVEN FORMAL CONCEPTUAL DATA MODELING FOR BIOLOGICAL DATA ANALYSIS -- 6.1 INTRODUCTION.

6.2 DESCRIPTION LOGICS FOR CONCEPTUAL DATA MODELING -- 6.2.1 Generic Common Conceptual Data Model CMcom -- 6.2.2 EER, UML, and ORM in Terms of CMcom -- 6.3 EXTENSIONS -- 6.3.1 Ontology-Driven Modeling -- 6.3.2 More Expressive Languages -- 6.4 AUTOMATED REASONING AND BIOLOGICAL KNOWLEDGE DISCOVERY -- 6.4.1 Exploiting Automated Reasoning Services -- 6.4.2 Finding New Relationships and Classes by Using Instances -- 6.5 CONCLUSIONS AND OUTLOOK -- REFERENCES -- 7 BIOLOGICAL DATA INTEGRATION USING NETWORK MODELS -- 7.1 INTRODUCTION -- 7.1.1 Data Sources -- 7.1.2 Issues with Data Integration from Multiple Sources -- 7.2 BIOLOGICAL NETWORK MODELS -- 7.2.1 Sequence-Based Approach to Predict Interactions and Functional Links between Proteins -- 7.2.2 Graph-Theoretic and Probabilistic-Based Protein Interaction Network Models -- 7.2.3 Models in Genetic Interaction Networks -- 7.3 NETWORK MODELS IN UNDERSTANDING DISEASE -- 7.3.1 Interactome Network for Disease Prediction -- 7.3.2 Network Perturbation Due to Pathogens -- 7.3.3 Network View of Cancer -- 7.4 FUTURE CHALLENGES -- ACKNOWLEDGMENT -- REFERENCES -- 8 NETWORK MODELING OF STATISTICAL EPISTASIS -- 8.1 INTRODUCTION -- 8.2 EPISTASIS AND DETECTION -- 8.3 NETWORK -- 8.3.1 Fundamental Definitions -- 8.3.2 Notions on Connectivity -- 8.3.3 Vertex Centrality -- 8.3.4 Degree Distributions -- 8.3.5 Networks for Epistasis Studies -- 8.4 GENE-ASSOCIATION INTERACTION NETWORK -- 8.5 STATISTICAL EPISTASIS NETWORKS -- 8.5.1 Network Construction and Analysis -- 8.5.2 Observations -- 8.5.3 Implications -- 8.6 CONCLUDING REMARKS -- ACKNOWLEDGMENT -- REFERENCES -- 9 GRAPHICAL MODELS FOR PROTEIN FUNCTION AND STRUCTURE PREDICTION -- 9.1 INTRODUCTION -- 9.2 GRAPHICAL MODELS -- 9.2.1 Directed Graphical Model -- 9.2.2 Undirected Model (Markov Random Field) -- 9.2.3 Discriminative versus Generative Model -- 9.2.4 Sequential Model.

9.2.5 Maximum-Entropy Markov Models and Label Bias Problem -- 9.2.6 Conditional Random Field -- 9.2.7 Summary of Models and Available Resources -- 9.3 APPLICATIONS -- 9.3.1 Gene Prediction Using Conditional Random Fields -- 9.3.2 Protein Function Prediction Using Markov Random Fields -- 9.3.3 Application to Protein Tertiary Structure Prediction -- 9.4 SUMMARY -- ACKNOWLEDGMENTS -- REFERENCES -- PART C: BIOLOGICAL FEATURE EXTRACTION -- 10 ALGORITHMS AND DATA STRUCTURES FOR NEXT-GENERATION SEQUENCES -- 10.1 ALIGNERS -- 10.1.1 Hash-Based Aligners -- 10.1.2 Prefix-Based Aligners -- 10.1.3 Distributed Architectures -- 10.2 ASSEMBLERS -- 10.2.1 Greedy Assemblers: Seed and Extend -- 10.2.2 Overlap-Layout-Consensus Assemblers -- 10.2.3 DBG-Based Assemblers -- REFERENCES -- 11 ALGORITHMS FOR NEXT-GENERATION SEQUENCING DATA -- 11.1 INTRODUCTION -- 11.2 DEFINITIONS AND NOTATIONS -- 11.3 REAL: A READ ALIGNER FOR MAPPING SHORT READS TO A GENOME -- 11.3.1 Algorithm -- 11.3.2 Experimental Results -- 11.4 CREAL: MAPPING SHORT READS TO A GENOME WITH CIRCULAR STRUCTURE -- 11.4.1 Algorithm -- 11.4.2 Experimental Results -- 11.5 DYNMAP: MAPPING SHORT READS TO MULTIPLE CLOSELY RELATED GENOMES -- 11.5.1 Algorithm -- 11.5.2 Experimental Results -- 11.6 CONCLUSION -- REFERENCES -- 12 GENE REGULATORY NETWORK IDENTIFICATION WITH QUALITATIVE PROBABILISTIC NETWORKS -- 12.1 CENTRAL DOGMA: GENE EXPRESSION IN A CELL -- 12.2 MEASURING EXPRESSION LEVELS: MICROARRAY TECHNOLOGY -- 12.3 UNDERSTANDING GENE REGULATORY NETWORKS: BASIC CONCEPTS -- 12.3.1 Constructing GRNs from Microarray Data -- 12.3.2 Models for Reverse Engineering GRNs -- 12.4 BAYESIAN NETWORKS FOR LEARNING GRNs -- 12.4.1 Bayesian Networks -- 12.4.2 Dynamic Bayesian Networks -- 12.4.3 Learning GRNs -- 12.5 TOWARD QUALITATIVE MODELING OF GRNs -- 12.5.1 Motivating Factors for Using Qualitative Models -- 12.5.2 QPNs.

12.6 QPNs FOR GENE REGULATION -- 12.6.1 Dynamic QPNs -- 12.6.2 Approach -- 12.6.3 Computational Experiments and Results -- 12.7 SUMMARY AND CONCLUSIONS -- REFERENCES -- PART D: BIOLOGICAL FEATURE SELECTION -- 13 COMPARING, RANKING, AND FILTERING MOTIFS WITH CHARACTER CLASSES: APPLICATION TO BIOLOGICAL SEQUENCES ANALYSIS -- 13.1 INTRODUCTION -- 13.1.1 Ranking and Clustering Motifs -- 13.1.2 Ensemble Methods -- 13.1.3 Motif Representation -- 13.1.4 Problem Statement -- 13.2 MOTIFS WITH CHARACTER CLASSES: A CHARACTERIZATION -- 13.2.1 On Transitive Properties of Character Classes -- 13.2.2 Minimal Motifs and Motif Priority -- 13.3 FILTERING BY MEANS OF UNDERLYING MOTIFS -- 13.3.1 Algorithm for Filtering Set of Motifs into Its Underlying Representative Set -- 13.4 EXPERIMENTAL RESULTS AND DISCUSSION -- 13.5 CONCLUSION -- ACKNOWLEDGMENTS -- REFERENCES -- 14 STABILITY OF FEATURE SELECTION ALGORITHMS AND ENSEMBLE FEATURE SELECTION METHODS IN BIOINFORMATICS -- 14.1 INTRODUCTION -- 14.2 FEATURE SELECTION ALGORITHMS AND INSTABILITY -- 14.2.1 Categorization of Feature Selection Algorithm -- 14.2.2 Potential Causes of Feature Selection Instability -- 14.2.3 Remark on Feature Selection Instability -- 14.3 ENSEMBLE FEATURE SELECTION ALGORITHMS -- 14.3.1 Ensemble Based on Data Perturbation -- 14.3.2 Ensemble Based on Different Data Partitioning -- 14.3.3 Performance on Feature Selection Stability -- 14.3.4 Performance of Sample Classification -- 14.3.5 Ensemble Size -- 14.3.6 Some Key Aspects in Ensemble Feature Selection Algorithms -- 14.4 METRICS FOR STABILITY ASSESSMENT -- 14.4.1 Rank-Based Stability Metrics -- 14.4.2 Set-Based Stability Metrics -- 14.4.3 Threshold in Stability Metrics -- 14.4.4 Remark on Metrics for Stability Evaluation -- 14.5 CONCLUSIONS -- ACKNOWLEDGMENT -- REFERENCES.

15 STATISTICAL SIGNIFICANCE ASSESSMENT FOR BIOLOGICAL FEATURE SELECTION: METHODS AND ISSUES.

Abstract:

The first comprehensive overview of preprocessing, mining, and postprocessing of biological data Molecular biology is undergoing exponential growth in both the volume and complexity of biological data-and knowledge discovery offers the capacity to automate complex search and data analysis tasks. This book presents a vast overview of the most recent developments on techniques and approaches in the field of biological knowledge discovery and data mining (KDD)-providing in-depth fundamental and technical field information on the most important topics encountered. Written by top experts, Biological Knowledge Discovery Handbook: Preprocessing, Mining, and Postprocessing of Biological Data covers the three main phases of knowledge discovery (data preprocessing, data processing-also known as data mining-and data postprocessing) and analyzes both verification systems and discovery systems. BIOLOGICAL DATA PREPROCESSING Part A: Biological Data Management Part B: Biological Data Modeling Part C: Biological Feature Extraction Part D Biological Feature Selection BIOLOGICAL DATA MINING Part E: Regression Analysis of Biological Data Part F Biological Data Clustering Part G: Biological Data Classification Part H: Association Rules Learning from Biological Data Part I: Text Mining and Application to Biological Data Part J: High-Performance Computing for Biological Data Mining Combining sound theory with practical applications in molecular biology, Biological Knowledge Discovery Handbook is ideal for courses in bioinformatics and biological KDD as well as for practitioners and professional researchers in computer science, life science, and mathematics.

Local Note:

Electronic reproduction. Ann Arbor, Michigan : ProQuest Ebook Central, 2017. Available via World Wide Web. Access may be limited to ProQuest Ebook Central affiliated libraries.

Subject Term:

Bioinformatics.

Computational biology.

Genre:

Added Author:

Electronic Access:

Holds: Copies:

Available:*

Bound With These Titles

On Order