Web Document Analysis : Challenges and Opportunities.

Title:

Author:

Antonacopoulos, Apostolos.

ISBN:

9789812775375

Personal Author:

Antonacopoulos, Apostolos.

Physical Description:

1 online resource (346 pages)

Series:

Series in Machine Perception and Artificial Intelligence ; v.55

Series in Machine Perception and Artificial Intelligence

Contents:

CONTENTS -- PREFACE -- Part I. Content Extraction and Web Mining -- CHAPTER 1 CLUSTERING OF WEB DOCUMENTS USING A GRAPH MODEL -- 1. Introduction -- 2. Graphs: Formal Notation -- 3. The Extended k-Means Clustering Algorithm -- 4. Clustering of Web Documents using the Graph Model -- 5. Experimental Results -- Acknowledgments -- References -- CHAPTER 2 APPLICATIONS OF GRAPH PROBING TO WEB DOCUMENT ANALYSIS -- 1. Introduction -- 2. Related Work -- 3. A Formalism for Graph Probing -- 4. Experimental Evaluation -- 4.1. Graph Model -- 4.2. Generating "Random" Collections of Web Pages -- 4.3. Experiment #1: Full Graph Matching -- 4.4. Experiment #2: Subgraph Matching -- 5. Conclusions -- 6. Acknowledgments -- References -- CHAPTER 3 WEB STRUCTURE ANALYSIS FOR INFORMATION MINING -- 1. Introduction -- 2. Object Model Architecture -- 2.1. HTML Parsing Library -- 2.2. Single-Slot HTML Parsing Functions -- 2.3. Multi-Slot/Pattern HTML Parsing Functions -- 3. User Interface -- 4. News Article Extraction -- 5. Link Extraction -- 6. Stock Quote Extraction -- 7. Conclusion -- Acknowledgment -- References -- CHAPTER 4 NATURAL LANGUAGE PROCESSING FOR WEB DOCUMENT ANALYSIS -- 1. Introduction -- 2. Design Principles -- 2.1. Why XML? -- 2.2. User Orientation -- 2.3. Portability -- 3. Document Suite XDOC -- 3.1. Preprocessing Module -- 3.1.1. HTML Cleaner -- 3.1.2. Structure Tagger -- 3.1.3. POS Tagger -- 3.2. Syntactic Module -- 3.2.1. Syntactic Parser -- 3.2.2. Phrase Detector -- 3.3. Corpus Based Module -- 3.4. Semantic Module -- 3.4.1. Semantic Tagger -- 3.4.2. Case Frame Analysis -- 3.4.3. Semantic Interpretation of Syntactic Structure -- 4. Related Work -- 5. Conclusion -- References -- Part II. Document Analysis for Adaptive Content Delivery -- CHAPTER 5 REFLOWABLE DOCUMENT IMAGES -- 1. Introduction -- 2. Image and Layout Analysis -- 2.1. Text/Image Segmentation.

2.2. Preprocessing -- 2.3. Layout Analysis -- 3. HTML-Based Representations -- 4. Reader Applications -- 5. New Document Formats -- 6. Summary and Conclusions -- Acknowledgments -- References -- CHAPTER 6 EXTRACTION AND MANAGEMENT OF CONTENT FROM HTML DOCUMENTS -- 1. Introduction -- 2. Research Direction -- 3. Current State of the Art -- 3.1 Handcrafting -- 3.2 Transcoding -- 3.3 Adaptive Re-authoring -- 4. Proposed Approach -- 4.1. Web Page Segmentation -- 4.2. Contextual Analysis and Segment Labeling -- 4.3. Web-Page Summarization -- 4.4. Post-processing -- 4.5. Overall Summary of the Content Extraction and Display Process -- 5. Results -- 6. Discussion -- 6.1 Web Page Segmentation -- 6.2 Contextual Analysis and Segment Labeling -- 6.3 Web-Page Summarization -- 6.4. Display Capabilities -- 6.5. Language Independence -- 6.6. Current State of Research -- 6.7. Supported Devices -- 7. Concluding Remarks -- References -- CHAPTER 7 HTML PAGE ANALYSIS BASED ON VISUAL CUES -- 1. Introduction -- 1.1. Document Analysis for Search Engines -- 1.2. Document Analysis for Adaptive Content Delivery -- 2. Visual Similarity of HTML Objects -- 2.1. Visual Similarity of Simple Objects -- 2.2. Visual Similarity of Container Objects -- 3. Pattern Detection and Construction of Structured Documents -- 3.1. Quantization -- 3.2. Frequency Counting -- 3.3. Selection and Confirmation -- 3.4. Construction of Structured Document -- 3.5. Special Consideration of HTML Tables -- 4. Experimental Results and Analysis -- 5. Application in an Adaptive Content Delivery System -- 6. Conclusions -- Acknowledgements -- References -- Part III. Table Understanding on the Web -- CHAPTER 8 AUTOMATIC TABLE DETECTION IN HTML DOCUMENTS -- 1. Introduction -- 2. Features for Web Table Detection -- 2.1. Layout Features -- 2.2. Content Type Features -- 2.3. Word Group Feature.

2.3.1. Vector Space Approach -- 2.3.2. Naive Bayes Approach -- 2.3.3. Weighted kNN Approach -- 3. Classification Schemes -- 3.1. Decision Tree -- 3.2. SVM -- 4. Data Collection and Ground Truthing -- 4.1. Data Collection -- 4.2. Ground Truthing -- 4.3. Database Description -- 5. Experiments -- 6. Conclusion and Future Work -- 7. Acknowledgment -- References -- CHAPTER 9 A WRAPPER INDUCTION SYSTEM fOR COMPLEX DOCUMENTS, AND ITS APPLICATION TO TABULAR DATA ON THE WEB -- 1. Introduction -- 2. Issues in Wrapper Learning -- 3. An Extensible Wrapper Learning System -- 3.1. Architecture of the Learning System -- 3.2. A Generic Representation for Structured Documents -- 3.3. A Generic Representation for Extractors -- 3.4. Representing Training Data -- 3.5. Designing a Bias -- 3.6. The Master Learning Algorithm -- 4. Additional Builders -- 4.1. Composite Builders -- 4.2. Format-based Extraction -- 5. Table-based Extraction -- 5.1. Representing Tables on the World Wide Web -- 5.2. Classes of Table Presentation in HTML -- 5.3. An Abstract Geometric Table Model -- 5.4. Table Location -- 5.4.1. Application of Machine Learning -- 5.4.2. Features -- 5.4.3. Experimental Results -- 5.5. Exploiting Table Context -- 5.6. Exploiting the Table Models -- 6. Experiments -- 7. Conclusions -- Acknowledgments -- References -- CHAPTER 10 EXTRACTING ATTRIBUTES AND THEIR VALUES FROM WEB PAGES -- 1. Introduction -- 2. Ontology Extraction from HTML Tables -- 2.1. Table Structure Recognition -- 2.2. Table Integration -- 3. List Analysis -- 3.1. Term Definition -- 3.2. State Sequence Estimation Module -- 3.2.1. Estimation of P(bls) and P(s) -- 3.2.2. Estimation of P(s'ls) and P(cls,s') -- 4. Experiments -- 5. Conclusion and Future Work -- Acknowledgments -- References -- Part IV. Web Image Analysis and Retrieval.

CHAPTER 11 A FUZZY APPROACH TO TEXT SEGMENTATION IN WEB IMAGES BASED ON HUMAN COLOUR PERCEPTION -- 1. Introduction -- 2. Colour Segmentation Method -- 2.1. Colour Distance -- 2.2. Colour Connected-Component Identification -- 2.3. Propinquity Features -- 2.4. Fuzzy Inference -- 2.5. Colour Component Aggregation -- 3. Results and Discussion -- Acknowledgement -- References -- CHAPTER 12 SEARCHING FOR IMAGES ON THE WEB USING TEXTUAL METADATA -- 1. Introduction -- 2. Background -- 3. Image Search Architecture -- 4. Image Search Experiment -- 4.1. Results -- 5. Conclusion and Future Work -- Acknowledgments -- References -- CHAPTER 13 AN ANATOMY OF A LARGE-SCALE IMAGE SEARCH ENGINE -- 1. Introduction -- 1.1. An Illustrative Example -- 2. System Architecture -- 2.1. Feature Extractor -- 2.2. High-dimensional Indexer -- 2.3. Perception-based Search Engine -- 2.4. Content-based Search Engine -- 3. Active Learning Algorithms -- 3.1. MEGA -- 3.2. SVMActive -- 3.3. Hybrid Algorithms -- 4. Experiments -- 5. Results and Discussion -- 5.1. MEGA, SVMActive, and Pipelining MEGA with SVMActive -- 5.2. Observations -- 6. Comparison with Related Work -- 7. Conclusion -- References -- Part V. New Opportunities -- CHAPTER 14 WEB SECURITY AND DOCUMENT IMAGE ANALYSIS -- 1. Introduction -- 1.1. An Influential Precursor: Turing Tests -- 1.2. Robot Exclusion Conventions -- 1.3. Primitive Means -- 1.4. First Use: The Add-URL Problem -- 1.5. The ChatRoom Problem -- 1.6. Screening Financial Accounts -- 1.7. PessimalPrint -- 2. The First International HIP Workshop -- 3. Implications for DIA Research -- 4. Discussion -- 5. Acknowledgments -- References -- CHAPTER 15 EXPLOITING WWW RESOURCES IN EXPERIMENTAL DOCUMENT ANALYSIS RESEARCH -- 1. Introduction -- 2. Traditional Approaches -- 3. Exploiting WWW Resources -- 4. Proof of Concept: Analysis of a Digital Library.

4.1. Optical Character Recognition -- 4.2. Table Detection -- 5. Examples of Other WWW Resources -- 6. Conclusions -- 7. Acknowledgments -- References -- CHAPTER 16 STRUCTURED MEDIA FOR AUTHORING MULTIMEDIA DOCUMENTS -- 1. Motivation -- 2. Multimedia Authoring -- 3. Multimedia Modelling -- 3.1. Video Content Modelling -- 3.1.1. General Model -- 3.1.2. Description of the Video Structure -- 3.1.3. Extensions of MPEG-7 for Model Definition -- 3.2. Document Modelling with Structured Media -- 3.2.1. Multimedia Document Model -- 3.2.2. Model Extensions -- 4. Multimedia Document Authoring System -- 4.1. Video Content Description Editing Environment -- 4.2. Authoring Multimedia Documents -- 5. Conclusion -- References -- CHAPTER 17 DOCUMENT ANALYSIS REVISITED FOR WEB DOCUMENTS -- 1. Introduction -- 2. Document Model Evolution: An Analysis Perspective -- 3. Web Document Analysis -- 3.1. Goals of Web Document Analysis -- 3.2. Specificities of Web Document Analysis -- 3.2.1. Dealing with Heterogeneous Formats -- 3.2.2. Dealing with Links -- 3.2.3 Dealing with Images and Graphics -- 3.2.4. Dealing with Interactive Aspects of Web Documents -- 4. Some Relevant Applications -- 4.1. Extracting Rich Structures from a Large Collection of Documents -- 4.2. Extracting Structure from Interconnected Documents -- 4.3. Dynamic Aspects of Web Documents -- 4.4. Generation of Metadata -- 5. Methodological Issues -- 5.1. Techniques and Methods -- 5.2. A Detailed Example -- 6 Conclusion and Perspectives -- References -- AUTHOR INDEX.

Abstract:

This book provides the first comprehensive look at the emerging field of web document analysis. It sets the scene in this new field by combining state-of-the-art reviews of challenges and opportunities with research papers by leading researchers. Readers will find in-depth discussions on the many diverse and interdisciplinary areas within the field, including web image processing, applications of machine learning and graph theories for content extraction and web mining, adaptive web content delivery, multimedia document modeling and human interactive proofs for web security. Contents: Content Extraction and Web Mining; Document Analysis for Adaptive Content Delivery; Table Understanding on the Web; Web Image Analysis and Retrieval; New Opportunities. Readership: Graduate students and researchers in document-analysis and web communities.

Local Note:

Electronic reproduction. Ann Arbor, Michigan : ProQuest Ebook Central, 2017. Available via World Wide Web. Access may be limited to ProQuest Ebook Central affiliated libraries.

Subject Term:

Data mining.

Electronic books. -- local.

Genre:

Added Author:

Electronic Access:

Holds: Copies:

Available:*

Bound With These Titles

On Order