Cover image for Automated Data Collection with R : A Practical Guide to Web Scraping and Text Mining.
Automated Data Collection with R : A Practical Guide to Web Scraping and Text Mining.
Title:
Automated Data Collection with R : A Practical Guide to Web Scraping and Text Mining.
Author:
Munzert, Simon.
ISBN:
9781118834787
Personal Author:
Edition:
1st ed.
Physical Description:
1 online resource (477 pages)
Contents:
Automated Data Collection with R -- Contents -- Preface -- What you won't learn from reading this book -- Why R? -- Recommended reading to get started with R -- Typographic conventions -- The book's website -- Disclaimer -- Acknowledgments -- 1 Introduction -- 1.1 Case study: World Heritage Sites in Danger -- 1.2 Some remarks on web data quality -- 1.3 Technologies for disseminating, extracting, and storing web data -- 1.3.1 Technologies for disseminating content on the Web -- 1.3.2 Technologies for information extraction from web documents -- 1.3.3 Technologies for data storage -- 1.4 Structure of the book -- Part One A Primer on Web and Data Technologies -- 2 HTML -- 2.1 Browser presentation and source code -- 2.2 Syntax rules -- 2.2.1 Tags, elements, and attributes -- 2.2.2 Tree structure -- 2.2.3 Comments -- 2.2.4 Reserved and special characters -- 2.2.5 Document type definition -- 2.2.6 Spaces and line breaks -- 2.3 Tags and attributes -- 2.3.1 The anchor tag -- 2.3.2 The metadata tag -- 2.3.3 The external reference tag -- 2.3.4 Emphasizing tags , , -- 2.3.5 The paragraphs tag -- 2.3.6 Heading tags , , , -- 2.3.7 Listing content with , , and -- 2.3.8 The organizational tags and -- 2.3.9 The tag and its companions -- 2.3.10 The foreign script tag -- 2.3.11 Table tags , , , and -- 2.4 Parsing -- 2.4.1 What is parsing? -- 2.4.2 Discarding nodes -- 2.4.3 Extracting information in the building process -- Summary -- Further reading -- Problems -- 3 XML and JSON -- 3.1 A short example XML document -- 3.2 XML syntax rules -- 3.2.1 Elements and attributes -- 3.2.2 XML structure -- 3.2.3 Naming and special characters -- 3.2.4 Comments and character data -- 3.2.5 XML syntax summary -- 3.3 When is an XML document well formed or valid?.

3.4 XML extensions and technologies -- 3.4.1 Namespaces -- 3.4.2 Extensions of XML -- 3.4.3 Example: Really Simple Syndication -- 3.4.4 Example: scalable vector graphics -- 3.5 XML and R in practice -- 3.5.1 Parsing XML -- 3.5.2 Basic operations on XML documents -- 3.5.3 From XML to data frames or lists -- 3.5.4 Event-driven parsing -- 3.6 A short example JSON document -- 3.7 JSON syntax rules -- 3.8 JSON and R in practice -- Summary -- Further reading -- Problems -- 4 XPath -- 4.1 XPath-a query language for web documents -- 4.2 Identifying node sets with XPath -- 4.2.1 Basic structure of an XPath query -- 4.2.2 Node relations -- 4.2.3 XPath predicates -- 4.3 Extracting node elements -- 4.3.1 Extending the fun argument -- 4.3.2 XML namespaces -- 4.3.3 Little XPath helper tools -- Summary -- Further reading -- Problems -- 5 HTTP -- 5.1 HTTP fundamentals -- 5.1.1 A short conversation with a web server -- 5.1.2 URL syntax -- 5.1.3 HTTP messages -- 5.1.4 Request methods -- 5.1.5 Status codes -- 5.1.6 Header fields -- 5.2 Advanced features of HTTP -- 5.2.1 Identification -- 5.2.2 Authentication -- 5.2.3 Proxies -- 5.3 Protocols beyond HTTP -- 5.3.1 HTTP Secure -- 5.3.2 FTP -- 5.4 HTTP in action -- 5.4.1 The libcurl library -- 5.4.2 Basic request methods -- 5.4.3 A low-level function of RCurl -- 5.4.4 Maintaining connections across multiple requests -- 5.4.5 Options -- 5.4.6 Debugging -- 5.4.7 Error handling -- 5.4.8 RCurl or httr-what to use? -- Summary -- Further reading -- Problems -- 6 AJAX -- 6.1 JavaScript -- 6.1.1 How JavaScript is used -- 6.1.2 DOM manipulation -- 6.2 XHR -- 6.2.1 Loading external HTML/XML documents -- 6.2.2 Loading JSON -- 6.3 Exploring AJAX with Web Developer Tools -- 6.3.1 Getting started with Chrome's Web Developer Tools -- 6.3.2 The Elements panel -- 6.3.3 The Network panel -- Summary -- Further reading -- Problems.

7 SQL and relational databases -- 7.1 Overview and terminology -- 7.2 Relational Databases -- 7.2.1 Storing data in tables -- 7.2.2 Normalization -- 7.2.3 Advanced features of relational databases and DBMS -- 7.3 SQL: a language to communicate with Databases -- 7.3.1 General remarks on SQL, syntax, and our running example -- 7.3.2 Data control language-DCL -- 7.3.3 Data definition language-DDL -- 7.3.4 Data manipulation language-DML -- 7.3.5 Clauses -- 7.3.6 Transaction control language-TCL -- 7.4 Databases in action -- 7.4.1 R packages to manage databases -- 7.4.2 Speaking R-SQL via DBI-based packages -- 7.4.3 Speaking R-SQL via RODBC -- Summary -- Further reading -- Problems -- 8 Regular expressions and essential string functions -- 8.1 Regular expressions -- 8.1.1 Exact character matching -- 8.1.2 Generalizing regular expressions -- 8.1.3 The introductory example reconsidered -- 8.2 String processing -- 8.2.1 The stringr package -- 8.2.2 A couple more handy functions -- 8.3 A word on character encodings -- Summary -- Further reading -- Problems -- Part Two A Practical Toolbox for Web Scraping and Text Mining -- 9 Scraping the Web -- 9.1 Retrieval scenarios -- 9.1.1 Downloading ready-made files -- 9.1.2 Downloading multiple files from an FTP index -- 9.1.3 Manipulating URLs to access multiple pages -- 9.1.4 Convenient functions to gather links, lists, and tables from HTML documents -- 9.1.5 Dealing with HTML forms -- 9.1.6 HTTP authentication -- 9.1.7 Connections via HTTPS -- 9.1.8 Using cookies -- 9.1.9 Scraping data from AJAX-enriched webpages with Selenium/Rwebdriver -- 9.1.10 Retrieving data from APIs -- 9.1.11 Authentication with OAuth -- 9.2 Extraction strategies -- 9.2.1 Regular expressions -- 9.2.2 XPath -- 9.2.3 Application Programming Interfaces -- 9.3 Web scraping: Good practice -- 9.3.1 Is web scraping legal?.

9.3.2 What is robots.txt? -- 9.3.3 Be friendly! -- 9.4 Valuable sources of inspiration -- Summary -- Further reading -- Problems -- 10 Statistical text processing -- 10.1 The running example: Classifying press releases of the British government -- 10.2 Processing textual data -- 10.2.1 Large-scale text operations-The tm package -- 10.2.2 Building a term-document matrix -- 10.2.3 Data cleansing -- 10.2.4 Sparsity and n-grams -- 10.3 Supervised learning techniques -- 10.3.1 Support vector machines -- 10.3.2 Random Forest -- 10.3.3 Maximum entropy -- 10.3.4 The RTextTools package -- 10.3.5 Application: Government press releases -- 10.4 Unsupervised learning techniques -- 10.4.1 Latent Dirichlet allocation and correlated topic models -- 10.4.2 Application: Government press releases -- Summary -- Further reading -- 11 Managing data projects -- 11.1 Interacting with the file system -- 11.2 Processing multiple documents/links -- 11.2.1 Using for-loops -- 11.2.2 Using while-loops and control structures -- 11.2.3 Using the plyr package -- 11.3 Organizing scraping procedures -- 11.3.1 Implementation of progress feedback: Messages and progressbars -- 11.3.2 Error and exception handling -- 11.4 Executing R scripts on a regular basis -- 11.4.1 Scheduling tasks on Mac OS and Linux -- 11.4.2 Scheduling tasks on Windows platforms -- Part Three A Bag of Case Studies -- 12 Collaboration networks in the US Senate -- 12.1 Information on the bills -- 12.2 Information on the senators -- 12.3 Analyzing the network structure -- 12.3.1 Descriptive statistics -- 12.3.2 Network analysis -- 12.4 Conclusion -- 13 Parsing information from semistructured documents -- 13.1 Downloading data from the FTP server -- 13.2 Parsing semistructured text data -- 13.3 Visualizing station and temperature data -- 14 Predicting the 2014 Academy Awards using Twitter -- 14.1 Twitter APIs: Overview.

14.1.1 The REST API -- 14.1.2 The streaming APIs -- 14.1.3 Collecting and preparing the data -- 14.2 Twitter-based forecast of the 2014 Academy Awards -- 14.2.1 Visualizing the data -- 14.2.2 Mining tweets for predictions -- 14.3 Conclusion -- 15 Mapping the geographic distribution of names -- 15.1 Developing a data collection strategy -- 15.2 Website inspection -- 15.3 Data retrieval and information extraction -- 15.4 Mapping names -- 15.5 Automating the process -- Summary -- 16 Gathering data on mobile phones -- 16.1 Page exploration -- 16.1.1 Searching mobile phones of a specific brand -- 16.1.2 Extracting product information -- 16.2 Scraping procedure -- 16.2.1 Retrieving data on several producers -- 16.2.2 Data cleansing -- 16.3 Graphical analysis -- 16.4 Data storage -- 16.4.1 General considerations -- 16.4.2 Table definitions for storage -- 16.4.3 Table definitions for future storage -- 16.4.4 View definitions for convenient data access -- 16.4.5 Functions for storing data -- 16.4.6 Data storage and inspection -- 17 Analyzing sentiments of product reviews -- 17.1 Introduction -- 17.2 Collecting the data -- 17.2.1 Downloading the files -- 17.2.2 Information extraction -- 17.2.3 Database storage -- 17.3 Analyzing the data -- 17.3.1 Data preparation -- 17.3.2 Dictionary-based sentiment analysis -- 17.3.3 Mining the content of reviews -- 17.4 Conclusion -- References -- General index -- Package index -- Function index -- EULA.
Abstract:
A hands on guide to web scraping and text mining for both beginners and experienced users of R Introduces fundamental concepts of the main architecture of the web and databases and covers HTTP, HTML, XML, JSON, SQL. Provides basic techniques to query web documents and data sets (XPath and regular expressions). An extensive set of exercises are presented to guide the reader through each technique. Explores both supervised and unsupervised techniques as well as advanced techniques such as data scraping and text management. Case studies are featured throughout along with examples for each technique presented. R code and solutions to exercises featured in the book are provided on a supporting website.
Local Note:
Electronic reproduction. Ann Arbor, Michigan : ProQuest Ebook Central, 2017. Available via World Wide Web. Access may be limited to ProQuest Ebook Central affiliated libraries.
Electronic Access:
Click to View
Holds: Copies: