Cover image for Webbots, Spiders, and Screen Scrapers.
Webbots, Spiders, and Screen Scrapers.
Title:
Webbots, Spiders, and Screen Scrapers.
Author:
Schrenk, Michael.
ISBN:
9781593271350
Personal Author:
Physical Description:
1 online resource (332 pages)
Contents:
Acknowledgments -- Tables of Contents -- Introduction -- Old-School Client-Server Technology -- The Problem with Browsers -- What to Expect from This Book -- Learn from My Mistakes -- Master Webbot Techniques -- Leverage Existing Scripts -- About the Website -- About the Code -- Requirements -- Hardware -- Software -- Internet Access -- A Disclaimer (This Is Important) -- PART I: Fundamental Concepts and Techniques -- 1: What's in It for You? -- Uncovering the Internet's True Potential -- What's in It for Developers? -- Webbot Developers Are in Demand -- Webbots Are Fun to Write -- Webbots Facilitate "Constructive Hacking" -- What's in It for Business Leaders? -- Customize the Internet for Your Business -- Capitalize on the Public's Inexperience with Webbots -- Accomplish a Lot with a Small Investment -- Final Thoughts -- 2: Ideas for Webbot Projects -- Inspiration from Browser Limitations -- Webbots That Aggregate and Filter Information for Relevance -- Webbots That Interpret What They Find Online -- Webbots That Act on Your Behalf -- A Few Crazy Ideas to Get You Started -- Help Out a Busy Executive -- Save Money by Automating Tasks -- Protect Intellectual Property -- Monitor Opportunities -- Verify Access Rights on a Website -- Create an Online Clipping Service -- Plot Unauthorized Wi-Fi Networks -- Track Web Technologies -- Allow Incompatible Systems to Communicate -- Final Thoughts -- 3: Downloading Web Pages -- Think About Files, Not Web Pages -- Downloading Files with PHP's Built-in Functions -- Downloading Files with fopen() and fgets() -- Downloading Files with file() -- Introducing PHP/CURL -- Multiple Transfer Protocols -- Form Submission -- Basic Authentication -- Cookies -- Redirection -- Agent Name Spoofing -- Referer Management -- Socket Management -- Installing PHP/CURL -- LIB_http -- Familiarizing Yourself with the Default Values.

Using LIB_http -- Learning More About HTTP Headers -- Examining LIB_http's Source Code -- Final Thoughts -- 4: Parsing Techniques -- Parsing Poorly Written HTML -- Standard Parse Routines -- Using LIB_parse -- Splitting a String at a Delimiter: split_string() -- Parsing Text Between Delimiters: return_between() -- Parsing a Data Set into an Array: parse_array() -- Parsing Attribute Values: get_attribute() -- Removing Unwanted Text: remove() -- Useful PHP Functions -- Detecting Whether a String Is Within Another String -- Replacing a Portion of a String with Another String -- Parsing Unformatted Text -- Measuring the Similarity of Strings -- Final Thoughts -- Don't Trust a Poorly Coded Web Page -- Parse in Small Steps -- Don't Render Parsed Text While Debugging -- Use Regular Expressions Sparingly -- 5: Automating Form Submission -- Reverse Engineering Form Interfaces -- Form Handlers, Data Fields, Methods, and Event Triggers -- Form Handlers -- Data Fields -- Methods -- Event Triggers -- Unpredictable Forms -- JavaScript Can Change a Form Just Before Submission -- Form HTML Is Often Unreadable by Humans -- Cookies Aren't Included in the Form, but Can Affect Operation -- Analyzing a Form -- Final Thoughts -- Don't Blow Your Cover -- Correctly Emulate Browsers -- Avoid Form Errors -- 6: Managing Large Amounts of Data -- Organizing Data -- Naming Conventions -- Storing Data in Structured Files -- Storing Text in a Database -- Storing Images in a Database -- Database or File? -- Making Data Smaller -- Storing References to Image Files -- Compressing Data -- Removing Formatting -- Thumbnailing Images -- Final Thoughts -- PART II: Projects -- 7: Price-Monitoring Webbots -- The Target -- Designing the Parsing Script -- Initialization and Downloading the Target -- Further Exploration -- 8: Image-Capturing Webbots -- Example Image-Capturing Webbot.

Creating the Image-Capturing Webbot -- Binary-Safe Download Routine -- Directory Structure -- The Main Script -- Further Exploration -- Final Thoughts -- 9: Link-Verification Webbots -- Creating the Link-Verification Webbot -- Initializing the Webbot and Downloading the Target -- Setting the Page Base -- Parsing the Links -- Running a Verification Loop -- Generating Fully Resolved URLs -- Downloading the Linked Page -- Displaying the Page Status -- Running the Webbot -- LIB_http_codes -- LIB_resolve_addresses -- Further Exploration -- 10: Anonymous Browsing Webbots -- Anonymity with Proxies -- Non-proxied Environments -- Your Online Exposure -- Proxied Environments -- The Anonymizer Project -- Writing the Anonymizer -- Final Thoughts -- 11: Search-Ranking Webbots -- Description of a Search Result Page -- What the Search-Ranking Webbot Does -- Running the Search-Ranking Webbot -- How the Search-Ranking Webbot Works -- The Search-Ranking Webbot Script -- Initializing Variables -- Starting the Loop -- Fetching the Search Results -- Parsing the Search Results -- Final Thoughts -- Be Kind to Your Sources -- Search Sites May Treat Webbots Differently Than Browsers -- Spidering Search Engines Is a Bad Idea -- Familiarize Yourself with the Google API -- Further Exploration -- 12: Aggregation Webbots -- Choosing Data Sources for Webbots -- Example Aggregation Webbot -- Familiarizing Yourself with RSS Feeds -- Writing the Aggregation Webbot -- Adding Filtering to Your Aggregation Webbot -- Further Exploration -- 13: FTP Webbots -- Example FTP Webbot -- PHP and FTP -- Further Exploration -- 14: NNTP News Webbots -- NNTP Use and History -- Webbots and Newsgroups -- Identifying News Servers -- Identifying Newsgroups -- Finding Articles in Newsgroups -- Reading an Article from a Newsgroup -- Further Exploration -- 15: Webbots That Read Email -- The POP3 Protocol.

Logging into a POP3 Mail Server -- Reading Mail from a POP3 Mail Server -- Executing POP3 Commands with a Webbot -- Further Exploration -- Email-Controlled Webbots -- Email Interfaces -- 16: Webbots That Send Email -- Email, Webbots, and Spam -- Sending Mail with SMTP and PHP -- Configuring PHP to Send Mail -- Sending an Email with mail() -- Writing a Webbot That Sends Email Notifications -- Keeping Legitimate Mail out of Spam Filters -- Sending HTML-Formatted Email -- Further Exploration -- Using Returned Emails to Prune Access Lists -- Using Email as Notification That Your Webbot Ran -- Leveraging Wireless Technologies -- Writing Webbots That Send Text Messages -- 17: Converting a Website into a Function -- Writing a Function Interface -- Defining the Interface -- Analyzing the Target Web Page -- Using describe_zipcode() -- Final Thoughts -- Distributing Resources -- Using Standard Interfaces -- Designing a Custom Lightweight "Web Service" -- PART III: Advanced Technical Considerations -- 18: Spiders -- How Spiders Work -- Example Spider -- LIB_simple_spider -- harvest_links() -- archive_links() -- get_domain() -- exclude_link() -- Experimenting with the Spider -- Adding the Payload -- Further Exploration -- Save Links in a Database -- Separate the Harvest and Payload -- Distribute Tasks Across Multiple Computers -- Regulate Page Requests -- 19: Procurement Webbots and Snipers -- Procurement Webbot Theory -- Get Purchase Criteria -- Authenticate Buyer -- Verify Item -- Evaluate Purchase Triggers -- Make Purchase -- Evaluate Results -- Sniper Theory -- Get Purchase Criteria -- Authenticate Buyer -- Verify Item -- Synchronize Clocks -- Time to Bid? -- Submit Bid -- Evaluate Results -- Testing Your Own Webbots and Snipers -- Further Exploration -- Final Thoughts -- 20: Webbots and Cryptography -- Designing Webbots That Use Encryption.

SSL and PHP Built-in Functions -- Encryption and PHP/CURL -- A Quick Overview of Web Encryption -- Local Certificates -- Final Thoughts -- 21: Authentication -- What Is Authentication? -- Types of Online Authentication -- Strengthening Authentication by Combining Techniques -- Authentication and Webbots -- Example Scripts and Practice Pages -- Basic Authentication -- Session Authentication -- Authentication with Cookie Sessions -- Authentication with Query Sessions -- Final Thoughts -- 22: Advanced Cookie Management -- How Cookies Work -- PHP/CURL and Cookies -- How Cookies Challenge Webbot Design -- Purging Temporary Cookies -- Managing Multiple Users' Cookies -- Further Exploration -- 23: Scheduling Webbots and Spiders -- The Windows Task Scheduler -- Preparing Your Webbots to Run as Scheduled Tasks -- Scheduling a Webbot to Run Daily -- Complex Schedules -- Non-Calendar-Based Triggers -- Final Thoughts -- Determine the Webbot's Best Periodicity -- Avoid Single Points of Failure -- Add Variety to Your Schedule -- PART IV: Larger Considerations -- 24: Designing Stealthy Webbots and Spiders -- Why Design a Stealthy Webbot? -- Log Files -- Log-Monitoring Software -- Stealth Means Simulating Human Patterns -- Be Kind to Your Resources -- Run Your Webbot During Busy Hours -- Don't Run Your Webbot at the Same Time Each Day -- Don't Run Your Webbot on Holidays and Weekends -- Use Random, Intra-fetch Delays -- Final Thoughts -- 25: Writing Fault-Tolerant Webbots -- Types of Webbot Fault Tolerance -- Adapting to Changes in URLs -- Adapting to Changes in Page Content -- Adapting to Changes in Forms -- Adapting to Changes in Cookie Management -- Adapting to Network Outages and Network Congestion -- Error Handlers -- 26: Designing Webbot-Friendly Websites -- Optimizing Web Pages for Search Engine Spiders -- Well-Defined Links -- Google Bombs and Spam Indexing.

Title Tags.
Local Note:
Electronic reproduction. Ann Arbor, Michigan : ProQuest Ebook Central, 2017. Available via World Wide Web. Access may be limited to ProQuest Ebook Central affiliated libraries.
Electronic Access:
Click to View
Holds: Copies: