Hadoop Real World Solutions Cookbook.

Title:

Author:

Owens, Jonathan.

ISBN:

9781849519137

Personal Author:

Owens, Jonathan.

Physical Description:

1 online resource (381 pages)

Contents:

Hadoop Real-World Solutions Cookbook -- Table of Contents -- Hadoop Real-World Solutions Cookbook -- Credits -- About the Authors -- About the Reviewers -- www.packtpub.com -- Support files, eBooks, discount offers and more -- Why Subscribe? -- Free Access for Packt account holders -- Preface -- What this book covers -- What you need for this book -- Who this book is for -- Conventions -- Reader feedback -- Customer support -- Downloading the example code -- Errata -- Piracy -- Questions -- 1. Hadoop Distributed File System - Importing and Exporting Data -- Introduction -- Importing and exporting data into HDFS using Hadoop shell commands -- Getting ready -- How to do it... -- How it works... -- There's more... -- See also -- Moving data efficiently between clusters using Distributed Copy -- Getting ready -- How to do it... -- How it works... -- There's more... -- Importing data from MySQL into HDFS using Sqoop -- Getting ready -- How to do it... -- How it works... -- There's more... -- See also -- Exporting data from HDFS into MySQL using Sqoop -- Getting ready -- How to do it... -- How it works... -- See also -- Configuring Sqoop for Microsoft SQL Server -- Getting ready -- How to do it... -- How it works... -- Exporting data from HDFS into MongoDB -- Getting ready -- How to do it... -- How it works... -- Importing data from MongoDB into HDFS -- Getting ready -- How to do it... -- How it works... -- Exporting data from HDFS into MongoDB using Pig -- Getting ready -- How to do it... -- How it works... -- Using HDFS in a Greenplum external table -- Getting ready -- How to do it... -- How it works... -- There's more... -- Using Flume to load data into HDFS -- Getting ready -- How to do it... -- How it works... -- There's more... -- 2. HDFS -- Introduction -- Reading and writing data to HDFS -- Getting ready -- How to do it... -- How it works...

There's more... -- Compressing data using LZO -- Getting ready -- How to do it... -- How it works... -- There's more... -- See also -- Reading and writing data to SequenceFiles -- Getting ready -- How to do it... -- How it works... -- There's more... -- See also -- Using Apache Avro to serialize data -- Getting ready -- How to do it... -- How it works... -- There's more... -- See also -- Using Apache Thrift to serialize data -- Getting ready -- How to do it... -- How it works... -- See also -- Using Protocol Buffers to serialize data -- Getting ready -- How to do it... -- How it works... -- Setting the replication factor for HDFS -- Getting ready -- How to do it... -- How it works... -- There's more... -- See also -- Setting the block size for HDFS -- Getting ready -- How to do it... -- How it works... -- 3. Extracting and Transforming Data -- Introduction -- Transforming Apache logs into TSV format using MapReduce -- Getting ready -- How to do it... -- How it works... -- There's more... -- See also -- Using Apache Pig to filter bot traffic from web server logs -- Getting ready -- How to do it... -- How it works... -- There's more... -- See also -- Using Apache Pig to sort web server log data by timestamp -- Getting ready -- How to do it... -- How it works... -- There's more... -- See also -- Using Apache Pig to sessionize web server log data -- Getting ready -- How to do it... -- How it works... -- See also -- Using Python to extend Apache Pig functionality -- Getting ready -- How to do it... -- How it works... -- Using MapReduce and secondary sort to calculate page views -- Getting ready -- How to do it... -- How it works... -- See also -- Using Hive and Python to clean and transform geographical event data -- Getting ready -- How to do it... -- How it works... -- There's more... -- Making every column type String.

Type casing values using the AS keyword -- Testing the script locally -- Using Python and Hadoop Streaming to perform a time series analytic -- Getting ready -- How to do it... -- How it works... -- There's more... -- Using Hadoop Streaming with any language that can read from stdin and write to stdout -- Using the -file parameter to pass additional required files for MapReduce jobs -- Using MultipleOutputs in MapReduce to name output files -- Getting ready -- How to do it... -- How it works... -- Creating custom Hadoop Writable and InputFormat to read geographical event data -- Getting ready -- How to do it... -- How it works... -- 4. Performing Common Tasks Using Hive, Pig, and MapReduce -- Introduction -- Using Hive to map an external table over weblog data in HDFS -- Getting ready -- How to do it... -- How it works... -- There's more... -- LOCATION must point to a directory, not a file -- Dropping an external table does not delete the data stored in the table -- You can add data to the path specified by LOCATION -- Using Hive to dynamically create tables from the results of a weblog query -- Getting ready -- How to do it... -- How it works... -- There's more... -- CREATE TABLE AS cannot be used to create external tables -- DROP temporary tables -- Using the Hive string UDFs to concatenate fields in weblog data -- Getting ready -- How to do it... -- How it works... -- There's more... -- The UDF concat_ws() function will not automatically cast parameters to String -- Alias your concatenated field -- The concat_ws() function supports variable length parameter arguments -- See also -- Using Hive to intersect weblog IPs and determine the country -- Getting ready -- How to do it... -- How it works... -- There's more... -- Hive supports multitable joins -- The ON operator for inner joins does not support inequality conditions -- See also.

Generating n-grams over news archives using MapReduce -- Getting ready -- How to do it... -- How it works... -- There's more... -- Use caution when invoking FileSystem.delete() -- Use NullWritable to avoid unnecessary serialization overhead -- Using the distributed cache in MapReduce to find lines that contain matching keywords over news archives -- Getting ready -- How to do it... -- How it works... -- There's more... -- Use the distributed cache to pass JAR dependencies to map/reduce task JVMs -- Distributed cache does not work in local jobrunner mode -- Using Pig to load a table and perform a SELECT operation with GROUP BY -- Getting ready -- How to do it... -- How it works... -- See also -- 5. Advanced Joins -- Introduction -- Joining data in the Mapper using MapReduce -- Getting ready -- How to do it... -- How it works... -- There's more... -- See also -- Joining data using Apache Pig replicated join -- Getting ready -- How to do it... -- How it works... -- There's more... -- See also -- Joining sorted data using Apache Pig merge join -- Getting ready -- How to do it... -- How it works... -- There's more... -- See also -- Joining skewed data using Apache Pig skewed join -- Getting ready -- How to do it... -- How it works... -- Using a map-side join in Apache Hive to analyze geographical events -- Getting ready -- How to do it... -- How it works... -- There's more... -- Auto-convert to map-side join whenever possible -- Map-join behavior -- See also -- Using optimized full outer joins in Apache Hive to analyze geographical events -- Getting ready -- How to do it... -- How it works... -- There's more... -- Common join versus map-side join -- STREAMTABLE hint -- Table ordering in the query matters -- Joining data using an external key-value store (Redis) -- Getting ready -- How to do it... -- How it works... -- There's more...

6. Big Data Analysis -- Introduction -- Counting distinct IPs in weblog data using MapReduce and Combiners -- Getting ready -- How to do it... -- How it works... -- There's more... -- The Combiner does not always have to be the same class as your Reducer -- Combiners are not guaranteed to run -- Using Hive date UDFs to transform and sort event dates from geographic event data -- Getting ready -- How to do it... -- How it works... -- There's more... -- Date format strings follow Java SimpleDateFormat guidelines -- Default date and time formats -- See also -- Using Hive to build a per-month report of fatalities over geographic event data -- Getting ready -- How to do it... -- How it works... -- There's more... -- The coalesce() method can take variable length arguments. -- Date reformatting code template -- See also -- Implementing a custom UDF in Hive to help validate source reliability over geographic event data -- Getting ready -- How to do it... -- How it works... -- There's more... -- Check out the existing UDFs -- User-defined table and aggregate functions -- Export HIVE_AUX_JARS_PATH in your environment -- See also -- Marking the longest period of non-violence using Hive MAP/REDUCE operators and Python -- Getting ready -- How to do it... -- How it works... -- There's more... -- SORT BY versus DISTRIBUTE BY versus CLUSTER BY versus ORDER BY -- MAP and REDUCE keywords are shorthand for SELECT TRANSFORM -- See also -- Calculating the cosine similarity of artists in the Audioscrobbler dataset using Pig -- Getting ready -- How to do it... -- How it works... -- Trim Outliers from the Audioscrobbler dataset using Pig and datafu -- Getting ready -- How to do it... -- How it works... -- There's more... -- 7. Advanced Big Data Analysis -- Introduction -- PageRank with Apache Giraph -- Getting ready -- How to do it... -- How it works... -- There's more...

Keep up with the Apache Giraph community.

Abstract:

Realistic, simple code examples to solve problems at scale with Hadoop and related technologies.

Local Note:

Electronic reproduction. Ann Arbor, Michigan : ProQuest Ebook Central, 2017. Available via World Wide Web. Access may be limited to ProQuest Ebook Central affiliated libraries.

Subject Term:

Electronic data processing -- Distributed processing.

Open source software.

Genre:

Added Author:

Electronic Access:

Holds: Copies:

Available:*

Bound With These Titles

On Order