Cover image for Pentaho Kettle Solutions : Building Open Source ETL Solutions with Pentaho Data Integration.
Pentaho Kettle Solutions : Building Open Source ETL Solutions with Pentaho Data Integration.
Title:
Pentaho Kettle Solutions : Building Open Source ETL Solutions with Pentaho Data Integration.
Author:
Casters, Matt.
ISBN:
9780470942420
Personal Author:
Edition:
1st ed.
Physical Description:
1 online resource (722 pages)
Contents:
Pentaho Kettle Solutions -- About the Authors -- Credits -- Acknowledgments -- Contents at a Glance -- Contents -- Introduction -- Part I: Getting Started -- Chapter 1: ETL Primer -- OLTP versus Data Warehousing -- What Is ETL? -- The Evolution of ETL Solutions -- ETL Building Blocks -- ETL, ELT, and EII -- ELT -- EII: Virtual Data Integration -- Data Integration Challenges -- Methodology: Agile BI -- ETL Design -- Data Acquisition -- Beware of Spreadsheets -- Design for Failure -- Change Data Capture -- Data Quality -- Data Profiling -- Data Validation -- ETL Tool Requirements -- Connectivity -- Platform Independence -- Scalability -- Design Flexibility -- Reuse -- Extensibility -- Data Transformations -- Testing and Debugging -- Lineage and Impact Analysis -- Logging and Auditing -- Summary -- Chapter 2 -- Chapter 1 -- Chapter 2: Kettle Concepts -- Design Principles -- The Building Blocks of Kettle Design -- Transformations -- Steps -- Transformation Hops -- Parallelism -- Rows of Data -- Data Conversion -- Jobs -- Job Entries -- Job Hops -- Multiple Paths and Backtracking -- Parallel Execution -- Job Entry Results -- Transformation or Job Metadata -- Database Connections -- Special Options -- The Power of the Relational Database -- Connections and Transactions -- Database Clustering -- Tools and Utilities -- Repositories -- Virtual File Systems -- Parameters and Variables -- Defining Variables -- Named Parameters -- Using Variables -- Visual Programming -- Getting Started -- Creating New Steps -- Putting It All Together -- Summary -- Chapter 3 -- Chapter 3: Installation and Configuration -- Kettle Software Overview -- Integrated Development Environment: Spoon -- Command-Line Launchers: Kitchen and Pan -- Job Server: Carte -- Encr.bat and encr.sh -- Installation -- Java Environment -- Installing Java Manually.

Using Your Linux Package Management System -- Installing Kettle -- Versions and Releases -- Archive Names and Formats -- Downloading and Uncompressing -- Running Kettle Programs -- Creating a Shortcut Icon or Launcher for Spoon -- Configuration -- Configuration Files and the .kettle Directory -- The Kettle Shell Scripts -- General Structure of the Startup Scripts -- Adding an Entry to the Classpath -- Changing the Maximum Heap Size -- Managing JDBC Drivers -- Summary -- Chapter 4 -- Chapter 4: An Example ETL Solution-Sakila -- Sakila -- The Sakila Sample Database -- DVD Rental Business Process -- Sakila Database Schema Diagram -- Sakila Database Subject Areas -- General Design Considerations -- Installing the Sakila Sample Database -- The Rental Star Schema -- Rental Star Schema Diagram -- Rental Fact Table -- Dimension Tables -- Keys and Change Data Capture -- Installing the Rental Star Schema -- Prerequisites and Some Basic Spoon Skills -- Setting Up the ETL Solution -- Creating Database Accounts -- Working with Spoon -- Opening Transformation and Job Files -- Opening the Step's Configuration Dialog -- Examining Streams -- Running Jobs and Transformations -- The Sample ETL Solution -- Static, Generated Dimensions -- Loading the dim_date Dimension Table -- Loading the dim_time Dimension Table -- Recurring Load -- The load_rentals Job -- The load_dim_staff Transformation -- Database Connections -- The load_dim_customer Transformation -- The load_dim_store Transformation -- The fetch_address Subtransformation -- The load_dim_actor Transformation -- The load_dim_film Transformation -- The load_fact_rental Transformation -- Summary -- Part II: ETL -- Chapter 5: ETL Subsystems -- Introduction to the 34 Subsystems -- Extraction -- Subsystems 1-3: Data Profiling, Change Data Capture, and Extraction -- Cleaning and Conforming Data.

Subsystem 4: Data Cleaning and Quality Screen Handler System -- Subsystem 5: Error Event Handler -- Subsystem 6: Audit Dimension Assembler -- Subsystem 7: Deduplication System -- Subsystem 8: Data Conformer -- Data Delivery -- Subsystem 9: Slowly Changing Dimension Processor -- Subsystem 10: Surrogate Key Creation System -- Subsystem 11: Hierarchy Dimension Builder -- Subsystem 12: Special Dimension Builder -- Subsystem 13: Fact Table Loader -- Subsystem 14: Surrogate Key Pipeline -- Subsystem 15: Multi-Valued Dimension Bridge Table Builder -- Subsystem 16: Late-Arriving Data Handler -- Subsystem 17: Dimension Manager System -- Subsystem 18: Fact Table Provider System -- Subsystem 19: Aggregate Builder -- Subsystem 20: Multidimensional (OLAP) Cube Builder -- Subsystem 21: Data Integration Manager -- Managing the ETL Environment -- Summary -- Chapter 6 -- Chapter 6: Data Extraction -- Kettle Data Extraction Overview -- File-Based Extraction -- Working with Text Files -- Working with XML files -- Special File Types -- Database-Based Extraction -- Web-Based Extraction -- Text-Based Web Extraction -- HTTP Client -- Using SOAP -- Stream-Based and Real-Time Extraction -- Working with ERP and CRM Systems -- ERP Challenges -- Kettle ERP Plugins -- Working with SAP Data -- ERP and CDC Issues -- Data Profiling -- Using eobjects.org DataCleaner -- Adding Profile Tasks -- Adding Database Connections -- Doing an Initial Profile -- Working with Regular Expressions -- Profiling and Exploring Results -- Validating and Comparing Data -- Using a Dictionary for Column Dependency Checks -- Alternative Solutions -- Text Profiling with Kettle -- CDC: Change Data Capture -- Source Data-Based CDC -- Trigger-Based CDC -- Snapshot-Based CDC -- Log-Based CDC -- Which CDC Alternative Should You Choose? -- Delivering Data -- Summary -- Chapter 7.

Chapter 7: Cleansing and Conforming -- Data Cleansing -- Data-Cleansing Steps -- Using Reference Tables -- Conforming Data Using Lookup Tables -- Conforming Data Using Reference Tables -- Data Validation -- Applying Validation Rules -- Validating Dependency Constraints -- Error Handling -- Handling Process Errors -- Transformation Errors -- Handling Data (Validation) Errors -- Auditing Data and Process Quality -- Deduplicating Data -- Handling Exact Duplicates -- The Problem of Non-Exact Duplicates -- Building Deduplication Transforms -- Step 1: Fuzzy Match -- Step 2: Select Suspects -- Step 3: Lookup Validation Value -- Step 4: Filter Duplicates -- Scripting -- Formula -- JavaScript -- User-Defined Java Expressions -- Regular Expressions -- Summary -- Chapter 8 -- Chapter 8: Handling Dimension Tables -- Managing Keys -- Managing Business Keys -- Keys in the Source System -- Keys in the Data Warehouse -- Business Keys -- Storing Business Keys -- Looking Up Keys with Kettle -- Generating Surrogate Keys -- The "Add sequence" Step -- Working with auto_increment or IDENTITY Columns -- Keys for Slowly Changing Dimensions -- Loading Dimension Tables -- Snowflaked Dimension Tables -- Top-Down Level-Wise Loading -- Sakila Snowflake Example -- Sample Transformation -- Database Lookup Configuration -- Sample Job -- Star Schema Dimension Tables -- Denormalization -- Denormalizing to 1NF with the "Database lookup" Step -- Change Data Capture -- Slowly Changing Dimensions -- Types of Slowly Changing Dimensions -- Type 1 Slowly Changing Dimensions -- The Insert / Update Step -- Type 2 Slowly Changing Dimensions -- The "Dimension lookup / update" Step -- Other Types of Slowly Changing Dimensions -- Type 3 Slowly Changing Dimensions -- Hybrid Slowly Changing Dimensions -- More Dimensions -- Generated Dimensions -- Date and Time Dimensions.

Generated Mini-Dimensions -- Junk Dimensions -- Recursive Hierarchies -- Summary -- Chapter 9 -- Chapter 9: Loading Fact Tables -- Loading in Bulk -- STDIN and FIFO -- Kettle Bulk Loaders -- MySQL Bulk Loading -- LucidDB Bulk Loader -- Oracle Bulk Loader -- PostgreSQL Bulk Loader -- Table Output Step -- General Bulk Load Considerations -- Dimension Lookups -- Maintaining Referential Integrity -- The Surrogate Key Pipeline -- Using In-Memory Lookups -- Stream Lookups -- Late-Arriving Data -- Late-Arriving Facts -- Late-Arriving Dimensions -- Fact Table Handling -- Periodic and Accumulating Snapshots -- Introducing State-Oriented Fact Tables -- Loading Periodic Snapshots -- Loading Accumulating Snapshots -- Loading State-Oriented Fact Tables -- Loading Aggregate Tables -- Summary -- Chapter 10 -- Chapter 10: Working with OLAP Data -- OLAP Benefits and Challenges -- OLAP Storage Types -- Positioning OLAP -- Kettle OLAP Options -- Working with Mondrian -- Working with XML/A Servers -- Working with Palo -- Setting Up the Palo Kettle Plugin -- Palo Architecture -- Reading Palo Data -- Writing Palo Data -- Summary -- Part III: Management and Deployment -- Chapter 11: ETL Development Lifecycle -- Solution Design -- Best and Bad Practices -- Data Mapping -- Naming and Commentary Conventions -- Common Pitfalls -- ETL Flow Design -- Reusability and Maintainability -- Agile Development -- Testing and Debugging -- Test Activities -- ETL Testing -- Test Data Requirements -- Testing for Completeness -- Testing Data Transformations -- Test Automation and Continuous Integration -- Upgrade Tests -- Debugging -- Documenting the Solution -- Why Isn't There Any Documentation? -- Myth 1: My Software Is Self-Explanatory -- Myth 2: Documentation Is Always Outdated -- Myth 3: Who Reads Documentation Anyway? -- Kettle Documentation Features -- Generating Documentation.

Summary.
Abstract:
A complete guide to Pentaho Kettle, the Pentaho Data lntegration toolset for ETL This practical book is a complete guide to installing, configuring, and managing Pentaho Kettle. If you're a database administrator or developer, you'll first get up to speed on Kettle basics and how to apply Kettle to create ETL solutions-before progressing to specialized concepts such as clustering, extensibility, and data vault models. Learn how to design and build every phase of an ETL solution. Shows developers and database administrators how to use the open-source Pentaho Kettle for enterprise-level ETL processes (Extracting, Transforming, and Loading data) Assumes no prior knowledge of Kettle or ETL, and brings beginners thoroughly up to speed at their own pace Explains how to get Kettle solutions up and running, then follows the 34 ETL subsystems model, as created by the Kimball Group, to explore the entire ETL lifecycle, including all aspects of data warehousing with Kettle Goes beyond routine tasks to explore how to extend Kettle and scale Kettle solutions using a distributed "cloud" Get the most out of Pentaho Kettle and your data warehousing with this detailed guide-from simple single table data migration to complex multisystem clustered data integration tasks.
Local Note:
Electronic reproduction. Ann Arbor, Michigan : ProQuest Ebook Central, 2017. Available via World Wide Web. Access may be limited to ProQuest Ebook Central affiliated libraries.
Electronic Access:
Click to View
Holds: Copies: