вторник, 24 ноября 2009 г.

ETL

Extract, transform, load
The functions performed when pulling data out of one database and placing it into another of a different type. ETL is used to migrate data, often from relational databases into decision support systems.
ETL, EII and EAI
ETL differs from EII (enterprise information integration), which aggregates data temporarily from disparate data sources, and EAI (enterprise application integration), which integrates data from one application to another. See EII and EAI.

Real-life ETL cycle


The typical real-life ETL cycle consists of the following execution steps:

  1. Cycle initiation
  2. Build reference data
  3. Extract (from sources)
  4. Validate
  5. Transform (clean, apply business rules, check for data integrity, create aggregates or disaggregates)
  6. Stage (load into staging tables, if used)
  7. Audit reports (for example, on compliance with business rules. Also, in case of failure, helps to diagnose/repair)
  8. Publish (to target tables)
  9. Archive
  10. Clean up

Challenges

  • The range of data values or data quality
  • Data warehouses typically grow asynchronously
  • Increasing volumes of data
  • Real-time change-data capture

Performance

In real life, the slowest part of an ETL process usually occurs in the database load phase.

  • Do most of the ETL processing outside of the database
  • Partition tables (and indices). Try to keep partitions similar in size (watch for null values which can skew the partitioning).
  • Do all validation in the ETL layer before the load. Disable integrity checking in the target database tables during the load.
  • Disable triggers in the target database tables during the load. Simulate their effect as a separate step.
  • Generate IDs in the ETL layer (not in the database).
  • Drop the indexes (on a table or partition) before the load - and recreate them after the load
  • Use parallel bulk load when possible
  • You often can do bulk load for inserts, but updates and deletes.

Parallel processing

  • Data: By splitting a single sequential file into smaller data files to provide parallel access.
  • Pipeline: Allowing the simultaneous running of several components on the same data stream. For example: looking up a value on record 1 at the same time as adding two fields on record 2.
  • Component: The simultaneous running of multiple processes on different data streams in the same job, for example, sorting one input file while removing duplicates on another file.

Комментариев нет:

Отправить комментарий