What I learned today: ETL

Extract, transform, load
The functions performed when pulling data out of one database and placing it into another of a different type. ETL is used to migrate data, often from relational databases into decision support systems.
ETL, EII and EAI
ETL differs from EII (enterprise information integration), which aggregates data temporarily from disparate data sources, and EAI (enterprise application integration), which integrates data from one application to another. See EII and EAI.

Real-life ETL cycle

The typical real-life ETL cycle consists of the following execution steps:

Cycle initiation
Build reference data
Extract (from sources)
Validate
Transform (clean, apply business rules, check for data integrity, create aggregates or disaggregates)
Stage (load into staging tables, if used)
Audit reports (for example, on compliance with business rules. Also, in case of failure, helps to diagnose/repair)
Publish (to target tables)
Archive
Clean up

Challenges

The range of data values or data quality
Data warehouses typically grow asynchronously
Increasing volumes of data
Real-time change-data capture

Performance

In real life, the slowest part of an ETL process usually occurs in the database load phase.

Do most of the ETL processing outside of the database
Partition tables (and indices). Try to keep partitions similar in size (watch for null values which can skew the partitioning).
Do all validation in the ETL layer before the load. Disable integrity checking in the target database tables during the load.
Disable triggers in the target database tables during the load. Simulate their effect as a separate step.
Generate IDs in the ETL layer (not in the database).
Drop the indexes (on a table or partition) before the load - and recreate them after the load
Use parallel bulk load when possible
You often can do bulk load for inserts, but updates and deletes.

Parallel processing

Data: By splitting a single sequential file into smaller data files to provide parallel access.
Pipeline: Allowing the simultaneous running of several components on the same data stream. For example: looking up a value on record 1 at the same time as adding two fields on record 2.
Component: The simultaneous running of multiple processes on different data streams in the same job, for example, sorting one input file while removing duplicates on another file.

вторник, 24 ноября 2009 г.

ETL

Комментариев нет:

Отправить комментарий

What I learned today

Архив блога

Поиск по этому блогу

вторник, 24 ноября 2009 г.

ETL