Change Data Capture (CDC) is a generic term for techniques that monitor operational data
sources with the objective of detecting and capturing data changes of interest .
CDC is of particular importance for data warehouse maintenance.
With
CDC techniques in place, the data warehouse can be maintained by propagating
changes captured at the sources. CDC techniques applied in practice roughly
follow three main approaches,
namely
1.log-based
CDC.
2.utilization
of audit columns
3.calculation
of snapshot differentials
1.Log-based CDC techniques parse system logs and retrieve changes of
interest. These techniques
are typically employed in conjunction with database systems. Virtually all database
systems record changes in transaction logs. This information can be leveraged for
CDC. Alternatively, changes may be explicitly recorded using database triggers or application
logic for instance.
2.utilization of audit columns :
Operational data sources often employ
so called audit columns.
Audit columns are appended to
each tuple and indicate the time at which the tuple was modified for the last time.
Usually timestamps or version numbers are used. Audit columns serve as the
selection criteria to extract changes that occurred since the last incremental
load process. Note that deletions remain undetected.
3.calculation of snapshot differentials
The snapshot differential technique is most appropriate for data that resides in unsophisticated data
sources such as flat files or legacy applications. The latter typically offer mechanisms
for dumping data into files but lack advanced query capabilities. In this case,
changes can be inferred by comparing a current source snapshot with a snapshot
taken at a previous point in time. A major drawback of the snapshot
differential approach is the need for frequent extractions of large data
volumes. However, it is applicable to virtually any type of data source.
- The above mentioned CDC approaches differ not only in their technical realization but also in their ability to detect changes. We refer to the inability to detect certain types of changes as CDC limitation.
- As mentioned before deletions cannot be detected by means of audit columns. Often a single audit column is used to record the time of both, record creation and modification. In this case insertions and updates are indistinguishable with respect to CDC. Another limitation of the audit columns approach is the inability to retrieve the initial state of records that have been updated.
- Interestingly, existing snapshot differential implementations usually have the same limitation. They do not provide the initial state of updated records while this would be feasible in principle.
- since the required data is available in the snapshot taken during the previous run.
- Log-based CDC approaches in practice typically capture all types of changes, i.e. insertions, deletions, and the initial and current state of updated records.
Defined different methods for Change data capture in datawearhouse.
- Timestamp on rows.
- Use in Optimistic locking.
- Status Indicator on rows.
- Time/Version/Status on Rows.
- Trigger on Tables
- Event Programming.
- Log Scanners on databases.
No comments:
Post a Comment