Wednesday, 20 May 2015

Data Profiling in Datawarehouse

Data Profiling in Datawarehouse

Data Profiling    

A process whereby one examines the data available in an existing database and collects statistics and information about that data. The purpose of these statistics may be to:
  • Find out whether existing data can easily be used for other purposes 
  • Give metrics on data quality including whether the data conforms to company standards
  •  Assess the risk involved in integrating data for new applications, including the challenges of joins
  •  Track data quality
  •  Assess whether metadata accurately describes the actual values in the source database
  •   Understanding data challenges early in any data intensive project, so that late project surprises are avoided. Finding data problems late in the project can incur time delays and project cost overruns.


Data Profiling categories    :

The overall process is in three steps, which must be executed in order:
  •  Column Profiling Column profiling provides critical metadata which is required in order to perform dependency profiling, and as such, must be executed before dependency profiling.
  •  Dependency Profiling, which identifies intra-table dependencies. Dependency profiling is related to the normalization of a data source, and addresses whether or not there are non-key attributes that determine or are dependent on other non-key attributes. The existence of transitive dependencies here may be evidence of second-normal form.
  •  Redundancy Profiling, which identifies overlapping values between tables. This is typically used to identify candidate foreign keys within tables, to validate attributes that should be foreign keys (but that may not have constraints to enforce integrity), and to identify other areas of data redundancy. Example: redundancy analysis could provide the analyst with the fact that the ZIP field in table A contained the same values as the ZIP_CODE field in table B, 80% of the time.
Benefits of Data Profiling:
  • The benefits of data profiling is to improve data quality, shorten the implementation cycle of major projects, and improve understanding of data for the users.
  •  Discovering business knowledge embedded in data itself is one of the significant benefits derived from data profiling..         
  • Data profiling is one of the most effective technologies for improving data accuracy in corporate databases.      
  • Although data profiling is effective, it can be challenging not slip into analysis paralysis.


Data Profiling
Data Profiling






Monday, 4 May 2015

Change data capture System in Datawerhouse

Change data capture System in Datawerhouse
Change Data Capture (CDC) is a generic term for techniques that monitor operational data sources with the objective of detecting and capturing data changes of interest . CDC is of particular importance for data warehouse maintenance.

With CDC techniques in place, the data warehouse can be maintained by propagating changes captured at the sources. CDC techniques applied in practice roughly follow three main approaches,
namely

1.log-based CDC.
2.utilization of audit columns
3.calculation of snapshot differentials

1.Log-based CDC techniques parse system logs and retrieve changes of interest. These techniques are typically employed in conjunction with database systems. Virtually all database systems record changes in transaction logs. This information can be leveraged for CDC. Alternatively, changes may be explicitly recorded using database triggers or application logic for instance.

2.utilization of audit columns :
 Operational data sources often employ so called audit columns. Audit columns are appended to each tuple and indicate the time at which the tuple was modified for the last time. Usually timestamps or version numbers are used. Audit columns serve as the selection criteria to extract changes that occurred since the last incremental load process. Note that deletions remain undetected.
3.calculation of snapshot differentials
The snapshot differential technique is most appropriate for data that resides in unsophisticated data sources such as flat files or legacy applications. The latter typically offer mechanisms for dumping data into files but lack advanced query capabilities. In this case, changes can be inferred by comparing a current source snapshot with a snapshot taken at a previous point in time. A major drawback of the snapshot differential approach is the need for frequent extractions of large data volumes. However, it is applicable to virtually any type of data source.

  • The above mentioned CDC approaches differ not only in their technical realization but also in their ability to detect changes. We refer to the inability to detect certain types of changes as CDC limitation
  • As mentioned before deletions cannot be detected by means of audit columns. Often a single audit column is used to record the time of both, record creation and modification. In this case insertions and updates are indistinguishable with respect to CDC. Another limitation of the audit columns approach is the inability to retrieve the initial state of records that have been updated. 
  • Interestingly, existing snapshot differential implementations usually have the same limitation. They do not provide the initial state of updated records while this would be feasible in principle. 
  • since the required data is available in the snapshot taken during the previous run. 
  • Log-based CDC approaches in practice typically capture all types of changes, i.e. insertions, deletions, and the initial and current state of updated records.
Defined different methods for Change data capture in datawearhouse.
  1. Timestamp on rows.
  2. Use in Optimistic locking.
  3. Status Indicator on rows.
  4. Time/Version/Status on Rows.
  5. Trigger on Tables
  6. Event Programming.
  7. Log Scanners on databases.

Facebook Commentbox

subscribe
Subscribe Us
emailSubscribe to our mailing list to get the updates to your email inbox... We can't wait more to have your email in our subscribers email list. Just put your nice email in below box: