Wednesday 20 May 2015

Data Profiling in Datawarehouse

Data Profiling in Datawarehouse

Data Profiling    

A process whereby one examines the data available in an existing database and collects statistics and information about that data. The purpose of these statistics may be to:
  • Find out whether existing data can easily be used for other purposes 
  • Give metrics on data quality including whether the data conforms to company standards
  •  Assess the risk involved in integrating data for new applications, including the challenges of joins
  •  Track data quality
  •  Assess whether metadata accurately describes the actual values in the source database
  •   Understanding data challenges early in any data intensive project, so that late project surprises are avoided. Finding data problems late in the project can incur time delays and project cost overruns.


Data Profiling categories    :

The overall process is in three steps, which must be executed in order:
  •  Column Profiling Column profiling provides critical metadata which is required in order to perform dependency profiling, and as such, must be executed before dependency profiling.
  •  Dependency Profiling, which identifies intra-table dependencies. Dependency profiling is related to the normalization of a data source, and addresses whether or not there are non-key attributes that determine or are dependent on other non-key attributes. The existence of transitive dependencies here may be evidence of second-normal form.
  •  Redundancy Profiling, which identifies overlapping values between tables. This is typically used to identify candidate foreign keys within tables, to validate attributes that should be foreign keys (but that may not have constraints to enforce integrity), and to identify other areas of data redundancy. Example: redundancy analysis could provide the analyst with the fact that the ZIP field in table A contained the same values as the ZIP_CODE field in table B, 80% of the time.
Benefits of Data Profiling:
  • The benefits of data profiling is to improve data quality, shorten the implementation cycle of major projects, and improve understanding of data for the users.
  •  Discovering business knowledge embedded in data itself is one of the significant benefits derived from data profiling..         
  • Data profiling is one of the most effective technologies for improving data accuracy in corporate databases.      
  • Although data profiling is effective, it can be challenging not slip into analysis paralysis.


Data Profiling
Data Profiling






15 comments:

subscribe
Subscribe Us
emailSubscribe to our mailing list to get the updates to your email inbox... We can't wait more to have your email in our subscribers email list. Just put your nice email in below box: