Tuesday, 24 February 2015

Data Warehouse Design Approaches

Data Warehouse Design Approaches






There are two major types of approaches to building or designing the Data Warehouse.
  1.  The Top-Down Approach
  2.  The Bottom-Up Approach




s  The Top Down Approach:

  •       The data flow in the top down OLAP environment begins with data extraction from the operational data sources. This data is loaded into the staging area and validated and consolidated for ensuring a level of accuracy and then transferred to the Operational Data Store (ODS). 
  •      Detailed data is regularly extracted from the ODS and temporarily hosted in the staging area for aggregation, summarization and then extracted and loaded into the Data warehouse. 
  •      Once the Data warehouse aggregation and summarization processes are complete, the data mart refresh cycles will extract the data from the Data warehouse into the staging area and perform a new set of transformations on them. This will help organize the data in particular structures required by data marts. Then the data marts can be loaded with the data and the OLAP environment becomes available to the users

DWH TOP DOWN APPROACH
DWH TOP DOWN DESIGN APPROACH
  •     The data marts are treated as sub sets of the data warehouse. Each data mart is built for an individual department and is optimized for analysis needs of the particular department for which it is created.


 The Bottom-Up Approach:
          

  •      Ralph Kimball designed the data warehouse with the data marts connected  to it with a bus structure.
  •      The bus structure contained all the common elements that are used by data marts such as conformed dimensions, measures etc defined for the enterprise as a whole. 
  •        This architecture makes the data warehouse more of a virtual reality than a physical reality
  •       All data marts could be located in one server or could be located on different servers across the enterprise while the data warehouse would be a virtual entity being nothing more than a sum total of all the data marts
  •      In this context even the cubes constructed by using OLAP tools could be considered as data marts.

Kimball's  bottom-up Design Apporach
Kimball Approach

              The bottom-up approach reverses the positions of the Data warehouse  and the Data marts. Data marts are directly loaded with the data from the   operational systems through the staging area.

                   The data flow in the bottom up approach starts with extraction of data  from operational databases into the staging area where it is processed  and consolidated and then loaded into the ODS.

         The data in the ODS is appended to or replaced by the fresh data being loaded. After the ODS is refreshed the current data is once again extracted into the staging area and processed to fit into the Data mart  structure. The data from the Data Mart, then is extracted to the staging area aggregated, summarized and so on and loaded into the Data Warehouse and made available to the end user for analysis.







Facebook Commentbox




Thursday, 19 February 2015

Surrogate key in Datawearhouse with example

Surrogate key in Datawearhouse with example

What is Surrogate Key?


  • Surrogate Keys are integers that are assigned sequentially in the dimension table which can be used as PK.
  • Surrogate key is a unique identification key, it is like an artificial or alternative key to production key, because the production key may be alphanumeric or composite key but the surrogate key is always single numeric key. 
  • Assume the production key is an alphanumeric field if you create an index for this fields it will occupy more space, so it is not advisable to join/index, because generally all theDatawearhouse fact table are having historical data. 
  • These fact Table are linked with so many dimension table. if it's a numerical fields the performance is high
  • Surrogate key is the primary key for the Dimensional table.
  • It’s a substitution for the natural primary key.
  • It is just a unique identifier or number for each row that can be used for the primary key to the table. 
  • The only requirement for a surrogate primary key is that it is unique for each row in the table.
  • Data warehouses typically use a surrogate, (also known as artificial or identity key)  key for the dimension tables primary keys. They can use Infa sequence generator, or Oracle sequence, or SQL Server Identity values for the surrogate key.
  • It is useful because the natural primary key (i.e. Customer Number in Customer table) can change and this makes updates more difficult.

Example:

          Let us consider an scenarios where you have designed a very good Data Warehouse , its catering all your reporting need and its in production. After two years organization decides to reuse there business keys of products

e.g.
         There was product called Baby Powder after two years organization decides to stop selling at its cost is high and sell is low and instead Baby Powder they launched a Talcum Powder for Men and want to give same key to the product say 336.
         Now while designing a Data Warehouse you have used Business Keys as Primary Key in dimension table with this new change you will have to update the dimension table to replace Baby Powder with talcum Powder and organization does want to remove all data of Baby Powder, what will you do now?
        So to avoid such situation its always better to use Surrogate Keys as Primary Key in dimension table along with Business Key.

e.g.
SK_Prouduct
Product_ID
Product Name
Cost
Acive
1
336
Baby Powder
444
N
2
345
Cream
34
Y
3
336
talkem Powder
44
Y


Now with above table you can use same code for another Product, mark the product as active and inactive and maintain all the product in Data Warehouse

Advantages of Surrogate Key:
  • Surrogate Key allow to cater all your data need that arises from the Business or operational changes and recycling.
  • Surrogate Keys allow the data warehouse to integrate data from all sources if they lack natural Business keys.
  • Surrogate keys are very helpful for ETL transformations.

Facebook Commentbox



Sunday, 15 February 2015

Operational Systems in datawearhouse

Operational Systems in datawearhouse

Operational Systems


Operational systems are the ones supporting the day-to-day activities of the enterprise. They are focused on processing transactions, ranging from order entry to billing to human resources transactions. In a typical organization, the operational systems use a wide variety of technologies and architectures, and they may include some vendor-packaged systems in addition to in-house custom-developed software. Operational systems are static by nature; they change only in response to an intentional change in business policies or
processes, or for technical reasons, such as system maintenance or performance tuning.


                              

Operational databases are normally "relational" - not "dimensional".  They are designed for operational, data entry purposes and are not well suited for online queries and analytics.

These operational systems are the source of most of the electronically maintained data within the CIF. Because these systems support time-sensitive realtime transaction processing, they have usually been optimized for performance and transaction throughput. Data in the operational systems environment may be duplicated across several systems, and is often not synchronized. These operational systems represent the first application of business rules to an organization’s data, and the quality of data in the operational systems has a direct impact on the quality of all other information used in the organization.


Sometimes operational systems are referred to as operational databases, transaction processing systems, or online transaction processing systems (OLTP). However, the use of the last two terms as synonyms may be confusing, because operational systems can be batch processing systems as well.
Any Enterprise must necessarily maintain a lot of data about its operation. This is its "Operational Data".

Operational systems vs. Data warehousing

The fundamental difference between operational systems and data warehousing systems is that operational systems are designed to support transaction processing whereas data warehousing systems are designed to support online analytical processing (or OLAP, for short).
Based on this fundamental difference, data usage patterns associated with operational systems are significantly different than usage patterns associated with data warehousing systems. As a result, data warehousing systems are designed and optimized using methodologies that drastically differ from that of operational systems.
The table below summarizes many of the differences between operational systems and data warehousing systems. 

Difference between operational systems and data warehousing systems
operational systems
data warehousing systems
Operational systems are generally designed to support high-volumetransaction processing with minimal back-end reporting.
Data warehousing systems are generally designed to support high-volume analytical processing (i.e. OLAP) and subsequent, often elaborate report generation.
Operational systems are generally process-oriented or process-driven, meaning that they are focused on specific business processes or tasks. Example tasks include billing, registration, etc.
Data warehousing systems are generally subject-oriented, organized around business areas that the organization needs information about. Such subject areas are usually populated with data from one or more operational systems. As an example, revenue may be a subject area of a data warehouse that incorporates data from operational systems that contain student tuition data, alumni gift data, financial aid data, etc.
Operational systems are generally concerned with current data.
Data warehousing systems are generally concerned with historical data.
Data within operational systems are generally updated regularlyaccording to need.
Data within a data warehouse is generally non-volatile, meaning that new data may be added regularly, but once loaded, the data is rarely changed, thus preserving an ever-growing history of information. In short, data within a data warehouse is generally read-only.
Operational systems are generally optimized to perform fast inserts and updates of relatively small volumes of data.
Data warehousing systems are generally optimized to perform fast retrievals of relatively large volumes of data.
Operational systems are generally application-specific, resulting in a multitude of partially or non-integrated systems and redundant data(e.g. billing data is not integrated with payroll data).
Data warehousing systems are generally integrated at a layer above the application layer, avoiding data redundancy problems.
Operational systems generally require a non-trivial level of computing skills amongst the end-user community.
Data warehousing systems generally appeal to an end-user community with a wide range of computing skills, from novice to expert users.

Facebook Commentbox




subscribe
Subscribe Us
emailSubscribe to our mailing list to get the updates to your email inbox... We can't wait more to have your email in our subscribers email list. Just put your nice email in below box: