datawarehouse concepts-BusinessIntelligence

Thursday, 12 March 2015

What is ETL (Extract, Transform, Load)

ETL refers to the methods involved in accessing and manipulating source data and loading it into target database. During the ETL process, more often, data is extracted from an OLTP database, transformed to match the data warehouse schema, and loaded into the data warehouse database.

(Extract, Transform, Load) –

ETL Process

EXTRACTION :

Data Capturing: The ETL extraction element is responsible for extracting data from the source system. During extraction, data may be removed from the source system or a copy made and the original data retained in the source system.

Data Capturing

Data Transmission: Legacy systems may require too much effort to implement such offload processes, so legacy data is often copied into the data warehouse, leaving the original data in place. Extracted data is loaded into the data warehouse staging area (a relational database usually separate from the data warehouse database), for manipulation by the remaining ETL processes.

Data Transmission

Cleansing Process:

Data extraction is generally performed within the source system itself.
Data extraction processes can be implemented using Transact-SQL stored procedures, Data Transformation Services (DTS) tasks, or custom applications developed in programming or scripting languages.

Transformation:

The ETL transformation element is responsible for data validation, data accuracy, data type conversion, and business rule application. An ETL system that uses inline transformations during extraction is less robust and flexible than one that confines transformations to the reformatting element. Transformations performed in the OLTP system impose a performance burden on the OLTP database.

Data Validation: Check that all rows in the fact table match rows in dimension tables to enforce data integrity.

DataAccuracy : Ensure that fields contain appropriate values, such as only "off" or "on" in a status field.

Data Type Conversion : Ensure that all values for a specified field are stored the same way in the data warehouse regardless of how they were stored in the source system. For example, if one source system stores "off" or "on" in its status field and another source system stores "0" or "1" in its status field, then a data type conversion transformation converts the content of one or both of the fields to a specified common value such as "off" or "on".

Transformation

Business Rule Application: Ensure that the rules of the business are enforced on the data stored in the warehouse. For example, check that all customer records contain values for both FirstName and LastName fields.

Loading :

The ETL loading element is responsible for loading transformed data into the data warehouse database.
Data warehouses are usually updated periodically rather than continuously, and large numbers of records are often loaded to multiple tables in a single data load.
The data warehouse is often taken offline during update operations so that data can be loaded faster and SQL Server 2000 Analysis Services can update OLAP cubes to incorporate the new data. BULK INSERT, bcp, and the Bulk Copy API are the best tools for data loading operations.
The design of the loading element should focus on efficiency and performance to minimize the data warehouse offline time.

Facebook Commentbox

Thursday, 5 March 2015

Logical Design In Datawearhouse

Logical design in datawearhouse

Logical Data Model (LDM) -

A logical design is conceptual and abstract. The process of logical design involves arranging data into a series of logical relationships called entities and attributes.
Logical data model includes all required entities, attributes, key groups, and relationships that represent business information and define business rules.
An entity represents a chunk of information. In relational databases, an entity often maps to a table. An attribute is a component of an entity and helps define the uniqueness of the entity. In relational databases, an attribute maps to a column.

Logical design in Data wearhouse

Important Terms:

Entity

Entity: Are the principal data object about which information is to be collected. A class of persons, places, objects, events, or concepts about which we need to capture and store data.

Persons: agency, contractor, customer, department, division, employee, instructor, student, supplier.

  Places: sales region, building, room, branch office, campus.

  Objects: book, machine, part, product, raw material, software license, software package, tool, vehicle model, vehicle.

  Events: application, award, cancellation, class, flight, invoice, order, registration, renewal, requisition, reservation, sale, trip.

  Concepts: account, block of time, bond, course, fund, qualification, stock.

Relationship:A natural business association that exists between one or more entities. The relationship may represent an event that links the entities or merely a logical affinity that exists between the entities

Relationship

An example of a relationship would be:

Employees are assigned to projects.

Student enrolling in a curriculum.

Projects have subtasks.

Departments manage one or more projects

Cardinality:The cardinality of a relationship is the actual number of related occurrences for each of the two entities. The basic types of connectivity for relations are: one-to-one, one-to-many, and many-to-many. The minimum and maximum number of occurrences of one entity that may be related to a single occurrence of the other entity. Because all relationships are bidirectional, cardinality must be defined in both directions for every relationship.

Cardinality

Cardinality Notations:

Facebook Commentbox

Tuesday, 3 March 2015

Physical design in datawearhouse

Physical design is the creation of the database with SQL statements. During the physical design process, you convert the data gathered during the logical design phase into a description of the physical database structure.

A complete physical data model will include all the database artifacts required to create relationships between tables or achieve performance goals, such as indexes, constraint definitions, linking tables, partitioned tables or clusters.

Physical data model represents how the model will be built in the database. A physical database model shows all table structures, including column name, column data type, column constraints, primary key, foreign key, and relationships between tables.

Features of a physical data model include:

Specification all tables and columns.
Foreign keys are used to identify relationships between tables.
Denormalization may occur based on user requirements.
Physical considerations may cause the physical data model to be quite different from the logical data model.
Physical data model will be different for different RDBMS. For example, data type for a column may be different between MySQL and SQL Server.

The steps for physical data model design are as follows:

Convert entities into tables.
Convert relationships into foreign keys.
Convert attributes into columns.
Modify the physical data model based on physical constraints / requirements.

Physical Design Structures:

Table spaces : A tablespace consists of one or more data files, which are physical structures within the operating system you are using. A data file is associated with only one tablespace. From a design perspective, table spaces are containers for physical design structures.

Tables and Partitioned Tables : Tables are the basic unit of data storage. They are the container for the expected amount of raw data in your data warehouse. Using partitioned tables instead of non-partitioned ones addresses the key problem of supporting very large data volumes by allowing you to decompose them into smaller and more manageable pieces.

Views : A view is a tailored presentation of the data contained in one or more tables or other views. A view takes the output of a query and treats it as a table. Views do not require any space in the database.

Integrity Constraints : Integrity constraints are used to enforce business rules associated with your database and to prevent having invalid information in the tables. Integrity constraints in data warehousing differ from constraints in OLTP environments. In OLTP environments, they primarily prevent the insertion of invalid data into a record, which is not a big problem in data warehousing environments because accuracy has already been guaranteed.

Indexes : Indexes are optional structures associated with tables or clusters. In addition to the classical B-tree indexes, bitmap indexes are very common in data warehousing environments.

Facebook Commentbox

Sunday, 1 March 2015

SQL interview Qustions and Answers for freshers (List 2)

1) What is Index?

A pointer to data having physical representation is called as Index.
Record can be located quickly and efficiently by creating Indices on existing tables.
Each index in a table has some valid name and we can have more than one index in different columns of a table.
We can speed up queries by setting up index in a column of a table.
In a table , each row is examined by sql server to fulfill our query is known as table scan and it only happen when there is no index available to help the query.
On large tables, the table scan has huge impact on performance.
Clustered and Non clustered indexes are the most widely used indexes in a database.

2) What is Trigger?

A Trigger is a process of firing an action when some event like Insert, Update or Delete occurs.
A trigger can’t be called or even executed rather they are automatically become active by the DBMS whenever some modification in associated table occur.
Triggers are event driven and can attached to particular table in a database.
Triggers are implicitly executed and stored procedures are also executed by triggers.
Referential integrity is maintained by the trigger and they are managed and stored by DBMS.
Triggers can be nested also, in which Insert, Update or Delete logic can be fired from the trigger itself.

3) What is a NOLOCK?

NOLOCK is used to improve concurrency on a busy system.
On data read, no lock can be taken on SELECT statement.
When some other process is updating the data on the same time you are reading it is known as dirty read.
Read (Shared) locks are taken by SELECT Statements.
Simultaneous access of multiple SELECT statements is allowed in Shared lock but modification process is not allowed.
The result to your system is blocking.
Update will start on completion of all the reads.

4) What is the STUFF function and how does it differ from the REPLACE function?

Using STUFF function we can overwrite the specified characters of a string.
The syntax of STUFF function is
STUFF (stringToChange, startIndex, length, new_characters )
where stringToChange is the string which will have the characters those we want to overwrite, startIndex is the starting position, length is the number of characters in the string that are to be overwrited, and new_characters are the new characters to write into the string.
While REPLACE function is used to replace specified character at all its existing occurances.
The syntax of REPLACE function is REPLACE (string_to_change, string_to_Replace, new_tring).
Every occurance of string_to_change will be replaced by new_string.

5) What are Self Join and Cross Join?

When we want to join a table to itself then SELF JOIN is used.
We can give one or more aliases to eliminate the confusion.
A self join can be used as any type, if both the tables are same.
The simple example where we can use SELF JOIN is if in a company have a hierarchal reporting structure and an employee reports to another.
A cross join give the number of rows in the first table multiplied by the number of rows in second table.
The simple example where we can use CROSS JOIJ is if in an organization wants to combine every Employee with family table to see each Employee with each family member.

6) What are the advantages of using Stored Procedures?

Procedure can reduce network traffic and latency, and can enhance application performance.
Procedure execution plans can be reused, staying cached in the management tool's memory, reducing its overhead.
Procedures provide the benefit of code reuse.
The logic can be encapsulated using procedures and can help to change procedure's code without interacting to application.
Procedures give more security to our data.

7) What is RANK function?

RANK function can be used to give a rank to each row returned from a SELECT statment.
For using this function first specify the function name, followed by the empty parentheses.
Then mention the OVER function. For this function, you have to pass an ORDER BY clause as an argument. The clause identifies the column on which you are going to apply the RANK function.
For Example
SELECT RANK() OVER(ORDER BY BirthDate DESC) AS [RowNumber], FirstName, BirthDate FROM EmpDetails
In the result you will see that the eldest employee got the first rank and the youngest employee got the last rank. Here the rows with equal age will get same ranks.
The rank depends on the row's position in the result set, but not on the sequential number of the row.

8) What are cursors and when they are useful?

When we execute any SQL operations, SQL Server opens a work area in memory which is called Cursor.
When it is required to perform the row by row operations which are not possible with the set-based operations then Cursor is used.
There are two of cursors

Implicit Cursor : SQL Server automatically manages cursors for all data manipulation statements.These cursors are called implicit cursors.
Explicit Cursor:

When the programmer wants to perform the row by row operations for the result set containing more than one row, then he explicitly declare a cursor with a name.They are managed by OPEN, FETCH and CLOSE.-%FOUND, %NOFOUND, %ROWCOUNT and %ISOPEN attributes are used in both types of cursors.

9) What is Similarity and Difference between Truncate and Delete in SQL?Similarity

Both Truncate and Delete command will delete data from given table and they will not delete the table structure from the database.

Difference

TRUNCATE is a DDL (data definition language) command whereas DELETE is a DML (data manipulation language) command.
We can’t execute a trigger with TRUNCATE whereas with DELETE command, a trigger can be executed.
TRUNCATE is faster than DELETE, because when you use DELETE to delete the data, at that time it store the whole data in rollback space from where you can get the data back after deletion, whereas
TRUNCATE will not store data in rollback space and will directly delete it.
You can’t get the deleted data back when you use TRUNCATE.
We can use any condition in WHERE clause using DELETE but it is notpossible with TRUNCATE.If table is referenced by any foreign key constraints then TRUNCATE cannot work.

10) What are COMMIT and ROLLBACK in SQL?

COMMIT statement is used to end the current transaction and once the COMMIT statement is exceucted the transaction will be permanent and undone.
Syntax: COMMIT;
Example:
BEGIN
UPDATE EmpDetails SET EmpName = ‘Arpit’ where Dept = ‘Developer’
COMMIT;
END;
ROLLBACK statement is used to end the current transaction and undone the changes which was made by that transaction.
Syntax: ROLLBACK [TO] Savepoint_name;
Example
BEGIN
Statement1;
SAVEPOINT mysavepoint;
BEGIN
Statement2;
EXCEPTION
WHEN OTHERS THEN
ROLLBACK TO mysavepoint;
Statement5;
END;
END;

11) What is a WITH(NOLOCK)?

WITH(NOLOCK) is used to unlock the data which is locked by the transaction that is not yet committed. This command is used before SELECT statement.
When the transaction is committed or rolled back then there is no need to use NOLOCK function because the data is already released by the committed transaction.
Syntax: WITH(NOLOCK)
Example:
SELECT * FROM EmpDetails WITH(NOLOCK)
WITH(NOLCOK) is similar as READ UNCOMMITTED

12) What is difference between Co-related sub query and nested sub query?

Correlated subquery executes single time for every row which is selected by the outer query.
It has a reference to a value from the row selected by the outer query.
Nested subquery executes only once for the entire nesting (outer) query. It does not contain any reference to the outer query row.
For example,
Correlated Subquery:
select e.EmpFirstName, e.Salary, e.DeptId from Employee e where e.Salary = (select max(Salary) from Employee ee where ee.DeptId = e.DeptId)
Nested Subquery:
select EmpFirstName, Salary, DeptId from Employee where (DeptId, Salary) in (select DeptId, max(Salary) from Employee group by DeptId)

13) Differentiate UNION, MINUS, UNION ALL and INTERSECT?

INTERSECT - It will give all the distinct rows from both select queries.
MINUS - It will give distinct rows returned by the first query but not by the second query.
UNION - It will give all distinct rows selected by either first query or second query.
UNION ALL - It will give all rows returned by either query with all dublicates records

14) What is a join? Explain the different types of joins?

Using Join in a query, we can retrieve referenced columns or rows from multiple tables.
Following are different types of Joins:
1. JOIN: Return details from tables if there is at least one matching row in both tables
2. LEFT JOIN: It will return all rows from the left table, even if there are no matching row in the right table
3. RIGHT JOIN: It will return all rows from the right table, even if there is no matching row in the left table.
4. FULL JOIN: It will return rows when there is a match in either of tables.

15) What is DDL, DML and DCL?

SQL commands can be divided in three large subgroups.

DDL: The SQL commands which deals with database schemas and information of how the data will be generated in database are classified as Data Definition Language.

For example CREATE TABLE or ALTER TABLE belongs to DDL.
DML: The SQL commands which deals with data manipulation are classified as Data Manipulation Language.
For example SELECT, INSERT, etc.
DCL: The SQL commands which deal with rights and permission over the database are classified as DCL.
For example GRANT, REVOKE

Thursday, 12 March 2015

What is ETL (Extract, Transform, Load)

Facebook Commentbox

Thursday, 5 March 2015

Logical Design In Datawearhouse

Logical Data Model (LDM) -

Important Terms:

Facebook Commentbox

Tuesday, 3 March 2015

Physical design in datawearhouse

Facebook Commentbox

Sunday, 1 March 2015

SQL interview Qustions and Answers for freshers (List 2)

1) What is Index?

2) What is Trigger?

3) What is a NOLOCK?

4) What is the STUFF function and how does it differ from the REPLACE function?

5) What are Self Join and Cross Join?

6) What are the advantages of using Stored Procedures?

7) What is RANK function?

8) What are cursors and when they are useful?

9) What is Similarity and Difference between Truncate and Delete in SQL?Similarity

Both Truncate and Delete command will delete data from given table and they will not delete the table structure from the database.

Difference

10) What are COMMIT and ROLLBACK in SQL?

11) What is a WITH(NOLOCK)?

12) What is difference between Co-related sub query and nested sub query?

13) Differentiate UNION, MINUS, UNION ALL and INTERSECT?

14) What is a join? Explain the different types of joins?

15) What is DDL, DML and DCL?

Facebook Commentbox

Navigation

online user

Download

add