Informatica Quick Tips: Data Warehouse Concepts

Introduction to Database and Data Warehouse:

One of the most important assets of any organization is its information. This asset is almost always kept by an organization in two forms: the operational systems of record and the data warehouse. Crudely speaking, the operational systems are where the data is put in, and the data warehouse is where we get the data out.

Database is the place where information is maintained in any organization.

• If you use the database for operational purpose then it is called as OLTP (Online

Transactional Processing System)

• If you use the database for analysis purpose then it is called as OLAP (Online Analytical

Processing System)

Operational System (OLTP):

The users of an operational system (OLTP) turn the wheels of the organization. They take orders, sign up new customers, make phone calls, withdraw funds, and log complaints are called transactions. Users of an operational system almost always deal with one or very less number of records at a time per each transaction that they make. They repeatedly perform the same operational tasks over and over (like you make several phone calls, place orders many times).

Example:

When you make a phone call, a record is loaded into an operational system with details like…

• Who you are – customer id, phone number

• Which circle are you calling from – AP, Delhi, KA etc

• To whom you are making call – phone number of receiver

• At what date and time you are making call

• Local/STD/ISD call?

• Duration of the call

• Total billed

• Total discount offer if you are under a certain plan (may be between 10pm – 6pm, it is free!)

• Which plan you are using, any discount offer etc…

Data warehouse or OLAP:

The users of a data warehouse (OLAP) watch the wheels of the organization turn – they watch the growth or progress of organization and make strategic decisions for future. They may count the new orders and compare them with last week’s orders and ask why the new customers signed up and what the customers complained about. Users of a data warehouse almost never deal with one row

at a time. Rather, their questions often require that hundreds or thousands or sometimes millions of rows be searched and compressed into an answer set (or report).

Example:

When a business analyst wants compare the growth of new mobile subscribers in 3 different states

(AP, Delhi and KA) for the 5 years

• All the new connection information in the last 5 years has to be pulled (means huge data)

• Analysis can be based on customers’ age group or marital status or profession etc…

• Analysis can be based on regions’ related attributes like metros (tier 1 city), district head- quarter’s (tier2) or rural area etc..

• Analysis can be based on type of plan that the customer chose when took new connection

• Analysis can be based on type agent (whether it is full time or part time agent) and their experience etc…

The primary difference between the application database and a data warehouse is that –

• OLTP is designed (and optimized) to record the activities/transactions

• OLAP is designed (and optimized) to respond to analysis based questions (that are answered from large set of data) that are critical for your business.

Some other key differences between OLTP and OLAP:

Feature	OLTP or Operational System	OLAP or Data warehouse
Historical data	No (Current data - continuous and periodical backup process)	Yes
Volatile (Changes)	High (SELECT, INSERT, UPDATE, DELETE operations)	Low or No Changes (Only SELECT – Read only)
#Transactions/sec	High	Low
#Records Affected (or involved) / Transaction	Low	Very High
Response Time	Less than a sec or few seconds	Few seconds to minutes
Data Redundancy	Low (Normalized – 3rd Normal	High (De-Normalized – NO 3rd

(Redundancy is a problem when data changes frequently like in case of OLTP)	form) • Means more tables • Means more joins*	Normal form. Only up to 2NF) • Means less tables • Means less joins*
Critical	Yes (mandatory)	No (Analysis is optional, but recommended)
Tables are called	Entities	Dimensions or Facts

*Joins are costly (means dangerous) when you fetch many records from tables, but are quite acceptable when you fetch small number of records. Since you generally deal with small number of records in OLTP per transaction, even though number of joins in OLTP is high, performance would not be bad.

Dimensional Modeling:

The overall term used to describe taking data structures in an OLTP format and holding the same data in an OLAP format is “Dimensional Modeling”.

Primarily a dimensional model contains 2 types of tables.

• Dimension tables

• Facts tables

Dimension: Dimensions contain category of information or logical groupings of data – contains how or in which ways you want to analyze.

For example,

• the Time dimension – contains all date/time related attributes (or columns) like date, year, month, quarter, week, weekend_date, #days in months, #days_from_year_till_date,

#days_from_quarter_till_date, fiscal_year, fiscal_month, fiscal_quarter etc…

• Product Dimension – contains prod_id, prod_name, prod_desc, band_desc, prod_category, prod_sub_category, prod_market_price, prod_selling_price, prod_color, prod_weight, prod_weight_category etc…

• Store Dimension – contains store_number, store_name, store_address, store_city, store_state, store_zip, store_district, store_region etc…

Fact Table: A fact table is a table that contains the measures of interest – contains what you want to analyze. For example, sales amount* would be a measure. This measure is stored in the fact table as

one column.

Below picture depicts how dimension and fact tables are used in reporting:

Report gives Sum of Sales amount and Quantity Sold calculated by store district and by Brand. In this case, the Sales fact table would be automatically joined with Time, Product and Store Dimensions and SUM aggregate is applied on measures called Quantity Sold and Sales Amount.

Few definitions to help understand dimensional data modeling better:

• Attribute: A unique level or a column within a dimension. For example, Month is an attribute in the Time Dimension.

• Hierarchy: The specification of levels that represents relationship between different attributes within a dimension.

• For example, one possible hierarchy in the Time dimension is Ye→ar Quarter →

Month → Day.

• In Product Dimension, product name, product_sub_category and product_category can form a hierarchy (Product Category → Product Sub Category → Product).

• Granularity (Grain): The level of detail captured in the data warehouse or the meaning of a single row in a fact table.

The first step in designing a fact table is to determine the granularity of the fact table. By granularity, we mean the lowest level of information that will be stored in the fact

table. For example, in time dimension, usually day would be the granularity. But, if some

retail business wants to do analysis along at the hourly level? (i.e., looking at how certain products may sell by different hours of the day.), then it makes sense to use 'hour' as the lowest level of granularity than day. Some possible granules are Customer by Product, Customer by Product by District, Customer by Product by District by Day, Customer by Product by District by Day by Hour.

• Surrogate Key: A surrogate key is a system-generated (non-meaningful from a business perspective) primary key for purposes of ensuring uniqueness within a dimension. Surrogates for DW are usually created during the ETL process. Surrogates are usually numeric data elements and are created on single column though their corresponding sources may contain multi-column primary keys (aka compound keys). This allows the database to query the single key column faster than it could multiple columns and hence the faster reporting.

• Natural Key: It is a primary key in the source or operational or OLTP system. For example, product ID or SKU number, Store ID etc

• These keys are only unique to their local database. Since data warehouse may pull data from several data sources, using the same Natural key coming from one of the sources as global warehouse key is not a good approach. So, Surrogate key is introduced to identify a record uniquely in each dimension.

• Natural keys can be any data type – numeric, alphanumeric. However Surrogate keys are mostly numeric (recommended) – numbers as primary keys saves space, and also improves performance when joined with other tables.

Each dimension (time, product, store etc…) that you identify for the data model gets implemented as a dimension table (time_dim, product_dim, store_dim etc…) in the data warehouse. Dimensions are the qualifiers that make the measures of the fact table meaningful, because they answer the what, when, and where aspects of a question. For example, consider the following business questions, for which the dimensions are italicized:

1. What accounts produced the highest revenue last year?

2. What was our profit by vendor?

3. How many units were sold for each product?

In the above set of questions, revenue, profit, and units sold are measures (not dimensions), as each represents quantitative or factual data

ROLAP, MOLAP and HOLAP:

In the OLAP world, there are mainly two different types: Multidimensional OLAP (MOLAP) and Relational OLAP (ROLAP). Hybrid OLAP (HOLAP) refers to technologies that combine MOLAP and ROLAP.

1. MOLAP: This is the more traditional way of OLAP analysis. In MOLAP, data is stored in a multidimensional cube. The storage is not in the relational database, but in proprietary formats. Example for MOLAP products (cube technology) are Hyperion Solution's Essbase and Oracle's Express Server.

Advantages:

Excellent performance: MOLAP cubes are built for fast data retrieval, and is optimal for slicing and dicing operations.

Can perform complex calculations: All calculations have been pre-generated when the cube is created. Hence, complex calculations are not only doable, but they return quickly.

Disadvantages:

• Limited in the amount of data it can handle: Because all calculations are performed when the cube is built, it is not possible to include a large amount of data in the

cube itself. This is not to say that the data in the cube cannot be derived from a large

amount of data. Indeed, this is possible. But in this case, only summary-level information will be included in the cube itself.

• Requires additional investment: Cube technology are often proprietary and do not already exist in the organization. Therefore, to adopt MOLAP technology, chances are additional investments in human and capital resources are needed.

2. ROLAP : This methodology relies on manipulating the data stored in the relational database to give the appearance of traditional OLAP's slicing and dicing functionality. In essence, each action of slicing and dicing is equivalent to adding a "WHERE" clause in the SQL statement.

Advantages:

• Can handle large amounts of data: The data size limitation of ROLAP technology is the limitation on data size of the underlying relational database. In other words, ROLAP itself places no limitation on data amount.

• Can leverage functionalities inherent in the relational database: Often, relational database already comes with a host of functionalities. ROLAP technologies, since they sit on top of the relational database, can therefore leverage these functionalities.

Disadvantages:

• Performance can be slow: Because each ROLAP report is essentially a SQL query (or multiple SQL queries) in the relational database, the query time can be long if the underlying data size is large.

• Limited by SQL functionalities: Because ROLAP technology mainly relies on generating SQL statements to query the relational database, and SQL statements do not fit all needs (for example, it is difficult to perform complex calculations using SQL), ROLAP technologies are therefore traditionally limited by what SQL can do. ROLAP vendors have mitigated this risk by building into the tool out-of-the-box complex functions as well as the ability to allow users to define their own functions.

3. HOLAP: HOLAP technologies attempt to combine the advantages of MOLAP and ROLAP. For summary-type information, HOLAP leverages cube technology for faster performance. When detail information is needed, HOLAP can "drill through" from the cube into the underlying

relational data.

Star and Snow-flake Schemas

Star Schema: Another name for the dimensional model is the star schema. This is the most recommend and commonly used dimensional model.

Snow-flake Schema:

The snowflake schema is an extension of the star schema, where each point of the star explodes into more points. In a star schema, each dimension is represented by a single dimensional table, whereas in a snowflake schema, that dimensional table is normalized into multiple lookup tables, each representing a level in the dimensional hierarchy.

Snow-flake schemas normalize dimensions to eliminate redundancy. That is, the dimension data has been grouped into multiple tables instead of one large table. For example, a product dimension table in a star schema might be normalized into a products table, a product_category table, and a product_manufacturer table in a snowflake schema. While this saves space, it increases the number of dimension tables and requires more foreign key joins. The result is more complex queries and reduced query performance.

Note: Oracle Corporation recommends you choose a star schema over a snowflake schema unless you have a clear reason not to.

Slowly Changing Dimensions:

Slowly Changing Dimensions (SCD) are dimensions that have data that slowly changes. This applies to cases where the attribute for a record (in dimension) varies over time.

For example, a customer first lived in Chicago, Illinois. So, the original entry in the customer table has the following record:

Customer Key	Customer ID	Name	State
101	2345	John	Illinois

At a later date, person moved to Los Angeles, California on January, 2003. How should the company now modify its customer table to reflect this change? This is the "Slowly Changing Dimension" problem.

There are in general three ways to solve this type of problem, and they are categorized as follows:

Type 1: No History – Always current information is maintained. The new record replaces the original.

Customer Key	Customer ID	Name	State
101	2345	John	California

Type 2: Unlimited History is maintained. A new record is added into the customer dimension table.

Customer Key	Customer ID	Name	State	St_Eff_Dt	End_Eff_Dt
101	2345	John	Illinois	01-Jan-2001	01-Jan-2003
105	2345	John	California	01-Jan-2003	01-Jan-3000

Some additional control fields to indicate the effective period of information in each record is also maintained. Here St_Eff_Dt and End_Eff_Dt tells that John was in Illinois between 01-Jan-2001 and

01-Jan-2003, and then in California from 01-Jan-2003 to till date (till date here is indicated by 01-Jan-

3000, or else as a practice you can use any bigger future date is a special date which for which current date or sysdate is always true).

Informatica Quick Tips

Saturday, 16 May 2015

Data Warehouse Concepts - Part 1

Star and Snow-flake Schemas

No comments:

Post a Comment