Saturday, 16 May 2015

Data Warehouse Concepts - Part 1


Introduction to Database and Data Warehouse:

One of the most important assets of any organization is its information. This asset is almost always kept by an organization in two forms: the operational systems of record and the data warehouse. Crudely speaking, the operational systems are where the data is put in, and the data warehouse is where we get the data out.

Database is the place where information is maintained in any organization.

   If  you  use  the  database  for  operational  purpose  then  it  is  called  as  OLTP  (Online
Transactional Processing System)

   If you use the database for analysis purpose then it is called as OLAP (Online Analytical
Processing System)

Operational System (OLTP):

The users of an operational system (OLTP) turn the wheels of the organization. They take orders, sign  up  new  customers,  make  phone  calls,  withdraw  funds,  and  log  complaints  are  called transactions. Users of an operational system almost always deal with one or very less number of records  at  a  time  per  each  transaction  that  they  make.  They  repeatedly  perform  the  same operational tasks over and over (like you make several phone calls, place orders many times).

Example:

When you make a phone call, a record is loaded into an operational system with details like

   Who you are customer id, phone number

   Which circle are you calling from AP, Delhi, KA etc

   To whom you are making call phone number of receiver

   At what date and time you are making call

   Local/STD/ISD call?

   Duration of the call

   Total billed

   Total discount offer if you are under a certain plan (may be between 10pm – 6pm, it is free!)

   Which plan you are using, any discount offer etc

Data warehouse or OLAP:

The users of a data warehouse (OLAP) watch the wheels of the organization turn – they watch the growth or progress of organization and make strategic decisions for future. They may count the new orders and compare them with last week’s orders and ask why the new customers signed up and what the customers complained about. Users of a data warehouse almost never deal with one row
 
at a time. Rather, their questions often require that hundreds or thousands or sometimes millions of rows be searched and compressed into an answer set (or report).


Example:


When a business analyst wants compare the growth of new mobile subscribers in 3 different states

(AP, Delhi and KA) for the 5 years


   All the new connection information in the last 5 years has to be pulled (means huge data)


   Analysis can be based on customers’ age group or marital status or profession etc…


     Analysis can be based on regions’ related attributes like metros (tier 1 city), district head- quarter’s (tier2) or rural area etc..

   Analysis can be based on type of plan that the customer chose when took new connection


     Analysis can be based on type agent (whether it is full time or part time agent) and their experience etc

The primary difference between the application database and a data warehouse is that –


   OLTP is designed (and optimized) to record the activities/transactions


     OLAP is designed (and optimized) to respond to analysis based questions (that are answered from large set of data) that are critical for your business.

Some other key differences between OLTP and OLAP:
Feature
OLTP or Operational System
OLAP or Data warehouse
Historical data
No  (Current  data  -  continuous and periodical backup process)
Yes
Volatile (Changes)
High (SELECT, INSERT, UPDATE,
DELETE operations)
Low    or    No    Changes    (Only
SELECT Read only)
#Transactions/sec
High
Low
#Records Affected (or involved)
/ Transaction
Low
Very High
Response Time
Less than a sec or few seconds
Few seconds to minutes
Data Redundancy
Low  (Normalized   3rd   Normal
High  (De-Normalized   NO  3rd
(Redundancy is a problem when
data changes frequently like in case of OLTP)
form)

   Means more tables

   Means more joins*
Normal form. Only up to 2NF)

   Means less tables

   Means less joins*
Critical
Yes (mandatory)
No   (Analysis   is   optional,   but
recommended)
Tables are called
Entities
Dimensions or Facts
*Joins are costly (means dangerous) when you fetch many records from tables, but are quite acceptable when you fetch small number of  records. Since you generally deal with small number of  records in  OLTP per transaction, even though number of joins in OLTP is high, performance would not be bad.

Dimensional Modeling:

The overall term used to describe taking data structures in an OLTP format and holding the same data in an OLAP format is Dimensional Modeling.


Primarily a dimensional model contains 2 types of tables.

   Dimension tables

   Facts tables


Dimension: Dimensions contain category of information or logical groupings of data contains how or in which ways you want to analyze.

For example,


     the Time dimension contains all date/time related attributes (or columns) like date, year, month,   quarter,   week,   weekend_date,   #days   in   months,   #days_from_year_till_date,
#days_from_quarter_till_date, fiscal_year, fiscal_month, fiscal_quarter etc


     Product Dimension contains prod_id, prod_name, prod_desc, band_desc, prod_category, prod_sub_category, prod_market_price, prod_selling_price, prod_color, prod_weight, prod_weight_category etc


     Store   Dimension      contains   store_number,   store_name,   store_address,   store_city, store_state, store_zip, store_district, store_region etc



Fact Table: A fact table is a table that contains the measures of interest contains what you want to analyze. For example, sales amount* would be a measure. This measure is stored in the fact table as
one column.


Below picture depicts how dimension and fact tables are used in reporting:


Report gives Sum of Sales amount and Quantity Sold calculated by store district and by Brand. In this case, the Sales fact table would be automatically joined with Time, Product and Store Dimensions and SUM aggregate is applied on measures called Quantity Sold and Sales Amount.

Few definitions to help understand dimensional data modeling better:



    Attribute:  A  unique  level  or  a  column  within  a  dimension.  For  example,  Month  is  an attribute in the Time Dimension.

    Hierarchy:  The  specification  of  levels  that  represents  relationship  between  different attributes within a dimension.

   For example, one possible hierarchy in the Time dimension is Year Quarter

Month → Day.


     In Product Dimension, product name, product_sub_category and product_category can form a hierarchy (Product Category Product Sub Category Product).

    Granularity (Grain): The level of detail captured in the data warehouse or the meaning of a single row in a fact table.

The first step in designing a fact table is to determine the granularity of the fact table. By granularity, we mean the lowest level of information that will be stored in the fact
                  table. For example, in time dimension, usually day would be the granularity. But, if some
retail business wants to do analysis along at the hourly level? (i.e., looking at how certain products may sell by different hours of the day.), then it makes sense to use 'hour' as the lowest  level  of  granularity  than day.  Some  possible  granules  are  Customer by  Product, Customer by Product by District, Customer by Product by District by Day, Customer by Product by District by Day by Hour.

    Surrogate Key: A surrogate key is a system-generated (non-meaningful from a business perspective)  primary  key  for  purposes  of  ensuring  uniqueness  within  a  dimension. Surrogates  for  DW  are  usually  created  during  the  ETL  process.  Surrogates  are  usually numeric  data  elements  and  are  created  on  single  column  though  their  corresponding sources may contain multi-column primary keys (aka compound keys). This allows the database to query the single key column faster than it could multiple columns and hence the faster reporting.


                     Natural Key: It is a primary key in the source or operational or OLTP system. For example                          product ID or SKU number, Store ID etc




     These keys are only unique to their local database. Since data warehouse may pull data from several data sources, using the same Natural key coming from one of the sources as global warehouse key is not a good approach. So, Surrogate key is introduced to identify a record uniquely in each dimension.

     Natural keys can be any data type numeric, alphanumeric. However Surrogate keys are mostly numeric (recommended) numbers as primary keys saves space, and also improves performance when joined with other tables.

Each dimension (time, product, store etc) that you identify for the data model gets implemented as a dimension table (time_dim, product_dim, store_dim etc) in the data warehouse. Dimensions are the qualifiers that make the measures of the fact table meaningful, because they answer the what, when, and where aspects of a question. For example, consider the following business questions, for which the dimensions are italicized:

1.   What accounts produced the highest revenue last year?


2.   What was our profit by vendor?


3.   How many units were sold for each product?

In the above set of questions, revenue, profit, and units sold are measures (not dimensions), as each represents quantitative or factual data
 

ROLAP, MOLAP and HOLAP:

In the OLAP world, there are mainly two different types: Multidimensional OLAP (MOLAP) and Relational OLAP (ROLAP). Hybrid OLAP (HOLAP) refers to technologies that combine MOLAP and ROLAP.

1.   MOLAP: This is the more traditional way of OLAP analysis. In MOLAP, data is stored in a multidimensional cube. The storage is not in the relational database, but in proprietary formats. Example for MOLAP products (cube technology) are Hyperion Solution's Essbase and Oracle's Express Server.  



 
Advantages:


Excellent performance: MOLAP cubes are built for fast data retrieval, and is optimal for slicing and dicing operations.

Can perform complex calculations: All calculations have been pre-generated when the cube is created. Hence, complex calculations are not only doable, but they return quickly.

Disadvantages:


    Limited in the amount of data it can handle: Because all calculations are performed when the cube is built, it is not possible to include a large amount of data in the
cube itself. This is not to say that the data in the cube cannot be derived from a large
amount  of  data.  Indeed,  this  is  possible.  But  in  this  case,  only  summary-level information will be included in the cube itself.

    Requires additional investment: Cube technology are often proprietary and do not already exist in the organization. Therefore, to adopt MOLAP technology, chances are additional investments in human and capital resources are needed.

2.   ROLAP : This methodology relies on manipulating the data stored in the relational database to give the appearance of traditional OLAP's slicing and dicing functionality. In essence, each action of slicing and dicing is equivalent to adding a "WHERE" clause in the SQL statement.

Advantages:


     Can handle large amounts of data: The data size limitation of ROLAP technology is the limitation on data size of the underlying relational database. In other words, ROLAP itself places no limitation on data amount.

     Can leverage functionalities inherent in the relational database: Often, relational database already comes with a host of functionalities. ROLAP technologies, since they sit on top of the relational database, can therefore leverage these functionalities.

Disadvantages:


     Performance can be slow: Because each ROLAP report is essentially a SQL query (or multiple SQL queries) in the relational database, the query time can be long if the underlying data size is large.

     Limited   by   SQL   functionalities:   Because   ROLAP   technology   mainly   relies   on generating SQL statements to query the relational database, and SQL statements do not fit all needs (for example, it is difficult to perform complex calculations using SQL), ROLAP technologies are therefore traditionally limited by what SQL can do. ROLAP vendors have mitigated this risk by building into the tool out-of-the-box complex functions as well as the ability to allow users to define their own functions.

3.   HOLAP: HOLAP technologies attempt to combine the advantages of MOLAP and ROLAP. For summary-type information, HOLAP leverages cube technology for faster performance. When detail information is needed, HOLAP can "drill through" from the cube into the underlying
relational data.


Star and Snow-flake Schemas

Star Schema: Another name for the dimensional model is the star schema. This is the most recommend and commonly used dimensional model.

   





Snow-flake Schema:


The snowflake schema is an extension of the star schema, where each point of the star explodes into more points. In a star schema, each dimension is represented by a single dimensional table, whereas in  a  snowflake  schema,  that  dimensional  table  is normalized into  multiple  lookup tables,  each representing a level in the dimensional hierarchy.


Snow-flake schemas normalize dimensions to eliminate redundancy. That is, the dimension data has been grouped into multiple tables instead of one large table. For example, a product dimension table in a star schema might be normalized into a products table, a product_category table, and a product_manufacturer table in a snowflake schema. While this saves space, it increases the number of dimension tables and requires more foreign key joins. The result is more complex queries and reduced query performance.

Note:  Oracle Corporation recommends you choose a star schema over a snowflake schema unless you have a clear reason not to.

Slowly Changing Dimensions:


Slowly Changing Dimensions (SCD) are dimensions that have data that slowly changes. This applies to cases where the attribute for a record (in dimension) varies over time.
For example, a customer first lived in Chicago, Illinois. So, the original entry in the customer table has the following record:
 


              



Customer Key
Customer ID
Name
State
101
2345
John
Illinois

 
At a later date, person moved to Los Angeles, California on January, 2003. How should the company now modify its customer table to reflect this change? This is the "Slowly Changing Dimension" problem.

There are in general three ways to solve this type of problem, and they are categorized as follows:

Type 1: No History Always current information is maintained. The new record replaces the original.

                                          
Customer Key
Customer ID
Name
State
101
2345
John
California
 

Type 2: Unlimited History is maintained. A new record is added into the customer dimension table.

                                          


Customer Key
Customer ID
Name
State
St_Eff_Dt
End_Eff_Dt
101
2345
John
Illinois
01-Jan-2001
01-Jan-2003
105
2345
John
California
01-Jan-2003
01-Jan-3000





Some additional control fields to indicate the effective period of information in each record is also maintained. Here St_Eff_Dt and End_Eff_Dt tells that John was in Illinois between 01-Jan-2001 and
01-Jan-2003, and then in California from 01-Jan-2003 to till date (till date here is indicated by 01-Jan-

3000, or else as a practice you can use any bigger future date is a special date which for which current date or sysdate is always true).



 



No comments:

Post a Comment