Introduction to Database
and Data Warehouse:
One of the most important assets of any organization is its information. This asset is almost always kept by an
organization in
two forms: the operational systems of record and the data warehouse. Crudely speaking, the operational systems are where the data is put in, and the data warehouse is where we get the data out.
Database is the place where information is maintained
in any organization.
• If you
use the
database
for operational
purpose then it is called as OLTP (Online
Transactional
Processing System)
• If you use the database for analysis purpose then it is called
as OLAP (Online Analytical
Processing
System)
Operational System
(OLTP):
The users of an operational system (OLTP) turn the wheels of the organization. They take orders,
sign up new customers,
make phone
calls,
withdraw
funds, and
log complaints
are
called
transactions. Users of an operational system almost always deal with one or very less number of
records
at a time per each transaction that
they
make.
They
repeatedly
perform
the same operational
tasks over and over (like you make several phone calls, place orders many times).
Example:
When you make a phone call, a
record is loaded into an operational
system with details like…
• Who you are – customer id, phone number
• Which circle are you calling from – AP, Delhi, KA etc
• To whom you are making call – phone number of receiver
• At what date and time you are making
call
• Local/STD/ISD call?
• Duration of the call
• Total billed
• Total discount offer
if you are under
a certain plan (may be between 10pm
– 6pm, it is free!)
• Which plan you are
using, any discount offer etc…
Data
warehouse or
OLAP:
The users of a data warehouse (OLAP) watch the wheels of the organization
turn – they watch the
growth
or
progress of organization and make strategic decisions for future. They may count the new orders and compare them with last week’s orders and ask why the new customers signed up and
what the customers complained about. Users of a data warehouse almost never deal with one row
at a time. Rather, their questions often require that hundreds or thousands or sometimes millions of rows be searched and compressed
into an answer set (or
report).
Example:
When a business analyst wants compare the growth of new mobile subscribers in 3 different states
(AP, Delhi and KA) for
the
5 years
• All the new connection information in the last 5 years has to be pulled
(means huge data)
• Analysis can be based on customers’ age group or marital status or profession
etc…
• Analysis can be based on regions’ related attributes like metros (tier 1 city), district head- quarter’s (tier2) or
rural area etc..
• Analysis can
be
based on type of
plan that the customer chose when took new connection
• Analysis can be based on type agent (whether it is full time or part time agent) and their
experience etc…
The primary difference between the application
database and a data warehouse is that –
• OLTP is designed (and optimized) to record
the
activities/transactions
• OLAP is designed (and optimized) to respond to analysis based questions (that are answered from large set of
data) that are critical
for your business.
Some other key differences between OLTP and
OLAP:
Feature
|
OLTP or
Operational System
|
OLAP or Data warehouse
|
Historical data
|
No (Current
data - continuous and
periodical
backup process)
|
Yes
|
Volatile (Changes)
|
High (SELECT,
INSERT, UPDATE,
DELETE operations)
|
Low or No Changes (Only
SELECT – Read only)
|
#Transactions/sec
|
High
|
Low
|
#Records Affected (or involved)
/ Transaction
|
Low
|
Very High
|
Response Time
|
Less than a sec or
few seconds
|
Few seconds to minutes
|
Data Redundancy
|
Low (Normalized – 3rd Normal
|
High
(De-Normalized – NO
3rd
|
(Redundancy is a problem when
data changes frequently like in
case of OLTP)
|
form)
• Means more tables
• Means more joins*
|
Normal form. Only up to 2NF)
• Means less tables
• Means less joins*
|
Critical
|
Yes (mandatory)
|
No (Analysis is
optional, but
recommended)
|
Tables are called
|
Entities
|
Dimensions or Facts
|
*Joins are costly (means dangerous) when you fetch many records from
tables, but are quite acceptable when you fetch small number of records. Since you generally deal with small number of records in
OLTP per transaction, even though number of joins in OLTP is high, performance would not be bad.
Dimensional
Modeling:
The overall term used to describe taking data structures in an OLTP format and holding the same
data in an OLAP format is “Dimensional Modeling”.
Primarily a dimensional model
contains 2 types of tables.
• Dimension tables
• Facts tables
Dimension: Dimensions contain category of information or logical groupings of data – contains how
or in which ways you want to analyze.
For example,
• the Time dimension – contains all date/time related attributes (or columns) like date, year, month, quarter, week, weekend_date, #days in months, #days_from_year_till_date,
#days_from_quarter_till_date, fiscal_year, fiscal_month, fiscal_quarter
etc…
• Product Dimension – contains prod_id, prod_name, prod_desc, band_desc, prod_category,
prod_sub_category,
prod_market_price, prod_selling_price, prod_color, prod_weight, prod_weight_category etc…
• Store Dimension – contains store_number, store_name,
store_address,
store_city, store_state, store_zip, store_district, store_region
etc…
Fact Table: A fact table is a
table that contains the measures of interest – contains what you
want to analyze. For example, sales amount* would be a measure. This measure is stored in the fact table as
one column.
Below picture
depicts how dimension
and fact tables are used
in reporting:
Report gives Sum of Sales amount and Quantity Sold calculated by store district and by Brand. In this
case, the Sales fact table would be automatically joined with Time, Product and Store Dimensions and
SUM
aggregate is applied
on
measures called
Quantity Sold
and Sales Amount.
• Attribute: A unique level or
a
column within
a
dimension.
For
example, Month is
an
attribute in the Time Dimension.
• Hierarchy: The
specification
of levels
that
represents
relationship
between different attributes within a dimension.
•
For example, one possible hierarchy in the Time dimension is Ye→ar Quarter →
Month → Day.
• In Product Dimension, product name, product_sub_category and product_category can form a hierarchy (Product Category → Product Sub
Category → Product).
• Granularity (Grain): The level of detail captured in the data warehouse or the meaning of a
single row in a fact table.
The first step in designing a fact table is to determine the granularity of the fact table. By granularity, we mean the lowest level of information that will be stored in the fact
table. For example, in time dimension, usually day would be the granularity. But, if some
retail business
wants to do analysis along at the hourly level? (i.e., looking at how certain products
may sell by different hours of
the
day.), then it
makes sense to use 'hour' as the
lowest
level of granularity
than day. Some possible granules
are Customer by
Product, Customer by Product by District, Customer by Product by District by Day, Customer by
Product by District by Day by Hour.
• Surrogate Key: A surrogate key is a system-generated (non-meaningful from a business
perspective)
primary
key for purposes of ensuring uniqueness
within
a dimension.
Surrogates for DW are
usually
created during the ETL process.
Surrogates
are usually numeric data elements and are created on single column
though their
corresponding sources may contain multi-column primary keys (aka compound keys). This allows the database to query the single key column faster
than it could
multiple columns and hence the
faster reporting.
• Natural Key: It is a primary key in the source or operational or OLTP system. For example, product ID or SKU number, Store ID etc
• These keys are only unique to their local database. Since data warehouse may pull data from several data sources, using the same Natural key coming from one of
the sources as global warehouse key is not a good approach. So, Surrogate key is
introduced to identify a record uniquely in each dimension.
• Natural
keys can be any data type – numeric, alphanumeric.
However
Surrogate keys
are mostly numeric (recommended) – numbers as primary keys saves space, and also improves performance when joined with other tables.
Each dimension
(time, product, store etc…) that you
identify for
the
data model gets implemented as a dimension table (time_dim, product_dim, store_dim etc…) in the data
warehouse. Dimensions are the
qualifiers that make the
measures of
the fact table meaningful,
because they answer
the what, when, and where aspects of a question. For example, consider the following business questions, for
which the dimensions are italicized:
1. What accounts produced
the highest revenue last year?
2. What was our profit by vendor?
3. How many units were sold for each product?
In the above set of questions, revenue, profit, and
units sold are measures (not dimensions), as each represents quantitative or factual
data
ROLAP, MOLAP and HOLAP:
In the OLAP world, there are mainly two different types: Multidimensional OLAP (MOLAP) and
Relational OLAP (ROLAP). Hybrid OLAP (HOLAP) refers to technologies
that combine MOLAP and
ROLAP.
1. MOLAP: This is the more traditional way of OLAP analysis. In MOLAP, data is stored in a
multidimensional cube.
The storage is not in the relational database, but in proprietary
formats. Example for MOLAP products (cube technology) are Hyperion Solution's Essbase and Oracle's Express Server.
Advantages:
Excellent performance: MOLAP cubes are built for fast data retrieval, and is optimal for slicing
and dicing operations.
Can perform complex calculations: All calculations have been pre-generated when the cube is created. Hence, complex calculations are not only doable, but they return quickly.
Disadvantages:
• Limited in the amount of data it can handle: Because all calculations are performed when the cube is built, it is not possible to include a large amount of data in the
cube itself. This is not to say that the data in the cube cannot be derived
from a large
amount
of data.
Indeed, this
is
possible. But
in
this case, only summary-level information will be included
in the cube itself.
• Requires additional investment: Cube technology are often proprietary and do not
already exist in the organization. Therefore, to adopt MOLAP technology, chances
are additional investments in human and
capital resources are needed.
2. ROLAP : This methodology relies on manipulating the data stored in the relational database to give the appearance of traditional OLAP's slicing and dicing functionality. In essence,
each action of slicing and
dicing is equivalent to adding a "WHERE" clause in the SQL statement.
Advantages:
• Can handle large amounts of data: The data size limitation of ROLAP technology is the limitation on data size of
the
underlying relational database. In other words, ROLAP itself
places no limitation on
data amount.
• Can leverage functionalities inherent in the relational database: Often, relational database already comes with a host of functionalities. ROLAP technologies, since
they sit on top of the relational database, can therefore leverage these functionalities.
Disadvantages:
• Performance can be slow: Because each ROLAP report is essentially
a SQL query (or
multiple SQL queries) in the relational database, the query time can be long
if the
underlying
data size is large.
• Limited by SQL functionalities:
Because ROLAP technology mainly relies on
generating SQL statements
to query the relational database, and SQL statements do
not fit all needs (for example, it is difficult to perform complex calculations using
SQL), ROLAP technologies are therefore traditionally limited by what SQL can do. ROLAP vendors have
mitigated this risk by building into the tool out-of-the-box
complex functions as well
as
the ability to allow users to define their own
functions.
3. HOLAP: HOLAP technologies attempt to combine the advantages of MOLAP and ROLAP. For
summary-type information, HOLAP leverages cube technology for
faster performance.
When
detail information is needed, HOLAP can "drill through" from the cube into the underlying
relational data.
Star and Snow-flake Schemas
Star Schema: Another name for the dimensional
model is the star schema. This is the most recommend and
commonly used
dimensional model.
Snow-flake Schema:
The snowflake schema is an extension
of the star schema, where each point of the star explodes into more points. In a star
schema, each dimension is
represented by a single dimensional table, whereas in
a snowflake schema,
that
dimensional table
is normalized into multiple
lookup tables,
each representing a level in the dimensional
hierarchy.
Snow-flake schemas normalize dimensions to eliminate redundancy. That is, the dimension
data has been
grouped
into multiple tables instead of one large table. For example, a product dimension
table in a star schema might be normalized
into a products table, a
product_category table, and
a product_manufacturer
table in a snowflake schema. While this saves space, it increases the number
of dimension tables and
requires more foreign key joins. The result is more complex queries and reduced
query performance.
Slowly Changing Dimensions:
Slowly Changing
Dimensions (SCD) are dimensions that have data
that slowly changes. This applies
to cases where the attribute for a record (in
dimension) varies over time.
For example, a customer first lived in Chicago, Illinois. So, the original entry in
the
customer table has the following record:
Customer Key
|
Customer ID
|
Name
|
State
|
101
|
2345
|
John
|
Illinois
|
At a
later
date, person moved to Los Angeles, California
on January, 2003.
How should
the
company now modify its customer table to reflect this
change? This is the "Slowly Changing Dimension" problem.
There are in general three ways to solve this type of problem, and
they are categorized as follows:
Type 1: No History – Always current information
is maintained. The new record replaces the original.
Customer Key
|
Customer ID
|
Name
|
State
|
101
|
2345
|
John
|
California
|
Type 2: Unlimited
History is maintained.
A new record is added into the customer dimension table.
Customer Key
|
Customer ID
|
Name
|
State
|
St_Eff_Dt
|
End_Eff_Dt
|
101
|
2345
|
John
|
Illinois
|
01-Jan-2001
|
01-Jan-2003
|
105
|
2345
|
John
|
California
|
01-Jan-2003
|
01-Jan-3000
|
Some additional
control fields to indicate the effective period of
information in each
record is also
maintained. Here St_Eff_Dt and End_Eff_Dt tells that John
was in Illinois between 01-Jan-2001 and
01-Jan-2003, and then
in California from 01-Jan-2003 to till
date (till date here is indicated by 01-Jan-
3000, or else as a practice you can use any bigger
future date is a
special
date which for which
current date or sysdate is always true).
No comments:
Post a Comment