Hash Signatures in Data warehousing in detail
Introduction
This
document intends to describe the process of using Hash values (also known as
Hash Signatures) in Data warehousing in detail.
Slowly Changing Dimensions and Change data Capture
One
of the biggest concerns that arise while working on slowly changing dimensions
is to figure out an efficient method for capturing changed data. This issue
becomes more critical if the source system does not have a timestamp or any
kind of flag to indicate the changed data. One of the easiest methods available
is using a brute force technique to identify the changed data. But this would
result in the increase of time and cost of maintaining the data warehouse
dramatically.
One
solution for this is creating unique signature for every record that is to be
inserted and store in the target table. The signature should be created based
on the data in each record; therefore no two rows with a different set of data
will have the same signature. Now compare the signature created from the source
tables’ records with the signature stored in the target tables’ records. If
there is a match then it means the same record has arrived again and we can
simply reject it.
We
can use either a CRC or a Hash function to generate these signatures. In this
book we will discuss the steps to implement Hash signatures in a Data
warehouse.
Concept of Hashing
We
can implement Hashing in various databases. In our case we will use Oracle DB
and Informatica – ETL tool to illustrate the concept.
In
order to illustrate the concept we will use the following components in our
example.
Source Table
Target Table: TYPE -2 Dimension
Get_sign () – Hash Function
Let’s call the user defined Hash function
that we use here as get_sign (). This function should be created in the
database. The steps to create this function are discussed later in this
article. The input to this function will be all the fields from the source
concatenated into one long string and its output will be a string which contains
the Hash value for the input string.
We
will begin with loading one row from the source to the target table. We will
use 3 different cases to illustrate the process of Hashing.
Case- I
Consider
the case when a new record (as shown below) arrives from the source.
First
we need to concatenate all the fields (fields required to check for changed
data) in a record into one string. Then pass this concatenated string into the
get_sign () function. The output of the
get_sign function contains the Hash value for the record.
E.g.
the query to fetch records from the source will look something like this.
Select Student_id, Student_Name, Marks1,
Marks2, get_sign
(Student_id|| Student_Name|| Marks1|| Marks2)
from Student_Marks;
The
output of the above query is shown below.
Now
we have the Hash value of the record. To make sure that exactly a similar
record is not present in the database already we need to compare the above
generated hash value with the hash values of the records in the target. Since
this is a new record and there are no records in the target table at this point
of time, this record will be inserted.
Case- II
Consider
the case when the above record which was inserted into the target is arriving
again in the next load.
The
same record arrives again as shown below.
We
generate the Hash value for this record as well using the same procedure as in CASE - I
Now,
Compare the Hash value of this record with the hash values of the records in
the target. This time there will be a match as the same record was inserted
earlier and hence we simply need to reject the record.
Case- III
In
this case the same Student_Id arrives with a different set of values in the
other fields (different marks).
We
generate the Hash value for this record.
Now when we try to match the Hash value of
this record with the other records, there will not be a match. Since it is a
TYPE – 2 dimension we insert this record and update the Meta data of the
previous record of the same Student_Id.
Generating Hash Value using Oracle functions
The OUTLN_EDIT_PKG in oracle has the
generate_signature () function. The
generate_signature () function takes one string as input and outputs the
corresponding hash signature for the input string.
It is
a good practice to wrap the generate_signature () function into a Deterministic
user defined function get_sign () (the one which we used in our examples). A deterministic function is one which
produces the same output for a specific input. Hence Oracle can buffer the
records for further use.
Also, we may want to replace any NULL values
with a string of our choice because the generate_signature () function
generates a hash value for NULL as well.
Example
of a wrapping function:
CREATE OR REPLACE
FUNCTION get_sign (text VARCHAR2)
RETURN RAW
DETERMINISTIC
AS
signature RAW (32);
text2 VARCHAR2 (32767);
BEGIN
text2:=text;
IF text2 IS NULL THEN
text2:='1';
END IF;
sys.outln_edit_pkg.generate_signature
(text2, signature);
RETURN signature;
END;
You can use "select standard_hash(text2) from dual" as well
ReplyDelete