Monday, 18 May 2015

Hash Signatures in Data warehousing in detail





Introduction

This document intends to describe the process of using Hash values (also known as Hash Signatures) in Data warehousing in detail.

Slowly Changing Dimensions and Change data Capture

One of the biggest concerns that arise while working on slowly changing dimensions is to figure out an efficient method for capturing changed data. This issue becomes more critical if the source system does not have a timestamp or any kind of flag to indicate the changed data. One of the easiest methods available is using a brute force technique to identify the changed data. But this would result in the increase of time and cost of maintaining the data warehouse dramatically.

One solution for this is creating unique signature for every record that is to be inserted and store in the target table. The signature should be created based on the data in each record; therefore no two rows with a different set of data will have the same signature. Now compare the signature created from the source tables’ records with the signature stored in the target tables’ records. If there is a match then it means the same record has arrived again and we can simply reject it.

We can use either a CRC or a Hash function to generate these signatures. In this book we will discuss the steps to implement Hash signatures in a Data warehouse.

Concept of Hashing

We can implement Hashing in various databases. In our case we will use Oracle DB and Informatica – ETL tool to illustrate the concept.

In order to illustrate the concept we will use the following components in our example.

Source Table

 

 

 

Target Table: TYPE -2 Dimension

 



Get_sign () – Hash Function

Let’s call the user defined Hash function that we use here as get_sign (). This function should be created in the database. The steps to create this function are discussed later in this article. The input to this function will be all the fields from the source concatenated into one long string and its output will be a string which contains the Hash value for the input string.

We will begin with loading one row from the source to the target table. We will use 3 different cases to illustrate the process of Hashing.

Case- I

Consider the case when a new record (as shown below) arrives from the source.

 

 

 

First we need to concatenate all the fields (fields required to check for changed data) in a record into one string. Then pass this concatenated string into the get_sign () function.  The output of the get_sign function contains the Hash value for the record.

E.g. the query to fetch records from the source will look something like this.

Select Student_id, Student_Name, Marks1, Marks2, get_sign (Student_id|| Student_Name|| Marks1|| Marks2) from Student_Marks;

The output of the above query is shown below.

 

 

 

Now we have the Hash value of the record. To make sure that exactly a similar record is not present in the database already we need to compare the above generated hash value with the hash values of the records in the target. Since this is a new record and there are no records in the target table at this point of time, this record will be inserted.

 



 

Case- II

Consider the case when the above record which was inserted into the target is arriving again in the next load.

The same record arrives again as shown below.

 

 

 

 

We generate the Hash value for this record as well using the same procedure as in CASE - I

 

 

 

Now, Compare the Hash value of this record with the hash values of the records in the target. This time there will be a match as the same record was inserted earlier and hence we simply need to reject the record.

Case- III

In this case the same Student_Id arrives with a different set of values in the other fields (different marks).

 

 

 

We generate the Hash value for this record.

 

 

 

 Now when we try to match the Hash value of this record with the other records, there will not be a match. Since it is a TYPE – 2 dimension we insert this record and update the Meta data of the previous record of the same Student_Id.

 

 

 

 

Generating Hash Value using Oracle functions

The OUTLN_EDIT_PKG in oracle has the generate_signature () function.  The generate_signature () function takes one string as input and outputs the corresponding hash signature for the input string.

It is a good practice to wrap the generate_signature () function into a Deterministic user defined function get_sign () (the one which we used in our examples).  A deterministic function is one which produces the same output for a specific input. Hence Oracle can buffer the records for further use.

Also, we may want to replace any NULL values with a string of our choice because the generate_signature () function generates a hash value for NULL as well.

 

Example of a wrapping function:

CREATE OR REPLACE

FUNCTION get_sign (text VARCHAR2)

RETURN RAW

DETERMINISTIC

AS

signature RAW (32);

text2 VARCHAR2 (32767);

BEGIN

text2:=text;

IF text2 IS NULL THEN

text2:='1';

END IF;

sys.outln_edit_pkg.generate_signature (text2, signature);

RETURN signature;

END;

 


1 comment:

  1. You can use "select standard_hash(text2) from dual" as well

    ReplyDelete