Abstract
This article explains a few of the important development
best practices, like lookups, workflow performance etc.
Content overview
Lookup - Performance considerations
Workflow performance – basic considerations
Pre/Post-Session commands - Uses
Sequence generator – design considerations
FTP Connection object –platform independence
1. Lookup - Performance considerations
What is a lookup transformation? It is just not another
transformation that fetches you data to look up against source data. A Lookup
is an important and useful transformation when used effectively. If used
improperly, performance of your mapping will be severely impaired.
Let us see the different scenarios where you can face
problems with Lookup and also how to tackle them.
1.1. Unwanted columns
By default, when you create a lookup on a table,
PowerCenter gives you all the columns in the table. If not all the columns are
required for the lookup condition or return, delete the unwanted columns from
the transformations. By not removing the unwanted columns, the cache size will
increase.
1.2. Size of the source versus size of lookup
Let us say, you have 10 rows in the source and one of
the columns has to be checked against a big table (1 million rows). Then
PowerCenter builds the cache for the lookup table and then checks the 10 source
rows against the cache. It takes more time to build the cache of 1 million rows
than going to the database 10 times and lookup against the table directly.
Use uncached lookup instead of building the static
cache, as the number of source rows is quite less than that of the lookup.
1.3. JOIN instead of Lookup
In the same context as above, if the Lookup transformation
is after the source qualifier and there is no active transformation in-between,
you can as well go for the SQL over ride of source qualifier and join
traditionally to the lookup table using database joins, if both the tables are
in the same database and schema.
1.4. Conditional call of lookup
Instead of going for connected lookups with filters for
a conditional lookup call, go for unconnected lookup. Is the single column
return bothering for this? Go ahead and change the SQL override to concatenate
the required columns into one big column. Break them at the calling side into
individual columns again.
1.5. SQL query
Find the execution plan of the Lookup SQL and see if you
can add some indexes or hints to the query to make it fetch data faster. You
may have to take the help of a database developer to accomplish this if you,
yourself are not a SQLer.
1.6. Increase cache
If none of the above options provide performance
enhancements, then the problem may lie with the cache. The cache that you assigned
for the lookup is not sufficient to hold the data or index of the lookup.
Whatever data that doesn't fit into the cache is spilt into the cache files
designated in $PMCacheDir. When the PowerCenter doesn't find the data you are
looking for in the cache, it swaps the data from the file to the cache and
keeps doing this until the data is found. This is quite expensive being that
this type of operation is very I/O intense. To stop this issue from occurring,
increase the size of the cache so the entire data set resides in memory. When
increasing the cache you also have to be aware of the system constraints. If
your cache size is greater than the resources available, the session will fail
due to the lack of resources.
1.7. Cachefile file-system
In many cases, if you have cache directory in a
different file-system than that of the hosting server, the cache file piling up
may take time and result in latency. So with the help of your system administrators
try to look into this aspect as well.
1.8. Useful cache utilities
If the same lookup SQL is being used by another lookup,
then shared cache or a reusable lookup should be used. Also, if you have a
table where the data is not changed often, you can use the persist cache option
to build the cache once and use it many times by consecutive flows.
2. Workflow performance – basic considerations
Though performance tuning has been the most feared part
of development, it is the easiest, if the intricacies are known. With the newer
and newer versions of PowerCenter, there is added flexibility for the developer
to build better performing workflows. The major blocks for performance are the
design of the mapping, SQL tuning if databases are involved.
Regarding the design of the mapping, I have few basic
considerations to be made. Please note that these are not any rules-of-thumb,
but will make you act sensibly in different scenarios.
- I would always suggest you to think twice before using an Update Strategy, though it adds a certain level of flexibility in the mapping. If you have a straight-through mapping which takes data from source and directly inserts all the records into the target, you wouldn’t need an update strategy.
- Use a pre-SQL delete statement if you wish to delete specific rows from target before loading into the target. Use truncate option in the session properties, if you wish to clean the table before loading. I would avoid a separate pipe-line in the mapping that runs before the load with update-strategy transformation.
- You have 3 sources and 3 targets with one-on-one mapping. If the load is independent according to business requirement, I would create 3 different mappings and 3 different session instances and they all run in parallel in my workflow after my “Start” task. I’ve observed that the workflow runtime comes down between 30-60% of serial processing.
- PowerCenter is built to work of high volumes of data. So let the server be completely busy. Induce parallelism as far as possible into the mapping/workflow.
- If using a transformation like a Joiner or Aggregator transformation, sort the data on the join keys or group by columns prior to these transformations to decrease the processing time.
- Filtering should be done at the database level instead within the mapping. The database engine is much more efficient in filtering than PowerCenter.
The above examples are just some things to consider when
tuning a mapping.
2.1. SQL tuning
SQL queries/actions occur in PowerCenter in one of the
below ways.
• Relational Source Qualifier
• Lookup SQL Override
• Stored Procedures
• Relational Target
Using the execution plan to tune a query is the best way
to gain an understanding of how the database will process the data. Some things
to keep in mind when reading the execution plan include: "Full Table Scans
are not evil", "Indexes are not always fast", and “Indexes can
be slow too". Analyse the table data to see if picking up 20 records out
of 20 million is best using index or using table scan. Fetching 10 records out
of 15 using index is faster or using full table scan is easier.
Many times the relational target indexes create
performance problems when loading records into the relational target. If the
indexes are needed for other purposes, it is suggested to drop the indexes at
the time of loading and then rebuild them in post-SQL. When dropping indexes on
a target you should consider integrity constraints and the time it takes to
rebuild the index on post load vs. actual load time.
3. Pre/Post-Session command - Uses
- It is a very good practice to email the success or failure status of a task, once it is done. In the same way, when a business requirement drives, make use of the Post Session Success and Failure email for proper communication.
- The built-in feature offers more flexibility with Session Logs as attachments and also provides other run-time data like Workflow run instance ID, etc.
- Any archiving activities around the source and target flat files can be easily managed within the session using the session properties for flat file command support that is new in PowerCenter v8.6. For example, after writing the flat file target, you can setup a command to zip the file to save space.
- If you have any editing of data in the target flat files which your mapping couldn’t accommodate, write a shell/batch command or script and call it in the Post-Session command task. I prefer taking trade-offs between PowerCenter capabilities and the OS capabilities in these scenarios.
4. Sequence generator – design considerations
In most of the cases, I would advice you to avoid the
use of sequence generator transformation, while populating an ID column in the
relational target table. I suggest you rather create a sequence on the target
database and enable the trigger on that table to fetch the value from the
database sequence.
There are many advantages to using a database sequence
generator:
Fewer PowerCenter objects will be present in a mapping
which reduces development time and also
maintenance effort.
- ID generation is PowerCenter independent if a different application is
Used in future to populate the target.
- Migration between environments is simplified because there is no additional overhead of considering the persistent values of the sequence generator from the repository database.
In all of the above cases, a sequence created in the
target database would make life lot easier for the table data maintenance and
also for the PowerCenter development. In fact, databases will have specific
mechanisms (focused) to deal with sequences and so you can implement manual
Push-down optimization on your PowerCenter mapping design for yourself.
DBAs will always complain about triggers on the
databases, but I would still insist on using sequence-trigger combination for
huge volumes of data as well.
5. FTP Connection object – platform independence
If you have any files to be read as source from Windows
server when your PowerCenter server is
hosted on UNIX/LINUX, then make use of FTP users on the Windows server and use
File Reader with FTP Connection object.
This connection object can be added as any other
connection string. This gives the flexibility of platform independence. This
will further reduce the overhead of having SAMBA mounts on to the Informatica
boxes.
Thank you so much for providing information about Informatica and its other aspects as well.I think each and every component of this is very useful.
ReplyDeleteInformatica Read Rest Api