Informatica
Map/Session Tuning Covers
BASIC, INTERMEDIATE, AND ADVANCED TUNING
PRACTICES.
Table of Contents
* Basic Guidelines
* Intermediate
Guidelines
* Advanced
INFORMATICA
BASIC TUNING GUIDELINES
The following points are high-level issues on
where to go to perform "tuning" in Informatica's products. These are
not NOT permanent instructions, nor are they the end-all solution. Just some
items (which if tuned first) might make a difference. The level of skill available
for certain items will cause the results to vary.
To 'test' performance throughput it is
generally recommended that the source set of data produce about 200,000 rows to
process. Beyond this - the performance problems / issues may lie in the
database - partitioning tables, dropping / re-creating indexes, striping raid
arrays, etc... Without such a large set
of results to deal with, you're average timings will be skewed by other users
on the database, processes on the server, or network traffic. This seems to be
an ideal test size set for producing mostly accurate averages.
Try tuning your maps with these steps
first. Then move to tuning the session, iterate this sequence until you are
happy, or cannot achieve better performance by continued efforts. If the performance is still not acceptable,.
then the architecture must be tuned (which can mean changes to what maps are
created). In this case, you can contact
us - we tune the architecture and the whole system from top to bottom.
KEEP THIS IN MIND:
In order to achieve optimal performance,
it's always a good idea to strike a balance between the tools, the database,
and the hardware resources. Allow each
to do what they do best. Varying the
architecture can make a huge difference in speed and optimization
possibilities.
1. Utilize a database (like Oracle / Sybase / Informix / DB2
etc...) for significant data handling operations (such as sorts, groups,
aggregates). In other words, staging
tables can be a huge benefit to parallelism of operations. In parallel design - simply defined by
mathematics, nearly always cuts your execution time. Staging tables have many
benefits. Please see the staging table
discussion in the methodologies section for full details.
2. Localize. Localize all target tables on to the SAME instance of
Oracle (same SID), or same instance of Sybase. Try not to use Synonyms (remote
database links) for anything (including: lookups, stored procedures, target
tables, sources, functions, privileges, etc...). Utilizing remote links will
most certainly slow things down. For Sybase users, remote mounting of databases
can definitely be a hindrance to
performance.
3. If you can - localize all target tables, stored
procedures, functions, views, sequences in the SOURCE database.
Again, try not to connect across synonyms. Synonyms (remote database tables)
could potentially affect performance by as much as a factor of 3 times or more.
4. Remove external registered modules. Perform
pre-processing / post-processing utilizing PERL, SED, AWK, GREP instead. The
Application Programmers Interface (API) which calls externals is inherently
slow (as of: 1/1/2000).
Hopefully Informatica will speed this up in the future. The external module
which exhibits speed problems is the regular expression module (Unix: Sun
Solaris E450, 4 CPU's 2 GIGS RAM, Oracle 8i and Informatica). It broke speed
from 1500+ rows per second without the module -to 486 rows per second with the
module. No other sessions were running.
(This was a SPECIFIC case - with a SPECIFIC map - it's not like this for
all maps).
5. Remember that Informatica suggests that
each session takes roughly 1 to 1 1/2 CPU's. In keeping with this - Informatica
play's well with RDBMS engines on the same machine, but does NOT get along
(performance wise) with ANY other engine (reporting engine, java engine, OLAP
engine, java virtual machine, etc...)
6. Remove any database based sequence generators.
This requires a wrapper function / stored procedure call. Utilizing these
stored procedures has caused performance to drop by a factor of 3 times. This
slowness is not easily debugged - it can only be spotted in the Write
Throughput column. Copy the map, replace the stored proc call with an internal
sequence generator for a test run - this is how fast you COULD run your map. If
you must use a database generated sequence number, then follow the instructions
for the staging table usage. If you're dealing with GIG's or Terabytes of
information - this should save you lot's of hours tuning. IF YOU MUST - have a shared sequence
generator, then build a staging table from the flat file, add a SEQUENCE ID
column, and call a POST TARGET LOAD stored procedure to populate that column. Place
the post target load procedure in to the flat file to staging table load map. A
single call to inside the database, followed by a batch operation to assign
sequences is the fastest method for utilizing shared sequence generators.
7. TURN OFF VERBOSE LOGGING. The session log has
a tremendous impact on the overall performance of the map. Force over-ride in
the session, setting it to NORMAL logging
mode. Unfortunately the logging
mechanism is not "parallel" in the internal core, it is embedded
directly in to the operations.
8. Turn off 'collect performance statistics'.
This also has an impact - although minimal at times - it writes a series of
performance data to the performance log. Removing this operation reduces
reliance on the flat file operations.
However, it may be necessary to have this turned on DURING your tuning
exercise. It can reveal a
lot about the speed of the reader, and
writer threads.
9. If your source is a flat file - utilize a
staging. This way - you can also use SQL*Loader, BCP, or some other
database Bulk-Load
utility. Place basic logic in the source load map, remove all
potential lookups from the code. At this
point - if your reader is slow, then check two things:
1) if you have an item in your registry or
configuration file which sets the "ThrottleReader" to a specific
maximum number of blocks, it will limit your read throughput (this only needs
to be set if the sessions have a demonstrated problems with constraint based
loads)
2) Move the flat file to local internal
disk (if at all possible). Try not to
read a file across the network, or from a RAID device. Most RAID array's are fast, but Informatica
seems to top out, where internal disk continues to be much faster. Here - a link will NOT work to increase speed
- it must be the full file itself - stored locally.
10. Try to eliminate the use of non-cached lookups.
By issuing a non-cached lookup, you're performance will be impacted
significantly. Particularly if the lookup table is also a "growing"
or "updated" target table - this generally means the indexes are
changing during operation, and the optimizer looses track of the index statistics.
Again - utilize staging tables if possible.
In utilizing staging tables, views in the database can be built which
join the data together; or Informatica's joiner object can be used to join data
together - either one will help dramatically increase speed.
11. Separate complex maps - try to break the maps
out in to logical threaded sections of processing. Re-arrange the architecture
if necessary to allow for parallel processing. There may be smaller components
doing individual tasks, however the throughput will be proportionate to the
degree of parallelism that is applied. A
discussion on HOW to perform this task is posted on the methodologies page,
please see this discussion for further details.
12. BALANCE.
Balance between Informatica and the power
of SQL and the database.
Try to utilize the DBMS for what it was
built for:
reading/writing/sorting/grouping/filtering
data en-masse. Use Informatica for the more complex logic, outside joins, data
integration, multiple source feeds, etc...
The balancing act is difficult without DBA knowledge. In order to achieve a balance, you must be
able to recognize what operations are best in the database, and which ones are
best in Informatica. This does not
degrade from the use of the ETL tool, rather it enhances it - it's a MUST if
you are performance tuning for high-volume throughput.
13. TUNE the DATABASE. Don't be afraid to
estimate: small, medium, large, and extra large source data set sizes (in terms
of: numbers of rows, average number of bytes per row), expected throughput for
each, turnaround time for load, is it a trickle feed? Give this information to
your DBA's and ask them to tune the database for "wost case".
Help them assess which
tables are expected to be high read/high write, which operations will sort,
(order by), etc... Moving disks, assigning the right table to the right disk
space could make all the difference.
Utilize a PERL script to generate "fake" data for small,
medium, large, and extra large data sets. Run each of these through your
mappings - in this manner, the DBA can watch or monitor throughput as a real
load size occurs.
14. Be sure there is enough SWAP, and TEMP space on your
PMSERVER machine. Not having enough disk space could potentially
slow down your entire server during processing (in an exponential
fashion). Sometimes this means watching
the disk space as while your session runs. Otherwise you may not get a good
picture of the space available during operation. Particularly if your maps
contain aggregates, or lookups that flow to disk Cache directory - or if you
have a JOINER object with heterogeneous sources.
15. Place some good server load monitoring
tools on your PMServer in development - watch it closely to understand how the
resources are being utilized, and where the hot spots are. Try to follow the
recommendations - it may mean upgrading the hardware to achieve throughput. Look
in to EMC's disk storage array - while expensive, it appears to be extremely
fast, I've heard (but not verified) that it has improved performance in some
cases by up to 50% .
16. SESSION SETTINGS. In the session, there is only so much tuning
you can do.
Balancing the throughput is important - by
turning on "Collect Performance Statistics" you can get a good feel
for what needs to be set in the session - or what needs to be changed in the
database. Read the performance section
carefully in the Informatica manuals.
Basically what you should try to achieve
is: OPTIMAL READ, OPTIMIAL THROUGHPUT, OPTIMAL WRITE. Over-tuning one of these three pieces can
result in ultimately slowing down your session.
For example: your write throughput is governed by your read and
transformation speed, likewise, your read throughput is governed by your
transformation and write speed.
The best method to tune a problematic map,
is to break it in to components for testing:
Read Throughput, tune for the reader, see
what the settings are, send the write output to a flat file for less contention
- Check the "ThrottleReader" setting (which is not configured by
default), increase the Default Buffer Size by a factor of 64k each shot -
ignore the warning above 128k. If the Reader still appears to increase during
the session, then stabilize (after a few thousand rows), then try increasing
the Shared Session Memory from 12MB to 24MB.
If the reader still stabilizes, then you have a slow source, slow
lookups, or your CACHE directory is not on internal disk. If the reader's throughput continues to climb
above where it stabilized, make note of the session settings. Check the Performance Statistics to make sure
the writer throughput is NOT the bottleneck - you are attempting to tune the
reader here, and don't want the writer threads to slow you down. Change the map target back to the database
targets - run the session again. This
time, make note of how much the reader slows down, it's optimal performance was
reached with a flat file(s). This time -
slow targets are the cause.
NOTE:
if your reader session to flat file just doesn't ever "get fast",
then you've got some basic map tuning to do.
Try to merge expression objects, set your lookups to unconnected (for
re-use if possible), check your Index and Data cache settings if you have
aggregation, or lookups being performed.
Etc... If you have a slow writer,
change the map to a single target table at a time - see which target is causing
the "slowness" and tune it.
Make copies of the original map, and break down the copies. Once the
"slower" of the N targets is discovered, talk to your DBA about
partitioning the table, updating statistics, removing indexes during load,
etc... There are many database things
you can do here.
17. Remove all other "applications" on the PMServer.
Except for the database / staging database or Data Warehouse itself. PMServer
plays well with RDBMS (relational database management system) - but doesn't
play well with application servers, particularly JAVA Virtual Machines, Web
Servers, Security Servers, application, and Report servers. All of these items should be broken out to
other machines.
This is critical to improving performance
on the PMServer machine.
INFORMATICA
INTERMEDIATE TUNING GUIDELINES
The following numbered items are for
intermediate level tuning.
After going through all the pieces above,
and still having trouble, these are some things to look for. These are items within a map which make a
difference in performance (We've done extensive performance testing of
Informatica to be able to show these affects).
Keep in mind - at this level, the performance isn't affected
unless there are more than 1 Million rows (average size: 2.5 GIG of data).
ALL items are Informatica MAP items, and
Informatica Objects - none are outside the map.
Also remember, this applies to PowerMart /PowerCenter (4.5x, 4.6x, /
1.5x, 1.6x) - other versions have NOT been tested. The order of these items is
not relevant to speed. Each one has it's own impact on the overall performance.
Again, throughput is also gauged by the number of objects constructed within a
map/maplet. Sometimes it's better to sacrifice a little readability, for a
little speed. It's the old paradigm, weighing readability and maintainability
(true modularity) against raw speed. Make sure the client agrees with the
approach, or that the data sets are large enough to warrant this type of
tuning.
BE AWARE: The following tuning tips range from "minor"
cleanup to "last resort" types of things - only when data sets get
very large, should these items be addressed, otherwise, start with the BASIC
tuning list above, then work your way in to these suggestions.
To understand the intermediate section,
you'll need to review this tips.
1. Filter Expressions - try to evaluate them in a port
expression. Try to create the filter
(true/false) answer inside a port expression upstream. Complex filter expressions slow down the
mapping. Again, expressions/conditions
operate fastest in an Expression Object with an output port for the result. Turns out - the longer the
expression, or the more complex - the more
severe the speed degradation. Place the
actual expression (complex or not) in an EXPRESSION OBJECT upstream from the
filter. Compute a single numerical flag:
1 for true, 0 for false as an output port.
Pump this in to the filter - you should see the maximum performance
ability with this configuration.
2. Remove all "DEFAULT" value expressions where
possible. Having a default value - even
the "ERROR(xxx)" command slows down the session. It causes an unnecessary evaluation of values
for every data element in the map. The
only time you want to use "DEFAULT value is when you have to provide a
default value for a specific port. There
is another method: placing a variable with an IIF(xxxx, DEFAULT VALUE, xxxx)
condition within an expression.
This will always be faster (if assigned to
an output port) than a default value.
3. Variable Ports are "slower" than Output
Expressions. Whenever
possible, use output expressions instead of variable ports. The variables are good for "static - and
state driven" but do slow down the processing time - as they are
allocated/reallocated each pass of a row through the expression object.
4. Datatype conversion - perform it in a port expression. Simply mapping a string to an integer, or an
integer to a string will perform the conversion, however it will be slower than
creating an output port with an expression like: to_integer(xxxx) and mapping
an integer to an integer. It's because
PMServer is left to decide if the conversion can be done mid-stream which seems
to slow things down.
5. Unused Ports. Surprisingly, unused
output ports have no affect on performance. This is a good thing. However in
general it is good practice to remove any unused ports in the mapping,
including variables. Unfortunately - there is no "quick" method for identifying
unused ports.
6. String Functions. String functions
definitely have an impact on performance. Particularly those that
change the length of a string (substring, ltrim, rtrim, etc..). These functions
slow the map down considerably, the operations behind each string function are
expensive (de-allocate, and re-allocate memory within a READER block in the
session). String functions are a necessary and important part of ETL, we do not
recommend removing their use completely, only try to limit them to necessary
operations. One of the ways we advocate tuning these, is to use
"varchar/varchar2" data types in your database sources, or to use
delimited strings in source flat files (as much as possible). This will help
reduce the need for "trimming" input. If your sources are in a
database, perform the LTRIM/RTRIM functions on the data coming in from a
database SQL statement, this will be much faster than operationally performing
it mid-stream.
7. IIF Conditionals are costly. When possible -
arrange the logic to minimize the use of IIF conditionals. This is not
particular to Informatica, it is costly in ANY programming language. It
introduces "decisions" within the tool, it also introduces multiple
code paths across the logic (thus increasing complexity). Therefore - when
possible, avoid utilizing an IIF conditional - again, the only possibility here
might be (for example) an ORACLE DECODE function applied to a SQL source.
8. Sequence Generators slow down mappings.
Unfortunately there is no "fast" and easy way to create sequence
generators. The cost is not that high for using a sequence generator inside of
Informatica, particularly if you are caching values (cache at around 2000) -
seems to be the suite spot. However - if at all avoidable, this is one "card"
up a sleve that can be played. If you don't absolutely need the sequence number
in the map for calculation reasons, and you are utilizing Oracle, then let
SQL*Loader create the sequence generator for all Insert Rows. If you're using
Sybase, don't specify the Identity column as a target - let the Sybase Server
generate the column. Also - try to avoid "reusable" sequence
generators - they tend to slow the session down further, even with cached
values.
9. Test
Expressions slow down sessions. Expressions such as: IS_SPACES tend
slow down the mappings, this is a data validation expression which has to run
through the entire string to determine if it is spaces, much the same as
IS_NUMBER has to validate an entire string. These expressions (if at all
avoidable) should be removed in cases where it is not necessary to
"test" prior to conversion. Be aware however, that direct conversion
without testing (conversion of an invalid value) will kill the
transformation. If you absolutely need a
test expression for numerics, try this: IIF(<field> * 1 >=
0,<field>,NULL) preferably you
don't care if it's zero. An alpha in
this expression should return a NULL to the computation. Yes - the IIF condition is slightly faster
than the IS_NUMBER - because IS_NUMBER parses the entire string, where the
multiplication operator is the actual speed gain.
10. Reduce Number of OBJETS in a map. Frequently,
the idea of these tools is to make the "data translation map" as easy
as possible. All to often, that means creating "an" (1) expression
for each throughput/translation (taking it to an extreme of course). Each
object adds computational overhead to the session and timings may suffer.
Sometimes if performance is an issue / goal, you can integrate several
expressions in to one expression object, thus reducing the "object"
overhead. In doing so - you could speed up the map.
11. Update Expressions - Session set to
Update Else Insert. If you have this switch turned on - it will definitely slow
the session down - Informatica performs 2 operations for each row: update
(w/PK), then if it returns a ZERO rows updated, performs an insert. The way to
speed this up is to "know" ahead of time if you need to issue a
DD_UPDATE or DD_INSERT inside the mapping, then tell the update strategy what
to do. After which you can change the session setting to: INSERT and UPDATE AS
UPDATE or UPDATE AS INSERT.
12. Multiple Targets are too slow. Frequently maps
are generated with multiple targets, and sometimes multiple sources. This
(despite first appearances) can really burn up time. If the architecture
permits change, and the users support re-work, then try to change the
architecture -> 1 map per target is the general rule of thumb. Once reaching
one map per target, the tuning get's easier. Sometimes it helps to reduce it to
1 source and 1 target per map. But - if the architecture allows more
modularization 1 map per target usually does the trick. Going further, you
could break it up: 1 map per target per operation (such as insert vs update).
In doing this, it will provide a few more cards to the deck with which you can
"tune" the session, as well as the target table itself. Going this
route also introduces parallel operations. For further info on this topic, see
my architecture presentations on Staging Tables, and 3rd normal form
architecture (Corporate Data Warehouse Slides).
13. Slow Sources - Flat Files. If you've got slow sources, and these sources
are flat files, you can look at some of the following possibilities. If the sources reside on a different machine,
and you've opened a named pipe to get them across the network - then you've
opened (potentially) a can of worms.
You've introduced the network speed as a variable on the speed of the
flat file source. Try to compress the
source file, FTP PUT it on the local machine (local to PMServer), decompress
it, then utilize it as a source. If
you're reaching across the network to a relational table - and the session is
pulling many many rows (over 10,000) then the source system itself may be
slow. You may be better off using a
source system extract program to dump it to file first, then follow the above
instructions. However, there is
something your SA's and Network Ops folks could do (if necessary) - this is
covered in detail in the advanced section.
They could backbone the two servers together with a dedicated network
line (no hubs, routers, or other items in between the two machines). At the very least, they could put the two
machines on the same sub-net. Now, if
your file is local to PMServer but is still slow, examine the location of the
file (which device is it on). If it's
not on an INTERNAL DISK then it will be slower than if it were on an internal
disk (C drive for you folks on NT). This
doesn't mean a unix file LINK exists locally, and the file is remote - it means
the actual file is local.
14. Too Many Aggregators. If your map has more than 1
aggregator, chances are the session will run very very slowly -
unless the CACHE directory is extremely fast, and your drive seek/access times
are very high. Even still, placing
aggregators end-to-end in mappings will slow the session down by factors of at
least 2. This is because of all the I/O
activity being a bottleneck in Informatica.
What needs to be known here is that Informatica's products: PM / PC up
through 4.7x are NOT built for parallel processing. In other words, the internal core doesn't put
the aggregators on threads, nor does it put the I/O on threads - therefore
being a single strung process it becomes easy for a part of the session/map to
become a "blocked" process by I/O factors. For I/O contention and
resource monitoring, please see the database/datawarehouse tuning guide.
15. Maplets containing Aggregators. Maplets are a good source for replicating
data logic. But just because an
aggregator is in a maplet doesn't mean it won't affect the mapping. The reason
maplets don't affect speed of the mappings, is they are treated as a part of
the mapping once the session starts - in other words, if you have an aggregator
in a maplet, followed by another aggregator in a mapping you will still have
the problem mentioned above in #14.
Reduce the number of aggregators in the entire mapping (included
maplets) to 1 if possible. If necessary,
split the map up in to several different maps, use intermediate tables in the
database if required to achieve processing goals.
16. Eliminate "too many
lookups". What happens and why?
Well - with too many lookups, your cache is eaten in memory - particularly on
the 1.6 / 4.6 products. The end result
is there is no memory left for the sessions to run in. The DTM reader/writer/transformer threads are
not left with enough memory to be able to run efficiently. PC 1.7, PM 4.7 solve some of these problems
by caching some of these lookups out to disk when the cache is full. But you still end up with contention - in
this case, with too many lookups, you're trading in Memory Contention for Disk
Contention. The memory contention might be worse than the disk contention,
because the system OS end's up thrashing (swapping in and out of TEMP/SWAP disk
space) with small block sizes to try to locate "find" your lookup
row, and as the row goes from lookup to lookup, the swapping / thrashing get's
worse.
17. Lookups & Aggregators Fight. The lookups and the aggregators fight for
memory space as discussed above. Each
requires Index Cache, and Data Cache and they "share" the same HEAP
segments inside the core. See Memory
Layout document for more information.
Particularly in the 4.6 / 1.6 products and prior - these memory areas
become critical, and when dealing with many many rows - the session is almost
certain to cause the server to "thrash" memory in and out of the OS
Swap space. If possible, separate the
maps - perform the lookups in the first section of the maps, position the data
in an intermediate target table - then a second map reads the target table and
performs the aggregation (also provides the option for a group by to be done
within the database)... Another speed
improvement...
INFORMATICA
ADVANCED TUNING GUIDELINES
The following numbered items are for
advanced level tuning. Please proceed
cautiously, one step at a time. Do not attempt to follow these guidelines if
you haven't already made it through all the basic and intermediate guidelines
first. These guidelines may require a
level of expertise which involves System Administrators, Database
Administrators, and Network Operations folks.
Please be patient. The most important aspect of advanced tuning is to be
able to pinpoint specific bottlenecks, then have the funding to address them.
As usual - these advanced tuning guidelines come last, and are pointed at
suggestions for the system.
There are other advanced tuning guidelines
available for Data Warehousing Tuning.
You can refer to those for questions surrounding your hardware /
software resources.
1. Break the mappings out. 1 per target. If necessary,1 per source per
target. Why does this work? Well - eliminating multiple targets in a single mapping can
greatly increase speed... Basically it's like this: one session per
map/target. Each session establishes
it's own database connection. Because of the unique database connection, the
DBMS server can now handle the insert/update/delete requests in parallel
against multiple targets. It also helps
to allow each session to be specified for it's intended purpose (no longer
mixing a data driven session with INSERTS only to a single target). Each
session can then be placed in to a batch marked "CONCURRENT" if
preferences allow. Once this is done,
parallelism of mappings and sessions become obvious.
A study of parallel processing has shown again and again, that the
operations can be completed sometimes in half the time of their original
counterparts merely by streaming them at the same time. With multiple targets
in the same mapping, you're telling a single database connection to handle
multiply diverse database statements - sometimes hitting this target, other
times hitting that target. Think - in
this situation it's extremely difficult for Informatica (or any other tool for
that matter) to build BULK operations... even though "bulk" is
specified in the session. Remember that
"BULK" means this is your preference, and that the tool will revert
to NORMAL load if it can't provide a BULK
operation on a series of consecutive rows.
Obviously, data driven then forces the tool down several other layers of
internal code before the data actually can reach the database.
2. Develop maplets for complex business
logic. It appears as if Maplets do NOT cause any performance hindrance by
themselves. Extensive use of maplets means better, more manageable business
logic. The maplets allow you to better break the mappings out.
3. Keep the mappings as simple as
possible. Bury complex logic (if you
must) in to a maplet. If you can avoid
complex logic all together - then that would be the key. The old rule of thumb applies here (common
sense) the straighter the path between two points, the shorter the distance...
Translated as: the shorter the distance between the source qualifier and the
target - the faster the data loads.
4. Remember the TIMING is affected by
READER/TRANSFORMER/WRITER threads. With
complex mappings, don't forget that each ELEMENT (field) must be weighed - in
this light a firm understanding of how to read performance statistics generated
by Informatica becomes important. In
other words - if the reader is slow, then the rest of the threads suffer, if
the writer is slow, same effect. A pipe
is only as big as it's smallest diameter....
A chain is only as strong as it's weakest link. Sorry for the metaphors, but it should make
sense.
5. Change Network Packet Size (for Sybase, MS-SQL
Server & Oracle users). Maximum
network packet size is a Database Wide Setting, which is usually defaulted at
512 bytes or 1024 bytes. Setting the maximum database packet size doesn't
necessarily hurt any of the other users, it does however allow the Informatica
database setting to make use of the larger packet sizes - thus transfer more
data in a single packet faster. The
typical 'best' settings are between 10k and 20k. In Oracle: you'll need to
adjust the Listener.ORA and TNSNames.ORA files.
Include the parameters: SDU, and TDU.
SDU = Service Layer Data Buffer Size (in bytes), TDU = Transport Layer
Data Buffer Size (in bytes). The SDU and
TDU should be set equally. See the
Informatica FAQ page for more information on setting these up.
6. Change to IPC Database Connection for
Local Oracle Database. If PMServer and
Oracle are running on the same server, use an IPC connection instead of a
TCP/IP connection. Change the protocol
in the TNSNames.ORA and Listener.ORA files, and restart the listener on the
server. Be careful - this protocol can
only be used locally, however the speed increases from using Inter Process
Communication can be between 2x and 6x. IPC is utilized by Oracle, but is
defined as a Unix System 5 standard specification. You can find more
information on IPC by reading about in in Unix System 5 manuals.
7. Change Database Priorities for the PMServer
Database User. Prioritizing the database login that any of the connections use
(setup in Server Manager) can assist in changing the priority given to the
Informatica executing tasks. These tasks
when logged in to the database then can over-ride others. Sizing memory for these tasks (in shared
global areas, and server settings) must be done if priorities are to be
changed. If BCP or SQL*Loader or some
other bulk-load facility is utilized, these priorities must also be set. This
can greatly improve performance. Again,
it's only suggested as a last resort method, and doesn't substitute for tuning
the database, or the mapping processes. It should only be utilized when all
other methods have been exhausted (tuned). Keep in mind that this should only
be relegated to the production machines, and only in certain instances where
the Load cycle that Informatica is utilizing is NOT impeding other users.
8. Change the UNIX User Priority. In order to
gain speed, the Informatica Unix User must be given a higher priority. The Unix
SA should understand what it takes to rank the Unix logins, and grant priorities
to particular tasks. Or - simply have the pmserver executed under a super user
(SU) command, this will take care of reprioritizing Informatica's core
process. This should only be used as a
last resort - once all other tuning avenues have been exhausted, or if you have
a dedicated Unix machine on which Informatica is running.
9. Try not to load across the network. If at all possible, try to co-locate PMServer
executable with a local database. Not
having the database local means:
1) the repository is across the network
(slow),
2) the sources / targets are across the
network, also potentially slow.
If you have to load across the network, at
least try to localize the repository on a database instance on the same machine
as the server.
The other thing is: try to co-locate the
two machines (pmserver and Target database server) on the same sub-net, even
the same hub if possible.
This eliminates unnecessary routing of
packets all over the network. Having a
localized database also allows you to setup a target table locally - which you
can then "dump" following a load, ftp to the target server, and
bulk-load in to the target table. This works
extremely well for situations where append or complete refresh is taking place.
10. Set Session Shared Memory Settings
between 12MB and 24MB. Typically I've
seen folks attempt to assign a session large heaps of memory (in hopes it will
increase speed). All it tends to do is
slow down the processing. See the memory
layout document for further information on how this affects Informatica and
it's memory handling, and why simply giving it more memory doesn't necessarily
provide speed.
11. Set Shared Buffer Block Size around
128k. Again, something that's covered in the memory layout document. This seems
to be a "sweet spot" for handling blocks of rows in side the
Informatica process.
12. MEMORY SETTINGS: The settings above are for an
average configured machine, any machine with less than 10 GIG's of RAM should
abide by the above settings. If you've
got 12+ GIG's, and you're running only 1 to 3 sessions concurrently, go ahead
and specify the Session Shared Memory size at 1 or 2 GIG's. Keep in mind that the Shared Buffer Block
Size should be set in relative size to the Shared Memory Setting. If you set a Shared Mem to 124 MB, set the
Buffer Block Size to 12MB, keep them in relative sizes. If you don't - the
result will be more memory "handling" going on in the background, so
less actual work will be done by Informatica.
Also - this holds true for the simpler mappings. The more complex the mapping, the less likely
you are to see a gain by increasing either buffer block size, or shared memory
settings - because Informatica potentially has to process cells
(ports/fields/values) inside of a huge memory block; thus resulting in a
potential re-allocation of the whole block.
13. Use SNAPSHOTS with your Database. If you have dedicated lines, DS3/T1, etc...
between servers, use a snapshot or Advanced Replication to get data out of the
source systems and in to a staging table (duplicate of the source). Then schedule the snapshot before running
processes. The RDBMS servers are built
for this kind of data transfer - and have optimizations built in to the core to
transfer data incrementally, or as a whole refresh. It may be to your advantage. Particularly if
your sources contain 13 Million + rows.
Place
Informatica processes to read from the snapshot, at that point you can index
any way you like - and increase the throughput speed without affecting the
source systems. Yes - Snapshots only
work if your sources are homogeneous to your targets (on the same type of system).
14. INCREASE THE DISK SPEED. One of the most common fallacies is that a
Data Warehouse RDBMS needs only 2 controllers, and 13 disks to survive. This is fine if you're running less than 5
Million Rows total through your system, or your load window exceeds 5
hours. I recommend at least 4 to 6
controllers, and at least 50 disks - set on a Raid 0+1 array, spinning at 7200
RPM or better. If it's necessary, plunk
the money down and go get an EMC device. You should see a significant increase
in performance after installing or upgrading to such a configuration.
15. Switch to Raid 0+1. Raid Level 5 is great for redundancy,
horrible for Data Warehouse performance, particularly on bulk loads. Raid 0+1 is the preferred method for data warehouses
out there, and most folks find that the replication is just as safe as a Raid
5, particularly since the Hardware is now nearly all hot-swappable, and the
software to manage this has improved greatly.
16. Upgrade your Hardware. On your production box, if you want Gigabytes
per second throughput, or you want to create 10 indexes in 4 hours on 34
million rows, then add CPU power, RAM, and the Disk modifications discussed
above. A 4 CPU machine just won't cut
the mustard today for this size of operation.
I recommend a
minimum of 8 CPU's as a starter box, and
increase to 12 as necessary. Again, this
is for huge Data Warehousing systems - GIG's per hour/MB per Hour. A box with 4
CPU's is great for development, or for smaller systems (totalling less than 5
Million rows in the warehouse). However,
keep in mind that Bus Speed is also a huge factor here. I've heard of a 4 CPU Dec-Alpha system
outperforming a 6 CPU system... So
what's the bottom line? Disk RPM's, Bus
Speed, RAM, and # of CPU's. I'd say
potentially in that order. Both Oracle
and Sybase perform extremely well when given 6+ CPU's and 8 or 12 GIG's RAM
setup on an EMC device at 7200 RPM with minimum of 4 controllers.
Sorting - performance issues You can
improve Aggregator transformation performance by using the Sorted Input option.
When the Sorted Input option is selected, the Informatica Server assumes all
data is sorted by group. As the Informatica Server reads rows for a group, it
performs aggregate calculations as it reads. When necessary, it stores group
information in memory. To use the Sorted Input option, you must pass sorted
data to the Aggregator transformation. You can gain added performance with
sorted ports when you partition the session. When Sorted Input is not selected,
the Informatica Server performs aggregate calculations as it reads. However,
since data is not sorted, the Informatica Server stores data for each group
until it reads the entire source to ensure all aggregate calculations are
accurate. For example, one Aggregator has the STORE_ID and ITEM Group By ports,
with the Sorted Input option selected. When you pass the following data through
the Aggregator, the Informatica Server performs an aggregation for the three
records in the 101/battery group as soon as it finds the new group,
201/battery: STORE_ID ITEM QTY PRICE 101 'battery' 3 2.99 101 'battery' 1 3.19
101 'battery' 2 2.59 201 'battery' 4 1.59 201 'battery' 1 1.99 If you use the
Sorted Input option and do not presort data correctly, the session fails.
Sorted Input Conditions Do not use the
Sorted Input option if any of the following conditions are true:
* The aggregate expression uses nested
aggregate functions.
* The session uses incremental aggregation.
* Input data is data-driven. You choose to
treat source data as data driven in the session properties, or the Update Strategy
transformation appears before the Aggregator transformation in the mapping.
* The mapping is upgraded from PowerMart
3.5. If you use the Sorted Input option under these circumstances, the Informatica
Server reverts to default aggregate behavior, reading all values before
performing aggregate calculations. Pre-Sorting Data To use the Sorted Input
option, you pass sorted data through the Aggregator.
Data must be sorted as follows:
* By the Aggregator group by ports, in the
order they appear in the Aggregator transformation.
* Using the same sort order configured for
the session. If data is not in strict ascending or descending order based on
the session sort order, the Informatica Server fails the session. For example,
if you configure a session to use a French sort order, data passing into the
Aggregator transformation must be sorted using the French sort order. If the
session uses file sources, you can use an external utility to sort file data
before starting the session. If the session uses relational sources, you can
use the Number of Sorted Ports option in the Source Qualifier transformation to
sort group by columns in the source database. Group By columns must be in the
exact same order in both the Aggregator and Source Qualifier transformations
Indexes - Make sure
indexes are in place and tables have been analyzed Might be able to use index
hints in source qualifier 3.I want to add some new columns in source table. I
used this source in lot of mappings. If i add new colums in my source table it
should be updated automatically in all mappings where i used this particular
source. -> If you add the columns
source table it will get reflected in all the mapping where that source table
has been used. But if you want to use that added columns in mapping you need to
modify that particular mapping.
Good Explanation, Thanks For sharing
ReplyDeleteInformatica Online Training Hyderabad
Although CompTIA is not an easy task to pass but I am happy to ace it by the first attempt. I can’t stop myself from appreciating the work at Dumpspass4sure. Experts have done excellent job. Pass4sure CompTIA Dumps have been presented so comprehensively that I got all the concepts by the first read. They have reduced candidates’ efforts by sharing their experience in this form of service. Congratulations to everyone who used CompTIA PDF Dumps like me.
ReplyDeleteI think Informatica is the best tool to find out solutions to some complex and important IT problems.Thank you so much for this informative article and tips.
ReplyDeleteInformatica Read Json