Saturday, 16 May 2015

Informatica Development FAQs




The following pages summarize some of the questions that typically arise during development and suggest potential resolutions.

Q: How does source format affect performance? (i.e., is it more efficient to source from a flat file rather than a database?)
In general, a flat file that is located on the server machine loads faster than a database located on the server machine. Fixed-width files are faster than delimited files because delimited files require extra parsing. However, if there is an intent to perform intricate transformations before loading to target, it may be advisable to first load the flat file into a relational database, which allows the PowerCenter mappings to access the data in an optimized fashion by using filters and custom SQL SELECTs where appropriate.

Q: What are some considerations when designing the mapping? (i.e. what is the impact of having multiple targets populated by a single map?)
With PowerCenter, it is possible to design a mapping with multiple targets. You can then load the targets in a specific order using Target Load Ordering. The recommendation is to limit the amount of complex logic in a mapping. Not only is it easier to debug a mapping with a limited number of objects, but they can also be run concurrently and make use of more system resources. When using multiple output files (targets), consider writing to multiple disks or file systems simultaneously. This minimizes disk seeks and applies to a session writing to multiple targets, and to multiple sessions running simultaneously.

Q: What are some considerations for determining how many objects and transformations to include in a single mapping?
There are several items to consider when building a mapping. The business requirement is always the first consideration, regardless of the number of objects it takes to fulfill the requirement. The most expensive use of the DTM is passing unnecessary data through the mapping. It is best to use filters as early as possible in the mapping to remove rows of data that are not needed. This is the SQL equivalent of the WHERE clause. Using the filter condition in the Source Qualifier to filter out the rows at the database level is a good way to increase the performance of the mapping.

Log File Organization

Q: Where is the best place to maintain Session Logs?
One often-recommended location is the default "SessLogs" folder in the Informatica directory, keeping all log files in the same directory.

Q: What documentation is available for the error codes that appear within the error log files?
Log file errors and descriptions appear in Appendix C of the PowerCenter TroubleShooting Guide. Error information also appears in the PowerCenter Help File within the PowerCenter client applications. For other database-specific errors, consult your Database User Guide.

Scheduling Techniques

Q: What are the benefits of using workflows with multiple tasks rather than a workflow with a stand-alone session?
Using a workflow to group logical sessions minimizes the number of objects that must be managed to successfully load the warehouse. For example, a hundred individual sessions can be logically grouped into twenty workflows. The Operations group can then work with twenty workflows to load the warehouse, which simplifies the operations tasks associated with loading the targets.
Workflows can be created to run sequentially or concurrently or have tasks in different paths doing either.
·         A sequential workflow runs sessions and tasks one at a time, in a linear sequence. Sequential workflows help ensure that dependencies are met as needed. For example, a sequential workflow ensures that session1 runs before session2 when session2 is dependent on the load of session1, and so on. It's also possible to set up conditions to run the next session only if the previous session was successful, or to stop on errors, etc.
·         A concurrent workflow groups logical sessions and tasks together, like a sequential workflow, but runs all the tasks at one time. This can reduce the load times into the warehouse, taking advantage of hardware platforms' Symmetric Multi-Processing (SMP) architecture.
Other workflow options, such as nesting worklets within workflows, can further reduce the complexity of loading the warehouse. However, this capability allows for the creation of very complex and flexible workflow streams without the use of a third-party scheduler.

Q: Assuming a workflow failure, does PowerCenter allow restart from the point of failure?
No. When a workflow fails, you can choose to start a workflow from a particular task but not from the point of failure.  It is possible, however, to create tasks and flows based on error handling assumptions.

Q: What guidelines exist regarding the execution of multiple concurrent sessions / workflows within or across applications?
Workflow Execution needs to be planned around two main constraints:
·         Available system resources
·         Memory and processors
The number of sessions that can run at one time depends on the number of processors available on the server. The load manager is always running as a process. As a general rule, a session will be compute-bound, meaning its throughput is limited by the availability of CPU cycles. Most sessions are transformation intensive, so the DTM always runs. Also, some sessions require more I/O, so they use less processor time. Generally, a session needs about 120 percent of a processor for the DTM, reader, and writer in total.
For concurrent sessions:
·         One session per processor is about right; you can run more, but that requires a "trial and error" approach to determine what number of sessions starts to affect session performance and possibly adversely affect other executing tasks on the server.
The sessions should run at "off-peak" hours to have as many available resources as possible.
Even after available processors are determined, it is necessary to look at overall system resource usage. Determining memory usage is more difficult than the processors calculation; it tends to vary according to system load and number of Informatica sessions running.
The first step is to estimate memory usage, accounting for:
·         Operating system kernel and miscellaneous processes
·         Database engine
·         Informatica Load Manager
The DTM process creates threads to initialize the session, read, write and transform data, and handle pre- and post-session operations.
·         More memory is allocated for lookups, aggregates, ranks, sorters and heterogeneous joins in addition to the shared memory segment.
At this point, you should have a good idea of what is left for concurrent sessions. It is important to arrange the production run to maximize use of this memory. Remember to account for sessions with large memory requirements; you may be able to run only one large session, or several small sessions concurrently.
Load Order Dependencies are also an important consideration because they often create additional constraints. For example, load the dimensions first, then facts. Also, some sources may only be available at specific times, some network links may become saturated if overloaded, and some target tables may need to be available to end users earlier than others.

Q: Is it possible to perform two "levels" of event notification? At the application level and the Informatica Server level to notify the Server Administrator?
The application level of event notification can be accomplished through post-session e-mail. Post-session e-mail allows you to create two different messages, one to be sent upon successful completion of the session, the other to be sent if the session fails. Messages can be a simple notification of session completion or failure, or a more complex notification containing specifics about the session. You can use the following variables in the text of your post-session e-mail:

E-mail Variable Description
%s Session name
%l Total records loaded
%r Total records rejected
%e Session status
%t Table details, including read throughput in bytes/second and write throughput in rows/second
%b Session start time
%c Session completion time
%i Session elapsed time (session completion time-session start time)
%g Attaches the session log to the message
%m Name and version of the mapping used in the session
%d Name of the folder containing the session
%n Name of the repository containing the session
%a<filename> Attaches the named file. The file must be local to the Informatica Server. The following are valid filenames: %a<c:\data\sales.txt> or %a</users/john/data/sales.txt>
On Windows NT, you can attach a file of any type.
On UNIX, you can only attach text files. If you attach a non-text file, the send might fail.

Note: The filename cannot include the Greater Than character (>) or a line break.

The PowerCenter Server on UNIX uses rmail to send post-session e-mail. The repository user who starts the PowerCenter server must have the rmail tool installed in the path in order to send e-mail.

To verify the rmail tool is accessible:
1. Login to the UNIX system as the PowerCenter user who starts the PowerCenter Server.
2. Type rmail <fully qualified email address> at the prompt and press Enter.
3. Type '.' to indicate the end of the message and press Enter.
4. You should receive a blank e-mail from the PowerCenter user's e-mail account. If not, locate the directory where rmail resides and add that directory to the path.
5. When you have verified that rmail is installed correctly, you are ready to send post-session e-mail.
The output should look like the following:
Session complete.
Session name: sInstrTest
Total Rows Loaded = 1
Total Rows Rejected = 0
Completed

Rows
Loaded
Rows
Rejected
ReadThroughput
(bytes/sec)
WriteThroughput
(rows/sec)
Table Name
Status



1 0 30 1 t_Q3_sales

No errors encountered.
Start Time: Tue Sep 14 12:26:31 1999
Completion Time: Tue Sep 14 12:26:41 1999
Elapsed time: 0:00:10 (h:m:s)
This information, or a subset, can also be sent to any text pager that accepts e-mail.

Backup Strategy Recommendation

Q: Can individual objects within a repository be restored from the back-up or from a prior version?
At the present time, individual objects cannot be restored from a back-up using the PowerCenter Repository Manager (i.e., you can only restore the entire repository). But, it is possible to restore the back up repository into a different database and then manually copy the individual objects back into the main repository.
Another option is to export individual objects to XML files.  This allows for the granular re-importation of individual objects, mappings, tasks, workflows, etc.

Server Administration

Q: What built-in functions, does PowerCenter provide to notify someone in the event that the server goes down, or some other significant event occurs?
The Repository Server can be used to send messages notifying users that the server will be shut down. Additionally, the Repository Server can be used to send notification messages about repository objects that are created, modified or deleted by another user.  Notification messages are received through the Informatica Client tools.

Q: What system resources should be monitored? What should be considered normal or acceptable server performance levels?
The pmprocs utility, which is available for UNIX systems only, shows the currently executing PowerCenter processes.
Pmprocs is a script that combines the ps and ipcs commands. It is available through Informatica Technical Support. The utility provides the following information:
- CPID - Creator PID (process ID)
- LPID - Last PID that accessed the resource
- Semaphores - used to sync the reader and writer
- 0 or 1 - shows slot in LM shared memory
(See Chapter 16 in the PowerCenter Repository Guide for additional details.)
A variety of UNIX and Windows NT commands and utilities are also available. Consult your UNIX and/or Windows NT documentation.

Q: What cleanup (if any) should be performed after a UNIX server crash? Or after an Oracle instance crash?
If the UNIX server crashes, you should first check to see if the repository database is able to come back up successfully. If this is the case, then you should try to start the PowerCenter server. Use the pmserver.err log to check if the server has started correctly. You can also use ps -ef | grep pmserver to see if the server process (the Load Manager) is running.

Metadata

Q: What recommendations or considerations exist as to naming standards or repository administration for metadata that might be extracted from the PowerCenter repository and used in others?
With PowerCenter, you can enter description information for all repository objects, sources, targets, transformations, etc, but the amount of metadata that you enter should be determined by the business requirements. You can also drill down to the column level and give descriptions of the columns in a table if necessary. All information about column size and scale, datatypes, and primary keys are stored in the repository.
The decision on how much metadata to create is often driven by project timelines. While it may be beneficial for a developer to enter detailed descriptions of each column, expression, variable, etc, it is also very time consuming to do so. Therefore, this decision should be made on the basis of how much metadata will be required by the systems that use the metadata.
There are some time saving tools that are available to better manage a metadata strategy and content, such as third party metadata software and, for sources and targets, data modeling tools.

Q: What procedures exist for extracting metadata from the repository?
Informatica offers an extremely rich suite of metadata-driven tools for data warehousing applications. All of these tools store, retrieve, and manage their metadata in Informatica's central repository. The motivation behind the original Metadata Exchange (MX) architecture was to provide an effective and easy-to-use interface to the repository.
Today, Informatica and several key Business Intelligence (BI) vendors, including Brio, Business Objects, Cognos, and MicroStrategy, are effectively using the MX views to report and query the Informatica metadata.
Informatica strongly discourages accessing the repository directly, even for SELECT access because some releases of PowerCenter change the look and feel of the repository tables, resulting in a maintenance task for you. Rather, views have been created to provide access to the metadata stored in the repository.
Additional products, such as Informaticas Metadata Reporter and PowerAnalyzer, allow for more robust reporting against the repository database and are able to present reports to the end-user and/or management.

1 comment:

  1. Thank you so much for providing such a useful and important blog post about Informatica and attaching its FAQs along with it.

    Informatica Read Rest API

    ReplyDelete