The following pages summarize some of the questions
that typically arise during development and suggest potential resolutions.
Q: How does source
format affect performance? (i.e., is it more efficient to source from a flat
file rather than a database?)
In general, a flat file that is located on the
server machine loads faster than a database located on the server machine.
Fixed-width files are faster than delimited files because delimited files
require extra parsing. However, if there is an intent to perform intricate
transformations before loading to target, it may be advisable to first load the
flat file into a relational database, which allows the PowerCenter mappings to
access the data in an optimized fashion by using filters and custom SQL SELECTs
where appropriate.
Q: What are some
considerations when designing the mapping? (i.e. what is the impact of having
multiple targets populated by a single map?)
With PowerCenter, it is possible to design a
mapping with multiple targets. You can then load the targets in a specific
order using Target Load Ordering. The recommendation is to limit the amount of
complex logic in a mapping. Not only is it easier to debug a mapping with a
limited number of objects, but they can also be run concurrently and make use
of more system resources. When using multiple output files (targets), consider
writing to multiple disks or file systems simultaneously. This minimizes disk
seeks and applies to a session writing to multiple targets, and to multiple
sessions running simultaneously.
Q: What are some
considerations for determining how many objects and transformations to include
in a single mapping?
There are several items to consider when building a
mapping. The business requirement is always the first consideration, regardless
of the number of objects it takes to fulfill the requirement. The most
expensive use of the DTM is passing unnecessary data through the mapping. It is
best to use filters as early as possible in the mapping to remove rows of data
that are not needed. This is the SQL equivalent of the WHERE clause. Using the
filter condition in the Source Qualifier to filter out the rows at the database
level is a good way to increase the performance of the mapping.
Log File Organization
Q: Where is the
best place to maintain Session Logs?
One often-recommended location is the default
"SessLogs" folder in the Informatica directory, keeping all log files
in the same directory.
Q: What
documentation is available for the error codes that appear within the error log
files?
Log file errors and descriptions appear in Appendix
C of the PowerCenter TroubleShooting Guide. Error information also
appears in the PowerCenter Help File within the PowerCenter client
applications. For other database-specific errors, consult your Database User
Guide.
Scheduling Techniques
Q: What are the benefits of using workflows with multiple tasks rather
than a workflow with a stand-alone session?
Using a workflow to group logical sessions
minimizes the number of objects that must be managed to successfully load the
warehouse. For example, a hundred individual sessions can be logically grouped
into twenty workflows. The Operations group can then work with twenty workflows
to load the warehouse, which simplifies the operations tasks associated with
loading the targets.
Workflows can be created to run sequentially or
concurrently or have tasks in different paths doing either.
·
A sequential workflow runs sessions and
tasks one at a time, in a linear sequence. Sequential workflows help ensure
that dependencies are met as needed. For example, a sequential workflow ensures
that session1 runs before session2 when session2 is dependent on the load of
session1, and so on. It's also possible to set up conditions to run the next
session only if the previous session was successful, or to stop on errors, etc.
·
A concurrent workflow groups logical
sessions and tasks together, like a sequential workflow, but runs all the tasks
at one time. This can reduce the load times into the warehouse, taking
advantage of hardware platforms' Symmetric Multi-Processing (SMP) architecture.
Other workflow options, such as nesting worklets
within workflows, can further reduce the complexity of loading the warehouse.
However, this capability allows for the creation of very complex and flexible
workflow streams without the use of a third-party scheduler.
Q: Assuming a
workflow failure, does PowerCenter allow restart from the point of failure?
No. When a workflow fails, you can choose to start
a workflow from a particular task but not from the point of failure. It
is possible, however, to create tasks and flows based on error handling
assumptions.
Q: What guidelines
exist regarding the execution of multiple concurrent sessions / workflows
within or across applications?
Workflow Execution needs to be planned around two
main constraints:
·
Available system resources
·
Memory and processors
The number of sessions that can run at one time
depends on the number of processors available on the server. The load manager
is always running as a process. As a general rule, a session will be
compute-bound, meaning its throughput is limited by the availability of CPU
cycles. Most sessions are transformation intensive, so the DTM always runs.
Also, some sessions require more I/O, so they use less processor time.
Generally, a session needs about 120 percent of a processor for the DTM,
reader, and writer in total.
For concurrent sessions:
·
One session per processor is about right; you can run more, but that requires a
"trial and error" approach to determine what number of sessions
starts to affect session performance and possibly adversely affect other
executing tasks on the server.
The sessions should run at "off-peak"
hours to have as many available resources as possible.
Even after available processors are determined, it
is necessary to look at overall system resource usage. Determining memory usage
is more difficult than the processors calculation; it tends to vary according
to system load and number of Informatica sessions running.
The first step is to estimate memory usage,
accounting for:
·
Operating system kernel and miscellaneous processes
·
Database engine
·
Informatica Load Manager
The DTM process creates threads to initialize the
session, read, write and transform data, and handle pre- and post-session
operations.
·
More memory is allocated for lookups, aggregates,
ranks, sorters and heterogeneous joins in addition to the shared memory
segment.
At this point, you should have a good idea of what
is left for concurrent sessions. It is important to arrange the production run
to maximize use of this memory. Remember to account for sessions with large
memory requirements; you may be able to run only one large session, or several
small sessions concurrently.
Load Order Dependencies are also an important
consideration because they often create additional constraints. For example,
load the dimensions first, then facts. Also, some sources may only be available
at specific times, some network links may become saturated if overloaded, and
some target tables may need to be available to end users earlier than others.
Q: Is it possible
to perform two "levels" of event notification? At the application
level and the Informatica Server level to notify the Server Administrator?
The application level of event notification can be
accomplished through post-session e-mail. Post-session e-mail allows you to
create two different messages, one to be sent upon successful completion of the
session, the other to be sent if the session fails. Messages can be a simple
notification of session completion or failure, or a more complex notification
containing specifics about the session. You can use the following variables in
the text of your post-session e-mail:
E-mail Variable | Description |
%s | Session name |
%l | Total records loaded |
%r | Total records rejected |
%e | Session status |
%t | Table details, including read throughput in bytes/second and write throughput in rows/second |
%b | Session start time |
%c | Session completion time |
%i | Session elapsed time (session completion time-session start time) |
%g | Attaches the session log to the message |
%m | Name and version of the mapping used in the session |
%d | Name of the folder containing the session |
%n | Name of the repository containing the session |
%a<filename> | Attaches
the named file. The file must be local to the Informatica Server. The
following are valid filenames: %a<c:\data\sales.txt> or
%a</users/john/data/sales.txt> On Windows NT, you can attach a file of any type. On UNIX, you can only attach text files. If you attach a non-text file, the send might fail. Note: The filename cannot include the Greater Than character (>) or a line break. |
The PowerCenter Server on UNIX uses rmail to send
post-session e-mail. The repository user who starts the PowerCenter server must
have the rmail tool installed in the path in order to send e-mail.
To verify the rmail tool is accessible:
1. Login to the UNIX system as the PowerCenter user
who starts the PowerCenter Server.
2. Type rmail <fully qualified email address>
at the prompt and press Enter.
3. Type '.' to indicate the end of the message and
press Enter.
4. You should receive a blank e-mail from the
PowerCenter user's e-mail account. If not, locate the directory where rmail
resides and add that directory to the path.
5. When you have verified that rmail is installed
correctly, you are ready to send post-session e-mail.
The output should look like the following:
Session complete.
Session name: sInstrTest
Total Rows Loaded = 1
Total Rows Rejected = 0
Completed
Session name: sInstrTest
Total Rows Loaded = 1
Total Rows Rejected = 0
Completed
Rows Loaded |
Rows Rejected |
ReadThroughput (bytes/sec) |
WriteThroughput (rows/sec) |
Table Name |
Status | ||||
1 | 0 | 30 | 1 | t_Q3_sales |
No errors encountered.
Start Time: Tue Sep 14 12:26:31 1999
Completion Time: Tue Sep 14 12:26:41 1999
Elapsed time: 0:00:10 (h:m:s)
Start Time: Tue Sep 14 12:26:31 1999
Completion Time: Tue Sep 14 12:26:41 1999
Elapsed time: 0:00:10 (h:m:s)
This information, or a subset, can also be sent to
any text pager that accepts e-mail.
Backup Strategy
Recommendation
Q: Can individual
objects within a repository be restored from the back-up or from a prior
version?
At the present time, individual objects cannot be
restored from a back-up using the PowerCenter Repository Manager (i.e., you can
only restore the entire repository). But, it is possible to restore the back up
repository into a different database and then manually copy the individual
objects back into the main repository.
Another option is to export individual objects to
XML files. This allows for the granular re-importation of individual
objects, mappings, tasks, workflows, etc.
Server Administration
Q: What built-in functions, does PowerCenter
provide to notify someone in the event that the server goes down, or some other
significant event occurs?
The
Repository Server can be used to send messages notifying users that the server
will be shut down. Additionally, the Repository Server can be used to send
notification messages about repository objects that are created, modified or
deleted by another user. Notification messages are received through the
Informatica Client tools.
Q: What system
resources should be monitored? What should be considered normal or acceptable
server performance levels?
The
pmprocs utility, which is available for UNIX systems only, shows the currently
executing PowerCenter processes.
Pmprocs
is a script that combines the ps and ipcs commands. It is available through
Informatica Technical Support. The utility provides the following information:
-
CPID - Creator PID (process ID)
-
LPID - Last PID that accessed the resource
-
Semaphores - used to sync the reader and writer
-
0 or 1 - shows slot in LM shared memory
(See
Chapter 16 in the PowerCenter Repository Guide for additional details.)
A
variety of UNIX and Windows NT commands and utilities are also available.
Consult your UNIX and/or Windows NT documentation.
Q: What cleanup (if
any) should be performed after a UNIX server crash? Or after an Oracle instance
crash?
If
the UNIX server crashes, you should first check to see if the repository
database is able to come back up successfully. If this is the case, then you
should try to start the PowerCenter server. Use the pmserver.err log to check
if the server has started correctly. You can also use ps -ef | grep pmserver to
see if the server process (the Load Manager) is running.
Metadata
Q: What recommendations or considerations exist
as to naming standards or repository administration for metadata that might be
extracted from the PowerCenter repository and used in others?
With
PowerCenter, you can enter description information for all repository objects,
sources, targets, transformations, etc, but the amount of metadata that you
enter should be determined by the business requirements. You can also drill
down to the column level and give descriptions of the columns in a table if
necessary. All information about column size and scale, datatypes, and primary
keys are stored in the repository.
The
decision on how much metadata to create is often driven by project timelines.
While it may be beneficial for a developer to enter detailed descriptions of
each column, expression, variable, etc, it is also very time consuming to do
so. Therefore, this decision should be made on the basis of how much metadata
will be required by the systems that use the metadata.
There
are some time saving tools that are available to better manage a metadata
strategy and content, such as third party metadata software and, for sources
and targets, data modeling tools.
Q: What procedures
exist for extracting metadata from the repository?
Informatica
offers an extremely rich suite of metadata-driven tools for data warehousing
applications. All of these tools store, retrieve, and manage their metadata in
Informatica's central repository. The motivation behind the original Metadata
Exchange (MX) architecture was to provide an effective and easy-to-use
interface to the repository.
Today,
Informatica and several key Business Intelligence (BI) vendors, including Brio,
Business Objects, Cognos, and MicroStrategy, are effectively using the MX views
to report and query the Informatica metadata.
Informatica
strongly discourages accessing the repository directly, even for SELECT access
because some releases of PowerCenter change the look and feel of the repository
tables, resulting in a maintenance task for you. Rather, views have been
created to provide access to the metadata stored in the repository.
Additional
products, such as Informaticas Metadata Reporter and PowerAnalyzer, allow for
more robust reporting against the repository database and are able to present
reports to the end-user and/or management.
Thank you so much for providing such a useful and important blog post about Informatica and attaching its FAQs along with it.
ReplyDeleteInformatica Read Rest API