Informatica Way: 2010

Thursday, 23 December 2010

Leveraging Metadata in Informatica Workflow-Session/Analysis

We can leverage the metadata collected in the Informatica repository for many interesting analysis, few of the scenarios where I have leveraged the Informatica Metadata are as following.

This SQL Queries can be executed in Oracle database with no changes and requires little modification with other databases.

Failed Sessions

The following query lists the failed sessions. To make it work for the last ‘n’ days, replace SYSDATE-1 with SYSDATE – n

QUERY:

SELECT SUBJECT_AREA AS FOLDER_NAME,

SESSION_NAME,

LAST_ERROR AS ERROR_MESSAGE,

DECODE (RUN_STATUS_CODE,3,’Failed’,4,’Stopped’,5,’Aborted’) AS STATUS,

ACTUAL_START AS START_TIME,

SESSION_TIMESTAMP

FROM REP_SESS_LOG

WHERE RUN_STATUS_CODE != 1

AND TRUNC(ACTUAL_START) BETWEEN TRUNC(SYSDATE -1) AND TRUNC(SYSDATE)

RESULT:

Long running Sessions

The following query lists long running sessions. To make it work for the last ‘n’ days, replace SYSDATE-1 with SYSDATE – n

QUERY:

SELECT SUBJECT_AREA AS FOLDER_NAME,

SESSION_NAME,

SUCCESSFUL_SOURCE_ROWS AS SOURCE_ROWS,

SUCCESSFUL_ROWS AS TARGET_ROWS,

ACTUAL_START AS START_TIME,

SESSION_TIMESTAMP

FROM REP_SESS_LOG

WHERE RUN_STATUS_CODE = 1

AND TRUNC(ACTUAL_START) BETWEEN TRUNC(SYSDATE -1) AND TRUNC(SYSDATE)

AND (SESSION_TIMESTAMP – ACTUAL_START) > (10/(24*60))

ORDER BY SESSION_TIMESTAMP

RESULT:

Invalid Tasks

The following query lists folder names and task name, version number, and last saved for all invalid tasks.

QUERY:

SELECT SUBJECT_AREA AS FOLDER_NAME,

DECODE(IS_REUSABLE,1,’Reusable’,’ ‘) || ‘ ‘ ||TASK_TYPE_NAME AS TASK_TYPE,

TASK_NAME AS OBJECT_NAME,

VERSION_NUMBER,

LAST_SAVED

FROM REP_ALL_TASKS

WHERE IS_VALID=0

AND IS_ENABLED=1

ORDER BY SUBJECT_AREA,TASK_NAME

RESULT:

Thanks for reading, do you have other scenarios where Workflow Metadata has been effective …wish you a very happy new year 2011.

To know more about Informatica Workflow

Tuesday, 26 October 2010

Impact Analysis on Source & Target Definition Changes

Changes to Source and Target definition will impact the current state of the Informatica mapping and this article list the possible changes at Source and the Target with impact.

Updating Source Definitions:When we update a source definition, the Designer propagates the changes to all mappings using that source. Some changes to source definitions can invalidate mappings.
Below table describes how the mappings get impacted when the source definition is edited:

Modification	Result of the source after modifying the source definition
Add a column.	Mappings are not invalidated.
Change a column Data type.	Mappings may be invalidated. If the column is connected to an input port that uses a Data type incompatible with the new one, the mapping is invalidated.
Change a column name.	Mapping may be invalidated. If you change the column name for a column you just added, the mapping remains valid. If you change the column name for an existing column, the mapping is invalidated.
Delete a column.	Mappings can be invalidated if the mapping uses values from the deleted column.

Adding a new column in the existing source definition:

When we add a new column to a source in the Source Analyzer, all mappings using the source definition remain valid.
However, when we add a new column and change some of its properties, the Designer invalidates mappings using the source definition.
We can change the following properties for a newly added source column without invalidating a mapping: 1. Name
2. Data type
3. Format
4. Usage
5. Redefines
6. Occurs
7. Key type

If the changes invalidate the mapping, we must open and edit the mapping. Then click Repository > Save to save the changes to the repository. If the invalidated mapping is used in a session, we must validate the session.
Updating Target Definitions:
When we change a target definition, the Designer propagates the changes to any mapping using that target. Some changes to target definitions can invalidate mappings.
The following table describes how the mappings get impacted when we edit target definitions:

Modification	Result of the source after modifying the target definition
Add a column.	Mapping not invalidated.
Change a column Data type.	Mapping may be invalidated. If the column is connected to an input port that uses a Data type that is incompatible with the new one (for example, Decimal to Date), the mapping is invalid.
Change a column name.	Mapping may be invalidated. If you change the column name for a column you just added, the mapping remains valid. If you change the column name for an existing column, the mapping is invalidated.
Delete a column.	Mapping may be invalidated if the mapping uses values from the deleted column.
Change the target definition type.	Mapping not invalidated.

Adding a new column in the existing target definition:

When we add a new column to a target in the Target Designer, all mappings using the target definition remain valid.
However, when you add a new column and change some of its properties, the Designer invalidates mappings using the target definition.
We can change the following properties for a newly added target column without invalidating a mapping:

1. Name
2. Data type
3. Format
If the changes invalidate the mapping, validate the mapping and any session using the mapping. We can validate objects from the Query Results or View Dependencies window or from the Repository Navigator. We can validate multiple objects from these locations without opening them in the workspace. If we cannot validate the mapping or session from one of these locations, open the object in the workspace and edit it.

Re-importing a Relational Target Definition:
If a target table changes, such as when we change a column data type, we can edit the definition or we can re-import the target definition. When we re-import the target, we can either replace the existing target definition or rename the new target definition to avoid a naming conflict with the existing target definition.

To re-import a target definition:

In the Target Designer, follow the same steps to import the target definition, and select the Target to import. The Designer notifies us that a target definition with that name already exists in the repository. If we have multiple tables to import and replace, select apply to All Tables.
Click Rename, Replace, Skip, or Compare.
If we click Rename, enter the name of the target definition and click OK.
If we have a relational target definition and click Replace, specify whether we want to retain primary key-foreign key information and target descriptions

The following table describes the options available in the Table Exists dialog box when re-importing and replacing a relational target definition:

Option	Description
Apply to all Tables	Select this option to apply rename, replaces, or skips all tables in the folder.
Retain User-Defined PK-FK Relationships	Select this option to keep the primary key-foreign key relationships in the target definition being replaced. This option is disabled when the target definition is non-relational.
Retain User-Defined Descriptions	Select this option to retain the target description and column and port descriptions of the target definition being replaced.

To know more about Target Definition

Thursday, 14 October 2010

Output Files in Informatica

The Integration Service process generates output files when we run workflows and sessions. By default, the Integration Service logs status and error messages to log event files.

Log event files are binary files that the Log Manager uses to display log events. When we run each session, the Integration Service also creates a reject file. Depending on transformation cache settings and target types, the Integration Service may create additional files as well.

The Integration Service creates the following output files:

Output Files

Session Details/logs:

When we run a session, the Integration service creates session log file with the load statistics/table names/Error information/threads created etc based on the tracing level that have set in the session properties.
We can monitor session details in the session run properties while session running/failed/succeeded.

Workflow Log:

Workflow log is available in Workflow Monitor.
The Integration Service process creates a workflow log for each workflow it runs.
It writes information in the workflow log such as
- Initialization of processes,
- Workflow task run information,
- Errors encountered and
- Workflows run summary.
The Integration Service can also be configured to suppress writing messages to the workflow log file.
As with Integration Service logs and session logs, the Integration Service process enters a code number into the workflow log file message along with message text.

Performance Detail File:

The Integration Service process generates performance details for session runs.
Through the performance details file we can determine where session performance can be improved.
Performance details provide transformation-by-transformation information on the flow of data through the session.

Reject Files:

By default, the Integration Service process creates a reject file for each target in the session. The reject file contains rows of data that the writer does not write to targets.
The writer may reject a row in the following circumstances:
- It is flagged for reject by an Update Strategy or Custom transformation.
- It violates a database constraint such as primary key constraint
- A field in the row was truncated or overflowed
- The target database is configured to reject truncated or overflowed data.

Note: By default, the Integration Service process saves the reject file in the directory entered for the service process variable $PMBadFileDir in the Workflow Manager, and names the reject file target_table_name.bad. We can view this file name in session level.

Open Session – Select any of the target View the options
- Reject File directory.
- Reject file name.
If you enable row error logging, the Integration Service process does not create a reject file.

Row Error Logs:

When we configure a session, we can choose to log row errors in a central location.
When a row error occurs, the Integration Service process logs error information that allows to determine the cause and source of the error.
The Integration Service process logs information such as source name, row ID, current row data, transformation, timestamp, error code, error message, repository name, folder name, session name, and mapping information.
we enable flat file logging, by default, the Integration Service process saves the file in the directory entered for the service process variable $PMBadFileDir in the Workflow Manager.

Recovery Tables Files:

The Integration Service process creates recovery tables on the target database system when it runs a session enabled for recovery.
When you run a session in recovery mode, the Integration Service process uses information in the recovery tables to complete the session.
When the Integration Service process performs recovery, it restores the state of operations to recover the workflow from the point of interruption.
The workflow state of operations includes information such as active service requests, completed and running status, workflow variable values, running workflows and sessions, and workflow schedules.

Control File:

When we run a session that uses an external loader, the Integration Service process creates a control file and a target flat file.
The control file contains information about the target flat file such as data format and loading instructions for the external loader.
The control file has an extension of .ctl. The Integration Service process creates the control file and the target flat file in the Integration Service variable directory, $PMTargetFileDir, by default.

Email:

We can compose and send email messages by creating an Email task in the Workflow Designer or Task Developer and the Email task can be placed in a workflow, or can be associated it with a session.
The Email task allows to automatically communicate information about a workflow or session run to designated recipients.
Email tasks in the workflow send email depending on the conditional links connected to the task. For post-session email, we can create two different messages, one to be sent if the session completes successfully, the other if the session fails.
We can also use variables to generate information about the session name, status, and total rows loaded.

Indicator File:

If we use a flat file as a target, we can configure the Integration Service to create an indicator file for target row type information.
For each target row, the indicator file contains a number to indicate whether the row was marked for insert, update, delete, or reject.
The Integration Service process names this file target_name.ind and stores it in the Integration Service variable directory, $PMTargetFileDir, by default.

Target or Output File:

If the session writes to a target file, the Integration Service process creates the target file based on a file target definition.
By default, the Integration Service process names the target file based on the target definition name.
If a mapping contains multiple instances of the same target, the Integration Service process names the target files based on the target instance name.
The Integration Service process creates this file in the Integration Service variable directory, $PMTargetFileDir, by default.

Cache Files:

When the Integration Service process creates memory cache, it also creates cache files. The Integration Service process creates cache files for the following mapping objects:
- Aggregator transformation
- Joiner transformation
- Rank transformation
- Lookup transformation
- Sorter transformation
- XML target
By default, the DTM creates the index and data files for Aggregator, Rank, Joiner, and Lookup transformations and XML targets in the directory configured for the $PMCacheDir service process variable.

Tuesday, 28 September 2010

Informatica Power Center performance – Concurrent Workflow Execution

What is concurrent work flow?
A concurrent workflow is a workflow that can run as multiple instances concurrently.

What is workflow instance?
A workflow instance is a representation of a workflow.

How to configure concurrent workflow?

1) Allow concurrent workflows with the same instance name:
Configure one workflow instance to run multiple times concurrently. Each instance has the same source, target, and variables parameters.
Eg: Create a workflow that reads data from a message queue that determines the source data and targets. You can run the instance multiple times concurrently and pass different connection parameters to the workflow instances from the message queue.

2) Configure unique workflow instances to run concurrently:
Define each workflow instance name and configure a workflow parameter file for the instance. You can define different sources, targets, and variables in the parameter file.
Eg: Configure workflow instances to run a workflow with different sources and targets. For example, your organization receives sales data from three divisions. You create a workflow that reads the sales data and writes it to the database. You configure three instances of the workflow. Each instance has a different workflow parameter file that defines which sales file to process. You can run all instances of the workflow concurrently.

How concurrent workflow Works?
A concurrent workflow group’s logical sessions and tasks together, like a sequential workflow, but runs all the tasks at one time.
Advantages of Concurrent workflow?
This can reduce the load times into the warehouse, taking advantage of hardware platforms’ Symmetric Multi-Processing (SMP) architecture.
LOAD SCENARIO:
Source table records count: 150,622,276

To know more about Informatica Power Center

Monday, 30 August 2010

Informatica Development Best Practice – Workflow

Workflow Manager default properties can be modified to improve the overall performance and few of them are listed below. This properties can impact the ETL runtime directly and needs to configured based on :

i) Source Database
ii) Target Database
iii) Data Volume

Category

Technique

Session Properties

While loading Staging Tables for FULL LOADS, Truncate target table option should be checked. Based on the Target database and the primary key defined, Integration Service fires TRUNCATE or DELETE statement.Database                  Primary Key Defined                   No Primary KeyDB2                             TRUNCATE                                       TRUNCATE
INFORMIX                 DELETE                                              DELETE
ODBC                         DELETE                                                DELETE
ORACLE                    DELETE UNRECOVERABLE            TRUNCATE
MSSQL                       DELETE                                               TRUNCATE
SYBASE                     TRUNCATE                                        TRUNCATE Workflow Property “Commit interval” (Default value : 10,000) should be increased for increased for Volumes more than 1 million records. Database Rollback Segment size should also be updated, while increasing “Commit Interval”.
Insert/Update/Delete options should be set as determined by the target population method.
Target Option                                   Integration Service
Insert                                                   Uses Target update Option
Update as Update
Update as Insert
Update else Insert
Update as update                             Updates all rows as Update
Update as Insert                               Inserts all rows
Update else Insert                            Updates existing rows else Insert

Partition

Maximum number of partitions for a session should be 1.5 times the number of processes in the Informatica server. i.e. 1.5 X 4 Processors = 6 partitions.

Key Value partitions should be used only when an even Distribution of data can be obtained. In other cases, Pass Through partitions should be used.

A Source filter should be added to evenly distribute the data between Pass through Partitions. Key Value should have ONLY numeric values. MOD(NVL(<Numeric Key Value>,0),# No of Partitions defined) Ex: MOD(NVL(product_sys_no,0),6)

If a session contains “N” partition, increase the DTM Buffer Size to at least “N” times the value for the session with One partition

If the Source or Target database is of MPP( Massively Parallel Processing ), enable Pushdown Optimization. By enabling this, Integration Service will push as much Transformation Logic to Source database or Target database or FULL ( both ) , based on the settings. This property can be ignored for Conventional databases.

To know more about Informatica

Thursday, 19 August 2010

Informatica Development Best Practices – Mapping

The following are generally accepted “Best Practices” for Informatica PowerCenter ETL development and if implemented, can significantly improve the overall performance.

Category	Technique	Benefits

Source Extracts	Loading data from Fixed-width files take less time than delimited, since delimited files require extra parsing. Incase of Fixed width files, Integration service know the Start and End position of each columns upfront and thus reduces the processing time.	Performance Improvement

	Using flat files located on the server machine loads faster than a database located on the server machine.	Performance Improvement

Mapping Designer	There should be a place holder transformation (Expression) immediately after the Source and one before the target. Data type and Data width changes are bound to happen during development phase and these place holder transformations are used to preserve the port link between transformations.	Best Practices

	Connect only the ports that are required in targets to subsequent transformations. Also, active transformations that reduce the number of records should be used as early in the mapping.	Code Optimization

	If a join must be used in the Mapping, select appropriate driving/master table while using joins. The table with the lesser number of rows should be the driving/master table.	Performance Improvement

Transformations	If there are multiple Lookup condition, make the condition with the “=” sign first in order to optimize the lookup performance. Also, indexes on the database table should include every column used in the lookup condition.	Code Optimization

	Persistent caches should be used if the lookup data is not expected to change often. This cache files are saved and can be reused for subsequent runs, eliminating querying the database.	Performance Improvement

	Integration Service processes numeric operations faster than string operations. For example, if a lookup is done on a large amount of data on two columns, EMPLOYEE_NAME and EMPLOYEE_ID, configuring the lookup around EMPLOYEE_ID improves performance.	Code Optimization

	Replace Complex filter expression with a flag (Y/N). Complex logic should be moved to the expression transformation and the result should be stored in a port. Filter expression should take less time to evaluate this port rather than executing the entire logic in Filter expression.	Best Practices

	Power Center Server automatically makes conversions between compatible data types which slowdown the performance considerably. For example, if a mapping moves data from an Integer port to a Decimal port, then back to an Integer port, the conversion may be unnecessary.	Performance Improvement

	Assigning default values to a port; Transformation errors written to session log will always slow down the session performance. Try removing default values and eliminate transformation errors.	Performance Improvement

	Complex joins in Source Qualifiers should be replaced with Database views. There won’t be any performance gains, but it improves the readability a lot. Also, any new conditions can be evaluated easily by just changing the Database view “WHERE” clause.	Best Practices

To know more about Informatica Development

Thursday, 12 August 2010

Change Data Capture in Informatica

Change data capture (CDC) is an approach or a technique to identify changes, only changes, in the source. I have seen applications that are built without CDC and later mandate to implement CDC at a higher cost. Building an ETL application without CDC is a costly miss and usually a backtracking step. In this article we can discuss different methods of implementing CDC.

Scenario #01: Change detection using timestamp on source rows
In this typical scenario the source rows have extra two columns say row_created_time & last_modified_time. Row_created_time : time at which the record was first created ; Last_modified_time: time at which the record was last modified

In the mapping create mapping variable $$LAST_ETL_RUN_TIME of datetime data type
Evaluate condition SetMaxVariable ($$LAST_ETL_RUN_TIME, SessionStartTime); this steps stores the time at which the Session was started to $$LAST_ETL_RUN_TIME
Use $$LAST_ETL_RUN_TIME in the ‘where’ clause of the source SQL. During the first run or initial seed the mapping variable would have a default value and pull all the records from the source, like: select * from employee where last_modified_date > ’01/01/1900 00:00:000’
Now let us assume the session is run on ’01/01/2010 00:00:000’ for initial seed
When the session is executed on ’02/01/2010 00:00:000’, the sequel would be like : select * from employee where last_modified_date > ’01/01/2010 00:00:000’, hereby pulling records that had only got changed in between successive runs

Scenario #02: Change detection using load_id or Run_id
Under this scenario the source rows have a column say load_id, a positive running number. The load_id is updated as and when the record is updated

In the mapping create mapping variable $$LAST_READ_LOAD_ID of integer data type
Evaluate condition SetMaxVariable ($$LAST_READ_LOAD_ID,load_id); the maximum load_id is stored into mapping variable
Use $$LAST_READ_LOAD_ID in the ‘where’ clause of the source SQL. During the first run or initial seed the mapping variable would have a default value and pull all the records from the source, like: select * from employee where load_id > 0; Assuming all records during initial seed have load_id =1, the mapping variable would store ‘1’ into the repository.
Now let us assume the session is run after five load’s into the source, the sequel would be select * from employee where load_id >1 ; hereby we limit the source read only to the records that have been changed after the initial seed
Consecutive runs would take care of updating the load_id & pulling the delta in sequence

In the next blog we can see how to implement CDC when reading from Salesforce.com

To know more about Informatica

Wednesday, 3 March 2010

Processing Multiple XML Files through Informatica – 1

Problem Statement: Data to be processed in Informatica were XML files in nature. The number of XML files to be processed was dynamic in nature. The need was also to ensure that the XML file name from which data is being processed is to be captured.

Resolution:
Option 1 – Using File list as part of Indirect File Sources in session
Option 2 – Using Parameter File and workflow variable

Implementation Details for option 1: Using File list

XML file names to be processed were read using batch script and file list was created containing XML file. This file list name was set under source properties at session level. XML file were read sequentially and data pertaining to every XML file was processed. Since the number of XML files to be processed was dynamic the need of the hour was to achieve looping in Informatica.

Challenge in using File List – Created in a session to run multiple source files for one source instance in the mapping. When file list is used in a mapping as multiple source files for one source instance, the properties of all files must match the source definition. File list are configured in session properties by mentioning the file name of the file list in the Source Filename field and location of the file list in the Source File Directory field. When the session starts, the Integration Service reads the file list, then locates and reads the first file source in the list. After the Integration Service reads the first file, it locates and reads the next file in the list. The issue using XML file names in file list was further compounded by Informatica grouping records pertaining to similar XML node together. This lead to difficultly in identifying which record belonged to which XML file.

Batch Script – batch scripts controlled over all looping in Informatica by encompassing below mentioned tasks:
• Reading XML file names from staging location and creating file list containing XML file names.
• Moving XML files from staging location to archive location.
• Verifying whether there are any more XML files to be processed and depending on the outcome either loop the process by invoking first workflow or end the process
• Using PMCMD commands invoke appropriate workflows.

Workflow Details –
There were two Informatica workflows designed to achieve looping:
• First workflow –created indirect file to be used as source in session properties and will trigger second workflow. Details of workflow are:
o Command task will execute a DOS batch script which will create indirect file after reading XML filenames from a pre-defined location on server.
o Command task which will execute the second workflow to process data within XML files.

• Second workflow will read process XML files and populate staging tables. Details of workflow are:
o A session will read XML file names using indirect file and load into staging tables.
o A command task will move the XML file just processed in file into an archive folder. Using batch script
o A command task will execute a batch script which will:
 Check whether there are any more XML files to be processed.
 If yes then it will trigger the first workflow. This will ensure all XML files are processed and loaded into staging tables.
 If no then process will complete.

Thanks for reading, pls let me know have you faced any similar situation.

To know more about XML Files

Informatica Way

Ads 468x60px

Pages

Labels

Blog Archive

Labels

Blogroll

About

Blogger templates

Blogger news

Thursday, 23 December 2010

Leveraging Metadata in Informatica Workflow-Session/Analysis

Tuesday, 26 October 2010

Impact Analysis on Source & Target Definition Changes

Thursday, 14 October 2010

Output Files in Informatica

Tuesday, 28 September 2010

Informatica Power Center performance – Concurrent Workflow Execution

Monday, 30 August 2010

Informatica Development Best Practice – Workflow

Thursday, 19 August 2010

Informatica Development Best Practices – Mapping

Thursday, 12 August 2010

Change Data Capture in Informatica

Wednesday, 3 March 2010

Processing Multiple XML Files through Informatica – 1

My Favourite Links

ERP- Oracle

Popular Posts

Mamta @ Twitter

Blog Archive

Labels

About Me