Informatica Way: 2009

Monday, 14 September 2009

Merge Rows as Columns / Transpose records

Requirement: Converting rows to columns

Customer	Product	Cost
Cust1	P1	10
Cust1	P2	20
Cust1	P3	30
Cust2	ABC	10
Cust2	P2	25
Cust2	Def	10

Customer	Product1		Cost1	Product2	Cost2	Product3	Cost3
Cust1	P1	10		P2	20	P3	30
Cust2	ABC	10		P2	25	Def	10

The above illustration would help in understanding the requirement. We had to merge multiple records into one record based on certain criteria. The design had to be reusable since each dimension within the data mart required this flattening logic.

1. Approach:

The use of aggregator transformation would group the records by a key, but retrieval of the values for a particular column as individual columns is a challenge, hence designed a component ‘Flattener’ based on expression transformation.

Flattener is a reusable component, a mapplet that performs the function of flattening records.

Flattener consists of an Expression and a Filter transformation. The expression is used to club each incoming record based on certain logic. Decision to write the record to target is taken using the Filter transformation.

2. Design:

The mapplet can receive up to five inputs, of the following data types:

i_Col1 (string), Customer

i_Col2 (string), Product

i_Col3 (decimal), Cost

i_Col4 (decimal) and

i_Col5 (date/time)

Have kept the names generic trying to accept different data types, so that the mapplet can be used in any scenario where there is a need for flattening records.

The mapplet gives out 15×5 sets of output, in the following manner:

o_F1_1 (string), Customer

o_F2_1 (string), Product1

o_F3_1 (decimal), Cost1

o_F4_1 (decimal) and

o_F5_1 (date/time)

o_F1_2 (string), Customer

o_F2_2 (string), Product2

o_F3_2 (decimal), Cost2

o_F4_2 (decimal) and

o_F5_2 (date/time)

… … and so on

The output record is going to have repetitive sets of 5 columns each (Each set would refer to one incoming row). Based on the requirement the number of occurrence of these sets can be increased. The required fields alone can be used / mapped. For the above example we use just 2 strings and one decimal for mapping Customer, Product and Cost.

The mapplet receives records from its parent mapping. The Expression would initially save each incoming value to a variable and compare it with its counterpart that came in earlier and is held in its cache as long as the condition to flatten is satisfied.

Syntax to store current and previous values:

i_Col2 string i

prv_Col2 string v curr_Col2

curr_Col2 string v i_Col2

The condition/logic to flatten records is parameterized and decided before mapping is called thus increasing codes’ scalability. The parameterized logic is passed to the Expression transformation via a Mapplet parameter. The value is used as an expression to perform the evaluation and the result is a flag value either ‘1’ or ‘2’.

Syntax for port – flag

Flag integer v $$Expr_compare

An example for parameterized expression

$$Expr_compare = iif (curr_Col1 = prv_Col1 AND curr_Col2 !=

prv_Col2, 1, iif (curr_Col1 != prv_Col1,2))

A variable port named “rec_count” is incremented, based on the flag.

Syntax for port – rec_count

rec_count integer v iif (flag=2,0, iif (flag=1,rec_count + 1,rec_count))

The expression transformation now uses the value in ports “flag” and “rec_count” to decide the place holder for each incoming input value, i.e. the column in target table where this data would move into ultimately. This process is an iterative one and goes on until the comparison logic ($$Expr_compare) holds good, i.e. until all records get flattened per the logic. An example of the place holder expression is shown below:

v_Field1 data type v iif(flag=2 AND rec_count=0,curr_Col1, v_Field1)

Port “write_flag_1” is set to 1 when the comparison logic fails (meaning flattening is complete). Filter transformation filters out the record once it is completely transposed.

Filter condition:

write_flag_1 integer v iif (flag=2 AND write_flag>1 ,1,0)

3. Outcome:

After developing the code and implementing the same we found it to be a useful utility, so thought of sharing it and would like to hear suggestions from readers on performing the same functionality in a different way. Please do share your views.

Wednesday, 2 September 2009

Process Control / Audit of Workflows in Informatica

1. Process Control – Definition
Process control or Auditing of a workflow in an Informatica is capturing the job information like start time, end time, read count, insert count, update count and delete count. This information is captured and written into table as the workflow executes

2. Structure of Process Control/Audit table
The table structure of process control table is given below,
Table 1: Process Control structure

PROCESS_RUN_ID	Number(p,s)	11	A unique number used to identify a specific process run.
PROCESS_NME	Varchar2	120	The name of the process (this column will be populated with the names of the informatica mappings.)
START_TMST	Date	19	The date/time when the process started.
END_TMST	Date	19	The date/time when the process ended.
ROW_READ_CNT	Number(p,s)	16	The number of rows read by the process.
ROW_INSERT_CNT	Number(p,s)	16	The number of rows inserted by the process.
ROW_UPDATE_CNT	Number(p,s)	16	The number of rows updated by the process.
ROW_DELETE_CNT	Number(p,s)	16	The number of rows deleted by the process
ROW_REJECT_CNT	Number(p,s)	16	The number of rows rejected by the process.
USER_ID	Varchar2	32	The etl user identifier associated with the process.

3. Mapping Logic and Build Steps
The process control flow has two data flows, one is an insert flow and the other is an update flow. The insert flow runs before the main mapping and update flows runs after the main mapping, this option is chosen in “Target Load Plan”. The source for both the flows could be a dummy source which will return one record as output, for example select ‘process’ from dual or select count(1) from Table_A. The following list of mapping variable is to be created,

Table 2: Mapping Parameter and variables

$$PROCESS_ID

$$PROCESS_NAME

$$INSERT_COUNT

$$UPDATE_COUNT

$$DELETE_COUNT

$$REJECT_COUNT

Steps to create Insert flow:

1. Have “select ‘process’ from dual” as Sequel in source qualifier
2. Have a sequence generator to create running process_run_Id ’s
3. In an expression SetVariable ($$PROCESS_RUN_ID,NEXTVAL), $$PROCESS_NAME to o_process_name, a output only field
4. In an expression assign $$SessionStarttime to o_Starttime, an output only field
5. In an expression accept the sequence id from sequence generator
6. Insert into target’ process control table’ with all the above three values

Table 3: Process Control Image after Insert flow

PROCESS_RUN_ID	1
PROCESS_NME	VENDOR_DIM_LOAD
START_TMST	8/23/2009 12:23
END_TMST
ROW_READ_CNT
ROW_INSERT_CNT
ROW_UPDATE_CNT
ROW_DELETE_CNT
ROW_REJECT_CNT
USER_ID	INFA8USER

Steps in main mapping,

1. After the source qualifier, increment the read count in a variable (v_read_count) for each record been read in an expression and SetMaxVariable ($$READ_COUNT,v_read_count)
2. Before the update strategy of target instances, do the same for Insert/Update/Delete counts; all the variables are now set with all their respective counts

Steps to create Update flow:

1. Have “select ‘process’ from dual” as Sequel in source qualifier
2. Use SetMaxvariable to get the process_run_id created in insert flow
3. In an expression assign $$INSERT_COUNT to an o_insert_count, a output only field, assign all the counts in the same way
4. In an expression assign $$SessionEndtime to o_Endtime, an output only field
5. Update the target ‘Process Control Table’ with all the above three values where process_run_id equals the process_run_id generated in Insert flow

Table 4: Process Control Image after Update flow

PROCESS_RUN_ID	1
PROCESS_NME	VENDOR_DIM_LOAD
START_TMST	8/23/2009 12:23
END_TMST	8/23/2009 12:30
ROW_READ_CNT	1000
ROW_INSERT_CNT	900
ROW_UPDATE_CNT	60
ROW_DELETE_CNT	40
ROW_REJECT_CNT	0
USER_ID	INFA8USER

4. Merits over Informatica Metadata
This information is also available in Informatica metadata, however maintaining this within our system has following benefits,

Need not write complex query to bring in the data from metadata tables
Job names need not be mapping names and can be user friendly names
Insert/Delete/Update counts of all as well as individual target can be audited
This audit information can be maintained outside the metadata security level and can be used by other mappings in their transformations
Can be used by mappings that build parameter files
Can be used by mappings that govern data volume
Can be used by Production support to find out the quick status of load

To know more about Informatica Process control audit

Monday, 11 May 2009

Informatica Upgrade Challenge –Default SQL Join for a Source Qualifier in 7x vs. 8x

Default SQL Query Generation for a Source Qualifier:

When relational sources are joined in one Source Qualifier transformation, the PowerCenter Server joins the tables based on the related keys in each table. This default join will be an equijoin like below

Source1.column_name = Source2.column_name

For Default joins to work, the columns in the default join must have:

A primary key-foreign key relationship
Matching data types

In current scenario, Most of the Datawarehouse are designed such a way that the primary key – foreign key relationship are designed in the logic instead of physical tables. In scenarios, where the fact tables are joined with dimension tables, the developer writes the join condition specifically in user defined join property present in source qualifier. This can be also done by default joins by creating relationships between the tables in Informatica instead of creating physically on the tables.

Creating relationships between the tables in Informatica are simple, just by dragging and dropping the column from one source definition to the other in Source Analyzer.

PowerCenter Server and SQL Query Generation

When a session is executed, Powercenter Server has two options

Use the SQL Query typed by the developer if the ‘SQL Query’ property text window has ‘some text’ which is not blank
If the ‘SQL Query’ property is blank then the PowerCenter Server generates a query for each Source Qualifier transformation when it runs the session.
The SQL Query generation process for option 2 is bit different in PowerCenter 7x and 8x.

The Default query from Powercenter 7x is built in the below order

SELECT keyword
Field/Port Names which are linked to the next transformation from Source Qualifier
FROM Keyword
List of table names from the source definitions connected to the Source Qualifier separated by Comma
WHERE Keyword
[Value Present in the “User Defined Join” property ]
[AND Keyword] combined with Default Join Condition formed by Powercenter based on the relationship (If the User Defined Join is not present)
[AND Keyword] combined with Value present in the “Source Filter” property
[ORDER BY keyword By Default, It selects the first field which is being selected after the SELECT clause.]

Where as in the Powercenter 8x, the default query is built in the below order

SELECT keyword
Field/Port Names which are linked to the next transformation from Source Qualifier
FROM Keyword
List of table names from the source definitions connected to the Source Qualifier separated by Comma
WHERE Keyword
[Value Present in the “User Defined Join” property ]
[AND Keyword] combined with Value present in the “Source Filter” property
[AND Keyword] combined with Default Join Condition formed by Powercenter based on the relationship (If the User Defined Join is not present)
[ORDER BY keyword By Default, It selects the first field which is being selected after the SELECT clause.]

The Default join condition in 8x is appended next to the Source Filter where as in 7x the default join is appended before the source filter.

I came across an issue in a recent upgrade project because of this difference in behavior. The mapping that ran properly in 7x which extracted the required data from the source, actually ran into problem 8x. The upgraded mapping in 8x created a Cartesian SQL join. When analyzed found that the source filter had the last line commented with ‘—‘. This made the default join condition to also get commented in 8x which resulted in Cartesian product of the source tables.

So the key is to determine how many of the Informatica mappings/sessions have Source Filter property set with a comment ‘—‘, this could help identify this issue much earlier in the upgrade.

Thanks for reading, share any other upgrade challenge that you have faced.

To know more about Informatica Upgrade Challenge

Wednesday, 22 April 2009

Informatica and Oracle hints in SQL overrides

HINTS used in a SQL statement helps in sending instructions to the Oracle optimizer which would quicken the query processing time involved. Can we make use of these hints in SQL overrides within our Informatica mappings so as to improve a query performance?

On a general note any Informatica help material would suggest: you can enter any valid SQL statement supported by the source database in a SQL override of a Source qualifier or a Lookup transformation or at the session properties level.

While using them as part of Source Qualifier has no complications, using them in a Lookup SQL override gets a bit tricky. Use of forward slash followed by an asterix (“/*”) in lookup SQL Override [generally used for commenting purpose in SQL and at times as Oracle hints.] would result in session failure with an error like:

TE_7017 : Failed to Initialize Server Transformation lkp_transaction

2009-02-19 12:00:56 : DEBUG : (18785 | MAPPING) : (IS | Integration_Service_xxxx) : node01_UAT-xxxx : DBG_21263 : Invalid lookup override

SELECT SALES. SALESSEQ as SalesId, SALES.OrderID as ORDERID, SALES.OrderDATE as ORDERDATE FROM SALES, AC_SALES WHERE AC_SALES. OrderSeq >= (Select /*+ FULL(AC_Sales) PARALLEL(AC_Sales,12) */ min(OrderSeq) From AC_Sales)

This is because Informatica’s parser fails to recognize this special character when used in a Lookup override. There has been a parameter made available starting with PowerCenter 7.1.3 release, which enables the use of forward slash or hints.

§ Infa 7.x

1. Using a text editor open the PowerCenter server configuration file (pmserver.cfg).

2. Add the following entry at the end of the file:

LookupOverrideParsingSetting=1

3. Re-start the PowerCenter server (pmserver).

§ Infa 8.x

1. Connect to the Administration Console.

2. Stop the Integration Service.

3. Select the Integration Service.

4. Under the Properties tab, click Edit in the Custom Properties section.

5. Under Name enter LookupOverrideParsingSetting

6. Under Value enter 1.

7. Click OK.

8. And start the Integration Service.

§ Starting with PowerCenter 8.5, this change could be done at the session task itself as follows:

1. Edit the session.

2. Select Config Object tab.

3. Under Custom Properties add the attribute LookupOverrideParsingSetting and set the Value to 1.

4. Save the session.

Thanks for reading this blog.To know more about Informatica

Thursday, 19 March 2009

Informatica PowerCenter 8x Key Concepts – 6

6. Integration Service (IS)

The key functions of IS are

Interpretation of the workflow and mapping metadata from the repository.
Execution of the instructions in the metadata
Manages the data from source system to target system within the memory and disk

The main three components of Integration Service which enable data movement are,

Integration Service Process
Load Balancer
Data Transformation Manager

6.1 Integration Service Process (ISP)

The Integration Service starts one or more Integration Service processes to run and monitor workflows. When we run a workflow, the ISP starts and locks the workflow, runs the workflow tasks, and starts the process to run sessions. The functions of the Integration Service Process are,

Locks and reads the workflow
Manages workflow scheduling, ie, maintains session dependency
Reads the workflow parameter file
Creates the workflow log
Runs workflow tasks and evaluates the conditional links
Starts the DTM process to run the session
Writes historical run information to the repository
Sends post-session emails

6.2 Load Balancer

The Load Balancer dispatches tasks to achieve optimal performance. It dispatches tasks to a single node or across the nodes in a grid after performing a sequence of steps. Before understanding these steps we have to know about Resources, Resource Provision Thresholds, Dispatch mode and Service levels

Resources – we can configure the Integration Service to check the resources available on each node and match them with the resources required to run the task. For example, if a session uses an SAP source, the Load Balancer dispatches the session only to nodes where the SAP client is installed
Three Resource Provision Thresholds, The maximum number of runnable threads waiting for CPU resources on the node called Maximum CPU Run Queue Length. The maximum percentage of virtual memory allocated on the node relative to the total physical memory size called Maximum Memory %. The maximum number of running Session and Command tasks allowed for each Integration Service process running on the node called Maximum Processes
Three Dispatch mode’s – Round-Robin: The Load Balancer dispatches tasks to available nodes in a round-robin fashion after checking the “Maximum Process” threshold. Metric-based: Checks all the three resource provision thresholds and dispatches tasks in round robin fashion. Adaptive: Checks all the three resource provision thresholds and also ranks nodes according to current CPU availability
Service Levels establishes priority among tasks that are waiting to be dispatched, the three components of service levels are Name, Dispatch Priority and Maximum dispatch wait time. “Maximum dispatch wait time” is the amount of time a task can wait in queue and this ensures no task waits forever

A .Dispatching Tasks on a node

The Load Balancer checks different resource provision thresholds on the node depending on the Dispatch mode set. If dispatching the task causes any threshold to be exceeded, the Load Balancer places the task in the dispatch queue, and it dispatches the task later
The Load Balancer dispatches all tasks to the node that runs the master Integration Service process

B. Dispatching Tasks on a grid,

The Load Balancer verifies which nodes are currently running and enabled
The Load Balancer identifies nodes that have the PowerCenter resources required by the tasks in the workflow
The Load Balancer verifies that the resource provision thresholds on each candidate node are not exceeded. If dispatching the task causes a threshold to be exceeded, the Load Balancer places the task in the dispatch queue, and it dispatches the task later
The Load Balancer selects a node based on the dispatch mode

6.3 Data Transformation Manager (DTM) Process

When the workflow reaches a session, the Integration Service Process starts the DTM process. The DTM is the process associated with the session task. The DTM process performs the following tasks:

Retrieves and validates session information from the repository.
Validates source and target code pages.
Verifies connection object permissions.
Performs pushdown optimization when the session is configured for pushdown optimization.
Adds partitions to the session when the session is configured for dynamic partitioning.
Expands the service process variables, session parameters, and mapping variables and parameters.
Creates the session log.
Runs pre-session shell commands, stored procedures, and SQL.
Sends a request to start worker DTM processes on other nodes when the session is configured to run on a grid.
Creates and runs mapping, reader, writer, and transformation threads to extract, transform, and load data
Runs post-session stored procedures, SQL, and shell commands and sends post-session email
After the session is complete, reports execution result to ISP

Pictorial Representation of Workflow execution:

A PowerCenter Client request IS to start workflow
IS starts ISP
ISP consults LB to select node
ISP starts DTM in node selected by LB

Thanks for reading this blog.To know more about Informatica PowerCenter 8x

Friday, 16 January 2009

Informatica PowerCenter 8x Key Concepts – 5

5. Repository Service

As we already discussed about metadata repository, now we discuss a separate,multi-threaded process that retrieves, inserts and updates metadata in the repository database tables, it is Repository Service.
Repository service manages connections to the PowerCenter repository from PowerCenter client applications like Desinger, Workflow Manager, Monitor, Repository manager, console and integration service. Repository service is responsible for ensuring the consistency of metdata in the repository.

Creation & Properties:

Use the PowerCenter Administration Console Navigator window to create a Repository Service. The properties needed to create are,

Service Name – name of the service like rep_SalesPerformanceDev

Location – Domain and folder where the service is created

License – license service name

Node, Primary Node & Backup Nodes – Node on which the service process runs

CodePage – The Repository Service uses the character set encoded in the repository code page when writing data to the repository

Database type & details – Type of database, username, pwd, connect string and tablespacename

The above properties are sufficient to create a repository service, however we can take a look at following features which are important for better performance and maintenance.

General Properties

> OperatingMode: Values are Normal and Exclusive. Use Exclusive mode to perform administrative tasks like enabling version control or promoting local to global repository

> EnableVersionControl: Creates a versioned repository

Node Assignments: “High availability option” is licensed feature which allows us to choose Primary & Backup nodes for continuous running of the repository service. Under normal licenses would see only only Node to select from

Database Properties

> DatabaseArrayOperationSize: Number of rows to fetch each time an array database operation is issued, such as insert or fetch. Default is 100

> DatabasePoolSize:Maximum number of connections to the repository database that the Repository Service can establish. If the Repository Service tries to establish more connections than specified for DatabasePoolSize, it times out the connection attempt after the number of seconds specified for DatabaseConnectionTimeout

Advanced Properties

> CommentsRequiredFor Checkin: Requires users to add comments when checking in repository objects.

> Error Severity Level: Level of error messages written to the Repository Service log. Specify one of the following message levels: Fatal, Error, Warning, Info, Trace & Debug

> EnableRepAgentCaching:Enables repository agent caching. Repository agent caching provides optimal performance of the repository when you run workflows. When you enable repository agent caching, the Repository Service process caches metadata requested by the Integration Service. Default is Yes.

> RACacheCapacity:Number of objects that the cache can contain when repository agent caching is enabled. You can increase the number of objects if there is available memory on the machine running the Repository Service process. The value must be between 100 and 10,000,000,000. Default is 10,000

> AllowWritesWithRACaching: Allows you to modify metadata in the repository when repository agent caching is enabled. When you allow writes, the Repository Service process flushes the cache each time you save metadata through the PowerCenter Client tools. You might want to disable writes to improve performance in a production environment where the Integration Service makes all changes to repository metadata. Default is Yes.

Environment Variables

The database client code page on a node is usually controlled by an environment variable. For example, Oracle uses NLS_LANG, and IBM DB2 uses DB2CODEPAGE. All Integration Services and Repository Services that run on this node use the same environment variable. You can configure a Repository Service process to use a different value for the database client code page environment variable than the value set for the node.

You might want to configure the code page environment variable for a Repository Service process when the Repository Service process requires a different database client code page than the Integration Service process running on the same node.

For example, the Integration Service reads from and writes to databases using the UTF-8 code page. The Integration Service requires that the code page environment variable be set to UTF-8. However, you have a Shift-JIS repository that requires that the code page environment variable be set to Shift-JIS. Set the environment variable on the node to UTF-8. Then add the environment variable to the Repository Service process properties and set the value to Shift-JIS.

Read More about Informatica PowerCenter 8x

Informatica Way

Ads 468x60px

Pages

Labels

Blog Archive

Labels

Blogroll

About

Blogger templates

Blogger news

Monday, 14 September 2009

Merge Rows as Columns / Transpose records

Wednesday, 2 September 2009

Process Control / Audit of Workflows in Informatica

Monday, 11 May 2009

Informatica Upgrade Challenge –Default SQL Join for a Source Qualifier in 7x vs. 8x

Wednesday, 22 April 2009

Informatica and Oracle hints in SQL overrides

Thursday, 19 March 2009

Informatica PowerCenter 8x Key Concepts – 6

6. Integration Service (IS)

6.1 Integration Service Process (ISP)

6.2 Load Balancer

6.3 Data Transformation Manager (DTM) Process

Friday, 16 January 2009

Informatica PowerCenter 8x Key Concepts – 5

Environment Variables

My Favourite Links

ERP- Oracle

Popular Posts

Mamta @ Twitter

Blog Archive

Labels

About Me