Ads 468x60px

Pages

Subscribe:

Labels

Informatica (18) Integration Service (10) Siebel Business Intelligence (6) ETL (5) Informatica PowerCenter (4) Informatica PowerCenter 8x (4) Oracle (4) Metadata (3) DTM (2) Data Transformation Manager (2) Hexaware Technologies (2) OUD (2) Oracle Unified Directory (2) PowerCenter (2) XML (2) business (2) ASCII (1) Administration Console (1) Application Services (1) Automated Migration (1) BFSI (1) Binary (1) Bulk Load (1) Business Intelligence (1) Business Intelligence Challenge (1) Business Intelligence Company (1) Business Intelligence Consulting (1) Business Objects (1) ByTree (1) CDC (1) CNBC News (1) CNBCTV18’s Shreya Roy (1) COBOL (1) Change Data Capture (1) Collaborative (1) Collaborative Data Management (1) Computing Expression Evaluator (1) Convert Rows To Columns In Inforamtica (1) Data (1) Data Governance (1) Data Management (1) Data Mart (1) Data Type (1) Data Virtualization Services (1) Database (1) Datawarehouse (1) ETL Developers (1) Expression Evaluator (1) Expression Evaluator Debugging (1) Extract (1) FTP (1) File List (1) Flash or Java Applets (1) Flat Files (1) Function (1) HP Diagnostics-Identify (1) HP Diagnostics-Identify bottlenecks (1) HTTP Headers (1) Hexaware Technologies Limited (1) IT Metrics (1) IT companies (1) IT company (1) Index (1) Indirect Source (1) Informatica 8.6 (1) Informatica 8.x (1) Informatica Data Integration Service (1) Informatica Debugger (1) Informatica Debugging Transformation (1) Informatica Development (1) Informatica File Transfer (1) Informatica Power Center (1) Informatica Powercenter 8x Key Concepts (1) Informatica Process Control Audit (1) Informatica Repository Restoration (1) Informatica Server Re-Installation (1) Informatica Server Recovery System (1) Informatica Upgrade Challenge (1) Informatica Workflow (1) Informatica Workflow Process Control (1) Integration Services (1) Integration and Repository services (1) Invalid Objects (1) JavaScript Functions (1) Joiner Transformation (1) LDAP (1) LDAP Directory (1) LDAP Replication (1) Load Balancer (1) LoadRunner 11.5 (1) Looping (1) Manual Correlation (1) Mappings (1) NTLM Resource (1) NTLM authentication (1) Native Driver (1) New Column (1) ODBC (1) ODBC Driver (1) OID (1) OUD Configuration (1) OUD Directory Server (1) OUD Replication (1) Oracle Applications (1) Oracle Client (1) Oracle Hints In SQL (1) Oracle Internet Director (1) Oracle Optimizer (1) Oracle R12 (1) Oracle Solutions (1) Oracle loadrunner (1) OracleErrorActionFile (1) PeopleSoft Jobs (1) PeopleSoft Jobs In Hexaware (1) Peoplesoft Tester In Chennai (1) Peoplesoft Tester Jobs In Chennai (1) Performance (1) Performance Testing (1) PowerCenter 8.5 (1) PowerCenter Server (1) PowerCenter Server Support (1) Powercenter 8.5.1 (1) Pushdown Optimization (1) Re-Import (1) Relational (1) Remote Filename (1) Repository Services (1) Reverse Of A Normalizer In Informatica (1) Rows Read (1) SFTP (1) SJSDS (1) SMP (1) SQL Statement (1) SSH2 (1) SUBSTR Function (1) Session Failed (1) Source Data (1) Source Definition (1) Source Row (1) Store Procedure (1) SuppressNilContentMethod (1) Symmetric Multi-Processing (1) Target Definition (1) Target Row (1) Task Developer (1) Text Flags (1) Transfer Protocol (1) Transformation (1) Transformation Logics (1) Transpose Records (1) UDF (1) User Defined Functions (1) WriteNullXMLFile (1) XML File (1) XML Optimization (1) XML Target (1) XML Tuning (1) XMLSendChildFirst (1) XMLWarnDupRows (1) Zero byte XML file (1) accelerate application (1) employee performance (1) mid-cap it (1) web_reg_save_param (1)

Labels

Blogroll

About

Blogger templates

Blogger news

Monday 30 August 2010

Informatica Development Best Practice – Workflow


Workflow Manager default properties can be modified to improve the overall performance and few of them are listed below.    This properties can impact the ETL runtime directly and needs to configured based on :

i)  Source Database
ii) Target Database
iii) Data Volume


Category Technique
Session Properties While loading Staging Tables for FULL LOADS,  Truncate target table option should be checked. Based on the Target database and the primary key defined, Integration Service fires TRUNCATE or DELETE statement.Database                  Primary Key Defined                   No Primary KeyDB2                             TRUNCATE                                       TRUNCATE
INFORMIX                 DELETE                                              DELETE
ODBC                         DELETE                                                DELETE
ORACLE                    DELETE UNRECOVERABLE            TRUNCATE
MSSQL                       DELETE                                               TRUNCATE
SYBASE                     TRUNCATE                                        TRUNCATE Workflow Property “Commit interval” (Default value : 10,000) should be increased for increased for Volumes more than 1 million records.  Database Rollback Segment size should also be updated, while increasing “Commit Interval”.
Insert/Update/Delete options should be set as determined by the target population method.
Target Option                                   Integration Service
Insert                                                   Uses Target update Option
Update as Update
Update as Insert
Update else Insert
Update as update                             Updates all rows as Update
Update as Insert                               Inserts all rows
Update else Insert                            Updates existing rows else Insert
Partition
Maximum number of partitions for a session should be 1.5 times the number of processes in the Informatica server. i.e. 1.5 X 4 Processors = 6 partitions.
Key Value partitions should be used only when an even Distribution of data can be obtained.  In other cases, Pass Through partitions should be used.
A Source filter should be added to evenly distribute the data between Pass through Partitions. Key Value should have ONLY numeric values. MOD(NVL(<Numeric Key Value>,0),# No of Partitions defined)  Ex: MOD(NVL(product_sys_no,0),6)
If a session contains “N” partition, increase the DTM Buffer Size to at least “N” times the value for the session with One partition
If the Source or Target database is of MPP( Massively Parallel Processing ), enable Pushdown Optimization.  By enabling this, Integration Service will push as much Transformation Logic to Source database or Target database or FULL ( both ) , based on the settings.  This property can be ignored for Conventional databases.

Thursday 19 August 2010

Informatica Development Best Practices – Mapping


The following are generally accepted “Best Practices” for Informatica PowerCenter ETL development and if implemented, can significantly improve the overall performance.


Category Technique Benefits
Source Extracts Loading data from Fixed-width files take less time than delimited, since delimited files require extra parsing.  Incase of Fixed width files, Integration service know the Start and End position of each columns upfront and thus reduces the processing time. Performance Improvement
Using flat files located on the server machine loads faster than a database located on the server machine. Performance Improvement
Mapping Designer There should be a place holder transformation (Expression) immediately after the Source and one before the target.  Data type and Data width changes are bound to happen during development phase and these place holder transformations are used to preserve the port link between transformations. Best Practices
Connect only the ports that are required in targets to subsequent transformations.  Also, active transformations that reduce the number of records should be used as early in the mapping. Code Optimization
If a join must be used in the Mapping, select appropriate driving/master table while using joins. The table with the lesser number of rows should be the driving/master table. Performance Improvement
Transformations If there are multiple Lookup condition, make the condition with the “=” sign first in order to optimize the lookup performance.  Also, indexes on the database table should include every column used in the lookup condition. Code Optimization
Persistent caches should be used if the lookup data is not expected to change often.  This cache files are saved and can be reused for subsequent runs, eliminating querying the database. Performance Improvement
Integration Service processes numeric operations faster than string operations. For example, if a lookup is done on a large amount of data on two columns, EMPLOYEE_NAME and EMPLOYEE_ID, configuring the lookup around EMPLOYEE_ID improves performance. Code Optimization
Replace Complex filter expression with a flag (Y/N). Complex logic should be moved to the expression transformation and the result should be stored in a port.  Filter expression should take less time to evaluate this port rather than executing the entire logic in Filter expression. Best Practices
Power Center Server automatically makes conversions between compatible data types which slowdown the performance considerably.  For example, if a mapping moves data from an Integer port to a Decimal port, then back to an Integer port, the conversion may be unnecessary. Performance Improvement
Assigning default values to a port; Transformation errors written to session log will always slow down the session performance.  Try  removing default values and eliminate transformation errors. Performance Improvement
Complex joins in Source Qualifiers should be replaced with Database views. There won’t be any performance gains, but it improves the readability a lot.  Also, any new conditions can be evaluated easily by just changing the Database view “WHERE” clause. Best Practices

Thursday 12 August 2010

Change Data Capture in Informatica


Change data capture (CDC) is an approach or a technique to identify changes, only changes, in the source. I have seen applications that are built without CDC and later mandate to implement CDC at a higher cost. Building an ETL application without CDC is a costly miss and usually a backtracking step. In this article we can discuss different methods of implementing CDC.


Scenario #01: Change detection using timestamp on source rows
In this typical scenario the source rows have extra two columns say row_created_time & last_modified_time. Row_created_time : time at which the record was first created ; Last_modified_time: time at which the record was last modified
  1. In the mapping create mapping variable $$LAST_ETL_RUN_TIME of datetime data type
  2. Evaluate condition SetMaxVariable ($$LAST_ETL_RUN_TIME, SessionStartTime); this steps stores the time at which the Session was started to $$LAST_ETL_RUN_TIME
  3. Use $$LAST_ETL_RUN_TIME in the ‘where’ clause of the source SQL. During the first run or initial seed the mapping variable would have a default value and pull all the records from the source, like: select * from employee where last_modified_date > ’01/01/1900 00:00:000’
  4. Now let us assume the session is run on ’01/01/2010 00:00:000’ for initial seed
  5. When the session is executed on ’02/01/2010 00:00:000’, the sequel would be like : select * from employee where last_modified_date > ’01/01/2010 00:00:000’, hereby pulling records that had only got changed in between successive runs
Scenario #02: Change detection using load_id or Run_id
Under this scenario the source rows have a column say load_id, a positive running number. The load_id is updated as and when the record is updated
  1. In the mapping create mapping variable $$LAST_READ_LOAD_ID of integer data type
  2. Evaluate condition SetMaxVariable ($$LAST_READ_LOAD_ID,load_id); the maximum load_id is stored into mapping variable
  3. Use $$LAST_READ_LOAD_ID in the ‘where’ clause of the source SQL. During the first run or initial seed the mapping variable would have a default value and pull all the records from the source, like: select * from employee where load_id > 0; Assuming all records during initial seed have load_id =1, the mapping variable would store ‘1’ into the repository.
  4. Now let us assume the session is run after five load’s into the source, the sequel would be select * from employee where load_id >1 ; hereby we limit the source read only to the records that have been changed after the initial seed
  5. Consecutive runs would take care of updating the load_id & pulling the delta in sequence
In the next blog we can see how to implement CDC when reading from Salesforce.com

Wednesday 3 March 2010

Processing Multiple XML Files through Informatica – 1


Problem Statement: Data to be processed in Informatica were XML files in nature. The number of XML files to be processed was dynamic in nature. The need was also to ensure that the XML file name from which data is being processed is to be captured.

Resolution:
Option 1 – Using File list as part of Indirect File Sources in session
Option 2 – Using Parameter File and workflow variable

Implementation Details for option 1: Using File list
XML file names to be processed were read using batch script and file list was created containing XML file. This file list name was set under source properties at session level. XML file were read sequentially and data pertaining to every XML file was processed. Since the number of XML files to be processed was dynamic the need of the hour was to achieve looping in Informatica.
Challenge in using File List – Created in a session to run multiple source files for one source instance in the mapping. When file list is used in a mapping as multiple source files for one source instance, the properties of all files must match the source definition. File list are configured in session properties by mentioning the file name of the file list in the Source Filename field and location of the file list in the Source File Directory field. When the session starts, the Integration Service reads the file list, then locates and reads the first file source in the list. After the Integration Service reads the first file, it locates and reads the next file in the list. The issue using XML file names in file list was further compounded by Informatica grouping records pertaining to similar XML node together. This lead to difficultly in identifying which record belonged to which XML file.
Batch Script – batch scripts controlled over all looping in Informatica by encompassing below mentioned tasks:
• Reading XML file names from staging location and creating file list containing XML file names.
• Moving XML files from staging location to archive location.
• Verifying whether there are any more XML files to be processed and depending on the outcome either loop the process by invoking first workflow or end the process
• Using PMCMD commands invoke appropriate workflows.
Workflow Details –
There were two Informatica workflows designed to achieve looping:
• First workflow –created indirect file to be used as source in session properties and will trigger second workflow. Details of workflow are:
o Command task will execute a DOS batch script which will create indirect file after reading XML filenames from a pre-defined location on server.
o Command task which will execute the second workflow to process data within XML files.
image1
• Second workflow will read process XML files and populate staging tables. Details of workflow are:
o A session will read XML file names using indirect file and load into staging tables.
o A command task will move the XML file just processed in file into an archive folder. Using batch script
o A command task will execute a batch script which will:
 Check whether there are any more XML files to be processed.
 If yes then it will trigger the first workflow. This will ensure all XML files are processed and loaded into staging tables.
 If no then process will complete.
image2
Thanks for reading, pls let me know have you faced any similar situation.

Monday 14 September 2009

Merge Rows as Columns / Transpose records


Requirement: Converting rows to columns


Customer
Product
Cost
Cust1
P1
10
Cust1
P2
20
Cust1
P3
30
Cust2
ABC
10
Cust2
P2
25
Cust2
Def
10
Customer
Product1
Cost1
Product2
Cost2
Product3
Cost3
Cust1
P1
10
P2
20
P3
30
Cust2
ABC
10
P2
25
Def
10

The above illustration would help in understanding the requirement. We had to merge multiple records into one record based on certain criteria. The design had to be reusable since each dimension within the data mart required this flattening logic.

1. Approach:
The use of aggregator transformation would group the records by a key, but retrieval of the values for a particular column as individual columns is a challenge, hence designed a component ‘Flattener’ based on expression transformation.
Flattener is a reusable component, a mapplet that performs the function of flattening records.
Flattener consists of an Expression and a Filter transformation. The expression is used to club each incoming record based on certain logic. Decision to write the record to target is taken using the Filter transformation.

2. Design:
The mapplet can receive up to five inputs, of the following data types:
i_Col1 (string),  Customer
i_Col2 (string), Product
i_Col3 (decimal), Cost
i_Col4 (decimal) and
i_Col5 (date/time)
Have kept the names generic trying to accept different data types, so that the mapplet can be used in any scenario where there is a need for flattening records.
The mapplet gives out 15×5 sets of output, in the following manner:
o_F1_1 (string), Customer
o_F2_1 (string), Product1
o_F3_1 (decimal), Cost1
o_F4_1 (decimal) and
o_F5_1 (date/time)
o_F1_2 (string), Customer
o_F2_2 (string), Product2
o_F3_2 (decimal), Cost2
o_F4_2 (decimal) and
o_F5_2 (date/time)
… … and so on
The output record is going to have repetitive sets of 5 columns each (Each set would refer to one incoming row). Based on the requirement the number of occurrence of these sets can be increased. The required fields alone can be used / mapped. For the above example we use just 2 strings and one decimal for mapping Customer, Product and Cost.
The mapplet receives records from its parent mapping. The Expression would initially save each incoming value to a variable and compare it with its counterpart that came in earlier and is held in its cache as long as the condition to flatten is satisfied.
Syntax to store current and previous values:
i_Col2 string i
prv_Col2 string v curr_Col2
curr_Col2 string v i_Col2
The condition/logic to flatten records is parameterized and decided before mapping is called thus increasing codes’ scalability. The parameterized logic is passed to the Expression transformation via a Mapplet parameter. The value is used as an expression to perform the evaluation and the result is a flag value either ‘1’ or ‘2’.
Syntax for port – flag
Flag integer v $$Expr_compare
An example for parameterized expression
$$Expr_compare = iif (curr_Col1 = prv_Col1 AND curr_Col2 !=
prv_Col2, 1, iif (curr_Col1 != prv_Col1,2))
A variable port named “rec_count” is incremented, based on the flag.
Syntax for port – rec_count
rec_count integer v iif (flag=2,0, iif (flag=1,rec_count + 1,rec_count))
The expression transformation now uses the value in ports “flag” and “rec_count” to decide the place holder for each incoming input value, i.e. the column in target table where this data would move into ultimately. This process is an iterative one and goes on until the comparison logic ($$Expr_compare) holds good, i.e. until all records get flattened per the logic. An example of the place holder expression is shown below:
v_Field1 data type v iif(flag=2 AND rec_count=0,curr_Col1, v_Field1)
Port “write_flag_1” is set to 1 when the comparison logic fails (meaning flattening is complete). Filter transformation filters out the record once it is completely transposed.
Filter condition:
write_flag_1 integer v iif (flag=2 AND write_flag>1 ,1,0)

3. Outcome:
After developing the code and implementing the same we found it to be a useful utility, so thought of sharing it and would like to hear suggestions from readers on performing the same functionality in a different way. Please do share your views.