Change data capture (CDC) is an approach or a technique to identify changes, only changes, in the source. I have seen applications that are built without CDC and later mandate to implement CDC at a higher cost. Building an ETL application without CDC is a costly miss and usually a backtracking step. In this article we can discuss different methods of implementing CDC.
Scenario #01: Change detection using timestamp on source rows
In this typical scenario the source rows have extra two columns say row_created_time & last_modified_time. Row_created_time : time at which the record was first created ; Last_modified_time: time at which the record was last modified
- In the mapping create mapping variable $$LAST_ETL_RUN_TIME of datetime data type
- Evaluate condition SetMaxVariable ($$LAST_ETL_RUN_TIME, SessionStartTime); this steps stores the time at which the Session was started to $$LAST_ETL_RUN_TIME
- Use $$LAST_ETL_RUN_TIME in the ‘where’ clause of the source SQL. During the first run or initial seed the mapping variable would have a default value and pull all the records from the source, like: select * from employee where last_modified_date > ’01/01/1900 00:00:000’
- Now let us assume the session is run on ’01/01/2010 00:00:000’ for initial seed
- When the session is executed on ’02/01/2010 00:00:000’, the sequel would be like : select * from employee where last_modified_date > ’01/01/2010 00:00:000’, hereby pulling records that had only got changed in between successive runs
Under this scenario the source rows have a column say load_id, a positive running number. The load_id is updated as and when the record is updated
- In the mapping create mapping variable $$LAST_READ_LOAD_ID of integer data type
- Evaluate condition SetMaxVariable ($$LAST_READ_LOAD_ID,load_id); the maximum load_id is stored into mapping variable
- Use $$LAST_READ_LOAD_ID in the ‘where’ clause of the source SQL. During the first run or initial seed the mapping variable would have a default value and pull all the records from the source, like: select * from employee where load_id > 0; Assuming all records during initial seed have load_id =1, the mapping variable would store ‘1’ into the repository.
- Now let us assume the session is run after five load’s into the source, the sequel would be select * from employee where load_id >1 ; hereby we limit the source read only to the records that have been changed after the initial seed
- Consecutive runs would take care of updating the load_id & pulling the delta in sequence