We are looking at a transformation designed to use insert/update step to perform data load on the target table. Pentaho Kettle step Insert/Update works as follows
Let us take an example of loading a target table. Assume that there is a daily load of 100k records into a target table with 10 million records and every incoming row from the source table looks up against all the 10 million records in the target table. This process continues for all the 100k input records.
Insert/Update step involves multiple round trips to the database depending on the commit size.
Performance of most steps depends heavily upon the number of round trips and speed of round trip. Speed depends on a combination of speed of network, the latency on the network and the performance of database.
Find below the wiki link to explain the Insert/Update step in Pentaho Kettle
As there is an extra lookup process involved in this step, it definitely slows down the process as it needs to run through the entire lot of records. The look up is to check for a matching record (to update) and if none matches (to perform insert).
This step is also slower than the regular “Table Output step”.
There is very little one can do on the network latency side. Reduction of number of round trips to the database is the first thing that you should consider. This can be accomplished by having a mechanism to load lookup data in memory (cache)
In Pentaho Kettle, most Lookup steps have options to cache data. Steps such as “Stream Lookup” & “Merge Join” are some of them.
In the above scenario, design the transformation with “Merge join” step to perform operations within a single statement. By doing this, you will minimize the number of times data in the source and target tables are processed. This helps in less memory consumption and good performance. You need to ensure that the input data is sorted to perform the “Merge join” step.
Find below the wiki link to explain the Merge Join step.