Our Blog

PDI best practices – Why avoid insert/update step

Sandeep Kemparaju

We are looking at a transformation designed to use insert/update step to perform data load on the target table. Pentaho Kettle step Insert/Update works as follows

Let us take an example of loading a target table. Assume that there is a daily load of 100k records into a target table with 10 million records and every incoming row from the source table looks up against all the 10 million records in the target table. This process continues for all the 100k input records.
Pentaho Kettle
Insert/Update step involves multiple round trips to the database depending on the commit size.

Performance of most steps depends heavily upon the number of round trips and speed of round trip. Speed depends on a combination of speed of network, the latency on the network and the performance of database.

Find below the wiki link to explain the Insert/Update step in Pentaho Kettle


As there is an extra lookup process involved in this step, it definitely slows down the process as it needs to run through the entire lot of records. The look up is to check for a matching record (to update) and if none matches (to perform insert).

This step is also slower than the regular “Table Output step”.

There is very little one can do on the network latency side. Reduction of number of round trips to the database is the first thing that you should consider. This can be accomplished by having a mechanism to load lookup data in memory (cache)

In Pentaho Kettle, most Lookup steps have options to cache data. Steps such as “Stream Lookup” & “Merge Join” are some of them.

In the above scenario, design the transformation with “Merge join” step to perform operations within a single statement. By doing this, you will minimize the number of times data in the source and target tables are processed. This helps in less memory consumption and good performance. You need to ensure that the input data is sorted to perform the “Merge join” step.

Find below the wiki link to explain the Merge Join step.


Tags :

Pentaho Launches Community Edition 5.0

Suresh Narayanappa

Pentaho Corporation recently announced the immediate availability of its Open Source Pentaho Community Edition 5.0 (Pentaho CDE). It is the latest version of business analytics and open source data integration platform. The launch event also included the Pentaho Marketplace, where members of the community can download and explore all available plug ins developed by the Pentaho Community. It has extended the capabilities of the open source platform and permits community members to work together, share feedback and submit or create new plug ins to broaden Pentaho functionality.

According to the company, the new edition offers an economical entry point to people or companies during their first brush with business analytics when they want to visualize and act upon data. It is also an excellent tool-set for practiced developers, users and consultants who prefer an open code base to extend the borders of Pentaho and also business analytics.
Pentaho Community Edition
New features in Pentaho Community Edition 5.0

Pentaho, in collaboration with its community of developers, has made a powerful tools suite that offers an open source option for data analysts and developers to meet their goals. The latest edition includes:

    • Business Analytics Platform: The modern, interactive and simplified approach of Pentaho helps business users to discover, access and blend all sizes and types of data. Users can take advantage of a wide range of advanced analytics, that range from simple reports to predictive modeling, and can analyze and also visualize data through multiple dimensions, while at the same time with minimum IT dependence.


    • Data Integration: Pentaho Data Integration, better known as Pentaho Kettle, delivers powerful transformation, loading and extraction capabilities. This is a stand-alone application and is utilized to visually design jobs, and aid easier reporting and analysis.


    • Report Designer: Pentaho Report Designer is a graphic design tool which has the capability to generate reports from the data streamed through Pentaho Data Integration engine with no requirement for any kind of intermediate staging tables. Output reports can be in PDF, HTML, XML, CSV, Excel and rich-text file.


  • Auxiliary Tools: Users can download various types of auxiliary tools, like Pentaho Aggregation Designer for a simple interface to first create and then deploy aggregate tables that improve the performance of the Pentaho OLAP Cubes. Mondrian Schema Workbench is the open source designer for the visual creating and testing of the Mondrian OLAP cube schemas. The Pentaho Metadata Editor offers a simplified tool that you can use to create reports, build domains of Pentaho Metadata or the relational data models.

Tags :