Our Blog

Clustering Pentaho BA Server 5.0.x version

Gangadhara Boranna

Clustering means that 2 or more instances of Pentaho share a common repository. Pentaho 5.0.X now uses the Jackrabbit Content Repository (JCR) for the BA Repository. Pentaho reporting related content, for instance, about reports that you create, examples we provide, report scheduling data, and audit data in the BA Repository. The BA Repository resides on the database that you installed. The BA Repository consists of three repositories: Jackrabbit, Quartz, and Hibernate.
– Jackrabbit contains the solution repository, examples, security data, and content data from reports that you use Pentaho software to create.

– Quartz holds data that is related to scheduling reports and jobs.

– Hibernate holds data that is related to audit logging.
Pentaho Consultation
You can choose to host the BA Repository on the PostgreSQL, MySQL, or Oracle database (by default, Pentaho software is configured to use the PostgreSQL Database). As already mentioned above that each node must have a shared repository, please find the instructions below for initializing and configuring your solution repository,

Initializing: http://infocenter.pentaho.com/help/topic/install_pdi/task_prepare_rdbms_repository.html

Configuring: http://infocenter.pentaho.com/help/topic/install_pdi/task_configure_rdbms_repository.html
You will need to add a section of the code to the repository.xml file found in \biserver-ee\pentaho-solutions\system\jackrabbit directory to allow each node to have a shared journal. Please note that each node must have a Unique ID. This will be explained in detail below. Configuring Each Node to have a Shared Journal: Before we start configuring shared journal, we would need to delete the files mentioned in the below directories,
– delete the contents of tomcat\work and tomcat\temp directories.

– Navigate to biserver-ee\pentaho-solutions\system\jackrabbit\repository directory and remove all files and folders from the final repository folder.

– Navigate to biserver-ee\pentaho-solutions\system\jackrabbit\repository directory and remove all files and folders from the workspaces folder.
Now, in order to configure nodes for a shared journal, we would need to edit the repository.xml file found in \biserver-ee\pentaho-solutions\system\jackrabbit directory. Add the below section of the code at the end.


Run with a cluster journal


<Cluster id=”Unique_ID “>

<Journal class=”org.apache.jackrabbit.core.journal.DatabaseJournal”>

<param name=”revision” value=”${rep.home}/revision.log”/>

<param name=”url” value=”jdbc:postgresql://HOSTNAME:PORT/jackrabbit”/>

<param name=”driver” value=”org.postgresql.Driver”/>

<param name=”user” value=”jcr_user”/>

<param name=”password” value=”password”/>

<param name=”databaseType” value=”postgresql”/>

<param name=”janitorEnabled” value=”true”/>

<param name=”janitorSleep” value=”86400″/>

<param name=”janitorFirstRunHourOfDay” value=”3″/>


You would need to replace the JDBC connection strings(URL, USERNAME, PASSWORD, DATABASE TYPE etc.,) to match to your specific database. Now Jackrabbit journalling is configured. Quartz will also need to be configured to avoid duplicate schedules created on each node.
Configuring Quartz for Cluster :
Navigate to \bi-server\pentaho-solutions\system\quartz and edit the quartz.properties file using a text editor. You will need to make the following changes in order to configure Quartz for cluster,

1. Org.quartz.scheduler.instanceId = AUTO

You will need to set it as AUTo because you can add multiple instances. The default value which would be set is 1.

2. org.quartz.jobStore.isClustered = true

The default value would be false.

3. org.quartz.jobStore.clusterCheckinInterval = 20000

You would need to explicitly add this in quartz properties file.

Tags :

PDI best practices – Why avoid insert/update step

Sandeep Kemparaju

We are looking at a transformation designed to use insert/update step to perform data load on the target table. Pentaho Kettle step Insert/Update works as follows

Let us take an example of loading a target table. Assume that there is a daily load of 100k records into a target table with 10 million records and every incoming row from the source table looks up against all the 10 million records in the target table. This process continues for all the 100k input records.
Pentaho Kettle
Insert/Update step involves multiple round trips to the database depending on the commit size.

Performance of most steps depends heavily upon the number of round trips and speed of round trip. Speed depends on a combination of speed of network, the latency on the network and the performance of database.

Find below the wiki link to explain the Insert/Update step in Pentaho Kettle


As there is an extra lookup process involved in this step, it definitely slows down the process as it needs to run through the entire lot of records. The look up is to check for a matching record (to update) and if none matches (to perform insert).

This step is also slower than the regular “Table Output step”.

There is very little one can do on the network latency side. Reduction of number of round trips to the database is the first thing that you should consider. This can be accomplished by having a mechanism to load lookup data in memory (cache)

In Pentaho Kettle, most Lookup steps have options to cache data. Steps such as “Stream Lookup” & “Merge Join” are some of them.

In the above scenario, design the transformation with “Merge join” step to perform operations within a single statement. By doing this, you will minimize the number of times data in the source and target tables are processed. This helps in less memory consumption and good performance. You need to ensure that the input data is sorted to perform the “Merge join” step.

Find below the wiki link to explain the Merge Join step.


Tags :