Here are some cool and quick tips for performance tuning in Kettle.
1. Distribution of Data: Change Number of Copies to Start-This Step is helpful for distributed architecture i.e. if your system has multiple processors, this step will use each of them efficiently for a particular selected step. In order to select this, Right Click on any Step of Transformation.
Now to see this option, you can change the default value (which is 1) to something more.
Value=Number of processors-1
Make sure to set type of data movement (Right Click on a step) to 'Distribute'.
This step will partition the data in different sets and assign them internally to different processors. So, for a single step multiple sets of data would be handles in parallel among processors thus achieving parallelization.
2. Set some Commit Size: In any Output/Update set some Value of Commit Size rather than 0.This will reduce Buffer Load of storing uncommitted data.
However, it won’t be a reasonable value i.e. Matt Casters says 1000 commit size for every 5000 rows.
3. Saving Round-Trips to Database Server: Too save trips of fetching data from a database every time, one can set defaultRowprefetch(Goto Connection>Options) to a reasonable value.
By default it is 10 rows for Oracle. So this means every call to server will fetch 10 rows. Increase it 50 and you can reduce the database calls to 1/5.
4. Adjusting Queue Size: Adjusting No. of rows in row-set (Go-to Transformation Settings >Miscellaneous) to a reasonable value which your RAM can handle can significantly help performance. This value should only be increased if the system has lot of RAM available. It would take much more rows in RAM, in buffer and processing will be much faster. However, to find the best suited value for it is by hit and trial.
5. Use Step Monitoring: Enable it (Transformation Settings>Monitoring) and find the performance statistics of the transformation then fine tune it accordingly.
6. Lazy Conversion: This is for good and better performance, especially when you need to write from an input text file to an output text file.
The step avoids any conversion happening to Binary data fetched from Input File and thus, uses serialization for writing data.
7. Removing Fields in Select Values step: Always avoid doing it as it reconstructs new rows. Thus it can directly affect performance.
8. Javascript good Practices for Pentaho: Avoid Javascript as much as possible, even though it is the fastest scripting language it still requires javascript engine to work and thus is an overhead. If avoiding is not possible reduce it e.g. instead of 3 javascript steps use 1 step with combined script.
Avoid Data conversion and variable creation in Javascript as much as possible, as it can be handled by Pentaho kettle steps.