Wednesday, July 17, 2013

Pentaho Kettle Parameters and Variables: Tips and Tricks

Pentaho Kettle Parameters and Variables: Tips and Tricks

This blog post is not intended to be a formal introduction to using parameters and variables in Pentaho Kettle, but more a practical showcase of possible usages.


Please read my previous blog post Pentaho Data Integration: Scheduling and command line arguments as an introduction on how to pass command line arguments to a Kettle job.


When I mention parameters below, I am always talking about named parameters.

Parameters and Variables

Definitions upfront

Named Parameter: “Named parameters are a system that allows you to parameterize your transformations and jobs.  On top of the variables system that was already in place prior to the introduction in version 3.2, named parameters offer the setting of a description and a default value.  That allows you in turn to list the required parameters for a job or transformation.” (Source)


Variable: “Variables can be used throughout Pentaho Data Integration, including in transformation steps and job entries. You define variables by setting them with the Set Variable step in a transformation or by setting them in the kettle.properties file. [...] The first usage (and only usage in previous Kettle versions) was to set an environment variable. Traditionally, this was accomplished by passing options to the Java Virtual Machine (JVM) with the -D option. The only problem with using environment variables is that the usage is not dynamic and problems arise if you try to use them in a dynamic way.  Changes to the environment variables are visible to all software running on the virtual machine.  [...] Because the scope of an environment variable is too broad, Kettle variables were introduced to provide a way to define variables that are local to the job in which the variable is set. The "Set Variable" step in a transformation allows you to specify in which job you want to set the variable's scope (i.e. parent job, grand-parent job or the root job).” (Source). “

Example

Let’s walk through this very simple example of using parameters and variables. I try to explain all the jobs and transformations involved. The files are also available for download here. You can find the following files in the folder intro_to_parameters_and_variables.

jb_main.kjb

In this extremely simple job we call a subjob call jb_slave.kjb. In this case, we defined hard coded parameter values in the job entry settings. Alternatively, to make this more dynamic, we could have just defined parameters in the job settings.

jb_slave.kjb

This subjob executes the transformations tr_set_variables.ktr and tr_show_param_values.ktr. In this case, in order to access the parameter values from the parent job, we defined the parameters without values in the job settings:
Note: This is just one of the ways you can pass parameters down to the subprocess.

tr_set_variables.ktr

This transformation sets a variable called var1 with scope Valid in parent job so that successive processes can make use if it. In this case the values originate from a Generate Rows step for demonstration purposes; in real world examples you might read in some values from a file or a database table.

tr_show_param_values.ktr

The main transformation has the sole purpose of writing all the parameter and variable values to the log. We retrieve the parameters and variable by using a Get Variables step. We also check if a value is present by using a Filter Rows step. In case one value is missing, we Abort the transformation, otherwise the values are written to the log.


There is no need to set the parameter names in this transformations; there is an advantage though if you do it:
Missing parameter values will be properly displayed as NULL, which makes it a bit easier to check for them.
If you don't define them in the transformation settings, missing parameter values will be displayed as ${PARAMETERNAME}.


Important: Variables coming from tr_set_variables.ktr MUST NOT be listed in the Parameter tab in the Transformation Settings as this overrides the variable.

Making Parameters available for all subprocesses in an easy fashion

As you saw above, defining the parameters for each subprocess just to be able to pass them down can be a bit labour intensive. Luckily, there is a faster way of doing just this:


  1. In the main job specify the parameters that you want to pass in in the Job Settings:
    This way parameters and their values can be passed in from the command line in example.
  2. Right after the Start job entry use the Set Variables job entry. Specify the variable names, reference the parameters you set up in step 1 and set the scope to Valid in the current job.
  3. There is no need to specify any parameters/variables in any of the subprocesses.


To see how this is working, run jb_main.kjb in the passing_down_parameters_in_an_easy_fashion folder (part of the provided examples).

What if I still want to be able to run my subprocess independently sometimes?

You might have some situations, when you have to run the subprocess independently (so in other words: You do not execute it from the parent/main job, but run it on its own). When we pass down parameters or variables, this can be a bit tricky and usually it just doesn’t work out of the box. Luckily, there is a way to achieve this though:
  1. In the subprocess, specify the parameter that you want to be able to pass in. In our example (which is based on the previous example), we modified the transformation tr_show_param_values.ktr and added following parameters to the Transformation Settings:
    We also amended the
    Get Variables step to make use of these parameters:
    This way, we can already run this transformation on its own. Now we only have to adjust the parent job so that we can run it from there as well.
  2. In the parent job, in the Job or Transformation job entry settings, go to the Parameters tab and tick Pass all parameter values down to the sub-transformation/sub-job. Next, as the Parameter set the name of the parameter you defined in the subprocess. As the Value define the variable that you want to pass down: ${variable}. This assumes that this variable was set beforehand by some Set Variables job entry/step.
    In our case, we modified transformation job entry in the job
    jb_slave.kjb and added following mapping to the job entry settings in the Parameters tab:
A sample for this setup is provided in the mulitpurpose_setup_allow_individual_execution_of_subprocesses folder.

Closing remarks


Using parameters and variables in Kettle jobs and transformations allows you to create highly dynamic processes. I hope this tutorial shed some light onto how this can be achieved.

12 comments:

  1. Very useful explanation about variables and parameters, a subject that always brings doubts to my mind everytime I need to use it.
    Thanks a lot, Diddy!

    ReplyDelete
    Replies
    1. Thanks a lot for your feedback Carlos! Much appreciated!

      Delete
  2. Hi.. I want to learn PDI.. From where I can start learning ? What is first basic example in PDI ?


    http://pentaho-bi-suite.blogspot.in/2013/07/pentaho-community-dashboardcde-basics.html

    ReplyDelete
    Replies
    1. I would recommend the book "Pentaho Data Integration 4 Cookbook".

      Delete
  3. Hi Diethard,
    The fact that variables coming from tr_set_variables.ktr MUST NOT be listed in the Parameter tab in the Transformation Settings (as this overrides the variable) seems to me like a bug. Would you agree that this corresponds to the bug report http://jira.pentaho.com/browse/PDI-5464?

    ReplyDelete
  4. Finally that confusion between vars and parameters was clearly explained. I too often used to read about this Pentaho topic. In every project i try to parametrize my jobs/transformations, but every single time I took a different approach. Now, I'll have a clear reference on this.

    Thanks Diethard!

    ReplyDelete
    Replies
    1. Excellent! Thanks a lot for your feedback!

      Delete
  5. Hi Diethard, Thanks for the excellent article, it clearly differenciates the Variables and parameters and how it is handled in pentaho.
    Recently I ran into a problem, I had in one job used a set variable at JVM level and set a variable RestartInterval=100. In another job I had defined this as a job parameter with a value of 300 in the job properties. how ever the value of 100 is only getting passed and the job parameter is not over writting the JVM variable.
    If I have set a value of a variable at Java Virtual Machine level, how can I unset the variable? or delete the variable from JVM?

    ReplyDelete
  6. Hello..
    I have a small problem..
    I have saved some parameters in kettle.properties. And I ran sh spoon.sh(Data integration). Here I opened a .ktr file which reads some parameters from kettle.properties. But it is not reading those values. I have set the path for KETTLE_HOME in environment variables also. Please help me out.. Thank you in advance

    ReplyDelete
    Replies
    1. Did you restart Spoon after adding the new parameters to kettle.properties? Is your KETTLE_HOME env variable using the correct path? Are you referencing the parameters the correct way? A good way of testing is: Within any step or job settings field that allows a parameter, press CTRL+Space and see if the parameter shows up in the list.

      Delete
    2. Did you restart Spoon after adding the new parameters to kettle.properties? Is your KETTLE_HOME env variable using the correct path? Are you referencing the parameters the correct way? A good way of testing is: Within any step or job settings field that allows a parameter, press CTRL+Space and see if the parameter shows up in the list.

      Delete