6. Pig support

For Pig users, SHDP provides easy creation and configuration of PigServer instances for registering and executing scripts either locally or remotely. In its simplest form, the declaration looks as follows:

<hdp:pig />

This will create a PigServer instance, named hadoop-pig, configured with a default PigContext, executing scripts in MapReduce mode. In typical scenarios however, one might want to connect to a remote Hadoop tracker and register some scripts automatically so let us take a look of how the configuration might look like:

<pig exec-type="LOCAL" job-name="pig-script" configuration-ref="hadoop-configuration" properties-location="pig-dev.properties" 
   xmlns="http://www.springframework.org/schema/hadoop">
     source=${pig.script.src}
   <script location="org/company/pig/script.pig">
     <arguments>electric=sea</arguments>
   </script>
   <script>
     A = LOAD 'src/test/resources/logs/apache_access.log' USING PigStorage() AS (name:chararray, age:int);
     B = FOREACH A GENERATE name;
     DUMP B;
   </script>
</pig> />

The example exposes quite a few options so let us review them one by one. First the top-level pig definition configures the pig instance: the execution type, the Hadoop configuration used and the job name. Notice that additional properties can be specified (either by declaring them inlined or/and loading them from an external file) - in fact, <hdp:pig/> just like the rest of the libraries configuration elements, supports common properties attributes as described in the hadoop configuration section.

The definition contains also two scripts: script.pig (read from the classpath) to which one pair of arguments, relevant to the script, is passed (notice the use of property placeholder) but also an inlined script, declared as part of the definition, without any arguments.

As you can tell, the pig namespace offers several options pertaining to Pig configuration. And, as with the other Hadoop-related integration, the underlying PigServer is bound to the enclosing application context life-cycle; that is, it will automatically start and stop along-side the application so one does not have to worry about its management.

6.1 Using the Pig tasklet

For Spring Batch environments, SHDP provides a dedicated tasklet to execute Pig queries, on demand, as part of a batch or workflow. The declaration is pretty straight forward:

<hdp:pig-tasklet id="pig-script">
   <hdp:script location="org/company/pig/handsome.pig" />
</hdp:pig-tasklet>

The syntax of the scripts declaration is similar to that of the pig namespace.