For Pig users, SHDP provides easy creation and configuration of PigServer
instances for registering and executing scripts either locally
or remotely. In its simplest form, the declaration looks as follows:
<hdp:pig />
This will create a PigServer
instance, named hadoop-pig
, configured with a default PigContext
,
executing scripts in MapReduce
mode. In typical scenarios however, one might want to connect to a remote Hadoop tracker and register some scripts automatically so let us take a look of how the
configuration might look like:
<pig exec-type="LOCAL" job-name="pig-script" configuration-ref="hadoopConfiguration" properties-location="pig-dev.properties" xmlns="http://www.springframework.org/schema/hadoop"> source=${pig.script.src} <script location="org/company/pig/script.pig"> <arguments>electric=sea</arguments> </script> <script> A = LOAD 'src/test/resources/logs/apache_access.log' USING PigStorage() AS (name:chararray, age:int); B = FOREACH A GENERATE name; DUMP B; </script> </pig> />
The example exposes quite a few options so let us review them one by one. First the top-level pig definition configures the pig instance: the execution type, the Hadoop configuration used and the job name. Notice that
additional properties can be specified (either by declaring them inlined or/and loading them from an external file) - in fact, <hdp:pig/>
just like the rest of the libraries configuration
elements, supports common properties attributes as described in the hadoop configuration section.
The definition contains also two scripts: script.pig
(read from the classpath) to which one pair of arguments,
relevant to the script, is passed (notice the use of property placeholder) but also an inlined script, declared as part of the definition, without any arguments.
As you can tell, the pig
namespace offers several options pertaining to Pig configuration.
And, as with the other Hadoop-related integration, the underlying PigServer
is bound to the enclosing application context life-cycle; that is, it will automatically start and stop along-side
the application so one does not have to worry about its management.
For Spring Batch environments, SHDP provides a dedicated tasklet to execute Pig queries, on demand, as part of a batch or workflow. The declaration is pretty straight forward:
<hdp:pig-tasklet id="pig-script"> <hdp:script location="org/company/pig/handsome.pig" /> </hdp:pig-tasklet>
The syntax of the scripts declaration is similar to that of the pig
namespace.