For Pig users, SHDP provides easy creation and configuration of PigServer instances for registering and executing scripts either locally or remotely. In its simplest form, the declaration looks as follows:
<hdp:pig />
This will create a org.springframework.data.hadoop.pig.PigServerFactory
instance, named pigFactory, a factory that creates PigServer
instances on demand configured with a default PigContext, executing
scripts in MapReduce mode. The factory is needed since PigServer is
not thread-safe and thus cannot be used by multiple objects at the same
time. In typical scenarios however, one might want to connect to a
remote Hadoop tracker and register some scripts automatically so let us
take a look of how the configuration might look like:
<pig-factory exec-type="LOCAL" job-name="pig-script" configuration-ref="hadoopConfiguration" properties-location="pig-dev.properties" xmlns="http://www.springframework.org/schema/hadoop"> source=${pig.script.src} <script location="org/company/pig/script.pig"> <arguments>electric=sea</arguments> </script> <script> A = LOAD 'src/test/resources/logs/apache_access.log' USING PigStorage() AS (name:chararray, age:int); B = FOREACH A GENERATE name; DUMP B; </script> </pig-factory> />
The example exposes quite a few options so let us review them one by
one. First the top-level pig definition configures the pig instance: the
execution type, the Hadoop configuration used and the job name. Notice
that additional properties can be specified (either by declaring them
inlined or/and loading them from an external file) - in fact,
<hdp:pig-factory/> just like the rest of the libraries configuration
elements, supports common properties attributes as described in the
hadoop configuration section.
The definition contains also two scripts: script.pig (read from the
classpath) to which one pair of arguments, relevant to the script, is
passed (notice the use of property placeholder) but also an inlined
script, declared as part of the definition, without any arguments.
As you can tell, the pig-factory namespace offers several options
pertaining to Pig configuration.
Like the rest of the Spring Hadoop components, a runner is provided out
of the box for executing Pig scripts, either inlined or from various
locations through pig-runner element:
<hdp:pig-runner id="pigRunner" run-at-startup="true"> <hdp:script> A = LOAD 'src/test/resources/logs/apache_access.log' USING PigStorage() AS (name:chararray, age:int); ... </hdp:script> <hdp:script location="pig-scripts/script.pig"/> </hdp:pig-runner>
The runner will trigger the execution during the application start-up
(notice the run-at-startup flag which is by default false). Do note
that the runner will not run unless triggered manually or if
run-at-startup is set to true. Additionally the runner (as in fact
do all runners in SHDP) allows one or multiple pre and
post actions to be specified to be executed before and after each run.
Typically other runners (such as other jobs or scripts) can be specified
but any JDK Callable can be passed in. For more information on
runners, see the dedicated chapter.
For Spring Batch environments, SHDP provides a dedicated tasklet to execute Pig queries, on demand, as part of a batch or workflow. The declaration is pretty straightforward:
<hdp:pig-tasklet id="pig-script"> <hdp:script location="org/company/pig/handsome.pig" /> </hdp:pig-tasklet>
The syntax of the scripts declaration is similar to that of the pig
namespace.
For those that need to programmatically interact directly with Pig ,
Spring for Apache Hadoop provides a dedicated
template, similar
to the aforementioned HiveTemplate. The template handles the
redundant, boiler-plate code, required for interacting with Pig such as
creating a new PigServer, executing the scripts, catching any
exceptions and performing clean-up. One can programmatically execute
scripts but also interact with the Hive API through the
PigServerCallback. For example:
<hdp:pig-factory ... /> <!-- Pig template wires automatically to 'pigFactory'--> <hdp:pig-template /> <!-- use component scanning--> <context:component-scan base-package="some.pkg" />
public class SomeClass { @Inject private PigTemplate template; public Set<String> getDbs() { return pigTemplate.execute(new PigCallback<Set<String>() { @Override public Set<String> doInPig(PigServer pig) throws ExecException, IOException { return pig.getAliasKeySet(); } }); } }
The example above shows a basic container configuration wiring a
PigTemplate into a user class which uses it to interact with the
PigServer API. Notice that the user does not have to handle the
lifecycle of the PigServer instance or catch any exception - these are
handled automatically by the template which converts them, like the rest
of the Spring templates, into `DataAccessException`s. Thus the
application only has to track only one exception hierarchy across all
data technologies instead of one per technology.