Please read the sample prerequistes before following these instructions.
This sample demonstrates how to execute the wordcount example in the
context of a Spring Batch application. It serves as a starting point that
will be expanded upon in other samples. The sample code is located in the
distribution directory
<spring-hadoop-install-dir>/samples/batch-wordcount
The sample uses the Spring Hadoop namespace to define a Spring Batch Tasklet that runs a Hadoop job. A Spring Batch Job (not to be confused with a Hadoop job) combines multiple steps together to create a flow of execution, a small workflow. Each step in the flow does some processing which can be as complex or as simple as your require. The configuration of the flow from one step to another can very simple, a linear sequence of steps, or complex using conditional and programmatic branching as well as sequential and parallel step execution. A Spring Batch Tasklet is the abstraction that represents the processing for a Step and is an extensible part of Spring Batch. You can write your own implementation of a Tasklet to perform arbitrary processing, but often you configure existing Tasklets provided by Spring Batch and Spring Hadoop.
Spring Batch provides Tasklets for reading, writing, and processing data from flat files, databases, messaging systems and executing system (shell) commands. Spring Hadoop provides Tasklets for running Hadoop MapReduce, Streaming, Hive, and Pig jobs as well as executing script files that have build in support for ease of use interaction with HDFS.
The part of the configuration file that defines the Spring Batch Job
flow is shown below and can be found in the file
wordcount-context.xml
. The elements
<batch:job/>, <batch:step/>,
<batch:tasklet>
come from the XML Schema for Spring
Batch
<batch:job id="job1"> <batch:step id="import" next="wordcount"> <batch:tasklet ref="script-tasklet"/> </batch:step> <batch:step id="wordcount"> <batch:tasklet ref="wordcount-tasklet" /> </batch:step> </batch:job>
This configuration defines a Spring Batch Job named "job1" that contains two steps executed sequentially. The first one prepares HDFS with sample data and the second runs the Hadoop wordcount mapreduce job. The tasklet's reference to "script-tasklet" and "wordcount-tasklet" definitions that will be shown a little later.
The Spring Source Toolsuite is a free Eclipse-powered development environment which provides a nice visualization and authoring help for Spring Batch workflows as shown below.
The script tasklet shown below uses Groovy to remove any data that is in the input or output directories and puts the file "nietzsche-chapter-1.txt" into HDFS.
<script-tasklet id="script-tasklet"> <script language="groovy"> inputPath = "${wordcount.input.path:/user/gutenberg/input/word/}" outputPath = "${wordcount.output.path:/user/gutenberg/output/word/}" if (fsh.test(inputPath)) { fsh.rmr(inputPath) } if (fsh.test(outputPath)) { fsh.rmr(outputPath) } inputFile = "src/main/resources/data/nietzsche-chapter-1.txt" fsh.put(inputFile, inputPath) </script> </script-tasklet>
The script makes use of the predefined variable fsh, which is the
embedded Filesystem Shell that
Spring Hadoop provides. It also uses Spring's Property
Placeholder functionality so that the input and out paths can be
configured external to the application, for example using property files.
The syntax for variables recognized by Spring's Property Placeholder is
${key:defaultValue}
, so in this case
/user/gutenberg/input/word
and
/user/gutenberg/output/word
are the default input and
output paths. Note you can also use Spring's
Expression Language to configure values in the script or in other
XML definitions.
The configuration of the tasklet to execute the Hadoop MapReduce jobs is shown below.
<hdp:tasklet id="hadoop-tasklet" job-ref="mr-job"/> <job id="wordcount-job" input-path="${wordcount.input.path:/user/gutenberg/input/word/}" output-path="${wordcount.output.path:/user/gutenberg/output/word/}" mapper="org.apache.hadoop.examples.WordCount.TokenizerMapper" reducer="org.apache.hadoop.examples.WordCount.IntSumReducer" />
The <hdp:tasklet>
and
<hdp:job/>
elements are from Spring Haddop XML
Schema. The hadoop tasklet is the bridge between the Spring Batch world
and the Hadoop world. The hadoop tasklet in turn refers to your map reduce
job. Various common properties of a map reduce job can be set, such as the
mapper and reducer. The section Creating a
Hadoop Job describes the additional elements you can configure in
the XML.
Note | |
---|---|
If you look at the JavaDocs for the org.apache.hadoop.examples package (here), you can see the Mapper and Reducer class names for many examples you may have previously used from the hadoop command line runner. |
The configuration-ref
element in the job
definition refers to common Hadoop configuration information that can be
shared across many jobs. It is defined in the file
hadoop-context.xml
and is shown below
<!-- default id is 'hadoop-configuration' --> <hdp:configuration register-url-handler="false"> fs.default.name=${hd.fs} </hdp:configuration>
As mentioned before, as this is a configuration file processed by
the Spring container, it supports variable substitution through the use of
${var}
style variables. In this case the location for
HDFS is parameterized and no default value is provided. The property file
hadoop.properties
contains the definition of the hd.fs
variable, change the value if you want to refer to a different name node
location.
hd.fs=hdfs://localhost:9000
The entire application is put together in the configuration file
launch-context.xml
, shown below.
<!-- where to read externalized configuration values --> <context:property-placeholder location="classpath:batch.properties,classpath:hadoop.properties" ignore-resource-not-found="true" ignore-unresolvable="true" /> <!-- simple base configuration for batch components, e.g. JobRepository --> <import resource="classpath:/META-INF/spring/batch-common.xml" /> <!-- shared hadoop configuration --> <import resource="classpath:/META-INF/spring/hadoop-context.xml" /> <!-- word count workflow --> <import resource="classpath:/META-INF/spring/wordcount-context.xml" />
In the directory
<spring-hadoop-install-dir>/samples/batch-wordcount
build and run the sample
$ ../../gradlew
If this is the first time you are using gradlew, it will download the Gradle build tool and all necessary dependencies to run the sample.
Note | |
---|---|
If you run into some issues, drop us a line in the Spring Forums. |
You can use Gradle to export all required dependencies into a directory and create a shell script to run the application. To do this execute the command
$ ../../gradlew installApp
This places the shell scripts and dependencies under the
build/install/batch-wordcount
directory. You can zip up that
directory and share the application with others.
The main Java class used is part of Spring Batch. The class is
org.springframework.batch.core.launch.support.CommandLineJobRunner
.
This main app requires you to specify at least a Spring configuration file
and a job instance name. You can read the CommandLineJobRunner
JavaDocs
for more information as well as this
section in the reference docs for Spring Batch to find out more of
what command line options it supports.
$ ./build/install/wordcount/bin/wordcount classpath:/launch-context.xml job1
You can then verify the output from work count is present and cat it to the screen
$ hadoop dfs -ls /user/gutenberg/output/word Warning: $HADOOP_HOME is deprecated. Found 2 items -rw-r--r-- 3 mpollack supergroup 0 2012-01-31 19:29 /user/gutenberg/output/word/_SUCCESS -rw-r--r-- 3 mpollack supergroup 918472 2012-01-31 19:29 /user/gutenberg/output/word/part-r-00000 $ hadoop dfs -cat /user/gutenberg/output/word/part-r-00000 | more Warning: $HADOOP_HOME is deprecated. "'Spells 1 "'army' 1 "(1) 1 "(Lo)cra"1 "13 4 "1490 1 "1498," 1 "35" 1 "40," 1 "A 9 "AS-IS". 1 "AWAY 1 "A_ 1 "Abide 1 "About 1