Hadoop testing has always been a cumbersome process especially if you try to do testing phase during the normal project build process. Traditionally developers have had few options like running Hadoop cluster either as a local or pseudo-distributed mode and then utilise that to run MapReduce jobs. Hadoop project itself is using a lot of mini clusters during the tests which provides better tools to run your code in an isolated environment.
Spring Hadoop and especially its Yarn module faced similar testing problems. Spring Hadoop provides testing facilities order to make testing on Hadoop much easier especially if code relies on Spring Hadoop itself. These testing facilities are also used internally to test Spring Hadoop, although some test cases still rely on a running Hadoop instance on a host where project build is executed.
Two central concepts of testing using Spring Hadoop is, firstly fire up the mini cluster and secondly use the configuration prepared by the mini cluster to talk to the Hadoop components. Now let's go through the general testing facilities offered by Spring Hadoop.
Testing for MapReduce and Yarn in Spring Hadoop is separated into different packages mostly because these two components doesn't have hard dependencies with each others. You will see a lot of similarities when creating tests for MapReduce and Yarn.
Mini clusters usually contain testing components from a Hadoop project
itself. These are clusters for MapReduce Job handling and
HDFS which are all run within a same process.
In Spring Hadoop mini clusters are implementing interface
HadoopCluster
which provides methods for lifecycle
and configuration. Spring Hadoop provides transitive
maven dependencies against different Hadoop distributions
and thus mini clusters are started using different implementations. This is mostly
because we want to support HadoopV1 and
HadoopV2 at a same time. All this is handled
automatically at runtime so everything should be transparent to
the end user.
public interface HadoopCluster { Configuration getConfiguration(); void start() throws Exception; void stop(); FileSystem getFileSystem() throws IOException; }
Currently one implementation named StandaloneHadoopCluster exists which supports simple cluster type where a number of nodes can be defined and then all the nodes will contain utilities for MapReduce Job handling and HDFS.
There are few ways how this cluster can be started depending on a use case. It is
possible to use StandaloneHadoopCluster
directly or configure and
start it through HadoopClusterFactoryBean
. Existing
HadoopClusterManager
is used in unit tests to
cache running clusters.
Note | |
---|---|
It's advisable not to use |
<bean id="hadoopCluster" class="org.springframework.data.hadoop.test.support.HadoopClusterFactoryBean"> <property name="clusterId" value="HadoopClusterTests"/> <property name="autoStart" value="true"/> <property name="nodes" value="1"/> </bean>
Example above defines a bean named hadoopCluster using a factory bean
HadoopClusterFactoryBean
. It defines a simple one node cluster which is
started automatically.
Spring Hadoop components usually depend on
Hadoop configuration which is then wired into these components
during the application context startup phase. This was explained
in previous chapters so we don't go through it again. However this is now a catch-22 because
we need the configuration for the context but it is not known until mini cluster has done
its startup magic and prepared the configuration with correct values reflecting current
runtime status of the cluster itself. Solution for this is to use other bean named
ConfigurationDelegatingFactoryBean
which will simply delegate
the configuration request into the running cluster.
<bean id="hadoopConfiguredConfiguration" class="org.springframework.data.hadoop.test.support.ConfigurationDelegatingFactoryBean"> <property name="cluster" ref="hadoopCluster"/> </bean> <hdp:configuration id="hadoopConfiguration" configuration-ref="hadoopConfiguredConfiguration"/>
In the above example we created a bean named hadoopConfiguredConfiguration
using ConfigurationDelegatingFactoryBean
which simple delegates
to hadoopCluster bean. Returned bean hadoopConfiguredConfiguration
is type of Hadoop's Configuration
object so it could
be used as it is.
Latter part of the example show how Spring Hadoop namespace is
used to create another Configuration
object which is using
hadoopConfiguredConfiguration as a reference. This scenario
would make sense if there is a need to add additional configuration options into running
configuration used by other components. Usually it is suiteable to use cluster prepared
configuration as it is.
It is perfecly all right to create your tests from scratch and for example create the cluster manually and then get the runtime configuration from there. This just needs some boilerplate code in your context configuration and unit test lifecycle.
Spring Hadoop adds additional facilities for the testing to make all this even easier.
@RunWith(SpringJUnit4ClassRunner.class) public abstract class AbstractHadoopClusterTests implements ApplicationContextAware { ... } @ContextConfiguration(loader=HadoopDelegatingSmartContextLoader.class) @MiniHadoopCluster public class ClusterBaseTestClassTests extends AbstractHadoopClusterTests { ... }
Above example shows the AbstractHadoopClusterTests
and
how ClusterBaseTestClassTests
is prepared to be aware of a mini cluster.
HadoopDelegatingSmartContextLoader
offers same base
functionality as the default DelegatingSmartContextLoader
in a
spring-test package. One additional thing what HadoopDelegatingSmartContextLoader
does is to automatically handle running clusters and inject Configuration
into the application context.
@MiniHadoopCluster(configName="hadoopConfiguration", clusterName="hadoopCluster", nodes=1, id="default")
Generally @MiniHadoopCluster
annotation allows you to define
injected bean name for mini cluster, its Configurations and a number of nodes you
like to have in a cluster.
Spring Hadoop testing is dependant of general facilities of
Spring Test framework meaning that everything what is cached during
the test are reuseable withing other tests. One need to understand that if Hadoop
mini cluster and its Configuration
is injected into an Application
Context, caching happens on a mercy of a Spring Testing meaning if a test Application Context is
cached also mini cluster instance is cached. While caching is always prefered, one needs to
understant that if tests are expecting vanilla environment to be present, test context should be
dirtied using @DirtiesContext
annotation.
Let's study a proper example of existing MapReduce Job which is executed and tested using Spring Hadoop. This example is the Hadoop's classic wordcount. We don't go through all the details of this example because we want to concentrate on testing specific code and configuration.
<context:property-placeholder location="hadoop.properties" /> <hdp:job id="wordcountJob" input-path="${wordcount.input.path}" output-path="${wordcount.output.path}" libs="file:build/libs/mapreduce-examples-wordcount-*.jar" mapper="org.springframework.data.hadoop.examples.TokenizerMapper" reducer="org.springframework.data.hadoop.examples.IntSumReducer" /> <hdp:script id="setupScript" location="copy-files.groovy"> <hdp:property name="localSourceFile" value="data/nietzsche-chapter-1.txt" /> <hdp:property name="inputDir" value="${wordcount.input.path}" /> <hdp:property name="outputDir" value="${wordcount.output.path}" /> </hdp:script> <hdp:job-runner id="runner" run-at-startup="false" kill-job-at-shutdown="false" wait-for-completion="false" pre-action="setupScript" job-ref="wordcountJob" />
In above configuration example we can see few differences with the actual runtime configuration. Firstly you can see that we didn't specify any kind of configuration for hadoop. This is because it's is injected automatically by testing framework. Secondly because we want to explicitely wait the job to be run and finished, kill-job-at-shutdown and wait-for-completion are set to false.
@ContextConfiguration(loader=HadoopDelegatingSmartContextLoader.class) @MiniHadoopCluster public class WordcountTests extends AbstractMapReduceTests { @Test public void testWordcountJob() throws Exception { // run blocks and throws exception if job failed JobRunner runner = getApplicationContext().getBean("runner", JobRunner.class); Job wordcountJob = getApplicationContext().getBean("wordcountJob", Job.class); runner.call(); JobStatus finishedStatus = waitFinishedStatus(wordcountJob, 60, TimeUnit.SECONDS); assertThat(finishedStatus, notNullValue()); // get output files from a job Path[] outputFiles = getOutputFilePaths("/user/gutenberg/output/word/"); assertEquals(1, outputFiles.length); assertThat(getFileSystem().getFileStatus(outputFiles[0]).getLen(), greaterThan(0l)); // read through the file and check that line with // "themselves 6" was found boolean found = false; InputStream in = getFileSystem().open(outputFiles[0]); BufferedReader reader = new BufferedReader(new InputStreamReader(in)); String line = null; while ((line = reader.readLine()) != null) { if (line.startsWith("themselves")) { assertThat(line, is("themselves\t6")); found = true; } } reader.close(); assertThat("Keyword 'themselves' not found", found); } }
In above unit test class we simply run the job defined in xml, explicitely wait it to finish and then check the output content from HDFS by searching expected strings.
Mini cluster usually contain testing components from a Hadoop project itself.
These are MiniYARNCluster
for Resource Manager and
MiniDFSCluster
for Datanode and Namenode
which are all run within a same process. In Spring Hadoop mini
clusters are implementing interface YarnCluster
which provides
methods for lifecycle and configuration.
public interface YarnCluster { Configuration getConfiguration(); void start() throws Exception; void stop(); File getYarnWorkDir(); }
Currently one implementation named StandaloneYarnCluster
exists which supports simple
cluster type where a number of nodes can be defined and then all the nodes will have Yarn Node
Manager and Hdfs Datanode, additionally
a Yarn Resource Manager and Hdfs Namenode
components are started.
There are few ways how this cluster can be started depending on a use case. It is
possible to use StandaloneYarnCluster
directly or configure and start it through
YarnClusterFactoryBean
. Existing YarnClusterManager
is used in unit tests to cache running clusters.
Note | |
---|---|
It's advisable not to use |
<bean id="yarnCluster" class="org.springframework.yarn.test.support.YarnClusterFactoryBean"> <property name="clusterId" value="YarnClusterTests"/> <property name="autoStart" value="true"/> <property name="nodes" value="1"/> </bean>
Example above defines a bean named yarnCluster using a factory
bean YarnClusterFactoryBean
. It defines a simple one node cluster
which is started automatically. Cluster working directories
would then exist under below paths:
target/YarnClusterTests/ target/YarnClusterTests-dfs/
Note | |
---|---|
We rely on base classes from a Hadoop distribution and target base directory is hardcoded in Hadoop and is not configurable. |
Spring Yarn components usually depend on Hadoop
configuration which is then wired into these components during the application context startup
phase. This was explained in previous chapters so we don't go through it again. However this
is now a catch-22 because we need the configuration for the context but it is not known until
mini cluster has done its startup magic and prepared the configuration with correct values reflecting current
runtime status of the cluster itself. Solution for this is to use other factory bean class named
ConfigurationDelegatingFactoryBean
which will simple delegate the configuration request
into the running cluster.
<bean id="yarnConfiguredConfiguration" class="org.springframework.yarn.test.support.ConfigurationDelegatingFactoryBean"> <property name="cluster" ref="yarnCluster"/> </bean> <yarn:configuration id="yarnConfiguration" configuration-ref="yarnConfiguredConfiguration"/>
In the above example we created a bean named yarnConfiguredConfiguration
using ConfigurationDelegatingFactoryBean
which simple delegates to
yarnCluster bean. Returned bean yarnConfiguredConfiguration
is type of Hadoop's Configuration
object so it could
be used as it is.
Latter part of the example show how Spring Yarn namespace is used to
create another Configuration
object which is using
yarnConfiguredConfiguration as a reference. This scenario
would make sense if there is a need to add additional configuration options into running
configuration used by other components. Usually it is suiteable to use cluster prepared
configuration as it is.
It is perfecly all right to create your tests from scratch and for example create the cluster manually and then get the runtime configuration from there. This just needs some boilerplate code in your context configuration and unit test lifecycle.
Spring Hadoop adds additional facilities for the testing to make all this even easier.
@RunWith(SpringJUnit4ClassRunner.class) public abstract class AbstractYarnClusterTests implements ApplicationContextAware { ... } @ContextConfiguration(loader=YarnDelegatingSmartContextLoader.class) @MiniYarnCluster public class ClusterBaseTestClassTests extends AbstractYarnClusterTests { ... }
Above example shows the AbstractYarnClusterTests
and how ClusterBaseTestClassTests
is prepared to be aware of a mini cluster. YarnDelegatingSmartContextLoader
offers same base
functionality as the default DelegatingSmartContextLoader
in a spring-test package.
One additional thing what YarnDelegatingSmartContextLoader
does is to automatically handle
running clusters and inject Configuration
into the application context.
@MiniYarnCluster(configName="yarnConfiguration", clusterName="yarnCluster", nodes=1, id="default")
Generally @MiniYarnCluster
annotation allows you to define injected bean names for
mini cluster, its Configurations and a number of nodes you like to have in a cluster.
Spring Hadoop Yarn testing is dependant of general facilities of
Spring Test framework meaning that everything what is cached during the
test are reuseable withing other tests. One need to understand that if Hadoop
mini cluster and its Configuration
is injected into an Application
Context, caching happens on a mercy of a Spring Testing meaning if a test Application Context is
cached also mini cluster instance is cached. While caching is always prefered, one needs to
understant that if tests are expecting vanilla environment to be present, test context should be
dirtied using @DirtiesContext
annotation.
Let's study a proper example of existing Spring Yarn application and how this is tested during the build process. Multi Context Example is a simple Spring Yarn based application which simply launches Application Master and four Containers and withing those containers a custom code is executed. In this case simply a log message is written.
In real life there are different ways to test whether Hadoop Yarn application execution has been succesful or not. The obvious method would be to check the application instance execution status reported by Hadoop Yarn. Status of the execution doesn't always tell the whole truth so i.e. if application is about to write something into HDFS as an output that could be used to check the proper outcome of an execution.
This example doesn't write anything into HDFS and anyway it would be out of scope of this document for obvious reason. It is fairly straightforward to check file content from HDFS. One other interesting method is simply to check to application log files that being the Application Master and Container logs. Test methods can check exceptions or expected log entries from a log files to determine whether test is succesful or not.
In this chapter we don't go through how Multi Context Example is configured and what it actually does, for that read the documentation about the examples. However we go through what needs to be done order to test this example application using testing support offered by Spring Hadoop.
In this example we gave instructions to copy library dependencies into Hdfs and then those entries were used within resouce localizer to tell Yarn to copy those files into Container working directory. During the unit testing when mini cluster is launched there are no files present in Hdfs because cluster is initialized from scratch. Furtunalety Spring Hadoop allows you to copy files into Hdfs during the localization process from a local file system where Application Context is executed. Only thing we need is the actual library files which can be assembled during the build process. Spring Hadoop Examples build system rely on Gradle so collecting dependencies is an easy task.
<yarn:localresources> <yarn:hdfs path="/app/multi-context/*.jar"/> <yarn:hdfs path="/lib/*.jar"/> </yarn:localresources>
Above configuration exists in application-context.xml and appmaster-context.xml files. This is a normal application configuration expecting static files already be present in Hdfs. This is usually done to minimize latency during the application submission and execution.
<yarn:localresources> <yarn:copy src="file:build/dependency-libs/*" dest="/lib/"/> <yarn:copy src="file:build/libs/*" dest="/app/multi-context/"/> <yarn:hdfs path="/app/multi-context/*.jar"/> <yarn:hdfs path="/lib/*.jar"/> </yarn:localresources>
Above example is from MultiContextTest-context.xml which provides the runtime context configuration talking with mini cluster during the test phase.
When we do context configuration for YarnClient during the testing phase all we need to do is to add copy elements which will transfer needed libraries into Hdfs before the actual localization process will fire up. When those files are copied into Hdfs running in a mini cluster we're basically in a same point if using a real Hadoop cluster with existing files.
Note | |
---|---|
Running tests which depends on copying files into Hdfs it is mandatory to use build system which is able to prepare these files for you. You can't do this within IDE's which have its own ways to execute unit tests. |
The complete example of running the test, checking the application execution status and finally checking the expected state of log files:
@ContextConfiguration(loader=YarnDelegatingSmartContextLoader.class) @MiniYarnCluster public class MultiContextTests extends AbstractYarnClusterTests { @Test @Timed(millis=70000) public void testAppSubmission() throws Exception { YarnApplicationState state = submitApplicationAndWait(); assertNotNull(state); assertTrue(state.equals(YarnApplicationState.FINISHED)); File workDir = getYarnCluster().getYarnWorkDir(); PathMatchingResourcePatternResolver resolver = new PathMatchingResourcePatternResolver(); String locationPattern = "file:" + workDir.getAbsolutePath() + "/**/*.std*"; Resource[] resources = resolver.getResources(locationPattern); // appmaster and 4 containers should // make it 10 log files assertThat(resources, notNullValue()); assertThat(resources.length, is(10)); for (Resource res : resources) { File file = res.getFile(); if (file.getName().endsWith("stdout")) { // there has to be some content in stdout file assertThat(file.length(), greaterThan(0l)); if (file.getName().equals("Container.stdout")) { Scanner scanner = new Scanner(file); String content = scanner.useDelimiter("\\A").next(); scanner.close(); // this is what container will log in stdout assertThat(content, containsString("Hello from MultiContextBeanExample")); } } else if (file.getName().endsWith("stderr")) { // can't have anything in stderr files assertThat(file.length(), is(0l)); } } } }