12. Testing Support

Hadoop testing has always been a cumbersome process especially if you try to do testing phase during the normal project build process. Traditionally developers have had few options like running Hadoop cluster either as a local or pseudo-distributed mode and then utilise that to run MapReduce jobs. Hadoop project itself is using a lot of mini clusters during the tests which provides better tools to run your code in an isolated environment.

Spring Hadoop and especially its Yarn module faced similar testing problems. Spring Hadoop provides testing facilities order to make testing on Hadoop much easier especially if code relies on Spring Hadoop itself. These testing facilities are also used internally to test Spring Hadoop, although some test cases still rely on a running Hadoop instance on a host where project build is executed.

Two central concepts of testing using Spring Hadoop is, firstly fire up the mini cluster and secondly use the configuration prepared by the mini cluster to talk to the Hadoop components. Now let's go through the general testing facilities offered by Spring Hadoop.

Testing for MapReduce and Yarn in Spring Hadoop is separated into different packages mostly because these two components doesn't have hard dependencies with each others. You will see a lot of similarities when creating tests for MapReduce and Yarn.

12.1 Testing MapReduce

12.1.1 Mini Clusters for MapReduce

Mini clusters usually contain testing components from a Hadoop project itself. These are clusters for MapReduce Job handling and HDFS which are all run within a same process. In Spring Hadoop mini clusters are implementing interface HadoopCluster which provides methods for lifecycle and configuration. Spring Hadoop provides transitive maven dependencies against different Hadoop distributions and thus mini clusters are started using different implementations. This is mostly because we want to support HadoopV1 and HadoopV2 at a same time. All this is handled automatically at runtime so everything should be transparent to the end user.

public interface HadoopCluster {
  Configuration getConfiguration();
  void start() throws Exception;
  void stop();
  FileSystem getFileSystem() throws IOException;
}

Currently one implementation named StandaloneHadoopCluster exists which supports simple cluster type where a number of nodes can be defined and then all the nodes will contain utilities for MapReduce Job handling and HDFS.

There are few ways how this cluster can be started depending on a use case. It is possible to use StandaloneHadoopCluster directly or configure and start it through HadoopClusterFactoryBean. Existing HadoopClusterManager is used in unit tests to cache running clusters.

[Note]Note

It's advisable not to use HadoopClusterManager outside of tests because literally it is using static fields to cache cluster references. This is a same concept used in Spring Test order to cache application contexts between the unit tests within a jvm.

<bean id="hadoopCluster" class="org.springframework.data.hadoop.test.support.HadoopClusterFactoryBean">
  <property name="clusterId" value="HadoopClusterTests"/>
  <property name="autoStart" value="true"/>
  <property name="nodes" value="1"/>
</bean>

Example above defines a bean named hadoopCluster using a factory bean HadoopClusterFactoryBean. It defines a simple one node cluster which is started automatically.

12.1.2 Configuration

Spring Hadoop components usually depend on Hadoop configuration which is then wired into these components during the application context startup phase. This was explained in previous chapters so we don't go through it again. However this is now a catch-22 because we need the configuration for the context but it is not known until mini cluster has done its startup magic and prepared the configuration with correct values reflecting current runtime status of the cluster itself. Solution for this is to use other bean named ConfigurationDelegatingFactoryBean which will simply delegate the configuration request into the running cluster.

<bean id="hadoopConfiguredConfiguration" class="org.springframework.data.hadoop.test.support.ConfigurationDelegatingFactoryBean">
  <property name="cluster" ref="hadoopCluster"/>
</bean>

<hdp:configuration id="hadoopConfiguration" configuration-ref="hadoopConfiguredConfiguration"/>

In the above example we created a bean named hadoopConfiguredConfiguration using ConfigurationDelegatingFactoryBean which simple delegates to hadoopCluster bean. Returned bean hadoopConfiguredConfiguration is type of Hadoop's Configuration object so it could be used as it is.

Latter part of the example show how Spring Hadoop namespace is used to create another Configuration object which is using hadoopConfiguredConfiguration as a reference. This scenario would make sense if there is a need to add additional configuration options into running configuration used by other components. Usually it is suiteable to use cluster prepared configuration as it is.

12.1.3 Simplified Testing

It is perfecly all right to create your tests from scratch and for example create the cluster manually and then get the runtime configuration from there. This just needs some boilerplate code in your context configuration and unit test lifecycle.

Spring Hadoop adds additional facilities for the testing to make all this even easier.

@RunWith(SpringJUnit4ClassRunner.class)
public abstract class AbstractHadoopClusterTests implements ApplicationContextAware {
  ...
}

@ContextConfiguration(loader=HadoopDelegatingSmartContextLoader.class)
@MiniHadoopCluster
public class ClusterBaseTestClassTests extends AbstractHadoopClusterTests {
  ...
}

Above example shows the AbstractHadoopClusterTests and how ClusterBaseTestClassTests is prepared to be aware of a mini cluster. HadoopDelegatingSmartContextLoader offers same base functionality as the default DelegatingSmartContextLoader in a spring-test package. One additional thing what HadoopDelegatingSmartContextLoader does is to automatically handle running clusters and inject Configuration into the application context.

@MiniHadoopCluster(configName="hadoopConfiguration", clusterName="hadoopCluster", nodes=1, id="default")

Generally @MiniHadoopCluster annotation allows you to define injected bean name for mini cluster, its Configurations and a number of nodes you like to have in a cluster.

Spring Hadoop testing is dependant of general facilities of Spring Test framework meaning that everything what is cached during the test are reuseable withing other tests. One need to understand that if Hadoop mini cluster and its Configuration is injected into an Application Context, caching happens on a mercy of a Spring Testing meaning if a test Application Context is cached also mini cluster instance is cached. While caching is always prefered, one needs to understant that if tests are expecting vanilla environment to be present, test context should be dirtied using @DirtiesContext annotation.

12.1.4 Wordcount Example

Let's study a proper example of existing MapReduce Job which is executed and tested using Spring Hadoop. This example is the Hadoop's classic wordcount. We don't go through all the details of this example because we want to concentrate on testing specific code and configuration.

<context:property-placeholder location="hadoop.properties" />

<hdp:job id="wordcountJob"
  input-path="${wordcount.input.path}"
  output-path="${wordcount.output.path}"
  libs="file:build/libs/mapreduce-examples-wordcount-*.jar"
  mapper="org.springframework.data.hadoop.examples.TokenizerMapper"
  reducer="org.springframework.data.hadoop.examples.IntSumReducer" />

<hdp:script id="setupScript" location="copy-files.groovy">
  <hdp:property name="localSourceFile" value="data/nietzsche-chapter-1.txt" />
  <hdp:property name="inputDir" value="${wordcount.input.path}" />
  <hdp:property name="outputDir" value="${wordcount.output.path}" />
</hdp:script>

<hdp:job-runner id="runner"
  run-at-startup="false"
  kill-job-at-shutdown="false"
  wait-for-completion="false"
  pre-action="setupScript"
  job-ref="wordcountJob" />

In above configuration example we can see few differences with the actual runtime configuration. Firstly you can see that we didn't specify any kind of configuration for hadoop. This is because it's is injected automatically by testing framework. Secondly because we want to explicitely wait the job to be run and finished, kill-job-at-shutdown and wait-for-completion are set to false.

@ContextConfiguration(loader=HadoopDelegatingSmartContextLoader.class)
@MiniHadoopCluster
public class WordcountTests extends AbstractMapReduceTests {
  @Test
  public void testWordcountJob() throws Exception {
    // run blocks and throws exception if job failed
    JobRunner runner = getApplicationContext().getBean("runner", JobRunner.class);
    Job wordcountJob = getApplicationContext().getBean("wordcountJob", Job.class);

    runner.call();

    JobStatus finishedStatus = waitFinishedStatus(wordcountJob, 60, TimeUnit.SECONDS);
    assertThat(finishedStatus, notNullValue());

    // get output files from a job
    Path[] outputFiles = getOutputFilePaths("/user/gutenberg/output/word/");
    assertEquals(1, outputFiles.length);
    assertThat(getFileSystem().getFileStatus(outputFiles[0]).getLen(), greaterThan(0l));

    // read through the file and check that line with
    // "themselves	6" was found
    boolean found = false;
    InputStream in = getFileSystem().open(outputFiles[0]);
    BufferedReader reader = new BufferedReader(new InputStreamReader(in));
    String line = null;
    while ((line = reader.readLine()) != null) {
      if (line.startsWith("themselves")) {
        assertThat(line, is("themselves\t6"));
        found = true;
      }
    }
    reader.close();
    assertThat("Keyword 'themselves' not found", found);
  }
}

In above unit test class we simply run the job defined in xml, explicitely wait it to finish and then check the output content from HDFS by searching expected strings.

12.2 Testing Yarn

12.2.1 Mini Clusters for Yarn

Mini cluster usually contain testing components from a Hadoop project itself. These are MiniYARNCluster for Resource Manager and MiniDFSCluster for Datanode and Namenode which are all run within a same process. In Spring Hadoop mini clusters are implementing interface YarnCluster which provides methods for lifecycle and configuration.

public interface YarnCluster {
  Configuration getConfiguration();
  void start() throws Exception;
  void stop();
  File getYarnWorkDir();
}

Currently one implementation named StandaloneYarnCluster exists which supports simple cluster type where a number of nodes can be defined and then all the nodes will have Yarn Node Manager and Hdfs Datanode, additionally a Yarn Resource Manager and Hdfs Namenode components are started.

There are few ways how this cluster can be started depending on a use case. It is possible to use StandaloneYarnCluster directly or configure and start it through YarnClusterFactoryBean. Existing YarnClusterManager is used in unit tests to cache running clusters.

[Note]Note

It's advisable not to use YarnClusterManager outside of tests because literally it is using static fields to cache cluster references. This is a same concept used in Spring Test order to cache application contexts between the unit tests within a jvm.

<bean id="yarnCluster" class="org.springframework.yarn.test.support.YarnClusterFactoryBean">
  <property name="clusterId" value="YarnClusterTests"/>
  <property name="autoStart" value="true"/>
  <property name="nodes" value="1"/>
</bean>

Example above defines a bean named yarnCluster using a factory bean YarnClusterFactoryBean. It defines a simple one node cluster which is started automatically. Cluster working directories would then exist under below paths:

target/YarnClusterTests/
target/YarnClusterTests-dfs/
[Note]Note

We rely on base classes from a Hadoop distribution and target base directory is hardcoded in Hadoop and is not configurable.

12.2.2 Configuration

Spring Yarn components usually depend on Hadoop configuration which is then wired into these components during the application context startup phase. This was explained in previous chapters so we don't go through it again. However this is now a catch-22 because we need the configuration for the context but it is not known until mini cluster has done its startup magic and prepared the configuration with correct values reflecting current runtime status of the cluster itself. Solution for this is to use other factory bean class named ConfigurationDelegatingFactoryBean which will simple delegate the configuration request into the running cluster.

<bean id="yarnConfiguredConfiguration" class="org.springframework.yarn.test.support.ConfigurationDelegatingFactoryBean">
  <property name="cluster" ref="yarnCluster"/>
</bean>

<yarn:configuration id="yarnConfiguration" configuration-ref="yarnConfiguredConfiguration"/>

In the above example we created a bean named yarnConfiguredConfiguration using ConfigurationDelegatingFactoryBean which simple delegates to yarnCluster bean. Returned bean yarnConfiguredConfiguration is type of Hadoop's Configuration object so it could be used as it is.

Latter part of the example show how Spring Yarn namespace is used to create another Configuration object which is using yarnConfiguredConfiguration as a reference. This scenario would make sense if there is a need to add additional configuration options into running configuration used by other components. Usually it is suiteable to use cluster prepared configuration as it is.

12.2.3 Simplified Testing

It is perfecly all right to create your tests from scratch and for example create the cluster manually and then get the runtime configuration from there. This just needs some boilerplate code in your context configuration and unit test lifecycle.

Spring Hadoop adds additional facilities for the testing to make all this even easier.

@RunWith(SpringJUnit4ClassRunner.class)
public abstract class AbstractYarnClusterTests implements ApplicationContextAware {
  ...
}

@ContextConfiguration(loader=YarnDelegatingSmartContextLoader.class)
@MiniYarnCluster
public class ClusterBaseTestClassTests extends AbstractYarnClusterTests {
  ...
}

Above example shows the AbstractYarnClusterTests and how ClusterBaseTestClassTests is prepared to be aware of a mini cluster. YarnDelegatingSmartContextLoader offers same base functionality as the default DelegatingSmartContextLoader in a spring-test package. One additional thing what YarnDelegatingSmartContextLoader does is to automatically handle running clusters and inject Configuration into the application context.

@MiniYarnCluster(configName="yarnConfiguration", clusterName="yarnCluster", nodes=1, id="default")

Generally @MiniYarnCluster annotation allows you to define injected bean names for mini cluster, its Configurations and a number of nodes you like to have in a cluster.

Spring Hadoop Yarn testing is dependant of general facilities of Spring Test framework meaning that everything what is cached during the test are reuseable withing other tests. One need to understand that if Hadoop mini cluster and its Configuration is injected into an Application Context, caching happens on a mercy of a Spring Testing meaning if a test Application Context is cached also mini cluster instance is cached. While caching is always prefered, one needs to understant that if tests are expecting vanilla environment to be present, test context should be dirtied using @DirtiesContext annotation.

12.2.4 Multi Context Example

Let's study a proper example of existing Spring Yarn application and how this is tested during the build process. Multi Context Example is a simple Spring Yarn based application which simply launches Application Master and four Containers and withing those containers a custom code is executed. In this case simply a log message is written.

In real life there are different ways to test whether Hadoop Yarn application execution has been succesful or not. The obvious method would be to check the application instance execution status reported by Hadoop Yarn. Status of the execution doesn't always tell the whole truth so i.e. if application is about to write something into HDFS as an output that could be used to check the proper outcome of an execution.

This example doesn't write anything into HDFS and anyway it would be out of scope of this document for obvious reason. It is fairly straightforward to check file content from HDFS. One other interesting method is simply to check to application log files that being the Application Master and Container logs. Test methods can check exceptions or expected log entries from a log files to determine whether test is succesful or not.

In this chapter we don't go through how Multi Context Example is configured and what it actually does, for that read the documentation about the examples. However we go through what needs to be done order to test this example application using testing support offered by Spring Hadoop.

In this example we gave instructions to copy library dependencies into Hdfs and then those entries were used within resouce localizer to tell Yarn to copy those files into Container working directory. During the unit testing when mini cluster is launched there are no files present in Hdfs because cluster is initialized from scratch. Furtunalety Spring Hadoop allows you to copy files into Hdfs during the localization process from a local file system where Application Context is executed. Only thing we need is the actual library files which can be assembled during the build process. Spring Hadoop Examples build system rely on Gradle so collecting dependencies is an easy task.

<yarn:localresources>
  <yarn:hdfs path="/app/multi-context/*.jar"/>
  <yarn:hdfs path="/lib/*.jar"/>
</yarn:localresources>

Above configuration exists in application-context.xml and appmaster-context.xml files. This is a normal application configuration expecting static files already be present in Hdfs. This is usually done to minimize latency during the application submission and execution.

<yarn:localresources>
  <yarn:copy src="file:build/dependency-libs/*" dest="/lib/"/>
  <yarn:copy src="file:build/libs/*" dest="/app/multi-context/"/>
  <yarn:hdfs path="/app/multi-context/*.jar"/>
  <yarn:hdfs path="/lib/*.jar"/>
</yarn:localresources>

Above example is from MultiContextTest-context.xml which provides the runtime context configuration talking with mini cluster during the test phase.

When we do context configuration for YarnClient during the testing phase all we need to do is to add copy elements which will transfer needed libraries into Hdfs before the actual localization process will fire up. When those files are copied into Hdfs running in a mini cluster we're basically in a same point if using a real Hadoop cluster with existing files.

[Note]Note

Running tests which depends on copying files into Hdfs it is mandatory to use build system which is able to prepare these files for you. You can't do this within IDE's which have its own ways to execute unit tests.

The complete example of running the test, checking the application execution status and finally checking the expected state of log files:

@ContextConfiguration(loader=YarnDelegatingSmartContextLoader.class)
@MiniYarnCluster
public class MultiContextTests extends AbstractYarnClusterTests {
  @Test
  @Timed(millis=70000)
  public void testAppSubmission() throws Exception {
    YarnApplicationState state = submitApplicationAndWait();
    assertNotNull(state);
    assertTrue(state.equals(YarnApplicationState.FINISHED));
  	
    File workDir = getYarnCluster().getYarnWorkDir();
  		
    PathMatchingResourcePatternResolver resolver = new PathMatchingResourcePatternResolver();
    String locationPattern = "file:" + workDir.getAbsolutePath() + "/**/*.std*";
    Resource[] resources = resolver.getResources(locationPattern);
  		
    // appmaster and 4 containers should
    // make it 10 log files
    assertThat(resources, notNullValue());
    assertThat(resources.length, is(10));
  		
    for (Resource res : resources) {
      File file = res.getFile();		
      if (file.getName().endsWith("stdout")) {
        // there has to be some content in stdout file
        assertThat(file.length(), greaterThan(0l));
        if (file.getName().equals("Container.stdout")) {
          Scanner scanner = new Scanner(file);
          String content = scanner.useDelimiter("\\A").next();
          scanner.close();
          // this is what container will log in stdout
          assertThat(content, containsString("Hello from MultiContextBeanExample"));
        }
      } else if (file.getName().endsWith("stderr")) {
        // can't have anything in stderr files
        assertThat(file.length(), is(0l));
      }
    }		
  }
}