Appendix A. Using Spring for Apache Hadoop with Amazon EMR

A popular option for creating on-demand Hadoop cluster is Amazon Elastic Map Reduce or Amazon EMR service. The user can through the command-line, API or a web UI configure, start, stop and manage a Hadoop cluster in the cloud without having to worry about the actual set-up or hardware resources used by the cluster. However, as the setup is different then a locally available cluster, so does the interaction between the application that want to use it and the target cluster. This section provides information on how to setup Amazon EMR with Spring for Apache Hadoop so the changes between a using a local, pseudo-distributed or owned cluster and EMR are minimal.

[Important]Important
This chapter assumes the user is familiar with Amazon EMR and the cost associated with it and its related services - we strongly recommend getting familiar with the official EMR documentation.

One of the big differences when using Amazon EMR versus a local cluster is the lack of access of the file system server and the job tracker. This means submitting jobs or reading and writing to the file-system isn't available out of the box - which is understandable for security reasons. If the cluster would be open, if could be easily abused while charging its rightful owner. However, it is fairly straight-forward to get access to both the file system and the job tracker so the deployment flow does not have to change.

Amazon EMR allows clusters to be created through the management console, through the API or the command-line. This documentation will focus on the command-line but the setup is not limited to it - feel free to adjust it according to your needs or preference. Make sure to properly setup the credentials so that the S3 file-system can be properly accessed.

A.1 Start up the cluster

A nice feature of Amazon EMR is starting a cluster for an indefinite period. That is rather then submitting a job and creating the cluster until it finished, one can create a cluster (along side a job) but request to be kept alive even if there is no work for it. This is easily done through the --create --alive parameters:

./elastic-mapreduce --create --alive
The output will be similar to this:
Created job flowJobFlowID

One can verify the results in the console through the list command or through the web management console. Depending on the cluster setup and the user account, the Hadoop cluster initialization should be complete anywhere between 1 to 5 minutes. The cluster is ready once its state changes from STARTING/PROVISIONING to WAITING.

[Note]Note
By default, each newly created cluster has a new public IP that is not typically reused. To simplify the setup, one can use Amazon Elastic IP, that is a static, predefined IP, so that she knows before-hand the cluster address. Refer to this section inside the EMR documentation for more information.

A.2 Accessing the Job Tracker

Once started, the cluster job tracker is available inside the newly EMR cluster, being bound to an internal IP. The easiest way to get access to it from outside is to open a SSH tunnel to it. The remote job tracker is available on the default port 9001. One can bind it to the same local port 9001 or a different one such as 20001. The SSH tunnel provides a secure connection between your machine on the cluster preventing any snooping or man-in-the-middle attacks. Further more it is quite easy to automate and be executed along side the cluster creation, programmatically or through some script. The Amazon EMR docs have dedicated sections on SSH Setup and Configuration and on opening a SSH Tunnel to the master node so please refer to them. One advantage of the SSH tunnel is that it encapsulates the connection setup to the remote cluster - no matter the cluster IP (whether static or not), the local port is entirely up to the user; one can chose to bind the cluster always to the same port - whether that is connected to a local cluster or a remote one, created on-demand is transparent to the using application.

A.3 Accessing the file-system

Amazon EMR offers Simple Storage Service, also known as S3 service, as means for durable read-write storage for EMR. While the cluster is active, one can write additional data to HDFS but unless S3 is used, the data will be lost once the cluster shuts down. Note that when using an S3 location for the first time, the proper access permissions needs to be setup. Accessing S3 is easier then the job tracker - in fact the Hadoop distribution provides not one but two file-system implementations for S3:

Table A.1. Hadoop S3 File Systems

NameURI PrefixAccess MethodDescription
S3 Native FSs3n://S3 NativeNative access to S3. The recommended file-system as the data is read/written in its native format and can be used not just by Hadoop but also other systems without any translation. The down-side is that it does not support large files (5GB) out of the box (though there is a work-around through the multipart upload feature).
S3 Block FSs3://Block BasedThe files are stored as blocks (similar to the underlying structure in HDFS). This is somewhat more efficient in terms of renames and file sizes but requires a dedicated bucket and is not inter-operable with other S3 tools.


To access the data in S3 one can either use an HDFS file-system on top of it, which requires no extra setup, or copy the data from S3 to the HDFS cluster using manual tools, distcp with S3, its dedicated version s3distcp, Hadoop DistributedCache (which SHDP supports as well) or third-party tools such as s3cmd.

For newbies and development we recommend accessing the S3 directly through the File-System abstraction as in most cases, its performance is close to that of the data inside the native HDFS. When dealing with data that is read multiple times, copying the data from S3 locally inside the cluster might improve performance but we advice running some performance tests first.

A.4 Shutting down the cluster

Once the cluster is no longer needed for a longer period of time, one can shut it down fairly straight forward:

./elastic-mapreduce --terminate JobFlowID

Note that the EMR cluster is billed by the hour and since the time is rounded upwards, starting and shutting down the cluster repeateadly might end up being more expensive then just keeping it alive. Consult the documentation for more information.

A.5 Example configuration

To put it all together, to use Amazon EMR one can use the following work-flow with SHDP:

  • Start an alive cluster and create a SSH tunnel to the Job Tracker

    Start the cluster for an indefinite period. Once the server is up, create an SSH tunnel to the remote job tracker node. The remote port is 9001 and let us assume the local tunnelled port is 20001. This step does not have to be repeated unless the cluster is terminated - one can (and should) submit multiple jobs to it.
  • Configure SHDP

  • Once the cluster is up and the SSH tunnel is in place, point SHDP to the new configuration. The example below shows how the configuration can look like:

    hadoop-context.xml

    <beans xmlns="http://www.springframework.org/schema/beans"
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xmlns:context="http://www.springframework.org/schema/context"
        xmlns:hdp="http://www.springframework.org/schema/hadoop"
    	xsi:schemaLocation="http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans.xsd
    		http://www.springframework.org/schema/context http://www.springframework.org/schema/context/spring-context.xsd
    		http://www.springframework.org/schema/hadoop http://www.springframework.org/schema/hadoop/spring-hadoop.xsd">
    
    <!-- property placeholder backed by hadoop.properties -->		
    <context:property-placeholder location="hadoop.properties"/>
    
    <!-- Hadoop FileSystem using a placeholder and s3.properties -->
    <hdp:configuration register-url-handler="false" properties-location="s3.properties">
    	fs.default.name=${hd.fs}
    </hdp:configuration>

    hadoop.properties

    # Amazon EMR
    # S3 bucked backing the HDFS S3 fs
    hd.fs=s3n://work-emr/tmp/
    # job tracker pointing to the local SSH tunnelled port
    mapred.job.tracker=localhost:20001

    s3.properties

    # S3 credentials
    # for s3:// uri
    fs.s3.awsAccessKeyId=XXXXXXXXXXXXXXXXXXXX
    fs.s3.awsSecretAccessKey=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    
    # for s3n:// uri
    fs.s3n.awsAccessKeyId=XXXXXXXXXXXXXXXXXXXX
    fs.s3n.awsSecretAccessKey=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    
    Spring Hadoop is now ready to talk to your Amazon EMR cluster. Try it out!
  • Shutdown the tunnel and the cluster

    Once the jobs submitted are completed, unless new jobs are shortly scheduled, one can shutdown the cluster. Just like the first step, this is optional. Again, make sure you understand the billing process first.