A popular option for creating on-demand Hadoop cluster is Amazon Elastic Map Reduce or Amazon EMR service. The user can through the command-line, API or a web UI configure, start, stop and manage a Hadoop cluster in the cloud without having to worry about the actual set-up or hardware resources used by the cluster. However, as the setup is different then a locally available cluster, so does the interaction between the application that want to use it and the target cluster. This section provides information on how to setup Amazon EMR with Spring for Apache Hadoop so the changes between a using a local, pseudo-distributed or owned cluster and EMR are minimal.
Important | |
---|---|
This chapter assumes the user is familiar with Amazon EMR and the cost associated with it and its related services - we strongly recommend getting familiar with the official EMR documentation. |
One of the big differences when using Amazon EMR versus a local cluster is the lack of access of the file system server and the job tracker. This means submitting jobs or reading and writing to the file-system isn't available out of the box - which is understandable for security reasons. If the cluster would be open, if could be easily abused while charging its rightful owner. However, it is fairly straight-forward to get access to both the file system and the job tracker so the deployment flow does not have to change.
Amazon EMR allows clusters to be created through the management console, through the API or the command-line. This documentation will focus on the command-line but the setup is not limited to it - feel free to adjust it according to your needs or preference. Make sure to properly setup the credentials so that the S3 file-system can be properly accessed.
A nice feature of Amazon EMR is starting a cluster for an indefinite period. That is rather then submitting a job and creating the cluster until it finished, one can create a cluster (along side a job) but request
to be kept alive even if there is no work for it. This is easily done
through the --create --alive
parameters:
./elastic-mapreduce --create --aliveThe output will be similar to this:
Created job flowJobFlowID
One can verify the results in the console through the list
command or through the web management console. Depending on the cluster setup and the user account, the Hadoop cluster initialization should
be complete anywhere between 1 to 5 minutes. The cluster is ready once its state changes from STARTING/PROVISIONING
to WAITING
.
Note | |
---|---|
By default, each newly created cluster has a new public IP that is not typically reused. To simplify the setup, one can use Amazon Elastic IP, that is a static, predefined IP, so that she knows before-hand the cluster address. Refer to this section inside the EMR documentation for more information. |
Once started, the cluster job tracker is available inside the newly EMR cluster, being bound to an internal IP. The easiest way to get access to it from outside is to open a SSH tunnel to it.
The remote job tracker is available on the default port 9001
.
One can bind it to the same local port 9001
or a different one such as 20001
.
The SSH tunnel provides a secure
connection between your machine on the cluster preventing any snooping or man-in-the-middle attacks. Further more it is quite easy to automate and be executed along side the cluster creation, programmatically or through
some script. The Amazon EMR docs have dedicated sections on
SSH Setup and Configuration and on opening a SSH Tunnel to the master node so please refer
to them. One advantage of the SSH tunnel is that it encapsulates the connection setup to the remote cluster - no matter the cluster IP (whether static or not), the local port is entirely up to the user;
one can chose to bind the cluster always to the same port - whether that is connected to a local cluster or a remote one, created on-demand is transparent to the using application.
Amazon EMR offers Simple Storage Service, also known as S3 service, as means for durable read-write storage for EMR. While the cluster is active, one can write additional data to HDFS but unless S3 is used, the data will be lost once the cluster shuts down. Note that when using an S3 location for the first time, the proper access permissions needs to be setup. Accessing S3 is easier then the job tracker - in fact the Hadoop distribution provides not one but two file-system implementations for S3:
Table A.1. Hadoop S3 File Systems
Name | URI Prefix | Access Method | Description |
---|---|---|---|
S3 Native FS | s3n:// | S3 Native | Native access to S3. The recommended file-system as the data is read/written in its native format and can be used not just by Hadoop but also other systems without any translation. The down-side is that it does not support large files (5GB) out of the box (though there is a work-around through the multipart upload feature). |
S3 Block FS | s3:// | Block Based | The files are stored as blocks (similar to the underlying structure in HDFS). This is somewhat more efficient in terms of renames and file sizes but requires a dedicated bucket and is not inter-operable with other S3 tools. |
To access the data in S3 one can either use an HDFS file-system on top of it, which requires no extra setup, or copy the data from S3 to the HDFS cluster using manual tools, distcp with S3, its dedicated version s3distcp, Hadoop DistributedCache (which SHDP supports as well) or third-party tools such as s3cmd.
For newbies and development we recommend accessing the S3 directly through the File-System abstraction as in most cases, its performance is close to that of the data inside the native HDFS. When dealing with data that is read multiple times, copying the data from S3 locally inside the cluster might improve performance but we advice running some performance tests first.
Once the cluster is no longer needed for a longer period of time, one can shut it down fairly straight forward:
./elastic-mapreduce --terminate JobFlowID
Note that the EMR cluster is billed by the hour and since the time is rounded upwards, starting and shutting down the cluster repeateadly might end up being more expensive then just keeping it alive. Consult the documentation for more information.
To put it all together, to use Amazon EMR one can use the following work-flow with SHDP:
Start an alive cluster and create a SSH tunnel to the Job Tracker
Start the cluster for an indefinite period. Once the server is up, create an SSH tunnel to the remote job tracker node. The remote port is9001
and let us assume the local tunnelled port
is 20001
. This step does not have to be repeated unless the cluster is terminated - one can (and should) submit multiple jobs to it.
Configure SHDP
hadoop-context.xml
<beans xmlns="http://www.springframework.org/schema/beans" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:context="http://www.springframework.org/schema/context" xmlns:hdp="http://www.springframework.org/schema/hadoop" xsi:schemaLocation="http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans.xsd http://www.springframework.org/schema/context http://www.springframework.org/schema/context/spring-context.xsd http://www.springframework.org/schema/hadoop http://www.springframework.org/schema/hadoop/spring-hadoop.xsd"> <!-- property placeholder backed by hadoop.properties --> <context:property-placeholder location="hadoop.properties"/> <!-- Hadoop FileSystem using a placeholder and s3.properties --> <hdp:configuration register-url-handler="false" properties-location="s3.properties"> fs.default.name=${hd.fs} </hdp:configuration>
hadoop.properties
# Amazon EMR # S3 bucked backing the HDFS S3 fs hd.fs=s3n://work-emr/tmp/ # job tracker pointing to the local SSH tunnelled port mapred.job.tracker=localhost:20001
s3.properties
# S3 credentials # for s3:// uri fs.s3.awsAccessKeyId=XXXXXXXXXXXXXXXXXXXX fs.s3.awsSecretAccessKey=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX # for s3n:// uri fs.s3n.awsAccessKeyId=XXXXXXXXXXXXXXXXXXXX fs.s3n.awsSecretAccessKey=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXSpring Hadoop is now ready to talk to your Amazon EMR cluster. Try it out!
Shutdown the tunnel and the cluster
Once the jobs submitted are completed, unless new jobs are shortly scheduled, one can shutdown the cluster. Just like the first step, this is optional. Again, make sure you understand the billing process first.