As mentioned above, those interested in using on-demand Hadoop clusters can use Amazon Elastic Map Reduce (or Amazon EMR) service. An alternative to that, for those that want maximum control over the cluster, is to use Amazon Elastic Compute Cloud or EC2. EC2 is in fact the service on top of which Amazon EMR runs and that is, a resizable, configurable compute capacity in the cloud.
Important | |
---|---|
This chapter assumes the user is familiar with Amazon EC2 and the cost associated with it and its related services - we strongly recommend getting familiar with the official EC2 documentation . |
Just like Amazon EMR, using EC2 means the Hadoop cluster (or whatever service you run on it) runs in the cloud and thus 'development' access to it, is different then when running the service in local network. There are various tips and tools out there that can handle the initial provisioning and configure the access to the cluster. Such a solution is Apache Whirr which is a set of libraries for running cloud services. Though it provides a Java API as well, one can easily configure, start and stop services from the command-line.
The Whirr
documentation
provides more detail on how to interact with the various cloud providers
out-there through Whirr. In case of EC2, one needs Java 6 (which is
required by Apache Hadoop), an account on EC2 and an SSH client
(available out of the box on *nix platforms and freely downloadable
(such as PuTTY) on Windows). Since Whirr does most of the heavy lifting,
one needs to tell Whirr what Cloud provider and account is used, either
by setting some environment properties or through the
~/.whirr/credentials file
:
whirr.provider=aws-ec2 whirr.identity=your-aws-key whirr.credential=your-aws-secret
Now instruct Whirr to configure a Hadoop cluster on EC2 - just add the
following properties to a configuration file (say hadoop.properties
):
whirr.cluster-name=myhadoopcluster whirr.instance-templates=1 hadoop-jobtracker+hadoop-namenode,1 hadoop-datanode+hadoop-tasktracker whirr.provider=aws-ec2 whirr.private-key-file=${sys:user.home}/.ssh/id_rsa whirr.public-key-file=${sys:user.home}/.ssh/id_rsa.pub
The configuration above assumes the SSH keys for your user have been already generated. Now start your Hadoop cluster:
bin/whirr launch-cluster --config hadoop.properties
As with Amazon EMR, one cannot correct to the Hadoop cluster from
outside - however Whirr provides out of the box the feature to create an
SSH tunnel to create a SOCKS proxy (on port 6666). When a cluster is
created, Whirr creates a script to launch the cluster which may be found
in ~/.whirr/cluster-name
. Run it as a follows (in a new terminal
window):
~/.whirr/myhadoopcluster/hadoop-proxy.sh
At this point, one can just the #emr:socks[SOCKS proxy] configuration from the Amazon EMR section to configure the Hadoop client.
To destroy the cluster, one can use the Amazon EMR console or Whirr itself:
bin/whirr destroy-cluster --config hadoop.properties