Spring for Apache Hadoop Reference Manual

Authors

Costin Leau

1.0.0.RC2

Copies of this document may be made for your own use and for distribution to others, provided that you do not charge any fee for such copies and further provided that each copy contains this Copyright Notice, whether distributed in print or electronically.


Table of Contents

Preface
I. Introduction
1. Requirements
II. Spring and Hadoop
2. Hadoop Configuration, MapReduce, and Distributed Cache
2.1. Using the Spring for Apache Hadoop Namespace
2.2. Configuring Hadoop
2.3. Creating a Hadoop Job
2.3.1. Creating a Hadoop Streaming Job
2.4. Running a Hadoop Job
2.4.1. Using the Hadoop Job tasklet
2.5. Running a Hadoop Tool
2.5.1. Replacing Hadoop shell invocations with tool-runner
2.5.2. Using the Hadoop Tool tasklet
2.6. Running a Hadoop Jar
2.6.1. Using the Hadoop Jar tasklet
2.7. Configuring the Hadoop DistributedCache
2.8. Map Reduce Generic Options
3. Working with the Hadoop File System
3.1. Configuring the file-system
3.2. Scripting the Hadoop API
3.2.1. Using scripts
3.3. Scripting implicit variables
3.3.1. Running scripts
3.3.2. Using the Scripting tasklet
3.4. File System Shell (FsShell)
3.4.1. DistCp API
4. Working with HBase
4.1. Data Access Object (DAO) Support
5. Hive integration
5.1. Starting a Hive Server
5.2. Using the Hive Thrift Client
5.3. Using the Hive JDBC Client
5.4. Running a Hive script or query
5.4.1. Using the Hive tasklet
5.5. Interacting with the Hive API
6. Pig support
6.1. Running a Pig script
6.1.1. Using the Pig tasklet
6.2. Interacting with the Pig API
7. Cascading integration
7.1. Using the Cascading tasklet
7.2. Using Scalding
7.3. Spring-specific local Taps
8. Using the runner classes
9. Security Support
9.1. HDFS permissions
9.2. User impersonation (Kerberos)
III. Developing Spring for Apache Hadoop Applications
10. Guidance and Examples
10.1. Scheduling
10.2. Batch Job Listeners
IV. Spring for Apache Hadoop sample applications
11. Sample prerequisites
12. Wordcount sample using the Spring Framework
12.1. Introduction
13. Wordcount sample using Spring Batch
13.1. Introduction
13.2. Basic Spring for Apache Hadoop configuration
13.3. Build and run the sample application
13.4. Run the sample application as a standlone Java application
V. Other Resources
14. Useful Links
VI. Appendices
A. Using Spring for Apache Hadoop with Amazon EMR
A.1. Start up the cluster
A.2. Open an SSH Tunnel as a SOCKS proxy
A.3. Configuring Hadoop to use a SOCKS proxy
A.4. Accessing the file-system
A.5. Shutting down the cluster
A.6. Example configuration
B. Using Spring for Apache Hadoop with EC2/Apache Whirr
B.1. Setting up the Hadoop cluster on EC2 with Apache Whirr
C. Spring for Apache Hadoop Schema