Spring for Apache Hadoop - Reference Documentation

Authors

Costin Leau , Thomas Risberg , Janne Valkealahti

2.0.3.RELEASE-phd20

Copies of this document may be made for your own use and for distribution to others, provided that you do not charge any fee for such copies and further provided that each copy contains this Copyright Notice, whether distributed in print or electronically.


Table of Contents

Preface
I. Introduction
1. Requirements
2. Additional Resources
II. Spring and Hadoop
3. Hadoop Configuration, MapReduce, and Distributed Cache
3.1. Using the Spring for Apache Hadoop Namespace
3.2. Configuring Hadoop
3.3. Creating a Hadoop Job
3.3.1. Creating a Hadoop Streaming Job
3.4. Running a Hadoop Job
3.4.1. Using the Hadoop Job tasklet
3.5. Running a Hadoop Tool
3.5.1. Replacing Hadoop shell invocations with tool-runner
3.5.2. Using the Hadoop Tool tasklet
3.6. Running a Hadoop Jar
3.6.1. Using the Hadoop Jar tasklet
3.7. Configuring the Hadoop DistributedCache
3.8. Map Reduce Generic Options
4. Working with the Hadoop File System
4.1. Configuring the file-system
4.2. Using HDFS Resource Loader
4.3. Scripting the Hadoop API
4.3.1. Using scripts
4.4. Scripting implicit variables
4.4.1. Running scripts
4.4.2. Using the Scripting tasklet
4.5. File System Shell (FsShell)
4.5.1. DistCp API
5. Writing and reading data using the Hadoop File System
5.1. Store Abstraction
5.1.1. Writing Data
File Naming
File Rollover
Partitioning
Writer Implementations
5.1.2. Reading Data
Input Splits
Reader Implementations
5.1.3. Using Codecs
5.2. Persisting POJO datasets using Kite SDK
5.2.1. Data Formats
Using Avro
Using Parquet
5.2.2. Configuring the dataset support
5.2.3. Writing datasets
5.2.4. Reading datasets
5.2.5. Partitioning datasets
6. Working with HBase
6.1. Data Access Object (DAO) Support
7. Hive integration
7.1. Starting a Hive Server
7.2. Using the Hive Thrift Client
7.3. Using the Hive JDBC Client
7.4. Running a Hive script or query
7.4.1. Using the Hive tasklet
7.5. Interacting with the Hive API
8. Pig support
8.1. Running a Pig script
8.1.1. Using the Pig tasklet
8.2. Interacting with the Pig API
9. Using the runner classes
10. Security Support
10.1. HDFS permissions
10.2. User impersonation (Kerberos)
11. Yarn Support
11.1. Using the Spring for Apache Yarn Namespace
11.2. Using the Spring for Apache Yarn JavaConfig
11.3. Configuring Yarn
11.4. Local Resources
11.5. Container Environment
11.6. Application Client
11.7. Application Master
11.8. Application Container
11.9. Application Master Services
11.9.1. Basic Concepts
11.9.2. Using JSON
11.9.3. Converters
11.10. Application Master Service
11.11. Application Master Service Client
11.12. Using Spring Batch
11.12.1. Batch Jobs
11.12.2. Partitioning
Configuring Master
Configuring Container
11.13. Using Spring Boot Application Model
11.13.1. Auto Configuration
11.13.2. Application Files
11.13.3. Application Classpath
Simple Executable Jar
Simple Zip Archive
11.13.4. Container Runners
Custom Runner
11.13.5. Resource Localizing
11.13.6. Container as POJO
11.13.7. Configuration Properties
11.13.8. Controlling Applications
Generic Usage
Using Configuration Properties
Using YarnPushApplication
Using YarnSubmitApplication
Using YarnInfoApplication
Using YarnKillApplication
12. Testing Support
12.1. Testing MapReduce
12.1.1. Mini Clusters for MapReduce
12.1.2. Configuration
12.1.3. Simplified Testing
12.1.4. Wordcount Example
12.2. Testing Yarn
12.2.1. Mini Clusters for Yarn
12.2.2. Configuration
12.2.3. Simplified Testing
12.2.4. Multi Context Example
12.3. Testing Boot Based Applications
III. Developing Spring for Apache Hadoop Applications
13. Guidance and Examples
13.1. Scheduling
13.2. Batch Job Listeners
IV. Spring for Apache Hadoop sample applications
V. Other Resources
14. Useful Links
VI. Appendices
A. Using Spring for Apache Hadoop with Amazon EMR
A.1. Start up the cluster
A.2. Open an SSH Tunnel as a SOCKS proxy
A.3. Configuring Hadoop to use a SOCKS proxy
A.4. Accessing the file-system
A.5. Shutting down the cluster
A.6. Example configuration
B. Using Spring for Apache Hadoop with EC2/Apache Whirr
B.1. Setting up the Hadoop cluster on EC2 with Apache Whirr
C. Spring for Apache Hadoop Schema