Spring for Apache Hadoop - Reference Documentation

Authors

Costin Leau , Thomas Risberg , Janne Valkealahti

2.5.0.RELEASE

Copies of this document may be made for your own use and for distribution to others, provided that you do not charge any fee for such copies and further provided that each copy contains this Copyright Notice, whether distributed in print or electronically.


Table of Contents

Preface
I. Introduction
1. Requirements
2. Additional Resources
II. Spring and Hadoop
3. Hadoop Configuration
3.1. Using the Spring for Apache Hadoop Namespace
3.2. Using the Spring for Apache Hadoop JavaConfig
3.3. Configuring Hadoop
3.4. Boot Support
3.4.1. spring.hadoop configuration properties
3.4.2. spring.hadoop.fsshell configuration properties
4. MapReduce and Distributed Cache
4.1. Creating a Hadoop Job
4.1.1. Creating a Hadoop Streaming Job
4.2. Running a Hadoop Job
4.2.1. Using the Hadoop Job tasklet
4.3. Running a Hadoop Tool
4.3.1. Replacing Hadoop shell invocations with tool-runner
4.3.2. Using the Hadoop Tool tasklet
4.4. Running a Hadoop Jar
4.4.1. Using the Hadoop Jar tasklet
4.5. Configuring the Hadoop DistributedCache
4.6. Map Reduce Generic Options
5. Working with the Hadoop File System
5.1. Configuring the file-system
5.2. Using HDFS Resource Loader
5.3. Scripting the Hadoop API
5.3.1. Using scripts
5.4. Scripting implicit variables
5.4.1. Running scripts
5.4.2. Using the Scripting tasklet
5.5. File System Shell (FsShell)
5.5.1. DistCp API
6. Writing and reading data using the Hadoop File System
6.1. Store Abstraction
6.1.1. Writing Data
File Naming
File Rollover
Partitioning
Creating a Custom Partition Strategy
Writer Implementations
Append and Sync Data
6.1.2. Reading Data
Input Splits
Reader Implementations
6.1.3. Using Codecs
6.2. Persisting POJO datasets using Kite SDK
6.2.1. Data Formats
Using Avro
Using Parquet
6.2.2. Configuring the dataset support
6.2.3. Writing datasets
6.2.4. Reading datasets
6.2.5. Partitioning datasets
6.3. Using the Spring for Apache JavaConfig
7. Working with HBase
7.1. Data Access Object (DAO) Support
8. Hive integration
8.1. Starting a Hive Server
8.2. Using the Hive JDBC Client
8.3. Running a Hive script or query
8.3.1. Using the Hive tasklet
8.4. Interacting with the Hive API
9. Pig support
9.1. Running a Pig script
9.1.1. Using the Pig tasklet
9.2. Interacting with the Pig API
10. Apache Spark integration
10.1. Simple example for running a Spark YARN Tasklet
11. Using the runner classes
12. Security Support
12.1. HDFS permissions
12.2. User impersonation (Kerberos)
12.3. Boot Support
12.3.1. spring.hadoop.security configuration properties
13. Yarn Support
13.1. Using the Spring for Apache Yarn Namespace
13.2. Using the Spring for Apache Yarn JavaConfig
13.3. Configuring Yarn
13.4. Local Resources
13.5. Container Environment
13.6. Application Client
13.7. Application Master
13.8. Application Container
13.9. Application Master Services
13.9.1. Basic Concepts
13.9.2. Using JSON
13.9.3. Converters
13.10. Application Master Service
13.11. Application Master Service Client
13.12. Using Spring Batch
13.12.1. Batch Jobs
13.12.2. Partitioning
Configuring Master
Configuring Container
13.13. Using Spring Boot Application Model
13.13.1. Auto Configuration
13.13.2. Application Files
13.13.3. Application Classpath
Simple Executable Jar
Simple Zip Archive
13.13.4. Container Runners
Custom Runner
13.13.5. Resource Localizing
13.13.6. Container as POJO
13.13.7. Configuration Properties
spring.yarn configuration properties
spring.yarn.appmaster configuration properties
spring.yarn.appmaster.launchcontext configuration properties
spring.yarn.appmaster.localizer configuration properties
spring.yarn.appmaster.resource configuration properties
spring.yarn.appmaster.containercluster configuration properties
spring.yarn.appmaster.containercluster.clusters.<name> configuration properties
spring.yarn.appmaster.containercluster.clusters.<name>.projection configuration properties
spring.yarn.endpoints.containercluster configuration properties
spring.yarn.endpoints.containerregister configuration properties
spring.yarn.client configuration properties
spring.yarn.client.launchcontext configuration properties
spring.yarn.client.localizer configuration properties
spring.yarn.client.resource configuration properties
spring.yarn.container configuration properties
spring.yarn.batch configuration properties
spring.yarn.batch.jobs configuration properties
13.13.8. Container Groups
Grid Projection
Group Configuration
Container Restart
REST API
13.13.9. Controlling Applications
Generic Usage
Using Configuration Properties
Using YarnPushApplication
Using YarnSubmitApplication
Using YarnInfoApplication
Using YarnKillApplication
Using YarnShutdownApplication
Using YarnContainerClusterApplication
13.13.10. Cli Integration
Build-in Commands
Implementing Command
Using Shell
14. Testing Support
14.1. Testing MapReduce
14.1.1. Mini Clusters for MapReduce
14.1.2. Configuration
14.1.3. Simplified Testing
14.1.4. Wordcount Example
14.2. Testing Yarn
14.2.1. Mini Clusters for Yarn
14.2.2. Configuration
14.2.3. Simplified Testing
14.2.4. Multi Context Example
14.3. Testing Boot Based Applications
III. Developing Spring for Apache Hadoop Applications
15. Guidance and Examples
15.1. Scheduling
15.2. Batch Job Listeners
16. Other Samples
IV. Other Resources
17. Useful Links