Spring for Apache Hadoop - Reference Documentation

Authors

Costin Leau , Thomas Risberg , Janne Valkealahti

2.3.0.RC2-hdp23

Copies of this document may be made for your own use and for distribution to others, provided that you do not charge any fee for such copies and further provided that each copy contains this Copyright Notice, whether distributed in print or electronically.


Table of Contents

Preface
I. Introduction
1. Requirements
2. Additional Resources
II. Spring and Hadoop
3. Hadoop Configuration
3.1. Using the Spring for Apache Hadoop Namespace
3.2. Using the Spring for Apache Hadoop JavaConfig
3.3. Configuring Hadoop
3.4. Boot Support
3.4.1. spring.hadoop configuration properties
3.4.2. spring.hadoop.fsshell configuration properties
4. MapReduce and Distributed Cache
4.1. Creating a Hadoop Job
4.1.1. Creating a Hadoop Streaming Job
4.2. Running a Hadoop Job
4.2.1. Using the Hadoop Job tasklet
4.3. Running a Hadoop Tool
4.3.1. Replacing Hadoop shell invocations with tool-runner
4.3.2. Using the Hadoop Tool tasklet
4.4. Running a Hadoop Jar
4.4.1. Using the Hadoop Jar tasklet
4.5. Configuring the Hadoop DistributedCache
4.6. Map Reduce Generic Options
5. Working with the Hadoop File System
5.1. Configuring the file-system
5.2. Using HDFS Resource Loader
5.3. Scripting the Hadoop API
5.3.1. Using scripts
5.4. Scripting implicit variables
5.4.1. Running scripts
5.4.2. Using the Scripting tasklet
5.5. File System Shell (FsShell)
5.5.1. DistCp API
6. Writing and reading data using the Hadoop File System
6.1. Store Abstraction
6.1.1. Writing Data
File Naming
File Rollover
Partitioning
Creating a Custom Partition Strategy
Writer Implementations
Append and Sync Data
6.1.2. Reading Data
Input Splits
Reader Implementations
6.1.3. Using Codecs
6.2. Persisting POJO datasets using Kite SDK
6.2.1. Data Formats
Using Avro
Using Parquet
6.2.2. Configuring the dataset support
6.2.3. Writing datasets
6.2.4. Reading datasets
6.2.5. Partitioning datasets
6.3. Using the Spring for Apache JavaConfig
7. Working with HBase
7.1. Data Access Object (DAO) Support
8. Hive integration
8.1. Starting a Hive Server
8.2. Using the Hive JDBC Client
8.3. Running a Hive script or query
8.3.1. Using the Hive tasklet
8.4. Interacting with the Hive API
9. Pig support
9.1. Running a Pig script
9.1.1. Using the Pig tasklet
9.2. Interacting with the Pig API
10. Using the runner classes
11. Security Support
11.1. HDFS permissions
11.2. User impersonation (Kerberos)
11.3. Boot Support
11.3.1. spring.hadoop.security configuration properties
12. Yarn Support
12.1. Using the Spring for Apache Yarn Namespace
12.2. Using the Spring for Apache Yarn JavaConfig
12.3. Configuring Yarn
12.4. Local Resources
12.5. Container Environment
12.6. Application Client
12.7. Application Master
12.8. Application Container
12.9. Application Master Services
12.9.1. Basic Concepts
12.9.2. Using JSON
12.9.3. Converters
12.10. Application Master Service
12.11. Application Master Service Client
12.12. Using Spring Batch
12.12.1. Batch Jobs
12.12.2. Partitioning
Configuring Master
Configuring Container
12.13. Using Spring Boot Application Model
12.13.1. Auto Configuration
12.13.2. Application Files
12.13.3. Application Classpath
Simple Executable Jar
Simple Zip Archive
12.13.4. Container Runners
Custom Runner
12.13.5. Resource Localizing
12.13.6. Container as POJO
12.13.7. Configuration Properties
spring.yarn configuration properties
spring.yarn.appmaster configuration properties
spring.yarn.appmaster.launchcontext configuration properties
spring.yarn.appmaster.localizer configuration properties
spring.yarn.appmaster.resource configuration properties
spring.yarn.appmaster.containercluster configuration properties
spring.yarn.appmaster.containercluster.clusters.<name> configuration properties
spring.yarn.appmaster.containercluster.clusters.<name>.projection configuration properties
spring.yarn.endpoints.containercluster configuration properties
spring.yarn.endpoints.containerregister configuration properties
spring.yarn.client configuration properties
spring.yarn.client.launchcontext configuration properties
spring.yarn.client.localizer configuration properties
spring.yarn.client.resource configuration properties
spring.yarn.container configuration properties
spring.yarn.batch configuration properties
spring.yarn.batch.jobs configuration properties
12.13.8. Container Groups
Grid Projection
Group Configuration
Container Restart
REST API
12.13.9. Controlling Applications
Generic Usage
Using Configuration Properties
Using YarnPushApplication
Using YarnSubmitApplication
Using YarnInfoApplication
Using YarnKillApplication
Using YarnShutdownApplication
Using YarnContainerClusterApplication
12.13.10. Cli Integration
Build-in Commands
Implementing Command
Using Shell
13. Testing Support
13.1. Testing MapReduce
13.1.1. Mini Clusters for MapReduce
13.1.2. Configuration
13.1.3. Simplified Testing
13.1.4. Wordcount Example
13.2. Testing Yarn
13.2.1. Mini Clusters for Yarn
13.2.2. Configuration
13.2.3. Simplified Testing
13.2.4. Multi Context Example
13.3. Testing Boot Based Applications
III. Developing Spring for Apache Hadoop Applications
14. Guidance and Examples
14.1. Scheduling
14.2. Batch Job Listeners
15. Other Samples
IV. Other Resources
16. Useful Links