Spring Cloud Data Flow for Apache YARN

Sabby Anandan, Marius Bogoevici, Eric Bottard, Mark Fisher, Ilayaperumal Gopinathan, Gunnar Hillert, Mark Pollack, Patrick Peralta, Glenn Renfro, Thomas Risberg, Dave Syer, David Turanski, Janne Valkealahti

1.0.0.M2

Copies of this document may be made for your own use and for distribution to others, provided that you do not charge any fee for such copies and further provided that each copy contains this Copyright Notice, whether distributed in print or electronically.

Part I. Preface

1. About the documentation

The Spring Cloud Data Flow for Apache Yarn reference guide is available as html, pdf and epub documents. The latest copy is available at docs.spring.io/spring-cloud-dataflow-server-yarn/docs/current-SNAPSHOT/reference/html/.

2. Getting help

Having trouble with Spring Cloud Data Flow, We’d like to help!

Try the How-to’s — they provide solutions to the most common questions.
Ask a question - we monitor stackoverflow.com for questions tagged with spring-cloud.
Report bugs with Spring Cloud Dataflow for Apache YARN at github.com/spring-cloud/spring-cloud-dataflow-server-yarn/issues.

	Note
	All of Spring Cloud Data Flow is open source, including the documentation! If you find problems with the docs; or if you just want to improve them, please get involved.

Part II. Introduction

3. Introducing Spring Cloud Data Flow for Apache YARN project

The Spring Cloud Data Flow for Apache YARN project allows you to deploy the Spring Cloud Dataflow using Apache YARN as the cluster runtime environment.

Part III. Spring Cloud Data Flow Overview

This section provides a brief overview of the Spring Cloud Data Flow reference documentation. Think of it as map for the rest of the document. You can read this reference guide in a linear fashion, or you can skip sections if something doesn’t interest you.

4. Introducing Spring Cloud Data Flow

A cloud native programming and operating model for composable data microservices on a structured platform. With Spring Cloud Data Flow, developers can create, orchestrate and refactor data pipelines through single programming model for common use cases such as data ingest, real-time analytics, and data import/export.

Spring Cloud Data Flow is the cloud native redesign of Spring XD – a project that aimed to simplify development of Big Data applications. The integration and batch modules from Spring XD are refactored into Spring Boot data microservices applications that are now autonomous deployment units – thus enabling them to take full advantage of platform capabilities "natively", and they can independently evolve in isolation.

Spring Cloud Data Flow defines best practices for distributed stream and batch microservice design patterns.

4.1 Features

Orchestrate applications across a variety of distributed runtime platforms including: Cloud Foundry, Apache YARN, Apache Mesos, and Kubernetes
Separate runtime dependencies backed by ‘spring profiles’
Consume stream and batch data-microservices as maven dependency
Develop using: DSL, Shell, REST-APIs, Admin-UI, and Flo
Take advantage of metrics, health checks and remote management of data-microservices
Scale stream and batch pipelines without interrupting data flows

5. Spring Cloud Data Flow Architecture

The architecture for Spring Cloud Data Flow is separated into a number of distinct components.

5.1 Components

The Core domain model includes the concept of a stream that is a composition of spring-cloud-stream apps in a linear pipeline from a source to a sink, optionally including processor apps in between. The domain also includes the concept of a task, which may be any process that does not run indefinitely, including Spring Batch jobs.

The App Registry maintains the set of available apps, and their mappings to a URI. For example, if relying on Maven coordinates, the URI would be of the format: maven://<groupId>:<artifactId>:<version>

The Data Flow Server Core provides the REST API and UI to be used in combination with an implementation of the Deployer SPI when creating a Data Flow Server for a given deployment environment.

The Shell connects to the Data Flow Server’s REST API and supports a DSL that simplifies the process of defining a stream and managing its lifecycle.

Several Data Flow Server implementations exist, covering a range of runtime environments:

Local (intended for development only)
Cloud Foundry
Apache Yarn
Apache Mesos
Kubernetes

As mentioned above, the Spring Cloud Data Flow Server implementations all rely upon corresponding implementations of the Spring Cloud Deployer SPI, which provides the abstraction layer for deploying the apps of a given stream or task. The following are links to the deployer SPI projects that correspond to the Data Flow Servers listed above:

Part IV. Spring Cloud Data Flow Runtime

Data flow runtime can be deployed and used with YARN in two different ways, firstly using it directly with a YARN cluster and secondly letting Apache Ambari to deploy it into its cluster as a service.

6. Deploying on YARN

The Admin server application is run as a standalone application. All modules used for streams and tasks will be deployed on the YARN cluster that is targeted by the Admin server. configured to be used.

6.1 Prerequisites

These requirements are not something yarn runtime needs but generally what dataflow core needs.

Redis - Needed for some persisting runtime data.
Rabbit - If dataflow modules using rabbit bindings are used.
Kafka - If dataflow modules using kafka bindings are used.
DB - we currently use embedded H2 database, though any supported DB can be configured.

6.2 Download and Extract Distribution

Download the Spring Cloud Data Flow YARN distribution ZIP file which includes the Admin and the Shell apps:

$ wget http://repo.spring.io/milestone/org/springframework/cloud/dist/spring-cloud-dataflow-server-yarn-dist/1.0.0.M2/spring-cloud-dataflow-server-yarn-dist-1.0.0.M2.zip

Unzip the distribution ZIP file and change to the directory containing the deployment files.

$ cd spring-cloud-dataflow-server-yarn-1.0.0.M2

6.3 Configure Settings

Generic runtime settings can changed in config/servers.yml. Make sure Hadoop and Redis are running. If either one is not running on localhost you need to configure them in config/servers.yml

6.4 Start Server

If this is the first time deploying make sure the user that runs the Server app has rights to create and write to /dataflow directory in hdfs. If there is an existing deployment on hdfs remove it using:

$ hdfs dfs -rm -R /dataflow

Start the Spring Cloud Data Flow Server app for YARN

$ ./bin/dataflow-server-yarn

6.5 Connect Shell

start spring-cloud-dataflow-shell

$ ./bin/dataflow-shell

6.6 Create Stream

Create a stream:

dataflow:>stream create --name foostream --definition "time|log" --deploy

List streams:

dataflow:>stream list
╔═══════════╤═════════════════╤════════╗
║Stream Name│Stream Definition│ Status ║
╠═══════════╪═════════════════╪════════╣
║foostream  │time|log         │deployed║
╚═══════════╧═════════════════╧════════╝

After some time, destroy the stream:

dataflow:>stream destroy --name foostream

The YARN application is pushed and started automatically during a stream deployment process. Once all streams are destroyed the YARN application will exit.

6.7 Create Task

Create and launch task:

dataflow:>task create --name footask --definition "timestamp"
Created new task 'footask'
dataflow:>task launch --name footask
Launched task 'footask'

6.8 Check YARN App Statuses

Overall app status can be seen from YARN Resource Manager UI or using Spring YARN CLI which gives more info about running containers within an app itself.

$ ./bin/dataflow-server-yarn-cli shell

When stream has been submitted YARN shows it as ACCEPTED before its turned to RUNNING state.

$ submitted
  APPLICATION ID                  USER          NAME                     QUEUE    TYPE      STARTTIME       FINISHTIME  STATE     FINALSTATUS  ORIGINAL TRACKING URL
  ------------------------------  ------------  -----------------------  -------  --------  --------------  ----------  --------  -----------  ---------------------
  application_1461658614481_0001  jvalkealahti  scdstream:app:foostream  default  DATAFLOW  26/04/16 16:27  N/A         ACCEPTED  UNDEFINED

$ submitted
  APPLICATION ID                  USER          NAME                     QUEUE    TYPE      STARTTIME       FINISHTIME  STATE    FINALSTATUS  ORIGINAL TRACKING URL
  ------------------------------  ------------  -----------------------  -------  --------  --------------  ----------  -------  -----------  -------------------------
  application_1461658614481_0001  jvalkealahti  scdstream:app:foostream  default  DATAFLOW  26/04/16 16:27  N/A         RUNNING  UNDEFINED    http://192.168.1.96:58580

More info about internals for stream apps can be queried by clustersinfo and clusterinfo commands:

$ clustersinfo -a application_1461658614481_0001
  CLUSTER ID
  --------------
  foostream:log
  foostream:time

$ clusterinfo -a application_1461658614481_0001 -c foostream:time
  CLUSTER STATE  MEMBER COUNT
  -------------  ------------
  RUNNING        1

After stream is undeployed YARN app should close itself automatically:

$ submitted -v
  APPLICATION ID                  USER          NAME                     QUEUE    TYPE      STARTTIME       FINISHTIME      STATE     FINALSTATUS  ORIGINAL TRACKING URL
  ------------------------------  ------------  -----------------------  -------  --------  --------------  --------------  --------  -----------  ---------------------
  application_1461658614481_0001  jvalkealahti  scdstream:app:foostream  default  DATAFLOW  26/04/16 16:27  26/04/16 16:28  FINISHED  SUCCEEDED

Launching a task will be shown in RUNNING state while app is executing its batch jobs:

$ submitted -v
  APPLICATION ID                  USER          NAME                     QUEUE    TYPE      STARTTIME       FINISHTIME      STATE     FINALSTATUS  ORIGINAL TRACKING URL
  ------------------------------  ------------  -----------------------  -------  --------  --------------  --------------  --------  -----------  -------------------------
  application_1461658614481_0002  jvalkealahti  scdtask:timestamp        default  DATAFLOW  26/04/16 16:29  N/A             RUNNING   UNDEFINED    http://192.168.1.96:39561
  application_1461658614481_0001  jvalkealahti  scdstream:app:foostream  default  DATAFLOW  26/04/16 16:27  26/04/16 16:28  FINISHED  SUCCEEDED

$ submitted -v
  APPLICATION ID                  USER          NAME                     QUEUE    TYPE      STARTTIME       FINISHTIME      STATE     FINALSTATUS  ORIGINAL TRACKING URL
  ------------------------------  ------------  -----------------------  -------  --------  --------------  --------------  --------  -----------  ---------------------
  application_1461658614481_0002  jvalkealahti  scdtask:timestamp        default  DATAFLOW  26/04/16 16:29  26/04/16 16:29  FINISHED  SUCCEEDED
  application_1461658614481_0001  jvalkealahti  scdstream:app:foostream  default  DATAFLOW  26/04/16 16:27  26/04/16 16:28  FINISHED  SUCCEEDED

7. Deploying on AMBARI

Ambari basically automates YARN installation instead of doing it manually. Also a lot of other configuration steps are automated as much as possible to easy overall installation process.

7.1 Install Ambari Server

Generally it is only needed to install scdf-plugin-hdp plugin into ambari server which adds needed service definitions.

[root@ambari-1 ~]# yum -y install ambari-server
[root@ambari-1 ~]# ambari-server setup -s
[root@ambari-1 ~]# wget -nv http://repo.spring.io/yum-milestone-local/scdf/1.0/scdf-milestone-1.0.repo -O /etc/yum.repos.d/scdf-milestone-1.0.repo
[root@ambari-1 ~]# yum -y install scdf-plugin-hdp
[root@ambari-1 ~]# ambari-server start

	Note
	Ambari plugin only works for redhat6 based systems for now.

7.2 Deploy Data Flow

When you create your cluste and choose a stack, make sure that redhat6 section contains repository named SCDF-1.0 and that it points to repo.spring.io/yum-milestone-local/scdf/1.0.

From services choose Spring Cloud Dataflow and Kafka. Hdfs, Yarn and Zookeeper are forced dependencies.

Then in Customize Services what is really left for user to do is to add address for redis(as it’s required). Everything else is automatically configured. Technically it also allows you to switch to use rabbit by leaving Kafka out and defining rabbit settings there. But generally use of Kafka is a good choice.

	Note
	We also install H2 DB as service so that it can be accessed from every node.

Part V. Appendices

Appendix A. Building

To build the source you will need to install JDK 1.7.

The build uses the Maven wrapper so you don’t have to install a specific version of Maven. To enable the tests for Redis you should run the server before bulding. See below for more information on how run Redis.

The main build command is

$ ./mvnw clean install

You can also add '-DskipTests' if you like, to avoid running the tests.

	Note
	You can also install Maven (>=3.3.3) yourself and run the `mvn` command in place of `./mvnw` in the examples below. If you do that you also might need to add `-P spring` if your local Maven settings do not contain repository declarations for spring pre-release artifacts.

	Note
	Be aware that you might need to increase the amount of memory available to Maven by setting a `MAVEN_OPTS` environment variable with a value like `-Xmx512m -XX:MaxPermSize=128m`. We try to cover this in the `.mvn` configuration, so if you find you have to do it to make a build succeed, please raise a ticket to get the settings added to source control.

The projects that require middleware generally include a docker-compose.yml, so consider using Docker Compose to run the middeware servers in Docker containers. See the README in the scripts demo repository for specific instructions about the common cases of mongo, rabbit and redis.

A.1 Documentation

There is a "full" profile that will generate documentation. You can build just the documentation by executing

$ ./mvnw clean package -DskipTests -P full -pl {project-artifactId} -am

A.2 Working with the code

If you don’t have an IDE preference we would recommend that you use Spring Tools Suite or Eclipse when working with the code. We use the m2eclipe eclipse plugin for maven support. Other IDEs and tools should also work without issue.

A.2.1 Importing into eclipse with m2eclipse

We recommend the m2eclipe eclipse plugin when working with eclipse. If you don’t already have m2eclipse installed it is available from the "eclipse marketplace".

Unfortunately m2e does not yet support Maven 3.3, so once the projects are imported into Eclipse you will also need to tell m2eclipse to use the .settings.xml file for the projects. If you do not do this you may see many different errors related to the POMs in the projects. Open your Eclipse preferences, expand the Maven preferences, and select User Settings. In the User Settings field click Browse and navigate to the Spring Cloud project you imported selecting the .settings.xml file in that project. Click Apply and then OK to save the preference changes.

	Note
	Alternatively you can copy the repository settings from `.settings.xml` into your own `~/.m2/settings.xml`.

A.2.2 Importing into eclipse without m2eclipse

If you prefer not to use m2eclipse you can generate eclipse project metadata using the following command:

$ ./mvnw eclipse:eclipse

The generated eclipse projects can be imported by selecting import existing projects from the file menu.

Appendix B. Contributing

Spring Cloud is released under the non-restrictive Apache 2.0 license, and follows a very standard Github development process, using Github tracker for issues and merging pull requests into master. If you want to contribute even something trivial please do not hesitate, but follow the guidelines below.

B.1 Sign the Contributor License Agreement

Before we accept a non-trivial patch or pull request we will need you to sign the contributor’s agreement. Signing the contributor’s agreement does not grant anyone commit rights to the main repository, but it does mean that we can accept your contributions, and you will get an author credit if we do. Active contributors might be asked to join the core team, and given the ability to merge pull requests.

B.2 Code Conventions and Housekeeping

None of these is essential for a pull request, but they will all help. They can also be added after the original pull request but before a merge.

Use the Spring Framework code format conventions. If you use Eclipse you can import formatter settings using the eclipse-code-formatter.xml file from the Spring Cloud Build project. If using IntelliJ, you can use the Eclipse Code Formatter Plugin to import the same file.
Make sure all new .java files to have a simple Javadoc class comment with at least an @author tag identifying you, and preferably at least a paragraph on what the class is for.
Add the ASF license header comment to all new .java files (copy from existing files in the project)
Add yourself as an @author to the .java files that you modify substantially (more than cosmetic changes).
Add some Javadocs and, if you change the namespace, some XSD doc elements.
A few unit tests would help a lot as well — someone has to do it.
If no-one else is using your branch, please rebase it against the current master (or other target branch in the main project).
When writing a commit message please follow these conventions, if you are fixing an existing issue please add Fixes gh-XXXX at the end of the commit message (where XXXX is the issue number).