1.0.0.M2
Copyright © 2013-2015 Pivotal Software, Inc.
Table of Contents
The Spring Cloud Data Flow for Apache Yarn reference guide is available as html, pdf and epub documents. The latest copy is available at docs.spring.io/spring-cloud-dataflow-server-yarn/docs/current-SNAPSHOT/reference/html/.
Copies of this document may be made for your own use and for distribution to others, provided that you do not charge any fee for such copies and further provided that each copy contains this Copyright Notice, whether distributed in print or electronically.
Having trouble with Spring Cloud Data Flow, We’d like to help!
spring-cloud
.![]() | Note |
---|---|
All of Spring Cloud Data Flow is open source, including the documentation! If you find problems with the docs; or if you just want to improve them, please get involved. |
This section provides a brief overview of the Spring Cloud Data Flow reference documentation. Think of it as map for the rest of the document. You can read this reference guide in a linear fashion, or you can skip sections if something doesn’t interest you.
A cloud native programming and operating model for composable data microservices on a structured platform. With Spring Cloud Data Flow, developers can create, orchestrate and refactor data pipelines through single programming model for common use cases such as data ingest, real-time analytics, and data import/export.
Spring Cloud Data Flow is the cloud native redesign of Spring XD – a project that aimed to simplify development of Big Data applications. The integration and batch modules from Spring XD are refactored into Spring Boot data microservices applications that are now autonomous deployment units – thus enabling them to take full advantage of platform capabilities "natively", and they can independently evolve in isolation.
Spring Cloud Data Flow defines best practices for distributed stream and batch microservice design patterns.
The architecture for Spring Cloud Data Flow is separated into a number of distinct components.
The Core domain model includes the concept of a stream that is a composition of spring-cloud-stream apps in a linear pipeline from a source to a sink, optionally including processor apps in between. The domain also includes the concept of a task, which may be any process that does not run indefinitely, including Spring Batch jobs.
The App Registry
maintains the set of available apps, and their mappings to a URI.
For example, if relying on Maven coordinates, the URI would be of the format:
maven://<groupId>:<artifactId>:<version>
The Data Flow Server Core provides the REST API and UI to be used in combination with an implementation of the Deployer SPI when creating a Data Flow Server for a given deployment environment.
The Shell connects to the Data Flow Server’s REST API and supports a DSL that simplifies the process of defining a stream and managing its lifecycle.
Several Data Flow Server implementations exist, covering a range of runtime environments:
As mentioned above, the Spring Cloud Data Flow Server implementations all rely upon corresponding implementations of the Spring Cloud Deployer SPI, which provides the abstraction layer for deploying the apps of a given stream or task. The following are links to the deployer SPI projects that correspond to the Data Flow Servers listed above:
Data flow runtime can be deployed and used with YARN in two different ways, firstly using it directly with a YARN cluster and secondly letting Apache Ambari to deploy it into its cluster as a service.
The Admin server application is run as a standalone application. All modules used for streams and tasks will be deployed on the YARN cluster that is targeted by the Admin server. configured to be used.
These requirements are not something yarn runtime needs but generally what dataflow core needs.
Download the Spring Cloud Data Flow YARN distribution ZIP file which includes the Admin and the Shell apps:
$ wget http://repo.spring.io/milestone/org/springframework/cloud/dist/spring-cloud-dataflow-server-yarn-dist/1.0.0.M2/spring-cloud-dataflow-server-yarn-dist-1.0.0.M2.zip
Unzip the distribution ZIP file and change to the directory containing the deployment files.
$ cd spring-cloud-dataflow-server-yarn-1.0.0.M2
Generic runtime settings can changed in config/servers.yml
. Make
sure Hadoop and Redis are running.
If either one is not running on localhost
you need to configure them in config/servers.yml
If this is the first time deploying make sure the user that runs
the Server app has rights to create and write to /dataflow
directory in hdfs
. If there is an existing deployment on hdfs
remove it using:
$ hdfs dfs -rm -R /dataflow
Start the Spring Cloud Data Flow Server app for YARN
$ ./bin/dataflow-server-yarn
Create a stream:
dataflow:>stream create --name foostream --definition "time|log" --deploy
List streams:
dataflow:>stream list ╔═══════════╤═════════════════╤════════╗ ║Stream Name│Stream Definition│ Status ║ ╠═══════════╪═════════════════╪════════╣ ║foostream │time|log │deployed║ ╚═══════════╧═════════════════╧════════╝
After some time, destroy the stream:
dataflow:>stream destroy --name foostream
The YARN application is pushed and started automatically during a stream deployment process. Once all streams are destroyed the YARN application will exit.
Create and launch task:
dataflow:>task create --name footask --definition "timestamp" Created new task 'footask' dataflow:>task launch --name footask Launched task 'footask'
Overall app status can be seen from YARN Resource Manager UI or using Spring YARN CLI which gives more info about running containers within an app itself.
$ ./bin/dataflow-server-yarn-cli shell
When stream has been submitted YARN shows it as ACCEPTED
before its
turned to RUNNING
state.
$ submitted APPLICATION ID USER NAME QUEUE TYPE STARTTIME FINISHTIME STATE FINALSTATUS ORIGINAL TRACKING URL ------------------------------ ------------ ----------------------- ------- -------- -------------- ---------- -------- ----------- --------------------- application_1461658614481_0001 jvalkealahti scdstream:app:foostream default DATAFLOW 26/04/16 16:27 N/A ACCEPTED UNDEFINED $ submitted APPLICATION ID USER NAME QUEUE TYPE STARTTIME FINISHTIME STATE FINALSTATUS ORIGINAL TRACKING URL ------------------------------ ------------ ----------------------- ------- -------- -------------- ---------- ------- ----------- ------------------------- application_1461658614481_0001 jvalkealahti scdstream:app:foostream default DATAFLOW 26/04/16 16:27 N/A RUNNING UNDEFINED http://192.168.1.96:58580
More info about internals for stream apps can be queried by
clustersinfo
and clusterinfo
commands:
$ clustersinfo -a application_1461658614481_0001 CLUSTER ID -------------- foostream:log foostream:time $ clusterinfo -a application_1461658614481_0001 -c foostream:time CLUSTER STATE MEMBER COUNT ------------- ------------ RUNNING 1
After stream is undeployed YARN app should close itself automatically:
$ submitted -v APPLICATION ID USER NAME QUEUE TYPE STARTTIME FINISHTIME STATE FINALSTATUS ORIGINAL TRACKING URL ------------------------------ ------------ ----------------------- ------- -------- -------------- -------------- -------- ----------- --------------------- application_1461658614481_0001 jvalkealahti scdstream:app:foostream default DATAFLOW 26/04/16 16:27 26/04/16 16:28 FINISHED SUCCEEDED
Launching a task will be shown in RUNNING
state while app is
executing its batch jobs:
$ submitted -v APPLICATION ID USER NAME QUEUE TYPE STARTTIME FINISHTIME STATE FINALSTATUS ORIGINAL TRACKING URL ------------------------------ ------------ ----------------------- ------- -------- -------------- -------------- -------- ----------- ------------------------- application_1461658614481_0002 jvalkealahti scdtask:timestamp default DATAFLOW 26/04/16 16:29 N/A RUNNING UNDEFINED http://192.168.1.96:39561 application_1461658614481_0001 jvalkealahti scdstream:app:foostream default DATAFLOW 26/04/16 16:27 26/04/16 16:28 FINISHED SUCCEEDED $ submitted -v APPLICATION ID USER NAME QUEUE TYPE STARTTIME FINISHTIME STATE FINALSTATUS ORIGINAL TRACKING URL ------------------------------ ------------ ----------------------- ------- -------- -------------- -------------- -------- ----------- --------------------- application_1461658614481_0002 jvalkealahti scdtask:timestamp default DATAFLOW 26/04/16 16:29 26/04/16 16:29 FINISHED SUCCEEDED application_1461658614481_0001 jvalkealahti scdstream:app:foostream default DATAFLOW 26/04/16 16:27 26/04/16 16:28 FINISHED SUCCEEDED
Ambari basically automates YARN installation instead of doing it manually. Also a lot of other configuration steps are automated as much as possible to easy overall installation process.
Generally it is only needed to install scdf-plugin-hdp
plugin into
ambari server which adds needed service definitions.
[root@ambari-1 ~]# yum -y install ambari-server [root@ambari-1 ~]# ambari-server setup -s [root@ambari-1 ~]# wget -nv http://repo.spring.io/yum-milestone-local/scdf/1.0/scdf-milestone-1.0.repo -O /etc/yum.repos.d/scdf-milestone-1.0.repo [root@ambari-1 ~]# yum -y install scdf-plugin-hdp [root@ambari-1 ~]# ambari-server start
![]() | Note |
---|---|
Ambari plugin only works for redhat6 based systems for now. |
When you create your cluste and choose a stack, make sure that
redhat6
section contains repository named SCDF-1.0
and that it
points to repo.spring.io/yum-milestone-local/scdf/1.0
.
From services choose Spring Cloud Dataflow
and Kafka
. Hdfs
,
Yarn
and Zookeeper
are forced dependencies.
Then in Customize Services what is really left for user to do is to add address for redis(as it’s required). Everything else is automatically configured. Technically it also allows you to switch to use rabbit by leaving Kafka out and defining rabbit settings there. But generally use of Kafka is a good choice.
![]() | Note |
---|---|
We also install H2 DB as service so that it can be accessed from every node. |
To build the source you will need to install JDK 1.7.
The build uses the Maven wrapper so you don’t have to install a specific version of Maven. To enable the tests for Redis you should run the server before bulding. See below for more information on how run Redis.
The main build command is
$ ./mvnw clean install
You can also add '-DskipTests' if you like, to avoid running the tests.
![]() | Note |
---|---|
You can also install Maven (>=3.3.3) yourself and run the |
![]() | Note |
---|---|
Be aware that you might need to increase the amount of memory
available to Maven by setting a |
The projects that require middleware generally include a
docker-compose.yml
, so consider using
Docker Compose to run the middeware servers
in Docker containers. See the README in the
scripts demo
repository for specific instructions about the common cases of mongo,
rabbit and redis.
There is a "full" profile that will generate documentation. You can build just the documentation by executing
$ ./mvnw clean package -DskipTests -P full -pl {project-artifactId} -am
If you don’t have an IDE preference we would recommend that you use Spring Tools Suite or Eclipse when working with the code. We use the m2eclipe eclipse plugin for maven support. Other IDEs and tools should also work without issue.
We recommend the m2eclipe eclipse plugin when working with eclipse. If you don’t already have m2eclipse installed it is available from the "eclipse marketplace".
Unfortunately m2e does not yet support Maven 3.3, so once the projects
are imported into Eclipse you will also need to tell m2eclipse to use
the .settings.xml
file for the projects. If you do not do this you
may see many different errors related to the POMs in the
projects. Open your Eclipse preferences, expand the Maven
preferences, and select User Settings. In the User Settings field
click Browse and navigate to the Spring Cloud project you imported
selecting the .settings.xml
file in that project. Click Apply and
then OK to save the preference changes.
![]() | Note |
---|---|
Alternatively you can copy the repository settings from |
Spring Cloud is released under the non-restrictive Apache 2.0 license, and follows a very standard Github development process, using Github tracker for issues and merging pull requests into master. If you want to contribute even something trivial please do not hesitate, but follow the guidelines below.
Before we accept a non-trivial patch or pull request we will need you to sign the contributor’s agreement. Signing the contributor’s agreement does not grant anyone commit rights to the main repository, but it does mean that we can accept your contributions, and you will get an author credit if we do. Active contributors might be asked to join the core team, and given the ability to merge pull requests.
None of these is essential for a pull request, but they will all help. They can also be added after the original pull request but before a merge.
eclipse-code-formatter.xml
file from the
Spring
Cloud Build project. If using IntelliJ, you can use the
Eclipse Code Formatter
Plugin to import the same file..java
files to have a simple Javadoc class comment with at least an
@author
tag identifying you, and preferably at least a paragraph on what the class is
for..java
files (copy from existing files
in the project)@author
to the .java files that you modify substantially (more
than cosmetic changes).Fixes gh-XXXX
at the end of the commit
message (where XXXX is the issue number).