15. Deploying on YARN

The server application is run as a standalone application. All applications used for streams and tasks will be deployed on the YARN cluster that is targeted by the server.

15.1 Prerequisites

These requirements are not something yarn runtime needs but generally what dataflow core needs.

  • Rabbit - If dataflow apps using rabbit bindings are used.
  • Kafka - If dataflow apps using kafka bindings are used.
  • DB - we currently use embedded H2 database, though any supported DB can be configured.

15.2 Download and Extract Distribution

Download the Spring Cloud Data Flow YARN distribution ZIP file which includes the Server and the Shell apps:

$ wget http://repo.spring.io/release/org/springframework/cloud/dist/spring-cloud-dataflow-server-yarn-dist/1.0.2.RELEASE/spring-cloud-dataflow-server-yarn-dist-1.0.2.RELEASE.zip

Unzip the distribution ZIP file and change to the directory containing the deployment files.

$ cd spring-cloud-dataflow-server-yarn-1.0.2.RELEASE

15.3 Configure Settings

Generic runtime settings can changed in config/servers.yml. Dedicated section Chapter 17, Configuring Runtime Settings and Environment contains detailed information about configuration.

servers.yml file is a central place to share common configuration as it is added to Boot based jvm processes via option -Dspring.config.location=servers.yml.

15.4 Start Server

If this is the first time deploying make sure the user that runs the Server app has rights to create and write to /dataflow directory in hdfs. If there is an existing deployment on hdfs remove it using:

$ hdfs dfs -rm -R /dataflow

Start the Spring Cloud Data Flow Server app for YARN

$ ./bin/dataflow-server-yarn

15.5 Connect Shell

start spring-cloud-dataflow-shell

$ ./bin/dataflow-shell

Shell in a distribution package contains extension commands for a hdfs file system.

dataflow:>hadoop fs
hadoop fs cat              hadoop fs copyFromLocal    hadoop fs copyToLocal      hadoop fs expunge
hadoop fs ls               hadoop fs mkdir            hadoop fs mv               hadoop fs rm
dataflow:>hadoop fs ls /
rwxrwxrwx root         supergroup 0 2016-07-25 06:54:15 /
rwxrwxrwx jvalkealahti supergroup 0 2016-07-25 06:58:38 /dataflow
rwxr-xr-x jvalkealahti supergroup 0 2016-07-25 07:31:32 /repo
rwxrwxrwx root         supergroup 0 2016-07-20 16:25:31 /tmp
rwxrwxrwx jvalkealahti supergroup 0 2015-10-29 10:59:24 /user
[Tip]Tip

You can configure server address automatically by placing it in a configuration using key dataflow.uri.

15.6 Register Applications

By default, the application registry will be empty. If you would like to register all out-of-the-box stream applications built with the RabbitMQ binder in bulk, you can with the following command. For more details, review how to register applications.

dataflow:>app import --uri http://bit.ly/stream-applications-rabbit-maven

15.6.1 Sourcing Applications from HDFS

YARN integration also allows you to store registered applications directly in HDFS instead of relying on maven or any other resolution. Only thing to change during a registration is to use hdfs address as shown below.

dataflow:>app register --name ftp --type sink --uri hdfs:/dataflow/artifacts/repo/ftp-sink-kafka-1.0.0.RC1.jar

15.7 Create Stream

Create a stream:

dataflow:>stream create --name foostream --definition "time|log" --deploy

List streams:

dataflow:>stream list
╔═══════════╤═════════════════╤════════╗
║Stream Name│Stream Definition│ Status ║
╠═══════════╪═════════════════╪════════╣
║foostream  │time|log         │deployed║
╚═══════════╧═════════════════╧════════╝

After some time, destroy the stream:

dataflow:>stream destroy --name foostream

The YARN application is pushed and started automatically during a stream deployment process. Once all streams are destroyed the YARN application will exit.

15.8 Create Task

Create and launch task:

dataflow:>task create --name footask --definition "timestamp"
Created new task 'footask'
dataflow:>task launch --name footask
Launched task 'footask'

15.9 Using YARN Cli

Overall app status can be seen from YARN Resource Manager UI or using Spring YARN CLI which gives more info about running containers within an app itself.

$ ./bin/dataflow-server-yarn-cli shell

15.9.1 Check YARN App Statuses

When stream has been submitted YARN shows it as ACCEPTED before its turned to RUNNING state.

$ submitted
  APPLICATION ID                  USER          NAME                     QUEUE    TYPE      STARTTIME       FINISHTIME  STATE     FINALSTATUS  ORIGINAL TRACKING URL
  ------------------------------  ------------  -----------------------  -------  --------  --------------  ----------  --------  -----------  ---------------------
  application_1461658614481_0001  jvalkealahti  scdstream:app:foostream  default  DATAFLOW  26/04/16 16:27  N/A         ACCEPTED  UNDEFINED

$ submitted
  APPLICATION ID                  USER          NAME                     QUEUE    TYPE      STARTTIME       FINISHTIME  STATE    FINALSTATUS  ORIGINAL TRACKING URL
  ------------------------------  ------------  -----------------------  -------  --------  --------------  ----------  -------  -----------  -------------------------
  application_1461658614481_0001  jvalkealahti  scdstream:app:foostream  default  DATAFLOW  26/04/16 16:27  N/A         RUNNING  UNDEFINED    http://192.168.1.96:58580

More info about internals for stream apps can be queried by clustersinfo and clusterinfo commands:

$ clustersinfo -a application_1461658614481_0001
  CLUSTER ID
  --------------
  foostream:log
  foostream:time

$ clusterinfo -a application_1461658614481_0001 -c foostream:time
  CLUSTER STATE  MEMBER COUNT
  -------------  ------------
  RUNNING        1

After stream is undeployed YARN app should close itself automatically:

$ submitted -v
  APPLICATION ID                  USER          NAME                     QUEUE    TYPE      STARTTIME       FINISHTIME      STATE     FINALSTATUS  ORIGINAL TRACKING URL
  ------------------------------  ------------  -----------------------  -------  --------  --------------  --------------  --------  -----------  ---------------------
  application_1461658614481_0001  jvalkealahti  scdstream:app:foostream  default  DATAFLOW  26/04/16 16:27  26/04/16 16:28  FINISHED  SUCCEEDED

Launching a task will be shown in RUNNING state while app is executing its batch jobs:

$ submitted -v
  APPLICATION ID                  USER          NAME                     QUEUE    TYPE      STARTTIME       FINISHTIME      STATE     FINALSTATUS  ORIGINAL TRACKING URL
  ------------------------------  ------------  -----------------------  -------  --------  --------------  --------------  --------  -----------  -------------------------
  application_1461658614481_0002  jvalkealahti  scdtask:timestamp        default  DATAFLOW  26/04/16 16:29  N/A             RUNNING   UNDEFINED    http://192.168.1.96:39561
  application_1461658614481_0001  jvalkealahti  scdstream:app:foostream  default  DATAFLOW  26/04/16 16:27  26/04/16 16:28  FINISHED  SUCCEEDED

$ submitted -v
  APPLICATION ID                  USER          NAME                     QUEUE    TYPE      STARTTIME       FINISHTIME      STATE     FINALSTATUS  ORIGINAL TRACKING URL
  ------------------------------  ------------  -----------------------  -------  --------  --------------  --------------  --------  -----------  ---------------------
  application_1461658614481_0002  jvalkealahti  scdtask:timestamp        default  DATAFLOW  26/04/16 16:29  26/04/16 16:29  FINISHED  SUCCEEDED
  application_1461658614481_0001  jvalkealahti  scdstream:app:foostream  default  DATAFLOW  26/04/16 16:27  26/04/16 16:28  FINISHED  SUCCEEDED

15.9.2 Push Apps

Yarn applications needed for a dataflow can be pushed manually into hdfs.

Spring YARN Cli (v2.4.0.RELEASE)
Hit TAB to complete. Type 'help' and hit RETURN for help, and 'exit' to quit.
$ push -t STREAM
New version installed
$ push -t TASK
New version installed
[Note]Note

Push happens automatically when stream is deployer or task launched.