15. Deploying on YARN

The server application is run as a standalone application. All applications used for streams and tasks will be deployed on the YARN cluster that is targeted by the server.

15.1 Prerequisites

These requirements are not something yarn runtime needs but generally what dataflow core needs.

  • Rabbit - If dataflow apps using rabbit bindings are used.
  • Kafka - If dataflow apps using kafka bindings are used.
  • DB - we currently use embedded H2 database, though any supported DB can be configured.

15.2 Download and Extract Distribution

Download the Spring Cloud Data Flow YARN distribution ZIP file which includes the Server and the Shell apps:

$ wget http://repo.spring.io/release/org/springframework/cloud/dist/spring-cloud-dataflow-server-yarn-dist/1.2.0.RELEASE/spring-cloud-dataflow-server-yarn-dist-1.2.0.RELEASE.zip

Unzip the distribution ZIP file and change to the directory containing the deployment files.

$ cd spring-cloud-dataflow-server-yarn-1.2.0.RELEASE

15.3 Configure Settings

Generic runtime settings can changed in config/servers.yml. Dedicated section Chapter 17, Configuring Runtime Settings and Environment contains detailed information about configuration.

servers.yml file is a central place to share common configuration as it is added to Boot based jvm processes via option -Dspring.config.location=servers.yml.

15.4 Start Server

If this is the first time deploying make sure the user that runs the Server app has rights to create and write to /dataflow directory in hdfs. If there is an existing deployment on hdfs remove it using:

$ hdfs dfs -rm -R /dataflow

Start the Spring Cloud Data Flow Server app for YARN

$ ./bin/dataflow-server-yarn

15.5 Connect Shell

start spring-cloud-dataflow-shell

$ ./bin/dataflow-shell

Shell in a distribution package contains extension commands for a hdfs file system.

dataflow:>hadoop fs
hadoop fs cat              hadoop fs copyFromLocal    hadoop fs copyToLocal      hadoop fs expunge
hadoop fs ls               hadoop fs mkdir            hadoop fs mv               hadoop fs rm
dataflow:>hadoop fs ls /
rwxrwxrwx root         supergroup 0 2016-07-25 06:54:15 /
rwxrwxrwx jvalkealahti supergroup 0 2016-07-25 06:58:38 /dataflow
rwxr-xr-x jvalkealahti supergroup 0 2016-07-25 07:31:32 /repo
rwxrwxrwx root         supergroup 0 2016-07-20 16:25:31 /tmp
rwxrwxrwx jvalkealahti supergroup 0 2015-10-29 10:59:24 /user
[Tip]Tip

You can configure server address automatically by placing it in a configuration using key dataflow.uri.

15.6 Register Applications

By default, the application registry will be empty. If you would like to register all out-of-the-box stream applications built with the RabbitMQ binder in bulk, you can with the following command. For more details, review how to register applications.

dataflow:>app import --uri http://bit.ly/stream-applications-rabbit-maven

15.6.1 Sourcing Applications from HDFS

YARN integration also allows you to store registered applications directly in HDFS instead of relying on maven or any other resolution. Only thing to change during a registration is to use hdfs address as shown below.

dataflow:>app register --name ftp --type sink --uri hdfs:/dataflow/artifacts/repo/ftp-sink-kafka-1.0.0.RC1.jar

15.7 Create Stream

Create a stream:

dataflow:>stream create --name foostream --definition "time|log" --deploy

List streams:

dataflow:>stream list
╔═══════════╤═════════════════╤════════╗
║Stream Name│Stream Definition│ Status ║
╠═══════════╪═════════════════╪════════╣
║foostream  │time|log         │deployed║
╚═══════════╧═════════════════╧════════╝

After some time, destroy the stream:

dataflow:>stream destroy --name foostream

The YARN application is pushed and started automatically during a stream deployment process. Once all streams are destroyed the YARN application will exit.

15.8 Create Task

Create and launch task:

dataflow:>task create --name footask --definition "timestamp"
Created new task 'footask'
dataflow:>task launch --name footask
Launched task 'footask'

Launch tasks from streams:

task-launcher-yarn-sink itself bundles a YARN Deployer but doesn’t push any apps into hdfs, thus pushed app needs to exist and match a deployer version task-launcher-yarn-sink uses.

In below sample we use tasklaunchrequest processor to pass needed properties into task-launcher-yarn sink. We explicitely defined appVersion as appv1 which you would have pushed into hdfs prior running this stream. With this processor you also need to define a uri for a task application itself.

stream create --name launchertest --definition "http
--server.port=9000|tasklaunchrequest
--deployment-properties=spring.cloud.deployer.yarn.app.appVersion=appv1
--uri=hdfs:/dataflow/repo/timestamp-task.jar|task-launcher-yarn"
--deploy

To fire up a task just post a dummy message into http source.

http post --target http://localhost:9000 --data empty
[Note]Note

Using http source in YARN difficult as you don’t immediately know on which cluster node that source app is running.

15.9 Using YARN Cli

Overall app status can be seen from YARN Resource Manager UI or using Spring YARN CLI which gives more info about running containers within an app itself.

$ ./bin/dataflow-server-yarn-cli shell

15.9.1 Check YARN App Statuses

When stream has been submitted YARN shows it as ACCEPTED before its turned to RUNNING state.

$ submitted
  APPLICATION ID                  USER          NAME                     QUEUE    TYPE      STARTTIME       FINISHTIME  STATE     FINALSTATUS  ORIGINAL TRACKING URL
  ------------------------------  ------------  -----------------------  -------  --------  --------------  ----------  --------  -----------  ---------------------
  application_1461658614481_0001  jvalkealahti  scdstream:app:foostream  default  DATAFLOW  26/04/16 16:27  N/A         ACCEPTED  UNDEFINED

$ submitted
  APPLICATION ID                  USER          NAME                     QUEUE    TYPE      STARTTIME       FINISHTIME  STATE    FINALSTATUS  ORIGINAL TRACKING URL
  ------------------------------  ------------  -----------------------  -------  --------  --------------  ----------  -------  -----------  -------------------------
  application_1461658614481_0001  jvalkealahti  scdstream:app:foostream  default  DATAFLOW  26/04/16 16:27  N/A         RUNNING  UNDEFINED    http://192.168.1.96:58580

More info about internals for stream apps can be queried by clustersinfo and clusterinfo commands:

$ clustersinfo -a application_1461658614481_0001
  CLUSTER ID
  --------------
  foostream:log
  foostream:time

$ clusterinfo -a application_1461658614481_0001 -c foostream:time
  CLUSTER STATE  MEMBER COUNT
  -------------  ------------
  RUNNING        1

After stream is undeployed YARN app should close itself automatically:

$ submitted -v
  APPLICATION ID                  USER          NAME                     QUEUE    TYPE      STARTTIME       FINISHTIME      STATE     FINALSTATUS  ORIGINAL TRACKING URL
  ------------------------------  ------------  -----------------------  -------  --------  --------------  --------------  --------  -----------  ---------------------
  application_1461658614481_0001  jvalkealahti  scdstream:app:foostream  default  DATAFLOW  26/04/16 16:27  26/04/16 16:28  FINISHED  SUCCEEDED

Launching a task will be shown in RUNNING state while app is executing its batch jobs:

$ submitted -v
  APPLICATION ID                  USER          NAME                     QUEUE    TYPE      STARTTIME       FINISHTIME      STATE     FINALSTATUS  ORIGINAL TRACKING URL
  ------------------------------  ------------  -----------------------  -------  --------  --------------  --------------  --------  -----------  -------------------------
  application_1461658614481_0002  jvalkealahti  scdtask:timestamp        default  DATAFLOW  26/04/16 16:29  N/A             RUNNING   UNDEFINED    http://192.168.1.96:39561
  application_1461658614481_0001  jvalkealahti  scdstream:app:foostream  default  DATAFLOW  26/04/16 16:27  26/04/16 16:28  FINISHED  SUCCEEDED

$ submitted -v
  APPLICATION ID                  USER          NAME                     QUEUE    TYPE      STARTTIME       FINISHTIME      STATE     FINALSTATUS  ORIGINAL TRACKING URL
  ------------------------------  ------------  -----------------------  -------  --------  --------------  --------------  --------  -----------  ---------------------
  application_1461658614481_0002  jvalkealahti  scdtask:timestamp        default  DATAFLOW  26/04/16 16:29  26/04/16 16:29  FINISHED  SUCCEEDED
  application_1461658614481_0001  jvalkealahti  scdstream:app:foostream  default  DATAFLOW  26/04/16 16:27  26/04/16 16:28  FINISHED  SUCCEEDED

15.9.2 Push Apps

Yarn applications needed for a dataflow can be pushed manually into hdfs with a given version which default to app.

Spring YARN Cli (v2.4.0.RELEASE)
Hit TAB to complete. Type 'help' and hit RETURN for help, and 'exit' to quit.
$ push -t STREAM
New version installed
$ push -t TASK
New version installed
$ push -t TASK -v appv1
New version installed

After above commands base directories for different app versions would look like as shown below. Streams and tasks can then use different versions which allows to use alternate configurations.

/dataflow/apps/stream/app
/dataflow/apps/task/app
/dataflow/apps/task/appv1
[Note]Note

Push happens automatically when stream is deployer or task launched.

15.10 Using Metric Collectors

We package three different metrics collector implementations, one for RabbitMQ and two for different Kafka versions. There can be started using shell scripts, dataflow-server-metrics-collector-kafka-09, dataflow-server-metrics-collector-kafka-10 and dataflow-server-metrics-collector-rabbit respectively. These applications are not using servers.yml file for config, instead collectors.yml is used where custom settings can be placed.

[Note]Note

With Kafka 0.10.1 and later, kafka-10 should be used. With Kafka 0.10.0 and earlier, kafka-09 should be used.