4. Task / Batch

4.1 Batch Job on Cloud Foundry

In this demonstration, you will learn how to orchestrate short-lived data processing application (eg: Spring Batch Jobs) using Spring Cloud Task and Spring Cloud Data Flow on Cloud Foundry.

4.1.1 Prerequisites

  • Local PCFDev instance
  • Local install of cf CLI command line tool
  • Running instance of mysql in PCFDev
  • A Running Data Flow Shell

The Spring Cloud Data Flow Shell is available for download or you can build it yourself.

[Note]Note

the Spring Cloud Data Flow Shell and Local server implementation are in the same repository and are both built by running ./mvnw install from the project root directory. If you have already run the build, use the jar in spring-cloud-dataflow-shell/target

To run the Shell open a new terminal session:

$ cd <PATH/TO/SPRING-CLOUD-DATAFLOW-SHELL-JAR>
$ java -jar spring-cloud-dataflow-shell-<VERSION>.jar
  ____                              ____ _                __
 / ___| _ __  _ __(_)_ __   __ _   / ___| | ___  _   _  __| |
 \___ \| '_ \| '__| | '_ \ / _` | | |   | |/ _ \| | | |/ _` |
  ___) | |_) | |  | | | | | (_| | | |___| | (_) | |_| | (_| |
 |____/| .__/|_|  |_|_| |_|\__, |  \____|_|\___/ \__,_|\__,_|
  ____ |_|    _          __|___/                 __________
 |  _ \  __ _| |_ __ _  |  ___| | _____      __  \ \ \ \ \ \
 | | | |/ _` | __/ _` | | |_  | |/ _ \ \ /\ / /   \ \ \ \ \ \
 | |_| | (_| | || (_| | |  _| | | (_) \ V  V /    / / / / / /
 |____/ \__,_|\__\__,_| |_|   |_|\___/ \_/\_/    /_/_/_/_/_/


Welcome to the Spring Cloud Data Flow shell. For assistance hit TAB or type "help".
dataflow:>
[Note]Note

The Spring Cloud Data Flow Shell is a Spring Boot application that connects to the Data Flow Server’s REST API and supports a DSL that simplifies the process of defining a stream or task and managing its lifecycle. Most of these samples use the shell. If you prefer, you can use the Data Flow UI localhost:9393/dashboard, (or wherever it the server is hosted) to perform equivalent operations.

  • Spring Cloud Data Flow installed on Cloud Foundry Follow the installation instructions to run Spring Cloud Data Flow on Cloud Foundry.

4.1.2 Building and Running the Demo

[Note]Note

PCF 1.7.12 or greater is required to run Tasks on Spring Cloud Data Flow. As of this writing, PCFDev and PWS supports builds upon this version.

  1. Task support needs to be enabled on pcf-dev. Being logged as admin, issue the following command:

    cf enable-feature-flag task_creation
    Setting status of task_creation as admin...
    
    OK
    
    Feature task_creation Enabled.
    [Note]Note

    For this sample, all you need is the mysql service and in PCFDev, the mysql service comes with a different plan. From CF CLI, create the service by: cf create-service p-mysql 512mb mysql and bind this service to dataflow-server by: cf bind-service dataflow-server mysql.

    [Note]Note

    All the apps deployed to PCFDev start with low memory by default. It is recommended to change it to at least 768MB for dataflow-server. Ditto for every app spawned by Spring Cloud Data Flow. Change the memory by: cf set-env dataflow-server SPRING_CLOUD_DEPLOYER_CLOUDFOUNDRY_MEMORY 512. Likewise, we would have to skip SSL validation by: cf set-env dataflow-server SPRING_CLOUD_DEPLOYER_CLOUDFOUNDRY_SKIP_SSL_VALIDATION true.

  2. Tasks in Spring Cloud Data Flow require an RDBMS to host "task repository" (see here for more details), so let’s instruct the Spring Cloud Data Flow server to bind the mysql service to each deployed task:

    $ cf set-env dataflow-server SPRING_CLOUD_DEPLOYER_CLOUDFOUNDRY_TASK_SERVICES mysql
    $ cf restage dataflow-server
    [Note]Note

    We only need mysql service for this sample.

  3. As a recap, here is what you should see as configuration for the Spring Cloud Data Flow server:

    cf env dataflow-server
    
    ....
    User-Provided:
    SPRING_CLOUD_DEPLOYER_CLOUDFOUNDRY_DOMAIN: local.pcfdev.io
    SPRING_CLOUD_DEPLOYER_CLOUDFOUNDRY_MEMORY: 512
    SPRING_CLOUD_DEPLOYER_CLOUDFOUNDRY_ORG: pcfdev-org
    SPRING_CLOUD_DEPLOYER_CLOUDFOUNDRY_PASSWORD: pass
    SPRING_CLOUD_DEPLOYER_CLOUDFOUNDRY_SKIP_SSL_VALIDATION: false
    SPRING_CLOUD_DEPLOYER_CLOUDFOUNDRY_SPACE: pcfdev-space
    SPRING_CLOUD_DEPLOYER_CLOUDFOUNDRY_TASK_SERVICES: mysql
    SPRING_CLOUD_DEPLOYER_CLOUDFOUNDRY_URL: https://api.local.pcfdev.io
    SPRING_CLOUD_DEPLOYER_CLOUDFOUNDRY_USERNAME: user
    
    No running env variables have been set
    
    No staging env variables have been set
  4. Notice that dataflow-server application is started and ready for interaction via dataflow-server.local.pcfdev.io endpoint
  5. Build and register the batch-job example from Spring Cloud Task samples. For convenience, the final uber-jar artifact is provided with this sample.

    dataflow:>app register --type task --name simple_batch_job --uri https://github.com/spring-cloud/spring-cloud-dataflow-samples/raw/master/src/main/asciidoc/tasks/simple-batch-job/batch-job-1.3.0.BUILD-SNAPSHOT.jar
  6. Create the task with simple-batch-job application

    dataflow:>task create foo --definition "simple_batch_job"
    [Note]Note

    Unlike Streams, the Task definitions don’t require explicit deployment. They can be launched on-demand, scheduled, or triggered by streams.

  7. Verify there’s still no Task applications running on PCFDev - they are listed only after the initial launch/staging attempt on PCF

    $ cf apps
    Getting apps in org pcfdev-org / space pcfdev-space as user...
    OK
    
    name              requested state   instances   memory   disk   urls
    dataflow-server   started           1/1         768M     512M   dataflow-server.local.pcfdev.io
  8. Let’s launch foo

    dataflow:>task launch foo
  9. Verify the execution of foo by tailing the logs

    $ cf logs foo
    Retrieving logs for app foo in org pcfdev-org / space pcfdev-space as user...
    
    2016-08-14T18:48:54.22-0700 [APP/TASK/foo/0]OUT Creating container
    2016-08-14T18:48:55.47-0700 [APP/TASK/foo/0]OUT
    
    2016-08-14T18:49:06.59-0700 [APP/TASK/foo/0]OUT 2016-08-15 01:49:06.598  INFO 14 --- [           main] o.s.b.c.l.support.SimpleJobLauncher      : Job: [SimpleJob: [name=job1]] launched with the following parameters: [{}]
    
    ...
    ...
    
    2016-08-14T18:49:06.78-0700 [APP/TASK/foo/0]OUT 2016-08-15 01:49:06.785  INFO 14 --- [           main] o.s.b.c.l.support.SimpleJobLauncher      : Job: [SimpleJob: [name=job1]] completed with the following parameters: [{}] and the following status: [COMPLETED]
    
    ...
    ...
    
    2016-08-14T18:49:07.36-0700 [APP/TASK/foo/0]OUT 2016-08-15 01:49:07.363  INFO 14 --- [           main] o.s.b.c.l.support.SimpleJobLauncher      : Job: [SimpleJob: [name=job2]] launched with the following parameters: [{}]
    
    ...
    ...
    
    2016-08-14T18:49:07.53-0700 [APP/TASK/foo/0]OUT 2016-08-15 01:49:07.536  INFO 14 --- [           main] o.s.b.c.l.support.SimpleJobLauncher      : Job: [SimpleJob: [name=job2]] completed with the following parameters: [{}] and the following status: [COMPLETED]
    
    ...
    ...
    
    2016-08-14T18:49:07.71-0700 [APP/TASK/foo/0]OUT Exit status 0
    2016-08-14T18:49:07.78-0700 [APP/TASK/foo/0]OUT Destroying container
    2016-08-14T18:49:08.47-0700 [APP/TASK/foo/0]OUT Successfully destroyed container
    [Note]Note

    Verify job1 and job2 operations embedded in simple-batch-job application are launched independently and they returned with the status COMPLETED.

    [Note]Note

    Unlike LRPs in Cloud Foundry, tasks are short-lived, so the logs aren’t always available. They are generated only when the Task application runs; at the end of Task operation, the container that ran the Task application is destroyed to free-up resources.

  10. List Tasks in Cloud Foundry

    $ cf apps
    Getting apps in org pcfdev-org / space pcfdev-space as user...
    OK
    
    name              requested state   instances   memory   disk   urls
    dataflow-server   started           1/1         768M     512M   dataflow-server.local.pcfdev.io
    foo               stopped           0/1         1G       1G
  11. Verify Task execution details

    dataflow:>task execution list
    ╔══════════════════════════╤══╤════════════════════════════╤════════════════════════════╤═════════╗
    ║        Task Name         │ID│         Start Time         │          End Time          │Exit Code║
    ╠══════════════════════════╪══╪════════════════════════════╪════════════════════════════╪═════════╣
    ║foo                       │1 │Sun Aug 14 18:49:05 PDT 2016│Sun Aug 14 18:49:07 PDT 2016│0        ║
    ╚══════════════════════════╧══╧════════════════════════════╧════════════════════════════╧═════════╝
  12. Verify Job execution details

    dataflow:>job execution list
    ╔═══╤═══════╤═════════╤════════════════════════════╤═════════════════════╤══════════════════╗
    ║ID │Task ID│Job Name │         Start Time         │Step Execution Count │Definition Status ║
    ╠═══╪═══════╪═════════╪════════════════════════════╪═════════════════════╪══════════════════╣
    ║2  │1      │job2     │Sun Aug 14 18:49:07 PDT 2016│1                    │Destroyed         ║
    ║1  │1      │job1     │Sun Aug 14 18:49:06 PDT 2016│1                    │Destroyed         ║
    ╚═══╧═══════╧═════════╧════════════════════════════╧═════════════════════╧══════════════════╝

4.1.3 Summary

In this sample, you have learned:

  • How to register and orchestrate Spring Batch jobs in Spring Cloud Data Flow
  • How to use the cf CLI in the context of Task applications orchestrated by Spring Cloud Data Flow
  • How to verify task executions and task repository

4.2 Batch File Ingest

In this demonstration, you will learn how to create a data processing application using Spring Batch which will then be run within Spring Cloud Data Flow.

4.2.1 Prerequisites

  • A Running Data Flow Server Follow the installation instructions to run Spring Cloud Data Flow on a local host.
  • A Running Data Flow Shell

The Spring Cloud Data Flow Shell is available for download or you can build it yourself.

[Note]Note

the Spring Cloud Data Flow Shell and Local server implementation are in the same repository and are both built by running ./mvnw install from the project root directory. If you have already run the build, use the jar in spring-cloud-dataflow-shell/target

To run the Shell open a new terminal session:

$ cd <PATH/TO/SPRING-CLOUD-DATAFLOW-SHELL-JAR>
$ java -jar spring-cloud-dataflow-shell-<VERSION>.jar
  ____                              ____ _                __
 / ___| _ __  _ __(_)_ __   __ _   / ___| | ___  _   _  __| |
 \___ \| '_ \| '__| | '_ \ / _` | | |   | |/ _ \| | | |/ _` |
  ___) | |_) | |  | | | | | (_| | | |___| | (_) | |_| | (_| |
 |____/| .__/|_|  |_|_| |_|\__, |  \____|_|\___/ \__,_|\__,_|
  ____ |_|    _          __|___/                 __________
 |  _ \  __ _| |_ __ _  |  ___| | _____      __  \ \ \ \ \ \
 | | | |/ _` | __/ _` | | |_  | |/ _ \ \ /\ / /   \ \ \ \ \ \
 | |_| | (_| | || (_| | |  _| | | (_) \ V  V /    / / / / / /
 |____/ \__,_|\__\__,_| |_|   |_|\___/ \_/\_/    /_/_/_/_/_/


Welcome to the Spring Cloud Data Flow shell. For assistance hit TAB or type "help".
dataflow:>
[Note]Note

The Spring Cloud Data Flow Shell is a Spring Boot application that connects to the Data Flow Server’s REST API and supports a DSL that simplifies the process of defining a stream or task and managing its lifecycle. Most of these samples use the shell. If you prefer, you can use the Data Flow UI localhost:9393/dashboard, (or wherever it the server is hosted) to perform equivalent operations.

4.2.2 Batch File Ingest Demo Overview

The source for the demo project is located in here. The sample is a Spring Boot application that demonstrates how to read data from a flat file, perform processing on the records, and store the transformed data into a database using Spring Batch.

The key classes for creating the batch job are:

  • BatchConfiguration.java - this is where we define our batch job, the step and components that are used read, process, and write our data. In the sample we use a FlatFileItemReader which reads a delimited file, a custom PersonItemProcessor to transform the data, and a JdbcBatchItemWriter to write our data to a database.
  • Person.java - the domain object representing the data we are reading and processing in our batch job. The sample data contains records made up of a persons first and last name.
  • PersonItemProcessor.java - this class is an ItemProcessor implementation which receives records after they have been read and before they are written. This allows us to transform the data between these two steps. In our sample ItemProcessor implementation, we simply transform the first and last name of each Person to uppercase characters.
  • Application.java - the main entry point into the Spring Boot application which is used to launch the batch job

Resource files are included to set up the database and provide sample data:

  • schema-all.sql - this is the database schema that will be created when the application starts up. In this sample, an in-memory database is created on start up and destroyed when the application exits.
  • data.csv - sample data file containing person records used in the demo
[Note]Note

This example expects to use the Spring Cloud Data Flow Server’s embedded H2 database. If you wish to use another repository, be sure to add the correct dependencies to the pom.xml and update the schema-all.sql.

4.2.3 Building and Running the Demo

  1. Build the demo JAR

    $ mvn clean package
  2. Register the task

    dataflow:>app register --name fileIngest --type task --uri file:///path/to/target/ingest-X.X.X.jar
    Successfully registered application 'task:fileIngest'
    dataflow:>
  3. Create the task

    dataflow:>task create fileIngestTask --definition fileIngest
    Created new task 'fileIngestTask'
    dataflow:>
  4. Launch the task

    dataflow:>task launch fileIngestTask --arguments "localFilePath=classpath:data.csv"
    Launched task 'fileIngestTask'
    dataflow:>
  5. Inspect logs

    The log file path for the launched task can be found in the local server output, for example:

    2017-10-27 14:58:18.112  INFO 19485 --- [nio-9393-exec-6] o.s.c.d.spi.local.LocalTaskLauncher      : launching task fileIngestTask-8932f73d-f17a-4bba-b44d-3fd9df042ac0
       Logs will be in /var/folders/6x/tgtx9xbn0x16xq2sx1j2rld80000gn/T/spring-cloud-dataflow-983191515779755562/fileIngestTask-1509130698071/fileIngestTask-8932f73d-f17a-4bba-b44d-3fd9df042ac0
  6. Verify Task execution details

    dataflow:>task execution list
    ╔══════════════╤══╤════════════════════════════╤════════════════════════════╤═════════╗
    ║  Task Name   │ID│         Start Time         │          End Time          │Exit Code║
    ╠══════════════╪══╪════════════════════════════╪════════════════════════════╪═════════╣
    ║fileIngestTask│1 │Fri Oct 27 14:58:20 EDT 2017│Fri Oct 27 14:58:20 EDT 2017│0        ║
    ╚══════════════╧══╧════════════════════════════╧════════════════════════════╧═════════╝
  7. Verify Job execution details

    dataflow:>job execution list
    ╔═══╤═══════╤═════════╤════════════════════════════╤═════════════════════╤══════════════════╗
    ║ID │Task ID│Job Name │         Start Time         │Step Execution Count │Definition Status ║
    ╠═══╪═══════╪═════════╪════════════════════════════╪═════════════════════╪══════════════════╣
    ║1  │1      │ingestJob│Fri Oct 27 14:58:20 EDT 2017│1                    │Created           ║
    ╚═══╧═══════╧═════════╧════════════════════════════╧═════════════════════╧══════════════════╝

4.2.4 Summary

In this sample, you have learned:

  • How to create a data processing batch job application
  • How to register and orchestrate Spring Batch jobs in Spring Cloud Data Flow
  • How to verify status via logs and shell commands