13. Deploying Streams on Kubernetes

In this getting started guide, the Data Flow Server is deployed to the Kubernetes cluster. This means that we need to make available an RDBMS service for stream and task repositories, app registry plus a transport option of either Kafka or Rabbit MQ. We also need a Redis instance if we are planning on using the analytics features.

[Important]Important

Currently, only apps registered with the Docker resource are supported by the Data Flow Server for Kubernetes. I.e. the below app registration is valid:

dataflow:>app register --type source --name time --uri docker://springcloudstream/time-source-kafka:latest

but any app registered with a Maven, HTTP or File resource (using a URI prefixed with maven://, http:// or file://) is not supported.

  1. Deploy a Kubernetes cluster.

    The Kubernetes Getting Started guide lets you choose among many deployment options so you can pick one that you are most comfortable using. We have successfully used the Vagrant option from a downloaded Kubernetes release.

    We have also used the Minikube project to run a local Kubernetes cluster for testing.

    The rest of this getting started guide assumes that you have a working Kubernetes cluster and a kubectl command line. For the MySQL service we used the gcloud comand line utility. See the docs for installing both these utilities: Installing Cloud SDK and Installing and Setting up kubectl.

  2. Create a Kafka service on the Kubernetes cluster.

    The Kafka service will be used for messaging between modules in the stream. You can instead use Rabbit MQ, but, in order to simplify, we only show the Kafka configurations in this guide. There are sample replication controller and service YAML files in the spring-cloud-dataflow-server-kubernetes repository that you can use as a starting point as they have the required metadata set for service discovery by the modules. For Kafka we use the files with a "zk" and "kafka" prefix.

    $ git clone https://github.com/spring-cloud/spring-cloud-dataflow-server-kubernetes
    $ cd spring-cloud-dataflow-server-kubernetes
    $ kubectl create -f src/etc/kubernetes/kafka-zk-controller.yml
    $ kubectl create -f src/etc/kubernetes/kafka-zk-service.yml
    $ kubectl create -f src/etc/kubernetes/kafka-controller.yml
    $ kubectl create -f src/etc/kubernetes/kafka-service.yml

    You can use the command kubectl get pods to verify that the controller and service is running. Use the command kubectl get services to check on the state of the service. Use the commands kubectl delete svc kafka and kubectl delete rc kafka-broker plus kubectl delete svc kafka-zk and kubectl delete rc kafka-zk to clean up afterwards.

  3. Create a MySQL service on the Kubernetes cluster.

    We are using MySQL for this guide, but you could use Postgres or H2 database instead. We include JDBC drivers for all three of these databases, you would just have to adjust the database URL and driver class name settings.

    Before creating the MySQL service we need to create a persistent disk and modify the password in the config file. To create a persistent disk you can use the following command:

    $ gcloud compute disks create mysql-disk --size 200 --type pd-standard

    Modify the password in the src/etc/kubernetes/mysql-controller.yml file inside the spring-cloud-dataflow-server-kubernetes repository. Then run the following commands to start the database service:

    $ kubectl create -f src/etc/kubernetes/mysql-controller.yml
    $ kubectl create -f src/etc/kubernetes/mysql-service.yml

    Again, you can use the command kubectl get pods to verify that the controller is running. Note that it can take a minute or so until there is an external IP address for the MySQL server. Use the command kubectl get services to check on the state of the service and look for when there is a value under the EXTERNAL_IP column. Use the commands kubectl delete svc mysql and kubectl delete rc mysql to clean up afterwards. Use the EXTERNAL_IP address to connect to the database and create a test database that we can use for our testing. Use your favorit SQL developer tool for this:

    CREATE DATABASE test;
  4. Create a Redis service on the Kubernetes cluster.

    The Redis service will be used for the analytics functionality. There are sample replication controller and service YAML files in the spring-cloud-dataflow-server-kubernetes repository that you can use as a starting point as they have the required metadata set for service discovery by the modules.

    $ kubectl create -f src/etc/kubernetes/redis-controller.yml
    $ kubectl create -f src/etc/kubernetes/redis-service.yml
    [Note]Note

    If you don’t need the analytics functionality you can turn this feature off by changing SPRING_CLOUD_DATAFLOW_FEATURES_ANALYTICS_ENABLED to false in the scdf-controller.yml file. If you don’t install the Redis service then you should also remove the Redis configuration settings in scdf-config-kafka.yml mentioned below.

  5. Update configuration files with values needed to connect to Kubernetes, MySQL and Redis.

    The Data Flow Server uses the Fabric8 Java client library to connect to the Kubernetes cluster. We are using environment variables to set the values needed when deploying the Data Flow server to Kubernetes. We are also using the Fabric8 Spring Cloud integration with Kubernetes library to access Kubernetes ConfigMap and Secrets settings. The ConfigMap settings are specified in the src/etc/kubernetes/scdf-config.yml file and the Secrets in the src/etc/kubernetes/scdf-secrets.yml file. Modify the password for MySQL in the latter if you changed it. It has to be provided encoded as base64.

    This approach supports using one Data Flow Server instance per Kubernetes namespace.

  6. Deploy the Spring Cloud Data Flow Server for Kubernetes using the Docker image and the configuration settings you just modified.

    $ kubectl create -f src/etc/kubernetes/scdf-config-kafka.yml
    $ kubectl create -f src/etc/kubernetes/scdf-secrets.yml
    $ kubectl create -f src/etc/kubernetes/scdf-service.yml
    $ kubectl create -f src/etc/kubernetes/scdf-controller.yml
    [Note]Note

    We haven’t tuned the memory use of the OOTB apps yet, so to be on the safe side we are increasing the memory for the pods by providing the following property: spring.cloud.deployer.kubernetes.memory=640Mi

    Use the kubectl get svc command to locate the EXTERNAL_IP address assigned to scdf, we use that to connect from the shell.

    $ kubectl get svc
    NAME         CLUSTER-IP       EXTERNAL-IP       PORT(S)    AGE
    kafka        10.103.248.211   <none>            9092/TCP   14d
    kubernetes   10.103.240.1     <none>            443/TCP    16d
    mysql        10.103.251.179   104.154.246.220   3306/TCP   10d
    redis        10.103.242.191   <none>            6379/TCP   8d
    scdf         10.103.246.82    130.211.203.246   9393/TCP   4m
    zk           10.103.243.29    <none>            2181/TCP   14d
  7. Download and run the Spring Cloud Data Flow shell.

    wget http://repo.spring.io/release/org/springframework/cloud/spring-cloud-dataflow-shell/1.1.2.RELEASE/spring-cloud-dataflow-shell-1.1.2.RELEASE.jar
    
    $ java -jar spring-cloud-dataflow-shell-1.1.2.RELEASE.jar

    Configure the Data Flow server URI with the following command (use the IP address from previous step and at the moment we are using port 9393):

      ____                              ____ _                __
     / ___| _ __  _ __(_)_ __   __ _   / ___| | ___  _   _  __| |
     \___ \| '_ \| '__| | '_ \ / _` | | |   | |/ _ \| | | |/ _` |
      ___) | |_) | |  | | | | | (_| | | |___| | (_) | |_| | (_| |
     |____/| .__/|_|  |_|_| |_|\__, |  \____|_|\___/ \__,_|\__,_|
      ____ |_|    _          __|___/                 __________
     |  _ \  __ _| |_ __ _  |  ___| | _____      __  \ \ \ \ \ \
     | | | |/ _` | __/ _` | | |_  | |/ _ \ \ /\ / /   \ \ \ \ \ \
     | |_| | (_| | || (_| | |  _| | | (_) \ V  V /    / / / / / /
     |____/ \__,_|\__\__,_| |_|   |_|\___/ \_/\_/    /_/_/_/_/_/
    
    1.1.2.RELEASE
    
    Welcome to the Spring Cloud Data Flow shell. For assistance hit TAB or type "help".
    server-unknown:>dataflow config server --uri http://130.211.203.246:9393
    Successfully targeted http://130.211.203.246:9393
    dataflow:>
  8. Register the Kafka version of the time and log apps using the shell and also register the timestamp app.

    dataflow:>app register --type source --name time --uri docker:springcloudstream/time-source-kafka:latest
    dataflow:>app register --type sink --name log --uri docker:springcloudstream/log-sink-kafka:latest
    dataflow:>app register --type task --name timestamp --uri docker:springcloudtask/timestamp-task:latest
  9. Alternatively, if you would like to register all out-of-the-box stream applications built with the Kafka binder in bulk, you can with the following command. For more details, review how to register applications.

    dataflow:>app import --uri http://bit.ly/stream-applications-kafka-docker
  10. Deploy a simple stream in the shell

    dataflow:>stream create --name ticktock --definition "time | log" --deploy

    You can use the command kubectl get pods to check on the state of the pods corresponding to this stream. We can run this from the shell by running it as an OS command by adding a "!" before the command.

    dataflow:>! kubectl get pods
    command is:kubectl get pods
    NAME                  READY     STATUS    RESTARTS   AGE
    kafka-d207a           1/1       Running   0          50m
    ticktock-log-qnk72    1/1       Running   0          2m
    ticktock-time-r65cn   1/1       Running   0          2m

    Look at the logs for the pod deployed for the log sink.

    $ kubectl logs -f ticktock-log-qnk72
    ...
    2015-12-28 18:50:02.897  INFO 1 --- [           main] o.s.c.s.module.log.LogSinkApplication    : Started LogSinkApplication in 10.973 seconds (JVM running for 50.055)
    2015-12-28 18:50:08.561  INFO 1 --- [hannel-adapter1] log.sink                                 : 2015-12-28 18:50:08
    2015-12-28 18:50:09.556  INFO 1 --- [hannel-adapter1] log.sink                                 : 2015-12-28 18:50:09
    2015-12-28 18:50:10.557  INFO 1 --- [hannel-adapter1] log.sink                                 : 2015-12-28 18:50:10
    2015-12-28 18:50:11.558  INFO 1 --- [hannel-adapter1] log.sink                                 : 2015-12-28 18:50:11
    [Note]Note

    If you need to specify any of the app specific configuration properties then you must use "long-form" of them including the app specific prefix like --jdbc.tableName=TEST_DATA. This is due to the server not being able to access the metadata for the Docker based starter apps. You will also not see the configuration properties listed when using the app info command or in the Dashboard GUI.

    [Note]Note

    If you need to be able to connect from outside of the Kubernetes cluster to an app that you deploy, like the http-source, then you can provide a deployment property of spring.cloud.deployer.kubernetes.createLoadBalancer=true for the app module to specify that you want to have a LoadBalancer with an external IP address created for your app’s service.

    To register the http-source and use it in a stream where you can post data to it, you can use the following commands:

    dataflow:>app register --type source --name http --uri docker:springcloudstream/http-source-kafka:latest
    dataflow:>stream create --name test --definition "http | log"
    dataflow:>stream deploy test --properties "app.http.spring.cloud.deployer.kubernetes.createLoadBalancer=true"

    Now, look up the external IP address for the http app (it can sometimes take a minute or two for the external IP to get assigned):

    dataflow:>! kubectl get service
    command is:kubectl get service
    NAME         CLUSTER-IP       EXTERNAL-IP      PORT(S)    AGE
    kafka        10.103.240.92    <none>           9092/TCP   7m
    kubernetes   10.103.240.1     <none>           443/TCP    4h
    test-http    10.103.251.157   130.211.200.96   8080/TCP   58s
    test-log     10.103.240.28    <none>           8080/TCP   59s
    zk           10.103.247.25    <none>           2181/TCP   7m

    Next, post some data to the test-http app:

    dataflow:>http post --target http://130.211.200.96:8080 --data "Hello"

    Finally, look at the logs for the test-log pod:

    dataflow:>! kubectl get pods
    command is:kubectl get pods
    NAME              READY     STATUS             RESTARTS   AGE
    kafka-o20qq       1/1       Running            0          9m
    mysql-o2v83       1/1       Running            0          9m
    redis-zb87a       1/1       Running            0          8m
    test-http-9obkq   1/1       Running            0          2m
    test-log-ysiz3    1/1       Running            0          2m
    dataflow:>! kubectl logs test-log-ysiz3
    command is:kubectl logs test-log-ysiz3
    ...
    2016-04-27 16:54:29.789  INFO 1 --- [           main] o.s.c.s.b.k.KafkaMessageChannelBinder$3  : started inbound.test.http.test
    2016-04-27 16:54:29.799  INFO 1 --- [           main] o.s.c.support.DefaultLifecycleProcessor  : Starting beans in phase 0
    2016-04-27 16:54:29.799  INFO 1 --- [           main] o.s.c.support.DefaultLifecycleProcessor  : Starting beans in phase 2147482647
    2016-04-27 16:54:29.895  INFO 1 --- [           main] s.b.c.e.t.TomcatEmbeddedServletContainer : Tomcat started on port(s): 8080 (http)
    2016-04-27 16:54:29.896  INFO 1 --- [  kafka-binder-] log.sink                                 : Hello

    A useful command to help in troubleshooting issues, such as a container that has a fatal error starting up, add the options --previous to view last terminated container log. You can also get more detailed information about the pods by using the kubctl describe like:

    kubectl describe pods/ticktock-log-qnk72
  11. Destroy the stream

    dataflow:>stream destroy --name ticktock
  12. Create a task and launch it

    Let’s create a simple task definition and launch it.

    dataflow:>task create task1 --definition "timestamp"
    dataflow:>task launch task1

    We can now list the tasks and executions using these commands:

    dataflow:>task list
    ╔═════════╤═══════════════╤═══════════╗
    ║Task Name│Task Definition│Task Status║
    ╠═════════╪═══════════════╪═══════════╣
    ║task1    │timestamp      │running    ║
    ╚═════════╧═══════════════╧═══════════╝
    
    dataflow:>task execution list
    ╔═════════╤══╤════════════════════════════╤════════════════════════════╤═════════╗
    ║Task Name│ID│         Start Time         │          End Time          │Exit Code║
    ╠═════════╪══╪════════════════════════════╪════════════════════════════╪═════════╣
    ║task1    │1 │Fri Jun 03 18:12:05 EDT 2016│Fri Jun 03 18:12:05 EDT 2016│0        ║
    ╚═════════╧══╧════════════════════════════╧════════════════════════════╧═════════╝
  13. Destroy the task

    dataflow:>task destroy --name task1