7. Data Science

7.1 Species Prediction

In this demonstration, you will learn how to use PMML model in the context of streaming data pipeline orchestrated by Spring Cloud Data Flow.

We will present the steps to prep, configure and rub Spring Cloud Data Flow’s Local server, a Spring Boot application.

7.1.1 Prerequisites

  • A Running Data Flow Shell

The Spring Cloud Data Flow Shell is available for download or you can build it yourself.

[Note]Note

the Spring Cloud Data Flow Shell and Local server implementation are in the same repository and are both built by running ./mvnw install from the project root directory. If you have already run the build, use the jar in spring-cloud-dataflow-shell/target

To run the Shell open a new terminal session:

$ cd <PATH/TO/SPRING-CLOUD-DATAFLOW-SHELL-JAR>
$ java -jar spring-cloud-dataflow-shell-<VERSION>.jar
  ____                              ____ _                __
 / ___| _ __  _ __(_)_ __   __ _   / ___| | ___  _   _  __| |
 \___ \| '_ \| '__| | '_ \ / _` | | |   | |/ _ \| | | |/ _` |
  ___) | |_) | |  | | | | | (_| | | |___| | (_) | |_| | (_| |
 |____/| .__/|_|  |_|_| |_|\__, |  \____|_|\___/ \__,_|\__,_|
  ____ |_|    _          __|___/                 __________
 |  _ \  __ _| |_ __ _  |  ___| | _____      __  \ \ \ \ \ \
 | | | |/ _` | __/ _` | | |_  | |/ _ \ \ /\ / /   \ \ \ \ \ \
 | |_| | (_| | || (_| | |  _| | | (_) \ V  V /    / / / / / /
 |____/ \__,_|\__\__,_| |_|   |_|\___/ \_/\_/    /_/_/_/_/_/


Welcome to the Spring Cloud Data Flow shell. For assistance hit TAB or type "help".
dataflow:>
[Note]Note

The Spring Cloud Data Flow Shell is a Spring Boot application that connects to the Data Flow Server’s REST API and supports a DSL that simplifies the process of defining a stream or task and managing its lifecycle. Most of these samples use the shell. If you prefer, you can use the Data Flow UI localhost:9393/dashboard, (or wherever it the server is hosted) to perform equivalent operations.

  • A running local Data Flow Server Follow the installation instructions to run Spring Cloud Data Flow on a local host.
  • Running instance of Kafka

7.1.2 Building and Running the Demo

  1. Register the out-of-the-box applications for the Kafka binder

    [Note]Note

    These samples assume that the Data Flow Server can access a remote Maven repository, repo.spring.io/libs-release by default. If your Data Flow server is running behind a firewall, or you are using a maven proxy preventing access to public repositories, you will need to install the sample apps in your internal Maven repository and configure the server accordingly. The sample applications are typically registered using Data Flow’s bulk import facility. For example, the Shell command dataflow:>app import --uri dataflow.spring.io/rabbitmq-maven-latest (The actual URI is release and binder specific so refer to the sample instructions for the actual URL). The bulk import URI references a plain text file containing entries for all of the publicly available Spring Cloud Stream and Task applications published to repo.spring.io. For example, source.http=maven://org.springframework.cloud.stream.app:http-source-rabbit:2.1.0.RELEASE registers the http source app at the corresponding Maven address, relative to the remote repository(ies) configured for the Data Flow server. The format is maven://<groupId>:<artifactId>:<version> You will need to download the required apps or build them and then install them in your Maven repository, using whatever group, artifact, and version you choose. If you do this, register individual apps using dataflow:>app register…​ using the maven:// resource URI format corresponding to your installed app.

    dataflow:>app import --uri https://dataflow.spring.io/kafka-maven-latest
  2. Create and deploy the following stream

    dataflow:>stream create --name pmmlTest --definition "http --server.port=9001 | pmml --modelLocation=https://raw.githubusercontent.com/spring-cloud/spring-cloud-stream-modules/master/pmml-processor/src/test/resources/iris-flower-classification-naive-bayes-1.pmml.xml --inputs='Sepal.Length=payload.sepalLength,Sepal.Width=payload.sepalWidth,Petal.Length=payload.petalLength,Petal.Width=payload.petalWidth' --outputs='Predicted_Species=payload.predictedSpecies' --inputType='application/x-spring-tuple' --outputType='application/json'| log" --deploy
    Created and deployed new stream 'pmmlTest'
    [Note]Note

    The built-in pmml processor will load the given PMML model definition and create an internal object representation that can be evaluated quickly. When the stream receives the data, it will be used as the input for the evaluation of the analytical model iris-flower-classifier-1 contained in the PMML document. The result of this evaluation is a new field predictedSpecies that was created from the pmml processor by applying a classifier that uses the naiveBayes algorithm.

  3. Verify the stream is successfully deployed

    dataflow:>stream list
  4. Notice that pmmlTest.http, pmmlTest.pmml, and pmmlTest.log Spring Cloud Stream applications are running within the local-server.

    2016-02-18 06:36:45.396  INFO 31194 --- [nio-9393-exec-1] o.s.c.d.d.l.OutOfProcessModuleDeployer   : deploying module org.springframework.cloud.stream.module:log-sink:jar:exec:1.0.0.BUILD-SNAPSHOT instance 0
       Logs will be in /var/folders/c3/ctx7_rns6x30tq7rb76wzqwr0000gp/T/spring-cloud-data-flow-3038434123335455382/pmmlTest-1455806205386/pmmlTest.log
    2016-02-18 06:36:45.402  INFO 31194 --- [nio-9393-exec-1] o.s.c.d.d.l.OutOfProcessModuleDeployer   : deploying module org.springframework.cloud.stream.module:pmml-processor:jar:exec:1.0.0.BUILD-SNAPSHOT instance 0
       Logs will be in /var/folders/c3/ctx7_rns6x30tq7rb76wzqwr0000gp/T/spring-cloud-data-flow-3038434123335455382/pmmlTest-1455806205386/pmmlTest.pmml
    2016-02-18 06:36:45.407  INFO 31194 --- [nio-9393-exec-1] o.s.c.d.d.l.OutOfProcessModuleDeployer   : deploying module org.springframework.cloud.stream.module:http-source:jar:exec:1.0.0.BUILD-SNAPSHOT instance 0
       Logs will be in /var/folders/c3/ctx7_rns6x30tq7rb76wzqwr0000gp/T/spring-cloud-data-flow-3038434123335455382/pmmlTest-1455806205386/pmmlTest.http
  5. Post sample data to the http endpoint: localhost:9001 (9001 is the port we specified for the http source in this case)

    dataflow:>http post --target http://localhost:9001 --contentType application/json --data "{ \"sepalLength\": 6.4, \"sepalWidth\": 3.2, \"petalLength\":4.5, \"petalWidth\":1.5 }"
    > POST (application/json;charset=UTF-8) http://localhost:9001 { "sepalLength": 6.4, "sepalWidth": 3.2, "petalLength":4.5, "petalWidth":1.5 }
    > 202 ACCEPTED
  6. Verify the predicted outcome by tailing <PATH/TO/LOGAPP/pmmlTest.log/stdout_0.log file. The predictedSpecies in this case is versicolor.

    {
      "sepalLength": 6.4,
      "sepalWidth": 3.2,
      "petalLength": 4.5,
      "petalWidth": 1.5,
      "Species": {
        "result": "versicolor",
        "type": "PROBABILITY",
        "categoryValues": [
          "setosa",
          "versicolor",
          "virginica"
        ]
      },
      "predictedSpecies": "versicolor",
      "Probability_setosa": 4.728207706362856E-9,
      "Probability_versicolor": 0.9133639504608079,
      "Probability_virginica": 0.0866360448109845
    }
  7. Let’s post with a slight variation in data.

    dataflow:>http post --target http://localhost:9001 --contentType application/json --data "{ \"sepalLength\": 6.4, \"sepalWidth\": 3.2, \"petalLength\":4.5, \"petalWidth\":1.8 }"
    > POST (application/json;charset=UTF-8) http://localhost:9001 { "sepalLength": 6.4, "sepalWidth": 3.2, "petalLength":4.5, "petalWidth":1.8 }
    > 202 ACCEPTED
    [Note]Note

    petalWidth value changed from 1.5 to 1.8

  8. The predictedSpecies will now be listed as virginica.

    {
      "sepalLength": 6.4,
      "sepalWidth": 3.2,
      "petalLength": 4.5,
      "petalWidth": 1.8,
      "Species": {
        "result": "virginica",
        "type": "PROBABILITY",
        "categoryValues": [
          "setosa",
          "versicolor",
          "virginica"
        ]
      },
      "predictedSpecies": "virginica",
      "Probability_setosa": 1.0443898084700813E-8,
      "Probability_versicolor": 0.1750120333571921,
      "Probability_virginica": 0.8249879561989097
    }

7.1.3 Summary

In this sample, you have learned:

  • How to use Spring Cloud Data Flow’s Local server
  • How to use Spring Cloud Data Flow’s shell application
  • How to use pmml processor to compute real-time predictions