17. Configuring Runtime Settings and Environment

This section describes how settings related to running YARN application can be modified.

17.1 Generic App Settings

All applications whether those are stream apps or task apps can be centrally configured with servers.yml as that file is passed to apps using --spring.config.location='servers.yml'.

17.2 Configuring Application Resources

Stream and task processes for application master and containers can be further tuned by setting memory and cpu settings. Also java options allow to define actual jvm options.

spring:
  cloud:
    deployer:
      yarn:
        app:
          streamappmaster:
            memory: 512m
            virtualCores: 1
            javaOpts: "-Xms512m -Xmx512m"
          streamcontainer:
            priority: 5
            memory: 256m
            virtualCores: 1
            javaOpts: "-Xms64m -Xmx256m"
          taskappmaster:
            memory: 512m
            virtualCores: 1
            javaOpts: "-Xms512m -Xmx512m"
          taskcontainer:
            priority: 10
            memory: 256m
            virtualCores: 1
            javaOpts: "-Xms64m -Xmx256m"

17.3 Configure Base Directory

Base directory where all needed files are kept defaults to /dataflow and can be changed using baseDir property.

spring:
  cloud:
    deployer:
      yarn:
        app:
          baseDir: /dataflow

17.4 Pre-populate Applications

Spring Cloud Data Flow app registration is based on URI’s with various different endpoints. As mentioned in section Chapter 18, How YARN Deployment Works all applications are first stored into hdfs before application container is launched. Server can use http, file, http and maven based uris as well direct hdfs uris.

It is possible to place these applications directly into HDFS and register application based on that URI.

17.5 Configure Logging

Logging for all components is done centrally via servers.yml file using normal Spring Boot properties.

logging:
  level:
    org.apache.hadoop: INFO
    org.springframework.yarn: INFO

17.6 Global YARN Memory Settings

YARN Nodemanager is continously tracking how much memory is used by individual YARN containers. If containers are using more memory than what the configuration allows, containers are simply killed by a Nodemanager. Application master controlling the app lifecycle is given a little more freedom meaning that Nodemanager is not that aggressive when making a desicion when a container should be killed.

[Important]Important

These are global cluster settings and cannot be changed during an application deployment.

Lets take a quick look of memory related settings in YARN cluster and in YARN applications. Below xml config is what a default vanilla Apache Hadoop uses for memory related settings. Other distributions may have different defaults.

yarn.nodemanager.pmem-check-enabled
Enables a check for physical memory of a process. This check if enabled is directly tracking amount of memory requested for a YARN container.
yarn.nodemanager.vmem-check-enabled
Enables a check for virtual memory of a process. This setting is one which is usually causing containers of a custom YARN applications to get killed by a node manager. Usually the actual ratio between physical and virtual memory is higher than a default 2.1 or bugs in a OS is causing wrong calculation of a used virtual memory.
yarn.nodemanager.vmem-pmem-ratio
Defines a ratio of allowed virtual memory compared to physical memory. This ratio simply defines how much virtual memory a process can use but the actual tracked size is always calculated from a physical memory limit.
yarn.scheduler.minimum-allocation-mb

Defines a minimum allocated memory for container.

[Note]Note

This setting also indirectly defines what is the actual physical memory limit requested during a container allocation. Actual physical memory limit is always going to be multiple of this setting rounded to upper bound. For example if this setting is left to default 1024 and container is requested with 512M, 1024M is going to be used. However if requested size is 1100M, actual size is set to 2048M.

yarn.scheduler.maximum-allocation-mb
Defines a maximum allocated memory for container.
yarn.nodemanager.resource.memory-mb
Defines how much memory a node controlled by a node manager is allowed to allocate. This setting should be set to amount of which OS is able give to YARN managed processes in a way which doesn’t cause OS to swap, etc.

17.7 Configure Kerberos

Enabling kerberos is relatively easy when existing kerberized cluster exists. Just like with every other hadoop related service, use a specific user and a keytab.

spring:
  hadoop:
    security:
      userPrincipal: scdf/[email protected]
      userKeytab: /etc/security/keytabs/scdf.service.keytab
      authMethod: kerberos
      namenodePrincipal: nn/[email protected]
      rmManagerPrincipal: rm/[email protected]
      jobHistoryPrincipal: jhs/[email protected]
[Note]Note

When using ambari, configuration and keytab generation are fully automated.

17.7.1 Working with Kerberized Kafka

[Important]Important

Currently released kafka based apps doesn’t work with cluster where zookeeper and kafka itself are configured to for kerberos authentication. Workaround is to use rabbit based apps or build stream apps based on new kafka binder having support for kerberized kafka.

After a kafka based stream app has a kerberos support, some settings in ambari’s kafka configuration needs to be changed. Effectively listeners and security.inter.broker.protocol needs to use SASL_PLAINTEXT. Also binder needs to be able to create topics, thus scdf user needs to be added to a kafka’s super users.

listeners=SASL_PLAINTEXT://localhost:6667
security.inter.broker.protocol=SASL_PLAINTEXT
super.users=user:kafka;user:scdf

Additional configs are needed for binder and sasl config.

spring:
  cloud:
    stream:
      kafka:
        binder:
          configuration:
            security:
              protocol: SASL_PLAINTEXT
spring:
  cloud:
    deployer:
      yarn:
        app:
          streamcontainer:
            saslConfig: "-Djava.security.auth.login.config=/etc/scdf/conf/scdf_kafka_jaas.conf"

Where scdf_kafka_jaas.conf looks something like shown below.

KafkaClient {
   com.sun.security.auth.module.Krb5LoginModule required
   useKeyTab=true
   keyTab="/etc/security/keytabs/scdf.service.keytab"
   storeKey=true
   useTicketCache=false
   serviceName="kafka"
   principal="scdf/[email protected]";
};
[Important]Important

When ambari is kerberized via its wizard, everything else is automatically configured except kafka settings for a super.users, listeners and security.inter.broker.protocol.

17.8 Configure Hdfs HA

Generic settings for dataflow components to work with HA setup can be seen below where id is set to mycluster.

spring:
  hadoop:
    fsUri: hdfs://mycluster:8020
    config:
      dfs.ha.automatic-failover.enabled=True
      dfs.nameservices=mycluster
      dfs.client.failover.proxy.provider.mycluster=org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider
      dfs.ha.namenodes.mycluster=nn1,nn2
      dfs.namenode.rpc-address.mycluster.nn2=ambari-3.localdomain:8020
      dfs.namenode.rpc-address.mycluster.nn1=ambari-2.localdomain:8020
[Note]Note

When using ambari and Hdfs HA setup, configuration is fully automated.

17.9 Configure Database

On default a dataflow server will start embedded H2 database using in-memory storage and effectively using configuration.

spring:
  datasource:
    url: jdbc:h2:tcp://localhost:19092/mem:dataflow
    username: sa
    password:
    driverClassName: org.h2.Driver

Distribution package contains a bundled self-contained H2 executable which can be used instead. This allows to persist data throughout server restarts and is not limited to single host.

./bin/dataflow-server-yarn-h2 --dataflow.database.h2.directory=/var/run/scdf/data
spring:
  datasource:
    url: jdbc:h2:tcp://neo:19092/dataflow
    username: sa
    password:
    driverClassName: org.h2.Driver
[Important]Important

With external H2 instance you cannot use localhost, instead use a real hostname.

[Note]Note

Port can be changed using property dataflow.database.h2.port.

This bundled H2 database is also used in ambari to have a default out of a box functionality. Any database supported by a dataflow itself can be used by changing datasource settings.

17.10 Configure Network Discovery

YARN Deployer has to be able to talk with Application Master which then is responsible controlling containers running stream and task applications. The way this work is that Application Master tries to discover its own address which YARN Deployer is then able to use. If YARN cluster nodes have multiple NICs or for some other reason address is discovered wrongly, some settings can be changed to alter default discovery logic.

Below is a generic settings what can be changed.

spring
  yarn:
    hostdiscovery:
      pointToPoint: false
      loopback: false
      preferInterface: ['eth', 'en']
      matchIpv4: 192.168.0.0/24
      matchInterface: eth\\d*
  • pointToPoint - Skips all interfaces which are most likely i.e. VPNs. Defaults to false.
  • loopback - Don’t take loopback interface. Defaults to false.
  • preferInterface - In case multiple interface names exist, setup preference order for discovery. Format is interface name without number qualifier so with eth0, use eth. There’s no defaults.
  • matchIpv4 - Interface can be matched using its existing ip address which is given as CIDR format. There’s no defaults.
  • matchInterface - Interface can also matched using a simple regex pattern which gives even better control if complex interface combinations exist in a cluster. There’s no defaults.