This section describes how settings related to running YARN application can be modified.
All applications whether those are stream apps or task apps can be
centrally configured with servers.yml
as that file is passed to apps
using --spring.config.location='servers.yml'
.
Stream and task processes for application master and containers can be further tuned by setting memory and cpu settings. Also java options allow to define actual jvm options.
spring: cloud: deployer: yarn: app: streamappmaster: memory: 512m virtualCores: 1 javaOpts: "-Xms512m -Xmx512m" streamcontainer: priority: 5 memory: 256m virtualCores: 1 javaOpts: "-Xms64m -Xmx256m" taskappmaster: memory: 512m virtualCores: 1 javaOpts: "-Xms512m -Xmx512m" taskcontainer: priority: 10 memory: 256m virtualCores: 1 javaOpts: "-Xms64m -Xmx256m"
Base directory where all needed files are kept defaults to /dataflow
and can be changed using baseDir
property.
spring: cloud: deployer: yarn: app: baseDir: /dataflow
Spring Cloud Data Flow app registration is based on URI’s with various
different endpoints. As mentioned in section Chapter 18, How YARN Deployment Works all
applications are first stored into hdfs before application container
is launched. Server can use http
, file
, http
and maven
based
uris as well direct hdfs
uris.
It is possible to place these applications directly into HDFS and register application based on that URI.
Logging for all components is done centrally via servers.yml
file
using normal Spring Boot properties.
logging: level: org.apache.hadoop: INFO org.springframework.yarn: INFO
YARN Nodemanager is continously tracking how much memory is used by individual YARN containers. If containers are using more memory than what the configuration allows, containers are simply killed by a Nodemanager. Application master controlling the app lifecycle is given a little more freedom meaning that Nodemanager is not that aggressive when making a desicion when a container should be killed.
Important | |
---|---|
These are global cluster settings and cannot be changed during an application deployment. |
Lets take a quick look of memory related settings in YARN cluster and in YARN applications. Below xml config is what a default vanilla Apache Hadoop uses for memory related settings. Other distributions may have different defaults.
2.1
or bugs in
a OS is causing wrong calculation of a used virtual memory.Defines a minimum allocated memory for container.
Note | |
---|---|
This setting also indirectly defines what is the actual physical
memory limit requested during a container allocation. Actual physical
memory limit is always going to be multiple of this setting rounded to
upper bound. For example if this setting is left to default |
Enabling kerberos is relatively easy when existing kerberized cluster exists. Just like with every other hadoop related service, use a specific user and a keytab.
spring: hadoop: security: userPrincipal: scdf/[email protected] userKeytab: /etc/security/keytabs/scdf.service.keytab authMethod: kerberos namenodePrincipal: nn/[email protected] rmManagerPrincipal: rm/[email protected] jobHistoryPrincipal: jhs/[email protected]
Note | |
---|---|
When using ambari, configuration and keytab generation are fully automated. |
Important | |
---|---|
Currently released kafka based apps doesn’t work with cluster where zookeeper and kafka itself are configured to for kerberos authentication. Workaround is to use rabbit based apps or build stream apps based on new kafka binder having support for kerberized kafka. |
After a kafka based stream app has a kerberos support, some settings
in ambari’s kafka configuration needs to be changed. Effectively
listeners
and security.inter.broker.protocol
needs to use
SASL_PLAINTEXT. Also binder needs to be able to create topics, thus
scdf
user needs to be added to a kafka’s super users.
listeners=SASL_PLAINTEXT://localhost:6667 security.inter.broker.protocol=SASL_PLAINTEXT super.users=user:kafka;user:scdf
Additional configs are needed for binder and sasl config.
spring: cloud: stream: kafka: binder: configuration: security: protocol: SASL_PLAINTEXT spring: cloud: deployer: yarn: app: streamcontainer: saslConfig: "-Djava.security.auth.login.config=/etc/scdf/conf/scdf_kafka_jaas.conf"
Where scdf_kafka_jaas.conf
looks something like shown below.
KafkaClient { com.sun.security.auth.module.Krb5LoginModule required useKeyTab=true keyTab="/etc/security/keytabs/scdf.service.keytab" storeKey=true useTicketCache=false serviceName="kafka" principal="scdf/[email protected]"; };
Important | |
---|---|
When ambari is kerberized via its wizard, everything else is
automatically configured except kafka settings for a |
Generic settings for dataflow components to work with
HA setup can be seen below where id is set to mycluster
.
spring: hadoop: fsUri: hdfs://mycluster:8020 config: dfs.ha.automatic-failover.enabled=True dfs.nameservices=mycluster dfs.client.failover.proxy.provider.mycluster=org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider dfs.ha.namenodes.mycluster=nn1,nn2 dfs.namenode.rpc-address.mycluster.nn2=ambari-3.localdomain:8020 dfs.namenode.rpc-address.mycluster.nn1=ambari-2.localdomain:8020
Note | |
---|---|
When using ambari and Hdfs HA setup, configuration is fully automated. |
On default a dataflow server will start embedded H2 database using in-memory storage and effectively using configuration.
spring: datasource: url: jdbc:h2:tcp://localhost:19092/mem:dataflow username: sa password: driverClassName: org.h2.Driver
Distribution package contains a bundled self-contained H2 executable which can be used instead. This allows to persist data throughout server restarts and is not limited to single host.
./bin/dataflow-server-yarn-h2 --dataflow.database.h2.directory=/var/run/scdf/data
spring: datasource: url: jdbc:h2:tcp://neo:19092/dataflow username: sa password: driverClassName: org.h2.Driver
Important | |
---|---|
With external H2 instance you cannot use |
Note | |
---|---|
Port can be changed using property |
This bundled H2 database is also used in ambari to have a default
out of a box functionality. Any database supported by a dataflow
itself can be used by changing datasource
settings.
YARN Deployer has to be able to talk with Application Master which then is responsible controlling containers running stream and task applications. The way this work is that Application Master tries to discover its own address which YARN Deployer is then able to use. If YARN cluster nodes have multiple NICs or for some other reason address is discovered wrongly, some settings can be changed to alter default discovery logic.
Below is a generic settings what can be changed.
spring yarn: hostdiscovery: pointToPoint: false loopback: false preferInterface: ['eth', 'en'] matchIpv4: 192.168.0.0/24 matchInterface: eth\\d*