Vinod Kumar Vavilapalli commented on YARN-2884:
[~jianhe] mentioned this offline and the configuration approach concerns me too.
Stepping back, I think the current discovery of Scheduler by the apps is
completely broken. Distributed Shell for e.g. works only because it is a java
application and NM happens to put HADOOP_CONF_DIR in the classpath.
Irrespective of this JIRA, we need to fix the scheduler discovery for the apps.
The current way of depending on server configuration is unreliable in the face
The specific solution in this JIRA further breaks rolling-upgrades and
configuration updates. If and when, an admin forces client configuration
changes, the config written by the Node will go out of sync. This overall makes
the situation worse.
I'd suggest that we start moving towards a better scheduler-discovery model. We
have already done similar work with Timeline service (YARN-3039). We can
implement part of that here - an environment based discovery - we can simply
have an environment say YARN_SCHEDULER_ADDRESS for now set by the NodeManager
into the AM-env, that is respected as the first level discovery mechanism. As
we add more first-class discovery mechanisms, this env can take lesser
precedence. This approach isn't too far from your current solution too, instead
of pointing to a conf-dir env, you are pointing to a scheduler-address env
> Proxying all AM-RM communications
> Key: YARN-2884
> URL: https://issues.apache.org/jira/browse/YARN-2884
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: nodemanager, resourcemanager
> Reporter: Carlo Curino
> Assignee: Kishore Chaliparambil
> Attachments: YARN-2884-V1.patch, YARN-2884-V2.patch,
> YARN-2884-V3.patch, YARN-2884-V4.patch, YARN-2884-V5.patch,
> YARN-2884-V6.patch, YARN-2884-V7.patch, YARN-2884-V8.patch, YARN-2884-V9.patch
> We introduce the notion of an RMProxy, running on each node (or once per
> rack). Upon start the AM is forced (via tokens and configuration) to direct
> all its requests to a new services running on the NM that provide a proxy to
> the central RM.
> This give us a place to:
> 1) perform distributed scheduling decisions
> 2) throttling mis-behaving AMs
> 3) mask the access to a federation of RMs
This message was sent by Atlassian JIRA