[
https://issues.apache.org/jira/browse/YARN-4692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15151477#comment-15151477
]
Wangda Tan commented on YARN-4692:
----------------------------------
Thanks [~vinodkv] and other folks working on this, this documentation is pretty
comprehensive already, some thoughts/suggestions:
1) For running containers, instead of classifying them into service/batch, I
would prefer to tag them by application priority. For example, 0 is production
service tasks, 5 is batch job, etc. The reason is
- Service container is not always important than other containers
- One important service can preempt containers from less important services.
2) A container is service or batch depends on duration of the task, we had lots
of discussions on YARN-1039 already.
3) For 3.2.2 container auto restart, beyond restart container when it dies, we
could let framework check health of running tasks. For example, support embeded
REST API to get healthy status of containers. With this, framework can restart
malfunctioning containers.
4) For 3.2.7 Scheduling / Queue model
Beyond queue model, we should consider long running containers when reserving
large container on node.
5) Debuggability for service container is also very important,
- Tools similar to [cAdvisor|https://github.com/google/cadvisor] could be very
helpful to figure out issues of service tasks
- We also need tool to show aggregated scheduling-related information of
apps/queues/cluster.
*For comments from [~asuresh]:*
bq. we can give applications the ability to specify Preemptability of
containers in a particular role...
Instead of adding a new field, I think we can reuse container priority and
application priority to describe preemptability.
bq. Allow LR Applications to specify peak, min and variance/mean (also many
transient and steady-state) of a Resource request to allow schedulers to make
better allocation decisions.
I think this is hard for end user to know. Our framework should be able to
figure out such metrics for running containers. For requested new containers,
we'd better assume they will use 100% of requested resources.
bq. In YARN-4597 Chris Douglas proposed ...
In my mind, YARN-4597 is targeted to solve low latency batch tasks, if service
tasks running for one hour or more, it's not a big deal to take several minutes
to setup it.
And agree that reservation system (YARN-1051) is the utimate solution of queue
model and container allocation for services
> [Umbrella] Simplified and first-class support for services in YARN
> ------------------------------------------------------------------
>
> Key: YARN-4692
> URL: https://issues.apache.org/jira/browse/YARN-4692
> Project: Hadoop YARN
> Issue Type: New Feature
> Reporter: Vinod Kumar Vavilapalli
> Assignee: Vinod Kumar Vavilapalli
> Attachments:
> YARN-First-Class-And-Simplified-Support-For-Services-v0.pdf
>
>
> YARN-896 focused on getting the ball rolling on the support for services
> (long running applications) on YARN.
> I’d like propose the next stage of this effort: _Simplified and first-class
> support for services in YARN_.
> The chief rationale for filing a separate new JIRA is threefold:
> - Do a fresh survey of all the things that are already implemented in the
> project
> - Weave a comprehensive story around what we further need and attempt to
> rally the community around a concrete end-goal, and
> - Additionally focus on functionality that YARN-896 and friends left for
> higher layers to take care of and see how much of that is better integrated
> into the YARN platform itself.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)