[
https://issues.apache.org/jira/browse/YARN-2140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335263#comment-14335263
]
Robert Joseph Evans commented on YARN-2140:
-------------------------------------------
Network traffic for batch jobs is inherently bursty, but also typically quite
robust in the face of slow networking and network hiccups. The SLAs are not
tight enough and the retries cover up enough problems that we typically don't
need to worry about it. However not all applications have the same pattern nor
the same robustness to network issues, and we want to be able to cleanly
support as many of these patterns as possible.
I see that there are several areas to think about and some of them counter each
other and need to be balanced.
Some applications are very sensitive to network latency where as others are not.
Some applications are very sensitive to having a consistent or guaranteed
network throughput where as others are not.
We want to avoid saturation of the network (primarily to be able to support the
above two) but we also want to have high resource utilization (and networking
typically degrades gracefully when over allocated).
We want to have data spread across multiple failure domains but also want to
colocate containers near each other to reduce network traffic.
There are also two different forms of traffic that we need to consider,
external traffic and internal traffic. External traffic cannot be reduced by
colocating containers, internal traffic typically can.
Lets take a static web server as one extreme. All of the traffic is external
traffic, the servers don't talk to each other. For robustness I would want to
have them spread out across multiple racks so if one rack goes down for some
reason I have a backup and my web sight stays up, although possibly degraded
until I can launch more containers. The web servers are sensitive to network
latency, and typically want a minimum amount of available bandwidth so they can
achieve the desired network latency in a cost effective manner.
If the scheduler does what you initially described, and tries to always place
containers on the same rack, we will be at risk of having a single rack bring
down my entire web site.
For most applications it is a balancing act between spreading data across
failure domains and increasing the locality to reduce internal communication
overhead. Storm tends to be the other extreme because it stores no data
internally, so it can ignore failure domains, but in most cases has a
significant amount of internal network traffic. I would much rather see
scheduler APIs added that allow the App Master to help give hints on how it
wants particular containers scheduled.
Then there are a number of general scheduler questions around locality that
this brings up, because we are really just extending the definition of locality
from close to this hdfs block/machine, to be close to these other containers,
preferably not all other containers in the application, because almost all
applications have different types of containers.
To be able to have that type of an API it feels like we would want to support
gang scheduling, but perhaps not.
Then there is the question of how long should the scheduler try to get the best
container. Right now it is cluster wide, but for storm I would really like to
be able to say it is OK to spend 10 seconds to get these containers, but for
this high priority one, because something just crashed, I want it ASAP.
The current design seems to only handle avoiding saturating the local nodes'
outbound network connection and attempts to reduce internal traffic by locating
containers for the same application close to one another. It seemed to imply
that it would do something with the top of rack switch to avoid saturating
that, but was not explicitly called out how that would work. And if we cannot
separate out internal vs external traffic I fear the top of rack switch will
either become saturated in some cases or cause other resources in the rack to
not be utilized because we think the top of rack switch is saturated. To me
the outbound traffic shaping seems like a step in the right direction, but I
honestly don't know if it is going to be enough for me to feel safe running
production storm topologies on a YARN cluster. Think of a storm topology
reading a real firehose of data from kafka 2 Gigbit/sec something that would
saturate the inbound link of a typical node two times over. This is completely
ignored by the current design simply because there is no easy way to do traffic
shaping on it. If I look at how cloud providers like Amazon and Google handle
networking they are doing things very similar to OpenFlow. I am not saying
that we need to recreate this, but any design that we come up with should
really think about supporting something similar long term, and doing something
good enough short term, so that the user facing APIs don't need to change, even
if the enforcement of it, or our ability to get better utilization from it does
change over time with improved hardware support.
> Add support for network IO isolation/scheduling for containers
> --------------------------------------------------------------
>
> Key: YARN-2140
> URL: https://issues.apache.org/jira/browse/YARN-2140
> Project: Hadoop YARN
> Issue Type: New Feature
> Reporter: Wei Yan
> Assignee: Wei Yan
> Attachments: NetworkAsAResourceDesign.pdf
>
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)