Re: Regarding a scenario where applications are being hung

Steve Loughran Tue, 02 Sep 2014 04:18:28 -0700

All AMs are already free to implement their own policy of tracking
outstanding requests and reacting to it —today. I don't know any that do.


https://issues.apache.org/jira/browse/YARN-624 covers "gang scheduling", in
which an AM can say "don't assign any until the requirements are met". This
addresses the more complex dining philosophers problem in which

AM1 requests 4 2 GB containers, gets back 3 and hangs on to them. not
starting work until it has all
AM2 requests 2x 4GB containers, gets one and hangs around waiting for the
other.

A timeout will tell them both they've failed and get them to react, which,
unless they are clever, will probably just be to fail themselves. However,
there are enough resources to satisfy both AMs, sequentially.


There's another possibility in cloud environments: something gets told that
more compute capacity is required, possibly triggering requests for more
VMs.


On 2 September 2014 02:36, Wangda Tan <[email protected]> wrote:

> Hi Naga,
> AFAIK, there's no such timeout,
> Since it's a new feature request, I'd suggest to create a JIRA and let's
> move discussions on it.
>
> Thanks,
> Wangda
>
>
> On Tue, Sep 2, 2014 at 8:52 AM, Naganarasimha G R (Naga) <
> [email protected]> wrote:
>
> > Hi,
> >
> >     "AM can take action if it doesn't receive any container for some
> time."
> >
> > Can we have a generic timeout feature for all AM's @ the yarn side such
> > that if no containers are assigned for an application for a defined
> period
> > than yarn can timeout the application attempt.
> >
> > Default can be set to 0 where in RM will not timeout the app attempt and
> > user can set his own timeout when he submits the application?
> >
> >
> >
> > Basically we faced this issues in MR2 itself and i was not able to find
> > any such timeouts in map reduce config
> >
> > Regards,
> > Naga
> > ________________________________________
> > From: Wangda Tan [[email protected]]
> > Sent: Tuesday, September 02, 2014 07:59
> > To: [email protected]
> > Subject: Re: Regarding a scenario where applications are being hung
> >
> > Hi Naga,
> > When trying to allocate a container, the behavior is,
> > First it will check capacity of queue, in your case, capacity of queue is
> > 8G * 2 = 16G. So a 7G container will be pass the check
> > The it will check if there's enough space on a node, in your case, it
> > cannot pass the check, so the ResourceRequest will be skipped.
> >
> > There's no "timeout" for a ResourceRequest now, AM can take action if it
> > doesn't receive any container for some time.
> > Preemption is another story, preemption is used to reclaim resource for a
> > queue under-satisfied from a queue over-satisfied. It's not your case
> too.
> >
> > Hope this helps,
> > Wangda
> >
> >
> >
> > On Mon, Sep 1, 2014 at 11:11 PM, Naganarasimha G R (Naga) <
> > [email protected]> wrote:
> >
> > > Hi Wangda,
> > >
> > >               Yes its the case where in its requesting 7GB per
> container.
> > > But can you decribe why its expected behavior ?
> > >
> > > From user perspective either it should not have taken in the
> application
> > > request or after some time may be both apps should have  been killed
> with
> > > proper exception or log information, or pre empt any one AM container
> etc
> > > ...
> > >
> > > Here none are happening!
> > >
> > >
> > > Regards,
> > > Naga
> > >
> > >
> > >
> > > ________________________________________
> > > From: Wangda Tan [[email protected]]
> > > Sent: Monday, September 01, 2014 22:51
> > > To: [email protected]
> > > Subject: Re: Regarding a scenario where applications are being hung
> > >
> > > Hi Naga,
> > > According to scenario you described, if "Now each AM is requesting for
> > > container of 7Gb  mem resource ." is a request for a single container
> > (not
> > > request 7 * containers, 1G for each one), it should be expected
> behavior.
> > > Please let me know if you have more questions.
> > >
> > > Thanks,
> > > Wangda
> > >
> > >
> > > On Mon, Sep 1, 2014 at 10:46 PM, Naganarasimha G R (Naga) <
> > > [email protected]> wrote:
> > >
> > > > Hi ,
> > > >
> > > >     I have one sceanrio which makes the applications to get hung , so
> > > > wanted to validate whether its a bug (if so will raise a jira).
> > > >
> > > >
> > > >
> > > > Consider a cluster setup which has 2 NMS of each 8GB resource,
> > > >
> > > > And 2 applications are launched in the default queue where in each AM
> > is
> > > > taking 2 GB each.
> > > >
> > > > Each AM is placed in each of the NM. Now each AM is requesting for
> > > > container of 7Gb  mem resource .
> > > >
> > > > As in each NM only 6GB resource is available both the applications
> are
> > > > hung forever.
> > > >
> > > >
> > > >
> > > > Is this a bug ?
> > > >
> > > >
> > > >
> > > > Regards,
> > > >
> > > > Naga
> > > >
> > > >
> > > >
> > > > Huawei Technologies Co., Ltd.
> > > > Phone:
> > > > Fax:
> > > > Mobile:  +91 9980040283
> > > > Email: [email protected]<mailto:[email protected]>
> > > > Huawei Technologies Co., Ltd.
> > > > Bantian, Longgang District,Shenzhen 518129, P.R.China
> > > > http://www.huawei.com
> > > >
> > > > ¡This e-mail and its attachments contain confidential information
> from
> > > > HUAWEI, which is intended only for the person or entity whose address
> > is
> > > > listed above. Any use of the information contained herein in any way
> > > > (including, but not limited to, total or partial disclosure,
> > > reproduction,
> > > > or dissemination) by persons other than the intended recipient(s) is
> > > > prohibited. If you receive this e-mail in error, please notify the
> > sender
> > > > by phone or email immediately and delete it!
> > > >
> > >
> >
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Regarding a scenario where applications are being hung

Reply via email to