subject:"\[jira\] \[Commented\] \(YARN\-2175\) Container localization has no timeouts and tasks can be stuck there for a long time"

[jira] [Commented] (YARN-2175) Container localization has no timeouts and tasks can be stuck there for a long time

2018-01-29 Thread Tao Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16343997#comment-16343997
 ] 

Tao Zhang commented on YARN-2175:
-

We're facing this issue too. Localization in part of containers may take a long 
time due to underlying HDFS issues, machine network conditions, etc. AM will 
request more containers when it doesn't see enough available containers (which 
finish localization). However Yarn will keep those containers being stuck at 
localization and there's not a good automatic way to kill them. A dynamically 
adjusting Timeout feature for "localizing" would help here.

Comparing to a "pre-configured" timeout value, it'd be better to have a 
"dynamically adjusting" timeout. E.g, we calculate the avg localization time 
for first 50% containers of 1 app, then set *2 * 
avg_localizing_time_of_half_containers* as the timeout threshold for rest 
containers. This requires information of all containers localization time. 
Hence *RM* would be the appropriate component to implement this feature 
(container localization timeout). AM may not be a good choice since 
"localization" is a common process of yarn, and we don't want to implement this 
feature for each type of ApplicationMasters.

 

> Container localization has no timeouts and tasks can be stuck there for a 
> long time
> ---
>
> Key: YARN-2175
> URL: https://issues.apache.org/jira/browse/YARN-2175
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.4.0
>Reporter: Anubhav Dhoot
>Priority: Major
>
> There are no timeouts that can be used to limit the time taken by various 
> container startup operations. Localization for example could take a long time 
> and there is no automated way to kill an task if its stuck in these states. 
> These may have nothing to do with the task itself and could be an issue 
> within the platform.
> Ideally there should be configurable limits for various states within the 
> NodeManager to limit various states. The RM does not care about most of these 
> and its only between AM and the NM. We can start by making these global 
> configurable defaults and in future we can make it fancier by letting AM 
> override them in the start container request. 
> This jira will be used to limit localization time and we can open others if 
> we feel we need to limit other operations.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-2175) Container localization has no timeouts and tasks can be stuck there for a long time

2015-03-31 Thread Karthik Kambatla (JIRA)

[
https://issues.apache.org/jira/browse/YARN-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14389173#comment-14389173
]

Karthik Kambatla commented on YARN-2175:

In our testing, we see this issue with the AM container itself potentially
taking longer than 10 mins to localize and the RM kills this attempt since it
hasn't heard from it.

Container localization has no timeouts and tasks can be stuck there for a
long time
---

Key: YARN-2175
URL: https://issues.apache.org/jira/browse/YARN-2175
Project: Hadoop YARN
Issue Type: Bug
Components: nodemanager
Affects Versions: 2.4.0
Reporter: Anubhav Dhoot
Assignee: Anubhav Dhoot

There are no timeouts that can be used to limit the time taken by various
container startup operations. Localization for example could take a long time
and there is no automated way to kill an task if its stuck in these states.
These may have nothing to do with the task itself and could be an issue
within the platform.
Ideally there should be configurable limits for various states within the
NodeManager to limit various states. The RM does not care about most of these
and its only between AM and the NM. We can start by making these global
configurable defaults and in future we can make it fancier by letting AM
override them in the start container request.
This jira will be used to limit localization time and we can open others if
we feel we need to limit other operations.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2175) Container localization has no timeouts and tasks can be stuck there for a long time

2014-07-02 Thread Anubhav Dhoot (JIRA)

[
https://issues.apache.org/jira/browse/YARN-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14050533#comment-14050533
]

Anubhav Dhoot commented on YARN-2175:
-

We have seen it happen when the source file system had issues. Some jobs would
intermittently take a long time to fail and would succeed in rerun because the
jars were put in a new distributed cache location when rerun. Without this
timeout we have no lever to mitigate underlying HDFS/Hardware issues out in
production until the root cause is identified and fixed.
Also in comparison with the mapreduce.task.timeout this seems very focussed on
a specific operation - localization. I would expect this timeout would be
defaulted to a large value in production (say 30 min) and used only to mitigate
when a issue occurs in production.

Container localization has no timeouts and tasks can be stuck there for a
long time
---

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2175) Container localization has no timeouts and tasks can be stuck there for a long time

2014-07-02 Thread Karthik Kambatla (JIRA)

[
https://issues.apache.org/jira/browse/YARN-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14050539#comment-14050539
]

Karthik Kambatla commented on YARN-2175:

I ll let Anubhav provide details on the particular instance we ran into this.

bq. We should try to address the right individual problem with its solution
before we put a band-aid that may still be useful for issues that we cannot
just address directly if any.
Having worked on several MR1 production issues, I see your point. I agree we
should look into and address individual problems. That said, I also believe in
failsafes to avoid bringing down a production cluster or failing a critical job
altogether in the face of hardware issues. That gives us time to fix the
individual issues correctly when we encounter them, instead of hurrying for a
hot fix.

Container localization has no timeouts and tasks can be stuck there for a
long time
---

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2175) Container localization has no timeouts and tasks can be stuck there for a long time

2014-07-02 Thread Karthik Kambatla (JIRA)

[
https://issues.apache.org/jira/browse/YARN-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14050543#comment-14050543
]

Karthik Kambatla commented on YARN-2175:

In MR1, mapred.task.timeout handles localization as well and that has worked
very well for our customers. Should we do the same for MR2 as well?

Container localization has no timeouts and tasks can be stuck there for a
long time
---

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2175) Container localization has no timeouts and tasks can be stuck there for a long time

2014-07-01 Thread Anubhav Dhoot (JIRA)

[
https://issues.apache.org/jira/browse/YARN-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14049223#comment-14049223
]

Anubhav Dhoot commented on YARN-2175:
-

I should clarify the AM can kill this container manually but each AM will have
to implement this logic to detect when localization takes longer and kill when
its taking too long. Updating description.
We can make it much simpler for administrators and AM writers by having an
automatic way to mitigate this. The NodeManager knows each state of the
container. Instead of having a back and forth between AM and NM, it will be
easier if we just let this be done by NM. We can start with a configurable
timeout with a reasonable default. In future we can add ability in the AM to
override this during the container request.
Lemme know what you guys think.

Container localization has no timeouts and tasks can be stuck there for a
long time
---

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2175) Container localization has no timeouts and tasks can be stuck there for a long time

2014-07-01 Thread Vinod Kumar Vavilapalli (JIRA)

[
https://issues.apache.org/jira/browse/YARN-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14049586#comment-14049586
]

Vinod Kumar Vavilapalli commented on YARN-2175:
---

That is a reasonable proposal, but I'd like to see if there are any other bugs
that are causing this to happen. Have we seen this in practice? If so, what is
the underlying reason? Too big a resource? The source file-system is down? Or
NM has a bug? We should try to address the right individual problem with its
solution before we put a band-aid that may still be useful for issues that we
cannot just address directly if any.

Contrast this with mapreduce.task.timeout. Arguably the config helped users
timeout their jobs, but from my experience it prevented us from focusing on
fixing point bugs that were hidden in the framework for a long time - it kind
of hides the issues. It still is useful, for those unmanageable and unsolvable
bugs, but I'd rather first fix the point problems and then put the band-aid.
Thoughts?

Container localization has no timeouts and tasks can be stuck there for a
long time
---

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2175) Container localization has no timeouts and tasks can be stuck there for a long time

2014-06-19 Thread Vinod Kumar Vavilapalli (JIRA)

[
https://issues.apache.org/jira/browse/YARN-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14037055#comment-14037055
]

Vinod Kumar Vavilapalli commented on YARN-2175:
---

bq. there is no way to kill an task if its stuck in these states.
YARN-1619/YARN-445 should let you do this manually if not automatically.

Container localization has no timeouts and tasks can be stuck there for a
long time
---

There are no timeouts that can be used to limit the time taken by various
container startup operations. Localization for example could take a long time
and there is no way to kill an task if its stuck in these states. These may
have nothing to do with the task itself and could be an issue within the
platform.
Ideally there should be configurable limits for various states within the
NodeManager to limit various states. The RM does not care about most of these
and its only between AM and the NM. We can start by making these global
configurable defaults and in future we can make it fancier by letting AM
override them in the start container request.
This jira will be used to limit localization time and we open others if we
feel we need to limit other operations.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2175) Container localization has no timeouts and tasks can be stuck there for a long time

2014-06-19 Thread Jason Lowe (JIRA)

[
https://issues.apache.org/jira/browse/YARN-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14037337#comment-14037337
]

Jason Lowe commented on YARN-2175:
--

I also wonder if there's been a regression, since at least in 0.23 containers
that are localizing can be killed by the ApplicationMaster. The MR AM does
this when mapreduce.task.timeout triggers a kill of a task due to lack of
progress. The MR AM kills the container and that, in turn, causes the
localizer to die because the NM tells the localizer to DIE during its next
heartbeat.

Although if the localizer gets stuck and stops heartbeating and the NM lost
track of it due to the container kill then it seems like we could leak a hung
localizer process.

Container localization has no timeouts and tasks can be stuck there for a
long time
---

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2175) Container localization has no timeouts and tasks can be stuck there for a long time

[jira] [Commented] (YARN-2175) Container localization has no timeouts and tasks can be stuck there for a long time

[jira] [Commented] (YARN-2175) Container localization has no timeouts and tasks can be stuck there for a long time

[jira] [Commented] (YARN-2175) Container localization has no timeouts and tasks can be stuck there for a long time

[jira] [Commented] (YARN-2175) Container localization has no timeouts and tasks can be stuck there for a long time

[jira] [Commented] (YARN-2175) Container localization has no timeouts and tasks can be stuck there for a long time

[jira] [Commented] (YARN-2175) Container localization has no timeouts and tasks can be stuck there for a long time

[jira] [Commented] (YARN-2175) Container localization has no timeouts and tasks can be stuck there for a long time

[jira] [Commented] (YARN-2175) Container localization has no timeouts and tasks can be stuck there for a long time

9 matches

Site Navigation

Mail list logo

Footer information