[jira] [Commented] (YARN-2175) Container localization has no timeouts and tasks can be stuck there for a long time

2018-01-29 Thread Tao Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16343997#comment-16343997
 ] 

Tao Zhang commented on YARN-2175:
-

We're facing this issue too. Localization in part of containers may take a long 
time due to underlying HDFS issues, machine network conditions, etc. AM will 
request more containers when it doesn't see enough available containers (which 
finish localization). However Yarn will keep those containers being stuck at 
localization and there's not a good automatic way to kill them. A dynamically 
adjusting Timeout feature for "localizing" would help here.

Comparing to a "pre-configured" timeout value, it'd be better to have a 
"dynamically adjusting" timeout. E.g, we calculate the avg localization time 
for first 50% containers of 1 app, then set *2 * 
avg_localizing_time_of_half_containers* as the timeout threshold for rest 
containers. This requires information of all containers localization time. 
Hence *RM* would be the appropriate component to implement this feature 
(container localization timeout). AM may not be a good choice since 
"localization" is a common process of yarn, and we don't want to implement this 
feature for each type of ApplicationMasters.

 

> Container localization has no timeouts and tasks can be stuck there for a 
> long time
> ---
>
> Key: YARN-2175
> URL: https://issues.apache.org/jira/browse/YARN-2175
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.4.0
>Reporter: Anubhav Dhoot
>Priority: Major
>
> There are no timeouts that can be used to limit the time taken by various 
> container startup operations. Localization for example could take a long time 
> and there is no automated way to kill an task if its stuck in these states. 
> These may have nothing to do with the task itself and could be an issue 
> within the platform.
> Ideally there should be configurable limits for various states within the 
> NodeManager to limit various states. The RM does not care about most of these 
> and its only between AM and the NM. We can start by making these global 
> configurable defaults and in future we can make it fancier by letting AM 
> override them in the start container request. 
> This jira will be used to limit localization time and we can open others if 
> we feel we need to limit other operations.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-2175) Container localization has no timeouts and tasks can be stuck there for a long time

2015-03-31 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14389173#comment-14389173
 ] 

Karthik Kambatla commented on YARN-2175:


In our testing, we see this issue with the AM container itself potentially 
taking longer than 10 mins to localize and the RM kills this attempt since it 
hasn't heard from it.  

 Container localization has no timeouts and tasks can be stuck there for a 
 long time
 ---

 Key: YARN-2175
 URL: https://issues.apache.org/jira/browse/YARN-2175
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.4.0
Reporter: Anubhav Dhoot
Assignee: Anubhav Dhoot

 There are no timeouts that can be used to limit the time taken by various 
 container startup operations. Localization for example could take a long time 
 and there is no automated way to kill an task if its stuck in these states. 
 These may have nothing to do with the task itself and could be an issue 
 within the platform.
 Ideally there should be configurable limits for various states within the 
 NodeManager to limit various states. The RM does not care about most of these 
 and its only between AM and the NM. We can start by making these global 
 configurable defaults and in future we can make it fancier by letting AM 
 override them in the start container request. 
 This jira will be used to limit localization time and we can open others if 
 we feel we need to limit other operations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2175) Container localization has no timeouts and tasks can be stuck there for a long time

2014-07-02 Thread Anubhav Dhoot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14050533#comment-14050533
 ] 

Anubhav Dhoot commented on YARN-2175:
-

We have seen it happen when the source file system had issues. Some jobs would 
intermittently take a long time to fail and would succeed in rerun because the 
jars were put in a new distributed cache location when rerun. Without this 
timeout we have no lever to mitigate underlying HDFS/Hardware issues out in 
production until the root cause is identified and fixed. 
Also in comparison with the mapreduce.task.timeout this seems very focussed on 
a specific operation - localization. I would expect this timeout would be 
defaulted to a large value in production (say 30 min) and used only to mitigate 
when a issue occurs in production.

 Container localization has no timeouts and tasks can be stuck there for a 
 long time
 ---

 Key: YARN-2175
 URL: https://issues.apache.org/jira/browse/YARN-2175
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.4.0
Reporter: Anubhav Dhoot
Assignee: Anubhav Dhoot

 There are no timeouts that can be used to limit the time taken by various 
 container startup operations. Localization for example could take a long time 
 and there is no automated way to kill an task if its stuck in these states. 
 These may have nothing to do with the task itself and could be an issue 
 within the platform.
 Ideally there should be configurable limits for various states within the 
 NodeManager to limit various states. The RM does not care about most of these 
 and its only between AM and the NM. We can start by making these global 
 configurable defaults and in future we can make it fancier by letting AM 
 override them in the start container request. 
 This jira will be used to limit localization time and we can open others if 
 we feel we need to limit other operations.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2175) Container localization has no timeouts and tasks can be stuck there for a long time

2014-07-02 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14050539#comment-14050539
 ] 

Karthik Kambatla commented on YARN-2175:


I ll let Anubhav provide details on the particular instance we ran into this. 

bq. We should try to address the right individual problem with its solution 
before we put a band-aid that may still be useful for issues that we cannot 
just address directly if any.
Having worked on several MR1 production issues, I see your point. I agree we 
should look into and address individual problems. That said, I also believe in 
failsafes to avoid bringing down a production cluster or failing a critical job 
altogether in the face of hardware issues. That gives us time to fix the 
individual issues correctly when we encounter them, instead of hurrying for a 
hot fix. 

 Container localization has no timeouts and tasks can be stuck there for a 
 long time
 ---

 Key: YARN-2175
 URL: https://issues.apache.org/jira/browse/YARN-2175
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.4.0
Reporter: Anubhav Dhoot
Assignee: Anubhav Dhoot

 There are no timeouts that can be used to limit the time taken by various 
 container startup operations. Localization for example could take a long time 
 and there is no automated way to kill an task if its stuck in these states. 
 These may have nothing to do with the task itself and could be an issue 
 within the platform.
 Ideally there should be configurable limits for various states within the 
 NodeManager to limit various states. The RM does not care about most of these 
 and its only between AM and the NM. We can start by making these global 
 configurable defaults and in future we can make it fancier by letting AM 
 override them in the start container request. 
 This jira will be used to limit localization time and we can open others if 
 we feel we need to limit other operations.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2175) Container localization has no timeouts and tasks can be stuck there for a long time

2014-07-02 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14050543#comment-14050543
 ] 

Karthik Kambatla commented on YARN-2175:


In MR1, mapred.task.timeout handles localization as well and that has worked 
very well for our customers. Should we do the same for MR2 as well? 

 Container localization has no timeouts and tasks can be stuck there for a 
 long time
 ---

 Key: YARN-2175
 URL: https://issues.apache.org/jira/browse/YARN-2175
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.4.0
Reporter: Anubhav Dhoot
Assignee: Anubhav Dhoot

 There are no timeouts that can be used to limit the time taken by various 
 container startup operations. Localization for example could take a long time 
 and there is no automated way to kill an task if its stuck in these states. 
 These may have nothing to do with the task itself and could be an issue 
 within the platform.
 Ideally there should be configurable limits for various states within the 
 NodeManager to limit various states. The RM does not care about most of these 
 and its only between AM and the NM. We can start by making these global 
 configurable defaults and in future we can make it fancier by letting AM 
 override them in the start container request. 
 This jira will be used to limit localization time and we can open others if 
 we feel we need to limit other operations.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2175) Container localization has no timeouts and tasks can be stuck there for a long time

2014-07-01 Thread Anubhav Dhoot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14049223#comment-14049223
 ] 

Anubhav Dhoot commented on YARN-2175:
-

I should clarify the AM can kill this container manually but each AM will have 
to implement this logic to detect when localization takes longer and kill when 
its taking too long. Updating description.
We can make it much simpler for administrators and AM writers by having an 
automatic way to mitigate this. The NodeManager knows each state of the 
container. Instead of having a back and forth between AM and NM, it will be 
easier if we just let this be done by NM. We can start with a configurable 
timeout with a reasonable default. In future we can add ability in the AM to 
override this during the container request.
Lemme know what you guys think.

 Container localization has no timeouts and tasks can be stuck there for a 
 long time
 ---

 Key: YARN-2175
 URL: https://issues.apache.org/jira/browse/YARN-2175
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.4.0
Reporter: Anubhav Dhoot
Assignee: Anubhav Dhoot

 There are no timeouts that can be used to limit the time taken by various 
 container startup operations. Localization for example could take a long time 
 and there is no automated way to kill an task if its stuck in these states. 
 These may have nothing to do with the task itself and could be an issue 
 within the platform.
 Ideally there should be configurable limits for various states within the 
 NodeManager to limit various states. The RM does not care about most of these 
 and its only between AM and the NM. We can start by making these global 
 configurable defaults and in future we can make it fancier by letting AM 
 override them in the start container request. 
 This jira will be used to limit localization time and we can open others if 
 we feel we need to limit other operations.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2175) Container localization has no timeouts and tasks can be stuck there for a long time

2014-07-01 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14049586#comment-14049586
 ] 

Vinod Kumar Vavilapalli commented on YARN-2175:
---

That is a reasonable proposal, but I'd like to see if there are any other bugs 
that are causing this to happen. Have we seen this in practice? If so, what is 
the underlying reason? Too big a resource? The source file-system is down? Or 
NM has a bug? We should try to address the right individual problem with its 
solution before we put a band-aid that may still be useful for issues that we 
cannot just address directly if any.

Contrast this with mapreduce.task.timeout. Arguably the config helped users 
timeout their jobs, but from my experience it prevented us from focusing on 
fixing point bugs that were hidden in the framework for a long time - it kind 
of hides the issues. It still is useful, for those unmanageable and unsolvable 
bugs, but I'd rather first fix the point problems and then put the band-aid. 
Thoughts?

 Container localization has no timeouts and tasks can be stuck there for a 
 long time
 ---

 Key: YARN-2175
 URL: https://issues.apache.org/jira/browse/YARN-2175
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.4.0
Reporter: Anubhav Dhoot
Assignee: Anubhav Dhoot

 There are no timeouts that can be used to limit the time taken by various 
 container startup operations. Localization for example could take a long time 
 and there is no automated way to kill an task if its stuck in these states. 
 These may have nothing to do with the task itself and could be an issue 
 within the platform.
 Ideally there should be configurable limits for various states within the 
 NodeManager to limit various states. The RM does not care about most of these 
 and its only between AM and the NM. We can start by making these global 
 configurable defaults and in future we can make it fancier by letting AM 
 override them in the start container request. 
 This jira will be used to limit localization time and we can open others if 
 we feel we need to limit other operations.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2175) Container localization has no timeouts and tasks can be stuck there for a long time

2014-06-19 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14037055#comment-14037055
 ] 

Vinod Kumar Vavilapalli commented on YARN-2175:
---

bq. there is no way to kill an task if its stuck in these states.
YARN-1619/YARN-445 should let you do this manually if not automatically.

 Container localization has no timeouts and tasks can be stuck there for a 
 long time
 ---

 Key: YARN-2175
 URL: https://issues.apache.org/jira/browse/YARN-2175
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.4.0
Reporter: Anubhav Dhoot
Assignee: Anubhav Dhoot

 There are no timeouts that can be used to limit the time taken by various 
 container startup operations. Localization for example could take a long time 
 and there is no way to kill an task if its stuck in these states. These may 
 have nothing to do with the task itself and could be an issue within the 
 platform. 
 Ideally there should be configurable limits for various states within the 
 NodeManager to limit various states. The RM does not care about most of these 
 and its only between AM and the NM. We can start by making these global 
 configurable defaults and in future we can make it fancier by letting AM 
 override them in the start container request.
 This jira will be used to limit localization time and we open others if we 
 feel we need to limit other operations.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2175) Container localization has no timeouts and tasks can be stuck there for a long time

2014-06-19 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14037337#comment-14037337
 ] 

Jason Lowe commented on YARN-2175:
--

I also wonder if there's been a regression, since at least in 0.23 containers 
that are localizing can be killed by the ApplicationMaster.  The MR AM does 
this when mapreduce.task.timeout triggers a kill of a task due to lack of 
progress.  The MR AM kills the container and that, in turn, causes the 
localizer to die because the NM tells the localizer to DIE during its next 
heartbeat.

Although if the localizer gets stuck and stops heartbeating and the NM lost 
track of it due to the container kill then it seems like we could leak a hung 
localizer process.

 Container localization has no timeouts and tasks can be stuck there for a 
 long time
 ---

 Key: YARN-2175
 URL: https://issues.apache.org/jira/browse/YARN-2175
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.4.0
Reporter: Anubhav Dhoot
Assignee: Anubhav Dhoot

 There are no timeouts that can be used to limit the time taken by various 
 container startup operations. Localization for example could take a long time 
 and there is no way to kill an task if its stuck in these states. These may 
 have nothing to do with the task itself and could be an issue within the 
 platform. 
 Ideally there should be configurable limits for various states within the 
 NodeManager to limit various states. The RM does not care about most of these 
 and its only between AM and the NM. We can start by making these global 
 configurable defaults and in future we can make it fancier by letting AM 
 override them in the start container request.
 This jira will be used to limit localization time and we open others if we 
 feel we need to limit other operations.



--
This message was sent by Atlassian JIRA
(v6.2#6252)