[jira] [Comment Edited] (YARN-8193) YARN RM hangs abruptly (stops allocating resources) when running successive applications.

2020-04-30 Thread Jonathan Hung (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17096807#comment-17096807
 ] 

Jonathan Hung edited comment on YARN-8193 at 4/30/20, 5:32 PM:
---

Hit this issue on 2.10.0 cluster. Reuploading patch to trigger jenkins


was (Author: jhung):
Hit this issue on 2.10.0 cluster. Reuploading patch

> YARN RM hangs abruptly (stops allocating resources) when running successive 
> applications.
> -
>
> Key: YARN-8193
> URL: https://issues.apache.org/jira/browse/YARN-8193
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Zian Chen
>Assignee: Zian Chen
>Priority: Blocker
> Fix For: 3.2.0, 3.1.1
>
> Attachments: YARN-8193-branch-2-001.patch, 
> YARN-8193-branch-2.10-001.patch, YARN-8193-branch-2.9.0-001.patch, 
> YARN-8193.001.patch, YARN-8193.002.patch
>
>
> When running massive queries successively, at some point RM just hangs and 
> stops allocating resources. At the point RM get hangs, YARN throw 
> NullPointerException  at RegularContainerAllocator.getLocalityWaitFactor.
> There's sufficient space given to yarn.nodemanager.local-dirs (not a node 
> health issue, RM didn't report any node being unhealthy). There is no fixed 
> trigger for this (query or operation).
> This problem goes away on restarting ResourceManager. No NM restart is 
> required. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-8193) YARN RM hangs abruptly (stops allocating resources) when running successive applications.

2018-07-02 Thread Xiao Liang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16530626#comment-16530626
 ] 

Xiao Liang edited comment on YARN-8193 at 7/3/18 12:39 AM:
---

The build failed due to some reason not related to the patch:
 npm ERR! Error: CERT_UNTRUSTED
 npm ERR! at SecurePair. (tls.js:1370:32)
 npm ERR! at SecurePair.EventEmitter.emit (events.js:92:17)
 npm ERR! at SecurePair.maybeInitFinished (tls.js:982:10)
 npm ERR! at CleartextStream.read [as _read] (tls.js:469:13)
 npm ERR! at CleartextStream.Readable.read (_stream_readable.js:320:10)
 npm ERR! at EncryptedStream.write [as _write] (tls.js:366:25)
 npm ERR! at doWrite (_stream_writable.js:223:10)
 npm ERR! at writeOrBuffer (_stream_writable.js:213:5)
 npm ERR! at EncryptedStream.Writable.write (_stream_writable.js:180:11)
 npm ERR! at write (_stream_readable.js:583:24)
 npm ERR! If you need help, you may report this log at:
 npm ERR! <
 [http://github.com/isaacs/npm/issues]
 >
 npm ERR! or email it to:
 npm ERR! 

npm ERR! System Linux 3.13.0-139-generic
 npm ERR! command "/usr/bin/nodejs" "/usr/bin/npm" "install" "-g" "bower"
 npm ERR! cwd /root
 npm ERR! node -v v0.10.25
 npm ERR! npm -v 1.3.10
 npm ERR! 
 npm ERR! Additional logging details can be found in:
 npm ERR! /root/npm-debug.log
 npm ERR! not ok code 0

 

Maybe there's some quick fix for it?


was (Author: surmountian):
The build failed due to some reason not related to the patch:
npm ERR! Error: CERT_UNTRUSTED
npm ERR! at SecurePair. (tls.js:1370:32)
npm ERR! at SecurePair.EventEmitter.emit (events.js:92:17)
npm ERR! at SecurePair.maybeInitFinished (tls.js:982:10)
npm ERR! at CleartextStream.read [as _read] (tls.js:469:13)
npm ERR! at CleartextStream.Readable.read (_stream_readable.js:320:10)
npm ERR! at EncryptedStream.write [as _write] (tls.js:366:25)
npm ERR! at doWrite (_stream_writable.js:223:10)
npm ERR! at writeOrBuffer (_stream_writable.js:213:5)
npm ERR! at EncryptedStream.Writable.write (_stream_writable.js:180:11)
npm ERR! at write (_stream_readable.js:583:24)
npm ERR! If you need help, you may report this log at:
npm ERR! <
[http://github.com/isaacs/npm/issues]
>
npm ERR! or email it to:
npm ERR! 

npm ERR! System Linux 3.13.0-139-generic
npm ERR! command "/usr/bin/nodejs" "/usr/bin/npm" "install" "-g" "bower"
npm ERR! cwd /root
npm ERR! node -v v0.10.25
npm ERR! npm -v 1.3.10
npm ERR! 
npm ERR! Additional logging details can be found in:
npm ERR! /root/npm-debug.log
npm ERR! not ok code 0

> YARN RM hangs abruptly (stops allocating resources) when running successive 
> applications.
> -
>
> Key: YARN-8193
> URL: https://issues.apache.org/jira/browse/YARN-8193
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Zian Chen
>Assignee: Zian Chen
>Priority: Critical
> Fix For: 2.9.0, 3.2.0, 3.1.1
>
> Attachments: YARN-8193-branch-2.9.0-001.patch, YARN-8193.001.patch, 
> YARN-8193.002.patch
>
>
> When running massive queries successively, at some point RM just hangs and 
> stops allocating resources. At the point RM get hangs, YARN throw 
> NullPointerException  at RegularContainerAllocator.getLocalityWaitFactor.
> There's sufficient space given to yarn.nodemanager.local-dirs (not a node 
> health issue, RM didn't report any node being unhealthy). There is no fixed 
> trigger for this (query or operation).
> This problem goes away on restarting ResourceManager. No NM restart is 
> required. 
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-8193) YARN RM hangs abruptly (stops allocating resources) when running successive applications.

2018-06-29 Thread tianjuan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16528549#comment-16528549
 ] 

tianjuan edited comment on YARN-8193 at 6/30/18 5:11 AM:
-

attaching patch for 2.9.0


was (Author: jutia):
attaching patching for 2.9.0

> YARN RM hangs abruptly (stops allocating resources) when running successive 
> applications.
> -
>
> Key: YARN-8193
> URL: https://issues.apache.org/jira/browse/YARN-8193
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Zian Chen
>Assignee: Zian Chen
>Priority: Critical
> Fix For: 2.9.0, 3.2.0, 3.1.1
>
> Attachments: YARN-8193-branch-2.9.0-001.patch, YARN-8193.001.patch, 
> YARN-8193.002.patch
>
>
> When running massive queries successively, at some point RM just hangs and 
> stops allocating resources. At the point RM get hangs, YARN throw 
> NullPointerException  at RegularContainerAllocator.getLocalityWaitFactor.
> There's sufficient space given to yarn.nodemanager.local-dirs (not a node 
> health issue, RM didn't report any node being unhealthy). There is no fixed 
> trigger for this (query or operation).
> This problem goes away on restarting ResourceManager. No NM restart is 
> required. 
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-8193) YARN RM hangs abruptly (stops allocating resources) when running successive applications.

2018-04-20 Thread Zian Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16446447#comment-16446447
 ] 

Zian Chen edited comment on YARN-8193 at 4/20/18 10:27 PM:
---

Dig into the code logic, when we decide if we can assign a container to a 
requesting application in Async scheduling, we should figure out the number of 
unique locations asks in RegularContainerAllocator#canAssign before we can pass 
it into RegularContainerAllocator#getLocalityWaitFactor. We only set canAssign 
result to be true after we do the NULL check for getting current application's 
AppPlacementAllocator and the number of unique locations asks is equal to one.

Also, we need to do NULL check in 
RegularContainerAllocator#preCheckForNodeCandidateSet when getting 
AppPlacementAllocator as well since this is possible when #pending resource 
decreased by a different thread.


was (Author: zian chen):
Dig into the code logic, when we decide if we can assign a container to a 
requesting application in Async scheduling, we should figure out the number of 
unique locations asks in RegularContainerAllocator#canAssign before we can pass 
it into RegularContainerAllocator#getLocalityWaitFactor. We only set canAssign 
result to be true after we do the NULL check for getting current application's 
AppPlacementAllocator and the number of unique locations asks is equal to one.

Also, we need to do NULL check in 
RegularContainerAllocator#preCheckForNodeCandidateSet when getting 
AppPlacementAllocator as well since this is possible when #pending resource 
decreased by a different thread.

 

 

 

> YARN RM hangs abruptly (stops allocating resources) when running successive 
> applications.
> -
>
> Key: YARN-8193
> URL: https://issues.apache.org/jira/browse/YARN-8193
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Zian Chen
>Assignee: Zian Chen
>Priority: Critical
>
> When running massive queries successively, at some point RM just hangs and 
> stops allocating resources. At the point RM get hangs, YARN throw 
> NullPointerException  at RegularContainerAllocator.getLocalityWaitFactor.
> There's sufficient space given to yarn.nodemanager.local-dirs (not a node 
> health issue, RM didn't report any node being unhealthy). There is no fixed 
> trigger for this (query or operation).
> This problem goes away on restarting ResourceManager. No NM restart is 
> required. 
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org