[jira] [Updated] (YARN-5403) yarn top command does not execute correct

2016-07-19 Thread gu-chi (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gu-chi updated YARN-5403:
-
Attachment: YARN-5403.patch

> yarn top command does not execute correct
> -
>
> Key: YARN-5403
> URL: https://issues.apache.org/jira/browse/YARN-5403
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 2.7.2
>Reporter: gu-chi
> Attachments: YARN-5403.patch
>
>
> when execute {{yarn top}}, I always get exception as below:
> {quote}
> 16/07/19 19:55:12 ERROR cli.TopCLI: Could not fetch RM start time
> java.net.ConnectException: Connection refused
>   at java.net.PlainSocketImpl.socketConnect(Native Method)
>   at 
> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
>   at 
> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:204)
>   at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
>   at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
>   at java.net.Socket.connect(Socket.java:589)
>   at java.net.Socket.connect(Socket.java:538)
>   at sun.net.NetworkClient.doConnect(NetworkClient.java:180)
>   at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
>   at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
>   at sun.net.www.http.HttpClient.(HttpClient.java:211)
>   at sun.net.www.http.HttpClient.New(HttpClient.java:308)
>   at sun.net.www.http.HttpClient.New(HttpClient.java:326)
>   at 
> sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1169)
>   at 
> sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1105)
>   at 
> sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:999)
>   at 
> sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:933)
>   at 
> org.apache.hadoop.yarn.client.cli.TopCLI.getRMStartTime(TopCLI.java:747)
>   at org.apache.hadoop.yarn.client.cli.TopCLI.run(TopCLI.java:443)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
>   at org.apache.hadoop.yarn.client.cli.TopCLI.main(TopCLI.java:421)
> YARN top - 19:55:13, up 17001d, 11:55, 0 active users, queue(s): root
> {quote}
> As I looked into it, the function {{getRMStartTime}} use HTTP as hardcoding 
> no matter what is the {{yarn.http.policy}} setting, should consider if use 
> HTTPS 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5403) yarn top command does not execute correct

2016-07-19 Thread gu-chi (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gu-chi updated YARN-5403:
-
Attachment: (was: YARN-5403.patch)

> yarn top command does not execute correct
> -
>
> Key: YARN-5403
> URL: https://issues.apache.org/jira/browse/YARN-5403
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 2.7.2
>Reporter: gu-chi
>
> when execute {{yarn top}}, I always get exception as below:
> {quote}
> 16/07/19 19:55:12 ERROR cli.TopCLI: Could not fetch RM start time
> java.net.ConnectException: Connection refused
>   at java.net.PlainSocketImpl.socketConnect(Native Method)
>   at 
> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
>   at 
> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:204)
>   at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
>   at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
>   at java.net.Socket.connect(Socket.java:589)
>   at java.net.Socket.connect(Socket.java:538)
>   at sun.net.NetworkClient.doConnect(NetworkClient.java:180)
>   at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
>   at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
>   at sun.net.www.http.HttpClient.(HttpClient.java:211)
>   at sun.net.www.http.HttpClient.New(HttpClient.java:308)
>   at sun.net.www.http.HttpClient.New(HttpClient.java:326)
>   at 
> sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1169)
>   at 
> sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1105)
>   at 
> sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:999)
>   at 
> sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:933)
>   at 
> org.apache.hadoop.yarn.client.cli.TopCLI.getRMStartTime(TopCLI.java:747)
>   at org.apache.hadoop.yarn.client.cli.TopCLI.run(TopCLI.java:443)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
>   at org.apache.hadoop.yarn.client.cli.TopCLI.main(TopCLI.java:421)
> YARN top - 19:55:13, up 17001d, 11:55, 0 active users, queue(s): root
> {quote}
> As I looked into it, the function {{getRMStartTime}} use HTTP as hardcoding 
> no matter what is the {{yarn.http.policy}} setting, should consider if use 
> HTTPS 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5403) yarn top command does not execute correct

2016-07-19 Thread gu-chi (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gu-chi updated YARN-5403:
-
Attachment: YARN-5403.patch

> yarn top command does not execute correct
> -
>
> Key: YARN-5403
> URL: https://issues.apache.org/jira/browse/YARN-5403
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 2.7.2
>Reporter: gu-chi
> Attachments: YARN-5403.patch
>
>
> when execute {{yarn top}}, I always get exception as below:
> {quote}
> 16/07/19 19:55:12 ERROR cli.TopCLI: Could not fetch RM start time
> java.net.ConnectException: Connection refused
>   at java.net.PlainSocketImpl.socketConnect(Native Method)
>   at 
> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
>   at 
> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:204)
>   at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
>   at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
>   at java.net.Socket.connect(Socket.java:589)
>   at java.net.Socket.connect(Socket.java:538)
>   at sun.net.NetworkClient.doConnect(NetworkClient.java:180)
>   at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
>   at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
>   at sun.net.www.http.HttpClient.(HttpClient.java:211)
>   at sun.net.www.http.HttpClient.New(HttpClient.java:308)
>   at sun.net.www.http.HttpClient.New(HttpClient.java:326)
>   at 
> sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1169)
>   at 
> sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1105)
>   at 
> sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:999)
>   at 
> sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:933)
>   at 
> org.apache.hadoop.yarn.client.cli.TopCLI.getRMStartTime(TopCLI.java:747)
>   at org.apache.hadoop.yarn.client.cli.TopCLI.run(TopCLI.java:443)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
>   at org.apache.hadoop.yarn.client.cli.TopCLI.main(TopCLI.java:421)
> YARN top - 19:55:13, up 17001d, 11:55, 0 active users, queue(s): root
> {quote}
> As I looked into it, the function {{getRMStartTime}} use HTTP as hardcoding 
> no matter what is the {{yarn.http.policy}} setting, should consider if use 
> HTTPS 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-5403) yarn top command does not execute correct

2016-07-19 Thread gu-chi (JIRA)
gu-chi created YARN-5403:


 Summary: yarn top command does not execute correct
 Key: YARN-5403
 URL: https://issues.apache.org/jira/browse/YARN-5403
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Affects Versions: 2.7.2
Reporter: gu-chi


when execute {{yarn top}}, I always get exception as below:
{quote}
16/07/19 19:55:12 ERROR cli.TopCLI: Could not fetch RM start time
java.net.ConnectException: Connection refused
at java.net.PlainSocketImpl.socketConnect(Native Method)
at 
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at 
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:204)
at 
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:589)
at java.net.Socket.connect(Socket.java:538)
at sun.net.NetworkClient.doConnect(NetworkClient.java:180)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
at sun.net.www.http.HttpClient.(HttpClient.java:211)
at sun.net.www.http.HttpClient.New(HttpClient.java:308)
at sun.net.www.http.HttpClient.New(HttpClient.java:326)
at 
sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1169)
at 
sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1105)
at 
sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:999)
at 
sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:933)
at 
org.apache.hadoop.yarn.client.cli.TopCLI.getRMStartTime(TopCLI.java:747)
at org.apache.hadoop.yarn.client.cli.TopCLI.run(TopCLI.java:443)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at org.apache.hadoop.yarn.client.cli.TopCLI.main(TopCLI.java:421)
YARN top - 19:55:13, up 17001d, 11:55, 0 active users, queue(s): root
{quote}

As I looked into it, the function {{getRMStartTime}} use HTTP as hardcoding no 
matter what is the {{yarn.http.policy}} setting, should consider if use HTTPS 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-3678) DelayedProcessKiller may kill other process other than container

2016-05-10 Thread gu-chi (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gu-chi resolved YARN-3678.
--
Resolution: Duplicate

> DelayedProcessKiller may kill other process other than container
> 
>
> Key: YARN-3678
> URL: https://issues.apache.org/jira/browse/YARN-3678
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.6.0, 2.7.2
>Reporter: gu-chi
>Priority: Critical
>
> Suppose one container finished, then it will do clean up, the PID file still 
> exist and will trigger once singalContainer, this will kill the process with 
> the pid in PID file, but as container already finished, so this PID may be 
> occupied by other process, this may cause serious issue.
> As I know, my NM was killed unexpectedly, what I described can be the cause. 
> Even rarely occur.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-4536) DelayedProcessKiller may not work under heavy workload

2016-01-04 Thread gu-chi (JIRA)
gu-chi created YARN-4536:


 Summary: DelayedProcessKiller may not work under heavy workload
 Key: YARN-4536
 URL: https://issues.apache.org/jira/browse/YARN-4536
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.7.1
Reporter: gu-chi


I am now facing with orphan process of container. Here is the scenario:
With heavy task load, the NM machine CPU usage can reach almost 100%. When some 
container got event of kill, it will get  {{SIGTERM}} , and then the parent 
process exit, leave the container process to OS. This container process need 
handle some shutdown events or some logic, but hardly can get CPU, we suppose 
to see a {{SIGKILL}} as there is {{DelayedProcessKiller}} ,but the parent 
process which persisted as container pid no longer exist, so the kill command 
can not reach the container process. This is how orphan container process come.
The orphan process do exit after some time, but the period can be very long, 
and will make the OS status worse. As I observed, the period can be several 
hours



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3678) DelayedProcessKiller may kill other process other than container

2016-01-04 Thread gu-chi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15082201#comment-15082201
 ] 

gu-chi commented on YARN-3678:
--

same issue as confirmed with [~hex108]

> DelayedProcessKiller may kill other process other than container
> 
>
> Key: YARN-3678
> URL: https://issues.apache.org/jira/browse/YARN-3678
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.6.0
>Reporter: gu-chi
>Priority: Critical
>
> Suppose one container finished, then it will do clean up, the PID file still 
> exist and will trigger once singalContainer, this will kill the process with 
> the pid in PID file, but as container already finished, so this PID may be 
> occupied by other process, this may cause serious issue.
> As I know, my NM was killed unexpectedly, what I described can be the cause. 
> Even rarely occur.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-4536) DelayedProcessKiller may not work under heavy workload

2016-01-04 Thread gu-chi (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gu-chi resolved YARN-4536.
--
Resolution: Not A Problem

As analyzed further, this is introduced by some custom modification, sorry if 
bother.

> DelayedProcessKiller may not work under heavy workload
> --
>
> Key: YARN-4536
> URL: https://issues.apache.org/jira/browse/YARN-4536
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.7.1
>Reporter: gu-chi
>
> I am now facing with orphan process of container. Here is the scenario:
> With heavy task load, the NM machine CPU usage can reach almost 100%. When 
> some container got event of kill, it will get  {{SIGTERM}} , and then the 
> parent process exit, leave the container process to OS. This container 
> process need handle some shutdown events or some logic, but hardly can get 
> CPU, we suppose to see a {{SIGKILL}} as there is {{DelayedProcessKiller}} 
> ,but the parent process which persisted as container pid no longer exist, so 
> the kill command can not reach the container process. This is how orphan 
> container process come.
> The orphan process do exit after some time, but the period can be very long, 
> and will make the OS status worse. As I observed, the period can be several 
> hours



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4536) DelayedProcessKiller may not work under heavy workload

2016-01-04 Thread gu-chi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15081159#comment-15081159
 ] 

gu-chi commented on YARN-4536:
--

Thanks for reply, the process group I not realize, this seems introduced by 
myself, I add a condition of check if container-executor process exist as I 
once meet with YARN-3678, in my logic, if parent process not belong to this 
container, will not signal kill, I saw you also faced same issue, is your patch 
can deal with that scenario and also will not introduce this issue?

> DelayedProcessKiller may not work under heavy workload
> --
>
> Key: YARN-4536
> URL: https://issues.apache.org/jira/browse/YARN-4536
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.7.1
>Reporter: gu-chi
>
> I am now facing with orphan process of container. Here is the scenario:
> With heavy task load, the NM machine CPU usage can reach almost 100%. When 
> some container got event of kill, it will get  {{SIGTERM}} , and then the 
> parent process exit, leave the container process to OS. This container 
> process need handle some shutdown events or some logic, but hardly can get 
> CPU, we suppose to see a {{SIGKILL}} as there is {{DelayedProcessKiller}} 
> ,but the parent process which persisted as container pid no longer exist, so 
> the kill command can not reach the container process. This is how orphan 
> container process come.
> The orphan process do exit after some time, but the period can be very long, 
> and will make the OS status worse. As I observed, the period can be several 
> hours



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4481) negative pending resource of queues lead to applications in accepted status inifnitly

2015-12-18 Thread gu-chi (JIRA)
gu-chi created YARN-4481:


 Summary: negative pending resource of queues lead to applications 
in accepted status inifnitly
 Key: YARN-4481
 URL: https://issues.apache.org/jira/browse/YARN-4481
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacity scheduler
Affects Versions: 2.7.2
Reporter: gu-chi
Priority: Critical


Met a scenario of negative pending resource with capacity scheduler, in jmx, it 
shows:
{noformat}
"PendingMB" : -4096,
"PendingVCores" : -1,
"PendingContainers" : -1,
{noformat}
full jmx infomation attached.
this is not just a jmx UI issue, the actual pending resource of queue is also 
negative as I see the debug log of
bq. DEBUG | ResourceManager Event Processor | Skip this queue=root, because it 
doesn't need more resource, schedulingMode=RESPECT_PARTITION_EXCLUSIVITY 
node-partition= | ParentQueue.java
this lead to the {{NULL_ASSIGNMENT}}
The background is submitting hundreds of applications and consume all cluster 
resource and reservation happen. While running, network fault injected by some 
tool, injection types are delay,jitter
,repeat,packet loss and disorder. And then kill most of the applications 
submitted.

Anyone also facing negative pending resource, or have idea of how this happen?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4481) negative pending resource of queues lead to applications in accepted status inifnitly

2015-12-18 Thread gu-chi (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gu-chi updated YARN-4481:
-
Attachment: jmx.txt

> negative pending resource of queues lead to applications in accepted status 
> inifnitly
> -
>
> Key: YARN-4481
> URL: https://issues.apache.org/jira/browse/YARN-4481
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.7.2
>Reporter: gu-chi
>Priority: Critical
> Attachments: jmx.txt
>
>
> Met a scenario of negative pending resource with capacity scheduler, in jmx, 
> it shows:
> {noformat}
> "PendingMB" : -4096,
> "PendingVCores" : -1,
> "PendingContainers" : -1,
> {noformat}
> full jmx infomation attached.
> this is not just a jmx UI issue, the actual pending resource of queue is also 
> negative as I see the debug log of
> bq. DEBUG | ResourceManager Event Processor | Skip this queue=root, because 
> it doesn't need more resource, schedulingMode=RESPECT_PARTITION_EXCLUSIVITY 
> node-partition= | ParentQueue.java
> this lead to the {{NULL_ASSIGNMENT}}
> The background is submitting hundreds of applications and consume all cluster 
> resource and reservation happen. While running, network fault injected by 
> some tool, injection types are delay,jitter
> ,repeat,packet loss and disorder. And then kill most of the applications 
> submitted.
> Anyone also facing negative pending resource, or have idea of how this happen?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4481) negative pending resource of queues lead to applications in accepted status inifnitly

2015-12-18 Thread gu-chi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15065175#comment-15065175
 ] 

gu-chi commented on YARN-4481:
--

Same using DRC.
:( Debug Log was only enabled after I saw the issue, so before that, no debug 
infomation. 
I got RM log, but several GB with hundreds applications.

> negative pending resource of queues lead to applications in accepted status 
> inifnitly
> -
>
> Key: YARN-4481
> URL: https://issues.apache.org/jira/browse/YARN-4481
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.7.2
>Reporter: gu-chi
>Priority: Critical
> Attachments: jmx.txt
>
>
> Met a scenario of negative pending resource with capacity scheduler, in jmx, 
> it shows:
> {noformat}
> "PendingMB" : -4096,
> "PendingVCores" : -1,
> "PendingContainers" : -1,
> {noformat}
> full jmx infomation attached.
> this is not just a jmx UI issue, the actual pending resource of queue is also 
> negative as I see the debug log of
> bq. DEBUG | ResourceManager Event Processor | Skip this queue=root, because 
> it doesn't need more resource, schedulingMode=RESPECT_PARTITION_EXCLUSIVITY 
> node-partition= | ParentQueue.java
> this lead to the {{NULL_ASSIGNMENT}}
> The background is submitting hundreds of applications and consume all cluster 
> resource and reservation happen. While running, network fault injected by 
> some tool, injection types are delay,jitter
> ,repeat,packet loss and disorder. And then kill most of the applications 
> submitted.
> Anyone also facing negative pending resource, or have idea of how this happen?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4481) negative pending resource of queues lead to applications in accepted status inifnitly

2015-12-18 Thread gu-chi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15065181#comment-15065181
 ] 

gu-chi commented on YARN-4481:
--

I added some extra log to trace, do you have any idea how can probably 
reproduce?

> negative pending resource of queues lead to applications in accepted status 
> inifnitly
> -
>
> Key: YARN-4481
> URL: https://issues.apache.org/jira/browse/YARN-4481
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.7.2
>Reporter: gu-chi
>Priority: Critical
> Attachments: jmx.txt
>
>
> Met a scenario of negative pending resource with capacity scheduler, in jmx, 
> it shows:
> {noformat}
> "PendingMB" : -4096,
> "PendingVCores" : -1,
> "PendingContainers" : -1,
> {noformat}
> full jmx infomation attached.
> this is not just a jmx UI issue, the actual pending resource of queue is also 
> negative as I see the debug log of
> bq. DEBUG | ResourceManager Event Processor | Skip this queue=root, because 
> it doesn't need more resource, schedulingMode=RESPECT_PARTITION_EXCLUSIVITY 
> node-partition= | ParentQueue.java
> this lead to the {{NULL_ASSIGNMENT}}
> The background is submitting hundreds of applications and consume all cluster 
> resource and reservation happen. While running, network fault injected by 
> some tool, injection types are delay,jitter
> ,repeat,packet loss and disorder. And then kill most of the applications 
> submitted.
> Anyone also facing negative pending resource, or have idea of how this happen?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4427) NPE on handleNMContainerStatus when NM is registering to RM

2015-12-07 Thread gu-chi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15044871#comment-15044871
 ] 

gu-chi commented on YARN-4427:
--

NM recovery is enabled, this is the precondition

> NPE on handleNMContainerStatus when NM is registering to RM
> ---
>
> Key: YARN-4427
> URL: https://issues.apache.org/jira/browse/YARN-4427
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Brahma Reddy Battula
>Assignee: Brahma Reddy Battula
>Priority: Critical
>
>  *Seen the following in one of our environment when AM got allocated 
> container but failed to updated in the ZK Where cluster is having network 
> problem for sometime(up and down).* 
> {noformat}
> 2015-12-07 16:39:38,489 | WARN  | IPC Server handler 49 on 26003 | IPC Server 
> handler 49 on 26003, call 
> org.apache.hadoop.yarn.server.api.ResourceTrackerPB.registerNodeManager from 
> 9.91.8.220:52169 Call#17 Retry#0 | Server.java:2107
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService.handleNMContainerStatus(ResourceTrackerService.java:286)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService.registerNodeManager(ResourceTrackerService.java:395)
> at 
> org.apache.hadoop.yarn.server.api.impl.pb.service.ResourceTrackerPBServiceImpl.registerNodeManager(ResourceTrackerPBServiceImpl.java:54)
> at 
> org.apache.hadoop.yarn.proto.ResourceTracker$ResourceTrackerService$2.callBlockingMethod(ResourceTracker.java:79)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:973)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2088)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2084)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1673)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2082)
> {noformat}
> Corresponding code, it might not match with {{branch-2.7/Trunk}} since we had 
> modified internally.
> {code}
>  284  RMAppAttempt rmAppAttempt = rmApp.getRMAppAttempt(appAttemptId);
>  285  Container masterContainer = rmAppAttempt.getMasterContainer();
>  286  if (masterContainer.getId().equals(containerStatus.getContainerId())
>  287   && containerStatus.getContainerState() == ContainerState.COMPLETE) 
> {
>  288 ContainerStatus status =
>  289 ContainerStatus.newInstance(containerStatus.getContainerId(),
>  290   containerStatus.getContainerState(), 
> containerStatus.getDiagnostics(),
>  291   containerStatus.getContainerExitStatus());
>  292 // sending master container finished event.
>  293 RMAppAttemptContainerFinishedEvent evt =
>  294 new RMAppAttemptContainerFinishedEvent(appAttemptId, status,
>  295 nodeId);
>  296 rmContext.getDispatcher().getEventHandler().handle(evt);
>  297   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3730) scheduler reserve more resource than required

2015-05-31 Thread gu-chi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14566919#comment-14566919
 ] 

gu-chi commented on YARN-3730:
--

Thx Naga, as improvements r not merged to my current using version, so this 
feature is not invoked, will set 
yarn.scheduler.capacity.reservations-continue-look-all-nodes to false on 
version 2.7.0 and check the outcome.

 scheduler reserve more resource than required
 -

 Key: YARN-3730
 URL: https://issues.apache.org/jira/browse/YARN-3730
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Reporter: gu-chi

 Using capacity scheduler, environment is 3 NM and each has 9 vcores, I ran a 
 spark task with 4 executors and each executor 5 cores, as suspected, only 1 
 executor not able to start and will be reserved, but actually more containers 
 are reserved. This way, I can not run some other smaller tasks. As I checked 
 the capacity scheduler, the 'needContainers' method in LeafQueue.java has a 
 computation of 'starvation', this cause the scenario of more container 
 reserved than required, any idea or suggestion on this?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3678) DelayedProcessKiller may kill other process other than container

2015-05-27 Thread gu-chi (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gu-chi updated YARN-3678:
-
Attachment: YARN-3678.patch

 DelayedProcessKiller may kill other process other than container
 

 Key: YARN-3678
 URL: https://issues.apache.org/jira/browse/YARN-3678
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.6.0
Reporter: gu-chi
Priority: Critical
 Attachments: YARN-3678.patch


 Suppose one container finished, then it will do clean up, the PID file still 
 exist and will trigger once singalContainer, this will kill the process with 
 the pid in PID file, but as container already finished, so this PID may be 
 occupied by other process, this may cause serious issue.
 As I know, my NM was killed unexpectedly, what I described can be the cause. 
 Even rarely occur.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3678) DelayedProcessKiller may kill other process other than container

2015-05-27 Thread gu-chi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14562180#comment-14562180
 ] 

gu-chi commented on YARN-3678:
--

I made this https://github.com/apache/hadoop/pull/20/

 DelayedProcessKiller may kill other process other than container
 

 Key: YARN-3678
 URL: https://issues.apache.org/jira/browse/YARN-3678
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.6.0
Reporter: gu-chi
Priority: Critical

 Suppose one container finished, then it will do clean up, the PID file still 
 exist and will trigger once singalContainer, this will kill the process with 
 the pid in PID file, but as container already finished, so this PID may be 
 occupied by other process, this may cause serious issue.
 As I know, my NM was killed unexpectedly, what I described can be the cause. 
 Even rarely occur.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3730) scheduler reserve more resource than required

2015-05-27 Thread gu-chi (JIRA)
gu-chi created YARN-3730:


 Summary: scheduler reserve more resource than required
 Key: YARN-3730
 URL: https://issues.apache.org/jira/browse/YARN-3730
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Reporter: gu-chi


Using capacity scheduler, environment is 3 NM and each has 9 vcores, I ran a 
spark task with 4 executors and each executor 5 cores, as suspected, only 1 
executor not able to start and will be reserved, but actually more containers 
are reserved. This way, I can not run some other smaller tasks. As I checked 
the capacity scheduler, the 'needContainers' method in LeafQueue.java has a 
computation of 'starvation', this cause the scenario of more container reserved 
than required, any idea or suggestion on this?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3678) DelayedProcessKiller may kill other process other than container

2015-05-27 Thread gu-chi (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gu-chi updated YARN-3678:
-
Attachment: (was: YARN-3678.patch)

 DelayedProcessKiller may kill other process other than container
 

 Key: YARN-3678
 URL: https://issues.apache.org/jira/browse/YARN-3678
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.6.0
Reporter: gu-chi
Priority: Critical

 Suppose one container finished, then it will do clean up, the PID file still 
 exist and will trigger once singalContainer, this will kill the process with 
 the pid in PID file, but as container already finished, so this PID may be 
 occupied by other process, this may cause serious issue.
 As I know, my NM was killed unexpectedly, what I described can be the cause. 
 Even rarely occur.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3678) DelayedProcessKiller may kill other process other than container

2015-05-19 Thread gu-chi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14551681#comment-14551681
 ] 

gu-chi commented on YARN-3678:
--

I see the possibility is low, but with heavy task load, it occurs frequently. I 
would suggest to add a check before kill, check if the process ID belongs to 
the container.

 DelayedProcessKiller may kill other process other than container
 

 Key: YARN-3678
 URL: https://issues.apache.org/jira/browse/YARN-3678
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.6.0
Reporter: gu-chi
Priority: Critical

 Suppose one container finished, then it will do clean up, the PID file still 
 exist and will trigger once singalContainer, this will kill the process with 
 the pid in PID file, but as container already finished, so this PID may be 
 occupied by other process, this may cause serious issue.
 As I know, my NM was killed unexpectedly, what I described can be the cause. 
 Even rarely occur.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3678) DelayedProcessKiller may kill other process other than container

2015-05-19 Thread gu-chi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14551756#comment-14551756
 ] 

gu-chi commented on YARN-3678:
--

The PID number may be not use as a process, also can be a thread, linux treat 
process and thread the same, kill one thread in process may also kill the 
process too, for thread, 250ms is possible to start, rt?

 DelayedProcessKiller may kill other process other than container
 

 Key: YARN-3678
 URL: https://issues.apache.org/jira/browse/YARN-3678
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.6.0
Reporter: gu-chi
Priority: Critical

 Suppose one container finished, then it will do clean up, the PID file still 
 exist and will trigger once singalContainer, this will kill the process with 
 the pid in PID file, but as container already finished, so this PID may be 
 occupied by other process, this may cause serious issue.
 As I know, my NM was killed unexpectedly, what I described can be the cause. 
 Even rarely occur.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3678) DelayedProcessKiller may kill other process other than container

2015-05-19 Thread gu-chi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14550390#comment-14550390
 ] 

gu-chi commented on YARN-3678:
--

I think if decrease the max_pid setting in OS can enlarge the possibility of 
reproducing, working on

 DelayedProcessKiller may kill other process other than container
 

 Key: YARN-3678
 URL: https://issues.apache.org/jira/browse/YARN-3678
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.6.0
Reporter: gu-chi
Priority: Critical

 Suppose one container finished, then it will do clean up, the PID file still 
 exist and will trigger once singalContainer, this will kill the process with 
 the pid in PID file, but as container already finished, so this PID may be 
 occupied by other process, this may cause serious issue.
 As I know, my NM was killed unexpectedly, what I described can be the cause. 
 Even rarely occur.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3678) DelayedProcessKiller may kill other process other than container

2015-05-19 Thread gu-chi (JIRA)
gu-chi created YARN-3678:


 Summary: DelayedProcessKiller may kill other process other than 
container
 Key: YARN-3678
 URL: https://issues.apache.org/jira/browse/YARN-3678
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.6.0
Reporter: gu-chi
Priority: Critical


Suppose one container finished, then it will do clean up, the PID file still 
exist and will trigger once singalContainer, this will kill the process with 
the pid in PID file, but as container already finished, so this PID may be 
occupied by other process, this may cause serious issue.
As I know, my NM was killed unexpectedly, what I described can be the cause. 
Even rarely occur.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1922) Process group remains alive after container process is killed externally

2015-05-18 Thread gu-chi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14547997#comment-14547997
 ] 

gu-chi commented on YARN-1922:
--

Hi, I see you comment here to check in YARN-1922.5.patch, but why 
YARN-1922.6.patch merged? What is the concern?
I find this solution may have defect.
Suppose one container finished, then it will do clean up, the PID file still 
exist and will trigger once singalContainer, this will kill the process with 
the pid in PID file, but as container already finished, so this PID may be 
occupied by other process, this may cause serious issue.
As I know, my NM was killed unexpectedly, what I described can be the cause. 
Even rarely occur.
Below is error scenario, task clean up not finished but NM was killed, then 
started

2015-05-14 21:49:03,063 | INFO  | DeletionService #1 | Deleting absolute path : 
/export/data1/yarn/nm/localdir/usercache/omm/appcache/application_1430456703237_8047/container_1430456703237_8047_01_12582917
 | 
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.deleteAsUser(LinuxContainerExecutor.java:400)
2015-05-14 21:49:03,063 | INFO  | AsyncDispatcher event handler | Container 
container_1430456703237_8047_01_12582917 transitioned from EXITED_WITH_SUCCESS 
to DONE | 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:918)
2015-05-14 21:49:03,064 | INFO  | AsyncDispatcher event handler | Removing 
container_1430456703237_8047_01_12582917 from application 
application_1430456703237_8047 | 
org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl$ContainerDoneTransition.transition(ApplicationImpl.java:340)
2015-05-14 21:49:03,064 | INFO  | AsyncDispatcher event handler | Considering 
container container_1430456703237_8047_01_12582917 for log-aggregation | 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.startContainerLogAggregation(AppLogAggregatorImpl.java:342)
2015-05-14 21:49:03,064 | INFO  | AsyncDispatcher event handler | Got event 
CONTAINER_STOP for appId application_1430456703237_8047 | 
org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.handle(AuxServices.java:196)
2015-05-14 21:49:03,152 | INFO  | Node Status Updater | Removed completed 
containers from NM context: [container_1430456703237_8047_01_12582917] | 
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.removeCompletedContainersFromContext(NodeStatusUpdaterImpl.java:417)
2015-05-14 21:49:03,293 | INFO  | Task killer for 26924 | Using 
linux-container-executor.users as omm | 
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.signalContainer(LinuxContainerExecutor.java:349)
2015-05-14 21:49:20,667 | INFO  | main | STARTUP_MSG: 
/
STARTUP_MSG: Starting NodeManager
STARTUP_MSG:   host = SR6S11/192.168.10.21
STARTUP_MSG:   args = []
STARTUP_MSG:   version = V100R001C00
STARTUP_MSG:   classpath = 

 Process group remains alive after container process is killed externally
 

 Key: YARN-1922
 URL: https://issues.apache.org/jira/browse/YARN-1922
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.4.0
 Environment: CentOS 6.4
Reporter: Billie Rinaldi
Assignee: Billie Rinaldi
 Fix For: 2.6.0

 Attachments: YARN-1922.1.patch, YARN-1922.2.patch, YARN-1922.3.patch, 
 YARN-1922.4.patch, YARN-1922.5.patch, YARN-1922.6.patch


 If the main container process is killed externally, ContainerLaunch does not 
 kill the rest of the process group.  Before sending the event that results in 
 the ContainerLaunch.containerCleanup method being called, ContainerLaunch 
 sets the completed flag to true.  Then when cleaning up, it doesn't try to 
 read the pid file if the completed flag is true.  If it read the pid file, it 
 would proceed to send the container a kill signal.  In the case of the 
 DefaultContainerExecutor, this would kill the process group.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3536) ZK exception occur when updating AppAttempt status, then NPE thrown when RM do recover

2015-04-23 Thread gu-chi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14510320#comment-14510320
 ] 

gu-chi commented on YARN-3536:
--

Thx, as the exception trace stack is almost, I once looked into this ticket. 
This patch is already merged into the current environment I use.
Not same cause.

 ZK exception occur when updating AppAttempt status, then NPE thrown when RM 
 do recover
 --

 Key: YARN-3536
 URL: https://issues.apache.org/jira/browse/YARN-3536
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler, resourcemanager
Affects Versions: 2.4.1
Reporter: gu-chi

 Here is a scenario that Application status is FAILED/FINISHED but AppAttempt 
 status is null, this cause NPE when doing recover with 
 yarn.resourcemanager.work-preserving-recovery.enabled set to true, RM should 
 handle recovery gracefully



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2308) NPE happened when RM restart after CapacityScheduler queue configuration changed

2015-04-23 Thread gu-chi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14508843#comment-14508843
 ] 

gu-chi commented on YARN-2308:
--

Thx, I saw this and think not a same issue. YARN-2340 is triggered by queue 
stop, there will be clear clue of Failed to submit application. My scenario 
is that ZK exception occurred and the appAttempt status update failed.

 NPE happened when RM restart after CapacityScheduler queue configuration 
 changed 
 -

 Key: YARN-2308
 URL: https://issues.apache.org/jira/browse/YARN-2308
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, scheduler
Affects Versions: 2.6.0
Reporter: Wangda Tan
Assignee: Chang Li
Priority: Critical
 Fix For: 2.6.0

 Attachments: YARN-2308.0.patch, YARN-2308.1.patch, jira2308.patch, 
 jira2308.patch, jira2308.patch


 I encountered a NPE when RM restart
 {code}
 2014-07-16 07:22:46,957 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
 handling event type APP_ATTEMPT_ADDED to the scheduler
 java.lang.NullPointerException
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:566)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:922)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:98)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:594)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:654)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:85)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:698)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:682)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
 at java.lang.Thread.run(Thread.java:744)
 {code}
 And RM will be failed to restart.
 This is caused by queue configuration changed, I removed some queues and 
 added new queues. So when RM restarts, it tries to recover history 
 applications, and when any of queues of these applications removed, NPE will 
 be raised.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3536) ZK exception occur when updating AppAttempt status, then NPE thrown when RM do recover

2015-04-23 Thread gu-chi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14508852#comment-14508852
 ] 

gu-chi commented on YARN-3536:
--

2015-04-21 03:52:31,395 | INFO  | AsyncDispatcher event handler | 
appattempt_1429597538411_0001_02 State change from RUNNING to FINAL_SAVING 
| 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:704)
2015-04-21 03:52:31,397 | INFO  | AsyncDispatcher event handler | Updating 
application application_1429597538411_0001 with final state: FINISHING | 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.rememberTargetTransitionsAndStoreState(RMAppImpl.java:988)
2015-04-21 03:52:31,397 | WARN  | main-SendThread(VM1228:24002) | Session 
0xd4cdaa0557f0005 for server VM1228/9.91.12.28:24002, unexpected error, closing 
socket connection and attempting reconnect | 
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1126)
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
at 
org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:68)
at 
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:368)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1105)
2015-04-21 03:52:31,499 | INFO  | AsyncDispatcher event handler | Exception 
while executing a ZK operation. | 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1098)
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = 
ConnectionLoss for 
/rmstore/ZKRMStateRoot/RMAppRoot/application_1429597538411_0001/appattempt_1429597538411_0001_02
at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045)
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1073)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:996)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:993)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1066)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1085)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.existsWithRetries(ZKRMStateStore.java:993)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:683)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:236)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:219)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:792)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:866)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:861)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
at java.lang.Thread.run(Thread.java:745)

 ZK exception occur when updating AppAttempt status, then NPE thrown when RM 
 do recover
 --

 Key: YARN-3536
 URL: https://issues.apache.org/jira/browse/YARN-3536
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler, resourcemanager
Affects Versions: 2.4.1
Reporter: gu-chi

 Here is a scenario that 

[jira] [Created] (YARN-3536) ZK exception occur when updating AppAttempt status, then NPE thrown when RM do recover

2015-04-23 Thread gu-chi (JIRA)
gu-chi created YARN-3536:


 Summary: ZK exception occur when updating AppAttempt status, then 
NPE thrown when RM do recover
 Key: YARN-3536
 URL: https://issues.apache.org/jira/browse/YARN-3536
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler, resourcemanager
Affects Versions: 2.4.1
Reporter: gu-chi


Here is a scenario that Application status is FAILED/FINISHED but AppAttempt 
status is null, this cause NPE when doing recover with 
yarn.resourcemanager.work-preserving-recovery.enabled set to true



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3536) ZK exception occur when updating AppAttempt status, then NPE thrown when RM do recover

2015-04-23 Thread gu-chi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14508855#comment-14508855
 ] 

gu-chi commented on YARN-3536:
--

2015-04-21 04:22:33,923 | INFO  | main-EventThread | Recovering app: 
application_1429597538411_0001 with 2 attempts and final state = FINISHED | 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recover(RMAppImpl.java:700)
2015-04-21 04:22:33,923 | INFO  | main-EventThread | Recovering attempt: 
appattempt_1429597538411_0001_01 with final state: FAILED | 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recover(RMAppAttemptImpl.java:734)
2015-04-21 04:22:33,924 | INFO  | main-EventThread | Recovering attempt: 
appattempt_1429597538411_0001_02 with final state: null | 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recover(RMAppAttemptImpl.java:734)
2015-04-21 04:22:33,924 | INFO  | main-EventThread | Create AMRMToken for 
ApplicationAttempt: appattempt_1429597538411_0001_02 | 
org.apache.hadoop.yarn.server.resourcemanager.security.AMRMTokenSecretManager.createAndGetAMRMToken(AMRMTokenSecretManager.java:195)
2015-04-21 04:22:33,924 | INFO  | main-EventThread | Creating password for 
appattempt_1429597538411_0001_02 | 
org.apache.hadoop.yarn.server.resourcemanager.security.AMRMTokenSecretManager.createPassword(AMRMTokenSecretManager.java:307)
2015-04-21 04:22:33,924 | INFO  | main-EventThread | 
appattempt_1429597538411_0001_01 State change from NEW to FAILED | 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:704)
2015-04-21 04:22:33,925 | INFO  | main-EventThread | Registering app attempt : 
appattempt_1429597538411_0001_02 | 
org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.registerAppAttempt(ApplicationMasterService.java:656)
2015-04-21 04:22:33,925 | ERROR | main-EventThread | Failed to load/recover 
state | 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:533)
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:607)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:941)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:97)

 ZK exception occur when updating AppAttempt status, then NPE thrown when RM 
 do recover
 --

 Key: YARN-3536
 URL: https://issues.apache.org/jira/browse/YARN-3536
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler, resourcemanager
Affects Versions: 2.4.1
Reporter: gu-chi

 Here is a scenario that Application status is FAILED/FINISHED but AppAttempt 
 status is null, this cause NPE when doing recover with 
 yarn.resourcemanager.work-preserving-recovery.enabled set to true



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3536) ZK exception occur when updating AppAttempt status, then NPE thrown when RM do recover

2015-04-23 Thread gu-chi (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gu-chi updated YARN-3536:
-
Description: Here is a scenario that Application status is FAILED/FINISHED 
but AppAttempt status is null, this cause NPE when doing recover with 
yarn.resourcemanager.work-preserving-recovery.enabled set to true, RM should 
handle recovery gracefully  (was: Here is a scenario that Application status is 
FAILED/FINISHED but AppAttempt status is null, this cause NPE when doing 
recover with yarn.resourcemanager.work-preserving-recovery.enabled set to true, 
RM should handle recover gracefully)

 ZK exception occur when updating AppAttempt status, then NPE thrown when RM 
 do recover
 --

 Key: YARN-3536
 URL: https://issues.apache.org/jira/browse/YARN-3536
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler, resourcemanager
Affects Versions: 2.4.1
Reporter: gu-chi

 Here is a scenario that Application status is FAILED/FINISHED but AppAttempt 
 status is null, this cause NPE when doing recover with 
 yarn.resourcemanager.work-preserving-recovery.enabled set to true, RM should 
 handle recovery gracefully



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3536) ZK exception occur when updating AppAttempt status, then NPE thrown when RM do recover

2015-04-23 Thread gu-chi (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gu-chi updated YARN-3536:
-
Description: Here is a scenario that Application status is FAILED/FINISHED 
but AppAttempt status is null, this cause NPE when doing recover with 
yarn.resourcemanager.work-preserving-recovery.enabled set to true, RM should 
handle recover gracefully  (was: Here is a scenario that Application status is 
FAILED/FINISHED but AppAttempt status is null, this cause NPE when doing 
recover with yarn.resourcemanager.work-preserving-recovery.enabled set to true)

 ZK exception occur when updating AppAttempt status, then NPE thrown when RM 
 do recover
 --

 Key: YARN-3536
 URL: https://issues.apache.org/jira/browse/YARN-3536
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler, resourcemanager
Affects Versions: 2.4.1
Reporter: gu-chi

 Here is a scenario that Application status is FAILED/FINISHED but AppAttempt 
 status is null, this cause NPE when doing recover with 
 yarn.resourcemanager.work-preserving-recovery.enabled set to true, RM should 
 handle recover gracefully



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3536) ZK exception occur when updating AppAttempt status, then NPE thrown when RM do recover

2015-04-23 Thread gu-chi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14508900#comment-14508900
 ] 

gu-chi commented on YARN-3536:
--

Please assign this to me for fixing

 ZK exception occur when updating AppAttempt status, then NPE thrown when RM 
 do recover
 --

 Key: YARN-3536
 URL: https://issues.apache.org/jira/browse/YARN-3536
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler, resourcemanager
Affects Versions: 2.4.1
Reporter: gu-chi

 Here is a scenario that Application status is FAILED/FINISHED but AppAttempt 
 status is null, this cause NPE when doing recover with 
 yarn.resourcemanager.work-preserving-recovery.enabled set to true, RM should 
 handle recovery gracefully



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2308) NPE happened when RM restart after CapacityScheduler queue configuration changed

2015-04-22 Thread gu-chi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14506991#comment-14506991
 ] 

gu-chi commented on YARN-2308:
--

Hi, Chang Li, as I went through the patches that you attached, previously these 
was 
+if (application==null) {
+  LOG.info(can't retireve application attempt);
+  return;
+}
but, finally, the patch merged does not have this modification. Is this updated 
on purpose?
What is the concern?
I am now facing one scenario, App status is Finished and AppAttempt status is 
null, this way when doing recover, application is null in CS and then NPE 
occur. I am thinking if condition application==null was there, the issue I 
meet will not occur.

 NPE happened when RM restart after CapacityScheduler queue configuration 
 changed 
 -

 Key: YARN-2308
 URL: https://issues.apache.org/jira/browse/YARN-2308
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, scheduler
Affects Versions: 2.6.0
Reporter: Wangda Tan
Assignee: Chang Li
Priority: Critical
 Fix For: 2.6.0

 Attachments: YARN-2308.0.patch, YARN-2308.1.patch, jira2308.patch, 
 jira2308.patch, jira2308.patch


 I encountered a NPE when RM restart
 {code}
 2014-07-16 07:22:46,957 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
 handling event type APP_ATTEMPT_ADDED to the scheduler
 java.lang.NullPointerException
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:566)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:922)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:98)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:594)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:654)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:85)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:698)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:682)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
 at java.lang.Thread.run(Thread.java:744)
 {code}
 And RM will be failed to restart.
 This is caused by queue configuration changed, I removed some queues and 
 added new queues. So when RM restarts, it tries to recover history 
 applications, and when any of queues of these applications removed, NPE will 
 be raised.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2308) NPE happened when RM restart after CapacityScheduler queue configuration changed

2015-04-22 Thread gu-chi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14506992#comment-14506992
 ] 

gu-chi commented on YARN-2308:
--

Hi, Chang Li, as I went through the patches that you attached, previously these 
was 
+if (application==null) {
+  LOG.info(can't retireve application attempt);
+  return;
+}
but, finally, the patch merged does not have this modification. Is this updated 
on purpose?
What is the concern?
I am now facing one scenario, App status is Finished and AppAttempt status is 
null, this way when doing recover, application is null in CS and then NPE 
occur. I am thinking if condition application==null was there, the issue I 
meet will not occur.

 NPE happened when RM restart after CapacityScheduler queue configuration 
 changed 
 -

 Key: YARN-2308
 URL: https://issues.apache.org/jira/browse/YARN-2308
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, scheduler
Affects Versions: 2.6.0
Reporter: Wangda Tan
Assignee: Chang Li
Priority: Critical
 Fix For: 2.6.0

 Attachments: YARN-2308.0.patch, YARN-2308.1.patch, jira2308.patch, 
 jira2308.patch, jira2308.patch


 I encountered a NPE when RM restart
 {code}
 2014-07-16 07:22:46,957 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
 handling event type APP_ATTEMPT_ADDED to the scheduler
 java.lang.NullPointerException
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:566)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:922)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:98)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:594)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:654)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:85)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:698)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:682)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
 at java.lang.Thread.run(Thread.java:744)
 {code}
 And RM will be failed to restart.
 This is caused by queue configuration changed, I removed some queues and 
 added new queues. So when RM restarts, it tries to recover history 
 applications, and when any of queues of these applications removed, NPE will 
 be raised.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)