[jira] [Updated] (YARN-5403) yarn top command does not execute correct
[ https://issues.apache.org/jira/browse/YARN-5403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gu-chi updated YARN-5403: - Attachment: YARN-5403.patch > yarn top command does not execute correct > - > > Key: YARN-5403 > URL: https://issues.apache.org/jira/browse/YARN-5403 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 2.7.2 >Reporter: gu-chi > Attachments: YARN-5403.patch > > > when execute {{yarn top}}, I always get exception as below: > {quote} > 16/07/19 19:55:12 ERROR cli.TopCLI: Could not fetch RM start time > java.net.ConnectException: Connection refused > at java.net.PlainSocketImpl.socketConnect(Native Method) > at > java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) > at > java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:204) > at > java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) > at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) > at java.net.Socket.connect(Socket.java:589) > at java.net.Socket.connect(Socket.java:538) > at sun.net.NetworkClient.doConnect(NetworkClient.java:180) > at sun.net.www.http.HttpClient.openServer(HttpClient.java:432) > at sun.net.www.http.HttpClient.openServer(HttpClient.java:527) > at sun.net.www.http.HttpClient.(HttpClient.java:211) > at sun.net.www.http.HttpClient.New(HttpClient.java:308) > at sun.net.www.http.HttpClient.New(HttpClient.java:326) > at > sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1169) > at > sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1105) > at > sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:999) > at > sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:933) > at > org.apache.hadoop.yarn.client.cli.TopCLI.getRMStartTime(TopCLI.java:747) > at org.apache.hadoop.yarn.client.cli.TopCLI.run(TopCLI.java:443) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) > at org.apache.hadoop.yarn.client.cli.TopCLI.main(TopCLI.java:421) > YARN top - 19:55:13, up 17001d, 11:55, 0 active users, queue(s): root > {quote} > As I looked into it, the function {{getRMStartTime}} use HTTP as hardcoding > no matter what is the {{yarn.http.policy}} setting, should consider if use > HTTPS -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5403) yarn top command does not execute correct
[ https://issues.apache.org/jira/browse/YARN-5403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gu-chi updated YARN-5403: - Attachment: (was: YARN-5403.patch) > yarn top command does not execute correct > - > > Key: YARN-5403 > URL: https://issues.apache.org/jira/browse/YARN-5403 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 2.7.2 >Reporter: gu-chi > > when execute {{yarn top}}, I always get exception as below: > {quote} > 16/07/19 19:55:12 ERROR cli.TopCLI: Could not fetch RM start time > java.net.ConnectException: Connection refused > at java.net.PlainSocketImpl.socketConnect(Native Method) > at > java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) > at > java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:204) > at > java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) > at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) > at java.net.Socket.connect(Socket.java:589) > at java.net.Socket.connect(Socket.java:538) > at sun.net.NetworkClient.doConnect(NetworkClient.java:180) > at sun.net.www.http.HttpClient.openServer(HttpClient.java:432) > at sun.net.www.http.HttpClient.openServer(HttpClient.java:527) > at sun.net.www.http.HttpClient.(HttpClient.java:211) > at sun.net.www.http.HttpClient.New(HttpClient.java:308) > at sun.net.www.http.HttpClient.New(HttpClient.java:326) > at > sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1169) > at > sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1105) > at > sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:999) > at > sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:933) > at > org.apache.hadoop.yarn.client.cli.TopCLI.getRMStartTime(TopCLI.java:747) > at org.apache.hadoop.yarn.client.cli.TopCLI.run(TopCLI.java:443) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) > at org.apache.hadoop.yarn.client.cli.TopCLI.main(TopCLI.java:421) > YARN top - 19:55:13, up 17001d, 11:55, 0 active users, queue(s): root > {quote} > As I looked into it, the function {{getRMStartTime}} use HTTP as hardcoding > no matter what is the {{yarn.http.policy}} setting, should consider if use > HTTPS -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5403) yarn top command does not execute correct
[ https://issues.apache.org/jira/browse/YARN-5403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gu-chi updated YARN-5403: - Attachment: YARN-5403.patch > yarn top command does not execute correct > - > > Key: YARN-5403 > URL: https://issues.apache.org/jira/browse/YARN-5403 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 2.7.2 >Reporter: gu-chi > Attachments: YARN-5403.patch > > > when execute {{yarn top}}, I always get exception as below: > {quote} > 16/07/19 19:55:12 ERROR cli.TopCLI: Could not fetch RM start time > java.net.ConnectException: Connection refused > at java.net.PlainSocketImpl.socketConnect(Native Method) > at > java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) > at > java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:204) > at > java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) > at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) > at java.net.Socket.connect(Socket.java:589) > at java.net.Socket.connect(Socket.java:538) > at sun.net.NetworkClient.doConnect(NetworkClient.java:180) > at sun.net.www.http.HttpClient.openServer(HttpClient.java:432) > at sun.net.www.http.HttpClient.openServer(HttpClient.java:527) > at sun.net.www.http.HttpClient.(HttpClient.java:211) > at sun.net.www.http.HttpClient.New(HttpClient.java:308) > at sun.net.www.http.HttpClient.New(HttpClient.java:326) > at > sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1169) > at > sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1105) > at > sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:999) > at > sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:933) > at > org.apache.hadoop.yarn.client.cli.TopCLI.getRMStartTime(TopCLI.java:747) > at org.apache.hadoop.yarn.client.cli.TopCLI.run(TopCLI.java:443) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) > at org.apache.hadoop.yarn.client.cli.TopCLI.main(TopCLI.java:421) > YARN top - 19:55:13, up 17001d, 11:55, 0 active users, queue(s): root > {quote} > As I looked into it, the function {{getRMStartTime}} use HTTP as hardcoding > no matter what is the {{yarn.http.policy}} setting, should consider if use > HTTPS -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-5403) yarn top command does not execute correct
gu-chi created YARN-5403: Summary: yarn top command does not execute correct Key: YARN-5403 URL: https://issues.apache.org/jira/browse/YARN-5403 Project: Hadoop YARN Issue Type: Bug Components: yarn Affects Versions: 2.7.2 Reporter: gu-chi when execute {{yarn top}}, I always get exception as below: {quote} 16/07/19 19:55:12 ERROR cli.TopCLI: Could not fetch RM start time java.net.ConnectException: Connection refused at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:204) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:589) at java.net.Socket.connect(Socket.java:538) at sun.net.NetworkClient.doConnect(NetworkClient.java:180) at sun.net.www.http.HttpClient.openServer(HttpClient.java:432) at sun.net.www.http.HttpClient.openServer(HttpClient.java:527) at sun.net.www.http.HttpClient.(HttpClient.java:211) at sun.net.www.http.HttpClient.New(HttpClient.java:308) at sun.net.www.http.HttpClient.New(HttpClient.java:326) at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1169) at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1105) at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:999) at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:933) at org.apache.hadoop.yarn.client.cli.TopCLI.getRMStartTime(TopCLI.java:747) at org.apache.hadoop.yarn.client.cli.TopCLI.run(TopCLI.java:443) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) at org.apache.hadoop.yarn.client.cli.TopCLI.main(TopCLI.java:421) YARN top - 19:55:13, up 17001d, 11:55, 0 active users, queue(s): root {quote} As I looked into it, the function {{getRMStartTime}} use HTTP as hardcoding no matter what is the {{yarn.http.policy}} setting, should consider if use HTTPS -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-3678) DelayedProcessKiller may kill other process other than container
[ https://issues.apache.org/jira/browse/YARN-3678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gu-chi resolved YARN-3678. -- Resolution: Duplicate > DelayedProcessKiller may kill other process other than container > > > Key: YARN-3678 > URL: https://issues.apache.org/jira/browse/YARN-3678 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.6.0, 2.7.2 >Reporter: gu-chi >Priority: Critical > > Suppose one container finished, then it will do clean up, the PID file still > exist and will trigger once singalContainer, this will kill the process with > the pid in PID file, but as container already finished, so this PID may be > occupied by other process, this may cause serious issue. > As I know, my NM was killed unexpectedly, what I described can be the cause. > Even rarely occur. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-4536) DelayedProcessKiller may not work under heavy workload
gu-chi created YARN-4536: Summary: DelayedProcessKiller may not work under heavy workload Key: YARN-4536 URL: https://issues.apache.org/jira/browse/YARN-4536 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.7.1 Reporter: gu-chi I am now facing with orphan process of container. Here is the scenario: With heavy task load, the NM machine CPU usage can reach almost 100%. When some container got event of kill, it will get {{SIGTERM}} , and then the parent process exit, leave the container process to OS. This container process need handle some shutdown events or some logic, but hardly can get CPU, we suppose to see a {{SIGKILL}} as there is {{DelayedProcessKiller}} ,but the parent process which persisted as container pid no longer exist, so the kill command can not reach the container process. This is how orphan container process come. The orphan process do exit after some time, but the period can be very long, and will make the OS status worse. As I observed, the period can be several hours -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3678) DelayedProcessKiller may kill other process other than container
[ https://issues.apache.org/jira/browse/YARN-3678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15082201#comment-15082201 ] gu-chi commented on YARN-3678: -- same issue as confirmed with [~hex108] > DelayedProcessKiller may kill other process other than container > > > Key: YARN-3678 > URL: https://issues.apache.org/jira/browse/YARN-3678 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.6.0 >Reporter: gu-chi >Priority: Critical > > Suppose one container finished, then it will do clean up, the PID file still > exist and will trigger once singalContainer, this will kill the process with > the pid in PID file, but as container already finished, so this PID may be > occupied by other process, this may cause serious issue. > As I know, my NM was killed unexpectedly, what I described can be the cause. > Even rarely occur. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-4536) DelayedProcessKiller may not work under heavy workload
[ https://issues.apache.org/jira/browse/YARN-4536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gu-chi resolved YARN-4536. -- Resolution: Not A Problem As analyzed further, this is introduced by some custom modification, sorry if bother. > DelayedProcessKiller may not work under heavy workload > -- > > Key: YARN-4536 > URL: https://issues.apache.org/jira/browse/YARN-4536 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.7.1 >Reporter: gu-chi > > I am now facing with orphan process of container. Here is the scenario: > With heavy task load, the NM machine CPU usage can reach almost 100%. When > some container got event of kill, it will get {{SIGTERM}} , and then the > parent process exit, leave the container process to OS. This container > process need handle some shutdown events or some logic, but hardly can get > CPU, we suppose to see a {{SIGKILL}} as there is {{DelayedProcessKiller}} > ,but the parent process which persisted as container pid no longer exist, so > the kill command can not reach the container process. This is how orphan > container process come. > The orphan process do exit after some time, but the period can be very long, > and will make the OS status worse. As I observed, the period can be several > hours -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4536) DelayedProcessKiller may not work under heavy workload
[ https://issues.apache.org/jira/browse/YARN-4536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15081159#comment-15081159 ] gu-chi commented on YARN-4536: -- Thanks for reply, the process group I not realize, this seems introduced by myself, I add a condition of check if container-executor process exist as I once meet with YARN-3678, in my logic, if parent process not belong to this container, will not signal kill, I saw you also faced same issue, is your patch can deal with that scenario and also will not introduce this issue? > DelayedProcessKiller may not work under heavy workload > -- > > Key: YARN-4536 > URL: https://issues.apache.org/jira/browse/YARN-4536 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.7.1 >Reporter: gu-chi > > I am now facing with orphan process of container. Here is the scenario: > With heavy task load, the NM machine CPU usage can reach almost 100%. When > some container got event of kill, it will get {{SIGTERM}} , and then the > parent process exit, leave the container process to OS. This container > process need handle some shutdown events or some logic, but hardly can get > CPU, we suppose to see a {{SIGKILL}} as there is {{DelayedProcessKiller}} > ,but the parent process which persisted as container pid no longer exist, so > the kill command can not reach the container process. This is how orphan > container process come. > The orphan process do exit after some time, but the period can be very long, > and will make the OS status worse. As I observed, the period can be several > hours -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4481) negative pending resource of queues lead to applications in accepted status inifnitly
gu-chi created YARN-4481: Summary: negative pending resource of queues lead to applications in accepted status inifnitly Key: YARN-4481 URL: https://issues.apache.org/jira/browse/YARN-4481 Project: Hadoop YARN Issue Type: Bug Components: capacity scheduler Affects Versions: 2.7.2 Reporter: gu-chi Priority: Critical Met a scenario of negative pending resource with capacity scheduler, in jmx, it shows: {noformat} "PendingMB" : -4096, "PendingVCores" : -1, "PendingContainers" : -1, {noformat} full jmx infomation attached. this is not just a jmx UI issue, the actual pending resource of queue is also negative as I see the debug log of bq. DEBUG | ResourceManager Event Processor | Skip this queue=root, because it doesn't need more resource, schedulingMode=RESPECT_PARTITION_EXCLUSIVITY node-partition= | ParentQueue.java this lead to the {{NULL_ASSIGNMENT}} The background is submitting hundreds of applications and consume all cluster resource and reservation happen. While running, network fault injected by some tool, injection types are delay,jitter ,repeat,packet loss and disorder. And then kill most of the applications submitted. Anyone also facing negative pending resource, or have idea of how this happen? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4481) negative pending resource of queues lead to applications in accepted status inifnitly
[ https://issues.apache.org/jira/browse/YARN-4481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gu-chi updated YARN-4481: - Attachment: jmx.txt > negative pending resource of queues lead to applications in accepted status > inifnitly > - > > Key: YARN-4481 > URL: https://issues.apache.org/jira/browse/YARN-4481 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.7.2 >Reporter: gu-chi >Priority: Critical > Attachments: jmx.txt > > > Met a scenario of negative pending resource with capacity scheduler, in jmx, > it shows: > {noformat} > "PendingMB" : -4096, > "PendingVCores" : -1, > "PendingContainers" : -1, > {noformat} > full jmx infomation attached. > this is not just a jmx UI issue, the actual pending resource of queue is also > negative as I see the debug log of > bq. DEBUG | ResourceManager Event Processor | Skip this queue=root, because > it doesn't need more resource, schedulingMode=RESPECT_PARTITION_EXCLUSIVITY > node-partition= | ParentQueue.java > this lead to the {{NULL_ASSIGNMENT}} > The background is submitting hundreds of applications and consume all cluster > resource and reservation happen. While running, network fault injected by > some tool, injection types are delay,jitter > ,repeat,packet loss and disorder. And then kill most of the applications > submitted. > Anyone also facing negative pending resource, or have idea of how this happen? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4481) negative pending resource of queues lead to applications in accepted status inifnitly
[ https://issues.apache.org/jira/browse/YARN-4481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15065175#comment-15065175 ] gu-chi commented on YARN-4481: -- Same using DRC. :( Debug Log was only enabled after I saw the issue, so before that, no debug infomation. I got RM log, but several GB with hundreds applications. > negative pending resource of queues lead to applications in accepted status > inifnitly > - > > Key: YARN-4481 > URL: https://issues.apache.org/jira/browse/YARN-4481 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.7.2 >Reporter: gu-chi >Priority: Critical > Attachments: jmx.txt > > > Met a scenario of negative pending resource with capacity scheduler, in jmx, > it shows: > {noformat} > "PendingMB" : -4096, > "PendingVCores" : -1, > "PendingContainers" : -1, > {noformat} > full jmx infomation attached. > this is not just a jmx UI issue, the actual pending resource of queue is also > negative as I see the debug log of > bq. DEBUG | ResourceManager Event Processor | Skip this queue=root, because > it doesn't need more resource, schedulingMode=RESPECT_PARTITION_EXCLUSIVITY > node-partition= | ParentQueue.java > this lead to the {{NULL_ASSIGNMENT}} > The background is submitting hundreds of applications and consume all cluster > resource and reservation happen. While running, network fault injected by > some tool, injection types are delay,jitter > ,repeat,packet loss and disorder. And then kill most of the applications > submitted. > Anyone also facing negative pending resource, or have idea of how this happen? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4481) negative pending resource of queues lead to applications in accepted status inifnitly
[ https://issues.apache.org/jira/browse/YARN-4481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15065181#comment-15065181 ] gu-chi commented on YARN-4481: -- I added some extra log to trace, do you have any idea how can probably reproduce? > negative pending resource of queues lead to applications in accepted status > inifnitly > - > > Key: YARN-4481 > URL: https://issues.apache.org/jira/browse/YARN-4481 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.7.2 >Reporter: gu-chi >Priority: Critical > Attachments: jmx.txt > > > Met a scenario of negative pending resource with capacity scheduler, in jmx, > it shows: > {noformat} > "PendingMB" : -4096, > "PendingVCores" : -1, > "PendingContainers" : -1, > {noformat} > full jmx infomation attached. > this is not just a jmx UI issue, the actual pending resource of queue is also > negative as I see the debug log of > bq. DEBUG | ResourceManager Event Processor | Skip this queue=root, because > it doesn't need more resource, schedulingMode=RESPECT_PARTITION_EXCLUSIVITY > node-partition= | ParentQueue.java > this lead to the {{NULL_ASSIGNMENT}} > The background is submitting hundreds of applications and consume all cluster > resource and reservation happen. While running, network fault injected by > some tool, injection types are delay,jitter > ,repeat,packet loss and disorder. And then kill most of the applications > submitted. > Anyone also facing negative pending resource, or have idea of how this happen? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4427) NPE on handleNMContainerStatus when NM is registering to RM
[ https://issues.apache.org/jira/browse/YARN-4427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15044871#comment-15044871 ] gu-chi commented on YARN-4427: -- NM recovery is enabled, this is the precondition > NPE on handleNMContainerStatus when NM is registering to RM > --- > > Key: YARN-4427 > URL: https://issues.apache.org/jira/browse/YARN-4427 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Brahma Reddy Battula >Assignee: Brahma Reddy Battula >Priority: Critical > > *Seen the following in one of our environment when AM got allocated > container but failed to updated in the ZK Where cluster is having network > problem for sometime(up and down).* > {noformat} > 2015-12-07 16:39:38,489 | WARN | IPC Server handler 49 on 26003 | IPC Server > handler 49 on 26003, call > org.apache.hadoop.yarn.server.api.ResourceTrackerPB.registerNodeManager from > 9.91.8.220:52169 Call#17 Retry#0 | Server.java:2107 > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService.handleNMContainerStatus(ResourceTrackerService.java:286) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService.registerNodeManager(ResourceTrackerService.java:395) > at > org.apache.hadoop.yarn.server.api.impl.pb.service.ResourceTrackerPBServiceImpl.registerNodeManager(ResourceTrackerPBServiceImpl.java:54) > at > org.apache.hadoop.yarn.proto.ResourceTracker$ResourceTrackerService$2.callBlockingMethod(ResourceTracker.java:79) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:973) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2088) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2084) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1673) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2082) > {noformat} > Corresponding code, it might not match with {{branch-2.7/Trunk}} since we had > modified internally. > {code} > 284 RMAppAttempt rmAppAttempt = rmApp.getRMAppAttempt(appAttemptId); > 285 Container masterContainer = rmAppAttempt.getMasterContainer(); > 286 if (masterContainer.getId().equals(containerStatus.getContainerId()) > 287 && containerStatus.getContainerState() == ContainerState.COMPLETE) > { > 288 ContainerStatus status = > 289 ContainerStatus.newInstance(containerStatus.getContainerId(), > 290 containerStatus.getContainerState(), > containerStatus.getDiagnostics(), > 291 containerStatus.getContainerExitStatus()); > 292 // sending master container finished event. > 293 RMAppAttemptContainerFinishedEvent evt = > 294 new RMAppAttemptContainerFinishedEvent(appAttemptId, status, > 295 nodeId); > 296 rmContext.getDispatcher().getEventHandler().handle(evt); > 297 } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3730) scheduler reserve more resource than required
[ https://issues.apache.org/jira/browse/YARN-3730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14566919#comment-14566919 ] gu-chi commented on YARN-3730: -- Thx Naga, as improvements r not merged to my current using version, so this feature is not invoked, will set yarn.scheduler.capacity.reservations-continue-look-all-nodes to false on version 2.7.0 and check the outcome. scheduler reserve more resource than required - Key: YARN-3730 URL: https://issues.apache.org/jira/browse/YARN-3730 Project: Hadoop YARN Issue Type: Bug Components: scheduler Reporter: gu-chi Using capacity scheduler, environment is 3 NM and each has 9 vcores, I ran a spark task with 4 executors and each executor 5 cores, as suspected, only 1 executor not able to start and will be reserved, but actually more containers are reserved. This way, I can not run some other smaller tasks. As I checked the capacity scheduler, the 'needContainers' method in LeafQueue.java has a computation of 'starvation', this cause the scenario of more container reserved than required, any idea or suggestion on this? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3678) DelayedProcessKiller may kill other process other than container
[ https://issues.apache.org/jira/browse/YARN-3678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gu-chi updated YARN-3678: - Attachment: YARN-3678.patch DelayedProcessKiller may kill other process other than container Key: YARN-3678 URL: https://issues.apache.org/jira/browse/YARN-3678 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Reporter: gu-chi Priority: Critical Attachments: YARN-3678.patch Suppose one container finished, then it will do clean up, the PID file still exist and will trigger once singalContainer, this will kill the process with the pid in PID file, but as container already finished, so this PID may be occupied by other process, this may cause serious issue. As I know, my NM was killed unexpectedly, what I described can be the cause. Even rarely occur. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3678) DelayedProcessKiller may kill other process other than container
[ https://issues.apache.org/jira/browse/YARN-3678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14562180#comment-14562180 ] gu-chi commented on YARN-3678: -- I made this https://github.com/apache/hadoop/pull/20/ DelayedProcessKiller may kill other process other than container Key: YARN-3678 URL: https://issues.apache.org/jira/browse/YARN-3678 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Reporter: gu-chi Priority: Critical Suppose one container finished, then it will do clean up, the PID file still exist and will trigger once singalContainer, this will kill the process with the pid in PID file, but as container already finished, so this PID may be occupied by other process, this may cause serious issue. As I know, my NM was killed unexpectedly, what I described can be the cause. Even rarely occur. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3730) scheduler reserve more resource than required
gu-chi created YARN-3730: Summary: scheduler reserve more resource than required Key: YARN-3730 URL: https://issues.apache.org/jira/browse/YARN-3730 Project: Hadoop YARN Issue Type: Bug Components: scheduler Reporter: gu-chi Using capacity scheduler, environment is 3 NM and each has 9 vcores, I ran a spark task with 4 executors and each executor 5 cores, as suspected, only 1 executor not able to start and will be reserved, but actually more containers are reserved. This way, I can not run some other smaller tasks. As I checked the capacity scheduler, the 'needContainers' method in LeafQueue.java has a computation of 'starvation', this cause the scenario of more container reserved than required, any idea or suggestion on this? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3678) DelayedProcessKiller may kill other process other than container
[ https://issues.apache.org/jira/browse/YARN-3678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gu-chi updated YARN-3678: - Attachment: (was: YARN-3678.patch) DelayedProcessKiller may kill other process other than container Key: YARN-3678 URL: https://issues.apache.org/jira/browse/YARN-3678 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Reporter: gu-chi Priority: Critical Suppose one container finished, then it will do clean up, the PID file still exist and will trigger once singalContainer, this will kill the process with the pid in PID file, but as container already finished, so this PID may be occupied by other process, this may cause serious issue. As I know, my NM was killed unexpectedly, what I described can be the cause. Even rarely occur. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3678) DelayedProcessKiller may kill other process other than container
[ https://issues.apache.org/jira/browse/YARN-3678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14551681#comment-14551681 ] gu-chi commented on YARN-3678: -- I see the possibility is low, but with heavy task load, it occurs frequently. I would suggest to add a check before kill, check if the process ID belongs to the container. DelayedProcessKiller may kill other process other than container Key: YARN-3678 URL: https://issues.apache.org/jira/browse/YARN-3678 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Reporter: gu-chi Priority: Critical Suppose one container finished, then it will do clean up, the PID file still exist and will trigger once singalContainer, this will kill the process with the pid in PID file, but as container already finished, so this PID may be occupied by other process, this may cause serious issue. As I know, my NM was killed unexpectedly, what I described can be the cause. Even rarely occur. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3678) DelayedProcessKiller may kill other process other than container
[ https://issues.apache.org/jira/browse/YARN-3678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14551756#comment-14551756 ] gu-chi commented on YARN-3678: -- The PID number may be not use as a process, also can be a thread, linux treat process and thread the same, kill one thread in process may also kill the process too, for thread, 250ms is possible to start, rt? DelayedProcessKiller may kill other process other than container Key: YARN-3678 URL: https://issues.apache.org/jira/browse/YARN-3678 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Reporter: gu-chi Priority: Critical Suppose one container finished, then it will do clean up, the PID file still exist and will trigger once singalContainer, this will kill the process with the pid in PID file, but as container already finished, so this PID may be occupied by other process, this may cause serious issue. As I know, my NM was killed unexpectedly, what I described can be the cause. Even rarely occur. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3678) DelayedProcessKiller may kill other process other than container
[ https://issues.apache.org/jira/browse/YARN-3678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14550390#comment-14550390 ] gu-chi commented on YARN-3678: -- I think if decrease the max_pid setting in OS can enlarge the possibility of reproducing, working on DelayedProcessKiller may kill other process other than container Key: YARN-3678 URL: https://issues.apache.org/jira/browse/YARN-3678 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Reporter: gu-chi Priority: Critical Suppose one container finished, then it will do clean up, the PID file still exist and will trigger once singalContainer, this will kill the process with the pid in PID file, but as container already finished, so this PID may be occupied by other process, this may cause serious issue. As I know, my NM was killed unexpectedly, what I described can be the cause. Even rarely occur. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3678) DelayedProcessKiller may kill other process other than container
gu-chi created YARN-3678: Summary: DelayedProcessKiller may kill other process other than container Key: YARN-3678 URL: https://issues.apache.org/jira/browse/YARN-3678 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Reporter: gu-chi Priority: Critical Suppose one container finished, then it will do clean up, the PID file still exist and will trigger once singalContainer, this will kill the process with the pid in PID file, but as container already finished, so this PID may be occupied by other process, this may cause serious issue. As I know, my NM was killed unexpectedly, what I described can be the cause. Even rarely occur. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1922) Process group remains alive after container process is killed externally
[ https://issues.apache.org/jira/browse/YARN-1922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14547997#comment-14547997 ] gu-chi commented on YARN-1922: -- Hi, I see you comment here to check in YARN-1922.5.patch, but why YARN-1922.6.patch merged? What is the concern? I find this solution may have defect. Suppose one container finished, then it will do clean up, the PID file still exist and will trigger once singalContainer, this will kill the process with the pid in PID file, but as container already finished, so this PID may be occupied by other process, this may cause serious issue. As I know, my NM was killed unexpectedly, what I described can be the cause. Even rarely occur. Below is error scenario, task clean up not finished but NM was killed, then started 2015-05-14 21:49:03,063 | INFO | DeletionService #1 | Deleting absolute path : /export/data1/yarn/nm/localdir/usercache/omm/appcache/application_1430456703237_8047/container_1430456703237_8047_01_12582917 | org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.deleteAsUser(LinuxContainerExecutor.java:400) 2015-05-14 21:49:03,063 | INFO | AsyncDispatcher event handler | Container container_1430456703237_8047_01_12582917 transitioned from EXITED_WITH_SUCCESS to DONE | org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:918) 2015-05-14 21:49:03,064 | INFO | AsyncDispatcher event handler | Removing container_1430456703237_8047_01_12582917 from application application_1430456703237_8047 | org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl$ContainerDoneTransition.transition(ApplicationImpl.java:340) 2015-05-14 21:49:03,064 | INFO | AsyncDispatcher event handler | Considering container container_1430456703237_8047_01_12582917 for log-aggregation | org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.startContainerLogAggregation(AppLogAggregatorImpl.java:342) 2015-05-14 21:49:03,064 | INFO | AsyncDispatcher event handler | Got event CONTAINER_STOP for appId application_1430456703237_8047 | org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.handle(AuxServices.java:196) 2015-05-14 21:49:03,152 | INFO | Node Status Updater | Removed completed containers from NM context: [container_1430456703237_8047_01_12582917] | org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.removeCompletedContainersFromContext(NodeStatusUpdaterImpl.java:417) 2015-05-14 21:49:03,293 | INFO | Task killer for 26924 | Using linux-container-executor.users as omm | org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.signalContainer(LinuxContainerExecutor.java:349) 2015-05-14 21:49:20,667 | INFO | main | STARTUP_MSG: / STARTUP_MSG: Starting NodeManager STARTUP_MSG: host = SR6S11/192.168.10.21 STARTUP_MSG: args = [] STARTUP_MSG: version = V100R001C00 STARTUP_MSG: classpath = Process group remains alive after container process is killed externally Key: YARN-1922 URL: https://issues.apache.org/jira/browse/YARN-1922 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.4.0 Environment: CentOS 6.4 Reporter: Billie Rinaldi Assignee: Billie Rinaldi Fix For: 2.6.0 Attachments: YARN-1922.1.patch, YARN-1922.2.patch, YARN-1922.3.patch, YARN-1922.4.patch, YARN-1922.5.patch, YARN-1922.6.patch If the main container process is killed externally, ContainerLaunch does not kill the rest of the process group. Before sending the event that results in the ContainerLaunch.containerCleanup method being called, ContainerLaunch sets the completed flag to true. Then when cleaning up, it doesn't try to read the pid file if the completed flag is true. If it read the pid file, it would proceed to send the container a kill signal. In the case of the DefaultContainerExecutor, this would kill the process group. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3536) ZK exception occur when updating AppAttempt status, then NPE thrown when RM do recover
[ https://issues.apache.org/jira/browse/YARN-3536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14510320#comment-14510320 ] gu-chi commented on YARN-3536: -- Thx, as the exception trace stack is almost, I once looked into this ticket. This patch is already merged into the current environment I use. Not same cause. ZK exception occur when updating AppAttempt status, then NPE thrown when RM do recover -- Key: YARN-3536 URL: https://issues.apache.org/jira/browse/YARN-3536 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler, resourcemanager Affects Versions: 2.4.1 Reporter: gu-chi Here is a scenario that Application status is FAILED/FINISHED but AppAttempt status is null, this cause NPE when doing recover with yarn.resourcemanager.work-preserving-recovery.enabled set to true, RM should handle recovery gracefully -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2308) NPE happened when RM restart after CapacityScheduler queue configuration changed
[ https://issues.apache.org/jira/browse/YARN-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14508843#comment-14508843 ] gu-chi commented on YARN-2308: -- Thx, I saw this and think not a same issue. YARN-2340 is triggered by queue stop, there will be clear clue of Failed to submit application. My scenario is that ZK exception occurred and the appAttempt status update failed. NPE happened when RM restart after CapacityScheduler queue configuration changed - Key: YARN-2308 URL: https://issues.apache.org/jira/browse/YARN-2308 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, scheduler Affects Versions: 2.6.0 Reporter: Wangda Tan Assignee: Chang Li Priority: Critical Fix For: 2.6.0 Attachments: YARN-2308.0.patch, YARN-2308.1.patch, jira2308.patch, jira2308.patch, jira2308.patch I encountered a NPE when RM restart {code} 2014-07-16 07:22:46,957 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type APP_ATTEMPT_ADDED to the scheduler java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:566) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:922) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:98) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:594) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:654) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:85) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:698) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:682) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:744) {code} And RM will be failed to restart. This is caused by queue configuration changed, I removed some queues and added new queues. So when RM restarts, it tries to recover history applications, and when any of queues of these applications removed, NPE will be raised. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3536) ZK exception occur when updating AppAttempt status, then NPE thrown when RM do recover
[ https://issues.apache.org/jira/browse/YARN-3536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14508852#comment-14508852 ] gu-chi commented on YARN-3536: -- 2015-04-21 03:52:31,395 | INFO | AsyncDispatcher event handler | appattempt_1429597538411_0001_02 State change from RUNNING to FINAL_SAVING | org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:704) 2015-04-21 03:52:31,397 | INFO | AsyncDispatcher event handler | Updating application application_1429597538411_0001 with final state: FINISHING | org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.rememberTargetTransitionsAndStoreState(RMAppImpl.java:988) 2015-04-21 03:52:31,397 | WARN | main-SendThread(VM1228:24002) | Session 0xd4cdaa0557f0005 for server VM1228/9.91.12.28:24002, unexpected error, closing socket connection and attempting reconnect | org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1126) java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) at sun.nio.ch.IOUtil.read(IOUtil.java:192) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379) at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:68) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:368) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1105) 2015-04-21 03:52:31,499 | INFO | AsyncDispatcher event handler | Exception while executing a ZK operation. | org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1098) org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /rmstore/ZKRMStateRoot/RMAppRoot/application_1429597538411_0001/appattempt_1429597538411_0001_02 at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045) at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1073) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:996) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:993) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1066) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1085) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.existsWithRetries(ZKRMStateStore.java:993) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:683) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:236) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:219) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:792) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:866) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:861) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:745) ZK exception occur when updating AppAttempt status, then NPE thrown when RM do recover -- Key: YARN-3536 URL: https://issues.apache.org/jira/browse/YARN-3536 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler, resourcemanager Affects Versions: 2.4.1 Reporter: gu-chi Here is a scenario that
[jira] [Created] (YARN-3536) ZK exception occur when updating AppAttempt status, then NPE thrown when RM do recover
gu-chi created YARN-3536: Summary: ZK exception occur when updating AppAttempt status, then NPE thrown when RM do recover Key: YARN-3536 URL: https://issues.apache.org/jira/browse/YARN-3536 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler, resourcemanager Affects Versions: 2.4.1 Reporter: gu-chi Here is a scenario that Application status is FAILED/FINISHED but AppAttempt status is null, this cause NPE when doing recover with yarn.resourcemanager.work-preserving-recovery.enabled set to true -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3536) ZK exception occur when updating AppAttempt status, then NPE thrown when RM do recover
[ https://issues.apache.org/jira/browse/YARN-3536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14508855#comment-14508855 ] gu-chi commented on YARN-3536: -- 2015-04-21 04:22:33,923 | INFO | main-EventThread | Recovering app: application_1429597538411_0001 with 2 attempts and final state = FINISHED | org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recover(RMAppImpl.java:700) 2015-04-21 04:22:33,923 | INFO | main-EventThread | Recovering attempt: appattempt_1429597538411_0001_01 with final state: FAILED | org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recover(RMAppAttemptImpl.java:734) 2015-04-21 04:22:33,924 | INFO | main-EventThread | Recovering attempt: appattempt_1429597538411_0001_02 with final state: null | org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recover(RMAppAttemptImpl.java:734) 2015-04-21 04:22:33,924 | INFO | main-EventThread | Create AMRMToken for ApplicationAttempt: appattempt_1429597538411_0001_02 | org.apache.hadoop.yarn.server.resourcemanager.security.AMRMTokenSecretManager.createAndGetAMRMToken(AMRMTokenSecretManager.java:195) 2015-04-21 04:22:33,924 | INFO | main-EventThread | Creating password for appattempt_1429597538411_0001_02 | org.apache.hadoop.yarn.server.resourcemanager.security.AMRMTokenSecretManager.createPassword(AMRMTokenSecretManager.java:307) 2015-04-21 04:22:33,924 | INFO | main-EventThread | appattempt_1429597538411_0001_01 State change from NEW to FAILED | org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:704) 2015-04-21 04:22:33,925 | INFO | main-EventThread | Registering app attempt : appattempt_1429597538411_0001_02 | org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.registerAppAttempt(ApplicationMasterService.java:656) 2015-04-21 04:22:33,925 | ERROR | main-EventThread | Failed to load/recover state | org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:533) java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:607) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:941) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:97) ZK exception occur when updating AppAttempt status, then NPE thrown when RM do recover -- Key: YARN-3536 URL: https://issues.apache.org/jira/browse/YARN-3536 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler, resourcemanager Affects Versions: 2.4.1 Reporter: gu-chi Here is a scenario that Application status is FAILED/FINISHED but AppAttempt status is null, this cause NPE when doing recover with yarn.resourcemanager.work-preserving-recovery.enabled set to true -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3536) ZK exception occur when updating AppAttempt status, then NPE thrown when RM do recover
[ https://issues.apache.org/jira/browse/YARN-3536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gu-chi updated YARN-3536: - Description: Here is a scenario that Application status is FAILED/FINISHED but AppAttempt status is null, this cause NPE when doing recover with yarn.resourcemanager.work-preserving-recovery.enabled set to true, RM should handle recovery gracefully (was: Here is a scenario that Application status is FAILED/FINISHED but AppAttempt status is null, this cause NPE when doing recover with yarn.resourcemanager.work-preserving-recovery.enabled set to true, RM should handle recover gracefully) ZK exception occur when updating AppAttempt status, then NPE thrown when RM do recover -- Key: YARN-3536 URL: https://issues.apache.org/jira/browse/YARN-3536 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler, resourcemanager Affects Versions: 2.4.1 Reporter: gu-chi Here is a scenario that Application status is FAILED/FINISHED but AppAttempt status is null, this cause NPE when doing recover with yarn.resourcemanager.work-preserving-recovery.enabled set to true, RM should handle recovery gracefully -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3536) ZK exception occur when updating AppAttempt status, then NPE thrown when RM do recover
[ https://issues.apache.org/jira/browse/YARN-3536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gu-chi updated YARN-3536: - Description: Here is a scenario that Application status is FAILED/FINISHED but AppAttempt status is null, this cause NPE when doing recover with yarn.resourcemanager.work-preserving-recovery.enabled set to true, RM should handle recover gracefully (was: Here is a scenario that Application status is FAILED/FINISHED but AppAttempt status is null, this cause NPE when doing recover with yarn.resourcemanager.work-preserving-recovery.enabled set to true) ZK exception occur when updating AppAttempt status, then NPE thrown when RM do recover -- Key: YARN-3536 URL: https://issues.apache.org/jira/browse/YARN-3536 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler, resourcemanager Affects Versions: 2.4.1 Reporter: gu-chi Here is a scenario that Application status is FAILED/FINISHED but AppAttempt status is null, this cause NPE when doing recover with yarn.resourcemanager.work-preserving-recovery.enabled set to true, RM should handle recover gracefully -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3536) ZK exception occur when updating AppAttempt status, then NPE thrown when RM do recover
[ https://issues.apache.org/jira/browse/YARN-3536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14508900#comment-14508900 ] gu-chi commented on YARN-3536: -- Please assign this to me for fixing ZK exception occur when updating AppAttempt status, then NPE thrown when RM do recover -- Key: YARN-3536 URL: https://issues.apache.org/jira/browse/YARN-3536 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler, resourcemanager Affects Versions: 2.4.1 Reporter: gu-chi Here is a scenario that Application status is FAILED/FINISHED but AppAttempt status is null, this cause NPE when doing recover with yarn.resourcemanager.work-preserving-recovery.enabled set to true, RM should handle recovery gracefully -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2308) NPE happened when RM restart after CapacityScheduler queue configuration changed
[ https://issues.apache.org/jira/browse/YARN-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14506991#comment-14506991 ] gu-chi commented on YARN-2308: -- Hi, Chang Li, as I went through the patches that you attached, previously these was +if (application==null) { + LOG.info(can't retireve application attempt); + return; +} but, finally, the patch merged does not have this modification. Is this updated on purpose? What is the concern? I am now facing one scenario, App status is Finished and AppAttempt status is null, this way when doing recover, application is null in CS and then NPE occur. I am thinking if condition application==null was there, the issue I meet will not occur. NPE happened when RM restart after CapacityScheduler queue configuration changed - Key: YARN-2308 URL: https://issues.apache.org/jira/browse/YARN-2308 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, scheduler Affects Versions: 2.6.0 Reporter: Wangda Tan Assignee: Chang Li Priority: Critical Fix For: 2.6.0 Attachments: YARN-2308.0.patch, YARN-2308.1.patch, jira2308.patch, jira2308.patch, jira2308.patch I encountered a NPE when RM restart {code} 2014-07-16 07:22:46,957 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type APP_ATTEMPT_ADDED to the scheduler java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:566) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:922) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:98) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:594) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:654) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:85) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:698) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:682) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:744) {code} And RM will be failed to restart. This is caused by queue configuration changed, I removed some queues and added new queues. So when RM restarts, it tries to recover history applications, and when any of queues of these applications removed, NPE will be raised. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2308) NPE happened when RM restart after CapacityScheduler queue configuration changed
[ https://issues.apache.org/jira/browse/YARN-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14506992#comment-14506992 ] gu-chi commented on YARN-2308: -- Hi, Chang Li, as I went through the patches that you attached, previously these was +if (application==null) { + LOG.info(can't retireve application attempt); + return; +} but, finally, the patch merged does not have this modification. Is this updated on purpose? What is the concern? I am now facing one scenario, App status is Finished and AppAttempt status is null, this way when doing recover, application is null in CS and then NPE occur. I am thinking if condition application==null was there, the issue I meet will not occur. NPE happened when RM restart after CapacityScheduler queue configuration changed - Key: YARN-2308 URL: https://issues.apache.org/jira/browse/YARN-2308 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, scheduler Affects Versions: 2.6.0 Reporter: Wangda Tan Assignee: Chang Li Priority: Critical Fix For: 2.6.0 Attachments: YARN-2308.0.patch, YARN-2308.1.patch, jira2308.patch, jira2308.patch, jira2308.patch I encountered a NPE when RM restart {code} 2014-07-16 07:22:46,957 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type APP_ATTEMPT_ADDED to the scheduler java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:566) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:922) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:98) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:594) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:654) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:85) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:698) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:682) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:744) {code} And RM will be failed to restart. This is caused by queue configuration changed, I removed some queues and added new queues. So when RM restarts, it tries to recover history applications, and when any of queues of these applications removed, NPE will be raised. -- This message was sent by Atlassian JIRA (v6.3.4#6332)