[jira] [Commented] (YARN-4736) Issues with HBaseTimelineWriterImpl
[ https://issues.apache.org/jira/browse/YARN-4736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15229673#comment-15229673 ] Sangjin Lee commented on YARN-4736: --- Agreed. Please feel free to remove the label. > Issues with HBaseTimelineWriterImpl > --- > > Key: YARN-4736 > URL: https://issues.apache.org/jira/browse/YARN-4736 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Naganarasimha G R >Assignee: Vrushali C >Priority: Critical > Labels: yarn-2928-1st-milestone > Attachments: NM_Hang_hbase1.0.3.tar.gz, hbaseException.log, > threaddump.log > > > Faced some issues while running ATSv2 in single node Hadoop cluster and in > the same node had launched Hbase with embedded zookeeper. > # Due to some NPE issues i was able to see NM was trying to shutdown, but the > NM daemon process was not completed due to the locks. > # Got some exception related to Hbase after application finished execution > successfully. > will attach logs and the trace for the same -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4736) Issues with HBaseTimelineWriterImpl
[ https://issues.apache.org/jira/browse/YARN-4736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15227648#comment-15227648 ] Naganarasimha G R commented on YARN-4736: - Hi [~sjlee0], As per the discussions which has happened i think we can remove this from the *"yarn-2928-1st-milestone"* list > Issues with HBaseTimelineWriterImpl > --- > > Key: YARN-4736 > URL: https://issues.apache.org/jira/browse/YARN-4736 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Naganarasimha G R >Assignee: Vrushali C >Priority: Critical > Labels: yarn-2928-1st-milestone > Attachments: NM_Hang_hbase1.0.3.tar.gz, hbaseException.log, > threaddump.log > > > Faced some issues while running ATSv2 in single node Hadoop cluster and in > the same node had launched Hbase with embedded zookeeper. > # Due to some NPE issues i was able to see NM was trying to shutdown, but the > NM daemon process was not completed due to the locks. > # Got some exception related to Hbase after application finished execution > successfully. > will attach logs and the trace for the same -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4736) Issues with HBaseTimelineWriterImpl
[ https://issues.apache.org/jira/browse/YARN-4736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15198175#comment-15198175 ] Sangjin Lee commented on YARN-4736: --- Yes, sorry I meant HBASE-15436. I think we're more OK with the situation where the entire HBase cluster is down or the master is down. That's a critical situation, and all bets are off at that point. My concern is more if one region server went down or is in a state where it times out writes and your {{BufferedMutatorImpl}} needs to flush to it. If that flush operation times out after 30+ minutes, that would be a significant problem. [~anoop.hbase], would things take 30+ minutes to time out if a region server (rather than the cluster itself) is down or misbehaving? Thoughts? > Issues with HBaseTimelineWriterImpl > --- > > Key: YARN-4736 > URL: https://issues.apache.org/jira/browse/YARN-4736 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Naganarasimha G R >Assignee: Vrushali C >Priority: Critical > Labels: yarn-2928-1st-milestone > Attachments: NM_Hang_hbase1.0.3.tar.gz, hbaseException.log, > threaddump.log > > > Faced some issues while running ATSv2 in single node Hadoop cluster and in > the same node had launched Hbase with embedded zookeeper. > # Due to some NPE issues i was able to see NM was trying to shutdown, but the > NM daemon process was not completed due to the locks. > # Got some exception related to Hbase after application finished execution > successfully. > will attach logs and the trace for the same -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4736) Issues with HBaseTimelineWriterImpl
[ https://issues.apache.org/jira/browse/YARN-4736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15198862#comment-15198862 ] Anoop Sam John commented on YARN-4736: -- When one RS is down, those regions in that would get moved to other RS(s). And if that was a META serving RS, then META will get moved to new RS. So any way next retries may get things moving. The issue here was that the entire HBase cluster went down and no one brought it back. There are certain must fix things in HBase. Will talk abt that in HBase jira. > Issues with HBaseTimelineWriterImpl > --- > > Key: YARN-4736 > URL: https://issues.apache.org/jira/browse/YARN-4736 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Naganarasimha G R >Assignee: Vrushali C >Priority: Critical > Labels: yarn-2928-1st-milestone > Attachments: NM_Hang_hbase1.0.3.tar.gz, hbaseException.log, > threaddump.log > > > Faced some issues while running ATSv2 in single node Hadoop cluster and in > the same node had launched Hbase with embedded zookeeper. > # Due to some NPE issues i was able to see NM was trying to shutdown, but the > NM daemon process was not completed due to the locks. > # Got some exception related to Hbase after application finished execution > successfully. > will attach logs and the trace for the same -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4736) Issues with HBaseTimelineWriterImpl
[ https://issues.apache.org/jira/browse/YARN-4736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15196777#comment-15196777 ] Anoop Sam John commented on YARN-4736: -- I see u mean HBASE-15436 > Issues with HBaseTimelineWriterImpl > --- > > Key: YARN-4736 > URL: https://issues.apache.org/jira/browse/YARN-4736 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Naganarasimha G R >Assignee: Vrushali C >Priority: Critical > Labels: yarn-2928-1st-milestone > Attachments: NM_Hang_hbase1.0.3.tar.gz, hbaseException.log, > threaddump.log > > > Faced some issues while running ATSv2 in single node Hadoop cluster and in > the same node had launched Hbase with embedded zookeeper. > # Due to some NPE issues i was able to see NM was trying to shutdown, but the > NM daemon process was not completed due to the locks. > # Got some exception related to Hbase after application finished execution > successfully. > will attach logs and the trace for the same -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4736) Issues with HBaseTimelineWriterImpl
[ https://issues.apache.org/jira/browse/YARN-4736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15196776#comment-15196776 ] Anoop Sam John commented on YARN-4736: -- HDFS-15436?? > Issues with HBaseTimelineWriterImpl > --- > > Key: YARN-4736 > URL: https://issues.apache.org/jira/browse/YARN-4736 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Naganarasimha G R >Assignee: Vrushali C >Priority: Critical > Labels: yarn-2928-1st-milestone > Attachments: NM_Hang_hbase1.0.3.tar.gz, hbaseException.log, > threaddump.log > > > Faced some issues while running ATSv2 in single node Hadoop cluster and in > the same node had launched Hbase with embedded zookeeper. > # Due to some NPE issues i was able to see NM was trying to shutdown, but the > NM daemon process was not completed due to the locks. > # Got some exception related to Hbase after application finished execution > successfully. > will attach logs and the trace for the same -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4736) Issues with HBaseTimelineWriterImpl
[ https://issues.apache.org/jira/browse/YARN-4736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15195773#comment-15195773 ] Sangjin Lee commented on YARN-4736: --- Dropped a comment on HDFS-15436. Even though there may be nothing we can do on this, this does sound like a pretty critical issue. We cannot have the hadoop cluster become unstable when a region server or region servers get into this state. > Issues with HBaseTimelineWriterImpl > --- > > Key: YARN-4736 > URL: https://issues.apache.org/jira/browse/YARN-4736 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Naganarasimha G R >Assignee: Vrushali C >Priority: Critical > Labels: yarn-2928-1st-milestone > Attachments: NM_Hang_hbase1.0.3.tar.gz, hbaseException.log, > threaddump.log > > > Faced some issues while running ATSv2 in single node Hadoop cluster and in > the same node had launched Hbase with embedded zookeeper. > # Due to some NPE issues i was able to see NM was trying to shutdown, but the > NM daemon process was not completed due to the locks. > # Got some exception related to Hbase after application finished execution > successfully. > will attach logs and the trace for the same -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4736) Issues with HBaseTimelineWriterImpl
[ https://issues.apache.org/jira/browse/YARN-4736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15194798#comment-15194798 ] Anoop Sam John commented on YARN-4736: -- Ya.. we know now , why HBase client is not getting shutdown. It will get shutdown after several mins Time = N * 36 * socket time out where N is the #Mutation in BufferedMutator's list ( to get flushed to server) > Issues with HBaseTimelineWriterImpl > --- > > Key: YARN-4736 > URL: https://issues.apache.org/jira/browse/YARN-4736 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Naganarasimha G R >Assignee: Vrushali C >Priority: Critical > Labels: yarn-2928-1st-milestone > Attachments: NM_Hang_hbase1.0.3.tar.gz, hbaseException.log, > threaddump.log > > > Faced some issues while running ATSv2 in single node Hadoop cluster and in > the same node had launched Hbase with embedded zookeeper. > # Due to some NPE issues i was able to see NM was trying to shutdown, but the > NM daemon process was not completed due to the locks. > # Got some exception related to Hbase after application finished execution > successfully. > will attach logs and the trace for the same -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4736) Issues with HBaseTimelineWriterImpl
[ https://issues.apache.org/jira/browse/YARN-4736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15194418#comment-15194418 ] Naganarasimha G R commented on YARN-4736: - Also seems like [~anoop.hbase] has got the cause for this issue in HBASE-15436, so i think no handling from YARN side right ? > Issues with HBaseTimelineWriterImpl > --- > > Key: YARN-4736 > URL: https://issues.apache.org/jira/browse/YARN-4736 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Naganarasimha G R >Assignee: Vrushali C >Priority: Critical > Labels: yarn-2928-1st-milestone > Attachments: NM_Hang_hbase1.0.3.tar.gz, hbaseException.log, > threaddump.log > > > Faced some issues while running ATSv2 in single node Hadoop cluster and in > the same node had launched Hbase with embedded zookeeper. > # Due to some NPE issues i was able to see NM was trying to shutdown, but the > NM daemon process was not completed due to the locks. > # Got some exception related to Hbase after application finished execution > successfully. > will attach logs and the trace for the same -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4736) Issues with HBaseTimelineWriterImpl
[ https://issues.apache.org/jira/browse/YARN-4736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15186306#comment-15186306 ] Sangjin Lee commented on YARN-4736: --- bq. BufferedMutatorImpl#flush() appeared in stack trace. However, if the hbase cluster was shutdown, the flush wouldn't succeed. Right, I fully expect that the flush would not succeed if the cluster is shut down. What I'm concerned about is that not only did the flush fail, it does appear that it remained stuck in flush, preventing the client from shutting it down. That would be unexpected. I know this is against 1.0.2 which is somewhat of an older version, but would it be helpful if I filed a HBASE JIRA to see if others have seen this or have suggestions? > Issues with HBaseTimelineWriterImpl > --- > > Key: YARN-4736 > URL: https://issues.apache.org/jira/browse/YARN-4736 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Naganarasimha G R >Assignee: Vrushali C >Priority: Critical > Labels: yarn-2928-1st-milestone > Attachments: hbaseException.log, threaddump.log > > > Faced some issues while running ATSv2 in single node Hadoop cluster and in > the same node had launched Hbase with embedded zookeeper. > # Due to some NPE issues i was able to see NM was trying to shutdown, but the > NM daemon process was not completed due to the locks. > # Got some exception related to Hbase after application finished execution > successfully. > will attach logs and the trace for the same -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4736) Issues with HBaseTimelineWriterImpl
[ https://issues.apache.org/jira/browse/YARN-4736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15172795#comment-15172795 ] Ted Yu commented on YARN-4736: -- bq. so planning to test with hbase-1.0.3 tar. There have been more release(s) since 1.0.3 release. e.g. you can try out 1.2.0 release. BufferedMutatorImpl#flush() appeared in stack trace. However, if the hbase cluster was shutdown, the flush wouldn't succeed. I haven't seen the above issue happen on a live 1.x cluster. > Issues with HBaseTimelineWriterImpl > --- > > Key: YARN-4736 > URL: https://issues.apache.org/jira/browse/YARN-4736 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Naganarasimha G R >Assignee: Vrushali C >Priority: Critical > Labels: yarn-2928-1st-milestone > Attachments: hbaseException.log, threaddump.log > > > Faced some issues while running ATSv2 in single node Hadoop cluster and in > the same node had launched Hbase with embedded zookeeper. > # Due to some NPE issues i was able to see NM was trying to shutdown, but the > NM daemon process was not completed due to the locks. > # Got some exception related to Hbase after application finished execution > successfully. > will attach logs and the trace for the same -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4736) Issues with HBaseTimelineWriterImpl
[ https://issues.apache.org/jira/browse/YARN-4736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15172779#comment-15172779 ] Sangjin Lee commented on YARN-4736: --- {quote} Was checking with our hbase team from the logs and the trace they were informing that the issue might be due to bad connectivity with the zookeeper, but it strange to see that in the local node setup. So i suspect that there is some issue with my hbase setup, so planning to test with hbase-1.0.3 tar. {quote} Regardless of what happens on the cluster, the client (NM in this case) should not lock up. So in that sense, I think the "deadlock" we're seeing should be looked at from the client-side. [~te...@apache.org], do you recall seeing any issues like this? {quote} But also would like to know whether you guys are facing the same issue or its only me who is facing it ? {quote} I haven't had a chance to try to reproduce it yet. I'll try it as soon as feasible. Does this happen every time you run the timeline service performance test with the latest branch? > Issues with HBaseTimelineWriterImpl > --- > > Key: YARN-4736 > URL: https://issues.apache.org/jira/browse/YARN-4736 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Naganarasimha G R >Assignee: Vrushali C >Priority: Critical > Labels: yarn-2928-1st-milestone > Attachments: hbaseException.log, threaddump.log > > > Faced some issues while running ATSv2 in single node Hadoop cluster and in > the same node had launched Hbase with embedded zookeeper. > # Due to some NPE issues i was able to see NM was trying to shutdown, but the > NM daemon process was not completed due to the locks. > # Got some exception related to Hbase after application finished execution > successfully. > will attach logs and the trace for the same -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4736) Issues with HBaseTimelineWriterImpl
[ https://issues.apache.org/jira/browse/YARN-4736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15172390#comment-15172390 ] Naganarasimha G R commented on YARN-4736: - Hi [~vrushalic], Was checking with our hbase team from the logs and the trace they were informing that the issue might be due to bad connectivity with the zookeeper, but it strange to see that in the local node setup. So i suspect that there is some issue with my hbase setup, so planning to test with hbase-1.0.3 tar. But also would like to know whether you guys are facing the same issue or its only me who is facing it ? > Issues with HBaseTimelineWriterImpl > --- > > Key: YARN-4736 > URL: https://issues.apache.org/jira/browse/YARN-4736 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Naganarasimha G R >Assignee: Vrushali C >Priority: Critical > Labels: yarn-2928-1st-milestone > Attachments: hbaseException.log, threaddump.log > > > Faced some issues while running ATSv2 in single node Hadoop cluster and in > the same node had launched Hbase with embedded zookeeper. > # Due to some NPE issues i was able to see NM was trying to shutdown, but the > NM daemon process was not completed due to the locks. > # Got some exception related to Hbase after application finished execution > successfully. > will attach logs and the trace for the same -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4736) Issues with HBaseTimelineWriterImpl
[ https://issues.apache.org/jira/browse/YARN-4736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15169753#comment-15169753 ] Naganarasimha G R commented on YARN-4736: - Thanks for the analysis [~sjlee0], bq. Both the thread dump and the HBase exception log are from the client process (NM side), correct? Yes both are client side exceptions (i.e. NM) but i am not sure issue 2 has relationship with issue 1 but based on your explanation it seems to be related. bq. some time after that, it looks like you issued a signal to stop the client process (NM)? Yes, after a significant amount of time after the completion of the job ( 00:02:28) error log came (00:39:03), and there was significant time after which i had tried to stop the NM @ 01:09:19. And even when try to stop the NM immediately after the job completion i am able to see this issue (NM not stopping completely) bq. That's why I thought there seems to be a HBase bug that is causing the flush operation to be wedged in this state. At least that explains why you were not able to shut down the collector (and therefore NM). Yes may be, but to confirm is the server side logs required ? I am using *HBase - Version 1.0.2* . Can share if any other information required too. cc/ [~te...@apache.org]. > Issues with HBaseTimelineWriterImpl > --- > > Key: YARN-4736 > URL: https://issues.apache.org/jira/browse/YARN-4736 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Naganarasimha G R >Assignee: Vrushali C >Priority: Critical > Labels: yarn-2928-1st-milestone > Attachments: hbaseException.log, threaddump.log > > > Faced some issues while running ATSv2 in single node Hadoop cluster and in > the same node had launched Hbase with embedded zookeeper. > # Due to some NPE issues i was able to see NM was trying to shutdown, but the > NM daemon process was not completed due to the locks. > # Got some exception related to Hbase after application finished execution > successfully. > will attach logs and the trace for the same -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4736) Issues with HBaseTimelineWriterImpl
[ https://issues.apache.org/jira/browse/YARN-4736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15169328#comment-15169328 ] Sangjin Lee commented on YARN-4736: --- Both the thread dump and the HBase exception log are from the client process (NM side), correct? I believe so because both are showing HBase client stack traces. Could you confirm? Since these are both from the client side, I am not sure what was going on on the HBase server side. Also, I'm not sure why HBase started refusing connections as evidenced by your exception log (that's why I assumed that HBase process might have gone away at that point). Here is how I put together the sequence of events: - at some point the HBase process starts refusing connections (hbaseException.log) - the periodic flush gets trapped in this bad state, finally logging the {{RetriesExhaustedException}} after 36 minutes {noformat} 2016-02-26 00:02:28,270 INFO org.apache.hadoop.yarn.server.timelineservice.collector.TimelineCollectorManager: The collector service for application_1456425026132_0001 was removed 2016-02-26 00:39:03,879 ERROR org.apache.hadoop.hbase.client.AsyncProcess: Failed to get region location org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after attempts=36, exceptions: {noformat} - some time after that, it looks like you issued a signal to stop the client process (NM)? {noformat} 2016-02-26 01:09:19,799 ERROR org.apache.hadoop.yarn.server.nodemanager.NodeManager: RECEIVED SIGNAL 15: SIGTERM {noformat} - but the service stop fails to shut down the periodic flush task thread {noformat} 2016-02-26 01:09:50,035 WARN org.apache.hadoop.yarn.server.timelineservice.collector.TimelineCollectorManager: failed to stop the flusher task in time. will still proceed to close the writer. 2016-02-26 01:09:50,035 INFO org.apache.hadoop.yarn.server.timelineservice.storage.HBaseTimelineWriterImpl: closing the entity table {noformat} - at this point the NM process is hung because the flush is in this stuck state while holding the {{BufferedMutatorImpl}} lock and the closing of the entity table needs to acquire that lock That's why I thought there seems to be a HBase bug that is causing the flush operation to be wedged in this state. At least that explains why you were not able to shut down the collector (and therefore NM). > Issues with HBaseTimelineWriterImpl > --- > > Key: YARN-4736 > URL: https://issues.apache.org/jira/browse/YARN-4736 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Naganarasimha G R >Assignee: Vrushali C >Priority: Critical > Labels: yarn-2928-1st-milestone > Attachments: hbaseException.log, threaddump.log > > > Faced some issues while running ATSv2 in single node Hadoop cluster and in > the same node had launched Hbase with embedded zookeeper. > # Due to some NPE issues i was able to see NM was trying to shutdown, but the > NM daemon process was not completed due to the locks. > # Got some exception related to Hbase after application finished execution > successfully. > will attach logs and the trace for the same -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4736) Issues with HBaseTimelineWriterImpl
[ https://issues.apache.org/jira/browse/YARN-4736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15168659#comment-15168659 ] Naganarasimha G R commented on YARN-4736: - Hi [~sjlee0], Hope the attached files are considered correctly for the issues: *For Issue 1* : threaddump.log *For Issue 2* : hbaseException.log bq. This could be a bug in HBase. It seems like the HBase cluster was already shut down, but the flush operation took a long time to finally error out (36 minutes): This log is for case 2: Actually cluster or NM was not shutdown in this case but app completed successfully but after 40 mins this error was shown. > Issues with HBaseTimelineWriterImpl > --- > > Key: YARN-4736 > URL: https://issues.apache.org/jira/browse/YARN-4736 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Naganarasimha G R >Assignee: Vrushali C >Priority: Critical > Labels: yarn-2928-1st-milestone > Attachments: hbaseException.log, threaddump.log > > > Faced some issues while running ATSv2 in single node Hadoop cluster and in > the same node had launched Hbase with embedded zookeeper. > # Due to some NPE issues i was able to see NM was trying to shutdown, but the > NM daemon process was not completed due to the locks. > # Got some exception related to Hbase after application finished execution > successfully. > will attach logs and the trace for the same -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4736) Issues with HBaseTimelineWriterImpl
[ https://issues.apache.org/jira/browse/YARN-4736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15168159#comment-15168159 ] Sangjin Lee commented on YARN-4736: --- This could be a bug in HBase. It seems like the HBase cluster was already shut down, but the flush operation took a long time to finally error out (36 minutes): {noformat} 2016-02-26 00:02:28,270 INFO org.apache.hadoop.yarn.server.timelineservice.collector.TimelineCollectorManager: The collector service for application_1456425026132_0001 was removed 2016-02-26 00:39:03,879 ERROR org.apache.hadoop.hbase.client.AsyncProcess: Failed to get region location org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after attempts=36, exceptions: Fri Feb 26 00:39:03 IST 2016, null, java.net.SocketTimeoutException: callTimeout=6, callDuration=68065: row 'timelineservice.entity,naga!yarn_cluster!flow_1456425026132_1!���!M�����!YARN_CONTAINER!container_1456425026132_0001_01_01,99' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=localhost,16201,1456365764939, seqNum=0 at org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.throwEnrichedException(RpcRetryingCallerWithReadReplicas.java:264) at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:215) at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:56) at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:200) at org.apache.hadoop.hbase.client.ClientSmallReversedScanner.loadCache(ClientSmallReversedScanner.java:211) at org.apache.hadoop.hbase.client.ClientSmallReversedScanner.next(ClientSmallReversedScanner.java:185) at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegionInMeta(ConnectionManager.java:1200) at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1109) at org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:369) at org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:320) at org.apache.hadoop.hbase.client.BufferedMutatorImpl.backgroundFlushCommits(BufferedMutatorImpl.java:206) at org.apache.hadoop.hbase.client.BufferedMutatorImpl.flush(BufferedMutatorImpl.java:183) at org.apache.hadoop.yarn.server.timelineservice.storage.common.BufferedMutatorDelegator.flush(BufferedMutatorDelegator.java:66) at org.apache.hadoop.yarn.server.timelineservice.storage.HBaseTimelineWriterImpl.flush(HBaseTimelineWriterImpl.java:457) at org.apache.hadoop.yarn.server.timelineservice.collector.TimelineCollectorManager$WriterFlushTask.run(TimelineCollectorManager.java:230) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.net.SocketTimeoutException: callTimeout=6, callDuration=68065: row 'timelineservice.entity,naga!yarn_cluster!flow_1456425026132_1!���!M�����!YARN_CONTAINER!container_1456425026132_0001_01_01,99' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=localhost,16201,1456365764939, seqNum=0 at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:159) at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:310) at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:291) at java.util.concurrent.FutureTask.run(FutureTask.java:262) ... 3 more Caused by: java.net.ConnectException: Connection refused {noformat} But this seems to have put the HBase client stuck in the flush state. That thread seems to be stuck: {noformat} "pool-14-thread-1" prio=10 tid=0x7f4215268000 nid=0x46e6 waiting on condition [0x7f41fe75d000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0xeeb5a010> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) at