[jira] [Commented] (YARN-4736) Issues with HBaseTimelineWriterImpl

2016-04-06 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15229673#comment-15229673
 ] 

Sangjin Lee commented on YARN-4736:
---

Agreed. Please feel free to remove the label.

> Issues with HBaseTimelineWriterImpl
> ---
>
> Key: YARN-4736
> URL: https://issues.apache.org/jira/browse/YARN-4736
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Naganarasimha G R
>Assignee: Vrushali C
>Priority: Critical
>  Labels: yarn-2928-1st-milestone
> Attachments: NM_Hang_hbase1.0.3.tar.gz, hbaseException.log, 
> threaddump.log
>
>
> Faced some issues while running ATSv2 in single node Hadoop cluster and in 
> the same node had launched Hbase with embedded zookeeper.
> # Due to some NPE issues i was able to see NM was trying to shutdown, but the 
> NM daemon process was not completed due to the locks.
> # Got some exception related to Hbase after application finished execution 
> successfully. 
> will attach logs and the trace for the same



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4736) Issues with HBaseTimelineWriterImpl

2016-04-05 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15227648#comment-15227648
 ] 

Naganarasimha G R commented on YARN-4736:
-

Hi [~sjlee0], 
As per the discussions which has happened i think we can remove this from the 
*"yarn-2928-1st-milestone"* list

> Issues with HBaseTimelineWriterImpl
> ---
>
> Key: YARN-4736
> URL: https://issues.apache.org/jira/browse/YARN-4736
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Naganarasimha G R
>Assignee: Vrushali C
>Priority: Critical
>  Labels: yarn-2928-1st-milestone
> Attachments: NM_Hang_hbase1.0.3.tar.gz, hbaseException.log, 
> threaddump.log
>
>
> Faced some issues while running ATSv2 in single node Hadoop cluster and in 
> the same node had launched Hbase with embedded zookeeper.
> # Due to some NPE issues i was able to see NM was trying to shutdown, but the 
> NM daemon process was not completed due to the locks.
> # Got some exception related to Hbase after application finished execution 
> successfully. 
> will attach logs and the trace for the same



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4736) Issues with HBaseTimelineWriterImpl

2016-03-20 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15198175#comment-15198175
 ] 

Sangjin Lee commented on YARN-4736:
---

Yes, sorry I meant HBASE-15436.

I think we're more OK with the situation where the entire HBase cluster is down 
or the master is down. That's a critical situation, and all bets are off at 
that point.

My concern is more if one region server went down or is in a state where it 
times out writes and your {{BufferedMutatorImpl}} needs to flush to it. If that 
flush operation times out after 30+ minutes, that would be a significant 
problem. [~anoop.hbase], would things take 30+ minutes to time out if a region 
server (rather than the cluster itself) is down or misbehaving? Thoughts?

> Issues with HBaseTimelineWriterImpl
> ---
>
> Key: YARN-4736
> URL: https://issues.apache.org/jira/browse/YARN-4736
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Naganarasimha G R
>Assignee: Vrushali C
>Priority: Critical
>  Labels: yarn-2928-1st-milestone
> Attachments: NM_Hang_hbase1.0.3.tar.gz, hbaseException.log, 
> threaddump.log
>
>
> Faced some issues while running ATSv2 in single node Hadoop cluster and in 
> the same node had launched Hbase with embedded zookeeper.
> # Due to some NPE issues i was able to see NM was trying to shutdown, but the 
> NM daemon process was not completed due to the locks.
> # Got some exception related to Hbase after application finished execution 
> successfully. 
> will attach logs and the trace for the same



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4736) Issues with HBaseTimelineWriterImpl

2016-03-19 Thread Anoop Sam John (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15198862#comment-15198862
 ] 

Anoop Sam John commented on YARN-4736:
--

When one RS is down, those regions in that would get moved to other RS(s).  And 
if that was a META serving RS, then META will get moved to new RS. So any way 
next retries may get things moving. The issue here was that the entire HBase 
cluster went down and no one brought it back. 
There are certain must fix things in HBase. Will talk abt that in HBase jira.


> Issues with HBaseTimelineWriterImpl
> ---
>
> Key: YARN-4736
> URL: https://issues.apache.org/jira/browse/YARN-4736
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Naganarasimha G R
>Assignee: Vrushali C
>Priority: Critical
>  Labels: yarn-2928-1st-milestone
> Attachments: NM_Hang_hbase1.0.3.tar.gz, hbaseException.log, 
> threaddump.log
>
>
> Faced some issues while running ATSv2 in single node Hadoop cluster and in 
> the same node had launched Hbase with embedded zookeeper.
> # Due to some NPE issues i was able to see NM was trying to shutdown, but the 
> NM daemon process was not completed due to the locks.
> # Got some exception related to Hbase after application finished execution 
> successfully. 
> will attach logs and the trace for the same



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4736) Issues with HBaseTimelineWriterImpl

2016-03-15 Thread Anoop Sam John (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15196777#comment-15196777
 ] 

Anoop Sam John commented on YARN-4736:
--

I see u mean HBASE-15436

> Issues with HBaseTimelineWriterImpl
> ---
>
> Key: YARN-4736
> URL: https://issues.apache.org/jira/browse/YARN-4736
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Naganarasimha G R
>Assignee: Vrushali C
>Priority: Critical
>  Labels: yarn-2928-1st-milestone
> Attachments: NM_Hang_hbase1.0.3.tar.gz, hbaseException.log, 
> threaddump.log
>
>
> Faced some issues while running ATSv2 in single node Hadoop cluster and in 
> the same node had launched Hbase with embedded zookeeper.
> # Due to some NPE issues i was able to see NM was trying to shutdown, but the 
> NM daemon process was not completed due to the locks.
> # Got some exception related to Hbase after application finished execution 
> successfully. 
> will attach logs and the trace for the same



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4736) Issues with HBaseTimelineWriterImpl

2016-03-15 Thread Anoop Sam John (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15196776#comment-15196776
 ] 

Anoop Sam John commented on YARN-4736:
--

HDFS-15436??

> Issues with HBaseTimelineWriterImpl
> ---
>
> Key: YARN-4736
> URL: https://issues.apache.org/jira/browse/YARN-4736
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Naganarasimha G R
>Assignee: Vrushali C
>Priority: Critical
>  Labels: yarn-2928-1st-milestone
> Attachments: NM_Hang_hbase1.0.3.tar.gz, hbaseException.log, 
> threaddump.log
>
>
> Faced some issues while running ATSv2 in single node Hadoop cluster and in 
> the same node had launched Hbase with embedded zookeeper.
> # Due to some NPE issues i was able to see NM was trying to shutdown, but the 
> NM daemon process was not completed due to the locks.
> # Got some exception related to Hbase after application finished execution 
> successfully. 
> will attach logs and the trace for the same



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4736) Issues with HBaseTimelineWriterImpl

2016-03-15 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15195773#comment-15195773
 ] 

Sangjin Lee commented on YARN-4736:
---

Dropped a comment on HDFS-15436. Even though there may be nothing we can do on 
this, this does sound like a pretty critical issue. We cannot have the hadoop 
cluster become unstable when a region server or region servers get into this 
state.

> Issues with HBaseTimelineWriterImpl
> ---
>
> Key: YARN-4736
> URL: https://issues.apache.org/jira/browse/YARN-4736
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Naganarasimha G R
>Assignee: Vrushali C
>Priority: Critical
>  Labels: yarn-2928-1st-milestone
> Attachments: NM_Hang_hbase1.0.3.tar.gz, hbaseException.log, 
> threaddump.log
>
>
> Faced some issues while running ATSv2 in single node Hadoop cluster and in 
> the same node had launched Hbase with embedded zookeeper.
> # Due to some NPE issues i was able to see NM was trying to shutdown, but the 
> NM daemon process was not completed due to the locks.
> # Got some exception related to Hbase after application finished execution 
> successfully. 
> will attach logs and the trace for the same



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4736) Issues with HBaseTimelineWriterImpl

2016-03-15 Thread Anoop Sam John (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15194798#comment-15194798
 ] 

Anoop Sam John commented on YARN-4736:
--

Ya.. we know now , why HBase client is not getting shutdown. It will get 
shutdown after several mins
Time  = N  * 36 * socket time out
where N is the #Mutation in BufferedMutator's list ( to get flushed to server)

> Issues with HBaseTimelineWriterImpl
> ---
>
> Key: YARN-4736
> URL: https://issues.apache.org/jira/browse/YARN-4736
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Naganarasimha G R
>Assignee: Vrushali C
>Priority: Critical
>  Labels: yarn-2928-1st-milestone
> Attachments: NM_Hang_hbase1.0.3.tar.gz, hbaseException.log, 
> threaddump.log
>
>
> Faced some issues while running ATSv2 in single node Hadoop cluster and in 
> the same node had launched Hbase with embedded zookeeper.
> # Due to some NPE issues i was able to see NM was trying to shutdown, but the 
> NM daemon process was not completed due to the locks.
> # Got some exception related to Hbase after application finished execution 
> successfully. 
> will attach logs and the trace for the same



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4736) Issues with HBaseTimelineWriterImpl

2016-03-14 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15194418#comment-15194418
 ] 

Naganarasimha G R commented on YARN-4736:
-

Also seems like [~anoop.hbase] has got the cause for this issue in HBASE-15436, 
so i think no handling from YARN side right ?

> Issues with HBaseTimelineWriterImpl
> ---
>
> Key: YARN-4736
> URL: https://issues.apache.org/jira/browse/YARN-4736
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Naganarasimha G R
>Assignee: Vrushali C
>Priority: Critical
>  Labels: yarn-2928-1st-milestone
> Attachments: NM_Hang_hbase1.0.3.tar.gz, hbaseException.log, 
> threaddump.log
>
>
> Faced some issues while running ATSv2 in single node Hadoop cluster and in 
> the same node had launched Hbase with embedded zookeeper.
> # Due to some NPE issues i was able to see NM was trying to shutdown, but the 
> NM daemon process was not completed due to the locks.
> # Got some exception related to Hbase after application finished execution 
> successfully. 
> will attach logs and the trace for the same



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4736) Issues with HBaseTimelineWriterImpl

2016-03-08 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15186306#comment-15186306
 ] 

Sangjin Lee commented on YARN-4736:
---

bq. BufferedMutatorImpl#flush() appeared in stack trace. However, if the hbase 
cluster was shutdown, the flush wouldn't succeed.

Right, I fully expect that the flush would not succeed if the cluster is shut 
down. What I'm concerned about is that not only did the flush fail, it does 
appear that it remained stuck in flush, preventing the client from shutting it 
down. That would be unexpected.

I know this is against 1.0.2 which is somewhat of an older version, but would 
it be helpful if I filed a HBASE JIRA to see if others have seen this or have 
suggestions?

> Issues with HBaseTimelineWriterImpl
> ---
>
> Key: YARN-4736
> URL: https://issues.apache.org/jira/browse/YARN-4736
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Naganarasimha G R
>Assignee: Vrushali C
>Priority: Critical
>  Labels: yarn-2928-1st-milestone
> Attachments: hbaseException.log, threaddump.log
>
>
> Faced some issues while running ATSv2 in single node Hadoop cluster and in 
> the same node had launched Hbase with embedded zookeeper.
> # Due to some NPE issues i was able to see NM was trying to shutdown, but the 
> NM daemon process was not completed due to the locks.
> # Got some exception related to Hbase after application finished execution 
> successfully. 
> will attach logs and the trace for the same



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4736) Issues with HBaseTimelineWriterImpl

2016-02-29 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15172795#comment-15172795
 ] 

Ted Yu commented on YARN-4736:
--

bq. so planning to test with hbase-1.0.3 tar. 

There have been more release(s) since 1.0.3 release.
e.g. you can try out 1.2.0 release.

BufferedMutatorImpl#flush() appeared in stack trace. However, if the hbase 
cluster was shutdown, the flush wouldn't succeed.

I haven't seen the above issue happen on a live 1.x cluster.

> Issues with HBaseTimelineWriterImpl
> ---
>
> Key: YARN-4736
> URL: https://issues.apache.org/jira/browse/YARN-4736
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Naganarasimha G R
>Assignee: Vrushali C
>Priority: Critical
>  Labels: yarn-2928-1st-milestone
> Attachments: hbaseException.log, threaddump.log
>
>
> Faced some issues while running ATSv2 in single node Hadoop cluster and in 
> the same node had launched Hbase with embedded zookeeper.
> # Due to some NPE issues i was able to see NM was trying to shutdown, but the 
> NM daemon process was not completed due to the locks.
> # Got some exception related to Hbase after application finished execution 
> successfully. 
> will attach logs and the trace for the same



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4736) Issues with HBaseTimelineWriterImpl

2016-02-29 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15172779#comment-15172779
 ] 

Sangjin Lee commented on YARN-4736:
---

{quote}
Was checking with our hbase team from the logs and the trace they were 
informing that the issue might be due to bad connectivity with the zookeeper, 
but it strange to see that in the local node setup. So i suspect that there is 
some issue with my hbase setup, so planning to test with hbase-1.0.3 tar.
{quote}

Regardless of what happens on the cluster, the client (NM in this case) should 
not lock up. So in that sense, I think the "deadlock" we're seeing should be 
looked at from the client-side. [~te...@apache.org], do you recall seeing any 
issues like this?

{quote}
But also would like to know whether you guys are facing the same issue or its 
only me who is facing it ?
{quote}

I haven't had a chance to try to reproduce it yet. I'll try it as soon as 
feasible. Does this happen every time you run the timeline service performance 
test with the latest branch?

> Issues with HBaseTimelineWriterImpl
> ---
>
> Key: YARN-4736
> URL: https://issues.apache.org/jira/browse/YARN-4736
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Naganarasimha G R
>Assignee: Vrushali C
>Priority: Critical
>  Labels: yarn-2928-1st-milestone
> Attachments: hbaseException.log, threaddump.log
>
>
> Faced some issues while running ATSv2 in single node Hadoop cluster and in 
> the same node had launched Hbase with embedded zookeeper.
> # Due to some NPE issues i was able to see NM was trying to shutdown, but the 
> NM daemon process was not completed due to the locks.
> # Got some exception related to Hbase after application finished execution 
> successfully. 
> will attach logs and the trace for the same



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4736) Issues with HBaseTimelineWriterImpl

2016-02-29 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15172390#comment-15172390
 ] 

Naganarasimha G R commented on YARN-4736:
-

Hi [~vrushalic],
Was checking with our hbase team from the logs and the trace they were 
informing that the issue might be due to bad connectivity with the zookeeper, 
but it strange to see that in the local node setup. So i suspect that there is 
some issue with my hbase setup, so planning to test with hbase-1.0.3 tar. 
But also would like to know whether you guys are facing the same issue or its 
only me who is facing it ?

> Issues with HBaseTimelineWriterImpl
> ---
>
> Key: YARN-4736
> URL: https://issues.apache.org/jira/browse/YARN-4736
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Naganarasimha G R
>Assignee: Vrushali C
>Priority: Critical
>  Labels: yarn-2928-1st-milestone
> Attachments: hbaseException.log, threaddump.log
>
>
> Faced some issues while running ATSv2 in single node Hadoop cluster and in 
> the same node had launched Hbase with embedded zookeeper.
> # Due to some NPE issues i was able to see NM was trying to shutdown, but the 
> NM daemon process was not completed due to the locks.
> # Got some exception related to Hbase after application finished execution 
> successfully. 
> will attach logs and the trace for the same



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4736) Issues with HBaseTimelineWriterImpl

2016-02-26 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15169753#comment-15169753
 ] 

Naganarasimha G R commented on YARN-4736:
-

Thanks for the analysis [~sjlee0],
bq. Both the thread dump and the HBase exception log are from the client 
process (NM side), correct? 
Yes both are client side exceptions (i.e. NM) but i am not sure issue 2 has 
relationship with issue 1 but based on your explanation it seems to be related. 

bq. some time after that, it looks like you issued a signal to stop the client 
process (NM)?
Yes,  after a significant amount of time  after the completion of the job ( 
00:02:28)  error log came (00:39:03), and there was significant time after 
which i had tried to stop the NM @ 01:09:19.  And even when try to stop the NM 
immediately after the job completion i am able to see this issue (NM not 
stopping completely)

bq. That's why I thought there seems to be a HBase bug that is causing the 
flush operation to be wedged in this state. At least that explains why you were 
not able to shut down the collector (and therefore NM).
Yes may be, but to confirm is the server side logs required ? I am using *HBase 
- Version 1.0.2* . Can share if any other information required too. cc/ 
[~te...@apache.org]. 




> Issues with HBaseTimelineWriterImpl
> ---
>
> Key: YARN-4736
> URL: https://issues.apache.org/jira/browse/YARN-4736
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Naganarasimha G R
>Assignee: Vrushali C
>Priority: Critical
>  Labels: yarn-2928-1st-milestone
> Attachments: hbaseException.log, threaddump.log
>
>
> Faced some issues while running ATSv2 in single node Hadoop cluster and in 
> the same node had launched Hbase with embedded zookeeper.
> # Due to some NPE issues i was able to see NM was trying to shutdown, but the 
> NM daemon process was not completed due to the locks.
> # Got some exception related to Hbase after application finished execution 
> successfully. 
> will attach logs and the trace for the same



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4736) Issues with HBaseTimelineWriterImpl

2016-02-26 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15169328#comment-15169328
 ] 

Sangjin Lee commented on YARN-4736:
---

Both the thread dump and the HBase exception log are from the client process 
(NM side), correct? I believe so because both are showing HBase client stack 
traces. Could you confirm?

Since these are both from the client side, I am not sure what was going on on 
the HBase server side. Also, I'm not sure why HBase started refusing 
connections as evidenced by your exception log (that's why I assumed that HBase 
process might have gone away at that point).

Here is how I put together the sequence of events:
- at some point the HBase process starts refusing connections 
(hbaseException.log)
- the periodic flush gets trapped in this bad state, finally logging the 
{{RetriesExhaustedException}} after 36 minutes
{noformat}
2016-02-26 00:02:28,270 INFO 
org.apache.hadoop.yarn.server.timelineservice.collector.TimelineCollectorManager:
 The collector service for application_1456425026132_0001 was removed
2016-02-26 00:39:03,879 ERROR org.apache.hadoop.hbase.client.AsyncProcess: 
Failed to get region location 
org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after 
attempts=36, exceptions:
{noformat}
- some time after that, it looks like you issued a signal to stop the client 
process (NM)?
{noformat}
2016-02-26 01:09:19,799 ERROR 
org.apache.hadoop.yarn.server.nodemanager.NodeManager: RECEIVED SIGNAL 15: 
SIGTERM
{noformat}
- but the service stop fails to shut down the periodic flush task thread
{noformat}
2016-02-26 01:09:50,035 WARN 
org.apache.hadoop.yarn.server.timelineservice.collector.TimelineCollectorManager:
 failed to stop the flusher task in time. will still proceed to close the 
writer.
2016-02-26 01:09:50,035 INFO 
org.apache.hadoop.yarn.server.timelineservice.storage.HBaseTimelineWriterImpl: 
closing the entity table
{noformat}
- at this point the NM process is hung because the flush is in this stuck state 
while holding the {{BufferedMutatorImpl}} lock and the closing of the entity 
table needs to acquire that lock

That's why I thought there seems to be a HBase bug that is causing the flush 
operation to be wedged in this state. At least that explains why you were not 
able to shut down the collector (and therefore NM).

> Issues with HBaseTimelineWriterImpl
> ---
>
> Key: YARN-4736
> URL: https://issues.apache.org/jira/browse/YARN-4736
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Naganarasimha G R
>Assignee: Vrushali C
>Priority: Critical
>  Labels: yarn-2928-1st-milestone
> Attachments: hbaseException.log, threaddump.log
>
>
> Faced some issues while running ATSv2 in single node Hadoop cluster and in 
> the same node had launched Hbase with embedded zookeeper.
> # Due to some NPE issues i was able to see NM was trying to shutdown, but the 
> NM daemon process was not completed due to the locks.
> # Got some exception related to Hbase after application finished execution 
> successfully. 
> will attach logs and the trace for the same



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4736) Issues with HBaseTimelineWriterImpl

2016-02-26 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15168659#comment-15168659
 ] 

Naganarasimha G R commented on YARN-4736:
-

Hi [~sjlee0],
Hope the attached files are considered correctly for the issues:
*For Issue 1* : threaddump.log
*For Issue 2* : hbaseException.log
bq. This could be a bug in HBase. It seems like the HBase cluster was already 
shut down, but the flush operation took a long time to finally error out (36 
minutes):
This log is for case 2: Actually cluster or NM was not shutdown in this case 
but app completed successfully but after 40 mins this error was shown.

> Issues with HBaseTimelineWriterImpl
> ---
>
> Key: YARN-4736
> URL: https://issues.apache.org/jira/browse/YARN-4736
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Naganarasimha G R
>Assignee: Vrushali C
>Priority: Critical
>  Labels: yarn-2928-1st-milestone
> Attachments: hbaseException.log, threaddump.log
>
>
> Faced some issues while running ATSv2 in single node Hadoop cluster and in 
> the same node had launched Hbase with embedded zookeeper.
> # Due to some NPE issues i was able to see NM was trying to shutdown, but the 
> NM daemon process was not completed due to the locks.
> # Got some exception related to Hbase after application finished execution 
> successfully. 
> will attach logs and the trace for the same



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4736) Issues with HBaseTimelineWriterImpl

2016-02-25 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15168159#comment-15168159
 ] 

Sangjin Lee commented on YARN-4736:
---

This could be a bug in HBase. It seems like the HBase cluster was already shut 
down, but the flush operation took a long time to finally error out (36 
minutes):
{noformat}
2016-02-26 00:02:28,270 INFO 
org.apache.hadoop.yarn.server.timelineservice.collector.TimelineCollectorManager:
 The collector service for application_1456425026132_0001 was removed
2016-02-26 00:39:03,879 ERROR org.apache.hadoop.hbase.client.AsyncProcess: 
Failed to get region location 
org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after 
attempts=36, exceptions:
Fri Feb 26 00:39:03 IST 2016, null, java.net.SocketTimeoutException: 
callTimeout=6, callDuration=68065: row 
'timelineservice.entity,naga!yarn_cluster!flow_1456425026132_1!���!M�����!YARN_CONTAINER!container_1456425026132_0001_01_01,99'
 on table 'hbase:meta' at region=hbase:meta,,1.1588230740, 
hostname=localhost,16201,1456365764939, seqNum=0

at 
org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.throwEnrichedException(RpcRetryingCallerWithReadReplicas.java:264)
at 
org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:215)
at 
org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:56)
at 
org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:200)
at 
org.apache.hadoop.hbase.client.ClientSmallReversedScanner.loadCache(ClientSmallReversedScanner.java:211)
at 
org.apache.hadoop.hbase.client.ClientSmallReversedScanner.next(ClientSmallReversedScanner.java:185)
at 
org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegionInMeta(ConnectionManager.java:1200)
at 
org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1109)
at 
org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:369)
at 
org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:320)
at 
org.apache.hadoop.hbase.client.BufferedMutatorImpl.backgroundFlushCommits(BufferedMutatorImpl.java:206)
at 
org.apache.hadoop.hbase.client.BufferedMutatorImpl.flush(BufferedMutatorImpl.java:183)
at 
org.apache.hadoop.yarn.server.timelineservice.storage.common.BufferedMutatorDelegator.flush(BufferedMutatorDelegator.java:66)
at 
org.apache.hadoop.yarn.server.timelineservice.storage.HBaseTimelineWriterImpl.flush(HBaseTimelineWriterImpl.java:457)
at 
org.apache.hadoop.yarn.server.timelineservice.collector.TimelineCollectorManager$WriterFlushTask.run(TimelineCollectorManager.java:230)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.SocketTimeoutException: callTimeout=6, 
callDuration=68065: row 
'timelineservice.entity,naga!yarn_cluster!flow_1456425026132_1!���!M�����!YARN_CONTAINER!container_1456425026132_0001_01_01,99'
 on table 'hbase:meta' at region=hbase:meta,,1.1588230740, 
hostname=localhost,16201,1456365764939, seqNum=0
at 
org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:159)
at 
org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:310)
at 
org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:291)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
... 3 more
Caused by: java.net.ConnectException: Connection refused
{noformat}

But this seems to have put the HBase client stuck in the flush state. That 
thread seems to be stuck:
{noformat}
"pool-14-thread-1" prio=10 tid=0x7f4215268000 nid=0x46e6 waiting on 
condition [0x7f41fe75d000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0xeeb5a010> (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at