[jira] [Commented] (HBASE-15436) BufferedMutatorImpl.flush() appears to get stuck

2016-03-30 Thread Anoop Sam John (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15218554#comment-15218554
 ] 

Anoop Sam John commented on HBASE-15436:


Thanks Nicholas
bq.iirc, A long time ago, the buffer was attached to the Table object, so the 
policy (or at least the objective :-)) when one of the puts had failed (i.e. 
reached the max retry number) was simple: all the operations currently in the 
buffer were considered as failed as well, even if we had not even tried to send 
them. As a consequence the buffer was empty after the failure of a single put. 
It was then up to the client to continue or not. May be we should do the same 
with the buffered mutator, for all  cases, close or not? I haven't looked at 
the bufferedMutator code, but I can have a look it you whish [~anoop.hbase].
Both BufferedMutator and normal Table uses same AycnProcess path.  Am not 
remembering our old way of fail all when one failed(after max retries).
Also I feel, we need to add the closed check in the loop of retry..  Some how 
user called close on the BufferedMutator.  Ya it has to be a graceful close.  
But not like mins user has to wait for the close..   We are in a trial and that 
failed, and at least before the next retry, we need to see the close flag.


> BufferedMutatorImpl.flush() appears to get stuck
> 
>
> Key: HBASE-15436
> URL: https://issues.apache.org/jira/browse/HBASE-15436
> Project: HBase
>  Issue Type: Bug
>  Components: Client
>Affects Versions: 1.0.2
>Reporter: Sangjin Lee
> Attachments: hbaseException.log, threaddump.log
>
>
> We noticed an instance where the thread that was executing a flush 
> ({{BufferedMutatorImpl.flush()}}) got stuck when the (local one-node) cluster 
> shut down and was unable to get out of that stuck state.
> The setup is a single node HBase cluster, and apparently the cluster went 
> away when the client was executing flush. The flush eventually logged a 
> failure after 30+ minutes of retrying. That is understandable.
> What is unexpected is that thread is stuck in this state (i.e. in the 
> {{flush()}} call). I would have expected the {{flush()}} call to return after 
> the complete failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-15436) BufferedMutatorImpl.flush() appears to get stuck

2016-03-30 Thread Nicolas Liochon (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15217714#comment-15217714
 ] 

Nicolas Liochon commented on HBASE-15436:
-

bq.  There should be a cap size for the size above which we should block the 
writes. We should not take more than this limit. May be some thing like 1.5 
times of what is the flush size.
We definitively want to take more than this limit, but may be not as much as 
what we're taking today (or maybe we want to be clearer on what these settings 
mean)
There is a limit, given by the number of task executed in parallel 
(hbase.client.max.total.tasks). If I understand correctly, this setting is now 
per client (and not per htable).
Ideally these parameters should be hidden to the user (i.e. the defaults are ok 
for a standard client w/o too much memory constraints). 

bq. How long we should wait? Whether we should come out faster? 
iirc, A long time ago, the buffer was attached to the Table object, so the 
policy (or at least the objective :-)) when one of the puts had failed (i.e. 
reached the max retry number) was simple: all the operations currently in the 
buffer were considered as failed as well, even if we had not even tried to send 
them. As a consequence the buffer was empty after the failure of a single put. 
It was then up to the client to continue or not. May be we should do the same 
with the buffered mutator, for all  cases, close or not? I haven't looked at 
the bufferedMutator code, but I can have a look it you whish [~anoop.hbase]. 

bq.  What if we were doing multi Get to META table to know the region location 
for N mutations at a time.
It seems like a good idea. There are many possible optimisation on how we use 
meta, and this is one of them.




> BufferedMutatorImpl.flush() appears to get stuck
> 
>
> Key: HBASE-15436
> URL: https://issues.apache.org/jira/browse/HBASE-15436
> Project: HBase
>  Issue Type: Bug
>  Components: Client
>Affects Versions: 1.0.2
>Reporter: Sangjin Lee
> Attachments: hbaseException.log, threaddump.log
>
>
> We noticed an instance where the thread that was executing a flush 
> ({{BufferedMutatorImpl.flush()}}) got stuck when the (local one-node) cluster 
> shut down and was unable to get out of that stuck state.
> The setup is a single node HBase cluster, and apparently the cluster went 
> away when the client was executing flush. The flush eventually logged a 
> failure after 30+ minutes of retrying. That is understandable.
> What is unexpected is that thread is stuck in this state (i.e. in the 
> {{flush()}} call). I would have expected the {{flush()}} call to return after 
> the complete failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-15436) BufferedMutatorImpl.flush() appears to get stuck

2016-03-20 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15197343#comment-15197343
 ] 

Naganarasimha G R commented on HBASE-15436:
---

Valid point let me discuss on this more with ATS team...

> BufferedMutatorImpl.flush() appears to get stuck
> 
>
> Key: HBASE-15436
> URL: https://issues.apache.org/jira/browse/HBASE-15436
> Project: HBase
>  Issue Type: Bug
>  Components: Client
>Affects Versions: 1.0.2
>Reporter: Sangjin Lee
> Attachments: hbaseException.log, threaddump.log
>
>
> We noticed an instance where the thread that was executing a flush 
> ({{BufferedMutatorImpl.flush()}}) got stuck when the (local one-node) cluster 
> shut down and was unable to get out of that stuck state.
> The setup is a single node HBase cluster, and apparently the cluster went 
> away when the client was executing flush. The flush eventually logged a 
> failure after 30+ minutes of retrying. That is understandable.
> What is unexpected is that thread is stuck in this state (i.e. in the 
> {{flush()}} call). I would have expected the {{flush()}} call to return after 
> the complete failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-15436) BufferedMutatorImpl.flush() appears to get stuck

2016-03-19 Thread Anoop Sam John (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15198866#comment-15198866
 ] 

Anoop Sam John commented on HBASE-15436:


There are some must fix things
1.  The BufferedMutator flush is keep on trying and taking more time.  It 
kicked as the size of all Mutations accumulated so far, met the flush size. 
(Say 2 MB).  The flush takes time and we keep on accepting new mutations into 
the list. This may lead to client side OOME !.. We may need to accept more 
mutations after a background started. Normally things will get moving faster. 
But this cannot be infinite.  There should be a cap size for the size above 
which we should block the writes. We should not take more than this limit. May 
be some thing like 1.5 times of what is the flush size.
2. The row lookups into META happening for one row at a time. So this makes its 
such that one row lookup failed after 36 retries and each having 1 min timeout. 
 The 1 min time out itself is so high? And even after that it just fails this 
one Mutation and continue with remaining.  What if we were doing multi Get to 
META table to know the region location for N mutations at a time.
3. When close() is explicitly called on BufferedMutator, we try for graceful 
down (ie. wait for a flush if one is there in progress and/or call flush before 
close).  In such case what if the cluster is down and it takes too long. How 
long we should wait?  Whether we should come out faster?  (May be loosing some 
Mutations, but that is any way known) (?)

> BufferedMutatorImpl.flush() appears to get stuck
> 
>
> Key: HBASE-15436
> URL: https://issues.apache.org/jira/browse/HBASE-15436
> Project: HBase
>  Issue Type: Bug
>  Components: Client
>Affects Versions: 1.0.2
>Reporter: Sangjin Lee
> Attachments: hbaseException.log, threaddump.log
>
>
> We noticed an instance where the thread that was executing a flush 
> ({{BufferedMutatorImpl.flush()}}) got stuck when the (local one-node) cluster 
> shut down and was unable to get out of that stuck state.
> The setup is a single node HBase cluster, and apparently the cluster went 
> away when the client was executing flush. The flush eventually logged a 
> failure after 30+ minutes of retrying. That is understandable.
> What is unexpected is that thread is stuck in this state (i.e. in the 
> {{flush()}} call). I would have expected the {{flush()}} call to return after 
> the complete failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-15436) BufferedMutatorImpl.flush() appears to get stuck

2016-03-15 Thread Anoop Sam John (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196846#comment-15196846
 ] 

Anoop Sam John commented on HBASE-15436:


bq.And also in my case it was happening due HBASE master and Region server 
going down abruptly due to connectivity problems with zookeeper.
So the HBase cluster will come back after some time?  The normal usages the OM 
will make it up again?  What is ur use case

> BufferedMutatorImpl.flush() appears to get stuck
> 
>
> Key: HBASE-15436
> URL: https://issues.apache.org/jira/browse/HBASE-15436
> Project: HBase
>  Issue Type: Bug
>  Components: Client
>Affects Versions: 1.0.2
>Reporter: Sangjin Lee
> Attachments: hbaseException.log, threaddump.log
>
>
> We noticed an instance where the thread that was executing a flush 
> ({{BufferedMutatorImpl.flush()}}) got stuck when the (local one-node) cluster 
> shut down and was unable to get out of that stuck state.
> The setup is a single node HBase cluster, and apparently the cluster went 
> away when the client was executing flush. The flush eventually logged a 
> failure after 30+ minutes of retrying. That is understandable.
> What is unexpected is that thread is stuck in this state (i.e. in the 
> {{flush()}} call). I would have expected the {{flush()}} call to return after 
> the complete failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)



[jira] [Commented] (HBASE-15436) BufferedMutatorImpl.flush() appears to get stuck

2016-03-15 Thread Anoop Sam John (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196723#comment-15196723
 ] 

Anoop Sam John commented on HBASE-15436:


When close() is called on BufferedMutator what is the expectation? All the 
prior writes (async) should get synced with HBase RS?
Ya when data is pumped in in async way, it is expected that there may be data 
loss. If client went down abruptly.  So we can say if close is called, we can 
try flush data and gracefully down. If flush is not happening normally  we may 
need to close it with out flush of remaining mutations (?)  How abt the 
decision making?  cc [~tedyu]

> BufferedMutatorImpl.flush() appears to get stuck
> 
>
> Key: HBASE-15436
> URL: https://issues.apache.org/jira/browse/HBASE-15436
> Project: HBase
>  Issue Type: Bug
>  Components: Client
>Affects Versions: 1.0.2
>Reporter: Sangjin Lee
> Attachments: hbaseException.log, threaddump.log
>
>
> We noticed an instance where the thread that was executing a flush 
> ({{BufferedMutatorImpl.flush()}}) got stuck when the (local one-node) cluster 
> shut down and was unable to get out of that stuck state.
> The setup is a single node HBase cluster, and apparently the cluster went 
> away when the client was executing flush. The flush eventually logged a 
> failure after 30+ minutes of retrying. That is understandable.
> What is unexpected is that thread is stuck in this state (i.e. in the 
> {{flush()}} call). I would have expected the {{flush()}} call to return after 
> the complete failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-15436) BufferedMutatorImpl.flush() appears to get stuck

2016-03-15 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196687#comment-15196687
 ] 

Naganarasimha G R commented on HBASE-15436:
---

Thanks for looking into it [~anoopsamjohn],
bq.  So this kind of a scenario the application should take care? I mean 
shutdown the clients ( The NMs in this case) before HBase cluster down 
Well this could be a OM/admin operation which i think YARN/platform will have 
less control off. And also in my case it was happening due HBASE master and 
Region server going down abruptly due to connectivity problems with zookeeper. 
I have attached the HBase logs last when i faced in 1.0.3 in YARN-4736. 
I faced this issue when trying to test ATS Next Gen with Hbase in Pseudo 
cluster and it was easily reproduced when zookeeper data folder was set to 
default {{tmp/hbase-}}. Not sure whether its coincidence or the cause.

> BufferedMutatorImpl.flush() appears to get stuck
> 
>
> Key: HBASE-15436
> URL: https://issues.apache.org/jira/browse/HBASE-15436
> Project: HBase
>  Issue Type: Bug
>  Components: Client
>Affects Versions: 1.0.2
>Reporter: Sangjin Lee
> Attachments: hbaseException.log, threaddump.log
>
>
> We noticed an instance where the thread that was executing a flush 
> ({{BufferedMutatorImpl.flush()}}) got stuck when the (local one-node) cluster 
> shut down and was unable to get out of that stuck state.
> The setup is a single node HBase cluster, and apparently the cluster went 
> away when the client was executing flush. The flush eventually logged a 
> failure after 30+ minutes of retrying. That is understandable.
> What is unexpected is that thread is stuck in this state (i.e. in the 
> {{flush()}} call). I would have expected the {{flush()}} call to return after 
> the complete failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-15436) BufferedMutatorImpl.flush() appears to get stuck

2016-03-15 Thread Anoop Sam John (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196669#comment-15196669
 ] 

Anoop Sam John commented on HBASE-15436:


And Ya, I believe all 1.0+ releases having similar code

> BufferedMutatorImpl.flush() appears to get stuck
> 
>
> Key: HBASE-15436
> URL: https://issues.apache.org/jira/browse/HBASE-15436
> Project: HBase
>  Issue Type: Bug
>  Components: Client
>Affects Versions: 1.0.2
>Reporter: Sangjin Lee
> Attachments: hbaseException.log, threaddump.log
>
>
> We noticed an instance where the thread that was executing a flush 
> ({{BufferedMutatorImpl.flush()}}) got stuck when the (local one-node) cluster 
> shut down and was unable to get out of that stuck state.
> The setup is a single node HBase cluster, and apparently the cluster went 
> away when the client was executing flush. The flush eventually logged a 
> failure after 30+ minutes of retrying. That is understandable.
> What is unexpected is that thread is stuck in this state (i.e. in the 
> {{flush()}} call). I would have expected the {{flush()}} call to return after 
> the complete failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-15436) BufferedMutatorImpl.flush() appears to get stuck

2016-03-15 Thread Anoop Sam John (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196668#comment-15196668
 ] 

Anoop Sam John commented on HBASE-15436:


The HBase cluster fully went down.  So this kind of a scenario the application 
should take care? I mean shutdown the clients ( The NMs in this case) before 
HBase cluster down (?)

Another thing is when the flush in BufferedMutator not able to do (this can be 
because of temp unavailability of HBase RS(s) do the put ops on it getting 
blocked? I dont think so. That will make the client side (NM) to go out of 
memory at some point?  This we need to fix.

When close() is called on BufferedMutator what is the expectation? All the 
prior writes (async) should get synced with HBase RS?

> BufferedMutatorImpl.flush() appears to get stuck
> 
>
> Key: HBASE-15436
> URL: https://issues.apache.org/jira/browse/HBASE-15436
> Project: HBase
>  Issue Type: Bug
>  Components: Client
>Affects Versions: 1.0.2
>Reporter: Sangjin Lee
> Attachments: hbaseException.log, threaddump.log
>
>
> We noticed an instance where the thread that was executing a flush 
> ({{BufferedMutatorImpl.flush()}}) got stuck when the (local one-node) cluster 
> shut down and was unable to get out of that stuck state.
> The setup is a single node HBase cluster, and apparently the cluster went 
> away when the client was executing flush. The flush eventually logged a 
> failure after 30+ minutes of retrying. That is understandable.
> What is unexpected is that thread is stuck in this state (i.e. in the 
> {{flush()}} call). I would have expected the {{flush()}} call to return after 
> the complete failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-15436) BufferedMutatorImpl.flush() appears to get stuck

2016-03-15 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15195735#comment-15195735
 ] 

Sangjin Lee commented on HBASE-15436:
-

Thanks [~anoop.hbase]. That sounds plausible.

This does represent a pretty critical issue then, no? If a region server is in 
a state where a socket timeout is thrown in this manner, flush will be stuck 
for a LONG time. In a high throughput situation, this would imply a pretty 
severe consequence and induce huge instability on the client.

For the background, we are working on using HBase for the timeline service v.2 
(see YARN-4736) and node managers will be HBase clients. If a region server or 
region servers are in an unhealthy state, this issue would cause a pretty big 
cascading effect on the client cluster, correct?

Does this behavior exist in all other later releases?

> BufferedMutatorImpl.flush() appears to get stuck
> 
>
> Key: HBASE-15436
> URL: https://issues.apache.org/jira/browse/HBASE-15436
> Project: HBase
>  Issue Type: Bug
>  Components: Client
>Affects Versions: 1.0.2
>Reporter: Sangjin Lee
> Attachments: hbaseException.log, threaddump.log
>
>
> We noticed an instance where the thread that was executing a flush 
> ({{BufferedMutatorImpl.flush()}}) got stuck when the (local one-node) cluster 
> shut down and was unable to get out of that stuck state.
> The setup is a single node HBase cluster, and apparently the cluster went 
> away when the client was executing flush. The flush eventually logged a 
> failure after 30+ minutes of retrying. That is understandable.
> What is unexpected is that thread is stuck in this state (i.e. in the 
> {{flush()}} call). I would have expected the {{flush()}} call to return after 
> the complete failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-15436) BufferedMutatorImpl.flush() appears to get stuck

2016-03-13 Thread Anoop Sam John (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15192411#comment-15192411
 ] 

Anoop Sam John commented on HBASE-15436:


{code}
"pool-14-thread-1" prio=10 tid=0x7f4215268000 nid=0x46e6 waiting on 
condition [0x7f41fe75d000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0xeeb5a010> (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
at 
java.util.concurrent.ArrayBlockingQueue.take(ArrayBlockingQueue.java:374)
at 
org.apache.hadoop.hbase.util.BoundedCompletionService.take(BoundedCompletionService.java:75)
at 
org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:190)
at 
org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:56)
at 
org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:200)
at 
org.apache.hadoop.hbase.client.ClientSmallReversedScanner.loadCache(ClientSmallReversedScanner.java:211)
at 
org.apache.hadoop.hbase.client.ClientSmallReversedScanner.next(ClientSmallReversedScanner.java:185)
at 
org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegionInMeta(ConnectionManager.java:1200)
at 
org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1109)
at 
org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:369)
at 
org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:320)
at 
org.apache.hadoop.hbase.client.BufferedMutatorImpl.backgroundFlushCommits(BufferedMutatorImpl.java:206)
at 
org.apache.hadoop.hbase.client.BufferedMutatorImpl.flush(BufferedMutatorImpl.java:183)
{code}

When I say the flush is continuing with each of the Mutation and you dont see, 
the thread doing flush op doing nothing, u say it looks not. But the issue is 
the thread doing the flush op works in a loop and that op in turn given a Meta 
table scan.  This u can see that the scan op is given to another thread in a 
pool. The original flush thread is waiting for the completion of that scan 
thread.  This u can clearly see in above trace.
So it is like this thread will wait for the result and that result is an 
Exception (SocketTimeout) which it will see after mins. Then the flush thread 
again comes back to life and continue that loop and again wil go into this wait 
mode..!!

> BufferedMutatorImpl.flush() appears to get stuck
> 
>
> Key: HBASE-15436
> URL: https://issues.apache.org/jira/browse/HBASE-15436
> Project: HBase
>  Issue Type: Bug
>  Components: Client
>Affects Versions: 1.0.2
>Reporter: Sangjin Lee
> Attachments: hbaseException.log, threaddump.log
>
>
> We noticed an instance where the thread that was executing a flush 
> ({{BufferedMutatorImpl.flush()}}) got stuck when the (local one-node) cluster 
> shut down and was unable to get out of that stuck state.
> The setup is a single node HBase cluster, and apparently the cluster went 
> away when the client was executing flush. The flush eventually logged a 
> failure after 30+ minutes of retrying. That is understandable.
> What is unexpected is that thread is stuck in this state (i.e. in the 
> {{flush()}} call). I would have expected the {{flush()}} call to return after 
> the complete failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-15436) BufferedMutatorImpl.flush() appears to get stuck

2016-03-11 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15191553#comment-15191553
 ] 

Sangjin Lee commented on HBASE-15436:
-

Thanks [~anoop.hbase] for your comments. To answer your question,

bq. So after seeing this log how long u wait?

I believe the user tried to shut it down about 30 minutes after this failure:
{noformat}
Fri Feb 26 00:39:03 IST 2016, null, java.net.SocketTimeoutException: 
callTimeout=6, callDuration=68065: row 
'timelineservice.entity,naga!yarn_cluster!flow_1456425026132_1!���!M�����!YARN_CONTAINER!container_1456425026132_0001_01_01,99'
 on table 'hbase:meta' at region=hbase:meta,,1.1588230740, 
hostname=localhost,16201,1456365764939, seqNum=0
...
2016-02-26 01:09:19,799 ERROR 
org.apache.hadoop.yarn.server.nodemanager.NodeManager: RECEIVED SIGNAL 15: 
SIGTERM
{noformat}

Also, I'm not too sure it is the case that flush is still going through the 
mutations. This is the stack trace of the thread that was in the {{flush()}} 
call (taken *after* this exception was seen):

{noformat}
"pool-14-thread-1" prio=10 tid=0x7f4215268000 nid=0x46e6 waiting on 
condition [0x7f41fe75d000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0xeeb5a010> (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
at 
java.util.concurrent.ArrayBlockingQueue.take(ArrayBlockingQueue.java:374)
at 
org.apache.hadoop.hbase.util.BoundedCompletionService.take(BoundedCompletionService.java:75)
at 
org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:190)
at 
org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:56)
at 
org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:200)
at 
org.apache.hadoop.hbase.client.ClientSmallReversedScanner.loadCache(ClientSmallReversedScanner.java:211)
at 
org.apache.hadoop.hbase.client.ClientSmallReversedScanner.next(ClientSmallReversedScanner.java:185)
at 
org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegionInMeta(ConnectionManager.java:1200)
at 
org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1109)
at 
org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:369)
at 
org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:320)
at 
org.apache.hadoop.hbase.client.BufferedMutatorImpl.backgroundFlushCommits(BufferedMutatorImpl.java:206)
at 
org.apache.hadoop.hbase.client.BufferedMutatorImpl.flush(BufferedMutatorImpl.java:183)
- locked <0xc246f268> (a 
org.apache.hadoop.hbase.client.BufferedMutatorImpl)
at 
org.apache.hadoop.yarn.server.timelineservice.storage.common.BufferedMutatorDelegator.flush(BufferedMutatorDelegator.java:66)
at 
org.apache.hadoop.yarn.server.timelineservice.storage.HBaseTimelineWriterImpl.flush(HBaseTimelineWriterImpl.java:457)
at 
org.apache.hadoop.yarn.server.timelineservice.collector.TimelineCollectorManager$WriterFlushTask.run(TimelineCollectorManager.java:230)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
{noformat}

The stack trace strongly indicates that it is waiting for more tasks to be 
completed and *is idle*. I wasn't the one who observed this, and don't have any 
more thread dumps around that time.

> BufferedMutatorImpl.flush() appears to get stuck
> 
>
> Key: HBASE-15436
> URL: https://issues.apache.org/jira/browse/HBASE-15436
> Project: HBase
>  Issue Type: Bug
>  Components: Client
>Affects Versions: 1.0.2
>Reporter: Sangjin Lee
> Attachments: hbaseException.log, threaddump.log
>
>
> We noticed an instance where the thread that was executing a flush 
> ({{BufferedMutatorImpl.flush()}

[jira] [Commented] (HBASE-15436) BufferedMutatorImpl.flush() appears to get stuck

2016-03-11 Thread Anoop Sam John (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15190717#comment-15190717
 ] 

Anoop Sam John commented on HBASE-15436:


So you say after u see the log abt failure (after some 30+ mins, in fact 36 
mins I guess, as 1 min seems socket time out and 36 attempts there), still the 
flush is not coming out. So after seeing this log how long u wait?
So this is an async way of write to table.. Ya when the size of accumulated 
puts become some configured size, we will do a flush. Till then puts are 
accumulated at client side.
I believe I got the issue. This is not a dead lock or so.  
To this flush we will pass all the Rows to flush (Write to RS).  Rows I mean 
Mutations.
It will try to group the mutations per server and will contact each of the 
server with List of mutations to go there.
Well to group this it checks the region locations for each of the row. And the 
scan happens to META (as shown in logs) and it fails.  For the 1st Mutation in 
this list itself, it took 36 mins.  Because the scan to META has retries.  Each 
of the trial fails after the SocketTimeout

See in AsyncProcess#submit
{code}
do {
  ...
  int posInList = -1;
  Iterator it = rows.iterator();
  while (it.hasNext()) {
Row r = it.next();
HRegionLocation loc;
try {
  if (r == null) throw new IllegalArgumentException("#" + id + ", row 
cannot be null");
  // Make sure we get 0-s replica.
  RegionLocations locs = connection.locateRegion(
  tableName, r.getRow(), true, true, 
RegionReplicaUtil.DEFAULT_REPLICA_ID);
  
} catch (IOException ex) {
  locationErrors = new ArrayList();
  locationErrorRows = new ArrayList();
  LOG.error("Failed to get region location ", ex);
  // This action failed before creating ars. Retain it, but do not add 
to submit list.
  // We will then add it to ars in an already-failed state.
  retainedActions.add(new Action(r, ++posInList));
  locationErrors.add(ex);
  locationErrorRows.add(posInList);
  it.remove();
  break; // Backward compat: we stop considering actions on location 
error.
}

   .
  }
} while (retainedActions.isEmpty() && atLeastOne && (locationErrors == 
null));
{code}
The List 'rows' is the same List which BufferedMutatorImpl hold. (ie. 
writeAsyncBuffer).   So for the 1st Mutation the region location lookup failed 
and that Mutation got removed from this List also as u can see.  This will 
eventually marked as failed op. And the flow comes back to 
BufferedMutatorImpl#backgroundFlushCommits
Here we can see
{code}
if (synchronous || ap.hasError()) {
while (!writeAsyncBuffer.isEmpty()) {
  ap.submit(tableName, writeAsyncBuffer, true, null, false);
}
{code}
The loop continues till writeAsyncBuffer is non empty.  So in this 36 mins we 
could remove only one item from the list.  Again it goes on and removes the  
2nd and so on.   So if there are 100 Mutation in the list when we called 
flush(), it would get over after  36 * 100 mins  !

Am not much knowing the design consideration of this AsyncProcess etc.   May be 
we should narrow down the lock on close() method from method level and set some 
thing like a closing state to true, the retries within the flows should check 
for this state and early out with a fat WARN log saying we will loose some of 
the mutations applied till now. (?)

> BufferedMutatorImpl.flush() appears to get stuck
> 
>
> Key: HBASE-15436
> URL: https://issues.apache.org/jira/browse/HBASE-15436
> Project: HBase
>  Issue Type: Bug
>  Components: Client
>Affects Versions: 1.0.2
>Reporter: Sangjin Lee
> Attachments: hbaseException.log, threaddump.log
>
>
> We noticed an instance where the thread that was executing a flush 
> ({{BufferedMutatorImpl.flush()}}) got stuck when the (local one-node) cluster 
> shut down and was unable to get out of that stuck state.
> The setup is a single node HBase cluster, and apparently the cluster went 
> away when the client was executing flush. The flush eventually logged a 
> failure after 30+ minutes of retrying. That is understandable.
> What is unexpected is that thread is stuck in this state (i.e. in the 
> {{flush()}} call). I would have expected the {{flush()}} call to return after 
> the complete failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-15436) BufferedMutatorImpl.flush() appears to get stuck

2016-03-09 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15188261#comment-15188261
 ] 

Sangjin Lee commented on HBASE-15436:
-

See YARN-4736 for more details.

> BufferedMutatorImpl.flush() appears to get stuck
> 
>
> Key: HBASE-15436
> URL: https://issues.apache.org/jira/browse/HBASE-15436
> Project: HBase
>  Issue Type: Bug
>  Components: Client
>Affects Versions: 1.0.2
>Reporter: Sangjin Lee
> Attachments: hbaseException.log, threaddump.log
>
>
> We noticed an instance where the thread that was executing a flush 
> ({{BufferedMutatorImpl.flush()}}) got stuck when the (local one-node) cluster 
> shut down and was unable to get out of that stuck state.
> The setup is a single node HBase cluster, and apparently the cluster went 
> away when the client was executing flush. The flush eventually logged a 
> failure after 30+ minutes of retrying. That is understandable.
> What is unexpected is that thread is stuck in this state (i.e. in the 
> {{flush()}} call). I would have expected the {{flush()}} call to return after 
> the complete failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)