Re: HTimedOutException and cluster not working

aaron morton Tue, 18 Sep 2012 04:37:36 -0700

What version are you on ?

>  HTimedOutException is logged for all the nodes. 
TimedOutException happens when less than CL replica nodes respond to the 
coordinator in time. 
You could get the error from all nodes in your cluster if the 3 nodes that 
store the key are having problems.


> MutationStage 16 2177067 879092633 0 0
This looks like mutations are blocked or running very very slowly. 

> FlushWriter 0 0 5616 0 1321
The All Timed Blocked number means there were 1,321 times a thread tried to 
flush a memtable but the queue of flushers was full. Do you use secondary 
indexes ? If so take a look at the comments for memtable_flush_queue_size in 
the yaml file. 

>   and cluster settings, it should be possible in this scenario, write success 
>   on one of the nodes even though node-3 is too busy or failing for any 
> reason? 
Yes. 
If only one node fails to respond the write should succeed. If you got a 
TimedOut with CL ONE it sounds like more nodes were having problems. 

> * when hector client failover to other nodes, basically all the nodes fail, 
> why
>   is this so?
Sorry I don't understand this question. 

> * what factors that increase MutationStage active and pending values?
Check the log for ERRORs
Check for failing or overloaded IO. 
See comment above about memtable flush queue size. 

Hope that helps. 

-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 18/09/2012, at 4:24 PM, Jason Wee <peich...@gmail.com> wrote:

> Hello,
> 
> A context to our environment, we have a clusters of 9 nodes with a few 
> keyspaces. The client write to the cluster with consistency level of one to a 
> keyspace in the cluster with a replication factor of 3. The hector client is 
> configured such that all the nodes in cluster is specified and so that we 
> would want to ensure that at any write request, two nodes, can fail and one 
> write is succcess to the cluster node.
> 
> However, under certain situation, we seen in the log, HTimedOutException is 
> logged during writing to the cluster. Hector client thus failover to the next 
> node in the cluster but what we noticed is that, the same exception, 
> HTimedOutException is logged for all the nodes. This result that the cluster 
> is not working as a whole. Logically, we checked all the nodes in the cluster 
> for load. Only node-3 seem to have high pending MutationStage when nodetool 
> tpstats is run. Other nodes are fine with 0 active and 0 pending for all the 
> stages. 
> 
> /nodetool -h localhost tpstats
> Pool Name Active Pending Completed Blocked All time blocked
> ReadStage 0 0 11116983 0 0
> RequestResponseStage 0 0 1252368951 0 0
> MutationStage 16 2177067 879092633 0 0
> ReadRepairStage 0 0 3648106 0 0
> ReplicateOnWriteStage 0 0 33722610 0 0
> GossipStage 0 0 20504608 0 0
> AntiEntropyStage 0 0 1197 0 0
> MigrationStage 0 0 89 0 0
> MemtablePostFlusher 0 0 5659 0 0
> StreamStage 0 0 296 0 0
> FlushWriter 0 0 5616 0 1321
> MiscStage 0 0 5964 0 0
> AntiEntropySessions 0 0 88 0 0
> InternalResponseStage 0 0 27 0 0
> HintedHandoff 1 2 5976 0 0
> 
> Message type Dropped
> RANGE_SLICE 0
> READ_REPAIR 0
> BINARY 0
> READ 178
> MUTATION 17467
> REQUEST_RESPONSE 0
> 
> We proceed to check if there is any compaction in node-3 and found out the 
> following:
> 
> ./nodetool -hlocalhost compactionstats
> pending tasks: 196
> compaction type keyspace column family bytes compacted bytes total progress
> Cleanup MyKeyspace MyCF 6946398685 10230720119 67.90%
> 
> 
> Question:
> * with a replication factor of 3 in the keyspace and client write consistency 
>   level of one, in the situation above, and the current hector client 
> settings 
>   and cluster settings, it should be possible in this scenario, write success 
>   on one of the nodes even though node-3 is too busy or failing for any 
> reason? 
>   
> * when hector client failover to other nodes, basically all the nodes fail, 
> why
>   is this so?
>   
> * what factors that increase MutationStage active and pending values?
> 
> Thank you for any comments and insight
> 
> Regards,
> Jason

Re: HTimedOutException and cluster not working

Reply via email to