Re: Unable to bootstrap new node

Keith Wright Thu, 03 Oct 2013 07:15:46 -0700

Thanks for the response.   We are still having issues bootstrapping a node.  
Quick background on where we are at (1.2.8 with Vnodes):


 *   We had a node start to complain about corrupted SSTables which we tried to 
delete one by one but it quickly became a whack-a-mole problem so we decided we 
would just wipe it and bootstrap
 *   We shutdown that node and ran a nodetool removenode on another node
 *   We wiped the effected node's data and then attempted to bootstrap it (with 
the same IP of course)
 *   Everytime we attempt to add the node 2 out of the 4 nodes sending data 
(the same 2 nodes by the way) have streaming failures which I believe is caused 
by GC (see logging below).  The streaming from these two nodes fails within the 
first couple minutes of bootstrapping the node.
 *   We tried restarting the nodes that failed to stream but the bootstrapping 
node did not automatically re-attempt the streaming and again we couldn't find 
a way to force it to
 *   We have tried upping the heap and new size on the nodes to help reduce the 
GC pressure (from our original 10 GB to 14) but no luck and we also decreased 
stream_throughput_outbound_megabits_per_sec from 400 to 200
 *   Eventually the bootstrapping node just hangs as it never gets data from 
the 2 nodes and there is no way I can find to get it to re-attempt

I'm add a bit of a loss.  Honestly bootstrapping nodes has been a total 
nightmare for me and makes me very concerned about our ability to fix/grow our 
cluster as needed.  I hoped Vnodes would help but so far no luck.  Here are the 
options as I see it:

 *   Hope someone here has a great idea on how to fix it :)
 *   Assuming I can't get the node to bootstrap, I can start it with bootstrap 
disabled and trigger a repair.  Is there anyway to ensure it doesn't serve any 
reads at this time?  I can disable thrift/binary ports but it will still handle 
requests from other coordinator nodes.  We usually run at read ANY so to ensure 
we don't miss data we would need to run at QUOROM until the repair completes.

Thanks for the help!

Existing streaming node 1 (10.8.44.98):
ERROR [GossipTasks:1] 2013-10-03 13:09:28,654 AbstractStreamSession.java (line 
110) Stream failed because /10.8.44.84 died or was restarted/removed (streams 
may still be active in background, but further streams won't be started)
ERROR [GossipTasks:1] 2013-10-03 13:09:28,720 AbstractStreamSession.java (line 
110) Stream failed because /10.8.44.84 died or was restarted/removed (streams 
may still be active in background, but further streams won't be started)

Existing streaming node 2 (10.8.44.72):
ERROR [GossipTasks:1] 2013-10-03 13:10:02,174 AbstractStreamSession.java (line 
110) Stream failed because /10.8.44.84 died or was restarted/removed (streams 
may still be active in background, but further streams won't be started)
ERROR [GossipTasks:1] 2013-10-03 13:10:02,185 AbstractStreamSession.java (line 
110) Stream failed because /10.8.44.84 died or was restarted/removed (streams 
may still be active in background, but further streams won't be started)
ERROR [ReplicateOnWriteStage:38] 2013-10-03 13:10:02,265 FailureDetector.java 
(line 154) unknown endpoint /10.8.44.84
ERROR [ReplicateOnWriteStage:36] 2013-10-03 13:10:02,302 FailureDetector.java 
(line 154) unknown endpoint /10.8.44.84
ERROR [Native-Transport-Requests:151] 2013-10-03 13:10:02,282 
FailureDetector.java (line 154) unknown endpoint /10.8.44.84
ERROR [ReplicateOnWriteStage:37] 2013-10-03 13:10:02,318 FailureDetector.java 
(line 154) unknown endpoint /10.8.44.84

Bootstrapping node (10.8.44.84):
ERROR [GossipTasks:1] 2013-10-03 13:09:23,196 AbstractStreamSession.java (line 
110) Stream failed because /10.8.44.98 died or was restarted/removed (streams 
may still be active in background, but further streams won't be started)
ERROR [GossipTasks:1] 2013-10-03 13:09:24,199 AbstractStreamSession.java (line 
110) Stream failed because /10.8.44.72 died or was restarted/removed (streams 
may still be active in background, but further streams won't be started)


From: Robert Coli <[email protected]<mailto:[email protected]>>
Reply-To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Date: Wednesday, October 2, 2013 1:55 PM
To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Subject: Re: Unable to bootstrap new node

On Wed, Oct 2, 2013 at 8:12 AM, Keith Wright 
<[email protected]<mailto:[email protected]>> wrote:
   We are running C* 1.2.8 with Vnodes enabled and are attempting to bootstrap 
a new node and are having issues.  When we add the node we see it bootstrap and 
we see data start to stream over from other nodes however we are seeing one of 
the other nodes get stuck in full GCs to the point where we had to restart one 
of the nodes.  I assume this is because building the merkle tree is expensive.

Merkle trees are only involved in "repair", not in normal bootstrap. Have you 
considered lowering the throttle for streaming? Bootstrap will be slower but 
should be less likely to overwhelm heap.

Any way to force the streaming to restart?   Have others seen this?

In the bootstrap case, you can just wipe the bootstrapping node and re-start 
the bootstrap.

In the general case regarding hung streaming :

https://issues.apache.org/jira/browse/CASSANDRA-3486

The only solution to hung non-bootstrap streaming is restart all nodes 
participating in the streaming. With vnodes, this will probably approach 100% 
of nodes...

=Rob

Re: Unable to bootstrap new node

Reply via email to