Are you by using Broadcast Address? if yes then you might be affected by https://issues.apache.org/jira/browse/CASSANDRA-3503
>>> Nodes are all up while repairing is running. I should have been clear are you seeing the following messages in logs (UP/DOWN) during the period of the repair... INFO [GossipStage:1] 2012-05-01 19:52:00,515 Gossiper.java (line 804) InetAddress /xx.xx.xx.xx is now UP Regards, </VJ> On Wed, May 9, 2012 at 5:49 AM, Bill Au <bill.w...@gmail.com> wrote: > I am running 1.0.8. Two data center with 8 machines in each dc. Nodes > are all up while repairing is running. No dropped Mutations/Messages. I > do see HintedHandoff messages. > > Bill > > > On Tue, May 8, 2012 at 11:15 PM, Vijay <vijay2...@gmail.com> wrote: > >> What is the version you are using? is it Multi DC setup? Are you seeing a >> lot of dropped Mutations/Messages? Are the nodes going up and down all the >> time while the repair is running? >> >> Regards, >> </VJ> >> >> >> >> >> On Tue, May 8, 2012 at 2:05 PM, Bill Au <bill.w...@gmail.com> wrote: >> >>> There are no error message in my log. >>> >>> I ended up restarting all the nodes in my cluster. After that I was >>> able to run repair successfully on one of the node. It took about 40 >>> minutes. Feeling lucky I ran repair on another node and it is stuck again. >>> >>> tpstats show 1 active and 1 pending AntiEntropySessions. netstats and >>> compactionstats show no activity. I took a close look at the log file, it >>> shows that the node requested merkle tree from 4 nodes (including itself). >>> It actually received 3 of those merkle trees. It looks like it is stuck >>> waiting for that last one. I checked the node where the request was sent >>> to, there isn't anything in the log on repair. So it looks like the merkle >>> tree request has gotten lost some how. It has been 8 hours since the >>> repair was issue and it is still stuck. I am going to let it run a bit >>> longer to see if it will eventually finish. >>> >>> I have observed that if I restart all the nodes, I would be able to run >>> repair successfully on a single node. I have done that twice already. But >>> after that all repairs will hang. Since we are supposed to run repair >>> periodically, having to restart all nodes before running repair on each >>> node isn't really viable for us. >>> >>> Bill >>> >>> >>> On Tue, May 8, 2012 at 6:04 AM, aaron morton <aa...@thelastpickle.com>wrote: >>> >>>> When you look in the logs please let me know if you see this error… >>>> https://issues.apache.org/jira/browse/CASSANDRA-4223 >>>> >>>> I look at nodetool compactionstats (for the Merkle tree phase), >>>> nodetool netstats for the streaming, and this to check for streaming >>>> progress: >>>> >>>> while true; do date; diff <(nodetool -h localhost netstats) <(sleep 5 >>>> && nodetool -h localhost netstats); done >>>> >>>> Or use Data Stax Ops Centre where possible >>>> http://www.datastax.com/products/opscenter >>>> >>>> Cheers >>>> >>>> >>>> ----------------- >>>> Aaron Morton >>>> Freelance Developer >>>> @aaronmorton >>>> http://www.thelastpickle.com >>>> >>>> On 8/05/2012, at 2:15 PM, Ben Coverston wrote: >>>> >>>> Check the log files for warnings or errors. They may indicate why your >>>> repair failed. >>>> >>>> On Mon, May 7, 2012 at 10:09 AM, Bill Au <bill.w...@gmail.com> wrote: >>>> >>>>> I restarted the nodes and then restarted the repair. It is still >>>>> hanging like before. Do I keep repeating until the repair actually >>>>> finish? >>>>> >>>>> Bill >>>>> >>>>> >>>>> On Fri, May 4, 2012 at 2:18 PM, Rob Coli <rc...@palominodb.com> wrote: >>>>> >>>>>> On Fri, May 4, 2012 at 10:30 AM, Bill Au <bill.w...@gmail.com> wrote: >>>>>> > I know repair may take a long time to run. I am running repair on >>>>>> a node >>>>>> > with about 15 GB of data and it is taking more than 24 hours. Is >>>>>> that >>>>>> > normal? Is there any way to get status of the repair? tpstats >>>>>> does show 2 >>>>>> > active and 2 pending AntiEntropySessions. But netstats and >>>>>> compactionstats >>>>>> > show no activity. >>>>>> >>>>>> As indicated by various recent threads to this effect, many versions >>>>>> of cassandra (including current 1.0.x release) contain bugs which >>>>>> sometimes prevent repair from completing. The other threads suggest >>>>>> that some of these bugs result in the state you are in now, where you >>>>>> do not see anything that looks like appropriate activity. >>>>>> Unfortunately the only solution offered on these other threads is the >>>>>> one I will now offer, which is to restart the participating nodes and >>>>>> re-start the repair. I am unaware of any JIRA tickets tracking these >>>>>> bugs (which doesn't mean they don't exist, of course) so you might >>>>>> want to file one. :) >>>>>> >>>>>> =Rob >>>>>> >>>>>> -- >>>>>> =Robert Coli >>>>>> AIM>ALK - rc...@palominodb.com >>>>>> YAHOO - rcoli.palominob >>>>>> SKYPE - rcoli_palominodb >>>>>> >>>>> >>>>> >>>> >>>> >>>> -- >>>> Ben Coverston >>>> DataStax -- The Apache Cassandra Company >>>> >>>> >>>> >>> >> >