Re: getting status of long running repair

Vijay Wed, 09 May 2012 09:19:50 -0700

Are you by using Broadcast Address? if yes then you might be affected by
https://issues.apache.org/jira/browse/CASSANDRA-3503


>>> Nodes are all up while repairing is running.
I should have been clear are you seeing the following messages in logs
(UP/DOWN) during the period of the repair...
 INFO [GossipStage:1] 2012-05-01 19:52:00,515 Gossiper.java (line 804)
InetAddress /xx.xx.xx.xx is now UP


Regards,
</VJ>



On Wed, May 9, 2012 at 5:49 AM, Bill Au <bill.w...@gmail.com> wrote:

> I am running 1.0.8.  Two data center with 8 machines in each dc.  Nodes
> are all up while repairing is running.  No dropped Mutations/Messages.  I
> do see HintedHandoff messages.
>
> Bill
>
>
> On Tue, May 8, 2012 at 11:15 PM, Vijay <vijay2...@gmail.com> wrote:
>
>> What is the version you are using? is it Multi DC setup? Are you seeing a
>> lot of dropped Mutations/Messages? Are the nodes going up and down all the
>> time while the repair is running?
>>
>> Regards,
>> </VJ>
>>
>>
>>
>>
>> On Tue, May 8, 2012 at 2:05 PM, Bill Au <bill.w...@gmail.com> wrote:
>>
>>> There are no error message in my log.
>>>
>>> I ended up restarting all the nodes in my cluster.  After that I was
>>> able to run repair successfully on one of the node.  It took about 40
>>> minutes.  Feeling lucky I ran repair on another node and it is stuck again.
>>>
>>> tpstats show 1 active and 1 pending AntiEntropySessions.  netstats and
>>> compactionstats show no activity.  I took a close look at the log file, it
>>> shows that the node requested merkle tree from 4 nodes (including itself).
>>> It actually received 3 of those merkle trees.  It looks like it is stuck
>>> waiting for that last one.  I checked the node where the request was sent
>>> to, there isn't anything in the log on repair.  So it looks like the merkle
>>> tree request has gotten lost some how.  It has been 8 hours since the
>>> repair was issue and it is still stuck.  I am going to let it run a bit
>>> longer to see if it will eventually finish.
>>>
>>> I have observed that if I restart all the nodes, I would be able to run
>>> repair successfully on a single node.  I have done that twice already.  But
>>> after that all repairs will hang.  Since we are supposed to run repair
>>> periodically, having to restart all nodes before running repair on each
>>> node isn't really viable for us.
>>>
>>> Bill
>>>
>>>
>>> On Tue, May 8, 2012 at 6:04 AM, aaron morton <aa...@thelastpickle.com>wrote:
>>>
>>>> When you look in the logs please let me know if you see this error…
>>>> https://issues.apache.org/jira/browse/CASSANDRA-4223
>>>>
>>>> I look at nodetool compactionstats (for the Merkle tree phase),
>>>>  nodetool netstats for the streaming, and this to check for streaming
>>>> progress:
>>>>
>>>> while true; do date; diff <(nodetool -h localhost netstats) <(sleep 5
>>>> && nodetool -h localhost netstats); done
>>>>
>>>> Or use Data Stax Ops Centre where possible
>>>> http://www.datastax.com/products/opscenter
>>>>
>>>> Cheers
>>>>
>>>>
>>>>   -----------------
>>>> Aaron Morton
>>>> Freelance Developer
>>>> @aaronmorton
>>>> http://www.thelastpickle.com
>>>>
>>>> On 8/05/2012, at 2:15 PM, Ben Coverston wrote:
>>>>
>>>> Check the log files for warnings or errors. They may indicate why your
>>>> repair failed.
>>>>
>>>> On Mon, May 7, 2012 at 10:09 AM, Bill Au <bill.w...@gmail.com> wrote:
>>>>
>>>>> I restarted the nodes and then restarted the repair.  It is still
>>>>> hanging like before.  Do I keep repeating until the repair actually 
>>>>> finish?
>>>>>
>>>>> Bill
>>>>>
>>>>>
>>>>> On Fri, May 4, 2012 at 2:18 PM, Rob Coli <rc...@palominodb.com> wrote:
>>>>>
>>>>>> On Fri, May 4, 2012 at 10:30 AM, Bill Au <bill.w...@gmail.com> wrote:
>>>>>> > I know repair may take a long time to run.  I am running repair on
>>>>>> a node
>>>>>> > with about 15 GB of data and it is taking more than 24 hours.  Is
>>>>>> that
>>>>>> > normal?  Is there any way to get status of the repair?  tpstats
>>>>>> does show 2
>>>>>> > active and 2 pending AntiEntropySessions.  But netstats and
>>>>>> compactionstats
>>>>>> > show no activity.
>>>>>>
>>>>>> As indicated by various recent threads to this effect, many versions
>>>>>> of cassandra (including current 1.0.x release) contain bugs which
>>>>>> sometimes prevent repair from completing. The other threads suggest
>>>>>> that some of these bugs result in the state you are in now, where you
>>>>>> do not see anything that looks like appropriate activity.
>>>>>> Unfortunately the only solution offered on these other threads is the
>>>>>> one I will now offer, which is to restart the participating nodes and
>>>>>> re-start the repair. I am unaware of any JIRA tickets tracking these
>>>>>> bugs (which doesn't mean they don't exist, of course) so you might
>>>>>> want to file one. :)
>>>>>>
>>>>>> =Rob
>>>>>>
>>>>>> --
>>>>>> =Robert Coli
>>>>>> AIM&GTALK - rc...@palominodb.com
>>>>>> YAHOO - rcoli.palominob
>>>>>> SKYPE - rcoli_palominodb
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Ben Coverston
>>>> DataStax -- The Apache Cassandra Company
>>>>
>>>>
>>>>
>>>
>>
>

Re: getting status of long running repair

Reply via email to