Re: Frustration with "repair" process in 1.1.11

2013-11-02 Thread Yuki Morishita
Hi, Oleg,

As you encountered already, many people was complaining about repair,
so we have been working actively to mitigate the problem.

> I run repair (with -pr) on DC2. First time I run it it gets *stuck* (i.e. 
> frozen) within the first 30 seconds, with no error or any sort of message

You said you are on 1.1.11, so
https://issues.apache.org/jira/browse/CASSANDRA-5393 comes up to my
mind at first.
The issue was that some messages including repair ones got lost
silently, so we added retry mechanism.

> I then run it again -- and it completes in seconds on each node, with about 
> 50 gigs of data on each.

Cassandra calculates replica difference using Merkle Tree built from
each row's hash value.
Cassandra used to detect many differences if you do a lot of delete or
have a lot of TTLed columns.
In 1.2, we fixed this issue(such in
https://issues.apache.org/jira/browse/CASSANDRA-4905), so if this is
the case, upgrading would save space.

> Is there any improvement and clarity in 1.2 ? How about 2.0 ?

Yes. The reason that repair hang prior to 2.0 is either 1) Merkle Tree
creation failure(validation failure), or 2) streaming failure, without
the failure node report back to the coordinator.
To fix this, we redesign the message flow around repair process in
https://issues.apache.org/jira/browse/CASSANDRA-5426.
At the same time, we improved data streaming among nodes by, again,
redesigning the streaming
protocol(http://www.datastax.com/dev/blog/streaming-in-cassandra-2-0).

So, with those said, Cassandra 2.0.x have much improved compared to 1.1.11.

Hope this helps,

Yuki

On Fri, Nov 1, 2013 at 2:15 PM, Oleg Dulin  wrote:
> First I need to vent.
>
>
> 
>
> One of my cassandra cluster is a dual data center setup, with DC1 acting as
> primary, and DC2 acting as a hot backup.
>
>
> Well, guess what ? I am pretty sure that it falls behind on replication. So
> I am told I need to run repair.
>
>
> I run repair (with -pr) on DC2. First time I run it it gets *stuck* (i.e.
> frozen) within the first 30 seconds, with no error or any sort of message. I
> then run it again -- and it completes in seconds on each node, with about 50
> gigs of data on each.
>
>
> That seems suspicious, so I do some research.
>
>
> I am told on IRC that running repair -pr will only do the repair on "100"
> tokens (the offset from DC1 to DC2)… Seriously ???
>
>
> Repair process is, indeed, a joke:
> https://issues.apache.org/jira/browse/CASSANDRA-5396 . Repair is the worst
> thing you can do to your cluster, it consumes enormous resources, and can
> leave your cluster in an inconsistent state. Oh and by the way you must run
> it every week…. Whoever invented that process must not live in a real world,
> with real applications.
>
> 
>
>
> No… lets have a constructive conversation.
>
>
> How do I know, with certainty, that my DC2 cluster is up to date on
> replication ? I have a few options:
>
>
> 1) I set read repair chance to 100% on critical column families and I write
> a tool to scan every CF, every column of every row. This strikes me as very
> silly.
>
> Q1: Do I need to scan every column or is looking at one column enough to
> trigger a read repair ?
>
>
> 2) Can someone explain to me how the repair works such that I don't totally
> trash my cluster or spill into work week ?
>
>
> Is there any improvement and clarity in 1.2 ? How about 2.0 ?
>
>
>
>
> --
>
> Regards,
>
> Oleg Dulin
>
> http://www.olegdulin.com



-- 
Yuki Morishita
 t:yukim (http://twitter.com/yukim)


Frustration with "repair" process in 1.1.11

2013-11-01 Thread Oleg Dulin

First I need to vent.


One of my cassandra cluster is a dual data center setup, with DC1 
acting as primary, and DC2 acting as a hot backup.


Well, guess what ? I am pretty sure that it falls behind on 
replication. So I am told I need to run repair.


I run repair (with -pr) on DC2. First time I run it it gets *stuck* 
(i.e. frozen) within the first 30 seconds, with no error or any sort of 
message. I then run it again -- and it completes in seconds on each 
node, with about 50 gigs of data on each.


That seems suspicious, so I do some research.

I am told on IRC that running repair -pr will only do the repair on 
"100" tokens (the offset from DC1 to DC2)… Seriously ???


Repair process is, indeed, a joke: 
https://issues.apache.org/jira/browse/CASSANDRA-5396 . Repair is the 
worst thing you can do to your cluster, it consumes enormous resources, 
and can leave your cluster in an inconsistent state. Oh and by the way 
you must run it every week…. Whoever invented that process must not 
live in a real world, with real applications.



No… lets have a constructive conversation.

How do I know, with certainty, that my DC2 cluster is up to date on 
replication ? I have a few options:


1) I set read repair chance to 100% on critical column families and I 
write a tool to scan every CF, every column of every row. This strikes 
me as very silly. 
Q1: Do I need to scan every column or is looking at one column enough 
to trigger a read repair ?


2) Can someone explain to me how the repair works such that I don't 
totally trash my cluster or spill into work week ?


Is there any improvement and clarity in 1.2 ? How about 2.0 ?



--
Regards,
Oleg Dulin
http://www.olegdulin.com