Re: Tombstone passed GC period causes un-repairable inconsistent data

2018-06-21 Thread Jeff Jirsa
Think he's talking about
https://issues.apache.org/jira/browse/CASSANDRA-6434

Doesn't solve every problem if you don't run repair at all, but if you're
not running repairs, you're nearly guaranteed problems with resurrection
after gcgs anyway.



On Thu, Jun 21, 2018 at 11:33 AM, Jay Zhuang  wrote:

> Yes, I also agree that the user should run (incremental) repair within GCGS
> to prevent it from happening.
>
> @Sankalp, would you please point us the patch you mentioned from Marcus?
> The problem is basically the same as
> https://issues.apache.org/jira/browse/CASSANDRA-14145
>
> CASSANDRA-11427  is
> actually the opposite of this problem. As purgeable tombstone is repaired,
> this un-repairable problem cannot be reproduced. I tried 2.2.5 (before the
> fix), it's able to repair the purgeable tombstone from node1 to node2, so
> the data is deleted as expected. But it doesn't mean that's the right
> behave, as it will also cause purgeable tombstones keeps bouncing around
> the nodes.
> I think https://issues.apache.org/jira/browse/CASSANDRA-14145 will fix the
> problem by detecting the repaired/un-repaired data.
>
> How about having hints dispatch to deliver/replay purgeable (not live)
> tombstones? It will reduce the chance to have this issue, especially when
> GCGS < hinted handoff window.
>
> On Wed, Jun 20, 2018 at 9:36 AM sankalp kohli 
> wrote:
>
> > I agree with Stefan that we should use incremental repair and use patches
> > from Marcus to drop tombstones only from repaired data.
> > Regarding deep repair, you can bump the read repair and run the repair.
> The
> > issue will be that you will stream lot of data and also your blocking
> read
> > repair will go up when you bump the gc grace to higher value.
> >
> > On Wed, Jun 20, 2018 at 1:10 AM Stefan Podkowinski 
> > wrote:
> >
> > > Sounds like an older issue that I tried to address two years ago:
> > > https://issues.apache.org/jira/browse/CASSANDRA-11427
> > >
> > > As you can see, the result hasn't been as expected and we got some
> > > unintended side effects based on the patch. I'm not sure I'd be willing
> > > to give this another try, considering the behaviour we like to fix in
> > > the first place is rather harmless and the read repairs shouldn't
> happen
> > > at all to any users who regularly run repairs within gc_grace.
> > >
> > > What I'd suggest is to think more into the direction of a
> > > post-full-repair-world and to fully embrace incremental repairs, as
> > > fixed by Blake in 4.0. In that case, we should stop doing read repairs
> > > at all for repaired data, as described in
> > > https://issues.apache.org/jira/browse/CASSANDRA-13912. RRs are
> certainly
> > > useful, but can be very risky if not very very carefully implemented.
> So
> > > I'm wondering if we shouldn't disable RRs for everything but unrepaired
> > > data. I'd btw also be interested to hear any opinions on this in
> context
> > > of transient replicas.
> > >
> > >
> > > On 20.06.2018 03:07, Jay Zhuang wrote:
> > > > Hi,
> > > >
> > > > We know that the deleted data may re-appear if repair is not run
> within
> > > > gc_grace_seconds. When the tombstone is not propagated to all nodes,
> > the
> > > > data will re-appear. But it's also causing following 2 issues before
> > the
> > > > tombstone is compacted away:
> > > > a. inconsistent query result
> > > >
> > > > With consistency level ONE or QUORUM, it may or may not return the
> > value.
> > > > b. lots of read repairs, but doesn't repair anything
> > > >
> > > > With consistency level ALL, it always triggers a read repair.
> > > > With consistency level QUORUM, it also very likely (2/3) causes a
> read
> > > > repair. But it doesn't repair the data, so it's causing repair every
> > > time.
> > > >
> > > >
> > > > Here are the reproducing steps:
> > > >
> > > > 1. Create a 3 nodes cluster
> > > > 2. Create a table (with small gc_grace_seconds):
> > > >
> > > > CREATE KEYSPACE foo WITH replication = {'class': 'SimpleStrategy',
> > > > 'replication_factor': 3};
> > > > CREATE TABLE foo.bar (
> > > > id int PRIMARY KEY,
> > > > name text
> > > > ) WITH gc_grace_seconds=30;
> > > >
> > > > 3. Insert data with consistency all:
> > > >
> > > > INSERT INTO foo.bar (id, name) VALUES(1, 'cstar');
> > > >
> > > > 4. stop 1 node
> > > >
> > > > $ ccm node2 stop
> > > >
> > > > 5. Delete the data with consistency quorum:
> > > >
> > > > DELETE FROM foo.bar WHERE id=1;
> > > >
> > > > 6. Wait 30 seconds and then start node2:
> > > >
> > > > $ ccm node2 start
> > > >
> > > > Now the tombstone is on node1 and node3 but not on node2.
> > > >
> > > > With quorum read, it may or may not return value, and read repair
> will
> > > send
> > > > the data from node2 to node1 and node3, but it doesn't repair
> anything.
> > > >
> > > > I'd like to discuss a few potential solutions and workarounds:
> > > >
> > > > 1. Can hints replay sends GCed tombstone?
> > > >
> > > > 

Re: Tombstone passed GC period causes un-repairable inconsistent data

2018-06-21 Thread Jay Zhuang
Yes, I also agree that the user should run (incremental) repair within GCGS
to prevent it from happening.

@Sankalp, would you please point us the patch you mentioned from Marcus?
The problem is basically the same as
https://issues.apache.org/jira/browse/CASSANDRA-14145

CASSANDRA-11427  is
actually the opposite of this problem. As purgeable tombstone is repaired,
this un-repairable problem cannot be reproduced. I tried 2.2.5 (before the
fix), it's able to repair the purgeable tombstone from node1 to node2, so
the data is deleted as expected. But it doesn't mean that's the right
behave, as it will also cause purgeable tombstones keeps bouncing around
the nodes.
I think https://issues.apache.org/jira/browse/CASSANDRA-14145 will fix the
problem by detecting the repaired/un-repaired data.

How about having hints dispatch to deliver/replay purgeable (not live)
tombstones? It will reduce the chance to have this issue, especially when
GCGS < hinted handoff window.

On Wed, Jun 20, 2018 at 9:36 AM sankalp kohli 
wrote:

> I agree with Stefan that we should use incremental repair and use patches
> from Marcus to drop tombstones only from repaired data.
> Regarding deep repair, you can bump the read repair and run the repair. The
> issue will be that you will stream lot of data and also your blocking read
> repair will go up when you bump the gc grace to higher value.
>
> On Wed, Jun 20, 2018 at 1:10 AM Stefan Podkowinski 
> wrote:
>
> > Sounds like an older issue that I tried to address two years ago:
> > https://issues.apache.org/jira/browse/CASSANDRA-11427
> >
> > As you can see, the result hasn't been as expected and we got some
> > unintended side effects based on the patch. I'm not sure I'd be willing
> > to give this another try, considering the behaviour we like to fix in
> > the first place is rather harmless and the read repairs shouldn't happen
> > at all to any users who regularly run repairs within gc_grace.
> >
> > What I'd suggest is to think more into the direction of a
> > post-full-repair-world and to fully embrace incremental repairs, as
> > fixed by Blake in 4.0. In that case, we should stop doing read repairs
> > at all for repaired data, as described in
> > https://issues.apache.org/jira/browse/CASSANDRA-13912. RRs are certainly
> > useful, but can be very risky if not very very carefully implemented. So
> > I'm wondering if we shouldn't disable RRs for everything but unrepaired
> > data. I'd btw also be interested to hear any opinions on this in context
> > of transient replicas.
> >
> >
> > On 20.06.2018 03:07, Jay Zhuang wrote:
> > > Hi,
> > >
> > > We know that the deleted data may re-appear if repair is not run within
> > > gc_grace_seconds. When the tombstone is not propagated to all nodes,
> the
> > > data will re-appear. But it's also causing following 2 issues before
> the
> > > tombstone is compacted away:
> > > a. inconsistent query result
> > >
> > > With consistency level ONE or QUORUM, it may or may not return the
> value.
> > > b. lots of read repairs, but doesn't repair anything
> > >
> > > With consistency level ALL, it always triggers a read repair.
> > > With consistency level QUORUM, it also very likely (2/3) causes a read
> > > repair. But it doesn't repair the data, so it's causing repair every
> > time.
> > >
> > >
> > > Here are the reproducing steps:
> > >
> > > 1. Create a 3 nodes cluster
> > > 2. Create a table (with small gc_grace_seconds):
> > >
> > > CREATE KEYSPACE foo WITH replication = {'class': 'SimpleStrategy',
> > > 'replication_factor': 3};
> > > CREATE TABLE foo.bar (
> > > id int PRIMARY KEY,
> > > name text
> > > ) WITH gc_grace_seconds=30;
> > >
> > > 3. Insert data with consistency all:
> > >
> > > INSERT INTO foo.bar (id, name) VALUES(1, 'cstar');
> > >
> > > 4. stop 1 node
> > >
> > > $ ccm node2 stop
> > >
> > > 5. Delete the data with consistency quorum:
> > >
> > > DELETE FROM foo.bar WHERE id=1;
> > >
> > > 6. Wait 30 seconds and then start node2:
> > >
> > > $ ccm node2 start
> > >
> > > Now the tombstone is on node1 and node3 but not on node2.
> > >
> > > With quorum read, it may or may not return value, and read repair will
> > send
> > > the data from node2 to node1 and node3, but it doesn't repair anything.
> > >
> > > I'd like to discuss a few potential solutions and workarounds:
> > >
> > > 1. Can hints replay sends GCed tombstone?
> > >
> > > 2. Can we have a "deep repair" which detects such issue and repair the
> > GCed
> > > tombstone? Or temperately increase the gc_grace_seconds for repair?
> > >
> > > What other suggestions you have if the user is having such issue?
> > >
> > >
> > > Thanks,
> > >
> > > Jay
> > >
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> > For additional commands, e-mail: dev-h...@cassandra.apache.org
> >
> >
>