Re: failure node rejoin

Yuji Ito Wed, 23 Nov 2016 20:14:37 -0800

Hi Ben,

I continue to investigate the data loss issue.
I'm investigating logs and source code and try to reproduce the data loss
issue with a simple test.
I also try my destructive test with DROP instead of TRUNCATE.


BTW, I want to discuss the issue of the title "failure node rejoin" again.

Will this issue be fixed? Other nodes should refuse this unexpected rejoin.
Or should I be more careful to add failure nodes to the existing cluster?

Thanks,
yuji


On Fri, Nov 11, 2016 at 1:00 PM, Ben Slater <ben.sla...@instaclustr.com>
wrote:

> From a quick look I couldn’t find any defects other than the ones you’ve
> found that seem potentially relevant to your issue (if any one else on the
> list knows of one please chime in). Maybe the next step, if you haven’t
> done so already, is to check your Cassandra logs for any signs of issues
> (ie WARNING or ERROR logs) in the failing case.
>
> Cheers
> Ben
>
> On Fri, 11 Nov 2016 at 13:07 Yuji Ito <y...@imagine-orb.com> wrote:
>
>> Thanks Ben,
>>
>> I tried 2.2.8 and could reproduce the problem.
>> So, I'm investigating some bug fixes of repair and commitlog between
>> 2.2.8 and 3.0.9.
>>
>> - CASSANDRA-12508: "nodetool repair returns status code 0 for some errors"
>>
>> - CASSANDRA-12436: "Under some races commit log may incorrectly think it
>> has unflushed data"
>>   - related to CASSANDRA-9669, CASSANDRA-11828 (the fix of 2.2 is
>> different from that of 3.0?)
>>
>> Do you know other bug fixes related to commitlog?
>>
>> Regards
>> yuji
>>
>> On Wed, Nov 9, 2016 at 11:34 AM, Ben Slater <ben.sla...@instaclustr.com>
>> wrote:
>>
>> There have been a few commit log bugs around in the last couple of months
>> so perhaps you’ve hit something that was fixed recently. Would be
>> interesting to know the problem is still occurring in 2.2.8.
>>
>> I suspect what is happening is that when you do your initial read
>> (without flush) to check the number of rows, the data is in memtables and
>> theoretically the commitlogs but not sstables. With the forced stop the
>> memtables are lost and Cassandra should read the commitlog from disk at
>> startup to reconstruct the memtables. However, it looks like that didn’t
>> happen for some (bad) reason.
>>
>> Good news that 3.0.9 fixes the problem so up to you if you want to
>> investigate further and see if you can narrow it down to file a JIRA
>> (although the first step of that would be trying 2.2.9 to make sure it’s
>> not already fixed there).
>>
>> Cheers
>> Ben
>>
>> On Wed, 9 Nov 2016 at 12:56 Yuji Ito <y...@imagine-orb.com> wrote:
>>
>> I tried C* 3.0.9 instead of 2.2.
>> The data lost problem hasn't happen for now (without `nodetool flush`).
>>
>> Thanks
>>
>> On Fri, Nov 4, 2016 at 3:50 PM, Yuji Ito <y...@imagine-orb.com> wrote:
>>
>> Thanks Ben,
>>
>> When I added `nodetool flush` on all nodes after step 2, the problem
>> didn't happen.
>> Did replay from old commit logs delete rows?
>>
>> Perhaps, the flush operation just detected that some nodes were down in
>> step 2 (just after truncating tables).
>> (Insertion and check in step2 would succeed if one node was down because
>> consistency levels was serial.
>> If the flush failed on more than one node, the test would retry step 2.)
>> However, if so, the problem would happen without deleting Cassandra data.
>>
>> Regards,
>> yuji
>>
>>
>> On Mon, Oct 24, 2016 at 8:37 AM, Ben Slater <ben.sla...@instaclustr.com>
>> wrote:
>>
>> Definitely sounds to me like something is not working as expected but I
>> don’t really have any idea what would cause that (other than the fairly
>> extreme failure scenario). A couple of things I can think of to try to
>> narrow it down:
>> 1) Run nodetool flush on all nodes after step 2 - that will make sure all
>> data is written to sstables rather than relying on commit logs
>> 2) Run the test with consistency level quorom rather than serial
>> (shouldn’t be any different but quorom is more widely used so maybe there
>> is a bug that’s specific to serial)
>>
>> Cheers
>> Ben
>>
>> On Mon, 24 Oct 2016 at 10:29 Yuji Ito <y...@imagine-orb.com> wrote:
>>
>> Hi Ben,
>>
>> The test without killing nodes has been working well without data lost.
>> I've repeated my test about 200 times after removing data and
>> rebuild/repair.
>>
>> Regards,
>>
>>
>> On Fri, Oct 21, 2016 at 3:14 PM, Yuji Ito <y...@imagine-orb.com> wrote:
>>
>> > Just to confirm, are you saying:
>> > a) after operation 2, you select all and get 1000 rows
>> > b) after operation 3 (which only does updates and read) you select and
>> only get 953 rows?
>>
>> That's right!
>>
>> I've started the test without killing nodes.
>> I'll report the result to you next Monday.
>>
>> Thanks
>>
>>
>> On Fri, Oct 21, 2016 at 3:05 PM, Ben Slater <ben.sla...@instaclustr.com>
>> wrote:
>>
>> Just to confirm, are you saying:
>> a) after operation 2, you select all and get 1000 rows
>> b) after operation 3 (which only does updates and read) you select and
>> only get 953 rows?
>>
>> If so, that would be very unexpected. If you run your tests without
>> killing nodes do you get the expected (1,000) rows?
>>
>> Cheers
>> Ben
>>
>> On Fri, 21 Oct 2016 at 17:00 Yuji Ito <y...@imagine-orb.com> wrote:
>>
>> > Are you certain your tests don’t generate any overlapping inserts (by
>> PK)?
>>
>> Yes. The operation 2) also checks the number of rows just after all
>> insertions.
>>
>>
>> On Fri, Oct 21, 2016 at 2:51 PM, Ben Slater <ben.sla...@instaclustr.com>
>> wrote:
>>
>> OK. Are you certain your tests don’t generate any overlapping inserts (by
>> PK)? Cassandra basically treats any inserts with the same primary key as
>> updates (so 1000 insert operations may not necessarily result in 1000 rows
>> in the DB).
>>
>> On Fri, 21 Oct 2016 at 16:30 Yuji Ito <y...@imagine-orb.com> wrote:
>>
>> thanks Ben,
>>
>> > 1) At what stage did you have (or expect to have) 1000 rows (and have
>> the mismatch between actual and expected) - at that end of operation (2) or
>> after operation (3)?
>>
>> after operation 3), at operation 4) which reads all rows by cqlsh with
>> CL.SERIAL
>>
>> > 2) What replication factor and replication strategy is used by the test
>> keyspace? What consistency level is used by your operations?
>>
>> - create keyspace testkeyspace WITH REPLICATION =
>> {'class':'SimpleStrategy','replication_factor':3};
>> - consistency level is SERIAL
>>
>>
>> On Fri, Oct 21, 2016 at 12:04 PM, Ben Slater <ben.sla...@instaclustr.com>
>> wrote:
>>
>>
>> A couple of questions:
>> 1) At what stage did you have (or expect to have) 1000 rows (and have the
>> mismatch between actual and expected) - at that end of operation (2) or
>> after operation (3)?
>> 2) What replication factor and replication strategy is used by the test
>> keyspace? What consistency level is used by your operations?
>>
>>
>> Cheers
>> Ben
>>
>> On Fri, 21 Oct 2016 at 13:57 Yuji Ito <y...@imagine-orb.com> wrote:
>>
>> Thanks Ben,
>>
>> I tried to run a rebuild and repair after the failure node rejoined the
>> cluster as a "new" node with -Dcassandra.replace_address_first_boot.
>> The failure node could rejoined and I could read all rows successfully.
>> (Sometimes a repair failed because the node cannot access other node. If
>> it failed, I retried a repair)
>>
>> But some rows were lost after my destructive test repeated (after about
>> 5-6 hours).
>> After the test inserted 1000 rows, there were only 953 rows at the end of
>> the test.
>>
>> My destructive test:
>> - each C* node is killed & restarted at the random interval (within about
>> 5 min) throughout this test
>> 1) truncate all tables
>> 2) insert initial rows (check if all rows are inserted successfully)
>> 3) request a lot of read/write to random rows for about 30min
>> 4) check all rows
>> If operation 1), 2) or 4) fail due to C* failure, the test retry the
>> operation.
>>
>> Does anyone have the similar problem?
>> What causes data lost?
>> Does the test need any operation when C* node is restarted? (Currently, I
>> just restarted C* process)
>>
>> Regards,
>>
>>
>> ...
>
> [Message clipped]

Re: failure node rejoin

Reply via email to