Re: Does recovery continue after truncating a table?

Ben Slater Sat, 26 Nov 2016 19:34:16 -0800

By “undocumented limitation”, I meant “TRUNCATE” is mainly only used in
development and testing, not production scenarios so a sufficient fix (and
certainly a better than nothing fix) might be just to document that if you
issue a TRUNCATE while there are still hinted hand-offs pending the hinted
hand-offs replayed after the truncate will come back to life. Of course, an
actual fix would be better.


Cheers
Ben

On Sat, 26 Nov 2016 at 21:08 Hiroyuki Yamada <mogwa...@gmail.com> wrote:

> Hi Yuji and Ben,
>
> I tried out this revised script and the same issue occurred to me, too.
> I think it's definitely a bug to be solved asap.
>
> >Ben
> What do you mean "an undocumented limitation" ?
>
> Thanks,
> Hiro
>
> On Sat, Nov 26, 2016 at 3:13 PM, Ben Slater <ben.sla...@instaclustr.com>
> wrote:
> > Nice detective work! Seems to me that it’s a best an undocumented
> limitation
> > and potentially could be viewed as a bug - maybe log another JIRA?
> >
> > One node - there is a nodetool truncatehints command that could be used
> to
> > clear out the hints
> > (
> http://cassandra.apache.org/doc/latest/tools/nodetool/truncatehints.html?highlight=truncate
> )
> > . However, it seems to clear all hints on particular endpoint, not just
> for
> > a specific table.
> >
> > Cheers
> > Ben
> >
> > On Fri, 25 Nov 2016 at 17:42 Yuji Ito <y...@imagine-orb.com> wrote:
> >>
> >> Hi all,
> >>
> >> I revised the script to reproduce the issue.
> >> I think the issue happens more frequently than before.
> >> Killing another node is added to the previous script.
> >>
> >> ==== [script] ====
> >> #!/bin/sh
> >>
> >> node1_ip=<node1 IP address>
> >> node2_ip=<node2 IP address>
> >> node3_ip=<node3 IP address>
> >> node2_user=<user name>
> >> node3_user=<user name>
> >> rows=10000
> >>
> >> echo "consistency quorum;" > init_data.cql
> >> for key in $(seq 0 $(expr $rows - 1))
> >> do
> >>     echo "insert into testdb.testtbl (key, val) values($key, 1111) IF
> NOT
> >> EXISTS;" >> init_data.cql
> >>     done
> >>
> >>     while true
> >>     do
> >>     echo "truncate the table"
> >>     cqlsh $node1_ip -e "truncate table testdb.testtbl" > /dev/null 2>&1
> >>     if [ $? -ne 0 ]; then
> >>         echo "truncating failed"
> >>     continue
> >>     else
> >>         break
> >>     fi
> >> done
> >>
> >> echo "kill C* process on node3"
> >> pdsh -l $node3_user -R ssh -w $node3_ip "ps auxww | grep
> CassandraDaemon |
> >> awk '{if (\$13 ~ /cassand/) print \$2}' | xargs sudo kill -9"
> >>
> >> echo "insert $rows rows"
> >> cqlsh $node1_ip -f init_data.cql > insert_log 2>&1
> >>
> >> echo "restart C* process on node3"
> >> pdsh -l $node3_user -R ssh -w $node3_ip "sudo /etc/init.d/cassandra
> start"
> >>
> >> while true
> >> do
> >> echo "truncate the table again"
> >> cqlsh $node1_ip -e "truncate table testdb.testtbl"
> >> if [ $? -ne 0 ]; then
> >>     echo "truncating failed"
> >>         continue
> >> else
> >>     echo "truncation succeeded!"
> >>     break
> >> fi
> >> done
> >>
> >> echo "kill C* process on node2"
> >> pdsh -l $node2_user -R ssh -w $node2_ip "ps auxww | grep
> CassandraDaemon |
> >> awk '{if (\$13 ~ /cassand/) print \$2}' | xargs sudo kill -9"
> >>
> >> cqlsh $node1_ip --request-timeout 3600 -e "consistency serial; select
> >> count(*) from testdb.testtbl;"
> >> sleep 10
> >> cqlsh $node1_ip --request-timeout 3600 -e "consistency serial; select
> >> count(*) from testdb.testtbl;"
> >>
> >> echo "restart C* process on node2"
> >> pdsh -l $node2_user -R ssh -w $node2_ip "sudo /etc/init.d/cassandra
> start"
> >>
> >>
> >> Thanks,
> >> yuji
> >>
> >>
> >> On Fri, Nov 18, 2016 at 7:52 PM, Yuji Ito <y...@imagine-orb.com> wrote:
> >>>
> >>> I investigated source code and logs of killed node.
> >>> I guess that unexpected writes are executed when truncation is being
> >>> executed.
> >>>
> >>> Some writes were executed after flush (the first flush) in truncation
> and
> >>> these writes could be read.
> >>> These writes were requested as MUTATION by another node for hinted
> >>> handoff.
> >>> Their data was stored to a new memtable and flushed (the second flush)
> to
> >>> a new SSTable before snapshot in truncation.
> >>> So, the truncation discarded only old SSTables, not the new SSTable.
> >>> That's because ReplayPosition which was used for discarding SSTable was
> >>> that of the first flush.
> >>>
> >>> I copied some parts of log as below.
> >>> "##" line is my comment.
> >>> The point is that the ReplayPosition is moved forward by the second
> >>> flush.
> >>> It means some writes are executed after the first flush.
> >>>
> >>> == log ==
> >>> ## started truncation
> >>> TRACE [SharedPool-Worker-16] 2016-11-17 08:36:04,612
> >>> ColumnFamilyStore.java:2790 - truncating testtbl
> >>> ## the first flush started before truncation
> >>> DEBUG [SharedPool-Worker-16] 2016-11-17 08:36:04,612
> >>> ColumnFamilyStore.java:952 - Enqueuing flush of testtbl: 591360 (0%)
> >>> on-heap, 0 (0%) off-heap
> >>> INFO  [MemtableFlushWriter:1] 2016-11-17 08:36:04,613
> Memtable.java:352 -
> >>> Writing Memtable-testtbl@1863835308(42.625KiB serialized bytes, 2816
> ops,
> >>> 0%/0% of on/off-heap limit)
> >>> ...
> >>> DEBUG [MemtableFlushWriter:1] 2016-11-17 08:36:04,973
> Memtable.java:386 -
> >>> Completed flushing
> >>>
> /var/lib/cassandra/data/testdb/testtbl-562848f0a55611e68b1451065d58fdfb/tmp-lb-1-big-Data.db
> >>> (17.651KiB) for commitlog position
> ReplayPosition(segmentId=1479371760395,
> >>> position=315867)
> >>> ## this ReplayPosition was used for discarding SSTables
> >>> ...
> >>> TRACE [MemtablePostFlush:1] 2016-11-17 08:36:05,022 CommitLog.java:298
> -
> >>> discard completed log segments for
> ReplayPosition(segmentId=1479371760395,
> >>> position=315867), table 562848f0-a556-11e6-8b14-51065d58fdfb
> >>> ## end of the first flush
> >>> DEBUG [SharedPool-Worker-16] 2016-11-17 08:36:05,028
> >>> ColumnFamilyStore.java:2823 - Discarding sstable data for truncated CF
> +
> >>> indexes
> >>> ## the second flush before snapshot
> >>> DEBUG [SharedPool-Worker-16] 2016-11-17 08:36:05,028
> >>> ColumnFamilyStore.java:952 - Enqueuing flush of testtbl: 698880 (0%)
> >>> on-heap, 0 (0%) off-heap
> >>> INFO  [MemtableFlushWriter:2] 2016-11-17 08:36:05,029
> Memtable.java:352 -
> >>> Writing Memtable-testtbl@1186728207(50.375KiB serialized bytes, 3328
> ops,
> >>> 0%/0% of on/off-heap limit)
> >>> ...
> >>> DEBUG [MemtableFlushWriter:2] 2016-11-17 08:36:05,258
> Memtable.java:386 -
> >>> Completed flushing
> >>>
> /var/lib/cassandra/data/testdb/testtbl-562848f0a55611e68b1451065d58fdfb/tmp-lb-2-big-Data.db
> >>> (17.696KiB) for commitlog position
> ReplayPosition(segmentId=1479371760395,
> >>> position=486627)
> >>> ...
> >>> TRACE [MemtablePostFlush:1] 2016-11-17 08:36:05,289 CommitLog.java:298
> -
> >>> discard completed log segments for
> ReplayPosition(segmentId=1479371760395,
> >>> position=486627), table 562848f0-a556-11e6-8b14-51065d58fdfb
> >>> ## end of the second flush: position was moved
> >>> ...
> >>> ## only old SSTable was deleted because this SSTable was older than
> >>> ReplayPosition(segmentId=1479371760395, position=315867)
> >>> TRACE [NonPeriodicTasks:1] 2016-11-17 08:36:05,303 SSTable.java:118 -
> >>> Deleted
> >>>
> /var/lib/cassandra/data/testdb/testtbl-562848f0a55611e68b1451065d58fdfb/lb-1-big
> >>> ...
> >>> TRACE [SharedPool-Worker-16] 2016-11-17 08:36:05,320
> >>> ColumnFamilyStore.java:2841 - truncate complete
> >>> TRACE [SharedPool-Worker-16] 2016-11-17 08:36:05,320
> >>> TruncateVerbHandler.java:53 - Truncation(keyspace='testdb',
> cf='testtbl')
> >>> applied.  Enqueuing response to 36512@/10.91.145.7
> >>> TRACE [SharedPool-Worker-16] 2016-11-17 08:36:05,320
> >>> MessagingService.java:728 - /10.91.145.27 sending REQUEST_RESPONSE to
> >>> 36512@/10.91.145.7
> >>> ## end of truncation
> >>> ====
> >>>
> >>> Actually, "truncated_at" of the table on the system.local after running
> >>> the script was 0x00000158716da30b0004d1db00000158716db524.
> >>> It means segmentId=1479371760395, position=315867
> >>> truncated_at=1479371765028 (2016-11-17 08:36:05,028)
> >>>
> >>> thanks,
> >>> yuji
> >>>
> >>>
> >>> On Wed, Nov 16, 2016 at 5:25 PM, Yuji Ito <y...@imagine-orb.com>
> wrote:
> >>>>
> >>>> Hi,
> >>>>
> >>>> I could find stale data after truncating a table.
> >>>> It seems that truncating starts while recovery is being executed just
> >>>> after a node restarts.
> >>>> After the truncating finishes, recovery still continues?
> >>>> Is it expected?
> >>>>
> >>>> I use C* 2.2.8 and can reproduce it as below.
> >>>>
> >>>> ==== [create table] ====
> >>>> cqlsh $ip -e "drop keyspace testdb;"
> >>>> cqlsh $ip -e "CREATE KEYSPACE testdb WITH replication = {'class':
> >>>> 'SimpleStrategy', 'replication_factor': '3'};"
> >>>> cqlsh $ip -e "CREATE TABLE testdb.testtbl (key int PRIMARY KEY, val
> >>>> int);"
> >>>>
> >>>> ==== [script] ====
> >>>> #!/bin/sh
> >>>>
> >>>> node1_ip=<node1 IP address>
> >>>> node2_ip=<node2 IP address>
> >>>> node3_ip=<node3 IP address>
> >>>> node3_user=<user name>
> >>>> rows=10000
> >>>>
> >>>> echo "consistency quorum;" > init_data.cql
> >>>> for key in $(seq 0 $(expr $rows - 1))
> >>>> do
> >>>>     echo "insert into testdb.testtbl (key, val) values($key, 1111) IF
> >>>> NOT EXISTS;" >> init_data.cql
> >>>> done
> >>>>
> >>>> while true
> >>>> do
> >>>> echo "truncate the table"
> >>>> cqlsh $node1_ip -e "truncate table testdb.testtbl"
> >>>> if [ $? -ne 0 ]; then
> >>>>     echo "truncating failed"
> >>>>     continue
> >>>> else
> >>>>     break
> >>>> fi
> >>>> done
> >>>>
> >>>> echo "kill C* process on node3"
> >>>> pdsh -l $node3_user -R ssh -w $node3_ip "ps auxww | grep
> CassandraDaemon
> >>>> | awk '{if (\$13 ~ /cassand/) print \$2}' | xargs sudo kill -9"
> >>>>
> >>>> echo "insert $rows rows"
> >>>> cqlsh $node1_ip -f init_data.cql > insert_log 2>&1
> >>>>
> >>>> echo "restart C* process on node3"
> >>>> pdsh -l $node3_user -R ssh -w $node3_ip "sudo /etc/init.d/cassandra
> >>>> start"
> >>>>
> >>>> while true
> >>>> do
> >>>> echo "truncate the table again"
> >>>> cqlsh $node1_ip -e "truncate table testdb.testtbl"
> >>>> if [ $? -ne 0 ]; then
> >>>>     echo "truncating failed"
> >>>>     continue
> >>>> else
> >>>>     break
> >>>> fi
> >>>> done
> >>>>
> >>>> cqlsh $node1_ip --request-timeout 3600 -e "consistency serial; select
> >>>> count(*) from testdb.testtbl;"
> >>>> sleep 10
> >>>> cqlsh $node1_ip --request-timeout 3600 -e "consistency serial; select
> >>>> count(*) from testdb.testtbl;"
> >>>>
> >>>>
> >>>> ==== [result] ====
> >>>> truncate the table
> >>>> kill C* process on node3
> >>>> insert 10000 rows
> >>>> restart C* process on node3
> >>>> 10.91.145.27: Starting Cassandra: OK
> >>>> truncate the table again
> >>>> <stdin>:1:TruncateError: Error during truncate: Cannot achieve
> >>>> consistency level ALL
> >>>> truncating failed
> >>>> truncate the table again
> >>>> <stdin>:1:TruncateError: Error during truncate: Cannot achieve
> >>>> consistency level ALL
> >>>> truncating failed
> >>>> truncate the table again
> >>>> <stdin>:1:TruncateError: Error during truncate: Cannot achieve
> >>>> consistency level ALL
> >>>> truncating failed
> >>>> truncate the table again
> >>>> <stdin>:1:TruncateError: Error during truncate: Cannot achieve
> >>>> consistency level ALL
> >>>> truncating failed
> >>>> truncate the table again
> >>>> <stdin>:1:TruncateError: Error during truncate: Cannot achieve
> >>>> consistency level ALL
> >>>> truncating failed
> >>>> truncate the table again
> >>>> <stdin>:1:TruncateError: Error during truncate: Cannot achieve
> >>>> consistency level ALL
> >>>> truncating failed
> >>>> truncate the table again
> >>>> Consistency level set to SERIAL.
> >>>>
> >>>>  count
> >>>> -------
> >>>>    300
> >>>>
> >>>> (1 rows)
> >>>>
> >>>> Warnings :
> >>>> Aggregation query used without partition key
> >>>>
> >>>> Consistency level set to SERIAL.
> >>>>
> >>>>  count
> >>>> -------
> >>>>   2304
> >>>>
> >>>> (1 rows)
> >>>>
> >>>> Warnings :
> >>>> Aggregation query used without partition key
> >>>> ====
> >>>>
> >>>> I found it when I was investigating data lost problem. (Ref. "failure
> >>>> node rejoin" thread)
> >>>> I'm not sure this problem is related to data lost.
> >>>>
> >>>> Thanks,
> >>>> yuji
> >>>
> >>>
> >>
> >
>

Re: Does recovery continue after truncating a table?

Reply via email to