By “undocumented limitation”, I meant “TRUNCATE” is mainly only used in development and testing, not production scenarios so a sufficient fix (and certainly a better than nothing fix) might be just to document that if you issue a TRUNCATE while there are still hinted hand-offs pending the hinted hand-offs replayed after the truncate will come back to life. Of course, an actual fix would be better.
Cheers Ben On Sat, 26 Nov 2016 at 21:08 Hiroyuki Yamada <mogwa...@gmail.com> wrote: > Hi Yuji and Ben, > > I tried out this revised script and the same issue occurred to me, too. > I think it's definitely a bug to be solved asap. > > >Ben > What do you mean "an undocumented limitation" ? > > Thanks, > Hiro > > On Sat, Nov 26, 2016 at 3:13 PM, Ben Slater <ben.sla...@instaclustr.com> > wrote: > > Nice detective work! Seems to me that it’s a best an undocumented > limitation > > and potentially could be viewed as a bug - maybe log another JIRA? > > > > One node - there is a nodetool truncatehints command that could be used > to > > clear out the hints > > ( > http://cassandra.apache.org/doc/latest/tools/nodetool/truncatehints.html?highlight=truncate > ) > > . However, it seems to clear all hints on particular endpoint, not just > for > > a specific table. > > > > Cheers > > Ben > > > > On Fri, 25 Nov 2016 at 17:42 Yuji Ito <y...@imagine-orb.com> wrote: > >> > >> Hi all, > >> > >> I revised the script to reproduce the issue. > >> I think the issue happens more frequently than before. > >> Killing another node is added to the previous script. > >> > >> ==== [script] ==== > >> #!/bin/sh > >> > >> node1_ip=<node1 IP address> > >> node2_ip=<node2 IP address> > >> node3_ip=<node3 IP address> > >> node2_user=<user name> > >> node3_user=<user name> > >> rows=10000 > >> > >> echo "consistency quorum;" > init_data.cql > >> for key in $(seq 0 $(expr $rows - 1)) > >> do > >> echo "insert into testdb.testtbl (key, val) values($key, 1111) IF > NOT > >> EXISTS;" >> init_data.cql > >> done > >> > >> while true > >> do > >> echo "truncate the table" > >> cqlsh $node1_ip -e "truncate table testdb.testtbl" > /dev/null 2>&1 > >> if [ $? -ne 0 ]; then > >> echo "truncating failed" > >> continue > >> else > >> break > >> fi > >> done > >> > >> echo "kill C* process on node3" > >> pdsh -l $node3_user -R ssh -w $node3_ip "ps auxww | grep > CassandraDaemon | > >> awk '{if (\$13 ~ /cassand/) print \$2}' | xargs sudo kill -9" > >> > >> echo "insert $rows rows" > >> cqlsh $node1_ip -f init_data.cql > insert_log 2>&1 > >> > >> echo "restart C* process on node3" > >> pdsh -l $node3_user -R ssh -w $node3_ip "sudo /etc/init.d/cassandra > start" > >> > >> while true > >> do > >> echo "truncate the table again" > >> cqlsh $node1_ip -e "truncate table testdb.testtbl" > >> if [ $? -ne 0 ]; then > >> echo "truncating failed" > >> continue > >> else > >> echo "truncation succeeded!" > >> break > >> fi > >> done > >> > >> echo "kill C* process on node2" > >> pdsh -l $node2_user -R ssh -w $node2_ip "ps auxww | grep > CassandraDaemon | > >> awk '{if (\$13 ~ /cassand/) print \$2}' | xargs sudo kill -9" > >> > >> cqlsh $node1_ip --request-timeout 3600 -e "consistency serial; select > >> count(*) from testdb.testtbl;" > >> sleep 10 > >> cqlsh $node1_ip --request-timeout 3600 -e "consistency serial; select > >> count(*) from testdb.testtbl;" > >> > >> echo "restart C* process on node2" > >> pdsh -l $node2_user -R ssh -w $node2_ip "sudo /etc/init.d/cassandra > start" > >> > >> > >> Thanks, > >> yuji > >> > >> > >> On Fri, Nov 18, 2016 at 7:52 PM, Yuji Ito <y...@imagine-orb.com> wrote: > >>> > >>> I investigated source code and logs of killed node. > >>> I guess that unexpected writes are executed when truncation is being > >>> executed. > >>> > >>> Some writes were executed after flush (the first flush) in truncation > and > >>> these writes could be read. > >>> These writes were requested as MUTATION by another node for hinted > >>> handoff. > >>> Their data was stored to a new memtable and flushed (the second flush) > to > >>> a new SSTable before snapshot in truncation. > >>> So, the truncation discarded only old SSTables, not the new SSTable. > >>> That's because ReplayPosition which was used for discarding SSTable was > >>> that of the first flush. > >>> > >>> I copied some parts of log as below. > >>> "##" line is my comment. > >>> The point is that the ReplayPosition is moved forward by the second > >>> flush. > >>> It means some writes are executed after the first flush. > >>> > >>> == log == > >>> ## started truncation > >>> TRACE [SharedPool-Worker-16] 2016-11-17 08:36:04,612 > >>> ColumnFamilyStore.java:2790 - truncating testtbl > >>> ## the first flush started before truncation > >>> DEBUG [SharedPool-Worker-16] 2016-11-17 08:36:04,612 > >>> ColumnFamilyStore.java:952 - Enqueuing flush of testtbl: 591360 (0%) > >>> on-heap, 0 (0%) off-heap > >>> INFO [MemtableFlushWriter:1] 2016-11-17 08:36:04,613 > Memtable.java:352 - > >>> Writing Memtable-testtbl@1863835308(42.625KiB serialized bytes, 2816 > ops, > >>> 0%/0% of on/off-heap limit) > >>> ... > >>> DEBUG [MemtableFlushWriter:1] 2016-11-17 08:36:04,973 > Memtable.java:386 - > >>> Completed flushing > >>> > /var/lib/cassandra/data/testdb/testtbl-562848f0a55611e68b1451065d58fdfb/tmp-lb-1-big-Data.db > >>> (17.651KiB) for commitlog position > ReplayPosition(segmentId=1479371760395, > >>> position=315867) > >>> ## this ReplayPosition was used for discarding SSTables > >>> ... > >>> TRACE [MemtablePostFlush:1] 2016-11-17 08:36:05,022 CommitLog.java:298 > - > >>> discard completed log segments for > ReplayPosition(segmentId=1479371760395, > >>> position=315867), table 562848f0-a556-11e6-8b14-51065d58fdfb > >>> ## end of the first flush > >>> DEBUG [SharedPool-Worker-16] 2016-11-17 08:36:05,028 > >>> ColumnFamilyStore.java:2823 - Discarding sstable data for truncated CF > + > >>> indexes > >>> ## the second flush before snapshot > >>> DEBUG [SharedPool-Worker-16] 2016-11-17 08:36:05,028 > >>> ColumnFamilyStore.java:952 - Enqueuing flush of testtbl: 698880 (0%) > >>> on-heap, 0 (0%) off-heap > >>> INFO [MemtableFlushWriter:2] 2016-11-17 08:36:05,029 > Memtable.java:352 - > >>> Writing Memtable-testtbl@1186728207(50.375KiB serialized bytes, 3328 > ops, > >>> 0%/0% of on/off-heap limit) > >>> ... > >>> DEBUG [MemtableFlushWriter:2] 2016-11-17 08:36:05,258 > Memtable.java:386 - > >>> Completed flushing > >>> > /var/lib/cassandra/data/testdb/testtbl-562848f0a55611e68b1451065d58fdfb/tmp-lb-2-big-Data.db > >>> (17.696KiB) for commitlog position > ReplayPosition(segmentId=1479371760395, > >>> position=486627) > >>> ... > >>> TRACE [MemtablePostFlush:1] 2016-11-17 08:36:05,289 CommitLog.java:298 > - > >>> discard completed log segments for > ReplayPosition(segmentId=1479371760395, > >>> position=486627), table 562848f0-a556-11e6-8b14-51065d58fdfb > >>> ## end of the second flush: position was moved > >>> ... > >>> ## only old SSTable was deleted because this SSTable was older than > >>> ReplayPosition(segmentId=1479371760395, position=315867) > >>> TRACE [NonPeriodicTasks:1] 2016-11-17 08:36:05,303 SSTable.java:118 - > >>> Deleted > >>> > /var/lib/cassandra/data/testdb/testtbl-562848f0a55611e68b1451065d58fdfb/lb-1-big > >>> ... > >>> TRACE [SharedPool-Worker-16] 2016-11-17 08:36:05,320 > >>> ColumnFamilyStore.java:2841 - truncate complete > >>> TRACE [SharedPool-Worker-16] 2016-11-17 08:36:05,320 > >>> TruncateVerbHandler.java:53 - Truncation(keyspace='testdb', > cf='testtbl') > >>> applied. Enqueuing response to 36512@/10.91.145.7 > >>> TRACE [SharedPool-Worker-16] 2016-11-17 08:36:05,320 > >>> MessagingService.java:728 - /10.91.145.27 sending REQUEST_RESPONSE to > >>> 36512@/10.91.145.7 > >>> ## end of truncation > >>> ==== > >>> > >>> Actually, "truncated_at" of the table on the system.local after running > >>> the script was 0x00000158716da30b0004d1db00000158716db524. > >>> It means segmentId=1479371760395, position=315867 > >>> truncated_at=1479371765028 (2016-11-17 08:36:05,028) > >>> > >>> thanks, > >>> yuji > >>> > >>> > >>> On Wed, Nov 16, 2016 at 5:25 PM, Yuji Ito <y...@imagine-orb.com> > wrote: > >>>> > >>>> Hi, > >>>> > >>>> I could find stale data after truncating a table. > >>>> It seems that truncating starts while recovery is being executed just > >>>> after a node restarts. > >>>> After the truncating finishes, recovery still continues? > >>>> Is it expected? > >>>> > >>>> I use C* 2.2.8 and can reproduce it as below. > >>>> > >>>> ==== [create table] ==== > >>>> cqlsh $ip -e "drop keyspace testdb;" > >>>> cqlsh $ip -e "CREATE KEYSPACE testdb WITH replication = {'class': > >>>> 'SimpleStrategy', 'replication_factor': '3'};" > >>>> cqlsh $ip -e "CREATE TABLE testdb.testtbl (key int PRIMARY KEY, val > >>>> int);" > >>>> > >>>> ==== [script] ==== > >>>> #!/bin/sh > >>>> > >>>> node1_ip=<node1 IP address> > >>>> node2_ip=<node2 IP address> > >>>> node3_ip=<node3 IP address> > >>>> node3_user=<user name> > >>>> rows=10000 > >>>> > >>>> echo "consistency quorum;" > init_data.cql > >>>> for key in $(seq 0 $(expr $rows - 1)) > >>>> do > >>>> echo "insert into testdb.testtbl (key, val) values($key, 1111) IF > >>>> NOT EXISTS;" >> init_data.cql > >>>> done > >>>> > >>>> while true > >>>> do > >>>> echo "truncate the table" > >>>> cqlsh $node1_ip -e "truncate table testdb.testtbl" > >>>> if [ $? -ne 0 ]; then > >>>> echo "truncating failed" > >>>> continue > >>>> else > >>>> break > >>>> fi > >>>> done > >>>> > >>>> echo "kill C* process on node3" > >>>> pdsh -l $node3_user -R ssh -w $node3_ip "ps auxww | grep > CassandraDaemon > >>>> | awk '{if (\$13 ~ /cassand/) print \$2}' | xargs sudo kill -9" > >>>> > >>>> echo "insert $rows rows" > >>>> cqlsh $node1_ip -f init_data.cql > insert_log 2>&1 > >>>> > >>>> echo "restart C* process on node3" > >>>> pdsh -l $node3_user -R ssh -w $node3_ip "sudo /etc/init.d/cassandra > >>>> start" > >>>> > >>>> while true > >>>> do > >>>> echo "truncate the table again" > >>>> cqlsh $node1_ip -e "truncate table testdb.testtbl" > >>>> if [ $? -ne 0 ]; then > >>>> echo "truncating failed" > >>>> continue > >>>> else > >>>> break > >>>> fi > >>>> done > >>>> > >>>> cqlsh $node1_ip --request-timeout 3600 -e "consistency serial; select > >>>> count(*) from testdb.testtbl;" > >>>> sleep 10 > >>>> cqlsh $node1_ip --request-timeout 3600 -e "consistency serial; select > >>>> count(*) from testdb.testtbl;" > >>>> > >>>> > >>>> ==== [result] ==== > >>>> truncate the table > >>>> kill C* process on node3 > >>>> insert 10000 rows > >>>> restart C* process on node3 > >>>> 10.91.145.27: Starting Cassandra: OK > >>>> truncate the table again > >>>> <stdin>:1:TruncateError: Error during truncate: Cannot achieve > >>>> consistency level ALL > >>>> truncating failed > >>>> truncate the table again > >>>> <stdin>:1:TruncateError: Error during truncate: Cannot achieve > >>>> consistency level ALL > >>>> truncating failed > >>>> truncate the table again > >>>> <stdin>:1:TruncateError: Error during truncate: Cannot achieve > >>>> consistency level ALL > >>>> truncating failed > >>>> truncate the table again > >>>> <stdin>:1:TruncateError: Error during truncate: Cannot achieve > >>>> consistency level ALL > >>>> truncating failed > >>>> truncate the table again > >>>> <stdin>:1:TruncateError: Error during truncate: Cannot achieve > >>>> consistency level ALL > >>>> truncating failed > >>>> truncate the table again > >>>> <stdin>:1:TruncateError: Error during truncate: Cannot achieve > >>>> consistency level ALL > >>>> truncating failed > >>>> truncate the table again > >>>> Consistency level set to SERIAL. > >>>> > >>>> count > >>>> ------- > >>>> 300 > >>>> > >>>> (1 rows) > >>>> > >>>> Warnings : > >>>> Aggregation query used without partition key > >>>> > >>>> Consistency level set to SERIAL. > >>>> > >>>> count > >>>> ------- > >>>> 2304 > >>>> > >>>> (1 rows) > >>>> > >>>> Warnings : > >>>> Aggregation query used without partition key > >>>> ==== > >>>> > >>>> I found it when I was investigating data lost problem. (Ref. "failure > >>>> node rejoin" thread) > >>>> I'm not sure this problem is related to data lost. > >>>> > >>>> Thanks, > >>>> yuji > >>> > >>> > >> > > >