Hi all, I revised the script to reproduce the issue. I think the issue happens more frequently than before. Killing another node is added to the previous script.
==== [script] ==== #!/bin/sh node1_ip=<node1 IP address> node2_ip=<node2 IP address> node3_ip=<node3 IP address> node2_user=<user name> node3_user=<user name> rows=10000 echo "consistency quorum;" > init_data.cql for key in $(seq 0 $(expr $rows - 1)) do echo "insert into testdb.testtbl (key, val) values($key, 1111) IF NOT EXISTS;" >> init_data.cql done while true do echo "truncate the table" cqlsh $node1_ip -e "truncate table testdb.testtbl" > /dev/null 2>&1 if [ $? -ne 0 ]; then echo "truncating failed" continue else break fi done echo "kill C* process on node3" pdsh -l $node3_user -R ssh -w $node3_ip "ps auxww | grep CassandraDaemon | awk '{if (\$13 ~ /cassand/) print \$2}' | xargs sudo kill -9" echo "insert $rows rows" cqlsh $node1_ip -f init_data.cql > insert_log 2>&1 echo "restart C* process on node3" pdsh -l $node3_user -R ssh -w $node3_ip "sudo /etc/init.d/cassandra start" while true do echo "truncate the table again" cqlsh $node1_ip -e "truncate table testdb.testtbl" if [ $? -ne 0 ]; then echo "truncating failed" continue else echo "truncation succeeded!" break fi done echo "kill C* process on node2" pdsh -l $node2_user -R ssh -w $node2_ip "ps auxww | grep CassandraDaemon | awk '{if (\$13 ~ /cassand/) print \$2}' | xargs sudo kill -9" cqlsh $node1_ip --request-timeout 3600 -e "consistency serial; select count(*) from testdb.testtbl;" sleep 10 cqlsh $node1_ip --request-timeout 3600 -e "consistency serial; select count(*) from testdb.testtbl;" echo "restart C* process on node2" pdsh -l $node2_user -R ssh -w $node2_ip "sudo /etc/init.d/cassandra start" Thanks, yuji On Fri, Nov 18, 2016 at 7:52 PM, Yuji Ito <y...@imagine-orb.com> wrote: > I investigated source code and logs of killed node. > I guess that unexpected writes are executed when truncation is being > executed. > > Some writes were executed after flush (the first flush) in truncation and > these writes could be read. > These writes were requested as MUTATION by another node for hinted handoff. > Their data was stored to a new memtable and flushed (the second flush) to > a new SSTable before snapshot in truncation. > So, the truncation discarded only old SSTables, not the new SSTable. > That's because ReplayPosition which was used for discarding SSTable was > that of the first flush. > > I copied some parts of log as below. > "##" line is my comment. > The point is that the ReplayPosition is moved forward by the second flush. > It means some writes are executed after the first flush. > > == log == > ## started truncation > TRACE [SharedPool-Worker-16] 2016-11-17 08:36:04,612 > ColumnFamilyStore.java:2790 - truncating testtbl > ## the first flush started before truncation > DEBUG [SharedPool-Worker-16] 2016-11-17 08:36:04,612 > ColumnFamilyStore.java:952 - Enqueuing flush of testtbl: 591360 (0%) > on-heap, 0 (0%) off-heap > INFO [MemtableFlushWriter:1] 2016-11-17 08:36:04,613 Memtable.java:352 - > Writing Memtable-testtbl@1863835308(42.625KiB serialized bytes, 2816 ops, > 0%/0% of on/off-heap limit) > ... > DEBUG [MemtableFlushWriter:1] 2016-11-17 08:36:04,973 Memtable.java:386 - > Completed flushing /var/lib/cassandra/data/testdb > /testtbl-562848f0a55611e68b1451065d58fdfb/tmp-lb-1-big-Data.db > (17.651KiB) for commitlog position ReplayPosition(segmentId=1479371760395, > position=315867) > ## this ReplayPosition was used for discarding SSTables > ... > TRACE [MemtablePostFlush:1] 2016-11-17 08:36:05,022 CommitLog.java:298 - > discard completed log segments for ReplayPosition(segmentId=1479371760395, > position=315867), table 562848f0-a556-11e6-8b14-51065d58fdfb > ## end of the first flush > DEBUG [SharedPool-Worker-16] 2016-11-17 08:36:05,028 > ColumnFamilyStore.java:2823 - Discarding sstable data for truncated CF + > indexes > ## the second flush before snapshot > DEBUG [SharedPool-Worker-16] 2016-11-17 08:36:05,028 > ColumnFamilyStore.java:952 - Enqueuing flush of testtbl: 698880 (0%) > on-heap, 0 (0%) off-heap > INFO [MemtableFlushWriter:2] 2016-11-17 08:36:05,029 Memtable.java:352 - > Writing Memtable-testtbl@1186728207(50.375KiB serialized bytes, 3328 ops, > 0%/0% of on/off-heap limit) > ... > DEBUG [MemtableFlushWriter:2] 2016-11-17 08:36:05,258 Memtable.java:386 - > Completed flushing /var/lib/cassandra/data/testdb > /testtbl-562848f0a55611e68b1451065d58fdfb/tmp-lb-2-big-Data.db > (17.696KiB) for commitlog position ReplayPosition(segmentId=1479371760395, > position=486627) > ... > TRACE [MemtablePostFlush:1] 2016-11-17 08:36:05,289 CommitLog.java:298 - > discard completed log segments for ReplayPosition(segmentId=1479371760395, > position=486627), table 562848f0-a556-11e6-8b14-51065d58fdfb > ## end of the second flush: position was moved > ... > ## only old SSTable was deleted because this SSTable was older than > ReplayPosition(segmentId=1479371760395, position=315867) > TRACE [NonPeriodicTasks:1] 2016-11-17 08:36:05,303 SSTable.java:118 - > Deleted /var/lib/cassandra/data/testdb/testtbl-562848f0a55611e68b145 > 1065d58fdfb/lb-1-big > ... > TRACE [SharedPool-Worker-16] 2016-11-17 08:36:05,320 > ColumnFamilyStore.java:2841 - truncate complete > TRACE [SharedPool-Worker-16] 2016-11-17 08:36:05,320 > TruncateVerbHandler.java:53 - Truncation(keyspace='testdb', cf='testtbl') > applied. Enqueuing response to 36512@/10.91.145.7 > TRACE [SharedPool-Worker-16] 2016-11-17 08:36:05,320 > MessagingService.java:728 - /10.91.145.27 sending REQUEST_RESPONSE to > 36512@/10.91.145.7 > ## end of truncation > ==== > > Actually, "truncated_at" of the table on the system.local after running > the script was 0x00000158716da30b0004d1db00000158716db524. > It means segmentId=1479371760395, position=315867 > truncated_at=1479371765028 (2016-11-17 08:36:05,028) > > thanks, > yuji > > > On Wed, Nov 16, 2016 at 5:25 PM, Yuji Ito <y...@imagine-orb.com> wrote: > >> Hi, >> >> I could find stale data after truncating a table. >> It seems that truncating starts while recovery is being executed just >> after a node restarts. >> After the truncating finishes, recovery still continues? >> Is it expected? >> >> I use C* 2.2.8 and can reproduce it as below. >> >> ==== [create table] ==== >> cqlsh $ip -e "drop keyspace testdb;" >> cqlsh $ip -e "CREATE KEYSPACE testdb WITH replication = {'class': >> 'SimpleStrategy', 'replication_factor': '3'};" >> cqlsh $ip -e "CREATE TABLE testdb.testtbl (key int PRIMARY KEY, val int);" >> >> ==== [script] ==== >> #!/bin/sh >> >> node1_ip=<node1 IP address> >> node2_ip=<node2 IP address> >> node3_ip=<node3 IP address> >> node3_user=<user name> >> rows=10000 >> >> echo "consistency quorum;" > init_data.cql >> for key in $(seq 0 $(expr $rows - 1)) >> do >> echo "insert into testdb.testtbl (key, val) values($key, 1111) IF NOT >> EXISTS;" >> init_data.cql >> done >> >> while true >> do >> echo "truncate the table" >> cqlsh $node1_ip -e "truncate table testdb.testtbl" >> if [ $? -ne 0 ]; then >> echo "truncating failed" >> continue >> else >> break >> fi >> done >> >> echo "kill C* process on node3" >> pdsh -l $node3_user -R ssh -w $node3_ip "ps auxww | grep CassandraDaemon >> | awk '{if (\$13 ~ /cassand/) print \$2}' | xargs sudo kill -9" >> >> echo "insert $rows rows" >> cqlsh $node1_ip -f init_data.cql > insert_log 2>&1 >> >> echo "restart C* process on node3" >> pdsh -l $node3_user -R ssh -w $node3_ip "sudo /etc/init.d/cassandra start" >> >> while true >> do >> echo "truncate the table again" >> cqlsh $node1_ip -e "truncate table testdb.testtbl" >> if [ $? -ne 0 ]; then >> echo "truncating failed" >> continue >> else >> break >> fi >> done >> >> cqlsh $node1_ip --request-timeout 3600 -e "consistency serial; select >> count(*) from testdb.testtbl;" >> sleep 10 >> cqlsh $node1_ip --request-timeout 3600 -e "consistency serial; select >> count(*) from testdb.testtbl;" >> >> >> ==== [result] ==== >> truncate the table >> kill C* process on node3 >> insert 10000 rows >> restart C* process on node3 >> 10.91.145.27: Starting Cassandra: OK >> truncate the table again >> <stdin>:1:TruncateError: Error during truncate: Cannot achieve >> consistency level ALL >> truncating failed >> truncate the table again >> <stdin>:1:TruncateError: Error during truncate: Cannot achieve >> consistency level ALL >> truncating failed >> truncate the table again >> <stdin>:1:TruncateError: Error during truncate: Cannot achieve >> consistency level ALL >> truncating failed >> truncate the table again >> <stdin>:1:TruncateError: Error during truncate: Cannot achieve >> consistency level ALL >> truncating failed >> truncate the table again >> <stdin>:1:TruncateError: Error during truncate: Cannot achieve >> consistency level ALL >> truncating failed >> truncate the table again >> <stdin>:1:TruncateError: Error during truncate: Cannot achieve >> consistency level ALL >> truncating failed >> truncate the table again >> Consistency level set to SERIAL. >> >> count >> ------- >> 300 >> >> (1 rows) >> >> Warnings : >> Aggregation query used without partition key >> >> Consistency level set to SERIAL. >> >> count >> ------- >> 2304 >> >> (1 rows) >> >> Warnings : >> Aggregation query used without partition key >> ==== >> >> I found it when I was investigating data lost problem. (Ref. "failure >> node rejoin" thread) >> I'm not sure this problem is related to data lost. >> >> Thanks, >> yuji >> > >