Any chance you're using NVMe with an older Linux kernel? I've seen a *lot* filesystem errors from using older CentOS versions. You'll want to be using a version > 4.15.
On Thu, Aug 8, 2019 at 9:31 AM Philip Ó Condúin <philipocond...@gmail.com> wrote: > *@Jeff *- If it was hardware that would explain it all, but do you think > it's possible to have every server in the cluster with a hardware issue? > The data is sensitive and the customer would lose their mind if I sent it > off-site which is a pity cause I could really do with the help. > The corruption is occurring irregularly on every server and instance and > column family in the cluster. Out of 72 instances, we are getting maybe 10 > corrupt files per day. > We are using vnodes (256) and it is happening in both DC's > > *@Asad *- internode compression is set to ALL on every server. I have > checked the packets for the private interconnect and I can't see any > dropped packets, there are dropped packets for other interfaces, but not > for the private ones, I will get the network team to double-check this. > The corruption is only on the application schema, we are not getting > corruption on any system or cass keyspaces. Corruption is happening in > both DC's. We are getting corruption for the 1 application schema we have > across all tables in the keyspace, it's not limited to one table. > Im not sure why the app team decided to not use default compression, I > must ask them. > > > > I have been checking the /var/log/messages today going back a few weeks > and can see a serious amount of broken pipe errors across all servers and > instances. > Here is a snippet from one server but most pipe errors are similar: > > Jul 9 03:00:08 cassandra: INFO 02:00:08 Writing > Memtable-sstable_activity@1126262628(43.631KiB serialized bytes, 18072 > ops, 0%/0% of on/off-heap limit) > Jul 9 03:00:13 kernel: fnic_handle_fip_timer: 8 callbacks suppressed > Jul 9 03:00:19 kernel: fnic_handle_fip_timer: 8 callbacks suppressed > Jul 9 03:00:22 cassandra: ERROR 02:00:22 Got an IOException during write! > Jul 9 03:00:22 cassandra: java.io.IOException: Broken pipe > Jul 9 03:00:22 cassandra: at sun.nio.ch.FileDispatcherImpl.write0(Native > Method) ~[na:1.8.0_172] > Jul 9 03:00:22 cassandra: at > sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) ~[na:1.8.0_172] > Jul 9 03:00:22 cassandra: at > sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) ~[na:1.8.0_172] > Jul 9 03:00:22 cassandra: at sun.nio.ch.IOUtil.write(IOUtil.java:65) > ~[na:1.8.0_172] > Jul 9 03:00:22 cassandra: at > sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471) > ~[na:1.8.0_172] > Jul 9 03:00:22 cassandra: at > org.apache.thrift.transport.TNonblockingSocket.write(TNonblockingSocket.java:165) > ~[libthrift-0.9.2.jar:0.9.2] > Jul 9 03:00:22 cassandra: at > com.thinkaurelius.thrift.util.mem.Buffer.writeTo(Buffer.java:104) > ~[thrift-server-0.3.7.jar:na] > Jul 9 03:00:22 cassandra: at > com.thinkaurelius.thrift.util.mem.FastMemoryOutputTransport.streamTo(FastMemoryOutputTransport.java:112) > ~[thrift-server-0.3.7.jar:na] > Jul 9 03:00:22 cassandra: at > com.thinkaurelius.thrift.Message.write(Message.java:222) > ~[thrift-server-0.3.7.jar:na] > Jul 9 03:00:22 cassandra: at > com.thinkaurelius.thrift.TDisruptorServer$SelectorThread.handleWrite(TDisruptorServer.java:598) > [thrift-server-0.3.7.jar:na] > Jul 9 03:00:22 cassandra: at > com.thinkaurelius.thrift.TDisruptorServer$SelectorThread.processKey(TDisruptorServer.java:569) > [thrift-server-0.3.7.jar:na] > Jul 9 03:00:22 cassandra: at > com.thinkaurelius.thrift.TDisruptorServer$AbstractSelectorThread.select(TDisruptorServer.java:423) > [thrift-server-0.3.7.jar:na] > Jul 9 03:00:22 cassandra: at > com.thinkaurelius.thrift.TDisruptorServer$AbstractSelectorThread.run(TDisruptorServer.java:383) > [thrift-server-0.3.7.jar:na] > Jul 9 03:00:25 kernel: fnic_handle_fip_timer: 8 callbacks suppressed > Jul 9 03:00:30 cassandra: ERROR 02:00:30 Got an IOException during write! > Jul 9 03:00:30 cassandra: java.io.IOException: Broken pipe > Jul 9 03:00:30 cassandra: at sun.nio.ch.FileDispatcherImpl.write0(Native > Method) ~[na:1.8.0_172] > Jul 9 03:00:30 cassandra: at > sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) ~[na:1.8.0_172] > Jul 9 03:00:30 cassandra: at > sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) ~[na:1.8.0_172] > Jul 9 03:00:30 cassandra: at sun.nio.ch.IOUtil.write(IOUtil.java:65) > ~[na:1.8.0_172] > Jul 9 03:00:30 cassandra: at > sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471) > ~[na:1.8.0_172] > Jul 9 03:00:30 cassandra: at > org.apache.thrift.transport.TNonblockingSocket.write(TNonblockingSocket.java:165) > ~[libthrift-0.9.2.jar:0.9.2] > Jul 9 03:00:30 cassandra: at > com.thinkaurelius.thrift.util.mem.Buffer.writeTo(Buffer.java:104) > ~[thrift-server-0.3.7.jar:na] > Jul 9 03:00:30 cassandra: at > com.thinkaurelius.thrift.util.mem.FastMemoryOutputTransport.streamTo(FastMemoryOutputTransport.java:112) > ~[thrift-server-0.3.7.jar:na] > Jul 9 03:00:30 cassandra: at > com.thinkaurelius.thrift.Message.write(Message.java:222) > ~[thrift-server-0.3.7.jar:na] > Jul 9 03:00:30 cassandra: at > com.thinkaurelius.thrift.TDisruptorServer$SelectorThread.handleWrite(TDisruptorServer.java:598) > [thrift-server-0.3.7.jar:na] > Jul 9 03:00:30 cassandra: at > com.thinkaurelius.thrift.TDisruptorServer$SelectorThread.processKey(TDisruptorServer.java:569) > [thrift-server-0.3.7.jar:na] > Jul 9 03:00:30 cassandra: at > com.thinkaurelius.thrift.TDisruptorServer$AbstractSelectorThread.select(TDisruptorServer.java:423) > [thrift-server-0.3.7.jar:na] > Jul 9 03:00:30 cassandra: at > com.thinkaurelius.thrift.TDisruptorServer$AbstractSelectorThread.run(TDisruptorServer.java:383) > [thrift-server-0.3.7.jar:na] > Jul 9 03:00:31 kernel: fnic_handle_fip_timer: 8 callbacks suppressed > Jul 9 03:00:37 kernel: fnic_handle_fip_timer: 8 callbacks suppressed > Jul 9 03:00:43 kernel: fnic_handle_fip_timer: 8 callbacks suppressed > > > > On Thu, 8 Aug 2019 at 15:42, ZAIDI, ASAD A <az1...@att.com> wrote: > >> Did you check if packets are NOT being dropped for network interfaces >> Cassandra instances are consuming (ifconfig –a) internode compression is >> set for all endpoint – may be network is playing any role here? >> >> is this corruption limited so certain keyspace/table | DCs or is that >> wide spread – the log snippet you shared it looked like only specific >> keyspace/table is affected – is that correct? >> >> When you remove corrupted sstable of a certain table, I guess you >> verifies all nodes for corrupted sstables for same table (may be with with >> nodetool scrub tool) so to limit spread of corruptions – right? >> >> Just curious to know – you’re not using lz4/default compressor for all >> tables there must be some reason for it. >> >> >> >> >> >> >> >> *From:* Philip Ó Condúin [mailto:philipocond...@gmail.com] >> *Sent:* Thursday, August 08, 2019 6:20 AM >> *To:* user@cassandra.apache.org >> *Subject:* Re: Datafile Corruption >> >> >> >> Hi All, >> >> Thank you so much for the replies. >> >> Currently, I have the following list that can potentially cause some sort >> of corruption in a Cassandra cluster. >> >> - Sudden Power cut - *We have had no power cuts in the datacenters* >> - Network Issues - *no network issues from what I can tell* >> - Disk full - *I don't think this is an issue for us, see disks >> below.* >> - An issue in Casandra version like Cassandra-13752 -* couldn't find >> any Jira issues similar to ours.* >> - Bit Flips -* we have compression enabled so I don't think this >> should be an issue.* >> - Repair during upgrade has caused corruption too -* we have not >> upgraded* >> - Dropping and adding columns with the same name but a different type >> - *I will need to ask the apps team how they are using the database.* >> >> >> >> Ok, let me try and explain the issue we are having, I am under a lot of >> pressure from above to get this fixed and I can't figure it out. >> >> This is a PRE-PROD environment. >> >> - 2 datacenters. >> - 9 physical servers in each datacenter >> - 4 Cassandra instances on each server >> - 72 Cassandra instances across the 2 data centres, 36 in site A, 36 >> in site B. >> >> >> We also have 2 Reaper Nodes we use for repair. One reaper node in each >> datacenter each running with its own Cassandra back end in a cluster >> together. >> >> OS Details [Red Hat Linux] >> cass_a@x 0 10:53:01 ~ $ uname -a >> Linux x 3.10.0-957.5.1.el7.x86_64 #1 SMP Wed Dec 19 10:46:58 EST 2018 >> x86_64 x86_64 x86_64 GNU/Linux >> >> cass_a@x 0 10:57:31 ~ $ cat /etc/*release >> NAME="Red Hat Enterprise Linux Server" >> VERSION="7.6 (Maipo)" >> ID="rhel" >> >> Storage Layout >> cass_a@xx 0 10:46:28 ~ $ df -h >> Filesystem Size Used Avail Use% Mounted on >> /dev/mapper/vg01-lv_root 20G 2.2G 18G 11% / >> devtmpfs 63G 0 63G 0% /dev >> tmpfs 63G 0 63G 0% /dev/shm >> tmpfs 63G 4.1G 59G 7% /run >> tmpfs 63G 0 63G 0% /sys/fs/cgroup >> >> 4 cassandra instances >> /dev/sdd 1.5T 802G 688G 54% /data/ssd4 >> /dev/sda 1.5T 798G 692G 54% /data/ssd1 >> /dev/sdb 1.5T 681G 810G 46% /data/ssd2 >> /dev/sdc 1.5T 558G 932G 38% /data/ssd3 >> >> Cassandra load is about 200GB and the rest of the space is snapshots >> >> CPU >> cass_a@x 127 10:58:47 ~ $ lscpu | grep -E '^Thread|^Core|^Socket|^CPU\(' >> CPU(s): 64 >> Thread(s) per core: 2 >> Core(s) per socket: 16 >> Socket(s): 2 >> >> *Description of problem:* >> During repair of the cluster, we are seeing multiple corruptions in the >> log files on a lot of instances. There seems to be no pattern to the >> corruption. It seems that the repair job is finding all the corrupted >> files for us. The repair will hang on the node where the corrupted file is >> found. To fix this we remove/rename the datafile and bounce the Cassandra >> instance. Our hardware/OS team have stated there is no problem on their >> side. I do not believe it the repair causing the corruption. >> >> We have maintenance scripts that run every night running compactions and >> creating snapshots, I decided to turn these off, fix any corruptions we >> currently had and ran major compaction on all nodes, once this was done we >> had a "clean" cluster and we left the cluster for a few days. After the >> process we noticed one corruption in the cluster, this datafile was created >> after I turned off the maintenance scripts so my theory of the scripts >> causing the issue was wrong. We then kicked off another repair and started >> to find more corrupt files created after the maintenance script was turned >> off. >> >> >> So let me give you an example of a corrupted file and maybe someone might >> be able to work through it with me? >> >> When this corrupted file was reported in the log it looks like it was the >> repair that found it. >> >> $ journalctl -u cassmeta-cass_b.service --since "2019-08-07 22:25:00" >> --until "2019-08-07 22:45:00" >> >> Aug 07 22:30:33 cassandra[34611]: INFO 21:30:33 Writing >> Memtable-compactions_in_progress@830377457(0.008KiB serialized bytes, 1 >> ops, 0%/0% of on/off-heap limit) >> Aug 07 22:30:33 cassandra[34611]: ERROR 21:30:33 Failed creating a merkle >> tree for [repair #9587a200-b95a-11e9-8920-9f72868b8375 on >> KeyspaceMetadata/x, (-1476350953672479093,-1474461 >> Aug 07 22:30:33 cassandra[34611]: ERROR 21:30:33 Exception in thread >> Thread[ValidationExecutor:825,1,main] >> Aug 07 22:30:33 cassandra[34611]: org.apache.cassandra.io.FSReadError: >> org.apache.cassandra.io.sstable.CorruptSSTableException: Corrupted: >> /x/ssd2/data/KeyspaceMetadata/x-1e453cb0 >> Aug 07 22:30:33 cassandra[34611]: at >> org.apache.cassandra.io.util.RandomAccessReader.readBytes(RandomAccessReader.java:365) >> ~[apache-cassandra-2.2.13.jar:2.2.13] >> Aug 07 22:30:33 cassandra[34611]: at >> org.apache.cassandra.utils.ByteBufferUtil.read(ByteBufferUtil.java:361) >> ~[apache-cassandra-2.2.13.jar:2.2.13] >> Aug 07 22:30:33 cassandra[34611]: at >> org.apache.cassandra.utils.ByteBufferUtil.readWithShortLength(ByteBufferUtil.java:340) >> ~[apache-cassandra-2.2.13.jar:2.2.13] >> Aug 07 22:30:33 cassandra[34611]: at >> org.apache.cassandra.db.composites.AbstractCType$Serializer.deserialize(AbstractCType.java:382) >> ~[apache-cassandra-2.2.13.jar:2.2.13] >> Aug 07 22:30:33 cassandra[34611]: at >> org.apache.cassandra.db.composites.AbstractCType$Serializer.deserialize(AbstractCType.java:366) >> ~[apache-cassandra-2.2.13.jar:2.2.13] >> Aug 07 22:30:33 cassandra[34611]: at >> org.apache.cassandra.db.OnDiskAtom$Serializer.deserializeFromSSTable(OnDiskAtom.java:81) >> ~[apache-cassandra-2.2.13.jar:2.2.13] >> Aug 07 22:30:33 cassandra[34611]: at >> org.apache.cassandra.db.AbstractCell$1.computeNext(AbstractCell.java:52) >> ~[apache-cassandra-2.2.13.jar:2.2.13] >> Aug 07 22:30:33 cassandra[34611]: at >> org.apache.cassandra.db.AbstractCell$1.computeNext(AbstractCell.java:46) >> ~[apache-cassandra-2.2.13.jar:2.2.13] >> Aug 07 22:30:33 cassandra[34611]: at >> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143) >> ~[guava-16.0.jar:na] >> Aug 07 22:30:33 cassandra[34611]: at >> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138) >> ~[guava-16.0.jar:na] >> Aug 07 22:30:33 cassandra[34611]: at >> org.apache.cassandra.io.sstable.SSTableIdentityIterator.hasNext(SSTableIdentityIterator.java:169) >> ~[apache-cassandra-2.2.13.jar:2.2.13] >> Aug 07 22:30:33 cassandra[34611]: at >> org.apache.cassandra.utils.MergeIterator$OneToOne.computeNext(MergeIterator.java:202) >> ~[apache-cassandra-2.2.13.jar:2.2.13] >> Aug 07 22:30:33 cassandra[34611]: at >> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143) >> ~[guava-16.0.jar:na] >> Aug 07 22:30:33 cassandra[34611]: at >> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138) >> ~[guava-16.0.jar:na] >> Aug 07 22:30:33 cassandra[34611]: at >> com.google.common.collect.Iterators$7.computeNext(Iterators.java:645) >> ~[guava-16.0.jar:na] >> Aug 07 22:30:33 cassandra[34611]: at >> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143) >> ~[guava-16.0.jar:na] >> Aug 07 22:30:33 cassandra[34611]: at >> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138) >> ~[guava-16.0.jar:na] >> Aug 07 22:30:33 cassandra[34611]: at >> org.apache.cassandra.db.ColumnIndex$Builder.buildForCompaction(ColumnIndex.java:174) >> ~[apache-cassandra-2.2.13.jar:2.2.13] >> Aug 07 22:30:33 cassandra[34611]: at >> org.apache.cassandra.db.compaction.LazilyCompactedRow.update(LazilyCompactedRow.java:187) >> ~[apache-cassandra-2.2.13.jar:2.2.13] >> Aug 07 22:30:33 cassandra[34611]: at >> org.apache.cassandra.repair.Validator.rowHash(Validator.java:201) >> ~[apache-cassandra-2.2.13.jar:2.2.13] >> Aug 07 22:30:33 cassandra[34611]: at >> org.apache.cassandra.repair.Validator.add(Validator.java:150) >> ~[apache-cassandra-2.2.13.jar:2.2.13] >> Aug 07 22:30:33 cassandra[34611]: at >> org.apache.cassandra.db.compaction.CompactionManager.doValidationCompaction(CompactionManager.java:1166) >> ~[apache-cassandra-2.2.13.jar:2.2.13] >> Aug 07 22:30:33 cassandra[34611]: at >> org.apache.cassandra.db.compaction.CompactionManager.access$600(CompactionManager.java:76) >> ~[apache-cassandra-2.2.13.jar:2.2.13] >> Aug 07 22:30:33 cassandra[34611]: at >> org.apache.cassandra.db.compaction.CompactionManager$10.call(CompactionManager.java:736) >> ~[apache-cassandra-2.2.13.jar:2.2.13] >> Aug 07 22:30:33 cassandra[34611]: at >> java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[na:1.8.0_172] >> Aug 07 22:30:33 cassandra[34611]: at >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) >> ~[na:1.8.0_172] >> Aug 07 22:30:33 cassandra[34611]: at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) >> [na:1.8.0_172] >> Aug 07 22:30:33 cassandra[34611]: at >> java.lang.Thread.run(Thread.java:748) [na:1.8.0_172] >> Aug 07 22:30:33 cassandra[34611]: Caused by: >> org.apache.cassandra.io.sstable.CorruptSSTableException: Corrupted: >> /data/ssd2/data/KeyspaceMetadata/x-x/l >> Aug 07 22:30:33 cassandra[34611]: at >> org.apache.cassandra.io.compress.CompressedRandomAccessReader.reBufferMmap(CompressedRandomAccessReader.java:216) >> ~[apache-cassandra-2.2.13.jar:2.2.13] >> Aug 07 22:30:33 cassandra[34611]: at >> org.apache.cassandra.io.compress.CompressedRandomAccessReader.reBuffer(CompressedRandomAccessReader.java:226) >> ~[apache-cassandra-2.2.13.jar:2.2.13] >> Aug 07 22:30:33 cassandra[34611]: at >> org.apache.cassandra.io.compress.CompressedThrottledReader.reBuffer(CompressedThrottledReader.java:42) >> ~[apache-cassandra-2.2.13.jar:2.2.13] >> Aug 07 22:30:33 cassandra[34611]: at >> org.apache.cassandra.io.util.RandomAccessReader.readBytes(RandomAccessReader.java:352) >> ~[apache-cassandra-2.2.13.jar:2.2.13] >> Aug 07 22:30:33 cassandra[34611]: ... 27 common frames omitted >> Aug 07 22:30:33 cassandra[34611]: Caused by: >> org.apache.cassandra.io.compress.CorruptBlockException: >> (/data/ssd2/data/KeyspaceMetadata/x-x/lb-26203-big >> Aug 07 22:30:33 cassandra[34611]: at >> org.apache.cassandra.io.compress.CompressedRandomAccessReader.reBufferMmap(CompressedRandomAccessReader.java:185) >> ~[apache-cassandra-2.2.13.jar:2.2.13] >> Aug 07 22:30:33 cassandra[34611]: ... 30 common frames omitted >> Aug 07 22:30:33 cassandra[34611]: INFO 21:30:33 Not a global repair, >> will not do anticompaction >> Aug 07 22:30:33 cassandra[34611]: ERROR 21:30:33 Stopping gossiper >> Aug 07 22:30:33 cassandra[34611]: WARN 21:30:33 Stopping gossip by >> operator request >> Aug 07 22:30:33 cassandra[34611]: INFO 21:30:33 Announcing shutdown >> Aug 07 22:30:33 cassandra[34611]: INFO 21:30:33 Node /10.2.57.37 >> <https://urldefense.proofpoint.com/v2/url?u=http-3A__10.2.57.37&d=DwMFaQ&c=LFYZ-o9_HUMeMTSQicvjIg&r=FsmDztdsVuIKml8IDhdHdg&m=4S7F10IxFntsiwIo-XT-YrkZE8312_yG8jMeOqOBjpE&s=20PLv0KNcUpBbyY1byoboimsLRjbPCLV76xL37jpttQ&e=> >> state jump to shutdown >> >> >> So I went to the file system to see when this corrupt file was created >> and it was created on July 30th at 15.55 >> >> root@x 0 01:14:03 ~ # ls -l >> /data/ssd2/data/KeyspaceMetadata/x-x/lb-26203-big-Data.db >> -rw-r--r-- 1 cass_b cass_b 3182243670 Jul 30 15:55 >> /data/ssd2/data/KeyspaceMetadata/x-x/lb-26203-big-Data.db >> >> >> >> So I checked /var/log/messages for errors around that time >> The only thing that stands out to me is the message "Cannot perform a >> full major compaction as repaired and unrepaired sstables cannot be >> compacted together", I'm not sure if this would be an issue though and >> cause corruption. >> >> Jul 30 15:55:06 x systemd: Created slice User Slice of root. >> Jul 30 15:55:06 x systemd: Started Session c165280 of user root. >> Jul 30 15:55:06 x audispd: node=x. type=USER_START >> msg=audit(1564498506.021:457933): pid=17533 uid=0 auid=4294967295 >> ses=4294967295 msg='op=PAM:session_open >> grantors=pam_keyinit,pam_limits,pam_keyinit,pam_limits,pam_tty_audit,pam_systemd,pam_unix >> acct="root" exe="/usr/bin/sudo" hostname=? addr=? terminal=? res=success' >> Jul 30 15:55:06 x systemd: Removed slice User Slice of root. >> Jul 30 15:55:14 x tag_audit_log: type=USER_CMD >> msg=audit(1564498506.013:457932): pid=17533 uid=509 auid=4294967295 >> ses=4294967295 msg='cwd="/" >> cmd=2F7573722F7362696E2F69706D692D73656E736F7273202D2D71756965742D6361636865202D2D7364722D63616368652D7265637265617465202D2D696E746572707265742D6F656D2D64617461202D2D6F75747075742D73656E736F722D7374617465202D2D69676E6F72652D6E6F742D617661696C61626C652D73656E736F7273202D2D6F75747075742D73656E736F722D7468726573686F6C6473 >> terminal=? res=success' >> Jul 30 15:55:14 x tag_audit_log: type=USER_START >> msg=audit(1564498506.021:457933): pid=17533 uid=0 auid=4294967295 >> ses=4294967295 msg='op=PAM:session_open >> grantors=pam_keyinit,pam_limits,pam_keyinit,pam_limits,pam_tty_audit,pam_systemd,pam_unix >> acct="root" exe="/usr/bin/sudo" hostname=? addr=? terminal=? res=success' >> Jul 30 15:55:19 x cassandra: INFO 14:55:19 Writing >> Memtable-compactions_in_progress@1462227999(0.008KiB serialized bytes, 1 >> ops, 0%/0% of on/off-heap limit) >> Jul 30 15:55:19 x cassandra: INFO 14:55:19 Cannot perform a full major >> compaction as repaired and unrepaired sstables cannot be compacted >> together. These two set of sstables will be compacted separately. >> Jul 30 15:55:19 x cassandra: INFO 14:55:19 Writing >> Memtable-compactions_in_progress@1198535528(1.002KiB serialized bytes, >> 57 ops, 0%/0% of on/off-heap limit) >> Jul 30 15:55:20 x cassandra: INFO 14:55:20 Writing >> Memtable-compactions_in_progress@2039409834(0.008KiB serialized bytes, 1 >> ops, 0%/0% of on/off-heap limit) >> Jul 30 15:55:24 x audispd: node=x. type=USER_LOGOUT >> msg=audit(1564498524.409:457934): pid=46620 uid=0 auid=464400029 ses=2747 >> msg='op=login id=464400029 exe="/usr/sbin/sshd" hostname=? addr=? >> terminal=/dev/pts/0 res=success' >> Jul 30 15:55:24 x audispd: node=x. type=USER_LOGOUT >> msg=audit(1564498524.409:457935): pid=4878 uid=0 auid=464400029 ses=2749 >> msg='op=login id=464400029 exe="/usr/sbin/sshd" hostname=? addr=? >> terminal=/dev/pts/1 res=success' >> >> Jul 30 15:55:57 x systemd: Created slice User Slice of root. >> Jul 30 15:55:57 x systemd: Started Session c165288 of user root. >> Jul 30 15:55:57 x audispd: node=x. type=USER_START >> msg=audit(1564498557.294:457958): pid=19687 uid=0 auid=4294967295 >> ses=4294967295 msg='op=PAM:session_open >> grantors=pam_keyinit,pam_limits,pam_keyinit,pam_limits,pam_tty_audit,pam_systemd,pam_unix >> acct="root" exe="/usr/bin/sudo" hostname=? addr=? terminal=? res=success' >> Jul 30 15:55:57 x audispd: node=x. type=USER_START >> msg=audit(1564498557.298:457959): pid=19690 uid=0 auid=4294967295 >> ses=4294967295 msg='op=PAM:session_open >> grantors=pam_keyinit,pam_systemd,pam_keyinit,pam_limits,pam_unix >> acct="cass_b" exe="/usr/sbin/runuser" hostname=? addr=? terminal=? >> res=success' >> Jul 30 15:55:58 x systemd: Removed slice User Slice of root. >> Jul 30 15:56:02 x cassandra: INFO 14:56:02 Writing >> Memtable-compactions_in_progress@1532791194(0.008KiB serialized bytes, 1 >> ops, 0%/0% of on/off-heap limit) >> Jul 30 15:56:02 x cassandra: INFO 14:56:02 Cannot perform a full major >> compaction as repaired and unrepaired sstables cannot be compacted >> together. These two set of sstables will be compacted separately. >> Jul 30 15:56:02 x cassandra: INFO 14:56:02 Writing >> Memtable-compactions_in_progress@1455399453(0.281KiB serialized bytes, >> 16 ops, 0%/0% of on/off-heap limit) >> Jul 30 15:56:04 x tag_audit_log: type=USER_CMD >> msg=audit(1564498555.190:457951): pid=19294 uid=509 auid=4294967295 >> ses=4294967295 msg='cwd="/" >> cmd=72756E75736572202D73202F62696E2F62617368202D6C20636173735F62202D632063617373616E6472612D6D6574612F63617373616E6472612F62696E2F6E6F6465746F6F6C2074707374617473 >> terminal=? res=success' >> >> >> >> We have checked a number of other things like NTP setting etc but nothing >> is telling us what could cause so many corruptions across the entire >> cluster. >> Things were healthy with this cluster for months, the only thing I can >> think is that we started loading data from a load of 20GB per instance up >> to 200GB where it sits now, maybe this just highlighted the issue. >> >> >> >> Compaction and Compression on Keyspace CL's [mixture] >> All CF's are using compression. >> >> AND compaction = {'min_threshold': '4', 'class': >> 'org.apache.cassandra.db.compaction.*SizeTieredCompactionStrategy*', >> 'max_threshold': '32'} >> AND compression = {'sstable_compression': >> 'org.apache.cassandra.io.compress.*SnappyCompressor*'} >> >> AND compaction = {'min_threshold': '4', 'class': >> 'org.apache.cassandra.db.compaction.*SizeTieredCompactionStrategy*', >> 'max_threshold': '32'} >> AND compression = {'sstable_compression': >> 'org.apache.cassandra.io.compress.*LZ4Compressor*'} >> >> AND compaction = {'class': 'org.apache.cassandra.db.compaction. >> *LeveledCompactionStrategy*'} >> AND compression = {'sstable_compression': >> 'org.apache.cassandra.io.compress.*SnappyCompressor*'} >> >> --We are also using internode network compression: >> internode_compression: all >> >> >> >> Does anyone have any idea what I should check next? >> Our next theory is that there may be an issue with Checksum, but I'm not >> sure where to go with this. >> >> >> >> Any help would be very much appreciated before I lose the last bit of >> hair I have on my head. >> >> >> >> Kind Regards, >> >> Phil >> >> >> >> On Wed, 7 Aug 2019 at 20:51, Nitan Kainth <nitankai...@gmail.com> wrote: >> >> Repair during upgrade have caused corruption too. >> >> >> >> Also, dropping and adding columns with same name but different type >> >> >> >> Regards, >> >> Nitan >> >> Cell: 510 449 9629 >> >> >> On Aug 7, 2019, at 2:42 PM, Jeff Jirsa <jji...@gmail.com> wrote: >> >> Is compression enabled? >> >> >> >> If not, bit flips on disk can corrupt data files and reads + repair may >> send that corruption to other hosts in the cluster >> >> >> On Aug 7, 2019, at 3:46 AM, Philip Ó Condúin <philipocond...@gmail.com> >> wrote: >> >> Hi All, >> >> >> >> I am currently experiencing multiple datafile corruptions across most >> nodes in my cluster, there seems to be no pattern to the corruption. I'm >> starting to think it might be a bug, we're using Cassandra 2.2.13. >> >> >> >> Without going into detail about the issue I just want to confirm >> something. >> >> >> >> Can someone share with me a list of scenarios that would cause corruption? >> >> >> >> 1. OS failure >> >> 2. Cassandra disturbed during the writing >> >> >> >> etc etc. >> >> >> >> I need to investigate each scenario and don't want to leave any out. >> >> >> >> -- >> >> Regards, >> >> Phil >> >> >> >> >> -- >> >> Regards, >> >> Phil >> > > > -- > Regards, > Phil >