[jira] [Commented] (CASSANDRA-15274) Multiple Corrupt datafiles across entire environment
[ https://issues.apache.org/jira/browse/CASSANDRA-15274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16961880#comment-16961880 ] Benedict Elliott Smith commented on CASSANDRA-15274: Thanks for coming back with an update [~philoconduin]. Sounds like a very annoying problem to diagnose! Glad to hear that in this case Cassandra wasn't to blame. > Multiple Corrupt datafiles across entire environment > - > > Key: CASSANDRA-15274 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15274 > Project: Cassandra > Issue Type: Bug > Components: Local/Compaction >Reporter: Phil O Conduin >Priority: Normal > Labels: impact-high > > Cassandra Version: 2.2.13 > PRE-PROD environment. > * 2 datacenters. > * 9 physical servers in each datacenter - (_Cisco UCS C220 M4 SFF_) > * 4 Cassandra instances on each server (cass_a, cass_b, cass_c, cass_d) > * 72 Cassandra instances across the 2 data centres, 36 in site A, 36 in site > B. > We also have 2 Reaper Nodes we use for repair. One reaper node in each > datacenter each running with its own Cassandra back end in a cluster together. > OS Details [Red Hat Linux] > cass_a@x 0 10:53:01 ~ $ uname -a > Linux x 3.10.0-957.5.1.el7.x86_64 #1 SMP Wed Dec 19 10:46:58 EST 2018 x86_64 > x86_64 x86_64 GNU/Linux > cass_a@x 0 10:57:31 ~ $ cat /etc/*release > NAME="Red Hat Enterprise Linux Server" > VERSION="7.6 (Maipo)" > ID="rhel" > Storage Layout > cass_a@xx 0 10:46:28 ~ $ df -h > Filesystem Size Used Avail Use% Mounted on > /dev/mapper/vg01-lv_root 20G 2.2G 18G 11% / > devtmpfs 63G 0 63G 0% /dev > tmpfs 63G 0 63G 0% /dev/shm > tmpfs 63G 4.1G 59G 7% /run > tmpfs 63G 0 63G 0% /sys/fs/cgroup > >> 4 cassandra instances > /dev/sdd 1.5T 802G 688G 54% /data/ssd4 > /dev/sda 1.5T 798G 692G 54% /data/ssd1 > /dev/sdb 1.5T 681G 810G 46% /data/ssd2 > /dev/sdc 1.5T 558G 932G 38% /data/ssd3 > Cassandra load is about 200GB and the rest of the space is snapshots > CPU > cass_a@x 127 10:58:47 ~ $ lscpu | grep -E '^Thread|^Core|^Socket|^CPU\(' > CPU(s): 64 > Thread(s) per core: 2 > Core(s) per socket: 16 > Socket(s): 2 > *Description of problem:* > During repair of the cluster, we are seeing multiple corruptions in the log > files on a lot of instances. There seems to be no pattern to the corruption. > It seems that the repair job is finding all the corrupted files for us. The > repair will hang on the node where the corrupted file is found. To fix this > we remove/rename the datafile and bounce the Cassandra instance. Our > hardware/OS team have stated there is no problem on their side. I do not > believe it the repair causing the corruption. > > So let me give you an example of a corrupted file and maybe someone might be > able to work through it with me? > When this corrupted file was reported in the log it looks like it was the > repair that found it. > $ journalctl -u cassmeta-cass_b.service --since "2019-08-07 22:25:00" --until > "2019-08-07 22:45:00" > Aug 07 22:30:33 cassandra[34611]: INFO 21:30:33 Writing > Memtable-compactions_in_progress@830377457(0.008KiB serialized bytes, 1 ops, > 0%/0% of on/off-heap limit) > Aug 07 22:30:33 cassandra[34611]: ERROR 21:30:33 Failed creating a merkle > tree for [repair #9587a200-b95a-11e9-8920-9f72868b8375 on KeyspaceMetadata/x, > (-1476350953672479093,-1474461 > Aug 07 22:30:33 cassandra[34611]: ERROR 21:30:33 Exception in thread > Thread[ValidationExecutor:825,1,main] > Aug 07 22:30:33 cassandra[34611]: org.apache.cassandra.io.FSReadError: > org.apache.cassandra.io.sstable.CorruptSSTableException: Corrupted: > /x/ssd2/data/KeyspaceMetadata/x-1e453cb0 > Aug 07 22:30:33 cassandra[34611]: at > org.apache.cassandra.io.util.RandomAccessReader.readBytes(RandomAccessReader.java:365) > ~[apache-cassandra-2.2.13.jar:2.2.13] > Aug 07 22:30:33 cassandra[34611]: at > org.apache.cassandra.utils.ByteBufferUtil.read(ByteBufferUtil.java:361) > ~[apache-cassandra-2.2.13.jar:2.2.13] > Aug 07 22:30:33 cassandra[34611]: at > org.apache.cassandra.utils.ByteBufferUtil.readWithShortLength(ByteBufferUtil.java:340) > ~[apache-cassandra-2.2.13.jar:2.2.13] > Aug 07 22:30:33 cassandra[34611]: at > org.apache.cassandra.db.composites.AbstractCType$Serializer.deserialize(AbstractCType.java:382) > ~[apache-cassandra-2.2.13.jar:2.2.13] > Aug 07 22:30:33 cassandra[34611]: at > org.apache.cassandra.db.composites.AbstractCType$Serializer.deserialize(AbstractCType.j
[jira] [Commented] (CASSANDRA-15274) Multiple Corrupt datafiles across entire environment
[ https://issues.apache.org/jira/browse/CASSANDRA-15274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16961436#comment-16961436 ] Phil O Conduin commented on CASSANDRA-15274: Hi Benedict, Sorry I forgot to come back and update this jira. Our datafile corruption issues were a problem with the OS wrongly taking one block belonging to a C* data file thinking it was no longer used and treating it as a free block that would later be used. For example: C* deletes file after compaction, OS collects all blocks which are free now and sends TRIM command to SSD, but SSD from time to time picks the wrong block, not the one reported by OS - does the trim - causing zeroized blocks to be seen in the datafile and later use it for different file. So the symptom is - we suddenly see 4096 zeroes in the datafile- it means SSD just trimmed the block, after some time we can see some data written to those blocks - it means the block is used by other file and therefore gives us a corrupt file. We turned off the scheduled TRIM function on the OS and we are no longer getting corruptions. This was very difficult to pinpoint. > Multiple Corrupt datafiles across entire environment > - > > Key: CASSANDRA-15274 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15274 > Project: Cassandra > Issue Type: Bug > Components: Local/Compaction >Reporter: Phil O Conduin >Priority: Normal > Labels: impact-high > > Cassandra Version: 2.2.13 > PRE-PROD environment. > * 2 datacenters. > * 9 physical servers in each datacenter - (_Cisco UCS C220 M4 SFF_) > * 4 Cassandra instances on each server (cass_a, cass_b, cass_c, cass_d) > * 72 Cassandra instances across the 2 data centres, 36 in site A, 36 in site > B. > We also have 2 Reaper Nodes we use for repair. One reaper node in each > datacenter each running with its own Cassandra back end in a cluster together. > OS Details [Red Hat Linux] > cass_a@x 0 10:53:01 ~ $ uname -a > Linux x 3.10.0-957.5.1.el7.x86_64 #1 SMP Wed Dec 19 10:46:58 EST 2018 x86_64 > x86_64 x86_64 GNU/Linux > cass_a@x 0 10:57:31 ~ $ cat /etc/*release > NAME="Red Hat Enterprise Linux Server" > VERSION="7.6 (Maipo)" > ID="rhel" > Storage Layout > cass_a@xx 0 10:46:28 ~ $ df -h > Filesystem Size Used Avail Use% Mounted on > /dev/mapper/vg01-lv_root 20G 2.2G 18G 11% / > devtmpfs 63G 0 63G 0% /dev > tmpfs 63G 0 63G 0% /dev/shm > tmpfs 63G 4.1G 59G 7% /run > tmpfs 63G 0 63G 0% /sys/fs/cgroup > >> 4 cassandra instances > /dev/sdd 1.5T 802G 688G 54% /data/ssd4 > /dev/sda 1.5T 798G 692G 54% /data/ssd1 > /dev/sdb 1.5T 681G 810G 46% /data/ssd2 > /dev/sdc 1.5T 558G 932G 38% /data/ssd3 > Cassandra load is about 200GB and the rest of the space is snapshots > CPU > cass_a@x 127 10:58:47 ~ $ lscpu | grep -E '^Thread|^Core|^Socket|^CPU\(' > CPU(s): 64 > Thread(s) per core: 2 > Core(s) per socket: 16 > Socket(s): 2 > *Description of problem:* > During repair of the cluster, we are seeing multiple corruptions in the log > files on a lot of instances. There seems to be no pattern to the corruption. > It seems that the repair job is finding all the corrupted files for us. The > repair will hang on the node where the corrupted file is found. To fix this > we remove/rename the datafile and bounce the Cassandra instance. Our > hardware/OS team have stated there is no problem on their side. I do not > believe it the repair causing the corruption. > > So let me give you an example of a corrupted file and maybe someone might be > able to work through it with me? > When this corrupted file was reported in the log it looks like it was the > repair that found it. > $ journalctl -u cassmeta-cass_b.service --since "2019-08-07 22:25:00" --until > "2019-08-07 22:45:00" > Aug 07 22:30:33 cassandra[34611]: INFO 21:30:33 Writing > Memtable-compactions_in_progress@830377457(0.008KiB serialized bytes, 1 ops, > 0%/0% of on/off-heap limit) > Aug 07 22:30:33 cassandra[34611]: ERROR 21:30:33 Failed creating a merkle > tree for [repair #9587a200-b95a-11e9-8920-9f72868b8375 on KeyspaceMetadata/x, > (-1476350953672479093,-1474461 > Aug 07 22:30:33 cassandra[34611]: ERROR 21:30:33 Exception in thread > Thread[ValidationExecutor:825,1,main] > Aug 07 22:30:33 cassandra[34611]: org.apache.cassandra.io.FSReadError: > org.apache.cassandra.io.sstable.CorruptSSTableException: Corrupted: > /x/ssd2/data/KeyspaceMetadata/x-1e453cb0 > Aug 07 22:30:33 cassandra
[jira] [Commented] (CASSANDRA-15274) Multiple Corrupt datafiles across entire environment
[ https://issues.apache.org/jira/browse/CASSANDRA-15274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16959051#comment-16959051 ] Benedict Elliott Smith commented on CASSANDRA-15274: Sorry for the slow response, this dropped off my radar due to a number of competing commitments. The stack traces you have posted are still being thrown by CRC mismatches AFAICT. If you are still having problems, I'd be willing to take a look at a raw sstable that you are able to provide (I will also need you to provide the schema for the table) > Multiple Corrupt datafiles across entire environment > - > > Key: CASSANDRA-15274 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15274 > Project: Cassandra > Issue Type: Bug > Components: Local/Compaction >Reporter: Phil O Conduin >Priority: Normal > Labels: impact-high > > Cassandra Version: 2.2.13 > PRE-PROD environment. > * 2 datacenters. > * 9 physical servers in each datacenter - (_Cisco UCS C220 M4 SFF_) > * 4 Cassandra instances on each server (cass_a, cass_b, cass_c, cass_d) > * 72 Cassandra instances across the 2 data centres, 36 in site A, 36 in site > B. > We also have 2 Reaper Nodes we use for repair. One reaper node in each > datacenter each running with its own Cassandra back end in a cluster together. > OS Details [Red Hat Linux] > cass_a@x 0 10:53:01 ~ $ uname -a > Linux x 3.10.0-957.5.1.el7.x86_64 #1 SMP Wed Dec 19 10:46:58 EST 2018 x86_64 > x86_64 x86_64 GNU/Linux > cass_a@x 0 10:57:31 ~ $ cat /etc/*release > NAME="Red Hat Enterprise Linux Server" > VERSION="7.6 (Maipo)" > ID="rhel" > Storage Layout > cass_a@xx 0 10:46:28 ~ $ df -h > Filesystem Size Used Avail Use% Mounted on > /dev/mapper/vg01-lv_root 20G 2.2G 18G 11% / > devtmpfs 63G 0 63G 0% /dev > tmpfs 63G 0 63G 0% /dev/shm > tmpfs 63G 4.1G 59G 7% /run > tmpfs 63G 0 63G 0% /sys/fs/cgroup > >> 4 cassandra instances > /dev/sdd 1.5T 802G 688G 54% /data/ssd4 > /dev/sda 1.5T 798G 692G 54% /data/ssd1 > /dev/sdb 1.5T 681G 810G 46% /data/ssd2 > /dev/sdc 1.5T 558G 932G 38% /data/ssd3 > Cassandra load is about 200GB and the rest of the space is snapshots > CPU > cass_a@x 127 10:58:47 ~ $ lscpu | grep -E '^Thread|^Core|^Socket|^CPU\(' > CPU(s): 64 > Thread(s) per core: 2 > Core(s) per socket: 16 > Socket(s): 2 > *Description of problem:* > During repair of the cluster, we are seeing multiple corruptions in the log > files on a lot of instances. There seems to be no pattern to the corruption. > It seems that the repair job is finding all the corrupted files for us. The > repair will hang on the node where the corrupted file is found. To fix this > we remove/rename the datafile and bounce the Cassandra instance. Our > hardware/OS team have stated there is no problem on their side. I do not > believe it the repair causing the corruption. > > So let me give you an example of a corrupted file and maybe someone might be > able to work through it with me? > When this corrupted file was reported in the log it looks like it was the > repair that found it. > $ journalctl -u cassmeta-cass_b.service --since "2019-08-07 22:25:00" --until > "2019-08-07 22:45:00" > Aug 07 22:30:33 cassandra[34611]: INFO 21:30:33 Writing > Memtable-compactions_in_progress@830377457(0.008KiB serialized bytes, 1 ops, > 0%/0% of on/off-heap limit) > Aug 07 22:30:33 cassandra[34611]: ERROR 21:30:33 Failed creating a merkle > tree for [repair #9587a200-b95a-11e9-8920-9f72868b8375 on KeyspaceMetadata/x, > (-1476350953672479093,-1474461 > Aug 07 22:30:33 cassandra[34611]: ERROR 21:30:33 Exception in thread > Thread[ValidationExecutor:825,1,main] > Aug 07 22:30:33 cassandra[34611]: org.apache.cassandra.io.FSReadError: > org.apache.cassandra.io.sstable.CorruptSSTableException: Corrupted: > /x/ssd2/data/KeyspaceMetadata/x-1e453cb0 > Aug 07 22:30:33 cassandra[34611]: at > org.apache.cassandra.io.util.RandomAccessReader.readBytes(RandomAccessReader.java:365) > ~[apache-cassandra-2.2.13.jar:2.2.13] > Aug 07 22:30:33 cassandra[34611]: at > org.apache.cassandra.utils.ByteBufferUtil.read(ByteBufferUtil.java:361) > ~[apache-cassandra-2.2.13.jar:2.2.13] > Aug 07 22:30:33 cassandra[34611]: at > org.apache.cassandra.utils.ByteBufferUtil.readWithShortLength(ByteBufferUtil.java:340) > ~[apache-cassandra-2.2.13.jar:2.2.13] > Aug 07 22:30:33 cassandra[34611]: at > org.apache.cassandra.db.composites.AbstractCType$Serializer.deserialize(Abstra
[jira] [Commented] (CASSANDRA-15274) Multiple Corrupt datafiles across entire environment
[ https://issues.apache.org/jira/browse/CASSANDRA-15274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16920810#comment-16920810 ] Phil O Conduin commented on CASSANDRA-15274: Hi, We managed to remove the CRC check from the code and build. When we do a sstable2json on a corrupt file we are not seeing an issue with CRC. This time it is not CRC check, but exception during an attempt to decompress the chunk, so I think we got the answer to our question - it is not just CRC check problem. Another area of investigation of this issue, we decided to create a script that generated MD5 checksums against all sstable files. This script runs from cron twice per day and logs checksums of all sstable files. We capture the md5 and then compare it over the lifetime fo the file. We have proved that the md5 checksum number is not changing. This would indicate a possible bug in Cassandra at time of compacting/writing the file. Taking the latest file for example: *First reported in cassandra log Sep 01 08:39:48* {{Sep 01 08:39:48 sa-ref-met-009.btmx-ref.synchronoss.net cassandra[16223]: ERROR 07:39:48 Failed creating a merkle tree for [repair #fb265fa0-cc8a-11e9-9296-5b5fb0093f98 on KeyspaceMetadata/CF_ConversationIndex1, (-2320162195562336336,-2318312110429971422]], /10.2.41.38 (see log for details)}} {{Sep 01 08:39:48 sa-ref-met-009.btmx-ref.synchronoss.net cassandra[16223]: INFO 07:39:48 [repair #fb265fa0-cc8a-11e9-9296-5b5fb0093f98] Received merkle tree for CF_ConversationIndex1 from /10.2.41.38}} {{Sep 01 08:39:48 sa-ref-met-009.btmx-ref.synchronoss.net cassandra[16223]: WARN 07:39:48 [repair #fb265fa0-cc8a-11e9-9296-5b5fb0093f98] CF_ConversationIndex1 sync failed}} {{Sep 01 08:39:48 sa-ref-met-009.btmx-ref.synchronoss.net cassandra[16223]: INFO 07:39:48 [repair #fb265fa0-cc8a-11e9-9296-5b5fb0093f98] Requesting merkle trees for CF_RecentIndex (to [/10.2.41.34, /10.2.41.48, /10.2.57.54, /10.2.57.46, /10.2.57.12, /10.2.41.38])}} {{Sep 01 08:39:48 sa-ref-met-009.btmx-ref.synchronoss.net cassandra[16223]: ERROR 07:39:48 Exception in thread Thread[RepairJobTask:24,5,main]}} {{Sep 01 08:39:48 sa-ref-met-009.btmx-ref.synchronoss.net cassandra[16223]: org.apache.cassandra.exceptions.RepairException: [repair #fb265fa0-cc8a-11e9-9296-5b5fb0093f98 on KeyspaceMetadata/CF_ConversationIndex1, (-2320162195562336336,-2318312110429971422]] Validation failed in /10.2.41.38}} {{Sep 01 08:39:48 sa-ref-met-009.btmx-ref.synchronoss.net cassandra[16223]: at org.apache.cassandra.repair.ValidationTask.treeReceived(ValidationTask.java:64) ~[apache-cassandra-2.2.13.jar:2.2.13]}} {{Sep 01 08:39:48 sa-ref-met-009.btmx-ref.synchronoss.net cassandra[16223]: at org.apache.cassandra.repair.RepairSession.validationComplete(RepairSession.java:178) ~[apache-cassandra-2.2.13.jar:2.2.13]}} {{Sep 01 08:39:48 sa-ref-met-009.btmx-ref.synchronoss.net cassandra[16223]: at org.apache.cassandra.service.ActiveRepairService.handleMessage(ActiveRepairService.java:478) ~[apache-cassandra-2.2.13.jar:2.2.13]}} {{Sep 01 08:39:48 sa-ref-met-009.btmx-ref.synchronoss.net cassandra[16223]: at org.apache.cassandra.repair.RepairMessageVerbHandler.doVerb(RepairMessageVerbHandler.java:174) ~[apache-cassandra-2.2.13.jar:2.2.13]}} {{Sep 01 08:39:48 sa-ref-met-009.btmx-ref.synchronoss.net cassandra[16223]: at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:67) ~[apache-cassandra-2.2.13.jar:2.2.13]}} {{Sep 01 08:39:48 sa-ref-met-009.btmx-ref.synchronoss.net cassandra[16223]: at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[na:1.8.0_172]}} {{Sep 01 08:39:48 sa-ref-met-009.btmx-ref.synchronoss.net cassandra[16223]: at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[na:1.8.0_172]}} {{Sep 01 08:39:48 sa-ref-met-009.btmx-ref.synchronoss.net cassandra[16223]: at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[na:1.8.0_172]}} {{Sep 01 08:39:48 sa-ref-met-009.btmx-ref.synchronoss.net cassandra[16223]: at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_172]}} {{Sep 01 08:39:48 sa-ref-met-009.btmx-ref.synchronoss.net cassandra[16223]: at java.lang.Thread.run(Thread.java:748) [na:1.8.0_172]}} {{Sep 01 08:39:48 sa-ref-met-009.btmx-ref.synchronoss.net cassandra[16223]: ERROR 07:39:48 Exception in thread Thread[ValidationExecutor:53,1,main]}} {{Sep 01 08:39:48 sa-ref-met-009.btmx-ref.synchronoss.net cassandra[16223]: org.apache.cassandra.io.FSReadError: org.apache.cassandra.io.sstable.CorruptSSTableException: Corrupted: /data/ssd2/data/KeyspaceMetadata/CF_ConversationIndex1-1e77be609c7911e8ac12255de1fb512a/lb-26352-big-Data.db}} {{Sep 01 08:39:48 sa-ref-met-009.btmx-ref.synchronoss.net cassandra[16223]: at org.apache.cassandra.io.util.RandomAccessReader.readBytes(RandomAccessReader.java:365) ~[apache-ca
[jira] [Commented] (CASSANDRA-15274) Multiple Corrupt datafiles across entire environment
[ https://issues.apache.org/jira/browse/CASSANDRA-15274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16914135#comment-16914135 ] Phil O Conduin commented on CASSANDRA-15274: Hi [~benedict] We are having trouble building the code to bypass the setCrcCheckChance. On the new build when we run sstable2json it still hits the chunk exception. Any chance you could help us on our version of the code - 2.2.13? > Multiple Corrupt datafiles across entire environment > - > > Key: CASSANDRA-15274 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15274 > Project: Cassandra > Issue Type: Bug > Components: Local/Compaction >Reporter: Phil O Conduin >Priority: Normal > > Cassandra Version: 2.2.13 > PRE-PROD environment. > * 2 datacenters. > * 9 physical servers in each datacenter - (_Cisco UCS C220 M4 SFF_) > * 4 Cassandra instances on each server (cass_a, cass_b, cass_c, cass_d) > * 72 Cassandra instances across the 2 data centres, 36 in site A, 36 in site > B. > We also have 2 Reaper Nodes we use for repair. One reaper node in each > datacenter each running with its own Cassandra back end in a cluster together. > OS Details [Red Hat Linux] > cass_a@x 0 10:53:01 ~ $ uname -a > Linux x 3.10.0-957.5.1.el7.x86_64 #1 SMP Wed Dec 19 10:46:58 EST 2018 x86_64 > x86_64 x86_64 GNU/Linux > cass_a@x 0 10:57:31 ~ $ cat /etc/*release > NAME="Red Hat Enterprise Linux Server" > VERSION="7.6 (Maipo)" > ID="rhel" > Storage Layout > cass_a@xx 0 10:46:28 ~ $ df -h > Filesystem Size Used Avail Use% Mounted on > /dev/mapper/vg01-lv_root 20G 2.2G 18G 11% / > devtmpfs 63G 0 63G 0% /dev > tmpfs 63G 0 63G 0% /dev/shm > tmpfs 63G 4.1G 59G 7% /run > tmpfs 63G 0 63G 0% /sys/fs/cgroup > >> 4 cassandra instances > /dev/sdd 1.5T 802G 688G 54% /data/ssd4 > /dev/sda 1.5T 798G 692G 54% /data/ssd1 > /dev/sdb 1.5T 681G 810G 46% /data/ssd2 > /dev/sdc 1.5T 558G 932G 38% /data/ssd3 > Cassandra load is about 200GB and the rest of the space is snapshots > CPU > cass_a@x 127 10:58:47 ~ $ lscpu | grep -E '^Thread|^Core|^Socket|^CPU\(' > CPU(s): 64 > Thread(s) per core: 2 > Core(s) per socket: 16 > Socket(s): 2 > *Description of problem:* > During repair of the cluster, we are seeing multiple corruptions in the log > files on a lot of instances. There seems to be no pattern to the corruption. > It seems that the repair job is finding all the corrupted files for us. The > repair will hang on the node where the corrupted file is found. To fix this > we remove/rename the datafile and bounce the Cassandra instance. Our > hardware/OS team have stated there is no problem on their side. I do not > believe it the repair causing the corruption. > > So let me give you an example of a corrupted file and maybe someone might be > able to work through it with me? > When this corrupted file was reported in the log it looks like it was the > repair that found it. > $ journalctl -u cassmeta-cass_b.service --since "2019-08-07 22:25:00" --until > "2019-08-07 22:45:00" > Aug 07 22:30:33 cassandra[34611]: INFO 21:30:33 Writing > Memtable-compactions_in_progress@830377457(0.008KiB serialized bytes, 1 ops, > 0%/0% of on/off-heap limit) > Aug 07 22:30:33 cassandra[34611]: ERROR 21:30:33 Failed creating a merkle > tree for [repair #9587a200-b95a-11e9-8920-9f72868b8375 on KeyspaceMetadata/x, > (-1476350953672479093,-1474461 > Aug 07 22:30:33 cassandra[34611]: ERROR 21:30:33 Exception in thread > Thread[ValidationExecutor:825,1,main] > Aug 07 22:30:33 cassandra[34611]: org.apache.cassandra.io.FSReadError: > org.apache.cassandra.io.sstable.CorruptSSTableException: Corrupted: > /x/ssd2/data/KeyspaceMetadata/x-1e453cb0 > Aug 07 22:30:33 cassandra[34611]: at > org.apache.cassandra.io.util.RandomAccessReader.readBytes(RandomAccessReader.java:365) > ~[apache-cassandra-2.2.13.jar:2.2.13] > Aug 07 22:30:33 cassandra[34611]: at > org.apache.cassandra.utils.ByteBufferUtil.read(ByteBufferUtil.java:361) > ~[apache-cassandra-2.2.13.jar:2.2.13] > Aug 07 22:30:33 cassandra[34611]: at > org.apache.cassandra.utils.ByteBufferUtil.readWithShortLength(ByteBufferUtil.java:340) > ~[apache-cassandra-2.2.13.jar:2.2.13] > Aug 07 22:30:33 cassandra[34611]: at > org.apache.cassandra.db.composites.AbstractCType$Serializer.deserialize(AbstractCType.java:382) > ~[apache-cassandra-2.2.13.jar:2.2.13] > Aug 07 22:30:33 cassandra[34611]: at > org.apache.cassandra.db.composites.AbstractCType$Serializer.deseriali
[jira] [Commented] (CASSANDRA-15274) Multiple Corrupt datafiles across entire environment
[ https://issues.apache.org/jira/browse/CASSANDRA-15274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16911135#comment-16911135 ] feroz shaik commented on CASSANDRA-15274: - [~benedict]- can you please share that SSTableExport.Java (with CRC disabled) which is 2.2 compatible if you have it handy. I was trying to add that line as you suggested but had some issues. > Multiple Corrupt datafiles across entire environment > - > > Key: CASSANDRA-15274 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15274 > Project: Cassandra > Issue Type: Bug > Components: Local/Compaction >Reporter: Phil O Conduin >Priority: Normal > > Cassandra Version: 2.2.13 > PRE-PROD environment. > * 2 datacenters. > * 9 physical servers in each datacenter - (_Cisco UCS C220 M4 SFF_) > * 4 Cassandra instances on each server (cass_a, cass_b, cass_c, cass_d) > * 72 Cassandra instances across the 2 data centres, 36 in site A, 36 in site > B. > We also have 2 Reaper Nodes we use for repair. One reaper node in each > datacenter each running with its own Cassandra back end in a cluster together. > OS Details [Red Hat Linux] > cass_a@x 0 10:53:01 ~ $ uname -a > Linux x 3.10.0-957.5.1.el7.x86_64 #1 SMP Wed Dec 19 10:46:58 EST 2018 x86_64 > x86_64 x86_64 GNU/Linux > cass_a@x 0 10:57:31 ~ $ cat /etc/*release > NAME="Red Hat Enterprise Linux Server" > VERSION="7.6 (Maipo)" > ID="rhel" > Storage Layout > cass_a@xx 0 10:46:28 ~ $ df -h > Filesystem Size Used Avail Use% Mounted on > /dev/mapper/vg01-lv_root 20G 2.2G 18G 11% / > devtmpfs 63G 0 63G 0% /dev > tmpfs 63G 0 63G 0% /dev/shm > tmpfs 63G 4.1G 59G 7% /run > tmpfs 63G 0 63G 0% /sys/fs/cgroup > >> 4 cassandra instances > /dev/sdd 1.5T 802G 688G 54% /data/ssd4 > /dev/sda 1.5T 798G 692G 54% /data/ssd1 > /dev/sdb 1.5T 681G 810G 46% /data/ssd2 > /dev/sdc 1.5T 558G 932G 38% /data/ssd3 > Cassandra load is about 200GB and the rest of the space is snapshots > CPU > cass_a@x 127 10:58:47 ~ $ lscpu | grep -E '^Thread|^Core|^Socket|^CPU\(' > CPU(s): 64 > Thread(s) per core: 2 > Core(s) per socket: 16 > Socket(s): 2 > *Description of problem:* > During repair of the cluster, we are seeing multiple corruptions in the log > files on a lot of instances. There seems to be no pattern to the corruption. > It seems that the repair job is finding all the corrupted files for us. The > repair will hang on the node where the corrupted file is found. To fix this > we remove/rename the datafile and bounce the Cassandra instance. Our > hardware/OS team have stated there is no problem on their side. I do not > believe it the repair causing the corruption. > > So let me give you an example of a corrupted file and maybe someone might be > able to work through it with me? > When this corrupted file was reported in the log it looks like it was the > repair that found it. > $ journalctl -u cassmeta-cass_b.service --since "2019-08-07 22:25:00" --until > "2019-08-07 22:45:00" > Aug 07 22:30:33 cassandra[34611]: INFO 21:30:33 Writing > Memtable-compactions_in_progress@830377457(0.008KiB serialized bytes, 1 ops, > 0%/0% of on/off-heap limit) > Aug 07 22:30:33 cassandra[34611]: ERROR 21:30:33 Failed creating a merkle > tree for [repair #9587a200-b95a-11e9-8920-9f72868b8375 on KeyspaceMetadata/x, > (-1476350953672479093,-1474461 > Aug 07 22:30:33 cassandra[34611]: ERROR 21:30:33 Exception in thread > Thread[ValidationExecutor:825,1,main] > Aug 07 22:30:33 cassandra[34611]: org.apache.cassandra.io.FSReadError: > org.apache.cassandra.io.sstable.CorruptSSTableException: Corrupted: > /x/ssd2/data/KeyspaceMetadata/x-1e453cb0 > Aug 07 22:30:33 cassandra[34611]: at > org.apache.cassandra.io.util.RandomAccessReader.readBytes(RandomAccessReader.java:365) > ~[apache-cassandra-2.2.13.jar:2.2.13] > Aug 07 22:30:33 cassandra[34611]: at > org.apache.cassandra.utils.ByteBufferUtil.read(ByteBufferUtil.java:361) > ~[apache-cassandra-2.2.13.jar:2.2.13] > Aug 07 22:30:33 cassandra[34611]: at > org.apache.cassandra.utils.ByteBufferUtil.readWithShortLength(ByteBufferUtil.java:340) > ~[apache-cassandra-2.2.13.jar:2.2.13] > Aug 07 22:30:33 cassandra[34611]: at > org.apache.cassandra.db.composites.AbstractCType$Serializer.deserialize(AbstractCType.java:382) > ~[apache-cassandra-2.2.13.jar:2.2.13] > Aug 07 22:30:33 cassandra[34611]: at > org.apache.cassandra.db.composites.AbstractCType$Serializer.deserialize(AbstractCType.java:366) > ~[apache-cassandr
[jira] [Commented] (CASSANDRA-15274) Multiple Corrupt datafiles across entire environment
[ https://issues.apache.org/jira/browse/CASSANDRA-15274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16908901#comment-16908901 ] Benedict commented on CASSANDRA-15274: -- You may need to modify the code in {{SSTableExport}} to include the line {{metadata.compressionParameters.setCrcCheckChance(0);}} at the start of {{export}} > Multiple Corrupt datafiles across entire environment > - > > Key: CASSANDRA-15274 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15274 > Project: Cassandra > Issue Type: Bug > Components: Local/Compaction >Reporter: Phil O Conduin >Priority: Normal > > Cassandra Version: 2.2.13 > PRE-PROD environment. > * 2 datacenters. > * 9 physical servers in each datacenter - (_Cisco UCS C220 M4 SFF_) > * 4 Cassandra instances on each server (cass_a, cass_b, cass_c, cass_d) > * 72 Cassandra instances across the 2 data centres, 36 in site A, 36 in site > B. > We also have 2 Reaper Nodes we use for repair. One reaper node in each > datacenter each running with its own Cassandra back end in a cluster together. > OS Details [Red Hat Linux] > cass_a@x 0 10:53:01 ~ $ uname -a > Linux x 3.10.0-957.5.1.el7.x86_64 #1 SMP Wed Dec 19 10:46:58 EST 2018 x86_64 > x86_64 x86_64 GNU/Linux > cass_a@x 0 10:57:31 ~ $ cat /etc/*release > NAME="Red Hat Enterprise Linux Server" > VERSION="7.6 (Maipo)" > ID="rhel" > Storage Layout > cass_a@xx 0 10:46:28 ~ $ df -h > Filesystem Size Used Avail Use% Mounted on > /dev/mapper/vg01-lv_root 20G 2.2G 18G 11% / > devtmpfs 63G 0 63G 0% /dev > tmpfs 63G 0 63G 0% /dev/shm > tmpfs 63G 4.1G 59G 7% /run > tmpfs 63G 0 63G 0% /sys/fs/cgroup > >> 4 cassandra instances > /dev/sdd 1.5T 802G 688G 54% /data/ssd4 > /dev/sda 1.5T 798G 692G 54% /data/ssd1 > /dev/sdb 1.5T 681G 810G 46% /data/ssd2 > /dev/sdc 1.5T 558G 932G 38% /data/ssd3 > Cassandra load is about 200GB and the rest of the space is snapshots > CPU > cass_a@x 127 10:58:47 ~ $ lscpu | grep -E '^Thread|^Core|^Socket|^CPU\(' > CPU(s): 64 > Thread(s) per core: 2 > Core(s) per socket: 16 > Socket(s): 2 > *Description of problem:* > During repair of the cluster, we are seeing multiple corruptions in the log > files on a lot of instances. There seems to be no pattern to the corruption. > It seems that the repair job is finding all the corrupted files for us. The > repair will hang on the node where the corrupted file is found. To fix this > we remove/rename the datafile and bounce the Cassandra instance. Our > hardware/OS team have stated there is no problem on their side. I do not > believe it the repair causing the corruption. > > So let me give you an example of a corrupted file and maybe someone might be > able to work through it with me? > When this corrupted file was reported in the log it looks like it was the > repair that found it. > $ journalctl -u cassmeta-cass_b.service --since "2019-08-07 22:25:00" --until > "2019-08-07 22:45:00" > Aug 07 22:30:33 cassandra[34611]: INFO 21:30:33 Writing > Memtable-compactions_in_progress@830377457(0.008KiB serialized bytes, 1 ops, > 0%/0% of on/off-heap limit) > Aug 07 22:30:33 cassandra[34611]: ERROR 21:30:33 Failed creating a merkle > tree for [repair #9587a200-b95a-11e9-8920-9f72868b8375 on KeyspaceMetadata/x, > (-1476350953672479093,-1474461 > Aug 07 22:30:33 cassandra[34611]: ERROR 21:30:33 Exception in thread > Thread[ValidationExecutor:825,1,main] > Aug 07 22:30:33 cassandra[34611]: org.apache.cassandra.io.FSReadError: > org.apache.cassandra.io.sstable.CorruptSSTableException: Corrupted: > /x/ssd2/data/KeyspaceMetadata/x-1e453cb0 > Aug 07 22:30:33 cassandra[34611]: at > org.apache.cassandra.io.util.RandomAccessReader.readBytes(RandomAccessReader.java:365) > ~[apache-cassandra-2.2.13.jar:2.2.13] > Aug 07 22:30:33 cassandra[34611]: at > org.apache.cassandra.utils.ByteBufferUtil.read(ByteBufferUtil.java:361) > ~[apache-cassandra-2.2.13.jar:2.2.13] > Aug 07 22:30:33 cassandra[34611]: at > org.apache.cassandra.utils.ByteBufferUtil.readWithShortLength(ByteBufferUtil.java:340) > ~[apache-cassandra-2.2.13.jar:2.2.13] > Aug 07 22:30:33 cassandra[34611]: at > org.apache.cassandra.db.composites.AbstractCType$Serializer.deserialize(AbstractCType.java:382) > ~[apache-cassandra-2.2.13.jar:2.2.13] > Aug 07 22:30:33 cassandra[34611]: at > org.apache.cassandra.db.composites.AbstractCType$Serializer.deserialize(AbstractCType.java:366) > ~[apache-cassandra-2.2.13.jar:2.2.13] > Aug 07 22:30:33 ca
[jira] [Commented] (CASSANDRA-15274) Multiple Corrupt datafiles across entire environment
[ https://issues.apache.org/jira/browse/CASSANDRA-15274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16908894#comment-16908894 ] Vladimir Vavro commented on CASSANDRA-15274: Since affected version is 2.2.x there is no sstabledump available, but there is sstable2json. We tried to export one file and the attempt failed - but it looks like it again failed during the crc check based on this part of error message: Caused by: org.apache.cassandra.io.compress.CorruptBlockException: (/data/ssd2/data/KeyspaceMetadata/CF_ConversationIndex1-1e77be609c7911e8ac12255de1fb512a/lb-10664-big-Data.db): corruption detected, chunk at 7392105638 of length 35173. at org.apache.cassandra.io.compress.CompressedRandomAccessReader.reBufferMmap(CompressedRandomAccessReader.java:185) Is it possible that sstable2json is using the same code to handle the data as Cassandra normally does? If it true, is it different for newer utilities sstableexport/sstabledump ? > Multiple Corrupt datafiles across entire environment > - > > Key: CASSANDRA-15274 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15274 > Project: Cassandra > Issue Type: Bug > Components: Local/Compaction >Reporter: Phil O Conduin >Priority: Normal > > Cassandra Version: 2.2.13 > PRE-PROD environment. > * 2 datacenters. > * 9 physical servers in each datacenter - (_Cisco UCS C220 M4 SFF_) > * 4 Cassandra instances on each server (cass_a, cass_b, cass_c, cass_d) > * 72 Cassandra instances across the 2 data centres, 36 in site A, 36 in site > B. > We also have 2 Reaper Nodes we use for repair. One reaper node in each > datacenter each running with its own Cassandra back end in a cluster together. > OS Details [Red Hat Linux] > cass_a@x 0 10:53:01 ~ $ uname -a > Linux x 3.10.0-957.5.1.el7.x86_64 #1 SMP Wed Dec 19 10:46:58 EST 2018 x86_64 > x86_64 x86_64 GNU/Linux > cass_a@x 0 10:57:31 ~ $ cat /etc/*release > NAME="Red Hat Enterprise Linux Server" > VERSION="7.6 (Maipo)" > ID="rhel" > Storage Layout > cass_a@xx 0 10:46:28 ~ $ df -h > Filesystem Size Used Avail Use% Mounted on > /dev/mapper/vg01-lv_root 20G 2.2G 18G 11% / > devtmpfs 63G 0 63G 0% /dev > tmpfs 63G 0 63G 0% /dev/shm > tmpfs 63G 4.1G 59G 7% /run > tmpfs 63G 0 63G 0% /sys/fs/cgroup > >> 4 cassandra instances > /dev/sdd 1.5T 802G 688G 54% /data/ssd4 > /dev/sda 1.5T 798G 692G 54% /data/ssd1 > /dev/sdb 1.5T 681G 810G 46% /data/ssd2 > /dev/sdc 1.5T 558G 932G 38% /data/ssd3 > Cassandra load is about 200GB and the rest of the space is snapshots > CPU > cass_a@x 127 10:58:47 ~ $ lscpu | grep -E '^Thread|^Core|^Socket|^CPU\(' > CPU(s): 64 > Thread(s) per core: 2 > Core(s) per socket: 16 > Socket(s): 2 > *Description of problem:* > During repair of the cluster, we are seeing multiple corruptions in the log > files on a lot of instances. There seems to be no pattern to the corruption. > It seems that the repair job is finding all the corrupted files for us. The > repair will hang on the node where the corrupted file is found. To fix this > we remove/rename the datafile and bounce the Cassandra instance. Our > hardware/OS team have stated there is no problem on their side. I do not > believe it the repair causing the corruption. > > So let me give you an example of a corrupted file and maybe someone might be > able to work through it with me? > When this corrupted file was reported in the log it looks like it was the > repair that found it. > $ journalctl -u cassmeta-cass_b.service --since "2019-08-07 22:25:00" --until > "2019-08-07 22:45:00" > Aug 07 22:30:33 cassandra[34611]: INFO 21:30:33 Writing > Memtable-compactions_in_progress@830377457(0.008KiB serialized bytes, 1 ops, > 0%/0% of on/off-heap limit) > Aug 07 22:30:33 cassandra[34611]: ERROR 21:30:33 Failed creating a merkle > tree for [repair #9587a200-b95a-11e9-8920-9f72868b8375 on KeyspaceMetadata/x, > (-1476350953672479093,-1474461 > Aug 07 22:30:33 cassandra[34611]: ERROR 21:30:33 Exception in thread > Thread[ValidationExecutor:825,1,main] > Aug 07 22:30:33 cassandra[34611]: org.apache.cassandra.io.FSReadError: > org.apache.cassandra.io.sstable.CorruptSSTableException: Corrupted: > /x/ssd2/data/KeyspaceMetadata/x-1e453cb0 > Aug 07 22:30:33 cassandra[34611]: at > org.apache.cassandra.io.util.RandomAccessReader.readBytes(RandomAccessReader.java:365) > ~[apache-cassandra-2.2.13.jar:2.2.13] > Aug 07 22:30:33 cassandra[34611]: at > org.apache.cassandra.
[jira] [Commented] (CASSANDRA-15274) Multiple Corrupt datafiles across entire environment
[ https://issues.apache.org/jira/browse/CASSANDRA-15274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907229#comment-16907229 ] Benedict commented on CASSANDRA-15274: -- bq. if they print their entire contents successfully there's already a reasonable chance that the data is not corrupted This comment was alluding to that likelihood - but that we would instead fail to parse the data because of corruption of the stream, long before we printed any garbage out. If we manage to print out, and we do this for every "corrupted" block (and there are many of them), it becomes very likely the files aren't truly corrupted. > Multiple Corrupt datafiles across entire environment > - > > Key: CASSANDRA-15274 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15274 > Project: Cassandra > Issue Type: Bug > Components: Local/Compaction >Reporter: Phil O Conduin >Priority: Normal > > Cassandra Version: 2.2.13 > PRE-PROD environment. > * 2 datacenters. > * 9 physical servers in each datacenter - (_Cisco UCS C220 M4 SFF_) > * 4 Cassandra instances on each server (cass_a, cass_b, cass_c, cass_d) > * 72 Cassandra instances across the 2 data centres, 36 in site A, 36 in site > B. > We also have 2 Reaper Nodes we use for repair. One reaper node in each > datacenter each running with its own Cassandra back end in a cluster together. > OS Details [Red Hat Linux] > cass_a@x 0 10:53:01 ~ $ uname -a > Linux x 3.10.0-957.5.1.el7.x86_64 #1 SMP Wed Dec 19 10:46:58 EST 2018 x86_64 > x86_64 x86_64 GNU/Linux > cass_a@x 0 10:57:31 ~ $ cat /etc/*release > NAME="Red Hat Enterprise Linux Server" > VERSION="7.6 (Maipo)" > ID="rhel" > Storage Layout > cass_a@xx 0 10:46:28 ~ $ df -h > Filesystem Size Used Avail Use% Mounted on > /dev/mapper/vg01-lv_root 20G 2.2G 18G 11% / > devtmpfs 63G 0 63G 0% /dev > tmpfs 63G 0 63G 0% /dev/shm > tmpfs 63G 4.1G 59G 7% /run > tmpfs 63G 0 63G 0% /sys/fs/cgroup > >> 4 cassandra instances > /dev/sdd 1.5T 802G 688G 54% /data/ssd4 > /dev/sda 1.5T 798G 692G 54% /data/ssd1 > /dev/sdb 1.5T 681G 810G 46% /data/ssd2 > /dev/sdc 1.5T 558G 932G 38% /data/ssd3 > Cassandra load is about 200GB and the rest of the space is snapshots > CPU > cass_a@x 127 10:58:47 ~ $ lscpu | grep -E '^Thread|^Core|^Socket|^CPU\(' > CPU(s): 64 > Thread(s) per core: 2 > Core(s) per socket: 16 > Socket(s): 2 > *Description of problem:* > During repair of the cluster, we are seeing multiple corruptions in the log > files on a lot of instances. There seems to be no pattern to the corruption. > It seems that the repair job is finding all the corrupted files for us. The > repair will hang on the node where the corrupted file is found. To fix this > we remove/rename the datafile and bounce the Cassandra instance. Our > hardware/OS team have stated there is no problem on their side. I do not > believe it the repair causing the corruption. > > So let me give you an example of a corrupted file and maybe someone might be > able to work through it with me? > When this corrupted file was reported in the log it looks like it was the > repair that found it. > $ journalctl -u cassmeta-cass_b.service --since "2019-08-07 22:25:00" --until > "2019-08-07 22:45:00" > Aug 07 22:30:33 cassandra[34611]: INFO 21:30:33 Writing > Memtable-compactions_in_progress@830377457(0.008KiB serialized bytes, 1 ops, > 0%/0% of on/off-heap limit) > Aug 07 22:30:33 cassandra[34611]: ERROR 21:30:33 Failed creating a merkle > tree for [repair #9587a200-b95a-11e9-8920-9f72868b8375 on KeyspaceMetadata/x, > (-1476350953672479093,-1474461 > Aug 07 22:30:33 cassandra[34611]: ERROR 21:30:33 Exception in thread > Thread[ValidationExecutor:825,1,main] > Aug 07 22:30:33 cassandra[34611]: org.apache.cassandra.io.FSReadError: > org.apache.cassandra.io.sstable.CorruptSSTableException: Corrupted: > /x/ssd2/data/KeyspaceMetadata/x-1e453cb0 > Aug 07 22:30:33 cassandra[34611]: at > org.apache.cassandra.io.util.RandomAccessReader.readBytes(RandomAccessReader.java:365) > ~[apache-cassandra-2.2.13.jar:2.2.13] > Aug 07 22:30:33 cassandra[34611]: at > org.apache.cassandra.utils.ByteBufferUtil.read(ByteBufferUtil.java:361) > ~[apache-cassandra-2.2.13.jar:2.2.13] > Aug 07 22:30:33 cassandra[34611]: at > org.apache.cassandra.utils.ByteBufferUtil.readWithShortLength(ByteBufferUtil.java:340) > ~[apache-cassandra-2.2.13.jar:2.2.13] > Aug 07 22:30:33 cassandra[34611]: at > org.apache.cassandra.db.composites.Abs
[jira] [Commented] (CASSANDRA-15274) Multiple Corrupt datafiles across entire environment
[ https://issues.apache.org/jira/browse/CASSANDRA-15274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907212#comment-16907212 ] Vladimir Vavro commented on CASSANDRA-15274: If it is possible to export suspicious sstable into json format, the challenge might be to verify, if the exported data are valid or corrupted. However my understanding is that crc check is optional but decompression is obviously not for the export tool. If the binary data before decompression are corrupted, should not we see binary garbage in the json output ? > Multiple Corrupt datafiles across entire environment > - > > Key: CASSANDRA-15274 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15274 > Project: Cassandra > Issue Type: Bug > Components: Local/Compaction >Reporter: Phil O Conduin >Priority: Normal > > Cassandra Version: 2.2.13 > PRE-PROD environment. > * 2 datacenters. > * 9 physical servers in each datacenter - (_Cisco UCS C220 M4 SFF_) > * 4 Cassandra instances on each server (cass_a, cass_b, cass_c, cass_d) > * 72 Cassandra instances across the 2 data centres, 36 in site A, 36 in site > B. > We also have 2 Reaper Nodes we use for repair. One reaper node in each > datacenter each running with its own Cassandra back end in a cluster together. > OS Details [Red Hat Linux] > cass_a@x 0 10:53:01 ~ $ uname -a > Linux x 3.10.0-957.5.1.el7.x86_64 #1 SMP Wed Dec 19 10:46:58 EST 2018 x86_64 > x86_64 x86_64 GNU/Linux > cass_a@x 0 10:57:31 ~ $ cat /etc/*release > NAME="Red Hat Enterprise Linux Server" > VERSION="7.6 (Maipo)" > ID="rhel" > Storage Layout > cass_a@xx 0 10:46:28 ~ $ df -h > Filesystem Size Used Avail Use% Mounted on > /dev/mapper/vg01-lv_root 20G 2.2G 18G 11% / > devtmpfs 63G 0 63G 0% /dev > tmpfs 63G 0 63G 0% /dev/shm > tmpfs 63G 4.1G 59G 7% /run > tmpfs 63G 0 63G 0% /sys/fs/cgroup > >> 4 cassandra instances > /dev/sdd 1.5T 802G 688G 54% /data/ssd4 > /dev/sda 1.5T 798G 692G 54% /data/ssd1 > /dev/sdb 1.5T 681G 810G 46% /data/ssd2 > /dev/sdc 1.5T 558G 932G 38% /data/ssd3 > Cassandra load is about 200GB and the rest of the space is snapshots > CPU > cass_a@x 127 10:58:47 ~ $ lscpu | grep -E '^Thread|^Core|^Socket|^CPU\(' > CPU(s): 64 > Thread(s) per core: 2 > Core(s) per socket: 16 > Socket(s): 2 > *Description of problem:* > During repair of the cluster, we are seeing multiple corruptions in the log > files on a lot of instances. There seems to be no pattern to the corruption. > It seems that the repair job is finding all the corrupted files for us. The > repair will hang on the node where the corrupted file is found. To fix this > we remove/rename the datafile and bounce the Cassandra instance. Our > hardware/OS team have stated there is no problem on their side. I do not > believe it the repair causing the corruption. > > So let me give you an example of a corrupted file and maybe someone might be > able to work through it with me? > When this corrupted file was reported in the log it looks like it was the > repair that found it. > $ journalctl -u cassmeta-cass_b.service --since "2019-08-07 22:25:00" --until > "2019-08-07 22:45:00" > Aug 07 22:30:33 cassandra[34611]: INFO 21:30:33 Writing > Memtable-compactions_in_progress@830377457(0.008KiB serialized bytes, 1 ops, > 0%/0% of on/off-heap limit) > Aug 07 22:30:33 cassandra[34611]: ERROR 21:30:33 Failed creating a merkle > tree for [repair #9587a200-b95a-11e9-8920-9f72868b8375 on KeyspaceMetadata/x, > (-1476350953672479093,-1474461 > Aug 07 22:30:33 cassandra[34611]: ERROR 21:30:33 Exception in thread > Thread[ValidationExecutor:825,1,main] > Aug 07 22:30:33 cassandra[34611]: org.apache.cassandra.io.FSReadError: > org.apache.cassandra.io.sstable.CorruptSSTableException: Corrupted: > /x/ssd2/data/KeyspaceMetadata/x-1e453cb0 > Aug 07 22:30:33 cassandra[34611]: at > org.apache.cassandra.io.util.RandomAccessReader.readBytes(RandomAccessReader.java:365) > ~[apache-cassandra-2.2.13.jar:2.2.13] > Aug 07 22:30:33 cassandra[34611]: at > org.apache.cassandra.utils.ByteBufferUtil.read(ByteBufferUtil.java:361) > ~[apache-cassandra-2.2.13.jar:2.2.13] > Aug 07 22:30:33 cassandra[34611]: at > org.apache.cassandra.utils.ByteBufferUtil.readWithShortLength(ByteBufferUtil.java:340) > ~[apache-cassandra-2.2.13.jar:2.2.13] > Aug 07 22:30:33 cassandra[34611]: at > org.apache.cassandra.db.composites.AbstractCType$Serializer.deserialize(AbstractCType.java:382) > ~[apache-cassand
[jira] [Commented] (CASSANDRA-15274) Multiple Corrupt datafiles across entire environment
[ https://issues.apache.org/jira/browse/CASSANDRA-15274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16906354#comment-16906354 ] Benedict commented on CASSANDRA-15274: -- {{sstableexport}} / {{sstabledump}} are your friend here - pick a corrupted sstable, and print its contents. I'm pretty sure that by default these tools do not verify the checksum, so if they print their entire contents successfully there's already a reasonable chance that the data is not corrupted. But to be sure, exporting data for the same partition keys from sstables on other nodes, and comparing that the same data is produced, gives a high confidence that the data in the files is still valid. This isn't quite as simple as it sounds, as there could be many records, many of which not contained in corrupt blocks, so it would be easier to modify {{sstableexport}} to detect the specifically corrupted blocks and only print the data contained within them. There's also the problem that compaction can lead to different data on each node. But picking a large and old sstable may give you a chance of fairly similar data residing on each node in comparable sstables. > Multiple Corrupt datafiles across entire environment > - > > Key: CASSANDRA-15274 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15274 > Project: Cassandra > Issue Type: Bug > Components: Local/Compaction >Reporter: Phil O Conduin >Priority: Normal > > Cassandra Version: 2.2.13 > PRE-PROD environment. > * 2 datacenters. > * 9 physical servers in each datacenter - (_Cisco UCS C220 M4 SFF_) > * 4 Cassandra instances on each server (cass_a, cass_b, cass_c, cass_d) > * 72 Cassandra instances across the 2 data centres, 36 in site A, 36 in site > B. > We also have 2 Reaper Nodes we use for repair. One reaper node in each > datacenter each running with its own Cassandra back end in a cluster together. > OS Details [Red Hat Linux] > cass_a@x 0 10:53:01 ~ $ uname -a > Linux x 3.10.0-957.5.1.el7.x86_64 #1 SMP Wed Dec 19 10:46:58 EST 2018 x86_64 > x86_64 x86_64 GNU/Linux > cass_a@x 0 10:57:31 ~ $ cat /etc/*release > NAME="Red Hat Enterprise Linux Server" > VERSION="7.6 (Maipo)" > ID="rhel" > Storage Layout > cass_a@xx 0 10:46:28 ~ $ df -h > Filesystem Size Used Avail Use% Mounted on > /dev/mapper/vg01-lv_root 20G 2.2G 18G 11% / > devtmpfs 63G 0 63G 0% /dev > tmpfs 63G 0 63G 0% /dev/shm > tmpfs 63G 4.1G 59G 7% /run > tmpfs 63G 0 63G 0% /sys/fs/cgroup > >> 4 cassandra instances > /dev/sdd 1.5T 802G 688G 54% /data/ssd4 > /dev/sda 1.5T 798G 692G 54% /data/ssd1 > /dev/sdb 1.5T 681G 810G 46% /data/ssd2 > /dev/sdc 1.5T 558G 932G 38% /data/ssd3 > Cassandra load is about 200GB and the rest of the space is snapshots > CPU > cass_a@x 127 10:58:47 ~ $ lscpu | grep -E '^Thread|^Core|^Socket|^CPU\(' > CPU(s): 64 > Thread(s) per core: 2 > Core(s) per socket: 16 > Socket(s): 2 > *Description of problem:* > During repair of the cluster, we are seeing multiple corruptions in the log > files on a lot of instances. There seems to be no pattern to the corruption. > It seems that the repair job is finding all the corrupted files for us. The > repair will hang on the node where the corrupted file is found. To fix this > we remove/rename the datafile and bounce the Cassandra instance. Our > hardware/OS team have stated there is no problem on their side. I do not > believe it the repair causing the corruption. > > So let me give you an example of a corrupted file and maybe someone might be > able to work through it with me? > When this corrupted file was reported in the log it looks like it was the > repair that found it. > $ journalctl -u cassmeta-cass_b.service --since "2019-08-07 22:25:00" --until > "2019-08-07 22:45:00" > Aug 07 22:30:33 cassandra[34611]: INFO 21:30:33 Writing > Memtable-compactions_in_progress@830377457(0.008KiB serialized bytes, 1 ops, > 0%/0% of on/off-heap limit) > Aug 07 22:30:33 cassandra[34611]: ERROR 21:30:33 Failed creating a merkle > tree for [repair #9587a200-b95a-11e9-8920-9f72868b8375 on KeyspaceMetadata/x, > (-1476350953672479093,-1474461 > Aug 07 22:30:33 cassandra[34611]: ERROR 21:30:33 Exception in thread > Thread[ValidationExecutor:825,1,main] > Aug 07 22:30:33 cassandra[34611]: org.apache.cassandra.io.FSReadError: > org.apache.cassandra.io.sstable.CorruptSSTableException: Corrupted: > /x/ssd2/data/KeyspaceMetadata/x-1e453cb0 > Aug 07 22:30:33 cassandra[34611]: at >
[jira] [Commented] (CASSANDRA-15274) Multiple Corrupt datafiles across entire environment
[ https://issues.apache.org/jira/browse/CASSANDRA-15274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16906344#comment-16906344 ] Phil O Conduin commented on CASSANDRA-15274: [~benedict] thanks a lot for the explanation. We have a ticket open with Cisco for help on this also. Can you explain a little more about how we validate for actual corruption, how would I go about comparing data written to files? > Multiple Corrupt datafiles across entire environment > - > > Key: CASSANDRA-15274 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15274 > Project: Cassandra > Issue Type: Bug > Components: Local/Compaction >Reporter: Phil O Conduin >Priority: Normal > > Cassandra Version: 2.2.13 > PRE-PROD environment. > * 2 datacenters. > * 9 physical servers in each datacenter - (_Cisco UCS C220 M4 SFF_) > * 4 Cassandra instances on each server (cass_a, cass_b, cass_c, cass_d) > * 72 Cassandra instances across the 2 data centres, 36 in site A, 36 in site > B. > We also have 2 Reaper Nodes we use for repair. One reaper node in each > datacenter each running with its own Cassandra back end in a cluster together. > OS Details [Red Hat Linux] > cass_a@x 0 10:53:01 ~ $ uname -a > Linux x 3.10.0-957.5.1.el7.x86_64 #1 SMP Wed Dec 19 10:46:58 EST 2018 x86_64 > x86_64 x86_64 GNU/Linux > cass_a@x 0 10:57:31 ~ $ cat /etc/*release > NAME="Red Hat Enterprise Linux Server" > VERSION="7.6 (Maipo)" > ID="rhel" > Storage Layout > cass_a@xx 0 10:46:28 ~ $ df -h > Filesystem Size Used Avail Use% Mounted on > /dev/mapper/vg01-lv_root 20G 2.2G 18G 11% / > devtmpfs 63G 0 63G 0% /dev > tmpfs 63G 0 63G 0% /dev/shm > tmpfs 63G 4.1G 59G 7% /run > tmpfs 63G 0 63G 0% /sys/fs/cgroup > >> 4 cassandra instances > /dev/sdd 1.5T 802G 688G 54% /data/ssd4 > /dev/sda 1.5T 798G 692G 54% /data/ssd1 > /dev/sdb 1.5T 681G 810G 46% /data/ssd2 > /dev/sdc 1.5T 558G 932G 38% /data/ssd3 > Cassandra load is about 200GB and the rest of the space is snapshots > CPU > cass_a@x 127 10:58:47 ~ $ lscpu | grep -E '^Thread|^Core|^Socket|^CPU\(' > CPU(s): 64 > Thread(s) per core: 2 > Core(s) per socket: 16 > Socket(s): 2 > *Description of problem:* > During repair of the cluster, we are seeing multiple corruptions in the log > files on a lot of instances. There seems to be no pattern to the corruption. > It seems that the repair job is finding all the corrupted files for us. The > repair will hang on the node where the corrupted file is found. To fix this > we remove/rename the datafile and bounce the Cassandra instance. Our > hardware/OS team have stated there is no problem on their side. I do not > believe it the repair causing the corruption. > > So let me give you an example of a corrupted file and maybe someone might be > able to work through it with me? > When this corrupted file was reported in the log it looks like it was the > repair that found it. > $ journalctl -u cassmeta-cass_b.service --since "2019-08-07 22:25:00" --until > "2019-08-07 22:45:00" > Aug 07 22:30:33 cassandra[34611]: INFO 21:30:33 Writing > Memtable-compactions_in_progress@830377457(0.008KiB serialized bytes, 1 ops, > 0%/0% of on/off-heap limit) > Aug 07 22:30:33 cassandra[34611]: ERROR 21:30:33 Failed creating a merkle > tree for [repair #9587a200-b95a-11e9-8920-9f72868b8375 on KeyspaceMetadata/x, > (-1476350953672479093,-1474461 > Aug 07 22:30:33 cassandra[34611]: ERROR 21:30:33 Exception in thread > Thread[ValidationExecutor:825,1,main] > Aug 07 22:30:33 cassandra[34611]: org.apache.cassandra.io.FSReadError: > org.apache.cassandra.io.sstable.CorruptSSTableException: Corrupted: > /x/ssd2/data/KeyspaceMetadata/x-1e453cb0 > Aug 07 22:30:33 cassandra[34611]: at > org.apache.cassandra.io.util.RandomAccessReader.readBytes(RandomAccessReader.java:365) > ~[apache-cassandra-2.2.13.jar:2.2.13] > Aug 07 22:30:33 cassandra[34611]: at > org.apache.cassandra.utils.ByteBufferUtil.read(ByteBufferUtil.java:361) > ~[apache-cassandra-2.2.13.jar:2.2.13] > Aug 07 22:30:33 cassandra[34611]: at > org.apache.cassandra.utils.ByteBufferUtil.readWithShortLength(ByteBufferUtil.java:340) > ~[apache-cassandra-2.2.13.jar:2.2.13] > Aug 07 22:30:33 cassandra[34611]: at > org.apache.cassandra.db.composites.AbstractCType$Serializer.deserialize(AbstractCType.java:382) > ~[apache-cassandra-2.2.13.jar:2.2.13] > Aug 07 22:30:33 cassandra[34611]: at > org.apache.cassandra.db.composites.AbstractCType$Serializer.deseriali
[jira] [Commented] (CASSANDRA-15274) Multiple Corrupt datafiles across entire environment
[ https://issues.apache.org/jira/browse/CASSANDRA-15274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16906288#comment-16906288 ] Benedict commented on CASSANDRA-15274: -- This error is _very_ suggestive of actual data file corruption, independent of C*. This exception is thrown only when the raw data for a block, whose checksum was computed on write, no longer produces the same checksum. C* never modifies a file once written, so in particular if these errors are being encountered for the first time against sstables that are older than your last successful repair we can essentially guarantee that the problem is with your system and not C*. How certain are you that your disks are reliable? You can try to rule out actual corruption by comparing the contents of data written to files reporting these failures to the same data as it exists on other nodes in the cluster (whether or not the files on the other nodes report these errors). > Multiple Corrupt datafiles across entire environment > - > > Key: CASSANDRA-15274 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15274 > Project: Cassandra > Issue Type: Bug > Components: Local/Compaction >Reporter: Phil O Conduin >Priority: Normal > > Cassandra Version: 2.2.13 > PRE-PROD environment. > * 2 datacenters. > * 9 physical servers in each datacenter - (_Cisco UCS C220 M4 SFF_) > * 4 Cassandra instances on each server (cass_a, cass_b, cass_c, cass_d) > * 72 Cassandra instances across the 2 data centres, 36 in site A, 36 in site > B. > We also have 2 Reaper Nodes we use for repair. One reaper node in each > datacenter each running with its own Cassandra back end in a cluster together. > OS Details [Red Hat Linux] > cass_a@x 0 10:53:01 ~ $ uname -a > Linux x 3.10.0-957.5.1.el7.x86_64 #1 SMP Wed Dec 19 10:46:58 EST 2018 x86_64 > x86_64 x86_64 GNU/Linux > cass_a@x 0 10:57:31 ~ $ cat /etc/*release > NAME="Red Hat Enterprise Linux Server" > VERSION="7.6 (Maipo)" > ID="rhel" > Storage Layout > cass_a@xx 0 10:46:28 ~ $ df -h > Filesystem Size Used Avail Use% Mounted on > /dev/mapper/vg01-lv_root 20G 2.2G 18G 11% / > devtmpfs 63G 0 63G 0% /dev > tmpfs 63G 0 63G 0% /dev/shm > tmpfs 63G 4.1G 59G 7% /run > tmpfs 63G 0 63G 0% /sys/fs/cgroup > >> 4 cassandra instances > /dev/sdd 1.5T 802G 688G 54% /data/ssd4 > /dev/sda 1.5T 798G 692G 54% /data/ssd1 > /dev/sdb 1.5T 681G 810G 46% /data/ssd2 > /dev/sdc 1.5T 558G 932G 38% /data/ssd3 > Cassandra load is about 200GB and the rest of the space is snapshots > CPU > cass_a@x 127 10:58:47 ~ $ lscpu | grep -E '^Thread|^Core|^Socket|^CPU\(' > CPU(s): 64 > Thread(s) per core: 2 > Core(s) per socket: 16 > Socket(s): 2 > *Description of problem:* > During repair of the cluster, we are seeing multiple corruptions in the log > files on a lot of instances. There seems to be no pattern to the corruption. > It seems that the repair job is finding all the corrupted files for us. The > repair will hang on the node where the corrupted file is found. To fix this > we remove/rename the datafile and bounce the Cassandra instance. Our > hardware/OS team have stated there is no problem on their side. I do not > believe it the repair causing the corruption. > > So let me give you an example of a corrupted file and maybe someone might be > able to work through it with me? > When this corrupted file was reported in the log it looks like it was the > repair that found it. > $ journalctl -u cassmeta-cass_b.service --since "2019-08-07 22:25:00" --until > "2019-08-07 22:45:00" > Aug 07 22:30:33 cassandra[34611]: INFO 21:30:33 Writing > Memtable-compactions_in_progress@830377457(0.008KiB serialized bytes, 1 ops, > 0%/0% of on/off-heap limit) > Aug 07 22:30:33 cassandra[34611]: ERROR 21:30:33 Failed creating a merkle > tree for [repair #9587a200-b95a-11e9-8920-9f72868b8375 on KeyspaceMetadata/x, > (-1476350953672479093,-1474461 > Aug 07 22:30:33 cassandra[34611]: ERROR 21:30:33 Exception in thread > Thread[ValidationExecutor:825,1,main] > Aug 07 22:30:33 cassandra[34611]: org.apache.cassandra.io.FSReadError: > org.apache.cassandra.io.sstable.CorruptSSTableException: Corrupted: > /x/ssd2/data/KeyspaceMetadata/x-1e453cb0 > Aug 07 22:30:33 cassandra[34611]: at > org.apache.cassandra.io.util.RandomAccessReader.readBytes(RandomAccessReader.java:365) > ~[apache-cassandra-2.2.13.jar:2.2.13] > Aug 07 22:30:33 cassandra[34611]: at > org.apache.cassandra.utils.ByteBuf
[jira] [Commented] (CASSANDRA-15274) Multiple Corrupt datafiles across entire environment
[ https://issues.apache.org/jira/browse/CASSANDRA-15274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16905808#comment-16905808 ] feroz shaik commented on CASSANDRA-15274: - Thank you [~philoconduin] . I just want to add other things to this problem that we already went through for the community to be aware off. # Power disruptions if any - Nothing of that sort reported by infra team. # Storage related glitches/issues - Nothing. # Network issues - Nothing. (we have not looked in detail with packet capture and drops etc, but from monitoring it is clean). # Schema change - It was reported on some forum that dropping a column and re-creating it back with a different datatype could cause corruptions - This was checked but there was no sort of such schema change on the cluster. # CRC check - This is something we are still investigating. If CRC was not being done effectively, there is another theory why it would only fail for certain data files and not all? From what we have been seeing is that the corruption could be on any CF, with no pattern to single compaction strategy used etc.. Another important consideration to take into account is our PROD env which is same like PRE-PROD in terms of infrastructure and C* config setup, schema. The only difference is the amount of data residing there - its only 6-10G avg as compared to 200 G avg'ng on pre-prod. We do not have any issues there (PROD). > Multiple Corrupt datafiles across entire environment > - > > Key: CASSANDRA-15274 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15274 > Project: Cassandra > Issue Type: Bug > Components: Local/Compaction >Reporter: Phil O Conduin >Priority: Normal > > Cassandra Version: 2.2.13 > PRE-PROD environment. > * 2 datacenters. > * 9 physical servers in each datacenter - (_Cisco UCS C220 M4 SFF_) > * 4 Cassandra instances on each server (cass_a, cass_b, cass_c, cass_d) > * 72 Cassandra instances across the 2 data centres, 36 in site A, 36 in site > B. > We also have 2 Reaper Nodes we use for repair. One reaper node in each > datacenter each running with its own Cassandra back end in a cluster together. > OS Details [Red Hat Linux] > cass_a@x 0 10:53:01 ~ $ uname -a > Linux x 3.10.0-957.5.1.el7.x86_64 #1 SMP Wed Dec 19 10:46:58 EST 2018 x86_64 > x86_64 x86_64 GNU/Linux > cass_a@x 0 10:57:31 ~ $ cat /etc/*release > NAME="Red Hat Enterprise Linux Server" > VERSION="7.6 (Maipo)" > ID="rhel" > Storage Layout > cass_a@xx 0 10:46:28 ~ $ df -h > Filesystem Size Used Avail Use% Mounted on > /dev/mapper/vg01-lv_root 20G 2.2G 18G 11% / > devtmpfs 63G 0 63G 0% /dev > tmpfs 63G 0 63G 0% /dev/shm > tmpfs 63G 4.1G 59G 7% /run > tmpfs 63G 0 63G 0% /sys/fs/cgroup > >> 4 cassandra instances > /dev/sdd 1.5T 802G 688G 54% /data/ssd4 > /dev/sda 1.5T 798G 692G 54% /data/ssd1 > /dev/sdb 1.5T 681G 810G 46% /data/ssd2 > /dev/sdc 1.5T 558G 932G 38% /data/ssd3 > Cassandra load is about 200GB and the rest of the space is snapshots > CPU > cass_a@x 127 10:58:47 ~ $ lscpu | grep -E '^Thread|^Core|^Socket|^CPU\(' > CPU(s): 64 > Thread(s) per core: 2 > Core(s) per socket: 16 > Socket(s): 2 > *Description of problem:* > During repair of the cluster, we are seeing multiple corruptions in the log > files on a lot of instances. There seems to be no pattern to the corruption. > It seems that the repair job is finding all the corrupted files for us. The > repair will hang on the node where the corrupted file is found. To fix this > we remove/rename the datafile and bounce the Cassandra instance. Our > hardware/OS team have stated there is no problem on their side. I do not > believe it the repair causing the corruption. > > So let me give you an example of a corrupted file and maybe someone might be > able to work through it with me? > When this corrupted file was reported in the log it looks like it was the > repair that found it. > $ journalctl -u cassmeta-cass_b.service --since "2019-08-07 22:25:00" --until > "2019-08-07 22:45:00" > Aug 07 22:30:33 cassandra[34611]: INFO 21:30:33 Writing > Memtable-compactions_in_progress@830377457(0.008KiB serialized bytes, 1 ops, > 0%/0% of on/off-heap limit) > Aug 07 22:30:33 cassandra[34611]: ERROR 21:30:33 Failed creating a merkle > tree for [repair #9587a200-b95a-11e9-8920-9f72868b8375 on KeyspaceMetadata/x, > (-1476350953672479093,-1474461 > Aug 07 22:30:33 cassandra[34611]: ERROR 21:30:33 Exception in thread