[jira] [Updated] (CASSANDRA-7704) FileNotFoundException during STREAM-OUT triggers 100% CPU usage
[ https://issues.apache.org/jira/browse/CASSANDRA-7704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict updated CASSANDRA-7704: Fix Version/s: (was: 2.1.0) (was: 2.0.10) 2.1.3 2.0.12 > FileNotFoundException during STREAM-OUT triggers 100% CPU usage > --- > > Key: CASSANDRA-7704 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7704 > Project: Cassandra > Issue Type: Bug >Reporter: Rick Branson >Assignee: Benedict > Fix For: 2.0.12, 2.1.3 > > Attachments: 7704-2.1.txt, 7704.txt, backtrace.txt, other-errors.txt > > > See attached backtrace which was what triggered this. This stream failed and > then ~12 seconds later it emitted that exception. At that point, all CPUs > went to 100%. A thread dump shows all the ReadStage threads stuck inside > IntervalTree.searchInternal inside of CFS.markReferenced(). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-7704) FileNotFoundException during STREAM-OUT triggers 100% CPU usage
[ https://issues.apache.org/jira/browse/CASSANDRA-7704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict updated CASSANDRA-7704: Attachment: (was: 7704.20.v2.txt) > FileNotFoundException during STREAM-OUT triggers 100% CPU usage > --- > > Key: CASSANDRA-7704 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7704 > Project: Cassandra > Issue Type: Bug >Reporter: Rick Branson >Assignee: Benedict > Fix For: 2.0.10, 2.1.0 > > Attachments: 7704-2.1.txt, 7704.txt, backtrace.txt, other-errors.txt > > > See attached backtrace which was what triggered this. This stream failed and > then ~12 seconds later it emitted that exception. At that point, all CPUs > went to 100%. A thread dump shows all the ReadStage threads stuck inside > IntervalTree.searchInternal inside of CFS.markReferenced(). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (CASSANDRA-7704) FileNotFoundException during STREAM-OUT triggers 100% CPU usage
[ https://issues.apache.org/jira/browse/CASSANDRA-7704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict updated CASSANDRA-7704: Attachment: 7704-2.1.txt Attaching a new version which does not cancel the task that was run, and updates the unit tests to match the new behaviour > FileNotFoundException during STREAM-OUT triggers 100% CPU usage > --- > > Key: CASSANDRA-7704 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7704 > Project: Cassandra > Issue Type: Bug >Reporter: Rick Branson >Assignee: Benedict > Fix For: 2.0.10, 2.1.0 > > Attachments: 7704-2.1.txt, 7704.txt, backtrace.txt, other-errors.txt > > > See attached backtrace which was what triggered this. This stream failed and > then ~12 seconds later it emitted that exception. At that point, all CPUs > went to 100%. A thread dump shows all the ReadStage threads stuck inside > IntervalTree.searchInternal inside of CFS.markReferenced(). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (CASSANDRA-7704) FileNotFoundException during STREAM-OUT triggers 100% CPU usage
[ https://issues.apache.org/jira/browse/CASSANDRA-7704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict updated CASSANDRA-7704: Fix Version/s: 2.1.0 2.0.10 > FileNotFoundException during STREAM-OUT triggers 100% CPU usage > --- > > Key: CASSANDRA-7704 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7704 > Project: Cassandra > Issue Type: Bug >Reporter: Rick Branson >Assignee: Benedict > Fix For: 2.0.10, 2.1.0 > > Attachments: 7704.20.v2.txt, 7704.txt, backtrace.txt, other-errors.txt > > > See attached backtrace which was what triggered this. This stream failed and > then ~12 seconds later it emitted that exception. At that point, all CPUs > went to 100%. A thread dump shows all the ReadStage threads stuck inside > IntervalTree.searchInternal inside of CFS.markReferenced(). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (CASSANDRA-7704) FileNotFoundException during STREAM-OUT triggers 100% CPU usage
[ https://issues.apache.org/jira/browse/CASSANDRA-7704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict updated CASSANDRA-7704: Attachment: 7704.20.v2.txt FTR, there was a (probably innocuous) mistake in that patch; fixed version attached. > FileNotFoundException during STREAM-OUT triggers 100% CPU usage > --- > > Key: CASSANDRA-7704 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7704 > Project: Cassandra > Issue Type: Bug >Reporter: Rick Branson >Assignee: Benedict > Attachments: 7704.20.v2.txt, 7704.txt, backtrace.txt, other-errors.txt > > > See attached backtrace which was what triggered this. This stream failed and > then ~12 seconds later it emitted that exception. At that point, all CPUs > went to 100%. A thread dump shows all the ReadStage threads stuck inside > IntervalTree.searchInternal inside of CFS.markReferenced(). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (CASSANDRA-7704) FileNotFoundException during STREAM-OUT triggers 100% CPU usage
[ https://issues.apache.org/jira/browse/CASSANDRA-7704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rick Branson updated CASSANDRA-7704: Attachment: other-errors.txt There wasn't anything in the logs that indicated *why* the failure happened. Th I attached anything suspect. The IndexOutOfBoundsException occurred on the bootstrapping node *after* the stream failure occurred on the node that was streaming out. There was a CompactionTask that ran at 2014-08-05 18:00:25,804 (4 minutes before the StreamOut task) that tried to compact that SSTable that referenced in the FileNotFoundException. No other log messages related to that file though. > FileNotFoundException during STREAM-OUT triggers 100% CPU usage > --- > > Key: CASSANDRA-7704 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7704 > Project: Cassandra > Issue Type: Bug >Reporter: Rick Branson >Assignee: Benedict > Attachments: 7704.txt, backtrace.txt, other-errors.txt > > > See attached backtrace which was what triggered this. This stream failed and > then ~12 seconds later it emitted that exception. At that point, all CPUs > went to 100%. A thread dump shows all the ReadStage threads stuck inside > IntervalTree.searchInternal inside of CFS.markReferenced(). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (CASSANDRA-7704) FileNotFoundException during STREAM-OUT triggers 100% CPU usage
[ https://issues.apache.org/jira/browse/CASSANDRA-7704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict updated CASSANDRA-7704: Attachment: 7704.txt Attaching a patch that I think addresses this. There are a number of concurrency bugs here, and whilst we could fix them with more advanced lock-freedom, there is no compelling reason this class doesn't use synchronized everywhere, which would probably have avoided this problem in the first place. There is only one place where the execution is not guaranteed to be prompt, and I have left this out of the synchronization. I have at the same time simplified the logic, and fixed the logic for cancelling timeouts, as well as made the scheduled executor for timeouts globally shared (there's no good reason to spinup a new executor for each set of transfers) In this particular instance the issue seems to have been a lack of atomicity between abort() and complete(); an ACK arrived at the same time as abort() was cancelling all transfers, causing a reference to be released twice. This could also occur with the timeouts, but since they occur only every 12hrs, the risk is low. > FileNotFoundException during STREAM-OUT triggers 100% CPU usage > --- > > Key: CASSANDRA-7704 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7704 > Project: Cassandra > Issue Type: Bug >Reporter: Rick Branson > Attachments: 7704.txt, backtrace.txt > > > See attached backtrace which was what triggered this. This stream failed and > then ~12 seconds later it emitted that exception. At that point, all CPUs > went to 100%. A thread dump shows all the ReadStage threads stuck inside > IntervalTree.searchInternal inside of CFS.markReferenced(). -- This message was sent by Atlassian JIRA (v6.2#6252)