[jira] [Commented] (CASSANDRA-12489) consecutive repairs of same range always finds 'out of sync' in sane cluster
[ https://issues.apache.org/jira/browse/CASSANDRA-12489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15900951#comment-15900951 ] Benjamin Roth commented on CASSANDRA-12489: --- Awesome. I see that there has been a lot of work done by thelastpickle since I initially forked it from spotifiy (which didn't support 3.x back then). A simple changelog.md would be even more awesome to see if there have been important changes. > consecutive repairs of same range always finds 'out of sync' in sane cluster > > > Key: CASSANDRA-12489 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12489 > Project: Cassandra > Issue Type: Bug > Components: Streaming and Messaging >Reporter: Benjamin Roth >Assignee: Benjamin Roth > Labels: lhf > Attachments: trace_3_10.1.log.gz, trace_3_10.2.log.gz, > trace_3_10.3.log.gz, trace_3_10.4.log.gz, trace_3_9.1.log.gz, > trace_3_9.2.log.gz > > > No matter how often or when I run the same subrange repair, it ALWAYS tells > me that some ranges are our of sync. Tested in 3.9 + 3.10 (git trunk of > 2016-08-17). The cluster is sane. All nodes are up, cluster is not overloaded. > I guess this is not a desired behaviour. I'd expect that a repair does what > it says and a consecutive repair shouldn't report "out of syncs" any more if > the cluster is sane. > Especially for tables with MVs that puts a lot of pressure during repair as > ranges are repaired over and over again. > See traces of different runs attached. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-12489) consecutive repairs of same range always finds 'out of sync' in sane cluster
[ https://issues.apache.org/jira/browse/CASSANDRA-12489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15900209#comment-15900209 ] Kurt Greaves commented on CASSANDRA-12489: -- Reaper supports incremental repairs as well, without splitting the ranges (or at least that's what I'm led to believe). In which case it should just run repairs so they don't occur at the same time. As Marcus says, incremental repairs reduce the amount of data that needs to be repaired, while splitting on the ranges still repairs all the data, but just in a more manageable fashion. > consecutive repairs of same range always finds 'out of sync' in sane cluster > > > Key: CASSANDRA-12489 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12489 > Project: Cassandra > Issue Type: Bug > Components: Streaming and Messaging >Reporter: Benjamin Roth >Assignee: Benjamin Roth > Labels: lhf > Attachments: trace_3_10.1.log.gz, trace_3_10.2.log.gz, > trace_3_10.3.log.gz, trace_3_10.4.log.gz, trace_3_9.1.log.gz, > trace_3_9.2.log.gz > > > No matter how often or when I run the same subrange repair, it ALWAYS tells > me that some ranges are our of sync. Tested in 3.9 + 3.10 (git trunk of > 2016-08-17). The cluster is sane. All nodes are up, cluster is not overloaded. > I guess this is not a desired behaviour. I'd expect that a repair does what > it says and a consecutive repair shouldn't report "out of syncs" any more if > the cluster is sane. > Especially for tables with MVs that puts a lot of pressure during repair as > ranges are repaired over and over again. > See traces of different runs attached. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-12489) consecutive repairs of same range always finds 'out of sync' in sane cluster
[ https://issues.apache.org/jira/browse/CASSANDRA-12489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898060#comment-15898060 ] Marcus Eriksson commented on CASSANDRA-12489: - the idea is that with incremental repairs we don't need to use the tools that split the ranges, instead the amount of data to repair is small since we only include unrepaired data > consecutive repairs of same range always finds 'out of sync' in sane cluster > > > Key: CASSANDRA-12489 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12489 > Project: Cassandra > Issue Type: Bug > Components: Streaming and Messaging >Reporter: Benjamin Roth >Assignee: Benjamin Roth > Labels: lhf > Attachments: trace_3_10.1.log.gz, trace_3_10.2.log.gz, > trace_3_10.3.log.gz, trace_3_10.4.log.gz, trace_3_9.1.log.gz, > trace_3_9.2.log.gz > > > No matter how often or when I run the same subrange repair, it ALWAYS tells > me that some ranges are our of sync. Tested in 3.9 + 3.10 (git trunk of > 2016-08-17). The cluster is sane. All nodes are up, cluster is not overloaded. > I guess this is not a desired behaviour. I'd expect that a repair does what > it says and a consecutive repair shouldn't report "out of syncs" any more if > the cluster is sane. > Especially for tables with MVs that puts a lot of pressure during repair as > ranges are repaired over and over again. > See traces of different runs attached. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-12489) consecutive repairs of same range always finds 'out of sync' in sane cluster
[ https://issues.apache.org/jira/browse/CASSANDRA-12489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898051#comment-15898051 ] Benjamin Roth commented on CASSANDRA-12489: --- Thanks for the answer. Thats what I thought. But what a right to exist do incremental repairs then have in the real world if (most, many, whatever) people use a tool that makes repairs manageable which eliminates this case. The use case + real benefit is quite limited then, isn't it? Probably thats a philosophic question but I'm curios what other guys think about it and if I am maybe missing a valuable use case. > consecutive repairs of same range always finds 'out of sync' in sane cluster > > > Key: CASSANDRA-12489 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12489 > Project: Cassandra > Issue Type: Bug > Components: Streaming and Messaging >Reporter: Benjamin Roth >Assignee: Benjamin Roth > Labels: lhf > Attachments: trace_3_10.1.log.gz, trace_3_10.2.log.gz, > trace_3_10.3.log.gz, trace_3_10.4.log.gz, trace_3_9.1.log.gz, > trace_3_9.2.log.gz > > > No matter how often or when I run the same subrange repair, it ALWAYS tells > me that some ranges are our of sync. Tested in 3.9 + 3.10 (git trunk of > 2016-08-17). The cluster is sane. All nodes are up, cluster is not overloaded. > I guess this is not a desired behaviour. I'd expect that a repair does what > it says and a consecutive repair shouldn't report "out of syncs" any more if > the cluster is sane. > Especially for tables with MVs that puts a lot of pressure during repair as > ranges are repaired over and over again. > See traces of different runs attached. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-12489) consecutive repairs of same range always finds 'out of sync' in sane cluster
[ https://issues.apache.org/jira/browse/CASSANDRA-12489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898043#comment-15898043 ] Marcus Eriksson commented on CASSANDRA-12489: - Yeah, typically people use stuff like spotifys reaper which splits the range of the in n (1000?) parts. If we have an sstable that covers the full range of the node, we would rewrite it n times - we write each repaired range into a new sstable, and the unrepaired parts get written to another sstable (and that sstable gets rewritten on the next repair etc). > consecutive repairs of same range always finds 'out of sync' in sane cluster > > > Key: CASSANDRA-12489 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12489 > Project: Cassandra > Issue Type: Bug > Components: Streaming and Messaging >Reporter: Benjamin Roth >Assignee: Benjamin Roth > Labels: lhf > Attachments: trace_3_10.1.log.gz, trace_3_10.2.log.gz, > trace_3_10.3.log.gz, trace_3_10.4.log.gz, trace_3_9.1.log.gz, > trace_3_9.2.log.gz > > > No matter how often or when I run the same subrange repair, it ALWAYS tells > me that some ranges are our of sync. Tested in 3.9 + 3.10 (git trunk of > 2016-08-17). The cluster is sane. All nodes are up, cluster is not overloaded. > I guess this is not a desired behaviour. I'd expect that a repair does what > it says and a consecutive repair shouldn't report "out of syncs" any more if > the cluster is sane. > Especially for tables with MVs that puts a lot of pressure during repair as > ranges are repaired over and over again. > See traces of different runs attached. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-12489) consecutive repairs of same range always finds 'out of sync' in sane cluster
[ https://issues.apache.org/jira/browse/CASSANDRA-12489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898028#comment-15898028 ] Benjamin Roth commented on CASSANDRA-12489: --- May I ask what's the reason that incremental + subrange repair doesn't do anticompaction? Is it because anticompaction is too expensive in this case or to say it in different words: A subrange full repair is cheaper than subrange incremental repair with anticompaction? > consecutive repairs of same range always finds 'out of sync' in sane cluster > > > Key: CASSANDRA-12489 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12489 > Project: Cassandra > Issue Type: Bug > Components: Streaming and Messaging >Reporter: Benjamin Roth >Assignee: Benjamin Roth > Labels: lhf > Attachments: trace_3_10.1.log.gz, trace_3_10.2.log.gz, > trace_3_10.3.log.gz, trace_3_10.4.log.gz, trace_3_9.1.log.gz, > trace_3_9.2.log.gz > > > No matter how often or when I run the same subrange repair, it ALWAYS tells > me that some ranges are our of sync. Tested in 3.9 + 3.10 (git trunk of > 2016-08-17). The cluster is sane. All nodes are up, cluster is not overloaded. > I guess this is not a desired behaviour. I'd expect that a repair does what > it says and a consecutive repair shouldn't report "out of syncs" any more if > the cluster is sane. > Especially for tables with MVs that puts a lot of pressure during repair as > ranges are repaired over and over again. > See traces of different runs attached. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-12489) consecutive repairs of same range always finds 'out of sync' in sane cluster
[ https://issues.apache.org/jira/browse/CASSANDRA-12489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15455756#comment-15455756 ] Marcus Eriksson commented on CASSANDRA-12489: - hmm, we should probably not mark streamed sstables as repaired in this case > consecutive repairs of same range always finds 'out of sync' in sane cluster > > > Key: CASSANDRA-12489 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12489 > Project: Cassandra > Issue Type: Bug > Components: Streaming and Messaging >Reporter: Benjamin Roth > Attachments: trace_3_10.1.log.gz, trace_3_10.2.log.gz, > trace_3_10.3.log.gz, trace_3_10.4.log.gz, trace_3_9.1.log.gz, > trace_3_9.2.log.gz > > > No matter how often or when I run the same subrange repair, it ALWAYS tells > me that some ranges are our of sync. Tested in 3.9 + 3.10 (git trunk of > 2016-08-17). The cluster is sane. All nodes are up, cluster is not overloaded. > I guess this is not a desired behaviour. I'd expect that a repair does what > it says and a consecutive repair shouldn't report "out of syncs" any more if > the cluster is sane. > Especially for tables with MVs that puts a lot of pressure during repair as > ranges are repaired over and over again. > See traces of different runs attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-12489) consecutive repairs of same range always finds 'out of sync' in sane cluster
[ https://issues.apache.org/jira/browse/CASSANDRA-12489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15453736#comment-15453736 ] Paulo Motta commented on CASSANDRA-12489: - It seems this problem is due to the use of combined incremental and subrange repair. One one hand, subrange incremental repair does not mark original sstables as repaired (CASSANDRA-10422), while incremental repair will mark streamed sstables as repaired. So, in the next execution of subrange incremental repair the mismatch will persist. So, we should either: A) Disable combination of incremental + subrange repair B) Mark sstables originating from incremental subrange repair as unrepaired WDYT [~krummas]? > consecutive repairs of same range always finds 'out of sync' in sane cluster > > > Key: CASSANDRA-12489 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12489 > Project: Cassandra > Issue Type: Bug > Components: Streaming and Messaging >Reporter: Benjamin Roth > Attachments: trace_3_10.1.log.gz, trace_3_10.2.log.gz, > trace_3_10.3.log.gz, trace_3_10.4.log.gz, trace_3_9.1.log.gz, > trace_3_9.2.log.gz > > > No matter how often or when I run the same subrange repair, it ALWAYS tells > me that some ranges are our of sync. Tested in 3.9 + 3.10 (git trunk of > 2016-08-17). The cluster is sane. All nodes are up, cluster is not overloaded. > I guess this is not a desired behaviour. I'd expect that a repair does what > it says and a consecutive repair shouldn't report "out of syncs" any more if > the cluster is sane. > Especially for tables with MVs that puts a lot of pressure during repair as > ranges are repaired over and over again. > See traces of different runs attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332)