[jira] [Commented] (CASSANDRA-13797) RepairJob blocks on syncTasks
[ https://issues.apache.org/jira/browse/CASSANDRA-13797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16398846#comment-16398846 ] Blake Eggleston commented on CASSANDRA-13797: - Agreed this should be a new ticket. Maybe the right fix is to backport CASSANDRA-13521 with concurrent validations set to something reasonable? I think the incorrect assumption of this ticket that's causing problems was that the validation executor was bounded, which obviously isn't the case. > RepairJob blocks on syncTasks > - > > Key: CASSANDRA-13797 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13797 > Project: Cassandra > Issue Type: Bug > Components: Repair >Reporter: Blake Eggleston >Assignee: Blake Eggleston >Priority: Major > Fix For: 3.0.15, 3.11.1, 4.0 > > > The thread running {{RepairJob}} blocks while it waits for the validations it > starts to complete ([see > here|https://github.com/bdeggleston/cassandra/blob/9fdec0a82851f5c35cd21d02e8c4da8fc685edb2/src/java/org/apache/cassandra/repair/RepairJob.java#L185]). > However, the downstream callbacks (ie: the post-repair cleanup stuff) aren't > waiting for {{RepairJob#run}} to return, they're waiting for a result to be > set on RepairJob the future, which happens after the sync tasks have > completed. This post repair cleanup stuff also immediately shuts down the > executor {{RepairJob#run}} is running in. So in noop repair sessions, where > there's nothing to stream, I'm seeing the callbacks sometimes fire before > {{RepairJob#run}} wakes up, and causing an {{InterruptedException}} is thrown. > I'm pretty sure this can just be removed, but I'd like a second opinion. This > appears to just be a holdover from before repair coordination became async. I > thought it might be doing some throttling by blocking, but each repair > session gets it's own executor, and validation is throttled by the fixed > size executors doing the actual work of validation, so I don't think we need > to keep this around. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13797) RepairJob blocks on syncTasks
[ https://issues.apache.org/jira/browse/CASSANDRA-13797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16398542#comment-16398542 ] Jeremiah Jordan commented on CASSANDRA-13797: - [~KurtG] when a ticket has already gone out in a release we prefer to open a brand new ticket rather than re-open an existing one, as fix versions get confusing if you re-open and already released ticket. > RepairJob blocks on syncTasks > - > > Key: CASSANDRA-13797 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13797 > Project: Cassandra > Issue Type: Bug > Components: Repair >Reporter: Blake Eggleston >Assignee: Blake Eggleston >Priority: Major > Fix For: 3.0.15, 3.11.1, 4.0 > > > The thread running {{RepairJob}} blocks while it waits for the validations it > starts to complete ([see > here|https://github.com/bdeggleston/cassandra/blob/9fdec0a82851f5c35cd21d02e8c4da8fc685edb2/src/java/org/apache/cassandra/repair/RepairJob.java#L185]). > However, the downstream callbacks (ie: the post-repair cleanup stuff) aren't > waiting for {{RepairJob#run}} to return, they're waiting for a result to be > set on RepairJob the future, which happens after the sync tasks have > completed. This post repair cleanup stuff also immediately shuts down the > executor {{RepairJob#run}} is running in. So in noop repair sessions, where > there's nothing to stream, I'm seeing the callbacks sometimes fire before > {{RepairJob#run}} wakes up, and causing an {{InterruptedException}} is thrown. > I'm pretty sure this can just be removed, but I'd like a second opinion. This > appears to just be a holdover from before repair coordination became async. I > thought it might be doing some throttling by blocking, but each repair > session gets it's own executor, and validation is throttled by the fixed > size executors doing the actual work of validation, so I don't think we need > to keep this around. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13797) RepairJob blocks on syncTasks
[ https://issues.apache.org/jira/browse/CASSANDRA-13797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16397897#comment-16397897 ] Kurt Greaves commented on CASSANDRA-13797: -- This issue is actually pretty serious for anyone running vnodes and a mid sized cluster. This isn't the first time we've had unbounded validation compactions kicked off simultaneously and it's caused a lot of problems at Instaclustr in the past. We should really fix this by 3.11.3 because it easily causes massive latency spikes whenever a repair kicks off due to validations taking up all the CPU. I'd like a simple revert but that doesn't fix the issue in the description. Don't think this warrants a new ticket so I think reopening this one is in order. [~bdeggleston] [~krummas] WDYT? > RepairJob blocks on syncTasks > - > > Key: CASSANDRA-13797 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13797 > Project: Cassandra > Issue Type: Bug > Components: Repair >Reporter: Blake Eggleston >Assignee: Blake Eggleston >Priority: Major > Fix For: 3.0.15, 3.11.1, 4.0 > > > The thread running {{RepairJob}} blocks while it waits for the validations it > starts to complete ([see > here|https://github.com/bdeggleston/cassandra/blob/9fdec0a82851f5c35cd21d02e8c4da8fc685edb2/src/java/org/apache/cassandra/repair/RepairJob.java#L185]). > However, the downstream callbacks (ie: the post-repair cleanup stuff) aren't > waiting for {{RepairJob#run}} to return, they're waiting for a result to be > set on RepairJob the future, which happens after the sync tasks have > completed. This post repair cleanup stuff also immediately shuts down the > executor {{RepairJob#run}} is running in. So in noop repair sessions, where > there's nothing to stream, I'm seeing the callbacks sometimes fire before > {{RepairJob#run}} wakes up, and causing an {{InterruptedException}} is thrown. > I'm pretty sure this can just be removed, but I'd like a second opinion. This > appears to just be a holdover from before repair coordination became async. I > thought it might be doing some throttling by blocking, but each repair > session gets it's own executor, and validation is throttled by the fixed > size executors doing the actual work of validation, so I don't think we need > to keep this around. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13797) RepairJob blocks on syncTasks
[ https://issues.apache.org/jira/browse/CASSANDRA-13797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16383119#comment-16383119 ] Vincent White commented on CASSANDRA-13797: --- After upgrading an ~18 node, vnode multi-DC cluster from 3.11.0 to 3.11.1 it started seeing some nodes running hundreds of concurrent validation compactions, rolling back it went back to 1 concurrent validation per CF. I haven't had a chance to reproduce it at that scale but my locale testing show that if I have enough data, or just add a sleep(999) to validation compactions to simulate long validations, they continue to accumulate over a few seconds until the repair session has looped through all the common ranges. > RepairJob blocks on syncTasks > - > > Key: CASSANDRA-13797 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13797 > Project: Cassandra > Issue Type: Bug > Components: Repair >Reporter: Blake Eggleston >Assignee: Blake Eggleston >Priority: Major > Fix For: 3.0.15, 3.11.1, 4.0 > > > The thread running {{RepairJob}} blocks while it waits for the validations it > starts to complete ([see > here|https://github.com/bdeggleston/cassandra/blob/9fdec0a82851f5c35cd21d02e8c4da8fc685edb2/src/java/org/apache/cassandra/repair/RepairJob.java#L185]). > However, the downstream callbacks (ie: the post-repair cleanup stuff) aren't > waiting for {{RepairJob#run}} to return, they're waiting for a result to be > set on RepairJob the future, which happens after the sync tasks have > completed. This post repair cleanup stuff also immediately shuts down the > executor {{RepairJob#run}} is running in. So in noop repair sessions, where > there's nothing to stream, I'm seeing the callbacks sometimes fire before > {{RepairJob#run}} wakes up, and causing an {{InterruptedException}} is thrown. > I'm pretty sure this can just be removed, but I'd like a second opinion. This > appears to just be a holdover from before repair coordination became async. I > thought it might be doing some throttling by blocking, but each repair > session gets it's own executor, and validation is throttled by the fixed > size executors doing the actual work of validation, so I don't think we need > to keep this around. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13797) RepairJob blocks on syncTasks
[ https://issues.apache.org/jira/browse/CASSANDRA-13797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382950#comment-16382950 ] Blake Eggleston commented on CASSANDRA-13797: - [~VincentWhite] are you seeing this happen? It's been a little while since I've thought about this, but I do remember worrying about something like and eventually convincing myself that this runaway validation case you're describing is prevented by something else. > RepairJob blocks on syncTasks > - > > Key: CASSANDRA-13797 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13797 > Project: Cassandra > Issue Type: Bug > Components: Repair >Reporter: Blake Eggleston >Assignee: Blake Eggleston >Priority: Major > Fix For: 3.0.15, 3.11.1, 4.0 > > > The thread running {{RepairJob}} blocks while it waits for the validations it > starts to complete ([see > here|https://github.com/bdeggleston/cassandra/blob/9fdec0a82851f5c35cd21d02e8c4da8fc685edb2/src/java/org/apache/cassandra/repair/RepairJob.java#L185]). > However, the downstream callbacks (ie: the post-repair cleanup stuff) aren't > waiting for {{RepairJob#run}} to return, they're waiting for a result to be > set on RepairJob the future, which happens after the sync tasks have > completed. This post repair cleanup stuff also immediately shuts down the > executor {{RepairJob#run}} is running in. So in noop repair sessions, where > there's nothing to stream, I'm seeing the callbacks sometimes fire before > {{RepairJob#run}} wakes up, and causing an {{InterruptedException}} is thrown. > I'm pretty sure this can just be removed, but I'd like a second opinion. This > appears to just be a holdover from before repair coordination became async. I > thought it might be doing some throttling by blocking, but each repair > session gets it's own executor, and validation is throttled by the fixed > size executors doing the actual work of validation, so I don't think we need > to keep this around. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13797) RepairJob blocks on syncTasks
[ https://issues.apache.org/jira/browse/CASSANDRA-13797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382839#comment-16382839 ] Vincent White commented on CASSANDRA-13797: --- Now that we don't wait for the validations of each repair job to finish before moving onto the next one, I don't see anything to stop the repair coordinator from spinning through all the token ranges and effectively triggering all the validations tasks at once, which could be a significant amount of validation compactions on each node depending on your topology and common ranges for that keyspace. I'm also not sure of the overhead of creating all the futures/listeners on the coordinator at once in this case. In 3 the validation executor thread pool has no size limit so a new validation is started as soon as a validation request is received. I admit I haven't caught up on the changes to repair in trunk, and while the validation executor pool size is configurable in trunk, its default is still Integer.MAX_VALUE. I understand this same affect (hundreds of concurrent validations) can still happen if you trigger a repair across a keyspace with a large number of column families but with this change there is no way of avoiding it without using subrange repairs on a single column family (if you have a topology/replication that cant be merged into a small number of common ranges) . > RepairJob blocks on syncTasks > - > > Key: CASSANDRA-13797 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13797 > Project: Cassandra > Issue Type: Bug > Components: Repair >Reporter: Blake Eggleston >Assignee: Blake Eggleston >Priority: Major > Fix For: 3.0.15, 3.11.1, 4.0 > > > The thread running {{RepairJob}} blocks while it waits for the validations it > starts to complete ([see > here|https://github.com/bdeggleston/cassandra/blob/9fdec0a82851f5c35cd21d02e8c4da8fc685edb2/src/java/org/apache/cassandra/repair/RepairJob.java#L185]). > However, the downstream callbacks (ie: the post-repair cleanup stuff) aren't > waiting for {{RepairJob#run}} to return, they're waiting for a result to be > set on RepairJob the future, which happens after the sync tasks have > completed. This post repair cleanup stuff also immediately shuts down the > executor {{RepairJob#run}} is running in. So in noop repair sessions, where > there's nothing to stream, I'm seeing the callbacks sometimes fire before > {{RepairJob#run}} wakes up, and causing an {{InterruptedException}} is thrown. > I'm pretty sure this can just be removed, but I'd like a second opinion. This > appears to just be a holdover from before repair coordination became async. I > thought it might be doing some throttling by blocking, but each repair > session gets it's own executor, and validation is throttled by the fixed > size executors doing the actual work of validation, so I don't think we need > to keep this around. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13797) RepairJob blocks on syncTasks
[ https://issues.apache.org/jira/browse/CASSANDRA-13797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16188407#comment-16188407 ] Jeff Jirsa commented on CASSANDRA-13797: Please keep an eye on fixvers. > RepairJob blocks on syncTasks > - > > Key: CASSANDRA-13797 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13797 > Project: Cassandra > Issue Type: Bug > Components: Repair >Reporter: Blake Eggleston >Assignee: Blake Eggleston > Fix For: 3.0.15, 3.11.1, 4.0 > > > The thread running {{RepairJob}} blocks while it waits for the validations it > starts to complete ([see > here|https://github.com/bdeggleston/cassandra/blob/9fdec0a82851f5c35cd21d02e8c4da8fc685edb2/src/java/org/apache/cassandra/repair/RepairJob.java#L185]). > However, the downstream callbacks (ie: the post-repair cleanup stuff) aren't > waiting for {{RepairJob#run}} to return, they're waiting for a result to be > set on RepairJob the future, which happens after the sync tasks have > completed. This post repair cleanup stuff also immediately shuts down the > executor {{RepairJob#run}} is running in. So in noop repair sessions, where > there's nothing to stream, I'm seeing the callbacks sometimes fire before > {{RepairJob#run}} wakes up, and causing an {{InterruptedException}} is thrown. > I'm pretty sure this can just be removed, but I'd like a second opinion. This > appears to just be a holdover from before repair coordination became async. I > thought it might be doing some throttling by blocking, but each repair > session gets it's own executor, and validation is throttled by the fixed > size executors doing the actual work of validation, so I don't think we need > to keep this around. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13797) RepairJob blocks on syncTasks
[ https://issues.apache.org/jira/browse/CASSANDRA-13797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16164601#comment-16164601 ] Jeremiah Jordan commented on CASSANDRA-13797: - If adding 3.0 you also need 3.11 > RepairJob blocks on syncTasks > - > > Key: CASSANDRA-13797 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13797 > Project: Cassandra > Issue Type: Bug >Reporter: Blake Eggleston >Assignee: Blake Eggleston > Fix For: 3.0.x, 3.11.x, 4.x > > > The thread running {{RepairJob}} blocks while it waits for the validations it > starts to complete ([see > here|https://github.com/bdeggleston/cassandra/blob/9fdec0a82851f5c35cd21d02e8c4da8fc685edb2/src/java/org/apache/cassandra/repair/RepairJob.java#L185]). > However, the downstream callbacks (ie: the post-repair cleanup stuff) aren't > waiting for {{RepairJob#run}} to return, they're waiting for a result to be > set on RepairJob the future, which happens after the sync tasks have > completed. This post repair cleanup stuff also immediately shuts down the > executor {{RepairJob#run}} is running in. So in noop repair sessions, where > there's nothing to stream, I'm seeing the callbacks sometimes fire before > {{RepairJob#run}} wakes up, and causing an {{InterruptedException}} is thrown. > I'm pretty sure this can just be removed, but I'd like a second opinion. This > appears to just be a holdover from before repair coordination became async. I > thought it might be doing some throttling by blocking, but each repair > session gets it's own executor, and validation is throttled by the fixed > size executors doing the actual work of validation, so I don't think we need > to keep this around. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13797) RepairJob blocks on syncTasks
[ https://issues.apache.org/jira/browse/CASSANDRA-13797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16163095#comment-16163095 ] Blake Eggleston commented on CASSANDRA-13797: - Adding 3.0 branch & tests since it's also affected. [3.0|https://github.com/bdeggleston/cassandra/tree/13797-3.0] [utest|https://circleci.com/gh/bdeggleston/cassandra/117] [dtest|https://builds.apache.org/view/A-D/view/Cassandra/job/Cassandra-devbranch-dtest/302/] > RepairJob blocks on syncTasks > - > > Key: CASSANDRA-13797 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13797 > Project: Cassandra > Issue Type: Bug >Reporter: Blake Eggleston >Assignee: Blake Eggleston > Fix For: 4.0 > > > The thread running {{RepairJob}} blocks while it waits for the validations it > starts to complete ([see > here|https://github.com/bdeggleston/cassandra/blob/9fdec0a82851f5c35cd21d02e8c4da8fc685edb2/src/java/org/apache/cassandra/repair/RepairJob.java#L185]). > However, the downstream callbacks (ie: the post-repair cleanup stuff) aren't > waiting for {{RepairJob#run}} to return, they're waiting for a result to be > set on RepairJob the future, which happens after the sync tasks have > completed. This post repair cleanup stuff also immediately shuts down the > executor {{RepairJob#run}} is running in. So in noop repair sessions, where > there's nothing to stream, I'm seeing the callbacks sometimes fire before > {{RepairJob#run}} wakes up, and causing an {{InterruptedException}} is thrown. > I'm pretty sure this can just be removed, but I'd like a second opinion. This > appears to just be a holdover from before repair coordination became async. I > thought it might be doing some throttling by blocking, but each repair > session gets it's own executor, and validation is throttled by the fixed > size executors doing the actual work of validation, so I don't think we need > to keep this around. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13797) RepairJob blocks on syncTasks
[ https://issues.apache.org/jira/browse/CASSANDRA-13797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16162185#comment-16162185 ] Blake Eggleston commented on CASSANDRA-13797: - rebased on latest trunk: https://builds.apache.org/view/A-D/view/Cassandra/job/Cassandra-devbranch-dtest/ > RepairJob blocks on syncTasks > - > > Key: CASSANDRA-13797 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13797 > Project: Cassandra > Issue Type: Bug >Reporter: Blake Eggleston >Assignee: Blake Eggleston > Fix For: 4.0 > > > The thread running {{RepairJob}} blocks while it waits for the validations it > starts to complete ([see > here|https://github.com/bdeggleston/cassandra/blob/9fdec0a82851f5c35cd21d02e8c4da8fc685edb2/src/java/org/apache/cassandra/repair/RepairJob.java#L185]). > However, the downstream callbacks (ie: the post-repair cleanup stuff) aren't > waiting for {{RepairJob#run}} to return, they're waiting for a result to be > set on RepairJob the future, which happens after the sync tasks have > completed. This post repair cleanup stuff also immediately shuts down the > executor {{RepairJob#run}} is running in. So in noop repair sessions, where > there's nothing to stream, I'm seeing the callbacks sometimes fire before > {{RepairJob#run}} wakes up, and causing an {{InterruptedException}} is thrown. > I'm pretty sure this can just be removed, but I'd like a second opinion. This > appears to just be a holdover from before repair coordination became async. I > thought it might be doing some throttling by blocking, but each repair > session gets it's own executor, and validation is throttled by the fixed > size executors doing the actual work of validation, so I don't think we need > to keep this around. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13797) RepairJob blocks on syncTasks
[ https://issues.apache.org/jira/browse/CASSANDRA-13797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16146204#comment-16146204 ] Blake Eggleston commented on CASSANDRA-13797: - let's see if this one finishes: [dtest|https://issues.apache.org/jira/browse/CASSANDRA-13797] > RepairJob blocks on syncTasks > - > > Key: CASSANDRA-13797 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13797 > Project: Cassandra > Issue Type: Bug >Reporter: Blake Eggleston >Assignee: Blake Eggleston > Fix For: 4.0 > > > The thread running {{RepairJob}} blocks while it waits for the validations it > starts to complete ([see > here|https://github.com/bdeggleston/cassandra/blob/9fdec0a82851f5c35cd21d02e8c4da8fc685edb2/src/java/org/apache/cassandra/repair/RepairJob.java#L185]). > However, the downstream callbacks (ie: the post-repair cleanup stuff) aren't > waiting for {{RepairJob#run}} to return, they're waiting for a result to be > set on RepairJob the future, which happens after the sync tasks have > completed. This post repair cleanup stuff also immediately shuts down the > executor {{RepairJob#run}} is running in. So in noop repair sessions, where > there's nothing to stream, I'm seeing the callbacks sometimes fire before > {{RepairJob#run}} wakes up, and causing an {{InterruptedException}} is thrown. > I'm pretty sure this can just be removed, but I'd like a second opinion. This > appears to just be a holdover from before repair coordination became async. I > thought it might be doing some throttling by blocking, but each repair > session gets it's own executor, and validation is throttled by the fixed > size executors doing the actual work of validation, so I don't think we need > to keep this around. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13797) RepairJob blocks on syncTasks
[ https://issues.apache.org/jira/browse/CASSANDRA-13797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16141688#comment-16141688 ] Marcus Eriksson commented on CASSANDRA-13797: - +1 > RepairJob blocks on syncTasks > - > > Key: CASSANDRA-13797 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13797 > Project: Cassandra > Issue Type: Bug >Reporter: Blake Eggleston >Assignee: Blake Eggleston > Fix For: 4.0 > > > The thread running {{RepairJob}} blocks while it waits for the validations it > starts to complete ([see > here|https://github.com/bdeggleston/cassandra/blob/9fdec0a82851f5c35cd21d02e8c4da8fc685edb2/src/java/org/apache/cassandra/repair/RepairJob.java#L185]). > However, the downstream callbacks (ie: the post-repair cleanup stuff) aren't > waiting for {{RepairJob#run}} to return, they're waiting for a result to be > set on RepairJob the future, which happens after the sync tasks have > completed. This post repair cleanup stuff also immediately shuts down the > executor {{RepairJob#run}} is running in. So in noop repair sessions, where > there's nothing to stream, I'm seeing the callbacks sometimes fire before > {{RepairJob#run}} wakes up, and causing an {{InterruptedException}} is thrown. > I'm pretty sure this can just be removed, but I'd like a second opinion. This > appears to just be a holdover from before repair coordination became async. I > thought it might be doing some throttling by blocking, but each repair > session gets it's own executor, and validation is throttled by the fixed > size executors doing the actual work of validation, so I don't think we need > to keep this around. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org