[jira] [Commented] (CASSANDRA-12245) initial view build can be parallel
[ https://issues.apache.org/jira/browse/CASSANDRA-12245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16274436#comment-16274436 ] Andrés de la Peña commented on CASSANDRA-12245: --- Committed to trunk as [4c80eeece37d79f434078224a0504400ae10a20d|https://github.com/apache/cassandra/commit/4c80eeece37d79f434078224a0504400ae10a20d]. Thanks for the thorough review :) > initial view build can be parallel > -- > > Key: CASSANDRA-12245 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12245 > Project: Cassandra > Issue Type: Improvement > Components: Materialized Views >Reporter: Tom van der Woerdt >Assignee: Andrés de la Peña > Fix For: 4.x > > > On a node with lots of data (~3TB) building a materialized view takes several > weeks, which is not ideal. It's doing this in a single thread. > There are several potential ways this can be optimized : > * do vnodes in parallel, instead of going through the entire range in one > thread > * just iterate through sstables, not worrying about duplicates, and include > the timestamp of the original write in the MV mutation. since this doesn't > exclude duplicates it does increase the amount of work and could temporarily > surface ghost rows (yikes) but I guess that's why they call it eventual > consistency. doing it this way can avoid holding references to all tables on > disk, allows parallelization, and removes the need to check other sstables > for existing data. this is essentially the 'do a full repair' path -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-12245) initial view build can be parallel
[ https://issues.apache.org/jira/browse/CASSANDRA-12245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16274261#comment-16274261 ] Andrés de la Peña commented on CASSANDRA-12245: --- bq. One thing I noticed is that even though the builder task and the view builder is aborted, the other tasks of the same builder keep running. At least until we have the ability to start and stop view builders, I think that stopping a subtask should also abort the other subtasks of the same view builder - since the view builder will not complete anyway. What do you think? I've done this [here|https://github.com/pauloricardomg/cassandra/commit/81853218eee702b778ba801426ba19d48336cf77] and the tests didn't need any change. I've also extended {{SplitterTest}} with a couple more test cases here. Makes sense. I started with a similar solution but changed to stop only the specified task in an attempt to better fulfill the {{nodetool stop}} documentation. But it's true that stopping all the tasks is more useful, especially without another way to stop the view build. bq. The tests looks good, but sometimes they were failing on my machine because the view builder task finished on some nodes before they were stopped and also {{_wait_for_view_build_start}} did not guarantee the view builder started in all nodes before issuing {{nodetool stop VIEW_BUILD}}, so I fixed this [on this commit|https://github.com/pauloricardomg/cassandra-dtest/commit/667315e42bd2b7d04ac038e79149f1b0e63ba0f2]. I also extended test_resume_stopped_build to verify that view was not built after abort ([here|https://github.com/pauloricardomg/cassandra-dtest/commit/f4c3ad7ac9e4ea64576d669a1cf30b0ef4e02a3f]). Nice fix, thanks! I have merged your changes and rebased again, I'll commit if a final CI round looks well: ||[cassandra|https://github.com/adelapena/cassandra/tree/12245-trunk]||[cassandra-dtest|https://github.com/adelapena/cassandra-dtest/tree/12245]|| ||[utest|http://jenkins-cassandra.datastax.lan/view/Dev/view/adelapena/job/adelapena-12245-trunk-testall/]||[dtest|http://jenkins-cassandra.datastax.lan/view/Dev/view/adelapena/job/adelapena-12245-trunk-dtest/]|| > initial view build can be parallel > -- > > Key: CASSANDRA-12245 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12245 > Project: Cassandra > Issue Type: Improvement > Components: Materialized Views >Reporter: Tom van der Woerdt >Assignee: Andrés de la Peña > Fix For: 4.x > > > On a node with lots of data (~3TB) building a materialized view takes several > weeks, which is not ideal. It's doing this in a single thread. > There are several potential ways this can be optimized : > * do vnodes in parallel, instead of going through the entire range in one > thread > * just iterate through sstables, not worrying about duplicates, and include > the timestamp of the original write in the MV mutation. since this doesn't > exclude duplicates it does increase the amount of work and could temporarily > surface ghost rows (yikes) but I guess that's why they call it eventual > consistency. doing it this way can avoid holding references to all tables on > disk, allows parallelization, and removes the need to check other sstables > for existing data. this is essentially the 'do a full repair' path -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-12245) initial view build can be parallel
[ https://issues.apache.org/jira/browse/CASSANDRA-12245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16271895#comment-16271895 ] Paulo Motta commented on CASSANDRA-12245: - bq. Good catch! Done here. I have added an overloaded version of ViewBuilderTask.stop to throw the CompactionInterruptedException only if the stop call comes from a different place than ViewBuilder. That is, the exception is not thrown in the case of a schema change (such as a drop), when the current build should be stopped without errors and maybe restarted. Good job! One thing I noticed is that even though the builder task and the view builder is aborted, the other tasks of the same builder keep running. At least until we have the ability to start and stop view builders, I think that stopping a subtask should also abort the other subtasks of the same view builder - since the view builder will not complete anyway. What do you think? I've done this [here|https://github.com/pauloricardomg/cassandra/commit/81853218eee702b778ba801426ba19d48336cf77] and the tests didn't need any change. I've also extended {{SplitterTest}} with a couple more test cases [here|https://github.com/pauloricardomg/cassandra/commit/428a990d6b3d79df9a4848d0f0f87502e72e470e]. bq. I have added a couple of dtests here. test_resume_stopped_build uses `nodetool stop VIEW_BUILD` to interrupt the running task of an ongoing view build and verifies that the unmarked build is resumed after restarting the nodes. test_drop_with_stopped_build verifies that a view with interrupted taks can still be dropped, which is something that has been problematic while writting the patch. The tests looks good, but sometimes they were failing on my machine because the view builder task finished on some nodes before they were stopped and also {{_wait_for_view_build_start}} did not guarantee the view builder started in all nodes before issuing {{nodetool stop VIEW_BUILD}}, so I fixed this [on this commit|https://github.com/pauloricardomg/cassandra-dtest/commit/fc62dc849d5a4d5e24d2bada6e6f8ce0f2d32b4d]. I also extended {{test_resume_stopped_build}} to verify that view was not built after abort ([here|https://github.com/pauloricardomg/cassandra-dtest/commit/6e38919d3c64a54688ae97bcf03611fff7d59dfe]). I've rebased and submitted a new CI run with the suggestions above [here|http://jenkins-cassandra.datastax.lan/view/Dev/view/adelapena/job/pauloricardomg-12245-trunk-dtest/] and [here|http://jenkins-cassandra.datastax.lan/view/Dev/view/adelapena/job/pauloricardomg-12245-trunk-testall/]. Besides these minor nits, I'm happy with the latest version of the patch and tests. If you agree with the suggestions above and CI looks good, feel free to incorporate them into your branches and commit. Excellent job and thanks for your patience! :) > initial view build can be parallel > -- > > Key: CASSANDRA-12245 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12245 > Project: Cassandra > Issue Type: Improvement > Components: Materialized Views >Reporter: Tom van der Woerdt >Assignee: Andrés de la Peña > Fix For: 4.x > > > On a node with lots of data (~3TB) building a materialized view takes several > weeks, which is not ideal. It's doing this in a single thread. > There are several potential ways this can be optimized : > * do vnodes in parallel, instead of going through the entire range in one > thread > * just iterate through sstables, not worrying about duplicates, and include > the timestamp of the original write in the MV mutation. since this doesn't > exclude duplicates it does increase the amount of work and could temporarily > surface ghost rows (yikes) but I guess that's why they call it eventual > consistency. doing it this way can avoid holding references to all tables on > disk, allows parallelization, and removes the need to check other sstables > for existing data. this is essentially the 'do a full repair' path -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-12245) initial view build can be parallel
[ https://issues.apache.org/jira/browse/CASSANDRA-12245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16264264#comment-16264264 ] Andrés de la Peña commented on CASSANDRA-12245: --- bq. I had a final look at the patch and tests, and it mostly looks good, but during the last review I found one edge case still not addressed: if the view build is stopped via {{nodetool stop VIEW_BUILD}}, the view build will be [considered successful|https://github.com/adelapena/cassandra/blob/e1ace2f47be71d48ab1987d0e2c7a07cc9486e97/src/java/org/apache/cassandra/db/view/ViewBuilder.java#L174], so we should probably throw a {{CompactionInterruptedException}} when the view build is stopped and not resubmit a new view build task in this case. Good catch! Done [here|https://github.com/adelapena/cassandra/commit/640d5b70232deab5b6b54b6158b28e86e53f7612]. I have added an overloaded version of [{{ViewBuilderTask.stop}}|https://github.com/adelapena/cassandra/blob/640d5b70232deab5b6b54b6158b28e86e53f7612/src/java/org/apache/cassandra/db/view/ViewBuilderTask.java#L204-L208] to throw the {{CompactionInterruptedException}} only if the stop call comes from a different place than {{ViewBuilder}}. That is, the exception is not thrown in the case of a schema change (such as a drop), when the current build should be stopped without errors and maybe restarted. I've also [extended {{CompactionInterruptedException}}|https://github.com/adelapena/cassandra/blob/640d5b70232deab5b6b54b6158b28e86e53f7612/src/java/org/apache/cassandra/db/view/ViewBuilderTask.java#L215-L249] with {{equals}} and {{hashCode}} methods that consider equals all the compaction interrupted exceptions produced by the same view. Otherwise, the multiple calls to [{{Futures#allAsList}}|https://github.com/adelapena/cassandra/blob/640d5b70232deab5b6b54b6158b28e86e53f7612/src/java/org/apache/cassandra/db/view/ViewBuilder.java#L175]'s [{{FutureCallback#onFailure}}|https://github.com/adelapena/cassandra/blob/640d5b70232deab5b6b54b6158b28e86e53f7612/src/java/org/apache/cassandra/db/view/ViewBuilder.java#L186-L199] with different exceptions make Guava's to log messages of the form {{SEVERE: You called allAsList, and more than one input failed, and the failures were different exceptions}}, which can be quite spamming. bq. Would you mind adding a dtest for this as well to ensure the view will not be marked as built after the view build is stopped? I have added a couple of dtests [here|https://github.com/adelapena/cassandra-dtest/commit/3af871a9daa8cc5aa9dc098f6126ca0a6fe4a015]. [{{test_resume_stopped_build}}|https://github.com/adelapena/cassandra-dtest/blob/3af871a9daa8cc5aa9dc098f6126ca0a6fe4a015/materialized_views_test.py#L1132-L1179] uses `nodetool stop VIEW_BUILD` to interrupt the running task of an ongoing view build and verifies that the unmarked build is resumed after restarting the nodes. [{{test_drop_with_stopped_build}}|https://github.com/adelapena/cassandra-dtest/blob/3af871a9daa8cc5aa9dc098f6126ca0a6fe4a015/materialized_views_test.py#L1074-L1130] verifies that a view with interrupted taks can still be dropped, which is something that has been problematic while writting the patch. CI for utests is clean, the updated dtests are still running: ||[utest|http://jenkins-cassandra.datastax.lan/view/Dev/view/adelapena/job/adelapena-12245-trunk-testall/]||[dtest|http://jenkins-cassandra.datastax.lan/view/Dev/view/adelapena/job/adelapena-12245-trunk-dtest/]|| > initial view build can be parallel > -- > > Key: CASSANDRA-12245 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12245 > Project: Cassandra > Issue Type: Improvement > Components: Materialized Views >Reporter: Tom van der Woerdt >Assignee: Andrés de la Peña > Fix For: 4.x > > > On a node with lots of data (~3TB) building a materialized view takes several > weeks, which is not ideal. It's doing this in a single thread. > There are several potential ways this can be optimized : > * do vnodes in parallel, instead of going through the entire range in one > thread > * just iterate through sstables, not worrying about duplicates, and include > the timestamp of the original write in the MV mutation. since this doesn't > exclude duplicates it does increase the amount of work and could temporarily > surface ghost rows (yikes) but I guess that's why they call it eventual > consistency. doing it this way can avoid holding references to all tables on > disk, allows parallelization, and removes the need to check other sstables > for existing data. this is essentially the 'do a full repair' path -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail:
[jira] [Commented] (CASSANDRA-12245) initial view build can be parallel
[ https://issues.apache.org/jira/browse/CASSANDRA-12245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16261518#comment-16261518 ] Paulo Motta commented on CASSANDRA-12245: - bq. I have moved the base table flush from ViewBuilderTask to ViewBuilder here, to do a single flush at the begining of the view build. The following writes will be writen to the MV through the regular path so it seems that they won't need any further flushes. I think that with this we don't need to check the table size and give a special treatment to small ones, what do you think? Much better this way, good job! bq. It seems that the configuration section of the doc is currently empty. I think that writting this section (structure, introduction, etc.) is probably out of the scope of this ticket and it might be done in a separate, dedicated ticket. Instead, I have updated the NEWS.txt file with more detailed info and I have added a note to the doc about CREATE MATERIALIZED VIEW statement. oh, my bad! I just noticed the configuration section is [dynamically generated|https://github.com/apache/cassandra/blob/8b3a60b9a7dbefeecc06bace617279612ec7092d/doc/Makefile#L20] from cassandra.yaml. In any case, the NEWS.txt entry and new MV doc looks great! bq. I have updated the dtest interrupt_build_process_test to make sure that the build is really interrupted also in 3.x through new byteman scripts. Without that, the build could finish before the cluster stop. Awesome! bq. Here is the new version of the patch, rebased and squashed. The udpated dtests can be found here. I had a final look at the patch and tests, and it mostly looks good, but during the last review I found one edge case still not addressed: if the view build is stopped via {{nodetool stop VIEW_BUILD}}, the view build will be [considered successful|https://github.com/adelapena/cassandra/blob/e1ace2f47be71d48ab1987d0e2c7a07cc9486e97/src/java/org/apache/cassandra/db/view/ViewBuilder.java#L174], so we should probably throw a {{CompactionInterruptedException}} when the view build is stopped and not resubmit a new view build task in this case. I think this last case is also present on 3.0 so it's not something introduced by this patch but it would be nice to address it here. Would you mind adding a dtest for this as well to ensure the view will not be marked as built after the view build is stopped? Sorry for delaying this further, but after this is addressed, this should be ready to go after a new clean CI round. > initial view build can be parallel > -- > > Key: CASSANDRA-12245 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12245 > Project: Cassandra > Issue Type: Improvement > Components: Materialized Views >Reporter: Tom van der Woerdt >Assignee: Andrés de la Peña > Fix For: 4.x > > > On a node with lots of data (~3TB) building a materialized view takes several > weeks, which is not ideal. It's doing this in a single thread. > There are several potential ways this can be optimized : > * do vnodes in parallel, instead of going through the entire range in one > thread > * just iterate through sstables, not worrying about duplicates, and include > the timestamp of the original write in the MV mutation. since this doesn't > exclude duplicates it does increase the amount of work and could temporarily > surface ghost rows (yikes) but I guess that's why they call it eventual > consistency. doing it this way can avoid holding references to all tables on > disk, allows parallelization, and removes the need to check other sstables > for existing data. this is essentially the 'do a full repair' path -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-12245) initial view build can be parallel
[ https://issues.apache.org/jira/browse/CASSANDRA-12245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16258054#comment-16258054 ] Andrés de la Peña commented on CASSANDRA-12245: --- Thanks for the comments, this is almost finished :) [Here|https://github.com/apache/cassandra/compare/trunk...adelapena:12245-trunk] is the new version of the patch, rebased and squashed. The udpated dtests can be found [here|https://github.com/apache/cassandra-dtest/compare/master...adelapena:12245]. bq. One minor thing is that we should probably only split the view build tasks at all if the base table is larger than a given size (let's say 500MB or so?), to avoid 4 * num_processor flushes for base tables with negligible size, WDYT? As discussed, I have moved the base table flush from {{ViewBuilderTask}} to {{ViewBuilder}} [here|https://github.com/adelapena/cassandra/commit/478ed88b490378caf4f8ddc82c8e3aa3f90e5264], to do a single flush at the begining of the view build. The following writes will be writen to the MV through the regular path so it seems that they won't need any further flushes. I think that with this we don't need to check the table size and give a special treatment to small ones, what do you think? bq. I noticed we don't stop in-progess view builds when a view is removed, would you mind adding that? Right, good catch. Done [here|https://github.com/adelapena/cassandra/commit/e1ace2f47be71d48ab1987d0e2c7a07cc9486e97]. I have also added [this dtest|https://github.com/adelapena/cassandra-dtest/blob/12245/materialized_views_test.py#L1025-L1067] to verify that the build is properly stopped. bq. ViewBuildExecutor is being constructed with minThreads=1 and maxPoolSize=concurrent_materialized_view_builders, but according to the {{DebuggableThreadPoolExecutor}}'s' [javadoc|https://github.com/apache/cassandra/blob/8b3a60b9a7dbefeecc06bace617279612ec7092d/src/java/org/apache/cassandra/concurrent/DebuggableThreadPoolExecutor.java#L33], this will actually make the executor with size 1 since maxPoolSize is not supported by {{DebuggableThreadPoolExecutor}} - and even if it were, new threads would only be created after the queue of the initial threads were full (which is quite unintuitive), but we actually want the pool to have concurrent_materialized_view_builders concurrent threads at most, so we should use the {{threadCount}} constructor instead - at some point we should actually remove the maximumPoolSize Done [here|https://github.com/adelapena/cassandra/commit/fc14b034bb5d36c23435f313541445dc5adb0078]. bq. I think we could take a {{buildAllViews}} parameter on reload, and set that to false during Keyspace initialization, since views will be build during daemon initialization and keyspace changes anyway, WDYT? Makes sense, done [here|https://github.com/adelapena/cassandra/commit/c4f19a5461434c0d5ca5e1301d92da26cca5083e]. bq. One last thing, can you please add the new yaml option {{concurrent_materialized_view_builders}} to the configuration section of the doc? It seems that [the configuration section|https://github.com/apache/cassandra/blob/trunk/doc/source/configuration/index.rst] of the doc is currently empty. I think that writting this section (structure, introduction, etc.) is probably out of the scope of this ticket and it might be done in a separate, dedicated ticket. Instead, I have [updated|https://github.com/adelapena/cassandra/commit/30293f852584189a5b46c2dce5ae4042ae62d3e4] the NEWS.txt file with more detailed info and I have added [a note|https://github.com/adelapena/cassandra/commit/82c446398d0b6b4b1b13b35b3502489fc71fe703] to the doc about {{CREATE MATERIALIZED VIEW}} statement. WDYT? I have updated the dtest {{interrupt_build_process_test}} to make sure that the build is really interrupted also in 3.x through [new byteman scripts|https://github.com/adelapena/cassandra-dtest/blob/f7aac39ee5d0c661b2f7f5b1db2a7347635f85c5/materialized_views_test.py#L962-L963]. Without that, the build could finish before the cluster stop. The CI results look good, at least for MVs: ||[utest|http://jenkins-cassandra.datastax.lan/view/Dev/view/adelapena/job/adelapena-12245-trunk-testall/]||[dtest|http://jenkins-cassandra.datastax.lan/view/Dev/view/adelapena/job/adelapena-12245-trunk-dtest/]|| > initial view build can be parallel > -- > > Key: CASSANDRA-12245 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12245 > Project: Cassandra > Issue Type: Improvement > Components: Materialized Views >Reporter: Tom van der Woerdt >Assignee: Andrés de la Peña > Fix For: 4.x > > > On a node with lots of data (~3TB) building a materialized view takes several > weeks, which is not ideal. It's doing this in a single thread. > There are several potential ways this can be
[jira] [Commented] (CASSANDRA-12245) initial view build can be parallel
[ https://issues.apache.org/jira/browse/CASSANDRA-12245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16245074#comment-16245074 ] Paulo Motta commented on CASSANDRA-12245: - Getting back to this after a while, sorry about the delay. We are really close now, I think we can wrap up after this last round of review. See follow-up below: bq. This makes a lot of sense. I'm worried about creating thousands of tasks for large datasets if the number of tasks is relative to the amount of data. Instead, I think we could fix the number of partitions to the higher reasonable number of parallel tasks, something like a multiple of the number of available processors. This would provide the desired immediate performance improvement if the user increases the number of concurrent view builders while keeping the number of tasks limited, independently of the amount of data. What do you think? Does it make any sense? Awesome! One minor thing is that we should probably only split the view build tasks at all if the base table is larger than a given size (let's say 500MB or so?), to avoid 4 * num_processor flushes for base tables with negligible size, WDYT? bq. One case that we hadn't considered is that if the token ranges change or are split in a different way when resuming a build then the progress would be lost, because ViewBuildTask won't found any entry for the new range at system.view_builds_in_progress. This would be specially true if we split the ranges by their data size. So, independently of how we finally split the ranges, I think it makes sense to load all the ranges with any progress from system.view_builds_in_progress at ViewBuilder before splitting the local ranges, create a task for those of them that are not already finish, and then split any remaining uncovered local range. It also has the advantage of skipping the creation of tasks for already completed ranges. What do you think? Good idea! This will prevent a node from computing new tasks for already built subranges if a new node joins the ring during view build. I will add a dtest to verify that scenario on CASSANDRA-13762. bq. I have also removed the method SystemKeyspace.beginViewBuild because I don't see any need of saving a range without progress. Indeed, if the view build is restarted it is probably better to don't restore the task without progress and let their tokens to be processed by the split logic. +1 bq. The first wait does the opposite to _wait_for_view, it waits until there is some progress. But we can use self._wait_for_view("ks", "t_by_v") here. Did you mean that? Yep. bq. Makes sense. The new test for ViewBuilderTask is here, and I've extended ViewTest.testViewBuilderResume to run with different number of concurrent view builders. Awesome, this looks much better now, thanks! I had a final skim through the patch and I think we covered everything and this looks much more robust now. Three minor things before we can close this: - I noticed we don't stop in-progess view builds when a view is removed, would you mind adding that? - ViewBuildExecutor is being constructed with minThreads=1 and maxPoolSize=concurrent_materialized_view_builders, but according to the {{DebuggableThreadPoolExecutor}}'s' [javadoc|https://github.com/apache/cassandra/blob/8b3a60b9a7dbefeecc06bace617279612ec7092d/src/java/org/apache/cassandra/concurrent/DebuggableThreadPoolExecutor.java#L33], this will actually make the executor with size 1 since maxPoolSize is not supported by {{DebuggableThreadPoolExecutor}} - and even if it were, new threads would only be created after the queue of the initial threads were full (which is quite unintuitive), but we actually want the pool to have concurrent_materialized_view_builders concurrent threads at most, so we should use the {{threadCount}} constructor instead - at some point we should actually remove the maximumPoolSize constructors from the TPE's since it can be quite confusing. - In the linked dtest branch results, {{base_replica_repair_with_contention_test}} is showing the following error: {noformat} ERROR [InternalResponseStage:1] 2017-10-11 11:03:16,005 CassandraDaemon.java:211 - Exception in thread Thread[InternalResponseStage:1,5,main] java.lang.RuntimeException: javax.management.InstanceAlreadyExistsException: org.apache.cassandra.db:type=Tables,keyspace=ks,table=t at org.apache.cassandra.db.ColumnFamilyStore.(ColumnFamilyStore.java:484) ~[main/:na] at org.apache.cassandra.db.ColumnFamilyStore.createColumnFamilyStore(ColumnFamilyStore.java:639) ~[main/:na] at org.apache.cassandra.db.ColumnFamilyStore.createColumnFamilyStore(ColumnFamilyStore.java:613) ~[main/:na] at org.apache.cassandra.db.ColumnFamilyStore.createColumnFamilyStore(ColumnFamilyStore.java:604) ~[main/:na] at org.apache.cassandra.db.Keyspace.initCf(Keyspace.java:420) ~[main/:na]
[jira] [Commented] (CASSANDRA-12245) initial view build can be parallel
[ https://issues.apache.org/jira/browse/CASSANDRA-12245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16200126#comment-16200126 ] Andrés de la Peña commented on CASSANDRA-12245: --- Sorry for the delay, and thank for the thorough review. I've updated [the patch|https://github.com/apache/cassandra/compare/trunk...adelapena:12245-trunk] addressing the comments. bq. It seems like the way the number of tokens in a range was computed by abs(range.right - range.left) may not work correctly for some wrap-around cases, as shown by [this test case|https://github.com/pauloricardomg/cassandra/blob/2760bbbc25a2ad9a9bbf9d29a0dc19e1e3bfb237/test/unit/org/apache/cassandra/dht/SplitterTest.java#L184]. Even though this shouldn't break when local ranges are used , I fixed it on [this commit|https://github.com/pauloricardomg/cassandra/commit/2760bbbc25a2ad9a9bbf9d29a0dc19e1e3bfb237] to make sure split works correctly for wrap-around ranges. Can you confirm this is correct? I think this is right, merged. bq. Other than that, it seems like you added unit tests only for Murmur3Partitioner, would you mind extending testSplit() to RandomPartitioner? Done [here|https://github.com/apache/cassandra/commit/6f92541abaa61ed789f50f67c8800af3f97162a5]. bq. I think having a dedicated executor will ensure view building doesn't compete with compactions for the compaction executor, good job! One problem I see though is that if the user is finding its view building slow it will try to increase the number of concurrent view builders via nodetool, but it will have no effect since the range was split in the previously number of concurrent view builders. Given this will be a pretty common scenario for large datasets, how about splitting the range in multiple smaller tasks, so that if the user increases {{concurrent_view_builders}} the other tasks immediately start executing? bq. We could use a simple approach of splitting the local range in let's say 1000 hard-coded parts, or be smarter and make each split have ~100MB or so. In this way we can keep {{concurrent_materialized_view_builders=1}} by default, and users with large base tables are able to increase it and see immediate effect via nodetool. WDYT? This makes a lot of sense. I'm worried about creating thousands of tasks for large datasets if the number of tasks is relative to the amount of data. Instead, I think we could fix the number of partitions to the higher reasonable number of parallel tasks, something like [a multiple of the number of available processors|https://github.com/adelapena/cassandra/blob/e460b3b76935cad60a9dc1e00c8d3af8bfa9584a/src/java/org/apache/cassandra/db/view/ViewBuilder.java#L59]. This would provide the desired immediate performance improvement if the user increases the number of concurrent view builders while keeping the number of tasks limited, independently of the amount of data. What do you think? Does it make any sense? bq. Great, looks much cleaner indeed! One minor thing is that if there's a failure after some {{ViewBuildTasks}} were completed, it will resume that subtask from its last token while it already finished. Could we maybe set the last_token = end_token when the task is finished to flag it was already finished and avoid resuming the task when that is the case? Done [here|https://github.com/adelapena/cassandra/blob/d505d8014524b72422ba1eee036494bda39f53f0/src/java/org/apache/cassandra/db/view/ViewBuilderTask.java#L166-L168]. One case that we hadn't considered is that if the token ranges change or are split in a different way when resuming a build then the progress would be lost, because {{ViewBuildTask}} won't found any entry for the new range at {{system.view_builds_in_progress}}. This would be specially true if we split the ranges by their data size. So, independently of how we finally split the ranges, I think it makes sense to load all the ranges with any progress from {{system.view_builds_in_progress}} at {{ViewBuilder}} before splitting the local ranges, create a task for those of them that are not already finish, and then split any remaining uncovered local range. It also has the advantage of skipping the creation of tasks for already completed ranges. What do you think? I have also removed the method {{SystemKeyspace.beginViewBuild}} because I don't see any need of saving a range without progress. Indeed, if the view build is restarted it is probably better to don't restore the task without progress and let their tokens to be processed by the split logic. bq. The dtest looks mostly good, except for the following nits: bq. * {{concurrent_materialized_view_builders=1}} when the nodes are restarted. Can you set the configuration value during cluster setup phase (instead of setting via nodetool) to make sure the restarted view builds will be parallel? Sure, done
[jira] [Commented] (CASSANDRA-12245) initial view build can be parallel
[ https://issues.apache.org/jira/browse/CASSANDRA-12245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16171536#comment-16171536 ] Paulo Motta commented on CASSANDRA-12245: - Finally getting to this after a while, sorry for the delay. Thanks for the update! Had another look at the patch and this is looking much better, see some follow-up comments below: bq. I have moved the methods to split the ranges to the Splitter, reusing its valueForToken method. Tests here. Awesome, looks much better now! It seems like the way the number of tokens in a range was computed by {{abs(range.right - range.left)}} may not work correctly for some wrap-around cases, as shown by [this test case|https://github.com/pauloricardomg/cassandra/blob/2760bbbc25a2ad9a9bbf9d29a0dc19e1e3bfb237/test/unit/org/apache/cassandra/dht/SplitterTest.java#L184]. Even though this shouldn't break when local ranges are used , I fixed it on [this commit|https://github.com/pauloricardomg/cassandra/commit/2760bbbc25a2ad9a9bbf9d29a0dc19e1e3bfb237] to make sure split works correctly for wrap-around ranges. Can you confirm this is correct? Other than that, it seems like you added unit tests only for {{Murmur3Partitioner}}, would you mind extending {{testSplit()}} to {{RandomPartitioner}}? bq. Agree. I have added a new dedicated executor in the CompactionManager, similar to the executors used for validation and cache cleanup. The concurrency of this executor is determined by the new config property concurrent_materialized_view_builders, which defaults to a perhaps too much conservative value of 1. This property can be modified through both JMX and the new setconcurrentviewbuilders and getconcurrentviewbuilders nodetool commands. These commands are tested here. I think having a dedicated executor will ensure view building doesn't compete with compactions for the compaction executor, good job! One problem I see though is that if the user is finding its view building slow it will try to increase the number of concurrent view builders via nodetool, but it will have no effect since the range was split in the previously number of concurrent view builders. Given this will be a pretty common scenario for large datasets, how about splitting the range in multiple smaller tasks, so that if the user increases {{concurrent_view_builders}} the other tasks immediately start executing? We could use a simple approach of splitting the local range in let's say 1000 hard-coded parts, or be smarter and make each split have ~100MB or so. In this way we can keep {{concurrent_materialized_view_builders=1}} by default, and users with large base tables are able to increase it and see immediate effect via nodetool. WDYT? bq. I would prefer to do this in another ticket. Agreed. bq. I have moved the marking of system tables (and the retries in case of failure) from the ViewBuilderTask to the ViewBuilder, using a callback to do the marking. I think the code is clearer this way. Great, looks much cleaner indeed! One minor thing is that if there's a failure after some {{ViewBuildTasks}} were completed, it will resume that subtask from its last token while it already finished. Could we maybe set the last_token = end_token when the task is finished to flag it was already finished and avoid resuming the task when that is the case? bq. Updated here. It also uses a byteman script to make sure that the MV build isn't finished before stopping the cluster, which is more likely to happen. The dtest looks mostly good, except for the following nits: * {{concurrent_materialized_view_builders=1}} when the nodes are restarted. Can you set the configuration value during cluster setup phase (instead of setting via nodetool) to make sure the restarted view builds will be parallel? * can probably use {{self._wait_for_view("ks", "t_by_v")}} [here|https://github.com/adelapena/cassandra-dtest/blob/cf893982c361fd7b6018b2570d3a5a33badd5424/materialized_views_test.py#L982] * We cannot ensure key 1 was not built [here|https://github.com/adelapena/cassandra-dtest/blob/cf893982c361fd7b6018b2570d3a5a33badd5424/materialized_views_test.py#L980] which may cause flakiness, so it's probably better to check for {{self.assertNotEqual(len(list(session.execute("SELECT count\(*\) FROM t_by_v;"))), 1)}} or something like that. * It would be nice to check that the view build was actually removed on restart, by checking for the log entry {{Resuming view build for range}} bq. I'm not sure about if it still makes sense for the builder task to extend CompactionInfo.Holder. If so, I'm neither sure about how to use prevToken.size(range.right) (that returns a double) to create CompationInfo objects. WDYT? I think it's still useful to show view build progress via {{nodetool compactionstats}}, so I created a new method {{Splitter.positionInRange(Token token, Range range)}} which gives the
[jira] [Commented] (CASSANDRA-12245) initial view build can be parallel
[ https://issues.apache.org/jira/browse/CASSANDRA-12245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16138233#comment-16138233 ] Andrés de la Peña commented on CASSANDRA-12245: --- [Here|https://github.com/apache/cassandra/compare/trunk...adelapena:12245-trunk] is a new version of the patch addressing the review comments. The updated dtests are [here|https://github.com/apache/cassandra-dtest/compare/master...adelapena:12245]. bq. It would be nice to maybe try to reuse the {{Splitter}} methods if possible, so we can reuse tests, or if that's not straightforward maybe put the methods on splitter and add some tests to make sure it's working correctly. I have moved the [methods to split the ranges|https://github.com/adelapena/cassandra/blob/af6b2547b58b8215c65be7546173f3c48b398072/src/java/org/apache/cassandra/dht/Splitter.java#L137-L187] to the {{Splitter}}, reusing its [{{valueForToken}}|https://github.com/adelapena/cassandra/blob/af6b2547b58b8215c65be7546173f3c48b398072/src/java/org/apache/cassandra/dht/Splitter.java#L47] method. Tests [here|https://github.com/adelapena/cassandra/blob/af6b2547b58b8215c65be7546173f3c48b398072/test/unit/org/apache/cassandra/dht/SplitterTest.java#L162-L224]. bq. Can probably remove the generation field from the builds in progress table and remove this comment Removed. bq. {{views_builds_in_progress_v2}} sounds a bit hacky, so perhaps we should call it {{system.view_builds_in_progress}} (remove the s) and also add a NOTICE entry informing the previous table was replaced and data files can be removed. Renamed to {{system.view_builds_in_progress}}. Added an [NEWS.txt entry|https://github.com/adelapena/cassandra/blob/af6b2547b58b8215c65be7546173f3c48b398072/NEWS.txt#L183-L185] informing about the replacement. bq. I'm a bit concerned about starving the compaction executor for a long period during view build of large base tables, so we should probably have another option like {{concurret_view_builders}} with a conservative default and perhaps control the concurrency at the {{ViewBuilderController}}. WDYT? Agree. I have added a new dedicated [executor|https://github.com/adelapena/cassandra/blob/af6b2547b58b8215c65be7546173f3c48b398072/src/java/org/apache/cassandra/db/compaction/CompactionManager.java#L126] in the {{CompactionManager}}, similar to the executors used for validation and cache cleanup. The concurrency of this executor is determined by the new config property [{{concurrent_materialized_view_builders}}|https://github.com/adelapena/cassandra/blob/af6b2547b58b8215c65be7546173f3c48b398072/conf/cassandra.yaml#L756], which defaults to a perhaps too much conservative value of {{1}}. This property can be modified through both JMX and the new [{{setconcurrentviewbuilders}}|https://github.com/adelapena/cassandra/blob/af6b2547b58b8215c65be7546173f3c48b398072/src/java/org/apache/cassandra/tools/nodetool/SetConcurrentViewBuilders.java] and [{{getconcurrentviewbuilders}}|https://github.com/adelapena/cassandra/blob/af6b2547b58b8215c65be7546173f3c48b398072/src/java/org/apache/cassandra/tools/nodetool/GetConcurrentViewBuilders.java] nodetool commands. These commands are tested [here|https://github.com/adelapena/cassandra-dtest/blob/cf893982c361fd7b6018b2570d3a5a33badd5424/nodetool_test.py#L162-L190]. I'm not sure about if it still makes sense for the builder task to extend {{CompactionInfo.Holder}}. If so, I'm neither sure about how to use {{prevToken.size(range.right)}} (that returns a {{double}}) to create {{CompationInfo}} objects. WDYT? bq. ViewBuilder seems to be reimplementing some of the logic of PartitionRangeReadCommand, so I wonder if we shoud take this chance to simplify and use that instead of manually constructing the commands via ReducingKeyIterator and multiple SinglePartitionReadCommands? We can totally do this in other ticket if you prefer. I would prefer to do this in another ticket. bq. Perform view marking on ViewBuilderController instead of ViewBuilder I have moved the marking of system tables (and the retries in case of failure) from the {{ViewBuilderTask}} to the {{ViewBuilder}}, using a callback to do the marking. I think the code is clearer this way. bq. Updating the view built status at every key is perhaps a bit inefficient and unnecessary, so perhaps we should update it every 1000 keys or so. Done, it is updated [every 1000 keys|https://github.com/adelapena/cassandra/blob/af6b2547b58b8215c65be7546173f3c48b398072/src/java/org/apache/cassandra/db/view/ViewBuilderTask.java#L62]. It doesn't seem to make a great difference in some small benchmarks that I have run. bq. Would be nice to update the {{interrupt_build_process_test}} to stop halfway through the build (instead of the start of the build) and verify it it's being resumed correctly with the new changes. Updated
[jira] [Commented] (CASSANDRA-12245) initial view build can be parallel
[ https://issues.apache.org/jira/browse/CASSANDRA-12245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16126791#comment-16126791 ] Paulo Motta commented on CASSANDRA-12245: - Thanks for the patch and sorry for the delay! Had an initial look at the patch, overall looks good, I have the following comments/questions/remarks: bq. The newly created ViewBuilderController reads the local token ranges, splits them to satisfy a concurrency factor, and runs a ViewBuilder for each of them. It would be nice to maybe try to reuse the {{Splitter}} methods if possible, so we can reuse tests, or if that's not straightforward maybe put the methods on splitter and add some tests to make sure it's working correctly. bq. When ViewBuilderController receives the finalization signal of the last ViewBuilder, it double-checks if there are new local ranges that weren't considered at the beginning of the build. If there are new ranges, new {{ViewBuilder}}s are created for them. This will not work if the range movement which created the new local range finishes after the view has finished building. This problem exists currently and is unrelated to the view build process itself, but more related to the range movement completion which should ensure the views are properly built before the operation finishes, so I created CASSANDRA-13762 to handle this properly. bq. Given that we have a ViewBuilder per local range, the key of the table system.views_builds_in_progress is modified to include the bounds of the token range. So, we will have an entry in the table per each ViewBuilder. The number of covered keys per range is also recorded in the table. Can probably remove the generation field from the builds in progress table and [remove this comment|https://github.com/adelapena/cassandra/blob/94f3d0d02bb5f849e4d54857d0b33531a5650643/src/java/org/apache/cassandra/db/view/ViewBuilder.java#L124] bq. I have updated the patch to use a new separate table, system.views_builds_in_progress_v2 {{views_builds_in_progress_v2}} sounds a bit hacky, so perhaps we should call it {{system.view_builds_in_progress}} (remove the s) and also add a NOTICE entry informing the previous table was replaced and data files can be removed. bq. The downside is that pending view builds will be restarted during an upgrade to 4.x, which seems reasonable to me. Sounds reasonable to me too. bq. ViewBuilder and ViewBuilderController are probably not the best names. Maybe we could rename ViewBuilder to something like ViewBuilderTask or ViewBuilderForRange, and rename ViewBuilderController to ViewBuilder. {{ViewBuilder}} and {{ViewBuilderTask}} LGTM bq. The concurrency factor is based on conf.concurrent_compactors because the views are built on the CompactionManager, but we may be interested in a different value. I'm a bit concerned about starving the compaction executor for a long period during view build of large base tables, so we should probably have another option like {{concurret_view_builders}} with a conservative default and perhaps control the concurrency at the {{ViewBuilderController}}. WDYT? bq. The patch tries to evenly split the token ranges in the minimum number of parts to satisfy the concurrency factor, and it never merges ranges. So, with the default 256 virtual nodes (and a lesser concurrency factor) we create 256 build tasks. We might be interested in a different planning. If we want the number of tasks to be lesser than the number of local ranges we should modify the ViewBuilder task to be responsible for several ranges, although it will complicate the status tracking. I think this is good to start with, we can improve the planning later if necessary. I don't think there is much gain from merging ranges to have smaller tasks. bq. Probably there is a better way of implementing ViewBuilder.getCompactionInfo. The patch uses keysBuilt/ColumnFamilyStore.estimatedKeysForRange to estimate the completion, which could deal to have task completion status over 100%, depending on the estimation. How about using {{prevToken.size(range.right)}} (introduced by CASSANDRA-7032)? Even though this will not be available for BytesToken (used by ByteOrderedPartitioner, which is rarely used, so could maybe fallback to the current imprecise calculation in that case). Other comments: * Avoid submitting view builder when view is already built instead of checking on the ViewBuilder ([here|https://github.com/adelapena/cassandra/blob/94f3d0d02bb5f849e4d54857d0b33531a5650643/src/java/org/apache/cassandra/db/view/ViewBuilder.java#L109]) * ViewBuilder seems to be reimplementing some of the logic of {{PartitionRangeReadCommand}}, so I wonder if we shoud take this chance to simplify and use that instead of manually constructing the commands via ReducingKeyIterator and multiple {{SinglePartitionReadCommands}}? We can totally do this in other ticket if
[jira] [Commented] (CASSANDRA-12245) initial view build can be parallel
[ https://issues.apache.org/jira/browse/CASSANDRA-12245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16107194#comment-16107194 ] Andrés de la Peña commented on CASSANDRA-12245: --- Good point, [~doanduyhai]. I have [updated|https://github.com/adelapena/cassandra/commit/35588ca00337465fedfb42e6cc001773bb739d2f] the patch to use a new separate table, {{system.views_builds_in_progress_v2}}. The downside is that pending view builds will be restarted during an upgrade to 4.x, which seems reasonable to me. I have also [updated the dtests|https://github.com/adelapena/cassandra-dtest/commit/91f6d3d88de97ec7e18b5243f60c359022bf41c3] to choose the proper system table depending on the version. > initial view build can be parallel > -- > > Key: CASSANDRA-12245 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12245 > Project: Cassandra > Issue Type: Improvement > Components: Materialized Views >Reporter: Tom van der Woerdt >Assignee: Andrés de la Peña > Fix For: 4.x > > > On a node with lots of data (~3TB) building a materialized view takes several > weeks, which is not ideal. It's doing this in a single thread. > There are several potential ways this can be optimized : > * do vnodes in parallel, instead of going through the entire range in one > thread > * just iterate through sstables, not worrying about duplicates, and include > the timestamp of the original write in the MV mutation. since this doesn't > exclude duplicates it does increase the amount of work and could temporarily > surface ghost rows (yikes) but I guess that's why they call it eventual > consistency. doing it this way can avoid holding references to all tables on > disk, allows parallelization, and removes the need to check other sstables > for existing data. this is essentially the 'do a full repair' path -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-12245) initial view build can be parallel
[ https://issues.apache.org/jira/browse/CASSANDRA-12245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16080299#comment-16080299 ] DOAN DuyHai commented on CASSANDRA-12245: - [~adelapena] since you change the key of the table {{system.views_builds_in_progress}}, how would we handle migration from previous version of the table to the new version ? ALTER TABLE does not work on primary key columns. > initial view build can be parallel > -- > > Key: CASSANDRA-12245 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12245 > Project: Cassandra > Issue Type: Improvement > Components: Materialized Views >Reporter: Tom van der Woerdt >Assignee: Andrés de la Peña > Fix For: 4.x > > > On a node with lots of data (~3TB) building a materialized view takes several > weeks, which is not ideal. It's doing this in a single thread. > There are several potential ways this can be optimized : > * do vnodes in parallel, instead of going through the entire range in one > thread > * just iterate through sstables, not worrying about duplicates, and include > the timestamp of the original write in the MV mutation. since this doesn't > exclude duplicates it does increase the amount of work and could temporarily > surface ghost rows (yikes) but I guess that's why they call it eventual > consistency. doing it this way can avoid holding references to all tables on > disk, allows parallelization, and removes the need to check other sstables > for existing data. this is essentially the 'do a full repair' path -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-12245) initial view build can be parallel
[ https://issues.apache.org/jira/browse/CASSANDRA-12245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16080228#comment-16080228 ] Andrés de la Peña commented on CASSANDRA-12245: --- [Here|https://github.com/apache/cassandra/compare/trunk...adelapena:12245-trunk] is a first draft of the patch. At first, it seems that the current implementation is not properly retrieving the view build status because of a mistake at [{{SystemKeyspace.getViewBuildStatus}}|https://github.com/apache/cassandra/blob/9c6f87c35f364ec6a88775cb3d0cf143e36635e7/src/java/org/apache/cassandra/db/SystemKeyspace.java#L501-L521]. This makes the view build to always start from the beginning. The provided patch does the following: * {{ViewBuilder}} changes to be responsible for the building of a view for a single given token range. * The newly created {{ViewBuilderController}} reads the local token ranges, splits them to satisfy a concurrency factor, and runs a {{ViewBuilder}} for each of them. * Each {{ViewBuilder}} task calls to {{ViewBuilderController}} at termination to know if it is the last pending build task. If it is, it marks the view as built. * When {{ViewBuilderController}} receives the finalization signal of the last {{ViewBuilder}}, it double-checks if there are new local ranges that weren't considered at the beginning of the build. If there are new ranges, new {{ViewBuilder}}s are created for them. * Given that we have a {{ViewBuilder}} per local range, the key of the table {{system.views_builds_in_progress}} is modified to include the bounds of the token range. So, we will have an entry in the table per each {{ViewBuilder}}. The number of covered keys per range is also recorded in the table. Some considerations about the implementation: * {{ViewBuilder}} and {{ViewBuilderController}} are probably not the best names. Maybe we could rename {{ViewBuilder}} to something like {{ViewBuilderTask}} or {{ViewBuilderForRange}}, and rename {{ViewBuilderController}} to {{ViewBuilder}}. * The concurrency factor is based on {{conf.concurrent_compactors}} because the views are built on the {{CompactionManager}}, but we may be interested in a different value. * The patch tries to evenly split the token ranges in the minimum number of parts to satisfy the concurrency factor, and it never merges ranges. So, with the default 256 virtual nodes (and a lesser concurrency factor) we create 256 build tasks. We might be interested in a different planning. If we want the number of tasks to be lesser than the number of local ranges we should modify the {{ViewBuilder}} task to be responsible for several ranges, although it will complicate the status tracking. * Probably there is a better way of implementing {{ViewBuilder.getCompactionInfo}}. The patch uses {{keysBuilt}}/{{ColumnFamilyStore.estimatedKeysForRange}} to estimate the completion, which could deal to have task completion status over 100%, depending on the estimation. [~slebresne], [~tjake], what do you think? > initial view build can be parallel > -- > > Key: CASSANDRA-12245 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12245 > Project: Cassandra > Issue Type: Improvement >Reporter: Tom van der Woerdt >Assignee: Andrés de la Peña > > On a node with lots of data (~3TB) building a materialized view takes several > weeks, which is not ideal. It's doing this in a single thread. > There are several potential ways this can be optimized : > * do vnodes in parallel, instead of going through the entire range in one > thread > * just iterate through sstables, not worrying about duplicates, and include > the timestamp of the original write in the MV mutation. since this doesn't > exclude duplicates it does increase the amount of work and could temporarily > surface ghost rows (yikes) but I guess that's why they call it eventual > consistency. doing it this way can avoid holding references to all tables on > disk, allows parallelization, and removes the need to check other sstables > for existing data. this is essentially the 'do a full repair' path -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-12245) initial view build can be parallel
[ https://issues.apache.org/jira/browse/CASSANDRA-12245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15385924#comment-15385924 ] Tom van der Woerdt commented on CASSANDRA-12245: Good point about the partial row updates, hadn't thought about that case. Could probably make it do the row retrieval conditionally based on whether all fields are in the sstable or not (most rows will be, thanks to compaction), but then you probably run into a lot of corner cases. > initial view build can be parallel > -- > > Key: CASSANDRA-12245 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12245 > Project: Cassandra > Issue Type: Improvement >Reporter: Tom van der Woerdt > > On a node with lots of data (~3TB) building a materialized view takes several > weeks, which is not ideal. It's doing this in a single thread. > There are several potential ways this can be optimized : > * do vnodes in parallel, instead of going through the entire range in one > thread > * just iterate through sstables, not worrying about duplicates, and include > the timestamp of the original write in the MV mutation. since this doesn't > exclude duplicates it does increase the amount of work and could temporarily > surface ghost rows (yikes) but I guess that's why they call it eventual > consistency. doing it this way can avoid holding references to all tables on > disk, allows parallelization, and removes the need to check other sstables > for existing data. this is essentially the 'do a full repair' path -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-12245) initial view build can be parallel
[ https://issues.apache.org/jira/browse/CASSANDRA-12245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15385657#comment-15385657 ] Sylvain Lebresne commented on CASSANDRA-12245: -- bq. just iterate through sstables, not worrying about duplicates, and include the timestamp of the original write in the MV mutation. I could misunderstand this but if you talking about dealing with each sstable completely individually, I don't think we can do that. It's not that we have to merge sstables to avoid dealing with duplicates, it's we need to current full version of a row to create the proper MV entry. For instance, say you have a first insert to a row in one sstable that contains the following columns/values {{c1=2, c2=3, c3='foo'}}, and a later update to that row in another sstable containing {{c1=4}}. And further say that {{c1}} is part of your view PK. If we deal with both sstable individually, in the first case we'll insert an entry for {{c1=2}}, with the other value, while for the 2nd sstable we'll create a *different* entry for {{c1=4}} with no other values. Which is doubly wrong as the {{c1=2}} entry shouldn't exist, and the {{c1=4}} entry should have the other values for {{c2}} and {{c3}}. With that said, several weeks to build a view over 3TB is indeed not ideal, and we can indeed at least do vnodes in parallel. > initial view build can be parallel > -- > > Key: CASSANDRA-12245 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12245 > Project: Cassandra > Issue Type: Improvement >Reporter: Tom van der Woerdt > > On a node with lots of data (~3TB) building a materialized view takes several > weeks, which is not ideal. It's doing this in a single thread. > There are several potential ways this can be optimized : > * do vnodes in parallel, instead of going through the entire range in one > thread > * just iterate through sstables, not worrying about duplicates, and include > the timestamp of the original write in the MV mutation. since this doesn't > exclude duplicates it does increase the amount of work and could temporarily > surface ghost rows (yikes) but I guess that's why they call it eventual > consistency. doing it this way can avoid holding references to all tables on > disk, allows parallelization, and removes the need to check other sstables > for existing data. this is essentially the 'do a full repair' path -- This message was sent by Atlassian JIRA (v6.3.4#6332)