[jira] [Commented] (SOLR-13013) Change export to extract DocValues in docID order
[ https://issues.apache.org/jira/browse/SOLR-13013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16702126#comment-16702126 ] Toke Eskildsen commented on SOLR-13013: --- Ah! I understand now, thanks. Guess I got a bit too focused on index data IO. > Change export to extract DocValues in docID order > - > > Key: SOLR-13013 > URL: https://issues.apache.org/jira/browse/SOLR-13013 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: Export Writer >Affects Versions: 7.5, master (8.0) >Reporter: Toke Eskildsen >Priority: Major > Fix For: master (8.0) > > Attachments: SOLR-13013_proof_of_concept.patch, > SOLR-13013_proof_of_concept.patch > > > The streaming export writer uses a sliding window of 30,000 documents for > paging through the result set in a given sort order. Each time a window has > been calculated, the values for the export fields are retrieved from the > underlying DocValues structures in document sort order and delivered. > The iterative DocValues API introduced in Lucene/Solr 7 does not support > random access. The current export implementation bypasses this by creating a > new DocValues-iterator for each individual value to retrieve. This slows down > export as the iterator has to seek to the given docID from start for each > value. The slowdown scales with shard size (see LUCENE-8374 for details). An > alternative is to extract the DocValues in docID-order, with re-use of > DocValues-iterators. The idea is as follows: > # Change the FieldWriters for export to re-use the DocValues-iterators if > subsequent requests are for docIDs higher than the previous ones > # Calculate the sliding window of SortDocs as usual > # Take a note of the order of the SortDocs in the sliding window > # Re-sort the SortDocs in docID-order > # Extract the DocValues to a temporary on-heap structure > # Re-sort the extracted values to the original sliding window order > Deliver the values > One big difference from the current export code is of course the need to hold > the whole sliding window scaled result set in memory. This might well be a > showstopper as there is no real limit to how large this partial result set > can be. Maybe such an optimization could be requested explicitly if the user > knows that there is enough memory? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-13013) Change export to extract DocValues in docID order
[ https://issues.apache.org/jira/browse/SOLR-13013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16702034#comment-16702034 ] Joel Bernstein commented on SOLR-13013: --- You're exactly right that improving performance of export helps the MapReduce use cases as well. It's just that in a sharded, replicated environment with a tier of worker nodes performing a reduce operation, you can get massive throughput already just because you can have dozens of servers pushing out an export and reducing in parallel. But you could easily argue that your usecase is the more common use case and we should really try to make it as fast as possible. I wouldn't worry too much about testing this is in sharded scenarios. We can extrapolate the single shard findings to multiple shards, realizing that the aggregator node will quickly become the bottleneck and the /export will spend much of it's time blocked while writing data. Having a tier of worker nodes unlocks this bottleneck in the case where worker nodes are performing some form of reduce operation. > Change export to extract DocValues in docID order > - > > Key: SOLR-13013 > URL: https://issues.apache.org/jira/browse/SOLR-13013 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: Export Writer >Affects Versions: 7.5, master (8.0) >Reporter: Toke Eskildsen >Priority: Major > Fix For: master (8.0) > > Attachments: SOLR-13013_proof_of_concept.patch, > SOLR-13013_proof_of_concept.patch > > > The streaming export writer uses a sliding window of 30,000 documents for > paging through the result set in a given sort order. Each time a window has > been calculated, the values for the export fields are retrieved from the > underlying DocValues structures in document sort order and delivered. > The iterative DocValues API introduced in Lucene/Solr 7 does not support > random access. The current export implementation bypasses this by creating a > new DocValues-iterator for each individual value to retrieve. This slows down > export as the iterator has to seek to the given docID from start for each > value. The slowdown scales with shard size (see LUCENE-8374 for details). An > alternative is to extract the DocValues in docID-order, with re-use of > DocValues-iterators. The idea is as follows: > # Change the FieldWriters for export to re-use the DocValues-iterators if > subsequent requests are for docIDs higher than the previous ones > # Calculate the sliding window of SortDocs as usual > # Take a note of the order of the SortDocs in the sliding window > # Re-sort the SortDocs in docID-order > # Extract the DocValues to a temporary on-heap structure > # Re-sort the extracted values to the original sliding window order > Deliver the values > One big difference from the current export code is of course the need to hold > the whole sliding window scaled result set in memory. This might well be a > showstopper as there is no real limit to how large this partial result set > can be. Maybe such an optimization could be requested explicitly if the user > knows that there is enough memory? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-13013) Change export to extract DocValues in docID order
[ https://issues.apache.org/jira/browse/SOLR-13013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16701946#comment-16701946 ] Toke Eskildsen commented on SOLR-13013: --- [~joel.bernstein] Unfortunately I don't have proper hardware at hand to test with our large shards in a multi-shard setup. I _could_ put them on a spinning drive, now that I think about it, but I am also afraid that my test-box does not have adequate memory to fully cache the DocValues structures when using multiple shards, so that would complicate testing somewhat. I'll see what else we have lying around and if nothing else, I could just delete 3/4th of the data in 4 of the shards and run with those instead (takes some days to do though). Up until now we have used export exclusively to do simple query-bases data-dumps, so that was my go-to case. It is probably due to my limited understanding of Streaming Expressions that I do not understand the methodological problem in my test: I get that multi-sharding, replicas and hashing (bit unsure about the hashing part) can distribute and parallelize the load to make processing faster, but only the "more and smaller shards" of those 3 would reduce the total amount of work, as I understand it? So with regard to that, any optimization to the export should work equally well for a single-shard simple export and a more complex distributed setup, measured as total work to be done? I am on (even) more shaky grounds with the local reduce operation. Isn't that step after the export part and therefore extremely dependent on raw export speed? Or is there some shortcut mechanism I haven't understood? > Change export to extract DocValues in docID order > - > > Key: SOLR-13013 > URL: https://issues.apache.org/jira/browse/SOLR-13013 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: Export Writer >Affects Versions: 7.5, master (8.0) >Reporter: Toke Eskildsen >Priority: Major > Fix For: master (8.0) > > Attachments: SOLR-13013_proof_of_concept.patch, > SOLR-13013_proof_of_concept.patch > > > The streaming export writer uses a sliding window of 30,000 documents for > paging through the result set in a given sort order. Each time a window has > been calculated, the values for the export fields are retrieved from the > underlying DocValues structures in document sort order and delivered. > The iterative DocValues API introduced in Lucene/Solr 7 does not support > random access. The current export implementation bypasses this by creating a > new DocValues-iterator for each individual value to retrieve. This slows down > export as the iterator has to seek to the given docID from start for each > value. The slowdown scales with shard size (see LUCENE-8374 for details). An > alternative is to extract the DocValues in docID-order, with re-use of > DocValues-iterators. The idea is as follows: > # Change the FieldWriters for export to re-use the DocValues-iterators if > subsequent requests are for docIDs higher than the previous ones > # Calculate the sliding window of SortDocs as usual > # Take a note of the order of the SortDocs in the sliding window > # Re-sort the SortDocs in docID-order > # Extract the DocValues to a temporary on-heap structure > # Re-sort the extracted values to the original sliding window order > Deliver the values > One big difference from the current export code is of course the need to hold > the whole sliding window scaled result set in memory. This might well be a > showstopper as there is no real limit to how large this partial result set > can be. Maybe such an optimization could be requested explicitly if the user > knows that there is enough memory? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-13013) Change export to extract DocValues in docID order
[ https://issues.apache.org/jira/browse/SOLR-13013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16701898#comment-16701898 ] Joel Bernstein commented on SOLR-13013: --- Interesting findings. I can work on getting this patch committed, possibly for the 8.0 release. A couple of thoughts about the design of the /export handler. The /export handler was very much designed to support MapReduce operations (distributed grouping, rollups, relational algebra) in Streaming Expressions. Scaling these MapReduce operations took the following path: 1) Sharding: The /export handler benefits tremendously by sharding. The benefits go well beyond linear. This is because 2 shards both doubles the computing power and more then halves the amount of work that needs to done by each shard. 3) Hash partitioning and worker collections: Sharding very quickly causes bottlenecks on a single aggregator node. Streaming Expressions parallel function when combined with the hash partitioner allows the /exports to be partitioned into X number of slices and brings into play not just the shards but the replicas. When a reduce operations happens on the worker nodes (rollups, innerJoins) which limits the numbers of records that are emitted in the final stream, this is an extremely powerful scaling tool. So, from a pure /export standpoint with no reduce operation, all from a single shard, you are working somewhat against the design goals of system. But that being said the faster we make the pure export form a single shard the more use cases the the /export handler serves. > Change export to extract DocValues in docID order > - > > Key: SOLR-13013 > URL: https://issues.apache.org/jira/browse/SOLR-13013 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: Export Writer >Affects Versions: 7.5, master (8.0) >Reporter: Toke Eskildsen >Priority: Major > Fix For: master (8.0) > > Attachments: SOLR-13013_proof_of_concept.patch, > SOLR-13013_proof_of_concept.patch > > > The streaming export writer uses a sliding window of 30,000 documents for > paging through the result set in a given sort order. Each time a window has > been calculated, the values for the export fields are retrieved from the > underlying DocValues structures in document sort order and delivered. > The iterative DocValues API introduced in Lucene/Solr 7 does not support > random access. The current export implementation bypasses this by creating a > new DocValues-iterator for each individual value to retrieve. This slows down > export as the iterator has to seek to the given docID from start for each > value. The slowdown scales with shard size (see LUCENE-8374 for details). An > alternative is to extract the DocValues in docID-order, with re-use of > DocValues-iterators. The idea is as follows: > # Change the FieldWriters for export to re-use the DocValues-iterators if > subsequent requests are for docIDs higher than the previous ones > # Calculate the sliding window of SortDocs as usual > # Take a note of the order of the SortDocs in the sliding window > # Re-sort the SortDocs in docID-order > # Extract the DocValues to a temporary on-heap structure > # Re-sort the extracted values to the original sliding window order > Deliver the values > One big difference from the current export code is of course the need to hold > the whole sliding window scaled result set in memory. This might well be a > showstopper as there is no real limit to how large this partial result set > can be. Maybe such an optimization could be requested explicitly if the user > knows that there is enough memory? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-13013) Change export to extract DocValues in docID order
[ https://issues.apache.org/jira/browse/SOLR-13013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16701727#comment-16701727 ] Toke Eskildsen commented on SOLR-13013: --- I cherry-picked some DENSE fields from our netarchive index and tried exporting them from a single shard, to demonstrate the problem with large indexes in Lucene/Solr 7+ and to performance test the current patch. I made sure everything was warmed (practically zero IO on the index-SSD according to iostat) and tested with combinations of SOLR-13013 and LUCENE-8374 turned on and off: {code} > curl -s "http://localhost:9090/solr/ns80/select?q=*:*; | jq .response.numFound 307171504 > curl -s "http://localhost:9090/solr/ns80/select?q=text:hestevogn; | jq > .response.numFound' 52654 > curl -s -w "%\{time_total} seconds"$\'\n\' > "http://localhost:9090/solr/ns80/export?q=text:hestevogn=id+asc=content_type_ext,content_type_served,crawl_date,content_length=true=true; > -o t_export_true_true 0.433661 seconds > curl -s -w "%\{time_total} seconds"$\'\n\' > "http://localhost:9090/solr/ns80/export?q=text:hestevogn=id+asc=content_type_ext,content_type_served,crawl_date,content_length=true=false; > -o t_export_true_false 0.555844 seconds > curl -s -w "%\{time_total} seconds"$\'\n\' > "http://localhost:9090/solr/ns80/export?q=text:hestevogn=id+asc=content_type_ext,content_type_served,crawl_date,content_length=false=true; > -o t_export_false_true 1.037004 seconds > curl -s -w "%\{time_total} seconds"$\'\n\' > "http://localhost:9090/solr/ns80/export?q=text:hestevogn=id+asc=content_type_ext,content_type_served,crawl_date,content_length=false=false; > -o t_export_false_false 843.477925 seconds > diff -s t_export_true_true t_export_true_false ; diff -s t_export_true_true > t_export_false_true ; diff -s t_export_true_true t_export_false_false Files t_export_true_true and t_export_true_false are identical Files t_export_true_true and t_export_false_true are identical Files t_export_true_true and t_export_false_false are identical {code} Observations from this ad-hoc test (which of course should be independently verified): Exporting from a large index with vanilla Solr master is not ideal. It does not make much sense to talk about what performance-factors the patches provides as they are mostly about changing time complexity: Our factor 1500 speed-up with SOLR-13013 with this shard with this request will be something quite else for other setups. * The explicit sort in SOLR-13013 seems the superior solution and the addition of the O(n) → O(1) lookup-improvement in LUCENE-8374 only makes it slightly faster. * On the other hand, LUCENE-8374 works quite well for export and does not require any changes to export. This might influence whether or not energy should be spend on a "best as possible" fallback in case of memory problems or if simpler "full fallback to sliding window sort order" is preferable. * On the gripping hand, testing with a smaller index is likely to result in SOLR-13013 being (relative to LUCENE-8374) even faster, as SOLR-13013 avoids re-opening DV-readers all the time. More testing needed (no surprise there). > Change export to extract DocValues in docID order > - > > Key: SOLR-13013 > URL: https://issues.apache.org/jira/browse/SOLR-13013 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: Export Writer >Affects Versions: 7.5, master (8.0) >Reporter: Toke Eskildsen >Priority: Major > Fix For: master (8.0) > > Attachments: SOLR-13013_proof_of_concept.patch, > SOLR-13013_proof_of_concept.patch > > > The streaming export writer uses a sliding window of 30,000 documents for > paging through the result set in a given sort order. Each time a window has > been calculated, the values for the export fields are retrieved from the > underlying DocValues structures in document sort order and delivered. > The iterative DocValues API introduced in Lucene/Solr 7 does not support > random access. The current export implementation bypasses this by creating a > new DocValues-iterator for each individual value to retrieve. This slows down > export as the iterator has to seek to the given docID from start for each > value. The slowdown scales with shard size (see LUCENE-8374 for details). An > alternative is to extract the DocValues in docID-order, with re-use of > DocValues-iterators. The idea is as follows: > # Change the FieldWriters for export to re-use the DocValues-iterators if > subsequent requests are for docIDs higher than the previous ones > # Calculate the sliding window of SortDocs as usual > # Take a note of the order of the SortDocs in the sliding window > # Re-sort the SortDocs in docID-order > #
[jira] [Commented] (SOLR-13013) Change export to extract DocValues in docID order
[ https://issues.apache.org/jira/browse/SOLR-13013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16699385#comment-16699385 ] Toke Eskildsen commented on SOLR-13013: --- [~joel.bernstein] I am glad that it looks useful. I expect that it needs at least a full re-implementation of the {{MapWriter}}-parts. I am unfamiliar with that part of the code, and it would be great if you took over. I won't take any offense if you rewrite everything. I'd be happy to try and review or at least do some testing on our 300M docs/segment shards as they are very affected by the DV API change. > Change export to extract DocValues in docID order > - > > Key: SOLR-13013 > URL: https://issues.apache.org/jira/browse/SOLR-13013 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: Export Writer >Affects Versions: 7.5, master (8.0) >Reporter: Toke Eskildsen >Priority: Major > Fix For: master (8.0) > > Attachments: SOLR-13013_proof_of_concept.patch, > SOLR-13013_proof_of_concept.patch > > > The streaming export writer uses a sliding window of 30,000 documents for > paging through the result set in a given sort order. Each time a window has > been calculated, the values for the export fields are retrieved from the > underlying DocValues structures in document sort order and delivered. > The iterative DocValues API introduced in Lucene/Solr 7 does not support > random access. The current export implementation bypasses this by creating a > new DocValues-iterator for each individual value to retrieve. This slows down > export as the iterator has to seek to the given docID from start for each > value. The slowdown scales with shard size (see LUCENE-8374 for details). An > alternative is to extract the DocValues in docID-order, with re-use of > DocValues-iterators. The idea is as follows: > # Change the FieldWriters for export to re-use the DocValues-iterators if > subsequent requests are for docIDs higher than the previous ones > # Calculate the sliding window of SortDocs as usual > # Take a note of the order of the SortDocs in the sliding window > # Re-sort the SortDocs in docID-order > # Extract the DocValues to a temporary on-heap structure > # Re-sort the extracted values to the original sliding window order > Deliver the values > One big difference from the current export code is of course the need to hold > the whole sliding window scaled result set in memory. This might well be a > showstopper as there is no real limit to how large this partial result set > can be. Maybe such an optimization could be requested explicitly if the user > knows that there is enough memory? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-13013) Change export to extract DocValues in docID order
[ https://issues.apache.org/jira/browse/SOLR-13013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16699241#comment-16699241 ] Joel Bernstein commented on SOLR-13013: --- I've read through the patch and it looks like a big win. We can probably trade off window size if memory is an issue. The 30,000 magic number was chosen after a lot of testing to determine which was the best window size to limit the number of passes over the results without bogging down the calls to an overly large priority queue. Up to about 30,000 I was seeing performance improvements. As the window size drops and more passes need to be made over the data performance drops. [~toke], how would you like to proceed? Do you want to commit this yourself or would you like me to work on the patch further and get it committed? I can spend time testing in either case. > Change export to extract DocValues in docID order > - > > Key: SOLR-13013 > URL: https://issues.apache.org/jira/browse/SOLR-13013 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: Export Writer >Affects Versions: 7.5, master (8.0) >Reporter: Toke Eskildsen >Priority: Major > Fix For: master (8.0) > > Attachments: SOLR-13013_proof_of_concept.patch, > SOLR-13013_proof_of_concept.patch > > > The streaming export writer uses a sliding window of 30,000 documents for > paging through the result set in a given sort order. Each time a window has > been calculated, the values for the export fields are retrieved from the > underlying DocValues structures in document sort order and delivered. > The iterative DocValues API introduced in Lucene/Solr 7 does not support > random access. The current export implementation bypasses this by creating a > new DocValues-iterator for each individual value to retrieve. This slows down > export as the iterator has to seek to the given docID from start for each > value. The slowdown scales with shard size (see LUCENE-8374 for details). An > alternative is to extract the DocValues in docID-order, with re-use of > DocValues-iterators. The idea is as follows: > # Change the FieldWriters for export to re-use the DocValues-iterators if > subsequent requests are for docIDs higher than the previous ones > # Calculate the sliding window of SortDocs as usual > # Take a note of the order of the SortDocs in the sliding window > # Re-sort the SortDocs in docID-order > # Extract the DocValues to a temporary on-heap structure > # Re-sort the extracted values to the original sliding window order > Deliver the values > One big difference from the current export code is of course the need to hold > the whole sliding window scaled result set in memory. This might well be a > showstopper as there is no real limit to how large this partial result set > can be. Maybe such an optimization could be requested explicitly if the user > knows that there is enough memory? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-13013) Change export to extract DocValues in docID order
[ https://issues.apache.org/jira/browse/SOLR-13013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16699102#comment-16699102 ] Yonik Seeley commented on SOLR-13013: - bq. Are you thinking about making something generic? Maybe a bulk request wrapper for doc values, that temporarily re-sorts internally? Maybe a bulk request wrapper for doc values, that temporarily re-sorts internally? Yep. Something that collects out-of-order docids along with other value sources that should be internally retrieved mostly in-order. It shouldn't slow up this issue though. I just bring it up to get it on other people's radar (it's been on my TODO list for years...) and because it's related to this issue. > Change export to extract DocValues in docID order > - > > Key: SOLR-13013 > URL: https://issues.apache.org/jira/browse/SOLR-13013 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: Export Writer >Affects Versions: 7.5, master (8.0) >Reporter: Toke Eskildsen >Priority: Major > Fix For: master (8.0) > > Attachments: SOLR-13013_proof_of_concept.patch, > SOLR-13013_proof_of_concept.patch > > > The streaming export writer uses a sliding window of 30,000 documents for > paging through the result set in a given sort order. Each time a window has > been calculated, the values for the export fields are retrieved from the > underlying DocValues structures in document sort order and delivered. > The iterative DocValues API introduced in Lucene/Solr 7 does not support > random access. The current export implementation bypasses this by creating a > new DocValues-iterator for each individual value to retrieve. This slows down > export as the iterator has to seek to the given docID from start for each > value. The slowdown scales with shard size (see LUCENE-8374 for details). An > alternative is to extract the DocValues in docID-order, with re-use of > DocValues-iterators. The idea is as follows: > # Change the FieldWriters for export to re-use the DocValues-iterators if > subsequent requests are for docIDs higher than the previous ones > # Calculate the sliding window of SortDocs as usual > # Take a note of the order of the SortDocs in the sliding window > # Re-sort the SortDocs in docID-order > # Extract the DocValues to a temporary on-heap structure > # Re-sort the extracted values to the original sliding window order > Deliver the values > One big difference from the current export code is of course the need to hold > the whole sliding window scaled result set in memory. This might well be a > showstopper as there is no real limit to how large this partial result set > can be. Maybe such an optimization could be requested explicitly if the user > knows that there is enough memory? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-13013) Change export to extract DocValues in docID order
[ https://issues.apache.org/jira/browse/SOLR-13013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16698258#comment-16698258 ] Yonik Seeley commented on SOLR-13013: - Great results! Retrieving results in order in batches has also been a TODO for augmenters (specifically, the ability to retrieve function query results along side field results) since they were added to Solr since some function queries needed to be accessed in order to be efficient. With the changes to iterators for docvalues, and the ability to retrieve stored fields using document values, this becomes even more important. > Change export to extract DocValues in docID order > - > > Key: SOLR-13013 > URL: https://issues.apache.org/jira/browse/SOLR-13013 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: Export Writer >Affects Versions: 7.5, master (8.0) >Reporter: Toke Eskildsen >Priority: Major > Fix For: master (8.0) > > Attachments: SOLR-13013_proof_of_concept.patch > > > The streaming export writer uses a sliding window of 30,000 documents for > paging through the result set in a given sort order. Each time a window has > been calculated, the values for the export fields are retrieved from the > underlying DocValues structures in document sort order and delivered. > The iterative DocValues API introduced in Lucene/Solr 7 does not support > random access. The current export implementation bypasses this by creating a > new DocValues-iterator for each individual value to retrieve. This slows down > export as the iterator has to seek to the given docID from start for each > value. The slowdown scales with shard size (see LUCENE-8374 for details). An > alternative is to extract the DocValues in docID-order, with re-use of > DocValues-iterators. The idea is as follows: > # Change the FieldWriters for export to re-use the DocValues-iterators if > subsequent requests are for docIDs higher than the previous ones > # Calculate the sliding window of SortDocs as usual > # Take a note of the order of the SortDocs in the sliding window > # Re-sort the SortDocs in docID-order > # Extract the DocValues to a temporary on-heap structure > # Re-sort the extracted values to the original sliding window order > Deliver the values > One big difference from the current export code is of course the need to hold > the whole sliding window scaled result set in memory. This might well be a > showstopper as there is no real limit to how large this partial result set > can be. Maybe such an optimization could be requested explicitly if the user > knows that there is enough memory? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-13013) Change export to extract DocValues in docID order
[ https://issues.apache.org/jira/browse/SOLR-13013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16698207#comment-16698207 ] Toke Eskildsen commented on SOLR-13013: --- [~janhoy] Nice idea! The values are collected as Objects, so it would involve some instanceof-checks to estimate memory overhead. I am unsure how that would affect performance, but it would be great if we could both avoid the risk of OOM and make it simpler for the user. Technically a fallback would be easy to do. It would even be possible to do a partial fallback: If the memory limit for the buffered values is reached before all 30K SortDocs has been processed, switch back to standard "sort order DV-resolving with immediate delivery", but use the already collected values whenever possible. Adjusting windows size for subsequent windows is tricky, as is requires weighing query + sort cost vs. DV-retrieval cost. It would be possible to collect a run time profile of the two parts and use that for qualified guessing, but then it's sounding like quite a large project. If we determine that the base idea of this JIRA-issue has merit, the first version could just use a simple fallback to sort order DV-resolving, without any re-use of already collected values, and stay in that mode for the rest of the current export. Re-use and/or window shrinking could be later enhancements. > Change export to extract DocValues in docID order > - > > Key: SOLR-13013 > URL: https://issues.apache.org/jira/browse/SOLR-13013 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: Export Writer >Affects Versions: 7.5, master (8.0) >Reporter: Toke Eskildsen >Priority: Major > Fix For: master (8.0) > > Attachments: SOLR-13013_proof_of_concept.patch > > > The streaming export writer uses a sliding window of 30,000 documents for > paging through the result set in a given sort order. Each time a window has > been calculated, the values for the export fields are retrieved from the > underlying DocValues structures in document sort order and delivered. > The iterative DocValues API introduced in Lucene/Solr 7 does not support > random access. The current export implementation bypasses this by creating a > new DocValues-iterator for each individual value to retrieve. This slows down > export as the iterator has to seek to the given docID from start for each > value. The slowdown scales with shard size (see LUCENE-8374 for details). An > alternative is to extract the DocValues in docID-order, with re-use of > DocValues-iterators. The idea is as follows: > # Change the FieldWriters for export to re-use the DocValues-iterators if > subsequent requests are for docIDs higher than the previous ones > # Calculate the sliding window of SortDocs as usual > # Take a note of the order of the SortDocs in the sliding window > # Re-sort the SortDocs in docID-order > # Extract the DocValues to a temporary on-heap structure > # Re-sort the extracted values to the original sliding window order > Deliver the values > One big difference from the current export code is of course the need to hold > the whole sliding window scaled result set in memory. This might well be a > showstopper as there is no real limit to how large this partial result set > can be. Maybe such an optimization could be requested explicitly if the user > knows that there is enough memory? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-13013) Change export to extract DocValues in docID order
[ https://issues.apache.org/jira/browse/SOLR-13013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16698182#comment-16698182 ] Jan Høydahl commented on SOLR-13013: Cool. Should there be a setting for max memory usage and if violated adjust the window size or fallback to old logic? > Change export to extract DocValues in docID order > - > > Key: SOLR-13013 > URL: https://issues.apache.org/jira/browse/SOLR-13013 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: Export Writer >Affects Versions: 7.5, master (8.0) >Reporter: Toke Eskildsen >Priority: Major > Fix For: master (8.0) > > Attachments: SOLR-13013_proof_of_concept.patch > > > The streaming export writer uses a sliding window of 30,000 documents for > paging through the result set in a given sort order. Each time a window has > been calculated, the values for the export fields are retrieved from the > underlying DocValues structures in document sort order and delivered. > The iterative DocValues API introduced in Lucene/Solr 7 does not support > random access. The current export implementation bypasses this by creating a > new DocValues-iterator for each individual value to retrieve. This slows down > export as the iterator has to seek to the given docID from start for each > value. The slowdown scales with shard size (see LUCENE-8374 for details). An > alternative is to extract the DocValues in docID-order, with re-use of > DocValues-iterators. The idea is as follows: > # Change the FieldWriters for export to re-use the DocValues-iterators if > subsequent requests are for docIDs higher than the previous ones > # Calculate the sliding window of SortDocs as usual > # Take a note of the order of the SortDocs in the sliding window > # Re-sort the SortDocs in docID-order > # Extract the DocValues to a temporary on-heap structure > # Re-sort the extracted values to the original sliding window order > Deliver the values > One big difference from the current export code is of course the need to hold > the whole sliding window scaled result set in memory. This might well be a > showstopper as there is no real limit to how large this partial result set > can be. Maybe such an optimization could be requested explicitly if the user > knows that there is enough memory? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-13013) Change export to extract DocValues in docID order
[ https://issues.apache.org/jira/browse/SOLR-13013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16698037#comment-16698037 ] Joel Bernstein commented on SOLR-13013: --- I'll spend some time this week testing out the patch. The approach sounds really promising. > Change export to extract DocValues in docID order > - > > Key: SOLR-13013 > URL: https://issues.apache.org/jira/browse/SOLR-13013 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: Export Writer >Affects Versions: 7.5, master (8.0) >Reporter: Toke Eskildsen >Priority: Major > Fix For: master (8.0) > > Attachments: SOLR-13013_proof_of_concept.patch > > > The streaming export writer uses a sliding window of 30,000 documents for > paging through the result set in a given sort order. Each time a window has > been calculated, the values for the export fields are retrieved from the > underlying DocValues structures in document sort order and delivered. > The iterative DocValues API introduced in Lucene/Solr 7 does not support > random access. The current export implementation bypasses this by creating a > new DocValues-iterator for each individual value to retrieve. This slows down > export as the iterator has to seek to the given docID from start for each > value. The slowdown scales with shard size (see LUCENE-8374 for details). An > alternative is to extract the DocValues in docID-order, with re-use of > DocValues-iterators. The idea is as follows: > # Change the FieldWriters for export to re-use the DocValues-iterators if > subsequent requests are for docIDs higher than the previous ones > # Calculate the sliding window of SortDocs as usual > # Take a note of the order of the SortDocs in the sliding window > # Re-sort the SortDocs in docID-order > # Extract the DocValues to a temporary on-heap structure > # Re-sort the extracted values to the original sliding window order > Deliver the values > One big difference from the current export code is of course the need to hold > the whole sliding window scaled result set in memory. This might well be a > showstopper as there is no real limit to how large this partial result set > can be. Maybe such an optimization could be requested explicitly if the user > knows that there is enough memory? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-13013) Change export to extract DocValues in docID order
[ https://issues.apache.org/jira/browse/SOLR-13013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697846#comment-16697846 ] Toke Eskildsen commented on SOLR-13013: --- I have uploaded a proof of concept for the idea in the issue description. The structure that collects and holds the temporary values is made by mashing the keyboard until it worked and the performance test is Frankensteined from existing unit-test code in TestExportWriter. Nevertheless unit-tests in TestExportWriter passes and a performance test can be executed with {{TES_SIZES="1000,1,10,20,30" ant -Dtests.heapsize=5g -Dtests.codec=Lucene80 -Dtestmethod=testExportSpeed -Dtestcase=TestExportWriter test | grep "TES:"}} It takes a 10+ minutes and writes a summary at the end, For a quicker test, use TES_SIZES="1000,1" or something like that. For my desktop the result was {{[junit4] 1> TES: Concatenated output:}} {{ [junit4] 1> TES: Test 1/5: 1000 documents, trie: 11098 / 7525 docs/sec ( 147%), points: 7639 / 11552 docs/sec ( 66%)}} {{ [junit4] 1> TES: Test 2/5: 1000 documents, trie: 15135 / 9269 docs/sec ( 163%), points: 27769 / 15986 docs/sec ( 174%)}} {{ [junit4] 1> TES: Test 3/5: 1000 documents, trie: 11505 / 9593 docs/sec ( 120%), points: 37643 / 13584 docs/sec ( 277%)}} {{ [junit4] 1> TES: Test 4/5: 1000 documents, trie: 17495 / 9730 docs/sec ( 180%), points: 39103 / 18222 docs/sec ( 215%)}} {{ [junit4] 1> TES: Test 5/5: 1000 documents, trie: 17657 / 10331 docs/sec ( 171%), points: 37633 / 19104 docs/sec ( 197%)}} {{ [junit4] 1> TES: --}} {{ [junit4] 1> TES: Test 1/5: 1 documents, trie: 17018 / 7901 docs/sec ( 215%), points: 38606 / 12381 docs/sec ( 312%)}} {{ [junit4] 1> TES: Test 2/5: 1 documents, trie: 17191 / 7879 docs/sec ( 218%), points: 39920 / 12404 docs/sec ( 322%)}} {{ [junit4] 1> TES: Test 3/5: 1 documents, trie: 17218 / 7881 docs/sec ( 218%), points: 41696 / 12410 docs/sec ( 336%)}} {{ [junit4] 1> TES: Test 4/5: 1 documents, trie: 17451 / 7884 docs/sec ( 221%), points: 41719 / 12360 docs/sec ( 338%)}} {{ [junit4] 1> TES: Test 5/5: 1 documents, trie: 17227 / 7855 docs/sec ( 219%), points: 41879 / 12436 docs/sec ( 337%)}} {{ [junit4] 1> TES: --}} {{ [junit4] 1> TES: Test 1/5: 10 documents, trie: 15849 / 3718 docs/sec ( 426%), points: 36037 / 4841 docs/sec ( 744%)}} {{ [junit4] 1> TES: Test 2/5: 10 documents, trie: 16348 / 3717 docs/sec ( 440%), points: 37994 / 4858 docs/sec ( 782%)}} {{ [junit4] 1> TES: Test 3/5: 10 documents, trie: 15378 / 3718 docs/sec ( 414%), points: 38831 / 4872 docs/sec ( 797%)}} {{ [junit4] 1> TES: Test 4/5: 10 documents, trie: 16042 / 3710 docs/sec ( 432%), points: 39084 / 4876 docs/sec ( 802%)}} {{ [junit4] 1> TES: Test 5/5: 10 documents, trie: 16009 / 3713 docs/sec ( 431%), points: 39503 / 4865 docs/sec ( 812%)}} {{ [junit4] 1> TES: --}} {{ [junit4] 1> TES: Test 1/5: 20 documents, trie: 15403 / 3031 docs/sec ( 508%), points: 37349 / 3531 docs/sec (1058%)}} {{ [junit4] 1> TES: Test 2/5: 20 documents, trie: 15853 / 3018 docs/sec ( 525%), points: 37509 / 3544 docs/sec (1058%)}} {{ [junit4] 1> TES: Test 3/5: 20 documents, trie: 14993 / 3018 docs/sec ( 497%), points: 38468 / 3547 docs/sec (1084%)}} {{ [junit4] 1> TES: Test 4/5: 20 documents, trie: 15191 / 3023 docs/sec ( 502%), points: 38684 / 3538 docs/sec (1093%)}} {{ [junit4] 1> TES: Test 5/5: 20 documents, trie: 15678 / 3035 docs/sec ( 517%), points: 38729 / 3542 docs/sec (1093%)}} {{ [junit4] 1> TES: --}} {{ [junit4] 1> TES: Test 1/5: 30 documents, trie: 15529 / 2834 docs/sec ( 548%), points: 36911 / 3652 docs/sec (1011%)}} {{ [junit4] 1> TES: Test 2/5: 30 documents, trie: 15455 / 2846 docs/sec ( 543%), points: 37705 / 3630 docs/sec (1039%)}} {{ [junit4] 1> TES: Test 3/5: 30 documents, trie: 15805 / 2866 docs/sec ( 551%), points: 37583 / 3660 docs/sec (1027%)}} {{ [junit4] 1> TES: Test 4/5: 30 documents, trie: 15653 / 2883 docs/sec ( 543%), points: 39365 / 3591 docs/sec (1096%)}} {{ [junit4] 1> TES: Test 5/5: 30 documents, trie: 15736 / 2895 docs/sec ( 543%), points: 38606 / 3667 docs/sec (1053%)}} The two numbers for trie and points are sorted followed by non_sorted. The numbers in the parentheses are sorted/non_sorted. As can be seen, non_sorted export performance degrades as index size (measured in number of documents) goes up. Also, as can be seen from the percentages, reusing the DocValues-iterators and ensuring docID order improved the speed significantly, The patch is not at alll production-ready. See it as a "is this idea worth exploring?". Ping to [~joel.bernstein], as I expect he will be interested in this. > Change export to extract DocValues in docID order > - > > Key: SOLR-13013 > URL: