[GitHub] [lucene-solr] atris commented on issue #754: LUCENE-8875: Introduce Optimized Collector For Large Number Of Hits
atris commented on issue #754: LUCENE-8875: Introduce Optimized Collector For Large Number Of Hits URL: https://github.com/apache/lucene-solr/pull/754#issuecomment-510028794 @jpountz JFYI I ran ant beast in 3 batches of 10 times each and it ran clean: WARNING: All illegal access operations will be denied in a future release [beaster] Beast round 1 results: /Users/atris/merge_refactor/lucene-solr/lucene/build/sandbox/test/1 [beaster] Beast round 2 results: /Users/atris/merge_refactor/lucene-solr/lucene/build/sandbox/test/2 [beaster] Beast round 3 results: /Users/atris/merge_refactor/lucene-solr/lucene/build/sandbox/test/3 [beaster] Beast round 4 results: /Users/atris/merge_refactor/lucene-solr/lucene/build/sandbox/test/4 [beaster] Beast round 5 results: /Users/atris/merge_refactor/lucene-solr/lucene/build/sandbox/test/5 [beaster] Beast round 6 results: /Users/atris/merge_refactor/lucene-solr/lucene/build/sandbox/test/6 [beaster] Beast round 7 results: /Users/atris/merge_refactor/lucene-solr/lucene/build/sandbox/test/7 [beaster] Beast round 8 results: /Users/atris/merge_refactor/lucene-solr/lucene/build/sandbox/test/8 [beaster] Beast round 9 results: /Users/atris/merge_refactor/lucene-solr/lucene/build/sandbox/test/9 [beaster] Beast round 10 results: /Users/atris/merge_refactor/lucene-solr/lucene/build/sandbox/test/10 [beaster] Beasting finished Successfully. -check-totals: beast: BUILD SUCCESSFUL This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] [lucene-solr] atris commented on issue #754: LUCENE-8875: Introduce Optimized Collector For Large Number Of Hits
atris commented on issue #754: LUCENE-8875: Introduce Optimized Collector For Large Number Of Hits URL: https://github.com/apache/lucene-solr/pull/754#issuecomment-510007530 > @atris Can you give some result comparison or other evidence? Please see the blog post listed above. In addition, the targeted use case for this collector makes the benefits obvious: I ran a TermQuery on a 5 TB dataset manufactured from Elasticsearch's nyc_taxis dataset. A string field was modified to have a specific value only for 50,000 documents, and number of hits requested was 25,00,000 by each of 30 concurrent requests (the same test mentioned in Toke's blogpost). The space savings were in order of multiple GBs, as expected. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] [lucene-solr] atris commented on issue #754: LUCENE-8875: Introduce Optimized Collector For Large Number Of Hits
atris commented on issue #754: LUCENE-8875: Introduce Optimized Collector For Large Number Of Hits URL: https://github.com/apache/lucene-solr/pull/754#issuecomment-509928095 Not just the cost of prepopulating the PQ, but also building it in first place. If total number of hits < number of requested hits and number of requested hits is a large value, then it makes sense to skip the PQ This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] [lucene-solr] atris commented on issue #754: LUCENE-8875: Introduce Optimized Collector For Large Number Of Hits
atris commented on issue #754: LUCENE-8875: Introduce Optimized Collector For Large Number Of Hits URL: https://github.com/apache/lucene-solr/pull/754#issuecomment-509655033 I just discovered ant beast that seems like the right tool for ensuring that such random tests get reproduced and are not dependent on lucky/unlucky seeds. Further PRs will have ant beast run on the new random tests before I raise a PR This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] [lucene-solr] atris commented on issue #754: LUCENE-8875: Introduce Optimized Collector For Large Number Of Hits
atris commented on issue #754: LUCENE-8875: Introduce Optimized Collector For Large Number Of Hits URL: https://github.com/apache/lucene-solr/pull/754#issuecomment-509599328 @jpountz Thanks for looking through! I refactored the tests to remove the dependency on random() to populate the string Field. That triggered the test for me in the first run. Maybe something to do with the seed? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] [lucene-solr] atris commented on issue #754: LUCENE-8875: Introduce Optimized Collector For Large Number Of Hits
atris commented on issue #754: LUCENE-8875: Introduce Optimized Collector For Large Number Of Hits URL: https://github.com/apache/lucene-solr/pull/754#issuecomment-509588693 @jpountz @mikemccand Thanks for your comments, fixed the same. I am surprised that the test is failing : it seems to be consistently passing for me. I ran a couple of times. I have also reduced the dataset for tests: there is no real functional difference between 100 and 250k documents for testing, since these are not performance tests. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] [lucene-solr] atris commented on issue #754: LUCENE-8875: Introduce Optimized Collector For Large Number Of Hits
atris commented on issue #754: LUCENE-8875: Introduce Optimized Collector For Large Number Of Hits URL: https://github.com/apache/lucene-solr/pull/754#issuecomment-509534291 Added a dedicated test for checking order of hits when TopDocs are returned from just the hits list and no PQ is built This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] [lucene-solr] atris commented on issue #754: LUCENE-8875: Introduce Optimized Collector For Large Number Of Hits
atris commented on issue #754: LUCENE-8875: Introduce Optimized Collector For Large Number Of Hits URL: https://github.com/apache/lucene-solr/pull/754#issuecomment-509525838 @jpountz Thanks, I have fixed the comments. The reason I did not use the BooleanQuery for the PQ build and not build tests was because those tests only exercise whether the priority queue was built or not which is dependent on the number of hits collected vs number of hits requested. However, per your suggestion, I have fixed the same. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] [lucene-solr] atris commented on issue #754: LUCENE-8875: Introduce Optimized Collector For Large Number Of Hits
atris commented on issue #754: LUCENE-8875: Introduce Optimized Collector For Large Number Of Hits URL: https://github.com/apache/lucene-solr/pull/754#issuecomment-509518620 FYI ant precommit passes -- ran it with latest iteration This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] [lucene-solr] atris commented on issue #754: LUCENE-8875: Introduce Optimized Collector For Large Number Of Hits
atris commented on issue #754: LUCENE-8875: Introduce Optimized Collector For Large Number Of Hits URL: https://github.com/apache/lucene-solr/pull/754#issuecomment-509207157 @jpountz I have updated the PR per your comments. ant precommit passes. Apologies, this iteration also got force pushed. I have a local daemon which auto squashes and force pushes to my fork each time I create a new commit on a branch (this was meant to help the committer merge the PR without needing to squash). I will disable that in the next runs. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] [lucene-solr] atris commented on issue #754: LUCENE-8875: Introduce Optimized Collector For Large Number Of Hits
atris commented on issue #754: LUCENE-8875: Introduce Optimized Collector For Large Number Of Hits URL: https://github.com/apache/lucene-solr/pull/754#issuecomment-509169211 @jpountz Thanks for the comments, update the PR. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] [lucene-solr] atris commented on issue #754: LUCENE-8875: Introduce Optimized Collector For Large Number Of Hits
atris commented on issue #754: LUCENE-8875: Introduce Optimized Collector For Large Number Of Hits URL: https://github.com/apache/lucene-solr/pull/754#issuecomment-509109583 @jpountz I have pushed a new iteration which does as discussed i.e. builds a hits list and populates hits as long as the number of collected hits are lesser than number of hits requested. Once that threshold is reached, a priority queue is built and minimum competitive score is set and used to filter further hits. Please let me know if it looks fine. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] [lucene-solr] atris commented on issue #754: LUCENE-8875: Introduce Optimized Collector For Large Number Of Hits
atris commented on issue #754: LUCENE-8875: Introduce Optimized Collector For Large Number Of Hits URL: https://github.com/apache/lucene-solr/pull/754#issuecomment-508914746 @jpountz @tokee Thanks for your comments. I am planning to maintain an ArrayList of ScoreDocs and collect numHits, then do a sort on score and return top N as Adrien suggested. This should optimize the performance bottleneck around PQ allocating slots and prepopulating sentinel values. I am inclined to pursue the idea of not using ScoreDoc and representing score + docID as a separate issue primarily due to the nuances involved with handling shard indices. I do feel that should not be a major blocker anymore given the fact that TopDocs.merge can now tie break on docIDs and ignore shard indices completely during tie breaking. Thoughts? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] [lucene-solr] atris commented on issue #754: LUCENE-8875: Introduce Optimized Collector For Large Number Of Hits
atris commented on issue #754: LUCENE-8875: Introduce Optimized Collector For Large Number Of Hits URL: https://github.com/apache/lucene-solr/pull/754#issuecomment-508804369 > Actually I don't think we need a growable priority queue. For such large number of hits it'd be probably more efficient to collect hits in an ArrayList first and only turn it into a PQ once there are `numHits` hits? Would that mean collecting all hits in the ArrayList, building a heap out of them iteratively once we exhaust documents and then calling top() numHits times? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] [lucene-solr] atris commented on issue #754: LUCENE-8875: Introduce Optimized Collector For Large Number Of Hits
atris commented on issue #754: LUCENE-8875: Introduce Optimized Collector For Large Number Of Hits URL: https://github.com/apache/lucene-solr/pull/754#issuecomment-508784416 > Not prepopulating the hit queue is only one part of the problem, we would also need to not allocate `numHits` slots in the priority queue right away? > That would require a different PriorityQueue implementation, since the default implementation always allocates numHits slots and sets maxSize to numHits. We would need an implementation that only allocates two slots initially, but has the ability to "resize" as needed. WDYT? > I'd rather like that we don't do any change to TopDocsCollector and not try to extend it, in my opinion those collectors are solving a very different problem and would need to evolve independently. My main objective to use TopScoreDocsCollector was to reuse code as much as I could -- but I agree with your point. Will change the implementation and keep TopScoreDocsCollector pristine. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] [lucene-solr] atris commented on issue #754: LUCENE-8875: Introduce Optimized Collector For Large Number Of Hits
atris commented on issue #754: LUCENE-8875: Introduce Optimized Collector For Large Number Of Hits URL: https://github.com/apache/lucene-solr/pull/754#issuecomment-508377686 Any thoughts on this? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] [lucene-solr] atris commented on issue #754: LUCENE-8875: Introduce Optimized Collector For Large Number Of Hits
atris commented on issue #754: LUCENE-8875: Introduce Optimized Collector For Large Number Of Hits URL: https://github.com/apache/lucene-solr/pull/754#issuecomment-507544094 I ran some tests with N requested as 100K but the matching docs being only around 10K. This gives a sizeable improvement in the overall size of the priority queue. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org