[GitHub] [lucene-solr] atris commented on issue #754: LUCENE-8875: Introduce Optimized Collector For Large Number Of Hits

2019-07-10 Thread GitBox
atris commented on issue #754: LUCENE-8875: Introduce Optimized Collector For 
Large Number Of Hits
URL: https://github.com/apache/lucene-solr/pull/754#issuecomment-510028794
 
 
   @jpountz JFYI I ran ant beast in 3 batches of 10 times each and it ran clean:
   
   WARNING: All illegal access operations will be denied in a future release
 [beaster] Beast round 1 results: 
/Users/atris/merge_refactor/lucene-solr/lucene/build/sandbox/test/1
 [beaster] Beast round 2 results: 
/Users/atris/merge_refactor/lucene-solr/lucene/build/sandbox/test/2
 [beaster] Beast round 3 results: 
/Users/atris/merge_refactor/lucene-solr/lucene/build/sandbox/test/3
 [beaster] Beast round 4 results: 
/Users/atris/merge_refactor/lucene-solr/lucene/build/sandbox/test/4
 [beaster] Beast round 5 results: 
/Users/atris/merge_refactor/lucene-solr/lucene/build/sandbox/test/5
 [beaster] Beast round 6 results: 
/Users/atris/merge_refactor/lucene-solr/lucene/build/sandbox/test/6
 [beaster] Beast round 7 results: 
/Users/atris/merge_refactor/lucene-solr/lucene/build/sandbox/test/7
 [beaster] Beast round 8 results: 
/Users/atris/merge_refactor/lucene-solr/lucene/build/sandbox/test/8
 [beaster] Beast round 9 results: 
/Users/atris/merge_refactor/lucene-solr/lucene/build/sandbox/test/9
 [beaster] Beast round 10 results: 
/Users/atris/merge_refactor/lucene-solr/lucene/build/sandbox/test/10
 [beaster] Beasting finished Successfully.
   
   -check-totals:
   
   beast:
   
   BUILD SUCCESSFUL


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] [lucene-solr] atris commented on issue #754: LUCENE-8875: Introduce Optimized Collector For Large Number Of Hits

2019-07-10 Thread GitBox
atris commented on issue #754: LUCENE-8875: Introduce Optimized Collector For 
Large Number Of Hits
URL: https://github.com/apache/lucene-solr/pull/754#issuecomment-510007530
 
 
   > @atris Can you give some result comparison or other evidence?
   
   Please see the blog post listed above.
   
   In addition, the targeted use case for this collector makes the benefits 
obvious: I ran a TermQuery on a 5 TB dataset manufactured from Elasticsearch's 
nyc_taxis dataset. A string field was modified to have a specific value only 
for 50,000 documents, and number of hits requested was 25,00,000 by each of 30 
concurrent requests (the same test mentioned in Toke's blogpost). The space 
savings were in order of multiple GBs, as expected.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] [lucene-solr] atris commented on issue #754: LUCENE-8875: Introduce Optimized Collector For Large Number Of Hits

2019-07-10 Thread GitBox
atris commented on issue #754: LUCENE-8875: Introduce Optimized Collector For 
Large Number Of Hits
URL: https://github.com/apache/lucene-solr/pull/754#issuecomment-509928095
 
 
   Not just the cost of prepopulating the PQ, but also building it in first 
place. If total number of hits < number of requested hits and number of 
requested hits is  a large value, then it makes sense to skip the PQ


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] [lucene-solr] atris commented on issue #754: LUCENE-8875: Introduce Optimized Collector For Large Number Of Hits

2019-07-09 Thread GitBox
atris commented on issue #754: LUCENE-8875: Introduce Optimized Collector For 
Large Number Of Hits
URL: https://github.com/apache/lucene-solr/pull/754#issuecomment-509655033
 
 
   I just discovered ant beast that seems like the right tool for ensuring that 
such random tests get reproduced and are not dependent on lucky/unlucky seeds. 
Further PRs will have ant beast run on the new random tests before I raise a PR


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] [lucene-solr] atris commented on issue #754: LUCENE-8875: Introduce Optimized Collector For Large Number Of Hits

2019-07-09 Thread GitBox
atris commented on issue #754: LUCENE-8875: Introduce Optimized Collector For 
Large Number Of Hits
URL: https://github.com/apache/lucene-solr/pull/754#issuecomment-509599328
 
 
   @jpountz Thanks for looking through!
   
   I refactored the tests to remove the dependency on random() to populate the 
string Field.  That triggered the test for me in the first run. Maybe something 
to do with the seed?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] [lucene-solr] atris commented on issue #754: LUCENE-8875: Introduce Optimized Collector For Large Number Of Hits

2019-07-09 Thread GitBox
atris commented on issue #754: LUCENE-8875: Introduce Optimized Collector For 
Large Number Of Hits
URL: https://github.com/apache/lucene-solr/pull/754#issuecomment-509588693
 
 
   @jpountz @mikemccand Thanks for your comments, fixed the same.
   
   I am surprised that the test is failing : it seems to be consistently 
passing for me. I ran a couple of times.
   
   I have also reduced the dataset for tests: there is no real functional 
difference between 100 and 250k documents for testing, since these are not 
performance tests.
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] [lucene-solr] atris commented on issue #754: LUCENE-8875: Introduce Optimized Collector For Large Number Of Hits

2019-07-09 Thread GitBox
atris commented on issue #754: LUCENE-8875: Introduce Optimized Collector For 
Large Number Of Hits
URL: https://github.com/apache/lucene-solr/pull/754#issuecomment-509534291
 
 
   Added a dedicated test for checking order of hits when TopDocs are returned 
from just the hits list and no PQ is built


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] [lucene-solr] atris commented on issue #754: LUCENE-8875: Introduce Optimized Collector For Large Number Of Hits

2019-07-09 Thread GitBox
atris commented on issue #754: LUCENE-8875: Introduce Optimized Collector For 
Large Number Of Hits
URL: https://github.com/apache/lucene-solr/pull/754#issuecomment-509525838
 
 
   @jpountz Thanks, I have fixed the comments. The reason I did not use the 
BooleanQuery for the PQ build and not build tests was because those tests only 
exercise whether the priority queue was built or not which is dependent on the 
number of hits collected vs number of hits requested. However, per your 
suggestion, I have fixed the same.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] [lucene-solr] atris commented on issue #754: LUCENE-8875: Introduce Optimized Collector For Large Number Of Hits

2019-07-09 Thread GitBox
atris commented on issue #754: LUCENE-8875: Introduce Optimized Collector For 
Large Number Of Hits
URL: https://github.com/apache/lucene-solr/pull/754#issuecomment-509518620
 
 
   FYI ant precommit passes -- ran it with latest iteration


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] [lucene-solr] atris commented on issue #754: LUCENE-8875: Introduce Optimized Collector For Large Number Of Hits

2019-07-08 Thread GitBox
atris commented on issue #754: LUCENE-8875: Introduce Optimized Collector For 
Large Number Of Hits
URL: https://github.com/apache/lucene-solr/pull/754#issuecomment-509207157
 
 
   @jpountz  I have updated the PR per your comments. ant precommit passes.
   
   Apologies, this iteration also got force pushed. I have a local daemon which 
auto squashes and force pushes to my fork each time I create a new commit on a 
branch (this was meant to help the committer merge the PR without needing to 
squash). I will disable that in the next runs.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] [lucene-solr] atris commented on issue #754: LUCENE-8875: Introduce Optimized Collector For Large Number Of Hits

2019-07-08 Thread GitBox
atris commented on issue #754: LUCENE-8875: Introduce Optimized Collector For 
Large Number Of Hits
URL: https://github.com/apache/lucene-solr/pull/754#issuecomment-509169211
 
 
   @jpountz Thanks for the comments, update the PR.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] [lucene-solr] atris commented on issue #754: LUCENE-8875: Introduce Optimized Collector For Large Number Of Hits

2019-07-08 Thread GitBox
atris commented on issue #754: LUCENE-8875: Introduce Optimized Collector For 
Large Number Of Hits
URL: https://github.com/apache/lucene-solr/pull/754#issuecomment-509109583
 
 
   @jpountz I have pushed a new iteration which does as discussed i.e. builds a 
hits list and populates hits as long as the number of collected hits are lesser 
than number of hits requested. Once that threshold is reached, a priority queue 
is built and minimum competitive score is set and used to filter further hits.
   
   Please let me know if it looks fine.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] [lucene-solr] atris commented on issue #754: LUCENE-8875: Introduce Optimized Collector For Large Number Of Hits

2019-07-06 Thread GitBox
atris commented on issue #754: LUCENE-8875: Introduce Optimized Collector For 
Large Number Of Hits
URL: https://github.com/apache/lucene-solr/pull/754#issuecomment-508914746
 
 
   @jpountz @tokee Thanks for your comments.
   
   I am planning to maintain an ArrayList of ScoreDocs and collect numHits, 
then do a sort on score and return top N as Adrien suggested. This should 
optimize the performance bottleneck around PQ allocating slots and 
prepopulating sentinel values.
   
   I am inclined to pursue the idea of not using ScoreDoc and representing 
score + docID as a separate issue primarily due to the nuances involved with 
handling shard indices. I do feel that should not be a major blocker anymore 
given the fact that TopDocs.merge can now tie break on docIDs and ignore shard 
indices completely during tie breaking.
   
   Thoughts?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] [lucene-solr] atris commented on issue #754: LUCENE-8875: Introduce Optimized Collector For Large Number Of Hits

2019-07-05 Thread GitBox
atris commented on issue #754: LUCENE-8875: Introduce Optimized Collector For 
Large Number Of Hits
URL: https://github.com/apache/lucene-solr/pull/754#issuecomment-508804369
 
 
   
   > Actually I don't think we need a growable priority queue. For such large 
number of hits it'd be probably more efficient to collect hits in an ArrayList 
first and only turn it into a PQ once there are `numHits` hits?
   
   Would that mean collecting all hits in the ArrayList, building a heap out of 
them iteratively once we exhaust documents and then calling top() numHits times?
   
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] [lucene-solr] atris commented on issue #754: LUCENE-8875: Introduce Optimized Collector For Large Number Of Hits

2019-07-05 Thread GitBox
atris commented on issue #754: LUCENE-8875: Introduce Optimized Collector For 
Large Number Of Hits
URL: https://github.com/apache/lucene-solr/pull/754#issuecomment-508784416
 
 
   > Not prepopulating the hit queue is only one part of the problem, we would 
also need to not allocate `numHits` slots in the priority queue right away?
   >
   That would require a different PriorityQueue implementation, since the 
default implementation always allocates numHits slots and sets maxSize to 
numHits. We would need an implementation that only allocates two slots 
initially, but has the ability to "resize" as needed. WDYT?
   
   > I'd rather like that we don't do any change to TopDocsCollector and not 
try to extend it, in my opinion those collectors are solving a very different 
problem and would need to evolve independently.
   
   My main objective to use TopScoreDocsCollector was to reuse code as much as 
I could -- but I agree with your point. Will change the implementation and keep 
TopScoreDocsCollector pristine.
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] [lucene-solr] atris commented on issue #754: LUCENE-8875: Introduce Optimized Collector For Large Number Of Hits

2019-07-04 Thread GitBox
atris commented on issue #754: LUCENE-8875: Introduce Optimized Collector For 
Large Number Of Hits
URL: https://github.com/apache/lucene-solr/pull/754#issuecomment-508377686
 
 
   Any thoughts on this?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] [lucene-solr] atris commented on issue #754: LUCENE-8875: Introduce Optimized Collector For Large Number Of Hits

2019-07-02 Thread GitBox
atris commented on issue #754: LUCENE-8875: Introduce Optimized Collector For 
Large Number Of Hits
URL: https://github.com/apache/lucene-solr/pull/754#issuecomment-507544094
 
 
   I ran some tests with N requested as 100K but the matching docs being only 
around 10K. This gives a sizeable improvement in the overall size of the 
priority queue.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org