RE: Increase search performance
Thanks for the feedback! -Original Message- From: Adrien Grand [mailto:jpou...@gmail.com] Sent: Friday, February 02, 2018 1:42 PM To: java-user@lucene.apache.org Subject: Re: Increase search performance If needsScores returns false on the collector, then scores won't be computed. Your prototype should work well. Le ven. 2 févr. 2018 à 04:46, Atul Bisaria <atul.bisa...@ericsson.com> a écrit : > Hi Adrien, > > Please correct if I am wrong, but I believe using extended > IntComparator in custom Sort object for randomization would still > score documents (using IndexSearcher.search(Query, int, Sort), for example). > > So I tried using a custom collector using IndexSearcher.search(Query, > Collector) where the custom collector does not score documents at all. > > I have refactored RandomOrderCollector to fix the memory usage problem > as described below. Let me know if this looks ok now. > > class RandomOrderCollector extends SimpleCollector { > private int maxHitsRequired; > private int docBase; > > private ScoreDoc[] matches; > > private int numHits; > > private Random random = new Random(); > > public RandomOrderCollector(int maxHitsRequired) > { > this.maxHitsRequired = maxHitsRequired; > this.matches = new ScoreDoc[maxHitsRequired]; > } > > public boolean needsScores() > { > return false; > } > > @Override > public void collect(int doc) throws IOException > { > int absoluteDoc = docBase + doc; > int randomScore = random.nextInt(); // assign a random > score to each doc > > if(numHits < maxHitsRequired) > { > matches[numHits++] = new ScoreDoc(absoluteDoc, > randomScore); > } > else > { > int index = random.nextInt(maxHitsRequired); > if(matches[index].score < randomScore) > { > matches[index] = new > ScoreDoc(absoluteDoc, randomScore);; > } > } > } > > @Override > protected void doSetNextReader(LeafReaderContext context) > throws IOException > { > super.doSetNextReader(context); > this.docBase = context.docBase; > } > > public ScoreDoc[] getHits() > { > return matches; > } > } > > Best Regards, > Atul Bisaria > > -Original Message- > From: Adrien Grand [mailto:jpou...@gmail.com] > Sent: Thursday, February 01, 2018 6:11 PM > To: java-user@lucene.apache.org > Subject: Re: Increase search performance > > Yes, this collector won't perform well if you have many matches since > memory usage is linear with the number of matches. A better option > would be to extend eg. IntComparator and implement getNumericDocValues > by returning a fake NumericDocValues instance that eg. does a bit mix > of the doc id and a per-request seed (for instance HPPC's BitMixer can > do that > https://github.com/carrotsearch/hppc/blob/master/hppc/src/main/java/co > m/carrotsearch/hppc/BitMixer.java > ). > > Le jeu. 1 févr. 2018 à 12:31, Atul Bisaria <atul.bisa...@ericsson.com> > a écrit : > > > Hi Adrien, > > > > Thanks for your reply. > > > > I have also tried testing with UsageTrackingQueryCachingPolicy, but > > did not observe a significant change in both latency and throughput. > > > > Given that I have specific search requirements of no scoring and > > sorting the search results in a random order (reason for custom sort > > object), I have also explored writing a custom collector and could > > observe quite a difference in latency figures. > > > > Let me know if this custom collector code has any loopholes which I > > could be missing: > > > > class RandomOrderCollector extends SimpleCollector { > > private int maxHitsRequired; > > private int docBase; > > > > private List matches = new ArrayList(); > > > > public RandomOrderCollector(int maxHitsRequired) > > { > > this.maxHitsRequired = maxHitsRequired; > > } > > > > public boolean needsScores() > > { > > return false; > > } > > > > @Override > > public void collect(int doc) throws IOException > > {
Re: Increase search performance
If needsScores returns false on the collector, then scores won't be computed. Your prototype should work well. Le ven. 2 févr. 2018 à 04:46, Atul Bisaria <atul.bisa...@ericsson.com> a écrit : > Hi Adrien, > > Please correct if I am wrong, but I believe using extended IntComparator > in custom Sort object for randomization would still score documents (using > IndexSearcher.search(Query, int, Sort), for example). > > So I tried using a custom collector using IndexSearcher.search(Query, > Collector) where the custom collector does not score documents at all. > > I have refactored RandomOrderCollector to fix the memory usage problem as > described below. Let me know if this looks ok now. > > class RandomOrderCollector extends SimpleCollector > { > private int maxHitsRequired; > private int docBase; > > private ScoreDoc[] matches; > > private int numHits; > > private Random random = new Random(); > > public RandomOrderCollector(int maxHitsRequired) > { > this.maxHitsRequired = maxHitsRequired; > this.matches = new ScoreDoc[maxHitsRequired]; > } > > public boolean needsScores() > { > return false; > } > > @Override > public void collect(int doc) throws IOException > { > int absoluteDoc = docBase + doc; > int randomScore = random.nextInt(); // assign a random > score to each doc > > if(numHits < maxHitsRequired) > { > matches[numHits++] = new ScoreDoc(absoluteDoc, > randomScore); > } > else > { > int index = random.nextInt(maxHitsRequired); > if(matches[index].score < randomScore) > { > matches[index] = new ScoreDoc(absoluteDoc, > randomScore);; > } > } > } > > @Override > protected void doSetNextReader(LeafReaderContext context) throws > IOException > { > super.doSetNextReader(context); > this.docBase = context.docBase; > } > > public ScoreDoc[] getHits() > { > return matches; > } > } > > Best Regards, > Atul Bisaria > > -Original Message- > From: Adrien Grand [mailto:jpou...@gmail.com] > Sent: Thursday, February 01, 2018 6:11 PM > To: java-user@lucene.apache.org > Subject: Re: Increase search performance > > Yes, this collector won't perform well if you have many matches since > memory usage is linear with the number of matches. A better option would be > to extend eg. IntComparator and implement getNumericDocValues by returning > a fake NumericDocValues instance that eg. does a bit mix of the doc id and > a per-request seed (for instance HPPC's BitMixer can do that > https://github.com/carrotsearch/hppc/blob/master/hppc/src/main/java/com/carrotsearch/hppc/BitMixer.java > ). > > Le jeu. 1 févr. 2018 à 12:31, Atul Bisaria <atul.bisa...@ericsson.com> a > écrit : > > > Hi Adrien, > > > > Thanks for your reply. > > > > I have also tried testing with UsageTrackingQueryCachingPolicy, but > > did not observe a significant change in both latency and throughput. > > > > Given that I have specific search requirements of no scoring and > > sorting the search results in a random order (reason for custom sort > > object), I have also explored writing a custom collector and could > > observe quite a difference in latency figures. > > > > Let me know if this custom collector code has any loopholes which I > > could be missing: > > > > class RandomOrderCollector extends SimpleCollector { > > private int maxHitsRequired; > > private int docBase; > > > > private List matches = new ArrayList(); > > > > public RandomOrderCollector(int maxHitsRequired) > > { > > this.maxHitsRequired = maxHitsRequired; > > } > > > > public boolean needsScores() > > { > > return false; > > } > > > > @Override > > public void collect(int doc) throws IOException > > { > > matches.add(docBase + doc); > > } > > > > @Override > > protected void doSetNextReader(LeafReaderContext context) > > throws IOExceptio
RE: Increase search performance
Hi Adrien, Please correct if I am wrong, but I believe using extended IntComparator in custom Sort object for randomization would still score documents (using IndexSearcher.search(Query, int, Sort), for example). So I tried using a custom collector using IndexSearcher.search(Query, Collector) where the custom collector does not score documents at all. I have refactored RandomOrderCollector to fix the memory usage problem as described below. Let me know if this looks ok now. class RandomOrderCollector extends SimpleCollector { private int maxHitsRequired; private int docBase; private ScoreDoc[] matches; private int numHits; private Random random = new Random(); public RandomOrderCollector(int maxHitsRequired) { this.maxHitsRequired = maxHitsRequired; this.matches = new ScoreDoc[maxHitsRequired]; } public boolean needsScores() { return false; } @Override public void collect(int doc) throws IOException { int absoluteDoc = docBase + doc; int randomScore = random.nextInt(); // assign a random score to each doc if(numHits < maxHitsRequired) { matches[numHits++] = new ScoreDoc(absoluteDoc, randomScore); } else { int index = random.nextInt(maxHitsRequired); if(matches[index].score < randomScore) { matches[index] = new ScoreDoc(absoluteDoc, randomScore);; } } } @Override protected void doSetNextReader(LeafReaderContext context) throws IOException { super.doSetNextReader(context); this.docBase = context.docBase; } public ScoreDoc[] getHits() { return matches; } } Best Regards, Atul Bisaria -Original Message- From: Adrien Grand [mailto:jpou...@gmail.com] Sent: Thursday, February 01, 2018 6:11 PM To: java-user@lucene.apache.org Subject: Re: Increase search performance Yes, this collector won't perform well if you have many matches since memory usage is linear with the number of matches. A better option would be to extend eg. IntComparator and implement getNumericDocValues by returning a fake NumericDocValues instance that eg. does a bit mix of the doc id and a per-request seed (for instance HPPC's BitMixer can do that https://github.com/carrotsearch/hppc/blob/master/hppc/src/main/java/com/carrotsearch/hppc/BitMixer.java ). Le jeu. 1 févr. 2018 à 12:31, Atul Bisaria <atul.bisa...@ericsson.com> a écrit : > Hi Adrien, > > Thanks for your reply. > > I have also tried testing with UsageTrackingQueryCachingPolicy, but > did not observe a significant change in both latency and throughput. > > Given that I have specific search requirements of no scoring and > sorting the search results in a random order (reason for custom sort > object), I have also explored writing a custom collector and could > observe quite a difference in latency figures. > > Let me know if this custom collector code has any loopholes which I > could be missing: > > class RandomOrderCollector extends SimpleCollector { > private int maxHitsRequired; > private int docBase; > > private List matches = new ArrayList(); > > public RandomOrderCollector(int maxHitsRequired) > { > this.maxHitsRequired = maxHitsRequired; > } > > public boolean needsScores() > { > return false; > } > > @Override > public void collect(int doc) throws IOException > { > matches.add(docBase + doc); > } > > @Override > protected void doSetNextReader(LeafReaderContext context) > throws IOException > { > super.doSetNextReader(context); > this.docBase = context.docBase; > } > > public List getHits() > { > Collections.shuffle(matches); > maxHitsRequired = Math.min(matches.size(), > maxHitsRequired); > > return matches.subList(0, maxHitsRequired); > } > } > > Best Regards, > Atul Bisaria > > -Original Message- > From: Adrien Grand [mailto:jpou...@gmail.com] > Sent: Wednesday, January 31, 2018 6:33 PM > To: java-user@lucene.apache.org > Subject: Re: Increase search performance > > Hi Atul, > > > Le mar. 30 janv. 2018 à 16:24, Atul Bisaria > <atul.bisa...@ericsson.com> a écrit : > > > 1. U
Re: Increase search performance
Yes, this collector won't perform well if you have many matches since memory usage is linear with the number of matches. A better option would be to extend eg. IntComparator and implement getNumericDocValues by returning a fake NumericDocValues instance that eg. does a bit mix of the doc id and a per-request seed (for instance HPPC's BitMixer can do that https://github.com/carrotsearch/hppc/blob/master/hppc/src/main/java/com/carrotsearch/hppc/BitMixer.java ). Le jeu. 1 févr. 2018 à 12:31, Atul Bisaria <atul.bisa...@ericsson.com> a écrit : > Hi Adrien, > > Thanks for your reply. > > I have also tried testing with UsageTrackingQueryCachingPolicy, but did > not observe a significant change in both latency and throughput. > > Given that I have specific search requirements of no scoring and sorting > the search results in a random order (reason for custom sort object), I > have also explored writing a custom collector and could observe quite a > difference in latency figures. > > Let me know if this custom collector code has any loopholes which I could > be missing: > > class RandomOrderCollector extends SimpleCollector > { > private int maxHitsRequired; > private int docBase; > > private List matches = new ArrayList(); > > public RandomOrderCollector(int maxHitsRequired) > { > this.maxHitsRequired = maxHitsRequired; > } > > public boolean needsScores() > { > return false; > } > > @Override > public void collect(int doc) throws IOException > { > matches.add(docBase + doc); > } > > @Override > protected void doSetNextReader(LeafReaderContext context) throws > IOException > { > super.doSetNextReader(context); > this.docBase = context.docBase; > } > > public List getHits() > { > Collections.shuffle(matches); > maxHitsRequired = Math.min(matches.size(), > maxHitsRequired); > > return matches.subList(0, maxHitsRequired); > } > } > > Best Regards, > Atul Bisaria > > -Original Message- > From: Adrien Grand [mailto:jpou...@gmail.com] > Sent: Wednesday, January 31, 2018 6:33 PM > To: java-user@lucene.apache.org > Subject: Re: Increase search performance > > Hi Atul, > > > Le mar. 30 janv. 2018 à 16:24, Atul Bisaria <atul.bisa...@ericsson.com> a > écrit : > > > 1. Using ConstantScoreQuery so that scoring overhead is removed since > > scoring is not required in my search use case. I also use a custom > > Sort object which does not sort by score (see code below). > > > > If you don't sort by score, then wrapping with a ConstantScoreQuery won't > help as Lucene will figure out scores are not needed anyway. > > > > 2. Using query cache > > > > > > > > My understanding is that query cache would cache query results and > > hence lead to significant increase in performance. Is this understanding > correct? > > > > It depends what you mean by performance. If you are optimizing for > worst-case latency, then the query cache might make things worse due to the > fact that caching a query requires to visit all matches, while query > execution can sometimes just skip over non-interesting matches (eg. in > conjunctions). > > However if you are looking at improving throughput, then usually the > default policy of the query cache of caching queries that look reused > usually helps. > > > > I am using Lucene version 5.4.1 where query cache seems to be enabled > > by default (https://issues.apache.org/jira/browse/LUCENE-6784), but I > > am not able to see any significant change in search performance. > > > > > > > > Here is the code I am testing with: > > > > > > > > DirectoryReader reader = DirectoryReader.open(directory); //using > > MMapDirectory > > > > IndexSearcher searcher = new IndexSearcher(reader); //IndexReader and > > IndexSearcher are created only once > > > > searcher.setQueryCachingPolicy(QueryCachingPolicy.ALWAYS_CACHE); > > > > Don't do that, this will always cache all filters, which usually makes > things slower for the reason mentioned above. I would rather advise that > you use an instance of UsageTrackingQueryCachingPolicy. >
RE: Increase search performance
Hi Adrien, Thanks for your reply. I have also tried testing with UsageTrackingQueryCachingPolicy, but did not observe a significant change in both latency and throughput. Given that I have specific search requirements of no scoring and sorting the search results in a random order (reason for custom sort object), I have also explored writing a custom collector and could observe quite a difference in latency figures. Let me know if this custom collector code has any loopholes which I could be missing: class RandomOrderCollector extends SimpleCollector { private int maxHitsRequired; private int docBase; private List matches = new ArrayList(); public RandomOrderCollector(int maxHitsRequired) { this.maxHitsRequired = maxHitsRequired; } public boolean needsScores() { return false; } @Override public void collect(int doc) throws IOException { matches.add(docBase + doc); } @Override protected void doSetNextReader(LeafReaderContext context) throws IOException { super.doSetNextReader(context); this.docBase = context.docBase; } public List getHits() { Collections.shuffle(matches); maxHitsRequired = Math.min(matches.size(), maxHitsRequired); return matches.subList(0, maxHitsRequired); } } Best Regards, Atul Bisaria -Original Message- From: Adrien Grand [mailto:jpou...@gmail.com] Sent: Wednesday, January 31, 2018 6:33 PM To: java-user@lucene.apache.org Subject: Re: Increase search performance Hi Atul, Le mar. 30 janv. 2018 à 16:24, Atul Bisaria <atul.bisa...@ericsson.com> a écrit : > 1. Using ConstantScoreQuery so that scoring overhead is removed since > scoring is not required in my search use case. I also use a custom > Sort object which does not sort by score (see code below). > If you don't sort by score, then wrapping with a ConstantScoreQuery won't help as Lucene will figure out scores are not needed anyway. > 2. Using query cache > > > > My understanding is that query cache would cache query results and > hence lead to significant increase in performance. Is this understanding > correct? > It depends what you mean by performance. If you are optimizing for worst-case latency, then the query cache might make things worse due to the fact that caching a query requires to visit all matches, while query execution can sometimes just skip over non-interesting matches (eg. in conjunctions). However if you are looking at improving throughput, then usually the default policy of the query cache of caching queries that look reused usually helps. > I am using Lucene version 5.4.1 where query cache seems to be enabled > by default (https://issues.apache.org/jira/browse/LUCENE-6784), but I > am not able to see any significant change in search performance. > > Here is the code I am testing with: > > > > DirectoryReader reader = DirectoryReader.open(directory); //using > MMapDirectory > > IndexSearcher searcher = new IndexSearcher(reader); //IndexReader and > IndexSearcher are created only once > > searcher.setQueryCachingPolicy(QueryCachingPolicy.ALWAYS_CACHE); > Don't do that, this will always cache all filters, which usually makes things slower for the reason mentioned above. I would rather advise that you use an instance of UsageTrackingQueryCachingPolicy.
Re: Increase search performance
Hi Atul, Le mar. 30 janv. 2018 à 16:24, Atul Bisariaa écrit : > 1. Using ConstantScoreQuery so that scoring overhead is removed since > scoring is not required in my search use case. I also use a custom Sort > object which does not sort by score (see code below). > If you don't sort by score, then wrapping with a ConstantScoreQuery won't help as Lucene will figure out scores are not needed anyway. > 2. Using query cache > > > > My understanding is that query cache would cache query results and hence > lead to significant increase in performance. Is this understanding correct? > It depends what you mean by performance. If you are optimizing for worst-case latency, then the query cache might make things worse due to the fact that caching a query requires to visit all matches, while query execution can sometimes just skip over non-interesting matches (eg. in conjunctions). However if you are looking at improving throughput, then usually the default policy of the query cache of caching queries that look reused usually helps. > I am using Lucene version 5.4.1 where query cache seems to be enabled by > default (https://issues.apache.org/jira/browse/LUCENE-6784), but I am not > able to see any significant change in search performance. > > Here is the code I am testing with: > > > > DirectoryReader reader = DirectoryReader.open(directory); //using > MMapDirectory > > IndexSearcher searcher = new IndexSearcher(reader); > //IndexReader and IndexSearcher are created only once > > searcher.setQueryCachingPolicy(QueryCachingPolicy.ALWAYS_CACHE); > Don't do that, this will always cache all filters, which usually makes things slower for the reason mentioned above. I would rather advise that you use an instance of UsageTrackingQueryCachingPolicy.