RE: Increase search performance
Thanks for the feedback! -Original Message- From: Adrien Grand [mailto:jpou...@gmail.com] Sent: Friday, February 02, 2018 1:42 PM To: java-user@lucene.apache.org Subject: Re: Increase search performance If needsScores returns false on the collector, then scores won't be computed. Your prototype should work well. Le ven. 2 févr. 2018 à 04:46, Atul Bisaria <atul.bisa...@ericsson.com> a écrit : > Hi Adrien, > > Please correct if I am wrong, but I believe using extended > IntComparator in custom Sort object for randomization would still > score documents (using IndexSearcher.search(Query, int, Sort), for example). > > So I tried using a custom collector using IndexSearcher.search(Query, > Collector) where the custom collector does not score documents at all. > > I have refactored RandomOrderCollector to fix the memory usage problem > as described below. Let me know if this looks ok now. > > class RandomOrderCollector extends SimpleCollector { > private int maxHitsRequired; > private int docBase; > > private ScoreDoc[] matches; > > private int numHits; > > private Random random = new Random(); > > public RandomOrderCollector(int maxHitsRequired) > { > this.maxHitsRequired = maxHitsRequired; > this.matches = new ScoreDoc[maxHitsRequired]; > } > > public boolean needsScores() > { > return false; > } > > @Override > public void collect(int doc) throws IOException > { > int absoluteDoc = docBase + doc; > int randomScore = random.nextInt(); // assign a random > score to each doc > > if(numHits < maxHitsRequired) > { > matches[numHits++] = new ScoreDoc(absoluteDoc, > randomScore); > } > else > { > int index = random.nextInt(maxHitsRequired); > if(matches[index].score < randomScore) > { > matches[index] = new > ScoreDoc(absoluteDoc, randomScore);; > } > } > } > > @Override > protected void doSetNextReader(LeafReaderContext context) > throws IOException > { > super.doSetNextReader(context); > this.docBase = context.docBase; > } > > public ScoreDoc[] getHits() > { > return matches; > } > } > > Best Regards, > Atul Bisaria > > -Original Message- > From: Adrien Grand [mailto:jpou...@gmail.com] > Sent: Thursday, February 01, 2018 6:11 PM > To: java-user@lucene.apache.org > Subject: Re: Increase search performance > > Yes, this collector won't perform well if you have many matches since > memory usage is linear with the number of matches. A better option > would be to extend eg. IntComparator and implement getNumericDocValues > by returning a fake NumericDocValues instance that eg. does a bit mix > of the doc id and a per-request seed (for instance HPPC's BitMixer can > do that > https://github.com/carrotsearch/hppc/blob/master/hppc/src/main/java/co > m/carrotsearch/hppc/BitMixer.java > ). > > Le jeu. 1 févr. 2018 à 12:31, Atul Bisaria <atul.bisa...@ericsson.com> > a écrit : > > > Hi Adrien, > > > > Thanks for your reply. > > > > I have also tried testing with UsageTrackingQueryCachingPolicy, but > > did not observe a significant change in both latency and throughput. > > > > Given that I have specific search requirements of no scoring and > > sorting the search results in a random order (reason for custom sort > > object), I have also explored writing a custom collector and could > > observe quite a difference in latency figures. > > > > Let me know if this custom collector code has any loopholes which I > > could be missing: > > > > class RandomOrderCollector extends SimpleCollector { > > private int maxHitsRequired; > > private int docBase; > > > > private List matches = new ArrayList(); > > > > public RandomOrderCollector(int maxHitsRequired) > > { > > this.maxHitsRequired = maxHitsRequired; > > } > > > > public boolean needsScores() > > { > > return false; > > } > > > > @Override > > public void collect(int doc) throws IOException > > {
RE: Increase search performance
Hi Adrien, Please correct if I am wrong, but I believe using extended IntComparator in custom Sort object for randomization would still score documents (using IndexSearcher.search(Query, int, Sort), for example). So I tried using a custom collector using IndexSearcher.search(Query, Collector) where the custom collector does not score documents at all. I have refactored RandomOrderCollector to fix the memory usage problem as described below. Let me know if this looks ok now. class RandomOrderCollector extends SimpleCollector { private int maxHitsRequired; private int docBase; private ScoreDoc[] matches; private int numHits; private Random random = new Random(); public RandomOrderCollector(int maxHitsRequired) { this.maxHitsRequired = maxHitsRequired; this.matches = new ScoreDoc[maxHitsRequired]; } public boolean needsScores() { return false; } @Override public void collect(int doc) throws IOException { int absoluteDoc = docBase + doc; int randomScore = random.nextInt(); // assign a random score to each doc if(numHits < maxHitsRequired) { matches[numHits++] = new ScoreDoc(absoluteDoc, randomScore); } else { int index = random.nextInt(maxHitsRequired); if(matches[index].score < randomScore) { matches[index] = new ScoreDoc(absoluteDoc, randomScore);; } } } @Override protected void doSetNextReader(LeafReaderContext context) throws IOException { super.doSetNextReader(context); this.docBase = context.docBase; } public ScoreDoc[] getHits() { return matches; } } Best Regards, Atul Bisaria -Original Message- From: Adrien Grand [mailto:jpou...@gmail.com] Sent: Thursday, February 01, 2018 6:11 PM To: java-user@lucene.apache.org Subject: Re: Increase search performance Yes, this collector won't perform well if you have many matches since memory usage is linear with the number of matches. A better option would be to extend eg. IntComparator and implement getNumericDocValues by returning a fake NumericDocValues instance that eg. does a bit mix of the doc id and a per-request seed (for instance HPPC's BitMixer can do that https://github.com/carrotsearch/hppc/blob/master/hppc/src/main/java/com/carrotsearch/hppc/BitMixer.java ). Le jeu. 1 févr. 2018 à 12:31, Atul Bisaria <atul.bisa...@ericsson.com> a écrit : > Hi Adrien, > > Thanks for your reply. > > I have also tried testing with UsageTrackingQueryCachingPolicy, but > did not observe a significant change in both latency and throughput. > > Given that I have specific search requirements of no scoring and > sorting the search results in a random order (reason for custom sort > object), I have also explored writing a custom collector and could > observe quite a difference in latency figures. > > Let me know if this custom collector code has any loopholes which I > could be missing: > > class RandomOrderCollector extends SimpleCollector { > private int maxHitsRequired; > private int docBase; > > private List matches = new ArrayList(); > > public RandomOrderCollector(int maxHitsRequired) > { > this.maxHitsRequired = maxHitsRequired; > } > > public boolean needsScores() > { > return false; > } > > @Override > public void collect(int doc) throws IOException > { > matches.add(docBase + doc); > } > > @Override > protected void doSetNextReader(LeafReaderContext context) > throws IOException > { > super.doSetNextReader(context); > this.docBase = context.docBase; > } > > public List getHits() > { > Collections.shuffle(matches); > maxHitsRequired = Math.min(matches.size(), > maxHitsRequired); > > return matches.subList(0, maxHitsRequired); > } > } > > Best Regards, > Atul Bisaria > > -Original Message- > From: Adrien Grand [mailto:jpou...@gmail.com] > Sent: Wednesday, January 31, 2018 6:33 PM > To: java-user@lucene.apache.org > Subject: Re: Increase search performance > > Hi Atul, > > > Le mar. 30 janv. 2018 à 16:24, Atul Bisaria > <atul.bisa...@ericsson.com> a écrit : > > > 1. U
RE: Increase search performance
Hi Adrien, Thanks for your reply. I have also tried testing with UsageTrackingQueryCachingPolicy, but did not observe a significant change in both latency and throughput. Given that I have specific search requirements of no scoring and sorting the search results in a random order (reason for custom sort object), I have also explored writing a custom collector and could observe quite a difference in latency figures. Let me know if this custom collector code has any loopholes which I could be missing: class RandomOrderCollector extends SimpleCollector { private int maxHitsRequired; private int docBase; private List matches = new ArrayList(); public RandomOrderCollector(int maxHitsRequired) { this.maxHitsRequired = maxHitsRequired; } public boolean needsScores() { return false; } @Override public void collect(int doc) throws IOException { matches.add(docBase + doc); } @Override protected void doSetNextReader(LeafReaderContext context) throws IOException { super.doSetNextReader(context); this.docBase = context.docBase; } public List getHits() { Collections.shuffle(matches); maxHitsRequired = Math.min(matches.size(), maxHitsRequired); return matches.subList(0, maxHitsRequired); } } Best Regards, Atul Bisaria -Original Message- From: Adrien Grand [mailto:jpou...@gmail.com] Sent: Wednesday, January 31, 2018 6:33 PM To: java-user@lucene.apache.org Subject: Re: Increase search performance Hi Atul, Le mar. 30 janv. 2018 à 16:24, Atul Bisaria <atul.bisa...@ericsson.com> a écrit : > 1. Using ConstantScoreQuery so that scoring overhead is removed since > scoring is not required in my search use case. I also use a custom > Sort object which does not sort by score (see code below). > If you don't sort by score, then wrapping with a ConstantScoreQuery won't help as Lucene will figure out scores are not needed anyway. > 2. Using query cache > > > > My understanding is that query cache would cache query results and > hence lead to significant increase in performance. Is this understanding > correct? > It depends what you mean by performance. If you are optimizing for worst-case latency, then the query cache might make things worse due to the fact that caching a query requires to visit all matches, while query execution can sometimes just skip over non-interesting matches (eg. in conjunctions). However if you are looking at improving throughput, then usually the default policy of the query cache of caching queries that look reused usually helps. > I am using Lucene version 5.4.1 where query cache seems to be enabled > by default (https://issues.apache.org/jira/browse/LUCENE-6784), but I > am not able to see any significant change in search performance. > > Here is the code I am testing with: > > > > DirectoryReader reader = DirectoryReader.open(directory); //using > MMapDirectory > > IndexSearcher searcher = new IndexSearcher(reader); //IndexReader and > IndexSearcher are created only once > > searcher.setQueryCachingPolicy(QueryCachingPolicy.ALWAYS_CACHE); > Don't do that, this will always cache all filters, which usually makes things slower for the reason mentioned above. I would rather advise that you use an instance of UsageTrackingQueryCachingPolicy.
Increase search performance
In the search use case in my application, I don't need to score query results since all results are equal. Also query patterns are also more or less fixed. Given these conditions, I am trying to increase search performance by 1. Using ConstantScoreQuery so that scoring overhead is removed since scoring is not required in my search use case. I also use a custom Sort object which does not sort by score (see code below). Is this enough to remove scoring overhead in search? 2. Using query cache My understanding is that query cache would cache query results and hence lead to significant increase in performance. Is this understanding correct? I am using Lucene version 5.4.1 where query cache seems to be enabled by default (https://issues.apache.org/jira/browse/LUCENE-6784), but I am not able to see any significant change in search performance. Here is the code I am testing with: DirectoryReader reader = DirectoryReader.open(directory); //using MMapDirectory IndexSearcher searcher = new IndexSearcher(reader);//IndexReader and IndexSearcher are created only once searcher.setQueryCachingPolicy(QueryCachingPolicy.ALWAYS_CACHE); //search code QueryParser parser = new QueryParser("fieldname", analyzer); Query query = new ConstantScoreQuery(parser.parse("text")); ScoreDoc[] hits = searcher.search(query, 20, sort).scoreDocs; Given above conditions in my application, is there anything more I can do to get better search performance?