RE: Increase search performance

2018-02-02 Thread Atul Bisaria
Thanks for the feedback!

-Original Message-
From: Adrien Grand [mailto:jpou...@gmail.com]
Sent: Friday, February 02, 2018 1:42 PM
To: java-user@lucene.apache.org
Subject: Re: Increase search performance

If needsScores returns false on the collector, then scores won't be computed.

Your prototype should work well.

Le ven. 2 févr. 2018 à 04:46, Atul Bisaria <atul.bisa...@ericsson.com> a écrit :

> Hi Adrien,
>
> Please correct if I am wrong, but I believe using extended
> IntComparator in custom Sort object for randomization would still
> score documents (using IndexSearcher.search(Query, int, Sort), for example).
>
> So I tried using a custom collector using IndexSearcher.search(Query,
> Collector) where the custom collector does not score documents at all.
>
> I have refactored RandomOrderCollector to fix the memory usage problem
> as described below. Let me know if this looks ok now.
>
> class RandomOrderCollector extends SimpleCollector {
> private int maxHitsRequired;
> private int docBase;
>
> private ScoreDoc[] matches;
>
> private int numHits;
>
> private Random random = new Random();
>
> public RandomOrderCollector(int maxHitsRequired)
> {
> this.maxHitsRequired = maxHitsRequired;
> this.matches = new ScoreDoc[maxHitsRequired];
> }
>
> public boolean needsScores()
> {
> return false;
> }
>
> @Override
> public void collect(int doc) throws IOException
> {
> int absoluteDoc = docBase + doc;
> int randomScore = random.nextInt(); // assign a random
> score to each doc
>
> if(numHits < maxHitsRequired)
> {
> matches[numHits++] = new ScoreDoc(absoluteDoc,
> randomScore);
> }
> else
> {
> int index = random.nextInt(maxHitsRequired);
> if(matches[index].score < randomScore)
> {
> matches[index] = new
> ScoreDoc(absoluteDoc, randomScore);;
> }
> }
> }
>
> @Override
> protected void doSetNextReader(LeafReaderContext context)
> throws IOException
> {
> super.doSetNextReader(context);
> this.docBase = context.docBase;
> }
>
> public ScoreDoc[] getHits()
> {
> return matches;
>     }
> }
>
> Best Regards,
> Atul Bisaria
>
> -Original Message-
> From: Adrien Grand [mailto:jpou...@gmail.com]
> Sent: Thursday, February 01, 2018 6:11 PM
> To: java-user@lucene.apache.org
> Subject: Re: Increase search performance
>
> Yes, this collector won't perform well if you have many matches since
> memory usage is linear with the number of matches. A better option
> would be to extend eg. IntComparator and implement getNumericDocValues
> by returning a fake NumericDocValues instance that eg. does a bit mix
> of the doc id and a per-request seed (for instance HPPC's BitMixer can
> do that
> https://github.com/carrotsearch/hppc/blob/master/hppc/src/main/java/co
> m/carrotsearch/hppc/BitMixer.java
> ).
>
> Le jeu. 1 févr. 2018 à 12:31, Atul Bisaria <atul.bisa...@ericsson.com>
> a écrit :
>
> > Hi Adrien,
> >
> > Thanks for your reply.
> >
> > I have also tried testing with UsageTrackingQueryCachingPolicy, but
> > did not observe a significant change in both latency and throughput.
> >
> > Given that I have specific search requirements of no scoring and
> > sorting the search results in a random order (reason for custom sort
> > object), I have also explored writing a custom collector and could
> > observe quite a difference in latency figures.
> >
> > Let me know if this custom collector code has any loopholes which I
> > could be missing:
> >
> > class RandomOrderCollector extends SimpleCollector {
> > private int maxHitsRequired;
> > private int docBase;
> >
> > private List matches = new ArrayList();
> >
> > public RandomOrderCollector(int maxHitsRequired)
> > {
> > this.maxHitsRequired = maxHitsRequired;
> > }
> >
> > public boolean needsScores()
> > {
> > return false;
> > }
> >
> > @Override
> > public void collect(int doc) throws IOException
> > {

Re: Increase search performance

2018-02-02 Thread Adrien Grand
If needsScores returns false on the collector, then scores won't be
computed.

Your prototype should work well.

Le ven. 2 févr. 2018 à 04:46, Atul Bisaria <atul.bisa...@ericsson.com> a
écrit :

> Hi Adrien,
>
> Please correct if I am wrong, but I believe using extended IntComparator
> in custom Sort object for randomization would still score documents (using
> IndexSearcher.search(Query, int, Sort), for example).
>
> So I tried using a custom collector using IndexSearcher.search(Query,
> Collector) where the custom collector does not score documents at all.
>
> I have refactored RandomOrderCollector to fix the memory usage problem as
> described below. Let me know if this looks ok now.
>
> class RandomOrderCollector extends SimpleCollector
> {
> private int maxHitsRequired;
> private int docBase;
>
> private ScoreDoc[] matches;
>
> private int numHits;
>
> private Random random = new Random();
>
> public RandomOrderCollector(int maxHitsRequired)
> {
> this.maxHitsRequired = maxHitsRequired;
> this.matches = new ScoreDoc[maxHitsRequired];
> }
>
> public boolean needsScores()
> {
> return false;
> }
>
> @Override
> public void collect(int doc) throws IOException
> {
> int absoluteDoc = docBase + doc;
> int randomScore = random.nextInt(); // assign a random
> score to each doc
>
> if(numHits < maxHitsRequired)
> {
> matches[numHits++] = new ScoreDoc(absoluteDoc,
> randomScore);
> }
> else
> {
> int index = random.nextInt(maxHitsRequired);
> if(matches[index].score < randomScore)
> {
> matches[index] = new ScoreDoc(absoluteDoc,
> randomScore);;
> }
> }
> }
>
> @Override
> protected void doSetNextReader(LeafReaderContext context) throws
> IOException
> {
> super.doSetNextReader(context);
> this.docBase = context.docBase;
> }
>
> public ScoreDoc[] getHits()
> {
> return matches;
> }
> }
>
> Best Regards,
> Atul Bisaria
>
> -Original Message-
> From: Adrien Grand [mailto:jpou...@gmail.com]
> Sent: Thursday, February 01, 2018 6:11 PM
> To: java-user@lucene.apache.org
> Subject: Re: Increase search performance
>
> Yes, this collector won't perform well if you have many matches since
> memory usage is linear with the number of matches. A better option would be
> to extend eg. IntComparator and implement getNumericDocValues by returning
> a fake NumericDocValues instance that eg. does a bit mix of the doc id and
> a per-request seed (for instance HPPC's BitMixer can do that
> https://github.com/carrotsearch/hppc/blob/master/hppc/src/main/java/com/carrotsearch/hppc/BitMixer.java
> ).
>
> Le jeu. 1 févr. 2018 à 12:31, Atul Bisaria <atul.bisa...@ericsson.com> a
> écrit :
>
> > Hi Adrien,
> >
> > Thanks for your reply.
> >
> > I have also tried testing with UsageTrackingQueryCachingPolicy, but
> > did not observe a significant change in both latency and throughput.
> >
> > Given that I have specific search requirements of no scoring and
> > sorting the search results in a random order (reason for custom sort
> > object), I have also explored writing a custom collector and could
> > observe quite a difference in latency figures.
> >
> > Let me know if this custom collector code has any loopholes which I
> > could be missing:
> >
> > class RandomOrderCollector extends SimpleCollector {
> > private int maxHitsRequired;
> > private int docBase;
> >
> > private List matches = new ArrayList();
> >
> > public RandomOrderCollector(int maxHitsRequired)
> > {
> > this.maxHitsRequired = maxHitsRequired;
> > }
> >
> > public boolean needsScores()
> > {
> > return false;
> > }
> >
> > @Override
> > public void collect(int doc) throws IOException
> > {
> > matches.add(docBase + doc);
> > }
> >
> > @Override
> > protected void doSetNextReader(LeafReaderContext context)
> > throws IOExceptio

RE: Increase search performance

2018-02-01 Thread Atul Bisaria
Hi Adrien,

Please correct if I am wrong, but I believe using extended IntComparator in 
custom Sort object for randomization would still score documents (using 
IndexSearcher.search(Query, int, Sort), for example).

So I tried using a custom collector using IndexSearcher.search(Query, 
Collector) where the custom collector does not score documents at all.

I have refactored RandomOrderCollector to fix the memory usage problem as 
described below. Let me know if this looks ok now.

class RandomOrderCollector extends SimpleCollector
{
private int maxHitsRequired;
private int docBase;

private ScoreDoc[] matches;

private int numHits;

private Random random = new Random();

public RandomOrderCollector(int maxHitsRequired)
{
this.maxHitsRequired = maxHitsRequired;
this.matches = new ScoreDoc[maxHitsRequired];
}

public boolean needsScores()
{
return false;
}

@Override
public void collect(int doc) throws IOException
{
int absoluteDoc = docBase + doc;
int randomScore = random.nextInt(); // assign a random score to 
each doc

if(numHits < maxHitsRequired)
{
matches[numHits++] = new ScoreDoc(absoluteDoc, 
randomScore);
}
else
{
int index = random.nextInt(maxHitsRequired);
if(matches[index].score < randomScore)
{
matches[index] = new ScoreDoc(absoluteDoc, 
randomScore);;
}
}
}

@Override
protected void doSetNextReader(LeafReaderContext context) throws 
IOException
{
super.doSetNextReader(context);
this.docBase = context.docBase;
}

public ScoreDoc[] getHits()
{
return matches;
}
}

Best Regards,
Atul Bisaria

-Original Message-
From: Adrien Grand [mailto:jpou...@gmail.com]
Sent: Thursday, February 01, 2018 6:11 PM
To: java-user@lucene.apache.org
Subject: Re: Increase search performance

Yes, this collector won't perform well if you have many matches since memory 
usage is linear with the number of matches. A better option would be to extend 
eg. IntComparator and implement getNumericDocValues by returning a fake 
NumericDocValues instance that eg. does a bit mix of the doc id and a 
per-request seed (for instance HPPC's BitMixer can do that 
https://github.com/carrotsearch/hppc/blob/master/hppc/src/main/java/com/carrotsearch/hppc/BitMixer.java
).

Le jeu. 1 févr. 2018 à 12:31, Atul Bisaria <atul.bisa...@ericsson.com> a écrit :

> Hi Adrien,
>
> Thanks for your reply.
>
> I have also tried testing with UsageTrackingQueryCachingPolicy, but
> did not observe a significant change in both latency and throughput.
>
> Given that I have specific search requirements of no scoring and
> sorting the search results in a random order (reason for custom sort
> object), I have also explored writing a custom collector and could
> observe quite a difference in latency figures.
>
> Let me know if this custom collector code has any loopholes which I
> could be missing:
>
> class RandomOrderCollector extends SimpleCollector {
> private int maxHitsRequired;
> private int docBase;
>
> private List matches = new ArrayList();
>
> public RandomOrderCollector(int maxHitsRequired)
> {
> this.maxHitsRequired = maxHitsRequired;
> }
>
> public boolean needsScores()
> {
> return false;
> }
>
> @Override
> public void collect(int doc) throws IOException
> {
> matches.add(docBase + doc);
> }
>
> @Override
> protected void doSetNextReader(LeafReaderContext context)
> throws IOException
> {
> super.doSetNextReader(context);
> this.docBase = context.docBase;
> }
>
> public List getHits()
> {
> Collections.shuffle(matches);
> maxHitsRequired = Math.min(matches.size(),
> maxHitsRequired);
>
> return matches.subList(0, maxHitsRequired);
> }
> }
>
> Best Regards,
> Atul Bisaria
>
> -Original Message-
> From: Adrien Grand [mailto:jpou...@gmail.com]
> Sent: Wednesday, January 31, 2018 6:33 PM
> To: java-user@lucene.apache.org
> Subject: Re: Increase search performance
>
> Hi Atul,
>
>
> Le mar. 30 janv. 2018 à 16:24, Atul Bisaria
> <atul.bisa...@ericsson.com> a écrit :
>
> > 1. U

Re: Increase search performance

2018-02-01 Thread Adrien Grand
Yes, this collector won't perform well if you have many matches since
memory usage is linear with the number of matches. A better option would be
to extend eg. IntComparator and implement getNumericDocValues by returning
a fake NumericDocValues instance that eg. does a bit mix of the doc id and
a per-request seed (for instance HPPC's BitMixer can do that
https://github.com/carrotsearch/hppc/blob/master/hppc/src/main/java/com/carrotsearch/hppc/BitMixer.java
).

Le jeu. 1 févr. 2018 à 12:31, Atul Bisaria <atul.bisa...@ericsson.com> a
écrit :

> Hi Adrien,
>
> Thanks for your reply.
>
> I have also tried testing with UsageTrackingQueryCachingPolicy, but did
> not observe a significant change in both latency and throughput.
>
> Given that I have specific search requirements of no scoring and sorting
> the search results in a random order (reason for custom sort object), I
> have also explored writing a custom collector and could observe quite a
> difference in latency figures.
>
> Let me know if this custom collector code has any loopholes which I could
> be missing:
>
> class RandomOrderCollector extends SimpleCollector
> {
> private int maxHitsRequired;
> private int docBase;
>
> private List matches = new ArrayList();
>
> public RandomOrderCollector(int maxHitsRequired)
> {
> this.maxHitsRequired = maxHitsRequired;
> }
>
> public boolean needsScores()
> {
> return false;
> }
>
> @Override
> public void collect(int doc) throws IOException
> {
> matches.add(docBase + doc);
> }
>
> @Override
> protected void doSetNextReader(LeafReaderContext context) throws
> IOException
> {
> super.doSetNextReader(context);
> this.docBase = context.docBase;
> }
>
> public List getHits()
> {
> Collections.shuffle(matches);
> maxHitsRequired = Math.min(matches.size(),
> maxHitsRequired);
>
> return matches.subList(0, maxHitsRequired);
> }
> }
>
> Best Regards,
> Atul Bisaria
>
> -Original Message-
> From: Adrien Grand [mailto:jpou...@gmail.com]
> Sent: Wednesday, January 31, 2018 6:33 PM
> To: java-user@lucene.apache.org
> Subject: Re: Increase search performance
>
> Hi Atul,
>
>
> Le mar. 30 janv. 2018 à 16:24, Atul Bisaria <atul.bisa...@ericsson.com> a
> écrit :
>
> > 1. Using ConstantScoreQuery so that scoring overhead is removed since
> > scoring is not required in my search use case. I also use a custom
> > Sort object which does not sort by score (see code below).
> >
>
> If you don't sort by score, then wrapping with a ConstantScoreQuery won't
> help as Lucene will figure out scores are not needed anyway.
>
>
> > 2. Using query cache
> >
> >
> >
> > My understanding is that query cache would cache query results and
> > hence lead to significant increase in performance. Is this understanding
> correct?
> >
>
> It depends what you mean by performance. If you are optimizing for
> worst-case latency, then the query cache might make things worse due to the
> fact that caching a query requires to visit all matches, while query
> execution can sometimes just skip over non-interesting matches (eg. in
> conjunctions).
>
> However if you are looking at improving throughput, then usually the
> default policy of the query cache of caching queries that look reused
> usually helps.
>
>
> > I am using Lucene version 5.4.1 where query cache seems to be enabled
> > by default (https://issues.apache.org/jira/browse/LUCENE-6784), but I
> > am not able to see any significant change in search performance.
> >
>
>
>
>
> > Here is the code I am testing with:
> >
> >
> >
> > DirectoryReader reader = DirectoryReader.open(directory);  //using
> > MMapDirectory
> >
> > IndexSearcher searcher = new IndexSearcher(reader); //IndexReader and
> > IndexSearcher are created only once
> >
> > searcher.setQueryCachingPolicy(QueryCachingPolicy.ALWAYS_CACHE);
> >
>
> Don't do that, this will always cache all filters, which usually makes
> things slower for the reason mentioned above. I would rather advise that
> you use an instance of UsageTrackingQueryCachingPolicy.
>


RE: Increase search performance

2018-02-01 Thread Atul Bisaria
Hi Adrien,

Thanks for your reply.

I have also tried testing with UsageTrackingQueryCachingPolicy, but did not 
observe a significant change in both latency and throughput.

Given that I have specific search requirements of no scoring and sorting the 
search results in a random order (reason for custom sort object), I have also 
explored writing a custom collector and could observe quite a difference in 
latency figures.

Let me know if this custom collector code has any loopholes which I could be 
missing:

class RandomOrderCollector extends SimpleCollector
{
private int maxHitsRequired;
private int docBase;

private List matches = new ArrayList();

public RandomOrderCollector(int maxHitsRequired)
{
this.maxHitsRequired = maxHitsRequired;
}

public boolean needsScores()
{
return false;
}

@Override
public void collect(int doc) throws IOException
{
matches.add(docBase + doc);
}

@Override
protected void doSetNextReader(LeafReaderContext context) throws 
IOException
{
super.doSetNextReader(context);
this.docBase = context.docBase;
}

public List getHits()
{
Collections.shuffle(matches);
maxHitsRequired = Math.min(matches.size(), maxHitsRequired);

return matches.subList(0, maxHitsRequired);
}
}

Best Regards,
Atul Bisaria

-Original Message-
From: Adrien Grand [mailto:jpou...@gmail.com]
Sent: Wednesday, January 31, 2018 6:33 PM
To: java-user@lucene.apache.org
Subject: Re: Increase search performance

Hi Atul,


Le mar. 30 janv. 2018 à 16:24, Atul Bisaria <atul.bisa...@ericsson.com> a écrit 
:

> 1. Using ConstantScoreQuery so that scoring overhead is removed since
> scoring is not required in my search use case. I also use a custom
> Sort object which does not sort by score (see code below).
>

If you don't sort by score, then wrapping with a ConstantScoreQuery won't help 
as Lucene will figure out scores are not needed anyway.


> 2. Using query cache
>
>
>
> My understanding is that query cache would cache query results and
> hence lead to significant increase in performance. Is this understanding 
> correct?
>

It depends what you mean by performance. If you are optimizing for worst-case 
latency, then the query cache might make things worse due to the fact that 
caching a query requires to visit all matches, while query execution can 
sometimes just skip over non-interesting matches (eg. in conjunctions).

However if you are looking at improving throughput, then usually the default 
policy of the query cache of caching queries that look reused usually helps.


> I am using Lucene version 5.4.1 where query cache seems to be enabled
> by default (https://issues.apache.org/jira/browse/LUCENE-6784), but I
> am not able to see any significant change in search performance.
>




> Here is the code I am testing with:
>
>
>
> DirectoryReader reader = DirectoryReader.open(directory);  //using
> MMapDirectory
>
> IndexSearcher searcher = new IndexSearcher(reader); //IndexReader and
> IndexSearcher are created only once
>
> searcher.setQueryCachingPolicy(QueryCachingPolicy.ALWAYS_CACHE);
>

Don't do that, this will always cache all filters, which usually makes things 
slower for the reason mentioned above. I would rather advise that you use an 
instance of UsageTrackingQueryCachingPolicy.


Re: Increase search performance

2018-01-31 Thread Adrien Grand
Hi Atul,


Le mar. 30 janv. 2018 à 16:24, Atul Bisaria  a
écrit :

> 1. Using ConstantScoreQuery so that scoring overhead is removed since
> scoring is not required in my search use case. I also use a custom Sort
> object which does not sort by score (see code below).
>

If you don't sort by score, then wrapping with a ConstantScoreQuery won't
help as Lucene will figure out scores are not needed anyway.


> 2. Using query cache
>
>
>
> My understanding is that query cache would cache query results and hence
> lead to significant increase in performance. Is this understanding correct?
>

It depends what you mean by performance. If you are optimizing for
worst-case latency, then the query cache might make things worse due to the
fact that caching a query requires to visit all matches, while query
execution can sometimes just skip over non-interesting matches (eg. in
conjunctions).

However if you are looking at improving throughput, then usually the
default policy of the query cache of caching queries that look reused
usually helps.


> I am using Lucene version 5.4.1 where query cache seems to be enabled by
> default (https://issues.apache.org/jira/browse/LUCENE-6784), but I am not
> able to see any significant change in search performance.
>




> Here is the code I am testing with:
>
>
>
> DirectoryReader reader = DirectoryReader.open(directory);  //using
> MMapDirectory
>
> IndexSearcher searcher = new IndexSearcher(reader);
> //IndexReader and IndexSearcher are created only once
>
> searcher.setQueryCachingPolicy(QueryCachingPolicy.ALWAYS_CACHE);
>

Don't do that, this will always cache all filters, which usually makes
things slower for the reason mentioned above. I would rather advise that
you use an instance of UsageTrackingQueryCachingPolicy.