Re: Optimizing fq query performance

2019-04-18 Thread John Davis
FYI
https://issues.apache.org/jira/browse/SOLR-11437
https://issues.apache.org/jira/browse/SOLR-12488

On Thu, Apr 18, 2019 at 7:24 AM Shawn Heisey  wrote:

> On 4/17/2019 11:49 PM, John Davis wrote:
> > I did a few tests with our instance solr-7.4.0 and field:* vs field:[* TO
> > *] doesn't seem materially different compared to has_field:1. If no one
> > knows why Lucene optimizes one but not another, it's not clear whether it
> > even optimizes one to be sure.
>
> Queries using a boolean field will be even faster than the all-inclusive
> range query ... but they require work at index time to function
> properly.  If you can do it this way, that's definitely preferred.  I
> was providing you with something that would work even without the
> separate boolean field.
>
> If the cardinality of the field you're searching is very low (only a few
> possible values for that field across the whole index) then a wildcard
> query can be fast.  It is only when the cardinality is high that the
> wildcard query is slow.  Still, it is better to use the range query for
> determining whether the field exists, unless you have a separate boolean
> field for that purpose, in which case the boolean query will be a little
> bit faster.
>
> Thanks,
> Shawn
>


Re: Optimizing fq query performance

2019-04-18 Thread Shawn Heisey

On 4/17/2019 11:49 PM, John Davis wrote:

I did a few tests with our instance solr-7.4.0 and field:* vs field:[* TO
*] doesn't seem materially different compared to has_field:1. If no one
knows why Lucene optimizes one but not another, it's not clear whether it
even optimizes one to be sure.


Queries using a boolean field will be even faster than the all-inclusive 
range query ... but they require work at index time to function 
properly.  If you can do it this way, that's definitely preferred.  I 
was providing you with something that would work even without the 
separate boolean field.


If the cardinality of the field you're searching is very low (only a few 
possible values for that field across the whole index) then a wildcard 
query can be fast.  It is only when the cardinality is high that the 
wildcard query is slow.  Still, it is better to use the range query for 
determining whether the field exists, unless you have a separate boolean 
field for that purpose, in which case the boolean query will be a little 
bit faster.


Thanks,
Shawn


Re: Optimizing fq query performance

2019-04-17 Thread John Davis
I did a few tests with our instance solr-7.4.0 and field:* vs field:[* TO
*] doesn't seem materially different compared to has_field:1. If no one
knows why Lucene optimizes one but not another, it's not clear whether it
even optimizes one to be sure.

On Wed, Apr 17, 2019 at 4:27 PM Shawn Heisey  wrote:

> On 4/17/2019 1:21 PM, John Davis wrote:
> > If what you describe is the case for range query [* TO *], why would
> lucene
> > not optimize field:* similar way?
>
> I don't know.  Low level lucene operation is a mystery to me.
>
> I have seen first-hand that the range query is MUCH faster than the
> wildcard query.
>
> Thanks,
> Shawn
>


Re: Optimizing fq query performance

2019-04-17 Thread Shawn Heisey

On 4/17/2019 1:21 PM, John Davis wrote:

If what you describe is the case for range query [* TO *], why would lucene
not optimize field:* similar way?


I don't know.  Low level lucene operation is a mystery to me.

I have seen first-hand that the range query is MUCH faster than the 
wildcard query.


Thanks,
Shawn


Re: Optimizing fq query performance

2019-04-17 Thread John Davis
If what you describe is the case for range query [* TO *], why would lucene
not optimize field:* similar way?

On Wed, Apr 17, 2019 at 10:36 AM Shawn Heisey  wrote:

> On 4/17/2019 10:51 AM, John Davis wrote:
> > Can you clarify why field:[* TO *] is lot more efficient than field:*
>
> It's a range query.  For every document, Lucene just has to answer two
> questions -- is the value more than any possible value and is the value
> less than any possible value.  The answer will be yes if the field
> exists, and no if it doesn't.  With one million documents, there are two
> million questions that Lucene has to answer.  Which probably seems like
> a lot ... but keep reading.  (Side note:  It wouldn't surprise me if
> Lucene has an optimization specifically for the all inclusive range such
> that it actually only asks one question, not two)
>
> With a wildcard query, there are as many questions as there are values
> in the field.  Every question is asked for every single document.  So if
> you have a million documents and there are three hundred thousand
> different values contained in the field across the whole index, that's
> 300 billion questions.
>
> Thanks,
> Shawn
>


Re: Optimizing fq query performance

2019-04-17 Thread Shawn Heisey

On 4/17/2019 10:51 AM, John Davis wrote:

Can you clarify why field:[* TO *] is lot more efficient than field:*


It's a range query.  For every document, Lucene just has to answer two 
questions -- is the value more than any possible value and is the value 
less than any possible value.  The answer will be yes if the field 
exists, and no if it doesn't.  With one million documents, there are two 
million questions that Lucene has to answer.  Which probably seems like 
a lot ... but keep reading.  (Side note:  It wouldn't surprise me if 
Lucene has an optimization specifically for the all inclusive range such 
that it actually only asks one question, not two)


With a wildcard query, there are as many questions as there are values 
in the field.  Every question is asked for every single document.  So if 
you have a million documents and there are three hundred thousand 
different values contained in the field across the whole index, that's 
300 billion questions.


Thanks,
Shawn


Re: Optimizing fq query performance

2019-04-17 Thread John Davis
Can you clarify why field:[* TO *] is lot more efficient than field:*

On Sun, Apr 14, 2019 at 12:14 PM Shawn Heisey  wrote:

> On 4/13/2019 12:58 PM, John Davis wrote:
> > We noticed a sizable performance degradation when we add certain fq
> filters
> > to the query even though the result set does not change between the two
> > queries. I would've expected solr to optimize internally by picking the
> > most constrained fq filter first, but maybe my understanding is wrong.
>
> All filters cover the entire index, unless the query parser that you're
> using implements the PostFilter interface, the filter cost is set high
> enough, and caching is disabled.  All three of those conditions must be
> met in order for a filter to only run on results instead of the entire
> index.
>
> http://yonik.com/advanced-filter-caching-in-solr/
> https://lucidworks.com/2017/11/27/caching-and-filters-and-post-filters/
>
> Most query parsers don't implement the PostFilter interface.  The lucene
> and edismax parsers do not implement PostFilter.  Unless you've
> specified the query parser in the fq parameter, it will use the lucene
> query parser, and it cannot be a PostFilter.
>
> > Here's an example:
> >
> > query1: fq = 'field1:* AND field2:value'
> > query2: fq = 'field2:value'
>
> If the point of the "field1:*" query clause is "make sure field1 exists
> in the document" then you would be a lot better off with this query clause:
>
> field1:[* TO *]
>
> This is an all-inclusive range query.  It works with all field types
> where I have tried it, and that includes TextField types.   It will be a
> lot more efficient than the wildcard query.
>
> Here's what happens with "field1:*".  If the cardinality of field1 is
> ten million different values, then the query that gets constructed for
> Lucene will literally contain ten million values.  And every single one
> of them will need to be compared to every document.  That's a LOT of
> comparisons.  Wildcard queries are normally very slow.
>
> Thanks,
> Shawn
>


Re: Optimizing fq query performance

2019-04-14 Thread Shawn Heisey

On 4/13/2019 12:58 PM, John Davis wrote:

We noticed a sizable performance degradation when we add certain fq filters
to the query even though the result set does not change between the two
queries. I would've expected solr to optimize internally by picking the
most constrained fq filter first, but maybe my understanding is wrong.


All filters cover the entire index, unless the query parser that you're 
using implements the PostFilter interface, the filter cost is set high 
enough, and caching is disabled.  All three of those conditions must be 
met in order for a filter to only run on results instead of the entire 
index.


http://yonik.com/advanced-filter-caching-in-solr/
https://lucidworks.com/2017/11/27/caching-and-filters-and-post-filters/

Most query parsers don't implement the PostFilter interface.  The lucene 
and edismax parsers do not implement PostFilter.  Unless you've 
specified the query parser in the fq parameter, it will use the lucene 
query parser, and it cannot be a PostFilter.



Here's an example:

query1: fq = 'field1:* AND field2:value'
query2: fq = 'field2:value'


If the point of the "field1:*" query clause is "make sure field1 exists 
in the document" then you would be a lot better off with this query clause:


field1:[* TO *]

This is an all-inclusive range query.  It works with all field types 
where I have tried it, and that includes TextField types.   It will be a 
lot more efficient than the wildcard query.


Here's what happens with "field1:*".  If the cardinality of field1 is 
ten million different values, then the query that gets constructed for 
Lucene will literally contain ten million values.  And every single one 
of them will need to be compared to every document.  That's a LOT of 
comparisons.  Wildcard queries are normally very slow.


Thanks,
Shawn


Re: Optimizing fq query performance

2019-04-14 Thread Erick Erickson
Patches welcome, but how would that be done? There’s no fixed schema at the 
Lucene level. It’s even possible  that no two documents in the index have any 
fields in common. Given the structure of an inverted index, answering the 
question “for document X does it have any value?" is rather “interesting”. You 
might be able to do something with docValues and function queries, but that’s 
overkill.

In some sense, fq=field:* does this dynamically by putting the results in the 
filterCache where it requires no calculations the next time so it seems like 
more effort than it’s worth.

Best,
Erick

> On Apr 13, 2019, at 11:24 PM, John Davis  wrote:
> 
>> field1:* is slow in general for indexed fields because all terms for the
>> field need to be iterated (e.g. does term1 match doc1, does term2 match
>> doc1, etc)
> 
> This feels like something could be optimized internally by tracking
> existence of the field in a doc instead of making users index yet another
> field to track existence?
> 
> BTW does this same behavior apply for tlong fields too where the value
> might be more continuous vs discrete strings?
> 
> On Sat, Apr 13, 2019 at 12:30 PM Yonik Seeley  wrote:
> 
>> More constrained but matching the same set of documents just guarantees
>> that there is more information to evaluate per document matched.
>> For your specific case, you can optimize fq = 'field1:* AND field2:value'
>> to &fq=field1:*&fq=field2:value
>> This will at least cause field1:* to be cached and reused if it's a common
>> pattern.
>> field1:* is slow in general for indexed fields because all terms for the
>> field need to be iterated (e.g. does term1 match doc1, does term2 match
>> doc1, etc)
>> One can optimize this by indexing a term in a different field to turn it
>> into a single term query (i.e. exists:field1)
>> 
>> -Yonik
>> 
>> On Sat, Apr 13, 2019 at 2:58 PM John Davis 
>> wrote:
>> 
>>> Hi there,
>>> 
>>> We noticed a sizable performance degradation when we add certain fq
>> filters
>>> to the query even though the result set does not change between the two
>>> queries. I would've expected solr to optimize internally by picking the
>>> most constrained fq filter first, but maybe my understanding is wrong.
>>> Here's an example:
>>> 
>>> query1: fq = 'field1:* AND field2:value'
>>> query2: fq = 'field2:value'
>>> 
>>> If we assume that the result set is identical between the two queries and
>>> field1 is in general more frequent in the index, we noticed query1 takes
>>> 100x longer than query2. In case it matters field1 is of type tlongs
>> while
>>> field2 is a string.
>>> 
>>> Any tips for optimizing this?
>>> 
>>> John
>>> 
>> 



Re: Optimizing fq query performance

2019-04-13 Thread John Davis
> field1:* is slow in general for indexed fields because all terms for the
> field need to be iterated (e.g. does term1 match doc1, does term2 match
> doc1, etc)

This feels like something could be optimized internally by tracking
existence of the field in a doc instead of making users index yet another
field to track existence?

BTW does this same behavior apply for tlong fields too where the value
might be more continuous vs discrete strings?

On Sat, Apr 13, 2019 at 12:30 PM Yonik Seeley  wrote:

> More constrained but matching the same set of documents just guarantees
> that there is more information to evaluate per document matched.
> For your specific case, you can optimize fq = 'field1:* AND field2:value'
> to &fq=field1:*&fq=field2:value
> This will at least cause field1:* to be cached and reused if it's a common
> pattern.
> field1:* is slow in general for indexed fields because all terms for the
> field need to be iterated (e.g. does term1 match doc1, does term2 match
> doc1, etc)
> One can optimize this by indexing a term in a different field to turn it
> into a single term query (i.e. exists:field1)
>
> -Yonik
>
> On Sat, Apr 13, 2019 at 2:58 PM John Davis 
> wrote:
>
> > Hi there,
> >
> > We noticed a sizable performance degradation when we add certain fq
> filters
> > to the query even though the result set does not change between the two
> > queries. I would've expected solr to optimize internally by picking the
> > most constrained fq filter first, but maybe my understanding is wrong.
> > Here's an example:
> >
> > query1: fq = 'field1:* AND field2:value'
> > query2: fq = 'field2:value'
> >
> > If we assume that the result set is identical between the two queries and
> > field1 is in general more frequent in the index, we noticed query1 takes
> > 100x longer than query2. In case it matters field1 is of type tlongs
> while
> > field2 is a string.
> >
> > Any tips for optimizing this?
> >
> > John
> >
>


Re: Optimizing fq query performance

2019-04-13 Thread Erick Erickson
Also note that field1:* does not necessarily match all documents. A document 
without that field will not match. So it really can’t be optimized they way you 
might expect since, as Yonik says, all the terms have to be enumerated….

Best,
Erick

> On Apr 13, 2019, at 12:30 PM, Yonik Seeley  wrote:
> 
> More constrained but matching the same set of documents just guarantees
> that there is more information to evaluate per document matched.
> For your specific case, you can optimize fq = 'field1:* AND field2:value'
> to &fq=field1:*&fq=field2:value
> This will at least cause field1:* to be cached and reused if it's a common
> pattern.
> field1:* is slow in general for indexed fields because all terms for the
> field need to be iterated (e.g. does term1 match doc1, does term2 match
> doc1, etc)
> One can optimize this by indexing a term in a different field to turn it
> into a single term query (i.e. exists:field1)
> 
> -Yonik
> 
> On Sat, Apr 13, 2019 at 2:58 PM John Davis 
> wrote:
> 
>> Hi there,
>> 
>> We noticed a sizable performance degradation when we add certain fq filters
>> to the query even though the result set does not change between the two
>> queries. I would've expected solr to optimize internally by picking the
>> most constrained fq filter first, but maybe my understanding is wrong.
>> Here's an example:
>> 
>> query1: fq = 'field1:* AND field2:value'
>> query2: fq = 'field2:value'
>> 
>> If we assume that the result set is identical between the two queries and
>> field1 is in general more frequent in the index, we noticed query1 takes
>> 100x longer than query2. In case it matters field1 is of type tlongs while
>> field2 is a string.
>> 
>> Any tips for optimizing this?
>> 
>> John
>> 



Re: Optimizing fq query performance

2019-04-13 Thread Yonik Seeley
More constrained but matching the same set of documents just guarantees
that there is more information to evaluate per document matched.
For your specific case, you can optimize fq = 'field1:* AND field2:value'
to &fq=field1:*&fq=field2:value
This will at least cause field1:* to be cached and reused if it's a common
pattern.
field1:* is slow in general for indexed fields because all terms for the
field need to be iterated (e.g. does term1 match doc1, does term2 match
doc1, etc)
One can optimize this by indexing a term in a different field to turn it
into a single term query (i.e. exists:field1)

-Yonik

On Sat, Apr 13, 2019 at 2:58 PM John Davis 
wrote:

> Hi there,
>
> We noticed a sizable performance degradation when we add certain fq filters
> to the query even though the result set does not change between the two
> queries. I would've expected solr to optimize internally by picking the
> most constrained fq filter first, but maybe my understanding is wrong.
> Here's an example:
>
> query1: fq = 'field1:* AND field2:value'
> query2: fq = 'field2:value'
>
> If we assume that the result set is identical between the two queries and
> field1 is in general more frequent in the index, we noticed query1 takes
> 100x longer than query2. In case it matters field1 is of type tlongs while
> field2 is a string.
>
> Any tips for optimizing this?
>
> John
>


Optimizing fq query performance

2019-04-13 Thread John Davis
Hi there,

We noticed a sizable performance degradation when we add certain fq filters
to the query even though the result set does not change between the two
queries. I would've expected solr to optimize internally by picking the
most constrained fq filter first, but maybe my understanding is wrong.
Here's an example:

query1: fq = 'field1:* AND field2:value'
query2: fq = 'field2:value'

If we assume that the result set is identical between the two queries and
field1 is in general more frequent in the index, we noticed query1 takes
100x longer than query2. In case it matters field1 is of type tlongs while
field2 is a string.

Any tips for optimizing this?

John