Re: field:* queries can be painfully slow if there are many terms.

2016-09-22 Thread Erick Erickson
Thanks Mike. I'm not sure this _should_ be fixed mind you, but thought I'd ask.

On Thu, Sep 22, 2016 at 10:16 AM, Michael McCandless
 wrote:
> You could index the prefix terms (edge ngrams), assuming your queries
> are prefix queries; this way there would typically be far fewer terms
> to visit than all 200 M terms.
>
> Auto-prefix terms also tried to solves this more "automatically", so
> you don't have to mess with edge ngrams, but we reverted it because of
> the added code complexity and lack of real-word use cases especially
> once we switched numerics from postings to dimensional points
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Thu, Sep 22, 2016 at 1:01 PM, Erick Erickson  
> wrote:
>> In MultiTermConstantScoreWrapper there's this block around line 174 in 6x:
>>
>> do {
>>   docs = termsEnum.postings(docs, PostingsEnum.NONE);
>>   builder.add(docs);
>> } while (termsEnum.next() != null);
>>
>> In the case of lots and lots of terms in a multiValued field this can
>> take quite a bit of time. In my test case I have 100K docs with 200M
>> terms (pathological I understand, but it illustrates the issue). If
>> I'm reading this right it loops through all the terms and, for each
>> term, creates a sub-list of docs for the term and adds the sub-list to
>> the "master list". So a query like 'field:*' takes 20+ seconds.
>>
>> Is there anything we can/should do to short circuit this kind of
>> thing? In this case I got 200M terms by ngramming 3-32 (again, far too
>> many ngrams I understand). It's not clear to me whether it's an easy
>> check to say "stop when all the docs have been added to the master
>> list"
>>
>> I can raise a JIRA if it makes sense.
>>
>> For supporting this particular use-case, we could index a separate
>> field "has_field1_value" but the general case still holds.
>>
>> Erick
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: field:* queries can be painfully slow if there are many terms.

2016-09-22 Thread Michael McCandless
You could index the prefix terms (edge ngrams), assuming your queries
are prefix queries; this way there would typically be far fewer terms
to visit than all 200 M terms.

Auto-prefix terms also tried to solves this more "automatically", so
you don't have to mess with edge ngrams, but we reverted it because of
the added code complexity and lack of real-word use cases especially
once we switched numerics from postings to dimensional points

Mike McCandless

http://blog.mikemccandless.com

On Thu, Sep 22, 2016 at 1:01 PM, Erick Erickson  wrote:
> In MultiTermConstantScoreWrapper there's this block around line 174 in 6x:
>
> do {
>   docs = termsEnum.postings(docs, PostingsEnum.NONE);
>   builder.add(docs);
> } while (termsEnum.next() != null);
>
> In the case of lots and lots of terms in a multiValued field this can
> take quite a bit of time. In my test case I have 100K docs with 200M
> terms (pathological I understand, but it illustrates the issue). If
> I'm reading this right it loops through all the terms and, for each
> term, creates a sub-list of docs for the term and adds the sub-list to
> the "master list". So a query like 'field:*' takes 20+ seconds.
>
> Is there anything we can/should do to short circuit this kind of
> thing? In this case I got 200M terms by ngramming 3-32 (again, far too
> many ngrams I understand). It's not clear to me whether it's an easy
> check to say "stop when all the docs have been added to the master
> list"
>
> I can raise a JIRA if it makes sense.
>
> For supporting this particular use-case, we could index a separate
> field "has_field1_value" but the general case still holds.
>
> Erick
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



field:* queries can be painfully slow if there are many terms.

2016-09-22 Thread Erick Erickson
In MultiTermConstantScoreWrapper there's this block around line 174 in 6x:

do {
  docs = termsEnum.postings(docs, PostingsEnum.NONE);
  builder.add(docs);
} while (termsEnum.next() != null);

In the case of lots and lots of terms in a multiValued field this can
take quite a bit of time. In my test case I have 100K docs with 200M
terms (pathological I understand, but it illustrates the issue). If
I'm reading this right it loops through all the terms and, for each
term, creates a sub-list of docs for the term and adds the sub-list to
the "master list". So a query like 'field:*' takes 20+ seconds.

Is there anything we can/should do to short circuit this kind of
thing? In this case I got 200M terms by ngramming 3-32 (again, far too
many ngrams I understand). It's not clear to me whether it's an easy
check to say "stop when all the docs have been added to the master
list"

I can raise a JIRA if it makes sense.

For supporting this particular use-case, we could index a separate
field "has_field1_value" but the general case still holds.

Erick

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org