Re: field:* queries can be painfully slow if there are many terms.
Thanks Mike. I'm not sure this _should_ be fixed mind you, but thought I'd ask. On Thu, Sep 22, 2016 at 10:16 AM, Michael McCandlesswrote: > You could index the prefix terms (edge ngrams), assuming your queries > are prefix queries; this way there would typically be far fewer terms > to visit than all 200 M terms. > > Auto-prefix terms also tried to solves this more "automatically", so > you don't have to mess with edge ngrams, but we reverted it because of > the added code complexity and lack of real-word use cases especially > once we switched numerics from postings to dimensional points > > Mike McCandless > > http://blog.mikemccandless.com > > On Thu, Sep 22, 2016 at 1:01 PM, Erick Erickson > wrote: >> In MultiTermConstantScoreWrapper there's this block around line 174 in 6x: >> >> do { >> docs = termsEnum.postings(docs, PostingsEnum.NONE); >> builder.add(docs); >> } while (termsEnum.next() != null); >> >> In the case of lots and lots of terms in a multiValued field this can >> take quite a bit of time. In my test case I have 100K docs with 200M >> terms (pathological I understand, but it illustrates the issue). If >> I'm reading this right it loops through all the terms and, for each >> term, creates a sub-list of docs for the term and adds the sub-list to >> the "master list". So a query like 'field:*' takes 20+ seconds. >> >> Is there anything we can/should do to short circuit this kind of >> thing? In this case I got 200M terms by ngramming 3-32 (again, far too >> many ngrams I understand). It's not clear to me whether it's an easy >> check to say "stop when all the docs have been added to the master >> list" >> >> I can raise a JIRA if it makes sense. >> >> For supporting this particular use-case, we could index a separate >> field "has_field1_value" but the general case still holds. >> >> Erick >> >> - >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> For additional commands, e-mail: dev-h...@lucene.apache.org >> > > - > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: field:* queries can be painfully slow if there are many terms.
You could index the prefix terms (edge ngrams), assuming your queries are prefix queries; this way there would typically be far fewer terms to visit than all 200 M terms. Auto-prefix terms also tried to solves this more "automatically", so you don't have to mess with edge ngrams, but we reverted it because of the added code complexity and lack of real-word use cases especially once we switched numerics from postings to dimensional points Mike McCandless http://blog.mikemccandless.com On Thu, Sep 22, 2016 at 1:01 PM, Erick Ericksonwrote: > In MultiTermConstantScoreWrapper there's this block around line 174 in 6x: > > do { > docs = termsEnum.postings(docs, PostingsEnum.NONE); > builder.add(docs); > } while (termsEnum.next() != null); > > In the case of lots and lots of terms in a multiValued field this can > take quite a bit of time. In my test case I have 100K docs with 200M > terms (pathological I understand, but it illustrates the issue). If > I'm reading this right it loops through all the terms and, for each > term, creates a sub-list of docs for the term and adds the sub-list to > the "master list". So a query like 'field:*' takes 20+ seconds. > > Is there anything we can/should do to short circuit this kind of > thing? In this case I got 200M terms by ngramming 3-32 (again, far too > many ngrams I understand). It's not clear to me whether it's an easy > check to say "stop when all the docs have been added to the master > list" > > I can raise a JIRA if it makes sense. > > For supporting this particular use-case, we could index a separate > field "has_field1_value" but the general case still holds. > > Erick > > - > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
field:* queries can be painfully slow if there are many terms.
In MultiTermConstantScoreWrapper there's this block around line 174 in 6x: do { docs = termsEnum.postings(docs, PostingsEnum.NONE); builder.add(docs); } while (termsEnum.next() != null); In the case of lots and lots of terms in a multiValued field this can take quite a bit of time. In my test case I have 100K docs with 200M terms (pathological I understand, but it illustrates the issue). If I'm reading this right it loops through all the terms and, for each term, creates a sub-list of docs for the term and adds the sub-list to the "master list". So a query like 'field:*' takes 20+ seconds. Is there anything we can/should do to short circuit this kind of thing? In this case I got 200M terms by ngramming 3-32 (again, far too many ngrams I understand). It's not clear to me whether it's an easy check to say "stop when all the docs have been added to the master list" I can raise a JIRA if it makes sense. For supporting this particular use-case, we could index a separate field "has_field1_value" but the general case still holds. Erick - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org