Re: Understanding Negative Filter Queries
> > Well, they’ll be exactly the same if (and only if) every document has a > tag. Otherwise, the > first one will exclude a doc that has no tag and the second one will > include it. That's a good point/catch. How slow is “very slow”? > Well, in the case I was looking at it was about 10x slower but with the following caveats that there were 15 or so of these negative fq all some version of `fq={!cache=false}(tag:* -tag:)` (*don't shoot me I didn't write it lol*) over 15 million documents. Which to me means that each fq was doing each step that you described below: The second form only has to index into the terms dictionary for the tag > field > value “email”, then zip down the posting list for all the docs that have > it. The > first form has to first identify all the docs that have a tag, accumulate > that list, > _then_ find the “email” value and zip down the postings list. > Thanks yet again Erick. That solidified in my mind how this works. Much appreciated! On Tue, Jul 14, 2020 at 7:22 AM Erick Erickson wrote: > Yeah, there are optimizations there. BTW, these two queries are subtly > different. > > Well, they’ll be exactly the same if (and only if) every document has a > tag. Otherwise, the > first one will exclude a doc that has no tag and the second one will > include it. > > How slow is “very slow”? > > The second form only has to index into the terms dictionary for the tag > field > value “email”, then zip down the posting list for all the docs that have > it. The > first form has to first identify all the docs that have a tag, accumulate > that list, > _then_ find the “email” value and zip down the postings list. > > You could get around this if you require the first form functionality by, > say, > including a boolean field “has_tags”, then the first one would be > > fq=has_tags:true -tags:email > > Best, > Erick > > > On Jul 14, 2020, at 8:05 AM, Emir Arnautović < > emir.arnauto...@sematext.com> wrote: > > > > Hi Chris, > > tag:* is a wildcard query while *:* is match all query. I believe that > adjusting pure negative is turned on by default so you can safely just use > -tag:email and it’ll be translated to *:* -tag:email. > > > > HTH, > > Emir > > -- > > Monitoring - Log Management - Alerting - Anomaly Detection > > Solr & Elasticsearch Consulting Support Training - http://sematext.com/ > > > > > > > >> On 14 Jul 2020, at 14:00, Chris Dempsey wrote: > >> > >> I'm trying to understand the difference between something like > >> fq={!cache=false}(tag:* -tag:email) which is very slow compared to > >> fq={!cache=false}(*:* -tag:email) on Solr 7.7.1. > >> > >> I believe in the case of `tag:*` Solr spends some effort to gather all > of > >> the documents that have a value for `tag` and then removes those with > >> `-tag:email` while in the `*:*` Solr simply uses the document set as-is > >> and then remove those with `-tag:email` (*and I believe Erick mentioned > >> there were special optimizations for `*:*`*)? > > > >
Understanding Negative Filter Queries
I'm trying to understand the difference between something like fq={!cache=false}(tag:* -tag:email) which is very slow compared to fq={!cache=false}(*:* -tag:email) on Solr 7.7.1. I believe in the case of `tag:*` Solr spends some effort to gather all of the documents that have a value for `tag` and then removes those with `-tag:email` while in the `*:*` Solr simply uses the document set as-is and then remove those with `-tag:email` (*and I believe Erick mentioned there were special optimizations for `*:*`*)?
Re: Multiple fq vs combined fq performance
Thanks for the suggestion, Alex. It doesn't appear that IndexOrDocValuesQuery (at least in Solr 7.7.1) supports the PostFilter interface. I've tried various values for cost on each of the fq and it doesn't change the QTime. So, after digging around a bit even though {!cache=false}taggedTickets_ticketId:100241 only matches one and only one document in the collection that doesn't matter for the other two fq who continue to look over the index of the collection, correct? On Thu, Jul 9, 2020 at 4:24 PM Alexandre Rafalovitch wrote: > I _think_ it will run all 3 and then do index hopping. But if you know one > fq is super expensive, you could assign it a cost > Value over 100 will try to use PostFilter then and apply the query on top > of results from other queries. > > > > https://lucene.apache.org/solr/guide/8_4/common-query-parameters.html#cache-parameter > > Hope it helps, > Alex. > > On Thu., Jul. 9, 2020, 2:05 p.m. Chris Dempsey, wrote: > > > Hi all! In a collection where we have ~54 million documents we've noticed > > running a query with the following: > > > > "fq":["{!cache=false}_class:taggedTickets", > > "{!cache=false}taggedTickets_ticketId:100241", > > "{!cache=false}companyId:22476"] > > > > when I debugQuery I see: > > > > "parsed_filter_queries":[ > > "{!cache=false}_class:taggedTickets", > > "{!cache=false}IndexOrDocValuesQuery(taggedTickets_ticketId:[100241 > > TO 100241])", > > "{!cache=false}IndexOrDocValuesQuery(companyId:[22476 TO 22476])" > > ] > > > > runs in roughly ~450ms but if we remove `{!cache=false}companyId:22476` > it > > drops down to ~5ms (it's important to note that `taggedTickets_ticketId` > is > > globally unique). > > > > If we change the fqs to: > > > > "fq":["{!cache=false}_class:taggedTickets", > > "{!cache=false}+companyId:22476 > +taggedTickets_ticketId:100241"] > > > > when I debugQuery I see: > > > > "parsed_filter_queries":[ > >"{!cache=false}_class:taggedTickets", > >"{!cache=false}+IndexOrDocValuesQuery(companyId:[22476 TO 22476]) > > +IndexOrDocValuesQuery(taggedTickets_ticketId:[100241 TO > 100241])" > > ] > > > > we get the correct result back in ~5ms. > > > > My current thought is that in the slow scenario Solr is still running > > `{!cache=false}IndexOrDocValuesQuery(companyId:[22476 > > TO 22476])` even though it "has the answer" from the first two fq. > > > > Am I off-base or misunderstanding how `fq` are processed? > > >
Multiple fq vs combined fq performance
Hi all! In a collection where we have ~54 million documents we've noticed running a query with the following: "fq":["{!cache=false}_class:taggedTickets", "{!cache=false}taggedTickets_ticketId:100241", "{!cache=false}companyId:22476"] when I debugQuery I see: "parsed_filter_queries":[ "{!cache=false}_class:taggedTickets", "{!cache=false}IndexOrDocValuesQuery(taggedTickets_ticketId:[100241 TO 100241])", "{!cache=false}IndexOrDocValuesQuery(companyId:[22476 TO 22476])" ] runs in roughly ~450ms but if we remove `{!cache=false}companyId:22476` it drops down to ~5ms (it's important to note that `taggedTickets_ticketId` is globally unique). If we change the fqs to: "fq":["{!cache=false}_class:taggedTickets", "{!cache=false}+companyId:22476 +taggedTickets_ticketId:100241"] when I debugQuery I see: "parsed_filter_queries":[ "{!cache=false}_class:taggedTickets", "{!cache=false}+IndexOrDocValuesQuery(companyId:[22476 TO 22476]) +IndexOrDocValuesQuery(taggedTickets_ticketId:[100241 TO 100241])" ] we get the correct result back in ~5ms. My current thought is that in the slow scenario Solr is still running `{!cache=false}IndexOrDocValuesQuery(companyId:[22476 TO 22476])` even though it "has the answer" from the first two fq. Am I off-base or misunderstanding how `fq` are processed?
Re: Prefix + Suffix Wildcards in Searches
@Mikhail Thanks for the link! I'll read through that. On Tue, Jun 30, 2020 at 6:28 AM Chris Dempsey wrote: > @Erick, > > You've got the idea. Basically the users can attach zero or more tags (*that > they create*) to a document. So as an example say they've created the > tags (this example is just a small subset of the total tags): > >- paid >- invoice-paid >- ms-reply-unpaid-2019 >- credit-ms-reply-unpaid >- ms-reply-paid-2019 >- ms-reply-paid-2020 > > and attached them in various combinations to documents. They then want to > find all documents by tag that don't contain the characters "paid" anywhere > in the tag, don't contain tags with the characters "ms-reply-unpaid", but > do include documents tagged with the characters "ms-reply-paid". > > The obvious suggestion would be to have the users just use the entire tag > (i.e. don't let them do a "contains") as a condition to eliminate the > wildcards - which would work - but unfortunately we have customers with (*not > joking*) over 100K different tags (*why have a taxonomy like that is yet > a different issue*). I'm willing to accept that in our scenario n-grams > might be the Solr-based answer (the other being to change what "contains" > means within our application) but thought I'd check I hadn't overlooked any > other options. :) > > On Mon, Jun 29, 2020 at 3:54 PM Mikhail Khludnev wrote: > >> Hello, Chris. >> I suppose index time analysis can yield these terms: >> "paid","ms-reply-unpaid","ms-reply-paid", and thus let you avoid these >> expensive wildcard queries. Here's why it's worth to avoid them >> >> https://www.slideshare.net/lucidworks/search-like-sql-mikhail-khludnev-epam >> >> On Mon, Jun 29, 2020 at 6:17 PM Chris Dempsey wrote: >> >> > Hello, all! I'm relatively new to Solr and Lucene (*using Solr 7.7.1*) >> but >> > I'm looking into options for optimizing something like this: >> > >> > > fq=(tag:* -tag:*paid*) OR (tag:* -tag:*ms-reply-unpaid*) OR >> > tag:*ms-reply-paid* >> > >> > It's probably not a surprise that we're seeing performance issues with >> > something like this. My understanding is that using the wildcard on both >> > ends forces a full-text index search. Something like the above can't >> take >> > advantage of something like the ReverseWordFilter either. I believe >> > constructing `n-grams` is an option (*at the expense of index size*) >> but is >> > there anything I'm overlooking as a possible avenue to look into? >> > >> >> >> -- >> Sincerely yours >> Mikhail Khludnev >> >
Re: Prefix + Suffix Wildcards in Searches
@Erick, You've got the idea. Basically the users can attach zero or more tags (*that they create*) to a document. So as an example say they've created the tags (this example is just a small subset of the total tags): - paid - invoice-paid - ms-reply-unpaid-2019 - credit-ms-reply-unpaid - ms-reply-paid-2019 - ms-reply-paid-2020 and attached them in various combinations to documents. They then want to find all documents by tag that don't contain the characters "paid" anywhere in the tag, don't contain tags with the characters "ms-reply-unpaid", but do include documents tagged with the characters "ms-reply-paid". The obvious suggestion would be to have the users just use the entire tag (i.e. don't let them do a "contains") as a condition to eliminate the wildcards - which would work - but unfortunately we have customers with (*not joking*) over 100K different tags (*why have a taxonomy like that is yet a different issue*). I'm willing to accept that in our scenario n-grams might be the Solr-based answer (the other being to change what "contains" means within our application) but thought I'd check I hadn't overlooked any other options. :) On Mon, Jun 29, 2020 at 3:54 PM Mikhail Khludnev wrote: > Hello, Chris. > I suppose index time analysis can yield these terms: > "paid","ms-reply-unpaid","ms-reply-paid", and thus let you avoid these > expensive wildcard queries. Here's why it's worth to avoid them > https://www.slideshare.net/lucidworks/search-like-sql-mikhail-khludnev-epam > > On Mon, Jun 29, 2020 at 6:17 PM Chris Dempsey wrote: > > > Hello, all! I'm relatively new to Solr and Lucene (*using Solr 7.7.1*) > but > > I'm looking into options for optimizing something like this: > > > > > fq=(tag:* -tag:*paid*) OR (tag:* -tag:*ms-reply-unpaid*) OR > > tag:*ms-reply-paid* > > > > It's probably not a surprise that we're seeing performance issues with > > something like this. My understanding is that using the wildcard on both > > ends forces a full-text index search. Something like the above can't take > > advantage of something like the ReverseWordFilter either. I believe > > constructing `n-grams` is an option (*at the expense of index size*) but > is > > there anything I'm overlooking as a possible avenue to look into? > > > > > -- > Sincerely yours > Mikhail Khludnev >
Re: Prefix + Suffix Wildcards in Searches
First off, thanks for taking a look, Erick! I see you helping lots of folks out here and I've learned a lot from your answers. Much appreciated! > How regular are your patterns? Are they arbitrary? Good question. :) That's data that I should have included in the initial post but both the values in the `tag` field and the search query itself are totally arbitrary (*i.e. user entered values*). I see where you're going if the set of either part was limited. > What’s the field type anyway? Is this field tokenized? On Mon, Jun 29, 2020 at 10:33 AM Erick Erickson wrote: > How regular are your patterns? Are they arbitrary? > What I’m wondering is if you could shift your work the the > indexing end, perhaps even in an auxiliary field. Could you, > say, just index “paid”, “ms-reply-unpaid” etc? Then there > are no wildcards at all. This akin to “concept search”. > > Otherwise ngramming is your best bet. > > What’s the field type anyway? Is this field tokenized? > > There are lots of options, but s much depends on whether > you can process the data such that you won’t need wildcards. > > Best, > Erick > > > On Jun 29, 2020, at 11:16 AM, Chris Dempsey wrote: > > > > Hello, all! I'm relatively new to Solr and Lucene (*using Solr 7.7.1*) > but > > I'm looking into options for optimizing something like this: > > > >> fq=(tag:* -tag:*paid*) OR (tag:* -tag:*ms-reply-unpaid*) OR > > tag:*ms-reply-paid* > > > > It's probably not a surprise that we're seeing performance issues with > > something like this. My understanding is that using the wildcard on both > > ends forces a full-text index search. Something like the above can't take > > advantage of something like the ReverseWordFilter either. I believe > > constructing `n-grams` is an option (*at the expense of index size*) but > is > > there anything I'm overlooking as a possible avenue to look into? > >
Prefix + Suffix Wildcards in Searches
Hello, all! I'm relatively new to Solr and Lucene (*using Solr 7.7.1*) but I'm looking into options for optimizing something like this: > fq=(tag:* -tag:*paid*) OR (tag:* -tag:*ms-reply-unpaid*) OR tag:*ms-reply-paid* It's probably not a surprise that we're seeing performance issues with something like this. My understanding is that using the wildcard on both ends forces a full-text index search. Something like the above can't take advantage of something like the ReverseWordFilter either. I believe constructing `n-grams` is an option (*at the expense of index size*) but is there anything I'm overlooking as a possible avenue to look into?
Re: Default Values and Missing Field Queries
Thanks for the clarification and pointers Erick! Much appreciated! On Mon, May 25, 2020 at 11:18 AM Erick Erickson wrote: > Try q=*:* -boolfield=false > > And it's not as costly as you might think, there's special handling for *:* > queries. And if you put that in an fq clause instead, the result set will > be put into the filter cache and be reused assuming you want to do this > repeatedly. > > BTW, Solr doesn't use strict Boolean logic, which may be a bit confusing. > Google for Chris Hostetter's (Hossman) blog at Lucidwirks for a great > explanation. > > And yes, your understanding of adding a new field is correct > > Best, > Erick > On Mon, May 25, 2020, 11:39 Chris Dempsey wrote: > > > I'm new to Solr and made an honest stab to finding this in info the docs. > > > > I'm working on an update to an existing large collection in Solr 7.7 to > add > > a BoolField to mark it as "soft deleted" or not. My understanding is that > > updating the schema will mean the new field will only exist and have a > > value (or the default value) for documents indexed after the change, > > correct? If that's the case, is it possible to query for all documents > that > > have that field set to `true` or if that field is completely missing? If > is > > a Bad Idea(tm) from a performance or resource usage standpoint to use a > > "where field X doesn't exist" query (i.e. am I going to end up running a > > "table scan" if I do)? > > > > Thanks in advance! > > >
Default Values and Missing Field Queries
I'm new to Solr and made an honest stab to finding this in info the docs. I'm working on an update to an existing large collection in Solr 7.7 to add a BoolField to mark it as "soft deleted" or not. My understanding is that updating the schema will mean the new field will only exist and have a value (or the default value) for documents indexed after the change, correct? If that's the case, is it possible to query for all documents that have that field set to `true` or if that field is completely missing? If is a Bad Idea(tm) from a performance or resource usage standpoint to use a "where field X doesn't exist" query (i.e. am I going to end up running a "table scan" if I do)? Thanks in advance!