Re: Understanding Negative Filter Queries

2020-07-14 Thread Chris Dempsey
>
> Well, they’ll be exactly the same if (and only if) every document has a
> tag. Otherwise, the
> first one will exclude a doc that has no tag and the second one will
> include it.


That's a good point/catch.

How slow is “very slow”?
>

Well, in the case I was looking at it was about 10x slower but with the
following caveats that there were 15 or so of these negative fq all some
version of `fq={!cache=false}(tag:* -tag:)` (*don't shoot me I
didn't write it lol*) over 15 million documents. Which to me means that
each fq was doing each step that you described below:

The second form only has to index into the terms dictionary for the tag
> field
> value “email”, then zip down the posting list for all the docs that have
> it. The
> first form has to first identify all the docs that have a tag, accumulate
> that list,
> _then_ find the “email” value and zip down the postings list.
>

Thanks yet again Erick. That solidified in my mind how this works. Much
appreciated!





On Tue, Jul 14, 2020 at 7:22 AM Erick Erickson 
wrote:

> Yeah, there are optimizations there. BTW, these two queries are subtly
> different.
>
> Well, they’ll be exactly the same if (and only if) every document has a
> tag. Otherwise, the
> first one will exclude a doc that has no tag and the second one will
> include it.
>
> How slow is “very slow”?
>
> The second form only has to index into the terms dictionary for the tag
> field
> value “email”, then zip down the posting list for all the docs that have
> it. The
> first form has to first identify all the docs that have a tag, accumulate
> that list,
> _then_ find the “email” value and zip down the postings list.
>
> You could get around this if you require the first form functionality by,
> say,
> including a boolean field “has_tags”, then the first one would be
>
> fq=has_tags:true -tags:email
>
> Best,
> Erick
>
> > On Jul 14, 2020, at 8:05 AM, Emir Arnautović <
> emir.arnauto...@sematext.com> wrote:
> >
> > Hi Chris,
> > tag:* is a wildcard query while *:* is match all query. I believe that
> adjusting pure negative is turned on by default so you can safely just use
> -tag:email and it’ll be translated to *:* -tag:email.
> >
> > HTH,
> > Emir
> > --
> > Monitoring - Log Management - Alerting - Anomaly Detection
> > Solr & Elasticsearch Consulting Support Training - http://sematext.com/
> >
> >
> >
> >> On 14 Jul 2020, at 14:00, Chris Dempsey  wrote:
> >>
> >> I'm trying to understand the difference between something like
> >> fq={!cache=false}(tag:* -tag:email) which is very slow compared to
> >> fq={!cache=false}(*:* -tag:email) on Solr 7.7.1.
> >>
> >> I believe in the case of `tag:*` Solr spends some effort to gather all
> of
> >> the documents that have a value for `tag` and then removes those with
> >> `-tag:email` while in the `*:*` Solr simply uses the document set as-is
> >> and  then remove those with `-tag:email` (*and I believe Erick mentioned
> >> there were special optimizations for `*:*`*)?
> >
>
>


Understanding Negative Filter Queries

2020-07-14 Thread Chris Dempsey
I'm trying to understand the difference between something like
fq={!cache=false}(tag:* -tag:email) which is very slow compared to
fq={!cache=false}(*:* -tag:email) on Solr 7.7.1.

I believe in the case of `tag:*` Solr spends some effort to gather all of
the documents that have a value for `tag` and then removes those with
`-tag:email` while in the `*:*` Solr simply uses the document set as-is
and  then remove those with `-tag:email` (*and I believe Erick mentioned
there were special optimizations for `*:*`*)?


Re: Multiple fq vs combined fq performance

2020-07-10 Thread Chris Dempsey
Thanks for the suggestion, Alex. It doesn't appear that
IndexOrDocValuesQuery (at least in Solr 7.7.1) supports the PostFilter
interface. I've tried various values for cost on each of the fq and it
doesn't change the QTime.

So, after digging around a bit even though
{!cache=false}taggedTickets_ticketId:100241 only matches one and only
one document in the collection that doesn't matter for the other two fq who
continue to look over the index of the collection, correct?

On Thu, Jul 9, 2020 at 4:24 PM Alexandre Rafalovitch 
wrote:

> I _think_ it will run all 3 and then do index hopping. But if you know one
> fq is super expensive, you could assign it a cost
> Value over 100 will try to use PostFilter then and apply the query on top
> of results from other queries.
>
>
>
> https://lucene.apache.org/solr/guide/8_4/common-query-parameters.html#cache-parameter
>
> Hope it helps,
> Alex.
>
> On Thu., Jul. 9, 2020, 2:05 p.m. Chris Dempsey,  wrote:
>
> > Hi all! In a collection where we have ~54 million documents we've noticed
> > running a query with the following:
> >
> > "fq":["{!cache=false}_class:taggedTickets",
> >   "{!cache=false}taggedTickets_ticketId:100241",
> >   "{!cache=false}companyId:22476"]
> >
> > when I debugQuery I see:
> >
> > "parsed_filter_queries":[
> >   "{!cache=false}_class:taggedTickets",
> >   "{!cache=false}IndexOrDocValuesQuery(taggedTickets_ticketId:[100241
> > TO 100241])",
> >   "{!cache=false}IndexOrDocValuesQuery(companyId:[22476 TO 22476])"
> > ]
> >
> > runs in roughly ~450ms but if we remove `{!cache=false}companyId:22476`
> it
> > drops down to ~5ms (it's important to note that `taggedTickets_ticketId`
> is
> > globally unique).
> >
> > If we change the fqs to:
> >
> > "fq":["{!cache=false}_class:taggedTickets",
> >   "{!cache=false}+companyId:22476
> +taggedTickets_ticketId:100241"]
> >
> > when I debugQuery I see:
> >
> > "parsed_filter_queries":[
> >"{!cache=false}_class:taggedTickets",
> >"{!cache=false}+IndexOrDocValuesQuery(companyId:[22476 TO 22476])
> > +IndexOrDocValuesQuery(taggedTickets_ticketId:[100241 TO
> 100241])"
> > ]
> >
> > we get the correct result back in ~5ms.
> >
> > My current thought is that in the slow scenario Solr is still running
> > `{!cache=false}IndexOrDocValuesQuery(companyId:[22476
> > TO 22476])` even though it "has the answer" from the first two fq.
> >
> > Am I off-base or misunderstanding how `fq` are processed?
> >
>


Multiple fq vs combined fq performance

2020-07-09 Thread Chris Dempsey
Hi all! In a collection where we have ~54 million documents we've noticed
running a query with the following:

"fq":["{!cache=false}_class:taggedTickets",
  "{!cache=false}taggedTickets_ticketId:100241",
  "{!cache=false}companyId:22476"]

when I debugQuery I see:

"parsed_filter_queries":[
  "{!cache=false}_class:taggedTickets",
  "{!cache=false}IndexOrDocValuesQuery(taggedTickets_ticketId:[100241
TO 100241])",
  "{!cache=false}IndexOrDocValuesQuery(companyId:[22476 TO 22476])"
]

runs in roughly ~450ms but if we remove `{!cache=false}companyId:22476` it
drops down to ~5ms (it's important to note that `taggedTickets_ticketId` is
globally unique).

If we change the fqs to:

"fq":["{!cache=false}_class:taggedTickets",
  "{!cache=false}+companyId:22476 +taggedTickets_ticketId:100241"]

when I debugQuery I see:

"parsed_filter_queries":[
   "{!cache=false}_class:taggedTickets",
   "{!cache=false}+IndexOrDocValuesQuery(companyId:[22476 TO 22476])
+IndexOrDocValuesQuery(taggedTickets_ticketId:[100241 TO 100241])"
]

we get the correct result back in ~5ms.

My current thought is that in the slow scenario Solr is still running
`{!cache=false}IndexOrDocValuesQuery(companyId:[22476
TO 22476])` even though it "has the answer" from the first two fq.

Am I off-base or misunderstanding how `fq` are processed?


Re: Prefix + Suffix Wildcards in Searches

2020-06-30 Thread Chris Dempsey
@Mikhail

Thanks for the link! I'll read through that.

On Tue, Jun 30, 2020 at 6:28 AM Chris Dempsey  wrote:

> @Erick,
>
> You've got the idea. Basically the users can attach zero or more tags (*that
> they create*) to a document. So as an example say they've created the
> tags (this example is just a small subset of the total tags):
>
>- paid
>- invoice-paid
>- ms-reply-unpaid-2019
>- credit-ms-reply-unpaid
>- ms-reply-paid-2019
>- ms-reply-paid-2020
>
> and attached them in various combinations to documents. They then want to
> find all documents by tag that don't contain the characters "paid" anywhere
> in the tag, don't contain tags with the characters "ms-reply-unpaid", but
> do include documents tagged with the characters "ms-reply-paid".
>
> The obvious suggestion would be to have the users just use the entire tag
> (i.e. don't let them do a "contains") as a condition to eliminate the
> wildcards - which would work -  but unfortunately we have customers with (*not
> joking*) over 100K different tags (*why have a taxonomy like that is yet
> a different issue*). I'm willing to accept that in our scenario n-grams
> might be the Solr-based answer (the other being to change what "contains"
> means within our application) but thought I'd check I hadn't overlooked any
> other options. :)
>
> On Mon, Jun 29, 2020 at 3:54 PM Mikhail Khludnev  wrote:
>
>> Hello, Chris.
>> I suppose index time analysis can yield these terms:
>> "paid","ms-reply-unpaid","ms-reply-paid", and thus let you avoid these
>> expensive wildcard queries. Here's why it's worth to avoid them
>>
>> https://www.slideshare.net/lucidworks/search-like-sql-mikhail-khludnev-epam
>>
>> On Mon, Jun 29, 2020 at 6:17 PM Chris Dempsey  wrote:
>>
>> > Hello, all! I'm relatively new to Solr and Lucene (*using Solr 7.7.1*)
>> but
>> > I'm looking into options for optimizing something like this:
>> >
>> > > fq=(tag:* -tag:*paid*) OR (tag:* -tag:*ms-reply-unpaid*) OR
>> > tag:*ms-reply-paid*
>> >
>> > It's probably not a surprise that we're seeing performance issues with
>> > something like this. My understanding is that using the wildcard on both
>> > ends forces a full-text index search. Something like the above can't
>> take
>> > advantage of something like the ReverseWordFilter either. I believe
>> > constructing `n-grams` is an option (*at the expense of index size*)
>> but is
>> > there anything I'm overlooking as a possible avenue to look into?
>> >
>>
>>
>> --
>> Sincerely yours
>> Mikhail Khludnev
>>
>


Re: Prefix + Suffix Wildcards in Searches

2020-06-30 Thread Chris Dempsey
@Erick,

You've got the idea. Basically the users can attach zero or more tags (*that
they create*) to a document. So as an example say they've created the tags
(this example is just a small subset of the total tags):

   - paid
   - invoice-paid
   - ms-reply-unpaid-2019
   - credit-ms-reply-unpaid
   - ms-reply-paid-2019
   - ms-reply-paid-2020

and attached them in various combinations to documents. They then want to
find all documents by tag that don't contain the characters "paid" anywhere
in the tag, don't contain tags with the characters "ms-reply-unpaid", but
do include documents tagged with the characters "ms-reply-paid".

The obvious suggestion would be to have the users just use the entire tag
(i.e. don't let them do a "contains") as a condition to eliminate the
wildcards - which would work -  but unfortunately we have customers with (*not
joking*) over 100K different tags (*why have a taxonomy like that is yet a
different issue*). I'm willing to accept that in our scenario n-grams might
be the Solr-based answer (the other being to change what "contains" means
within our application) but thought I'd check I hadn't overlooked any other
options. :)

On Mon, Jun 29, 2020 at 3:54 PM Mikhail Khludnev  wrote:

> Hello, Chris.
> I suppose index time analysis can yield these terms:
> "paid","ms-reply-unpaid","ms-reply-paid", and thus let you avoid these
> expensive wildcard queries. Here's why it's worth to avoid them
> https://www.slideshare.net/lucidworks/search-like-sql-mikhail-khludnev-epam
>
> On Mon, Jun 29, 2020 at 6:17 PM Chris Dempsey  wrote:
>
> > Hello, all! I'm relatively new to Solr and Lucene (*using Solr 7.7.1*)
> but
> > I'm looking into options for optimizing something like this:
> >
> > > fq=(tag:* -tag:*paid*) OR (tag:* -tag:*ms-reply-unpaid*) OR
> > tag:*ms-reply-paid*
> >
> > It's probably not a surprise that we're seeing performance issues with
> > something like this. My understanding is that using the wildcard on both
> > ends forces a full-text index search. Something like the above can't take
> > advantage of something like the ReverseWordFilter either. I believe
> > constructing `n-grams` is an option (*at the expense of index size*) but
> is
> > there anything I'm overlooking as a possible avenue to look into?
> >
>
>
> --
> Sincerely yours
> Mikhail Khludnev
>


Re: Prefix + Suffix Wildcards in Searches

2020-06-29 Thread Chris Dempsey
First off, thanks for taking a look, Erick! I see you helping lots of folks
out here and I've learned a lot from your answers. Much appreciated!

> How regular are your patterns? Are they arbitrary?

Good question. :) That's data that I should have included in the initial
post but both the values in the `tag` field and the search query itself are
totally arbitrary (*i.e. user entered values*). I see where you're going if
the set of either part was limited.

> What’s the field type anyway? Is this field tokenized?


















On Mon, Jun 29, 2020 at 10:33 AM Erick Erickson 
wrote:

> How regular are your patterns? Are they arbitrary?
> What I’m wondering is if you could shift your work the the
> indexing end, perhaps even in an auxiliary field. Could you,
> say, just index “paid”, “ms-reply-unpaid” etc? Then there
> are no wildcards at all. This akin to “concept search”.
>
> Otherwise ngramming is your best bet.
>
> What’s the field type anyway? Is this field tokenized?
>
> There are lots of options, but s much depends on whether
> you can process the data such that you won’t need wildcards.
>
> Best,
> Erick
>
> > On Jun 29, 2020, at 11:16 AM, Chris Dempsey  wrote:
> >
> > Hello, all! I'm relatively new to Solr and Lucene (*using Solr 7.7.1*)
> but
> > I'm looking into options for optimizing something like this:
> >
> >> fq=(tag:* -tag:*paid*) OR (tag:* -tag:*ms-reply-unpaid*) OR
> > tag:*ms-reply-paid*
> >
> > It's probably not a surprise that we're seeing performance issues with
> > something like this. My understanding is that using the wildcard on both
> > ends forces a full-text index search. Something like the above can't take
> > advantage of something like the ReverseWordFilter either. I believe
> > constructing `n-grams` is an option (*at the expense of index size*) but
> is
> > there anything I'm overlooking as a possible avenue to look into?
>
>


Prefix + Suffix Wildcards in Searches

2020-06-29 Thread Chris Dempsey
Hello, all! I'm relatively new to Solr and Lucene (*using Solr 7.7.1*) but
I'm looking into options for optimizing something like this:

> fq=(tag:* -tag:*paid*) OR (tag:* -tag:*ms-reply-unpaid*) OR
tag:*ms-reply-paid*

It's probably not a surprise that we're seeing performance issues with
something like this. My understanding is that using the wildcard on both
ends forces a full-text index search. Something like the above can't take
advantage of something like the ReverseWordFilter either. I believe
constructing `n-grams` is an option (*at the expense of index size*) but is
there anything I'm overlooking as a possible avenue to look into?


Re: Default Values and Missing Field Queries

2020-05-25 Thread Chris Dempsey
Thanks for the clarification and pointers Erick! Much appreciated!

On Mon, May 25, 2020 at 11:18 AM Erick Erickson 
wrote:

> Try q=*:* -boolfield=false
>
> And it's not as costly as you might think, there's special handling for *:*
> queries. And if you put that in an fq clause instead, the result set will
> be put into the filter cache and be reused assuming you want to do this
> repeatedly.
>
> BTW, Solr doesn't use strict Boolean logic, which may be a bit confusing.
> Google for Chris Hostetter's (Hossman) blog at Lucidwirks for a great
> explanation.
>
> And yes, your understanding of adding a new field is correct
>
> Best,
> Erick
> On Mon, May 25, 2020, 11:39 Chris Dempsey  wrote:
>
> > I'm new to Solr and made an honest stab to finding this in info the docs.
> >
> > I'm working on an update to an existing large collection in Solr 7.7 to
> add
> > a BoolField to mark it as "soft deleted" or not. My understanding is that
> > updating the schema will mean the new field will only exist and have a
> > value (or the default value) for documents indexed after the change,
> > correct? If that's the case, is it possible to query for all documents
> that
> > have that field set to `true` or if that field is completely missing? If
> is
> > a Bad Idea(tm) from a performance or resource usage standpoint to use a
> > "where field X doesn't exist" query (i.e. am I going to end up running a
> > "table scan" if I do)?
> >
> > Thanks in advance!
> >
>


Default Values and Missing Field Queries

2020-05-25 Thread Chris Dempsey
I'm new to Solr and made an honest stab to finding this in info the docs.

I'm working on an update to an existing large collection in Solr 7.7 to add
a BoolField to mark it as "soft deleted" or not. My understanding is that
updating the schema will mean the new field will only exist and have a
value (or the default value) for documents indexed after the change,
correct? If that's the case, is it possible to query for all documents that
have that field set to `true` or if that field is completely missing? If is
a Bad Idea(tm) from a performance or resource usage standpoint to use a
"where field X doesn't exist" query (i.e. am I going to end up running a
"table scan" if I do)?

Thanks in advance!