Re: Prefix + Suffix Wildcards in Searches

Chris Dempsey Tue, 30 Jun 2020 04:29:55 -0700

@Mikhail

Thanks for the link! I'll read through that.


On Tue, Jun 30, 2020 at 6:28 AM Chris Dempsey <cdal...@gmail.com> wrote:

> @Erick,
>
> You've got the idea. Basically the users can attach zero or more tags (*that
> they create*) to a document. So as an example say they've created the
> tags (this example is just a small subset of the total tags):
>
>    - paid
>    - invoice-paid
>    - ms-reply-unpaid-2019
>    - credit-ms-reply-unpaid
>    - ms-reply-paid-2019
>    - ms-reply-paid-2020
>
> and attached them in various combinations to documents. They then want to
> find all documents by tag that don't contain the characters "paid" anywhere
> in the tag, don't contain tags with the characters "ms-reply-unpaid", but
> do include documents tagged with the characters "ms-reply-paid".
>
> The obvious suggestion would be to have the users just use the entire tag
> (i.e. don't let them do a "contains") as a condition to eliminate the
> wildcards - which would work -  but unfortunately we have customers with (*not
> joking*) over 100K different tags (*why have a taxonomy like that is yet
> a different issue*). I'm willing to accept that in our scenario n-grams
> might be the Solr-based answer (the other being to change what "contains"
> means within our application) but thought I'd check I hadn't overlooked any
> other options. :)
>
> On Mon, Jun 29, 2020 at 3:54 PM Mikhail Khludnev <m...@apache.org> wrote:
>
>> Hello, Chris.
>> I suppose index time analysis can yield these terms:
>> "paid","ms-reply-unpaid","ms-reply-paid", and thus let you avoid these
>> expensive wildcard queries. Here's why it's worth to avoid them
>>
>> https://www.slideshare.net/lucidworks/search-like-sql-mikhail-khludnev-epam
>>
>> On Mon, Jun 29, 2020 at 6:17 PM Chris Dempsey <cdal...@gmail.com> wrote:
>>
>> > Hello, all! I'm relatively new to Solr and Lucene (*using Solr 7.7.1*)
>> but
>> > I'm looking into options for optimizing something like this:
>> >
>> > > fq=(tag:* -tag:*paid*) OR (tag:* -tag:*ms-reply-unpaid*) OR
>> > tag:*ms-reply-paid*
>> >
>> > It's probably not a surprise that we're seeing performance issues with
>> > something like this. My understanding is that using the wildcard on both
>> > ends forces a full-text index search. Something like the above can't
>> take
>> > advantage of something like the ReverseWordFilter either. I believe
>> > constructing `n-grams` is an option (*at the expense of index size*)
>> but is
>> > there anything I'm overlooking as a possible avenue to look into?
>> >
>>
>>
>> --
>> Sincerely yours
>> Mikhail Khludnev
>>
>

Re: Prefix + Suffix Wildcards in Searches

Reply via email to