@Mikhail Thanks for the link! I'll read through that.
On Tue, Jun 30, 2020 at 6:28 AM Chris Dempsey <cdal...@gmail.com> wrote: > @Erick, > > You've got the idea. Basically the users can attach zero or more tags (*that > they create*) to a document. So as an example say they've created the > tags (this example is just a small subset of the total tags): > > - paid > - invoice-paid > - ms-reply-unpaid-2019 > - credit-ms-reply-unpaid > - ms-reply-paid-2019 > - ms-reply-paid-2020 > > and attached them in various combinations to documents. They then want to > find all documents by tag that don't contain the characters "paid" anywhere > in the tag, don't contain tags with the characters "ms-reply-unpaid", but > do include documents tagged with the characters "ms-reply-paid". > > The obvious suggestion would be to have the users just use the entire tag > (i.e. don't let them do a "contains") as a condition to eliminate the > wildcards - which would work - but unfortunately we have customers with (*not > joking*) over 100K different tags (*why have a taxonomy like that is yet > a different issue*). I'm willing to accept that in our scenario n-grams > might be the Solr-based answer (the other being to change what "contains" > means within our application) but thought I'd check I hadn't overlooked any > other options. :) > > On Mon, Jun 29, 2020 at 3:54 PM Mikhail Khludnev <m...@apache.org> wrote: > >> Hello, Chris. >> I suppose index time analysis can yield these terms: >> "paid","ms-reply-unpaid","ms-reply-paid", and thus let you avoid these >> expensive wildcard queries. Here's why it's worth to avoid them >> >> https://www.slideshare.net/lucidworks/search-like-sql-mikhail-khludnev-epam >> >> On Mon, Jun 29, 2020 at 6:17 PM Chris Dempsey <cdal...@gmail.com> wrote: >> >> > Hello, all! I'm relatively new to Solr and Lucene (*using Solr 7.7.1*) >> but >> > I'm looking into options for optimizing something like this: >> > >> > > fq=(tag:* -tag:*paid*) OR (tag:* -tag:*ms-reply-unpaid*) OR >> > tag:*ms-reply-paid* >> > >> > It's probably not a surprise that we're seeing performance issues with >> > something like this. My understanding is that using the wildcard on both >> > ends forces a full-text index search. Something like the above can't >> take >> > advantage of something like the ReverseWordFilter either. I believe >> > constructing `n-grams` is an option (*at the expense of index size*) >> but is >> > there anything I'm overlooking as a possible avenue to look into? >> > >> >> >> -- >> Sincerely yours >> Mikhail Khludnev >> >