Re: Processing query clause combinations at indexing time
You can look at IndexSearcher.setQueryCache etc for more details. Especially LRUQueryCache. Maybe we should celebrate a little bit if its already 80% of the way there for your use-case, but at the same time, perhaps defaults could be better. There is a lot going on here, for example decisions about which datastructures to use in the cache (sparse bitsets and so on) that all have tradeoffs. But IMO the out-of-box defaults should be as good as possible since it has the huge benefit of requiring zero effort from the user. On Tue, Dec 15, 2020 at 8:05 PM Michael Froh wrote: > > We don't handle positional queries in our use-case, but that's just because > we don't happen to have many positional queries. But if we identify documents > at indexing time that contain a given phrase/slop/etc. query, then we can tag > the documents with a term that indicates that (or, more likely, tag documents > that contain that positional query AND some other queries). We can identify > documents that match a PhraseQuery, for example, by adding appending a > TokenFilter for the relevant field that "listens" for the given phrase. > > Our use-case has only needed TermQuery, numeric range queries, and > ToParentBlockJoinQuery clauses so far, though. For TermQuery, we can just > listen for individual terms (with a TokenFilter). For range queries, we look > at the IndexableField itself (typically an IntPoint) before submitting the > Document to the IndexWriter. For a ToParentBlockJoinQuery, we can just apply > the matching logic to each child document to detect a match before we get to > the parent. The downside is that for each Query type that we want to be able > to evaluate at indexing time, we need to add explicit support. > > We're not scoring at matching time (relying on a static sort instead), which > allows us to remove the matched clauses altogether. That said, if the match > set of the conjunction of required clauses is small (at least smaller than > the match sets of the individual clauses), adding a "precomputed > intersection" filter should advance scorers more efficiently. > > Does Lucene's filter caching match on subsets of required clauses? So, for > example, if some queries contain (somewhere in a BooleanQuery tree) clauses > that flatten to "+A +B +C", can I cache that and also have it kick in for a > BooleanQuery containing "+A +B +C +D", turning it into something like > "+cached('+A +B +C') +D" without having to explicitly do a cache lookup for > "+A +B +C"? > > I guess another advantage of our approach is that it's effectively a > write-through cache, pushing the filter-matching burden to indexing time. For > read-heavy use-cases, that trade-off is worth it. > > > > > On Tue, Dec 15, 2020 at 3:42 PM Robert Muir wrote: >> >> What are you doing with positional queries though? And how does the >> scoring work (it is unclear from your previous reply to me whether you >> are scoring). >> >> Lucene has filter caching too, so if you are doing this for >> non-scoring cases maybe something is off? >> >> On Tue, Dec 15, 2020 at 3:19 PM Michael Froh wrote: >> > >> > It's conceptually similar to CommonGrams in the single-field case, though >> > it doesn't require terms to appear in any particular positions. >> > >> > It's also able to match across fields, which is where we get a lot of >> > benefit. We have frequently-occurring filters that get added by various >> > front-end layers before they hit us (which vary depending on where the >> > query comes from). In that regard, it's kind of like Solr's filter cache, >> > except that we identify the filters offline by analyzing query logs, find >> > common combinations of filters (especially ones where the intersection is >> > smaller than the smallest term's postings list), and cache the filters in >> > the index the next time we reindex. >> > >> > On Tue, Dec 15, 2020 at 9:10 AM Robert Muir wrote: >> >> >> >> See also commongrams which is a very similar concept: >> >> https://github.com/apache/lucene-solr/tree/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/commongrams >> >> >> >> On Tue, Dec 15, 2020 at 12:08 PM Robert Muir wrote: >> >> > >> >> > I wonder if it can be done in a fairly clean way. This sounds similar >> >> > to using a ShingleFilter to do this optimization, but adding some >> >> > conditionals so that the index is smaller? Now that we have >> >> > ConditionalTokenFilter (for branching), can the feature be implemented >> >> > cleanly? >> >> > >> >> > Ideally it wouldn't require a lot of new code, something like checking >> >> > a "set" + conditionaltokenfilter + shinglefilter? >> >> > >> >> > On Mon, Dec 14, 2020 at 2:37 PM Michael Froh wrote: >> >> > > >> >> > > My team at work has a neat feature that we've built on top of Lucene >> >> > > that has provided a substantial (20%+) increase in maximum qps and >> >> > > some reduction in query latency. >> >> > > >> >> > > Basically, we run a training
Re: Processing query clause combinations at indexing time
We don't handle positional queries in our use-case, but that's just because we don't happen to have many positional queries. But if we identify documents at indexing time that contain a given phrase/slop/etc. query, then we can tag the documents with a term that indicates that (or, more likely, tag documents that contain that positional query AND some other queries). We can identify documents that match a PhraseQuery, for example, by adding appending a TokenFilter for the relevant field that "listens" for the given phrase. Our use-case has only needed TermQuery, numeric range queries, and ToParentBlockJoinQuery clauses so far, though. For TermQuery, we can just listen for individual terms (with a TokenFilter). For range queries, we look at the IndexableField itself (typically an IntPoint) before submitting the Document to the IndexWriter. For a ToParentBlockJoinQuery, we can just apply the matching logic to each child document to detect a match before we get to the parent. The downside is that for each Query type that we want to be able to evaluate at indexing time, we need to add explicit support. We're not scoring at matching time (relying on a static sort instead), which allows us to remove the matched clauses altogether. That said, if the match set of the conjunction of required clauses is small (at least smaller than the match sets of the individual clauses), adding a "precomputed intersection" filter should advance scorers more efficiently. Does Lucene's filter caching match on subsets of required clauses? So, for example, if some queries contain (somewhere in a BooleanQuery tree) clauses that flatten to "+A +B +C", can I cache that and also have it kick in for a BooleanQuery containing "+A +B +C +D", turning it into something like "+cached('+A +B +C') +D" without having to explicitly do a cache lookup for "+A +B +C"? I guess another advantage of our approach is that it's effectively a write-through cache, pushing the filter-matching burden to indexing time. For read-heavy use-cases, that trade-off is worth it. On Tue, Dec 15, 2020 at 3:42 PM Robert Muir wrote: > What are you doing with positional queries though? And how does the > scoring work (it is unclear from your previous reply to me whether you > are scoring). > > Lucene has filter caching too, so if you are doing this for > non-scoring cases maybe something is off? > > On Tue, Dec 15, 2020 at 3:19 PM Michael Froh wrote: > > > > It's conceptually similar to CommonGrams in the single-field case, > though it doesn't require terms to appear in any particular positions. > > > > It's also able to match across fields, which is where we get a lot of > benefit. We have frequently-occurring filters that get added by various > front-end layers before they hit us (which vary depending on where the > query comes from). In that regard, it's kind of like Solr's filter cache, > except that we identify the filters offline by analyzing query logs, find > common combinations of filters (especially ones where the intersection is > smaller than the smallest term's postings list), and cache the filters in > the index the next time we reindex. > > > > On Tue, Dec 15, 2020 at 9:10 AM Robert Muir wrote: > >> > >> See also commongrams which is a very similar concept: > >> > https://github.com/apache/lucene-solr/tree/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/commongrams > >> > >> On Tue, Dec 15, 2020 at 12:08 PM Robert Muir wrote: > >> > > >> > I wonder if it can be done in a fairly clean way. This sounds similar > >> > to using a ShingleFilter to do this optimization, but adding some > >> > conditionals so that the index is smaller? Now that we have > >> > ConditionalTokenFilter (for branching), can the feature be implemented > >> > cleanly? > >> > > >> > Ideally it wouldn't require a lot of new code, something like checking > >> > a "set" + conditionaltokenfilter + shinglefilter? > >> > > >> > On Mon, Dec 14, 2020 at 2:37 PM Michael Froh > wrote: > >> > > > >> > > My team at work has a neat feature that we've built on top of > Lucene that has provided a substantial (20%+) increase in maximum qps and > some reduction in query latency. > >> > > > >> > > Basically, we run a training process that looks at historical > queries to find frequently co-occurring combinations of required clauses, > say "+A +B +C +D". Then at indexing time, if a document satisfies one of > these known combinations, we add a new term to the doc, like "opto:ABCD". > At query time, we can then replace the required clauses with a single > TermQuery for the "optimized" term. > >> > > > >> > > It adds a little bit of extra work at indexing time and requires > the offline training step, but we've found that it yields a significant > boost at query time. > >> > > > >> > > We're interested in open-sourcing this feature. Is it something > worth adding to Lucene? Since it doesn't require any core changes, maybe as > a module? > >> > >>
Re: Processing query clause combinations at indexing time
What are you doing with positional queries though? And how does the scoring work (it is unclear from your previous reply to me whether you are scoring). Lucene has filter caching too, so if you are doing this for non-scoring cases maybe something is off? On Tue, Dec 15, 2020 at 3:19 PM Michael Froh wrote: > > It's conceptually similar to CommonGrams in the single-field case, though it > doesn't require terms to appear in any particular positions. > > It's also able to match across fields, which is where we get a lot of > benefit. We have frequently-occurring filters that get added by various > front-end layers before they hit us (which vary depending on where the query > comes from). In that regard, it's kind of like Solr's filter cache, except > that we identify the filters offline by analyzing query logs, find common > combinations of filters (especially ones where the intersection is smaller > than the smallest term's postings list), and cache the filters in the index > the next time we reindex. > > On Tue, Dec 15, 2020 at 9:10 AM Robert Muir wrote: >> >> See also commongrams which is a very similar concept: >> https://github.com/apache/lucene-solr/tree/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/commongrams >> >> On Tue, Dec 15, 2020 at 12:08 PM Robert Muir wrote: >> > >> > I wonder if it can be done in a fairly clean way. This sounds similar >> > to using a ShingleFilter to do this optimization, but adding some >> > conditionals so that the index is smaller? Now that we have >> > ConditionalTokenFilter (for branching), can the feature be implemented >> > cleanly? >> > >> > Ideally it wouldn't require a lot of new code, something like checking >> > a "set" + conditionaltokenfilter + shinglefilter? >> > >> > On Mon, Dec 14, 2020 at 2:37 PM Michael Froh wrote: >> > > >> > > My team at work has a neat feature that we've built on top of Lucene >> > > that has provided a substantial (20%+) increase in maximum qps and some >> > > reduction in query latency. >> > > >> > > Basically, we run a training process that looks at historical queries to >> > > find frequently co-occurring combinations of required clauses, say "+A >> > > +B +C +D". Then at indexing time, if a document satisfies one of these >> > > known combinations, we add a new term to the doc, like "opto:ABCD". At >> > > query time, we can then replace the required clauses with a single >> > > TermQuery for the "optimized" term. >> > > >> > > It adds a little bit of extra work at indexing time and requires the >> > > offline training step, but we've found that it yields a significant >> > > boost at query time. >> > > >> > > We're interested in open-sourcing this feature. Is it something worth >> > > adding to Lucene? Since it doesn't require any core changes, maybe as a >> > > module? >> >> - >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> For additional commands, e-mail: dev-h...@lucene.apache.org >> - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Processing query clause combinations at indexing time
It's conceptually similar to CommonGrams in the single-field case, though it doesn't require terms to appear in any particular positions. It's also able to match across fields, which is where we get a lot of benefit. We have frequently-occurring filters that get added by various front-end layers before they hit us (which vary depending on where the query comes from). In that regard, it's kind of like Solr's filter cache, except that we identify the filters offline by analyzing query logs, find common combinations of filters (especially ones where the intersection is smaller than the smallest term's postings list), and cache the filters in the index the next time we reindex. On Tue, Dec 15, 2020 at 9:10 AM Robert Muir wrote: > See also commongrams which is a very similar concept: > > https://github.com/apache/lucene-solr/tree/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/commongrams > > On Tue, Dec 15, 2020 at 12:08 PM Robert Muir wrote: > > > > I wonder if it can be done in a fairly clean way. This sounds similar > > to using a ShingleFilter to do this optimization, but adding some > > conditionals so that the index is smaller? Now that we have > > ConditionalTokenFilter (for branching), can the feature be implemented > > cleanly? > > > > Ideally it wouldn't require a lot of new code, something like checking > > a "set" + conditionaltokenfilter + shinglefilter? > > > > On Mon, Dec 14, 2020 at 2:37 PM Michael Froh wrote: > > > > > > My team at work has a neat feature that we've built on top of Lucene > that has provided a substantial (20%+) increase in maximum qps and some > reduction in query latency. > > > > > > Basically, we run a training process that looks at historical queries > to find frequently co-occurring combinations of required clauses, say "+A > +B +C +D". Then at indexing time, if a document satisfies one of these > known combinations, we add a new term to the doc, like "opto:ABCD". At > query time, we can then replace the required clauses with a single > TermQuery for the "optimized" term. > > > > > > It adds a little bit of extra work at indexing time and requires the > offline training step, but we've found that it yields a significant boost > at query time. > > > > > > We're interested in open-sourcing this feature. Is it something worth > adding to Lucene? Since it doesn't require any core changes, maybe as a > module? > > - > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > >
Re: Processing query clause combinations at indexing time
Huh... I didn't know about Luwak / the monitoring module. I spent some time this morning going through it. It takes a very different approach to matching at indexing time versus what we did, and looks more powerful. Given that document-matching is one of the harder steps in the process, I'm quite happy to leverage something that already exists. The feature we built has two other parts -- an offline training piece and a query-optimizing piece. They share a QueryVisitor that collects required clauses. The training step identifies frequently co-occurring combinations of required clauses (using an FP-Growth implementation) and the query optimizer adds a matching TermQuery as a filter clause (and removes the replaced clauses, if they're non-scoring). They're pretty lightweight compared to document-matching, though. On Tue, Dec 15, 2020 at 7:41 AM Michael Sokolov wrote: > I feel like there could be some considerable overlap with features > provided by Luwak, which was contributed to Lucene fairly recently, > and I think does the query inversion work required for this; maybe > more of it already exists here? I don't know if that module handles > the query rewriting, or the term indexing you're talking about though. > > On Mon, Dec 14, 2020 at 11:25 PM Atri Sharma wrote: > > > > +1 > > > > I would suggest that this be an independent project hosted on Github > (there have been similar projects in the past that have seen success that > way) > > > > On Tue, 15 Dec 2020, 09:37 David Smiley, wrote: > >> > >> Great optimization! > >> > >> I'm dubious on it being a good contribution to Lucene itself however, > because what you propose fits cleanly above Lucene. Even at a ES/Solr > layer (which I know you don't use, but hypothetically speaking), I'm > dubious there as well. > >> > >> ~ David Smiley > >> Apache Lucene/Solr Search Developer > >> http://www.linkedin.com/in/davidwsmiley > >> > >> > >> On Mon, Dec 14, 2020 at 2:37 PM Michael Froh wrote: > >>> > >>> My team at work has a neat feature that we've built on top of Lucene > that has provided a substantial (20%+) increase in maximum qps and some > reduction in query latency. > >>> > >>> Basically, we run a training process that looks at historical queries > to find frequently co-occurring combinations of required clauses, say "+A > +B +C +D". Then at indexing time, if a document satisfies one of these > known combinations, we add a new term to the doc, like "opto:ABCD". At > query time, we can then replace the required clauses with a single > TermQuery for the "optimized" term. > >>> > >>> It adds a little bit of extra work at indexing time and requires the > offline training step, but we've found that it yields a significant boost > at query time. > >>> > >>> We're interested in open-sourcing this feature. Is it something worth > adding to Lucene? Since it doesn't require any core changes, maybe as a > module? > > - > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > >
Re: Processing query clause combinations at indexing time
I wonder if it can be done in a fairly clean way. This sounds similar to using a ShingleFilter to do this optimization, but adding some conditionals so that the index is smaller? Now that we have ConditionalTokenFilter (for branching), can the feature be implemented cleanly? Ideally it wouldn't require a lot of new code, something like checking a "set" + conditionaltokenfilter + shinglefilter? On Mon, Dec 14, 2020 at 2:37 PM Michael Froh wrote: > > My team at work has a neat feature that we've built on top of Lucene that has > provided a substantial (20%+) increase in maximum qps and some reduction in > query latency. > > Basically, we run a training process that looks at historical queries to find > frequently co-occurring combinations of required clauses, say "+A +B +C +D". > Then at indexing time, if a document satisfies one of these known > combinations, we add a new term to the doc, like "opto:ABCD". At query time, > we can then replace the required clauses with a single TermQuery for the > "optimized" term. > > It adds a little bit of extra work at indexing time and requires the offline > training step, but we've found that it yields a significant boost at query > time. > > We're interested in open-sourcing this feature. Is it something worth adding > to Lucene? Since it doesn't require any core changes, maybe as a module? - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Processing query clause combinations at indexing time
See also commongrams which is a very similar concept: https://github.com/apache/lucene-solr/tree/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/commongrams On Tue, Dec 15, 2020 at 12:08 PM Robert Muir wrote: > > I wonder if it can be done in a fairly clean way. This sounds similar > to using a ShingleFilter to do this optimization, but adding some > conditionals so that the index is smaller? Now that we have > ConditionalTokenFilter (for branching), can the feature be implemented > cleanly? > > Ideally it wouldn't require a lot of new code, something like checking > a "set" + conditionaltokenfilter + shinglefilter? > > On Mon, Dec 14, 2020 at 2:37 PM Michael Froh wrote: > > > > My team at work has a neat feature that we've built on top of Lucene that > > has provided a substantial (20%+) increase in maximum qps and some > > reduction in query latency. > > > > Basically, we run a training process that looks at historical queries to > > find frequently co-occurring combinations of required clauses, say "+A +B > > +C +D". Then at indexing time, if a document satisfies one of these known > > combinations, we add a new term to the doc, like "opto:ABCD". At query > > time, we can then replace the required clauses with a single TermQuery for > > the "optimized" term. > > > > It adds a little bit of extra work at indexing time and requires the > > offline training step, but we've found that it yields a significant boost > > at query time. > > > > We're interested in open-sourcing this feature. Is it something worth > > adding to Lucene? Since it doesn't require any core changes, maybe as a > > module? - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Processing query clause combinations at indexing time
In that case, I would be interested to know if this can be merged into Luwak. On Tue, 15 Dec 2020, 21:50 Adrien Grand, wrote: > I like this idea. I can think of several users who have a priori knowledge > of frequently used filters and would appreciate having Lucene take care of > transparently optimizing the execution of such filters instead of having to > do it manually. > > I'm not sure a separate project is the best option, it makes it more > challenging to keep up-to-date with releases, more challenging for users to > find it, etc. I'd rather add this feature to the Lucene repository, as a > new module or as part of an existing module? > > > On Tue, Dec 15, 2020 at 4:41 PM Michael Sokolov > wrote: > >> I feel like there could be some considerable overlap with features >> provided by Luwak, which was contributed to Lucene fairly recently, >> and I think does the query inversion work required for this; maybe >> more of it already exists here? I don't know if that module handles >> the query rewriting, or the term indexing you're talking about though. >> >> On Mon, Dec 14, 2020 at 11:25 PM Atri Sharma wrote: >> > >> > +1 >> > >> > I would suggest that this be an independent project hosted on Github >> (there have been similar projects in the past that have seen success that >> way) >> > >> > On Tue, 15 Dec 2020, 09:37 David Smiley, wrote: >> >> >> >> Great optimization! >> >> >> >> I'm dubious on it being a good contribution to Lucene itself however, >> because what you propose fits cleanly above Lucene. Even at a ES/Solr >> layer (which I know you don't use, but hypothetically speaking), I'm >> dubious there as well. >> >> >> >> ~ David Smiley >> >> Apache Lucene/Solr Search Developer >> >> http://www.linkedin.com/in/davidwsmiley >> >> >> >> >> >> On Mon, Dec 14, 2020 at 2:37 PM Michael Froh wrote: >> >>> >> >>> My team at work has a neat feature that we've built on top of Lucene >> that has provided a substantial (20%+) increase in maximum qps and some >> reduction in query latency. >> >>> >> >>> Basically, we run a training process that looks at historical queries >> to find frequently co-occurring combinations of required clauses, say "+A >> +B +C +D". Then at indexing time, if a document satisfies one of these >> known combinations, we add a new term to the doc, like "opto:ABCD". At >> query time, we can then replace the required clauses with a single >> TermQuery for the "optimized" term. >> >>> >> >>> It adds a little bit of extra work at indexing time and requires the >> offline training step, but we've found that it yields a significant boost >> at query time. >> >>> >> >>> We're interested in open-sourcing this feature. Is it something worth >> adding to Lucene? Since it doesn't require any core changes, maybe as a >> module? >> >> - >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> For additional commands, e-mail: dev-h...@lucene.apache.org >> >> > > -- > Adrien >
Re: Processing query clause combinations at indexing time
I like this idea. I can think of several users who have a priori knowledge of frequently used filters and would appreciate having Lucene take care of transparently optimizing the execution of such filters instead of having to do it manually. I'm not sure a separate project is the best option, it makes it more challenging to keep up-to-date with releases, more challenging for users to find it, etc. I'd rather add this feature to the Lucene repository, as a new module or as part of an existing module? On Tue, Dec 15, 2020 at 4:41 PM Michael Sokolov wrote: > I feel like there could be some considerable overlap with features > provided by Luwak, which was contributed to Lucene fairly recently, > and I think does the query inversion work required for this; maybe > more of it already exists here? I don't know if that module handles > the query rewriting, or the term indexing you're talking about though. > > On Mon, Dec 14, 2020 at 11:25 PM Atri Sharma wrote: > > > > +1 > > > > I would suggest that this be an independent project hosted on Github > (there have been similar projects in the past that have seen success that > way) > > > > On Tue, 15 Dec 2020, 09:37 David Smiley, wrote: > >> > >> Great optimization! > >> > >> I'm dubious on it being a good contribution to Lucene itself however, > because what you propose fits cleanly above Lucene. Even at a ES/Solr > layer (which I know you don't use, but hypothetically speaking), I'm > dubious there as well. > >> > >> ~ David Smiley > >> Apache Lucene/Solr Search Developer > >> http://www.linkedin.com/in/davidwsmiley > >> > >> > >> On Mon, Dec 14, 2020 at 2:37 PM Michael Froh wrote: > >>> > >>> My team at work has a neat feature that we've built on top of Lucene > that has provided a substantial (20%+) increase in maximum qps and some > reduction in query latency. > >>> > >>> Basically, we run a training process that looks at historical queries > to find frequently co-occurring combinations of required clauses, say "+A > +B +C +D". Then at indexing time, if a document satisfies one of these > known combinations, we add a new term to the doc, like "opto:ABCD". At > query time, we can then replace the required clauses with a single > TermQuery for the "optimized" term. > >>> > >>> It adds a little bit of extra work at indexing time and requires the > offline training step, but we've found that it yields a significant boost > at query time. > >>> > >>> We're interested in open-sourcing this feature. Is it something worth > adding to Lucene? Since it doesn't require any core changes, maybe as a > module? > > - > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > > -- Adrien
Re: Processing query clause combinations at indexing time
I feel like there could be some considerable overlap with features provided by Luwak, which was contributed to Lucene fairly recently, and I think does the query inversion work required for this; maybe more of it already exists here? I don't know if that module handles the query rewriting, or the term indexing you're talking about though. On Mon, Dec 14, 2020 at 11:25 PM Atri Sharma wrote: > > +1 > > I would suggest that this be an independent project hosted on Github (there > have been similar projects in the past that have seen success that way) > > On Tue, 15 Dec 2020, 09:37 David Smiley, wrote: >> >> Great optimization! >> >> I'm dubious on it being a good contribution to Lucene itself however, >> because what you propose fits cleanly above Lucene. Even at a ES/Solr layer >> (which I know you don't use, but hypothetically speaking), I'm dubious there >> as well. >> >> ~ David Smiley >> Apache Lucene/Solr Search Developer >> http://www.linkedin.com/in/davidwsmiley >> >> >> On Mon, Dec 14, 2020 at 2:37 PM Michael Froh wrote: >>> >>> My team at work has a neat feature that we've built on top of Lucene that >>> has provided a substantial (20%+) increase in maximum qps and some >>> reduction in query latency. >>> >>> Basically, we run a training process that looks at historical queries to >>> find frequently co-occurring combinations of required clauses, say "+A +B >>> +C +D". Then at indexing time, if a document satisfies one of these known >>> combinations, we add a new term to the doc, like "opto:ABCD". At query >>> time, we can then replace the required clauses with a single TermQuery for >>> the "optimized" term. >>> >>> It adds a little bit of extra work at indexing time and requires the >>> offline training step, but we've found that it yields a significant boost >>> at query time. >>> >>> We're interested in open-sourcing this feature. Is it something worth >>> adding to Lucene? Since it doesn't require any core changes, maybe as a >>> module? - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[DISCUSS] SIP-12: Incremental Backup and Restore
Hey all, This morning I published SIP-12, which proposes an overhaul of Solr's backup and restore functionality. While the "headline" improvement in this SIP is a change to do backups incrementally, it bundles in a number of other improvements as well, including the addition of corruption checks, APIs to list and delete backups, and stronger integration points with popular object storage APIs. The SIP can be found here: https://cwiki.apache.org/confluence/display/SOLR/SIP-12%3A+Incremental+Backup+and+Restore Please read the SIP description and come back here for discussion. As the discussion progresses we will update the SIP page with any outcomes and eventually move things to a VOTE. Looking forward to hearing your feedback. Best, Jason - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org