Logging deprecations
Wouldn't it be nice if Solr had a simple utility to log a warning exactly once, and in a disable-able way, when you use a feature that we want to remove from Solr? Or that we are thinking about removing but want user input? I'm thinking of adding this. I don't think our users monitor the lists well, and so this would be an additional way to solicit their inputs. If you know if something similar exists or if you know of a suitable place this should go, please let me know. I didn't look hard but found nothing so I'll likely write some new little class for this. I'm thinking of a simple static method taking a String loggerName (that is an ID of the feature likewise) and a String warning message. ~ David Smiley Apache Lucene/Solr Search Developer http://www.linkedin.com/in/davidwsmiley
Re: Avoiding false-positives in multivalued field search with intervals?
A slightly different but related topic is how to manage lots of fields I agree that sub-fields are a pain and that mashing everything together in an all-field is a mess, but for best performance with a large number of fields/sub-fields, it is the only workable option I can see? Expanding a query over numerous fields grows combinatorically in the number of fields (if I want my query to match when all terms match in *some* field), doesn't it? I would like to see a mechanism for defining sub-fields using positions. Together with an absolute positional query this would enable both match-any-field as well as field-specific matching with each token indexed only once (multi-values are possible within this with boundary tokens or big enough position ranges, as Alan suggested). It does mean that the sub-field boundaries have to be managed somehow. Without index support, you can set an arbitrary large size for your sub-field and insert position gaps at the boundaries, but maybe we could detect the largest sub-field at flush time and write that metadata somewhere in the index to enable smaller gaps? Another issue is differing analysis for the sub-fields, and properly updating the positions during analysis: at the boundaries(you don't want to insert a gap, rather advance to a fixed position, and you have to index sub-fields in order. Maybe we could make it less horrible by adding better support for it. Re: query parsing; wasn't there at one time an interval query parser? It had operators like w() and n() IIRC On Thu, Sep 10, 2020 at 4:20 PM Dawid Weiss wrote: > > > Ok so the more general question is whether we need an interval query parser > > Oh, to this I'd say: yes, yes, yes. > > I didn't have much prior experience writing frontend apps on top of > Solr/Lucene but once I did have > to go that route it quickly turns out that several things that are > readily available from code-level > are so darn difficult to achieve and integrate from the outside. Specifically: > > - Field expansion in query parsers is a must (so that unqualified > terms are expanded over multiple fields). > Any query parser that doesn't support this is in my opinion of zero > use. The "default" copy-to sink field known > from Solr brings more problems than it solves. > > - Exact match-region hit highlighting is a strong expectation. I > solved this with matches API (see LUCENE-9461) > and flexible query parser's multifield expansion. Works like a charm. > > - Multivalued fields are common and sub-document handling is a pain. > The problem I raised here is a result of > direct user feedback. In real life multivalued fields are omnipresent > and searches over those fields can be complex. > Users see hits that just should not be there and are confused. > > - People do use complex queries. Maybe not all people but there are > people out there who do... Just recently I extended > flexible query parser with a handcrafted min-should-match operator > because it is otherwise not accessible in any Lucene > query parser (!). I can make this code available (it's not terribly > complex), although, since you asked, I think a query parser that > exposes all sorts of "higher level" functionality of intervals would > be very, very useful. > > It may end up that I'll have to write something for intervals anyway > so we can work on this together if you like. > Especially the syntax is an open question - should it be > operator-based (like the current boost of fuzzy operators) or > meta-function-based (so that pseudo-functions would be available). Or > maybe a mix of both? I don't know, really. :) > > Dawid > > - > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Avoiding false-positives in multivalued field search with intervals?
> Ok so the more general question is whether we need an interval query parser Oh, to this I'd say: yes, yes, yes. I didn't have much prior experience writing frontend apps on top of Solr/Lucene but once I did have to go that route it quickly turns out that several things that are readily available from code-level are so darn difficult to achieve and integrate from the outside. Specifically: - Field expansion in query parsers is a must (so that unqualified terms are expanded over multiple fields). Any query parser that doesn't support this is in my opinion of zero use. The "default" copy-to sink field known from Solr brings more problems than it solves. - Exact match-region hit highlighting is a strong expectation. I solved this with matches API (see LUCENE-9461) and flexible query parser's multifield expansion. Works like a charm. - Multivalued fields are common and sub-document handling is a pain. The problem I raised here is a result of direct user feedback. In real life multivalued fields are omnipresent and searches over those fields can be complex. Users see hits that just should not be there and are confused. - People do use complex queries. Maybe not all people but there are people out there who do... Just recently I extended flexible query parser with a handcrafted min-should-match operator because it is otherwise not accessible in any Lucene query parser (!). I can make this code available (it's not terribly complex), although, since you asked, I think a query parser that exposes all sorts of "higher level" functionality of intervals would be very, very useful. It may end up that I'll have to write something for intervals anyway so we can work on this together if you like. Especially the syntax is an open question - should it be operator-based (like the current boost of fuzzy operators) or meta-function-based (so that pseudo-functions would be available). Or maybe a mix of both? I don't know, really. :) Dawid - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Notification of analysis on publicly available project data
Is there any PMC action or support/cooperation needed here from our part? On Thu, 10 Sep, 2020, 10:19 pm Griselda Cuevas, wrote: > Dear PMC, > > > I’m contacting you because your project has been selected by the ASF D > committee which is leading a research project to evaluate and understand > the current state of diversity in our community [1]. As part of this > research, we will analyze publicly available data about your project such > as Git logs, Jira boards and mailing lists, to better understand the > state of diversity in Apache projects and to complement the findings we > obtained from the Community Survey that was run this year [2]. > > > This analysis will be performed by Bitegia [3], a vendor specializing in > researching open source projects and foundations. The results will be > published in a report similar to the OpenStack Foundation Analysis > published in 2018 [4]. > > > The analysis will be done only on aggregated data at the project level during > and after processing, ensuring we do not report anything that could > identify a single individual. The data we analyze will be deleted right > after the research is done and won’t be retained by either the researcher > or the ASF. > > > If you have any concerns or questions, please raise them to the diversity > committee (d...@diversity.apache.org) and/or to the data privacy committee > (priv...@apache.org). > > > Regards, > > Griselda Cuevas > > V.P. of Diversity and Inclusion > > Apache Software Foundation > > > [1] > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=127405614 > > [2] https://youtu.be/4Mr1CRtKqUI > > [3] https://bitergia.com/bitergia-analytics/ > > [4] https://superuser.openstack.org/articles/2018-gender-diversity-report/ >
Re: Avoiding false-positives in multivalued field search with intervals?
Ok so the more general question is whether we need an interval query parser Le jeu. 10 sept. 2020 à 17:28, Dawid Weiss a écrit : > I am fine with the boundary token suggestion, actually. What I don't > see at the moment is how I can marry it with an output of a general > query parser (which returns any Query). I could give an attempt to > process the query node tree from standard query parser (which we're > using at the moment anyway) but if the tree becomes complex there is > no guarantee I can extract subtrees that can be parsed into > IntervalSources (and then in turn into IntervalQuery). > > Dawid > > On Thu, Sep 10, 2020 at 4:28 PM jim ferenczi > wrote: > > > > Right, I misunderstood Alan's answer. The boundary option is not > "impure" in my opinion. It solves this issue nicely but maybe it needs > something more packaged to add the boundaries and build queries easily. > > > > Le jeu. 10 sept. 2020 à 16:16, Dawid Weiss a > écrit : > >> > >> Yup - similar to what Alan suggested. I'd have to rewrite the (general > >> text-to-query) query parser to only use intervals though. Still > >> thinking about possible approaches to this. > >> > >> D. > >> > >> On Thu, Sep 10, 2020 at 3:58 PM jim ferenczi > wrote: > >> > > >> > You could set a very high position increment gap for multi-valued > fields (Analyzer#getPositionIncrementGap) and perform something > >> > like Intervals.maxWidth(Intervals.unordered(...), pos_gap-1) ? > >> > > >> > > >> > Le jeu. 10 sept. 2020 à 12:32, Dawid Weiss a > écrit : > >> >> > >> >> Yeah... I was thinking about adding synthetic boundaries but this > >> >> seems... impure. :) Another quick reflection is that I'd have to > >> >> somehow translate the original query (which can be arbitrarily > >> >> complex) into an interval query. Tough. > >> >> > >> >> D. > >> >> > >> >> On Thu, Sep 10, 2020 at 12:22 PM Alan Woodward > wrote: > >> >> > > >> >> > I’ve solved this sort of thing in the past by indexing boundary > tokens, and wrapping the queries with the equivalent of > Intervals.notContaining(query, boundary-query); you could also put a very > large position increment gap and use a width filter, but that’s a bit more > error prone if you could conceivably have lots of text in the individual > field entries. > >> >> > > >> >> > > On 10 Sep 2020, at 10:38, Dawid Weiss > wrote: > >> >> > > > >> >> > > Hi Alan, > >> >> > > > >> >> > > You're the expert here so I thought I'd ask before I jump in > deep. Do > >> >> > > you think it's feasible to solve the following multivalued-field > >> >> > > problem: > >> >> > > > >> >> > > doc: field=["foo", "bar"] > >> >> > > query: field:(foo AND bar) > >> >> > > > >> >> > > I'd like the above to return zero hits (no single value contains > both > >> >> > > foo and bar), but since multi-valued fields are logically > indexed as a > >> >> > > single field, it returns doc. I recognize this as a well known > problem > >> >> > > but subdocuments are not fun to deal with so I'd like to avoid > them at > >> >> > > all costs. > >> >> > > > >> >> > > Would it be possible to solve the above with intervals? Say, > something > >> >> > > like this: > >> >> > > > >> >> > > Intervals.containing(valuePositionRanges(), query). > >> >> > > > >> >> > > I assume the containment relationship would get rid of > false-positives > >> >> > > crossing value boundary here. The problem is in how to construct > those > >> >> > > value position ranges... Store them at index-construction time > >> >> > > somehow? Compute them on the fly for anything that has a chance > to > >> >> > > match query? Your thoughts would be very appreciated. > >> >> > > > >> >> > > Dawid > >> >> > > > >> >> > > > - > >> >> > > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > >> >> > > For additional commands, e-mail: dev-h...@lucene.apache.org > >> >> > > > >> >> > > >> >> > > >> >> > > - > >> >> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > >> >> > For additional commands, e-mail: dev-h...@lucene.apache.org > >> >> > > >> >> > >> >> - > >> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > >> >> For additional commands, e-mail: dev-h...@lucene.apache.org > >> >> > >> > >> - > >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > >> For additional commands, e-mail: dev-h...@lucene.apache.org > >> > > - > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > >
Notification of analysis on publicly available project data
Dear PMC, I’m contacting you because your project has been selected by the ASF D committee which is leading a research project to evaluate and understand the current state of diversity in our community [1]. As part of this research, we will analyze publicly available data about your project such as Git logs, Jira boards and mailing lists, to better understand the state of diversity in Apache projects and to complement the findings we obtained from the Community Survey that was run this year [2]. This analysis will be performed by Bitegia [3], a vendor specializing in researching open source projects and foundations. The results will be published in a report similar to the OpenStack Foundation Analysis published in 2018 [4]. The analysis will be done only on aggregated data at the project level during and after processing, ensuring we do not report anything that could identify a single individual. The data we analyze will be deleted right after the research is done and won’t be retained by either the researcher or the ASF. If you have any concerns or questions, please raise them to the diversity committee (d...@diversity.apache.org) and/or to the data privacy committee ( priv...@apache.org). Regards, Griselda Cuevas V.P. of Diversity and Inclusion Apache Software Foundation [1] https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=127405614 [2] https://youtu.be/4Mr1CRtKqUI [3] https://bitergia.com/bitergia-analytics/ [4] https://superuser.openstack.org/articles/2018-gender-diversity-report/
Re: Avoiding false-positives in multivalued field search with intervals?
I am fine with the boundary token suggestion, actually. What I don't see at the moment is how I can marry it with an output of a general query parser (which returns any Query). I could give an attempt to process the query node tree from standard query parser (which we're using at the moment anyway) but if the tree becomes complex there is no guarantee I can extract subtrees that can be parsed into IntervalSources (and then in turn into IntervalQuery). Dawid On Thu, Sep 10, 2020 at 4:28 PM jim ferenczi wrote: > > Right, I misunderstood Alan's answer. The boundary option is not "impure" in > my opinion. It solves this issue nicely but maybe it needs something more > packaged to add the boundaries and build queries easily. > > Le jeu. 10 sept. 2020 à 16:16, Dawid Weiss a écrit : >> >> Yup - similar to what Alan suggested. I'd have to rewrite the (general >> text-to-query) query parser to only use intervals though. Still >> thinking about possible approaches to this. >> >> D. >> >> On Thu, Sep 10, 2020 at 3:58 PM jim ferenczi wrote: >> > >> > You could set a very high position increment gap for multi-valued fields >> > (Analyzer#getPositionIncrementGap) and perform something >> > like Intervals.maxWidth(Intervals.unordered(...), pos_gap-1) ? >> > >> > >> > Le jeu. 10 sept. 2020 à 12:32, Dawid Weiss a écrit >> > : >> >> >> >> Yeah... I was thinking about adding synthetic boundaries but this >> >> seems... impure. :) Another quick reflection is that I'd have to >> >> somehow translate the original query (which can be arbitrarily >> >> complex) into an interval query. Tough. >> >> >> >> D. >> >> >> >> On Thu, Sep 10, 2020 at 12:22 PM Alan Woodward >> >> wrote: >> >> > >> >> > I’ve solved this sort of thing in the past by indexing boundary tokens, >> >> > and wrapping the queries with the equivalent of >> >> > Intervals.notContaining(query, boundary-query); you could also put a >> >> > very large position increment gap and use a width filter, but that’s a >> >> > bit more error prone if you could conceivably have lots of text in the >> >> > individual field entries. >> >> > >> >> > > On 10 Sep 2020, at 10:38, Dawid Weiss wrote: >> >> > > >> >> > > Hi Alan, >> >> > > >> >> > > You're the expert here so I thought I'd ask before I jump in deep. Do >> >> > > you think it's feasible to solve the following multivalued-field >> >> > > problem: >> >> > > >> >> > > doc: field=["foo", "bar"] >> >> > > query: field:(foo AND bar) >> >> > > >> >> > > I'd like the above to return zero hits (no single value contains both >> >> > > foo and bar), but since multi-valued fields are logically indexed as a >> >> > > single field, it returns doc. I recognize this as a well known problem >> >> > > but subdocuments are not fun to deal with so I'd like to avoid them at >> >> > > all costs. >> >> > > >> >> > > Would it be possible to solve the above with intervals? Say, something >> >> > > like this: >> >> > > >> >> > > Intervals.containing(valuePositionRanges(), query). >> >> > > >> >> > > I assume the containment relationship would get rid of false-positives >> >> > > crossing value boundary here. The problem is in how to construct those >> >> > > value position ranges... Store them at index-construction time >> >> > > somehow? Compute them on the fly for anything that has a chance to >> >> > > match query? Your thoughts would be very appreciated. >> >> > > >> >> > > Dawid >> >> > > >> >> > > - >> >> > > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> >> > > For additional commands, e-mail: dev-h...@lucene.apache.org >> >> > > >> >> > >> >> > >> >> > - >> >> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> >> > For additional commands, e-mail: dev-h...@lucene.apache.org >> >> > >> >> >> >> - >> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> >> For additional commands, e-mail: dev-h...@lucene.apache.org >> >> >> >> - >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> For additional commands, e-mail: dev-h...@lucene.apache.org >> - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Avoiding false-positives in multivalued field search with intervals?
Right, I misunderstood Alan's answer. The boundary option is not "impure" in my opinion. It solves this issue nicely but maybe it needs something more packaged to add the boundaries and build queries easily. Le jeu. 10 sept. 2020 à 16:16, Dawid Weiss a écrit : > Yup - similar to what Alan suggested. I'd have to rewrite the (general > text-to-query) query parser to only use intervals though. Still > thinking about possible approaches to this. > > D. > > On Thu, Sep 10, 2020 at 3:58 PM jim ferenczi > wrote: > > > > You could set a very high position increment gap for multi-valued fields > (Analyzer#getPositionIncrementGap) and perform something > > like Intervals.maxWidth(Intervals.unordered(...), pos_gap-1) ? > > > > > > Le jeu. 10 sept. 2020 à 12:32, Dawid Weiss a > écrit : > >> > >> Yeah... I was thinking about adding synthetic boundaries but this > >> seems... impure. :) Another quick reflection is that I'd have to > >> somehow translate the original query (which can be arbitrarily > >> complex) into an interval query. Tough. > >> > >> D. > >> > >> On Thu, Sep 10, 2020 at 12:22 PM Alan Woodward > wrote: > >> > > >> > I’ve solved this sort of thing in the past by indexing boundary > tokens, and wrapping the queries with the equivalent of > Intervals.notContaining(query, boundary-query); you could also put a very > large position increment gap and use a width filter, but that’s a bit more > error prone if you could conceivably have lots of text in the individual > field entries. > >> > > >> > > On 10 Sep 2020, at 10:38, Dawid Weiss > wrote: > >> > > > >> > > Hi Alan, > >> > > > >> > > You're the expert here so I thought I'd ask before I jump in deep. > Do > >> > > you think it's feasible to solve the following multivalued-field > >> > > problem: > >> > > > >> > > doc: field=["foo", "bar"] > >> > > query: field:(foo AND bar) > >> > > > >> > > I'd like the above to return zero hits (no single value contains > both > >> > > foo and bar), but since multi-valued fields are logically indexed > as a > >> > > single field, it returns doc. I recognize this as a well known > problem > >> > > but subdocuments are not fun to deal with so I'd like to avoid them > at > >> > > all costs. > >> > > > >> > > Would it be possible to solve the above with intervals? Say, > something > >> > > like this: > >> > > > >> > > Intervals.containing(valuePositionRanges(), query). > >> > > > >> > > I assume the containment relationship would get rid of > false-positives > >> > > crossing value boundary here. The problem is in how to construct > those > >> > > value position ranges... Store them at index-construction time > >> > > somehow? Compute them on the fly for anything that has a chance to > >> > > match query? Your thoughts would be very appreciated. > >> > > > >> > > Dawid > >> > > > >> > > > - > >> > > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > >> > > For additional commands, e-mail: dev-h...@lucene.apache.org > >> > > > >> > > >> > > >> > - > >> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > >> > For additional commands, e-mail: dev-h...@lucene.apache.org > >> > > >> > >> - > >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > >> For additional commands, e-mail: dev-h...@lucene.apache.org > >> > > - > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > >
Re: Avoiding false-positives in multivalued field search with intervals?
Yup - similar to what Alan suggested. I'd have to rewrite the (general text-to-query) query parser to only use intervals though. Still thinking about possible approaches to this. D. On Thu, Sep 10, 2020 at 3:58 PM jim ferenczi wrote: > > You could set a very high position increment gap for multi-valued fields > (Analyzer#getPositionIncrementGap) and perform something > like Intervals.maxWidth(Intervals.unordered(...), pos_gap-1) ? > > > Le jeu. 10 sept. 2020 à 12:32, Dawid Weiss a écrit : >> >> Yeah... I was thinking about adding synthetic boundaries but this >> seems... impure. :) Another quick reflection is that I'd have to >> somehow translate the original query (which can be arbitrarily >> complex) into an interval query. Tough. >> >> D. >> >> On Thu, Sep 10, 2020 at 12:22 PM Alan Woodward wrote: >> > >> > I’ve solved this sort of thing in the past by indexing boundary tokens, >> > and wrapping the queries with the equivalent of >> > Intervals.notContaining(query, boundary-query); you could also put a very >> > large position increment gap and use a width filter, but that’s a bit more >> > error prone if you could conceivably have lots of text in the individual >> > field entries. >> > >> > > On 10 Sep 2020, at 10:38, Dawid Weiss wrote: >> > > >> > > Hi Alan, >> > > >> > > You're the expert here so I thought I'd ask before I jump in deep. Do >> > > you think it's feasible to solve the following multivalued-field >> > > problem: >> > > >> > > doc: field=["foo", "bar"] >> > > query: field:(foo AND bar) >> > > >> > > I'd like the above to return zero hits (no single value contains both >> > > foo and bar), but since multi-valued fields are logically indexed as a >> > > single field, it returns doc. I recognize this as a well known problem >> > > but subdocuments are not fun to deal with so I'd like to avoid them at >> > > all costs. >> > > >> > > Would it be possible to solve the above with intervals? Say, something >> > > like this: >> > > >> > > Intervals.containing(valuePositionRanges(), query). >> > > >> > > I assume the containment relationship would get rid of false-positives >> > > crossing value boundary here. The problem is in how to construct those >> > > value position ranges... Store them at index-construction time >> > > somehow? Compute them on the fly for anything that has a chance to >> > > match query? Your thoughts would be very appreciated. >> > > >> > > Dawid >> > > >> > > - >> > > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> > > For additional commands, e-mail: dev-h...@lucene.apache.org >> > > >> > >> > >> > - >> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> > For additional commands, e-mail: dev-h...@lucene.apache.org >> > >> >> - >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> For additional commands, e-mail: dev-h...@lucene.apache.org >> - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Avoiding false-positives in multivalued field search with intervals?
You could set a very high position increment gap for multi-valued fields (Analyzer#getPositionIncrementGap) and perform something like Intervals.maxWidth(Intervals.unordered(...), pos_gap-1) ? Le jeu. 10 sept. 2020 à 12:32, Dawid Weiss a écrit : > Yeah... I was thinking about adding synthetic boundaries but this > seems... impure. :) Another quick reflection is that I'd have to > somehow translate the original query (which can be arbitrarily > complex) into an interval query. Tough. > > D. > > On Thu, Sep 10, 2020 at 12:22 PM Alan Woodward > wrote: > > > > I’ve solved this sort of thing in the past by indexing boundary tokens, > and wrapping the queries with the equivalent of > Intervals.notContaining(query, boundary-query); you could also put a very > large position increment gap and use a width filter, but that’s a bit more > error prone if you could conceivably have lots of text in the individual > field entries. > > > > > On 10 Sep 2020, at 10:38, Dawid Weiss wrote: > > > > > > Hi Alan, > > > > > > You're the expert here so I thought I'd ask before I jump in deep. Do > > > you think it's feasible to solve the following multivalued-field > > > problem: > > > > > > doc: field=["foo", "bar"] > > > query: field:(foo AND bar) > > > > > > I'd like the above to return zero hits (no single value contains both > > > foo and bar), but since multi-valued fields are logically indexed as a > > > single field, it returns doc. I recognize this as a well known problem > > > but subdocuments are not fun to deal with so I'd like to avoid them at > > > all costs. > > > > > > Would it be possible to solve the above with intervals? Say, something > > > like this: > > > > > > Intervals.containing(valuePositionRanges(), query). > > > > > > I assume the containment relationship would get rid of false-positives > > > crossing value boundary here. The problem is in how to construct those > > > value position ranges... Store them at index-construction time > > > somehow? Compute them on the fly for anything that has a chance to > > > match query? Your thoughts would be very appreciated. > > > > > > Dawid > > > > > > - > > > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > > > For additional commands, e-mail: dev-h...@lucene.apache.org > > > > > > > > > - > > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > > For additional commands, e-mail: dev-h...@lucene.apache.org > > > > - > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > >
Re: Avoiding false-positives in multivalued field search with intervals?
Yeah... I was thinking about adding synthetic boundaries but this seems... impure. :) Another quick reflection is that I'd have to somehow translate the original query (which can be arbitrarily complex) into an interval query. Tough. D. On Thu, Sep 10, 2020 at 12:22 PM Alan Woodward wrote: > > I’ve solved this sort of thing in the past by indexing boundary tokens, and > wrapping the queries with the equivalent of Intervals.notContaining(query, > boundary-query); you could also put a very large position increment gap and > use a width filter, but that’s a bit more error prone if you could > conceivably have lots of text in the individual field entries. > > > On 10 Sep 2020, at 10:38, Dawid Weiss wrote: > > > > Hi Alan, > > > > You're the expert here so I thought I'd ask before I jump in deep. Do > > you think it's feasible to solve the following multivalued-field > > problem: > > > > doc: field=["foo", "bar"] > > query: field:(foo AND bar) > > > > I'd like the above to return zero hits (no single value contains both > > foo and bar), but since multi-valued fields are logically indexed as a > > single field, it returns doc. I recognize this as a well known problem > > but subdocuments are not fun to deal with so I'd like to avoid them at > > all costs. > > > > Would it be possible to solve the above with intervals? Say, something > > like this: > > > > Intervals.containing(valuePositionRanges(), query). > > > > I assume the containment relationship would get rid of false-positives > > crossing value boundary here. The problem is in how to construct those > > value position ranges... Store them at index-construction time > > somehow? Compute them on the fly for anything that has a chance to > > match query? Your thoughts would be very appreciated. > > > > Dawid > > > > - > > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > > For additional commands, e-mail: dev-h...@lucene.apache.org > > > > > - > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Avoiding false-positives in multivalued field search with intervals?
I’ve solved this sort of thing in the past by indexing boundary tokens, and wrapping the queries with the equivalent of Intervals.notContaining(query, boundary-query); you could also put a very large position increment gap and use a width filter, but that’s a bit more error prone if you could conceivably have lots of text in the individual field entries. > On 10 Sep 2020, at 10:38, Dawid Weiss wrote: > > Hi Alan, > > You're the expert here so I thought I'd ask before I jump in deep. Do > you think it's feasible to solve the following multivalued-field > problem: > > doc: field=["foo", "bar"] > query: field:(foo AND bar) > > I'd like the above to return zero hits (no single value contains both > foo and bar), but since multi-valued fields are logically indexed as a > single field, it returns doc. I recognize this as a well known problem > but subdocuments are not fun to deal with so I'd like to avoid them at > all costs. > > Would it be possible to solve the above with intervals? Say, something > like this: > > Intervals.containing(valuePositionRanges(), query). > > I assume the containment relationship would get rid of false-positives > crossing value boundary here. The problem is in how to construct those > value position ranges... Store them at index-construction time > somehow? Compute them on the fly for anything that has a chance to > match query? Your thoughts would be very appreciated. > > Dawid > > - > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Avoiding false-positives in multivalued field search with intervals?
Hi Alan, You're the expert here so I thought I'd ask before I jump in deep. Do you think it's feasible to solve the following multivalued-field problem: doc: field=["foo", "bar"] query: field:(foo AND bar) I'd like the above to return zero hits (no single value contains both foo and bar), but since multi-valued fields are logically indexed as a single field, it returns doc. I recognize this as a well known problem but subdocuments are not fun to deal with so I'd like to avoid them at all costs. Would it be possible to solve the above with intervals? Say, something like this: Intervals.containing(valuePositionRanges(), query). I assume the containment relationship would get rid of false-positives crossing value boundary here. The problem is in how to construct those value position ranges... Store them at index-construction time somehow? Compute them on the fly for anything that has a chance to match query? Your thoughts would be very appreciated. Dawid - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org