Logging deprecations

2020-09-10 Thread David Smiley
Wouldn't it be nice if Solr had a simple utility to log a warning exactly
once, and in a disable-able way, when you use a feature that we want to
remove from Solr?  Or that we are thinking about removing but want
user input?  I'm thinking of adding this.  I don't think our users monitor
the lists well, and so this would be an additional way to solicit their
inputs.  If you know if something similar exists or if you know of a
suitable place this should go, please let me know.  I didn't look hard but
found nothing so I'll likely write some new little class for this.  I'm
thinking of a simple static method taking a String loggerName (that is an
ID of the feature likewise) and a String warning message.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


Re: Avoiding false-positives in multivalued field search with intervals?

2020-09-10 Thread Michael Sokolov
A slightly different but related topic is how to manage lots of fields

I agree that sub-fields are a pain and that mashing everything
together in an all-field is a mess, but for best performance with a
large number of fields/sub-fields, it is the only workable option I
can see? Expanding a query over numerous fields grows combinatorically
in the number of fields (if I want my query to match when all terms
match in *some* field), doesn't it?

I would like to see a mechanism for defining sub-fields using
positions. Together with an absolute positional query this would
enable both match-any-field as well as field-specific matching with
each token indexed only once (multi-values are possible within this
with boundary tokens or big enough position ranges, as Alan
suggested). It does mean that the sub-field boundaries have to be
managed somehow. Without index support, you can set an arbitrary large
size for your sub-field and insert position gaps at the boundaries,
but maybe we could detect the largest sub-field at flush time and
write that metadata somewhere in the index to enable smaller gaps?
Another issue is differing analysis for the sub-fields, and properly
updating the positions during analysis: at the boundaries(you don't
want to insert a gap, rather advance to a fixed position, and you have
to index sub-fields in order. Maybe we could make it less horrible by
adding better support for it.

Re: query parsing; wasn't there at one time an interval query parser?
It had operators like w() and n() IIRC

On Thu, Sep 10, 2020 at 4:20 PM Dawid Weiss  wrote:
>
> > Ok so the more general question is whether we need an interval query parser
>
> Oh, to this I'd say: yes, yes, yes.
>
> I didn't have much prior experience writing frontend apps on top of
> Solr/Lucene but once I did have
> to go that route it quickly turns out that several things that are
> readily available from code-level
> are so darn difficult to achieve and integrate from the outside. Specifically:
>
> - Field expansion in query parsers is a must (so that unqualified
> terms are expanded over multiple fields).
> Any query parser that doesn't support this is in my opinion of zero
> use. The "default" copy-to sink field known
> from Solr brings more problems than it solves.
>
> - Exact match-region hit highlighting is a strong expectation. I
> solved this with matches API (see LUCENE-9461)
> and flexible query parser's multifield expansion. Works like a charm.
>
> - Multivalued fields are common and sub-document handling is a pain.
> The problem I raised here is a result of
> direct user feedback. In real life multivalued fields are omnipresent
> and searches over those fields can be complex.
> Users see hits that just should not be there and are confused.
>
> - People do use complex queries. Maybe not all people but there are
> people out there who do... Just recently I extended
> flexible query parser with a handcrafted min-should-match operator
> because it is otherwise not accessible in any Lucene
> query parser (!). I can make this code available (it's not terribly
> complex), although, since you asked, I think a query parser that
> exposes all sorts of "higher level" functionality of intervals would
> be very, very useful.
>
> It may end up that I'll have to write something for intervals anyway
> so we can work on this together if you like.
> Especially the syntax is an open question - should it be
> operator-based (like the current boost of fuzzy operators) or
> meta-function-based (so that pseudo-functions would be available). Or
> maybe a mix of both? I don't know, really. :)
>
> Dawid
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Avoiding false-positives in multivalued field search with intervals?

2020-09-10 Thread Dawid Weiss
> Ok so the more general question is whether we need an interval query parser

Oh, to this I'd say: yes, yes, yes.

I didn't have much prior experience writing frontend apps on top of
Solr/Lucene but once I did have
to go that route it quickly turns out that several things that are
readily available from code-level
are so darn difficult to achieve and integrate from the outside. Specifically:

- Field expansion in query parsers is a must (so that unqualified
terms are expanded over multiple fields).
Any query parser that doesn't support this is in my opinion of zero
use. The "default" copy-to sink field known
from Solr brings more problems than it solves.

- Exact match-region hit highlighting is a strong expectation. I
solved this with matches API (see LUCENE-9461)
and flexible query parser's multifield expansion. Works like a charm.

- Multivalued fields are common and sub-document handling is a pain.
The problem I raised here is a result of
direct user feedback. In real life multivalued fields are omnipresent
and searches over those fields can be complex.
Users see hits that just should not be there and are confused.

- People do use complex queries. Maybe not all people but there are
people out there who do... Just recently I extended
flexible query parser with a handcrafted min-should-match operator
because it is otherwise not accessible in any Lucene
query parser (!). I can make this code available (it's not terribly
complex), although, since you asked, I think a query parser that
exposes all sorts of "higher level" functionality of intervals would
be very, very useful.

It may end up that I'll have to write something for intervals anyway
so we can work on this together if you like.
Especially the syntax is an open question - should it be
operator-based (like the current boost of fuzzy operators) or
meta-function-based (so that pseudo-functions would be available). Or
maybe a mix of both? I don't know, really. :)

Dawid

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Notification of analysis on publicly available project data

2020-09-10 Thread Ishan Chattopadhyaya
Is there any PMC action or support/cooperation needed here from our part?

On Thu, 10 Sep, 2020, 10:19 pm Griselda Cuevas,  wrote:

> Dear PMC,
>
>
> I’m contacting you because your project has been selected by the ASF D
> committee which is leading a research project to evaluate and understand
> the current state of diversity in our community [1]. As part of this
> research, we will analyze publicly available data about your project such
> as Git logs, Jira boards and mailing lists, to better understand the
> state of diversity in Apache projects and to complement the findings we
> obtained from the Community Survey that was run this year [2].
>
>
> This analysis will be performed by Bitegia [3], a vendor specializing in
> researching open source projects and foundations. The results will be
> published in a report similar to the OpenStack Foundation Analysis
> published in 2018 [4].
>
>
> The analysis will be done only on aggregated data at the project level during
> and after processing, ensuring we do not report anything that could
> identify a single individual. The data we analyze will be deleted right
> after the research is done and won’t be retained by either the researcher
> or the ASF.
>
>
> If you have any concerns or questions, please raise them to the diversity
> committee (d...@diversity.apache.org) and/or to the data privacy committee
> (priv...@apache.org).
>
>
> Regards,
>
> Griselda Cuevas
>
> V.P. of Diversity and Inclusion
>
> Apache Software Foundation
>
>
> [1]
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=127405614
>
> [2] https://youtu.be/4Mr1CRtKqUI
>
> [3] https://bitergia.com/bitergia-analytics/
>
> [4] https://superuser.openstack.org/articles/2018-gender-diversity-report/
>


Re: Avoiding false-positives in multivalued field search with intervals?

2020-09-10 Thread jim ferenczi
Ok so the more general question is whether we need an interval query parser

Le jeu. 10 sept. 2020 à 17:28, Dawid Weiss  a écrit :

> I am fine with the boundary token suggestion, actually. What I don't
> see at the moment is how I can marry it with an output of a general
> query parser (which returns any Query). I could give an attempt to
> process the query node tree from standard query parser (which we're
> using at the moment anyway) but if the tree becomes complex there is
> no guarantee I can extract subtrees that can be parsed into
> IntervalSources (and then in turn into IntervalQuery).
>
> Dawid
>
> On Thu, Sep 10, 2020 at 4:28 PM jim ferenczi 
> wrote:
> >
> > Right, I misunderstood Alan's answer. The boundary option is not
> "impure" in my opinion. It solves this issue nicely but maybe it needs
> something more packaged to add the boundaries and build queries easily.
> >
> > Le jeu. 10 sept. 2020 à 16:16, Dawid Weiss  a
> écrit :
> >>
> >> Yup - similar to what Alan suggested. I'd have to rewrite the (general
> >> text-to-query) query parser to only use intervals though. Still
> >> thinking about possible approaches to this.
> >>
> >> D.
> >>
> >> On Thu, Sep 10, 2020 at 3:58 PM jim ferenczi 
> wrote:
> >> >
> >> > You could set a very high position increment gap for multi-valued
> fields (Analyzer#getPositionIncrementGap) and perform something
> >> > like Intervals.maxWidth(Intervals.unordered(...), pos_gap-1) ?
> >> >
> >> >
> >> > Le jeu. 10 sept. 2020 à 12:32, Dawid Weiss  a
> écrit :
> >> >>
> >> >> Yeah... I was thinking about adding synthetic boundaries but this
> >> >> seems... impure. :) Another quick reflection is that I'd have to
> >> >> somehow translate the original query (which can be arbitrarily
> >> >> complex) into an interval query. Tough.
> >> >>
> >> >> D.
> >> >>
> >> >> On Thu, Sep 10, 2020 at 12:22 PM Alan Woodward 
> wrote:
> >> >> >
> >> >> > I’ve solved this sort of thing in the past by indexing boundary
> tokens, and wrapping the queries with the equivalent of
> Intervals.notContaining(query, boundary-query); you could also put a very
> large position increment gap and use a width filter, but that’s a bit more
> error prone if you could conceivably have lots of text in the individual
> field entries.
> >> >> >
> >> >> > > On 10 Sep 2020, at 10:38, Dawid Weiss 
> wrote:
> >> >> > >
> >> >> > > Hi Alan,
> >> >> > >
> >> >> > > You're the expert here so I thought I'd ask before I jump in
> deep. Do
> >> >> > > you think it's feasible to solve the following multivalued-field
> >> >> > > problem:
> >> >> > >
> >> >> > > doc: field=["foo", "bar"]
> >> >> > > query: field:(foo AND bar)
> >> >> > >
> >> >> > > I'd like the above to return zero hits (no single value contains
> both
> >> >> > > foo and bar), but since multi-valued fields are logically
> indexed as a
> >> >> > > single field, it returns doc. I recognize this as a well known
> problem
> >> >> > > but subdocuments are not fun to deal with so I'd like to avoid
> them at
> >> >> > > all costs.
> >> >> > >
> >> >> > > Would it be possible to solve the above with intervals? Say,
> something
> >> >> > > like this:
> >> >> > >
> >> >> > > Intervals.containing(valuePositionRanges(), query).
> >> >> > >
> >> >> > > I assume the containment relationship would get rid of
> false-positives
> >> >> > > crossing value boundary here. The problem is in how to construct
> those
> >> >> > > value position ranges... Store them at index-construction time
> >> >> > > somehow? Compute them on the fly for anything that has a chance
> to
> >> >> > > match query? Your thoughts would be very appreciated.
> >> >> > >
> >> >> > > Dawid
> >> >> > >
> >> >> > >
> -
> >> >> > > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> >> >> > > For additional commands, e-mail: dev-h...@lucene.apache.org
> >> >> > >
> >> >> >
> >> >> >
> >> >> >
> -
> >> >> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> >> >> > For additional commands, e-mail: dev-h...@lucene.apache.org
> >> >> >
> >> >>
> >> >> -
> >> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> >> >> For additional commands, e-mail: dev-h...@lucene.apache.org
> >> >>
> >>
> >> -
> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: dev-h...@lucene.apache.org
> >>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>


Notification of analysis on publicly available project data

2020-09-10 Thread Griselda Cuevas
Dear PMC,


I’m contacting you because your project has been selected by the ASF D
committee which is leading a research project to evaluate and understand
the current state of diversity in our community [1]. As part of this
research, we will analyze publicly available data about your project such
as Git logs, Jira boards and mailing lists, to better understand the state
of diversity in Apache projects and to complement the findings we obtained
from the Community Survey that was run this year [2].


This analysis will be performed by Bitegia [3], a vendor specializing in
researching open source projects and foundations. The results will be
published in a report similar to the OpenStack Foundation Analysis
published in 2018 [4].


The analysis will be done only on aggregated data at the project level during
and after processing, ensuring we do not report anything that could
identify a single individual. The data we analyze will be deleted right
after the research is done and won’t be retained by either the researcher
or the ASF.


If you have any concerns or questions, please raise them to the diversity
committee (d...@diversity.apache.org) and/or to the data privacy committee (
priv...@apache.org).


Regards,

Griselda Cuevas

V.P. of Diversity and Inclusion

Apache Software Foundation


[1]
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=127405614

[2] https://youtu.be/4Mr1CRtKqUI

[3] https://bitergia.com/bitergia-analytics/

[4] https://superuser.openstack.org/articles/2018-gender-diversity-report/


Re: Avoiding false-positives in multivalued field search with intervals?

2020-09-10 Thread Dawid Weiss
I am fine with the boundary token suggestion, actually. What I don't
see at the moment is how I can marry it with an output of a general
query parser (which returns any Query). I could give an attempt to
process the query node tree from standard query parser (which we're
using at the moment anyway) but if the tree becomes complex there is
no guarantee I can extract subtrees that can be parsed into
IntervalSources (and then in turn into IntervalQuery).

Dawid

On Thu, Sep 10, 2020 at 4:28 PM jim ferenczi  wrote:
>
> Right, I misunderstood Alan's answer. The boundary option is not "impure" in 
> my opinion. It solves this issue nicely but maybe it needs something more 
> packaged to add the boundaries and build queries easily.
>
> Le jeu. 10 sept. 2020 à 16:16, Dawid Weiss  a écrit :
>>
>> Yup - similar to what Alan suggested. I'd have to rewrite the (general
>> text-to-query) query parser to only use intervals though. Still
>> thinking about possible approaches to this.
>>
>> D.
>>
>> On Thu, Sep 10, 2020 at 3:58 PM jim ferenczi  wrote:
>> >
>> > You could set a very high position increment gap for multi-valued fields 
>> > (Analyzer#getPositionIncrementGap) and perform something
>> > like Intervals.maxWidth(Intervals.unordered(...), pos_gap-1) ?
>> >
>> >
>> > Le jeu. 10 sept. 2020 à 12:32, Dawid Weiss  a écrit 
>> > :
>> >>
>> >> Yeah... I was thinking about adding synthetic boundaries but this
>> >> seems... impure. :) Another quick reflection is that I'd have to
>> >> somehow translate the original query (which can be arbitrarily
>> >> complex) into an interval query. Tough.
>> >>
>> >> D.
>> >>
>> >> On Thu, Sep 10, 2020 at 12:22 PM Alan Woodward  
>> >> wrote:
>> >> >
>> >> > I’ve solved this sort of thing in the past by indexing boundary tokens, 
>> >> > and wrapping the queries with the equivalent of 
>> >> > Intervals.notContaining(query, boundary-query); you could also put a 
>> >> > very large position increment gap and use a width filter, but that’s a 
>> >> > bit more error prone if you could conceivably have lots of text in the 
>> >> > individual field entries.
>> >> >
>> >> > > On 10 Sep 2020, at 10:38, Dawid Weiss  wrote:
>> >> > >
>> >> > > Hi Alan,
>> >> > >
>> >> > > You're the expert here so I thought I'd ask before I jump in deep. Do
>> >> > > you think it's feasible to solve the following multivalued-field
>> >> > > problem:
>> >> > >
>> >> > > doc: field=["foo", "bar"]
>> >> > > query: field:(foo AND bar)
>> >> > >
>> >> > > I'd like the above to return zero hits (no single value contains both
>> >> > > foo and bar), but since multi-valued fields are logically indexed as a
>> >> > > single field, it returns doc. I recognize this as a well known problem
>> >> > > but subdocuments are not fun to deal with so I'd like to avoid them at
>> >> > > all costs.
>> >> > >
>> >> > > Would it be possible to solve the above with intervals? Say, something
>> >> > > like this:
>> >> > >
>> >> > > Intervals.containing(valuePositionRanges(), query).
>> >> > >
>> >> > > I assume the containment relationship would get rid of false-positives
>> >> > > crossing value boundary here. The problem is in how to construct those
>> >> > > value position ranges... Store them at index-construction time
>> >> > > somehow? Compute them on the fly for anything that has a chance to
>> >> > > match query? Your thoughts would be very appreciated.
>> >> > >
>> >> > > Dawid
>> >> > >
>> >> > > -
>> >> > > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> >> > > For additional commands, e-mail: dev-h...@lucene.apache.org
>> >> > >
>> >> >
>> >> >
>> >> > -
>> >> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> >> > For additional commands, e-mail: dev-h...@lucene.apache.org
>> >> >
>> >>
>> >> -
>> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> >> For additional commands, e-mail: dev-h...@lucene.apache.org
>> >>
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Avoiding false-positives in multivalued field search with intervals?

2020-09-10 Thread jim ferenczi
Right, I misunderstood Alan's answer. The boundary option is not "impure"
in my opinion. It solves this issue nicely but maybe it needs something
more packaged to add the boundaries and build queries easily.

Le jeu. 10 sept. 2020 à 16:16, Dawid Weiss  a écrit :

> Yup - similar to what Alan suggested. I'd have to rewrite the (general
> text-to-query) query parser to only use intervals though. Still
> thinking about possible approaches to this.
>
> D.
>
> On Thu, Sep 10, 2020 at 3:58 PM jim ferenczi 
> wrote:
> >
> > You could set a very high position increment gap for multi-valued fields
> (Analyzer#getPositionIncrementGap) and perform something
> > like Intervals.maxWidth(Intervals.unordered(...), pos_gap-1) ?
> >
> >
> > Le jeu. 10 sept. 2020 à 12:32, Dawid Weiss  a
> écrit :
> >>
> >> Yeah... I was thinking about adding synthetic boundaries but this
> >> seems... impure. :) Another quick reflection is that I'd have to
> >> somehow translate the original query (which can be arbitrarily
> >> complex) into an interval query. Tough.
> >>
> >> D.
> >>
> >> On Thu, Sep 10, 2020 at 12:22 PM Alan Woodward 
> wrote:
> >> >
> >> > I’ve solved this sort of thing in the past by indexing boundary
> tokens, and wrapping the queries with the equivalent of
> Intervals.notContaining(query, boundary-query); you could also put a very
> large position increment gap and use a width filter, but that’s a bit more
> error prone if you could conceivably have lots of text in the individual
> field entries.
> >> >
> >> > > On 10 Sep 2020, at 10:38, Dawid Weiss 
> wrote:
> >> > >
> >> > > Hi Alan,
> >> > >
> >> > > You're the expert here so I thought I'd ask before I jump in deep.
> Do
> >> > > you think it's feasible to solve the following multivalued-field
> >> > > problem:
> >> > >
> >> > > doc: field=["foo", "bar"]
> >> > > query: field:(foo AND bar)
> >> > >
> >> > > I'd like the above to return zero hits (no single value contains
> both
> >> > > foo and bar), but since multi-valued fields are logically indexed
> as a
> >> > > single field, it returns doc. I recognize this as a well known
> problem
> >> > > but subdocuments are not fun to deal with so I'd like to avoid them
> at
> >> > > all costs.
> >> > >
> >> > > Would it be possible to solve the above with intervals? Say,
> something
> >> > > like this:
> >> > >
> >> > > Intervals.containing(valuePositionRanges(), query).
> >> > >
> >> > > I assume the containment relationship would get rid of
> false-positives
> >> > > crossing value boundary here. The problem is in how to construct
> those
> >> > > value position ranges... Store them at index-construction time
> >> > > somehow? Compute them on the fly for anything that has a chance to
> >> > > match query? Your thoughts would be very appreciated.
> >> > >
> >> > > Dawid
> >> > >
> >> > >
> -
> >> > > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> >> > > For additional commands, e-mail: dev-h...@lucene.apache.org
> >> > >
> >> >
> >> >
> >> > -
> >> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> >> > For additional commands, e-mail: dev-h...@lucene.apache.org
> >> >
> >>
> >> -
> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: dev-h...@lucene.apache.org
> >>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>


Re: Avoiding false-positives in multivalued field search with intervals?

2020-09-10 Thread Dawid Weiss
Yup - similar to what Alan suggested. I'd have to rewrite the (general
text-to-query) query parser to only use intervals though. Still
thinking about possible approaches to this.

D.

On Thu, Sep 10, 2020 at 3:58 PM jim ferenczi  wrote:
>
> You could set a very high position increment gap for multi-valued fields 
> (Analyzer#getPositionIncrementGap) and perform something
> like Intervals.maxWidth(Intervals.unordered(...), pos_gap-1) ?
>
>
> Le jeu. 10 sept. 2020 à 12:32, Dawid Weiss  a écrit :
>>
>> Yeah... I was thinking about adding synthetic boundaries but this
>> seems... impure. :) Another quick reflection is that I'd have to
>> somehow translate the original query (which can be arbitrarily
>> complex) into an interval query. Tough.
>>
>> D.
>>
>> On Thu, Sep 10, 2020 at 12:22 PM Alan Woodward  wrote:
>> >
>> > I’ve solved this sort of thing in the past by indexing boundary tokens, 
>> > and wrapping the queries with the equivalent of 
>> > Intervals.notContaining(query, boundary-query); you could also put a very 
>> > large position increment gap and use a width filter, but that’s a bit more 
>> > error prone if you could conceivably have lots of text in the individual 
>> > field entries.
>> >
>> > > On 10 Sep 2020, at 10:38, Dawid Weiss  wrote:
>> > >
>> > > Hi Alan,
>> > >
>> > > You're the expert here so I thought I'd ask before I jump in deep. Do
>> > > you think it's feasible to solve the following multivalued-field
>> > > problem:
>> > >
>> > > doc: field=["foo", "bar"]
>> > > query: field:(foo AND bar)
>> > >
>> > > I'd like the above to return zero hits (no single value contains both
>> > > foo and bar), but since multi-valued fields are logically indexed as a
>> > > single field, it returns doc. I recognize this as a well known problem
>> > > but subdocuments are not fun to deal with so I'd like to avoid them at
>> > > all costs.
>> > >
>> > > Would it be possible to solve the above with intervals? Say, something
>> > > like this:
>> > >
>> > > Intervals.containing(valuePositionRanges(), query).
>> > >
>> > > I assume the containment relationship would get rid of false-positives
>> > > crossing value boundary here. The problem is in how to construct those
>> > > value position ranges... Store them at index-construction time
>> > > somehow? Compute them on the fly for anything that has a chance to
>> > > match query? Your thoughts would be very appreciated.
>> > >
>> > > Dawid
>> > >
>> > > -
>> > > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> > > For additional commands, e-mail: dev-h...@lucene.apache.org
>> > >
>> >
>> >
>> > -
>> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> > For additional commands, e-mail: dev-h...@lucene.apache.org
>> >
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Avoiding false-positives in multivalued field search with intervals?

2020-09-10 Thread jim ferenczi
You could set a very high position increment gap for multi-valued fields
(Analyzer#getPositionIncrementGap) and perform something
like Intervals.maxWidth(Intervals.unordered(...), pos_gap-1) ?


Le jeu. 10 sept. 2020 à 12:32, Dawid Weiss  a écrit :

> Yeah... I was thinking about adding synthetic boundaries but this
> seems... impure. :) Another quick reflection is that I'd have to
> somehow translate the original query (which can be arbitrarily
> complex) into an interval query. Tough.
>
> D.
>
> On Thu, Sep 10, 2020 at 12:22 PM Alan Woodward 
> wrote:
> >
> > I’ve solved this sort of thing in the past by indexing boundary tokens,
> and wrapping the queries with the equivalent of
> Intervals.notContaining(query, boundary-query); you could also put a very
> large position increment gap and use a width filter, but that’s a bit more
> error prone if you could conceivably have lots of text in the individual
> field entries.
> >
> > > On 10 Sep 2020, at 10:38, Dawid Weiss  wrote:
> > >
> > > Hi Alan,
> > >
> > > You're the expert here so I thought I'd ask before I jump in deep. Do
> > > you think it's feasible to solve the following multivalued-field
> > > problem:
> > >
> > > doc: field=["foo", "bar"]
> > > query: field:(foo AND bar)
> > >
> > > I'd like the above to return zero hits (no single value contains both
> > > foo and bar), but since multi-valued fields are logically indexed as a
> > > single field, it returns doc. I recognize this as a well known problem
> > > but subdocuments are not fun to deal with so I'd like to avoid them at
> > > all costs.
> > >
> > > Would it be possible to solve the above with intervals? Say, something
> > > like this:
> > >
> > > Intervals.containing(valuePositionRanges(), query).
> > >
> > > I assume the containment relationship would get rid of false-positives
> > > crossing value boundary here. The problem is in how to construct those
> > > value position ranges... Store them at index-construction time
> > > somehow? Compute them on the fly for anything that has a chance to
> > > match query? Your thoughts would be very appreciated.
> > >
> > > Dawid
> > >
> > > -
> > > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> > > For additional commands, e-mail: dev-h...@lucene.apache.org
> > >
> >
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: dev-h...@lucene.apache.org
> >
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>


Re: Avoiding false-positives in multivalued field search with intervals?

2020-09-10 Thread Dawid Weiss
Yeah... I was thinking about adding synthetic boundaries but this
seems... impure. :) Another quick reflection is that I'd have to
somehow translate the original query (which can be arbitrarily
complex) into an interval query. Tough.

D.

On Thu, Sep 10, 2020 at 12:22 PM Alan Woodward  wrote:
>
> I’ve solved this sort of thing in the past by indexing boundary tokens, and 
> wrapping the queries with the equivalent of Intervals.notContaining(query, 
> boundary-query); you could also put a very large position increment gap and 
> use a width filter, but that’s a bit more error prone if you could 
> conceivably have lots of text in the individual field entries.
>
> > On 10 Sep 2020, at 10:38, Dawid Weiss  wrote:
> >
> > Hi Alan,
> >
> > You're the expert here so I thought I'd ask before I jump in deep. Do
> > you think it's feasible to solve the following multivalued-field
> > problem:
> >
> > doc: field=["foo", "bar"]
> > query: field:(foo AND bar)
> >
> > I'd like the above to return zero hits (no single value contains both
> > foo and bar), but since multi-valued fields are logically indexed as a
> > single field, it returns doc. I recognize this as a well known problem
> > but subdocuments are not fun to deal with so I'd like to avoid them at
> > all costs.
> >
> > Would it be possible to solve the above with intervals? Say, something
> > like this:
> >
> > Intervals.containing(valuePositionRanges(), query).
> >
> > I assume the containment relationship would get rid of false-positives
> > crossing value boundary here. The problem is in how to construct those
> > value position ranges... Store them at index-construction time
> > somehow? Compute them on the fly for anything that has a chance to
> > match query? Your thoughts would be very appreciated.
> >
> > Dawid
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: dev-h...@lucene.apache.org
> >
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Avoiding false-positives in multivalued field search with intervals?

2020-09-10 Thread Alan Woodward
I’ve solved this sort of thing in the past by indexing boundary tokens, and 
wrapping the queries with the equivalent of Intervals.notContaining(query, 
boundary-query); you could also put a very large position increment gap and use 
a width filter, but that’s a bit more error prone if you could conceivably have 
lots of text in the individual field entries.

> On 10 Sep 2020, at 10:38, Dawid Weiss  wrote:
> 
> Hi Alan,
> 
> You're the expert here so I thought I'd ask before I jump in deep. Do
> you think it's feasible to solve the following multivalued-field
> problem:
> 
> doc: field=["foo", "bar"]
> query: field:(foo AND bar)
> 
> I'd like the above to return zero hits (no single value contains both
> foo and bar), but since multi-valued fields are logically indexed as a
> single field, it returns doc. I recognize this as a well known problem
> but subdocuments are not fun to deal with so I'd like to avoid them at
> all costs.
> 
> Would it be possible to solve the above with intervals? Say, something
> like this:
> 
> Intervals.containing(valuePositionRanges(), query).
> 
> I assume the containment relationship would get rid of false-positives
> crossing value boundary here. The problem is in how to construct those
> value position ranges... Store them at index-construction time
> somehow? Compute them on the fly for anything that has a chance to
> match query? Your thoughts would be very appreciated.
> 
> Dawid
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
> 


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Avoiding false-positives in multivalued field search with intervals?

2020-09-10 Thread Dawid Weiss
Hi Alan,

You're the expert here so I thought I'd ask before I jump in deep. Do
you think it's feasible to solve the following multivalued-field
problem:

doc: field=["foo", "bar"]
query: field:(foo AND bar)

I'd like the above to return zero hits (no single value contains both
foo and bar), but since multi-valued fields are logically indexed as a
single field, it returns doc. I recognize this as a well known problem
but subdocuments are not fun to deal with so I'd like to avoid them at
all costs.

Would it be possible to solve the above with intervals? Say, something
like this:

Intervals.containing(valuePositionRanges(), query).

I assume the containment relationship would get rid of false-positives
crossing value boundary here. The problem is in how to construct those
value position ranges... Store them at index-construction time
somehow? Compute them on the fly for anything that has a chance to
match query? Your thoughts would be very appreciated.

Dawid

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org