Re: Multi-valued xxValue / xxValueSource implementations?

2021-10-26 Thread Robert Muir
On Tue, Oct 26, 2021 at 8:01 PM Robert Muir  wrote:
>
> Hi Greg, I think the general issue is one of the API, the ValueSource
> seems really geared at returning values from single-valued fields.

I think really, this is the core issue. This ValueSource thing was
created before the days of docvalues, in a lot of cases will do
inefficient things depending on how you hold it.

I feel that things like facets apis should really try to move to
lower-level apis (DoubleValuesSource, SortedSetDocValues, etc)

Reverse the problem around from push to a pull, now if you want to
give "computed field" or similar inputs to faceting (e.g. some kind of
filtering-on-the-fly), you have the chance to implement it
efficiently.
The expressions module switched away from this ValueSource to a
DoubleValues/DoubleValuesSource already, though I didn't follow
specific reasons why.
Maybe similar approaches apply to all the numerics.

As far as the strings, personally, I'm not sure what a ValueSource API
that "filters/transforms" terms should look like. Seems slow no matter
how you do it. But maybe fresh ideas are needed.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Multi-valued xxValue / xxValueSource implementations?

2021-10-26 Thread Robert Muir
A little history may help...

(this is based on my bad memory, so it could all be wrong, nobody get offended):

At the time, lucene could only sort single valued fields. But solr and
elasticsearch would happily sort on multi-valued docs in various hacky
ways. And this typically entailed large amounts of memory to do it.
IMO, it was important to get docvalues working for most use-cases, but
this "sorting on multi-valued field" was a tricky one, because to me
it is MATHEMATICAL NONSENSE.

But it seemed nobody really cared about how the sorting worked (again
it is MATHEMATICALLY INSANE anyway), rather just, that users didn't
have to confess if their fields were single-valued or multi-valued. So
they did stuff like substitute min value for a forward sort, or max
value for a reverse sort. These selectors allow you to implement such
a sort if you want. Hopefully MIN is the default and common case, and
you only need MAX in the rare case someone clicks an arrow to reverse
the sort, as it requires consuming all the ordinals for each doc :)

On Tue, Oct 26, 2021 at 8:01 PM Robert Muir  wrote:
>
> Hi Greg, I think the general issue is one of the API, the ValueSource
> seems really geared at returning values from single-valued fields.
>
> IMO, for the way the API is used (e.g. sorting), it makes sense to
> define a selector that works in O(1) time per-document, and use these
> existing valuesources:
>
> https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/MultiValuedIntFieldSource.java
> https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/MultiValuedLongFieldSource.java
> https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/MultiValuedFloatFieldSource.java
> https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/MultiValuedDoubleFieldSource.java
>
> These require that you specify a "selector" as to who will be the
> "stuckee" (designated value) for the doc:
> https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/SortedNumericSelector.java
> I strongly recommend "min", as it can just read the first DV for each doc.
>
> For terms (strings), there is a similar thing:
>
> https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/SortedSetFieldSource.java
>
> And again, it has available selectors:
> https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/SortedSetSelector.java
> I would still strongly recommend "min", to just read the first DV for each 
> doc.
>
> On Tue, Oct 26, 2021 at 7:49 PM Greg Miller  wrote:
> >
> > Hi folks-
> >
> > Out of curiosity, is there a reason Lucene doesn't have
> > implementations for concepts like DoubleValues / DoubleValuesSource
> > that support multiple values per document? Or maybe something like
> > this does exist in Lucen that I'm not aware of? I can't believe this
> > hasn't been a topic of discussion at least once, but I couldn't turn
> > up a past Jira issue.
> >
> > I ask because most of the faceting implementations in Lucene allow the
> > user to provide their own xxValuesSource to use instead of assuming
> > the data is in an indexed field, but there's an inherent limitation
> > here forcing documents to have a single value. The faceting
> > implementations have all been updated to operate correctly for
> > multi-valued documents when referencing an indexed field, but there's
> > a bit of a gap here if the user wants to supply their own source.
> >
> > Many thanks!
> >
> > Cheers,
> > -Greg
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: dev-h...@lucene.apache.org
> >

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Multi-valued xxValue / xxValueSource implementations?

2021-10-26 Thread Robert Muir
Hi Greg, I think the general issue is one of the API, the ValueSource
seems really geared at returning values from single-valued fields.

IMO, for the way the API is used (e.g. sorting), it makes sense to
define a selector that works in O(1) time per-document, and use these
existing valuesources:

https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/MultiValuedIntFieldSource.java
https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/MultiValuedLongFieldSource.java
https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/MultiValuedFloatFieldSource.java
https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/MultiValuedDoubleFieldSource.java

These require that you specify a "selector" as to who will be the
"stuckee" (designated value) for the doc:
https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/SortedNumericSelector.java
I strongly recommend "min", as it can just read the first DV for each doc.

For terms (strings), there is a similar thing:

https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/SortedSetFieldSource.java

And again, it has available selectors:
https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/SortedSetSelector.java
I would still strongly recommend "min", to just read the first DV for each doc.

On Tue, Oct 26, 2021 at 7:49 PM Greg Miller  wrote:
>
> Hi folks-
>
> Out of curiosity, is there a reason Lucene doesn't have
> implementations for concepts like DoubleValues / DoubleValuesSource
> that support multiple values per document? Or maybe something like
> this does exist in Lucen that I'm not aware of? I can't believe this
> hasn't been a topic of discussion at least once, but I couldn't turn
> up a past Jira issue.
>
> I ask because most of the faceting implementations in Lucene allow the
> user to provide their own xxValuesSource to use instead of assuming
> the data is in an indexed field, but there's an inherent limitation
> here forcing documents to have a single value. The faceting
> implementations have all been updated to operate correctly for
> multi-valued documents when referencing an indexed field, but there's
> a bit of a gap here if the user wants to supply their own source.
>
> Many thanks!
>
> Cheers,
> -Greg
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Multi-valued xxValue / xxValueSource implementations?

2021-10-26 Thread Greg Miller
Hi folks-

Out of curiosity, is there a reason Lucene doesn't have
implementations for concepts like DoubleValues / DoubleValuesSource
that support multiple values per document? Or maybe something like
this does exist in Lucen that I'm not aware of? I can't believe this
hasn't been a topic of discussion at least once, but I couldn't turn
up a past Jira issue.

I ask because most of the faceting implementations in Lucene allow the
user to provide their own xxValuesSource to use instead of assuming
the data is in an indexed field, but there's an inherent limitation
here forcing documents to have a single value. The faceting
implementations have all been updated to operate correctly for
multi-valued documents when referencing an indexed field, but there's
a bit of a gap here if the user wants to supply their own source.

Many thanks!

Cheers,
-Greg

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Slow DV equivalent of TermInSetQuery

2021-10-26 Thread Robert Muir
Well if, as I suggest, we use MultiTermQuery + DocValuesRewriteMethod
to implement this, then the choice is yours. just run it against a
"slow IndexReader" and go thru the ordinal map if you choose? There's
nothing stopping you from doing that, and it will do what you want
already.

I just personally don't recommend it for this case. As the number of
documents increases, the ordinal map indirection probably costs more
than the construction cost is worth. Better tradeoff to simply work
per-segment with no indirection. The number of lookupOrds is bounded
in a simple way, unlike faceting, where I would recommend the ordinal
map.


On Tue, Oct 26, 2021 at 6:10 PM Joel Bernstein  wrote:
>
> There are times, particularly in ecommerce and access control, where speed 
> really matters. So, you build stuff that's really fast at query time, with a 
> tradeoff at commit time.
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
>
> On Tue, Oct 26, 2021 at 5:31 PM Robert Muir  wrote:
>>
>> Sorry, I don't think there is a need to use any top-level ordinals.
>> none of these docvalues-based query implementations need it.
>>
>> As far as query intersecting an input-stream, that is a big no-go.
>> Lucene Queries need to have correct hashcode/equals/etc.
>>
>> That's why current stuff around this such as TermInSetQuery encode
>> everything into a PrefixCodedTerms.
>>
>> On Tue, Oct 26, 2021 at 4:57 PM Joel Bernstein  wrote:
>> >
>> > One more wrinkle for extremely large lists, is pass the list in as an 
>> > InputStream which is a presorted binary representation of the ASIN's and 
>> > slide a BytesRef across the stream and merge it with the SortedDocValues. 
>> > This saves on all the object creation and String overhead for really long 
>> > lists of id's.
>> >
>> > Joel Bernstein
>> > http://joelsolr.blogspot.com/
>> >
>> >
>> > On Tue, Oct 26, 2021 at 4:50 PM Joel Bernstein  wrote:
>> >>
>> >> If the list of ASIN's is presorted you can quickly merge it with the 
>> >> SortedDocValues and produce a FixedBitSet of the top level ordinals, 
>> >> which can be used as the post filter. This is a nice approach for things 
>> >> like passing in a long list of access control predicates.
>> >>
>> >>
>> >> Joel Bernstein
>> >> http://joelsolr.blogspot.com/
>> >>
>> >>
>> >> On Tue, Oct 26, 2021 at 3:52 PM Adrien Grand  wrote:
>> >>>
>> >>> I opened https://issues.apache.org/jira/browse/LUCENE-10207 about these 
>> >>> ideas.
>> >>>
>> >>> On Tue, Oct 26, 2021 at 7:52 PM Robert Muir  wrote:
>> 
>>  On Tue, Oct 26, 2021 at 1:37 PM Adrien Grand  wrote:
>>  >
>>  > > And then we could make an IndexOrDocValuesQuery with both the 
>>  > > TermInSetQuery and this SDV.newSlowInSetQuery?
>>  >
>>  > Unfortunately IndexOrDocValuesQuery relies on the fact that the 
>>  > "index" query can evaluate its cost (ScorerSupplier#cost) without 
>>  > doing anything costly, which isn't the case for TermInSetQuery.
>>  >
>>  > So we'd need to make some changes. Estimating the cost of a 
>>  > TermInSetQuery in general without seeking the terms is a hard 
>>  > problem, but maybe we could specialize the unique key case to return 
>>  > the number of terms as the cost?
>> 
>>  Yes we know each term in terms dict only has a single document, when
>>  terms.size() == terms.getSumDocFreq(): there's only one posting for
>>  each term.
>>  But we can probably generalize a cost estimation a bit more, just
>>  based on these two stats?
>> 
>>  -
>>  To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>  For additional commands, e-mail: dev-h...@lucene.apache.org
>> 
>> >>>
>> >>>
>> >>> --
>> >>> Adrien
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Slow DV equivalent of TermInSetQuery

2021-10-26 Thread Joel Bernstein
There are times, particularly in ecommerce and access control, where speed
really matters. So, you build stuff that's really fast at query time, with
a tradeoff at commit time.


Joel Bernstein
http://joelsolr.blogspot.com/


On Tue, Oct 26, 2021 at 5:31 PM Robert Muir  wrote:

> Sorry, I don't think there is a need to use any top-level ordinals.
> none of these docvalues-based query implementations need it.
>
> As far as query intersecting an input-stream, that is a big no-go.
> Lucene Queries need to have correct hashcode/equals/etc.
>
> That's why current stuff around this such as TermInSetQuery encode
> everything into a PrefixCodedTerms.
>
> On Tue, Oct 26, 2021 at 4:57 PM Joel Bernstein  wrote:
> >
> > One more wrinkle for extremely large lists, is pass the list in as an
> InputStream which is a presorted binary representation of the ASIN's and
> slide a BytesRef across the stream and merge it with the SortedDocValues.
> This saves on all the object creation and String overhead for really long
> lists of id's.
> >
> > Joel Bernstein
> > http://joelsolr.blogspot.com/
> >
> >
> > On Tue, Oct 26, 2021 at 4:50 PM Joel Bernstein 
> wrote:
> >>
> >> If the list of ASIN's is presorted you can quickly merge it with the
> SortedDocValues and produce a FixedBitSet of the top level ordinals, which
> can be used as the post filter. This is a nice approach for things like
> passing in a long list of access control predicates.
> >>
> >>
> >> Joel Bernstein
> >> http://joelsolr.blogspot.com/
> >>
> >>
> >> On Tue, Oct 26, 2021 at 3:52 PM Adrien Grand  wrote:
> >>>
> >>> I opened https://issues.apache.org/jira/browse/LUCENE-10207 about
> these ideas.
> >>>
> >>> On Tue, Oct 26, 2021 at 7:52 PM Robert Muir  wrote:
> 
>  On Tue, Oct 26, 2021 at 1:37 PM Adrien Grand 
> wrote:
>  >
>  > > And then we could make an IndexOrDocValuesQuery with both the
> TermInSetQuery and this SDV.newSlowInSetQuery?
>  >
>  > Unfortunately IndexOrDocValuesQuery relies on the fact that the
> "index" query can evaluate its cost (ScorerSupplier#cost) without doing
> anything costly, which isn't the case for TermInSetQuery.
>  >
>  > So we'd need to make some changes. Estimating the cost of a
> TermInSetQuery in general without seeking the terms is a hard problem, but
> maybe we could specialize the unique key case to return the number of terms
> as the cost?
> 
>  Yes we know each term in terms dict only has a single document, when
>  terms.size() == terms.getSumDocFreq(): there's only one posting for
>  each term.
>  But we can probably generalize a cost estimation a bit more, just
>  based on these two stats?
> 
>  -
>  To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>  For additional commands, e-mail: dev-h...@lucene.apache.org
> 
> >>>
> >>>
> >>> --
> >>> Adrien
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>


Re: Slow DV equivalent of TermInSetQuery

2021-10-26 Thread Robert Muir
Sorry, I don't think there is a need to use any top-level ordinals.
none of these docvalues-based query implementations need it.

As far as query intersecting an input-stream, that is a big no-go.
Lucene Queries need to have correct hashcode/equals/etc.

That's why current stuff around this such as TermInSetQuery encode
everything into a PrefixCodedTerms.

On Tue, Oct 26, 2021 at 4:57 PM Joel Bernstein  wrote:
>
> One more wrinkle for extremely large lists, is pass the list in as an 
> InputStream which is a presorted binary representation of the ASIN's and 
> slide a BytesRef across the stream and merge it with the SortedDocValues. 
> This saves on all the object creation and String overhead for really long 
> lists of id's.
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
>
> On Tue, Oct 26, 2021 at 4:50 PM Joel Bernstein  wrote:
>>
>> If the list of ASIN's is presorted you can quickly merge it with the 
>> SortedDocValues and produce a FixedBitSet of the top level ordinals, which 
>> can be used as the post filter. This is a nice approach for things like 
>> passing in a long list of access control predicates.
>>
>>
>> Joel Bernstein
>> http://joelsolr.blogspot.com/
>>
>>
>> On Tue, Oct 26, 2021 at 3:52 PM Adrien Grand  wrote:
>>>
>>> I opened https://issues.apache.org/jira/browse/LUCENE-10207 about these 
>>> ideas.
>>>
>>> On Tue, Oct 26, 2021 at 7:52 PM Robert Muir  wrote:

 On Tue, Oct 26, 2021 at 1:37 PM Adrien Grand  wrote:
 >
 > > And then we could make an IndexOrDocValuesQuery with both the 
 > > TermInSetQuery and this SDV.newSlowInSetQuery?
 >
 > Unfortunately IndexOrDocValuesQuery relies on the fact that the "index" 
 > query can evaluate its cost (ScorerSupplier#cost) without doing anything 
 > costly, which isn't the case for TermInSetQuery.
 >
 > So we'd need to make some changes. Estimating the cost of a 
 > TermInSetQuery in general without seeking the terms is a hard problem, 
 > but maybe we could specialize the unique key case to return the number 
 > of terms as the cost?

 Yes we know each term in terms dict only has a single document, when
 terms.size() == terms.getSumDocFreq(): there's only one posting for
 each term.
 But we can probably generalize a cost estimation a bit more, just
 based on these two stats?

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org

>>>
>>>
>>> --
>>> Adrien

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Slow DV equivalent of TermInSetQuery

2021-10-26 Thread Joel Bernstein
One more wrinkle for extremely large lists, is pass the list in as an
InputStream which is a presorted binary representation of the ASIN's and
slide a BytesRef across the stream and merge it with the SortedDocValues.
This saves on all the object creation and String overhead for really long
lists of id's.

Joel Bernstein
http://joelsolr.blogspot.com/


On Tue, Oct 26, 2021 at 4:50 PM Joel Bernstein  wrote:

> If the list of ASIN's is presorted you can quickly merge it with the
> SortedDocValues and produce a FixedBitSet of the top level ordinals, which
> can be used as the post filter. This is a nice approach for things like
> passing in a long list of access control predicates.
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
>
> On Tue, Oct 26, 2021 at 3:52 PM Adrien Grand  wrote:
>
>> I opened https://issues.apache.org/jira/browse/LUCENE-10207 about these
>> ideas.
>>
>> On Tue, Oct 26, 2021 at 7:52 PM Robert Muir  wrote:
>>
>>> On Tue, Oct 26, 2021 at 1:37 PM Adrien Grand  wrote:
>>> >
>>> > > And then we could make an IndexOrDocValuesQuery with both the
>>> TermInSetQuery and this SDV.newSlowInSetQuery?
>>> >
>>> > Unfortunately IndexOrDocValuesQuery relies on the fact that the
>>> "index" query can evaluate its cost (ScorerSupplier#cost) without doing
>>> anything costly, which isn't the case for TermInSetQuery.
>>> >
>>> > So we'd need to make some changes. Estimating the cost of a
>>> TermInSetQuery in general without seeking the terms is a hard problem, but
>>> maybe we could specialize the unique key case to return the number of terms
>>> as the cost?
>>>
>>> Yes we know each term in terms dict only has a single document, when
>>> terms.size() == terms.getSumDocFreq(): there's only one posting for
>>> each term.
>>> But we can probably generalize a cost estimation a bit more, just
>>> based on these two stats?
>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>>
>>>
>>
>> --
>> Adrien
>>
>


Re: Slow DV equivalent of TermInSetQuery

2021-10-26 Thread Joel Bernstein
If the list of ASIN's is presorted you can quickly merge it with the
SortedDocValues and produce a FixedBitSet of the top level ordinals, which
can be used as the post filter. This is a nice approach for things like
passing in a long list of access control predicates.


Joel Bernstein
http://joelsolr.blogspot.com/


On Tue, Oct 26, 2021 at 3:52 PM Adrien Grand  wrote:

> I opened https://issues.apache.org/jira/browse/LUCENE-10207 about these
> ideas.
>
> On Tue, Oct 26, 2021 at 7:52 PM Robert Muir  wrote:
>
>> On Tue, Oct 26, 2021 at 1:37 PM Adrien Grand  wrote:
>> >
>> > > And then we could make an IndexOrDocValuesQuery with both the
>> TermInSetQuery and this SDV.newSlowInSetQuery?
>> >
>> > Unfortunately IndexOrDocValuesQuery relies on the fact that the "index"
>> query can evaluate its cost (ScorerSupplier#cost) without doing anything
>> costly, which isn't the case for TermInSetQuery.
>> >
>> > So we'd need to make some changes. Estimating the cost of a
>> TermInSetQuery in general without seeking the terms is a hard problem, but
>> maybe we could specialize the unique key case to return the number of terms
>> as the cost?
>>
>> Yes we know each term in terms dict only has a single document, when
>> terms.size() == terms.getSumDocFreq(): there's only one posting for
>> each term.
>> But we can probably generalize a cost estimation a bit more, just
>> based on these two stats?
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>>
>
> --
> Adrien
>


Re: Slow DV equivalent of TermInSetQuery

2021-10-26 Thread Adrien Grand
I opened https://issues.apache.org/jira/browse/LUCENE-10207 about these
ideas.

On Tue, Oct 26, 2021 at 7:52 PM Robert Muir  wrote:

> On Tue, Oct 26, 2021 at 1:37 PM Adrien Grand  wrote:
> >
> > > And then we could make an IndexOrDocValuesQuery with both the
> TermInSetQuery and this SDV.newSlowInSetQuery?
> >
> > Unfortunately IndexOrDocValuesQuery relies on the fact that the "index"
> query can evaluate its cost (ScorerSupplier#cost) without doing anything
> costly, which isn't the case for TermInSetQuery.
> >
> > So we'd need to make some changes. Estimating the cost of a
> TermInSetQuery in general without seeking the terms is a hard problem, but
> maybe we could specialize the unique key case to return the number of terms
> as the cost?
>
> Yes we know each term in terms dict only has a single document, when
> terms.size() == terms.getSumDocFreq(): there's only one posting for
> each term.
> But we can probably generalize a cost estimation a bit more, just
> based on these two stats?
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

-- 
Adrien


Re: Slow DV equivalent of TermInSetQuery

2021-10-26 Thread Robert Muir
On Tue, Oct 26, 2021 at 1:37 PM Adrien Grand  wrote:
>
> > And then we could make an IndexOrDocValuesQuery with both the 
> > TermInSetQuery and this SDV.newSlowInSetQuery?
>
> Unfortunately IndexOrDocValuesQuery relies on the fact that the "index" query 
> can evaluate its cost (ScorerSupplier#cost) without doing anything costly, 
> which isn't the case for TermInSetQuery.
>
> So we'd need to make some changes. Estimating the cost of a TermInSetQuery in 
> general without seeking the terms is a hard problem, but maybe we could 
> specialize the unique key case to return the number of terms as the cost?

Yes we know each term in terms dict only has a single document, when
terms.size() == terms.getSumDocFreq(): there's only one posting for
each term.
But we can probably generalize a cost estimation a bit more, just
based on these two stats?

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [External] : RE: Thank you! JDK 18 Early Access build 20 is now available

2021-10-26 Thread Rory O'Donnell

Many thanks Uwe, look forward to having a beer with you !

Rgds,Rory

On 26/10/2021 17:50, Uwe Schindler wrote:


Hallo Rory,

huh, that’s good for you and bad for us ! I was wishing to see you 
one more time on FOSDEM, but looks like this will not happen (not only 
because of COVID). Maybe we can still have a beer together in Ireland! 
Unfortunately, I was not able to visit Ireland up to now, so I have a 
new target!


Thanks for taking care of the open source projects, this was a great 
success!


Have a good time with your family!

Uwe

-

Uwe Schindler

uschind...@apache.org

ASF Member, Member of PMC and Committer of Apache Lucene and Apache Solr

Bremen, Germany

https://lucene.apache.org/ 



https://solr.apache.org/ 



*From:*Rory O'Donnell 
*Sent:* Tuesday, October 26, 2021 2:56 PM
*To:* Uwe Schindler 
*Cc:* rory.odonn...@oracle.com; David Delabassee 
; Deepak Nenmelithara Damodaran 
; Dalibor Topic 
; Balchandra Vaidya 
; Dawid Weiss ; 
dev@lucene.apache.org

*Subject:* Thank you! JDK 18 Early Access build 20 is now available

Hi Uwe & Dawid,

*Thank you.*

I'm retiring at the end of November 2021, it's time to spend more time 
with the family.


We started the Quality Outreach back in October 2014.  We now have 
170+ projects participating.


Thank you for taking the time to provide Testing feedback , excellent 
bugs and support throughout


the last seven years.

It's been a pleasure working with you. I am delighted to say that the 
program will continue


with the support of the Java DevRel Team, with David Delabassee as 
your contact. David has


been assisting with on-boarding new projects for the last couple of years.

All the best, Rory

*OpenJDK 18 Early Access build 20 is now available at 
**https://jdk.java.net/18/* 
**


  * These early-access , open-source builds are provided under the

  o GNU General Public License, version 2, with the Classpath
Exception .

  * Release Notes are available at
https://jdk.java.net/18/release-notes



  * Features:

  o JEPs integrated to JDK 18, so far:

  + JEP 400: UTF-8 by Default 
  + JEP 408: Simple Web Server 
  + JEP 413: Code Snippets in Java API Documentation


  o JEPs targeted to JDK 18, so far

  + JEP 417: Vector API (Third Incubator)


  o JEPs proposed to target JDK 18:

  + JEP 416: Reimplement Core Reflection with Method Handles


  * Significant changes since the last availability email:

  o Build 20:

  + JDK-8275252: Migrate cacerts from JKS to password-less PKCS12
  + JDK-8275149: (ch) ReadableByteChannel returned by
Channels.newChannel(InputStream) throws
ReadOnlyBufferException
  + JDK-8266936: Add a finalization JFR event
  + JDK-8264849: Add KW and KWP support to PKCS11 provider

  o Build 19:

  + JDK-8274840: Update OS detection code to recognize Windows 11
  + JDK-8274407: (tz) Update Timezone Data to 2021c
  + JDK-8273102: Delete deprecated for removal the empty
finalize() in java.desktop module

  o Build 18:

  + JDK-8274656: Remove default_checksum and
safe_checksum_type from krb5.conf
  + JDK-8274471: Add support for RSASSA-PSS in OCSP Response
  + JDK-8274227: Remove "impl.prefix" jdk system property
usage from InetAddress
  + JDK-8274002: [win11 and winserver2022] JDK executable
installer from network drive starts with huge delay
  + JDK-8273670: Remove weak etypes from default krb5 etype list

  o Build 17:

  + JDK-8273401: Disable JarIndex Support In URLClassPath
  + JDK-8231640: (prop) Canonical property storage
  + Build 16:
  + JDK-8269039: Disable SHA-1 Signed JARs

*Topics of Interest:*__

_JDK 17:_

  * Inside Java Podcast “Java 17 is Here!”

  o Part 1: https://inside.java/2021/09/14/podcast-019/


  o Part 2: 

Re: Slow DV equivalent of TermInSetQuery

2021-10-26 Thread Adrien Grand
> And then we could make an IndexOrDocValuesQuery with both the
TermInSetQuery and this SDV.newSlowInSetQuery?

Unfortunately IndexOrDocValuesQuery relies on the fact that the "index"
query can evaluate its cost (ScorerSupplier#cost) without doing anything
costly, which isn't the case for TermInSetQuery.

So we'd need to make some changes. Estimating the cost of a TermInSetQuery
in general without seeking the terms is a hard problem, but maybe we could
specialize the unique key case to return the number of terms as the cost?

On Tue, Oct 26, 2021 at 5:37 PM Robert Muir  wrote:

> On Tue, Oct 26, 2021 at 11:24 AM Robert Muir  wrote:
> >
> > On Tue, Oct 26, 2021 at 10:58 AM Alan Woodward 
> wrote:
> > >
> > > We have SortedSetDocValuesField.newSlowRangeQuery() which does
> something close to what you want here, I think.
> > >
> >
> > See also DocValuesRewriteMethod which might be useful, at least as a
> > start. You'd have to express the "SetQuery" as a MultiTermQuery for
> > that to work, but It would be more efficient than a disjunction of
> > slow-exact-queries.
>
> Maybe that's the issue here? If TermInSetQuery extended
> MultiTermQuery, then this would be trivial, you wouldn't have to write
> any code to use the DV ordinals instead of the terms+postings, you'd
> just call .setRewriteMethod().
>
> Could/should TermInSetQuery be refactored to extend multitermquery?
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

-- 
Adrien


RE: Thank you! JDK 18 Early Access build 20 is now available

2021-10-26 Thread Uwe Schindler
Hallo Rory,

 

huh, that’s good for you and bad for us ! I was wishing to see you one more 
time on FOSDEM, but looks like this will not happen (not only because of 
COVID). Maybe we can still have a beer together in Ireland! Unfortunately, I 
was not able to visit Ireland up to now, so I have a new target!

 

Thanks for taking care of the open source projects, this was a great success!

 

Have a good time with your family!

Uwe

 

-

Uwe Schindler

uschind...@apache.org 

ASF Member, Member of PMC and Committer of Apache Lucene and Apache Solr

Bremen, Germany

https://lucene.apache.org/

https://solr.apache.org/

 

From: Rory O'Donnell  
Sent: Tuesday, October 26, 2021 2:56 PM
To: Uwe Schindler 
Cc: rory.odonn...@oracle.com; David Delabassee ; 
Deepak Nenmelithara Damodaran ; Dalibor Topic 
; Balchandra Vaidya ; 
Dawid Weiss ; dev@lucene.apache.org
Subject: Thank you! JDK 18 Early Access build 20 is now available

 

Hi Uwe & Dawid,
 

Thank you. 

 

I'm retiring at the end of November 2021, it's time to spend more time with the 
family.

 

We started the Quality Outreach back in October 2014.  We now have 170+ 
projects participating. 

Thank you for taking the time to provide Testing feedback , excellent bugs and 
support throughout 

the last seven years.

 

It's been a pleasure working with you. I am delighted to say that the program 
will continue

with the support of the Java DevRel Team, with David Delabassee as your 
contact. David has

been assisting with on-boarding new projects for the last couple of years.

 

All the best, Rory

 

 

OpenJDK 18 Early Access build 20 is now available at  
 https://jdk.java.net/18/ 

*   These early-access , open-source builds are provided under the 

* GNU General Public 
License, version 2, with the Classpath Exception.

*   Release Notes are available at   
https://jdk.java.net/18/release-notes 
*   Features: 

*   JEPs integrated to JDK 18, so far:

*   JEP 400:   UTF-8 by Default  
*   JEP 408:   Simple Web Server  
*   JEP 413:   Code Snippets in Java API 
Documentation  

*   JEPs targeted to JDK 18, so far

*   JEP 417:   Vector API (Third 
Incubator) 

*   JEPs proposed to target JDK 18:  

*   JEP 416:   Reimplement Core 
Reflection with Method Handles  

*   Significant changes since the last availability email:

*   Build 20: 

*   JDK-8275252: Migrate cacerts from JKS to password-less PKCS12 
*   JDK-8275149: (ch) ReadableByteChannel returned by 
Channels.newChannel(InputStream) throws ReadOnlyBufferException 
*   JDK-8266936: Add a finalization JFR event 
*   JDK-8264849: Add KW and KWP support to PKCS11 provider 

*   Build 19: 

*   JDK-8274840: Update OS detection code to recognize Windows 11 
*   JDK-8274407: (tz) Update Timezone Data to 2021c 
*   JDK-8273102: Delete deprecated for removal the empty finalize() in 
java.desktop module 

*   Build 18: 

*   JDK-8274656: Remove default_checksum and safe_checksum_type from 
krb5.conf 
*   JDK-8274471: Add support for RSASSA-PSS in OCSP Response 
*   JDK-8274227: Remove "impl.prefix" jdk system property usage from 
InetAddress 
*   JDK-8274002: [win11 and winserver2022] JDK executable installer from 
network drive starts with huge delay 
*   JDK-8273670: Remove weak etypes from default krb5 etype list 

*   Build 17: 

*   JDK-8273401: Disable JarIndex Support In URLClassPath 
*   JDK-8231640: (prop) Canonical property storage 
*   Build 16: 
*   JDK-8269039: Disable SHA-1 Signed JARs 

Topics of Interest: 

JDK 17:

*   Inside Java Podcast “Java 17 is Here!”

*   Part 1:   
https://inside.java/2021/09/14/podcast-019/
*   Part 2:   
https://inside.java/2021/09/27/podcast-020/

*   G1 GC & Parallel GC Improvements in JDK 17

* 
https://inside.java/2021/09/17/jdk-17-gc-updates/

*   ZGC - What's new in JDK 17

* 
https://inside.java/2021/10/05/zgc-in-jdk17/

*   JDK 17 Security Enhancements

* 
https://inside.java/2021/09/15/jdk-17-security-enhancements/

*   The Vector API in JDK 17 (video)

* 
https://inside.java/2021/09/23/devlive-vector-api/

*   Faster Charset Encoding 

* 
https://inside.java/2021/10/17/faster-charset-encoding/

JDK 18:

*

Re: Slow DV equivalent of TermInSetQuery

2021-10-26 Thread Robert Muir
On Tue, Oct 26, 2021 at 11:24 AM Robert Muir  wrote:
>
> On Tue, Oct 26, 2021 at 10:58 AM Alan Woodward  wrote:
> >
> > We have SortedSetDocValuesField.newSlowRangeQuery() which does something 
> > close to what you want here, I think.
> >
>
> See also DocValuesRewriteMethod which might be useful, at least as a
> start. You'd have to express the "SetQuery" as a MultiTermQuery for
> that to work, but It would be more efficient than a disjunction of
> slow-exact-queries.

Maybe that's the issue here? If TermInSetQuery extended
MultiTermQuery, then this would be trivial, you wouldn't have to write
any code to use the DV ordinals instead of the terms+postings, you'd
just call .setRewriteMethod().

Could/should TermInSetQuery be refactored to extend multitermquery?

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Slow DV equivalent of TermInSetQuery

2021-10-26 Thread Robert Muir
On Tue, Oct 26, 2021 at 10:58 AM Alan Woodward  wrote:
>
> We have SortedSetDocValuesField.newSlowRangeQuery() which does something 
> close to what you want here, I think.
>

See also DocValuesRewriteMethod which might be useful, at least as a
start. You'd have to express the "SetQuery" as a MultiTermQuery for
that to work, but It would be more efficient than a disjunction of
slow-exact-queries.

e.g. for each segment it will first sequentially fill a bitset
corresponding to the ordinals matching the terms in your set.
Then when checking a single doc, it looks at document's ordinals to
see if one is in the bitset.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Slow DV equivalent of TermInSetQuery

2021-10-26 Thread Alan Woodward
We have SortedSetDocValuesField.newSlowRangeQuery() which does something close 
to what you want here, I think.

> On 26 Oct 2021, at 15:23, Michael McCandless  > wrote:
> 
> Hi Team,
> 
> I was discussing this problem with Greg Miller (also at Amazon Product 
> Search):
> 
> If I want to make a query that filters out a few primary keys (ASIN in our 
> Amazon Product Search world), I can make a TermInSetQuery and add it as a 
> MUST_NOT onto a BooleanQuery that has all the other interesting clauses for 
> my query.
> 
> But if I have many, many ASINs to filter out, at some point it may become 
> more efficient to just use doc values and filter them out like Solr's 
> "post-filter" / during collection, e.g. by loading the BINARY value or SORTED 
> (globalized) ordinal, and checking e.g. a HashSet to see if it should be 
> skipped.  Not using the inverted index at all...
> 
> Do we already have such a "slow DV TermInSet" query?
> 
> It seems like it could belong in SortedDocValues where we already have 
> newSlowRangeQuery, newSlowExactQuery, we could add a newSlowInSetQuery?
> 
> And then we could make an IndexOrDocValuesQuery with both the TermInSetQuery 
> and this SDV.newSlowInSetQuery?
> 
> Or maybe there is already a good way to do this in Lucene?
> 
> Thanks!,
> 
> Mike McCandless
> 
> http://blog.mikemccandless.com 


Slow DV equivalent of TermInSetQuery

2021-10-26 Thread Michael McCandless
Hi Team,

I was discussing this problem with Greg Miller (also at Amazon Product
Search):

If I want to make a query that filters out a few primary keys (ASIN in our
Amazon Product Search world), I can make a TermInSetQuery and add it as a
MUST_NOT onto a BooleanQuery that has all the other interesting clauses for
my query.

But if I have many, many ASINs to filter out, at some point it may become
more efficient to just use doc values and filter them out like Solr's
"post-filter" / during collection, e.g. by loading the BINARY value or
SORTED (globalized) ordinal, and checking e.g. a HashSet to see if it
should be skipped.  Not using the inverted index at all...

Do we already have such a "slow DV TermInSet" query?

It seems like it could belong in SortedDocValues where we already have
newSlowRangeQuery, newSlowExactQuery, we could add a newSlowInSetQuery?

And then we could make an IndexOrDocValuesQuery with both the
TermInSetQuery and this SDV.newSlowInSetQuery?

Or maybe there is already a good way to do this in Lucene?

Thanks!,

Mike McCandless

http://blog.mikemccandless.com


Thank you! JDK 18 Early Access build 20 is now available

2021-10-26 Thread Rory O'Donnell

Hi Uwe & Dawid,

*Thank you.*

I'm retiring at the end of November 2021, it's time to spend more time 
with the family.


We started the Quality Outreach back in October 2014.  We now have 170+ 
projects participating.
Thank you for taking the time to provide Testing feedback , excellent 
bugs and support throughout

the last seven years.

It's been a pleasure working with you. I am delighted to say that the 
program will continue
with the support of the Java DevRel Team, with David Delabassee as your 
contact. David has

been assisting with on-boarding new projects for the last couple of years.

All the best, Rory


*OpenJDK 18Early Access build 20is now available 
at**https://jdk.java.net/18/ **

*

 * These early-access , open-source builds are provided under the
 o GNU General Public License, version 2, with the Classpath
   Exception .
 * Release Notes are available athttps://jdk.java.net/18/release-notes
   
 * Features:
 o JEPs integrated to JDK 18, so far:
 + JEP 400: UTF-8 by Default 
 + JEP 408: Simple Web Server 
 + JEP 413: Code Snippets in Java API Documentation
   
 o JEPs targeted to JDK 18, so far
 + JEP 417: Vector API (Third Incubator)
   
 o JEPs proposed to target JDK 18:
 + JEP 416: Reimplement Core Reflection with Method Handles
   

 * Significant changes since the last availability email:
 o Build 20:
 + JDK-8275252: Migrate cacerts from JKS to password-less PKCS12
 + JDK-8275149: (ch) ReadableByteChannel returned by
   Channels.newChannel(InputStream) throws ReadOnlyBufferException
 + JDK-8266936: Add a finalization JFR event
 + JDK-8264849: Add KW and KWP support to PKCS11 provider
 o Build 19:
 + JDK-8274840: Update OS detection code to recognize Windows 11
 + JDK-8274407: (tz) Update Timezone Data to 2021c
 + JDK-8273102: Delete deprecated for removal the empty
   finalize() in java.desktop module
 o Build 18:
 + JDK-8274656: Remove default_checksum and safe_checksum_type
   from krb5.conf
 + JDK-8274471: Add support for RSASSA-PSS in OCSP Response
 + JDK-8274227: Remove "impl.prefix" jdk system property usage
   from InetAddress
 + JDK-8274002: [win11 and winserver2022] JDK executable
   installer from network drive starts with huge delay
 + JDK-8273670: Remove weak etypes from default krb5 etype list
 o Build 17:
 + JDK-8273401: Disable JarIndex Support In URLClassPath
 + JDK-8231640: (prop) Canonical property storage
 + Build 16:
 + JDK-8269039: Disable SHA-1 Signed JARs

*Topics of Interest:*_
_

_JDK 17:_**
**

 * *Inside Java Podcast “Java 17 is Here!”*
 o *Part 1: https://inside.java/2021/09/14/podcast-019/
   *
 o *Part 2: https://inside.java/2021/09/27/podcast-020/
   *
 * *G1 GC & Parallel GC Improvements in JDK 17*
 o *https://inside.java/2021/09/17/jdk-17-gc-updates/
   *
 * ZGC - What's new in JDK 17**
 o *https://inside.java/2021/10/05/zgc-in-jdk17/
   *
 * JDK 17 Security Enhancements**
 o *https://inside.java/2021/09/15/jdk-17-security-enhancements/
   *
 * The Vector API in JDK 17 (video)**
 o *https://inside.java/2021/09/23/devlive-vector-api/
   *
 * Faster Charset Encoding**
 o *https://inside.java/2021/10/17/faster-charset-encoding/
   *

_JDK 18:_

 * JEP 400 and the Default Charset
 o https://inside.java/2021/10/04/the-default-charset-jep400/
   
 * JDK 18 augmented `javac -Xlint:serial` checks
 o https://inside.java/2021/10/20/augmented-serial-checks
   

_Project Panama - Foreign Function & Memory API:_

 * Finalizing the Foreign APIs
 o https://inside.java/2021/09/16/finalizing-the-foreign-apis/
   
 * Resource Scope Dependencies
 o https://inside.java/2021/10/12/panama-scope-dependencies/
   

***October 2021 Critical Patch Update Released*

 * As part of the October 2021, we released JDK 17.0.1 LTS, JDK 11.0.13
   LTS, JDK 8u311 and JDK