Re: Clarification regarding BlockTree implementation of IntersectTermsEnum

2019-04-02 Thread Stamatis Zampetakis
Thanks Robert and Uwe for your feedback!

I am not sure what you mean that the slowness does not come from the
iteration over the term index.
I did a small profiling (screenshot attached) of sending repeatedly a
TermRangeQuery that matches almost the whole index and I observe that 80%
of the time is spend on IntersectTermsEnum.next() method.
Isn't this the method which iterates over the index?



Στις Δευ, 1 Απρ 2019 στις 6:57 μ.μ., ο/η Uwe Schindler 
έγραψε:

> Hi again,
>
> > The problem with TermRangeQueries is actually not the iteration over the
> > term index. The slowness comes from the fact that all terms between start
> > and end have to be iterated and their postings be fetched and those
> postings
> > be merged together. If the "source of terms" for doing this is just a
> simple
> > linear iteration of all terms from/to or the automaton intersection does
> not
> > really matter for the query execution. The change to prefer the automaton
> > instead of a simple term iteration is just to allow further
> optimizations, for
> > more info see https://issues.apache.org/jira/browse/LUCENE-5879
>
> So the above issue brings no further improvements (currently). So as said
> before, a simple enumeration of all all terms and fetching their postings
> will be as fast. By using an automaton, the idea here is that the term
> dictionary may have some "percalculated" postings list for prefix terms. So
> instead of iterating over all terms and merging their postings, the terms
> dictionary could return a "virtual term" that contains all documents for a
> whole prefix and store the merged postings list in the index file. This was
> not yet implemented but is the place to hook into. You could create an
> improved BlockTermsDict implementation, that allows to get a PostingsEnum
> for a whole prefix of terms.
>
> Not sure how much of that was already implemented by LUCENE-5879, but it
> allows to do this. So here is where you could step in and improve the terms
> dictionary!
>
> Uwe
>
> > Uwe
> >
> > -
> > Uwe Schindler
> > Achterdiek 19, D-28357 Bremen
> > http://www.thetaphi.de
> > eMail: u...@thetaphi.de
> >
> > > -Original Message-
> > > From: Robert Muir 
> > > Sent: Monday, April 1, 2019 6:30 PM
> > > To: java-user 
> > > Subject: Re: Clarification regarding BlockTree implementation of
> > > IntersectTermsEnum
> > >
> > > The regular TermsEnum is really designed for walking terms in linear
> order.
> > > it does have some ability to seek/leapfrog. But this means paths in a
> query
> > > automaton that match no terms result in a wasted seek and cpu, because
> > the
> > > api is designed to return the next term after regardless.
> > >
> > > On the other hand the intersect() is for intersecting two automata:
> query
> > > and index. Presumably it can also remove more inefficiencies than just
> the
> > > wasted seeks for complex wildcards and fuzzies and stuff, since it can
> > > "see" the whole input as an automaton. so for example it might be able
> to
> > > work on blocks of terms at a time and so on.
> > >
> > > On Mon, Apr 1, 2019, 12:17 PM Stamatis Zampetakis
> > 
> > > wrote:
> > >
> > > > Yes it is used.
> > > >
> > > > I think there are simpler and possibly more efficient ways to
> implement a
> > > > TermRangeQuery and that is why I am looking into this.
> > > > But I am also curious to understand what IntersectTermsEnum is
> > supposed
> > > to
> > > > do.
> > > >
> > > > Στις Δευ, 1 Απρ 2019 στις 5:34 μ.μ., ο/η Robert Muir <
> rcm...@gmail.com>
> > > > έγραψε:
> > > >
> > > > > Is this IntersectTermsEnum really being used for term range query?
> > > Seems
> > > > > like using a standard TermsEnum, seeking to the start of the
> range, then
> > > > > calling next until the end would be easier.
> > > > >
> > > > > On Mon, Apr 1, 2019, 10:05 AM Stamatis Zampetakis
> > > 
> > > > > wrote:
> > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > I am currently working on improving the performance of range
> > queries
> > > on
> > > > > > strings. I've noticed that using TermRangeQuery with
> low-selective
> > > > > queries
> > > > > > is a very bad idea in terms of performance but I cannot clea

Re: Clarification regarding BlockTree implementation of IntersectTermsEnum

2019-04-02 Thread Stamatis Zampetakis
Thanks for the explanation regarding IntersectTermsEnum.next(). I
understand better now.

How about indexing a string field containing the firstname and lastname of
a person or possibly its address.
Values for these fields have variable length and usually they have more
than 16 bytes. If my understanding is correct then they should (can) not be
indexed as points.
Range queries on this fields appear mostly when users implement custom
pagination (without using searchAfter) on the result set.

The alternative that I consider and seems to work well for the use-cases I
mentioned is using doc values and SortedDocValuesField.newSlowRangeQuery.





Στις Τρί, 2 Απρ 2019 στις 3:01 μ.μ., ο/η Robert Muir 
έγραψε:

> Can you explain a little more about your use-case? I think that's the
> biggest problem here for term range query. Pretty much all range
> use-cases are converted over to space partitioned data structures
> (Points) so its unclear why this query would even be used for anything
> serious.
>
> To answer your question, IntersectTermsEnum.next() doesn't iterate
> over the whole index, just the term dictionary. But the lines are
> blurred since if there's only one posting for a term, that single
> posting is inlined directly into the term dictionary and often a
> ton of terms fit into this category, so seeing 80% time in the term
> dictionary wouldn't surprise me.
>
> On Tue, Apr 2, 2019 at 8:03 AM Stamatis Zampetakis 
> wrote:
> >
> > Thanks Robert and Uwe for your feedback!
> >
> > I am not sure what you mean that the slowness does not come from the
> iteration over the term index.
> > I did a small profiling (screenshot attached) of sending repeatedly a
> TermRangeQuery that matches almost the whole index and I observe that 80%
> of the time is spend on IntersectTermsEnum.next() method.
> > Isn't this the method which iterates over the index?
> >
> >
> >
> > Στις Δευ, 1 Απρ 2019 στις 6:57 μ.μ., ο/η Uwe Schindler 
> έγραψε:
> >>
> >> Hi again,
> >>
> >> > The problem with TermRangeQueries is actually not the iteration over
> the
> >> > term index. The slowness comes from the fact that all terms between
> start
> >> > and end have to be iterated and their postings be fetched and those
> postings
> >> > be merged together. If the "source of terms" for doing this is just a
> simple
> >> > linear iteration of all terms from/to or the automaton intersection
> does not
> >> > really matter for the query execution. The change to prefer the
> automaton
> >> > instead of a simple term iteration is just to allow further
> optimizations, for
> >> > more info see https://issues.apache.org/jira/browse/LUCENE-5879
> >>
> >> So the above issue brings no further improvements (currently). So as
> said before, a simple enumeration of all all terms and fetching their
> postings will be as fast. By using an automaton, the idea here is that the
> term dictionary may have some "percalculated" postings list for prefix
> terms. So instead of iterating over all terms and merging their postings,
> the terms dictionary could return a "virtual term" that contains all
> documents for a whole prefix and store the merged postings list in the
> index file. This was not yet implemented but is the place to hook into. You
> could create an improved BlockTermsDict implementation, that allows to get
> a PostingsEnum for a whole prefix of terms.
> >>
> >> Not sure how much of that was already implemented by LUCENE-5879, but
> it allows to do this. So here is where you could step in and improve the
> terms dictionary!
> >>
> >> Uwe
> >>
> >> > Uwe
> >> >
> >> > -
> >> > Uwe Schindler
> >> > Achterdiek 19, D-28357 Bremen
> >> > http://www.thetaphi.de
> >> > eMail: u...@thetaphi.de
> >> >
> >> > > -Original Message-
> >> > > From: Robert Muir 
> >> > > Sent: Monday, April 1, 2019 6:30 PM
> >> > > To: java-user 
> >> > > Subject: Re: Clarification regarding BlockTree implementation of
> >> > > IntersectTermsEnum
> >> > >
> >> > > The regular TermsEnum is really designed for walking terms in
> linear order.
> >> > > it does have some ability to seek/leapfrog. But this means paths in
> a query
> >> > > automaton that match no terms result in a wasted seek and cpu,
> because
> >> > the
> >> > > api is designed to return the next term after regardless.
> >>

Re: Clarification regarding BlockTree implementation of IntersectTermsEnum

2019-04-04 Thread Stamatis Zampetakis
I am mostly referring to keyset pagination so sorting the field is always
necessary.
For the same reason range queries are inevitable.

Στις Τρί, 2 Απρ 2019 στις 5:08 μ.μ., ο/η Robert Muir 
έγραψε:

> I think instead of using range queries for custom pagination, it would
> be best to simply sort by the field normally (with or without
> searchAfter)
>
> On Tue, Apr 2, 2019 at 7:34 AM Stamatis Zampetakis 
> wrote:
> >
> > Thanks for the explanation regarding IntersectTermsEnum.next(). I
> > understand better now.
> >
> > How about indexing a string field containing the firstname and lastname
> of
> > a person or possibly its address.
> > Values for these fields have variable length and usually they have more
> > than 16 bytes. If my understanding is correct then they should (can) not
> be
> > indexed as points.
> > Range queries on this fields appear mostly when users implement custom
> > pagination (without using searchAfter) on the result set.
> >
> > The alternative that I consider and seems to work well for the use-cases
> I
> > mentioned is using doc values and SortedDocValuesField.newSlowRangeQuery.
> >
> >
> >
> >
> >
> > Στις Τρί, 2 Απρ 2019 στις 3:01 μ.μ., ο/η Robert Muir 
> > έγραψε:
> >
> > > Can you explain a little more about your use-case? I think that's the
> > > biggest problem here for term range query. Pretty much all range
> > > use-cases are converted over to space partitioned data structures
> > > (Points) so its unclear why this query would even be used for anything
> > > serious.
> > >
> > > To answer your question, IntersectTermsEnum.next() doesn't iterate
> > > over the whole index, just the term dictionary. But the lines are
> > > blurred since if there's only one posting for a term, that single
> > > posting is inlined directly into the term dictionary and often a
> > > ton of terms fit into this category, so seeing 80% time in the term
> > > dictionary wouldn't surprise me.
> > >
> > > On Tue, Apr 2, 2019 at 8:03 AM Stamatis Zampetakis 
> > > wrote:
> > > >
> > > > Thanks Robert and Uwe for your feedback!
> > > >
> > > > I am not sure what you mean that the slowness does not come from the
> > > iteration over the term index.
> > > > I did a small profiling (screenshot attached) of sending repeatedly a
> > > TermRangeQuery that matches almost the whole index and I observe that
> 80%
> > > of the time is spend on IntersectTermsEnum.next() method.
> > > > Isn't this the method which iterates over the index?
> > > >
> > > >
> > > >
> > > > Στις Δευ, 1 Απρ 2019 στις 6:57 μ.μ., ο/η Uwe Schindler <
> u...@thetaphi.de>
> > > έγραψε:
> > > >>
> > > >> Hi again,
> > > >>
> > > >> > The problem with TermRangeQueries is actually not the iteration
> over
> > > the
> > > >> > term index. The slowness comes from the fact that all terms
> between
> > > start
> > > >> > and end have to be iterated and their postings be fetched and
> those
> > > postings
> > > >> > be merged together. If the "source of terms" for doing this is
> just a
> > > simple
> > > >> > linear iteration of all terms from/to or the automaton
> intersection
> > > does not
> > > >> > really matter for the query execution. The change to prefer the
> > > automaton
> > > >> > instead of a simple term iteration is just to allow further
> > > optimizations, for
> > > >> > more info see https://issues.apache.org/jira/browse/LUCENE-5879
> > > >>
> > > >> So the above issue brings no further improvements (currently). So as
> > > said before, a simple enumeration of all all terms and fetching their
> > > postings will be as fast. By using an automaton, the idea here is that
> the
> > > term dictionary may have some "percalculated" postings list for prefix
> > > terms. So instead of iterating over all terms and merging their
> postings,
> > > the terms dictionary could return a "virtual term" that contains all
> > > documents for a whole prefix and store the merged postings list in the
> > > index file. This was not yet implemented but is the place to hook
> into. You
> > > could create an improved BlockTermsDict implementation, that allows to
> get
> > >

Re: Clarification regarding BlockTree implementation of IntersectTermsEnum

2019-04-01 Thread Stamatis Zampetakis
Yes it is used.

I think there are simpler and possibly more efficient ways to implement a
TermRangeQuery and that is why I am looking into this.
But I am also curious to understand what IntersectTermsEnum is supposed to
do.

Στις Δευ, 1 Απρ 2019 στις 5:34 μ.μ., ο/η Robert Muir 
έγραψε:

> Is this IntersectTermsEnum really being used for term range query? Seems
> like using a standard TermsEnum, seeking to the start of the range, then
> calling next until the end would be easier.
>
> On Mon, Apr 1, 2019, 10:05 AM Stamatis Zampetakis 
> wrote:
>
> > Hi all,
> >
> > I am currently working on improving the performance of range queries on
> > strings. I've noticed that using TermRangeQuery with low-selective
> queries
> > is a very bad idea in terms of performance but I cannot clearly explain
> why
> > since it seems related with how the IntersectTermsEnum#next method is
> > implemented.
> >
> > The Javadoc of the class says that the terms index (the burst-trie
> > datastructure) is not used by this implementation of TermsEnum. However,
> > when I see the implementation of the next method I get the impression
> that
> > this is not accurate. Aren't we using the trie structure to skip parts of
> > the data when  the automaton states do not match?
> >
> > Can somebody provide a high-level intutition of what
> > IntersectTermsEnum#next does? Initially, I thought that it is traversing
> > the whole trie structure (skipping some branches when necessary) but I
> may
> > be wrong.
> >
> > Thanks in advance,
> > Stamatis
> >
>


Clarification regarding BlockTree implementation of IntersectTermsEnum

2019-04-01 Thread Stamatis Zampetakis
Hi all,

I am currently working on improving the performance of range queries on
strings. I've noticed that using TermRangeQuery with low-selective queries
is a very bad idea in terms of performance but I cannot clearly explain why
since it seems related with how the IntersectTermsEnum#next method is
implemented.

The Javadoc of the class says that the terms index (the burst-trie
datastructure) is not used by this implementation of TermsEnum. However,
when I see the implementation of the next method I get the impression that
this is not accurate. Aren't we using the trie structure to skip parts of
the data when  the automaton states do not match?

Can somebody provide a high-level intutition of what
IntersectTermsEnum#next does? Initially, I thought that it is traversing
the whole trie structure (skipping some branches when necessary) but I may
be wrong.

Thanks in advance,
Stamatis


Re: Use custom score in ConstantScoreQuery

2019-12-09 Thread Stamatis Zampetakis
Thanks for you reply Adrien!
Can you clarify what is the second way?
At the moment I haven't found a way (apart from creating my own Query
classes) to say that a query will always return a score of 0.5 for each
document.

On Mon, Dec 9, 2019 at 8:16 AM Adrien Grand  wrote:

> Hi Stamatis,
>
> I personally like the current way things work. If we added the ability
> to set a custom score on ConstantScoreQuery, then we'd end up with two
> ways to do the same thing, which I like to avoid whenever possible.
>
> On Sun, Dec 8, 2019 at 10:07 PM Stamatis Zampetakis 
> wrote:
> >
> > Small reminder. Any input on this?
> >
> > Thanks,
> > Stamatis
> >
> > On Mon, Dec 2, 2019 at 12:10 PM Stamatis Zampetakis 
> > wrote:
> >
> > > Hi all,
> > >
> > > Currently ConstantScoreQuery [1] returns a constant score equal to 1
> for
> > > every document that matches the query.
> > >
> > > I would like to use the ConstantScoreQuery but with a different score
> > > value that I can pass explicitly (via the constructor for instance).
> > >
> > > This change may also benefit some other parts of Lucene where a
> > > ConstantScoreQuery is wrapped around a BoostQuery simply for returning
> a
> > > score of zero [2][3].
> > >
> > > Does this change make sense? Shall I create a JIRA for it?
> > >
> > > Best,
> > > Stamatis
> > >
> > > [1]
> > >
> https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/ConstantScoreQuery.java
> > > [2]
> > >
> https://github.com/apache/lucene-solr/blob/1d238c844e45f088a942aec14750c186c7a66d92/lucene/core/src/java/org/apache/lucene/search/BooleanQuery.java#L253
> > > [3]
> > >
> https://github.com/apache/lucene-solr/blob/1d238c844e45f088a942aec14750c186c7a66d92/lucene/core/src/java/org/apache/lucene/search/BoostQuery.java#L97
> > >
>
>
>
> --
> Adrien
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: Use custom score in ConstantScoreQuery

2019-12-08 Thread Stamatis Zampetakis
Small reminder. Any input on this?

Thanks,
Stamatis

On Mon, Dec 2, 2019 at 12:10 PM Stamatis Zampetakis 
wrote:

> Hi all,
>
> Currently ConstantScoreQuery [1] returns a constant score equal to 1 for
> every document that matches the query.
>
> I would like to use the ConstantScoreQuery but with a different score
> value that I can pass explicitly (via the constructor for instance).
>
> This change may also benefit some other parts of Lucene where a
> ConstantScoreQuery is wrapped around a BoostQuery simply for returning a
> score of zero [2][3].
>
> Does this change make sense? Shall I create a JIRA for it?
>
> Best,
> Stamatis
>
> [1]
> https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/ConstantScoreQuery.java
> [2]
> https://github.com/apache/lucene-solr/blob/1d238c844e45f088a942aec14750c186c7a66d92/lucene/core/src/java/org/apache/lucene/search/BooleanQuery.java#L253
> [3]
> https://github.com/apache/lucene-solr/blob/1d238c844e45f088a942aec14750c186c7a66d92/lucene/core/src/java/org/apache/lucene/search/BoostQuery.java#L97
>


Re: Use custom score in ConstantScoreQuery

2019-12-10 Thread Stamatis Zampetakis
Many thanks for the replies.

Intuitively, I was hoping that this could be done with a single query class
especially since ConstantScoreQuery seems to be made for that. I've seen
the pattern of wrapping a BoostQuery around a ConstantScoreQuery but this
seemed to me like a workaround. Anyways, I understand that finally this is
a well adopted pattern so I am not going to insist more on this.

Best,
Stamatis

On Mon, Dec 9, 2019 at 10:50 AM Uwe Schindler  wrote:

> Hi,
>
> Just add a BoostQuery  with a boost factor of 0.5 around the
> ConstantScoreQuery. It's just one line more in your code. I don't
> understand why we would need separate query classes for this.
>
> Uwe
>
> -
> Uwe Schindler
> Achterdiek 19, D-28357 Bremen
> https://www.thetaphi.de
> eMail: u...@thetaphi.de
>
> > -Original Message-
> > From: Stamatis Zampetakis 
> > Sent: Monday, December 9, 2019 10:42 AM
> > To: java-user@lucene.apache.org
> > Subject: Re: Use custom score in ConstantScoreQuery
> >
> > Thanks for you reply Adrien!
> > Can you clarify what is the second way?
> > At the moment I haven't found a way (apart from creating my own Query
> > classes) to say that a query will always return a score of 0.5 for each
> > document.
> >
> > On Mon, Dec 9, 2019 at 8:16 AM Adrien Grand  wrote:
> >
> > > Hi Stamatis,
> > >
> > > I personally like the current way things work. If we added the ability
> > > to set a custom score on ConstantScoreQuery, then we'd end up with two
> > > ways to do the same thing, which I like to avoid whenever possible.
> > >
> > > On Sun, Dec 8, 2019 at 10:07 PM Stamatis Zampetakis
> > 
> > > wrote:
> > > >
> > > > Small reminder. Any input on this?
> > > >
> > > > Thanks,
> > > > Stamatis
> > > >
> > > > On Mon, Dec 2, 2019 at 12:10 PM Stamatis Zampetakis
> > 
> > > > wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > Currently ConstantScoreQuery [1] returns a constant score equal to
> 1
> > > for
> > > > > every document that matches the query.
> > > > >
> > > > > I would like to use the ConstantScoreQuery but with a different
> score
> > > > > value that I can pass explicitly (via the constructor for
> instance).
> > > > >
> > > > > This change may also benefit some other parts of Lucene where a
> > > > > ConstantScoreQuery is wrapped around a BoostQuery simply for
> > returning
> > > a
> > > > > score of zero [2][3].
> > > > >
> > > > > Does this change make sense? Shall I create a JIRA for it?
> > > > >
> > > > > Best,
> > > > > Stamatis
> > > > >
> > > > > [1]
> > > > >
> > > https://github.com/apache/lucene-
> > solr/blob/master/lucene/core/src/java/org/apache/lucene/search/Constant
> > ScoreQuery.java
> > > > > [2]
> > > > >
> > > https://github.com/apache/lucene-
> > solr/blob/1d238c844e45f088a942aec14750c186c7a66d92/lucene/core/src/ja
> > va/org/apache/lucene/search/BooleanQuery.java#L253
> > > > > [3]
> > > > >
> > > https://github.com/apache/lucene-
> > solr/blob/1d238c844e45f088a942aec14750c186c7a66d92/lucene/core/src/ja
> > va/org/apache/lucene/search/BoostQuery.java#L97
> > > > >
> > >
> > >
> > >
> > > --
> > > Adrien
> > >
> > > -
> > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > > For additional commands, e-mail: java-user-h...@lucene.apache.org
> > >
> > >
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Parameterized queries in Lucene

2019-10-21 Thread Stamatis Zampetakis
Hi all,

In the world of relational databases and SQL, the existence of
parameterized queries (aka. PreparedStatement) offers many advantages in
terms of security and performance.

I guess everybody is familiar with the idea that you prepare a statement
and then you execute it multiple times by just changing certain parameters.
A simple use case for demonstrating the idea
is shown below:

Query q = ... // An arbitrary complex query with a part that has a single
parameter of type int
for (int i=0; i<100; i++) {
  int paramValue = i;
  q.visit(new ParameterSetter(paramValue));
  TopDocs docs = searcher.search(q, 10);
}

Note that this is a very simplistic use case and does not correspond to the
reality where the construction and execution are not done side by side.

I already implemented something to satisfy use-cases like the one shown
above by introducing a new subclass of Query. However, I was wondering if
there is already a mechanism to compile and execute queries with parameters
in Lucene and I am just reinventing the wheel.

Feedback is much appreciated!

Best,
Stamatis


Re: Parameterized queries in Lucene

2019-10-25 Thread Stamatis Zampetakis
In terms of performance, I am not going to insist a lot before performing
some accurate benchmarks.
(I will do that and share the results here if I find some time).

The main benefit I see for having such a new query type in Lucene core is
that it guides the users on
creating parameterized queries and it also acts as a documentation point
for the things discussed in this thread.

However, I respect your point of view that maybe it is better not to give
the users much flexibility since it will not
change something fundamentally on the way the query is evaluated in Lucene.

My question is completely answered. Thanks all for your time!

Best,
Stamatis

On Wed, Oct 23, 2019 at 2:29 PM Atri Sharma  wrote:

> I am still not sure of what benefits we see from the new query type. As
> Adrien mentioned,  Lucene queries are dependent on term specific
> statistics.
>
> I can imagine that this approach would save some GC cycles by reusing the
> same object, but you would need to have a very large number of objects for
> that to make a significant difference. In any case, for a user to emulate
> this behaviour, it should be a simple matter of writing a custom
> QueryVisitor.
>
> On Wed, 23 Oct 2019 at 17:52, Stamatis Zampetakis 
> wrote:
>
> > Hi Adrien,
> >
> > Are there optimizations based on statistics that take place during the
> > construction process of the query?
> >
> > When the query is about to be executed (passing it to the searcher) all
> > parameters must be bound.
> > So in terms of optimizations I was thinking that nothing would change.
> >
> > To make sure that we are all in the same page, I created a simple gist
> [1]
> > of how I had in mind to support parameterized queries in Lucene.
> > If you think that it is a bad idea then I will go back to the option of
> > recreating the query every time.
> >
> > Best,
> > Stamatis
> >
> > [1] https://gist.github.com/zabetak/ac2a8320c72779dd646230278662fdd4
> >
> >
> > On Wed, Oct 23, 2019 at 11:41 AM Adrien Grand  wrote:
> >
> > > I agree with Atri that it makes little sense to support parameterized
> > > queries like this. Lucene not only uses field statistics but also term
> > > statistics, so we couldn't make decisions about the right way to
> > > execute this query before knowing what the value of `a.name` is.
> > > Supporting something like that would send the message that it makes
> > > things more efficient while in practice it would only save a couple
> > > object creations.
> > >
> > > On Wed, Oct 23, 2019 at 9:26 AM Stamatis Zampetakis  >
> > > wrote:
> > > >
> > > > Hi Atri,
> > > >
> > > > Let's assume that we have the following simple SQL query over a
> Lucene
> > > > index holding book authors.
> > > >
> > > > SELECT *
> > > > FROM authors a
> > > > WHERE a.name = ? AND a.age > 15
> > > >
> > > > The where clause corresponds to a BooleanQuery combining a TermQuery
> > and
> > > a
> > > > IntPoint query.
> > > > During the prepare phase of the SQL statement we cannot really
> > construct
> > > > the BooleanQuery since the parameter
> > > > is not yet bound. We have to postpone the creation of the query to
> the
> > > > execution time of the statement where all parameters are bound.
> Running
> > > the
> > > > same query many times with a different parameter means recreating the
> > > Query
> > > > every time.
> > > >
> > > > I admit that creation of the Lucene query is not the most expensive
> > part
> > > of
> > > > the planning process still we can gain something by not creating the
> > > whole
> > > > query for every single parameter.
> > > >
> > > > In terms of design it appears more natural to build the Lucene query
> > > during
> > > > the preparation phase and not during the execution phase.
> > > >
> > > > Best,
> > > > Stamatis
> > > >
> > > >
> > > >
> > > >
> > > > On Mon, Oct 21, 2019 at 2:20 PM Atri Sharma  wrote:
> > > >
> > > > > I am curious — what use case are you targeting to solve here?
> > > > >
> > > > > In relational world, this is useful primarily due to the fact that
> > > prepared
> > > > > statements eliminate the need for re planning the query, thus
> saving
> > > the
> > > > > co

Re: Parameterized queries in Lucene

2019-10-23 Thread Stamatis Zampetakis
Hi Adrien,

Are there optimizations based on statistics that take place during the
construction process of the query?

When the query is about to be executed (passing it to the searcher) all
parameters must be bound.
So in terms of optimizations I was thinking that nothing would change.

To make sure that we are all in the same page, I created a simple gist [1]
of how I had in mind to support parameterized queries in Lucene.
If you think that it is a bad idea then I will go back to the option of
recreating the query every time.

Best,
Stamatis

[1] https://gist.github.com/zabetak/ac2a8320c72779dd646230278662fdd4


On Wed, Oct 23, 2019 at 11:41 AM Adrien Grand  wrote:

> I agree with Atri that it makes little sense to support parameterized
> queries like this. Lucene not only uses field statistics but also term
> statistics, so we couldn't make decisions about the right way to
> execute this query before knowing what the value of `a.name` is.
> Supporting something like that would send the message that it makes
> things more efficient while in practice it would only save a couple
> object creations.
>
> On Wed, Oct 23, 2019 at 9:26 AM Stamatis Zampetakis 
> wrote:
> >
> > Hi Atri,
> >
> > Let's assume that we have the following simple SQL query over a Lucene
> > index holding book authors.
> >
> > SELECT *
> > FROM authors a
> > WHERE a.name = ? AND a.age > 15
> >
> > The where clause corresponds to a BooleanQuery combining a TermQuery and
> a
> > IntPoint query.
> > During the prepare phase of the SQL statement we cannot really construct
> > the BooleanQuery since the parameter
> > is not yet bound. We have to postpone the creation of the query to the
> > execution time of the statement where all parameters are bound. Running
> the
> > same query many times with a different parameter means recreating the
> Query
> > every time.
> >
> > I admit that creation of the Lucene query is not the most expensive part
> of
> > the planning process still we can gain something by not creating the
> whole
> > query for every single parameter.
> >
> > In terms of design it appears more natural to build the Lucene query
> during
> > the preparation phase and not during the execution phase.
> >
> > Best,
> > Stamatis
> >
> >
> >
> >
> > On Mon, Oct 21, 2019 at 2:20 PM Atri Sharma  wrote:
> >
> > > I am curious — what use case are you targeting to solve here?
> > >
> > > In relational world, this is useful primarily due to the fact that
> prepared
> > > statements eliminate the need for re planning the query, thus saving
> the
> > > cost of iterating over a potentially large combinatorial space.
> However,
> > > for Lucene, there isn’t so much of a concept of a query plan (yet).
> Some
> > > queries try to achieve that (IndexOrDocValuesQuery for eg), but it is
> a far
> > > cry from what relational databases expose.
> > >
> > > Atri
> > >
> > > On Mon, 21 Oct 2019 at 17:42, Stamatis Zampetakis 
> > > wrote:
> > >
> > > > Hi al
> > > > In the world of relational databases and SQL, the existence of
> > > > parameterized queries (aka. PreparedStatement) offers many
> advantages in
> > > > terms of security and performance.
> > > >
> > > > I guess everybody is familiar with the idea that you prepare a
> statement
> > > > and then you execute it multiple times by just changing certain
> > > parameters.
> > > > A simple use case for demonstrating the idea
> > > > is shown below:
> > > >
> > > > Query q = ... // An arbitrary complex query with a part that has a
> single
> > > > parameter of type int
> > > > for (int i=0; i<100; i++) {
> > > >   int paramValue = i;
> > > >   q.visit(new ParameterSetter(paramValue));
> > > >   TopDocs docs = searcher.search(q, 10);
> > > > }
> > > >
> > > > Note that this is a very simplistic use case and does not correspond
> to
> > > the
> > > > reality where the construction and execution are not done side by
> side.
> > > >
> > > > I already implemented something to satisfy use-cases like the one
> shown
> > > > above by introducing a new subclass of Query. However, I was
> wondering if
> > > > there is already a mechanism to compile and execute queries with
> > > parameters
> > > > in Lucene and I am just reinventing the wheel.
> > > >
> > > > Feedback is much appreciated!
> > > >
> > > > Best,
> > > > Stamatis
> > > >
> > > --
> > > Regards,
> > >
> > > Atri
> > > Apache Concerted
> > >
>
>
>
> --
> Adrien
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: Parameterized queries in Lucene

2019-10-23 Thread Stamatis Zampetakis
Hi Atri,

Let's assume that we have the following simple SQL query over a Lucene
index holding book authors.

SELECT *
FROM authors a
WHERE a.name = ? AND a.age > 15

The where clause corresponds to a BooleanQuery combining a TermQuery and a
IntPoint query.
During the prepare phase of the SQL statement we cannot really construct
the BooleanQuery since the parameter
is not yet bound. We have to postpone the creation of the query to the
execution time of the statement where all parameters are bound. Running the
same query many times with a different parameter means recreating the Query
every time.

I admit that creation of the Lucene query is not the most expensive part of
the planning process still we can gain something by not creating the whole
query for every single parameter.

In terms of design it appears more natural to build the Lucene query during
the preparation phase and not during the execution phase.

Best,
Stamatis




On Mon, Oct 21, 2019 at 2:20 PM Atri Sharma  wrote:

> I am curious — what use case are you targeting to solve here?
>
> In relational world, this is useful primarily due to the fact that prepared
> statements eliminate the need for re planning the query, thus saving the
> cost of iterating over a potentially large combinatorial space. However,
> for Lucene, there isn’t so much of a concept of a query plan (yet). Some
> queries try to achieve that (IndexOrDocValuesQuery for eg), but it is a far
> cry from what relational databases expose.
>
> Atri
>
> On Mon, 21 Oct 2019 at 17:42, Stamatis Zampetakis 
> wrote:
>
> > Hi al
> > In the world of relational databases and SQL, the existence of
> > parameterized queries (aka. PreparedStatement) offers many advantages in
> > terms of security and performance.
> >
> > I guess everybody is familiar with the idea that you prepare a statement
> > and then you execute it multiple times by just changing certain
> parameters.
> > A simple use case for demonstrating the idea
> > is shown below:
> >
> > Query q = ... // An arbitrary complex query with a part that has a single
> > parameter of type int
> > for (int i=0; i<100; i++) {
> >   int paramValue = i;
> >   q.visit(new ParameterSetter(paramValue));
> >   TopDocs docs = searcher.search(q, 10);
> > }
> >
> > Note that this is a very simplistic use case and does not correspond to
> the
> > reality where the construction and execution are not done side by side.
> >
> > I already implemented something to satisfy use-cases like the one shown
> > above by introducing a new subclass of Query. However, I was wondering if
> > there is already a mechanism to compile and execute queries with
> parameters
> > in Lucene and I am just reinventing the wheel.
> >
> > Feedback is much appreciated!
> >
> > Best,
> > Stamatis
> >
> --
> Regards,
>
> Atri
> Apache Concerted
>


Use custom score in ConstantScoreQuery

2019-12-02 Thread Stamatis Zampetakis
Hi all,

Currently ConstantScoreQuery [1] returns a constant score equal to 1 for
every document that matches the query.

I would like to use the ConstantScoreQuery but with a different score value
that I can pass explicitly (via the constructor for instance).

This change may also benefit some other parts of Lucene where a
ConstantScoreQuery is wrapped around a BoostQuery simply for returning a
score of zero [2][3].

Does this change make sense? Shall I create a JIRA for it?

Best,
Stamatis

[1]
https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/ConstantScoreQuery.java
[2]
https://github.com/apache/lucene-solr/blob/1d238c844e45f088a942aec14750c186c7a66d92/lucene/core/src/java/org/apache/lucene/search/BooleanQuery.java#L253
[3]
https://github.com/apache/lucene-solr/blob/1d238c844e45f088a942aec14750c186c7a66d92/lucene/core/src/java/org/apache/lucene/search/BoostQuery.java#L97


Re: Semantics and performance regarding min number of the optional BooleanClauses

2020-03-30 Thread Stamatis Zampetakis
Thanks Adrien, indeed I missed the difference in the score since in my
context I do not need them most of the time.

I will try to check if there is something that can be done with respect to
the rewriting and if something promising comes up I will let you know.

Best,
Stamatis


On Mon, Mar 30, 2020, 7:20 PM Adrien Grand  wrote:

> Hi Stamatis,
>
> One thing that you missed regarding semantics is scoring. While (A B C)~2
> and ((+A +B) (+A +C) (+B +C)) would match the same documents, they would
> produce different scores.
>
> Moreover, many users come to this query because it is exactly what they
> need: matching k out of n clauses. In the example you gave it's pretty
> simple because there are only 3 clauses, but try to see what the generated
> query looks like when matching 3 out of 5 clauses, it's already very
> complex.
>
> It would be nice if we could rewrite the expanded form into the variant
> that sets a minimum number of matching should clauses, which should be more
> efficient. My worry is that it would be quite expensive to do, maybe to the
> point that it would more hurt than help on average. I'd be very happy to be
> proven wrong though, if we can cheaply rewrite the expanded form, this
> would be a good addition.
>
> On Mon, Mar 30, 2020 at 6:06 PM Stamatis Zampetakis 
> wrote:
>
> > Hi all,
> >
> > My question concerns the method setMinimumNumberShouldMatch in
> BooleaQuery
> > class.
> >
> > Lets assume that we have 3 queries (optional clauses), namely A, B, C and
> > we build a BooleanQuery specifying that at least 2 should match.
> >
> > In terms of semantics what I understand so far is that
> >
> > (A B C)~2 is equivalent to ((+A +B) (+A +C) (+B +C)).
> >
> > In other words a single BooleaQuery with a min should match parameter
> could
> > be rewritten as pure disjunctive BooleanQuery comprised from 3
> sub-queries.
> >
> > In terms of performance it seems that the two queries present different
> > behavior so the minMatch property is not only syntactic sugar and
> > apparently there is no rewriting between the two.
> >
> > Coming from the SQL world it is a bit hard for me to justify the addition
> > of a new operator that looks like syntactic sugar and at the same time is
> > more performant than the more primitive equivalents. I looked a bit in
> [1]
> > to understand motivation for adding this API but without much success.
> >
> > Summing up everything to three questions:
> > 1. Did I get right the semantics of this extra property or there are
> things
> > that I am missing?
> > (If my understanding is correct)
> > 2. What's the reason for introducing the minMatch property in the first
> > place? (Avoid creating huge queries?)
> > 3. Should the performance of the two queries shown above differ?
> >
> > Thanks in advance!
> >
> > Best,
> > Stamatis
> >
> > [1] https://issues.apache.org/jira/browse/LUCENE-395
> >
>
>
> --
> Adrien
>


Semantics and performance regarding min number of the optional BooleanClauses

2020-03-30 Thread Stamatis Zampetakis
Hi all,

My question concerns the method setMinimumNumberShouldMatch in BooleaQuery
class.

Lets assume that we have 3 queries (optional clauses), namely A, B, C and
we build a BooleanQuery specifying that at least 2 should match.

In terms of semantics what I understand so far is that

(A B C)~2 is equivalent to ((+A +B) (+A +C) (+B +C)).

In other words a single BooleaQuery with a min should match parameter could
be rewritten as pure disjunctive BooleanQuery comprised from 3 sub-queries.

In terms of performance it seems that the two queries present different
behavior so the minMatch property is not only syntactic sugar and
apparently there is no rewriting between the two.

Coming from the SQL world it is a bit hard for me to justify the addition
of a new operator that looks like syntactic sugar and at the same time is
more performant than the more primitive equivalents. I looked a bit in [1]
to understand motivation for adding this API but without much success.

Summing up everything to three questions:
1. Did I get right the semantics of this extra property or there are things
that I am missing?
(If my understanding is correct)
2. What's the reason for introducing the minMatch property in the first
place? (Avoid creating huge queries?)
3. Should the performance of the two queries shown above differ?

Thanks in advance!

Best,
Stamatis

[1] https://issues.apache.org/jira/browse/LUCENE-395


Re: [VOTE] Lucene logo contest

2020-06-18 Thread Stamatis Zampetakis
Hi all,

Apparently there is an extension of the deadline so in the meanwhile if
somebody has ideas about the logo feel free to share.
I can give it another try although I cannot promise anything about the
final outcome :P

Best,
Stamatis


On Thu, Jun 18, 2020 at 7:00 PM  wrote:

> Hi Ryan,-
>
>   I very much appreciate this oppurtunity to submit my designs.
>
> Best regards
>
>
> On 6/18/20 1:29 AM, Ryan Ernst wrote:
> > > IMHO this vote is invalid because...
> > > it doesn’t include the red / orange variants submitted by Dustin Haver
> >
> > I considered the latest submission by Dustin Haver to be his
> > submission, but I can see how some might like the other better and it
> > should have been part of the vote.
> >
> > > I propose to restart the VOTE to include all submissions.
> >
> > Given that I omitted the submission above, that seems reasonable. And
> > since we are restarting, I guess we can allow Baris to add in an entry.
> >
> > Baris, please add your entry to the jira issue. I will restart the
> > vote next week.
> >
> > > If we're going to have more options, I suggest we use "ranked voting"
> >
> > I considered rank voting, but tallying a rank vote by hand can be
> > incredibly tedious. I don't think we should use any external tools
> > since that prohibits verification on who is voting from the PMC.
> > However, given the lastingness of this decision, I guess it is fair to
> > do the necessary harder tallying work of rank choice voting over
> > email. When I restart the vote, I will give instructions on making
> > multiple selections.
> >
> > So, consider this vote CLOSED and VOID.
> >
> >
> >
> > On Wed, Jun 17, 2020 at 8:27 AM David Smiley  > <mailto:david.w.smi...@gmail.com>> wrote:
> >
> > If we're going to have more options, I suggest we use "ranked
> > voting": https://en.wikipedia.org/wiki/Ranked_voting
> > <
> https://urldefense.com/v3/__https://en.wikipedia.org/wiki/Ranked_voting__;!!GqivPVa7Brio!JgWdGehe0o2fApygEKb6X6IZpJAzG1ZZh0cvE1I1R4xX1eJ9iyu7IqgEYaVxqtj8TA$
> >
> >
> > If you create a Google Form based submission which supports a
> > ranked choice input, then this should make it probably not hard to
> > tally the results correctly. A PMC boolean would be helpful too.
> >
> > ~ David
> >
> >
> > On Wed, Jun 17, 2020 at 11:14 AM Andrzej Białecki  > <mailto:a...@getopt.org>> wrote:
> >
> > IMHO this vote is invalid because it doesn’t include all
> > submissions linked to that issue. Specifically, it doesn’t
> > include the red / orange variants submitted by Dustin Haver
> > (which I personally prefer over the sickly green ones … ;) )
> >
> > I propose to restart the VOTE to include all submissions.
> >
> >> On 17 Jun 2020, at 17:04, Adrien Grand  >> <mailto:jpou...@gmail.com>> wrote:
> >>
> >> A. (PMC) I like that it retains the same idea as our current
> >> logo with a more modern look.
> >>
> >> On Wed, Jun 17, 2020 at 4:58 PM Andi Vajda  >> <mailto:o...@ovaltofu.org>> wrote:
> >>
> >>
> >> C. (current logo)
> >>
> >> Andi.. (pmc)
> >>
> >>>     On Jun 15, 2020, at 15:08, Ryan Ernst  >>> <mailto:r...@iernst.net>> wrote:
> >>>
> >>> 
> >>> Dear Lucene and Solr developers!
> >>>
> >>> In February a contest was started to design a new logo
> >>> for Lucene [1]. That contest concluded, and I am now
> >>> (admittedly a little late!) calling a vote.
> >>>
> >>> The entries are labeled as follows:
> >>>
> >>> A. Submitted by Dustin Haver [2]
> >>>
> >>> B. Submitted by Stamatis Zampetakis [3] Note that this
> >>> has several variants. Within the linked entry there are
> >>> 7 patterns and 7 color palettes. Any vote for B should
> >>> contain the pattern number, like B1 or B3. If a B
> >>> variant wins, we will have a followup vote on the color
> >>> palette.
> >>>
> >>> C. The current Lucene logo [4]
> >>>
> >>> Please vote