Re: ArithmeticException: due to integer overflow during lucene merging

2024-05-07 Thread Michael Sokolov
This is definitely a confusing error condition. If we can add more
information without creating an undue burden for the indexer it would
be nice, but I think this will be very challenging here since the
exception is thrown at a low level in the code where there might not
be a lot of useful info (ie the field name) to provide. And I expect
there are other places that make a similar assumption we would have to
track down?

On Tue, May 7, 2024 at 9:10 AM Jerven Tjalling Bolleman
 wrote:
>
> Dear Michael,
>
> Looking deeper into this. I think we overflowed a term frequency field.
> Looking in some statistics, in a previous release we had 1,288,526,281
> of a certain field, this would be larger now. Each of these would have
> had a limited set of values. But crucially nearly all of them would have
> had the term "positional" or "non-positional" added to the document.
>
> There is no good reason to do this today, we should just turn this into
> a boolean field and update the UI. I will do this and report back.
>
> Do you think that a patch for a try/catch for a more informative log
> message be appreciated by the community? e.g. mentioning the field name
> in the exception?
>
> Regards,
> Jerven
>
> On 5/7/24 14:52, Jerven Tjalling Bolleman wrote:
> > Dear Michael,
> >
> > Thank you for your help.
> >
> > We don't use custom term frequencies (I just double checked with a code
> > search).
> > We also always merge down to one segment (historical but also we index
> > once and then there are no changes for a week to a month and then we
> > reindex every document from scratch).
> >
> > Your response is very helpful already and I very much appreciate it as
> > it cuts down the search space significantly.
> >
> > Regards,
> > Jerven
> >
> >
> > On 5/7/24 14:03, Michael Sokolov wrote:
> >> It seems as if the term frequency for some term exceeded the maximum.
> >> This can happen if you supplied custom term frequencies eg with
> >> https://lucene.apache.org/core/9_10_0/core/org/apache/lucene/analysis/tokenattributes/TermFrequencyAttribute.html?is-external=true
> >> . The behavior didn't change since 8.x but it's possible that the
> >> merging brought together some very "high frequency" terms that were
> >> previously not in the same segment?
> >>
> >> On Tue, May 7, 2024 at 4:03 AM Jerven Tjalling Bolleman
> >>  wrote:
> >>>
> >>> Dear Lucene community,
> >>>
> >>> This morning I found this exception in our logs. This was the first time
> >>> we indexed this data with lucene 9.10. Before we were still on the
> >>> lucene 8.x branch. between the last indexing with 8 and this one with
> >>> 9.10 we have a bit more data so it could be something else that went
> >>> over an limit.
> >>>
> >>> Unfortunately, from this log message I am at a loss for what is going
> >>> on. And what I could do to prevent this from happening. Does anyone have
> >>> any ideas?
> >>>
> >>> Regards,
> >>> Jerven Bolleman
> >>>
> >>>
> >>> Exception in thread "Lucene Merge Thread #202"
> >>> org.apache.lucene.index.MergePolicy$MergeException:
> >>> java.lang.ArithmeticException: integer overflow
> >>> at
> >>> org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:735)
> >>> at
> >>> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:727)
> >>> Caused by: java.lang.ArithmeticException: integer overflow
> >>> at java.base/java.lang.Math.toIntExact(Math.java:1135)
> >>> at
> >>> org.apache.lucene.store.DataOutput.writeGroupVInts(DataOutput.java:354)
> >>> at
> >>> org.apache.lucene.codecs.lucene99.Lucene99PostingsWriter.finishTerm(Lucene99PostingsWriter.java:379)
> >>> at
> >>> org.apache.lucene.codecs.PushPostingsWriterBase.writeTerm(PushPostingsWriterBase.java:173)
> >>> at
> >>> org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsWriter$TermsWriter.write(Lucene90BlockTreeTermsWriter.java:1097)
> >>> at
> >>> org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsWriter.write(Lucene90BlockTreeTermsWriter.java:398)
> >>> at org.apache.lucene.codecs.FieldsConsumer.merge(FieldsConsumer.java:95)
> >>> at
> >>> org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$Fie

Re: ArithmeticException: due to integer overflow during lucene merging

2024-05-07 Thread Michael Sokolov
It seems as if the term frequency for some term exceeded the maximum.
This can happen if you supplied custom term frequencies eg with
https://lucene.apache.org/core/9_10_0/core/org/apache/lucene/analysis/tokenattributes/TermFrequencyAttribute.html?is-external=true
. The behavior didn't change since 8.x but it's possible that the
merging brought together some very "high frequency" terms that were
previously not in the same segment?

On Tue, May 7, 2024 at 4:03 AM Jerven Tjalling Bolleman
 wrote:
>
> Dear Lucene community,
>
> This morning I found this exception in our logs. This was the first time
> we indexed this data with lucene 9.10. Before we were still on the
> lucene 8.x branch. between the last indexing with 8 and this one with
> 9.10 we have a bit more data so it could be something else that went
> over an limit.
>
> Unfortunately, from this log message I am at a loss for what is going
> on. And what I could do to prevent this from happening. Does anyone have
> any ideas?
>
> Regards,
> Jerven Bolleman
>
>
> Exception in thread "Lucene Merge Thread #202"
> org.apache.lucene.index.MergePolicy$MergeException:
> java.lang.ArithmeticException: integer overflow
> at
> org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:735)
> at
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:727)
> Caused by: java.lang.ArithmeticException: integer overflow
> at java.base/java.lang.Math.toIntExact(Math.java:1135)
> at org.apache.lucene.store.DataOutput.writeGroupVInts(DataOutput.java:354)
> at
> org.apache.lucene.codecs.lucene99.Lucene99PostingsWriter.finishTerm(Lucene99PostingsWriter.java:379)
> at
> org.apache.lucene.codecs.PushPostingsWriterBase.writeTerm(PushPostingsWriterBase.java:173)
> at
> org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsWriter$TermsWriter.write(Lucene90BlockTreeTermsWriter.java:1097)
> at
> org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsWriter.write(Lucene90BlockTreeTermsWriter.java:398)
> at org.apache.lucene.codecs.FieldsConsumer.merge(FieldsConsumer.java:95)
> at
> org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsWriter.merge(PerFieldPostingsFormat.java:205)
> at org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:209)
> at
> org.apache.lucene.index.SegmentMerger.mergeWithLogging(SegmentMerger.java:298)
> at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:137)
> at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:5252)
> at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4740)
> at
> org.apache.lucene.index.IndexWriter$IndexWriterMergeSource.merge(IndexWriter.java:6541)
> at
> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:639)
> at
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:700)
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Help running the demo program

2024-04-22 Thread Michael Sokolov
I also found this helpful documentation by looking in the source code
of SearchFiles.java: https://lucene.apache.org/core/9_10_0/demo/

On Mon, Apr 22, 2024 at 4:40 AM Stefan Vodita  wrote:
>
> Hi Siddharth,
>
> If you happen to be using IntelliJ, you can run a demo class from the IDE.
> It probably works with other IDEs too, though I haven't tried it.
>
>
> Stefan
>
> On Sun, 21 Apr 2024 at 23:59, Siddharth Jain  wrote:
>
> > Hello,
> >
> > I am a new user to Lucene. I checked out the Lucene repo
> >  and synced
> > to releases/lucene/9.10.0 tag. From there I have run following commands:
> >
> > ./gradlew
> > ./gradlew assemble
> >
> > I would now like to run the demo program. How can I do that? I see some
> > class files under lucene/demo/build/classes/java/main but how do I build
> > the full classpath with all the dependencies needed to run the demo
> > program? Can anyone help me? Thanks,
> >
> > S.
> >

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: hnsw parameters for vector search

2024-02-01 Thread Michael Sokolov
To get best results it's necessary to tune these parameters for each vector
model. My suggestion is to use a subset of your 100M vectors for parameter
optimization to save time while iterating through the parameters space as
you will indeed need to reindex in order to measure

Generally speaking, increasing maxconns and beam width will lead to higher
recall, but more latency.

You can use the knngraphtester tool in luceneutils package to get started.

On Tue, Jan 30, 2024, 11:05 AM Michael Wechner 
wrote:

> Re your "second" question about suboptimal results, I think Nils Reimers
> explains quite nicely why this might happen, see for example
>
> https://www.youtube.com/watch?v=Abh3YCahyqU
>
> HTH
>
> Michael
>
>
>
> Am 30.01.24 um 15:48 schrieb Moll, Dr. Andreas:
> > Hi,
> >
> > the hnsw documentation for the Lucene HnswGraph and the SolR vector
> search is not very verbose, especially in regards to the parameters
> hnswMaxConn and hnswBeamWidth.
> > I find it hard to come up with sensible values for these parameters by
> reading the paper from 2018.
> > Does anyone have experience with the influence of the parameters on the
> results? As far as I understand the code the graph is created at indexing
> time so it would be time intensive to come up with the optimal values for a
> specific use case by trial and error?
> >
> > We have a SolR index with roughly 100 million embeddings and in a
> synthetic randomized benchmarks around 14% percent of requests will result
> in a suboptimal answer (based on the cosine vector similarity).
> > I expected this "error" rate to be much smaller. I would love to hear
> your experiences.
> >
> > Best regards
> >
> > Andreas Moll
> >
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: DisjunctionMinQuery

2023-11-10 Thread Michael Sokolov
> In Lucene scores should go up for more relevancy.

That is the case for combining child scores with min. min() is monotonic --
if its arguments increase, the result does not decrease, it only stays the
same or increases, so I think it is a valid scoring operation for Lucene.
And it makes some logical sense if you think of the terms as an ensemble:
you want all of them to match, and the score scales according to the number
of times they all occur ... something like that.

On Thu, Nov 9, 2023 at 3:09 PM Marc D'Mello  wrote:

> Hi all,
>
> Once again, thanks for the responses! After thinking about this a bit more,
> I think Michael's response makes sense now. I do agree that partial matches
> shouldn't be ranked higher than conjunctive matches, so I think it doesn't
> make sense in my use case to use a DisjunctiveMinQuery (I think I would
> need a AndMinQuery or something like that). This also answers my initial
> question.
>
> I did have a question about this though:
>
> in that case you should use something like 1/x as your scoring function
> > in the sub-clauses
> >
>
> Doesn't using 1/x as a scoring function, even in the subclauses, still
> cause an issue where the output score will be inversely correlated to the
> indexed term score? I think that would break BMW right? Or maybe I am
> misunderstanding the suggestion.
>
> Thanks,
> Marc
>
> On Thu, Nov 9, 2023 at 10:18 AM Uwe Schindler  wrote:
>
> > Hi,
> >
> > in that case you should use something like 1/x as your scoring function
> > in the sub-clauses. In Lucene scores should go up for more relevancy.
> > This must also apply for function scoring.
> >
> > Uwe
> >
> > Am 09.11.2023 um 19:14 schrieb Marc D'Mello:
> > > Hi Michael,
> > >
> > > Thanks for the response! So to answer your first question, yes this
> would
> > > keep the lowest score from the matching sub-scorers. Our use case is
> that
> > > we have a custom term-level score overriding term frequency and we want
> > to
> > > take the min of that as part of our scoring function. Maybe it's a
> niche
> > > use case?
> > >
> > > Thanks,
> > > Marc
> > >
> > > On Wed, Nov 8, 2023 at 3:19 PM Michael Froh  wrote:
> > >
> > >> Hi Marc,
> > >>
> > >> Can you clarify what the semantics of a DisjunctionMinQuery would be?
> > Would
> > >> you keep the score for the *lowest* scoring disjunct (plus some
> > tiebreaker
> > >> applied to the other matching disjuncts)?
> > >>
> > >> I'm trying to imagine how that would work compared to the classic
> DisMax
> > >> use-case. Say I'm searching for "dalmatian" using a DisMax query over
> > term
> > >> queries against title and body. A match on title is probably going to
> > score
> > >> higher than a match against the body, just because the title has a
> > shorter
> > >> length (and the doc frequency of individual terms in the title is
> > likely to
> > >> be lower, since there are fewer terms overall). With DisMax, a match
> on
> > >> title alone will score higher than a match on body, and the tie-break
> > will
> > >> tend to score a match on title and body higher than a match on title
> > alone.
> > >>
> > >> With a DisMin (assuming you keep the lowest score), then a match on
> > title
> > >> and body would probably score lower than a match on title alone. That
> > feels
> > >> weird to me, but I might be missing the use-case.
> > >>
> > >> How would you use a DisMinQuery?
> > >>
> > >> Thanks,
> > >> Froh
> > >>
> > >>
> > >>
> > >> On Wed, Nov 8, 2023 at 10:50 AM Marc D'Mello 
> > wrote:
> > >>
> > >>> Hi all,
> > >>>
> > >>> I noticed we have a DisjunctionMaxQuery
> > >>> <
> > >>>
> > >>
> >
> https://github.com/apache/lucene/blob/branch_9_7/lucene/core/src/java/org/apache/lucene/search/DisjunctionMaxQuery.java
> > >>> but
> > >>> not a corresponding DisjunctionMinQuery. I was just wondering if
> there
> > >> was
> > >>> a specific reason for that? Or is it just that it is not a common
> query
> > >> to
> > >>> use?
> > >>>
> > >>> Thanks!
> > >>> Marc
> > >>>
> > --
> > Uwe Schindler
> > Achterdiek 19, D-28357 Bremen
> > https://www.thetaphi.de
> > eMail: u...@thetaphi.de
> >
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >
>


Re: Can the BooleanQuery execution be optimized with same term queries?

2023-09-19 Thread Michael Sokolov
another thing to check beyond whether the correct documents are
matched is whether the correct score is returned. I'm not sure
actually how it works but I can imagine that a query for "red red
wine" would produce a higher score for documents having "red red wine"
than it would for documents having "red wine wine"

On Tue, Sep 19, 2023 at 2:37 AM YouPeng Yang  wrote:
>
> Hi All
>
>During my unemployment time ,the happiest thing is  diving to study the
> Lucene Source Code ,thanks for all the work .
>
>   About the BooleanQuery.I am encounterd by a question about the execution
> of BooleanQuery:although,BooleanQuery#rewrite has done some  works to
> remove duplicate FILTER,SHOULD clauses.however still the same term query
> can been executed the several times.
>
>   I copy the test code in the TestBooleanQuery to approve my assumption.
>
>   Unit Test Code as follows:
>
>
>
> BooleanQuery.Builder qBuilder = new BooleanQuery.Builder();
>
> qBuilder = new BooleanQuery.Builder();
>
> qBuilder.add(new TermQuery(new Term("field", "b")), Occur.*FILTER*);
>
> qBuilder.add(new TermQuery(new Term("field", "a")), Occur.*SHOULD*);
>
> qBuilder.add(new TermQuery(new Term("field", "d")), Occur.*SHOULD*);
>
> BooleanQuery.Builder nestQuery  = new BooleanQuery.Builder();
>
> nestQuery.add(new TermQuery(new Term("field", "b")), Occur.*FILTER*);
>
> nestQuery.add(new TermQuery(new Term("field", "a")), Occur.*SHOULD*);
>
> nestQuery.add(new TermQuery(new Term("field", "d")), Occur.*SHOULD*);
>
> qBuilder.add(nestQuery.build(),Occur.*SHOULD*);
>
> qBuilder.setMinimumNumberShouldMatch(1);
>
> BooleanQuery q = qBuilder.build();
>
> q = qBuilder.build();
>
> assertSameScoresWithoutFilters(searcher, q);
>
>
> In this test, the top boolean query(qBuilder) contains 4 clauses(3 simple
> term-query ,1 nested boolean query that contains the same 3 term-quey).
>
> The underlying execution is that the all the 6 term query were executed(see
> TermQuery.Termweight#getTermsEnum()).
>
> Apparently and theoretically,  the executions can be merged to increase the
> time,right?.
>
>
> So,Is there any possible or necessary  that Lucene merge the execution to
> optimize the query performance, even I know the optimization may be
> difficult.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Top docs depend on value of K nearest neighbour

2023-08-03 Thread Michael Sokolov
well, it is "approximate" KNN and can get caught in local minima
(maxima?). Increasing K has, indirectly, the effect of expanding the
search space because the minimum score in the priority score (score of
the Kth item) is used as a threshold for deciding when to terminate
the search

On Wed, Aug 2, 2023 at 5:19 PM Michael Wechner
 wrote:
>
> Hi
>
> I use Lucene 9.7.0 but experienced the same behaviour with Lucene 9.6.0
> when doing vector search as follows:
>
> I have indexed about 200 vectors (dimension 768)
>
> I build the query as follows
>
>   Query query = new KnnFloatVectorQuery("vector-field-name",
> queryVector, k);
>
> and do the search as follows:
>
> TopDocs topDocs = searcher.search(query, k);
>
> When I set k=27 then the top doc has a score of 0.7757
>
> When I set the "k" value a little lower, e.g. k=24 then the top doc has
> a score of 0.7319 and is not the same document as the one with the score
> of 0.7757
>
> And idea what I might be doing wrong or what I misunderstand?
>
> Why does the value of k has an effect on the returned top doc?
>
> Thanks
>
> Michael
>
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Proposal to Reimplement Disk Usage API - Request for Feedback and Collaboration

2023-05-26 Thread Michael Sokolov
Hi Deepika, that would be a welcome addition - we had an earlier
discussion about it; see the thread here:
https://markmail.org/message/hq7jvobsnxwp7iat

Please be careful not to copy the code from Elastic as it is not
shared under an open license that permits copying

On Wed, May 24, 2023 at 3:19 PM Deepika Sharma
 wrote:
>
> Dear Community
>
> I am writing to share thoughts on the existing Disk Usage API, I believe
> there is an opportunity to improve its functionality and performance
> through a reimplementation.
> Currently, the best tool we have for this is based on a custom Codec that
> separates storage by field; to get the statistics we read an existing index
> and write it out using AddIndexes and force-merging, using the custom
> codec. This is time-consuming and inefficient and tends not to get done.
> What we could do is similar to the functionality in Elasticsearch. The
> DiskUsage API 
> estimates the storage of each field by iterating its structures (i.e.,
> inverted index, doc-values, stored fields, etc.) and tracking the number of
> read-bytes. Since we will enumerate the index, it wouldn't require us to
> force-merge all the data through addIndexes, and at the same time it
> doesn't invade the codec apis.
>
> Thank you for your time and consideration. I would greatly appreciate any
> input, suggestions, or concerns you might have regarding this proposal and
> eagerly look forward to your response.
>
> Best regards,

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Can I simplify this bit of query boosting?

2023-05-11 Thread Michael Sokolov
You might also want to have a look at FeatureField. This can be used
to associate a score with a particular term.

On Thu, May 11, 2023 at 1:13 PM Hrvoje Lončar  wrote:
>
> I had a situation when i wanted to sort a list of articles based on the
> amount of data entered. For example, article having a photo, description,
> ingredients should perform better comparing to one having only name and
> photo.
> For that purpose I created a numeric field that holds calculated value
> named completeness. Later when executing a query, this number is used as a
> sort modifier - in my case by using reverse order.
> My project is based on Hibernate Search, so I guess it's not that I can put
> here a code snippet. This numeric value does not have to be 1st sort
> modifier. First you put the main sort rule and then you can refine sort
> with this numeric value.
> I hope it helps - at least to give you an idea which way to go.
> BR,
> Hrvoje
>
> On Thu, 11 May 2023, 15:44 Trevor Nicholls, 
> wrote:
>
> > Hi, I've hit a wall here.
> >
> >
> >
> > In brief, users search a library of documents. Every indexed document has a
> > version number field which is always populated for release notes, sometimes
> > for other docs. Every document also has a category field which is how
> > release notes are identified, among other content types.
> >
> >
> >
> > The requirement is to make sure that release notes are boosted relative to
> > other content, and that release notes with higher versions are boosted more
> > than those with lower versions.
> >
> >
> >
> > I've currently implemented a crude method to achieve this, and the crucial
> > part of the process is here:
> >
> >
> >
> >   // have IndexReader reader, IndexSearcher searcher, Analyzer analyzer,
> > String userQuery
> >
> >   QueryParser parser = new QueryParser( "content", analyzer );
> >
> >   parser.setDefaultOperator( QueryParserBase.AND_OPERATOR );
> >
> >   BooleanQuery query = new BooleanQuery.Builder()
> >
> >  .add( parser.parse( userQuery ), Occur.MUST )
> >
> >  .add( new BoostQuery( parser.parse( "category:relnotes version:9*" ),
> > 90.0f ), Occur.SHOULD )
> >
> >  .add( new BoostQuery( parser.parse( "category:relnotes version:8*" ),
> > 80.0f ), Occur.SHOULD )
> >
> >  .add( new BoostQuery( parser.parse( "category:relnotes version:7*" ),
> > 70.0f ), Occur.SHOULD )
> >
> >  .add( new BoostQuery( parser.parse( "category:relnotes version:6*" ),
> > 60.0f ), Occur.SHOULD )
> >
> >  .add( new BoostQuery( parser.parse( "category:relnotes version:5*" ),
> > 50.0f ), Occur.SHOULD )
> >
> >  .add( new BoostQuery( parser.parse( "category:relnotes version:4*" ),
> > 40.0f ), Occur.SHOULD )
> >
> >  .add( new BoostQuery( parser.parse( "category:relnotes version:3*" ),
> > 30.0f ), Occur.SHOULD )
> >
> >  .add( new BoostQuery( parser.parse( "category:relnotes version:2*" ),
> > 20.0f ), Occur.SHOULD )
> >
> >  .add( new BoostQuery( parser.parse( "category:relnotes version:1*" ),
> > 10.0f ), Occur.SHOULD )
> >
> >  .build();
> >
> >
> >
> > I found through experimentation that the boost factors are not
> > multiplicative (as most of the explanations on the web implied) but are
> > simply added to the score. If I've misunderstood how boosting works, please
> > enlighten me!
> >
> > The versions and boost factors above are arbitrary just to keep the example
> > simple; in reality the versions cover a much wider range and the boost
> > values do too.
> >
> >
> >
> > This is working to a degree. But it's not granular enough, I really want
> > the
> > boost factor to be calculated directly from the version value, if that is
> > possible.
> >
> > I also imagine doing it this way makes searches quite expensive.
> >
> >
> >
> > How could I improve this?
> >
> >
> >
> > cheers
> >
> > T
> >
> >
> >
> >
> >
> >

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Question about index segment search order

2023-05-11 Thread Michael Sokolov
Maybe ask this issue on solr-dev then? I'm not familiar with how that
collector works. Does it count hits across all segments? only within a
single segment?

On Tue, May 9, 2023 at 1:36 PM Wei  wrote:
>
> Hi Michael,
>
> I am applying early termination with Solr's EarlyTerminatingCollector
> https://github.com/apache/solr/blob/d9ddba3ac51ece953d762c796f62730e27629966/solr/core/src/java/org/apache/solr/search/EarlyTerminatingCollector.java
> ,
> which triggers EarlyTerminatingCollectorException in SolrIndexSearcher
> https://github.com/apache/solr/blob/d9ddba3ac51ece953d762c796f62730e27629966/solr/core/src/java/org/apache/solr/search/SolrIndexSearcher.java#L281
>
> Thanks,
> Wei
>
>
> On Thu, May 4, 2023 at 11:47 AM Michael Sokolov  wrote:
>
> > Yes, sorry I didn't mean to imply you couldn't control this if you
> > want to. I guess in the typical setup it is not predictable. How are
> > you applying early termination? Are you using a standard Lucene
> > Collector or do you have your own?
> >
> > On Thu, May 4, 2023 at 2:03 PM Patrick Zhai  wrote:
> > >
> > > Hi Mike,
> > > Just want to mention if the user chooses to use single thread to index
> > and
> > > use LogXXMergePolicy then the document order will be preserved as index
> > > order.
> > >
> > >
> > >
> > > On Thu, May 4, 2023 at 10:04 AM Wei  wrote:
> > >
> > > > Hi Michael,
> > > >
> > > > We are interested in the segment sequence for early termination. In our
> > > > case there is always a large dominant segment after index rebuild,
> > then
> > > > many small segments are generated with continuous updates as time goes
> > by.
> > > > When early termination is applied, the limit could be reached just for
> > > > traversing the dominant segment alone and the newer smaller segments
> > > > doesn't get a chance.  If we can control the segment sequence so that
> > the
> > > > newer segments are visited first, the documents with recent updates
> > can be
> > > > retrieved with early termination.  Do you think this makes sense? Any
> > > > suggestion is appreciated.
> > > >
> > > > Thanks,
> > > > Wei
> > > >
> > > > On Thu, May 4, 2023 at 3:33 AM Michael Sokolov 
> > wrote:
> > > >
> > > > > There is no meaning to the sequence. The segments are created
> > > > concurrently
> > > > > by many threads and the merge process will merge them without
> > regards to
> > > > > any ordering.
> > > > >
> > > > >
> > > > >
> > > > > On Wed, May 3, 2023, 1:09 PM Patrick Zhai 
> > wrote:
> > > > >
> > > > > > For that part I'm not entirely sure, if other folks know it please
> > > > chime
> > > > > in
> > > > > > :)
> > > > > >
> > > > > > On Wed, May 3, 2023 at 8:48 AM Wei  wrote:
> > > > > >
> > > > > > > Thanks Patrick! In the default case when no LeafSorter is
> > provided,
> > > > are
> > > > > > the
> > > > > > > segments traversed in the order of creation time, i.e. the oldest
> > > > > segment
> > > > > > > is always visited first?
> > > > > > >
> > > > > > > Wei
> > > > > > >
> > > > > > > On Tue, May 2, 2023 at 7:22 PM Patrick Zhai 
> > > > > wrote:
> > > > > > >
> > > > > > > > Hi Wei,
> > > > > > > > Lucene in general iterate through the index in the order of
> > what is
> > > > > > > > recorded in the SegmentInfos
> > > > > > > > <
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/SegmentInfos.java#L140
> > > > > > > > >
> > > > > > > > And at search time, you can specify the order using LeafSorter
> > > > > > > > <
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/DirectoryReader.java#L75
> > > > > > > > >
> > > > > > > > when you're opening the IndexReader
> > > > > > > >
> > > > > > > > Patrick
> > > > > > > >
> > > > > > > > On Tue, May 2, 2023 at 5:28 PM Wei 
> > wrote:
> > > > > > > >
> > > > > > > > > Hello,
> > > > > > > > >
> > > > > > > > > We have a index that has multiple segments generated with
> > > > > continuous
> > > > > > > > > updates. Does Lucene  have a specific order when iterate
> > through
> > > > > the
> > > > > > > > > segments (assuming single query thread) ? Can the order be
> > > > > customized
> > > > > > > > that
> > > > > > > > > the latest generated segments are searched first?
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Wei
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Question about index segment search order

2023-05-04 Thread Michael Sokolov
Yes, sorry I didn't mean to imply you couldn't control this if you
want to. I guess in the typical setup it is not predictable. How are
you applying early termination? Are you using a standard Lucene
Collector or do you have your own?

On Thu, May 4, 2023 at 2:03 PM Patrick Zhai  wrote:
>
> Hi Mike,
> Just want to mention if the user chooses to use single thread to index and
> use LogXXMergePolicy then the document order will be preserved as index
> order.
>
>
>
> On Thu, May 4, 2023 at 10:04 AM Wei  wrote:
>
> > Hi Michael,
> >
> > We are interested in the segment sequence for early termination. In our
> > case there is always a large dominant segment after index rebuild,  then
> > many small segments are generated with continuous updates as time goes by.
> > When early termination is applied, the limit could be reached just for
> > traversing the dominant segment alone and the newer smaller segments
> > doesn't get a chance.  If we can control the segment sequence so that the
> > newer segments are visited first, the documents with recent updates can be
> > retrieved with early termination.  Do you think this makes sense? Any
> > suggestion is appreciated.
> >
> > Thanks,
> > Wei
> >
> > On Thu, May 4, 2023 at 3:33 AM Michael Sokolov  wrote:
> >
> > > There is no meaning to the sequence. The segments are created
> > concurrently
> > > by many threads and the merge process will merge them without regards to
> > > any ordering.
> > >
> > >
> > >
> > > On Wed, May 3, 2023, 1:09 PM Patrick Zhai  wrote:
> > >
> > > > For that part I'm not entirely sure, if other folks know it please
> > chime
> > > in
> > > > :)
> > > >
> > > > On Wed, May 3, 2023 at 8:48 AM Wei  wrote:
> > > >
> > > > > Thanks Patrick! In the default case when no LeafSorter is provided,
> > are
> > > > the
> > > > > segments traversed in the order of creation time, i.e. the oldest
> > > segment
> > > > > is always visited first?
> > > > >
> > > > > Wei
> > > > >
> > > > > On Tue, May 2, 2023 at 7:22 PM Patrick Zhai 
> > > wrote:
> > > > >
> > > > > > Hi Wei,
> > > > > > Lucene in general iterate through the index in the order of what is
> > > > > > recorded in the SegmentInfos
> > > > > > <
> > > > > >
> > > > >
> > > >
> > >
> > https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/SegmentInfos.java#L140
> > > > > > >
> > > > > > And at search time, you can specify the order using LeafSorter
> > > > > > <
> > > > > >
> > > > >
> > > >
> > >
> > https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/DirectoryReader.java#L75
> > > > > > >
> > > > > > when you're opening the IndexReader
> > > > > >
> > > > > > Patrick
> > > > > >
> > > > > > On Tue, May 2, 2023 at 5:28 PM Wei  wrote:
> > > > > >
> > > > > > > Hello,
> > > > > > >
> > > > > > > We have a index that has multiple segments generated with
> > > continuous
> > > > > > > updates. Does Lucene  have a specific order when iterate through
> > > the
> > > > > > > segments (assuming single query thread) ? Can the order be
> > > customized
> > > > > > that
> > > > > > > the latest generated segments are searched first?
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Wei
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Question about index segment search order

2023-05-04 Thread Michael Sokolov
There is no meaning to the sequence. The segments are created concurrently
by many threads and the merge process will merge them without regards to
any ordering.



On Wed, May 3, 2023, 1:09 PM Patrick Zhai  wrote:

> For that part I'm not entirely sure, if other folks know it please chime in
> :)
>
> On Wed, May 3, 2023 at 8:48 AM Wei  wrote:
>
> > Thanks Patrick! In the default case when no LeafSorter is provided, are
> the
> > segments traversed in the order of creation time, i.e. the oldest segment
> > is always visited first?
> >
> > Wei
> >
> > On Tue, May 2, 2023 at 7:22 PM Patrick Zhai  wrote:
> >
> > > Hi Wei,
> > > Lucene in general iterate through the index in the order of what is
> > > recorded in the SegmentInfos
> > > <
> > >
> >
> https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/SegmentInfos.java#L140
> > > >
> > > And at search time, you can specify the order using LeafSorter
> > > <
> > >
> >
> https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/DirectoryReader.java#L75
> > > >
> > > when you're opening the IndexReader
> > >
> > > Patrick
> > >
> > > On Tue, May 2, 2023 at 5:28 PM Wei  wrote:
> > >
> > > > Hello,
> > > >
> > > > We have a index that has multiple segments generated with continuous
> > > > updates. Does Lucene  have a specific order when iterate through the
> > > > segments (assuming single query thread) ? Can the order be customized
> > > that
> > > > the latest generated segments are searched first?
> > > >
> > > > Thanks,
> > > > Wei
> > > >
> > >
> >
>


Re: Info required on licensing of Lucene component

2023-03-21 Thread Michael Sokolov
Lucene is licensed under the Apache license, just as it says in the
LICENSE file. junit is used for testing Lucene and is not
redistributed with it. Using Lucene in your code does not mean you are
using junit, except in some extremely philosophical sense. EG Lucene
developers may have developed Lucene using Windows on their laptops -
that doesn't mean you need a WIndows license to use Lucene. IANAL, so
you should ask yours - I'm sure someone at Cisco can help you sort
this out?

On Tue, Mar 21, 2023 at 10:13 AM external-opensource-requests(mailer
list)  wrote:
>
> Hello Team
>
> I hope you are doing well!!
>
> This is regarding Lucene component licensing.
> The maven repo link  
> https://mvnrepository.com/artifact/org.apache.lucene/lucene-queries/4.10.4 
> for lucene-queries 4.10.4 shows Apache 2.0 license associated with the 
> component.
> Also, the archive (lucene-queries-4.10.4-sources.jar) uploaded has a 
> LICENSE.txt file which has Apache 2.0 license, but it also includes a 
> NOTICE.txt file which shows JUnit (junit-4.10) licensed under the Common 
> Public License v. 1.0. But there is no code associated with Junit included in 
> the source archive (lucene-queries-4.10.4-sources.jar) file.
>
> In this case, since Common Public License 1.0 is more restrictive compared to 
> Apache 2.0, for our better understanding,  can you clarify to us on what is 
> the actual Open Source license associated with the Lucene component?
>
> Mentioning just two of the lucene components in mail as example for your 
> reference "lucene-backward-codecs 9.3.0" 
> https://mvnrepository.com/artifact/org.apache.lucene/lucene-backward-codecs/9.3.0
>
> Looking forward to your reply.
>
>
> Thanks ,
> Open Source Request Team
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How to highlight fields that are not stored?

2023-02-16 Thread Michael Sokolov
Sorry your problem statement makes no sense: you should be able to
store field data in the index without loading all your documents into
RAM while indexing. Maybe there is some constraint you are not telling
us about? Or you may be confused. In any case highlighting requires
the document in its uninverted form. Otherwise what text would you
highlight?

On Mon, Feb 13, 2023 at 3:46 PM Shifflett, David [USA]
 wrote:
>
> Hi,
> I am converting my application from
> reading documents into memory, then indexing the documents
> to streaming the documents to be indexed.
>
> I quickly found out this required that the field NOT be stored.
> I then quickly found out that my highlighting code requires the field to 
> be stored.
>
> I’ve been searching for an existing highlighter that doesn’t require the 
> field to be stored,
> and thought I’d found one in the FastVectorHighlighter,
> but tests revealed this highlighter also requires the field to be stored,
> though this requirement isn’t documented, or reflected in any returned 
> exception.
>
>   I have been investigating using code like
> Terms terms = reader.getTermVector(docID, fieldName);
> TermsEnum termsEnum = terms.iterator();
> BytesRef bytesRef = termsEnum.next();
> PostingsEnum pe = termsEnum.postings(null, PostingsEnum.OFFSETS);
>
> While this gives me the terms from the document, and the positions,
> iterating over this, and matching to the queries I’m running,
> seems cumbersome, and inefficient.
>
> Any suggestions for highlighting query matches without the searched field 
> being stored?
>
> Thanks,
> David Shifflett
> Senior Lead Technologist
> Enterprise Cross Domain Solutions (ECDS)
> Booz Allen Hamilton
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Other vector similarity metric than provided by VectorSimilarityFunction

2023-01-15 Thread Michael Sokolov
I would suggest building Lucene from source and adding your own
similarity function to VectorSimilarity. That is the proper extension
point for similarity functions. If you find there is some substantial
benefit, it wouldn't be a big lift to add something like that. However
I'm dubious about the likely benefit; just because scipy supports lots
of functions doesn't mean you will get substantially better results
with L3 metric vs L2 metric or so. I think you'd probably find this
community receptive to a metric that *doesn't lose* accuracy and
provides a more efficient computation -- maybe L1 would do that?

On Sat, Jan 14, 2023 at 6:04 PM Michael Wechner
 wrote:
>
> Hi Adrien
>
> Thanks for your feedback! Whereas I am not sure I fully understand what
> you mean
>
> At the moment I am using something like:
>
> float[] vector = ...;
> FieldType vectorFieldType = KnnVectorField.createFieldType(vector.length, 
> VectorSimilarityFunction.COSINE);
> KnnVectorField vectorField =new KnnVectorField("vector_field", vector, 
> vectorFieldType);
> doc.add(vectorField);
>
> Could you give me some sample code what you mean with "custom KNN
> vectors format"?
>
> Thanks
>
> Michael
>
> Am 14.01.23 um 22:14 schrieb Adrien Grand:
> > Hi Michael,
> >
> > You could create a custom KNN vectors format that ignores the vector
> > similarity configured on the field and uses its own.
> >
> > Le sam. 14 janv. 2023, 21:33, Michael Wechner  a
> > écrit :
> >
> >> Hi
> >>
> >> IIUC Lucene currently supports
> >>
> >> VectorSimilarityFunction.COSINE
> >> VectorSimilarityFunction.DOT_PRODUCT
> >> VectorSimilarityFunction.EUCLIDEAN
> >>
> >> whereas some embedding models have been trained with other metrics.
> >> Also see
> >>
> >> https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cdist.html
> >>
> >> How can I best implement another metric?
> >>
> >> Thanks
> >>
> >> Michael
> >>
> >>
> >>
> >>
> >>
> >> -
> >> To unsubscribe, e-mail:java-user-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail:java-user-h...@lucene.apache.org
> >>
> >>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Question about current situation of good first issues in GitHub

2023-01-13 Thread Michael Sokolov
That label seems to be something GitHub created automatically?

You might have better luck browsing the full list of labels. I found these:

https://github.com/apache/lucene/labels/legacy-jira-label%3Anewbie
https://github.com/apache/lucene/labels/legacy-jira-label%3Anewdev
https://github.com/apache/lucene/labels/legacy-jira-label%3Anoob
https://github.com/apache/lucene/labels/legacy-jira-label%3Astarter
https://github.com/apache/lucene/labels/legacy-jira-priority%3ATrivial



On Sun, Jan 8, 2023 at 9:26 AM Shunya Ueta  wrote:
>
> Hello Lucene users.
> Last time I checked `good first issue` in GitHub issues to start a
> contribution of Lucene.
>
> https://github.com/apache/lucene/issues?q=is%3Aopen+is%3Aissue+label%3A%22good+first+issue%22
>
> But currently no issues with this label.
> I don't know the current operation of this label, but in the future, Is
> this label will utilized?
> Because good first issues label issues are a very nice starting point for
> beginner contributors.
>
> Thanks & Regards!

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Is there a way to customize segment names?

2022-12-16 Thread Michael Sokolov
+1 trying to coordinate multiple writers running independently will
not work. My 2c for availability: you can have a single primary active
writer with a backup one waiting, receiving all the segments from the
primary. Then if the primary goes down, the secondary one has the most
recent commit replicated from the primary (identical commit, same
segments etc) and can pick up from there. You would need a mechanism
to replay the writes the primary never had a chance to commit.

On Fri, Dec 16, 2022 at 5:41 AM Robert Muir  wrote:
>
> You are still talking "Multiple writers". Like i said, going down this
> path (playing tricks with filenames) isn't going to work out well.
>
> On Fri, Dec 16, 2022 at 2:48 AM Patrick Zhai  wrote:
> >
> > Hi Robert,
> >
> > Maybe I didn't explain it clearly but we're not going to constantly switch
> > between writers or share effort between writers, it's purely for
> > availability: the second writer only kicks in when the first writer is not
> > available for some reason.
> > And as far as I know the replicator/nrt module has not provided a solution
> > on when the primary node (main indexer) is down, how would we recover with
> > a back up indexer?
> >
> > Thanks
> > Patrick
> >
> >
> > On Thu, Dec 15, 2022 at 7:16 PM Robert Muir  wrote:
> >
> > > This multiple-writer isn't going to work and customizing names won't
> > > allow it anyway. Each file also contains a unique identifier tied to
> > > its commit so that we know everything is intact.
> > >
> > > I would look at the segment replication in lucene/replicator and not
> > > try to play games with files and mixing multiple writers.
> > >
> > > On Thu, Dec 15, 2022 at 5:45 PM Patrick Zhai  wrote:
> > > >
> > > > Hi Folks,
> > > >
> > > > We're trying to build a search architecture using segment replication
> > > (indexer and searcher are separated and indexer shipping new segments to
> > > searchers) right now and one of the problems we're facing is: for
> > > availability reason we need to have multiple indexers running, and when 
> > > the
> > > searcher is switching from consuming one indexer to another, there are
> > > chances where the segment names collide with each other (because segment
> > > names are count based) and the searcher have to reload the whole index.
> > > > To avoid that we're looking for a way to name the segments so that
> > > Lucene is able to tell the difference and load only the difference (by
> > > calling `openIfChanged`). I've checked the IndexWriter and the
> > > DocumentsWriter and it seems it is controlled by a private final method
> > > `newSegmentName()` so likely not possible there. So I wonder whether
> > > there's any other ways people are aware of that can help control the
> > > segment names?
> > > >
> > > > A example of the situation described above:
> > > > Searcher previously consuming from indexer 1, and have following
> > > segments: _1, _2, _3, _4
> > > > Indexer 2 previously sync'd from indexer 1, sharing the first 3
> > > segments, and produced its own 4th segments (notioned as _4', but it 
> > > shares
> > > the same "_4" name): _1, _2, _3, _4'
> > > > Suddenly Indexer 1 dies and searcher switched from Indexer 1 to Indexer
> > > 2, then when it finished downloading the segments and trying to refresh 
> > > the
> > > reader, it will likely hit the exception here, and seems all we can do
> > > right now is to reload the whole index and that could be potentially a 
> > > high
> > > cost.
> > > >
> > > > Sorry for the long email and thank you in advance for any replies!
> > > >
> > > > Best
> > > > Patrick
> > > >
> > >
> > > -
> > > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> > > For additional commands, e-mail: dev-h...@lucene.apache.org
> > >
> > >
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene 4.10.4 forward slash syntax error

2022-11-28 Thread Michael Sokolov
Have you tried escaping with a backslash? I have a vague memory that
might work. As for modifying classes in 4.10.4, you are welcome to do
so in a custom fork, but that version is so old that we no longer post
fixes for it on the official Apache release branches. The current
release series is 9.x - you should seriously consider upgrading. It's
a job for sure, but that is what maintenance is all about :)

On Fri, Nov 25, 2022 at 5:06 AM Younes Bahloul  wrote:
>
> hello i m part of the team that maintain exist-db
> https://github.com/eXist-db/exist
> we are using lucene 4.10.4 and we have an issue with using forward slash
> we made our own custom Analyzer that produce tokens with punctuation
> but we are facing some problems when trying to parse the input
> we are using `org.apache.lucene.queryparser.classic.QueryParser`
> and it's treating forward slash as an end of file
> is it possible to modify `queryparser.classic.QueryParser`?
> or a way to escape the forward slash?
> thank you
> --
> Kind regards,
> Younes Bahloul
> Junior Engineer

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Best strategy migrate indexes

2022-11-07 Thread Michael Sokolov
The error you got

BufferedChecksumIndexInput(MMapIndexInput(path="tests_small_index-7.x-migrator\segments_1"))):
9 (needs to be between 6 and 7)

indicates that the index you are reading was written by Lucene 9, so
things are not set up the way you described (writing using Lucene 7)


> Thanks TX for your response.
>
> I would check that the Luke version matches the Lucene version - if
> > the two match, it shouldn't be possible to get issues like this.
> > That is, the precise versions of Lucene each is using.
>
>
> Yes, I am using https://github.com/DmitryKey/luke/releases/tag/luke-7.1.0
>
> It works ok with my new generated indexes, but it does not with the
> "migrated" ones.
>
> El lun, 7 nov 2022 a las 12:18, Trejkaz () escribió:
>
> > The process itself sounds like it should work (it's basically a
> > reindex so it should be safer than trying to migrate directly.)
> >
> > I would check that the Luke version matches the Lucene version - if
> > the two match, it shouldn't be possible to get issues like this.
> > That is, the precise versions of Lucene each is using.
> >
> > TX
> >
> >
> > On Mon, 7 Nov 2022 at 22:09, Pablo Vázquez Blázquez 
> > wrote:
> > >
> > > Hi!
> > >
> > > > I am trying to create a tool to read docs from a lucene5 index and
> > > generate lucene9 documents from them (with docValues). That might work,
> > > right? I am shading both lucene5 and lucene9 to avoid package conflicts.
> > >
> > > I am doing the following steps:
> > >
> > > - create IndexReader with lucene5 package over a lucene5 index
> > > - create IndexWriter with lucene7 package
> > > - iterate over reader.numDocs() to process each Document (lucene5)
> > > - convert each Document (lucene5) to lucene7 Document
> > > - for each IndexableField (lucene5) from Document (lucene5)
> > convert
> > > it to create an IndexableField (lucene7)
> > > - create a SortedDocValuesField (lucene7) and add it to the
> > > Document (lucene7)
> > > - add the field to the Document (lucene7)
> > > - add each converted Document to the writer
> > > - close  IndexReader and IndexWriter
> > >
> > > When I open the resulting migrated lucene7 index with Luke I got an
> > error:
> > > org.apache.lucene.index.IndexFormatTooNewException: Format version is not
> > > supported (resource
> > >
> > BufferedChecksumIndexInput(MMapIndexInput(path="tests_small_index-7.x-migrator\segments_1"))):
> > > 9 (needs to be between 6 and 7)
> > >
> > > When I use the tool "luceneupgrader
> > > ", I got:
> > > java -jar luceneupgrader-0.5.2-SNAPSHOT.jar info
> > > tests_small_index-7.x-migrator
> > > Lucene index version: 7
> > >
> > > What am I doing wrong or misleading?
> > >
> > > Thanks!
> > >
> > > El mié, 2 nov 2022 a las 21:13, Pablo Vázquez Blázquez (<
> > pabl...@gmail.com>)
> > > escribió:
> > >
> > > > Hi,
> > > >
> > > > Luckily we were already using lucenemigrator
> > > >
> > > >
> > > > What do you mean with "lucenemigrator"? Is it a public tool?
> > > >
> > > > I am trying to create a tool to read docs from a lucene5 index and
> > > > generate lucene9 documents from them (with docValues). That might work,
> > > > right? I am shading both lucene5 and lucene9 to avoid package
> > conflicts.
> > > >
> > > > Thanks!
> > > >
> > > > El mar, 1 nov 2022 a las 0:35, Trejkaz ()
> > escribió:
> > > >
> > > >> Well...
> > > >>
> > > >> There's a way, but I wouldn't necessarily recommend it.
> > > >>
> > > >> You can write custom migration code against some version of Lucene
> > > >> which supports doc values, to create doc values fields. It's going to
> > > >> involve writing a FilterCodecReader which wraps your real index and
> > > >> then pretends to also have doc values, which you'll build in a custom
> > > >> class which works similarly to UninvertingReader. Then you pass those
> > > >> CodecReaders to IndexWriter.addIndexes to create a new index which
> > > >> really has those doc values.
> > > >>
> > > >> We did that ourselves when we had the same issue. The only painful
> > > >> thing about it is having to keep around older versions of lucene to do
> > > >> that migration. Forever. Luckily we were already using lucenemigrator,
> > > >> which has the older versions baked into it with package prefixes. So
> > > >> that library will get fatter and fatter over time but at least our own
> > > >> code only gets fatter at the rate migrations are added.
> > > >>
> > > >> The same approach works for any other kind of ad-hoc migration you
> > > >> might want to perform. e.g., you might want to create points. Or
> > > >> remove an index for a field. Or add an index for a field.
> > > >>
> > > >> TX
> > > >>
> > > >>
> > > >> On Tue, 1 Nov 2022 at 02:57, Pablo Vázquez Blázquez <
> > pabl...@gmail.com>
> > > >> wrote:
> > > >> >
> > > >> > Hi all,
> > > >> >
> > > >> > Thank you all for your responses.
> > > >> >
> > > >> > So, when updating to a newer (major) Lucene version that 

Re: Latency and recall re HSWN: Lucene versus Vespa

2022-10-01 Thread Michael Sokolov
I'd agree with the main point re: the need to combine vector-based
matching with term-based matching.

As for the comparison with Lucene, I'd say it's a shallow and biased
take. The main argument is that Vespa's mutable in-memory(?) data
structures are superior to Lucene's immutable on-disk segments. While
it is true that Lucene's approach leads to slower searches when there
are more segments, especially for vector searches, the immutability
property provides other well-understood benefits. TBH I don't know
enough about Vespa to make any meaningful comparison, but every choice
is a compromise. We've known for centuries that "Odyous of olde been
comparisonis, And of comparisonis engendyrd is haterede."

On Sat, Oct 1, 2022 at 7:18 AM Michael Wechner
 wrote:
>
> Hi Together
>
> I just read the following article, where the author compares Lucene and
> Vespa re HSWN
>
> https://bergum.medium.com/will-new-vector-databases-dislodge-traditional-search-engines-b4fdb398fb43
>
> What is your take on "comparing Lucene and Vespa re HSWN latency and
> recall"?
>
> Thanks
>
> Michael
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



[ANNOUNCE] Apache Lucene 9.4.0 released

2022-09-30 Thread Michael Sokolov
The Lucene PMC is pleased to announce the release of Apache Lucene 9.4.0.

Apache Lucene is a high-performance, full-featured search engine
library written entirely in Java. It is a technology suitable for
nearly any application that requires structured search, full-text
search, faceting, nearest-neighbor search across high-dimensionality
vectors, spell correction or query suggestions.

This release contains numerous bug fixes, optimizations, and
improvements, some of which are highlighted below. The release is
available for immediate download at:

https://lucene.apache.org/core/downloads.html

Lucene 9.4.0 Release Highlights:

New features

Added ShapeDocValues/Field, a unified abstraction to represent
existing types: XY and lat/long.
FacetSets can now be filtered using a Query via MatchingFacetSetCounts.
SortField now allows control over whether to apply index-sort optimizations.
Support for Java 19 foreign memory access ("project Panama") was
added. Applications started with command line parameter "java
--enable-preview" will automatically use the new foreign memory API of
Java 19 to access indexes on disk with MMapDirectory. This is an
opt-in feature and requires explicit Java command line flag passed to
your application's Java process (e.g., modify startup parameters of
Solr or Elasticsearch/Opensearch)! When enabled, Lucene logs a notice
using java.util.logging. Please test thoroughly and report
bugs/slowness to Lucene's mailing list. When the new API is used,
MMapDirectory will mmap Lucene indexes in chunks of 16 GiB (instead of
1 GiB) and indexes closed while queries are running can no longer
crash the JVM.

Optimizations

Added support for dynamic pruning to queries sorted by a string field
that is indexed with both terms and SORTED or SORTED_SET doc values.
This can lead to dramatic speedups when applicable.
TermInSetQuery is optimized for the case when one of its terms matches
all docs in a segment, and it now provides cost estimation, making it
usable with IndexOrDocValuesQuery for better query planning.
KnnVector fields can now be stored with reduced (8-bit) precision,
saving storage and yielding a small query latency improvement.

Other

KnnVector fields' HNSW graphs are now created incrementally when new
documents are added, rather than all-at-once when flushing. This
yields more consistent predictable behavior at the cost of an overall
increase in indexing time.
randomizedtesting dependency upgraded to 2.8.1
addIndexes(CodecReader) now respects MergePolicy and MergeScheduler,
enabling it to do its work concurrently.

Please read CHANGES.txt for a full list of new features and changes:

https://lucene.apache.org/core/9_4_0/changes/Changes.html

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Best practice - preparing search term for Lucene

2022-09-23 Thread Michael Sokolov
I think it depends how precise you want to make the search. If you
want to enable diacritic-sensitive search in order to avoid confusions
when users actually are able to enter the diacritics, you can index
both ways (ascii-folded and not folded) and not normalize the query
terms. Or you can just fold everything and not worry about it. In
French I know there are confusable words like "cote" which has at
least a few different meanings depending on the accents. Not sure how
it is in Croatian.

On Fri, Sep 23, 2022 at 5:30 AM Hrvoje Lončar  wrote:
>
> Hi Stephane!
>
> Actually, I have excactly that kind of conversion, but I didn't mention as my 
> mail was long enough whithout it :)
> My main concern it should I let Lucene index original keywords or not.
> Considering what you wrote, I guess your answer would be to store only 
> converted values without exotic characters.
>
> Thanks a lot for your reply!
>
> BR,
> Hrvoje
>
> On Thu, Sep 22, 2022 at 7:53 PM Stephane Passignat  
> wrote:
>>
>> Hello,
>>
>> The way I did it took me some time and I almost sure it's applicable to all 
>> languages.
>>
>> I normalized the words. Replacing letters or group of letters by another 
>> approaching one.
>>
>> In french e é è ê ai ei sound a bit the same, and for someone who write 
>> mistakes having to use the right letters is very frustrating. So I 
>> transformed all of them into e...
>>
>> Hope it helps
>>
>> Télécharger BlueMail pour Android
>> Le 22 sept. 2022, à 16:37, "Hrvoje Lončar" 
>> mailto:horv...@gmail.com>> a écrit:
>>
>> Hi!
>>
>> I'm using Hibernate Search / Lucene to index my entities in Spring Boot
>> aplication.
>>
>> One thing I'm not sure is how to handle Croatian specific letters.
>> Croatian language has few additional letters "*č* *Č* *ć* *Ć* *đ* *Đ* *š*
>> *Š* *ž* *Ž*".
>> Letters "*đ* *Đ*" are commonly replaced with "*dj* *DJ*" when no Croatian
>> letters available.
>>
>> In my custom Hibernate bridge there is a step that replaces all Croatian
>> characters with appropriate ASCII replacements which means "*č*" becomes "
>> *c*", "*š*" becomes "*s*" and so on.
>> Later, when user enters search text, the same process is done to match
>> values from index.
>> There is one more good thing about it - some older users that used
>> computers in early ages when no Croatian letters were available - those
>> users type words without Croatian letters, automatically replacing "*č*" with
>> "*c*" and that fits my logic to get good search results.
>>
>> For example, the title of my entity is: "*juha s češnjakom u đumbirom*".
>> My custom Hibernate String bridge converts it to "*juha cesnjakom dumbirom*
>> ".
>> Then user enters "*juha s češnjakom*".
>> Before issuing a search, the same conversion is made to users' query and
>> text sent to Lucene is "*juha cesnjakom*".
>> This is the way how I implemented it and it's working fine.
>>
>> The other way would be to index original text and then find words with
>> Croatian characters, convert them to ASCII and add to original.
>> The title "*juha s češnjakom i đumbirom*" would become "*juha češnjakom
>> đumbirom cesnjakom dumbirom*".
>> In that case there is no need to convert users' search terms because
>> both "*juha
>> s češnjakom*" and "*juha s cesnjakom*" would return the same result.
>>
>> My question is:
>> Is there any reason to switch to this alternative logic and have original
>> keywords indexed in parallel with those converted to ASCII?
>>
>> Thanks!
>>
>> BR,
>> Hrvoje
>
>
>
> --
> {{ Horvoje.net ~~ VegCook.net ~~ TheVegCat.com ~~ Cuspajz.com ~~ 
> VintageZagreb.net ~~ Sterilizacija.org ~~ SmijSe.com ~~ HTMLutil.net ~~ 
> HTTPinfo.net }}
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Max Field Length

2022-09-23 Thread Michael Sokolov
ooh

On Fri, Sep 23, 2022 at 11:02 AM Adrien Grand  wrote:
>
> We have a TruncateTokenFilter in lucene/analysis/common. :)
>
> On Fri, Sep 23, 2022 at 4:39 PM Michael Sokolov  wrote:
>
> > I wonder if it would make sense to provide a TruncationFilter in
> > addition to the LengthFilter. That way long tokens in source text
> > could be better supported, albeit with some confusion if they share
> > the same very long prefix...
> >
> > On Fri, Sep 23, 2022 at 9:56 AM Scott Guthery  wrote:
> > >
> > > Thanks much, Adrian.  I hadn't realized that the size limit was on one
> > > token in the text as opposed to being a limit on the length of the entire
> > > text field.  I'm loading patents, so I suspect that the very long word
> > is a
> > > DNA sequence.
> > >
> > > Thanks also for your guidance with regard to setting maximums.
> > >
> > > Cheers, Scott
> > >
> > > >
> > > >
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >
>
> --
> Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Max Field Length

2022-09-23 Thread Michael Sokolov
I wonder if it would make sense to provide a TruncationFilter in
addition to the LengthFilter. That way long tokens in source text
could be better supported, albeit with some confusion if they share
the same very long prefix...

On Fri, Sep 23, 2022 at 9:56 AM Scott Guthery  wrote:
>
> Thanks much, Adrian.  I hadn't realized that the size limit was on one
> token in the text as opposed to being a limit on the length of the entire
> text field.  I'm loading patents, so I suspect that the very long word is a
> DNA sequence.
>
> Thanks also for your guidance with regard to setting maximums.
>
> Cheers, Scott
>
> >
> >

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Can lucene be used in Android ?

2022-09-09 Thread Michael Sokolov
no, and I think it could be challenging to go the route of using
Dalvik/ART. Maybe you can run an actual JDK on Android? See
https://openjdk.org/projects/mobile/android.html

On Fri, Sep 9, 2022 at 9:27 AM Jie Wang  wrote:
>
> Hey,
>
> Recently, I am trying to compile the Lucene to get a jar that can be used in 
> Android, but failed.
>
> Is there an official version that supports the use of Lucene on Android?
>
>
> Thanks!
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: [ANNOUNCE] Issue migration Jira to GitHub starts on Monday, August 22

2022-08-24 Thread Michael Sokolov
Thanks! It seems to be working nicely.

Question about the fix-version: tagging. I wonder if going forward we
want to main that for new issues? I happened to notice there is also
this "milestone" feature in github -- does that seem like a place to
put version information?

On Wed, Aug 24, 2022 at 3:20 PM Tomoko Uchida
 wrote:
>
> 
>
> Issue migration has been completed (except for minor cleanups).
> This is the Jira -> GitHub issue number mapping for possible future usage. 
> https://github.com/apache/lucene-jira-archive/blob/main/migration/mappings-data/issue-map.csv.20220823_final
>
> GitHub issue is now fully available for all issues.
> For issue label management (e.g. "fix-version"), please review this manual.
> https://github.com/apache/lucene/blob/main/dev-docs/github-issues-howto.md
>
> Tomoko
>
>
> 2022年8月22日(月) 19:46 Michael McCandless :
>>
>> Wooot!  Thank you so much Tomoko!!
>>
>> Mike
>>
>> On Mon, Aug 22, 2022 at 6:44 AM Tomoko Uchida  
>> wrote:
>>>
>>> 
>>>
>>> Issue migration has been started. Jira is now read-only.
>>>
>>> GitHub issue is available for new issues.
>>>
>>> - You should open new issues on GitHub. E.g. 
>>> https://github.com/apache/lucene/issues/1078
>>> - Do not touch issues that are in the middle of migration, please. E.g. 
>>> https://github.com/apache/lucene/issues/1072
>>>   - While you cannot break these issues, migration scripts can 
>>> modify/overwrite your comments on the issues.
>>> - Pull requests are not affected. You can open/update PRs as usual. Please 
>>> let me know if you have any trouble with PRs.
>>>
>>>
>>> Tomoko
>>>
>>>
>>> 2022年8月18日(木) 18:23 Tomoko Uchida :

 Hello all,

 The Lucene project decided to move our issue tracking system from Jira to 
 GitHub and migrate all Jira issues to GitHub.

 We start issue migration on Monday, August 22 at 8:00 UTC.
 1) We make Jira read-only before migration. You cannot update existing 
 issues until the migration is completed.
 2) You can use GitHub for opening NEW issues or pull requests during 
 migration.

 Note that issues should be raised in Jira at this moment, although GitHub 
 issue is already enabled in the Lucene repository.
 Please do not raise issues in GitHub until we let you know that GitHub 
 issue is officially available. We immediately close any issues on GitHub 
 until then.

 Here are the detailed plan/migration steps.
 https://github.com/apache/lucene-jira-archive/issues/7

 Tomoko
>>
>> --
>> Mike McCandless
>>
>> http://blog.mikemccandless.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Performance Comparison of Benchmarks by using Lucene 9.1.0 vs 8.5.1

2022-07-26 Thread Michael Sokolov
https://home.apache.org/~mikemccand/lucenebench/ shows how various
benchmarks have evolved over time *on the main branch*. There is no
direct comparison of every version against every other version that I
have seen though.

On Tue, Jul 26, 2022 at 2:12 PM Baris Kazar  wrote:
>
> Dear Folks,-
>  Similar question to my previous post: this time I wonder if there is a Lucene
> web site where benchmarks are run against these two versions of Lucene.
> I see many (44+16) api changes and (48+9) improvements and (16+15) Bug fixes, 
> which sounds great.
> Best regards
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Fuzzy Query Similarity

2022-07-09 Thread Michael Sokolov
Oh good! Thanks for clarifying, Uwe

On Sat, Jul 9, 2022, 12:23 PM Uwe Schindler  wrote:

> Hi
> > FuzzyQuery/MultiTermQuery and I don't see any way to "boost" exact
> > matches, or even to incorporate the edit distance more generally into
> > the per-term score, although it does seem like that would be something
> > people would generally expect.
>
> Actually it does this:
>
>   * By default FuzzyQuery uses a rewrite method that expands all terms
> as should clauses into a boolean query:
> MultiTermQuery.TopTermsBlendedFreqScoringRewrite(maxExpansions)
>   * TopTermsReqrite basically keeps track of a "boost" factor for each
> term and sorts the "best" terms in a PQ:
>
> https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/TopTermsRewrite.java#L109-L160
>   * For each collected term the term enumeration sets a boost (1.0 for
> exact match):
>
> https://github.com/apache/lucene/blob/dd4e8b82d711b8f665e91f0d74f159ef1e63939f/lucene/core/src/java/org/apache/lucene/search/FuzzyTermsEnum.java#L248-L256
>
> So in short the exact term gets a boost factor of 1 in the resulting
> term query, all other terms a lower one.
>
> Uwe
>
> --
> Uwe Schindler
> Achterdiek 19, D-28357 Bremen
> https://www.thetaphi.de
> eMail:u...@thetaphi.de
>


Re: Fuzzy Query Similarity

2022-07-09 Thread Michael Sokolov
I am no expert with this, but I got curious and looked at
FuzzyQuery/MultiTermQuery and I don't see any way to "boost" exact
matches, or even to incorporate the edit distance more generally into
the per-term score, although it does seem like that would be something
people would generally expect. So maybe FuzzyQuery should somehow do
that? But without changing it, you could also use a query that does it
explicitly; if you get a term "foo", you could maybe search for "foo
OR foo~" ?

On Fri, Jul 8, 2022 at 4:14 PM Mike Drob  wrote:
>
> Hi folks,
>
> I'm working with some fuzzy queries and trying my best to understand what
> is the expected behaviour of the searcher. I'm not sure if this is a
> similarity bug or an incorrect usage on my end.
>
> The problem is when I do a fuzzy search for a term "spark~" then instead of
> matching documents with spark first, it will match other documents that
> have multiple other near terms like "spar" and "spars". I see this same
> thing with both ClassicSimilarity and BM25.
>
> This is from a much smaller (two document) index when I was trying to
> isolate and reproduce the issue, but I see comparable behaviour with more
> varied scoring on a much larger corpus. The two documents are:
>
> addDoc("spark spark", writer); // exact match
>
> addDoc("spar spars", writer); // multiple fuzzy terms
>
> The non-zero edit distance terms get a slight down-boost, but it's not
> enough to overcome their sum exceeding even the TF boost for the desired
> document.
>
> A full reproducible unit test is at
> https://github.com/apache/lucene/commit/dbf8e788cd2c2a5e1852b8cee86cb21a792dc546
>
> What is the recommended approach to get the document with exact term
> matching for me again? I don't see an option to tweak the internal boost
> provided by FuzzyQuery, that's one idea I had. Or is this a different
> change that needs to be fixed at the lucene level rather than application
> level?
>
> Thanks,
> Mike
>
>
>
> More detail:
>
>
> The first document with the field "spark spark" has a score explanation:
>
> 1.4054651 = sum of:
>   1.4054651 = weight(field:spark in 0) [ClassicSimilarity], result of:
> 1.4054651 = score(freq=2.0), product of:
>   1.4054651 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:
> 1 = docFreq, number of documents containing term
> 2 = docCount, total number of documents with field
>   1.4142135 = tf(freq=2.0), with freq of:
> 2.0 = freq, occurrences of term within document
>   0.70710677 = fieldNorm
>
> And a document with the field "spar spars" comes in ever so slightly higher
> at
>
> 1.5404116 = sum of:
>   0.74536043 = weight(field:spar in 1) [ClassicSimilarity], result of:
> 0.74536043 = score(freq=1.0), product of:
>   0.75 = boost
>   1.4054651 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:
> 1 = docFreq, number of documents containing term
> 2 = docCount, total number of documents with field
>   1.0 = tf(freq=1.0), with freq of:
> 1.0 = freq, occurrences of term within document
>   0.70710677 = fieldNorm
>   0.79505116 = weight(field:spars in 1) [ClassicSimilarity], result of:
> 0.79505116 = score(freq=1.0), product of:
>   0.8 = boost
>   1.4054651 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:
> 1 = docFreq, number of documents containing term
> 2 = docCount, total number of documents with field
>   1.0 = tf(freq=1.0), with freq of:
> 1.0 = freq, occurrences of term within document
>   0.70710677 = fieldNorm

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Version of log4j in Lucene 8.11.2

2022-06-23 Thread Michael Sokolov
Lucene core is a no-dependencies library. Some of the other Lucene
modules, and the build and tests, have dependencies, but none of them
includes log4j. So sorry, but we won't be making Lucene use log4j
2.17.2; probably you should get your compliance standards changed to
include *forbidden* versions rather than *required* versions :)

On Thu, Jun 23, 2022 at 9:57 AM Kurz, Fred
 wrote:
>
> Categorization: Unclassified
> Hi:
>
> What version of log4j is included in Lucene version 8.11.2?  The release 
> notes for Solr 8.11.2 explicitly states log4j version is upgraded to 2.17.2 
> to address security vulnerabilities, but there is no such note for Lucene.  I 
> assume the same is true for Lucene 8.11.2 since Solr is a subproject, but I 
> need it confirmed.
>
> I am trying to get Lucene 8.11.2 certified for use in my organization but 
> certification is contingent on Lucene using log4j 2.17.2.  A prompt reply 
> would be greatly appreciated.
>
> Thanks,
> Fred Kurz
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Question about Benchmark

2022-05-17 Thread Michael Sokolov
OK I replied on the issue. This ann-benchmarks is a separate project,
and I think you are asking about how to change it. Probably should
take it up with erikbern or whatever community is supporting that
actively. I just created a "plugin" so we could use it to test
Lucene's KNN implementation, but that issue was never committed.
Please do feel free to improve it and submit to ann-benchmarks if you
find it useful though

On Tue, May 17, 2022 at 3:31 AM balmukund mandal  wrote:
>
> Hi All,
> It's my apologies for not mentioning the benchmark which i was using. Also,
> i realized that i've not subscribed to this group,hence duplicating this
> mail. The below queries are for ANN-Benchmark
> https://issues.apache.org/jira/browse/LUCENE-9625
> Indexing takes a long time, so is there a way to configure the benchmark to
> use an already existing index for search? Also, is there a way to configure
> the benchmark to use multiple threads for indexing (looks to me that it’s a
> single-threaded indexing)?
>
>
> On Mon, May 16, 2022 at 11:06 AM balmukund mandal 
> wrote:
>
> > Hi All,
> > I was trying to run the benchmark and had a couple of questions. Indexing
> > takes a long time, so is there a way to configure the benchmark to use an
> > already existing index for search? Also, is there a way to configure the
> > benchmark to use multiple threads for indexing (looks to me that it’s a
> > single-threaded indexing)?
> >
> > --Regards,
> > Balmukund
> >

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: New user questions about demo, downloads, and IRC

2022-04-26 Thread Michael Sokolov
thanks, I fixed the doc!

On Tue, Apr 26, 2022 at 9:13 AM Bridger Dyson-Smith
 wrote:
>
> Hi Michael -
>
> On Mon, Apr 25, 2022 at 5:38 PM Michael Wechner 
> wrote:
>
> > Hi Bridger
> >
> > Inside
> >
> > https://dlcdn.apache.org/lucene/java/9.1.0/lucene-9.1.0.tgz
> >
> > you should find
> >
> > modules/lucene-core-9.1.0.jar
> > modules/lucene-queryparser-9.1.0.jar
> > modules/lucene-analysis-common-9.1.0.jar
> > modules/lucene-demo-9.1.0.jar
> >
> > Yes, those are there!  I wasn't sure if a different directory structure
> would be available if building from source, but in any case I'll try
> working through the demo.
>
> I guess the documentation is not quite right.
> >
> >
> Re your second question, there are two channels on Slack
> >
> > https://app.slack.com/client/T4S1WH2J3/CE70MDPMF (#lucene-dev)
> > https://app.slack.com/client/T4S1WH2J3/C01E88Y8TQD (#lucene-vector)
> >
> > Are these channels appropriate places for new user talk/questions?
>
>
> > HTH
> >
> > Michael
> >
> > Very helpful indeed -- thank you very kindly for your time.
> Best,
>
> Bridger
>
> >
> >
> > Am 25.04.22 um 21:27 schrieb Bridger Dyson-Smith:
> > > Hi all -
> > >
> > > I hope these questions are acceptable for this particular list.
> > >
> > > I have a combined question re the 9.1.0 demo[1] and the binary
> > release[2]:
> > >
> > > the demo suggests that there should be a `core/` directory, as well as
> > > others, however, after unpacking the TAR, I'm not seeing any:
> > > ) ls -1
> > > CHANGES.txt
> > > JRE_VERSION_MIGRATION.md
> > > LICENSE.txt
> > > MIGRATE.md
> > > NOTICE.txt
> > > README.md
> > > SYSTEM_REQUIREMENTS.md
> > > bin/
> > > docs/
> > > licenses/
> > > modules/
> > > modules-test-framework/
> > > modules-thirdparty/
> > >
> > > Is downloading the source and building the recommended approach here?
> > >
> > > Also, are the Lucene folks anywhere on liberachat vs freenode? Many
> > > communities seem to have moved away from freenode and I was curious if
> > that
> > > was the case with Lucene's IRC, or if people were still using freenode
> > (no
> > > big deal either way - just curious!).
> > >
> > > Thanks very much for your time!
> > > Best,
> > > Bridger
> > >
> > > [1] https://lucene.apache.org/core/9_1_0/demo/index.html
> > > [2]
> > https://www.apache.org/dyn/closer.lua/lucene/java/9.1.0/lucene-9.1.0.tgz
> > >
> >
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: RangeFacetsCount Question

2022-04-26 Thread Michael Sokolov
Looking at git blame I see the current parameter was added here:
https://issues.apache.org/jira/browse/LUCENE-6648. Previous
implementations supported a BitSet rather than a Query. I'm not really
sure what the use case is for applying additional filtering when
faceting. Perhaps it can support something like drill sideways??

On Thu, Apr 21, 2022 at 6:08 PM Marc D'Mello  wrote:
>
> Hi,
>
> I had a quick question about RangeFacetsCounts
> ,
> I'm a bit confused by the fastMatchQuery param. Specifically, I was
> wondering why we need this when we can provide hits from a FacetCollector
> directly without having to run a query? I realize that the fastMatchQuery
> is used for filtering provided hits further, but it seems redundant when we
> can do all the matching we need before providing the FacetCollector object
> to RangeFacetCounts. SortedSetDocValuesFacetCounts only has FacetCollector
> as a param for example
> 
> without
> having the fastMatchQuery param. Maybe I'm misunderstanding something here?
> If anyone has an explanation that would be super helpful!
>
> Thanks!
> Marc D'Mello

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Returning large resultset is slow and resource intensive

2022-03-08 Thread Michael Sokolov
Another approach for retrieving large result sets can work if you have
a unique sort key. and don't mind retrieving your results sorted by
this key. Then you can retrieve the results in batches using a
cursor-style approach; request the top N sorted by the key. Then
request the top N s.t. the key is greater than the greatest value in
the last batch. rinse and repeat.

On Tue, Mar 8, 2022 at 4:13 AM Uwe Schindler  wrote:
>
> Hi,
>
> > For our use case, we need to run queries which return the full
> > matched result set. In some cases, this result set can be large (50k+
> > results out of 4 million total documents).
> > Perf test showed that just 4 threads running random queries returning 50k
> > results make Lucene utilize 100% CPU on a 4-core machine (profiler
> > screenshot
> >  > c2e4-45b6-b98d-b7622b6ac801.png>).
>
> This screenshot shows the problem: The search methods returning TopDocs (or 
> TopFieldDocs) should never ever be used to retrieve a larger amount or ALL 
> results. This is called "deep paging" problem. Lucene cannot return "paged" 
> results easily starting at a specific result page, it has to score all 
> results and insert them into a priority queue - this does not scale well 
> because the priority queue approach is made for quuickly getting top-ranking 
> results. So to get all results, don't call: 
> 
>
> If you just want to get all results then you should write your own collector 
> (single threaded as subclass of SimpleCollector, an alternative is 
> CollectorManager for multithreaded search with a separate "reduce" step to 
> merge results of each index segment) that just retrieves document ids and 
> processes them. If you don't need the score, don't call the scoring methods 
> in the Scorable.
>
> For this you have to create a subclass of SimpleCollector (and 
> CollectorManager, if needed) and implement its methods that are called by the 
> query internal as a kind of "notifications" about which index segment you are 
> and which result *relative* to this index segment you. Important things:
> - you get notified about new segments using SimpleCollector#doSetNextReader. 
> Save the content in a local field of the collector for later usage
> - if you need the scores also implement SimpleCollector#setScorer().
> - for each search hit of the reader passed in the previous call you get the 
> SimpleCollector#collect() method called. Use the document id passed and 
> resolve it using the leaf reader to the actual document and its fields/doc 
> values. To get the score ask the Scoreable from previous call.
>
> Another approach is to use searchAfter with smaller windows, but for getting 
> all results this is still slower as a priority queue has to be managed, too 
> (just smaller ones).
>
> > The query is very simple and contains only a single-term filter clause, all
> > unrelated parts of the application are disabled, no stored fields are
> > fetched, GC is doing minimal amount of work
> >  > 41c1-4af1-afcf-37d0c5f86054.png>
>
> Lucene never uses much heap space, so GC should always be low.
>
> Uwe
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Issue with Japanese User Dictionary

2022-01-13 Thread Michael Sokolov
HI Marc, I wonder if there is a workaround for this issue: eg, could
we have entries for both widths? I wonder if there is some interaction
with an analysis chain that is doing half-width -> full-width
conversion (or vice versa)? I think the UserDictionary has to operate
on pre-analyzed tokens ... although maybe *after* char filtering,
which presumably could handle width conversions. A bunch of rambling,
but maybe the point is - can you share some more information -- what
is the full entry in the dictionary that causes the problem?

On Wed, Jan 12, 2022 at 7:04 PM Marc D'Mello  wrote:
>
> Hi,
>
> I had a question about the Japanese user dictionary. We have a user
> dictionary that used to work but after attempting to upgrade Lucene, it
> fails with the following error:
>
> Caused by: java.lang.RuntimeException: Illegal user dictionary entry レコーダー
> - the concatenated segmentation (レコーダー) does not match the surface form
> (レコーダー)
> at
> org.apache.lucene.analysis.ja.dict.UserDictionary.(UserDictionary.java:123)
>
> The specific commit causing this error is here
> .
> The only thing that seems to differ is that the characters are full-width
> vs half-width, so I was wondering if this is intended behavior or a bug/too
> restrictive. Any suggestions for fixing this would be greatly appreciated!
> Thanks!

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Moving from lucene 6.x to 8.x

2022-01-13 Thread Michael Sokolov
I think the "broken offsets" refers to offsets of tokens "going
backwards". Offsets are attributes of tokens that refer back to their
byte position in the original indexed text. Going backwards means -- a
token with a greater position (in the sequence of tokens, or token
graph) should not have a lesser (or maybe it must be strictly
increasing I forget) offset. If you use term vectors, and have these
broken offsets, which should not but do often occur with custom
analysis chains, this could be a problem.

On Wed, Jan 12, 2022 at 12:36 AM Rahul Goswami  wrote:
>
> Thanks Vinay for the link to Erick's talk! I hadn't seen it and I must
> admit it did help put a few things into perspective.
>
> I was able to track down the JIRAs (thank you 'git blame')
> surrounding/leading up to this architectural decision and the linked
> patches:
> https://issues.apache.org/jira/browse/LUCENE-7703  (Record the version that
> was used at index creation time)
> https://issues.apache.org/jira/browse/LUCENE-7730  (Better encode length
> normalization in similarities)
> https://issues.apache.org/jira/browse/LUCENE-7837  (Use
> indexCreatedVersionMajor to fail opening too old indices)
>
> From these JIRAs what I was able to piece together is that if not
> reindexed, relevance scoring might act in unpredictable ways. For my use
> case, I can live with that since we provide an explicit sort on one or more
> fields.
>
> In LUCENE-7703, Adrien says "we will reject broken offsets in term vectors
> as of 7.0". So my questions to the community are
> i) What are these offsets, and what feature/s might break with respect to
> these offsets if not reindexed?
> ii) Do the length normalization changes in  LUCENE-7730 affect only
> relevance scores?
>
> I understand I could be playing with fire here, but reindexing is not a
> practical solution for my situation. At least not in the near future until
> I figure out a more seamless way of reindexing with minimal downtime given
> that there are multiple 1TB+ indexes. Would appreciate inputs from the dev
> community on this.
>
> Thanks,
> Rahul
>
> On Sun, Jan 9, 2022 at 2:41 PM Vinay Rajput 
> wrote:
>
> > Hi Rahul,
> >
> > I am not an expert so someone else might provide a better answer. However,
> > I remember
> > @Erick briefly talked about this restriction in one of his talks here:-
> > https://www.youtube.com/watch?v=eaQBH_H3d3g=621s (not sure if you have
> > seen it already).
> >
> > As he explains, earlier it looked like IndexUpgrader tool was doing the job
> > perfectly but it wasn't always the case. There is no guarantee that after
> > using the IndexUpgrader tool, your 8.x index will keep all of the
> > characteristics of lucene 8. There can be some situations (e.g. incorrect
> > offset) where you might get an incorrect relevance score which might be
> > difficult to trace and debug. So, Lucene developers now made it explicit
> > that what people were doing earlier was not ideal, and they should now plan
> > to reindex all the documents during the major upgrade.
> >
> > Having said that, what you have done can just work without any issue as
> > long as you don't encounter any odd sorting behavior. This may/may not be
> > super critical depending on the business use case and that is where you
> > might need to make a decision.
> >
> > Thanks,
> > Vinay
> >
> > On Sat, Jan 8, 2022 at 10:27 PM Rahul Goswami 
> > wrote:
> >
> > > Hello,
> > > Would appreciate any insights on the issue.Are there any backward
> > > incompatible changes in 8.x index because of which the lucene upgrader is
> > > unable to upgrade any index EVER touched by <= 6.x ? Or is the
> > restriction
> > > more of a safety net at this point for possible future incompatibilities
> > ?
> > >
> > > Thanks,
> > > Rahul
> > >
> > > On Thu, Jan 6, 2022 at 11:46 PM Rahul Goswami 
> > > wrote:
> > >
> > > > Hello,
> > > > I am using Apache Solr 7.7.2 with indexes which were originally created
> > > on
> > > > 4.8 and upgraded ever since. I recently tried upgrading to 8.x using
> > the
> > > > lucene IndexUpgrader tool and the upgrade fails. I know that lucene 8.x
> > > > prevents opening any segment which was touched by <= 6.x at any point
> > in
> > > > the past. I also know the general recommendation is to reindex upon
> > > > migration to another major release, however it is not always feasible.
> > > >
> > > > So I tried to remove the check for LATEST-1 in SegmentInfos.java (
> > > >
> > >
> > https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.11.1/lucene/core/src/java/org/apache/lucene/index/SegmentInfos.java#L321
> > > )
> > > > and also checked for other references to IndexFormatTooOldException.
> > > Turns
> > > > out that removing this check and rebuilding lucene-core lets the
> > upgrade
> > > go
> > > > through fine. I ran a full sequence of index upgrades from 5.x -> 6.x
> > ->
> > > > 7.x ->8.x. which went through fine. Also search/update operations work
> > > > without any issues in 8.x.
> > > >
> > 

Re: Lucene 9.0.0 inconsistent index options

2021-12-14 Thread Michael Sokolov
Strictly speaking, we could have opened an older index using Lucene 8
(say one that was created using Lucene 7, or 6) that would no longer
be valid in Lucene 9, at least according to the policy? I agree we
should try to fix this, just want to clarify the policy

On Tue, Dec 14, 2021 at 8:54 AM Adrien Grand  wrote:
>
> This looks related to the new changes around schema validation. Lucene
> now requires a field to either be absent from a document or be indexed
> with the exact same options (index options, points dimensions, norms,
> doc values type, etc.) as already indexed documents that also have
> this field.
>
> However it's a bug that Lucene fails to open an index that was legal
> in Lucene 8. Can you file a JIRA issue?
>
> On Mon, Dec 13, 2021 at 4:23 PM Ian Lea  wrote:
> >
> > Hi
> >
> >
> > We have a long-standing index with some mandatory fields and some optional
> > fields that has been through multiple lucene upgrades without a full
> > rebuild and on testing out an upgrade from version 8.11.0 to 9.0.0, when
> > open an IndexWriter we are hitting the exception
> >
> > Exception in thread "main" java.lang.IllegalArgumentException: cannot
> > change field "language" from index options=NONE to inconsistent index
> > options=DOCS
> > at
> > org.apache.lucene.index.FieldInfo.verifySameIndexOptions(FieldInfo.java:245)
> > at
> > org.apache.lucene.index.FieldInfos$FieldNumbers.verifySameSchema(FieldInfos.java:421)
> > at
> > org.apache.lucene.index.FieldInfos$FieldNumbers.addOrGet(FieldInfos.java:357)
> > at
> > org.apache.lucene.index.IndexWriter.getFieldNumberMap(IndexWriter.java:1263)
> > at org.apache.lucene.index.IndexWriter.(IndexWriter.java:1116)
> >
> > Where language is one of our optional fields.
> >
> > Presumably this is at least somewhat related to "Index options can no
> > longer be changed dynamically" as mentioned at
> > https://lucene.apache.org/core/9_0_0/MIGRATE.html although it fails before
> > our code attempts to update the index, and we are not trying to change any
> > index options.
> >
> > Adding some displays to IndexWriter and FieldInfos and logging rather than
> > throwing the exception I see
> >
> >  language curr=NONE, other=NONE
> >  language curr=NONE, other=NONE
> >  language curr=NONE, other=NONE
> >  language curr=NONE, other=NONE
> >  language curr=NONE, other=NONE
> >  language curr=NONE, other=NONE
> >  language curr=NONE, other=NONE
> >  language curr=NONE, other=NONE
> >  language curr=NONE, other=DOCS
> >  language curr=NONE, other=NONE
> >  language curr=NONE, other=NONE
> >  language curr=NONE, other=NONE
> >  language curr=NONE, other=NONE
> >  language curr=NONE, other=NONE
> >  language curr=NONE, other=NONE
> >  language curr=NONE, other=NONE
> >  language curr=NONE, other=NONE
> >  language curr=NONE, other=NONE
> >  language curr=NONE, other=DOCS
> >  language curr=NONE, other=DOCS
> >  language curr=NONE, other=DOCS
> >  language curr=NONE, other=DOCS
> >  language curr=NONE, other=DOCS
> >  language curr=NONE, other=DOCS
> >  language curr=NONE, other=DOCS
> >  language curr=NONE, other=DOCS
> >
> > where there is one line per segment.  It logs the exception whenever
> > other=DOCS.  Subset with segment info:
> >
> > segment _x8(8.2.0):c31753/-1:[diagnostics={timestamp=1565623850605,
> > lucene.version=8.2.0, java.vm.version=11.0.3+7, java.version=11.0.3,
> > mergeMaxNumSegments=-1, os.version=3.1.0-1.2-desktop,
> > java.vendor=AdoptOpenJDK, source=merge, os.arch=amd64, mergeFactor=10,
> > java.runtime.version=11.0.3+7,
> > os=Linux}]:[attributes={Lucene50StoredFieldsFormat.mode=BEST_SPEED}]
> >
> >  language curr=NONE, other=NONE
> >
> > segment _y9(8.7.0):c43531/-1:[diagnostics={timestamp=1604597581562,
> > lucene.version=8.7.0, java.vm.version=11.0.3+7, java.version=11.0.3,
> > mergeMaxNumSegments=-1, os.version=3.1.0-1.2-desktop,
> > java.vendor=AdoptOpenJDK, source=merge, os.arch=amd64, mergeFactor=10,
> > java.runtime.version=11.0.3+7,
> > os=Linux}]:[attributes={Lucene87StoredFieldsFormat.mode=BEST_SPEED}]
> >
> >  language curr=NONE, other=DOCS
> >
> > NOT throwing java.lang.IllegalArgumentException: cannot change field
> > "language" from index options=NONE to inconsistent index options=DOCS
> >
> >
> > Some variation on an old-fashioned not set versus not present bug perhaps?
> >
> >
> > --
> > Ian.
>
>
>
> --
> Adrien
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: A question on PhraseQuery and slop

2021-12-13 Thread Michael Sokolov
I wonder if the Analysis chain could be involved. If those stop words
("is") are removed without leaving a hole somehow, then that could
explain?

On Mon, Dec 13, 2021 at 9:35 AM Michael McCandless
 wrote:
>
> Hello Claude,
>
> Hmm, that is interesting that you see slop=2 matching query "quick fox"
> against document "the fox is quick".
>
> Edit distance (Levenshtein) is a bit tricky because it might include a
> transposition (just swapping the two words) as edit distance 1 OR 2.
>
> So maybe Lucene's PhraseQuery is counting transposition as edit distance 1,
> in which case, your test makes sense, and the javadocs are wrong?
>
> I am far from an expert on PhraseQuery :)  Does anyone know if we change
> the behavior?  In any case, we must at least fix the javadocs.  Claude,
> maybe open a Jira issue (
> https://issues.apache.org/jira/projects/LUCENE/summary) and we can
> discuss there?
>
> Thank you for catching this!
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Fri, Dec 10, 2021 at 8:47 AM Claude Lepere 
> wrote:
>
> > Hello.
> >
> >
> > The explanation of
> >
> > https://lucene.apache.org/core/8_0_0/core/org/apache/lucene/search/PhraseQuery.html#getSlop
> > <
> > https://lucene.apache.org/core/8_0_0/core/org/apache/lucene/search/PhraseQuery.html#getSlop--
> > >
> > writes
> > that the edit distance between "quick fox" and "the fox is quick" would be
> > at an edit distance of 3;
> > this seems inaccurate to me.
> >
> > I don't know if the edit distance used by Lucene is the Levenshtein
> > distance (insertion, deletion, substitution, all of weight 1) - a standard
> > in information retrieval - but a test of "quick fox" PhraseQuery with a
> > slop of 2 hits the text "the fox is quick" (1 deletion + 1 insertion); the
> > slop does not have to be 3.
> >
> > I wonder if I'm right.
> >
> >
> > Claude Lepère, Belgium
> >
> > claudelep...@gmail.com
> >
> >
> >
> > <
> > http://www.avg.com/email-signature?utm_medium=email_source=link_campaign=sig-email_content=webmail
> > >
> > Virus-free.
> > www.avg.com
> > <
> > http://www.avg.com/email-signature?utm_medium=email_source=link_campaign=sig-email_content=webmail
> > >
> > <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
> >

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How to change sorting *after* getting search results

2021-11-30 Thread Michael Sokolov
I think you are asking how to re-sort a result set returned from
IndexSearcher.search, ie a TopDocs? You can do this with one of the various
Rescorers. Have you looked at those?

On Tue, Nov 30, 2021, 9:15 AM Luís Filipe Nassif 
wrote:

> Hi Lucene community,
>
> Our users could do very heavy searches and they are able to change the
> sorting criteria multiple times after getting the results. We collect all
> of them, this is important for our use case, disabling scoring if the
> result size is too large to make the search faster. Currently we have our
> own multi-thread sorting code using DocValues (one instance per thread) to
> do this after the results are returned, so we don't have to run the heavy
> searches again.
>
> We are upgrading from Lucene 6.6 to 7.x and DocValues access is not random
> anymore, but our custom sorting code was based on that. So we are
> considering to stop using custom sorting and to use internal Lucene
> sorting, but we need to change the sorting multiple times after getting the
> TopDocs results. Is this possible? I searched the docs, but was just able
> to find out how to sort at the same time the search is done using
> IndexSearcher.search(..., Sort) methods.
>
> Thanks in advance,
> Luís Nassif
>


Re: Java 17 and Lucene

2021-10-26 Thread Michael Sokolov
Uwe, thanks for pointing out that ZGC is associated with all the
pauses you've observed. I'm feeling more confident now (since we are
generally using G1GC anyway, although sometimes experimenting with
other things). Indeed GC pauses have been much less of a problem since
we started using G1 to the point we don't worry about them much now. I
will say that with most of these so-called "pauseless" collectors it
seems they can really only achieve their promise by having a lot of
spare heap available, so if 200ms pauses are unacceptable this could
be a reason to run with a larger heap.

On Tue, Oct 26, 2021 at 1:14 PM Uwe Schindler  wrote:
>
> Hi,
>
> > Is this recommended "-XX:+UseZGC options to enable ZGC." as it claims very
> > low pauses.
>
> You may have seen my prvious post that JDK 16, 17 and 18 have hangs on our 
> build server. All of those hanging builds have one thing in common: They are 
> running with ZGC. So my answer in short: Don’t use ZGC, which is anyways not 
> a good idea with Lucene. It reduces pauses, but on the other hand reduces 
> throughput by >10%. So IMHO, better use G1GC and have higher throughput. With 
> G1GC the average pauses are limited, too. But I would say, with common 
> workloads it is better to have 10% faster queries and maybe have some of them 
> wait 200 ms because of a pause!? If you have multiple replicas just 
> distribute your queries and the pause will be not really visible to many 
> people. And: Why is 200 ms response time bad if it happens seldom?
>
> In addition: Lucene does not apply pressure to garbage collector, so use low 
> heap space and use docvalues and other off-heap features of Lucene. Anybody 
> running Lucene/Solr/Elasticsearch with huge heap space does something wrong!
>
> Uwe
>
> > For "*DY* (2021-10-19 08:14:33): Upgrade to JDK17+35" execution for
> > "Indexing
> > throughput
> > <https://home.apache.org/~mikemccand/lucenebench/indexing.html>"
> > is ZGC used for the "Indexing throughput
> > <https://home.apache.org/~mikemccand/lucenebench/indexing.html>" test?
> >
> >
> > On Wed, Oct 20, 2021 at 8:27 AM Michael McCandless <
> > luc...@mikemccandless.com> wrote:
> >
> > > Nightly benchmarks managed to succeed (once, so far) on JDK 17:
> > > https://home.apache.org/~mikemccand/lucenebench/
> > >
> > > No obvious performance changes on quick look.
> > >
> > > Mike McCandless
> > >
> > > http://blog.mikemccandless.com
> > >
> > >
> > > On Tue, Oct 19, 2021 at 8:42 PM Gautam Worah
> > 
> > > wrote:
> > >
> > > > Thanks for the note of caution Uwe.
> > > >
> > > > > On our Jenkins server running with AMD Ryzen CPU it happens quite 
> > > > > often
> > > > that JDK 16, JDK 17 and JDK 18 hang during tests and stay unkillable
> > > (only
> > > > a hard kill with" kill -9")
> > > >
> > > > Scary stuff.
> > > > I'll try to reproduce the hang first and then try to get the JVM logs.
> > > I'll
> > > > respond back here if I find something useful.
> > > >
> > > > > Do you get this error in lucene:core:ecjLintMain and not during
> > > compile?
> > > > Then this is https://issues.apache.org/jira/browse/LUCENE-10185, solved
> > > > already.
> > > >
> > > > Ahh. I should've been clearer with my comment. The error we see is
> > > because
> > > > we have forked the class and have modified it a bit.
> > > > I just assumed that the upstream Lucene package would've also gotten
> > > errors
> > > > on the JDK17 build because it was untouched.
> > > >
> > > > -
> > > > Gautam Worah.
> > > >
> > > >
> > > > On Tue, Oct 19, 2021 at 5:07 AM Michael Sokolov 
> > > > wrote:
> > > >
> > > > > > I would a bit careful: On our Jenkins server running with AMD Ryzen
> > > CPU
> > > > > it happens quite often that JDK 16, JDK 17 and JDK 18 hang during 
> > > > > tests
> > > > and
> > > > > stay unkillable (only a hard kill with" kill -9"). Previous Java
> > > versions
> > > > > don't hang. It happens not all the time (about 1/4th of all builds) 
> > > > > and
> > > > due
> > > > > to the fact that the JVM is unresponsible it is not possible to get a
> > > > stack
> > > > &

Re: Java 17 and Lucene

2021-10-20 Thread Michael Sokolov
The "System Requirements" page for each release lists the JDK it was
built with and tested most extensively with; eg
https://lucene.apache.org/core/8_10_1/SYSTEM_REQUIREMENTS.html (JDK8
there; 9.0 will target JDK11)

That is pretty conservative but safe. Generally speaking we are always
testing cutting-edge JDKs, so you can be pretty confident about say
JDK11, but it's best to run your own tests, of course.

On Tue, Oct 19, 2021 at 8:19 PM Kevin Rosendahl
 wrote:
>
> Thank you all for the information, it's very useful. Seems like it's best
> to hold off on upgrading for now, but great to know that different JDK
> versions are at least being exercised in CI.
>
> I'm wondering, is there a better way to assess the production readiness of
> a Lucene/JDK combination than just emailing the user group, or is this our
> best bet in the future as well?
>
> Thanks again!
> Kevin
>
> On Tue, Oct 19, 2021 at 5:07 AM Michael Sokolov  wrote:
>
> > > I would a bit careful: On our Jenkins server running with AMD Ryzen CPU
> > it happens quite often that JDK 16, JDK 17 and JDK 18 hang during tests and
> > stay unkillable (only a hard kill with" kill -9"). Previous Java versions
> > don't hang. It happens not all the time (about 1/4th of all builds) and due
> > to the fact that the JVM is unresponsible it is not possible to get a stack
> > trace with "jstack". If you know a way to get the stack trace, I'd happy to
> > get help.
> >
> > ooh that sounds scary. I suppose one could maybe get core dumps using
> > the right signal and debug that way? Oh wait you said only 9 works,
> > darn! How about attaching using gdb? Do we maintain GC logs for these
> > Jenkins builds? Maybe something suspicious would show up there.
> >
> > By the way the JDK is absolutely "responsible" in this situation! Not
> > responsive maybe ...
> >
> > On Tue, Oct 19, 2021 at 4:46 AM Uwe Schindler  wrote:
> > >
> > > Hi,
> > >
> > > > Hey,
> > > >
> > > > Our team at Amazon Product Search recently ran our internal benchmarks
> > with
> > > > JDK 17.
> > > > We saw a ~5% increase in throughput and are in the process of
> > > > experimenting/enabling it in production.
> > > > We also plan to test the new Corretto Generational Shenandoah GC.
> > >
> > > I would a bit careful: On our Jenkins server running with AMD Ryzen CPU
> > it happens quite often that JDK 16, JDK 17 and JDK 18 hang during tests and
> > stay unkillable (only a hard kill with" kill -9"). Previous Java versions
> > don't hang. It happens not all the time (about 1/4th of all builds) and due
> > to the fact that the JVM is unresponsible it is not possible to get a stack
> > trace with "jstack". If you know a way to get the stack trace, I'd happy to
> > get help.
> > >
> > > Once I figured out what makes it hang, I will open issues in OpenJDK (I
> > am OpenJDK member/editor). I have now many stuck JVMs running to analyze on
> > the server, so you're invited to help! At the moment, I have no time to
> > take care, so any help is useful.
> > >
> > > > On a side note, the Lucene codebase still uses the deprecated (as of
> > > > JDK17) AccessController
> > > > in the RamUsageEstimator class.
> > > > We suppressed the warning for now (based on recommendations
> > > > <http://mail-archives.apache.org/mod_mbox/db-derby-
> > > > dev/202106.mbox/%3CJIRA.13369440.1617476525000.615331.16239514800
> > > > 5...@atlassian.jira%3E>
> > > > from the Apache Derby mailing list).
> > >
> > > This should not be an issue, because we compile Lucene with javac
> > parameter "--release 11", so it won't show any warning that you need to
> > suppress. Looks like your build system at Amazon is not the original one by
> > Lucene's Gradle, which shows no warnings at all.
> > >
> > > Uwe
> > >
> > > > Gautam Worah.
> > > >
> > > >
> > > > On Mon, Oct 18, 2021 at 3:02 PM Michael McCandless <
> > > > luc...@mikemccandless.com> wrote:
> > > >
> > > > > Also, I try to semi-aggressively upgrade Lucene's nightly benchmarks
> > to new
> > > > > JDK releases and leave an annotation on the nightly charts:
> > > > > https://home.apache.org/~mikemccand/lucenebench/
> > > > >
> > > > > I just now upgraded to JDK 17 and kicked off a new benchmark run ...
> > in a
> > > > &

Re: Java 17 and Lucene

2021-10-19 Thread Michael Sokolov
> I would a bit careful: On our Jenkins server running with AMD Ryzen CPU it 
> happens quite often that JDK 16, JDK 17 and JDK 18 hang during tests and stay 
> unkillable (only a hard kill with" kill -9"). Previous Java versions don't 
> hang. It happens not all the time (about 1/4th of all builds) and due to the 
> fact that the JVM is unresponsible it is not possible to get a stack trace 
> with "jstack". If you know a way to get the stack trace, I'd happy to get 
> help.

ooh that sounds scary. I suppose one could maybe get core dumps using
the right signal and debug that way? Oh wait you said only 9 works,
darn! How about attaching using gdb? Do we maintain GC logs for these
Jenkins builds? Maybe something suspicious would show up there.

By the way the JDK is absolutely "responsible" in this situation! Not
responsive maybe ...

On Tue, Oct 19, 2021 at 4:46 AM Uwe Schindler  wrote:
>
> Hi,
>
> > Hey,
> >
> > Our team at Amazon Product Search recently ran our internal benchmarks with
> > JDK 17.
> > We saw a ~5% increase in throughput and are in the process of
> > experimenting/enabling it in production.
> > We also plan to test the new Corretto Generational Shenandoah GC.
>
> I would a bit careful: On our Jenkins server running with AMD Ryzen CPU it 
> happens quite often that JDK 16, JDK 17 and JDK 18 hang during tests and stay 
> unkillable (only a hard kill with" kill -9"). Previous Java versions don't 
> hang. It happens not all the time (about 1/4th of all builds) and due to the 
> fact that the JVM is unresponsible it is not possible to get a stack trace 
> with "jstack". If you know a way to get the stack trace, I'd happy to get 
> help.
>
> Once I figured out what makes it hang, I will open issues in OpenJDK (I am 
> OpenJDK member/editor). I have now many stuck JVMs running to analyze on the 
> server, so you're invited to help! At the moment, I have no time to take 
> care, so any help is useful.
>
> > On a side note, the Lucene codebase still uses the deprecated (as of
> > JDK17) AccessController
> > in the RamUsageEstimator class.
> > We suppressed the warning for now (based on recommendations
> >  > dev/202106.mbox/%3CJIRA.13369440.1617476525000.615331.16239514800
> > 5...@atlassian.jira%3E>
> > from the Apache Derby mailing list).
>
> This should not be an issue, because we compile Lucene with javac parameter 
> "--release 11", so it won't show any warning that you need to suppress. Looks 
> like your build system at Amazon is not the original one by Lucene's Gradle, 
> which shows no warnings at all.
>
> Uwe
>
> > Gautam Worah.
> >
> >
> > On Mon, Oct 18, 2021 at 3:02 PM Michael McCandless <
> > luc...@mikemccandless.com> wrote:
> >
> > > Also, I try to semi-aggressively upgrade Lucene's nightly benchmarks to 
> > > new
> > > JDK releases and leave an annotation on the nightly charts:
> > > https://home.apache.org/~mikemccand/lucenebench/
> > >
> > > I just now upgraded to JDK 17 and kicked off a new benchmark run ... in a
> > > few hours it should show the new data points and then I'll try to remember
> > > to annotate it tomorrow.
> > >
> > > So let's see whether nightly benchmarks uncover any performance changes
> > > from JDK17 :)
> > >
> > > Mike McCandless
> > >
> > > http://blog.mikemccandless.com
> > >
> > >
> > > On Mon, Oct 18, 2021 at 5:36 PM Robert Muir  wrote:
> > >
> > > > We test different releases on different platforms (e.g. Linux, Windows,
> > > > Mac).
> > > > We also test EA (Early Access) releases of openjdk versions during the
> > > > development process.
> > > > This finds bugs before they get released.
> > > >
> > > > More information about versions/EA testing: https://jenkins.thetaphi.de/
> > > >
> > > > On Mon, Oct 18, 2021 at 5:33 PM Kevin Rosendahl
> > > >  wrote:
> > > > >
> > > > > Hello,
> > > > >
> > > > > We are using Lucene 8 and planning to upgrade from Java 11 to Java 17.
> > > We
> > > > > are curious:
> > > > >
> > > > >- How lucene is testing against java versions. Are there 
> > > > > correctness
> > > > and
> > > > >performance tests using java 17?
> > > > >   - Additionally, besides Java 17, how are new Java releases
> > > tested?
> > > > >- Are there any other orgs using Java 17 with Lucene?
> > > > >- Any other considerations we should be aware of?
> > > > >
> > > > >
> > > > > Best,
> > > > > Kevin Rosendahl
> > > >
> > > > -
> > > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > > > For additional commands, e-mail: java-user-h...@lucene.apache.org
> > > >
> > > >
> > >
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: 

Re: Using setIndexSort on a binary field

2021-10-17 Thread Michael Sokolov
Yeah, index sorting doesn't do that -- it sorts *within* each segment
so that when documents are iterated (within that segment) by any of
the many DocIdSetIterators that underlie the Lucene search API, they
are retrieved in the order specified (which is then also docid order).

To achieve what you want you would have to tightly control the
indexing process. For example you could configure a NoMergePolicy to
prevent the segments you manually create from being merged, set a very
large RAM buffer size on the index writer so it doesn't unexpectedly
flush a segment while you're indexing, and then index documents in the
sequence you want to group them by, committing after each block of
documents. But this is a very artificial setup; it wouldn't survive
any normal indexing workflow where merges are allowed, documents may
be updated, etc.

For testing purposes we've recently added the ability to rearrange the
index (IndexRearranger) according to a specific assignment of docids
to segments - you could apply this to an existing index. But again,
this is not really intended for use in a production on-line index that
receives updates.

On Fri, Oct 15, 2021 at 1:27 PM Alex K  wrote:
>
> Thanks Adrien. This makes me think I might not be understanding the use
> case for index sorting correctly. I basically want to make it so that my
> terms are sorted across segments. For example, let's say I have integer
> terms 1 to 100 and 10 segments. I'd like terms 1 to 10 to occur in segment
> 1, terms 11 to 20 in segment 2, terms 21 to 30 in segment 3, and so on.
> With default indexing settings, I see terms duplicated across segments. I
> thought index sorting was the way to achieve this, but the use of doc
> values makes me think it might actually be used for something else? Is
> something like what I described possible? Any clarification would be great.
> Thanks,
> Alex
>
>
> On Fri, Oct 15, 2021 at 12:43 PM Adrien Grand  wrote:
>
> > Hi Alex,
> >
> > You need to use a BinaryDocValuesField so that the field is indexed with
> > doc values.
> >
> > `Field` is not going to work because it only indexes the data while index
> > sorting requires doc values.
> >
> > On Fri, Oct 15, 2021 at 6:40 PM Alex K  wrote:
> >
> > > Hi all,
> > >
> > > Could someone point me to an example of using the
> > > IndexWriterConfig.setIndexSort for a field containing binary values?
> > >
> > > To be specific, the fields are constructed using the Field(String name,
> > > byte[] value, IndexableFieldType type) constructor, and I'd like to try
> > > using the java.util.Arrays.compareUnsigned method to sort the fields.
> > >
> > > Thanks,
> > > Alex
> > >
> >
> >
> > --
> > Adrien
> >

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Search while typing (incremental search)

2021-10-08 Thread Michael Sokolov
Thank you for offering to add to the FAQ! Indeed it should mention the
suggester capability. I think you have permissions to edit that wiki?
Please go ahead and I think add a link to the suggest module javadocs

On Thu, Oct 7, 2021 at 2:30 AM Michael Wechner
 wrote:
>
> Thanks very much for your feedback!
>
> I will try it :-)
>
> As I wrote I would like to add a summary to the Lucene FAQ
> (https://cwiki.apache.org/confluence/display/lucene/lucenefaq)
>
> Would the following questions make sense?
>
>   - "Does Lucene support incremental search?"
>
>   - "Does Lucene support auto completion suggestions?"
>
> Or would other other terms / or another wording make more sense?
>
> Thanks
>
> Michael
>
>
>
> Am 07.10.21 um 01:14 schrieb Robert Muir:
> > TLDR: use the lucene suggest/ package. Start with building suggester
> > from your query logs (either a file or index them).
> > These have a lot of flexibility about how the matches happen, for
> > example pure prefixes, edit distance typos, infix matching, analysis
> > chain, even now Japanese input-method integration :)
> >
> > Run that suggester on the user input, retrieving say, the top 5-10
> > matches of relevant query suggestions.
> > return those in the UI (typical autosuggest-type field), but also run
> > a search on the first one.
> >
> > The user gets the instant-search experience, but when they type 'tes',
> > you search on 'tesla' (if that's the top-suggested query, the
> > highlighted one in the autocomplete). if they arrow-down to another
> > suggestion such as 'test' or type a 't' or use the mouse or whatever,
> > then the process runs again and they see the results for that.
> >
> > IMO for most cases this leads to a saner experience than trying to
> > rank all documents based on a prefix 'tes': the problem is there is
> > still too much query ambiguity, not really any "keywords" yet, so
> > trying to rank those documents won't be very useful. Instead you try
> > to "interact" with the user to present results in a useful way that
> > they can navigate.
> >
> > On the other hand if you really want to just search on prefixes and
> > jumble up the results (perhaps because you are gonna just sort by some
> > custom document feature instead of relevance), then you can do that if
> > you really want. You can use the n-gram/edge-ngram/shingle filters in
> > the analysis package for that.
> >
> > On Wed, Oct 6, 2021 at 5:37 PM Michael Wechner
> >  wrote:
> >> Hi
> >>
> >> I am trying to implement a search with Lucene similar to what for
> >> example various "Note Apps" (e.g. "Google Keep" or "Samsung Notes") are
> >> offering, that with every new letter typed a new search is being executed.
> >>
> >> For example when I type "tes", then all documents are being returned
> >> containing the word "test" or "tesla" and when I continue typing, for
> >> example "tesö" and there are no documents containing the string "tesö",
> >> then the app will tell me that there are no matches.
> >>
> >> I have found a couple of articles related to this kind of search, for
> >> example
> >>
> >> https://stackoverflow.com/questions/10828825/incremental-search-using-lucene
> >>
> >> https://stackoverflow.com/questions/120180/how-to-do-query-auto-completion-suggestions-in-lucene
> >>
> >> but would be great to know whether there exist other possibilities or
> >> what the best practice is?
> >>
> >> I am even not sure what the right term for this kind of search is, is it
> >> really "incremental search" or something else?
> >>
> >> Looking forward to your feedback and will be happy to extend the Lucene
> >> FAQ once I understand better :-)
> >>
> >> Thanks
> >>
> >> Michael
> >>
> >> -
> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >>
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Querying into a Collector visits documents multiple times

2021-09-24 Thread Michael Sokolov
Ah sorry never mind. Confused collector and collector manager

On Fri, Sep 24, 2021, 6:51 AM Michael Sokolov  wrote:

> Separate issue, but this collector is not going to work with concurrent
> search since the sum is not updated in a thread safe manner. Maybe you
> don't care, since you don't use a thread pool to execute your queries, but
> you probably should!
>
> On Wed, Sep 22, 2021, 8:38 AM Adrien Grand  wrote:
>
>> Hi Steven,
>>
>> This collector looks correct to me. Resetting the counter to 0 on the
>> first
>> segment is indeed not necessary.
>>
>> We have plenty of collectors that are very similar to this one and we
>> never
>> observed any double-counting issue. I would suspect an issue in the code
>> that calls this collector. Maybe try to print the stack trace under the `
>> if (context.docBase == 0) {` check to see why your collector is being
>> called twice?
>>
>> On Tue, Sep 21, 2021 at 9:30 PM Steven Schlansker <
>> stevenschlans...@gmail.com> wrote:
>>
>> > Hi Lucene users,
>> >
>> > I am developing a search application that needs to do some basic
>> > summary statistics. We use Lucene 8.9.0.
>> > To improve performance for e.g. summing a value across 10,000
>> > documents, we are using DocValues as columnar storage.
>> >
>> > In order to retrieve the DocValues without collecting all hits into a
>> > TopDocs, which we determined to cause a lot of memory pressure and
>> > consume much time, we are using the expert Collector query interface.
>> >
>> > Here's the code, simplified a bit for the list:
>> >
>> > final collector = new Collector() {
>> > long sum = 0;
>> >
>> > @Override
>> > public ScoreMode scoreMode() {
>> > return ScoreMode.COMPLETE_NO_SCORES;
>> > }
>> >
>> > @Override
>> > public LeafCollector getLeafCollector(final LeafReaderContext
>> > context) throws IOException {
>> >  if (context.docBase == 0) {
>> > sum = 0; // XXX: this should not be necessary?
>> > }
>> > final var subtotalValue =
>> > context.reader().getNumericDocValues("subtotal");
>> > return new LeafCollector() {
>> > @Override
>> > public void setScorer(final Scorable scorer) throws
>> > IOException {
>> > }
>> >
>> > @Override
>> > public void collect(final int doc) throws IOException {
>> > if (subtotalValue.docID() > doc ||
>> > !subtotalValue.advanceExact(doc) || subtotalValue.longValue() == 0) {
>> > return;
>> > }
>> > sum += subtotalValue.longValue();
>> > }
>> > };
>> > }
>> > }
>> > searcher.search(myQuery, collector);
>> > return collector.sum;
>> >
>> > The query is a moderately complicated Boolean query with some
>> > TermQuery and MultiTermQuery instances combined together.
>> > While first testing, I observed that seemingly the collector is called
>> > twice for each document, and the sum is exactly double what you would
>> > expect.
>> >
>> > It seems that the Collector is observing every matched document twice,
>> > and by printing out the Scorer, I see that it's done with two
>> > different BooleanScorer instances.
>> > You can see my hack that resets the collector every time it starts at
>> > docBase 0. which I am sure is not the right approach, but seems to
>> > work.
>> > What is the right pattern to ensure my Collector only observes result
>> > documents once, no matter the input query? I see a note in the
>> > documentation that state is supposed to be stored on the Scorer
>> > implementation, but I am not providing a custom Scorer, nor do I
>> > actually want any scoring at all.
>> >
>> > Thank you for any guidance!
>> > Steven
>> >
>> > -
>> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> > For additional commands, e-mail: java-user-h...@lucene.apache.org
>> >
>> >
>>
>> --
>> Adrien
>>
>


Re: Querying into a Collector visits documents multiple times

2021-09-24 Thread Michael Sokolov
Separate issue, but this collector is not going to work with concurrent
search since the sum is not updated in a thread safe manner. Maybe you
don't care, since you don't use a thread pool to execute your queries, but
you probably should!

On Wed, Sep 22, 2021, 8:38 AM Adrien Grand  wrote:

> Hi Steven,
>
> This collector looks correct to me. Resetting the counter to 0 on the first
> segment is indeed not necessary.
>
> We have plenty of collectors that are very similar to this one and we never
> observed any double-counting issue. I would suspect an issue in the code
> that calls this collector. Maybe try to print the stack trace under the `
> if (context.docBase == 0) {` check to see why your collector is being
> called twice?
>
> On Tue, Sep 21, 2021 at 9:30 PM Steven Schlansker <
> stevenschlans...@gmail.com> wrote:
>
> > Hi Lucene users,
> >
> > I am developing a search application that needs to do some basic
> > summary statistics. We use Lucene 8.9.0.
> > To improve performance for e.g. summing a value across 10,000
> > documents, we are using DocValues as columnar storage.
> >
> > In order to retrieve the DocValues without collecting all hits into a
> > TopDocs, which we determined to cause a lot of memory pressure and
> > consume much time, we are using the expert Collector query interface.
> >
> > Here's the code, simplified a bit for the list:
> >
> > final collector = new Collector() {
> > long sum = 0;
> >
> > @Override
> > public ScoreMode scoreMode() {
> > return ScoreMode.COMPLETE_NO_SCORES;
> > }
> >
> > @Override
> > public LeafCollector getLeafCollector(final LeafReaderContext
> > context) throws IOException {
> >  if (context.docBase == 0) {
> > sum = 0; // XXX: this should not be necessary?
> > }
> > final var subtotalValue =
> > context.reader().getNumericDocValues("subtotal");
> > return new LeafCollector() {
> > @Override
> > public void setScorer(final Scorable scorer) throws
> > IOException {
> > }
> >
> > @Override
> > public void collect(final int doc) throws IOException {
> > if (subtotalValue.docID() > doc ||
> > !subtotalValue.advanceExact(doc) || subtotalValue.longValue() == 0) {
> > return;
> > }
> > sum += subtotalValue.longValue();
> > }
> > };
> > }
> > }
> > searcher.search(myQuery, collector);
> > return collector.sum;
> >
> > The query is a moderately complicated Boolean query with some
> > TermQuery and MultiTermQuery instances combined together.
> > While first testing, I observed that seemingly the collector is called
> > twice for each document, and the sum is exactly double what you would
> > expect.
> >
> > It seems that the Collector is observing every matched document twice,
> > and by printing out the Scorer, I see that it's done with two
> > different BooleanScorer instances.
> > You can see my hack that resets the collector every time it starts at
> > docBase 0. which I am sure is not the right approach, but seems to
> > work.
> > What is the right pattern to ensure my Collector only observes result
> > documents once, no matter the input query? I see a note in the
> > documentation that state is supposed to be stored on the Scorer
> > implementation, but I am not providing a custom Scorer, nor do I
> > actually want any scoring at all.
> >
> > Thank you for any guidance!
> > Steven
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >
>
> --
> Adrien
>


Re: Adding vs multiplicating scores when implementing "recency"

2021-09-17 Thread Michael Sokolov
ah, thanks for the explanation

On Fri, Sep 17, 2021 at 10:11 AM Adrien Grand  wrote:
>
> This is one requirement indeed. Since WAND reasons about partially
> evaluated documents, it also requires that matching one more clause makes
> the overall score higher, which is why we introduced the requirement that
> scores must be positive in 8.0. For multiplication, this would require
> scores that are greater than 1.
>
> If someone really wanted to multiply scores, the easiest way might be to
> create a query wrapper that takes the log of the scores of the wrapped
> query, and rely on log(a)+log(b) = log(a * b).
>
> Le ven. 17 sept. 2021 à 14:47, Michael Sokolov  a
> écrit :
>
> > Not advocating any particular approach here, just curious: could BMW
> > also function in the presence of a doc-score (like recency) that is
> > multiplied? My vague understanding is that as long as the scoring
> > formula is monotonic in all of its inputs, and we have block-encoded
> > the inputs, then we could compute a max score for a block?
> >
> > On Thu, Sep 16, 2021 at 12:41 PM Adrien Grand  wrote:
> > >
> > > Hello,
> > >
> > > You are correct that the contribution would be additive in that case. We
> > > don't provide an easy way to make the contribution multiplicative.
> > >
> > > There is some debate about what is the best way to combine BM25 scores
> > with
> > > query-independent features, though in the discussions I've seen
> > > contributions were summed up and the debate was more about whether they
> > > should be normalized or not.
> > >
> > > How much recency impacts ranking indeed depends on the number of terms
> > and
> > > how frequent these terms are. One way that I'm interpreting the fact that
> > > not everyone recommends normalizing scores is that this way the query
> > score
> > > dominates when the query is looking for something very specific, because
> > it
> > > includes many terms or because it uses very specific terms - which may
> > be a
> > > feature. This approach also works well for Lucene since dynamic pruning
> > via
> > > Block-Max WAND keeps working when query-independent features are
> > > incorporated into the final score, which helps figure out the top hits
> > > without having to collect all matches.
> > >
> > > On Thu, Sep 16, 2021 at 5:40 PM Nicolás Lichtmaier
> > >  wrote:
> > >
> > > > On March I've asked a question here that go no answers at all. As it
> > > > still something that I'd very much like to know I'll ask again.
> > > >
> > > > To implement "recency" into a search you would add a boolean clause
> > with
> > > > a LongPoint.newDistanceFeatureQuery(), right? But that's additive,
> > > > meaning that this recency will impact different for searches with
> > > > different number of terms, right? With more terms the recency component
> > > > contribution to score will be more and more "diluted". However... I
> > only
> > > > see examples using this way of doing, and I would need to do something
> > > > weird to implement a multiplicative change of the score... Am I missing
> > > > something?
> > > >
> > > > Thanks!
> > > >
> > > >
> > > > -
> > > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > > > For additional commands, e-mail: java-user-h...@lucene.apache.org
> > > >
> > > >
> > >
> > > --
> > > Adrien
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Adding vs multiplicating scores when implementing "recency"

2021-09-17 Thread Michael Sokolov
Not advocating any particular approach here, just curious: could BMW
also function in the presence of a doc-score (like recency) that is
multiplied? My vague understanding is that as long as the scoring
formula is monotonic in all of its inputs, and we have block-encoded
the inputs, then we could compute a max score for a block?

On Thu, Sep 16, 2021 at 12:41 PM Adrien Grand  wrote:
>
> Hello,
>
> You are correct that the contribution would be additive in that case. We
> don't provide an easy way to make the contribution multiplicative.
>
> There is some debate about what is the best way to combine BM25 scores with
> query-independent features, though in the discussions I've seen
> contributions were summed up and the debate was more about whether they
> should be normalized or not.
>
> How much recency impacts ranking indeed depends on the number of terms and
> how frequent these terms are. One way that I'm interpreting the fact that
> not everyone recommends normalizing scores is that this way the query score
> dominates when the query is looking for something very specific, because it
> includes many terms or because it uses very specific terms - which may be a
> feature. This approach also works well for Lucene since dynamic pruning via
> Block-Max WAND keeps working when query-independent features are
> incorporated into the final score, which helps figure out the top hits
> without having to collect all matches.
>
> On Thu, Sep 16, 2021 at 5:40 PM Nicolás Lichtmaier
>  wrote:
>
> > On March I've asked a question here that go no answers at all. As it
> > still something that I'd very much like to know I'll ask again.
> >
> > To implement "recency" into a search you would add a boolean clause with
> > a LongPoint.newDistanceFeatureQuery(), right? But that's additive,
> > meaning that this recency will impact different for searches with
> > different number of terms, right? With more terms the recency component
> > contribution to score will be more and more "diluted". However... I only
> > see examples using this way of doing, and I would need to do something
> > weird to implement a multiplicative change of the score... Am I missing
> > something?
> >
> > Thanks!
> >
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >
>
> --
> Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: currency based search using query time calculated field match with expression

2021-09-03 Thread Michael Sokolov
Sorry I'm not sure I understand what you're trying to do. Maybe you
want to match a document having a computed value? This is going to be
potentially costly, potentially requiring post-filtering of all hits
matching for other reasons. I think there is a
FunctionQuery/FunctionRangeQuery that might help, but I don't have
much experience with this API, so I'm not sure. If you want useful
suggestions, you need to be much more explicit about your use case,
what you've tried, why it didn't work, etc.

On Fri, Sep 3, 2021 at 6:08 AM Kumaran Ramasubramanian
 wrote:
>
> Hi Michael, Thanks for the response.
>
> Based on my understanding, we can use the expressions module in lucene to
> reorder search results using custom score calculations based on expression
> using stored fields.
>
> But i am not sure how to do the same for lucene document hits(doc hits
> matching 2 USD with 150 INR records). Any pointers to know about this in
> detail?
>
>
> Kumaran R
> Chennai, India
>
>
>
> On Fri, Sep 3, 2021 at 12:08 AM Michael Sokolov  wrote:
>
> > Have you looked at the expressions module? It provides support for
> > user-defined computation using values from the index based on a simple
> > expression language. It might prove useful to you if the exchange rate
> > needs to be tracked very dynamically.
> >
> > On Thu, Sep 2, 2021 at 2:15 PM Kumaran Ramasubramanian
> >  wrote:
> > >
> > > I am having one use case regarding currency based search. I want to get
> > any
> > > suggestions or pointers..
> > >
> > > For example,
> > > Assume,
> > > 1USD = 75 INR
> > > 1USD = 42190 IRR
> > > similarly, we have support for 100 currencies as of now.
> > >
> > > Record1 created with PRICE 150 INR & EXCHANGE_RATE 75 for USD
> > > Record2 created with PRICE 84380 IRR & EXCHANGE_RATE 42190 for USD
> > >
> > > If i search 2 ( USD ), I would like to get both Record1 & Record2 as
> > search
> > > results
> > >
> > > PRICE & EXCHANGE_RATE are indexed & stored as separate fields in the
> > search
> > > index
> > > We can have 50 number of currency fields like PRICE. so we may need to
> > > index additional 50 fields holding USD values.
> > >
> > > To avoid additional fields, Is it possible to match records in the search
> > > index by applying an expression like (PRICE / EXCHANGE_RATE )
> > >
> > > I am not sure if this is the right use case for Lucene index. But I would
> > > like to know the possibilities. Thanks in advance
> > >
> > >
> > > --
> > > Kumaran R
> > > Chennai, India
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: currency based search using query time calculated field match with expression

2021-09-02 Thread Michael Sokolov
Have you looked at the expressions module? It provides support for
user-defined computation using values from the index based on a simple
expression language. It might prove useful to you if the exchange rate
needs to be tracked very dynamically.

On Thu, Sep 2, 2021 at 2:15 PM Kumaran Ramasubramanian
 wrote:
>
> I am having one use case regarding currency based search. I want to get any
> suggestions or pointers..
>
> For example,
> Assume,
> 1USD = 75 INR
> 1USD = 42190 IRR
> similarly, we have support for 100 currencies as of now.
>
> Record1 created with PRICE 150 INR & EXCHANGE_RATE 75 for USD
> Record2 created with PRICE 84380 IRR & EXCHANGE_RATE 42190 for USD
>
> If i search 2 ( USD ), I would like to get both Record1 & Record2 as search
> results
>
> PRICE & EXCHANGE_RATE are indexed & stored as separate fields in the search
> index
> We can have 50 number of currency fields like PRICE. so we may need to
> index additional 50 fields holding USD values.
>
> To avoid additional fields, Is it possible to match records in the search
> index by applying an expression like (PRICE / EXCHANGE_RATE )
>
> I am not sure if this is the right use case for Lucene index. But I would
> like to know the possibilities. Thanks in advance
>
>
> --
> Kumaran R
> Chennai, India

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene cpu utilization & scoring

2021-08-20 Thread Michael Sokolov
I think the usual usage pattern is to *refresh* frequently and commit
less frequently. Is there a reason you need to commit often?

You may also have overlooked this newish method: MergePolicy.findFullFlushMerges

If you implement that, you can tell IndexWriter to (for example) merge
multiple small segments on commit, which may be piling up given
frequent commits, and if you are indexing across multiple threads. We
found this can help reduce the number of segments, and the variability
in the number of segments.  I don't know if that is truly a root cause
of your performance problems here though.

Regarding scoring costs -I don't think creating dummy Weight and
Scorer will do what you think - Scorers are doing matching in fact as
well as scoring. You won't get any results if you don't have any real
Scorer.

I *think* that setting needsScores() to false should disable work done
to compute relevance scores - you can confirm by looking at the scores
you get back with your hits - are they all zero? Also, we did
something similar in our system, and then later re-enabled scoring,
and it did not add significant cost for us. YMMV, but are you sure the
costs you are seeing are related to computing scores and not required
for matching?

-Mike

On Fri, Aug 20, 2021 at 2:02 PM Varun Sharma
 wrote:
>
> Hi,
>
> We have a large index that we divide into X lucene indices - we use lucene
> 6.5.0. On each of our serving machines serves 8 lucene indices in parallel.
> We are getting realtime updates to each of these 8 indices. We are seeing a
> couple of things:
>
> a) When we turn off realtime updates, performance is significantly better.
> When we turn on realtime updates, due to accumulation of segments - CPU
> utilization by lucene goes up by at least *3X* [based on profiling].
>
> b)  A profile shows that the vast majority of time is being spent in
> scoring methods even though we are setting *needsScores() to false* in our
> collectors.
>
> We do commit our index frequently and we are roughly at ~25 segments per
> index - so a total of 8 * 25 ~ 200 segments across all the 8 indices.
>
> Changing the number of 8 indices per machine to reduce the number of
> segments is a significant effort. So, we would like to know if there are
> ways to improve performance, w.r.t a) & b)
>
> i) We have tried some parameters with the merge policy &
> NRTCachingDirectory and they did not help significantly
> ii) Since we dont care about lucene level scores, is there a way to
> completely disable scoring ? Should setting needsScores() to false in our
> collectors do the trick ? Should we create our own dummy weight/scorer and
> injecting it into the Query classes ?
>
> Thanks
> Varun

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Index backwards compatibility

2021-05-27 Thread Michael Sokolov
... should *reindex*  ( not update )

On Thu, May 27, 2021 at 10:39 AM Michael Sokolov  wrote:
>
> LGTM, but perhaps also should state that if possible you *should*
> update because the 8.x index may not be able to be read by the
> eventual 10 release.
>
> On Thu, May 27, 2021 at 7:52 AM Michael Wechner
>  wrote:
> >
> > I have added a QnA
> >
> > https://cwiki.apache.org/confluence/display/LUCENE/LuceneFAQ#LuceneFAQ-WhenIupradeLucene,forexamplefrom8.8.2to9.0.0,doIhavetoreindex?
> >
> > Hope that makes sense, otherwise let me know and I can correct/update :-)
> >
> >
> >
> > Am 26.05.21 um 23:56 schrieb Michael Wechner:
> > > using lucene-backward-codecs-9.0.0-SNAPSHOT.jar makes it work :-)
> > >
> > > Thank you very much!
> > >
> > > But IIUC it is recommended to reindex when upgrading, right? I guess
> > > similar to what Solr is recommending
> > >
> > > https://solr.apache.org/guide/8_0/reindexing.html
> > >
> > >
> > > Am 26.05.21 um 21:26 schrieb Michael Sokolov:
> > >> I think you need backward-codecs-9.0.0-SNAPSHOT there. It enables 9.0
> > >> to read 8.x indexes.
> > >>
> > >> On Wed, May 26, 2021 at 9:27 AM Michael Wechner
> > >>  wrote:
> > >>> Hi
> > >>>
> > >>> I am using Lucene 8.8.2 in production and I am currently doing some
> > >>> tests using 9.0.0-SNAPSHOT, whereas I have included
> > >>> lucene-backward-codecs, because in the log files it was asking me
> > >>> whether I have forgotten to include lucene-backward-codecs.jar
> > >>>
> > >>>   
> > >>>   org.apache.lucene
> > >>>   lucene-core
> > >>>   9.0.0-SNAPSHOT
> > >>>   
> > >>>   
> > >>>   org.apache.lucene
> > >>> lucene-queryparser
> > >>>   9.0.0-SNAPSHOT
> > >>>   
> > >>>   
> > >>>   org.apache.lucene
> > >>> lucene-backward-codecs
> > >>>   8.8.2
> > >>>   
> > >>>
> > >>> But when querying index directories created with Lucene 8.8.2, then I
> > >>> receive the following error
> > >>>
> > >>> java.lang.NoClassDefFoundError: Could not initialize class
> > >>> org.apache.lucene.codecs.Codec$Holder
> > >>>
> > >>> I am not sure whether I understand the backwards compatibility page
> > >>> correctly
> > >>>
> > >>> https://cwiki.apache.org/confluence/display/LUCENE/BackwardsCompatibility
> > >>>
> > >>>
> > >>> but I guess version 9 will not be backwards compatible to version 8? Or
> > >>> should I do something different?
> > >>>
> > >>> Thanks
> > >>>
> > >>> Michael
> > >>>
> > >>> -
> > >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > >>> For additional commands, e-mail: java-user-h...@lucene.apache.org
> > >>>
> > >> -
> > >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > >> For additional commands, e-mail: java-user-h...@lucene.apache.org
> > >>
> > >
> > >
> > > -
> > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > > For additional commands, e-mail: java-user-h...@lucene.apache.org
> > >
> >
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Index backwards compatibility

2021-05-27 Thread Michael Sokolov
LGTM, but perhaps also should state that if possible you *should*
update because the 8.x index may not be able to be read by the
eventual 10 release.

On Thu, May 27, 2021 at 7:52 AM Michael Wechner
 wrote:
>
> I have added a QnA
>
> https://cwiki.apache.org/confluence/display/LUCENE/LuceneFAQ#LuceneFAQ-WhenIupradeLucene,forexamplefrom8.8.2to9.0.0,doIhavetoreindex?
>
> Hope that makes sense, otherwise let me know and I can correct/update :-)
>
>
>
> Am 26.05.21 um 23:56 schrieb Michael Wechner:
> > using lucene-backward-codecs-9.0.0-SNAPSHOT.jar makes it work :-)
> >
> > Thank you very much!
> >
> > But IIUC it is recommended to reindex when upgrading, right? I guess
> > similar to what Solr is recommending
> >
> > https://solr.apache.org/guide/8_0/reindexing.html
> >
> >
> > Am 26.05.21 um 21:26 schrieb Michael Sokolov:
> >> I think you need backward-codecs-9.0.0-SNAPSHOT there. It enables 9.0
> >> to read 8.x indexes.
> >>
> >> On Wed, May 26, 2021 at 9:27 AM Michael Wechner
> >>  wrote:
> >>> Hi
> >>>
> >>> I am using Lucene 8.8.2 in production and I am currently doing some
> >>> tests using 9.0.0-SNAPSHOT, whereas I have included
> >>> lucene-backward-codecs, because in the log files it was asking me
> >>> whether I have forgotten to include lucene-backward-codecs.jar
> >>>
> >>>   
> >>>   org.apache.lucene
> >>>   lucene-core
> >>>   9.0.0-SNAPSHOT
> >>>   
> >>>   
> >>>   org.apache.lucene
> >>> lucene-queryparser
> >>>   9.0.0-SNAPSHOT
> >>>   
> >>>   
> >>>   org.apache.lucene
> >>> lucene-backward-codecs
> >>>   8.8.2
> >>>   
> >>>
> >>> But when querying index directories created with Lucene 8.8.2, then I
> >>> receive the following error
> >>>
> >>> java.lang.NoClassDefFoundError: Could not initialize class
> >>> org.apache.lucene.codecs.Codec$Holder
> >>>
> >>> I am not sure whether I understand the backwards compatibility page
> >>> correctly
> >>>
> >>> https://cwiki.apache.org/confluence/display/LUCENE/BackwardsCompatibility
> >>>
> >>>
> >>> but I guess version 9 will not be backwards compatible to version 8? Or
> >>> should I do something different?
> >>>
> >>> Thanks
> >>>
> >>> Michael
> >>>
> >>> -
> >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >>> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >>>
> >> -
> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >>
> >
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene/Solr and BERT

2021-05-26 Thread Michael Sokolov
This java implementation will be slower than the C implementation. I
believe the algorithm is essentially the same, however this is new and
there may be bugs!  I (and I think Julie had similar results IIRC)
measured something like 8x slower than hnswlib (using ann-benchmarks).
It is also surprising (to me) though how this varies with
differently-learned vectors so YMMV. I still think there is value
here, and look forward to improved performance, especially as JDK16
has some improved support for vectorized instructions.

Please also understand that the HNSW algorithm interacts with Lucene's
segmented architecture in a tricky way. Because we built a graph
*per-segment* when flushing/merging, these must be rebuilt whenever
segments are merged. So your indexing performance can be heavily
influenced by how often you flush, as well as by your merge policy
settings. Also, when searching, there is a bigger than usual benefit
for searching across fewer segments, since the cost of searching an
HNSW graph scales more or less with log N (so searching a single large
graph is cheaper than searching the same documents divided among
smaller graphs). So I do recommend using a multithreaded collector in
order to get best latency with HNSW-based search. To get the best
indexing, and searching, performance, you should generally index as
large a number of documents as possible before flushing.

-Mike

On Wed, May 26, 2021 at 9:43 AM Michael Wechner
 wrote:
>
> Hi Alex
>
> Thank you very much for your feedback and the various insights!
>
> Am 26.05.21 um 04:41 schrieb Alex K:
> > Hi Michael and others,
> >
> > Sorry just now getting back to you. For your three original questions:
> >
> > - Yes, I was referring to the Lucene90Hnsw* classes. Michael S. had a
> > thorough response.
> > - As far as I know Opendistro is calling out to a C/C++ binary to run the
> > actual HNSW algorithm and store the HNSW part of the index. When they
> > implemented it about a year ago, Lucene did not have this yet. I assume the
> > Lucene HNSW implementation is solid, but would not be surprised if it's
> > slower than the C/C++ based implementation, given the JVM has some
> > disadvantages for these kinds of CPU-bound/number crunching algos.
> > - I just haven't had much time to invest into my benchmark recently. In
> > particular, I got stuck on why indexing was taking extremely long. Just
> > indexing the vectors would have easily exceeded the current time
> > limitations in the ANN-benchmarks project. Maybe I had some naive mistake
> > in my implementation, but I profiled and dug pretty deep to make it fast.
>
> I am trying to get Julie's branch running
>
> https://github.com/jtibshirani/lucene/tree/hnsw-bench
>
> Maybe this will help and is comparable
>
>
> >
> > I'm assuming you want to use Lucene, but not necessarily via Elasticsearch?
>
> Yes, for more simple setups I would like to use Lucene standalone, but
> for setups which have to scale I would use either Elasticsearch or Solr.
>
> Thanks
>
> Michael
>
>
>
> > If so, another option you might try for ANN is the elastiknn-models
> > and elastiknn-lucene packages. elastiknn-models contains the Locality
> > Sensitive Hashing implementations of ANN used by Elastiknn, and
> > elastiknn-lucene contains the Lucene queries used by Elastiknn.The Lucene
> > query is the MatchHashesAndScoreQuery
> > .
> > There are a couple of scala test suites that show how to use it:
> > MatchHashesAndScoreQuerySuite
> > .
> > MatchHashesAndScoreQueryPerformanceSuite
> > .
> > This is all designed to work independently from Elasticsearch and is
> > published on Maven: com.klibisz.elastiknn / lucene
> > 
> > and
> > com.klibisz.elastiknn / models
> > .
> > The tests are Scala but all of the implementation is in Java.
> >
> > Thanks,
> > Alex
> >
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Index backwards compatibility

2021-05-26 Thread Michael Sokolov
I think you need backward-codecs-9.0.0-SNAPSHOT there. It enables 9.0
to read 8.x indexes.

On Wed, May 26, 2021 at 9:27 AM Michael Wechner
 wrote:
>
> Hi
>
> I am using Lucene 8.8.2 in production and I am currently doing some
> tests using 9.0.0-SNAPSHOT, whereas I have included
> lucene-backward-codecs, because in the log files it was asking me
> whether I have forgotten to include lucene-backward-codecs.jar
>
>  
>  org.apache.lucene
>  lucene-core
>  9.0.0-SNAPSHOT
>  
>  
>  org.apache.lucene
>  lucene-queryparser
>  9.0.0-SNAPSHOT
>  
>  
>  org.apache.lucene
> lucene-backward-codecs
>  8.8.2
>  
>
> But when querying index directories created with Lucene 8.8.2, then I
> receive the following error
>
> java.lang.NoClassDefFoundError: Could not initialize class
> org.apache.lucene.codecs.Codec$Holder
>
> I am not sure whether I understand the backwards compatibility page
> correctly
>
> https://cwiki.apache.org/confluence/display/LUCENE/BackwardsCompatibility
>
> but I guess version 9 will not be backwards compatible to version 8? Or
> should I do something different?
>
> Thanks
>
> Michael
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene/Solr and BERT

2021-05-23 Thread Michael Sokolov
Hi Michael, that is fully-functional in the sense that Lucene will
build an HNSW graph for a vector-valued field and you can then use the
VectorReader.search method to do KNN-based search. Next steps may
include some integration with lexical, inverted-index type search so
that you can retrieve N-closest constrained by other constraints.
Today you can approximate that by oversampling and filtering. There is
also interest in pursuing other KNN search algorithms, and we have
been working to make sure the VectorFormat API (might still get
renamed due to confusion with other kinds of vectors existing in
Lucene) can support alternative KNN implementations.

On Wed, May 19, 2021 at 12:22 PM Michael Wechner
 wrote:
>
> Hi Alex
>
> Just to make sure I understand better what the additions are about
>
> Am 21.04.21 um 17:21 schrieb Alex K:
> > There were a couple additions recently merged into lucene but not yet
> > released:
> > - A first-class vector codec
>
> do you mean the classes inside
>
> https://github.com/apache/lucene/tree/main/lucene/core/src/java/org/apache/lucene/codecs/lucene90
>
> and in particular
>
> Lucene90HnswVectorFormat.java  Lucene90HnswVectorReader.java
> Lucene90HnswVectorWriter.java
>
> ?
>
> > - An implementation of HNSW for approximate nearest neighbor search
>
> the HNSW implementation at
>
> https://github.com/apache/lucene/tree/main/lucene/core/src/java/org/apache/lucene/util/hnsw
>
> is similar to
>
> https://opendistro.github.io/for-elasticsearch/blog/odfe-updates/2020/04/Building-k-Nearest-Neighbor-(k-NN)-Similarity-Search-Engine-with-Elasticsearch/
>
> ?
> >
> > They are however available in the snapshot releases. I started on a small
> > project to get the HNSW implementation into the ann-benchmarks project, but
> > had to set it aside.
>
> Is there still something missing? Or what would be the next steps?
>
> Thanks
>
> Michael
>
>
> >   Here's the code:
> > https://github.com/alexklibisz/ann-benchmarks-lucene. There are some test
> > suites that index and search Glove vectors. My first impression was that
> > indexing seems surprisingly slow, but it's entirely possible I'm doing
> > something wrong.
> >
> > On Wed, Apr 21, 2021 at 9:31 AM Michael Wechner 
> > wrote:
> >
> >> Hi
> >>
> >> I recently found the following articles re Lucene/Solr and BERT
> >>
> >> https://dmitry-kan.medium.com/neural-search-with-bert-and-solr-ea5ead060b28
> >>
> >> https://medium.com/swlh/fun-with-apache-lucene-and-bert-embeddings-c2c496baa559
> >>
> >> and would like to ask whether there might be more recent developments
> >> within the Lucene/Solr community re BERT integration?
> >>
> >> Also how these developments relate to
> >>
> >> https://sbert.net/
> >>
> >> ?
> >>
> >> Thanks very much for your insights!
> >>
> >> Michael
> >>
> >> -
> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >>
> >>
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene Explanation

2021-04-12 Thread Michael Sokolov
You might want to check out
https://issues.apache.org/jira/browse/LUCENE-8019 where I tried to
implement some debugging utilities on top of Explain. It never got
committed, but it does explore some of the challenges around
introducing a more structured explain response.

On Fri, Apr 9, 2021 at 6:40 PM Puneeth Bikkumanla
 wrote:
>
> Hello,
> I am currently working on a project that would like to implement Document
> Explain where we can see how a document was scored internally in lucene
> given a query.
>
> I see that the IndexSearcher has an explain
> 
> method
> available that returns an Explanation
> 
> object. An Explanation object only contains a description field (string)
> but there is no way to know what part of a score that Explanation object is
> for without parsing the description field itself. We wanted to implement
> Document Explain in a more safe way where we could know what part of the
> score an Explanation object is associated with and not parse the
> description string field to find out. Here are a few of the options I have
> thought of:
>
> 1. I was thinking about extending the similarity class (BM25Similarity) and
> then overriding the particular methods that dealt with the different
> subcomponents of explain but saw that the explainTF
> 
> method
> is private. Is there a reason why this is? It would be very useful if it
> could be public so that I can override it and store the knowledge that the
> returned Explanation is for the TF component of the document score.
>
> 2. I also thought about extending the IndexSearcher and overriding the
> createWeight method to store the weight structure and then use that to
> understand the resulting Explanation structure from the IndexSearcher's
> explain method.
>
> Please let me know if any of that didn't make sense. Also, if anyone has
> any other ideas on how I could approach this problem suggestions would be
> greatly appreciated. Lastly, I would be happy to submit a PR to modify
> Lucene's Explanation to be more aware of where it is created.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Search results/criteria validation

2021-03-17 Thread Michael Sokolov
See https://issues.apache.org/jira/browse/LUCENE-9640

On Wed, Mar 17, 2021 at 4:02 PM Paul Libbrecht
 wrote:
>
> Explain is a heavyweight thing. Maybe it helps you, maybe you need
> something high-performance.
>
> I was asking a similar question ~10 years ago and got a very interesting
> answer on this list. If you want I can try to dig this to find it. At
> the end, and with some limitation in the number of queries and in the
> score’s fineness, it was indicating thing sub-query was used. This was
> used to attempt highlighting matching of the parts of a formula.
>
> Paul
>
> On 17 Mar 2021, at 20:24, Diego Ceccarelli (BLOOMBERG/ LONDON) wrote:
>
> > Maybe using explain?
> >
> > https://chrisperks.co/2017/06/06/explaining-lucene-explain/
> >
> >
> > It might slow down the performance..
> >
> > Cheers,
> > Diego
> >
> >
> > From: java-user@lucene.apache.org At: 03/17/21 17:26:14To:
> > java-user@lucene.apache.org
> > Cc:  shuo@jobdiva.com
> > Subject: Search results/criteria validation
> >
> > Hello there,
> >
> > We are looking to check what part of criteria matched for the document
> > to be
> > included in the results. So, for example our criteria is "(A or B or
> > C) and (D
> > or E)" and documents 1,2,3 came back in results. Can we check for each
> > of the
> > documents, which parts of criteria matched? So, for example, it might
> > be that
> > document 1 was matched because A and B and D were found and for
> > document 2 C
> > and E were found. Is there a way to check that?
> >
> > --
> > Regards
> > -Siraj Haider
> > (212) 306-0154
> >
> >
> > 
> >
> > This electronic mail message and any attachments may contain
> > information which
> > is privileged, sensitive and/or otherwise exempt from disclosure under
> > applicable law. The information is intended only for the use of the
> > individual
> > or entity named as the addressee above. If you are not the intended
> > recipient,
> > you are hereby notified that any disclosure, copying, distribution
> > (electronic
> > or otherwise) or forwarding of, or the taking of any action in
> > reliance on, the
> > contents of this transmission is strictly prohibited. If you have
> > received this
> > electronic transmission in error, please notify us by telephone,
> > facsimile, or
> > e-mail as noted above to arrange for the return of any electronic mail
> > or
> > attachments. Thank You.
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene Migration query

2020-11-20 Thread Michael Sokolov
Ah, sorry for the misdirection, thanks for the correction, Erick. That
does jibe with what I now remember having heard before. I guess we
reserve the right to create index data structures in the future for
which we did not save sufficient data in the past.

On Fri, Nov 20, 2020 at 9:15 AM Erick Erickson  wrote:
>
> The IndexUpgraderTool does a forceMerge(1). If you have a large index,
> that has its own problems, but will work. The threshold for the issues is
> 5G. See: https://lucidworks.com/post/solr-and-optimizing-your-index-take-ii/
> I should emphasize that if you have a very large single segment as a
> result, it’ll eventually shrink if it accumulated deleted (or updated) 
> documents,
> it’ll just require a bunch of I/O amortized over time.
>
> IndexUpgraderTool will _not_ allow you to take an index originally created 
> with
> 7x to be used in 9x. (Uwe, I’ve been telling people this for a long time, if 
> I’ve
> been lying please let me know!). Starting with Lucene 6, a version is written 
> into
> each segment. Upon merge, the lowest version stamp is preserved. Lucene
> will refuse to open an index where _any_ segment has a version stamp X-2 or
> older.
>
> Best,
> Erick
>
> > On Nov 20, 2020, at 7:57 AM, Michael Sokolov  wrote:
> >
> > I think running the upgrade tool would also be necessary to set you up for
> > the next upgrade, when 9.0 comes along.
> >
> > On Fri, Nov 20, 2020, 4:25 AM Uwe Schindler  wrote:
> >
> >> Hi,
> >>
> >>> Currently I am using Lucene 7.3, I want to upgrade to lucene 8.5.1.
> >> Should
> >>> I do reindexing in this case ?
> >>
> >> No, you don't need that.
> >>
> >>> Can I make use of backward codec jar without a reindex?
> >>
> >> Yes, just add the JAR file to your classpath and it can read the indexes.
> >> Updates written to the index will use the new codecs. To force a full
> >> upgrade (rewrite all segments), invoke the IndexUpgrader class either from
> >> your code or using the command line. But this is not needed, it just makes
> >> sure that you can get rid of the backwards-codecs jar.
> >>
> >> Uwe
> >>
> >>
> >> -
> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >>
> >>
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene Migration query

2020-11-20 Thread Michael Sokolov
I think running the upgrade tool would also be necessary to set you up for
the next upgrade, when 9.0 comes along.

On Fri, Nov 20, 2020, 4:25 AM Uwe Schindler  wrote:

> Hi,
>
> > Currently I am using Lucene 7.3, I want to upgrade to lucene 8.5.1.
> Should
> > I do reindexing in this case ?
>
> No, you don't need that.
>
> > Can I make use of backward codec jar without a reindex?
>
> Yes, just add the JAR file to your classpath and it can read the indexes.
> Updates written to the index will use the new codecs. To force a full
> upgrade (rewrite all segments), invoke the IndexUpgrader class either from
> your code or using the command line. But this is not needed, it just makes
> sure that you can get rid of the backwards-codecs jar.
>
> Uwe
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: https://issues.apache.org/jira/browse/LUCENE-8448

2020-11-13 Thread Michael Sokolov
You can't directly compare disk usage across two indexes, even with
the same data. Try re-indexing one of your datasets, and you will see
that the disk size is not the same. Mostly this is due to the way
segments are merged varying with some randomness from one run to
another, although the size of the difference you report is pretty
large, it is not out of the question that could occur, especially if
you have a large number of deletions or updates to existing documents.
If you want to get a more accurate idea of the amount of space taken
up by your index, you could try calling IndexWriter.forceMerge(1);
this will merge your index to a single segment, eliminating waste. It
is not generally recommended to do this for indexes you use for
querying, but it can be a useful tool for analysis.

On Fri, Nov 13, 2020 at 1:01 PM  wrote:
>
> Nothing changed between two index generations except the data changed a
> bit as i described.
>
> When Lucene is done generating index, that is what i am reporting as the
> size of the directory where all index files are stored.
>
> I dont know about deleted docs? How do you trace that? yes the queries
> run exactly the same way (same number of results) most of the time the
> order is just changed which is fine; or some few different entries show
> up and i dont know why since lowecase filter should normalize even if
> original data casing changes.
>
> Yes absolutely sure nothing else changed. i kept all those things the
> same across two runs.
>
> actually does lucene repository have these kinda experiments accross
> versions (major or minor versions)?
>
> if i were lucene i would do these experiments to see the impact on index
> end results. this will help find out some potential un-indentified bugs.
>
> Methodology:
>
> have a large dataset like 15 million docs
>
> run index at each time a new version comes out with very common settings.
>
>
> i am not using solr, pure lucene 7.7.2. these info were in the other
> email here. let me copy paste here:
>
>
>
> = previous email 
>
> On a related issue:
>
> i experience that with Version 7.7.2 i experienced this:
>
> data is all lower case (same amount of docs as next case though)
>
> vs
>
> data is camel case except last word always in capital letters
>
>
> but i used in indexer the lowercase filter in both cases so indexing is
> done with all lower cases and i saw the first case's index size for case
> is like 9.5GB
>
> but same data size for second case was 11GB.
>
>
> what causes such difference and increase in index size? amount of docs
> are the same in both cases.
>
>
> Best regards
>
>
>
> On 11/13/20 7:39 AM, Erick Erickson wrote:
> > What does “final finished sizes” mean? After optimize of just after 
> > finishing all indexing?
> > The former is what counts here.
> >
> > And you provided no information on the number of deleted docs in the two 
> > cases. Is
> > the number of deletedDocs the same (or close)? And does the q=*:* query
> > return the same numFound?
> >
> > Finally, are you absolutely and totally sure that no other options changed. 
> > For instance,
> > you specified docValues=true for some field in one but not the other. Or 
> > stored=true
> > etc. If you’re using the same schema.
> >
> > And you also haven’t provided information on what versions of Solr you’re 
> > talking about.
> > You mention 7.7.2, but not the _other_ version of solr. If you’re going 
> > from one major
> > version to another, sometimes defaults change for docValues on primitive 
> > fields
> > especially. I’d consider firing up Luke and examining the field definitions 
> > in
> > detail.
> >
> > Best,
> > Erick
> >
> >> On Nov 13, 2020, at 12:16 AM, baris.ka...@oracle.com wrote:
> >>
> >> Hi,-
> >> Thanks.
> >> These are final finished sizes in both cases.
> >> Best regards
> >>
> >>
> >>> On Nov 12, 2020, at 11:12 PM, Erick Erickson  
> >>> wrote:
> >>>
> >>> Yes, that issue is fixed. The “Resolution” tag is the key, it’s marked 
> >>> “fixed” and the version is 8.0
> >>>
> >>> As for your other question, index size is a very imprecise number. How 
> >>> many deleted documents are there
> >>> in each case? Deleted documents take up disk space until the segments 
> >>> containing them are merged away.
> >>>
> >>> Best,
> >>> Erick
> >>>
>  On Nov 12, 2020, at 5:35 PM, baris.ka...@oracle.com wrote:
> 
>  https://urldefense.com/v3/__https://issues.apache.org/jira/browse/LUCENE-8448__;!!GqivPVa7Brio!I3RsAXIoDcPmpP_sc8C29vn8DcAXSvIgH7pvcxyDaBnfhdJAk24zPpQhqP035V1IJA$
> 
> 
>  Hi,-
> 
>  is this issue fixed please? Could You please help me figure it out?
> 
>  Best regards
> 
> 
> 
>  -
>  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>  For additional commands, e-mail: java-user-h...@lucene.apache.org
> 
> >>>
> >>> 

Re: [VOTE] Lucene logo contest, third time's a charm

2020-09-04 Thread Michael Sokolov
A1, D, A2 (binding)

On Fri, Sep 4, 2020 at 12:46 AM David Smiley  wrote:
>
> (binding)
> vote: D, A1
>
>
> (thanks Ryan for your thorough vote instructions & preparation)

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Simultaneous Indexing and searching

2020-09-01 Thread Michael Sokolov
So ... this is a fairly complex topic I can't really cover it in depth
here; how to architect a distributed search engine service. Most
people opt to use Solr or Elasticsearch since they solve that problem
for you. Those systems work best when the indexes are local to the
service that is accessing them, and build systems to distribute data
internally; distributing via NFS is generally not a *good idea* (tm),
although it may work most of the time. In your case, have you
considered building a search service that runs on the same box as your
indexer and responds to queries from the web server(s)?

On Tue, Sep 1, 2020 at 11:13 AM Richard So
 wrote:
>
> Hi there,
>
> I am beginner for using Lucene especially in the area of Indexing and 
> searching simultaneously.
>
> Our environment is that we have several webserver for the search front-end 
> that submit search request and also a backend server that do the full text 
> indexing; whereas the index files are stored in a NFS volume such that both 
> the indexing and searchs are pointing to this same NFS volume. The indexing 
> may happen whenever something new documents comes in or get updated.
>
> Our project requires that both indexing and searching can be happened at the 
> same time (or the blocking should be as short as possible, e.g. under a 
> second)
>
> We have search through the Internet and found something like this references:
> http://blog.mikemccandless.com/2011/09/lucenes-searchermanager-simplifies.html
> http://blog.mikemccandless.com/2011/11/near-real-time-readers-with-lucenes.html
>
> but seems those only apply to indexing and search in the same server (correct 
> me if I am wrong).
>
> Could somebody tell me how to implement such system, e.g. what Lucene classes 
> to be used and the caveat, or how to setup ,etc?
>
> Regards
> Richard
>
>
>
>
>
>
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: [VOTE] Lucene logo contest, here we go again

2020-09-01 Thread Michael Sokolov
A1, binding

On Mon, Aug 31, 2020 at 8:26 PM Ryan Ernst  wrote:
>
> Dear Lucene and Solr developers!
>
> In February a contest was started to design a new logo for Lucene 
> [jira-issue]. The initial attempt [first-vote] to call a vote resulted in 
> some confusion on the rules, as well the request for one additional 
> submission. I would like to call a new vote, now with more explicit 
> instructions on how to vote.
>
> Please read the following rules carefully before submitting your vote.
>
> Who can vote?
>
> Anyone is welcome to cast a vote in support of their favorite submission(s). 
> Note that only PMC member's votes are binding. If you are a PMC member, 
> please indicate with your vote that the vote is binding, to ease collection 
> of votes. In tallying the votes, I will attempt to verify only those marked 
> as binding.
>
> How do I vote?
>
> Votes can be cast simply by replying to this email. It is a ranked-choice 
> vote [rank-choice-voting]. Multiple selections may be made, where the order 
> of preference must be specified. If an entry gets more than half the votes, 
> it is the winner. Otherwise, the entry with the lowest number of votes is 
> removed, and the votes are retallied, taking into account the next preferred 
> entry for those whose first entry was removed. This process repeats until 
> there is a winner.
>
> The entries are broken up by variants, since some entries have multiple color 
> or style variations. The entry identifiers are first a capital letter, 
> followed by a variation id (described with each entry below), if applicable. 
> As an example, if you prefer variant 1 of entry A, followed by variant 2 of 
> entry A, variant 3 of entry C, entry D, and lastly variant 4e of entry B, the 
> following should be in your reply:
>
> (binding)
> vote: A1, A2, C3, D, B4e
>
> Entries
>
> The entries are as follows:
>
> A. Submitted by Dustin Haver. This entry has two variants, A1 and A2.
>
> [A1] 
> https://issues.apache.org/jira/secure/attachment/12999548/Screen%20Shot%202020-04-10%20at%208.29.32%20AM.png
> [A2] https://issues.apache.org/jira/secure/attachment/12997172/LuceneLogo.png
>
> B. Submitted by Stamatis Zampetakis. This has several variants. Within the 
> linked entry there are 7 patterns and 7 color palettes. Any vote for B should 
> contain the pattern number followed by the lowercase letter of the color 
> palette. For example, B3e or B1a.
>
> [B] https://issues.apache.org/jira/secure/attachment/12997768/zabetak-1-7.pdf
>
> C. Submitted by Baris Kazar. This entry has 8 variants.
>
> [C1] 
> https://issues.apache.org/jira/secure/attachment/13006392/lucene_logo1_full.pdf
> [C2] 
> https://issues.apache.org/jira/secure/attachment/13006392/lucene_logo2_full.pdf
> [C3] 
> https://issues.apache.org/jira/secure/attachment/13006392/lucene_logo3_full.pdf
> [C4] 
> https://issues.apache.org/jira/secure/attachment/13006392/lucene_logo4_full.pdf
> [C5] 
> https://issues.apache.org/jira/secure/attachment/13006392/lucene_logo5_full.pdf
> [C6] 
> https://issues.apache.org/jira/secure/attachment/13006392/lucene_logo6_full.pdf
> [C7] 
> https://issues.apache.org/jira/secure/attachment/13006392/lucene_logo7_full.pdf
> [C8] 
> https://issues.apache.org/jira/secure/attachment/13006392/lucene_logo8_full.pdf
>
> D. The current Lucene logo.
>
> [D] https://lucene.apache.org/theme/images/lucene/lucene_logo_green_300.png
>
> Please vote for one of the above choices. This vote will close one week from 
> today, Mon, Sept 7, 2020 at 11:59PM.
>
> Thanks!
>
> [jira-issue] https://issues.apache.org/jira/browse/LUCENE-9221
> [first-vote] 
> http://mail-archives.apache.org/mod_mbox/lucene-dev/202006.mbox/%3cCA+DiXd74Mz4H6o9SmUNLUuHQc6Q1-9mzUR7xfxR03ntGwo=d...@mail.gmail.com%3e
> [rank-choice-voting] https://en.wikipedia.org/wiki/Instant-runoff_voting

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Hierarchical facet select a subtree but one child

2020-08-15 Thread Michael Sokolov
If you are trying to show documents that have facet value V1 excluding
those with facet value V1.1, then you would need to issue a query
like:

+f:V1 -f:V1.1

assuming your facet values are indexed in a field called "f". I don't
think this really has anything to do with faceting; it's just a
filtering problem.

On Tue, Aug 4, 2020 at 4:47 AM nbuso  wrote:
>
> Hi,
>
> is there someone that can point me in the right API to negate facet
> values?
> May be this DrillDownQuery#add(dim, query) the API to permit this use
> case?
> https://lucene.apache.org/core/8_5_2/facet/org/apache/lucene/facet/DrillDownQuery.html#add-java.lang.String-org.apache.lucene.search.Query-
>
>
> Nicola
>
>
> On 2020-07-29 10:27, nbuso wrote:
> > Hi,
> >
> > I'm a bit rusty with Lucene facets API and I have a common use case
> > that I would like to solve.
> > Suppose the following facet values tree:
> >
> > Facet
> >  - V1
> >- V1.1
> >- V1.2
> >- V1.3
> >- V1.4
> >- (not topK values)
> >  - V2
> >- V2.1
> >- V2.2
> >- V2.3
> >- V2.4
> >- (not topK values)
> >
> > With (not topK values) I mean values you are not showing in the UI
> > because of space/visualization problems. You usually see them with the
> > links "More ..."
> >
> > Use case:
> > 1 - select V1 => all V1.x are selected
> > 2 - de-select V1.1
> >
> > How can I achieve this? from the search results I know the values
> > V1.[1-4] but I don't know the values that are not in topK. How can I
> > select all the V1 subtree but V1.1?
> >
> > Please let me know if you need more info.
> >
> >
> > Nicola Buso - EBI
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: ANN search current state

2020-07-16 Thread Michael Sokolov
We have some prototype implementations in the issues you found.  If
you want to try out the approaches in those issues, you could build
Lucene from source and patch it, but there is no release containing
KNN/vector support. We're still working to establish consensus on what
the best way forward is. I think the most fruitful thing we can do at
the moment is establish a format for storing and accessing vectors
that will support different approaches since there is such a rich
variety of algorithms and approaches in this area. The last issue you
pointed to is focused on the format.

On Wed, Jul 15, 2020 at 11:20 AM Alex K  wrote:
>
> Hi Mikhail,
>
> I'm not sure about the state of ANN in lucene proper. Very interested to
> see the response from others.
> I've been doing some work on ANN for an Elasticsearch plugin:
> http://elastiknn.klibisz.com/
> I think it's possible to extract my custom queries and modeling code so
> that it's elasticsearch-agnostic and can be used directly in Lucene apps.
> However I'm much more familiar with Elasticsearch's APIs and usage/testing
> patterns than I am with raw Lucene, so I'd likely need to get some help
> from the Lucene community.
> Please LMK if that sounds interesting to anyone.
>
> - Alex
>
>
>
> On Wed, Jul 15, 2020 at 11:11 AM Mikhail  wrote:
>
> >
> > Hi,
> >
> >I want to incorporate semantic search in my project, which uses
> > Lucene. I want to use sentence embeddings and ANN (approximate nearest
> > neighbor) search. I found the related Lucene issues:
> > https://issues.apache.org/jira/browse/LUCENE-9004 ,
> > https://issues.apache.org/jira/browse/LUCENE-9136 ,
> > https://issues.apache.org/jira/browse/LUCENE-9322 . I see that there
> > are some related work and related PRs. What is the current state of this
> > functionality?
> >
> > --
> > Thanks,
> > Mikhail
> >
> >

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: About custom score using Solr8/Lucene8

2020-07-08 Thread Michael Sokolov
You should think of DoubleValuesSource as a factory for DoubleValues.
Usually a factory will be immutable - you set it up and then it
produces per-leaf DoubleValues. So I don't really understand what
you're saying about state there.  Regarding the DoubleValues, which is
an iterator, yes it definitely has state: it has to keep track of
which document it is positioned on (if any). I think the order of
operations is as you listed them.


On Tue, Jul 7, 2020 at 1:11 PM Vincenzo D'Amore  wrote:
>
> Thanks Michael,
> if I understand correctly the DoubleValuesSource is stateful.
> When getValues is called if scores is null an internal state true/false is
> saved.
> This state should be returned by needsScores method.
> Is this correct?
>
> As far as I understood, the same thing happens to the DoubleValue object
> returned by the getValues() method.
> DoubleValue object has an internal state which depends from advanceExact()
> method.
> In other words, DoubleValue.doubleValue() returns a value produced after
> the call of advanceExact(int docId).
>
> Now the problem I have is understanding if DoubleValuesSource has to be
> stateful or has to store an state that should be available later.
> Or, on the other hand, understand if there is an order in the methods calls
> (first getValues then needsScores, first advanceExact then doubleValue).
> Don't you agree?
>
>
> On Mon, Jul 6, 2020 at 4:58 PM Michael Sokolov  wrote:
>
> > That controls whether  getValues(LeafReaderContext ctx, DoubleValues
> > scores) gets a null scores parameter or not. You should say true only
> > if you need the text relevance scores computed by the Query's Scorer.
> >
> > On Mon, Jul 6, 2020 at 10:22 AM Vincenzo D'Amore 
> > wrote:
> > >
> > > Hi Michael, thanks for answering my questions.
> > > Yes, I read, but I think that it not is enough.
> > >
> > > For make the things clearer, this is taken from the javadocs:
> > >
> > > abstract boolean needsScores() - Return true if document scores are
> > needed
> > > to calculate values
> > >
> > > So, I thought return true, because yes, I need to calculate the scores
> > for
> > > my custom implementations.
> > > Anyway, I should remember that the wrong way always seems more reasonable
> > > :) , and googling around for:
> > >
> > > "boolean needsScores" "DoubleValuesSource" site:github.com
> > >
> > > I found that when there is explicit code many implementations returns
> > > directly: false.
> > >
> > > What does this mean? why and when should I return true or false?
> > >
> > >
> > > On Mon, Jul 6, 2020 at 2:50 PM Michael Sokolov 
> > wrote:
> > >
> > > > Did you read the DoubleValuesSource javadocs, and find they weren't
> > enough?
> > > >
> > > > On Sun, Jul 5, 2020 at 7:54 AM Vincenzo D'Amore 
> > > > wrote:
> > > > >
> > > > > Hi all,
> > > > >
> > > > > Finally I have a custom DoubleValuesSource that gives the expected
> > > > results,
> > > > > but I'm a little worried about the lack of documentation.
> > > > >
> > > > > When you extend DoubleValuesSource there are a number of methods to
> > > > write,
> > > > > for some of them it is not clear what they do and why they need to be
> > > > > implemented.
> > > > > Here I've listed the mandatory methods:
> > > > >
> > > > > public abstract DoubleValues getValues(LeafReaderContext var1,
> > > > > DoubleValues var2) throws IOException;
> > > > > public abstract boolean needsScores()
> > > > > public abstract DoubleValuesSource rewrite(IndexSearcher var1)
> > throws
> > > > > IOException;
> > > > > public boolean isCacheable(LeafReaderContext ctx);
> > > > > public abstract int hashCode();
> > > > > public abstract boolean equals(Object var1);
> > > > >
> > > > > for some of them I could imagine why (hashCode() or equals()) but
> > what
> > > > > about the others?
> > > > > As said, I wrote an implementation of getValues that returns the
> > expected
> > > > > results (I've compared the results with the old version), but for
> > many
> > > > > methods I've just mimed (copied) the code found in other
> > implementations.
> > > > > So why does needsScores() always return false,

Re: About custom score using Solr8/Lucene8

2020-07-06 Thread Michael Sokolov
That controls whether  getValues(LeafReaderContext ctx, DoubleValues
scores) gets a null scores parameter or not. You should say true only
if you need the text relevance scores computed by the Query's Scorer.

On Mon, Jul 6, 2020 at 10:22 AM Vincenzo D'Amore  wrote:
>
> Hi Michael, thanks for answering my questions.
> Yes, I read, but I think that it not is enough.
>
> For make the things clearer, this is taken from the javadocs:
>
> abstract boolean needsScores() - Return true if document scores are needed
> to calculate values
>
> So, I thought return true, because yes, I need to calculate the scores for
> my custom implementations.
> Anyway, I should remember that the wrong way always seems more reasonable
> :) , and googling around for:
>
> "boolean needsScores" "DoubleValuesSource" site:github.com
>
> I found that when there is explicit code many implementations returns
> directly: false.
>
> What does this mean? why and when should I return true or false?
>
>
> On Mon, Jul 6, 2020 at 2:50 PM Michael Sokolov  wrote:
>
> > Did you read the DoubleValuesSource javadocs, and find they weren't enough?
> >
> > On Sun, Jul 5, 2020 at 7:54 AM Vincenzo D'Amore 
> > wrote:
> > >
> > > Hi all,
> > >
> > > Finally I have a custom DoubleValuesSource that gives the expected
> > results,
> > > but I'm a little worried about the lack of documentation.
> > >
> > > When you extend DoubleValuesSource there are a number of methods to
> > write,
> > > for some of them it is not clear what they do and why they need to be
> > > implemented.
> > > Here I've listed the mandatory methods:
> > >
> > > public abstract DoubleValues getValues(LeafReaderContext var1,
> > > DoubleValues var2) throws IOException;
> > > public abstract boolean needsScores()
> > > public abstract DoubleValuesSource rewrite(IndexSearcher var1) throws
> > > IOException;
> > > public boolean isCacheable(LeafReaderContext ctx);
> > > public abstract int hashCode();
> > > public abstract boolean equals(Object var1);
> > >
> > > for some of them I could imagine why (hashCode() or equals()) but what
> > > about the others?
> > > As said, I wrote an implementation of getValues that returns the expected
> > > results (I've compared the results with the old version), but for many
> > > methods I've just mimed (copied) the code found in other implementations.
> > > So why does needsScores() always return false, how to implement
> > > correctly isCacheable() ?
> > > Anyone could write a short description of these methods and how they
> > > have to be implemented?
> > >
> > > Best regards,
> > > Vincenzo
> > >
> > > On Sat, Jul 4, 2020 at 3:29 AM Vincenzo D'Amore 
> > wrote:
> > >
> > > > Hi all, I did few steps forward but still struggling in how read the
> > field
> > > > value inside my custom DoubleValuesSource
> > > >
> > > > final CustomValuesSource valuesSource = new
> > > > CustomValuesSource(data, req.getSchema().getField(field));
> > > > return FunctionScoreQuery.boostByValue(query,
> > > > valuesSource);
> > > >
> > > > CustomValuesSource extends DoubleValuesSource
> > > >
> > > > But, if I did right, I'm struggling with the getValues code.
> > > >
> > > > public DoubleValues getValues(LeafReaderContext ctx, DoubleValues
> > scores)
> > > > throws IOException {
> > > >
> > > > The field I have to read is a binary field, and I can't find an example
> > > > how to read a binary field from LeafReaderContext
> > > >
> > > > Any help appreciated.
> > > >
> > > > Best regards,
> > > > Vincenzo
> > > >
> > > > On Thu, Jul 2, 2020 at 1:19 PM Vincenzo D'Amore 
> > > > wrote:
> > > >
> > > >> Hi Mikhail, I was just trying to understand how to extend
> > > >> DoubleValuesSource class, now I'm looking around to find an inspiring
> > > >> example...
> > > >>
> > > >> On Thu, Jul 2, 2020 at 12:55 PM Mikhail Khludnev 
> > wrote:
> > > >>
> > > >>> Hi, Vincenzo.
> > > >>>
> > > >>> Have you tried to implement DoubleValuesSource ?
> > > >&g

Re: About custom score using Solr8/Lucene8

2020-07-06 Thread Michael Sokolov
Did you read the DoubleValuesSource javadocs, and find they weren't enough?

On Sun, Jul 5, 2020 at 7:54 AM Vincenzo D'Amore  wrote:
>
> Hi all,
>
> Finally I have a custom DoubleValuesSource that gives the expected results,
> but I'm a little worried about the lack of documentation.
>
> When you extend DoubleValuesSource there are a number of methods to write,
> for some of them it is not clear what they do and why they need to be
> implemented.
> Here I've listed the mandatory methods:
>
> public abstract DoubleValues getValues(LeafReaderContext var1,
> DoubleValues var2) throws IOException;
> public abstract boolean needsScores()
> public abstract DoubleValuesSource rewrite(IndexSearcher var1) throws
> IOException;
> public boolean isCacheable(LeafReaderContext ctx);
> public abstract int hashCode();
> public abstract boolean equals(Object var1);
>
> for some of them I could imagine why (hashCode() or equals()) but what
> about the others?
> As said, I wrote an implementation of getValues that returns the expected
> results (I've compared the results with the old version), but for many
> methods I've just mimed (copied) the code found in other implementations.
> So why does needsScores() always return false, how to implement
> correctly isCacheable() ?
> Anyone could write a short description of these methods and how they
> have to be implemented?
>
> Best regards,
> Vincenzo
>
> On Sat, Jul 4, 2020 at 3:29 AM Vincenzo D'Amore  wrote:
>
> > Hi all, I did few steps forward but still struggling in how read the field
> > value inside my custom DoubleValuesSource
> >
> > final CustomValuesSource valuesSource = new
> > CustomValuesSource(data, req.getSchema().getField(field));
> > return FunctionScoreQuery.boostByValue(query,
> > valuesSource);
> >
> > CustomValuesSource extends DoubleValuesSource
> >
> > But, if I did right, I'm struggling with the getValues code.
> >
> > public DoubleValues getValues(LeafReaderContext ctx, DoubleValues scores)
> > throws IOException {
> >
> > The field I have to read is a binary field, and I can't find an example
> > how to read a binary field from LeafReaderContext
> >
> > Any help appreciated.
> >
> > Best regards,
> > Vincenzo
> >
> > On Thu, Jul 2, 2020 at 1:19 PM Vincenzo D'Amore 
> > wrote:
> >
> >> Hi Mikhail, I was just trying to understand how to extend
> >> DoubleValuesSource class, now I'm looking around to find an inspiring
> >> example...
> >>
> >> On Thu, Jul 2, 2020 at 12:55 PM Mikhail Khludnev  wrote:
> >>
> >>> Hi, Vincenzo.
> >>>
> >>> Have you tried to implement DoubleValuesSource ?
> >>>
> >>> On Thu, Jul 2, 2020 at 9:58 AM Vincenzo D'Amore 
> >>> wrote:
> >>>
> >>> > Again, @Federico Pici or anybody, did you figure out how to
> >>> > port CustomScoreQuery in Solr8?
> >>> >
> >>> > On Tue, Jul 23, 2019 at 1:05 AM Xiaofei  wrote:
> >>> >
> >>> > > @Federico Pici, did you figure out on how to produce customized
> >>> score in
> >>> > > Solr
> >>> > > 8?
> >>> > >
> >>> > >
> >>> > >
> >>> > > --
> >>> > > Sent from:
> >>> > > http://lucene.472066.n3.nabble.com/Lucene-Java-Users-f532864.html
> >>> > >
> >>> > > -
> >>> > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >>> > > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >>> > >
> >>> > >
> >>> >
> >>> > --
> >>> > Vincenzo D'Amore
> >>> >
> >>>
> >>>
> >>> --
> >>> Sincerely yours
> >>> Mikhail Khludnev
> >>>
> >>
> >>
> >> --
> >> Vincenzo D'Amore
> >>
> >>
> >
> > --
> > Vincenzo D'Amore
> >
> >
>
> --
> Vincenzo D'Amore

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Optimizing a boolean query for 100s of term clauses

2020-06-24 Thread Michael Sokolov
Yeah that will require some changes since what it does currently is to
maintain a bitset, and or into it repeatedly (once for each term's
docs). To maintain counts, you'd need a counter per doc (rather than a
bit), and you might lose some of the speed...

On Tue, Jun 23, 2020 at 8:52 PM Alex K  wrote:
>
> The TermsInSetQuery is definitely faster. Unfortunately it doesn't seem to
> return the number of terms that matched in a given document. Rather it just
> returns the boost value. I'll look into copying/modifying the internals to
> return the number of matched terms.
>
> Thanks
> - AK
>
> On Tue, Jun 23, 2020 at 3:17 PM Alex K  wrote:
>
> > Hi Michael,
> > Thanks for the quick response!
> >
> > I will look into the TermInSetQuery.
> >
> > My usage of "heap" might've been confusing.
> > I'm using a FunctionScoreQuery from Elasticsearch.
> > This gets instantiated with a Lucene query, in this case the boolean query
> > as I described it, as well as a custom ScoreFunction object.
> > The ScoreFunction exposes a single method that takes a doc id and the
> > BooleanQuery score for that doc id, and returns another score.
> > In that method I use a MinMaxPriorityQueue from the Guava library to
> > maintain a fixed-capacity subset of the highest-scoring docs and evaluate
> > exact similarity on them.
> > Once the queue is at capacity, I just return 0 for any docs that had a
> > boolean query score smaller than the min in the queue.
> >
> > But you can actually forget entirely that this ScoreFunction exists. It
> > only contributes ~6% of the runtime.
> > Even if I only use the BooleanQuery by itself, I still see the same
> > behavior and bottlenecks.
> >
> > Thanks
> > - AK
> >
> >
> > On Tue, Jun 23, 2020 at 2:06 PM Michael Sokolov 
> > wrote:
> >
> >> You might consider using a TermInSetQuery in place of a BooleanQuery
> >> for the hashes (since they are all in the same field).
> >>
> >> I don't really understand why you are seeing so much cost in the heap
> >> - it's sounds as if you have a single heap with mixed scores - those
> >> generated by the BooleanQuery and those generated by the vector
> >> scoring operation. Maybe you comment a little more on the interaction
> >> there - are there really two heaps? Do you override the standard
> >> collector?
> >>
> >> On Tue, Jun 23, 2020 at 9:51 AM Alex K  wrote:
> >> >
> >> > Hello all,
> >> >
> >> > I'm working on an Elasticsearch plugin (using Lucene internally) that
> >> > allows users to index numerical vectors and run exact and approximate
> >> > k-nearest-neighbors similarity queries.
> >> > I'd like to get some feedback about my usage of BooleanQueries and
> >> > TermQueries, and see if there are any optimizations or performance
> >> tricks
> >> > for my use case.
> >> >
> >> > An example use case for the plugin is reverse image search. A user can
> >> > store vectors representing images and run a nearest-neighbors query to
> >> > retrieve the 10 vectors with the smallest L2 distance to a query vector.
> >> > More detailed documentation here: http://elastiknn.klibisz.com/
> >> >
> >> > The main method for indexing the vectors is based on Locality Sensitive
> >> > Hashing <https://en.wikipedia.org/wiki/Locality-sensitive_hashing>.
> >> > The general pattern is:
> >> >
> >> >1. When indexing a vector, apply a hash function to it, producing a
> >> set
> >> >of discrete hashes. Usually there are anywhere from 100 to 1000
> >> hashes.
> >> >Similar vectors are more likely to share hashes (i.e., similar
> >> vectors
> >> >produce hash collisions).
> >> >2. Convert each hash to a byte array and store the byte array as a
> >> >Lucene Term at a specific field.
> >> >3. Store the complete vector (i.e. floating point numbers) in a
> >> binary
> >> >doc values field.
> >> >
> >> > In other words, I'm converting each vector into a bag of words, though
> >> the
> >> > words have no semantic meaning.
> >> >
> >> > A query works as follows:
> >> >
> >> >1. Given a query vector, apply the same hash function to produce a
> >> set
> >> >of hashes.
> >> >2. Convert each hash to a byte array and create a Term.
> >&g

Re: Optimizing a boolean query for 100s of term clauses

2020-06-23 Thread Michael Sokolov
You might consider using a TermInSetQuery in place of a BooleanQuery
for the hashes (since they are all in the same field).

I don't really understand why you are seeing so much cost in the heap
- it's sounds as if you have a single heap with mixed scores - those
generated by the BooleanQuery and those generated by the vector
scoring operation. Maybe you comment a little more on the interaction
there - are there really two heaps? Do you override the standard
collector?

On Tue, Jun 23, 2020 at 9:51 AM Alex K  wrote:
>
> Hello all,
>
> I'm working on an Elasticsearch plugin (using Lucene internally) that
> allows users to index numerical vectors and run exact and approximate
> k-nearest-neighbors similarity queries.
> I'd like to get some feedback about my usage of BooleanQueries and
> TermQueries, and see if there are any optimizations or performance tricks
> for my use case.
>
> An example use case for the plugin is reverse image search. A user can
> store vectors representing images and run a nearest-neighbors query to
> retrieve the 10 vectors with the smallest L2 distance to a query vector.
> More detailed documentation here: http://elastiknn.klibisz.com/
>
> The main method for indexing the vectors is based on Locality Sensitive
> Hashing .
> The general pattern is:
>
>1. When indexing a vector, apply a hash function to it, producing a set
>of discrete hashes. Usually there are anywhere from 100 to 1000 hashes.
>Similar vectors are more likely to share hashes (i.e., similar vectors
>produce hash collisions).
>2. Convert each hash to a byte array and store the byte array as a
>Lucene Term at a specific field.
>3. Store the complete vector (i.e. floating point numbers) in a binary
>doc values field.
>
> In other words, I'm converting each vector into a bag of words, though the
> words have no semantic meaning.
>
> A query works as follows:
>
>1. Given a query vector, apply the same hash function to produce a set
>of hashes.
>2. Convert each hash to a byte array and create a Term.
>3. Build and run a BooleanQuery with a clause for each Term. Each clause
>looks like this: `new BooleanClause(new ConstantScoreQuery(new
>TermQuery(new Term(field, new BytesRef(hashValue.toByteArray))),
>BooleanClause.Occur.SHOULD))`.
>4. As the BooleanQuery produces results, maintain a fixed-size heap of
>its scores. For any score exceeding the min in the heap, load its vector
>from the binary doc values, compute the exact similarity, and update the
>heap. Otherwise the vector gets a score of 0.
>
> When profiling my benchmarks with VisualVM, I've found the Elasticsearch
> search threads spend > 50% of the runtime in these two methods:
>
>- org.apache.lucene.search.DisiPriorityQueue.downHeap (~58% of runtime)
>- org.apache.lucene.search.DisjunctionDISIApproximation.nextDoc (~8% of
>runtime)
>
> So the time seems to be dominated by collecting and ordering the results
> produced by the BooleanQuery from step 3 above.
> The exact similarity computation is only about 15% of the runtime. If I
> disable it entirely, I still see the same bottlenecks in VisualVM.
> Reducing the number of hashes yields roughly linear scaling (i.e., 400
> hashes take ~2x longer than 200 hashes).
>
> The use case seems different to text search in that there's no semantic
> meaning to the terms, their length, their ordering, their stems, etc.
> I basically just need the index to be a rudimentary HashMap, and I only
> care about the scores for the top k results.
> With that in mind, I've made the following optimizations:
>
>- Disabled tokenization on the FieldType (setTokenized(false))
>- Disabled norms on the FieldType (setOmitNorms(true))
>- Set similarity to BooleanSimilarity on the elasticsearch
>MappedFieldType
>- Set index options to IndexOptions.Docs.
>- Used the MoreLikeThis heuristic to pick a subset of terms. This
>understandably only yields a speedup proportional to the number of
>discarded terms.
>
> I'm using Elasticsearch version 7.6.2 with Lucene 8.4.0.
> The main query implementation is here
> 
> .
> 
> The actual query that gets executed by Elasticsearch is instantiated on line
> 98
> 
> .
> It's in Scala but all of the Java query classes should look familiar.
>
> Maybe there are some settings that I'm not aware of?
> Maybe I could optimize this by implementing a custom query or scorer?
> Maybe there's just 

Re: [VOTE] Lucene logo contest

2020-06-16 Thread Michael Sokolov
A
non-PMC

On Tue, Jun 16, 2020 at 4:52 PM Bruno Roustant  wrote:
>
> C - current logo
> not PMC
>
> Le mar. 16 juin 2020 à 21:38, Erik Hatcher  a écrit :
>>
>> C - current logo
>>
>> On Jun 15, 2020, at 6:08 PM, Ryan Ernst  wrote:
>>
>> Dear Lucene and Solr developers!
>>
>> In February a contest was started to design a new logo for Lucene [1]. That 
>> contest concluded, and I am now (admittedly a little late!) calling a vote.
>>
>> The entries are labeled as follows:
>>
>> A. Submitted by Dustin Haver [2]
>>
>> B. Submitted by Stamatis Zampetakis [3] Note that this has several variants. 
>> Within the linked entry there are 7 patterns and 7 color palettes. Any vote 
>> for B should contain the pattern number, like B1 or B3. If a B variant wins, 
>> we will have a followup vote on the color palette.
>>
>> C. The current Lucene logo [4]
>>
>> Please vote for one of the three (or nine depending on your perspective!) 
>> above choices. Note that anyone in the Lucene+Solr community is invited to 
>> express their opinion, though only Lucene+Solr PMC cast binding votes 
>> (indicate non-binding votes in your reply, please). This vote will close one 
>> week from today, Mon, June 22, 2020.
>>
>> Thanks!
>>
>> [1] https://issues.apache.org/jira/browse/LUCENE-9221
>> [2] 
>> https://issues.apache.org/jira/secure/attachment/12999548/Screen%20Shot%202020-04-10%20at%208.29.32%20AM.png
>> [3] https://issues.apache.org/jira/secure/attachment/12997768/zabetak-1-7.pdf
>> [4] https://lucene.apache.org/theme/images/lucene/lucene_logo_green_300.png
>>
>>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene Approximation

2020-06-02 Thread Michael Sokolov
Sorry, I thought that you wanted to maintain the true value rather
than the approximated value. I am not entirely sure, but I think the
approximation arises due to rounding and low-precision storage of
these values in the index. You might be able to reverse engineer it by
looking at "Norms," which involve the document length. TBH there has
been a fair amount of change there in recent releases, and I'm not
completely up to speed on what we store, so I'll decline to provide
more misinformation at this point!

On Tue, Jun 2, 2020 at 1:20 PM  wrote:
>
> Thank you for your answer, but please could you explain this idea in
> detail as I cannot see how this would help solving my problem?
>
> For example, I got the indexed Wikipedia Article "Alan Smithee" with a
> document length of 756, which also is used when calculating the average
> document length. But if the BM25 score in this article is calculated it
> uses the approximated document length of 728, which returns a different
> result from when the score is calculated with the correct document
> length. So I wonder where this value is calculated and how I might
> change this approximation or at least can get the approximated value, so
> that I can use it for my own calculations.
>
> On 2020-06-02 18:48, Michael Sokolov wrote:
> > You could append an EOF token to every indexed text, and then iterate
> > over Terms to get the positions of those tokens?
> >
> > On Tue, Jun 2, 2020 at 11:50 AM Moritz Staudinger
> >  wrote:
> >>
> >> Hello,
> >>
> >> I am not sure if I am at the right place here, but I got a question
> >> about
> >> the approximation my Lucene implementation does.
> >>
> >> I am trying to calculate the same scores Lucenes BM25Similiarity
> >> calculates,
> >> but I found out that Lucene only approximates the length of documents
> >> for
> >> scoring but uses the correct values for the average document length.
> >> Is there a way to turn off these approximations or to get the values,
> >> so
> >> that I can save it for my own calculations?
> >>
> >> For my Implementation I use Lucene 8.4.1 in Combination with Spring
> >> Boot, if
> >> this is necessary.
> >>
> >> Thank you in advance,
> >> Moritz
> >>
> >>
> >> -
> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >>
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene Approximation

2020-06-02 Thread Michael Sokolov
You could append an EOF token to every indexed text, and then iterate
over Terms to get the positions of those tokens?

On Tue, Jun 2, 2020 at 11:50 AM Moritz Staudinger
 wrote:
>
> Hello,
>
> I am not sure if I am at the right place here, but I got a question about
> the approximation my Lucene implementation does.
>
> I am trying to calculate the same scores Lucenes BM25Similiarity calculates,
> but I found out that Lucene only approximates the length of documents for
> scoring but uses the correct values for the average document length.
> Is there a way to turn off these approximations or to get the values, so
> that I can save it for my own calculations?
>
> For my Implementation I use Lucene 8.4.1 in Combination with Spring Boot, if
> this is necessary.
>
> Thank you in advance,
> Moritz
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: issue with Lucene UpdateDocument

2020-03-01 Thread Michael Sokolov
So -- you update a single document and the call to updateDocument
takes 3 minutes? Or you update a single document and call commit() and
that takes 3 minutes? Or -- you update 10 documents and call
commit() and that takes 3 minutes? We can't help you with the level of
detail you've provided. As far as buffered memory vs main memory it's
not clear what you're asking. There is a control for the buffer size
used by IndexWriter to decide when to flush a segment - maybe that's
what you're after? Look in IndexWriterConfig for it.

On Fri, Feb 28, 2020 at 12:15 AM Jyothsna Bavisetti
 wrote:
>
> HI Team,
>
>
>
>
>
> 1.We Upgraded Lucene 4.6 to 8+, After upgrading we are facing issue with 
> UpdateDocument API. We are using UpdateDocument for editing existing records 
> and adding new records.
>
> 2.Adding a new record to the index file is working fine.
>
> 3.When we are trying to edit one of record from the list of records, it is 
> deleting a specific record from the index file and adding new value. But time 
> taking to add edit value to segment file is almost three minutes.
>
> 4. Is there any API for controlling to commit buffered memory to main memory.
>
>
>
> Please throw some suggestions to commit changes before buffer time.
>
> Thanks,
>
> Jyothsna
>
>
>
>
>
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Searching number of tokens in text field

2019-12-28 Thread Michael Sokolov
I don't know of any pre-existing thing that does exactly this, but how
about a token filter that counts tokens (or positions maybe), and then
appends some special token encoding the length?

On Sat, Dec 28, 2019, 9:36 AM Matt Davis  wrote:

> Hello,
>
> I was wondering if it is possible to search for the number of tokens in a
> text field.  For example find book titles with 3 or more words.  I don't
> mind adding a field that is the number of tokens to the search index but I
> would like to avoid analyzing the text two times.   Can Lucene search for
> the number of tokens in a text field?  Or can I get the number of tokens
> after analysis and add it to the Lucene document before/during indexing?
> Or do I need to analysis the text myself and add the field to the document
> (analyze the text twice, once myself, once in the IndexWriter).
>
> Thanks,
> Matt Davis
>


Re: Using Lucene as a Document Comparison Tool

2019-12-13 Thread Michael Sokolov
Have you tried making a BooleanQuery with a term for every word in the
query document as Optional? You will get a lot of matches,  ranked
according to the similarity.

On Thu, Dec 12, 2019 at 10:47 AM John Brown  wrote:
>
> Hi,
>
>
>
> I have some questions about how to use Lucene for the specific purpose of
> finding document similarities. Lucene seems to have classes that were made
> for this, including: ClassicSimilarity and BM25Similarity. However I’m
> fumbling a bit when it comes to implementing them.
>
>
>
> From what I understand, to use these classes you simply set the similarity
> of your IndexWriter and IndexSearcher, then submit a query. The documents
> returned from your query should be ordered from highest to lowest
> similarity.
>
>
>
> My initial thought was to just use a phrase query to hold the "document" I
> want to find similarities to, but phrase queries are limited in that they
> will only return results that are deemed to fall within a certain slop
> value. Term/Boolean queries are similarly limited in that they allow
> documents to be sorted only if they contain all the terms in the query.
>
>
>
> Ideally, I’d like to submit a query that would essentially be a document
> itself. Each of my queries contain 10 or so phrases, that each contain 5-10
> terms. I would like to compare this query with all the documents in my
> index to see which is the most similar, and which is the least similar. I
> feel as if there is an easy way to do this that I'm missing, after all, I
> essentially just want to remove a step from the process. Any help would be
> much appreciated.
>
>
> Thank  you,
>
> -John B

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Get distinct fields values from lucene index

2019-11-22 Thread Michael Sokolov
In Solr and ES this is done with faceting and aggregations,
respectively, based on Lucene's low-level APIs. Have you looked at
TermsEnum? You can use that to get all distinct terms for a segment,
and then it is up to you to coalesce terms across segments ("leaves").

On Thu, Nov 21, 2019 at 1:15 AM Amol Suryawanshi
 wrote:
>
> Hello,
>
> I am using lucene in my organization. I want to know how can I get distinct 
> values from lucene index. I have tried “GroupingSearch” API but it doesn’t 
> serves the purpose. It will give all documents contains distinct values. I 
> have used below code.
>
>
> final GroupingSearch groupingSearch = new GroupingSearch(groupField);
>
> Sort sort  =  new Sort(new SortField(groupField, SortField.Type.STRING_VAL, 
> false));
> groupingSearch.setSortWithinGroup(sort);
> Query query = new MatchAllDocsQuery();
> TopGroups topGroups = null;
>
> try {
> topGroups = groupingSearch.search(searcher, query, 0, 10);
> } catch (final IOException e) {
> System.out.println("Can't execute group search because of an IOException. 
> "+ e);
> }
>
> Sent from Mail for Windows 10
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Question about the light and minimal French stemmers

2019-07-28 Thread Michael Sokolov
Oh sorry for jumping in with my irrelevant comment, you are right, of
course!

On Sat, Jul 27, 2019, 10:36 PM Tomoko Uchida 
wrote:

> Let me just make things a bit clear...
> I think the concern here is that FrenchMinimalStemmer would remove the
> last "digit" from a token because of it does not check if the
> character is letter or not.
> e.g., "123455" is trimmed to "12345" by FrenchMinimalStemmer.
>
> To me, this behaviour is beyond stemming.
>
> Tomoko
>
> 2019年7月28日(日) 4:55 Michael Sokolov :
> >
> > I'm not so sure. I think the whole idea of having both stemmers is that
> the
> > minimal one does less than the light one.
> >
> > Removing the final character of a double letter suffix is going to
> > sacrifice some precision. For example mes/mess, ne/née, I'm sure there
> are
> > others.
> >
> > So having both options is helpful, I don't think it's a bug on the face
> of
> > it. However I didn't look closely at the code, so I'm not sure what the
> > intent is exactly.
> >
> > On Sat, Jul 27, 2019, 7:30 AM Tomoko Uchida <
> tomoko.uchida.1...@gmail.com>
> > wrote:
> >
> > > Hi Adrien,
> > >
> > > To me, it sounds simply a bug. Can you please open a JIRA (with a
> > > patch if possible)?
> > >
> > > Tomoko
> > >
> > > 2019年7月23日(火) 22:05 Adrien Gallou :
> > > >
> > > > Hi,
> > > >
> > > > I'm using both light and minimal French stemmers and encountered an
> issue
> > > > when using the minimal stemmer.
> > > >
> > > > The light stemmer removes the last character of a word if the last
> two
> > > > characters are identical.
> > > > We can see that here:
> > > >
> > >
> https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchLightStemmer.java#L263
> > > > In this light stemmer, there is a check to avoid altering the token
> if
> > > the
> > > > token is a number.
> > > >
> > > > The minimal stemmer also removes the last character of a word if the
> last
> > > > two characters are identical.
> > > > We can see that here:
> > > >
> > >
> https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchMinimalStemmer.java#L77
> > > >
> > > > But in this minimal stemmer there is no check to see if the
> character is
> > > a
> > > > letter or not.
> > > > So when we have numeric tokens with the last two characters identical
> > > they
> > > > are altered.
> > > >
> > > > Is there a reason for this?
> > > > Should I file an issue on Jira to add this check?
> > > >
> > > > Thanks,
> > > >
> > > > Adrien Gallou
> > >
> > > -
> > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > > For additional commands, e-mail: java-user-h...@lucene.apache.org
> > >
> > >
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: Question about the light and minimal French stemmers

2019-07-27 Thread Michael Sokolov
I'm not so sure. I think the whole idea of having both stemmers is that the
minimal one does less than the light one.

Removing the final character of a double letter suffix is going to
sacrifice some precision. For example mes/mess, ne/née, I'm sure there are
others.

So having both options is helpful, I don't think it's a bug on the face of
it. However I didn't look closely at the code, so I'm not sure what the
intent is exactly.

On Sat, Jul 27, 2019, 7:30 AM Tomoko Uchida 
wrote:

> Hi Adrien,
>
> To me, it sounds simply a bug. Can you please open a JIRA (with a
> patch if possible)?
>
> Tomoko
>
> 2019年7月23日(火) 22:05 Adrien Gallou :
> >
> > Hi,
> >
> > I'm using both light and minimal French stemmers and encountered an issue
> > when using the minimal stemmer.
> >
> > The light stemmer removes the last character of a word if the last two
> > characters are identical.
> > We can see that here:
> >
> https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchLightStemmer.java#L263
> > In this light stemmer, there is a check to avoid altering the token if
> the
> > token is a number.
> >
> > The minimal stemmer also removes the last character of a word if the last
> > two characters are identical.
> > We can see that here:
> >
> https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchMinimalStemmer.java#L77
> >
> > But in this minimal stemmer there is no check to see if the character is
> a
> > letter or not.
> > So when we have numeric tokens with the last two characters identical
> they
> > are altered.
> >
> > Is there a reason for this?
> > Should I file an issue on Jira to add this check?
> >
> > Thanks,
> >
> > Adrien Gallou
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: [External] Re: How to ignore certain words based on query specifics

2019-07-10 Thread Michael Sokolov
I'm not au courant with highlighters as I used to be. I think some of them
work using postings, and for those, no, you wouldn't be able to highlight
stop words. But maybe you can use the old default highlighter that would
reanalyze the document from a stored field, using an Analyzer that doesn't
remove stop words? Sorry I'm not sure if that exists any more, maybe
someone else will know.

On Tue, Jul 9, 2019, 10:17 AM Shifflett, David [USA] <
shifflett_da...@bah.com> wrote:

> Michael,
> Thanks for your reply.
>
> You are correct, the desired effect is to not match 'freedom ...'.
> I hadn't considered the case where both free* and freedom match.
>
> My solution 'free* and not freedom' would NOT match either of your
> examples.
>
> I think what I really want is
> Get every matching term from a matching document,
> and if the term also matches an ignore word, then ignore the match.
>
> I hadn't considered the stopwords approach, I'll look into that.
> If I add all the ignore words as stop words, will that effect highlighting?
> Are the stopwords still available for highlighting?
>
> Thanks,
> David Shifflett
>
>
> On 7/9/19, 11:58 AM, "Michael Sokolov"  wrote:
>
> I think what you're saying in you're example is that "free*" should
> match anything with a term matching that pattern, but not *only*
> freedom. In other words, if a document has "freedom from stupidity"
>  then it should not match, but if the document has "free freedom from
> stupidity" than it should.
>
> Is that correct?
>
> You could apply stopwords, except that it sounds as if this is a
> per-user blacklist, and you want them to share the same index?
>
> On Tue, Jul 9, 2019 at 11:29 AM Shifflett, David [USA]
>  wrote:
> >
> > Sorry for the weird reply path, but I couldn’t find an easy reply
> method via the list archive.
> >
> > Anyway …
> >
> > The use case is as follows:
> > Allow the user to specify queries such as ‘free*’
> > and also include similar words to be ignored, such as freedom.
> > Another example would be ‘secret*’ and secretary.
> >
> > I want to keep the ignore words separate so they apply to all
> queries,
> > but then realized the ignore words should only apply to relevant
> (matching) queries.
> >
> > I don’t want the users to be required to add ‘and not WORD’ many
> times to each of the listed queries.
> >
> > David Shifflett
> >
> > From: Diego Ceccarelli
> >
> > Could you please describe the use case? maybe there is an easier
> solution
> >
> >
> >
> > From: "Shifflett, David [USA]" 
> > Date: Tuesday, July 9, 2019 at 8:02 AM
> > To: "java-user@lucene.apache.org" 
> > Subject: How to ignore certain words based on query specifics
> >
> > Hi all,
> > I have a configuration file that lists multiple queries, of all
> different types,
> > and that lists words to be ignored.
> >
> > Each of these lists is user configured, variable in length and
> content.
> >
> > I know that, in general, unless the ignore word is in the query it
> won’t match,
> > but I need to be able to handle wildcard, fuzzy, and Regex, queries
> which might match.
> >
> > What I need to be able to do is ignore the words in the ignore list,
> > but only when they match terms the query would match.
> >
> > For example: if the query is ‘free*’ and ‘freedom’ should be ignored,
> > I could modify the query to be ‘free*’ and not freedom.
> >
> > But if ‘liberty’ is also to be ignored, I don’t want to add ‘and not
> liberty’ to that query
> > because that could produce false negatives for documents containing
> free and liberty.
> >
> > I think what I need to do is:
> > for each query
> >   for each ignore word
> > if the query would match the ignore word,
> >   add ‘and not ignore word’ to the query
> >
> > How can I test if a query would match an ignore word without putting
> the ignore words into an index
> > and searching the index?
> > This seems like overkill.
> >
> > To make matters worse, for a query like A and B and C,
> > this won’t match an index of ignore words that contains C, but not A
> or B.
> >
> > Thanks in advance, for any suggestions or advice,
> > David Shifflett
> >
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>
>
>


Re: How to ignore certain words based on query specifics

2019-07-09 Thread Michael Sokolov
I think what you're saying in you're example is that "free*" should
match anything with a term matching that pattern, but not *only*
freedom. In other words, if a document has "freedom from stupidity"
 then it should not match, but if the document has "free freedom from
stupidity" than it should.

Is that correct?

You could apply stopwords, except that it sounds as if this is a
per-user blacklist, and you want them to share the same index?

On Tue, Jul 9, 2019 at 11:29 AM Shifflett, David [USA]
 wrote:
>
> Sorry for the weird reply path, but I couldn’t find an easy reply method via 
> the list archive.
>
> Anyway …
>
> The use case is as follows:
> Allow the user to specify queries such as ‘free*’
> and also include similar words to be ignored, such as freedom.
> Another example would be ‘secret*’ and secretary.
>
> I want to keep the ignore words separate so they apply to all queries,
> but then realized the ignore words should only apply to relevant (matching) 
> queries.
>
> I don’t want the users to be required to add ‘and not WORD’ many times to 
> each of the listed queries.
>
> David Shifflett
>
> From: Diego Ceccarelli
>
> Could you please describe the use case? maybe there is an easier solution
>
>
>
> From: "Shifflett, David [USA]" 
> Date: Tuesday, July 9, 2019 at 8:02 AM
> To: "java-user@lucene.apache.org" 
> Subject: How to ignore certain words based on query specifics
>
> Hi all,
> I have a configuration file that lists multiple queries, of all different 
> types,
> and that lists words to be ignored.
>
> Each of these lists is user configured, variable in length and content.
>
> I know that, in general, unless the ignore word is in the query it won’t 
> match,
> but I need to be able to handle wildcard, fuzzy, and Regex, queries which 
> might match.
>
> What I need to be able to do is ignore the words in the ignore list,
> but only when they match terms the query would match.
>
> For example: if the query is ‘free*’ and ‘freedom’ should be ignored,
> I could modify the query to be ‘free*’ and not freedom.
>
> But if ‘liberty’ is also to be ignored, I don’t want to add ‘and not liberty’ 
> to that query
> because that could produce false negatives for documents containing free and 
> liberty.
>
> I think what I need to do is:
> for each query
>   for each ignore word
> if the query would match the ignore word,
>   add ‘and not ignore word’ to the query
>
> How can I test if a query would match an ignore word without putting the 
> ignore words into an index
> and searching the index?
> This seems like overkill.
>
> To make matters worse, for a query like A and B and C,
> this won’t match an index of ignore words that contains C, but not A or B.
>
> Thanks in advance, for any suggestions or advice,
> David Shifflett
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Index-time join ToParentBlockJoinQuery query produces incorrect result

2019-07-03 Thread Michael Sokolov
Well for one thing, you might have other documents in the index that
are neither parents nor children (in this particular relation). Also,
consider a nested hierarchy - how can we automatically figure out
which "generation" or "level" of parent to select?

On Wed, Jul 3, 2019 at 2:50 PM ANDREI SOLODIN  wrote:
>
> After looking through the unit tests, I got it working. The problem was that 
> I thought the parent filter in the ToParentBlockJoinQuery can be used to 
> select a subset of parents. It appears that the parent filter must select ALL 
> parents, not a subset. This is not explained in the javadoc. If you want to 
> select a subset of parents (independently of the child query), 
> ToParentBlockJoinQuery can not be used on its own, but rather as a clause in 
> another query.
>
> It would be a nice enhancement to just automatically select all parents, I 
> mean, it is already required to be the last document in the block, why do we 
> need to provide a query for them?
>
> > On July 3, 2019 at 10:52 AM ANDREI SOLODIN  wrote:
> >
> >
> > Thanks Mikhail.
> >
> >
> > I read through the javadoc and thought I was satisfying all the 
> > preconditions. Obviously not :-) Is it this part that am I getting wrong: 
> > "At search time you provide a Filter identifying the parents, however this 
> > Filter must provide an BitSet 
> > https://lucene.apache.org/core/8_1_1/core/org/apache/lucene/util/BitSet.html?is-external=true
> >  per sub-reader."? If so, given the data above how do I properly create a 
> > parent query?
> >
> >
> > > > On July 3, 2019 at 10:30 AM Mikhail Khludnev < m...@apache.org 
> > mailto:m...@apache.org > wrote:
> > >
> > >
> > > On Wed, Jul 3, 2019 at 6:11 PM ANDREI SOLODIN < 
> > > asolo...@comcast.net mailto:asolo...@comcast.net > wrote:
> > >
> > > >
> > >
> > > > > > This returns "id3", which is unexpected.
> > > >
> > > > > >
> > > > > > Please check ToPBJQ javadoc. It's absolutely expected.
> > > >
> > > > > > --
> > > Sincerely yours
> > > Mikhail Khludnev
> > >
> > > >
>
>
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Sampled Queries -- Use Cases and Feedback

2019-06-11 Thread Michael Sokolov
Atri, in the abstract it sounds like a great idea, but in practice it will
only be as good as the data that drives it. I think that to make this work
it would be a good idea to write up a proposal of some sort targeting
different open (or commercial, although I doubt you would get much of this)
source projects that use lucene-based search asking them to contribute
their data.

Also can we learn anything from the previous attempt? What did they try?
How can this effort about the same pitfalls?

Even with document and query  data, you still need some kind of relevance
ground truth, and this is notoriously difficult to get. Probably click
through stats are the most generic proxy for that.

So as a thought experiment, maybe contact Wikipedia and ask if they would
be willing to share some sample of queries and logs. Or did you have
another idea how to drive this? Then with one pilot participant, you could
maybe get others to join. I think if you have some commitments, or at least
serious expression of interest, from data providers, then you can start to
think about what to actually do with the data, but I would start there?

On Mon, Jun 10, 2019, 2:54 AM Atri Sharma  wrote:

> Any thoughts on this? I am envisioning applications to machine
> learning systems, where the training dataset might be a small sample
> of the entire dataset, and the user wants scoring to be done only on
> samples of the dataset.
>
> On Fri, Jun 7, 2019 at 5:45 PM Atri Sharma  wrote:
> >
> > Hi All,
> >
> > While working on a new Query type, I was inclined to think of a couple
> > of use cases where the documents being scored need not be all of the
> > data set, but a sample of them. This can be useful for very large
> > datasets, where a query is only interested in getting the "feel" of
> > the data, and other queries where the data is being aggregated over
> > time, so a wide enough sample of the data is good enough for the user
> > at the tradeoff of improved performance. Faceting already has sampling
> > mechanisms, so there are ideas to be borrowed from that part.
> >
> > I have some ideas on introducing a new query type and associated
> > semantics to allow this functionality to be present from ground up.
> > Specifically, a query type which wraps another query and "feeds"
> > offsets to the inner query, along with a limit of collection of hits.
> > I can go in more detail, but wanted to get some thoughts and feedback
> > before delving deeper.
> >
> > Atri
>
>
>
> --
> Regards,
>
> Atri
> Apache Concerted
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: JapaneseAnalyzer's system vs user dict

2019-05-26 Thread Michael Sokolov
Thanks, Namgyu. I've been able to build a dictionary using
DictionaryBuilder (I guess that is what the "regenerate" task must be
using?) and I can replace the existing one on the classpath with jar
surgery for now. Not a very user-friendly approach, but it will enable
me to run some experiments and see whether this is truly necessary for
my use case.

On Sun, May 26, 2019 at 7:56 AM Namgyu Kim  wrote:
>
> Sorry for the wrong information, Mike.
> Tomoko is right.
> I checked it wrong.
>
> User dictionary is independent from the system dictionary. If you give
> the user entries, JapaneseTokenizer builds two FSTs one for the
> built-in dictionary and one for the user dictionary and they are
> retrieved separately.
>
> Please ignore the following lines in my e-mail.
> 
> Japanese Analyzer does not load dictionaries by default.
> ...
> Since it is a way to create and pass the UserDictionary object, there is no
> conflict between user dictionary and system dictionary.
> (You may choose only one of them! -> means userFST instance in
> JapaneseTokenizer)
> =
>
> The System dictionary and the User dictionary are separated and can have
> each.
>
> About System dictionary,
> As I know, it is not possible to change the System dictionary at the code
> level.
> The part that reads the System dictionary is hard-coded.
> (TokenInfoDictionary, UnknownDictionary, BinaryDictionary)
> If you really need it, can you make a JIRA issue and proceed with me?
>
> But there is a way to build a new kuromoji jar.
> 1. Modify your dictionary file and rebuild.
>   1-1) Install MeCab
>   1-2) Install MeCab Dictionary
>   1-3) Modify your dictionary file
>   1-4) Make it to tar.gz
> 2. change kuromoji/ivy.xml from
> https://jaist.dl.sourceforge.net/project/mecab/mecab-ipadic/2.7.0-20070801/mecab-ipadic-2.7.0-20070801.tar.gz
> "/>
> to
> 
> 3. "ant regenerate" in /your/path/lucene-solr/lucene/analysis/kuromoji
> 4. "ant jar"
>
> I wish I could help you.
>
> Warm regards,
> Namgyu Kim
>
> 2019년 5월 26일 (일) 오전 9:03, Michael Sokolov 님이 작성:
>
> > Thank you for the detailed responses! What Tomoko is saying seems
> > consistent with my cursory reading of the code. The reason I asked is
> > I have a customer that thinks they want to replace the system
> > dictionary, and I am trying to see if that is necessary. It seems as
> > if for the most part, we can supply a comprehensive user dictionary
> > and it would pretty much take the place of the system dictionary,
> > assuming it is a superset (covers at least the original system dict
> > tokens), but there is probably no way to "remove" a token that is
> > present in the system dictionary (or maybe it can effectively be
> > removed by adding it to user dictionary with a high penalty?). I'm not
> > sure why one would want to do this removal, just trying to understand
> > the design parameters.
> >
> > On Sat, May 25, 2019 at 7:30 PM Tomoko Uchida
> >  wrote:
> > >
> > > Hi,
> > >
> > > > If I provide entries in the user
> > > dictionary is it just as if I had included them in the system
> > > dictionary? If the same entry occurs in both, do the user dictionary
> > > weights supersede those in the system dictionary? Is there some way to
> > > suppress entries in the system dict?
> > >
> > > User dictionary is independent from the system dictionary. If you give
> > > the user entries, JapaneseTokenizer builds two FSTs one for the
> > > built-in dictionary and one for the user dictionary and they are
> > > retrieved separately.
> > >
> > > First the user dictionary is retrieved, and if there are no entries
> > > matched then the system dictionary is retrieved. So if any entry is
> > > found in the user dictionary, all possible candidates in the system
> > > dictionary are ignored (suppressed).
> > >
> > > (I think this is kuromoji specific behaviour, the original mecab pos
> > > tagger retrieves both of the system dictionary and user dictionary and
> > > compares their weights by performing Viterbi. In fact the behaviour -
> > > always gives priority to the entries in the user dictionary - is a bit
> > > too aggressive from the point of view of the consistency of
> > > tokenization. I do not know why, but there may be some performance
> > > reasons?)
> > >
> > > I think you can easily find the retrieval logic I described here in
> > > Japan

Re: JapaneseAnalyzer's system vs user dict

2019-05-25 Thread Michael Sokolov
Thank you for the detailed responses! What Tomoko is saying seems
consistent with my cursory reading of the code. The reason I asked is
I have a customer that thinks they want to replace the system
dictionary, and I am trying to see if that is necessary. It seems as
if for the most part, we can supply a comprehensive user dictionary
and it would pretty much take the place of the system dictionary,
assuming it is a superset (covers at least the original system dict
tokens), but there is probably no way to "remove" a token that is
present in the system dictionary (or maybe it can effectively be
removed by adding it to user dictionary with a high penalty?). I'm not
sure why one would want to do this removal, just trying to understand
the design parameters.

On Sat, May 25, 2019 at 7:30 PM Tomoko Uchida
 wrote:
>
> Hi,
>
> > If I provide entries in the user
> dictionary is it just as if I had included them in the system
> dictionary? If the same entry occurs in both, do the user dictionary
> weights supersede those in the system dictionary? Is there some way to
> suppress entries in the system dict?
>
> User dictionary is independent from the system dictionary. If you give
> the user entries, JapaneseTokenizer builds two FSTs one for the
> built-in dictionary and one for the user dictionary and they are
> retrieved separately.
>
> First the user dictionary is retrieved, and if there are no entries
> matched then the system dictionary is retrieved. So if any entry is
> found in the user dictionary, all possible candidates in the system
> dictionary are ignored (suppressed).
>
> (I think this is kuromoji specific behaviour, the original mecab pos
> tagger retrieves both of the system dictionary and user dictionary and
> compares their weights by performing Viterbi. In fact the behaviour -
> always gives priority to the entries in the user dictionary - is a bit
> too aggressive from the point of view of the consistency of
> tokenization. I do not know why, but there may be some performance
> reasons?)
>
> I think you can easily find the retrieval logic I described here in
> JapaneseTokenizer#parse() method. (Let me know if my understanding is
> not correct.)
>
> Regards,
> Tomoko
>
> 2019年5月26日(日) 5:08 김남규 :
> >
> > Hi, Mike :D
> >
> > Japanese Analyzer does not load dictionaries by default.
> > If you look at the constructor, you can see that it is created as null if
> > not set parameters.
> > (check testUserDict3() in TestJapaneseAnalyzer.java)
> >
> > In JapaneseTokenizer,
> > =
> > if (userDictionary != null) {
> >   userFST = userDictionary.getFST();
> >   userFSTReader = userFST.getBytesReader();
> > } else {
> >   userFST = null;
> >   userFSTReader = null;
> > }
> > =
> > Since it is a way to create and pass the UserDictionary object, there is no
> > conflict between user dictionary and system dictionary.
> > (You may choose only one of them! -> means userFST instance in
> > JapaneseTokenizer)
> >
> > About dictionary,
> > Lucene has one pre-built dictionary by default since Lucene 3.6.
> > You can check it in org.apache.lucene.analysis.ja.dict.
> > It called MeCab which uses the Viterbi algorithm.
> > In Lucene, Convert MeCab dictionary(in Lucene, some dat files) to FST and
> > use
> > But it can't satisfy all users.
> > Depending on the situation, some user may need a custom dictionary.
> > It is also same for Nori(Korean Analyzer) since Lucene 7.4. (The basic
> > logic(MeCab + FST) is similar to Japanese Analyzer)
> > The original Korean MeCab dictionary size is almost 220MB, but Lucene's
> > dictionary size is 24MB.
> > If the user needs a dictionary of 100MB size, the user must build and use
> > it.
> > (Modify MeCab Dictionary -> Training -> Porting to Lucene)
> >
> > If anyone find some wrong information in my reply, please send a reply with
> > the correction.
> >
> > Thank you,
> > Namgyu Kim
> >
> >
> > 2019년 5월 26일 (일) 오전 4:03, Michael Sokolov 님이 작성:
> >
> > > I'm trying to understand the relationship between the system and user
> > > dictionaries that JapaneseAnalyzer uses. The API allows a user to
> > > provide a user dictionary; the system one is built in. Are they
> > > otherwise the same kind of thing? If I provide entries in the user
> > > dictionary is it just as if I had included them in the system
> > > dictionary? If the same entry occurs in both, do the user dictionary
> > > weights supersede those in the system dic

JapaneseAnalyzer's system vs user dict

2019-05-25 Thread Michael Sokolov
I'm trying to understand the relationship between the system and user
dictionaries that JapaneseAnalyzer uses. The API allows a user to
provide a user dictionary; the system one is built in. Are they
otherwise the same kind of thing? If I provide entries in the user
dictionary is it just as if I had included them in the system
dictionary? If the same entry occurs in both, do the user dictionary
weights supersede those in the system dictionary? Is there some way to
suppress entries in the system dict?  I hunted for documentation, but
didn't find answers to these questions, and the code is pretty
involved, so any pointers would be greatly appreciated.

-Mike

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



de/serializing queries

2019-05-16 Thread Michael Sokolov
We have lots of nice QueryParsers, but do we have any Query
serializers that produce strings that can then be reliably parsed back
to the original query?

I thought maybe XML parser would do this since it seems to aim to be
all things, but I couldn't find a Query->XML method in a cursory
glance. If nothing exists, I thinkwe could make one using Alan's new
QueryVisitor.

-Mike

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: fields contains equals term docs search

2019-04-22 Thread Michael Sokolov
Can you create a scoring scenario that counts the number of fields in
which a term occurs and rank by that (descending) with some kind of
post-filtering?

On Fri, Apr 19, 2019 at 11:24 AM Valentin Popov  wrote:
>
> Hi,
> I trying find the way, to search all docs has equals term on different
> fields. Like
>
> doc1 {"foo":"master", "bar":"master"}
> doc2 {"foo":"test", "bar":"master"}
>
> As result should be doc1 only.
>
> Right now, I'm get all terms for "foo", "bar" intersect it and get all
> terms could be both "foo", "bar"
> and after make huge query with all intersected items:
>
> Query query
> (String item: Intersection) {
> query.addBoolean({"foo": item, "bar": item})
> }
>
> Is any better way to find all doc's that has intersected terms?
>
> Thanks!
> --
> Regards,
> Valentin.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: umlauts / diacritic expansion

2019-04-17 Thread Michael Sokolov
Right, AsciiFoldingFilter seems to map  Ü  [LATIN CAPITAL LETTER U
WITH DIAERESIS] to "U" not "UE".

On Wed, Apr 17, 2019 at 12:26 AM Ralf Heyde  wrote:
>
> Ah sorry, Asciifolding for umlauts will result in ue/ae - ß/ss etc
>
> You could allow a distance of 1 or 2 given you use levenshtein distance - 
> this might be close to what you need.
>
> Von meinem iPhone gesendet
>
> > Am 16.04.2019 um 20:08 schrieb Michael Sokolov :
> >
> > I'm learning how to index/search German today and understanding that
> > vowels with umlauts are conventionally expanded into two ASCII
> > characters, eg  "für" -> "fuer", so people may search for the expanded
> > form "fuer", but they might also search with the diacritic, and
> > finally they might lazily search using the stripped form "fur".
> >
> > My question: is there a standard CharFilter or TokenFilter that
> > expands to both (ASCII) forms, for characters with umlauts and perhaps
> > other diacritics I might be unaware of in other languages having
> > similar multiple renderings in ASCII?
> >
> > -Mike
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: umlauts / diacritic expansion

2019-04-17 Thread Michael Sokolov
Thanks - GermanNormalizer seems as if it addresses this problem, yes.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



umlauts / diacritic expansion

2019-04-16 Thread Michael Sokolov
I'm learning how to index/search German today and understanding that
vowels with umlauts are conventionally expanded into two ASCII
characters, eg  "für" -> "fuer", so people may search for the expanded
form "fuer", but they might also search with the diacritic, and
finally they might lazily search using the stripped form "fur".

My question: is there a standard CharFilter or TokenFilter that
expands to both (ASCII) forms, for characters with umlauts and perhaps
other diacritics I might be unaware of in other languages having
similar multiple renderings in ASCII?

-Mike

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How to control missing value when sorting using *ValuesSource?

2019-03-31 Thread Michael Sokolov
I just want to surface this again in case it got missed; it's a small gap,
but it seems buggy that we can support setMissingValue for SortFields based
on concrete numeric fields, but not for abstract values sources that are
numeric and have the same "missing value" semantics - ie when the source
does not advance to a doc, it has no value for that doc.

I think the gap can be addressed by adding fairly trivial overrides for
LongValuesSource.LongValuesSortField.setMissingValue and
DoubleValuesSource.DoubleValuesSortField.setMissingValue, adding the
missing value to the corresponding ComparatorSource classes, and ensuring
it doesn't get dropped in SortField.rewrite(). I'm happy to provide a patch
if it would be welcome.


On Wed, Mar 27, 2019 at 12:26 PM Michael Sokolov  wrote:

> The ValuesSources provide a getSortField method that always supplies 0 as
> the "missing value" - ie the default value for documents that do not
> advance the source.
>
> But if an application wants to provide "missing last" or "missing first"
> semantics, it needs to control the missing value depending on the possible
> range of values and whether the sort is ascending or descending.
>
> Yet if you call setMissingValue() on one of these SortFields, you get an
> Exception:
> throw new IllegalArgumentException("Missing value only
> works for numeric or STRING types");
>
> But these *are* numeric types, or they seem that way. Internally they are
> labeled as type CUSTOM.
>
> First, my question is whether there is a way to achieve the desired
> behavior -- missing last semantics combined with sorting on ValuesSource?
>
> And if there isn't a good way, would it make sense to default to "missing
> last", providing a default value of MIN or MAX depending on whether the
> sort is descending or ascending, or maybe just allow setMissingValue to be
> called for these fields?
>


How to control missing value when sorting using *ValuesSource?

2019-03-27 Thread Michael Sokolov
The ValuesSources provide a getSortField method that always supplies 0 as
the "missing value" - ie the default value for documents that do not
advance the source.

But if an application wants to provide "missing last" or "missing first"
semantics, it needs to control the missing value depending on the possible
range of values and whether the sort is ascending or descending.

Yet if you call setMissingValue() on one of these SortFields, you get an
Exception:
throw new IllegalArgumentException("Missing value only
works for numeric or STRING types");

But these *are* numeric types, or they seem that way. Internally they are
labeled as type CUSTOM.

First, my question is whether there is a way to achieve the desired
behavior -- missing last semantics combined with sorting on ValuesSource?

And if there isn't a good way, would it make sense to default to "missing
last", providing a default value of MIN or MAX depending on whether the
sort is descending or ascending, or maybe just allow setMissingValue to be
called for these fields?


Re: position-anchored queries

2019-03-21 Thread Michael Sokolov
Thanks Mikhail; I see SpanPositionRangeQuery, SpanFirstQuery, and thanks
Adrien for confirming there is no max position. I have a legacy system that
indexes anchor tokens at beginning and end, and am checking if there is a
better approach. It seems as if we could get rid of the start token, but we
may need to keep the end token to support that feature.

-Mike


On Thu, Mar 21, 2019 at 9:10 AM Adrien Grand  wrote:

> We don't track of the maximum position per doc. One way to do
> ends-with queries would be to write a token filter that emits a
> special token at the end of the wrapped token stream and use it as the
> last token of a phrase query.
>
> On Thu, Mar 21, 2019 at 1:29 PM Michael Sokolov 
> wrote:
> >
> > Hi, if I want to write a begins-with, ends-with, or exact-match style
> > query; like a phrase query, but anchored to the beginning of a field, the
> > end, or both, is there a generally-accepted best way? I can see how to
> > maybe create a positional query that specifies the exact position for
> each
> > term (at least in theory -- I'm not sure if there is a Query that does
> > this?) but do we keep track of max(position, field, doc) so I can know
> that
> > there are not any subsequent terms in the field?
> >
> > -Mike
>
>
>
> --
> Adrien
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


position-anchored queries

2019-03-21 Thread Michael Sokolov
Hi, if I want to write a begins-with, ends-with, or exact-match style
query; like a phrase query, but anchored to the beginning of a field, the
end, or both, is there a generally-accepted best way? I can see how to
maybe create a positional query that specifies the exact position for each
term (at least in theory -- I'm not sure if there is a Query that does
this?) but do we keep track of max(position, field, doc) so I can know that
there are not any subsequent terms in the field?

-Mike


IndexWriter concurrent flushing

2019-02-15 Thread Michael Sokolov
I noticed that commit() was taking an inordinately long time. It turned out
IndexWriter was flushing using only a single thread because it relies on
its caller to supply it with threads (via updateDocument, deleteDocument,
etc), which it then "hijacks" to do flushing. If (as we do) a caller
indexes a lot of documents and then calls commit at the end of a large
batch, when no indexing is ongoing, the commit() takes much longer than
needed since it is unable to make user of multiple cores to do concurrent
I/O.

How can we support this batch-mode use case better? I think we should -
it's not an unreasonable thing to do, since it can lead to the shortest
overall indexing time if you have sufficient RAM and don't need to search
until you're done indexing. I tried adding an IndexWriter.yield() method
that just flushes pending segments and does other queued work; the caller
can invoke this in order to provide resources. A more convenient API would
be to grant IndexWriter an ExecutorService of its own, but this is more
involved since it would ne necessary to arbitrate where the work should be
done. Maybe a middle ground would be to offer a commit(ExecutorService)
method. Any other ideas? Any interest in a patch for IndexWriter.yield()?

-Mike


Re: SynonymGraphFilter can't consume an incoming graph

2019-02-15 Thread Michael Sokolov
I wonder what happens if you ensure that none of your synonyms contains a
character that WDGF cares about. Then they would operate on a disjoint set
of tokens, and maybe they would (or could be made to) play nicely together?
Even if they hate each other (maybe they detect token graphs and fail even
though they could safely ignore), you could maybe write something using
ConditionalTokenFilter that passes each token to either one or the other,
thereby keeping them separate

On Thu, Feb 14, 2019 at 10:19 PM lambda.coder lucene <
lambda.coder.luc...@gmail.com> wrote:

> Thanks Eric for your response
>
> So I guess the answer to Shawn Heisey’s question [1] :
>
> "Since multiple Graph filters cannot be used in an analysis chain, what is
> somebody running 8.0 supposed to do if they need both the WordDelimiter
> filter and Synonym filter in their analysis chain? »
>
> is to have an analysis chain for the WordDelimiterGraphFilter and another
> one for the SynonymGraphFiler and then querying the two corresponding
> fields at the same time
>
> There is currently no better option / alternative
>
> Am I right ?
>
> Kind regards
> Patrick
>
>
> [1]
> https://issues.apache.org/jira/browse/LUCENE-6664?focusedCommentId=16386294=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16386294
>
>
> > Le 11 févr. 2019 à 05:46, Erick Erickson  a
> écrit :
> >
> > It's, well, undefined. As in nobody knows except that it'll be wrong.
> > And exactly what the results are may change with any given release.
> >
> > Best,
> > Erick
> >
> > On Sun, Feb 10, 2019 at 10:48 AM lambda.coder lucene
> >  wrote:
> >>
> >> Hello,
> >>
> >> The Javadocs of SynonymGraphFilter says that it can’t consume an
> incoming graph and that the result will be undefined
> >>
> >> Is there any example that exhibits the limitations and what is meant by
> undefined ?
> >>
> >>
> >> Regards
> >> Patrick
> >>
> >>
> >> -
> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >>
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: prorated early termination

2019-02-05 Thread Michael Sokolov
OK - I opened https://issues.apache.org/jira/browse/LUCENE-8681 and talked
about the possible user knobs I think we could provide.

On Tue, Feb 5, 2019 at 9:10 AM Robert Muir  wrote:

> OK. Thanks for uploading the PR. I think you should definitely open an
> issue?
>
> Its still worth looking at the API for this specific proposal, as this
> may not return the exact top-N, correct? I think we need to be careful
> and keep the user involved when it comes to inexact, not everyone's
> data might meet the expected distribution and maybe others would
> rather have the exact results. And of course this part of the api (the
> way the user specifies such things) is way more important than how
> ugly the collector code is behind the scenes. When I was looking at
> this topic years ago, i looked at basically an enum being passed to
> IndexSearcher with 3 possible values: exact top N with full counts
> (slowest), exact top N without full counts (faster), inexact top N
> (fastest). When Adrien did the actual work, the enum seemed overkill
> because only 2 of the possibilties were implemented, but maybe its
> worth revisiting?
>
> On Tue, Feb 5, 2019 at 7:50 AM Michael Sokolov  wrote:
> >
> > Hi Robert - yeah this is a complex subject. I think there's room for some
> > exciting improvement though. There is some discussion in LUCENE-8675 that
> > is pointing out some potential API-level problems we may have to address
> in
> > order to make the most efficient use of the segment structure in a sorted
> > index. I think generally speaking we're trying to think of ways to search
> > partial segments. I don't have concrete proposals for API changes at this
> > point, but it's clear there are some hurdles to be grappled with. For
> > example, Adrien's point about BKD trees having a high up-front cost
> points
> > out one difficulty. If we want to search a single segment in multiple
> "work
> > units" (whether threaded or not), we want a way to share that up-front
> cost
> > without needing to pay it over again for each work unit. I think a
> similar
> > problem also occurs with some other query types (MultiTerm can produce a
> > bitset I believe?).
> >
> > As far as the specific (prorated early termination) proposal here .. this
> > is something very specific and localized within TopFieldCollector that
> > doesn't require any public-facing API change or refactoring at all. It
> just
> > terminates a little earlier based on the segment distribution. Here's a
> PR
> > so you can see what this is:
> https://github.com/apache/lucene-solr/pull/564
> >
> >
> > On Mon, Feb 4, 2019 at 8:44 AM Robert Muir  wrote:
> >
> > > Regarding adding a threshold to TopFieldCollector, do you have ideas
> > > on what it would take to fix the relevant collector/indexsearcher APIs
> > > to make this kind of thing easier? (i know this is a doozie, but we
> > > should at least try to think about it, maybe make some progress)
> > >
> > > I can see where things become less efficient in this parallel+sorted
> > > case with large top N, but there are also many other "top k
> > > algorithms" that could be better for different use cases. in your
> > > case, if you throw out the parallel and just think about doing your
> > > sorted case segment-by-segment, the current code there may be
> > > inefficient too (not as bad, but still doesn't really take total
> > > advantage of sortedness). Maybe we improve that case by scoring some
> > > initial "range" of docs for each/some segments first, and then handle
> > > any "tail". With a simple google search I easily find many ideas for
> > > how this logic could work: exact and inexact, sorted and unsorted,
> > > distributed (parallel) and sequential.  So I think there are probably
> > > other improvements that could be done here, but worry about what the
> > > code might look like if we don't refactor it.
> > >
> > > On Sun, Feb 3, 2019 at 3:14 PM Michael McCandless
> > >  wrote:
> > > >
> > > > On Sun, Feb 3, 2019 at 10:41 AM Michael Sokolov 
> > > wrote:
> > > >
> > > >  > > In single-threaded mode we can check against
> minCompetitiveScore and
> > > > > terminate collection for each segment appropriately,
> > > > >
> > > > > > Does Lucene do this today by default?  That should be a nice
> > > > > optimization,
> > > > > and it'd be safe/correct.
> > > > >
> > > > > Yes, it does that t

Re: prorated early termination

2019-02-05 Thread Michael Sokolov
Hi Robert - yeah this is a complex subject. I think there's room for some
exciting improvement though. There is some discussion in LUCENE-8675 that
is pointing out some potential API-level problems we may have to address in
order to make the most efficient use of the segment structure in a sorted
index. I think generally speaking we're trying to think of ways to search
partial segments. I don't have concrete proposals for API changes at this
point, but it's clear there are some hurdles to be grappled with. For
example, Adrien's point about BKD trees having a high up-front cost points
out one difficulty. If we want to search a single segment in multiple "work
units" (whether threaded or not), we want a way to share that up-front cost
without needing to pay it over again for each work unit. I think a similar
problem also occurs with some other query types (MultiTerm can produce a
bitset I believe?).

As far as the specific (prorated early termination) proposal here .. this
is something very specific and localized within TopFieldCollector that
doesn't require any public-facing API change or refactoring at all. It just
terminates a little earlier based on the segment distribution. Here's a PR
so you can see what this is: https://github.com/apache/lucene-solr/pull/564


On Mon, Feb 4, 2019 at 8:44 AM Robert Muir  wrote:

> Regarding adding a threshold to TopFieldCollector, do you have ideas
> on what it would take to fix the relevant collector/indexsearcher APIs
> to make this kind of thing easier? (i know this is a doozie, but we
> should at least try to think about it, maybe make some progress)
>
> I can see where things become less efficient in this parallel+sorted
> case with large top N, but there are also many other "top k
> algorithms" that could be better for different use cases. in your
> case, if you throw out the parallel and just think about doing your
> sorted case segment-by-segment, the current code there may be
> inefficient too (not as bad, but still doesn't really take total
> advantage of sortedness). Maybe we improve that case by scoring some
> initial "range" of docs for each/some segments first, and then handle
> any "tail". With a simple google search I easily find many ideas for
> how this logic could work: exact and inexact, sorted and unsorted,
> distributed (parallel) and sequential.  So I think there are probably
> other improvements that could be done here, but worry about what the
> code might look like if we don't refactor it.
>
> On Sun, Feb 3, 2019 at 3:14 PM Michael McCandless
>  wrote:
> >
> > On Sun, Feb 3, 2019 at 10:41 AM Michael Sokolov 
> wrote:
> >
> >  > > In single-threaded mode we can check against minCompetitiveScore and
> > > terminate collection for each segment appropriately,
> > >
> > > > Does Lucene do this today by default?  That should be a nice
> > > optimization,
> > > and it'd be safe/correct.
> > >
> > > Yes, it does that today (in TopFieldCollector -- see
> > >
> > >
> https://github.com/msokolov/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/TopFieldCollector.java#L225
> > > )
> > >
> >
> > Ahh -- great, thanks for finding that.
> >
> >
> > > Re: our high cost of collection in static ranking phase -- that is
> true,
> > > Mike, but I do also see a nice improvement on the luceneutil benchmark
> > > (modified to have a sorted index and collect concurrently) using just a
> > > vanilla TopFieldCollector. I looked at some profiler output, and it
> just
> > > seems to be showing more time spent walking postings.
> > >
> >
> > Yeah, understood -- I think pro-rating the N collected per segment makes
> a
> > lot of sense.
> >
> > Mike McCandless
> >
> > http://blog.mikemccandless.com
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


  1   2   3   >