Re: Indexing time increase moving from Lucene 8 to 9

2024-04-17 Thread Adrien Grand
Hi Marc,

Nothing jumps to mind as a potential cause for this 2x regression. It would
be interesting to look at a profile.

On Wed, Apr 17, 2024 at 9:32 PM Marc Davenport
 wrote:

> Hello,
> I'm finally migrating Lucene from 8.11.2 to 9.10.0 as our overall build can
> now support Java 11. The quick first step of renaming packages and
> importing the new libraries has gone well.  I'm even seeing a nice
> performance bump in our average query time. I am however seeing a dramatic
> increase in our indexing time.  We are indexing ~3.1 million documents each
> with about 100 attributes used for facet filter, and sorting; no lexical
> text search.  Our indexing time has jumped from ~1k seconds to ~2k
> seconds.  I have yet to profile the individual aspects of how we convert
> our data to records vs time for the index writer to accept the documents.
> I'm curious if other users discovered this for their migrations at some
> point.  Or if there are some changes to defaults that I did not see in the
> migration guide that would account for this?  Looking at the logs I can see
> that as we are indexing the documents we commit every 10 minutes.
> Thank you,
> Marc
>


-- 
Adrien


Re: Query Optimization in search/searchAfter

2024-04-12 Thread Adrien Grand
You are correct, query rewriting is not affected by the use of search vs.
searchAfter.

On Fri, Apr 12, 2024 at 3:37 PM Puneeth Bikkumanla 
wrote:

> Hello,
> Sorry I should have clarified what I meant by “optimized”. I am familiar
> with the collector/comparators using the “after” doc to filter out
> documents but I specifically was talking about the query rewriting phase.
> Is the query rewritten differently in search vs searchAfter? Looking at the
> code I think no but would just like to confirm if there are any edge cases
> here.
>
> On Fri, Apr 12, 2024 at 8:46 AM Adrien Grand  wrote:
>
> > Hello Puneeth,
> >
> > When you pass an `after` doc, Lucene will filter out documents that
> compare
> > better than this `after` document if it can. See e.g. what LongComparator
> > does with its `topValue`, which is the value of the `after` doc.
> >
> > On Thu, Apr 11, 2024 at 4:34 PM Puneeth Bikkumanla <
> pbikkuma...@gmail.com>
> > wrote:
> >
> > > Hello,
> > > I was wondering if a user-defined Query is optimized the same way in
> both
> > > search/searchAfter provided the index stays the same (no CRUD takes
> > place).
> > >
> > > In searchAfter we pass in an "after" doc so I was wondering if that
> > changes
> > > how a query is optimized at all. By looking at the code, I'm thinking
> no
> > > but was wondering if there were any other parameters here that I am not
> > > aware of that would influence query optimization differently in
> > > search/searchAfter. Thanks!
> > >
> >
> >
> > --
> > Adrien
> >
>


-- 
Adrien


Re: Query Optimization in search/searchAfter

2024-04-12 Thread Adrien Grand
Hello Puneeth,

When you pass an `after` doc, Lucene will filter out documents that compare
better than this `after` document if it can. See e.g. what LongComparator
does with its `topValue`, which is the value of the `after` doc.

On Thu, Apr 11, 2024 at 4:34 PM Puneeth Bikkumanla 
wrote:

> Hello,
> I was wondering if a user-defined Query is optimized the same way in both
> search/searchAfter provided the index stays the same (no CRUD takes place).
>
> In searchAfter we pass in an "after" doc so I was wondering if that changes
> how a query is optimized at all. By looking at the code, I'm thinking no
> but was wondering if there were any other parameters here that I am not
> aware of that would influence query optimization differently in
> search/searchAfter. Thanks!
>


-- 
Adrien


Re: Support of RRF (Reciprocal Rank Fusion) by Lucene?

2024-03-26 Thread Adrien Grand
GitHub issue or PR directly, whatever works best for you is going to work
for us.

On Tue, Mar 26, 2024 at 3:12 PM Michael Wechner 
wrote:

> Hi Adrien
>
> Cool, thanks for your quick feedback!
>
> Yes, IIUC it should not be too difficult.
>
> Should I create github issue to discuss in more detail
>
> https://github.com/apache/lucene/issues
>
> Thanks
>
> Michael
>
> Am 26.03.24 um 14:56 schrieb Adrien Grand:
> > Hey Michael,
> >
> > I agree that it would be a nice addition. Plus it should be pretty easy
> to
> > implement. This sounds like a good fit for a utility method on the
> TopDocs
> > class?
> >
> > On Tue, Mar 26, 2024 at 2:54 PM Michael Wechner <
> michael.wech...@wyona.com>
> > wrote:
> >
> >> Hi
> >>
> >> IIUC Lucene does not contain a RRF implementation, for example to merge
> >> keyword/BM25 and vector search results, right?
> >>
> >> I think it would be nice to have within Lucene, WDYT?
> >>
> >> Thanks
> >>
> >> Michael
> >>
> >> -
> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >>
> >>
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

-- 
Adrien


Re: Support of RRF (Reciprocal Rank Fusion) by Lucene?

2024-03-26 Thread Adrien Grand
Hey Michael,

I agree that it would be a nice addition. Plus it should be pretty easy to
implement. This sounds like a good fit for a utility method on the TopDocs
class?

On Tue, Mar 26, 2024 at 2:54 PM Michael Wechner 
wrote:

> Hi
>
> IIUC Lucene does not contain a RRF implementation, for example to merge
> keyword/BM25 and vector search results, right?
>
> I think it would be nice to have within Lucene, WDYT?
>
> Thanks
>
> Michael
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

-- 
Adrien


[ANNOUNCE] Apache Lucene 9.10.0 released

2024-02-20 Thread Adrien Grand
The Lucene PMC is pleased to announce the release of Apache Lucene 9.10.

Apache Lucene is a high-performance, full-featured search engine library
written entirely in Java. It is a technology suitable for nearly any
application that requires structured search, full-text search, faceting,
nearest-neighbor search on high-dimensionality vectors, spell correction or
query suggestions.

This release contains numerous features, optimizations, and improvements,
some of which are highlighted below. The release is available for immediate
download at:
  https://lucene.apache.org/core/downloads.html

Lucene 9.10 Release Highlights

New Features

 * Support for similarity-based vector searches, ie. finding all nearest
neighbors whose similarity is greater than a configured threshold from a
query vector. See [Byte|Float]VectorSimilarityQuery.

 * Index sorting is now compatible with block joins. See
IndexWriterConfig#setParentField.

 * MMapDirectory now takes advantage of the now finalized JDK foreign
memory API internally when running on Java 22 (or later). This was only
supported with Java 19 to 21 until now.

 * SIMD vectorization now takes advantage of JDK vector incubator on Java
22. This was only supported with Java 20 or 21 until now.

Optimizations

 * Tail postings are now encoded using group-varint. This yielded speedups
on queries that match lots of terms that have short postings lists in
Lucene's nightly benchmarks.

 * Range queries on points now exit earlier when evaluating a segment that
has no matches. This will improve performance when intersected with other
queries that have a high up-front cost such as multi-term queries.

 * BooleanQueries that mix SHOULD and FILTER clauses now propagate minimum
competitive scores to the SHOULD clauses, yielding significant speedups for
top-k queries sorted by descending score.

 * IndexSearcher#count has been optimized on pure disjunctions of two term
queries.

... plus a multitude of helpful bug fixes!

Further details of changes are available in the change log available at:
http://lucene.apache.org/core/9_10_0/changes/Changes.html.

Please report any feedback to the mailing lists (
http://lucene.apache.org/core/discussion.html)

Note: The Apache Software Foundation uses an extensive mirroring network
for distributing releases. It is possible that the mirror you are using may
not have replicated the release yet. If that is the case, please try
another mirror. This also applies to Maven access.

-- 
Adrien


Re: Old codecs may only be used for reading

2024-01-11 Thread Adrien Grand
Hey Michael. Your understanding is correct.

On Thu, Jan 11, 2024 at 10:46 AM Michael Wechner 
wrote:

> Hi
>
> I recently upgraded from Lucene 9.8.0 to Lucene 9.9.1 and noticed that
> Lucene95Codec got moved to
>
> org.apache.lucene.backward_codecs.lucene95.Lucene95Codec
>
> When testing my code I received the following error message:
>
> "Old codecs may only be used for reading"
>
> Do I understand correctly, that Lucene95Codec can not be used for writing
> anymore?
>
> And that I should use now Lucene99Codec for writing?
>
> Thanks
>
> Michael
>


-- 
Adrien


Re: Assertion error with NumericDocValues.advanceExact

2024-01-01 Thread Adrien Grand
Hello,

Can you check if you are running advanceExact on decreasing doc IDs or on
doc IDs that are outside of the valid range [0, maxDoc)? If you have
Lucene's test framework on your classpath, these checks can be added
automatically by using AssertingIndexSearcher instead of IndexSearcher to
run queries.


On Sun, Dec 31, 2023 at 8:51 AM  wrote:

> Stack trace:
>
> java.lang.AssertionError
>
> at
> org.apache.lucene.codecs.lucene90.IndexedDISI$Method$1.advanceExactWithinBlock(IndexedDISI.java:567)
>
> at
> org.apache.lucene.codecs.lucene90.IndexedDISI.advanceExact(IndexedDISI.java:461)
>
> at
> org.apache.lucene.codecs.lucene90.Lucene90DocValuesProducer$SparseNumericDocValues.advanceExact(Lucene90DocValuesProducer.java:453)
>
>
>
> Lucene 9.80 with a 9.80 index.
>
>
>
> Everything seems to work fine if I run it with assertions disabled.
>
> This is a large project which I can’t share.
>
> CheckIndex didn’t find any problems.
>
>
>
> Erel
>
>
>
>

-- 
Adrien


Re: migrate index from 6 to 9

2023-12-18 Thread Adrien Grand
Hi Vincent,

Unfortunately, your assumption is incorrect, Lucene 9 is not able to search
Lucene 6 indexes as Lucene only keeps read access to indexes created by the
current (9) or previous major version (8). You will need to reindex your
6.x index with Lucene 8 or 9 (preferred) to be able to search it with
Lucene 9.

On Mon, Dec 18, 2023 at 5:11 PM Vincent Sevel  wrote:

> Hello,
> can lucene 6 indexes be reopened with lucene 9 seamlessly?
> or are there breaking changes in the format, or in some other aspects that
> would require a special migration procedure?
> I remember going through this a while back when moving from one major
> version to another.
> I looked at the release notes for the different versions (7, 8 and 9) and
> could not see something that would indicate such a breaking change.
> am I correct to assume that I can reuse my old lucene 6 indexes in lucene
> 9?
> Thanks,
> Vincent
>


-- 
Adrien


Re: When to use StringField and when to use FacetField for categorization?

2023-10-20 Thread Adrien Grand
FYI there is also KeywordField, which combines StringField and
SortedSetDocValuesField. It supports filtering, sorting, faceting and
retrieval. It's my go-to field for string values.

Le ven. 20 oct. 2023, 12:20, Michael McCandless 
a écrit :

> There are some differences.
>
> StringField is indexed into the inverted index (postings) so you can do
> efficient filtering.  You can also store in stored fields to retrieve.
>
> FacetField does everything StringField does (filtering, storing (maybe?)),
> but in addition it stores data for faceting.  I.e. you can compute facet
> counts or simple aggregations at search time.
>
> FacetField is also hierarchical: you can filter and facet by different
> points/levels of your hierarchy.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Fri, Oct 20, 2023 at 5:43 AM Michael Wechner  >
> wrote:
>
> > Hi
> >
> > I have found the following simple Facet Example
> >
> >
> >
> https://github.com/apache/lucene/blob/main/lucene/demo/src/java/org/apache/lucene/demo/facet/SimpleFacetsExample.java
> >
> > whereas for a simple categorization of documents I currently use
> > StringField, e.g.
> >
> > doc1.add(new StringField("category", "book"));
> > doc1.add(new StringField("category", "quantum_physics"));
> > doc1.add(new StringField("category", "Neumann"))
> > doc1.add(new StringField("category", "Wheeler"))
> >
> > doc2.add(new StringField("category", "magazine"));
> > doc2.add(new StringField("category", "astro_physics"));
> >
> > which works well, but would it be better to use Facets for this, e.g.
> >
> > doc1.add(new FacetField("media-type", "book"));
> > doc1.add(new FacetField("topic", "physics", "quantum");
> > doc1.add(new FacetField("author", "Neumann");
> > doc1.add(new FacetField("author", "Wheeler");
> >
> > doc1.add(new FacetField("media-type", "magazine"));
> > doc1.add(new FacetField("topic", "physics", "astro");
> >
> > ?
> >
> > IIUC the StringField approach is more general, whereas the FacetField
> > approach allows to do a more specific categorization / search.
> > Or do I misunderstand this?
> >
> > Thanks
> >
> > Michael
> >
> >
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >
>


Re: Exception from the codec layer during indexing

2023-09-28 Thread Adrien Grand
Hi Rahul,

This exception complains that IndexingChain did not deduplicate terms
as expected.

I don't recall seeing this exception before (which doesn't mean it's
not a real bug).

What JVM are you running? Does this exception frequently occur or was
it a one-off?

On Thu, Sep 28, 2023 at 4:49 PM Rahul Goswami  wrote:
>
> Hi,
> Following up on my issue...anyone who's seen similar exceptions ? Or any
> insights on what might be going on?
>
> Thanks,
> Rahul
>
> On Wed, Sep 27, 2023 at 1:00 AM Rahul Goswami  wrote:
>
> > Hello,
> > On one of the servers running Solr 7.7.2, during indexing I observe 2
> > different kinds of exceptions coming from the Lucene codec layer. I can't
> > think of an application/data issue that could be causing this.
> >
> > In particular, Exception-2 seems like a potential bug since it complains
> > about "terms out of order" even though both byte arrays are essentially the
> > same. Reason I say this is that the FutureArrays.mismatch() is supposed to
> > behave like Java's Arrays.mismatch which returns -1 if NO mismatch is
> > found. However the check in the below line treats the value -1 as a
> > mismatch causing the exception.
> >
> >
> > https://github.com/apache/lucene-solr/blob/releases/lucene-solr/7.7.2/lucene/core/src/java/org/apache/lucene/util/StringHelper.java#L46
> >
> > Happy to submit a PR for this if there is a consensus on this being a bug.
> >
> > Would appreciate any inputs on the exceptions seen below!
> >
> > *Exception-1:*
> >
> > 2023-09-19 10:13:48.901 ERROR (qtp1859039536-1691) [
> > x:fsindex_FileIndexer20234799_shard_1] o.a.s.s.HttpSolrCall
> > null:org.apache.solr.common.SolrException: Server error writing document id
> > 6182!bbdbe92468734899c738f048e6f58245 to the index
> > at
> > org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:240)
> > at
> > org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:67)
> > at
> > org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55)
> > at
> > org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:1002)
> > at
> > org.apache.solr.update.processor.DistributedUpdateProcessor.doVersionAdd(DistributedUpdateProcessor.java:1233)
> > at
> > org.apache.solr.update.processor.DistributedUpdateProcessor.lambda$versionAdd$2(DistributedUpdateProcessor.java:1082)
> > at org.apache.solr.update.VersionBucket.runWithLock(VersionBucket.java:50)
> > at
> > org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:1082)
> > at
> > org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:694)
> > at
> > org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55)
> > at
> > org.apache.solr.update.processor.FieldMutatingUpdateProcessor.processAdd(FieldMutatingUpdateProcessor.java:118)
> > at
> > org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55)
> > at
> > org.apache.solr.update.processor.FieldMutatingUpdateProcessor.processAdd(FieldMutatingUpdateProcessor.java:118)
> > at
> > org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55)
> > at
> > org.apache.solr.update.processor.AbstractDefaultValueUpdateProcessorFactory$DefaultValueUpdateProcessor.processAdd(AbstractDefaultValueUpdateProcessorFactory.java:92)
> > at
> > org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55)
> > at
> > org.apache.solr.update.processor.FieldMutatingUpdateProcessor.processAdd(FieldMutatingUpdateProcessor.java:118)
> > at
> > org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:261)
> > at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:188)
> > at
> > org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:97)
> > at
> > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68)
> > at
> > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:202)
> > at org.apache.solr.core.SolrCore.execute(SolrCore.java:2551)
> > at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:711)
> > at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:516)
> > at
> > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:395)
> > at
> > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:341)
> > at
> > com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:82)
> > at
> > com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:121)
> > at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:133)
> > at
> > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1602)
> > at
> > 

Re: forceMerge(1) leads to ~10% perf gains

2023-09-22 Thread Adrien Grand
> Was wondering - are there any other techniques which can be used to speed
up that work well when forceMerge works like this?

Lucene 9.8 (to be released in a few days hopefully) will add support
to recursive graph bisection, which is another thing that can be used
to speed up querying on read-only indices.

https://github.com/apache/lucene/pull/12489

On Fri, Sep 22, 2023 at 12:54 PM Uwe Schindler  wrote:
>
> Hi,
>
> Yes, a force-merged index can be faster, as less work is spent on
> looking up terms in different index segments.
>
> If you are looking for higher speed, non-merged indexes can actually
> perform better, IF you parallelize. You can do this by adding an
> Executor instance to IndexSearcher
> ().
> If you do this each segment of the index is searched in parallel (using
> the thread pool limits of the Executor) and results are merged at end.
>
> If an index is read-only and static, fore-merge is a good idea - unless
> you want to parallelize.
>
> Tokenizing and joining with OR is the correct way, but for speed you may
> also use AND. To further improve the speed also take a look at Blockmax
> WAND: If you are not interested in the total number of documents, you
> can get huge speed improvements. By default this is enabled in Lucene
> 9.x with default IndexSearcher, but on Solr/Elasticsearch you may need
> to actively request it. In that case it will only count exact number of
> hits till 1000 docs are found.
>
> Uwe
>
> Am 22.09.2023 um 03:40 schrieb qrdl kaggle:
> > After testing on 4800 fairly complex queries, I see a performance gain of
> > 10% after doing indexWriter.forceMerge(1); indexWriter.commit(); from 209
> > ms per query, to 185 ms per query.
> >
> > Queries are quite complex, often about 30 or words, of the format OR
> > text:
> >
> > It went from 214 to 14 files on the forceMerge.
> >
> > It's a 6GB static/read only index with about 6.4M documents.  Documents are
> > around 1MB or so of text.
> >
> > Was wondering - are there any other techniques which can be used to speed
> > up that work well when forceMerge works like this?
> >
> > Is there a better way to query and still maintain accuracy than simply word
> > tokenizing a sentence and joining with OR text: ?
> >
> --
> Uwe Schindler
> Achterdiek 19, D-28357 Bremen
> https://www.thetaphi.de
> eMail: u...@thetaphi.de
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>


-- 
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Relative cpu cost of fetching term frequency during scoring

2023-06-26 Thread Adrien Grand
This is a bit surprising, can you share the profiler output (e.g.
screenshot), to see what is slow within the `PostingsEnum#freq` call?

`PostingsEnum#freq` may need to decode a block of freqs, but I would
generally not expect it to be 5x slower than decoding doc IDs for the
same block.

On Thu, Jun 22, 2023 at 6:00 AM Vimal Jain  wrote:
>
> I did profiling of new code and found that below api call is most time
> consuming :-
> org.apache.lucene.index.PostingsEnum#freq
> If i comment out this call and instead use some random integer for testing
> purpose, then perf is at least 5x compared to old code.
> Is there any thoughts on why term frequency calls on PostingsEnum are that
> slow ?
>
>
>
> *Thanks and Regards,*
> *Vimal Jain*
>
>
> On Wed, Jun 21, 2023 at 1:43 PM Adrien Grand  wrote:
>
> > As far as your performance problem is concerned, I don't know. Can you
> > compare the number of documents that need to be evaluated in both cases,
> > e.g. by running `IndexSearcher#count` on your two queries. If they're
> > similar, can you run your new query under a profiler to figure out what its
> > bottleneck is?
> >
> > Regarding migration to newer major version, there is a MIGRATE.txt that
> > gives some advice:
> >
> > https://github.com/apache/lucene/blob/releases/lucene-solr/8.0.0/lucene/MIGRATE.txt
> > .
> >
> > On Wed, Jun 21, 2023 at 8:54 AM Vimal Jain  wrote:
> >
> > > Thanks Adrien , I had a look at your blog post.  Looks like this
> > > Scorer#getMaxScore was added in lucene 8.0 , i am using 7.7.3.
> > > A side question , is there any resource to help migrate newer major
> > version
> > > , i see lot of api changed from v7 to v8.
> > >
> > > *Thanks and Regards,*
> > > *Vimal Jain*
> > >
> > >
> > > On Wed, Jun 21, 2023 at 1:08 AM Adrien Grand  wrote:
> > >
> > > > Lucene has logic to only evaluate a subset of the matching documents
> > when
> > > > retrieving top-k hits. This leverages the Scorer#getMaxScore API. If
> > you
> > > > never implemented it on your custom query, then you never took
> > advantage
> > > of
> > > > dynamic pruning anyway. I wrote a bit more about it
> > > > <
> > > >
> > >
> > https://www.elastic.co/blog/faster-retrieval-of-top-hits-in-elasticsearch-with-block-max-wand
> > > > >
> > > > a few years ago if you're curious.
> > > >
> > > > On Tue, Jun 20, 2023 at 6:58 PM Vimal Jain  wrote:
> > > >
> > > > > Thanks Adrien for quick response.
> > > > > Yes , i am replacing disjuncts across multiple fields with single
> > > custom
> > > > > term query over merged field.
> > > > > Can you please provide more details on what do you mean by dynamic
> > > > pruning
> > > > > in context of custom term query ?
> > > > >
> > > > > On Tue, 20 Jun, 2023, 9:45 pm Adrien Grand, 
> > wrote:
> > > > >
> > > > > > Intuitively replacing a disjunction across multiple fields with a
> > > > single
> > > > > > term query should always be faster.
> > > > > >
> > > > > > You're saying that you're storing the type of token as part of the
> > > term
> > > > > > frequency. This doesn't sound like something that would play well
> > > with
> > > > > > dynamic pruning, so I wonder if this is the reason why you are
> > seeing
> > > > > > slower queries. But since you mentioned custom term queries, maybe
> > > you
> > > > > > never actually took advantage of dynamic pruning?
> > > > > >
> > > > > > On Tue, Jun 20, 2023 at 10:30 AM Vimal Jain 
> > > wrote:
> > > > > >
> > > > > > > Ok , sorry , I realized that I need to provide more context.
> > > > > > > So we used to create a lucene query which consisted of custom
> > term
> > > > > > queries
> > > > > > > for different fields and based on the type of field , we used to
> > > > > assign a
> > > > > > > boost that would be used in scoring.
> > > > > > > Now we want to get rid off different fields and instead of
> > creating
> > > > > > > multiple term queries , we create only 1 term query for the
> > merged
> > > > > field
> > > > > > > and the 

[ANNOUNCE] Apache Lucene 9.7.0 released

2023-06-26 Thread Adrien Grand
The Lucene PMC is pleased to announce the release of Apache Lucene 9.7.0.

Apache Lucene is a high-performance, full-featured search engine library
written entirely in Java. It is a technology suitable for nearly any
application that requires structured search, full-text search, faceting,
nearest-neighbor search across high-dimensionality vectors, spell
correction or query suggestions.

This release contains numerous bug fixes, optimizations, and improvements,
some of which are highlighted below. The release is available for immediate
download at:

  

### Lucene 9.7.0 Release Highlights:

 New features

 * The new IndexWriter#updateDocuments(Query, Iterable) allows updating
multiple documents that match a query at the same time.

 * Function queries can now compute similarity scores between kNN vectors.

 Optimizations

 * KNN indexing and querying can now take advantage of vectorization for
distance computation between vectors. To enable this, use exactly Java 20
or 21, and pass --add-modules jdk.incubator.vector as a command-line
parameter to the Java program.

 * KNN queries now run concurrently if the IndexSearcher has been created
with an executor.

 * Queries sorted by field are now able to dynamically prune hits only
using the after value. This yields major speedups when paginating deeply.

 * Reduced merge-time overhead of computing the number of soft deletes.

 Changes in runtime behavior

 * KNN vectors are now disallowed to have non-finite values such as NaN or
±Infinity.

 Bug fixes

 * Backward reading is no longer an adversarial case for
BufferedIndexInput, used by NIOFSDirectory and SimpleFSDirectory. This
addresses a performance bug when performing terms dictionary lookups with
either of these directories.

 * GraphTokenStreamFiniteStrings#articulationPointsRecurse may no longer
overflow the stack.

 * ... plus a number of helpful bug fixes!

Please read CHANGES.txt for a full list of new features and changes:

  

-- 
Adrien


Re: Relative cpu cost of fetching term frequency during scoring

2023-06-21 Thread Adrien Grand
As far as your performance problem is concerned, I don't know. Can you
compare the number of documents that need to be evaluated in both cases,
e.g. by running `IndexSearcher#count` on your two queries. If they're
similar, can you run your new query under a profiler to figure out what its
bottleneck is?

Regarding migration to newer major version, there is a MIGRATE.txt that
gives some advice:
https://github.com/apache/lucene/blob/releases/lucene-solr/8.0.0/lucene/MIGRATE.txt
.

On Wed, Jun 21, 2023 at 8:54 AM Vimal Jain  wrote:

> Thanks Adrien , I had a look at your blog post.  Looks like this
> Scorer#getMaxScore was added in lucene 8.0 , i am using 7.7.3.
> A side question , is there any resource to help migrate newer major version
> , i see lot of api changed from v7 to v8.
>
> *Thanks and Regards,*
> *Vimal Jain*
>
>
> On Wed, Jun 21, 2023 at 1:08 AM Adrien Grand  wrote:
>
> > Lucene has logic to only evaluate a subset of the matching documents when
> > retrieving top-k hits. This leverages the Scorer#getMaxScore API. If you
> > never implemented it on your custom query, then you never took advantage
> of
> > dynamic pruning anyway. I wrote a bit more about it
> > <
> >
> https://www.elastic.co/blog/faster-retrieval-of-top-hits-in-elasticsearch-with-block-max-wand
> > >
> > a few years ago if you're curious.
> >
> > On Tue, Jun 20, 2023 at 6:58 PM Vimal Jain  wrote:
> >
> > > Thanks Adrien for quick response.
> > > Yes , i am replacing disjuncts across multiple fields with single
> custom
> > > term query over merged field.
> > > Can you please provide more details on what do you mean by dynamic
> > pruning
> > > in context of custom term query ?
> > >
> > > On Tue, 20 Jun, 2023, 9:45 pm Adrien Grand,  wrote:
> > >
> > > > Intuitively replacing a disjunction across multiple fields with a
> > single
> > > > term query should always be faster.
> > > >
> > > > You're saying that you're storing the type of token as part of the
> term
> > > > frequency. This doesn't sound like something that would play well
> with
> > > > dynamic pruning, so I wonder if this is the reason why you are seeing
> > > > slower queries. But since you mentioned custom term queries, maybe
> you
> > > > never actually took advantage of dynamic pruning?
> > > >
> > > > On Tue, Jun 20, 2023 at 10:30 AM Vimal Jain 
> wrote:
> > > >
> > > > > Ok , sorry , I realized that I need to provide more context.
> > > > > So we used to create a lucene query which consisted of custom term
> > > > queries
> > > > > for different fields and based on the type of field , we used to
> > > assign a
> > > > > boost that would be used in scoring.
> > > > > Now we want to get rid off different fields and instead of creating
> > > > > multiple term queries , we create only 1 term query for the merged
> > > field
> > > > > and the scorer of this term query ( on merged field ) makes use of
> > > custom
> > > > > term frequency info to deduce type of token ( during indexing we
> > store
> > > > this
> > > > > info ) and hence the score that we were using earlier.
> > > > > So perf drop is observed in reference to  earlier implementation (
> > with
> > > > > multiple term queries ).
> > > > >
> > > > >
> > > > > *Thanks and Regards,*
> > > > > *Vimal Jain*
> > > > >
> > > > >
> > > > > On Tue, Jun 20, 2023 at 1:01 PM Adrien Grand 
> > > wrote:
> > > > >
> > > > > > You say you observed a performance drop, what are you comparing
> > > > against?
> > > > > >
> > > > > > Le mar. 20 juin 2023, 08:59, Vimal Jain  a
> > écrit :
> > > > > >
> > > > > > > Note - i am using lucene 7.7.3
> > > > > > >
> > > > > > > *Thanks and Regards,*
> > > > > > > *Vimal Jain*
> > > > > > >
> > > > > > >
> > > > > > > On Tue, Jun 20, 2023 at 12:26 PM Vimal Jain 
> > > > wrote:
> > > > > > >
> > > > > > > > Hi,
> > > > > > > > I want to understand if fetching the term frequency of a term
> > > > during
> > > > > > > > scoring is relatively cpu bound operation ?
> > > > > > > > Context - I am storing custom term frequency during indexing
> > and
> > > > > later
> > > > > > > > using it for scoring during query execution time ( in
> Scorer's
> > > > > score()
> > > > > > > > method ). I noticed a performance drop in my application and
> I
> > > > > suspect
> > > > > > > it's
> > > > > > > > because of this change.
> > > > > > > > Any insight or related articles for reference would be
> > > appreciated.
> > > > > > > >
> > > > > > > >
> > > > > > > > *Thanks and Regards,*
> > > > > > > > *Vimal Jain*
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Adrien
> > > >
> > >
> >
> >
> > --
> > Adrien
> >
>


-- 
Adrien


Re: Relative cpu cost of fetching term frequency during scoring

2023-06-20 Thread Adrien Grand
Lucene has logic to only evaluate a subset of the matching documents when
retrieving top-k hits. This leverages the Scorer#getMaxScore API. If you
never implemented it on your custom query, then you never took advantage of
dynamic pruning anyway. I wrote a bit more about it
<https://www.elastic.co/blog/faster-retrieval-of-top-hits-in-elasticsearch-with-block-max-wand>
a few years ago if you're curious.

On Tue, Jun 20, 2023 at 6:58 PM Vimal Jain  wrote:

> Thanks Adrien for quick response.
> Yes , i am replacing disjuncts across multiple fields with single custom
> term query over merged field.
> Can you please provide more details on what do you mean by dynamic pruning
> in context of custom term query ?
>
> On Tue, 20 Jun, 2023, 9:45 pm Adrien Grand,  wrote:
>
> > Intuitively replacing a disjunction across multiple fields with a single
> > term query should always be faster.
> >
> > You're saying that you're storing the type of token as part of the term
> > frequency. This doesn't sound like something that would play well with
> > dynamic pruning, so I wonder if this is the reason why you are seeing
> > slower queries. But since you mentioned custom term queries, maybe you
> > never actually took advantage of dynamic pruning?
> >
> > On Tue, Jun 20, 2023 at 10:30 AM Vimal Jain  wrote:
> >
> > > Ok , sorry , I realized that I need to provide more context.
> > > So we used to create a lucene query which consisted of custom term
> > queries
> > > for different fields and based on the type of field , we used to
> assign a
> > > boost that would be used in scoring.
> > > Now we want to get rid off different fields and instead of creating
> > > multiple term queries , we create only 1 term query for the merged
> field
> > > and the scorer of this term query ( on merged field ) makes use of
> custom
> > > term frequency info to deduce type of token ( during indexing we store
> > this
> > > info ) and hence the score that we were using earlier.
> > > So perf drop is observed in reference to  earlier implementation ( with
> > > multiple term queries ).
> > >
> > >
> > > *Thanks and Regards,*
> > > *Vimal Jain*
> > >
> > >
> > > On Tue, Jun 20, 2023 at 1:01 PM Adrien Grand 
> wrote:
> > >
> > > > You say you observed a performance drop, what are you comparing
> > against?
> > > >
> > > > Le mar. 20 juin 2023, 08:59, Vimal Jain  a écrit :
> > > >
> > > > > Note - i am using lucene 7.7.3
> > > > >
> > > > > *Thanks and Regards,*
> > > > > *Vimal Jain*
> > > > >
> > > > >
> > > > > On Tue, Jun 20, 2023 at 12:26 PM Vimal Jain 
> > wrote:
> > > > >
> > > > > > Hi,
> > > > > > I want to understand if fetching the term frequency of a term
> > during
> > > > > > scoring is relatively cpu bound operation ?
> > > > > > Context - I am storing custom term frequency during indexing and
> > > later
> > > > > > using it for scoring during query execution time ( in Scorer's
> > > score()
> > > > > > method ). I noticed a performance drop in my application and I
> > > suspect
> > > > > it's
> > > > > > because of this change.
> > > > > > Any insight or related articles for reference would be
> appreciated.
> > > > > >
> > > > > >
> > > > > > *Thanks and Regards,*
> > > > > > *Vimal Jain*
> > > > > >
> > > > >
> > > >
> > >
> >
> >
> > --
> > Adrien
> >
>


-- 
Adrien


Re: Relative cpu cost of fetching term frequency during scoring

2023-06-20 Thread Adrien Grand
Intuitively replacing a disjunction across multiple fields with a single
term query should always be faster.

You're saying that you're storing the type of token as part of the term
frequency. This doesn't sound like something that would play well with
dynamic pruning, so I wonder if this is the reason why you are seeing
slower queries. But since you mentioned custom term queries, maybe you
never actually took advantage of dynamic pruning?

On Tue, Jun 20, 2023 at 10:30 AM Vimal Jain  wrote:

> Ok , sorry , I realized that I need to provide more context.
> So we used to create a lucene query which consisted of custom term queries
> for different fields and based on the type of field , we used to assign a
> boost that would be used in scoring.
> Now we want to get rid off different fields and instead of creating
> multiple term queries , we create only 1 term query for the merged field
> and the scorer of this term query ( on merged field ) makes use of custom
> term frequency info to deduce type of token ( during indexing we store this
> info ) and hence the score that we were using earlier.
> So perf drop is observed in reference to  earlier implementation ( with
> multiple term queries ).
>
>
> *Thanks and Regards,*
> *Vimal Jain*
>
>
> On Tue, Jun 20, 2023 at 1:01 PM Adrien Grand  wrote:
>
> > You say you observed a performance drop, what are you comparing against?
> >
> > Le mar. 20 juin 2023, 08:59, Vimal Jain  a écrit :
> >
> > > Note - i am using lucene 7.7.3
> > >
> > > *Thanks and Regards,*
> > > *Vimal Jain*
> > >
> > >
> > > On Tue, Jun 20, 2023 at 12:26 PM Vimal Jain  wrote:
> > >
> > > > Hi,
> > > > I want to understand if fetching the term frequency of a term during
> > > > scoring is relatively cpu bound operation ?
> > > > Context - I am storing custom term frequency during indexing and
> later
> > > > using it for scoring during query execution time ( in Scorer's
> score()
> > > > method ). I noticed a performance drop in my application and I
> suspect
> > > it's
> > > > because of this change.
> > > > Any insight or related articles for reference would be appreciated.
> > > >
> > > >
> > > > *Thanks and Regards,*
> > > > *Vimal Jain*
> > > >
> > >
> >
>


-- 
Adrien


Re: Relative cpu cost of fetching term frequency during scoring

2023-06-20 Thread Adrien Grand
You say you observed a performance drop, what are you comparing against?

Le mar. 20 juin 2023, 08:59, Vimal Jain  a écrit :

> Note - i am using lucene 7.7.3
>
> *Thanks and Regards,*
> *Vimal Jain*
>
>
> On Tue, Jun 20, 2023 at 12:26 PM Vimal Jain  wrote:
>
> > Hi,
> > I want to understand if fetching the term frequency of a term during
> > scoring is relatively cpu bound operation ?
> > Context - I am storing custom term frequency during indexing and later
> > using it for scoring during query execution time ( in Scorer's score()
> > method ). I noticed a performance drop in my application and I suspect
> it's
> > because of this change.
> > Any insight or related articles for reference would be appreciated.
> >
> >
> > *Thanks and Regards,*
> > *Vimal Jain*
> >
>


Re: Performance regression in getting doc by id in Lucene 8 vs Lucene 7

2023-06-07 Thread Adrien Grand
I agree it's worth discussing. I opened
https://github.com/apache/lucene/issues/12355 and
https://github.com/apache/lucene/issues/12356.

On Tue, Jun 6, 2023 at 9:17 PM Rahul Goswami  wrote:
>
> Thanks Adrien. I spent some time trying to understand the readByte() in
> ReverseRandomAccessReader (through FST) and compare with 7.x.  Although I
> don't understand ALL of the details and reasoning for always loading the
> FST (and in turn the term index) off-heap (as discussed in
> https://github.com/apache/lucene/issues/10297 ) I understand that this is
> essentially causing disk access for every single byte during readByte().
>
> Does this warrant a JIRA for regression?
>
> As mentioned, I am noticing a 10x slowdown in SegmentTermsEnum.seekExact()
> affecting atomic update performance . For setups like mine that can't use
> mmap due to large indexes this would be a legit regression, no?
>
> - Rahul
>
> On Tue, Jun 6, 2023 at 10:09 AM Adrien Grand  wrote:
>
> > Yes, this changed in 8.x:
> >  - 8.0 moved the terms index off-heap for non-PK fields with
> > MMapDirectory. https://github.com/apache/lucene/issues/9681
> >  - Then in 8.6 the FST was moved off-heap all the time.
> > https://github.com/apache/lucene/issues/10297
> >
> > More generally, there's a few files that are no longer loaded in heap
> > in 8.x. It should be possible to load them back in heap by doing
> > something like that (beware, I did not actually test this code):
> >
> > class MyHeapDirectory extends FilterDirectory {
> >
> >   MyHeapDirectory(Directory in) {
> > super(in);
> >   }
> >
> >   @Override
> >   public IndexInput openInput(String name, IOContext context) throws
> > IOException {
> > if (context.load == false) {
> >   return super.openInput(name, context);
> > } else {
> >   try (IndexInput in = super.openInput(name, context)) {
> > byte[] bytes = new byte[Math.toIntExact(in.length())];
> > in.readBytes(bytes, bytes.length);
> > ByteBuffer bb =
> > ByteBuffer.wrap(bytes).order(ByteOrder.LITTLE_ENDIAN).asReadOnlyBuffer();
> > return new ByteBuffersIndexInput(new
> > ByteBuffersDataInput(Collections.singletonList(bb)),
> > "ByteBuffersIndexInput(" + name + ")");
> >   }
> > }
> >   }
> >
> > }
> >
> > On Tue, Jun 6, 2023 at 3:41 PM Rahul Goswami 
> > wrote:
> > >
> > > Thanks Adrien. Is this behavior of FST something that has changed in
> > Lucene
> > > 8.x (from 7.x)?
> > > Also, is the terms index not loaded into memory anymore in 8.x?
> > >
> > > To your point on MMapDirectoryFactory, it is much faster as you
> > > anticipated, but the indexes commonly being >1 TB makes the Windows
> > machine
> > > freeze to a point I sometimes can't even connect to the VM.
> > > SimpleFSDirectory works well for us from that standpoint.
> > >
> > > To add, both NIOFS and SimpleFS have similar indexing benchmarks on
> > > Windows. I understand it is because of the Java bug which synchronizes
> > > internally in the native call for NIOFs.
> > >
> > > -Rahul
> > >
> > > On Tue, Jun 6, 2023 at 9:32 AM Adrien Grand  wrote:
> > >
> > > > +Alan Woodward helped me better understand what is going on here.
> > > > BufferedIndexInput (used by NIOFSDirectory and SimpleFSDirectory)
> > > > doesn't play well with the fact that the FST reads bytes backwards:
> > > > every call to readByte() triggers a refill of 1kB because it wants to
> > > > read the byte that is just before what the buffer contains.
> > > >
> > > > On Tue, Jun 6, 2023 at 2:07 PM Adrien Grand  wrote:
> > > > >
> > > > > My best guess based on your description of the issue is that
> > > > > SimpleFSDirectory doesn't like the fact that the terms index now
> > reads
> > > > > data directly from the directory instead of loading the terms index
> > in
> > > > > heap. Would you be able to run the same benchmark with MMapDirectory
> > > > > to check if it addresses the regression?
> > > > >
> > > > >
> > > > > On Tue, Jun 6, 2023 at 5:47 AM Rahul Goswami 
> > > > wrote:
> > > > > >
> > > > > > Hello,
> > > > > > We started experiencing slowness with atomic updates in Solr after
> > > > > > upgrading from 7.7.2 to 8.11.1. Running several te

Re: Performance regression in getting doc by id in Lucene 8 vs Lucene 7

2023-06-06 Thread Adrien Grand
Yes, this changed in 8.x:
 - 8.0 moved the terms index off-heap for non-PK fields with
MMapDirectory. https://github.com/apache/lucene/issues/9681
 - Then in 8.6 the FST was moved off-heap all the time.
https://github.com/apache/lucene/issues/10297

More generally, there's a few files that are no longer loaded in heap
in 8.x. It should be possible to load them back in heap by doing
something like that (beware, I did not actually test this code):

class MyHeapDirectory extends FilterDirectory {

  MyHeapDirectory(Directory in) {
super(in);
  }

  @Override
  public IndexInput openInput(String name, IOContext context) throws
IOException {
if (context.load == false) {
  return super.openInput(name, context);
} else {
  try (IndexInput in = super.openInput(name, context)) {
byte[] bytes = new byte[Math.toIntExact(in.length())];
in.readBytes(bytes, bytes.length);
ByteBuffer bb =
ByteBuffer.wrap(bytes).order(ByteOrder.LITTLE_ENDIAN).asReadOnlyBuffer();
return new ByteBuffersIndexInput(new
ByteBuffersDataInput(Collections.singletonList(bb)),
"ByteBuffersIndexInput(" + name + ")");
  }
}
  }

}

On Tue, Jun 6, 2023 at 3:41 PM Rahul Goswami  wrote:
>
> Thanks Adrien. Is this behavior of FST something that has changed in Lucene
> 8.x (from 7.x)?
> Also, is the terms index not loaded into memory anymore in 8.x?
>
> To your point on MMapDirectoryFactory, it is much faster as you
> anticipated, but the indexes commonly being >1 TB makes the Windows machine
> freeze to a point I sometimes can't even connect to the VM.
> SimpleFSDirectory works well for us from that standpoint.
>
> To add, both NIOFS and SimpleFS have similar indexing benchmarks on
> Windows. I understand it is because of the Java bug which synchronizes
> internally in the native call for NIOFs.
>
> -Rahul
>
> On Tue, Jun 6, 2023 at 9:32 AM Adrien Grand  wrote:
>
> > +Alan Woodward helped me better understand what is going on here.
> > BufferedIndexInput (used by NIOFSDirectory and SimpleFSDirectory)
> > doesn't play well with the fact that the FST reads bytes backwards:
> > every call to readByte() triggers a refill of 1kB because it wants to
> > read the byte that is just before what the buffer contains.
> >
> > On Tue, Jun 6, 2023 at 2:07 PM Adrien Grand  wrote:
> > >
> > > My best guess based on your description of the issue is that
> > > SimpleFSDirectory doesn't like the fact that the terms index now reads
> > > data directly from the directory instead of loading the terms index in
> > > heap. Would you be able to run the same benchmark with MMapDirectory
> > > to check if it addresses the regression?
> > >
> > >
> > > On Tue, Jun 6, 2023 at 5:47 AM Rahul Goswami 
> > wrote:
> > > >
> > > > Hello,
> > > > We started experiencing slowness with atomic updates in Solr after
> > > > upgrading from 7.7.2 to 8.11.1. Running several tests revealed the
> > > > slowness to be in RealTimeGet's SolrIndexSearcher.getFirstMatch() call
> > > > which eventually calls Lucene's SegmentTermsEnum.seekExact()..
> > > >
> > > > In the benchmarks I ran, 8.11.1 is about 10x slower than 7.7.2. After
> > > > discussion on the Solr mailing list I created the below JIRA:
> > > >
> > > > https://issues.apache.org/jira/browse/SOLR-16838
> > > >
> > > > The thread dumps collected show a lot of threads stuck in the
> > > > FST.findTargetArc()
> > > > method. Testing environment details:
> > > >
> > > > Environment details:
> > > > - Java 11 on Windows server
> > > > - Xms1536m Xmx3072m
> > > > - Indexing client code running 15 parallel threads indexing in batches
> > of
> > > > 1000 on a standalone core.
> > > > - using SimpleFSDirectoryFactory  (since Mmap doesn't  quite work well
> > on
> > > > Windows for our index sizes which commonly run north of 1 TB)
> > > >
> > > >
> > https://drive.google.com/drive/folders/1q2DPNTYQEU6fi3NeXIKJhaoq3KPnms0h?usp=sharing
> > > >
> > > > Is there a known issue with slowness with TermsEnum.seekExact() in
> > Lucene
> > > > 8.x ?
> > > >
> > > > Thanks,
> > > > Rahul
> > >
> > >
> > >
> > > --
> > > Adrien
> >
> >
> >
> > --
> > Adrien
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >



-- 
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Performance regression in getting doc by id in Lucene 8 vs Lucene 7

2023-06-06 Thread Adrien Grand
+Alan Woodward helped me better understand what is going on here.
BufferedIndexInput (used by NIOFSDirectory and SimpleFSDirectory)
doesn't play well with the fact that the FST reads bytes backwards:
every call to readByte() triggers a refill of 1kB because it wants to
read the byte that is just before what the buffer contains.

On Tue, Jun 6, 2023 at 2:07 PM Adrien Grand  wrote:
>
> My best guess based on your description of the issue is that
> SimpleFSDirectory doesn't like the fact that the terms index now reads
> data directly from the directory instead of loading the terms index in
> heap. Would you be able to run the same benchmark with MMapDirectory
> to check if it addresses the regression?
>
>
> On Tue, Jun 6, 2023 at 5:47 AM Rahul Goswami  wrote:
> >
> > Hello,
> > We started experiencing slowness with atomic updates in Solr after
> > upgrading from 7.7.2 to 8.11.1. Running several tests revealed the
> > slowness to be in RealTimeGet's SolrIndexSearcher.getFirstMatch() call
> > which eventually calls Lucene's SegmentTermsEnum.seekExact()..
> >
> > In the benchmarks I ran, 8.11.1 is about 10x slower than 7.7.2. After
> > discussion on the Solr mailing list I created the below JIRA:
> >
> > https://issues.apache.org/jira/browse/SOLR-16838
> >
> > The thread dumps collected show a lot of threads stuck in the
> > FST.findTargetArc()
> > method. Testing environment details:
> >
> > Environment details:
> > - Java 11 on Windows server
> > - Xms1536m Xmx3072m
> > - Indexing client code running 15 parallel threads indexing in batches of
> > 1000 on a standalone core.
> > - using SimpleFSDirectoryFactory  (since Mmap doesn't  quite work well on
> > Windows for our index sizes which commonly run north of 1 TB)
> >
> > https://drive.google.com/drive/folders/1q2DPNTYQEU6fi3NeXIKJhaoq3KPnms0h?usp=sharing
> >
> > Is there a known issue with slowness with TermsEnum.seekExact() in Lucene
> > 8.x ?
> >
> > Thanks,
> > Rahul
>
>
>
> --
> Adrien



-- 
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Performance regression in getting doc by id in Lucene 8 vs Lucene 7

2023-06-06 Thread Adrien Grand
My best guess based on your description of the issue is that
SimpleFSDirectory doesn't like the fact that the terms index now reads
data directly from the directory instead of loading the terms index in
heap. Would you be able to run the same benchmark with MMapDirectory
to check if it addresses the regression?


On Tue, Jun 6, 2023 at 5:47 AM Rahul Goswami  wrote:
>
> Hello,
> We started experiencing slowness with atomic updates in Solr after
> upgrading from 7.7.2 to 8.11.1. Running several tests revealed the
> slowness to be in RealTimeGet's SolrIndexSearcher.getFirstMatch() call
> which eventually calls Lucene's SegmentTermsEnum.seekExact()..
>
> In the benchmarks I ran, 8.11.1 is about 10x slower than 7.7.2. After
> discussion on the Solr mailing list I created the below JIRA:
>
> https://issues.apache.org/jira/browse/SOLR-16838
>
> The thread dumps collected show a lot of threads stuck in the
> FST.findTargetArc()
> method. Testing environment details:
>
> Environment details:
> - Java 11 on Windows server
> - Xms1536m Xmx3072m
> - Indexing client code running 15 parallel threads indexing in batches of
> 1000 on a standalone core.
> - using SimpleFSDirectoryFactory  (since Mmap doesn't  quite work well on
> Windows for our index sizes which commonly run north of 1 TB)
>
> https://drive.google.com/drive/folders/1q2DPNTYQEU6fi3NeXIKJhaoq3KPnms0h?usp=sharing
>
> Is there a known issue with slowness with TermsEnum.seekExact() in Lucene
> 8.x ?
>
> Thanks,
> Rahul



--
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Mix of lucene50 and lucene70 codes

2023-04-08 Thread Adrien Grand
Hi,

This is normal. Lucene usually names codecs and file formats after the
first version that they were introduced in. But not all file formats
change on every version, and the Lucene 7.7.3 default postings format
was called Lucene50.

On Sat, Apr 8, 2023 at 4:17 PM Vimal Jain  wrote:
>
> Hi Guys,
> I am using lucene v7.7.3
> I see that in my final lucene index ( consisting of multiple segments ) ,
> some files are from lucene50 codec and some from lucene70 codec , is that
> normal behaviour ?
>
> e.g. for a segment, i have
> _1.dii
> _1.dim
> _1.fdt
> _1.fdx
> _1.fnm
> _1.si
>
>
>
>
> *_1_Lucene50_0.doc_1_Lucene50_0.tim_1_Lucene50_0.tip_1_Lucene70_0.dvd_1_Lucene70_0.dvm*
>
> Is this some misconfiguration ? How do I fix it ?
>
> *Thanks and Regards,*
> *Vimal Jain*



-- 
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Change score with distance SortField

2023-02-06 Thread Adrien Grand
Hi Michal,

The best way to do this would be to put a
LatLonPoint#newDistanceFeatureQuery in a SHOULD clause. It's not as
flexible as leveraging expressions, but it has the benefit of not
disabling dynamic pruning.

On Mon, Feb 6, 2023 at 10:33 AM Michal Hlavac  wrote:
>
> Hi,
> I would like to influence the score using geographical distance. More distant 
> documents lower the score.
> I have sort field:
> SortField geoSort = LatLonDocValuesField.newDistanceSort("location", 
> pos.getLatitude(), pos.getLongitude());
>
> Then I tried add this sort field to SimpleBindings. I've found some code, 
> where SortField is added to SimpleBindings, but
> lucene 9.4.x API doesn't have this capability.
>
> What is the proper way to do this?
>
> thank you, Michal Hlavac
>
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>


-- 
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Other vector similarity metric than provided by VectorSimilarityFunction

2023-01-14 Thread Adrien Grand
Hi Michael,

You could create a custom KNN vectors format that ignores the vector
similarity configured on the field and uses its own.

Le sam. 14 janv. 2023, 21:33, Michael Wechner  a
écrit :

> Hi
>
> IIUC Lucene currently supports
>
> VectorSimilarityFunction.COSINE
> VectorSimilarityFunction.DOT_PRODUCT
> VectorSimilarityFunction.EUCLIDEAN
>
> whereas some embedding models have been trained with other metrics.
> Also see
>
> https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cdist.html
>
> How can I best implement another metric?
>
> Thanks
>
> Michael
>
>
>
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: The current default similarity implementation of Lucene is BM25, right?

2022-11-23 Thread Adrien Grand
This is correct. See IndexSearcher#getDefaultSimilarity().

On Wed, Nov 23, 2022 at 10:53 AM Michael Wechner
 wrote:
>
> Hi
>
> On the Lucene FAQ there is no mentioning re tf-idf or bm25 and I would
> like to add some notes, but to be sure I don't write anything wrong I
> would like to ask
>
> whether the current default similarity implementation of Lucene is
> really BM25, right?
>
> as described at
>
> https://opensourceconnections.com/blog/2015/10/16/bm25-the-next-generation-of-lucene-relevation/
>
> Thanks
>
> Michael
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>


-- 
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



[ANNOUNCE] Apache Lucene 9.4.2 released

2022-11-23 Thread Adrien Grand
The Lucene PMC is pleased to announce the release of Apache Lucene 9.4.2

Apache Lucene is a high-performance, full-featured search engine library
written entirely in Java. It is a technology suitable for nearly any
application that requires structured search, full-text search, faceting,
nearest-neighbor search on high-dimensionality vectors, spell correction or
query suggestions.

This patch release contains an important fix for a bug affecting version
9.4.1. The release is available for immediate download at:
  https://lucene.apache.org/core/downloads.html

Lucene 9.4.2 Release Highlights

Bug fixes
 - Fixed integer overflow when opening segments containing more than ~16M
KNN vectors.
 - Fixed cost computation of BitSets created via DocIdSetBuilder, such as
for multi-term queries. This may improve performance of multi-term queries.

Enhancements
 - CheckIndex now verifies the consistency of KNN vectors more thoroughly.

Further details of changes are available in the change log available at:
https://lucene.apache.org/core/9_4_2/changes/Changes.html.

Please report any feedback to the mailing lists (
http://lucene.apache.org/core/discussion.html)

Note: The Apache Software Foundation now uses a content distribution
network (CDN) for distributing releases.

-- 
Adrien


Re: Sort by numeric field, order missing values before anything else

2022-11-21 Thread Adrien Grand
Uwe, I think that Petko's question was about making sure that missing
values would be returned before non-missing values, even though some of
these non-missing values might be equal to Long.MIN_VALUE. Which isn't
possible today.

I agree with your recommendation against going with bytes given the
overhead in case of high cardinality.

On Mon, Nov 21, 2022 at 11:08 AM Uwe Schindler  wrote:

> Hi,
>
> Long.MIN_VALUE and Long.MAX_VALUE are the correct way for longs to sort.
> In fact if you have Long.MIN_VALUE in your collection, empty values are
> treated the same, but still empty value will appear at the wanted place.
> In contrast to the default "0", it is not somewhere in the middle.
> Because there is no long that is smaller than Long.MIN_VALUE, the sort
> order will be OK.
>
> BTW, Apache Solr is using exactly those values to support missing values
> automatically (see sortMissingFirst, sortMissingLast schema options).
>
> In fact, string/bytes sorting has theoretically the same problem,
> because NULL is still different that empty. WARNING: If you really want
> to compare by byte[] as suggested in your last mail, keep in mind: When
> you sort against the raw bytes (using NumericUtils) with SORTED_SET
> docvalues type, there is a large overhead on indexing and sorting
> performance, especially for the case where you have many different
> values in your index (which is likely for numerics).
>
> Uwe
>
> Am 17.11.2022 um 08:47 schrieb Adrien Grand:
> > Hi Petko,
> >
> > Lucene's comparators for numerics have this limitation indeed. We haven't
> > got many questions around that in the past, which I would guess is due to
> > the fact that most numeric fields do not use the entire long range,
> > specifically Long.MIN_VALUE and Long.MAX_VALUE, so using either of these
> > works as a way to sort missing values first or last. If you have a field
> > that may use Long.MIN_VALUE and long.MAX_VALUE, we do not have a
> comparator
> > that can easily sort missing values first or last reliably out of the
> box.
> >
> > The easier option I can think of would consist of using the comparator
> for
> > longs with MIN_VALUE / MAX_VALUE for missing values depending on whether
> > you want missing values sorted first or last, and chain it with another
> > comparator (via a FieldComparatorSource) which would sort missing values
> > before/after existing values. The benefit of this approach is that you
> > would automatically benefit from some not-so-trivial features of Lucene's
> > comparator such as dynamic pruning.
> >
> > On Wed, Nov 16, 2022 at 9:16 PM Petko Minkov  wrote:
> >
> >> Hello,
> >>
> >> When sorting documents by a NumericDocValuesField, how can documents be
> >> ordered such that those with missing values can come before anything
> else
> >> in ascending sorts? SortField allows to set a missing value:
> >>
> >>  var sortField = new SortField("price", SortField.Type.LONG);
> >>  sortField.setMissingValue(null);
> >>
> >> This null is however converted into a long 0 and documents with missing
> >> values are considered equally ordered with documents with an actual 0
> >> value. It's possible to set the missing value to Long.MIN_VALUE, but
> that
> >> will have the same problem, just for a different long value.
> >>
> >> Besides writing a custom comparator, is there any simpler and still
> >> performant way to achieve this sort?
> >>
> >> --Petko
> >>
> >
> --
> Uwe Schindler
> Achterdiek 19, D-28357 Bremen
> https://www.thetaphi.de
> eMail: u...@thetaphi.de
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

-- 
Adrien


Re: Sort by numeric field, order missing values before anything else

2022-11-16 Thread Adrien Grand
Hi Petko,

Lucene's comparators for numerics have this limitation indeed. We haven't
got many questions around that in the past, which I would guess is due to
the fact that most numeric fields do not use the entire long range,
specifically Long.MIN_VALUE and Long.MAX_VALUE, so using either of these
works as a way to sort missing values first or last. If you have a field
that may use Long.MIN_VALUE and long.MAX_VALUE, we do not have a comparator
that can easily sort missing values first or last reliably out of the box.

The easier option I can think of would consist of using the comparator for
longs with MIN_VALUE / MAX_VALUE for missing values depending on whether
you want missing values sorted first or last, and chain it with another
comparator (via a FieldComparatorSource) which would sort missing values
before/after existing values. The benefit of this approach is that you
would automatically benefit from some not-so-trivial features of Lucene's
comparator such as dynamic pruning.

On Wed, Nov 16, 2022 at 9:16 PM Petko Minkov  wrote:

> Hello,
>
> When sorting documents by a NumericDocValuesField, how can documents be
> ordered such that those with missing values can come before anything else
> in ascending sorts? SortField allows to set a missing value:
>
> var sortField = new SortField("price", SortField.Type.LONG);
> sortField.setMissingValue(null);
>
> This null is however converted into a long 0 and documents with missing
> values are considered equally ordered with documents with an actual 0
> value. It's possible to set the missing value to Long.MIN_VALUE, but that
> will have the same problem, just for a different long value.
>
> Besides writing a custom comparator, is there any simpler and still
> performant way to achieve this sort?
>
> --Petko
>


-- 
Adrien


Re: Learning Lucene from ground up

2022-11-07 Thread Adrien Grand
+1 to MyCoy's suggestion.

To answer your most immediate questions:
 - Lucene mostly loads metadata in memory at the time of opening a segment
(dvm, tmd, fdm, vem, nvm, kdm files), other files are memory-mapped and
Lucene relies on the filesystem cache to have their data efficiently
available. This allows Lucene to have a very small memory footprint for
searching.
 - Finite state machines are mostly used for suggesters and for the terms
index (tip file), which essentially stores all prefixes that are shared by
25-40 terms in a FST.

On Sun, Nov 6, 2022 at 2:12 AM MyCoy Z  wrote:

> I just started learning Lucene HNSW source code last months.
>
> I find the most effective way is to start with the testcases, set debugging
> break points in the code you're interested in, and walk through the code
>
> Regards
> MyCoy
>
> On Fri, Nov 4, 2022 at 9:24 PM Rahul Goswami 
> wrote:
>
> > Hello,
> > I have been working with Lucene and Solr for quite some time and have a
> > good understanding of a lot of moving parts at the code level. However I
> > wish to learn Lucene  internals from the ground up and want to
> familiarize
> > myself with all the dirty details. I would like to know what would be the
> > best way to go about it.
> >
> > To kick things off, I have been thinking about picking up “Lucene in
> > Action”, but have been hesitant (and possibly wrongly) since it is based
> on
> > Lucene 3.0 and we have come a long way since then. To give an example of
> > the level of detail I wish to learn (among other things) would be what
> > parts of a segment (.tim, .tip, etc) get loaded in memory at search time,
> > which part uses finite state machines and why, etc
> >
> > I would really appreciate any thoughts/inputs on how I can go about this.
> > Thanks in advance!
> >
> > Regards,
> > Rahul
> >
>


-- 
Adrien


Re: Efficient sort on SortedDocValues

2022-11-07 Thread Adrien Grand
Hi Andrei,

The case that you are describing got optimized in Lucene 9.4.0 in the case
when your field is also indexed with a StringField:
https://github.com/apache/lucene/pull/1023. See annotation ER at
http://people.apache.org/~mikemccand/lucenebench/TermMonthSort.html.

The way it works is that Lucene will automatically leverage the inverted
index in order to only look at documents that compare better than the
current k-th document in the priority queue.

To make it work with your test case, you will need to:
 - index a StringField with the same name and same value
 - change values to be less random if possible, since this optimization
works better on low-cardinality fields than on high-cardinality fields





On Mon, Nov 7, 2022 at 9:45 AM Mikhail Khludnev  wrote:

> Hello, Andrei.
> Docs are scored in-order (see Weight.scoreAll(), scoreRange()), just
> because underneath postings API is in-order. There are a few
> shortcuts/optimizations, but they only omit some iterations/segments like
> checking competitive scores and so one.
>
> On Sun, Nov 6, 2022 at 1:35 AM Solodin, Andrei (TR Technology)
>  wrote:
>
> > One more thing. While the test case passes now, it still iterates in
> index
> > order. Which means that it still collects ~6.4K docs out of 10k matches.
> > This is an improvement, but I am still wondering why it's not possible to
> > iterate in the field older. Seems like that would provide substantial
> > improvement.
> >
> > From: Solodin, Andrei (TR Technology)
> > Sent: Saturday, November 5, 2022 5:18 PM
> > To: java-user@lucene.apache.org
> > Subject: RE: Efficient sort on SortedDocValues
> >
> > I just realized that the problem is that the field needs to be indexed as
> > well. Now it works. But I noticed that this only works in Lucene 9. Does
> > not work in Lucene 8 (specifically 8.11.2). This must be new
> functionality
> > in Lucene 9?
> >
> > Thanks
> >
> >
> > From: Solodin, Andrei (TR Technology)
> > Sent: Saturday, November 5, 2022 1:07 PM
> > To: java-user@lucene.apache.org
> > Subject: Efficient sort on SortedDocValues
> >
> > Hello Lucene community, while looking into how to efficiently sort on a
> > field value, I came across a couple of things that I don't quite
> > understand. My assumption was that if I execute a search and sort on a
> > SortedDocValues field, lucene would only iterate over the docs in the
> order
> > of the field values or at least collect only competitive docs (docs that
> > made it into the topN queue). Neither of those things seems to be
> > happening. Instead, the iteration is happening in index order and all
> > matched docs are collected. Looking at the code, I see that the
> > optimizations are only possible if the index is sorted in the field order
> > to begin with, which is not possible for our use case. We may have dozens
> > of such fields in our index, thus there isn't any one field that can be
> > used to sort the index. So I guess my question if what I am trying to
> > achieve is possible? I tried to look though Solr codebase, but so far
> > couldn't come up with anything. Code example is here
> > https://pastebin.com/i05E2wZy  . I am using 9.4.1. Thanks in advance.
> >
> > Andrei
> >
> >
>
> --
> Sincerely yours
> Mikhail Khludnev
>


-- 
Adrien


Re: Upgrading from 9.1.0. to 9.4.0: Old codecs may only be used for reading Lucene91HnswVectorsFormat.java

2022-10-01 Thread Adrien Grand
The best practice is to not set the codec explicitly, and Lucene will make
sure to always use the right one.

Seeing the codec explicitly is considered expert. I guess you are doing
this because you want to configure things like stored fields compression or
HNSW parameters? If so, there is no better way than what you are doing.


Le sam. 1 oct. 2022, 12:31, Michael Wechner  a
écrit :

> Hi Adrien
>
> Thank you very much for your help!
>
> That was it :-) I completely forgot that I set this somewhere hidden
> inside my code.
> I made a note in the pom file, such that I should not forget again
> during the next upgrade :-)
>
> Or what is the best practice re setting / handling the codec?
>
> Thanks
>
> Michael
>
> Am 01.10.22 um 08:06 schrieb Adrien Grand:
> > I would guess that you are configuring your IndexWriterConfig with a
> > "Lucene91Codec" instance. You need to replace it with a "Lucene94Codec"
> > instance.
> >
> > Le sam. 1 oct. 2022, 06:12, Michael Wechner 
> a
> > écrit :
> >
> >> Hi
> >>
> >> I have just upgraded from 9.1.0 to 9.4.0 and compiling works fine, but
> >> when I run and re-index my data using KnnVectorField, then I receive the
> >> following exception:
> >>
> >> java.lang.UnsupportedOperationException: Old codecs may only be used for
> >> reading
> >>   at
> >>
> org.apache.lucene.backward_codecs.lucene91.Lucene91HnswVectorsFormat.fieldsWriter(Lucene91HnswVectorsFormat.java:131)
> >>
> >> ~[lucene-backward-codecs-9.4.0.jar:9.4.0
> >> d2e22e18c6c92b6a6ba0bbc26d78b5e82832f956 - sokolovm - 2022-09-30
> 14:55:13]
> >>   at
> >>
> org.apache.lucene.codecs.perfield.PerFieldKnnVectorsFormat$FieldsWriter.getInstance(PerFieldKnnVectorsFormat.java:161)
> >>
> >> ~[lucene-core-9.4.0.jar:9.4.0 d2e22e18c6c92b6a6ba0bbc26d78b5e82832f956 -
> >> sokolovm - 2022-09-30 14:55:13]
> >>   at
> >>
> org.apache.lucene.codecs.perfield.PerFieldKnnVectorsFormat$FieldsWriter.addField(PerFieldKnnVectorsFormat.java:105)
> >>
> >> ~[lucene-core-9.4.0.jar:9.4.0 d2e22e18c6c92b6a6ba0bbc26d78b5e82832f956 -
> >> sokolovm - 2022-09-30 14:55:13]
> >>   at
> >>
> org.apache.lucene.index.VectorValuesConsumer.addField(VectorValuesConsumer.java:70)
> >>
> >> ~[lucene-core-9.4.0.jar:9.4.0 d2e22e18c6c92b6a6ba0bbc26d78b5e82832f956 -
> >> sokolovm - 2022-09-30 14:55:13]
> >>   at
> >>
> org.apache.lucene.index.IndexingChain.initializeFieldInfo(IndexingChain.java:665)
> >>
> >> ~[lucene-core-9.4.0.jar:9.4.0 d2e22e18c6c92b6a6ba0bbc26d78b5e82832f956 -
> >> sokolovm - 2022-09-30 14:55:13]
> >>   at
> >>
> org.apache.lucene.index.IndexingChain.processDocument(IndexingChain.java:556)
> >>
> >> ~[lucene-core-9.4.0.jar:9.4.0 d2e22e18c6c92b6a6ba0bbc26d78b5e82832f956 -
> >> sokolovm - 2022-09-30 14:55:13]
> >>   at
> >>
> org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:241)
> >>
> >> ~[lucene-core-9.4.0.jar:9.4.0 d2e22e18c6c92b6a6ba0bbc26d78b5e82832f956 -
> >> sokolovm - 2022-09-30 14:55:13]
> >>   at
> >>
> org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:432)
> >>
> >> ~[lucene-core-9.4.0.jar:9.4.0 d2e22e18c6c92b6a6ba0bbc26d78b5e82832f956 -
> >> sokolovm - 2022-09-30 14:55:13]
> >>   at
> >>
> org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1533)
> >> ~[lucene-core-9.4.0.jar:9.4.0 d2e22e18c6c92b6a6ba0bbc26d78b5e82832f956 -
> >> sokolovm - 2022-09-30 14:55:13]
> >>   at
> >>
> org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1818)
> >> ~[lucene-core-9.4.0.jar:9.4.0 d2e22e18c6c92b6a6ba0bbc26d78b5e82832f956 -
> >> sokolovm - 2022-09-30 14:55:13]
> >>   at
> >> org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1471)
> >> ~[lucene-core-9.4.0.jar:9.4.0 d2e22e18c6c92b6a6ba0bbc26d78b5e82832f956 -
> >> sokolovm - 2022-09-30 14:55:13]
> >>
> >> Any idea what I might be doing wrong?
> >>
> >> Thanks
> >>
> >> Michael
> >>
> >> -
> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >>
> >>
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: Upgrading from 9.1.0. to 9.4.0: Old codecs may only be used for reading Lucene91HnswVectorsFormat.java

2022-10-01 Thread Adrien Grand
I would guess that you are configuring your IndexWriterConfig with a
"Lucene91Codec" instance. You need to replace it with a "Lucene94Codec"
instance.

Le sam. 1 oct. 2022, 06:12, Michael Wechner  a
écrit :

> Hi
>
> I have just upgraded from 9.1.0 to 9.4.0 and compiling works fine, but
> when I run and re-index my data using KnnVectorField, then I receive the
> following exception:
>
> java.lang.UnsupportedOperationException: Old codecs may only be used for
> reading
>  at
> org.apache.lucene.backward_codecs.lucene91.Lucene91HnswVectorsFormat.fieldsWriter(Lucene91HnswVectorsFormat.java:131)
>
> ~[lucene-backward-codecs-9.4.0.jar:9.4.0
> d2e22e18c6c92b6a6ba0bbc26d78b5e82832f956 - sokolovm - 2022-09-30 14:55:13]
>  at
> org.apache.lucene.codecs.perfield.PerFieldKnnVectorsFormat$FieldsWriter.getInstance(PerFieldKnnVectorsFormat.java:161)
>
> ~[lucene-core-9.4.0.jar:9.4.0 d2e22e18c6c92b6a6ba0bbc26d78b5e82832f956 -
> sokolovm - 2022-09-30 14:55:13]
>  at
> org.apache.lucene.codecs.perfield.PerFieldKnnVectorsFormat$FieldsWriter.addField(PerFieldKnnVectorsFormat.java:105)
>
> ~[lucene-core-9.4.0.jar:9.4.0 d2e22e18c6c92b6a6ba0bbc26d78b5e82832f956 -
> sokolovm - 2022-09-30 14:55:13]
>  at
> org.apache.lucene.index.VectorValuesConsumer.addField(VectorValuesConsumer.java:70)
>
> ~[lucene-core-9.4.0.jar:9.4.0 d2e22e18c6c92b6a6ba0bbc26d78b5e82832f956 -
> sokolovm - 2022-09-30 14:55:13]
>  at
> org.apache.lucene.index.IndexingChain.initializeFieldInfo(IndexingChain.java:665)
>
> ~[lucene-core-9.4.0.jar:9.4.0 d2e22e18c6c92b6a6ba0bbc26d78b5e82832f956 -
> sokolovm - 2022-09-30 14:55:13]
>  at
> org.apache.lucene.index.IndexingChain.processDocument(IndexingChain.java:556)
>
> ~[lucene-core-9.4.0.jar:9.4.0 d2e22e18c6c92b6a6ba0bbc26d78b5e82832f956 -
> sokolovm - 2022-09-30 14:55:13]
>  at
> org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:241)
>
> ~[lucene-core-9.4.0.jar:9.4.0 d2e22e18c6c92b6a6ba0bbc26d78b5e82832f956 -
> sokolovm - 2022-09-30 14:55:13]
>  at
> org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:432)
>
> ~[lucene-core-9.4.0.jar:9.4.0 d2e22e18c6c92b6a6ba0bbc26d78b5e82832f956 -
> sokolovm - 2022-09-30 14:55:13]
>  at
> org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1533)
> ~[lucene-core-9.4.0.jar:9.4.0 d2e22e18c6c92b6a6ba0bbc26d78b5e82832f956 -
> sokolovm - 2022-09-30 14:55:13]
>  at
> org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1818)
> ~[lucene-core-9.4.0.jar:9.4.0 d2e22e18c6c92b6a6ba0bbc26d78b5e82832f956 -
> sokolovm - 2022-09-30 14:55:13]
>  at
> org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1471)
> ~[lucene-core-9.4.0.jar:9.4.0 d2e22e18c6c92b6a6ba0bbc26d78b5e82832f956 -
> sokolovm - 2022-09-30 14:55:13]
>
> Any idea what I might be doing wrong?
>
> Thanks
>
> Michael
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: Max Field Length

2022-09-23 Thread Adrien Grand
We have a TruncateTokenFilter in lucene/analysis/common. :)

On Fri, Sep 23, 2022 at 4:39 PM Michael Sokolov  wrote:

> I wonder if it would make sense to provide a TruncationFilter in
> addition to the LengthFilter. That way long tokens in source text
> could be better supported, albeit with some confusion if they share
> the same very long prefix...
>
> On Fri, Sep 23, 2022 at 9:56 AM Scott Guthery  wrote:
> >
> > Thanks much, Adrian.  I hadn't realized that the size limit was on one
> > token in the text as opposed to being a limit on the length of the entire
> > text field.  I'm loading patents, so I suspect that the very long word
> is a
> > DNA sequence.
> >
> > Thanks also for your guidance with regard to setting maximums.
> >
> > Cheers, Scott
> >
> > >
> > >
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

-- 
Adrien


Re: Questions about Lucene source

2022-09-23 Thread Adrien Grand
On the 2nd question, we do not plan on leveraging this information to
figure out the codec: the codec that should be used to read a segment is
stored separately (also in segment infos).

It is mostly useful for diagnostics purposes. E.g. if we see an interesting
corruption case where checksums match, we can guess that there is a bug
somewhere in Lucene in a version that is between this minimum version and
the version that was used to write the segment.

On Sat, Sep 17, 2022 at 11:07 AM Dawid Weiss  wrote:

> > (so deleted docs == max docs) and call commit. Will/Can this segment
> still
> > exist after commit?
> >
>
> Depends on your merge policy index deletion policy.  You can configure
> Lucene to keep older commits (and then you'll preserve all historical
> segments).
>
> I don't know the answer to your second question.
>
> D.
>


-- 
Adrien


Re: Max Field Length

2022-09-23 Thread Adrien Grand
Hi Scott,

There is no way to lift this limit. The assumption is that a user would
never type a 32kB keyword in a search bar, so indexing such long keywords
is wasteful. Some tokenizers like StandardTokenizer can be configured to
limit the length of the tokens that they produce, there is also a
LengthFilter that can be appended to the analysis chain to filter out
tokens that exceed the maximum term length.

I would note that modifying the source code is going to require more than
bumping the hardcoded limit as we rely on this limit in a few places, e.g.
ByteBlockPool.

On Fri, Sep 23, 2022 at 12:59 AM Scott Guthery  wrote:

> Lucene 9.3 seems to have a (post-Analyzer) maximum field length of 32767.
> Is there a way of increasing this without resorting to the source code?
>
> Thanks for any guidance.
>
> Cheers, Scott
>


-- 
Adrien


Re: Lucene's LRU Query Cache - Deep Dive

2022-07-19 Thread Adrien Grand
1. I believe that this would require pulling a ScorerSupplier twice for the
same segment, which is a costly operation.

2. The cost is computed in order to know whether the top-level query is
likely to consume a large-enough portion of the matches of the query that
we are considering caching so that caching this query wouldn't hurt latency
too much. Making a bad decision here because the cost is unknown would lead
to a worse situation than computing the cost on every query that we are
considering caching.

On both of these questions, I feel like I may be missing the point about
the suggestion you are making so feel free to show a simple code change
that could help me understand the change that you are suggesting.

On Thu, Jul 14, 2022 at 12:26 PM Mohammad Sadiq 
wrote:

> Thanks for the deep-dive Shradha. Thank you Adrien for the additional
> questions and answers.
>
> I had a couple of questions, when looking around the cache code.
>
> 1. The `QueryCachingPolicy` [1] makes decisions based on `Query`. Why not
> use `Weight`?
> The `scorerSupplier` [2] in the `LRUQueryCache` decides whether something
> should be cached by determining the cost [3] using the `Weight`. IIUC, this
> was introduced because “Query caching leads to absurdly slow queries” [4].
> What if the `QueryCachingPolicy` was called with the `Weight` instead?
> Since the `Query` can be obtained from the `Weight`, we can have all such
> caching decisions in the policy, and de-couple that decision from the
> `LRUQueryCache` class. What do you think?
>
> 2. Why do we invoke a possibly costly `cost()` method for every cache
> addition?
> During the above cost computation, we call the `supplier.cost()` method;
> but its documentation [5] states that it “may be a costly operation, so it
> should only be called if necessary".
> This means that we’re including a (possibly) costly operation for every
> cache addition. If we de-couple these, then, for cases where the cache
> addition is expensive, we can use the call to `cost`, but for other cases,
> we can avoid this expensive call.
>
> If you, or the community thinks that this is a good idea, then I can open
> a JIRA, and submit a PR.
>
> References:
> [1]
> https://github.com/apache/lucene/blob/d5d6dc079395c47cd6d12dcce3bcfdd2c7d9dc63/lucene/core/src/java/org/apache/lucene/search/QueryCachingPolicy.java
> <
> https://github.com/apache/lucene/blob/d5d6dc079395c47cd6d12dcce3bcfdd2c7d9dc63/lucene/core/src/java/org/apache/lucene/search/QueryCachingPolicy.java
> >
> [2]
> https://github.com/apache/lucene/blob/941df98c3f718371af4702c92bf6537739120064/lucene/core/src/java/org/apache/lucene/search/LRUQueryCache.java#L725
> <
> https://github.com/apache/lucene/blob/941df98c3f718371af4702c92bf6537739120064/lucene/core/src/java/org/apache/lucene/search/LRUQueryCache.java#L725
> >
> [3]
> https://github.com/apache/lucene/blob/941df98c3f718371af4702c92bf6537739120064/lucene/core/src/java/org/apache/lucene/search/LRUQueryCache.java#L767
> <
> https://github.com/apache/lucene/blob/941df98c3f718371af4702c92bf6537739120064/lucene/core/src/java/org/apache/lucene/search/LRUQueryCache.java#L767
> >
> [4] https://github.com/apache/lucene-solr/pull/940/files <
> https://github.com/apache/lucene-solr/pull/940/files>
> [5]
> https://github.com/apache/lucene/blob/d5d6dc079395c47cd6d12dcce3bcfdd2c7d9dc63/lucene/core/src/java/org/apache/lucene/search/ScorerSupplier.java#L39-L40
> <
> https://github.com/apache/lucene/blob/d5d6dc079395c47cd6d12dcce3bcfdd2c7d9dc63/lucene/core/src/java/org/apache/lucene/search/ScorerSupplier.java#L39-L40
> >
>
>
> Regards,
> Mohammad Sadiq
>
>
> > On 11 Jul 2022, at 10:37, Adrien Grand  wrote:
> >
> > Hey Shradha,
> >
> > This correctly describes the what, but I think it could add more color
> > about why the cache behaves this way to be more useful, e.g.
> > - Why doesn't the cache cache all queries? Lucene is relatively good at
> > evaluating a subset of the matching documents, e.g. queries sorted by
> > numeric field can leverage point index structures to only look at a small
> > subset of the matching docs. Yet caching a query requires consuming all
> its
> > matches, so it could significantly hurt latencies. It's important to not
> > cache all queries to preserve the benefit of Lucene's filtering and
> dynamic
> > pruning capabilities.
> > - A corollary of the above is that the query cache is essentially a way
> to
> > trade latency for throughput. Use-cases that care more about latency than
> > throughput may want to disable the cache entirely.
> > - LRUQueryCache takes a `skipCacheFactor` which aims at limiting the
> > impact of query caching on latency by not cachi

Re: Lucene Disable scoring

2022-07-11 Thread Adrien Grand
Note that Lucene automatically disables scoring already when scores are not
needed. E.g. queries that compute the top-k hits by score will definitely
compute scores, but if you are just counting the number of matches of a
query or aggregations, then Lucene skips scoring entirely already.

Is there something that leads you to believing that Lucene is computing
scores when it shouldn't?

On Mon, Jul 11, 2022 at 5:27 PM Mikhail Khludnev  wrote:

> I'd rather agree with Uwe, but you can plug BooleanSimilarity just to check
> it out.
>
> On Mon, Jul 11, 2022 at 6:01 PM Mohammad Kasaei <
> mohammadkasae...@gmail.com>
> wrote:
>
> > Hello
> >
> > I have a question. Is it possible to completely disable scoring in
> lucene?
> >
> > Detailed description:
> > I have an index in elasticsearch and it contains big shards (every shard
> > about 500m docs) so a nano second of time spent on scoring every document
> > in any shard causes a few second delay in the query response.
> > I discovered that the most performant way to score documents is constant
> > score but the overhead of function calls can cause delay.
> > As a result I'm looking for a trick to ignore the function call and have
> > all no scoring on my whole query
> >
> > Is it possible to ignore this step?
> >
> > thanks a million
> >
>
>
> --
> Sincerely yours
> Mikhail Khludnev
>


-- 
Adrien


Re: Lucene's LRU Query Cache - Deep Dive

2022-07-11 Thread Adrien Grand
Hey Shradha,

This correctly describes the what, but I think it could add more color
about why the cache behaves this way to be more useful, e.g.
 - Why doesn't the cache cache all queries? Lucene is relatively good at
evaluating a subset of the matching documents, e.g. queries sorted by
numeric field can leverage point index structures to only look at a small
subset of the matching docs. Yet caching a query requires consuming all its
matches, so it could significantly hurt latencies. It's important to not
cache all queries to preserve the benefit of Lucene's filtering and dynamic
pruning capabilities.
 - A corollary of the above is that the query cache is essentially a way to
trade latency for throughput. Use-cases that care more about latency than
throughput may want to disable the cache entirely.
 - LRUQueryCache takes a `skipCacheFactor` which aims at limiting the
impact of query caching on latency by not caching clauses whose cost is
much higher than the overall query. It only helps for filters within
conjunctions though, not in the dynamic pruning case when we don't know how
many matches are going to be consumed.
 - Why are small segments never cached? Small segments are likely going to
be merged soon, so it would be wasteful to build cache entries that would
get evicted shortly.
 - The queries that never get cached don't get cached because a cached
entry wouldn't perform faster than their uncached counterpart. An inverted
index is already a cache of the matches for every unique term of the index.

On Fri, Jul 8, 2022 at 3:20 PM Shradha Shankar 
wrote:

> Hello!
>
> I work at Amazon Product Search and I’ve recently been looking into
> understanding how Lucene’s LRU Query Cache works. I’ve written up a summary
> of my understanding here. (Also attached as a markdown file with this email)
>
> Will really appreciate feedback/improvements/corrections for my
> understanding and if this is worthy of contributing to the documentation
> for LRU QueryCache. :)
>
> =
> A brief overview of Lucene's Query Cache
>
> We first get acquainted with Lucene’s caching at IndexSearcher’s
> createWeight method which is called for every query (and consequently
> sub-queries within that query, eg see BooleanWeight) before we can actually
> find matching documents in our index and score them. Weight is really
> another representation of a query that is specific to the statistics of the
> IndexSearcher being used. This definition makes it easier to see why
> caching logic would start while creating weight for the query - we want to
> make a weight that will be responsible for caching matching docs per
> segment. Since segments are specific to the IndexReader being used by the
> IndexSearcher, they are transitively, specific to the IndexSearcher.
>
> QueryCache in Lucene is an interface that has the signature for just one
> method - doCache. doCache takes in a Weight (weight of the query in
> question eg: TermWeight for a TermQuery) and operates on it based on the
> rules defined by QueryCachingPolicy (yet another interface that defines two
> methods - onUse and shouldCache) to return a Weight. This “new” returned
> Weight possibly wraps the original weight and bestows upon it some caching
> abilities.
>
> As of now, Lucene has one core implementation of the QueryCache and the
> QueryCachingPolicy. All IndexSearcher instances have a default query cache
> - LRUQueryCache and use the default policy -
> UsageTrackingQueryCachingPolicy.In the IndexSearcher’s createWeight method,
> we first create a weight for the incoming query and then subject it to the
> LRUQueryCache’s doCache method. An important thing to note here is that we
> only call doCache when the score mode passed to the search method does not
> need scores. Calling doCache does nothing complex; it just returns the
> input weight wrapped as a CachingWrapperWeight that encapsulates the
> caching policy information. No real caching has happened, yet!
>
> After getting a weight from the createWeight method, the IndexSearcher
> iterates over each leaf and uses the weight to create a BulkScorer. A
> BulkScorer, as the name suggests, is used to score a range of the documents
> - generally the range being all the matches found in that leaf. Given
> context information for a leaf, every weight should know how to create a
> bulk scorer for that leaf. In our case, the CachingWrapperWeight’s
> BulkScorer method does a little bit extra and this is where the actual
> caching happens!
>
> A brief dive into the query caching policy: While LRU says what we want to
> evict from a full cache, using query caching policies we can define other
> rules to use in conjunction with the cache’s design policy. The default
> UsageTrackingQueryCachingPolicy dictates what queries will be put into the
> cache. This policy uses a ring buffer data structure optimised to track and
> retrieve frequencies for a given query. The policy also defines some
> queries that will 

Re: Question about Benchmark

2022-05-16 Thread Adrien Grand
Hi Balmukund,

What benchmark are you talking about?

On Mon, May 16, 2022 at 4:35 PM balmukund mandal  wrote:
>
> Hi All,
> I was trying to run the benchmark and had a couple of questions. Indexing
> takes a long time, so is there a way to configure the benchmark to use an
> already existing index for search? Also, is there a way to configure the
> benchmark to use multiple threads for indexing (looks to me that it’s a
> single-threaded indexing)?
>
> --Regards,
> Balmukund



-- 
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Index corruption and repair

2022-04-28 Thread Adrien Grand
Hi Anthony,

This isn't something that you should try to fix programmatically,
corruptions indicate that something is wrong with the environment,
like a broken disk or corrupt RAM. I would suggest running a memtest
to check your RAM and looking at system logs in case they have
anything to tell about your disks.

Can you also share the full stack trace of the exception?

On Thu, Apr 28, 2022 at 10:26 AM Antony Joseph
 wrote:
>
> Hello,
>
> We are facing a strange situation in our application as described below:
>
> *Using*:
>
>- Python 3.8.10
>- Pylucene 6.5.0
>- Java 8 (1.8.0_181)
>- Runs on Linux and Windows (error seen on Windows)
>
> We suddenly get the following *error*:
>
> 2022-02-10 09:58:09.253215: ERROR : writer | Failed to get index
> (D:\i\202202) writer, Exception:
> org.apache.lucene.index.CorruptIndexException: Unexpected file read error
> while reading index.
> (resource=BufferedChecksumIndexInput(MMapIndexInput(path="D:\i\202202\segments_fo")))
>
>
> After this, no further indexing happens - trying to open the index for
> writing throws the above error - and the index writer does not open.
>
> FYI, our code contains the following *settings*:
>
> index_path = "D:\i\202202"
> index_directory = FSDirectory.open(Paths.get(index_path))
> iconfig = IndexWriterConfig(wrapper_analyzer)
> iconfig.setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND)
> iconfig.setRAMBufferSizeMB(16.0)
> writer = IndexWriter(index_directory, iconfig)
>
>
> *Repairing*
> We tried 'repairing' the index with the following command / tool:
>
> java -cp lucene-core-6.5.0.jar:lucene-backward-codecs-6.5.0.jar
> org.apache.lucene.index.CheckIndex "D:\i\202202" -exorcise
>
> This however returns saying "No problems found with the index."
>
>
> *Work around*
> We have to manually delete the problematic segment file:
> D:\i\202202\segments_fo
> after which the application starts again... until the next corruption. We
> can't spot a specific pattern.
>
>
> *Two questions:*
>
>1. Can we handle this situation programmatically, so that no manual
>intervention is needed?
>2. Any reason why we are facing the corruption issue in the first place?
>
>
> Before this we were using Pylucene 4.10 and we didn't face this problem -
> the application logic is the same.
>
> Also, while the application runs on both Linux and Windows, so far we have
> observed this situation only on various Windows platforms.
>
> Would really appreciate some assistance. Thanks in advance.
>
> Regards,
> Antony



-- 
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How to propose a new feature

2022-04-01 Thread Adrien Grand
Just send an email with the problem that you want to solve and the
approach that you are suggesting.

On Fri, Apr 1, 2022 at 6:56 PM Baris Kazar  wrote:
>
> Resent due to need for help.
> Thanks
> 
> From: Baris Kazar
> Sent: Wednesday, March 30, 2022 2:30 PM
> To: java-user@lucene.apache.org 
> Cc: Baris Kazar 
> Subject: How to propose a new feature
>
> Hi Everyone,-
> What is the process to propose a new feature for Core Lucene engine?
> Best regards



-- 
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: TF in MoreLikeThis

2022-04-01 Thread Adrien Grand
>From a quick look, your suggestion of passing the term frequency to
TFIDFSimilarity#tf makes sense.

Would you like to contribute this change? You can find contributing
guidelines here:
https://github.com/apache/lucene/blob/main/CONTRIBUTING.md.

On Thu, Mar 31, 2022 at 11:46 PM Petko Minkov  wrote:
>
> Hi,
>
> I was looking at Lucene's code for MoreLikeThis, specifically this line:
> https://github.com/apache/lucene/blob/69b040fc6292ac47d7f7fc8bc3b7fd601794e54b/lucene/queries/src/java/org/apache/lucene/queries/mlt/MoreLikeThis.java#L640
>
> It looks like in ClassicSimilarity, TF is a square root, but in the code TF
> is used without the ClassicSimilarity::tf() function called. Is that a bug
> - it will make TF have a disproportionately higher weight compared to IDF?
>
> --Petko



-- 
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Call for Presentations now open, ApacheCon North America 2022

2022-03-31 Thread Adrien Grand
Thanks Michael for helping spread the word about Lucene's new vector
search capabilities!

On Thu, Mar 31, 2022 at 7:36 AM Michael Wechner
 wrote:
>
> ok :-) thanks!
>
> Anyway, if somebody would like to join re a "vector search" proposal,
> please let me know
>
> Michael
>
> Am 30.03.22 um 20:13 schrieb Anshum Gupta:
> > Hi Michael,
> >
> > I'd highly recommend submitting a proposal irrespective of what other folks
> > decide. Your submission would be reviewed independently and if there is
> > another proposals that clashes, the abstract would help the program
> > committee pick the one (or both) that's best suited for the audience.
> >
> > Good luck!
> >
> > -Anshum
> >
> > On Wed, Mar 30, 2022 at 5:47 AM Michael Wechner 
> > wrote:
> >
> >> Hi Together
> >>
> >> I would be interested to submit a proposal/presentation re Lucene's
> >> vector search,  but would like to ask first whether somebody else wants
> >> to do this as well or might be interested to do this together?
> >>
> >> Thanks
> >>
> >> Michael
> >>
> >> Am 30.03.22 um 14:16 schrieb Rich Bowen:
> >>> [You are receiving this because you are subscribed to one or more user
> >>> or dev mailing list of an Apache Software Foundation project.]
> >>>
> >>> ApacheCon draws participants at all levels to explore “Tomorrow’s
> >>> Technology Today” across 300+ Apache projects and their diverse
> >>> communities. ApacheCon showcases the latest developments in ubiquitous
> >>> Apache projects and emerging innovations through hands-on sessions,
> >>> keynotes, real-world case studies, trainings, hackathons, community
> >>> events, and more.
> >>>
> >>> The Apache Software Foundation will be holding ApacheCon North America
> >>> 2022 at the New Orleans Sheration, October 3rd through 6th, 2022. The
> >>> Call for Presentations is now open, and will close at 00:01 UTC on May
> >>> 23rd, 2022.
> >>>
> >>> We are accepting presentation proposals for any topic that is related
> >>> to the Apache mission of producing free software for the public good.
> >>> This includes, but is not limited to:
> >>>
> >>> Community
> >>> Big Data
> >>> Search
> >>> IoT
> >>> Cloud
> >>> Fintech
> >>> Pulsar
> >>> Tomcat
> >>>
> >>> You can submit your session proposals starting today at
> >>> https://cfp.apachecon.com/
> >>>
> >>> Rich Bowen, on behalf of the ApacheCon Planners
> >>> apachecon.com
> >>> @apachecon
> >>>
> >>> -
> >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >>> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >>>
> >>
> >> -
> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >>
> >>
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>


-- 
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Re: Custom scores and sort

2022-03-23 Thread Adrien Grand
Sorry Claude, but I have some trouble following what you are doing
with your CustomScoreQuery. It feels like your query is doing
something that breaks some assumptions that Lucene makes.

Have you looked at existing ways that Lucene supports boosting
documents by recency, such as putting a LongDistanceFeatureQuery as a
SHOULD clause in a BooleanQuery?

On Mon, Mar 14, 2022 at 7:00 PM Claude Lepere  wrote:
>
> Adrien, thank you for your answer and sorry for the lack of clarity.
>
> No, the score of a document does not depend on the score of another
> document, the problem lies within a document.
>
> There are several "only once score" fields; to simplify, I suppose there is
> only one "only once score" field;
> a document can contain several times this "only once score" field with
> different values;
> a query can contain several clauses on the different values of this field
> and these clauses can be SHOULD or MUST.
> But for such a document, the score of this field should only be counted on
> the first pass through my CustomScoreQuery subclass, on subsequent passes,
> the custom score = 0 ;
> to process so, the constructor of the subclass has as argument the map "my
> document id (not Lucene doc!) to the field".
>
>  Then, the score of the first pass is multiplied by a date factor which
> depends on the age of the document (age = maximum date of the query results
> - date of the document):
> the score of a document decreases with its age.
>
> The total score (field + date) is correctly calculated, but the explanation
> log shows that the sort score (the first element of fields[]) is not the
> total score but the total score minus the "only once score" or to put it
> another way, a total score where the "only once score" = 0, and that's why
> a hit with a lower total score happens to be ranked before a hit with a
> higher total score.
>
> The log of my CustomScoreQuery subclass shows that even if the document
> contains only one "only once score" field,
> Lucene passes the CustomScoreProvider's customScore method twice, so the
> score = 0 and it seems to me that this value is retained for the sort score.
>
> I did not find why a TopFieldDocs search (with Sort = SortField.FIELD_SCORE
> and date) uses the "diminished" score and not the total score, as TopDocs
> does.
>
>
> Thanks in advance.
>
>
> Claude Lepère
>
> On 2022/03/14 12:59:45 Adrien Grand wrote:
> > It's a bit hard for me to parse what you are trying to do, but it
> > looks like you are making assumptions about how Lucene works
> > internally that are not correct.
> >
> > Do I understand correctly that your scoring mechanism has dependencies
> > on other documents, ie. the score of a document could depend on the
> > score of other documents? This is something that Lucene doesn't
> > support.
> >
> > On Thu, Mar 10, 2022 at 12:23 PM Claude Lepere  wrote:
> > >
> > > Hi.
> > > The problem is that although sorting by score a match with a lower
> score is
> > > ranked before a match with a greater score.
> > > The origin of the problem lies in a subclass of CustomScoreQuery which
> > > calculates an "only once" score for each document: on the first pass the
> > > document gets its score and, if the document contains several times the
> > > same field, on the subsequent passes it gets 0.
> > > I wonder if it is possible for Lucene to give a score that depends on a
> > > previous pass in the CustomScoreProvider customScore routine for the
> same
> > > document.
> > > I ran 2 searches with IndexSearcher: the first one returns a TopDocs
> which
> > > is sorted by default by relevance, and the second search - with the Sort
> > > array = [SortField.FIELD_SCORE, a date SortField] argument - returns a
> > > TopFieldDocs.
> > > The TopDocs results are sorted by the score with the first pass value of
> > > the only once method while the TopFieldDocs results are sorted by the
> score
> > > with the value (= 0) of the next pass, hence the ranking errors.
> > > I did not find why does the TopFieldDocs search not use to sort the
> score
> > > of the hit, as the TopDocs search?
> > > I did not find how to tell the TopFieldDocs search to use the hit score
> to
> > > sort.
> > >
> > > Claude Lepère
> >
> >
> >
> > --
> > Adrien
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >



-- 
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: LongDistanceFeatureQuery for DoublePoint

2022-03-23 Thread Adrien Grand
Hi Puneeth,

Doubles are always a bit more tricky due to rounding for arithmetic
operations, but this should still be doable.

Out of curiosity, what sort of data do your double fields store? This
query had been added with the idea that it would be useful for
timestamp fields in order to boost hits by recency. What is your
use-case for adding similar functionality to double fields?

On Wed, Mar 23, 2022 at 12:38 AM Puneeth Bikkumanla
 wrote:
>
> Hello,
> I was wondering if there is anything similar to the
> LongDistanceFeatureQuery for DoublePoint. We are currently converting our
> doubles into longs in order to use this feature but would like to switch
> off of that. If nothing exists, are there any immediate challenges that
> people foresee for implementing a "DoubleDistanceFeatureQuery" for
> DoublePoint?



-- 
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



[ANNOUNCE] Apache Lucene 9.1.0 released

2022-03-22 Thread Adrien Grand
The Lucene PMC is pleased to announce the release of Apache Lucene 9.1.0.

Apache Lucene is a high-performance, full-featured search engine
library written entirely in Java. It is a technology suitable for
nearly any application that requires structured search, full-text
search, faceting, nearest-neighbor search across high-dimensionality
vectors, spell correction or query suggestions.

This release contains numerous bug fixes, optimizations, and
improvements, some of which are highlighted below. The release is
available for immediate download at:
  https://lucene.apache.org/core/downloads.html

Lucene 9.1.0 Release Highlights

New features

 - Lucene JARs are now proper Java modules, with module descriptors
and dependency information
 - Support for filtering in nearest-neighbor vector search
 - Support for intervals queries in the standard query syntax
 - A new token filter SpanishPluralStemFilter for precise stemming of
Spanish plurals

Optimizations

 - Up to 30% improvement in index throughput for high-dimensional vectors
 - Up to 10% faster nearest neighbor searches on high-dimensional vectors
 - Faster execution of "count" searches across different query types
 - Faster counting for taxonomy facets
 - Several other search speed-ups, including improvements to
PointRangeQuery, MultiRangeQuery, and CoveringRangeQuery

Other

 - The test framework is now a module, so all classes have been moved
from to org.apache.lucene.tests.* to avoid package name conflicts
 - Lucene now faithfully implements the HNSW algorithm for nearest
neighbor search by supporting multiple graph layers

… plus a number of helpful bug fixes!

Further details of changes are available in the change log available at:
  http://lucene.apache.org/core/9_1_0/changes/Changes.html
and the migration guide available at:
  https://lucene.apache.org/core/9_1_0/MIGRATE.html

Please report any feedback to the mailing lists
(http://lucene.apache.org/core/discussion.html)

-- 
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: FacetsCollector ScoreMode

2022-03-21 Thread Adrien Grand
+1 to adjusting the ScoreMode based on keepScores.

On Mon, Mar 21, 2022 at 5:47 PM Mike Drob  wrote:
>
> Hey all,
>
> I was looking into some performance issues and was a little confused about
> one aspect of FacetsCollector - why does it always specify
> ScoreMode.COMPLETE?
>
> Especially for the case where we are counting facets, without collecting
> the documents, it seems like we should be able to get away without scoring.
> I've tested it locally and it seems to work, but I'm wondering what nuance
> I am missing.
>
> The default behaviour is keepScores == false, so I feel like we should be
> able to adjust the score mode used based on that.
>
> Thanks,
> Mike



-- 
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Custom scores and sort

2022-03-14 Thread Adrien Grand
It's a bit hard for me to parse what you are trying to do, but it
looks like you are making assumptions about how Lucene works
internally that are not correct.

Do I understand correctly that your scoring mechanism has dependencies
on other documents, ie. the score of a document could depend on the
score of other documents? This is something that Lucene doesn't
support.

On Thu, Mar 10, 2022 at 12:23 PM Claude Lepere  wrote:
>
> Hi.
> The problem is that although sorting by score a match with a lower score is
> ranked before a match with a greater score.
> The origin of the problem lies in a subclass of CustomScoreQuery which
> calculates an "only once" score for each document: on the first pass the
> document gets its score and, if the document contains several times the
> same field, on the subsequent passes it gets 0.
> I wonder if it is possible for Lucene to give a score that depends on a
> previous pass in the CustomScoreProvider customScore routine for the same
> document.
> I ran 2 searches with IndexSearcher: the first one returns a TopDocs which
> is sorted by default by relevance, and the second search - with the Sort
> array = [SortField.FIELD_SCORE, a date SortField] argument - returns a
> TopFieldDocs.
> The TopDocs results are sorted by the score with the first pass value of
> the only once method while the TopFieldDocs results are sorted by the score
> with the value (= 0) of the next pass, hence the ranking errors.
> I did not find why does the TopFieldDocs search not use to sort the score
> of the hit, as the TopDocs search?
> I did not find how to tell the TopFieldDocs search to use the hit score to
> sort.
>
> Claude Lepère



-- 
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: DocValuesIterator: advance vs advanceExact

2022-02-03 Thread Adrien Grand
Hi Alexander,

In general, advance(target) is best used to implement queries and
advanceExact(target) for collectors.

See javadocs for advanceExact(target), this method may only be called
on doc IDs that are between 0 included and maxDoc excluded.

On Thu, Feb 3, 2022 at 10:00 AM Alexander Buloichik
 wrote:
>
> Hi.
>
> I'm trying to retrieve field values in my own LeafCollector (Lucene 8.11.1). 
> But I didn't find good tutorial how to do it.
> So, I get SortedSetDoc from LeafReader, then Lucene calls my implementation 
> of LeafCollector.collect() method, and I try to get values from SortedSetDoc 
> inside my implementation of LeafCollector.collect() method.
>
> What should I call for move SortedSetDoc's pointer to the required 
> document('doc' parameter of LeafCollector.collect(int doc)) ? advance(doc) or 
> advanceExact(doc) ?
>
> I'm using SortedSetDoc.advanceExact() and it works good.
>
> But when I'm trying to call BinaryDocValues.advanceExact() for other field, 
> it always returns true (see Lucene80DocValuesProducer.java:685), even in case 
> doc >= maxDoc.
> If I'm using advance(doc) and checking that "advance(doc)==doc" (against case 
> of "document number is greater to target") - it works good.
>
> But SortedSetDoc.advance(doc) always returns 'doc+1', not a 'doc' 
> (IndexedDISI.java:385).
>
> --
> Alexander Buloichik
>


-- 
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene 6.5.1 source code

2022-02-01 Thread Adrien Grand
You can find the 6.5.1 source code on the old lucene-solr repository:
https://github.com/apache/lucene-solr/tree/releases/lucene-solr%2F6.5.1

On Tue, Feb 1, 2022 at 2:54 PM Omri  wrote:
>
> It seems that the old versions branches in github were deleted.
> There is a way to see Lucene 6.5.1 source code?
> The contents of this e-mail message and any attachments are confidential and 
> are intended solely for addressee. The information may also be legally 
> privileged. This transmission is sent in trust, for the sole purpose of 
> delivery to the intended recipient. If you have received this transmission in 
> error, any use, reproduction or dissemination of this transmission is 
> strictly prohibited. If you are not the intended recipient, please 
> immediately notify the sender by reply e-mail or phone and delete this 
> message and its attachments, if any.



-- 
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Migration from Lucene 5.5 to 8.11.1

2022-01-12 Thread Adrien Grand
The log says what the problem is: version 8.11.1 cannot read indices
created by Lucene 5.5, you will need to reindex your data.

On Wed, Jan 12, 2022 at 3:41 PM  wrote:
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org



-- 
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Want explanation on lucene norms

2022-01-05 Thread Adrien Grand
Hi,

Norms are inputs to the score that are independent from the query. It
is typically computed as a function of the number of terms of a
document: the more terms, the higher the normalization factor and the
lower the score.

Lucene computes and indexes length normalization factors automatically
for Text fields at index-time, and automatically uses them at search
time. There is nothing to do. The way normalization scores are
computed and folded into the final score is controlled by the
org.apache.lucene.search.similarities.Similarity.

Boosts are factors to the score that are independent from the
document. You can set them on a query via Lucene's
org.apache.lucene.search.BoostQuery. (Note that old Lucene versions
had a concept of index-time boosts, which would get combined with the
length normalization factor, but it was not recommended and it has
been removed.)

On Mon, Dec 20, 2021 at 6:08 PM Sowmiya M 4085  wrote:
>
> I just want to know what is norms in lucene 4.10.4.
> How to implement norms in a program.
> What are their types.
> What is the difference between boost and norms?
> Sample programs on norms



-- 
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene 9.0.0 inconsistent index options

2021-12-14 Thread Adrien Grand
This looks related to the new changes around schema validation. Lucene
now requires a field to either be absent from a document or be indexed
with the exact same options (index options, points dimensions, norms,
doc values type, etc.) as already indexed documents that also have
this field.

However it's a bug that Lucene fails to open an index that was legal
in Lucene 8. Can you file a JIRA issue?

On Mon, Dec 13, 2021 at 4:23 PM Ian Lea  wrote:
>
> Hi
>
>
> We have a long-standing index with some mandatory fields and some optional
> fields that has been through multiple lucene upgrades without a full
> rebuild and on testing out an upgrade from version 8.11.0 to 9.0.0, when
> open an IndexWriter we are hitting the exception
>
> Exception in thread "main" java.lang.IllegalArgumentException: cannot
> change field "language" from index options=NONE to inconsistent index
> options=DOCS
> at
> org.apache.lucene.index.FieldInfo.verifySameIndexOptions(FieldInfo.java:245)
> at
> org.apache.lucene.index.FieldInfos$FieldNumbers.verifySameSchema(FieldInfos.java:421)
> at
> org.apache.lucene.index.FieldInfos$FieldNumbers.addOrGet(FieldInfos.java:357)
> at
> org.apache.lucene.index.IndexWriter.getFieldNumberMap(IndexWriter.java:1263)
> at org.apache.lucene.index.IndexWriter.(IndexWriter.java:1116)
>
> Where language is one of our optional fields.
>
> Presumably this is at least somewhat related to "Index options can no
> longer be changed dynamically" as mentioned at
> https://lucene.apache.org/core/9_0_0/MIGRATE.html although it fails before
> our code attempts to update the index, and we are not trying to change any
> index options.
>
> Adding some displays to IndexWriter and FieldInfos and logging rather than
> throwing the exception I see
>
>  language curr=NONE, other=NONE
>  language curr=NONE, other=NONE
>  language curr=NONE, other=NONE
>  language curr=NONE, other=NONE
>  language curr=NONE, other=NONE
>  language curr=NONE, other=NONE
>  language curr=NONE, other=NONE
>  language curr=NONE, other=NONE
>  language curr=NONE, other=DOCS
>  language curr=NONE, other=NONE
>  language curr=NONE, other=NONE
>  language curr=NONE, other=NONE
>  language curr=NONE, other=NONE
>  language curr=NONE, other=NONE
>  language curr=NONE, other=NONE
>  language curr=NONE, other=NONE
>  language curr=NONE, other=NONE
>  language curr=NONE, other=NONE
>  language curr=NONE, other=DOCS
>  language curr=NONE, other=DOCS
>  language curr=NONE, other=DOCS
>  language curr=NONE, other=DOCS
>  language curr=NONE, other=DOCS
>  language curr=NONE, other=DOCS
>  language curr=NONE, other=DOCS
>  language curr=NONE, other=DOCS
>
> where there is one line per segment.  It logs the exception whenever
> other=DOCS.  Subset with segment info:
>
> segment _x8(8.2.0):c31753/-1:[diagnostics={timestamp=1565623850605,
> lucene.version=8.2.0, java.vm.version=11.0.3+7, java.version=11.0.3,
> mergeMaxNumSegments=-1, os.version=3.1.0-1.2-desktop,
> java.vendor=AdoptOpenJDK, source=merge, os.arch=amd64, mergeFactor=10,
> java.runtime.version=11.0.3+7,
> os=Linux}]:[attributes={Lucene50StoredFieldsFormat.mode=BEST_SPEED}]
>
>  language curr=NONE, other=NONE
>
> segment _y9(8.7.0):c43531/-1:[diagnostics={timestamp=1604597581562,
> lucene.version=8.7.0, java.vm.version=11.0.3+7, java.version=11.0.3,
> mergeMaxNumSegments=-1, os.version=3.1.0-1.2-desktop,
> java.vendor=AdoptOpenJDK, source=merge, os.arch=amd64, mergeFactor=10,
> java.runtime.version=11.0.3+7,
> os=Linux}]:[attributes={Lucene87StoredFieldsFormat.mode=BEST_SPEED}]
>
>  language curr=NONE, other=DOCS
>
> NOT throwing java.lang.IllegalArgumentException: cannot change field
> "language" from index options=NONE to inconsistent index options=DOCS
>
>
> Some variation on an old-fashioned not set versus not present bug perhaps?
>
>
> --
> Ian.



-- 
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



[ANNOUNCE] Apache Lucene 9.0.0 released

2021-12-07 Thread Adrien Grand
The Lucene PMC is pleased to announce the release of Apache Lucene 9.0.

Apache Lucene is a high-performance, full-featured search engine
library written entirely in Java. It is a technology suitable for
nearly any application that requires structured search, full-text
search, faceting, nearest-neighbor search across high-dimensionality
vectors, spell correction or query suggestions.

This release contains numerous features, optimizations, and
improvements, some of which are highlighted below. The release is
available for immediate download at:
  https://lucene.apache.org/core/downloads.html

Lucene 9.0 Release Highlights

System requirements
 - Lucene 9.0 requires JDK 11 or newer

New features
 - Support for indexing high-dimensionality numeric vectors to perform
nearest-neighbor search, using the Hierarchical Navigable Small World
graph algorithm
 - New Analyzers for Serbian, Nepali, and Tamil languages
 - IME-friendly autosuggest for Japanese
 - Snowball 2, adding Hindi, Indonesian, Nepali, Serbian, Tamil, and
Yiddish stemmers
 - New normalization/stemming for Swedish and Norwegian

Optimizations
 - Up to 400% faster taxonomy faceting
 - 10-15% faster indexing of multi-dimensional points
 - Several times faster sorting on fields that are indexed with
points. This optimization used to be an opt-in in late 8.x releases
and is now opt-out as of 9.0.
 - ConcurrentMergeScheduler now assumes fast I/O, likely improving
indexing speed in case where heuristics would incorrectly detect
whether the system had modern I/O or not
 - Encoding of postings lists changed from FOR-delta to PFOR-delta to
save further disk space

Other
 - File formats have all been changed from big-endian order to little
endian order
 - Lucene 9 no longer has split packages. This required renaming some
packages outside of the lucene-core JAR, so you will need to adjust
some imports accordingly.
 - Using Lucene 9 with the module system should be considered
experimental. We expect to make progress on this in future 9.x
releases.

Further details of changes are available in the change log available at:
  http://lucene.apache.org/core/9_0_0/changes/Changes.html
and the migration guide available at:
  https://lucene.apache.org/core/9_0_0/MIGRATE.html

Please report any feedback to the mailing lists
(http://lucene.apache.org/core/discussion.html)

-- 
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: index file of lucene8.7 is larger than the 7.7

2021-12-07 Thread Adrien Grand
As a disclaimer, it can be misleading to draw conclusions on space
efficiency based on such a small index.

Can you compare file sizes by extension across 7.7 and 8.7? You might
need to call IndexWriterConfig#setUseCompoundFile(false) to prevent
the flush from wrapping your segment files in a compound file.

On Wed, Nov 17, 2021 at 6:28 AM xiaoshi  wrote:
>
> Hi, everyone.
> I found that the index file of lucene8.7 is larger than the 7.7 version:
> My data source: lucene/demo/src/test/org/apache/lucene/demo/test-files/docs
> The index code is as follows:
>   InputStream stream = Files.newInputStream(file)
>   Document doc = new Document();
>   Field pathField = new StringField("path", file.toString(), Field.Store.YES);
> doc.add(pathField);
> doc.add(new LongPoint("modified", lastModified));
> doc.add(new TextField("contents", new BufferedReader(new 
> InputStreamReader(stream, StandardCharsets.UTF_8;
>
>
> Index size
> 8.7: 136K
> 7.7: 116K
> I guess it is caused by LUCENE-9027?
> Can anyone tell me why?



-- 
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



[ANNOUNCE] Apache Lucene 8.11.0 released

2021-11-16 Thread Adrien Grand
The Lucene PMC is pleased to announce the release of Apache Lucene 8.11.

Apache Lucene is a high-performance, full-featured text search engine
library written entirely in Java. It is a technology suitable for nearly
any application that requires full-text search, especially cross-platform.

This release contains numerous bug fixes, features, optimizations, and
improvements, some of which are highlighted below. The release is available
for immediate download at:
  https://lucene.apache.org/core/downloads.html

Lucene 8.11 Release Highlights
 - Facets now properly ignore deleted documents when accumulating facet
counts for all documents.
 - CheckIndex can run concurrently.

Further details of changes are available in the change log available at:
http://lucene.apache.org/core/8_11/changes/Changes.html

Please report any feedback to the mailing lists (
http://lucene.apache.org/core/discussion.html).

-- 
Adrien


Re: Need help on aggregation of nested documents

2021-11-16 Thread Adrien Grand
Indeed you shouldn't load all hits, you should register a
org.apache.lucene.search.Collector that will aggregate data while matches
are being collected.

Since you are already using a ToChildBlockJoinQuery, you should be able to
use it in conjunction with utility classes from lucene/facets. Have you
looked into it already?

On Tue, Nov 16, 2021 at 7:30 AM Gopal Sharma
 wrote:

> Hi Adrien,
>
> Thanks for the reply.
>
> I am able to retrieve the child docId's using the .ToChildBlockJoinQuery.
> Now for me to do aggregates i need to find the document using
> reader.document(int docID) right?. If that is the case won't getting all
> the documents would be a costly operation and then finally doing the
> aggregates.
>
> Is there any other way around this?
>
> Thanks
> Gopal Sharma
>
>
>
>
>
>
>
> On Mon, Nov 15, 2021 at 10:36 PM Adrien Grand  wrote:
>
> > It's not straightforward as we don't provide high-level tooling to do
> this.
> > You need to use the BitSetProducer that you pass to the
> > ToParentBlockJoinQuery in order to resolve the range of child doc IDs
> for a
> > given parent doc ID (see e.g. how ToChildBlockJoinQuery does it), and
> then
> > aggregate over these child doc IDs.
> >
> > On Mon, Nov 15, 2021 at 6:06 AM Gopal Sharma
> >  wrote:
> >
> > > Hi Team,
> > >
> > > I have a document structure as a customer which itself has few
> attributes
> > > like gender, location etc.
> > >
> > > Each customer will have a list of facts like transaction, product views
> > > etc.
> > >
> > > I want to do an aggregation of the facts. For example find all
> customers
> > > who are from a specific location and have done transactions worth more
> > than
> > > 500$ between two date ranges.
> > >
> > > The queries can go deeper than this.
> > >
> > > Thanks in advance.
> > >
> > > Gopal Sharma
> > >
> >
> >
> > --
> > Adrien
> >
>


-- 
Adrien


Re: Need help on aggregation of nested documents

2021-11-15 Thread Adrien Grand
It's not straightforward as we don't provide high-level tooling to do this.
You need to use the BitSetProducer that you pass to the
ToParentBlockJoinQuery in order to resolve the range of child doc IDs for a
given parent doc ID (see e.g. how ToChildBlockJoinQuery does it), and then
aggregate over these child doc IDs.

On Mon, Nov 15, 2021 at 6:06 AM Gopal Sharma
 wrote:

> Hi Team,
>
> I have a document structure as a customer which itself has few attributes
> like gender, location etc.
>
> Each customer will have a list of facts like transaction, product views
> etc.
>
> I want to do an aggregation of the facts. For example find all customers
> who are from a specific location and have done transactions worth more than
> 500$ between two date ranges.
>
> The queries can go deeper than this.
>
> Thanks in advance.
>
> Gopal Sharma
>


-- 
Adrien


Re: Using setIndexSort on a binary field

2021-10-15 Thread Adrien Grand
Hi Alex,

You need to use a BinaryDocValuesField so that the field is indexed with
doc values.

`Field` is not going to work because it only indexes the data while index
sorting requires doc values.

On Fri, Oct 15, 2021 at 6:40 PM Alex K  wrote:

> Hi all,
>
> Could someone point me to an example of using the
> IndexWriterConfig.setIndexSort for a field containing binary values?
>
> To be specific, the fields are constructed using the Field(String name,
> byte[] value, IndexableFieldType type) constructor, and I'd like to try
> using the java.util.Arrays.compareUnsigned method to sort the fields.
>
> Thanks,
> Alex
>


-- 
Adrien


Re: org.apache.lucene.search.BooleanWeight.bulkScorer() and BulkScorer.score()

2021-10-05 Thread Adrien Grand
Hmm we should fix these access$ accessors by fixing the visibility of some
fields.

These breakdowns do not necessarily signal that something is wrong. Is the
query executing fast overall?

On Mon, Oct 4, 2021 at 11:57 PM Baris Kazar  wrote:

> Hi, -
> I did more experiments and this time i looked into these methods:
> org.apache.lucene.search.BooleanWeight.bulkScorer() and BulkScorer.score()
>
>
> Lets start with BooleanWeight.bulkScorer() with its call tree and time
> spent:
>
>
> BooleanWeight.bulkScorer()
> -->> Weight.bulkScorer()
> -->>-->> BooleanWeight.scorer()
> -->>-->>-->>BooleanWeight.scorerSupplier()
> -->>-->>-->>-->> Weight.scorerSupplier()
> -->>-->>-->>-->>-->> TermQuery$Termweight.scorer()
> -->>-->>-->>-->>-->>-->>
> org.apache.lucene.codecs.blocktree.SegmentTermsEnum.impacts()
> -->>-->>-->>-->>-->>-->>-->>
> org.apache.lucene.codecs.lucene84.Lucene84PostingsReader.impacts()
> -->>-->>-->>-->>-->>-->>-->>-->>
> org.apache.lucene.codecs.lucene84.Lucene84PostingsReader$BlockImpactsDocEnums.init()
> -->>-->>-->>-->>-->>-->>-->>-->>-->>
> org.apache.lucene.codecs.lucene84.Lucene84SkipReader.init()
> -->>-->>-->>-->>-->>-->>-->>-->>-->>-->>
> org.apache.lucene.codecs.MultiLevelSkipListReader.init()
> -->>-->>-->>-->>-->>-->>-->>-->>-->>-->>-->>
> org.apache.lucene.codecs.MultiLevelSkipListReader.loadSkipLevels()
> -->>-->>-->>-->>-->>-->>-->>-->>-->>-->>-->>-->>
> org.apache.lucene.store.DataInput.readVLong() (constittutes %100 of
> BooleanWeight.bulkScorer() time here)
>
>
>
> Next: BulkScorer.score() with its call tree and time spent:
>
>
>
> BulkScorer.score()
> -->> Weight$DefaultBulkScorer.score()
> -->>-->> Weight$DefaultBulkScorer.scoreAll()
> -->>-->>-->> WANDScorer$1.nextDoc()
> -->>-->>-->>-->> WANDScorer$1.advance()
> -->>-->>-->>-->>-->> WANDScorer.access$300() (constitutes %65 of
> BulkScorer.score() time here)
> -->>-->>-->>-->>-->> WANDScorer.access$100() (constitutes %30 of
> BulkScorer.score() time here)
> -->>-->>-->>-->>-->> WANDScorer.access$400() (constitutes %5 of
> BulkScorer.score() time here)
>
> Best regards
>
> 
> From: Baris Kazar 
> Sent: Saturday, October 2, 2021 3:14 PM
> To: Adrien Grand ; Lucene Users Mailing List <
> java-user@lucene.apache.org>
> Cc: Baris Kazar 
> Subject: Re: org.apache.lucene.search.BooleanWeight.bulkScorer() and
> BulkScorer.score()
>
> Hi Adrien,-
> Thanks. Let me see next week the components (units, methods) within
> BulkScorer#score to see what takes most time among its called methods.
>
> Jvisualvm reports for a method whole time including the time spent in the
> called methods and when you go down the execution tree it goes until the
> very last called method.
>
> Regarding the second paragraph above:
> when will there be too many segments in the Lucene index? i have 1 text
> field and 1 stored (non indexed) field.
>
> I most of the time get a couple of thousands hits and i ask for top 20 of
> them. Could this be leading to
> BooleanWeight#bulkScorer spending time?
>
> Both of these units:
> BooleanWeight#bulkScorer and BulkScorer#score spend equal amounts of time
> and totally make up
> 75% of IndexSearcher#search as i mentioned before.
>
> Thanks for the swift reply
> I appreciate very much
>
>
> Best regards
> 
> From: Adrien Grand 
> Sent: Saturday, October 2, 2021 1:44:40 AM
> To: Lucene Users Mailing List 
> Cc: Baris Kazar 
> Subject: Re: org.apache.lucene.search.BooleanWeight.bulkScorer() and
> BulkScorer.score()
>
> Is your profiler reporting inclusive or exclusive costs for each function?
> Ie. does it exclude time spent in functions that are called within a
> function? I'm asking because it makes total sense for IndexSearcher#search
> to spend most of its time is BulkScorer#score, which coordinates the whole
> matching+scoring process.
>
> Having much time spent in BooleanWeight#bulkScorer is a bit surprising
> however. This suggests that you have too many segments in your index (since
> the bulk scorer needs to be recreated for every segment) or that your
> average query matches a very low number of documents (so that Lucene spends
> more time figuring out how best to find the matches versus actually finding
> these matches).
>
> On Sat, Oct 2, 2021 at 5:57 AM Baris Kazar  baris.ka...@oracle.com>> wrote:
> Hi,-
>  I performance profiled my application via jvisualvm on Java
> and saw that 75% of the search process from
> org.apache.lucene.search.IndexSearcher.search() are spent on
> these units:
> org.apache.lucene.search.BooleanWeight.bulkScorer() and BulkScorer.score()
> Is there any study or project to speed up these please?
>
> Best regards
>
>
>
> --
> Adrien
>


-- 
Adrien


Re: org.apache.lucene.search.BooleanWeight.bulkScorer() and BulkScorer.score()

2021-10-01 Thread Adrien Grand
Is your profiler reporting inclusive or exclusive costs for each function?
Ie. does it exclude time spent in functions that are called within a
function? I'm asking because it makes total sense for IndexSearcher#search
to spend most of its time is BulkScorer#score, which coordinates the whole
matching+scoring process.

Having much time spent in BooleanWeight#bulkScorer is a bit surprising
however. This suggests that you have too many segments in your index (since
the bulk scorer needs to be recreated for every segment) or that your
average query matches a very low number of documents (so that Lucene spends
more time figuring out how best to find the matches versus actually finding
these matches).

On Sat, Oct 2, 2021 at 5:57 AM Baris Kazar  wrote:

> Hi,-
>  I performance profiled my application via jvisualvm on Java
> and saw that 75% of the search process from
> org.apache.lucene.search.IndexSearcher.search() are spent on
> these units:
> org.apache.lucene.search.BooleanWeight.bulkScorer() and BulkScorer.score()
> Is there any study or project to speed up these please?
>
> Best regards
>
>

-- 
Adrien


Re: Querying into a Collector visits documents multiple times

2021-09-22 Thread Adrien Grand
Hi Steven,

This collector looks correct to me. Resetting the counter to 0 on the first
segment is indeed not necessary.

We have plenty of collectors that are very similar to this one and we never
observed any double-counting issue. I would suspect an issue in the code
that calls this collector. Maybe try to print the stack trace under the `
if (context.docBase == 0) {` check to see why your collector is being
called twice?

On Tue, Sep 21, 2021 at 9:30 PM Steven Schlansker <
stevenschlans...@gmail.com> wrote:

> Hi Lucene users,
>
> I am developing a search application that needs to do some basic
> summary statistics. We use Lucene 8.9.0.
> To improve performance for e.g. summing a value across 10,000
> documents, we are using DocValues as columnar storage.
>
> In order to retrieve the DocValues without collecting all hits into a
> TopDocs, which we determined to cause a lot of memory pressure and
> consume much time, we are using the expert Collector query interface.
>
> Here's the code, simplified a bit for the list:
>
> final collector = new Collector() {
> long sum = 0;
>
> @Override
> public ScoreMode scoreMode() {
> return ScoreMode.COMPLETE_NO_SCORES;
> }
>
> @Override
> public LeafCollector getLeafCollector(final LeafReaderContext
> context) throws IOException {
>  if (context.docBase == 0) {
> sum = 0; // XXX: this should not be necessary?
> }
> final var subtotalValue =
> context.reader().getNumericDocValues("subtotal");
> return new LeafCollector() {
> @Override
> public void setScorer(final Scorable scorer) throws
> IOException {
> }
>
> @Override
> public void collect(final int doc) throws IOException {
> if (subtotalValue.docID() > doc ||
> !subtotalValue.advanceExact(doc) || subtotalValue.longValue() == 0) {
> return;
> }
> sum += subtotalValue.longValue();
> }
> };
> }
> }
> searcher.search(myQuery, collector);
> return collector.sum;
>
> The query is a moderately complicated Boolean query with some
> TermQuery and MultiTermQuery instances combined together.
> While first testing, I observed that seemingly the collector is called
> twice for each document, and the sum is exactly double what you would
> expect.
>
> It seems that the Collector is observing every matched document twice,
> and by printing out the Scorer, I see that it's done with two
> different BooleanScorer instances.
> You can see my hack that resets the collector every time it starts at
> docBase 0. which I am sure is not the right approach, but seems to
> work.
> What is the right pattern to ensure my Collector only observes result
> documents once, no matter the input query? I see a note in the
> documentation that state is supposed to be stored on the Scorer
> implementation, but I am not providing a custom Scorer, nor do I
> actually want any scoring at all.
>
> Thank you for any guidance!
> Steven
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

-- 
Adrien


Re: Adding vs multiplicating scores when implementing "recency"

2021-09-17 Thread Adrien Grand
This is one requirement indeed. Since WAND reasons about partially
evaluated documents, it also requires that matching one more clause makes
the overall score higher, which is why we introduced the requirement that
scores must be positive in 8.0. For multiplication, this would require
scores that are greater than 1.

If someone really wanted to multiply scores, the easiest way might be to
create a query wrapper that takes the log of the scores of the wrapped
query, and rely on log(a)+log(b) = log(a * b).

Le ven. 17 sept. 2021 à 14:47, Michael Sokolov  a
écrit :

> Not advocating any particular approach here, just curious: could BMW
> also function in the presence of a doc-score (like recency) that is
> multiplied? My vague understanding is that as long as the scoring
> formula is monotonic in all of its inputs, and we have block-encoded
> the inputs, then we could compute a max score for a block?
>
> On Thu, Sep 16, 2021 at 12:41 PM Adrien Grand  wrote:
> >
> > Hello,
> >
> > You are correct that the contribution would be additive in that case. We
> > don't provide an easy way to make the contribution multiplicative.
> >
> > There is some debate about what is the best way to combine BM25 scores
> with
> > query-independent features, though in the discussions I've seen
> > contributions were summed up and the debate was more about whether they
> > should be normalized or not.
> >
> > How much recency impacts ranking indeed depends on the number of terms
> and
> > how frequent these terms are. One way that I'm interpreting the fact that
> > not everyone recommends normalizing scores is that this way the query
> score
> > dominates when the query is looking for something very specific, because
> it
> > includes many terms or because it uses very specific terms - which may
> be a
> > feature. This approach also works well for Lucene since dynamic pruning
> via
> > Block-Max WAND keeps working when query-independent features are
> > incorporated into the final score, which helps figure out the top hits
> > without having to collect all matches.
> >
> > On Thu, Sep 16, 2021 at 5:40 PM Nicolás Lichtmaier
> >  wrote:
> >
> > > On March I've asked a question here that go no answers at all. As it
> > > still something that I'd very much like to know I'll ask again.
> > >
> > > To implement "recency" into a search you would add a boolean clause
> with
> > > a LongPoint.newDistanceFeatureQuery(), right? But that's additive,
> > > meaning that this recency will impact different for searches with
> > > different number of terms, right? With more terms the recency component
> > > contribution to score will be more and more "diluted". However... I
> only
> > > see examples using this way of doing, and I would need to do something
> > > weird to implement a multiplicative change of the score... Am I missing
> > > something?
> > >
> > > Thanks!
> > >
> > >
> > > -
> > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > > For additional commands, e-mail: java-user-h...@lucene.apache.org
> > >
> > >
> >
> > --
> > Adrien
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: Adding vs multiplicating scores when implementing "recency"

2021-09-16 Thread Adrien Grand
Hello,

You are correct that the contribution would be additive in that case. We
don't provide an easy way to make the contribution multiplicative.

There is some debate about what is the best way to combine BM25 scores with
query-independent features, though in the discussions I've seen
contributions were summed up and the debate was more about whether they
should be normalized or not.

How much recency impacts ranking indeed depends on the number of terms and
how frequent these terms are. One way that I'm interpreting the fact that
not everyone recommends normalizing scores is that this way the query score
dominates when the query is looking for something very specific, because it
includes many terms or because it uses very specific terms - which may be a
feature. This approach also works well for Lucene since dynamic pruning via
Block-Max WAND keeps working when query-independent features are
incorporated into the final score, which helps figure out the top hits
without having to collect all matches.

On Thu, Sep 16, 2021 at 5:40 PM Nicolás Lichtmaier
 wrote:

> On March I've asked a question here that go no answers at all. As it
> still something that I'd very much like to know I'll ask again.
>
> To implement "recency" into a search you would add a boolean clause with
> a LongPoint.newDistanceFeatureQuery(), right? But that's additive,
> meaning that this recency will impact different for searches with
> different number of terms, right? With more terms the recency component
> contribution to score will be more and more "diluted". However... I only
> see examples using this way of doing, and I would need to do something
> weird to implement a multiplicative change of the score... Am I missing
> something?
>
> Thanks!
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

-- 
Adrien


Re: How exactly the normalized length of the documents are stored in the index

2021-07-13 Thread Adrien Grand
The BM25 similarity computes the normalized length as the number of tokens,
ignoring synonyms (tokens at the same position).

Then it encodes this length as an 8-bit integer in the index using this
logic:
https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/SmallFloat.java#L147-L156,
which preserves a bit more than 4 significant bits.

On Tue, Jul 13, 2021 at 1:22 PM Dwaipayan Roy 
wrote:

> During indexing, an inverted index is made with the term of the documents
> and the term frequency, document frequency etc. are stored. If I know
> correctly, the exact document length is not stored in the index to reduce
> the size. Instead, a normalized length is stored for each document.
> However, for most retrieval functions, document length is a necessary
> component and the normalized doc-length is used in those functions.
>
> I want to ask how exactly the normalization process is performed. The
> question might have been answered already, but I was unable to find the
> proper response. Your help is much appreciated.
>
> Thanks.
>


-- 
Adrien


Re: Need approach to store JSON data in Lucene index

2021-06-17 Thread Adrien Grand
In general, the preferred approach is denormalizing, but your description
suggests that you want to be able to query anything: actions, tasks, test
cases, etc. so I guess that the most natural approach would be to leverage
Lucene's support for index-time joins, see the documentation of the join
package

.

On Thu, Jun 17, 2021 at 3:45 PM Amol Suryawanshi <
amol.suryawan...@qualitiasoft.com> wrote:

> Hi Team,
>
> We are using Lucene Java library in our organization to store JSON files
> data into to Lucene indexes.
>
> Our JSON file are structured in below format.
>
>
>   1.  Testcase has several Testcase steps
>   2.  Testcase has several Tasks
>   3.  Tasks has task step
>   4.  Task step has Actions and objects
>
> Testcase
> TCSteps
>- Actions
>- Objects
> TASK
> TaskSteps
> - Actions
> - Objects
>
>
> How should I store this tree like data where I can get any parent document
> or child document using Lucene query
>
> for eg: I want to get all the Testcases in which particular action is
> mapped.
>
> Thanks & Regards
> Amol A. Suryawanshi
>
> Sent from Mail for
> Windows 10
>
>

-- 
Adrien


Re: Is deleting with IndexReader still possible?

2021-06-17 Thread Adrien Grand
Good catch Michael, removing from IndexReader has actually been removed a
long time ago. I just edited the FAQ to correct this.

On Thu, Jun 17, 2021 at 10:08 AM Michael Wechner 
wrote:

> Hi
>
> According to the FAQ one can delete documents using the IndexReader
>
>
> https://cwiki.apache.org/confluence/display/lucene/lucenefaq#LuceneFAQ-HowdoIdeletedocumentsfromtheindex
> ?
>
> but when I look at the javadoc of Lucene version 8_8_2
>
>
> https://lucene.apache.org/core/8_8_2/core/org/apache/lucene/index/IndexWriter.html
>
> https://lucene.apache.org/core/8_8_2/core/org/apache/lucene/index/IndexReader.html
>
> then it seems to me, that deleting documents is only possible with
> IndexWriter, but not anymore with IndexReader, right?
>
> Thanks
>
> Michael
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

-- 
Adrien


Re: Handling Archive Data Using Lucene 7.6

2021-06-14 Thread Adrien Grand
Hi Rashmi,

This upgrade skips 3 major versions, the simplest path will be to reindex
your content.


On Fri, Jun 11, 2021 at 10:40 AM Rashmi Bisanal
 wrote:

> Hi Lucene Support Team ,
>
>
>
> Objective : Upgrade Lucene 3.6 to 7.6
>
>
>
> Description : We have huge data against version Lucene 3.6 .All of this
> data needs to upgraded to version Lucene 7.6 without any changes .
> Requesting your support on how to proceed with this ?
>
>
>
>
>
> Regards
>
> Rashmi
>
>
> 
> Disclaimer: This message and the information contained herein is
> proprietary and confidential and subject to the Tech Mahindra policy
> statement, you may review the policy at
> http://www.techmahindra.com/Disclaimer.html externally
> http://tim.techmahindra.com/tim/disclaimer.html internally within
> TechMahindra.
> 
>
>


-- 
Adrien


Re: Potential bug

2021-06-14 Thread Adrien Grand
>> very expensive.
> > >>> Are you using the default 'fuzzyness' parameter? (0.5) - It might end
> > up
> > >> exploring a lot of documents, did you try to play with that parameter?
> > >>> Have you tried to see how the performance change if you do not use
> > fuzzy
> > >> (just to see if is fuzzy the introduce the slow down)?
> > >>> Or what happens to performance if you do fuzzy with 1, 2, 5 terms
> > >> instead of 10?
> > >>>
> > >>> From: java-user@lucene.apache.org At: 06/09/21 18:56:31To:
> > >> java-user@lucene.apache.org,  baris.ka...@oracle.com
> > >>> Subject: Re: Potential bug
> > >>>
> > >>> i cant reveal those details i am very sorry. but it is more than 1
> > >> million.
> > >>> let me tell that i have a lot of code that processes results from
> > lucene
> > >>> but the bottle neck is lucene fuzzy search.
> > >>>
> > >>> Best regards
> > >>>
> > >>>
> > >>> On 6/9/21 1:53 PM, Diego Ceccarelli (BLOOMBERG/ LONDON) wrote:
> > >>>> How many documents do you have in the index?
> > >>>> and can you show an example of query?
> > >>>>
> > >>>>
> > >>>> From: java-user@lucene.apache.org At: 06/09/21 18:33:25To:
> > >>> java-user@lucene.apache.org,  baris.ka...@oracle.com
> > >>>> Subject: Re: Potential bug
> > >>>>
> > >>>> i have only two fields one string the other is a number (stored as
> > >>>> string), i guess you cant go simpler than this.
> > >>>>
> > >>>> i retreieve the hits and my major bottleneck is lucene fuzzy search.
> > >>>>
> > >>>>
> > >>>> i take each word from the string which is usually around at most 10
> > >> words
> > >>>> i build a fuzzy boolean query out of them.
> > >>>>
> > >>>>
> > >>>> simple query is like this 10 word query.
> > >>>>
> > >>>>
> > >>>> limit means i want to stop lucene search around 20 hits i dont want
> > >>>> thousands of hits.
> > >>>>
> > >>>>
> > >>>> Best regards
> > >>>>
> > >>>>
> > >>>> On 6/9/21 1:25 PM, Diego Ceccarelli (BLOOMBERG/ LONDON) wrote:
> > >>>>
> > >>>>> Hi Baris,
> > >>>>>
> > >>>>>> what if the user needs to limit the search process?
> > >>>>> What do you mean by 'limit'?
> > >>>>>
> > >>>>>> there should be a way to speedup lucene then if this is not
> > possible,
> > >>>>>> since for some simple queries it takes half a second which is too
> > >> long.
> > >>>>> What do you mean by 'simple' query? there might be multiple reasons
> > >> behind
> > >>>> slowness of a query that are unrelated to the search (for example,
> if
> > >> you
> > >>>> retrieve many documents and for each document you are extracting the
> > >> content
> > >>> of
> > >>>> many fields) - would you like to tell us a bit more about your use
> > case?
> > >>>>> Regards,
> > >>>>> Diego
> > >>>>>
> > >>>>> From: java-user@lucene.apache.org At: 06/09/21 18:18:01To:
> > >>>> java-user@lucene.apache.org
> > >>>>> Cc:  baris.ka...@oracle.com
> > >>>>> Subject: Re: Potential bug
> > >>>>>
> > >>>>> Thanks Adrien, but the differences is too far apart.
> > >>>>>
> > >>>>> I think the algorithm needs to be revised.
> > >>>>>
> > >>>>>
> > >>>>> what if the user needs to limit the search process?
> > >>>>>
> > >>>>> that leaves no control.
> > >>>>>
> > >>>>> there should be a way to speedup lucene then if this is not
> possible,
> > >>>>>
> > >>>>> since for some simple queries it takes half a second which is too
> > long.
> > >>>>>
> > >>>>> Best rega

Re: Monitoring decisions taken by IndexOrDocValuesQuery

2021-06-10 Thread Adrien Grand
This sounds useful indeed!

Is there a way we could do it that wouldn't require forking
IndexOrDocValuesQuery? E.g. could we have query wrappers that we would use
on both the index query and the doc-value query in order to be able to
count how many times they have been used? We could add something like that
to lucene/sandbox.

On Thu, Jun 10, 2021 at 2:51 PM Egor Moraru  wrote:

> Hi Adrien,
>
> In this specific use case our data is encoded as points and also is stored
> as doc values.
>
> We use information about which execution path is chosen to decide
> if we can get away with storing this data only once and using one of
> the queries.
>
> On Wed, Jun 9, 2021 at 10:39 PM Adrien Grand  wrote:
>
> > FWIW a related PR was just merged that allows to introspect query
> > execution: https://issues.apache.org/jira/browse/LUCENE-9965. It's
> > different from your use-case though in that it is debugging information
> for
> > a single query rather than statistical information across lots of user
> > queries (and the approach on that other issue makes things much slower so
> > you wouldn't like to enable it in production).
> >
> > Out of curiosity, what are you doing with this information about which
> > execution path is chosen?
> >
> > On Wed, Jun 9, 2021 at 2:14 PM Egor Moraru 
> wrote:
> >
> > > Hi,
> > >
> > > At my current project we wanted to monitor for a specific field the
> > > fraction of indexed vs doc values queries executed by
> > > IndexOrDocValuesQuery.
> > >
> > > We ended up forking IndexOrDocValuesQuery and passing a listener that
> > > is notified when the query execution path is decided.
> > >
> > > Do you think this is something the community might be interested in?
> > >
> > > Kind regards,
> > > Egor Moraru.
> > >
> >
> >
> > --
> > Adrien
> >
>
>
> --
>
> Kind regards,
> Egor Moraru.
>


-- 
Adrien


Re: Monitoring decisions taken by IndexOrDocValuesQuery

2021-06-09 Thread Adrien Grand
FWIW a related PR was just merged that allows to introspect query
execution: https://issues.apache.org/jira/browse/LUCENE-9965. It's
different from your use-case though in that it is debugging information for
a single query rather than statistical information across lots of user
queries (and the approach on that other issue makes things much slower so
you wouldn't like to enable it in production).

Out of curiosity, what are you doing with this information about which
execution path is chosen?

On Wed, Jun 9, 2021 at 2:14 PM Egor Moraru  wrote:

> Hi,
>
> At my current project we wanted to monitor for a specific field the
> fraction of indexed vs doc values queries executed by
> IndexOrDocValuesQuery.
>
> We ended up forking IndexOrDocValuesQuery and passing a listener that
> is notified when the query execution path is decided.
>
> Do you think this is something the community might be interested in?
>
> Kind regards,
> Egor Moraru.
>


-- 
Adrien


Re: Potential bug

2021-06-09 Thread Adrien Grand
Hi Baris,

totalhitsThreshold is actually a minimum threshold, not a maximum threshold.

The problem is that Lucene cannot directly identify the top matching
documents for a given query. The strategy it adopts is to start collecting
hits naively in doc ID order and to progressively raise the bar about the
minimum score that is required for a hit to be competitive in order to skip
non-competitive documents. So it's expected that Lucene still collects 100s
or 1000s of hits, even though the collector is configured to only compute
the top 10 hits.

On Wed, Jun 9, 2021 at 7:07 PM  wrote:

> Hi,-
>
>   i think this is a potential bug
>
>
> i set this time totalHitsThreshold to 10 and i get totalhits reported as
> 1655 but i get 10 results in total.
>
> I think this suggests that there might be a bug with
> TopScoreDocCollector algorithm.
>
>
> Best regards
>
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

-- 
Adrien


Re: An interesting case

2021-06-08 Thread Adrien Grand
Yes, for instance if you care about the top 10 hits only, you could call
TopScoreDocsCollector.create(10, null, 10). By default, IndexSearcher is
configured to count at least 1,000 hits, and creates its top docs collector
with TopScoreDocsCollector.create(10, null, 1000).

On Tue, Jun 8, 2021 at 7:19 PM  wrote:

> Ok i think you meant something else here.
>
> you are not refering to total number of hits calculation or the
> mismatch, right?
>
>
>
> so to make lucene minimum work to reach the matched docs
>
>
> TopScoreDocCollector should be used, right?
>
>
> Let me check this class.
>
> Thanks
>
>
> On 6/8/21 1:16 PM, baris.ka...@oracle.com wrote:
> > Adrien my concern is not actually the number mismatch
> >
> > as i mentioned it is the performance.
> >
> >
> > seeing those numbers mismatch it seems that lucene is still doing same
> >
> > amount of work to get results no matter how many results you need in
> > the indexsearcher search api.
> >
> >
> > i thought i was clear on that.
> >
> >
> > Lucene should not spend any energy for the count as scoredocs already
> > has that.
> >
> > But seeing totalhits high number, that worries me as i explained above.
> >
> >
> > Best regards
> >
> >
> > On 6/8/21 1:12 PM, Adrien Grand wrote:
> >> If you don't need any information about the total hit count, you could
> >> create a TopScoreDocCollector that has the same value for numHits
> >> and totalHitsThreshold. This way Lucene will spend as little energy as
> >> possible computing the number of matches of the query.
> >>
> >> On Tue, Jun 8, 2021 at 6:28 PM  wrote:
> >>
> >>> i am currently happy with Lucene performance but i want to understand
> >>> and speedup further
> >>>
> >>> by limiting the results concretely. So i still donot know why totalHits
> >>> and scoredocs report
> >>>
> >>> different number of hits.
> >>>
> >>>
> >>> Best regards
> >>>
> >>>
> >>> On 6/8/21 2:52 AM, Baris Kazar wrote:
> >>>> my worry is actually about the lucene's performance.
> >>>>
> >>>> if lucene collects thousands of hits instead of actually n (<<< a
> >>>> couple of 1000s) hits, then this creates performance issue.
> >>>>
> >>>> ScoreDoc array is ok as i mentioned ie, it has size n.
> >>>> i will check count api.
> >>>>
> >>>> Best regards
> >>>>
> 
> >>>>
> >>>> *From:* Adrien Grand 
> >>>> *Sent:* Tuesday, June 8, 2021 2:46 AM
> >>>> *To:* Lucene Users Mailing List
> >>>> *Cc:* Baris Kazar
> >>>> *Subject:* Re: An interesting case
> >>>> When you call IndexSearcher#search(Query query, int n), there are two
> >>>> cases:
> >>>>   - either your query matches n hits or more, and the TopDocs object
> >>>> will have a ScoreDoc[] array that contains the n best scoring hits
> >>>> sorted by descending score,
> >>>>   - or your query matches less then n hits and then the TopDocs object
> >>>> will have all matches in the ScoreDoc[] array, sorted by descending
> >>> score.
> >>>> In both cases, TopDocs#totalHits gives information about the total
> >>>> number of matches of the query. On older versions of Lucene (<7.0)
> >>>> this is an integer that is always accurate, while on more recent
> >>>> versions of Lucene (>= 8.0) it is a lower bound of the total number of
> >>>> matches. It typically returns the number of collected documents
> >>>> indeed, though this is an implementation detail that might change in
> >>>> the future.
> >>>>
> >>>> If you want to count the number of matches of a Query precisely, you
> >>>> can use IndexSearcher#count.
> >>>>
> >>>> On Tue, Jun 8, 2021 at 7:51 AM  >>>> <mailto:baris.ka...@oracle.com>> wrote:
> >>>>
> >>>>
> >>>
> https://urldefense.com/v3/__https://stackoverflow.com/questions/50368313/relation-between-topdocs-totalhits-and-parameter-n-of-indexsearcher-search__;!!GqivPVa7Brio!LRsX8rEVxyiW7z_x1SgYFeTYHDh861CsGCbMnMgKAuawz8u5_hiRv52XJ08nfvhVHw$
> >>>
> >>>>  <
&

Re: An interesting case

2021-06-08 Thread Adrien Grand
If you don't need any information about the total hit count, you could
create a TopScoreDocCollector that has the same value for numHits
and totalHitsThreshold. This way Lucene will spend as little energy as
possible computing the number of matches of the query.

On Tue, Jun 8, 2021 at 6:28 PM  wrote:

> i am currently happy with Lucene performance but i want to understand
> and speedup further
>
> by limiting the results concretely. So i still donot know why totalHits
> and scoredocs report
>
> different number of hits.
>
>
> Best regards
>
>
> On 6/8/21 2:52 AM, Baris Kazar wrote:
> > my worry is actually about the lucene's performance.
> >
> > if lucene collects thousands of hits instead of actually n (<<< a
> > couple of 1000s) hits, then this creates performance issue.
> >
> > ScoreDoc array is ok as i mentioned ie, it has size n.
> > i will check count api.
> >
> > Best regards
> > 
> > *From:* Adrien Grand 
> > *Sent:* Tuesday, June 8, 2021 2:46 AM
> > *To:* Lucene Users Mailing List
> > *Cc:* Baris Kazar
> > *Subject:* Re: An interesting case
> > When you call IndexSearcher#search(Query query, int n), there are two
> > cases:
> >  - either your query matches n hits or more, and the TopDocs object
> > will have a ScoreDoc[] array that contains the n best scoring hits
> > sorted by descending score,
> >  - or your query matches less then n hits and then the TopDocs object
> > will have all matches in the ScoreDoc[] array, sorted by descending
> score.
> >
> > In both cases, TopDocs#totalHits gives information about the total
> > number of matches of the query. On older versions of Lucene (<7.0)
> > this is an integer that is always accurate, while on more recent
> > versions of Lucene (>= 8.0) it is a lower bound of the total number of
> > matches. It typically returns the number of collected documents
> > indeed, though this is an implementation detail that might change in
> > the future.
> >
> > If you want to count the number of matches of a Query precisely, you
> > can use IndexSearcher#count.
> >
> > On Tue, Jun 8, 2021 at 7:51 AM  > <mailto:baris.ka...@oracle.com>> wrote:
> >
> >
> https://stackoverflow.com/questions/50368313/relation-between-topdocs-totalhits-and-parameter-n-of-indexsearcher-search
> > <
> https://urldefense.com/v3/__https://stackoverflow.com/questions/50368313/relation-between-topdocs-totalhits-and-parameter-n-of-indexsearcher-search__;!!GqivPVa7Brio!JjLGw8TaYQcqSC7BtpPSZl5dl-WqgwwcgGFhOqHSUKIsCaTSNpoDvOJjq0BbkQhfpw$
> >
> >
> > looks like someone else also had this problem, too.
> >
> > Any suggestions please?
> >
> > Best regards
> >
> >
> > On 6/8/21 1:36 AM, baris.ka...@oracle.com
> > <mailto:baris.ka...@oracle.com> wrote:
> > > Hi,-
> > >
> > >  I use IndexSearcher.search API with two parameters like Query
> > and int
> > > number (i set as 20).
> > >
> > > However, when i look at the TopDocs object which is the result
> > of this
> > > above API call
> > >
> > > i see thousands of hits from totalhits. Is this inaccurate or
> > Lucene
> > > is doing actually search based on that many results?
> > >
> > > But when i iterate over result of above API call's scoreDocs
> > object i
> > > get int number of hits (ie, 20 hits).
> > >
> > >
> > > I am trying to find out why
> > org.apache.lucene.search.Topdocs.TotalHits
> > > report a number of collected results than
> > >
> > > the actual number of results. I see on the order of couple of
> > > thousands vs 20.
> > >
> > >
> > > Best regards
> > >
> > >
> > >
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > <mailto:java-user-unsubscr...@lucene.apache.org>
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> > <mailto:java-user-h...@lucene.apache.org>
> >
> >
> >
> > --
> > Adrien
>


-- 
Adrien


Re: An interesting case

2021-06-08 Thread Adrien Grand
When you call IndexSearcher#search(Query query, int n), there are two cases:
 - either your query matches n hits or more, and the TopDocs object will
have a ScoreDoc[] array that contains the n best scoring hits sorted by
descending score,
 - or your query matches less then n hits and then the TopDocs object will
have all matches in the ScoreDoc[] array, sorted by descending score.

In both cases, TopDocs#totalHits gives information about the total number
of matches of the query. On older versions of Lucene (<7.0) this is an
integer that is always accurate, while on more recent versions of Lucene
(>= 8.0) it is a lower bound of the total number of matches. It typically
returns the number of collected documents indeed, though this is an
implementation detail that might change in the future.

If you want to count the number of matches of a Query precisely, you can
use IndexSearcher#count.

On Tue, Jun 8, 2021 at 7:51 AM  wrote:

>
> https://stackoverflow.com/questions/50368313/relation-between-topdocs-totalhits-and-parameter-n-of-indexsearcher-search
>
> looks like someone else also had this problem, too.
>
> Any suggestions please?
>
> Best regards
>
>
> On 6/8/21 1:36 AM, baris.ka...@oracle.com wrote:
> > Hi,-
> >
> >  I use IndexSearcher.search API with two parameters like Query and int
> > number (i set as 20).
> >
> > However, when i look at the TopDocs object which is the result of this
> > above API call
> >
> > i see thousands of hits from totalhits. Is this inaccurate or Lucene
> > is doing actually search based on that many results?
> >
> > But when i iterate over result of above API call's scoreDocs object i
> > get int number of hits (ie, 20 hits).
> >
> >
> > I am trying to find out why org.apache.lucene.search.Topdocs.TotalHits
> > report a number of collected results than
> >
> > the actual number of results. I see on the order of couple of
> > thousands vs 20.
> >
> >
> > Best regards
> >
> >
> >
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

-- 
Adrien


Re: Changing Term Vectors for Query

2021-06-07 Thread Adrien Grand
Hi Marcel,

You can make Lucene index custom frequencies using something like
DelimitedTermFrequencyTokenFilter
,
which would be easier than writing a custom Query/Weight/Scorer. Would it
work for you?

On Sun, Jun 6, 2021 at 10:24 PM Hannes Lohr
 wrote:

> Hello,
> for some Queries i need to calcuate the score mostly like the normal
> score, but for some documents certain terms are assigned a Frequency given
> by me and the score should be calculated with these new term frequencies.
> After some research, it seems i have to implement a custom Query, custom
> Weight and Custom Scorer for this. I wanted to ask if I'm overlooking a
> simpler solution or if this is the way to go.
> Thanks,
> Marcel



-- 
Adrien


Re: Performance decrease with NRT use-case in 8.8.x (coming from 8.3.0)

2021-05-19 Thread Adrien Grand
LUCENE-9115 certainly creates more files in the FSDirectory than in the
ByteBuffersDirectory, e.g. stored fields are now always flushed to the
FSDirectory since their size can't be known in advance, while they were
always written to the ByteBuffersDirectory before (which was a big since
these files could be arbitrary large).

On Wed, May 19, 2021 at 1:55 PM Gietzen, Markus
 wrote:

> Hi,
>
> thanks for reaching me that fast!
>
> Your hint that there were changes to NRTCachingDirectory were the right
> point:
>
> I copied the 8.3 NRTCachingDirectory implementation into the project (with
> a different classname, you get the idea ) and used that one.
>
> And believe it or not: everything is fine. Now 8.8 performs as fast as 8.3!
>
> I will check the differences and put them in step by step to find out
> which change causes the slow-down.
> I’ll report here.
>
> Bye,
>
> Markus
>
>
> From: Michael McCandless 
> Sent: Wednesday, 19 May 2021 13:39
> To: Lucene Users ; Gietzen, Markus <
> markus.giet...@softwareag.com>
> Subject: Re: Performance decrease with NRT use-case in 8.8.x (coming from
> 8.3.0)
>
> > The update showed no issues (e.g. compiled without changes) but I
> noticed that our test-suites take a lot longer to finish.
>
> Hmm, that sounds bad.  We need our tests to stay fast but also do a good
> job testing things ;)
>
> Does your production usage also slow down?  Tests do other interesting
> things, e.g. enable assertions, randomly swap in different codecs,
> different Directory impls (if you are using Lucene's randomized test
> infrastructure<
> http://blog.mikemccandless.com/2011/03/your-test-cases-should-sometimes-fail.html>),
> etc.
>
> > While in 8.8 files are opened for reading that do not (yet) exist.
>
> That is incredibly strange!  Lucene should never do that (opening a
> non-existing file for "read", causing that file to become created through
> Windows CreateFile), that I am aware of.  Can you share the full source
> code of the test case?
>
> Try to eliminate parts of the test maybe?  Do you see the same slowdown if
> you don't use NRTCachingDirectory at all? (Just straight to
> MMapDirectory.)  There were some recent fixes to NRTCachineDirectory, e.g.
> https://issues.apache.org/jira/browse/LUCENE-9115 and
> https://issues.apache.org/jira/browse/LUCENE-8663 -- maybe they are
> related?
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Wed, May 19, 2021 at 7:23 AM Gietzen, Markus
>  markus.giet...@softwareag.com.invalid>> wrote:
> Hello,
>
> recently I updated the Lucene version in one of our products from 8.3 to
> 8.8.x (8.8.2 as of now).
> The update showed no issues (e.g. compiled without changes) but I noticed
> that our test-suites take a lot longer to finish.
>
> So I took a closer look at one test-case which showed a severe slowdown
> (it’s doing small update, flush, search  cycles in order to stress NRT;
> the purpose is to see performance-changes in an early stage  ):
>
> Lucene 8.3:   ~2,3s
> Lucene 8.8.x:  25s
>
> This is a huge difference. Therefore I used YourKit to profile 8.3 and 8.8
> and do a comparison.
>
> The gap is caused by different amount of calls to
> sun.nio.fs.WindowsNativeDispatcher.CreateFile0(long, int, int, long, int,
> int) WindowsNativeDispatcher.java (native)
> 8.3:  about 150 calls
> 8.8:  about 12500 calls
>
> In order to hunt down what is causing this, I took a look at the open() in
> NRTDirectory.
> Here I could see that the amount of calls to that open is in the same
> ballpark for 8.3 and 8.8
>
> The difference is that in 8.3 nearly all files are available in the
> underlying RAMDirectory. While in 8.8 files are opened for reading that do
> not (yet) exist.
> This leads to a call to the WindowsNativeDispatcher.CreateFile0
>
> Add the end of the mail I added two example-stacktraces that show this
> behavior.
>
> Has someone an idea what change might cause this or if I need to do
> something different in 8.8 compared to 8.3?
>
>
> Thanks for any help,
>
> Markus
>
> Here is an example stacktrace that is causing such a try of a read-access
> to non-existing file:
>
> Filename= _0.fdm(IOContext is READ)   (I checked the directory on
> harddisk: it did not yet contain it nor in RAM-directory of the NRTCacheDir)
>
> openInput:100, FilterDirectory (org.apache.lucene.store)
> openInput:100, FilterDirectory (org.apache.lucene.store)
> openChecksumInput:157, Directory (org.apache.lucene.store)
> finish:140, FieldsIndexWriter (org.apache.lucene.codecs.compressing)
> finish:480, CompressingStoredFieldsWriter
> (org.apache.lucene.codecs.compressing)
> flush:81, StoredFieldsConsumer (org.apache.lucene.index)
> flush:239, DefaultIndexingChain (org.apache.lucene.index)
> flush:350, DocumentsWriterPerThread (org.apache.lucene.index)
> doFlush:476, DocumentsWriter (org.apache.lucene.index)
> flushAllThreads:656, DocumentsWriter (org.apache.lucene.index)
> getReader:605, IndexWriter (org.apache.lucene.index)
> doOpenIfChanged:277, 

Re: How to ignore a match if a given keyword is before/after another given keyword?

2021-04-27 Thread Adrien Grand
Great to hear!

Le mar. 27 avr. 2021 à 22:44, Jean Morissette  a
écrit :

> Using intervals worked, thank you for your help !
>
> On Sun, 25 Apr 2021 at 13:52, Adrien Grand  wrote:
>
> > Hi Jean,
> >
> > You should be able to do this with intervals, see
> >
> >
> https://lucene.apache.org/core/8_8_1/queries/org/apache/lucene/queries/intervals/package-summary.html
> > .
> >
> > Le dim. 25 avr. 2021 à 18:43, Jean Morissette  >
> > a
> > écrit :
> >
> > > Thank you for your answer.
> > >
> > > The problem with this solution is that it excludes documents which
> > contain
> > > both positive and negative positive matches.
> > >
> > > For example, consider those 3 documents with the terms a, b:
> > > - document 1: "a"
> > > - document 2: "a b"
> > > - document 3: "a b a"
> > >
> > > What we want is to find documents with the terms 'a', ignoring matches
> if
> > > 'a' is followed by 'b'.
> > > That is, we don't want to exclude one document if 'a' is followed by
> 'b'.
> > >
> > > The right answer should be documents 1 and 3 but your solution excludes
> > > document 3.
> > >
> > > Is-it something achievable with Lucene?
> > >
> > > Thanks,
> > > Jean
> > >
> > >
> > > On Thu, 15 Apr 2021 at 01:33, Aditya Varun Chadha 
> > > wrote:
> > >
> > > > maybe you want (abstractly):
> > > >
> > > > bool(must(term("f", "positive"), mustNot(phrase("f", "negative
> > positive",
> > > > slop=1)))
> > > >
> > > > On Thu, Apr 15, 2021 at 7:27 AM Jean Morissette <
> > > jean.morisse...@gmail.com
> > > > >
> > > > wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > Does someone know if it's possible to search documents containing a
> > > given
> > > > > keyword only if this keyword is not followed or preceded or another
> > > given
> > > > > keyword?
> > > > >
> > > > > Thanks,
> > > > > Jean
> > > > >
> > > >
> > > >
> > > > --
> > > > Aditya
> > > >
> > >
> >
>


Re: NullPointerException in LongComparator.setTopValue

2021-04-26 Thread Adrien Grand
Thanks Michael.

On Wed, Apr 21, 2021 at 5:22 PM Michael Grafl - SKIDATA <
michael.gr...@skidata.com> wrote:

> Hi Adrian,
>
> Thanks for the reply, I have filed
> https://github.com/elastic/elasticsearch/issues/72029.
> Unfortunately, I am not a liberty to share the search request.
>
> Best regards,
> Michael
>
> Michael Grafl
> Software House Klagenfurt / Architect - Software Development
>
> SKIDATA
> We change the world of welcoming people
>
> SKIDATA GmbH
> Lakeside B06 | 9020 Klagenfurt | Austria
> P +43 6246 888-6177
> E michael.gr...@skidata.com | www.skidata.com
>
> -Original Message-
> From: Adrien Grand 
> Sent: Thursday, March 18, 2021 12:12
> To: Lucene Users Mailing List 
> Subject: Re: NullPointerException in LongComparator.setTopValue
>
> Hi Michael,
>
> At first sight, this looks more like an Elasticsearch bug than like a
> Lucene bug to me. Can you file an issue at
> https://github.com/elastic/elasticsearch and share the search request
> than you are running?
>
> On Thu, Mar 18, 2021 at 11:52 AM Michael Grafl - SKIDATA <
> michael.gr...@skidata.com> wrote:
>
> > Hi all,
> >
> > I get a NullPointerException using Elasticsearch 7.9.1 with Lucene
> > Core
> > 8.6.2 CentOS 7:
> >
> > "stacktrace":
> > ["org.elasticsearch.action.search.SearchPhaseExecutionException: all
> > shards failed", "at
> > org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseFailu
> > re(AbstractSearchAsyncAction.java:551)
> > [elasticsearch-7.9.1.jar:7.9.1]",
> > "at
> > org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextP
> > hase(AbstractSearchAsyncAction.java:309)
> > [elasticsearch-7.9.1.jar:7.9.1]",
> > "at
> > org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseDone(
> > AbstractSearchAsyncAction.java:582)
> > [elasticsearch-7.9.1.jar:7.9.1]",
> > "at
> > org.elasticsearch.action.search.AbstractSearchAsyncAction.onShardFailu
> > re(AbstractSearchAsyncAction.java:393)
> > [elasticsearch-7.9.1.jar:7.9.1]",
> > "at
> > org.elasticsearch.action.search.AbstractSearchAsyncAction.access$100(A
> > bstractSearchAsyncAction.java:68)
> > [elasticsearch-7.9.1.jar:7.9.1]",
> > "at
> > org.elasticsearch.action.search.AbstractSearchAsyncAction$1.onFailure(
> > AbstractSearchAsyncAction.java:245)
> > [elasticsearch-7.9.1.jar:7.9.1]",
> > "at
> > org.elasticsearch.action.search.SearchExecutionStatsCollector.onFailur
> > e(SearchExecutionStatsCollector.java:73)
> > [elasticsearch-7.9.1.jar:7.9.1]",
> > "at
> > org.elasticsearch.action.ActionListenerResponseHandler.handleException
> > (ActionListenerResponseHandler.java:59)
> > [elasticsearch-7.9.1.jar:7.9.1]",
> > "at
> > org.elasticsearch.action.search.SearchTransportService$ConnectionCount
> > ingHandler.handleException(SearchTransportService.java:403)
> > [elasticsearch-7.9.1.jar:7.9.1]",
> > "at
> > org.elasticsearch.transport.TransportService$6.handleException(Transpo
> > rtService.java:638)
> > [elasticsearch-7.9.1.jar:7.9.1]",
> > "at
> > org.elasticsearch.transport.TransportService$ContextRestoreResponseHan
> > dler.handleException(TransportService.java:1172)
> > [elasticsearch-7.9.1.jar:7.9.1]",
> > "at
> > org.elasticsearch.transport.TransportService$DirectResponseChannel.pro
> > cessException(TransportService.java:1281)
> > [elasticsearch-7.9.1.jar:7.9.1]",
> > "at
> > org.elasticsearch.transport.TransportService$DirectResponseChannel.sen
> > dResponse(TransportService.java:1255)
> > [elasticsearch-7.9.1.jar:7.9.1]",
> > "at
> > org.elasticsearch.transport.TaskTransportChannel.sendResponse(TaskTran
> > sportChannel.java:61)
> > [elasticsearch-7.9.1.jar:7.9.1]",
> > "at
> > org.elasticsearch.transport.TransportChannel.sendErrorResponse(Transpo
> > rtChannel.java:56)
> > [elasticsearch-7.9.1.jar:7.9.1]",
> > "at
> > org.elasticsearch.action.support.ChannelActionListener.onFailure(Chann
> > elActionListener.java:51)
> > [elasticsearch-7.9.1.jar:7.9.1]",
> > "at
> > org.elasticsearch.search.SearchService.lambda$runAsync$0(SearchService
> > .java:414)
> > [elasticsearch-7.9.1.jar:7.9.1]",
> > "at
> > org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunn
> > able.java:44)
> > [elasticsearch-7.9.1.jar:7.9.1]",
>

Re: How to ignore a match if a given keyword is before/after another given keyword?

2021-04-25 Thread Adrien Grand
Hi Jean,

You should be able to do this with intervals, see
https://lucene.apache.org/core/8_8_1/queries/org/apache/lucene/queries/intervals/package-summary.html
.

Le dim. 25 avr. 2021 à 18:43, Jean Morissette  a
écrit :

> Thank you for your answer.
>
> The problem with this solution is that it excludes documents which contain
> both positive and negative positive matches.
>
> For example, consider those 3 documents with the terms a, b:
> - document 1: "a"
> - document 2: "a b"
> - document 3: "a b a"
>
> What we want is to find documents with the terms 'a', ignoring matches if
> 'a' is followed by 'b'.
> That is, we don't want to exclude one document if 'a' is followed by 'b'.
>
> The right answer should be documents 1 and 3 but your solution excludes
> document 3.
>
> Is-it something achievable with Lucene?
>
> Thanks,
> Jean
>
>
> On Thu, 15 Apr 2021 at 01:33, Aditya Varun Chadha 
> wrote:
>
> > maybe you want (abstractly):
> >
> > bool(must(term("f", "positive"), mustNot(phrase("f", "negative positive",
> > slop=1)))
> >
> > On Thu, Apr 15, 2021 at 7:27 AM Jean Morissette <
> jean.morisse...@gmail.com
> > >
> > wrote:
> >
> > > Hi all,
> > >
> > > Does someone know if it's possible to search documents containing a
> given
> > > keyword only if this keyword is not followed or preceded or another
> given
> > > keyword?
> > >
> > > Thanks,
> > > Jean
> > >
> >
> >
> > --
> > Aditya
> >
>


Re: Backward compatibility of FST50 and UniformSplit formats

2021-04-19 Thread Adrien Grand
Hi Dmitry,

These codecs are indeed not backward compatible. Only the default codec is
guaranteed to be backward compatible.

If you would like to bring your index to a snapshot of the main branch, one
option would be to:
 1. Use Lucene 8.5's IndexWriter#addIndexes in order to create a copy of
your index that uses the default codec.
 2. Upgrade to a main snapshot.
 3. Use main's IndexWriter#addIndexes in order to go back to a custom codec
that uses FST50 or UniformSplit on some fields.

Please also note that we don't guarantee backward compatibility of indices
created by Lucene snapshots, ie. an index created by a snapshot of Lucene
9.0 might not be readable by Lucene 9.0, so you will have to do the same
operation again when moving to Lucene 9.0.

On Sun, Apr 18, 2021 at 4:50 PM Dmitry Emets  wrote:

> Hi!
> I cannot open by lucene master my indexes created by lucene 8.5. I get an
> error
> Exception in thread "main" org.apache.lucene.index.CorruptIndexException:
> codec mismatch: actual codec=Lucene84PostingsWriterDoc vs expected
> codec=Lucene90PostingsWriterDoc
> (resource=MMapIndexInput(path="C:\data\lucene\_1i8ye_FST50_0.doc"))
> at
> org.apache.lucene.codecs.CodecUtil.checkHeaderNoMagic(CodecUtil.java:204)
> at org.apache.lucene.codecs.CodecUtil.checkHeader(CodecUtil.java:193)
> at org.apache.lucene.codecs.CodecUtil.checkIndexHeader(CodecUtil.java:253)
> at
>
> org.apache.lucene.codecs.lucene90.Lucene90PostingsReader.(Lucene90PostingsReader.java:84)
> at
>
> org.apache.lucene.codecs.memory.FSTPostingsFormat.fieldsProducer(FSTPostingsFormat.java:60)
> ...
> Files with UniformSplit has the same error. Are these codecs not backward
> compatible? Is there any workaround?
> Thank you!
>


-- 
Adrien


Re: How to explain Lucene's ranking algorithm to someone who is not technical?

2021-04-19 Thread Adrien Grand
1. This isn't true. Your query has 10 terms. A document that poorly matches
all 10 terms will rank lower than a document that has great matches for 9
of the 10 terms. However it's true that having more matches usually
correlates with better scores since the final score of a boolean query is
the sum of the scores of all wrapped term queries, so having more matches
helps get higher scores.

2. Assuming that all documents have the same length, this is correct.

3. This is incorrect. Proximity is ignored by default. You can boost based
on proximity data by adding optional phrase clauses, but this is not the
default behavior. And I can't find the paper, but I remember reading one
that said that contrary to intuition, leveraging proximity didn't actually
improve relevancy.

4. This is correct.

5. It would be correct if you replaced "complex / longer" with "rarest".

6. This is incorrect, proximity is not taken into account for boolean
queries.

On Mon, Apr 19, 2021 at 1:10 AM Steven White  wrote:

> Hi everyone,
>
> If you are asked to explain how Lucene's algorithm works, to someone who is
> not technical and doesn't understand math, how do you go about doing so?
>
> I'm going to list what I see as key points to use but please correct me
> where correction is needed and do add where addition is needed.  Here are
> the talking points I can think of.
>
> Search terms are: "to be or not to be, that is the question"   The examples
> below or simple term search (no booleans, no phrase, no fields, etc.)
>
> 1) Documents that contain all or most of the search terms are ranked
> highest.
> hit #1: ... ... to be or not to be, that is the question ... ...
> hit #2: ... ... to be, that is the question ... ...
> hit #3: ... ... is the question  ... ...
>
> 2) Documents that contain all or most of the search terms, more often than
> other documents are ranked higher.
> hit #1: ... ... to be or not to be, that is the question and is still the
> question ... ...
> hit #2: ... ... to be or not to be, that is the question ... ...
> hit #3: ... ... to be, that is the question ... ...
>
> 3) Documents that contain the search terms closer to each other are ranked
> higher
> hit #1: ... ... to be or not to be, that is the question ... ...
> hit #2: ... ... to be or not to be, is what being asked, that is the
> question ... ...
> hit #3: ... ... is the question  ... ...
>
> 4) Documents that contain the exact search terms, including number of times
> search terms occur, the smaller document is ranked higher
> hit #1: to be or not to be, that is the question
> hit #2: ... ... to be or not to be, that is the question ... ...
>
> 5) Documents that contain more of the complex / longer terms are ranked
> higher than those containing more of the lighter terms.
> hit #1: ... ... to be or not to be, that is the question and is still the
> question to question  ... ...
> hit #2: ... ... to be or not to be and to be or not to be, and to be or not
> to be, that is the question ... ...
>
> 6) Documents that contain search terms, match the order, are ranked higher:
> hit #1: ... ... to be or not to be, that is the question ... ...
> hit #2: ... ... question the that is be not to be or be ... ...
>
> I think I get all the above right (I'm not sure about #6).
>
> Thanks
>
> Steven
>


-- 
Adrien


Re: Impact and WAND

2021-04-16 Thread Adrien Grand
Hi,

Indeed BMW is only about disjunctions but the paper (
http://engineering.nyu.edu/~suel/papers/bmw.pdf) shortly describes how
block max indexes can be used to speed up conjunctions as well using a
simple algorithm that they call Block Max And, which we implemented in the
BlockMaxConjunctionScorer class.

Le ven. 16 avr. 2021 à 18:51, Tomás Fernández Löbbe 
a écrit :

> I was looking at the nightly benchmarks[1] and noticed a big jump in
> performance for conjunction queries when LUCENE-8060 was merged. I was
> puzzled because I didn't expect BMW to help in this type of queries, but I
> guess that's the "other optimizations" you were talking about? Do you have
> any pointers to those?
>
>
> [1] https://home.apache.org/~mikemccand/lucenebench
>
> On Thu, Jul 11, 2019 at 6:02 AM Atri Sharma  wrote:
>
> > Note that any other scoring mode (COMPLETE or COMPLETE_NO_SCORES) will
> > mandatorily visit all hits, so there is no scope of skipping and hence
> > no point of using impacts.
> >
> > On Thu, Jul 11, 2019 at 8:51 AM Wu,Yunfeng 
> wrote:
> > >
> > >
> > > @Adrien Grand mailto:jpou...@gmail.com>>. Thanks
> for
> > your reply.
> > >
> > > The explanation ` skip low-scoring matches` is great,  I  looked up
> some
> > docs and inspect some related code.
> > >
> > > I noticed the ` block-max WAND` mode only work when
> > ScoreMode.TOP_SCORES is used,   is right?  (The basic TermQuery would
> > generate ImpactDISI with scoreMode is TOP_SCORES.)
> > >
> > > Lucene compute max score per block and then cached in `MaxScoreCache` ,
> > this means we can skip low-scoring block( current one block 128 DocIds)
> > and in competitive block  still need to score any docId as seen,   I
> > confused with  `MaxScoreCache#getMaxScoreForLevel(int level)`, what the
> > level mean? Skip level?  (Somewhere invoke this method pass one Integer
> > upTo param)
> > >
> > > Thanks Lucene Team
> > >
> > >
> > > 在 2019年7月10日,下午10:52,Adrien Grand  > jpou...@gmail.com>> 写道:
> > >
> > > To clarify, the scoring process is not accelerated because we
> > > terminate early but because we can skip low-scoring matches (there
> > > might be competitive hits at the very end of the index).
> > >
> > > CompetitiveImpactAccumulator is indeed related to WAND. It helps store
> > > the maximum score impacts per block of documents in postings lists.
> > > Then this information is leveraged by block-max WAND in order to skip
> > > low-scoring blocks.
> > >
> > > This does indeed help avoid reading norms, but also document IDs and
> > > term frequencies.
> > >
> > > On Wed, Jul 10, 2019 at 4:10 PM Wu,Yunfeng  > <mailto:wuyunfen...@baidu.com>> wrote:
> > >
> > > Hi,
> > >
> > > We discuss some topic from
> > https://github.com/apache/lucene-solr/pull/595. As Atri Sharma propose
> > discuss with the java dev list.
> > >
> > >
> > > Impact `frequency ` and `norm ` just to accelerate the `score process`
> > which  `terminate early`.
> > >
> > > In impact mode, `CompetitiveImpactAccumulator` will record (freq, norm)
> > pair , would stored at index level. Also I noted
> > `CompetitiveImpactAccumulator` commented with `This class accumulates the
> > (freq, norm) pairs that may produce competitive scores`,  maybe related
> to
> > `WAND`?
> > >
> > >
> > > The norm value which produced or consumed by `Lucene80NormsFormat`.
> > >
> > > In this ` Impact way`, we can avoid read norms from
> > `Lucene80NormsProducer` that may generate the extra IO?  ( the norm value
> > Lucene stored twice.)and take full advantage of the WAND method?
> > >
> > >
> > >
> > > --
> > > Adrien
> > >
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >
>


Re: Slower fetch document after upgrade >=8.7

2021-04-08 Thread Adrien Grand
I just opened https://issues.apache.org/jira/browse/LUCENE-9917.

On Thu, Apr 8, 2021 at 4:34 PM Никита Михайлов 
wrote:

> Thank you. Understood the policy
>
> > I have some changes to stored fields on my plate, I'll include this
> change as well.
>
> Is there a ticket for this change?
>
> чт, 8 апр. 2021 г. в 15:30, Adrien Grand :
> >
> > Actually, we don't plan to have flexible settings even for advanced
> > developers. Our stance on these discussions is that we should be
> > opinionated about the default codec and not offer any options. Rather
> than
> > exposing advanced settings for advanced users, these advanced users can
> > build their own codec and take care of backward compatibility themselves.
> >
> > On Thu, Apr 8, 2021 at 10:11 AM Никита Михайлов <
> mihaylovniki...@gmail.com>
> > wrote:
> >
> > > Thanks for the reply.
> > >
> > > The problem of understanding. You can make flexible settings for
> > > advanced developers, leaving two facets by default. In tests, check
> > > these facets
> > > Never change them so that the developers themselves explicitly set the
> > > settings. IMHO, I think this will help to avoid such problems
> > >
> > > OK. Have a ticket?
> > >
> > > чт, 8 апр. 2021 г. в 13:52, Adrien Grand :
> > > >
> > > > Thanks for the feedback.
> > > >
> > > > We don't want to offer too many choices, as it complicates backward
> > > > compatibility testing, and want to stick to two options at most.
> > > >
> > > > Since this is the second time I'm seeing this feedback, I'm inclined
> to
> > > > reduce the block size for BEST_SPEED in order to trade a bit of
> > > compression
> > > > ratio for better decompression speed. I have some changes to stored
> > > fields
> > > > on my plate, I'll include this change as well.
> > > >
> > > > On Thu, Apr 8, 2021 at 7:04 AM Никита Михайлов <
> > > mihaylovniki...@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi
> > > > > BEST_SPEED has been changed in LUCENE-9447 and LUCENE-9486. For
> this
> > > > > reason, retrieving data from elasticsearch has slowed down by
> 10-20%.
> > > When
> > > > > there is a lot of data, this is critical
> > > > > Can developers leave the choice of which codec to use: LZ4(16kB)
> (old
> > > > > BEST_SPEED) or LZ4 with preset dict(BEST_SPEED_SAVING_DISKSIZE)? Or
> > > make
> > > > > more flexible settings?
> > > > >
> > > > > Otherwise, such changes may be a blocker or will have to spend
> money on
> > > > > buying new hardware
> > > > >
> > > >
> > > >
> > > > --
> > > > Adrien
> > >
> > > -
> > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > > For additional commands, e-mail: java-user-h...@lucene.apache.org
> > >
> > >
> >
> > --
> > Adrien
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

-- 
Adrien


Re: Slower search after 8.5.x to >=8.6

2021-04-08 Thread Adrien Grand
FSDirectory#open is just a utility method that tries to pick the best
Directory implementation based on the platform, it's most likely
MMapDirectory for you, which is the directory implementation we use on all
64-bit platforms. So it's intriguing that you are seeing a slowdown with
MMapDirectory but not with FSDirectory#open. To my knowledge, Elasticsearch
is not doing anything special that could explain why MMapDirectory is slow
with Elasticsearch yet fast with Lucene.

Regardless of the Directory implementation, it's surprising that term
lookups be the bottleneck for query execution. It's usually more a
bottleneck when indexing with IndexWriter#updateDocument, which needs to
perform one ID lookup for every indexed document. I guess that the queries
that you are running match so few hits that very little time is spent
reading postings, is that correct? But then that would also means that your
queries are running very fast, likely in the order of a few millis? Or
maybe you have misconfigured your merge policy in a way that makes your
indices have so many segments that terms dictionary lookups may be a
bottleneck?

On Thu, Apr 8, 2021 at 1:40 PM Никита Михайлов 
wrote:

> Thanks for the answer
> NIOFSDirectory is like an example. Degradation is also on
> MMapDirectory and SimpleFSDirectory
>
> We are using elasticseach and it has: simplefs (SimpleFsDirectory),
> niofs (NIOFSDirectory), mmapfs (MMapDirectory) and hybridfs
> (NIOFSDirectory + MMapDirectory). And for us, while niofs was a little
> faster than other stores
>
> Yes FSDirectory works fast(both commits), but now it is difficult to
> test on prod elasticseach.
> But why is FSDirectory fast? How to understand this?
>
> чт, 8 апр. 2021 г. в 13:49, Adrien Grand :
> >
> > Hello,
> >
> > Why are you forcing NIOFSDirectory instead of using Lucene's defaults via
> > FSDirectory#open? I wonder if this might contribute to the slowdown you
> are
> > seeing given that access to the terms index tends to be a bit random.
> >
> > It's very unlikely we'll add back a toggle for this as there is no point
> in
> > holding the terms index in JVM heap when it could live in the OS cache
> > instead.
> >
> > On Thu, Apr 8, 2021 at 7:57 AM Никита Михайлов <
> mihaylovniki...@gmail.com>
> > wrote:
> >
> > > Hi. I noticed that after the upgrade from Lucene8.5.x to Lucene >=8.6,
> > >  search became slower(example TopScoreDocCollector became 20-30%
> slower,
> > > from ElasticSearch - 50%).
> > >
> > > While testing, I realized that it happened after LUCENE-9257(commit
> > > e7a61ea). Bug or feature? Can add settings for isOffHeep? To make the
> > > developer explicitly make this choice
> > >
> > > Added a file that shows a simple demo that the search is slow
> > > Need to run on commit e7a61ea and 90aced5, you will notice how the
> speed
> > > drops to 30%
> > >
> > > -
> > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >
> >
> > --
> > Adrien
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

-- 
Adrien


Re: Slower fetch document after upgrade >=8.7

2021-04-08 Thread Adrien Grand
Actually, we don't plan to have flexible settings even for advanced
developers. Our stance on these discussions is that we should be
opinionated about the default codec and not offer any options. Rather than
exposing advanced settings for advanced users, these advanced users can
build their own codec and take care of backward compatibility themselves.

On Thu, Apr 8, 2021 at 10:11 AM Никита Михайлов 
wrote:

> Thanks for the reply.
>
> The problem of understanding. You can make flexible settings for
> advanced developers, leaving two facets by default. In tests, check
> these facets
> Never change them so that the developers themselves explicitly set the
> settings. IMHO, I think this will help to avoid such problems
>
> OK. Have a ticket?
>
> чт, 8 апр. 2021 г. в 13:52, Adrien Grand :
> >
> > Thanks for the feedback.
> >
> > We don't want to offer too many choices, as it complicates backward
> > compatibility testing, and want to stick to two options at most.
> >
> > Since this is the second time I'm seeing this feedback, I'm inclined to
> > reduce the block size for BEST_SPEED in order to trade a bit of
> compression
> > ratio for better decompression speed. I have some changes to stored
> fields
> > on my plate, I'll include this change as well.
> >
> > On Thu, Apr 8, 2021 at 7:04 AM Никита Михайлов <
> mihaylovniki...@gmail.com>
> > wrote:
> >
> > > Hi
> > > BEST_SPEED has been changed in LUCENE-9447 and LUCENE-9486. For this
> > > reason, retrieving data from elasticsearch has slowed down by 10-20%.
> When
> > > there is a lot of data, this is critical
> > > Can developers leave the choice of which codec to use: LZ4(16kB) (old
> > > BEST_SPEED) or LZ4 with preset dict(BEST_SPEED_SAVING_DISKSIZE)? Or
> make
> > > more flexible settings?
> > >
> > > Otherwise, such changes may be a blocker or will have to spend money on
> > > buying new hardware
> > >
> >
> >
> > --
> > Adrien
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

-- 
Adrien


Re: Slower fetch document after upgrade >=8.7

2021-04-08 Thread Adrien Grand
Thanks for the feedback.

We don't want to offer too many choices, as it complicates backward
compatibility testing, and want to stick to two options at most.

Since this is the second time I'm seeing this feedback, I'm inclined to
reduce the block size for BEST_SPEED in order to trade a bit of compression
ratio for better decompression speed. I have some changes to stored fields
on my plate, I'll include this change as well.

On Thu, Apr 8, 2021 at 7:04 AM Никита Михайлов 
wrote:

> Hi
> BEST_SPEED has been changed in LUCENE-9447 and LUCENE-9486. For this
> reason, retrieving data from elasticsearch has slowed down by 10-20%. When
> there is a lot of data, this is critical
> Can developers leave the choice of which codec to use: LZ4(16kB) (old
> BEST_SPEED) or LZ4 with preset dict(BEST_SPEED_SAVING_DISKSIZE)? Or make
> more flexible settings?
>
> Otherwise, such changes may be a blocker or will have to spend money on
> buying new hardware
>


-- 
Adrien


Re: Slower search after 8.5.x to >=8.6

2021-04-08 Thread Adrien Grand
Hello,

Why are you forcing NIOFSDirectory instead of using Lucene's defaults via
FSDirectory#open? I wonder if this might contribute to the slowdown you are
seeing given that access to the terms index tends to be a bit random.

It's very unlikely we'll add back a toggle for this as there is no point in
holding the terms index in JVM heap when it could live in the OS cache
instead.

On Thu, Apr 8, 2021 at 7:57 AM Никита Михайлов 
wrote:

> Hi. I noticed that after the upgrade from Lucene8.5.x to Lucene >=8.6,
>  search became slower(example TopScoreDocCollector became 20-30% slower,
> from ElasticSearch - 50%).
>
> While testing, I realized that it happened after LUCENE-9257(commit
> e7a61ea). Bug or feature? Can add settings for isOffHeep? To make the
> developer explicitly make this choice
>
> Added a file that shows a simple demo that the search is slow
> Need to run on commit e7a61ea and 90aced5, you will notice how the speed
> drops to 30%
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org



-- 
Adrien


Re: Interface IndexReader.CacheHelper

2021-03-29 Thread Adrien Grand
Hi Baris,

I created a PR that adds an example to the javadocs at
https://github.com/apache/lucene/pull/50. Could you have a look and let me
know if that is the sort of additional information that you were looking
for?

On Fri, Mar 26, 2021 at 10:30 PM  wrote:

> Hi,-
>
>
> https://lucene.apache.org/core/8_5_2/core/org/apache/lucene/index/IndexReader.CacheHelper.html?is-external=true
>
>   it would be nice to have more detailed explanation and maybe an
> example for this interesting interface?
>
> Best regards
>
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

-- 
Adrien


Re: NullPointerException in LongComparator.setTopValue

2021-03-18 Thread Adrien Grand
Hi Michael,

At first sight, this looks more like an Elasticsearch bug than like a
Lucene bug to me. Can you file an issue at
https://github.com/elastic/elasticsearch and share the search request than
you are running?

On Thu, Mar 18, 2021 at 11:52 AM Michael Grafl - SKIDATA <
michael.gr...@skidata.com> wrote:

> Hi all,
>
> I get a NullPointerException using Elasticsearch 7.9.1 with Lucene Core
> 8.6.2 CentOS 7:
>
> "stacktrace":
> ["org.elasticsearch.action.search.SearchPhaseExecutionException: all shards
> failed",
> "at
> org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseFailure(AbstractSearchAsyncAction.java:551)
> [elasticsearch-7.9.1.jar:7.9.1]",
> "at
> org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:309)
> [elasticsearch-7.9.1.jar:7.9.1]",
> "at
> org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseDone(AbstractSearchAsyncAction.java:582)
> [elasticsearch-7.9.1.jar:7.9.1]",
> "at
> org.elasticsearch.action.search.AbstractSearchAsyncAction.onShardFailure(AbstractSearchAsyncAction.java:393)
> [elasticsearch-7.9.1.jar:7.9.1]",
> "at
> org.elasticsearch.action.search.AbstractSearchAsyncAction.access$100(AbstractSearchAsyncAction.java:68)
> [elasticsearch-7.9.1.jar:7.9.1]",
> "at
> org.elasticsearch.action.search.AbstractSearchAsyncAction$1.onFailure(AbstractSearchAsyncAction.java:245)
> [elasticsearch-7.9.1.jar:7.9.1]",
> "at
> org.elasticsearch.action.search.SearchExecutionStatsCollector.onFailure(SearchExecutionStatsCollector.java:73)
> [elasticsearch-7.9.1.jar:7.9.1]",
> "at
> org.elasticsearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:59)
> [elasticsearch-7.9.1.jar:7.9.1]",
> "at
> org.elasticsearch.action.search.SearchTransportService$ConnectionCountingHandler.handleException(SearchTransportService.java:403)
> [elasticsearch-7.9.1.jar:7.9.1]",
> "at
> org.elasticsearch.transport.TransportService$6.handleException(TransportService.java:638)
> [elasticsearch-7.9.1.jar:7.9.1]",
> "at
> org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1172)
> [elasticsearch-7.9.1.jar:7.9.1]",
> "at
> org.elasticsearch.transport.TransportService$DirectResponseChannel.processException(TransportService.java:1281)
> [elasticsearch-7.9.1.jar:7.9.1]",
> "at
> org.elasticsearch.transport.TransportService$DirectResponseChannel.sendResponse(TransportService.java:1255)
> [elasticsearch-7.9.1.jar:7.9.1]",
> "at
> org.elasticsearch.transport.TaskTransportChannel.sendResponse(TaskTransportChannel.java:61)
> [elasticsearch-7.9.1.jar:7.9.1]",
> "at
> org.elasticsearch.transport.TransportChannel.sendErrorResponse(TransportChannel.java:56)
> [elasticsearch-7.9.1.jar:7.9.1]",
> "at
> org.elasticsearch.action.support.ChannelActionListener.onFailure(ChannelActionListener.java:51)
> [elasticsearch-7.9.1.jar:7.9.1]",
> "at
> org.elasticsearch.search.SearchService.lambda$runAsync$0(SearchService.java:414)
> [elasticsearch-7.9.1.jar:7.9.1]",
> "at
> org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:44)
> [elasticsearch-7.9.1.jar:7.9.1]",
> "at
> org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:710)
> [elasticsearch-7.9.1.jar:7.9.1]",
> "at
> org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
> [elasticsearch-7.9.1.jar:7.9.1]",
> "at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
> [?:?]",
> "at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
> [?:?]",
> "at java.lang.Thread.run(Thread.java:832) [?:?]",
> "Caused by: org.elasticsearch.ElasticsearchException$1: Cannot invoke
> \"java.lang.Long.longValue()\" because \"value\" is null",
> "at
> org.elasticsearch.ElasticsearchException.guessRootCauses(ElasticsearchException.java:644)
> ~[elasticsearch-7.9.1.jar:7.9.1]",
> "at
> org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:307)
> [elasticsearch-7.9.1.jar:7.9.1]",
> "... 21 more",
> "Caused by: java.lang.NullPointerException: Cannot invoke
> \"java.lang.Long.longValue()\" because \"value\" is null",
> "at
> org.apache.lucene.search.FieldComparator$LongComparator.setTopValue(FieldComparator.java:392)
> ~[lucene-core-8.6.2.jar:8.6.2 016993b65e393b58246d54e8ddda9f56a453eb0e -
> ivera - 2020-08-26 10:53:36]",
> "at
> org.apache.lucene.search.FieldComparator$LongComparator.setTopValue(FieldComparator.java:348)
> ~[lucene-core-8.6.2.jar:8.6.2 016993b65e393b58246d54e8ddda9f56a453eb0e -
> ivera - 2020-08-26 10:53:36]",
> "at
> org.apache.lucene.search.TopFieldCollector$PagingFieldCollector.(TopFieldCollector.java:210)
> ~[lucene-core-8.6.2.jar:8.6.2 016993b65e393b58246d54e8ddda9f56a453eb0e -
> ivera - 2020-08-26 10:53:36]",
> "at
> 

Re: BigIntegerPoint

2021-02-27 Thread Adrien Grand
It's indeed working. As Robert suggested, it's in the sandbox more because
it's unclear if it is really needed than because it is unstable.

The few data points I have suggest that among the users for whom LongPoint
is not enough, there are more users who need unsigned 64 bits integers than
true 128 bits integers.

Le sam. 27 févr. 2021 à 11:20, Michael Kleen  a écrit :

> Looking at TestBigIntegerPoint it seems to me that the core functionality
> of BigIntegerPoint
> is working. Is there any reason you would advise me for not using it ? Is
> there anything
> missing ?
>
> > On 26. Feb 2021, at 22:14, Robert Muir  wrote:
> >
> > It was added to the sandbox originally (along with InetAddressPoint for
> ip
> > addresses) and just never graduated from there:
> > https://issues.apache.org/jira/browse/LUCENE-7043
> >
> > The InetAddressPoint was moved to core because it seems pretty common
> that
> > people want to do range queries on IP hosts and so on. So it got love.
> >
> > Not many people need 128-bit range queries I suppose?
> >
> > On Fri, Feb 26, 2021 at 1:25 PM Michael Kleen  wrote:
> >
> >> Hello,
> >>
> >> I am interested in using BigIntegerPoint. What is the reason that it is
> >> part of the sandbox ? Is it ready for use ?
> >>
> >> Many thanks,
> >>
> >> Michael
> >>
> >>
> >>
> >> -
> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >>
> >>
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: Slower document retrieval in 8.7.0 comparing to 7.5.0

2020-12-03 Thread Adrien Grand
Hello Martynas,

There have indeed been changes related to stored fields in 8.7. What does
your workload look like and how large are your documents on average?

On Thu, Dec 3, 2020 at 3:04 PM Martynas L  wrote:

> Hi,
> We've migrated from 7.5.0 to 8.7.0 and find out that the index "searching"
> is significantly (4-5 times) slower in the latest version.
> It seems that
> org.apache.lucene.search.IndexSearcher#doc(int)
> is slower.
>
> Is it possible to have similar performance with 8.7.0?
>
> Best regards,
> Martynas
>


-- 
Adrien


Re: Lucene 8.7 error searching an index created with 8.3

2020-11-24 Thread Adrien Grand
This is related to phrase matching indeed. Positions are stored in blocks
of 128 values, where every block is encoded with a different number of bits
per value. And the error you are seeing suggests that one block reports 69
bits per value.

The fact that CheckIndex didn't complain is surprising. Did you only verify
checksums (the -fast option) or did you run the full CheckIndex?

Is your problem reproducible? If yes, does it still reproduce if you move
to a recent JVM?

On Tue, Nov 24, 2020 at 3:22 PM Nicolás Lichtmaier 
wrote:

> Lucene 8.7's CheckIndex says there are no errors in the index.
>
> On closer inspection this seems related to phrase matching...
>
> El 24/11/20 a las 05:18, Adrien Grand escribió:
> > Can you run CheckIndex on your index to make sure it is not corrupt?
> >
> > On Tue, Nov 24, 2020 at 1:01 AM Nicolás Lichtmaier
> >  wrote:
> >
> >> I'm seeing errors like this one (using backwards codecs):
> >>
> >> java.lang.ArrayIndexOutOfBoundsException: Index 69 out of bounds for
> >> length 33
> >>   at
> >> org.apache.lucene.codecs.lucene50.ForUtil.readBlock(ForUtil.java:196)
> >>   at
> >>
> >>
> org.apache.lucene.codecs.lucene50.Lucene50PostingsReader$EverythingEnum.refillPositions(Lucene50PostingsReader.java:721)
> >>   at
> >>
> >>
> org.apache.lucene.codecs.lucene50.Lucene50PostingsReader$EverythingEnum.nextPosition(Lucene50PostingsReader.java:924)
> >>   at
> >>
> >>
> org.apache.lucene.search.PhrasePositions.nextPosition(PhrasePositions.java:57)
> >>   at
> >>
> >>
> org.apache.lucene.search.SloppyPhraseMatcher.advancePP(SloppyPhraseMatcher.java:262)
> >>   at
> >>
> >>
> org.apache.lucene.search.SloppyPhraseMatcher.nextMatch(SloppyPhraseMatcher.java:173)
> >>   at
> >> org.apache.lucene.search.PhraseScorer$1.matches(PhraseScorer.java:58)
> >>   at
> >>
> >>
> org.apache.lucene.search.DoubleValuesSource$WeightDoubleValuesSource$1.advanceExact(DoubleValuesSource.java:631)
> >>   at
> >>
> >>
> org.apache.lucene.queries.function.FunctionScoreQuery$QueryBoostValuesSource$1.advanceExact(FunctionScoreQuery.java:343)
> >>   at
> >>
> org.apache.lucene.search.DoubleValues$1.advanceExact(DoubleValues.java:53)
> >>   at
> >>
> org.apache.lucene.search.DoubleValues$1.advanceExact(DoubleValues.java:53)
> >>   at
> >>
> >>
> org.apache.lucene.queries.function.FunctionScoreQuery$MultiplicativeBoostValuesSource$1.advanceExact(FunctionScoreQuery.java:270)
> >>   at
> >>
> >>
> org.apache.lucene.queries.function.FunctionScoreQuery$FunctionScoreWeight$1.score(FunctionScoreQuery.java:228)
> >>   at
> >>
> >>
> org.apache.lucene.search.DisjunctionMaxScorer.score(DisjunctionMaxScorer.java:67)
> >>   at
> >>
> >>
> org.apache.lucene.search.DisjunctionScorer.score(DisjunctionScorer.java:194)
> >>   at
> >>
> >>
> org.apache.lucene.search.DoubleValuesSource$2.doubleValue(DoubleValuesSource.java:344)
> >>   at
> >>
> >>
> org.apache.lucene.queries.function.FunctionScoreQuery$MultiplicativeBoostValuesSource$1.doubleValue(FunctionScoreQuery.java:265)
> >>   at
> >>
> >>
> org.apache.lucene.queries.function.FunctionScoreQuery$FunctionScoreWeight$1.score(FunctionScoreQuery.java:229)
> >>   at
> >>
> >>
> org.apache.lucene.search.TopScoreDocCollector$SimpleTopScoreDocCollector$1.collect(TopScoreDocCollector.java:76)
> >>   at
> >>
> org.apache.lucene.search.Weight$DefaultBulkScorer.scoreAll(Weight.java:276)
> >>   at
> >> org.apache.lucene.search.Weight$DefaultBulkScorer.score(Weight.java:232)
> >>   at org.apache.lucene.search.BulkScorer.score(BulkScorer.java:39)
> >>   at
> >> org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:661)
> >>   at
> >> org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:445)
> >>   at
> >> org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:574)
> >>   at
> >>
> org.apache.lucene.search.IndexSearcher.searchAfter(IndexSearcher.java:421)
> >>   at
> >> org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:432)
> >>
> >> They seem to be connected with double values stored as "docvalues" and
> >> user in formulas to affect the scores.
> >>
> >> Is there any known incompatibility? Is this something that should work?
> >> Must I rebuild the indices with 8.7? (that would be bad for our usecase
> >> here)
> >>
> >> Thanks!
> >>
> >>
> >> -
> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >>
> >>
>


-- 
Adrien


Re: Lucene 8.7 error searching an index created with 8.3

2020-11-24 Thread Adrien Grand
Can you run CheckIndex on your index to make sure it is not corrupt?

On Tue, Nov 24, 2020 at 1:01 AM Nicolás Lichtmaier
 wrote:

> I'm seeing errors like this one (using backwards codecs):
>
> java.lang.ArrayIndexOutOfBoundsException: Index 69 out of bounds for
> length 33
>  at
> org.apache.lucene.codecs.lucene50.ForUtil.readBlock(ForUtil.java:196)
>  at
>
> org.apache.lucene.codecs.lucene50.Lucene50PostingsReader$EverythingEnum.refillPositions(Lucene50PostingsReader.java:721)
>  at
>
> org.apache.lucene.codecs.lucene50.Lucene50PostingsReader$EverythingEnum.nextPosition(Lucene50PostingsReader.java:924)
>  at
>
> org.apache.lucene.search.PhrasePositions.nextPosition(PhrasePositions.java:57)
>  at
>
> org.apache.lucene.search.SloppyPhraseMatcher.advancePP(SloppyPhraseMatcher.java:262)
>  at
>
> org.apache.lucene.search.SloppyPhraseMatcher.nextMatch(SloppyPhraseMatcher.java:173)
>  at
> org.apache.lucene.search.PhraseScorer$1.matches(PhraseScorer.java:58)
>  at
>
> org.apache.lucene.search.DoubleValuesSource$WeightDoubleValuesSource$1.advanceExact(DoubleValuesSource.java:631)
>  at
>
> org.apache.lucene.queries.function.FunctionScoreQuery$QueryBoostValuesSource$1.advanceExact(FunctionScoreQuery.java:343)
>  at
> org.apache.lucene.search.DoubleValues$1.advanceExact(DoubleValues.java:53)
>  at
> org.apache.lucene.search.DoubleValues$1.advanceExact(DoubleValues.java:53)
>  at
>
> org.apache.lucene.queries.function.FunctionScoreQuery$MultiplicativeBoostValuesSource$1.advanceExact(FunctionScoreQuery.java:270)
>  at
>
> org.apache.lucene.queries.function.FunctionScoreQuery$FunctionScoreWeight$1.score(FunctionScoreQuery.java:228)
>  at
>
> org.apache.lucene.search.DisjunctionMaxScorer.score(DisjunctionMaxScorer.java:67)
>  at
>
> org.apache.lucene.search.DisjunctionScorer.score(DisjunctionScorer.java:194)
>  at
>
> org.apache.lucene.search.DoubleValuesSource$2.doubleValue(DoubleValuesSource.java:344)
>  at
>
> org.apache.lucene.queries.function.FunctionScoreQuery$MultiplicativeBoostValuesSource$1.doubleValue(FunctionScoreQuery.java:265)
>  at
>
> org.apache.lucene.queries.function.FunctionScoreQuery$FunctionScoreWeight$1.score(FunctionScoreQuery.java:229)
>  at
>
> org.apache.lucene.search.TopScoreDocCollector$SimpleTopScoreDocCollector$1.collect(TopScoreDocCollector.java:76)
>  at
> org.apache.lucene.search.Weight$DefaultBulkScorer.scoreAll(Weight.java:276)
>  at
> org.apache.lucene.search.Weight$DefaultBulkScorer.score(Weight.java:232)
>  at org.apache.lucene.search.BulkScorer.score(BulkScorer.java:39)
>  at
> org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:661)
>  at
> org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:445)
>  at
> org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:574)
>  at
> org.apache.lucene.search.IndexSearcher.searchAfter(IndexSearcher.java:421)
>  at
> org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:432)
>
> They seem to be connected with double values stored as "docvalues" and
> user in formulas to affect the scores.
>
> Is there any known incompatibility? Is this something that should work?
> Must I rebuild the indices with 8.7? (that would be bad for our usecase
> here)
>
> Thanks!
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

-- 
Adrien


Re: BooleanQuery: BooleanClause.Occur.MUST_NOT seems to require at least one BooleanClause.Occur.MUST

2020-11-06 Thread Adrien Grand
Hi Nissim,

This is by design: boolean queries that don't have positive clauses like
empty boolean queries or boolean queries that only consist of negative
(MUST_NOT) clauses don't match any hits.

On Thu, Nov 5, 2020 at 9:07 PM Nissim Shiman 
wrote:

> Hello Apache Lucene team members,
> I have found that constructing a BooleanQuery with just
> a BooleanClause.Occur.MUST_NOT will return no results.  It will return
> results is if there is also a BooleanClause.Occur.MUST as part of the query
> as well though.
>
>
> I don't see this limitation with a BooleanQuery with just
> a BooleanClause.Occur.MUST (i.e. results will return fine if they match).
>
> Is this by design or is this an issue?
>
> Thanks You,
> Nissim Shiman



-- 
Adrien


Re: unexpected performance TermsQuery Occur.SHOULD vs TermsInSetQuery?

2020-10-13 Thread Adrien Grand
100,000+ requests per core per second is a lot. :) My initial reaction is
that the query is likely so fast on that index that the bottleneck might be
rewriting or the initialization of weights/scorers (which don't get more
costly as the index gets larger) rather than actual query execution, which
means that we can't really conclude that the boolean query is faster than
the TermInSetQuery.

Also beware than IndexSearcher#count will look at index statistics if your
queries have a single term, which would no longer work if you use this
query as a filter for another query.

On Tue, Oct 13, 2020 at 12:51 PM Rob Audenaerde 
wrote:

> I reduced the benchmark as far as I could, and now got these results,
> TermsInSet being a lot slower compared to the Terms/SHOULD.
>
>
> BenchmarkOrQuery.benchmarkTerms   thrpt5  190820.510 ± 16667.411
> ops/s
> BenchmarkOrQuery.benchmarkTermsInSet  thrpt5  110548.345 ±  7490.169
> ops/s
>
>
> @Fork(1)
> @Measurement(iterations = 5, time = 10)
> @OutputTimeUnit(TimeUnit.SECONDS)
> @Warmup(iterations = 3, time = 1)
> @Benchmark
> public void benchmarkTerms(final MyState myState) {
> try {
> final IndexSearcher searcher =
> myState.matchedReaders.getIndexSearcher();
> final BooleanQuery.Builder b = new BooleanQuery.Builder();
>
> for (final String role : myState.user.getAdditionalRoles()) {
> b.add(new TermQuery(new Term(roles, new BytesRef(role))),
> BooleanClause.Occur.SHOULD);
> }
> searcher.count(b.build());
>
> } catch (final IOException e) {
> e.printStackTrace();
> }
> }
>
> @Fork(1)
> @Measurement(iterations = 5, time = 10)
> @OutputTimeUnit(TimeUnit.SECONDS)
> @Warmup(iterations = 3, time = 1)
> @Benchmark
> public void benchmarkTermsInSet(final MyState myState) {
> try {
> final IndexSearcher searcher =
> myState.matchedReaders.getIndexSearcher();
> final Set roles =
>
> myState.user.getAdditionalRoles().stream().map(BytesRef::new).collect(Collectors.toSet());
> searcher.count(new TermInSetQuery(BenchmarkOrQuery.roles, roles));
>
> } catch (final IOException e) {
> e.printStackTrace();
> }
> }
>
>
> On Tue, Oct 13, 2020 at 11:56 AM Rob Audenaerde 
> wrote:
>
> > Hello Adrien,
> >
> > Thanks for the swift reply. I'll add the details:
> >
> > Lucene version: 8.6.2
> >
> > The restrictionQuery is indeed a conjunction, it allowes for a document
> to
> > be a hit if the 'roles' field is empty as well. It's used within a
> > bigger query builder; so maybe I did something else wrong. I'll rewrite
> the
> > benchmark to just benchmark the TermsInSet and Terms.
> >
> > It never occurred (hah) to me to use Occur.FILTER, that is a good point
> to
> > check as well.
> >
> > As you put it, I would expect the results to be very similar, as I do not
> > react the 16 terms in the TermInSet. I'll let you know what I'll find.
> >
> > On Tue, Oct 13, 2020 at 11:48 AM Adrien Grand  wrote:
> >
> >> Can you give us a few more details:
> >>  - What version of Lucene are you testing?
> >>  - Are you benchmarking "restrictionQuery" on its own, or its
> conjunction
> >> with another query?
> >>
> >> You mentioned that you combine your "restrictionQuery" and the user
> query
> >> with Occur.MUST, Occur.FILTER feels more appropriate for
> >> "restrictionQuery"
> >> since it should not contribute to scoring.
> >>
> >> TermsInSetQuery automatically executes like a BooleanQuery when the
> number
> >> of clauses is less than 16, so I would not expect major performance
> >> differences between a TermInSetQuery over less than 16 terms and a
> >> BooleanQuery wrapped in a ConstantScoreQuery.
> >>
> >> On Tue, Oct 13, 2020 at 11:35 AM Rob Audenaerde <
> rob.audenae...@gmail.com
> >> >
> >> wrote:
> >>
> >> > Hello,
> >> >
> >> > I'm benchmarking an application which implements security on lucene by
> >> > adding a multivalue field "roles". If the user has one of these roles,
> >> he
> >> > can find the document.
> >> >
> >> > I implemented this as a Boolean and query, added the original query
> and
> >> the
> >> > restriction with Occur.MUST.
> >> >
> >> > I'm having some performance issues when counting the index (>60M
> docs),
> >> so
> >> > I thought about tweaking this restriction-implementation.
>

Re: unexpected performance TermsQuery Occur.SHOULD vs TermsInSetQuery?

2020-10-13 Thread Adrien Grand
Can you give us a few more details:
 - What version of Lucene are you testing?
 - Are you benchmarking "restrictionQuery" on its own, or its conjunction
with another query?

You mentioned that you combine your "restrictionQuery" and the user query
with Occur.MUST, Occur.FILTER feels more appropriate for "restrictionQuery"
since it should not contribute to scoring.

TermsInSetQuery automatically executes like a BooleanQuery when the number
of clauses is less than 16, so I would not expect major performance
differences between a TermInSetQuery over less than 16 terms and a
BooleanQuery wrapped in a ConstantScoreQuery.

On Tue, Oct 13, 2020 at 11:35 AM Rob Audenaerde 
wrote:

> Hello,
>
> I'm benchmarking an application which implements security on lucene by
> adding a multivalue field "roles". If the user has one of these roles, he
> can find the document.
>
> I implemented this as a Boolean and query, added the original query and the
> restriction with Occur.MUST.
>
> I'm having some performance issues when counting the index (>60M docs), so
> I thought about tweaking this restriction-implementation.
>
> I set-up a benchmark like this:
>
> I generate 2M documents, Each document has a multi-value "roles" field. The
> "roles" field in each document has 4 values, taken from (2,2,1000,100)
> unique values.
> The user has (1,1,2,1) values for roles (so, 1 out of the 2, for the first
> role, 1 out of 2 for the second, 2 out of the 1000 for the third value, and
> 1 / 100 for the fourth).
>
> I got a somewhat unexpected performance difference. At first, I implemented
> the restriction query like this:
>
> for (final String role : roles) {
> restrictionQuery.add(new TermQuery(new Term("roles", new
> BytesRef(role))), Occur.SHOULD);
> }
>
> I then switched to a TermInSetQuery, which I thought would be faster
> as it is using constant-scores.
>
> final Set rolesSet =
> roles.stream().map(BytesRef::new).collect(Collectors.toSet());
> restrictionQuery.add(new TermInSetQuery("roles", rolesSet), Occur.SHOULD);
>
>
> However, the TermInSetQuery has about 25% slower ops/s. Is that to
> be expected? I did not, as I thought the constant-scoring would be faster.
>


-- 
Adrien


Re: Links to classes missing for BMW

2020-10-12 Thread Adrien Grand
It's not the most visible place, but the paper is referenced in the source
code of the class that implements BM WAND
https://github.com/apache/lucene-solr/blob/907d1142fa435451b40c072f1d445ee868044b15/lucene/core/src/java/org/apache/lucene/search/WANDScorer.java#L29-L44
.

On Mon, Oct 12, 2020 at 6:34 PM  wrote:

> Hi Uwe,-
>
>   i see, thanks for the info, i wish the documentation mentions this new
> algorithm by referencing the papers (i have the papers).
>
> Best regards
>
>
> On 10/12/20 12:27 PM, Uwe Schindler wrote:
> > There's not much new documentation, it works behind scenes, except that
> IndexSearcher.search and TopDocs class no longer return an absolute count
> for totalHits and instead this class:
> https://urldefense.com/v3/__https://lucene.apache.org/core/8_0_0/core/org/apache/lucene/search/TotalHits.html__;!!GqivPVa7Brio!NsVHzhvGTA9P12ZIyQAjPZwTUjkcQf-sLoYAnRSG_HCVgwtfetbKY48FWTKvKR__kQ$
> >
> > Uwe
> >
> > Am October 12, 2020 4:22:43 PM UTC schrieb baris.ka...@oracle.com:
> >> Hi Uwe,-
> >>
> >>   Could You please point me to the class documentation please?
> >>
> >> Best regards
> >>
> >>
> >> On 10/12/20 12:16 PM, Uwe Schindler wrote:
> >>> BMW support is in Lucene since version 8.0.
> >>>
> >>> Uwe
> >>>
> >>> Am October 12, 2020 4:08:42 PM UTC schrieb baris.ka...@oracle.com:
> >>>
> >>>  Hi,-
> >>>
> >>> Is BMW (Block Max Wand) support only for Solr?
> >>>
> >>>
> https://urldefense.com/v3/__https://lucene.apache.org/solr/guide/8_6/solr-upgrade-notes.html__;!!GqivPVa7Brio!NsVHzhvGTA9P12ZIyQAjPZwTUjkcQf-sLoYAnRSG_HCVgwtfetbKY48FWTLkNmbQlw$
> >> <
> https://urldefense.com/v3/__https://lucene.apache.org/solr/guide/8_6/solr-upgrade-notes.html__;!!GqivPVa7Brio!PrzCrebVbXvOC6GhctJ1mj8CW5Xps_OiWG7ieYh_NuriXPSFIriiBXEKjJSzSrgW3A$
> >
> >>>  This pages says "also" so it implies support for Lucene, too,
> >> right?
> >>>  Best regards
> >>>
> >> 
> >>>  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >>>  For additional commands, e-mail: java-user-h...@lucene.apache.org
> >>>
> >>>
> >>> --
> >>> Uwe Schindler
> >>> Achterdiek 19, 28357 Bremen
> >>>
> https://urldefense.com/v3/__https://www.thetaphi.de__;!!GqivPVa7Brio!NsVHzhvGTA9P12ZIyQAjPZwTUjkcQf-sLoYAnRSG_HCVgwtfetbKY48FWTJOeW75UA$
> >>>
> >> <
> https://urldefense.com/v3/__https://www.thetaphi.de__;!!GqivPVa7Brio!PrzCrebVbXvOC6GhctJ1mj8CW5Xps_OiWG7ieYh_NuriXPSFIriiBXEKjJQldTepBw$
> >
> >>
> > --
> > Uwe Schindler
> > Achterdiek 19, 28357 Bremen
> >
> https://urldefense.com/v3/__https://www.thetaphi.de__;!!GqivPVa7Brio!NsVHzhvGTA9P12ZIyQAjPZwTUjkcQf-sLoYAnRSG_HCVgwtfetbKY48FWTJOeW75UA$
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

-- 
Adrien


Re: How to access block-max metadata?

2020-10-12 Thread Adrien Grand
advanceShallow is indeed faster than advance because it does less:
advanceShallow only advances the cursor for block-max metadata, this allows
reasoning about maximum scores without actually advancing the doc ID.
advanceShallow is implicitly called via advance.

If your optimization rarely helps skip entire blocks, then it's expected
that advance doesn't help much over nextDoc. advanceShallow is rarely a
drop-in replacement for advance since it's unable to tell whether a
document matches or not, it can only be used to reason about maximum scores
for a range of doc IDs when combined with ImpactsSource#getImpacts.

On Mon, Oct 12, 2020 at 5:21 PM Alex K  wrote:

> Thanks Adrien. Very helpful.
> The doc for ImpactSource.advanceShallow says it's more efficient than
> DocIDSetIterator.advance.
> Is that because advanceShallow is skipping entire blocks at a time, whereas
> advance is not?
> One possible optimization I've explored involves skipping pruned docIDs. I
> tried this using .advance() instead of .nextDoc(), but found the
> improvement was negligible. I'm thinking maybe advanceShallow() would let
> me get that speedup.
> - AK
>
> On Mon, Oct 12, 2020 at 2:59 AM Adrien Grand  wrote:
>
> > Hi Alex,
> >
> > The entry point for block-max metadata is TermsEnum#impacts (
> >
> >
> https://lucene.apache.org/core/8_6_0/core/org/apache/lucene/index/TermsEnum.html#impacts(int)
> > )
> > which returns a view of the postings lists that includes block-max
> > metadata. In particular, see documentation for
> ImpactsSource#advanceShallow
> > and ImpactsSource#getImpacts (
> >
> >
> https://lucene.apache.org/core/8_6_0/core/org/apache/lucene/index/ImpactsSource.html
> > ).
> >
> > You can look at ImpactsDISI to see how this metadata is leveraged in
> > practice to turn this metadata into score upper bounds, which is in-turn
> > used to skip irrelevant documents.
> >
> > On Mon, Oct 12, 2020 at 2:45 AM Alex K  wrote:
> >
> > > Hi all,
> > > There was some fairly recent work in Lucene to introduce Block-Max WAND
> > > Scoring (
> > >
> > >
> >
> https://cs.uwaterloo.ca/~jimmylin/publications/Grand_etal_ECIR2020_preprint.pdf
> > > , https://issues.apache.org/jira/browse/LUCENE-8135).
> > >
> > > I've been working on a use-case where I need very efficient top-k
> scoring
> > > for 100s of query terms (usually between 300 and 600 terms, k between
> 100
> > > and 1, each term contributes a simple TF-IDF score). There's some
> > > discussion here: https://github.com/alexklibisz/elastiknn/issues/160.
> > >
> > > Now that block-based metadata are presumably available in Lucene, how
> > would
> > > I access this metadata?
> > >
> > > I've read the WANDScorer.java code, but I couldn't quite understand how
> > > exactly it is leveraging a block-max codec or block-based statistics.
> In
> > my
> > > own code, I'm exploring some ways to prune low-quality docs, and I
> > figured
> > > there might be some block-max metadata that I can access to improve the
> > > pruning. I'm iterating over the docs matching each term using the
> > > .advance() and .nextDoc() methods on a PostingsEnum. I don't see any
> > > block-related methods on the PostingsEnum interface. I feel like I'm
> > > missing something.. hopefully something simple!
> > >
> > > I appreciate any tips or examples!
> > >
> > > Thanks,
> > > Alex
> > >
> >
> >
> > --
> > Adrien
> >
>


-- 
Adrien


  1   2   3   4   5   >