Re: Slow HNSW creation times.

2024-04-28 Thread Adrien Grand
Hello Kannan, The fact that adding 10k docs to an empty HNSW graph is faster than adding 10k docs to a large HNSW graph sounds expected to me, but the 120x factor that you are reporting sounds high. Maybe your dataset is larger than the size of your page cache, forcing your OS to read vectors

Re: Indexing time increase moving from Lucene 8 to 9

2024-04-17 Thread Adrien Grand
Hi Marc, Nothing jumps to mind as a potential cause for this 2x regression. It would be interesting to look at a profile. On Wed, Apr 17, 2024 at 9:32 PM Marc Davenport wrote: > Hello, > I'm finally migrating Lucene from 8.11.2 to 9.10.0 as our overall build can > now support Java 11. The

Re: Query Optimization in search/searchAfter

2024-04-12 Thread Adrien Grand
to filter out > documents but I specifically was talking about the query rewriting phase. > Is the query rewritten differently in search vs searchAfter? Looking at the > code I think no but would just like to confirm if there are any edge cases > here. > > On Fri, Apr 12, 2024 at

Re: Query Optimization in search/searchAfter

2024-04-12 Thread Adrien Grand
Hello Puneeth, When you pass an `after` doc, Lucene will filter out documents that compare better than this `after` document if it can. See e.g. what LongComparator does with its `topValue`, which is the value of the `after` doc. On Thu, Apr 11, 2024 at 4:34 PM Puneeth Bikkumanla wrote: >

Re: Support of RRF (Reciprocal Rank Fusion) by Lucene?

2024-03-26 Thread Adrien Grand
iscuss in more detail > > https://github.com/apache/lucene/issues > > Thanks > > Michael > > Am 26.03.24 um 14:56 schrieb Adrien Grand: > > Hey Michael, > > > > I agree that it would be a nice addition. Plus it should be pretty easy > to > > impl

Re: Support of RRF (Reciprocal Rank Fusion) by Lucene?

2024-03-26 Thread Adrien Grand
Hey Michael, I agree that it would be a nice addition. Plus it should be pretty easy to implement. This sounds like a good fit for a utility method on the TopDocs class? On Tue, Mar 26, 2024 at 2:54 PM Michael Wechner wrote: > Hi > > IIUC Lucene does not contain a RRF implementation, for

[ANNOUNCE] Apache Lucene 9.10.0 released

2024-02-20 Thread Adrien Grand
The Lucene PMC is pleased to announce the release of Apache Lucene 9.10. Apache Lucene is a high-performance, full-featured search engine library written entirely in Java. It is a technology suitable for nearly any application that requires structured search, full-text search, faceting,

Re: Old codecs may only be used for reading

2024-01-11 Thread Adrien Grand
Hey Michael. Your understanding is correct. On Thu, Jan 11, 2024 at 10:46 AM Michael Wechner wrote: > Hi > > I recently upgraded from Lucene 9.8.0 to Lucene 9.9.1 and noticed that > Lucene95Codec got moved to > > org.apache.lucene.backward_codecs.lucene95.Lucene95Codec > > When testing my code

Re: Assertion error with NumericDocValues.advanceExact

2024-01-01 Thread Adrien Grand
Hello, Can you check if you are running advanceExact on decreasing doc IDs or on doc IDs that are outside of the valid range [0, maxDoc)? If you have Lucene's test framework on your classpath, these checks can be added automatically by using AssertingIndexSearcher instead of IndexSearcher to run

Re: migrate index from 6 to 9

2023-12-18 Thread Adrien Grand
Hi Vincent, Unfortunately, your assumption is incorrect, Lucene 9 is not able to search Lucene 6 indexes as Lucene only keeps read access to indexes created by the current (9) or previous major version (8). You will need to reindex your 6.x index with Lucene 8 or 9 (preferred) to be able to

Re: When to use StringField and when to use FacetField for categorization?

2023-10-20 Thread Adrien Grand
FYI there is also KeywordField, which combines StringField and SortedSetDocValuesField. It supports filtering, sorting, faceting and retrieval. It's my go-to field for string values. Le ven. 20 oct. 2023, 12:20, Michael McCandless a écrit : > There are some differences. > > StringField is

Re: Exception from the codec layer during indexing

2023-09-28 Thread Adrien Grand
Hi Rahul, This exception complains that IndexingChain did not deduplicate terms as expected. I don't recall seeing this exception before (which doesn't mean it's not a real bug). What JVM are you running? Does this exception frequently occur or was it a one-off? On Thu, Sep 28, 2023 at 4:49 PM

Re: forceMerge(1) leads to ~10% perf gains

2023-09-22 Thread Adrien Grand
> Was wondering - are there any other techniques which can be used to speed up that work well when forceMerge works like this? Lucene 9.8 (to be released in a few days hopefully) will add support to recursive graph bisection, which is another thing that can be used to speed up querying on

Re: Relative cpu cost of fetching term frequency during scoring

2023-06-26 Thread Adrien Grand
least 5x compared to old code. > Is there any thoughts on why term frequency calls on PostingsEnum are that > slow ? > > > > *Thanks and Regards,* > *Vimal Jain* > > > On Wed, Jun 21, 2023 at 1:43 PM Adrien Grand wrote: > > > As far as your performance problem i

[ANNOUNCE] Apache Lucene 9.7.0 released

2023-06-26 Thread Adrien Grand
The Lucene PMC is pleased to announce the release of Apache Lucene 9.7.0. Apache Lucene is a high-performance, full-featured search engine library written entirely in Java. It is a technology suitable for nearly any application that requires structured search, full-text search, faceting,

Re: Relative cpu cost of fetching term frequency during scoring

2023-06-21 Thread Adrien Grand
ike this > Scorer#getMaxScore was added in lucene 8.0 , i am using 7.7.3. > A side question , is there any resource to help migrate newer major version > , i see lot of api changed from v7 to v8. > > *Thanks and Regards,* > *Vimal Jain* > > > On Wed, Jun 21, 2023

Re: Relative cpu cost of fetching term frequency during scoring

2023-06-20 Thread Adrien Grand
ery over merged field. > Can you please provide more details on what do you mean by dynamic pruning > in context of custom term query ? > > On Tue, 20 Jun, 2023, 9:45 pm Adrien Grand, wrote: > > > Intuitively replacing a disjunction across multiple fields with a single &g

Re: Relative cpu cost of fetching term frequency during scoring

2023-06-20 Thread Adrien Grand
ntation ( with > multiple term queries ). > > > *Thanks and Regards,* > *Vimal Jain* > > > On Tue, Jun 20, 2023 at 1:01 PM Adrien Grand wrote: > > > You say you observed a performance drop, what are you comparing against? > > > > Le mar. 20 juin

Re: Relative cpu cost of fetching term frequency during scoring

2023-06-20 Thread Adrien Grand
You say you observed a performance drop, what are you comparing against? Le mar. 20 juin 2023, 08:59, Vimal Jain a écrit : > Note - i am using lucene 7.7.3 > > *Thanks and Regards,* > *Vimal Jain* > > > On Tue, Jun 20, 2023 at 12:26 PM Vimal Jain wrote: > > > Hi, > > I want to understand if

Re: Performance regression in getting doc by id in Lucene 8 vs Lucene 7

2023-06-07 Thread Adrien Grand
ndexes this would be a legit regression, no? > > - Rahul > > On Tue, Jun 6, 2023 at 10:09 AM Adrien Grand wrote: > > > Yes, this changed in 8.x: > > - 8.0 moved the terms index off-heap for non-PK fields with > > MMapDirectory. https://github.com/apache/lucene/issues/96

Re: Performance regression in getting doc by id in Lucene 8 vs Lucene 7

2023-06-06 Thread Adrien Grand
understand it is because of the Java bug which synchronizes > internally in the native call for NIOFs. > > -Rahul > > On Tue, Jun 6, 2023 at 9:32 AM Adrien Grand wrote: > > > +Alan Woodward helped me better understand what is going on here. > > BufferedIndexInput (used

Re: Performance regression in getting doc by id in Lucene 8 vs Lucene 7

2023-06-06 Thread Adrien Grand
before what the buffer contains. On Tue, Jun 6, 2023 at 2:07 PM Adrien Grand wrote: > > My best guess based on your description of the issue is that > SimpleFSDirectory doesn't like the fact that the terms index now reads > data directly from the directory instead of loading the

Re: Performance regression in getting doc by id in Lucene 8 vs Lucene 7

2023-06-06 Thread Adrien Grand
My best guess based on your description of the issue is that SimpleFSDirectory doesn't like the fact that the terms index now reads data directly from the directory instead of loading the terms index in heap. Would you be able to run the same benchmark with MMapDirectory to check if it addresses

Re: Mix of lucene50 and lucene70 codes

2023-04-08 Thread Adrien Grand
Hi, This is normal. Lucene usually names codecs and file formats after the first version that they were introduced in. But not all file formats change on every version, and the Lucene 7.7.3 default postings format was called Lucene50. On Sat, Apr 8, 2023 at 4:17 PM Vimal Jain wrote: > > Hi

Re: Change score with distance SortField

2023-02-06 Thread Adrien Grand
Hi Michal, The best way to do this would be to put a LatLonPoint#newDistanceFeatureQuery in a SHOULD clause. It's not as flexible as leveraging expressions, but it has the benefit of not disabling dynamic pruning. On Mon, Feb 6, 2023 at 10:33 AM Michal Hlavac wrote: > > Hi, > I would like to

Re: Other vector similarity metric than provided by VectorSimilarityFunction

2023-01-14 Thread Adrien Grand
Hi Michael, You could create a custom KNN vectors format that ignores the vector similarity configured on the field and uses its own. Le sam. 14 janv. 2023, 21:33, Michael Wechner a écrit : > Hi > > IIUC Lucene currently supports > > VectorSimilarityFunction.COSINE >

Re: The current default similarity implementation of Lucene is BM25, right?

2022-11-23 Thread Adrien Grand
This is correct. See IndexSearcher#getDefaultSimilarity(). On Wed, Nov 23, 2022 at 10:53 AM Michael Wechner wrote: > > Hi > > On the Lucene FAQ there is no mentioning re tf-idf or bm25 and I would > like to add some notes, but to be sure I don't write anything wrong I > would like to ask > >

[ANNOUNCE] Apache Lucene 9.4.2 released

2022-11-23 Thread Adrien Grand
The Lucene PMC is pleased to announce the release of Apache Lucene 9.4.2 Apache Lucene is a high-performance, full-featured search engine library written entirely in Java. It is a technology suitable for nearly any application that requires structured search, full-text search, faceting,

Re: Sort by numeric field, order missing values before anything else

2022-11-21 Thread Adrien Grand
p in mind: When > you sort against the raw bytes (using NumericUtils) with SORTED_SET > docvalues type, there is a large overhead on indexing and sorting > performance, especially for the case where you have many different > values in your index (which is likely for numerics). &g

Re: Sort by numeric field, order missing values before anything else

2022-11-16 Thread Adrien Grand
Hi Petko, Lucene's comparators for numerics have this limitation indeed. We haven't got many questions around that in the past, which I would guess is due to the fact that most numeric fields do not use the entire long range, specifically Long.MIN_VALUE and Long.MAX_VALUE, so using either of

Re: Learning Lucene from ground up

2022-11-07 Thread Adrien Grand
+1 to MyCoy's suggestion. To answer your most immediate questions: - Lucene mostly loads metadata in memory at the time of opening a segment (dvm, tmd, fdm, vem, nvm, kdm files), other files are memory-mapped and Lucene relies on the filesystem cache to have their data efficiently available.

Re: Efficient sort on SortedDocValues

2022-11-07 Thread Adrien Grand
Hi Andrei, The case that you are describing got optimized in Lucene 9.4.0 in the case when your field is also indexed with a StringField: https://github.com/apache/lucene/pull/1023. See annotation ER at http://people.apache.org/~mikemccand/lucenebench/TermMonthSort.html. The way it works is that

Re: Upgrading from 9.1.0. to 9.4.0: Old codecs may only be used for reading Lucene91HnswVectorsFormat.java

2022-10-01 Thread Adrien Grand
ould not forget again > during the next upgrade :-) > > Or what is the best practice re setting / handling the codec? > > Thanks > > Michael > > Am 01.10.22 um 08:06 schrieb Adrien Grand: > > I would guess that you are configuring your IndexWriterConfig wi

Re: Upgrading from 9.1.0. to 9.4.0: Old codecs may only be used for reading Lucene91HnswVectorsFormat.java

2022-10-01 Thread Adrien Grand
I would guess that you are configuring your IndexWriterConfig with a "Lucene91Codec" instance. You need to replace it with a "Lucene94Codec" instance. Le sam. 1 oct. 2022, 06:12, Michael Wechner a écrit : > Hi > > I have just upgraded from 9.1.0 to 9.4.0 and compiling works fine, but > when I

Re: Max Field Length

2022-09-23 Thread Adrien Grand
We have a TruncateTokenFilter in lucene/analysis/common. :) On Fri, Sep 23, 2022 at 4:39 PM Michael Sokolov wrote: > I wonder if it would make sense to provide a TruncationFilter in > addition to the LengthFilter. That way long tokens in source text > could be better supported, albeit with some

Re: Questions about Lucene source

2022-09-23 Thread Adrien Grand
On the 2nd question, we do not plan on leveraging this information to figure out the codec: the codec that should be used to read a segment is stored separately (also in segment infos). It is mostly useful for diagnostics purposes. E.g. if we see an interesting corruption case where checksums

Re: Max Field Length

2022-09-23 Thread Adrien Grand
Hi Scott, There is no way to lift this limit. The assumption is that a user would never type a 32kB keyword in a search bar, so indexing such long keywords is wasteful. Some tokenizers like StandardTokenizer can be configured to limit the length of the tokens that they produce, there is also a

Re: Lucene's LRU Query Cache - Deep Dive

2022-07-19 Thread Adrien Grand
; < > https://github.com/apache/lucene/blob/d5d6dc079395c47cd6d12dcce3bcfdd2c7d9dc63/lucene/core/src/java/org/apache/lucene/search/ScorerSupplier.java#L39-L40 > > > > > Regards, > Mohammad Sadiq > > > > On 11 Jul 2022, at 10:37, Adrien Grand wrote: > &

Re: Lucene Disable scoring

2022-07-11 Thread Adrien Grand
Note that Lucene automatically disables scoring already when scores are not needed. E.g. queries that compute the top-k hits by score will definitely compute scores, but if you are just counting the number of matches of a query or aggregations, then Lucene skips scoring entirely already. Is there

Re: Lucene's LRU Query Cache - Deep Dive

2022-07-11 Thread Adrien Grand
Hey Shradha, This correctly describes the what, but I think it could add more color about why the cache behaves this way to be more useful, e.g. - Why doesn't the cache cache all queries? Lucene is relatively good at evaluating a subset of the matching documents, e.g. queries sorted by numeric

Re: Question about Benchmark

2022-05-16 Thread Adrien Grand
Hi Balmukund, What benchmark are you talking about? On Mon, May 16, 2022 at 4:35 PM balmukund mandal wrote: > > Hi All, > I was trying to run the benchmark and had a couple of questions. Indexing > takes a long time, so is there a way to configure the benchmark to use an > already existing

Re: Index corruption and repair

2022-04-28 Thread Adrien Grand
Hi Anthony, This isn't something that you should try to fix programmatically, corruptions indicate that something is wrong with the environment, like a broken disk or corrupt RAM. I would suggest running a memtest to check your RAM and looking at system logs in case they have anything to tell

Re: How to propose a new feature

2022-04-01 Thread Adrien Grand
Just send an email with the problem that you want to solve and the approach that you are suggesting. On Fri, Apr 1, 2022 at 6:56 PM Baris Kazar wrote: > > Resent due to need for help. > Thanks > > From: Baris Kazar > Sent: Wednesday, March 30, 2022 2:30 PM > To:

Re: TF in MoreLikeThis

2022-04-01 Thread Adrien Grand
>From a quick look, your suggestion of passing the term frequency to TFIDFSimilarity#tf makes sense. Would you like to contribute this change? You can find contributing guidelines here: https://github.com/apache/lucene/blob/main/CONTRIBUTING.md. On Thu, Mar 31, 2022 at 11:46 PM Petko Minkov

Re: Call for Presentations now open, ApacheCon North America 2022

2022-03-31 Thread Adrien Grand
Thanks Michael for helping spread the word about Lucene's new vector search capabilities! On Thu, Mar 31, 2022 at 7:36 AM Michael Wechner wrote: > > ok :-) thanks! > > Anyway, if somebody would like to join re a "vector search" proposal, > please let me know > > Michael > > Am 30.03.22 um 20:13

Re: Re: Custom scores and sort

2022-03-23 Thread Adrien Grand
nt > contains only one "only once score" field, > Lucene passes the CustomScoreProvider's customScore method twice, so the > score = 0 and it seems to me that this value is retained for the sort score. > > I did not find why a TopFieldDocs search (with Sort = SortField.FIELD_SC

Re: LongDistanceFeatureQuery for DoublePoint

2022-03-23 Thread Adrien Grand
Hi Puneeth, Doubles are always a bit more tricky due to rounding for arithmetic operations, but this should still be doable. Out of curiosity, what sort of data do your double fields store? This query had been added with the idea that it would be useful for timestamp fields in order to boost

[ANNOUNCE] Apache Lucene 9.1.0 released

2022-03-22 Thread Adrien Grand
The Lucene PMC is pleased to announce the release of Apache Lucene 9.1.0. Apache Lucene is a high-performance, full-featured search engine library written entirely in Java. It is a technology suitable for nearly any application that requires structured search, full-text search, faceting,

Re: FacetsCollector ScoreMode

2022-03-21 Thread Adrien Grand
+1 to adjusting the ScoreMode based on keepScores. On Mon, Mar 21, 2022 at 5:47 PM Mike Drob wrote: > > Hey all, > > I was looking into some performance issues and was a little confused about > one aspect of FacetsCollector - why does it always specify > ScoreMode.COMPLETE? > > Especially for

Re: Custom scores and sort

2022-03-14 Thread Adrien Grand
It's a bit hard for me to parse what you are trying to do, but it looks like you are making assumptions about how Lucene works internally that are not correct. Do I understand correctly that your scoring mechanism has dependencies on other documents, ie. the score of a document could depend on

Re: DocValuesIterator: advance vs advanceExact

2022-02-03 Thread Adrien Grand
Hi Alexander, In general, advance(target) is best used to implement queries and advanceExact(target) for collectors. See javadocs for advanceExact(target), this method may only be called on doc IDs that are between 0 included and maxDoc excluded. On Thu, Feb 3, 2022 at 10:00 AM Alexander

Re: Lucene 6.5.1 source code

2022-02-01 Thread Adrien Grand
You can find the 6.5.1 source code on the old lucene-solr repository: https://github.com/apache/lucene-solr/tree/releases/lucene-solr%2F6.5.1 On Tue, Feb 1, 2022 at 2:54 PM Omri wrote: > > It seems that the old versions branches in github were deleted. > There is a way to see Lucene 6.5.1 source

Re: Migration from Lucene 5.5 to 8.11.1

2022-01-12 Thread Adrien Grand
The log says what the problem is: version 8.11.1 cannot read indices created by Lucene 5.5, you will need to reindex your data. On Wed, Jan 12, 2022 at 3:41 PM wrote: > > > - > To unsubscribe, e-mail:

Re: Want explanation on lucene norms

2022-01-05 Thread Adrien Grand
Hi, Norms are inputs to the score that are independent from the query. It is typically computed as a function of the number of terms of a document: the more terms, the higher the normalization factor and the lower the score. Lucene computes and indexes length normalization factors automatically

Re: Lucene 9.0.0 inconsistent index options

2021-12-14 Thread Adrien Grand
This looks related to the new changes around schema validation. Lucene now requires a field to either be absent from a document or be indexed with the exact same options (index options, points dimensions, norms, doc values type, etc.) as already indexed documents that also have this field.

[ANNOUNCE] Apache Lucene 9.0.0 released

2021-12-07 Thread Adrien Grand
The Lucene PMC is pleased to announce the release of Apache Lucene 9.0. Apache Lucene is a high-performance, full-featured search engine library written entirely in Java. It is a technology suitable for nearly any application that requires structured search, full-text search, faceting,

Re: index file of lucene8.7 is larger than the 7.7

2021-12-07 Thread Adrien Grand
As a disclaimer, it can be misleading to draw conclusions on space efficiency based on such a small index. Can you compare file sizes by extension across 7.7 and 8.7? You might need to call IndexWriterConfig#setUseCompoundFile(false) to prevent the flush from wrapping your segment files in a

[ANNOUNCE] Apache Lucene 8.11.0 released

2021-11-16 Thread Adrien Grand
The Lucene PMC is pleased to announce the release of Apache Lucene 8.11. Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform. This

Re: Need help on aggregation of nested documents

2021-11-16 Thread Adrien Grand
ment(int docID) right?. If that is the case won't getting all > the documents would be a costly operation and then finally doing the > aggregates. > > Is there any other way around this? > > Thanks > Gopal Sharma > > > > > > > > On Mon, Nov 15, 2021 at

Re: Need help on aggregation of nested documents

2021-11-15 Thread Adrien Grand
It's not straightforward as we don't provide high-level tooling to do this. You need to use the BitSetProducer that you pass to the ToParentBlockJoinQuery in order to resolve the range of child doc IDs for a given parent doc ID (see e.g. how ToChildBlockJoinQuery does it), and then aggregate over

Re: Using setIndexSort on a binary field

2021-10-15 Thread Adrien Grand
Hi Alex, You need to use a BinaryDocValuesField so that the field is indexed with doc values. `Field` is not going to work because it only indexes the data while index sorting requires doc values. On Fri, Oct 15, 2021 at 6:40 PM Alex K wrote: > Hi all, > > Could someone point me to an example

Re: org.apache.lucene.search.BooleanWeight.bulkScorer() and BulkScorer.score()

2021-10-05 Thread Adrien Grand
gt; Next: BulkScorer.score() with its call tree and time spent: > > > > BulkScorer.score() > -->> Weight$DefaultBulkScorer.score() > -->>-->> Weight$DefaultBulkScorer.scoreAll() > -->>-->>-->> WANDScorer$1.nextDoc() > -->>-->>-->&

Re: org.apache.lucene.search.BooleanWeight.bulkScorer() and BulkScorer.score()

2021-10-01 Thread Adrien Grand
Is your profiler reporting inclusive or exclusive costs for each function? Ie. does it exclude time spent in functions that are called within a function? I'm asking because it makes total sense for IndexSearcher#search to spend most of its time is BulkScorer#score, which coordinates the whole

Re: Querying into a Collector visits documents multiple times

2021-09-22 Thread Adrien Grand
Hi Steven, This collector looks correct to me. Resetting the counter to 0 on the first segment is indeed not necessary. We have plenty of collectors that are very similar to this one and we never observed any double-counting issue. I would suspect an issue in the code that calls this collector.

Re: Adding vs multiplicating scores when implementing "recency"

2021-09-17 Thread Adrien Grand
ute a max score for a block? > > On Thu, Sep 16, 2021 at 12:41 PM Adrien Grand wrote: > > > > Hello, > > > > You are correct that the contribution would be additive in that case. We > > don't provide an easy way to make the contribution multiplicative. >

Re: Adding vs multiplicating scores when implementing "recency"

2021-09-16 Thread Adrien Grand
Hello, You are correct that the contribution would be additive in that case. We don't provide an easy way to make the contribution multiplicative. There is some debate about what is the best way to combine BM25 scores with query-independent features, though in the discussions I've seen

Re: How exactly the normalized length of the documents are stored in the index

2021-07-13 Thread Adrien Grand
The BM25 similarity computes the normalized length as the number of tokens, ignoring synonyms (tokens at the same position). Then it encodes this length as an 8-bit integer in the index using this logic:

Re: Need approach to store JSON data in Lucene index

2021-06-17 Thread Adrien Grand
In general, the preferred approach is denormalizing, but your description suggests that you want to be able to query anything: actions, tasks, test cases, etc. so I guess that the most natural approach would be to leverage Lucene's support for index-time joins, see the documentation of the join

Re: Is deleting with IndexReader still possible?

2021-06-17 Thread Adrien Grand
Good catch Michael, removing from IndexReader has actually been removed a long time ago. I just edited the FAQ to correct this. On Thu, Jun 17, 2021 at 10:08 AM Michael Wechner wrote: > Hi > > According to the FAQ one can delete documents using the IndexReader > > >

Re: Handling Archive Data Using Lucene 7.6

2021-06-14 Thread Adrien Grand
Hi Rashmi, This upgrade skips 3 major versions, the simplest path will be to reindex your content. On Fri, Jun 11, 2021 at 10:40 AM Rashmi Bisanal wrote: > Hi Lucene Support Team , > > > > Objective : Upgrade Lucene 3.6 to 7.6 > > > > Description : We have huge data against version Lucene 3.6

Re: Potential bug

2021-06-14 Thread Adrien Grand
> >>>> thousands of hits. > > >>>> > > >>>> > > >>>> Best regards > > >>>> > > >>>> > > >>>> On 6/9/21 1:25 PM, Diego Ceccarelli (BLOOMBERG/ LONDON) wrote: > > >&g

Re: Monitoring decisions taken by IndexOrDocValuesQuery

2021-06-10 Thread Adrien Grand
get away with storing this data only once and using one of > the queries. > > On Wed, Jun 9, 2021 at 10:39 PM Adrien Grand wrote: > > > FWIW a related PR was just merged that allows to introspect query > > execution: https://issues.apache.org/jira/browse/LUCENE-9965.

Re: Monitoring decisions taken by IndexOrDocValuesQuery

2021-06-09 Thread Adrien Grand
FWIW a related PR was just merged that allows to introspect query execution: https://issues.apache.org/jira/browse/LUCENE-9965. It's different from your use-case though in that it is debugging information for a single query rather than statistical information across lots of user queries (and the

Re: Potential bug

2021-06-09 Thread Adrien Grand
Hi Baris, totalhitsThreshold is actually a minimum threshold, not a maximum threshold. The problem is that Lucene cannot directly identify the top matching documents for a given query. The strategy it adopts is to start collecting hits naively in doc ID order and to progressively raise the bar

Re: An interesting case

2021-06-08 Thread Adrien Grand
e count as scoredocs already > > has that. > > > > But seeing totalhits high number, that worries me as i explained above. > > > > > > Best regards > > > > > > On 6/8/21 1:12 PM, Adrien Grand wrote: > >> If you don't need any informati

Re: An interesting case

2021-06-08 Thread Adrien Grand
k as i mentioned ie, it has size n. > > i will check count api. > > > > Best regards > > > > *From:* Adrien Grand > > *Sent:* Tuesday, June 8, 2021 2:46 AM > > *To:* Lucene Users Maili

Re: An interesting case

2021-06-08 Thread Adrien Grand
When you call IndexSearcher#search(Query query, int n), there are two cases: - either your query matches n hits or more, and the TopDocs object will have a ScoreDoc[] array that contains the n best scoring hits sorted by descending score, - or your query matches less then n hits and then the

Re: Changing Term Vectors for Query

2021-06-07 Thread Adrien Grand
Hi Marcel, You can make Lucene index custom frequencies using something like DelimitedTermFrequencyTokenFilter , which would be easier than writing a custom

Re: Performance decrease with NRT use-case in 8.8.x (coming from 8.3.0)

2021-05-19 Thread Adrien Grand
LUCENE-9115 certainly creates more files in the FSDirectory than in the ByteBuffersDirectory, e.g. stored fields are now always flushed to the FSDirectory since their size can't be known in advance, while they were always written to the ByteBuffersDirectory before (which was a big since these

Re: How to ignore a match if a given keyword is before/after another given keyword?

2021-04-27 Thread Adrien Grand
Great to hear! Le mar. 27 avr. 2021 à 22:44, Jean Morissette a écrit : > Using intervals worked, thank you for your help ! > > On Sun, 25 Apr 2021 at 13:52, Adrien Grand wrote: > > > Hi Jean, > > > > You should be able to do this with intervals, see > >

Re: NullPointerException in LongComparator.setTopValue

2021-04-26 Thread Adrien Grand
E michael.gr...@skidata.com | www.skidata.com > > -Original Message- > From: Adrien Grand > Sent: Thursday, March 18, 2021 12:12 > To: Lucene Users Mailing List > Subject: Re: NullPointerException in LongComparator.setTopValue > > Hi Michael, > > At first

Re: How to ignore a match if a given keyword is before/after another given keyword?

2021-04-25 Thread Adrien Grand
Hi Jean, You should be able to do this with intervals, see https://lucene.apache.org/core/8_8_1/queries/org/apache/lucene/queries/intervals/package-summary.html . Le dim. 25 avr. 2021 à 18:43, Jean Morissette a écrit : > Thank you for your answer. > > The problem with this solution is that it

Re: Backward compatibility of FST50 and UniformSplit formats

2021-04-19 Thread Adrien Grand
Hi Dmitry, These codecs are indeed not backward compatible. Only the default codec is guaranteed to be backward compatible. If you would like to bring your index to a snapshot of the main branch, one option would be to: 1. Use Lucene 8.5's IndexWriter#addIndexes in order to create a copy of

Re: How to explain Lucene's ranking algorithm to someone who is not technical?

2021-04-19 Thread Adrien Grand
1. This isn't true. Your query has 10 terms. A document that poorly matches all 10 terms will rank lower than a document that has great matches for 9 of the 10 terms. However it's true that having more matches usually correlates with better scores since the final score of a boolean query is the

Re: Impact and WAND

2021-04-16 Thread Adrien Grand
ther scoring mode (COMPLETE or COMPLETE_NO_SCORES) will > > mandatorily visit all hits, so there is no scope of skipping and hence > > no point of using impacts. > > > > On Thu, Jul 11, 2019 at 8:51 AM Wu,Yunfeng > wrote: > > > > > > > > > @Adrien Grand mailto:j

Re: Slower fetch document after upgrade >=8.7

2021-04-08 Thread Adrien Grand
his change? > > чт, 8 апр. 2021 г. в 15:30, Adrien Grand : > > > > Actually, we don't plan to have flexible settings even for advanced > > developers. Our stance on these discussions is that we should be > > opinionated about the default codec and not offer any op

Re: Slower search after 8.5.x to >=8.6

2021-04-08 Thread Adrien Grand
y). And for us, while niofs was a little > faster than other stores > > Yes FSDirectory works fast(both commits), but now it is difficult to > test on prod elasticseach. > But why is FSDirectory fast? How to understand this? > > чт, 8 апр. 2021 г. в 13:49, Adrien Grand :

Re: Slower fetch document after upgrade >=8.7

2021-04-08 Thread Adrien Grand
facets > Never change them so that the developers themselves explicitly set the > settings. IMHO, I think this will help to avoid such problems > > OK. Have a ticket? > > чт, 8 апр. 2021 г. в 13:52, Adrien Grand : > > > > Thanks for the feedback. > > > > W

Re: Slower fetch document after upgrade >=8.7

2021-04-08 Thread Adrien Grand
Thanks for the feedback. We don't want to offer too many choices, as it complicates backward compatibility testing, and want to stick to two options at most. Since this is the second time I'm seeing this feedback, I'm inclined to reduce the block size for BEST_SPEED in order to trade a bit of

Re: Slower search after 8.5.x to >=8.6

2021-04-08 Thread Adrien Grand
Hello, Why are you forcing NIOFSDirectory instead of using Lucene's defaults via FSDirectory#open? I wonder if this might contribute to the slowdown you are seeing given that access to the terms index tends to be a bit random. It's very unlikely we'll add back a toggle for this as there is no

Re: Interface IndexReader.CacheHelper

2021-03-29 Thread Adrien Grand
Hi Baris, I created a PR that adds an example to the javadocs at https://github.com/apache/lucene/pull/50. Could you have a look and let me know if that is the sort of additional information that you were looking for? On Fri, Mar 26, 2021 at 10:30 PM wrote: > Hi,- > > >

Re: NullPointerException in LongComparator.setTopValue

2021-03-18 Thread Adrien Grand
Hi Michael, At first sight, this looks more like an Elasticsearch bug than like a Lucene bug to me. Can you file an issue at https://github.com/elastic/elasticsearch and share the search request than you are running? On Thu, Mar 18, 2021 at 11:52 AM Michael Grafl - SKIDATA <

Re: BigIntegerPoint

2021-02-27 Thread Adrien Grand
It's indeed working. As Robert suggested, it's in the sandbox more because it's unclear if it is really needed than because it is unstable. The few data points I have suggest that among the users for whom LongPoint is not enough, there are more users who need unsigned 64 bits integers than true

Re: Slower document retrieval in 8.7.0 comparing to 7.5.0

2020-12-03 Thread Adrien Grand
Hello Martynas, There have indeed been changes related to stored fields in 8.7. What does your workload look like and how large are your documents on average? On Thu, Dec 3, 2020 at 3:04 PM Martynas L wrote: > Hi, > We've migrated from 7.5.0 to 8.7.0 and find out that the index "searching" >

Re: Lucene 8.7 error searching an index created with 8.3

2020-11-24 Thread Adrien Grand
ors in the index. > > On closer inspection this seems related to phrase matching... > > El 24/11/20 a las 05:18, Adrien Grand escribió: > > Can you run CheckIndex on your index to make sure it is not corrupt? > > > > On Tue, Nov 24, 2020 at 1:01 AM Nicolás Lichtmaier > >

Re: Lucene 8.7 error searching an index created with 8.3

2020-11-24 Thread Adrien Grand
Can you run CheckIndex on your index to make sure it is not corrupt? On Tue, Nov 24, 2020 at 1:01 AM Nicolás Lichtmaier wrote: > I'm seeing errors like this one (using backwards codecs): > > java.lang.ArrayIndexOutOfBoundsException: Index 69 out of bounds for > length 33 > at >

Re: BooleanQuery: BooleanClause.Occur.MUST_NOT seems to require at least one BooleanClause.Occur.MUST

2020-11-06 Thread Adrien Grand
Hi Nissim, This is by design: boolean queries that don't have positive clauses like empty boolean queries or boolean queries that only consist of negative (MUST_NOT) clauses don't match any hits. On Thu, Nov 5, 2020 at 9:07 PM Nissim Shiman wrote: > Hello Apache Lucene team members, > I have

Re: unexpected performance TermsQuery Occur.SHOULD vs TermsInSetQuery?

2020-10-13 Thread Adrien Grand
d is empty as well. It's used within a > > bigger query builder; so maybe I did something else wrong. I'll rewrite > the > > benchmark to just benchmark the TermsInSet and Terms. > > > > It never occurred (hah) to me to use Occur.FILTER, that is a good point > to > >

Re: unexpected performance TermsQuery Occur.SHOULD vs TermsInSetQuery?

2020-10-13 Thread Adrien Grand
Can you give us a few more details: - What version of Lucene are you testing? - Are you benchmarking "restrictionQuery" on its own, or its conjunction with another query? You mentioned that you combine your "restrictionQuery" and the user query with Occur.MUST, Occur.FILTER feels more

Re: Links to classes missing for BMW

2020-10-12 Thread Adrien Grand
It's not the most visible place, but the paper is referenced in the source code of the class that implements BM WAND https://github.com/apache/lucene-solr/blob/907d1142fa435451b40c072f1d445ee868044b15/lucene/core/src/java/org/apache/lucene/search/WANDScorer.java#L29-L44 . On Mon, Oct 12, 2020 at

  1   2   3   4   5   >