Re: Right Way to Read vectors from Index

2024-02-12 Thread Uwe Schindler

Hi,

reading information from the inverted index (and also vectors) is always 
slow, because the data is not stored "as is" for easy reconsumption. To 
allow easy reindexing, there input data must be serialized to a "stored" 
field in parallel to the indexed value.


Elasticearch is using the approach to have a single/separate "stored 
only" binary field in the index that contains the "_source" data of the 
whole document as machine readable JSON/CBOR/SMILE format. When a 
document is updated in index, the updater reads the original source, 
applies updates to it and then reindexes the document. All other fields 
in Elasticsearch are not stored (unless you explicitely to opt-in for 
that).


In Solr it is very similar, but there are the stored values serialized 
to companion fields with same name. But there is currently no separate 
Lucene StoredField implementation in to store vectors. But it's easy to 
do: You could use a binary (byte[]) stored field to preserve the vector 
data (e.g., serialized in little/big endian).


I tend to favour the Elasticsearch approach to have a single stored 
field containing the whole document in machine readable from.


Uwe

Am 11.02.2024 um 13:39 schrieb Uthra:

Hi Michael,
The use case is to handle index updates along with its vector field 
without resending the vector in change data every time. The change data will 
consist of only “updated_field(s):value(s)” wherein I will read the vector 
value from Index to update the document.

Thanks,
Uthra


On 09-Feb-2024, at 7:13 PM, Michael Wechner  wrote:

Can you describe your use case in more detail (beyond having to read the 
vectors)?

Thanks

Michael

Am 09.02.24 um 12:28 schrieb Uthra:

Hi,
Our project uses Lucene 9_7_0 and we have a requirement of frequent 
vector read operation from the Index for a set of documents. We tried two 
approaches
1. Index vector as Stored field and retrieve whenever needed using StoredFields 
APIs.
2. Using LeafReader’s API to read vector. Here the Random accessing of 
documents is very slow.
Which one is the right approach and can you suggest me a better approach.Also 
why isn’t there a straightforward API like the StoredFields API to read vector.

Regards,
Uthra


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: u...@thetaphi.de


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Need suggestion for a Lucene upgrade scenario

2024-01-30 Thread Uwe Schindler

Hi,

please read the documentation. It is explained in detail: 
https://lucene.apache.org/core/8_11_0/core/org/apache/lucene/analysis/package-summary.html#package.description


There are also many blog posts about this change (which was done now 
almost 15 years ago)!


Uwe

Am 30.01.2024 um 11:30 schrieb Saha, Rajib:

Hi,

In our project for Lucene migration from 2.4.0 to 8.11.2, we need your 
suggestion to address a case.
With Lucene 2.4.0, we were using the kind of below code snippet.
With Lucene 8.11.2[Written snippet below], we need to extract the startOffset & 
endOffset value for further some calculation similar to Lucene 2.4.0.
Is there any easy way/API to extract the values from tokenStream?

//Lucene 2.4.0
===
Token token;
TokenStream valueStream = analyzer.tokenStream(new StringReader(fieldValue), 
false,true);
while ((token = valueStream.next()) != null) {
   int startOffset = token.startOffset();
   int endOffset = token.endOffset();

   //Do some calculation based on startOffset & endOffset
}



//Lucene 8.11.2

TokenStream valueStream = analyzer.tokenStream(field, new 
StringReader(fieldValue));
CharTermAttribute charValueTermAttribute = 
valueStream.addAttribute(CharTermAttribute.class);
while (valueStream.incrementToken()) {
   String termValueText = charValueTermAttribute.toString();

   //How to get startOffset & endOffset as like in Lucene 2.4

   //Do some calculation based on startOffset & endOffset
}

Please let me know, if there is any further information is required from my 
side.

Regards
Rajib


--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: u...@thetaphi.de


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: NumericRangeQuery in Lucene 5.5.5: replacing the deprecated setBoost while keeping the NumericRange type?

2023-11-26 Thread Uwe Schindler

Hi,

Lucene 5 removed the way to boost queries on its own. You have to live 
with that. In addition, all queries should be immutable Not yet enfocred 
in Lucene 5), which is required for caching purposes*).


When you apply a BoostQuery on top of any other query the scores the 
inner query return are just multiplied by the boost. Of course the 
original query won't change. That is the typical "wrapper pattern".


I don't understand what you problem is, why do you want to cast those 
queries to something they aren't?


Uwe

*) In Lucene 5 not all queries are yet really immutable, but most of 
them were changed or setters deprecated. E.g., BooleanQuery has a 
builder in 5.5, with all constructors deprecated. The reason for this is 
also immutablility: It should be impossible to change the clauses after 
constructing the query, so a builder is needed.


Am 25.11.2023 um 16:02 schrieb Claude Lepère:

Hello Mikhail.

Sorry if I was not precise enough.
A NumericRangeQuery can be wrapped in a BooleanQuery with a boost but this
boost is applied to the BoostQuery (a Query) not to the wrapped
NumericRangeQuery.
Casting BoostQuery or Query to NumericRangeQuery is impossible and
BoostQuery.getQuery returns the wrapped NumericRangeQuery not modified, not
boosted (by virtue of immutability? The migration guide does not say that
NumericRangeQuery is immutable.).

But my problem is indeed boosting the NumericRangeQuery and I'm staying
with the deprecated NumericRangeQuery.setBoost.

Thanks for your help.

Claude Lepère






On Sat, Nov 25, 2023 at 3:14 PM Mikhail Khludnev  wrote:


Hello Claude,
Wrap it by BoostQuery. see
https://lucene.apache.org/core/6_0_0/MIGRATE.html


On Sat, Nov 25, 2023 at 2:46 PM Claude Lepère 
wrote:


Hi.

We are using Lucene 5.5.5 where setBoost is deprecated for all Query

types.

How to set the boost of a NumericRangeQuery while preserving the
NumericRangeQuery type?
BoostQuery doesn't allow this and I haven't found a way.

Thanks for your help.

Claude Lepère



--
Sincerely yours
Mikhail Khludnev


--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: u...@thetaphi.de


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: StandardQueryParser and numeric fields

2023-11-14 Thread Uwe Schindler

Hi,

By default the standard query parser has no idea about field types (and 
it cannot because it does not know the schema of your index). If you 
want to allow searching in non-text fields (TextField, all other - also 
normal StringField breaks easy), you need to customize it.


There are 2 query parsers that I would call "standard":

 * For the default / classic QueryParser the QueryParser#newXYQuery
   methods need to be adapted (by subclassing) to generate a query
   based on the text input. This is how Solr/Elasticsearch adapt their
   schemas.
 * For the flexible query parser (I think you mean that one with
   StandardQueryParser), there is an option to use PointsConfig
   instances and set a "per field" mapping how to parse points. Use
   StandardQueryParser#setPointsConfigMap​() for that. Unfortunately
   Javadocs is missing

This is what Solr/Elasticsearch/Opensearch do.

Uwe

Am 14.11.2023 um 03:01 schrieb Tony Schwartz:

Hello,

  


I'm banging my head at this point, hoping someone can help me.

  


I can't get StandardQueryParser to work on numeric fields.  Luke v9.8.0
finds the records for me.  Example search query string in Luke that works:
eventIdNum:3001

  


Here is my code:

  


Query initQuery() {

StandardQueryParser p = new StandardQueryParser( analyzer );

p.setDefaultOperator( StandardQueryConfigHandler.Operator.AND );

String queryString = "eventIdNum:3001";

return p.parse( queryString, "any" );

}

  


here is how the field was added to the index:

d.add( new IntField( "eventIdNum", 3001, Field.Store.NO ) );

  


I've tried various analyzers.  Example:  new StandardAnalyzer(
CharArraySet.EMPTY_SET );

  

  


I'm sure there is something I'm missing here, but I can't seem to track down
what I'm missing.  The analyzer is the exact same analyzer I'm using during
indexing.  It's a PerFieldAnalyzerWrapper.  The specific analyzer for the
numeric fields is the one I mentioned above (StandardAnalyzer).

  


The query used is:

indexSearcher.search( query, 10 );

  


Thank you,

  


Tony Schwartz

  

  




--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail:u...@thetaphi.de


Re: DisjunctionMinQuery

2023-11-09 Thread Uwe Schindler

Hi,

in that case you should use something like 1/x as your scoring function 
in the sub-clauses. In Lucene scores should go up for more relevancy. 
This must also apply for function scoring.


Uwe

Am 09.11.2023 um 19:14 schrieb Marc D'Mello:

Hi Michael,

Thanks for the response! So to answer your first question, yes this would
keep the lowest score from the matching sub-scorers. Our use case is that
we have a custom term-level score overriding term frequency and we want to
take the min of that as part of our scoring function. Maybe it's a niche
use case?

Thanks,
Marc

On Wed, Nov 8, 2023 at 3:19 PM Michael Froh  wrote:


Hi Marc,

Can you clarify what the semantics of a DisjunctionMinQuery would be? Would
you keep the score for the *lowest* scoring disjunct (plus some tiebreaker
applied to the other matching disjuncts)?

I'm trying to imagine how that would work compared to the classic DisMax
use-case. Say I'm searching for "dalmatian" using a DisMax query over term
queries against title and body. A match on title is probably going to score
higher than a match against the body, just because the title has a shorter
length (and the doc frequency of individual terms in the title is likely to
be lower, since there are fewer terms overall). With DisMax, a match on
title alone will score higher than a match on body, and the tie-break will
tend to score a match on title and body higher than a match on title alone.

With a DisMin (assuming you keep the lowest score), then a match on title
and body would probably score lower than a match on title alone. That feels
weird to me, but I might be missing the use-case.

How would you use a DisMinQuery?

Thanks,
Froh



On Wed, Nov 8, 2023 at 10:50 AM Marc D'Mello  wrote:


Hi all,

I noticed we have a DisjunctionMaxQuery
<


https://github.com/apache/lucene/blob/branch_9_7/lucene/core/src/java/org/apache/lucene/search/DisjunctionMaxQuery.java

but
not a corresponding DisjunctionMinQuery. I was just wondering if there

was

a specific reason for that? Or is it just that it is not a common query

to

use?

Thanks!
Marc


--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: u...@thetaphi.de


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Field[vector]vector's dimensions must be <= [1024]; got 1536

2023-11-08 Thread Uwe Schindler

Hi Michael,

The version below looks correct. Of course the Solr version is able to 
do much more. The code you posted limits it to the bare minimum:


 * subclass default codec
 * implement getKnnVectorsFormatForField() and return the wrapper with
   other max dimension

Reading indexes still works with unmodified default codec, you only need 
to set it for IndexWriter. When reading the actual codec is looked up by 
name.


Uwe

Am 07.11.2023 um 17:03 schrieb Michael Wechner:

Hi Uwe

Thanks again for your feedback, I got it working now :-)

I am using a simplified version, which I will post below, such that it 
might help others, at least as long as this implementation makes sense.


Btw, when a new version of Lucene gets released, how do I best find 
out that  "Lucene95Codec" is still the most recent default codec or 
that there is a new default codec?


Thanks

Michael

---

@Autowired private LuceneCodecFactoryluceneCodecFactory;

IndexWriterConfig iwc =new IndexWriterConfig();
iwc.setCodec(luceneCodecFactory.getCodec());



package com.erkigsnek.webapp.services;

import org.apache.lucene.codecs.Codec;
import org.apache.lucene.codecs.KnnVectorsFormat;
import org.apache.lucene.codecs.KnnVectorsReader;
import org.apache.lucene.codecs.KnnVectorsWriter;
import org.apache.lucene.codecs.lucene95.Lucene95Codec;
import org.apache.lucene.codecs.lucene95.Lucene95HnswVectorsFormat;
import org.apache.lucene.index.SegmentReadState;
import org.apache.lucene.index.SegmentWriteState;
import org.springframework.stereotype.Component;
import lombok.extern.slf4j.Slf4j;

import java.io.IOException;

@Slf4j @Component public class LuceneCodecFactory {

    private final int maxDimensions =16384;/** * */ public Codec 
getCodec() {

    //return Lucene95Codec.getDefault(); log.info("Get codec ...");
    Codec codec =new Lucene95Codec() {
    @Override public KnnVectorsFormat 
getKnnVectorsFormatForField(String field) {

    var delegate =new Lucene95HnswVectorsFormat();
    log.info("Maximum Vector Dimension: " +maxDimensions);
    return new 
DelegatingKnnVectorsFormat(delegate,maxDimensions);

    }
    };

    return codec;
    }
}

/** * This class exists because Lucene95HnswVectorsFormat's 
getMaxDimensions method is final and we * need to workaround that 
constraint to allow more than the default number of dimensions */ 
@Slf4j class DelegatingKnnVectorsFormatextends KnnVectorsFormat {

    private final KnnVectorsFormatdelegate;
    private final int maxDimensions;

    public DelegatingKnnVectorsFormat(KnnVectorsFormat delegate,int 
maxDimensions) {

    super(delegate.getName());
    this.delegate = delegate;
    this.maxDimensions = maxDimensions;
    }

    @Override public KnnVectorsWriter fieldsWriter(SegmentWriteState 
state)throws IOException {

    return delegate.fieldsWriter(state);
    }

    @Override public KnnVectorsReader fieldsReader(SegmentReadState 
state)throws IOException {

    return delegate.fieldsReader(state);
    }

    @Override public int getMaxDimensions(String fieldName) {
    log.info("Maximum vector dimension: " +maxDimensions);
    return maxDimensions;
    }
}






Am 19.10.23 um 11:23 schrieb Uwe Schindler:

Hi Michael,

The max vector dimension limit is no longer checked in the field type 
as it is responsibility of the codec to enforce it.


You need to build your own codec that returns a different setting so 
it can be enforced by IndexWriter. See Apache Solr's code how to wrap 
the existing KnnVectorsFormat so it returns another limit: 
<https://github.com/apache/solr/blob/6d50c592fb0b7e0ea2e52ecf1cde7e882e1d0d0a/solr/core/src/java/org/apache/solr/core/SchemaCodecFactory.java#L159-L183> 



Basically you need to subclass Lucene95Codec like done here: 
<https://github.com/apache/solr/blob/6d50c592fb0b7e0ea2e52ecf1cde7e882e1d0d0a/solr/core/src/java/org/apache/solr/core/SchemaCodecFactory.java#L99-L146> 
and return a different vectors format like a delegator as descirbed 
before.


The responsibility was shifted to the codec, because there may be 
better alternatives to HNSW that have different limits especially 
with regard to performance during merging and query response times, 
e.g. BKD trees.


Uwe

Am 19.10.2023 um 10:53 schrieb Michael Wechner:
I forgot to mention, that when using the custom FieldType and 1536 
vector dimension does work with Lucene 9.7.0


Thanks

Michael



Am 19.10.23 um 10:39 schrieb Michael Wechner:

Hi

I recently upgraded Lucene to 9.8.0 and was running tests with 
OpenAI's embedding model, which has the vector dimension 1536 and 
received the following error


Field[vector]vector's dimensions must be <= [1024]; got 1536

wheres this worked previously with the hack to override the vector 
dimension using a custom


float[] vector = ...
FieldType vectorFieldType = new 
CustomVectorFieldType(vector

Re: Preventing field data from being loaded into page cache

2023-10-21 Thread Uwe Schindler

Hi,

There is a workaround available called DirectIODirectory. You can 
subclass it and override useDirectIO() method to return true only for 
fdt files. It wraps another FSDirectory (e.g. MMapDirectory) and 
delegates everything back to it, but for those where useDirectIO() 
returns true it implements its own IndexInput:


https://github.com/apache/lucene/blob/90f8bac9f75df88fed387d5b9f2b0ee387604387/lucene/misc/src/java/org/apache/lucene/misc/store/DirectIODirectory.java#L160-L164

The default uses DirectIO only for merges to not pollute page cache 
during merging index segments.


Uwe

Am 21.10.2023 um 01:54 schrieb Justin Borromeo:

Is there any way to keep field data files out of the operating system's
page cache? We only use fdt for highlighting and don't need to keep it warm
in memory.  From what I understand, the operating system is in control of
what files get loaded into the page cache. Does Lucene have any mechanisms
to explicitly prevent them from being cached?  Is it even possible with
Java?

Thanks,
Justin Borromeo


--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: u...@thetaphi.de


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Field[vector]vector's dimensions must be <= [1024]; got 1536

2023-10-19 Thread Uwe Schindler

Hi Michael,

The max vector dimension limit is no longer checked in the field type as 
it is responsibility of the codec to enforce it.


You need to build your own codec that returns a different setting so it 
can be enforced by IndexWriter. See Apache Solr's code how to wrap the 
existing KnnVectorsFormat so it returns another limit: 
<https://github.com/apache/solr/blob/6d50c592fb0b7e0ea2e52ecf1cde7e882e1d0d0a/solr/core/src/java/org/apache/solr/core/SchemaCodecFactory.java#L159-L183>


Basically you need to subclass Lucene95Codec like done here: 
<https://github.com/apache/solr/blob/6d50c592fb0b7e0ea2e52ecf1cde7e882e1d0d0a/solr/core/src/java/org/apache/solr/core/SchemaCodecFactory.java#L99-L146> 
and return a different vectors format like a delegator as descirbed before.


The responsibility was shifted to the codec, because there may be better 
alternatives to HNSW that have different limits especially with regard 
to performance during merging and query response times, e.g. BKD trees.


Uwe

Am 19.10.2023 um 10:53 schrieb Michael Wechner:
I forgot to mention, that when using the custom FieldType and 1536 
vector dimension does work with Lucene 9.7.0


Thanks

Michael



Am 19.10.23 um 10:39 schrieb Michael Wechner:

Hi

I recently upgraded Lucene to 9.8.0 and was running tests with 
OpenAI's embedding model, which has the vector dimension 1536 and 
received the following error


Field[vector]vector's dimensions must be <= [1024]; got 1536

wheres this worked previously with the hack to override the vector 
dimension using a custom


float[] vector = ...
FieldType vectorFieldType = new CustomVectorFieldType(vector.length, 
VectorSimilarityFuncion.COSINE);


and setting

KnnFloatVectorField vectorField = new 
KnnFloatVectorField("VECTOR_FIELD", vector, vectorFieldType);


But this does not seem to work anymore with Lucene 9.8.0

Is this hack now prevented by the Lucene code itself, or any idea how 
to make this work again?


Whatever one thinks of OpenAI, the embedding model 
"text-embedding-ada-002" is really good and it is sad, that one 
cannot use it with Lucene, because of the 1024 dimension restriction.


Thanks

Michael



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: u...@thetaphi.de


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How to replace deprecated document(i)

2023-09-25 Thread Uwe Schindler

Hi,

yes once per search request is the best to start with.

You can reuse the instance for multiple requests, but you cannot use it 
from multiple threads. So it is up to you to make sure you reuse it at 
best effort.


See also the documentation I posted from MIGRATE.txt.

If the documentation is missing, maybe let's open a pull request that 
gives the missing information in 9.x Javadocs, too.


Uwe

Am 25.09.2023 um 11:02 schrieb Michael Wechner:

you mean once per search request?

I mean for example

GET https://localhost:8080/search?q=Lucene

and the following would be executed

IndexReader reader = DirectoryReader.open(...);
StoredFields  storedfields = reader.storedFields();
IndexSearcher searcher = new IndexSearcher(reader)
TopDocs topDocs = searcher.search(query, k)
for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
  Document doc = storedFields.document(scoreDoc.doc);
}

Like this?

Thanks

Michael


Am 25.09.23 um 10:28 schrieb Uwe Schindler:
Background: For performance, it is advisable to get the 
storedFields() *once* to process all documents in the search result. 
The resason for the change was the problem of accessing stored fields 
would otherwise need to use ThreadLocals to keep state.


Issue: https://github.com/apache/lucene/pull/11998

This was introduced in Lucene 9.5.

It is also listed in MIGRATE.txt:

   ### Removed deprecated IndexSearcher.doc, IndexReader.document,
   IndexReader.getTermVectors (GITHUB#11998)

   The deprecated Stored Fields and Term Vectors apis relied upon
   threadlocal storage and have been removed.

   Instead, call storedFields()/termVectors() to return an instance
   which can fetch data for multiple documents,
   and will be garbage-collected as usual.

   For example:
   ```java
   TopDocs hits = searcher.search(query, 10);
   StoredFields storedFields = reader.storedFields();
   for (ScoreDoc hit : hits.scoreDocs) {
      Document doc = storedFields.document(hit.doc);
   }
   ```

   Note that these StoredFields and TermVectors instances should only
   be consumed in the thread where
   they were acquired. For instance, it is illegal to share them across
   threads.

Uwe

Am 25.09.2023 um 07:53 schrieb Michael Wechner:

Hi Shubham

Great, thank you very much!

Michael

Am 25.09.23 um 02:14 schrieb Shubham Chaudhary:

Hi Michael,

You could replace this with
*indexReader.storedFields().document(scoreDoc.doc)*

Docs -
https://lucene.apache.org/core/9_7_0/core/org/apache/lucene/index/StoredFields.html#document(int) 



- Shubham

On Mon, Sep 25, 2023 at 1:59 AM Michael Wechner 


wrote:


Hi

I recently noctived that

IndexReader.document(int)

is deprecated, whereas my code is currently

TopDocs topDocs = searcher.search(query, k);
for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
  Document doc = indexReader.document(scoreDoc.doc);
}

How do I best replace document(int)?

Thanks

Michael

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org





-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: u...@thetaphi.de


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How to replace deprecated document(i)

2023-09-25 Thread Uwe Schindler
Background: For performance, it is advisable to get the storedFields() 
*once* to process all documents in the search result. The resason for 
the change was the problem of accessing stored fields would otherwise 
need to use ThreadLocals to keep state.


Issue: https://github.com/apache/lucene/pull/11998

This was introduced in Lucene 9.5.

It is also listed in MIGRATE.txt:

   ### Removed deprecated IndexSearcher.doc, IndexReader.document,
   IndexReader.getTermVectors (GITHUB#11998)

   The deprecated Stored Fields and Term Vectors apis relied upon
   threadlocal storage and have been removed.

   Instead, call storedFields()/termVectors() to return an instance
   which can fetch data for multiple documents,
   and will be garbage-collected as usual.

   For example:
   ```java
   TopDocs hits = searcher.search(query, 10);
   StoredFields storedFields = reader.storedFields();
   for (ScoreDoc hit : hits.scoreDocs) {
  Document doc = storedFields.document(hit.doc);
   }
   ```

   Note that these StoredFields and TermVectors instances should only
   be consumed in the thread where
   they were acquired. For instance, it is illegal to share them across
   threads.

Uwe

Am 25.09.2023 um 07:53 schrieb Michael Wechner:

Hi Shubham

Great, thank you very much!

Michael

Am 25.09.23 um 02:14 schrieb Shubham Chaudhary:

Hi Michael,

You could replace this with
*indexReader.storedFields().document(scoreDoc.doc)*

Docs -
https://lucene.apache.org/core/9_7_0/core/org/apache/lucene/index/StoredFields.html#document(int) 



- Shubham

On Mon, Sep 25, 2023 at 1:59 AM Michael Wechner 


wrote:


Hi

I recently noctived that

IndexReader.document(int)

is deprecated, whereas my code is currently

TopDocs topDocs = searcher.search(query, k);
for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
  Document doc = indexReader.document(scoreDoc.doc);
}

How do I best replace document(int)?

Thanks

Michael

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org





-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail:u...@thetaphi.de


Re: forceMerge(1) leads to ~10% perf gains

2023-09-22 Thread Uwe Schindler

Hi,

Yes, a force-merged index can be faster, as less work is spent on 
looking up terms in different index segments.


If you are looking for higher speed, non-merged indexes can actually 
perform better, IF you parallelize. You can do this by adding an 
Executor instance to IndexSearcher 
(<https://lucene.apache.org/core/9_7_0/core/org/apache/lucene/search/IndexSearcher.html#%3Cinit%3E(org.apache.lucene.index.IndexReader,java.util.concurrent.Executor)>). 
If you do this each segment of the index is searched in parallel (using 
the thread pool limits of the Executor) and results are merged at end.


If an index is read-only and static, fore-merge is a good idea - unless 
you want to parallelize.


Tokenizing and joining with OR is the correct way, but for speed you may 
also use AND. To further improve the speed also take a look at Blockmax 
WAND: If you are not interested in the total number of documents, you 
can get huge speed improvements. By default this is enabled in Lucene 
9.x with default IndexSearcher, but on Solr/Elasticsearch you may need 
to actively request it. In that case it will only count exact number of 
hits till 1000 docs are found.


Uwe

Am 22.09.2023 um 03:40 schrieb qrdl kaggle:

After testing on 4800 fairly complex queries, I see a performance gain of
10% after doing indexWriter.forceMerge(1); indexWriter.commit(); from 209
ms per query, to 185 ms per query.

Queries are quite complex, often about 30 or words, of the format OR
text:

It went from 214 to 14 files on the forceMerge.

It's a 6GB static/read only index with about 6.4M documents.  Documents are
around 1MB or so of text.

Was wondering - are there any other techniques which can be used to speed
up that work well when forceMerge works like this?

Is there a better way to query and still maintain accuracy than simply word
tokenizing a sentence and joining with OR text: ?


--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: u...@thetaphi.de


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How to retain % sign next to number during tokenization

2023-09-21 Thread Uwe Schindler
The problem with WhitespaceTokenizer is that is splits only on 
whitespace. If you have text like "This is, was some test." then you get 
tokens like "is," and "test." including the punctuations.


This is the reason why StandardTokenizer is normally used for human 
readable text. WhitespaceTokenizer is normally only used for special 
stuff like token lists (like tags) or uinque identifiers,...


As quick workaround while still keeping the %, you can add a CharFilter 
like MappingCharFilter before the Tokenizer that replaces the "%" char 
by something else which is not stripped off. As this is done for both 
indexing and searching this does not hurt you. How about a "percent 
emoji"? :-)


Another common "workaround" is also shown in some Solr default 
configurations typically used for product search: Those use 
WhitespaceTokenizer, followed by WordDelimiterFilter. WDF is then able 
to remove accents and handle stuff like product numbers correctly. There 
you can possibly make sure thet "%" survives.


Uwe

Am 20.09.2023 um 22:42 schrieb Amitesh Kumar:

Thanks Mikhail!

I have tried all other tokenizers from Lucene4.4. In case of
WhitespaceTokwnizer, it loses romanizing of special chars like - etc


On Wed, Sep 20, 2023 at 16:39 Mikhail Khludnev  wrote:


Hello,
Check the whitespace tokenizer.

On Wed, Sep 20, 2023 at 7:46 PM Amitesh Kumar 
wrote:


Hi,

I am facing a requirement change to get % sign retained in searches. e.g.

Sample search docs:
1. Number of boys 50
2. My score was 50%
3. 40-50% for pass score

Search query: 50%
Expected results: Doc-2, Doc-3 i.e.
My score was
1. 50%
2. 40-50% for pass score

Actual result: All 3 documents (because tokenizer strips off the % both
during indexing as well as searching and hence matches all docs with 50

in

it.

On the implementation front, I am using a set of filters like
lowerCaseFilter, EnglishPossessiveFilter etc in addition to base

tokenizer

StandardTokenizer.

Per my analysis suggests, StandardTokenizer strips off the %  I am

facing a

requirement change to get % sign retained in searches. e.g

Sample search docs:
1. Number of boys 50
2. My score was 50%
3. 40-50% for pass score

Search query: 50%
Expected results: Doc-2, Doc-3 i.e.
My score was 50%
40-50% for pass score

Actual result: All 4 documents

On the implementation front, I am using a set of filters like
lowerCaseFilter, EnglishPossessiveFilter etc in addition to base

tokenizer

StandardTokenizer.

Per my analysis, StandardTOkenizer strips off the %  sign and hence the
behavior.Has someone faced similar requirement? Any help/guidance is

highly

appreciated.



--
Sincerely yours
Mikhail Khludnev


--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: u...@thetaphi.de


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Reindexing leaving behind 0 live doc segments

2023-09-13 Thread Uwe Schindler
It looks like your code has a leak and does not close all 
IndexReaders/Writers that you use during your custom code in Solr. It is 
impossible to review this from outside.


You shuld use the Solr provided SolrIndexWriter and SolrIndexSearcher to 
do your custom stuff and let Solr manage them.


Uwe

Am 10.09.2023 um 04:09 schrieb Rahul Goswami:

Uwe,
Thanks for the response. I have openSearcher=false in autoCommit, but I do
have an autoSoftCommit interval of 5 minutes configured as well which
should open a searcher.
In vanilla Solr, without my code, I see that if I completely reindex all
documents in a segment (via a client call), the segment does get deleted
after the soft commit interval. However if I process the segments as per
Approach-1 in my original email, I see that the 0 doc 7.x segment stays
even after the process finishes, i.e even after I exit the
try-with-resources block.  Note that my index is a mix of 7.x and 8.x
segments and I am only reindexing 7.x segments by preventing them from
participating in merge via a custom MergePolicy.
Additionally as mentioned, Solr provides a handler (/admin/segments)
which does what Luke does and it shows that by the end of the process there
are no more 7.x segments as referenced by the segments_x file. But for some
reason the physical 7.x segment files continue to stay behind until I
restart Solr.

Thanks,
Rahul

On Mon, Sep 4, 2023 at 7:18 AM Uwe Schindler  wrote:


Hi,

in Solr the empty segment keeps open as long as there is a Searcher
still open. At some point the empty segment (100% deletions) will be
deleted, but you have to wait until SolIndexSearcher has restarted.
Maybe check your solrconfig.xml and check if openSearcher is enabled
after autoSoftCommit:

https://solr.apache.org/guide/solr/latest/configuration-guide/commits-transaction-logs.html

Uwe

Am 31.08.2023 um 21:35 schrieb Rahul Goswami:

Stefan, Mike,
Appreciate your responses! I spent some time analyzing your inputs and
going further down the rabbit hole.

Stefan,
I looked at the IndexRearranger code you referenced where it tries to

drop

the segment. I see that it eventually gets handled via
IndexFileDeleter.checkpoint() through file refCounts (=0 for deletion
criteria). The same method also gets called as part of

IndexWrtier.commit()

flow (Inside finishCommit()). So in an ideal scenario a commit should

have

taken care of dropping the segment files. So that tells me the refCounts
for the files are not getting set to 0. I have a fair suspicion the
reindexing process running on the same index inside the same JVM has to

do

something with it.

Mike,
Thanks for the caution on Approach 2 ...good to at least be able to
continue on one train of thought. As mentioned in my response to Stefan,
the reindexing is going on *inside* of the Solr JVM as an asynchronous
thread and not as a separate process. So I believe the open reader you

are

alluding to might be the one I am opening to through

DirectoryReader.open()

(?) . However, looking at the code, I am seeing IndexFileDeleter.incRef()
only on the files in SegmentCommitInfos.

Does an incRef() also happen when an IndexReader is opened ?

Note:The index is a mix of 7.x and 8.x segments (on Solr 8.x). By

extending

TMP and overloading findMerges() I am preventing 7.x segments from
participating in merges, and the code only reindexes these 7.x segments
into the same index, segment-by-segment.
In the current tests I am performing, there are no parallel search or
indexing threads through an external request. The reindexing is the only
process interacting with the index. The goal is to eventually have this
running alongside any parallel indexing/search requests on the index.
Also, as noted earlier, by inspecting the SegmentInfos , I can see the

7.x

segment progressively reducing, but the files never get cleared.

If it is my reader that is throwing off the refCount for Solr, what could
be another way of reading the index without bloating it up with 0 doc
segments?

I will also try floating this in the Solr list to get answers to some of
the questions you pose around Solr's handling of readers..

Thanks,
Rahul




On Thu, Aug 31, 2023 at 6:48 AM Michael McCandless <
luc...@mikemccandless.com> wrote:


Hi Rahul,

Please do not pursue Approach 2 :)  ReadersAndUpdates.release is not
something the application should be calling.  This path can only lead to
pain.

It sounds to me like something in Solr is holding an old reader (maybe

the

last commit point, or reader prior to the refresh after you re-indexed

all

docs in a given now 100% deleted segment) open.

Does Solr keep old readers open, older than the most recent commit?  Do
you have queries in flight that might be holding the old reader open?

Given that your small by-hand test case (3 docs) correctly showed the

100%

deleted segment being reclaimed after the soft commit interval or a

manual

hard commit, something must be different in the larger use case that is
causing Solr t

Re: Reindexing leaving behind 0 live doc segments

2023-09-04 Thread Uwe Schindler
en(dir)) {
 for (LeafReaderContext lrc : reader.leaves()) {

//read live docs from each leaf , create a
SolrInputDocument out of Document and index using Solr api

 }
}catch(Exception e){

}

Approach 2:
==
ReadersAndUpdates rld = null;
SegmentReader segmentReader = null;
RefCounted iwRef =
core.getSolrCoreState().getIndexWriter(core);
  iw = iwRef.get();
try{
   for (SegmentCommitInfo sci : segmentInfos) {
  rld = iw.getPooledInstance(sci, true);
  segmentReader = rld.getReader(IOContext.READ);

 //process all live docs similar to above using the segmentReader.

 rld.release(segmentReader);
 iw.release(rld);
}finally{
if (iwRef != null) {
iwRef.decref();
 }
}

Help would be much appreciated!

Thanks,
Rahul


--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: u...@thetaphi.de


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Disjunctively scoring non-matching conjunctive clauses

2023-07-21 Thread Uwe Schindler

Hi,

this is the normal way to do this: use a filter or constant score query 
to do the matcing and use disjunctive scoring as a long chain of 
"should" clauses.


Uwe

Am 21.07.2023 um 02:35 schrieb Marc D'Mello:

Hi all,

I'm an engineer on Amazon Product Search and I've recently come upon a
situation where I've required conjunctive matching but disjunctive scoring.
As a concrete example, let's say I have a query like this:

(+title:"a" +title:"b" +title:"c") (product_id:1)

This is saying I want to conjunctively match on the title OR I want to
match a specific product document where the product_id is 1.

Let's say the document where product_id = 1 has a title of "a b", so it
doesn't match the title query. In this case, the score for the title clause
will be 0 since to my understanding, Lucene doesn't count scores for
non-matching clauses. However for my use case, I would like to take into
account that several keywords did in fact match, so as I stated earlier,
disjunctive scoring even though I still want to match conjunctively,

My way of working around this right now is to reconstruct the query as the
following (forgive my made-up Lucene query syntax, hopefully it's still
readable):

+(ConstantScoreQuery: 0 ((+title:"a" +title:"b" +title:"c")
(product_id:1))) (title:"a" title:"b" title:"c")

Pretty much, I separate this into a matching query that is wrapped by a
ConstantScore query so it has no score and a scoring query that will
provide a disjunctive score.

My approach feels a bit convoluted, so I was wondering if there were any
cleaner ways to do this? And if not, are there any drawbacks to my
workaround performance wise?

Thanks!
Marc D'Mello


--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: u...@thetaphi.de


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Getting LinkageError due to Panama APIs

2023-06-30 Thread Uwe Schindler

Hi,

It is not obvious what you have done, but the issue may come from custom 
builds, e.g., if you are not using the original Lucene JAR file but a 
modified one. Another reason may be Maven Shade plugin or other 
assemblies like Uber-JARs!


Make sure that all class files and module information is included in the 
final Uber-JAR. The Lucene team does not support repackaging Lucene JAR 
files as this causes problems.


Uwe

Am 26.06.2023 um 21:18 schrieb Shubham Chaudhary:

Hi everyone,

I’m trying to build and run my software using JDK 19 which has a direct
dependency on Apache Lucene 9.6 built with JDK 17 and I’m running into
below exception due to Panama APIs. Is this expected behaviour? Any help
would be highly appreciated.

Exception in thread "main" java.lang.LinkageError:
MemorySegmentIndexInputProvider is missing in Lucene JAR file
 at 
org.apache.lucene.store.MMapDirectory.lookupProvider(MMapDirectory.java:437)
 at 
java.base/java.security.AccessController.doPrivileged(AccessController.java:318)
 at 
org.apache.lucene.store.MMapDirectory.doPrivileged(MMapDirectory.java:395)
 at org.apache.lucene.store.MMapDirectory.(MMapDirectory.java:448)
 :
 :



Thanks,
Shubham


--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: u...@thetaphi.de


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Question about index segment search order

2023-05-13 Thread Uwe Schindler

Hi,

in reference to previous code references and discussions from other 
Lucene committers I have to clarify:


 * If you run the query multithreaded (per segment), this means when
   you add an Executor to IndexSearcher, the order is not predicatable,
   plain simple
 * If you use Solr, a single query is not multithreaded. Solr works on
   shards and paralellizes them, but it does not parallelize search on
   a single index
 * If you want to have control on the order of segments when searching,
   theres an easy way with pure lucene, Solr would need to be patched:
 o don't pass Executor (see above)
 o when constructing the IndexSearcher, don't simply pass
   IndexReader but instead "customize it": There are two ways to do
   it: (a) You can take the existing IndexReader and then get all
   leave segments from it (IndexReader#leaves() call). Sort the
   leaves in the order you like it to be searched and then create a
   MultiReader on those sorged segments. (b) alternatively use
   DirectoryReader#open() with a Comparator to sort the segments.
   You could order them reverse on their segment ID.

Anyways, Solr needs to be patched, there are no API hooks to dig into 
that. You may be able to subclass SolrIndexSearcher, but you still need 
to hook it into the Solr control flow.


Uwe

Am 08.05.2023 um 16:47 schrieb Wei:

Hi Michael,

I am applying early termination with Solr's EarlyTerminatingCollector
https://github.com/apache/solr/blob/d9ddba3ac51ece953d762c796f62730e27629966/solr/core/src/java/org/apache/solr/search/EarlyTerminatingCollector.java
,
which triggers EarlyTerminatingCollectorException in SolrIndexSearcher
https://github.com/apache/solr/blob/d9ddba3ac51ece953d762c796f62730e27629966/solr/core/src/java/org/apache/solr/search/SolrIndexSearcher.java#L281

Thanks,
Wei


On Thu, May 4, 2023 at 11:47 AM Michael Sokolov  wrote:


Yes, sorry I didn't mean to imply you couldn't control this if you
want to. I guess in the typical setup it is not predictable. How are
you applying early termination? Are you using a standard Lucene
Collector or do you have your own?

On Thu, May 4, 2023 at 2:03 PM Patrick Zhai  wrote:

Hi Mike,
Just want to mention if the user chooses to use single thread to index

and

use LogXXMergePolicy then the document order will be preserved as index
order.



On Thu, May 4, 2023 at 10:04 AM Wei  wrote:


Hi Michael,

We are interested in the segment sequence for early termination. In our
case there is always a large dominant segment after index rebuild,

then

many small segments are generated with continuous updates as time goes

by.

When early termination is applied, the limit could be reached just for
traversing the dominant segment alone and the newer smaller segments
doesn't get a chance.  If we can control the segment sequence so that

the

newer segments are visited first, the documents with recent updates

can be

retrieved with early termination.  Do you think this makes sense? Any
suggestion is appreciated.

Thanks,
Wei

On Thu, May 4, 2023 at 3:33 AM Michael Sokolov

wrote:

There is no meaning to the sequence. The segments are created

concurrently

by many threads and the merge process will merge them without

regards to

any ordering.



On Wed, May 3, 2023, 1:09 PM Patrick Zhai

wrote:

For that part I'm not entirely sure, if other folks know it please

chime

in

:)

On Wed, May 3, 2023 at 8:48 AM Wei  wrote:


Thanks Patrick! In the default case when no LeafSorter is

provided,

are

the

segments traversed in the order of creation time, i.e. the oldest

segment

is always visited first?

Wei

On Tue, May 2, 2023 at 7:22 PM Patrick Zhai

wrote:

Hi Wei,
Lucene in general iterate through the index in the order of

what is

recorded in the SegmentInfos
<


https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/SegmentInfos.java#L140

And at search time, you can specify the order using LeafSorter
<


https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/DirectoryReader.java#L75

when you're opening the IndexReader

Patrick

On Tue, May 2, 2023 at 5:28 PM Wei

wrote:

Hello,

We have a index that has multiple segments generated with

continuous

updates. Does Lucene  have a specific order when iterate

through

the

segments (assuming single query thread) ? Can the order be

customized

that

the latest generated segments are searched first?

Thanks,
Wei


-
To unsubscribe, e-mail:java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail:java-user-h...@lucene.apache.org



--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail:u...@thetaphi.de


Re: Need help for conversion code from Lucene 2.4.0 to 8.11.2

2023-02-10 Thread Uwe Schindler

Hi,

the reason for this is that files in Lucene are always write-once. We 
never ever change a file after it was written and committed in the 
2-phase-commit. If you write some own index files, e.g. as part of an 
Index Codec you must adhere this rule. See Docvalues or Livedocs 
implementation for an example how "changes" are done in later commits: 
it creates new files with similar name and different suffix having some 
delta-like content.


In general I would really avoid to deal with custom index files. Since 
Lucene 2 there were so many new features so it is never a good idea to 
have your own index file format. Often Docvalues is the solution to all 
your problems you had in early Lucene versions. If you add your own 
stuff not knowing how the transactional model of Lucene work then you 
are possibly causing index corrumption. Index file formats need a 
carefully designed file format with thoughts on transactional safety and 
performance.


If you want to just deal with termporary files, the Directory API allows 
you to maintain temporary files, too.


Uwe

Am 10.02.2023 um 06:49 schrieb Saha, Rajib:

Hi Uwe,

Thanks for the clarification.
We may have to rewrite the whole logic related to it, as seek functionality is 
no more for IndexOutput.

BTW, I have one more query related to it.
On playing around, I see, directory.createOutput(String name, IOContext 
context) API throwing FileAlreadyExistsException in case the file[say 
output.index] already exists in 8.11.2.
Now, I wondering, if my process is closed. And the in new process I want to use 
the same file[output.index] to keep appending to write. How, I can achieve it?

My Sample code:

Try {
SimpleFSDirectory directory = new SimpleFSDirectory(new 
File("E:\\Lucene-index").toPath());
IndexOutput output = directory.createOutput("output.index", 
IOContext.DEFAULT);
output.writeInt(223344);
output.writeString("Testing Testing");
output.close();
} catch(Exception e) {
e.printStackTrace();
}
==



Regards
Rajib

-Original Message-
From: Uwe Schindler 
Sent: 06 February 2023 16:46
To: java-user@lucene.apache.org
Subject: Re: Need help for conversion code from Lucene 2.4.0 to 8.11.2

Hi,

Since around Lucene 4 (maybe already in 3) there is no way to write
index files in random access anymore. All data must be written in
sequence (aka input stream). This is especially important as Lucene
index files use checksums since around Lucene 5.

Uwe

Am 06.02.2023 um 11:57 schrieb Saha, Rajib:

Hi Mikhail,

Thanks for all you’re your suggestions in one shot.
It helped us a lot.
Thank you very much once again. 

Need one more suggestion for below API.
==
IndexOutput.seek(long pos)
==

We have used it extensively in around 40-50 places.
Currently, this API is not there.

Could you please suggest, how we can handle the API in 8.11.2?

Regards
Rajib


-Original Message-
From: Mikhail Khludnev 
Sent: 01 February 2023 12:22
To: java-user@lucene.apache.org
Subject: Re: Need help for conversion code from Lucene 2.4.0 to 8.11.2

Hello, Rajib.

On Mon, Jan 30, 2023 at 4:07 PM Saha, Rajib 
wrote:


Hi Mikhail,

Thanks for your suggestion. It solved lots of cases today in my end. 

I need some more suggestions from your end. I am putting together as below
one by one:

In 2.4, we have used couple of cases with APIs:

Field(String name, String value, Field.Store store, Field.Index index)
Field(String name, String value, Field.Store store, Field.Index index,
Field.TermVector termVector)


Check org.apache.lucene.document.StringField/TextField and its FieldType
constants.



In 8.11, I can see suitable API corresponding to it as :
Field(String name, Reader reader, IndexableFieldType type)

But, I am not clear, how can I use IndexableFieldType for Field.Store,
Field.Index, Field.TermVector.
Can you please suggest here?


check usages for org.apache.lucene.document.Field.Store
org.apache.lucene.document.FieldType#setIndexOptions
org.apache.lucene.document.FieldType#setStoreTermVectors




=

In 2.4, there was an API:
IndexReader.indexExists(File file)
This checks, if index files exists in the path.

In 8.11, any API, which can do the same job?


org.apache.lucene.index.DirectoryReader#indexExists



==
In 2.4, there was an API:
IndexReader.isLocked(FSDirectory fsdir)
IndexReader.unlock(Directory directory)

In 8.11, are IndexReader and IndexWritter synchronized enough internally
for not using the APIs?


org.apache.lucene.store.BaseDirectory#obtainLock
Lock.close()

IndexWriters are mutually exclusive via lock factory.
org.apache.lucene.index.DirectoryReader#open(org.apache.lucene.index.IndexWriter)
opens NRT reader i.e. search what not yet committed

Re: Need help for conversion code from Lucene 2.4.0 to 8.11.2

2023-02-06 Thread Uwe Schindler
nt at https://aka.ms/LearnAboutSenderIdentification ]

Hello, Rajib.
API were evolved since 2.4, but it should be clear



https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Flucene.apache.org%2Fcore%2F8_11_2%2Fcore%2Forg%2Fapache%2Flucene%2Findex%2Fpackage-summary.html%23fields=05%7C01%7Crajib.saha%40sap.com%7C6410f6d6a071470ba56408db042100b5%7C42f7676cf455423c82f6dc2d99791af7%7C0%7C0%7C638108312019308094%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C=Ggdf909aTONFendGmA0ucmWV%2FP76UT2iwtC8gdzf1tk%3D=0


On Wed, Jan 18, 2023 at 1:11 PM Saha, Rajib 
wrote:


Hi All,

We are in a process for conversion of Lucene from  2.4.0 to 8.11.2 for

our

platform code.
We have used extensively Lucene in our code.

We have replaced several of our code to Lucene 8.11.2 APIs.

But, few places, we are stuck of which New Lucene APIs to use, as not
getting any suitable match.

Can somebody help me, how we can convert below code using Lucene 2.4.0

to

8.11.2?


ProcessDocs(IndexReader reader, Term t) {

   final TermDocs termDocs = reader.termDocs();
   termDocs.seek(t);
   while (termDocs.next()) {
 //Some internal function to process the doc.
 forEach.process(termDocs.doc());
   }

}

Regards
Rajib



--
Sincerely yours
Mikhail Khludnev



https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Ft.me%2FMUST_SEARCH=05%7C01%7Crajib.saha%40sap.com%7C6410f6d6a071470ba56408db042100b5%7C42f7676cf455423c82f6dc2d99791af7%7C0%7C0%7C638108312019308094%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C=UZTUGmuXrDFwKIBRNEPHczjND9Wa%2FdPzJAYByUqnbAs%3D=0

A caveat: Cyrillic!

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



--
Sincerely yours
Mikhail Khludnev

https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Ft.me%2FMUST_SEARCH=05%7C01%7Crajib.saha%40sap.com%7C6410f6d6a071470ba56408db042100b5%7C42f7676cf455423c82f6dc2d99791af7%7C0%7C0%7C638108312019308094%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C=UZTUGmuXrDFwKIBRNEPHczjND9Wa%2FdPzJAYByUqnbAs%3D=0
A caveat: Cyrillic!

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: u...@thetaphi.de


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Question about current situation of good first issues in GitHub

2023-01-10 Thread Uwe Schindler

Hi,

The old JIRA labels are also in Github. See tags named 
"legacy-jiralabel:*". The equivalent search would be this: 
https://github.com/apache/lucene/labels/legacy-jira-label%3Anewdev


Uwe

Am 10.01.2023 um 12:41 schrieb Stefan Vodita:

Hello Shunya,

As far as I know, GitHub issues are not marked for new developers yet.
The project migrated a few months ago from Jira to GitHub issues, so
you can still search the old labels in Jira . In particular, there is `newdev`
for good starter issues [1].

Hope this helps,
Stefan

[1] 
https://issues.apache.org/jira/browse/LUCENE-8674?jql=project%20%3D%20LUCENE%20AND%20resolution%20%3D%20Unresolved%20AND%20labels%20%3D%20newdev%20ORDER%20BY%20labels%20ASC%2C%20priority%20DESC%2C%20updated%20DESC



On 08/01/2023, 14:27, "Shunya Ueta"  wrote:

Hello Lucene users.
Last time I checked `good first issue` in GitHub issues to start a
contribution of Lucene.


https://github.com/apache/lucene/issues?q=is%3Aopen+is%3Aissue+label%3A%22good+first+issue%22

But currently no issues with this label.
I don't know the current operation of this label, but in the future, Is
this label will utilized?
Because good first issues label issues are a very nice starting point for
beginner contributors.

Thanks & Regards!

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: u...@thetaphi.de


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Need your perspective on Garbage Collection

2023-01-03 Thread Uwe Schindler

Hi Satnam,

Can you please share some details about what application using Lucene 
you are using. For Solr and Elasticserach there are recommendations and 
default startup scripts. If it is yur own Lucene application we would 
also need more details.


Basically, Lucene itsself needs very few heap to execute queries and 
index stuff. With an index of 700 Gigabytes you should still be able to 
use a small heap (like a few gigabytes). Problems are mostly located 
outside of Lucene, e.g., code trying to fetch all results of a large 
query result using TopDocs paging ("deep paging problem"). So please 
share more details to give you some answers. Maybe also source code 
where it hangs.


Uwe

Am 03.01.2023 um 13:49 schrieb _ SATNAM:

Hi,
The issue is my garbage collection is running quite often i configure my
JVM as recommended (Gone though several articles ,blogs on lucene) also
provide enough RAM  and memory (not as large to trigger GC ) .Main cause of
concern is GC run for more than 10 min (sometimes even 15 min)
This make whole server stuck and  search is not responding . to solve it
what i am doing right now is restarting my server (very bad approach) can
you please help me in managing it and provide your insight what steps or
configuration i should prefer some useful way to optimize it .
my index size  700 GB

what configurations you suggest for it ,
like jvm,ram ,cpu cores,heap size,young and old genration.
I hope to hear from you soon

    -


--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: u...@thetaphi.de


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Recurring index corruption

2023-01-02 Thread Uwe Schindler

Hi,

Yeah, sorry for leading the issue into the wrong direction. I was just 
stumbling on the exception message and because we do not spend much time 
in improving/supporting the use of NIOFSDirectory, I may have moved this 
mailing list thread into the wrong direction.


I don't think the directory implementation will change much when using 
Samba. Recent versions of the CIFS specification (the spec behind Samba) 
and the implementation in Windows Server are fine. But I doubt that 
recent implementations in Linux kernel are correct and fully support the 
spec.


The corrumption issue you see is just detected by NIOFSDirectory (which 
I stumbled upon), but with MMapDirectory it will happen in the same way 
as the problem is caused by fsync not working correctly and files 
appearing in wrong order on network devices (this is very important for 
Lucene). When writing and fsyncing index files, Lucene uses the same 
code (sequential writes to segment files, then writing commit file and 
finally atomic renaming it to its final location - a 2 phase commit). 
This must be fully supported and implemented by the file system also 
with making changes visible in the correct order, which is not the case 
for NFS (due to stale caches), but *should* work with the recent CIFS 
spec, but Samba may not have implemented it correctly in Linux kernel.


This is the reason for the recommendation to not use network file 
systems. You're example just confirmed this recommendation.


Uwe

Am 02.01.2023 um 19:24 schrieb S S:

Hi Uwe,

I will report the bug to ES, as you suggested.

Do you recon using Mmap would have an effect to the index corruption when using 
SMB? I have to report back to my manager in few days to decide wether to carry 
on with ACIs or find another hosting solution. It is unfortunate there seems to 
be problems with this solution. Microsoft seems not interested in extending the 
volume mapping options for ACIs and K8 is overkilling for our use case.

Thank you for your help so far, you have been very kind :)

Cheers,

Seb


On 2 Jan 2023, at 19:09, Uwe Schindler  wrote:

Hi,

Please open a bug report at ES. The setting vm.max_map_count is not needed and 
should not be changed unless really needed, because it uses kernel resources.

This has to do with their support (they try to tell people to overshard and to prevent 
support requests they ask to raise this seeting). The default value on Linux is 65530. 
This would allow you to memory map 65530 chunks of 1 GiB (thats the limitation without 
preview enabled in the Java 19 JVM). A shard in Lucene has about 70 files, most of them 
with filesizes < 1 Gib and a few >1 GiB (but < 5 GiB) chunks (so lets assume 90 
mappings needed for a shard), so you could have approx 728 shards per node. Sorry 
raising this setting is not needed!

Uwe

Am 02.01.2023 um 18:24 schrieb S S:

I also tried enabling preview but no joy, same error :(

It looks like it is not possible to start a multinode ES cluster without 
setting vm.max_map_count. I also googled it and this check cannot be disabled.

I guess MMapDirectory is not an option for ES on ACIs, unless you have 
something else I can try?

Many thanks,

Seb


On 2 Jan 2023, at 17:55, S S  wrote:

Thank you Uwe, this is great! I am rebuilding the cluster using MMapDirectory 
and no enable-preview, as you suggested. Let’s see what happens.

Cheers,

Seb


On 2 Jan 2023, at 17:51, Uwe Schindler  wrote:

Hi,

in recent versions it works like that:

https://www.elastic.co/guide/en/elasticsearch/reference/current/advanced-configuration.html#set-jvm-options

So in folder jvm.options.d/ add a new file (like "preview.conf") and put 
"19:--enable-preview" into it. It is basically the same like modifying heap size.

But in general, you can simply use MMapDirectory, the max-map-count setting is 
only relevant in *huge* (huge means hundreds of huge indexes per node). In that 
case Java 19's preview features would be recommended.

Uwe

Am 02.01.2023 um 17:41 schrieb S S:

Hi Uwe,

Sorry for the late reply but upgrading the docker image to use OpenJDK was 
easier said that done.

I am not a Java developer/expert so, sorry for the stupid question but, how do 
I specify the --enable-preview flag? ES has got a quite complex way to start so 
I cannot specify the flag on the command line. You suggested to use a 
jvm.properties file but I cannot find anything useful about it when google-ing. 
Where should it be placed? And what should I write in it?

I can see ES recognising OpenJDK 19 while bootstrapping and suggesting to 
enable preview, but it does not suggest how, and I cannot find anything on the 
ES website.

Many thanks.

Seb


On 2 Jan 2023, at 11:48, Uwe Schindler  wrote:

Hi,

in general you can still use MMapDirectory. There is no requirement to set 
vm.max_map_count for smaller clusters. The information in Elastics 
documentation is not mandatory and misleading.

If you use newest version of Elasticse

Re: Recurring index corruption

2023-01-02 Thread Uwe Schindler

Hi,

Please open a bug report at ES. The setting vm.max_map_count is not 
needed and should not be changed unless really needed, because it uses 
kernel resources.


This has to do with their support (they try to tell people to overshard 
and to prevent support requests they ask to raise this seeting). The 
default value on Linux is 65530. This would allow you to memory map 
65530 chunks of 1 GiB (thats the limitation without preview enabled in 
the Java 19 JVM). A shard in Lucene has about 70 files, most of them 
with filesizes < 1 Gib and a few >1 GiB (but < 5 GiB) chunks (so lets 
assume 90 mappings needed for a shard), so you could have approx 728 
shards per node. Sorry raising this setting is not needed!


Uwe

Am 02.01.2023 um 18:24 schrieb S S:

I also tried enabling preview but no joy, same error :(

It looks like it is not possible to start a multinode ES cluster without 
setting vm.max_map_count. I also googled it and this check cannot be disabled.

I guess MMapDirectory is not an option for ES on ACIs, unless you have 
something else I can try?

Many thanks,

Seb


On 2 Jan 2023, at 17:55, S S  wrote:

Thank you Uwe, this is great! I am rebuilding the cluster using MMapDirectory 
and no enable-preview, as you suggested. Let’s see what happens.

Cheers,

Seb


On 2 Jan 2023, at 17:51, Uwe Schindler  wrote:

Hi,

in recent versions it works like that:

https://www.elastic.co/guide/en/elasticsearch/reference/current/advanced-configuration.html#set-jvm-options

So in folder jvm.options.d/ add a new file (like "preview.conf") and put 
"19:--enable-preview" into it. It is basically the same like modifying heap size.

But in general, you can simply use MMapDirectory, the max-map-count setting is 
only relevant in *huge* (huge means hundreds of huge indexes per node). In that 
case Java 19's preview features would be recommended.

Uwe

Am 02.01.2023 um 17:41 schrieb S S:

Hi Uwe,

Sorry for the late reply but upgrading the docker image to use OpenJDK was 
easier said that done.

I am not a Java developer/expert so, sorry for the stupid question but, how do 
I specify the --enable-preview flag? ES has got a quite complex way to start so 
I cannot specify the flag on the command line. You suggested to use a 
jvm.properties file but I cannot find anything useful about it when google-ing. 
Where should it be placed? And what should I write in it?

I can see ES recognising OpenJDK 19 while bootstrapping and suggesting to 
enable preview, but it does not suggest how, and I cannot find anything on the 
ES website.

Many thanks.

Seb


On 2 Jan 2023, at 11:48, Uwe Schindler  wrote:

Hi,

in general you can still use MMapDirectory. There is no requirement to set 
vm.max_map_count for smaller clusters. The information in Elastics 
documentation is not mandatory and misleading.

If you use newest version of Elasticsearch with Java 19 and you use 
`--enable-preview` in you jvm.properties file, you don't even need to change 
that setting even with larger clusters.

Uwe

Am 02.01.2023 um 11:18 schrieb S S:

We are experimenting with Elastic Search deployed in Azure Container Instances 
(Debian + OpenJDK). The ES indexes are stored into an Azure file share mounted 
via SMB (3.0). The Elastic Search cluster is made up of 4 nodes, each one have 
a separate file share to store the indices.

This configuration has been influenced by some ACIs limitations, specifically:

we cannot set the max_map_count value as we do not have access to the 
underlying host 
(https://www.elastic.co/guide/en/elasticsearch/reference/current/vm-max-map-count.html).
 Unfortunately, this is required to run an ES cluster, therefore we were forced 
to use NIOF
ACI’s storage is ephemera, therefore we had to map volumes to persist the 
indexes. ACIs only allow volume mappings using Azure File Shares, which only 
works with NFS or SMB.

We are experiencing recurring index corruption, specifically a "read past EOF" 
exception. I asked on the Elastic Search forum but the answer I got was a bit generic and 
not really helpful other than confirming that, from ES point of view, ES should work on 
an SMB share as long as it behaves as a local drive. As the underlying exception relates 
to an issue with a Lucene index, I was wondering if you could help out? Specifically, can 
Lucene work on SMB? I can only find sparse information on this configuration and, while 
NFS seems a no-no, for SMB is not that clear. Below is the exception we are getting.

java.io.IOException: read past EOF: 
NIOFSIndexInput(path="/bitnami/elasticsearch/data/indices/mS2bUbLtSeG0FSAMuKX7JQ/0/index/_ldsn_1.fnm")
 buffer: java.nio.HeapByteBuffer[pos=0 lim=1024 cap=1024] chunkLen: 1024 end: 2331: 
NIOFSIndexInput(path="/bitnami/elasticsearch/data/indices/mS2bUbLtSeG0FSAMuKX7JQ/0/index/_ldsn_1.fnm")
  at 
org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput.rea

Re: Recurring index corruption

2023-01-02 Thread Uwe Schindler

Hi,

in recent versions it works like that:

https://www.elastic.co/guide/en/elasticsearch/reference/current/advanced-configuration.html#set-jvm-options

So in folder jvm.options.d/ add a new file (like "preview.conf") and put 
"19:--enable-preview" into it. It is basically the same like modifying 
heap size.


But in general, you can simply use MMapDirectory, the max-map-count 
setting is only relevant in *huge* (huge means hundreds of huge indexes 
per node). In that case Java 19's preview features would be recommended.


Uwe

Am 02.01.2023 um 17:41 schrieb S S:

Hi Uwe,

Sorry for the late reply but upgrading the docker image to use OpenJDK was 
easier said that done.

I am not a Java developer/expert so, sorry for the stupid question but, how do 
I specify the --enable-preview flag? ES has got a quite complex way to start so 
I cannot specify the flag on the command line. You suggested to use a 
jvm.properties file but I cannot find anything useful about it when google-ing. 
Where should it be placed? And what should I write in it?

I can see ES recognising OpenJDK 19 while bootstrapping and suggesting to 
enable preview, but it does not suggest how, and I cannot find anything on the 
ES website.

Many thanks.

Seb


On 2 Jan 2023, at 11:48, Uwe Schindler  wrote:

Hi,

in general you can still use MMapDirectory. There is no requirement to set 
vm.max_map_count for smaller clusters. The information in Elastics 
documentation is not mandatory and misleading.

If you use newest version of Elasticsearch with Java 19 and you use 
`--enable-preview` in you jvm.properties file, you don't even need to change 
that setting even with larger clusters.

Uwe

Am 02.01.2023 um 11:18 schrieb S S:

We are experimenting with Elastic Search deployed in Azure Container Instances 
(Debian + OpenJDK). The ES indexes are stored into an Azure file share mounted 
via SMB (3.0). The Elastic Search cluster is made up of 4 nodes, each one have 
a separate file share to store the indices.

This configuration has been influenced by some ACIs limitations, specifically:

we cannot set the max_map_count value as we do not have access to the 
underlying host 
(https://www.elastic.co/guide/en/elasticsearch/reference/current/vm-max-map-count.html).
 Unfortunately, this is required to run an ES cluster, therefore we were forced 
to use NIOF
ACI’s storage is ephemera, therefore we had to map volumes to persist the 
indexes. ACIs only allow volume mappings using Azure File Shares, which only 
works with NFS or SMB.

We are experiencing recurring index corruption, specifically a "read past EOF" 
exception. I asked on the Elastic Search forum but the answer I got was a bit generic and 
not really helpful other than confirming that, from ES point of view, ES should work on 
an SMB share as long as it behaves as a local drive. As the underlying exception relates 
to an issue with a Lucene index, I was wondering if you could help out? Specifically, can 
Lucene work on SMB? I can only find sparse information on this configuration and, while 
NFS seems a no-no, for SMB is not that clear. Below is the exception we are getting.

java.io.IOException: read past EOF: 
NIOFSIndexInput(path="/bitnami/elasticsearch/data/indices/mS2bUbLtSeG0FSAMuKX7JQ/0/index/_ldsn_1.fnm")
 buffer: java.nio.HeapByteBuffer[pos=0 lim=1024 cap=1024] chunkLen: 1024 end: 2331: 
NIOFSIndexInput(path="/bitnami/elasticsearch/data/indices/mS2bUbLtSeG0FSAMuKX7JQ/0/index/_ldsn_1.fnm")
   at 
org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput.readInternal(NIOFSDirectory.java:200)
 ~[lucene-core-9.3.0.jar:?]
   at 
org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:291) 
~[lucene-core-9.3.0.jar:?]
   at 
org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:55) 
~[lucene-core-9.3.0.jar:?]
   at 
org.apache.lucene.store.BufferedChecksumIndexInput.readByte(BufferedChecksumIndexInput.java:39)
 ~[lucene-core-9.3.0.jar:?]
   at 
org.apache.lucene.codecs.CodecUtil.readBEInt(CodecUtil.java:667) 
~[lucene-core-9.3.0.jar:?]
   at 
org.apache.lucene.codecs.CodecUtil.checkHeader(CodecUtil.java:184) 
~[lucene-core-9.3.0.jar:?]
   at 
org.apache.lucene.codecs.CodecUtil.checkIndexHeader(CodecUtil.java:253) 
~[lucene-core-9.3.0.jar:?]
   at 
org.apache.lucene.codecs.lucene90.Lucene90FieldInfosFormat.read(Lucene90FieldInfosFormat.java:128)
 ~[lucene-core-9.3.0.jar:?]
   at 
org.apache.lucene.index.SegmentReader.initFieldInfos(SegmentReader.java:205) 
~[lucene-core-9.3.0.jar:?]
   at 
org.apache.lucene.index.SegmentReader.(SegmentReader.java:156) 
~[lucene-core-9.3.0.jar:?]
   at 
org.apache.lucene.index.ReadersAndUpdates.createNewReaderWithLatestLiveDocs(ReadersAndUpdates.java:738)
 ~[lucene-core-9.3.0.jar:?]
   at 
org.apache.lucene.index

Re: Recurring index corruption

2023-01-02 Thread Uwe Schindler
:170) 
~[lucene-core-9.3.0.jar:?]
   at 
org.elasticsearch.index.engine.ElasticsearchReaderManager.refreshIfNeeded(ElasticsearchReaderManager.java:48)
 ~[elasticsearch-8.4.1.jar:?]
   at 
org.elasticsearch.index.engine.ElasticsearchReaderManager.refreshIfNeeded(ElasticsearchReaderManager.java:27)
 ~[elasticsearch-8.4.1.jar:?]
   at 
org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:167)
 ~[lucene-core-9.3.0.jar:?]
   at 
org.apache.lucene.search.ReferenceManager.maybeRefreshBlocking(ReferenceManager.java:240)
 ~[lucene-core-9.3.0.jar:?]
   at 
org.elasticsearch.index.engine.InternalEngine$ExternalReaderManager.refreshIfNeeded(InternalEngine.java:355)
 ~[elasticsearch-8.4.1.jar:?]
   at 
org.elasticsearch.index.engine.InternalEngine$ExternalReaderManager.refreshIfNeeded(InternalEngine.java:335)
 ~[elasticsearch-8.4.1.jar:?]
   at 
org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:167)
 ~[lucene-core-9.3.0.jar:?]
Many thanks.

Seb


--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: u...@thetaphi.de


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: What exactly returns IndexReader.numDeletedDocs()

2022-12-08 Thread Uwe Schindler
If this is a reader with only a few documents the likelyness of all 
deletes being applied while closing is high.


Uwe

Am 08.12.2022 um 11:44 schrieb Michael Wechner:

My code at the moment is as follows:

Directory dir = FSDirectory.open(Paths.get(vectorIndexPath));

IndexReader reader = 
DirectoryReader.open(FSDirectory.open(Paths.get(vectorIndexPath)));

int numberOfDocsBeforeDeleting = reader.numDocs();
log.info("Number of documents: " + numberOfDocsBeforeDeleting);
log.info("Number of deleted documents: " + reader.numDeletedDocs());
reader.close();

log.info("Delete document with path '" + uuid +"' from index '" + 
vectorIndexPath +"' ...");
IndexWriterConfig iwc =new IndexWriterConfig();IndexWriter writer =new 
IndexWriter(dir, iwc);

Term term =new Term(PATH_FIELD, uuid);
writer.deleteDocuments(term);writer.close();

reader = 
DirectoryReader.open(FSDirectory.open(Paths.get(vectorIndexPath)));

int numberOfDocsAfterDeleting = reader.numDocs();
log.info("Number of documents: " + numberOfDocsAfterDeleting);
log.info("Number of deleted documents: " + (numberOfDocsBeforeDeleting 
- numberOfDocsAfterDeleting));
// TODO: Not sure whether the method numDeletedDocs() makes sense here 
log.info("Number of deleted documents: " + reader.numDeletedDocs()); 
reader.close();



whereas this code always returns 0, whereas

numberOfDocsBeforeDeleting - numberOfDocsAfterDeleting

produces the correct result.

Should I open the reader before closing the writer?

Thanks

Michael



Am 08.12.22 um 11:36 schrieb Uwe Schindler:

You have to reopen the index reader to see deletes from the indexwriter.

Am 08.12.2022 um 10:32 schrieb Hrvoje Lončar:

Did you call this method before or after commit method?
My wild guess would be that you can count deleted documents inside
transaction only.

On Thu, Dec 8, 2022 at 12:10 AM Michael Wechner 


wrote:


Hi

I am using Lucen 9.4.2 vector search and everything seems to work 
fine,
except that when I delete some documents from the index, then the 
method



https://lucene.apache.org/core/9_0_0/core/org/apache/lucene/index/IndexReader.html#numDeletedDocs() 



always returns 0, whereas I would have expected that it would 
return the

number of documents which I deleted from the index.

IndexReader.numDocs() returns the correct number though.

I guess I misunderstand the javadoc and in particular the note 
"*NOTE*:

This operation may run in O(maxDoc)."

Does somebody explain in more detail what this method is doing?

Thanks

Michael






--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: u...@thetaphi.de


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: What exactly returns IndexReader.numDeletedDocs()

2022-12-08 Thread Uwe Schindler

You have to reopen the index reader to see deletes from the indexwriter.

Am 08.12.2022 um 10:32 schrieb Hrvoje Lončar:

Did you call this method before or after commit method?
My wild guess would be that you can count deleted documents inside
transaction only.

On Thu, Dec 8, 2022 at 12:10 AM Michael Wechner 
wrote:


Hi

I am using Lucen 9.4.2 vector search and everything seems to work fine,
except that when I delete some documents from the index, then the method


https://lucene.apache.org/core/9_0_0/core/org/apache/lucene/index/IndexReader.html#numDeletedDocs()

always returns 0, whereas I would have expected that it would return the
number of documents which I deleted from the index.

IndexReader.numDocs() returns the correct number though.

I guess I misunderstand the javadoc and in particular the note "*NOTE*:
This operation may run in O(maxDoc)."

Does somebody explain in more detail what this method is doing?

Thanks

Michael




--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: u...@thetaphi.de


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Sort by numeric field, order missing values before anything else

2022-11-21 Thread Uwe Schindler

Hi,

Long.MIN_VALUE and Long.MAX_VALUE are the correct way for longs to sort. 
In fact if you have Long.MIN_VALUE in your collection, empty values are 
treated the same, but still empty value will appear at the wanted place. 
In contrast to the default "0", it is not somewhere in the middle. 
Because there is no long that is smaller than Long.MIN_VALUE, the sort 
order will be OK.


BTW, Apache Solr is using exactly those values to support missing values 
automatically (see sortMissingFirst, sortMissingLast schema options).


In fact, string/bytes sorting has theoretically the same problem, 
because NULL is still different that empty. WARNING: If you really want 
to compare by byte[] as suggested in your last mail, keep in mind: When 
you sort against the raw bytes (using NumericUtils) with SORTED_SET 
docvalues type, there is a large overhead on indexing and sorting 
performance, especially for the case where you have many different 
values in your index (which is likely for numerics).


Uwe

Am 17.11.2022 um 08:47 schrieb Adrien Grand:

Hi Petko,

Lucene's comparators for numerics have this limitation indeed. We haven't
got many questions around that in the past, which I would guess is due to
the fact that most numeric fields do not use the entire long range,
specifically Long.MIN_VALUE and Long.MAX_VALUE, so using either of these
works as a way to sort missing values first or last. If you have a field
that may use Long.MIN_VALUE and long.MAX_VALUE, we do not have a comparator
that can easily sort missing values first or last reliably out of the box.

The easier option I can think of would consist of using the comparator for
longs with MIN_VALUE / MAX_VALUE for missing values depending on whether
you want missing values sorted first or last, and chain it with another
comparator (via a FieldComparatorSource) which would sort missing values
before/after existing values. The benefit of this approach is that you
would automatically benefit from some not-so-trivial features of Lucene's
comparator such as dynamic pruning.

On Wed, Nov 16, 2022 at 9:16 PM Petko Minkov  wrote:


Hello,

When sorting documents by a NumericDocValuesField, how can documents be
ordered such that those with missing values can come before anything else
in ascending sorts? SortField allows to set a missing value:

 var sortField = new SortField("price", SortField.Type.LONG);
 sortField.setMissingValue(null);

This null is however converted into a long 0 and documents with missing
values are considered equally ordered with documents with an actual 0
value. It's possible to set the missing value to Long.MIN_VALUE, but that
will have the same problem, just for a different long value.

Besides writing a custom comparator, is there any simpler and still
performant way to achieve this sort?

--Petko




--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: u...@thetaphi.de


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Migrating WhitespaceTokenizerFactory from 8.2 to 9.4

2022-10-29 Thread Uwe Schindler

Hi,

we can't help you here without a full source code and your build system 
setup. Generally those errors only happen if you are using some shading 
or any other tool that creates UBER JARs. E.g. for Maven's UBER JARS you 
need to add the some resource ransformers, so it includes all necessary 
files. I checked the JAR file of Lucene, it has all services entries.


General recommendation: Please do not repackage lucene, use the 
*original* JAR files. Also if you are using the Java 11 module system in 
your project, it is very important to not repackage JARs, otherwise it 
breaks completely! This is why: Because in Java 11 when module system is 
used, service providers are found by the module-info.class files as part 
of every JAR (META-INF is no longer used). And this file with exact same 
name is part of every JAR file. When you merge them it breaks as only 
one survives (e.g. the one from Lucene Core as I see in your output 
(Lucene core only has the standard tokenizer).


Uwe

Am 28.10.2022 um 21:46 schrieb Shifflett, David [USA]:

I am migrating my project’s usage of Lucene from 8.2 to 9.4.
The migration documentation has been very helpful,
but doesn’t help me resolve this exception:

‘Caused by: java.lang.IllegalArgumentException: A SPI class of type 
org.apache.lucene.analysis.TokenizerFactory with name 'whitespace' does not 
exist. You need to add the corresponding JAR file supporting this SPI to your 
classpath. The current classpath supports the following names: [standard]’

My project includes the lucene-analysis-common JAR,
and my JAR includes 
org/apache/lucene/analysis/core/WhitespaceTokenizerFactory.class.

I am not familiar with how Java SPI is configured and built.

I tried creating META-INF/services/org.apache.lucene.analysis.TokenizerFactory
containing: org.apache.lucene.analysis.core.WhitespaceTokenizerFactory

What am I missing?

Any help would be appreciated.

Thanks,
David Shifflett


--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: u...@thetaphi.de


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: java 17 and older lucene (4.x)

2022-09-26 Thread Uwe Schindler

Hi,

Lucene >=5.5.4 should work with Java 11 or 17 out of box, versions 
before will not fully work, unless you use another directory 
implementation than MMapDirectory. So theoretically you can use 
NIOFSDirectory and it should work also in Lucene 4.


Officially only Lucene 6+ is supported with Java 9 or later, but some 
patches were backported to 5.5.x.


To find the correct issue numbers to backport look for "Java 9" in the 
changelog (e.g. start here 
https://lucene.apache.org/core/5_5_5/changes/Changes.html).


Uwe

Am 26.09.2022 um 10:20 schrieb Thomas Matthijs:

Hello,

Just wondering if anyone has patched lucene 4.x for usage with java 17+ and 
willing to share their work? anything would be appreciated.

No we cannot upgrade lucene, and will likely spend time to try to 
backport/patch it ourselves, but maybe someone already has? if anyone has 
interest in our results let me know as well.

Thanks

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: u...@thetaphi.de


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene 9.2.0 build fails on Windows

2022-09-14 Thread Uwe Schindler

Hi,

do you have Microsoft Visual Studio installed? It looks like Gradle 
tries to detect it and fails with some NullPointerException while 
parsing a JSON file from its instalation.


The misc module contains some (optional) native code that will get 
compiled (optionally) with Visual C++. It looks like thiat breaks.


I have no idea how to fix this. Dawid: Maybe we can also make the 
configuration of that native stuff only opt-in? So only detect Visual 
Studio when you actively activate native code compilation?


Uwe

Am 13.09.2022 um 21:00 schrieb Rahul Goswami:

Hi Dawid,
I believe you. Just that for some reason I have never been able to get it
to work on Windows. Also, being a complete newbie to gradle doesn't help
much. So would appreciate some help on this while I find my footing. Here
is the link to the diagnostics that you requested (since attachments/images
won't make it through):

https://drive.google.com/file/d/15pt9Qt1H98gOvA5e0NrtY8YYHao0lgdM/view?usp=sharing


Thanks,
Rahul

On Tue, Sep 13, 2022 at 1:18 PM Dawid Weiss  wrote:


Hi Rahul,

Well, that's weird.


"releases/lucene/9.2.0"  -> Run "gradlew help"

If you need additional stacktrace or other diagnostics I am happy to
provide the same.

Could you do the following:

1) run: git --version so that we're on the same page as to what the
git version is (I don't think this matters),
2) run: gradlew help --stacktrace

Step (2) should provide the exact place that fails. Something is
definitely wrong because I'm on Windows and it works for me like a
charm.

Dawid

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: u...@thetaphi.de


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: [External] Re: Can lucene be used in Android ?

2022-09-12 Thread Uwe Schindler
That is exactly as said: Lucene main branch is our development branch 
and will possibly be branched as Lucene 10, this branch is currently at 
Java 17 minimum.


For current releases of Apache Lucene 9, we use a separate branch 
"branch_9x" where we cherry-pick stuff. This branch runs with Java 11 
minimum (but of course runs also with 17 or later; when Java 19 is out 
we will also have a Multi-Release JAR file with the MMapDirectory IO 
implementation based on Java Project Panama).


branch_8x is in a completely different repository (together with Apache 
Solr) and updated seldomly now (by cross-repo cherry-picking).


Uwe

Am 12.09.2022 um 02:23 schrieb Shifflett, David [USA]:

Hi Uwe,

I am a little confused by your 2 statements.


Lucene 9.x series requires JDK 11 to run
The main branch is already on JDK 17

Will Lucene 9.x run on JDK 17?
Is 9.x 'the main branch'?

Thanks,
David Shifflett
Senior Lead Technologist
Enterprise Cross Domain Solutions (ECDS)
Booz Allen Hamilton

On 9/10/22, 5:30 AM, "Uwe Schindler"  wrote:

 Hi Jie,

 actually the Lucene 9.x series requires JDK 11 to run, previous versions
 also work with Java 8. The main branch is already on JDK 17. From my
 knowledge, you may only use Lucene versions up to 8 to have at least a
 chance to run it. But with older Android version you may even need to go
 back to Lucene builds targetting JDK 7 (Lucene 5 ?, don't know).

 But this is only half of the story: Lucene actually uses many many
 modern JDK and JVM features that are partly not implemented in Dalvik.
 It uses MethodHandles instead of reflection and the Java 8+ version use
 lambdas which were not compatible with older Android SDKs.

 So in short: Use older version and hope, but we offer no support or are
 not keen to apply changes to Lucene so it can be used with Android at
 all - because Android is not really compatible to any Java spec like API
 or memory model.

 Uwe

 Am 09.09.2022 um 09:10 schrieb Jie Wang:
 > Hey,
 >
 > Recently, I am trying to compile the Lucene to get a jar that can be 
used in Android, but failed.
 >
 > Is there an official version that supports the use of Lucene on Android?
 >
 >
 > Thanks!
 > -
 > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 > For additional commands, e-mail: java-user-h...@lucene.apache.org
 >
 --
 Uwe Schindler
 Achterdiek 19, D-28357 Bremen
 
https://urldefense.com/v3/__https://www.thetaphi.de__;!!May37g!I0Gu25Y3BgTV3Vu1HySs6-3CFpW6BoaYKIsxiSeaohtNPkf00opY-hSY8XMqPJz990oyteqdryrf1cToSA$
 eMail: u...@thetaphi.de


 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: u...@thetaphi.de


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Can lucene be used in Android ?

2022-09-10 Thread Uwe Schindler

Hi Jie,

actually the Lucene 9.x series requires JDK 11 to run, previous versions 
also work with Java 8. The main branch is already on JDK 17. From my 
knowledge, you may only use Lucene versions up to 8 to have at least a 
chance to run it. But with older Android version you may even need to go 
back to Lucene builds targetting JDK 7 (Lucene 5 ?, don't know).


But this is only half of the story: Lucene actually uses many many 
modern JDK and JVM features that are partly not implemented in Dalvik. 
It uses MethodHandles instead of reflection and the Java 8+ version use 
lambdas which were not compatible with older Android SDKs.


So in short: Use older version and hope, but we offer no support or are 
not keen to apply changes to Lucene so it can be used with Android at 
all - because Android is not really compatible to any Java spec like API 
or memory model.


Uwe

Am 09.09.2022 um 09:10 schrieb Jie Wang:

Hey,

Recently, I am trying to compile the Lucene to get a jar that can be used in 
Android, but failed.

Is there an official version that supports the use of Lucene on Android?


Thanks!
-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: u...@thetaphi.de


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How to filter KnnVectorQuery with multiple terms?

2022-09-01 Thread Uwe Schindler

Simply said,

the last parameter of KnnVectorQuery is a Lucene query, so you can pass 
any query type there. TermInSetQuery is a good idea for doing a "IN 
multiple terms" query. But you can also pass a BooleanQuery with 
multiple terms or a combination of other queries, a numeric range,... or 
a fulltext query out of Lucene's query parsers.


Uwe

Am 31.08.2022 um 22:19 schrieb Michael Wechner:

Hi Matt

Thanks very much for your feedback!

According to your links I will try

Collection terms =new ArrayList();
terms.add(new BytesRef(classification1));
terms.add(new BytesRef(classification2));
Query filter =new TermInSetQuery(CLASSIFICATION_FIELD, terms);

query =new KnnVectorQuery(VECTOR_FIELD, queryVector, k, filter);

All the best

Michael



Am 31.08.22 um 20:24 schrieb Matt Davis:
If I understand correctly, I believe you would want to use a 
TermInSetQuery

query.  An example usage can be found here
https://github.com/zuliaio/zuliasearch/blob/main/zulia-server/src/main/java/io/zulia/server/index/ZuliaIndex.java#L398. 




You can also check out the usage of KnnVectorQuery here:
https://github.com/zuliaio/zuliasearch/blob/main/zulia-server/src/main/java/io/zulia/server/index/ZuliaIndex.java#L419 


noting that in this case the getPreFilter method a few lines below uses a
BooleanQuery.Builder.

As noted in TermsInSetQuery (
https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/TermInSetQuery.java#L62) 


multiple terms could be represented as a boolean query with Occur.SHOULD.

~Matt

On Wed, Aug 31, 2022 at 11:15 AM Michael 
Wechner

wrote:


Hi

I am currently filtering a KnnVectorQuery as follows

Query filter =new TermQuery(new Term(CLASSIFICATION_FIELD,
classification));
query =new KnnVectorQuery(VECTOR_FIELD, queryVector, k, filter);

but it is not clear to me how I can filter for multiple terms.

Should I subclass MultiTermQuery and use as filter, just as I use
TermQuery as filter above?

Thanks

Michael




--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: u...@thetaphi.de


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: [ANNOUNCE] Issue migration Jira to GitHub starts on Monday, August 22

2022-08-24 Thread Uwe Schindler

Hi,

instead of fix version i used the Milestone 9.4 to indicate the proposed 
fix version. Why do we need a label for this?


See what I did in an issue I am working on and the corresponding PR: 
https://github.com/apache/lucene/issues/11701 and 
https://github.com/apache/lucene/pull/11718


Uwe

Am 24.08.2022 um 21:26 schrieb Michael Sokolov:

Thanks! It seems to be working nicely.

Question about the fix-version: tagging. I wonder if going forward we
want to main that for new issues? I happened to notice there is also
this "milestone" feature in github -- does that seem like a place to
put version information?

On Wed, Aug 24, 2022 at 3:20 PM Tomoko Uchida
 wrote:



Issue migration has been completed (except for minor cleanups).
This is the Jira -> GitHub issue number mapping for possible future usage. 
https://github.com/apache/lucene-jira-archive/blob/main/migration/mappings-data/issue-map.csv.20220823_final

GitHub issue is now fully available for all issues.
For issue label management (e.g. "fix-version"), please review this manual.
https://github.com/apache/lucene/blob/main/dev-docs/github-issues-howto.md

Tomoko


2022年8月22日(月) 19:46 Michael McCandless :

Wooot!  Thank you so much Tomoko!!

Mike

On Mon, Aug 22, 2022 at 6:44 AM Tomoko Uchida  
wrote:



Issue migration has been started. Jira is now read-only.

GitHub issue is available for new issues.

- You should open new issues on GitHub. E.g. 
https://github.com/apache/lucene/issues/1078
- Do not touch issues that are in the middle of migration, please. E.g. 
https://github.com/apache/lucene/issues/1072
   - While you cannot break these issues, migration scripts can 
modify/overwrite your comments on the issues.
- Pull requests are not affected. You can open/update PRs as usual. Please let 
me know if you have any trouble with PRs.


Tomoko


2022年8月18日(木) 18:23 Tomoko Uchida :

Hello all,

The Lucene project decided to move our issue tracking system from Jira to 
GitHub and migrate all Jira issues to GitHub.

We start issue migration on Monday, August 22 at 8:00 UTC.
1) We make Jira read-only before migration. You cannot update existing issues 
until the migration is completed.
2) You can use GitHub for opening NEW issues or pull requests during migration.

Note that issues should be raised in Jira at this moment, although GitHub issue 
is already enabled in the Lucene repository.
Please do not raise issues in GitHub until we let you know that GitHub issue is 
officially available. We immediately close any issues on GitHub until then.

Here are the detailed plan/migration steps.
https://github.com/apache/lucene-jira-archive/issues/7

Tomoko

--
Mike McCandless

http://blog.mikemccandless.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: u...@thetaphi.de


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Can I integrate Apache Lucene with Dovecot POP3/IMAP incoming mail server to perform indexing and fast searching of email messages?

2022-08-13 Thread Uwe Schindler

Hi,

there are two possibilities to use Lucene with Dovecot:

 * Use the (official) Dovecot-Solr plugin, which populates and queries
   index managed by Solr (called "fts_solr"):
   https://doc.dovecot.org/configuration_manual/fts/solr/
 * Use the alternative plugin "fts_elastic" (not shipped and officially
   supported by Dovecot), which uses elasticsearch as backend:
   https://github.com/filiphanes/fts-elastic

Both plugins work in combination with another very useful plugin: It 
allows to extract the text from attachments using Apache Tika: "fts_tika".


Uwe

Am 12.08.2022 um 08:43 schrieb Turritopsis Dohrnii Teo En Ming:

Subject: Can I integrate Apache Lucene with Dovecot POP3/IMAP incoming
mail server to perform indexing and fast searching of email messages?

Good day from Singapore,

I have a Virtual Private Server (VPS) in Germany running
Virtualmin/Webmin web hosting control panel. Virtualmin uses Dovecot
incoming mail server by default.

I used to have more than 200,000 email messages in my Inbox. At that
point in time, I cannot even search for any email message at all.
Dovecot will throw an error message. My workaround is to reduce the
number of email messages in my Inbox drastically by deleting USELESS
email messages.

But the number of email messages in my Inbox will continue to grow.
Can I integrate Apache Lucene with Dovecot to perform indexing and
fast searching of email messages? I have to cater for a time when my
Inbox will grow to millions of email messages.

Thank you.

Regards,

Mr. Turritopsis Dohrnii Teo En Ming
Targeted Individual in Singapore
12 Aug 2022 Fri
Blogs:
https://tdtemcerts.blogspot.com
https://tdtemcerts.wordpress.com

-
To unsubscribe, e-mail:java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail:java-user-h...@lucene.apache.org


--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail:u...@thetaphi.de


Re: Lucene Disable scoring

2022-07-11 Thread Uwe Schindler
No that's the only way to do it. The function call does not cost 
overheads because it is optimized away by the runtime.


Uwe

Am 10.07.2022 um 11:34 schrieb Mohammad Kasaei:

Hello

I have a question. Is it possible to completely disable scoring in lucene?

Detailed description:
I have an index in elasticsearch and it contains big shards (every shard
about 500m docs) so a nano second of time spent on scoring every document
in any shard causes a few second delay in the query response.
I discovered that the most performant way to score documents is constant
score but the overhead of function calls can cause delay.
As a result I'm looking for a trick to ignore the function call and have
all no scoring on my whole query

Is it possible to ignore this step?

thanks a million


--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: u...@thetaphi.de


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Fuzzy Query Similarity

2022-07-09 Thread Uwe Schindler

Hi

FuzzyQuery/MultiTermQuery and I don't see any way to "boost" exact
matches, or even to incorporate the edit distance more generally into
the per-term score, although it does seem like that would be something
people would generally expect.


Actually it does this:

 * By default FuzzyQuery uses a rewrite method that expands all terms
   as should clauses into a boolean query:
   MultiTermQuery.TopTermsBlendedFreqScoringRewrite(maxExpansions)
 * TopTermsReqrite basically keeps track of a "boost" factor for each
   term and sorts the "best" terms in a PQ:
   
https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/TopTermsRewrite.java#L109-L160
 * For each collected term the term enumeration sets a boost (1.0 for
   exact match):
   
https://github.com/apache/lucene/blob/dd4e8b82d711b8f665e91f0d74f159ef1e63939f/lucene/core/src/java/org/apache/lucene/search/FuzzyTermsEnum.java#L248-L256

So in short the exact term gets a boost factor of 1 in the resulting 
term query, all other terms a lower one.


Uwe

--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail:u...@thetaphi.de


Re: Fuzzy Query Similarity

2022-07-09 Thread Uwe Schindler
The problem is that the query combines the native termquery score (which 
depends on length of document and term's statistic). The edit distance 
is also multiplied in. When the difference in term statistics is too 
large, the edit distance no longer matters. This is perfectly fine and 
also happens with other types of queries. When you have seldom terms in 
small documents, those matches will always come up. This is also a 
problem if you for example boost cheaper products to the top.


If you are only interested in the query distance, you should configure 
IndexSearcher to use BooleanSimilarity - in that case it will ignore the 
term statistics and disable norms on the field (during indexing or with 
a wrapper on the IndexReader): 
https://lucene.apache.org/core/9_2_0/core/org/apache/lucene/search/similarities/BooleanSimilarity.html


You can do this only for a specific field: 
https://lucene.apache.org/core/9_2_0/core/org/apache/lucene/search/similarities/PerFieldSimilarityWrapper.html


Uwe

Am 09.07.2022 um 14:08 schrieb Michael Sokolov:

I am no expert with this, but I got curious and looked at
FuzzyQuery/MultiTermQuery and I don't see any way to "boost" exact
matches, or even to incorporate the edit distance more generally into
the per-term score, although it does seem like that would be something
people would generally expect. So maybe FuzzyQuery should somehow do
that? But without changing it, you could also use a query that does it
explicitly; if you get a term "foo", you could maybe search for "foo
OR foo~" ?

On Fri, Jul 8, 2022 at 4:14 PM Mike Drob  wrote:

Hi folks,

I'm working with some fuzzy queries and trying my best to understand what
is the expected behaviour of the searcher. I'm not sure if this is a
similarity bug or an incorrect usage on my end.

The problem is when I do a fuzzy search for a term "spark~" then instead of
matching documents with spark first, it will match other documents that
have multiple other near terms like "spar" and "spars". I see this same
thing with both ClassicSimilarity and BM25.

This is from a much smaller (two document) index when I was trying to
isolate and reproduce the issue, but I see comparable behaviour with more
varied scoring on a much larger corpus. The two documents are:

addDoc("spark spark", writer); // exact match

addDoc("spar spars", writer); // multiple fuzzy terms

The non-zero edit distance terms get a slight down-boost, but it's not
enough to overcome their sum exceeding even the TF boost for the desired
document.

A full reproducible unit test is at
https://github.com/apache/lucene/commit/dbf8e788cd2c2a5e1852b8cee86cb21a792dc546

What is the recommended approach to get the document with exact term
matching for me again? I don't see an option to tweak the internal boost
provided by FuzzyQuery, that's one idea I had. Or is this a different
change that needs to be fixed at the lucene level rather than application
level?

Thanks,
Mike



More detail:


The first document with the field "spark spark" has a score explanation:

1.4054651 = sum of:
   1.4054651 = weight(field:spark in 0) [ClassicSimilarity], result of:
 1.4054651 = score(freq=2.0), product of:
   1.4054651 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:
 1 = docFreq, number of documents containing term
 2 = docCount, total number of documents with field
   1.4142135 = tf(freq=2.0), with freq of:
 2.0 = freq, occurrences of term within document
   0.70710677 = fieldNorm

And a document with the field "spar spars" comes in ever so slightly higher
at

1.5404116 = sum of:
   0.74536043 = weight(field:spar in 1) [ClassicSimilarity], result of:
 0.74536043 = score(freq=1.0), product of:
   0.75 = boost
   1.4054651 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:
 1 = docFreq, number of documents containing term
 2 = docCount, total number of documents with field
   1.0 = tf(freq=1.0), with freq of:
 1.0 = freq, occurrences of term within document
   0.70710677 = fieldNorm
   0.79505116 = weight(field:spars in 1) [ClassicSimilarity], result of:
 0.79505116 = score(freq=1.0), product of:
   0.8 = boost
   1.4054651 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:
 1 = docFreq, number of documents containing term
 2 = docCount, total number of documents with field
   1.0 = tf(freq=1.0), with freq of:
 1.0 = freq, occurrences of term within document
   0.70710677 = fieldNorm

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: u...@thetaphi.de



Re: Fwd: Finding out which fields matched the query

2022-06-27 Thread Uwe Schindler

Many of us already answered in the dev mailing list.

Uwe

Am 25.06.2022 um 05:19 schrieb Yichen Sun:

-- 转发的邮件 -
发件人: Yichen Sun 
日期:2022年6月25日 周六11:14
主题:Finding out which fields matched the query
收件人: , , <
java-user@lucene.apache.org>


Hello!

I’m a MSCS student from BU and learning to use Lucene. Recently I try to
output matched fields by one query. For example, for one document, there
are 10 fields and 2 of them match the query. I want to get the name of
these fields.

I have tried using explain() method and getting description then regex.
However it cost so much time.

I wonder what is the efficient way to get the matched fields. Would you
please offer some help? Thank you so much!

Best regards,
Yichen Sun


--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: u...@thetaphi.de


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Regarding field cache

2022-06-08 Thread Uwe Schindler

Hi,

As mentioned before. Since Lucene 6 theres no FieldCache in Lucene 
anymore so this is the wrong Mailinglist to ask those questions. Apache 
Solr has its own impleentation (a deprecated legacy copy of the old 
Lucene implementation). Solr uses kind of SeacherManager, so the field 
cache gets updated after every soft-commit (which is what is refreshing 
in SearcherManager is doing). After a refresh entries may disappear and 
new ones may appear. Also when you reload cores, all entries dissapear.


As said before -- IMPORTANT: FieldCache entries inside the Solr Admin UI 
cache statistics are a sign of a bad index configuration! Watch the 
video from 2012 where this was discussed for the first time. To fix 
this, make a list of all field names appearing there and change those 
fields in your schema to docValues=true (+ reindex or force-merge after 
the change). After that the FieldCache stats will be empty. FieldCache 
is a legacy  mechanism and is no longer supported so you should really 
get rid of it.


Uwe

Am 08.06.2022 um 20:18 schrieb Poorna Murali:

Thanks Uwe for the details.  In our solr (8.4)configuration , we have a
fieldcache that has the fields used for sorting. It can be observed that
the fieldCache is getting cleared sometimes. But, I do not think we have
the below mentioned search manager logic implemented in our setup. We have
not modified any solr/lucene implementation.

So, without opening or refreshing a searcher, I am not able to understand
how the field cache is getting cleared.
Can you please help to clarify this.

On 2022/06/08 17:46:50 Uwe Schindler wrote:

Hi,

You do not neessarily need a commit. If you use SearcherManager in
combination with NRTCachingDirectory you can also refresh you searcher
every few seconds, so in-memory cached segments are searched. But in
short: If you do not explicitly ask for a fresh searcher, there won't be
any automatic refreshes and the caches stays as is.

It is also important for older Lucene versions that still support
FieldCache: Make sure your queries work "per segment" and not globally.
Because wrongly written applications would have a cache entry per index
and not per segment, so on every refresh the whole cache has to be
rebuilt. This is also one reason why DocValues are preferred for sorting.



https://2012.berlinbuzzwords.de/sessions/your-index-reader-really-atomic-or-maybe-slow.html

https://www.youtube.com/watch?v=iZZ1AbJ6dik

Uwe

Am 08.06.2022 um 19:05 schrieb Poorna Murali:

Thanks Uwe! New searcher opens when we do a commit.Apart from this, are
there other scenarios where a searcher would be refreshed?

On 2022/06/08 16:43:07 Uwe Schindler wrote:

Hi,

They get evicted when the segment of that index is closed. After that
theres no reference to them anymore through a
WeakHashMap and thecache object gets freed by GC.
This happens on refresh of searcher where unused segments are closed

and

new ones are openend. There is no way to get rid of entries on a live
searcher.

FieldCache is no longer available since Lucene 6, so which version are
you using? Since Lucene 4 it is better to use DocValues fields for
sorting or facetting/aggregations.

If you are using Solr, theres still a clone of FieldCache as part of
Solr's codebase (and is not supported by Lucene anymore), but thats

only

for legacy indexes where the schema was not updated to use DocValues.

In

an "ideally configured Solr server", the Admin UI shows no entries

below

Core's FieldCache stats. If you see entries there go and replace those
field's config by adding docvalues=true.

Uwe

Am 08.06.2022 um 15:26 schrieb Poorna Murali:

Hi,

I would like to know if there is any automatic eviction policy for the
field cache entries. I understand that it gets invalidated when a new
searcher opens. But, my question is in case if gc runs or if there is

any

other scenario which could evict the unused entries from fieldcache.

Please help to clarify the same.

Thanks
Poorna


--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: u...@thetaphi.de


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: u...@thetaphi.de


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: u...@thetaphi.de


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Regarding field cache

2022-06-08 Thread Uwe Schindler

Hi,

You do not neessarily need a commit. If you use SearcherManager in 
combination with NRTCachingDirectory you can also refresh you searcher 
every few seconds, so in-memory cached segments are searched. But in 
short: If you do not explicitly ask for a fresh searcher, there won't be 
any automatic refreshes and the caches stays as is.


It is also important for older Lucene versions that still support 
FieldCache: Make sure your queries work "per segment" and not globally. 
Because wrongly written applications would have a cache entry per index 
and not per segment, so on every refresh the whole cache has to be 
rebuilt. This is also one reason why DocValues are preferred for sorting.


https://2012.berlinbuzzwords.de/sessions/your-index-reader-really-atomic-or-maybe-slow.html

https://www.youtube.com/watch?v=iZZ1AbJ6dik

Uwe

Am 08.06.2022 um 19:05 schrieb Poorna Murali:

Thanks Uwe! New searcher opens when we do a commit.Apart from this, are
there other scenarios where a searcher would be refreshed?

On 2022/06/08 16:43:07 Uwe Schindler wrote:

Hi,

They get evicted when the segment of that index is closed. After that
theres no reference to them anymore through a
WeakHashMap and thecache object gets freed by GC.
This happens on refresh of searcher where unused segments are closed and
new ones are openend. There is no way to get rid of entries on a live
searcher.

FieldCache is no longer available since Lucene 6, so which version are
you using? Since Lucene 4 it is better to use DocValues fields for
sorting or facetting/aggregations.

If you are using Solr, theres still a clone of FieldCache as part of
Solr's codebase (and is not supported by Lucene anymore), but thats only
for legacy indexes where the schema was not updated to use DocValues. In
an "ideally configured Solr server", the Admin UI shows no entries below
Core's FieldCache stats. If you see entries there go and replace those
field's config by adding docvalues=true.

Uwe

Am 08.06.2022 um 15:26 schrieb Poorna Murali:

Hi,

I would like to know if there is any automatic eviction policy for the
field cache entries. I understand that it gets invalidated when a new
searcher opens. But, my question is in case if gc runs or if there is

any

other scenario which could evict the unused entries from fieldcache.

Please help to clarify the same.

Thanks
Poorna


--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: u...@thetaphi.de


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: u...@thetaphi.de


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Regarding field cache

2022-06-08 Thread Uwe Schindler

Hi,

They get evicted when the segment of that index is closed. After that 
theres no reference to them anymore through a 
WeakHashMap and thecache object gets freed by GC. 
This happens on refresh of searcher where unused segments are closed and 
new ones are openend. There is no way to get rid of entries on a live 
searcher.


FieldCache is no longer available since Lucene 6, so which version are 
you using? Since Lucene 4 it is better to use DocValues fields for 
sorting or facetting/aggregations.


If you are using Solr, theres still a clone of FieldCache as part of 
Solr's codebase (and is not supported by Lucene anymore), but thats only 
for legacy indexes where the schema was not updated to use DocValues. In 
an "ideally configured Solr server", the Admin UI shows no entries below 
Core's FieldCache stats. If you see entries there go and replace those 
field's config by adding docvalues=true.


Uwe

Am 08.06.2022 um 15:26 schrieb Poorna Murali:

Hi,

I would like to know if there is any automatic eviction policy for the
field cache entries. I understand that it gets invalidated when a new
searcher opens. But, my question is in case if gc runs or if there is any
other scenario which could evict the unused entries from fieldcache.

Please help to clarify the same.

Thanks
Poorna


--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: u...@thetaphi.de


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Index corruption and repair

2022-05-05 Thread Uwe Schindler
) - without issues.
>>>>> > >
>>>>> > > One thing to mention is that our earlier version used Python 2.7.15
>>>>> (with
>>>>> > > pylucene 4.10) and now we are using Python 3.8.10 with Pylucene
>>>>> 6.5.0 -
>>>>> > the
>>>>> > > indexing logic is the same...
>>>>> > >
>>>>> > > One other thing to note is that the issue described has (so far!)
>>>>> only
>>>>> > > occurred on MS Windows - none of our Linux customers have complained
>>>>> > about
>>>>> > > this.
>>>>> > >
>>>>> > > Any ideas?
>>>>> > >
>>>>> > > Regards,
>>>>> > > Antony
>>>>> > >
>>>>> > > On Thu, 28 Apr 2022 at 17:00, Adrien Grand 
>>>>> wrote:
>>>>> > >
>>>>> > > > Hi Anthony,
>>>>> > > >
>>>>> > > > This isn't something that you should try to fix programmatically,
>>>>> > > > corruptions indicate that something is wrong with the environment,
>>>>> > > > like a broken disk or corrupt RAM. I would suggest running a
>>>>> memtest
>>>>> > > > to check your RAM and looking at system logs in case they have
>>>>> > > > anything to tell about your disks.
>>>>> > > >
>>>>> > > > Can you also share the full stack trace of the exception?
>>>>> > > >
>>>>> > > > On Thu, Apr 28, 2022 at 10:26 AM Antony Joseph
>>>>> > > >  wrote:
>>>>> > > > >
>>>>> > > > > Hello,
>>>>> > > > >
>>>>> > > > > We are facing a strange situation in our application as
>>>>> described
>>>>> > below:
>>>>> > > > >
>>>>> > > > > *Using*:
>>>>> > > > >
>>>>> > > > >    - Python 3.8.10
>>>>> > > > >- Pylucene 6.5.0
>>>>> > > > >- Java 8 (1.8.0_181)
>>>>> > > > >- Runs on Linux and Windows (error seen on Windows)
>>>>> > > > >
>>>>> > > > > We suddenly get the following *error*:
>>>>> > > > >
>>>>> > > > > 2022-02-10 09:58:09.253215: ERROR : writer | Failed to get index
>>>>> > > > > (D:\i\202202) writer, Exception:
>>>>> > > > > org.apache.lucene.index.CorruptIndexException: Unexpected file
>>>>> read
>>>>> > error
>>>>> > > > > while reading index.
>>>>> > > > >
>>>>> > > >
>>>>> >
>>>>> (resource=BufferedChecksumIndexInput(MMapIndexInput(path="D:\i\202202\segments_fo")))
>>>>> > > > >
>>>>> > > > >
>>>>> > > > > After this, no further indexing happens - trying to open the
>>>>> index
>>>>> > for
>>>>> > > > > writing throws the above error - and the index writer does not
>>>>> open.
>>>>> > > > >
>>>>> > > > > FYI, our code contains the following *settings*:
>>>>> > > > >
>>>>> > > > > index_path = "D:\i\202202"
>>>>> > > > > index_directory = FSDirectory.open(Paths.get(index_path))
>>>>> > > > > iconfig = IndexWriterConfig(wrapper_analyzer)
>>>>> > > > > iconfig.setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND)
>>>>> > > > > iconfig.setRAMBufferSizeMB(16.0)
>>>>> > > > > writer = IndexWriter(index_directory, iconfig)
>>>>> > > > >
>>>>> > > > >
>>>>> > > > > *Repairing*
>>>>> > > > > We tried 'repairing' the index with the following command /
>>>>> tool:
>>>>> > > > >
>>>>> > > > > java -cp lucene-core-6.5.0.jar:lucene-backward-codecs-6.5.0.jar
>>>>> > > > > org.apache.lucene.index.CheckIndex "D:\i\202202" -exorcise
>>>>> > > > >
>>>>> > > > > This however returns saying "No problems found with the index."
>>>>> > > > >
>>>>> > > > >
>>>>> > > > > *Work around*
>>>>> > > > > We have to manually delete the problematic segment file:
>>>>> > > > > D:\i\202202\segments_fo
>>>>> > > > > after which the application starts again... until the next
>>>>> > corruption. We
>>>>> > > > > can't spot a specific pattern.
>>>>> > > > >
>>>>> > > > >
>>>>> > > > > *Two questions:*
>>>>> > > > >
>>>>> > > > >1. Can we handle this situation programmatically, so that no
>>>>> > manual
>>>>> > > > >intervention is needed?
>>>>> > > > >2. Any reason why we are facing the corruption issue in the
>>>>> first
>>>>> > > > place?
>>>>> > > > >
>>>>> > > > >
>>>>> > > > > Before this we were using Pylucene 4.10 and we didn't face this
>>>>> > problem -
>>>>> > > > > the application logic is the same.
>>>>> > > > >
>>>>> > > > > Also, while the application runs on both Linux and Windows, so
>>>>> far we
>>>>> > > > have
>>>>> > > > > observed this situation only on various Windows platforms.
>>>>> > > > >
>>>>> > > > > Would really appreciate some assistance. Thanks in advance.
>>>>> > > > >
>>>>> > > > > Regards,
>>>>> > > > > Antony
>>>>> > > >
>>>>> > > >
>>>>> > > >
>>>>> > > > --
>>>>> > > > Adrien
>>>>> > > >
>>>>> > > >
>>>>> -
>>>>> > > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>>>> > > > For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>>> > > >
>>>>> > > >
>>>>> >
>>>>> > -
>>>>> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>>>> > For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>>> >
>>>>> >
>>>>>
>>>>

--
Uwe Schindler
Achterdiek 19, 28357 Bremen
https://www.thetaphi.de

RE: Returning large resultset is slow and resource intensive

2022-03-08 Thread Uwe Schindler
Hi,

> For our use case, we need to run queries which return the full
> matched result set. In some cases, this result set can be large (50k+
> results out of 4 million total documents).
> Perf test showed that just 4 threads running random queries returning 50k
> results make Lucene utilize 100% CPU on a 4-core machine (profiler
> screenshot
>  c2e4-45b6-b98d-b7622b6ac801.png>).

This screenshot shows the problem: The search methods returning TopDocs (or 
TopFieldDocs) should never ever be used to retrieve a larger amount or ALL 
results. This is called "deep paging" problem. Lucene cannot return "paged" 
results easily starting at a specific result page, it has to score all results 
and insert them into a priority queue - this does not scale well because the 
priority queue approach is made for quuickly getting top-ranking results. So to 
get all results, don't call: 


If you just want to get all results then you should write your own collector 
(single threaded as subclass of SimpleCollector, an alternative is 
CollectorManager for multithreaded search with a separate "reduce" step to 
merge results of each index segment) that just retrieves document ids and 
processes them. If you don't need the score, don't call the scoring methods in 
the Scorable.

For this you have to create a subclass of SimpleCollector (and 
CollectorManager, if needed) and implement its methods that are called by the 
query internal as a kind of "notifications" about which index segment you are 
and which result *relative* to this index segment you. Important things:
- you get notified about new segments using SimpleCollector#doSetNextReader. 
Save the content in a local field of the collector for later usage
- if you need the scores also implement SimpleCollector#setScorer().
- for each search hit of the reader passed in the previous call you get the 
SimpleCollector#collect() method called. Use the document id passed and resolve 
it using the leaf reader to the actual document and its fields/doc values. To 
get the score ask the Scoreable from previous call. 

Another approach is to use searchAfter with smaller windows, but for getting 
all results this is still slower as a priority queue has to be managed, too 
(just smaller ones).

> The query is very simple and contains only a single-term filter clause, all
> unrelated parts of the application are disabled, no stored fields are
> fetched, GC is doing minimal amount of work
>  41c1-4af1-afcf-37d0c5f86054.png>

Lucene never uses much heap space, so GC should always be low.

Uwe


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Migration from Lucene 5.5 to 8.11.1

2022-01-17 Thread Uwe Schindler
"*initially* created with 6.x".

-
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: u...@thetaphi.de

> -Original Message-
> From: András Péteri 
> Sent: Thursday, January 13, 2022 9:59 AM
> To: java-user@lucene.apache.org
> Subject: Re: Migration from Lucene 5.5 to 8.11.1
> 
> It looks like Sascha runs IndexUpgrader for all major versions, ie. 6.6.6,
> 7.7.3 and 8.11.1. File "segments_91" is written by the 7.7.3 run
> immediately before the error.
> 
> On Wed, Jan 12, 2022 at 3:44 PM Adrien Grand  wrote:
> 
> > The log says what the problem is: version 8.11.1 cannot read indices
> > created by Lucene 5.5, you will need to reindex your data.
> >
> > On Wed, Jan 12, 2022 at 3:41 PM  wrote:
> > >
> > >
> > > -
> > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >
> >
> > --
> > Adrien
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >
> --
> András


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: migration from lucene 5 to 8

2022-01-17 Thread Uwe Schindler
Hi,

no that's expected. See my other post as response to another question a minute 
ago.

Uwe

-
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: u...@thetaphi.de

> -Original Message-
> From: Sascha Janz 
> Sent: Wednesday, January 12, 2022 4:21 PM
> To: java-user@lucene.apache.org
> Subject: migration from lucene 5 to 8
> 
> Hello,
> 
> the body from my previous mail was filtered out...
> 
> 
> we need to migrate our lucene 5.5 indexes to version 8.11.1. fortunately i
> found the IndexUpgrader class which i didn't know yet.
> 
> i tried to migrate from major version to major version.
> 
> so i did
> 
> java -cp lucene-core-6.6.6.jar;lucene-backward-codecs-6.6.6.jar
> org.apache.lucene.index.IndexUpgrader -delete-prior-commits -verbose
> "V:\\LuceneMigration\\5"
> 
> next step
> 
> java -cp lucene-core-7.7.3.jar;lucene-backward-codecs-7.7.3.jar
> org.apache.lucene.index.IndexUpgrader -delete-prior-commits -verbose
> "V:\\LuceneMigration\\5"
> 
> and then
> 
> java -cp lucene-core-8.11.1.jar;lucene-backward-codecs-8.11.1.jar
> org.apache.lucene.index.IndexUpgrader -delete-prior-commits -verbose
> "V:\\LuceneMigration\\5"
> 
> the first two seems to work well.
> 
> but with the last i get
> 
> MS 0 [2022-01-12T14:04:24.248Z; main]: initDynamicDefaults spins=true
> maxThreadCount=1 maxMergeCount=6
> IW 0 [2022-01-12T14:04:24.275Z; main]: init: hit exception on init; releasing
> write lock
> Exception in thread "main"
> org.apache.lucene.index.IndexFormatTooOldException: Format version is not
> supported (resource
> BufferedChecksumIndexInput(MMapIndexInput(path="V:\LuceneMigration\5\s
> egments_91"))): This index was initially created with Lucene 6.x while the
> current version is 8.11.1 and Lucene only supports reading the current and
> previous major versions.. This version of Lucene only supports indexes created
> with release 7.0 and later.
> at
> org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:322)
> at
> org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:291)
> at org.apache.lucene.index.IndexWriter.(IndexWriter.java:1037)
> at
> org.apache.lucene.index.IndexUpgrader.upgrade(IndexUpgrader.java:167)
> at org.apache.lucene.index.IndexUpgrader.main(IndexUpgrader.java:78)
> Suppressed: org.apache.lucene.index.CorruptIndexException: checksum
> passed (7268b2f2). possibly transient resource issue, or a Lucene or JVM bug
> (resource=BufferedChecksumIndexInput(MMapIndexInput(path="V:\LuceneMig
> ration\5\segments_91")))
> at
> org.apache.lucene.codecs.CodecUtil.checkFooter(CodecUtil.java:466)
> at
> org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:434)
> ... 4 more
> 
> did i anything wrong?
> 
> 
> 
> thanks for help.
> 
> regards
> 
> Sascha
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Moving from lucene 6.x to 8.x

2022-01-17 Thread Uwe Schindler
By the way
> Hi, one thing that always works to "forcefully" upgrade without reindexing. 
> You
> just merge the old index into a completely new index not by coping files, but 
> by
> sending their SegmentReaders to addIndex, stripping all metadata from them
> with some trick:
> https://lucene.apache.org/core/8_11_0/core/org/apache/lucene/index/SlowCo
> decReaderWrapper.html in combination with
>  Writer.html#addIndexes-org.apache.lucene.index.CodecReader...->
> 
> One way to do this is the following:
> - Open old index using DirectoryReader.open(): reader =
> DirectoryReader.open(...old directory...)
> - Create a new Index with IndexWriter writer: writer = new IndedxWriter(...new
> directory...)
> - Call
> writer.addIndexes(reader.leaves().stream().map(IndexReaderContext::reader).
> map(SlowCodecReaderWrapper::wrap).toArray(CodecReader[]::new));

This trick also works if you want to transform indexes. I wrote some code that 
on the-fly rewrites old NumericField to PointField. The trick is to add another 
FilterLeafReader (before wrapping with SlowCodecReaderWrapper), that detects 
legacy numeric fields, removes them fromm metadata and feeds them as new stream 
of flat BKD points enumerated by the TermsEnum (which works because order is 
same and hierarchy is generated by the receiving IndexWriter) to a new field 
with PointField metadata. This is a bit hacky but works great.

Uwe


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Moving from lucene 6.x to 8.x

2022-01-17 Thread Uwe Schindler
Hi, one thing that always works to "forcefully" upgrade without reindexing. You 
just merge the old index into a completely new index not by coping files, but 
by sending their SegmentReaders to addIndex, stripping all metadata from them 
with some trick: 
https://lucene.apache.org/core/8_11_0/core/org/apache/lucene/index/SlowCodecReaderWrapper.html
 in combination with 
<https://lucene.apache.org/core/8_11_0/core/org/apache/lucene/index/IndexWriter.html#addIndexes-org.apache.lucene.index.CodecReader...->
 

One way to do this is the following:
- Open old index using DirectoryReader.open(): reader = 
DirectoryReader.open(...old directory...)
- Create a new Index with IndexWriter writer: writer = new IndedxWriter(...new 
directory...)
- Call 
writer.addIndexes(reader.leaves().stream().map(IndexReaderContext::reader).map(SlowCodecReaderWrapper::wrap).toArray(CodecReader[]::new));

This will add all segments from the old index logically (not reading plain 
files but using the logical layers on top) and add them to the current index as 
one large segment. If you want to keep the segment structure, then iterate over 
the leaves and call addIndexes() for each one separately.

This may be a bit slower as the whole index needs to be processed, but it is 
still faster than reindexing. If you have incorrect offsets, the process will 
fail, so there's no risk.

Uwe

-----
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: u...@thetaphi.de

> -Original Message-
> From: Rahul Goswami 
> Sent: Wednesday, January 12, 2022 6:36 AM
> To: java-user@lucene.apache.org
> Subject: Re: Moving from lucene 6.x to 8.x
> 
> Thanks Vinay for the link to Erick's talk! I hadn't seen it and I must
> admit it did help put a few things into perspective.
> 
> I was able to track down the JIRAs (thank you 'git blame')
> surrounding/leading up to this architectural decision and the linked
> patches:
> https://issues.apache.org/jira/browse/LUCENE-7703  (Record the version that
> was used at index creation time)
> https://issues.apache.org/jira/browse/LUCENE-7730  (Better encode length
> normalization in similarities)
> https://issues.apache.org/jira/browse/LUCENE-7837  (Use
> indexCreatedVersionMajor to fail opening too old indices)
> 
> From these JIRAs what I was able to piece together is that if not
> reindexed, relevance scoring might act in unpredictable ways. For my use
> case, I can live with that since we provide an explicit sort on one or more
> fields.
> 
> In LUCENE-7703, Adrien says "we will reject broken offsets in term vectors
> as of 7.0". So my questions to the community are
> i) What are these offsets, and what feature/s might break with respect to
> these offsets if not reindexed?
> ii) Do the length normalization changes in  LUCENE-7730 affect only
> relevance scores?
> 
> I understand I could be playing with fire here, but reindexing is not a
> practical solution for my situation. At least not in the near future until
> I figure out a more seamless way of reindexing with minimal downtime given
> that there are multiple 1TB+ indexes. Would appreciate inputs from the dev
> community on this.
> 
> Thanks,
> Rahul
> 
> On Sun, Jan 9, 2022 at 2:41 PM Vinay Rajput 
> wrote:
> 
> > Hi Rahul,
> >
> > I am not an expert so someone else might provide a better answer. However,
> > I remember
> > @Erick briefly talked about this restriction in one of his talks here:-
> > https://www.youtube.com/watch?v=eaQBH_H3d3g=621s (not sure if you
> have
> > seen it already).
> >
> > As he explains, earlier it looked like IndexUpgrader tool was doing the job
> > perfectly but it wasn't always the case. There is no guarantee that after
> > using the IndexUpgrader tool, your 8.x index will keep all of the
> > characteristics of lucene 8. There can be some situations (e.g. incorrect
> > offset) where you might get an incorrect relevance score which might be
> > difficult to trace and debug. So, Lucene developers now made it explicit
> > that what people were doing earlier was not ideal, and they should now plan
> > to reindex all the documents during the major upgrade.
> >
> > Having said that, what you have done can just work without any issue as
> > long as you don't encounter any odd sorting behavior. This may/may not be
> > super critical depending on the business use case and that is where you
> > might need to make a decision.
> >
> > Thanks,
> > Vinay
> >
> > On Sat, Jan 8, 2022 at 10:27 PM Rahul Goswami 
> > wrote:
> >
> > > Hello,
> > > Would appreciate any insights on the issue.Are there any backward
> > > incompatible changes in 8.x index because of which the

Re: Log4j

2021-12-15 Thread Uwe Schindler
Hi,

It only has an abstract logging interface inside IndexWriter to track actions 
done during indexing. But implementation of that is up to the application. By 
default you can only redirect to a file or stdout, if needed.

All other Apis log nothing.

Uwe

Am 15. Dezember 2021 21:58:59 UTC schrieb Ali Akhtar :
>Does Lucene not have any internal logging at all, e.g for debugging?
>
>On Thu, Dec 16, 2021 at 2:49 AM Uwe Schindler  wrote:
>
>> Hi,
>>
>> Lucene is an API and does not log with log4j.
>>
>> Only the user interface Luke uses log4j, but this one does not do any
>> networking. So unless user of Luke enters jndi expressions nothing can
>> happen. 
>>
>> Uwe
>>
>> Am 15. Dezember 2021 21:41:37 UTC schrieb Baris Kazar <
>> baris.ka...@oracle.com>:
>> >Hi Folks,-
>> > Lucene is not affected by the latest bug, right?
>> >I saw on Solr News page there are some fixes already made to Solr.
>> >Best regards
>>
>> --
>> Uwe Schindler
>> Achterdiek 19, 28357 Bremen
>> https://www.thetaphi.de

--
Uwe Schindler
Achterdiek 19, 28357 Bremen
https://www.thetaphi.de

Re: Log4j

2021-12-15 Thread Uwe Schindler
Hi,

Lucene is an API and does not log with log4j.

Only the user interface Luke uses log4j, but this one does not do any 
networking. So unless user of Luke enters jndi expressions nothing can happen. 

Uwe

Am 15. Dezember 2021 21:41:37 UTC schrieb Baris Kazar :
>Hi Folks,-
> Lucene is not affected by the latest bug, right?
>I saw on Solr News page there are some fixes already made to Solr.
>Best regards

--
Uwe Schindler
Achterdiek 19, 28357 Bremen
https://www.thetaphi.de

RE: Java 17 and Lucene

2021-10-26 Thread Uwe Schindler
Hi,

> Is this recommended "-XX:+UseZGC options to enable ZGC." as it claims very
> low pauses.

You may have seen my prvious post that JDK 16, 17 and 18 have hangs on our 
build server. All of those hanging builds have one thing in common: They are 
running with ZGC. So my answer in short: Don’t use ZGC, which is anyways not a 
good idea with Lucene. It reduces pauses, but on the other hand reduces 
throughput by >10%. So IMHO, better use G1GC and have higher throughput. With 
G1GC the average pauses are limited, too. But I would say, with common 
workloads it is better to have 10% faster queries and maybe have some of them 
wait 200 ms because of a pause!? If you have multiple replicas just distribute 
your queries and the pause will be not really visible to many people. And: Why 
is 200 ms response time bad if it happens seldom?

In addition: Lucene does not apply pressure to garbage collector, so use low 
heap space and use docvalues and other off-heap features of Lucene. Anybody 
running Lucene/Solr/Elasticsearch with huge heap space does something wrong!

Uwe

> For "*DY* (2021-10-19 08:14:33): Upgrade to JDK17+35" execution for
> "Indexing
> throughput
> <https://home.apache.org/~mikemccand/lucenebench/indexing.html>"
> is ZGC used for the "Indexing throughput
> <https://home.apache.org/~mikemccand/lucenebench/indexing.html>" test?
> 
> 
> On Wed, Oct 20, 2021 at 8:27 AM Michael McCandless <
> luc...@mikemccandless.com> wrote:
> 
> > Nightly benchmarks managed to succeed (once, so far) on JDK 17:
> > https://home.apache.org/~mikemccand/lucenebench/
> >
> > No obvious performance changes on quick look.
> >
> > Mike McCandless
> >
> > http://blog.mikemccandless.com
> >
> >
> > On Tue, Oct 19, 2021 at 8:42 PM Gautam Worah
> 
> > wrote:
> >
> > > Thanks for the note of caution Uwe.
> > >
> > > > On our Jenkins server running with AMD Ryzen CPU it happens quite often
> > > that JDK 16, JDK 17 and JDK 18 hang during tests and stay unkillable
> > (only
> > > a hard kill with" kill -9")
> > >
> > > Scary stuff.
> > > I'll try to reproduce the hang first and then try to get the JVM logs.
> > I'll
> > > respond back here if I find something useful.
> > >
> > > > Do you get this error in lucene:core:ecjLintMain and not during
> > compile?
> > > Then this is https://issues.apache.org/jira/browse/LUCENE-10185, solved
> > > already.
> > >
> > > Ahh. I should've been clearer with my comment. The error we see is
> > because
> > > we have forked the class and have modified it a bit.
> > > I just assumed that the upstream Lucene package would've also gotten
> > errors
> > > on the JDK17 build because it was untouched.
> > >
> > > -
> > > Gautam Worah.
> > >
> > >
> > > On Tue, Oct 19, 2021 at 5:07 AM Michael Sokolov 
> > > wrote:
> > >
> > > > > I would a bit careful: On our Jenkins server running with AMD Ryzen
> > CPU
> > > > it happens quite often that JDK 16, JDK 17 and JDK 18 hang during tests
> > > and
> > > > stay unkillable (only a hard kill with" kill -9"). Previous Java
> > versions
> > > > don't hang. It happens not all the time (about 1/4th of all builds) and
> > > due
> > > > to the fact that the JVM is unresponsible it is not possible to get a
> > > stack
> > > > trace with "jstack". If you know a way to get the stack trace, I'd
> > happy
> > > to
> > > > get help.
> > > >
> > > > ooh that sounds scary. I suppose one could maybe get core dumps using
> > > > the right signal and debug that way? Oh wait you said only 9 works,
> > > > darn! How about attaching using gdb? Do we maintain GC logs for these
> > > > Jenkins builds? Maybe something suspicious would show up there.
> > > >
> > > > By the way the JDK is absolutely "responsible" in this situation! Not
> > > > responsive maybe ...
> > > >
> > > > On Tue, Oct 19, 2021 at 4:46 AM Uwe Schindler 
> wrote:
> > > > >
> > > > > Hi,
> > > > >
> > > > > > Hey,
> > > > > >
> > > > > > Our team at Amazon Product Search recently ran our internal
> > > benchmarks
> > > > with
> > > > > > JDK 17.
> > > > > > We saw a ~5% increase in throughput and are in the process

RE: Java 17 and Lucene

2021-10-19 Thread Uwe Schindler
Hi,

> > On a side note, the Lucene codebase still uses the deprecated (as of
> > JDK17) AccessController
> > in the RamUsageEstimator class.
> > We suppressed the warning for now (based on recommendations
> >  >
> dev/202106.mbox/%3CJIRA.13369440.1617476525000.615331.16239514800
> > 5...@atlassian.jira%3E>
> > from the Apache Derby mailing list).
> 
> This should not be an issue, because we compile Lucene with javac parameter
> "--release 11", so it won't show any warning that you need to suppress. Looks
> like your build system at Amazon is not the original one by Lucene's Gradle,
> which shows no warnings at all.

Do you get this error in lucene:core:ecjLintMain and not during compile? Then 
this is https://issues.apache.org/jira/browse/LUCENE-10185, solved already. 
This problem did not happen in Lucene builds because we run Gradle with JDK 11 
(JAVA_HOME) and only compile with JDK 17 (passed as RUNTIME_JAVA_HOME).

Uwe

> Uwe
> 
> > Gautam Worah.
> >
> >
> > On Mon, Oct 18, 2021 at 3:02 PM Michael McCandless <
> > luc...@mikemccandless.com> wrote:
> >
> > > Also, I try to semi-aggressively upgrade Lucene's nightly benchmarks to 
> > > new
> > > JDK releases and leave an annotation on the nightly charts:
> > > https://home.apache.org/~mikemccand/lucenebench/
> > >
> > > I just now upgraded to JDK 17 and kicked off a new benchmark run ... in a
> > > few hours it should show the new data points and then I'll try to remember
> > > to annotate it tomorrow.
> > >
> > > So let's see whether nightly benchmarks uncover any performance changes
> > > from JDK17 :)
> > >
> > > Mike McCandless
> > >
> > > http://blog.mikemccandless.com
> > >
> > >
> > > On Mon, Oct 18, 2021 at 5:36 PM Robert Muir  wrote:
> > >
> > > > We test different releases on different platforms (e.g. Linux, Windows,
> > > > Mac).
> > > > We also test EA (Early Access) releases of openjdk versions during the
> > > > development process.
> > > > This finds bugs before they get released.
> > > >
> > > > More information about versions/EA testing: https://jenkins.thetaphi.de/
> > > >
> > > > On Mon, Oct 18, 2021 at 5:33 PM Kevin Rosendahl
> > > >  wrote:
> > > > >
> > > > > Hello,
> > > > >
> > > > > We are using Lucene 8 and planning to upgrade from Java 11 to Java 17.
> > > We
> > > > > are curious:
> > > > >
> > > > >- How lucene is testing against java versions. Are there 
> > > > > correctness
> > > > and
> > > > >performance tests using java 17?
> > > > >   - Additionally, besides Java 17, how are new Java releases
> > > tested?
> > > > >- Are there any other orgs using Java 17 with Lucene?
> > > > >- Any other considerations we should be aware of?
> > > > >
> > > > >
> > > > > Best,
> > > > > Kevin Rosendahl
> > > >
> > > > -
> > > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > > > For additional commands, e-mail: java-user-h...@lucene.apache.org
> > > >
> > > >
> > >
> 
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Java 17 and Lucene

2021-10-19 Thread Uwe Schindler
Hi,

> Hey,
> 
> Our team at Amazon Product Search recently ran our internal benchmarks with
> JDK 17.
> We saw a ~5% increase in throughput and are in the process of
> experimenting/enabling it in production.
> We also plan to test the new Corretto Generational Shenandoah GC.

I would a bit careful: On our Jenkins server running with AMD Ryzen CPU it 
happens quite often that JDK 16, JDK 17 and JDK 18 hang during tests and stay 
unkillable (only a hard kill with" kill -9"). Previous Java versions don't 
hang. It happens not all the time (about 1/4th of all builds) and due to the 
fact that the JVM is unresponsible it is not possible to get a stack trace with 
"jstack". If you know a way to get the stack trace, I'd happy to get help.

Once I figured out what makes it hang, I will open issues in OpenJDK (I am 
OpenJDK member/editor). I have now many stuck JVMs running to analyze on the 
server, so you're invited to help! At the moment, I have no time to take care, 
so any help is useful.

> On a side note, the Lucene codebase still uses the deprecated (as of
> JDK17) AccessController
> in the RamUsageEstimator class.
> We suppressed the warning for now (based on recommendations
>  dev/202106.mbox/%3CJIRA.13369440.1617476525000.615331.16239514800
> 5...@atlassian.jira%3E>
> from the Apache Derby mailing list).

This should not be an issue, because we compile Lucene with javac parameter 
"--release 11", so it won't show any warning that you need to suppress. Looks 
like your build system at Amazon is not the original one by Lucene's Gradle, 
which shows no warnings at all.

Uwe

> Gautam Worah.
> 
> 
> On Mon, Oct 18, 2021 at 3:02 PM Michael McCandless <
> luc...@mikemccandless.com> wrote:
> 
> > Also, I try to semi-aggressively upgrade Lucene's nightly benchmarks to new
> > JDK releases and leave an annotation on the nightly charts:
> > https://home.apache.org/~mikemccand/lucenebench/
> >
> > I just now upgraded to JDK 17 and kicked off a new benchmark run ... in a
> > few hours it should show the new data points and then I'll try to remember
> > to annotate it tomorrow.
> >
> > So let's see whether nightly benchmarks uncover any performance changes
> > from JDK17 :)
> >
> > Mike McCandless
> >
> > http://blog.mikemccandless.com
> >
> >
> > On Mon, Oct 18, 2021 at 5:36 PM Robert Muir  wrote:
> >
> > > We test different releases on different platforms (e.g. Linux, Windows,
> > > Mac).
> > > We also test EA (Early Access) releases of openjdk versions during the
> > > development process.
> > > This finds bugs before they get released.
> > >
> > > More information about versions/EA testing: https://jenkins.thetaphi.de/
> > >
> > > On Mon, Oct 18, 2021 at 5:33 PM Kevin Rosendahl
> > >  wrote:
> > > >
> > > > Hello,
> > > >
> > > > We are using Lucene 8 and planning to upgrade from Java 11 to Java 17.
> > We
> > > > are curious:
> > > >
> > > >- How lucene is testing against java versions. Are there correctness
> > > and
> > > >performance tests using java 17?
> > > >   - Additionally, besides Java 17, how are new Java releases
> > tested?
> > > >- Are there any other orgs using Java 17 with Lucene?
> > > >- Any other considerations we should be aware of?
> > > >
> > > >
> > > > Best,
> > > > Kevin Rosendahl
> > >
> > > -
> > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > > For additional commands, e-mail: java-user-h...@lucene.apache.org
> > >
> > >
> >


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: IntervalQuery replacement for SpanFirstQuery? Closest replacement for slops?

2021-10-08 Thread Uwe Schindler
Hi Alan,

this was all very helpful. Another thing about the intervals and transformation 
from SpanQuery to IntervalQuery: IntervalQuery only returns a score between 
0..1 and does not take term statistics into account. To combine them with term 
scoring, one should combine it with some term queries (which is perfectly fine 
as it decouples term scoring from their position and allows more flexibility).

My question now (and maybe this should be documented in some MIGRATE.txt or the 
Javadocs): How to best combine the scores from TermQuery and IntervalQuery to 
get a scoring *similar* (not identical) to the good old SpanQueries? I tried to 
read the SpanQuery scoring mechanisms but gave up because I did not figure out 
where the final score of the terms is combined with the span score.

My first idea was to create a BooleanQuery with the IntervalQuery as MUST 
clause and all terms appearing somewhere in the (positive) intervals  added as 
SHOULD clauses. My problem is now that the number of terms differs from query 
to query, but the IntervalQuery only adds 0..1 to the total score. So should 
you use a BoostQuery around the IntervalQuery that boosts by the number of 
terms added as sibling should clauses? Other suggestions?

Uwe

-
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: u...@thetaphi.de

> -Original Message-
> From: Alan Woodward 
> Sent: Monday, September 21, 2020 7:56 PM
> To: Dawid Weiss 
> Cc: Lucene Users 
> Subject: Re: IntervalQuery replacement for SpanFirstQuery? Closest
> replacement for slops?
> 
> Your filtered query should work the same as a SpanFirst, yes.  I didn’t add a
> shortcut just because you can do it this way, but feel free to add it if you 
> think
> it’s useful!
> 
> Re sloppy phrases, this one is trickier.  The closest you can get at the 
> moment is
> an unordered near, but that’s not the same thing as it doesn’t take
> transpositions into account when calculating the slop.  I think it should be
> possible to write something that works similarly to SloppyPhraseMatcher, but
> as always the tricky part is in dealing with duplicate entries.  I have some 
> ideas
> but they’re not ready to commit yet, unfortunately.
> 
> In terms of your suggested replacements: maxwidth will give you the
> equivalent of a SpanNearUnordered.  Maxgaps gives a restriction on how many
> internal holes there are in the query, so works better if the constituent 
> intervals
> are not necessarily single terms.
> 
> > On 21 Sep 2020, at 18:47, Dawid Weiss  wrote:
> >
> >
> > For what it is worth, I would be also interested in answers to these 
> > questions.
> ;)
> >
> > On Mon, Sep 21, 2020, 19:08 Uwe Schindler  <mailto:u...@thetaphi.de>> wrote:
> > Hi all, hi Alan,
> >
> > I am currently rewriting some SpanQuery code to use IntervalQuery. Most of
> the transformations can be done quite easily and it is also better to read 
> after
> transformation. What I am missing a bit is some document to compare the
> different query types and a guide how to convert those.
> >
> > I did not find a replacement for SpanFirstQuery (or at least any query stat
> takes absolute positions). I know intervals more deal with term intervals, 
> but I
> was successful in replacing a SpanFirstQuery with this:
> > IntervalsSource term = Intervals.term("foo");
> > IntervalsSource filtered = new 
> > FilteredIntervalsSource("FIRST"+distance,
> term) {
> >   @Override
> >   protected boolean accept(IntervalIterator it) {
> > return it.end() < distance; // or should this be <= distance???
> >   }
> > };
> > Query = new IntervalQuery(field, iv2);
> >
> > I am not fully sure if this works under all circumstances . To me it looks
> fine and also did work with more complex intervals than "term". If this is ok,
> how about adding a "first(int n, IntervalsSource iv)" method to Intervals 
> class?
> >
> > The second question: What's the "closest" replacement for a PhraseQuery
> with slop? Should I use maxwidth(slop + 1) or maxgaps(slop-1) or
> maxgaps(slop). I know SpanQuery slops cannot be fully replaced with intervals,
> but I don't care about those SpanQuery bugs.
> >
> > Uwe
> >
> > -
> > Uwe Schindler
> > Achterdiek 19, D-28357 Bremen
> > https://www.thetaphi.de <https://www.thetaphi.de/>
> > eMail: u...@thetaphi.de <mailto:u...@thetaphi.de>
> >
> >
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> <mailto:java-user-unsubscr...@lucene.apache.org>
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> <mailto:java-user-h...@lucene.apache.org>
> >



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Question about readVint & writeVint from DataOutput and DataInput

2021-09-03 Thread Uwe Schindler
They are fully supported, so you can write and read them.

The problem with negative numbers is that they need lot of (disk) space, 
because in two's complement they have almost all bits set. The largest number 
is kinds of disk space is -1.

Negative numbers appear in older index formats, so they can't be prevented by a 
pull request as suggested.

Just take the comment as given: all is supported, but if you want to store 
negative numbers use a different encoding, e.g. zigzag.

Uwe

Am 3. September 2021 14:34:35 UTC schrieb Aaron Cohen 
:
>While reading the Lucene JavaDoc I came across writeVInt 
><https://lucene.apache.org/core/8_9_0/core/org/apache/lucene/store/DataOutput.html#writeVInt-int->
> & readVInt 
><https://lucene.apache.org/core/8_9_0/core/org/apache/lucene/store/DataInput.html#readVInt-->
> from DataOutput and DataInput base classes. It says for writeVint
>
>Parameters:
>i - Smaller values take fewer bytes. Negative numbers are supported, but 
>should be avoided.
>
>And for readVint "Negative numbers are supported, but should be avoided.”
>
>This seems like an odd statement. Why would something be supported but should 
>be avoided? Should I submit a PR to prevent negative integers?
--
Uwe Schindler
Achterdiek 19, 28357 Bremen
https://www.thetaphi.de

Re: Range query with Lucene7.7.1 on old indexes.

2021-09-01 Thread Uwe Schindler
Hi,
The old trie based range fields were deprecated in License 6 and removed in 7.

https://lucene.apache.org/core/6_6_0/core/org/apache/lucene/search/LegacyNumericRangeQuery.html

Upgrading the index does not help, because there's no easy way to convert trie 
fields to points and thats not done automatically. I wrote a hack to do this 
for a customer, but it requires a complete rewrite of index (read 5.x index 
with Lucene 6 and use addIndexes for merging into an empty index, while 
emulating points. The code is not open source and requires careful usage. 
Basically it enumerates the trie terms during merging and pushes them as 
flattened BKD tree. Lucene will rebuild the full tree while merging.

Uwe

Am 1. September 2021 17:20:24 UTC schrieb Antony Joseph 
:
>Hi all,
>
>Using: python 2.7.14, pylucene 4.10.0
>
>Index:
>
>xdate = long("20190101183030")
>doc.add(LongField('xdate', xdate, Field.Store.YES)) # stored and not
>analyzed
>
>Query:
>
>query = NumericRangeQuery.newLongRange("xdate", long("2019010100"),
>long("20190101115959"), True, True)
>
>I am getting the results. It works fine.
>
>Now i upgraded the lucene index 4.10.0 to lucene 7.7.1
>
>Using : Python3.8.11, pylucene 7.7.1
>
>I am searching on my old indexes, using the following queries
>
>query = NumericDocValuesField.newSlowRangeQuery("xdate",
>long("2019010100"), long("20190101115959"))
>
>No results.
>
>query = LongPoint.newRangeQuery("xdate", long("2019010100"),
>long("20190101115959"))
>
>No results.
>
>How to get the results on my old indexes using date range query?
>
>Can anyone help?
>
>Thanks

--
Uwe Schindler
Achterdiek 19, 28357 Bremen
https://www.thetaphi.de

RE: lucene 4.10.4 punctuation

2021-08-25 Thread Uwe Schindler
Hi,

you should explain to use what you exactly want to do: How do you want to 
search, how do your documents look like? Why is it important to match on 
punctuation and how should this matching look like?

Uwe

-
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: u...@thetaphi.de

> -Original Message-
> From: Younes Bahloul 
> Sent: Wednesday, August 25, 2021 6:34 PM
> To: java-user@lucene.apache.org
> Subject: lucene 4.10.4 punctuation
> 
> Hello
> i m part of a team that maintain
> http://exist-db.org/exist/apps/homepage/index.html
> its an Open Source XML database
> and we use lucene 4.10.4
> i m trying to introduce punctuation in search feature
> is there an analyzer that provides that or a way to  do it in 4.10.4 API
> 
> thanks Younes


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Failed to execute Ant run-task command

2021-08-19 Thread Uwe Schindler
Could you please open an issue?

Can you also check if it still happens on main branch with Lucene 9.0 and
Gradle as build system?

-
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: u...@thetaphi.de

> -Original Message-
> From: xiaoshi 
> Sent: Thursday, August 19, 2021 5:43 AM
> To: java-user@lucene.apache.org
> Subject: Failed to execute Ant run-task command
> 
> Hello:
> 
> When I running ant run-task command, the default thread name is main not
> ParallelTaskThread,
> StringIndexOutOfBoundsException error is thrown.
> 
> Here is the stack error:
> 
> [java] 
> [java] ### D O N E !!! ###
> [java] 
> [java] Error: cannot execute the algorithm! String index out of range: -15
> [java] java.lang.StringIndexOutOfBoundsException: String index out of
range: -
> 15
> [java] at java.lang.String.substring(String.java:1967)
> [java] at
> org.apache.lucene.benchmark.byTask.feeds.ReutersContentSource.getNextDoc
> Data(ReutersContentSource.java:120)
> [java] at
> org.apache.lucene.benchmark.byTask.feeds.DocMaker.makeDocument(DocMak
> er.java:371)
> [java] at
> org.apache.lucene.benchmark.byTask.tasks.AddDocTask.setup(AddDocTask.java
> :52)
> [java] at
> org.apache.lucene.benchmark.byTask.tasks.PerfTask.runAndMaybeStats(PerfTa
> sk.java:134)
> [java] at
> org.apache.lucene.benchmark.byTask.tasks.TaskSequence.doSerialTasks(TaskSe
> quence.java:198)
> [java] at
> org.apache.lucene.benchmark.byTask.tasks.TaskSequence.doLogic(TaskSequenc
> e.java:139)
> [java] at
> org.apache.lucene.benchmark.byTask.tasks.PerfTask.runAndMaybeStats(PerfTa
> sk.java:146)
> [java] at
> org.apache.lucene.benchmark.byTask.tasks.TaskSequence.doSerialTasks(TaskSe
> quence.java:198)
> [java] at
> org.apache.lucene.benchmark.byTask.tasks.TaskSequence.doLogic(TaskSequenc
> e.java:139)
> [java] at
> org.apache.lucene.benchmark.byTask.tasks.PerfTask.runAndMaybeStats(PerfTa
> sk.java:146)
> [java] at
> org.apache.lucene.benchmark.byTask.tasks.TaskSequence.doSerialTasks(TaskSe
> quence.java:198)
> [java] at
> org.apache.lucene.benchmark.byTask.tasks.TaskSequence.doLogic(TaskSequenc
> e.java:139)
> [java] at
> org.apache.lucene.benchmark.byTask.tasks.PerfTask.runAndMaybeStats(PerfTa
> sk.java:146)
> [java] at
> org.apache.lucene.benchmark.byTask.tasks.TaskSequence.doSerialTasks(TaskSe
> quence.java:198)
> [java] at
> org.apache.lucene.benchmark.byTask.tasks.TaskSequence.doLogic(TaskSequenc
> e.java:139)
> [java] at
> org.apache.lucene.benchmark.byTask.tasks.PerfTask.runAndMaybeStats(PerfTa
> sk.java:146)
> [java] at
>
org.apache.lucene.benchmark.byTask.utils.Algorithm.execute(Algorithm.java:33
> 2)
> [java] at
> org.apache.lucene.benchmark.byTask.Benchmark.execute(Benchmark.java:77)
> [java] at
> org.apache.lucene.benchmark.byTask.Benchmark.exec(Benchmark.java:121)
> [java] at
> org.apache.lucene.benchmark.byTask.Benchmark.main(Benchmark.java:85)
> 
> 
> 
> 
> I submitted a pull request (GitHub Pull Request #2556) to fix this issue,
can
> someone please to see if it needs to be fixed?
> I submitted an issues in jira :
https://issues.apache.org/jira/browse/LUCENE-
> 10051
> 
> 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: NRT readers and overall indexing/querying throughput

2021-08-08 Thread Uwe Schindler
Hi,

in general, NRT indexing throughput is always a bit slower than a normal 
indexing as it reopens readers and needs to flush segments more often (and 
therefor you should use NRTCachingDirectory). So 10% slower indexing throughput 
is quite normal. You can improve by parallelizing, but still during a refresh 
you have a small delay on each reopen of readers by SearcherManager.

Searching is mostly same speed, because while indexing, most of the segments 
don't change and can be reused after reopen, only new but small segments are 
cold. Merged segments also need warming, so generally you only see small spikes 
in search performance when new merged and possibly huge "cold" segments get 
live.

Of course, if you use more parallel threads during indexing you will also see a 
slowdown in search performance.

When doing NRT always use NRTCachingDirectory, for "normal bulk indexing", 
MMapDirectory alone is fine.

I don't fully understand your expectations, but all what you describe looks 
quite normal. The main reason to use NRT indexing is shorter turnaround times 
by not doing expensive commits. And that's what you see -- while indexing 
performance and also search performance go down depending on refresh rate.

Uwe

-
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: u...@thetaphi.de

> -Original Message-
> From: Alexander Lukyanchikov 
> Sent: Wednesday, August 4, 2021 4:43 AM
> To: java-user@lucene.apache.org
> Subject: NRT readers and overall indexing/querying throughput
> 
> Hello everyone,
> 
> We are considering switching from regular to NRT readers, hoping it would
> improve overall indexing/querying throughput and also optimize the
> turnaround time.
> I did some benchmarks, mostly to understand how much benefit we can get
> and
> make sure I'm implementing everything correctly.
> 
> To my surprise, no matter how I tweak it, our indexing throughput is 10%
> lower with NRT, and query throughput (goes in parallel with indexing) is
> pretty much the same. I do see almost x5 turnaround time improvement
> though.
> Maybe I have wrong expectations, and less frequent commits with NRT refresh
> were not intended to improve overall performance?
> 
> Some details about the tests -
> Base implementation commits and refreshes a regular reader every second.
> NRT implementation commits every 60 seconds and refreshes NRT reader every
> second.
> The indexing rate is about 23 Mb/sec, query rate ~300 rps (text search with
> avg 50ms latency). Documents size is about 35 Kb.
> 36 core machine is used for the tests, and I don't see a big difference in
> JVM metrics between the tests. Also, there is no obvious bottleneck in
> CPU/memory/disk utilization (everything is way below 100%)
> NRT readers are implemented using the SearchManager, the same as the
> implementation
> in the Lucene benchmark
> <https://github.com/mikemccand/luceneutil/blob/master/src/main/perf/NRTP
> erfTest.java>
>  repository.
> With NRT, commit latency is about 3 sec, average refresh latency is 150ms.
> In the base approach, commit latency is about 500 ms, refresh 300 ms.
> I tried NRTCachingDirectory (with MmapDirectory and NIOFSDirectory), insert
> vs update workload, `applyAllDeletes=false`, single indexing thread -
> nothing helps to match the base version throughput.
> 
> I'd appreciate any advice. Am I missing something obvious, or the
> expectation that NRT with less frequent commits going to be more
> performant/resource-efficient is incorrect?
> 
> --
> Regards,
> Alex


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Does Lucene have anything like a covering index as an alternative to DocValues?

2021-07-05 Thread Uwe Schindler
Hi,

Sorry I misunderstood you question, you want to lookup the UUID in another 
system!
Then the approach you are doing is correct. Either store as stored field or as 
docvalue. An inverted index cannot store additional data, because it *is* 
inverted, it is focused around *terms* not documents. The posting list of each 
term can only store internal, numeric lucene doc ids. Those have then to be 
used to lookup the actual contents from e.g. stored fields (possibility A) or 
DocValues (possibility B). We can't store UUIDs in the highly compressed 
posting list.

Uwe

-
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: u...@thetaphi.de

> -Original Message-
> From: Uwe Schindler 
> Sent: Monday, July 5, 2021 3:10 PM
> To: java-user@lucene.apache.org
> Subject: RE: Does Lucene have anything like a covering index as an alternative
> to DocValues?
> 
> You need to index the UUID as a standard indexed StringField. Then you can do
> a lookup using TermQuery. That's how all systems like Solr or Elasticsearch
> handle document identifiers.
> 
> DocValues are for facetting and sorting, but looking up by ID is a typical use
> case for an inverted index. If you still need to store it as DocValues field, 
> just
> add it with both types.
> 
> Uwe
> 
> -
> Uwe Schindler
> Achterdiek 19, D-28357 Bremen
> https://www.thetaphi.de
> eMail: u...@thetaphi.de
> 
> > -Original Message-
> > From: Alex K 
> > Sent: Monday, July 5, 2021 2:30 AM
> > To: java-user@lucene.apache.org
> > Subject: Does Lucene have anything like a covering index as an alternative 
> > to
> > DocValues?
> >
> > Hi all,
> >
> > I am curious if there is anything in Lucene that resembles a covering index
> > (from the relational database world) as an alternative to DocValues for
> > commonly-accessed values?
> >
> > Consider the following use-case: I'm indexing docs in a Lucene index. Each
> > doc has some terms, which are not stored. Each doc also has a UUID
> > corresponding to some other system, which is stored using DocValues. When I
> > run a query, I get back the TopDocs and use the doc ID to fetch the UUID
> > from DocValues. I know that I will *always* need to go fetch this UUID. Is
> > there any way to have the UUID stored in the actual index, rather than
> > using DocValues?
> >
> > Thanks in advance for any tips
> >
> > Alex Klibisz
> 
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Control the number of segments without using forceMerge.

2021-07-05 Thread Uwe Schindler
If you want an exact number of segments, create 64 indexes, each forceMerged to 
one segment.
After that use MultiReader to create a view on all separate indexes. 
MultiReaders's contents are always flattened to a list of those 64 indexes.

But keep in mind that this should only ever be done with *static* indexes. As 
soon as you have updates, this is a bad idea (forceMerge in general) and also 
splitting indexes like this. Parallelization should normally come from multiple 
queries running in parallel, but you shouldn't force Lucene to run a single 
query over so many indexes.

Uwe

-
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: u...@thetaphi.de

> -Original Message-
> From: Alex K 
> Sent: Monday, July 5, 2021 4:04 AM
> To: java-user@lucene.apache.org
> Subject: Control the number of segments without using forceMerge.
> 
> Hi all,
> 
> I'm trying to figure out if there is a way to control the number of
> segments in an index without explicitly calling forceMerge.
> 
> My use-case looks like this: I need to index a static dataset of ~1
> billion documents. I know the exact number of docs before indexing starts.
> I know the VM where this index is searched has 64 threads. I'd like to end
> up with exactly 64 segments, so I can search them in a parallelized fashion.
> 
> I know that I could call forceMerge(64), but this takes an extremely long
> time.
> 
> Is there a straightforward way to ensure that I end up with 64 threads
> without force-merging after adding all of the documents?
> 
> Thanks in advance for any tips
> 
> Alex Klibisz


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Does Lucene have anything like a covering index as an alternative to DocValues?

2021-07-05 Thread Uwe Schindler
You need to index the UUID as a standard indexed StringField. Then you can do a 
lookup using TermQuery. That's how all systems like Solr or Elasticsearch 
handle document identifiers.

DocValues are for facetting and sorting, but looking up by ID is a typical use 
case for an inverted index. If you still need to store it as DocValues field, 
just add it with both types.

Uwe

-
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: u...@thetaphi.de

> -Original Message-
> From: Alex K 
> Sent: Monday, July 5, 2021 2:30 AM
> To: java-user@lucene.apache.org
> Subject: Does Lucene have anything like a covering index as an alternative to
> DocValues?
> 
> Hi all,
> 
> I am curious if there is anything in Lucene that resembles a covering index
> (from the relational database world) as an alternative to DocValues for
> commonly-accessed values?
> 
> Consider the following use-case: I'm indexing docs in a Lucene index. Each
> doc has some terms, which are not stored. Each doc also has a UUID
> corresponding to some other system, which is stored using DocValues. When I
> run a query, I get back the TopDocs and use the doc ID to fetch the UUID
> from DocValues. I know that I will *always* need to go fetch this UUID. Is
> there any way to have the UUID stored in the actual index, rather than
> using DocValues?
> 
> Thanks in advance for any tips
> 
> Alex Klibisz


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Changing Term Vectors for Query

2021-06-07 Thread Uwe Schindler
Hi,

the only way to get this performance wise effective would be the approach by 
Adrien.

What you generally do is to index the same information into 2 different fields 
(in Solr or Elasticsearch as "copy_to / copyfield") with different analyzers. 
During query you choosse the field applicable.

If you want to have "per document" scoring factors (not per term), you can also 
use additional DocValues fields with per-document factors and you can use a 
function query (e.g. using expressions module) to modify the score.

Uwe

-
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: u...@thetaphi.de

> -Original Message-
> From: Marcel D. 
> Sent: Monday, June 7, 2021 9:53 AM
> To: java-user@lucene.apache.org
> Subject: Re: Changing Term Vectors for Query
> 
> Hi Adrien,
> i forgot to mention that i also need the original frequencies. I have some
> queries i need to perform with the original frequencies and then some with
> custom frequencies, but as im only having a small index and a few queries that
> would work, but a solution where i dont have to change the index for those
> queries would be better for me.
> Marcel
> 
> 
> 
> ‐‐‐ Original Message ‐‐‐
> On Monday, June 7, 2021 9:11 AM, Adrien Grand  wrote:
> 
> > Hi Marcel,
> >
> > You can make Lucene index custom frequencies using something like
> > DelimitedTermFrequencyTokenFilter
> > https://lucene.apache.org/core/8_8_0/analyzers-
> common/org/apache/lucene/analysis/miscellaneous/DelimitedTermFrequencyT
> okenFilter.html,
> > which would be easier than writing a custom Query/Weight/Scorer. Would it
> > work for you?
> >
> > On Sun, Jun 6, 2021 at 10:24 PM Hannes Lohr
> > truebau...@protonmail.com.invalid wrote:
> >
> > > Hello,
> > > for some Queries i need to calcuate the score mostly like the normal
> > > score, but for some documents certain terms are assigned a Frequency
> given
> > > by me and the score should be calculated with these new term frequencies.
> > > After some research, it seems i have to implement a custom Query, custom
> > > Weight and Custom Scorer for this. I wanted to ask if I'm overlooking a
> > > simpler solution or if this is the way to go.
> > > Thanks,
> > > Marcel
> >
> > --
> >
> > Adrien
> 
> 
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: MMapDirectory vs In Memory Lucene Index (i.e., ByteBuffersDirectory)

2020-12-14 Thread Uwe Schindler
Hi,

 
> Thanks Uwe, i am not insisting on to load everything into memory
> 
> but loading into memory might speed up and i would like to see how much
> speedup.
> 
> 
> but i have one more question and that is still not clear to me:
> 
> "it is much better to open index, with MMAP directory"
> 
> 
> does this mean i should not use the constructor but instead use the open
> api?

No that means, use MMapDirectory, it should fit your needs. If you have enough 
memory outside of heap in your operating system that can be used by Lucene to 
have all pages of the mmaped file in memory then it’s the best you can have.

FSDirectory.open() is fine as it will always use MMapDirectory on 64 bit 
platforms.

> in other words: which way should be preferred?

Does not matter. If you want to use setPreload() [beware of slowdowns on 
opening index files for first time!!!], use constructor of MMAPDirectory, 
because the FSDirectoryFactory cannot guarantee which implementation you get.

Calling a static method on a class that does not implement it, is generally 
considered bad practise (Eclipse should warn you). The static 
FSDirectory.open() is a factory method and should be used (on FSDircetory not 
its subclass) if you don't know what you want to have and be operating system 
independent. If you want MMapDirectory and its features specifically, use the 
constructor.

> The example is from both during indexing and searching:
> 
> 
> /*First way: Using constructor (without setPreload) :*/
> 
> MMapDirectory dir = new MMapDirectory(Paths.get(indexDir)); // Uses
> FSLockFactory.getDefault() and DEFAULT_MAX_CHUNK_SIZE which is 1GB
> if (dir.getPreload() == false)
>   dir.setPreload(Constants.PRELOAD_YES); // In-Memory Lucene Index
> enabled-> *commented out*
> IndexReader reader = DirectoryReader.open(dir);
> 
> ...
> 
> 
> /*Second way: Or using open (without setPreload) :*/
> 
> *Directory* dir = MMapDirectory.open(Paths.get(indexDir)); //open is
> inherited from FSDirectory
> if (dir.getPreload() == false)
>   dir.setPreload(Constants.PRELOAD_YES); // In-Memory Lucene Index
> enabled-> *here setPreload cannot be used*
> IndexReader reader = DirectoryReader.open(dir);
> IndexSearcher is = new IndexSearcher(reader);
> 
> ...
> 
> 
> Best regards
> 
> 
> On 12/14/20 1:51 PM, Uwe Schindler wrote:
> > Hi,
> >
> > as writer of the original bog post, here my comments:
> >
> > Yes, MMapDirectory.setPreload() is the feature mentioned in my blog post is
> > to load everything into memory - but that does not guarantee anything!
> > Still, I would not recommend to use that function, because all it does is to
> > just touch every page of the file, so the linux kernel puts it into OS cache
> > - nothing more; IMHO very ineffective as it slows down openining index for a
> > stupid for-each-page-touch-loop. It will do this with EVERY page, if it is
> > later used or not! So this may take some time until it is done. Lateron,
> > still Lucene needs to open index files, initialize its own data
> > structures,...
> >
> > In general it is much better to open index, with MMAP directory and execute
> > some "sample" queries. This will do exactly the same like the preload
> > function, but it is more "selective". Parts of the index which are not used
> > won't be touched, and on top, it will also load ALL the required index
> > structures to heap.
> >
> > As always and as mentioned in my blog post: there's nothing that can ensure
> > your index will stays in memory. Please trust the kernel to do the right
> > thing. Why do you care at all?
> >
> > If you are curious and want to have everything in memory all the time:
> > - use tmpfs as your filesystem (of course you will loose data when OS shuts
> > down)
> > - disable swap and/or disable swapiness
> > - use only as much heap as needed, keep everything of free memory for your
> > index outside heap.
> >
> > Fake feelings of "everything in RAM" are misconceptions like:
> > - use RAMDirectory (deprecated): this may be a desaster as it described in
> > the blog post
> > - use ByteBuffersDirectory: a little bit better, but this brings nothing, as
> > the operating system kernel may still page out your index pages. They still
> > live in/off heap and are part of usual paging. They are just no longer
> > backed by a file.
> >
> > Lucene does most of the stuff outside heap, live with it!
> >
> > Uwe
> >
> > -
> > Uwe Schindler
> > Achterdiek 19, D-28357 Bremen
> >
> https://urldefense.com/v3/__https://www.thetaphi.de__;!!

RE: MMapDirectory vs In Memory Lucene Index (i.e., ByteBuffersDirectory)

2020-12-14 Thread Uwe Schindler
Hi,

as writer of the original bog post, here my comments:

Yes, MMapDirectory.setPreload() is the feature mentioned in my blog post is
to load everything into memory - but that does not guarantee anything!
Still, I would not recommend to use that function, because all it does is to
just touch every page of the file, so the linux kernel puts it into OS cache
- nothing more; IMHO very ineffective as it slows down openining index for a
stupid for-each-page-touch-loop. It will do this with EVERY page, if it is
later used or not! So this may take some time until it is done. Lateron,
still Lucene needs to open index files, initialize its own data
structures,...

In general it is much better to open index, with MMAP directory and execute
some "sample" queries. This will do exactly the same like the preload
function, but it is more "selective". Parts of the index which are not used
won't be touched, and on top, it will also load ALL the required index
structures to heap.

As always and as mentioned in my blog post: there's nothing that can ensure
your index will stays in memory. Please trust the kernel to do the right
thing. Why do you care at all?

If you are curious and want to have everything in memory all the time:
- use tmpfs as your filesystem (of course you will loose data when OS shuts
down)
- disable swap and/or disable swapiness
- use only as much heap as needed, keep everything of free memory for your
index outside heap.

Fake feelings of "everything in RAM" are misconceptions like:
- use RAMDirectory (deprecated): this may be a desaster as it described in
the blog post
- use ByteBuffersDirectory: a little bit better, but this brings nothing, as
the operating system kernel may still page out your index pages. They still
live in/off heap and are part of usual paging. They are just no longer
backed by a file.

Lucene does most of the stuff outside heap, live with it!

Uwe

-
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: u...@thetaphi.de

> -Original Message-
> From: baris.ka...@oracle.com 
> Sent: Sunday, December 13, 2020 10:18 PM
> To: java-user@lucene.apache.org
> Cc: BARIS KAZAR 
> Subject: MMapDirectory vs In Memory Lucene Index (i.e.,
ByteBuffersDirectory)
> 
> Hi,-
> 
> it would be nice to create a Lucene index in files and then effectively
load it
> into memory once (since i use in read-only mode). I am looking into if
this is
> doable in Lucene.
> 
> i wish there were an option to load whole Lucene index into memory:
> 
> Both of below urls have links to the blog url where i quoted a very nice
section:
> 
> https://lucene.apache.org/core/8_5_0/core/org/apache/lucene/store/MMapDi
> rectory.html
> https://lucene.apache.org/core/8_5_2/core/org/apache/lucene/store/MMapDi
> rectory.html
> 
> This following blog mentions about such option
> to run in the memory: (see the underlined sentence below)
> 
> https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-
> 64bit.html?m=1
> 
> MMapDirectory will not load the whole index into physical memory. Why
> should it do this? We just ask the operating system to map the file into
address
> space for easy access, by no means we are requesting more. Java and the
O/S
> optionally provide the option to try loading the whole file into RAM (if
enough
> is available), but Lucene does not use that option (we may add this
possibility
> in a later version).
> 
> My question is: is there such an option?
> is the method setPreLoad for this purpose:
> to load all Lucene lndex into memory?
> 
> I would like to use MMapDirectory and set my
> JVM heap to 16G or a bit less (since my index is
> around this much).
> 
> The Lucene 8.5.2 (8.5.0 as well) javadocs say:
> public void setPreload(boolean preload)
> Set to true to ask mapped pages to be loaded into physical memory on init.
The
> behavior is best-effort and operating system dependent.
> 
> For example Lucene 4.0.0 does not have setPreLoad method.
> 
> https://lucene.apache.org/core/4_0_0/core/org/apache/lucene/store/MMapDi
> rectory.html
> 
> Happy Holidays
> Best regards
> 
> 
> Ps. i know there is also BytesBuffersDirectory class for in memory Lucene
but
> this requires creating Lucene Index on the fly.
> 
> This is great for only such kind of Lucene indexes that can be created
quickly on
> the fly.
> 
> Ekaterina has a nice article on this BytesBuffersDirectory class:
> 
> https://medium.com/@ekaterinamihailova/in-memory-search-and-
> autocomplete-with-lucene-8-5-f2df1bc71c36



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Lucene Migration query

2020-11-20 Thread Uwe Schindler
Hi,

> Currently I am using Lucene 7.3, I want to upgrade to lucene 8.5.1. Should
> I do reindexing in this case ?

No, you don't need that.

> Can I make use of backward codec jar without a reindex?

Yes, just add the JAR file to your classpath and it can read the indexes. 
Updates written to the index will use the new codecs. To force a full upgrade 
(rewrite all segments), invoke the IndexUpgrader class either from your code or 
using the command line. But this is not needed, it just makes sure that you can 
get rid of the backwards-codecs jar.

Uwe


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: best way (performance wise) to search for field without value?

2020-11-13 Thread Uwe Schindler
Hi,

Solr and Elasticsearch implement the exists query like this, which is fully in 
line with your investigation: if a field has docvalues it uses 
DocValuesFieldExistsQuery, if it is a tokenized field it uses the 
NormsFieldExistsQuery. The negative one is a must-not clause, which is 
perfectly fine performance wise.

An alternative way to search is indexing all field names that have a value into 
a separate stringfield. But this needs preprocessing.

https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-exists-query.html

https://issues.apache.org/jira/browse/SOLR-11437

Uwe

Am November 13, 2020 2:19:43 PM UTC schrieb Michael McCandless 
:
>That's great Rob!  Thanks for bringing closure.
>
>Mike McCandless
>
>http://blog.mikemccandless.com
>
>
>On Fri, Nov 13, 2020 at 9:13 AM Rob Audenaerde
>
>wrote:
>
>> To follow up, based on a quick JMH-test with 2M docs with some random
>data
>> I see a speedup of 70% :)
>> That is a nice friday-afternoon gift, thanks!
>>
>> For ppl that are interested:
>>
>> I added a BinaryDocValues field like this:
>>
>> doc.add(BinaryDocValuesField("GROUPS_ALLOWED_EMPTY", new
>BytesRef(0x01;
>>
>> And used the finalQuery.add(new DocValuesFieldExistsQuery("
>> GROUPS_ALLOWED_EMPTY", BooleanClause.Occur.SHOULD);
>>
>> On Fri, Nov 13, 2020 at 2:09 PM Michael McCandless <
>> luc...@mikemccandless.com> wrote:
>>
>> > Maybe NormsFieldExistsQuery as a MUST_NOT clause?  Though, you must
>> enable
>> > norms on your field to use that.
>> >
>> > TermRangeQuery is indeed a horribly costly way to execute this, but
>if
>> you
>> > cache the result on each refresh, perhaps it is OK?
>> >
>> > You could also index a dedicated doc values field indicating that
>the
>> > field empty and then use DocValuesFieldExistsQuery.
>> >
>> > Mike McCandless
>> >
>> > http://blog.mikemccandless.com
>> >
>> >
>> > On Fri, Nov 13, 2020 at 7:56 AM Rob Audenaerde
>> >
>> > wrote:
>> >
>> >> Hi all,
>> >>
>> >> We have implemented some security on our index by adding a field
>> >> 'groups_allowed' to documents, and wrap a boolean must query
>around the
>> >> original query, that checks if one of the given user-groups
>matches at
>> >> least one groups_allowed.
>> >>
>> >> We chose to leave the groups_allowed field empty when the document
>> should
>> >> able to be retrieved by all users, so we need to also select a
>document
>> if
>> >> the 'groups_allowed' is empty.
>> >>
>> >> What would be the faster Query construction to do so?
>> >>
>> >>
>> >> Currently I use a TermRangeQuery that basically matches all values
>and
>> put
>> >> that in a MUST_NOT combined with a MatchAllDocumentQuery(), but
>that
>> gets
>> >> rather slow then the number of groups is high.
>> >>
>> >> Thanks!
>> >>
>> >
>>

--
Uwe Schindler
Achterdiek 19, 28357 Bremen
https://www.thetaphi.de

Re: BooleanQuery: BooleanClause.Occur.MUST_NOT seems to require at least one BooleanClause.Occur.MUST

2020-11-06 Thread Uwe Schindler
Hi,

Finally, to "fix" it so it behaves like you may want it: just add a 
MatchAllDocsQuery as MUST or SHOULD clause. You have full control on how it 
behaves!

Uwe

Am November 6, 2020 6:05:03 PM UTC schrieb Nissim Shiman 
:
> Thank You Erick and Adrien!
>On Friday, November 6, 2020, 08:43:59 AM EST, Erick Erickson
> wrote:  
> 
> Nissim:
>
>Here’s a good explanation of why it was designed this way
>if you’d like details:
>
>https://lucidworks.com/post/why-not-and-or-and-not/
>
>Don’t be put off by the Solr title, it’s really about
>BooleanQuery and BooleanClause
>
>Best,
>Erick
>
>> On Nov 6, 2020, at 8:17 AM, Adrien Grand  wrote:
>> 
>> Hi Nissim,
>> 
>> This is by design: boolean queries that don't have positive clauses
>like
>> empty boolean queries or boolean queries that only consist of
>negative
>> (MUST_NOT) clauses don't match any hits.
>> 
>> On Thu, Nov 5, 2020 at 9:07 PM Nissim Shiman
>
>> wrote:
>> 
>>> Hello Apache Lucene team members,
>>> I have found that constructing a BooleanQuery with just
>>> a BooleanClause.Occur.MUST_NOT will return no results.  It will
>return
>>> results is if there is also a BooleanClause.Occur.MUST as part of
>the query
>>> as well though.
>>> 
>>> 
>>> I don't see this limitation with a BooleanQuery with just
>>> a BooleanClause.Occur.MUST (i.e. results will return fine if they
>match).
>>> 
>>> Is this by design or is this an issue?
>>> 
>>> Thanks You,
>>> Nissim Shiman
>> 
>> 
>> 
>> -- 
>> Adrien
>
>
>-
>To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>For additional commands, e-mail: java-user-h...@lucene.apache.org
>  

--
Uwe Schindler
Achterdiek 19, 28357 Bremen
https://www.thetaphi.de

RE: stucked indexing process

2020-10-14 Thread Uwe Schindler
Hi,

this looks like an issue in Solr or how you use Solr. Could it be that you
are reloading the cores all the time? Because the mentioned
"IOUtils.spins()" should only be called when the index is opened and the
IndexWriter is initialized by Solr. It is unlikely that you have any
concurrency there.

There might be one problem: If you have a stuck mount point in your system
(like another NFS mount) that hangs, it might happen that Lucene's code also
hangs, as it inspects the mount points for SSD / spinning disks on starting
up IndexWriter. So please make sure that "mount" does not hang and all the
mountpoints respond (e.g. there are o hanging NFS mounts blocking lucene
from inspecting mounts).

This is also a different issue than the one mentioned before, because you
don't use NFS, it's a local disk, right?

One workaround may be to explicitely tell ConcurrentMergeScheduler to enable
SSD or spinning disk  default settings in your solrconfig.xml:


  true


Use "true" for spinning disks and "false" for SSDs. This prevents the
auto-detection from running.

Uwe

-
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: u...@thetaphi.de

> -Original Message-
> From: Sachin909 
> Sent: Wednesday, October 14, 2020 9:43 AM
> To: java-user@lucene.apache.org
> Subject: RE: stucked indexing process
> 
> Hi Uwe,
> 
> I have observed the similer issue with my application.
> 
> Application stack:
> 
> "coreLoadExecutor-4-thread-1" #86 prio=5 os_prio=0 tid=0x7fbb1c364800
> *nid=0x1616* runnable [0x7fbaa96ef000]
>java.lang.Thread.State: RUNNABLE
>   at sun.nio.fs.UnixNativeDispatcher.stat0(Native Method)
>   at
sun.nio.fs.UnixNativeDispatcher.stat(UnixNativeDispatcher.java:286)
>   at sun.nio.fs.UnixFileAttributes.get(UnixFileAttributes.java:70)
>   at sun.nio.fs.UnixFileStore.devFor(UnixFileStore.java:55)
>   at sun.nio.fs.UnixFileStore.(UnixFileStore.java:70)
>   at sun.nio.fs.LinuxFileStore.(LinuxFileStore.java:48)
>   at sun.nio.fs.LinuxFileSystem.getFileStore(LinuxFileSystem.java:112)
>   at
>
sun.nio.fs.UnixFileSystem$FileStoreIterator.readNext(UnixFileSystem.java:213
)
>   at
>
sun.nio.fs.UnixFileSystem$FileStoreIterator.hasNext(UnixFileSystem.java:224)
>   - locked <0x996864f8> (a
> sun.nio.fs.UnixFileSystem$FileStoreIterator)
>   at org.apache.lucene.util.IOUtils.getFileStore(IOUtils.java:543)
>   at org.apache.lucene.util.IOUtils.spinsLinux(IOUtils.java:487)
>   at org.apache.lucene.util.IOUtils.spins(IOUtils.java:476)
>   at org.apache.lucene.util.IOUtils.spins(IOUtils.java:451)
>   at
> *org.apache.lucene.index.ConcurrentMergeScheduler.initDynamicDefaults(Con
> currentMergeScheduler.java:376)*
>   - locked <0x99686598> (a
> org.apache.lucene.index.ConcurrentMergeScheduler)
>   at
> org.apache.lucene.index.ConcurrentMergeScheduler.merge(ConcurrentMergeS
> cheduler.java:464)
>   - locked <0x99686598> (a
> org.apache.lucene.index.ConcurrentMergeScheduler)
>   at
> org.apache.lucene.index.IndexWriter.waitForMerges(IndexWriter.java:2444)
>   at
> org.apache.lucene.index.IndexWriter.shutdown(IndexWriter.java:1131)
>   at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:1175)
>   at
> org.apache.solr.update.SolrIndexWriter.close(SolrIndexWriter.java:291)
>   at org.apache.solr.core.SolrCore.initIndex(SolrCore.java:716)
>   at org.apache.solr.core.SolrCore.(SolrCore.java:899)
>   at org.apache.solr.core.SolrCore.(SolrCore.java:816)
>   at org.apache.solr.core.CoreContainer.create(CoreContainer.java:890)
>   at
> org.apache.solr.core.CoreContainer.lambda$load$3(CoreContainer.java:542)
>   at
> org.apache.solr.core.CoreContainer$$Lambda$34/209767675.call(Unknown
> Source)
>   at
>
com.codahale.metrics.InstrumentedExecutorService$InstrumentedCallable.call(I
> nstrumentedExecutorService.java:197)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lam
> bda$execute$0(ExecutorUtil.java:229)
>   at
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$$La
> mbda$35/1998024988.run(Unknown
> Source)
>   at
>
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1
> 149)
>   at
>
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:
> 624)
>   at java.lang.Thread.run(Thread.java:748)
> 
> 
> Mount:
> /dev/mapper/appvg-lv_apps/apps   index location has sufficient
> (100Gb+)disk free space.
> 
> 
> starce command:
> 
> In the

Re: Links to classes missing for BMW

2020-10-12 Thread Uwe Schindler
There's not much new documentation, it works behind scenes, except that 
IndexSearcher.search and TopDocs class no longer return an absolute count for 
totalHits and instead this class: 
https://lucene.apache.org/core/8_0_0/core/org/apache/lucene/search/TotalHits.html

Uwe

Am October 12, 2020 4:22:43 PM UTC schrieb baris.ka...@oracle.com:
>Hi Uwe,-
>
>  Could You please point me to the class documentation please?
>
>Best regards
>
>
>On 10/12/20 12:16 PM, Uwe Schindler wrote:
>> BMW support is in Lucene since version 8.0.
>>
>> Uwe
>>
>> Am October 12, 2020 4:08:42 PM UTC schrieb baris.ka...@oracle.com:
>>
>> Hi,-
>>
>>    Is BMW (Block Max Wand) support only for Solr?
>>
>> https://lucene.apache.org/solr/guide/8_6/solr-upgrade-notes.html 
><https://urldefense.com/v3/__https://lucene.apache.org/solr/guide/8_6/solr-upgrade-notes.html__;!!GqivPVa7Brio!PrzCrebVbXvOC6GhctJ1mj8CW5Xps_OiWG7ieYh_NuriXPSFIriiBXEKjJSzSrgW3A$>
>>
>> This pages says "also" so it implies support for Lucene, too,
>right?
>>
>> Best regards
>>
>
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>> --
>> Uwe Schindler
>> Achterdiek 19, 28357 Bremen
>> https://www.thetaphi.de 
>>
><https://urldefense.com/v3/__https://www.thetaphi.de__;!!GqivPVa7Brio!PrzCrebVbXvOC6GhctJ1mj8CW5Xps_OiWG7ieYh_NuriXPSFIriiBXEKjJQldTepBw$>
>

--
Uwe Schindler
Achterdiek 19, 28357 Bremen
https://www.thetaphi.de

Re: Links to classes missing for BMW

2020-10-12 Thread Uwe Schindler
BMW support is in Lucene since version 8.0.

Uwe

Am October 12, 2020 4:08:42 PM UTC schrieb baris.ka...@oracle.com:
>Hi,-
>
>  Is BMW (Block Max Wand) support only for Solr?
>
>https://lucene.apache.org/solr/guide/8_6/solr-upgrade-notes.html
>
>This pages says "also" so it implies support for Lucene, too, right?
>
>Best regards
>
>
>
>-
>To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>For additional commands, e-mail: java-user-h...@lucene.apache.org

--
Uwe Schindler
Achterdiek 19, 28357 Bremen
https://www.thetaphi.de

Re: Fuzzy Search Scoring Adjustment

2020-09-23 Thread Uwe Schindler
You can create a different RewriteMethod for MultiTermQueries (see the default 
used by Fuzzy query). This one is used to convert the FuzzyQuery on rewrite to 
a BooleanQuery. To achieve what you want to have just create a subclass of 
RewriteMethod that uses a DisjunctionMaxQuery instead of BooleanQuery to 
collect the clauses:

Subclass this abstract one: 
https://lucene.apache.org/core/8_6_0/core/org/apache/lucene/search/TopTermsRewrite.html

...and set it as RewriteMethod on the Fuzzy. Use one of the already existing 
subclasses as example and adapt it for DisjunctionMaxQuery.

Uwe

Am September 23, 2020 5:58:29 PM UTC schrieb "Eastlack, Kainoa" 
:
>When performing a fuzzy search inside a BooleanQuery, it looks like the
>default behavior is to score all fuzzy matches separately and then sum
>them
>up to get an aggregate score. However, I need it to instead score based
>on
>the maximum of each distinct match it might find, rather than the sum
>of
>them, to avoid overly inflated scores in some circumstances.
>
>For example, consider a query for "Bstn~2" and four documents
>containing
>"Boston", "Basin", "Boston Basin", and "Boston Boston Basin". The query
>might respectively score them as 1, 1, 2, and 3 (or something like
>that,
>depending on the scorer used, of course). However, I need it to instead
>score them as 1, 1, 1, and 2, since that's the count of just the most
>frequent unique fuzzy match in each document.
>
>Ideally I'd like to use a built in mechanism for achieving this, but if
>it's not available, a way to extend the BooleanQuery, BooleanWeight,
>and/or
>BooleanScorer classes to have slightly different scoring logic but
>otherwise function exactly the same would also work, but all of those
>are
>either final classes or have no public constructor, effectively making
>it
>impossible to reuse their logic directly, as near as I can tell.
>
>If anyone has any ideas of how to approach this, it would be very
>helpful.
>
>Thanks,
>Kainoa

--
Uwe Schindler
Achterdiek 19, 28357 Bremen
https://www.thetaphi.de

IntervalQuery replacement for SpanFirstQuery? Closest replacement for slops?

2020-09-21 Thread Uwe Schindler
Hi all, hi Alan,

I am currently rewriting some SpanQuery code to use IntervalQuery. Most of the 
transformations can be done quite easily and it is also better to read after 
transformation. What I am missing a bit is some document to compare the 
different query types and a guide how to convert those.

I did not find a replacement for SpanFirstQuery (or at least any query stat 
takes absolute positions). I know intervals more deal with term intervals, but 
I was successful in replacing a SpanFirstQuery with this:
IntervalsSource term = Intervals.term("foo");
IntervalsSource filtered = new 
FilteredIntervalsSource("FIRST"+distance, term) {
  @Override
  protected boolean accept(IntervalIterator it) {
return it.end() < distance; // or should this be <= distance???
  }
};
Query = new IntervalQuery(field, iv2);

I am not fully sure if this works under all circumstances . To me it looks 
fine and also did work with more complex intervals than "term". If this is ok, 
how about adding a "first(int n, IntervalsSource iv)" method to Intervals class?

The second question: What's the "closest" replacement for a PhraseQuery with 
slop? Should I use maxwidth(slop + 1) or maxgaps(slop-1) or maxgaps(slop). I 
know SpanQuery slops cannot be fully replaced with intervals, but I don't care 
about those SpanQuery bugs.

Uwe
 
-
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: u...@thetaphi.de



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: [VOTE] Lucene logo contest, third time's a charm

2020-09-06 Thread Uwe Schindler
Hi,

 

My votes (binding): A1, D

 

Reason: I want to keep the original Lucene colors, so A1 is the only 
alternative. I still really like the old one, if it would be better vectorized, 
so my second choice is D.

 

Uwe

 

-

Uwe Schindler

Achterdiek 19, D-28357 Bremen

https://www.thetaphi.de

eMail: u...@thetaphi.de

 

From: Ryan Ernst  
Sent: Tuesday, September 1, 2020 10:21 PM
To: java-user@lucene.apache.org; d...@lucene.apache.org
Subject: [VOTE] Lucene logo contest, third time's a charm

 

Dear Lucene and Solr developers!

 

Sorry for the multiple threads. This should be the last one.

 

In February a contest was started to design a new logo for Lucene [jira-issue]. 
The initial attempt [first-vote] to call a vote resulted in some confusion on 
the rules, as well the request for one additional submission. The second 
attempt [second-vote] yesterday had incorrect links for one of the submissions. 
I would like to call a new vote, now with more explicit instructions on how to 
vote, and corrected links.

 

Please read the following rules carefully before submitting your vote.

 

Who can vote?

 

Anyone is welcome to cast a vote in support of their favorite submission(s). 
Note that only PMC member's votes are binding. If you are a PMC member, please 
indicate with your vote that the vote is binding, to ease collection of votes. 
In tallying the votes, I will attempt to verify only those marked as binding.

 

How do I vote?

Votes can be cast simply by replying to this email. It is a ranked-choice vote 
[rank-choice-voting]. Multiple selections may be made, where the order of 
preference must be specified. If an entry gets more than half the votes, it is 
the winner. Otherwise, the entry with the lowest number of votes is removed, 
and the votes are retallied, taking into account the next preferred entry for 
those whose first entry was removed. This process repeats until there is a 
winner.

 

The entries are broken up by variants, since some entries have multiple color 
or style variations. The entry identifiers are first a capital letter, followed 
by a variation id (described with each entry below), if applicable. As an 
example, if you prefer variant 1 of entry A, followed by variant 2 of entry A, 
variant 3 of entry C, entry D, and lastly variant 4e of entry B, the following 
should be in your reply:

 

(binding)

vote: A1, A2, C3, D, B4e

 

Entries

 

The entries are as follows:

 

A. Submitted by Dustin Haver. This entry has two variants, A1 and A2.

 

[A1] 
https://issues.apache.org/jira/secure/attachment/12999548/Screen%20Shot%202020-04-10%20at%208.29.32%20AM.png
[A2] https://issues.apache.org/jira/secure/attachment/12997172/LuceneLogo.png

 

B. Submitted by Stamatis Zampetakis. This has several variants. Within the 
linked entry there are 7 patterns and 7 color palettes. Any vote for B should 
contain the pattern number followed by the lowercase letter of the color 
palette. For example, B3e or B1a.

 

[B] https://issues.apache.org/jira/secure/attachment/12997768/zabetak-1-7.pdf

 

C. Submitted by Baris Kazar. This entry has 8 variants.

 

[C1] 
https://issues.apache.org/jira/secure/attachment/13006392/lucene_logo1_full.pdf
[C2] 
https://issues.apache.org/jira/secure/attachment/13006393/lucene_logo2_full.pdf
[C3] 
https://issues.apache.org/jira/secure/attachment/13006394/lucene_logo3_full.pdf
[C4] 
https://issues.apache.org/jira/secure/attachment/13006395/lucene_logo4_full.pdf
[C5] 
https://issues.apache.org/jira/secure/attachment/13006396/lucene_logo5_full.pdf
[C6] 
https://issues.apache.org/jira/secure/attachment/13006397/lucene_logo6_full.pdf
[C7] 
https://issues.apache.org/jira/secure/attachment/13006398/lucene_logo7_full.pdf
[C8] 
https://issues.apache.org/jira/secure/attachment/13006399/lucene_logo8_full.pdf

 

D. The current Lucene logo.

 

[D] https://lucene.apache.org/theme/images/lucene/lucene_logo_green_300.png

 

Please vote for one of the above choices. This vote will close about one week 
from today, Mon, Sept 7, 2020 at 11:59PM.

 

Thanks!

 

[jira-issue] https://issues.apache.org/jira/browse/LUCENE-9221
[first-vote] 
http://mail-archives.apache.org/mod_mbox/lucene-dev/202006.mbox/%3cCA+DiXd74Mz4H6o9SmUNLUuHQc6Q1-9mzUR7xfxR03ntGwo=d...@mail.gmail.com%3e

[second-vote] 
http://mail-archives.apache.org/mod_mbox/lucene-dev/202009.mbox/%3cCA+DiXd7eBrQu5+aJQ3jKaUtUTJUqaG2U6o+kUZfNe-m=smn...@mail.gmail.com%3e
[rank-choice-voting] https://en.wikipedia.org/wiki/Instant-runoff_voting



Re: Tessellate exception in Elasticsearch

2020-06-04 Thread Uwe Schindler
Ah sorry, I misunderstood your schema and input document.

The CRS in your json is not used by ES, it just assumes WGS84. So the polygon 
is plain invalid, as if you calculate it modulo 360 degrees, the polygon is 
just wrong and has many overlapping lines.

I was thinking you want to test the new CRS features in recent ES development. 
My fault.

Uwe

Am June 4, 2020 1:40:51 PM UTC schrieb Uwe Schindler :
>Hi,
>
>Yes. With different projections there is one issue: Elasticsearch only
>converts the polygon points to wgs84. But depending on the projection,
>the lines between the points may have a different shape in reality (no
>longer lines, but maybe curves), but as only the line endpoints are
>converted they get straight, which may cause a shape of the polygon,
>where in wgs84 projection some lines may overlap other lines.
>
>The projection converters are nice in Elasticsearch, but far from
>useable for real use cases. Because the earth is not rectangular and
>flat. 藍
>
>Uwe
>
>Am June 4, 2020 1:24:28 PM UTC schrieb Claeys Wouter
>:
>>Thanks for the help! This isn't very clear in the Elasticsearch docs.
>>Upon converting to WGS-84 everything seems to index fine.
>>
>>
>>Van: Ignacio Vera 
>>Verzonden: donderdag 4 juni 2020 14:01
>>Aan: java-user@lucene.apache.org 
>>Onderwerp: Re: Tessellate exception in Elasticsearch
>>
>>I think this is not a lucene issue. Elasticsearch geo_shape only
>>supports
>>(and it assumes) polygons on the WGS-84 reference system.
>>
>>On Thu, Jun 4, 2020 at 1:38 PM Claeys Wouter
>>
>>wrote:
>>
>>> Hi,
>>>
>>> This is the original polygon:
>>>
>>> {
>>>"crs":{
>>>   "type":"name",
>>>   "properties":{
>>>  "name":"urn:ogc:def:crs:EPSG::31370"
>>>   }
>>>},
>>>"type":"MultiPolygon",
>>>"coordinates":[
>>>   [
>>>  [
>>> [
>>>171044.231002,
>>>175818.094268
>>> ],
>>> [
>>>170996.799514,
>>>175850.678652
>>> ],
>>> [
>>>170957.441562,
>>>175877.716668
>>> ],
>>> [
>>>170946.243418,
>>>175861.052668
>>> ],
>>> [
>>>170935.531674,
>>>175845.112572
>>> ],
>>> [
>>>170923.57865,
>>>175827.325308
>>> ],
>>> [
>>>170906.675354,
>>>175802.171388
>>> ],
>>> [
>>>170886.642266,
>>>175772.360124
>>> ],
>>> [
>>>170886.478554,
>>>175772.116476
>>> ],
>>> [
>>>170951.311002,
>>>175727.607548
>>> ],
>>> [
>>>171026.378266,
>>>175676.072188
>>> ],
>>> [
>>>171098.875162,
>>>175780.555004
>>> ],
>>> [
>>>171090.729754,
>>>175786.150716
>>> ],
>>> [
>>>171044.231002,
>>>175818.094268
>>> ]
>>>  ]
>>>   ]
>>>]
>>> }
>>>
>>> Thanks!
>>>
>>> 
>>> Van: Ignacio Vera Sequeiros 
>>> Verzonden: donderdag 4 juni 2020 12:24
>>> Aan: java-user@lucene.apache.org 
>>> Onderwerp: Re: Tessellate exception in Elasticsearch
>>>
>>> Hi,
>>>
>>> I think your polygon has intersecting edges but it is difficult to
>>> reproduce with that output. Could you provide the original polygon
>>you are
>>> trying to index?
>>>
>>> Thanks!
>>>
>>> On Thu, Jun 4, 2020 at 11:30 AM Claeys Wouter
>>>> >
>>> wrote:
>>>
>>> > H

Re: Tessellate exception in Elasticsearch

2020-06-04 Thread Uwe Schindler
79.4449960563,
>> > 98.8751621119] [73.8492839626, 90.7297539996]
>> > [41.90573200001381, 44.2310018589] [9.3213479767,
>> > -3.20048586995] ]. Possible malformed shape detected.
>> > at
>> > org.apache.lucene.geo.Tessellator.tessellate(Tessellator.java:114)
>> > ~[lucene-sandbox-7.7.3.jar:7.7.3
>> 1a0d2a901dfec93676b0fe8be425101ceb754b85 -
>> > noble - 2020-04-21 10:31:55]
>> > at
>> >
>>
>org.apache.lucene.document.LatLonShape.createIndexableFields(LatLonShape.java:73)
>> > ~[lucene-sandbox-7.7.3.jar:7.7.3
>> 1a0d2a901dfec93676b0fe8be425101ceb754b85 -
>> > noble - 2020-04-21 10:31:55]
>> > at
>> >
>>
>org.elasticsearch.index.mapper.GeoShapeFieldMapper.indexShape(GeoShapeFieldMapper.java:146)
>> > ~[elasticsearch-6.8.9.jar:6.8.9]
>> >
>> > This is a very basic geometry. Could someone please explain why
>this
>> shape
>> > is invalid?
>> >
>> >
>> >
>> >
>> > Thanks in advance,
>> >
>> > Wouter Claeys
>> >
>>

--
Uwe Schindler
Achterdiek 19, 28357 Bremen
https://www.thetaphi.de

RE: Need suggetion in replacing forcemerge(1) with alternative which consumes less space

2020-04-14 Thread Uwe Schindler
Hi,

from what you are describing it is not clear, what you are seeing. Asking the 
question about "forceMerge(1)" seems like an XY-Problem 
(https://en.wikipedia.org/wiki/XY_problem).

(1) forceMerge(1) should never be used, only for some very special 
circumstances (like indexes that are read only and never be updated again). If 
you forceMerge an index its "internal structure" gets corrupted and later 
merging never works again like it should. This requires you to forceMerge it 
over an over.

(2) forceMerge does not solve the problem you are asking for! What you see 
might just be a side effect of something else!

(3) you say: 

> Lucene Document is getting corrupted. (data is not getting updated correctly.
> Merging of different row data).

This looks like an issue in your code. Be sure to create new Documents and pass 
them to IndexReader. Documents may be indexed asynchronously (depending on how 
ou setup everything), so it looks like you change already created/existing 
documents while indexing.

> 2. when we are trying to updateDocument method for single record. It is not
> reflecting in IndexReader until the count is 8.  Once the count exceeds, than
> records are visible for IndexReader. (creating 8 segment files.) is there any
> alternative for reducing these segment file creation.

Segments are perfectly fine and required to make incremental updates work 
correctly. What you say with "up to 8" does not make sense. Lucene has no 
mechanism of making the visibility dependent of number of segments. The issue 
you are seing is more related to wrong usage of the real-time readers. 
IndexReaders are point-in-time snapshorts. When you getReader on the Writer you 
get a reader that does not change anymore (point-in-time snapshot). To get the 
updates, you have to open a new reader. There is SearcherManager to help with 
that. It allows to manage a pool of searchers/indexreaders and takes care of 
reopening them if underlying index data changes.

> 3. above two issues are resolved by forcemerge(1). But it is not feasible for 
> our
> use case , because it takes 3X memory. We are creating indexes for huge data.

Don't use forceMerge, especially not to work around some issue that comes from 
wrong multi-threading code and basic misunderstanding on IndexReaders and their 
relationship to IndexWriters.

Uwe

-
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: u...@thetaphi.de

> -Original Message-
> From: Jyothsna Bavisetti 
> Sent: Tuesday, April 14, 2020 7:56 AM
> To: java-user@lucene.apache.org
> Subject: Need suggetion in replacing forcemerge(1) with alternative which
> consumes less space
> 
> Hi,
> 
> 
> 
> 1.We Upgraded Lucene 4.6 to 8+, After upgrading we are facing issue with
> Lucene Index Creation.
> 
> We are indexing in Multi-threading environment. When we create bulk indexes
> , Lucene Document is getting corrupted. (data is not getting updated 
> correctly.
> Merging of different row data).
> 
> 2. when we are trying to updateDocument method for single record. It is not
> reflecting in IndexReader until the count is 8.  Once the count exceeds, than
> records are visible for IndexReader. (creating 8 segment files.) is there any
> alternative for reducing these segment file creation.
> 
> 3. above two issues are resolved by forcemerge(1). But it is not feasible for 
> our
> use case , because it takes 3X memory. We are creating indexes for huge data.
> 
> 
> 
> 4. IndexWriter Config:
> analyzer=com.datanomic.director.casemanagement.indexing.AnalyzerFactory$
> MA
> 
> ramBufferSizeMB=64.0
> 
> maxBufferedDocs=-1
> 
> mergedSegmentWarmer=null
> 
> delPolicy=com.datanomic.director.casemanagement.indexing.engines.TimedDel
> etionPolicy
> 
> commit=null
> 
> openMode=CREATE_OR_APPEND
> 
> similarity=org.apache.lucene.search.similarities.BM25Similarity
> 
> mergeScheduler=ConcurrentMergeScheduler: maxThreadCount=-1,
> maxMergeCount=-1, ioThrottle=true
> 
> codec=Lucene80
> 
> infoStream=org.apache.lucene.util.InfoStream$NoOutput
> 
> mergePolicy=[TieredMergePolicy: maxMergeAtOnce=10,
> maxMergeAtOnceExplicit=30, maxMergedSegmentMB=5120.0,
> floorSegmentMB=2.0, forceMergeDeletesPctAllowed=10.0,
> segmentsPerTier=10.0, maxCFSSegmentSizeMB=8.796093022207999E12,
> noCFSRatio=0.1, deletesPctAllowed=33.0
> 
> indexerThreadPool=org.apache.lucene.index.DocumentsWriterPerThreadPool@
> 24348e05
> 
> readerPooling=true
> 
> perThreadHardLimitMB=1945
> 
> useCompoundFile=false
> 
> commitOnClose=true
> 
> indexSort=null
> 
> checkPendingFlushOnUpdate=true
> 
> softDeletesField=null
> 
> readerAttributes={}
> 
> writer=org.apache.lucene.index.IndexWriter@23a84a99
> 
> 

Re: Lucene 8 early termination

2020-01-23 Thread Uwe Schindler
Hi,

There is no support with calculating facets, because the counts can't be 
optimized with wand or blockmax.

The general recommendation is to execute facets/aggregations in separate 
Elasticsearch or Solr requests (e.g. using AJAX on your website). The display 
of search results would be instant and facets coming later. Doing that in the 
same request or separately does not really matter for performance. So I'd 
always recommend to do it separately if you can do that in your user interface.

Uwe

Am January 23, 2020 6:13:29 PM UTC schrieb Wei :
>Hi,
>
>I am excited to see Lucene 8 introduced BlockMax WAND as a major speed
>improvement https://issues.apache.org/jira/browse/LUCENE-8135.  My
>question
>is, how does it integrate with facet request,  when the numFound won't
>be
>exact? I did some search but haven't found any documentation on this.
>Any
>pointer is greatly appreciated.
>
>Best,
>Wei

--
Uwe Schindler
Achterdiek 19, 28357 Bremen
https://www.thetaphi.de

Re: Disk Free decrease in a directory containing only live lucene indexes

2020-01-21 Thread Uwe Schindler
Hi,

That's easy to explain: While indexing it constantly creates new files (new 
segments). Those segments are merged from time to time into larger segments. If 
you have an IndexReader open at the same time for searching while indexing, it 
will see a specific snapshot (point in time) until it is reopened to see latest 
updates.

IndexWriter at the same time merged segments ad deletes old segments that were 
merged. The IndexReader opened in parallel still sees an old state of index so 
it keeps its files open, also the older segments. Unix has "delete on last 
close" semantics, so disk space is only freed once the last user of a file has 
closed it. Deleting a file just removes the directory entry (the one that "du" 
looks at), but the inode (allocated disk space) is freed later (this is what 
"df" sees).

Uwe

Am January 21, 2020 2:17:33 PM UTC schrieb Riccardo Tasso 
:
>Hi,
> I'm running a lucene based application on a linux system.
>
>The application writes and read many lucene indexes under the same
>directory, which doesn't contain other data.
>
>We are monitoring the indexes directory and we noticed that the disk
>usage
>as calculated by the df util grows more rapidly than that calculated by
>the
>du util.
>
>When we terminate the application the disk usage calculated with the
>two
>utils is the same and it is the one calculated with du when the
>application
>is running.
>
>Can you figure out which is the reason?
>
>Thanks,
> Riccardo

--
Uwe Schindler
Achterdiek 19, 28357 Bremen
https://www.thetaphi.de

Re: Quest about Lucene's IndexSearcher.search(Query query, int n) API's parameter n

2020-01-09 Thread Uwe Schindler
You can sort with custom formulas. All values that are needed for calculation 
must be part of the index as docvalues fields. You can then use expressions 
module to supply a formula for the calculation, which may include the original 
score. The expressions module can override the score (so standard sorting 
works) or provide a SortField.

https://lucene.apache.org/core/8_4_0/expressions/org/apache/lucene/expressions/Expression.html

It is only a bad idea to do this if the calculation is expensive, as it needs 
to be done for every possible hit. One optimization is therefore to do a simple 
calculation using expressions, which brings all documents into a average order, 
so only manually sorting top-n is ok.

Uwe

Am January 10, 2020 4:39:58 AM UTC schrieb "小鱼儿" :
>I'm doing a POI(Point-of-interest) search using lucene, each POI has a
>"location" which is a GeoPoint/LonLat type. I need do a keyword-range
>search but the query result POIs need to sort by distance to a starting
>point.
>
>This "distance", in fact, is a dynamic computed property which cannot
>be
>used by the SortField API, i doubt if Lucene can support a
>"DynamicSortField", that would be perfect. Or i had to do:
>use IndexSearcher.search(Query query, int n) API to first filter out
>Top-n
>POIs and then do a manual sort after these n documents' StoredField's
>have
>all be loaded, which seems not efficient.
>
>The problem is, the parameter n in IndexSearcher.search API has a
>usability
>problem, it may be not large enough to cover all the candidates. & the
>low-level search(Query, Collector) API seems to be short of
>documentations.
>If set the n to a very large value, the later sort proc may be very
>inefficient...
>
>My current idea: use more detailed near-to-far sub geo ranges to
>iteratively/incrementally search/filter -> load documents -> manual
>sort ->
>combine.
>
>Any suggestions?

--
Uwe Schindler
Achterdiek 19, 28357 Bremen
https://www.thetaphi.de

RE: Use custom score in ConstantScoreQuery

2019-12-09 Thread Uwe Schindler
Hi,

Just add a BoostQuery  with a boost factor of 0.5 around the 
ConstantScoreQuery. It's just one line more in your code. I don't understand 
why we would need separate query classes for this.

Uwe

-
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: u...@thetaphi.de

> -Original Message-
> From: Stamatis Zampetakis 
> Sent: Monday, December 9, 2019 10:42 AM
> To: java-user@lucene.apache.org
> Subject: Re: Use custom score in ConstantScoreQuery
> 
> Thanks for you reply Adrien!
> Can you clarify what is the second way?
> At the moment I haven't found a way (apart from creating my own Query
> classes) to say that a query will always return a score of 0.5 for each
> document.
> 
> On Mon, Dec 9, 2019 at 8:16 AM Adrien Grand  wrote:
> 
> > Hi Stamatis,
> >
> > I personally like the current way things work. If we added the ability
> > to set a custom score on ConstantScoreQuery, then we'd end up with two
> > ways to do the same thing, which I like to avoid whenever possible.
> >
> > On Sun, Dec 8, 2019 at 10:07 PM Stamatis Zampetakis
> 
> > wrote:
> > >
> > > Small reminder. Any input on this?
> > >
> > > Thanks,
> > > Stamatis
> > >
> > > On Mon, Dec 2, 2019 at 12:10 PM Stamatis Zampetakis
> 
> > > wrote:
> > >
> > > > Hi all,
> > > >
> > > > Currently ConstantScoreQuery [1] returns a constant score equal to 1
> > for
> > > > every document that matches the query.
> > > >
> > > > I would like to use the ConstantScoreQuery but with a different score
> > > > value that I can pass explicitly (via the constructor for instance).
> > > >
> > > > This change may also benefit some other parts of Lucene where a
> > > > ConstantScoreQuery is wrapped around a BoostQuery simply for
> returning
> > a
> > > > score of zero [2][3].
> > > >
> > > > Does this change make sense? Shall I create a JIRA for it?
> > > >
> > > > Best,
> > > > Stamatis
> > > >
> > > > [1]
> > > >
> > https://github.com/apache/lucene-
> solr/blob/master/lucene/core/src/java/org/apache/lucene/search/Constant
> ScoreQuery.java
> > > > [2]
> > > >
> > https://github.com/apache/lucene-
> solr/blob/1d238c844e45f088a942aec14750c186c7a66d92/lucene/core/src/ja
> va/org/apache/lucene/search/BooleanQuery.java#L253
> > > > [3]
> > > >
> > https://github.com/apache/lucene-
> solr/blob/1d238c844e45f088a942aec14750c186c7a66d92/lucene/core/src/ja
> va/org/apache/lucene/search/BoostQuery.java#L97
> > > >
> >
> >
> >
> > --
> > Adrien
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Index-time boosting: Deprecated setBoost method

2019-10-21 Thread Uwe Schindler
No. That's how you do it: BooleanQuery with 2 should clauses.

Or use a different query parser that offers this out of box.

Uwe

Am October 21, 2019 7:16:01 PM UTC schrieb baris.ka...@oracle.com:
>Hi,-
>
>Thanks.
>
>  lets apply to this case:
>
>QueryParser parser = new QueryParser("field1", analyzer) ;
>parser.setPhraseSlop(2);
>Query query = parser.parse("some string value here"+"*");
>TopDocs hits = indexsearcherObject.search(query, 10);
>
>Now i want to use BoostQuery
>
>QueryParser parser = new QueryParser("field1", analyzerObject) ;
>parser.setPhraseSlop(2);
>Query query = parser.parse("some string value here"+"*");
>
>BoostQuery bq = new BoostQuery(query, "2.0f");
>
>TopDocs hits = indexsearcherObject.search(bq, 10);
>
>
>Now how will i process field2 with boost value 1.0f?
>
>Before, this was being done at index time.
>
>
>i can see the only way here is the BooleanQuery which combines
>
>the first boostquery object bq and another one that i need to define
>for 
>bq2 for field2.
>
>is there any other way?
>
>Best regards
>
>
>
>On 10/21/19 2:33 PM, Uwe Schindler wrote:
>> Hi Boris,
>>
>>> That is ok, and i can see this case would be best with BoostQuery
>and
>>> also i dont have to use lucene expression jar and its dependents.
>>>
>>> However, i am curious how to do this kind of field based boosting at
>>> index time even though i will prefer the query time boosting
>methodology.
>> The reason why it was deprecated is exactly the problem I mentioned
>before: It did never do what the user expected. The boost factor given
>in the document's field was multiplied into the per document norms.
>Unfortunately, at the same time, he query normalization was using query
>statistics and normalized the scores. As Lucene is working per field,
>the same normalization is done per field, resulting in the constant
>factor per field to disappear. There was still some effect of index
>time boosting if different documents had different values, but it your
>case all is the same. I am not sure how your queries worked before, but
>the constant boost factors per field at index time did definitely not
>have the effect you were thinking of. Since the earliest version of
>Lucene, boosting at query time was the way to go to have different
>weights per field.
>>
>> The new feature in Lucene is now that you can change the score per
>document using docvalues and apply that per document at query time.
>Previously this was also possible with Document/Field#setBoost, but the
>flexibility was missing (only multiplying and limited precision). In
>addition the normalization effects made the whole thing not reliable.
>>
>> Uwe
>>
>>> Best regards
>>>
>>>
>>> On 10/21/19 12:54 PM, Uwe Schindler wrote:
>>>> Hi,
>>>>
>>>> As I said, before that is a misuse of index-time boosting. In
>addition in
>>> previous versions it did not even work correctly, because of query
>>> normalization it was normalized away anyways. And on top, to change
>it
>>> your have to reindex.
>>>> What you intend to do is a typical use case for query time boosting
>with
>>> BoostQuery. That is explained in almost every book about search,
>like those
>>> about Solr or Elasticsearch.
>>>> Most query parsers also allow to also add boost factors for fields,
>e.g.
>>> SimpleQueryParser (for humans that need simple syntax without
>fields).
>>> There you give a list of fields and boost factors.
>>>> Uwe
>>>>
>>>> -
>>>> Uwe Schindler
>>>> Achterdiek 19, D-28357 Bremen
>>>> https://urldefense.proofpoint.com/v2/url?u=https-
>>> 3A__www.thetaphi.de=DwIFaQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIr
>>> MUB65eapI_JnE=nlG5z5NcNdIbQAiX-
>>> BKNeyLlULCbaezrgocEvPhQkl4=r7LRZQV82ywkycV4mBw1baHDKxar0wnm
>>> JtLLTiUC0wI=Zj32e0QqmZFvPbBlD8DPeh7KHYfOgQr89wvmaRvy_n8=
>>>> eMail: u...@thetaphi.de
>>>>
>>>>> -Original Message-
>>>>> From: baris.ka...@oracle.com 
>>>>> Sent: Monday, October 21, 2019 6:45 PM
>>>>> To: java-user@lucene.apache.org
>>>>> Cc: baris.kazar 
>>>>> Subject: Re: Index-time boosting: Deprecated setBoost method
>>>>>
>>>>> Hi,-
>>>>>
>>>>> Thanks and i appreciate the disccussion.
>>>>>
>>>>> Let me please  ask this way, i think i give too much info at

RE: Index-time boosting: Deprecated setBoost method

2019-10-21 Thread Uwe Schindler
Hi Boris,

> That is ok, and i can see this case would be best with BoostQuery and
> also i dont have to use lucene expression jar and its dependents.
> 
> However, i am curious how to do this kind of field based boosting at
> index time even though i will prefer the query time boosting methodology.

The reason why it was deprecated is exactly the problem I mentioned before: It 
did never do what the user expected. The boost factor given in the document's 
field was multiplied into the per document norms. Unfortunately, at the same 
time, he query normalization was using query statistics and normalized the 
scores. As Lucene is working per field, the same normalization is done per 
field, resulting in the constant factor per field to disappear. There was still 
some effect of index time boosting if different documents had different values, 
but it your case all is the same. I am not sure how your queries worked before, 
but the constant boost factors per field at index time did definitely not have 
the effect you were thinking of. Since the earliest version of Lucene, boosting 
at query time was the way to go to have different weights per field.

The new feature in Lucene is now that you can change the score per document 
using docvalues and apply that per document at query time. Previously this was 
also possible with Document/Field#setBoost, but the flexibility was missing 
(only multiplying and limited precision). In addition the normalization effects 
made the whole thing not reliable.

Uwe

> Best regards
> 
> 
> On 10/21/19 12:54 PM, Uwe Schindler wrote:
> > Hi,
> >
> > As I said, before that is a misuse of index-time boosting. In addition in
> previous versions it did not even work correctly, because of query
> normalization it was normalized away anyways. And on top, to change it
> your have to reindex.
> >
> > What you intend to do is a typical use case for query time boosting with
> BoostQuery. That is explained in almost every book about search, like those
> about Solr or Elasticsearch.
> >
> > Most query parsers also allow to also add boost factors for fields, e.g.
> SimpleQueryParser (for humans that need simple syntax without fields).
> There you give a list of fields and boost factors.
> >
> > Uwe
> >
> > -
> > Uwe Schindler
> > Achterdiek 19, D-28357 Bremen
> > https://urldefense.proofpoint.com/v2/url?u=https-
> 3A__www.thetaphi.de=DwIFaQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIr
> MUB65eapI_JnE=nlG5z5NcNdIbQAiX-
> BKNeyLlULCbaezrgocEvPhQkl4=r7LRZQV82ywkycV4mBw1baHDKxar0wnm
> JtLLTiUC0wI=Zj32e0QqmZFvPbBlD8DPeh7KHYfOgQr89wvmaRvy_n8=
> > eMail: u...@thetaphi.de
> >
> >> -Original Message-
> >> From: baris.ka...@oracle.com 
> >> Sent: Monday, October 21, 2019 6:45 PM
> >> To: java-user@lucene.apache.org
> >> Cc: baris.kazar 
> >> Subject: Re: Index-time boosting: Deprecated setBoost method
> >>
> >> Hi,-
> >>
> >> Thanks and i appreciate the disccussion.
> >>
> >> Let me please  ask this way, i think i give too much info at one time:
> >>
> >> Currently i have this:
> >>
> >> 

Field  f1= new TextField("field1", "string1", Field.Store.YES);

> >>
> >> doc.add(f1); 
f1.setBoost(2.0f);


> >>
> >> Field f2 = new TextField("field2", "string2", Field.Store.YES);

> >>
> >> doc.add(f2);

> >>
> >> f2.setBoost(1.0f);


> >>
> >>
> >> But this fails with Lucene 7.7.2.
> >>
> >>
> >> Probably it is more efficient and more flexible to fix this by using
> >> BoostQuery.
> >>
> >> However, what could be the fix with index time boosting? the code in my
> >> previous post was trying to do that.
> >>
> >> Best regards
> >>
> >>
> >> On 10/21/19 12:34 PM, Uwe Schindler wrote:
> >>> Hi,
> >>>
> >>> sorry I don't fully understand what you intend to do? If the boost values
> >> per field are static and used with exactly same value for every document,
> it's
> >> not needed a index time. You can just boost the field on the query side
> (e.g.
> >> using BoostQuery). Boosting every document with the same static values
> is
> >> an anti-pattern, that's something better suited for the query side - as you
> are
> >> more flexible.
> >>> If you need a different boost value per document, you can save that
> boost
> >> value in the index per document using a docvalues field (this consumes
> extra
> >> space, of course). Then you need the Expres

RE: Index-time boosting: Deprecated setBoost method

2019-10-21 Thread Uwe Schindler
Hi,

As I said, before that is a misuse of index-time boosting. In addition in 
previous versions it did not even work correctly, because of query 
normalization it was normalized away anyways. And on top, to change it your 
have to reindex.

What you intend to do is a typical use case for query time boosting with 
BoostQuery. That is explained in almost every book about search, like those 
about Solr or Elasticsearch.

Most query parsers also allow to also add boost factors for fields, e.g. 
SimpleQueryParser (for humans that need simple syntax without fields). There 
you give a list of fields and boost factors.

Uwe

-
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: u...@thetaphi.de

> -Original Message-
> From: baris.ka...@oracle.com 
> Sent: Monday, October 21, 2019 6:45 PM
> To: java-user@lucene.apache.org
> Cc: baris.kazar 
> Subject: Re: Index-time boosting: Deprecated setBoost method
> 
> Hi,-
> 
> Thanks and i appreciate the disccussion.
> 
> Let me please  ask this way, i think i give too much info at one time:
> 
> Currently i have this:
> 
> 

Field  f1= new TextField("field1", "string1", Field.Store.YES);

> 
> doc.add(f1); 
f1.setBoost(2.0f);


> 
> Field f2 = new TextField("field2", "string2", Field.Store.YES);

> 
> doc.add(f2);

> 
> f2.setBoost(1.0f);


> 
> 
> But this fails with Lucene 7.7.2.
> 
> 
> Probably it is more efficient and more flexible to fix this by using
> BoostQuery.
> 
> However, what could be the fix with index time boosting? the code in my
> previous post was trying to do that.
> 
> Best regards
> 
> 
> On 10/21/19 12:34 PM, Uwe Schindler wrote:
> > Hi,
> >
> > sorry I don't fully understand what you intend to do? If the boost values
> per field are static and used with exactly same value for every document, it's
> not needed a index time. You can just boost the field on the query side (e.g.
> using BoostQuery). Boosting every document with the same static values is
> an anti-pattern, that's something better suited for the query side - as you 
> are
> more flexible.
> >
> > If you need a different boost value per document, you can save that boost
> value in the index per document using a docvalues field (this consumes extra
> space, of course). Then you need the ExpressionQuery on the query side. But
> just because it looks like Javascript, it's not slow. The syntax is compiled 
> to
> bytecode and directly included into the query execution as a dynamic java
> class, so it's very fast.
> >
> > In short:
> > - If you need to have a different boost factor per field name that's 
> > constant
> for all documents, apply it at query time with BoostQuery.
> > - If you have to boost specific documents (e.g., top selling products), 
> > index
> a numeric docvalues field per document. On the query side you can use
> different query types to modify the score of each result based on the
> docvalues field. That can be done with Expression modules (using compiled
> Javascript) or by another query in Lucene that operates on ValueSource (e.g.,
> FunctionQuery). The first one is easier to use for complex formulas.4
> >
> > Uwe
> >
> > -
> > Uwe Schindler
> > Achterdiek 19, D-28357 Bremen
> > https://urldefense.proofpoint.com/v2/url?u=https-
> 3A__www.thetaphi.de=DwIFaQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIr
> MUB65eapI_JnE=nlG5z5NcNdIbQAiX-
> BKNeyLlULCbaezrgocEvPhQkl4=70RoM6loHhMGsp95phVzGQf8w5JxW7gX
> T5XnleMKrOs=td7cUfd22mXljSuvkUPXDunkIs_eO4GxdvHHxD2CTk0=
> > eMail: u...@thetaphi.de
> >
> >> -Original Message-
> >> From: baris.ka...@oracle.com 
> >> Sent: Monday, October 21, 2019 5:17 PM
> >> To: java-user@lucene.apache.org
> >> Cc: baris.kazar 
> >> Subject: Re: Index-time boosting: Deprecated setBoost method
> >>
> >> Hi,-
> >>
> >> Sorry about the missing parts in previous post. please accept my
> >> apologies for that.
> >>
> >> i needed to add a few more questions/corrections/additions to the
> >> previous post:
> >>
> >> Main Question was: if boost is a single constant value, do we need the
> >> Javascript part below?
> >>
> >>
> >>
> >> === Indexing code snippet for Lucene version 6.6.0 and before===
> >>
> >> Document doc = new Document();
> >>
> >>
> >> 

Field  f1= new TextField("field1", "string1", Field.Store.YES);

> >>
> >> doc.add(f1); 
f1.setBoost(2.0f);


> >>
> >> Field f2 = new TextField("fi

RE: Index-time boosting: Deprecated setBoost method

2019-10-21 Thread Uwe Schindler
Hi,

sorry I don't fully understand what you intend to do? If the boost values per 
field are static and used with exactly same value for every document, it's not 
needed a index time. You can just boost the field on the query side (e.g. using 
BoostQuery). Boosting every document with the same static values is an 
anti-pattern, that's something better suited for the query side - as you are 
more flexible.

If you need a different boost value per document, you can save that boost value 
in the index per document using a docvalues field (this consumes extra space, 
of course). Then you need the ExpressionQuery on the query side. But just 
because it looks like Javascript, it's not slow. The syntax is compiled to 
bytecode and directly included into the query execution as a dynamic java 
class, so it's very fast.

In short:
- If you need to have a different boost factor per field name that's constant 
for all documents, apply it at query time with BoostQuery.
- If you have to boost specific documents (e.g., top selling products), index a 
numeric docvalues field per document. On the query side you can use different 
query types to modify the score of each result based on the docvalues field. 
That can be done with Expression modules (using compiled Javascript) or by 
another query in Lucene that operates on ValueSource (e.g., FunctionQuery). The 
first one is easier to use for complex formulas.4

Uwe

-
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: u...@thetaphi.de

> -Original Message-
> From: baris.ka...@oracle.com 
> Sent: Monday, October 21, 2019 5:17 PM
> To: java-user@lucene.apache.org
> Cc: baris.kazar 
> Subject: Re: Index-time boosting: Deprecated setBoost method
> 
> Hi,-
> 
> Sorry about the missing parts in previous post. please accept my
> apologies for that.
> 
> i needed to add a few more questions/corrections/additions to the
> previous post:
> 
> Main Question was: if boost is a single constant value, do we need the
> Javascript part below?
> 
> 
> 
> === Indexing code snippet for Lucene version 6.6.0 and before===
> 
> Document doc = new Document();
> 
> 
> 

Field  f1= new TextField("field1", "string1", Field.Store.YES);

> 
> doc.add(f1); 
f1.setBoost(2.0f);


> 
> Field f2 = new TextField("field2", "string2", Field.Store.YES);

> 
> doc.add(f2);

> 
> f2.setBoost(1.0f);


> 
> === end of indexing code snippet for Lucene version 6.6.0 and before ===
> 
> 
> This turns into this where _boost1 field is associated with field1 and
> 
> _boost2 field is associated with field2 field:
> 
> 
> In Indexing code:
> 
> === begining of indexing code snippet ===
> Field  f1= new TextField("field1", "string1", Field.Store.YES);

> 
> Field _boost1 = new NumericDocValuesField(“field1”, 2L);
> doc.add(_boost1);
> 
> // If this boost value needs to be stored, a separate storedField
> instance needs to be added as well
> … ( i will post this soon)
> 
> Field _boost2 = new NumericDocValuesField(“field2”, 1L);
> doc.add(_boost2);
> 
> // If this boost value needs to be stored, a separate storedField
> instance needs to be added as well
> … ( i will post this soon)
> 
> === end of indexing code snippet ===
> 
> 
> Now, in the searching code (i.e., at query time) should i need the
> FunctionScoreQuery because in this case
> 
> the boost is just a constant value but not a function? However, constant
> value can be argued to be a function with the same value all the time, too.
> 
> 
> == begining of query time code snippet ===
> Expression expr = JavascriptCompiler.compile(“_boost1 + _boost2");
> 
> 

// SimpleBindings just maps variables to SortField instances

> 
> SimpleBindings bindings = new SimpleBindings();

> 
> bindings.add(new SortField("_boost1", SortField.Type.LONG));
 
// These
> have to LONG type i think since NumericDocValuesField accepts "long"
> type only, am i right? Can this be DOUBLE type?
> 
> bindings.add(new SortField("_boost2", SortField.Type.LONG));
 
// same
> question here
> 
> // create a query that matches based on body:contents but

> 
> // scores using expr

> 
> Query query = new FunctionScoreQuery(

> 
>  new TermQuery(new Term("field1", "term_to_look_for")),

> 
> expr.getDoubleValuesSource(bindings));
> 
> 
searcher.search(query, 10);
> 
> === end of code snippet ===
> 
> 
> Best regards
> 
> 
> On 10/21/19 11:05 AM, baris.ka...@oracle.com wrote:
> > Hi,-
> >
> >  i would like to ask the following to make it clearer (for me at least):
> >
> > Document doc = new Document();
&

Re: Index-time boosting: Deprecated setBoost method

2019-10-18 Thread Uwe Schindler
Hi,

Read my original email! The index time values are written using 
NumericDocValuesField. The expressions docs also refer to that when the 
bindings are documented.

It's separate from the indexed data (TextField). Think of it like an additional 
numeric field in your database table with a factor in each row.

Uwe

Am October 18, 2019 7:14:03 PM UTC schrieb baris.ka...@oracle.com:
>Uwe,-
>
>Two questions there:
>
>i guess this is applicable to TextField, too.
>
>And i was expecting a index writer object in the example for index time
>
>boosting.
>
>Best regards
>
>
>On 10/18/19 2:57 PM, Uwe Schindler wrote:
>> Sorry I was imprecise. It's a mix of both. The factors are stored per
>document in index (this is why I called it index time). During query
>time the expression use the index time values to fold them into the
>query boost at query time.
>>
>> What's your problem with that approach?
>>
>> Uwe
>>
>> Am October 18, 2019 6:50:40 PM UTC schrieb baris.ka...@oracle.com:
>>> Uwe,-
>>>
>>>   Thanks, if possible i am looking for a pure Java methodology to do
>the
>>>
>>> index time boosting.
>>>
>>> This example looks like a search time boosting example:
>>>
>>>
>https://urldefense.proofpoint.com/v2/url?u=https-3A__lucene.apache.org_core_7-5F7-5F2_expressions_org_apache_lucene_expressions_Expression.html=DwIFaQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=nlG5z5NcNdIbQAiX-BKNeyLlULCbaezrgocEvPhQkl4=6m6i5zZXPZNP6DyVv_xG4vXnVTPEdfKLeLSvGjEXbyw=B5_kGwRIbAoGqL0-SVR9r3t78E5XUuzLT37TeyV-bv8=
>>>
>>>
>>>
>>> Best regards
>>>
>>> On 10/18/19 2:31 PM, Uwe Schindler wrote:
>>>> Hi,
>>>>
>>>>> Is there a working example for this? Is this mentioned in the
>Lucene
>>>>> Javadocs or any other docs so that i can look it?
>>>> To index the docvalues, see NumericDocValuesField (it can be added
>to
>>> documents like indexed or stored fields). You may have used them for
>>> sorting already.
>>>>> this methodology seems sort of like discouraging using index time
>>> boosting.
>>>> Not really. Many use this all the time. It's one of the killer
>>> features of both Solr and Elasticsearch. The problem was how the
>>> Document.setBoost()worked (it did not work correctly, see below).
>>>>> Previous setBoost method call was fine and easy to use.
>>>>> Did it have some performance issues and then is that why it was
>>> deprecated?
>>>> No the reason for deprecating this was for several reasons:
>setBoost
>>> was not doing what the user had expected. Internally the boost value
>>> was just multiplied into the document norm factor (which is
>internally
>>> also a docvalues field). The norm factors are only very inprecise
>>> floats stored in a byte, so precision is not well. If you put some
>>> values into it and the length norm was already consuming all bits,
>the
>>> boosting was very coarse. It was also only multiplied into and most
>>> users want to do some stuff like record click counts in the index
>and
>>> then boost for example with the logarithm or some other function. If
>>> the boost is just multiplied into the length norm you have no
>>> flexibility at all.
>>>> In addition you can have several docvalues fields and use their
>>> values in a function (e.g. one field with click count and another
>one
>>> with product price). After that you can combine click count and
>price
>>> (which can be modified indipenently during index updates) and change
>>> boost to boost lower price and higher click count up.
>>>> This is what you can do with the expressions module. You just give
>it
>>> a function.
>>>> Here is an example, the second example is using a
>FunctionScoreQuery
>>> that modifies the score based on the function and the given
>docvalues:
>>>
>https://urldefense.proofpoint.com/v2/url?u=https-3A__lucene.apache.org_core_7-5F7-5F2_expressions_org_apache_lucene_expressions_Expression.html=DwIFaQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=nlG5z5NcNdIbQAiX-BKNeyLlULCbaezrgocEvPhQkl4=6m6i5zZXPZNP6DyVv_xG4vXnVTPEdfKLeLSvGjEXbyw=B5_kGwRIbAoGqL0-SVR9r3t78E5XUuzLT37TeyV-bv8=
>>>>> FunctionScoreQuery usage with MultiFieldQueryParser would also be
>>> nice
>>>>> where
>>>>>
>>>>> MultiFieldQuery already has boosts field to do this in its
>>> constructor.
>>>> The boots in the query parser are applied for f

Re: Index-time boosting: Deprecated setBoost method

2019-10-18 Thread Uwe Schindler
Sorry I was imprecise. It's a mix of both. The factors are stored per document 
in index (this is why I called it index time). During query time the expression 
use the index time values to fold them into the query boost at query time.

What's your problem with that approach?

Uwe

Am October 18, 2019 6:50:40 PM UTC schrieb baris.ka...@oracle.com:
>Uwe,-
>
> Thanks, if possible i am looking for a pure Java methodology to do the
>
>index time boosting.
>
>This example looks like a search time boosting example:
>
>https://urldefense.proofpoint.com/v2/url?u=https-3A__lucene.apache.org_core_7-5F7-5F2_expressions_org_apache_lucene_expressions_Expression.html=DwIFaQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=nlG5z5NcNdIbQAiX-BKNeyLlULCbaezrgocEvPhQkl4=6m6i5zZXPZNP6DyVv_xG4vXnVTPEdfKLeLSvGjEXbyw=B5_kGwRIbAoGqL0-SVR9r3t78E5XUuzLT37TeyV-bv8=
>
>
>
>Best regards
>
>On 10/18/19 2:31 PM, Uwe Schindler wrote:
>> Hi,
>>
>>> Is there a working example for this? Is this mentioned in the Lucene
>>> Javadocs or any other docs so that i can look it?
>> To index the docvalues, see NumericDocValuesField (it can be added to
>documents like indexed or stored fields). You may have used them for
>sorting already.
>>
>>> this methodology seems sort of like discouraging using index time
>boosting.
>> Not really. Many use this all the time. It's one of the killer
>features of both Solr and Elasticsearch. The problem was how the
>Document.setBoost()worked (it did not work correctly, see below).
>>
>>> Previous setBoost method call was fine and easy to use.
>>> Did it have some performance issues and then is that why it was
>deprecated?
>> No the reason for deprecating this was for several reasons: setBoost
>was not doing what the user had expected. Internally the boost value
>was just multiplied into the document norm factor (which is internally
>also a docvalues field). The norm factors are only very inprecise
>floats stored in a byte, so precision is not well. If you put some
>values into it and the length norm was already consuming all bits, the
>boosting was very coarse. It was also only multiplied into and most
>users want to do some stuff like record click counts in the index and
>then boost for example with the logarithm or some other function. If
>the boost is just multiplied into the length norm you have no
>flexibility at all.
>>
>> In addition you can have several docvalues fields and use their
>values in a function (e.g. one field with click count and another one
>with product price). After that you can combine click count and price
>(which can be modified indipenently during index updates) and change
>boost to boost lower price and higher click count up.
>>
>> This is what you can do with the expressions module. You just give it
>a function.
>>
>> Here is an example, the second example is using a FunctionScoreQuery
>that modifies the score based on the function and the given docvalues:
>>
>https://urldefense.proofpoint.com/v2/url?u=https-3A__lucene.apache.org_core_7-5F7-5F2_expressions_org_apache_lucene_expressions_Expression.html=DwIFaQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=nlG5z5NcNdIbQAiX-BKNeyLlULCbaezrgocEvPhQkl4=6m6i5zZXPZNP6DyVv_xG4vXnVTPEdfKLeLSvGjEXbyw=B5_kGwRIbAoGqL0-SVR9r3t78E5XUuzLT37TeyV-bv8=
>>
>>> FunctionScoreQuery usage with MultiFieldQueryParser would also be
>nice
>>> where
>>>
>>> MultiFieldQuery already has boosts field to do this in its
>constructor.
>> The boots in the query parser are applied for fields during query
>time (to have a different weight per field). Index time boosting is per
>document. So you can combine both.
>>
>>> Maybe it is not needed with MultiFieldQueryParser.
>> You use MultiFieldQueryParser to adjust weights of the fields (e.g.
>title versus body). The parsed query is then wrapped with an expression
>that modifies the score per document according to the docvalues.
>>
>> Uwe
>>
>>> On 10/18/19 1:28 PM, Uwe Schindler wrote:
>>>
>>>> Hi,
>>>>
>>>> that's not true. You can do index time boosting, but you need to do
>that
>>> using a separate field. You just index a numeric docvalues field
>(which may
>>> contain a long or float value per document). Later you wrap your
>query with
>>> some FunctionScoreQuery (e.g., use the Javascript function query
>syntax in
>>> the expressions module). This allows you to compile a javascript
>function
>>> that calculated the final score based on the score returned by the
>inner query
>>> and combines them with docvalues that were indexed per document.
>>>> Uwe
&

RE: Index-time boosting: Deprecated setBoost method

2019-10-18 Thread Uwe Schindler
Hi,

> Is there a working example for this? Is this mentioned in the Lucene
> Javadocs or any other docs so that i can look it?

To index the docvalues, see NumericDocValuesField (it can be added to documents 
like indexed or stored fields). You may have used them for sorting already.

> this methodology seems sort of like discouraging using index time boosting.

Not really. Many use this all the time. It's one of the killer features of both 
Solr and Elasticsearch. The problem was how the Document.setBoost()worked (it 
did not work correctly, see below).

> Previous setBoost method call was fine and easy to use.
> Did it have some performance issues and then is that why it was deprecated?

No the reason for deprecating this was for several reasons: setBoost was not 
doing what the user had expected. Internally the boost value was just 
multiplied into the document norm factor (which is internally also a docvalues 
field). The norm factors are only very inprecise floats stored in a byte, so 
precision is not well. If you put some values into it and the length norm was 
already consuming all bits, the boosting was very coarse. It was also only 
multiplied into and most users want to do some stuff like record click counts 
in the index and then boost for example with the logarithm or some other 
function. If the boost is just multiplied into the length norm you have no 
flexibility at all.

In addition you can have several docvalues fields and use their values in a 
function (e.g. one field with click count and another one with product price). 
After that you can combine click count and price (which can be modified 
indipenently during index updates) and change boost to boost lower price and 
higher click count up.

This is what you can do with the expressions module. You just give it a 
function.

Here is an example, the second example is using a FunctionScoreQuery that 
modifies the score based on the function and the given docvalues:
https://lucene.apache.org/core/7_7_2/expressions/org/apache/lucene/expressions/Expression.html

> FunctionScoreQuery usage with MultiFieldQueryParser would also be nice
> where
> 
> MultiFieldQuery already has boosts field to do this in its constructor.

The boots in the query parser are applied for fields during query time (to have 
a different weight per field). Index time boosting is per document. So you can 
combine both.

> Maybe it is not needed with MultiFieldQueryParser.

You use MultiFieldQueryParser to adjust weights of the fields (e.g. title 
versus body). The parsed query is then wrapped with an expression that modifies 
the score per document according to the docvalues.

Uwe

> On 10/18/19 1:28 PM, Uwe Schindler wrote:
> 
> > Hi,
> >
> > that's not true. You can do index time boosting, but you need to do that
> using a separate field. You just index a numeric docvalues field (which may
> contain a long or float value per document). Later you wrap your query with
> some FunctionScoreQuery (e.g., use the Javascript function query syntax in
> the expressions module). This allows you to compile a javascript function
> that calculated the final score based on the score returned by the inner query
> and combines them with docvalues that were indexed per document.
> >
> > Uwe
> >
> > -
> > Uwe Schindler
> > Achterdiek 19, D-28357 Bremen
> > https://urldefense.proofpoint.com/v2/url?u=https-
> 3A__www.thetaphi.de=DwIFaQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIr
> MUB65eapI_JnE=nlG5z5NcNdIbQAiX-
> BKNeyLlULCbaezrgocEvPhQkl4=6rVk8db2H8dAcjS3WCWmAPd08C7JQCvZ
> 8W80yE9L5xY=zgKmnmP9gLG4DlEnAfDdtBMEzPXtHNVYojxXIKEnQgs=
> > eMail: u...@thetaphi.de
> >
> >> -Original Message-
> >> From: baris.ka...@oracle.com 
> >> Sent: Friday, October 18, 2019 5:28 PM
> >> To: java-user@lucene.apache.org
> >> Cc: baris.ka...@oracle.com
> >> Subject: Re: Index-time boosting: Deprecated setBoost method
> >>
> >> It looks like index-time boosting (field) is not possible since Lucene
> >> version 7.7.2 and
> >>
> >> i was using before for another case the BoostQuery at search time for
> >> boosting and
> >>
> >> this seems to be the only boosting option now in Lucene.
> >>
> >> Best regards
> >>
> >>
> >> On 10/18/19 10:01 AM, baris.ka...@oracle.com wrote:
> >>> Hi,-
> >>>
> >>> i saw this in the Field class docs and i am figuring out the following
> >>> note in the docs:
> >>>
> >>> setBoost(float boost)
> >>> Deprecated.
> >>> Index-time boosts are deprecated, please index index-time scoring
> >>> factors into a doc value field and combine them with the score at
>

RE: Index-time boosting: Deprecated setBoost method

2019-10-18 Thread Uwe Schindler
Hi,

that's not true. You can do index time boosting, but you need to do that using 
a separate field. You just index a numeric docvalues field (which may contain a 
long or float value per document). Later you wrap your query with some 
FunctionScoreQuery (e.g., use the Javascript function query syntax in the 
expressions module). This allows you to compile a javascript function that 
calculated the final score based on the score returned by the inner query and 
combines them with docvalues that were indexed per document.

Uwe

-
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: u...@thetaphi.de

> -Original Message-
> From: baris.ka...@oracle.com 
> Sent: Friday, October 18, 2019 5:28 PM
> To: java-user@lucene.apache.org
> Cc: baris.ka...@oracle.com
> Subject: Re: Index-time boosting: Deprecated setBoost method
> 
> It looks like index-time boosting (field) is not possible since Lucene
> version 7.7.2 and
> 
> i was using before for another case the BoostQuery at search time for
> boosting and
> 
> this seems to be the only boosting option now in Lucene.
> 
> Best regards
> 
> 
> On 10/18/19 10:01 AM, baris.ka...@oracle.com wrote:
> > Hi,-
> >
> > i saw this in the Field class docs and i am figuring out the following
> > note in the docs:
> >
> > setBoost(float boost)
> > Deprecated.
> > Index-time boosts are deprecated, please index index-time scoring
> > factors into a doc value field and combine them with the score at
> > query time using eg. FunctionScoreQuery.
> >
> > I appreciate this note. Is there an example about this? I wish docs
> > would give a simple example to further help.
> >
> >
> https://lucene.apache.org/core/6_6_0//core/org/apache/lucene/document/
> Field.html
> >
> >
> > vs
> >
> >
> https://lucene.apache.org/core/7_7_2/core/org/apache/lucene/document/F
> ield.html
> >
> >
> > Best regards
> >
> >
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Split package in Lucene 8.2.0

2019-09-05 Thread Uwe Schindler
Hi,

this issue is known and cannot be solved by just patching lucene, it affects 
the whole lucene infrastrcuture. A change on this would break almost any app 
out there so it needs to be done on a major release.

In addition, as Lucene uses SPI / service loaders, auto-modules don't works 
anyways, so you have to pack all of lucene into one module. The current 
recommendation is to do this on packaging. Some time in the future we will 
modularize Lucene, but that’s not a big priority yet. When this is done, also 
the plugin/service loading mechanism needs module-info.class files.

Uwe

-
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: u...@thetaphi.de

> -Original Message-
> From: Philippe Cadé 
> Sent: Thursday, September 5, 2019 2:11 PM
> To: java-user@lucene.apache.org
> Subject: Split package in Lucene 8.2.0
> 
> Dear all,
> 
> We are working on getting rid of all split packages in our product and our
> dependencies. We have included Lucene 8.2.0 and noticed that both the
> lucene-core-8.2.0.jar and lucene-analyzers-common-8.2.0.jar files use the
> org.apache.lucene.analysis.standard package. This is not allowed anymore
> since Java 9 and it will create trouble with a modularized application.
> 
> We are using the jsplitpkgscan project [1] to find those split packages.
> 
> I was going to open a JIRA issue for this but JIRA tells me that I need to
> ask on the mailing list first, so here you go.
> 
> Correcting this is most likely just moving classes in distinct packages
> which should not be too complex.
> 
> Philippe
> 
> [1] https://github.com/AdoptOpenJDK/jsplitpkgscan


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: AlphaNumeric analyzer/tokenizer

2019-08-19 Thread Uwe Schindler
You already got many responses. Check you inbox.

Uwe

Am August 19, 2019 6:23:20 AM UTC schrieb Abhishek Chauhan 
:
>Hi,
>
>Can someone please check the above mail and provide some feedback?
>
>Thanks and Regards,
>Abhishek
>
>On Fri, Aug 16, 2019 at 2:52 PM Abhishek Chauhan <
>abhishek.chauhan...@gmail.com> wrote:
>
>> Hi,
>>
>> We have been using SimpleAnalyzer which keeps only letters in its
>tokens.
>> This limits us to search in strings that contains both letters and
>numbers.
>> For e.g. "axt1234". SimpleAnalyzer would only enable us to search for
>"axt"
>> successfully, but search strings like "axt1", "axt123" etc would give
>no
>> results because while indexing it ignored the numbers.
>>
>> I can use StandardAnalyzer or WhitespaceAnalyzer but I want to
>tokenize on
>> underscores also
>> which these analyzers don't do. I have also looked at
>WordDelimiterFilter
>> which will split "axt1234" into "axt" and "1234". However, using this
>also,
>> I cannot search for "axt12" etc.
>>
>> Is there something like an Alphanumeric analyzer which would be very
>> similar to SimpleAnalzyer but in addition to letters it would also
>keep
>> digits in its tokens? I am willing contribute such an analyzer if one
>is
>> not available.
>>
>> Thanks and Regards,
>> Abhishek
>>
>>
>>

--
Uwe Schindler
Achterdiek 19, 28357 Bremen
https://www.thetaphi.de

RE: AlphaNumeric analyzer/tokenizer

2019-08-16 Thread Uwe Schindler
Hi,

The easiest is to use PatternTokenizer as part of your analyzer. It uses a 
regular expression to split words. Just use some regular expression that 
matches unicode ranges for numbers and digits.

To build your Analyzer use the class CustomAnalyzer and its builder API to 
construct your own analysis chain. User PatternTokenizerFactory as tokenizer 
and add stuff like LowercaseFilterFactory and you are done. No need for any new 
components in Lucene. It's all there, RTFM 

https://lucene.apache.org/core/8_2_0/analyzers-common/org/apache/lucene/analysis/custom/CustomAnalyzer.html
https://lucene.apache.org/core/8_2_0/analyzers-common/org/apache/lucene/analysis/pattern/PatternTokenizerFactory.html
 (the example there is for Apache Solr, but you can use the same parameter 
names in CustomAnalyzer)

Uwe

-
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: u...@thetaphi.de

> -Original Message-
> From: Abhishek Chauhan 
> Sent: Friday, August 16, 2019 11:23 AM
> To: java-user@lucene.apache.org
> Subject: AlphaNumeric analyzer/tokenizer
> 
> Hi,
> 
> We have been using SimpleAnalyzer which keeps only letters in its tokens.
> This limits us to search in strings that contains both letters and numbers.
> For e.g. "axt1234". SimpleAnalyzer would only enable us to search for "axt"
> successfully, but search strings like "axt1", "axt123" etc would give no
> results because while indexing it ignored the numbers.
> 
> I can use StandardAnalyzer or WhitespaceAnalyzer but I want to tokenize on
> underscores also
> which these analyzers don't do. I have also looked at WordDelimiterFilter
> which will split "axt1234" into "axt" and "1234". However, using this also,
> I cannot search for "axt12" etc.
> 
> Is there something like an Alphanumeric analyzer which would be very
> similar to SimpleAnalzyer but in addition to letters it would also keep
> digits in its tokens? I am willing contribute such an analyzer if one is
> not available.
> 
> Thanks and Regards,
> Abhishek


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene 5.2.1 score for MUST_NOT query

2019-08-04 Thread Uwe Schindler
In short: As it matches nothing, it cannot modify the score. Scores of 
documents not excluded are not modified, so it behaves like zero.

In short: it's the opposite of FILTER clauses.

Uwe

Am August 4, 2019 6:26:29 PM UTC schrieb Atri Sharma :
>MUST_NOT represents a clause which must not match against a document in
>order for it to be qualified as a hit (think of SQL’s NOT IN).
>
>MUST_NOT clauses are used as filters to eliminate candidate documents.
>
>On Sun, 4 Aug 2019 at 23:11, Claude Lepere 
>wrote:
>
>> Hello!
>>
>> What score of a hit in response to a query that begins with the
>clause
>> MUST_NOT?
>> Is it 0 or something else?
>> What does it mean?
>> How is it calculated?
>>
>> Thank you in advance. Claude Lepère
>>
>-- 
>Regards,
>
>Atri
>Apache Concerted

--
Uwe Schindler
Achterdiek 19, 28357 Bremen
https://www.thetaphi.de

RE: Slowness on Java 11 with Lucene 6

2019-07-29 Thread Uwe Schindler
Hi,

We did not notice any slowdown with queries based on Java version, if the 
configuration is identical.
But Java 11 uses G1GC as default garbage collector, in contrast to Java 8 which 
uses ParallelGC as garbage collector. G1GC has no longer pause times anymore 
(so you application won't halt anymore for longer time), but in contrast it 
slows down the throughput.

If you use Solr or Elasticsearch, those applications configure the garbage 
collector for use with search, but if you just run your application with Java 8 
and default command line parameters, slowdowns due to different defaults may be 
possible.

I'd suggest to make sure that the garbage collector is configured in a defined 
way and then compare again. We also don't know on the mailing list what you 
have done (is it pure Lucene, is it Solr, or is it Elasticsearch?) and how your 
JVM is configured. So assumed the worst case (default settings with pure Lucene 
application).

Uwe

-
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: u...@thetaphi.de

> -Original Message-
> From: GASPARD-EXT Joel 
> Sent: Monday, July 29, 2019 5:17 PM
> To: java-user@lucene.apache.org
> Subject: Slowness on Java 11 with Lucene 6
> 
> Hello,
> 
> We have noticed slower response times on search with Lucene 6 when
> upgrading to Java 11.
> 
> We use the version 6.6.5 of Lucene. Our servers are on Windows, with SSD
> devices. Our index contains several millions of documents.
> When upgrading from Java 8 to Java 11, we have noticed slower response
> times in query searches : sometimes 30% slower, sometimes twice as slow,
> depending on the server capacity.
> 
> We measure response times with 20 queries.
> We have made our tests with different JDK providers. We have noticed the
> same deviation.
> 
> Have you encountered this problem ?
> 
> Thanks
> 
> 
> 
> 
> 
> Ce message et toutes les pi?ces jointes qu'il contient sont uniquement
> destin?s aux personnes auxquelles ils sont adress?s et sont strictement
> confidentiels. A moins qu'il en ait ?t? explicitement convenu autrement, son
> contenu ne refl?te que la pens?e personnelle de son auteur et ne saurait
> donc repr?senter la vision officielle de l'Entreprise. Si vous avez re?u ce
> message par erreur, nous vous remercions de bien vouloir en informer
> l'exp?diteur imm?diatement par retour d'email et supprimer d?finitivement
> le message de vos r?pertoires. Toute utilisation de ce message non conforme
> ? sa destination, toute diffusion ou toute publication, totale ou partielle, 
> est
> interdite, sauf autorisation expresse. L'internet ne permettant pas d'assurer
> l'int?grit? de ce message, l'Entreprise d?cline toute responsabilit? au titre 
> de
> ce message, dans l'hypoth?se o? il aurait ?t? modifi?.
> 
> 
> This message including any attachments is confidential and intended solely
> for the addressees. Unless explicitly mentioned, its content reflects only the
> personal thoughts of the author, and therefore cannot represent the official
> view of the Company. If received by error, please inform immediately the
> sender by return e-mail and delete definitely the message from any and all
> directories. Any use, dissemination or disclosure not in conformity with the
> intended purposes is strictly prohibited. The integrity of messages via
> Internet cannot be guaranteed and the Company accepts no liability for any
> changes which may occur.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: field:* vs field:[* TO *]

2019-04-18 Thread Uwe Schindler
Hi,

> I was pointed to Lucene from the Solr list. I am wondering if the
> performance of the below two queries is expected to be quite different and
> would they return the same set of results?
> 
> field:*
> field:[* TO *]

>From the Lucene side they are identical, but it depends on the implementation 
>in Solr's query parser. They both iterate all terms in the field (if it’s a 
>string field).

> The use case I am trying to optimize is returning all documents that
> contain any value for a given field, and I've noticed the queries to be
> quite slow especially for fields that have a large number of distinct
> values.

Unfortunately Solr has no optimized support for that. There are 2 issues open:

https://issues.apache.org/jira/browse/SOLR-11437
https://issues.apache.org/jira/browse/SOLR-12488

This is the same way how Elasticsearch is doing this today. I can look into 
implementing this (it's on my TODO list of issues).

In the meantime there is another efficient way to do this, but it requires you 
to index an additional field. The nice thing with that one is, that it does not 
require the field properties to be correct (e.g, it does not need to 
differentiate between different field types, if there are docuvalues or norms). 
The idea came also from Elasticsearch, which had this since the first day. 
Elasticsearch indexed (until they switched to the above approach using 
DocValues/NormsExistsQuery) an hidden internal field (invisible to the user) 
that was powering the exists query. This field was basically (in Solr speak) a 
"multivalued, non-tokenized, string" field. This field just contains the field 
names of all fields that have a value. E.g., if you have a document:

{ "foo": "hello", "bar": 20, "text": "all fine" }

Your indexing code would extend this document to add an additional field (Solr 
won't do this automatically like Elasticsearch):

{ "foo": "hello", "bar": 20, "text": "all fine", "fields ": ["foo", "bar", 
"text"] }

Then you can query: =fields:bar to filter all field that have a value in 
"bar".

Uwe


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Upper limit on Score

2019-04-18 Thread Uwe Schindler
No there is no limit.

-
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: u...@thetaphi.de

> -Original Message-
> From: Kevin Manuel 
> Sent: Wednesday, April 17, 2019 7:38 PM
> To: java-user@lucene.apache.org
> Subject: Upper limit on Score
> 
> Hi,
> 
> I was just wondering is there an upper limit to the score that can be
> generated for a non-constant score query?
> 
> Thanks,
> Kevin


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Noticed performance degrade from lucene-7.5.0 to lucene-8.0.0

2019-04-14 Thread Uwe Schindler
Without further information we can't help here. So we would need the type of 
queries (conjunction, disjunction, phrase,...). There are significant changes 
which may cause some queries to be slower, but others like 50 times faster if 
the exact number of results are not needed, see 
https://www.elastic.co/blog/faster-retrieval-of-top-hits-in-elasticsearch-with-block-max-wand

Uwe

Am April 14, 2019 2:22:59 PM UTC schrieb Khurram Shehzad :
>Hi All,
>
>I have recently updated from lucene-7.5.0 to lucene-8.0.0. But I
>noticed considerable performance degrade. Queries that used to be
>executed in 18 to 24 milliseconds now taking 74 to 110 milliseconds.
>
>Any suggestion please?
>
>Regards,
>Khurram

--
Uwe Schindler
Achterdiek 19, 28357 Bremen
https://www.thetaphi.de

RE: Why does Lucene 7.4.0 commit() Increase Memory Usage x2

2019-04-04 Thread Uwe Schindler
Small correction: It's not fully true that the JVM "never" gives back memory to 
the operating system: The G1 collector can give back memory to the OS since the 
beginning, but it does this only on full GCs which it tries to prevent. 

But: The default collector as shipped with Java 8 (ParallelGC) never gives back 
any memory to OS, same applies for ConcMarkSweepGC. And I assume you are using 
this one.

Uwe

-
Uwe Schindler
Achterdiek 19, D-28357 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

> -Original Message-----
> From: Uwe Schindler 
> Sent: Thursday, April 4, 2019 11:49 AM
> To: java-user@lucene.apache.org
> Subject: RE: Why does Lucene 7.4.0 commit() Increase Memory Usage x2
> 
> Hi,
> 
> Thanks Adrien. With current JVM versions (Java 8 or Java 11), the garbage
> collector never gives back memory to the operating system, once it has
> allocated that. Due to now heavy usage of containers and similar techniques,
> there are efforts on the JVM front to change that: At least the G1 garbage
> collector (also IBM J9's collector and also the brand new Shenandoah, but
> not the good old CMS) gets a new feature to give back memory to the
> operating system to a certain amount if it is idle, starting with Java 12:
> https://openjdk.java.net/jeps/346
> 
> But that's still a tricky issue, because if you limit the size of allocated 
> memory,
> the garbage collection needs to happen more often, which if you have
> enough reserved space, the GC has a more relaxed job.
> 
> Uwe
> 
> -
> Uwe Schindler
> Achterdiek 19, D-28357 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
> 
> > -Original Message-
> > From: Adrien Grand 
> > Sent: Thursday, April 4, 2019 10:00 AM
> > To: Lucene Users Mailing List 
> > Subject: Re: Why does Lucene 7.4.0 commit() Increase Memory Usage x2
> >
> > I think what you are experiencing is just due to how the JVM works: it
> > happily reserves memory to the operating system if it thinks it might
> > need it, and then it's reluctant to give it back because it assumes
> > that if it has needed so much memory in the past, it might need it
> > again in the future. If you don't want to JVM to use so much memory,
> > just pass a lower value of the maximum heap size.
> >
> > On Wed, Apr 3, 2019 at 4:11 PM thturk  wrote:
> > >
> > > I have tried Java VisualVM too watch GC status per each commit and
> relase
> > > variables for Reader Writer Searcher.  But as result GC working like in
> > > photo at below
> > >
> >
> <http://lucene.472066.n3.nabble.com/file/t494233/Ekran_Al%C4%B1nt%C4
> > %B1s%C4%B1.png>
> > > After 16.40 I called GC manully but Heap size didnt decrease  is it cos 
> > > its
> > > take while to merge serment for lucene ?  cos after a hour Memory
> Ussage
> > > Decreased around 3G it was 3.5G after add new 15k document.
> > >
> > >
> > >
> > > --
> > > Sent from: http://lucene.472066.n3.nabble.com/Lucene-Java-Users-
> > f532864.html
> > >
> > > -
> > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > > For additional commands, e-mail: java-user-h...@lucene.apache.org
> > >
> >
> >
> > --
> > Adrien
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> 
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Why does Lucene 7.4.0 commit() Increase Memory Usage x2

2019-04-04 Thread Uwe Schindler
Hi,

Thanks Adrien. With current JVM versions (Java 8 or Java 11), the garbage 
collector never gives back memory to the operating system, once it has 
allocated that. Due to now heavy usage of containers and similar techniques, 
there are efforts on the JVM front to change that: At least the G1 garbage 
collector (also IBM J9's collector and also the brand new Shenandoah, but not 
the good old CMS) gets a new feature to give back memory to the operating 
system to a certain amount if it is idle, starting with Java 12: 
https://openjdk.java.net/jeps/346

But that's still a tricky issue, because if you limit the size of allocated 
memory, the garbage collection needs to happen more often, which if you have 
enough reserved space, the GC has a more relaxed job.

Uwe

-
Uwe Schindler
Achterdiek 19, D-28357 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

> -Original Message-
> From: Adrien Grand 
> Sent: Thursday, April 4, 2019 10:00 AM
> To: Lucene Users Mailing List 
> Subject: Re: Why does Lucene 7.4.0 commit() Increase Memory Usage x2
> 
> I think what you are experiencing is just due to how the JVM works: it
> happily reserves memory to the operating system if it thinks it might
> need it, and then it's reluctant to give it back because it assumes
> that if it has needed so much memory in the past, it might need it
> again in the future. If you don't want to JVM to use so much memory,
> just pass a lower value of the maximum heap size.
> 
> On Wed, Apr 3, 2019 at 4:11 PM thturk  wrote:
> >
> > I have tried Java VisualVM too watch GC status per each commit and  relase
> > variables for Reader Writer Searcher.  But as result GC working like in
> > photo at below
> >
> <http://lucene.472066.n3.nabble.com/file/t494233/Ekran_Al%C4%B1nt%C4
> %B1s%C4%B1.png>
> > After 16.40 I called GC manully but Heap size didnt decrease  is it cos its
> > take while to merge serment for lucene ?  cos after a hour Memory Ussage
> > Decreased around 3G it was 3.5G after add new 15k document.
> >
> >
> >
> > --
> > Sent from: http://lucene.472066.n3.nabble.com/Lucene-Java-Users-
> f532864.html
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> 
> 
> --
> Adrien
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



  1   2   3   4   5   6   7   8   9   10   >