Re: Exception from the codec layer during indexing

2023-09-28 Thread Rahul Goswami
Thanks Adrien. I am using Java 17. Occurs like clockwork in this particular
customer's environment even after deleting the index and reconstructing.
I have never seen this issue myself despite running the same Solr and Java
version on more than 100 other nodes.
I have no clue about the first exception, but looking at the second one
does look like a code bug to me for the reasons mentioned in my email.

Also, the storage this customer is running has Software RAID5 . Not sure if
that could have anything to do with how the bytes are read from the disk.

Or does this exception point to issues with term deduplication in memory
before making it to disk?

Thanks,
Rahul

On Thu, Sep 28, 2023 at 3:49 PM Adrien Grand  wrote:

> Hi Rahul,
>
> This exception complains that IndexingChain did not deduplicate terms
> as expected.
>
> I don't recall seeing this exception before (which doesn't mean it's
> not a real bug).
>
> What JVM are you running? Does this exception frequently occur or was
> it a one-off?
>
> On Thu, Sep 28, 2023 at 4:49 PM Rahul Goswami 
> wrote:
> >
> > Hi,
> > Following up on my issue...anyone who's seen similar exceptions ? Or any
> > insights on what might be going on?
> >
> > Thanks,
> > Rahul
> >
> > On Wed, Sep 27, 2023 at 1:00 AM Rahul Goswami 
> wrote:
> >
> > > Hello,
> > > On one of the servers running Solr 7.7.2, during indexing I observe 2
> > > different kinds of exceptions coming from the Lucene codec layer. I
> can't
> > > think of an application/data issue that could be causing this.
> > >
> > > In particular, Exception-2 seems like a potential bug since it
> complains
> > > about "terms out of order" even though both byte arrays are
> essentially the
> > > same. Reason I say this is that the FutureArrays.mismatch() is
> supposed to
> > > behave like Java's Arrays.mismatch which returns -1 if NO mismatch is
> > > found. However the check in the below line treats the value -1 as a
> > > mismatch causing the exception.
> > >
> > >
> > >
> https://github.com/apache/lucene-solr/blob/releases/lucene-solr/7.7.2/lucene/core/src/java/org/apache/lucene/util/StringHelper.java#L46
> > >
> > > Happy to submit a PR for this if there is a consensus on this being a
> bug.
> > >
> > > Would appreciate any inputs on the exceptions seen below!
> > >
> > > *Exception-1:*
> > >
> > > 2023-09-19 10:13:48.901 ERROR (qtp1859039536-1691) [
> > > x:fsindex_FileIndexer20234799_shard_1] o.a.s.s.HttpSolrCall
> > > null:org.apache.solr.common.SolrException: Server error writing
> document id
> > > 6182!bbdbe92468734899c738f048e6f58245 to the index
> > > at
> > >
> org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:240)
> > > at
> > >
> org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:67)
> > > at
> > >
> org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55)
> > > at
> > >
> org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:1002)
> > > at
> > >
> org.apache.solr.update.processor.DistributedUpdateProcessor.doVersionAdd(DistributedUpdateProcessor.java:1233)
> > > at
> > >
> org.apache.solr.update.processor.DistributedUpdateProcessor.lambda$versionAdd$2(DistributedUpdateProcessor.java:1082)
> > > at
> org.apache.solr.update.VersionBucket.runWithLock(VersionBucket.java:50)
> > > at
> > >
> org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:1082)
> > > at
> > >
> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:694)
> > > at
> > >
> org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55)
> > > at
> > >
> org.apache.solr.update.processor.FieldMutatingUpdateProcessor.processAdd(FieldMutatingUpdateProcessor.java:118)
> > > at
> > >
> org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55)
> > > at
> > >
> org.apache.solr.update.processor.FieldMutatingUpdateProcessor.processAdd(FieldMutatingUpdateProcessor.java:118)
> > > at
> > >
> org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55)
> > > at
> &g

Re: Exception from the codec layer during indexing

2023-09-28 Thread Rahul Goswami
Hi,
Following up on my issue...anyone who's seen similar exceptions ? Or any
insights on what might be going on?

Thanks,
Rahul

On Wed, Sep 27, 2023 at 1:00 AM Rahul Goswami  wrote:

> Hello,
> On one of the servers running Solr 7.7.2, during indexing I observe 2
> different kinds of exceptions coming from the Lucene codec layer. I can't
> think of an application/data issue that could be causing this.
>
> In particular, Exception-2 seems like a potential bug since it complains
> about "terms out of order" even though both byte arrays are essentially the
> same. Reason I say this is that the FutureArrays.mismatch() is supposed to
> behave like Java's Arrays.mismatch which returns -1 if NO mismatch is
> found. However the check in the below line treats the value -1 as a
> mismatch causing the exception.
>
>
> https://github.com/apache/lucene-solr/blob/releases/lucene-solr/7.7.2/lucene/core/src/java/org/apache/lucene/util/StringHelper.java#L46
>
> Happy to submit a PR for this if there is a consensus on this being a bug.
>
> Would appreciate any inputs on the exceptions seen below!
>
> *Exception-1:*
>
> 2023-09-19 10:13:48.901 ERROR (qtp1859039536-1691) [
> x:fsindex_FileIndexer20234799_shard_1] o.a.s.s.HttpSolrCall
> null:org.apache.solr.common.SolrException: Server error writing document id
> 6182!bbdbe92468734899c738f048e6f58245 to the index
> at
> org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:240)
> at
> org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:67)
> at
> org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55)
> at
> org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:1002)
> at
> org.apache.solr.update.processor.DistributedUpdateProcessor.doVersionAdd(DistributedUpdateProcessor.java:1233)
> at
> org.apache.solr.update.processor.DistributedUpdateProcessor.lambda$versionAdd$2(DistributedUpdateProcessor.java:1082)
> at org.apache.solr.update.VersionBucket.runWithLock(VersionBucket.java:50)
> at
> org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:1082)
> at
> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:694)
> at
> org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55)
> at
> org.apache.solr.update.processor.FieldMutatingUpdateProcessor.processAdd(FieldMutatingUpdateProcessor.java:118)
> at
> org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55)
> at
> org.apache.solr.update.processor.FieldMutatingUpdateProcessor.processAdd(FieldMutatingUpdateProcessor.java:118)
> at
> org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55)
> at
> org.apache.solr.update.processor.AbstractDefaultValueUpdateProcessorFactory$DefaultValueUpdateProcessor.processAdd(AbstractDefaultValueUpdateProcessorFactory.java:92)
> at
> org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55)
> at
> org.apache.solr.update.processor.FieldMutatingUpdateProcessor.processAdd(FieldMutatingUpdateProcessor.java:118)
> at
> org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:261)
> at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:188)
> at
> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:97)
> at
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68)
> at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:202)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:2551)
> at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:711)
> at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:516)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:395)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:341)
> at
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:82)
> at
> com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:121)
> at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:133)
> at
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1602)
> at
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:540)
> at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:146)
> at
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHa

Exception from the codec layer during indexing

2023-09-26 Thread Rahul Goswami
Hello,
On one of the servers running Solr 7.7.2, during indexing I observe 2
different kinds of exceptions coming from the Lucene codec layer. I can't
think of an application/data issue that could be causing this.

In particular, Exception-2 seems like a potential bug since it complains
about "terms out of order" even though both byte arrays are essentially the
same. Reason I say this is that the FutureArrays.mismatch() is supposed to
behave like Java's Arrays.mismatch which returns -1 if NO mismatch is
found. However the check in the below line treats the value -1 as a
mismatch causing the exception.

https://github.com/apache/lucene-solr/blob/releases/lucene-solr/7.7.2/lucene/core/src/java/org/apache/lucene/util/StringHelper.java#L46

Happy to submit a PR for this if there is a consensus on this being a bug.

Would appreciate any inputs on the exceptions seen below!

*Exception-1:*

2023-09-19 10:13:48.901 ERROR (qtp1859039536-1691) [
x:fsindex_FileIndexer20234799_shard_1] o.a.s.s.HttpSolrCall
null:org.apache.solr.common.SolrException: Server error writing document id
6182!bbdbe92468734899c738f048e6f58245 to the index
at
org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:240)
at
org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:67)
at
org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:1002)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.doVersionAdd(DistributedUpdateProcessor.java:1233)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.lambda$versionAdd$2(DistributedUpdateProcessor.java:1082)
at org.apache.solr.update.VersionBucket.runWithLock(VersionBucket.java:50)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:1082)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:694)
at
org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55)
at
org.apache.solr.update.processor.FieldMutatingUpdateProcessor.processAdd(FieldMutatingUpdateProcessor.java:118)
at
org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55)
at
org.apache.solr.update.processor.FieldMutatingUpdateProcessor.processAdd(FieldMutatingUpdateProcessor.java:118)
at
org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55)
at
org.apache.solr.update.processor.AbstractDefaultValueUpdateProcessorFactory$DefaultValueUpdateProcessor.processAdd(AbstractDefaultValueUpdateProcessorFactory.java:92)
at
org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55)
at
org.apache.solr.update.processor.FieldMutatingUpdateProcessor.processAdd(FieldMutatingUpdateProcessor.java:118)
at
org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:261)
at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:188)
at
org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:97)
at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:202)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:2551)
at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:711)
at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:516)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:395)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:341)
at
com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:82)
at
com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:121)
at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:133)
at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1602)
at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:540)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:146)
at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
at
org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257)
at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1588)
at
org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)
at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1345)
at
org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)
at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:480)
at
org.eclipse.jetty.server.session.Sessio

Re: Reindexing leaving behind 0 live doc segments

2023-09-09 Thread Rahul Goswami
Uwe,
Thanks for the response. I have openSearcher=false in autoCommit, but I do
have an autoSoftCommit interval of 5 minutes configured as well which
should open a searcher.
In vanilla Solr, without my code, I see that if I completely reindex all
documents in a segment (via a client call), the segment does get deleted
after the soft commit interval. However if I process the segments as per
Approach-1 in my original email, I see that the 0 doc 7.x segment stays
even after the process finishes, i.e even after I exit the
try-with-resources block.  Note that my index is a mix of 7.x and 8.x
segments and I am only reindexing 7.x segments by preventing them from
participating in merge via a custom MergePolicy.
Additionally as mentioned, Solr provides a handler (/admin/segments)
which does what Luke does and it shows that by the end of the process there
are no more 7.x segments as referenced by the segments_x file. But for some
reason the physical 7.x segment files continue to stay behind until I
restart Solr.

Thanks,
Rahul

On Mon, Sep 4, 2023 at 7:18 AM Uwe Schindler  wrote:

> Hi,
>
> in Solr the empty segment keeps open as long as there is a Searcher
> still open. At some point the empty segment (100% deletions) will be
> deleted, but you have to wait until SolIndexSearcher has restarted.
> Maybe check your solrconfig.xml and check if openSearcher is enabled
> after autoSoftCommit:
>
> https://solr.apache.org/guide/solr/latest/configuration-guide/commits-transaction-logs.html
>
> Uwe
>
> Am 31.08.2023 um 21:35 schrieb Rahul Goswami:
> > Stefan, Mike,
> > Appreciate your responses! I spent some time analyzing your inputs and
> > going further down the rabbit hole.
> >
> > Stefan,
> > I looked at the IndexRearranger code you referenced where it tries to
> drop
> > the segment. I see that it eventually gets handled via
> > IndexFileDeleter.checkpoint() through file refCounts (=0 for deletion
> > criteria). The same method also gets called as part of
> IndexWrtier.commit()
> > flow (Inside finishCommit()). So in an ideal scenario a commit should
> have
> > taken care of dropping the segment files. So that tells me the refCounts
> > for the files are not getting set to 0. I have a fair suspicion the
> > reindexing process running on the same index inside the same JVM has to
> do
> > something with it.
> >
> > Mike,
> > Thanks for the caution on Approach 2 ...good to at least be able to
> > continue on one train of thought. As mentioned in my response to Stefan,
> > the reindexing is going on *inside* of the Solr JVM as an asynchronous
> > thread and not as a separate process. So I believe the open reader you
> are
> > alluding to might be the one I am opening to through
> DirectoryReader.open()
> > (?) . However, looking at the code, I am seeing IndexFileDeleter.incRef()
> > only on the files in SegmentCommitInfos.
> >
> > Does an incRef() also happen when an IndexReader is opened ?
> >
> > Note:The index is a mix of 7.x and 8.x segments (on Solr 8.x). By
> extending
> > TMP and overloading findMerges() I am preventing 7.x segments from
> > participating in merges, and the code only reindexes these 7.x segments
> > into the same index, segment-by-segment.
> > In the current tests I am performing, there are no parallel search or
> > indexing threads through an external request. The reindexing is the only
> > process interacting with the index. The goal is to eventually have this
> > running alongside any parallel indexing/search requests on the index.
> > Also, as noted earlier, by inspecting the SegmentInfos , I can see the
> 7.x
> > segment progressively reducing, but the files never get cleared.
> >
> > If it is my reader that is throwing off the refCount for Solr, what could
> > be another way of reading the index without bloating it up with 0 doc
> > segments?
> >
> > I will also try floating this in the Solr list to get answers to some of
> > the questions you pose around Solr's handling of readers..
> >
> > Thanks,
> > Rahul
> >
> >
> >
> >
> > On Thu, Aug 31, 2023 at 6:48 AM Michael McCandless <
> > luc...@mikemccandless.com> wrote:
> >
> >> Hi Rahul,
> >>
> >> Please do not pursue Approach 2 :)  ReadersAndUpdates.release is not
> >> something the application should be calling.  This path can only lead to
> >> pain.
> >>
> >> It sounds to me like something in Solr is holding an old reader (maybe
> the
> >> last commit point, or reader prior to the refresh after you re-indexed
> all
> >> docs in a given now 100% deleted segment) open.

Re: Reindexing leaving behind 0 live doc segments

2023-08-31 Thread Rahul Goswami
Stefan, Mike,
Appreciate your responses! I spent some time analyzing your inputs and
going further down the rabbit hole.

Stefan,
I looked at the IndexRearranger code you referenced where it tries to drop
the segment. I see that it eventually gets handled via
IndexFileDeleter.checkpoint() through file refCounts (=0 for deletion
criteria). The same method also gets called as part of IndexWrtier.commit()
flow (Inside finishCommit()). So in an ideal scenario a commit should have
taken care of dropping the segment files. So that tells me the refCounts
for the files are not getting set to 0. I have a fair suspicion the
reindexing process running on the same index inside the same JVM has to do
something with it.

Mike,
Thanks for the caution on Approach 2 ...good to at least be able to
continue on one train of thought. As mentioned in my response to Stefan,
the reindexing is going on *inside* of the Solr JVM as an asynchronous
thread and not as a separate process. So I believe the open reader you are
alluding to might be the one I am opening to through DirectoryReader.open()
(?) . However, looking at the code, I am seeing IndexFileDeleter.incRef()
only on the files in SegmentCommitInfos.

Does an incRef() also happen when an IndexReader is opened ?

Note:The index is a mix of 7.x and 8.x segments (on Solr 8.x). By extending
TMP and overloading findMerges() I am preventing 7.x segments from
participating in merges, and the code only reindexes these 7.x segments
into the same index, segment-by-segment.
In the current tests I am performing, there are no parallel search or
indexing threads through an external request. The reindexing is the only
process interacting with the index. The goal is to eventually have this
running alongside any parallel indexing/search requests on the index.
Also, as noted earlier, by inspecting the SegmentInfos , I can see the 7.x
segment progressively reducing, but the files never get cleared.

If it is my reader that is throwing off the refCount for Solr, what could
be another way of reading the index without bloating it up with 0 doc
segments?

I will also try floating this in the Solr list to get answers to some of
the questions you pose around Solr's handling of readers..

Thanks,
Rahul




On Thu, Aug 31, 2023 at 6:48 AM Michael McCandless <
luc...@mikemccandless.com> wrote:

> Hi Rahul,
>
> Please do not pursue Approach 2 :)  ReadersAndUpdates.release is not
> something the application should be calling.  This path can only lead to
> pain.
>
> It sounds to me like something in Solr is holding an old reader (maybe the
> last commit point, or reader prior to the refresh after you re-indexed all
> docs in a given now 100% deleted segment) open.
>
> Does Solr keep old readers open, older than the most recent commit?  Do
> you have queries in flight that might be holding the old reader open?
>
> Given that your small by-hand test case (3 docs) correctly showed the 100%
> deleted segment being reclaimed after the soft commit interval or a manual
> hard commit, something must be different in the larger use case that is
> causing Solr to keep a still old reader open.  Is there any logging you can
> enable to understand Solr's handling of its IndexReaders' lifecycle?
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Mon, Aug 28, 2023 at 10:20 PM Rahul Goswami 
> wrote:
>
>> Hello,
>> I am trying to execute a program to read documents segment-by-segment and
>> reindex to the same index. I am reading using Lucene apis and indexing
>> using solr api (in a core that is currently loaded).
>>
>> What I am observing is that even after a segment has been fully processed
>> and an autoCommit (as well as autoSoftCommit ) has kicked in, the segment
>> with 0 live docs gets left behind. *Upon Solr restart, the segment does
>> get
>> cleared succesfully.*
>>
>> I tried to replicate same thing without the code by indexing 3 docs on an
>> empty test core, and then reindexing the same docs. The older segment gets
>> deleted as soon as softCommit interval hits or an explicit commit=true is
>> called.
>>
>> Here are the two approaches that I have tried. Approach 2 is inspired by
>> the merge logic of accessing segments in case opening a DirectoryReader
>> (Approach 1) externally is causing this issue.
>>
>> But both approaches leave undeleted segments behind until I restart Solr
>> and load the core again. What am I missing? I don't have any more brain
>> cells left to fry on this!
>>
>> Approach 1:
>> =
>> try (FSDirectory dir = FSDirectory.open(Paths.get(core.getIndexDir()));
>> IndexReader reader = DirectoryReader.open(dir)) {
>> for (LeafReaderContext lrc : reader.le

Re: Reindexing leaving behind 0 live doc segments

2023-08-30 Thread Rahul Goswami
Thanks for the response Mikhail. I don't think I am looking for
forceMergeDeletes() though since it could be more expensive than I would
like and I only want to see the unreferenced segments with 0 live docs to
be deleted. Just the way they get deleted with a commit=true option or even
softDelete.

Another piece of important information that I missed out earlier is that
when I examine the segments referenced by the segments_* files these
segments (with 0 live docs) are no longer part of it, but they are still
not cleared. Would appreciate more lines of thought!

Thanks,
Rahul

On Tue, Aug 29, 2023 at 2:46 AM Mikhail Khludnev  wrote:

> Hi Rahul.
> Are you looking for
>
> https://lucene.apache.org/core/9_0_0/core/org/apache/lucene/index/IndexWriter.html#forceMergeDeletes()
> ?
>
> On Tue, Aug 29, 2023 at 5:20 AM Rahul Goswami 
> wrote:
>
> > Hello,
> > I am trying to execute a program to read documents segment-by-segment and
> > reindex to the same index. I am reading using Lucene apis and indexing
> > using solr api (in a core that is currently loaded).
> >
> > What I am observing is that even after a segment has been fully processed
> > and an autoCommit (as well as autoSoftCommit ) has kicked in, the segment
> > with 0 live docs gets left behind. *Upon Solr restart, the segment does
> get
> > cleared succesfully.*
> >
> > I tried to replicate same thing without the code by indexing 3 docs on an
> > empty test core, and then reindexing the same docs. The older segment
> gets
> > deleted as soon as softCommit interval hits or an explicit commit=true is
> > called.
> >
> > Here are the two approaches that I have tried. Approach 2 is inspired by
> > the merge logic of accessing segments in case opening a DirectoryReader
> > (Approach 1) externally is causing this issue.
> >
> > But both approaches leave undeleted segments behind until I restart Solr
> > and load the core again. What am I missing? I don't have any more brain
> > cells left to fry on this!
> >
> > Approach 1:
> > =
> > try (FSDirectory dir = FSDirectory.open(Paths.get(core.getIndexDir()));
> > IndexReader reader = DirectoryReader.open(dir)) {
> > for (LeafReaderContext lrc : reader.leaves()) {
> >
> >//read live docs from each leaf , create a
> > SolrInputDocument out of Document and index using Solr api
> >
> > }
> > }catch(Exception e){
> >
> > }
> >
> > Approach 2:
> > ==
> > ReadersAndUpdates rld = null;
> > SegmentReader segmentReader = null;
> > RefCounted iwRef =
> > core.getSolrCoreState().getIndexWriter(core);
> >  iw = iwRef.get();
> > try{
> >   for (SegmentCommitInfo sci : segmentInfos) {
> >  rld = iw.getPooledInstance(sci, true);
> >  segmentReader = rld.getReader(IOContext.READ);
> >
> > //process all live docs similar to above using the segmentReader.
> >
> > rld.release(segmentReader);
> > iw.release(rld);
> > }finally{
> >if (iwRef != null) {
> >iwRef.decref();
> > }
> > }
> >
> > Help would be much appreciated!
> >
> > Thanks,
> > Rahul
> >
>
>
> --
> Sincerely yours
> Mikhail Khludnev
>


Reindexing leaving behind 0 live doc segments

2023-08-28 Thread Rahul Goswami
Hello,
I am trying to execute a program to read documents segment-by-segment and
reindex to the same index. I am reading using Lucene apis and indexing
using solr api (in a core that is currently loaded).

What I am observing is that even after a segment has been fully processed
and an autoCommit (as well as autoSoftCommit ) has kicked in, the segment
with 0 live docs gets left behind. *Upon Solr restart, the segment does get
cleared succesfully.*

I tried to replicate same thing without the code by indexing 3 docs on an
empty test core, and then reindexing the same docs. The older segment gets
deleted as soon as softCommit interval hits or an explicit commit=true is
called.

Here are the two approaches that I have tried. Approach 2 is inspired by
the merge logic of accessing segments in case opening a DirectoryReader
(Approach 1) externally is causing this issue.

But both approaches leave undeleted segments behind until I restart Solr
and load the core again. What am I missing? I don't have any more brain
cells left to fry on this!

Approach 1:
=
try (FSDirectory dir = FSDirectory.open(Paths.get(core.getIndexDir()));
IndexReader reader = DirectoryReader.open(dir)) {
for (LeafReaderContext lrc : reader.leaves()) {

   //read live docs from each leaf , create a
SolrInputDocument out of Document and index using Solr api

}
}catch(Exception e){

}

Approach 2:
==
ReadersAndUpdates rld = null;
SegmentReader segmentReader = null;
RefCounted iwRef =
core.getSolrCoreState().getIndexWriter(core);
 iw = iwRef.get();
try{
  for (SegmentCommitInfo sci : segmentInfos) {
 rld = iw.getPooledInstance(sci, true);
 segmentReader = rld.getReader(IOContext.READ);

//process all live docs similar to above using the segmentReader.

rld.release(segmentReader);
iw.release(rld);
}finally{
   if (iwRef != null) {
   iwRef.decref();
}
}

Help would be much appreciated!

Thanks,
Rahul


Re: Performance regression in getting doc by id in Lucene 8 vs Lucene 7

2023-06-06 Thread Rahul Goswami
Thanks Adrien. I spent some time trying to understand the readByte() in
ReverseRandomAccessReader (through FST) and compare with 7.x.  Although I
don't understand ALL of the details and reasoning for always loading the
FST (and in turn the term index) off-heap (as discussed in
https://github.com/apache/lucene/issues/10297 ) I understand that this is
essentially causing disk access for every single byte during readByte().

Does this warrant a JIRA for regression?

As mentioned, I am noticing a 10x slowdown in SegmentTermsEnum.seekExact()
affecting atomic update performance . For setups like mine that can't use
mmap due to large indexes this would be a legit regression, no?

- Rahul

On Tue, Jun 6, 2023 at 10:09 AM Adrien Grand  wrote:

> Yes, this changed in 8.x:
>  - 8.0 moved the terms index off-heap for non-PK fields with
> MMapDirectory. https://github.com/apache/lucene/issues/9681
>  - Then in 8.6 the FST was moved off-heap all the time.
> https://github.com/apache/lucene/issues/10297
>
> More generally, there's a few files that are no longer loaded in heap
> in 8.x. It should be possible to load them back in heap by doing
> something like that (beware, I did not actually test this code):
>
> class MyHeapDirectory extends FilterDirectory {
>
>   MyHeapDirectory(Directory in) {
> super(in);
>   }
>
>   @Override
>   public IndexInput openInput(String name, IOContext context) throws
> IOException {
> if (context.load == false) {
>   return super.openInput(name, context);
> } else {
>   try (IndexInput in = super.openInput(name, context)) {
> byte[] bytes = new byte[Math.toIntExact(in.length())];
> in.readBytes(bytes, bytes.length);
> ByteBuffer bb =
> ByteBuffer.wrap(bytes).order(ByteOrder.LITTLE_ENDIAN).asReadOnlyBuffer();
> return new ByteBuffersIndexInput(new
> ByteBuffersDataInput(Collections.singletonList(bb)),
> "ByteBuffersIndexInput(" + name + ")");
>   }
> }
>   }
>
> }
>
> On Tue, Jun 6, 2023 at 3:41 PM Rahul Goswami 
> wrote:
> >
> > Thanks Adrien. Is this behavior of FST something that has changed in
> Lucene
> > 8.x (from 7.x)?
> > Also, is the terms index not loaded into memory anymore in 8.x?
> >
> > To your point on MMapDirectoryFactory, it is much faster as you
> > anticipated, but the indexes commonly being >1 TB makes the Windows
> machine
> > freeze to a point I sometimes can't even connect to the VM.
> > SimpleFSDirectory works well for us from that standpoint.
> >
> > To add, both NIOFS and SimpleFS have similar indexing benchmarks on
> > Windows. I understand it is because of the Java bug which synchronizes
> > internally in the native call for NIOFs.
> >
> > -Rahul
> >
> > On Tue, Jun 6, 2023 at 9:32 AM Adrien Grand  wrote:
> >
> > > +Alan Woodward helped me better understand what is going on here.
> > > BufferedIndexInput (used by NIOFSDirectory and SimpleFSDirectory)
> > > doesn't play well with the fact that the FST reads bytes backwards:
> > > every call to readByte() triggers a refill of 1kB because it wants to
> > > read the byte that is just before what the buffer contains.
> > >
> > > On Tue, Jun 6, 2023 at 2:07 PM Adrien Grand  wrote:
> > > >
> > > > My best guess based on your description of the issue is that
> > > > SimpleFSDirectory doesn't like the fact that the terms index now
> reads
> > > > data directly from the directory instead of loading the terms index
> in
> > > > heap. Would you be able to run the same benchmark with MMapDirectory
> > > > to check if it addresses the regression?
> > > >
> > > >
> > > > On Tue, Jun 6, 2023 at 5:47 AM Rahul Goswami 
> > > wrote:
> > > > >
> > > > > Hello,
> > > > > We started experiencing slowness with atomic updates in Solr after
> > > > > upgrading from 7.7.2 to 8.11.1. Running several tests revealed the
> > > > > slowness to be in RealTimeGet's SolrIndexSearcher.getFirstMatch()
> call
> > > > > which eventually calls Lucene's SegmentTermsEnum.seekExact()..
> > > > >
> > > > > In the benchmarks I ran, 8.11.1 is about 10x slower than 7.7.2.
> After
> > > > > discussion on the Solr mailing list I created the below JIRA:
> > > > >
> > > > > https://issues.apache.org/jira/browse/SOLR-16838
> > > > >
> > > > > The thread dumps collected show a lot of threads stuck in the
> > > > > FS

Re: Performance regression in getting doc by id in Lucene 8 vs Lucene 7

2023-06-06 Thread Rahul Goswami
Thanks Adrien. Is this behavior of FST something that has changed in Lucene
8.x (from 7.x)?
Also, is the terms index not loaded into memory anymore in 8.x?

To your point on MMapDirectoryFactory, it is much faster as you
anticipated, but the indexes commonly being >1 TB makes the Windows machine
freeze to a point I sometimes can't even connect to the VM.
SimpleFSDirectory works well for us from that standpoint.

To add, both NIOFS and SimpleFS have similar indexing benchmarks on
Windows. I understand it is because of the Java bug which synchronizes
internally in the native call for NIOFs.

-Rahul

On Tue, Jun 6, 2023 at 9:32 AM Adrien Grand  wrote:

> +Alan Woodward helped me better understand what is going on here.
> BufferedIndexInput (used by NIOFSDirectory and SimpleFSDirectory)
> doesn't play well with the fact that the FST reads bytes backwards:
> every call to readByte() triggers a refill of 1kB because it wants to
> read the byte that is just before what the buffer contains.
>
> On Tue, Jun 6, 2023 at 2:07 PM Adrien Grand  wrote:
> >
> > My best guess based on your description of the issue is that
> > SimpleFSDirectory doesn't like the fact that the terms index now reads
> > data directly from the directory instead of loading the terms index in
> > heap. Would you be able to run the same benchmark with MMapDirectory
> > to check if it addresses the regression?
> >
> >
> > On Tue, Jun 6, 2023 at 5:47 AM Rahul Goswami 
> wrote:
> > >
> > > Hello,
> > > We started experiencing slowness with atomic updates in Solr after
> > > upgrading from 7.7.2 to 8.11.1. Running several tests revealed the
> > > slowness to be in RealTimeGet's SolrIndexSearcher.getFirstMatch() call
> > > which eventually calls Lucene's SegmentTermsEnum.seekExact()..
> > >
> > > In the benchmarks I ran, 8.11.1 is about 10x slower than 7.7.2. After
> > > discussion on the Solr mailing list I created the below JIRA:
> > >
> > > https://issues.apache.org/jira/browse/SOLR-16838
> > >
> > > The thread dumps collected show a lot of threads stuck in the
> > > FST.findTargetArc()
> > > method. Testing environment details:
> > >
> > > Environment details:
> > > - Java 11 on Windows server
> > > - Xms1536m Xmx3072m
> > > - Indexing client code running 15 parallel threads indexing in batches
> of
> > > 1000 on a standalone core.
> > > - using SimpleFSDirectoryFactory  (since Mmap doesn't  quite work well
> on
> > > Windows for our index sizes which commonly run north of 1 TB)
> > >
> > >
> https://drive.google.com/drive/folders/1q2DPNTYQEU6fi3NeXIKJhaoq3KPnms0h?usp=sharing
> > >
> > > Is there a known issue with slowness with TermsEnum.seekExact() in
> Lucene
> > > 8.x ?
> > >
> > > Thanks,
> > > Rahul
> >
> >
> >
> > --
> > Adrien
>
>
>
> --
> Adrien
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Performance regression in getting doc by id in Lucene 8 vs Lucene 7

2023-06-05 Thread Rahul Goswami
Hello,
We started experiencing slowness with atomic updates in Solr after
upgrading from 7.7.2 to 8.11.1. Running several tests revealed the
slowness to be in RealTimeGet's SolrIndexSearcher.getFirstMatch() call
which eventually calls Lucene's SegmentTermsEnum.seekExact()..

In the benchmarks I ran, 8.11.1 is about 10x slower than 7.7.2. After
discussion on the Solr mailing list I created the below JIRA:

https://issues.apache.org/jira/browse/SOLR-16838

The thread dumps collected show a lot of threads stuck in the
FST.findTargetArc()
method. Testing environment details:

Environment details:
- Java 11 on Windows server
- Xms1536m Xmx3072m
- Indexing client code running 15 parallel threads indexing in batches of
1000 on a standalone core.
- using SimpleFSDirectoryFactory  (since Mmap doesn't  quite work well on
Windows for our index sizes which commonly run north of 1 TB)

https://drive.google.com/drive/folders/1q2DPNTYQEU6fi3NeXIKJhaoq3KPnms0h?usp=sharing

Is there a known issue with slowness with TermsEnum.seekExact() in Lucene
8.x ?

Thanks,
Rahul


Re: Questions about Lucene source

2022-12-14 Thread Rahul Goswami
David and Adrien, thanks for your responses. Bringing up an old thread
here. Revisiting this question ...
> (so deleted docs == max docs) and call commit. Will/Can this segment still
> exist after commit?

SInce I am using Solr (8.11.1), the default deletion policy is
SolrDeletionPolicy which retains only the latest commit by default and
deletes the rest. In that case, would a segment be automatically
deleted once all of the docs in it have been marked deleted (eg: via
reindexing)? If yes, at what point (commit or merge)?

Thanks,
Rahul

On Fri, Sep 23, 2022 at 9:25 AM Adrien Grand  wrote:

> On the 2nd question, we do not plan on leveraging this information to
> figure out the codec: the codec that should be used to read a segment is
> stored separately (also in segment infos).
>
> It is mostly useful for diagnostics purposes. E.g. if we see an interesting
> corruption case where checksums match, we can guess that there is a bug
> somewhere in Lucene in a version that is between this minimum version and
> the version that was used to write the segment.
>
> On Sat, Sep 17, 2022 at 11:07 AM Dawid Weiss 
> wrote:
>
> > > (so deleted docs == max docs) and call commit. Will/Can this segment
> > still
> > > exist after commit?
> > >
> >
> > Depends on your merge policy index deletion policy.  You can configure
> > Lucene to keep older commits (and then you'll preserve all historical
> > segments).
> >
> > I don't know the answer to your second question.
> >
> > D.
> >
>
>
> --
> Adrien
>


Learning Lucene from ground up

2022-11-04 Thread Rahul Goswami
Hello,
I have been working with Lucene and Solr for quite some time and have a
good understanding of a lot of moving parts at the code level. However I
wish to learn Lucene  internals from the ground up and want to familiarize
myself with all the dirty details. I would like to know what would be the
best way to go about it.

To kick things off, I have been thinking about picking up “Lucene in
Action”, but have been hesitant (and possibly wrongly) since it is based on
Lucene 3.0 and we have come a long way since then. To give an example of
the level of detail I wish to learn (among other things) would be what
parts of a segment (.tim, .tip, etc) get loaded in memory at search time,
which part uses finite state machines and why, etc

I would really appreciate any thoughts/inputs on how I can go about this.
Thanks in advance!

Regards,
Rahul


Re: Questions about Lucene source

2022-09-16 Thread Rahul Goswami
Following up on my questions since they didn't get much love the first
time. Any inputs are greatly appreciated!

Thanks,
Rahul

On Wed, Sep 14, 2022 at 3:58 PM Rahul Goswami  wrote:

> Hello,
>
> I was going through some parts of the Lucene source and had some questions:
> 1) Can lucene have 0 document segments? Or will they always be purged
> (either by TMP or otherwise) on a commit?
> Eg: A segment has 4 docs, and I make a /update call to overwrite all 4
> docs (so deleted docs == max docs) and call commit. Will/Can this segment
> still exist after commit?
>
> 2) Starting Lucene 7.0, each segment also stores a "minVersion" which
> tracks the min version of the segment that contributed docs to this
> segment.
>
> https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.11.1/lucene/core/src/java/org/apache/lucene/index/SegmentInfo.java#L83
>
> Reading through LUCENE-7756 I see that one reason to have minVersion was
> to have the entire version of the original index stored somewhere since a
> change was made to store only the major version at the index level (in
> SegmentInfos)
>
>
> https://issues.apache.org/jira/browse/LUCENE-7756?focusedCommentId=15945863&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15945863
>
> Checking the code, I found it's being consulted for any signs of index
> corruption but that was pretty much it. Curious if there is any other
> intended/planned use for minVersion? Eg: some choice of codec at read time
> based on this field or anything else?
>
> Thanks,
> Rahul
>
>


Questions about Lucene source

2022-09-14 Thread Rahul Goswami
Hello,

I was going through some parts of the Lucene source and had some questions:
1) Can lucene have 0 document segments? Or will they always be purged
(either by TMP or otherwise) on a commit?
Eg: A segment has 4 docs, and I make a /update call to overwrite all 4 docs
(so deleted docs == max docs) and call commit. Will/Can this segment still
exist after commit?

2) Starting Lucene 7.0, each segment also stores a "minVersion" which
tracks the min version of the segment that contributed docs to this
segment.
https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.11.1/lucene/core/src/java/org/apache/lucene/index/SegmentInfo.java#L83

Reading through LUCENE-7756 I see that one reason to have minVersion was to
have the entire version of the original index stored somewhere since a
change was made to store only the major version at the index level (in
SegmentInfos)

https://issues.apache.org/jira/browse/LUCENE-7756?focusedCommentId=15945863&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15945863

Checking the code, I found it's being consulted for any signs of index
corruption but that was pretty much it. Curious if there is any other
intended/planned use for minVersion? Eg: some choice of codec at read time
based on this field or anything else?

Thanks,
Rahul


Re: Lucene 9.2.0 build fails on Windows

2022-09-14 Thread Rahul Goswami
Uwe, Dawid, and Robert,
Thank you for the helpful pointers! I do have Visual Studio 2017 on my
machine which I don't use much lately.

https://github.com/microsoft/vswhere
*"vswhere* is included with the installer as of Visual Studio 2017 version
15.2 and later, and can be found at the following location:
%ProgramFiles(x86)%\Microsoft
Visual Studio\Installer\vswhere.exe."

 From the Gradle github you shared, looks like it first tries to locate
vswhere on the machine and execute some command on it. If not found it just
returns an empty list. So I removed vxwhere.exe from the path and builds
are successful now! Although this is a stop gap measure, it will work for
me. Thanks a lot!

https://github.com/gradle/gradle/blob/v7.3.3/subprojects/platform-native/src/main/java/org/gradle/nativeplatform/toolchain/internal/msvcpp/version/CommandLineToolVersionLocator.java#L63

-Rahul

On Wed, Sep 14, 2022 at 11:51 AM Dawid Weiss  wrote:

> > I have no idea how to fix this. Dawid: Maybe we can also make the
> > configuration of that native stuff only opt-in? So only detect Visual
> > Studio when you actively activate native code compilation?
>
> It is an opt-in, actually. The problem is: gradle fails on applying the
> plugin - even if the tasks are ignored.
>
> I'm +1 to remove the native thing entirely if nobody is using it and
> there are no benefits to keeping it maintained.
>
> Dawid
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: Lucene 9.2.0 build fails on Windows

2022-09-13 Thread Rahul Goswami
Hi Dawid,
I believe you. Just that for some reason I have never been able to get it
to work on Windows. Also, being a complete newbie to gradle doesn't help
much. So would appreciate some help on this while I find my footing. Here
is the link to the diagnostics that you requested (since attachments/images
won't make it through):

https://drive.google.com/file/d/15pt9Qt1H98gOvA5e0NrtY8YYHao0lgdM/view?usp=sharing


Thanks,
Rahul

On Tue, Sep 13, 2022 at 1:18 PM Dawid Weiss  wrote:

> Hi Rahul,
>
> Well, that's weird.
>
> > "releases/lucene/9.2.0"  -> Run "gradlew help"
> >
> > If you need additional stacktrace or other diagnostics I am happy to
> > provide the same.
>
> Could you do the following:
>
> 1) run: git --version so that we're on the same page as to what the
> git version is (I don't think this matters),
> 2) run: gradlew help --stacktrace
>
> Step (2) should provide the exact place that fails. Something is
> definitely wrong because I'm on Windows and it works for me like a
> charm.
>
> Dawid
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: Lucene 9.2.0 build fails on Windows

2022-09-13 Thread Rahul Goswami
Hi Dawid,
I tried with Gitbash only after "gradlew help" failed on cmd. Just now
tried Powershell as well and get the exact same error message.
The steps I performed were, clone the repo -> create a branch from tag
"releases/lucene/9.2.0"  -> Run "gradlew help"

If you need additional stacktrace or other diagnostics I am happy to
provide the same.

Thanks,
Rahul

On Tue, Sep 13, 2022 at 11:37 AM Dawid Weiss  wrote:

> It does work just fine. Use cmd or powershell though. I don't think
> things are even tested with cygwin/msys.
>
> Dawid
>
> On Tue, Sep 13, 2022 at 4:55 AM Rahul Goswami 
> wrote:
> >
> > Hello,
> > I am using gitbash to build lucene 9.2.0 on Windows. I checked out the
> > release/lucene/9.2.0 tag and tried running "./gradlew help". But it
> fails.
> > Running Java 11.0.4. Somehow building lucene 9x on Windows has never
> worked
> > for me. Had the same issue with 9.0.0 as well.
> >
> > mypc@mypc MINGW64 /c/work/snap/git/Lucene_9_2/lucene (lucene_9.2_local)
> > $* ./gradlew help*
> > Downloading gradle-wrapper.jar from
> >
> https://github.com/gradle/gradle/raw/v7.3.3/gradle/wrapper/gradle-wrapper.jar
> > To honour the JVM settings for this build a single-use Daemon process
> will
> > be forked. See
> >
> https://docs.gradle.org/7.3.3/userguide/gradle_daemon.html#sec:disabling_the_daemon
> > .
> > Daemon will be stopped at the end of the build
> >
> > FAILURE: Build failed with an exception.
> >
> > * What went wrong:
> > A problem occurred configuring root project 'lucene-root'.
> > > A problem occurred configuring project ':lucene:misc:native'.
> >> java.lang.NullPointerException (no error message)
> >
> > * Try:
> > > Run with --stacktrace option to get the stack trace.
> > > Run with --info or --debug option to get more log output.
> > > Run with --scan to get full insights.
> >
> > * Get more help at https://help.gradle.org
> >
> > BUILD FAILED in 13s
> > 2 actionable tasks: 2 executed
> >
> > mypc@mypc MINGW64 /c/work/snap/git/Lucene_9_2/lucene (lucene_9.2_local)
> > $ java -version
> > java version "11.0.4" 2019-07-16 LTS
> > Java(TM) SE Runtime Environment 18.9 (build 11.0.4+10-LTS)
> > Java HotSpot(TM) 64-Bit Server VM 18.9 (build 11.0.4+10-LTS, mixed mode)
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Lucene 9.2.0 build fails on Windows

2022-09-12 Thread Rahul Goswami
Hello,
I am using gitbash to build lucene 9.2.0 on Windows. I checked out the
release/lucene/9.2.0 tag and tried running "./gradlew help". But it fails.
Running Java 11.0.4. Somehow building lucene 9x on Windows has never worked
for me. Had the same issue with 9.0.0 as well.

mypc@mypc MINGW64 /c/work/snap/git/Lucene_9_2/lucene (lucene_9.2_local)
$* ./gradlew help*
Downloading gradle-wrapper.jar from
https://github.com/gradle/gradle/raw/v7.3.3/gradle/wrapper/gradle-wrapper.jar
To honour the JVM settings for this build a single-use Daemon process will
be forked. See
https://docs.gradle.org/7.3.3/userguide/gradle_daemon.html#sec:disabling_the_daemon
.
Daemon will be stopped at the end of the build

FAILURE: Build failed with an exception.

* What went wrong:
A problem occurred configuring root project 'lucene-root'.
> A problem occurred configuring project ':lucene:misc:native'.
   > java.lang.NullPointerException (no error message)

* Try:
> Run with --stacktrace option to get the stack trace.
> Run with --info or --debug option to get more log output.
> Run with --scan to get full insights.

* Get more help at https://help.gradle.org

BUILD FAILED in 13s
2 actionable tasks: 2 executed

mypc@mypc MINGW64 /c/work/snap/git/Lucene_9_2/lucene (lucene_9.2_local)
$ java -version
java version "11.0.4" 2019-07-16 LTS
Java(TM) SE Runtime Environment 18.9 (build 11.0.4+10-LTS)
Java HotSpot(TM) 64-Bit Server VM 18.9 (build 11.0.4+10-LTS, mixed mode)


Re: SI File Missing

2022-08-11 Thread Rahul Goswami
Hi Brian,
This is a case of index corruption and unless you have a backup of your
working index, there is no way to recover the data unfortunately. There is,
however, a way for you to recover the index with partial data loss if that
is something that you can work with.

You can use lucene's CheckIndex utility to recover your index minus the
corrupt segment. You'll basically lose the documents contained in the
corrupt segment and recover the rest. You can invoke it as:
java -cp lucene-core.jar org.apache.lucene.index.CheckIndex pathToIndex
[-exorcise] [-crossCheckTermVectors] [-segment X] [-segment Y] [-dir-impl X]

You didn't specify the Lucene version you are using, but you can check out
the "main" method in the Javadoc for your appropriate version:
https://lucene.apache.org/core/8_11_1/core/org/apache/lucene/index/CheckIndex.html


You'll need to run it with the "-exorcise" option to repair the index with
the loss of documents in the corrupt segment.

-Rahul

On Thu, Aug 11, 2022 at 9:30 AM Brian Byju  wrote:

> I have an index file where I am missing si files for a set of dim,fdt,and
> nvd files, is there a way i can create the si file, the shard is thus not
> getting allocated because of this and  I am facing no index found
> exception.
>


Error building lucene 9.0

2022-07-07 Thread Rahul Goswami
Hi,
I cloned the lucene repo from github and checked out branch 9.0. I have JDK
11 installed on my Windows machine and am using GitBash to run the build as
below:

===
$ ./gradlew assemble
To honour the JVM settings for this build a single-use Daemon process will
be forked. See
https://docs.gradle.org/7.2/userguide/gradle_daemon.html#sec:disabling_the_daemon
.
Daemon will be stopped at the end of the build

FAILURE: Build failed with an exception.

* What went wrong:
A problem occurred configuring root project 'lucene-root'.

*> A problem occurred configuring project ':lucene:misc:native'.   >
java.lang.NullPointerException (no error message)*

==

Even running "./gradlew help" runs into the same exception.

Would appreciate any help on the issue!

Thanks,
Rahul


Re: Moving from lucene 6.x to 8.x

2022-01-26 Thread Rahul Goswami
Uwe,
This is beautiful! Especially for conversion from Trie to Point fields is
going to be extremely handy. I am going to have to check this out further.
Thank you for the tip!

Rahul

On Mon, Jan 17, 2022 at 10:23 AM Uwe Schindler  wrote:

> By the way
> > Hi, one thing that always works to "forcefully" upgrade without
> reindexing. You
> > just merge the old index into a completely new index not by coping
> files, but by
> > sending their SegmentReaders to addIndex, stripping all metadata from
> them
> > with some trick:
> >
> https://lucene.apache.org/core/8_11_0/core/org/apache/lucene/index/SlowCo
> > decReaderWrapper.html in combination with
> > <
> https://lucene.apache.org/core/8_11_0/core/org/apache/lucene/index/Index
> > Writer.html#addIndexes-org.apache.lucene.index.CodecReader...->
> >
> > One way to do this is the following:
> > - Open old index using DirectoryReader.open(): reader =
> > DirectoryReader.open(...old directory...)
> > - Create a new Index with IndexWriter writer: writer = new
> IndedxWriter(...new
> > directory...)
> > - Call
> >
> writer.addIndexes(reader.leaves().stream().map(IndexReaderContext::reader).
> > map(SlowCodecReaderWrapper::wrap).toArray(CodecReader[]::new));
>
> This trick also works if you want to transform indexes. I wrote some code
> that on the-fly rewrites old NumericField to PointField. The trick is to
> add another FilterLeafReader (before wrapping with SlowCodecReaderWrapper),
> that detects legacy numeric fields, removes them fromm metadata and feeds
> them as new stream of flat BKD points enumerated by the TermsEnum (which
> works because order is same and hierarchy is generated by the receiving
> IndexWriter) to a new field with PointField metadata. This is a bit hacky
> but works great.
>
> Uwe
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: Moving from lucene 6.x to 8.x

2022-01-15 Thread Rahul Goswami
Thanks for the explanation Michael. I read more about term vectors and your
explanation in combination helps put things into perspective. .

On Thu, Jan 13, 2022 at 8:53 AM Michael Sokolov  wrote:

> I think the "broken offsets" refers to offsets of tokens "going
> backwards". Offsets are attributes of tokens that refer back to their
> byte position in the original indexed text. Going backwards means -- a
> token with a greater position (in the sequence of tokens, or token
> graph) should not have a lesser (or maybe it must be strictly
> increasing I forget) offset. If you use term vectors, and have these
> broken offsets, which should not but do often occur with custom
> analysis chains, this could be a problem.
>
> On Wed, Jan 12, 2022 at 12:36 AM Rahul Goswami 
> wrote:
> >
> > Thanks Vinay for the link to Erick's talk! I hadn't seen it and I must
> > admit it did help put a few things into perspective.
> >
> > I was able to track down the JIRAs (thank you 'git blame')
> > surrounding/leading up to this architectural decision and the linked
> > patches:
> > https://issues.apache.org/jira/browse/LUCENE-7703  (Record the version
> that
> > was used at index creation time)
> > https://issues.apache.org/jira/browse/LUCENE-7730  (Better encode length
> > normalization in similarities)
> > https://issues.apache.org/jira/browse/LUCENE-7837  (Use
> > indexCreatedVersionMajor to fail opening too old indices)
> >
> > From these JIRAs what I was able to piece together is that if not
> > reindexed, relevance scoring might act in unpredictable ways. For my use
> > case, I can live with that since we provide an explicit sort on one or
> more
> > fields.
> >
> > In LUCENE-7703, Adrien says "we will reject broken offsets in term
> vectors
> > as of 7.0". So my questions to the community are
> > i) What are these offsets, and what feature/s might break with respect to
> > these offsets if not reindexed?
> > ii) Do the length normalization changes in  LUCENE-7730 affect only
> > relevance scores?
> >
> > I understand I could be playing with fire here, but reindexing is not a
> > practical solution for my situation. At least not in the near future
> until
> > I figure out a more seamless way of reindexing with minimal downtime
> given
> > that there are multiple 1TB+ indexes. Would appreciate inputs from the
> dev
> > community on this.
> >
> > Thanks,
> > Rahul
> >
> > On Sun, Jan 9, 2022 at 2:41 PM Vinay Rajput 
> > wrote:
> >
> > > Hi Rahul,
> > >
> > > I am not an expert so someone else might provide a better answer.
> However,
> > > I remember
> > > @Erick briefly talked about this restriction in one of his talks here:-
> > > https://www.youtube.com/watch?v=eaQBH_H3d3g&t=621s (not sure if you
> have
> > > seen it already).
> > >
> > > As he explains, earlier it looked like IndexUpgrader tool was doing
> the job
> > > perfectly but it wasn't always the case. There is no guarantee that
> after
> > > using the IndexUpgrader tool, your 8.x index will keep all of the
> > > characteristics of lucene 8. There can be some situations (e.g.
> incorrect
> > > offset) where you might get an incorrect relevance score which might be
> > > difficult to trace and debug. So, Lucene developers now made it
> explicit
> > > that what people were doing earlier was not ideal, and they should now
> plan
> > > to reindex all the documents during the major upgrade.
> > >
> > > Having said that, what you have done can just work without any issue as
> > > long as you don't encounter any odd sorting behavior. This may/may not
> be
> > > super critical depending on the business use case and that is where you
> > > might need to make a decision.
> > >
> > > Thanks,
> > > Vinay
> > >
> > > On Sat, Jan 8, 2022 at 10:27 PM Rahul Goswami 
> > > wrote:
> > >
> > > > Hello,
> > > > Would appreciate any insights on the issue.Are there any backward
> > > > incompatible changes in 8.x index because of which the lucene
> upgrader is
> > > > unable to upgrade any index EVER touched by <= 6.x ? Or is the
> > > restriction
> > > > more of a safety net at this point for possible future
> incompatibilities
> > > ?
> > > >
> > > > Thanks,
> > > > Rahul
> > > >
> > > > On Thu, Jan 6, 2022 at 11:46 PM Rahul Go

Re: Moving from lucene 6.x to 8.x

2022-01-11 Thread Rahul Goswami
Thanks Vinay for the link to Erick's talk! I hadn't seen it and I must
admit it did help put a few things into perspective.

I was able to track down the JIRAs (thank you 'git blame')
surrounding/leading up to this architectural decision and the linked
patches:
https://issues.apache.org/jira/browse/LUCENE-7703  (Record the version that
was used at index creation time)
https://issues.apache.org/jira/browse/LUCENE-7730  (Better encode length
normalization in similarities)
https://issues.apache.org/jira/browse/LUCENE-7837  (Use
indexCreatedVersionMajor to fail opening too old indices)

>From these JIRAs what I was able to piece together is that if not
reindexed, relevance scoring might act in unpredictable ways. For my use
case, I can live with that since we provide an explicit sort on one or more
fields.

In LUCENE-7703, Adrien says "we will reject broken offsets in term vectors
as of 7.0". So my questions to the community are
i) What are these offsets, and what feature/s might break with respect to
these offsets if not reindexed?
ii) Do the length normalization changes in  LUCENE-7730 affect only
relevance scores?

I understand I could be playing with fire here, but reindexing is not a
practical solution for my situation. At least not in the near future until
I figure out a more seamless way of reindexing with minimal downtime given
that there are multiple 1TB+ indexes. Would appreciate inputs from the dev
community on this.

Thanks,
Rahul

On Sun, Jan 9, 2022 at 2:41 PM Vinay Rajput 
wrote:

> Hi Rahul,
>
> I am not an expert so someone else might provide a better answer. However,
> I remember
> @Erick briefly talked about this restriction in one of his talks here:-
> https://www.youtube.com/watch?v=eaQBH_H3d3g&t=621s (not sure if you have
> seen it already).
>
> As he explains, earlier it looked like IndexUpgrader tool was doing the job
> perfectly but it wasn't always the case. There is no guarantee that after
> using the IndexUpgrader tool, your 8.x index will keep all of the
> characteristics of lucene 8. There can be some situations (e.g. incorrect
> offset) where you might get an incorrect relevance score which might be
> difficult to trace and debug. So, Lucene developers now made it explicit
> that what people were doing earlier was not ideal, and they should now plan
> to reindex all the documents during the major upgrade.
>
> Having said that, what you have done can just work without any issue as
> long as you don't encounter any odd sorting behavior. This may/may not be
> super critical depending on the business use case and that is where you
> might need to make a decision.
>
> Thanks,
> Vinay
>
> On Sat, Jan 8, 2022 at 10:27 PM Rahul Goswami 
> wrote:
>
> > Hello,
> > Would appreciate any insights on the issue.Are there any backward
> > incompatible changes in 8.x index because of which the lucene upgrader is
> > unable to upgrade any index EVER touched by <= 6.x ? Or is the
> restriction
> > more of a safety net at this point for possible future incompatibilities
> ?
> >
> > Thanks,
> > Rahul
> >
> > On Thu, Jan 6, 2022 at 11:46 PM Rahul Goswami 
> > wrote:
> >
> > > Hello,
> > > I am using Apache Solr 7.7.2 with indexes which were originally created
> > on
> > > 4.8 and upgraded ever since. I recently tried upgrading to 8.x using
> the
> > > lucene IndexUpgrader tool and the upgrade fails. I know that lucene 8.x
> > > prevents opening any segment which was touched by <= 6.x at any point
> in
> > > the past. I also know the general recommendation is to reindex upon
> > > migration to another major release, however it is not always feasible.
> > >
> > > So I tried to remove the check for LATEST-1 in SegmentInfos.java (
> > >
> >
> https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.11.1/lucene/core/src/java/org/apache/lucene/index/SegmentInfos.java#L321
> > )
> > > and also checked for other references to IndexFormatTooOldException.
> > Turns
> > > out that removing this check and rebuilding lucene-core lets the
> upgrade
> > go
> > > through fine. I ran a full sequence of index upgrades from 5.x -> 6.x
> ->
> > > 7.x ->8.x. which went through fine. Also search/update operations work
> > > without any issues in 8.x.
> > >
> > > I could not find any JIRAs which talk about the technical reason behind
> > > imposing this restriction, and would like to know the nitty-gritties.
> > Also
> > > would like to know about any potential pitfalls that I might be
> > overlooking
> > > with the above hack.
> > >
> > > Thanks,
> > > Rahul
> > >
> > >
> >
>


Re: Moving from lucene 6.x to 8.x

2022-01-08 Thread Rahul Goswami
Hello,
Would appreciate any insights on the issue.Are there any backward
incompatible changes in 8.x index because of which the lucene upgrader is
unable to upgrade any index EVER touched by <= 6.x ? Or is the restriction
more of a safety net at this point for possible future incompatibilities ?

Thanks,
Rahul

On Thu, Jan 6, 2022 at 11:46 PM Rahul Goswami  wrote:

> Hello,
> I am using Apache Solr 7.7.2 with indexes which were originally created on
> 4.8 and upgraded ever since. I recently tried upgrading to 8.x using the
> lucene IndexUpgrader tool and the upgrade fails. I know that lucene 8.x
> prevents opening any segment which was touched by <= 6.x at any point in
> the past. I also know the general recommendation is to reindex upon
> migration to another major release, however it is not always feasible.
>
> So I tried to remove the check for LATEST-1 in SegmentInfos.java (
> https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.11.1/lucene/core/src/java/org/apache/lucene/index/SegmentInfos.java#L321)
> and also checked for other references to IndexFormatTooOldException. Turns
> out that removing this check and rebuilding lucene-core lets the upgrade go
> through fine. I ran a full sequence of index upgrades from 5.x -> 6.x ->
> 7.x ->8.x. which went through fine. Also search/update operations work
> without any issues in 8.x.
>
> I could not find any JIRAs which talk about the technical reason behind
> imposing this restriction, and would like to know the nitty-gritties. Also
> would like to know about any potential pitfalls that I might be overlooking
> with the above hack.
>
> Thanks,
> Rahul
>
>


Moving from lucene 6.x to 8.x

2022-01-06 Thread Rahul Goswami
Hello,
I am using Apache Solr 7.7.2 with indexes which were originally created on
4.8 and upgraded ever since. I recently tried upgrading to 8.x using the
lucene IndexUpgrader tool and the upgrade fails. I know that lucene 8.x
prevents opening any segment which was touched by <= 6.x at any point in
the past. I also know the general recommendation is to reindex upon
migration to another major release, however it is not always feasible.

So I tried to remove the check for LATEST-1 in SegmentInfos.java (
https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.11.1/lucene/core/src/java/org/apache/lucene/index/SegmentInfos.java#L321)
and also checked for other references to IndexFormatTooOldException. Turns
out that removing this check and rebuilding lucene-core lets the upgrade go
through fine. I ran a full sequence of index upgrades from 5.x -> 6.x ->
7.x ->8.x. which went through fine. Also search/update operations work
without any issues in 8.x.

I could not find any JIRAs which talk about the technical reason behind
imposing this restriction, and would like to know the nitty-gritties. Also
would like to know about any potential pitfalls that I might be overlooking
with the above hack.

Thanks,
Rahul


Any downsides to using RAFDirectory instead of SimpleFSDirectory ?

2021-10-21 Thread Rahul Goswami
Hello,
I know RAFDirectory was marked legacy, but can anyone please share any
downsides to using RAFDirectory over SimpleFSDirectory. I am running Solr
on a Windows server and mmap doesn't quite work well there, so I have been
using SimpleFS.

It was working well for the most part, but we recently started seeing
ClosedChannelException with SimpleFS which I have been trying to track. In
the meantime I found this info from SimpleFSDirectory java doc:

https://lucene.apache.org/core/7_7_2/core/org/apache/lucene/store/SimpleFSDirectory.html
*NOTE:* Accessing this class either directly or indirectly from a thread
while it's interrupted can close the underlying file descriptor immediately
if at the same time the thread is blocked on IO. The file descriptor will
remain closed and subsequent access to SimpleFSDirectory

will
throw a ClosedChannelException
.
If your application uses either Thread.interrupt()

 or Future.cancel(boolean)

you
should use the legacy RAFDirectory from the Lucene misc module in favor of
SimpleFSDirectory

.

And hence my question:
Any there any downsides to using RAFDirectory instead of SimpleFSDirectory ?

Thanks.,
Rahul