FileNotFoundException with version 4.10.4

2019-09-10 Thread Stuart Goldberg
We have been using version 4.10.4 for quite some time and ran into the
following issue.

Out of the clear blue, one of our clients sees the exception cited below.
We see no prior evidence of anything going awry in our log files. This
literally seems to occur out of nowhere.

Is there any known issue with the version we are using that might explain
this?

Is there any way to recover from such a condition short of deleting the
entire index?

java.lang.RuntimeException: java.io.FileNotFoundException: _27b7u.fnm
at
org.apache.lucene.index.TieredMergePolicy$SegmentByteSizeDescending.compare(TieredMergePolicy.java:258)
at
org.apache.lucene.index.TieredMergePolicy$SegmentByteSizeDescending.compare(TieredMergePolicy.java:238)
at java.util.TimSort.countRunAndMakeAscending(Unknown Source)
at java.util.TimSort.sort(Unknown Source)
at java.util.Arrays.sort(Unknown Source)
at java.util.ArrayList.sort(Unknown Source)
at java.util.Collections.sort(Unknown Source)
at
org.apache.lucene.index.TieredMergePolicy.findMerges(TieredMergePolicy.java:292)
at
org.apache.lucene.index.IndexWriter.updatePendingMerges(IndexWriter.java:2020)
at
org.apache.lucene.index.IndexWriter.maybeMerge(IndexWriter.java:1984)
at
org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:441)
at
org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:112)

Stuart M Goldberg

Senior Vice President of Software Develpment
*FIX Flyer LLC*
http://www.FIXFlyer.com/ 

NOTICE TO RECIPIENT: THIS E- MAIL IS MEANT ONLY FOR THE INTENDED
RECIPIENT(S) OF THE TRANSMISSION, AND CONTAINS CONFIDENTIAL INFORMATION
WHICH IS PROPRIETARY TO FIX FLYER LLC ANY UNAUTHORIZED USE, COPYING,
DISTRIBUTION, OR DISSEMINATION IS STRICTLY PROHIBITED. ALL RIGHTS TO THIS
INFORMATION IS RESERVED BY FIX FLYER LLC. IF YOU ARE NOT THE INTENDED
RECIPIENT, PLEASE CONTACT THE SENDER BY REPLY EMAIL AND PLEASE DELETE THIS
E-MAIL FROM YOUR SYSTEM AND DESTROY ANY COPIES.

-- 
*Notice to Recipient*: https://www.fixflyer.com/disclaimer 



Re: Deleted documents and NRT Readers

2018-07-20 Thread Stuart Goldberg
Version 4.10.4. Sorry we are woefully behind.

Stuart M Goldberg

Senior Vice President of Software Develpment
*FIX Flyer LLC*
http://www.FIXFlyer.com/ <http://www.fixflyer.com/>

NOTICE TO RECIPIENT: THIS E- MAIL IS MEANT ONLY FOR THE INTENDED
RECIPIENT(S) OF THE TRANSMISSION, AND CONTAINS CONFIDENTIAL INFORMATION
WHICH IS PROPRIETARY TO FIX FLYER LLC ANY UNAUTHORIZED USE, COPYING,
DISTRIBUTION, OR DISSEMINATION IS STRICTLY PROHIBITED. ALL RIGHTS TO THIS
INFORMATION IS RESERVED BY FIX FLYER LLC. IF YOU ARE NOT THE INTENDED
RECIPIENT, PLEASE CONTACT THE SENDER BY REPLY EMAIL AND PLEASE DELETE THIS
E-MAIL FROM YOUR SYSTEM AND DESTROY ANY COPIES.


On Fri, Jul 20, 2018 at 1:10 PM Michael McCandless <
luc...@mikemccandless.com> wrote:

> Yeah it is surprising that Lucene applied that one delete when you said it
> didn't have to.
>
> Which Lucene version?
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Thu, Jul 19, 2018 at 5:54 PM, Stuart Goldberg 
> wrote:
>
>> Understood. But I would think that in a tiny program where I add one
>> document and then update it, that the load is so small that it for sure
>> would not have applied the delete.
>>
>> Why am I wrong in thinking this?
>>
>>
>> On Thu, Jul 19, 2018, 5:50 PM Michael McCandless <
>> luc...@mikemccandless.com> wrote:
>>
>>> Passing applyDeletes=false means Lucene does not have to apply all of
>>> its buffered deletes.
>>>
>>> But, it still may have already applied some deletes, so there's no
>>> guarantee that it won't have applied deletes.
>>>
>>> Mike McCandless
>>>
>>> http://blog.mikemccandless.com
>>>
>>> On Thu, Jul 19, 2018 at 3:23 PM, Stuart Goldberg >> > wrote:
>>>
>>>> I used NRT readers all the time. I create then with 'applyDeletes' set
>>>> to
>>>> false for performance reasons and take the javadoc at its word that my
>>>> code
>>>> has to be prepared to deal with deleted documents. I thought I
>>>> understood
>>>> that and I wrote my code to be deleted-document-safe.
>>>>
>>>> But I have recently revisited the issue and tried to understand what
>>>> happens using a little test program. I create a document and add it to
>>>> the
>>>> index. I then create a new document that mirrors the first one but I
>>>> change
>>>> the value of a field. Then I call IndexWriter.updateDocument() which is
>>>> a
>>>> delete and an add.
>>>>
>>>> I then get a NRT reader with applyDeletes set to false and do a
>>>> MatchAllDocsQuery search. I would expect to get 2 documents back: the
>>>> current one and the updated one. But I only get back the updated one.
>>>>
>>>> But I know in real code with 1000's of documents flying into the index
>>>> that
>>>> I have gotten deleted documents returned.
>>>>
>>>> Can someone explain to me why my small test program doesn't get the
>>>> deleted
>>>> documents back?
>>>>
>>>> Stuart M Goldberg
>>>>
>>>> Senior Vice President of Software Develpment
>>>> *FIX Flyer LLC*
>>>> http://www.FIXFlyer.com/ <http://www.fixflyer.com/>
>>>>
>>>> NOTICE TO RECIPIENT: THIS E- MAIL IS MEANT ONLY FOR THE INTENDED
>>>> RECIPIENT(S) OF THE TRANSMISSION, AND CONTAINS CONFIDENTIAL INFORMATION
>>>> WHICH IS PROPRIETARY TO FIX FLYER LLC ANY UNAUTHORIZED USE, COPYING,
>>>> DISTRIBUTION, OR DISSEMINATION IS STRICTLY PROHIBITED. ALL RIGHTS TO
>>>> THIS
>>>> INFORMATION IS RESERVED BY FIX FLYER LLC. IF YOU ARE NOT THE INTENDED
>>>> RECIPIENT, PLEASE CONTACT THE SENDER BY REPLY EMAIL AND PLEASE DELETE
>>>> THIS
>>>> E-MAIL FROM YOUR SYSTEM AND DESTROY ANY COPIES.
>>>>
>>>> --
>>>> *Notice to Recipient*: https://www.fixflyer.com/disclaimer
>>>> <https://www.fixflyer.com/disclaimer>
>>>>
>>>
>>>
>> *Notice to Recipient*: https://www.fixflyer.com/disclaimer
>
>
>

-- 
*Notice to Recipient*: https://www.fixflyer.com/disclaimer 
<https://www.fixflyer.com/disclaimer>


Re: Deleted documents and NRT Readers

2018-07-19 Thread Stuart Goldberg
Understood. But I would think that in a tiny program where I add one
document and then update it, that the load is so small that it for sure
would not have applied the delete.

Why am I wrong in thinking this?

On Thu, Jul 19, 2018, 5:50 PM Michael McCandless 
wrote:

> Passing applyDeletes=false means Lucene does not have to apply all of its
> buffered deletes.
>
> But, it still may have already applied some deletes, so there's no
> guarantee that it won't have applied deletes.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Thu, Jul 19, 2018 at 3:23 PM, Stuart Goldberg 
> wrote:
>
>> I used NRT readers all the time. I create then with 'applyDeletes' set to
>> false for performance reasons and take the javadoc at its word that my
>> code
>> has to be prepared to deal with deleted documents. I thought I understood
>> that and I wrote my code to be deleted-document-safe.
>>
>> But I have recently revisited the issue and tried to understand what
>> happens using a little test program. I create a document and add it to the
>> index. I then create a new document that mirrors the first one but I
>> change
>> the value of a field. Then I call IndexWriter.updateDocument() which is a
>> delete and an add.
>>
>> I then get a NRT reader with applyDeletes set to false and do a
>> MatchAllDocsQuery search. I would expect to get 2 documents back: the
>> current one and the updated one. But I only get back the updated one.
>>
>> But I know in real code with 1000's of documents flying into the index
>> that
>> I have gotten deleted documents returned.
>>
>> Can someone explain to me why my small test program doesn't get the
>> deleted
>> documents back?
>>
>> Stuart M Goldberg
>>
>> Senior Vice President of Software Develpment
>> *FIX Flyer LLC*
>> http://www.FIXFlyer.com/ <http://www.fixflyer.com/>
>>
>> NOTICE TO RECIPIENT: THIS E- MAIL IS MEANT ONLY FOR THE INTENDED
>> RECIPIENT(S) OF THE TRANSMISSION, AND CONTAINS CONFIDENTIAL INFORMATION
>> WHICH IS PROPRIETARY TO FIX FLYER LLC ANY UNAUTHORIZED USE, COPYING,
>> DISTRIBUTION, OR DISSEMINATION IS STRICTLY PROHIBITED. ALL RIGHTS TO THIS
>> INFORMATION IS RESERVED BY FIX FLYER LLC. IF YOU ARE NOT THE INTENDED
>> RECIPIENT, PLEASE CONTACT THE SENDER BY REPLY EMAIL AND PLEASE DELETE THIS
>> E-MAIL FROM YOUR SYSTEM AND DESTROY ANY COPIES.
>>
>> --
>> *Notice to Recipient*: https://www.fixflyer.com/disclaimer
>> <https://www.fixflyer.com/disclaimer>
>>
>
>

-- 
*Notice to Recipient*: https://www.fixflyer.com/disclaimer 
<https://www.fixflyer.com/disclaimer>


Deleted documents and NRT Readers

2018-07-19 Thread Stuart Goldberg
I used NRT readers all the time. I create then with 'applyDeletes' set to
false for performance reasons and take the javadoc at its word that my code
has to be prepared to deal with deleted documents. I thought I understood
that and I wrote my code to be deleted-document-safe.

But I have recently revisited the issue and tried to understand what
happens using a little test program. I create a document and add it to the
index. I then create a new document that mirrors the first one but I change
the value of a field. Then I call IndexWriter.updateDocument() which is a
delete and an add.

I then get a NRT reader with applyDeletes set to false and do a
MatchAllDocsQuery search. I would expect to get 2 documents back: the
current one and the updated one. But I only get back the updated one.

But I know in real code with 1000's of documents flying into the index that
I have gotten deleted documents returned.

Can someone explain to me why my small test program doesn't get the deleted
documents back?

Stuart M Goldberg

Senior Vice President of Software Develpment
*FIX Flyer LLC*
http://www.FIXFlyer.com/ 

NOTICE TO RECIPIENT: THIS E- MAIL IS MEANT ONLY FOR THE INTENDED
RECIPIENT(S) OF THE TRANSMISSION, AND CONTAINS CONFIDENTIAL INFORMATION
WHICH IS PROPRIETARY TO FIX FLYER LLC ANY UNAUTHORIZED USE, COPYING,
DISTRIBUTION, OR DISSEMINATION IS STRICTLY PROHIBITED. ALL RIGHTS TO THIS
INFORMATION IS RESERVED BY FIX FLYER LLC. IF YOU ARE NOT THE INTENDED
RECIPIENT, PLEASE CONTACT THE SENDER BY REPLY EMAIL AND PLEASE DELETE THIS
E-MAIL FROM YOUR SYSTEM AND DESTROY ANY COPIES.

-- 
*Notice to Recipient*: https://www.fixflyer.com/disclaimer 



RE: Help! - Max Segment name reached

2018-04-17 Thread Stuart Goldberg
Thanks, I will try that.

Why haven't more people run into this issue? The next segment number is 
persisted, so if an index has a long life it should eventually run into this 
problem.

Stuart M Goldberg
Senior Vice President of Software Develpment
FIX Flyer LLC
http://www.FIXFlyer.com/
NOTICE TO RECIPIENT: THIS E- MAIL IS MEANT ONLY FOR THE INTENDED RECIPIENT(S) 
OF THE TRANSMISSION, AND CONTAINS CONFIDENTIAL INFORMATION WHICH IS PROPRIETARY 
TO FIX FLYER LLC ANY UNAUTHORIZED USE, COPYING, DISTRIBUTION, OR DISSEMINATION 
IS STRICTLY PROHIBITED. ALL RIGHTS TO THIS INFORMATION IS RESERVED BY FIX FLYER 
LLC. IF YOU ARE NOT THE INTENDED RECIPIENT, PLEASE CONTACT THE SENDER BY REPLY 
EMAIL AND PLEASE DELETE THIS E-MAIL FROM YOUR SYSTEM AND DESTROY ANY COPIES.

-Original Message-
From: Uwe Schindler <u...@thetaphi.de> 
Sent: Tuesday, April 17, 2018 4:02 PM
To: java-user@lucene.apache.org
Subject: Re: Help! - Max Segment name reached

Hi,

Create a new empty index in a new directory and use addIndex() using the other 
directory with the broken index.

This will copy all segments but renumber them.

Uwe

Am April 17, 2018 3:52:27 PM UTC schrieb Stuart Goldberg 
<sgoldb...@fixflyer.com>:
>We have an index that has run into this bug:
>https://issues.apache.org/jira/browse/LUCENE-7999
>
> 
>
>Although this is reported to be fixed in Lucene 7.2, we are at 4.10.4 
>and cannot upgrade.
>
> 
>
>By looking at the code it seems that the last segment number counter is 
>persisted in segment_h. When creating a new segment, it names the 
>segment based on the persisted counter. If this counter is larger than 
>Integer.MAX_VALUE how can we recover this index.
>
> 
>
>Is there anything we can do?
>
> 
>
>Stuart M Goldberg
>
>Senior Vice President of Software Develpment FIX Flyer LLC  
><http://www.FIXFlyer.com/> http://www.FIXFlyer.com/
>
>NOTICE TO RECIPIENT: THIS E- MAIL IS MEANT ONLY FOR THE INTENDED
>RECIPIENT(S) OF THE TRANSMISSION, AND CONTAINS CONFIDENTIAL INFORMATION 
>WHICH IS PROPRIETARY TO FIX FLYER LLC ANY UNAUTHORIZED USE, COPYING, 
>DISTRIBUTION, OR DISSEMINATION IS STRICTLY PROHIBITED. ALL RIGHTS TO 
>THIS INFORMATION IS RESERVED BY FIX FLYER LLC. IF YOU ARE NOT THE 
>INTENDED RECIPIENT, PLEASE CONTACT THE SENDER BY REPLY EMAIL AND PLEASE 
>DELETE THIS E-MAIL FROM YOUR SYSTEM AND DESTROY ANY COPIES.
>
> 

--
Uwe Schindler
Achterdiek 19, 28357 Bremen
https://www.thetaphi.de


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Help! - Max Segment name reached

2018-04-17 Thread Stuart Goldberg
We have an index that has run into this bug:
https://issues.apache.org/jira/browse/LUCENE-7999

 

Although this is reported to be fixed in Lucene 7.2, we are at 4.10.4 and
cannot upgrade.

 

By looking at the code it seems that the last segment number counter is
persisted in segment_h. When creating a new segment, it names the segment
based on the persisted counter. If this counter is larger than
Integer.MAX_VALUE how can we recover this index.

 

Is there anything we can do?

 

Stuart M Goldberg

Senior Vice President of Software Develpment
FIX Flyer LLC
  http://www.FIXFlyer.com/

NOTICE TO RECIPIENT: THIS E- MAIL IS MEANT ONLY FOR THE INTENDED
RECIPIENT(S) OF THE TRANSMISSION, AND CONTAINS CONFIDENTIAL INFORMATION
WHICH IS PROPRIETARY TO FIX FLYER LLC ANY UNAUTHORIZED USE, COPYING,
DISTRIBUTION, OR DISSEMINATION IS STRICTLY PROHIBITED. ALL RIGHTS TO THIS
INFORMATION IS RESERVED BY FIX FLYER LLC. IF YOU ARE NOT THE INTENDED
RECIPIENT, PLEASE CONTACT THE SENDER BY REPLY EMAIL AND PLEASE DELETE THIS
E-MAIL FROM YOUR SYSTEM AND DESTROY ANY COPIES.

 



Re: Help with huge index

2018-02-28 Thread Stuart Goldberg
Thanks so much. I actually found that my purging routine finished after
about 35 minutes which is really acceptable given that this routine is
supposed to run during the overnight period.

On Feb 28, 2018 8:34 PM, "Adrien Grand" <jpou...@gmail.com> wrote:

> Thanks. Deleting lots of documents can indeed trigger a lot of work in the
> Lucene side. First Lucene likely needs to rewrite the live docs of all your
> segments and then this might trigger significant merging activity due to
> the fact that Lucene tries to keep the number of deleted docs reasonable so
> that most disk space is not spent on deleted docs. I can't think of
> settings that would make it more efficient.
>
> If you call deleteDocuments because you are eg. deleting data after a given
> age, it would help to have time-based indices so that you would remove an
> entire index at once rather than large portions of an index.
>
> Le jeu. 1 mars 2018 à 01:20, Stuart Goldberg <sgoldb...@fixflyer.com> a
> écrit :
>
> > I call deleteDocuments
> >
> > On Feb 28, 2018 8:16 PM, "Adrien Grand" <jpou...@gmail.com> wrote:
> >
> > > What do you mean by purging? What methods do you call?
> > >
> > > Le mer. 28 févr. 2018 à 19:34, Stuart Goldberg <sgoldb...@fixflyer.com
> >
> > a
> > > écrit :
> > >
> > > > I have huge lucene index. On disk it's about 24Gb.
> > > >
> > > >
> > > >
> > > > I have a purging routine that is supposed to run and purge old docs.
> > > >
> > > >
> > > >
> > > > There are about 650 million docs in there and through testing I have
> > > > determined that about 1/3 of these need to be purged.
> > > >
> > > >
> > > >
> > > > During the purge, every so often it's apparently doing some flushing
> > and
> > > > applying deletes. This causes the process to hang. I know it's not
> > > hanging,
> > > > but actually doing work because I have enabled infostream and I am
> > > getting
> > > > messages every so often (every 5 minutes).
> > > >
> > > >
> > > >
> > > > Is there some trick (index config) I can employ to get this to work
> > > faster.
> > > >
> > > >
> > > >
> > > > Stuart M Goldberg
> > > >
> > > >
> > >
> >
>


Re: Help with huge index

2018-02-28 Thread Stuart Goldberg
I call deleteDocuments

On Feb 28, 2018 8:16 PM, "Adrien Grand" <jpou...@gmail.com> wrote:

> What do you mean by purging? What methods do you call?
>
> Le mer. 28 févr. 2018 à 19:34, Stuart Goldberg <sgoldb...@fixflyer.com> a
> écrit :
>
> > I have huge lucene index. On disk it's about 24Gb.
> >
> >
> >
> > I have a purging routine that is supposed to run and purge old docs.
> >
> >
> >
> > There are about 650 million docs in there and through testing I have
> > determined that about 1/3 of these need to be purged.
> >
> >
> >
> > During the purge, every so often it's apparently doing some flushing and
> > applying deletes. This causes the process to hang. I know it's not
> hanging,
> > but actually doing work because I have enabled infostream and I am
> getting
> > messages every so often (every 5 minutes).
> >
> >
> >
> > Is there some trick (index config) I can employ to get this to work
> faster.
> >
> >
> >
> > Stuart M Goldberg
> >
> >
>


Help with huge index

2018-02-28 Thread Stuart Goldberg
I have huge lucene index. On disk it's about 24Gb.

 

I have a purging routine that is supposed to run and purge old docs.

 

There are about 650 million docs in there and through testing I have
determined that about 1/3 of these need to be purged.

 

During the purge, every so often it's apparently doing some flushing and
applying deletes. This causes the process to hang. I know it's not hanging,
but actually doing work because I have enabled infostream and I am getting
messages every so often (every 5 minutes).

 

Is there some trick (index config) I can employ to get this to work faster.

 

Stuart M Goldberg



RE: Problems Refactoring a Lucene Index

2016-08-22 Thread Stuart Goldberg
Understood, but did it used to work?

 

Stuart M Goldberg

Senior Vice President of Software Develpment
FIX Flyer LLC
http://www.FIXFlyer.com/

NOTICE TO RECIPIENT: THIS E- MAIL IS MEANT ONLY FOR THE INTENDED RECIPIENT(S) 
OF THE TRANSMISSION, AND CONTAINS CONFIDENTIAL INFORMATION WHICH IS PROPRIETARY 
TO FIX FLYER LLC ANY UNAUTHORIZED USE, COPYING, DISTRIBUTION, OR DISSEMINATION 
IS STRICTLY PROHIBITED. ALL RIGHTS TO THIS INFORMATION IS RESERVED BY FIX FLYER 
LLC. IF YOU ARE NOT THE INTENDED RECIPIENT, PLEASE CONTACT THE SENDER BY REPLY 
EMAIL AND PLEASE DELETE THIS E-MAIL FROM YOUR SYSTEM AND DESTROY ANY COPIES.

 

From: Michael McCandless [mailto:luc...@mikemccandless.com] 
Sent: Monday, August 22, 2016 4:38 PM
To: Stuart Goldberg <sgoldb...@fixflyer.com>
Cc: Lucene Users <java-user@lucene.apache.org>
Subject: Re: Problems Refactoring a Lucene Index

 

The design is indeed trappy, and many users have hit the situation you have, 
and we have tried to fix this before (to change IndexReader.document to return 
a different class than Document), but it didn't "take": 
https://issues.apache.org/jira/browse/LUCENE-6971

 

Have a look at FieldInfo.java to see the metadata it records.

 

The challenge here is Lucene's schema-less-ness.  For example, on a document by 
document basis, you can change how term vectors are indexed, whether a field is 
stored, or omits norms, or indexes only docs and not freqs, etc., for the same 
field across documents, across segments.

 

Lucene only stores in FieldInfo what is necessary for it to read the index 
files, and does not store metadata beyond that.

 

Patches welcome :)  We should fix this trap; it's just that doing so is 
apparently not so easy.




Mike McCandless

http://blog.mikemccandless.com

 

On Mon, Aug 22, 2016 at 11:04 AM, Stuart Goldberg <sgoldb...@fixflyer.com 
<mailto:sgoldb...@fixflyer.com> > wrote:

Thanks for the quick response.

 

I kind of figured on my own that I had to recreate the document from scratch

 

But there is something in your response that I don’t understand. You say 
“Lucene only preserves the metadata it needs for each field”. What does that 
mean? In my posting I gave examples of metadata returned that is clearly the 
exact opposite of the metadata that was there when originally indexed.

 

According to what you are saying there is metadata that is preserved correctly. 
What metadata is that?

 

Not sure if you are just a Lucene guru (I have your Lucene in Action books!) or 
an actual author/contributor to the code, so my observation might not be 
appropriately directed at you. But it seems a questionable API design to return 
a “Document” from the index that has properties described by the Javadoc that 
give back bogus data.

 

And what about the FieldInfo class that purports to give back field 
information. Why have such an API if the data it provides is bogus?

 

Stuart M Goldberg

Senior Vice President of Software Develpment
FIX Flyer LLC
http://www.FIXFlyer.com/

NOTICE TO RECIPIENT: THIS E- MAIL IS MEANT ONLY FOR THE INTENDED RECIPIENT(S) 
OF THE TRANSMISSION, AND CONTAINS CONFIDENTIAL INFORMATION WHICH IS PROPRIETARY 
TO FIX FLYER LLC ANY UNAUTHORIZED USE, COPYING, DISTRIBUTION, OR DISSEMINATION 
IS STRICTLY PROHIBITED. ALL RIGHTS TO THIS INFORMATION IS RESERVED BY FIX FLYER 
LLC. IF YOU ARE NOT THE INTENDED RECIPIENT, PLEASE CONTACT THE SENDER BY REPLY 
EMAIL AND PLEASE DELETE THIS E-MAIL FROM YOUR SYSTEM AND DESTROY ANY COPIES.

 

From: Michael McCandless [mailto:luc...@mikemccandless.com 
<mailto:luc...@mikemccandless.com> ] 
Sent: Monday, August 22, 2016 10:48 AM
To: Lucene Users <java-user@lucene.apache.org 
<mailto:java-user@lucene.apache.org> >; sgoldb...@fixflyer.com 
<mailto:sgoldb...@fixflyer.com> 
Subject: Re: Problems Refactoring a Lucene Index

 

This is unfortunately "by design": Lucene makes no guarantees that the Document 
you retrieve from an IndexReader is precisely the same Document you had indexed.

 

Lucene only preserves the metadata it needs for each field.

 

Your only recourse is to create a new Document using your application level 
information about which fields are tokenized, indexed, etc.




Mike McCandless

http://blog.mikemccandless.com

 

On Fri, Jul 8, 2016 at 12:12 PM, Stuart Goldberg <sgoldb...@fixflyer.com 
<mailto:sgoldb...@fixflyer.com> > wrote:

As our software goes through its lifecycle, we sometimes have to alter
existing Lucene indexes. The way I have done that in the past is to open the
existing index for reading, read each Document, modify it and write that
Document to a new index. At the end of the process, I delete the old index
and rename the new index to the old name.

I do not do any tokenizing and use no analyzers.

I recently upgraded from Lucene 3.x to 4.10.4. Now I have the following
problem: Suppose the existing document has 10 fields in it and there's one I
have to modify. 

RE: Problems Refactoring a Lucene Index

2016-08-22 Thread Stuart Goldberg
Thanks for the quick response.

 

I kind of figured on my own that I had to recreate the document from scratch

 

But there is something in your response that I don’t understand. You say 
“Lucene only preserves the metadata it needs for each field”. What does that 
mean? In my posting I gave examples of metadata returned that is clearly the 
exact opposite of the metadata that was there when originally indexed.

 

According to what you are saying there is metadata that is preserved correctly. 
What metadata is that?

 

Not sure if you are just a Lucene guru (I have your Lucene in Action books!) or 
an actual author/contributor to the code, so my observation might not be 
appropriately directed at you. But it seems a questionable API design to return 
a “Document” from the index that has properties described by the Javadoc that 
give back bogus data.

 

And what about the FieldInfo class that purports to give back field 
information. Why have such an API if the data it provides is bogus?

 

Stuart M Goldberg

Senior Vice President of Software Develpment
FIX Flyer LLC
http://www.FIXFlyer.com/

NOTICE TO RECIPIENT: THIS E- MAIL IS MEANT ONLY FOR THE INTENDED RECIPIENT(S) 
OF THE TRANSMISSION, AND CONTAINS CONFIDENTIAL INFORMATION WHICH IS PROPRIETARY 
TO FIX FLYER LLC ANY UNAUTHORIZED USE, COPYING, DISTRIBUTION, OR DISSEMINATION 
IS STRICTLY PROHIBITED. ALL RIGHTS TO THIS INFORMATION IS RESERVED BY FIX FLYER 
LLC. IF YOU ARE NOT THE INTENDED RECIPIENT, PLEASE CONTACT THE SENDER BY REPLY 
EMAIL AND PLEASE DELETE THIS E-MAIL FROM YOUR SYSTEM AND DESTROY ANY COPIES.

 

From: Michael McCandless [mailto:luc...@mikemccandless.com] 
Sent: Monday, August 22, 2016 10:48 AM
To: Lucene Users <java-user@lucene.apache.org>; sgoldb...@fixflyer.com
Subject: Re: Problems Refactoring a Lucene Index

 

This is unfortunately "by design": Lucene makes no guarantees that the Document 
you retrieve from an IndexReader is precisely the same Document you had indexed.

 

Lucene only preserves the metadata it needs for each field.

 

Your only recourse is to create a new Document using your application level 
information about which fields are tokenized, indexed, etc.




Mike McCandless

http://blog.mikemccandless.com

 

On Fri, Jul 8, 2016 at 12:12 PM, Stuart Goldberg <sgoldb...@fixflyer.com 
<mailto:sgoldb...@fixflyer.com> > wrote:

As our software goes through its lifecycle, we sometimes have to alter
existing Lucene indexes. The way I have done that in the past is to open the
existing index for reading, read each Document, modify it and write that
Document to a new index. At the end of the process, I delete the old index
and rename the new index to the old name.

I do not do any tokenizing and use no analyzers.

I recently upgraded from Lucene 3.x to 4.10.4. Now I have the following
problem: Suppose the existing document has 10 fields in it and there's one I
have to modify. I remove that field and re-add it with the new settings.
Then I add the Document in its entirety to the new index. I run into the
following problems:

*   I get Exceptions thrown for the fields I don't even touch. That's
because their FieldType has 'tokenized' set to true and it fails because I
am using no analyzers. 'tokenized' is set to true even though when I
originally added the field to the original index I had 'tokenized' set to
false!

*   I have LongFields that come back with 'indexed' set to false even
though in the original index they were indexed! This makes the new index not
searchable on these fields and hence unusable.

*   I can't even alter 'indexed' for these LongFields because for some
reason the FieldType instance comes back frozen from the IndexReader. Once
frozen,  you can't alter it. Even if I create a new FieldType, there is no
way to change the FieldType of a Field

It seems the returned FieldType contents are kind of random!

I did see in the Javadoc of IndexReader.document() that field metadata is
not returned and that, in fact, that they should have new kind of object
returned like 'StoredField' so there is no pretense of there being any
metadata.

I thought perhaps I could use FieldInfos. But that class returns the same
bogus metadata.  What then is the purpose of FieldInfos if the info is
bogus?

Am I not understanding something here? This is not very usable. What can I
do to work around this? Is this a Lucene bug? Oversight?

 



Problems Refactoring a Lucene Index

2016-07-08 Thread Stuart Goldberg
As our software goes through its lifecycle, we sometimes have to alter
existing Lucene indexes. The way I have done that in the past is to open the
existing index for reading, read each Document, modify it and write that
Document to a new index. At the end of the process, I delete the old index
and rename the new index to the old name.

I do not do any tokenizing and use no analyzers.

I recently upgraded from Lucene 3.x to 4.10.4. Now I have the following
problem: Suppose the existing document has 10 fields in it and there's one I
have to modify. I remove that field and re-add it with the new settings.
Then I add the Document in its entirety to the new index. I run into the
following problems:

*   I get Exceptions thrown for the fields I don't even touch. That's
because their FieldType has 'tokenized' set to true and it fails because I
am using no analyzers. 'tokenized' is set to true even though when I
originally added the field to the original index I had 'tokenized' set to
false!

*   I have LongFields that come back with 'indexed' set to false even
though in the original index they were indexed! This makes the new index not
searchable on these fields and hence unusable. 

*   I can't even alter 'indexed' for these LongFields because for some
reason the FieldType instance comes back frozen from the IndexReader. Once
frozen,  you can't alter it. Even if I create a new FieldType, there is no
way to change the FieldType of a Field

It seems the returned FieldType contents are kind of random!

I did see in the Javadoc of IndexReader.document() that field metadata is
not returned and that, in fact, that they should have new kind of object
returned like 'StoredField' so there is no pretense of there being any
metadata.

I thought perhaps I could use FieldInfos. But that class returns the same
bogus metadata.  What then is the purpose of FieldInfos if the info is
bogus?

Am I not understanding something here? This is not very usable. What can I
do to work around this? Is this a Lucene bug? Oversight?