Re: segment gets corrupted (after background merge ?)

Michael McCandless Fri, 14 Jan 2011 04:00:32 -0800

Right, but removing a segment out from under a live IW (when you run
CheckIndex with -fix) is deadly, because that other IW doesn't know
you've removed the segment, and will later commit a new segment infos
still referencing that segment.


The nature of this particular exception from CheckIndex is very
strange... I think it can only be a bug in Lucene, a bug in the JRE or
a hardware issue (bits are flipping somewhere).

I don't think an error in the IO system can cause this particular
exception (it would cause others), because the deleted docs are loaded
up front when SegmentReader is init'd...

This is why I'd really like to see if a given corrupt index always
hits precisely the same exception if you run CheckIndex more than
once.

Mike

On Thu, Jan 13, 2011 at 10:56 PM, Lance Norskog <goks...@gmail.com> wrote:
> 1) CheckIndex is not supposed to change a corrupt segment, only remove it.
> 2) Are you using local hard disks, or do run on a common SAN or remote
> file server? I have seen corruption errors on SANs, where existing
> files have random changes.
>
> On Thu, Jan 13, 2011 at 11:06 AM, Michael McCandless
> <luc...@mikemccandless.com> wrote:
>> Generally it's not safe to run CheckIndex if a writer is also open on the 
>> index.
>>
>> It's not safe because CheckIndex could hit FNFE's on opening files,
>> or, if you use -fix, CheckIndex will change the index out from under
>> your other IndexWriter (which will then cause other kinds of
>> corruption).
>>
>> That said, I don't think the corruption that CheckIndex is detecting
>> in your index would be caused by having a writer open on the index.
>> Your first CheckIndex has a different deletes file (_phe_p3.del, with
>> 44824 deleted docs) than the 2nd time you ran it (_phe_p4.del, with
>> 44828 deleted docs), so it must somehow have to do with that change.
>>
>> One question: if you have a corrupt index, and run CheckIndex on it
>> several times in a row, does it always fail in the same way?  (Ie the
>> same term hits the below exception).
>>
>> Is there any way I could get a copy of one of your corrupt cases?  I
>> can then dig...
>>
>> Mike
>>
>> On Thu, Jan 13, 2011 at 10:52 AM, Stéphane Delprat
>> <stephane.delp...@blogspirit.com> wrote:
>>> I understand less and less what is happening to my solr.
>>>
>>> I did a checkIndex (without -fix) and there was an error...
>>>
>>> So a did another checkIndex with -fix and then the error was gone. The
>>> segment was alright
>>>
>>>
>>> During checkIndex I do not shut down the solr server, I just make sure no
>>> client connect to the server.
>>>
>>> Should I shut down the solr server during checkIndex ?
>>>
>>>
>>>
>>> first checkIndex :
>>>
>>>  4 of 17: name=_phe docCount=264148
>>>    compound=false
>>>    hasProx=true
>>>    numFiles=9
>>>    size (MB)=928.977
>>>    diagnostics = {optimize=false, mergeFactor=10, os.version=2.6.26-2-amd64,
>>> os=Linux, mergeDocStores=true, lucene.version=2.9.3 951790 - 2010-06-06
>>> 01:30:55, source=merge, os.arch=amd64, java.version=1.6.0_20,
>>> java.vendor=Sun Microsystems Inc.}
>>>    has deletions [delFileName=_phe_p3.del]
>>>    test: open reader.........OK [44824 deleted docs]
>>>    test: fields..............OK [51 fields]
>>>    test: field norms.........OK [51 fields]
>>>    test: terms, freq, prox...ERROR [term post_id:562 docFreq=1 != num docs
>>> seen 0 + num docs deleted 0]
>>> java.lang.RuntimeException: term post_id:562 docFreq=1 != num docs seen 0 +
>>> num docs deleted 0
>>>        at
>>> org.apache.lucene.index.CheckIndex.testTermIndex(CheckIndex.java:675)
>>>        at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:530)
>>>        at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:903)
>>>    test: stored fields.......OK [7206878 total field count; avg 32.86 fields
>>> per doc]
>>>    test: term vectors........OK [0 total vector count; avg 0 term/freq
>>> vector fields per doc]
>>> FAILED
>>>    WARNING: fixIndex() would remove reference to this segment; full
>>> exception:
>>> java.lang.RuntimeException: Term Index test failed
>>>        at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:543)
>>>        at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:903)
>>>
>>>
>>> a few minutes latter :
>>>
>>>  4 of 18: name=_phe docCount=264148
>>>    compound=false
>>>    hasProx=true
>>>    numFiles=9
>>>    size (MB)=928.977
>>>    diagnostics = {optimize=false, mergeFactor=10, os.version=2.6.26-2-amd64,
>>> os=Linux, mergeDocStores=true, lucene.version=2.9.3 951790 - 2010-06-06
>>> 01:30:55, source=merge, os.arch=amd64, java.version=1.6.0
>>> _20, java.vendor=Sun Microsystems Inc.}
>>>    has deletions [delFileName=_phe_p4.del]
>>>    test: open reader.........OK [44828 deleted docs]
>>>    test: fields..............OK [51 fields]
>>>    test: field norms.........OK [51 fields]
>>>    test: terms, freq, prox...OK [3200899 terms; 26804334 terms/docs pairs;
>>> 28919124 tokens]
>>>    test: stored fields.......OK [7206764 total field count; avg 32.86 fields
>>> per doc]
>>>    test: term vectors........OK [0 total vector count; avg 0 term/freq
>>> vector fields per doc]
>>>
>>>
>>> Le 12/01/2011 16:50, Michael McCandless a écrit :
>>>>
>>>> Curious... is it always a docFreq=1 != num docs seen 0 + num docs deleted
>>>> 0?
>>>>
>>>> It looks like new deletions were flushed against the segment (del file
>>>> changed from _ncc_22s.del to _ncc_24f.del).
>>>>
>>>> Are you hitting any exceptions during indexing?
>>>>
>>>> Mike
>>>>
>>>> On Wed, Jan 12, 2011 at 10:33 AM, Stéphane Delprat
>>>> <stephane.delp...@blogspirit.com>  wrote:
>>>>>
>>>>> I got another corruption.
>>>>>
>>>>> It sure looks like it's the same type of error. (on a different field)
>>>>>
>>>>> It's also not linked to a merge, since the segment size did not change.
>>>>>
>>>>>
>>>>> *** good segment :
>>>>>
>>>>>  1 of 9: name=_ncc docCount=1841685
>>>>>    compound=false
>>>>>    hasProx=true
>>>>>    numFiles=9
>>>>>    size (MB)=6,683.447
>>>>>    diagnostics = {optimize=false, mergeFactor=10,
>>>>> os.version=2.6.26-2-amd64,
>>>>> os=Linux, mergeDocStores=true, lucene.version=2.9.3 951790 - 2010-06-06
>>>>> 01:30:55, source=merge, os.arch=amd64, java.version=1.6.0
>>>>> _20, java.vendor=Sun Microsystems Inc.}
>>>>>    has deletions [delFileName=_ncc_22s.del]
>>>>>    test: open reader.........OK [275881 deleted docs]
>>>>>    test: fields..............OK [51 fields]
>>>>>    test: field norms.........OK [51 fields]
>>>>>    test: terms, freq, prox...OK [17952652 terms; 174113812 terms/docs
>>>>> pairs;
>>>>> 204561440 tokens]
>>>>>    test: stored fields.......OK [45511958 total field count; avg 29.066
>>>>> fields per doc]
>>>>>    test: term vectors........OK [0 total vector count; avg 0 term/freq
>>>>> vector fields per doc]
>>>>>
>>>>>
>>>>> a few hours latter :
>>>>>
>>>>> *** broken segment :
>>>>>
>>>>>  1 of 17: name=_ncc docCount=1841685
>>>>>    compound=false
>>>>>    hasProx=true
>>>>>    numFiles=9
>>>>>    size (MB)=6,683.447
>>>>>    diagnostics = {optimize=false, mergeFactor=10,
>>>>> os.version=2.6.26-2-amd64,
>>>>> os=Linux, mergeDocStores=true, lucene.version=2.9.3 951790 - 2010-06-06
>>>>> 01:30:55, source=merge, os.arch=amd64, java.version=1.6.0
>>>>> _20, java.vendor=Sun Microsystems Inc.}
>>>>>    has deletions [delFileName=_ncc_24f.del]
>>>>>    test: open reader.........OK [278167 deleted docs]
>>>>>    test: fields..............OK [51 fields]
>>>>>    test: field norms.........OK [51 fields]
>>>>>    test: terms, freq, prox...ERROR [term post_id:1599104 docFreq=1 != num
>>>>> docs seen 0 + num docs deleted 0]
>>>>> java.lang.RuntimeException: term post_id:1599104 docFreq=1 != num docs
>>>>> seen
>>>>> 0 + num docs deleted 0
>>>>>        at
>>>>> org.apache.lucene.index.CheckIndex.testTermIndex(CheckIndex.java:675)
>>>>>        at
>>>>> org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:530)
>>>>>        at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:903)
>>>>>    test: stored fields.......OK [45429565 total field count; avg 29.056
>>>>> fields per doc]
>>>>>    test: term vectors........OK [0 total vector count; avg 0 term/freq
>>>>> vector fields per doc]
>>>>> FAILED
>>>>>    WARNING: fixIndex() would remove reference to this segment; full
>>>>> exception:
>>>>> java.lang.RuntimeException: Term Index test failed
>>>>>        at
>>>>> org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:543)
>>>>>        at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:903)
>>>>>
>>>>>
>>>>> I'll activate infoStream for next time.
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>>
>>>>> Le 12/01/2011 00:49, Michael McCandless a écrit :
>>>>>>
>>>>>> When you hit corruption is it always this same problem?:
>>>>>>
>>>>>>   java.lang.RuntimeException: term source:margolisphil docFreq=1 !=
>>>>>> num docs seen 0 + num docs deleted 0
>>>>>>
>>>>>> Can you run with Lucene's IndexWriter infoStream turned on, and catch
>>>>>> the output leading to the corruption?  If something is somehow messing
>>>>>> up the bits in the deletes file that could cause this.
>>>>>>
>>>>>> Mike
>>>>>>
>>>>>> On Mon, Jan 10, 2011 at 5:52 AM, Stéphane Delprat
>>>>>> <stephane.delp...@blogspirit.com>    wrote:
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> We are using :
>>>>>>> Solr Specification Version: 1.4.1
>>>>>>> Solr Implementation Version: 1.4.1 955763M - mark - 2010-06-17 18:06:42
>>>>>>> Lucene Specification Version: 2.9.3
>>>>>>> Lucene Implementation Version: 2.9.3 951790 - 2010-06-06 01:30:55
>>>>>>>
>>>>>>> # java -version
>>>>>>> java version "1.6.0_20"
>>>>>>> Java(TM) SE Runtime Environment (build 1.6.0_20-b02)
>>>>>>> Java HotSpot(TM) 64-Bit Server VM (build 16.3-b01, mixed mode)
>>>>>>>
>>>>>>> We want to index 4M docs in one core (and when it works fine we will
>>>>>>> add
>>>>>>> other cores with 2M on the same server) (1 doc ~= 1kB)
>>>>>>>
>>>>>>> We use SOLR replication every 5 minutes to update the slave server
>>>>>>> (queries
>>>>>>> are executed on the slave only)
>>>>>>>
>>>>>>> Documents are changing very quickly, during a normal day we will have
>>>>>>> approx
>>>>>>> :
>>>>>>> * 200 000 updated docs
>>>>>>> * 1000 new docs
>>>>>>> * 200 deleted docs
>>>>>>>
>>>>>>>
>>>>>>> I attached the last good checkIndex : solr20110107.txt
>>>>>>> And the corrupted one : solr20110110.txt
>>>>>>>
>>>>>>>
>>>>>>> This is not the first time a segment gets corrupted on this server,
>>>>>>> that's
>>>>>>> why I ran frequent "checkIndex". (but as you can see the first segment
>>>>>>> is
>>>>>>> 1.800.000 docs and it works fine!)
>>>>>>>
>>>>>>>
>>>>>>> I can't find any "SEVER" "FATAL" or "exception" in the Solr logs.
>>>>>>>
>>>>>>>
>>>>>>> I also attached my schema.xml and solrconfig.xml
>>>>>>>
>>>>>>>
>>>>>>> Is there something wrong with what we are doing ? Do you need other
>>>>>>> info
>>>>>>> ?
>>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
>
>
> --
> Lance Norskog
> goks...@gmail.com
>

Re: segment gets corrupted (after background merge ?)

Reply via email to