Re: segment gets corrupted (after background merge ?)

Michael McCandless Tue, 18 Jan 2011 08:28:48 -0800

OK thanks for bringing closure!

The "tokens" output is the total number of indexed tokens (ie, as if
you had a counter that counted all tokens produced by analysis as the
indexer consumes them).


My guess is the faulty server's hardware problem also messed up this count?

Mike

On Tue, Jan 18, 2011 at 9:57 AM, Stéphane Delprat
<stephane.delp...@blogspirit.com> wrote:
> I ran other tests : when I execute the checkIndex on the master I got random
> errors, but when I scp the file on another server (same software exactly) no
> error occurs...
>
> We will start using another server.
>
>
> Just one question concerning checkIndex :
>
> What does "tokens" mean ?
> How is it possible that the number of tokens change while the files were not
> modified at all ? (this is from the faulty server, on the other server the
> tokens do not change at all)
> (solr was stopped during the whole checkIndex process)
>
>
> #diff 20110118_141257_checkIndex.log 20110118_142356_checkIndex.log
> 15c15
> <     test: terms, freq, prox...OK [5211271 terms; 39824029 terms/docs
> pairs; 58236510 tokens]
> ---
>>     test: terms, freq, prox...OK [5211271 terms; 39824029 terms/docs
>> pairs; 58236582 tokens]
> 43c43
> <     test: terms, freq, prox...OK [3947589 terms; 34468256 terms/docs
> pairs; 36740496 tokens]
> ---
>>     test: terms, freq, prox...OK [3947589 terms; 34468256 terms/docs
>> pairs; 36740533 tokens]
> 85c85
> <     test: terms, freq, prox...OK [2600874 terms; 21272098 terms/docs
> pairs; 10862212 tokens]
> ---
>>     test: terms, freq, prox...OK [2600874 terms; 21272098 terms/docs
>> pairs; 10862221 tokens]
>
>
> Thanks,
>
>
> Le 14/01/2011 12:59, Michael McCandless a écrit :
>>
>> Right, but removing a segment out from under a live IW (when you run
>> CheckIndex with -fix) is deadly, because that other IW doesn't know
>> you've removed the segment, and will later commit a new segment infos
>> still referencing that segment.
>>
>> The nature of this particular exception from CheckIndex is very
>> strange... I think it can only be a bug in Lucene, a bug in the JRE or
>> a hardware issue (bits are flipping somewhere).
>>
>> I don't think an error in the IO system can cause this particular
>> exception (it would cause others), because the deleted docs are loaded
>> up front when SegmentReader is init'd...
>>
>> This is why I'd really like to see if a given corrupt index always
>> hits precisely the same exception if you run CheckIndex more than
>> once.
>>
>> Mike
>>
>> On Thu, Jan 13, 2011 at 10:56 PM, Lance Norskog<goks...@gmail.com>  wrote:
>>>
>>> 1) CheckIndex is not supposed to change a corrupt segment, only remove
>>> it.
>>> 2) Are you using local hard disks, or do run on a common SAN or remote
>>> file server? I have seen corruption errors on SANs, where existing
>>> files have random changes.
>>>
>>> On Thu, Jan 13, 2011 at 11:06 AM, Michael McCandless
>>> <luc...@mikemccandless.com>  wrote:
>>>>
>>>> Generally it's not safe to run CheckIndex if a writer is also open on
>>>> the index.
>>>>
>>>> It's not safe because CheckIndex could hit FNFE's on opening files,
>>>> or, if you use -fix, CheckIndex will change the index out from under
>>>> your other IndexWriter (which will then cause other kinds of
>>>> corruption).
>>>>
>>>> That said, I don't think the corruption that CheckIndex is detecting
>>>> in your index would be caused by having a writer open on the index.
>>>> Your first CheckIndex has a different deletes file (_phe_p3.del, with
>>>> 44824 deleted docs) than the 2nd time you ran it (_phe_p4.del, with
>>>> 44828 deleted docs), so it must somehow have to do with that change.
>>>>
>>>> One question: if you have a corrupt index, and run CheckIndex on it
>>>> several times in a row, does it always fail in the same way?  (Ie the
>>>> same term hits the below exception).
>>>>
>>>> Is there any way I could get a copy of one of your corrupt cases?  I
>>>> can then dig...
>>>>
>>>> Mike
>>>>
>>>> On Thu, Jan 13, 2011 at 10:52 AM, Stéphane Delprat
>>>> <stephane.delp...@blogspirit.com>  wrote:
>>>>>
>>>>> I understand less and less what is happening to my solr.
>>>>>
>>>>> I did a checkIndex (without -fix) and there was an error...
>>>>>
>>>>> So a did another checkIndex with -fix and then the error was gone. The
>>>>> segment was alright
>>>>>
>>>>>
>>>>> During checkIndex I do not shut down the solr server, I just make sure
>>>>> no
>>>>> client connect to the server.
>>>>>
>>>>> Should I shut down the solr server during checkIndex ?
>>>>>
>>>>>
>>>>>
>>>>> first checkIndex :
>>>>>
>>>>>  4 of 17: name=_phe docCount=264148
>>>>>    compound=false
>>>>>    hasProx=true
>>>>>    numFiles=9
>>>>>    size (MB)=928.977
>>>>>    diagnostics = {optimize=false, mergeFactor=10,
>>>>> os.version=2.6.26-2-amd64,
>>>>> os=Linux, mergeDocStores=true, lucene.version=2.9.3 951790 - 2010-06-06
>>>>> 01:30:55, source=merge, os.arch=amd64, java.version=1.6.0_20,
>>>>> java.vendor=Sun Microsystems Inc.}
>>>>>    has deletions [delFileName=_phe_p3.del]
>>>>>    test: open reader.........OK [44824 deleted docs]
>>>>>    test: fields..............OK [51 fields]
>>>>>    test: field norms.........OK [51 fields]
>>>>>    test: terms, freq, prox...ERROR [term post_id:562 docFreq=1 != num
>>>>> docs
>>>>> seen 0 + num docs deleted 0]
>>>>> java.lang.RuntimeException: term post_id:562 docFreq=1 != num docs seen
>>>>> 0 +
>>>>> num docs deleted 0
>>>>>        at
>>>>> org.apache.lucene.index.CheckIndex.testTermIndex(CheckIndex.java:675)
>>>>>        at
>>>>> org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:530)
>>>>>        at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:903)
>>>>>    test: stored fields.......OK [7206878 total field count; avg 32.86
>>>>> fields
>>>>> per doc]
>>>>>    test: term vectors........OK [0 total vector count; avg 0 term/freq
>>>>> vector fields per doc]
>>>>> FAILED
>>>>>    WARNING: fixIndex() would remove reference to this segment; full
>>>>> exception:
>>>>> java.lang.RuntimeException: Term Index test failed
>>>>>        at
>>>>> org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:543)
>>>>>        at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:903)
>>>>>
>>>>>
>>>>> a few minutes latter :
>>>>>
>>>>>  4 of 18: name=_phe docCount=264148
>>>>>    compound=false
>>>>>    hasProx=true
>>>>>    numFiles=9
>>>>>    size (MB)=928.977
>>>>>    diagnostics = {optimize=false, mergeFactor=10,
>>>>> os.version=2.6.26-2-amd64,
>>>>> os=Linux, mergeDocStores=true, lucene.version=2.9.3 951790 - 2010-06-06
>>>>> 01:30:55, source=merge, os.arch=amd64, java.version=1.6.0
>>>>> _20, java.vendor=Sun Microsystems Inc.}
>>>>>    has deletions [delFileName=_phe_p4.del]
>>>>>    test: open reader.........OK [44828 deleted docs]
>>>>>    test: fields..............OK [51 fields]
>>>>>    test: field norms.........OK [51 fields]
>>>>>    test: terms, freq, prox...OK [3200899 terms; 26804334 terms/docs
>>>>> pairs;
>>>>> 28919124 tokens]
>>>>>    test: stored fields.......OK [7206764 total field count; avg 32.86
>>>>> fields
>>>>> per doc]
>>>>>    test: term vectors........OK [0 total vector count; avg 0 term/freq
>>>>> vector fields per doc]
>>>>>
>>>>>
>>>>> Le 12/01/2011 16:50, Michael McCandless a écrit :
>>>>>>
>>>>>> Curious... is it always a docFreq=1 != num docs seen 0 + num docs
>>>>>> deleted
>>>>>> 0?
>>>>>>
>>>>>> It looks like new deletions were flushed against the segment (del file
>>>>>> changed from _ncc_22s.del to _ncc_24f.del).
>>>>>>
>>>>>> Are you hitting any exceptions during indexing?
>>>>>>
>>>>>> Mike
>>>>>>
>>>>>> On Wed, Jan 12, 2011 at 10:33 AM, Stéphane Delprat
>>>>>> <stephane.delp...@blogspirit.com>    wrote:
>>>>>>>
>>>>>>> I got another corruption.
>>>>>>>
>>>>>>> It sure looks like it's the same type of error. (on a different
>>>>>>> field)
>>>>>>>
>>>>>>> It's also not linked to a merge, since the segment size did not
>>>>>>> change.
>>>>>>>
>>>>>>>
>>>>>>> *** good segment :
>>>>>>>
>>>>>>>  1 of 9: name=_ncc docCount=1841685
>>>>>>>    compound=false
>>>>>>>    hasProx=true
>>>>>>>    numFiles=9
>>>>>>>    size (MB)=6,683.447
>>>>>>>    diagnostics = {optimize=false, mergeFactor=10,
>>>>>>> os.version=2.6.26-2-amd64,
>>>>>>> os=Linux, mergeDocStores=true, lucene.version=2.9.3 951790 -
>>>>>>> 2010-06-06
>>>>>>> 01:30:55, source=merge, os.arch=amd64, java.version=1.6.0
>>>>>>> _20, java.vendor=Sun Microsystems Inc.}
>>>>>>>    has deletions [delFileName=_ncc_22s.del]
>>>>>>>    test: open reader.........OK [275881 deleted docs]
>>>>>>>    test: fields..............OK [51 fields]
>>>>>>>    test: field norms.........OK [51 fields]
>>>>>>>    test: terms, freq, prox...OK [17952652 terms; 174113812 terms/docs
>>>>>>> pairs;
>>>>>>> 204561440 tokens]
>>>>>>>    test: stored fields.......OK [45511958 total field count; avg
>>>>>>> 29.066
>>>>>>> fields per doc]
>>>>>>>    test: term vectors........OK [0 total vector count; avg 0
>>>>>>> term/freq
>>>>>>> vector fields per doc]
>>>>>>>
>>>>>>>
>>>>>>> a few hours latter :
>>>>>>>
>>>>>>> *** broken segment :
>>>>>>>
>>>>>>>  1 of 17: name=_ncc docCount=1841685
>>>>>>>    compound=false
>>>>>>>    hasProx=true
>>>>>>>    numFiles=9
>>>>>>>    size (MB)=6,683.447
>>>>>>>    diagnostics = {optimize=false, mergeFactor=10,
>>>>>>> os.version=2.6.26-2-amd64,
>>>>>>> os=Linux, mergeDocStores=true, lucene.version=2.9.3 951790 -
>>>>>>> 2010-06-06
>>>>>>> 01:30:55, source=merge, os.arch=amd64, java.version=1.6.0
>>>>>>> _20, java.vendor=Sun Microsystems Inc.}
>>>>>>>    has deletions [delFileName=_ncc_24f.del]
>>>>>>>    test: open reader.........OK [278167 deleted docs]
>>>>>>>    test: fields..............OK [51 fields]
>>>>>>>    test: field norms.........OK [51 fields]
>>>>>>>    test: terms, freq, prox...ERROR [term post_id:1599104 docFreq=1 !=
>>>>>>> num
>>>>>>> docs seen 0 + num docs deleted 0]
>>>>>>> java.lang.RuntimeException: term post_id:1599104 docFreq=1 != num
>>>>>>> docs
>>>>>>> seen
>>>>>>> 0 + num docs deleted 0
>>>>>>>        at
>>>>>>> org.apache.lucene.index.CheckIndex.testTermIndex(CheckIndex.java:675)
>>>>>>>        at
>>>>>>> org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:530)
>>>>>>>        at
>>>>>>> org.apache.lucene.index.CheckIndex.main(CheckIndex.java:903)
>>>>>>>    test: stored fields.......OK [45429565 total field count; avg
>>>>>>> 29.056
>>>>>>> fields per doc]
>>>>>>>    test: term vectors........OK [0 total vector count; avg 0
>>>>>>> term/freq
>>>>>>> vector fields per doc]
>>>>>>> FAILED
>>>>>>>    WARNING: fixIndex() would remove reference to this segment; full
>>>>>>> exception:
>>>>>>> java.lang.RuntimeException: Term Index test failed
>>>>>>>        at
>>>>>>> org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:543)
>>>>>>>        at
>>>>>>> org.apache.lucene.index.CheckIndex.main(CheckIndex.java:903)
>>>>>>>
>>>>>>>
>>>>>>> I'll activate infoStream for next time.
>>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>>
>>>>>>> Le 12/01/2011 00:49, Michael McCandless a écrit :
>>>>>>>>
>>>>>>>> When you hit corruption is it always this same problem?:
>>>>>>>>
>>>>>>>>   java.lang.RuntimeException: term source:margolisphil docFreq=1 !=
>>>>>>>> num docs seen 0 + num docs deleted 0
>>>>>>>>
>>>>>>>> Can you run with Lucene's IndexWriter infoStream turned on, and
>>>>>>>> catch
>>>>>>>> the output leading to the corruption?  If something is somehow
>>>>>>>> messing
>>>>>>>> up the bits in the deletes file that could cause this.
>>>>>>>>
>>>>>>>> Mike
>>>>>>>>
>>>>>>>> On Mon, Jan 10, 2011 at 5:52 AM, Stéphane Delprat
>>>>>>>> <stephane.delp...@blogspirit.com>      wrote:
>>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> We are using :
>>>>>>>>> Solr Specification Version: 1.4.1
>>>>>>>>> Solr Implementation Version: 1.4.1 955763M - mark - 2010-06-17
>>>>>>>>> 18:06:42
>>>>>>>>> Lucene Specification Version: 2.9.3
>>>>>>>>> Lucene Implementation Version: 2.9.3 951790 - 2010-06-06 01:30:55
>>>>>>>>>
>>>>>>>>> # java -version
>>>>>>>>> java version "1.6.0_20"
>>>>>>>>> Java(TM) SE Runtime Environment (build 1.6.0_20-b02)
>>>>>>>>> Java HotSpot(TM) 64-Bit Server VM (build 16.3-b01, mixed mode)
>>>>>>>>>
>>>>>>>>> We want to index 4M docs in one core (and when it works fine we
>>>>>>>>> will
>>>>>>>>> add
>>>>>>>>> other cores with 2M on the same server) (1 doc ~= 1kB)
>>>>>>>>>
>>>>>>>>> We use SOLR replication every 5 minutes to update the slave server
>>>>>>>>> (queries
>>>>>>>>> are executed on the slave only)
>>>>>>>>>
>>>>>>>>> Documents are changing very quickly, during a normal day we will
>>>>>>>>> have
>>>>>>>>> approx
>>>>>>>>> :
>>>>>>>>> * 200 000 updated docs
>>>>>>>>> * 1000 new docs
>>>>>>>>> * 200 deleted docs
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I attached the last good checkIndex : solr20110107.txt
>>>>>>>>> And the corrupted one : solr20110110.txt
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> This is not the first time a segment gets corrupted on this server,
>>>>>>>>> that's
>>>>>>>>> why I ran frequent "checkIndex". (but as you can see the first
>>>>>>>>> segment
>>>>>>>>> is
>>>>>>>>> 1.800.000 docs and it works fine!)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I can't find any "SEVER" "FATAL" or "exception" in the Solr logs.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I also attached my schema.xml and solrconfig.xml
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Is there something wrong with what we are doing ? Do you need other
>>>>>>>>> info
>>>>>>>>> ?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Lance Norskog
>>> goks...@gmail.com
>>>
>>
>

Re: segment gets corrupted (after background merge ?)

Reply via email to