Re: segment gets corrupted (after background merge ?)

2011-01-18 Thread Stéphane Delprat
I ran other tests : when I execute the checkIndex on the master I got 
random errors, but when I scp the file on another server (same software 
exactly) no error occurs...


We will start using another server.


Just one question concerning checkIndex :

What does tokens mean ?
How is it possible that the number of tokens change while the files were 
not modified at all ? (this is from the faulty server, on the other 
server the tokens do not change at all)

(solr was stopped during the whole checkIndex process)


#diff 20110118_141257_checkIndex.log 20110118_142356_checkIndex.log
15c15
 test: terms, freq, prox...OK [5211271 terms; 39824029 terms/docs 
pairs; 58236510 tokens]

---
 test: terms, freq, prox...OK [5211271 terms; 39824029 terms/docs 
pairs; 58236582 tokens]

43c43
 test: terms, freq, prox...OK [3947589 terms; 34468256 terms/docs 
pairs; 36740496 tokens]

---
 test: terms, freq, prox...OK [3947589 terms; 34468256 terms/docs 
pairs; 36740533 tokens]

85c85
 test: terms, freq, prox...OK [2600874 terms; 21272098 terms/docs 
pairs; 10862212 tokens]

---
 test: terms, freq, prox...OK [2600874 terms; 21272098 terms/docs 
pairs; 10862221 tokens]



Thanks,


Le 14/01/2011 12:59, Michael McCandless a écrit :

Right, but removing a segment out from under a live IW (when you run
CheckIndex with -fix) is deadly, because that other IW doesn't know
you've removed the segment, and will later commit a new segment infos
still referencing that segment.

The nature of this particular exception from CheckIndex is very
strange... I think it can only be a bug in Lucene, a bug in the JRE or
a hardware issue (bits are flipping somewhere).

I don't think an error in the IO system can cause this particular
exception (it would cause others), because the deleted docs are loaded
up front when SegmentReader is init'd...

This is why I'd really like to see if a given corrupt index always
hits precisely the same exception if you run CheckIndex more than
once.

Mike

On Thu, Jan 13, 2011 at 10:56 PM, Lance Norskoggoks...@gmail.com  wrote:

1) CheckIndex is not supposed to change a corrupt segment, only remove it.
2) Are you using local hard disks, or do run on a common SAN or remote
file server? I have seen corruption errors on SANs, where existing
files have random changes.

On Thu, Jan 13, 2011 at 11:06 AM, Michael McCandless
luc...@mikemccandless.com  wrote:

Generally it's not safe to run CheckIndex if a writer is also open on the index.

It's not safe because CheckIndex could hit FNFE's on opening files,
or, if you use -fix, CheckIndex will change the index out from under
your other IndexWriter (which will then cause other kinds of
corruption).

That said, I don't think the corruption that CheckIndex is detecting
in your index would be caused by having a writer open on the index.
Your first CheckIndex has a different deletes file (_phe_p3.del, with
44824 deleted docs) than the 2nd time you ran it (_phe_p4.del, with
44828 deleted docs), so it must somehow have to do with that change.

One question: if you have a corrupt index, and run CheckIndex on it
several times in a row, does it always fail in the same way?  (Ie the
same term hits the below exception).

Is there any way I could get a copy of one of your corrupt cases?  I
can then dig...

Mike

On Thu, Jan 13, 2011 at 10:52 AM, Stéphane Delprat
stephane.delp...@blogspirit.com  wrote:

I understand less and less what is happening to my solr.

I did a checkIndex (without -fix) and there was an error...

So a did another checkIndex with -fix and then the error was gone. The
segment was alright


During checkIndex I do not shut down the solr server, I just make sure no
client connect to the server.

Should I shut down the solr server during checkIndex ?



first checkIndex :

  4 of 17: name=_phe docCount=264148
compound=false
hasProx=true
numFiles=9
size (MB)=928.977
diagnostics = {optimize=false, mergeFactor=10, os.version=2.6.26-2-amd64,
os=Linux, mergeDocStores=true, lucene.version=2.9.3 951790 - 2010-06-06
01:30:55, source=merge, os.arch=amd64, java.version=1.6.0_20,
java.vendor=Sun Microsystems Inc.}
has deletions [delFileName=_phe_p3.del]
test: open reader.OK [44824 deleted docs]
test: fields..OK [51 fields]
test: field norms.OK [51 fields]
test: terms, freq, prox...ERROR [term post_id:562 docFreq=1 != num docs
seen 0 + num docs deleted 0]
java.lang.RuntimeException: term post_id:562 docFreq=1 != num docs seen 0 +
num docs deleted 0
at
org.apache.lucene.index.CheckIndex.testTermIndex(CheckIndex.java:675)
at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:530)
at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:903)
test: stored fields...OK [7206878 total field count; avg 32.86 fields
per doc]
test: term vectorsOK [0 total vector count; avg 0 term/freq
vector fields per doc]
FAILED
WARNING: fixIndex() would 

Re: segment gets corrupted (after background merge ?)

2011-01-18 Thread Michael McCandless
OK thanks for bringing closure!

The tokens output is the total number of indexed tokens (ie, as if
you had a counter that counted all tokens produced by analysis as the
indexer consumes them).

My guess is the faulty server's hardware problem also messed up this count?

Mike

On Tue, Jan 18, 2011 at 9:57 AM, Stéphane Delprat
stephane.delp...@blogspirit.com wrote:
 I ran other tests : when I execute the checkIndex on the master I got random
 errors, but when I scp the file on another server (same software exactly) no
 error occurs...

 We will start using another server.


 Just one question concerning checkIndex :

 What does tokens mean ?
 How is it possible that the number of tokens change while the files were not
 modified at all ? (this is from the faulty server, on the other server the
 tokens do not change at all)
 (solr was stopped during the whole checkIndex process)


 #diff 20110118_141257_checkIndex.log 20110118_142356_checkIndex.log
 15c15
      test: terms, freq, prox...OK [5211271 terms; 39824029 terms/docs
 pairs; 58236510 tokens]
 ---
     test: terms, freq, prox...OK [5211271 terms; 39824029 terms/docs
 pairs; 58236582 tokens]
 43c43
      test: terms, freq, prox...OK [3947589 terms; 34468256 terms/docs
 pairs; 36740496 tokens]
 ---
     test: terms, freq, prox...OK [3947589 terms; 34468256 terms/docs
 pairs; 36740533 tokens]
 85c85
      test: terms, freq, prox...OK [2600874 terms; 21272098 terms/docs
 pairs; 10862212 tokens]
 ---
     test: terms, freq, prox...OK [2600874 terms; 21272098 terms/docs
 pairs; 10862221 tokens]


 Thanks,


 Le 14/01/2011 12:59, Michael McCandless a écrit :

 Right, but removing a segment out from under a live IW (when you run
 CheckIndex with -fix) is deadly, because that other IW doesn't know
 you've removed the segment, and will later commit a new segment infos
 still referencing that segment.

 The nature of this particular exception from CheckIndex is very
 strange... I think it can only be a bug in Lucene, a bug in the JRE or
 a hardware issue (bits are flipping somewhere).

 I don't think an error in the IO system can cause this particular
 exception (it would cause others), because the deleted docs are loaded
 up front when SegmentReader is init'd...

 This is why I'd really like to see if a given corrupt index always
 hits precisely the same exception if you run CheckIndex more than
 once.

 Mike

 On Thu, Jan 13, 2011 at 10:56 PM, Lance Norskoggoks...@gmail.com  wrote:

 1) CheckIndex is not supposed to change a corrupt segment, only remove
 it.
 2) Are you using local hard disks, or do run on a common SAN or remote
 file server? I have seen corruption errors on SANs, where existing
 files have random changes.

 On Thu, Jan 13, 2011 at 11:06 AM, Michael McCandless
 luc...@mikemccandless.com  wrote:

 Generally it's not safe to run CheckIndex if a writer is also open on
 the index.

 It's not safe because CheckIndex could hit FNFE's on opening files,
 or, if you use -fix, CheckIndex will change the index out from under
 your other IndexWriter (which will then cause other kinds of
 corruption).

 That said, I don't think the corruption that CheckIndex is detecting
 in your index would be caused by having a writer open on the index.
 Your first CheckIndex has a different deletes file (_phe_p3.del, with
 44824 deleted docs) than the 2nd time you ran it (_phe_p4.del, with
 44828 deleted docs), so it must somehow have to do with that change.

 One question: if you have a corrupt index, and run CheckIndex on it
 several times in a row, does it always fail in the same way?  (Ie the
 same term hits the below exception).

 Is there any way I could get a copy of one of your corrupt cases?  I
 can then dig...

 Mike

 On Thu, Jan 13, 2011 at 10:52 AM, Stéphane Delprat
 stephane.delp...@blogspirit.com  wrote:

 I understand less and less what is happening to my solr.

 I did a checkIndex (without -fix) and there was an error...

 So a did another checkIndex with -fix and then the error was gone. The
 segment was alright


 During checkIndex I do not shut down the solr server, I just make sure
 no
 client connect to the server.

 Should I shut down the solr server during checkIndex ?



 first checkIndex :

  4 of 17: name=_phe docCount=264148
    compound=false
    hasProx=true
    numFiles=9
    size (MB)=928.977
    diagnostics = {optimize=false, mergeFactor=10,
 os.version=2.6.26-2-amd64,
 os=Linux, mergeDocStores=true, lucene.version=2.9.3 951790 - 2010-06-06
 01:30:55, source=merge, os.arch=amd64, java.version=1.6.0_20,
 java.vendor=Sun Microsystems Inc.}
    has deletions [delFileName=_phe_p3.del]
    test: open reader.OK [44824 deleted docs]
    test: fields..OK [51 fields]
    test: field norms.OK [51 fields]
    test: terms, freq, prox...ERROR [term post_id:562 docFreq=1 != num
 docs
 seen 0 + num docs deleted 0]
 java.lang.RuntimeException: term post_id:562 docFreq=1 != num docs seen
 0 +
 num docs deleted 0
        at
 

Re: segment gets corrupted (after background merge ?)

2011-01-14 Thread Michael McCandless
Right, but removing a segment out from under a live IW (when you run
CheckIndex with -fix) is deadly, because that other IW doesn't know
you've removed the segment, and will later commit a new segment infos
still referencing that segment.

The nature of this particular exception from CheckIndex is very
strange... I think it can only be a bug in Lucene, a bug in the JRE or
a hardware issue (bits are flipping somewhere).

I don't think an error in the IO system can cause this particular
exception (it would cause others), because the deleted docs are loaded
up front when SegmentReader is init'd...

This is why I'd really like to see if a given corrupt index always
hits precisely the same exception if you run CheckIndex more than
once.

Mike

On Thu, Jan 13, 2011 at 10:56 PM, Lance Norskog goks...@gmail.com wrote:
 1) CheckIndex is not supposed to change a corrupt segment, only remove it.
 2) Are you using local hard disks, or do run on a common SAN or remote
 file server? I have seen corruption errors on SANs, where existing
 files have random changes.

 On Thu, Jan 13, 2011 at 11:06 AM, Michael McCandless
 luc...@mikemccandless.com wrote:
 Generally it's not safe to run CheckIndex if a writer is also open on the 
 index.

 It's not safe because CheckIndex could hit FNFE's on opening files,
 or, if you use -fix, CheckIndex will change the index out from under
 your other IndexWriter (which will then cause other kinds of
 corruption).

 That said, I don't think the corruption that CheckIndex is detecting
 in your index would be caused by having a writer open on the index.
 Your first CheckIndex has a different deletes file (_phe_p3.del, with
 44824 deleted docs) than the 2nd time you ran it (_phe_p4.del, with
 44828 deleted docs), so it must somehow have to do with that change.

 One question: if you have a corrupt index, and run CheckIndex on it
 several times in a row, does it always fail in the same way?  (Ie the
 same term hits the below exception).

 Is there any way I could get a copy of one of your corrupt cases?  I
 can then dig...

 Mike

 On Thu, Jan 13, 2011 at 10:52 AM, Stéphane Delprat
 stephane.delp...@blogspirit.com wrote:
 I understand less and less what is happening to my solr.

 I did a checkIndex (without -fix) and there was an error...

 So a did another checkIndex with -fix and then the error was gone. The
 segment was alright


 During checkIndex I do not shut down the solr server, I just make sure no
 client connect to the server.

 Should I shut down the solr server during checkIndex ?



 first checkIndex :

  4 of 17: name=_phe docCount=264148
    compound=false
    hasProx=true
    numFiles=9
    size (MB)=928.977
    diagnostics = {optimize=false, mergeFactor=10, os.version=2.6.26-2-amd64,
 os=Linux, mergeDocStores=true, lucene.version=2.9.3 951790 - 2010-06-06
 01:30:55, source=merge, os.arch=amd64, java.version=1.6.0_20,
 java.vendor=Sun Microsystems Inc.}
    has deletions [delFileName=_phe_p3.del]
    test: open reader.OK [44824 deleted docs]
    test: fields..OK [51 fields]
    test: field norms.OK [51 fields]
    test: terms, freq, prox...ERROR [term post_id:562 docFreq=1 != num docs
 seen 0 + num docs deleted 0]
 java.lang.RuntimeException: term post_id:562 docFreq=1 != num docs seen 0 +
 num docs deleted 0
        at
 org.apache.lucene.index.CheckIndex.testTermIndex(CheckIndex.java:675)
        at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:530)
        at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:903)
    test: stored fields...OK [7206878 total field count; avg 32.86 fields
 per doc]
    test: term vectorsOK [0 total vector count; avg 0 term/freq
 vector fields per doc]
 FAILED
    WARNING: fixIndex() would remove reference to this segment; full
 exception:
 java.lang.RuntimeException: Term Index test failed
        at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:543)
        at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:903)


 a few minutes latter :

  4 of 18: name=_phe docCount=264148
    compound=false
    hasProx=true
    numFiles=9
    size (MB)=928.977
    diagnostics = {optimize=false, mergeFactor=10, os.version=2.6.26-2-amd64,
 os=Linux, mergeDocStores=true, lucene.version=2.9.3 951790 - 2010-06-06
 01:30:55, source=merge, os.arch=amd64, java.version=1.6.0
 _20, java.vendor=Sun Microsystems Inc.}
    has deletions [delFileName=_phe_p4.del]
    test: open reader.OK [44828 deleted docs]
    test: fields..OK [51 fields]
    test: field norms.OK [51 fields]
    test: terms, freq, prox...OK [3200899 terms; 26804334 terms/docs pairs;
 28919124 tokens]
    test: stored fields...OK [7206764 total field count; avg 32.86 fields
 per doc]
    test: term vectorsOK [0 total vector count; avg 0 term/freq
 vector fields per doc]


 Le 12/01/2011 16:50, Michael McCandless a écrit :

 Curious... is it always a docFreq=1 != num docs seen 0 + 

Re: segment gets corrupted (after background merge ?)

2011-01-14 Thread Stéphane Delprat

So I ran checkIndex (without -fix) 5 times in a row :

SOLR was running, but no client connected to it. (just the slave which 
was synchronizing every 5 minutes)


summary :

1: all good
2: 2 errors: (seg 1  2) terms, freq, prox...ERROR [term blog_id:104150: 
doc 324697 = lastDoc 324697]  terms, freq, prox...ERROR [term 
blog_keywords:SPORT: doc 174808 = lastDoc 174808]

3: 1 error: (seg 2) terms, freq, prox...ERROR [Index: 105, Size: 51]
4: all good
5: 1 error: (seg 7) terms, freq, prox...ERROR [term blog_comments: %X 
docFreq=1 != num docs seen 0 + num docs deleted 0]


Seams to me that some random things are happening here.

File system is ext3, on a physical server.


Here are the logs of the interesting segments :

** 1 **

  1 of 17: name=_nqt docCount=431889
compound=false
hasProx=true
numFiles=9
size (MB)=1,671.375
diagnostics = {optimize=false, mergeFactor=10, 
os.version=2.6.26-2-amd64, os=Linux, mergeDocStores=true, 
lucene.version=2.9.3 951790 - 2010-06-06 01:30:55, source=merge, 
os.arch=amd64, java.version=1.6.0_20, java.vendor=Sun Microsystems Inc.}

has deletions [delFileName=_nqt_1y2.del]
test: open reader.OK [41918 deleted docs]
test: fields..OK [51 fields]
test: field norms.OK [51 fields]
test: terms, freq, prox...OK [5211271 terms; 39824029 terms/docs 
pairs; 59357374 tokens]
test: stored fields...OK [11505678 total field count; avg 
29.504 fields per doc]
test: term vectorsOK [0 total vector count; avg 0 term/freq 
vector fields per doc]


  2 of 17: name=_ol7 docCount=913886
compound=false
hasProx=true
numFiles=9
size (MB)=3,567.739
diagnostics = {optimize=false, mergeFactor=10, 
os.version=2.6.26-2-amd64, os=Linux, mergeDocStores=true, 
lucene.version=2.9.3 951790 - 2010-06-06 01:30:55, source=merge, 
os.arch=amd64, java.version=1.6.0_20, java.vendor=Sun Microsystems Inc.}

has deletions [delFileName=_ol7_1mc.del]
test: open reader.OK [74076 deleted docs]
test: fields..OK [51 fields]
test: field norms.OK [51 fields]
test: terms, freq, prox...OK [9825896 terms; 93954470 terms/docs 
pairs; 132337348 tokens]
test: stored fields...OK [26933113 total field count; avg 32.07 
fields per doc]
test: term vectorsOK [0 total vector count; avg 0 term/freq 
vector fields per doc]


** 2 **

  1 of 17: name=_nqt docCount=431889
compound=false
hasProx=true
numFiles=9
size (MB)=1,671.375
diagnostics = {optimize=false, mergeFactor=10, 
os.version=2.6.26-2-amd64, os=Linux, mergeDocStores=true, 
lucene.version=2.9.3 951790 - 2010-06-06 01:30:55, source=merge, 
os.arch=amd64, java.version=1.6.0

_20, java.vendor=Sun Microsystems Inc.}
has deletions [delFileName=_nqt_1y2.del]
test: open reader.OK [41918 deleted docs]
test: fields..OK [51 fields]
test: field norms.OK [51 fields]
test: terms, freq, prox...ERROR [term blog_id:104150: doc 324697 = 
lastDoc 324697]
java.lang.RuntimeException: term blog_id:104150: doc 324697 = lastDoc 
324697
at 
org.apache.lucene.index.CheckIndex.testTermIndex(CheckIndex.java:644)
at 
org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:530)

at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:903)
test: stored fields...OK [11505678 total field count; avg 
29.504 fields per doc]
test: term vectorsOK [0 total vector count; avg 0 term/freq 
vector fields per doc]

FAILED
WARNING: fixIndex() would remove reference to this segment; full 
exception:

java.lang.RuntimeException: Term Index test failed
at 
org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:543)

at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:903)

  2 of 17: name=_ol7 docCount=913886
compound=false
hasProx=true
numFiles=9
size (MB)=3,567.739
diagnostics = {optimize=false, mergeFactor=10, 
os.version=2.6.26-2-amd64, os=Linux, mergeDocStores=true, 
lucene.version=2.9.3 951790 - 2010-06-06 01:30:55, source=merge, 
os.arch=amd64, java.version=1.6.0_20, java.vendor=Sun Microsystems Inc.}

has deletions [delFileName=_ol7_1mc.del]
test: open reader.OK [74076 deleted docs]
test: fields..OK [51 fields]
test: field norms.OK [51 fields]
test: terms, freq, prox...ERROR [term blog_keywords:SPORT: doc 
174808 = lastDoc 174808]
java.lang.RuntimeException: term blog_keywords:SPORT: doc 174808 = 
lastDoc 174808
at 
org.apache.lucene.index.CheckIndex.testTermIndex(CheckIndex.java:644)
at 
org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:530)

at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:903)
test: stored fields...OK [26933113 total field count; avg 32.07 
fields per doc]
test: term vectorsOK [0 total vector count; avg 0 term/freq 
vector fields per doc]


Re: segment gets corrupted (after background merge ?)

2011-01-14 Thread Michael McCandless
OK given that you're seeing non-deterministic results on the same
index... I think this is likely a hardware issue or a JRE bug?

If you move that index over to another env and run CheckIndex, is it consistent?

Mike

On Fri, Jan 14, 2011 at 9:00 AM, Stéphane Delprat
stephane.delp...@blogspirit.com wrote:
 So I ran checkIndex (without -fix) 5 times in a row :

 SOLR was running, but no client connected to it. (just the slave which was
 synchronizing every 5 minutes)

 summary :

 1: all good
 2: 2 errors: (seg 1  2) terms, freq, prox...ERROR [term blog_id:104150: doc
 324697 = lastDoc 324697]  terms, freq, prox...ERROR [term
 blog_keywords:SPORT: doc 174808 = lastDoc 174808]
 3: 1 error: (seg 2) terms, freq, prox...ERROR [Index: 105, Size: 51]
 4: all good
 5: 1 error: (seg 7) terms, freq, prox...ERROR [term blog_comments: %X
 docFreq=1 != num docs seen 0 + num docs deleted 0]

 Seams to me that some random things are happening here.

 File system is ext3, on a physical server.


 Here are the logs of the interesting segments :

 ** 1 **

  1 of 17: name=_nqt docCount=431889
    compound=false
    hasProx=true
    numFiles=9
    size (MB)=1,671.375
    diagnostics = {optimize=false, mergeFactor=10, os.version=2.6.26-2-amd64,
 os=Linux, mergeDocStores=true, lucene.version=2.9.3 951790 - 2010-06-06
 01:30:55, source=merge, os.arch=amd64, java.version=1.6.0_20,
 java.vendor=Sun Microsystems Inc.}
    has deletions [delFileName=_nqt_1y2.del]
    test: open reader.OK [41918 deleted docs]
    test: fields..OK [51 fields]
    test: field norms.OK [51 fields]
    test: terms, freq, prox...OK [5211271 terms; 39824029 terms/docs pairs;
 59357374 tokens]
    test: stored fields...OK [11505678 total field count; avg 29.504
 fields per doc]
    test: term vectorsOK [0 total vector count; avg 0 term/freq
 vector fields per doc]

  2 of 17: name=_ol7 docCount=913886
    compound=false
    hasProx=true
    numFiles=9
    size (MB)=3,567.739
    diagnostics = {optimize=false, mergeFactor=10, os.version=2.6.26-2-amd64,
 os=Linux, mergeDocStores=true, lucene.version=2.9.3 951790 - 2010-06-06
 01:30:55, source=merge, os.arch=amd64, java.version=1.6.0_20,
 java.vendor=Sun Microsystems Inc.}
    has deletions [delFileName=_ol7_1mc.del]
    test: open reader.OK [74076 deleted docs]
    test: fields..OK [51 fields]
    test: field norms.OK [51 fields]
    test: terms, freq, prox...OK [9825896 terms; 93954470 terms/docs pairs;
 132337348 tokens]
    test: stored fields...OK [26933113 total field count; avg 32.07
 fields per doc]
    test: term vectorsOK [0 total vector count; avg 0 term/freq
 vector fields per doc]

 ** 2 **

  1 of 17: name=_nqt docCount=431889
    compound=false
    hasProx=true
    numFiles=9
    size (MB)=1,671.375
    diagnostics = {optimize=false, mergeFactor=10, os.version=2.6.26-2-amd64,
 os=Linux, mergeDocStores=true, lucene.version=2.9.3 951790 - 2010-06-06
 01:30:55, source=merge, os.arch=amd64, java.version=1.6.0
 _20, java.vendor=Sun Microsystems Inc.}
    has deletions [delFileName=_nqt_1y2.del]
    test: open reader.OK [41918 deleted docs]
    test: fields..OK [51 fields]
    test: field norms.OK [51 fields]
    test: terms, freq, prox...ERROR [term blog_id:104150: doc 324697 =
 lastDoc 324697]
 java.lang.RuntimeException: term blog_id:104150: doc 324697 = lastDoc
 324697
        at
 org.apache.lucene.index.CheckIndex.testTermIndex(CheckIndex.java:644)
        at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:530)
        at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:903)
    test: stored fields...OK [11505678 total field count; avg 29.504
 fields per doc]
    test: term vectorsOK [0 total vector count; avg 0 term/freq
 vector fields per doc]
 FAILED
    WARNING: fixIndex() would remove reference to this segment; full
 exception:
 java.lang.RuntimeException: Term Index test failed
        at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:543)
        at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:903)

  2 of 17: name=_ol7 docCount=913886
    compound=false
    hasProx=true
    numFiles=9
    size (MB)=3,567.739
    diagnostics = {optimize=false, mergeFactor=10, os.version=2.6.26-2-amd64,
 os=Linux, mergeDocStores=true, lucene.version=2.9.3 951790 - 2010-06-06
 01:30:55, source=merge, os.arch=amd64, java.version=1.6.0_20,
 java.vendor=Sun Microsystems Inc.}
    has deletions [delFileName=_ol7_1mc.del]
    test: open reader.OK [74076 deleted docs]
    test: fields..OK [51 fields]
    test: field norms.OK [51 fields]
    test: terms, freq, prox...ERROR [term blog_keywords:SPORT: doc 174808 =
 lastDoc 174808]
 java.lang.RuntimeException: term blog_keywords:SPORT: doc 174808 = lastDoc
 174808
        at
 org.apache.lucene.index.CheckIndex.testTermIndex(CheckIndex.java:644)
        at 

Re: segment gets corrupted (after background merge ?)

2011-01-13 Thread Stéphane Delprat

I understand less and less what is happening to my solr.

I did a checkIndex (without -fix) and there was an error...

So a did another checkIndex with -fix and then the error was gone. The 
segment was alright



During checkIndex I do not shut down the solr server, I just make sure 
no client connect to the server.


Should I shut down the solr server during checkIndex ?



first checkIndex :

  4 of 17: name=_phe docCount=264148
compound=false
hasProx=true
numFiles=9
size (MB)=928.977
diagnostics = {optimize=false, mergeFactor=10, 
os.version=2.6.26-2-amd64, os=Linux, mergeDocStores=true, 
lucene.version=2.9.3 951790 - 2010-06-06 01:30:55, source=merge, 
os.arch=amd64, java.version=1.6.0_20, java.vendor=Sun Microsystems Inc.}

has deletions [delFileName=_phe_p3.del]
test: open reader.OK [44824 deleted docs]
test: fields..OK [51 fields]
test: field norms.OK [51 fields]
test: terms, freq, prox...ERROR [term post_id:562 docFreq=1 != num 
docs seen 0 + num docs deleted 0]
java.lang.RuntimeException: term post_id:562 docFreq=1 != num docs seen 
0 + num docs deleted 0
at 
org.apache.lucene.index.CheckIndex.testTermIndex(CheckIndex.java:675)
at 
org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:530)

at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:903)
test: stored fields...OK [7206878 total field count; avg 32.86 
fields per doc]
test: term vectorsOK [0 total vector count; avg 0 term/freq 
vector fields per doc]

FAILED
WARNING: fixIndex() would remove reference to this segment; full 
exception:

java.lang.RuntimeException: Term Index test failed
at 
org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:543)

at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:903)


a few minutes latter :

  4 of 18: name=_phe docCount=264148
compound=false
hasProx=true
numFiles=9
size (MB)=928.977
diagnostics = {optimize=false, mergeFactor=10, 
os.version=2.6.26-2-amd64, os=Linux, mergeDocStores=true, 
lucene.version=2.9.3 951790 - 2010-06-06 01:30:55, source=merge, 
os.arch=amd64, java.version=1.6.0

_20, java.vendor=Sun Microsystems Inc.}
has deletions [delFileName=_phe_p4.del]
test: open reader.OK [44828 deleted docs]
test: fields..OK [51 fields]
test: field norms.OK [51 fields]
test: terms, freq, prox...OK [3200899 terms; 26804334 terms/docs 
pairs; 28919124 tokens]
test: stored fields...OK [7206764 total field count; avg 32.86 
fields per doc]
test: term vectorsOK [0 total vector count; avg 0 term/freq 
vector fields per doc]



Le 12/01/2011 16:50, Michael McCandless a écrit :

Curious... is it always a docFreq=1 != num docs seen 0 + num docs deleted 0?

It looks like new deletions were flushed against the segment (del file
changed from _ncc_22s.del to _ncc_24f.del).

Are you hitting any exceptions during indexing?

Mike

On Wed, Jan 12, 2011 at 10:33 AM, Stéphane Delprat
stephane.delp...@blogspirit.com  wrote:

I got another corruption.

It sure looks like it's the same type of error. (on a different field)

It's also not linked to a merge, since the segment size did not change.


*** good segment :

  1 of 9: name=_ncc docCount=1841685
compound=false
hasProx=true
numFiles=9
size (MB)=6,683.447
diagnostics = {optimize=false, mergeFactor=10, os.version=2.6.26-2-amd64,
os=Linux, mergeDocStores=true, lucene.version=2.9.3 951790 - 2010-06-06
01:30:55, source=merge, os.arch=amd64, java.version=1.6.0
_20, java.vendor=Sun Microsystems Inc.}
has deletions [delFileName=_ncc_22s.del]
test: open reader.OK [275881 deleted docs]
test: fields..OK [51 fields]
test: field norms.OK [51 fields]
test: terms, freq, prox...OK [17952652 terms; 174113812 terms/docs pairs;
204561440 tokens]
test: stored fields...OK [45511958 total field count; avg 29.066
fields per doc]
test: term vectorsOK [0 total vector count; avg 0 term/freq
vector fields per doc]


a few hours latter :

*** broken segment :

  1 of 17: name=_ncc docCount=1841685
compound=false
hasProx=true
numFiles=9
size (MB)=6,683.447
diagnostics = {optimize=false, mergeFactor=10, os.version=2.6.26-2-amd64,
os=Linux, mergeDocStores=true, lucene.version=2.9.3 951790 - 2010-06-06
01:30:55, source=merge, os.arch=amd64, java.version=1.6.0
_20, java.vendor=Sun Microsystems Inc.}
has deletions [delFileName=_ncc_24f.del]
test: open reader.OK [278167 deleted docs]
test: fields..OK [51 fields]
test: field norms.OK [51 fields]
test: terms, freq, prox...ERROR [term post_id:1599104 docFreq=1 != num
docs seen 0 + num docs deleted 0]
java.lang.RuntimeException: term post_id:1599104 docFreq=1 != num docs seen
0 + num docs deleted 0
at

Re: segment gets corrupted (after background merge ?)

2011-01-13 Thread Michael McCandless
Generally it's not safe to run CheckIndex if a writer is also open on the index.

It's not safe because CheckIndex could hit FNFE's on opening files,
or, if you use -fix, CheckIndex will change the index out from under
your other IndexWriter (which will then cause other kinds of
corruption).

That said, I don't think the corruption that CheckIndex is detecting
in your index would be caused by having a writer open on the index.
Your first CheckIndex has a different deletes file (_phe_p3.del, with
44824 deleted docs) than the 2nd time you ran it (_phe_p4.del, with
44828 deleted docs), so it must somehow have to do with that change.

One question: if you have a corrupt index, and run CheckIndex on it
several times in a row, does it always fail in the same way?  (Ie the
same term hits the below exception).

Is there any way I could get a copy of one of your corrupt cases?  I
can then dig...

Mike

On Thu, Jan 13, 2011 at 10:52 AM, Stéphane Delprat
stephane.delp...@blogspirit.com wrote:
 I understand less and less what is happening to my solr.

 I did a checkIndex (without -fix) and there was an error...

 So a did another checkIndex with -fix and then the error was gone. The
 segment was alright


 During checkIndex I do not shut down the solr server, I just make sure no
 client connect to the server.

 Should I shut down the solr server during checkIndex ?



 first checkIndex :

  4 of 17: name=_phe docCount=264148
    compound=false
    hasProx=true
    numFiles=9
    size (MB)=928.977
    diagnostics = {optimize=false, mergeFactor=10, os.version=2.6.26-2-amd64,
 os=Linux, mergeDocStores=true, lucene.version=2.9.3 951790 - 2010-06-06
 01:30:55, source=merge, os.arch=amd64, java.version=1.6.0_20,
 java.vendor=Sun Microsystems Inc.}
    has deletions [delFileName=_phe_p3.del]
    test: open reader.OK [44824 deleted docs]
    test: fields..OK [51 fields]
    test: field norms.OK [51 fields]
    test: terms, freq, prox...ERROR [term post_id:562 docFreq=1 != num docs
 seen 0 + num docs deleted 0]
 java.lang.RuntimeException: term post_id:562 docFreq=1 != num docs seen 0 +
 num docs deleted 0
        at
 org.apache.lucene.index.CheckIndex.testTermIndex(CheckIndex.java:675)
        at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:530)
        at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:903)
    test: stored fields...OK [7206878 total field count; avg 32.86 fields
 per doc]
    test: term vectorsOK [0 total vector count; avg 0 term/freq
 vector fields per doc]
 FAILED
    WARNING: fixIndex() would remove reference to this segment; full
 exception:
 java.lang.RuntimeException: Term Index test failed
        at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:543)
        at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:903)


 a few minutes latter :

  4 of 18: name=_phe docCount=264148
    compound=false
    hasProx=true
    numFiles=9
    size (MB)=928.977
    diagnostics = {optimize=false, mergeFactor=10, os.version=2.6.26-2-amd64,
 os=Linux, mergeDocStores=true, lucene.version=2.9.3 951790 - 2010-06-06
 01:30:55, source=merge, os.arch=amd64, java.version=1.6.0
 _20, java.vendor=Sun Microsystems Inc.}
    has deletions [delFileName=_phe_p4.del]
    test: open reader.OK [44828 deleted docs]
    test: fields..OK [51 fields]
    test: field norms.OK [51 fields]
    test: terms, freq, prox...OK [3200899 terms; 26804334 terms/docs pairs;
 28919124 tokens]
    test: stored fields...OK [7206764 total field count; avg 32.86 fields
 per doc]
    test: term vectorsOK [0 total vector count; avg 0 term/freq
 vector fields per doc]


 Le 12/01/2011 16:50, Michael McCandless a écrit :

 Curious... is it always a docFreq=1 != num docs seen 0 + num docs deleted
 0?

 It looks like new deletions were flushed against the segment (del file
 changed from _ncc_22s.del to _ncc_24f.del).

 Are you hitting any exceptions during indexing?

 Mike

 On Wed, Jan 12, 2011 at 10:33 AM, Stéphane Delprat
 stephane.delp...@blogspirit.com  wrote:

 I got another corruption.

 It sure looks like it's the same type of error. (on a different field)

 It's also not linked to a merge, since the segment size did not change.


 *** good segment :

  1 of 9: name=_ncc docCount=1841685
    compound=false
    hasProx=true
    numFiles=9
    size (MB)=6,683.447
    diagnostics = {optimize=false, mergeFactor=10,
 os.version=2.6.26-2-amd64,
 os=Linux, mergeDocStores=true, lucene.version=2.9.3 951790 - 2010-06-06
 01:30:55, source=merge, os.arch=amd64, java.version=1.6.0
 _20, java.vendor=Sun Microsystems Inc.}
    has deletions [delFileName=_ncc_22s.del]
    test: open reader.OK [275881 deleted docs]
    test: fields..OK [51 fields]
    test: field norms.OK [51 fields]
    test: terms, freq, prox...OK [17952652 terms; 174113812 terms/docs
 pairs;
 204561440 tokens]
    test: stored fields...OK 

Re: segment gets corrupted (after background merge ?)

2011-01-13 Thread Lance Norskog
1) CheckIndex is not supposed to change a corrupt segment, only remove it.
2) Are you using local hard disks, or do run on a common SAN or remote
file server? I have seen corruption errors on SANs, where existing
files have random changes.

On Thu, Jan 13, 2011 at 11:06 AM, Michael McCandless
luc...@mikemccandless.com wrote:
 Generally it's not safe to run CheckIndex if a writer is also open on the 
 index.

 It's not safe because CheckIndex could hit FNFE's on opening files,
 or, if you use -fix, CheckIndex will change the index out from under
 your other IndexWriter (which will then cause other kinds of
 corruption).

 That said, I don't think the corruption that CheckIndex is detecting
 in your index would be caused by having a writer open on the index.
 Your first CheckIndex has a different deletes file (_phe_p3.del, with
 44824 deleted docs) than the 2nd time you ran it (_phe_p4.del, with
 44828 deleted docs), so it must somehow have to do with that change.

 One question: if you have a corrupt index, and run CheckIndex on it
 several times in a row, does it always fail in the same way?  (Ie the
 same term hits the below exception).

 Is there any way I could get a copy of one of your corrupt cases?  I
 can then dig...

 Mike

 On Thu, Jan 13, 2011 at 10:52 AM, Stéphane Delprat
 stephane.delp...@blogspirit.com wrote:
 I understand less and less what is happening to my solr.

 I did a checkIndex (without -fix) and there was an error...

 So a did another checkIndex with -fix and then the error was gone. The
 segment was alright


 During checkIndex I do not shut down the solr server, I just make sure no
 client connect to the server.

 Should I shut down the solr server during checkIndex ?



 first checkIndex :

  4 of 17: name=_phe docCount=264148
    compound=false
    hasProx=true
    numFiles=9
    size (MB)=928.977
    diagnostics = {optimize=false, mergeFactor=10, os.version=2.6.26-2-amd64,
 os=Linux, mergeDocStores=true, lucene.version=2.9.3 951790 - 2010-06-06
 01:30:55, source=merge, os.arch=amd64, java.version=1.6.0_20,
 java.vendor=Sun Microsystems Inc.}
    has deletions [delFileName=_phe_p3.del]
    test: open reader.OK [44824 deleted docs]
    test: fields..OK [51 fields]
    test: field norms.OK [51 fields]
    test: terms, freq, prox...ERROR [term post_id:562 docFreq=1 != num docs
 seen 0 + num docs deleted 0]
 java.lang.RuntimeException: term post_id:562 docFreq=1 != num docs seen 0 +
 num docs deleted 0
        at
 org.apache.lucene.index.CheckIndex.testTermIndex(CheckIndex.java:675)
        at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:530)
        at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:903)
    test: stored fields...OK [7206878 total field count; avg 32.86 fields
 per doc]
    test: term vectorsOK [0 total vector count; avg 0 term/freq
 vector fields per doc]
 FAILED
    WARNING: fixIndex() would remove reference to this segment; full
 exception:
 java.lang.RuntimeException: Term Index test failed
        at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:543)
        at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:903)


 a few minutes latter :

  4 of 18: name=_phe docCount=264148
    compound=false
    hasProx=true
    numFiles=9
    size (MB)=928.977
    diagnostics = {optimize=false, mergeFactor=10, os.version=2.6.26-2-amd64,
 os=Linux, mergeDocStores=true, lucene.version=2.9.3 951790 - 2010-06-06
 01:30:55, source=merge, os.arch=amd64, java.version=1.6.0
 _20, java.vendor=Sun Microsystems Inc.}
    has deletions [delFileName=_phe_p4.del]
    test: open reader.OK [44828 deleted docs]
    test: fields..OK [51 fields]
    test: field norms.OK [51 fields]
    test: terms, freq, prox...OK [3200899 terms; 26804334 terms/docs pairs;
 28919124 tokens]
    test: stored fields...OK [7206764 total field count; avg 32.86 fields
 per doc]
    test: term vectorsOK [0 total vector count; avg 0 term/freq
 vector fields per doc]


 Le 12/01/2011 16:50, Michael McCandless a écrit :

 Curious... is it always a docFreq=1 != num docs seen 0 + num docs deleted
 0?

 It looks like new deletions were flushed against the segment (del file
 changed from _ncc_22s.del to _ncc_24f.del).

 Are you hitting any exceptions during indexing?

 Mike

 On Wed, Jan 12, 2011 at 10:33 AM, Stéphane Delprat
 stephane.delp...@blogspirit.com  wrote:

 I got another corruption.

 It sure looks like it's the same type of error. (on a different field)

 It's also not linked to a merge, since the segment size did not change.


 *** good segment :

  1 of 9: name=_ncc docCount=1841685
    compound=false
    hasProx=true
    numFiles=9
    size (MB)=6,683.447
    diagnostics = {optimize=false, mergeFactor=10,
 os.version=2.6.26-2-amd64,
 os=Linux, mergeDocStores=true, lucene.version=2.9.3 951790 - 2010-06-06
 01:30:55, source=merge, os.arch=amd64, java.version=1.6.0
 _20, 

Re: segment gets corrupted (after background merge ?)

2011-01-12 Thread Stéphane Delprat

I got another corruption.

It sure looks like it's the same type of error. (on a different field)

It's also not linked to a merge, since the segment size did not change.


*** good segment :

  1 of 9: name=_ncc docCount=1841685
compound=false
hasProx=true
numFiles=9
size (MB)=6,683.447
diagnostics = {optimize=false, mergeFactor=10, 
os.version=2.6.26-2-amd64, os=Linux, mergeDocStores=true, 
lucene.version=2.9.3 951790 - 2010-06-06 01:30:55, source=merge, 
os.arch=amd64, java.version=1.6.0

_20, java.vendor=Sun Microsystems Inc.}
has deletions [delFileName=_ncc_22s.del]
test: open reader.OK [275881 deleted docs]
test: fields..OK [51 fields]
test: field norms.OK [51 fields]
test: terms, freq, prox...OK [17952652 terms; 174113812 terms/docs 
pairs; 204561440 tokens]
test: stored fields...OK [45511958 total field count; avg 
29.066 fields per doc]
test: term vectorsOK [0 total vector count; avg 0 term/freq 
vector fields per doc]



a few hours latter :

*** broken segment :

  1 of 17: name=_ncc docCount=1841685
compound=false
hasProx=true
numFiles=9
size (MB)=6,683.447
diagnostics = {optimize=false, mergeFactor=10, 
os.version=2.6.26-2-amd64, os=Linux, mergeDocStores=true, 
lucene.version=2.9.3 951790 - 2010-06-06 01:30:55, source=merge, 
os.arch=amd64, java.version=1.6.0

_20, java.vendor=Sun Microsystems Inc.}
has deletions [delFileName=_ncc_24f.del]
test: open reader.OK [278167 deleted docs]
test: fields..OK [51 fields]
test: field norms.OK [51 fields]
test: terms, freq, prox...ERROR [term post_id:1599104 docFreq=1 != 
num docs seen 0 + num docs deleted 0]
java.lang.RuntimeException: term post_id:1599104 docFreq=1 != num docs 
seen 0 + num docs deleted 0
at 
org.apache.lucene.index.CheckIndex.testTermIndex(CheckIndex.java:675)
at 
org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:530)

at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:903)
test: stored fields...OK [45429565 total field count; avg 
29.056 fields per doc]
test: term vectorsOK [0 total vector count; avg 0 term/freq 
vector fields per doc]

FAILED
WARNING: fixIndex() would remove reference to this segment; full 
exception:

java.lang.RuntimeException: Term Index test failed
at 
org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:543)

at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:903)


I'll activate infoStream for next time.


Thanks,


Le 12/01/2011 00:49, Michael McCandless a écrit :

When you hit corruption is it always this same problem?:

   java.lang.RuntimeException: term source:margolisphil docFreq=1 !=
num docs seen 0 + num docs deleted 0

Can you run with Lucene's IndexWriter infoStream turned on, and catch
the output leading to the corruption?  If something is somehow messing
up the bits in the deletes file that could cause this.

Mike

On Mon, Jan 10, 2011 at 5:52 AM, Stéphane Delprat
stephane.delp...@blogspirit.com  wrote:

Hi,

We are using :
Solr Specification Version: 1.4.1
Solr Implementation Version: 1.4.1 955763M - mark - 2010-06-17 18:06:42
Lucene Specification Version: 2.9.3
Lucene Implementation Version: 2.9.3 951790 - 2010-06-06 01:30:55

# java -version
java version 1.6.0_20
Java(TM) SE Runtime Environment (build 1.6.0_20-b02)
Java HotSpot(TM) 64-Bit Server VM (build 16.3-b01, mixed mode)

We want to index 4M docs in one core (and when it works fine we will add
other cores with 2M on the same server) (1 doc ~= 1kB)

We use SOLR replication every 5 minutes to update the slave server (queries
are executed on the slave only)

Documents are changing very quickly, during a normal day we will have approx
:
* 200 000 updated docs
* 1000 new docs
* 200 deleted docs


I attached the last good checkIndex : solr20110107.txt
And the corrupted one : solr20110110.txt


This is not the first time a segment gets corrupted on this server, that's
why I ran frequent checkIndex. (but as you can see the first segment is
1.800.000 docs and it works fine!)


I can't find any SEVER FATAL or exception in the Solr logs.


I also attached my schema.xml and solrconfig.xml


Is there something wrong with what we are doing ? Do you need other info ?


Thanks,





Re: segment gets corrupted (after background merge ?)

2011-01-12 Thread Michael McCandless
Curious... is it always a docFreq=1 != num docs seen 0 + num docs deleted 0?

It looks like new deletions were flushed against the segment (del file
changed from _ncc_22s.del to _ncc_24f.del).

Are you hitting any exceptions during indexing?

Mike

On Wed, Jan 12, 2011 at 10:33 AM, Stéphane Delprat
stephane.delp...@blogspirit.com wrote:
 I got another corruption.

 It sure looks like it's the same type of error. (on a different field)

 It's also not linked to a merge, since the segment size did not change.


 *** good segment :

  1 of 9: name=_ncc docCount=1841685
    compound=false
    hasProx=true
    numFiles=9
    size (MB)=6,683.447
    diagnostics = {optimize=false, mergeFactor=10, os.version=2.6.26-2-amd64,
 os=Linux, mergeDocStores=true, lucene.version=2.9.3 951790 - 2010-06-06
 01:30:55, source=merge, os.arch=amd64, java.version=1.6.0
 _20, java.vendor=Sun Microsystems Inc.}
    has deletions [delFileName=_ncc_22s.del]
    test: open reader.OK [275881 deleted docs]
    test: fields..OK [51 fields]
    test: field norms.OK [51 fields]
    test: terms, freq, prox...OK [17952652 terms; 174113812 terms/docs pairs;
 204561440 tokens]
    test: stored fields...OK [45511958 total field count; avg 29.066
 fields per doc]
    test: term vectorsOK [0 total vector count; avg 0 term/freq
 vector fields per doc]


 a few hours latter :

 *** broken segment :

  1 of 17: name=_ncc docCount=1841685
    compound=false
    hasProx=true
    numFiles=9
    size (MB)=6,683.447
    diagnostics = {optimize=false, mergeFactor=10, os.version=2.6.26-2-amd64,
 os=Linux, mergeDocStores=true, lucene.version=2.9.3 951790 - 2010-06-06
 01:30:55, source=merge, os.arch=amd64, java.version=1.6.0
 _20, java.vendor=Sun Microsystems Inc.}
    has deletions [delFileName=_ncc_24f.del]
    test: open reader.OK [278167 deleted docs]
    test: fields..OK [51 fields]
    test: field norms.OK [51 fields]
    test: terms, freq, prox...ERROR [term post_id:1599104 docFreq=1 != num
 docs seen 0 + num docs deleted 0]
 java.lang.RuntimeException: term post_id:1599104 docFreq=1 != num docs seen
 0 + num docs deleted 0
        at
 org.apache.lucene.index.CheckIndex.testTermIndex(CheckIndex.java:675)
        at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:530)
        at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:903)
    test: stored fields...OK [45429565 total field count; avg 29.056
 fields per doc]
    test: term vectorsOK [0 total vector count; avg 0 term/freq
 vector fields per doc]
 FAILED
    WARNING: fixIndex() would remove reference to this segment; full
 exception:
 java.lang.RuntimeException: Term Index test failed
        at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:543)
        at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:903)


 I'll activate infoStream for next time.


 Thanks,


 Le 12/01/2011 00:49, Michael McCandless a écrit :

 When you hit corruption is it always this same problem?:

   java.lang.RuntimeException: term source:margolisphil docFreq=1 !=
 num docs seen 0 + num docs deleted 0

 Can you run with Lucene's IndexWriter infoStream turned on, and catch
 the output leading to the corruption?  If something is somehow messing
 up the bits in the deletes file that could cause this.

 Mike

 On Mon, Jan 10, 2011 at 5:52 AM, Stéphane Delprat
 stephane.delp...@blogspirit.com  wrote:

 Hi,

 We are using :
 Solr Specification Version: 1.4.1
 Solr Implementation Version: 1.4.1 955763M - mark - 2010-06-17 18:06:42
 Lucene Specification Version: 2.9.3
 Lucene Implementation Version: 2.9.3 951790 - 2010-06-06 01:30:55

 # java -version
 java version 1.6.0_20
 Java(TM) SE Runtime Environment (build 1.6.0_20-b02)
 Java HotSpot(TM) 64-Bit Server VM (build 16.3-b01, mixed mode)

 We want to index 4M docs in one core (and when it works fine we will add
 other cores with 2M on the same server) (1 doc ~= 1kB)

 We use SOLR replication every 5 minutes to update the slave server
 (queries
 are executed on the slave only)

 Documents are changing very quickly, during a normal day we will have
 approx
 :
 * 200 000 updated docs
 * 1000 new docs
 * 200 deleted docs


 I attached the last good checkIndex : solr20110107.txt
 And the corrupted one : solr20110110.txt


 This is not the first time a segment gets corrupted on this server,
 that's
 why I ran frequent checkIndex. (but as you can see the first segment is
 1.800.000 docs and it works fine!)


 I can't find any SEVER FATAL or exception in the Solr logs.


 I also attached my schema.xml and solrconfig.xml


 Is there something wrong with what we are doing ? Do you need other info
 ?


 Thanks,





Re: segment gets corrupted (after background merge ?)

2011-01-11 Thread Michael McCandless
When you hit corruption is it always this same problem?:

  java.lang.RuntimeException: term source:margolisphil docFreq=1 !=
num docs seen 0 + num docs deleted 0

Can you run with Lucene's IndexWriter infoStream turned on, and catch
the output leading to the corruption?  If something is somehow messing
up the bits in the deletes file that could cause this.

Mike

On Mon, Jan 10, 2011 at 5:52 AM, Stéphane Delprat
stephane.delp...@blogspirit.com wrote:
 Hi,

 We are using :
 Solr Specification Version: 1.4.1
 Solr Implementation Version: 1.4.1 955763M - mark - 2010-06-17 18:06:42
 Lucene Specification Version: 2.9.3
 Lucene Implementation Version: 2.9.3 951790 - 2010-06-06 01:30:55

 # java -version
 java version 1.6.0_20
 Java(TM) SE Runtime Environment (build 1.6.0_20-b02)
 Java HotSpot(TM) 64-Bit Server VM (build 16.3-b01, mixed mode)

 We want to index 4M docs in one core (and when it works fine we will add
 other cores with 2M on the same server) (1 doc ~= 1kB)

 We use SOLR replication every 5 minutes to update the slave server (queries
 are executed on the slave only)

 Documents are changing very quickly, during a normal day we will have approx
 :
 * 200 000 updated docs
 * 1000 new docs
 * 200 deleted docs


 I attached the last good checkIndex : solr20110107.txt
 And the corrupted one : solr20110110.txt


 This is not the first time a segment gets corrupted on this server, that's
 why I ran frequent checkIndex. (but as you can see the first segment is
 1.800.000 docs and it works fine!)


 I can't find any SEVER FATAL or exception in the Solr logs.


 I also attached my schema.xml and solrconfig.xml


 Is there something wrong with what we are doing ? Do you need other info ?


 Thanks,