Re: Luke bugs (Re: [jira] Commented: (LUCENE-1454) Corrupted index produced by lucene 2.4)

2008-11-18 Thread Michael McCandless
errorMsg("Invalid or corrupted index."); return; } {code} So, what Luke really means is there are 0 fields found in the index, ie it's an empty index. You're lucky that I spotted this message ... ;) I'll fix it in the next minor release of Luke, pretty soon. Howev

Luke bugs (Re: [jira] Commented: (LUCENE-1454) Corrupted index produced by lucene 2.4)

2008-11-16 Thread Andrzej Bialecki
1454: You're right, Luke is saying that! But, that's a misleading error -- here are the sources in Luke for that error: {code} fn = ir.getFieldNames(IndexReader.FieldOption.ALL); if (fn.size() == 0) { errorMsg("Invalid or corrupted index."); return; } {code} So, what Lu

[jira] Closed: (LUCENE-1454) Corrupted index produced by lucene 2.4

2008-11-16 Thread Andrew Zhang (JIRA)
when opening an empty index. > Corrupted index produced by lucene 2.4 > -- > > Key: LUCENE-1454 > URL: https://issues.apache.org/jira/browse/LUCENE-1454 > Project: Lucene - Java >

[jira] Commented: (LUCENE-1454) Corrupted index produced by lucene 2.4

2008-11-16 Thread Andrew Zhang (JIRA)
2 AM Andrew Zhang - 16/Nov/08 05:02 AM :) I'll close the jira. Thanks! > Corrupted index produced by lucene 2.4 > -- > > Key: LUCENE-1454 > URL: https://issues.apache.org/jira/browse/LUCENE-1454 >

[jira] Commented: (LUCENE-1454) Corrupted index produced by lucene 2.4

2008-11-16 Thread Andrew Zhang (JIRA)
e of Luke.java, ~Ln 800: fn = ir.getFieldNames(IndexReader.FieldOption.ALL); if (fn.size() == 0) { errorMsg("Invalid or corrupted index."); return; } It seems that normal empty index will be reported as "Invalid or corrupted index". I tried to create

[jira] Commented: (LUCENE-1454) Corrupted index produced by lucene 2.4

2008-11-16 Thread Michael McCandless (JIRA)
is saying that! But, that's a misleading error -- here are the sources in Luke for that error: {code} fn = ir.getFieldNames(IndexReader.FieldOption.ALL); if (fn.size() == 0) { errorMsg("Invalid or corrupted index."); return; } {code} So, what Luke really means is there are 0 f

[jira] Commented: (LUCENE-1454) Corrupted index produced by lucene 2.4

2008-11-16 Thread Andrew Zhang (JIRA)
ndex shows the index is OK, while Luke (both 0.8.1 and 0.9) shows "Invalid or corrupted index". Luke "Tools -> Check index tool" also shows no problem of the index. Looks like a bug of Luke "Open Index". I'll take a close look soon. Thanks again!

[jira] Commented: (LUCENE-1454) Corrupted index produced by lucene 2.4

2008-11-16 Thread Michael McCandless (JIRA)
and ran CheckIndex on it, and it did not report any exception. There is a leftover write.lock, which you'll need to remove before opening another writer. Can you post the full exception you're hitting? > Corrupted index produ

[jira] Updated: (LUCENE-1454) Corrupted index produced by lucene 2.4

2008-11-15 Thread Andrew Zhang (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Zhang updated LUCENE-1454: - Description: Hi, I found corrupted index produced by lucene-2.4. I can't find a w

[jira] Updated: (LUCENE-1454) Corrupted index produced by lucene 2.4

2008-11-15 Thread Andrew Zhang (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Zhang updated LUCENE-1454: - Attachment: index.zip > Corrupted index produced by lucene

[jira] Created: (LUCENE-1454) Corrupted index produced by lucene 2.4

2008-11-15 Thread Andrew Zhang (JIRA)
Corrupted index produced by lucene 2.4 -- Key: LUCENE-1454 URL: https://issues.apache.org/jira/browse/LUCENE-1454 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions

Re: detected corrupted index / performance improvement

2008-02-08 Thread Doug Cutting
Doug Cutting wrote: The linux kernel dynamically increases the readahead window based on the access pattern: the more you read sequentially, the larger the readahead window. Sorry, it appears that's in 2.6.23, which isn't yet broadly used. http://kernelnewbies.org/Linux_2_6_23#head-102af26593

Re: detected corrupted index / performance improvement

2008-02-08 Thread Doug Cutting
robert engels wrote: But that would mean we should be using at least 250k buffers for the IndexInput ? Not the 16k or so that is the default. Is the OS smart enough to figure out that the file is being sequentially read, and adjust its physical read size to 256k, based on the other concurrent

Re: detected corrupted index / performance improvement

2008-02-08 Thread robert engels
But that would mean we should be using at least 250k buffers for the IndexInput ? Not the 16k or so that is the default. Is the OS smart enough to figure out that the file is being sequentially read, and adjust its physical read size to 256k, based on the other concurrent IO operations. See

Re: detected corrupted index / performance improvement

2008-02-08 Thread Doug Cutting
Michael McCandless wrote: Merging is far more IO intensive. With mergeFactor=10, we read from 40 input streams and write to 4 output streams when merging the tii/tis/frq/prx files. If your disk can transfer at 50MB/s, and takes 5ms/seek, then 250kB reads and writes are the break-even point, w

Re: detected corrupted index / performance improvement

2008-02-08 Thread Michael McCandless
Mike, you're right: all lucene files are written sequentially (flushing or merging). It's just a matter of how many are open at once, and whether we are also reading from source(s) files, which affects IO throughput far less than truly random access writes. Plus, as of LUCENE-843, bytes are wri

Re: detected corrupted index / performance improvement

2008-02-07 Thread Mike Klaas
Oh, it certainly causes some random access--I don't deny that. I just want to emphasize that this isn't at all the same as all "random writes", which would be expected to perform an order-mag slower. Just did a test where I wrote out a 1gig file in 1K chunks. Then wrote it out in 2files,

Re: detected corrupted index / performance improvement

2008-02-07 Thread robert engels
I don't think that is true - but I'm probably wrong though :). My understanding is that several files are written in parallel (during the merge), causing random access. After the files are written, then they are all reread and written as a CFS file (essential sequential - although the read

Re: detected corrupted index / performance improvement

2008-02-07 Thread Mike Klaas
On 7-Feb-08, at 2:00 PM, robert engels wrote: My point is that commit needs to be used in most applications, and the commit in Lucene is very slow. You don't have 2x the IO cost, mainly because only the log file needs to be sync'd. The index only has to be sync'd eventually, in order to

Re: detected corrupted index / performance improvement

2008-02-07 Thread robert engels
My point is that commit needs to be used in most applications, and the commit in Lucene is very slow. You don't have 2x the IO cost, mainly because only the log file needs to be sync'd. The index only has to be sync'd eventually, in order to prune the logfile - this can be done in the back

Re: detected corrupted index / performance improvement

2008-02-07 Thread Michael McCandless
robert engels wrote: I might be misunderstanding 1044. There were several approaches, and I am not certain what was the final??? The final approach (take 7) is to make the index consistent (sync the files) after finishing a merge. Also, a new method ("commit") is added which will force

Re: detected corrupted index / performance improvement

2008-02-07 Thread robert engels
I might be misunderstanding 1044. There were several approaches, and I am not certain what was the final??? I reread the bug and am still a bit unclear. If the segments are sync'd as part of the commit, then yes, that would suffice. The merges don't need to commit, you just can't delete t

Re: detected corrupted index / performance improvement

2008-02-07 Thread robert engels
This is simply not true. Two different issues are at play. You cannot have a true 'commit' unless it is synchronous! Lucene-1044 might allow the index to be brought back to a consistent state, but not one that is consistent with a synchronization point. For example, I write three documents

Re: detected corrupted index / performance improvement

2008-02-07 Thread Michael McCandless
Good idea; I'll call this ("if your hardware ignores the sync() call then you're in trouble") out in the javadocs with LUCENE-1044. Mike Mark Miller wrote: We should really probably mention it in the JavaDoc when the issue is done. I think both yonik and robert pointed it out, and ever

Re: detected corrupted index / performance improvement

2008-02-07 Thread Mark Miller
We should really probably mention it in the JavaDoc when the issue is done. I think both yonik and robert pointed it out, and ever since then I have seen issues regarding it everywhere. http://hardware.slashdot.org/article.pl?sid=05/05/13/0529252 Apparently, your just not ACID unless you have

Re: detected corrupted index / performance improvement

2008-02-07 Thread Michael McCandless
In fact this is exactly the approach in the final patch on LUCENE-1044 and it gives far better performance than the simply synchronous (original) approach of syncing every segment file on close. Using a transaction log would also require periodic syncing. LUCENE-1044 syncs files after ever

Re: detected corrupted index / performance improvement

2008-02-07 Thread Michael McCandless
DM Smith wrote: On Feb 6, 2008, at 6:42 PM, Mark Miller wrote: Hey DM, Just to recap an earlier thread, you need the sync and you need hardware that doesn't lie to you about the result of the sync. Here is an excerpt about Digg running into that issue: "They had problems with their sto

Re: detected corrupted index / performance improvement

2008-02-07 Thread Michael McCandless
But then you're back to syncing in a BG thread, right? We've come full circle. Asynchronously syncing give the best performance we've seen so far, and so that's the current patch on LUCENE-1044 (using CMS's threads). Using a transaction log would also require async. syncing, but then would also

Re: detected corrupted index / performance improvement

2008-02-06 Thread robert engels
That is the problem, waiting for the full sync (of all of the segment files) takes quite a while... syncing a single log file is much more efficient. On Feb 6, 2008, at 9:41 PM, Andrew Zhang wrote: On Feb 7, 2008 7:22 AM, robert engels <[EMAIL PROTECTED]> wrote: That doesn't help, with la

Re: detected corrupted index / performance improvement

2008-02-06 Thread Andrew Zhang
On Feb 7, 2008 7:22 AM, robert engels <[EMAIL PROTECTED]> wrote: > That doesn't help, with lazy writing/buffering by the OS, there is no > guarantee that if the last written block is ok, that earlier blocks > in the file are > > The OS/drive is going to physically write them in the most effici

Re: detected corrupted index / performance improvement

2008-02-06 Thread DM Smith
On Feb 6, 2008, at 6:42 PM, Mark Miller wrote: Hey DM, Just to recap an earlier thread, you need the sync and you need hardware that doesn't lie to you about the result of the sync. Here is an excerpt about Digg running into that issue: "They had problems with their storage system telling

Re: detected corrupted index / performance improvement

2008-02-06 Thread Mark Miller
Hey DM, Just to recap an earlier thread, you need the sync and you need hardware that doesn't lie to you about the result of the sync. Here is an excerpt about Digg running into that issue: "They had problems with their storage system telling them writes were on disk when they really weren't

Re: detected corrupted index / performance improvement

2008-02-06 Thread robert engels
That doesn't help, with lazy writing/buffering by the OS, there is no guarantee that if the last written block is ok, that earlier blocks in the file are The OS/drive is going to physically write them in the most efficient manner. Only after a sync would this hold true (which is what we

Re: detected corrupted index / performance improvement

2008-02-06 Thread robert engels
Yes, but this pruning could be more efficient. On a background thread, get current segment from segments file, call the system wide sync ( e.g. System.exec("fsync"), then you can purge the transaction logs for all segments up to that one. Since it is a background operation, you are not bloc

Re: detected corrupted index / performance improvement

2008-02-06 Thread DM Smith
On Feb 6, 2008, at 5:42 PM, Michael McCandless wrote: robert engels wrote: Do we have any way of determining if a segment is definitely OK/ VALID ? The only way I know is the CheckIndex tool, and it's rather slow (and it's not clear that it always catches all corruption). Just a thought.

Re: detected corrupted index / performance improvement

2008-02-06 Thread Michael McCandless
robert engels wrote: Do we have any way of determining if a segment is definitely OK/ VALID ? The only way I know is the CheckIndex tool, and it's rather slow (and it's not clear that it always catches all corruption). If so, a much more efficient transactional system could be developed. S

detected corrupted index / performance improvement

2008-02-05 Thread robert engels
I had a recent sidebar with another user, and it got me to thinking. Do we have any way of determining if a segment is definitely OK/VALID ? If so, a much more efficient transactional system could be developed. Serialize the updates to a log file. Sync the log. Update the lucene index WITHOUT

Corrupted Index

2005-09-12 Thread Shane O'Sullivan
last time -- -- -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: Monday, April 02, 2002 11:51:42 GMT To: lucene-dev@jakarta.apache.org Cc: [EMAIL PROTECTED] Subject: RE: corrupted index Doug, Yep, I think waiting until after 1.2 would be a good idea.

RE: corrupted index

2005-09-12 Thread Shane O'Sullivan
ic [mailto:[EMAIL PROTECTED] Sent: Monday, April 02, 2002 11:51:42 GMT To: lucene-dev@jakarta.apache.org Cc: [EMAIL PROTECTED] Subject: RE: corrupted index Doug, Yep, I think waiting until after 1.2 would be a good idea. As I find time over the next couple of weeks, I'll try to start put