[jira] Updated: (LUCENE-743) IndexReader.reopen()
[ https://issues.apache.org/jira/browse/LUCENE-743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Busch updated LUCENE-743: - Attachment: lucene-743-take7.patch Changes: - Updated patch to current trunk (I just realized that the latest didn't apply cleanly anymore) - MultiSegmentReader now decRefs the subReaders correctly in case an exception is thrown during reopen() - Small changes in TestIndexReaderReopen.java The thread-safety test still sometimes fails. The weird thing is that the test verifies that the re-opened readers always return correct results. The only problem is that the refCount value is not always 0 at the end of the test. I'm starting to think that the testcase itself has a problem? Maybe someone else can take a look - it's probably something really obvious but I'm already starting to feel dizzy while pondering about thread-safety. > IndexReader.reopen() > > > Key: LUCENE-743 > URL: https://issues.apache.org/jira/browse/LUCENE-743 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Otis Gospodnetic >Assignee: Michael Busch >Priority: Minor > Fix For: 2.3 > > Attachments: IndexReaderUtils.java, lucene-743-take2.patch, > lucene-743-take3.patch, lucene-743-take4.patch, lucene-743-take5.patch, > lucene-743-take6.patch, lucene-743-take7.patch, lucene-743.patch, > lucene-743.patch, lucene-743.patch, MyMultiReader.java, MySegmentReader.java, > varient-no-isCloneSupported.BROKEN.patch > > > This is Robert Engels' implementation of IndexReader.reopen() functionality, > as a set of 3 new classes (this was easier for him to implement, but should > probably be folded into the core, if this looks good). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
How to solve the issue "Unable to read entire block; 72 bytes read; expected 512 bytes"
Dear All, Using lucene i'm indexing my documents. While doing index for some word documents i got the following exception: Unable to read entire block; 72 bytes read; expected 512 bytes While indexing rtf documents i get the following exception: Unable to read entire block; 72 bytes read; expected 512 bytes Why it is occurs?. How to solve it?. Thanks in Advance. Share files, take polls, and discuss your passions - all under one roof. Go to http://in.promos.yahoo.com/groups
Re: How to solve the issue "Unable to read entire block; 72 bytes read; expected 512 bytes"
Sorry, for rtf it throws the following exception: Unable to read entire header; 100 bytes read; expected 512 bytes Is it a issue with POI of Lucene?. If so which build of POI contains fix for this problem where i can get it?. Please tell me asap. Thanks. - Original Message From: Durai murugan <[EMAIL PROTECTED]> To: java-dev@lucene.apache.org Sent: Monday, 12 November, 2007 6:45:14 PM Subject: How to solve the issue "Unable to read entire block; 72 bytes read; expected 512 bytes" Dear All, Using lucene i'm indexing my documents. While doing index for some word documents i got the following exception: Unable to read entire block; 72 bytes read; expected 512 bytes While indexing rtf documents i get the following exception: Unable to read entire block; 72 bytes read; expected 512 bytes Why it is occurs?. How to solve it?. Thanks in Advance. Share files, take polls, and discuss your passions - all under one roof. Go to http://in.promos.yahoo.com/groups Unlimited freedom, unlimited storage. Get it now, on http://help.yahoo.com/l/in/yahoo/mail/yahoomail/tools/tools-08.html/
Re: setSimilarity on Query
: The problem is that I want to use QueryParser to construct the : query for me. I am having to overriding the logic in QueryParser to : construct my own derived class, which seems to me like a convoluted : way to just setting the Similariy. that's the basic design of the QueryParser class - you override to get custom behavior. independent of the QueryParser aspects of your question, adding a setSimilarity method to the Query class would be a complete 180 of how it currently works right now. Query classes have to have a getSimilarity method so that their Weight/Scorer have a way to access the similarity functions ... but every core type of query gets that similarity from the searcher being used when hte query is executed. if the Query class defined a "setSimilarity" then the similarity used by one query in a BooleanQuery might not be the same as another query in the same query structure ... queryNorms, idfs, tfs ... could all be completley nonsensical. A more logical extension point is probably long the lines of past discussion towards making all of the Similarity methods take in a field name (so you could have a "PerFieldSimilarityWrapper" type implementation) and/or changing Searchable.getSimilarity to take in a fieldname param. i don't think anyone every submitted a patch for either of those ideas though ... if you check the mailing list archives you'll see there were performance concerns about one of them (i think it was the first one because some of those methods are in tight loops, which is unfortunate because it's the one that can be done in a backwards compatible way) -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How to solve the issue "Unable to read entire block; 72 bytes read; expected 512 bytes"
: Sorry, for rtf it throws the following exception: : Unable to read entire header; 100 bytes read; expected 512 bytes : Is it a issue with POI of Lucene?. If so which build of POI contains fix : for this problem where i can get it?. Please tell me asap. 1) java-dev if for discussiong development of hte Lucene Java API, questions baout errors when using the Java API should be sent to the java-user list. 2) that's just a one line error string, it may be the message of an exception -- but it may just be something logged by your application. if it is an exception message, the only way to make sense of it is to see the entire exception stack trace. 3) i can't think of anywhere in the Lucene code base that might write out a string like that (or throw an exception with that message) i suspect it is coming from POI (i'd know ofr sure if you'd sent the full stack trace) so you should consider contacting the POI user list ... before you do, you might try a simple test of a micro app using POI to parse the same document without Lucene involved at all -- if you get hte same error, then you know it's POI and not lucene related at all. -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1044) Behavior on hard power shutdown
[ https://issues.apache.org/jira/browse/LUCENE-1044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12541874 ] Doug Cutting commented on LUCENE-1044: -- > Is a sync before every file close really needed [...] ? It might be nice if we could use the Linux sync() system call, instead of fsync(). Then we could call that only when the new segments file is moved into place rather than as each file is closed. We could exec the sync shell command when running on Unix, but I don't know whether there's an equivalent command for Windows, and it wouldn't be Java... > Behavior on hard power shutdown > --- > > Key: LUCENE-1044 > URL: https://issues.apache.org/jira/browse/LUCENE-1044 > Project: Lucene - Java > Issue Type: Bug > Components: Index > Environment: Windows Server 2003, Standard Edition, Sun Hotspot Java > 1.5 >Reporter: venkat rangan >Assignee: Michael McCandless > Fix For: 2.3 > > Attachments: LUCENE-1044.patch, LUCENE-1044.take2.patch, > LUCENE-1044.take3.patch > > > When indexing a large number of documents, upon a hard power failure (e.g. > pull the power cord), the index seems to get corrupted. We start a Java > application as an Windows Service, and feed it documents. In some cases > (after an index size of 1.7GB, with 30-40 index segment .cfs files) , the > following is observed. > The 'segments' file contains only zeros. Its size is 265 bytes - all bytes > are zeros. > The 'deleted' file also contains only zeros. Its size is 85 bytes - all bytes > are zeros. > Before corruption, the segments file and deleted file appear to be correct. > After this corruption, the index is corrupted and lost. > This is a problem observed in Lucene 1.4.3. We are not able to upgrade our > customer deployments to 1.9 or later version, but would be happy to back-port > a patch, if the patch is small enough and if this problem is already solved. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Commented: (LUCENE-1044) Behavior on hard power shutdown
I don't think this would be any difference performance wise, and might actually be slower. When you call FD.sync() it only needs to ensure the dirty blocks associated with that descriptor need to be saved. On Nov 12, 2007, at 12:15 PM, Doug Cutting (JIRA) wrote: [ https://issues.apache.org/jira/browse/LUCENE-1044? page=com.atlassian.jira.plugin.system.issuetabpanels:comment- tabpanel#action_12541874 ] Doug Cutting commented on LUCENE-1044: -- Is a sync before every file close really needed [...] ? It might be nice if we could use the Linux sync() system call, instead of fsync(). Then we could call that only when the new segments file is moved into place rather than as each file is closed. We could exec the sync shell command when running on Unix, but I don't know whether there's an equivalent command for Windows, and it wouldn't be Java... Behavior on hard power shutdown --- Key: LUCENE-1044 URL: https://issues.apache.org/jira/browse/ LUCENE-1044 Project: Lucene - Java Issue Type: Bug Components: Index Environment: Windows Server 2003, Standard Edition, Sun Hotspot Java 1.5 Reporter: venkat rangan Assignee: Michael McCandless Fix For: 2.3 Attachments: LUCENE-1044.patch, LUCENE-1044.take2.patch, LUCENE-1044.take3.patch When indexing a large number of documents, upon a hard power failure (e.g. pull the power cord), the index seems to get corrupted. We start a Java application as an Windows Service, and feed it documents. In some cases (after an index size of 1.7GB, with 30-40 index segment .cfs files) , the following is observed. The 'segments' file contains only zeros. Its size is 265 bytes - all bytes are zeros. The 'deleted' file also contains only zeros. Its size is 85 bytes - all bytes are zeros. Before corruption, the segments file and deleted file appear to be correct. After this corruption, the index is corrupted and lost. This is a problem observed in Lucene 1.4.3. We are not able to upgrade our customer deployments to 1.9 or later version, but would be happy to back-port a patch, if the patch is small enough and if this problem is already solved. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Web-based Luke
I'm putting together a Google Web Toolkit-based version of Luke: http://www.inperspective.com/lucene/Luke.war ( Just add your version of lucene core jar to WEB-INF/lib subdirectory and you should have the basis of a web-enabled Luke.) The intention behind this is to port Luke to a wholly Apache-licensed codebase so it can be managed in Lucene's subversion repository (and for me to learn GWT!). Early results are encouraging so I would like to consider how to handle this moving forward. The considerations are: 1) Are folks interested in bringing this into the Lucene project? 2) Where to manage it (in contrib?) 3) What needs to change in the build process to take GWT source (Java code) and feed it through the GWT compiler to produce Javascript/html etc? 4) How to package it in the distribution (bundle Jetty?) In MVC terms, having separated the Model code from the (thinlet-based) View code I now also have the basis for building a Swing-based UI too on the same backend. Cheers, Mark ___ Want ideas for reducing your carbon footprint? Visit Yahoo! For Good http://uk.promotions.yahoo.com/forgood/environment.html - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Commented: (LUCENE-1044) Behavior on hard power shutdown
robert engels wrote: I don't think this would be any difference performance wise, and might actually be slower. When you call FD.sync() it only needs to ensure the dirty blocks associated with that descriptor need to be saved. The potential benefit is that you wouldn't have to wait for things to be written as you close files. So, with write-behind, data could be written while the CPU moves on to other tasks, only blocking at commit. With log-based filesystems, only the log need be flushed, and batching that is a performance win. However, if there are lots of other applications writing at the same time, and the Lucene update is small, it could in theory slow things, but my hunch is that it would in practice frequently nearly eliminate the cost of syncing. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Commented: (LUCENE-1044) Behavior on hard power shutdown
Would it not be simpler to pure Java... Add the descriptor that needs to be sync'd (and closed) to a Queue. Start a Thread to sync/close descriptors. In commit(), wait for all sync threads to terminate using join(). On Nov 12, 2007, at 12:34 PM, Doug Cutting wrote: robert engels wrote: I don't think this would be any difference performance wise, and might actually be slower. When you call FD.sync() it only needs to ensure the dirty blocks associated with that descriptor need to be saved. The potential benefit is that you wouldn't have to wait for things to be written as you close files. So, with write-behind, data could be written while the CPU moves on to other tasks, only blocking at commit. With log-based filesystems, only the log need be flushed, and batching that is a performance win. However, if there are lots of other applications writing at the same time, and the Lucene update is small, it could in theory slow things, but my hunch is that it would in practice frequently nearly eliminate the cost of syncing. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Commented: (LUCENE-1044) Behavior on hard power shutdown
robert engels wrote: Would it not be simpler to pure Java... Add the descriptor that needs to be sync'd (and closed) to a Queue. Start a Thread to sync/close descriptors. In commit(), wait for all sync threads to terminate using join(). +1 Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How to solve the issue "Unable to read entire block; 72 bytes read; expected 512 bytes"
: Sorry, for rtf it throws the following exception: : Unable to read entire header; 100 bytes read; expected 512 bytes : Is it a issue with POI of Lucene?. If so which build of POI contains fix : for this problem where i can get it?. Please tell me asap. 1) java-dev if for discussiong development of hte Lucene Java API, questions baout errors when using the Java API should be sent to the java-user list. 2) that's just a one line error string, it may be the message of an exception -- but it may just be something logged by your application. if it is an exception message, the only way to make sense of it is to see the entire exception stack trace. 3) i can't think of anywhere in the Lucene code base that might write out a string like that (or throw an exception with that message) i suspect it is coming from POI (i'd know ofr sure if you'd sent the full stack trace) so you should consider contacting the POI user list ... before you do, you might try a simple test of a micro app using POI to parse the same document without Lucene involved at all -- if you get hte same error, then you know it's POI and not lucene related at all. It's there in POI: http://www.krugle.org/kse/files/svn/svn.apache.org/poi/src/java/org/apache/poi/poifs/storage/HeaderBlockReader.java On line 83. -- Ken -- Ken Krugler Krugle, Inc. +1 530-210-6378 "If you can't find it, you can't fix it" - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Commented: (LUCENE-1044) Behavior on hard power shutdown
On Nov 12, 2007 1:41 PM, robert engels <[EMAIL PROTECTED]> wrote: > Would it not be simpler to pure Java... > > Add the descriptor that needs to be sync'd (and closed) to a Queue. > Start a Thread to sync/close descriptors. > > In commit(), wait for all sync threads to terminate using join(). This would also need to be hooked in with file deletion (since a file could be created and deleted before commit()). -Yonik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Commented: (LUCENE-1044) Behavior on hard power shutdown
I'll look into this approach. We must also sync/close the file before we can open it for reading, eg for creating compound file or if a merge kicks off. Though if we are willing to not commit a new segments_N after saving a segment and before creating its compound found then we don't need to sync the segment files in that case. I think I would put all this logic (to manage background sync'ing) under FSDirectory. Mike "Yonik Seeley" <[EMAIL PROTECTED]> wrote: > On Nov 12, 2007 1:41 PM, robert engels <[EMAIL PROTECTED]> wrote: > > Would it not be simpler to pure Java... > > > > Add the descriptor that needs to be sync'd (and closed) to a Queue. > > Start a Thread to sync/close descriptors. > > > > In commit(), wait for all sync threads to terminate using join(). > > This would also need to be hooked in with file deletion (since a file > could be created and deleted before commit()). > > -Yonik > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Commented: (LUCENE-1044) Behavior on hard power shutdown
I would be wary of the additional complexity of doing this. It would be my vote to making 'sync' an option, and if set, all files are sync'd before close. With proper hardware setup, this should be a minimal performance penalty. What about writing a marker at the end of each file? I am not sure it is guarenteed but the segments is syncd, and the segment files have the correct marker, then the segment file is ok. Otherwise the "bad" segments/versions can be removed (on start up). On Nov 12, 2007, at 2:06 PM, Michael McCandless wrote: I'll look into this approach. We must also sync/close the file before we can open it for reading, eg for creating compound file or if a merge kicks off. Though if we are willing to not commit a new segments_N after saving a segment and before creating its compound found then we don't need to sync the segment files in that case. I think I would put all this logic (to manage background sync'ing) under FSDirectory. Mike "Yonik Seeley" <[EMAIL PROTECTED]> wrote: On Nov 12, 2007 1:41 PM, robert engels <[EMAIL PROTECTED]> wrote: Would it not be simpler to pure Java... Add the descriptor that needs to be sync'd (and closed) to a Queue. Start a Thread to sync/close descriptors. In commit(), wait for all sync threads to terminate using join(). This would also need to be hooked in with file deletion (since a file could be created and deleted before commit()). -Yonik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Web-based Luke
On Nov 12, 2007, at 1:21 PM, mark harwood wrote: I'm putting together a Google Web Toolkit-based version of Luke: http://www.inperspective.com/lucene/Luke.war ( Just add your version of lucene core jar to WEB-INF/lib subdirectory and you should have the basis of a web-enabled Luke.) Mark: +1 Wow! Very nice. The intention behind this is to port Luke to a wholly Apache- licensed codebase so it can be managed in Lucene's subversion repository (and for me to learn GWT!). RDD (Resume Driven Development) at it's finest! Early results are encouraging so I would like to consider how to handle this moving forward. The considerations are: 1) Are folks interested in bringing this into the Lucene project? Absolutely. 2) Where to manage it (in contrib?) Seems like a fine place to put it for now. But it really deserves a better home than that. What about a new "client/luke" directory? (following on Solr's structure). 3) What needs to change in the build process to take GWT source (Java code) and feed it through the GWT compiler to produce Javascript/html etc? Can't be much. 4) How to package it in the distribution (bundle Jetty?) Yeah, that'd be nice. Exactly how Solr does it. In MVC terms, having separated the Model code from the (thinlet- based) View code I now also have the basis for building a Swing- based UI too on the same backend. This is very nice, Mark. This would surely plug into Solr's admin UI very well also. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Commented: (LUCENE-1044) Behavior on hard power shutdown
"robert engels" <[EMAIL PROTECTED]> wrote: > I would be wary of the additional complexity of doing this. > > It would be my vote to making 'sync' an option, and if set, all files > are sync'd before close. This is the way it is now: doSync is an option to FSDirectory, which defaults to true. I agree sync() before close() is by far the simplest approach here. On a good IO it seems to have minimal performance impact. On poor hardware (laptop hard drive) I'm seeing a rather sizable impact (~30-40% slowdown on indexing Wikipedia). But I think given this I would still leave the default at true: I think keeping index consistent, even on the somewhat rare event of machine/OS crash, trumps indexing performance, as a default? People who care about performance are happy to change the defaults. > With proper hardware setup, this should be a minimal performance > penalty. Right. > What about writing a marker at the end of each file? I am not sure it > is guarenteed but the segments is syncd, and the segment files have > the correct marker, then the segment file is ok. Otherwise the "bad" > segments/versions can be removed (on start up). Well ... if we took this approach we would also have to forcefully keep around the "last known good" commit point, vs what we do now (delete all but the last commit point). But, creating such a deletion policy is not really possible because we can't "query" the IO system (OS) to find out what's really on stable storage. Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
small improvement when no payloads?
The else clause in SegmentTermPositions.readDeltaPosition() is redundant and could be removed, yes? It's a pretty minor improvement, but this is very inner-loop stuff. -Yonik private final int readDeltaPosition() throws IOException { int delta = proxStream.readVInt(); if (currentFieldStoresPayloads) { // if the current field stores payloads then // the position delta is shifted one bit to the left. // if the LSB is set, then we have to read the current // payload length if ((delta & 1) != 0) { payloadLength = proxStream.readVInt(); } delta >>>= 1; needToLoadPayload = true; } else { payloadLength = 0; needToLoadPayload = false; } return delta; } - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-743) IndexReader.reopen()
[ https://issues.apache.org/jira/browse/LUCENE-743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12541955 ] Michael McCandless commented on LUCENE-743: --- I think the cause of the intermittant failure in the test is a missing try/finally in doReopen to properly close/decRef everything on exception. Because of lockless commits, a commit could be in-process while you are re-opening, in which case you could hit an IOexception and you must therefore decRef those norms you had incRef'd (and, close eg the newly opened FieldsReader). > IndexReader.reopen() > > > Key: LUCENE-743 > URL: https://issues.apache.org/jira/browse/LUCENE-743 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Otis Gospodnetic >Assignee: Michael Busch >Priority: Minor > Fix For: 2.3 > > Attachments: IndexReaderUtils.java, lucene-743-take2.patch, > lucene-743-take3.patch, lucene-743-take4.patch, lucene-743-take5.patch, > lucene-743-take6.patch, lucene-743-take7.patch, lucene-743.patch, > lucene-743.patch, lucene-743.patch, MyMultiReader.java, MySegmentReader.java, > varient-no-isCloneSupported.BROKEN.patch > > > This is Robert Engels' implementation of IndexReader.reopen() functionality, > as a set of 3 new classes (this was easier for him to implement, but should > probably be folded into the core, if this looks good). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Commented: (LUCENE-743) IndexReader.reopen()
Why doesn't reopen get the 'read' lock, since commit has the write lock, it should wait... On Nov 12, 2007, at 3:35 PM, Michael McCandless (JIRA) wrote: [ https://issues.apache.org/jira/browse/LUCENE-743? page=com.atlassian.jira.plugin.system.issuetabpanels:comment- tabpanel#action_12541955 ] Michael McCandless commented on LUCENE-743: --- I think the cause of the intermittant failure in the test is a missing try/finally in doReopen to properly close/decRef everything on exception. Because of lockless commits, a commit could be in-process while you are re-opening, in which case you could hit an IOexception and you must therefore decRef those norms you had incRef'd (and, close eg the newly opened FieldsReader). IndexReader.reopen() Key: LUCENE-743 URL: https://issues.apache.org/jira/browse/LUCENE-743 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Otis Gospodnetic Assignee: Michael Busch Priority: Minor Fix For: 2.3 Attachments: IndexReaderUtils.java, lucene-743- take2.patch, lucene-743-take3.patch, lucene-743-take4.patch, lucene-743-take5.patch, lucene-743-take6.patch, lucene-743- take7.patch, lucene-743.patch, lucene-743.patch, lucene-743.patch, MyMultiReader.java, MySegmentReader.java, varient-no- isCloneSupported.BROKEN.patch This is Robert Engels' implementation of IndexReader.reopen() functionality, as a set of 3 new classes (this was easier for him to implement, but should probably be folded into the core, if this looks good). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Commented: (LUCENE-743) IndexReader.reopen()
On Nov 12, 2007 4:43 PM, robert engels <[EMAIL PROTECTED]> wrote: > Why doesn't reopen get the 'read' lock, since commit has the write > lock, it should wait... After lockless commits, there is no read lock! -Yonik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Commented: (LUCENE-743) IndexReader.reopen()
Then how can the commit during reopen be an issue? I am not very family with this new code, but it seems that you need to write segments.XXX.new and then rename to segments.XXX. As long as the files are sync'd, even on nfs the reopen should not see segments.XXX until is is ready. Although lockless commits are beneficial in their own rite, I still think that people's understanding of NFS limitations are flawed. Read the section below on "close to open" consistency. There should be no problem using Lucene across NFS - even the old version. The write-once nature of Lucene makes this trivial. The only problem was the segments file, which is lucene used the read/write lock and close(0 correctly never would have been a problem. According to the NFS docs: NFS Version 2 requires that a server must save all the data in a write operation to disk before it replies to a client that the write operation has completed. This can be expensive because it breaks write requests into small chunks (8KB or less) that must each be written to disk before the next chunk can be written. Disks work best when they can write large amounts of data all at once. NFS Version 3 introduces the concept of "safe asynchronous writes." A Version 3 client can specify that the server is allowed to reply before it has saved the requested data to disk, permitting the server to gather small NFS write operations into a single efficient disk write operation. A Version 3 client can also specify that the data must be written to disk before the server replies, just like a Version 2 write. The client specifies the type of write by setting the stable_how field in the arguments of each write operation to UNSTABLE to request a safe asynchronous write, and FILE_SYNC for an NFS Version 2 style write. Servers indicate whether the requested data is permanently stored by setting a corresponding field in the response to each NFS write operation. A server can respond to an UNSTABLE write request with an UNSTABLE reply or a FILE_SYNC reply, depending on whether or not the requested data resides on permanent storage yet. An NFS protocol- compliant server must respond to a FILE_SYNC request only with a FILE_SYNC reply. Clients ensure that data that was written using a safe asynchronous write has been written onto permanent storage using a new operation available in Version 3 called a COMMIT. Servers do not send a response to a COMMIT operation until all data specified in the request has been written to permanent storage. NFS Version 3 clients must protect buffered data that has been written using a safe asynchronous write but not yet committed. If a server reboots before a client has sent an appropriate COMMIT, the server can reply to the eventual COMMIT request in a way that forces the client to resend the original write operation. Version 3 clients use COMMIT operations when flushing safe asynchronous writes to the server during a close (2) or fsync(2) system call, or when encountering memory pressure. A8. What is close-to-open cache consistency? A. Perfect cache coherency among disparate NFS clients is very expensive to achieve, so NFS settles for something weaker that satisfies the requirements of most everyday types of file sharing. Everyday file sharing is most often completely sequential: first client A opens a file, writes something to it, then closes it; then client B opens the same file, and reads the changes. So, when an application opens a file stored in NFS, the NFS client checks that it still exists on the server, and is permitted to the opener, by sending a GETATTR or ACCESS operation. When the application closes the file, the NFS client writes back any pending changes to the file so that the next opener can view the changes. This also gives the NFS client an opportunity to report any server write errors to the application via the return code from close(). This behavior is referred to as close-to-open cache consistency. Linux implements close-to-open cache consistency by comparing the results of a GETATTR operation done just after the file is closed to the results of a GETATTR operation done when the file is next opened. If the results are the same, the client will assume its data cache is still valid; otherwise, the cache is purged. Close-to-open cache consistency was introduced to the Linux NFS client in 2.4.20. If for some reason you have applications that depend on the old behavior, you can disable close-to-open support by using the "nocto" mount option. There are still opportunities for a client's data cache to contain stale data. The NFS version 3 protocol introduced "weak cache consistency" (also known as WCC) which provides a way of checking a file's attributes before and after an operation to allow a client to identify changes that could have been made by other clients. Unfortunately when a clien
Re: [jira] Commented: (LUCENE-743) IndexReader.reopen()
On Nov 12, 2007 5:08 PM, robert engels <[EMAIL PROTECTED]> wrote: > As long as the files are sync'd, even on nfs the reopen should not > see segments.XXX until is is ready. Right, but then there is a race on the other side... a reader may open the segments .XXX file and then start opening all the referenced segments files, but some of them may have already been deleted because a segment merge happened. There's a retry mechanism in this case. http://issues.apache.org/jira/browse/LUCENE-701 I guess the test with 150 threads is very atypical and could actually cause a reader to not be successfully opened and hence an exception thrown. -Yonik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Commented: (LUCENE-743) IndexReader.reopen()
robert engels <[EMAIL PROTECTED]> wrote: > Then how can the commit during reopen be an issue? This is what happens: * Reader opens latest segments_N & reads all SegmentInfos successfully. * Writer writes new segments_N+1, and then deletes now un-referenced files. * Reader tries to open files referenced by segments_N and hits FNFE when it tries to open a file writer just removed. Lucene handles this fine (it just retries on the new segments_N+1), but the patch in LUCENE-743 is now failing to decRef the Norm instances when this retry happens. > I am not very family with this new code, but it seems that you need > to write segments.XXX.new and then rename to segments.XXX. We don't rename anymore (it's not reliable on windows). We write straight to segments_N. > As long as the files are sync'd, even on nfs the reopen should not > see segments.XXX until is is ready. > > Although lockless commits are beneficial in their own rite, I still > think that people's understanding of NFS limitations are > flawed. Read the section below on "close to open" consistency. There > should be no problem using Lucene across NFS - even the old version. > > The write-once nature of Lucene makes this trivial. The only > problem was the segments file, which is lucene used the read/write > lock and close(0 correctly never would have been a problem. Yes, in an ideal world, NFS server+clients are supposed to implement close-to-open semantics but in my experience they do not always succeed. Previous version of Lucene do in fact have problems over NFS. NFS also does not give you "delete on last close" which Lucene normally relies on (unless you create a custom deletion policy). Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Commented: (LUCENE-743) IndexReader.reopen()
But merging segments doesn't delete the old, it only creates new, unless the segments meet the "purge old criteria". A reopen() is supposed to open the latest version in the directory by definition, so this seems rather a remote possibility. If it occurs due to low system resources (meaning that during a reopen some expected segments were already deleted, throw an StaleIndexException) and the client can reissue the reopen() call (similar to if it could not get the write lock). On Nov 12, 2007, at 4:47 PM, Yonik Seeley wrote: On Nov 12, 2007 5:08 PM, robert engels <[EMAIL PROTECTED]> wrote: As long as the files are sync'd, even on nfs the reopen should not see segments.XXX until is is ready. Right, but then there is a race on the other side... a reader may open the segments .XXX file and then start opening all the referenced segments files, but some of them may have already been deleted because a segment merge happened. There's a retry mechanism in this case. http://issues.apache.org/jira/browse/LUCENE-701 I guess the test with 150 threads is very atypical and could actually cause a reader to not be successfully opened and hence an exception thrown. -Yonik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Commented: (LUCENE-743) IndexReader.reopen()
"Yonik Seeley" <[EMAIL PROTECTED]> wrote: > On Nov 12, 2007 5:08 PM, robert engels <[EMAIL PROTECTED]> wrote: > > As long as the files are sync'd, even on nfs the reopen should not > > see segments.XXX until is is ready. > > Right, but then there is a race on the other side... a reader may open > the segments .XXX file and then start opening all the referenced > segments files, but some of them may have already been deleted because > a segment merge happened. There's a retry mechanism in this case. > http://issues.apache.org/jira/browse/LUCENE-701 > > I guess the test with 150 threads is very atypical and could actually > cause a reader to not be successfully opened and hence an exception > thrown. The test is just hitting the normal retry exception, and then the retry succeeds, but the patch fails to decRef those incRef's it had done on the first attempt. Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Commented: (LUCENE-743) IndexReader.reopen()
What are you basing the "rename" is not reliable on windows on? That a virus scanner has the file open. If that is the case, that should either be an incorrect setup, or the operation retried until it completes. Writing directly to a file that someone else can open for reading is bound to be a problem. If the file is opened exclusive for write, then the others will be prohibited from opening for read, so there should not be a problem. All of the "delete on last close" stuff is a poor design. The database can be resync on startup. The basic design flaw is one I have pointed out many times - you either use Lucene in a local environment, or a server environment. Using NFS to "share" a Lucene database is a poor choice (normally due to performance, but there are other problems - e.g. resource and user monitoring, etc.) is a poor choice !. People have written reliable database systems without very advanced semantics for years. There is no reason for all of this esoteric code in Lucene. Those that claim, Lucene had problems with NFS in the past, did not perform reliable testing, or their OS was out of date. What is Lucene was failing for an OS needed an update, would you change Lucene, or fix/update the OS??? Obviously the former. Some very loud voices complained about the NFS problems without doing the due diligence and test cases to prove the problem. Instead they just mucked up the Lucene code. On Nov 12, 2007, at 4:54 PM, Michael McCandless wrote: robert engels <[EMAIL PROTECTED]> wrote: Then how can the commit during reopen be an issue? This is what happens: * Reader opens latest segments_N & reads all SegmentInfos successfully. * Writer writes new segments_N+1, and then deletes now un-referenced files. * Reader tries to open files referenced by segments_N and hits FNFE when it tries to open a file writer just removed. Lucene handles this fine (it just retries on the new segments_N+1), but the patch in LUCENE-743 is now failing to decRef the Norm instances when this retry happens. I am not very family with this new code, but it seems that you need to write segments.XXX.new and then rename to segments.XXX. We don't rename anymore (it's not reliable on windows). We write straight to segments_N. As long as the files are sync'd, even on nfs the reopen should not see segments.XXX until is is ready. Although lockless commits are beneficial in their own rite, I still think that people's understanding of NFS limitations are flawed. Read the section below on "close to open" consistency. There should be no problem using Lucene across NFS - even the old version. The write-once nature of Lucene makes this trivial. The only problem was the segments file, which is lucene used the read/write lock and close(0 correctly never would have been a problem. Yes, in an ideal world, NFS server+clients are supposed to implement close-to-open semantics but in my experience they do not always succeed. Previous version of Lucene do in fact have problems over NFS. NFS also does not give you "delete on last close" which Lucene normally relies on (unless you create a custom deletion policy). Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Commented: (LUCENE-743) IndexReader.reopen()
"robert engels" <[EMAIL PROTECTED]> wrote: > But merging segments doesn't delete the old, it only creates new, > unless the segments meet the "purge old criteria". What's the "purge old criteria"? Normally a segment merge once committed immediately deletes the segments it had just merged. > A reopen() is supposed to open the latest version in the directory > by definition, so this seems rather a remote possibility. Well, if a commit is in-flight then likely the reopen will hit an exception and then retry. This is the same as a normal open. > If it occurs due to low system resources (meaning that during a > reopen some expected segments were already deleted, throw an > StaleIndexException) and the client can reissue the reopen() call > (similar to if it could not get the write lock). I'm not sure what you mean by "low system resources". Missing some files because they were deleted by a commit in process isn't a low system resources sort of situation. Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Commented: (LUCENE-743) IndexReader.reopen()
Not just virus scanners: any program that uses the Microsoft API for being notified of file changes. I think TortoiseSVN was one such example. People who embed Lucene can't control what their users install on their desktops. Virus scanners are naturally very common on desktops. I think we want Lucene to work in these cases. NFS (and other shared filesystems) is a convenient, if not performant, way to share an index. I think Lucene should work in such cases as well. Mike "robert engels" <[EMAIL PROTECTED]> wrote: > What are you basing the "rename" is not reliable on windows on? That > a virus scanner has the file open. If that is the case, that should > either be an incorrect setup, or the operation retried until it > completes. > > Writing directly to a file that someone else can open for reading is > bound to be a problem. If the file is opened exclusive for write, > then the others will be prohibited from opening for read, so there > should not be a problem. > > All of the "delete on last close" stuff is a poor design. The > database can be resync on startup. > > The basic design flaw is one I have pointed out many times - you > either use Lucene in a local environment, or a server environment. > Using NFS to "share" a Lucene database is a poor choice (normally due > to performance, but there are other problems - e.g. resource and user > monitoring, etc.) is a poor choice !. > > People have written reliable database systems without very advanced > semantics for years. There is no reason for all of this esoteric code > in Lucene. > > Those that claim, Lucene had problems with NFS in the past, did not > perform reliable testing, or their OS was out of date. What is > Lucene was failing for an OS needed an update, would you change > Lucene, or fix/update the OS??? Obviously the former. > > Some very loud voices complained about the NFS problems without doing > the due diligence and test cases to prove the problem. Instead they > just mucked up the Lucene code. > > > On Nov 12, 2007, at 4:54 PM, Michael McCandless wrote: > > > > > robert engels <[EMAIL PROTECTED]> wrote: > > > >> Then how can the commit during reopen be an issue? > > > > This is what happens: > > > > * Reader opens latest segments_N & reads all SegmentInfos > > successfully. > > > > * Writer writes new segments_N+1, and then deletes now un-referenced > > files. > > > > * Reader tries to open files referenced by segments_N and hits FNFE > > when it tries to open a file writer just removed. > > > > Lucene handles this fine (it just retries on the new segments_N+1), > > but the patch in LUCENE-743 is now failing to decRef the Norm > > instances when this retry happens. > > > >> I am not very family with this new code, but it seems that you need > >> to write segments.XXX.new and then rename to segments.XXX. > > > > We don't rename anymore (it's not reliable on windows). We write > > straight to segments_N. > > > >> As long as the files are sync'd, even on nfs the reopen should not > >> see segments.XXX until is is ready. > >> > >> Although lockless commits are beneficial in their own rite, I still > >> think that people's understanding of NFS limitations are > >> flawed. Read the section below on "close to open" consistency. There > >> should be no problem using Lucene across NFS - even the old version. > >> > >> The write-once nature of Lucene makes this trivial. The only > >> problem was the segments file, which is lucene used the read/write > >> lock and close(0 correctly never would have been a problem. > > > > Yes, in an ideal world, NFS server+clients are supposed to implement > > close-to-open semantics but in my experience they do not always > > succeed. Previous version of Lucene do in fact have problems over > > NFS. NFS also does not give you "delete on last close" which Lucene > > normally relies on (unless you create a custom deletion policy). > > > > Mike > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Commented: (LUCENE-743) IndexReader.reopen()
Horse poo poo. If you are working in a local environment, the files should be opened with exclusive access. This guarantees that the operations will succeed for the calling process. That NFS is a viable solution is highly debatable, and IMO shows a lack of understanding of NFS and the unix/linux filesystem design principles. Read about why unix never offered file locking, and never really needed it... Still, if the proper uses of exclusive access controls is used, Lucene (and Java) have no problems working in NFS/shared filesystem environment. Sorry but that some only recently became aware of FD.sync() shows that they don't really know enough to be designing/testing systems like this. Sorry if the tone of this is harsh, but I hate seeing lots of complex code because the designers fail to understand the basic operating principles of what they are working with... On Nov 12, 2007, at 5:18 PM, Michael McCandless wrote: Not just virus scanners: any program that uses the Microsoft API for being notified of file changes. I think TortoiseSVN was one such example. People who embed Lucene can't control what their users install on their desktops. Virus scanners are naturally very common on desktops. I think we want Lucene to work in these cases. NFS (and other shared filesystems) is a convenient, if not performant, way to share an index. I think Lucene should work in such cases as well. Mike "robert engels" <[EMAIL PROTECTED]> wrote: What are you basing the "rename" is not reliable on windows on? That a virus scanner has the file open. If that is the case, that should either be an incorrect setup, or the operation retried until it completes. Writing directly to a file that someone else can open for reading is bound to be a problem. If the file is opened exclusive for write, then the others will be prohibited from opening for read, so there should not be a problem. All of the "delete on last close" stuff is a poor design. The database can be resync on startup. The basic design flaw is one I have pointed out many times - you either use Lucene in a local environment, or a server environment. Using NFS to "share" a Lucene database is a poor choice (normally due to performance, but there are other problems - e.g. resource and user monitoring, etc.) is a poor choice !. People have written reliable database systems without very advanced semantics for years. There is no reason for all of this esoteric code in Lucene. Those that claim, Lucene had problems with NFS in the past, did not perform reliable testing, or their OS was out of date. What is Lucene was failing for an OS needed an update, would you change Lucene, or fix/update the OS??? Obviously the former. Some very loud voices complained about the NFS problems without doing the due diligence and test cases to prove the problem. Instead they just mucked up the Lucene code. On Nov 12, 2007, at 4:54 PM, Michael McCandless wrote: robert engels <[EMAIL PROTECTED]> wrote: Then how can the commit during reopen be an issue? This is what happens: * Reader opens latest segments_N & reads all SegmentInfos successfully. * Writer writes new segments_N+1, and then deletes now un- referenced files. * Reader tries to open files referenced by segments_N and hits FNFE when it tries to open a file writer just removed. Lucene handles this fine (it just retries on the new segments_N+1), but the patch in LUCENE-743 is now failing to decRef the Norm instances when this retry happens. I am not very family with this new code, but it seems that you need to write segments.XXX.new and then rename to segments.XXX. We don't rename anymore (it's not reliable on windows). We write straight to segments_N. As long as the files are sync'd, even on nfs the reopen should not see segments.XXX until is is ready. Although lockless commits are beneficial in their own rite, I still think that people's understanding of NFS limitations are flawed. Read the section below on "close to open" consistency. There should be no problem using Lucene across NFS - even the old version. The write-once nature of Lucene makes this trivial. The only problem was the segments file, which is lucene used the read/write lock and close(0 correctly never would have been a problem. Yes, in an ideal world, NFS server+clients are supposed to implement close-to-open semantics but in my experience they do not always succeed. Previous version of Lucene do in fact have problems over NFS. NFS also does not give you "delete on last close" which Lucene normally relies on (unless you create a custom deletion policy). Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL
Re: [jira] Commented: (LUCENE-743) IndexReader.reopen()
robert engels wrote: > > The commit "in flight" cannot (SHOULD NOT) be deleting segments if they > are in use. That a caller could issue a reopen call means there are > segments in use by definition (or they would have nothing to reopen). > Reopen still works correctly, even if there are no segments left that the old reader used. It will simply behave as an "open" then. An example is an index that was optimized. In that case all old segments are gone and if you reopen your reader you will get a new SegmentReader that opens the new segment. The old reader can still access the old segments because of the OS' "delete on last close". Or, on Windows, the IndexWriter will re-try to delete the old segments until the delete was successful (i. e. after the last reader accessing them was closed). -Michael - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Commented: (LUCENE-743) IndexReader.reopen()
I am not debating that reopen works (since that is supposed to get the latest version). I am stating that commit cannot be deleting segments if they are in use, which they must be at that time in order to issue a reopen(), since to issue reopen() you must have an instance of IndexReader open, which means you will have segments open... I was talking about Windows in particular - as stated, unix/linux does not have the problem - under Windows the delete will (should) fail. On Nov 12, 2007, at 5:42 PM, Michael Busch wrote: robert engels wrote: The commit "in flight" cannot (SHOULD NOT) be deleting segments if they are in use. That a caller could issue a reopen call means there are segments in use by definition (or they would have nothing to reopen). Reopen still works correctly, even if there are no segments left that the old reader used. It will simply behave as an "open" then. An example is an index that was optimized. In that case all old segments are gone and if you reopen your reader you will get a new SegmentReader that opens the new segment. The old reader can still access the old segments because of the OS' "delete on last close". Or, on Windows, the IndexWriter will re-try to delete the old segments until the delete was successful (i. e. after the last reader accessing them was closed). -Michael - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Commented: (LUCENE-743) IndexReader.reopen()
robert engels wrote: > > I was talking about Windows in particular - as stated, unix/linux does > not have the problem - under Windows the delete will (should) fail. > As I said, delete does fail on Windows in that case, and the IndexFileDeleter (called by the IndexWriter) catches the IOException and tries again (and again...). -Michael - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Commented: (LUCENE-743) IndexReader.reopen()
That is not true - at least it didn't use to be. if there were readers open the files/segments would not be deleted. they would be deleted at next open. The "purge criteria" was based on the next "commit" sets. To make this work, and be able to roll back or open a previous "version", you need to keep the segments around. The commit "in flight" cannot (SHOULD NOT) be deleting segments if they are in use. That a caller could issue a reopen call means there are segments in use by definition (or they would have nothing to reopen). On Nov 12, 2007, at 5:14 PM, Michael McCandless wrote: "robert engels" <[EMAIL PROTECTED]> wrote: But merging segments doesn't delete the old, it only creates new, unless the segments meet the "purge old criteria". What's the "purge old criteria"? Normally a segment merge once committed immediately deletes the segments it had just merged. A reopen() is supposed to open the latest version in the directory by definition, so this seems rather a remote possibility. Well, if a commit is in-flight then likely the reopen will hit an exception and then retry. This is the same as a normal open. If it occurs due to low system resources (meaning that during a reopen some expected segments were already deleted, throw an StaleIndexException) and the client can reissue the reopen() call (similar to if it could not get the write lock). I'm not sure what you mean by "low system resources". Missing some files because they were deleted by a commit in process isn't a low system resources sort of situation. Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-743) IndexReader.reopen()
[ https://issues.apache.org/jira/browse/LUCENE-743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12541998 ] Michael Busch commented on LUCENE-743: -- > I think the cause of the intermittant failure in the test is a missing > try/finally in doReopen to properly close/decRef everything on > exception. Awesome! Thanks so much for pointing me there, Mike! I was getting a little suicidal here already ... ;) I should have read the comment in SegmentReader#initialize more carefully: {code:java} } finally { // With lock-less commits, it's entirely possible (and // fine) to hit a FileNotFound exception above. In // this case, we want to explicitly close any subset // of things that were opened so that we don't have to // wait for a GC to do so. if (!success) { doClose(); } } {code} While debugging, it's easy to miss such an exception, because SegmentInfos.FindSegmentsFile#run() ignores it. But it's good that it logs such an exception, I just have to remember to print out the infoStream next time. So it seems that this was indeed the cause for the failing test case. I made the change and so far the tests didn't fail anymore (ran it about 10 times so far). I'll run it another few times on a different JVM and submit an updated patch in a short while if it doesn't fail again. > IndexReader.reopen() > > > Key: LUCENE-743 > URL: https://issues.apache.org/jira/browse/LUCENE-743 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Otis Gospodnetic >Assignee: Michael Busch >Priority: Minor > Fix For: 2.3 > > Attachments: IndexReaderUtils.java, lucene-743-take2.patch, > lucene-743-take3.patch, lucene-743-take4.patch, lucene-743-take5.patch, > lucene-743-take6.patch, lucene-743-take7.patch, lucene-743.patch, > lucene-743.patch, lucene-743.patch, MyMultiReader.java, MySegmentReader.java, > varient-no-isCloneSupported.BROKEN.patch > > > This is Robert Engels' implementation of IndexReader.reopen() functionality, > as a set of 3 new classes (this was easier for him to implement, but should > probably be folded into the core, if this looks good). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Commented: (LUCENE-743) IndexReader.reopen()
I would still argue that it is an incorrect setup - almost as bad as "not plugging the computer in". If a user runs a virus scanner or file system indexer on the lucene index directory, their system is going to slow to a crawl and indexing will be abominably slow. The installation guide should just make this required. An installer can easily use the available APIs to remove the lucene data directory from virus scanning / indexing. On Nov 12, 2007, at 6:01 PM, Michael Busch wrote: robert engels wrote: I was talking about Windows in particular - as stated, unix/linux does not have the problem - under Windows the delete will (should) fail. As I said, delete does fail on Windows in that case, and the IndexFileDeleter (called by the IndexWriter) catches the IOException and tries again (and again...). -Michael - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Commented: (LUCENE-743) IndexReader.reopen()
On Nov 12, 2007 7:19 PM, robert engels <[EMAIL PROTECTED]> wrote: > I would still argue that it is an incorrect setup - almost as bad as > "not plugging the computer in". A user themselves could even go in and look at the index files (I've done so myself)... as could a backup program or whatever. It's a fact of life on windows that a move or delete can fail. -Yonik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-743) IndexReader.reopen()
[ https://issues.apache.org/jira/browse/LUCENE-743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Busch updated LUCENE-743: - Attachment: lucene-743-take8.patch OK, all tests pass now, including the thread-safety test. I ran it several times on different JVMs. Changes: - As Mike suggested I added a try ... finally block to SegmentReader#reopenSegment() which cleans up after an exception was hit. - Added some additional comments. - Minor improvements to TestIndexReaderReopen > IndexReader.reopen() > > > Key: LUCENE-743 > URL: https://issues.apache.org/jira/browse/LUCENE-743 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Otis Gospodnetic >Assignee: Michael Busch >Priority: Minor > Fix For: 2.3 > > Attachments: IndexReaderUtils.java, lucene-743-take2.patch, > lucene-743-take3.patch, lucene-743-take4.patch, lucene-743-take5.patch, > lucene-743-take6.patch, lucene-743-take7.patch, lucene-743-take8.patch, > lucene-743.patch, lucene-743.patch, lucene-743.patch, MyMultiReader.java, > MySegmentReader.java, varient-no-isCloneSupported.BROKEN.patch > > > This is Robert Engels' implementation of IndexReader.reopen() functionality, > as a set of 3 new classes (this was easier for him to implement, but should > probably be folded into the core, if this looks good). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Term pollution from binary data
Doug Cutting wrote on 11/07/2007 09:26 AM: Hadoop's MapFile is similar to Lucene's term index, and supports a feature where only a subset of the index entries are loaded (determined by io.map.index.skip). It would not be difficult to add such a feature to Lucene by changing TermInfosReader#ensureIndexIsRead(). Here's a (totally untested) patch. Doug, thanks for this suggestion and your quick patch. I fleshed this out in the version of Lucene we are using, a bit after 2.1. There was an off-by-1 bug plus a few missing pieces. The attached patch is for 2.1+, but might be useful as it at least contains the corrections and missing elements. It also contains extensions to the tests to exercise the patch. I tried integrating this into 2.3, but enough has changed so that it was not straightforward (primarily for the test case extensions -- the implementation seems it will apply with just a bit of manual merging). Unfortunately, I have so many local changes that is has become difficult to track the latest Lucene. The task of syncing up will come soon. I'll post a proper patch against the trunk in jira at a future date if the issue is not already resolved before then. Michael McCandless wrote on 11/08/2007 12:43 AM: I'll open an issue and work through this patch. Michael, I did not see the issue, else would have posted this there. Unfortunately, I'm pretty far behind on lucene mail these days. One thing is: I'd prefer to not use system property for this, since it's so global, but I'm not sure how to better do it. Agree strongly that this is not global. Whether ctors or an index-specific properties object or whatever, it is important to be able to set this on some indexes and not others in a single application. Thanks for picking this up! Chuck Index: src/test/org/apache/lucene/index/DocHelper.java === --- src/test/org/apache/lucene/index/DocHelper.java (revision 2247) +++ src/test/org/apache/lucene/index/DocHelper.java (working copy) @@ -254,10 +254,25 @@ */ public static void writeDoc(Directory dir, Analyzer analyzer, Similarity similarity, String segment, Document doc) throws IOException { -DocumentWriter writer = new DocumentWriter(dir, analyzer, similarity, 50); -writer.addDocument(segment, doc); +writeDoc(dir, analyzer, similarity, segment, doc, IndexWriter.DEFAULT_TERM_INDEX_INTERVAL); } + /** + * Writes the document to the directory segment using the analyzer and the similarity score + * @param dir + * @param analyzer + * @param similarity + * @param segment + * @param doc + * @param termIndexInterval + * @throws IOException + */ + public static void writeDoc(Directory dir, Analyzer analyzer, Similarity similarity, String segment, Document doc, int termIndexInterval) throws IOException + { +DocumentWriter writer = new DocumentWriter(dir, analyzer, similarity, 50, termIndexInterval); +writer.addDocument(segment, doc); + } + public static int numFields(Document doc) { return doc.getFields().size(); } Index: src/test/org/apache/lucene/index/TestSegmentTermDocs.java === --- src/test/org/apache/lucene/index/TestSegmentTermDocs.java (revision 2247) +++ src/test/org/apache/lucene/index/TestSegmentTermDocs.java (working copy) @@ -25,6 +25,7 @@ import org.apache.lucene.document.Field; import java.io.IOException; +import org.apache.lucene.search.Similarity; public class TestSegmentTermDocs extends TestCase { private Document testDoc = new Document(); @@ -212,6 +213,23 @@ dir.close(); } + public void testIndexDivisor() throws IOException { +dir = new RAMDirectory(); +testDoc = new Document(); +DocHelper.setupDoc(testDoc); +DocHelper.writeDoc(dir, new WhitespaceAnalyzer(), Similarity.getDefault(), "test", testDoc, 3); + +assertNull(System.getProperty("lucene.term.index.divisor")); +System.setProperty("lucene.term.index.divisor", "2"); +try { + testTermDocs(); + testBadSeek(); + testSkipTo(); +} finally { + System.clearProperty("lucene.term.index.divisor"); +} + } + private void addDoc(IndexWriter writer, String value) throws IOException { Document doc = new Document(); Index: src/test/org/apache/lucene/index/TestSegmentReader.java === --- src/test/org/apache/lucene/index/TestSegmentReader.java (revision 2247) +++ src/test/org/apache/lucene/index/TestSegmentReader.java (working copy) @@ -23,10 +23,12 @@ import java.util.List; import junit.framework.TestCase; +import org.apache.lucene.analysis.WhitespaceAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Fieldable; import org.apache.lucene.search.DefaultSimilarity; +import org.apache.lucene.search.Similarity; import org.apa
Re: small improvement when no payloads?
Yonik Seeley wrote: > The else clause in SegmentTermPositions.readDeltaPosition() is > redundant and could be removed, yes? > It's a pretty minor improvement, but this is very inner-loop stuff. > > -Yonik > Thanks, Yonik, you're right. We can safely remove those two lines. TermPositions#seek() resets the two values. And "currentFieldStoresPayloads" doesn't change unless seek() is called. All test cases still pass after removing the else clause. I'll commit this small change (I don't think we need to open a Jira issue). -Michael - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: setSimilarity on Query
Chris Hostetter wrote: independent of the QueryParser aspects of your question, adding a setSimilarity method to the Query class would be a complete 180 of how it currently works right now. Query classes have to have a getSimilarity method so that their Weight/Scorer have a way to access the similarity functions ... but every core type of query gets that similarity from the searcher being used when hte query is executed. if the Query class defined a "setSimilarity" then the similarity used by one query in a BooleanQuery might not be the same as another query in the same query structure ... queryNorms, idfs, tfs ... could all be completley nonsensical. The getSimilarity() implementation in Query actually invokes Searcher.getSimilarity() which in turn returns the value of Similarity.getDefault() IndexSearcher has a corresponding setSimilarity() method which will override the value return value which makes it convenient for what you're trying to accomplish. There is, however, another point of discord -- which is the Weight associated with the Query (which is relevant if you want a different implementation of term weighting). Here the locus of control is inverted -- it is the Searcher which delegates to the Query in order to create the Weight. In order to change the scoring implementation one needs to implement a new Query class, a new Weight class, a new Similarity class and a new QueryParser. A friendlier alternative I'd like to propose is a sort of Weight and Similarity factory which is provided either to the top level Query object that is returned from parsing -- or to the Searcher object that processes the query. The factory can then return Similarity and Weight implementations that are identical for all parts of the query and which are mutually consistent. This would allow field specific Similarity and Weight implementations and would also be backwards compatible. A more logical extension point is probably long the lines of past discussion towards making all of the Similarity methods take in a field name (so you could have a "PerFieldSimilarityWrapper" type implementation) and/or changing Searchable.getSimilarity to take in a fieldname param. i don't think anyone every submitted a patch for either of those ideas though ... if you check the mailing list archives you'll see there were performance concerns about one of them (i think it was the first one because some of those methods are in tight loops, which is unfortunate because it's the one that can be done in a backwards compatible way) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Commented: (LUCENE-743) IndexReader.reopen()
True. It seems that the Lucene code might be a bit more resilient here though, using the following: 1. open the segments file exclusively (if this fails, updates are prohibited, and an exception is thrown) 2. write new segments 3. write segments.new including segments hash & sync 4. update segments file including hash 5. delete segments that you can Then if it crashes in step 4, it is easy to know segments is bad (out of date) and use segments.new If it crashes in steps 3, then segments.new is easily detected as being corrupt (hash does not match), so you know segments is valid. if there are segments that cannot be deleted in 5, every open can check if it can delete them... A similar technique can be used if using lockless commits, just need to make it segments.XXX.new, etc. On Nov 12, 2007, at 7:21 PM, Yonik Seeley wrote: On Nov 12, 2007 7:19 PM, robert engels <[EMAIL PROTECTED]> wrote: I would still argue that it is an incorrect setup - almost as bad as "not plugging the computer in". A user themselves could even go in and look at the index files (I've done so myself)... as could a backup program or whatever. It's a fact of life on windows that a move or delete can fail. -Yonik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]