[jira] Updated: (LUCENE-743) IndexReader.reopen()

2007-11-12 Thread Michael Busch (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Busch updated LUCENE-743: - Attachment: lucene-743-take7.patch Changes: - Updated patch to current trunk (I just realized th

How to solve the issue "Unable to read entire block; 72 bytes read; expected 512 bytes"

2007-11-12 Thread Durai murugan
Dear All, Using lucene i'm indexing my documents. While doing index for some word documents i got the following exception: Unable to read entire block; 72 bytes read; expected 512 bytes While indexing rtf documents i get the following exception: Unable to read entire block; 72 bytes read; expect

Re: How to solve the issue "Unable to read entire block; 72 bytes read; expected 512 bytes"

2007-11-12 Thread Durai murugan
Sorry, for rtf it throws the following exception: Unable to read entire header; 100 bytes read; expected 512 bytes Is it a issue with POI of Lucene?. If so which build of POI contains fix for this problem where i can get it?. Please tell me asap. Thanks. - Original Message From: Dura

Re: setSimilarity on Query

2007-11-12 Thread Chris Hostetter
: The problem is that I want to use QueryParser to construct the : query for me. I am having to overriding the logic in QueryParser to : construct my own derived class, which seems to me like a convoluted : way to just setting the Similariy. that's the basic design of the QueryParser class -

Re: How to solve the issue "Unable to read entire block; 72 bytes read; expected 512 bytes"

2007-11-12 Thread Chris Hostetter
: Sorry, for rtf it throws the following exception: : Unable to read entire header; 100 bytes read; expected 512 bytes : Is it a issue with POI of Lucene?. If so which build of POI contains fix : for this problem where i can get it?. Please tell me asap. 1) java-dev if for discussiong developm

[jira] Commented: (LUCENE-1044) Behavior on hard power shutdown

2007-11-12 Thread Doug Cutting (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12541874 ] Doug Cutting commented on LUCENE-1044: -- > Is a sync before every file close really needed [...] ? It might be

Re: [jira] Commented: (LUCENE-1044) Behavior on hard power shutdown

2007-11-12 Thread robert engels
I don't think this would be any difference performance wise, and might actually be slower. When you call FD.sync() it only needs to ensure the dirty blocks associated with that descriptor need to be saved. On Nov 12, 2007, at 12:15 PM, Doug Cutting (JIRA) wrote: [ https://issues.a

Web-based Luke

2007-11-12 Thread mark harwood
I'm putting together a Google Web Toolkit-based version of Luke: http://www.inperspective.com/lucene/Luke.war ( Just add your version of lucene core jar to WEB-INF/lib subdirectory and you should have the basis of a web-enabled Luke.) The intention behind this is to port Luke to a wholly Apach

Re: [jira] Commented: (LUCENE-1044) Behavior on hard power shutdown

2007-11-12 Thread Doug Cutting
robert engels wrote: I don't think this would be any difference performance wise, and might actually be slower. When you call FD.sync() it only needs to ensure the dirty blocks associated with that descriptor need to be saved. The potential benefit is that you wouldn't have to wait for thing

Re: [jira] Commented: (LUCENE-1044) Behavior on hard power shutdown

2007-11-12 Thread robert engels
Would it not be simpler to pure Java... Add the descriptor that needs to be sync'd (and closed) to a Queue. Start a Thread to sync/close descriptors. In commit(), wait for all sync threads to terminate using join(). On Nov 12, 2007, at 12:34 PM, Doug Cutting wrote: robert engels wrote: I don

Re: [jira] Commented: (LUCENE-1044) Behavior on hard power shutdown

2007-11-12 Thread Doug Cutting
robert engels wrote: Would it not be simpler to pure Java... Add the descriptor that needs to be sync'd (and closed) to a Queue. Start a Thread to sync/close descriptors. In commit(), wait for all sync threads to terminate using join(). +1 Doug -

Re: How to solve the issue "Unable to read entire block; 72 bytes read; expected 512 bytes"

2007-11-12 Thread Ken Krugler
: Sorry, for rtf it throws the following exception: : Unable to read entire header; 100 bytes read; expected 512 bytes : Is it a issue with POI of Lucene?. If so which build of POI contains fix : for this problem where i can get it?. Please tell me asap. 1) java-dev if for discussiong developme

Re: [jira] Commented: (LUCENE-1044) Behavior on hard power shutdown

2007-11-12 Thread Yonik Seeley
On Nov 12, 2007 1:41 PM, robert engels <[EMAIL PROTECTED]> wrote: > Would it not be simpler to pure Java... > > Add the descriptor that needs to be sync'd (and closed) to a Queue. > Start a Thread to sync/close descriptors. > > In commit(), wait for all sync threads to terminate using join(). This

Re: [jira] Commented: (LUCENE-1044) Behavior on hard power shutdown

2007-11-12 Thread Michael McCandless
I'll look into this approach. We must also sync/close the file before we can open it for reading, eg for creating compound file or if a merge kicks off. Though if we are willing to not commit a new segments_N after saving a segment and before creating its compound found then we don't need to syn

Re: [jira] Commented: (LUCENE-1044) Behavior on hard power shutdown

2007-11-12 Thread robert engels
I would be wary of the additional complexity of doing this. It would be my vote to making 'sync' an option, and if set, all files are sync'd before close. With proper hardware setup, this should be a minimal performance penalty. What about writing a marker at the end of each file? I am no

Re: Web-based Luke

2007-11-12 Thread Erik Hatcher
On Nov 12, 2007, at 1:21 PM, mark harwood wrote: I'm putting together a Google Web Toolkit-based version of Luke: http://www.inperspective.com/lucene/Luke.war ( Just add your version of lucene core jar to WEB-INF/lib subdirectory and you should have the basis of a web-enabled Luke.) Mark:

Re: [jira] Commented: (LUCENE-1044) Behavior on hard power shutdown

2007-11-12 Thread Michael McCandless
"robert engels" <[EMAIL PROTECTED]> wrote: > I would be wary of the additional complexity of doing this. > > It would be my vote to making 'sync' an option, and if set, all files > are sync'd before close. This is the way it is now: doSync is an option to FSDirectory, which defaults to true. I

small improvement when no payloads?

2007-11-12 Thread Yonik Seeley
The else clause in SegmentTermPositions.readDeltaPosition() is redundant and could be removed, yes? It's a pretty minor improvement, but this is very inner-loop stuff. -Yonik private final int readDeltaPosition() throws IOException { int delta = proxStream.readVInt(); if (currentFieldSt

[jira] Commented: (LUCENE-743) IndexReader.reopen()

2007-11-12 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12541955 ] Michael McCandless commented on LUCENE-743: --- I think the cause of the intermittant failure in the test is a

Re: [jira] Commented: (LUCENE-743) IndexReader.reopen()

2007-11-12 Thread robert engels
Why doesn't reopen get the 'read' lock, since commit has the write lock, it should wait... On Nov 12, 2007, at 3:35 PM, Michael McCandless (JIRA) wrote: [ https://issues.apache.org/jira/browse/LUCENE-743? page=com.atlassian.jira.plugin.system.issuetabpanels:comment- tabpanel#action_125

Re: [jira] Commented: (LUCENE-743) IndexReader.reopen()

2007-11-12 Thread Yonik Seeley
On Nov 12, 2007 4:43 PM, robert engels <[EMAIL PROTECTED]> wrote: > Why doesn't reopen get the 'read' lock, since commit has the write > lock, it should wait... After lockless commits, there is no read lock! -Yonik - To unsubscr

Re: [jira] Commented: (LUCENE-743) IndexReader.reopen()

2007-11-12 Thread robert engels
Then how can the commit during reopen be an issue? I am not very family with this new code, but it seems that you need to write segments.XXX.new and then rename to segments.XXX. As long as the files are sync'd, even on nfs the reopen should not see segments.XXX until is is ready. Although

Re: [jira] Commented: (LUCENE-743) IndexReader.reopen()

2007-11-12 Thread Yonik Seeley
On Nov 12, 2007 5:08 PM, robert engels <[EMAIL PROTECTED]> wrote: > As long as the files are sync'd, even on nfs the reopen should not > see segments.XXX until is is ready. Right, but then there is a race on the other side... a reader may open the segments .XXX file and then start opening all the

Re: [jira] Commented: (LUCENE-743) IndexReader.reopen()

2007-11-12 Thread Michael McCandless
robert engels <[EMAIL PROTECTED]> wrote: > Then how can the commit during reopen be an issue? This is what happens: * Reader opens latest segments_N & reads all SegmentInfos successfully. * Writer writes new segments_N+1, and then deletes now un-referenced files. * Reader tries

Re: [jira] Commented: (LUCENE-743) IndexReader.reopen()

2007-11-12 Thread robert engels
But merging segments doesn't delete the old, it only creates new, unless the segments meet the "purge old criteria". A reopen() is supposed to open the latest version in the directory by definition, so this seems rather a remote possibility. If it occurs due to low system resources (meaning

Re: [jira] Commented: (LUCENE-743) IndexReader.reopen()

2007-11-12 Thread Michael McCandless
"Yonik Seeley" <[EMAIL PROTECTED]> wrote: > On Nov 12, 2007 5:08 PM, robert engels <[EMAIL PROTECTED]> wrote: > > As long as the files are sync'd, even on nfs the reopen should not > > see segments.XXX until is is ready. > > Right, but then there is a race on the other side... a reader may open >

Re: [jira] Commented: (LUCENE-743) IndexReader.reopen()

2007-11-12 Thread robert engels
What are you basing the "rename" is not reliable on windows on? That a virus scanner has the file open. If that is the case, that should either be an incorrect setup, or the operation retried until it completes. Writing directly to a file that someone else can open for reading is bound to

Re: [jira] Commented: (LUCENE-743) IndexReader.reopen()

2007-11-12 Thread Michael McCandless
"robert engels" <[EMAIL PROTECTED]> wrote: > But merging segments doesn't delete the old, it only creates new, > unless the segments meet the "purge old criteria". What's the "purge old criteria"? Normally a segment merge once committed immediately deletes the segments it had just merged. > A

Re: [jira] Commented: (LUCENE-743) IndexReader.reopen()

2007-11-12 Thread Michael McCandless
Not just virus scanners: any program that uses the Microsoft API for being notified of file changes. I think TortoiseSVN was one such example. People who embed Lucene can't control what their users install on their desktops. Virus scanners are naturally very common on desktops. I think we want

Re: [jira] Commented: (LUCENE-743) IndexReader.reopen()

2007-11-12 Thread robert engels
Horse poo poo. If you are working in a local environment, the files should be opened with exclusive access. This guarantees that the operations will succeed for the calling process. That NFS is a viable solution is highly debatable, and IMO shows a lack of understanding of NFS and the unix

Re: [jira] Commented: (LUCENE-743) IndexReader.reopen()

2007-11-12 Thread Michael Busch
robert engels wrote: > > The commit "in flight" cannot (SHOULD NOT) be deleting segments if they > are in use. That a caller could issue a reopen call means there are > segments in use by definition (or they would have nothing to reopen). > Reopen still works correctly, even if there are no seg

Re: [jira] Commented: (LUCENE-743) IndexReader.reopen()

2007-11-12 Thread robert engels
I am not debating that reopen works (since that is supposed to get the latest version). I am stating that commit cannot be deleting segments if they are in use, which they must be at that time in order to issue a reopen(), since to issue reopen() you must have an instance of IndexReader ope

Re: [jira] Commented: (LUCENE-743) IndexReader.reopen()

2007-11-12 Thread Michael Busch
robert engels wrote: > > I was talking about Windows in particular - as stated, unix/linux does > not have the problem - under Windows the delete will (should) fail. > As I said, delete does fail on Windows in that case, and the IndexFileDeleter (called by the IndexWriter) catches the IOExceptio

Re: [jira] Commented: (LUCENE-743) IndexReader.reopen()

2007-11-12 Thread robert engels
That is not true - at least it didn't use to be. if there were readers open the files/segments would not be deleted. they would be deleted at next open. The "purge criteria" was based on the next "commit" sets. To make this work, and be able to roll back or open a previous "version", you

[jira] Commented: (LUCENE-743) IndexReader.reopen()

2007-11-12 Thread Michael Busch (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12541998 ] Michael Busch commented on LUCENE-743: -- > I think the cause of the intermittant failure in the test is a missing

Re: [jira] Commented: (LUCENE-743) IndexReader.reopen()

2007-11-12 Thread robert engels
I would still argue that it is an incorrect setup - almost as bad as "not plugging the computer in". If a user runs a virus scanner or file system indexer on the lucene index directory, their system is going to slow to a crawl and indexing will be abominably slow. The installation guide s

Re: [jira] Commented: (LUCENE-743) IndexReader.reopen()

2007-11-12 Thread Yonik Seeley
On Nov 12, 2007 7:19 PM, robert engels <[EMAIL PROTECTED]> wrote: > I would still argue that it is an incorrect setup - almost as bad as > "not plugging the computer in". A user themselves could even go in and look at the index files (I've done so myself)... as could a backup program or whatever.

[jira] Updated: (LUCENE-743) IndexReader.reopen()

2007-11-12 Thread Michael Busch (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Busch updated LUCENE-743: - Attachment: lucene-743-take8.patch OK, all tests pass now, including the thread-safety test. I ra

Re: Term pollution from binary data

2007-11-12 Thread Chuck Williams
Doug Cutting wrote on 11/07/2007 09:26 AM: Hadoop's MapFile is similar to Lucene's term index, and supports a feature where only a subset of the index entries are loaded (determined by io.map.index.skip). It would not be difficult to add such a feature to Lucene by changing TermInfosReader#ens

Re: small improvement when no payloads?

2007-11-12 Thread Michael Busch
Yonik Seeley wrote: > The else clause in SegmentTermPositions.readDeltaPosition() is > redundant and could be removed, yes? > It's a pretty minor improvement, but this is very inner-loop stuff. > > -Yonik > Thanks, Yonik, you're right. We can safely remove those two lines. TermPositions#seek() r

Re: setSimilarity on Query

2007-11-12 Thread Shailesh Kochhar
Chris Hostetter wrote: independent of the QueryParser aspects of your question, adding a setSimilarity method to the Query class would be a complete 180 of how it currently works right now. Query classes have to have a getSimilarity method so that their Weight/Scorer have a way to access the

Re: [jira] Commented: (LUCENE-743) IndexReader.reopen()

2007-11-12 Thread robert engels
True. It seems that the Lucene code might be a bit more resilient here though, using the following: 1. open the segments file exclusively (if this fails, updates are prohibited, and an exception is thrown) 2. write new segments 3. write segments.new including segments hash & sync 4. update s