[jira] Updated: (LUCENE-743) IndexReader.reopen()

2007-11-12 Thread Michael Busch (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Busch updated LUCENE-743:
-

Attachment: lucene-743-take7.patch

Changes:

- Updated patch to current trunk (I just realized that the 
  latest didn't apply cleanly anymore)
- MultiSegmentReader now decRefs the subReaders correctly
  in case an exception is thrown during reopen()
- Small changes in TestIndexReaderReopen.java

The thread-safety test still sometimes fails. The weird
thing is that the test verifies that the re-opened 
readers always return correct results. The only problem
is that the refCount value is not always 0 at the end
of the test. I'm starting to think that the testcase
itself has a problem? Maybe someone else can take a look
- it's probably something really obvious but I'm already 
starting to feel dizzy while pondering about 
thread-safety.

> IndexReader.reopen()
> 
>
> Key: LUCENE-743
> URL: https://issues.apache.org/jira/browse/LUCENE-743
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Otis Gospodnetic
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 2.3
>
> Attachments: IndexReaderUtils.java, lucene-743-take2.patch, 
> lucene-743-take3.patch, lucene-743-take4.patch, lucene-743-take5.patch, 
> lucene-743-take6.patch, lucene-743-take7.patch, lucene-743.patch, 
> lucene-743.patch, lucene-743.patch, MyMultiReader.java, MySegmentReader.java, 
> varient-no-isCloneSupported.BROKEN.patch
>
>
> This is Robert Engels' implementation of IndexReader.reopen() functionality, 
> as a set of 3 new classes (this was easier for him to implement, but should 
> probably be folded into the core, if this looks good).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



How to solve the issue "Unable to read entire block; 72 bytes read; expected 512 bytes"

2007-11-12 Thread Durai murugan
Dear All,

Using lucene i'm indexing my documents. While doing index for some word 
documents i got the following exception:
Unable to read entire block; 72 bytes read; expected 512 bytes

While indexing rtf documents i get the following exception:
Unable to read entire block; 72 bytes read; expected 512 bytes

Why it is occurs?. How to solve it?.

Thanks in Advance.






  Share files, take polls, and discuss your passions - all under one roof. 
Go to http://in.promos.yahoo.com/groups

Re: How to solve the issue "Unable to read entire block; 72 bytes read; expected 512 bytes"

2007-11-12 Thread Durai murugan
Sorry, for rtf it throws the following exception:

Unable to read entire header; 100 bytes read; expected 512 bytes

Is it a issue with POI of Lucene?. If so which build of POI contains fix for 
this problem where i can get it?. Please tell me asap. 

Thanks.

- Original Message 
From: Durai murugan <[EMAIL PROTECTED]>
To: java-dev@lucene.apache.org
Sent: Monday, 12 November, 2007 6:45:14 PM
Subject: How to solve the issue "Unable to read entire block; 72 bytes read; 
expected 512 bytes"

Dear All,

Using lucene i'm indexing my documents. While doing index for some word
 documents i got the following exception:
Unable to read entire block; 72 bytes read; expected 512 bytes

While indexing rtf documents i get the following exception:
Unable to read entire block; 72 bytes read; expected 512 bytes

Why it is occurs?. How to solve it?.

Thanks in Advance.






  Share files, take polls, and discuss your passions - all under
 one roof. Go to http://in.promos.yahoo.com/groups





  Unlimited freedom, unlimited storage. Get it now, on 
http://help.yahoo.com/l/in/yahoo/mail/yahoomail/tools/tools-08.html/

Re: setSimilarity on Query

2007-11-12 Thread Chris Hostetter

:  The problem is that I want to use QueryParser to construct the
: query for me. I am having to overriding the logic in QueryParser to
: construct my own derived class, which seems to me like a convoluted
: way to just setting the Similariy.

that's the basic design of the QueryParser class - you override to get 
custom behavior.

independent of the QueryParser aspects of your question, adding a 
setSimilarity method to the Query class would be a complete 180 of how it 
currently works right now.

Query classes have to have a getSimilarity method so that their 
Weight/Scorer have a way to access the similarity functions ... but every 
core type of query gets that similarity from the searcher being used when 
hte query is executed.

if the Query class defined a "setSimilarity" then the similarity used by 
one query in a BooleanQuery might not be the same as another query in the 
same query structure ... queryNorms, idfs, tfs ... could all be completley 
nonsensical.

A more logical extension point is probably long the lines of past 
discussion towards making all of the Similarity methods take in a field 
name (so you could have a "PerFieldSimilarityWrapper" type implementation) 
and/or changing Searchable.getSimilarity to take in a fieldname param.

i don't think anyone every submitted a patch for either of those ideas 
though ... if you check the mailing list archives you'll see there were 
performance concerns about one of them (i think it was the first one 
because some of those methods are in tight loops, which is unfortunate 
because it's the one that can be done in a backwards compatible way)




-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: How to solve the issue "Unable to read entire block; 72 bytes read; expected 512 bytes"

2007-11-12 Thread Chris Hostetter

: Sorry, for rtf it throws the following exception:

: Unable to read entire header; 100 bytes read; expected 512 bytes

: Is it a issue with POI of Lucene?. If so which build of POI contains fix 
: for this problem where i can get it?. Please tell me asap.

1) java-dev if for discussiong development of hte Lucene Java API, 
questions baout errors when using the Java API should be sent to the 
java-user list.

2) that's just a one line error string, it may be the message of an 
exception -- but it may just be something logged by your application.  if 
it is an exception message, the only way to make sense of it is to see the 
entire exception stack trace.

3) i can't think of anywhere in the Lucene code base that might write out 
a string like that (or throw an exception with that message) i suspect it 
is coming from POI (i'd know ofr sure if you'd sent the full stack trace) 
so you should consider contacting the POI user list ... before you do, you 
might try a simple test of a micro app using POI to parse the same 
document without Lucene involved at all -- if you get hte same error, then 
you know it's POI and not lucene related at all.



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1044) Behavior on hard power shutdown

2007-11-12 Thread Doug Cutting (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12541874
 ] 

Doug Cutting commented on LUCENE-1044:
--

> Is a sync before every file close really needed [...] ?

It might be nice if we could use the Linux sync() system call, instead of 
fsync().  Then we could call that only when the new segments file is moved into 
place rather than as each file is closed.  We could exec the sync shell command 
when running on Unix, but I don't know whether there's an equivalent command 
for Windows, and it wouldn't be Java...

> Behavior on hard power shutdown
> ---
>
> Key: LUCENE-1044
> URL: https://issues.apache.org/jira/browse/LUCENE-1044
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
> Environment: Windows Server 2003, Standard Edition, Sun Hotspot Java 
> 1.5
>Reporter: venkat rangan
>Assignee: Michael McCandless
> Fix For: 2.3
>
> Attachments: LUCENE-1044.patch, LUCENE-1044.take2.patch, 
> LUCENE-1044.take3.patch
>
>
> When indexing a large number of documents, upon a hard power failure  (e.g. 
> pull the power cord), the index seems to get corrupted. We start a Java 
> application as an Windows Service, and feed it documents. In some cases 
> (after an index size of 1.7GB, with 30-40 index segment .cfs files) , the 
> following is observed.
> The 'segments' file contains only zeros. Its size is 265 bytes - all bytes 
> are zeros.
> The 'deleted' file also contains only zeros. Its size is 85 bytes - all bytes 
> are zeros.
> Before corruption, the segments file and deleted file appear to be correct. 
> After this corruption, the index is corrupted and lost.
> This is a problem observed in Lucene 1.4.3. We are not able to upgrade our 
> customer deployments to 1.9 or later version, but would be happy to back-port 
> a patch, if the patch is small enough and if this problem is already solved.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [jira] Commented: (LUCENE-1044) Behavior on hard power shutdown

2007-11-12 Thread robert engels
I don't think this would be any difference performance wise, and  
might actually be slower.


When you call FD.sync() it only needs to ensure the dirty blocks  
associated with that descriptor need to be saved.





On Nov 12, 2007, at 12:15 PM, Doug Cutting (JIRA) wrote:



[ https://issues.apache.org/jira/browse/LUCENE-1044? 
page=com.atlassian.jira.plugin.system.issuetabpanels:comment- 
tabpanel#action_12541874 ]


Doug Cutting commented on LUCENE-1044:
--


Is a sync before every file close really needed [...] ?


It might be nice if we could use the Linux sync() system call,  
instead of fsync().  Then we could call that only when the new  
segments file is moved into place rather than as each file is  
closed.  We could exec the sync shell command when running on Unix,  
but I don't know whether there's an equivalent command for Windows,  
and it wouldn't be Java...



Behavior on hard power shutdown
---

Key: LUCENE-1044
URL: https://issues.apache.org/jira/browse/ 
LUCENE-1044

Project: Lucene - Java
 Issue Type: Bug
 Components: Index
Environment: Windows Server 2003, Standard Edition, Sun  
Hotspot Java 1.5

   Reporter: venkat rangan
   Assignee: Michael McCandless
Fix For: 2.3

Attachments: LUCENE-1044.patch, LUCENE-1044.take2.patch,  
LUCENE-1044.take3.patch



When indexing a large number of documents, upon a hard power  
failure  (e.g. pull the power cord), the index seems to get  
corrupted. We start a Java application as an Windows Service, and  
feed it documents. In some cases (after an index size of 1.7GB,  
with 30-40 index segment .cfs files) , the following is observed.
The 'segments' file contains only zeros. Its size is 265 bytes -  
all bytes are zeros.
The 'deleted' file also contains only zeros. Its size is 85 bytes  
- all bytes are zeros.
Before corruption, the segments file and deleted file appear to be  
correct. After this corruption, the index is corrupted and lost.
This is a problem observed in Lucene 1.4.3. We are not able to  
upgrade our customer deployments to 1.9 or later version, but  
would be happy to back-port a patch, if the patch is small enough  
and if this problem is already solved.


--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Web-based Luke

2007-11-12 Thread mark harwood
I'm putting together a Google Web Toolkit-based version of Luke:
   http://www.inperspective.com/lucene/Luke.war
( Just add your version of lucene core jar to WEB-INF/lib subdirectory and you 
should have the basis of a web-enabled Luke.)

The intention behind this is to port Luke to a wholly Apache-licensed codebase 
so it can be managed in Lucene's subversion repository  (and for me to learn 
GWT!).

Early results are encouraging so I would like to consider how to handle this 
moving forward.

The considerations are:
1) Are folks interested in bringing this into the Lucene project?
2) Where to manage it (in contrib?)
3) What needs to change in the build process to take GWT source (Java code) and 
feed it through the GWT compiler to produce Javascript/html etc?
4) How to package it in the distribution (bundle Jetty?)

In MVC terms, having separated the Model code from the (thinlet-based) View 
code I now also have the basis for building a Swing-based UI too on the same 
backend.

Cheers,
Mark







  ___ 
Want ideas for reducing your carbon footprint? Visit Yahoo! For Good  
http://uk.promotions.yahoo.com/forgood/environment.html

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [jira] Commented: (LUCENE-1044) Behavior on hard power shutdown

2007-11-12 Thread Doug Cutting

robert engels wrote:
I don't think this would be any difference performance wise, and might 
actually be slower.


When you call FD.sync() it only needs to ensure the dirty blocks 
associated with that descriptor need to be saved.


The potential benefit is that you wouldn't have to wait for things to be 
written as you close files.  So, with write-behind, data could be 
written while the CPU moves on to other tasks, only blocking at commit. 
 With log-based filesystems, only the log need be flushed, and batching 
that is a performance win.  However, if there are lots of other 
applications writing at the same time, and the Lucene update is small, 
it could in theory slow things, but my hunch is that it would in 
practice frequently nearly eliminate the cost of syncing.


Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [jira] Commented: (LUCENE-1044) Behavior on hard power shutdown

2007-11-12 Thread robert engels

Would it not be simpler to pure Java...

Add the descriptor that needs to be sync'd (and closed) to a Queue.
Start a Thread to sync/close descriptors.

In commit(), wait for all sync threads to terminate using join().


On Nov 12, 2007, at 12:34 PM, Doug Cutting wrote:


robert engels wrote:
I don't think this would be any difference performance wise, and  
might actually be slower.
When you call FD.sync() it only needs to ensure the dirty blocks  
associated with that descriptor need to be saved.


The potential benefit is that you wouldn't have to wait for things  
to be written as you close files.  So, with write-behind, data  
could be written while the CPU moves on to other tasks, only  
blocking at commit.  With log-based filesystems, only the log need  
be flushed, and batching that is a performance win.  However, if  
there are lots of other applications writing at the same time, and  
the Lucene update is small, it could in theory slow things, but my  
hunch is that it would in practice frequently nearly eliminate the  
cost of syncing.


Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [jira] Commented: (LUCENE-1044) Behavior on hard power shutdown

2007-11-12 Thread Doug Cutting

robert engels wrote:

Would it not be simpler to pure Java...

Add the descriptor that needs to be sync'd (and closed) to a Queue.
Start a Thread to sync/close descriptors.

In commit(), wait for all sync threads to terminate using join().


+1

Doug


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: How to solve the issue "Unable to read entire block; 72 bytes read; expected 512 bytes"

2007-11-12 Thread Ken Krugler

: Sorry, for rtf it throws the following exception:

: Unable to read entire header; 100 bytes read; expected 512 bytes

: Is it a issue with POI of Lucene?. If so which build of POI contains fix
: for this problem where i can get it?. Please tell me asap.

1) java-dev if for discussiong development of hte Lucene Java API,
questions baout errors when using the Java API should be sent to the
java-user list.

2) that's just a one line error string, it may be the message of an
exception -- but it may just be something logged by your application.  if
it is an exception message, the only way to make sense of it is to see the
entire exception stack trace.

3) i can't think of anywhere in the Lucene code base that might write out
a string like that (or throw an exception with that message) i suspect it
is coming from POI (i'd know ofr sure if you'd sent the full stack trace)
so you should consider contacting the POI user list ... before you do, you
might try a simple test of a micro app using POI to parse the same
document without Lucene involved at all -- if you get hte same error, then
you know it's POI and not lucene related at all.


It's there in POI:

http://www.krugle.org/kse/files/svn/svn.apache.org/poi/src/java/org/apache/poi/poifs/storage/HeaderBlockReader.java

On line 83.

-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"If you can't find it, you can't fix it"

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [jira] Commented: (LUCENE-1044) Behavior on hard power shutdown

2007-11-12 Thread Yonik Seeley
On Nov 12, 2007 1:41 PM, robert engels <[EMAIL PROTECTED]> wrote:
> Would it not be simpler to pure Java...
>
> Add the descriptor that needs to be sync'd (and closed) to a Queue.
> Start a Thread to sync/close descriptors.
>
> In commit(), wait for all sync threads to terminate using join().

This would also need to be hooked in with file deletion (since a file
could be created and deleted before commit()).

-Yonik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [jira] Commented: (LUCENE-1044) Behavior on hard power shutdown

2007-11-12 Thread Michael McCandless

I'll look into this approach.

We must also sync/close the file before we can open it for reading, eg
for creating compound file or if a merge kicks off.

Though if we are willing to not commit a new segments_N after saving a
segment and before creating its compound found then we don't need to
sync the segment files in that case.

I think I would put all this logic (to manage background sync'ing)
under FSDirectory.

Mike

"Yonik Seeley" <[EMAIL PROTECTED]> wrote:
> On Nov 12, 2007 1:41 PM, robert engels <[EMAIL PROTECTED]> wrote:
> > Would it not be simpler to pure Java...
> >
> > Add the descriptor that needs to be sync'd (and closed) to a Queue.
> > Start a Thread to sync/close descriptors.
> >
> > In commit(), wait for all sync threads to terminate using join().
> 
> This would also need to be hooked in with file deletion (since a file
> could be created and deleted before commit()).
> 
> -Yonik
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [jira] Commented: (LUCENE-1044) Behavior on hard power shutdown

2007-11-12 Thread robert engels

I would be wary of the additional complexity of doing this.

It would be my vote to making 'sync' an option, and if set, all files  
are sync'd before close.


With proper hardware setup, this should be a minimal performance  
penalty.


What about writing a marker at the end of each file? I am not sure it  
is guarenteed but the segments is syncd, and the segment files have  
the correct marker, then the segment file is ok. Otherwise the "bad"  
segments/versions can be removed (on start up).


On Nov 12, 2007, at 2:06 PM, Michael McCandless wrote:



I'll look into this approach.

We must also sync/close the file before we can open it for reading, eg
for creating compound file or if a merge kicks off.

Though if we are willing to not commit a new segments_N after saving a
segment and before creating its compound found then we don't need to
sync the segment files in that case.

I think I would put all this logic (to manage background sync'ing)
under FSDirectory.

Mike

"Yonik Seeley" <[EMAIL PROTECTED]> wrote:

On Nov 12, 2007 1:41 PM, robert engels <[EMAIL PROTECTED]> wrote:

Would it not be simpler to pure Java...

Add the descriptor that needs to be sync'd (and closed) to a Queue.
Start a Thread to sync/close descriptors.

In commit(), wait for all sync threads to terminate using join().


This would also need to be hooked in with file deletion (since a file
could be created and deleted before commit()).

-Yonik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Web-based Luke

2007-11-12 Thread Erik Hatcher


On Nov 12, 2007, at 1:21 PM, mark harwood wrote:

I'm putting together a Google Web Toolkit-based version of Luke:
   http://www.inperspective.com/lucene/Luke.war
( Just add your version of lucene core jar to WEB-INF/lib  
subdirectory and you should have the basis of a web-enabled Luke.)


Mark: +1   Wow!  Very nice.

The intention behind this is to port Luke to a wholly Apache- 
licensed codebase so it can be managed in Lucene's subversion  
repository  (and for me to learn GWT!).


RDD (Resume Driven Development) at it's finest!

Early results are encouraging so I would like to consider how to  
handle this moving forward.


The considerations are:
1) Are folks interested in bringing this into the Lucene project?


Absolutely.


2) Where to manage it (in contrib?)


Seems like a fine place to put it for now.  But it really deserves a  
better home than that.  What about a new "client/luke" directory?   
(following on Solr's structure).


3) What needs to change in the build process to take GWT source  
(Java code) and feed it through the GWT compiler to produce  
Javascript/html etc?


Can't be much.


4) How to package it in the distribution (bundle Jetty?)


Yeah, that'd be nice.  Exactly how Solr does it.

In MVC terms, having separated the Model code from the (thinlet- 
based) View code I now also have the basis for building a Swing- 
based UI too on the same backend.


This is very nice, Mark.  This would surely plug into Solr's admin UI  
very well also.


Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [jira] Commented: (LUCENE-1044) Behavior on hard power shutdown

2007-11-12 Thread Michael McCandless
"robert engels" <[EMAIL PROTECTED]> wrote:
> I would be wary of the additional complexity of doing this.
> 
> It would be my vote to making 'sync' an option, and if set, all files  
> are sync'd before close.

This is the way it is now: doSync is an option to FSDirectory,
which defaults to true.

I agree sync() before close() is by far the simplest approach here.

On a good IO it seems to have minimal performance impact.  On poor
hardware (laptop hard drive) I'm seeing a rather sizable impact
(~30-40% slowdown on indexing Wikipedia).

But I think given this I would still leave the default at true: I
think keeping index consistent, even on the somewhat rare event of
machine/OS crash, trumps indexing performance, as a default?  People
who care about performance are happy to change the defaults.

> With proper hardware setup, this should be a minimal performance  
> penalty.

Right.

> What about writing a marker at the end of each file? I am not sure it  
> is guarenteed but the segments is syncd, and the segment files have  
> the correct marker, then the segment file is ok. Otherwise the "bad"  
> segments/versions can be removed (on start up).

Well ... if we took this approach we would also have to forcefully
keep around the "last known good" commit point, vs what we do now
(delete all but the last commit point).  But, creating such a deletion
policy is not really possible because we can't "query" the IO system
(OS) to find out what's really on stable storage.

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



small improvement when no payloads?

2007-11-12 Thread Yonik Seeley
The else clause in SegmentTermPositions.readDeltaPosition() is
redundant and could be removed, yes?
It's a pretty minor improvement, but this is very inner-loop stuff.

-Yonik

  private final int readDeltaPosition() throws IOException {
int delta = proxStream.readVInt();
if (currentFieldStoresPayloads) {
  // if the current field stores payloads then
  // the position delta is shifted one bit to the left.
  // if the LSB is set, then we have to read the current
  // payload length
  if ((delta & 1) != 0) {
payloadLength = proxStream.readVInt();
  }
  delta >>>= 1;
  needToLoadPayload = true;
} else {
  payloadLength = 0;
  needToLoadPayload = false;
}
return delta;
  }

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-743) IndexReader.reopen()

2007-11-12 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12541955
 ] 

Michael McCandless commented on LUCENE-743:
---

I think the cause of the intermittant failure in the test is a missing
try/finally in doReopen to properly close/decRef everything on
exception.

Because of lockless commits, a commit could be in-process while you
are re-opening, in which case you could hit an IOexception and you
must therefore decRef those norms you had incRef'd (and, close eg the
newly opened FieldsReader).

> IndexReader.reopen()
> 
>
> Key: LUCENE-743
> URL: https://issues.apache.org/jira/browse/LUCENE-743
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Otis Gospodnetic
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 2.3
>
> Attachments: IndexReaderUtils.java, lucene-743-take2.patch, 
> lucene-743-take3.patch, lucene-743-take4.patch, lucene-743-take5.patch, 
> lucene-743-take6.patch, lucene-743-take7.patch, lucene-743.patch, 
> lucene-743.patch, lucene-743.patch, MyMultiReader.java, MySegmentReader.java, 
> varient-no-isCloneSupported.BROKEN.patch
>
>
> This is Robert Engels' implementation of IndexReader.reopen() functionality, 
> as a set of 3 new classes (this was easier for him to implement, but should 
> probably be folded into the core, if this looks good).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [jira] Commented: (LUCENE-743) IndexReader.reopen()

2007-11-12 Thread robert engels
Why doesn't reopen get the 'read' lock, since commit has the write  
lock, it should wait...


On Nov 12, 2007, at 3:35 PM, Michael McCandless (JIRA) wrote:



[ https://issues.apache.org/jira/browse/LUCENE-743? 
page=com.atlassian.jira.plugin.system.issuetabpanels:comment- 
tabpanel#action_12541955 ]


Michael McCandless commented on LUCENE-743:
---

I think the cause of the intermittant failure in the test is a missing
try/finally in doReopen to properly close/decRef everything on
exception.

Because of lockless commits, a commit could be in-process while you
are re-opening, in which case you could hit an IOexception and you
must therefore decRef those norms you had incRef'd (and, close eg the
newly opened FieldsReader).


IndexReader.reopen()


Key: LUCENE-743
URL: https://issues.apache.org/jira/browse/LUCENE-743
Project: Lucene - Java
 Issue Type: Improvement
 Components: Index
   Reporter: Otis Gospodnetic
   Assignee: Michael Busch
   Priority: Minor
Fix For: 2.3

Attachments: IndexReaderUtils.java, lucene-743- 
take2.patch, lucene-743-take3.patch, lucene-743-take4.patch,  
lucene-743-take5.patch, lucene-743-take6.patch, lucene-743- 
take7.patch, lucene-743.patch, lucene-743.patch, lucene-743.patch,  
MyMultiReader.java, MySegmentReader.java, varient-no- 
isCloneSupported.BROKEN.patch



This is Robert Engels' implementation of IndexReader.reopen()  
functionality, as a set of 3 new classes (this was easier for him  
to implement, but should probably be folded into the core, if this  
looks good).


--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [jira] Commented: (LUCENE-743) IndexReader.reopen()

2007-11-12 Thread Yonik Seeley
On Nov 12, 2007 4:43 PM, robert engels <[EMAIL PROTECTED]> wrote:
> Why doesn't reopen get the 'read' lock, since commit has the write
> lock, it should wait...

After lockless commits, there is no read lock!

-Yonik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [jira] Commented: (LUCENE-743) IndexReader.reopen()

2007-11-12 Thread robert engels

Then how can the commit during reopen be an issue?

I am not very family with this new code, but it seems that you need  
to write segments.XXX.new and then rename to segments.XXX.


As long as the files are sync'd, even on nfs the reopen should not  
see segments.XXX until is is ready.


Although lockless commits are beneficial in their own rite, I still  
think that people's understanding of NFS limitations are flawed. Read  
the section below on "close to open" consistency. There should be no  
problem using Lucene across NFS - even the old version.


The write-once nature of Lucene makes this trivial.  The only problem  
was the segments file, which is lucene used the read/write lock and  
close(0 correctly never would have been a problem.


According to the NFS docs:

NFS Version 2 requires that a server must save all the data in a  
write operation to disk before it replies to a client that the write  
operation has completed. This can be expensive because it breaks  
write requests into small chunks (8KB or less) that must each be  
written to disk before the next chunk can be written. Disks work best  
when they can write large amounts of data all at once.


NFS Version 3 introduces the concept of "safe asynchronous writes." A  
Version 3 client can specify that the server is allowed to reply  
before it has saved the requested data to disk, permitting the server  
to gather small NFS write operations into a single efficient disk  
write operation. A Version 3 client can also specify that the data  
must be written to disk before the server replies, just like a  
Version 2 write. The client specifies the type of write by setting  
the stable_how field in the arguments of each write operation to  
UNSTABLE to request a safe asynchronous write, and FILE_SYNC for an  
NFS Version 2 style write.


Servers indicate whether the requested data is permanently stored by  
setting a corresponding field in the response to each NFS write  
operation. A server can respond to an UNSTABLE write request with an  
UNSTABLE reply or a FILE_SYNC reply, depending on whether or not the  
requested data resides on permanent storage yet. An NFS protocol- 
compliant server must respond to a FILE_SYNC request only with a  
FILE_SYNC reply.


Clients ensure that data that was written using a safe asynchronous  
write has been written onto permanent storage using a new operation  
available in Version 3 called a COMMIT. Servers do not send a  
response to a COMMIT operation until all data specified in the  
request has been written to permanent storage. NFS Version 3 clients  
must protect buffered data that has been written using a safe  
asynchronous write but not yet committed. If a server reboots before  
a client has sent an appropriate COMMIT, the server can reply to the  
eventual COMMIT request in a way that forces the client to resend the  
original write operation. Version 3 clients use COMMIT operations  
when flushing safe asynchronous writes to the server during a close 
(2) or fsync(2) system call, or when encountering memory pressure.




A8. What is close-to-open cache consistency?
A. Perfect cache coherency among disparate NFS clients is very  
expensive to achieve, so NFS settles for something weaker that  
satisfies the requirements of most everyday types of file sharing.  
Everyday file sharing is most often completely sequential: first  
client A opens a file, writes something to it, then closes it; then  
client B opens the same file, and reads the changes.


So, when an application opens a file stored in NFS, the NFS client  
checks that it still exists on the server, and is permitted to the  
opener, by sending a GETATTR or ACCESS operation. When the  
application closes the file, the NFS client writes back any pending  
changes to the file so that the next opener can view the changes.  
This also gives the NFS client an opportunity to report any server  
write errors to the application via the return code from close().  
This behavior is referred to as close-to-open cache consistency.


Linux implements close-to-open cache consistency by comparing the  
results of a GETATTR operation done just after the file is closed to  
the results of a GETATTR operation done when the file is next opened.  
If the results are the same, the client will assume its data cache is  
still valid; otherwise, the cache is purged.


Close-to-open cache consistency was introduced to the Linux NFS  
client in 2.4.20. If for some reason you have applications that  
depend on the old behavior, you can disable close-to-open support by  
using the "nocto" mount option.


There are still opportunities for a client's data cache to contain  
stale data. The NFS version 3 protocol introduced "weak cache  
consistency" (also known as WCC) which provides a way of checking a  
file's attributes before and after an operation to allow a client to  
identify changes that could have been made by other clients.  
Unfortunately when a clien

Re: [jira] Commented: (LUCENE-743) IndexReader.reopen()

2007-11-12 Thread Yonik Seeley
On Nov 12, 2007 5:08 PM, robert engels <[EMAIL PROTECTED]> wrote:
> As long as the files are sync'd, even on nfs the reopen should not
> see segments.XXX until is is ready.

Right, but then there is a race on the other side... a reader may open
the segments .XXX file and then start opening all the referenced
segments files, but some of them may have already been deleted because
a segment merge happened.  There's a retry mechanism in this case.
http://issues.apache.org/jira/browse/LUCENE-701

I guess the test with 150 threads is very atypical and could actually
cause a reader to not be successfully opened and hence an exception
thrown.

-Yonik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [jira] Commented: (LUCENE-743) IndexReader.reopen()

2007-11-12 Thread Michael McCandless

robert engels <[EMAIL PROTECTED]> wrote:

> Then how can the commit during reopen be an issue?

This is what happens:

  * Reader opens latest segments_N & reads all SegmentInfos
successfully.

  * Writer writes new segments_N+1, and then deletes now un-referenced
files.

  * Reader tries to open files referenced by segments_N and hits FNFE
when it tries to open a file writer just removed.

Lucene handles this fine (it just retries on the new segments_N+1),
but the patch in LUCENE-743 is now failing to decRef the Norm
instances when this retry happens.

> I am not very family with this new code, but it seems that you need
> to write segments.XXX.new and then rename to segments.XXX.

We don't rename anymore (it's not reliable on windows).  We write
straight to segments_N.

> As long as the files are sync'd, even on nfs the reopen should not
> see segments.XXX until is is ready.
>
> Although lockless commits are beneficial in their own rite, I still
> think that people's understanding of NFS limitations are
> flawed. Read the section below on "close to open" consistency. There
> should be no problem using Lucene across NFS - even the old version.
>
> The write-once nature of Lucene makes this trivial.  The only
> problem was the segments file, which is lucene used the read/write
> lock and close(0 correctly never would have been a problem.

Yes, in an ideal world, NFS server+clients are supposed to implement
close-to-open semantics but in my experience they do not always
succeed.  Previous version of Lucene do in fact have problems over
NFS.  NFS also does not give you "delete on last close" which Lucene
normally relies on (unless you create a custom deletion policy).

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [jira] Commented: (LUCENE-743) IndexReader.reopen()

2007-11-12 Thread robert engels
But merging segments doesn't delete the old, it only creates new,  
unless the segments meet the "purge old criteria".


A reopen() is supposed to open the latest version in the directory by  
definition, so this seems rather a remote possibility.


If it occurs due to low system resources (meaning that during a  
reopen some expected segments were already deleted, throw an  
StaleIndexException) and the client can reissue the reopen() call  
(similar to if it could not get the write lock).


On Nov 12, 2007, at 4:47 PM, Yonik Seeley wrote:


On Nov 12, 2007 5:08 PM, robert engels <[EMAIL PROTECTED]> wrote:

As long as the files are sync'd, even on nfs the reopen should not
see segments.XXX until is is ready.


Right, but then there is a race on the other side... a reader may open
the segments .XXX file and then start opening all the referenced
segments files, but some of them may have already been deleted because
a segment merge happened.  There's a retry mechanism in this case.
http://issues.apache.org/jira/browse/LUCENE-701

I guess the test with 150 threads is very atypical and could actually
cause a reader to not be successfully opened and hence an exception
thrown.

-Yonik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [jira] Commented: (LUCENE-743) IndexReader.reopen()

2007-11-12 Thread Michael McCandless

"Yonik Seeley" <[EMAIL PROTECTED]> wrote:
> On Nov 12, 2007 5:08 PM, robert engels <[EMAIL PROTECTED]> wrote:
> > As long as the files are sync'd, even on nfs the reopen should not
> > see segments.XXX until is is ready.
> 
> Right, but then there is a race on the other side... a reader may open
> the segments .XXX file and then start opening all the referenced
> segments files, but some of them may have already been deleted because
> a segment merge happened.  There's a retry mechanism in this case.
> http://issues.apache.org/jira/browse/LUCENE-701
> 
> I guess the test with 150 threads is very atypical and could actually
> cause a reader to not be successfully opened and hence an exception
> thrown.

The test is just hitting the normal retry exception, and then the
retry succeeds, but the patch fails to decRef those incRef's it had
done on the first attempt.

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [jira] Commented: (LUCENE-743) IndexReader.reopen()

2007-11-12 Thread robert engels
What are you basing the "rename" is not reliable on windows on? That  
a virus scanner has the file open. If that is the case, that should  
either be an incorrect setup, or the operation retried until it  
completes.


Writing directly to a file that someone else can open for reading is  
bound to be a problem. If the file is opened exclusive for write,  
then the others will be prohibited from opening for read, so there  
should not be a problem.


All of the "delete on last close" stuff is a poor design. The  
database can be resync on startup.


The basic design flaw is one I have pointed out many times - you  
either use Lucene in a local environment, or a server environment.  
Using NFS to "share" a Lucene database is a poor choice (normally due  
to performance, but there are other problems - e.g. resource and user  
monitoring, etc.) is a poor choice !.


People have written reliable database systems without very advanced  
semantics for years. There is no reason for all of this esoteric code  
in Lucene.


Those that claim, Lucene had problems with NFS in the past, did not  
perform reliable testing, or their OS was out of date.  What is  
Lucene was failing for an OS needed an update, would you change  
Lucene, or fix/update the OS??? Obviously the former.


Some very loud voices complained about the NFS problems without doing  
the due diligence and test cases to prove the problem. Instead they  
just mucked up the Lucene code.



On Nov 12, 2007, at 4:54 PM, Michael McCandless wrote:



robert engels <[EMAIL PROTECTED]> wrote:


Then how can the commit during reopen be an issue?


This is what happens:

  * Reader opens latest segments_N & reads all SegmentInfos
successfully.

  * Writer writes new segments_N+1, and then deletes now un-referenced
files.

  * Reader tries to open files referenced by segments_N and hits FNFE
when it tries to open a file writer just removed.

Lucene handles this fine (it just retries on the new segments_N+1),
but the patch in LUCENE-743 is now failing to decRef the Norm
instances when this retry happens.


I am not very family with this new code, but it seems that you need
to write segments.XXX.new and then rename to segments.XXX.


We don't rename anymore (it's not reliable on windows).  We write
straight to segments_N.


As long as the files are sync'd, even on nfs the reopen should not
see segments.XXX until is is ready.

Although lockless commits are beneficial in their own rite, I still
think that people's understanding of NFS limitations are
flawed. Read the section below on "close to open" consistency. There
should be no problem using Lucene across NFS - even the old version.

The write-once nature of Lucene makes this trivial.  The only
problem was the segments file, which is lucene used the read/write
lock and close(0 correctly never would have been a problem.


Yes, in an ideal world, NFS server+clients are supposed to implement
close-to-open semantics but in my experience they do not always
succeed.  Previous version of Lucene do in fact have problems over
NFS.  NFS also does not give you "delete on last close" which Lucene
normally relies on (unless you create a custom deletion policy).

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [jira] Commented: (LUCENE-743) IndexReader.reopen()

2007-11-12 Thread Michael McCandless

"robert engels" <[EMAIL PROTECTED]> wrote:

> But merging segments doesn't delete the old, it only creates new,
> unless the segments meet the "purge old criteria".

What's the "purge old criteria"?

Normally a segment merge once committed immediately deletes the
segments it had just merged.

> A reopen() is supposed to open the latest version in the directory
> by definition, so this seems rather a remote possibility.

Well, if a commit is in-flight then likely the reopen will hit an
exception and then retry.  This is the same as a normal open.

> If it occurs due to low system resources (meaning that during a
> reopen some expected segments were already deleted, throw an
> StaleIndexException) and the client can reissue the reopen() call
> (similar to if it could not get the write lock).

I'm not sure what you mean by "low system resources".  Missing some
files because they were deleted by a commit in process isn't a low
system resources sort of situation.

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [jira] Commented: (LUCENE-743) IndexReader.reopen()

2007-11-12 Thread Michael McCandless
Not just virus scanners: any program that uses the Microsoft API for
being notified of file changes.  I think TortoiseSVN was one such
example.

People who embed Lucene can't control what their users install on
their desktops.  Virus scanners are naturally very common on
desktops.  I think we want Lucene to work in these cases.

NFS (and other shared filesystems) is a convenient, if not performant,
way to share an index.  I think Lucene should work in such cases
as well.

Mike

"robert engels" <[EMAIL PROTECTED]> wrote:
> What are you basing the "rename" is not reliable on windows on? That  
> a virus scanner has the file open. If that is the case, that should  
> either be an incorrect setup, or the operation retried until it  
> completes.
> 
> Writing directly to a file that someone else can open for reading is  
> bound to be a problem. If the file is opened exclusive for write,  
> then the others will be prohibited from opening for read, so there  
> should not be a problem.
> 
> All of the "delete on last close" stuff is a poor design. The  
> database can be resync on startup.
> 
> The basic design flaw is one I have pointed out many times - you  
> either use Lucene in a local environment, or a server environment.  
> Using NFS to "share" a Lucene database is a poor choice (normally due  
> to performance, but there are other problems - e.g. resource and user  
> monitoring, etc.) is a poor choice !.
> 
> People have written reliable database systems without very advanced  
> semantics for years. There is no reason for all of this esoteric code  
> in Lucene.
> 
> Those that claim, Lucene had problems with NFS in the past, did not  
> perform reliable testing, or their OS was out of date.  What is  
> Lucene was failing for an OS needed an update, would you change  
> Lucene, or fix/update the OS??? Obviously the former.
> 
> Some very loud voices complained about the NFS problems without doing  
> the due diligence and test cases to prove the problem. Instead they  
> just mucked up the Lucene code.
> 
> 
> On Nov 12, 2007, at 4:54 PM, Michael McCandless wrote:
> 
> >
> > robert engels <[EMAIL PROTECTED]> wrote:
> >
> >> Then how can the commit during reopen be an issue?
> >
> > This is what happens:
> >
> >   * Reader opens latest segments_N & reads all SegmentInfos
> > successfully.
> >
> >   * Writer writes new segments_N+1, and then deletes now un-referenced
> > files.
> >
> >   * Reader tries to open files referenced by segments_N and hits FNFE
> > when it tries to open a file writer just removed.
> >
> > Lucene handles this fine (it just retries on the new segments_N+1),
> > but the patch in LUCENE-743 is now failing to decRef the Norm
> > instances when this retry happens.
> >
> >> I am not very family with this new code, but it seems that you need
> >> to write segments.XXX.new and then rename to segments.XXX.
> >
> > We don't rename anymore (it's not reliable on windows).  We write
> > straight to segments_N.
> >
> >> As long as the files are sync'd, even on nfs the reopen should not
> >> see segments.XXX until is is ready.
> >>
> >> Although lockless commits are beneficial in their own rite, I still
> >> think that people's understanding of NFS limitations are
> >> flawed. Read the section below on "close to open" consistency. There
> >> should be no problem using Lucene across NFS - even the old version.
> >>
> >> The write-once nature of Lucene makes this trivial.  The only
> >> problem was the segments file, which is lucene used the read/write
> >> lock and close(0 correctly never would have been a problem.
> >
> > Yes, in an ideal world, NFS server+clients are supposed to implement
> > close-to-open semantics but in my experience they do not always
> > succeed.  Previous version of Lucene do in fact have problems over
> > NFS.  NFS also does not give you "delete on last close" which Lucene
> > normally relies on (unless you create a custom deletion policy).
> >
> > Mike
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [jira] Commented: (LUCENE-743) IndexReader.reopen()

2007-11-12 Thread robert engels
Horse poo poo.  If you are working in a local environment, the files  
should be opened with exclusive access. This guarantees that the  
operations will succeed for the calling process.


That NFS is a viable solution is highly debatable, and IMO shows a  
lack of understanding of NFS and the unix/linux filesystem design  
principles.  Read about why unix never offered file locking, and  
never really needed it...


Still, if the proper uses of exclusive access controls is used,  
Lucene (and Java) have no problems working in NFS/shared filesystem  
environment.


Sorry but that some only recently became aware of FD.sync() shows  
that they don't really know enough to be designing/testing systems  
like this.


Sorry if the tone of this is harsh, but I hate seeing lots of complex  
code because the designers fail to understand the basic operating  
principles of what they are working with...


On Nov 12, 2007, at 5:18 PM, Michael McCandless wrote:


Not just virus scanners: any program that uses the Microsoft API for
being notified of file changes.  I think TortoiseSVN was one such
example.

People who embed Lucene can't control what their users install on
their desktops.  Virus scanners are naturally very common on
desktops.  I think we want Lucene to work in these cases.

NFS (and other shared filesystems) is a convenient, if not performant,
way to share an index.  I think Lucene should work in such cases
as well.

Mike

"robert engels" <[EMAIL PROTECTED]> wrote:

What are you basing the "rename" is not reliable on windows on? That
a virus scanner has the file open. If that is the case, that should
either be an incorrect setup, or the operation retried until it
completes.

Writing directly to a file that someone else can open for reading is
bound to be a problem. If the file is opened exclusive for write,
then the others will be prohibited from opening for read, so there
should not be a problem.

All of the "delete on last close" stuff is a poor design. The
database can be resync on startup.

The basic design flaw is one I have pointed out many times - you
either use Lucene in a local environment, or a server environment.
Using NFS to "share" a Lucene database is a poor choice (normally due
to performance, but there are other problems - e.g. resource and user
monitoring, etc.) is a poor choice !.

People have written reliable database systems without very advanced
semantics for years. There is no reason for all of this esoteric code
in Lucene.

Those that claim, Lucene had problems with NFS in the past, did not
perform reliable testing, or their OS was out of date.  What is
Lucene was failing for an OS needed an update, would you change
Lucene, or fix/update the OS??? Obviously the former.

Some very loud voices complained about the NFS problems without doing
the due diligence and test cases to prove the problem. Instead they
just mucked up the Lucene code.


On Nov 12, 2007, at 4:54 PM, Michael McCandless wrote:



robert engels <[EMAIL PROTECTED]> wrote:


Then how can the commit during reopen be an issue?


This is what happens:

  * Reader opens latest segments_N & reads all SegmentInfos
successfully.

  * Writer writes new segments_N+1, and then deletes now un- 
referenced

files.

  * Reader tries to open files referenced by segments_N and hits  
FNFE

when it tries to open a file writer just removed.

Lucene handles this fine (it just retries on the new segments_N+1),
but the patch in LUCENE-743 is now failing to decRef the Norm
instances when this retry happens.


I am not very family with this new code, but it seems that you need
to write segments.XXX.new and then rename to segments.XXX.


We don't rename anymore (it's not reliable on windows).  We write
straight to segments_N.


As long as the files are sync'd, even on nfs the reopen should not
see segments.XXX until is is ready.

Although lockless commits are beneficial in their own rite, I still
think that people's understanding of NFS limitations are
flawed. Read the section below on "close to open" consistency.  
There
should be no problem using Lucene across NFS - even the old  
version.


The write-once nature of Lucene makes this trivial.  The only
problem was the segments file, which is lucene used the read/write
lock and close(0 correctly never would have been a problem.


Yes, in an ideal world, NFS server+clients are supposed to implement
close-to-open semantics but in my experience they do not always
succeed.  Previous version of Lucene do in fact have problems over
NFS.  NFS also does not give you "delete on last close" which Lucene
normally relies on (unless you create a custom deletion policy).

Mike

 
-

To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL

Re: [jira] Commented: (LUCENE-743) IndexReader.reopen()

2007-11-12 Thread Michael Busch
robert engels wrote:
> 
> The commit "in flight" cannot (SHOULD NOT) be deleting segments if they
> are in use.  That a caller could issue a reopen call means there are
> segments in use by definition (or they would have nothing to reopen).
> 

Reopen still works correctly, even if there are no segments left that
the old reader used. It will simply behave as an "open" then.

An example is an index that was optimized. In that case all old segments
are gone and if you reopen your reader you will get a new SegmentReader
that opens the new segment.

The old reader can still access the old segments because of the OS'
"delete on last close". Or, on Windows, the IndexWriter will re-try to
delete the old segments until the delete was successful (i. e. after the
last reader accessing them was closed).

-Michael

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [jira] Commented: (LUCENE-743) IndexReader.reopen()

2007-11-12 Thread robert engels
I am not debating that reopen works (since that is supposed to get  
the latest version). I am stating that commit cannot be deleting  
segments if they are in use, which they must be at that time in order  
to issue a reopen(), since to issue reopen() you must have an  
instance of IndexReader open, which means you will have segments open...


I was talking about Windows in particular - as stated, unix/linux  
does not have the problem - under Windows the delete will (should) fail.


On Nov 12, 2007, at 5:42 PM, Michael Busch wrote:


robert engels wrote:


The commit "in flight" cannot (SHOULD NOT) be deleting segments if  
they

are in use.  That a caller could issue a reopen call means there are
segments in use by definition (or they would have nothing to reopen).



Reopen still works correctly, even if there are no segments left that
the old reader used. It will simply behave as an "open" then.

An example is an index that was optimized. In that case all old  
segments
are gone and if you reopen your reader you will get a new  
SegmentReader

that opens the new segment.

The old reader can still access the old segments because of the OS'
"delete on last close". Or, on Windows, the IndexWriter will re-try to
delete the old segments until the delete was successful (i. e.  
after the

last reader accessing them was closed).

-Michael

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [jira] Commented: (LUCENE-743) IndexReader.reopen()

2007-11-12 Thread Michael Busch
robert engels wrote:
> 
> I was talking about Windows in particular - as stated, unix/linux does
> not have the problem - under Windows the delete will (should) fail.
> 

As I said, delete does fail on Windows in that case, and the
IndexFileDeleter (called by the IndexWriter) catches the IOException and
tries again (and again...).

-Michael

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [jira] Commented: (LUCENE-743) IndexReader.reopen()

2007-11-12 Thread robert engels
That is not true - at least it didn't use to be.  if there were  
readers open the files/segments would not be deleted. they would be  
deleted at next open.


The "purge criteria" was based on the next "commit" sets. To make  
this work, and be able to roll back or open a previous "version", you  
need to keep the segments around.


The commit "in flight" cannot (SHOULD NOT) be deleting segments if  
they are in use.  That a caller could issue a reopen call means there  
are segments in use by definition (or they would have nothing to  
reopen).



On Nov 12, 2007, at 5:14 PM, Michael McCandless wrote:



"robert engels" <[EMAIL PROTECTED]> wrote:


But merging segments doesn't delete the old, it only creates new,
unless the segments meet the "purge old criteria".


What's the "purge old criteria"?

Normally a segment merge once committed immediately deletes the
segments it had just merged.


A reopen() is supposed to open the latest version in the directory
by definition, so this seems rather a remote possibility.


Well, if a commit is in-flight then likely the reopen will hit an
exception and then retry.  This is the same as a normal open.


If it occurs due to low system resources (meaning that during a
reopen some expected segments were already deleted, throw an
StaleIndexException) and the client can reissue the reopen() call
(similar to if it could not get the write lock).


I'm not sure what you mean by "low system resources".  Missing some
files because they were deleted by a commit in process isn't a low
system resources sort of situation.

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-743) IndexReader.reopen()

2007-11-12 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12541998
 ] 

Michael Busch commented on LUCENE-743:
--

> I think the cause of the intermittant failure in the test is a missing
> try/finally in doReopen to properly close/decRef everything on
> exception.

Awesome! Thanks so much for pointing me there, Mike! I was getting a 
little suicidal here already ... ;)

I should have read the comment in SegmentReader#initialize more 
carefully:
{code:java}
} finally {

  // With lock-less commits, it's entirely possible (and
  // fine) to hit a FileNotFound exception above.  In
  // this case, we want to explicitly close any subset
  // of things that were opened so that we don't have to
  // wait for a GC to do so.
  if (!success) {
doClose();
  }
}
{code}

While debugging, it's easy to miss such an exception, because 
SegmentInfos.FindSegmentsFile#run() ignores it. But it's good that it
logs such an exception, I just have to remember to print out the 
infoStream next time.

So it seems that this was indeed the cause for the failing test case.
I made the change and so far the tests didn't fail anymore (ran it 
about 10 times so far). I'll run it another few times on a different 
JVM and submit an updated patch in a short while if it doesn't fail 
again.


> IndexReader.reopen()
> 
>
> Key: LUCENE-743
> URL: https://issues.apache.org/jira/browse/LUCENE-743
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Otis Gospodnetic
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 2.3
>
> Attachments: IndexReaderUtils.java, lucene-743-take2.patch, 
> lucene-743-take3.patch, lucene-743-take4.patch, lucene-743-take5.patch, 
> lucene-743-take6.patch, lucene-743-take7.patch, lucene-743.patch, 
> lucene-743.patch, lucene-743.patch, MyMultiReader.java, MySegmentReader.java, 
> varient-no-isCloneSupported.BROKEN.patch
>
>
> This is Robert Engels' implementation of IndexReader.reopen() functionality, 
> as a set of 3 new classes (this was easier for him to implement, but should 
> probably be folded into the core, if this looks good).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [jira] Commented: (LUCENE-743) IndexReader.reopen()

2007-11-12 Thread robert engels
I would still argue that it is an incorrect setup - almost as bad as  
"not plugging the computer in".


If a user runs a virus scanner or file system indexer on the lucene  
index directory, their system is going to slow to a crawl and  
indexing will be abominably slow.


The installation guide should just make this required.

An installer can easily use the available APIs to remove the lucene  
data directory from virus scanning / indexing.


On Nov 12, 2007, at 6:01 PM, Michael Busch wrote:


robert engels wrote:


I was talking about Windows in particular - as stated, unix/linux  
does

not have the problem - under Windows the delete will (should) fail.



As I said, delete does fail on Windows in that case, and the
IndexFileDeleter (called by the IndexWriter) catches the  
IOException and

tries again (and again...).

-Michael

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [jira] Commented: (LUCENE-743) IndexReader.reopen()

2007-11-12 Thread Yonik Seeley
On Nov 12, 2007 7:19 PM, robert engels <[EMAIL PROTECTED]> wrote:
> I would still argue that it is an incorrect setup - almost as bad as
> "not plugging the computer in".

A user themselves could even go in and look at the index files (I've
done so myself)... as could a backup program or whatever.  It's a fact
of life on windows that a move or delete can fail.

-Yonik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-743) IndexReader.reopen()

2007-11-12 Thread Michael Busch (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Busch updated LUCENE-743:
-

Attachment: lucene-743-take8.patch

OK, all tests pass now, including the thread-safety test.
I ran it several times on different JVMs.

Changes:
- As Mike suggested I added a try ... finally block to 
SegmentReader#reopenSegment() which cleans up after an
exception was hit.
- Added some additional comments.
- Minor improvements to TestIndexReaderReopen

> IndexReader.reopen()
> 
>
> Key: LUCENE-743
> URL: https://issues.apache.org/jira/browse/LUCENE-743
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Otis Gospodnetic
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 2.3
>
> Attachments: IndexReaderUtils.java, lucene-743-take2.patch, 
> lucene-743-take3.patch, lucene-743-take4.patch, lucene-743-take5.patch, 
> lucene-743-take6.patch, lucene-743-take7.patch, lucene-743-take8.patch, 
> lucene-743.patch, lucene-743.patch, lucene-743.patch, MyMultiReader.java, 
> MySegmentReader.java, varient-no-isCloneSupported.BROKEN.patch
>
>
> This is Robert Engels' implementation of IndexReader.reopen() functionality, 
> as a set of 3 new classes (this was easier for him to implement, but should 
> probably be folded into the core, if this looks good).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Term pollution from binary data

2007-11-12 Thread Chuck Williams

Doug Cutting wrote on 11/07/2007 09:26 AM:
Hadoop's MapFile is similar to Lucene's term index, and supports a 
feature where only a subset of the index entries are loaded 
(determined by io.map.index.skip).  It would not be difficult to add 
such a feature to Lucene by changing TermInfosReader#ensureIndexIsRead().


Here's a (totally untested) patch.


Doug, thanks for this suggestion and your quick patch.

I fleshed this out in the version of Lucene we are using, a bit after 
2.1.  There was an off-by-1 bug plus a few missing pieces.  The attached 
patch is for 2.1+, but might be useful as it at least contains the 
corrections and missing elements.  It also contains extensions to the 
tests to exercise the patch.


I tried integrating this into 2.3, but enough has changed so that it was 
not straightforward (primarily for the test case extensions -- the 
implementation seems it will apply with just a bit of manual merging).  
Unfortunately, I have so many local changes that is has become difficult 
to track the latest Lucene.  The task of syncing up will come soon.  
I'll post a proper patch against the trunk in jira at a future date if 
the issue is not already resolved before then.


Michael McCandless wrote on 11/08/2007 12:43 AM:

I'll open an issue and work through this patch.
  
Michael, I did not see the issue, else would have posted this there.  
Unfortunately, I'm pretty far behind on lucene mail these days.

One thing is: I'd prefer to not use system property for this, since
it's so global, but I'm not sure how to better do it.
  


Agree strongly that this is not global.  Whether ctors or an 
index-specific properties object or whatever, it is important to be able 
to set this on some indexes and not others in a single application.


Thanks for picking this up!

Chuck

Index: src/test/org/apache/lucene/index/DocHelper.java
===
--- src/test/org/apache/lucene/index/DocHelper.java	(revision 2247)
+++ src/test/org/apache/lucene/index/DocHelper.java	(working copy)
@@ -254,10 +254,25 @@
*/ 
   public static void writeDoc(Directory dir, Analyzer analyzer, Similarity similarity, String segment, Document doc) throws IOException
   {
-DocumentWriter writer = new DocumentWriter(dir, analyzer, similarity, 50);
-writer.addDocument(segment, doc);
+writeDoc(dir, analyzer, similarity, segment, doc, IndexWriter.DEFAULT_TERM_INDEX_INTERVAL);
   }
 
+  /**
+   * Writes the document to the directory segment using the analyzer and the similarity score
+   * @param dir
+   * @param analyzer
+   * @param similarity
+   * @param segment
+   * @param doc
+   * @param termIndexInterval
+   * @throws IOException
+   */ 
+  public static void writeDoc(Directory dir, Analyzer analyzer, Similarity similarity, String segment, Document doc, int termIndexInterval) throws IOException
+  {
+DocumentWriter writer = new DocumentWriter(dir, analyzer, similarity, 50, termIndexInterval);
+writer.addDocument(segment, doc);
+  }
+  
   public static int numFields(Document doc) {
 return doc.getFields().size();
   }
Index: src/test/org/apache/lucene/index/TestSegmentTermDocs.java
===
--- src/test/org/apache/lucene/index/TestSegmentTermDocs.java	(revision 2247)
+++ src/test/org/apache/lucene/index/TestSegmentTermDocs.java	(working copy)
@@ -25,6 +25,7 @@
 import org.apache.lucene.document.Field;
 
 import java.io.IOException;
+import org.apache.lucene.search.Similarity;
 
 public class TestSegmentTermDocs extends TestCase {
   private Document testDoc = new Document();
@@ -212,6 +213,23 @@
 dir.close();
   }
   
+  public void testIndexDivisor() throws IOException {
+dir = new RAMDirectory();
+testDoc = new Document();
+DocHelper.setupDoc(testDoc);
+DocHelper.writeDoc(dir, new WhitespaceAnalyzer(), Similarity.getDefault(), "test", testDoc, 3);
+
+assertNull(System.getProperty("lucene.term.index.divisor"));
+System.setProperty("lucene.term.index.divisor", "2");
+try {
+  testTermDocs();
+  testBadSeek();
+  testSkipTo();
+} finally {
+  System.clearProperty("lucene.term.index.divisor");
+}
+  }
+  
   private void addDoc(IndexWriter writer, String value) throws IOException
   {
   Document doc = new Document();
Index: src/test/org/apache/lucene/index/TestSegmentReader.java
===
--- src/test/org/apache/lucene/index/TestSegmentReader.java	(revision 2247)
+++ src/test/org/apache/lucene/index/TestSegmentReader.java	(working copy)
@@ -23,10 +23,12 @@
 import java.util.List;
 
 import junit.framework.TestCase;
+import org.apache.lucene.analysis.WhitespaceAnalyzer;
 
 import org.apache.lucene.document.Document;
 import org.apache.lucene.document.Fieldable;
 import org.apache.lucene.search.DefaultSimilarity;
+import org.apache.lucene.search.Similarity;
 import org.apa

Re: small improvement when no payloads?

2007-11-12 Thread Michael Busch
Yonik Seeley wrote:
> The else clause in SegmentTermPositions.readDeltaPosition() is
> redundant and could be removed, yes?
> It's a pretty minor improvement, but this is very inner-loop stuff.
> 
> -Yonik
> 

Thanks, Yonik, you're right. We can safely remove those two lines.
TermPositions#seek() resets the two values. And
"currentFieldStoresPayloads" doesn't change unless seek() is called.

All test cases still pass after removing the else clause. I'll commit
this small change (I don't think we need to open a Jira issue).

-Michael

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: setSimilarity on Query

2007-11-12 Thread Shailesh Kochhar

Chris Hostetter wrote:
independent of the QueryParser aspects of your question, adding a 
setSimilarity method to the Query class would be a complete 180 of how it 
currently works right now.


Query classes have to have a getSimilarity method so that their 
Weight/Scorer have a way to access the similarity functions ... but every 
core type of query gets that similarity from the searcher being used when 
hte query is executed.


if the Query class defined a "setSimilarity" then the similarity used by 
one query in a BooleanQuery might not be the same as another query in the 
same query structure ... queryNorms, idfs, tfs ... could all be completley 
nonsensical.


The getSimilarity() implementation in Query actually invokes 
Searcher.getSimilarity() which in turn returns the value of 
Similarity.getDefault()


IndexSearcher has a corresponding setSimilarity() method which will 
override the value return value which makes it convenient for what 
you're trying to accomplish.


There is, however, another point of discord -- which is the Weight 
associated with the Query (which is relevant if you want a different 
implementation of term weighting). Here the locus of control is inverted 
-- it is the Searcher which delegates to the Query in order to create 
the Weight. In order to change the scoring implementation one needs to 
implement a new Query class, a new Weight class, a new Similarity class 
and a new QueryParser.


A friendlier alternative I'd like to propose is a sort of Weight and 
Similarity factory which is provided either to the top level Query 
object that is returned from parsing -- or to the Searcher object that 
processes the query. The factory can then return Similarity and Weight 
implementations that are identical for all parts of the query and which 
are mutually consistent.


This would allow field specific Similarity and Weight implementations 
and would also be backwards compatible.


A more logical extension point is probably long the lines of past 
discussion towards making all of the Similarity methods take in a field 
name (so you could have a "PerFieldSimilarityWrapper" type implementation) 
and/or changing Searchable.getSimilarity to take in a fieldname param.


i don't think anyone every submitted a patch for either of those ideas 
though ... if you check the mailing list archives you'll see there were 
performance concerns about one of them (i think it was the first one 
because some of those methods are in tight loops, which is unfortunate 
because it's the one that can be done in a backwards compatible way)





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [jira] Commented: (LUCENE-743) IndexReader.reopen()

2007-11-12 Thread robert engels
True. It seems that the Lucene code might be a bit more resilient  
here though, using the following:


1. open the segments file exclusively (if this fails, updates are  
prohibited, and an exception is thrown)

2. write new segments
3. write segments.new including segments hash & sync
4. update segments file including hash
5. delete segments that you can

Then if it crashes in step 4, it is easy to know segments is bad (out  
of date) and use segments.new
If it crashes in steps 3, then segments.new is easily detected as  
being corrupt (hash does not match), so you know segments is valid.


if there are segments that cannot be deleted in 5, every open can  
check if it can delete them...


A similar technique can be used if using lockless commits, just need  
to make it segments.XXX.new, etc.


On Nov 12, 2007, at 7:21 PM, Yonik Seeley wrote:


On Nov 12, 2007 7:19 PM, robert engels <[EMAIL PROTECTED]> wrote:

I would still argue that it is an incorrect setup - almost as bad as
"not plugging the computer in".


A user themselves could even go in and look at the index files (I've
done so myself)... as could a backup program or whatever.  It's a fact
of life on windows that a move or delete can fail.

-Yonik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]