[jira] Updated: (LUCENE-1290) Deprecate Hits

2008-05-20 Thread Michael Busch (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Busch updated LUCENE-1290:
--

Attachment: lucene-1290.patch

New version of the patch:
- added TopDocCollector example to deprecated-section in the javadocs of Hits
- added more comments to new demo code
- updated scoring.html and removed references to Hits
- got rid of tabs (patch only uses whitespaces now)

Hoss, could you try if this patch applies cleanly now and all tests pass, 
please?
After I committed eol-style=native to all files, changed the tabs to 
whitespaces 
and recreated the patch file patching and running the tests worked fine for me 
on linux.

Wow, I didn't imagine before that a patch that simply deprecates one class 
would have more than 5000 lines! :)

> Deprecate Hits
> --
>
> Key: LUCENE-1290
> URL: https://issues.apache.org/jira/browse/LUCENE-1290
> Project: Lucene - Java
>  Issue Type: Task
>  Components: Search
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 2.4
>
> Attachments: lucene-1290.patch, lucene-1290.patch
>
>
> The Hits class has several drawbacks as pointed out in LUCENE-954.
> The other search APIs that use TopDocCollector and TopDocs should be used 
> instead.
> This patch:
> - deprecates org/apache/lucene/search/Hits, Hit, and HitIterator, as well as
>   the Searcher.search( * ) methods which return a Hits Object.
> - removes all references to Hits from the core and uses TopDocs and ScoreDoc
>   instead
> - Changes the demo SearchFiles: adds the two modes 'paging search' and 
> 'streaming search',
>   each of which demonstrating a different way of using the search APIs. The 
> former
>   uses TopDocs and a TopDocCollector, the latter a custom HitCollector 
> implementation.
> - Updates the online tutorial that descibes the demo.
> All tests pass.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1288) Add getVersion method to IndexCommit

2008-05-20 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12598226#action_12598226
 ] 

Michael McCandless commented on LUCENE-1288:


Are you suggesting instead or in addition to getVersion?

> Add getVersion method to IndexCommit
> 
>
> Key: LUCENE-1288
> URL: https://issues.apache.org/jira/browse/LUCENE-1288
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 2.3.1
>Reporter: Jason Rutherglen
>Assignee: Michael McCandless
>Priority: Minor
>
> Returns the equivalent of IndexReader.getVersion for IndexCommit
> {code}
> public abstract long getVersion();
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Token implementation

2008-05-20 Thread Michael McCandless


Hiroaki Kawai wrote:



"Michael McCandless" <[EMAIL PROTECTED]> wrote:

I agree the situation is not ideal, and it's confusing.

This comes back to LUCENE-969.

At the time, we decided to keep both String & char[] only to avoid
performance cost for those analyzer chains that use String tokens
exclusively.

The idea was to allow Token to keep both text or char[] and sometimes
both (if they are storing the same characters, as happens if
termBuffer() is called when it's a String being stored)

Then, in 3.0, we would make the change you are proposing (to only
store char[] internally).  That was the plan, anyway.  Accelerating
this plan (to store only char[] today) is compelling, but I worry
about the performance hit to legacy analyzer chains...


I'd like to suggest another implementation which use
StringBuilder or CharBuffer instead of char[].


StringBuilder has to wait until we are on Java 1.5.


Because we don't need to maintain the length separatly from the
characater sequence itself.
If we use char[], then we have to handle char[] and the offset and the
sequence length, the method we implement will be so complex.
I think those should be packed into one object.

I did not test that using StringBuilder or CharBuffer hit the
performance or not. But I think it might not result in so bad  
performace.


I'm somewhat less optimistic here.  These classes are targeting use  
cases with much larger sequences of characters than a typical Token  
in a Document.  We should test the performance impact to see.





More responses below:
DM Smith <[EMAIL PROTECTED]> wrote:

-snip-
I was looking at this in light of TokenFilter's next(Token)  
method and how
it was being used. In looking at the contrib filters, they have  
not been
modified. Further, most of them, if they work with the content  
analysis and
generation, do their work in strings. Some of these appear to be  
good
candidates for using char[] rather than strings, such as the  
NGram filter.
But others look like they'd just as well remain with String  
manipulation.


It would be great to upgrade all contrib filters to use the re-use  
APIs.


I'll contribute, too. :-)


Fantastic!



I'd like to suggest that internally, that Token be changed to  
only use

char[] termBuffer and eliminate termText.


The question is what performance cost we are incurring eg on the
contrib (& other) sources/filters?  Every time setTermText is called,
we copy out the chars (instead of holding a reference to the String).
Every time getText() is called we create a new String(...) from the
char[].  I think it's potentially a high cost, and so maybe we should
wait until 3.0 when we drop the deprecated APIs?


And also, that termText be restored as not deprecated.


It made me nervous keeping this method because it looks like it  
should

be cheap to call, and in the past it was very cheap to call.  But,
maybe we could keep it, if we mark very very clearly in the javadocs
the performance cost you incur by using this method (it makes a new
String() every time)?


I'd like to suggest changing the method definition to:
 public void setTermText(CharSequence text)


This seems like a good idea.




But, in TokenFilter, next() should be deprecated, IMHO.


I think this is a good idea.  After all if people don't want to  
bother

using the passed in Token, they are still allowed to return a new
one.


I could not see what you meant. Can I ask you to let me know the  
reason

why it should be deprecated?


Deprecated in favor of next(Token result) API.  Ie, token sources/ 
filters should migrate to this re-use API.  It's a straightforward  
migration because the method next(Token result) is allowed to ignore  
result (and return its own Token) if it wants to.


Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Token implementation

2008-05-20 Thread DM Smith


On May 20, 2008, at 12:50 AM, Hiroaki Kawai wrote:




"Michael McCandless" <[EMAIL PROTECTED]> wrote:




More responses below:
DM Smith <[EMAIL PROTECTED]> wrote:







But, in TokenFilter, next() should be deprecated, IMHO.


I think this is a good idea.  After all if people don't want to  
bother

using the passed in Token, they are still allowed to return a new
one.


I could not see what you meant. Can I ask you to let me know the  
reason

why it should be deprecated?


The purpose of the char[] rather than a String is to promote reuse of  
a mutable buffer. Reusing a Token minimizes the number of  
constructions. Each TokenFilter has both

Token next()
and
Token next(Token)

If next() is implemented then he TokenFilter does not have access to a  
shared Token. While next(Token) supplies a shared token by the caller.


The proper way to use next(Token) is as follows

t = input.next(t);

next(Token)
can be implemented as
Token next(Token result) {
  Token t = new Token();
  ...
  return t;
}

This would be the trivial equivalent to next(). Just ignore the  
argument and do as before.


-- DM

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Token implementation

2008-05-20 Thread DM Smith


On May 20, 2008, at 5:01 AM, Michael McCandless wrote:



DM Smith wrote:


On May 19, 2008, at 4:33 PM, Michael McCandless wrote:


DM Smith <[EMAIL PROTECTED]> wrote:

Michael McCandless wrote:


I agree the situation is not ideal, and it's confusing.


My problem as a user is that I have to read the code to figure  
out how to

optimally use the class. The JavaDoc is a bit wanting.


Yeah we should fix the javadocs.


This comes back to LUCENE-969.

At the time, we decided to keep both String & char[] only to avoid
performance cost for those analyzer chains that use String tokens
exclusively.

The idea was to allow Token to keep both text or char[] and  
sometimes

both (if they are storing the same characters, as happens if
termBuffer() is called when it's a String being stored)


When termBuffer() is called termText is always null on return.  
This is the
invariant of initTermBuffer() which is called directly or  
indirectly by

every method that touches termBuffer.


Sorry I meant termText().

It is only after calling termText() that one could have both. The  
only
advantage I see here is that calling it twice without any  
intervening call

to a termBuffer method would it return the same physical string.


Right.


After calling setTermText(newText), termBuffer is set to null.

I presume the purpose of a filter is to get the token content,  
and if
modified, set the new content. If so, the result of the setXXX  
will be that

either termText or termBuffer will be null.


Right, though, if there's no change, and the next filter in the  
chain

calls termText(), we save constructing a new String by caching it.


Then, in 3.0, we would make the change you are proposing (to only
store char[] internally).  That was the plan, anyway.   
Accelerating

this plan (to store only char[] today) is compelling, but I worry
about the performance hit to legacy analyzer chains...


I wonder whether it is all that significant an issue. Today, some  
of the
Lucene analyzers have been migrated to using char[], while  
others, notably

contrib, continue to use text.

IMHO: Prior to char[], the text was replaceable, but not  
modifiable. From a
practical perspective, Token reuse minimized the cost of  
construction, but
not much else. The performance of a Token was predictable, but  
the filter
was potentially sub-optimal. With char[] and supporting methods,  
the text

became modifiable, too.

When a filter calls setTermText or setTermBuffer, it does not  
know how the

consumer of the token will work. It could be that it stores it with
setTermText and the next filter calls termBuffer().

I may not understand this correctly, but it seems to me that the  
following
is plausible given a filter chain of Lucene provided filters  
(including contrib)
If we have a token filter chain of A -> B -> C, which uses next()  
in any
part of the chain, the flow of a reusable Token is stopped. A  
given filter

may cache a Token and reuse it. So consider the following scenario:
A overrides next(Token) and reuses the token via char[]
B overrides next() and has a cached Token and updates text.
C overrides next(Token) and reuses the token via char[].

First run:
After A is done, the termText in the supplied Token is null and  
termBuffer

has a value.

B provides it's own token so it is not influenced by A.

C is given the token that B returns, because the caller is  
effectively using
"token = input.next(token)", but because the token has text, a  
performance
hit is taken to put it into char[]. Both text and char[] start  
out the same,

but because char[] is changed, termText is set to null.

Second run:
A starts with a Token with a char[] because it is reusing the  
token from the
last run or because it is using a localToken from the caller. If  
it is a
localToken, then the scenario is as above. But if it is the end  
result of
the first run, then A is re-using the token that is cached in B.  
Since C

last modified it, it is char[].

B uses its cached Token, but it was modified by A to be char[]  
with null

text. Now B takes a performance hit as it creates a new String.

C is as in the first run.

Another scenario:
A, B and C are all legacy. This would only be filter chains that  
are not
provided by core Lucene as the core filter chains have been  
migrated. This

would be a performance hit.


Why is this a performance hit?  If they are all legacy they would  
all

implemented the non-reuse next() API, and would use setTermText, and
no conversion to char[] would happen (except inside  
DocumentsWriter)?


I was thinking faster than I was typing. I meant to say "not be a  
performance hit".


Ahh, ok.





A last scenario:
A, B and C are all char[]. This would not take a performance hit.

It seems to me that in a mixed chain, that there will always be a
performance hit.


Right, though since we cache the String we should save on multiple
calls to termText().  And in a non-mixed chain (all new or all  
old) I

think there wouldn't

Re: Token implementation

2008-05-20 Thread Michael McCandless


DM Smith wrote:


On May 19, 2008, at 4:33 PM, Michael McCandless wrote:


DM Smith <[EMAIL PROTECTED]> wrote:

Michael McCandless wrote:


I agree the situation is not ideal, and it's confusing.


My problem as a user is that I have to read the code to figure  
out how to

optimally use the class. The JavaDoc is a bit wanting.


Yeah we should fix the javadocs.


This comes back to LUCENE-969.

At the time, we decided to keep both String & char[] only to avoid
performance cost for those analyzer chains that use String tokens
exclusively.

The idea was to allow Token to keep both text or char[] and  
sometimes

both (if they are storing the same characters, as happens if
termBuffer() is called when it's a String being stored)


When termBuffer() is called termText is always null on return.  
This is the
invariant of initTermBuffer() which is called directly or  
indirectly by

every method that touches termBuffer.


Sorry I meant termText().

It is only after calling termText() that one could have both. The  
only
advantage I see here is that calling it twice without any  
intervening call

to a termBuffer method would it return the same physical string.


Right.


After calling setTermText(newText), termBuffer is set to null.

I presume the purpose of a filter is to get the token content,  
and if
modified, set the new content. If so, the result of the setXXX  
will be that

either termText or termBuffer will be null.


Right, though, if there's no change, and the next filter in the chain
calls termText(), we save constructing a new String by caching it.


Then, in 3.0, we would make the change you are proposing (to only
store char[] internally).  That was the plan, anyway.  Accelerating
this plan (to store only char[] today) is compelling, but I worry
about the performance hit to legacy analyzer chains...


I wonder whether it is all that significant an issue. Today, some  
of the
Lucene analyzers have been migrated to using char[], while  
others, notably

contrib, continue to use text.

IMHO: Prior to char[], the text was replaceable, but not  
modifiable. From a
practical perspective, Token reuse minimized the cost of  
construction, but
not much else. The performance of a Token was predictable, but  
the filter
was potentially sub-optimal. With char[] and supporting methods,  
the text

became modifiable, too.

When a filter calls setTermText or setTermBuffer, it does not  
know how the

consumer of the token will work. It could be that it stores it with
setTermText and the next filter calls termBuffer().

I may not understand this correctly, but it seems to me that the  
following
is plausible given a filter chain of Lucene provided filters  
(including contrib)
If we have a token filter chain of A -> B -> C, which uses next()  
in any
part of the chain, the flow of a reusable Token is stopped. A  
given filter

may cache a Token and reuse it. So consider the following scenario:
A overrides next(Token) and reuses the token via char[]
B overrides next() and has a cached Token and updates text.
C overrides next(Token) and reuses the token via char[].

First run:
After A is done, the termText in the supplied Token is null and  
termBuffer

has a value.

B provides it's own token so it is not influenced by A.

C is given the token that B returns, because the caller is  
effectively using
"token = input.next(token)", but because the token has text, a  
performance
hit is taken to put it into char[]. Both text and char[] start  
out the same,

but because char[] is changed, termText is set to null.

Second run:
A starts with a Token with a char[] because it is reusing the  
token from the
last run or because it is using a localToken from the caller. If  
it is a
localToken, then the scenario is as above. But if it is the end  
result of
the first run, then A is re-using the token that is cached in B.  
Since C

last modified it, it is char[].

B uses its cached Token, but it was modified by A to be char[]  
with null

text. Now B takes a performance hit as it creates a new String.

C is as in the first run.

Another scenario:
A, B and C are all legacy. This would only be filter chains that  
are not
provided by core Lucene as the core filter chains have been  
migrated. This

would be a performance hit.


Why is this a performance hit?  If they are all legacy they would all
implemented the non-reuse next() API, and would use setTermText, and
no conversion to char[] would happen (except inside DocumentsWriter)?


I was thinking faster than I was typing. I meant to say "not be a  
performance hit".


Ahh, ok.





A last scenario:
A, B and C are all char[]. This would not take a performance hit.

It seems to me that in a mixed chain, that there will always be a
performance hit.


Right, though since we cache the String we should save on multiple
calls to termText().  And in a non-mixed chain (all new or all old) I
think there wouldn't be a hit.


But my guess is that maintaining termBuffer once used would

[jira] Created: (LUCENE-1291) Allow leading wildcard in table searcher

2008-05-20 Thread Peter Backlund (JIRA)
Allow leading wildcard in table searcher


 Key: LUCENE-1291
 URL: https://issues.apache.org/jira/browse/LUCENE-1291
 Project: Lucene - Java
  Issue Type: Wish
Affects Versions: 2.3.1
Reporter: Peter Backlund
Priority: Minor


It would be nice to have a boolean property on TableSearcher for allowing 
leading wildcard in query, which could be off by default.

MultiFieldQueryParser parser = new MultiFieldQueryParser(fields, 
analyzer);
parser.setAllowLeadingWildcard(this.allowLeadingWildcard);
Query query = parser.parse(searchString);

+ setter and field for "allowLeadingWildcard"

Snippet is from 
http://www.koders.com/java/fid94A4BBC5CC6609930A88583480AA66B32EBB08E3.aspx?s=TableSearcher#L53,
 lines 245-246.



Another approach would be to have a protected factory-method for creating the 
parser, which can be overridden:

protected Parser createParser(fields, analyzer) {
  return new MultiFieldQueryParser(fields, analyzer);
} 

and 

  Query query = createParser(fields, analyzer).parse(searchString); 



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1290) Deprecate Hits

2008-05-20 Thread JIRA

[ 
https://issues.apache.org/jira/browse/LUCENE-1290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12598252#action_12598252
 ] 

Christian Kohlschütter commented on LUCENE-1290:


-1 from me for the current solution.

Deprecating Hits necessarily means deprecating HitIterator. With 
Hits/HitIterator we have two really simple ways to iterate over a long list of 
search results. The TopDocs/HitCollector-based approach is basically one level 
below Hits, and thus, Hits can clearly be regarded a convenience class then. It 
is not as flexible as HitCollector, but serves its purpose very well. 

What could make sense is to deprecate the Searcher#search() methods which 
return a Hits instance, to reduce API clutter. Hits could have a public 
constructor that takes a Searcher instance instead.

> Deprecate Hits
> --
>
> Key: LUCENE-1290
> URL: https://issues.apache.org/jira/browse/LUCENE-1290
> Project: Lucene - Java
>  Issue Type: Task
>  Components: Search
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 2.4
>
> Attachments: lucene-1290.patch, lucene-1290.patch
>
>
> The Hits class has several drawbacks as pointed out in LUCENE-954.
> The other search APIs that use TopDocCollector and TopDocs should be used 
> instead.
> This patch:
> - deprecates org/apache/lucene/search/Hits, Hit, and HitIterator, as well as
>   the Searcher.search( * ) methods which return a Hits Object.
> - removes all references to Hits from the core and uses TopDocs and ScoreDoc
>   instead
> - Changes the demo SearchFiles: adds the two modes 'paging search' and 
> 'streaming search',
>   each of which demonstrating a different way of using the search APIs. The 
> former
>   uses TopDocs and a TopDocCollector, the latter a custom HitCollector 
> implementation.
> - Updates the online tutorial that descibes the demo.
> All tests pass.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1288) Add getVersion method to IndexCommit

2008-05-20 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12598263#action_12598263
 ] 

Jason Rutherglen commented on LUCENE-1288:
--

getGeneration in addition.  

Will IndexCommit.getVersion return the same value as the IndexReader that 
created it?  I'm using this in conjunction with IndexReader to close an object 
associated with the IndexReader upon deletion of the snapshot.  

> Add getVersion method to IndexCommit
> 
>
> Key: LUCENE-1288
> URL: https://issues.apache.org/jira/browse/LUCENE-1288
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 2.3.1
>Reporter: Jason Rutherglen
>Assignee: Michael McCandless
>Priority: Minor
>
> Returns the equivalent of IndexReader.getVersion for IndexCommit
> {code}
> public abstract long getVersion();
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1288) Add getVersion method to IndexCommit

2008-05-20 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12598265#action_12598265
 ] 

Michael McCandless commented on LUCENE-1288:


OK I'll do both.  Yes, getVersion() will be the same as 
IndexReader.getVersion() if that reader was opened on the same commit point.

> Add getVersion method to IndexCommit
> 
>
> Key: LUCENE-1288
> URL: https://issues.apache.org/jira/browse/LUCENE-1288
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 2.3.1
>Reporter: Jason Rutherglen
>Assignee: Michael McCandless
>Priority: Minor
>
> Returns the equivalent of IndexReader.getVersion for IndexCommit
> {code}
> public abstract long getVersion();
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Open Source Relevance

2008-05-20 Thread Steven A Rowe
On 05/19/2008 at 3:58 PM, Grant Ingersoll wrote:
> I think it is time the open source search community (and
> I don’t mean just Lucene) develop and publish a set of
> TREC-style relevance judgments for freely available data
> that is easily obtained from the Internet.

Stephen Green, Minion developer at Sun, whose posts comparing Minion and Lucene 
were recently mentioned on the solr-user mailing list[1], has similar ideas.  
From :

   I think it would be a good idea for all of the open
   source engines to get together, find a nice open document
   collection (the Apache mailing list archives and their
   associated searches?) and build a nice set of regression
   tests and some pooled relevance sets so that we can test
   retrieval performance without having to rely on the TREC
   data.

Steve

[1] Solr += Minion? on solr-user: 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1290) Deprecate Hits

2008-05-20 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12598396#action_12598396
 ] 

Michael Busch commented on LUCENE-1290:
---

{quote}
With Hits/HitIterator we have two really simple ways to iterate over a long 
list of search results.
{quote}

I think this is exactly a problem of Hits. If you use an HitIterator to iterate 
over let's say
2000 results, then Hits will run the same query 5 times under the covers with 
100, 200, 400,
800, 1600 as values for the heap used in TopDocCollector.

IMO Hits only makes sense if you want to use it for paging or, as Doug pointed 
out, for 
prefetching of hits in a scrollable pane. But then it's just as easy to 
implement this using
TopDocCollector/TopDocs as shown in the SearchFiles demo (in this patch). The 
latter approach
is also much more flexible, as it allows you to control the parameters.

{quote}
The TopDocs/HitCollector-based approach is basically one level below Hits, and 
thus, Hits can
clearly be regarded a convenience class then.
{quote}

What are in your opinion the advantages of using an Iterator interface instead 
of looping over 
a ScoreDoc[] array?


> Deprecate Hits
> --
>
> Key: LUCENE-1290
> URL: https://issues.apache.org/jira/browse/LUCENE-1290
> Project: Lucene - Java
>  Issue Type: Task
>  Components: Search
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 2.4
>
> Attachments: lucene-1290.patch, lucene-1290.patch
>
>
> The Hits class has several drawbacks as pointed out in LUCENE-954.
> The other search APIs that use TopDocCollector and TopDocs should be used 
> instead.
> This patch:
> - deprecates org/apache/lucene/search/Hits, Hit, and HitIterator, as well as
>   the Searcher.search( * ) methods which return a Hits Object.
> - removes all references to Hits from the core and uses TopDocs and ScoreDoc
>   instead
> - Changes the demo SearchFiles: adds the two modes 'paging search' and 
> 'streaming search',
>   each of which demonstrating a different way of using the search APIs. The 
> former
>   uses TopDocs and a TopDocCollector, the latter a custom HitCollector 
> implementation.
> - Updates the online tutorial that descibes the demo.
> All tests pass.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1285) WeightedSpanTermExtractor incorrectly treats the same terms occurring in different query types

2008-05-20 Thread Otis Gospodnetic (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12598419#action_12598419
 ] 

Otis Gospodnetic commented on LUCENE-1285:
--

Mark, are you done with this/would you like to commit this?  Or should I?  
(Asking because of SOLR-553)

> WeightedSpanTermExtractor incorrectly treats the same terms occurring in 
> different query types
> --
>
> Key: LUCENE-1285
> URL: https://issues.apache.org/jira/browse/LUCENE-1285
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/highlighter
>Affects Versions: 2.4
>Reporter: Andrzej Bialecki 
> Fix For: 2.4
>
> Attachments: highlighter-test.patch, highlighter.patch
>
>
> Given a BooleanQuery with multiple clauses, if a term occurs both in a Span / 
> Phrase query, and in a TermQuery, the results of term extraction are 
> unpredictable and depend on the order of clauses. Concequently, the result of 
> highlighting are incorrect.
> Example text: t1 t2 t3 t4 t2
> Example query: t2 t3 "t1 t2"
> Current highlighting: [t1 t2] [t3] t4 t2
> Correct highlighting: [t1 t2] [t3] t4 [t2]
> The problem comes from the fact that we keep a Map WeightedSpanTerm>, and if the same term occurs in a Phrase or Span query the 
> resulting WeightedSpanTerm will have a positionSensitive=true, whereas terms 
> added from TermQuery have positionSensitive=false. The end result for this 
> particular term will depend on the order in which the clauses are processed.
> My fix is to use a subclass of Map, which on put() always sets the result to 
> the most lax setting, i.e. if we already have a term with 
> positionSensitive=true, and we try to put() a term with 
> positionSensitive=false, we set the result positionSensitive=false, as it 
> will match both cases.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-112) [PATCH] Add an IndexReader implementation that frees resources when idle and refreshes itself when stale

2008-05-20 Thread Otis Gospodnetic (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic updated LUCENE-112:


Assignee: (was: Eric Isakson)

> [PATCH] Add an IndexReader implementation that frees resources when idle and 
> refreshes itself when stale
> 
>
> Key: LUCENE-112
> URL: https://issues.apache.org/jira/browse/LUCENE-112
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: CVS Nightly - Specify date in submission
> Environment: Operating System: All
> Platform: All
>Reporter: Eric Isakson
>Priority: Minor
> Attachments: IdleTimeoutRefreshingIndexReader.html, 
> IdleTimeoutRefreshingIndexReader.java
>
>
> Here is a little something I worked on this weekend that I wanted to 
> contribute 
> back as I think others might find it very useful.
> I extended IndexReader and added support for configuring an idle timeout and 
> refresh interval.
> It uses a monitoring thread to watch for the reader going idle. When the 
> reader 
> goes idle it is closed. When the index is read again it is re-opened.
> It uses another thread to periodically check when the reader needs to be 
> refreshed due to a change to index. When the reader is stale, it closes the 
> reader and reopens the index.
> It is acually delegating all the work to another IndexReader implementation 
> and 
> just handling the threading and synchronization. When it closes a reader, it 
> delegates the close to another thread that waits a bit (configurable how 
> long) 
> before actually closing the reader it was delegating to. This gives any 
> consumers of the original reader a chance to finish up their last action on 
> the 
> reader.
> This implementation sacrifices a little bit of speed since there is a bit 
> more 
> synchroniztion to deal with and the delegation model puts extra calls on the 
> stack, but it should provide long running applications that have idle periods 
> or frequently changing indices from having to open and close readers all the 
> time or hold open unused resources.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1291) Allow leading wildcard in table searcher

2008-05-20 Thread Hoss Man (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hoss Man updated LUCENE-1291:
-

Component/s: contrib/*

for anyone else who may be confused: this seems to relate to the contrib/swing 
package

> Allow leading wildcard in table searcher
> 
>
> Key: LUCENE-1291
> URL: https://issues.apache.org/jira/browse/LUCENE-1291
> Project: Lucene - Java
>  Issue Type: Wish
>  Components: contrib/*
>Affects Versions: 2.3.1
>Reporter: Peter Backlund
>Priority: Minor
>
> It would be nice to have a boolean property on TableSearcher for allowing 
> leading wildcard in query, which could be off by default.
> MultiFieldQueryParser parser = new MultiFieldQueryParser(fields, 
> analyzer);
> parser.setAllowLeadingWildcard(this.allowLeadingWildcard);
> Query query = parser.parse(searchString);
> + setter and field for "allowLeadingWildcard"
> Snippet is from 
> http://www.koders.com/java/fid94A4BBC5CC6609930A88583480AA66B32EBB08E3.aspx?s=TableSearcher#L53,
>  lines 245-246.
> Another approach would be to have a protected factory-method for creating the 
> parser, which can be overridden:
> protected Parser createParser(fields, analyzer) {
>   return new MultiFieldQueryParser(fields, analyzer);
> } 
> and 
>   Query query = createParser(fields, analyzer).parse(searchString); 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1282) Sun hotspot compiler bug in 1.6.0_04/05 affects Lucene

2008-05-20 Thread Ismael Juma (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12598457#action_12598457
 ] 

Ismael Juma commented on LUCENE-1282:
-

It's worth noting that jdk 6u10 beta b24 (released today) and openjdk6 in 
Fedora 9 are also affected by the problem shown in the test-case.

> Sun hotspot compiler bug in 1.6.0_04/05 affects Lucene
> --
>
> Key: LUCENE-1282
> URL: https://issues.apache.org/jira/browse/LUCENE-1282
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: 2.3, 2.3.1
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.4
>
> Attachments: corrupt_merge_out15.txt
>
>
> This is not a Lucene bug.  It's an as-yet not fully characterized Sun
> JRE bug, as best I can tell.  I'm opening this to gather all things we
> know, and to work around it in Lucene if possible, and maybe open an
> issue with Sun if we can reduce it to a compact test case.
> It's hit at least 3 users:
>   
> http://mail-archives.apache.org/mod_mbox/lucene-java-user/200803.mbox/[EMAIL 
> PROTECTED]
>   
> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200804.mbox/[EMAIL 
> PROTECTED]
>   
> http://mail-archives.apache.org/mod_mbox/lucene-java-user/200805.mbox/[EMAIL 
> PROTECTED]
> It's specific to at least JRE 1.6.0_04 and 1.6.0_05, that affects
> Lucene.  Whereas 1.6.0_03 works OK and it's unknown whether 1.6.0_06
> shows it.
> The bug affects bulk merging of stored fields.  When it strikes, the
> segment produced by a merge is corrupt because its fdx file (stored
> fields index file) is missing one document.  After iterating many
> times with the first user that hit this, adding diagnostics &
> assertions, its seems that a call to fieldsWriter.addDocument some
> either fails to run entirely, or, fails to invoke its call to
> indexStream.writeLong.  It's as if when hotspot compiles a method,
> there's some sort of race condition in cutting over to the compiled
> code whereby a single method call fails to be invoked (speculation).
> Unfortunately, this corruption is silent when it occurs and only later
> detected when a merge tries to merge the bad segment, or an
> IndexReader tries to open it.  Here's a typical merge exception:
> {code}
> Exception in thread "Thread-10" 
> org.apache.lucene.index.MergePolicy$MergeException: 
> org.apache.lucene.index.CorruptIndexException:
> doc counts differ for segment _3gh: fieldsReader shows 15999 but 
> segmentInfo shows 16000
> at 
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:271)
> Caused by: org.apache.lucene.index.CorruptIndexException: doc counts differ 
> for segment _3gh: fieldsReader shows 15999 but segmentInfo shows 16000
> at 
> org.apache.lucene.index.SegmentReader.initialize(SegmentReader.java:313)
> at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:262)
> at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:221)
> at 
> org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3099)
> at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:2834)
> at 
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:240)
> {code}
> and here's a typical exception hit when opening a searcher:
> {code}
> org.apache.lucene.index.CorruptIndexException: doc counts differ for segment 
> _kk: fieldsReader shows 72670 but segmentInfo shows 72671
> at 
> org.apache.lucene.index.SegmentReader.initialize(SegmentReader.java:313)
> at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:262)
> at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:230)
> at 
> org.apache.lucene.index.DirectoryIndexReader$1.doBody(DirectoryIndexReader.java:73)
> at 
> org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:636)
> at 
> org.apache.lucene.index.DirectoryIndexReader.open(DirectoryIndexReader.java:63)
> at org.apache.lucene.index.IndexReader.open(IndexReader.java:209)
> at org.apache.lucene.index.IndexReader.open(IndexReader.java:173)
> at 
> org.apache.lucene.search.IndexSearcher.(IndexSearcher.java:48)
> {code}
> Sometimes, adding -Xbatch (forces up front compilation) or -Xint
> (disables compilation) to the java command line works around the
> issue.
> Here are some of the OS's we've seen the failure on:
> {code}
> SuSE 10.0
> Linux phoebe 2.6.13-15-smp #1 SMP Tue Sep 13 14:56:15 UTC 2005 x86_64 
> x86_64 x86_64 GNU/Linux 
> SuSE 8.2
> Linux phobos 2.4.20-64GB-SMP #1 SMP Mon Mar 17 17:56:03 UTC 2003 i686 
> unknown unknown GNU/Linux 
> Red Hat Enterpri

Re: [jira] Commented: (LUCENE-1285) WeightedSpanTermExtractor incorrectly treats the same terms occurring in different query types

2008-05-20 Thread Mark Miller

Otis Gospodnetic (JIRA) wrote:
[ https://issues.apache.org/jira/browse/LUCENE-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12598419#action_12598419 ] 


Otis Gospodnetic commented on LUCENE-1285:
--

Mark, are you done with this/would you like to commit this?  Or should I?  
(Asking because of SOLR-553)

  

WeightedSpanTermExtractor incorrectly treats the same terms occurring in 
different query types
--

Key: LUCENE-1285
URL: https://issues.apache.org/jira/browse/LUCENE-1285
Project: Lucene - Java
 Issue Type: Bug
 Components: contrib/highlighter
   Affects Versions: 2.4
   Reporter: Andrzej Bialecki 
Fix For: 2.4


Attachments: highlighter-test.patch, highlighter.patch


Given a BooleanQuery with multiple clauses, if a term occurs both in a Span / 
Phrase query, and in a TermQuery, the results of term extraction are 
unpredictable and depend on the order of clauses. Concequently, the result of 
highlighting are incorrect.
Example text: t1 t2 t3 t4 t2
Example query: t2 t3 "t1 t2"
Current highlighting: [t1 t2] [t3] t4 t2
Correct highlighting: [t1 t2] [t3] t4 [t2]
The problem comes from the fact that we keep a Map, 
and if the same term occurs in a Phrase or Span query the resulting WeightedSpanTerm 
will have a positionSensitive=true, whereas terms added from TermQuery have 
positionSensitive=false. The end result for this particular term will depend on the 
order in which the clauses are processed.
My fix is to use a subclass of Map, which on put() always sets the result to 
the most lax setting, i.e. if we already have a term with 
positionSensitive=true, and we try to put() a term with 
positionSensitive=false, we set the result positionSensitive=false, as it will 
match both cases.



  
I don't have any access yet, so commit away Otis. All old tests pass and 
the patch looks good.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1290) Deprecate Hits

2008-05-20 Thread JIRA

[ 
https://issues.apache.org/jira/browse/LUCENE-1290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12598471#action_12598471
 ] 

Christian Kohlschütter commented on LUCENE-1290:


Michael,

the current implementation of Hits certainly has its deficiencies, but 
represents a very simple way to retrieve documents from Lucene. As long as 
there is no real replacement, I simply do not a reason to deprecate it.

A replacement could be an API which allows something like:

for(Iterator it = searcher.iterator(query); it.hasNext(); ) {
  (...)
  if (...) break;
}




> Deprecate Hits
> --
>
> Key: LUCENE-1290
> URL: https://issues.apache.org/jira/browse/LUCENE-1290
> Project: Lucene - Java
>  Issue Type: Task
>  Components: Search
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 2.4
>
> Attachments: lucene-1290.patch, lucene-1290.patch
>
>
> The Hits class has several drawbacks as pointed out in LUCENE-954.
> The other search APIs that use TopDocCollector and TopDocs should be used 
> instead.
> This patch:
> - deprecates org/apache/lucene/search/Hits, Hit, and HitIterator, as well as
>   the Searcher.search( * ) methods which return a Hits Object.
> - removes all references to Hits from the core and uses TopDocs and ScoreDoc
>   instead
> - Changes the demo SearchFiles: adds the two modes 'paging search' and 
> 'streaming search',
>   each of which demonstrating a different way of using the search APIs. The 
> former
>   uses TopDocs and a TopDocCollector, the latter a custom HitCollector 
> implementation.
> - Updates the online tutorial that descibes the demo.
> All tests pass.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1290) Deprecate Hits

2008-05-20 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12598518#action_12598518
 ] 

Michael Busch commented on LUCENE-1290:
---

{quote}
A replacement could be an API which allows something like:

for(Iterator it = searcher.iterator(query); it.hasNext(); ) { (...) 
if (...) break; }
{quote}

That would duplicate the search methods that use a HitCollector.
I still don't understand why an iterator approach is better/easier
than Lucene's callback (HitCollector) approach.


> Deprecate Hits
> --
>
> Key: LUCENE-1290
> URL: https://issues.apache.org/jira/browse/LUCENE-1290
> Project: Lucene - Java
>  Issue Type: Task
>  Components: Search
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 2.4
>
> Attachments: lucene-1290.patch, lucene-1290.patch
>
>
> The Hits class has several drawbacks as pointed out in LUCENE-954.
> The other search APIs that use TopDocCollector and TopDocs should be used 
> instead.
> This patch:
> - deprecates org/apache/lucene/search/Hits, Hit, and HitIterator, as well as
>   the Searcher.search( * ) methods which return a Hits Object.
> - removes all references to Hits from the core and uses TopDocs and ScoreDoc
>   instead
> - Changes the demo SearchFiles: adds the two modes 'paging search' and 
> 'streaming search',
>   each of which demonstrating a different way of using the search APIs. The 
> former
>   uses TopDocs and a TopDocCollector, the latter a custom HitCollector 
> implementation.
> - Updates the online tutorial that descibes the demo.
> All tests pass.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [jira] Commented: (LUCENE-1290) Deprecate Hits

2008-05-20 Thread Mark Miller

Michael Busch (JIRA) wrote:
[ https://issues.apache.org/jira/browse/LUCENE-1290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12598518#action_12598518 ] 


Michael Busch commented on LUCENE-1290:
---

{quote}
A replacement could be an API which allows something like:

for(Iterator it = searcher.iterator(query); it.hasNext(); ) { (...) 
if (...) break; }
{quote}

That would duplicate the search methods that use a HitCollector.
I still don't understand why an iterator approach is better/easier
than Lucene's callback (HitCollector) approach.

  
I think its a lots harder to misuse things when using what I think used 
to be labeled as the *expert* api (HitCollector). Hits attempts to make 
things easier for the new comer, but its so easy to misuse the class 
that I think new comers often don't have the knowledge to use it well. 
It does not make a great default.


>> FWIW, the Hits API was originally designed to support desktop 
applications, with a scrollable pane of hits. I wonder if anyone ever 
actually used >>it that way, and, if so, whether it worked well...


And thats the limited thing that Hits is good for...a single user 
experience. Lucene is so heavily used in a multi threaded, multi-user 
environment, that often Hits caching and pre-fetching are pretty 
worthless at the Hits level. Its not a good class for a new user that 
doesnt understand it limitations and its not a good class for the 
general search case.


I don't know if I necessarily agree the whole class has to go (that will 
annoy plenty that use it, and we will prob force a lot of individuals to 
maintain it themselves), but I think it sure should lose its emphasis as 
the goto search class for new Lucene users.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Open Source Relevance

2008-05-20 Thread Grant Ingersoll

Cool, hadn't seen that.

-Grant

On May 20, 2008, at 1:01 PM, Steven A Rowe wrote:


On 05/19/2008 at 3:58 PM, Grant Ingersoll wrote:

I think it is time the open source search community (and
I don’t mean just Lucene) develop and publish a set of
TREC-style relevance judgments for freely available data
that is easily obtained from the Internet.


Stephen Green, Minion developer at Sun, whose posts comparing Minion  
and Lucene were recently mentioned on the solr-user mailing list[1],  
has similar ideas.  From :


  I think it would be a good idea for all of the open
  source engines to get together, find a nice open document
  collection (the Apache mailing list archives and their
  associated searches?) and build a nice set of regression
  tests and some pooled relevance sets so that we can test
  retrieval performance without having to rely on the TREC
  data.

Steve

[1] Solr += Minion? on solr-user: 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Assigned: (LUCENE-1285) WeightedSpanTermExtractor incorrectly treats the same terms occurring in different query types

2008-05-20 Thread Otis Gospodnetic (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic reassigned LUCENE-1285:


Assignee: Otis Gospodnetic

> WeightedSpanTermExtractor incorrectly treats the same terms occurring in 
> different query types
> --
>
> Key: LUCENE-1285
> URL: https://issues.apache.org/jira/browse/LUCENE-1285
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/highlighter
>Affects Versions: 2.4
>Reporter: Andrzej Bialecki 
>Assignee: Otis Gospodnetic
> Fix For: 2.4
>
> Attachments: highlighter-test.patch, highlighter.patch
>
>
> Given a BooleanQuery with multiple clauses, if a term occurs both in a Span / 
> Phrase query, and in a TermQuery, the results of term extraction are 
> unpredictable and depend on the order of clauses. Concequently, the result of 
> highlighting are incorrect.
> Example text: t1 t2 t3 t4 t2
> Example query: t2 t3 "t1 t2"
> Current highlighting: [t1 t2] [t3] t4 t2
> Correct highlighting: [t1 t2] [t3] t4 [t2]
> The problem comes from the fact that we keep a Map WeightedSpanTerm>, and if the same term occurs in a Phrase or Span query the 
> resulting WeightedSpanTerm will have a positionSensitive=true, whereas terms 
> added from TermQuery have positionSensitive=false. The end result for this 
> particular term will depend on the order in which the clauses are processed.
> My fix is to use a subclass of Map, which on put() always sets the result to 
> the most lax setting, i.e. if we already have a term with 
> positionSensitive=true, and we try to put() a term with 
> positionSensitive=false, we set the result positionSensitive=false, as it 
> will match both cases.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1290) Deprecate Hits

2008-05-20 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12598557#action_12598557
 ] 

Uwe Schindler commented on LUCENE-1290:
---

The HitCollerctor and Iterator approach only supports *forward* displaying of 
results. In a typical Google-like Website, where the user can just jump to page 
X, display some results and can jump back to page Y and display results from 
there too, Hits works really good. The problem with Hits is, that it returns 
and caches whole "Documents". If it could just return ScoreDocs and would 
implement the Java Collection API "List" Interface (using AbstractList), that 
would be a good replacement for "navigateable" result sets.

> Deprecate Hits
> --
>
> Key: LUCENE-1290
> URL: https://issues.apache.org/jira/browse/LUCENE-1290
> Project: Lucene - Java
>  Issue Type: Task
>  Components: Search
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 2.4
>
> Attachments: lucene-1290.patch, lucene-1290.patch
>
>
> The Hits class has several drawbacks as pointed out in LUCENE-954.
> The other search APIs that use TopDocCollector and TopDocs should be used 
> instead.
> This patch:
> - deprecates org/apache/lucene/search/Hits, Hit, and HitIterator, as well as
>   the Searcher.search( * ) methods which return a Hits Object.
> - removes all references to Hits from the core and uses TopDocs and ScoreDoc
>   instead
> - Changes the demo SearchFiles: adds the two modes 'paging search' and 
> 'streaming search',
>   each of which demonstrating a different way of using the search APIs. The 
> former
>   uses TopDocs and a TopDocCollector, the latter a custom HitCollector 
> implementation.
> - Updates the online tutorial that descibes the demo.
> All tests pass.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1290) Deprecate Hits

2008-05-20 Thread JIRA

[ 
https://issues.apache.org/jira/browse/LUCENE-1290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12598565#action_12598565
 ] 

Christian Kohlschütter commented on LUCENE-1290:


Michael:
The HitCollector callback is called in index order (or in any other, 
non-deterministic order), whereas the results in Hits are sorted (by relevance 
or any given Sort order). 

Uwe:
Good idea, this would be even better than the plain iterator class.


> Deprecate Hits
> --
>
> Key: LUCENE-1290
> URL: https://issues.apache.org/jira/browse/LUCENE-1290
> Project: Lucene - Java
>  Issue Type: Task
>  Components: Search
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 2.4
>
> Attachments: lucene-1290.patch, lucene-1290.patch
>
>
> The Hits class has several drawbacks as pointed out in LUCENE-954.
> The other search APIs that use TopDocCollector and TopDocs should be used 
> instead.
> This patch:
> - deprecates org/apache/lucene/search/Hits, Hit, and HitIterator, as well as
>   the Searcher.search( * ) methods which return a Hits Object.
> - removes all references to Hits from the core and uses TopDocs and ScoreDoc
>   instead
> - Changes the demo SearchFiles: adds the two modes 'paging search' and 
> 'streaming search',
>   each of which demonstrating a different way of using the search APIs. The 
> former
>   uses TopDocs and a TopDocCollector, the latter a custom HitCollector 
> implementation.
> - Updates the online tutorial that descibes the demo.
> All tests pass.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]