Re: To Sort or not to Sort

2004-12-16 Thread Erik Hatcher
To squeeze the best performance out of your first option, be sure to 
index your date field as a numeric (i.e. literally Long.toString, or 
Integer.toString value).  Then when you sort, be sure to specify the 
field type.  Sorting by String's is the most expensive, but numerics 
use far less resources.

I always recommend the simplest approach first - using the Sort 
facility in this case.  If that is good enough then move on to other 
things :)

Erik
On Dec 16, 2004, at 7:07 PM, Scott Smith wrote:
I'm hoping someone has an opinion (based on some experience) as to how 
I
might approach a design I'm doing with Lucene.


In my application, users search for messages with Lucene.  Typically,
they are more interested in seeing their hits in date-order than in
relevance-order.  In reading my ebook copy of Lucene in action (wish
I'd had that a year ago), I find that one of the features added in 1.4
was the ability to ask for hits in an order based on a field.  It also
looks like adding the field necessary to get things by date order is
straight forward.

But, for my browser-based application I think there is another
consideration.  Users will typically page through the messages 20-50 at
a time and often they will only look through the first few pages of
messages and then be done.  So, I think there are two possible designs.

1.	Simply use the built-in lucene sort functionality, cache the hit
list and then page through the list.  Adv: looks pretty straight
forward, I write less code.  Dis: for searches that return a large
number of hits (having a search return several hundred to a few 
thousand
hits is not uncommon), Lucene is sorting a lot of entries that don't
really need to be sorted (because the user will never look at them) and
sorting tends to be expensive.
2.	The other solution uses a priority heap to collect the top N (or
next N) entries.  I still have to walk the entire hit list, but keeping
entries in a priority heap means I can determine the N entries I need
with a few comparisons and minimal sorting.  I don't have to sort a
bunch of entries whose order I don't care about.  Additionally, I don't
have to have all of the entries in memory at one time.  The big
disadvantage with this is that I have to write more code.  However, it
may be worth it if the performance difference is large enough.


This may be one of those questions where the only answer is code it
both ways and do speed trials.  I was just wondering if anyone had
enough experience with either method to offer an opinion.  Are there
things the Lucene sort is doing under the covers that will make it's
ability to sort much faster than what I can do with the hit lists since
I still have to force the IndexSearch object to retrieve all of the
Documents in the hits list?

 Opinions?

Scott



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: To Sort or not to Sort

2004-12-16 Thread Chris Hostetter
: In my application, users search for messages with Lucene.  Typically,
: they are more interested in seeing their hits in date-order than in
: relevance-order.  In reading my ebook copy of Lucene in action (wish
: I'd had that a year ago), I find that one of the features added in 1.4
: was the ability to ask for hits in an order based on a field.  It also
: looks like adding the field necessary to get things by date order is
: straight forward.

When considering issues like this, it's important to consider what is
really important to your users: Do they eally want to see items strictly
ordered by date, or do they want to see results sorted by relevancy --
where the recentness of an item influences how relevent it is.

For example, when I search theLucene users mailing list for RangeQuery I
want more recent messages to appear first, but I'd still prefer that a
slightly older message bubble up in the list if the Subject includes
RangeQuery and the body mentions RangeQuery dozens of times -- because
it's likely to be very relevent then more recent messages which only
mention RangeQuery once or twice -- but I don't want results that are
strictly sorted by term frequency, becuase then messages from 3 years ago
(and several Lucene revs ago) might be at the top of the list.

Depending on how you maintain your index, there are a couple of different
ways of achieving a goal like this.  if you rebuild regularly, then just
giving your more recent documents a higher boost is one way to go.
another would be to use several FilteredQuery(RangeFilter) with several
increasing intervals of dates (ie: today OR the past week OR the past
month OR the past year) so that more recent documents match all of the
clauses, and older documents match fewer (or none)



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: To Sort or not to Sort

2004-12-16 Thread Doug Cutting
Scott Smith wrote:
1.	Simply use the built-in lucene sort functionality, cache the hit
list and then page through the list.  Adv: looks pretty straight
forward, I write less code.  Dis: for searches that return a large
number of hits (having a search return several hundred to a few thousand
hits is not uncommon), Lucene is sorting a lot of entries that don't
really need to be sorted (because the user will never look at them) and
sorting tends to be expensive.
2.	The other solution uses a priority heap to collect the top N (or
next N) entries.  I still have to walk the entire hit list, but keeping
entries in a priority heap means I can determine the N entries I need
with a few comparisons and minimal sorting.  I don't have to sort a
bunch of entries whose order I don't care about.  Additionally, I don't
have to have all of the entries in memory at one time.  The big
disadvantage with this is that I have to write more code.  However, it
may be worth it if the performance difference is large enough. 
Lucene's built-in sorting code already performs the optimization you 
describe as (2).  So don't bother re-inventing it!

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: To Sort or not to Sort

2004-12-16 Thread Scott Smith
I think we have a winner.  Number 1 it is.  Thanks for the information.

-Original Message- 
From: Doug Cutting [mailto:[EMAIL PROTECTED] 
Sent: Thu 12/16/2004 10:25 PM 
To: Lucene Users List 
Cc: 
Subject: Re: To Sort or not to Sort



Scott Smith wrote:
 1.Simply use the built-in lucene sort functionality, cache the hit
 list and then page through the list.  Adv: looks pretty straight
 forward, I write less code.  Dis: for searches that return a large
 number of hits (having a search return several hundred to a few 
thousand
 hits is not uncommon), Lucene is sorting a lot of entries that don't
 really need to be sorted (because the user will never look at them) 
and
 sorting tends to be expensive.
 2.The other solution uses a priority heap to collect the top N (or
 next N) entries.  I still have to walk the entire hit list, but 
keeping
 entries in a priority heap means I can determine the N entries I need
 with a few comparisons and minimal sorting.  I don't have to sort a
 bunch of entries whose order I don't care about.  Additionally, I 
don't
 have to have all of the entries in memory at one time.  The big
 disadvantage with this is that I have to write more code.  However, it
 may be worth it if the performance difference is large enough.

Lucene's built-in sorting code already performs the optimization you
describe as (2).  So don't bother re-inventing it!

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

To Sort or not to Sort

2004-12-16 Thread Scott Smith
I'm hoping someone has an opinion (based on some experience) as to how I
might approach a design I'm doing with Lucene.

 

In my application, users search for messages with Lucene.  Typically,
they are more interested in seeing their hits in date-order than in
relevance-order.  In reading my ebook copy of Lucene in action (wish
I'd had that a year ago), I find that one of the features added in 1.4
was the ability to ask for hits in an order based on a field.  It also
looks like adding the field necessary to get things by date order is
straight forward.

 

But, for my browser-based application I think there is another
consideration.  Users will typically page through the messages 20-50 at
a time and often they will only look through the first few pages of
messages and then be done.  So, I think there are two possible designs.

 

1.  Simply use the built-in lucene sort functionality, cache the hit
list and then page through the list.  Adv: looks pretty straight
forward, I write less code.  Dis: for searches that return a large
number of hits (having a search return several hundred to a few thousand
hits is not uncommon), Lucene is sorting a lot of entries that don't
really need to be sorted (because the user will never look at them) and
sorting tends to be expensive.
2.  The other solution uses a priority heap to collect the top N (or
next N) entries.  I still have to walk the entire hit list, but keeping
entries in a priority heap means I can determine the N entries I need
with a few comparisons and minimal sorting.  I don't have to sort a
bunch of entries whose order I don't care about.  Additionally, I don't
have to have all of the entries in memory at one time.  The big
disadvantage with this is that I have to write more code.  However, it
may be worth it if the performance difference is large enough. 

 

This may be one of those questions where the only answer is code it
both ways and do speed trials.  I was just wondering if anyone had
enough experience with either method to offer an opinion.  Are there
things the Lucene sort is doing under the covers that will make it's
ability to sort much faster than what I can do with the hit lists since
I still have to force the IndexSearch object to retrieve all of the
Documents in the hits list?

 

 Opinions?

 

Scott