Re: [lucy-user] Hits offset and search performarce

Thomas den Braber Tue, 13 Nov 2012 08:47:57 -0800

Peter Karman wrote on 11/12/2012 5:46 PM

> Thomas den Braber wrote on 11/12/12 3:53 AM:
> > On Sun, Nov 11, 2012 at 04:19 AM, Marvin Humphrey <[email protected]> 
> > wrote:
> > 
> >> I don't know how Swish-e implements sorting of hits, but this is expected
> >> behavior in Lucy.
> > 
> > Swish-e can use presorting of attributes during indexing:
> > 'By default Swish-e generates presorted tables while indexing for each 
> > property name.
> This
> > allows faster sorting when generating results. On large document 
> > collections this
> > presorting may add to the indexing time, and also adds to the total size of 
> > the
> index.
> > This directive can be used to customize exactly which properties will be 
> > presorted.'
> > 
> > Maybe this does the trick ?
> 
> 
> Swish-e does presort attributes, but rank/score is not one of them. That is
> always a per-search attribute.
> 
> ISTR an email exchange about this back when I was first using KinoSearch
> (pre-Lucy), but I can't find it now.
>


I never understood what the presorting in Swish-e was actually doing. Maybe not 
that
interesting for Lucy users, but could you explain in a couple of lines? 

> 
> > 
> >>> I would expect that using the offset, performance should be higher because
> >>> no processing needs to be done to the hits before the offset (no score
> >>> calculation).
> > 
> >> How do you know that the hit number 5000 actually ranks 5000th in sort 
> >> order
> >> unless you calculate scores for all documents and perform sorting?
> 
> 
> Swish-e calculates the score for all documents before sorting them. Just like 
> Lucy.
> 
> 
> > 
> >> There are certain times when Lucy can avoid calculating scores -- when
> >> SortSpecs do not require scores, or when documents match pure negative 
> >> clauses
> >> (docs matching "bar" in the query `foo AND NOT bar`).  But when you are
> >> ranking documents based on score, we have to calculate a score for 
> >> **every**
> >> document.
> > 
> > Sorry I didn't mention this but I really meant sorting by attributes other 
> > the score,
> like
> > modification date or file size. Is calculating of the score also needed 
> > here?
> 
> 
> No. If you look at the source for SWISH::Prog::Lucy::Searcher->search() you 
> will
> see that I always add a SortRule for 'score' but that is only so that I can 
> show
> the result, not to sort by it.
> 

Doesn't this cost performance ?

> 
> 
> > 
> >> I would assume that Swish-e and Lucy are implemented differently.  I don't
> >> know what seek() does in the context of Swish-e.
> > 
> > Seek will fast forward through the search result without first specifying 
> > the total
> hits
> > you want to collect and not reading the results that exists before the seek 
> > pointer.
> In
> > swish you also do not have to say in advance how many hits you want.
> 
> 
> $hits->seek(10); # skip the first 9 hits
> 
> This is similar to the Lucy::Index::Lexicon->seek() method.
> 
> It would be useful to have it for Lucy::Search::Hits too, imo.
> 

That would be nice. It also would be useful to change the num_wanted and offset 
ofter the
'$searcher->hits()' call has been done, without performing the search again 
with a
different offset.


> > 
> > I can overcome the absence of such a command in Lucy by tweaking my program 
> > and
> moving
> > some of my logic to an earlier stage.
> > 
> > I will continue my migration and will let you know if there are 'more bumps 
> > on the
> road'.
> > 
> > I can also make a more detailed performance comparison if you like.
> 
> 
> I, for one, would be interested in hearing your thoughts, Thomas. As you might
> expect, I have some experience with both. :)

I know, I'm glad that you joined the Lucy project.

//Thomas

Re: [lucy-user] Hits offset and search performarce

Reply via email to