Peter Karman wrote on 11/12/2012 5:46 PM > Thomas den Braber wrote on 11/12/12 3:53 AM: > > On Sun, Nov 11, 2012 at 04:19 AM, Marvin Humphrey <[email protected]> > > wrote: > > > >> I don't know how Swish-e implements sorting of hits, but this is expected > >> behavior in Lucy. > > > > Swish-e can use presorting of attributes during indexing: > > 'By default Swish-e generates presorted tables while indexing for each > > property name. > This > > allows faster sorting when generating results. On large document > > collections this > > presorting may add to the indexing time, and also adds to the total size of > > the > index. > > This directive can be used to customize exactly which properties will be > > presorted.' > > > > Maybe this does the trick ? > > > Swish-e does presort attributes, but rank/score is not one of them. That is > always a per-search attribute. > > ISTR an email exchange about this back when I was first using KinoSearch > (pre-Lucy), but I can't find it now. >
I never understood what the presorting in Swish-e was actually doing. Maybe not that interesting for Lucy users, but could you explain in a couple of lines? > > > > >>> I would expect that using the offset, performance should be higher because > >>> no processing needs to be done to the hits before the offset (no score > >>> calculation). > > > >> How do you know that the hit number 5000 actually ranks 5000th in sort > >> order > >> unless you calculate scores for all documents and perform sorting? > > > Swish-e calculates the score for all documents before sorting them. Just like > Lucy. > > > > > >> There are certain times when Lucy can avoid calculating scores -- when > >> SortSpecs do not require scores, or when documents match pure negative > >> clauses > >> (docs matching "bar" in the query `foo AND NOT bar`). But when you are > >> ranking documents based on score, we have to calculate a score for > >> **every** > >> document. > > > > Sorry I didn't mention this but I really meant sorting by attributes other > > the score, > like > > modification date or file size. Is calculating of the score also needed > > here? > > > No. If you look at the source for SWISH::Prog::Lucy::Searcher->search() you > will > see that I always add a SortRule for 'score' but that is only so that I can > show > the result, not to sort by it. > Doesn't this cost performance ? > > > > > >> I would assume that Swish-e and Lucy are implemented differently. I don't > >> know what seek() does in the context of Swish-e. > > > > Seek will fast forward through the search result without first specifying > > the total > hits > > you want to collect and not reading the results that exists before the seek > > pointer. > In > > swish you also do not have to say in advance how many hits you want. > > > $hits->seek(10); # skip the first 9 hits > > This is similar to the Lucy::Index::Lexicon->seek() method. > > It would be useful to have it for Lucy::Search::Hits too, imo. > That would be nice. It also would be useful to change the num_wanted and offset ofter the '$searcher->hits()' call has been done, without performing the search again with a different offset. > > > > I can overcome the absence of such a command in Lucy by tweaking my program > > and > moving > > some of my logic to an earlier stage. > > > > I will continue my migration and will let you know if there are 'more bumps > > on the > road'. > > > > I can also make a more detailed performance comparison if you like. > > > I, for one, would be interested in hearing your thoughts, Thomas. As you might > expect, I have some experience with both. :) I know, I'm glad that you joined the Lucy project. //Thomas
