Thomas den Braber wrote on 11/12/12 3:53 AM: > On Sun, Nov 11, 2012 at 04:19 AM, Marvin Humphrey <[email protected]> > wrote: > >> I don't know how Swish-e implements sorting of hits, but this is expected >> behavior in Lucy. > > Swish-e can use presorting of attributes during indexing: > 'By default Swish-e generates presorted tables while indexing for each > property name. This > allows faster sorting when generating results. On large document collections > this > presorting may add to the indexing time, and also adds to the total size of > the index. > This directive can be used to customize exactly which properties will be > presorted.' > > Maybe this does the trick ?
Swish-e does presort attributes, but rank/score is not one of them. That is always a per-search attribute. ISTR an email exchange about this back when I was first using KinoSearch (pre-Lucy), but I can't find it now. > >>> I would expect that using the offset, performance should be higher because >>> no processing needs to be done to the hits before the offset (no score >>> calculation). > >> How do you know that the hit number 5000 actually ranks 5000th in sort order >> unless you calculate scores for all documents and perform sorting? Swish-e calculates the score for all documents before sorting them. Just like Lucy. > >> There are certain times when Lucy can avoid calculating scores -- when >> SortSpecs do not require scores, or when documents match pure negative >> clauses >> (docs matching "bar" in the query `foo AND NOT bar`). But when you are >> ranking documents based on score, we have to calculate a score for **every** >> document. > > Sorry I didn't mention this but I really meant sorting by attributes other > the score, like > modification date or file size. Is calculating of the score also needed here? No. If you look at the source for SWISH::Prog::Lucy::Searcher->search() you will see that I always add a SortRule for 'score' but that is only so that I can show the result, not to sort by it. > >> I would assume that Swish-e and Lucy are implemented differently. I don't >> know what seek() does in the context of Swish-e. > > Seek will fast forward through the search result without first specifying the > total hits > you want to collect and not reading the results that exists before the seek > pointer. In > swish you also do not have to say in advance how many hits you want. $hits->seek(10); # skip the first 9 hits This is similar to the Lucy::Index::Lexicon->seek() method. It would be useful to have it for Lucy::Search::Hits too, imo. > > I can overcome the absence of such a command in Lucy by tweaking my program > and moving > some of my logic to an earlier stage. > > I will continue my migration and will let you know if there are 'more bumps > on the road'. > > I can also make a more detailed performance comparison if you like. I, for one, would be interested in hearing your thoughts, Thomas. As you might expect, I have some experience with both. :) -- Peter Karman . http://peknet.com/ . [email protected]
