Hi Marvin, Thank you for following up on this!
Marvin Humphrey <[email protected]> writes: [...] > >> * Lucy::Search::IndexSearcher::top_docs (used by SearchServer) is > >> about twice as slow Lucy::Search::Searcher::hits (used by > >> IndexSearcher). > > Well, this may come as a surprise in light of your benchmarks, but > Searcher#hits() calls top_docs() internally. :) > > http://s.apache.org/vH (link to git-wip-us.apache.org) > > For the record, Searcher is IndexSearcher's parent class; IndexSearcher > inherits hits() but provides its own implementation of top_docs(). A fair > benchmark would involve comparing the results of top_docs() and hits() on a > single IndexSearcher -- and it would be very surprising if hits() was faster. > > I suspect the at least some of the discrepancies you are seeing arise because: > > * IndexSearcher is a mature class implemented primarily in C. > * ClusterSearcher is a comparatively young class implemented in Perl. Even though ClusterSearcher is implemented in Perl, I don't see that the C function top_docs would be calling back into Perl space here, and thus I still don't understand the (big) discrepancy. Either Devel::DProf is wildly inaccurate in this case, or there is something rather strange going on. Could it be that top_docs is called with a different set of parameters in the ClusterSearcher case? Do you have any other ideas? In any case, just to rule out any *really* crazy stuff, I did the test you suggested above. Here, top_docs() was a tiny bit faster than hits(), as should be excpected. I have pasted the test program for this at the end of this email. I peeked at Searcher.c and Lucy.xs to work out the equivalent Perl code for hits(); I hope I got it right. > Should we port ClusterSearcher to C, I expect that we'll see some of the > performance anomalies smooth out. However, I don't think we should focus on > that yet, because ClusterSearcher's architecture is not yet optimal -- and it > will be easier to refactor if we keep it in Perl for now. I wholeheartedly agree. First solve the fundamental issues in order to make the Perl proof-of-concept run as fast as possible, *then* optimize the low level stuff. -- Best regards, Dag Lem #!/usr/bin/perl # Test performance of top_docs() vs. hits() use strict; use warnings; use Lucy::Search::IndexSearcher; # Pass 0 to test hits(). my $top_docs = !@ARGV || $ARGV[0] eq '1'; my $searcher = Lucy::Search::IndexSearcher->new(index => "/db/disk1/lucy/full"); my $schema = $searcher->get_schema(); my $query_parser = Lucy::Search::QueryParser->new(schema => $schema); $query_parser->set_heed_colons(1); my $offset = 0; my $num_wanted = 100; for (1..10000) { my $query = $query_parser->parse("fornavn:(lem dag) AND etternavn:(lem dag)"); # my $query = $query_parser->parse("fornavn:(dag) AND etternavn:(lem)"); my $hits; if ($top_docs) { my $real_query = $searcher->glean_query($query); my $doc_max = $searcher->doc_max(); my $wanted = $offset + $num_wanted > $doc_max ? $doc_max : $offset + $num_wanted; my $top_docs = $searcher->top_docs(query => $real_query, num_wanted => $wanted); $hits = Lucy::Search::Hits->new(searcher => $searcher, top_docs => $top_docs, offset => $offset); } else { $hits = $searcher->hits( query => $query, offset => $offset, num_wanted => $num_wanted, ); } # while (my $hit = $hits->next) { # print "$hit->{fodselsdato}\t$hit->{navn}\n"; # } }
