Hi Marvin,

Thank you for following up on this!

Marvin Humphrey <[email protected]> writes:

[...]

> >> * Lucy::Search::IndexSearcher::top_docs (used by SearchServer) is
> >>   about twice as slow Lucy::Search::Searcher::hits (used by
> >>   IndexSearcher).
> 
> Well, this may come as a surprise in light of your benchmarks, but
> Searcher#hits() calls top_docs() internally. :)
> 
>     http://s.apache.org/vH  (link to git-wip-us.apache.org)
> 
> For the record, Searcher is IndexSearcher's parent class; IndexSearcher
> inherits hits() but provides its own implementation of top_docs().  A fair
> benchmark would involve comparing the results of top_docs() and hits() on a
> single IndexSearcher -- and it would be very surprising if hits() was faster.
> 
> I suspect the at least some of the discrepancies you are seeing arise because:
> 
> *   IndexSearcher is a mature class implemented primarily in C.
> *   ClusterSearcher is a comparatively young class implemented in Perl.

Even though ClusterSearcher is implemented in Perl, I don't see that
the C function top_docs would be calling back into Perl space here,
and thus I still don't understand the (big) discrepancy. Either
Devel::DProf is wildly inaccurate in this case, or there is something
rather strange going on. Could it be that top_docs is called with a
different set of parameters in the ClusterSearcher case? Do you have
any other ideas?

In any case, just to rule out any *really* crazy stuff, I did the test
you suggested above. Here, top_docs() was a tiny bit faster than
hits(), as should be excpected. I have pasted the test program for
this at the end of this email. I peeked at Searcher.c and Lucy.xs to
work out the equivalent Perl code for hits(); I hope I got it right.

> Should we port ClusterSearcher to C, I expect that we'll see some of the
> performance anomalies smooth out.  However, I don't think we should focus on
> that yet, because ClusterSearcher's architecture is not yet optimal -- and it
> will be easier to refactor if we keep it in Perl for now.

I wholeheartedly agree. First solve the fundamental issues in order to
make the Perl proof-of-concept run as fast as possible, *then*
optimize the low level stuff.

-- 
Best regards,

Dag Lem


#!/usr/bin/perl

# Test performance of top_docs() vs. hits()

use strict;
use warnings;

use Lucy::Search::IndexSearcher;

# Pass 0 to test hits().
my $top_docs = !@ARGV || $ARGV[0] eq '1';

my $searcher = Lucy::Search::IndexSearcher->new(index => "/db/disk1/lucy/full");
my $schema = $searcher->get_schema();

my $query_parser = Lucy::Search::QueryParser->new(schema => $schema);
$query_parser->set_heed_colons(1);

my $offset = 0;
my $num_wanted = 100;

for (1..10000) {
    my $query = $query_parser->parse("fornavn:(lem dag) AND etternavn:(lem 
dag)");
#    my $query = $query_parser->parse("fornavn:(dag) AND etternavn:(lem)");

    my $hits;

    if ($top_docs) {
        my $real_query = $searcher->glean_query($query);
        my $doc_max = $searcher->doc_max();
        my $wanted  = $offset + $num_wanted > $doc_max
            ? $doc_max
            : $offset + $num_wanted;
        my $top_docs = $searcher->top_docs(query      => $real_query,
                                           num_wanted => $wanted);
        $hits = Lucy::Search::Hits->new(searcher => $searcher,
                                        top_docs => $top_docs,
                                        offset   => $offset);
    }
    else {
        $hits = $searcher->hits(
                                query      => $query,
                                offset     => $offset,
                                num_wanted => $num_wanted,
                                );
    }

#     while (my $hit = $hits->next) {
#       print "$hit->{fodselsdato}\t$hit->{navn}\n";
#     }
}

Reply via email to