[lucy-user] ClusterSearcher statistics

Marvin Humphrey Wed, 24 Oct 2012 18:54:15 -0700

On Wed, Oct 24, 2012 at 5:08 AM, Dag Lem <[email protected]> wrote:

> Here, doc_freq and top_docs should be replaced with something like
> docs_freq_and_top_docs, i.e. only one request / response per query.


It is true that calls to `doc_freq()` are responsible for a disproportionate
amount of network traffic.  However, it is not currently feasible to
consolidate the calls to `doc_freq()` and `top_docs()` into a single round
trip.

The `doc_freq()` invocations are part of query weighting -- they tell us how
many documents a given term occurs in, allowing us to increase the weight of
rare terms and decrease the weight of common ones.  It is important that the
query weighting be exactly the same for each shard, because otherwise hits
from different shards will have scores which are not comparable to each other.

To know how common a term is across the entire collection we need to survey
all shards and sum the results.  All these calls must be completed before we
can finish weighting the query, allowing us to call `top_docs()`.

The calls to `doc_freq()` also cannot be consolidated together easily, because
they are invoked by nested weighting methods within an arbitrarily complex
compound query object.

As an alternative, how about adding this new method to ClusterSearcher?

    =head2 set_stat_source

        my $local_searcher = Lucy::Search::IndexSearcher->new(
            index => '/path/to/index',
        );
        $cluster_searcher->set_stat_source($local_searcher);

    Set the Searcher which will be used to find index statistics.

    By default, ClusterSearcher gathers index statistics such as doc_freq()
    from all shards when performing certain calcuations.  This is accurate,
    but slow because it involves numerous network round-trips.

    If a local IndexSearcher is consulted instead, network costs are
    eliminated, speeding up processes such as query weighting considerably.

    NB: Use with caution -- relevancy may be degraded to the extent that the
    content of the stat source Searcher differs from the content of the
    collection across all shards.

So long as the ClusterSearcher runs on the same machine as a large,
representative shard, using a local IndexSearcher should be a decent
workaround.  Scoring will be messed up if e.g. the local shard is completely
missing a term which is common on other shards, but at least it will be messed
up in the same way for all hits across all shards.

In the future, the best way to handle this problem is to provide a local cache
of doc_freq stats and create a specialized Searcher subclass which wraps the
cache and knows how to respond to `doc_freq()`.  It's not important that the
stats be updated in real time, nor is it important that the cache contain rare
terms; for decent query weighting, term stats only have to be in the ballpark,
not exact.

Marvin Humphrey

[lucy-user] ClusterSearcher statistics

Reply via email to