Marvin Humphrey <[email protected]> writes:
[...]
> First, regarding term extraction, it does not suffice to walk the query tree
> looking for TermQueries -- PhraseQueries also have terms, but more crucially,
> so do arbitrary user-defined Query subclasses. In order to get at terms
> within arbitrary Query objects, Query needs an `extract_terms()` method which
> subclasses may have to override.
>
> Second, once you obtain an array of terms via `$query->extract_terms()` and
> bulk-fetch their stats from the remote shards, you need to cache the stats in
> a hash and override `doc_freq()`. That way, when nested query weighting
> routines invoke `$searcher->doc_freq`, they get the stat which was bulk
> fetched moments before.
[...]
> There's a lot of dissatisfaction in Lucy-land with our labyrinthine
> search-time Query weighting mechanism. The Lucene architecture we inherited
> is ridiculously convoluted and we've already been through a couple rounds of
> refactoring trying to simplify it. The last thing we want to do is make it
> harder to write a custom query subclass when our users already struggle with
> the complexity of that task.
OK, so how about this poor man's solution?
1. Add a private function to Searcher to switch between three
different behaviors of doc_freq() - normal operation, store
field/term in cache, or retrieve freq from cache.
2. For ClusterSearcher, insert an extra call to QueryParser::Parse to
store field/term in the cache (discarding the returned query), and
call the new function doc_freqs() to add the freqs to the cache.
Then, let the existing call to QueryParser::Parse retrieve from the
cache and build the actual query.
Sure, it's a hack, but as far as I can tell it would not be very
intrusive nor change the public API.
> Besides, bulk-fetching of term stats is only an optimization to begin with,
> and it's a sub-optimal optimization in comparison to the approach of obtaining
> term stats locally.
That depends. IMHO the advantages of a fully distributed solution can
in many cases handily trump the theoretical (and far from achievable
in practice) 2x performance win of a local statistics
database. E.g. if I envision, some time in the future, *several*
clients querying the same Lucy sharded massive index, it smells like
unwanted complexity if I had to maintain a local index for each
client.
Sure, if you have a single client where performance is paramount, and
adding more shards is not practical, then local statistics would be
very nice.
I'd say as Winnie-the-Pooh: Both! :-)
[...]
> > I guess this would be nice to have for applications which are
> > extremely performance sensitive.
>
> Doesn't that include your use case?
Not at all, really :) I've only been doing some tests on Lucy to see
whether it could be used in a possible future project. This would
cover a batch oriented system without any hard limits on performance.
I simply wanted to see just how fast things could run (faster is
always better), tested SearchServer / ClusterSearcher, and you know
the rest :-)
> I was hoping that this approach would meet your immediate needs. :\
Rest assured that Lucy would without a doubt cover my needs, if the
project should materialize! :-)
> No problem! :)
>
> package EmptyStatSource;
> use base qw( Lucy::Search::Searcher );
>
> sub doc_freq {1}
Nice :-)
--
Best regards,
Dag Lem