On 9/19/06, Chris Hostetter <[EMAIL PROTECTED]> wrote:

Quick Question: did you say you are faceting on the first name field
seperately from the last name field? ... why?

You'll probably see a sharp increase in performacne if you have a single
untokenized author field containing hte full name and you facet on that --
there will be a lot less unique terms to use when computing DocSets and
intersections.

Second: you mentioned increasing hte size of your filterCache
significantly, but we don't really know how heterogenous your index is ...
once you made that cahnge did your filterCache hitrate increase? .. do you
have any evictions (you can check on the "Statistics" patge)

: > Also, I was under the impression
: > that it was only searching / sorting for authors that it knows are in
: > the result set...
:
: That's the problem... it's not necessarily easy to know *what* authors
: are in the result set.  If we could quickly determine that, we could
: just count them and not do any intersections or anything at all.

another way to look at it is that by looking at all the authors, the work
done for generating the facet counts for query A can be completely reused
for the next query B -- presuming your filterCache is large enough to hold
all of the author filters.

: There could be optimizations when docs_matching_query.size() is small,
: so we start somehow with terms in the documents rather than terms in
: the index.  That requires termvectors to be stored (medium speed), or
: requires that the field be stored and that we re-analyze it (very
: slow).
:
: More optimization of special cases hasn't been done simply because no
: one has done it yet... (as you note, faceting is a new feature).

the optimization optimization i anticipated from teh begining, would
probably be usefull in the situation Michael is describing ... if there is
a "long tail" oif authors (and in my experience, there typically is)

we
can cache an ordered list of the top N most prolific authors, along with
the count of how many documents they have in the index (this info is easy
to getfrom TermEnum.docFreq).

Yeah, I've thought about a fieldInfoCache too.  It could also cache
the total number of terms in order to make decisions about what
faceting strategy to follow.

when we facet on the authors, we start with
that list and go in order, generating their facet constraint count using
the DocSet intersection just like we currently do ... if we reach our
facet.limit before we reach the end of hte list and the lowest constraint
count is higher then the total doc count of the last author in the list,
then we know we don't need to bother testing any other Author, because no
other author an possibly have a higher facet constraint count then the
ones on our list

This works OK if the intersection counts are high (as a percentage of
the facet sets).  I'm not sure how often this will be the case though.

Another tradeoff is to allow getting inexact counts with multi-token fields by:
- simply faceting on the most popular values
  OR
- do some sort of statistical sampling by reading term vectors for a
fraction of the matching docs.

-Yonik

Reply via email to