Yonik Seeley wrote:
On 4/28/07, Ryan McKinley <[EMAIL PROTECTED]> wrote:
I have a few things I'd like to check with the Luke handler, if you call
could check some of the assumptions, that would be great.

* I want to print out the document frequency for a term in a given
document.  Since that term shows up in the given document, I would think
the term frequency must be > 1.  I am using: reader.docFreq( t ) [line
236] The results seem reasonable, but *sometimes* it returns zero... is
that possible?

Is the field indexed?
Did you run the field through the analyzer to get the terms (to match
what's in the index)?
If both of those are true, it seems like the docFreq should always be
greater than 0.


aah, that makes sense - now that you mention it, I only see df=0 for non-indexed, stored fields.



In an inverted index, terms point to documents.   So you have to
traverse *all* of the terms of a field across all documents, and keep
track of when you run across the document you are interested in.  When
you do, then get the positions that the term appeared at, and keep
track of them.  After you have covered all the terms, you can put
everything in order.  There could be gaps (positionIncrement, stop
word removal, etc) and it's also possible for multiple tokens to
appear at the same position.

For a full-text field with many terms, and a large index, this could
take a *long* time.
It's probably very useful for debugging though.


that must be why luke starts a new thread for 'reconstruct and edit' For now, i will leave this out of the handler, and leave that open to someone with the need/time in the future.


* Each field gets an boolean attribute "cacheableFaceting" -- this true
if the number of distinct terms is smaller then the filterCacheSize.  I
get the filterCacheSize from: solrconfig.xml:"query/filterCache/@size"
and get the distinctTerm count from counting up the termEnum.  Is this
logic solid?  I know the cacheability changes if you are faciting
multiple fields at once, but its still nice to have a ballpark estimate
without needing to know the internals.

It could get trickier... I'm about to hack up a quick patch now that
will reduce memory usage by only using the filterCache  above a
certain df threshold.  It may increase or
decrease the faceting speed - TBD.

Also, other alternate faceting schemes are in the works (a month or two out).
I'd leave this attribute out and just report on the number of unique terms.

ok, that seems reasonable.


Some kind of histogram might be really nice though (how many terms
under varying df values):
 1=>412  (412 terms have a df of 1)
 2=>516  (516 terms have a df of 2)
 4=>600
 8=>650
16=>670
32=>680
64=>683
128=>685
256=>686
11325=>690  (the maxDf found)


I'll take a look at that


Remember that df is not updated when a document is marked for deletion
in Lucene.
So you can have a df of 2, do a search, and only come up with one document.


that would explain why I'm seeing df > 1 for the uniqueKey!

Reply via email to