Luke handler questions

2008-09-04 Thread Otis Gospodnetic
Hi,

I'm looking at an index with the Luke handler and see something that makes no 
sense to me:

lst name=itemid
str name=typestring/str
str name=schemaI-SOl/str
str name=indexI-SO-/str
int name=docs1138826/int
int name=distinct1138826/int
lst name=topTerms
  int name=INBMA001343200809012/int

Note how docs # == distinct #.  That looks good and makes sense - each document 
has a unique itemid.  But then look at topTerms.  What does number 2 
represent there?  I thought it was the term frequency.  If so, then the above 
says there are 2 documents with itemid=INBMA00134320080901 and that conflicts 
with docs # == distinct #.

Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



Re: Luke handler questions

2008-09-04 Thread Yonik Seeley
On Thu, Sep 4, 2008 at 1:26 PM, Otis Gospodnetic
[EMAIL PROTECTED] wrote:
 Note how docs # == distinct #.  That looks good and makes sense - each 
 document has a unique itemid.  But then look at topTerms.  What does number 
 2 represent there?  I thought it was the term frequency.  If so, then the 
 above says there are 2 documents with itemid=INBMA00134320080901 and that 
 conflicts with docs # == distinct #.

Remember that the Lucene term frequency does not take into account
deleted documents.  So in this case, INBMA00134320080901 was probably
overwritten.

-Yonik