Luke / get doc count for each term

2009-06-16 Thread Ryan McKinley
Hi-

I'm trying to use the LukeRequestHandler with an index of ~9 million
docs.  I know that counting the top / distinct terms for each field is
expensive and can take a LONG time to return.

Is there a faster way to check the number of documents for each field?
 Currently this gets the doc count for each term:

  if( sfield != null  sfield.indexed() ) {
Query q = qp.parse( fieldName+:[* TO *] );
int docCount = searcher.numDocs( q, matchAllDocs );
...

Looking at it again, that could be replaced with:

  if( sfield != null  sfield.indexed() ) {
Query q = qp.parse( fieldName+:[* TO *] );
int docCount = searcher.getDocSet( q ).size();
...

Is there any faster option then running a query for each field?

thanks
ryan


Re: Luke / get doc count for each term

2009-06-16 Thread Yonik Seeley
doc count for each term is stored directly in the index - with the big
caveat that it doesn't take deleted docs into account.  That addresses
the get doc count for each term.

get doc count for each field is a different question... see below.

On Tue, Jun 16, 2009 at 1:57 PM, Ryan McKinleyryan...@gmail.com wrote:
 Hi-

 I'm trying to use the LukeRequestHandler with an index of ~9 million
 docs.  I know that counting the top / distinct terms for each field is
 expensive and can take a LONG time to return.

 Is there a faster way to check the number of documents for each field?
  Currently this gets the doc count for each term:

      if( sfield != null  sfield.indexed() ) {
        Query q = qp.parse( fieldName+:[* TO *] );
        int docCount = searcher.numDocs( q, matchAllDocs );

That looks like it gets the doc count for each field, as opposed to each term.

 Looking at it again, that could be replaced with:

      if( sfield != null  sfield.indexed() ) {
        Query q = qp.parse( fieldName+:[* TO *] );
        int docCount = searcher.getDocSet( q ).size();

Correct.  Unfortunately it probably won't save you much (one set intersection).
I don't (currently) know of a way to get this info quicker.

In a specific application, the fastest way would be to index a boolean
or another single token for each document that had the field you were
interested in then count the number of docs for the single token
rather than all tokens in the field.

-Yonik
http://www.lucidimagination.com

 Is there any faster option then running a query for each field?

 thanks
 ryan



Re: Luke / get doc count for each term

2009-06-16 Thread Grant Ingersoll


On Jun 16, 2009, at 1:57 PM, Ryan McKinley wrote:



Is there a faster way to check the number of documents for each field?
Currently this gets the doc count for each term:



In the past, I've created a field that contains the names of the  
Fields present on the document.  Then, simply facet on the new Field.   
I think that gets you what you want and the mechanism is all built in  
to Solr and is quite speedy.


Re: Luke / get doc count for each term

2009-06-16 Thread Ryan McKinley


On Jun 16, 2009, at 5:21 PM, Grant Ingersoll wrote:



On Jun 16, 2009, at 1:57 PM, Ryan McKinley wrote:



Is there a faster way to check the number of documents for each  
field?

Currently this gets the doc count for each term:



In the past, I've created a field that contains the names of the  
Fields present on the document.  Then, simply facet on the new  
Field.  I think that gets you what you want and the mechanism is all  
built in to Solr and is quite speedy.



makes sense -- i like this idea.

ryan