Luke / get doc count for each term
Hi- I'm trying to use the LukeRequestHandler with an index of ~9 million docs. I know that counting the top / distinct terms for each field is expensive and can take a LONG time to return. Is there a faster way to check the number of documents for each field? Currently this gets the doc count for each term: if( sfield != null sfield.indexed() ) { Query q = qp.parse( fieldName+:[* TO *] ); int docCount = searcher.numDocs( q, matchAllDocs ); ... Looking at it again, that could be replaced with: if( sfield != null sfield.indexed() ) { Query q = qp.parse( fieldName+:[* TO *] ); int docCount = searcher.getDocSet( q ).size(); ... Is there any faster option then running a query for each field? thanks ryan
Re: Luke / get doc count for each term
doc count for each term is stored directly in the index - with the big caveat that it doesn't take deleted docs into account. That addresses the get doc count for each term. get doc count for each field is a different question... see below. On Tue, Jun 16, 2009 at 1:57 PM, Ryan McKinleyryan...@gmail.com wrote: Hi- I'm trying to use the LukeRequestHandler with an index of ~9 million docs. I know that counting the top / distinct terms for each field is expensive and can take a LONG time to return. Is there a faster way to check the number of documents for each field? Currently this gets the doc count for each term: if( sfield != null sfield.indexed() ) { Query q = qp.parse( fieldName+:[* TO *] ); int docCount = searcher.numDocs( q, matchAllDocs ); That looks like it gets the doc count for each field, as opposed to each term. Looking at it again, that could be replaced with: if( sfield != null sfield.indexed() ) { Query q = qp.parse( fieldName+:[* TO *] ); int docCount = searcher.getDocSet( q ).size(); Correct. Unfortunately it probably won't save you much (one set intersection). I don't (currently) know of a way to get this info quicker. In a specific application, the fastest way would be to index a boolean or another single token for each document that had the field you were interested in then count the number of docs for the single token rather than all tokens in the field. -Yonik http://www.lucidimagination.com Is there any faster option then running a query for each field? thanks ryan
Re: Luke / get doc count for each term
On Jun 16, 2009, at 1:57 PM, Ryan McKinley wrote: Is there a faster way to check the number of documents for each field? Currently this gets the doc count for each term: In the past, I've created a field that contains the names of the Fields present on the document. Then, simply facet on the new Field. I think that gets you what you want and the mechanism is all built in to Solr and is quite speedy.
Re: Luke / get doc count for each term
On Jun 16, 2009, at 5:21 PM, Grant Ingersoll wrote: On Jun 16, 2009, at 1:57 PM, Ryan McKinley wrote: Is there a faster way to check the number of documents for each field? Currently this gets the doc count for each term: In the past, I've created a field that contains the names of the Fields present on the document. Then, simply facet on the new Field. I think that gets you what you want and the mechanism is all built in to Solr and is quite speedy. makes sense -- i like this idea. ryan