Re: Highest frequency terms for a subset of documents

Ofer Fort Wed, 20 Apr 2011 16:33:16 -0700

seems like the facet search is not all that suited for a full text field. (
http://search.lucidimagination.com/search/document/178f1a82ff19070c/solr_severe_error_when_doing_a_faceted_search#16562790cda76197
)


Maybe i should go another direction. I think that the HighFreqTerms
approach, just not sure how to start.

On Thu, Apr 21, 2011 at 2:23 AM, Ofer Fort <o...@tra.cx> wrote:

> thanks, but that's what i started with, but it took an even longer time and
> threw this:
> Approaching too many values for UnInvertedField faceting on field 'text' :
> bucket size=15560140
> Approaching too many values for UnInvertedField faceting on field 'text :
> bucket size=15619075
> Exception during facet counts:org.apache.solr.common.SolrException: Too
> many values for UnInvertedField faceting on field text
>
>
>
> On Thu, Apr 21, 2011 at 2:11 AM, Jonathan Rochkind <rochk...@jhu.edu>wrote:
>
>> I think faceting is probably the best way to do that, indeed. It might be
>> slow, but it's kind of set up for exactly that case, I can't imagine any
>> other technique being faster -- there's stuff that has to be done to look up
>> the info you want.
>>
>> BUT, I see your problem:  don't use facet.method=enum. Use
>> facet.method=fc.  Works a LOT better for very high arity fields (lots and
>> lots of unique values) like you have. I bet you'll see significant speed-up
>> if you use facet.method=fc instead, hopefully fast enough to be workable.
>>
>> With facet.method=enum, I would have indeed predicted it would be horribly
>> slow, before solr 1.4 when facet.method=fc became available, it was nearly
>> impossible to facet on very high arity fields, facet.method=fc is the magic.
>> I think facet.method=fc is even the default in Solr 1.4+, if you hadn't
>> explicitly set it to enum instead!
>>
>> Jonathan
>> ________________________________________
>> From: Ofer Fort [ofer...@gmail.com]
>> Sent: Wednesday, April 20, 2011 6:49 PM
>> To: solr-user@lucene.apache.org
>> Subject: Highest frequency terms for a subset of documents
>> Hi,
>> I am looking for the best way to find the terms with the highest frequency
>> for a given subset of documents. (terms in the text field)
>> My first thought was to do a count facet search , where the query defines
>> the subset of documents and the facet.field is the text field, this gives
>> me
>> the result but it is very very slow.
>> These are my params:
>> <str name="facet">true</str>
>> <str name="facet.offset">0</str>
>> <str name="facet.mincount">3</str>
>> <str name="indent">on</str>
>> <str name="facet.limit">500</str>
>> <str name="facet.method">enum</str>
>> <str name="wt">xml</str>
>> <str name="rows">0</str>
>> <str name="version">2.2</str>
>> <str name="facet.sort">count</str>
>>   <str name="q">in_subset:1</str>
>> <str name="facet.field">text</str>
>> </lst>
>>
>> The index contains 7M documents, the subset is about 200K. A simple query
>> for the subset takes around 100ms, but the facet search takes 40s.
>>
>> Am i doing something wrong?
>>
>> If facet search is not the correct approach, i thought about using
>> something
>> like org.apache.lucene.misc.HighFreqTerms, but i'm not sure how to do this
>> in solr. Should i implememt a request handler that executes this kind of
>> code?
>>
>> thanks for any help
>>
>
>

Re: Highest frequency terms for a subset of documents

Reply via email to