Re: What are the options for obtaining IDF at interactive speeds?

Roman Chyla Wed, 03 Jul 2013 20:36:08 -0700

Hi Kathryn,
I wonder if you could index all your terms as separate documents and then
construct a new query (2nd pass)


q=term:term1 OR term:term2 OR term:term3

and use func to score them

*idf(other_field,field(term))*
*
*
the 'term' index cannot be multi-valued, obviously.

Other than that, if you could do it on server side, that weould be the
fastest - the code is ready inside IDFValueSource:
http://lucene.apache.org/core/4_3_0/queries/org/apache/lucene/queries/function/valuesource/IDFValueSource.html

roman


On Tue, Jul 2, 2013 at 5:06 PM, Kathryn Mazaitis
<kathryn.riv...@gmail.com>wrote:

> Hi,
>
> I'm using SOLRJ to run a query, with the goal of obtaining:
>
> (1) the retrieved documents,
> (2) the TF of each term in each document,
> (3) the IDF of each term in the set of retrieved documents (TF/IDF would be
> fine too)
>
> ...all at interactive speeds, or <10s per query. This is a demo, so if all
> else fails I can adjust the corpus, but I'd rather, y'know, actually do it.
>
> (1) and (2) are working; I completed the patch posted in the following
> issue:
> https://issues.apache.org/jira/browse/SOLR-949
> and am just setting tv=true&tv.tf=true for my query. This way I get the
> documents and the tf information all in one go.
>
> With (3) I'm running into trouble. I have found 2 ways to do it so far:
>
> Option A: set tv.df=true or tv.tf_idf for my query, and get the idf
> information along with the documents and tf information. Since each term
> may appear in multiple documents, this means retrieving idf information for
> each term about 20 times, and takes over a minute to do.
>
> Option B: After I've gathered the tf information, run through the list of
> terms used across the set of retrieved documents, and for each term, run a
> query like:
> {!func}idf(text,'the_term')&deftype=func&fl=score&rows=1
> ...while this retrieves idf information only once for each term, the added
> latency for doing that many queries piles up to almost two minutes on my
> current corpus.
>
> Is there anything I didn't think of -- a way to construct a query to get
> idf information for a set of terms all in one go, outside the bounds of
> what terms happen to be in a document?
>
> Failing that, does anyone have a sense for how far I'd have to scale down a
> corpus to approach interactive speeds, if I want this sort of data?
>
> Katie
>

Re: What are the options for obtaining IDF at interactive speeds?

Reply via email to