Hi,

I'm using SOLRJ to run a query, with the goal of obtaining:

(1) the retrieved documents,
(2) the TF of each term in each document,
(3) the IDF of each term in the set of retrieved documents (TF/IDF would be
fine too)

...all at interactive speeds, or <10s per query. This is a demo, so if all
else fails I can adjust the corpus, but I'd rather, y'know, actually do it.

(1) and (2) are working; I completed the patch posted in the following
issue:
https://issues.apache.org/jira/browse/SOLR-949
and am just setting tv=true&tv.tf=true for my query. This way I get the
documents and the tf information all in one go.

With (3) I'm running into trouble. I have found 2 ways to do it so far:

Option A: set tv.df=true or tv.tf_idf for my query, and get the idf
information along with the documents and tf information. Since each term
may appear in multiple documents, this means retrieving idf information for
each term about 20 times, and takes over a minute to do.

Option B: After I've gathered the tf information, run through the list of
terms used across the set of retrieved documents, and for each term, run a
query like:
{!func}idf(text,'the_term')&deftype=func&fl=score&rows=1
...while this retrieves idf information only once for each term, the added
latency for doing that many queries piles up to almost two minutes on my
current corpus.

Is there anything I didn't think of -- a way to construct a query to get
idf information for a set of terms all in one go, outside the bounds of
what terms happen to be in a document?

Failing that, does anyone have a sense for how far I'd have to scale down a
corpus to approach interactive speeds, if I want this sort of data?

Katie

Reply via email to