Hi Kathryn, I wonder if you could index all your terms as separate documents and then construct a new query (2nd pass)
q=term:term1 OR term:term2 OR term:term3 and use func to score them *idf(other_field,field(term))* * * the 'term' index cannot be multi-valued, obviously. Other than that, if you could do it on server side, that weould be the fastest - the code is ready inside IDFValueSource: http://lucene.apache.org/core/4_3_0/queries/org/apache/lucene/queries/function/valuesource/IDFValueSource.html roman On Tue, Jul 2, 2013 at 5:06 PM, Kathryn Mazaitis <kathryn.riv...@gmail.com>wrote: > Hi, > > I'm using SOLRJ to run a query, with the goal of obtaining: > > (1) the retrieved documents, > (2) the TF of each term in each document, > (3) the IDF of each term in the set of retrieved documents (TF/IDF would be > fine too) > > ...all at interactive speeds, or <10s per query. This is a demo, so if all > else fails I can adjust the corpus, but I'd rather, y'know, actually do it. > > (1) and (2) are working; I completed the patch posted in the following > issue: > https://issues.apache.org/jira/browse/SOLR-949 > and am just setting tv=true&tv.tf=true for my query. This way I get the > documents and the tf information all in one go. > > With (3) I'm running into trouble. I have found 2 ways to do it so far: > > Option A: set tv.df=true or tv.tf_idf for my query, and get the idf > information along with the documents and tf information. Since each term > may appear in multiple documents, this means retrieving idf information for > each term about 20 times, and takes over a minute to do. > > Option B: After I've gathered the tf information, run through the list of > terms used across the set of retrieved documents, and for each term, run a > query like: > {!func}idf(text,'the_term')&deftype=func&fl=score&rows=1 > ...while this retrieves idf information only once for each term, the added > latency for doing that many queries piles up to almost two minutes on my > current corpus. > > Is there anything I didn't think of -- a way to construct a query to get > idf information for a set of terms all in one go, outside the bounds of > what terms happen to be in a document? > > Failing that, does anyone have a sense for how far I'd have to scale down a > corpus to approach interactive speeds, if I want this sort of data? > > Katie >