I didn't try indexing each term as a separate document (and if I had I probably would've just used tv.tf_idf instead of a functional query -- why not?). The regular functional query which required sending a separate request for each of thousands of terms was waaaay dominated by the overhead of each query, and far too slow.
On Mon, Jul 8, 2013 at 4:45 PM, Roman Chyla <roman.ch...@gmail.com> wrote: > Hi, > I am curious about the functional query, did you try it and it didn't work? > or was it too slow? > > idf(other_field,field(term)) > > Thanks! > > roman > > > On Mon, Jul 8, 2013 at 4:34 PM, Kathryn Mazaitis <ka...@rivard.org> wrote: > > > Hi All, > > > > Resolution: I ended up cheating. :P Though now that I look at it, I think > > this was Roman's second suggestion. Thanks! > > > > Since the application that will be processing the IDF figures is located > on > > the same machine as SOLR, I opened a second IndexReader on the lucene > index > > and used > > > > reader.numDocs() > > reader.docFreq(field,term) > > > > to generate IDF by hand, ref: > http://en.wikipedia.org/wiki/Tf%E2%80%93idf > > > > As it turns out, using this method to get IDF on all the terms mentioned > in > > the set of relevant documents runs in time comparable to retrieving the > > documents in the first place (so, .1-1s). This makes it fast enough that > > it's no longer the slowest part of my algorithm by far. Problem solved! > It > > is possible that IDFValueSource would be faster; I may swap that in at a > > later date. > > > > I will keep Mikhail's debugQuery=true in my pocket, too; that technique > > would never have occurred to me. Thank you too! > > > > Best, > > Katie > > > > > > On Wed, Jul 3, 2013 at 11:35 PM, Roman Chyla <roman.ch...@gmail.com> > > wrote: > > > > > Hi Kathryn, > > > I wonder if you could index all your terms as separate documents and > then > > > construct a new query (2nd pass) > > > > > > q=term:term1 OR term:term2 OR term:term3 > > > > > > and use func to score them > > > > > > *idf(other_field,field(term))* > > > * > > > * > > > the 'term' index cannot be multi-valued, obviously. > > > > > > Other than that, if you could do it on server side, that weould be the > > > fastest - the code is ready inside IDFValueSource: > > > > > > > > > http://lucene.apache.org/core/4_3_0/queries/org/apache/lucene/queries/function/valuesource/IDFValueSource.html > > > > > > roman > > > > > > > > > On Tue, Jul 2, 2013 at 5:06 PM, Kathryn Mazaitis > > > <kathryn.riv...@gmail.com>wrote: > > > > > > > Hi, > > > > > > > > I'm using SOLRJ to run a query, with the goal of obtaining: > > > > > > > > (1) the retrieved documents, > > > > (2) the TF of each term in each document, > > > > (3) the IDF of each term in the set of retrieved documents (TF/IDF > > would > > > be > > > > fine too) > > > > > > > > ...all at interactive speeds, or <10s per query. This is a demo, so > if > > > all > > > > else fails I can adjust the corpus, but I'd rather, y'know, actually > do > > > it. > > > > > > > > (1) and (2) are working; I completed the patch posted in the > following > > > > issue: > > > > https://issues.apache.org/jira/browse/SOLR-949 > > > > and am just setting tv=true&tv.tf=true for my query. This way I get > > the > > > > documents and the tf information all in one go. > > > > > > > > With (3) I'm running into trouble. I have found 2 ways to do it so > far: > > > > > > > > Option A: set tv.df=true or tv.tf_idf for my query, and get the idf > > > > information along with the documents and tf information. Since each > > term > > > > may appear in multiple documents, this means retrieving idf > information > > > for > > > > each term about 20 times, and takes over a minute to do. > > > > > > > > Option B: After I've gathered the tf information, run through the > list > > of > > > > terms used across the set of retrieved documents, and for each term, > > run > > > a > > > > query like: > > > > {!func}idf(text,'the_term')&deftype=func&fl=score&rows=1 > > > > ...while this retrieves idf information only once for each term, the > > > added > > > > latency for doing that many queries piles up to almost two minutes on > > my > > > > current corpus. > > > > > > > > Is there anything I didn't think of -- a way to construct a query to > > get > > > > idf information for a set of terms all in one go, outside the bounds > of > > > > what terms happen to be in a document? > > > > > > > > Failing that, does anyone have a sense for how far I'd have to scale > > > down a > > > > corpus to approach interactive speeds, if I want this sort of data? > > > > > > > > Katie > > > > > > > > > >