Re: What are the options for obtaining IDF at interactive speeds?

Kathryn Mazaitis Wed, 10 Jul 2013 09:05:00 -0700

I didn't try indexing each term as a separate document (and if I had I
probably would've just used tv.tf_idf instead of a functional query -- why
not?). The regular functional query which required sending a separate
request for each of thousands of terms was waaaay dominated by the overhead
of each query, and far too slow.



On Mon, Jul 8, 2013 at 4:45 PM, Roman Chyla <roman.ch...@gmail.com> wrote:

> Hi,
> I am curious about the functional query, did you try it and it didn't work?
>  or was it too slow?
>
> idf(other_field,field(term))
>
> Thanks!
>
>   roman
>
>
> On Mon, Jul 8, 2013 at 4:34 PM, Kathryn Mazaitis <ka...@rivard.org> wrote:
>
> > Hi All,
> >
> > Resolution: I ended up cheating. :P Though now that I look at it, I think
> > this was Roman's second suggestion. Thanks!
> >
> > Since the application that will be processing the IDF figures is located
> on
> > the same machine as SOLR, I opened a second IndexReader on the lucene
> index
> > and used
> >
> > reader.numDocs()
> > reader.docFreq(field,term)
> >
> > to generate IDF by hand, ref:
> http://en.wikipedia.org/wiki/Tf%E2%80%93idf
> >
> > As it turns out, using this method to get IDF on all the terms mentioned
> in
> > the set of relevant documents runs in time comparable to retrieving the
> > documents in the first place (so, .1-1s). This makes it fast enough that
> > it's no longer the slowest part of my algorithm by far. Problem solved!
> It
> > is possible that IDFValueSource would be faster; I may swap that in at a
> > later date.
> >
> > I will keep Mikhail's debugQuery=true in my pocket, too; that technique
> > would never have occurred to me. Thank you too!
> >
> > Best,
> > Katie
> >
> >
> > On Wed, Jul 3, 2013 at 11:35 PM, Roman Chyla <roman.ch...@gmail.com>
> > wrote:
> >
> > > Hi Kathryn,
> > > I wonder if you could index all your terms as separate documents and
> then
> > > construct a new query (2nd pass)
> > >
> > > q=term:term1 OR term:term2 OR term:term3
> > >
> > > and use func to score them
> > >
> > > *idf(other_field,field(term))*
> > > *
> > > *
> > > the 'term' index cannot be multi-valued, obviously.
> > >
> > > Other than that, if you could do it on server side, that weould be the
> > > fastest - the code is ready inside IDFValueSource:
> > >
> > >
> >
> http://lucene.apache.org/core/4_3_0/queries/org/apache/lucene/queries/function/valuesource/IDFValueSource.html
> > >
> > > roman
> > >
> > >
> > > On Tue, Jul 2, 2013 at 5:06 PM, Kathryn Mazaitis
> > > <kathryn.riv...@gmail.com>wrote:
> > >
> > > > Hi,
> > > >
> > > > I'm using SOLRJ to run a query, with the goal of obtaining:
> > > >
> > > > (1) the retrieved documents,
> > > > (2) the TF of each term in each document,
> > > > (3) the IDF of each term in the set of retrieved documents (TF/IDF
> > would
> > > be
> > > > fine too)
> > > >
> > > > ...all at interactive speeds, or <10s per query. This is a demo, so
> if
> > > all
> > > > else fails I can adjust the corpus, but I'd rather, y'know, actually
> do
> > > it.
> > > >
> > > > (1) and (2) are working; I completed the patch posted in the
> following
> > > > issue:
> > > > https://issues.apache.org/jira/browse/SOLR-949
> > > > and am just setting tv=true&tv.tf=true for my query. This way I get
> > the
> > > > documents and the tf information all in one go.
> > > >
> > > > With (3) I'm running into trouble. I have found 2 ways to do it so
> far:
> > > >
> > > > Option A: set tv.df=true or tv.tf_idf for my query, and get the idf
> > > > information along with the documents and tf information. Since each
> > term
> > > > may appear in multiple documents, this means retrieving idf
> information
> > > for
> > > > each term about 20 times, and takes over a minute to do.
> > > >
> > > > Option B: After I've gathered the tf information, run through the
> list
> > of
> > > > terms used across the set of retrieved documents, and for each term,
> > run
> > > a
> > > > query like:
> > > > {!func}idf(text,'the_term')&deftype=func&fl=score&rows=1
> > > > ...while this retrieves idf information only once for each term, the
> > > added
> > > > latency for doing that many queries piles up to almost two minutes on
> > my
> > > > current corpus.
> > > >
> > > > Is there anything I didn't think of -- a way to construct a query to
> > get
> > > > idf information for a set of terms all in one go, outside the bounds
> of
> > > > what terms happen to be in a document?
> > > >
> > > > Failing that, does anyone have a sense for how far I'd have to scale
> > > down a
> > > > corpus to approach interactive speeds, if I want this sort of data?
> > > >
> > > > Katie
> > > >
> > >
> >
>

Re: What are the options for obtaining IDF at interactive speeds?

Reply via email to