Re: What are the options for obtaining IDF at interactive speeds?

2013-07-10 Thread Kathryn Mazaitis
I didn't try indexing each term as a separate document (and if I had I
probably would've just used tv.tf_idf instead of a functional query -- why
not?). The regular functional query which required sending a separate
request for each of thousands of terms was wy dominated by the overhead
of each query, and far too slow.


On Mon, Jul 8, 2013 at 4:45 PM, Roman Chyla roman.ch...@gmail.com wrote:

 Hi,
 I am curious about the functional query, did you try it and it didn't work?
  or was it too slow?

 idf(other_field,field(term))

 Thanks!

   roman


 On Mon, Jul 8, 2013 at 4:34 PM, Kathryn Mazaitis ka...@rivard.org wrote:

  Hi All,
 
  Resolution: I ended up cheating. :P Though now that I look at it, I think
  this was Roman's second suggestion. Thanks!
 
  Since the application that will be processing the IDF figures is located
 on
  the same machine as SOLR, I opened a second IndexReader on the lucene
 index
  and used
 
  reader.numDocs()
  reader.docFreq(field,term)
 
  to generate IDF by hand, ref:
 http://en.wikipedia.org/wiki/Tf%E2%80%93idf
 
  As it turns out, using this method to get IDF on all the terms mentioned
 in
  the set of relevant documents runs in time comparable to retrieving the
  documents in the first place (so, .1-1s). This makes it fast enough that
  it's no longer the slowest part of my algorithm by far. Problem solved!
 It
  is possible that IDFValueSource would be faster; I may swap that in at a
  later date.
 
  I will keep Mikhail's debugQuery=true in my pocket, too; that technique
  would never have occurred to me. Thank you too!
 
  Best,
  Katie
 
 
  On Wed, Jul 3, 2013 at 11:35 PM, Roman Chyla roman.ch...@gmail.com
  wrote:
 
   Hi Kathryn,
   I wonder if you could index all your terms as separate documents and
 then
   construct a new query (2nd pass)
  
   q=term:term1 OR term:term2 OR term:term3
  
   and use func to score them
  
   *idf(other_field,field(term))*
   *
   *
   the 'term' index cannot be multi-valued, obviously.
  
   Other than that, if you could do it on server side, that weould be the
   fastest - the code is ready inside IDFValueSource:
  
  
 
 http://lucene.apache.org/core/4_3_0/queries/org/apache/lucene/queries/function/valuesource/IDFValueSource.html
  
   roman
  
  
   On Tue, Jul 2, 2013 at 5:06 PM, Kathryn Mazaitis
   kathryn.riv...@gmail.comwrote:
  
Hi,
   
I'm using SOLRJ to run a query, with the goal of obtaining:
   
(1) the retrieved documents,
(2) the TF of each term in each document,
(3) the IDF of each term in the set of retrieved documents (TF/IDF
  would
   be
fine too)
   
...all at interactive speeds, or 10s per query. This is a demo, so
 if
   all
else fails I can adjust the corpus, but I'd rather, y'know, actually
 do
   it.
   
(1) and (2) are working; I completed the patch posted in the
 following
issue:
https://issues.apache.org/jira/browse/SOLR-949
and am just setting tv=truetv.tf=true for my query. This way I get
  the
documents and the tf information all in one go.
   
With (3) I'm running into trouble. I have found 2 ways to do it so
 far:
   
Option A: set tv.df=true or tv.tf_idf for my query, and get the idf
information along with the documents and tf information. Since each
  term
may appear in multiple documents, this means retrieving idf
 information
   for
each term about 20 times, and takes over a minute to do.
   
Option B: After I've gathered the tf information, run through the
 list
  of
terms used across the set of retrieved documents, and for each term,
  run
   a
query like:
{!func}idf(text,'the_term')deftype=funcfl=scorerows=1
...while this retrieves idf information only once for each term, the
   added
latency for doing that many queries piles up to almost two minutes on
  my
current corpus.
   
Is there anything I didn't think of -- a way to construct a query to
  get
idf information for a set of terms all in one go, outside the bounds
 of
what terms happen to be in a document?
   
Failing that, does anyone have a sense for how far I'd have to scale
   down a
corpus to approach interactive speeds, if I want this sort of data?
   
Katie
   
  
 



Re: What are the options for obtaining IDF at interactive speeds?

2013-07-08 Thread Kathryn Mazaitis
Hi All,

Resolution: I ended up cheating. :P Though now that I look at it, I think
this was Roman's second suggestion. Thanks!

Since the application that will be processing the IDF figures is located on
the same machine as SOLR, I opened a second IndexReader on the lucene index
and used

reader.numDocs()
reader.docFreq(field,term)

to generate IDF by hand, ref: http://en.wikipedia.org/wiki/Tf%E2%80%93idf

As it turns out, using this method to get IDF on all the terms mentioned in
the set of relevant documents runs in time comparable to retrieving the
documents in the first place (so, .1-1s). This makes it fast enough that
it's no longer the slowest part of my algorithm by far. Problem solved! It
is possible that IDFValueSource would be faster; I may swap that in at a
later date.

I will keep Mikhail's debugQuery=true in my pocket, too; that technique
would never have occurred to me. Thank you too!

Best,
Katie


On Wed, Jul 3, 2013 at 11:35 PM, Roman Chyla roman.ch...@gmail.com wrote:

 Hi Kathryn,
 I wonder if you could index all your terms as separate documents and then
 construct a new query (2nd pass)

 q=term:term1 OR term:term2 OR term:term3

 and use func to score them

 *idf(other_field,field(term))*
 *
 *
 the 'term' index cannot be multi-valued, obviously.

 Other than that, if you could do it on server side, that weould be the
 fastest - the code is ready inside IDFValueSource:

 http://lucene.apache.org/core/4_3_0/queries/org/apache/lucene/queries/function/valuesource/IDFValueSource.html

 roman


 On Tue, Jul 2, 2013 at 5:06 PM, Kathryn Mazaitis
 kathryn.riv...@gmail.comwrote:

  Hi,
 
  I'm using SOLRJ to run a query, with the goal of obtaining:
 
  (1) the retrieved documents,
  (2) the TF of each term in each document,
  (3) the IDF of each term in the set of retrieved documents (TF/IDF would
 be
  fine too)
 
  ...all at interactive speeds, or 10s per query. This is a demo, so if
 all
  else fails I can adjust the corpus, but I'd rather, y'know, actually do
 it.
 
  (1) and (2) are working; I completed the patch posted in the following
  issue:
  https://issues.apache.org/jira/browse/SOLR-949
  and am just setting tv=truetv.tf=true for my query. This way I get the
  documents and the tf information all in one go.
 
  With (3) I'm running into trouble. I have found 2 ways to do it so far:
 
  Option A: set tv.df=true or tv.tf_idf for my query, and get the idf
  information along with the documents and tf information. Since each term
  may appear in multiple documents, this means retrieving idf information
 for
  each term about 20 times, and takes over a minute to do.
 
  Option B: After I've gathered the tf information, run through the list of
  terms used across the set of retrieved documents, and for each term, run
 a
  query like:
  {!func}idf(text,'the_term')deftype=funcfl=scorerows=1
  ...while this retrieves idf information only once for each term, the
 added
  latency for doing that many queries piles up to almost two minutes on my
  current corpus.
 
  Is there anything I didn't think of -- a way to construct a query to get
  idf information for a set of terms all in one go, outside the bounds of
  what terms happen to be in a document?
 
  Failing that, does anyone have a sense for how far I'd have to scale
 down a
  corpus to approach interactive speeds, if I want this sort of data?
 
  Katie
 



What are the options for obtaining IDF at interactive speeds?

2013-07-02 Thread Kathryn Mazaitis
Hi,

I'm using SOLRJ to run a query, with the goal of obtaining:

(1) the retrieved documents,
(2) the TF of each term in each document,
(3) the IDF of each term in the set of retrieved documents (TF/IDF would be
fine too)

...all at interactive speeds, or 10s per query. This is a demo, so if all
else fails I can adjust the corpus, but I'd rather, y'know, actually do it.

(1) and (2) are working; I completed the patch posted in the following
issue:
https://issues.apache.org/jira/browse/SOLR-949
and am just setting tv=truetv.tf=true for my query. This way I get the
documents and the tf information all in one go.

With (3) I'm running into trouble. I have found 2 ways to do it so far:

Option A: set tv.df=true or tv.tf_idf for my query, and get the idf
information along with the documents and tf information. Since each term
may appear in multiple documents, this means retrieving idf information for
each term about 20 times, and takes over a minute to do.

Option B: After I've gathered the tf information, run through the list of
terms used across the set of retrieved documents, and for each term, run a
query like:
{!func}idf(text,'the_term')deftype=funcfl=scorerows=1
...while this retrieves idf information only once for each term, the added
latency for doing that many queries piles up to almost two minutes on my
current corpus.

Is there anything I didn't think of -- a way to construct a query to get
idf information for a set of terms all in one go, outside the bounds of
what terms happen to be in a document?

Failing that, does anyone have a sense for how far I'd have to scale down a
corpus to approach interactive speeds, if I want this sort of data?

Katie