Re: What are the options for obtaining IDF at interactive speeds?

2013-07-10 Thread Kathryn Mazaitis
I didn't try indexing each term as a separate document (and if I had I
probably would've just used tv.tf_idf instead of a functional query -- why
not?). The regular functional query which required sending a separate
request for each of thousands of terms was wy dominated by the overhead
of each query, and far too slow.


On Mon, Jul 8, 2013 at 4:45 PM, Roman Chyla roman.ch...@gmail.com wrote:

 Hi,
 I am curious about the functional query, did you try it and it didn't work?
  or was it too slow?

 idf(other_field,field(term))

 Thanks!

   roman


 On Mon, Jul 8, 2013 at 4:34 PM, Kathryn Mazaitis ka...@rivard.org wrote:

  Hi All,
 
  Resolution: I ended up cheating. :P Though now that I look at it, I think
  this was Roman's second suggestion. Thanks!
 
  Since the application that will be processing the IDF figures is located
 on
  the same machine as SOLR, I opened a second IndexReader on the lucene
 index
  and used
 
  reader.numDocs()
  reader.docFreq(field,term)
 
  to generate IDF by hand, ref:
 http://en.wikipedia.org/wiki/Tf%E2%80%93idf
 
  As it turns out, using this method to get IDF on all the terms mentioned
 in
  the set of relevant documents runs in time comparable to retrieving the
  documents in the first place (so, .1-1s). This makes it fast enough that
  it's no longer the slowest part of my algorithm by far. Problem solved!
 It
  is possible that IDFValueSource would be faster; I may swap that in at a
  later date.
 
  I will keep Mikhail's debugQuery=true in my pocket, too; that technique
  would never have occurred to me. Thank you too!
 
  Best,
  Katie
 
 
  On Wed, Jul 3, 2013 at 11:35 PM, Roman Chyla roman.ch...@gmail.com
  wrote:
 
   Hi Kathryn,
   I wonder if you could index all your terms as separate documents and
 then
   construct a new query (2nd pass)
  
   q=term:term1 OR term:term2 OR term:term3
  
   and use func to score them
  
   *idf(other_field,field(term))*
   *
   *
   the 'term' index cannot be multi-valued, obviously.
  
   Other than that, if you could do it on server side, that weould be the
   fastest - the code is ready inside IDFValueSource:
  
  
 
 http://lucene.apache.org/core/4_3_0/queries/org/apache/lucene/queries/function/valuesource/IDFValueSource.html
  
   roman
  
  
   On Tue, Jul 2, 2013 at 5:06 PM, Kathryn Mazaitis
   kathryn.riv...@gmail.comwrote:
  
Hi,
   
I'm using SOLRJ to run a query, with the goal of obtaining:
   
(1) the retrieved documents,
(2) the TF of each term in each document,
(3) the IDF of each term in the set of retrieved documents (TF/IDF
  would
   be
fine too)
   
...all at interactive speeds, or 10s per query. This is a demo, so
 if
   all
else fails I can adjust the corpus, but I'd rather, y'know, actually
 do
   it.
   
(1) and (2) are working; I completed the patch posted in the
 following
issue:
https://issues.apache.org/jira/browse/SOLR-949
and am just setting tv=truetv.tf=true for my query. This way I get
  the
documents and the tf information all in one go.
   
With (3) I'm running into trouble. I have found 2 ways to do it so
 far:
   
Option A: set tv.df=true or tv.tf_idf for my query, and get the idf
information along with the documents and tf information. Since each
  term
may appear in multiple documents, this means retrieving idf
 information
   for
each term about 20 times, and takes over a minute to do.
   
Option B: After I've gathered the tf information, run through the
 list
  of
terms used across the set of retrieved documents, and for each term,
  run
   a
query like:
{!func}idf(text,'the_term')deftype=funcfl=scorerows=1
...while this retrieves idf information only once for each term, the
   added
latency for doing that many queries piles up to almost two minutes on
  my
current corpus.
   
Is there anything I didn't think of -- a way to construct a query to
  get
idf information for a set of terms all in one go, outside the bounds
 of
what terms happen to be in a document?
   
Failing that, does anyone have a sense for how far I'd have to scale
   down a
corpus to approach interactive speeds, if I want this sort of data?
   
Katie
   
  
 



Re: What are the options for obtaining IDF at interactive speeds?

2013-07-08 Thread Kathryn Mazaitis
Hi All,

Resolution: I ended up cheating. :P Though now that I look at it, I think
this was Roman's second suggestion. Thanks!

Since the application that will be processing the IDF figures is located on
the same machine as SOLR, I opened a second IndexReader on the lucene index
and used

reader.numDocs()
reader.docFreq(field,term)

to generate IDF by hand, ref: http://en.wikipedia.org/wiki/Tf%E2%80%93idf

As it turns out, using this method to get IDF on all the terms mentioned in
the set of relevant documents runs in time comparable to retrieving the
documents in the first place (so, .1-1s). This makes it fast enough that
it's no longer the slowest part of my algorithm by far. Problem solved! It
is possible that IDFValueSource would be faster; I may swap that in at a
later date.

I will keep Mikhail's debugQuery=true in my pocket, too; that technique
would never have occurred to me. Thank you too!

Best,
Katie


On Wed, Jul 3, 2013 at 11:35 PM, Roman Chyla roman.ch...@gmail.com wrote:

 Hi Kathryn,
 I wonder if you could index all your terms as separate documents and then
 construct a new query (2nd pass)

 q=term:term1 OR term:term2 OR term:term3

 and use func to score them

 *idf(other_field,field(term))*
 *
 *
 the 'term' index cannot be multi-valued, obviously.

 Other than that, if you could do it on server side, that weould be the
 fastest - the code is ready inside IDFValueSource:

 http://lucene.apache.org/core/4_3_0/queries/org/apache/lucene/queries/function/valuesource/IDFValueSource.html

 roman


 On Tue, Jul 2, 2013 at 5:06 PM, Kathryn Mazaitis
 kathryn.riv...@gmail.comwrote:

  Hi,
 
  I'm using SOLRJ to run a query, with the goal of obtaining:
 
  (1) the retrieved documents,
  (2) the TF of each term in each document,
  (3) the IDF of each term in the set of retrieved documents (TF/IDF would
 be
  fine too)
 
  ...all at interactive speeds, or 10s per query. This is a demo, so if
 all
  else fails I can adjust the corpus, but I'd rather, y'know, actually do
 it.
 
  (1) and (2) are working; I completed the patch posted in the following
  issue:
  https://issues.apache.org/jira/browse/SOLR-949
  and am just setting tv=truetv.tf=true for my query. This way I get the
  documents and the tf information all in one go.
 
  With (3) I'm running into trouble. I have found 2 ways to do it so far:
 
  Option A: set tv.df=true or tv.tf_idf for my query, and get the idf
  information along with the documents and tf information. Since each term
  may appear in multiple documents, this means retrieving idf information
 for
  each term about 20 times, and takes over a minute to do.
 
  Option B: After I've gathered the tf information, run through the list of
  terms used across the set of retrieved documents, and for each term, run
 a
  query like:
  {!func}idf(text,'the_term')deftype=funcfl=scorerows=1
  ...while this retrieves idf information only once for each term, the
 added
  latency for doing that many queries piles up to almost two minutes on my
  current corpus.
 
  Is there anything I didn't think of -- a way to construct a query to get
  idf information for a set of terms all in one go, outside the bounds of
  what terms happen to be in a document?
 
  Failing that, does anyone have a sense for how far I'd have to scale
 down a
  corpus to approach interactive speeds, if I want this sort of data?
 
  Katie
 



Re: What are the options for obtaining IDF at interactive speeds?

2013-07-08 Thread Roman Chyla
Hi,
I am curious about the functional query, did you try it and it didn't work?
 or was it too slow?

idf(other_field,field(term))

Thanks!

  roman


On Mon, Jul 8, 2013 at 4:34 PM, Kathryn Mazaitis ka...@rivard.org wrote:

 Hi All,

 Resolution: I ended up cheating. :P Though now that I look at it, I think
 this was Roman's second suggestion. Thanks!

 Since the application that will be processing the IDF figures is located on
 the same machine as SOLR, I opened a second IndexReader on the lucene index
 and used

 reader.numDocs()
 reader.docFreq(field,term)

 to generate IDF by hand, ref: http://en.wikipedia.org/wiki/Tf%E2%80%93idf

 As it turns out, using this method to get IDF on all the terms mentioned in
 the set of relevant documents runs in time comparable to retrieving the
 documents in the first place (so, .1-1s). This makes it fast enough that
 it's no longer the slowest part of my algorithm by far. Problem solved! It
 is possible that IDFValueSource would be faster; I may swap that in at a
 later date.

 I will keep Mikhail's debugQuery=true in my pocket, too; that technique
 would never have occurred to me. Thank you too!

 Best,
 Katie


 On Wed, Jul 3, 2013 at 11:35 PM, Roman Chyla roman.ch...@gmail.com
 wrote:

  Hi Kathryn,
  I wonder if you could index all your terms as separate documents and then
  construct a new query (2nd pass)
 
  q=term:term1 OR term:term2 OR term:term3
 
  and use func to score them
 
  *idf(other_field,field(term))*
  *
  *
  the 'term' index cannot be multi-valued, obviously.
 
  Other than that, if you could do it on server side, that weould be the
  fastest - the code is ready inside IDFValueSource:
 
 
 http://lucene.apache.org/core/4_3_0/queries/org/apache/lucene/queries/function/valuesource/IDFValueSource.html
 
  roman
 
 
  On Tue, Jul 2, 2013 at 5:06 PM, Kathryn Mazaitis
  kathryn.riv...@gmail.comwrote:
 
   Hi,
  
   I'm using SOLRJ to run a query, with the goal of obtaining:
  
   (1) the retrieved documents,
   (2) the TF of each term in each document,
   (3) the IDF of each term in the set of retrieved documents (TF/IDF
 would
  be
   fine too)
  
   ...all at interactive speeds, or 10s per query. This is a demo, so if
  all
   else fails I can adjust the corpus, but I'd rather, y'know, actually do
  it.
  
   (1) and (2) are working; I completed the patch posted in the following
   issue:
   https://issues.apache.org/jira/browse/SOLR-949
   and am just setting tv=truetv.tf=true for my query. This way I get
 the
   documents and the tf information all in one go.
  
   With (3) I'm running into trouble. I have found 2 ways to do it so far:
  
   Option A: set tv.df=true or tv.tf_idf for my query, and get the idf
   information along with the documents and tf information. Since each
 term
   may appear in multiple documents, this means retrieving idf information
  for
   each term about 20 times, and takes over a minute to do.
  
   Option B: After I've gathered the tf information, run through the list
 of
   terms used across the set of retrieved documents, and for each term,
 run
  a
   query like:
   {!func}idf(text,'the_term')deftype=funcfl=scorerows=1
   ...while this retrieves idf information only once for each term, the
  added
   latency for doing that many queries piles up to almost two minutes on
 my
   current corpus.
  
   Is there anything I didn't think of -- a way to construct a query to
 get
   idf information for a set of terms all in one go, outside the bounds of
   what terms happen to be in a document?
  
   Failing that, does anyone have a sense for how far I'd have to scale
  down a
   corpus to approach interactive speeds, if I want this sort of data?
  
   Katie
  
 



Re: What are the options for obtaining IDF at interactive speeds?

2013-07-03 Thread Mikhail Khludnev
Katie,

This case is actually really hard to get. Just let me provide the
contra-sample, to let you explain problem better by spotting the gap.
What if I say that, debugQuery=true provides tf, idf for the terms and
documents from the requested page of results. Why you can't use explain to
solve the problem?


On Wed, Jul 3, 2013 at 1:06 AM, Kathryn Mazaitis
kathryn.riv...@gmail.comwrote:

 Hi,

 I'm using SOLRJ to run a query, with the goal of obtaining:

 (1) the retrieved documents,
 (2) the TF of each term in each document,
 (3) the IDF of each term in the set of retrieved documents (TF/IDF would be
 fine too)

 ...all at interactive speeds, or 10s per query. This is a demo, so if all
 else fails I can adjust the corpus, but I'd rather, y'know, actually do it.

 (1) and (2) are working; I completed the patch posted in the following
 issue:
 https://issues.apache.org/jira/browse/SOLR-949
 and am just setting tv=truetv.tf=true for my query. This way I get the
 documents and the tf information all in one go.

 With (3) I'm running into trouble. I have found 2 ways to do it so far:

 Option A: set tv.df=true or tv.tf_idf for my query, and get the idf
 information along with the documents and tf information. Since each term
 may appear in multiple documents, this means retrieving idf information for
 each term about 20 times, and takes over a minute to do.

 Option B: After I've gathered the tf information, run through the list of
 terms used across the set of retrieved documents, and for each term, run a
 query like:
 {!func}idf(text,'the_term')deftype=funcfl=scorerows=1
 ...while this retrieves idf information only once for each term, the added
 latency for doing that many queries piles up to almost two minutes on my
 current corpus.

 Is there anything I didn't think of -- a way to construct a query to get
 idf information for a set of terms all in one go, outside the bounds of
 what terms happen to be in a document?

 Failing that, does anyone have a sense for how far I'd have to scale down a
 corpus to approach interactive speeds, if I want this sort of data?

 Katie




-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: What are the options for obtaining IDF at interactive speeds?

2013-07-03 Thread Roman Chyla
Hi Kathryn,
I wonder if you could index all your terms as separate documents and then
construct a new query (2nd pass)

q=term:term1 OR term:term2 OR term:term3

and use func to score them

*idf(other_field,field(term))*
*
*
the 'term' index cannot be multi-valued, obviously.

Other than that, if you could do it on server side, that weould be the
fastest - the code is ready inside IDFValueSource:
http://lucene.apache.org/core/4_3_0/queries/org/apache/lucene/queries/function/valuesource/IDFValueSource.html

roman


On Tue, Jul 2, 2013 at 5:06 PM, Kathryn Mazaitis
kathryn.riv...@gmail.comwrote:

 Hi,

 I'm using SOLRJ to run a query, with the goal of obtaining:

 (1) the retrieved documents,
 (2) the TF of each term in each document,
 (3) the IDF of each term in the set of retrieved documents (TF/IDF would be
 fine too)

 ...all at interactive speeds, or 10s per query. This is a demo, so if all
 else fails I can adjust the corpus, but I'd rather, y'know, actually do it.

 (1) and (2) are working; I completed the patch posted in the following
 issue:
 https://issues.apache.org/jira/browse/SOLR-949
 and am just setting tv=truetv.tf=true for my query. This way I get the
 documents and the tf information all in one go.

 With (3) I'm running into trouble. I have found 2 ways to do it so far:

 Option A: set tv.df=true or tv.tf_idf for my query, and get the idf
 information along with the documents and tf information. Since each term
 may appear in multiple documents, this means retrieving idf information for
 each term about 20 times, and takes over a minute to do.

 Option B: After I've gathered the tf information, run through the list of
 terms used across the set of retrieved documents, and for each term, run a
 query like:
 {!func}idf(text,'the_term')deftype=funcfl=scorerows=1
 ...while this retrieves idf information only once for each term, the added
 latency for doing that many queries piles up to almost two minutes on my
 current corpus.

 Is there anything I didn't think of -- a way to construct a query to get
 idf information for a set of terms all in one go, outside the bounds of
 what terms happen to be in a document?

 Failing that, does anyone have a sense for how far I'd have to scale down a
 corpus to approach interactive speeds, if I want this sort of data?

 Katie