Re: What are the options for obtaining IDF at interactive speeds?
I didn't try indexing each term as a separate document (and if I had I probably would've just used tv.tf_idf instead of a functional query -- why not?). The regular functional query which required sending a separate request for each of thousands of terms was wy dominated by the overhead of each query, and far too slow. On Mon, Jul 8, 2013 at 4:45 PM, Roman Chyla roman.ch...@gmail.com wrote: Hi, I am curious about the functional query, did you try it and it didn't work? or was it too slow? idf(other_field,field(term)) Thanks! roman On Mon, Jul 8, 2013 at 4:34 PM, Kathryn Mazaitis ka...@rivard.org wrote: Hi All, Resolution: I ended up cheating. :P Though now that I look at it, I think this was Roman's second suggestion. Thanks! Since the application that will be processing the IDF figures is located on the same machine as SOLR, I opened a second IndexReader on the lucene index and used reader.numDocs() reader.docFreq(field,term) to generate IDF by hand, ref: http://en.wikipedia.org/wiki/Tf%E2%80%93idf As it turns out, using this method to get IDF on all the terms mentioned in the set of relevant documents runs in time comparable to retrieving the documents in the first place (so, .1-1s). This makes it fast enough that it's no longer the slowest part of my algorithm by far. Problem solved! It is possible that IDFValueSource would be faster; I may swap that in at a later date. I will keep Mikhail's debugQuery=true in my pocket, too; that technique would never have occurred to me. Thank you too! Best, Katie On Wed, Jul 3, 2013 at 11:35 PM, Roman Chyla roman.ch...@gmail.com wrote: Hi Kathryn, I wonder if you could index all your terms as separate documents and then construct a new query (2nd pass) q=term:term1 OR term:term2 OR term:term3 and use func to score them *idf(other_field,field(term))* * * the 'term' index cannot be multi-valued, obviously. Other than that, if you could do it on server side, that weould be the fastest - the code is ready inside IDFValueSource: http://lucene.apache.org/core/4_3_0/queries/org/apache/lucene/queries/function/valuesource/IDFValueSource.html roman On Tue, Jul 2, 2013 at 5:06 PM, Kathryn Mazaitis kathryn.riv...@gmail.comwrote: Hi, I'm using SOLRJ to run a query, with the goal of obtaining: (1) the retrieved documents, (2) the TF of each term in each document, (3) the IDF of each term in the set of retrieved documents (TF/IDF would be fine too) ...all at interactive speeds, or 10s per query. This is a demo, so if all else fails I can adjust the corpus, but I'd rather, y'know, actually do it. (1) and (2) are working; I completed the patch posted in the following issue: https://issues.apache.org/jira/browse/SOLR-949 and am just setting tv=truetv.tf=true for my query. This way I get the documents and the tf information all in one go. With (3) I'm running into trouble. I have found 2 ways to do it so far: Option A: set tv.df=true or tv.tf_idf for my query, and get the idf information along with the documents and tf information. Since each term may appear in multiple documents, this means retrieving idf information for each term about 20 times, and takes over a minute to do. Option B: After I've gathered the tf information, run through the list of terms used across the set of retrieved documents, and for each term, run a query like: {!func}idf(text,'the_term')deftype=funcfl=scorerows=1 ...while this retrieves idf information only once for each term, the added latency for doing that many queries piles up to almost two minutes on my current corpus. Is there anything I didn't think of -- a way to construct a query to get idf information for a set of terms all in one go, outside the bounds of what terms happen to be in a document? Failing that, does anyone have a sense for how far I'd have to scale down a corpus to approach interactive speeds, if I want this sort of data? Katie
Re: What are the options for obtaining IDF at interactive speeds?
Hi All, Resolution: I ended up cheating. :P Though now that I look at it, I think this was Roman's second suggestion. Thanks! Since the application that will be processing the IDF figures is located on the same machine as SOLR, I opened a second IndexReader on the lucene index and used reader.numDocs() reader.docFreq(field,term) to generate IDF by hand, ref: http://en.wikipedia.org/wiki/Tf%E2%80%93idf As it turns out, using this method to get IDF on all the terms mentioned in the set of relevant documents runs in time comparable to retrieving the documents in the first place (so, .1-1s). This makes it fast enough that it's no longer the slowest part of my algorithm by far. Problem solved! It is possible that IDFValueSource would be faster; I may swap that in at a later date. I will keep Mikhail's debugQuery=true in my pocket, too; that technique would never have occurred to me. Thank you too! Best, Katie On Wed, Jul 3, 2013 at 11:35 PM, Roman Chyla roman.ch...@gmail.com wrote: Hi Kathryn, I wonder if you could index all your terms as separate documents and then construct a new query (2nd pass) q=term:term1 OR term:term2 OR term:term3 and use func to score them *idf(other_field,field(term))* * * the 'term' index cannot be multi-valued, obviously. Other than that, if you could do it on server side, that weould be the fastest - the code is ready inside IDFValueSource: http://lucene.apache.org/core/4_3_0/queries/org/apache/lucene/queries/function/valuesource/IDFValueSource.html roman On Tue, Jul 2, 2013 at 5:06 PM, Kathryn Mazaitis kathryn.riv...@gmail.comwrote: Hi, I'm using SOLRJ to run a query, with the goal of obtaining: (1) the retrieved documents, (2) the TF of each term in each document, (3) the IDF of each term in the set of retrieved documents (TF/IDF would be fine too) ...all at interactive speeds, or 10s per query. This is a demo, so if all else fails I can adjust the corpus, but I'd rather, y'know, actually do it. (1) and (2) are working; I completed the patch posted in the following issue: https://issues.apache.org/jira/browse/SOLR-949 and am just setting tv=truetv.tf=true for my query. This way I get the documents and the tf information all in one go. With (3) I'm running into trouble. I have found 2 ways to do it so far: Option A: set tv.df=true or tv.tf_idf for my query, and get the idf information along with the documents and tf information. Since each term may appear in multiple documents, this means retrieving idf information for each term about 20 times, and takes over a minute to do. Option B: After I've gathered the tf information, run through the list of terms used across the set of retrieved documents, and for each term, run a query like: {!func}idf(text,'the_term')deftype=funcfl=scorerows=1 ...while this retrieves idf information only once for each term, the added latency for doing that many queries piles up to almost two minutes on my current corpus. Is there anything I didn't think of -- a way to construct a query to get idf information for a set of terms all in one go, outside the bounds of what terms happen to be in a document? Failing that, does anyone have a sense for how far I'd have to scale down a corpus to approach interactive speeds, if I want this sort of data? Katie
Re: What are the options for obtaining IDF at interactive speeds?
Hi, I am curious about the functional query, did you try it and it didn't work? or was it too slow? idf(other_field,field(term)) Thanks! roman On Mon, Jul 8, 2013 at 4:34 PM, Kathryn Mazaitis ka...@rivard.org wrote: Hi All, Resolution: I ended up cheating. :P Though now that I look at it, I think this was Roman's second suggestion. Thanks! Since the application that will be processing the IDF figures is located on the same machine as SOLR, I opened a second IndexReader on the lucene index and used reader.numDocs() reader.docFreq(field,term) to generate IDF by hand, ref: http://en.wikipedia.org/wiki/Tf%E2%80%93idf As it turns out, using this method to get IDF on all the terms mentioned in the set of relevant documents runs in time comparable to retrieving the documents in the first place (so, .1-1s). This makes it fast enough that it's no longer the slowest part of my algorithm by far. Problem solved! It is possible that IDFValueSource would be faster; I may swap that in at a later date. I will keep Mikhail's debugQuery=true in my pocket, too; that technique would never have occurred to me. Thank you too! Best, Katie On Wed, Jul 3, 2013 at 11:35 PM, Roman Chyla roman.ch...@gmail.com wrote: Hi Kathryn, I wonder if you could index all your terms as separate documents and then construct a new query (2nd pass) q=term:term1 OR term:term2 OR term:term3 and use func to score them *idf(other_field,field(term))* * * the 'term' index cannot be multi-valued, obviously. Other than that, if you could do it on server side, that weould be the fastest - the code is ready inside IDFValueSource: http://lucene.apache.org/core/4_3_0/queries/org/apache/lucene/queries/function/valuesource/IDFValueSource.html roman On Tue, Jul 2, 2013 at 5:06 PM, Kathryn Mazaitis kathryn.riv...@gmail.comwrote: Hi, I'm using SOLRJ to run a query, with the goal of obtaining: (1) the retrieved documents, (2) the TF of each term in each document, (3) the IDF of each term in the set of retrieved documents (TF/IDF would be fine too) ...all at interactive speeds, or 10s per query. This is a demo, so if all else fails I can adjust the corpus, but I'd rather, y'know, actually do it. (1) and (2) are working; I completed the patch posted in the following issue: https://issues.apache.org/jira/browse/SOLR-949 and am just setting tv=truetv.tf=true for my query. This way I get the documents and the tf information all in one go. With (3) I'm running into trouble. I have found 2 ways to do it so far: Option A: set tv.df=true or tv.tf_idf for my query, and get the idf information along with the documents and tf information. Since each term may appear in multiple documents, this means retrieving idf information for each term about 20 times, and takes over a minute to do. Option B: After I've gathered the tf information, run through the list of terms used across the set of retrieved documents, and for each term, run a query like: {!func}idf(text,'the_term')deftype=funcfl=scorerows=1 ...while this retrieves idf information only once for each term, the added latency for doing that many queries piles up to almost two minutes on my current corpus. Is there anything I didn't think of -- a way to construct a query to get idf information for a set of terms all in one go, outside the bounds of what terms happen to be in a document? Failing that, does anyone have a sense for how far I'd have to scale down a corpus to approach interactive speeds, if I want this sort of data? Katie
Re: What are the options for obtaining IDF at interactive speeds?
Katie, This case is actually really hard to get. Just let me provide the contra-sample, to let you explain problem better by spotting the gap. What if I say that, debugQuery=true provides tf, idf for the terms and documents from the requested page of results. Why you can't use explain to solve the problem? On Wed, Jul 3, 2013 at 1:06 AM, Kathryn Mazaitis kathryn.riv...@gmail.comwrote: Hi, I'm using SOLRJ to run a query, with the goal of obtaining: (1) the retrieved documents, (2) the TF of each term in each document, (3) the IDF of each term in the set of retrieved documents (TF/IDF would be fine too) ...all at interactive speeds, or 10s per query. This is a demo, so if all else fails I can adjust the corpus, but I'd rather, y'know, actually do it. (1) and (2) are working; I completed the patch posted in the following issue: https://issues.apache.org/jira/browse/SOLR-949 and am just setting tv=truetv.tf=true for my query. This way I get the documents and the tf information all in one go. With (3) I'm running into trouble. I have found 2 ways to do it so far: Option A: set tv.df=true or tv.tf_idf for my query, and get the idf information along with the documents and tf information. Since each term may appear in multiple documents, this means retrieving idf information for each term about 20 times, and takes over a minute to do. Option B: After I've gathered the tf information, run through the list of terms used across the set of retrieved documents, and for each term, run a query like: {!func}idf(text,'the_term')deftype=funcfl=scorerows=1 ...while this retrieves idf information only once for each term, the added latency for doing that many queries piles up to almost two minutes on my current corpus. Is there anything I didn't think of -- a way to construct a query to get idf information for a set of terms all in one go, outside the bounds of what terms happen to be in a document? Failing that, does anyone have a sense for how far I'd have to scale down a corpus to approach interactive speeds, if I want this sort of data? Katie -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: What are the options for obtaining IDF at interactive speeds?
Hi Kathryn, I wonder if you could index all your terms as separate documents and then construct a new query (2nd pass) q=term:term1 OR term:term2 OR term:term3 and use func to score them *idf(other_field,field(term))* * * the 'term' index cannot be multi-valued, obviously. Other than that, if you could do it on server side, that weould be the fastest - the code is ready inside IDFValueSource: http://lucene.apache.org/core/4_3_0/queries/org/apache/lucene/queries/function/valuesource/IDFValueSource.html roman On Tue, Jul 2, 2013 at 5:06 PM, Kathryn Mazaitis kathryn.riv...@gmail.comwrote: Hi, I'm using SOLRJ to run a query, with the goal of obtaining: (1) the retrieved documents, (2) the TF of each term in each document, (3) the IDF of each term in the set of retrieved documents (TF/IDF would be fine too) ...all at interactive speeds, or 10s per query. This is a demo, so if all else fails I can adjust the corpus, but I'd rather, y'know, actually do it. (1) and (2) are working; I completed the patch posted in the following issue: https://issues.apache.org/jira/browse/SOLR-949 and am just setting tv=truetv.tf=true for my query. This way I get the documents and the tf information all in one go. With (3) I'm running into trouble. I have found 2 ways to do it so far: Option A: set tv.df=true or tv.tf_idf for my query, and get the idf information along with the documents and tf information. Since each term may appear in multiple documents, this means retrieving idf information for each term about 20 times, and takes over a minute to do. Option B: After I've gathered the tf information, run through the list of terms used across the set of retrieved documents, and for each term, run a query like: {!func}idf(text,'the_term')deftype=funcfl=scorerows=1 ...while this retrieves idf information only once for each term, the added latency for doing that many queries piles up to almost two minutes on my current corpus. Is there anything I didn't think of -- a way to construct a query to get idf information for a set of terms all in one go, outside the bounds of what terms happen to be in a document? Failing that, does anyone have a sense for how far I'd have to scale down a corpus to approach interactive speeds, if I want this sort of data? Katie