Re: Highest frequency terms for a subset of documents
OK, so I copied my index and ran solr3.1 against it. Qtime dropped, from about 40s to 17s! This is good news, but still longer than i hoped for. I tried to do the same text with 4.0, but i'm getting IndexFormatTooOldException since my index was created using 1.4.1. Is my only chance to test this is to reindex using 3.1 or 4.0? Another strange behavior is that the Qtime seems pretty stable, no matter how many object match my query. 200K and 20K both take about 17s. I would have guessed that since the time is going over all the terms of all the subset documents, would mean that the more documents, the more time. Thanks for any insights ofer On Thu, Apr 21, 2011 at 3:07 AM, Ofer Fort o...@tra.cx wrote: my documents are user entries, so i'm guessing they vary a lot. Tomorrow i'll try 3.1 and also 4.0, and see if they have an improvement. thanks guys! On Thu, Apr 21, 2011 at 3:02 AM, Yonik Seeley yo...@lucidimagination.comwrote: On Wed, Apr 20, 2011 at 7:45 PM, Ofer Fort o...@tra.cx wrote: Thanks but i've disabled the cache already, since my concern is speed and i'm willing to pay the price (memory) Then you should not disable the cache. , and my subset are not fixed. Does the facet search do any extra work that i don't need, that i might be able to disable (either by a flag or by a code change), Somehow i feel, or rather hope, that counting the terms of 200K documents and finding the top 500 should take less than 30 seconds. Using facet.enum.cache.minDf should be a little faster than just disabling the cache - it's a different code path. Using the cache selectively will speed things up, so try setting that minDf to 1000 or so for example. How many unique terms do you have in the index? Is this Solr 3.1 - there were some optimizations when there were many terms to iterate over? You could also try trunk, which has even more optimizations, or the bulkpostings branch if you really want to experiment. -Yonik
Re: Highest frequency terms for a subset of documents
On Thu, Apr 21, 2011 at 9:24 AM, Ofer Fort o...@tra.cx wrote: Another strange behavior is that the Qtime seems pretty stable, no matter how many object match my query. 200K and 20K both take about 17s. I would have guessed that since the time is going over all the terms of all the subset documents, would mean that the more documents, the more time. facet.method=enum steps over all terms in the index for that field... that takes time regardless of how many documents are in the base set. There are also short-circuit methods that avoid looking at the docs for a term if it's docfreq is low enough that it couldn't possibly make it into the priority queue. Because if this, it can actually be faster to facet on a larger base set (try *:* as the base query). Actually, it might be interesting to see the query time if you set facet.mincount equal to the number of docs in the base set - that will test pretty much just the time to enumerate over the terms without doing any set intersections at all. Be careful not to set mincount greater than the number of docs in the base set though - solr will short-circuit that too and skip enumeration altogether. The work on the bulkpostings branch should definitely speed up your case even more - but I have no idea when it will land on trunk. -Yonik http://www.lucenerevolution.org -- Lucene/Solr User Conference, May 25-26, San Francisco
Re: Highest frequency terms for a subset of documents
Not sure i fully understand, If facet.method=enum steps over all terms in the index for that field, than what does setting the q=field:subset do? if i set the q=*:*, than how do i get the frequency only on my subset? Ofer On Thu, Apr 21, 2011 at 4:40 PM, Yonik Seeley yo...@lucidimagination.comwrote: On Thu, Apr 21, 2011 at 9:24 AM, Ofer Fort o...@tra.cx wrote: Another strange behavior is that the Qtime seems pretty stable, no matter how many object match my query. 200K and 20K both take about 17s. I would have guessed that since the time is going over all the terms of all the subset documents, would mean that the more documents, the more time. facet.method=enum steps over all terms in the index for that field... that takes time regardless of how many documents are in the base set. There are also short-circuit methods that avoid looking at the docs for a term if it's docfreq is low enough that it couldn't possibly make it into the priority queue. Because if this, it can actually be faster to facet on a larger base set (try *:* as the base query). Actually, it might be interesting to see the query time if you set facet.mincount equal to the number of docs in the base set - that will test pretty much just the time to enumerate over the terms without doing any set intersections at all. Be careful not to set mincount greater than the number of docs in the base set though - solr will short-circuit that too and skip enumeration altogether. The work on the bulkpostings branch should definitely speed up your case even more - but I have no idea when it will land on trunk. -Yonik http://www.lucenerevolution.org -- Lucene/Solr User Conference, May 25-26, San Francisco
Re: Highest frequency terms for a subset of documents
On Thu, Apr 21, 2011 at 9:44 AM, Ofer Fort o...@tra.cx wrote: Not sure i fully understand, If facet.method=enum steps over all terms in the index for that field, than what does setting the q=field:subset do? if i set the q=*:*, than how do i get the frequency only on my subset? It's an implementation detail. Faceting *does* just give you counts that just match q=field:subset. How it does it is a different matter (i.e. for facet.method=enum, it must step over all terms in the field), so it's closer to O(nterms in field) rather than O(ndocs in base set) -Yonik http://www.lucenerevolution.org -- Lucene/Solr User Conference, May 25-26, San Francisco Ofer On Thu, Apr 21, 2011 at 4:40 PM, Yonik Seeley yo...@lucidimagination.com wrote: On Thu, Apr 21, 2011 at 9:24 AM, Ofer Fort o...@tra.cx wrote: Another strange behavior is that the Qtime seems pretty stable, no matter how many object match my query. 200K and 20K both take about 17s. I would have guessed that since the time is going over all the terms of all the subset documents, would mean that the more documents, the more time. facet.method=enum steps over all terms in the index for that field... that takes time regardless of how many documents are in the base set. There are also short-circuit methods that avoid looking at the docs for a term if it's docfreq is low enough that it couldn't possibly make it into the priority queue. Because if this, it can actually be faster to facet on a larger base set (try *:* as the base query). Actually, it might be interesting to see the query time if you set facet.mincount equal to the number of docs in the base set - that will test pretty much just the time to enumerate over the terms without doing any set intersections at all. Be careful not to set mincount greater than the number of docs in the base set though - solr will short-circuit that too and skip enumeration altogether. The work on the bulkpostings branch should definitely speed up your case even more - but I have no idea when it will land on trunk. -Yonik http://www.lucenerevolution.org -- Lucene/Solr User Conference, May 25-26, San Francisco
Re: Highest frequency terms for a subset of documents
I see, thanks. So if I would want to implement something that would fit my needs, would going through the subset of documents and counting all the terms in each one, would be faster? and easier to implement? On Thu, Apr 21, 2011 at 5:36 PM, Yonik Seeley yo...@lucidimagination.comwrote: On Thu, Apr 21, 2011 at 9:44 AM, Ofer Fort o...@tra.cx wrote: Not sure i fully understand, If facet.method=enum steps over all terms in the index for that field, than what does setting the q=field:subset do? if i set the q=*:*, than how do i get the frequency only on my subset? It's an implementation detail. Faceting *does* just give you counts that just match q=field:subset. How it does it is a different matter (i.e. for facet.method=enum, it must step over all terms in the field), so it's closer to O(nterms in field) rather than O(ndocs in base set) -Yonik http://www.lucenerevolution.org -- Lucene/Solr User Conference, May 25-26, San Francisco Ofer On Thu, Apr 21, 2011 at 4:40 PM, Yonik Seeley yo...@lucidimagination.com wrote: On Thu, Apr 21, 2011 at 9:24 AM, Ofer Fort o...@tra.cx wrote: Another strange behavior is that the Qtime seems pretty stable, no matter how many object match my query. 200K and 20K both take about 17s. I would have guessed that since the time is going over all the terms of all the subset documents, would mean that the more documents, the more time. facet.method=enum steps over all terms in the index for that field... that takes time regardless of how many documents are in the base set. There are also short-circuit methods that avoid looking at the docs for a term if it's docfreq is low enough that it couldn't possibly make it into the priority queue. Because if this, it can actually be faster to facet on a larger base set (try *:* as the base query). Actually, it might be interesting to see the query time if you set facet.mincount equal to the number of docs in the base set - that will test pretty much just the time to enumerate over the terms without doing any set intersections at all. Be careful not to set mincount greater than the number of docs in the base set though - solr will short-circuit that too and skip enumeration altogether. The work on the bulkpostings branch should definitely speed up your case even more - but I have no idea when it will land on trunk. -Yonik http://www.lucenerevolution.org -- Lucene/Solr User Conference, May 25-26, San Francisco
Re: Highest frequency terms for a subset of documents
On Thu, Apr 21, 2011 at 10:41 AM, Ofer Fort o...@tra.cx wrote: I see, thanks. So if I would want to implement something that would fit my needs, would going through the subset of documents and counting all the terms in each one, would be faster? and easier to implement? That's not just your needs, that's everyone's needs (it's the definition of field faceting). There's no way to do what you're asking with a term enumerator (i.e. facet.method=enum). Going through documents and counting all the terms in each is what facet.method=fc does. But it's also not great when the number of unique terms per document is high. If you can think of a better way, go for it! -Yonik http://www.lucenerevolution.org -- Lucene/Solr User Conference, May 25-26, San Francisco
Re: Highest frequency terms for a subset of documents
So if i want to use the facet.method=fc, is there a way to speed it up? and remove the bucket size limitation? On Thu, Apr 21, 2011 at 5:58 PM, Yonik Seeley yo...@lucidimagination.comwrote: On Thu, Apr 21, 2011 at 10:41 AM, Ofer Fort o...@tra.cx wrote: I see, thanks. So if I would want to implement something that would fit my needs, would going through the subset of documents and counting all the terms in each one, would be faster? and easier to implement? That's not just your needs, that's everyone's needs (it's the definition of field faceting). There's no way to do what you're asking with a term enumerator (i.e. facet.method=enum). Going through documents and counting all the terms in each is what facet.method=fc does. But it's also not great when the number of unique terms per document is high. If you can think of a better way, go for it! -Yonik http://www.lucenerevolution.org -- Lucene/Solr User Conference, May 25-26, San Francisco
Re: Highest frequency terms for a subset of documents
On Thu, Apr 21, 2011 at 11:15 AM, Ofer Fort o...@tra.cx wrote: So if i want to use the facet.method=fc, is there a way to speed it up? and remove the bucket size limitation? Not really - else we would have done it already ;-) We don't really have great methods for faceting on full-text fields (as opposed to shorter meta-data fields) today. -Yonik http://www.lucenerevolution.org -- Lucene/Solr User Conference, May 25-26, San Francisco
Re: Highest frequency terms for a subset of documents
Well, it was worth the try;-) But will using the facet.method=fc, will reducing the subset size reduce the time and memory? Meaning is it an O( ndocs of the set)? Thanks On Thursday, April 21, 2011, Yonik Seeley yo...@lucidimagination.com wrote: On Thu, Apr 21, 2011 at 11:15 AM, Ofer Fort o...@tra.cx wrote: So if i want to use the facet.method=fc, is there a way to speed it up? and remove the bucket size limitation? Not really - else we would have done it already ;-) We don't really have great methods for faceting on full-text fields (as opposed to shorter meta-data fields) today. -Yonik http://www.lucenerevolution.org -- Lucene/Solr User Conference, May 25-26, San Francisco
Re: Highest frequency terms for a subset of documents
On Thu, Apr 21, 2011 at 6:25 PM, Ofer Fort o...@tra.cx wrote: Well, it was worth the try;-) But will using the facet.method=fc, will reducing the subset size reduce the time and memory? Meaning is it an O( ndocs of the set)? facet.method=fc builds a multi-valued fieldcache like structure (UnInvertedField) the first time, that is used for counting facets for all subsequent requests. So the faceting time (after the first time) is O(ndocs of the set), but the UnInvertedField singleton uses a large amout of memory unrelated to any particular base docset. -Yonik http://www.lucenerevolution.org -- Lucene/Solr User Conference, May 25-26, San Francisco Thanks On Thursday, April 21, 2011, Yonik Seeley yo...@lucidimagination.com wrote: On Thu, Apr 21, 2011 at 11:15 AM, Ofer Fort o...@tra.cx wrote: So if i want to use the facet.method=fc, is there a way to speed it up? and remove the bucket size limitation? Not really - else we would have done it already ;-) We don't really have great methods for faceting on full-text fields (as opposed to shorter meta-data fields) today. -Yonik http://www.lucenerevolution.org -- Lucene/Solr User Conference, May 25-26, San Francisco
Re: Highest frequency terms for a subset of documents
So I'm guessing my best approach now would be to test trunk, and hope that as 3.1 cut the performance in half, trunk will do the same Thanks for the info Ofer On Friday, April 22, 2011, Yonik Seeley yo...@lucidimagination.com wrote: On Thu, Apr 21, 2011 at 6:25 PM, Ofer Fort o...@tra.cx wrote: Well, it was worth the try;-) But will using the facet.method=fc, will reducing the subset size reduce the time and memory? Meaning is it an O( ndocs of the set)? facet.method=fc builds a multi-valued fieldcache like structure (UnInvertedField) the first time, that is used for counting facets for all subsequent requests. So the faceting time (after the first time) is O(ndocs of the set), but the UnInvertedField singleton uses a large amout of memory unrelated to any particular base docset. -Yonik http://www.lucenerevolution.org -- Lucene/Solr User Conference, May 25-26, San Francisco Thanks On Thursday, April 21, 2011, Yonik Seeley yo...@lucidimagination.com wrote: On Thu, Apr 21, 2011 at 11:15 AM, Ofer Fort o...@tra.cx wrote: So if i want to use the facet.method=fc, is there a way to speed it up? and remove the bucket size limitation? Not really - else we would have done it already ;-) We don't really have great methods for faceting on full-text fields (as opposed to shorter meta-data fields) today. -Yonik http://www.lucenerevolution.org -- Lucene/Solr User Conference, May 25-26, San Francisco
Re: Highest frequency terms for a subset of documents
On Thu, Apr 21, 2011 at 6:34 PM, Ofer Fort o...@tra.cx wrote: So I'm guessing my best approach now would be to test trunk, and hope that as 3.1 cut the performance in half, trunk will do the same Trunk prob won't be much better... but the bulkpostings branch possibly could be. -Yonik http://www.lucenerevolution.org -- Lucene/Solr User Conference, May 25-26, San Francisco Thanks for the info Ofer On Friday, April 22, 2011, Yonik Seeley yo...@lucidimagination.com wrote: On Thu, Apr 21, 2011 at 6:25 PM, Ofer Fort o...@tra.cx wrote: Well, it was worth the try;-) But will using the facet.method=fc, will reducing the subset size reduce the time and memory? Meaning is it an O( ndocs of the set)? facet.method=fc builds a multi-valued fieldcache like structure (UnInvertedField) the first time, that is used for counting facets for all subsequent requests. So the faceting time (after the first time) is O(ndocs of the set), but the UnInvertedField singleton uses a large amout of memory unrelated to any particular base docset. -Yonik http://www.lucenerevolution.org -- Lucene/Solr User Conference, May 25-26, San Francisco Thanks On Thursday, April 21, 2011, Yonik Seeley yo...@lucidimagination.com wrote: On Thu, Apr 21, 2011 at 11:15 AM, Ofer Fort o...@tra.cx wrote: So if i want to use the facet.method=fc, is there a way to speed it up? and remove the bucket size limitation? Not really - else we would have done it already ;-) We don't really have great methods for faceting on full-text fields (as opposed to shorter meta-data fields) today. -Yonik http://www.lucenerevolution.org -- Lucene/Solr User Conference, May 25-26, San Francisco
Re: Highest frequency terms for a subset of documents
Ok, I'll give it a try, as this is a server I am willing to risk. How is the competability between solrj of bulkpostings, trunk, 3.1 and 1.4.1? On Friday, April 22, 2011, Yonik Seeley yo...@lucidimagination.com wrote: On Thu, Apr 21, 2011 at 6:34 PM, Ofer Fort o...@tra.cx wrote: So I'm guessing my best approach now would be to test trunk, and hope that as 3.1 cut the performance in half, trunk will do the same Trunk prob won't be much better... but the bulkpostings branch possibly could be. -Yonik http://www.lucenerevolution.org -- Lucene/Solr User Conference, May 25-26, San Francisco Thanks for the info Ofer On Friday, April 22, 2011, Yonik Seeley yo...@lucidimagination.com wrote: On Thu, Apr 21, 2011 at 6:25 PM, Ofer Fort o...@tra.cx wrote: Well, it was worth the try;-) But will using the facet.method=fc, will reducing the subset size reduce the time and memory? Meaning is it an O( ndocs of the set)? facet.method=fc builds a multi-valued fieldcache like structure (UnInvertedField) the first time, that is used for counting facets for all subsequent requests. So the faceting time (after the first time) is O(ndocs of the set), but the UnInvertedField singleton uses a large amout of memory unrelated to any particular base docset. -Yonik http://www.lucenerevolution.org -- Lucene/Solr User Conference, May 25-26, San Francisco Thanks On Thursday, April 21, 2011, Yonik Seeley yo...@lucidimagination.com wrote: On Thu, Apr 21, 2011 at 11:15 AM, Ofer Fort o...@tra.cx wrote: So if i want to use the facet.method=fc, is there a way to speed it up? and remove the bucket size limitation? Not really - else we would have done it already ;-) We don't really have great methods for faceting on full-text fields (as opposed to shorter meta-data fields) today. -Yonik http://www.lucenerevolution.org -- Lucene/Solr User Conference, May 25-26, San Francisco
Re: Highest frequency terms for a subset of documents
On Thu, Apr 21, 2011 at 6:50 PM, Ofer Fort o...@tra.cx wrote: Ok, I'll give it a try, as this is a server I am willing to risk. How is the competability between solrj of bulkpostings, trunk, 3.1 and 1.4.1? bulkpostings, trunk, and 3.1 should all be relatively solrj compatible. But the SolrJ javabin format (used by default for queries) changed for strings between 1.4.1 and 3.1 (SOLR-2034). -Yonik http://www.lucenerevolution.org -- Lucene/Solr User Conference, May 25-26, San Francisco On Friday, April 22, 2011, Yonik Seeley yo...@lucidimagination.com wrote: On Thu, Apr 21, 2011 at 6:34 PM, Ofer Fort o...@tra.cx wrote: So I'm guessing my best approach now would be to test trunk, and hope that as 3.1 cut the performance in half, trunk will do the same Trunk prob won't be much better... but the bulkpostings branch possibly could be. -Yonik http://www.lucenerevolution.org -- Lucene/Solr User Conference, May 25-26, San Francisco Thanks for the info Ofer On Friday, April 22, 2011, Yonik Seeley yo...@lucidimagination.com wrote: On Thu, Apr 21, 2011 at 6:25 PM, Ofer Fort o...@tra.cx wrote: Well, it was worth the try;-) But will using the facet.method=fc, will reducing the subset size reduce the time and memory? Meaning is it an O( ndocs of the set)? facet.method=fc builds a multi-valued fieldcache like structure (UnInvertedField) the first time, that is used for counting facets for all subsequent requests. So the faceting time (after the first time) is O(ndocs of the set), but the UnInvertedField singleton uses a large amout of memory unrelated to any particular base docset. -Yonik http://www.lucenerevolution.org -- Lucene/Solr User Conference, May 25-26, San Francisco Thanks On Thursday, April 21, 2011, Yonik Seeley yo...@lucidimagination.com wrote: On Thu, Apr 21, 2011 at 11:15 AM, Ofer Fort o...@tra.cx wrote: So if i want to use the facet.method=fc, is there a way to speed it up? and remove the bucket size limitation? Not really - else we would have done it already ;-) We don't really have great methods for faceting on full-text fields (as opposed to shorter meta-data fields) today. -Yonik http://www.lucenerevolution.org -- Lucene/Solr User Conference, May 25-26, San Francisco
Re: Highest frequency terms for a subset of documents
Ok, thanks On Friday, April 22, 2011, Yonik Seeley yo...@lucidimagination.com wrote: On Thu, Apr 21, 2011 at 6:50 PM, Ofer Fort o...@tra.cx wrote: Ok, I'll give it a try, as this is a server I am willing to risk. How is the competability between solrj of bulkpostings, trunk, 3.1 and 1.4.1? bulkpostings, trunk, and 3.1 should all be relatively solrj compatible. But the SolrJ javabin format (used by default for queries) changed for strings between 1.4.1 and 3.1 (SOLR-2034). -Yonik http://www.lucenerevolution.org -- Lucene/Solr User Conference, May 25-26, San Francisco On Friday, April 22, 2011, Yonik Seeley yo...@lucidimagination.com wrote: On Thu, Apr 21, 2011 at 6:34 PM, Ofer Fort o...@tra.cx wrote: So I'm guessing my best approach now would be to test trunk, and hope that as 3.1 cut the performance in half, trunk will do the same Trunk prob won't be much better... but the bulkpostings branch possibly could be. -Yonik http://www.lucenerevolution.org -- Lucene/Solr User Conference, May 25-26, San Francisco Thanks for the info Ofer On Friday, April 22, 2011, Yonik Seeley yo...@lucidimagination.com wrote: On Thu, Apr 21, 2011 at 6:25 PM, Ofer Fort o...@tra.cx wrote: Well, it was worth the try;-) But will using the facet.method=fc, will reducing the subset size reduce the time and memory? Meaning is it an O( ndocs of the set)? facet.method=fc builds a multi-valued fieldcache like structure (UnInvertedField) the first time, that is used for counting facets for all subsequent requests. So the faceting time (after the first time) is O(ndocs of the set), but the UnInvertedField singleton uses a large amout of memory unrelated to any particular base docset. -Yonik http://www.lucenerevolution.org -- Lucene/Solr User Conference, May 25-26, San Francisco Thanks On Thursday, April 21, 2011, Yonik Seeley yo...@lucidimagination.com wrote: On Thu, Apr 21, 2011 at 11:15 AM, Ofer Fort o...@tra.cx wrote: So if i want to use the facet.method=fc, is there a way to speed it up? and remove the bucket size limitation? Not really - else we would have done it already ;-) We don't really have great methods for faceting on full-text fields (as opposed to shorter meta-data fields) today. -Yonik http://www.lucenerevolution.org -- Lucene/Solr User Conference, May 25-26, San Francisco
RE: Highest frequency terms for a subset of documents
I think faceting is probably the best way to do that, indeed. It might be slow, but it's kind of set up for exactly that case, I can't imagine any other technique being faster -- there's stuff that has to be done to look up the info you want. BUT, I see your problem: don't use facet.method=enum. Use facet.method=fc. Works a LOT better for very high arity fields (lots and lots of unique values) like you have. I bet you'll see significant speed-up if you use facet.method=fc instead, hopefully fast enough to be workable. With facet.method=enum, I would have indeed predicted it would be horribly slow, before solr 1.4 when facet.method=fc became available, it was nearly impossible to facet on very high arity fields, facet.method=fc is the magic. I think facet.method=fc is even the default in Solr 1.4+, if you hadn't explicitly set it to enum instead! Jonathan From: Ofer Fort [ofer...@gmail.com] Sent: Wednesday, April 20, 2011 6:49 PM To: solr-user@lucene.apache.org Subject: Highest frequency terms for a subset of documents Hi, I am looking for the best way to find the terms with the highest frequency for a given subset of documents. (terms in the text field) My first thought was to do a count facet search , where the query defines the subset of documents and the facet.field is the text field, this gives me the result but it is very very slow. These are my params: str name=facettrue/str str name=facet.offset0/str str name=facet.mincount3/str str name=indenton/str str name=facet.limit500/str str name=facet.methodenum/str str name=wtxml/str str name=rows0/str str name=version2.2/str str name=facet.sortcount/str str name=qin_subset:1/str str name=facet.fieldtext/str /lst The index contains 7M documents, the subset is about 200K. A simple query for the subset takes around 100ms, but the facet search takes 40s. Am i doing something wrong? If facet search is not the correct approach, i thought about using something like org.apache.lucene.misc.HighFreqTerms, but i'm not sure how to do this in solr. Should i implememt a request handler that executes this kind of code? thanks for any help
Re: Highest frequency terms for a subset of documents
thanks, but that's what i started with, but it took an even longer time and threw this: Approaching too many values for UnInvertedField faceting on field 'text' : bucket size=15560140 Approaching too many values for UnInvertedField faceting on field 'text : bucket size=15619075 Exception during facet counts:org.apache.solr.common.SolrException: Too many values for UnInvertedField faceting on field text On Thu, Apr 21, 2011 at 2:11 AM, Jonathan Rochkind rochk...@jhu.edu wrote: I think faceting is probably the best way to do that, indeed. It might be slow, but it's kind of set up for exactly that case, I can't imagine any other technique being faster -- there's stuff that has to be done to look up the info you want. BUT, I see your problem: don't use facet.method=enum. Use facet.method=fc. Works a LOT better for very high arity fields (lots and lots of unique values) like you have. I bet you'll see significant speed-up if you use facet.method=fc instead, hopefully fast enough to be workable. With facet.method=enum, I would have indeed predicted it would be horribly slow, before solr 1.4 when facet.method=fc became available, it was nearly impossible to facet on very high arity fields, facet.method=fc is the magic. I think facet.method=fc is even the default in Solr 1.4+, if you hadn't explicitly set it to enum instead! Jonathan From: Ofer Fort [ofer...@gmail.com] Sent: Wednesday, April 20, 2011 6:49 PM To: solr-user@lucene.apache.org Subject: Highest frequency terms for a subset of documents Hi, I am looking for the best way to find the terms with the highest frequency for a given subset of documents. (terms in the text field) My first thought was to do a count facet search , where the query defines the subset of documents and the facet.field is the text field, this gives me the result but it is very very slow. These are my params: str name=facettrue/str str name=facet.offset0/str str name=facet.mincount3/str str name=indenton/str str name=facet.limit500/str str name=facet.methodenum/str str name=wtxml/str str name=rows0/str str name=version2.2/str str name=facet.sortcount/str str name=qin_subset:1/str str name=facet.fieldtext/str /lst The index contains 7M documents, the subset is about 200K. A simple query for the subset takes around 100ms, but the facet search takes 40s. Am i doing something wrong? If facet search is not the correct approach, i thought about using something like org.apache.lucene.misc.HighFreqTerms, but i'm not sure how to do this in solr. Should i implememt a request handler that executes this kind of code? thanks for any help
Re: Highest frequency terms for a subset of documents
seems like the facet search is not all that suited for a full text field. ( http://search.lucidimagination.com/search/document/178f1a82ff19070c/solr_severe_error_when_doing_a_faceted_search#16562790cda76197 ) Maybe i should go another direction. I think that the HighFreqTerms approach, just not sure how to start. On Thu, Apr 21, 2011 at 2:23 AM, Ofer Fort o...@tra.cx wrote: thanks, but that's what i started with, but it took an even longer time and threw this: Approaching too many values for UnInvertedField faceting on field 'text' : bucket size=15560140 Approaching too many values for UnInvertedField faceting on field 'text : bucket size=15619075 Exception during facet counts:org.apache.solr.common.SolrException: Too many values for UnInvertedField faceting on field text On Thu, Apr 21, 2011 at 2:11 AM, Jonathan Rochkind rochk...@jhu.eduwrote: I think faceting is probably the best way to do that, indeed. It might be slow, but it's kind of set up for exactly that case, I can't imagine any other technique being faster -- there's stuff that has to be done to look up the info you want. BUT, I see your problem: don't use facet.method=enum. Use facet.method=fc. Works a LOT better for very high arity fields (lots and lots of unique values) like you have. I bet you'll see significant speed-up if you use facet.method=fc instead, hopefully fast enough to be workable. With facet.method=enum, I would have indeed predicted it would be horribly slow, before solr 1.4 when facet.method=fc became available, it was nearly impossible to facet on very high arity fields, facet.method=fc is the magic. I think facet.method=fc is even the default in Solr 1.4+, if you hadn't explicitly set it to enum instead! Jonathan From: Ofer Fort [ofer...@gmail.com] Sent: Wednesday, April 20, 2011 6:49 PM To: solr-user@lucene.apache.org Subject: Highest frequency terms for a subset of documents Hi, I am looking for the best way to find the terms with the highest frequency for a given subset of documents. (terms in the text field) My first thought was to do a count facet search , where the query defines the subset of documents and the facet.field is the text field, this gives me the result but it is very very slow. These are my params: str name=facettrue/str str name=facet.offset0/str str name=facet.mincount3/str str name=indenton/str str name=facet.limit500/str str name=facet.methodenum/str str name=wtxml/str str name=rows0/str str name=version2.2/str str name=facet.sortcount/str str name=qin_subset:1/str str name=facet.fieldtext/str /lst The index contains 7M documents, the subset is about 200K. A simple query for the subset takes around 100ms, but the facet search takes 40s. Am i doing something wrong? If facet search is not the correct approach, i thought about using something like org.apache.lucene.misc.HighFreqTerms, but i'm not sure how to do this in solr. Should i implememt a request handler that executes this kind of code? thanks for any help
Re: Highest frequency terms for a subset of documents
: thanks, but that's what i started with, but it took an even longer time and : threw this: : Approaching too many values for UnInvertedField faceting on field 'text' : : bucket size=15560140 : Approaching too many values for UnInvertedField faceting on field 'text : : bucket size=15619075 : Exception during facet counts:org.apache.solr.common.SolrException: Too many : values for UnInvertedField faceting on field text right ... facet.method=fc is a good default, but cases like full text faceting can cause it to seriously blow up the memory ... i didn't eve realize it was possible to get it to fail this way, i would have just expected an OutOfmemoryException. facet.method=enum is probably your best bet in this situation precisely because it does a linera scan over the terms ... it's slower because it's safer. the one speed up you might be able to get is to ensure you don't use the filterCache -- that way you don't wast time constantly caching/overwriting DocSets and FWIW... : If facet search is not the correct approach, i thought about using : something : like org.apache.lucene.misc.HighFreqTerms, but i'm not sure how to do this : in solr. Should i implememt a request handler that executes this kind of HighFreqTerms just looks at the raw docfreq for the terms, nearly identical to the TermsComponent -- there is no way to deal with your subset of documents requrements using an approach like that. If the number of subsets you have to deal with are fixed, finite, and non-overlapping, using distinct cores for each subset (which you can aggregate using distributed search when you don't want this type of query) can also be a wise choice in many situations (ie: if you have a books core and a movies core you can search both using distributed search, or hit the terms component on just one of them to get the top terms for that core) -Hoss
Re: Highest frequency terms for a subset of documents
On Wed, Apr 20, 2011 at 7:34 PM, Chris Hostetter hossman_luc...@fucit.org wrote: : thanks, but that's what i started with, but it took an even longer time and : threw this: : Approaching too many values for UnInvertedField faceting on field 'text' : : bucket size=15560140 : Approaching too many values for UnInvertedField faceting on field 'text : : bucket size=15619075 : Exception during facet counts:org.apache.solr.common.SolrException: Too many : values for UnInvertedField faceting on field text right ... facet.method=fc is a good default, but cases like full text faceting can cause it to seriously blow up the memory ... i didn't eve realize it was possible to get it to fail this way, i would have just expected an OutOfmemoryException. facet.method=enum is probably your best bet in this situation precisely because it does a linera scan over the terms ... it's slower because it's safer. the one speed up you might be able to get is to ensure you don't use the filterCache -- that way you don't wast time constantly caching/overwriting DocSets Right - or only using filterCache for high df terms via http://wiki.apache.org/solr/SimpleFacetParameters#facet.enum.cache.minDf -Yonik http://www.lucenerevolution.org -- Lucene/Solr User Conference, May 25-26, San Francisco
Re: Highest frequency terms for a subset of documents
Thanks but i've disabled the cache already, since my concern is speed and i'm willing to pay the price (memory), and my subset are not fixed. Does the facet search do any extra work that i don't need, that i might be able to disable (either by a flag or by a code change), Somehow i feel, or rather hope, that counting the terms of 200K documents and finding the top 500 should take less than 30 seconds. On Thu, Apr 21, 2011 at 2:41 AM, Yonik Seeley yo...@lucidimagination.comwrote: On Wed, Apr 20, 2011 at 7:34 PM, Chris Hostetter hossman_luc...@fucit.org wrote: : thanks, but that's what i started with, but it took an even longer time and : threw this: : Approaching too many values for UnInvertedField faceting on field 'text' : : bucket size=15560140 : Approaching too many values for UnInvertedField faceting on field 'text : : bucket size=15619075 : Exception during facet counts:org.apache.solr.common.SolrException: Too many : values for UnInvertedField faceting on field text right ... facet.method=fc is a good default, but cases like full text faceting can cause it to seriously blow up the memory ... i didn't eve realize it was possible to get it to fail this way, i would have just expected an OutOfmemoryException. facet.method=enum is probably your best bet in this situation precisely because it does a linera scan over the terms ... it's slower because it's safer. the one speed up you might be able to get is to ensure you don't use the filterCache -- that way you don't wast time constantly caching/overwriting DocSets Right - or only using filterCache for high df terms via http://wiki.apache.org/solr/SimpleFacetParameters#facet.enum.cache.minDf -Yonik http://www.lucenerevolution.org -- Lucene/Solr User Conference, May 25-26, San Francisco
Re: Highest frequency terms for a subset of documents
BTW, i'm using solr 1.4.1, does 3.1 or 4.0 contain any performance improvements that will make a difference as far as facet search? thanks again Ofer On Thu, Apr 21, 2011 at 2:45 AM, Ofer Fort o...@tra.cx wrote: Thanks but i've disabled the cache already, since my concern is speed and i'm willing to pay the price (memory), and my subset are not fixed. Does the facet search do any extra work that i don't need, that i might be able to disable (either by a flag or by a code change), Somehow i feel, or rather hope, that counting the terms of 200K documents and finding the top 500 should take less than 30 seconds. On Thu, Apr 21, 2011 at 2:41 AM, Yonik Seeley yo...@lucidimagination.comwrote: On Wed, Apr 20, 2011 at 7:34 PM, Chris Hostetter hossman_luc...@fucit.org wrote: : thanks, but that's what i started with, but it took an even longer time and : threw this: : Approaching too many values for UnInvertedField faceting on field 'text' : : bucket size=15560140 : Approaching too many values for UnInvertedField faceting on field 'text : : bucket size=15619075 : Exception during facet counts:org.apache.solr.common.SolrException: Too many : values for UnInvertedField faceting on field text right ... facet.method=fc is a good default, but cases like full text faceting can cause it to seriously blow up the memory ... i didn't eve realize it was possible to get it to fail this way, i would have just expected an OutOfmemoryException. facet.method=enum is probably your best bet in this situation precisely because it does a linera scan over the terms ... it's slower because it's safer. the one speed up you might be able to get is to ensure you don't use the filterCache -- that way you don't wast time constantly caching/overwriting DocSets Right - or only using filterCache for high df terms via http://wiki.apache.org/solr/SimpleFacetParameters#facet.enum.cache.minDf -Yonik http://www.lucenerevolution.org -- Lucene/Solr User Conference, May 25-26, San Francisco
Re: Highest frequency terms for a subset of documents
On Wed, Apr 20, 2011 at 7:45 PM, Ofer Fort o...@tra.cx wrote: Thanks but i've disabled the cache already, since my concern is speed and i'm willing to pay the price (memory) Then you should not disable the cache. , and my subset are not fixed. Does the facet search do any extra work that i don't need, that i might be able to disable (either by a flag or by a code change), Somehow i feel, or rather hope, that counting the terms of 200K documents and finding the top 500 should take less than 30 seconds. Using facet.enum.cache.minDf should be a little faster than just disabling the cache - it's a different code path. Using the cache selectively will speed things up, so try setting that minDf to 1000 or so for example. How many unique terms do you have in the index? Is this Solr 3.1 - there were some optimizations when there were many terms to iterate over? You could also try trunk, which has even more optimizations, or the bulkpostings branch if you really want to experiment. -Yonik
Re: Highest frequency terms for a subset of documents
my documents are user entries, so i'm guessing they vary a lot. Tomorrow i'll try 3.1 and also 4.0, and see if they have an improvement. thanks guys! On Thu, Apr 21, 2011 at 3:02 AM, Yonik Seeley yo...@lucidimagination.comwrote: On Wed, Apr 20, 2011 at 7:45 PM, Ofer Fort o...@tra.cx wrote: Thanks but i've disabled the cache already, since my concern is speed and i'm willing to pay the price (memory) Then you should not disable the cache. , and my subset are not fixed. Does the facet search do any extra work that i don't need, that i might be able to disable (either by a flag or by a code change), Somehow i feel, or rather hope, that counting the terms of 200K documents and finding the top 500 should take less than 30 seconds. Using facet.enum.cache.minDf should be a little faster than just disabling the cache - it's a different code path. Using the cache selectively will speed things up, so try setting that minDf to 1000 or so for example. How many unique terms do you have in the index? Is this Solr 3.1 - there were some optimizations when there were many terms to iterate over? You could also try trunk, which has even more optimizations, or the bulkpostings branch if you really want to experiment. -Yonik