subject:"Re\: Highest frequency terms for a subset of documents"

Re: Highest frequency terms for a subset of documents

2011-04-21 Thread Ofer Fort

OK, so I copied my index and ran solr3.1 against it.
Qtime dropped, from about 40s to 17s! This is good news, but still longer
than i hoped for.
I tried to do the same text with 4.0, but i'm getting
IndexFormatTooOldException since my index was created using 1.4.1. Is my
only chance to test this is to reindex using 3.1 or 4.0?

Another strange behavior is that the Qtime seems pretty stable, no matter
how many object match my query. 200K and 20K both take about 17s.
I would have guessed that since the time is going over all the terms of all
the subset documents, would mean that the more documents, the more time.

Thanks for any insights

ofer



On Thu, Apr 21, 2011 at 3:07 AM, Ofer Fort o...@tra.cx wrote:

 my documents are user entries, so i'm guessing they vary a lot.
 Tomorrow i'll try 3.1 and also 4.0, and see if they have an improvement.
 thanks guys!


 On Thu, Apr 21, 2011 at 3:02 AM, Yonik Seeley 
 yo...@lucidimagination.comwrote:

 On Wed, Apr 20, 2011 at 7:45 PM, Ofer Fort o...@tra.cx wrote:
  Thanks
  but i've disabled the cache already, since my concern is speed and i'm
  willing to pay the price (memory)

 Then you should not disable the cache.

 , and my subset are not fixed.
  Does the facet search do any extra work that i don't need, that i might
 be
  able to disable (either by a flag or by a code change),
  Somehow i feel, or rather hope, that counting the terms of 200K
 documents
  and finding the top 500 should take less than 30 seconds.

 Using facet.enum.cache.minDf should be a little faster than just
 disabling the cache - it's a different code path.
 Using the cache selectively will speed things up, so try setting that
 minDf to 1000 or so for example.

 How many unique terms do you have in the index?
 Is this Solr 3.1 - there were some optimizations when there were many
 terms to iterate over?
 You could also try trunk, which has even more optimizations, or the
 bulkpostings branch if you really want to experiment.

 -Yonik

Re: Highest frequency terms for a subset of documents

2011-04-21 Thread Yonik Seeley

On Thu, Apr 21, 2011 at 9:24 AM, Ofer Fort o...@tra.cx wrote:
 Another strange behavior is that the Qtime seems pretty stable, no matter
 how many object match my query. 200K and 20K both take about 17s.
 I would have guessed that since the time is going over all the terms of all
 the subset documents, would mean that the more documents, the more time.

facet.method=enum steps over all terms in the index for that field...
that takes time regardless of how many documents are in the base set.

There are also short-circuit methods that avoid looking at the docs
for a term if it's docfreq is low enough that it couldn't possibly
make it into the priority queue.  Because if this, it can actually be
faster to facet on a larger base set (try *:* as the base query).

Actually, it might be interesting to see the query time if you set
facet.mincount equal to the number of docs in the base set - that will
test pretty much just the time to enumerate over the terms without
doing any set intersections at all.  Be careful not to set mincount
greater than the number of docs in the base set though - solr will
short-circuit that too and skip enumeration altogether.

The work on the bulkpostings branch should definitely speed up your
case even more - but I have no idea when it will land on trunk.


-Yonik
http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
25-26, San Francisco

Re: Highest frequency terms for a subset of documents

2011-04-21 Thread Ofer Fort

Not sure i fully understand,
If facet.method=enum steps over all terms in the index for that field,
than what does setting the q=field:subset do? if i set the q=*:*, than how
do i get the frequency only on my subset?
Ofer

On Thu, Apr 21, 2011 at 4:40 PM, Yonik Seeley yo...@lucidimagination.comwrote:

 On Thu, Apr 21, 2011 at 9:24 AM, Ofer Fort o...@tra.cx wrote:
  Another strange behavior is that the Qtime seems pretty stable, no matter
  how many object match my query. 200K and 20K both take about 17s.
  I would have guessed that since the time is going over all the terms of
 all
  the subset documents, would mean that the more documents, the more time.

 facet.method=enum steps over all terms in the index for that field...
 that takes time regardless of how many documents are in the base set.

 There are also short-circuit methods that avoid looking at the docs
 for a term if it's docfreq is low enough that it couldn't possibly
 make it into the priority queue.  Because if this, it can actually be
 faster to facet on a larger base set (try *:* as the base query).

 Actually, it might be interesting to see the query time if you set
 facet.mincount equal to the number of docs in the base set - that will
 test pretty much just the time to enumerate over the terms without
 doing any set intersections at all.  Be careful not to set mincount
 greater than the number of docs in the base set though - solr will
 short-circuit that too and skip enumeration altogether.

 The work on the bulkpostings branch should definitely speed up your
 case even more - but I have no idea when it will land on trunk.


 -Yonik
 http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
 25-26, San Francisco

Re: Highest frequency terms for a subset of documents

2011-04-21 Thread Yonik Seeley

On Thu, Apr 21, 2011 at 9:44 AM, Ofer Fort o...@tra.cx wrote:
 Not sure i fully understand,
 If facet.method=enum steps over all terms in the index for that field,
 than what does setting the q=field:subset do? if i set the q=*:*, than how
 do i get the frequency only on my subset?

It's an implementation detail.  Faceting *does* just give you counts
that just match
q=field:subset.  How it does it is a different matter (i.e. for
facet.method=enum, it
must step over all terms in the field), so it's closer to O(nterms in
field) rather than O(ndocs in base set)

-Yonik
http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
25-26, San Francisco


 Ofer

 On Thu, Apr 21, 2011 at 4:40 PM, Yonik Seeley yo...@lucidimagination.com
 wrote:

 On Thu, Apr 21, 2011 at 9:24 AM, Ofer Fort o...@tra.cx wrote:
  Another strange behavior is that the Qtime seems pretty stable, no
  matter
  how many object match my query. 200K and 20K both take about 17s.
  I would have guessed that since the time is going over all the terms of
  all
  the subset documents, would mean that the more documents, the more time.

 facet.method=enum steps over all terms in the index for that field...
 that takes time regardless of how many documents are in the base set.

 There are also short-circuit methods that avoid looking at the docs
 for a term if it's docfreq is low enough that it couldn't possibly
 make it into the priority queue.  Because if this, it can actually be
 faster to facet on a larger base set (try *:* as the base query).

 Actually, it might be interesting to see the query time if you set
 facet.mincount equal to the number of docs in the base set - that will
 test pretty much just the time to enumerate over the terms without
 doing any set intersections at all.  Be careful not to set mincount
 greater than the number of docs in the base set though - solr will
 short-circuit that too and skip enumeration altogether.

 The work on the bulkpostings branch should definitely speed up your
 case even more - but I have no idea when it will land on trunk.


 -Yonik
 http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
 25-26, San Francisco

Re: Highest frequency terms for a subset of documents

2011-04-21 Thread Ofer Fort

I see, thanks.
So if I would want to implement something that would fit my needs, would
going through the subset of documents and counting all the terms in each
one, would be faster? and easier to implement?

On Thu, Apr 21, 2011 at 5:36 PM, Yonik Seeley yo...@lucidimagination.comwrote:

 On Thu, Apr 21, 2011 at 9:44 AM, Ofer Fort o...@tra.cx wrote:
  Not sure i fully understand,
  If facet.method=enum steps over all terms in the index for that field,
  than what does setting the q=field:subset do? if i set the q=*:*, than
 how
  do i get the frequency only on my subset?

 It's an implementation detail.  Faceting *does* just give you counts
 that just match
 q=field:subset.  How it does it is a different matter (i.e. for
 facet.method=enum, it
 must step over all terms in the field), so it's closer to O(nterms in
 field) rather than O(ndocs in base set)

 -Yonik
 http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
 25-26, San Francisco


  Ofer
 
  On Thu, Apr 21, 2011 at 4:40 PM, Yonik Seeley 
 yo...@lucidimagination.com
  wrote:
 
  On Thu, Apr 21, 2011 at 9:24 AM, Ofer Fort o...@tra.cx wrote:
   Another strange behavior is that the Qtime seems pretty stable, no
   matter
   how many object match my query. 200K and 20K both take about 17s.
   I would have guessed that since the time is going over all the terms
 of
   all
   the subset documents, would mean that the more documents, the more
 time.
 
  facet.method=enum steps over all terms in the index for that field...
  that takes time regardless of how many documents are in the base set.
 
  There are also short-circuit methods that avoid looking at the docs
  for a term if it's docfreq is low enough that it couldn't possibly
  make it into the priority queue.  Because if this, it can actually be
  faster to facet on a larger base set (try *:* as the base query).
 
  Actually, it might be interesting to see the query time if you set
  facet.mincount equal to the number of docs in the base set - that will
  test pretty much just the time to enumerate over the terms without
  doing any set intersections at all.  Be careful not to set mincount
  greater than the number of docs in the base set though - solr will
  short-circuit that too and skip enumeration altogether.
 
  The work on the bulkpostings branch should definitely speed up your
  case even more - but I have no idea when it will land on trunk.
 
 
  -Yonik
  http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
  25-26, San Francisco

Re: Highest frequency terms for a subset of documents

2011-04-21 Thread Yonik Seeley

On Thu, Apr 21, 2011 at 10:41 AM, Ofer Fort o...@tra.cx wrote:
 I see, thanks.
 So if I would want to implement something that would fit my needs, would
 going through the subset of documents and counting all the terms in each
 one, would be faster? and easier to implement?

That's not just your needs, that's everyone's needs (it's the
definition of field faceting).
There's no way to do what you're asking with a term enumerator (i.e.
facet.method=enum).

Going through documents and counting all the terms in each is what
facet.method=fc does.
But it's also not great when the number of unique terms per document is high.
If you can think of a better way, go for it!


-Yonik
http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
25-26, San Francisco

Re: Highest frequency terms for a subset of documents

2011-04-21 Thread Ofer Fort

So if i want to use the facet.method=fc, is there a way to speed it up? and
remove the bucket size limitation?

On Thu, Apr 21, 2011 at 5:58 PM, Yonik Seeley yo...@lucidimagination.comwrote:

 On Thu, Apr 21, 2011 at 10:41 AM, Ofer Fort o...@tra.cx wrote:
  I see, thanks.
  So if I would want to implement something that would fit my needs, would
  going through the subset of documents and counting all the terms in each
  one, would be faster? and easier to implement?

 That's not just your needs, that's everyone's needs (it's the
 definition of field faceting).
 There's no way to do what you're asking with a term enumerator (i.e.
 facet.method=enum).

 Going through documents and counting all the terms in each is what
 facet.method=fc does.
 But it's also not great when the number of unique terms per document is
 high.
 If you can think of a better way, go for it!


 -Yonik
 http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
 25-26, San Francisco

Re: Highest frequency terms for a subset of documents

2011-04-21 Thread Yonik Seeley

On Thu, Apr 21, 2011 at 11:15 AM, Ofer Fort o...@tra.cx wrote:
 So if i want to use the facet.method=fc, is there a way to speed it up? and
 remove the bucket size limitation?

Not really - else we would have done it already ;-)
We don't really have great methods for faceting on full-text fields
(as opposed to shorter meta-data fields) today.

-Yonik
http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
25-26, San Francisco

Re: Highest frequency terms for a subset of documents

2011-04-21 Thread Ofer Fort

Well, it was worth the try;-)
But will using the facet.method=fc, will reducing the subset size
reduce the time and memory? Meaning is it an O( ndocs of the set)?
Thanks
On Thursday, April 21, 2011, Yonik Seeley yo...@lucidimagination.com wrote:
 On Thu, Apr 21, 2011 at 11:15 AM, Ofer Fort o...@tra.cx wrote:
 So if i want to use the facet.method=fc, is there a way to speed it up? and
 remove the bucket size limitation?

 Not really - else we would have done it already ;-)
 We don't really have great methods for faceting on full-text fields
 (as opposed to shorter meta-data fields) today.

 -Yonik
 http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
 25-26, San Francisco

Re: Highest frequency terms for a subset of documents

2011-04-21 Thread Yonik Seeley

On Thu, Apr 21, 2011 at 6:25 PM, Ofer Fort o...@tra.cx wrote:
 Well, it was worth the try;-)
 But will using the facet.method=fc, will reducing the subset size
 reduce the time and memory? Meaning is it an O( ndocs of the set)?

facet.method=fc builds a multi-valued fieldcache like structure
(UnInvertedField) the first time, that
is used for counting facets for all subsequent requests.  So the
faceting time (after the first time) is O(ndocs of the set),
but the UnInvertedField singleton uses a large amout of memory
unrelated to any particular base docset.

-Yonik
http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
25-26, San Francisco


 Thanks
 On Thursday, April 21, 2011, Yonik Seeley yo...@lucidimagination.com wrote:
 On Thu, Apr 21, 2011 at 11:15 AM, Ofer Fort o...@tra.cx wrote:
 So if i want to use the facet.method=fc, is there a way to speed it up? and
 remove the bucket size limitation?

 Not really - else we would have done it already ;-)
 We don't really have great methods for faceting on full-text fields
 (as opposed to shorter meta-data fields) today.

 -Yonik
 http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
 25-26, San Francisco

Re: Highest frequency terms for a subset of documents

2011-04-21 Thread Ofer Fort

So I'm guessing my best approach now would be to test trunk, and hope
that as 3.1 cut the performance in half, trunk will do the same
Thanks for the info
Ofer

On Friday, April 22, 2011, Yonik Seeley yo...@lucidimagination.com wrote:
 On Thu, Apr 21, 2011 at 6:25 PM, Ofer Fort o...@tra.cx wrote:
 Well, it was worth the try;-)
 But will using the facet.method=fc, will reducing the subset size
 reduce the time and memory? Meaning is it an O( ndocs of the set)?

 facet.method=fc builds a multi-valued fieldcache like structure
 (UnInvertedField) the first time, that
 is used for counting facets for all subsequent requests.  So the
 faceting time (after the first time) is O(ndocs of the set),
 but the UnInvertedField singleton uses a large amout of memory
 unrelated to any particular base docset.

 -Yonik
 http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
 25-26, San Francisco


 Thanks
 On Thursday, April 21, 2011, Yonik Seeley yo...@lucidimagination.com wrote:
 On Thu, Apr 21, 2011 at 11:15 AM, Ofer Fort o...@tra.cx wrote:
 So if i want to use the facet.method=fc, is there a way to speed it up? and
 remove the bucket size limitation?

 Not really - else we would have done it already ;-)
 We don't really have great methods for faceting on full-text fields
 (as opposed to shorter meta-data fields) today.

 -Yonik
 http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
 25-26, San Francisco

Re: Highest frequency terms for a subset of documents

2011-04-21 Thread Yonik Seeley

On Thu, Apr 21, 2011 at 6:34 PM, Ofer Fort o...@tra.cx wrote:
 So I'm guessing my best approach now would be to test trunk, and hope
 that as 3.1 cut the performance in half, trunk will do the same

Trunk prob won't be much better... but the bulkpostings branch
possibly could be.

-Yonik
http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
25-26, San Francisco

 Thanks for the info
 Ofer

 On Friday, April 22, 2011, Yonik Seeley yo...@lucidimagination.com wrote:
 On Thu, Apr 21, 2011 at 6:25 PM, Ofer Fort o...@tra.cx wrote:
 Well, it was worth the try;-)
 But will using the facet.method=fc, will reducing the subset size
 reduce the time and memory? Meaning is it an O( ndocs of the set)?

 facet.method=fc builds a multi-valued fieldcache like structure
 (UnInvertedField) the first time, that
 is used for counting facets for all subsequent requests.  So the
 faceting time (after the first time) is O(ndocs of the set),
 but the UnInvertedField singleton uses a large amout of memory
 unrelated to any particular base docset.

 -Yonik
 http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
 25-26, San Francisco


 Thanks
 On Thursday, April 21, 2011, Yonik Seeley yo...@lucidimagination.com 
 wrote:
 On Thu, Apr 21, 2011 at 11:15 AM, Ofer Fort o...@tra.cx wrote:
 So if i want to use the facet.method=fc, is there a way to speed it up? 
 and
 remove the bucket size limitation?

 Not really - else we would have done it already ;-)
 We don't really have great methods for faceting on full-text fields
 (as opposed to shorter meta-data fields) today.

 -Yonik
 http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
 25-26, San Francisco

Re: Highest frequency terms for a subset of documents

2011-04-21 Thread Ofer Fort

Ok, I'll give it a try, as this is a server I am willing to risk.
How is the competability between solrj of bulkpostings, trunk, 3.1 and 1.4.1?

On Friday, April 22, 2011, Yonik Seeley yo...@lucidimagination.com wrote:
 On Thu, Apr 21, 2011 at 6:34 PM, Ofer Fort o...@tra.cx wrote:
 So I'm guessing my best approach now would be to test trunk, and hope
 that as 3.1 cut the performance in half, trunk will do the same

 Trunk prob won't be much better... but the bulkpostings branch
 possibly could be.

 -Yonik
 http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
 25-26, San Francisco

 Thanks for the info
 Ofer

 On Friday, April 22, 2011, Yonik Seeley yo...@lucidimagination.com wrote:
 On Thu, Apr 21, 2011 at 6:25 PM, Ofer Fort o...@tra.cx wrote:
 Well, it was worth the try;-)
 But will using the facet.method=fc, will reducing the subset size
 reduce the time and memory? Meaning is it an O( ndocs of the set)?

 facet.method=fc builds a multi-valued fieldcache like structure
 (UnInvertedField) the first time, that
 is used for counting facets for all subsequent requests.  So the
 faceting time (after the first time) is O(ndocs of the set),
 but the UnInvertedField singleton uses a large amout of memory
 unrelated to any particular base docset.

 -Yonik
 http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
 25-26, San Francisco


 Thanks
 On Thursday, April 21, 2011, Yonik Seeley yo...@lucidimagination.com 
 wrote:
 On Thu, Apr 21, 2011 at 11:15 AM, Ofer Fort o...@tra.cx wrote:
 So if i want to use the facet.method=fc, is there a way to speed it up? 
 and
 remove the bucket size limitation?

 Not really - else we would have done it already ;-)
 We don't really have great methods for faceting on full-text fields
 (as opposed to shorter meta-data fields) today.

 -Yonik
 http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
 25-26, San Francisco

Re: Highest frequency terms for a subset of documents

2011-04-21 Thread Yonik Seeley

On Thu, Apr 21, 2011 at 6:50 PM, Ofer Fort o...@tra.cx wrote:
 Ok, I'll give it a try, as this is a server I am willing to risk.
 How is the competability between solrj of bulkpostings, trunk, 3.1 and 1.4.1?

bulkpostings, trunk, and 3.1 should all be relatively solrj
compatible.  But the SolrJ javabin format (used by default for
queries) changed for strings between 1.4.1 and 3.1 (SOLR-2034).

-Yonik
http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
25-26, San Francisco


 On Friday, April 22, 2011, Yonik Seeley yo...@lucidimagination.com wrote:
 On Thu, Apr 21, 2011 at 6:34 PM, Ofer Fort o...@tra.cx wrote:
 So I'm guessing my best approach now would be to test trunk, and hope
 that as 3.1 cut the performance in half, trunk will do the same

 Trunk prob won't be much better... but the bulkpostings branch
 possibly could be.

 -Yonik
 http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
 25-26, San Francisco

 Thanks for the info
 Ofer

 On Friday, April 22, 2011, Yonik Seeley yo...@lucidimagination.com wrote:
 On Thu, Apr 21, 2011 at 6:25 PM, Ofer Fort o...@tra.cx wrote:
 Well, it was worth the try;-)
 But will using the facet.method=fc, will reducing the subset size
 reduce the time and memory? Meaning is it an O( ndocs of the set)?

 facet.method=fc builds a multi-valued fieldcache like structure
 (UnInvertedField) the first time, that
 is used for counting facets for all subsequent requests.  So the
 faceting time (after the first time) is O(ndocs of the set),
 but the UnInvertedField singleton uses a large amout of memory
 unrelated to any particular base docset.

 -Yonik
 http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
 25-26, San Francisco


 Thanks
 On Thursday, April 21, 2011, Yonik Seeley yo...@lucidimagination.com 
 wrote:
 On Thu, Apr 21, 2011 at 11:15 AM, Ofer Fort o...@tra.cx wrote:
 So if i want to use the facet.method=fc, is there a way to speed it up? 
 and
 remove the bucket size limitation?

 Not really - else we would have done it already ;-)
 We don't really have great methods for faceting on full-text fields
 (as opposed to shorter meta-data fields) today.

 -Yonik
 http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
 25-26, San Francisco

Re: Highest frequency terms for a subset of documents

2011-04-21 Thread Ofer Fort

Ok, thanks

On Friday, April 22, 2011, Yonik Seeley yo...@lucidimagination.com wrote:
 On Thu, Apr 21, 2011 at 6:50 PM, Ofer Fort o...@tra.cx wrote:
 Ok, I'll give it a try, as this is a server I am willing to risk.
 How is the competability between solrj of bulkpostings, trunk, 3.1 and 1.4.1?

 bulkpostings, trunk, and 3.1 should all be relatively solrj
 compatible.  But the SolrJ javabin format (used by default for
 queries) changed for strings between 1.4.1 and 3.1 (SOLR-2034).

 -Yonik
 http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
 25-26, San Francisco


 On Friday, April 22, 2011, Yonik Seeley yo...@lucidimagination.com wrote:
 On Thu, Apr 21, 2011 at 6:34 PM, Ofer Fort o...@tra.cx wrote:
 So I'm guessing my best approach now would be to test trunk, and hope
 that as 3.1 cut the performance in half, trunk will do the same

 Trunk prob won't be much better... but the bulkpostings branch
 possibly could be.

 -Yonik
 http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
 25-26, San Francisco

 Thanks for the info
 Ofer

 On Friday, April 22, 2011, Yonik Seeley yo...@lucidimagination.com wrote:
 On Thu, Apr 21, 2011 at 6:25 PM, Ofer Fort o...@tra.cx wrote:
 Well, it was worth the try;-)
 But will using the facet.method=fc, will reducing the subset size
 reduce the time and memory? Meaning is it an O( ndocs of the set)?

 facet.method=fc builds a multi-valued fieldcache like structure
 (UnInvertedField) the first time, that
 is used for counting facets for all subsequent requests.  So the
 faceting time (after the first time) is O(ndocs of the set),
 but the UnInvertedField singleton uses a large amout of memory
 unrelated to any particular base docset.

 -Yonik
 http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
 25-26, San Francisco


 Thanks
 On Thursday, April 21, 2011, Yonik Seeley yo...@lucidimagination.com 
 wrote:
 On Thu, Apr 21, 2011 at 11:15 AM, Ofer Fort o...@tra.cx wrote:
 So if i want to use the facet.method=fc, is there a way to speed it 
 up? and
 remove the bucket size limitation?

 Not really - else we would have done it already ;-)
 We don't really have great methods for faceting on full-text fields
 (as opposed to shorter meta-data fields) today.

 -Yonik
 http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
 25-26, San Francisco

RE: Highest frequency terms for a subset of documents

2011-04-20 Thread Jonathan Rochkind

I think faceting is probably the best way to do that, indeed. It might be slow, 
but it's kind of set up for exactly that case, I can't imagine any other 
technique being faster -- there's stuff that has to be done to look up the info 
you want. 

BUT, I see your problem:  don't use facet.method=enum. Use facet.method=fc.  
Works a LOT better for very high arity fields (lots and lots of unique values) 
like you have. I bet you'll see significant speed-up if you use facet.method=fc 
instead, hopefully fast enough to be workable. 

With facet.method=enum, I would have indeed predicted it would be horribly 
slow, before solr 1.4 when facet.method=fc became available, it was nearly 
impossible to facet on very high arity fields, facet.method=fc is the magic. I 
think facet.method=fc is even the default in Solr 1.4+, if you hadn't 
explicitly set it to enum instead! 

Jonathan

From: Ofer Fort [ofer...@gmail.com]
Sent: Wednesday, April 20, 2011 6:49 PM
To: solr-user@lucene.apache.org
Subject: Highest frequency terms for a subset of documents
Hi,
I am looking for the best way to find the terms with the highest frequency
for a given subset of documents. (terms in the text field)
My first thought was to do a count facet search , where the query defines
the subset of documents and the facet.field is the text field, this gives me
the result but it is very very slow.
These are my params:
str name=facettrue/str
str name=facet.offset0/str
str name=facet.mincount3/str
str name=indenton/str
str name=facet.limit500/str
str name=facet.methodenum/str
str name=wtxml/str
str name=rows0/str
str name=version2.2/str
str name=facet.sortcount/str
   str name=qin_subset:1/str
str name=facet.fieldtext/str
/lst

The index contains 7M documents, the subset is about 200K. A simple query
for the subset takes around 100ms, but the facet search takes 40s.

Am i doing something wrong?

If facet search is not the correct approach, i thought about using something
like org.apache.lucene.misc.HighFreqTerms, but i'm not sure how to do this
in solr. Should i implememt a request handler that executes this kind of
code?

thanks for any help

Re: Highest frequency terms for a subset of documents

2011-04-20 Thread Ofer Fort

thanks, but that's what i started with, but it took an even longer time and
threw this:
Approaching too many values for UnInvertedField faceting on field 'text' :
bucket size=15560140
Approaching too many values for UnInvertedField faceting on field 'text :
bucket size=15619075
Exception during facet counts:org.apache.solr.common.SolrException: Too many
values for UnInvertedField faceting on field text


On Thu, Apr 21, 2011 at 2:11 AM, Jonathan Rochkind rochk...@jhu.edu wrote:

 I think faceting is probably the best way to do that, indeed. It might be
 slow, but it's kind of set up for exactly that case, I can't imagine any
 other technique being faster -- there's stuff that has to be done to look up
 the info you want.

 BUT, I see your problem:  don't use facet.method=enum. Use facet.method=fc.
  Works a LOT better for very high arity fields (lots and lots of unique
 values) like you have. I bet you'll see significant speed-up if you use
 facet.method=fc instead, hopefully fast enough to be workable.

 With facet.method=enum, I would have indeed predicted it would be horribly
 slow, before solr 1.4 when facet.method=fc became available, it was nearly
 impossible to facet on very high arity fields, facet.method=fc is the magic.
 I think facet.method=fc is even the default in Solr 1.4+, if you hadn't
 explicitly set it to enum instead!

 Jonathan
 
 From: Ofer Fort [ofer...@gmail.com]
 Sent: Wednesday, April 20, 2011 6:49 PM
 To: solr-user@lucene.apache.org
 Subject: Highest frequency terms for a subset of documents
 Hi,
 I am looking for the best way to find the terms with the highest frequency
 for a given subset of documents. (terms in the text field)
 My first thought was to do a count facet search , where the query defines
 the subset of documents and the facet.field is the text field, this gives
 me
 the result but it is very very slow.
 These are my params:
 str name=facettrue/str
 str name=facet.offset0/str
 str name=facet.mincount3/str
 str name=indenton/str
 str name=facet.limit500/str
 str name=facet.methodenum/str
 str name=wtxml/str
 str name=rows0/str
 str name=version2.2/str
 str name=facet.sortcount/str
   str name=qin_subset:1/str
 str name=facet.fieldtext/str
 /lst

 The index contains 7M documents, the subset is about 200K. A simple query
 for the subset takes around 100ms, but the facet search takes 40s.

 Am i doing something wrong?

 If facet search is not the correct approach, i thought about using
 something
 like org.apache.lucene.misc.HighFreqTerms, but i'm not sure how to do this
 in solr. Should i implememt a request handler that executes this kind of
 code?

 thanks for any help

Re: Highest frequency terms for a subset of documents

2011-04-20 Thread Ofer Fort

seems like the facet search is not all that suited for a full text field. (
http://search.lucidimagination.com/search/document/178f1a82ff19070c/solr_severe_error_when_doing_a_faceted_search#16562790cda76197
)

Maybe i should go another direction. I think that the HighFreqTerms
approach, just not sure how to start.

On Thu, Apr 21, 2011 at 2:23 AM, Ofer Fort o...@tra.cx wrote:

 thanks, but that's what i started with, but it took an even longer time and
 threw this:
 Approaching too many values for UnInvertedField faceting on field 'text' :
 bucket size=15560140
 Approaching too many values for UnInvertedField faceting on field 'text :
 bucket size=15619075
 Exception during facet counts:org.apache.solr.common.SolrException: Too
 many values for UnInvertedField faceting on field text



 On Thu, Apr 21, 2011 at 2:11 AM, Jonathan Rochkind rochk...@jhu.eduwrote:

 I think faceting is probably the best way to do that, indeed. It might be
 slow, but it's kind of set up for exactly that case, I can't imagine any
 other technique being faster -- there's stuff that has to be done to look up
 the info you want.

 BUT, I see your problem:  don't use facet.method=enum. Use
 facet.method=fc.  Works a LOT better for very high arity fields (lots and
 lots of unique values) like you have. I bet you'll see significant speed-up
 if you use facet.method=fc instead, hopefully fast enough to be workable.

 With facet.method=enum, I would have indeed predicted it would be horribly
 slow, before solr 1.4 when facet.method=fc became available, it was nearly
 impossible to facet on very high arity fields, facet.method=fc is the magic.
 I think facet.method=fc is even the default in Solr 1.4+, if you hadn't
 explicitly set it to enum instead!

 Jonathan
 
 From: Ofer Fort [ofer...@gmail.com]
 Sent: Wednesday, April 20, 2011 6:49 PM
 To: solr-user@lucene.apache.org
 Subject: Highest frequency terms for a subset of documents
 Hi,
 I am looking for the best way to find the terms with the highest frequency
 for a given subset of documents. (terms in the text field)
 My first thought was to do a count facet search , where the query defines
 the subset of documents and the facet.field is the text field, this gives
 me
 the result but it is very very slow.
 These are my params:
 str name=facettrue/str
 str name=facet.offset0/str
 str name=facet.mincount3/str
 str name=indenton/str
 str name=facet.limit500/str
 str name=facet.methodenum/str
 str name=wtxml/str
 str name=rows0/str
 str name=version2.2/str
 str name=facet.sortcount/str
   str name=qin_subset:1/str
 str name=facet.fieldtext/str
 /lst

 The index contains 7M documents, the subset is about 200K. A simple query
 for the subset takes around 100ms, but the facet search takes 40s.

 Am i doing something wrong?

 If facet search is not the correct approach, i thought about using
 something
 like org.apache.lucene.misc.HighFreqTerms, but i'm not sure how to do this
 in solr. Should i implememt a request handler that executes this kind of
 code?

 thanks for any help

Re: Highest frequency terms for a subset of documents

2011-04-20 Thread Chris Hostetter


: thanks, but that's what i started with, but it took an even longer time and
: threw this:
: Approaching too many values for UnInvertedField faceting on field 'text' :
: bucket size=15560140
: Approaching too many values for UnInvertedField faceting on field 'text :
: bucket size=15619075
: Exception during facet counts:org.apache.solr.common.SolrException: Too many
: values for UnInvertedField faceting on field text

right ... facet.method=fc is a good default, but cases like full text 
faceting can cause it to seriously blow up the memory ... i didn't eve 
realize it was possible to get it to fail this way, i would have just 
expected an OutOfmemoryException.

facet.method=enum is probably your best bet in this situation precisely 
because it does a linera scan over the terms ... it's slower because it's 
safer.

the one speed up you might be able to get is to ensure you don't use the 
filterCache -- that way you don't wast time constantly caching/overwriting 
DocSets

and FWIW...

:  If facet search is not the correct approach, i thought about using
:  something
:  like org.apache.lucene.misc.HighFreqTerms, but i'm not sure how to do this
:  in solr. Should i implememt a request handler that executes this kind of

HighFreqTerms just looks at the raw docfreq for the terms, nearly 
identical to the TermsComponent -- there is no way to deal with your 
subset of documents requrements using an approach like that.

If the number of subsets you have to deal with are fixed, finite, and 
non-overlapping, using distinct cores for each subset (which you can 
aggregate using distributed search when you don't want this type of query) 
can also be a wise choice in many situations

(ie: if you have a books core and a movies core you can search both 
using distributed search, or hit the terms component on just one of them 
to get the top terms for that core)

-Hoss

Re: Highest frequency terms for a subset of documents

2011-04-20 Thread Yonik Seeley

On Wed, Apr 20, 2011 at 7:34 PM, Chris Hostetter
hossman_luc...@fucit.org wrote:

 : thanks, but that's what i started with, but it took an even longer time and
 : threw this:
 : Approaching too many values for UnInvertedField faceting on field 'text' :
 : bucket size=15560140
 : Approaching too many values for UnInvertedField faceting on field 'text :
 : bucket size=15619075
 : Exception during facet counts:org.apache.solr.common.SolrException: Too many
 : values for UnInvertedField faceting on field text

 right ... facet.method=fc is a good default, but cases like full text
 faceting can cause it to seriously blow up the memory ... i didn't eve
 realize it was possible to get it to fail this way, i would have just
 expected an OutOfmemoryException.

 facet.method=enum is probably your best bet in this situation precisely
 because it does a linera scan over the terms ... it's slower because it's
 safer.

 the one speed up you might be able to get is to ensure you don't use the
 filterCache -- that way you don't wast time constantly caching/overwriting
 DocSets

Right - or only using filterCache for high df terms via
http://wiki.apache.org/solr/SimpleFacetParameters#facet.enum.cache.minDf

-Yonik
http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
25-26, San Francisco

Re: Highest frequency terms for a subset of documents

2011-04-20 Thread Ofer Fort

Thanks
but i've disabled the cache already, since my concern is speed and i'm
willing to pay the price (memory), and my subset are not fixed.
Does the facet search do any extra work that i don't need, that i might be
able to disable (either by a flag or by a code change),
Somehow i feel, or rather hope, that counting the terms of 200K documents
and finding the top 500 should take less than 30 seconds.


On Thu, Apr 21, 2011 at 2:41 AM, Yonik Seeley yo...@lucidimagination.comwrote:

 On Wed, Apr 20, 2011 at 7:34 PM, Chris Hostetter
 hossman_luc...@fucit.org wrote:
 
  : thanks, but that's what i started with, but it took an even longer time
 and
  : threw this:
  : Approaching too many values for UnInvertedField faceting on field
 'text' :
  : bucket size=15560140
  : Approaching too many values for UnInvertedField faceting on field 'text
 :
  : bucket size=15619075
  : Exception during facet counts:org.apache.solr.common.SolrException: Too
 many
  : values for UnInvertedField faceting on field text
 
  right ... facet.method=fc is a good default, but cases like full text
  faceting can cause it to seriously blow up the memory ... i didn't eve
  realize it was possible to get it to fail this way, i would have just
  expected an OutOfmemoryException.
 
  facet.method=enum is probably your best bet in this situation precisely
  because it does a linera scan over the terms ... it's slower because it's
  safer.
 
  the one speed up you might be able to get is to ensure you don't use the
  filterCache -- that way you don't wast time constantly
 caching/overwriting
  DocSets

 Right - or only using filterCache for high df terms via
 http://wiki.apache.org/solr/SimpleFacetParameters#facet.enum.cache.minDf

 -Yonik
 http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
 25-26, San Francisco

Re: Highest frequency terms for a subset of documents

2011-04-20 Thread Ofer Fort

BTW,
i'm using solr 1.4.1, does 3.1 or 4.0 contain any performance improvements
that will make a difference as far as facet search?
thanks again
Ofer

On Thu, Apr 21, 2011 at 2:45 AM, Ofer Fort o...@tra.cx wrote:

 Thanks
 but i've disabled the cache already, since my concern is speed and i'm
 willing to pay the price (memory), and my subset are not fixed.
 Does the facet search do any extra work that i don't need, that i might be
 able to disable (either by a flag or by a code change),
 Somehow i feel, or rather hope, that counting the terms of 200K documents
 and finding the top 500 should take less than 30 seconds.



 On Thu, Apr 21, 2011 at 2:41 AM, Yonik Seeley 
 yo...@lucidimagination.comwrote:

 On Wed, Apr 20, 2011 at 7:34 PM, Chris Hostetter
 hossman_luc...@fucit.org wrote:
 
  : thanks, but that's what i started with, but it took an even longer
 time and
  : threw this:
  : Approaching too many values for UnInvertedField faceting on field
 'text' :
  : bucket size=15560140
  : Approaching too many values for UnInvertedField faceting on field
 'text :
  : bucket size=15619075
  : Exception during facet counts:org.apache.solr.common.SolrException:
 Too many
  : values for UnInvertedField faceting on field text
 
  right ... facet.method=fc is a good default, but cases like full text
  faceting can cause it to seriously blow up the memory ... i didn't eve
  realize it was possible to get it to fail this way, i would have just
  expected an OutOfmemoryException.
 
  facet.method=enum is probably your best bet in this situation precisely
  because it does a linera scan over the terms ... it's slower because
 it's
  safer.
 
  the one speed up you might be able to get is to ensure you don't use the
  filterCache -- that way you don't wast time constantly
 caching/overwriting
  DocSets

 Right - or only using filterCache for high df terms via
 http://wiki.apache.org/solr/SimpleFacetParameters#facet.enum.cache.minDf

 -Yonik
 http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
 25-26, San Francisco

Re: Highest frequency terms for a subset of documents

2011-04-20 Thread Yonik Seeley

On Wed, Apr 20, 2011 at 7:45 PM, Ofer Fort o...@tra.cx wrote:
 Thanks
 but i've disabled the cache already, since my concern is speed and i'm
 willing to pay the price (memory)

Then you should not disable the cache.

, and my subset are not fixed.
 Does the facet search do any extra work that i don't need, that i might be
 able to disable (either by a flag or by a code change),
 Somehow i feel, or rather hope, that counting the terms of 200K documents
 and finding the top 500 should take less than 30 seconds.

Using facet.enum.cache.minDf should be a little faster than just
disabling the cache - it's a different code path.
Using the cache selectively will speed things up, so try setting that
minDf to 1000 or so for example.

How many unique terms do you have in the index?
Is this Solr 3.1 - there were some optimizations when there were many
terms to iterate over?
You could also try trunk, which has even more optimizations, or the
bulkpostings branch if you really want to experiment.

-Yonik

Re: Highest frequency terms for a subset of documents

2011-04-20 Thread Ofer Fort

my documents are user entries, so i'm guessing they vary a lot.
Tomorrow i'll try 3.1 and also 4.0, and see if they have an improvement.
thanks guys!

On Thu, Apr 21, 2011 at 3:02 AM, Yonik Seeley yo...@lucidimagination.comwrote:

 On Wed, Apr 20, 2011 at 7:45 PM, Ofer Fort o...@tra.cx wrote:
  Thanks
  but i've disabled the cache already, since my concern is speed and i'm
  willing to pay the price (memory)

 Then you should not disable the cache.

 , and my subset are not fixed.
  Does the facet search do any extra work that i don't need, that i might
 be
  able to disable (either by a flag or by a code change),
  Somehow i feel, or rather hope, that counting the terms of 200K documents
  and finding the top 500 should take less than 30 seconds.

 Using facet.enum.cache.minDf should be a little faster than just
 disabling the cache - it's a different code path.
 Using the cache selectively will speed things up, so try setting that
 minDf to 1000 or so for example.

 How many unique terms do you have in the index?
 Is this Solr 3.1 - there were some optimizations when there were many
 terms to iterate over?
 You could also try trunk, which has even more optimizations, or the
 bulkpostings branch if you really want to experiment.

 -Yonik

Re: Highest frequency terms for a subset of documents

Re: Highest frequency terms for a subset of documents

Re: Highest frequency terms for a subset of documents

Re: Highest frequency terms for a subset of documents

Re: Highest frequency terms for a subset of documents

Re: Highest frequency terms for a subset of documents

Re: Highest frequency terms for a subset of documents

Re: Highest frequency terms for a subset of documents

Re: Highest frequency terms for a subset of documents

Re: Highest frequency terms for a subset of documents

Re: Highest frequency terms for a subset of documents

Re: Highest frequency terms for a subset of documents

Re: Highest frequency terms for a subset of documents

Re: Highest frequency terms for a subset of documents

Re: Highest frequency terms for a subset of documents

RE: Highest frequency terms for a subset of documents

Re: Highest frequency terms for a subset of documents

Re: Highest frequency terms for a subset of documents

Re: Highest frequency terms for a subset of documents

Re: Highest frequency terms for a subset of documents

Re: Highest frequency terms for a subset of documents

Re: Highest frequency terms for a subset of documents

Re: Highest frequency terms for a subset of documents

Re: Highest frequency terms for a subset of documents

24 matches

Site Navigation

Mail list logo

Footer information