RE: Facets and running out of Heap Space

2007-10-10 Thread David Whalen
It looks now like I can't use facets the way I was hoping
to because the memory requirements are impractical.

So, as an alternative I was thinking I could get counts
by doing rows=0 and using filter queries.  

Is there a reason to think that this might perform better?
Or, am I simply moving the problem to another step in the
process?

DW

  

 -Original Message-
 From: Stu Hood [mailto:[EMAIL PROTECTED] 
 Sent: Tuesday, October 09, 2007 10:53 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Facets and running out of Heap Space
 
  Using the filter cache method on the things like media type and 
  location; this will occupy ~2.3MB of memory _per unique value_
 
 Mike, how did you calculate that value? I'm trying to tune my 
 caches, and any equations that could be used to determine 
 some balanced settings would be extremely helpful. I'm in a 
 memory limited environment, so I can't afford to throw a ton 
 of cache at the problem.
 
 (I don't want to thread-jack, but I'm also wondering whether 
 anyone has any notes on how to tune cache sizes for the 
 filterCache, queryResultCache and documentCache).
 
 Thanks,
 Stu
 
 
 -Original Message-
 From: Mike Klaas [EMAIL PROTECTED]
 Sent: Tuesday, October 9, 2007 9:30pm
 To: solr-user@lucene.apache.org
 Subject: Re: Facets and running out of Heap Space
 
 On 9-Oct-07, at 12:36 PM, David Whalen wrote:
 
 (snip)
  I'm sure we could stop storing many of these columns, 
 especially  if 
 someone told me that would make a big difference.
 
 I don't think that it would make a difference in memory 
 consumption, but storage is certainly not necessary for 
 faceting.  Extra stored fields can slow down search if they 
 are large (in terms of bytes), but don't really occupy extra 
 memory, unless they are polluting the doc cache.  Does 'text' 
 need to be stored?
 
  what does the LukeReqeust Handler tell you about the # of distinct 
  terms in each field that you facet on?
 
  Where would I find that?  I could probably estimate that 
 myself on a 
  per-column basis.  it ranges from 4 distinct values for 
 media_type to 
  30-ish for location to 200-ish for country_code to almost 
 10,000 for 
  site_id to almost 100,000 for journalist_id.
 
 Using the filter cache method on the things like media type 
 and location; this will occupy ~2.3MB of memory _per unique 
 value_, so it should be a net win for those (although quite 
 close in space requirements for a 30-ary field on your index size).
 
 -Mike
 
 


Re: Facets and running out of Heap Space

2007-10-10 Thread Mike Klaas

On 10-Oct-07, at 12:19 PM, David Whalen wrote:


It looks now like I can't use facets the way I was hoping
to because the memory requirements are impractical.


I can't remember if this has been mentioned, but upping the  
HashDocSet size is one way to reduce memory consumption.  Whether  
this will work well depends greatly on the cardinality of your facet  
sets.  facet.enum.cache.minDf set high is another option (will not  
generate a bitset for any value whose facet set is less that this  
value).


Both options have performance implications.


So, as an alternative I was thinking I could get counts
by doing rows=0 and using filter queries.

Is there a reason to think that this might perform better?
Or, am I simply moving the problem to another step in the
process?


Running one query per unique facet value seems impractical, if that  
is what you are suggesting.  Setting minDf to a very high value  
should always outperform such an approach.


-Mike


DW




-Original Message-
From: Stu Hood [mailto:[EMAIL PROTECTED]
Sent: Tuesday, October 09, 2007 10:53 PM
To: solr-user@lucene.apache.org
Subject: Re: Facets and running out of Heap Space


Using the filter cache method on the things like media type and
location; this will occupy ~2.3MB of memory _per unique value_


Mike, how did you calculate that value? I'm trying to tune my
caches, and any equations that could be used to determine
some balanced settings would be extremely helpful. I'm in a
memory limited environment, so I can't afford to throw a ton
of cache at the problem.

(I don't want to thread-jack, but I'm also wondering whether
anyone has any notes on how to tune cache sizes for the
filterCache, queryResultCache and documentCache).

Thanks,
Stu


-Original Message-
From: Mike Klaas [EMAIL PROTECTED]
Sent: Tuesday, October 9, 2007 9:30pm
To: solr-user@lucene.apache.org
Subject: Re: Facets and running out of Heap Space

On 9-Oct-07, at 12:36 PM, David Whalen wrote:


(snip)
I'm sure we could stop storing many of these columns,

especially  if

someone told me that would make a big difference.


I don't think that it would make a difference in memory
consumption, but storage is certainly not necessary for
faceting.  Extra stored fields can slow down search if they
are large (in terms of bytes), but don't really occupy extra
memory, unless they are polluting the doc cache.  Does 'text'
need to be stored?



what does the LukeReqeust Handler tell you about the # of distinct
terms in each field that you facet on?


Where would I find that?  I could probably estimate that

myself on a

per-column basis.  it ranges from 4 distinct values for

media_type to

30-ish for location to 200-ish for country_code to almost

10,000 for

site_id to almost 100,000 for journalist_id.


Using the filter cache method on the things like media type
and location; this will occupy ~2.3MB of memory _per unique
value_, so it should be a net win for those (although quite
close in space requirements for a 30-ary field on your index size).

-Mike






RE: Facets and running out of Heap Space

2007-10-10 Thread David Whalen
Accoriding to Yonik I can't use minDf because I'm faceting
on a string field.  I'm thinking of changing it to a tokenized
type so that I can utilize this setting, but then I'll have to
rebuild my entire index.

Unless there's some way around that?


  

 -Original Message-
 From: Mike Klaas [mailto:[EMAIL PROTECTED] 
 Sent: Wednesday, October 10, 2007 4:56 PM
 To: solr-user@lucene.apache.org
 Cc: stuhood
 Subject: Re: Facets and running out of Heap Space
 
 On 10-Oct-07, at 12:19 PM, David Whalen wrote:
 
  It looks now like I can't use facets the way I was hoping 
 to because 
  the memory requirements are impractical.
 
 I can't remember if this has been mentioned, but upping the 
 HashDocSet size is one way to reduce memory consumption.  
 Whether this will work well depends greatly on the 
 cardinality of your facet sets.  facet.enum.cache.minDf set 
 high is another option (will not generate a bitset for any 
 value whose facet set is less that this value).
 
 Both options have performance implications.
 
  So, as an alternative I was thinking I could get counts by doing 
  rows=0 and using filter queries.
 
  Is there a reason to think that this might perform better?
  Or, am I simply moving the problem to another step in the process?
 
 Running one query per unique facet value seems impractical, 
 if that is what you are suggesting.  Setting minDf to a very 
 high value should always outperform such an approach.
 
 -Mike
 
  DW
 
 
 
  -Original Message-
  From: Stu Hood [mailto:[EMAIL PROTECTED]
  Sent: Tuesday, October 09, 2007 10:53 PM
  To: solr-user@lucene.apache.org
  Subject: Re: Facets and running out of Heap Space
 
  Using the filter cache method on the things like media type and 
  location; this will occupy ~2.3MB of memory _per unique value_
 
  Mike, how did you calculate that value? I'm trying to tune 
 my caches, 
  and any equations that could be used to determine some balanced 
  settings would be extremely helpful. I'm in a memory limited 
  environment, so I can't afford to throw a ton of cache at the 
  problem.
 
  (I don't want to thread-jack, but I'm also wondering 
 whether anyone 
  has any notes on how to tune cache sizes for the filterCache, 
  queryResultCache and documentCache).
 
  Thanks,
  Stu
 
 
  -Original Message-
  From: Mike Klaas [EMAIL PROTECTED]
  Sent: Tuesday, October 9, 2007 9:30pm
  To: solr-user@lucene.apache.org
  Subject: Re: Facets and running out of Heap Space
 
  On 9-Oct-07, at 12:36 PM, David Whalen wrote:
 
  (snip)
  I'm sure we could stop storing many of these columns,
  especially  if
  someone told me that would make a big difference.
 
  I don't think that it would make a difference in memory 
 consumption, 
  but storage is certainly not necessary for faceting.  Extra stored 
  fields can slow down search if they are large (in terms of bytes), 
  but don't really occupy extra memory, unless they are 
 polluting the 
  doc cache.  Does 'text'
  need to be stored?
 
  what does the LukeReqeust Handler tell you about the # 
 of distinct 
  terms in each field that you facet on?
 
  Where would I find that?  I could probably estimate that
  myself on a
  per-column basis.  it ranges from 4 distinct values for
  media_type to
  30-ish for location to 200-ish for country_code to almost
  10,000 for
  site_id to almost 100,000 for journalist_id.
 
  Using the filter cache method on the things like media type and 
  location; this will occupy ~2.3MB of memory _per unique 
 value_, so it 
  should be a net win for those (although quite close in space 
  requirements for a 30-ary field on your index size).
 
  -Mike
 
 
 
 
 


Re: Facets and running out of Heap Space

2007-10-10 Thread Mike Klaas

On 10-Oct-07, at 2:40 PM, David Whalen wrote:


Accoriding to Yonik I can't use minDf because I'm faceting
on a string field.  I'm thinking of changing it to a tokenized
type so that I can utilize this setting, but then I'll have to
rebuild my entire index.

Unless there's some way around that?


For the fields that matter (many unique values), this is likely  
result in a performance regression.


It might be better to try storing less unique data.  For instance,  
faceting on the blog_url field, or create_date in your schema would  
case problems (they probably have millions of unique values).


It would be helpful to know which field is causing the problem.  One  
way would be to do a sorted query on a quiescent index for each  
field, and see if there are any suspiciously large jumps in memory  
usage.


-Mike






-Original Message-
From: Mike Klaas [mailto:[EMAIL PROTECTED]
Sent: Wednesday, October 10, 2007 4:56 PM
To: solr-user@lucene.apache.org
Cc: stuhood
Subject: Re: Facets and running out of Heap Space

On 10-Oct-07, at 12:19 PM, David Whalen wrote:


It looks now like I can't use facets the way I was hoping

to because

the memory requirements are impractical.


I can't remember if this has been mentioned, but upping the
HashDocSet size is one way to reduce memory consumption.
Whether this will work well depends greatly on the
cardinality of your facet sets.  facet.enum.cache.minDf set
high is another option (will not generate a bitset for any
value whose facet set is less that this value).

Both options have performance implications.


So, as an alternative I was thinking I could get counts by doing
rows=0 and using filter queries.

Is there a reason to think that this might perform better?
Or, am I simply moving the problem to another step in the process?


Running one query per unique facet value seems impractical,
if that is what you are suggesting.  Setting minDf to a very
high value should always outperform such an approach.

-Mike


DW




-Original Message-
From: Stu Hood [mailto:[EMAIL PROTECTED]
Sent: Tuesday, October 09, 2007 10:53 PM
To: solr-user@lucene.apache.org
Subject: Re: Facets and running out of Heap Space


Using the filter cache method on the things like media type and
location; this will occupy ~2.3MB of memory _per unique value_


Mike, how did you calculate that value? I'm trying to tune

my caches,

and any equations that could be used to determine some balanced
settings would be extremely helpful. I'm in a memory limited
environment, so I can't afford to throw a ton of cache at the
problem.

(I don't want to thread-jack, but I'm also wondering

whether anyone

has any notes on how to tune cache sizes for the filterCache,
queryResultCache and documentCache).

Thanks,
Stu


-Original Message-
From: Mike Klaas [EMAIL PROTECTED]
Sent: Tuesday, October 9, 2007 9:30pm
To: solr-user@lucene.apache.org
Subject: Re: Facets and running out of Heap Space

On 9-Oct-07, at 12:36 PM, David Whalen wrote:


(snip)
I'm sure we could stop storing many of these columns,

especially  if

someone told me that would make a big difference.


I don't think that it would make a difference in memory

consumption,

but storage is certainly not necessary for faceting.  Extra stored
fields can slow down search if they are large (in terms of bytes),
but don't really occupy extra memory, unless they are

polluting the

doc cache.  Does 'text'
need to be stored?



what does the LukeReqeust Handler tell you about the #

of distinct

terms in each field that you facet on?


Where would I find that?  I could probably estimate that

myself on a

per-column basis.  it ranges from 4 distinct values for

media_type to

30-ish for location to 200-ish for country_code to almost

10,000 for

site_id to almost 100,000 for journalist_id.


Using the filter cache method on the things like media type and
location; this will occupy ~2.3MB of memory _per unique

value_, so it

should be a net win for those (although quite close in space
requirements for a 30-ary field on your index size).

-Mike










RE: Facets and running out of Heap Space

2007-10-10 Thread David Whalen
I'll see what I can do about that.

Truthfully, the most important facet we need is the one on
media_type, which has only 4 unique values.  The second
most important one to us is location, which has about 30
unique values.

So, it would seem like we actually need a counter-intuitive
solution.  That's why I thought Field Queries might be the
solution.

Is there some reason to avoid setting multiValued to true
here?  It sounds like it would be the true cure-all

Thanks again!

dave


  

 -Original Message-
 From: Mike Klaas [mailto:[EMAIL PROTECTED] 
 Sent: Wednesday, October 10, 2007 6:20 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Facets and running out of Heap Space
 
 On 10-Oct-07, at 2:40 PM, David Whalen wrote:
 
  Accoriding to Yonik I can't use minDf because I'm faceting 
 on a string 
  field.  I'm thinking of changing it to a tokenized type so 
 that I can 
  utilize this setting, but then I'll have to rebuild my entire index.
 
  Unless there's some way around that?
 
 For the fields that matter (many unique values), this is 
 likely result in a performance regression.
 
 It might be better to try storing less unique data.  For 
 instance, faceting on the blog_url field, or create_date in 
 your schema would case problems (they probably have millions 
 of unique values).
 
 It would be helpful to know which field is causing the 
 problem.  One way would be to do a sorted query on a 
 quiescent index for each field, and see if there are any 
 suspiciously large jumps in memory usage.
 
 -Mike
 
 
 
 
  -Original Message-
  From: Mike Klaas [mailto:[EMAIL PROTECTED]
  Sent: Wednesday, October 10, 2007 4:56 PM
  To: solr-user@lucene.apache.org
  Cc: stuhood
  Subject: Re: Facets and running out of Heap Space
 
  On 10-Oct-07, at 12:19 PM, David Whalen wrote:
 
  It looks now like I can't use facets the way I was hoping
  to because
  the memory requirements are impractical.
 
  I can't remember if this has been mentioned, but upping the
  HashDocSet size is one way to reduce memory consumption.
  Whether this will work well depends greatly on the
  cardinality of your facet sets.  facet.enum.cache.minDf set
  high is another option (will not generate a bitset for any
  value whose facet set is less that this value).
 
  Both options have performance implications.
 
  So, as an alternative I was thinking I could get counts by doing
  rows=0 and using filter queries.
 
  Is there a reason to think that this might perform better?
  Or, am I simply moving the problem to another step in the process?
 
  Running one query per unique facet value seems impractical,
  if that is what you are suggesting.  Setting minDf to a very
  high value should always outperform such an approach.
 
  -Mike
 
  DW
 
 
 
  -Original Message-
  From: Stu Hood [mailto:[EMAIL PROTECTED]
  Sent: Tuesday, October 09, 2007 10:53 PM
  To: solr-user@lucene.apache.org
  Subject: Re: Facets and running out of Heap Space
 
  Using the filter cache method on the things like media type and
  location; this will occupy ~2.3MB of memory _per unique value_
 
  Mike, how did you calculate that value? I'm trying to tune
  my caches,
  and any equations that could be used to determine some balanced
  settings would be extremely helpful. I'm in a memory limited
  environment, so I can't afford to throw a ton of cache at the
  problem.
 
  (I don't want to thread-jack, but I'm also wondering
  whether anyone
  has any notes on how to tune cache sizes for the filterCache,
  queryResultCache and documentCache).
 
  Thanks,
  Stu
 
 
  -Original Message-
  From: Mike Klaas [EMAIL PROTECTED]
  Sent: Tuesday, October 9, 2007 9:30pm
  To: solr-user@lucene.apache.org
  Subject: Re: Facets and running out of Heap Space
 
  On 9-Oct-07, at 12:36 PM, David Whalen wrote:
 
  (snip)
  I'm sure we could stop storing many of these columns,
  especially  if
  someone told me that would make a big difference.
 
  I don't think that it would make a difference in memory
  consumption,
  but storage is certainly not necessary for faceting.  
 Extra stored
  fields can slow down search if they are large (in terms 
 of bytes),
  but don't really occupy extra memory, unless they are
  polluting the
  doc cache.  Does 'text'
  need to be stored?
 
  what does the LukeReqeust Handler tell you about the #
  of distinct
  terms in each field that you facet on?
 
  Where would I find that?  I could probably estimate that
  myself on a
  per-column basis.  it ranges from 4 distinct values for
  media_type to
  30-ish for location to 200-ish for country_code to almost
  10,000 for
  site_id to almost 100,000 for journalist_id.
 
  Using the filter cache method on the things like media type and
  location; this will occupy ~2.3MB of memory _per unique
  value_, so it
  should be a net win for those (although quite close in space
  requirements for a 30-ary field on your index size).
 
  -Mike
 
 
 
 
 
 
 
 


Re: Facets and running out of Heap Space

2007-10-10 Thread Mike Klaas

On 10-Oct-07, at 3:46 PM, David Whalen wrote:


I'll see what I can do about that.

Truthfully, the most important facet we need is the one on
media_type, which has only 4 unique values.  The second
most important one to us is location, which has about 30
unique values.

So, it would seem like we actually need a counter-intuitive
solution.  That's why I thought Field Queries might be the
solution.

Is there some reason to avoid setting multiValued to true
here?  It sounds like it would be the true cure-all


Should work.  It would cost about 100 MB on a 25m corpus for those  
two fields.


Have you tried setting multivalued=true without reindexing?  I'm not  
sure, but I think it will work.


-Mike





Re: Facets and running out of Heap Space

2007-10-10 Thread Yonik Seeley
On 10/10/07, Mike Klaas [EMAIL PROTECTED] wrote:
 Have you tried setting multivalued=true without reindexing?  I'm not
 sure, but I think it will work.

Yes, that will work fine.
One thing that will change is the response format for stored fields
arr name=foostrval1/str/arr
instead of
str name=fooval1/str

Hopefully in the future we can specify a faceting method w/o having to
change the schema.

-Yonik


Facets and running out of Heap Space

2007-10-09 Thread David Whalen
Hi All.

I run a faceted query against a very large index on a 
regular schedule.  Every now and then the query throws
an out of heap space error, and we're sunk.

So, naturally we increased the heap size and things worked
well for a while and then the errors would happen again.
We've increased the initial heap size to 2.5GB and it's
still happening.

Is there anything we can do about this?

Thanks in advance,

Dave W


Re: Facets and running out of Heap Space

2007-10-09 Thread Yonik Seeley
On 10/9/07, David Whalen [EMAIL PROTECTED] wrote:
 I run a faceted query against a very large index on a
 regular schedule.  Every now and then the query throws
 an out of heap space error, and we're sunk.

 So, naturally we increased the heap size and things worked
 well for a while and then the errors would happen again.
 We've increased the initial heap size to 2.5GB and it's
 still happening.

 Is there anything we can do about this?

Try facet.enum.cache.minDf param:
http://wiki.apache.org/solr/SimpleFacetParameters

-Yonik


RE: Facets and running out of Heap Space

2007-10-09 Thread David Whalen
Hi Yonik.

According to the doc:


 This is only used during the term enumeration method of
 faceting (facet.field type faceting on multi-valued or
 full-text fields). 

What if I'm faceting on just a plain String field?  It's
not full-text, and I don't have multiValued set for it


Dave


 -Original Message-
 From: Yonik Seeley [mailto:[EMAIL PROTECTED] 
 Sent: Tuesday, October 09, 2007 12:47 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Facets and running out of Heap Space
 
 On 10/9/07, David Whalen [EMAIL PROTECTED] wrote:
  I run a faceted query against a very large index on a regular 
  schedule.  Every now and then the query throws an out of heap space 
  error, and we're sunk.
 
  So, naturally we increased the heap size and things worked 
 well for a 
  while and then the errors would happen again.
  We've increased the initial heap size to 2.5GB and it's still 
  happening.
 
  Is there anything we can do about this?
 
 Try facet.enum.cache.minDf param:
 http://wiki.apache.org/solr/SimpleFacetParameters
 
 -Yonik
 
 


Re: Facets and running out of Heap Space

2007-10-09 Thread Yonik Seeley
On 10/9/07, David Whalen [EMAIL PROTECTED] wrote:
  This is only used during the term enumeration method of
  faceting (facet.field type faceting on multi-valued or
  full-text fields).

 What if I'm faceting on just a plain String field?  It's
 not full-text, and I don't have multiValued set for it

Then you will be using the FieldCache counting method, and this param
is not applicable :-)
Are all your field that you facet on like this?

The FieldCache entry might be taking up too much room, esp if the
number of entries is high, and the entries are big.  The requests
themselves can take up a good chunk of memory temporarily (4 bytes *
nValuesInField).

You could try a memory profiling tool and see where all the memory is
being taken up too.

-Yonik


Re: Facets and running out of Heap Space

2007-10-09 Thread Chris Hostetter

: So, naturally we increased the heap size and things worked
: well for a while and then the errors would happen again.
: We've increased the initial heap size to 2.5GB and it's
: still happening.

is this the same 25,000,000 document index you mentioned before?

2.5GB of heap doesn't seem like much if you are also doing faceting ... 
even if you are faceting on an int field, there's going to be 95MB of 
FieldCache for that field, you said this was a string field, so it's going 
to be 95MB+however much space is needed for all the terms 
(presumably if you are faceting on this field every doc doesn't have a 
unique value, but even assuming a conservative 10% unique values of 10 
characters each that's another ~50MB, so we're up to about 150MB of 
FieldCache to facet that field -- and we haven't even started talking 
about how big the index is itself (or how big the filterCache gets, or 
how many other fields you are faceting on)

how big is your index on disk? are you faceting or sorting on other fields 
as well?

what does the LukeReqeust Handler tell you about the # of distinct terms 
in each field that you facet on?




-Hoss



RE: Facets and running out of Heap Space

2007-10-09 Thread David Whalen
 Make sure you have:
 requestHandler name=/admin/luke 
 class=org.apache.solr.handler.admin.LukeRequestHandler / 
 defined in solrconfig.xml

What's the consequence of me changing the solrconfig.xml file?
Doesn't that cause a restart of solr?

 for a large index, this can be very slow but the results are valuable.

In what way?  I'm still not clear on what this does for me


 -Original Message-
 From: Ryan McKinley [mailto:[EMAIL PROTECTED] 
 Sent: Tuesday, October 09, 2007 4:01 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Facets and running out of Heap Space
 
  
  what does the LukeReqeust Handler tell you about the # of distinct 
  terms in each field that you facet on?
  
  Where would I find that?  
 
 check:
 http://wiki.apache.org/solr/LukeRequestHandler
 
 Make sure you have:
 requestHandler name=/admin/luke 
 class=org.apache.solr.handler.admin.LukeRequestHandler / 
 defined in solrconfig.xml
 
 for a large index, this can be very slow but the results are valuable.
 
 ryan
 
 


Re: Facets and running out of Heap Space

2007-10-09 Thread Ryan McKinley

David Whalen wrote:

Make sure you have:
requestHandler name=/admin/luke 
class=org.apache.solr.handler.admin.LukeRequestHandler / 
defined in solrconfig.xml


What's the consequence of me changing the solrconfig.xml file?
Doesn't that cause a restart of solr?



editing solrconfig.xml does *not* restart solr.

But you need to restart solr to see any changes to solrconfig.



for a large index, this can be very slow but the results are valuable.


In what way?  I'm still not clear on what this does for me



It gives you all kinds of index statistics - that may or may not be 
useful in figuring out how big field caches will need to be.


It is just a diagnostics tool, not a fix.

ryan



Re: Facets and running out of Heap Space

2007-10-09 Thread Mike Klaas

On 9-Oct-07, at 12:36 PM, David Whalen wrote:


field name=id type=string indexed=true stored=true /
field name=content_date type=date indexed=true stored=true /
field name=media_type type=string indexed=true stored=true /
field name=location type=string indexed=true stored=true /
field name=country_code type=string indexed=true  
stored=true /
field name=text type=text indexed=true stored=true  
multiValued=true /
field name=content_source type=string indexed=true  
stored=true /

field name=title type=string indexed=true stored=true /
field name=site_id type=string indexed=true stored=true /
field name=journalist_id type=string indexed=true  
stored=true /

field name=blog_url type=string indexed=true stored=true /
field name=created_date type=date indexed=true stored=true /

I'm sure we could stop storing many of these columns, especially
if someone told me that would make a big difference.


I don't think that it would make a difference in memory consumption,  
but storage is certainly not necessary for faceting.  Extra stored  
fields can slow down search if they are large (in terms of bytes),  
but don't really occupy extra memory, unless they are polluting the  
doc cache.  Does 'text' need to be stored?



what does the LukeReqeust Handler tell you about the # of
distinct terms in each field that you facet on?


Where would I find that?  I could probably estimate that myself
on a per-column basis.  it ranges from 4 distinct values for
media_type to 30-ish for location to 200-ish for country_code
to almost 10,000 for site_id to almost 100,000 for journalist_id.


Using the filter cache method on the things like media type and  
location; this will occupy ~2.3MB of memory _per unique value_, so it  
should be a net win for those (although quite close in space  
requirements for a 30-ary field on your index size).


-Mike


Re: Facets and running out of Heap Space

2007-10-09 Thread Stu Hood
 Using the filter cache method on the things like media type and
 location; this will occupy ~2.3MB of memory _per unique value_

Mike, how did you calculate that value? I'm trying to tune my caches, and any 
equations that could be used to determine some balanced settings would be 
extremely helpful. I'm in a memory limited environment, so I can't afford to 
throw a ton of cache at the problem.

(I don't want to thread-jack, but I'm also wondering whether anyone has any 
notes on how to tune cache sizes for the filterCache, queryResultCache and 
documentCache).

Thanks,
Stu


-Original Message-
From: Mike Klaas [EMAIL PROTECTED]
Sent: Tuesday, October 9, 2007 9:30pm
To: solr-user@lucene.apache.org
Subject: Re: Facets and running out of Heap Space

On 9-Oct-07, at 12:36 PM, David Whalen wrote:

(snip)
 I'm sure we could stop storing many of these columns, especially
 if someone told me that would make a big difference.

I don't think that it would make a difference in memory consumption,  
but storage is certainly not necessary for faceting.  Extra stored  
fields can slow down search if they are large (in terms of bytes),  
but don't really occupy extra memory, unless they are polluting the  
doc cache.  Does 'text' need to be stored?

 what does the LukeReqeust Handler tell you about the # of
 distinct terms in each field that you facet on?

 Where would I find that?  I could probably estimate that myself
 on a per-column basis.  it ranges from 4 distinct values for
 media_type to 30-ish for location to 200-ish for country_code
 to almost 10,000 for site_id to almost 100,000 for journalist_id.

Using the filter cache method on the things like media type and  
location; this will occupy ~2.3MB of memory _per unique value_, so it  
should be a net win for those (although quite close in space  
requirements for a 30-ary field on your index size).

-Mike


Re: Facets and running out of Heap Space

2007-10-09 Thread Mike Klaas

On 9-Oct-07, at 7:53 PM, Stu Hood wrote:


Using the filter cache method on the things like media type and
location; this will occupy ~2.3MB of memory _per unique value_


Mike, how did you calculate that value? I'm trying to tune my  
caches, and any equations that could be used to determine some  
balanced settings would be extremely helpful. I'm in a memory  
limited environment, so I can't afford to throw a ton of cache at  
the problem.


8bits * 25m docs.  Note that HashSet filters will be smaller  
(cardinality  3000).


(I don't want to thread-jack, but I'm also wondering whether anyone  
has any notes on how to tune cache sizes for the filterCache,  
queryResultCache and documentCache).


I'll give the usual Solr answer: it depends g.  For me:

The filterCache is the most important.  I want my faceting filters to  
be there at all times, as well as the common fq's I throw at Solr.  I  
have this bumped up to 4096 or so.


The queryResultCache isn't too important.  I'm mostly interested in  
keeping around a few recent queries since they tend to be  
reexecuted.  There is generally not a whole lot of overlap, though,  
and I never page very far into the results (10 results over 100  
slaves is more than I typically would ever need).  Memory usage is  
quite low, though, so you might have success going nuts with this cache.


docCache? Make sure this is set to at least maxResults*max  
concurrent queries, since the query processing sometimes assumes  
fetching a document earlier in the request will let us retrieve it  
for free later in the request from the cache.  Other than that, it  
depends on your document usage overlap.  It you have a set of  
documents needed for meta-data storage, it behooves you to make sure  
these are always cached.


cheers,
-Mike


Cache Memory Usage (was: Facets and running out of Heap Space)

2007-10-09 Thread Stu Hood
Sorry... where do the unique values come into the equation?



Also, you say that the queryResultCache memory usage is very low... how
could this be when it is storing the same information as the
filterCache, but with the addition of sorting?



Your answers are very helpful, thanks!

Stu Hood
Webmail.us
You manage your business. We'll manage your email.®

Re: Cache Memory Usage (was: Facets and running out of Heap Space)

2007-10-09 Thread Mike Klaas

On 9-Oct-07, at 8:28 PM, Stu Hood wrote:


Sorry... where do the unique values come into the equation?


Faceting.  You should have a filterCache  # unique values in all  
fields faceted-on (using the fieldCache method).


Also, you say that the queryResultCache memory usage is very low...  
how

could this be when it is storing the same information as the
filterCache, but with the addition of sorting?


Solr caches only the top N documents in the queryResultCache (boosted  
by queryResultWindowSize), which amounts to 40-odd ints, 40-odd  
float, and change.


-Mike