It's actually, as I understand it, expected JVM behavior to see the heap rise to close to it's limit before it gets GC'd, that's how Java GC works. Whether that should happen every 20 seconds or what, I don't nkow.

Another option is setting better JVM garbage collection arguments, so GC doesn't "stop the world" so often. I have had good luck with my Solr using this: -XX:+UseParallelGC





On 3/14/2011 4:15 PM, Doğacan Güney wrote:
Hello again,

2011/3/14 Markus Jelsma<markus.jel...@openindex.io>

Hello,

2011/3/14 Markus Jelsma<markus.jel...@openindex.io>

Hi Doğacan,

Are you, at some point, running out of heap space? In my experience,
that's the common cause of increased load and excessivly high response
times (or time
outs).
How much of a heap size would be enough? Our index size is growing slowly
but we did not have this problem
a couple weeks ago where index size was maybe 100mb smaller.
Telling how much heap space is needed isn't easy to say. It usually needs
to
be increased when you run out of memory and get those nasty OOM errors, are
you getting them?
Replication eventes will increase heap usage due to cache warming queries
and
autowarming.


Nope, no OOM errors.


We left most of the caches in solrconfig as default and only increased
filterCache to 1024. We only ask for "id"s (which
are unique) and no other fields during queries (though we do faceting).
Btw, 1.6gb of our index is stored fields (we store
everything for now, even though we do not get them during queries), and
about 1gb of index.
Hmm, it seems 4000 would be enough indeed. What about the fieldCache, are
there
a lot of entries? Is there an insanity count? Do you use boost functions?


Insanity count is 0 and fieldCAche has 12 entries. We do use some boosting
functions.

Btw, I am monitoring output via jconsole with 8gb of ram and it still goes
to 8gb every 20 seconds or so,
gc runs, falls down to 1gb.

Btw, our current revision was just a random choice but up until two weeks
ago it has been rock-solid so we have been
reluctant to update to another version. Would you recommend upgrading to
latest trunk?


It might not have anything to do with memory at all but i'm just asking.
There
may be a bug in your revision causing this.

Anyway, Xmx was 4000m, we tried increasing it to 8000m but did not get
any
improvement in load. I can try monitoring with Jconsole
with 8gigs of heap to see if it helps.

Cheers,

Hello everyone,

First of all here is our Solr setup:

- Solr nightly build 986158
- Running solr inside the default jetty comes with solr build
- 1 write only Master , 4 read only Slaves (quad core 5640 with 24gb
of
RAM) - Index replicated (on optimize) to slaves via Solr Replication
- Size of index is around 2.5gb
- No incremental writes, index is created from scratch(delete old
documents

->  commit new documents ->  optimize)  every 6 hours
- Avg # of request per second is around 60 (for a single slave)
- Avg time per request is around 25ms (before having problems)
- Load on each is slave is around 2

We are using this set-up for months without any problem. However last
week

we started to experience very weird performance problems like :

- Avg time per request increased from 25ms to 200-300ms (even higher
if
we

don't restart the slaves)
- Load on each slave increased from 2 to 15-20 (solr uses %400-%600
cpu)

When we profile solr we see two very strange things :

1 - This is the jconsole output:

https://skitch.com/meralan/rwwcf/mail-886x691

As you see gc runs for every 10-15 seconds and collects more than 1
gb
of memory. (Actually if you wait more than 10 minutes you see spikes
up to
4gb

consistently)

2 - This is the newrelic output :

https://skitch.com/meralan/rwwci/solr-requests-solr-new-relic-rpm

As you see solr spent ridiculously long time in
SolrDispatchFilter.doFilter() method.


Apart form these, when we clean the index directory, re-replicate and
restart  each slave one by one we see a relief in the system but
after
some

time servers start to melt down again. Although deleting index and
replicating doesn't solve the problem, we think that these problems
are
somehow related to replication. Because symptoms started after
replication

and once it heals itself after replication. I also see
lucene-write.lock files in slaves (we don't have write.lock files in
the master) which I think we shouldn't see.


If anyone can give any sort of ideas, we will appreciate it.

Regards,
Dogacan Guney


Reply via email to