RE: entire farm fails at the same time with OOM issues

2010-12-09 Thread Chris Hostetter

I'm not sure if you resolved this issue, but...

: It has typically been when query traffic was lowest!  We are at 12 GB 

...that doesn't mean it couldn't have been query load related.  it's 
possible that some unusual query (ie: trying to sort on many fields at 
the same time?) could have forced the memory usage to spike (because of 
hte field cache).  depending on how your load balancer is setup the OOM on 
one box could have caused the it to fail over to the next box, which also 
OOMed, etc...

the really anoying part is how hard this sort of thing is to detect, 
because your servlet containers request log usually won't log a request 
untill after it's finished and all the data has been written bac kto the 
client -- it may have never been logged because of the OOM.

If your Load balancer keeps a request log, you could try checing it.  this 
could be something as simple as a bot doing a slow crawl of some very 
badly constructed URLs


-Hoss


RE: entire farm fails at the same time with OOM issues

2010-12-01 Thread Robert Petersen
Good idea.  Our farm is behind Akamai so that should be ok to do.

-Original Message-
From: Peter Karich [mailto:peat...@yahoo.de] 
Sent: Wednesday, December 01, 2010 12:21 PM
To: solr-user@lucene.apache.org
Subject: Re: entire farm fails at the same time with OOM issues


  also try to minimize maxWarming searchers to 1(?) or 2.
And decrease cache usage (especially autowarming) if possible at all. 
But again: only if it doesn't affect performance ...

Regards,
Peter.

> On Tue, Nov 30, 2010 at 6:04 PM, Robert Petersen
wrote:
>> My question is this.  Why in the world would all of my slaves, after
>> running fine for some days, suddenly all at the exact same minute
>> experience OOM heap errors and go dead?
> If there is no change in query traffic when this happens, then it's
> due to what the index looks like.
>
> My guess is a large index merge happened, which means that when the
> searchers re-open on the new index, it requires more memory than
> normal (much less can be shared with the previous index).
>
> I'd try bumping the heap a little bit, and then optimizing once a day
> during off-peak hours.
> If you still get OOM errors, bump the heap a little more.
>
> -Yonik
> http://www.lucidimagination.com



Re: entire farm fails at the same time with OOM issues

2010-12-01 Thread Peter Karich

 also try to minimize maxWarming searchers to 1(?) or 2.
And decrease cache usage (especially autowarming) if possible at all. 
But again: only if it doesn't affect performance ...


Regards,
Peter.


On Tue, Nov 30, 2010 at 6:04 PM, Robert Petersen  wrote:

My question is this.  Why in the world would all of my slaves, after
running fine for some days, suddenly all at the exact same minute
experience OOM heap errors and go dead?

If there is no change in query traffic when this happens, then it's
due to what the index looks like.

My guess is a large index merge happened, which means that when the
searchers re-open on the new index, it requires more memory than
normal (much less can be shared with the previous index).

I'd try bumping the heap a little bit, and then optimizing once a day
during off-peak hours.
If you still get OOM errors, bump the heap a little more.

-Yonik
http://www.lucidimagination.com




Re: entire farm fails at the same time with OOM issues

2010-12-01 Thread Ken Krugler


On Nov 30, 2010, at 5:16pm, Robert Petersen wrote:


What would I do with the heap dump though?  Run one of those java heap
analyzers looking for memory leaks or something?  I have no experience
with thoseI saw there was a bug fix in solr 1.4.1 for a 100 byte  
memory
leak occurring on each commit, but it would take thousands of  
commits to

make that add up to anything right?


Typically when I run out of memory in Solr, it's during an index  
update, when the new index searcher is getting warmed up.


Looking at the heap often shows ways to reduce memory requirements,  
e.g. you'll see a really big chunk used for a sorted field.


See http://wiki.apache.org/solr/SolrCaching and http://wiki.apache.org/solr/SolrPerformanceFactors 
 for more details.


-- Ken




-Original Message-
From: Ken Krugler [mailto:kkrugler_li...@transpac.com]
Sent: Tuesday, November 30, 2010 3:12 PM
To: solr-user@lucene.apache.org
Subject: Re: entire farm fails at the same time with OOM issues

Hi Robert,

I'd recommend launching Tomcat with -XX:+HeapDumpOnOutOfMemoryError
and -XX:HeapDumpPath=, so then
you have something to look at versus a Gedankenexperiment :)

-- Ken

On Nov 30, 2010, at 3:04pm, Robert Petersen wrote:


Greetings, we are running one master and four slaves of our multicore
solr setup.  We just served searches for our catalog of 8 million
products with this farm during black Friday and cyber Monday, our
busiest days of the year, and the servers did not break a sweat!
Index
size is about 28GB.

However, twice now recently during a time of low load we have had a
fire
drill where I have seen tomcat/solr fail and become unresponsive  
after

some OOM heap errors.  Solr wouldn't even serve up its admin pages.
I've had to go in and manually knock tomcat out of memory and then
restart it.  These solr slaves are load balanced and the load
balancers
always probe the solr slaves so if they stop serving up searches they
are automatically removed from the load balancer.  When all four
fail at
the same time we have an issue!

My question is this.  Why in the world would all of my slaves, after
running fine for some days, suddenly all at the exact same minute
experience OOM heap errors and go dead?  The load balancer kicks them
all out at the same time each time.  Each slave only talks to the
master
and not to each other, but the master show no errors in the logs at
all.
Something must be triggering this though.  The only other odd thing I
saw in the logs was after the first OOM errors were recorded, the
slaves
started occasionally not being able to get to the master.

This behavior makes me a little nervous...=:-o  eek!





Environment:  Lucid Imagination distro of Solr 1.4 on Tomcat



Platform: RHEL with Sun JRE 1.6.0_18 on dual quad xeon machines with
64GB memory etc etc










<http://ken-blog.krugler.org>
+1 530-265-2225






--
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g







--
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g







RE: entire farm fails at the same time with OOM issues

2010-12-01 Thread Robert Petersen
It has typically been when query traffic was lowest!  We are at 12 GB heap, so 
I will try to bump it to 14 GB.  We have 64GB main memory installed now.  Here 
is our settings, do these look OK?

export JAVA_OPTS="-Xmx12228m -Xms12228m -XX:+UseConcMarkSweepGC 
-XX:+CMSIncrementalMode"



-Original Message-
From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley
Sent: Tuesday, November 30, 2010 6:44 PM
To: solr-user@lucene.apache.org
Subject: Re: entire farm fails at the same time with OOM issues

On Tue, Nov 30, 2010 at 6:04 PM, Robert Petersen  wrote:
> My question is this.  Why in the world would all of my slaves, after
> running fine for some days, suddenly all at the exact same minute
> experience OOM heap errors and go dead?

If there is no change in query traffic when this happens, then it's
due to what the index looks like.

My guess is a large index merge happened, which means that when the
searchers re-open on the new index, it requires more memory than
normal (much less can be shared with the previous index).

I'd try bumping the heap a little bit, and then optimizing once a day
during off-peak hours.
If you still get OOM errors, bump the heap a little more.

-Yonik
http://www.lucidimagination.com


Re: entire farm fails at the same time with OOM issues

2010-11-30 Thread Yonik Seeley
On Tue, Nov 30, 2010 at 6:04 PM, Robert Petersen  wrote:
> My question is this.  Why in the world would all of my slaves, after
> running fine for some days, suddenly all at the exact same minute
> experience OOM heap errors and go dead?

If there is no change in query traffic when this happens, then it's
due to what the index looks like.

My guess is a large index merge happened, which means that when the
searchers re-open on the new index, it requires more memory than
normal (much less can be shared with the previous index).

I'd try bumping the heap a little bit, and then optimizing once a day
during off-peak hours.
If you still get OOM errors, bump the heap a little more.

-Yonik
http://www.lucidimagination.com


RE: entire farm fails at the same time with OOM issues

2010-11-30 Thread Robert Petersen
What would I do with the heap dump though?  Run one of those java heap
analyzers looking for memory leaks or something?  I have no experience
with thoseI saw there was a bug fix in solr 1.4.1 for a 100 byte memory
leak occurring on each commit, but it would take thousands of commits to
make that add up to anything right?

-Original Message-
From: Ken Krugler [mailto:kkrugler_li...@transpac.com] 
Sent: Tuesday, November 30, 2010 3:12 PM
To: solr-user@lucene.apache.org
Subject: Re: entire farm fails at the same time with OOM issues

Hi Robert,

I'd recommend launching Tomcat with -XX:+HeapDumpOnOutOfMemoryError  
and -XX:HeapDumpPath=, so then  
you have something to look at versus a Gedankenexperiment :)

-- Ken

On Nov 30, 2010, at 3:04pm, Robert Petersen wrote:

> Greetings, we are running one master and four slaves of our multicore
> solr setup.  We just served searches for our catalog of 8 million
> products with this farm during black Friday and cyber Monday, our
> busiest days of the year, and the servers did not break a sweat!   
> Index
> size is about 28GB.
>
> However, twice now recently during a time of low load we have had a  
> fire
> drill where I have seen tomcat/solr fail and become unresponsive after
> some OOM heap errors.  Solr wouldn't even serve up its admin pages.
> I've had to go in and manually knock tomcat out of memory and then
> restart it.  These solr slaves are load balanced and the load  
> balancers
> always probe the solr slaves so if they stop serving up searches they
> are automatically removed from the load balancer.  When all four  
> fail at
> the same time we have an issue!
>
> My question is this.  Why in the world would all of my slaves, after
> running fine for some days, suddenly all at the exact same minute
> experience OOM heap errors and go dead?  The load balancer kicks them
> all out at the same time each time.  Each slave only talks to the  
> master
> and not to each other, but the master show no errors in the logs at  
> all.
> Something must be triggering this though.  The only other odd thing I
> saw in the logs was after the first OOM errors were recorded, the  
> slaves
> started occasionally not being able to get to the master.
>
> This behavior makes me a little nervous...=:-o  eek!
>
>
>
>
>
> Environment:  Lucid Imagination distro of Solr 1.4 on Tomcat
>
>
>
> Platform: RHEL with Sun JRE 1.6.0_18 on dual quad xeon machines with
> 64GB memory etc etc
>
>
>
>
>
>
>


<http://ken-blog.krugler.org>
+1 530-265-2225






--
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g







Re: entire farm fails at the same time with OOM issues

2010-11-30 Thread Ken Krugler

Hi Robert,

I'd recommend launching Tomcat with -XX:+HeapDumpOnOutOfMemoryError  
and -XX:HeapDumpPath=, so then  
you have something to look at versus a Gedankenexperiment :)


-- Ken

On Nov 30, 2010, at 3:04pm, Robert Petersen wrote:


Greetings, we are running one master and four slaves of our multicore
solr setup.  We just served searches for our catalog of 8 million
products with this farm during black Friday and cyber Monday, our
busiest days of the year, and the servers did not break a sweat!   
Index

size is about 28GB.

However, twice now recently during a time of low load we have had a  
fire

drill where I have seen tomcat/solr fail and become unresponsive after
some OOM heap errors.  Solr wouldn't even serve up its admin pages.
I've had to go in and manually knock tomcat out of memory and then
restart it.  These solr slaves are load balanced and the load  
balancers

always probe the solr slaves so if they stop serving up searches they
are automatically removed from the load balancer.  When all four  
fail at

the same time we have an issue!

My question is this.  Why in the world would all of my slaves, after
running fine for some days, suddenly all at the exact same minute
experience OOM heap errors and go dead?  The load balancer kicks them
all out at the same time each time.  Each slave only talks to the  
master
and not to each other, but the master show no errors in the logs at  
all.

Something must be triggering this though.  The only other odd thing I
saw in the logs was after the first OOM errors were recorded, the  
slaves

started occasionally not being able to get to the master.

This behavior makes me a little nervous...=:-o  eek!





Environment:  Lucid Imagination distro of Solr 1.4 on Tomcat



Platform: RHEL with Sun JRE 1.6.0_18 on dual quad xeon machines with
64GB memory etc etc











+1 530-265-2225






--
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g







entire farm fails at the same time with OOM issues

2010-11-30 Thread Robert Petersen
Greetings, we are running one master and four slaves of our multicore
solr setup.  We just served searches for our catalog of 8 million
products with this farm during black Friday and cyber Monday, our
busiest days of the year, and the servers did not break a sweat!  Index
size is about 28GB.

 

However, twice now recently during a time of low load we have had a fire
drill where I have seen tomcat/solr fail and become unresponsive after
some OOM heap errors.  Solr wouldn't even serve up its admin pages.
I've had to go in and manually knock tomcat out of memory and then
restart it.  These solr slaves are load balanced and the load balancers
always probe the solr slaves so if they stop serving up searches they
are automatically removed from the load balancer.  When all four fail at
the same time we have an issue!

 

My question is this.  Why in the world would all of my slaves, after
running fine for some days, suddenly all at the exact same minute
experience OOM heap errors and go dead?  The load balancer kicks them
all out at the same time each time.  Each slave only talks to the master
and not to each other, but the master show no errors in the logs at all.
Something must be triggering this though.  The only other odd thing I
saw in the logs was after the first OOM errors were recorded, the slaves
started occasionally not being able to get to the master.

 

This behavior makes me a little nervous...=:-o  eek!

 

 

Environment:  Lucid Imagination distro of Solr 1.4 on Tomcat  

 

Platform: RHEL with Sun JRE 1.6.0_18 on dual quad xeon machines with
64GB memory etc etc