Re: SolrCloud 4.3.1 - Failure to open existing log file (non fatal) errors under high load

2013-07-27 Thread Erick Erickson
What is your autocommit limit? Is it possible that your transaction
logs are simply getting too large? tlogs are truncated whenever
you do a hard commit (autocommit) with openSearcher either
true for false it doesn't matter.

FWIW,
Erick

On Fri, Jul 26, 2013 at 12:56 AM, Tim Vaillancourt t...@elementspace.com 
wrote:
 Thanks Shawn and Yonik!

 Yonik: I noticed this error appears to be fairly trivial, but it is not
 appearing after a previous crash. Every time I run this high-volume test
 that produced my stack trace, I zero out the logs, Solr data and Zookeeper
 data and start over from scratch with a brand new collection and zero'd out
 logs.

 The test is mostly high volume (2000-4000 updates/sec) and at the start the
 SolrCloud runs decently for a good 20-60~ minutes, no errors in the logs at
 all. Then that stack trace occurs on all 3 nodes (staggered), I immediately
 get some replica down messages and then some cannot connect errors to all
 other cluster nodes, who have all crashed the same way. The tlog error could
 be a symptom of the problem of running out of threads perhaps.

 Shawn: thanks so much for sharing those details! Yes, they seem to be nice
 servers, for sure - I don't get to touch/see them but they're fast! I'll
 look into firmwares for sure and will try again after updating them. These
 Solr instances are not-bare metal and are actually KVM VMs so that's another
 layer to look into, although it is consistent between the two clusters.

 I am not currently increasing the 'nofiles' ulimit to above default like you
 are, but does Solr use 10,000+ file handles? It won't hurt to try it I guess
 :). To rule out Java 7, I'll probably also try Jetty 8 and Java 1.6 as an
 experiment as well.

 Thanks!

 Tim


 On 25/07/13 05:55 PM, Yonik Seeley wrote:

 On Thu, Jul 25, 2013 at 7:44 PM, Tim Vaillancourtt...@elementspace.com
 wrote:

 ERROR [2013-07-25 19:34:24.264] [org.apache.solr.common.SolrException]
 Failure to open existing log file (non fatal)

 That itself isn't necessarily a problem (and why it says non fatal)
 - it just means that most likely the a transaction log file was
 truncated from a previous crash.  It may be unrelated to the other
 issues you are seeing.

 -Yonik
 http://lucidworks.com


Re: SolrCloud 4.3.1 - Failure to open existing log file (non fatal) errors under high load

2013-07-27 Thread Tim Vaillancourt

Thanks for the reply Erick,

Hard Commit - 15000ms, openSearcher=false
Soft Commit - 1000ms, openSearcher=true

15sec hard commit was sort of a guess, I could try a smaller number. 
When you say getting too large what limit do you think it would be 
hitting: a ulimit (nofiles), disk space, number of changes, a limit in 
Solr itself?


By my math there would be 15 tlogs max per core, but I don't really know 
how it all works if someone could fill me in/point me somewhere.


Cheers,

Tim

On 27/07/13 07:57 AM, Erick Erickson wrote:

What is your autocommit limit? Is it possible that your transaction
logs are simply getting too large? tlogs are truncated whenever
you do a hard commit (autocommit) with openSearcher either
true for false it doesn't matter.

FWIW,
Erick

On Fri, Jul 26, 2013 at 12:56 AM, Tim Vaillancourtt...@elementspace.com  
wrote:

Thanks Shawn and Yonik!

Yonik: I noticed this error appears to be fairly trivial, but it is not
appearing after a previous crash. Every time I run this high-volume test
that produced my stack trace, I zero out the logs, Solr data and Zookeeper
data and start over from scratch with a brand new collection and zero'd out
logs.

The test is mostly high volume (2000-4000 updates/sec) and at the start the
SolrCloud runs decently for a good 20-60~ minutes, no errors in the logs at
all. Then that stack trace occurs on all 3 nodes (staggered), I immediately
get some replica down messages and then some cannot connect errors to all
other cluster nodes, who have all crashed the same way. The tlog error could
be a symptom of the problem of running out of threads perhaps.

Shawn: thanks so much for sharing those details! Yes, they seem to be nice
servers, for sure - I don't get to touch/see them but they're fast! I'll
look into firmwares for sure and will try again after updating them. These
Solr instances are not-bare metal and are actually KVM VMs so that's another
layer to look into, although it is consistent between the two clusters.

I am not currently increasing the 'nofiles' ulimit to above default like you
are, but does Solr use 10,000+ file handles? It won't hurt to try it I guess
:). To rule out Java 7, I'll probably also try Jetty 8 and Java 1.6 as an
experiment as well.

Thanks!

Tim


On 25/07/13 05:55 PM, Yonik Seeley wrote:

On Thu, Jul 25, 2013 at 7:44 PM, Tim Vaillancourtt...@elementspace.com
wrote:

ERROR [2013-07-25 19:34:24.264] [org.apache.solr.common.SolrException]
Failure to open existing log file (non fatal)


That itself isn't necessarily a problem (and why it says non fatal)
- it just means that most likely the a transaction log file was
truncated from a previous crash.  It may be unrelated to the other
issues you are seeing.

-Yonik
http://lucidworks.com


Re: SolrCloud 4.3.1 - Failure to open existing log file (non fatal) errors under high load

2013-07-27 Thread Jack Krupansky
No hard numbers, but the general guidance is that you should set your hard 
commit interval to match your expectations for how quickly nodes should come 
up if they need to be restarted. Specifically, a hard commit assures that 
all changes have been committed to disk and are ready for immediate access 
on restart, but any and all soft commit changes since the last hard commit 
must be replayed (reexecuted) on restart of a node.


How long does it take to replay the changes in the update log? No firm 
numbers, but treat it as if all of those uncommitted updates had to be 
resent and reprocessed by Solr. It's probably faster than that, but you get 
the picture.


I would suggest thinking in terms of minutes rather than seconds for hard 
commits 5 minutes, 10, 15, 20, 30 minutes.


Hard commits may result in kicking off segment merges, so too rapid a rate 
of segment creation might cause problems or at least be counterproductive.


So, instead of 15 seconds, try 15 minutes.

OTOH, if you really need to handle 4,000 update a seconds... you are clearly 
in uncharted territory and need to expect to need to do some heavy duty 
trial and error tuning on your own.


-- Jack Krupansky

-Original Message- 
From: Tim Vaillancourt

Sent: Saturday, July 27, 2013 4:21 PM
To: solr-user@lucene.apache.org
Subject: Re: SolrCloud 4.3.1 - Failure to open existing log file (non 
fatal) errors under high load


Thanks for the reply Erick,

Hard Commit - 15000ms, openSearcher=false
Soft Commit - 1000ms, openSearcher=true

15sec hard commit was sort of a guess, I could try a smaller number.
When you say getting too large what limit do you think it would be
hitting: a ulimit (nofiles), disk space, number of changes, a limit in
Solr itself?

By my math there would be 15 tlogs max per core, but I don't really know
how it all works if someone could fill me in/point me somewhere.

Cheers,

Tim

On 27/07/13 07:57 AM, Erick Erickson wrote:

What is your autocommit limit? Is it possible that your transaction
logs are simply getting too large? tlogs are truncated whenever
you do a hard commit (autocommit) with openSearcher either
true for false it doesn't matter.

FWIW,
Erick

On Fri, Jul 26, 2013 at 12:56 AM, Tim Vaillancourtt...@elementspace.com 
wrote:

Thanks Shawn and Yonik!

Yonik: I noticed this error appears to be fairly trivial, but it is not
appearing after a previous crash. Every time I run this high-volume test
that produced my stack trace, I zero out the logs, Solr data and 
Zookeeper
data and start over from scratch with a brand new collection and zero'd 
out

logs.

The test is mostly high volume (2000-4000 updates/sec) and at the start 
the
SolrCloud runs decently for a good 20-60~ minutes, no errors in the logs 
at
all. Then that stack trace occurs on all 3 nodes (staggered), I 
immediately
get some replica down messages and then some cannot connect errors to 
all
other cluster nodes, who have all crashed the same way. The tlog error 
could

be a symptom of the problem of running out of threads perhaps.

Shawn: thanks so much for sharing those details! Yes, they seem to be 
nice

servers, for sure - I don't get to touch/see them but they're fast! I'll
look into firmwares for sure and will try again after updating them. 
These
Solr instances are not-bare metal and are actually KVM VMs so that's 
another

layer to look into, although it is consistent between the two clusters.

I am not currently increasing the 'nofiles' ulimit to above default like 
you
are, but does Solr use 10,000+ file handles? It won't hurt to try it I 
guess

:). To rule out Java 7, I'll probably also try Jetty 8 and Java 1.6 as an
experiment as well.

Thanks!

Tim


On 25/07/13 05:55 PM, Yonik Seeley wrote:

On Thu, Jul 25, 2013 at 7:44 PM, Tim Vaillancourtt...@elementspace.com
wrote:

ERROR [2013-07-25 19:34:24.264] [org.apache.solr.common.SolrException]
Failure to open existing log file (non fatal)


That itself isn't necessarily a problem (and why it says non fatal)
- it just means that most likely the a transaction log file was
truncated from a previous crash.  It may be unrelated to the other
issues you are seeing.

-Yonik
http://lucidworks.com 




Re: SolrCloud 4.3.1 - Failure to open existing log file (non fatal) errors under high load

2013-07-27 Thread Erick Erickson
Tim:

15 seconds isn't unreasonable, I was mostly wondering if it was hours.

Take a look at the size of the tlogs as you're indexing, you should see them
truncate every 15 seconds or so. There'll be a varying number of tlogs kept
around, although under heavy indexing I'd only expect 1 or 2 inactive ones,
the internal number is that there'll be enough tlogs kept around to
hold 100 docs.

There should only be 1 open tlog/core as I understand it. When a commit
happens (hard, openSearcher = true or false doesn't matter) the current
tlog is closed and a new one opened. Then some cleanup happens so there
are only enough tlogs kept around to hold 100 docs.

Strange, Im kind of out of ideas.
Erick

On Sat, Jul 27, 2013 at 4:41 PM, Jack Krupansky j...@basetechnology.com wrote:
 No hard numbers, but the general guidance is that you should set your hard
 commit interval to match your expectations for how quickly nodes should come
 up if they need to be restarted. Specifically, a hard commit assures that
 all changes have been committed to disk and are ready for immediate access
 on restart, but any and all soft commit changes since the last hard commit
 must be replayed (reexecuted) on restart of a node.

 How long does it take to replay the changes in the update log? No firm
 numbers, but treat it as if all of those uncommitted updates had to be
 resent and reprocessed by Solr. It's probably faster than that, but you get
 the picture.

 I would suggest thinking in terms of minutes rather than seconds for hard
 commits 5 minutes, 10, 15, 20, 30 minutes.

 Hard commits may result in kicking off segment merges, so too rapid a rate
 of segment creation might cause problems or at least be counterproductive.

 So, instead of 15 seconds, try 15 minutes.

 OTOH, if you really need to handle 4,000 update a seconds... you are clearly
 in uncharted territory and need to expect to need to do some heavy duty
 trial and error tuning on your own.

 -- Jack Krupansky

 -Original Message- From: Tim Vaillancourt
 Sent: Saturday, July 27, 2013 4:21 PM
 To: solr-user@lucene.apache.org
 Subject: Re: SolrCloud 4.3.1 - Failure to open existing log file (non
 fatal) errors under high load


 Thanks for the reply Erick,

 Hard Commit - 15000ms, openSearcher=false
 Soft Commit - 1000ms, openSearcher=true

 15sec hard commit was sort of a guess, I could try a smaller number.
 When you say getting too large what limit do you think it would be
 hitting: a ulimit (nofiles), disk space, number of changes, a limit in
 Solr itself?

 By my math there would be 15 tlogs max per core, but I don't really know
 how it all works if someone could fill me in/point me somewhere.

 Cheers,

 Tim

 On 27/07/13 07:57 AM, Erick Erickson wrote:

 What is your autocommit limit? Is it possible that your transaction
 logs are simply getting too large? tlogs are truncated whenever
 you do a hard commit (autocommit) with openSearcher either
 true for false it doesn't matter.

 FWIW,
 Erick

 On Fri, Jul 26, 2013 at 12:56 AM, Tim Vaillancourtt...@elementspace.com
 wrote:

 Thanks Shawn and Yonik!

 Yonik: I noticed this error appears to be fairly trivial, but it is not
 appearing after a previous crash. Every time I run this high-volume test
 that produced my stack trace, I zero out the logs, Solr data and
 Zookeeper
 data and start over from scratch with a brand new collection and zero'd
 out
 logs.

 The test is mostly high volume (2000-4000 updates/sec) and at the start
 the
 SolrCloud runs decently for a good 20-60~ minutes, no errors in the logs
 at
 all. Then that stack trace occurs on all 3 nodes (staggered), I
 immediately
 get some replica down messages and then some cannot connect errors to
 all
 other cluster nodes, who have all crashed the same way. The tlog error
 could
 be a symptom of the problem of running out of threads perhaps.

 Shawn: thanks so much for sharing those details! Yes, they seem to be
 nice
 servers, for sure - I don't get to touch/see them but they're fast! I'll
 look into firmwares for sure and will try again after updating them.
 These
 Solr instances are not-bare metal and are actually KVM VMs so that's
 another
 layer to look into, although it is consistent between the two clusters.

 I am not currently increasing the 'nofiles' ulimit to above default like
 you
 are, but does Solr use 10,000+ file handles? It won't hurt to try it I
 guess
 :). To rule out Java 7, I'll probably also try Jetty 8 and Java 1.6 as an
 experiment as well.

 Thanks!

 Tim


 On 25/07/13 05:55 PM, Yonik Seeley wrote:

 On Thu, Jul 25, 2013 at 7:44 PM, Tim Vaillancourtt...@elementspace.com
 wrote:

 ERROR [2013-07-25 19:34:24.264] [org.apache.solr.common.SolrException]
 Failure to open existing log file (non fatal)

 That itself isn't necessarily a problem (and why it says non fatal)
 - it just means that most likely the a transaction log file was
 truncated from a previous crash.  It may be unrelated to the other
 issues you are seeing

Re: SolrCloud 4.3.1 - Failure to open existing log file (non fatal) errors under high load

2013-07-27 Thread Tim Vaillancourt

Thanks Jack/Erick,

I don't know if this is true or not, but I've read there is a tlog per 
soft commit, which is then truncated by the hard commit. If this were 
true, a 15sec hard-commit with a 1sec soft-commit could generate around 
15~ tlogs, but I've never checked. I like Erick's scenario more if it is 
1 tlog/core though. I'll try to find out some more.



Another two test/things I really should try for sanity are:
- Java 1.6 and Jetty 8: just to rule things out (wouldn't actually 
launch this way).

- ulimit for 'nofiles': the default is pretty high but why not?
- Monitor size + # of tlogs.


I'll be sure to share findings and really appreciate the help guys!


PS: This is asking a lot, but if anyone can take a look at that thread 
dump, or give me some pointers on what to look for in a 
stall/thread-pile up thread dump like this, I would really appreciate 
it. I'm quite weak at deciphering those (I use Thread Dump Analyzer) but 
I'm sure it would tell a lot.



Cheers,


Tim


On 27/07/13 02:24 PM, Erick Erickson wrote:

Tim:

15 seconds isn't unreasonable, I was mostly wondering if it was hours.

Take a look at the size of the tlogs as you're indexing, you should see them
truncate every 15 seconds or so. There'll be a varying number of tlogs kept
around, although under heavy indexing I'd only expect 1 or 2 inactive ones,
the internal number is that there'll be enough tlogs kept around to
hold 100 docs.

There should only be 1 open tlog/core as I understand it. When a commit
happens (hard, openSearcher = true or false doesn't matter) the current
tlog is closed and a new one opened. Then some cleanup happens so there
are only enough tlogs kept around to hold 100 docs.

Strange, Im kind of out of ideas.
Erick

On Sat, Jul 27, 2013 at 4:41 PM, Jack Krupanskyj...@basetechnology.com  wrote:

No hard numbers, but the general guidance is that you should set your hard
commit interval to match your expectations for how quickly nodes should come
up if they need to be restarted. Specifically, a hard commit assures that
all changes have been committed to disk and are ready for immediate access
on restart, but any and all soft commit changes since the last hard commit
must be replayed (reexecuted) on restart of a node.

How long does it take to replay the changes in the update log? No firm
numbers, but treat it as if all of those uncommitted updates had to be
resent and reprocessed by Solr. It's probably faster than that, but you get
the picture.

I would suggest thinking in terms of minutes rather than seconds for hard
commits 5 minutes, 10, 15, 20, 30 minutes.

Hard commits may result in kicking off segment merges, so too rapid a rate
of segment creation might cause problems or at least be counterproductive.

So, instead of 15 seconds, try 15 minutes.

OTOH, if you really need to handle 4,000 update a seconds... you are clearly
in uncharted territory and need to expect to need to do some heavy duty
trial and error tuning on your own.

-- Jack Krupansky

-Original Message- From: Tim Vaillancourt
Sent: Saturday, July 27, 2013 4:21 PM
To: solr-user@lucene.apache.org
Subject: Re: SolrCloud 4.3.1 - Failure to open existing log file (non
fatal) errors under high load


Thanks for the reply Erick,

Hard Commit - 15000ms, openSearcher=false
Soft Commit - 1000ms, openSearcher=true

15sec hard commit was sort of a guess, I could try a smaller number.
When you say getting too large what limit do you think it would be
hitting: a ulimit (nofiles), disk space, number of changes, a limit in
Solr itself?

By my math there would be 15 tlogs max per core, but I don't really know
how it all works if someone could fill me in/point me somewhere.

Cheers,

Tim

On 27/07/13 07:57 AM, Erick Erickson wrote:

What is your autocommit limit? Is it possible that your transaction
logs are simply getting too large? tlogs are truncated whenever
you do a hard commit (autocommit) with openSearcher either
true for false it doesn't matter.

FWIW,
Erick

On Fri, Jul 26, 2013 at 12:56 AM, Tim Vaillancourtt...@elementspace.com
wrote:

Thanks Shawn and Yonik!

Yonik: I noticed this error appears to be fairly trivial, but it is not
appearing after a previous crash. Every time I run this high-volume test
that produced my stack trace, I zero out the logs, Solr data and
Zookeeper
data and start over from scratch with a brand new collection and zero'd
out
logs.

The test is mostly high volume (2000-4000 updates/sec) and at the start
the
SolrCloud runs decently for a good 20-60~ minutes, no errors in the logs
at
all. Then that stack trace occurs on all 3 nodes (staggered), I
immediately
get some replica down messages and then some cannot connect errors to
all
other cluster nodes, who have all crashed the same way. The tlog error
could
be a symptom of the problem of running out of threads perhaps.

Shawn: thanks so much for sharing those details! Yes, they seem to be
nice
servers, for sure - I don't get to touch/see them but they're fast

SolrCloud 4.3.1 - Failure to open existing log file (non fatal) errors under high load

2013-07-25 Thread Tim Vaillancourt
Hey guys,

I am reaching out to the Solr list with a very vague issue: under high load
against a SolrCloud 4.3.1 cluster of 3 instances, 3 shards, 2 replicas (2
cores per instance), I eventually see failure messages related to
transaction logs, and shortly after these stacktraces occur the cluster
starts to fall apart.

To explain my setup:
- SolrCloud 4.3.1.
- Jetty 9.x.
- Oracle/Sun JDK 1.7.25 w/CMS.
- RHEL 6.x 64-bit.
- 3 instances, 1 per server.
- 3 shards.
- 2 replicas per shard.

The transaction log error I receive after about 10-30 minutes of load
testing is:

ERROR [2013-07-25 19:34:24.264] [org.apache.solr.common.SolrException]
Failure to open existing log file (non fatal)
/opt/easw/easw_apps/easo_solr_cloud/solr/xmshd_shard3_replica2/data/tlog/tlog.078:org.apache.solr.common.SolrException:
java.io.EOFException
at
org.apache.solr.update.TransactionLog.init(TransactionLog.java:182)
at org.apache.solr.update.UpdateLog.init(UpdateLog.java:233)
at
org.apache.solr.update.UpdateHandler.initLog(UpdateHandler.java:83)
at
org.apache.solr.update.UpdateHandler.init(UpdateHandler.java:138)
at
org.apache.solr.update.UpdateHandler.init(UpdateHandler.java:125)
at
org.apache.solr.update.DirectUpdateHandler2.init(DirectUpdateHandler2.java:95)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:525)
at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:525)
at
org.apache.solr.core.SolrCore.createUpdateHandler(SolrCore.java:596)
at org.apache.solr.core.SolrCore.init(SolrCore.java:805)
at org.apache.solr.core.SolrCore.init(SolrCore.java:618)
at
org.apache.solr.core.CoreContainer.createFromZk(CoreContainer.java:894)
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:982)
at org.apache.solr.core.CoreContainer$2.call(CoreContainer.java:597)
at org.apache.solr.core.CoreContainer$2.call(CoreContainer.java:592)
at
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:722)
Caused by: java.io.EOFException
at
org.apache.solr.common.util.FastInputStream.readUnsignedByte(FastInputStream.java:73)
at
org.apache.solr.common.util.FastInputStream.readInt(FastInputStream.java:216)
at
org.apache.solr.update.TransactionLog.readHeader(TransactionLog.java:266)
at
org.apache.solr.update.TransactionLog.init(TransactionLog.java:160)
... 25 more


Eventually after a few of these stack traces, the cluster starts to lose
shards and replicas fail. Jetty then creates hung threads until hitting
OutOfMemory on native threads due to the maximum process ulimit.

I know this is quite a vague issue, so I'm not expecting a silver-bullet
answer, but I was wondering if anyone has suggestions on where to look
next? Does this sound Solr-related at all, or possibly system? Has anyone
seen this issue before, or has any hypothesis how to find out more?

I will reply shortly with a thread dump, taken from 1 locked-up node.

Thanks for any suggestions!

Tim


Re: SolrCloud 4.3.1 - Failure to open existing log file (non fatal) errors under high load

2013-07-25 Thread Tim Vaillancourt
Stack trace:

http://timvaillancourt.com.s3.amazonaws.com/tmp/solrcloud.nodeC.2013-07-25-16.jstack.gz

Cheers!

Tim


On 25 July 2013 16:44, Tim Vaillancourt t...@elementspace.com wrote:

 Hey guys,

 I am reaching out to the Solr list with a very vague issue: under high
 load against a SolrCloud 4.3.1 cluster of 3 instances, 3 shards, 2 replicas
 (2 cores per instance), I eventually see failure messages related to
 transaction logs, and shortly after these stacktraces occur the cluster
 starts to fall apart.

 To explain my setup:
 - SolrCloud 4.3.1.
 - Jetty 9.x.
 - Oracle/Sun JDK 1.7.25 w/CMS.
 - RHEL 6.x 64-bit.
 - 3 instances, 1 per server.
 - 3 shards.
 - 2 replicas per shard.

 The transaction log error I receive after about 10-30 minutes of load
 testing is:

 ERROR [2013-07-25 19:34:24.264] [org.apache.solr.common.SolrException]
 Failure to open existing log file (non fatal)
 /opt/easw/easw_apps/easo_solr_cloud/solr/xmshd_shard3_replica2/data/tlog/tlog.078:org.apache.solr.common.SolrException:
 java.io.EOFException
 at
 org.apache.solr.update.TransactionLog.init(TransactionLog.java:182)
 at org.apache.solr.update.UpdateLog.init(UpdateLog.java:233)
 at
 org.apache.solr.update.UpdateHandler.initLog(UpdateHandler.java:83)
 at
 org.apache.solr.update.UpdateHandler.init(UpdateHandler.java:138)
 at
 org.apache.solr.update.UpdateHandler.init(UpdateHandler.java:125)
 at
 org.apache.solr.update.DirectUpdateHandler2.init(DirectUpdateHandler2.java:95)
 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
 Method)
 at
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
 at
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
 at java.lang.reflect.Constructor.newInstance(Constructor.java:525)
 at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:525)
 at
 org.apache.solr.core.SolrCore.createUpdateHandler(SolrCore.java:596)
 at org.apache.solr.core.SolrCore.init(SolrCore.java:805)
 at org.apache.solr.core.SolrCore.init(SolrCore.java:618)
 at
 org.apache.solr.core.CoreContainer.createFromZk(CoreContainer.java:894)
 at
 org.apache.solr.core.CoreContainer.create(CoreContainer.java:982)
 at
 org.apache.solr.core.CoreContainer$2.call(CoreContainer.java:597)
 at
 org.apache.solr.core.CoreContainer$2.call(CoreContainer.java:592)
 at
 java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
 at java.util.concurrent.FutureTask.run(FutureTask.java:166)
 at
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
 at
 java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
 at java.util.concurrent.FutureTask.run(FutureTask.java:166)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:722)
 Caused by: java.io.EOFException
 at
 org.apache.solr.common.util.FastInputStream.readUnsignedByte(FastInputStream.java:73)
 at
 org.apache.solr.common.util.FastInputStream.readInt(FastInputStream.java:216)
 at
 org.apache.solr.update.TransactionLog.readHeader(TransactionLog.java:266)
 at
 org.apache.solr.update.TransactionLog.init(TransactionLog.java:160)
 ... 25 more
 

 Eventually after a few of these stack traces, the cluster starts to lose
 shards and replicas fail. Jetty then creates hung threads until hitting
 OutOfMemory on native threads due to the maximum process ulimit.

 I know this is quite a vague issue, so I'm not expecting a silver-bullet
 answer, but I was wondering if anyone has suggestions on where to look
 next? Does this sound Solr-related at all, or possibly system? Has anyone
 seen this issue before, or has any hypothesis how to find out more?

 I will reply shortly with a thread dump, taken from 1 locked-up node.

 Thanks for any suggestions!

 Tim



Re: SolrCloud 4.3.1 - Failure to open existing log file (non fatal) errors under high load

2013-07-25 Thread Shawn Heisey

On 7/25/2013 5:44 PM, Tim Vaillancourt wrote:

The transaction log error I receive after about 10-30 minutes of load
testing is:

ERROR [2013-07-25 19:34:24.264] [org.apache.solr.common.SolrException]
Failure to open existing log file (non fatal)
/opt/easw/easw_apps/easo_solr_cloud/solr/xmshd_shard3_replica2/data/tlog/tlog.078:org.apache.solr.common.SolrException:
java.io.EOFException


snip


Caused by: java.io.EOFException
 at
org.apache.solr.common.util.FastInputStream.readUnsignedByte(FastInputStream.java:73)
 at
org.apache.solr.common.util.FastInputStream.readInt(FastInputStream.java:216)
 at
org.apache.solr.update.TransactionLog.readHeader(TransactionLog.java:266)
 at
org.apache.solr.update.TransactionLog.init(TransactionLog.java:160)
 ... 25 more



This looks to me like a system problem.  RHEL should be pretty solid, I 
use CentOS without any trouble.  My initial guesses are a corrupt 
filesystem, failing hardware, or possibly a kernel problem with your 
specific hardware.


I'm running Jetty 8, which is the version that the example uses.  Could 
Jetty 9 be a problem here?  I couldn't really say, though my initial 
guess is that it's not a problem.


I'm running Oracle Java 1.7.0_13.  Normally later releases are better, 
but Java bugs do exist and do get introduced in later releases.  Because 
you're on the absolute latest, I'm guessing that you had the problem 
with an earlier release and upgraded to see if it went away.  If that's 
what happened, it is less likely that it's Java.


My first instinct would be to do a 'yum distro-sync' followed by 'touch 
/forcefsck' and reboot with console access to the server, so that you 
can deal with any fsck problems.  Perhaps you've already tried that. 
I'm aware that this could be very very hard to get pushed through strict 
change management procedures.


I did some searching.  SOLR-4519 is a different problem, but it looks 
like it has a similar underlying exception, with no resolution.  It was 
filed When Solr 4.1.0 was current.


Could there be a resource problem - heap too small, not enough OS disk 
cache, etc?


Thanks,
Shawn



Re: SolrCloud 4.3.1 - Failure to open existing log file (non fatal) errors under high load

2013-07-25 Thread Tim Vaillancourt
Thanks for the reply Shawn, I can always count on you :).

We are using 10GB heaps and have over 100GB of OS cache free to answer the
JVM question, Young has about 50% of the heap, all CMS. Our max number of
processes for the JVM user is 10k, which is where Solr dies when it blows
up with 'cannot create native thread'.

I also want to say this is system related, but I am seeing this occur on
all 3 servers, which are brand-new Dell R720s. I'm not saying this is
impossible, but I don't see much to suggest that, and it would need to be
one hell of a coincidence.

To add more confusion to the mix, we actually run a 2nd SolrCloud cluster
on the same Solr, Jetty and JVM versions that do not exhibit this issue,
although using a completely different schema, servers and access-patterns,
although it is also at high-TPS. That is some evidence to say the current
software stack is OK, or maybe this only occurs under an extreme load that
2nd cluster does not see, or lastly only with a certain schema.

Lastly, to add a bit more detail to my original description, so far I have
tried:

- Entirely rebuilding my cluster from scratch, reinstalling all deps,
configs, reindexing the data (in case I screwed up somewhere). The EXACT
same issue occurs under load about 20-45 minutes in.
- Moving to Java 1.7.0_21 from _25 due to some known bugs. Same issue
occurs after some load.
- Restarting SolrCloud / forcing rebuilds or cores. Same issue occurs after
some load.

Cheers,

Tim


On 25 July 2013 17:13, Shawn Heisey s...@elyograg.org wrote:

 On 7/25/2013 5:44 PM, Tim Vaillancourt wrote:

 The transaction log error I receive after about 10-30 minutes of load
 testing is:

 ERROR [2013-07-25 19:34:24.264] [org.apache.solr.common.**SolrException]
 Failure to open existing log file (non fatal)
 /opt/easw/easw_apps/easo_solr_**cloud/solr/xmshd_shard3_**
 replica2/data/tlog/tlog.**078:org.**apache.solr.common.**
 SolrException:
 java.io.EOFException


 snip


  Caused by: java.io.EOFException
  at
 org.apache.solr.common.util.**FastInputStream.**readUnsignedByte(**
 FastInputStream.java:73)
  at
 org.apache.solr.common.util.**FastInputStream.readInt(**
 FastInputStream.java:216)
  at
 org.apache.solr.update.**TransactionLog.readHeader(**
 TransactionLog.java:266)
  at
 org.apache.solr.update.**TransactionLog.init(**TransactionLog.java:160)
  ... 25 more
 


 This looks to me like a system problem.  RHEL should be pretty solid, I
 use CentOS without any trouble.  My initial guesses are a corrupt
 filesystem, failing hardware, or possibly a kernel problem with your
 specific hardware.

 I'm running Jetty 8, which is the version that the example uses.  Could
 Jetty 9 be a problem here?  I couldn't really say, though my initial guess
 is that it's not a problem.

 I'm running Oracle Java 1.7.0_13.  Normally later releases are better, but
 Java bugs do exist and do get introduced in later releases.  Because you're
 on the absolute latest, I'm guessing that you had the problem with an
 earlier release and upgraded to see if it went away.  If that's what
 happened, it is less likely that it's Java.

 My first instinct would be to do a 'yum distro-sync' followed by 'touch
 /forcefsck' and reboot with console access to the server, so that you can
 deal with any fsck problems.  Perhaps you've already tried that. I'm aware
 that this could be very very hard to get pushed through strict change
 management procedures.

 I did some searching.  SOLR-4519 is a different problem, but it looks like
 it has a similar underlying exception, with no resolution.  It was filed
 When Solr 4.1.0 was current.

 Could there be a resource problem - heap too small, not enough OS disk
 cache, etc?

 Thanks,
 Shawn




Re: SolrCloud 4.3.1 - Failure to open existing log file (non fatal) errors under high load

2013-07-25 Thread Yonik Seeley
On Thu, Jul 25, 2013 at 7:44 PM, Tim Vaillancourt t...@elementspace.com wrote:
 ERROR [2013-07-25 19:34:24.264] [org.apache.solr.common.SolrException]
 Failure to open existing log file (non fatal)


That itself isn't necessarily a problem (and why it says non fatal)
- it just means that most likely the a transaction log file was
truncated from a previous crash.  It may be unrelated to the other
issues you are seeing.

-Yonik
http://lucidworks.com


Re: SolrCloud 4.3.1 - Failure to open existing log file (non fatal) errors under high load

2013-07-25 Thread Shawn Heisey
On 7/25/2013 6:53 PM, Tim Vaillancourt wrote:
 Thanks for the reply Shawn, I can always count on you :).
 
 We are using 10GB heaps and have over 100GB of OS cache free to answer the
 JVM question, Young has about 50% of the heap, all CMS. Our max number of
 processes for the JVM user is 10k, which is where Solr dies when it blows
 up with 'cannot create native thread'.
 
 I also want to say this is system related, but I am seeing this occur on
 all 3 servers, which are brand-new Dell R720s. I'm not saying this is
 impossible, but I don't see much to suggest that, and it would need to be
 one hell of a coincidence.

Nice hardware.  I have some R720xd servers for another project unrelated
to Solr, love them.

I know a little about Dell servers.  If you haven't done so already, I
would install the OpenManage repo and get the firmware fully updated -
BIOS, RAID, and LAN in particular.  Instructions that are pretty easy to
follow:

http://linux.dell.com/repo/hardware/latest/

For process/file limits, I have the following in
/etc/security/limits.conf on systems that aren't using Cloud:

ncindex hardnproc   6144
ncindex softnproc   4096

ncindex hardnofile  65535
ncindex softnofile  49151

 To add more confusion to the mix, we actually run a 2nd SolrCloud cluster
 on the same Solr, Jetty and JVM versions that do not exhibit this issue,
 although using a completely different schema, servers and access-patterns,
 although it is also at high-TPS. That is some evidence to say the current
 software stack is OK, or maybe this only occurs under an extreme load that
 2nd cluster does not see, or lastly only with a certain schema.

This is a big reason why I think you should make sure you're fully up to
date on your firmware, as the hardware seems to be one strong
difference.  As much as I love Dell server hardware, firmware issues are
relatively common, especially on early versions of the latest
generation, which includes the R720.

 Lastly, to add a bit more detail to my original description, so far I have
 tried:
 
 - Entirely rebuilding my cluster from scratch, reinstalling all deps,
 configs, reindexing the data (in case I screwed up somewhere). The EXACT
 same issue occurs under load about 20-45 minutes in.
 - Moving to Java 1.7.0_21 from _25 due to some known bugs. Same issue
 occurs after some load.
 - Restarting SolrCloud / forcing rebuilds or cores. Same issue occurs after
 some load.

The only other thing I can think of is increasing your zkClientTimeout
to 30 seconds or so and trying Solr 4.4 so you have SOLR-4899 and
SOLR-4805.  That's very definitely a shot in the dark.

Thanks,
Shawn



Re: SolrCloud 4.3.1 - Failure to open existing log file (non fatal) errors under high load

2013-07-25 Thread Tim Vaillancourt

Thanks Shawn and Yonik!

Yonik: I noticed this error appears to be fairly trivial, but it is not 
appearing after a previous crash. Every time I run this high-volume test 
that produced my stack trace, I zero out the logs, Solr data and 
Zookeeper data and start over from scratch with a brand new collection 
and zero'd out logs.


The test is mostly high volume (2000-4000 updates/sec) and at the start 
the SolrCloud runs decently for a good 20-60~ minutes, no errors in the 
logs at all. Then that stack trace occurs on all 3 nodes (staggered), I 
immediately get some replica down messages and then some cannot 
connect errors to all other cluster nodes, who have all crashed the 
same way. The tlog error could be a symptom of the problem of running 
out of threads perhaps.


Shawn: thanks so much for sharing those details! Yes, they seem to be 
nice servers, for sure - I don't get to touch/see them but they're fast! 
I'll look into firmwares for sure and will try again after updating 
them. These Solr instances are not-bare metal and are actually KVM VMs 
so that's another layer to look into, although it is consistent between 
the two clusters.


I am not currently increasing the 'nofiles' ulimit to above default like 
you are, but does Solr use 10,000+ file handles? It won't hurt to try it 
I guess :). To rule out Java 7, I'll probably also try Jetty 8 and Java 
1.6 as an experiment as well.


Thanks!

Tim

On 25/07/13 05:55 PM, Yonik Seeley wrote:

On Thu, Jul 25, 2013 at 7:44 PM, Tim Vaillancourtt...@elementspace.com  wrote:

ERROR [2013-07-25 19:34:24.264] [org.apache.solr.common.SolrException]
Failure to open existing log file (non fatal)


That itself isn't necessarily a problem (and why it says non fatal)
- it just means that most likely the a transaction log file was
truncated from a previous crash.  It may be unrelated to the other
issues you are seeing.

-Yonik
http://lucidworks.com