Re: SolrCloud 4.3.1 - Failure to open existing log file (non fatal) errors under high load
What is your autocommit limit? Is it possible that your transaction logs are simply getting too large? tlogs are truncated whenever you do a hard commit (autocommit) with openSearcher either true for false it doesn't matter. FWIW, Erick On Fri, Jul 26, 2013 at 12:56 AM, Tim Vaillancourt t...@elementspace.com wrote: Thanks Shawn and Yonik! Yonik: I noticed this error appears to be fairly trivial, but it is not appearing after a previous crash. Every time I run this high-volume test that produced my stack trace, I zero out the logs, Solr data and Zookeeper data and start over from scratch with a brand new collection and zero'd out logs. The test is mostly high volume (2000-4000 updates/sec) and at the start the SolrCloud runs decently for a good 20-60~ minutes, no errors in the logs at all. Then that stack trace occurs on all 3 nodes (staggered), I immediately get some replica down messages and then some cannot connect errors to all other cluster nodes, who have all crashed the same way. The tlog error could be a symptom of the problem of running out of threads perhaps. Shawn: thanks so much for sharing those details! Yes, they seem to be nice servers, for sure - I don't get to touch/see them but they're fast! I'll look into firmwares for sure and will try again after updating them. These Solr instances are not-bare metal and are actually KVM VMs so that's another layer to look into, although it is consistent between the two clusters. I am not currently increasing the 'nofiles' ulimit to above default like you are, but does Solr use 10,000+ file handles? It won't hurt to try it I guess :). To rule out Java 7, I'll probably also try Jetty 8 and Java 1.6 as an experiment as well. Thanks! Tim On 25/07/13 05:55 PM, Yonik Seeley wrote: On Thu, Jul 25, 2013 at 7:44 PM, Tim Vaillancourtt...@elementspace.com wrote: ERROR [2013-07-25 19:34:24.264] [org.apache.solr.common.SolrException] Failure to open existing log file (non fatal) That itself isn't necessarily a problem (and why it says non fatal) - it just means that most likely the a transaction log file was truncated from a previous crash. It may be unrelated to the other issues you are seeing. -Yonik http://lucidworks.com
Re: SolrCloud 4.3.1 - Failure to open existing log file (non fatal) errors under high load
Thanks for the reply Erick, Hard Commit - 15000ms, openSearcher=false Soft Commit - 1000ms, openSearcher=true 15sec hard commit was sort of a guess, I could try a smaller number. When you say getting too large what limit do you think it would be hitting: a ulimit (nofiles), disk space, number of changes, a limit in Solr itself? By my math there would be 15 tlogs max per core, but I don't really know how it all works if someone could fill me in/point me somewhere. Cheers, Tim On 27/07/13 07:57 AM, Erick Erickson wrote: What is your autocommit limit? Is it possible that your transaction logs are simply getting too large? tlogs are truncated whenever you do a hard commit (autocommit) with openSearcher either true for false it doesn't matter. FWIW, Erick On Fri, Jul 26, 2013 at 12:56 AM, Tim Vaillancourtt...@elementspace.com wrote: Thanks Shawn and Yonik! Yonik: I noticed this error appears to be fairly trivial, but it is not appearing after a previous crash. Every time I run this high-volume test that produced my stack trace, I zero out the logs, Solr data and Zookeeper data and start over from scratch with a brand new collection and zero'd out logs. The test is mostly high volume (2000-4000 updates/sec) and at the start the SolrCloud runs decently for a good 20-60~ minutes, no errors in the logs at all. Then that stack trace occurs on all 3 nodes (staggered), I immediately get some replica down messages and then some cannot connect errors to all other cluster nodes, who have all crashed the same way. The tlog error could be a symptom of the problem of running out of threads perhaps. Shawn: thanks so much for sharing those details! Yes, they seem to be nice servers, for sure - I don't get to touch/see them but they're fast! I'll look into firmwares for sure and will try again after updating them. These Solr instances are not-bare metal and are actually KVM VMs so that's another layer to look into, although it is consistent between the two clusters. I am not currently increasing the 'nofiles' ulimit to above default like you are, but does Solr use 10,000+ file handles? It won't hurt to try it I guess :). To rule out Java 7, I'll probably also try Jetty 8 and Java 1.6 as an experiment as well. Thanks! Tim On 25/07/13 05:55 PM, Yonik Seeley wrote: On Thu, Jul 25, 2013 at 7:44 PM, Tim Vaillancourtt...@elementspace.com wrote: ERROR [2013-07-25 19:34:24.264] [org.apache.solr.common.SolrException] Failure to open existing log file (non fatal) That itself isn't necessarily a problem (and why it says non fatal) - it just means that most likely the a transaction log file was truncated from a previous crash. It may be unrelated to the other issues you are seeing. -Yonik http://lucidworks.com
Re: SolrCloud 4.3.1 - Failure to open existing log file (non fatal) errors under high load
No hard numbers, but the general guidance is that you should set your hard commit interval to match your expectations for how quickly nodes should come up if they need to be restarted. Specifically, a hard commit assures that all changes have been committed to disk and are ready for immediate access on restart, but any and all soft commit changes since the last hard commit must be replayed (reexecuted) on restart of a node. How long does it take to replay the changes in the update log? No firm numbers, but treat it as if all of those uncommitted updates had to be resent and reprocessed by Solr. It's probably faster than that, but you get the picture. I would suggest thinking in terms of minutes rather than seconds for hard commits 5 minutes, 10, 15, 20, 30 minutes. Hard commits may result in kicking off segment merges, so too rapid a rate of segment creation might cause problems or at least be counterproductive. So, instead of 15 seconds, try 15 minutes. OTOH, if you really need to handle 4,000 update a seconds... you are clearly in uncharted territory and need to expect to need to do some heavy duty trial and error tuning on your own. -- Jack Krupansky -Original Message- From: Tim Vaillancourt Sent: Saturday, July 27, 2013 4:21 PM To: solr-user@lucene.apache.org Subject: Re: SolrCloud 4.3.1 - Failure to open existing log file (non fatal) errors under high load Thanks for the reply Erick, Hard Commit - 15000ms, openSearcher=false Soft Commit - 1000ms, openSearcher=true 15sec hard commit was sort of a guess, I could try a smaller number. When you say getting too large what limit do you think it would be hitting: a ulimit (nofiles), disk space, number of changes, a limit in Solr itself? By my math there would be 15 tlogs max per core, but I don't really know how it all works if someone could fill me in/point me somewhere. Cheers, Tim On 27/07/13 07:57 AM, Erick Erickson wrote: What is your autocommit limit? Is it possible that your transaction logs are simply getting too large? tlogs are truncated whenever you do a hard commit (autocommit) with openSearcher either true for false it doesn't matter. FWIW, Erick On Fri, Jul 26, 2013 at 12:56 AM, Tim Vaillancourtt...@elementspace.com wrote: Thanks Shawn and Yonik! Yonik: I noticed this error appears to be fairly trivial, but it is not appearing after a previous crash. Every time I run this high-volume test that produced my stack trace, I zero out the logs, Solr data and Zookeeper data and start over from scratch with a brand new collection and zero'd out logs. The test is mostly high volume (2000-4000 updates/sec) and at the start the SolrCloud runs decently for a good 20-60~ minutes, no errors in the logs at all. Then that stack trace occurs on all 3 nodes (staggered), I immediately get some replica down messages and then some cannot connect errors to all other cluster nodes, who have all crashed the same way. The tlog error could be a symptom of the problem of running out of threads perhaps. Shawn: thanks so much for sharing those details! Yes, they seem to be nice servers, for sure - I don't get to touch/see them but they're fast! I'll look into firmwares for sure and will try again after updating them. These Solr instances are not-bare metal and are actually KVM VMs so that's another layer to look into, although it is consistent between the two clusters. I am not currently increasing the 'nofiles' ulimit to above default like you are, but does Solr use 10,000+ file handles? It won't hurt to try it I guess :). To rule out Java 7, I'll probably also try Jetty 8 and Java 1.6 as an experiment as well. Thanks! Tim On 25/07/13 05:55 PM, Yonik Seeley wrote: On Thu, Jul 25, 2013 at 7:44 PM, Tim Vaillancourtt...@elementspace.com wrote: ERROR [2013-07-25 19:34:24.264] [org.apache.solr.common.SolrException] Failure to open existing log file (non fatal) That itself isn't necessarily a problem (and why it says non fatal) - it just means that most likely the a transaction log file was truncated from a previous crash. It may be unrelated to the other issues you are seeing. -Yonik http://lucidworks.com
Re: SolrCloud 4.3.1 - Failure to open existing log file (non fatal) errors under high load
Tim: 15 seconds isn't unreasonable, I was mostly wondering if it was hours. Take a look at the size of the tlogs as you're indexing, you should see them truncate every 15 seconds or so. There'll be a varying number of tlogs kept around, although under heavy indexing I'd only expect 1 or 2 inactive ones, the internal number is that there'll be enough tlogs kept around to hold 100 docs. There should only be 1 open tlog/core as I understand it. When a commit happens (hard, openSearcher = true or false doesn't matter) the current tlog is closed and a new one opened. Then some cleanup happens so there are only enough tlogs kept around to hold 100 docs. Strange, Im kind of out of ideas. Erick On Sat, Jul 27, 2013 at 4:41 PM, Jack Krupansky j...@basetechnology.com wrote: No hard numbers, but the general guidance is that you should set your hard commit interval to match your expectations for how quickly nodes should come up if they need to be restarted. Specifically, a hard commit assures that all changes have been committed to disk and are ready for immediate access on restart, but any and all soft commit changes since the last hard commit must be replayed (reexecuted) on restart of a node. How long does it take to replay the changes in the update log? No firm numbers, but treat it as if all of those uncommitted updates had to be resent and reprocessed by Solr. It's probably faster than that, but you get the picture. I would suggest thinking in terms of minutes rather than seconds for hard commits 5 minutes, 10, 15, 20, 30 minutes. Hard commits may result in kicking off segment merges, so too rapid a rate of segment creation might cause problems or at least be counterproductive. So, instead of 15 seconds, try 15 minutes. OTOH, if you really need to handle 4,000 update a seconds... you are clearly in uncharted territory and need to expect to need to do some heavy duty trial and error tuning on your own. -- Jack Krupansky -Original Message- From: Tim Vaillancourt Sent: Saturday, July 27, 2013 4:21 PM To: solr-user@lucene.apache.org Subject: Re: SolrCloud 4.3.1 - Failure to open existing log file (non fatal) errors under high load Thanks for the reply Erick, Hard Commit - 15000ms, openSearcher=false Soft Commit - 1000ms, openSearcher=true 15sec hard commit was sort of a guess, I could try a smaller number. When you say getting too large what limit do you think it would be hitting: a ulimit (nofiles), disk space, number of changes, a limit in Solr itself? By my math there would be 15 tlogs max per core, but I don't really know how it all works if someone could fill me in/point me somewhere. Cheers, Tim On 27/07/13 07:57 AM, Erick Erickson wrote: What is your autocommit limit? Is it possible that your transaction logs are simply getting too large? tlogs are truncated whenever you do a hard commit (autocommit) with openSearcher either true for false it doesn't matter. FWIW, Erick On Fri, Jul 26, 2013 at 12:56 AM, Tim Vaillancourtt...@elementspace.com wrote: Thanks Shawn and Yonik! Yonik: I noticed this error appears to be fairly trivial, but it is not appearing after a previous crash. Every time I run this high-volume test that produced my stack trace, I zero out the logs, Solr data and Zookeeper data and start over from scratch with a brand new collection and zero'd out logs. The test is mostly high volume (2000-4000 updates/sec) and at the start the SolrCloud runs decently for a good 20-60~ minutes, no errors in the logs at all. Then that stack trace occurs on all 3 nodes (staggered), I immediately get some replica down messages and then some cannot connect errors to all other cluster nodes, who have all crashed the same way. The tlog error could be a symptom of the problem of running out of threads perhaps. Shawn: thanks so much for sharing those details! Yes, they seem to be nice servers, for sure - I don't get to touch/see them but they're fast! I'll look into firmwares for sure and will try again after updating them. These Solr instances are not-bare metal and are actually KVM VMs so that's another layer to look into, although it is consistent between the two clusters. I am not currently increasing the 'nofiles' ulimit to above default like you are, but does Solr use 10,000+ file handles? It won't hurt to try it I guess :). To rule out Java 7, I'll probably also try Jetty 8 and Java 1.6 as an experiment as well. Thanks! Tim On 25/07/13 05:55 PM, Yonik Seeley wrote: On Thu, Jul 25, 2013 at 7:44 PM, Tim Vaillancourtt...@elementspace.com wrote: ERROR [2013-07-25 19:34:24.264] [org.apache.solr.common.SolrException] Failure to open existing log file (non fatal) That itself isn't necessarily a problem (and why it says non fatal) - it just means that most likely the a transaction log file was truncated from a previous crash. It may be unrelated to the other issues you are seeing
Re: SolrCloud 4.3.1 - Failure to open existing log file (non fatal) errors under high load
Thanks Jack/Erick, I don't know if this is true or not, but I've read there is a tlog per soft commit, which is then truncated by the hard commit. If this were true, a 15sec hard-commit with a 1sec soft-commit could generate around 15~ tlogs, but I've never checked. I like Erick's scenario more if it is 1 tlog/core though. I'll try to find out some more. Another two test/things I really should try for sanity are: - Java 1.6 and Jetty 8: just to rule things out (wouldn't actually launch this way). - ulimit for 'nofiles': the default is pretty high but why not? - Monitor size + # of tlogs. I'll be sure to share findings and really appreciate the help guys! PS: This is asking a lot, but if anyone can take a look at that thread dump, or give me some pointers on what to look for in a stall/thread-pile up thread dump like this, I would really appreciate it. I'm quite weak at deciphering those (I use Thread Dump Analyzer) but I'm sure it would tell a lot. Cheers, Tim On 27/07/13 02:24 PM, Erick Erickson wrote: Tim: 15 seconds isn't unreasonable, I was mostly wondering if it was hours. Take a look at the size of the tlogs as you're indexing, you should see them truncate every 15 seconds or so. There'll be a varying number of tlogs kept around, although under heavy indexing I'd only expect 1 or 2 inactive ones, the internal number is that there'll be enough tlogs kept around to hold 100 docs. There should only be 1 open tlog/core as I understand it. When a commit happens (hard, openSearcher = true or false doesn't matter) the current tlog is closed and a new one opened. Then some cleanup happens so there are only enough tlogs kept around to hold 100 docs. Strange, Im kind of out of ideas. Erick On Sat, Jul 27, 2013 at 4:41 PM, Jack Krupanskyj...@basetechnology.com wrote: No hard numbers, but the general guidance is that you should set your hard commit interval to match your expectations for how quickly nodes should come up if they need to be restarted. Specifically, a hard commit assures that all changes have been committed to disk and are ready for immediate access on restart, but any and all soft commit changes since the last hard commit must be replayed (reexecuted) on restart of a node. How long does it take to replay the changes in the update log? No firm numbers, but treat it as if all of those uncommitted updates had to be resent and reprocessed by Solr. It's probably faster than that, but you get the picture. I would suggest thinking in terms of minutes rather than seconds for hard commits 5 minutes, 10, 15, 20, 30 minutes. Hard commits may result in kicking off segment merges, so too rapid a rate of segment creation might cause problems or at least be counterproductive. So, instead of 15 seconds, try 15 minutes. OTOH, if you really need to handle 4,000 update a seconds... you are clearly in uncharted territory and need to expect to need to do some heavy duty trial and error tuning on your own. -- Jack Krupansky -Original Message- From: Tim Vaillancourt Sent: Saturday, July 27, 2013 4:21 PM To: solr-user@lucene.apache.org Subject: Re: SolrCloud 4.3.1 - Failure to open existing log file (non fatal) errors under high load Thanks for the reply Erick, Hard Commit - 15000ms, openSearcher=false Soft Commit - 1000ms, openSearcher=true 15sec hard commit was sort of a guess, I could try a smaller number. When you say getting too large what limit do you think it would be hitting: a ulimit (nofiles), disk space, number of changes, a limit in Solr itself? By my math there would be 15 tlogs max per core, but I don't really know how it all works if someone could fill me in/point me somewhere. Cheers, Tim On 27/07/13 07:57 AM, Erick Erickson wrote: What is your autocommit limit? Is it possible that your transaction logs are simply getting too large? tlogs are truncated whenever you do a hard commit (autocommit) with openSearcher either true for false it doesn't matter. FWIW, Erick On Fri, Jul 26, 2013 at 12:56 AM, Tim Vaillancourtt...@elementspace.com wrote: Thanks Shawn and Yonik! Yonik: I noticed this error appears to be fairly trivial, but it is not appearing after a previous crash. Every time I run this high-volume test that produced my stack trace, I zero out the logs, Solr data and Zookeeper data and start over from scratch with a brand new collection and zero'd out logs. The test is mostly high volume (2000-4000 updates/sec) and at the start the SolrCloud runs decently for a good 20-60~ minutes, no errors in the logs at all. Then that stack trace occurs on all 3 nodes (staggered), I immediately get some replica down messages and then some cannot connect errors to all other cluster nodes, who have all crashed the same way. The tlog error could be a symptom of the problem of running out of threads perhaps. Shawn: thanks so much for sharing those details! Yes, they seem to be nice servers, for sure - I don't get to touch/see them but they're fast
SolrCloud 4.3.1 - Failure to open existing log file (non fatal) errors under high load
Hey guys, I am reaching out to the Solr list with a very vague issue: under high load against a SolrCloud 4.3.1 cluster of 3 instances, 3 shards, 2 replicas (2 cores per instance), I eventually see failure messages related to transaction logs, and shortly after these stacktraces occur the cluster starts to fall apart. To explain my setup: - SolrCloud 4.3.1. - Jetty 9.x. - Oracle/Sun JDK 1.7.25 w/CMS. - RHEL 6.x 64-bit. - 3 instances, 1 per server. - 3 shards. - 2 replicas per shard. The transaction log error I receive after about 10-30 minutes of load testing is: ERROR [2013-07-25 19:34:24.264] [org.apache.solr.common.SolrException] Failure to open existing log file (non fatal) /opt/easw/easw_apps/easo_solr_cloud/solr/xmshd_shard3_replica2/data/tlog/tlog.078:org.apache.solr.common.SolrException: java.io.EOFException at org.apache.solr.update.TransactionLog.init(TransactionLog.java:182) at org.apache.solr.update.UpdateLog.init(UpdateLog.java:233) at org.apache.solr.update.UpdateHandler.initLog(UpdateHandler.java:83) at org.apache.solr.update.UpdateHandler.init(UpdateHandler.java:138) at org.apache.solr.update.UpdateHandler.init(UpdateHandler.java:125) at org.apache.solr.update.DirectUpdateHandler2.init(DirectUpdateHandler2.java:95) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:525) at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:525) at org.apache.solr.core.SolrCore.createUpdateHandler(SolrCore.java:596) at org.apache.solr.core.SolrCore.init(SolrCore.java:805) at org.apache.solr.core.SolrCore.init(SolrCore.java:618) at org.apache.solr.core.CoreContainer.createFromZk(CoreContainer.java:894) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:982) at org.apache.solr.core.CoreContainer$2.call(CoreContainer.java:597) at org.apache.solr.core.CoreContainer$2.call(CoreContainer.java:592) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:722) Caused by: java.io.EOFException at org.apache.solr.common.util.FastInputStream.readUnsignedByte(FastInputStream.java:73) at org.apache.solr.common.util.FastInputStream.readInt(FastInputStream.java:216) at org.apache.solr.update.TransactionLog.readHeader(TransactionLog.java:266) at org.apache.solr.update.TransactionLog.init(TransactionLog.java:160) ... 25 more Eventually after a few of these stack traces, the cluster starts to lose shards and replicas fail. Jetty then creates hung threads until hitting OutOfMemory on native threads due to the maximum process ulimit. I know this is quite a vague issue, so I'm not expecting a silver-bullet answer, but I was wondering if anyone has suggestions on where to look next? Does this sound Solr-related at all, or possibly system? Has anyone seen this issue before, or has any hypothesis how to find out more? I will reply shortly with a thread dump, taken from 1 locked-up node. Thanks for any suggestions! Tim
Re: SolrCloud 4.3.1 - Failure to open existing log file (non fatal) errors under high load
Stack trace: http://timvaillancourt.com.s3.amazonaws.com/tmp/solrcloud.nodeC.2013-07-25-16.jstack.gz Cheers! Tim On 25 July 2013 16:44, Tim Vaillancourt t...@elementspace.com wrote: Hey guys, I am reaching out to the Solr list with a very vague issue: under high load against a SolrCloud 4.3.1 cluster of 3 instances, 3 shards, 2 replicas (2 cores per instance), I eventually see failure messages related to transaction logs, and shortly after these stacktraces occur the cluster starts to fall apart. To explain my setup: - SolrCloud 4.3.1. - Jetty 9.x. - Oracle/Sun JDK 1.7.25 w/CMS. - RHEL 6.x 64-bit. - 3 instances, 1 per server. - 3 shards. - 2 replicas per shard. The transaction log error I receive after about 10-30 minutes of load testing is: ERROR [2013-07-25 19:34:24.264] [org.apache.solr.common.SolrException] Failure to open existing log file (non fatal) /opt/easw/easw_apps/easo_solr_cloud/solr/xmshd_shard3_replica2/data/tlog/tlog.078:org.apache.solr.common.SolrException: java.io.EOFException at org.apache.solr.update.TransactionLog.init(TransactionLog.java:182) at org.apache.solr.update.UpdateLog.init(UpdateLog.java:233) at org.apache.solr.update.UpdateHandler.initLog(UpdateHandler.java:83) at org.apache.solr.update.UpdateHandler.init(UpdateHandler.java:138) at org.apache.solr.update.UpdateHandler.init(UpdateHandler.java:125) at org.apache.solr.update.DirectUpdateHandler2.init(DirectUpdateHandler2.java:95) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:525) at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:525) at org.apache.solr.core.SolrCore.createUpdateHandler(SolrCore.java:596) at org.apache.solr.core.SolrCore.init(SolrCore.java:805) at org.apache.solr.core.SolrCore.init(SolrCore.java:618) at org.apache.solr.core.CoreContainer.createFromZk(CoreContainer.java:894) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:982) at org.apache.solr.core.CoreContainer$2.call(CoreContainer.java:597) at org.apache.solr.core.CoreContainer$2.call(CoreContainer.java:592) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:722) Caused by: java.io.EOFException at org.apache.solr.common.util.FastInputStream.readUnsignedByte(FastInputStream.java:73) at org.apache.solr.common.util.FastInputStream.readInt(FastInputStream.java:216) at org.apache.solr.update.TransactionLog.readHeader(TransactionLog.java:266) at org.apache.solr.update.TransactionLog.init(TransactionLog.java:160) ... 25 more Eventually after a few of these stack traces, the cluster starts to lose shards and replicas fail. Jetty then creates hung threads until hitting OutOfMemory on native threads due to the maximum process ulimit. I know this is quite a vague issue, so I'm not expecting a silver-bullet answer, but I was wondering if anyone has suggestions on where to look next? Does this sound Solr-related at all, or possibly system? Has anyone seen this issue before, or has any hypothesis how to find out more? I will reply shortly with a thread dump, taken from 1 locked-up node. Thanks for any suggestions! Tim
Re: SolrCloud 4.3.1 - Failure to open existing log file (non fatal) errors under high load
On 7/25/2013 5:44 PM, Tim Vaillancourt wrote: The transaction log error I receive after about 10-30 minutes of load testing is: ERROR [2013-07-25 19:34:24.264] [org.apache.solr.common.SolrException] Failure to open existing log file (non fatal) /opt/easw/easw_apps/easo_solr_cloud/solr/xmshd_shard3_replica2/data/tlog/tlog.078:org.apache.solr.common.SolrException: java.io.EOFException snip Caused by: java.io.EOFException at org.apache.solr.common.util.FastInputStream.readUnsignedByte(FastInputStream.java:73) at org.apache.solr.common.util.FastInputStream.readInt(FastInputStream.java:216) at org.apache.solr.update.TransactionLog.readHeader(TransactionLog.java:266) at org.apache.solr.update.TransactionLog.init(TransactionLog.java:160) ... 25 more This looks to me like a system problem. RHEL should be pretty solid, I use CentOS without any trouble. My initial guesses are a corrupt filesystem, failing hardware, or possibly a kernel problem with your specific hardware. I'm running Jetty 8, which is the version that the example uses. Could Jetty 9 be a problem here? I couldn't really say, though my initial guess is that it's not a problem. I'm running Oracle Java 1.7.0_13. Normally later releases are better, but Java bugs do exist and do get introduced in later releases. Because you're on the absolute latest, I'm guessing that you had the problem with an earlier release and upgraded to see if it went away. If that's what happened, it is less likely that it's Java. My first instinct would be to do a 'yum distro-sync' followed by 'touch /forcefsck' and reboot with console access to the server, so that you can deal with any fsck problems. Perhaps you've already tried that. I'm aware that this could be very very hard to get pushed through strict change management procedures. I did some searching. SOLR-4519 is a different problem, but it looks like it has a similar underlying exception, with no resolution. It was filed When Solr 4.1.0 was current. Could there be a resource problem - heap too small, not enough OS disk cache, etc? Thanks, Shawn
Re: SolrCloud 4.3.1 - Failure to open existing log file (non fatal) errors under high load
Thanks for the reply Shawn, I can always count on you :). We are using 10GB heaps and have over 100GB of OS cache free to answer the JVM question, Young has about 50% of the heap, all CMS. Our max number of processes for the JVM user is 10k, which is where Solr dies when it blows up with 'cannot create native thread'. I also want to say this is system related, but I am seeing this occur on all 3 servers, which are brand-new Dell R720s. I'm not saying this is impossible, but I don't see much to suggest that, and it would need to be one hell of a coincidence. To add more confusion to the mix, we actually run a 2nd SolrCloud cluster on the same Solr, Jetty and JVM versions that do not exhibit this issue, although using a completely different schema, servers and access-patterns, although it is also at high-TPS. That is some evidence to say the current software stack is OK, or maybe this only occurs under an extreme load that 2nd cluster does not see, or lastly only with a certain schema. Lastly, to add a bit more detail to my original description, so far I have tried: - Entirely rebuilding my cluster from scratch, reinstalling all deps, configs, reindexing the data (in case I screwed up somewhere). The EXACT same issue occurs under load about 20-45 minutes in. - Moving to Java 1.7.0_21 from _25 due to some known bugs. Same issue occurs after some load. - Restarting SolrCloud / forcing rebuilds or cores. Same issue occurs after some load. Cheers, Tim On 25 July 2013 17:13, Shawn Heisey s...@elyograg.org wrote: On 7/25/2013 5:44 PM, Tim Vaillancourt wrote: The transaction log error I receive after about 10-30 minutes of load testing is: ERROR [2013-07-25 19:34:24.264] [org.apache.solr.common.**SolrException] Failure to open existing log file (non fatal) /opt/easw/easw_apps/easo_solr_**cloud/solr/xmshd_shard3_** replica2/data/tlog/tlog.**078:org.**apache.solr.common.** SolrException: java.io.EOFException snip Caused by: java.io.EOFException at org.apache.solr.common.util.**FastInputStream.**readUnsignedByte(** FastInputStream.java:73) at org.apache.solr.common.util.**FastInputStream.readInt(** FastInputStream.java:216) at org.apache.solr.update.**TransactionLog.readHeader(** TransactionLog.java:266) at org.apache.solr.update.**TransactionLog.init(**TransactionLog.java:160) ... 25 more This looks to me like a system problem. RHEL should be pretty solid, I use CentOS without any trouble. My initial guesses are a corrupt filesystem, failing hardware, or possibly a kernel problem with your specific hardware. I'm running Jetty 8, which is the version that the example uses. Could Jetty 9 be a problem here? I couldn't really say, though my initial guess is that it's not a problem. I'm running Oracle Java 1.7.0_13. Normally later releases are better, but Java bugs do exist and do get introduced in later releases. Because you're on the absolute latest, I'm guessing that you had the problem with an earlier release and upgraded to see if it went away. If that's what happened, it is less likely that it's Java. My first instinct would be to do a 'yum distro-sync' followed by 'touch /forcefsck' and reboot with console access to the server, so that you can deal with any fsck problems. Perhaps you've already tried that. I'm aware that this could be very very hard to get pushed through strict change management procedures. I did some searching. SOLR-4519 is a different problem, but it looks like it has a similar underlying exception, with no resolution. It was filed When Solr 4.1.0 was current. Could there be a resource problem - heap too small, not enough OS disk cache, etc? Thanks, Shawn
Re: SolrCloud 4.3.1 - Failure to open existing log file (non fatal) errors under high load
On Thu, Jul 25, 2013 at 7:44 PM, Tim Vaillancourt t...@elementspace.com wrote: ERROR [2013-07-25 19:34:24.264] [org.apache.solr.common.SolrException] Failure to open existing log file (non fatal) That itself isn't necessarily a problem (and why it says non fatal) - it just means that most likely the a transaction log file was truncated from a previous crash. It may be unrelated to the other issues you are seeing. -Yonik http://lucidworks.com
Re: SolrCloud 4.3.1 - Failure to open existing log file (non fatal) errors under high load
On 7/25/2013 6:53 PM, Tim Vaillancourt wrote: Thanks for the reply Shawn, I can always count on you :). We are using 10GB heaps and have over 100GB of OS cache free to answer the JVM question, Young has about 50% of the heap, all CMS. Our max number of processes for the JVM user is 10k, which is where Solr dies when it blows up with 'cannot create native thread'. I also want to say this is system related, but I am seeing this occur on all 3 servers, which are brand-new Dell R720s. I'm not saying this is impossible, but I don't see much to suggest that, and it would need to be one hell of a coincidence. Nice hardware. I have some R720xd servers for another project unrelated to Solr, love them. I know a little about Dell servers. If you haven't done so already, I would install the OpenManage repo and get the firmware fully updated - BIOS, RAID, and LAN in particular. Instructions that are pretty easy to follow: http://linux.dell.com/repo/hardware/latest/ For process/file limits, I have the following in /etc/security/limits.conf on systems that aren't using Cloud: ncindex hardnproc 6144 ncindex softnproc 4096 ncindex hardnofile 65535 ncindex softnofile 49151 To add more confusion to the mix, we actually run a 2nd SolrCloud cluster on the same Solr, Jetty and JVM versions that do not exhibit this issue, although using a completely different schema, servers and access-patterns, although it is also at high-TPS. That is some evidence to say the current software stack is OK, or maybe this only occurs under an extreme load that 2nd cluster does not see, or lastly only with a certain schema. This is a big reason why I think you should make sure you're fully up to date on your firmware, as the hardware seems to be one strong difference. As much as I love Dell server hardware, firmware issues are relatively common, especially on early versions of the latest generation, which includes the R720. Lastly, to add a bit more detail to my original description, so far I have tried: - Entirely rebuilding my cluster from scratch, reinstalling all deps, configs, reindexing the data (in case I screwed up somewhere). The EXACT same issue occurs under load about 20-45 minutes in. - Moving to Java 1.7.0_21 from _25 due to some known bugs. Same issue occurs after some load. - Restarting SolrCloud / forcing rebuilds or cores. Same issue occurs after some load. The only other thing I can think of is increasing your zkClientTimeout to 30 seconds or so and trying Solr 4.4 so you have SOLR-4899 and SOLR-4805. That's very definitely a shot in the dark. Thanks, Shawn
Re: SolrCloud 4.3.1 - Failure to open existing log file (non fatal) errors under high load
Thanks Shawn and Yonik! Yonik: I noticed this error appears to be fairly trivial, but it is not appearing after a previous crash. Every time I run this high-volume test that produced my stack trace, I zero out the logs, Solr data and Zookeeper data and start over from scratch with a brand new collection and zero'd out logs. The test is mostly high volume (2000-4000 updates/sec) and at the start the SolrCloud runs decently for a good 20-60~ minutes, no errors in the logs at all. Then that stack trace occurs on all 3 nodes (staggered), I immediately get some replica down messages and then some cannot connect errors to all other cluster nodes, who have all crashed the same way. The tlog error could be a symptom of the problem of running out of threads perhaps. Shawn: thanks so much for sharing those details! Yes, they seem to be nice servers, for sure - I don't get to touch/see them but they're fast! I'll look into firmwares for sure and will try again after updating them. These Solr instances are not-bare metal and are actually KVM VMs so that's another layer to look into, although it is consistent between the two clusters. I am not currently increasing the 'nofiles' ulimit to above default like you are, but does Solr use 10,000+ file handles? It won't hurt to try it I guess :). To rule out Java 7, I'll probably also try Jetty 8 and Java 1.6 as an experiment as well. Thanks! Tim On 25/07/13 05:55 PM, Yonik Seeley wrote: On Thu, Jul 25, 2013 at 7:44 PM, Tim Vaillancourtt...@elementspace.com wrote: ERROR [2013-07-25 19:34:24.264] [org.apache.solr.common.SolrException] Failure to open existing log file (non fatal) That itself isn't necessarily a problem (and why it says non fatal) - it just means that most likely the a transaction log file was truncated from a previous crash. It may be unrelated to the other issues you are seeing. -Yonik http://lucidworks.com