This is my first experience with SolrCloud, so please bear with me. I've inherited a setup with 5 servers, 2 of which are Zookeeper only and the 3 others SolrCloud + Zookeeper. Versions are respectively 5.4.0 & 3.4.7. There's around 80 Gb of index, some collections are rather big (20Gb) and some very small. All of them have only one shard. The bigger ones are almost constantly being updated (and of course queried at the same time).
I've had a huge number of errors, many different ones. At some point the system seemed rather stable, but I've tried to add a few new collections and things went wrong again. The usual symptom is that some cores stop synchronizing; sometimes an entire server is shown as "gone" (although it's still alive and well). When I add a core on a server, another (or several others) often goes down on that server. Even when the system is rather stable some cores are shown as recovering. When restarting a server it takes a very long time (30 min at least) to fully recover. Some of the many errors I've got (I've skipped the warnings): - org.apache.solr.common.SolrException: Error trying to proxy request for url - org.apache.solr.update.processor.DistributedUpdateProcessor; Setting up to try to start recovery on replica - org.apache.solr.common.SolrException; Error while trying to recover. core=[...]:org.apache.solr.common.SolrException: No registered leader was found after waiting - update log not in ACTIVE or REPLAY state. FSUpdateLog{state=BUFFERING, tlog=null} - org.apache.solr.cloud.RecoveryStrategy; Could not publish as ACTIVE after succesful recovery - org.apache.solr.common.SolrException; Could not find core to call recovery - org.apache.solr.common.SolrException: Error CREATEing SolrCore '...': Unable to create core - org.apache.solr.request.SolrRequestInfo; prev == info : false - org.apache.solr.request.SolrRequestInfo; Previous SolrRequestInfo was not closed! - org.apache.solr.update.SolrIndexWriter; Error closing IndexWriter - org.apache.solr.update.SolrIndexWriter; SolrIndexWriter was not closed prior to finalize(), indicates a bug -- POSSIBLE RESOURCE LEAK!!! - org.apache.solr.cloud.OverseerCollectionMessageHandler; Error from shard - org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for connection from pool - and so on... Any advice on where I should start? I've checked disk space, memory usage, max number of open files, everything seems fine there. My guess is that the configuration is rather unaltered from the defaults. I've extended timeouts in Zookeeper already. Thanks, John