This is my first experience with SolrCloud, so please bear with me.

I've inherited a setup with 5 servers, 2 of which are Zookeeper only and
the 3 others SolrCloud + Zookeeper. Versions are respectively 5.4.0 &
3.4.7. There's around 80 Gb of index, some collections are rather big
(20Gb) and some very small. All of them have only one shard. The bigger
ones are almost constantly being updated (and of course queried at the
same time).

I've had a huge number of errors, many different ones. At some point the
system seemed rather stable, but I've tried to add a few new collections
and things went wrong again. The usual symptom is that some cores stop
synchronizing; sometimes an entire server is shown as "gone" (although
it's still alive and well). When I add a core on a server, another (or
several others) often goes down on that server. Even when the system is
rather stable some cores are shown as recovering. When restarting a
server it takes a very long time (30 min at least) to fully recover.

Some of the many errors I've got (I've skipped the warnings):
- org.apache.solr.common.SolrException: Error trying to proxy request
for url
- org.apache.solr.update.processor.DistributedUpdateProcessor; Setting
up to try to start recovery on replica
- org.apache.solr.common.SolrException; Error while trying to recover.
core=[...]:org.apache.solr.common.SolrException: No registered leader
was found after waiting
- update log not in ACTIVE or REPLAY state. FSUpdateLog{state=BUFFERING,
tlog=null}
- org.apache.solr.cloud.RecoveryStrategy; Could not publish as ACTIVE
after succesful recovery
- org.apache.solr.common.SolrException; Could not find core to call recovery
- org.apache.solr.common.SolrException: Error CREATEing SolrCore '...':
Unable to create core
- org.apache.solr.request.SolrRequestInfo; prev == info : false
- org.apache.solr.request.SolrRequestInfo; Previous SolrRequestInfo was
not closed!
- org.apache.solr.update.SolrIndexWriter; Error closing IndexWriter
- org.apache.solr.update.SolrIndexWriter; SolrIndexWriter was not closed
prior to finalize(), indicates a bug -- POSSIBLE RESOURCE LEAK!!!
- org.apache.solr.cloud.OverseerCollectionMessageHandler; Error from shard
- org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting
for connection from pool
- and so on...

Any advice on where I should start? I've checked disk space, memory
usage, max number of open files, everything seems fine there. My guess
is that the configuration is rather unaltered from the defaults. I've
extended timeouts in Zookeeper already.

Thanks,
John

Reply via email to