Hi,
I experience the same problem, using version 4.4.0.
In my case:
2 Solr nodes -> 4 collections, each 1 shard and 2 replicas.
3 Zookeepers
Replicas can get state=down when a connection to Zookeeper is lost.
However, there are 2 more Zookeeper servers, so this shouldn't be a
problem right?
The only errors in the log are like the following:
Error inspecting tlog
tlog{file=/opt/solr/server/blabla/replica1/data/tlog/tlog.0000000000000001106
refcount=2}
Funny thing is, the replicas with the error work just fine, the ones
without errors are causing problems.
Maybe because the replicas with this error go through the recovery
process and the others do not?
There seems absolutely no problem with the replicas that are down. The
only dirty hack to fix things is by editing the clusterstate.json and
change the state from down to active.
Doesn't seem right, but it does work.
Jeroen
On 18-9-2013 5:50, Mark Miller wrote:
SOLR-5243 and SOLR-5240 will likely improve the situation. Both fixes are in
4.5 - the first RC for 4.5 will likely come tomorrow.
Thanks to yonik for sussing these out.
- Mark
On Sep 17, 2013, at 2:43 PM, Mark Miller <markrmil...@gmail.com> wrote:
On Sep 17, 2013, at 12:00 PM, Vladimir Veljkovic
<vladimir.veljko...@boxalino.com> wrote:
Hello there,
we have following setup:
SolrCloud 4.4.0 (3 nodes, physical machines)
Zookeeper 3.4.5 (3 nodes, physical machines)
We have a number of rather small collections (~10K or ~100K of documents), that
we would like to load to all Solr instances (numShards=1,
replication_factor=3), and access them through local network interface, as the
load balancing is done in layers above.
We can live (and we actually do it in the test phase) with updating the entire
collections whenever we need it, switching collection aliases and removing the
old collections.
We stumbled across following problem: as soon as all three Solr nodes become a
leader to at least one collection, restarting any node makes it completely
unresponsive (timeout), both though admin interface and for replication. If we
restart all solr nodes the cluster end up in some kind of deadlock and only
remedy we found is Solr clean installation, removing ZooKeeper data and
re-posting collections.
Apparently, leader is waiting for replicas to come up and they try to
synchronize but timeout on http requests, so everything ends up in some kind of
dead lock, maybe related to:
https://issues.apache.org/jira/browse/SOLR-5240
Yup, that sounds exactly what you would expect with SOLR-5240. A fix for that
is coming in 4.5, which is a probably a week or so away.
Eventually (after few minutes), leader takes over, mark collections "active"
but remains blocked on http interface, so other nodes can not synchronize.
In further tests, we loaded 4 collections with numShards=1 and
replication_factor=2. By chance, one node become the leader for all 4
collections. Restarting the node which was not the leader is done without the
problem, but when we restarted the leader it happened that:
- leader shut down, other nodes became leaders of 2 collections each
- leader starts up, 3 collections on it become "active", one collection remains
down and node becomes unresponsive and timeouts on http requests.
Hard to say - I'll experiment with 4.5 and see if I can duplicate this.
- Mark
As this behavior is completely unexpected for one cluster solution, I wonder if
somebody else experienced same problems or we are doing something entirely
wrong.
Best regards
--
Vladimir Veljkovic
Senior Java Entwickler
Boxalino AG
vladimir.veljko...@boxalino.com
www.boxalino.com
Tuning Kit for your Online Shop
Product Search - Recommendations - Landing Pages - Data intelligence - Mobile
Commerce