Any chance you still have the logs from the servers hosting 1 & 2? I would open a JIRA ticket for this one as it sounds like something went terribly wrong on restart.
You can update the /clusterstate.json to fix this situation. Lastly, it's recommended to use an OOM killer script with SolrCloud so that you don't end up with zombie nodes hanging around in your cluster. I use something like: -XX:OnOutOfMemoryError="$SCRIPT_DIR/oom_solr.sh $x %p" $x in start script is the port # and %p is the process ID ... My oom_solr.sh script is something like this: #!/bin/bash SOLR_PORT=$1 SOLR_PID=$2 NOW=$(date +"%F%T") ( echo "Running OOM killer script for process $SOLR_PID for Solr on port 89$SOLR_PORT" kill -9 $SOLR_PID echo "Killed process $SOLR_PID" ) | tee oom_killer-89$SOLR_PORT-$NOW.log I use supervisord do handle the restart after the process gets killed by the OOM killer, which is why you don't see the restart in this script ;-) Timothy Potter Sr. Software Engineer, LucidWorks www.lucidworks.com ________________________________________ From: youknow...@heroicefforts.net <youknow...@heroicefforts.net> Sent: Tuesday, December 17, 2013 10:31 PM To: solr-user@lucene.apache.org Subject: Solr failure results in misreplication? My client has a test cluster Solr 4.6 with three instances 1, 2, and 3 hosting shards 1, 2, and 3, respectively. There is no replication in this cluster. We started receiving OOME during indexing; likely the batches were too large. The cluster was rebooted to restore the system. However, upon reboot, instance 2 now shows as a replica of shard 1 and its shard2 is down with a null range. Instance 2 is queryable shards.tolerant=true&distribute=false and returns a different set of records than instance 1 (as would be expected during normal operations). Clusterstate.json is similar to the following: mycollection:{ shard1:{ range:8000000-d554ffff, state:active, replicas:{ instance1....state:active..., instance2....state:active... } }, shard3:{....state:active.....}, shard2:{ range:null, state:active, replicas:{ instance2{....state:down....} } }, maxShardsPerNode:1, replicationFactor:1 } Any ideas on how this would come to pass? Would manually correcting the clusterstate.json in Zk correct this situation?