Any chance you still have the logs from the servers hosting 1 & 2? I would open 
a JIRA ticket for this one as it sounds like something went terribly wrong on 
restart. 

You can update the /clusterstate.json to fix this situation.

Lastly, it's recommended to use an OOM killer script with SolrCloud so that you 
don't end up with zombie nodes hanging around in your cluster. I use something 
like: -XX:OnOutOfMemoryError="$SCRIPT_DIR/oom_solr.sh $x %p"

$x in start script is the port # and %p is the process ID ... My oom_solr.sh 
script is something like this:

#!/bin/bash
SOLR_PORT=$1
SOLR_PID=$2
NOW=$(date +"%F%T")
(
echo "Running OOM killer script for process $SOLR_PID for Solr on port 
89$SOLR_PORT"
kill -9 $SOLR_PID
echo "Killed process $SOLR_PID"
) | tee oom_killer-89$SOLR_PORT-$NOW.log

I use supervisord do handle the restart after the process gets killed by the 
OOM killer, which is why you don't see the restart in this script ;-)

Timothy Potter
Sr. Software Engineer, LucidWorks
www.lucidworks.com

________________________________________
From: youknow...@heroicefforts.net <youknow...@heroicefforts.net>
Sent: Tuesday, December 17, 2013 10:31 PM
To: solr-user@lucene.apache.org
Subject: Solr failure results in misreplication?

My client has a test cluster Solr 4.6 with three instances 1, 2, and 3 hosting 
shards 1, 2, and 3, respectively.  There is no replication in this cluster.  We 
started receiving OOME during indexing; likely the batches were too large.  The 
cluster was rebooted to restore the system.  However, upon reboot, instance 2 
now shows as a replica of shard 1 and its shard2 is down with a null range.  
Instance 2 is queryable shards.tolerant=true&distribute=false and returns a 
different set of records than instance 1 (as would be expected during normal 
operations).  Clusterstate.json is similar to the following:

mycollection:{
shard1:{
range:8000000-d554ffff,
state:active,
replicas:{
instance1....state:active...,
instance2....state:active...
}
},
shard3:{....state:active.....},
shard2:{
range:null,
state:active,
replicas:{
instance2{....state:down....}
}
},
maxShardsPerNode:1,
replicationFactor:1
}

Any ideas on how this would come to pass?  Would manually correcting the 
clusterstate.json in Zk correct this situation?

Reply via email to