Hi, this is with SolrCloud 6.5.1 on Ubuntu LTS 16.04 and OpenJDK 8, 4 Solr in Cloud mode, external ZK.
I tried to split my colection's shard1 (500 GB) with SPLITSHARD, it kind of worked. After more than 8 hours the new shards left "construction" state — and entered "recovery" :( Another about 12 hours later, Out of Memory errors with "could not create thread" happened. Node 10.10.10.162 took leadership of shard1, but since we still saw errors on searches, I stopped solr on 10.10.10.161, changed heap from 24G to 31G and rebooted the system, just in case — good time to install latest patches. 10.10.10.161 came back and shards shard1, shard1_0 and shard1_1 started recovery. But unfortunately, 10.10.10.162, leader for shard2 which was being split as well, hit "something": solr.log got not updated anymore, the UI didn't work anymore, so in the end, I stopped solr there as well (finished instantly) and rebootet. Now both are running with 31G java heap, shard1 and shard2 are synced and I try to clean up before retrying. Of shard2, only a shard2_0 without any replicas was left over, and DELETESHARD clean it up. But shard1 has shard1_0 and shard1_1, each with two replicas. DELETESHARD errored out, so I DELETEREPLICA all of them. This worked, but "parts of" shard1_0 and shard1_1 are still there and I cannot delete them: $ wget -q -O - 'http://10.10.10.162:8983/solr/admin/collections?wt=json&action=CLUSTERSTATUS' | jq […] "shard1_0": { "range": "80000000-bfffffff", "state": "recovery_failed", "replicas": {} }, "shard1_1": { "parent": "shard1", "shard_parent_node": "10.10.10.161:8983_solr", "range": "c0000000-ffffffff", "state": "recovery_failed", "shard_parent_zk_session": "98682039611162624", "replicas": {} } […] $ wget -O - 'http://10.10.10.161:8983/solr/admin/collections?action=DELETESHARD&shard=shard1_1&collection=collection' --2017-09-29 01:01:16-- http://10.10.10.161:8983/solr/admin/collections?action=DELETESHARD&shard=shard1_1&collection=collection Connecting to 10.10.10.161:8983... connected. HTTP request sent, awaiting response... 400 Bad Request 2017-09-29 01:01:16 ERROR 400: Bad Request. Any hint on how to fix this appreciated ;) Regards, -kai