Hi,

this is with SolrCloud 6.5.1 on Ubuntu LTS 16.04 and OpenJDK 8, 4 Solr in Cloud 
mode, external ZK.

I tried to split my colection's shard1 (500 GB) with SPLITSHARD, it kind of 
worked. After more than 8 hours the new shards left "construction" state — and 
entered "recovery" :( Another about 12 hours later, Out of Memory errors with 
"could not create thread" happened. Node 10.10.10.162 took leadership of 
shard1, but since we still saw errors on searches, I stopped solr on 
10.10.10.161, changed heap from 24G to 31G and rebooted the system, just in 
case — good time to install latest patches. 10.10.10.161 came back and shards 
shard1, shard1_0 and shard1_1 started recovery. But unfortunately, 
10.10.10.162, leader for shard2 which was being split as well, hit "something": 
solr.log got not updated anymore, the UI didn't work anymore, so in the end, I 
stopped solr there as well (finished instantly) and rebootet. Now both are 
running with 31G java heap, shard1 and shard2 are synced and I try to clean up 
before retrying.

Of shard2, only a shard2_0 without any replicas was left over, and DELETESHARD 
clean it up.

But shard1 has shard1_0 and shard1_1, each with two replicas. DELETESHARD 
errored out, so I DELETEREPLICA all of them. This worked, but "parts of" 
shard1_0 and shard1_1 are still there and I cannot delete them:

$ wget -q -O - 
'http://10.10.10.162:8983/solr/admin/collections?wt=json&action=CLUSTERSTATUS' 
| jq
[…]
          "shard1_0": {
            "range": "80000000-bfffffff",
            "state": "recovery_failed",
            "replicas": {}
          },
          "shard1_1": {
            "parent": "shard1",
            "shard_parent_node": "10.10.10.161:8983_solr",
            "range": "c0000000-ffffffff",
            "state": "recovery_failed",
            "shard_parent_zk_session": "98682039611162624",
            "replicas": {}
          }
[…]


$ wget -O - 
'http://10.10.10.161:8983/solr/admin/collections?action=DELETESHARD&shard=shard1_1&collection=collection'
--2017-09-29 01:01:16--  
http://10.10.10.161:8983/solr/admin/collections?action=DELETESHARD&shard=shard1_1&collection=collection
Connecting to 10.10.10.161:8983... connected.
HTTP request sent, awaiting response... 400 Bad Request
2017-09-29 01:01:16 ERROR 400: Bad Request.

Any hint on how to fix this appreciated ;)

Regards,
-kai



Reply via email to