Hello,
I am new to this, but I was hoping to raise a bug/ ask for some advice.
We have upgraded our solr cluster to version 10 in the past two months and
noticed that when we restart our Zookeeper instances in particular the leader
instance that solr has issues managing its collections.
What we experience is that collections will go into the degraded state and stay
in such a state till they recover. However, recovery can be dependent on the
number of collections. For example, 10 collections with 2 replicas each can
take up to 5 minutes, and the time increases exponentially dependent on the
number of replicas. Sometime a node will also go into the down state and will
stay in such a state.
We had a brief look through the code and noticed that the handling of Zookeeper
disconnection is handled differently and believe that the new change is causing
an effect.
We handle our Zookeeper and solr instance in Kubernetes and are experiencing
this quite regularly due to the fact that kubernetes can sometimes do
rebalancing of workloads of its own accord causing Zookeeper to be moved or
recycled more than once a week. To ensure that this wasn't a quirk of zookeeper
we also replicated this issue in a docker compose scenario as well and
experienced the same.
We went back to v9 and did not experience any issues at all.
Below I will add Docker Compose Template as well as script to replicate this
action.
Docker-compose.yml
services:
# ─── ZooKeeper ensemble (3 nodes for quorum) ───────────────────────────────
zookeeper-0:
image: zookeeper:3.9
hostname: zookeeper-0
restart: unless-stopped
environment:
ZOO_MY_ID: "1"
ZOO_SERVERS: "server.1=zookeeper-0:2888:3888;2181
server.2=zookeeper-1:2888:3888;2181 server.3=zookeeper-2:2888:3888;2181"
volumes:
- zk0-data:/data
- zk0-datalog:/datalog
healthcheck:
# AdminServer HTTP endpoint — works regardless of election state;
4LW/zkServer.sh unreliable in 3.9
test: ["CMD", "wget", "-q", "--spider",
"http://localhost:8080/commands/ruok"]
interval: 5s
timeout: 5s
retries: 20
start_period: 20s
zookeeper-1:
image: zookeeper:3.9
hostname: zookeeper-1
restart: unless-stopped
environment:
ZOO_MY_ID: "2"
ZOO_SERVERS: "server.1=zookeeper-0:2888:3888;2181
server.2=zookeeper-1:2888:3888;2181 server.3=zookeeper-2:2888:3888;2181"
volumes:
- zk1-data:/data
- zk1-datalog:/datalog
healthcheck:
test: ["CMD", "wget", "-q", "--spider",
"http://localhost:8080/commands/ruok"]
interval: 5s
timeout: 5s
retries: 20
start_period: 20s
zookeeper-2:
image: zookeeper:3.9
hostname: zookeeper-2
restart: unless-stopped
environment:
ZOO_MY_ID: "3"
ZOO_SERVERS: "server.1=zookeeper-0:2888:3888;2181
server.2=zookeeper-1:2888:3888;2181 server.3=zookeeper-2:2888:3888;2181"
volumes:
- zk2-data:/data
- zk2-datalog:/datalog
healthcheck:
test: ["CMD", "wget", "-q", "--spider",
"http://localhost:8080/commands/ruok"]
interval: 5s
timeout: 5s
retries: 20
start_period: 20s
# ─── ZK chroot init (creates /solrcloud-test znode, then exits) ─────────────
zk-init:
image: zookeeper:3.9
restart: "no"
depends_on:
zookeeper-0:
condition: service_healthy
command: >
sh -c "zkCli.sh -server zookeeper-0:2181 create /solrcloud-test '' 2>&1 |
tail -1; echo 'ZK chroot ready'"
healthcheck:
test: ["CMD-SHELL", "exit 0"]
interval: 5s
retries: 1
# ─── SolrCloud nodes ────────────────────────────────────────────────────────
solrcloud-0:
image: solr:10
hostname: solrcloud-0
restart: unless-stopped
environment:
SOLR_SKIP_ROOT_CHECK: "true"
SOLR_PORT: "8983"
SOLR_JAVA_MEM: "-XX:+UseContainerSupport -XX:MaxRAMPercentage=60.0"
# /solrcloud-test chroot isolates this cluster within the ZK ensemble
ZK_HOST:
"zookeeper-0:2181,zookeeper-1:2181,zookeeper-2:2181/solrcloud-test"
SOLR_HOST: "solrcloud-0"
SOLR_LOG_LEVEL: "WARN"
LOG4J_FORMAT_MSG_NO_LOOKUPS: "true"
SOLR_OPTS: "-Dhost=solrcloud-0"
ports:
- "8983:8983"
depends_on:
zookeeper-0:
condition: service_healthy
zookeeper-1:
condition: service_healthy
zookeeper-2:
condition: service_healthy
zk-init:
condition: service_completed_successfully
volumes:
- solr0-data:/var/solr
- ./config/solr-log.xml:/opt/solr/server/resources/log4j2.xml:ro
- ./config/solr-log.xml:/var/solr/log4j2.xml:ro
healthcheck:
test: ["CMD", "curl", "-f",
"http://localhost:8983/solr/admin/info/system"]
interval: 10s
timeout: 5s
retries: 10
start_period: 60s
solrcloud-1:
image: solr:10
hostname: solrcloud-1
restart: unless-stopped
environment:
SOLR_SKIP_ROOT_CHECK: "true"
SOLR_PORT: "8983"
SOLR_JAVA_MEM: "-XX:+UseContainerSupport -XX:MaxRAMPercentage=60.0"
ZK_HOST:
"zookeeper-0:2181,zookeeper-1:2181,zookeeper-2:2181/solrcloud-test"
SOLR_HOST: "solrcloud-1"
SOLR_LOG_LEVEL: "WARN"
LOG4J_FORMAT_MSG_NO_LOOKUPS: "true"
SOLR_OPTS: "-Dhost=solrcloud-1"
ports:
- "8984:8983"
depends_on:
zookeeper-0:
condition: service_healthy
zookeeper-1:
condition: service_healthy
zookeeper-2:
condition: service_healthy
zk-init:
condition: service_completed_successfully
volumes:
- solr1-data:/var/solr
- ./config/solr-log.xml:/opt/solr/server/resources/log4j2.xml:ro
- ./config/solr-log.xml:/var/solr/log4j2.xml:ro
healthcheck:
test: ["CMD", "curl", "-f",
"http://localhost:8983/solr/admin/info/system"]
interval: 10s
timeout: 5s
retries: 10
start_period: 60s
volumes:
zk0-data:
zk0-datalog:
zk1-data:
zk1-datalog:
zk2-data:
zk2-datalog:
solr0-data:
solr1-data:
Zk-failover-test.ps1
$ErrorActionPreference = "Stop"
Set-Location $PSScriptRoot
$ZkContainers = @(
"solrcloud-docker-compose-zookeeper-0-1",
"solrcloud-docker-compose-zookeeper-1-1",
"solrcloud-docker-compose-zookeeper-2-1"
)
$SolrContainers = @(
"solrcloud-docker-compose-solrcloud-0-1",
"solrcloud-docker-compose-solrcloud-1-1"
)
$AllHealthChecked = $ZkContainers + $SolrContainers
# ── 1. Start the stack ────────────────────────────────────────────────────────
Write-Host "`n[1/5] Starting Docker Compose stack..." -ForegroundColor Cyan
docker compose up -d
if ($LASTEXITCODE -ne 0) { throw "docker compose up failed" }
# ── 2. Wait for all containers to be healthy ──────────────────────────────────
Write-Host "`n[2/5] Waiting for all containers to be healthy..."
-ForegroundColor Cyan
$timeout = 300
$elapsed = 0
while ($elapsed -lt $timeout) {
Start-Sleep -Seconds 5
$elapsed += 5
$statuses = $AllHealthChecked | ForEach-Object {
docker inspect $_ --format "{{.State.Health.Status}}" 2>$null
}
$unhealthy = ($statuses | Where-Object { $_ -ne "healthy" }).Count
Write-Host " ${elapsed}s — $($statuses.Count -
$unhealthy)/$($statuses.Count) healthy"
if ($unhealthy -eq 0) { break }
}
if ($elapsed -ge $timeout) { throw "Containers did not become healthy within
${timeout}s" }
Write-Host " All containers healthy." -ForegroundColor Green
# ── 3. Create 10 Solr collections ─────────────────────────────────────────────
Write-Host "`n[3/5] Creating 10 Solr collections..." -ForegroundColor Cyan
for ($i = 1; $i -le 10; $i++) {
$name = "test-collection-$i"
$uri =
"http://localhost:8983/solr/admin/collections?action=CREATE&name=$name&numShards=1&replicationFactor=2&wt=json"
try {
$resp = Invoke-RestMethod -Uri $uri -Method Get
$status = $resp.responseHeader.status
Write-Host " Created $name (status: $status)"
} catch {
Write-Warning " Failed to create $name`: $_"
}
}
# ── 4. Find the ZooKeeper leader ──────────────────────────────────────────────
Write-Host "`n[4/5] Finding ZooKeeper leader..." -ForegroundColor Cyan
$leaderContainer = $null
foreach ($container in $ZkContainers) {
$stat = docker exec $container wget -qO-
"http://localhost:8080/commands/stat" 2>&1
if ($stat -match '"server_state"\s*:\s*"leader"') {
$leaderContainer = $container
Write-Host " Leader: $container" -ForegroundColor Yellow
break
}
}
if (-not $leaderContainer) { throw "Could not find ZooKeeper leader" }
# ── 5. Restart the leader ─────────────────────────────────────────────────────
Write-Host "`n[5/5] Restarting ZK leader ($leaderContainer)..."
-ForegroundColor Cyan
docker restart $leaderContainer
Write-Host " Restarted. New leader election underway." -ForegroundColor Green
Write-Host "`nDone." -ForegroundColor Green
We had a look through solr jira to see if we could see any bugs but could not
see any that match these symptoms.
If this could be raised as bug or someone could advise on some solutions that
would be much appreciated.
Thanks,
Liam Newton
Email: [email protected]<mailto:[email protected]>
Platform Engineer
Disclaimer
The information contained in this communication from the sender is
confidential. It is intended solely for use by the recipient and others
authorized to receive it. If you are not the recipient, you are hereby notified
that any disclosure, copying, distribution or taking action in relation of the
contents of this information is strictly prohibited and may be unlawful.
This email has been scanned for viruses and malware, and may have been
automatically archived by Mimecast, a leader in email security and cyber
resilience. Mimecast integrates email defenses with brand protection, security
awareness training, web security, compliance and other essential capabilities.
Mimecast helps protect large and small organizations from malicious activity,
human error and technology failure; and to lead the movement toward building a
more resilient world. To find out more, visit our website.