Hi, How do you handle ZK in k8s? Apparently not through SolrOperator. Do you use statefulset with a service in front? Normal service or headless? Have you configured the STS so that k8s is not allowed to take down more than one of the pod replicas at a time, and have you spread them across unique k8s nodes? Many questions... Would be helpful with more details, config, logs. Solr does not like the entire ZK being unavailable..
Jan > 13. mai 2026 kl. 19:37 skrev Liam Newton <[email protected]>: > > Hello, > > I am new to this, but I was hoping to raise a bug/ ask for some advice. > > We have upgraded our solr cluster to version 10 in the past two months and > noticed that when we restart our Zookeeper instances in particular the leader > instance that solr has issues managing its collections. > > What we experience is that collections will go into the degraded state and > stay in such a state till they recover. However, recovery can be dependent on > the number of collections. For example, 10 collections with 2 replicas each > can take up to 5 minutes, and the time increases exponentially dependent on > the number of replicas. Sometime a node will also go into the down state and > will stay in such a state. > > We had a brief look through the code and noticed that the handling of > Zookeeper disconnection is handled differently and believe that the new > change is causing an effect. > > We handle our Zookeeper and solr instance in Kubernetes and are experiencing > this quite regularly due to the fact that kubernetes can sometimes do > rebalancing of workloads of its own accord causing Zookeeper to be moved or > recycled more than once a week. To ensure that this wasn't a quirk of > zookeeper we also replicated this issue in a docker compose scenario as well > and experienced the same. > > We went back to v9 and did not experience any issues at all. > > Below I will add Docker Compose Template as well as script to replicate this > action. > > Docker-compose.yml > services: > > # ─── ZooKeeper ensemble (3 nodes for quorum) ─────────────────────────────── > > zookeeper-0: > image: zookeeper:3.9 > hostname: zookeeper-0 > restart: unless-stopped > environment: > ZOO_MY_ID: "1" > ZOO_SERVERS: "server.1=zookeeper-0:2888:3888;2181 > server.2=zookeeper-1:2888:3888;2181 server.3=zookeeper-2:2888:3888;2181" > volumes: > - zk0-data:/data > - zk0-datalog:/datalog > healthcheck: > # AdminServer HTTP endpoint — works regardless of election state; > 4LW/zkServer.sh unreliable in 3.9 > test: ["CMD", "wget", "-q", "--spider", > "http://localhost:8080/commands/ruok"] > interval: 5s > timeout: 5s > retries: 20 > start_period: 20s > > zookeeper-1: > image: zookeeper:3.9 > hostname: zookeeper-1 > restart: unless-stopped > environment: > ZOO_MY_ID: "2" > ZOO_SERVERS: "server.1=zookeeper-0:2888:3888;2181 > server.2=zookeeper-1:2888:3888;2181 server.3=zookeeper-2:2888:3888;2181" > volumes: > - zk1-data:/data > - zk1-datalog:/datalog > healthcheck: > test: ["CMD", "wget", "-q", "--spider", > "http://localhost:8080/commands/ruok"] > interval: 5s > timeout: 5s > retries: 20 > start_period: 20s > > zookeeper-2: > image: zookeeper:3.9 > hostname: zookeeper-2 > restart: unless-stopped > environment: > ZOO_MY_ID: "3" > ZOO_SERVERS: "server.1=zookeeper-0:2888:3888;2181 > server.2=zookeeper-1:2888:3888;2181 server.3=zookeeper-2:2888:3888;2181" > volumes: > - zk2-data:/data > - zk2-datalog:/datalog > healthcheck: > test: ["CMD", "wget", "-q", "--spider", > "http://localhost:8080/commands/ruok"] > interval: 5s > timeout: 5s > retries: 20 > start_period: 20s > > # ─── ZK chroot init (creates /solrcloud-test znode, then exits) > ───────────── > > zk-init: > image: zookeeper:3.9 > restart: "no" > depends_on: > zookeeper-0: > condition: service_healthy > command: > > sh -c "zkCli.sh -server zookeeper-0:2181 create /solrcloud-test '' 2>&1 > | tail -1; echo 'ZK chroot ready'" > healthcheck: > test: ["CMD-SHELL", "exit 0"] > interval: 5s > retries: 1 > > # ─── SolrCloud nodes > ──────────────────────────────────────────────────────── > > solrcloud-0: > image: solr:10 > hostname: solrcloud-0 > restart: unless-stopped > environment: > SOLR_SKIP_ROOT_CHECK: "true" > SOLR_PORT: "8983" > SOLR_JAVA_MEM: "-XX:+UseContainerSupport -XX:MaxRAMPercentage=60.0" > # /solrcloud-test chroot isolates this cluster within the ZK ensemble > ZK_HOST: > "zookeeper-0:2181,zookeeper-1:2181,zookeeper-2:2181/solrcloud-test" > SOLR_HOST: "solrcloud-0" > SOLR_LOG_LEVEL: "WARN" > LOG4J_FORMAT_MSG_NO_LOOKUPS: "true" > SOLR_OPTS: "-Dhost=solrcloud-0" > ports: > - "8983:8983" > depends_on: > zookeeper-0: > condition: service_healthy > zookeeper-1: > condition: service_healthy > zookeeper-2: > condition: service_healthy > zk-init: > condition: service_completed_successfully > volumes: > - solr0-data:/var/solr > - ./config/solr-log.xml:/opt/solr/server/resources/log4j2.xml:ro > - ./config/solr-log.xml:/var/solr/log4j2.xml:ro > healthcheck: > test: ["CMD", "curl", "-f", > "http://localhost:8983/solr/admin/info/system"] > interval: 10s > timeout: 5s > retries: 10 > start_period: 60s > > solrcloud-1: > image: solr:10 > hostname: solrcloud-1 > restart: unless-stopped > environment: > SOLR_SKIP_ROOT_CHECK: "true" > SOLR_PORT: "8983" > SOLR_JAVA_MEM: "-XX:+UseContainerSupport -XX:MaxRAMPercentage=60.0" > ZK_HOST: > "zookeeper-0:2181,zookeeper-1:2181,zookeeper-2:2181/solrcloud-test" > SOLR_HOST: "solrcloud-1" > SOLR_LOG_LEVEL: "WARN" > LOG4J_FORMAT_MSG_NO_LOOKUPS: "true" > SOLR_OPTS: "-Dhost=solrcloud-1" > ports: > - "8984:8983" > depends_on: > zookeeper-0: > condition: service_healthy > zookeeper-1: > condition: service_healthy > zookeeper-2: > condition: service_healthy > zk-init: > condition: service_completed_successfully > volumes: > - solr1-data:/var/solr > - ./config/solr-log.xml:/opt/solr/server/resources/log4j2.xml:ro > - ./config/solr-log.xml:/var/solr/log4j2.xml:ro > healthcheck: > test: ["CMD", "curl", "-f", > "http://localhost:8983/solr/admin/info/system"] > interval: 10s > timeout: 5s > retries: 10 > start_period: 60s > > volumes: > zk0-data: > zk0-datalog: > zk1-data: > zk1-datalog: > zk2-data: > zk2-datalog: > solr0-data: > solr1-data: > > Zk-failover-test.ps1 > > $ErrorActionPreference = "Stop" > Set-Location $PSScriptRoot > > $ZkContainers = @( > "solrcloud-docker-compose-zookeeper-0-1", > "solrcloud-docker-compose-zookeeper-1-1", > "solrcloud-docker-compose-zookeeper-2-1" > ) > $SolrContainers = @( > "solrcloud-docker-compose-solrcloud-0-1", > "solrcloud-docker-compose-solrcloud-1-1" > ) > $AllHealthChecked = $ZkContainers + $SolrContainers > > # ── 1. Start the stack > ──────────────────────────────────────────────────────── > > Write-Host "`n[1/5] Starting Docker Compose stack..." -ForegroundColor Cyan > docker compose up -d > if ($LASTEXITCODE -ne 0) { throw "docker compose up failed" } > > # ── 2. Wait for all containers to be healthy > ────────────────────────────────── > > Write-Host "`n[2/5] Waiting for all containers to be healthy..." > -ForegroundColor Cyan > $timeout = 300 > $elapsed = 0 > > while ($elapsed -lt $timeout) { > Start-Sleep -Seconds 5 > $elapsed += 5 > > $statuses = $AllHealthChecked | ForEach-Object { > docker inspect $_ --format "{{.State.Health.Status}}" 2>$null > } > > $unhealthy = ($statuses | Where-Object { $_ -ne "healthy" }).Count > Write-Host " ${elapsed}s — $($statuses.Count - > $unhealthy)/$($statuses.Count) healthy" > > if ($unhealthy -eq 0) { break } > } > > if ($elapsed -ge $timeout) { throw "Containers did not become healthy within > ${timeout}s" } > Write-Host " All containers healthy." -ForegroundColor Green > > # ── 3. Create 10 Solr collections > ───────────────────────────────────────────── > > Write-Host "`n[3/5] Creating 10 Solr collections..." -ForegroundColor Cyan > for ($i = 1; $i -le 10; $i++) { > $name = "test-collection-$i" > $uri = > "http://localhost:8983/solr/admin/collections?action=CREATE&name=$name&numShards=1&replicationFactor=2&wt=json" > try { > $resp = Invoke-RestMethod -Uri $uri -Method Get > $status = $resp.responseHeader.status > Write-Host " Created $name (status: $status)" > } catch { > Write-Warning " Failed to create $name`: $_" > } > } > > # ── 4. Find the ZooKeeper leader > ────────────────────────────────────────────── > > Write-Host "`n[4/5] Finding ZooKeeper leader..." -ForegroundColor Cyan > $leaderContainer = $null > > foreach ($container in $ZkContainers) { > $stat = docker exec $container wget -qO- > "http://localhost:8080/commands/stat" 2>&1 > if ($stat -match '"server_state"\s*:\s*"leader"') { > $leaderContainer = $container > Write-Host " Leader: $container" -ForegroundColor Yellow > break > } > } > > if (-not $leaderContainer) { throw "Could not find ZooKeeper leader" } > > # ── 5. Restart the leader > ───────────────────────────────────────────────────── > > Write-Host "`n[5/5] Restarting ZK leader ($leaderContainer)..." > -ForegroundColor Cyan > docker restart $leaderContainer > Write-Host " Restarted. New leader election underway." -ForegroundColor Green > > Write-Host "`nDone." -ForegroundColor Green > > > We had a look through solr jira to see if we could see any bugs but could not > see any that match these symptoms. > > If this could be raised as bug or someone could advise on some solutions that > would be much appreciated. > > Thanks, > > > Liam Newton > > Email: [email protected]<mailto:[email protected]> > > Platform Engineer > > Disclaimer > > The information contained in this communication from the sender is > confidential. It is intended solely for use by the recipient and others > authorized to receive it. If you are not the recipient, you are hereby > notified that any disclosure, copying, distribution or taking action in > relation of the contents of this information is strictly prohibited and may > be unlawful. > > This email has been scanned for viruses and malware, and may have been > automatically archived by Mimecast, a leader in email security and cyber > resilience. Mimecast integrates email defenses with brand protection, > security awareness training, web security, compliance and other essential > capabilities. Mimecast helps protect large and small organizations from > malicious activity, human error and technology failure; and to lead the > movement toward building a more resilient world. To find out more, visit our > website.
