Hello,

I am new to this, but I was hoping to raise a bug/ ask for some advice.

We have upgraded our solr cluster to version 10 in the past two months and 
noticed that when we restart our Zookeeper instances in particular the leader 
instance that solr has issues managing its collections.

What we experience is that collections will go into the degraded state and stay 
in such a state till they recover. However, recovery can be dependent on the 
number of collections. For example, 10 collections with 2 replicas each can 
take up to 5 minutes, and the time increases exponentially dependent on the 
number of replicas. Sometime a node will also go into the down state and will 
stay in such a state.

We had a brief look through the code and noticed that the handling of Zookeeper 
disconnection is handled differently and believe that the new change is causing 
an effect.

We handle our Zookeeper and solr  instance in Kubernetes and are experiencing 
this quite regularly due to the fact that kubernetes can sometimes do 
rebalancing of workloads of its own accord causing Zookeeper to be moved or 
recycled more than once a week. To ensure that this wasn't a quirk of zookeeper 
we also replicated this issue in a docker compose scenario as well and 
experienced the same.

We went back to v9 and did not experience any issues at all.

Below I will add Docker Compose Template as well as script to replicate this 
action.

Docker-compose.yml
services:

  # ─── ZooKeeper ensemble (3 nodes for quorum) ───────────────────────────────

  zookeeper-0:
    image: zookeeper:3.9
    hostname: zookeeper-0
    restart: unless-stopped
    environment:
      ZOO_MY_ID: "1"
      ZOO_SERVERS: "server.1=zookeeper-0:2888:3888;2181 
server.2=zookeeper-1:2888:3888;2181 server.3=zookeeper-2:2888:3888;2181"
    volumes:
      - zk0-data:/data
      - zk0-datalog:/datalog
    healthcheck:
      # AdminServer HTTP endpoint — works regardless of election state; 
4LW/zkServer.sh unreliable in 3.9
      test: ["CMD", "wget", "-q", "--spider", 
"http://localhost:8080/commands/ruok";]
      interval: 5s
      timeout: 5s
      retries: 20
      start_period: 20s

  zookeeper-1:
    image: zookeeper:3.9
    hostname: zookeeper-1
    restart: unless-stopped
    environment:
      ZOO_MY_ID: "2"
      ZOO_SERVERS: "server.1=zookeeper-0:2888:3888;2181 
server.2=zookeeper-1:2888:3888;2181 server.3=zookeeper-2:2888:3888;2181"
    volumes:
      - zk1-data:/data
      - zk1-datalog:/datalog
    healthcheck:
      test: ["CMD", "wget", "-q", "--spider", 
"http://localhost:8080/commands/ruok";]
      interval: 5s
      timeout: 5s
      retries: 20
      start_period: 20s

  zookeeper-2:
    image: zookeeper:3.9
    hostname: zookeeper-2
    restart: unless-stopped
    environment:
      ZOO_MY_ID: "3"
      ZOO_SERVERS: "server.1=zookeeper-0:2888:3888;2181 
server.2=zookeeper-1:2888:3888;2181 server.3=zookeeper-2:2888:3888;2181"
    volumes:
      - zk2-data:/data
      - zk2-datalog:/datalog
    healthcheck:
      test: ["CMD", "wget", "-q", "--spider", 
"http://localhost:8080/commands/ruok";]
      interval: 5s
      timeout: 5s
      retries: 20
      start_period: 20s

  # ─── ZK chroot init (creates /solrcloud-test znode, then exits) ─────────────

  zk-init:
    image: zookeeper:3.9
    restart: "no"
    depends_on:
      zookeeper-0:
        condition: service_healthy
    command: >
      sh -c "zkCli.sh -server zookeeper-0:2181 create /solrcloud-test '' 2>&1 | 
tail -1; echo 'ZK chroot ready'"
    healthcheck:
      test: ["CMD-SHELL", "exit 0"]
      interval: 5s
      retries: 1

  # ─── SolrCloud nodes ────────────────────────────────────────────────────────

  solrcloud-0:
    image: solr:10
    hostname: solrcloud-0
    restart: unless-stopped
    environment:
      SOLR_SKIP_ROOT_CHECK: "true"
      SOLR_PORT: "8983"
      SOLR_JAVA_MEM: "-XX:+UseContainerSupport -XX:MaxRAMPercentage=60.0"
      # /solrcloud-test chroot isolates this cluster within the ZK ensemble
      ZK_HOST: 
"zookeeper-0:2181,zookeeper-1:2181,zookeeper-2:2181/solrcloud-test"
      SOLR_HOST: "solrcloud-0"
      SOLR_LOG_LEVEL: "WARN"
      LOG4J_FORMAT_MSG_NO_LOOKUPS: "true"
      SOLR_OPTS: "-Dhost=solrcloud-0"
    ports:
      - "8983:8983"
    depends_on:
      zookeeper-0:
        condition: service_healthy
      zookeeper-1:
        condition: service_healthy
      zookeeper-2:
        condition: service_healthy
      zk-init:
        condition: service_completed_successfully
    volumes:
      - solr0-data:/var/solr
      - ./config/solr-log.xml:/opt/solr/server/resources/log4j2.xml:ro
      - ./config/solr-log.xml:/var/solr/log4j2.xml:ro
    healthcheck:
      test: ["CMD", "curl", "-f", 
"http://localhost:8983/solr/admin/info/system";]
      interval: 10s
      timeout: 5s
      retries: 10
      start_period: 60s

  solrcloud-1:
    image: solr:10
    hostname: solrcloud-1
    restart: unless-stopped
    environment:
      SOLR_SKIP_ROOT_CHECK: "true"
      SOLR_PORT: "8983"
      SOLR_JAVA_MEM: "-XX:+UseContainerSupport -XX:MaxRAMPercentage=60.0"
      ZK_HOST: 
"zookeeper-0:2181,zookeeper-1:2181,zookeeper-2:2181/solrcloud-test"
      SOLR_HOST: "solrcloud-1"
      SOLR_LOG_LEVEL: "WARN"
      LOG4J_FORMAT_MSG_NO_LOOKUPS: "true"
      SOLR_OPTS: "-Dhost=solrcloud-1"
    ports:
      - "8984:8983"
    depends_on:
      zookeeper-0:
        condition: service_healthy
      zookeeper-1:
        condition: service_healthy
      zookeeper-2:
        condition: service_healthy
      zk-init:
        condition: service_completed_successfully
    volumes:
      - solr1-data:/var/solr
      - ./config/solr-log.xml:/opt/solr/server/resources/log4j2.xml:ro
      - ./config/solr-log.xml:/var/solr/log4j2.xml:ro
    healthcheck:
      test: ["CMD", "curl", "-f", 
"http://localhost:8983/solr/admin/info/system";]
      interval: 10s
      timeout: 5s
      retries: 10
      start_period: 60s

volumes:
  zk0-data:
  zk0-datalog:
  zk1-data:
  zk1-datalog:
  zk2-data:
  zk2-datalog:
  solr0-data:
  solr1-data:

Zk-failover-test.ps1

$ErrorActionPreference = "Stop"
Set-Location $PSScriptRoot

$ZkContainers = @(
    "solrcloud-docker-compose-zookeeper-0-1",
    "solrcloud-docker-compose-zookeeper-1-1",
    "solrcloud-docker-compose-zookeeper-2-1"
)
$SolrContainers = @(
    "solrcloud-docker-compose-solrcloud-0-1",
    "solrcloud-docker-compose-solrcloud-1-1"
)
$AllHealthChecked = $ZkContainers + $SolrContainers

# ── 1. Start the stack ────────────────────────────────────────────────────────

Write-Host "`n[1/5] Starting Docker Compose stack..." -ForegroundColor Cyan
docker compose up -d
if ($LASTEXITCODE -ne 0) { throw "docker compose up failed" }

# ── 2. Wait for all containers to be healthy ──────────────────────────────────

Write-Host "`n[2/5] Waiting for all containers to be healthy..." 
-ForegroundColor Cyan
$timeout = 300
$elapsed = 0

while ($elapsed -lt $timeout) {
    Start-Sleep -Seconds 5
    $elapsed += 5

    $statuses = $AllHealthChecked | ForEach-Object {
        docker inspect $_ --format "{{.State.Health.Status}}" 2>$null
    }

    $unhealthy = ($statuses | Where-Object { $_ -ne "healthy" }).Count
    Write-Host "  ${elapsed}s — $($statuses.Count - 
$unhealthy)/$($statuses.Count) healthy"

    if ($unhealthy -eq 0) { break }
}

if ($elapsed -ge $timeout) { throw "Containers did not become healthy within 
${timeout}s" }
Write-Host "  All containers healthy." -ForegroundColor Green

# ── 3. Create 10 Solr collections ─────────────────────────────────────────────

Write-Host "`n[3/5] Creating 10 Solr collections..." -ForegroundColor Cyan
for ($i = 1; $i -le 10; $i++) {
    $name = "test-collection-$i"
    $uri  = 
"http://localhost:8983/solr/admin/collections?action=CREATE&name=$name&numShards=1&replicationFactor=2&wt=json";
    try {
        $resp = Invoke-RestMethod -Uri $uri -Method Get
        $status = $resp.responseHeader.status
        Write-Host "  Created $name (status: $status)"
    } catch {
        Write-Warning "  Failed to create $name`: $_"
    }
}

# ── 4. Find the ZooKeeper leader ──────────────────────────────────────────────

Write-Host "`n[4/5] Finding ZooKeeper leader..." -ForegroundColor Cyan
$leaderContainer = $null

foreach ($container in $ZkContainers) {
    $stat = docker exec $container wget -qO- 
"http://localhost:8080/commands/stat"; 2>&1
    if ($stat -match '"server_state"\s*:\s*"leader"') {
        $leaderContainer = $container
        Write-Host "  Leader: $container" -ForegroundColor Yellow
        break
    }
}

if (-not $leaderContainer) { throw "Could not find ZooKeeper leader" }

# ── 5. Restart the leader ─────────────────────────────────────────────────────

Write-Host "`n[5/5] Restarting ZK leader ($leaderContainer)..." 
-ForegroundColor Cyan
docker restart $leaderContainer
Write-Host "  Restarted. New leader election underway." -ForegroundColor Green

Write-Host "`nDone." -ForegroundColor Green


We had a look through solr jira to see if we could see any bugs but could not 
see any that match these symptoms.

If this could be raised as bug or someone could advise on some solutions that 
would be much appreciated.

Thanks,


Liam Newton

Email: [email protected]<mailto:[email protected]>

Platform Engineer

Disclaimer

The information contained in this communication from the sender is 
confidential. It is intended solely for use by the recipient and others 
authorized to receive it. If you are not the recipient, you are hereby notified 
that any disclosure, copying, distribution or taking action in relation of the 
contents of this information is strictly prohibited and may be unlawful.

This email has been scanned for viruses and malware, and may have been 
automatically archived by Mimecast, a leader in email security and cyber 
resilience. Mimecast integrates email defenses with brand protection, security 
awareness training, web security, compliance and other essential capabilities. 
Mimecast helps protect large and small organizations from malicious activity, 
human error and technology failure; and to lead the movement toward building a 
more resilient world. To find out more, visit our website.

Reply via email to