Re: Cluster went down after "Unable to await partitions release latch within timeout" WARN

Pavel Kovalenko Fri, 01 May 2020 05:58:24 -0700

Hello,

I don't clearly understand from your message, but have the exchange finally
finished? Or you were getting this WARN message all the time?


пт, 1 мая 2020 г. в 12:32, Ilya Kasnacheev <ilya.kasnach...@gmail.com>:

> Hello!
>
> This description sounds like a typical hanging Partition Map Exchange, but
> you should be able to see that in logs.
> If you don't, you can collect thread dumps from all nodes with jstack and
> check it for any stalling operations (or share with us).
>
> Regards,
> --
> Ilya Kasnacheev
>
>
> пт, 1 мая 2020 г. в 11:53, userx <gagan...@gmail.com>:
>
>> Hi Pavel,
>>
>> I am using 2.8 and still getting the same issue. Here is the ecosystem
>>
>> 19 Ignite servers (S1 to S19) running at 16GB of max JVM and in persistent
>> mode.
>>
>> 96 Clients (C1 to C96)
>>
>> There are 19 machines, 1 Ignite server is started on 1 machine. The
>> clients
>> are evenly distributed across machines.
>>
>> C19 tries to create a cache, it gets a timeout exception as i have 5 mins
>> of
>> timeout. When I looked into the coordinator logs, between a span of 5
>> minutes, it gets the messages
>>
>>
>> 2020-04-24 15:37:09,434 WARN [exchange-worker-#45%S1%] {}
>>
>> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture
>> - Unable to await partitions release latch within timeout. Some nodes have
>> not sent acknowledgement for latch completion. It's possible due to
>> unfinishined atomic updates, transactions or not released explicit locks
>> on
>> that nodes. Please check logs for errors on nodes with ids reported in
>> latch
>> `pendingAcks` collection [latch=ServerLatch [permits=4,
>> pendingAcks=HashSet
>> [84b8416c-fa06-4544-9ce0-e3dfba41038a,
>> 19bd7744-0ced-4123-a35f-ddf0cf9f55c4,
>> 533af8f9-c0f6-44b6-92d4-658f86ffaca0,
>> 1b31cb25-abbc-4864-88a3-5a4df37a0cf4],
>> super=CompletableLatch [id=CompletableLatchUid [id=exchange,
>> topVer=AffinityTopologyVersion [topVer=174, minorTopVer=1]]]]]
>>
>> And the 4 nodes which have not been able to acknowledge latch completion
>> are
>> S14, S7, S18, S4
>>
>> I went to see the logs of S4, it just records the addition of C19 into
>> topology and then C19 leaving it after 5 minutes. The only thing is that
>> in
>> GC I see this consistently "Total time for which application threads were
>> stopped: 0.0006225 seconds, Stopping threads took: 0.0000887 seconds"
>>
>> I understand that until the time all the atomic updates and transactions
>> are
>> finished Clients are not able to create caches by communicating with
>> Coordinator but is there a way around ?
>>
>> So the question is that is it still prevalent on 2.8 ?
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> --
>> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>>
>

Re: Cluster went down after "Unable to await partitions release latch within timeout" WARN

Reply via email to