Hi,

I entered into the following scenario.

3 Ignite data grid servers started in PersistentMode say S1,S2,S3 in a
cluster.

On S1, there is a rebalancing started for a previous topology version say
24. 
The following log mentions that
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionDemander
- Starting rebalancing [grp=PCache, mode=ASYNC,
fromNode=a8b0f10f-8ad9-45a4-aab3-a0562fd0d202, partitionsCount=311,
topology=AffinityTopologyVersion [topVer=24, minorTopVer=0], rebalanceId=65]

During the process, S1 got an exception while archiving a WAL segment and
threw an exception
Caused by: class org.apache.ignite.IgniteCheckedException: Failed to archive
WAL segment
-> No space left on device

Since S2 and S3 are in the same cluster the corresponding threads are going
in a while loop with a message

"Unable to await partitions release latch within timeout: ServerLatch
[permits=1, pendingAcks=[327b3756-da19-4dd6-90da-7e0797f269d7],
super=CompletableLatch [id=exchange, topVer=AffinityTopologyVersion
[topVer=24, minorTopVer=1]]]"

The pending acks mentions the node id of S1.

I looked into the source and it is revolving in the following loop

        if (!localJoinExchange()) {
            try {
                while (true) {
                    try {
                        releaseLatch.await(waitTimeout,
TimeUnit.MILLISECONDS);

                        if (log.isInfoEnabled())
                            log.info("Finished waiting for partitions
release latch: " + releaseLatch);

                        break;
                    }
                    catch (IgniteFutureTimeoutCheckedException ignored) {
                        U.warn(log, "Unable to await partitions release
latch within timeout: " + releaseLatch);

                        // Try to resend ack.
                        releaseLatch.countDown();
                    }
                }
            }
            catch (IgniteCheckedException e) {
                U.warn(log, "Stop waiting for partitions release latch: " +
e.getMessage());
            }
        }

Although the cluster is able to add a new node and reflect that in topology
but the re-balancing of previous topVersion 24 never finishes.

Is there a timeout to come out of this loop ? or any other configuration so
that the cluster can move forwards with S2 and S3 ?













 



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Reply via email to