If backup partitions are available when a node is lost, we should not
expect lost partitions.

There is a lot more to this story than this thread explains, so for the
community: please don't follow this procedure.

https://ignite.apache.org/docs/latest/configuring-caches/partition-loss-policy
"A partition is lost when both the primary copy and all backup copies of
the partition are not available to the cluster, i.e. when the primary and
backup nodes for the partition become unavailable."

If you attempt to access a cache and receive a lost partitions error, this
means there IS DATA LOSS. Partition loss means there are no primary or
backup copies of a particular cache partition available. Have multiple
server nodes experienced trouble? Can we be certain that the affected
caches were created with backups>=1?

If a node fails to start up, and complains about maintenance tasks, we
should be very suspicious this node's persistent data is corrupted. If the
cluster is activated with a missing node and caches have lost partitions,
then we know these caches have lost some data. If there are no lost
partitions, we can safely remove the corrupted node from the baseline and
bring up a fresh node, and add it to the baseline to replace it thus
restoring redundancy. If there are lost partitions and we need to reset
lost partitions to bring a cache back online, we should expect that cache
is missing some data and may need to be reloaded.

Cache configuration backups=2 is excessive except in edge cases. For
backups=n, the memory and persistence footprint cost is n+1 times the
nominal data footprint. This scales linear. The marginal utility we derive
from each additional backup copy is diminishing because for a probability
of any single node failure p or p/1, the likelihood of needing those extra
copies is p/(n+1) for n backup copies.

Think of backup partitions like multiple coats of paint. After the second
coat, nobody will be able to tell the difference if you applied a third or
fourth coat. It still takes the same effort and materials to apply each
coat of paint.

If you NEED fault tolerance, then it should be mandatory to conduct testing
to make sure the configuration you have chosen is working as expected. If
backups=1 isn't effective for single node failures, then backups=2 will
make no beneficial difference. With backups=1 we should expect a cache to
work without complaining about lost partitions when one server node is
offline.

On Wed, May 29, 2024 at 12:15 PM Naveen Kumar <[email protected]>
wrote:

> Thanks very much for your prompt response Gianluca
>
> just for the community, I could solve this by running the control.sh with
> reset lost partitions for individual cachereset_lost_partitions
> looks like it worked, those partition issue is resolved, I suppose there
> wouldnt be any data loss as we have set all our caches with 2 replicas
>
> coming to the node which was not getting added to the cluster earlier,
> removed from baseline --> cleared all persistence store --> brought up the
> node --> added the node to baseline, this also seems to have worked fine.
>
> Thanks
>
>
> On Wed, May 29, 2024 at 5:13 PM Gianluca Bonetti <
> [email protected]> wrote:
>
>> Hello Naveen
>>
>> Apache Ignite 2.13 is more than 2 years old, 25 months old in actual fact.
>> Three bugfix releases had been rolled out over time up to 2.16 release.
>>
>> It seems you are restarting your cluster on a regular basis, so you'd
>> better upgrade to 2.16 as soon as possible.
>> Otherwise it will also be very difficult for people on a community based
>> mailing list, on volunteer time, to work out a solution with a 2 years old
>> version running.
>>
>> Besides that, you are not providing very much information about your
>> cluster setup.
>> How many nodes, what infrastructure, how many caches, overall data size.
>> One could only guess you have more than 1 node running, with at least 1
>> cache, and non-empty dataset. :)
>>
>> This document from GridGain may be helpful but I don't see the same for
>> Ignite, it may still be worth checking it out.
>>
>> https://www.gridgain.com/docs/latest/perf-troubleshooting-guide/maintenance-mode
>>
>> On the other hand you should also check your failing node.
>> If it is always the same node failing, then there should be some root
>> cause apart from Ignite.
>> Indeed if the nodes configuration is the same across all nodes, and just
>> this one fails, you should also consider some network issues (check
>> connectivity and network latency between nodes) and hardware related issues
>> (faulty disks, faulty memory)
>> In the end, one option might be to replace the faulty machine with a
>> brand new one.
>> In cloud environments this is actually quite cheap and easy to do.
>>
>> Cheers
>> Gianluca
>>
>> On Wed, 29 May 2024 at 08:43, Naveen Kumar <[email protected]>
>> wrote:
>>
>>> Hello All
>>>
>>> We are using Ignite 2.13.0
>>>
>>> After a cluster restart, one of the node is not coming up and in node
>>> logs are seeing this error - Node requires maintenance, non-empty set of
>>> maintainance  tasks is found - node is not coming up
>>>
>>> we are getting errors like time out is reached before computation is
>>> completed error in other nodes as well.
>>>
>>> I could see that, we have control.sh script to backup and clean up the
>>> corrupted files, but when I run the command, it fails.
>>>
>>> I have removed the node from baseline and tried to run as well, still
>>> its failing
>>>
>>> what could be the solution for this, cluster is functioning,
>>> however there are requests failing
>>>
>>> Is there anyway we can start ignite node in  maintenance mode and try
>>> running clean corrupted commands
>>>
>>> Thanks
>>> Naveen
>>>
>>>
>>>
>
> --
> Thanks & Regards,
> Naveen Bandaru
>

Reply via email to