Re: Nodes failed to join the cluster after restarting

Cong Guo Mon, 16 Nov 2020 20:35:51 -0800

Hi,

Please find the attached log for a complete but failed reboot. You can see
the exceptions.


On Mon, Nov 16, 2020 at 4:00 AM Ivan Bessonov <[email protected]> wrote:

> Hello,
>
> there must be a bug somewhere during node start, it updates its
> distributed metastorage content and tries to join an already activated
> cluster, thus creating a conflict. It's hard to tell the exact data that
> caused conflict, especially without any logs.
>
> Topic that you mentioned (
> http://apache-ignite-users.70518.x6.nabble.com/Question-about-baseline-topology-and-cluster-activation-td34336.html)
> seems to be about the same problem, but the issue
> https://issues.apache.org/jira/browse/IGNITE-12850 is not related to it.
>
> If you have logs from those unsuccessful restart attempts, it would be
> very helpful.
>
> Sadly, distributed metastorage is an internal component to store settings
> and has no public documentation. Developers documentation is probably
> outdated and incomplete. But just in case, "version id" that message is
> referring to is located in field
> "org.apache.ignite.internal.processors.metastorage.persistence.DistributedMetaStorageImpl#ver",
> it's incremented on every distributed metastorage setting update. You can
> find your error message in the same class.
>
> Please follow up with more questions and logs it possible, I hope we'll
> figure it out.
>
> Thank you!
>
> пт, 13 нояб. 2020 г. в 02:23, Cong Guo <[email protected]>:
>
>> Hi,
>>
>> I have a 3-node cluster with persistence enabled. All the three nodes are
>> in the baseline topology. The ignite version is 2.8.1.
>>
>> When I restart the first node, it encounters an error and fails to join
>> the cluster. The error message is "Caused by: org.apache.
>> ignite.spi.IgniteSpiException: Attempting to join node with larger
>> distributed metastorage version id. The node is most likely in invalid
>> state and can't be joined." I try several times but get the same error.
>>
>> Then I restart the second node, it encounters the same error. After I
>> restart the third node, the other two nodes can start successfully and join
>> the cluster. When I restart the nodes, I do not change the baseline
>> topology. I cannot reproduce this error now.
>>
>> I find someone else has the same problem.
>> http://apache-ignite-users.70518.x6.nabble.com/Question-about-baseline-topology-and-cluster-activation-td34336.html
>>
>> The answer is corruption in the metastorage. I do not see any issue of
>> the metastorage files. However, it is a small probability event to have
>> files on two different machines corrupted at the same time. Is it possible
>> that this is another bug like
>> https://issues.apache.org/jira/browse/IGNITE-12850?
>>
>> Do you have any document about how the version id is updated and read?
>> Could you please show me in the source code where the version id is read
>> when a node starts and where the version id is updated when a node stops?
>> Thank you!
>>
>>
>>
>
> --
> Sincerely yours,
> Ivan Bessonov
>

errorlog
Description: Binary data

Re: Nodes failed to join the cluster after restarting

Reply via email to