Re: Nodes failed to join the cluster after restarting

Cong Guo Tue, 17 Nov 2020 15:29:57 -0800

Hi,

The parameters values on two other nodes are the same. Actually I do not
configure these values. When you enable the native persistence, you will
see these logs by default. Nothing is special. When this error occurs on
the restarting node, nothing happens on two other nodes. When I restart the
second node, it also fails due to the same error.


I will still need to restart the nodes in the future,  one by one without
stopping the service. This issue may happen again. The workaround has to
deactivate the cluster and stop the service, which does not work in a
production environment.

I think we need to fix this bug or at least understand the reason to avoid
it. Could you please tell me where this version value could be modified
when a node just starts? Do you have any guess about this bug now? I can
help analyze the code. Thank you.

On Tue, Nov 17, 2020 at 4:09 AM Ivan Bessonov <[email protected]> wrote:

> Thank you for the reply!
>
> Right now the only existing distributed properties I see are these:
> - Baseline parameter 'baselineAutoAdjustEnabled' was changed from 'null'
> to 'false'
> - Baseline parameter 'baselineAutoAdjustTimeout' was changed from 'null'
> to '300000'
> - SQL parameter 'sql.disabledFunctions' was changed from 'null' to
> '[FILE_WRITE, CANCEL_SESSION, MEMORY_USED, CSVREAD, LINK_SCHEMA,
> MEMORY_FREE, FILE_READ, CSVWRITE, SESSION_ID, LOCK_MODE]'
>
> I wonder what values they have on nodes that rejected the new node. I
> suggest sending logs of those nodes as well.
> Right now I believe that this bug won't happen again on your installation,
> but it only makes it more elusive...
>
> The most probable reason is that node (somehow) initialized some
> properties with defaults before joining the cluster, while cluster didn't
> have those values at all.
> The rule is that activated cluster can't accept changed properties from
> joining node. So, the workaround would be deactivating the cluster, joining
> the node and activating it again. But as I said, I don't think that you'll
> see this bug ever again.
>
> вт, 17 нояб. 2020 г. в 07:34, Cong Guo <[email protected]>:
>
>> Hi,
>>
>> Please find the attached log for a complete but failed reboot. You can
>> see the exceptions.
>>
>> On Mon, Nov 16, 2020 at 4:00 AM Ivan Bessonov <[email protected]>
>> wrote:
>>
>>> Hello,
>>>
>>> there must be a bug somewhere during node start, it updates its
>>> distributed metastorage content and tries to join an already activated
>>> cluster, thus creating a conflict. It's hard to tell the exact data that
>>> caused conflict, especially without any logs.
>>>
>>> Topic that you mentioned (
>>> http://apache-ignite-users.70518.x6.nabble.com/Question-about-baseline-topology-and-cluster-activation-td34336.html)
>>> seems to be about the same problem, but the issue
>>> https://issues.apache.org/jira/browse/IGNITE-12850 is not related to it.
>>>
>>> If you have logs from those unsuccessful restart attempts, it would be
>>> very helpful.
>>>
>>> Sadly, distributed metastorage is an internal component to store
>>> settings and has no public documentation. Developers documentation is
>>> probably outdated and incomplete. But just in case, "version id" that
>>> message is referring to is located in field
>>> "org.apache.ignite.internal.processors.metastorage.persistence.DistributedMetaStorageImpl#ver",
>>> it's incremented on every distributed metastorage setting update. You can
>>> find your error message in the same class.
>>>
>>> Please follow up with more questions and logs it possible, I hope we'll
>>> figure it out.
>>>
>>> Thank you!
>>>
>>> пт, 13 нояб. 2020 г. в 02:23, Cong Guo <[email protected]>:
>>>
>>>> Hi,
>>>>
>>>> I have a 3-node cluster with persistence enabled. All the three nodes
>>>> are in the baseline topology. The ignite version is 2.8.1.
>>>>
>>>> When I restart the first node, it encounters an error and fails to join
>>>> the cluster. The error message is "Caused by: org.apache.
>>>> ignite.spi.IgniteSpiException: Attempting to join node with larger
>>>> distributed metastorage version id. The node is most likely in invalid
>>>> state and can't be joined." I try several times but get the same error.
>>>>
>>>> Then I restart the second node, it encounters the same error. After I
>>>> restart the third node, the other two nodes can start successfully and join
>>>> the cluster. When I restart the nodes, I do not change the baseline
>>>> topology. I cannot reproduce this error now.
>>>>
>>>> I find someone else has the same problem.
>>>> http://apache-ignite-users.70518.x6.nabble.com/Question-about-baseline-topology-and-cluster-activation-td34336.html
>>>>
>>>> The answer is corruption in the metastorage. I do not see any issue of
>>>> the metastorage files. However, it is a small probability event to have
>>>> files on two different machines corrupted at the same time. Is it possible
>>>> that this is another bug like
>>>> https://issues.apache.org/jira/browse/IGNITE-12850?
>>>>
>>>> Do you have any document about how the version id is updated and read?
>>>> Could you please show me in the source code where the version id is read
>>>> when a node starts and where the version id is updated when a node stops?
>>>> Thank you!
>>>>
>>>>
>>>>
>>>
>>> --
>>> Sincerely yours,
>>> Ivan Bessonov
>>>
>>
>
> --
> Sincerely yours,
> Ivan Bessonov
>

Re: Nodes failed to join the cluster after restarting

Reply via email to