Re: Nodes failed to join the cluster after restarting

Ivan Bessonov Tue, 17 Nov 2020 01:09:46 -0800

Thank you for the reply!

Right now the only existing distributed properties I see are these:
- Baseline parameter 'baselineAutoAdjustEnabled' was changed from 'null' to
'false'
- Baseline parameter 'baselineAutoAdjustTimeout' was changed from 'null' to
'300000'
- SQL parameter 'sql.disabledFunctions' was changed from 'null' to
'[FILE_WRITE, CANCEL_SESSION, MEMORY_USED, CSVREAD, LINK_SCHEMA,
MEMORY_FREE, FILE_READ, CSVWRITE, SESSION_ID, LOCK_MODE]'


I wonder what values they have on nodes that rejected the new node. I
suggest sending logs of those nodes as well.
Right now I believe that this bug won't happen again on your installation,
but it only makes it more elusive...

The most probable reason is that node (somehow) initialized some properties
with defaults before joining the cluster, while cluster didn't have those
values at all.
The rule is that activated cluster can't accept changed properties from
joining node. So, the workaround would be deactivating the cluster, joining
the node and activating it again. But as I said, I don't think that you'll
see this bug ever again.

вт, 17 нояб. 2020 г. в 07:34, Cong Guo <[email protected]>:

> Hi,
>
> Please find the attached log for a complete but failed reboot. You can see
> the exceptions.
>
> On Mon, Nov 16, 2020 at 4:00 AM Ivan Bessonov <[email protected]>
> wrote:
>
>> Hello,
>>
>> there must be a bug somewhere during node start, it updates its
>> distributed metastorage content and tries to join an already activated
>> cluster, thus creating a conflict. It's hard to tell the exact data that
>> caused conflict, especially without any logs.
>>
>> Topic that you mentioned (
>> http://apache-ignite-users.70518.x6.nabble.com/Question-about-baseline-topology-and-cluster-activation-td34336.html)
>> seems to be about the same problem, but the issue
>> https://issues.apache.org/jira/browse/IGNITE-12850 is not related to it.
>>
>> If you have logs from those unsuccessful restart attempts, it would be
>> very helpful.
>>
>> Sadly, distributed metastorage is an internal component to store settings
>> and has no public documentation. Developers documentation is probably
>> outdated and incomplete. But just in case, "version id" that message is
>> referring to is located in field
>> "org.apache.ignite.internal.processors.metastorage.persistence.DistributedMetaStorageImpl#ver",
>> it's incremented on every distributed metastorage setting update. You can
>> find your error message in the same class.
>>
>> Please follow up with more questions and logs it possible, I hope we'll
>> figure it out.
>>
>> Thank you!
>>
>> пт, 13 нояб. 2020 г. в 02:23, Cong Guo <[email protected]>:
>>
>>> Hi,
>>>
>>> I have a 3-node cluster with persistence enabled. All the three nodes
>>> are in the baseline topology. The ignite version is 2.8.1.
>>>
>>> When I restart the first node, it encounters an error and fails to join
>>> the cluster. The error message is "Caused by: org.apache.
>>> ignite.spi.IgniteSpiException: Attempting to join node with larger
>>> distributed metastorage version id. The node is most likely in invalid
>>> state and can't be joined." I try several times but get the same error.
>>>
>>> Then I restart the second node, it encounters the same error. After I
>>> restart the third node, the other two nodes can start successfully and join
>>> the cluster. When I restart the nodes, I do not change the baseline
>>> topology. I cannot reproduce this error now.
>>>
>>> I find someone else has the same problem.
>>> http://apache-ignite-users.70518.x6.nabble.com/Question-about-baseline-topology-and-cluster-activation-td34336.html
>>>
>>> The answer is corruption in the metastorage. I do not see any issue of
>>> the metastorage files. However, it is a small probability event to have
>>> files on two different machines corrupted at the same time. Is it possible
>>> that this is another bug like
>>> https://issues.apache.org/jira/browse/IGNITE-12850?
>>>
>>> Do you have any document about how the version id is updated and read?
>>> Could you please show me in the source code where the version id is read
>>> when a node starts and where the version id is updated when a node stops?
>>> Thank you!
>>>
>>>
>>>
>>
>> --
>> Sincerely yours,
>> Ivan Bessonov
>>
>

-- 
Sincerely yours,
Ivan Bessonov

Re: Nodes failed to join the cluster after restarting

Reply via email to