Re: Nodes failed to join the cluster after restarting

Cong Guo Wed, 18 Nov 2020 15:50:30 -0800

Hi,

I attach the log from the only working node while two others are restarted.
There is no error message other than the "failed to join" message. I do not
see any clue in the log. I cannot reproduce this issue either. That's why I
am asking about the code. Maybe you know certain suspicious places. Thank
you.


On Wed, Nov 18, 2020 at 2:45 AM Ivan Bessonov <[email protected]> wrote:

> Sorry, I see that you use TcpDiscoverySpi.
>
> ср, 18 нояб. 2020 г. в 10:44, Ivan Bessonov <[email protected]>:
>
>> Hello,
>>
>> these parameters are configured automatically, I know that you don't
>> configure them. And with the fact that all "automatic" configuration is
>> completed, chances of seeing the same bug are low.
>>
>> Understanding the reason is tricky, we would need to debug the starting
>> node or at least add more logs. Is this possible? I see that you're asking
>> me about the code.
>>
>> Knowing the content of "ver" and "histCache.toArray()" in
>> "org.apache.ignite.internal.processors.metastorage.persistence.DistributedMetaStorageImpl#collectJoiningNodeData"
>> would certainly help.
>> More specifically - *ver.id <http://ver.id>()* and 
>> *Arrays.stream(histCache.toArray()).map(item
>> -> Arrays.toString(item.keys())).collect(Collectors.joining(","))*
>>
>> Honestly, I have no idea how your situation is even possible, otherwise
>> we would find the solution rather quickly. Needless to say, I can't
>> reproduce it. Error message that you see was created for the case when you
>> join your node to the wrong cluster.
>>
>> Do you have any custom code during the node start? And one more question
>> - what discovery SPI are you using? TCP or Zookeeper?
>>
>>
>> ср, 18 нояб. 2020 г. в 02:29, Cong Guo <[email protected]>:
>>
>>> Hi,
>>>
>>> The parameters values on two other nodes are the same. Actually I do not
>>> configure these values. When you enable the native persistence, you will
>>> see these logs by default. Nothing is special. When this error occurs on
>>> the restarting node, nothing happens on two other nodes. When I restart the
>>> second node, it also fails due to the same error.
>>>
>>> I will still need to restart the nodes in the future,  one by one
>>> without stopping the service. This issue may happen again. The workaround
>>> has to deactivate the cluster and stop the service, which does not work in
>>> a production environment.
>>>
>>> I think we need to fix this bug or at least understand the reason to
>>> avoid it. Could you please tell me where this version value could be
>>> modified when a node just starts? Do you have any guess about this bug now?
>>> I can help analyze the code. Thank you.
>>>
>>> On Tue, Nov 17, 2020 at 4:09 AM Ivan Bessonov <[email protected]>
>>> wrote:
>>>
>>>> Thank you for the reply!
>>>>
>>>> Right now the only existing distributed properties I see are these:
>>>> - Baseline parameter 'baselineAutoAdjustEnabled' was changed from
>>>> 'null' to 'false'
>>>> - Baseline parameter 'baselineAutoAdjustTimeout' was changed from
>>>> 'null' to '300000'
>>>> - SQL parameter 'sql.disabledFunctions' was changed from 'null' to
>>>> '[FILE_WRITE, CANCEL_SESSION, MEMORY_USED, CSVREAD, LINK_SCHEMA,
>>>> MEMORY_FREE, FILE_READ, CSVWRITE, SESSION_ID, LOCK_MODE]'
>>>>
>>>> I wonder what values they have on nodes that rejected the new node. I
>>>> suggest sending logs of those nodes as well.
>>>> Right now I believe that this bug won't happen again on your
>>>> installation, but it only makes it more elusive...
>>>>
>>>> The most probable reason is that node (somehow) initialized some
>>>> properties with defaults before joining the cluster, while cluster didn't
>>>> have those values at all.
>>>> The rule is that activated cluster can't accept changed properties from
>>>> joining node. So, the workaround would be deactivating the cluster, joining
>>>> the node and activating it again. But as I said, I don't think that you'll
>>>> see this bug ever again.
>>>>
>>>> вт, 17 нояб. 2020 г. в 07:34, Cong Guo <[email protected]>:
>>>>
>>>>> Hi,
>>>>>
>>>>> Please find the attached log for a complete but failed reboot. You can
>>>>> see the exceptions.
>>>>>
>>>>> On Mon, Nov 16, 2020 at 4:00 AM Ivan Bessonov <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> there must be a bug somewhere during node start, it updates its
>>>>>> distributed metastorage content and tries to join an already activated
>>>>>> cluster, thus creating a conflict. It's hard to tell the exact data that
>>>>>> caused conflict, especially without any logs.
>>>>>>
>>>>>> Topic that you mentioned (
>>>>>> http://apache-ignite-users.70518.x6.nabble.com/Question-about-baseline-topology-and-cluster-activation-td34336.html)
>>>>>> seems to be about the same problem, but the issue
>>>>>> https://issues.apache.org/jira/browse/IGNITE-12850 is not related to
>>>>>> it.
>>>>>>
>>>>>> If you have logs from those unsuccessful restart attempts, it would
>>>>>> be very helpful.
>>>>>>
>>>>>> Sadly, distributed metastorage is an internal component to store
>>>>>> settings and has no public documentation. Developers documentation is
>>>>>> probably outdated and incomplete. But just in case, "version id" that
>>>>>> message is referring to is located in field
>>>>>> "org.apache.ignite.internal.processors.metastorage.persistence.DistributedMetaStorageImpl#ver",
>>>>>> it's incremented on every distributed metastorage setting update. You can
>>>>>> find your error message in the same class.
>>>>>>
>>>>>> Please follow up with more questions and logs it possible, I hope
>>>>>> we'll figure it out.
>>>>>>
>>>>>> Thank you!
>>>>>>
>>>>>> пт, 13 нояб. 2020 г. в 02:23, Cong Guo <[email protected]>:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I have a 3-node cluster with persistence enabled. All the three
>>>>>>> nodes are in the baseline topology. The ignite version is 2.8.1.
>>>>>>>
>>>>>>> When I restart the first node, it encounters an error and fails to
>>>>>>> join the cluster. The error message is "Caused by: org.apache.
>>>>>>> ignite.spi.IgniteSpiException: Attempting to join node with larger
>>>>>>> distributed metastorage version id. The node is most likely in invalid
>>>>>>> state and can't be joined." I try several times but get the same
>>>>>>> error.
>>>>>>>
>>>>>>> Then I restart the second node, it encounters the same error. After
>>>>>>> I restart the third node, the other two nodes can start successfully and
>>>>>>> join the cluster. When I restart the nodes, I do not change the baseline
>>>>>>> topology. I cannot reproduce this error now.
>>>>>>>
>>>>>>> I find someone else has the same problem.
>>>>>>> http://apache-ignite-users.70518.x6.nabble.com/Question-about-baseline-topology-and-cluster-activation-td34336.html
>>>>>>>
>>>>>>> The answer is corruption in the metastorage. I do not see any issue
>>>>>>> of the metastorage files. However, it is a small probability event to 
>>>>>>> have
>>>>>>> files on two different machines corrupted at the same time. Is it 
>>>>>>> possible
>>>>>>> that this is another bug like
>>>>>>> https://issues.apache.org/jira/browse/IGNITE-12850?
>>>>>>>
>>>>>>> Do you have any document about how the version id is updated and
>>>>>>> read? Could you please show me in the source code where the version id 
>>>>>>> is
>>>>>>> read when a node starts and where the version id is updated when a node
>>>>>>> stops? Thank you!
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Sincerely yours,
>>>>>> Ivan Bessonov
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Sincerely yours,
>>>> Ivan Bessonov
>>>>
>>>
>>
>> --
>> Sincerely yours,
>> Ivan Bessonov
>>
>
>
> --
> Sincerely yours,
> Ivan Bessonov
>

othernode.log
Description: Binary data

Re: Nodes failed to join the cluster after restarting

Reply via email to