Hi, I attach the log from the only working node while two others are restarted. There is no error message other than the "failed to join" message. I do not see any clue in the log. I cannot reproduce this issue either. That's why I am asking about the code. Maybe you know certain suspicious places. Thank you.
On Wed, Nov 18, 2020 at 2:45 AM Ivan Bessonov <[email protected]> wrote: > Sorry, I see that you use TcpDiscoverySpi. > > ср, 18 нояб. 2020 г. в 10:44, Ivan Bessonov <[email protected]>: > >> Hello, >> >> these parameters are configured automatically, I know that you don't >> configure them. And with the fact that all "automatic" configuration is >> completed, chances of seeing the same bug are low. >> >> Understanding the reason is tricky, we would need to debug the starting >> node or at least add more logs. Is this possible? I see that you're asking >> me about the code. >> >> Knowing the content of "ver" and "histCache.toArray()" in >> "org.apache.ignite.internal.processors.metastorage.persistence.DistributedMetaStorageImpl#collectJoiningNodeData" >> would certainly help. >> More specifically - *ver.id <http://ver.id>()* and >> *Arrays.stream(histCache.toArray()).map(item >> -> Arrays.toString(item.keys())).collect(Collectors.joining(","))* >> >> Honestly, I have no idea how your situation is even possible, otherwise >> we would find the solution rather quickly. Needless to say, I can't >> reproduce it. Error message that you see was created for the case when you >> join your node to the wrong cluster. >> >> Do you have any custom code during the node start? And one more question >> - what discovery SPI are you using? TCP or Zookeeper? >> >> >> ср, 18 нояб. 2020 г. в 02:29, Cong Guo <[email protected]>: >> >>> Hi, >>> >>> The parameters values on two other nodes are the same. Actually I do not >>> configure these values. When you enable the native persistence, you will >>> see these logs by default. Nothing is special. When this error occurs on >>> the restarting node, nothing happens on two other nodes. When I restart the >>> second node, it also fails due to the same error. >>> >>> I will still need to restart the nodes in the future, one by one >>> without stopping the service. This issue may happen again. The workaround >>> has to deactivate the cluster and stop the service, which does not work in >>> a production environment. >>> >>> I think we need to fix this bug or at least understand the reason to >>> avoid it. Could you please tell me where this version value could be >>> modified when a node just starts? Do you have any guess about this bug now? >>> I can help analyze the code. Thank you. >>> >>> On Tue, Nov 17, 2020 at 4:09 AM Ivan Bessonov <[email protected]> >>> wrote: >>> >>>> Thank you for the reply! >>>> >>>> Right now the only existing distributed properties I see are these: >>>> - Baseline parameter 'baselineAutoAdjustEnabled' was changed from >>>> 'null' to 'false' >>>> - Baseline parameter 'baselineAutoAdjustTimeout' was changed from >>>> 'null' to '300000' >>>> - SQL parameter 'sql.disabledFunctions' was changed from 'null' to >>>> '[FILE_WRITE, CANCEL_SESSION, MEMORY_USED, CSVREAD, LINK_SCHEMA, >>>> MEMORY_FREE, FILE_READ, CSVWRITE, SESSION_ID, LOCK_MODE]' >>>> >>>> I wonder what values they have on nodes that rejected the new node. I >>>> suggest sending logs of those nodes as well. >>>> Right now I believe that this bug won't happen again on your >>>> installation, but it only makes it more elusive... >>>> >>>> The most probable reason is that node (somehow) initialized some >>>> properties with defaults before joining the cluster, while cluster didn't >>>> have those values at all. >>>> The rule is that activated cluster can't accept changed properties from >>>> joining node. So, the workaround would be deactivating the cluster, joining >>>> the node and activating it again. But as I said, I don't think that you'll >>>> see this bug ever again. >>>> >>>> вт, 17 нояб. 2020 г. в 07:34, Cong Guo <[email protected]>: >>>> >>>>> Hi, >>>>> >>>>> Please find the attached log for a complete but failed reboot. You can >>>>> see the exceptions. >>>>> >>>>> On Mon, Nov 16, 2020 at 4:00 AM Ivan Bessonov <[email protected]> >>>>> wrote: >>>>> >>>>>> Hello, >>>>>> >>>>>> there must be a bug somewhere during node start, it updates its >>>>>> distributed metastorage content and tries to join an already activated >>>>>> cluster, thus creating a conflict. It's hard to tell the exact data that >>>>>> caused conflict, especially without any logs. >>>>>> >>>>>> Topic that you mentioned ( >>>>>> http://apache-ignite-users.70518.x6.nabble.com/Question-about-baseline-topology-and-cluster-activation-td34336.html) >>>>>> seems to be about the same problem, but the issue >>>>>> https://issues.apache.org/jira/browse/IGNITE-12850 is not related to >>>>>> it. >>>>>> >>>>>> If you have logs from those unsuccessful restart attempts, it would >>>>>> be very helpful. >>>>>> >>>>>> Sadly, distributed metastorage is an internal component to store >>>>>> settings and has no public documentation. Developers documentation is >>>>>> probably outdated and incomplete. But just in case, "version id" that >>>>>> message is referring to is located in field >>>>>> "org.apache.ignite.internal.processors.metastorage.persistence.DistributedMetaStorageImpl#ver", >>>>>> it's incremented on every distributed metastorage setting update. You can >>>>>> find your error message in the same class. >>>>>> >>>>>> Please follow up with more questions and logs it possible, I hope >>>>>> we'll figure it out. >>>>>> >>>>>> Thank you! >>>>>> >>>>>> пт, 13 нояб. 2020 г. в 02:23, Cong Guo <[email protected]>: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> I have a 3-node cluster with persistence enabled. All the three >>>>>>> nodes are in the baseline topology. The ignite version is 2.8.1. >>>>>>> >>>>>>> When I restart the first node, it encounters an error and fails to >>>>>>> join the cluster. The error message is "Caused by: org.apache. >>>>>>> ignite.spi.IgniteSpiException: Attempting to join node with larger >>>>>>> distributed metastorage version id. The node is most likely in invalid >>>>>>> state and can't be joined." I try several times but get the same >>>>>>> error. >>>>>>> >>>>>>> Then I restart the second node, it encounters the same error. After >>>>>>> I restart the third node, the other two nodes can start successfully and >>>>>>> join the cluster. When I restart the nodes, I do not change the baseline >>>>>>> topology. I cannot reproduce this error now. >>>>>>> >>>>>>> I find someone else has the same problem. >>>>>>> http://apache-ignite-users.70518.x6.nabble.com/Question-about-baseline-topology-and-cluster-activation-td34336.html >>>>>>> >>>>>>> The answer is corruption in the metastorage. I do not see any issue >>>>>>> of the metastorage files. However, it is a small probability event to >>>>>>> have >>>>>>> files on two different machines corrupted at the same time. Is it >>>>>>> possible >>>>>>> that this is another bug like >>>>>>> https://issues.apache.org/jira/browse/IGNITE-12850? >>>>>>> >>>>>>> Do you have any document about how the version id is updated and >>>>>>> read? Could you please show me in the source code where the version id >>>>>>> is >>>>>>> read when a node starts and where the version id is updated when a node >>>>>>> stops? Thank you! >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> -- >>>>>> Sincerely yours, >>>>>> Ivan Bessonov >>>>>> >>>>> >>>> >>>> -- >>>> Sincerely yours, >>>> Ivan Bessonov >>>> >>> >> >> -- >> Sincerely yours, >> Ivan Bessonov >> > > > -- > Sincerely yours, > Ivan Bessonov >
othernode.log
Description: Binary data
