[tickets] [opensaf:tickets] #2030 dtm: "Node already exit in the cluster with smiler configuration"
- **status**: review --> fixed - **Comment**: commit 4ca20e2caf15e22754af01ddecd01c1ea7413ccf Author: Alex JonesDate: Thu Aug 10 12:45:24 2017 -0400 --- ** [tickets:#2030] dtm: "Node already exit in the cluster with smiler configuration"** **Status:** fixed **Milestone:** 5.17.10 **Created:** Tue Sep 13, 2016 12:10 PM UTC by Anders Widell **Last Updated:** Tue Aug 08, 2017 06:13 PM UTC **Owner:** Alex Jones osafdtm does not handle rapid consecutive node reboots properly. I got the following errors in syslog: ~~~ Sep 13 14:00:52 SC-2 local0.err osafdtmd[378]: ER DTM: Node already exit in the cluster with smiler configuration , correct the other joining Node configuration Sep 13 14:01:02 SC-2 local0.err osafdtmd[378]: ER DTM: dtm_node_add failed .node_ip: 192.168.0.1, node_id: 0 Sep 13 14:01:06 SC-2 local0.err osafdtmd[378]: ER DTM: dtm_node_add failed .node_ip: 192.168.0.1, node_id: 0 ~~~ Here are the steps to reproduce this problem in UML: ./opensaf start (wait until the cluster comes up) ./opensaf nodestop 2 (wait a few seconds) ./opensaf nodestart 2 ./opensaf nodestart 2 The last two commands should be execute quickly after each other, maybe with one second delay in between them. It seems that osafdtmd asserts and dies when this happens. Here is the result from a second run of the above test: ~~~ Sep 13 14:25:58 SC-2 local0.err osafdtmd[378]: ER DTM: Node already exit in the cluster with smiler configuration , correct the other joining Node configuration Sep 13 14:25:58 SC-2 local0.err osafdtmd[378]: dtm_node.c:109: dtm_process_node_info: Assertion '0' failed. Sep 13 14:25:58 SC-2 local0.err osafamfd[478]: MDTM:SOCKET recd_bytes :0, conn lost with dh server, exiting library err :Success Sep 13 14:25:58 SC-2 local0.err osafclmna[468]: MDTM:SOCKET recd_bytes :0, conn lost with dh server, exiting library err :Success Sep 13 14:25:58 SC-2 local0.err osafclmd[458]: MDTM:SOCKET recd_bytes :0, conn lost with dh server, exiting library err :Success Sep 13 14:25:58 SC-2 local0.err osafntfd[448]: MDTM:SOCKET recd_bytes :0, conn lost with dh server, exiting library err :Success Sep 13 14:25:58 SC-2 local0.err osaflogd[437]: MDTM:SOCKET recd_bytes :0, conn lost with dh server, exiting library err :Success Sep 13 14:25:58 SC-2 local0.err osafimmnd[426]: MDTM:SOCKET recd_bytes :0, conn lost with dh server, exiting library err :Success Sep 13 14:25:58 SC-2 local0.err osafimmd[415]: MDTM:SOCKET recd_bytes :0, conn lost with dh server, exiting library err :Success Sep 13 14:25:58 SC-2 local0.err osaffmd[405]: MDTM:SOCKET recd_bytes :0, conn lost with dh server, exiting library err :Success Sep 13 14:25:58 SC-2 local0.err osafrded[392]: MDTM:SOCKET recd_bytes :0, conn lost with dh server, exiting library err :Success Sep 13 14:25:58 SC-2 local0.notice osafdtmd[378]: NO Established contact with 'SC-1' Sep 13 14:25:58 SC-2 local0.notice osafdtmd[378]: NO Established contact with 'PL-4' Sep 13 14:25:58 SC-2 local0.notice osafdtmd[378]: NO Established contact with 'PL-5' Sep 13 14:25:58 SC-2 local0.notice osafdtmd[378]: NO Established contact with 'PL-3' Sep 13 14:25:59 SC-2 user.notice osafdtmd: osafdtmd Process down, Rebooting the node Sep 13 14:25:59 SC-2 user.notice opensaf_reboot: Rebooting local node; timeout=60 ~~~ Update: it seems I forgot to do "./opensaf nodestop" between the two "./opensaf nodestart" above. Thus, there are probably two SC-2 nodes at the same time, and the error message "Node already exit in the cluster with smiler configuration" should be interpreted as "duplicate node detected in the network". Reducing the priority of this defect to "minor". Still two problems ought to be fixed: the error message should be changed so that it is clear what it means, and osafdtmd should not assert (it could call opensaf_reboot() if a there is a configuration problem, but asserting idicates a software problem). --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #2030 dtm: "Node already exit in the cluster with smiler configuration"
- **status**: assigned --> review - **assigned_to**: A V Mahesh (AVM) --> Alex Jones - **Part**: - --> d - **Blocker**: --> False - **Milestone**: future --> 5.17.10 --- ** [tickets:#2030] dtm: "Node already exit in the cluster with smiler configuration"** **Status:** review **Milestone:** 5.17.10 **Created:** Tue Sep 13, 2016 12:10 PM UTC by Anders Widell **Last Updated:** Tue Aug 08, 2017 04:18 PM UTC **Owner:** Alex Jones osafdtm does not handle rapid consecutive node reboots properly. I got the following errors in syslog: ~~~ Sep 13 14:00:52 SC-2 local0.err osafdtmd[378]: ER DTM: Node already exit in the cluster with smiler configuration , correct the other joining Node configuration Sep 13 14:01:02 SC-2 local0.err osafdtmd[378]: ER DTM: dtm_node_add failed .node_ip: 192.168.0.1, node_id: 0 Sep 13 14:01:06 SC-2 local0.err osafdtmd[378]: ER DTM: dtm_node_add failed .node_ip: 192.168.0.1, node_id: 0 ~~~ Here are the steps to reproduce this problem in UML: ./opensaf start (wait until the cluster comes up) ./opensaf nodestop 2 (wait a few seconds) ./opensaf nodestart 2 ./opensaf nodestart 2 The last two commands should be execute quickly after each other, maybe with one second delay in between them. It seems that osafdtmd asserts and dies when this happens. Here is the result from a second run of the above test: ~~~ Sep 13 14:25:58 SC-2 local0.err osafdtmd[378]: ER DTM: Node already exit in the cluster with smiler configuration , correct the other joining Node configuration Sep 13 14:25:58 SC-2 local0.err osafdtmd[378]: dtm_node.c:109: dtm_process_node_info: Assertion '0' failed. Sep 13 14:25:58 SC-2 local0.err osafamfd[478]: MDTM:SOCKET recd_bytes :0, conn lost with dh server, exiting library err :Success Sep 13 14:25:58 SC-2 local0.err osafclmna[468]: MDTM:SOCKET recd_bytes :0, conn lost with dh server, exiting library err :Success Sep 13 14:25:58 SC-2 local0.err osafclmd[458]: MDTM:SOCKET recd_bytes :0, conn lost with dh server, exiting library err :Success Sep 13 14:25:58 SC-2 local0.err osafntfd[448]: MDTM:SOCKET recd_bytes :0, conn lost with dh server, exiting library err :Success Sep 13 14:25:58 SC-2 local0.err osaflogd[437]: MDTM:SOCKET recd_bytes :0, conn lost with dh server, exiting library err :Success Sep 13 14:25:58 SC-2 local0.err osafimmnd[426]: MDTM:SOCKET recd_bytes :0, conn lost with dh server, exiting library err :Success Sep 13 14:25:58 SC-2 local0.err osafimmd[415]: MDTM:SOCKET recd_bytes :0, conn lost with dh server, exiting library err :Success Sep 13 14:25:58 SC-2 local0.err osaffmd[405]: MDTM:SOCKET recd_bytes :0, conn lost with dh server, exiting library err :Success Sep 13 14:25:58 SC-2 local0.err osafrded[392]: MDTM:SOCKET recd_bytes :0, conn lost with dh server, exiting library err :Success Sep 13 14:25:58 SC-2 local0.notice osafdtmd[378]: NO Established contact with 'SC-1' Sep 13 14:25:58 SC-2 local0.notice osafdtmd[378]: NO Established contact with 'PL-4' Sep 13 14:25:58 SC-2 local0.notice osafdtmd[378]: NO Established contact with 'PL-5' Sep 13 14:25:58 SC-2 local0.notice osafdtmd[378]: NO Established contact with 'PL-3' Sep 13 14:25:59 SC-2 user.notice osafdtmd: osafdtmd Process down, Rebooting the node Sep 13 14:25:59 SC-2 user.notice opensaf_reboot: Rebooting local node; timeout=60 ~~~ Update: it seems I forgot to do "./opensaf nodestop" between the two "./opensaf nodestart" above. Thus, there are probably two SC-2 nodes at the same time, and the error message "Node already exit in the cluster with smiler configuration" should be interpreted as "duplicate node detected in the network". Reducing the priority of this defect to "minor". Still two problems ought to be fixed: the error message should be changed so that it is clear what it means, and osafdtmd should not assert (it could call opensaf_reboot() if a there is a configuration problem, but asserting idicates a software problem). --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #2030 dtm: "Node already exit in the cluster with smiler configuration"
I can reproduce this assertion as outlined in ticket 2545. --- ** [tickets:#2030] dtm: "Node already exit in the cluster with smiler configuration"** **Status:** assigned **Milestone:** future **Created:** Tue Sep 13, 2016 12:10 PM UTC by Anders Widell **Last Updated:** Thu Mar 16, 2017 04:23 AM UTC **Owner:** A V Mahesh (AVM) osafdtm does not handle rapid consecutive node reboots properly. I got the following errors in syslog: ~~~ Sep 13 14:00:52 SC-2 local0.err osafdtmd[378]: ER DTM: Node already exit in the cluster with smiler configuration , correct the other joining Node configuration Sep 13 14:01:02 SC-2 local0.err osafdtmd[378]: ER DTM: dtm_node_add failed .node_ip: 192.168.0.1, node_id: 0 Sep 13 14:01:06 SC-2 local0.err osafdtmd[378]: ER DTM: dtm_node_add failed .node_ip: 192.168.0.1, node_id: 0 ~~~ Here are the steps to reproduce this problem in UML: ./opensaf start (wait until the cluster comes up) ./opensaf nodestop 2 (wait a few seconds) ./opensaf nodestart 2 ./opensaf nodestart 2 The last two commands should be execute quickly after each other, maybe with one second delay in between them. It seems that osafdtmd asserts and dies when this happens. Here is the result from a second run of the above test: ~~~ Sep 13 14:25:58 SC-2 local0.err osafdtmd[378]: ER DTM: Node already exit in the cluster with smiler configuration , correct the other joining Node configuration Sep 13 14:25:58 SC-2 local0.err osafdtmd[378]: dtm_node.c:109: dtm_process_node_info: Assertion '0' failed. Sep 13 14:25:58 SC-2 local0.err osafamfd[478]: MDTM:SOCKET recd_bytes :0, conn lost with dh server, exiting library err :Success Sep 13 14:25:58 SC-2 local0.err osafclmna[468]: MDTM:SOCKET recd_bytes :0, conn lost with dh server, exiting library err :Success Sep 13 14:25:58 SC-2 local0.err osafclmd[458]: MDTM:SOCKET recd_bytes :0, conn lost with dh server, exiting library err :Success Sep 13 14:25:58 SC-2 local0.err osafntfd[448]: MDTM:SOCKET recd_bytes :0, conn lost with dh server, exiting library err :Success Sep 13 14:25:58 SC-2 local0.err osaflogd[437]: MDTM:SOCKET recd_bytes :0, conn lost with dh server, exiting library err :Success Sep 13 14:25:58 SC-2 local0.err osafimmnd[426]: MDTM:SOCKET recd_bytes :0, conn lost with dh server, exiting library err :Success Sep 13 14:25:58 SC-2 local0.err osafimmd[415]: MDTM:SOCKET recd_bytes :0, conn lost with dh server, exiting library err :Success Sep 13 14:25:58 SC-2 local0.err osaffmd[405]: MDTM:SOCKET recd_bytes :0, conn lost with dh server, exiting library err :Success Sep 13 14:25:58 SC-2 local0.err osafrded[392]: MDTM:SOCKET recd_bytes :0, conn lost with dh server, exiting library err :Success Sep 13 14:25:58 SC-2 local0.notice osafdtmd[378]: NO Established contact with 'SC-1' Sep 13 14:25:58 SC-2 local0.notice osafdtmd[378]: NO Established contact with 'PL-4' Sep 13 14:25:58 SC-2 local0.notice osafdtmd[378]: NO Established contact with 'PL-5' Sep 13 14:25:58 SC-2 local0.notice osafdtmd[378]: NO Established contact with 'PL-3' Sep 13 14:25:59 SC-2 user.notice osafdtmd: osafdtmd Process down, Rebooting the node Sep 13 14:25:59 SC-2 user.notice opensaf_reboot: Rebooting local node; timeout=60 ~~~ Update: it seems I forgot to do "./opensaf nodestop" between the two "./opensaf nodestart" above. Thus, there are probably two SC-2 nodes at the same time, and the error message "Node already exit in the cluster with smiler configuration" should be interpreted as "duplicate node detected in the network". Reducing the priority of this defect to "minor". Still two problems ought to be fixed: the error message should be changed so that it is clear what it means, and osafdtmd should not assert (it could call opensaf_reboot() if a there is a configuration problem, but asserting idicates a software problem). --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #2030 dtm: "Node already exit in the cluster with smiler configuration"
In normal conditions we are not able to reproduce the problem by doing `/etc/init.d/opensafd restart ` so can please provide following information , to reproduce the problem: 1) Can you please share or elaborate what "./opensaf nodestop" "./opensaf nodestart" scripts do aprt of ` /etc/init.d/opensafd stop` & `/etc/init.d/opensafd restart 2) is their any other NON Opensaf application using MDS/TCP libariry ? if so are they stoped cleanly before ` /etc/init.d/opensafd stop` --- ** [tickets:#2030] dtm: "Node already exit in the cluster with smiler configuration"** **Status:** assigned **Milestone:** 5.0.2 **Created:** Tue Sep 13, 2016 12:10 PM UTC by Anders Widell **Last Updated:** Mon Sep 26, 2016 02:26 PM UTC **Owner:** A V Mahesh (AVM) osafdtm does not handle rapid consecutive node reboots properly. I got the following errors in syslog: ~~~ Sep 13 14:00:52 SC-2 local0.err osafdtmd[378]: ER DTM: Node already exit in the cluster with smiler configuration , correct the other joining Node configuration Sep 13 14:01:02 SC-2 local0.err osafdtmd[378]: ER DTM: dtm_node_add failed .node_ip: 192.168.0.1, node_id: 0 Sep 13 14:01:06 SC-2 local0.err osafdtmd[378]: ER DTM: dtm_node_add failed .node_ip: 192.168.0.1, node_id: 0 ~~~ Here are the steps to reproduce this problem in UML: ./opensaf start (wait until the cluster comes up) ./opensaf nodestop 2 (wait a few seconds) ./opensaf nodestart 2 ./opensaf nodestart 2 The last two commands should be execute quickly after each other, maybe with one second delay in between them. It seems that osafdtmd asserts and dies when this happens. Here is the result from a second run of the above test: ~~~ Sep 13 14:25:58 SC-2 local0.err osafdtmd[378]: ER DTM: Node already exit in the cluster with smiler configuration , correct the other joining Node configuration Sep 13 14:25:58 SC-2 local0.err osafdtmd[378]: dtm_node.c:109: dtm_process_node_info: Assertion '0' failed. Sep 13 14:25:58 SC-2 local0.err osafamfd[478]: MDTM:SOCKET recd_bytes :0, conn lost with dh server, exiting library err :Success Sep 13 14:25:58 SC-2 local0.err osafclmna[468]: MDTM:SOCKET recd_bytes :0, conn lost with dh server, exiting library err :Success Sep 13 14:25:58 SC-2 local0.err osafclmd[458]: MDTM:SOCKET recd_bytes :0, conn lost with dh server, exiting library err :Success Sep 13 14:25:58 SC-2 local0.err osafntfd[448]: MDTM:SOCKET recd_bytes :0, conn lost with dh server, exiting library err :Success Sep 13 14:25:58 SC-2 local0.err osaflogd[437]: MDTM:SOCKET recd_bytes :0, conn lost with dh server, exiting library err :Success Sep 13 14:25:58 SC-2 local0.err osafimmnd[426]: MDTM:SOCKET recd_bytes :0, conn lost with dh server, exiting library err :Success Sep 13 14:25:58 SC-2 local0.err osafimmd[415]: MDTM:SOCKET recd_bytes :0, conn lost with dh server, exiting library err :Success Sep 13 14:25:58 SC-2 local0.err osaffmd[405]: MDTM:SOCKET recd_bytes :0, conn lost with dh server, exiting library err :Success Sep 13 14:25:58 SC-2 local0.err osafrded[392]: MDTM:SOCKET recd_bytes :0, conn lost with dh server, exiting library err :Success Sep 13 14:25:58 SC-2 local0.notice osafdtmd[378]: NO Established contact with 'SC-1' Sep 13 14:25:58 SC-2 local0.notice osafdtmd[378]: NO Established contact with 'PL-4' Sep 13 14:25:58 SC-2 local0.notice osafdtmd[378]: NO Established contact with 'PL-5' Sep 13 14:25:58 SC-2 local0.notice osafdtmd[378]: NO Established contact with 'PL-3' Sep 13 14:25:59 SC-2 user.notice osafdtmd: osafdtmd Process down, Rebooting the node Sep 13 14:25:59 SC-2 user.notice opensaf_reboot: Rebooting local node; timeout=60 ~~~ Update: it seems I forgot to do "./opensaf nodestop" between the two "./opensaf nodestart" above. Thus, there are probably two SC-2 nodes at the same time, and the error message "Node already exit in the cluster with smiler configuration" should be interpreted as "duplicate node detected in the network". Reducing the priority of this defect to "minor". Still two problems ought to be fixed: the error message should be changed so that it is clear what it means, and osafdtmd should not assert (it could call opensaf_reboot() if a there is a configuration problem, but asserting idicates a software problem). --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net
[tickets] [opensaf:tickets] #2030 dtm: "Node already exit in the cluster with smiler configuration"
- **status**: unassigned --> assigned - **assigned_to**: A V Mahesh (AVM) --- ** [tickets:#2030] dtm: "Node already exit in the cluster with smiler configuration"** **Status:** assigned **Milestone:** 4.7.2 **Created:** Tue Sep 13, 2016 12:10 PM UTC by Anders Widell **Last Updated:** Tue Sep 13, 2016 01:01 PM UTC **Owner:** A V Mahesh (AVM) osafdtm does not handle rapid consecutive node reboots properly. I got the following errors in syslog: ~~~ Sep 13 14:00:52 SC-2 local0.err osafdtmd[378]: ER DTM: Node already exit in the cluster with smiler configuration , correct the other joining Node configuration Sep 13 14:01:02 SC-2 local0.err osafdtmd[378]: ER DTM: dtm_node_add failed .node_ip: 192.168.0.1, node_id: 0 Sep 13 14:01:06 SC-2 local0.err osafdtmd[378]: ER DTM: dtm_node_add failed .node_ip: 192.168.0.1, node_id: 0 ~~~ Here are the steps to reproduce this problem in UML: ./opensaf start (wait until the cluster comes up) ./opensaf nodestop 2 (wait a few seconds) ./opensaf nodestart 2 ./opensaf nodestart 2 The last two commands should be execute quickly after each other, maybe with one second delay in between them. It seems that osafdtmd asserts and dies when this happens. Here is the result from a second run of the above test: ~~~ Sep 13 14:25:58 SC-2 local0.err osafdtmd[378]: ER DTM: Node already exit in the cluster with smiler configuration , correct the other joining Node configuration Sep 13 14:25:58 SC-2 local0.err osafdtmd[378]: dtm_node.c:109: dtm_process_node_info: Assertion '0' failed. Sep 13 14:25:58 SC-2 local0.err osafamfd[478]: MDTM:SOCKET recd_bytes :0, conn lost with dh server, exiting library err :Success Sep 13 14:25:58 SC-2 local0.err osafclmna[468]: MDTM:SOCKET recd_bytes :0, conn lost with dh server, exiting library err :Success Sep 13 14:25:58 SC-2 local0.err osafclmd[458]: MDTM:SOCKET recd_bytes :0, conn lost with dh server, exiting library err :Success Sep 13 14:25:58 SC-2 local0.err osafntfd[448]: MDTM:SOCKET recd_bytes :0, conn lost with dh server, exiting library err :Success Sep 13 14:25:58 SC-2 local0.err osaflogd[437]: MDTM:SOCKET recd_bytes :0, conn lost with dh server, exiting library err :Success Sep 13 14:25:58 SC-2 local0.err osafimmnd[426]: MDTM:SOCKET recd_bytes :0, conn lost with dh server, exiting library err :Success Sep 13 14:25:58 SC-2 local0.err osafimmd[415]: MDTM:SOCKET recd_bytes :0, conn lost with dh server, exiting library err :Success Sep 13 14:25:58 SC-2 local0.err osaffmd[405]: MDTM:SOCKET recd_bytes :0, conn lost with dh server, exiting library err :Success Sep 13 14:25:58 SC-2 local0.err osafrded[392]: MDTM:SOCKET recd_bytes :0, conn lost with dh server, exiting library err :Success Sep 13 14:25:58 SC-2 local0.notice osafdtmd[378]: NO Established contact with 'SC-1' Sep 13 14:25:58 SC-2 local0.notice osafdtmd[378]: NO Established contact with 'PL-4' Sep 13 14:25:58 SC-2 local0.notice osafdtmd[378]: NO Established contact with 'PL-5' Sep 13 14:25:58 SC-2 local0.notice osafdtmd[378]: NO Established contact with 'PL-3' Sep 13 14:25:59 SC-2 user.notice osafdtmd: osafdtmd Process down, Rebooting the node Sep 13 14:25:59 SC-2 user.notice opensaf_reboot: Rebooting local node; timeout=60 ~~~ Update: it seems I forgot to do "./opensaf nodestop" between the two "./opensaf nodestart" above. Thus, there are probably two SC-2 nodes at the same time, and the error message "Node already exit in the cluster with smiler configuration" should be interpreted as "duplicate node detected in the network". Reducing the priority of this defect to "minor". Still two problems ought to be fixed: the error message should be changed so that it is clear what it means, and osafdtmd should not assert (it could call opensaf_reboot() if a there is a configuration problem, but asserting idicates a software problem). --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- ___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #2030 dtm: "Node already exit in the cluster with smiler configuration"
- Description has changed: Diff: --- old +++ new @@ -39,3 +39,6 @@ Sep 13 14:25:59 SC-2 user.notice opensaf_reboot: Rebooting local node; timeout=60 ~~~ + +Update: it seems I forgot to do "./opensaf nodestop" between the two "./opensaf nodestart" above. Thus, there are probably two SC-2 nodes at the same time, and the error message "Node already exit in the cluster with smiler configuration" should be interpreted as "duplicate node detected in the network". Reducing the priority of this defect to "minor". Still two problems ought to be fixed: the error message should be changed so that it is clear what it means, and osafdtmd should not assert (it could call opensaf_reboot() if a there is a configuration problem, but asserting idicates a software problem). + - **Priority**: major --> minor --- ** [tickets:#2030] dtm: "Node already exit in the cluster with smiler configuration"** **Status:** unassigned **Milestone:** 4.7.2 **Created:** Tue Sep 13, 2016 12:10 PM UTC by Anders Widell **Last Updated:** Tue Sep 13, 2016 12:30 PM UTC **Owner:** nobody osafdtm does not handle rapid consecutive node reboots properly. I got the following errors in syslog: ~~~ Sep 13 14:00:52 SC-2 local0.err osafdtmd[378]: ER DTM: Node already exit in the cluster with smiler configuration , correct the other joining Node configuration Sep 13 14:01:02 SC-2 local0.err osafdtmd[378]: ER DTM: dtm_node_add failed .node_ip: 192.168.0.1, node_id: 0 Sep 13 14:01:06 SC-2 local0.err osafdtmd[378]: ER DTM: dtm_node_add failed .node_ip: 192.168.0.1, node_id: 0 ~~~ Here are the steps to reproduce this problem in UML: ./opensaf start (wait until the cluster comes up) ./opensaf nodestop 2 (wait a few seconds) ./opensaf nodestart 2 ./opensaf nodestart 2 The last two commands should be execute quickly after each other, maybe with one second delay in between them. It seems that osafdtmd asserts and dies when this happens. Here is the result from a second run of the above test: ~~~ Sep 13 14:25:58 SC-2 local0.err osafdtmd[378]: ER DTM: Node already exit in the cluster with smiler configuration , correct the other joining Node configuration Sep 13 14:25:58 SC-2 local0.err osafdtmd[378]: dtm_node.c:109: dtm_process_node_info: Assertion '0' failed. Sep 13 14:25:58 SC-2 local0.err osafamfd[478]: MDTM:SOCKET recd_bytes :0, conn lost with dh server, exiting library err :Success Sep 13 14:25:58 SC-2 local0.err osafclmna[468]: MDTM:SOCKET recd_bytes :0, conn lost with dh server, exiting library err :Success Sep 13 14:25:58 SC-2 local0.err osafclmd[458]: MDTM:SOCKET recd_bytes :0, conn lost with dh server, exiting library err :Success Sep 13 14:25:58 SC-2 local0.err osafntfd[448]: MDTM:SOCKET recd_bytes :0, conn lost with dh server, exiting library err :Success Sep 13 14:25:58 SC-2 local0.err osaflogd[437]: MDTM:SOCKET recd_bytes :0, conn lost with dh server, exiting library err :Success Sep 13 14:25:58 SC-2 local0.err osafimmnd[426]: MDTM:SOCKET recd_bytes :0, conn lost with dh server, exiting library err :Success Sep 13 14:25:58 SC-2 local0.err osafimmd[415]: MDTM:SOCKET recd_bytes :0, conn lost with dh server, exiting library err :Success Sep 13 14:25:58 SC-2 local0.err osaffmd[405]: MDTM:SOCKET recd_bytes :0, conn lost with dh server, exiting library err :Success Sep 13 14:25:58 SC-2 local0.err osafrded[392]: MDTM:SOCKET recd_bytes :0, conn lost with dh server, exiting library err :Success Sep 13 14:25:58 SC-2 local0.notice osafdtmd[378]: NO Established contact with 'SC-1' Sep 13 14:25:58 SC-2 local0.notice osafdtmd[378]: NO Established contact with 'PL-4' Sep 13 14:25:58 SC-2 local0.notice osafdtmd[378]: NO Established contact with 'PL-5' Sep 13 14:25:58 SC-2 local0.notice osafdtmd[378]: NO Established contact with 'PL-3' Sep 13 14:25:59 SC-2 user.notice osafdtmd: osafdtmd Process down, Rebooting the node Sep 13 14:25:59 SC-2 user.notice opensaf_reboot: Rebooting local node; timeout=60 ~~~ Update: it seems I forgot to do "./opensaf nodestop" between the two "./opensaf nodestart" above. Thus, there are probably two SC-2 nodes at the same time, and the error message "Node already exit in the cluster with smiler configuration" should be interpreted as "duplicate node detected in the network". Reducing the priority of this defect to "minor". Still two problems ought to be fixed: the error message should be changed so that it is clear what it means, and osafdtmd should not assert (it could call opensaf_reboot() if a there is a configuration problem, but asserting idicates a software problem). --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing
[tickets] [opensaf:tickets] #2030 dtm: "Node already exit in the cluster with smiler configuration"
- Description has changed: Diff: --- old +++ new @@ -1,9 +1,9 @@ osafdtm does not handle rapid consecutive node reboots properly. I got the following errors in syslog: ~~~ -var/SC-2/log/messages:Sep 13 14:00:52 SC-2 local0.err osafdtmd[378]: ER DTM: Node already exit in the cluster with smiler configuration , correct the other joining Node configuration -var/SC-2/log/messages:Sep 13 14:01:02 SC-2 local0.err osafdtmd[378]: ER DTM: dtm_node_add failed .node_ip: 192.168.0.1, node_id: 0 -var/SC-2/log/messages:Sep 13 14:01:06 SC-2 local0.err osafdtmd[378]: ER DTM: dtm_node_add failed .node_ip: 192.168.0.1, node_id: 0 +Sep 13 14:00:52 SC-2 local0.err osafdtmd[378]: ER DTM: Node already exit in the cluster with smiler configuration , correct the other joining Node configuration +Sep 13 14:01:02 SC-2 local0.err osafdtmd[378]: ER DTM: dtm_node_add failed .node_ip: 192.168.0.1, node_id: 0 +Sep 13 14:01:06 SC-2 local0.err osafdtmd[378]: ER DTM: dtm_node_add failed .node_ip: 192.168.0.1, node_id: 0 ~~~ Here are the steps to reproduce this problem in UML: @@ -16,3 +16,26 @@ ./opensaf nodestart 2 The last two commands should be execute quickly after each other, maybe with one second delay in between them. + +It seems that osafdtmd asserts and dies when this happens. Here is the result from a second run of the above test: + +~~~ +Sep 13 14:25:58 SC-2 local0.err osafdtmd[378]: ER DTM: Node already exit in the cluster with smiler configuration , correct the other joining Node configuration +Sep 13 14:25:58 SC-2 local0.err osafdtmd[378]: dtm_node.c:109: dtm_process_node_info: Assertion '0' failed. +Sep 13 14:25:58 SC-2 local0.err osafamfd[478]: MDTM:SOCKET recd_bytes :0, conn lost with dh server, exiting library err :Success +Sep 13 14:25:58 SC-2 local0.err osafclmna[468]: MDTM:SOCKET recd_bytes :0, conn lost with dh server, exiting library err :Success +Sep 13 14:25:58 SC-2 local0.err osafclmd[458]: MDTM:SOCKET recd_bytes :0, conn lost with dh server, exiting library err :Success +Sep 13 14:25:58 SC-2 local0.err osafntfd[448]: MDTM:SOCKET recd_bytes :0, conn lost with dh server, exiting library err :Success +Sep 13 14:25:58 SC-2 local0.err osaflogd[437]: MDTM:SOCKET recd_bytes :0, conn lost with dh server, exiting library err :Success +Sep 13 14:25:58 SC-2 local0.err osafimmnd[426]: MDTM:SOCKET recd_bytes :0, conn lost with dh server, exiting library err :Success +Sep 13 14:25:58 SC-2 local0.err osafimmd[415]: MDTM:SOCKET recd_bytes :0, conn lost with dh server, exiting library err :Success +Sep 13 14:25:58 SC-2 local0.err osaffmd[405]: MDTM:SOCKET recd_bytes :0, conn lost with dh server, exiting library err :Success +Sep 13 14:25:58 SC-2 local0.err osafrded[392]: MDTM:SOCKET recd_bytes :0, conn lost with dh server, exiting library err :Success +Sep 13 14:25:58 SC-2 local0.notice osafdtmd[378]: NO Established contact with 'SC-1' +Sep 13 14:25:58 SC-2 local0.notice osafdtmd[378]: NO Established contact with 'PL-4' +Sep 13 14:25:58 SC-2 local0.notice osafdtmd[378]: NO Established contact with 'PL-5' +Sep 13 14:25:58 SC-2 local0.notice osafdtmd[378]: NO Established contact with 'PL-3' +Sep 13 14:25:59 SC-2 user.notice osafdtmd: osafdtmd Process down, Rebooting the node +Sep 13 14:25:59 SC-2 user.notice opensaf_reboot: Rebooting local node; timeout=60 + +~~~ --- ** [tickets:#2030] dtm: "Node already exit in the cluster with smiler configuration"** **Status:** unassigned **Milestone:** 4.7.2 **Created:** Tue Sep 13, 2016 12:10 PM UTC by Anders Widell **Last Updated:** Tue Sep 13, 2016 12:17 PM UTC **Owner:** nobody osafdtm does not handle rapid consecutive node reboots properly. I got the following errors in syslog: ~~~ Sep 13 14:00:52 SC-2 local0.err osafdtmd[378]: ER DTM: Node already exit in the cluster with smiler configuration , correct the other joining Node configuration Sep 13 14:01:02 SC-2 local0.err osafdtmd[378]: ER DTM: dtm_node_add failed .node_ip: 192.168.0.1, node_id: 0 Sep 13 14:01:06 SC-2 local0.err osafdtmd[378]: ER DTM: dtm_node_add failed .node_ip: 192.168.0.1, node_id: 0 ~~~ Here are the steps to reproduce this problem in UML: ./opensaf start (wait until the cluster comes up) ./opensaf nodestop 2 (wait a few seconds) ./opensaf nodestart 2 ./opensaf nodestart 2 The last two commands should be execute quickly after each other, maybe with one second delay in between them. It seems that osafdtmd asserts and dies when this happens. Here is the result from a second run of the above test: ~~~ Sep 13 14:25:58 SC-2 local0.err osafdtmd[378]: ER DTM: Node already exit in the cluster with smiler configuration , correct the other joining Node configuration Sep 13 14:25:58 SC-2 local0.err osafdtmd[378]: dtm_node.c:109: dtm_process_node_info: Assertion '0' failed. Sep 13 14:25:58 SC-2 local0.err osafamfd[478]: MDTM:SOCKET recd_bytes :0, conn lost with dh server, exiting library err :Success Sep 13 14:25:58 SC-2 local0.err
[tickets] [opensaf:tickets] #2030 dtm: "Node already exit in the cluster with smiler configuration"
Needless to say, the error message itself is also faulty here. I suppose "exit" should be "exists", and "smiler" should be "similar"? I am just guessing... :-) --- ** [tickets:#2030] dtm: "Node already exit in the cluster with smiler configuration"** **Status:** unassigned **Milestone:** 4.7.2 **Created:** Tue Sep 13, 2016 12:10 PM UTC by Anders Widell **Last Updated:** Tue Sep 13, 2016 12:10 PM UTC **Owner:** nobody osafdtm does not handle rapid consecutive node reboots properly. I got the following errors in syslog: ~~~ var/SC-2/log/messages:Sep 13 14:00:52 SC-2 local0.err osafdtmd[378]: ER DTM: Node already exit in the cluster with smiler configuration , correct the other joining Node configuration var/SC-2/log/messages:Sep 13 14:01:02 SC-2 local0.err osafdtmd[378]: ER DTM: dtm_node_add failed .node_ip: 192.168.0.1, node_id: 0 var/SC-2/log/messages:Sep 13 14:01:06 SC-2 local0.err osafdtmd[378]: ER DTM: dtm_node_add failed .node_ip: 192.168.0.1, node_id: 0 ~~~ Here are the steps to reproduce this problem in UML: ./opensaf start (wait until the cluster comes up) ./opensaf nodestop 2 (wait a few seconds) ./opensaf nodestart 2 ./opensaf nodestart 2 The last two commands should be execute quickly after each other, maybe with one second delay in between them. --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- ___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #2030 dtm: "Node already exit in the cluster with smiler configuration"
--- ** [tickets:#2030] dtm: "Node already exit in the cluster with smiler configuration"** **Status:** unassigned **Milestone:** 4.7.2 **Created:** Tue Sep 13, 2016 12:10 PM UTC by Anders Widell **Last Updated:** Tue Sep 13, 2016 12:10 PM UTC **Owner:** nobody osafdtm does not handle rapid consecutive node reboots properly. I got the following errors in syslog: ~~~ var/SC-2/log/messages:Sep 13 14:00:52 SC-2 local0.err osafdtmd[378]: ER DTM: Node already exit in the cluster with smiler configuration , correct the other joining Node configuration var/SC-2/log/messages:Sep 13 14:01:02 SC-2 local0.err osafdtmd[378]: ER DTM: dtm_node_add failed .node_ip: 192.168.0.1, node_id: 0 var/SC-2/log/messages:Sep 13 14:01:06 SC-2 local0.err osafdtmd[378]: ER DTM: dtm_node_add failed .node_ip: 192.168.0.1, node_id: 0 ~~~ Here are the steps to reproduce this problem in UML: ./opensaf start (wait until the cluster comes up) ./opensaf nodestop 2 (wait a few seconds) ./opensaf nodestart 2 ./opensaf nodestart 2 The last two commands should be execute quickly after each other, maybe with one second delay in between them. --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- ___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets