[tickets] [opensaf:tickets] #2419 smf: when fixing ticket #2145 a NBC problem was introduced
> This means that it was for example possible to complete a campaign even if a > component failed to start and fix this problem after committing. No action to > resume the campaign was needed. Just playing devil's advocate here... If the admin was prepared to restart a component after the campaign was committed, why is it a big deal to do this during the campaign? SMF already supports the "suspended by error detected" state for other problems (e.g. node failing to come back after reboot for a rolling upgrade w/reboot campaign). So, it is already possible for the admin to restart a campaign before it has been committed. --- ** [tickets:#2419] smf: when fixing ticket #2145 a NBC problem was introduced** **Status:** unassigned **Milestone:** 5.2.0 **Created:** Mon Apr 10, 2017 11:11 AM UTC by elunlen **Last Updated:** Mon Apr 10, 2017 11:18 AM UTC **Owner:** nobody Previous behavior: The behavior was to ignore a fail to activate a component unless any secondary fault happened. This means that it was for example possible to complete a campaign even if a component failed to start and fix this problem after committing. No action to resume the campaign was needed. After [#2145]: The campaign will always suspend in case of component fail and a resume must be requested for the campaign to continue. NBC: The behavior has changed in such a way that it must be seen as a NBC. The #2145 ticket corrects SMF behavior regarding AIS but is still NBC since the previous behavior is the legacy behavior in previous releases. Proposal 1; Fix if not needed to change setting in runtime e.g. during an upgrade Add a new configuration attribute to the SMF configuration class that makes it possible to select whether the behavior after #2145 shall be used or not. The default setting must be the previous behavior. The setting must have the following properties: - If the attribute does not exist (old model) legacy behavior - If the attribute value is not changed from defaultlegacy behavior - If the attribute value is or invalid legacy behavior - If the attribute value is a valid “ON” settingnew behavior - A request to change the attribute in runtime shall always be rejected Proposal 2; Fix if change has to be made during upgrade: Add a new configuration attribute to the SMF configuration class that makes it possible to select whether the behavior after #2145 shall be used or not. The default setting must be the previous behavior. The setting must have the following properties: - If the attribute does not exist (old model) legacy behavior - If the attribute value is not changed from defaultlegacy behavior - If the attribute value is or invalid legacy behavior - If the attribute value is a valid “ON” settingnew behavior - Attribute value must be possible to change in runtime in “idle” state (no campaign is executing) - Attribute value must be possible to change in runtime in campaign init state. Note that if changed here the new setting must be used in the rest of the campaign --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #2100 Standby should not be rebooted, for SC absence configuration mismatch
IMMSV_SC_ABSENCE_ALLOWED cannot be moved to imm.xml or PBE. IMMSV_SC_ABSENCE_ALLOWED, as well as IMMSV_SC_ABSENCE_VETERAN_MAX_WAIT, are required for IMMD which starts before IMMND. NID should take a decision if a node should be restarted or not after failed IMMD start --- ** [tickets:#2100] Standby should not be rebooted, for SC absence configuration mismatch** **Status:** unassigned **Milestone:** future **Created:** Fri Oct 07, 2016 07:11 AM UTC by Srikanth R **Last Updated:** Thu Mar 30, 2017 04:51 AM UTC **Owner:** nobody Changeset : 8190 5.1.GA -> Initially brought up opensaf on SC-1 with "SC ABSENCE" feature enabled in immd.conf. -> On SC-2, "SC ABSENCE" feature is not enabled in immd.conf and opensafd is started on SC-2, for which node rebooted. Oct 7 17:58:27 SLES-SLOT2 osafimmd[3615]: ER SC absence allowed in not the same as on active IMMD. Active: 900, Standby: 0. Exiting. Oct 7 17:58:27 SLES-SLOT2 osafamfnd[3676]: NO 'safComp=IMMD,safSu=SC-2,safSg=2N,safApp=OpenSAF' faulted due to 'avaDown' : Recovery is 'nodeFailfast' Oct 7 17:58:27 SLES-SLOT2 osafamfnd[3676]: ER safComp=IMMD,safSu=SC-2,safSg=2N,safApp=OpenSAF Faulted due to:avaDown Recovery is:nodeFailfast Oct 7 17:58:27 SLES-SLOT2 osafamfnd[3676]: Rebooting OpenSAF NodeId = 131599 EE Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId = 131599, SupervisionTime = 60 Here user had misconfigured the configuration on both the controllers, for which standby rebooted. Opensafd is enabled in runlevel as part of installation and standby shall reboot continuously until opensafd is stopped on SC-1. Suggested behavior : Opensafd should not start on standby, instead of immediate reboot. Also, the cluster level attributes like IMMSV_SC_ABSENCE_ALLOWED, can be moved to imm.xml. Node level attributes like traces enabling can be retained in configuration files. --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #2419 smf: when fixing ticket #2145 a NBC problem was introduced
Reference to earlier mail converstion about this: http://sourceforge.net/p/opensaf/mailman/message/35490515/ --- ** [tickets:#2419] smf: when fixing ticket #2145 a NBC problem was introduced** **Status:** unassigned **Milestone:** 5.2.0 **Created:** Mon Apr 10, 2017 11:11 AM UTC by elunlen **Last Updated:** Mon Apr 10, 2017 11:11 AM UTC **Owner:** nobody Previous behavior: The behavior was to ignore a fail to activate a component unless any secondary fault happened. This means that it was for example possible to complete a campaign even if a component failed to start and fix this problem after committing. No action to resume the campaign was needed. After [#2145]: The campaign will always suspend in case of component fail and a resume must be requested for the campaign to continue. NBC: The behavior has changed in such a way that it must be seen as a NBC. The #2145 ticket corrects SMF behavior regarding AIS but is still NBC since the previous behavior is the legacy behavior in previous releases. Proposal 1; Fix if not needed to change setting in runtime e.g. during an upgrade Add a new configuration attribute to the SMF configuration class that makes it possible to select whether the behavior after #2145 shall be used or not. The default setting must be the previous behavior. The setting must have the following properties: - If the attribute does not exist (old model) legacy behavior - If the attribute value is not changed from defaultlegacy behavior - If the attribute value is or invalid legacy behavior - If the attribute value is a valid “ON” settingnew behavior - A request to change the attribute in runtime shall always be rejected Proposal 2; Fix if change has to be made during upgrade: Add a new configuration attribute to the SMF configuration class that makes it possible to select whether the behavior after #2145 shall be used or not. The default setting must be the previous behavior. The setting must have the following properties: - If the attribute does not exist (old model) legacy behavior - If the attribute value is not changed from defaultlegacy behavior - If the attribute value is or invalid legacy behavior - If the attribute value is a valid “ON” settingnew behavior - Attribute value must be possible to change in runtime in “idle” state (no campaign is executing) - Attribute value must be possible to change in runtime in campaign init state. Note that if changed here the new setting must be used in the rest of the campaign --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #2419 smf: when fixing ticket #2145 a NBC problem was introduced
--- ** [tickets:#2419] smf: when fixing ticket #2145 a NBC problem was introduced** **Status:** unassigned **Milestone:** 5.2.0 **Created:** Mon Apr 10, 2017 11:11 AM UTC by elunlen **Last Updated:** Mon Apr 10, 2017 11:11 AM UTC **Owner:** nobody Previous behavior: The behavior was to ignore a fail to activate a component unless any secondary fault happened. This means that it was for example possible to complete a campaign even if a component failed to start and fix this problem after committing. No action to resume the campaign was needed. After [#2145]: The campaign will always suspend in case of component fail and a resume must be requested for the campaign to continue. NBC: The behavior has changed in such a way that it must be seen as a NBC. The #2145 ticket corrects SMF behavior regarding AIS but is still NBC since the previous behavior is the legacy behavior in previous releases. Proposal 1; Fix if not needed to change setting in runtime e.g. during an upgrade Add a new configuration attribute to the SMF configuration class that makes it possible to select whether the behavior after #2145 shall be used or not. The default setting must be the previous behavior. The setting must have the following properties: - If the attribute does not exist (old model) legacy behavior - If the attribute value is not changed from defaultlegacy behavior - If the attribute value is or invalid legacy behavior - If the attribute value is a valid “ON” settingnew behavior - A request to change the attribute in runtime shall always be rejected Proposal 2; Fix if change has to be made during upgrade: Add a new configuration attribute to the SMF configuration class that makes it possible to select whether the behavior after #2145 shall be used or not. The default setting must be the previous behavior. The setting must have the following properties: - If the attribute does not exist (old model) legacy behavior - If the attribute value is not changed from defaultlegacy behavior - If the attribute value is or invalid legacy behavior - If the attribute value is a valid “ON” settingnew behavior - Attribute value must be possible to change in runtime in “idle” state (no campaign is executing) - Attribute value must be possible to change in runtime in campaign init state. Note that if changed here the new setting must be used in the rest of the campaign --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #2418 imm: Info of dead IMMND remains in standby IMMD
--- ** [tickets:#2418] imm: Info of dead IMMND remains in standby IMMD** **Status:** accepted **Milestone:** 5.0.2 **Created:** Mon Apr 10, 2017 10:23 AM UTC by Hung Nguyen **Last Updated:** Mon Apr 10, 2017 10:23 AM UTC **Owner:** Hung Nguyen **Attachments:** - [log.tgz](https://sourceforge.net/p/opensaf/tickets/2418/attachment/log.tgz) (149.4 kB; application/x-compressed) When Standby IMMD is up at the same time with a IMMND exiting, the info of that IMMND might not be removed from **immnd_tree** of the Standby IMMD. Details of the problem is explained in the sequence diagram below [sequence diagram](http://sequencediagram.org/index.html?initialData=A4QwTgLglgxloDsIAICCBhAKgWgJIFl8ARAKFElnhCWQGVMAhPQ0kkAIwHsAPZTgNwCmYOo2bFkAYjCCAJgC5kRAPIB1AHLJBQmgDMwnALbIC+dUT4JkCTrMHIAGiRJdeA4aKamii3AigoxLQAOgh+Acj4DOjI1LLIAM6CgdHIBgA29hCcdBBx7ACezvReLMjYAHxoWOIW8uic6bIJBQgwaYIAjgCuggkQziQYON7lVSW1ig1NLW0dCcCcCEmhEAAW9qbmyOlQ-chQbenddgnI65uE20vWtvZOzhw8fEIiw7VSMgpKapragnoDMYthYbjY7I5nK4Xh53t4ADQTbyKTAbExXCx7DqGdzxfRGaojMrsbooGSGECHM6HTy1IA) SC-5 was Active, SC-2 was Standby, IMMND on SC-1 was exiting ~~~ 18:35:03 SC-1 osafimmnd[441]: exiting for shutdown 18:35:03 SC-2 osafrded[413]: NO RDE role set to STANDBY 18:35:03 SC-2 osafimmd[430]: NO MDS event from svc_id 25 (change:3, dest:568511936070075) 18:35:03 SC-2 osafimmd[430]: NO MDS event from svc_id 25 (change:3, dest:567412424442298) 18:35:03 SC-2 osafimmd[430]: NO MDS event from svc_id 25 (change:3, dest:566312912814523) 18:35:03 SC-2 osafimmd[430]: NO MDS event from svc_id 25 (change:3, dest:565213401186744) 18:35:03 SC-5 osafimmd[433]: NO MDS event from svc_id 25 (change:4, dest:564113889558969) ~~~ Down event for IMMND@SC-1 was received on SC-5 but not on SC-2. **The symptoms:** 1. If the down IMMND is the corrdinator, that results in when that Standby IMMD becomes Active, it fails to elect new coordinator as there's already a coordinator in the **immnd_tree**. ~~~ 18:35:11 SC-2 osafimmd[430]: WA IMMND coordinator at 2050f apparently crashed => electing new coord ~~~ No more logs about newly elected coordinator were printed out. 2. When IMMND@SC-1 is up again, it will fail to introduce to IMMD because the IMMD already have IMMND@SC-1 in **immnd_tree** with a wrong epoch. ~~~ 18:35:29 SC-1 osafimmnd[441]: NO SERVER STATE: IMM_SERVER_ANONYMOUS --> IMM_SERVER_CLUSTER_WAITING 18:35:29 SC-1 osafimmnd[441]: NO This IMMND is now the NEW Coord 18:35:29 SC-1 osafimmnd[441]: ER 3 > 0, exiting ~~~ --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #2416 amfnd: su_si assignment message could be processed during SC absence stages
- **status**: unassigned --> accepted - **assigned_to**: Minh Hon Chau --- ** [tickets:#2416] amfnd: su_si assignment message could be processed during SC absence stages** **Status:** accepted **Milestone:** 5.1.1 **Created:** Mon Apr 10, 2017 04:39 AM UTC by Minh Hon Chau **Last Updated:** Mon Apr 10, 2017 04:39 AM UTC **Owner:** Minh Hon Chau In configuration of 2N application which has active SU hosted in controller and the other standby SU is hosted in payload, the event of stopping both SCs could generate a su_si assignment message towards standby SU to change HA state to active. - In case this su_si assignment message is buffered and comes before MDSNCS_DOWN, node is rebooted - In other cases where MDSNCS_DOWN comes before su_si assignment, currently amfnd does not ignore this su_si assignment. amfnd should ignore this su_si assignment message as similiar to other messages like su_pres, su_reg --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #2415 CKPT node director failed to execute ckpt create request
- **status**: review --> fixed - **Comment**: changeset: 8754:aa1ba7700cf2 user:Anders Widelldate:Mon Apr 10 08:32:59 2017 +0200 summary: ckpt: Increase limit for number of file desciptors in CKPTND [#2415] [staging:aa1ba7] --- ** [tickets:#2415] CKPT node director failed to execute ckpt create request** **Status:** fixed **Milestone:** 5.2.0 **Created:** Fri Apr 07, 2017 01:30 AM UTC by David Byrne **Last Updated:** Sat Apr 08, 2017 09:10 AM UTC **Owner:** A V Mahesh (AVM) After the following two patches were removed, based on OpenSAF CS8701, CKPT node director failed to execute ckpt create request(Collocated Checkpoints, Asynchronous Update). -ph4_01_headless_escalation_for_osaftest.diff -mds_log_level.diff CPND_MAX_REPLICAS =1000 retention_time is set to 30s Test procedure 1. Send 34 ckpt request per second 34*30 = 1020 which is > CPND_MAX_REPLICAS Failed which is expected 2. Send 32 ckpt request per second 32*30 = 960 which is < CPND_MAX_REPLICAS It used to pass, but now failed since removing the above two patches. syslog: Apr 5 01:42:46 SC-2-1 osafckptnd[4958]: ncs_sel_obj_create: socketpair failed - Too many open files Apr 5 01:42:46 SC-2-1 osafckptnd[4958]: ER cpnd has exceeded the maximum number of allowed replicas (CPND_MAX_REPLICAS) Test debug info: Apr 5, 2017 1:46:08 AM INFO ANSWER type: report start-time: 1491349366.360 stop-time: 1491349567.269 total: send=6428 recv=6407 fail=6407 Change test procedure for investigation purpose 1. Start test from 32 ckpt/s 32*30 = 960 which is < CPND_MAX_REPLICAS Passed Apr 6, 2017 2:56:27 AM INFO ANSWER type: report start-time: 1491439975.068 stop-time: 1491440187.347 total: send=6792 send-failed=0 recv=6780 2. then test 34 ckpt/s Failed 3. Then test 33 ckpt/s Failed 4. Then back to 32 ckpt/s again Failed From this experiment, we can see that once exceed the CPND_MAX_REPLICAS, ckpt service can’t be recovered. Note: the problem only occurs for Collocated Checkpoints, Asynchronous Update. Run the same test for Non-Collocated Checkpoints, Synchronous Update, it is OK. Test Contact: Li Suo --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #2083 amf: error in syslog when initiating SI SWAP
- **status**: unassigned --> duplicate - **Milestone**: future --> never - **Comment**: Fixed as part of #1897 --- ** [tickets:#2083] amf: error in syslog when initiating SI SWAP** **Status:** duplicate **Milestone:** never **Created:** Thu Sep 29, 2016 12:01 PM UTC by Rafael **Last Updated:** Thu Sep 29, 2016 12:12 PM UTC **Owner:** nobody SMF initiates an SI SWAP which fails but then the retry is succesful. The error result should not be logged as an error in syslog. osafamfd[473]: ER safSi=SC-2N,safApp=OpenSAF SWAP failed - only one assignment osafrded[407]: NO Peer up on node 0x2020f osafrded[407]: NO Got peer info request from node 0x2020f with role STANDBY osafrded[407]: NO Got peer info response from node 0x2020f with role STANDBY osafimmd[426]: NO MDS event from svc_id 24 (change:5, dest:13) osafimmnd[436]: NO Implementer (applier) connected: 49 (@safAmfService2020f) <0, 2020f> osafimmnd[436]: NO Implementer (applier) connected: 50 (@OpenSafImmReplicatorB) <0, 2020f> osafamfd[473]: ER safSi=SC-2N,safApp=OpenSAF SWAP failed - Cold sync in progress osafamfd[473]: NO safSi=SC-2N,safApp=OpenSAF Swap initiated --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #2417 amf: support for si-swap in N+M model when Standbys are in different SUs.
--- ** [tickets:#2417] amf: support for si-swap in N+M model when Standbys are in different SUs.** **Status:** accepted **Milestone:** next **Created:** Mon Apr 10, 2017 06:17 AM UTC by Praveen **Last Updated:** Mon Apr 10, 2017 06:17 AM UTC **Owner:** Praveen This is continuation of ticket #2259 This new ticket will consider more general case: "When SIs in designated SUs have their standby distributed in different SUs," Will update with an example configuration.. --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets