** Description changed:

  [impact]
  
  programs using libqb logging exit due to failed assertion on qb log init
  
  [test case]
  
  test program:
  
- 
  #include <qb/qblog.h>
  
  QB_LOG_INIT_DATA(test);
  
  int main(int argc, char* argv[])
  {
-   return 0;
+   return 0;
  }
- 
  
  compile and run:
  
  $ gcc -flto -D_GNU_SOURCE -o test test.c -lqb -ldl
  /usr/bin/ld: warning: 
/usr/lib/gcc/x86_64-linux-gnu/9/../../../x86_64-linux-gnu/libqb.so contains 
output sections; did you forget -T?
  
- $ ./test 
+ $ ./test
  test: test.c:4: test: Assertion `"implicit callsite section is observable, 
otherwise target's and/or libqb's build is at fault, preventing reliable 
logging" && work_s1 != NULL && work_s2 != NULL' failed.
  Aborted (core dumped)
- 
  
  Note the error is slightly different when compiling without lto:
  
  $ gcc -D_GNU_SOURCE -o test test.c -lqb -ldl
  /usr/bin/ld: warning: 
/usr/lib/gcc/x86_64-linux-gnu/9/../../../x86_64-linux-gnu/libqb.so contains 
output sections; did you forget -T?
  
- $ ./test 
+ $ ./test
  test: test.c:4: test: Assertion `"implicit callsite section is populated, 
otherwise target's build is at fault, preventing reliable logging" && 
QB_ATTR_SECTION_START != QB_ATTR_SECTION_STOP' failed.
  Aborted (core dumped)
- 
  
  [regression potential]
  
  any regression would likely involve problems during logging using the
  libqb logging functions, which could include failure to log or even
  program exit and/or crash.
  
  [scope]
  
  this appears to be needed only for focal; the issue seems to be an
  interaction between the focal version of binutils and some linker
  "magic" that libqb used in the focal version.
  
  The upstream libqb removed/replaced that linker "magic" after the version in 
focal, so this should not affect groovy or later. However, the fix changes the 
ABI and thus isn't appropriate for SRUing.
  https://github.com/ClusterLabs/libqb/pull/322
  
- The binutils in bionic and earlier does not appear to cause the
- problematic behavior with the libqb linker "magic", so no change is
- needed there.
+ The libqb code in bionic does not include the linker "magic" and so does
+ not have this problem.
  
  [other info]
  
  related debian binutils bug report:
  https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=923246
  
  related gcc bug report:
  https://sourceware.org/bugzilla/show_bug.cgi?id=24276
  
  however, those appear to only have changed binutils to ignore the issue
  to allow the build to stop failing.
  
  The libqb docs do contain two suggestions to possibly work around this
  bug, specifically using either -l:libqb.so.0 or
  -DQB_KILL_ATTRIBUTE_SECTION, or both. Either or both approaches do help
  with the simple test case, but more testing is needed that actually
  exercises the log functionality to make sure nothing else breaks.
  
  $ gcc -flto -D_GNU_SOURCE -o test test.c -lqb -ldl
  /usr/bin/ld: warning: 
/usr/lib/gcc/x86_64-linux-gnu/9/../../../x86_64-linux-gnu/libqb.so contains 
output sections; did you forget -T?
- $ ./test 
+ $ ./test
  test: test.c:4: test: Assertion `"implicit callsite section is observable, 
otherwise target's and/or libqb's build is at fault, preventing reliable 
logging" && work_s1 != NULL && work_s2 != NULL' failed.
  Aborted (core dumped)
  
  $ gcc -flto -D_GNU_SOURCE -o test test.c -l:libqb.so.0 -ldl
- $ ./test 
+ $ ./test
  
  $ gcc -flto -DQB_KILL_ATTRIBUTE_SECTION -D_GNU_SOURCE -o test test.c -lqb -ldl
  /usr/bin/ld: warning: 
/usr/lib/gcc/x86_64-linux-gnu/9/../../../x86_64-linux-gnu/libqb.so contains 
output sections; did you forget -T?
- $ ./test 
- 
+ $ ./test
  
  [original description]
  
- 
- When a clustered node is detected as failed the remaining node tries to fence 
the resources. When using pacemaker with gfs2 on an lvm2 logical volume 
dlm_controld calls out to dlm_stonith to release any locks held.
+ When a clustered node is detected as failed the remaining node tries to
+ fence the resources. When using pacemaker with gfs2 on an lvm2 logical
+ volume dlm_controld calls out to dlm_stonith to release any locks held.
  
  Due to a build issue with the version of libqb that pacemaker is
  compiled against, the call to QB_LOG_INIT_DATA which is #defined to
  CRM_TRACE_INIT_DATA, fails with an assertion. This prevents the lock
  manager from releasing any held locks on the failed node.
  
  At this point the gfs2 filesystem cannot be accessed and after any
  resource timeouts are met, the resource is marked as failed.
  
  Calling dlm_stonith by hand with the data that is passed to it by
  dlm_controld shows the assertion.
  
  root@u2004-1:~# /usr/sbin/dlm_stonith -n 2 -t 1612361398
  dlm_stonith: utils.c:57: common: Assertion `"implicit callsite section is 
observable, otherwise target's and/or libqb's build is at fault, preventing 
reliable logging" && work_s1 != NULL && work_s2 != NULL' failed.
  
  It would appear that the code in libqb is over aggressive on the sanity
  checking, or assumes that QB_LOG_INIT_DATA will only be called by the
  library. External programs such as pacemaker that end up calling
  CRM_TRACE_INIT_DATA will suffer the same assertion.
  
  This patch from clusterlabs is an attempt to resolve the assertion, but
  is still not sufficient.
  https://lists.clusterlabs.org/pipermail/users/2018-February/023614.html
  
  Taking out the assertion in <qb/qblog.h> and recompiling pacemaker
  appears to be the only way to allow dlm_stonith to work.
  
  journalctl shows dlm_controld keeps trying to get a successful response
  from dlm_stonith
  
  Feb 16 13:11:57 u2004-1 dlm_controld[9344]: 4389 fence result 2 pid 26568 
result -1 term signal 6
  Feb 16 13:11:57 u2004-1 dlm_controld[9344]: 4389 fence status 2 receive -1 
from 1 walltime 1613481117 local 4389
  Feb 16 13:11:57 u2004-1 dlm_controld[9344]: 4389 fence request 2 pid 26607 
nodedown time 1613481102 fence_all dlm_stonith
  Feb 16 13:11:58 u2004-1 dlm_controld[9344]: 4391 fence result 2 pid 26607 
result -1 term signal 6
  Feb 16 13:11:58 u2004-1 dlm_controld[9344]: 4391 fence status 2 receive -1 
from 1 walltime 1613481118 local 4391
  Feb 16 13:11:58 u2004-1 dlm_controld[9344]: 4391 fence request 2 pid 26637 
nodedown time 1613481102 fence_all dlm_stonith
  Feb 16 13:12:00 u2004-1 dlm_controld[9344]: 4392 fence result 2 pid 26637 
result -1 term signal 6
  Feb 16 13:12:00 u2004-1 dlm_controld[9344]: 4392 fence status 2 receive -1 
from 1 walltime 1613481120 local 4392
  Feb 16 13:12:00 u2004-1 dlm_controld[9344]: 4392 fence request 2 pid 26693 
nodedown time 1613481102 fence_all dlm_stonith
  ....
  
  Calling 'dlm_tool fence_ack 2' by hand immediately releases the dlm
  resource locks.
  
  root@u2004-1:~# lsb_release -rd
  Description:    Ubuntu 20.04 LTS
  Release:        20.04
  
  root@u2004-1:~# apt-cache policy pacemaker
  pacemaker:
    Installed: 2.0.3-3ubuntu4.1
    Candidate: 2.0.3-3ubuntu4.1
    Version table:
   *** 2.0.3-3ubuntu4.1 500
          500 http://gb.archive.ubuntu.com/ubuntu focal-updates/main amd64 
Packages
          500 http://gb.archive.ubuntu.com/ubuntu focal-security/main amd64 
Packages
          100 /var/lib/dpkg/status
       2.0.3-3ubuntu3 500
          500 http://gb.archive.ubuntu.com/ubuntu focal/main amd64 Packages

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1915828

Title:
  pacemaker fails to release clustered filesystem dlm locks on failover

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/1915828/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to