Hi! our customers were hit by a quite strange issue with resources populating attributes in attrd. The most obscure fact is that they see that issue only on a selected subset of nodes (two nodes in a 8-node cluster). Symptoms are sporadic timeouts of resources whose RAs call attrd_updater to manage node attributes. In order to debug an issue we modified resource agents to run attrd_updater under strace and also to collect attrd blackbox on a stop (after the timed-out monitor).
That way we managed to get a little bit further. Below are the analysis for a failure registered Dec 17 08:08:19. Resource has monitor timeout of 30 sec and interval 10 sec, so the monitor operation was started around Dec 17 08:07:49: Dec 17 08:08:19 [38018] pfs2n5 lrmd: warning: child_timeout_callback: ifspeed-o2ib0_monitor_10000 process (PID 40819) timed out Dec 17 08:08:24 [38018] pfs2n5 lrmd: crit: child_timeout_callback: ifspeed-o2ib0_monitor_10000 process (PID 40819) will not die! ^^ This is because of strace Dec 17 08:08:48 [38018] pfs2n5 lrmd: warning: operation_finished: ifspeed-o2ib0_monitor_10000:40819 - timed out after 30000ms Strace log for attrd_updater shows the following: ======= [...] connect(3, {sa_family=AF_LOCAL, sun_path=@"attrd"}, 110) = 0 setsockopt(3, SOL_SOCKET, SO_PASSCRED, [1], 4) = 0 sendto(3, "\377\377\377\377\0\0\0\0\30\0\0\0\0\0\0\0\0\0\20\0\0\0\0\0", 24, MSG_NOSIGNAL, NULL, 0) = 24 setsockopt(3, SOL_SOCKET, SO_PASSCRED, [0], 4) = 0 recvfrom(3, 0x7ffdb55ade40, 12328, 16640, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) poll([{fd=3, events=POLLIN}], 1, -1) = 1 ([{fd=3, revents=POLLIN}]) recvfrom(3, "\377\377\377\377\0\0\0\0(0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0\0\0\20\0"..., 12328, MSG_WAITALL|MSG_NOSIGNAL, NULL, NULL) = 12328 mmap(NULL, 1052672, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f6981ee6000 open("/dev/shm/qb-attrd-request-38019-40694-10-header", O_RDWR) = 4 ftruncate(4, 8252) = 0 mmap(NULL, 8252, PROT_READ|PROT_WRITE, MAP_SHARED, 4, 0) = 0x7f6982100000 open("/dev/shm/qb-attrd-request-38019-40694-10-data", O_RDWR) = 5 ftruncate(5, 1052672) = 0 mmap(NULL, 2105344, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f6981ce4000 mmap(0x7f6981ce4000, 1052672, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED, 5, 0) = 0x7f6981ce4000 mmap(0x7f6981de5000, 1052672, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED, 5, 0) = 0x7f6981de5000 close(5) = 0 close(4) = 0 open("/dev/shm/qb-attrd-response-38019-40694-10-header", O_RDWR) = 4 ftruncate(4, 8248) = 0 mmap(NULL, 8248, PROT_READ|PROT_WRITE, MAP_SHARED, 4, 0) = 0x7f69820fd000 open("/dev/shm/qb-attrd-response-38019-40694-10-data", O_RDWR) = 5 ftruncate(5, 1052672) = 0 mmap(NULL, 2105344, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f6981ae2000 mmap(0x7f6981ae2000, 1052672, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED, 5, 0) = 0x7f6981ae2000 mmap(0x7f6981be3000, 1052672, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED, 5, 0) = 0x7f6981be3000 close(5) = 0 close(4) = 0 open("/dev/shm/qb-attrd-event-38019-40694-10-header", O_RDWR) = 4 ftruncate(4, 8248) = 0 mmap(NULL, 8248, PROT_READ|PROT_WRITE, MAP_SHARED, 4, 0) = 0x7f69820fa000 open("/dev/shm/qb-attrd-event-38019-40694-10-data", O_RDWR) = 5 ftruncate(5, 1052672) = 0 mmap(NULL, 2105344, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f69818e0000 mmap(0x7f69818e0000, 1052672, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED, 5, 0) = 0x7f69818e0000 mmap(0x7f69819e1000, 1052672, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED, 5, 0) = 0x7f69819e1000 close(5) = 0 close(4) = 0 poll([{fd=3, events=POLLIN}], 1, 0) = 0 (Timeout) sendto(3, "\355", 1, MSG_NOSIGNAL, NULL, 0) = 1 futex(0x7f69820ff010, FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME, 0, {1481958460, 283442042}, ffffffff) = 0 poll([{fd=3, events=POLLIN}], 1, 0) = 0 (Timeout) exit_group(0) = ? +++ exited with 0 +++ ====== At the same time attrd blackbox traces do not show a connect/update attempt at all: [...] debug Dec 17 08:07:19 attrd_client_update(320):0: Broadcasting ifspeed-o2ib0[pfs2n5] = 56000 debug Dec 17 08:07:29 attrd_client_update(320):0: Broadcasting ifspeed-o2ib0[pfs2n5] = 56000 debug Dec 17 08:07:39 attrd_client_update(320):0: Broadcasting ifspeed-o2ib0[pfs2n5] = 56000 <<< Here should be lines about ifspeed-o2ib0[pfs2n5] at 08:07:49, but no lines with that timestamp there >>> debug Dec 17 08:08:48 attrd_client_update(320):0: Broadcasting fail-count-ifspeed-o2ib0[pfs2n5] = 2 debug Dec 17 08:08:48 attrd_client_update(320):0: Broadcasting last-failure-ifspeed-o2ib0[pfs2n5] = 1481958528 r_name="last-failure-ifspeed-o2ib0" task="update" attr_value="1481958528" attr_section="status" attr_host="pfs2n5" attr_is_remote="0" attr_is_private="0" attr_is_utilization="0" acl_target="hacluster" attr_user="hacluster"/> attr_is_utilization is a local patch (http://bugs.clusterlabs.org/show_bug.cgi?id=5165) As I understand, issue may be either in mainloop implementation, or inside the libqb. What could be the next to look at? Of course, we made a workaround now to run attrd_updater several times in a loop under the 'timeout' utility, but we'd like to get to the issue roots. What else bothers me is that there is a trash in some messages in a blackbox log: trace Dec 17 07:54:36 pcmk_message_common_cs(281):0: Got new^MO^A message (size=85760, 0, 822083584) corresponding code is: crm_trace("Got new%s message (size=%d, %d, %d)", msg->is_compressed ? " compressed" : "", ais_data_len(msg), msg->size, msg->compressed_size); I definitely cannot get how that binary "^MO^A" data appeared there except a heap corruption or some unusual issue with macros. That strings (" compressed" and "") are in the RO data section... Also numeric data (size=85760, 0, 822083584) - that is broken too. Nodes are equipped with ECC RAM, so single-bit errors are corrected, multi-bit ones result in a NMI. OS is EL6, issue is seen with both pacemaker 1.1.13 and 1.1.15 (e174ec8). libqb is quite recent - 1908e6c1. Any pointers are appreciated, Best regards, Vladislav _______________________________________________ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org