Hi! Digging further in ldirectord, I found that the utility functions do no make a difference between a name that is not known, and a name that is (probably) known, but cannot be resolved at the moment.
I hacked the corresponding functions to observe and return the error code (errno) as a negative number. Short demo: DB<2> x ld_gethostbyname('x',AF_INET) 0 '-2' # ENOENT DB<3> x ld_gethostbyname('localhost',AF_INET) 0 '127.0.0.1' DB<4> x ld_gethostbyname('localhost',AF_INET6) 0 '[::1]' ### hacked /etc/resolv.conf to make namservers unreachable (add IPs that are no nameservers or don't exist), but host exists DB<7> x ld_gethostbyname('mail-1',AF_INET) 0 '-3' # ESRCH Returning the error message string is a bit trickier, so I just used the error code. However it's not clear what to do when the resolver fails (i.e.: name would be known if resolver worked). In any case it takes quite a while until an error result is returned. For example (using the hacked functions): if (($fallback->{port} = &ld_getservbyname($fallback->{port}, $protocol)) =~ /^-/) { &config_error($line, "invalid port for fallback server"); } One could check for "== '-2'" instead, but still in the other case there is no valid port value. Ideas? Regards, Ulrich >>> Ulrich Windl schrieb am 08.08.2022 um 11:19 in Nachricht <62F0D518.3F8 : >>> 161 : 60728>: > Hi! > > The bug is still under investigation, but digging in the ldirectord code I > found this part called when stopping: > > } elsif ($CMD eq "stop") { > kill 15, $oldpid; > ld_exit(0, "Exiting from ldirectord $CMD"); > > As ldirectord uses a SIGTERM handler that sets a flag only and then (at some > later time) the termination code will be started. > Doesn't that mean the cluster will see a bad exit code (success while parts > of ldirectord are still running)? > > Regards, > Ulrich > > > > >>> Ulrich Windl schrieb am 03.08.2022 um 11:13 in Nachricht <62EA3C2C.E8D : > >>> 161 > : > 60728>: > > Hi! > > > > I wanted to inform you of an unpleasant bug in ldirectord of SLES12 SP5: > > We had a short network problem while some redundancy paths reconfigured in > > the infrastructure, effectively causing that some network services could > not > > be reached. > > Unfortunately ldirectord controlled by the cluster reported a failure (the > > director, not the services being directed to): > > > > h11 crmd[28930]: notice: h11-prm_lvs_mail_monitor_300000:369 [ Use of > > uninitialized value $ip_port in pattern match (m//) at /usr/sbin/ldirectord > > > line 1830, <CFGFILE> line 21. Error [33159] reading file > > /etc/ldirectord/mail.conf at line 10: invalid address for virtual service\n > ] > > h11 ldirectord[33266]: Exiting with exit_status 2: config_error: > > Configuration Error > > > > You can guess wat happened: > > Pacemaker tried to recover (stop, then start), but the stop failed, too: > > h11 lrmd[28927]: notice: prm_lvs_mail_stop_0:35047:stderr [ Use of > > uninitialized value $ip_port in pattern match (m//) at /usr/sbin/ldirectord > > > line 1830, <CFGFILE> line 21. ] > > h11 lrmd[28927]: notice: prm_lvs_mail_stop_0:35047:stderr [ Error [36293] > > > reading file /etc/ldirectord/mail.conf at line 10: invalid address for > > virtual service ] > > h11 crmd[28930]: notice: Result of stop operation for prm_lvs_mail on > h11: > > 1 (unknown error) > > > > A stop failure meant that the node was fenced, interrupting all the other > > services. > > > > Examining the logs I also found this interesting type of error: > > h11 attrd[28928]: notice: Cannot update > > fail-count-prm_lvs_rksapds5#monitor_300000[monitor]=(null) because peer > UUID > > not known (will retry if learned) > > > > Eventually, here's the code that caused the error: > > > > sub _ld_read_config_virtual_resolve > > { > > my($line, $vsrv, $ip_port, $af)=(@_); > > > > if($ip_port){ > > $ip_port=&ld_gethostservbyname($ip_port, $vsrv->{protocol}, > > $af); > > if ($ip_port =~ /(\[[0-9A-Fa-f:]+\]):(\d+)/) { > > $vsrv->{server} = $1; > > $vsrv->{port} = $2; > > } elsif($ip_port){ > > ($vsrv->{server}, $vsrv->{port}) = split /:/, > > $ip_port; > > } > > else { > > &config_error($line, > > "invalid address for virtual service"); > > } > > ... > > > > The value returned by ld_gethostservbyname is undefined. I also wonder what > > > the program logic is: > > If the host looks like an hex address in square brackets, host and port are > > > split at the colon; otherwise host and port are split at the colon. > > Why not split simply at the last colon if the value is defined, AND THEN > > check if the components look OK? > > > > So the "invalid address for virtual service" is only invalid when the > > resolver service (e.g. via LDAP) is unavailable. > > I used host and service names for readability. > > > > (I reported the issue to SLES support) > > > > Regards, > > Ulrich > > > > > > > > > > _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/