I've finally solved this. Solution inline.
On Fri, Jul 13, 2018 at 9:55 AM Jan Friesse <jfrie...@redhat.com> wrote: > > Jason, > > > On Thu, Jun 21, 2018 at 10:47 AM Jason Gauthier <jagauth...@gmail.com> > > wrote: > >> > >> On Thu, Jun 21, 2018 at 9:49 AM Jan Pokorný <jpoko...@redhat.com> wrote: > >>> > >>> On 21/06/18 07:05 -0400, Jason Gauthier wrote: > >>>> On Thu, Jun 21, 2018 at 5:11 AM Christine Caulfield > >>>> <ccaul...@redhat.com> wrote: > >>>>> On 19/06/18 18:47, Jason Gauthier wrote: > >>>>>> Attached! > >>>>> > >>>>> That's very odd. I can see communication with the server and corosync in > >>>>> there (do it's doing something) but no logging at all. When I start > >>>>> qdevice on my systems it logs loads of messages even if it doesn't > >>>>> manage to contact the server. Do you have any logging entries in > >>>>> corosync.conf that might be stopping it? > >>>> > >>>> I haven't checked the corosync logs for any entries before, but I just > >>>> did. There isn't anything logged. > >>> > >>> What about syslog entries (may boil down to /var/log/messages, > >>> journald log, or whatever sink is configured)? > >> > >> I took a look, since both you and Chrissie mentioned that. > >> > >> There aren't any new entries added to any of the /var/log files. > >> > >> # corosync-qdevice -f -d > >> # date > >> Thu Jun 21 10:36:06 EDT 2018 > >> > >> # ls -lt|head > >> total 152072 > >> -rw-r----- 1 root adm 68018 Jun 21 10:34 auth.log > >> -rw-rw-r-- 1 root utmp 18704352 Jun 21 10:34 lastlog > >> -rw-rw-r-- 1 root utmp 107136 Jun 21 10:34 wtmp > >> -rw-r----- 1 root adm 248444 Jun 21 10:34 daemon.log > >> -rw-r----- 1 root adm 160899 Jun 21 10:34 syslog > >> -rw-r----- 1 root adm 1119856 Jun 21 09:46 kern.log > >> > >> I did look through daemon, messages, and syslog just to be sure. > >> > >>>>> Where did the binary come from? did you build it yourself or is it from > >>>>> a package? I wonder if it's got corrupted or is a bad version. Possibly > >>>>> linked against a 'dodgy' libqb - there have been some things going on > >>>>> there that could cause logging to go missing in some circumstances. > >>>>> > >>>>> Honza (the qdevice expert) is away at the moment, so I'm guessing a bit > >>>>> here anyway! > > Corosync-qdevice is using same config as corosync, so to get messages on > stderr, please configure > > logging.to_stderr: on Yes! I added a logging subsection with QDEVICE and enabled stderr. Then, and only then did corosync-qdevice -f -d behave the way I expected it to. > > >>>> > >>>> Hmm. Interesting. I installed the debian package. When it didn't > >>>> work, I grabbed the source from github. They both act the same way, > >>>> but if there is an underlying library issue then that will continue to > >>>> be a problem. > >>>> > >>>> It doesn't say much: > >>>> /usr/lib/x86_64-linux-gnu/libqb.so.0.18.1 > >>> > >>> You are likely using libqb v1.0.1. > >> > >> Correct. I didn't even think to look at the output of dpkg -l for the > >> package version. > >> Debian 9 also packages binutils-2.28 > >> > >>> Ability to figure out the proper package version is one of the most > >>> basic skills to provide useful diagnostics about the issues with > >>> distro-provided packages. > >>> > >>> With Debian, the proper incantation seems to be > >>> > >>> dpkg -s libqb-dev | grep -i version > >>> > >>> or > >>> > >>> apt list libqb-dev > >>> > >>> (or substitute libqb0 for libqb-dev). > >>> > >>> As Chrissie mentioned, there is some fishiness possible if you happen > >>> to use ld linker from binutils 2.29+ for the building with this old > >>> libqb in the mix, so if the issues persist and logging seems to be > >>> missing, try recompiling with the downgraded binutils package below > >>> said breakage point. > >> > >> Since the system already has a lower numbered binutils (2.28) I wonder > >> if I should attempt to build a newer version of the libqb library. > >> > >> As Chrissie mentioned, I will open a bug with Debian in the Interim. > >> But I don 't believe I will see resolution to that any time soon. :) > > > > I was finally able to look at this problem again, and found that qnetd > > is giving me some messaging, but I don't know what to do with it. > > > > Jun 29 16:34:35 debug New client connected > > Jun 29 16:34:35 debug cluster name = zeta > > Jun 29 16:34:35 debug tls started = 1 > > Jun 29 16:34:35 debug tls peer certificate verified = 1 > > Jun 29 16:34:35 debug node_id = 1084772368 > > Jun 29 16:34:35 debug pointer = 0x563afd609d70 > > Jun 29 16:34:35 debug addr_str = ::ffff:192.168.80.16:38010 > > Jun 29 16:34:35 debug ring id = (40a85010.89ec) > > Jun 29 16:34:35 debug cluster dump: > > Jun 29 16:34:35 debug client = ::ffff:192.168.80.16:38010, > > node_id = 1084772368 > > Jun 29 16:34:35 debug Client ::ffff:192.168.80.16:38010 (cluster > > zeta, node_id 1084772368) sent initial node list. > > Jun 29 16:34:35 debug msg seq num 4 > > Jun 29 16:34:35 debug node list: > > Jun 29 16:34:35 error ffsplit: Received empty config node list for > > client ::ffff:192.168.80.16:38010 > > Yes, this is interesting. Could you please share your config? Yes, see below. > > Jun 29 16:34:35 error Algorithm returned error code. Sending error reply. > > Jun 29 16:34:35 debug Client ::ffff:192.168.80.16:38010 (cluster > > zeta, node_id 1084772368) sent membership node list. > > Jun 29 16:34:35 debug msg seq num 5 > > Jun 29 16:34:35 debug ring id = (40a85010.89ec) > > Jun 29 16:34:35 debug node list: > > Jun 29 16:34:35 debug node_id = 1084772368, data_center_id = 0, > > node_state = not set > > Jun 29 16:34:35 debug node_id = 1084772369, data_center_id = 0, > > node_state = not set > > Jun 29 16:34:35 debug Algorithm result vote is Ask later > > Jun 29 16:34:35 debug Client ::ffff:192.168.80.16:38010 (cluster > > zeta, node_id 1084772368) sent quorum node list. > > Jun 29 16:34:35 debug msg seq num 6 > > Jun 29 16:34:35 debug quorate = 1 > > Jun 29 16:34:35 debug node list: > > Jun 29 16:34:35 debug node_id = 1084772368, data_center_id = 0, > > node_state = member > > Jun 29 16:34:35 debug node_id = 1084772369, data_center_id = 0, > > node_state = member > > > > It looks like "config node list" is empty, but the other lists are > > not. I'm not sure where it's getting that node list from. For fun, I > > added > > nodelist { > > node { > > alpha: 192.168.80.16 > > } > > node { > > beta: 192.168.80.17 > > } > > } > > } > > This is how nodelist doesn't look like. It should look like: > nodelist { > node { > ring0_addr: 192.168.80.16 > nodeid: 1 > } > node { > ring0_addr: 192.168.80.17 > nodeid: 2 > } > } > You are correct. I figured this out as well, after some experimentation and finding examples online. However, this alone did not resolve the issue. When I started it like this, corosync-qdevice would not send a config nodelist, and it would exit with an error code of 18. It wasn't until after I moved the nodelist above the quorum section did corosync-qdevice actually start successfully. > But it's really weird corosync-qdevice started without proper nodelist > (it shouldn't). It wasn't. It was exiting with an error code of 18. But that was because it never saw the nodelist. The incorrect nodelist above, was never interpreted. When I figured out the nodelist syntax was wrong, and I still received the error 18, I moved the nodelist above the quorum section and it started. > Honza > > > to corosync.conf, and restarted both nodes. But that didn't help. > > _______________________________________________ > > Users mailing list: Users@clusterlabs.org > > https://lists.clusterlabs.org/mailman/listinfo/users > > > > Project Home: http://www.clusterlabs.org > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > Bugs: http://bugs.clusterlabs.org > > > _______________________________________________ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org