On Wed, Apr 20, 2016 at 09:43:05AM -0400, Simo Sorce wrote: > On Wed, 2016-04-20 at 11:12 +0200, Jakub Hrozek wrote: > > On Wed, Apr 20, 2016 at 10:32:59AM +0200, Jakub Hrozek wrote: > > > > > From 0dff46755af6063ed4b0339020ae5bb686692de1 Mon Sep 17 00:00:00 2001 > > > > > From: Simo Sorce <s...@redhat.com> > > > > > Date: Tue, 12 Jan 2016 20:13:28 -0500 > > > > > Subject: [PATCH 02/15] Server: Enable Watchdog in all daemons > > > > > > > > > > This allows the services to self monitor. > > > > > > > > > > Related: > > > > > https://fedorahosted.org/sssd/ticket/2921 > > > > > > > > Is it intentional that we also enable the watchdog in monitor? I haven't > > > > seen the sssd process being stuck and if it does, we probably have > > > > bigger issues, so it's probably fine, I just need to remember to not > > > > SIGSTOP sssd when testing anymore :) > > > > > > > > Otherwise ack. > > > > > > Actually, more questions... > > > > > > Can you help me test this patch? I tried to inject sleep() into sssd_be > > > code and the sleep was just interrupted by the SIGRT delivery. With SSSD, > > > most of the time the process was stuck was because it was writing a lot of > > > data with fsync()/fdatasync(). I can't find any information in the Linux > > > fsync manpage on how fsync behaves wrt signals. openpub manpages indicate > > > that fsync would return EINTR, which worries me a bit.. > > > > Hmm, sorry, I was not being careful enough. man 7 signal also says: > > """ > > The sleep(3) function is also never restarted if interrupted by a > > handler, but gives a success return: the number of seconds remaining to > > sleep. > > """ > > > > so the sleep testcase was wrong even though CatchSignal uses SA_RESTART. > > But do you know how would write() or fsync() behave here? The signal > > manpage is a bit unclar to me as it talks about "slow" devices.. > > > > Or can you think of some easy way to test this? > > The fsync manpage here says: > "The call blocks until the device reports that the transfer has > completed." > > And does not report EINTR as a possible error. > > That said I am a bit unclear what you want to test actually ?
I want to actually test that the service can be restarted if stuck and reconnects fine. So far I haven't been lucky - SIGSTOP-ing the service stopped delivery of the signals, so did attaching gdb and waiting. But most importantly I want to make sure that if tdb is writing a transaction and the signal is delivered, then we don't fsync() in tdb doesn't get interrupted and doesn't corrupt the database. From all the cases where users complained about a service being restarted, it was always about tdb writing stuff to disk... > > Yes interruptible calls can be interrupted by a signal, that's always > the case, if we have code that misbehave when a syscall is interrupted > we need to fix that code. > > Afaik when we write() we always check the return and retry on EINTR. > > Simo. > > -- > Simo Sorce * Red Hat, Inc * New York > _______________________________________________ sssd-devel mailing list sssd-devel@lists.fedorahosted.org https://lists.fedorahosted.org/admin/lists/sssd-devel@lists.fedorahosted.org