On Wed, Apr 20, 2016 at 09:43:05AM -0400, Simo Sorce wrote:
> On Wed, 2016-04-20 at 11:12 +0200, Jakub Hrozek wrote:
> > On Wed, Apr 20, 2016 at 10:32:59AM +0200, Jakub Hrozek wrote:
> > > > > From 0dff46755af6063ed4b0339020ae5bb686692de1 Mon Sep 17 00:00:00 2001
> > > > > From: Simo Sorce <s...@redhat.com>
> > > > > Date: Tue, 12 Jan 2016 20:13:28 -0500
> > > > > Subject: [PATCH 02/15] Server: Enable Watchdog in all daemons
> > > > > 
> > > > > This allows the services to self monitor.
> > > > > 
> > > > > Related:
> > > > > https://fedorahosted.org/sssd/ticket/2921
> > > > 
> > > > Is it intentional that we also enable the watchdog in monitor? I haven't
> > > > seen the sssd process being stuck and if it does, we probably have
> > > > bigger issues, so it's probably fine, I just need to remember to not
> > > > SIGSTOP sssd when testing anymore :)
> > > > 
> > > > Otherwise ack.
> > > 
> > > Actually, more questions...
> > > 
> > > Can you help me test this patch? I tried to inject sleep() into sssd_be
> > > code and the sleep was just interrupted by the SIGRT delivery. With SSSD,
> > > most of the time the process was stuck was because it was writing a lot of
> > > data with fsync()/fdatasync(). I can't find any information in the Linux
> > > fsync manpage on how fsync behaves wrt signals. openpub manpages indicate
> > > that fsync would return EINTR, which worries me a bit..
> > 
> > Hmm, sorry, I was not being careful enough. man 7 signal also says:
> > """
> > The sleep(3) function is also never restarted if interrupted by a
> > handler, but gives a success return: the number of seconds remaining to
> > sleep.
> > """
> > 
> > so the sleep testcase was wrong even though CatchSignal uses SA_RESTART.
> > But do you know how would write() or fsync() behave here? The signal
> > manpage is a bit unclar to me as it talks about "slow" devices..
> > 
> > Or can you think of some easy way to test this?
> 
> The fsync manpage here says:
>         "The call blocks until the device reports that the transfer has
>         completed."
>         
> And does not report EINTR as a possible error.
> 
> That said I am a bit unclear what you want to test actually ?

I want to actually test that the service can be restarted if stuck and
reconnects fine. So far I haven't been lucky - SIGSTOP-ing the service
stopped delivery of the signals, so did attaching gdb and waiting.

But most importantly I want to make sure that if tdb is writing a
transaction and the signal is delivered, then we don't fsync() in tdb
doesn't get interrupted and doesn't corrupt the database. From all the
cases where users complained about a service being restarted, it was
always about tdb writing stuff to disk...

> 
> Yes interruptible calls can be interrupted by a signal, that's always
> the case, if we have code that misbehave when a syscall is interrupted
> we need to fix that code.
> 
> Afaik when we write() we always check the return and retry on EINTR.
> 
> Simo.
> 
> -- 
> Simo Sorce * Red Hat, Inc * New York
> 
_______________________________________________
sssd-devel mailing list
sssd-devel@lists.fedorahosted.org
https://lists.fedorahosted.org/admin/lists/sssd-devel@lists.fedorahosted.org

Reply via email to