Am 22.11.2013 12:49, schrieb David Timothy Strauss:
> It is the responsibility of whatever sends the watchdog to ensure 
> everything's healthy, however necessary. It would
> be silly to spawn a thread and have it blindly report health to watchdog. The 
> point is for that thread to do proper
> checks but ensure reports go in at the right intervals.

i know that but the *how* is the question

you can internally check what not but that does not mean at
the end of the day the service responds correctly to a
client connection over the network until you do not go
through the same stack meaning doing a network connection

i spent hundrets of hours in upstream-debugging of dbmail
to find spinlocks and what not else only happening in
rare situations, one of them took 16 hours stress tests
until it happend with debug-log enabled while on the
real server it took a few minutes to get triggered
by a random client action

that's the difference between theory and real workload

your internal checks are mostly theory because in case of
a bug you have undefined behavior and what you want to achieve
with the watchdog is catch this undefined behavior and restart
the service - in doubt this will not work in the rare cases
the watchdog should restart until you went the complete
code-path of a client, in case of a IMAP server you can
enter the spin-loop everywhere from accept the connection
to folder listing or receive a message and it may depend
on a buffer overflow while high concurrency and different
threads are touching each other in a unexpected way

been there, died nearly in debug it and catch data for upstream

> On Nov 22, 2013 7:50 PM, "Reindl Harald" <h.rei...@thelounge.net 
> <mailto:h.rei...@thelounge.net>> wrote:
> 
> 
>     Am 22.11.2013 03:04, schrieb salil GK:
>     > Thanks a lot David
>     >
>     > On 22 November 2013 06:44, David Timothy Strauss 
> <da...@davidstrauss.net <mailto:da...@davidstrauss.net>
>     <mailto:da...@davidstrauss.net <mailto:da...@davidstrauss.net>>> wrote:
>     >
>     >     On Thu, Nov 21, 2013 at 4:57 PM, salil GK <gksa...@gmail.com 
> <mailto:gksa...@gmail.com>
>     <mailto:gksa...@gmail.com <mailto:gksa...@gmail.com>>> wrote:
>     >     > What happens is - my process may be busy with some other activity 
> during
>     >     > which time it will fail to send periodic message to systemd. 
> After a while
>     >     > it will come out of it's loop and ready to serve. But during this 
> time
>     >     > system would have already marked the process as failed.
>     >
>     >     Then you need to either use another thread, refactor to make a 
> tighter
>     >     event loop, or increase the watchdog time. Drifting in and out of
>     >     tolerance with watchdog is not a safe strategy.
> 
>     the problem i see with "use another thread" is that this thread can 
> happily
>     work and send it's keep alive, but that does not mean at the end that the
>     service itself is working OK and responsible because both are running
>     isolated
> 
>     in case of network services it would be pretty cool if systemd watchdog
>     could be configured to connect to the service avery n seconds and if
>     there is no response restart it because this would monitor the real 
> service
>     without need external tools

Attachment: signature.asc
Description: OpenPGP digital signature

_______________________________________________
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel

Reply via email to