On Wed, 2007-02-28 at 08:59 +0000, Daniel Schnell wrote:
> Philippe Gerum wrote:
> > 
> > It's not a magic value, it means that up to 32 memory contexts may be
> > stalled in the drop queue at any point in time, waiting for a normal
> > Linux task to be switched in to flush them, while a large number of
> > Xenomai threads is going through frequent domain migrations. Hitting
> > this limit _is_ where the problem lies, not the static limit itself,
> > and it tells that something needs to be fixed in the application, so
> > that it does not starve the regular Linux activities this way.      
> > 
> > IOW, it's ok to raise this compile-time value to account for
> > applications living on the edge, but we should not make a feature out
> > of an application design issue, by adding such a configuration knob.  
> 
> 
> I do not understand. In my opionion no application behaviour should ever
> cause the Kernel to oops. Playing around with Kernel configuration
> values, to come around such Oopses is where the problem lies and not in
> the application itself. I do nat have fully understood what exactly does
> the Kernel cause to oops in this specific case, but a killall of an
> application is similar to e.g. a caught SIGSEGV, which might happen in
> normal development, and looks very weird if it leads to a Kernel crash
> whatsoever. In my opinion the BUG_ON() call inside sched.c is a bug and
> should be replaced by something that handles the situation more
> gracefully.
> 

In theory, you might be right. In practice, there is no way to be 100%
sure that under such situation, the wipe out would succeed, _that_ is
the problem. We could devise something with a watermark, starting to
spam the syslog with warning messages after some pressure level is
reached, but even the wipe out operation would have to follow the OOM
"strategy", i.e. not necessarily killing the real offending application
but any random thread which happens to be registered in the drop queue,
which is terminally useless when debugging.

Again, the situation you encounter is a sign of a patent dysfunctioning
due to a design issue in your application. We could raise the limit so
that statistically, the issue would not even trigger in your case, like
it never triggered for anyone else for the last two years. But, for that
particular issue occurring in sched.c, there is nothing 100% safe and
sane we could do to recover such situation, so let's not pretend that a
system works just because it does not run into any BUG_ON(). As it is,
your application is living on the edge, and this is what you need to
solve, and for that purpose, a BUG_ON() which points you at the problem
immediately during the testing phase is much better that killing any
thread around randomly while in production, just for the purpose of
"being graceful". This would not be graceful, at all.

> 
> 
> Best regards,
> 
> Daniel Schnell.
-- 
Philippe.



_______________________________________________
Xenomai-help mailing list
[email protected]
https://mail.gna.org/listinfo/xenomai-help

Reply via email to