Re: CVS commit: src/sys/kern

2018-10-04 Thread Emmanuel Dreyfus
On Thu, Oct 04, 2018 at 01:32:33PM +0200, Manuel Bouyer wrote:
> borneo:/home/bouyer#umount /mnt
> umount: /mnt: Resource temporarily unavailable

Oh right, you should have said earlier that you actually observed
a problem. I will back that out.

-- 
Emmanuel Dreyfus
m...@netbsd.org


Re: CVS commit: src/sys/kern

2018-10-04 Thread Manuel Bouyer
On Thu, Oct 04, 2018 at 04:00:12AM +0200, Emmanuel Dreyfus wrote:
> Manuel Bouyer  wrote:
> 
> > You're the only one seeing this problem AFAIK
> 
> Indeed, but I reproduced it on multiple machines. It appears 100% of the
> time on a busy server, and on a test machine I have a script that
> reliabily triggers it within a few minutes. 
> 
> Anyway, I am not sure we should disregard bugs on the basis on how often
> they appear. After all you raised very valid concerns about my changes,
> but we have no case of that problems acctually occuring yeet.

borneo:/home/bouyer#umount /mnt
umount: /mnt: Resource temporarily unavailable

this should have been EBUSY.

> I have
> been running three servers with the patch for a week and nothing wrong
> occured. Perhaps this is just because I do not tried shutting down a
> lot. 
>  
> > Returning EAGAIN without reason is harmfull.
> 
> There is a reason: not doing so causes a deadlock. I agree the timeout

You found a deadlock, this is clear. But returning an error instead of
fixing the deadlock is not a proper way to fix it.

> value may be too short, but there should be an upper limit on how long
> we block any process from trying to enter fstrans_start(). Blocking any
> filesystem activity for seconds does not seems acceptable to me on a
> multiuser machine

Actually is seems this happens on umount, if there is lots of data in
the cache to be flushed to a slow media (like a USB key).

> 
> Perhaps we could timeout, unblock the processes that are waiting for us
> and that we are waiting for, and retry? Something like that?

The best way would be to actually avoid the deadlock. If this is not possible
(I didn't look closely at this specific code path), then sleeping and retrying
is the best option. There are other parts of the kernel doing this already.

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--