Re: CVS commit: src/sys/dev/scsipi
On Thu, Apr 19, 2012 at 06:39:29PM -0600, Warner Losh wrote: FreeBSD started out with MPSAFE and then went to NEEDS_GIANT as flags for the drivers. I'm with Matthew: patch the drivers that are broken (or not known to be safe) and then you have a convenient thing to grep for when you want to expand the drivers that are safe. Much better that way, and it turned out to be a big win in FreeBSD when we went from opt-in MPSAFE to out-out... But we're still in the MPSAFE world, not yet at NEEDS_GIANT. Very few drivers are MPSAFE in NetBSD, so almost all of them need to be patched. -- Manuel Bouyer bou...@antioche.eu.org NetBSD: 26 ans d'experience feront toujours la difference --
Re: CVS commit: src/sys/dev/scsipi
On Fri, Apr 20, 2012 at 09:37:00AM +0200, Manuel Bouyer wrote: we've never had autoconfig run with the kernel lock AFAICT, so this assumption has never been true. So this is a bug. The contract was really that spl-locked drivers would continue to work as is when fine-grained locking was introduced. I don't think it's reasonable to require every (but the few already MP-safe) drivers to take KERNEL_LOCK in their attach routine. So autoconf should check and take the kernel lock before calling the attach routine of a non-MPSAFE driver. If it doesn't already do this, it's wrong. Meanwhile, since scsipi is not MPSAFE, whoever is calling into scsipi without taking the kernel lock first is wrong, so the change at the beginning of this thread should be reverted. (Furthermore, scsipi is itself wrong. It is a core component; it should be MPSAFE.) -- David A. Holland dholl...@netbsd.org
Re: CVS commit: src/sys/dev/scsipi
On Fri, Apr 20, 2012 at 07:46:35AM +, David Holland wrote: On Fri, Apr 20, 2012 at 09:37:00AM +0200, Manuel Bouyer wrote: we've never had autoconfig run with the kernel lock AFAICT, so this assumption has never been true. So this is a bug. The contract was really that spl-locked drivers would continue to work as is when fine-grained locking was introduced. I don't think it's reasonable to require every (but the few already MP-safe) drivers to take KERNEL_LOCK in their attach routine. So autoconf should check and take the kernel lock before calling the attach routine of a non-MPSAFE driver. If it doesn't already do this, it's wrong. I agree :) Meanwhile, since scsipi is not MPSAFE, whoever is calling into scsipi without taking the kernel lock first is wrong, so the change at the beginning of this thread should be reverted. I also agree, in light of this discussion (I though that KERNEL_LOCK was not held while cold on purpose, but it seems that the issue is at another level, and more annoying). But I find it acceptable as an interim solution, it at last allows attaching scsi drivers that are probed at boot. (Furthermore, scsipi is itself wrong. It is a core component; it should be MPSAFE.) I also agree. So should be ata, if_ethersubr, etc ... that's a lot of work. -- Manuel Bouyer bou...@antioche.eu.org NetBSD: 26 ans d'experience feront toujours la difference --
Re: CVS commit: src/sys/dev/scsipi
On Fri, Apr 20, 2012 at 06:06:35PM +1000, matthew green wrote: we've never had autoconfig run with the kernel lock AFAICT, so this assumption has never been true. So this is a bug. The contract was really that spl-locked drivers would continue to work as is when fine-grained locking was introduced. there are no actual bugs or races here. this is about cleanup and ensuring code runs how it expects. And that's the point: all code that use spl-style locking expects to be called with KERNEL_LOCK held. And this includes the attach routine. I don't think it's reasonable to require every (but the few already MP-safe) drivers to take KERNEL_LOCK in their attach routine. i'm not saying that they should. i'm saying that scsipi drivers, regardless of whether they are fully MPSAFE or not, should now arrange to have the kernel lock held when calling into scsipi at all times. this really has nothing to do with attach, except that attach is problem to fix currently in some drivers. Almost all scsipi drivers ... It's not only IPL_VM drivers, it'a all driver wich don't register and interrupt handler explicitely marked MPSAFE (for example, with PCI_INTR_MPSAFE for PCI drivers). interesting; i hadn't seen PCI_INTR_MPSAFE before. however, it seems ignored / unused. the x86 pci_intr_setattr() does do something but it's never called. none of the rest do anything. this just means there's no PCI MP-SAFE driver yet ? my point is that drivers get interrupt protection from asking for an IPL_VM interrupt. that's what tells the MD code to make sure the kernel lock is held for these interrupt handlers. Or explicitely declared MPSAFE. at last x86 allows this. these KASSERT()s have been in place for a long time now and, until your change, we've fixed it at the caller level not worked around it at the scsipi level. AFAIK the only drivers that has been changed are ncr53c9x and USB, and they are MP-safe (they already use mutex instead of spl locking). usb is not MP safe in -current, only on the usbmp branch. all those drivers take care of taking the kernel lock when calling into the scsipi code since they almost *always* run when not cold anyway. So things were fixed at the wrong place. Any non-MPSAFE driver is at risk right now i don't understand. eg, the usbmp code never calls into scsipi during autoconfig. usb is special in this regard. these asserts aren't specific to autoconfig, and IIRC, most drivers do not call into scsipi during attach. Most if not all regular scsi drivers do: they do a config_found() in their attach routine to attach the scsibus. ATA does it from a kernel thread. your change can't work with the drivers that are broken without the major regression of holding kernel lock over autoconfig, or, never ever having them be modules or detached/reattached. Maybe non-MPSAFE drivers can't be modules or detached/reattached, indeed. that's not true. eg, radeondrm works just fine being MP safe or not. (it actually is MP safe, though it doesn't announce this to any of the rest of the system currently.) Yes, of course, unless there's a KASSERT, things will run fine most of the time. That's the problem with race conditions: it doesn't reliably fail. that's not what i said. what i said is that it is very trivial to have a non MPSAFE-driver handle being a module, being detached, and being reattached. The discussion is about KERNEL_LOCK, and possible race of running non-MPSAFE code without it. No, it's not specific to scsipi. It happens that scsipi asserts that KERNEL_LOCK is held, but any non-MPSAFE drivers needs to be called with KERNEL_LOCK held. why? we've never done this for several years now and there's no reason why we think there is a problem with it currently. why did you add this KASSERT to scsipi then ? we've actually run the interrupt part of autoconfig with multiple threads since sometime between netbsd 4.0 and 5.0. this doesn't mean the races don't exists, just that we don't run into them most of the time. Maybe that in some common case there's no possible race, but that would be by accident, not by design. as in aside, you should make drvctl -d work for any drivers you maintain. everyone (tm) should. irrespective of modules or otherwise, the end result of it working tends to be bug fixes. It's hard to drvctl -d the root device ... eh, pxe is your friend. :-) I can pxe boot the kernel; having root on NFS is another things. If we want this to work for non-MPSAFE driver, then this has to be fixed in autoconf (if the problem really exists, which I've not looked at yet). i don't believe there is a problem in autoconfig. there is a problem with some (probably most) scsi drivers here, but that is about it. it is true that the contracts have changed, but
Re: CVS commit: src/tests/lib/libc/gen (address alignment)
XXX I would appreciate if someone could test it under a real amd64 host with an up-to-date kernel, so I can reasonably assume that the culprit is Virtual Box and not our amd64 port (my test machine being off line I cannot do it myself). Results from other arches would be a plus too. I just updated one of my machines (running a bulldozer FX-8150) to -current from an hour ago (with the most up-to-date rev of psl.h) and ran atf-run on the libc/gen tests. The machine hung solid at this point: ... t_siginfo (28/32): 8 test cases sigalarm: [1.006950s] Passed. sigbus_adraln: and I am no longer able to telnet into the box from elsewhere. I am not physically at the machine, so I cannot tell if it has dropped into ddb. - | Paul Goyette | PGP Key fingerprint: | E-mail addresses: | | Customer Service | FA29 0E3B 35AF E8AE 6651 | paul at whooppee.com| | Network Engineer | 0786 F758 55DE 53BA 7731 | pgoyette at juniper.net | | Kernel Developer | | pgoyette at netbsd.org | -
Re: CVS commit: src/tests/lib/libc/gen (address alignment)
On Fri, 20 Apr 2012, Paul Goyette wrote: The machine did not drop into ddb, it simply rebooted. Unfortunately it did not leave a core dump behind, so I don't have much to look at just yet. When I get home later today, I will try to get more info. BTW, this occurred while running the ATF test from a non-privileged user, so if there's a bug lurking in these recent changes, it could be considered to be a security vulnerability - non-priv user should not be able to crash the box... The machine does not drop into ddb, and it does not panic. It simply reboots. This is not good... :) - | Paul Goyette | PGP Key fingerprint: | E-mail addresses: | | Customer Service | FA29 0E3B 35AF E8AE 6651 | paul at whooppee.com| | Network Engineer | 0786 F758 55DE 53BA 7731 | pgoyette at juniper.net | | Kernel Developer | | pgoyette at netbsd.org | -