> > we've never had autoconfig run with the kernel lock AFAICT, so this
> > assumption has never been true.
> 
> So this is a bug. The contract was really that spl-locked drivers would
> continue to work as is when fine-grained locking was introduced.

there are no actual bugs or races here.  this is about cleanup
and ensuring code runs how it expects.

> I don't think it's reasonable to require every (but the few
> already MP-safe) drivers to take KERNEL_LOCK in their attach
> routine.

i'm not saying that they should.  i'm saying that scsipi drivers,
regardless of whether they are fully MPSAFE or not, should now
arrange to have the kernel lock held when calling into scsipi at
all times.  this really has nothing to do with attach, except that
attach is problem to fix currently in some drivers.

> It's not only IPL_VM drivers, it'a all driver wich don't register
> and interrupt handler explicitely marked MPSAFE (for example,
> with PCI_INTR_MPSAFE for PCI drivers).

interesting; i hadn't seen PCI_INTR_MPSAFE before.  however, it
seems ignored / unused.  the x86 pci_intr_setattr() does do
something but it's never called.  none of the rest do anything.

my point is that drivers get interrupt protection from asking for
an IPL_VM interrupt.  that's what tells the MD code to make sure
the kernel lock is held for these interrupt handlers.

> > > > these KASSERT()s have been in place for a long time now and, until
> > > > your change, we've fixed it at the caller level not worked around it
> > > > at the scsipi level.
> > > 
> > > AFAIK the only drivers that has been changed are ncr53c9x and USB, and 
> > > they
> > > are MP-safe (they already use mutex instead of spl locking).
> > 
> > usb is not MP safe in -current, only on the usbmp branch.  all those
> > drivers take care of taking the kernel lock when calling into the
> > scsipi code since they almost *always* run when not cold anyway.
> 
> So things were fixed at the wrong place. Any non-MPSAFE driver is at
> risk right now

i don't understand.  eg, the usbmp code never calls into scsipi
during autoconfig.  these asserts aren't specific to autoconfig,
and IIRC, most drivers do not call into scsipi during attach.

> > > > your change can't work with the drivers that are broken without the
> > > > major regression of holding kernel lock over autoconfig, or, never
> > > > ever having them be modules or detached/reattached.
> > > 
> > > Maybe non-MPSAFE drivers can't be modules or detached/reattached, indeed.
> > 
> > that's not true.  eg, radeondrm works just fine being MP safe or not.
> > (it actually is MP safe, though it doesn't announce this to any of the
> > rest of the system currently.)
> 
> Yes, of course, unless there's a KASSERT, things will run fine most of
> the time. That's the problem with race conditions: it doesn't reliably
> fail. 

that's not what i said.  what i said is that it is very trivial to have
a non MPSAFE-driver handle being a module, being detached, and being
reattached.

> > > But that's a problem with autoconf not dealing with non-MPSAFE drivers,
> > > not the driver themselve (they were working before the MP changes, they
> > > should continue to work as-is). If you want autoconf to not run
> > > under the KERNEL_LOCK when not needed, then is has to be extended so that
> > > MPSAFE drivers can register themselves as MPSAFE, so that non-MPSAFE
> > > drivers are still called with KERNEL_LOCK.
> > 
> > it seems the wrong answer to me.  instead of fixing the one or
> > two drivers you've found problems in, you want to add more
> > infrastructure (IMO clutter) to the system and patch all the
> > drivers that *are* ok.  considering that this problem is
> > very specific to scsipi, i don't think that is a viable
> > solution.
> 
> No, it's not specific to scsipi. It happens that scsipi asserts that
> KERNEL_LOCK is held, but any non-MPSAFE drivers needs to be called with
> KERNEL_LOCK held.

why?  we've never done this for several years now and there's no reason
why we think there is a problem with it currently.  we've actually run
the interrupt part of autoconfig with multiple threads since sometime
between netbsd 4.0 and 5.0.

> > as in aside, you should make "drvctl -d" work for any drivers
> > you maintain.  everyone (tm) should.  irrespective of modules
> > or otherwise, the end result of it working tends to be bug
> > fixes.
> 
> It's hard to drvctl -d the root device ...

eh, pxe is your friend.  :-)  

> > > If we want this to work for non-MPSAFE driver, then this has to be fixed
> > > in autoconf (if the problem really exists, which I've not looked at yet).
> > 
> > i don't believe there is a problem in autoconfig.  there is a
> > problem with some (probably most) scsi drivers here, but that
> > is about it.  it is true that the contracts have changed, but
> > this is progress after all.  the assert exists solely to find
> > places that are wrong and fix them.  your change hides this.
> 
> the contract has changed in a way that exposes all non-MPSAFE driver
> to a race condition. The problem shows up in scsipi, but it's not the
> only place where the problem exists. ata is not MP-safe, so the
> problem is also in IDE and SATA drivers (including some USB drivers).
> mii(4) and ifmedia is not MP-safe, so ethernet drivers have this problem
> too.
> 
> I've not audited the whole code but there's probably other examples.

could be.  we should fix them.  the only way we can find them is
if we add things like KASSERT()s to the relevant places and then go
fix the callers.  that's what i'm asking you to do here.

> > this isn't about SMPifing a whole driver, but adding a call to
> > taken/release the kernel lock around a few (often just one)
> > calls in a driver.  it has never been a difficult change to
> > make.  if you have drivers that needs it, i am happy to send
> > patches for testing.
> 
> *ALL* non-MPSAFE drivers needs it. 

how can all drivers need scsipi specific code?  *most* of the
scsipi drivers do not actually call into scsipi in the attach
routine, from what i recall looking a few months ago.

what i'm asking you for is a list of actual problems you've seen
with the KASSERT() you changed so i can fix those, and revert
your change.  please help me advance the state of netbsd smp.


.mrg.

Reply via email to