When thinking about concurrency, it's important to assume everything that can go wrong will. If you can't point out how this scenario is _prevented_ then there's not much point trying to duplicate it, we know it's _possible_. I've been neglecting upstart, but surely there's a way to force X startup to depend on a DKMS build? You could perhaps define a small task that executes rapidly in the common case (no build required), but blocks X startup until the build is complete?
Justin On Fri, Nov 6, 2009 at 6:21 PM, Bryce Harrington <[email protected]> wrote: > The two worst bugs are fixed, and the other two are at least understood > now but I could use a bit more advice. It seems there is a weird race > condition with DKMS/upstart/nvidia which has cropped up because due to > faster boot, that looks tricky to get sorted, so feedback from people > with experience in DKMS/upstart matters would be helpful. > > From what I understand, when doing an upgrade it installs both nvidia > and a new kernel (2.6.31). At that point nvidia.ko is built against the > *old* kernel (2.6.28). Fine, a nvidia.ko was successfully built so > installation completes without error. xorg.conf is updated and the > system is ready to run nvidia. Or so it thinks. > > Now the user reboots. > > During boot, dpkg notes that it needs to build a new nvidia.ko for > 2.6.31 and dutifully gets to work. Meanwhile, since X is being started > early on in the boot cycle, it in fact starts up before dkms has > finished building the new nvidia.ko. X starts booting nvidia but since > there is not yet an nvidia.ko for the current kernel it exits with an > error. > > I'm going to see if I can reproduce this synthetically, but meanwhile > does this theory make sense? If so, is there a dkms/upstart trick we > could do to work around the issue in Karmic? And for Lucid what would > the "right" solution be? > > > Further notes on the other nvidia issues below... > > On Wed, Nov 04, 2009 at 02:26:56PM -0800, Bryce Harrington wrote: >> I've been looking into some problems people have been reporting >> upgrading to Karmic with -nvidia installed. >> >> One thing I've noticed is aside from whatever issue is occuring with >> nvidia, there are bugs elsewhere which are compounding the problems and >> leading to some poor user experiences. A common scenario occurs if for >> whatever reason the -nvidia kernel module fails to build in DKMS: >> >> 438398 - If DKMS fails to build the kernel module, the package upgrade >> does not kick out. It shows package upgrade as successful. So this >> leads directly to... > > In reviewing instances of nvidia failures, this particular scenario > appears to pop up less frequently in practice than I had initially > assumed, and mostly due to unusual corner cases like not having patch > installed, upgrading to Karmic directly from Hardy, etc. It seems most > of these specific issues got fixed during development, just that the > bug reports didn't get closed. The important point though is that these > failures ended up worse than they should have been, due to the following > bugs... > >> 451305 - Jockey misses that the driver failed to build, and so is not >> letting users know about the potential problem. It goes ahead and >> updates xorg.conf as if the driver was there. X tries to obey the >> configuration settings, but of course they won't work, so it exits on >> startup with an error message. *Normally* bulletproof-X would kick in >> at this point, display the error to the user, and give them some tools >> to diagnose and/or debug the situation. Unfortunately... > > Elsewhere in this thread several fixes/workarounds to this issue were > identified, which should greatly lessen the severity of these kinds of > error situations. > >> 474806 - The new gdm no longer supports the FailsafeXServer option, so >> the diagnostic session no longer can be triggered to come up. Instead, >> gdm tries several times, then gives up, but then... > > This is fixed; we now no longer rely on gdm for doing the failsafe but > instead catch it with a simple upstart job and kick into failsafe-x mode. > Thanks Steve! > >> 441638 - The gdm upstart job notices gdm has failed and so restarts it. >> X of course continues to fail, gdm tries a few times and continues to >> fail, repeat ad infinitum, and the user is just left looking at a >> flashing screen. Ick. > > Now that we have an upstart job handling this case, the blinking > situation will no longer happen. This fix is SRU'd and uploaded to > ubuntu-proposed, and will go live before long. > > Since this particular situation crops up right now mostly with nvidia, > people installing via the release livecd should be okay - that boots > with open source drivers, and when they choose to install nvidia it will > download that and (I assume) also update xorg to the version that > contains this fix. So by the time they reboot they'll have the fix. > Steve, can you confirm? > >> The above appears to be a pretty common scenario that we're getting a >> rash of bug reports about. It's hard to be certain because many of the >> bug reports are only including information about the failed boot, not on >> the failed build. So I'm not sure if it is just one reason why the >> build fails, or several. However if we can solve the above bugs it >> should give much better visibility into things. >> >> >> Btw, workaround for anyone experiencing this issue is to purge your >> nvidia (and fglrx) packages, remove /etc/X11/xorg.conf, and reinstall >> nvidia (or fglrx). It appears that in most of the bug reports this gets >> the system functioning again. Doing a full reinstall of Ubuntu rather >> than an upgrade also appears to work around the issues. > > It looks like simply doing a dpkg-reconfigure on the nvidia package is > sufficient to work around the issue, no need for reinstalling it > (although that'll work too). > > Bryce -- Ubuntu-x mailing list [email protected] Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-x
