Bug#889669: nvidia-graphics-drivers: solve the upgrade problem
I've got an answer from NVIDIA: "Our driver design, based on earlier assumptions according to use/deployment cases at the time, packages all components together to ensure integrity is retained as components evolve over the course of driver development. We are investigating the ability to enable modest compatibility across versions, but the time horizon and breadth of that compatibility are not known at this time. We are also looking at how to improve the interoperability of CUDA calls between driver versions—but again, this is a long-term effort. One suggestion for the near-term was to install in such a way that updated driver files are latched on next boot so that kernel- and user- components can be changed on the file system in lock-step." >From my point of view this is pretty much the answer I've expected. They are committed to investigating a solution but IMHO this doesn't necessarily mean that there will be a solution. Even if there will be a solution we don't know how long it'll take NVIDIA to implement it and if their solution will be feasible for us. For an instance if they only promise compatibility on minor driver version updates then that would still be problematic for us on major driver version updates. That brings me to the question what is feasible on the Debian side without making it even more a nightmare than it already is... Maybe discussing it here isn't the best place though if the discussion involves a lot of back and forth and options and so maybe this should be discussed in an online document (Google Docs or similar). Thoughts? On Sat, Mar 31, 2018 at 3:21 PM Philipp Kern wrote: > > On 2018-03-30 20:02, Luca Boccassi wrote: > > On Mon, 2018-03-26 at 18:45 +0200, Philipp Kern wrote: > >> I would like to understand better what the current set of packages > >> helps > >> with, though. It is true that I hadn't considered that you are > >> shipping > >> so many packages right now. However, you seem to also hardcode the > >> dependencies between them with a lot of substvars in the packaging, > >> which is understandable given the non-free nature of them. But at the > >> same time it makes it more muddy as to what problem that solves. > > Well that's the Debian policy - one shared library per package, that's > > what we follow. > > While this is technically true, they are also far from the regular > shared library packages, too. People generally don't link against these > shared libraries. Files are installed not into the regular directories. > Most of the time newer libraries are not actually co-installable. The > installed file doesn't necessarily follow the SONAME. (I only spot > checked as I have spotty connectivity right now.) > > This is not about "you're doing it wrong or anything". Instead these are > just awkward binary blobs that I think can be treated differently than > usual shared libraries if needed. Especially in case you don't get the > advantages of the split packaging with the binaries you are provided by > NVidia. > > I'll try to come up with a longer answer to the remaining bits. I > suppose we should play this through as an example with the current > packaging and then check what's acceptable and what's not acceptable. > > > Yes, the legacy drivers (340xx and 304xx at the moment, although the > > latter is out of support so I guess we'll drop it in buster) are co- > > installable. There are update-alternatives for those too. We have a > > script to make it easier to manage those and the glx provider (mesa, > > fglrx, nvidia), it's update-glx from the update-glx package. > > > > You can find the scripts and configs in the git repo: > > https://salsa.debian.org/nvidia-team/glx-alternatives > > This means that users are expected to call update-glx on bootup if the > driver in the installation doesn't match the installed hardware, right? > My hope would be that if we get it to work consistently for minor > revisions that we can support legacy drivers with the same mechanism: > When a legacy module is loadable, we make sure that the GLX bits point > to the correct library version for the card installed. I know that in > regular desktop systems card architecture changes are rare and users > expect to tend to the machine manually in this case. However in the case > of bigger pool setups and imaging, modern Linux and X.org just works, > except the NVidia bits. > > Kind regards and thanks a lot for your responses! > Philipp Kern > > -- > To unsubscribe, send mail to 889669-unsubscr...@bugs.debian.org.
Bug#889669: nvidia-graphics-drivers: solve the upgrade problem
On 2018-03-30 20:02, Luca Boccassi wrote: On Mon, 2018-03-26 at 18:45 +0200, Philipp Kern wrote: I would like to understand better what the current set of packages helps with, though. It is true that I hadn't considered that you are shipping so many packages right now. However, you seem to also hardcode the dependencies between them with a lot of substvars in the packaging, which is understandable given the non-free nature of them. But at the same time it makes it more muddy as to what problem that solves. Well that's the Debian policy - one shared library per package, that's what we follow. While this is technically true, they are also far from the regular shared library packages, too. People generally don't link against these shared libraries. Files are installed not into the regular directories. Most of the time newer libraries are not actually co-installable. The installed file doesn't necessarily follow the SONAME. (I only spot checked as I have spotty connectivity right now.) This is not about "you're doing it wrong or anything". Instead these are just awkward binary blobs that I think can be treated differently than usual shared libraries if needed. Especially in case you don't get the advantages of the split packaging with the binaries you are provided by NVidia. I'll try to come up with a longer answer to the remaining bits. I suppose we should play this through as an example with the current packaging and then check what's acceptable and what's not acceptable. Yes, the legacy drivers (340xx and 304xx at the moment, although the latter is out of support so I guess we'll drop it in buster) are co- installable. There are update-alternatives for those too. We have a script to make it easier to manage those and the glx provider (mesa, fglrx, nvidia), it's update-glx from the update-glx package. You can find the scripts and configs in the git repo: https://salsa.debian.org/nvidia-team/glx-alternatives This means that users are expected to call update-glx on bootup if the driver in the installation doesn't match the installed hardware, right? My hope would be that if we get it to work consistently for minor revisions that we can support legacy drivers with the same mechanism: When a legacy module is loadable, we make sure that the GLX bits point to the correct library version for the card installed. I know that in regular desktop systems card architecture changes are rare and users expect to tend to the machine manually in this case. However in the case of bigger pool setups and imaging, modern Linux and X.org just works, except the NVidia bits. Kind regards and thanks a lot for your responses! Philipp Kern
Bug#889669: nvidia-graphics-drivers: solve the upgrade problem
On Mon, 2018-03-26 at 18:45 +0200, Philipp Kern wrote: > Hi Luca, > > On 3/21/18 2:01 PM, Luca Boccassi wrote: > > Isn't this sort-of-like what Ubuntu does? IIRC they lump together > > everything into a single package unlike we do, and they are named > > after > > the major revision. > > let's say that even Ubuntu have not solved this problem because they > don't consider the minor revision either, as suggested here. > > I would like to understand better what the current set of packages > helps > with, though. It is true that I hadn't considered that you are > shipping > so many packages right now. However, you seem to also hardcode the > dependencies between them with a lot of substvars in the packaging, > which is understandable given the non-free nature of them. But at the > same time it makes it more muddy as to what problem that solves. Well that's the Debian policy - one shared library per package, that's what we follow. > At the same time I also did not consider libglvnd - I unfortunately > was > not aware of it. That at least in theory seems to be a nice way > forward > to just co-install multiple implementations. Is anyone other than > NVidia > supporting it at this point? But anyhow we'll live with the two > options > here if one of them is a regression, either in bugs or features, > which > seems to be the case here. Given that the two are not co-installable > today anyway, collating the two options into two separate packages > would > work. But for that suggestion to make any sense I'd like to > understand > the current packaging first - as per the above. Mesa does support glvnd, and ships with it in sid/buster. One day I'd like to drop the non-glvnd one, but it would need a solution for switchable graphics first (hopefully server-side glvnd in Xorg 1.20 will help with that, but can't say I have looked into it yet). > The key idea is that the packages install their binaries into paths > versioned with both major and minor revision and do not change while > the > machine is booted. Then we would need to juggle around some symlinks > based off the module version exposed in sysfs on boot. The > constraints > here are doing that after the module is loaded and /usr is made > available and before X(/Wayland?) starts. It does seem a little messy > with systemd, that's true. We'd likely end up needing this to be > included in basic.target. With sysvinit rcS would work. If the nvidia > module is included in the initrd for KMS - which I think is the case? > - > udev wouldn't work as easily, just in addition. So I suppose it'd > need > one script that puts the symlink farm into the right state and then > we > need to sprinkle some hooks into the right places depending on when > the > module is loaded. The modules are not in the initrd (weirdly, I thought they would), at least on my desktop: $ lsinitramfs /boot/initrd.img-4.9.0-6-amd64 | grep nvidia lib/modules/4.9.0-6-amd64/kernel/drivers/net/ethernet/nvidia lib/modules/4.9.0-6-amd64/kernel/drivers/net/ethernet/nvidia/forcedeth.ko etc/modprobe.d/nvidia-kernel-common.conf etc/modprobe.d/nvidia.conf etc/modprobe.d/nvidia-blacklists-nouveau.conf etc/nvidia etc/nvidia/current etc/nvidia/current/nvidia-modprobe.conf etc/nvidia/current/nvidia-blacklists-nouveau.conf > What kind of alternatives do we need to offer at this point? Mesa and > NVidia? Can legacy drivers be co-installable? I'd intuitively prefer > to > have glvnd/non-glvnd be two non-co-installable packages. It'd be > great > if legacy drivers could be co-installable and then the right driver > would be loaded, which is theoretically feasible. And Mesa needs to > be > co-installable. So it'd be nice if this would really boil down to > just > Mesa vs. NVidia on an alternatives level, unless I miss something. > > Kind regards and thanks for all your replies so far! > Philipp Kern Yes, the legacy drivers (340xx and 304xx at the moment, although the latter is out of support so I guess we'll drop it in buster) are co- installable. There are update-alternatives for those too. We have a script to make it easier to manage those and the glx provider (mesa, fglrx, nvidia), it's update-glx from the update-glx package. You can find the scripts and configs in the git repo: https://salsa.debian.org/nvidia-team/glx-alternatives -- Kind regards, Luca Boccassi signature.asc Description: This is a digitally signed message part
Bug#889669: nvidia-graphics-drivers: solve the upgrade problem
On Fri, 2018-03-23 at 14:54 +, Michael Schaller wrote: > > I see. Perhaps a systemd unit with the appropriate precedences set > > so > > that it runs before X starts? And $something-$something for Sys-V I > > guess :-) > > The more I think about it maybe this shouldn't be handled by a > service at > boot but rather by udev. What do you think? That would make it independent from the init system, so don't see why not. Haven't really played with with udev so can't propose a solution, but can help test one. > > Was worth a try :-) I must admit that lately, looking at the AMD > > camp > > with their in-tree kernel drivers and first-class support for Mesa > > for > > userspace, I am green with envy (ha!) > > :-D -- Kind regards, Luca Boccassi signature.asc Description: This is a digitally signed message part
Bug#889669: nvidia-graphics-drivers: solve the upgrade problem
Hi Luca, On 3/21/18 2:01 PM, Luca Boccassi wrote: > Isn't this sort-of-like what Ubuntu does? IIRC they lump together > everything into a single package unlike we do, and they are named after > the major revision. let's say that even Ubuntu have not solved this problem because they don't consider the minor revision either, as suggested here. I would like to understand better what the current set of packages helps with, though. It is true that I hadn't considered that you are shipping so many packages right now. However, you seem to also hardcode the dependencies between them with a lot of substvars in the packaging, which is understandable given the non-free nature of them. But at the same time it makes it more muddy as to what problem that solves. At the same time I also did not consider libglvnd - I unfortunately was not aware of it. That at least in theory seems to be a nice way forward to just co-install multiple implementations. Is anyone other than NVidia supporting it at this point? But anyhow we'll live with the two options here if one of them is a regression, either in bugs or features, which seems to be the case here. Given that the two are not co-installable today anyway, collating the two options into two separate packages would work. But for that suggestion to make any sense I'd like to understand the current packaging first - as per the above. The key idea is that the packages install their binaries into paths versioned with both major and minor revision and do not change while the machine is booted. Then we would need to juggle around some symlinks based off the module version exposed in sysfs on boot. The constraints here are doing that after the module is loaded and /usr is made available and before X(/Wayland?) starts. It does seem a little messy with systemd, that's true. We'd likely end up needing this to be included in basic.target. With sysvinit rcS would work. If the nvidia module is included in the initrd for KMS - which I think is the case? - udev wouldn't work as easily, just in addition. So I suppose it'd need one script that puts the symlink farm into the right state and then we need to sprinkle some hooks into the right places depending on when the module is loaded. What kind of alternatives do we need to offer at this point? Mesa and NVidia? Can legacy drivers be co-installable? I'd intuitively prefer to have glvnd/non-glvnd be two non-co-installable packages. It'd be great if legacy drivers could be co-installable and then the right driver would be loaded, which is theoretically feasible. And Mesa needs to be co-installable. So it'd be nice if this would really boil down to just Mesa vs. NVidia on an alternatives level, unless I miss something. Kind regards and thanks for all your replies so far! Philipp Kern
Bug#889669: nvidia-graphics-drivers: solve the upgrade problem
On Thu, 2018-03-22 at 14:36 +, Michael Schaller wrote: > > How would the switch-at-boot mechanism work? > > The basic idea for the switch-at-boot mechanism is that it would > check the > version of the loaded NVIDIA kernel module > (/sys/module/nvidia/version) on > boot and then select the matching user space version (via > update-alternatives) before anything attempts to use it. I see. Perhaps a systemd unit with the appropriate precedences set so that it runs before X starts? And $something-$something for Sys-V I guess :-) > > Seeing your email address domain - any chance your company could > > use > > its gargantuan soft-power to get Nvidia to publish the specs for > > the > > missing parts of Nouveau (reclocking, power managerment, etc)? That > > would solve all our problems once and for all :-P > > I wished but that sounds like deep lawyer cat territory and I very > much > prefer to work on a technical solution. ;-) Was worth a try :-) I must admit that lately, looking at the AMD camp with their in-tree kernel drivers and first-class support for Mesa for userspace, I am green with envy (ha!) > I've just asked though if the version lock between the NVIDIA kernel > modules and user-space components really needs to be so strict. Let's > see > how that goes... It would be good to know, but we'd need strong guarantees - otherwise it's nasty regressions waiting to happen, given the very minimal debug- ability. -- Kind regards, Luca Boccassi signature.asc Description: This is a digitally signed message part
Bug#889669: nvidia-graphics-drivers: solve the upgrade problem
On 22.03.2018 15:43, Andreas Beckmann wrote: > On 2018-03-22 15:36, Michael Schaller wrote: >>> We should probably postpone this to post-390.xx if nvidia sticks to >>> their plan to drop i386 driver support ... >> That's the first time I've heard about that. Do you have further >> information about that (link is fine). I also wonder how that will impact >> projects that depend on i386 support like for an instance Wine. > http://nvidia.custhelp.com/app/answers/detail/a_id/4604/ Apart from the page not being really informative (not your fault!), I'd expect that they at least ship the 32-bit libGL libraries. That they get rid of the i386 driver support isn't really surprising as long as they at least keep compatibility with 32-bit binaries. Now of course that page does not say that, but alas it's also a topic different from the one described in the bug. =) Kind regards Philipp Kern
Bug#889669: nvidia-graphics-drivers: solve the upgrade problem
On 2018-03-22 15:36, Michael Schaller wrote: >> We should probably postpone this to post-390.xx if nvidia sticks to >> their plan to drop i386 driver support ... > That's the first time I've heard about that. Do you have further > information about that (link is fine). I also wonder how that will impact > projects that depend on i386 support like for an instance Wine. http://nvidia.custhelp.com/app/answers/detail/a_id/4604/ Andreas
Bug#889669: nvidia-graphics-drivers: solve the upgrade problem
We should probably postpone this to post-390.xx if nvidia sticks to their plan to drop i386 driver support ... that would remove a lot of complexity, since we probably don't need proper multiarch support any more ... Andreas
Bug#889669: nvidia-graphics-drivers: solve the upgrade problem
Control: severity -1 normal On Wed, 2018-03-21 at 08:17 +, Michael Schaller wrote: > Please reconsider that this is merely an annoyance and that this is a > wishlist item. > If a NVIDIA driver security update is pushed and security updates are > installed unattendedly then all NVIDIA user space components will > stop > working immediately after the respective package updates as the > loaded > kernel module and the user space components have a version mismatch. > The consequences are not immediately visible to the user as NVIDIA > components in memory are still properly matched and hence still work. > The > real issue is with new processes as for an instance no OpenGL > applications > or CUDA workloads can be launched anymore. This is especially severe > for > CUDA server farms as they currently can't enable unattended security > updates unless they specifically exclude NVIDIA driver updates. That's fine, I didn't grok that you had large installations where this was causing issues already, personally I'm fine with talking about possible solutions. Seeing your email address domain - any chance your company could use its gargantuan soft-power to get Nvidia to publish the specs for the missing parts of Nouveau (reclocking, power managerment, etc)? That would solve all our problems once and for all :-P > On Wed, Mar 21, 2018 at 9:00 AM Philipp Kern> wrote: > > > On 03/20/2018 10:59 PM, Luca Boccassi wrote: > > > The problems I see are that it would make an already quite > > > complex > > > packaging system, over which we have very little control (most of > > > it > > > it's binary blobs) even more complicated. We already have 2 > > > layers of > > > update-alternatives (mesa vs nvidia and then current vs legacy). > > > > > > It would also mean we have to start maintaining multiple versions > > > at > > > the same time - again being all binary blobs, which will multiply > > > the > > > source of problems. Basically, it would mean that instead of > > > having > > > current vs legacy340xx (up until a few months ago also > > > legacy304xx), > > > every single driver update would have to be maintained > > > separately. > > I don't propose this as the solution, though. I think that'd indeed > > be > > infeasible. What I'm saying is that the *binary* packages are > > versioned > > like this, not the source packages. It's like the kernel in a way, > > where > > every ABI version gets its own binary package name. Although in > > Debian > > the hesitance to change the ABI is much higher than in Ubuntu, for > > reasons that I assume have to do with the NEW queue. Cleaning up > > older > > versions is something we'd find a solution for, just like people > > clean > > up their old kernels. > > So please separate out maintenance from the proposal. ;-) > > I get it with the two layers of alternatives. Is the reason for > > mesa vs. > > nvidia because we don't put Nvidia into the library search path > > first > > and need to deal with the corresponding file conflicts in a sane > > way? Or > > because we want to keep co-installability between mesa and nvidia? > > > In the end the problem is an annoyance but not a deal breaker - > > > updates > > > can be scheduled and delayed (unlike some other OSes...), and on > > > top of > > > that, version bumps are not that common - at most once a month, > > > and > > > only for those running unstable or testing - in stable we just > > > ship LTS > > > versions. > > Actually it's a real deal breaker in mass deployments. If your > > users are > > hesitant to do reboots because it resets their work environment, > > you > > really need to detach nvidia updates from the rest of the package > > updates, which means having a custom-built solution to do that. > > That has > > turned out to be brittle, as it turns out that you end up > > installing > > pre-downloaded modules at boot, blocking it for about ten minutes. > > (It > > has gotten better with SSDs, but still.) > > Even if you just ship LTS versions there are sometimes updates > > needed, > > be it for Meltdown/Spectre or new hardware. In our case we actually > > do > > use testing, but even then we had the need to push updates to > > drivers. I > > think a setup that separates out binaries for every version that > > allows > > for consistent rollbacks[1] and rollforwards would be beneficial > > not > > just for us but also for the whole userbase of Debian. > > We'd be willing to invest some time into a solution - as our own to > > work > > around the flaws in the packaging has turned out to be a > > maintenance > > headache. But that only works if we at least agree on a plan. I'm > > also > > happy to clarify more that I probably missed in the proposal. :) > > Kind regards and thanks > > Philipp Kern > > [1] We had a bunch of regressions with newer drivers in the past > > that > > made them dead on arrival, like missing repaints in terminals for a > > fraction of the cards. > > -- > > To unsubscribe, send
Bug#889669: nvidia-graphics-drivers: solve the upgrade problem
On Wed, 2018-03-21 at 08:56 +0100, Philipp Kern wrote: > On 03/20/2018 10:59 PM, Luca Boccassi wrote: > > The problems I see are that it would make an already quite complex > > packaging system, over which we have very little control (most of > > it > > it's binary blobs) even more complicated. We already have 2 layers > > of > > update-alternatives (mesa vs nvidia and then current vs legacy). > > > > It would also mean we have to start maintaining multiple versions > > at > > the same time - again being all binary blobs, which will multiply > > the > > source of problems. Basically, it would mean that instead of having > > current vs legacy340xx (up until a few months ago also > > legacy304xx), > > every single driver update would have to be maintained separately. > > I don't propose this as the solution, though. I think that'd indeed > be > infeasible. What I'm saying is that the *binary* packages are > versioned > like this, not the source packages. It's like the kernel in a way, > where > every ABI version gets its own binary package name. Although in > Debian > the hesitance to change the ABI is much higher than in Ubuntu, for > reasons that I assume have to do with the NEW queue. Cleaning up > older > versions is something we'd find a solution for, just like people > clean > up their old kernels. > > So please separate out maintenance from the proposal. ;-) Ah I see - one issue I can foresee is that it's binary blobs all the way down - so there's really no way to know that libnvidia-foo from version 1.1 can work with libnvidia-bar from version 2.2. So all the packages would have to be versioned. Isn't this sort-of-like what Ubuntu does? IIRC they lump together everything into a single package unlike we do, and they are named after the major revision. How would the switch-at-boot mechanism work? > I get it with the two layers of alternatives. Is the reason for mesa > vs. > nvidia because we don't put Nvidia into the library search path first > and need to deal with the corresponding file conflicts in a sane way? > Or > because we want to keep co-installability between mesa and nvidia? co-installability - it used to be that each vendor had its own version of libGL, and they were all incompatible with each other. With libglvnd this is changing - but sadly we need to keep shipping the non-glvnd versions as there are often regressions (and some use cases don't work with the glvnd versions yet, like switchable graphics on laptops). So in reality what glvnd is doing for us right now is multiplying the maintenance effort rather than reducing it. But I digress... > > In the end the problem is an annoyance but not a deal breaker - > > updates > > can be scheduled and delayed (unlike some other OSes...), and on > > top of > > that, version bumps are not that common - at most once a month, and > > only for those running unstable or testing - in stable we just ship > > LTS > > versions. > > Actually it's a real deal breaker in mass deployments. If your users > are > hesitant to do reboots because it resets their work environment, you > really need to detach nvidia updates from the rest of the package > updates, which means having a custom-built solution to do that. That > has > turned out to be brittle, as it turns out that you end up installing > pre-downloaded modules at boot, blocking it for about ten minutes. > (It > has gotten better with SSDs, but still.) > > Even if you just ship LTS versions there are sometimes updates > needed, > be it for Meltdown/Spectre or new hardware. In our case we actually > do > use testing, but even then we had the need to push updates to > drivers. I > think a setup that separates out binaries for every version that > allows > for consistent rollbacks[1] and rollforwards would be beneficial not > just for us but also for the whole userbase of Debian. > > We'd be willing to invest some time into a solution - as our own to > work > around the flaws in the packaging has turned out to be a maintenance > headache. But that only works if we at least agree on a plan. I'm > also > happy to clarify more that I probably missed in the proposal. :) > > Kind regards and thanks > Philipp Kern > > [1] We had a bunch of regressions with newer drivers in the past that > made them dead on arrival, like missing repaints in terminals for a > fraction of the cards. Ok so now I understand you have some large deployments where this is an actual issue - I didn't get it immediately, sorry. I'm up for talking about proposals - Andreas, what do you think? -- Kind regards, Luca Boccassi signature.asc Description: This is a digitally signed message part
Bug#889669: nvidia-graphics-drivers: solve the upgrade problem
I cannot remember any bug reports regarding this upgrade problem before yours ... -ENOTMUCHTIMETHISWEEK :-( Andreas
Bug#889669: nvidia-graphics-drivers: solve the upgrade problem
On 03/20/2018 10:59 PM, Luca Boccassi wrote: > The problems I see are that it would make an already quite complex > packaging system, over which we have very little control (most of it > it's binary blobs) even more complicated. We already have 2 layers of > update-alternatives (mesa vs nvidia and then current vs legacy). > > It would also mean we have to start maintaining multiple versions at > the same time - again being all binary blobs, which will multiply the > source of problems. Basically, it would mean that instead of having > current vs legacy340xx (up until a few months ago also legacy304xx), > every single driver update would have to be maintained separately. I don't propose this as the solution, though. I think that'd indeed be infeasible. What I'm saying is that the *binary* packages are versioned like this, not the source packages. It's like the kernel in a way, where every ABI version gets its own binary package name. Although in Debian the hesitance to change the ABI is much higher than in Ubuntu, for reasons that I assume have to do with the NEW queue. Cleaning up older versions is something we'd find a solution for, just like people clean up their old kernels. So please separate out maintenance from the proposal. ;-) I get it with the two layers of alternatives. Is the reason for mesa vs. nvidia because we don't put Nvidia into the library search path first and need to deal with the corresponding file conflicts in a sane way? Or because we want to keep co-installability between mesa and nvidia? > In the end the problem is an annoyance but not a deal breaker - updates > can be scheduled and delayed (unlike some other OSes...), and on top of > that, version bumps are not that common - at most once a month, and > only for those running unstable or testing - in stable we just ship LTS > versions. Actually it's a real deal breaker in mass deployments. If your users are hesitant to do reboots because it resets their work environment, you really need to detach nvidia updates from the rest of the package updates, which means having a custom-built solution to do that. That has turned out to be brittle, as it turns out that you end up installing pre-downloaded modules at boot, blocking it for about ten minutes. (It has gotten better with SSDs, but still.) Even if you just ship LTS versions there are sometimes updates needed, be it for Meltdown/Spectre or new hardware. In our case we actually do use testing, but even then we had the need to push updates to drivers. I think a setup that separates out binaries for every version that allows for consistent rollbacks[1] and rollforwards would be beneficial not just for us but also for the whole userbase of Debian. We'd be willing to invest some time into a solution - as our own to work around the flaws in the packaging has turned out to be a maintenance headache. But that only works if we at least agree on a plan. I'm also happy to clarify more that I probably missed in the proposal. :) Kind regards and thanks Philipp Kern [1] We had a bunch of regressions with newer drivers in the past that made them dead on arrival, like missing repaints in terminals for a fraction of the cards.
Bug#889669: nvidia-graphics-drivers: solve the upgrade problem
Control: severity -1 wishlist On Tue, 2018-03-20 at 21:22 +0100, Philipp Kern wrote: > Hi, > > On 2/5/18 4:26 PM, Philipp Kern wrote: > > Since forever users of NVIDIA on Debian accepted that package > > upgrades > > break newly spawned binaries because the interface between the > > client > > library and the kernel driver is strictly versioned. The kernel > > module > > will emit an API mismatch error into the kernel log and GLX > > requests > > will fail. A reboot is required to remediate this situation. > > > > I would propose the following model: > > > > * All binary packages that require strict versioning with NVRM are > > shipped in versioned packages. This means that the library package > > names > > reflect both major and minor version (= the version on which the > > driver > > checks) of the driver. The resulting packages should be co- > > installable > > with each other. > > * An script modifies the symlink for the currently active libraries > > to > > point to the version of the currently loaded nvidia module (as > > fetched > > from sysfs's /sys/module/nvidia/version). This script is called on > > installation but more crucially on every boot. This will tie the > > libraries to the module loaded at boot-up. > > * The kernel module itself does not have to be versioned. The > > kernel > > module can be upgraded and it will end up in the initrd > > automatically. > > > > Assuming that we have a metapackage that pulls in the most recent > > driver > > (like linux-image does), this model would allow to upgrade the > > driver at > > any point in time and only make it live with the next reboot. This > > allows applications to continue to function. > > > > This approach has the drawback that every update from NVIDIA needs > > to go > > through NEW. However I think this is just a theoretical > > disadvantage at > > this point as NEW latency for ABI version changes has decreased a > > lot. > > > > The thing I'm not sure about is how this proposal interacts with > > the > > legacy modules. I suppose they can all use the same mechanism but > > the > > script would need to be aware what library stack needs to be > > chosen. The > > NVIDIA kernel shim already checks using rm_is_supported_device if > > the > > currently installed device is supported. That together with > > modalias > > should supposedly already load the correct module and then the > > script > > could just check which of the modules (if legacy or the normal one) > > is > > loaded and act accordingly. > > > > Do you think this would be workable? The NVIDIA packaging is quite > > a > > beast to handle, I know (and I'm very grateful for your work!). So > > we > > should have some consensus if this is something you'd be interested > > in. :) > > is there something I could help with to get to a consensus here? > Anything? :) > > (After just having had this again that I needed to reboot when all I > wanted was getting the i386 driver.) > > Kind regards and thanks > Philipp Kern Hi, Thanks for your proposal, I understand the need to reboot is an annoyance. The problems I see are that it would make an already quite complex packaging system, over which we have very little control (most of it it's binary blobs) even more complicated. We already have 2 layers of update-alternatives (mesa vs nvidia and then current vs legacy). It would also mean we have to start maintaining multiple versions at the same time - again being all binary blobs, which will multiply the source of problems. Basically, it would mean that instead of having current vs legacy340xx (up until a few months ago also legacy304xx), every single driver update would have to be maintained separately. In the end the problem is an annoyance but not a deal breaker - updates can be scheduled and delayed (unlike some other OSes...), and on top of that, version bumps are not that common - at most once a month, and only for those running unstable or testing - in stable we just ship LTS versions. Sorry, my personal opinion is that I'm just not sure it would be really worth the additional time and hassle :-/ -- Kind regards, Luca Boccassi signature.asc Description: This is a digitally signed message part
Bug#889669: nvidia-graphics-drivers: solve the upgrade problem
Hi, On 2/5/18 4:26 PM, Philipp Kern wrote: > Since forever users of NVIDIA on Debian accepted that package upgrades > break newly spawned binaries because the interface between the client > library and the kernel driver is strictly versioned. The kernel module > will emit an API mismatch error into the kernel log and GLX requests > will fail. A reboot is required to remediate this situation. > > I would propose the following model: > > * All binary packages that require strict versioning with NVRM are > shipped in versioned packages. This means that the library package names > reflect both major and minor version (= the version on which the driver > checks) of the driver. The resulting packages should be co-installable > with each other. > * An script modifies the symlink for the currently active libraries to > point to the version of the currently loaded nvidia module (as fetched > from sysfs's /sys/module/nvidia/version). This script is called on > installation but more crucially on every boot. This will tie the > libraries to the module loaded at boot-up. > * The kernel module itself does not have to be versioned. The kernel > module can be upgraded and it will end up in the initrd automatically. > > Assuming that we have a metapackage that pulls in the most recent driver > (like linux-image does), this model would allow to upgrade the driver at > any point in time and only make it live with the next reboot. This > allows applications to continue to function. > > This approach has the drawback that every update from NVIDIA needs to go > through NEW. However I think this is just a theoretical disadvantage at > this point as NEW latency for ABI version changes has decreased a lot. > > The thing I'm not sure about is how this proposal interacts with the > legacy modules. I suppose they can all use the same mechanism but the > script would need to be aware what library stack needs to be chosen. The > NVIDIA kernel shim already checks using rm_is_supported_device if the > currently installed device is supported. That together with modalias > should supposedly already load the correct module and then the script > could just check which of the modules (if legacy or the normal one) is > loaded and act accordingly. > > Do you think this would be workable? The NVIDIA packaging is quite a > beast to handle, I know (and I'm very grateful for your work!). So we > should have some consensus if this is something you'd be interested in. :) is there something I could help with to get to a consensus here? Anything? :) (After just having had this again that I needed to reboot when all I wanted was getting the i386 driver.) Kind regards and thanks Philipp Kern
Bug#889669: nvidia-graphics-drivers: solve the upgrade problem
Source: nvidia-graphics-drivers Since forever users of NVIDIA on Debian accepted that package upgrades break newly spawned binaries because the interface between the client library and the kernel driver is strictly versioned. The kernel module will emit an API mismatch error into the kernel log and GLX requests will fail. A reboot is required to remediate this situation. I would propose the following model: * All binary packages that require strict versioning with NVRM are shipped in versioned packages. This means that the library package names reflect both major and minor version (= the version on which the driver checks) of the driver. The resulting packages should be co-installable with each other. * An script modifies the symlink for the currently active libraries to point to the version of the currently loaded nvidia module (as fetched from sysfs's /sys/module/nvidia/version). This script is called on installation but more crucially on every boot. This will tie the libraries to the module loaded at boot-up. * The kernel module itself does not have to be versioned. The kernel module can be upgraded and it will end up in the initrd automatically. Assuming that we have a metapackage that pulls in the most recent driver (like linux-image does), this model would allow to upgrade the driver at any point in time and only make it live with the next reboot. This allows applications to continue to function. This approach has the drawback that every update from NVIDIA needs to go through NEW. However I think this is just a theoretical disadvantage at this point as NEW latency for ABI version changes has decreased a lot. The thing I'm not sure about is how this proposal interacts with the legacy modules. I suppose they can all use the same mechanism but the script would need to be aware what library stack needs to be chosen. The NVIDIA kernel shim already checks using rm_is_supported_device if the currently installed device is supported. That together with modalias should supposedly already load the correct module and then the script could just check which of the modules (if legacy or the normal one) is loaded and act accordingly. Do you think this would be workable? The NVIDIA packaging is quite a beast to handle, I know (and I'm very grateful for your work!). So we should have some consensus if this is something you'd be interested in. :) Kind regards and thanks Philipp Kern signature.asc Description: OpenPGP digital signature