Bug#889669: nvidia-graphics-drivers: solve the upgrade problem

2018-05-29 Thread Michael Schaller
I've got an answer from NVIDIA:
"Our driver design, based on earlier assumptions according to
use/deployment cases at the time, packages all components together to
ensure integrity is retained as components evolve over the course of
driver development.
We are investigating the ability to enable modest compatibility across
versions, but the time horizon and breadth of that compatibility are
not known at this time.
We are also looking at how to improve the interoperability of CUDA
calls between driver versions—but again, this is a long-term effort.
One suggestion for the near-term was to install in such a way that
updated driver files are latched on next boot so that kernel- and
user- components can be changed on the file system in lock-step."

>From my point of view this is pretty much the answer I've expected.
They are committed to investigating a solution but IMHO this doesn't
necessarily mean that there will be a solution. Even if there will be
a solution we don't know how long it'll take NVIDIA to implement it
and if their solution will be feasible for us. For an instance if they
only promise compatibility on minor driver version updates then that
would still be problematic for us on major driver version updates.

That brings me to the question what is feasible on the Debian side
without making it even more a nightmare than it already is...
Maybe discussing it here isn't the best place though if the discussion
involves a lot of back and forth and options and so maybe this should
be discussed in an online document (Google Docs or similar).

Thoughts?
On Sat, Mar 31, 2018 at 3:21 PM Philipp Kern  wrote:
>
> On 2018-03-30 20:02, Luca Boccassi wrote:
> > On Mon, 2018-03-26 at 18:45 +0200, Philipp Kern wrote:
> >> I would like to understand better what the current set of packages
> >> helps
> >> with, though. It is true that I hadn't considered that you are
> >> shipping
> >> so many packages right now. However, you seem to also hardcode the
> >> dependencies between them with a lot of substvars in the packaging,
> >> which is understandable given the non-free nature of them. But at the
> >> same time it makes it more muddy as to what problem that solves.
> > Well that's the Debian policy - one shared library per package, that's
> > what we follow.
>
> While this is technically true, they are also far from the regular
> shared library packages, too. People generally don't link against these
> shared libraries. Files are installed not into the regular directories.
> Most of the time newer libraries are not actually co-installable. The
> installed file doesn't necessarily follow the SONAME. (I only spot
> checked as I have spotty connectivity right now.)
>
> This is not about "you're doing it wrong or anything". Instead these are
> just awkward binary blobs that I think can be treated differently than
> usual shared libraries if needed. Especially in case you don't get the
> advantages of the split packaging with the binaries you are provided by
> NVidia.
>
> I'll try to come up with a longer answer to the remaining bits. I
> suppose we should play this through as an example with the current
> packaging and then check what's acceptable and what's not acceptable.
>
> > Yes, the legacy drivers (340xx and 304xx at the moment, although the
> > latter is out of support so I guess we'll drop it in buster) are co-
> > installable. There are update-alternatives for those too. We have a
> > script to make it easier to manage those and the glx provider (mesa,
> > fglrx, nvidia), it's update-glx from the update-glx package.
> >
> > You can find the scripts and configs in the git repo:
> > https://salsa.debian.org/nvidia-team/glx-alternatives
>
> This means that users are expected to call update-glx on bootup if the
> driver in the installation doesn't match the installed hardware, right?
> My hope would be that if we get it to work consistently for minor
> revisions that we can support legacy drivers with the same mechanism:
> When a legacy module is loadable, we make sure that the GLX bits point
> to the correct library version for the card installed. I know that in
> regular desktop systems card architecture changes are rare and users
> expect to tend to the machine manually in this case. However in the case
> of bigger pool setups and imaging, modern Linux and X.org just works,
> except the NVidia bits.
>
> Kind regards and thanks a lot for your responses!
> Philipp Kern
>
> --
> To unsubscribe, send mail to 889669-unsubscr...@bugs.debian.org.



Bug#889669: nvidia-graphics-drivers: solve the upgrade problem

2018-03-31 Thread Philipp Kern

On 2018-03-30 20:02, Luca Boccassi wrote:

On Mon, 2018-03-26 at 18:45 +0200, Philipp Kern wrote:

I would like to understand better what the current set of packages
helps
with, though. It is true that I hadn't considered that you are
shipping
so many packages right now. However, you seem to also hardcode the
dependencies between them with a lot of substvars in the packaging,
which is understandable given the non-free nature of them. But at the
same time it makes it more muddy as to what problem that solves.

Well that's the Debian policy - one shared library per package, that's
what we follow.


While this is technically true, they are also far from the regular 
shared library packages, too. People generally don't link against these 
shared libraries. Files are installed not into the regular directories. 
Most of the time newer libraries are not actually co-installable. The 
installed file doesn't necessarily follow the SONAME. (I only spot 
checked as I have spotty connectivity right now.)


This is not about "you're doing it wrong or anything". Instead these are 
just awkward binary blobs that I think can be treated differently than 
usual shared libraries if needed. Especially in case you don't get the 
advantages of the split packaging with the binaries you are provided by 
NVidia.


I'll try to come up with a longer answer to the remaining bits. I 
suppose we should play this through as an example with the current 
packaging and then check what's acceptable and what's not acceptable.



Yes, the legacy drivers (340xx and 304xx at the moment, although the
latter is out of support so I guess we'll drop it in buster) are co-
installable. There are update-alternatives for those too. We have a
script to make it easier to manage those and the glx provider (mesa,
fglrx, nvidia), it's update-glx from the update-glx package.

You can find the scripts and configs in the git repo:
https://salsa.debian.org/nvidia-team/glx-alternatives


This means that users are expected to call update-glx on bootup if the 
driver in the installation doesn't match the installed hardware, right? 
My hope would be that if we get it to work consistently for minor 
revisions that we can support legacy drivers with the same mechanism: 
When a legacy module is loadable, we make sure that the GLX bits point 
to the correct library version for the card installed. I know that in 
regular desktop systems card architecture changes are rare and users 
expect to tend to the machine manually in this case. However in the case 
of bigger pool setups and imaging, modern Linux and X.org just works, 
except the NVidia bits.


Kind regards and thanks a lot for your responses!
Philipp Kern



Bug#889669: nvidia-graphics-drivers: solve the upgrade problem

2018-03-30 Thread Luca Boccassi
On Mon, 2018-03-26 at 18:45 +0200, Philipp Kern wrote:
> Hi Luca,
> 
> On 3/21/18 2:01 PM, Luca Boccassi wrote:
> > Isn't this sort-of-like what Ubuntu does? IIRC they lump together
> > everything into a single package unlike we do, and they are named
> > after
> > the major revision.
> 
> let's say that even Ubuntu have not solved this problem because they
> don't consider the minor revision either, as suggested here.
> 
> I would like to understand better what the current set of packages
> helps
> with, though. It is true that I hadn't considered that you are
> shipping
> so many packages right now. However, you seem to also hardcode the
> dependencies between them with a lot of substvars in the packaging,
> which is understandable given the non-free nature of them. But at the
> same time it makes it more muddy as to what problem that solves.

Well that's the Debian policy - one shared library per package, that's
what we follow.

> At the same time I also did not consider libglvnd - I unfortunately
> was
> not aware of it. That at least in theory seems to be a nice way
> forward
> to just co-install multiple implementations. Is anyone other than
> NVidia
> supporting it at this point? But anyhow we'll live with the two
> options
> here if one of them is a regression, either in bugs or features,
> which
> seems to be the case here. Given that the two are not co-installable
> today anyway, collating the two options into two separate packages
> would
> work. But for that suggestion to make any sense I'd like to
> understand
> the current packaging first - as per the above.

Mesa does support glvnd, and ships with it in sid/buster. One day I'd
like to drop the non-glvnd one, but it would need a solution for
switchable graphics first (hopefully server-side glvnd in Xorg 1.20
will help with that, but can't say I have looked into it yet).

> The key idea is that the packages install their binaries into paths
> versioned with both major and minor revision and do not change while
> the
> machine is booted. Then we would need to juggle around some symlinks
> based off the module version exposed in sysfs on boot. The
> constraints
> here are doing that after the module is loaded and /usr is made
> available and before X(/Wayland?) starts. It does seem a little messy
> with systemd, that's true. We'd likely end up needing this to be
> included in basic.target. With sysvinit rcS would work. If the nvidia
> module is included in the initrd for KMS - which I think is the case?
> -
> udev wouldn't work as easily, just in addition. So I suppose it'd
> need
> one script that puts the symlink farm into the right state and then
> we
> need to sprinkle some hooks into the right places depending on when
> the
> module is loaded.

The modules are not in the initrd (weirdly, I thought they would), at
least on my desktop:

$ lsinitramfs /boot/initrd.img-4.9.0-6-amd64  | grep nvidia
lib/modules/4.9.0-6-amd64/kernel/drivers/net/ethernet/nvidia
lib/modules/4.9.0-6-amd64/kernel/drivers/net/ethernet/nvidia/forcedeth.ko
etc/modprobe.d/nvidia-kernel-common.conf
etc/modprobe.d/nvidia.conf
etc/modprobe.d/nvidia-blacklists-nouveau.conf
etc/nvidia
etc/nvidia/current
etc/nvidia/current/nvidia-modprobe.conf
etc/nvidia/current/nvidia-blacklists-nouveau.conf

> What kind of alternatives do we need to offer at this point? Mesa and
> NVidia? Can legacy drivers be co-installable? I'd intuitively prefer
> to
> have glvnd/non-glvnd be two non-co-installable packages. It'd be
> great
> if legacy drivers could be co-installable and then the right driver
> would be loaded, which is theoretically feasible. And Mesa needs to
> be
> co-installable. So it'd be nice if this would really boil down to
> just
> Mesa vs. NVidia on an alternatives level, unless I miss something.
> 
> Kind regards and thanks for all your replies so far!
> Philipp Kern

Yes, the legacy drivers (340xx and 304xx at the moment, although the
latter is out of support so I guess we'll drop it in buster) are co-
installable. There are update-alternatives for those too. We have a
script to make it easier to manage those and the glx provider (mesa,
fglrx, nvidia), it's update-glx from the update-glx package.

You can find the scripts and configs in the git repo:
https://salsa.debian.org/nvidia-team/glx-alternatives

-- 
Kind regards,
Luca Boccassi

signature.asc
Description: This is a digitally signed message part


Bug#889669: nvidia-graphics-drivers: solve the upgrade problem

2018-03-30 Thread Luca Boccassi
On Fri, 2018-03-23 at 14:54 +, Michael Schaller wrote:
> > I see. Perhaps a systemd unit with the appropriate precedences set
> > so
> > that it runs before X starts? And $something-$something for Sys-V I
> > guess :-)
> 
> The more I think about it maybe this shouldn't be handled by a
> service at
> boot but rather by udev. What do you think?

That would make it independent from the init system, so don't see why
not.
Haven't really played with with udev so can't propose a solution, but
can help test one.

> > Was worth a try :-) I must admit that lately, looking at the AMD
> > camp
> > with their in-tree kernel drivers and first-class support for Mesa
> > for
> > userspace, I am green with envy (ha!)
> 
> :-D
-- 
Kind regards,
Luca Boccassi

signature.asc
Description: This is a digitally signed message part


Bug#889669: nvidia-graphics-drivers: solve the upgrade problem

2018-03-26 Thread Philipp Kern
Hi Luca,

On 3/21/18 2:01 PM, Luca Boccassi wrote:
> Isn't this sort-of-like what Ubuntu does? IIRC they lump together
> everything into a single package unlike we do, and they are named after
> the major revision.

let's say that even Ubuntu have not solved this problem because they
don't consider the minor revision either, as suggested here.

I would like to understand better what the current set of packages helps
with, though. It is true that I hadn't considered that you are shipping
so many packages right now. However, you seem to also hardcode the
dependencies between them with a lot of substvars in the packaging,
which is understandable given the non-free nature of them. But at the
same time it makes it more muddy as to what problem that solves.

At the same time I also did not consider libglvnd - I unfortunately was
not aware of it. That at least in theory seems to be a nice way forward
to just co-install multiple implementations. Is anyone other than NVidia
supporting it at this point? But anyhow we'll live with the two options
here if one of them is a regression, either in bugs or features, which
seems to be the case here. Given that the two are not co-installable
today anyway, collating the two options into two separate packages would
work. But for that suggestion to make any sense I'd like to understand
the current packaging first - as per the above.

The key idea is that the packages install their binaries into paths
versioned with both major and minor revision and do not change while the
machine is booted. Then we would need to juggle around some symlinks
based off the module version exposed in sysfs on boot. The constraints
here are doing that after the module is loaded and /usr is made
available and before X(/Wayland?) starts. It does seem a little messy
with systemd, that's true. We'd likely end up needing this to be
included in basic.target. With sysvinit rcS would work. If the nvidia
module is included in the initrd for KMS - which I think is the case? -
udev wouldn't work as easily, just in addition. So I suppose it'd need
one script that puts the symlink farm into the right state and then we
need to sprinkle some hooks into the right places depending on when the
module is loaded.

What kind of alternatives do we need to offer at this point? Mesa and
NVidia? Can legacy drivers be co-installable? I'd intuitively prefer to
have glvnd/non-glvnd be two non-co-installable packages. It'd be great
if legacy drivers could be co-installable and then the right driver
would be loaded, which is theoretically feasible. And Mesa needs to be
co-installable. So it'd be nice if this would really boil down to just
Mesa vs. NVidia on an alternatives level, unless I miss something.

Kind regards and thanks for all your replies so far!
Philipp Kern



Bug#889669: nvidia-graphics-drivers: solve the upgrade problem

2018-03-22 Thread Luca Boccassi
On Thu, 2018-03-22 at 14:36 +, Michael Schaller wrote:
> > How would the switch-at-boot mechanism work?
> 
> The basic idea for the switch-at-boot mechanism is that it would
> check the
> version of the loaded NVIDIA kernel module
> (/sys/module/nvidia/version) on
> boot and then select the matching user space version (via
> update-alternatives) before anything attempts to use it.

I see. Perhaps a systemd unit with the appropriate precedences set so
that it runs before X starts? And $something-$something for Sys-V I
guess :-)

> > Seeing your email address domain - any chance your company could
> > use
> > its gargantuan soft-power to get Nvidia to publish the specs for
> > the
> > missing parts of Nouveau (reclocking, power managerment, etc)? That
> > would solve all our problems once and for all :-P
> 
> I wished but that sounds like deep lawyer cat territory and I very
> much
> prefer to work on a technical solution. ;-)

Was worth a try :-) I must admit that lately, looking at the AMD camp
with their in-tree kernel drivers and first-class support for Mesa for
userspace, I am green with envy (ha!)

> I've just asked though if the version lock between the NVIDIA kernel
> modules and user-space components really needs to be so strict. Let's
> see
> how that goes...

It would be good to know, but we'd need strong guarantees - otherwise
it's nasty regressions waiting to happen, given the very minimal debug-
ability.

-- 
Kind regards,
Luca Boccassi

signature.asc
Description: This is a digitally signed message part


Bug#889669: nvidia-graphics-drivers: solve the upgrade problem

2018-03-22 Thread Philipp Kern
On 22.03.2018 15:43, Andreas Beckmann wrote:
> On 2018-03-22 15:36, Michael Schaller wrote:
>>> We should probably postpone this to post-390.xx if nvidia sticks to
>>> their plan to drop i386 driver support ...
>> That's the first time I've heard about that. Do you have further
>> information about that (link is fine). I also wonder how that will impact
>> projects that depend on i386 support like for an instance Wine.
> http://nvidia.custhelp.com/app/answers/detail/a_id/4604/

Apart from the page not being really informative (not your fault!), I'd
expect that they at least ship the 32-bit libGL libraries. That they get
rid of the i386 driver support isn't really surprising as long as they
at least keep compatibility with 32-bit binaries. Now of course that
page does not say that, but alas it's also a topic different from the
one described in the bug. =)

Kind regards
Philipp Kern



Bug#889669: nvidia-graphics-drivers: solve the upgrade problem

2018-03-22 Thread Andreas Beckmann
On 2018-03-22 15:36, Michael Schaller wrote:
>> We should probably postpone this to post-390.xx if nvidia sticks to
>> their plan to drop i386 driver support ...
> That's the first time I've heard about that. Do you have further
> information about that (link is fine). I also wonder how that will impact
> projects that depend on i386 support like for an instance Wine.

http://nvidia.custhelp.com/app/answers/detail/a_id/4604/


Andreas



Bug#889669: nvidia-graphics-drivers: solve the upgrade problem

2018-03-21 Thread Andreas Beckmann
We should probably postpone this to post-390.xx if nvidia sticks to
their plan to drop i386 driver support ... that would remove a lot of
complexity, since we probably don't need proper multiarch support any
more ...


Andreas



Bug#889669: nvidia-graphics-drivers: solve the upgrade problem

2018-03-21 Thread Luca Boccassi
Control: severity -1 normal

On Wed, 2018-03-21 at 08:17 +, Michael Schaller wrote:
> Please reconsider that this is merely an annoyance and that this is a
> wishlist item.
> If a NVIDIA driver security update is pushed and security updates are
> installed unattendedly then all NVIDIA user space components will
> stop
> working immediately after the respective package updates as the
> loaded
> kernel module and the user space components have a version mismatch.
> The consequences are not immediately visible to the user as NVIDIA
> components in memory are still properly matched and hence still work.
> The
> real issue is with new processes as for an instance no OpenGL
> applications
> or CUDA workloads can be launched anymore. This is especially severe
> for
> CUDA server farms as they currently can't enable unattended security
> updates unless they specifically exclude NVIDIA driver updates.

That's fine, I didn't grok that you had large installations where this
was causing issues already, personally I'm fine with talking about
possible solutions.

Seeing your email address domain - any chance your company could use
its gargantuan soft-power to get Nvidia to publish the specs for the
missing parts of Nouveau (reclocking, power managerment, etc)? That
would solve all our problems once and for all :-P

> On Wed, Mar 21, 2018 at 9:00 AM Philipp Kern 
> wrote:
> 
> > On 03/20/2018 10:59 PM, Luca Boccassi wrote:
> > > The problems I see are that it would make an already quite
> > > complex
> > > packaging system, over which we have very little control (most of
> > > it
> > > it's binary blobs) even more complicated. We already have 2
> > > layers of
> > > update-alternatives (mesa vs nvidia and then current vs legacy).
> > > 
> > > It would also mean we have to start maintaining multiple versions
> > > at
> > > the same time - again being all binary blobs, which will multiply
> > > the
> > > source of problems. Basically, it would mean that instead of
> > > having
> > > current vs legacy340xx (up until a few months ago also
> > > legacy304xx),
> > > every single driver update would have to be maintained
> > > separately.
> > I don't propose this as the solution, though. I think that'd indeed
> > be
> > infeasible. What I'm saying is that the *binary* packages are
> > versioned
> > like this, not the source packages. It's like the kernel in a way,
> > where
> > every ABI version gets its own binary package name. Although in
> > Debian
> > the hesitance to change the ABI is much higher than in Ubuntu, for
> > reasons that I assume have to do with the NEW queue. Cleaning up
> > older
> > versions is something we'd find a solution for, just like people
> > clean
> > up their old kernels.
> > So please separate out maintenance from the proposal. ;-)
> > I get it with the two layers of alternatives. Is the reason for
> > mesa vs.
> > nvidia because we don't put Nvidia into the library search path
> > first
> > and need to deal with the corresponding file conflicts in a sane
> > way? Or
> > because we want to keep co-installability between mesa and nvidia?
> > > In the end the problem is an annoyance but not a deal breaker -
> > > updates
> > > can be scheduled and delayed (unlike some other OSes...), and on
> > > top of
> > > that, version bumps are not that common - at most once a month,
> > > and
> > > only for those running unstable or testing - in stable we just
> > > ship LTS
> > > versions.
> > Actually it's a real deal breaker in mass deployments. If your
> > users are
> > hesitant to do reboots because it resets their work environment,
> > you
> > really need to detach nvidia updates from the rest of the package
> > updates, which means having a custom-built solution to do that.
> > That has
> > turned out to be brittle, as it turns out that you end up
> > installing
> > pre-downloaded modules at boot, blocking it for about ten minutes.
> > (It
> > has gotten better with SSDs, but still.)
> > Even if you just ship LTS versions there are sometimes updates
> > needed,
> > be it for Meltdown/Spectre or new hardware. In our case we actually
> > do
> > use testing, but even then we had the need to push updates to
> > drivers. I
> > think a setup that separates out binaries for every version that
> > allows
> > for consistent rollbacks[1] and rollforwards would be beneficial
> > not
> > just for us but also for the whole userbase of Debian.
> > We'd be willing to invest some time into a solution - as our own to
> > work
> > around the flaws in the packaging has turned out to be a
> > maintenance
> > headache. But that only works if we at least agree on a plan. I'm
> > also
> > happy to clarify more that I probably missed in the proposal. :)
> > Kind regards and thanks
> > Philipp Kern
> > [1] We had a bunch of regressions with newer drivers in the past
> > that
> > made them dead on arrival, like missing repaints in terminals for a
> > fraction of the cards.
> > --
> > To unsubscribe, send 

Bug#889669: nvidia-graphics-drivers: solve the upgrade problem

2018-03-21 Thread Luca Boccassi
On Wed, 2018-03-21 at 08:56 +0100, Philipp Kern wrote:
> On 03/20/2018 10:59 PM, Luca Boccassi wrote:
> > The problems I see are that it would make an already quite complex
> > packaging system, over which we have very little control (most of
> > it
> > it's binary blobs) even more complicated. We already have 2 layers
> > of
> > update-alternatives (mesa vs nvidia and then current vs legacy).
> > 
> > It would also mean we have to start maintaining multiple versions
> > at
> > the same time - again being all binary blobs, which will multiply
> > the
> > source of problems. Basically, it would mean that instead of having
> > current vs legacy340xx (up until a few months ago also
> > legacy304xx),
> > every single driver update would have to be maintained separately.
> 
> I don't propose this as the solution, though. I think that'd indeed
> be
> infeasible. What I'm saying is that the *binary* packages are
> versioned
> like this, not the source packages. It's like the kernel in a way,
> where
> every ABI version gets its own binary package name. Although in
> Debian
> the hesitance to change the ABI is much higher than in Ubuntu, for
> reasons that I assume have to do with the NEW queue. Cleaning up
> older
> versions is something we'd find a solution for, just like people
> clean
> up their old kernels.
> 
> So please separate out maintenance from the proposal. ;-)

Ah I see - one issue I can foresee is that it's binary blobs all the
way down - so there's really no way to know that libnvidia-foo from
version 1.1 can work with libnvidia-bar from version 2.2. So all the
packages would have to be versioned.

Isn't this sort-of-like what Ubuntu does? IIRC they lump together
everything into a single package unlike we do, and they are named after
the major revision.

How would the switch-at-boot mechanism work?

> I get it with the two layers of alternatives. Is the reason for mesa
> vs.
> nvidia because we don't put Nvidia into the library search path first
> and need to deal with the corresponding file conflicts in a sane way?
> Or
> because we want to keep co-installability between mesa and nvidia?

co-installability - it used to be that each vendor had its own version
of libGL, and they were all incompatible with each other. With libglvnd
this is changing - but sadly we need to keep shipping the non-glvnd
versions as there are often regressions (and some use cases don't work
with the glvnd versions yet, like switchable graphics on laptops).
So in reality what glvnd is doing for us right now is multiplying the
maintenance effort rather than reducing it. But I digress...

> > In the end the problem is an annoyance but not a deal breaker -
> > updates
> > can be scheduled and delayed (unlike some other OSes...), and on
> > top of
> > that, version bumps are not that common - at most once a month, and
> > only for those running unstable or testing - in stable we just ship
> > LTS
> > versions.
> 
> Actually it's a real deal breaker in mass deployments. If your users
> are
> hesitant to do reboots because it resets their work environment, you
> really need to detach nvidia updates from the rest of the package
> updates, which means having a custom-built solution to do that. That
> has
> turned out to be brittle, as it turns out that you end up installing
> pre-downloaded modules at boot, blocking it for about ten minutes.
> (It
> has gotten better with SSDs, but still.)
> 
> Even if you just ship LTS versions there are sometimes updates
> needed,
> be it for Meltdown/Spectre or new hardware. In our case we actually
> do
> use testing, but even then we had the need to push updates to
> drivers. I
> think a setup that separates out binaries for every version that
> allows
> for consistent rollbacks[1] and rollforwards would be beneficial not
> just for us but also for the whole userbase of Debian.
> 
> We'd be willing to invest some time into a solution - as our own to
> work
> around the flaws in the packaging has turned out to be a maintenance
> headache. But that only works if we at least agree on a plan. I'm
> also
> happy to clarify more that I probably missed in the proposal. :)
> 
> Kind regards and thanks
> Philipp Kern
> 
> [1] We had a bunch of regressions with newer drivers in the past that
> made them dead on arrival, like missing repaints in terminals for a
> fraction of the cards.

Ok so now I understand you have some large deployments where this is an
actual issue - I didn't get it immediately, sorry.

I'm up for talking about proposals - Andreas, what do you think?

-- 
Kind regards,
Luca Boccassi

signature.asc
Description: This is a digitally signed message part


Bug#889669: nvidia-graphics-drivers: solve the upgrade problem

2018-03-21 Thread Andreas Beckmann
I cannot remember any bug reports regarding this upgrade problem before
yours ...

-ENOTMUCHTIMETHISWEEK :-(

Andreas



Bug#889669: nvidia-graphics-drivers: solve the upgrade problem

2018-03-21 Thread Philipp Kern
On 03/20/2018 10:59 PM, Luca Boccassi wrote:
> The problems I see are that it would make an already quite complex
> packaging system, over which we have very little control (most of it
> it's binary blobs) even more complicated. We already have 2 layers of
> update-alternatives (mesa vs nvidia and then current vs legacy).
> 
> It would also mean we have to start maintaining multiple versions at
> the same time - again being all binary blobs, which will multiply the
> source of problems. Basically, it would mean that instead of having
> current vs legacy340xx (up until a few months ago also legacy304xx),
> every single driver update would have to be maintained separately.

I don't propose this as the solution, though. I think that'd indeed be
infeasible. What I'm saying is that the *binary* packages are versioned
like this, not the source packages. It's like the kernel in a way, where
every ABI version gets its own binary package name. Although in Debian
the hesitance to change the ABI is much higher than in Ubuntu, for
reasons that I assume have to do with the NEW queue. Cleaning up older
versions is something we'd find a solution for, just like people clean
up their old kernels.

So please separate out maintenance from the proposal. ;-)

I get it with the two layers of alternatives. Is the reason for mesa vs.
nvidia because we don't put Nvidia into the library search path first
and need to deal with the corresponding file conflicts in a sane way? Or
because we want to keep co-installability between mesa and nvidia?

> In the end the problem is an annoyance but not a deal breaker - updates
> can be scheduled and delayed (unlike some other OSes...), and on top of
> that, version bumps are not that common - at most once a month, and
> only for those running unstable or testing - in stable we just ship LTS
> versions.

Actually it's a real deal breaker in mass deployments. If your users are
hesitant to do reboots because it resets their work environment, you
really need to detach nvidia updates from the rest of the package
updates, which means having a custom-built solution to do that. That has
turned out to be brittle, as it turns out that you end up installing
pre-downloaded modules at boot, blocking it for about ten minutes. (It
has gotten better with SSDs, but still.)

Even if you just ship LTS versions there are sometimes updates needed,
be it for Meltdown/Spectre or new hardware. In our case we actually do
use testing, but even then we had the need to push updates to drivers. I
think a setup that separates out binaries for every version that allows
for consistent rollbacks[1] and rollforwards would be beneficial not
just for us but also for the whole userbase of Debian.

We'd be willing to invest some time into a solution - as our own to work
around the flaws in the packaging has turned out to be a maintenance
headache. But that only works if we at least agree on a plan. I'm also
happy to clarify more that I probably missed in the proposal. :)

Kind regards and thanks
Philipp Kern

[1] We had a bunch of regressions with newer drivers in the past that
made them dead on arrival, like missing repaints in terminals for a
fraction of the cards.



Bug#889669: nvidia-graphics-drivers: solve the upgrade problem

2018-03-20 Thread Luca Boccassi
Control: severity -1 wishlist

On Tue, 2018-03-20 at 21:22 +0100, Philipp Kern wrote:
> Hi,
> 
> On 2/5/18 4:26 PM, Philipp Kern wrote:
> > Since forever users of NVIDIA on Debian accepted that package
> > upgrades
> > break newly spawned binaries because the interface between the
> > client
> > library and the kernel driver is strictly versioned. The kernel
> > module
> > will emit an API mismatch error into the kernel log and GLX
> > requests
> > will fail. A reboot is required to remediate this situation.
> > 
> > I would propose the following model:
> > 
> > * All binary packages that require strict versioning with NVRM are
> > shipped in versioned packages. This means that the library package
> > names
> > reflect both major and minor version (= the version on which the
> > driver
> > checks) of the driver. The resulting packages should be co-
> > installable
> > with each other.
> > * An script modifies the symlink for the currently active libraries
> > to
> > point to the version of the currently loaded nvidia module (as
> > fetched
> > from sysfs's /sys/module/nvidia/version). This script is called on
> > installation but more crucially on every boot. This will tie the
> > libraries to the module loaded at boot-up.
> > * The kernel module itself does not have to be versioned. The
> > kernel
> > module can be upgraded and it will end up in the initrd
> > automatically.
> > 
> > Assuming that we have a metapackage that pulls in the most recent
> > driver
> > (like linux-image does), this model would allow to upgrade the
> > driver at
> > any point in time and only make it live with the next reboot. This
> > allows applications to continue to function.
> > 
> > This approach has the drawback that every update from NVIDIA needs
> > to go
> > through NEW. However I think this is just a theoretical
> > disadvantage at
> > this point as NEW latency for ABI version changes has decreased a
> > lot.
> > 
> > The thing I'm not sure about is how this proposal interacts with
> > the
> > legacy modules. I suppose they can all use the same mechanism but
> > the
> > script would need to be aware what library stack needs to be
> > chosen. The
> > NVIDIA kernel shim already checks using rm_is_supported_device if
> > the
> > currently installed device is supported. That together with
> > modalias
> > should supposedly already load the correct module and then the
> > script
> > could just check which of the modules (if legacy or the normal one)
> > is
> > loaded and act accordingly.
> > 
> > Do you think this would be workable? The NVIDIA packaging is quite
> > a
> > beast to handle, I know (and I'm very grateful for your work!). So
> > we
> > should have some consensus if this is something you'd be interested
> > in. :)
> 
> is there something I could help with to get to a consensus here?
> Anything? :)
> 
> (After just having had this again that I needed to reboot when all I
> wanted was getting the i386 driver.)
> 
> Kind regards and thanks
> Philipp Kern

Hi,

Thanks for your proposal, I understand the need to reboot is an
annoyance.

The problems I see are that it would make an already quite complex
packaging system, over which we have very little control (most of it
it's binary blobs) even more complicated. We already have 2 layers of
update-alternatives (mesa vs nvidia and then current vs legacy).

It would also mean we have to start maintaining multiple versions at
the same time - again being all binary blobs, which will multiply the
source of problems. Basically, it would mean that instead of having
current vs legacy340xx (up until a few months ago also legacy304xx),
every single driver update would have to be maintained separately.

In the end the problem is an annoyance but not a deal breaker - updates
can be scheduled and delayed (unlike some other OSes...), and on top of
that, version bumps are not that common - at most once a month, and
only for those running unstable or testing - in stable we just ship LTS
versions.

Sorry, my personal opinion is that I'm just not sure it would be really
worth the additional time and hassle :-/

-- 
Kind regards,
Luca Boccassi

signature.asc
Description: This is a digitally signed message part


Bug#889669: nvidia-graphics-drivers: solve the upgrade problem

2018-03-20 Thread Philipp Kern
Hi,

On 2/5/18 4:26 PM, Philipp Kern wrote:
> Since forever users of NVIDIA on Debian accepted that package upgrades
> break newly spawned binaries because the interface between the client
> library and the kernel driver is strictly versioned. The kernel module
> will emit an API mismatch error into the kernel log and GLX requests
> will fail. A reboot is required to remediate this situation.
> 
> I would propose the following model:
> 
> * All binary packages that require strict versioning with NVRM are
> shipped in versioned packages. This means that the library package names
> reflect both major and minor version (= the version on which the driver
> checks) of the driver. The resulting packages should be co-installable
> with each other.
> * An script modifies the symlink for the currently active libraries to
> point to the version of the currently loaded nvidia module (as fetched
> from sysfs's /sys/module/nvidia/version). This script is called on
> installation but more crucially on every boot. This will tie the
> libraries to the module loaded at boot-up.
> * The kernel module itself does not have to be versioned. The kernel
> module can be upgraded and it will end up in the initrd automatically.
> 
> Assuming that we have a metapackage that pulls in the most recent driver
> (like linux-image does), this model would allow to upgrade the driver at
> any point in time and only make it live with the next reboot. This
> allows applications to continue to function.
> 
> This approach has the drawback that every update from NVIDIA needs to go
> through NEW. However I think this is just a theoretical disadvantage at
> this point as NEW latency for ABI version changes has decreased a lot.
> 
> The thing I'm not sure about is how this proposal interacts with the
> legacy modules. I suppose they can all use the same mechanism but the
> script would need to be aware what library stack needs to be chosen. The
> NVIDIA kernel shim already checks using rm_is_supported_device if the
> currently installed device is supported. That together with modalias
> should supposedly already load the correct module and then the script
> could just check which of the modules (if legacy or the normal one) is
> loaded and act accordingly.
> 
> Do you think this would be workable? The NVIDIA packaging is quite a
> beast to handle, I know (and I'm very grateful for your work!). So we
> should have some consensus if this is something you'd be interested in. :)

is there something I could help with to get to a consensus here?
Anything? :)

(After just having had this again that I needed to reboot when all I
wanted was getting the i386 driver.)

Kind regards and thanks
Philipp Kern



Bug#889669: nvidia-graphics-drivers: solve the upgrade problem

2018-02-05 Thread Philipp Kern
Source: nvidia-graphics-drivers

Since forever users of NVIDIA on Debian accepted that package upgrades
break newly spawned binaries because the interface between the client
library and the kernel driver is strictly versioned. The kernel module
will emit an API mismatch error into the kernel log and GLX requests
will fail. A reboot is required to remediate this situation.

I would propose the following model:

* All binary packages that require strict versioning with NVRM are
shipped in versioned packages. This means that the library package names
reflect both major and minor version (= the version on which the driver
checks) of the driver. The resulting packages should be co-installable
with each other.
* An script modifies the symlink for the currently active libraries to
point to the version of the currently loaded nvidia module (as fetched
from sysfs's /sys/module/nvidia/version). This script is called on
installation but more crucially on every boot. This will tie the
libraries to the module loaded at boot-up.
* The kernel module itself does not have to be versioned. The kernel
module can be upgraded and it will end up in the initrd automatically.

Assuming that we have a metapackage that pulls in the most recent driver
(like linux-image does), this model would allow to upgrade the driver at
any point in time and only make it live with the next reboot. This
allows applications to continue to function.

This approach has the drawback that every update from NVIDIA needs to go
through NEW. However I think this is just a theoretical disadvantage at
this point as NEW latency for ABI version changes has decreased a lot.

The thing I'm not sure about is how this proposal interacts with the
legacy modules. I suppose they can all use the same mechanism but the
script would need to be aware what library stack needs to be chosen. The
NVIDIA kernel shim already checks using rm_is_supported_device if the
currently installed device is supported. That together with modalias
should supposedly already load the correct module and then the script
could just check which of the modules (if legacy or the normal one) is
loaded and act accordingly.

Do you think this would be workable? The NVIDIA packaging is quite a
beast to handle, I know (and I'm very grateful for your work!). So we
should have some consensus if this is something you'd be interested in. :)

Kind regards and thanks
Philipp Kern



signature.asc
Description: OpenPGP digital signature