Re: mlx5 broken affinity
On 11/08/2017 12:33 PM, Thomas Gleixner wrote: > On Wed, 8 Nov 2017, Jes Sorensen wrote: >> On 11/07/2017 10:07 AM, Thomas Gleixner wrote: >>> Depending on the machine and the number of queues this might even result in >>> completely losing the ability to suspend/hibernate because the number of >>> available vectors on CPU0 is not sufficient to accomodate all queue >>> interrupts. >> >> Depending on the system, suspend/resume is really the lesser interesting >> issue to the user. Pretty much any system with a 10/25GBps mlx5 NIC in >> it will be in some sort of rack and is unlikely to ever want to >> suspend/resume. > > The discussions with Intel about that tell a different story and cpu > online/offline for power management purposes is - while debatable - widely > used. I certainly do not want to dispute that, it just underlines that different users have different priorities. >>> A lot of things are possible, the question is whether it makes sense. The >>> whole point is to have resources (queues, interrupts etc.) per CPU and have >>> them strictly associated. >>> >>> Why would you give the user a knob to destroy what you carefully optimized? >>> >>> Just because we can and just because users love those knobs or is there any >>> real technical reason? >> >> Because the user sometimes knows better based on statically assigned >> loads, or the user wants consistency across kernels. It's great that the >> system is better at allocating this now, but we also need to allow for a >> user to change it. Like anything on Linux, a user wanting to blow off >> his/her own foot, should be allowed to do so. > > That's fine, but that's not what the managed affinity facility provides. If > you want to leverage the spread mechanism, but avoid the managed part, then > this is a different story and we need to figure out how to provide that > without breaking the managed side of it. > > As I said it's possible, but I vehemently disagree, that this is a bug in > the core code, as it was claimed several times in this thread. So it may be my original message was confusing on this. Currently IRQ-affinity.txt describes how a user can change that, but it no longer works if an IRQ is marked managed. That is what I qualified as a bug in my original posting. If an IRQ is not meant to have it's affinity modified because it is managed, then I think it should also result in the permissions on the /proc file change to rrr rather than getting EIO when trying to write to it, but that is a minor detail IMHO. None of this is a major showstopper for the next kernel release, as long as we can work on a longer term fix that satisfies everyone. What I would ideally like to see is the option for drivers to benefit from the new allocation scheme without being locked in and also have the option to go managed if that is right for the given devices. Basically a best of both Worlds situation. > The real issue is that the driver was converted to something which was > expected to behave differently. That's hardly a bug in the core code, at > most it's a documentation problem. When I hit this issue I read the driver commit stating it was taking advantage of the new allocation. Not having been part of the discussion and design of the new core code, I just caught what was described in the commit message for mlx5. For now I have just reverted the offending mlx5 patch in our kernel while we figure out how to resolve it long term. Cheers, Jes
Re: mlx5 broken affinity
On 11/09/2017 02:23 PM, Thomas Gleixner wrote: > On Thu, 9 Nov 2017, Jens Axboe wrote: >> On 11/09/2017 10:03 AM, Thomas Gleixner wrote: >>> On Thu, 9 Nov 2017, Jens Axboe wrote: On 11/09/2017 07:19 AM, Thomas Gleixner wrote: If that's the attitude at your end, then I do suggest we just revert the driver changes. Clearly this isn't going to be productive going forward. The better solution was to make the managed setup more flexible, but it doesn't sound like that is going to be viable at all. >>> >>> That's not true. I indicated several times, that we can do that, but not >>> just by breaking the managed facility. >>> >>> What I'm arguing against is that the blame is focused on those who >>> implemented the managed facility with the existing semantics. >>> >>> I'm still waiting for a proper description of what needs to be changed in >>> order to make these drivers work again. All I have seen so far is to break >>> managed interrupts completely and that's not going to happen. >> >> There's no blame as far as I'm concerned, just frustration that we broke >> this and folks apparently not acknowledging that it's a concern. >> >> What used to work was that you could move IRQs around as you wanted to. >> That was very useful for custom setups, for tuning, or for isolation >> purposes. None of that works now, which is unfortunate. > > Well, its unfortunate and I can understand your frustration, but I really > have a hard time to understand that these concerns have not been brought up > when the whole thing was discussed and in the early stages of > implementation way before it got merged. > > It was not my decision to make it the way it is and I merily try to prevent > hasty and damaging hackery right now. There's no rush, we can take our time to get it right. We won't get this fixed for 4.14 anyway. On my end, I don't think the limitation of removing user tweakability was made clear enough. I'm all for moving common code into helpers and frameworks, but this currently has limitation that aren't acceptable imho. This is partly my fault, I should have realized this was the case. As long as we can move forward in a productive fashion, then that's fine with me. -- Jens Axboe
Re: mlx5 broken affinity
On Thu, 9 Nov 2017, Jens Axboe wrote: > On 11/09/2017 10:07 AM, Thomas Gleixner wrote: > > I say it one last time: It can be done and I'm willing to help. > > It didn't sound like it earlier, but that's good news. Well, I'm equally frustrated by this whole thing, but I certainly never said that I'm not willing to change anything. Thanks, tglx
Re: mlx5 broken affinity
On Thu, 9 Nov 2017, Jens Axboe wrote: > On 11/09/2017 10:03 AM, Thomas Gleixner wrote: > > On Thu, 9 Nov 2017, Jens Axboe wrote: > >> On 11/09/2017 07:19 AM, Thomas Gleixner wrote: > >> If that's the attitude at your end, then I do suggest we just revert the > >> driver changes. Clearly this isn't going to be productive going forward. > >> > >> The better solution was to make the managed setup more flexible, but > >> it doesn't sound like that is going to be viable at all. > > > > That's not true. I indicated several times, that we can do that, but not > > just by breaking the managed facility. > > > > What I'm arguing against is that the blame is focused on those who > > implemented the managed facility with the existing semantics. > > > > I'm still waiting for a proper description of what needs to be changed in > > order to make these drivers work again. All I have seen so far is to break > > managed interrupts completely and that's not going to happen. > > There's no blame as far as I'm concerned, just frustration that we broke > this and folks apparently not acknowledging that it's a concern. > > What used to work was that you could move IRQs around as you wanted to. > That was very useful for custom setups, for tuning, or for isolation > purposes. None of that works now, which is unfortunate. Well, its unfortunate and I can understand your frustration, but I really have a hard time to understand that these concerns have not been brought up when the whole thing was discussed and in the early stages of implementation way before it got merged. It was not my decision to make it the way it is and I merily try to prevent hasty and damaging hackery right now. I'll answer to the technical details in a separate mail. Thanks, tglx
Re: mlx5 broken affinity
On 11/09/2017 10:07 AM, Thomas Gleixner wrote: > On Thu, 9 Nov 2017, Jens Axboe wrote: > >> On 11/09/2017 09:01 AM, Sagi Grimberg wrote: Now you try to blame the people who implemented the managed affinity stuff for the wreckage, which was created by people who changed drivers to use it. Nice try. >>> >>> I'm not trying to blame anyone, really. I was just trying to understand >>> how to move forward with making users happy and still enjoy subsystem >>> services instead of doing lots of similar things inside mlx5 driver. >> >> Exactly. The key here is how we make it work for both cases. But there >> has to be a willingness to make the infrastructure work for that. > > I say it one last time: It can be done and I'm willing to help. It didn't sound like it earlier, but that's good news. > Please tell me what exactly you expect and I can have a look what needs to > be done w/o creating an utter mess. See the previous email, should have all the details. -- Jens Axboe
Re: mlx5 broken affinity
On 11/09/2017 10:03 AM, Thomas Gleixner wrote: > On Thu, 9 Nov 2017, Jens Axboe wrote: >> On 11/09/2017 07:19 AM, Thomas Gleixner wrote: >> If that's the attitude at your end, then I do suggest we just revert the >> driver changes. Clearly this isn't going to be productive going forward. >> >> The better solution was to make the managed setup more flexible, but >> it doesn't sound like that is going to be viable at all. > > That's not true. I indicated several times, that we can do that, but not > just by breaking the managed facility. > > What I'm arguing against is that the blame is focused on those who > implemented the managed facility with the existing semantics. > > I'm still waiting for a proper description of what needs to be changed in > order to make these drivers work again. All I have seen so far is to break > managed interrupts completely and that's not going to happen. There's no blame as far as I'm concerned, just frustration that we broke this and folks apparently not acknowledging that it's a concern. What used to work was that you could move IRQs around as you wanted to. That was very useful for custom setups, for tuning, or for isolation purposes. None of that works now, which is unfortunate. As a bare minimum, being able to specify a static mask at load time would be useful. It's still not nearly as useful as it was before where you could move them at runtime, but at least it's better than what we have right now. -- Jens Axboe
Re: mlx5 broken affinity
On Thu, 9 Nov 2017, Jens Axboe wrote: > On 11/09/2017 09:01 AM, Sagi Grimberg wrote: > >> Now you try to blame the people who implemented the managed affinity stuff > >> for the wreckage, which was created by people who changed drivers to use > >> it. Nice try. > > > > I'm not trying to blame anyone, really. I was just trying to understand > > how to move forward with making users happy and still enjoy subsystem > > services instead of doing lots of similar things inside mlx5 driver. > > Exactly. The key here is how we make it work for both cases. But there > has to be a willingness to make the infrastructure work for that. I say it one last time: It can be done and I'm willing to help. Please tell me what exactly you expect and I can have a look what needs to be done w/o creating an utter mess. Thanks, tglx
Re: mlx5 broken affinity
On Thu, 9 Nov 2017, Jens Axboe wrote: > On 11/09/2017 07:19 AM, Thomas Gleixner wrote: > If that's the attitude at your end, then I do suggest we just revert the > driver changes. Clearly this isn't going to be productive going forward. > > The better solution was to make the managed setup more flexible, but > it doesn't sound like that is going to be viable at all. That's not true. I indicated several times, that we can do that, but not just by breaking the managed facility. What I'm arguing against is that the blame is focused on those who implemented the managed facility with the existing semantics. I'm still waiting for a proper description of what needs to be changed in order to make these drivers work again. All I have seen so far is to break managed interrupts completely and that's not going to happen. Thanks, tglx
Re: mlx5 broken affinity
On 11/09/2017 09:01 AM, Sagi Grimberg wrote: >> Now you try to blame the people who implemented the managed affinity stuff >> for the wreckage, which was created by people who changed drivers to use >> it. Nice try. > > I'm not trying to blame anyone, really. I was just trying to understand > how to move forward with making users happy and still enjoy subsystem > services instead of doing lots of similar things inside mlx5 driver. Exactly. The key here is how we make it work for both cases. But there has to be a willingness to make the infrastructure work for that. -- Jens Axboe
Re: mlx5 broken affinity
The early discussion of the managed facility came to the conclusion that it will manage this stuff completely to allow fixed association of 'queue / interrupt / corresponding memory' to a single CPU or a set of CPUs. That removes a lot of 'affinity' handling magic from the driver and utilizes the hardware in a sensible way. That was not my decision, really. It surely made sense to me and I helped Christoph to implement it. The real question is whether you want to have the fixed 'queue / interrupt/ corresponding memory association' and get rid of all the 'affinity' dance in your driver or not. If you answer that question with 'yes' then the consequence is that there is no knob. If you answer that question with 'no' then you should not use the managed facility in the first place and if you need parts of that functionality then this needs to be added to the core code _before_ a driver gets converted and not afterwards. point taken. It's not my problem if people decide, to use this and then trip over the limitations after the changes hit the tree. This could have been figured out before even a single patch was posted. That's correct, I could have known that, but I didn't, and from your reply, I understand there is really only a single way forward... Now you try to blame the people who implemented the managed affinity stuff for the wreckage, which was created by people who changed drivers to use it. Nice try. I'm not trying to blame anyone, really. I was just trying to understand how to move forward with making users happy and still enjoy subsystem services instead of doing lots of similar things inside mlx5 driver.
Re: mlx5 broken affinity
Again, I think Jes or others can provide more information. Sagi, I believe Jes is not trying to argue about what initial affinity values you give to the driver, We have a very critical regression that is afflicting Live systems today and common tools that already exists in various distros, such as irqblanace which solely depends on smp_affinity sysfs entry which is now not write-able due to this regression. please see https://github.com/Irqbalance/irqbalance/blob/ma ster/activate.c Some users would like to have thier network traffic handled in some cores and free up other cores for other purposes, you just can't take that away from them. If revereting mlx5 patches would solve the issue, I am afraid that is the solution i am going to go with, until the regression is fixed. Well, I completely agree with you Saeed, If this is causing regression for mlx5 users and we can't find a way for managed interface to expose this knob to userspace, I don't see any other choice as well.
Re: mlx5 broken affinity
On 11/09/2017 07:19 AM, Thomas Gleixner wrote: > On Thu, 9 Nov 2017, Sagi Grimberg wrote: >> Thomas, >> Because the user sometimes knows better based on statically assigned loads, or the user wants consistency across kernels. It's great that the system is better at allocating this now, but we also need to allow for a user to change it. Like anything on Linux, a user wanting to blow off his/her own foot, should be allowed to do so. >>> >>> That's fine, but that's not what the managed affinity facility provides. If >>> you want to leverage the spread mechanism, but avoid the managed part, then >>> this is a different story and we need to figure out how to provide that >>> without breaking the managed side of it. >>> >>> As I said it's possible, but I vehemently disagree, that this is a bug in >>> the core code, as it was claimed several times in this thread. >>> >>> The real issue is that the driver was converted to something which was >>> expected to behave differently. That's hardly a bug in the core code, at >>> most it's a documentation problem. >> >> I disagree here, this is not a device specific discussion. The question >> of exposing IRQ affinity assignment knob to user-space holds for every >> device (NICs, HBAs and alike). The same issue could have been raised on >> any other device using managed affinitization (and we have quite a few >> of those now). I can't see any reason why its OK for device X to use it >> while its not OK for device Y to use it. >> >> If the resolution is "YES we must expose a knob to user-space" then the >> managed facility is unusable in its current form for *any* device. If the >> answer is "NO, user-space can't handle all the stuff the kernel can" then >> we should document it. This is really device independent. > > The early discussion of the managed facility came to the conclusion that it > will manage this stuff completely to allow fixed association of 'queue / > interrupt / corresponding memory' to a single CPU or a set of CPUs. That > removes a lot of 'affinity' handling magic from the driver and utilizes the > hardware in a sensible way. That was not my decision, really. It surely > made sense to me and I helped Christoph to implement it. > > The real question is whether you want to have the fixed 'queue / interrupt/ > corresponding memory association' and get rid of all the 'affinity' dance > in your driver or not. > > If you answer that question with 'yes' then the consequence is that there > is no knob. > > If you answer that question with 'no' then you should not use > the managed facility in the first place and if you need parts of that > functionality then this needs to be added to the core code _before_ a > driver gets converted and not afterwards. > > It's not my problem if people decide, to use this and then trip over the > limitations after the changes hit the tree. This could have been figured > out before even a single patch was posted. > > Now you try to blame the people who implemented the managed affinity stuff > for the wreckage, which was created by people who changed drivers to use > it. Nice try. If that's the attitude at your end, then I do suggest we just revert the driver changes. Clearly this isn't going to be productive going forward. The better solution was to make the managed setup more flexible, but it doesn't sound like that is going to be viable at all. -- Jens Axboe
Re: mlx5 broken affinity
On 11/09/2017 03:50 AM, Sagi Grimberg wrote: > Thomas, > >>> Because the user sometimes knows better based on statically assigned >>> loads, or the user wants consistency across kernels. It's great that the >>> system is better at allocating this now, but we also need to allow for a >>> user to change it. Like anything on Linux, a user wanting to blow off >>> his/her own foot, should be allowed to do so. >> >> That's fine, but that's not what the managed affinity facility provides. If >> you want to leverage the spread mechanism, but avoid the managed part, then >> this is a different story and we need to figure out how to provide that >> without breaking the managed side of it. >> >> As I said it's possible, but I vehemently disagree, that this is a bug in >> the core code, as it was claimed several times in this thread. >> >> The real issue is that the driver was converted to something which was >> expected to behave differently. That's hardly a bug in the core code, at >> most it's a documentation problem. > > I disagree here, this is not a device specific discussion. The question > of exposing IRQ affinity assignment knob to user-space holds for every > device (NICs, HBAs and alike). The same issue could have been raised on > any other device using managed affinitization (and we have quite a few > of those now). I can't see any reason why its OK for device X to use it > while its not OK for device Y to use it. > > If the resolution is "YES we must expose a knob to user-space" then the > managed facility is unusable in its current form for *any* device. If > the answer is "NO, user-space can't handle all the stuff the kernel can" > then we should document it. This is really device independent. Completely agree, and honestly I'm pretty baffled we're even having to argue this point. -- Jens Axboe
Re: mlx5 broken affinity
On 11/09/2017 03:09 AM, Christoph Hellwig wrote: > On Wed, Nov 08, 2017 at 09:13:59AM -0700, Jens Axboe wrote: >> There are numerous valid reasons to be able to set the affinity, for >> both nics and block drivers. It's great that the kernel has a predefined >> layout that works well, but users do need the flexibility to be able to >> reconfigure affinities, to suit their needs. > > Managed interrupts are about more than just setting affinities - they > bind a a vector (which means a queue) to a specific cpu or set of cpus. > This is extremely useful to deal with hotplug issues and also makes > life a lot easier in general. I know why it was done. >> But that particular case is completely orthogonal to whether or not we >> should allow the user to reconfigure. The answer to that is clearly YES, >> and we should ensure that this is possible. > > And why is the answer yes? If the anser is YES it means we need explicit > boilerplate code to deal with cpu hotplug in every driver not using > managed interrupts (even if optionally), so it is a non-trivial tradeoff. The answer is yes because as with any other kernel level policy, there are situations where it isn't the optimal solution. I ran into this myself with nvme recently, and I started cursing the managed setup for that very reason. You can even argue this is a regression, since we used to be able to move things around as we saw fit. If the answer is boilerplate code to make it feasible, then we need that boilerplate code. -- Jens Axboe
Re: mlx5 broken affinity
On Wed, 2017-11-08 at 09:27 +0200, Sagi Grimberg wrote: > > Depending on the machine and the number of queues this might even > > result in > > completely losing the ability to suspend/hibernate because the > > number of > > available vectors on CPU0 is not sufficient to accomodate all queue > > interrupts. > > > > > Would it be possible to keep the managed facility until a user > > > overrides > > > an affinity assignment? This way if the user didn't touch it, we > > > keep > > > all the perks, and in case the user overrides it, we log the > > > implication > > > so the user is aware? > > > > A lot of things are possible, the question is whether it makes > > sense. The > > whole point is to have resources (queues, interrupts etc.) per CPU > > and have > > them strictly associated. > > Not arguing here. > > > Why would you give the user a knob to destroy what you carefully > > optimized? > > Well, looks like someone relies on this knob, the question is if he > is > doing something better for his workload. I don't know, its really up > to > the user to say. > > > Just because we can and just because users love those knobs or is > > there any > > real technical reason? > > Again, I think Jes or others can provide more information. Sagi, I believe Jes is not trying to argue about what initial affinity values you give to the driver, We have a very critical regression that is afflicting Live systems today and common tools that already exists in various distros, such as irqblanace which solely depends on smp_affinity sysfs entry which is now not write-able due to this regression. please see https://github.com/Irqbalance/irqbalance/blob/ma ster/activate.c Some users would like to have thier network traffic handled in some cores and free up other cores for other purposes, you just can't take that away from them. If revereting mlx5 patches would solve the issue, I am afraid that is the solution i am going to go with, until the regression is fixed.
Re: mlx5 broken affinity
On Thu, 9 Nov 2017, Sagi Grimberg wrote: > Thomas, > > > > Because the user sometimes knows better based on statically assigned > > > loads, or the user wants consistency across kernels. It's great that the > > > system is better at allocating this now, but we also need to allow for a > > > user to change it. Like anything on Linux, a user wanting to blow off > > > his/her own foot, should be allowed to do so. > > > > That's fine, but that's not what the managed affinity facility provides. If > > you want to leverage the spread mechanism, but avoid the managed part, then > > this is a different story and we need to figure out how to provide that > > without breaking the managed side of it. > > > > As I said it's possible, but I vehemently disagree, that this is a bug in > > the core code, as it was claimed several times in this thread. > > > > The real issue is that the driver was converted to something which was > > expected to behave differently. That's hardly a bug in the core code, at > > most it's a documentation problem. > > I disagree here, this is not a device specific discussion. The question > of exposing IRQ affinity assignment knob to user-space holds for every > device (NICs, HBAs and alike). The same issue could have been raised on > any other device using managed affinitization (and we have quite a few > of those now). I can't see any reason why its OK for device X to use it > while its not OK for device Y to use it. > > If the resolution is "YES we must expose a knob to user-space" then the > managed facility is unusable in its current form for *any* device. If the > answer is "NO, user-space can't handle all the stuff the kernel can" then > we should document it. This is really device independent. The early discussion of the managed facility came to the conclusion that it will manage this stuff completely to allow fixed association of 'queue / interrupt / corresponding memory' to a single CPU or a set of CPUs. That removes a lot of 'affinity' handling magic from the driver and utilizes the hardware in a sensible way. That was not my decision, really. It surely made sense to me and I helped Christoph to implement it. The real question is whether you want to have the fixed 'queue / interrupt/ corresponding memory association' and get rid of all the 'affinity' dance in your driver or not. If you answer that question with 'yes' then the consequence is that there is no knob. If you answer that question with 'no' then you should not use the managed facility in the first place and if you need parts of that functionality then this needs to be added to the core code _before_ a driver gets converted and not afterwards. It's not my problem if people decide, to use this and then trip over the limitations after the changes hit the tree. This could have been figured out before even a single patch was posted. Now you try to blame the people who implemented the managed affinity stuff for the wreckage, which was created by people who changed drivers to use it. Nice try. Thanks, tglx
Re: mlx5 broken affinity
Thomas, Because the user sometimes knows better based on statically assigned loads, or the user wants consistency across kernels. It's great that the system is better at allocating this now, but we also need to allow for a user to change it. Like anything on Linux, a user wanting to blow off his/her own foot, should be allowed to do so. That's fine, but that's not what the managed affinity facility provides. If you want to leverage the spread mechanism, but avoid the managed part, then this is a different story and we need to figure out how to provide that without breaking the managed side of it. As I said it's possible, but I vehemently disagree, that this is a bug in the core code, as it was claimed several times in this thread. The real issue is that the driver was converted to something which was expected to behave differently. That's hardly a bug in the core code, at most it's a documentation problem. I disagree here, this is not a device specific discussion. The question of exposing IRQ affinity assignment knob to user-space holds for every device (NICs, HBAs and alike). The same issue could have been raised on any other device using managed affinitization (and we have quite a few of those now). I can't see any reason why its OK for device X to use it while its not OK for device Y to use it. If the resolution is "YES we must expose a knob to user-space" then the managed facility is unusable in its current form for *any* device. If the answer is "NO, user-space can't handle all the stuff the kernel can" then we should document it. This is really device independent.
Re: mlx5 broken affinity
On Wed, Nov 08, 2017 at 09:13:59AM -0700, Jens Axboe wrote: > There are numerous valid reasons to be able to set the affinity, for > both nics and block drivers. It's great that the kernel has a predefined > layout that works well, but users do need the flexibility to be able to > reconfigure affinities, to suit their needs. Managed interrupts are about more than just setting affinities - they bind a a vector (which means a queue) to a specific cpu or set of cpus. This is extremely useful to deal with hotplug issues and also makes life a lot easier in general. > But that particular case is completely orthogonal to whether or not we > should allow the user to reconfigure. The answer to that is clearly YES, > and we should ensure that this is possible. And why is the answer yes? If the anser is YES it means we need explicit boilerplate code to deal with cpu hotplug in every driver not using managed interrupts (even if optionally), so it is a non-trivial tradeoff.
Re: mlx5 broken affinity
On Wed, 8 Nov 2017, Jes Sorensen wrote: > On 11/07/2017 10:07 AM, Thomas Gleixner wrote: > > On Sun, 5 Nov 2017, Sagi Grimberg wrote: > >> I do agree that the user would lose better cpu online/offline behavior, > >> but it seems that users want to still have some control over the IRQ > >> affinity assignments even if they lose this functionality. > > > > Depending on the machine and the number of queues this might even result in > > completely losing the ability to suspend/hibernate because the number of > > available vectors on CPU0 is not sufficient to accomodate all queue > > interrupts. > > Depending on the system, suspend/resume is really the lesser interesting > issue to the user. Pretty much any system with a 10/25GBps mlx5 NIC in > it will be in some sort of rack and is unlikely to ever want to > suspend/resume. The discussions with Intel about that tell a different story and cpu online/offline for power management purposes is - while debatable - widely used. That's where the whole idea for managed affinities originated from along with avoiding the affinity hint and affinity notifier machinery which creates more problems than it solves. > >> Would it be possible to keep the managed facility until a user overrides > >> an affinity assignment? This way if the user didn't touch it, we keep > >> all the perks, and in case the user overrides it, we log the implication > >> so the user is aware? > > > > A lot of things are possible, the question is whether it makes sense. The > > whole point is to have resources (queues, interrupts etc.) per CPU and have > > them strictly associated. > > > > Why would you give the user a knob to destroy what you carefully optimized? > > > > Just because we can and just because users love those knobs or is there any > > real technical reason? > > Because the user sometimes knows better based on statically assigned > loads, or the user wants consistency across kernels. It's great that the > system is better at allocating this now, but we also need to allow for a > user to change it. Like anything on Linux, a user wanting to blow off > his/her own foot, should be allowed to do so. That's fine, but that's not what the managed affinity facility provides. If you want to leverage the spread mechanism, but avoid the managed part, then this is a different story and we need to figure out how to provide that without breaking the managed side of it. As I said it's possible, but I vehemently disagree, that this is a bug in the core code, as it was claimed several times in this thread. The real issue is that the driver was converted to something which was expected to behave differently. That's hardly a bug in the core code, at most it's a documentation problem. Thanks, tglx
Re: mlx5 broken affinity
On 11/07/2017 10:07 AM, Thomas Gleixner wrote: > On Sun, 5 Nov 2017, Sagi Grimberg wrote: >> I do agree that the user would lose better cpu online/offline behavior, >> but it seems that users want to still have some control over the IRQ >> affinity assignments even if they lose this functionality. > > Depending on the machine and the number of queues this might even result in > completely losing the ability to suspend/hibernate because the number of > available vectors on CPU0 is not sufficient to accomodate all queue > interrupts. Depending on the system, suspend/resume is really the lesser interesting issue to the user. Pretty much any system with a 10/25GBps mlx5 NIC in it will be in some sort of rack and is unlikely to ever want to suspend/resume. >> Would it be possible to keep the managed facility until a user overrides >> an affinity assignment? This way if the user didn't touch it, we keep >> all the perks, and in case the user overrides it, we log the implication >> so the user is aware? > > A lot of things are possible, the question is whether it makes sense. The > whole point is to have resources (queues, interrupts etc.) per CPU and have > them strictly associated. > > Why would you give the user a knob to destroy what you carefully optimized? > > Just because we can and just because users love those knobs or is there any > real technical reason? Because the user sometimes knows better based on statically assigned loads, or the user wants consistency across kernels. It's great that the system is better at allocating this now, but we also need to allow for a user to change it. Like anything on Linux, a user wanting to blow off his/her own foot, should be allowed to do so. Cheers, Jes
Re: mlx5 broken affinity
On 11/08/2017 05:21 AM, David Laight wrote: > From: Sagi Grimberg >> Sent: 08 November 2017 07:28 > ... >>> Why would you give the user a knob to destroy what you carefully optimized? >> >> Well, looks like someone relies on this knob, the question is if he is >> doing something better for his workload. I don't know, its really up to >> the user to say. > > Maybe the user wants to ensure that nothing except some very specific > processing happens on some (or most) of the cpu cores. > > If the expected overall ethernet data rate isn't exceptionally large > is there any reason to allocate a queue (etc) for every cpu. There are numerous valid reasons to be able to set the affinity, for both nics and block drivers. It's great that the kernel has a predefined layout that works well, but users do need the flexibility to be able to reconfigure affinities, to suit their needs. For the particular mlx5 case, I'm actually not sure how the FB configuration differs from the in-kernel stuff. I'll take a look at that. It may just be that the configuration exists because the code used to be driver private and frequently changed, setting it at bootup to a known good configuration helped eliminate problems when upgrading kernels. I also remember some cases of removing CPU0 from the mask. But that particular case is completely orthogonal to whether or not we should allow the user to reconfigure. The answer to that is clearly YES, and we should ensure that this is possible. -- Jens Axboe
RE: mlx5 broken affinity
From: Sagi Grimberg > Sent: 08 November 2017 07:28 ... > > Why would you give the user a knob to destroy what you carefully optimized? > > Well, looks like someone relies on this knob, the question is if he is > doing something better for his workload. I don't know, its really up to > the user to say. Maybe the user wants to ensure that nothing except some very specific processing happens on some (or most) of the cpu cores. If the expected overall ethernet data rate isn't exceptionally large is there any reason to allocate a queue (etc) for every cpu. David
Re: mlx5 broken affinity
Depending on the machine and the number of queues this might even result in completely losing the ability to suspend/hibernate because the number of available vectors on CPU0 is not sufficient to accomodate all queue interrupts. Would it be possible to keep the managed facility until a user overrides an affinity assignment? This way if the user didn't touch it, we keep all the perks, and in case the user overrides it, we log the implication so the user is aware? A lot of things are possible, the question is whether it makes sense. The whole point is to have resources (queues, interrupts etc.) per CPU and have them strictly associated. Not arguing here. Why would you give the user a knob to destroy what you carefully optimized? Well, looks like someone relies on this knob, the question is if he is doing something better for his workload. I don't know, its really up to the user to say. Just because we can and just because users love those knobs or is there any real technical reason? Again, I think Jes or others can provide more information.
Re: mlx5 broken affinity
On Sun, 5 Nov 2017, Sagi Grimberg wrote: > > > > This wasn't to start a debate about which allocation method is the > > > > perfect solution. I am perfectly happy with the new default, the part > > > > that is broken is to take away the user's option to reassign the > > > > affinity. That is a bug and it needs to be fixed! > > > > > > Well, > > > > > > I would really want to wait for Thomas/Christoph to reply, but this > > > simple change fixed it for me: > > > -- > > > diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c > > > index 573dc52b0806..eccd06be5e44 100644 > > > --- a/kernel/irq/manage.c > > > +++ b/kernel/irq/manage.c > > > @@ -146,8 +146,7 @@ bool irq_can_set_affinity_usr(unsigned int irq) > > > { > > > struct irq_desc *desc = irq_to_desc(irq); > > > > > > - return __irq_can_set_affinity(desc) && > > > - !irqd_affinity_is_managed(>irq_data); > > > + return __irq_can_set_affinity(desc); > > > > Which defeats the whole purpose of the managed facility, which is _not_ to > > break the affinities on cpu offline and bring the interrupt back on the CPU > > when it comes online again. > > > > What I can do is to have a separate flag, which only uses the initial > > distribution mechanism, but I really want to have Christophs opinion on > > that. > > I do agree that the user would lose better cpu online/offline behavior, > but it seems that users want to still have some control over the IRQ > affinity assignments even if they lose this functionality. Depending on the machine and the number of queues this might even result in completely losing the ability to suspend/hibernate because the number of available vectors on CPU0 is not sufficient to accomodate all queue interrupts. > Would it be possible to keep the managed facility until a user overrides > an affinity assignment? This way if the user didn't touch it, we keep > all the perks, and in case the user overrides it, we log the implication > so the user is aware? A lot of things are possible, the question is whether it makes sense. The whole point is to have resources (queues, interrupts etc.) per CPU and have them strictly associated. Why would you give the user a knob to destroy what you carefully optimized? Just because we can and just because users love those knobs or is there any real technical reason? Thanks, tglx
Re: mlx5 broken affinity
This wasn't to start a debate about which allocation method is the perfect solution. I am perfectly happy with the new default, the part that is broken is to take away the user's option to reassign the affinity. That is a bug and it needs to be fixed! Well, I would really want to wait for Thomas/Christoph to reply, but this simple change fixed it for me: -- diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c index 573dc52b0806..eccd06be5e44 100644 --- a/kernel/irq/manage.c +++ b/kernel/irq/manage.c @@ -146,8 +146,7 @@ bool irq_can_set_affinity_usr(unsigned int irq) { struct irq_desc *desc = irq_to_desc(irq); - return __irq_can_set_affinity(desc) && - !irqd_affinity_is_managed(>irq_data); + return __irq_can_set_affinity(desc); Which defeats the whole purpose of the managed facility, which is _not_ to break the affinities on cpu offline and bring the interrupt back on the CPU when it comes online again. What I can do is to have a separate flag, which only uses the initial distribution mechanism, but I really want to have Christophs opinion on that. I do agree that the user would lose better cpu online/offline behavior, but it seems that users want to still have some control over the IRQ affinity assignments even if they lose this functionality. Would it be possible to keep the managed facility until a user overrides an affinity assignment? This way if the user didn't touch it, we keep all the perks, and in case the user overrides it, we log the implication so the user is aware?
Re: mlx5 broken affinity
On Thu, 2 Nov 2017, Sagi Grimberg wrote: > > > This wasn't to start a debate about which allocation method is the > > perfect solution. I am perfectly happy with the new default, the part > > that is broken is to take away the user's option to reassign the > > affinity. That is a bug and it needs to be fixed! > > Well, > > I would really want to wait for Thomas/Christoph to reply, but this > simple change fixed it for me: > -- > diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c > index 573dc52b0806..eccd06be5e44 100644 > --- a/kernel/irq/manage.c > +++ b/kernel/irq/manage.c > @@ -146,8 +146,7 @@ bool irq_can_set_affinity_usr(unsigned int irq) > { > struct irq_desc *desc = irq_to_desc(irq); > > - return __irq_can_set_affinity(desc) && > - !irqd_affinity_is_managed(>irq_data); > + return __irq_can_set_affinity(desc); Which defeats the whole purpose of the managed facility, which is _not_ to break the affinities on cpu offline and bring the interrupt back on the CPU when it comes online again. What I can do is to have a separate flag, which only uses the initial distribution mechanism, but I really want to have Christophs opinion on that. Thanks, tglx
Re: mlx5 broken affinity
On 11/02/2017 12:14 PM, Sagi Grimberg wrote: > >> This wasn't to start a debate about which allocation method is the >> perfect solution. I am perfectly happy with the new default, the part >> that is broken is to take away the user's option to reassign the >> affinity. That is a bug and it needs to be fixed! > > Well, > > I would really want to wait for Thomas/Christoph to reply, but this > simple change fixed it for me: > -- > diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c > index 573dc52b0806..eccd06be5e44 100644 > --- a/kernel/irq/manage.c > +++ b/kernel/irq/manage.c > @@ -146,8 +146,7 @@ bool irq_can_set_affinity_usr(unsigned int irq) > { > struct irq_desc *desc = irq_to_desc(irq); > > - return __irq_can_set_affinity(desc) && > - !irqd_affinity_is_managed(>irq_data); > + return __irq_can_set_affinity(desc); > } The part I don't get here is why the mlx5 driver change was allowed to set the irqs to 'managed'? One thing is to use the generic system for initial allocation, which is all great, but the change shouldn't mark the irq affinity mapping as managed. Jes
Re: mlx5 broken affinity
This wasn't to start a debate about which allocation method is the perfect solution. I am perfectly happy with the new default, the part that is broken is to take away the user's option to reassign the affinity. That is a bug and it needs to be fixed! Well, I would really want to wait for Thomas/Christoph to reply, but this simple change fixed it for me: -- diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c index 573dc52b0806..eccd06be5e44 100644 --- a/kernel/irq/manage.c +++ b/kernel/irq/manage.c @@ -146,8 +146,7 @@ bool irq_can_set_affinity_usr(unsigned int irq) { struct irq_desc *desc = irq_to_desc(irq); - return __irq_can_set_affinity(desc) && - !irqd_affinity_is_managed(>irq_data); + return __irq_can_set_affinity(desc); } /** --
Re: mlx5 broken affinity
On 11/02/2017 06:08 AM, Sagi Grimberg wrote: > I vaguely remember Nacking Sagi's patch as we knew it would break mlx5e netdev affinity assumptions. >> I remember that argument. Still the series found its way in. > > Of course it maid its way in, it was acked by three different > maintainers, and I addressed all of Saeed's comments. > >> That series moves affinity decisions to kernel's responsibility. >> AFAI see, what kernel does is assign IRQs to the NUMA's one by one in >> increasing indexing (starting with cores of NUMA #0), no matter what >> NUMA is closer to the NIC. > > Well, as we said before, if there is a good argument to do the home node > first we can change the generic code (as it should be given that this is > absolutely not device specific). > >> This means that if your NIC is on NUMA #1, and you reduce the number >> of channels, you might end up working only with the cores on the far >> NUMA. Not good! > We deliberated on this before, and concluded that application affinity > and device affinity are equally important. If you have a real use case > that shows otherwise, its perfectly doable to start from the device home > node. This wasn't to start a debate about which allocation method is the perfect solution. I am perfectly happy with the new default, the part that is broken is to take away the user's option to reassign the affinity. That is a bug and it needs to be fixed! Jes
Re: mlx5 broken affinity
> >This means that if your NIC is on NUMA #1, and you reduce the number of > >channels, you might end up working only with the cores on the far NUMA. > >Not good! > We deliberated on this before, and concluded that application affinity > and device affinity are equally important. If you have a real use case > that shows otherwise, its perfectly doable to start from the device home > node. Hi Sagi You might find this talk at NetDev interesting: https://www.netdevconf.org/2.2/session.html?shochat-devicemgmt-talk Andrew
Re: mlx5 broken affinity
I vaguely remember Nacking Sagi's patch as we knew it would break mlx5e netdev affinity assumptions. I remember that argument. Still the series found its way in. Of course it maid its way in, it was acked by three different maintainers, and I addressed all of Saeed's comments. That series moves affinity decisions to kernel's responsibility. AFAI see, what kernel does is assign IRQs to the NUMA's one by one in increasing indexing (starting with cores of NUMA #0), no matter what NUMA is closer to the NIC. Well, as we said before, if there is a good argument to do the home node first we can change the generic code (as it should be given that this is absolutely not device specific). This means that if your NIC is on NUMA #1, and you reduce the number of channels, you might end up working only with the cores on the far NUMA. Not good! We deliberated on this before, and concluded that application affinity and device affinity are equally important. If you have a real use case that shows otherwise, its perfectly doable to start from the device home node. And I agree here that user should be able to read /proc/irq/x/smp_affinity and even modify it if required. Totally agree. We should fix that ASAP. User must have write access. I'll let Thomas reply here, I do not fully understand the reason for why pci_alloc_irq_vectors() make the affinity assignments immutable..
Re: mlx5 broken affinity
On 02/11/2017 1:02 AM, Jes Sorensen wrote: On 11/01/2017 06:41 PM, Saeed Mahameed wrote: On Wed, Nov 1, 2017 at 11:20 AM, Jes Sorensenwrote: On 11/01/2017 01:21 PM, Sagi Grimberg wrote: I am all in favor of making the automatic setup better, but assuming an automatic setup is always right seems problematic. There could be workloads where you may want to assign affinity explicitly. Jes I vaguely remember Nacking Sagi's patch as we knew it would break mlx5e netdev affinity assumptions. I remember that argument. Still the series found its way in. That series moves affinity decisions to kernel's responsibility. AFAI see, what kernel does is assign IRQs to the NUMA's one by one in increasing indexing (starting with cores of NUMA #0), no matter what NUMA is closer to the NIC. This means that if your NIC is on NUMA #1, and you reduce the number of channels, you might end up working only with the cores on the far NUMA. Not good! Anyway we already submitted the mlx5e patch that removed such assumption https://urldefense.proofpoint.com/v2/url?u=https-3A__patchwork.ozlabs.org_patch_809196_=DwIBaQ=5VD0RTtNlTh3ycd41b3MUw=zRrmoWoylV2tnG53v9ZA2w=Z6xtsiQVL8xhTauY_DrOWYDhci-D49TqNKLWV_HK5Ug=pkxagNCZzy5-ZzMRTxIQ5pDfFq8WOSRdSx5zeQpQdBI= ("net/mlx5e: Distribute RSS table among all RX rings") Jes please confirm you have it. And I agree here that user should be able to read /proc/irq/x/smp_affinity and even modify it if required. Totally agree. We should fix that ASAP. User must have write access. Hi Saeed, I can confirm I have that patch - the problem is also present in Linus' current tree. I can read smp_affinity, but I cannot write to it. Indeed. Should be fixed. Cheers, Jes Thanks, Tariq
Re: mlx5 broken affinity
Jes, I am all in favor of making the automatic setup better, but assuming an automatic setup is always right seems problematic. There could be workloads where you may want to assign affinity explicitly. Adding Thomas to the thread. My understanding that the thought is to prevent user-space from messing up the nice linear assignment, but maybe this might be too conservative? The patch that prevented user-space from touching managed irq affinity assignments: -- commit 45ddcecbfa947f1dd8e8019bad9e90d6c9f2665c Author: Thomas GleixnerDate: Mon Jul 4 17:39:25 2016 +0900 genirq: Use affinity hint in irqdesc allocation Use the affinity hint in the irqdesc allocator. The hint is used to determine the node for the allocation and to set the affinity of the interrupt. If multiple interrupts are allocated (multi-MSI) then the allocator iterates over the cpumask and for each set cpu it allocates on their node and sets the initial affinity to that cpu. If a single interrupt is allocated (MSI-X) then the allocator uses the first cpu in the mask to compute the allocation node and uses the mask for the initial affinity setting. Interrupts set up this way are marked with the AFFINITY_MANAGED flag to prevent userspace from messing with their affinity settings. Signed-off-by: Thomas Gleixner Cc: Christoph Hellwig Cc: linux-bl...@vger.kernel.org Cc: linux-...@vger.kernel.org Cc: linux-n...@lists.infradead.org Cc: ax...@fb.com Cc: agord...@redhat.com Link: http://lkml.kernel.org/r/1467621574-8277-5-git-send-email-...@lst.de Signed-off-by: Thomas Gleixner --
Re: mlx5 broken affinity
On 11/01/2017 06:41 PM, Saeed Mahameed wrote: > On Wed, Nov 1, 2017 at 11:20 AM, Jes Sorensenwrote: >> On 11/01/2017 01:21 PM, Sagi Grimberg wrote: >> I am all in favor of making the automatic setup better, but assuming an >> automatic setup is always right seems problematic. There could be >> workloads where you may want to assign affinity explicitly. >> >> Jes > > I vaguely remember Nacking Sagi's patch as we knew it would break > mlx5e netdev affinity assumptions. > Anyway we already submitted the mlx5e patch that removed such assumption > > https://urldefense.proofpoint.com/v2/url?u=https-3A__patchwork.ozlabs.org_patch_809196_=DwIBaQ=5VD0RTtNlTh3ycd41b3MUw=zRrmoWoylV2tnG53v9ZA2w=Z6xtsiQVL8xhTauY_DrOWYDhci-D49TqNKLWV_HK5Ug=pkxagNCZzy5-ZzMRTxIQ5pDfFq8WOSRdSx5zeQpQdBI= > ("net/mlx5e: Distribute RSS > table among all RX rings") > Jes please confirm you have it. > > And I agree here that user should be able to read > /proc/irq/x/smp_affinity and even modify it if required. Hi Saeed, I can confirm I have that patch - the problem is also present in Linus' current tree. I can read smp_affinity, but I cannot write to it. Cheers, Jes
Re: mlx5 broken affinity
On Wed, Nov 1, 2017 at 11:20 AM, Jes Sorensenwrote: > On 11/01/2017 01:21 PM, Sagi Grimberg wrote: >>> Hi, >> >> Hi Jes, >> >>> The below patch seems to have broken PCI IRQ affinity assignments for >>> mlx5. >> >> I wouldn't call it breaking IRQ affinity assignments. It just makes >> them automatic. > > Well it explicitly breaks the option for an admin to assign them > manually, which contradicts what is written in > Documentation/IRQ-affinity.txt > >>> Prior to this patch I could echo a value to /proc/irq//smp_affinity >>> and it would get assigned. With this patch applied I get -EIO >> >> Adding Christoph, >> >> Ideally the affinity assignments would be better left alone, but >> I wasn't aware that they are now immutable. Christoph? >> >>> The actual affinity assignments seem to have changed too, but I assume >>> this is a result of the generic allocation? >> >> The affinity assignments should be giving you perfect linear assignment >> of the rx/tx rings completion vectors to CPU cores: >> first -> 0 >> second -> 1 >> third -> 2 >> ... >> >> Before, mlx5 spread affinity starting from the local numa node as it >> relied on that when constructing RSS indirection table only to the local >> numa node (which as a side effect hurt applications running on the far >> node as the traffic was guaranteed to arrive rx rings located in the >> "wrong" node). >> >> Now the RSS indirection table is linear across both numa nodes just like >> the irq affinity settings. Another thing that was improved was the >> pre/post vectors which blacklisted any non rx/tx completion vectors from >> the affinity assignments (like port events completion vectors from >> example). > > I am all in favor of making the automatic setup better, but assuming an > automatic setup is always right seems problematic. There could be > workloads where you may want to assign affinity explicitly. > > Jes I vaguely remember Nacking Sagi's patch as we knew it would break mlx5e netdev affinity assumptions. Anyway we already submitted the mlx5e patch that removed such assumption https://patchwork.ozlabs.org/patch/809196/ ("net/mlx5e: Distribute RSS table among all RX rings") Jes please confirm you have it. And I agree here that user should be able to read /proc/irq/x/smp_affinity and even modify it if required.
Re: mlx5 broken affinity
On 11/01/2017 01:21 PM, Sagi Grimberg wrote: >> Hi, > > Hi Jes, > >> The below patch seems to have broken PCI IRQ affinity assignments for >> mlx5. > > I wouldn't call it breaking IRQ affinity assignments. It just makes > them automatic. Well it explicitly breaks the option for an admin to assign them manually, which contradicts what is written in Documentation/IRQ-affinity.txt >> Prior to this patch I could echo a value to /proc/irq//smp_affinity >> and it would get assigned. With this patch applied I get -EIO > > Adding Christoph, > > Ideally the affinity assignments would be better left alone, but > I wasn't aware that they are now immutable. Christoph? > >> The actual affinity assignments seem to have changed too, but I assume >> this is a result of the generic allocation? > > The affinity assignments should be giving you perfect linear assignment > of the rx/tx rings completion vectors to CPU cores: > first -> 0 > second -> 1 > third -> 2 > ... > > Before, mlx5 spread affinity starting from the local numa node as it > relied on that when constructing RSS indirection table only to the local > numa node (which as a side effect hurt applications running on the far > node as the traffic was guaranteed to arrive rx rings located in the > "wrong" node). > > Now the RSS indirection table is linear across both numa nodes just like > the irq affinity settings. Another thing that was improved was the > pre/post vectors which blacklisted any non rx/tx completion vectors from > the affinity assignments (like port events completion vectors from > example). I am all in favor of making the automatic setup better, but assuming an automatic setup is always right seems problematic. There could be workloads where you may want to assign affinity explicitly. Jes
Re: mlx5 broken affinity
Hi, Hi Jes, The below patch seems to have broken PCI IRQ affinity assignments for mlx5. I wouldn't call it breaking IRQ affinity assignments. It just makes them automatic. Prior to this patch I could echo a value to /proc/irq//smp_affinity and it would get assigned. With this patch applied I get -EIO Adding Christoph, Ideally the affinity assignments would be better left alone, but I wasn't aware that they are now immutable. Christoph? The actual affinity assignments seem to have changed too, but I assume this is a result of the generic allocation? The affinity assignments should be giving you perfect linear assignment of the rx/tx rings completion vectors to CPU cores: first -> 0 second -> 1 third -> 2 ... Before, mlx5 spread affinity starting from the local numa node as it relied on that when constructing RSS indirection table only to the local numa node (which as a side effect hurt applications running on the far node as the traffic was guaranteed to arrive rx rings located in the "wrong" node). Now the RSS indirection table is linear across both numa nodes just like the irq affinity settings. Another thing that was improved was the pre/post vectors which blacklisted any non rx/tx completion vectors from the affinity assignments (like port events completion vectors from example). With this applied I get: [root@testbox /proc/irq]# cat 50/smp_affinity 00,0010 [root@textbox /proc/irq]# cat 100/smp_affinity 10, Without I get: [root@testbox /proc/irq]# cat 50/smp_affinity 00,0200 [root@testbox /proc/irq]# cat 100/smp_affinity 000100, I am not wildly familiar with the affinity assignment code. Is there something obvious I am missing here? The affinity assignment is indeed changed, should be better though. If you do not see a linear (diagonal) affinity assignment then its a bug that needs to be fixed (probably in msi.c code).