Re: mlx5 broken affinity

2017-11-09 Thread Jes Sorensen
On 11/08/2017 12:33 PM, Thomas Gleixner wrote:
> On Wed, 8 Nov 2017, Jes Sorensen wrote:
>> On 11/07/2017 10:07 AM, Thomas Gleixner wrote:
>>> Depending on the machine and the number of queues this might even result in
>>> completely losing the ability to suspend/hibernate because the number of
>>> available vectors on CPU0 is not sufficient to accomodate all queue
>>> interrupts.
>>
>> Depending on the system, suspend/resume is really the lesser interesting
>> issue to the user. Pretty much any system with a 10/25GBps mlx5 NIC in
>> it will be in some sort of rack and is unlikely to ever want to
>> suspend/resume.
> 
> The discussions with Intel about that tell a different story and cpu
> online/offline for power management purposes is - while debatable - widely
> used.

I certainly do not want to dispute that, it just underlines that
different users have different priorities.

>>> A lot of things are possible, the question is whether it makes sense. The
>>> whole point is to have resources (queues, interrupts etc.) per CPU and have
>>> them strictly associated.
>>>
>>> Why would you give the user a knob to destroy what you carefully optimized?
>>>
>>> Just because we can and just because users love those knobs or is there any
>>> real technical reason?
>>
>> Because the user sometimes knows better based on statically assigned
>> loads, or the user wants consistency across kernels. It's great that the
>> system is better at allocating this now, but we also need to allow for a
>> user to change it. Like anything on Linux, a user wanting to blow off
>> his/her own foot, should be allowed to do so.
> 
> That's fine, but that's not what the managed affinity facility provides. If
> you want to leverage the spread mechanism, but avoid the managed part, then
> this is a different story and we need to figure out how to provide that
> without breaking the managed side of it.
> 
> As I said it's possible, but I vehemently disagree, that this is a bug in
> the core code, as it was claimed several times in this thread.

So it may be my original message was confusing on this. Currently
IRQ-affinity.txt describes how a user can change that, but it no longer
works if an IRQ is marked managed. That is what I qualified as a bug in
my original posting. If an IRQ is not meant to have it's affinity
modified because it is managed, then I think it should also result in
the permissions on the /proc file change to rrr rather than getting EIO
when trying to write to it, but that is a minor detail IMHO.

None of this is a major showstopper for the next kernel release, as long
as we can work on a longer term fix that satisfies everyone.

What I would ideally like to see is the option for drivers to benefit
from the new allocation scheme without being locked in and also have the
option to go managed if that is right for the given devices. Basically a
best of both Worlds situation.

> The real issue is that the driver was converted to something which was
> expected to behave differently. That's hardly a bug in the core code, at
> most it's a documentation problem.

When I hit this issue I read the driver commit stating it was taking
advantage of the new allocation. Not having been part of the discussion
and design of the new core code, I just caught what was described in the
commit message for mlx5.

For now I have just reverted the offending mlx5 patch in our kernel
while we figure out how to resolve it long term.

Cheers,
Jes


Re: mlx5 broken affinity

2017-11-09 Thread Jens Axboe
On 11/09/2017 02:23 PM, Thomas Gleixner wrote:
> On Thu, 9 Nov 2017, Jens Axboe wrote:
>> On 11/09/2017 10:03 AM, Thomas Gleixner wrote:
>>> On Thu, 9 Nov 2017, Jens Axboe wrote:
 On 11/09/2017 07:19 AM, Thomas Gleixner wrote:
 If that's the attitude at your end, then I do suggest we just revert the
 driver changes. Clearly this isn't going to be productive going forward.

 The better solution was to make the managed setup more flexible, but
 it doesn't sound like that is going to be viable at all.
>>>
>>> That's not true. I indicated several times, that we can do that, but not
>>> just by breaking the managed facility.
>>>
>>> What I'm arguing against is that the blame is focused on those who
>>> implemented the managed facility with the existing semantics.
>>>
>>> I'm still waiting for a proper description of what needs to be changed in
>>> order to make these drivers work again. All I have seen so far is to break
>>> managed interrupts completely and that's not going to happen.
>>
>> There's no blame as far as I'm concerned, just frustration that we broke
>> this and folks apparently not acknowledging that it's a concern.
>>
>> What used to work was that you could move IRQs around as you wanted to.
>> That was very useful for custom setups, for tuning, or for isolation
>> purposes. None of that works now, which is unfortunate.
> 
> Well, its unfortunate and I can understand your frustration, but I really
> have a hard time to understand that these concerns have not been brought up
> when the whole thing was discussed and in the early stages of
> implementation way before it got merged.
> 
> It was not my decision to make it the way it is and I merily try to prevent
> hasty and damaging hackery right now.

There's no rush, we can take our time to get it right. We won't get this
fixed for 4.14 anyway.

On my end, I don't think the limitation of removing user tweakability was
made clear enough. I'm all for moving common code into helpers and
frameworks, but this currently has limitation that aren't acceptable
imho. This is partly my fault, I should have realized this was the case.

As long as we can move forward in a productive fashion, then that's fine
with me.

-- 
Jens Axboe



Re: mlx5 broken affinity

2017-11-09 Thread Thomas Gleixner
On Thu, 9 Nov 2017, Jens Axboe wrote:
> On 11/09/2017 10:07 AM, Thomas Gleixner wrote:
> > I say it one last time: It can be done and I'm willing to help.
> 
> It didn't sound like it earlier, but that's good news.

Well, I'm equally frustrated by this whole thing, but I certainly never
said that I'm not willing to change anything.

Thanks,

tglx




Re: mlx5 broken affinity

2017-11-09 Thread Thomas Gleixner
On Thu, 9 Nov 2017, Jens Axboe wrote:
> On 11/09/2017 10:03 AM, Thomas Gleixner wrote:
> > On Thu, 9 Nov 2017, Jens Axboe wrote:
> >> On 11/09/2017 07:19 AM, Thomas Gleixner wrote:
> >> If that's the attitude at your end, then I do suggest we just revert the
> >> driver changes. Clearly this isn't going to be productive going forward.
> >>
> >> The better solution was to make the managed setup more flexible, but
> >> it doesn't sound like that is going to be viable at all.
> > 
> > That's not true. I indicated several times, that we can do that, but not
> > just by breaking the managed facility.
> > 
> > What I'm arguing against is that the blame is focused on those who
> > implemented the managed facility with the existing semantics.
> > 
> > I'm still waiting for a proper description of what needs to be changed in
> > order to make these drivers work again. All I have seen so far is to break
> > managed interrupts completely and that's not going to happen.
> 
> There's no blame as far as I'm concerned, just frustration that we broke
> this and folks apparently not acknowledging that it's a concern.
> 
> What used to work was that you could move IRQs around as you wanted to.
> That was very useful for custom setups, for tuning, or for isolation
> purposes. None of that works now, which is unfortunate.

Well, its unfortunate and I can understand your frustration, but I really
have a hard time to understand that these concerns have not been brought up
when the whole thing was discussed and in the early stages of
implementation way before it got merged.

It was not my decision to make it the way it is and I merily try to prevent
hasty and damaging hackery right now.

I'll answer to the technical details in a separate mail.

Thanks,

tglx


Re: mlx5 broken affinity

2017-11-09 Thread Jens Axboe
On 11/09/2017 10:07 AM, Thomas Gleixner wrote:
> On Thu, 9 Nov 2017, Jens Axboe wrote:
> 
>> On 11/09/2017 09:01 AM, Sagi Grimberg wrote:
 Now you try to blame the people who implemented the managed affinity stuff
 for the wreckage, which was created by people who changed drivers to use
 it. Nice try.
>>>
>>> I'm not trying to blame anyone, really. I was just trying to understand
>>> how to move forward with making users happy and still enjoy subsystem
>>> services instead of doing lots of similar things inside mlx5 driver.
>>
>> Exactly. The key here is how we make it work for both cases. But there
>> has to be a willingness to make the infrastructure work for that.
> 
> I say it one last time: It can be done and I'm willing to help.

It didn't sound like it earlier, but that's good news.

> Please tell me what exactly you expect and I can have a look what needs to
> be done w/o creating an utter mess.

See the previous email, should have all the details.

-- 
Jens Axboe



Re: mlx5 broken affinity

2017-11-09 Thread Jens Axboe
On 11/09/2017 10:03 AM, Thomas Gleixner wrote:
> On Thu, 9 Nov 2017, Jens Axboe wrote:
>> On 11/09/2017 07:19 AM, Thomas Gleixner wrote:
>> If that's the attitude at your end, then I do suggest we just revert the
>> driver changes. Clearly this isn't going to be productive going forward.
>>
>> The better solution was to make the managed setup more flexible, but
>> it doesn't sound like that is going to be viable at all.
> 
> That's not true. I indicated several times, that we can do that, but not
> just by breaking the managed facility.
> 
> What I'm arguing against is that the blame is focused on those who
> implemented the managed facility with the existing semantics.
> 
> I'm still waiting for a proper description of what needs to be changed in
> order to make these drivers work again. All I have seen so far is to break
> managed interrupts completely and that's not going to happen.

There's no blame as far as I'm concerned, just frustration that we broke
this and folks apparently not acknowledging that it's a concern.

What used to work was that you could move IRQs around as you wanted to.
That was very useful for custom setups, for tuning, or for isolation
purposes. None of that works now, which is unfortunate.

As a bare minimum, being able to specify a static mask at load time
would be useful. It's still not nearly as useful as it was before
where you could move them at runtime, but at least it's better than
what we have right now.

-- 
Jens Axboe



Re: mlx5 broken affinity

2017-11-09 Thread Thomas Gleixner
On Thu, 9 Nov 2017, Jens Axboe wrote:

> On 11/09/2017 09:01 AM, Sagi Grimberg wrote:
> >> Now you try to blame the people who implemented the managed affinity stuff
> >> for the wreckage, which was created by people who changed drivers to use
> >> it. Nice try.
> > 
> > I'm not trying to blame anyone, really. I was just trying to understand
> > how to move forward with making users happy and still enjoy subsystem
> > services instead of doing lots of similar things inside mlx5 driver.
> 
> Exactly. The key here is how we make it work for both cases. But there
> has to be a willingness to make the infrastructure work for that.

I say it one last time: It can be done and I'm willing to help.

Please tell me what exactly you expect and I can have a look what needs to
be done w/o creating an utter mess.

Thanks,

tglx


Re: mlx5 broken affinity

2017-11-09 Thread Thomas Gleixner
On Thu, 9 Nov 2017, Jens Axboe wrote:
> On 11/09/2017 07:19 AM, Thomas Gleixner wrote:
> If that's the attitude at your end, then I do suggest we just revert the
> driver changes. Clearly this isn't going to be productive going forward.
> 
> The better solution was to make the managed setup more flexible, but
> it doesn't sound like that is going to be viable at all.

That's not true. I indicated several times, that we can do that, but not
just by breaking the managed facility.

What I'm arguing against is that the blame is focused on those who
implemented the managed facility with the existing semantics.

I'm still waiting for a proper description of what needs to be changed in
order to make these drivers work again. All I have seen so far is to break
managed interrupts completely and that's not going to happen.

Thanks,

tglx






Re: mlx5 broken affinity

2017-11-09 Thread Jens Axboe
On 11/09/2017 09:01 AM, Sagi Grimberg wrote:
>> Now you try to blame the people who implemented the managed affinity stuff
>> for the wreckage, which was created by people who changed drivers to use
>> it. Nice try.
> 
> I'm not trying to blame anyone, really. I was just trying to understand
> how to move forward with making users happy and still enjoy subsystem
> services instead of doing lots of similar things inside mlx5 driver.

Exactly. The key here is how we make it work for both cases. But there
has to be a willingness to make the infrastructure work for that.

-- 
Jens Axboe



Re: mlx5 broken affinity

2017-11-09 Thread Sagi Grimberg



The early discussion of the managed facility came to the conclusion that it
will manage this stuff completely to allow fixed association of 'queue /
interrupt / corresponding memory' to a single CPU or a set of CPUs. That
removes a lot of 'affinity' handling magic from the driver and utilizes the
hardware in a sensible way. That was not my decision, really. It surely
made sense to me and I helped Christoph to implement it.

The real question is whether you want to have the fixed 'queue / interrupt/
corresponding memory association' and get rid of all the 'affinity' dance
in your driver or not.

If you answer that question with 'yes' then the consequence is that there
is no knob.

If you answer that question with 'no' then you should not use
the managed facility in the first place and if you need parts of that
functionality then this needs to be added to the core code _before_ a
driver gets converted and not afterwards.


point taken.


It's not my problem if people decide, to use this and then trip over the
limitations after the changes hit the tree. This could have been figured
out before even a single patch was posted.


That's correct, I could have known that, but I didn't, and from your
reply, I understand there is really only a single way forward...


Now you try to blame the people who implemented the managed affinity stuff
for the wreckage, which was created by people who changed drivers to use
it. Nice try.


I'm not trying to blame anyone, really. I was just trying to understand
how to move forward with making users happy and still enjoy subsystem
services instead of doing lots of similar things inside mlx5 driver.


Re: mlx5 broken affinity

2017-11-09 Thread Sagi Grimberg



Again, I think Jes or others can provide more information.


Sagi, I believe Jes is not trying to argue about what initial affinity
values you give to the driver, We have a very critical regression that
is afflicting Live systems today and common tools that already exists
in various distros, such as irqblanace which solely depends on
smp_affinity sysfs entry which is now not write-able due to this
regression. please see https://github.com/Irqbalance/irqbalance/blob/ma
ster/activate.c

Some users would like to have thier network traffic handled in some
cores and free up other cores for other purposes, you just can't take
that away from them.

If revereting mlx5 patches would solve the issue, I am afraid that is
the solution i am going to go with, until the regression is fixed.


Well, I completely agree with you Saeed, If this is causing regression
for mlx5 users and we can't find a way for managed interface to expose
this knob to userspace, I don't see any other choice as well.


Re: mlx5 broken affinity

2017-11-09 Thread Jens Axboe
On 11/09/2017 07:19 AM, Thomas Gleixner wrote:
> On Thu, 9 Nov 2017, Sagi Grimberg wrote:
>> Thomas,
>>
 Because the user sometimes knows better based on statically assigned
 loads, or the user wants consistency across kernels. It's great that the
 system is better at allocating this now, but we also need to allow for a
 user to change it. Like anything on Linux, a user wanting to blow off
 his/her own foot, should be allowed to do so.
>>>
>>> That's fine, but that's not what the managed affinity facility provides. If
>>> you want to leverage the spread mechanism, but avoid the managed part, then
>>> this is a different story and we need to figure out how to provide that
>>> without breaking the managed side of it.
>>>
>>> As I said it's possible, but I vehemently disagree, that this is a bug in
>>> the core code, as it was claimed several times in this thread.
>>>
>>> The real issue is that the driver was converted to something which was
>>> expected to behave differently. That's hardly a bug in the core code, at
>>> most it's a documentation problem.
>>
>> I disagree here, this is not a device specific discussion. The question
>> of exposing IRQ affinity assignment knob to user-space holds for every
>> device (NICs, HBAs and alike). The same issue could have been raised on
>> any other device using managed affinitization (and we have quite a few
>> of those now). I can't see any reason why its OK for device X to use it
>> while its not OK for device Y to use it.
>>
>> If the resolution is "YES we must expose a knob to user-space" then the
>> managed facility is unusable in its current form for *any* device. If the
>> answer is "NO, user-space can't handle all the stuff the kernel can" then
>> we should document it. This is really device independent.
> 
> The early discussion of the managed facility came to the conclusion that it
> will manage this stuff completely to allow fixed association of 'queue /
> interrupt / corresponding memory' to a single CPU or a set of CPUs. That
> removes a lot of 'affinity' handling magic from the driver and utilizes the
> hardware in a sensible way. That was not my decision, really. It surely
> made sense to me and I helped Christoph to implement it.
> 
> The real question is whether you want to have the fixed 'queue / interrupt/
> corresponding memory association' and get rid of all the 'affinity' dance
> in your driver or not.
> 
> If you answer that question with 'yes' then the consequence is that there
> is no knob.
> 
> If you answer that question with 'no' then you should not use
> the managed facility in the first place and if you need parts of that
> functionality then this needs to be added to the core code _before_ a
> driver gets converted and not afterwards.
> 
> It's not my problem if people decide, to use this and then trip over the
> limitations after the changes hit the tree. This could have been figured
> out before even a single patch was posted.
> 
> Now you try to blame the people who implemented the managed affinity stuff
> for the wreckage, which was created by people who changed drivers to use
> it. Nice try.

If that's the attitude at your end, then I do suggest we just revert the
driver changes. Clearly this isn't going to be productive going forward.

The better solution was to make the managed setup more flexible, but
it doesn't sound like that is going to be viable at all.

-- 
Jens Axboe



Re: mlx5 broken affinity

2017-11-09 Thread Jens Axboe
On 11/09/2017 03:50 AM, Sagi Grimberg wrote:
> Thomas,
> 
>>> Because the user sometimes knows better based on statically assigned
>>> loads, or the user wants consistency across kernels. It's great that the
>>> system is better at allocating this now, but we also need to allow for a
>>> user to change it. Like anything on Linux, a user wanting to blow off
>>> his/her own foot, should be allowed to do so.
>>
>> That's fine, but that's not what the managed affinity facility provides. If
>> you want to leverage the spread mechanism, but avoid the managed part, then
>> this is a different story and we need to figure out how to provide that
>> without breaking the managed side of it.
>>
>> As I said it's possible, but I vehemently disagree, that this is a bug in
>> the core code, as it was claimed several times in this thread.
>>
>> The real issue is that the driver was converted to something which was
>> expected to behave differently. That's hardly a bug in the core code, at
>> most it's a documentation problem.
> 
> I disagree here, this is not a device specific discussion. The question
> of exposing IRQ affinity assignment knob to user-space holds for every
> device (NICs, HBAs and alike). The same issue could have been raised on
> any other device using managed affinitization (and we have quite a few
> of those now). I can't see any reason why its OK for device X to use it
> while its not OK for device Y to use it.
> 
> If the resolution is "YES we must expose a knob to user-space" then the
> managed facility is unusable in its current form for *any* device. If
> the answer is "NO, user-space can't handle all the stuff the kernel can"
> then we should document it. This is really device independent.

Completely agree, and honestly I'm pretty baffled we're even having
to argue this point.

-- 
Jens Axboe



Re: mlx5 broken affinity

2017-11-09 Thread Jens Axboe
On 11/09/2017 03:09 AM, Christoph Hellwig wrote:
> On Wed, Nov 08, 2017 at 09:13:59AM -0700, Jens Axboe wrote:
>> There are numerous valid reasons to be able to set the affinity, for
>> both nics and block drivers. It's great that the kernel has a predefined
>> layout that works well, but users do need the flexibility to be able to
>> reconfigure affinities, to suit their needs.
> 
> Managed interrupts are about more than just setting affinities - they
> bind a a vector (which means a queue) to a specific cpu or set of cpus.
> This is extremely useful to deal with hotplug issues and also makes
> life a lot easier in general.

I know why it was done.

>> But that particular case is completely orthogonal to whether or not we
>> should allow the user to reconfigure. The answer to that is clearly YES,
>> and we should ensure that this is possible.
> 
> And why is the answer yes?  If the anser is YES it means we need explicit
> boilerplate code to deal with  cpu hotplug in every driver not using
> managed interrupts (even if optionally), so it is a non-trivial tradeoff. 

The answer is yes because as with any other kernel level policy, there
are situations where it isn't the optimal solution. I ran into this myself
with nvme recently, and I started cursing the managed setup for that
very reason. You can even argue this is a regression, since we used
to be able to move things around as we saw fit. If the answer is
boilerplate code to make it feasible, then we need that boilerplate
code.

-- 
Jens Axboe



Re: mlx5 broken affinity

2017-11-09 Thread Saeed Mahameed
On Wed, 2017-11-08 at 09:27 +0200, Sagi Grimberg wrote:
> > Depending on the machine and the number of queues this might even
> > result in
> > completely losing the ability to suspend/hibernate because the
> > number of
> > available vectors on CPU0 is not sufficient to accomodate all queue
> > interrupts.
> > 
> > > Would it be possible to keep the managed facility until a user
> > > overrides
> > > an affinity assignment? This way if the user didn't touch it, we
> > > keep
> > > all the perks, and in case the user overrides it, we log the
> > > implication
> > > so the user is aware?
> > 
> > A lot of things are possible, the question is whether it makes
> > sense. The
> > whole point is to have resources (queues, interrupts etc.) per CPU
> > and have
> > them strictly associated.
> 
> Not arguing here.
> 
> > Why would you give the user a knob to destroy what you carefully
> > optimized?
> 
> Well, looks like someone relies on this knob, the question is if he
> is
> doing something better for his workload. I don't know, its really up
> to
> the user to say.
> 
> > Just because we can and just because users love those knobs or is
> > there any
> > real technical reason?
> 
> Again, I think Jes or others can provide more information.

Sagi, I believe Jes is not trying to argue about what initial affinity
values you give to the driver, We have a very critical regression that
is afflicting Live systems today and common tools that already exists
in various distros, such as irqblanace which solely depends on
smp_affinity sysfs entry which is now not write-able due to this
regression. please see https://github.com/Irqbalance/irqbalance/blob/ma
ster/activate.c

Some users would like to have thier network traffic handled in some
cores and free up other cores for other purposes, you just can't take
that away from them.

If revereting mlx5 patches would solve the issue, I am afraid that is
the solution i am going to go with, until the regression is fixed.





Re: mlx5 broken affinity

2017-11-09 Thread Thomas Gleixner
On Thu, 9 Nov 2017, Sagi Grimberg wrote:
> Thomas,
> 
> > > Because the user sometimes knows better based on statically assigned
> > > loads, or the user wants consistency across kernels. It's great that the
> > > system is better at allocating this now, but we also need to allow for a
> > > user to change it. Like anything on Linux, a user wanting to blow off
> > > his/her own foot, should be allowed to do so.
> > 
> > That's fine, but that's not what the managed affinity facility provides. If
> > you want to leverage the spread mechanism, but avoid the managed part, then
> > this is a different story and we need to figure out how to provide that
> > without breaking the managed side of it.
> > 
> > As I said it's possible, but I vehemently disagree, that this is a bug in
> > the core code, as it was claimed several times in this thread.
> > 
> > The real issue is that the driver was converted to something which was
> > expected to behave differently. That's hardly a bug in the core code, at
> > most it's a documentation problem.
> 
> I disagree here, this is not a device specific discussion. The question
> of exposing IRQ affinity assignment knob to user-space holds for every
> device (NICs, HBAs and alike). The same issue could have been raised on
> any other device using managed affinitization (and we have quite a few
> of those now). I can't see any reason why its OK for device X to use it
> while its not OK for device Y to use it.
>
> If the resolution is "YES we must expose a knob to user-space" then the
> managed facility is unusable in its current form for *any* device. If the
> answer is "NO, user-space can't handle all the stuff the kernel can" then
> we should document it. This is really device independent.

The early discussion of the managed facility came to the conclusion that it
will manage this stuff completely to allow fixed association of 'queue /
interrupt / corresponding memory' to a single CPU or a set of CPUs. That
removes a lot of 'affinity' handling magic from the driver and utilizes the
hardware in a sensible way. That was not my decision, really. It surely
made sense to me and I helped Christoph to implement it.

The real question is whether you want to have the fixed 'queue / interrupt/
corresponding memory association' and get rid of all the 'affinity' dance
in your driver or not.

If you answer that question with 'yes' then the consequence is that there
is no knob.

If you answer that question with 'no' then you should not use
the managed facility in the first place and if you need parts of that
functionality then this needs to be added to the core code _before_ a
driver gets converted and not afterwards.

It's not my problem if people decide, to use this and then trip over the
limitations after the changes hit the tree. This could have been figured
out before even a single patch was posted.

Now you try to blame the people who implemented the managed affinity stuff
for the wreckage, which was created by people who changed drivers to use
it. Nice try.

Thanks,

tglx


Re: mlx5 broken affinity

2017-11-09 Thread Sagi Grimberg

Thomas,


Because the user sometimes knows better based on statically assigned
loads, or the user wants consistency across kernels. It's great that the
system is better at allocating this now, but we also need to allow for a
user to change it. Like anything on Linux, a user wanting to blow off
his/her own foot, should be allowed to do so.


That's fine, but that's not what the managed affinity facility provides. If
you want to leverage the spread mechanism, but avoid the managed part, then
this is a different story and we need to figure out how to provide that
without breaking the managed side of it.

As I said it's possible, but I vehemently disagree, that this is a bug in
the core code, as it was claimed several times in this thread.

The real issue is that the driver was converted to something which was
expected to behave differently. That's hardly a bug in the core code, at
most it's a documentation problem.


I disagree here, this is not a device specific discussion. The question
of exposing IRQ affinity assignment knob to user-space holds for every
device (NICs, HBAs and alike). The same issue could have been raised on
any other device using managed affinitization (and we have quite a few
of those now). I can't see any reason why its OK for device X to use it
while its not OK for device Y to use it.

If the resolution is "YES we must expose a knob to user-space" then the
managed facility is unusable in its current form for *any* device. If
the answer is "NO, user-space can't handle all the stuff the kernel can"
then we should document it. This is really device independent.


Re: mlx5 broken affinity

2017-11-09 Thread Christoph Hellwig
On Wed, Nov 08, 2017 at 09:13:59AM -0700, Jens Axboe wrote:
> There are numerous valid reasons to be able to set the affinity, for
> both nics and block drivers. It's great that the kernel has a predefined
> layout that works well, but users do need the flexibility to be able to
> reconfigure affinities, to suit their needs.

Managed interrupts are about more than just setting affinities - they
bind a a vector (which means a queue) to a specific cpu or set of cpus.
This is extremely useful to deal with hotplug issues and also makes
life a lot easier in general.

> But that particular case is completely orthogonal to whether or not we
> should allow the user to reconfigure. The answer to that is clearly YES,
> and we should ensure that this is possible.

And why is the answer yes?  If the anser is YES it means we need explicit
boilerplate code to deal with  cpu hotplug in every driver not using
managed interrupts (even if optionally), so it is a non-trivial tradeoff. 


Re: mlx5 broken affinity

2017-11-08 Thread Thomas Gleixner
On Wed, 8 Nov 2017, Jes Sorensen wrote:
> On 11/07/2017 10:07 AM, Thomas Gleixner wrote:
> > On Sun, 5 Nov 2017, Sagi Grimberg wrote:
> >> I do agree that the user would lose better cpu online/offline behavior,
> >> but it seems that users want to still have some control over the IRQ
> >> affinity assignments even if they lose this functionality.
> > 
> > Depending on the machine and the number of queues this might even result in
> > completely losing the ability to suspend/hibernate because the number of
> > available vectors on CPU0 is not sufficient to accomodate all queue
> > interrupts.
> 
> Depending on the system, suspend/resume is really the lesser interesting
> issue to the user. Pretty much any system with a 10/25GBps mlx5 NIC in
> it will be in some sort of rack and is unlikely to ever want to
> suspend/resume.

The discussions with Intel about that tell a different story and cpu
online/offline for power management purposes is - while debatable - widely
used.

That's where the whole idea for managed affinities originated from along
with avoiding the affinity hint and affinity notifier machinery which
creates more problems than it solves.

> >> Would it be possible to keep the managed facility until a user overrides
> >> an affinity assignment? This way if the user didn't touch it, we keep
> >> all the perks, and in case the user overrides it, we log the implication
> >> so the user is aware?
> > 
> > A lot of things are possible, the question is whether it makes sense. The
> > whole point is to have resources (queues, interrupts etc.) per CPU and have
> > them strictly associated.
> > 
> > Why would you give the user a knob to destroy what you carefully optimized?
> > 
> > Just because we can and just because users love those knobs or is there any
> > real technical reason?
> 
> Because the user sometimes knows better based on statically assigned
> loads, or the user wants consistency across kernels. It's great that the
> system is better at allocating this now, but we also need to allow for a
> user to change it. Like anything on Linux, a user wanting to blow off
> his/her own foot, should be allowed to do so.

That's fine, but that's not what the managed affinity facility provides. If
you want to leverage the spread mechanism, but avoid the managed part, then
this is a different story and we need to figure out how to provide that
without breaking the managed side of it.

As I said it's possible, but I vehemently disagree, that this is a bug in
the core code, as it was claimed several times in this thread.

The real issue is that the driver was converted to something which was
expected to behave differently. That's hardly a bug in the core code, at
most it's a documentation problem.

Thanks,

tglx





Re: mlx5 broken affinity

2017-11-08 Thread Jes Sorensen
On 11/07/2017 10:07 AM, Thomas Gleixner wrote:
> On Sun, 5 Nov 2017, Sagi Grimberg wrote:
>> I do agree that the user would lose better cpu online/offline behavior,
>> but it seems that users want to still have some control over the IRQ
>> affinity assignments even if they lose this functionality.
> 
> Depending on the machine and the number of queues this might even result in
> completely losing the ability to suspend/hibernate because the number of
> available vectors on CPU0 is not sufficient to accomodate all queue
> interrupts.

Depending on the system, suspend/resume is really the lesser interesting
issue to the user. Pretty much any system with a 10/25GBps mlx5 NIC in
it will be in some sort of rack and is unlikely to ever want to
suspend/resume.

>> Would it be possible to keep the managed facility until a user overrides
>> an affinity assignment? This way if the user didn't touch it, we keep
>> all the perks, and in case the user overrides it, we log the implication
>> so the user is aware?
> 
> A lot of things are possible, the question is whether it makes sense. The
> whole point is to have resources (queues, interrupts etc.) per CPU and have
> them strictly associated.
> 
> Why would you give the user a knob to destroy what you carefully optimized?
> 
> Just because we can and just because users love those knobs or is there any
> real technical reason?

Because the user sometimes knows better based on statically assigned
loads, or the user wants consistency across kernels. It's great that the
system is better at allocating this now, but we also need to allow for a
user to change it. Like anything on Linux, a user wanting to blow off
his/her own foot, should be allowed to do so.

Cheers,
Jes


Re: mlx5 broken affinity

2017-11-08 Thread Jens Axboe
On 11/08/2017 05:21 AM, David Laight wrote:
> From: Sagi Grimberg
>> Sent: 08 November 2017 07:28
> ...
>>> Why would you give the user a knob to destroy what you carefully optimized?
>>
>> Well, looks like someone relies on this knob, the question is if he is
>> doing something better for his workload. I don't know, its really up to
>> the user to say.
> 
> Maybe the user wants to ensure that nothing except some very specific
> processing happens on some (or most) of the cpu cores.
> 
> If the expected overall ethernet data rate isn't exceptionally large
> is there any reason to allocate a queue (etc) for every cpu.

There are numerous valid reasons to be able to set the affinity, for
both nics and block drivers. It's great that the kernel has a predefined
layout that works well, but users do need the flexibility to be able to
reconfigure affinities, to suit their needs.

For the particular mlx5 case, I'm actually not sure how the FB
configuration differs from the in-kernel stuff. I'll take a look at
that. It may just be that the configuration exists because the code used
to be driver private and frequently changed, setting it at bootup to a
known good configuration helped eliminate problems when upgrading
kernels. I also remember some cases of removing CPU0 from the mask.

But that particular case is completely orthogonal to whether or not we
should allow the user to reconfigure. The answer to that is clearly YES,
and we should ensure that this is possible.

-- 
Jens Axboe



RE: mlx5 broken affinity

2017-11-08 Thread David Laight
From: Sagi Grimberg
> Sent: 08 November 2017 07:28
...
> > Why would you give the user a knob to destroy what you carefully optimized?
> 
> Well, looks like someone relies on this knob, the question is if he is
> doing something better for his workload. I don't know, its really up to
> the user to say.

Maybe the user wants to ensure that nothing except some very specific
processing happens on some (or most) of the cpu cores.

If the expected overall ethernet data rate isn't exceptionally large
is there any reason to allocate a queue (etc) for every cpu.

David



Re: mlx5 broken affinity

2017-11-07 Thread Sagi Grimberg



Depending on the machine and the number of queues this might even result in
completely losing the ability to suspend/hibernate because the number of
available vectors on CPU0 is not sufficient to accomodate all queue
interrupts.


Would it be possible to keep the managed facility until a user overrides
an affinity assignment? This way if the user didn't touch it, we keep
all the perks, and in case the user overrides it, we log the implication
so the user is aware?


A lot of things are possible, the question is whether it makes sense. The
whole point is to have resources (queues, interrupts etc.) per CPU and have
them strictly associated.


Not arguing here.


Why would you give the user a knob to destroy what you carefully optimized?


Well, looks like someone relies on this knob, the question is if he is
doing something better for his workload. I don't know, its really up to
the user to say.


Just because we can and just because users love those knobs or is there any
real technical reason?


Again, I think Jes or others can provide more information.


Re: mlx5 broken affinity

2017-11-07 Thread Thomas Gleixner
On Sun, 5 Nov 2017, Sagi Grimberg wrote:
> > > > This wasn't to start a debate about which allocation method is the
> > > > perfect solution. I am perfectly happy with the new default, the part
> > > > that is broken is to take away the user's option to reassign the
> > > > affinity. That is a bug and it needs to be fixed!
> > > 
> > > Well,
> > > 
> > > I would really want to wait for Thomas/Christoph to reply, but this
> > > simple change fixed it for me:
> > > --
> > > diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
> > > index 573dc52b0806..eccd06be5e44 100644
> > > --- a/kernel/irq/manage.c
> > > +++ b/kernel/irq/manage.c
> > > @@ -146,8 +146,7 @@ bool irq_can_set_affinity_usr(unsigned int irq)
> > >   {
> > >  struct irq_desc *desc = irq_to_desc(irq);
> > > 
> > > -   return __irq_can_set_affinity(desc) &&
> > > -   !irqd_affinity_is_managed(>irq_data);
> > > +   return __irq_can_set_affinity(desc);
> > 
> > Which defeats the whole purpose of the managed facility, which is _not_ to
> > break the affinities on cpu offline and bring the interrupt back on the CPU
> > when it comes online again.
> > 
> > What I can do is to have a separate flag, which only uses the initial
> > distribution mechanism, but I really want to have Christophs opinion on
> > that.
> 
> I do agree that the user would lose better cpu online/offline behavior,
> but it seems that users want to still have some control over the IRQ
> affinity assignments even if they lose this functionality.

Depending on the machine and the number of queues this might even result in
completely losing the ability to suspend/hibernate because the number of
available vectors on CPU0 is not sufficient to accomodate all queue
interrupts.

> Would it be possible to keep the managed facility until a user overrides
> an affinity assignment? This way if the user didn't touch it, we keep
> all the perks, and in case the user overrides it, we log the implication
> so the user is aware?

A lot of things are possible, the question is whether it makes sense. The
whole point is to have resources (queues, interrupts etc.) per CPU and have
them strictly associated.

Why would you give the user a knob to destroy what you carefully optimized?

Just because we can and just because users love those knobs or is there any
real technical reason?

Thanks,

tglx






Re: mlx5 broken affinity

2017-11-05 Thread Sagi Grimberg



This wasn't to start a debate about which allocation method is the
perfect solution. I am perfectly happy with the new default, the part
that is broken is to take away the user's option to reassign the
affinity. That is a bug and it needs to be fixed!


Well,

I would really want to wait for Thomas/Christoph to reply, but this
simple change fixed it for me:
--
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index 573dc52b0806..eccd06be5e44 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -146,8 +146,7 @@ bool irq_can_set_affinity_usr(unsigned int irq)
  {
 struct irq_desc *desc = irq_to_desc(irq);

-   return __irq_can_set_affinity(desc) &&
-   !irqd_affinity_is_managed(>irq_data);
+   return __irq_can_set_affinity(desc);


Which defeats the whole purpose of the managed facility, which is _not_ to
break the affinities on cpu offline and bring the interrupt back on the CPU
when it comes online again.

What I can do is to have a separate flag, which only uses the initial
distribution mechanism, but I really want to have Christophs opinion on
that.


I do agree that the user would lose better cpu online/offline behavior,
but it seems that users want to still have some control over the IRQ
affinity assignments even if they lose this functionality.

Would it be possible to keep the managed facility until a user overrides
an affinity assignment? This way if the user didn't touch it, we keep
all the perks, and in case the user overrides it, we log the implication
so the user is aware?


Re: mlx5 broken affinity

2017-11-02 Thread Thomas Gleixner
On Thu, 2 Nov 2017, Sagi Grimberg wrote:

> 
> > This wasn't to start a debate about which allocation method is the
> > perfect solution. I am perfectly happy with the new default, the part
> > that is broken is to take away the user's option to reassign the
> > affinity. That is a bug and it needs to be fixed!
> 
> Well,
> 
> I would really want to wait for Thomas/Christoph to reply, but this
> simple change fixed it for me:
> --
> diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
> index 573dc52b0806..eccd06be5e44 100644
> --- a/kernel/irq/manage.c
> +++ b/kernel/irq/manage.c
> @@ -146,8 +146,7 @@ bool irq_can_set_affinity_usr(unsigned int irq)
>  {
> struct irq_desc *desc = irq_to_desc(irq);
> 
> -   return __irq_can_set_affinity(desc) &&
> -   !irqd_affinity_is_managed(>irq_data);
> +   return __irq_can_set_affinity(desc);

Which defeats the whole purpose of the managed facility, which is _not_ to
break the affinities on cpu offline and bring the interrupt back on the CPU
when it comes online again.

What I can do is to have a separate flag, which only uses the initial
distribution mechanism, but I really want to have Christophs opinion on
that.

Thanks,

tglx


Re: mlx5 broken affinity

2017-11-02 Thread Jes Sorensen
On 11/02/2017 12:14 PM, Sagi Grimberg wrote:
> 
>> This wasn't to start a debate about which allocation method is the
>> perfect solution. I am perfectly happy with the new default, the part
>> that is broken is to take away the user's option to reassign the
>> affinity. That is a bug and it needs to be fixed!
> 
> Well,
> 
> I would really want to wait for Thomas/Christoph to reply, but this
> simple change fixed it for me:
> -- 
> diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
> index 573dc52b0806..eccd06be5e44 100644
> --- a/kernel/irq/manage.c
> +++ b/kernel/irq/manage.c
> @@ -146,8 +146,7 @@ bool irq_can_set_affinity_usr(unsigned int irq)
>  {
>     struct irq_desc *desc = irq_to_desc(irq);
> 
> -   return __irq_can_set_affinity(desc) &&
> -   !irqd_affinity_is_managed(>irq_data);
> +   return __irq_can_set_affinity(desc);
>  }

The part I don't get here is why the mlx5 driver change was allowed to
set the irqs to 'managed'? One thing is to use the generic system for
initial allocation, which is all great, but the change shouldn't mark
the irq affinity mapping as managed.

Jes


Re: mlx5 broken affinity

2017-11-02 Thread Sagi Grimberg



This wasn't to start a debate about which allocation method is the
perfect solution. I am perfectly happy with the new default, the part
that is broken is to take away the user's option to reassign the
affinity. That is a bug and it needs to be fixed!


Well,

I would really want to wait for Thomas/Christoph to reply, but this
simple change fixed it for me:
--
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index 573dc52b0806..eccd06be5e44 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -146,8 +146,7 @@ bool irq_can_set_affinity_usr(unsigned int irq)
 {
struct irq_desc *desc = irq_to_desc(irq);

-   return __irq_can_set_affinity(desc) &&
-   !irqd_affinity_is_managed(>irq_data);
+   return __irq_can_set_affinity(desc);
 }

 /**
--


Re: mlx5 broken affinity

2017-11-02 Thread Jes Sorensen
On 11/02/2017 06:08 AM, Sagi Grimberg wrote:
> 
 I vaguely remember Nacking Sagi's patch as we knew it would break
 mlx5e netdev affinity assumptions.
>> I remember that argument. Still the series found its way in.
> 
> Of course it maid its way in, it was acked by three different
> maintainers, and I addressed all of Saeed's comments.
> 
>> That series moves affinity decisions to kernel's responsibility.
>> AFAI see, what kernel does is assign IRQs to the NUMA's one by one in
>> increasing indexing (starting with cores of NUMA #0), no matter what
>> NUMA is closer to the NIC.
> 
> Well, as we said before, if there is a good argument to do the home node
> first we can change the generic code (as it should be given that this is
> absolutely not device specific).
> 
>> This means that if your NIC is on NUMA #1, and you reduce the number
>> of channels, you might end up working only with the cores on the far
>> NUMA. Not good!
> We deliberated on this before, and concluded that application affinity
> and device affinity are equally important. If you have a real use case
> that shows otherwise, its perfectly doable to start from the device home
> node.

This wasn't to start a debate about which allocation method is the
perfect solution. I am perfectly happy with the new default, the part
that is broken is to take away the user's option to reassign the
affinity. That is a bug and it needs to be fixed!

Jes



Re: mlx5 broken affinity

2017-11-02 Thread Andrew Lunn
> >This means that if your NIC is on NUMA #1, and you reduce the number of
> >channels, you might end up working only with the cores on the far NUMA.
> >Not good!

> We deliberated on this before, and concluded that application affinity
> and device affinity are equally important. If you have a real use case
> that shows otherwise, its perfectly doable to start from the device home
> node.

Hi Sagi

You might find this talk at NetDev interesting:

https://www.netdevconf.org/2.2/session.html?shochat-devicemgmt-talk

Andrew


Re: mlx5 broken affinity

2017-11-02 Thread Sagi Grimberg



I vaguely remember Nacking Sagi's patch as we knew it would break
mlx5e netdev affinity assumptions.

I remember that argument. Still the series found its way in.


Of course it maid its way in, it was acked by three different
maintainers, and I addressed all of Saeed's comments.


That series moves affinity decisions to kernel's responsibility.
AFAI see, what kernel does is assign IRQs to the NUMA's one by one in 
increasing indexing (starting with cores of NUMA #0), no matter what 
NUMA is closer to the NIC.


Well, as we said before, if there is a good argument to do the home node
first we can change the generic code (as it should be given that this is
absolutely not device specific).

This means that if your NIC is on NUMA #1, and you reduce the number of 
channels, you might end up working only with the cores on the far NUMA. 
Not good!

We deliberated on this before, and concluded that application affinity
and device affinity are equally important. If you have a real use case
that shows otherwise, its perfectly doable to start from the device home
node.


And I agree here that user should be able to read
/proc/irq/x/smp_affinity and even modify it if required.

Totally agree. We should fix that ASAP.
User must have write access.


I'll let Thomas reply here, I do not fully understand the reason for why
pci_alloc_irq_vectors() make the affinity assignments immutable..


Re: mlx5 broken affinity

2017-11-02 Thread Tariq Toukan



On 02/11/2017 1:02 AM, Jes Sorensen wrote:

On 11/01/2017 06:41 PM, Saeed Mahameed wrote:

On Wed, Nov 1, 2017 at 11:20 AM, Jes Sorensen  wrote:

On 11/01/2017 01:21 PM, Sagi Grimberg wrote:
I am all in favor of making the automatic setup better, but assuming an
automatic setup is always right seems problematic. There could be
workloads where you may want to assign affinity explicitly.

Jes


I vaguely remember Nacking Sagi's patch as we knew it would break
mlx5e netdev affinity assumptions.

I remember that argument. Still the series found its way in.
That series moves affinity decisions to kernel's responsibility.
AFAI see, what kernel does is assign IRQs to the NUMA's one by one in 
increasing indexing (starting with cores of NUMA #0), no matter what 
NUMA is closer to the NIC.
This means that if your NIC is on NUMA #1, and you reduce the number of 
channels, you might end up working only with the cores on the far NUMA. 
Not good!

Anyway we already submitted the mlx5e patch that removed such assumption

https://urldefense.proofpoint.com/v2/url?u=https-3A__patchwork.ozlabs.org_patch_809196_=DwIBaQ=5VD0RTtNlTh3ycd41b3MUw=zRrmoWoylV2tnG53v9ZA2w=Z6xtsiQVL8xhTauY_DrOWYDhci-D49TqNKLWV_HK5Ug=pkxagNCZzy5-ZzMRTxIQ5pDfFq8WOSRdSx5zeQpQdBI=
 ("net/mlx5e: Distribute RSS
table among all RX rings")
Jes please confirm you have it.

And I agree here that user should be able to read
/proc/irq/x/smp_affinity and even modify it if required.

Totally agree. We should fix that ASAP.
User must have write access.


Hi Saeed,

I can confirm I have that patch - the problem is also present in Linus'
current tree.

I can read smp_affinity, but I cannot write to it.

Indeed. Should be fixed.


Cheers,
Jes


Thanks,
Tariq


Re: mlx5 broken affinity

2017-11-02 Thread Sagi Grimberg

Jes,


I am all in favor of making the automatic setup better, but assuming an
automatic setup is always right seems problematic. There could be
workloads where you may want to assign affinity explicitly.


Adding Thomas to the thread.

My understanding that the thought is to prevent user-space from
messing up the nice linear assignment, but maybe this might be too
conservative?

The patch that prevented user-space from touching managed irq affinity
assignments:

--
commit 45ddcecbfa947f1dd8e8019bad9e90d6c9f2665c
Author: Thomas Gleixner 
Date:   Mon Jul 4 17:39:25 2016 +0900

genirq: Use affinity hint in irqdesc allocation

Use the affinity hint in the irqdesc allocator. The hint is used to 
determine

the node for the allocation and to set the affinity of the interrupt.

If multiple interrupts are allocated (multi-MSI) then the allocator 
iterates
over the cpumask and for each set cpu it allocates on their node 
and sets the

initial affinity to that cpu.

If a single interrupt is allocated (MSI-X) then the allocator uses 
the first
cpu in the mask to compute the allocation node and uses the mask 
for the

initial affinity setting.

Interrupts set up this way are marked with the AFFINITY_MANAGED flag to
prevent userspace from messing with their affinity settings.

Signed-off-by: Thomas Gleixner 
Cc: Christoph Hellwig 
Cc: linux-bl...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: linux-n...@lists.infradead.org
Cc: ax...@fb.com
Cc: agord...@redhat.com
Link: 
http://lkml.kernel.org/r/1467621574-8277-5-git-send-email-...@lst.de

Signed-off-by: Thomas Gleixner 
--


Re: mlx5 broken affinity

2017-11-01 Thread Jes Sorensen
On 11/01/2017 06:41 PM, Saeed Mahameed wrote:
> On Wed, Nov 1, 2017 at 11:20 AM, Jes Sorensen  wrote:
>> On 11/01/2017 01:21 PM, Sagi Grimberg wrote:
>> I am all in favor of making the automatic setup better, but assuming an
>> automatic setup is always right seems problematic. There could be
>> workloads where you may want to assign affinity explicitly.
>>
>> Jes
> 
> I vaguely remember Nacking Sagi's patch as we knew it would break
> mlx5e netdev affinity assumptions.
> Anyway we already submitted the mlx5e patch that removed such assumption
> 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__patchwork.ozlabs.org_patch_809196_=DwIBaQ=5VD0RTtNlTh3ycd41b3MUw=zRrmoWoylV2tnG53v9ZA2w=Z6xtsiQVL8xhTauY_DrOWYDhci-D49TqNKLWV_HK5Ug=pkxagNCZzy5-ZzMRTxIQ5pDfFq8WOSRdSx5zeQpQdBI=
>  ("net/mlx5e: Distribute RSS
> table among all RX rings")
> Jes please confirm you have it.
> 
> And I agree here that user should be able to read
> /proc/irq/x/smp_affinity and even modify it if required.

Hi Saeed,

I can confirm I have that patch - the problem is also present in Linus'
current tree.

I can read smp_affinity, but I cannot write to it.

Cheers,
Jes


Re: mlx5 broken affinity

2017-11-01 Thread Saeed Mahameed
On Wed, Nov 1, 2017 at 11:20 AM, Jes Sorensen  wrote:
> On 11/01/2017 01:21 PM, Sagi Grimberg wrote:
>>> Hi,
>>
>> Hi Jes,
>>
>>> The below patch seems to have broken PCI IRQ affinity assignments for
>>> mlx5.
>>
>> I wouldn't call it breaking IRQ affinity assignments. It just makes
>> them automatic.
>
> Well it explicitly breaks the option for an admin to assign them
> manually, which contradicts what is written in
> Documentation/IRQ-affinity.txt
>
>>> Prior to this patch I could echo a value to /proc/irq//smp_affinity
>>> and it would get assigned. With this patch applied I get -EIO
>>
>> Adding Christoph,
>>
>> Ideally the affinity assignments would be better left alone, but
>> I wasn't aware that they are now immutable. Christoph?
>>
>>> The actual affinity assignments seem to have changed too, but I assume
>>> this is a result of the generic allocation?
>>
>> The affinity assignments should be giving you perfect linear assignment
>> of the rx/tx rings completion vectors to CPU cores:
>> first  -> 0
>> second -> 1
>> third  -> 2
>> ...
>>
>> Before, mlx5 spread affinity starting from the local numa node as it
>> relied on that when constructing RSS indirection table only to the local
>> numa node (which as a side effect hurt applications running on the far
>> node as the traffic was guaranteed to arrive rx rings located in the
>> "wrong" node).
>>
>> Now the RSS indirection table is linear across both numa nodes just like
>> the irq affinity settings. Another thing that was improved was the
>> pre/post vectors which blacklisted any non rx/tx completion vectors from
>> the affinity assignments (like port events completion vectors from
>> example).
>
> I am all in favor of making the automatic setup better, but assuming an
> automatic setup is always right seems problematic. There could be
> workloads where you may want to assign affinity explicitly.
>
> Jes

I vaguely remember Nacking Sagi's patch as we knew it would break
mlx5e netdev affinity assumptions.
Anyway we already submitted the mlx5e patch that removed such assumption

https://patchwork.ozlabs.org/patch/809196/ ("net/mlx5e: Distribute RSS
table among all RX rings")
Jes please confirm you have it.

And I agree here that user should be able to read
/proc/irq/x/smp_affinity and even modify it if required.


Re: mlx5 broken affinity

2017-11-01 Thread Jes Sorensen
On 11/01/2017 01:21 PM, Sagi Grimberg wrote:
>> Hi,
> 
> Hi Jes,
> 
>> The below patch seems to have broken PCI IRQ affinity assignments for
>> mlx5.
> 
> I wouldn't call it breaking IRQ affinity assignments. It just makes
> them automatic.

Well it explicitly breaks the option for an admin to assign them
manually, which contradicts what is written in
Documentation/IRQ-affinity.txt

>> Prior to this patch I could echo a value to /proc/irq//smp_affinity
>> and it would get assigned. With this patch applied I get -EIO
> 
> Adding Christoph,
> 
> Ideally the affinity assignments would be better left alone, but
> I wasn't aware that they are now immutable. Christoph?
> 
>> The actual affinity assignments seem to have changed too, but I assume
>> this is a result of the generic allocation?
> 
> The affinity assignments should be giving you perfect linear assignment
> of the rx/tx rings completion vectors to CPU cores:
> first  -> 0
> second -> 1
> third  -> 2
> ...
> 
> Before, mlx5 spread affinity starting from the local numa node as it
> relied on that when constructing RSS indirection table only to the local
> numa node (which as a side effect hurt applications running on the far
> node as the traffic was guaranteed to arrive rx rings located in the
> "wrong" node).
> 
> Now the RSS indirection table is linear across both numa nodes just like
> the irq affinity settings. Another thing that was improved was the
> pre/post vectors which blacklisted any non rx/tx completion vectors from
> the affinity assignments (like port events completion vectors from
> example).

I am all in favor of making the automatic setup better, but assuming an
automatic setup is always right seems problematic. There could be
workloads where you may want to assign affinity explicitly.

Jes


Re: mlx5 broken affinity

2017-11-01 Thread Sagi Grimberg

Hi,


Hi Jes,


The below patch seems to have broken PCI IRQ affinity assignments for mlx5.


I wouldn't call it breaking IRQ affinity assignments. It just makes
them automatic.


Prior to this patch I could echo a value to /proc/irq//smp_affinity
and it would get assigned. With this patch applied I get -EIO


Adding Christoph,

Ideally the affinity assignments would be better left alone, but
I wasn't aware that they are now immutable. Christoph?


The actual affinity assignments seem to have changed too, but I assume
this is a result of the generic allocation?


The affinity assignments should be giving you perfect linear assignment
of the rx/tx rings completion vectors to CPU cores:
first  -> 0
second -> 1
third  -> 2
...

Before, mlx5 spread affinity starting from the local numa node as it
relied on that when constructing RSS indirection table only to the local
numa node (which as a side effect hurt applications running on the far
node as the traffic was guaranteed to arrive rx rings located in the
"wrong" node).

Now the RSS indirection table is linear across both numa nodes just like
the irq affinity settings. Another thing that was improved was the
pre/post vectors which blacklisted any non rx/tx completion vectors from
the affinity assignments (like port events completion vectors from
example).


With this applied I get:
[root@testbox /proc/irq]# cat 50/smp_affinity
00,0010
[root@textbox /proc/irq]# cat 100/smp_affinity
10,

Without I get:
[root@testbox /proc/irq]# cat 50/smp_affinity
00,0200
[root@testbox /proc/irq]# cat 100/smp_affinity
000100,

I am not wildly familiar with the affinity assignment code. Is there
something obvious I am missing here?


The affinity assignment is indeed changed, should be better though.
If you do not see a linear (diagonal) affinity assignment then
its a bug that needs to be fixed (probably in msi.c code).