RE: [tip:irq/core] genirq/matrix: Improve target CPU selection for managed interrupts.

2018-11-07 Thread Michael Kelley
From: Thomas Gleixner   Sent: Wednesday, November 7, 2018 
12:23 PM
> 
> There is another interesting property of managed interrupts vs. CPU
> hotplug. When the last CPU in the affinity mask goes offline, then the core
> code shuts down the interrupt and the device driver and related layers
> exclude the associated device queue from I/O. The same applies for CPUs
> which are not online when the device is initialized, i.e. if non of the
> CPUs is online then the interrupt is not started and the I/O queue stays
> disabled.
> 
> When the first CPU in the mask comes online (again), then the interrupt is
> reenabled and the device driver and related layers reenable I/O on the
> associated device queue.
> 

Thanks!  The transition into and out of the situation when none of the CPUs
in the affinity mask are online is what I wasn't aware of.  With that piece of
the puzzle, it all makes sense.

Michael


RE: [tip:irq/core] genirq/matrix: Improve target CPU selection for managed interrupts.

2018-11-07 Thread Michael Kelley
From: Thomas Gleixner   Sent: Wednesday, November 7, 2018 
12:23 PM
> 
> There is another interesting property of managed interrupts vs. CPU
> hotplug. When the last CPU in the affinity mask goes offline, then the core
> code shuts down the interrupt and the device driver and related layers
> exclude the associated device queue from I/O. The same applies for CPUs
> which are not online when the device is initialized, i.e. if non of the
> CPUs is online then the interrupt is not started and the I/O queue stays
> disabled.
> 
> When the first CPU in the mask comes online (again), then the interrupt is
> reenabled and the device driver and related layers reenable I/O on the
> associated device queue.
> 

Thanks!  The transition into and out of the situation when none of the CPUs
in the affinity mask are online is what I wasn't aware of.  With that piece of
the puzzle, it all makes sense.

Michael


RE: [tip:irq/core] genirq/matrix: Improve target CPU selection for managed interrupts.

2018-11-07 Thread Thomas Gleixner
Michael,

On Wed, 7 Nov 2018, Michael Kelley wrote:
> >  2) Managed interrupts:
> > 
> > Managed interrupts guarantee vector reservation when the MSI/MSI-X
> > functionality of a device is enabled, which is achieved by reserving
> > vectors in the bitmaps of the possible target CPUs. This reservation
> > decrements the available count on each possible target CPU.
> >
> 
> For the curious, could you elaborate on the reservation guarantee for
> managed interrupts?  What exactly is guaranteed?  I'm trying to
> understand the benefit of reserving a vector on all possible target CPUs.
> I can imagine this may be to related hot-remove of CPUs, but I'm not
> seeing the scenario where reserving on all possible target CPUs solves
> any fundamental problem.  irq_build_affinity_masks() assigns spreads
> target CPUs across each IRQ in the batch, so you might get a small handful
> of possible target CPUs for each IRQ.  But if those small handful of CPUs
> were to be hot-removed, then all the reserved vectors disappear anyway.
> So maybe there's another scenario I'm missing.

When managed interrupts are allocated (MSI[-X] enable) then each allocated
Linux interrupt (virtual irq number) is given an affinity mask in the
spreading algorithm. The mask contains 1 or more CPUs depending on the
ratio of queues and possible CPUs.

When the virtual irq and the corresponding data structures are allocated,
then a vector is reserved on each CPU in the affinity mask.

The device driver and other layers like block-mq rely on the associated
affinity mask of each interrupt, i.e. they associate a device queue to the
exact same affinity mask. All I/O on the CPUs in the mask goes through that
associated device queue.

So if the allocation would not be guaranteed and allowed to fail, then the
I/O association would not work as expected.

Sure, we could move the interrupt to a random CPU, but that would cause
performance problems especially when the interrupt affinity moves to a
different node.

Now you might argue that reserving one vector on one CPU in the mask would
be sufficient. That's true, if CPU hotplug is disabled and all CPUs are
online when the device driver is initialized.

But it would break assumptions in the CPU hotplug case. The guaranteed
reservation on all CPUs in the associated CPU mask guarantees that the
interrupt can be moved from the outgoing CPU to a still online CPU in the
mask without violating the affinity association.

There is another interesting property of managed interrupts vs. CPU
hotplug. When the last CPU in the affinity mask goes offline, then the core
code shuts down the interrupt and the device driver and related layers
exclude the associated device queue from I/O. The same applies for CPUs
which are not online when the device is initialized, i.e. if non of the
CPUs is online then the interrupt is not started and the I/O queue stays
disabled.

When the first CPU in the mask comes online (again), then the interrupt is
reenabled and the device driver and related layers reenable I/O on the
associated device queue.

If the reservation would not be guaranteed even accross offline/online
cycles, then again the assumptions of the drivers and the related layers
would not longer work.

Note, that the affinity of managed interrupts cannot be changed from
userspace via /proc/irq/$N/affinity for the same reasons.

That was a design decision to simplify the block multi-queue logic in the
device drivers and the related layers. It removed the whole track affinity
changes, reallocate data structures and reroute I/O requirements. Some of
the early multi-queue device drivers implemented horrible hacks to handle
all those horrors.

Hope that answers your question.

Thanks,

tglx


RE: [tip:irq/core] genirq/matrix: Improve target CPU selection for managed interrupts.

2018-11-07 Thread Thomas Gleixner
Michael,

On Wed, 7 Nov 2018, Michael Kelley wrote:
> >  2) Managed interrupts:
> > 
> > Managed interrupts guarantee vector reservation when the MSI/MSI-X
> > functionality of a device is enabled, which is achieved by reserving
> > vectors in the bitmaps of the possible target CPUs. This reservation
> > decrements the available count on each possible target CPU.
> >
> 
> For the curious, could you elaborate on the reservation guarantee for
> managed interrupts?  What exactly is guaranteed?  I'm trying to
> understand the benefit of reserving a vector on all possible target CPUs.
> I can imagine this may be to related hot-remove of CPUs, but I'm not
> seeing the scenario where reserving on all possible target CPUs solves
> any fundamental problem.  irq_build_affinity_masks() assigns spreads
> target CPUs across each IRQ in the batch, so you might get a small handful
> of possible target CPUs for each IRQ.  But if those small handful of CPUs
> were to be hot-removed, then all the reserved vectors disappear anyway.
> So maybe there's another scenario I'm missing.

When managed interrupts are allocated (MSI[-X] enable) then each allocated
Linux interrupt (virtual irq number) is given an affinity mask in the
spreading algorithm. The mask contains 1 or more CPUs depending on the
ratio of queues and possible CPUs.

When the virtual irq and the corresponding data structures are allocated,
then a vector is reserved on each CPU in the affinity mask.

The device driver and other layers like block-mq rely on the associated
affinity mask of each interrupt, i.e. they associate a device queue to the
exact same affinity mask. All I/O on the CPUs in the mask goes through that
associated device queue.

So if the allocation would not be guaranteed and allowed to fail, then the
I/O association would not work as expected.

Sure, we could move the interrupt to a random CPU, but that would cause
performance problems especially when the interrupt affinity moves to a
different node.

Now you might argue that reserving one vector on one CPU in the mask would
be sufficient. That's true, if CPU hotplug is disabled and all CPUs are
online when the device driver is initialized.

But it would break assumptions in the CPU hotplug case. The guaranteed
reservation on all CPUs in the associated CPU mask guarantees that the
interrupt can be moved from the outgoing CPU to a still online CPU in the
mask without violating the affinity association.

There is another interesting property of managed interrupts vs. CPU
hotplug. When the last CPU in the affinity mask goes offline, then the core
code shuts down the interrupt and the device driver and related layers
exclude the associated device queue from I/O. The same applies for CPUs
which are not online when the device is initialized, i.e. if non of the
CPUs is online then the interrupt is not started and the I/O queue stays
disabled.

When the first CPU in the mask comes online (again), then the interrupt is
reenabled and the device driver and related layers reenable I/O on the
associated device queue.

If the reservation would not be guaranteed even accross offline/online
cycles, then again the assumptions of the drivers and the related layers
would not longer work.

Note, that the affinity of managed interrupts cannot be changed from
userspace via /proc/irq/$N/affinity for the same reasons.

That was a design decision to simplify the block multi-queue logic in the
device drivers and the related layers. It removed the whole track affinity
changes, reallocate data structures and reroute I/O requirements. Some of
the early multi-queue device drivers implemented horrible hacks to handle
all those horrors.

Hope that answers your question.

Thanks,

tglx


RE: [tip:irq/core] genirq/matrix: Improve target CPU selection for managed interrupts.

2018-11-07 Thread Michael Kelley
From: tip tree robot   Sent: Tuesday, November 6, 2018 2:28 PM
>
> Committer:  Thomas Gleixner 
> CommitDate: Tue, 6 Nov 2018 23:20:13 +0100
> 
>  2) Managed interrupts:
> 
> Managed interrupts guarantee vector reservation when the MSI/MSI-X
> functionality of a device is enabled, which is achieved by reserving
> vectors in the bitmaps of the possible target CPUs. This reservation
> decrements the available count on each possible target CPU.
>

Thomas,

For the curious, could you elaborate on the reservation guarantee for
managed interrupts?  What exactly is guaranteed?  I'm trying to
understand the benefit of reserving a vector on all possible target CPUs.
I can imagine this may be to related hot-remove of CPUs, but I'm not
seeing the scenario where reserving on all possible target CPUs solves
any fundamental problem.  irq_build_affinity_masks() assigns spreads
target CPUs across each IRQ in the batch, so you might get a small handful
of possible target CPUs for each IRQ.  But if those small handful of CPUs
were to be hot-removed, then all the reserved vectors disappear anyway.
So maybe there's another scenario I'm missing.

Thanks,

Michael


RE: [tip:irq/core] genirq/matrix: Improve target CPU selection for managed interrupts.

2018-11-07 Thread Michael Kelley
From: tip tree robot   Sent: Tuesday, November 6, 2018 2:28 PM
>
> Committer:  Thomas Gleixner 
> CommitDate: Tue, 6 Nov 2018 23:20:13 +0100
> 
>  2) Managed interrupts:
> 
> Managed interrupts guarantee vector reservation when the MSI/MSI-X
> functionality of a device is enabled, which is achieved by reserving
> vectors in the bitmaps of the possible target CPUs. This reservation
> decrements the available count on each possible target CPU.
>

Thomas,

For the curious, could you elaborate on the reservation guarantee for
managed interrupts?  What exactly is guaranteed?  I'm trying to
understand the benefit of reserving a vector on all possible target CPUs.
I can imagine this may be to related hot-remove of CPUs, but I'm not
seeing the scenario where reserving on all possible target CPUs solves
any fundamental problem.  irq_build_affinity_masks() assigns spreads
target CPUs across each IRQ in the batch, so you might get a small handful
of possible target CPUs for each IRQ.  But if those small handful of CPUs
were to be hot-removed, then all the reserved vectors disappear anyway.
So maybe there's another scenario I'm missing.

Thanks,

Michael


[tip:irq/core] genirq/matrix: Improve target CPU selection for managed interrupts.

2018-11-06 Thread tip-bot for Long Li
Commit-ID:  e8da8794a7fd9eef1ec9a07f0d4897c68581c72b
Gitweb: https://git.kernel.org/tip/e8da8794a7fd9eef1ec9a07f0d4897c68581c72b
Author: Long Li 
AuthorDate: Tue, 6 Nov 2018 04:00:00 +
Committer:  Thomas Gleixner 
CommitDate: Tue, 6 Nov 2018 23:20:13 +0100

genirq/matrix: Improve target CPU selection for managed interrupts.

On large systems with multiple devices of the same class (e.g. NVMe disks,
using managed interrupts), the kernel can affinitize these interrupts to a
small subset of CPUs instead of spreading them out evenly.

irq_matrix_alloc_managed() tries to select the CPU in the supplied cpumask
of possible target CPUs which has the lowest number of interrupt vectors
allocated.

This is done by searching the CPU with the highest number of available
vectors. While this is correct for non-managed CPUs it can select the wrong
CPU for managed interrupts. Under certain constellations this results in
affinitizing the managed interrupts of several devices to a single CPU in
a set.

The book keeping of available vectors works the following way:

 1) Non-managed interrupts:

available is decremented when the interrupt is actually requested by
the device driver and a vector is assigned. It's incremented when the
interrupt and the vector are freed.

 2) Managed interrupts:

Managed interrupts guarantee vector reservation when the MSI/MSI-X
functionality of a device is enabled, which is achieved by reserving
vectors in the bitmaps of the possible target CPUs. This reservation
decrements the available count on each possible target CPU.

When the interrupt is requested by the device driver then a vector is
allocated from the reserved region. The operation is reversed when the
interrupt is freed by the device driver. Neither of these operations
affect the available count.

The reservation persist up to the point where the MSI/MSI-X
functionality is disabled and only this operation increments the
available count again.

For non-managed interrupts the available count is the correct selection
criterion because the guaranteed reservations need to be taken into
account. Using the allocated counter could lead to a failing allocation in
the following situation (total vector space of 10 assumed):

 CPU0   CPU1
 available: 2  0
 allocated: 5  3   <--- CPU1 is selected, but available space = 0
 managed reserved:  3  7

 while available yields the correct result.

For managed interrupts the available count is not the appropriate
selection criterion because as explained above the available count is not
affected by the actual vector allocation.

The following example illustrates that. Total vector space of 10
assumed. The starting point is:

 CPU0   CPU1
 available: 5  4
 allocated: 2  3
 managed reserved:  3  3

 Allocating vectors for three non-managed interrupts will result in
 affinitizing the first two to CPU0 and the third one to CPU1 because the
 available count is adjusted with each allocation:

  CPU0  CPU1
 available:  5 4<- Select CPU0 for 1st allocation
 --> allocated:  3 3

 available:  4 4<- Select CPU0 for 2nd allocation
 --> allocated:  4 3

 available:  3 4<- Select CPU1 for 3rd allocation
 --> allocated:  4 4

 But the allocation of three managed interrupts starting from the same
 point will affinitize all of them to CPU0 because the available count is
 not affected by the allocation (see above). So the end result is:

  CPU0  CPU1
 available:  5 4
 allocated:  5 3

Introduce a "managed_allocated" field in struct cpumap to track the vector
allocation for managed interrupts separately. Use this information to
select the target CPU when a vector is allocated for a managed interrupt,
which results in more evenly distributed vector assignments. The above
example results in the following allocations:

 CPU0   CPU1
 managed_allocated: 0  0<- Select CPU0 for 1st allocation
 --> allocated: 3  3

 managed_allocated: 1  0<- Select CPU1 for 2nd allocation
 --> allocated: 3  4

 managed_allocated: 1  1<- Select CPU0 for 3rd allocation
 --> allocated: 4  4

The allocation of non-managed interrupts is not affected by this change and
is still evaluating the available count.

The overall distribution of interrupt vectors for both types of interrupts
might still not be perfectly even depending on the number of non-managed
and managed interrupts in a system, but due to the reservation guarantee
for managed interrupts this cannot be avoided.

Expose the new field in debugfs as well.

[ tglx: Clarified the background of the problem in the changelog and
described it independent of NVME ]

Signed-off-by: Long Li 
Signed-off-by: Thomas Gleixner 
Cc: Michael Kelley 
Link: 

[tip:irq/core] genirq/matrix: Improve target CPU selection for managed interrupts.

2018-11-06 Thread tip-bot for Long Li
Commit-ID:  e8da8794a7fd9eef1ec9a07f0d4897c68581c72b
Gitweb: https://git.kernel.org/tip/e8da8794a7fd9eef1ec9a07f0d4897c68581c72b
Author: Long Li 
AuthorDate: Tue, 6 Nov 2018 04:00:00 +
Committer:  Thomas Gleixner 
CommitDate: Tue, 6 Nov 2018 23:20:13 +0100

genirq/matrix: Improve target CPU selection for managed interrupts.

On large systems with multiple devices of the same class (e.g. NVMe disks,
using managed interrupts), the kernel can affinitize these interrupts to a
small subset of CPUs instead of spreading them out evenly.

irq_matrix_alloc_managed() tries to select the CPU in the supplied cpumask
of possible target CPUs which has the lowest number of interrupt vectors
allocated.

This is done by searching the CPU with the highest number of available
vectors. While this is correct for non-managed CPUs it can select the wrong
CPU for managed interrupts. Under certain constellations this results in
affinitizing the managed interrupts of several devices to a single CPU in
a set.

The book keeping of available vectors works the following way:

 1) Non-managed interrupts:

available is decremented when the interrupt is actually requested by
the device driver and a vector is assigned. It's incremented when the
interrupt and the vector are freed.

 2) Managed interrupts:

Managed interrupts guarantee vector reservation when the MSI/MSI-X
functionality of a device is enabled, which is achieved by reserving
vectors in the bitmaps of the possible target CPUs. This reservation
decrements the available count on each possible target CPU.

When the interrupt is requested by the device driver then a vector is
allocated from the reserved region. The operation is reversed when the
interrupt is freed by the device driver. Neither of these operations
affect the available count.

The reservation persist up to the point where the MSI/MSI-X
functionality is disabled and only this operation increments the
available count again.

For non-managed interrupts the available count is the correct selection
criterion because the guaranteed reservations need to be taken into
account. Using the allocated counter could lead to a failing allocation in
the following situation (total vector space of 10 assumed):

 CPU0   CPU1
 available: 2  0
 allocated: 5  3   <--- CPU1 is selected, but available space = 0
 managed reserved:  3  7

 while available yields the correct result.

For managed interrupts the available count is not the appropriate
selection criterion because as explained above the available count is not
affected by the actual vector allocation.

The following example illustrates that. Total vector space of 10
assumed. The starting point is:

 CPU0   CPU1
 available: 5  4
 allocated: 2  3
 managed reserved:  3  3

 Allocating vectors for three non-managed interrupts will result in
 affinitizing the first two to CPU0 and the third one to CPU1 because the
 available count is adjusted with each allocation:

  CPU0  CPU1
 available:  5 4<- Select CPU0 for 1st allocation
 --> allocated:  3 3

 available:  4 4<- Select CPU0 for 2nd allocation
 --> allocated:  4 3

 available:  3 4<- Select CPU1 for 3rd allocation
 --> allocated:  4 4

 But the allocation of three managed interrupts starting from the same
 point will affinitize all of them to CPU0 because the available count is
 not affected by the allocation (see above). So the end result is:

  CPU0  CPU1
 available:  5 4
 allocated:  5 3

Introduce a "managed_allocated" field in struct cpumap to track the vector
allocation for managed interrupts separately. Use this information to
select the target CPU when a vector is allocated for a managed interrupt,
which results in more evenly distributed vector assignments. The above
example results in the following allocations:

 CPU0   CPU1
 managed_allocated: 0  0<- Select CPU0 for 1st allocation
 --> allocated: 3  3

 managed_allocated: 1  0<- Select CPU1 for 2nd allocation
 --> allocated: 3  4

 managed_allocated: 1  1<- Select CPU0 for 3rd allocation
 --> allocated: 4  4

The allocation of non-managed interrupts is not affected by this change and
is still evaluating the available count.

The overall distribution of interrupt vectors for both types of interrupts
might still not be perfectly even depending on the number of non-managed
and managed interrupts in a system, but due to the reservation guarantee
for managed interrupts this cannot be avoided.

Expose the new field in debugfs as well.

[ tglx: Clarified the background of the problem in the changelog and
described it independent of NVME ]

Signed-off-by: Long Li 
Signed-off-by: Thomas Gleixner 
Cc: Michael Kelley 
Link: