Re: [PATCH v3 0/9] powerpc/xive: Map one IPI interrupt per node

2021-04-18 Thread Michael Ellerman
On Wed, 31 Mar 2021 16:45:05 +0200, Cédric Le Goater wrote:
> ipistorm [*] can be used to benchmark the raw interrupt rate of an
> interrupt controller by measuring the number of IPIs a system can
> sustain. When applied to the XIVE interrupt controller of POWER9 and
> POWER10 systems, a significant drop of the interrupt rate can be
> observed when crossing the second node boundary.
> 
> This is due to the fact that a single IPI interrupt is used for all
> CPUs of the system. The structure is shared and the cache line updates
> impact greatly the traffic between nodes and the overall IPI
> performance.
> 
> [...]

Patches 2-9 applied to powerpc/next.

[2/9] powerpc/xive: Introduce an IPI interrupt domain
  https://git.kernel.org/powerpc/c/7d348494136c8b47c39d1f7ccba28c47d5094a54
[3/9] powerpc/xive: Remove useless check on XIVE_IPI_HW_IRQ
  https://git.kernel.org/powerpc/c/1835e72942b5aa779c8ada62aaeba03ab66d92c9
[4/9] powerpc/xive: Simplify xive_core_debug_show()
  https://git.kernel.org/powerpc/c/5159d9872823230669b7949ba3caf18c4c314846
[5/9] powerpc/xive: Drop check on irq_data in xive_core_debug_show()
  https://git.kernel.org/powerpc/c/a74ce5926b20cd0e6d624a9b2527073a96dfed7f
[6/9] powerpc/xive: Simplify the dump of XIVE interrupts under xmon
  https://git.kernel.org/powerpc/c/6bf66eb8f404050030805c65cf39a810892f5f8e
[7/9] powerpc/xive: Fix xmon command "dxi"
  https://git.kernel.org/powerpc/c/33e4bc5946432a4ac173fd08e8e30a13ab94d06d
[8/9] powerpc/xive: Map one IPI interrupt per node
  https://git.kernel.org/powerpc/c/7dcc37b3eff97379b194adb17eb9a8270512dd1d
[9/9] powerpc/xive: Modernize XIVE-IPI domain with an 'alloc' handler
  https://git.kernel.org/powerpc/c/fd6db2892ebaa1383a93b4a609c65b96e615510a

cheers


Re: [PATCH v3 0/9] powerpc/xive: Map one IPI interrupt per node

2021-04-01 Thread Cédric Le Goater
On 4/1/21 2:45 PM, Greg Kurz wrote:
> On Thu, 1 Apr 2021 11:18:10 +0200
> Cédric Le Goater  wrote:
> 
>> Hello,
>>
>> On 4/1/21 10:04 AM, Greg Kurz wrote:
>>> On Wed, 31 Mar 2021 16:45:05 +0200
>>> Cédric Le Goater  wrote:
>>>

 Hello,

 ipistorm [*] can be used to benchmark the raw interrupt rate of an
 interrupt controller by measuring the number of IPIs a system can
 sustain. When applied to the XIVE interrupt controller of POWER9 and
 POWER10 systems, a significant drop of the interrupt rate can be
 observed when crossing the second node boundary.

 This is due to the fact that a single IPI interrupt is used for all
 CPUs of the system. The structure is shared and the cache line updates
 impact greatly the traffic between nodes and the overall IPI
 performance.

 As a workaround, the impact can be reduced by deactivating the IRQ
 lockup detector ("noirqdebug") which does a lot of accounting in the
 Linux IRQ descriptor structure and is responsible for most of the
 performance penalty.

 As a fix, this proposal allocates an IPI interrupt per node, to be
 shared by all CPUs of that node. It solves the scaling issue, the IRQ
 lockup detector still has an impact but the XIVE interrupt rate scales
 linearly. It also improves the "noirqdebug" case as showed in the
 tables below. 

>>>
>>> As explained by David and others, NUMA nodes happen to match sockets
>>> with current POWER CPUs but these are really different concepts. NUMA
>>> is about CPU memory accesses latency, 
>>
>> This is exactly our problem. we have cache issues because hw threads 
>> on different chips are trying to access the same structure in memory.
>> It happens on virtual platforms and baremetal platforms. This is not
>> restricted to pseries.
>>
> 
> Ok, I get it... the XIVE HW accesses structures in RAM, just like HW threads
> do, so the closer, the better. 

No. That's another problem related to the XIVE internal tables which
should be allocated on the chip where it is "mostly" used. 

The problem is much simpler. As the commit log says : 

 This is due to the fact that a single IPI interrupt is used for all
 CPUs of the system. The structure is shared and the cache line updates
 impact greatly the traffic between nodes and the overall IPI
 performance.

So, we have multiple threads competing for the same IRQ descriptor and 
overloading the PowerBUS with cache update synchronization. 


> This definitely looks NUMA related indeed. So
> yes, the idea of having the XIVE HW to only access local in-RAM data when
> handling IPIs between vCPUs in the same NUMA node makes sense.

yes. That's the goal.
 
> What is less clear is the exact role of ibm,chip-id actually. This is
> currently used on PowerNV only to pick up a default target on the same
> "chip" as the source if possible. What is the detailed motivation behind
> this ?

The "ibm,chip-id" issue is extra noise and not a requirement for this 
patchset.

>>> while in the case of XIVE you
>>> really need to identify a XIVE chip localized in a given socket.
>>>
>>> PAPR doesn't know about sockets, only cores. In other words, a PAPR
>>> compliant guest sees all vCPUs like they all sit in a single socket.
>>
>> There are also NUMA nodes on PAPR.
>>
> 
> Yes but nothing prevents a NUMA node to span over multiple sockets
> or having several NUMA nodes within the same socket, even if this
> isn't the case in practice with current POWER hardware.

yes. A NUMA node could even be a PCI adapter attached to storage. 
I don't know what to say. We are missing a concept maybe.

>>> Same for the XIVE. Trying to introduce a concept of socket, either
>>> by hijacking OPAL's ibm,chip-id or NUMA node ids, is a kind of
>>> spec violation in this context. If the user cares for locality of
>>> the vCPUs and XIVE on the same socket, then it should bind vCPU
>>> threads to host CPUs from the same socket in the first place.
>>
>> Yes. that's a must have of course. You need to reflect the real HW
>> topology in the guest or LPAR if you are after performance, or 
>> restrict the virtual machine to be on a single socket/chip/node.  
>>
>> And this is not only a XIVE problem. XICS has the same problem with
>> a shared single IPI interrupt descriptor but XICS doesn't scale well 
>> by design, so it doesn't show.
>>
>>
>>> Isn't this enough to solve the performance issues this series
>>> want to fix, without the need for virtual socket ids ?
>> what are virtual socket ids ? A new concept ? 
>>
> 
> For now, we have virtual CPUs identified by a virtual CPU id.
> It thus seems natural to speak of a virtual socket id, but
> anyway, the wording isn't really important here and you
> don't answer the question ;-)

if, on the hypervisor, you restrict the virtual machine vCPUs to be 
on a single POWER processor/chip, there is no problem. But large 
KVM guests or PowerVM LPARs do exist on 16s systems.

C.
 


Re: [PATCH v3 0/9] powerpc/xive: Map one IPI interrupt per node

2021-04-01 Thread Greg Kurz
On Thu, 1 Apr 2021 11:18:10 +0200
Cédric Le Goater  wrote:

> Hello,
> 
> On 4/1/21 10:04 AM, Greg Kurz wrote:
> > On Wed, 31 Mar 2021 16:45:05 +0200
> > Cédric Le Goater  wrote:
> > 
> >>
> >> Hello,
> >>
> >> ipistorm [*] can be used to benchmark the raw interrupt rate of an
> >> interrupt controller by measuring the number of IPIs a system can
> >> sustain. When applied to the XIVE interrupt controller of POWER9 and
> >> POWER10 systems, a significant drop of the interrupt rate can be
> >> observed when crossing the second node boundary.
> >>
> >> This is due to the fact that a single IPI interrupt is used for all
> >> CPUs of the system. The structure is shared and the cache line updates
> >> impact greatly the traffic between nodes and the overall IPI
> >> performance.
> >>
> >> As a workaround, the impact can be reduced by deactivating the IRQ
> >> lockup detector ("noirqdebug") which does a lot of accounting in the
> >> Linux IRQ descriptor structure and is responsible for most of the
> >> performance penalty.
> >>
> >> As a fix, this proposal allocates an IPI interrupt per node, to be
> >> shared by all CPUs of that node. It solves the scaling issue, the IRQ
> >> lockup detector still has an impact but the XIVE interrupt rate scales
> >> linearly. It also improves the "noirqdebug" case as showed in the
> >> tables below. 
> >>
> > 
> > As explained by David and others, NUMA nodes happen to match sockets
> > with current POWER CPUs but these are really different concepts. NUMA
> > is about CPU memory accesses latency, 
> 
> This is exactly our problem. we have cache issues because hw threads 
> on different chips are trying to access the same structure in memory.
> It happens on virtual platforms and baremetal platforms. This is not
> restricted to pseries.
> 

Ok, I get it... the XIVE HW accesses structures in RAM, just like HW threads
do, so the closer, the better. This definitely looks NUMA related indeed. So
yes, the idea of having the XIVE HW to only access local in-RAM data when
handling IPIs between vCPUs in the same NUMA node makes sense.

What is less clear is the exact role of ibm,chip-id actually. This is
currently used on PowerNV only to pick up a default target on the same
"chip" as the source if possible. What is the detailed motivation behind
this ?

> > while in the case of XIVE you
> > really need to identify a XIVE chip localized in a given socket.
> > 
> > PAPR doesn't know about sockets, only cores. In other words, a PAPR
> > compliant guest sees all vCPUs like they all sit in a single socket.
> 
> There are also NUMA nodes on PAPR.
> 

Yes but nothing prevents a NUMA node to span over multiple sockets
or having several NUMA nodes within the same socket, even if this
isn't the case in practice with current POWER hardware.

> > Same for the XIVE. Trying to introduce a concept of socket, either
> > by hijacking OPAL's ibm,chip-id or NUMA node ids, is a kind of
> > spec violation in this context. If the user cares for locality of
> > the vCPUs and XIVE on the same socket, then it should bind vCPU
> > threads to host CPUs from the same socket in the first place.
> 
> Yes. that's a must have of course. You need to reflect the real HW
> topology in the guest or LPAR if you are after performance, or 
> restrict the virtual machine to be on a single socket/chip/node.  
> 
> And this is not only a XIVE problem. XICS has the same problem with
> a shared single IPI interrupt descriptor but XICS doesn't scale well 
> by design, so it doesn't show.
> 
> 
> > Isn't this enough to solve the performance issues this series
> > want to fix, without the need for virtual socket ids ?
> what are virtual socket ids ? A new concept ? 
> 

For now, we have virtual CPUs identified by a virtual CPU id.
It thus seems natural to speak of a virtual socket id, but
anyway, the wording isn't really important here and you
don't answer the question ;-)

> Thanks,
> 
> C.
> 
> > 
> >>  * P9 DD2.2 - 2s * 64 threads
> >>
> >>"noirqdebug"
> >> Mint/sMint/s   
> >>  chips  cpus  IPI/sys   IPI/chip   IPI/chipIPI/sys 
> >>  --
> >>  1  0-15 4.984023   4.875405   4.996536   5.048892
> >> 0-3110.879164  10.544040  10.757632  11.037859
> >> 0-4715.345301  14.688764  14.926520  15.310053
> >> 0-6317.064907  17.066812  17.613416  17.874511
> >>  2  0-7911.768764  21.650749  22.689120  22.566508
> >> 0-9510.616812  26.878789  28.434703  28.320324
> >> 0-111   10.151693  31.397803  31.771773  32.388122
> >> 0-1279.948502  33.139336  34.875716  35.224548
> >>
> >>
> >>  * P10 DD1 - 4s (not homogeneous) 352 threads
> >>
> >>"noirqdebug"
> >> Mint/sMint/s   
> 

Re: [PATCH v3 0/9] powerpc/xive: Map one IPI interrupt per node

2021-04-01 Thread Cédric Le Goater
Hello,

On 4/1/21 10:04 AM, Greg Kurz wrote:
> On Wed, 31 Mar 2021 16:45:05 +0200
> Cédric Le Goater  wrote:
> 
>>
>> Hello,
>>
>> ipistorm [*] can be used to benchmark the raw interrupt rate of an
>> interrupt controller by measuring the number of IPIs a system can
>> sustain. When applied to the XIVE interrupt controller of POWER9 and
>> POWER10 systems, a significant drop of the interrupt rate can be
>> observed when crossing the second node boundary.
>>
>> This is due to the fact that a single IPI interrupt is used for all
>> CPUs of the system. The structure is shared and the cache line updates
>> impact greatly the traffic between nodes and the overall IPI
>> performance.
>>
>> As a workaround, the impact can be reduced by deactivating the IRQ
>> lockup detector ("noirqdebug") which does a lot of accounting in the
>> Linux IRQ descriptor structure and is responsible for most of the
>> performance penalty.
>>
>> As a fix, this proposal allocates an IPI interrupt per node, to be
>> shared by all CPUs of that node. It solves the scaling issue, the IRQ
>> lockup detector still has an impact but the XIVE interrupt rate scales
>> linearly. It also improves the "noirqdebug" case as showed in the
>> tables below. 
>>
> 
> As explained by David and others, NUMA nodes happen to match sockets
> with current POWER CPUs but these are really different concepts. NUMA
> is about CPU memory accesses latency, 

This is exactly our problem. we have cache issues because hw threads 
on different chips are trying to access the same structure in memory.
It happens on virtual platforms and baremetal platforms. This is not
restricted to pseries.

> while in the case of XIVE you
> really need to identify a XIVE chip localized in a given socket.
> 
> PAPR doesn't know about sockets, only cores. In other words, a PAPR
> compliant guest sees all vCPUs like they all sit in a single socket.

There are also NUMA nodes on PAPR.

> Same for the XIVE. Trying to introduce a concept of socket, either
> by hijacking OPAL's ibm,chip-id or NUMA node ids, is a kind of
> spec violation in this context. If the user cares for locality of
> the vCPUs and XIVE on the same socket, then it should bind vCPU
> threads to host CPUs from the same socket in the first place.

Yes. that's a must have of course. You need to reflect the real HW
topology in the guest or LPAR if you are after performance, or 
restrict the virtual machine to be on a single socket/chip/node.  

And this is not only a XIVE problem. XICS has the same problem with
a shared single IPI interrupt descriptor but XICS doesn't scale well 
by design, so it doesn't show.


> Isn't this enough to solve the performance issues this series
> want to fix, without the need for virtual socket ids ?
what are virtual socket ids ? A new concept ? 

Thanks,

C.

> 
>>  * P9 DD2.2 - 2s * 64 threads
>>
>>"noirqdebug"
>> Mint/sMint/s   
>>  chips  cpus  IPI/sys   IPI/chip   IPI/chipIPI/sys 
>>  --
>>  1  0-15 4.984023   4.875405   4.996536   5.048892
>> 0-3110.879164  10.544040  10.757632  11.037859
>> 0-4715.345301  14.688764  14.926520  15.310053
>> 0-6317.064907  17.066812  17.613416  17.874511
>>  2  0-7911.768764  21.650749  22.689120  22.566508
>> 0-9510.616812  26.878789  28.434703  28.320324
>> 0-111   10.151693  31.397803  31.771773  32.388122
>> 0-1279.948502  33.139336  34.875716  35.224548
>>
>>
>>  * P10 DD1 - 4s (not homogeneous) 352 threads
>>
>>"noirqdebug"
>> Mint/sMint/s   
>>  chips  cpus  IPI/sys   IPI/chip   IPI/chipIPI/sys 
>>  --
>>  1  0-15 2.409402   2.364108   2.383303   2.395091
>> 0-31 6.028325   6.046075   6.08   6.073750
>> 0-47 8.655178   8.644531   8.712830   8.724702
>> 0-6311.629652  11.735953  12.088203  12.055979
>> 0-7914.392321  14.729959  14.986701  14.973073
>> 0-9512.604158  13.004034  17.528748  17.568095
>>  2  0-1119.767753  13.719831  19.968606  20.024218
>> 0-1276.744566  16.418854  22.898066  22.995110
>> 0-1436.005699  19.174421  25.425622  25.417541
>> 0-1595.649719  21.938836  27.952662  28.059603
>> 0-1755.441410  24.109484  31.133915  31.127996
>>  3  0-1915.318341  24.405322  33.999221  33.775354
>> 0-2075.191382  26.449769  36.050161  35.867307
>> 0-2235.102790  29.356943  39.544135  39.508169
>> 0-2395.035295  31.933051  42.135075  42.071975
>> 0-2554.969209  

Re: [PATCH v3 0/9] powerpc/xive: Map one IPI interrupt per node

2021-04-01 Thread Cédric Le Goater
Hello,


On 3/31/21 4:45 PM, Cédric Le Goater wrote:
> 
> Hello,
> 
> ipistorm [*] can be used to benchmark the raw interrupt rate of an
> interrupt controller by measuring the number of IPIs a system can
> sustain. When applied to the XIVE interrupt controller of POWER9 and
> POWER10 systems, a significant drop of the interrupt rate can be
> observed when crossing the second node boundary.
> 
> This is due to the fact that a single IPI interrupt is used for all
> CPUs of the system. The structure is shared and the cache line updates
> impact greatly the traffic between nodes and the overall IPI
> performance.
> 
> As a workaround, the impact can be reduced by deactivating the IRQ
> lockup detector ("noirqdebug") which does a lot of accounting in the
> Linux IRQ descriptor structure and is responsible for most of the
> performance penalty.
> 
> As a fix, this proposal allocates an IPI interrupt per node, to be
> shared by all CPUs of that node. It solves the scaling issue, the IRQ
> lockup detector still has an impact but the XIVE interrupt rate scales
> linearly. It also improves the "noirqdebug" case as showed in the
> tables below. Hello,

>From the comments, I received on different email threads. It seems 
I am doing some wrong assumption on the code and concepts. We canpostpone this 
patchset. It's an optimization and there are some 
more cleanups that can be done before. 

Thanks for the time and the shared expertise,

C.

> 
>  * P9 DD2.2 - 2s * 64 threads
> 
>"noirqdebug"
> Mint/sMint/s   
>  chips  cpus  IPI/sys   IPI/chip   IPI/chipIPI/sys 
>  --
>  1  0-15 4.984023   4.875405   4.996536   5.048892
> 0-3110.879164  10.544040  10.757632  11.037859
> 0-4715.345301  14.688764  14.926520  15.310053
> 0-6317.064907  17.066812  17.613416  17.874511
>  2  0-7911.768764  21.650749  22.689120  22.566508
> 0-9510.616812  26.878789  28.434703  28.320324
> 0-111   10.151693  31.397803  31.771773  32.388122
> 0-1279.948502  33.139336  34.875716  35.224548
> 
> 
>  * P10 DD1 - 4s (not homogeneous) 352 threads
> 
>"noirqdebug"
> Mint/sMint/s   
>  chips  cpus  IPI/sys   IPI/chip   IPI/chipIPI/sys 
>  --
>  1  0-15 2.409402   2.364108   2.383303   2.395091
> 0-31 6.028325   6.046075   6.08   6.073750
> 0-47 8.655178   8.644531   8.712830   8.724702
> 0-6311.629652  11.735953  12.088203  12.055979
> 0-7914.392321  14.729959  14.986701  14.973073
> 0-9512.604158  13.004034  17.528748  17.568095
>  2  0-1119.767753  13.719831  19.968606  20.024218
> 0-1276.744566  16.418854  22.898066  22.995110
> 0-1436.005699  19.174421  25.425622  25.417541
> 0-1595.649719  21.938836  27.952662  28.059603
> 0-1755.441410  24.109484  31.133915  31.127996
>  3  0-1915.318341  24.405322  33.999221  33.775354
> 0-2075.191382  26.449769  36.050161  35.867307
> 0-2235.102790  29.356943  39.544135  39.508169
> 0-2395.035295  31.933051  42.135075  42.071975
> 0-2554.969209  34.477367  44.655395  44.757074
>  4  0-2714.907652  35.887016  47.080545  47.318537
> 0-2874.839581  38.076137  50.464307  50.636219
> 0-3034.786031  40.881319  53.478684  53.310759
> 0-3194.743750  43.448424  56.388102  55.973969
> 0-3354.709936  45.623532  59.400930  58.926857
> 0-3514.681413  45.646151  62.035804  61.830057
> 
> [*] https://github.com/antonblanchard/ipistorm
> 
> Thanks,
> 
> C.
> 
> Changes in v3:
> 
>   - improved commit log for the misuse of "ibm,chip-id"
>   - better error handling of xive_request_ipi()
>   - use of a fwnode_handle to name the new domain 
>   - increased IPI name length
>   - use of early_cpu_to_node() for hotplugged CPUs
>   - filter CPU-less nodes
> 
> Changes in v2:
> 
>   - extra simplification on xmon
>   - fixes on issues reported by the kernel test robot
> 
> Cédric Le Goater (9):
>   powerpc/xive: Use cpu_to_node() instead of "ibm,chip-id" property
>   powerpc/xive: Introduce an IPI interrupt domain
>   powerpc/xive: Remove useless check on XIVE_IPI_HW_IRQ
>   powerpc/xive: Simplify xive_core_debug_show()
>   powerpc/xive: Drop check on irq_data in xive_core_debug_show()
>   powerpc/xive: Simplify the dump of XIVE interrupts under xmon
>   powerpc/xive: Fix xmon command "dxi"
>   powerpc/xive: Map one IPI interrupt per node
>   

Re: [PATCH v3 0/9] powerpc/xive: Map one IPI interrupt per node

2021-04-01 Thread Greg Kurz
On Wed, 31 Mar 2021 16:45:05 +0200
Cédric Le Goater  wrote:

> 
> Hello,
> 
> ipistorm [*] can be used to benchmark the raw interrupt rate of an
> interrupt controller by measuring the number of IPIs a system can
> sustain. When applied to the XIVE interrupt controller of POWER9 and
> POWER10 systems, a significant drop of the interrupt rate can be
> observed when crossing the second node boundary.
> 
> This is due to the fact that a single IPI interrupt is used for all
> CPUs of the system. The structure is shared and the cache line updates
> impact greatly the traffic between nodes and the overall IPI
> performance.
> 
> As a workaround, the impact can be reduced by deactivating the IRQ
> lockup detector ("noirqdebug") which does a lot of accounting in the
> Linux IRQ descriptor structure and is responsible for most of the
> performance penalty.
> 
> As a fix, this proposal allocates an IPI interrupt per node, to be
> shared by all CPUs of that node. It solves the scaling issue, the IRQ
> lockup detector still has an impact but the XIVE interrupt rate scales
> linearly. It also improves the "noirqdebug" case as showed in the
> tables below. 
> 

As explained by David and others, NUMA nodes happen to match sockets
with current POWER CPUs but these are really different concepts. NUMA
is about CPU memory accesses latency, while in the case of XIVE you
really need to identify a XIVE chip localized in a given socket.

PAPR doesn't know about sockets, only cores. In other words, a PAPR
compliant guest sees all vCPUs like they all sit in a single socket.
Same for the XIVE. Trying to introduce a concept of socket, either
by hijacking OPAL's ibm,chip-id or NUMA node ids, is a kind of
spec violation in this context. If the user cares for locality of
the vCPUs and XIVE on the same socket, then it should bind vCPU
threads to host CPUs from the same socket in the first place.
Isn't this enough to solve the performance issues this series
want to fix, without the need for virtual socket ids ?

>  * P9 DD2.2 - 2s * 64 threads
> 
>"noirqdebug"
> Mint/sMint/s   
>  chips  cpus  IPI/sys   IPI/chip   IPI/chipIPI/sys 
>  --
>  1  0-15 4.984023   4.875405   4.996536   5.048892
> 0-3110.879164  10.544040  10.757632  11.037859
> 0-4715.345301  14.688764  14.926520  15.310053
> 0-6317.064907  17.066812  17.613416  17.874511
>  2  0-7911.768764  21.650749  22.689120  22.566508
> 0-9510.616812  26.878789  28.434703  28.320324
> 0-111   10.151693  31.397803  31.771773  32.388122
> 0-1279.948502  33.139336  34.875716  35.224548
> 
> 
>  * P10 DD1 - 4s (not homogeneous) 352 threads
> 
>"noirqdebug"
> Mint/sMint/s   
>  chips  cpus  IPI/sys   IPI/chip   IPI/chipIPI/sys 
>  --
>  1  0-15 2.409402   2.364108   2.383303   2.395091
> 0-31 6.028325   6.046075   6.08   6.073750
> 0-47 8.655178   8.644531   8.712830   8.724702
> 0-6311.629652  11.735953  12.088203  12.055979
> 0-7914.392321  14.729959  14.986701  14.973073
> 0-9512.604158  13.004034  17.528748  17.568095
>  2  0-1119.767753  13.719831  19.968606  20.024218
> 0-1276.744566  16.418854  22.898066  22.995110
> 0-1436.005699  19.174421  25.425622  25.417541
> 0-1595.649719  21.938836  27.952662  28.059603
> 0-1755.441410  24.109484  31.133915  31.127996
>  3  0-1915.318341  24.405322  33.999221  33.775354
> 0-2075.191382  26.449769  36.050161  35.867307
> 0-2235.102790  29.356943  39.544135  39.508169
> 0-2395.035295  31.933051  42.135075  42.071975
> 0-2554.969209  34.477367  44.655395  44.757074
>  4  0-2714.907652  35.887016  47.080545  47.318537
> 0-2874.839581  38.076137  50.464307  50.636219
> 0-3034.786031  40.881319  53.478684  53.310759
> 0-3194.743750  43.448424  56.388102  55.973969
> 0-3354.709936  45.623532  59.400930  58.926857
> 0-3514.681413  45.646151  62.035804  61.830057
> 
> [*] https://github.com/antonblanchard/ipistorm
> 
> Thanks,
> 
> C.
> 
> Changes in v3:
> 
>   - improved commit log for the misuse of "ibm,chip-id"
>   - better error handling of xive_request_ipi()
>   - use of a fwnode_handle to name the new domain 
>   - increased IPI name length
>   - use of early_cpu_to_node() for hotplugged CPUs
>   - filter CPU-less nodes
> 
> Changes in v2:
> 
>   - extra simplification 

[PATCH v3 0/9] powerpc/xive: Map one IPI interrupt per node

2021-03-31 Thread Cédric Le Goater


Hello,

ipistorm [*] can be used to benchmark the raw interrupt rate of an
interrupt controller by measuring the number of IPIs a system can
sustain. When applied to the XIVE interrupt controller of POWER9 and
POWER10 systems, a significant drop of the interrupt rate can be
observed when crossing the second node boundary.

This is due to the fact that a single IPI interrupt is used for all
CPUs of the system. The structure is shared and the cache line updates
impact greatly the traffic between nodes and the overall IPI
performance.

As a workaround, the impact can be reduced by deactivating the IRQ
lockup detector ("noirqdebug") which does a lot of accounting in the
Linux IRQ descriptor structure and is responsible for most of the
performance penalty.

As a fix, this proposal allocates an IPI interrupt per node, to be
shared by all CPUs of that node. It solves the scaling issue, the IRQ
lockup detector still has an impact but the XIVE interrupt rate scales
linearly. It also improves the "noirqdebug" case as showed in the
tables below. 

 * P9 DD2.2 - 2s * 64 threads

   "noirqdebug"
Mint/sMint/s   
 chips  cpus  IPI/sys   IPI/chip   IPI/chipIPI/sys 
 --
 1  0-15 4.984023   4.875405   4.996536   5.048892
0-3110.879164  10.544040  10.757632  11.037859
0-4715.345301  14.688764  14.926520  15.310053
0-6317.064907  17.066812  17.613416  17.874511
 2  0-7911.768764  21.650749  22.689120  22.566508
0-9510.616812  26.878789  28.434703  28.320324
0-111   10.151693  31.397803  31.771773  32.388122
0-1279.948502  33.139336  34.875716  35.224548


 * P10 DD1 - 4s (not homogeneous) 352 threads

   "noirqdebug"
Mint/sMint/s   
 chips  cpus  IPI/sys   IPI/chip   IPI/chipIPI/sys 
 --
 1  0-15 2.409402   2.364108   2.383303   2.395091
0-31 6.028325   6.046075   6.08   6.073750
0-47 8.655178   8.644531   8.712830   8.724702
0-6311.629652  11.735953  12.088203  12.055979
0-7914.392321  14.729959  14.986701  14.973073
0-9512.604158  13.004034  17.528748  17.568095
 2  0-1119.767753  13.719831  19.968606  20.024218
0-1276.744566  16.418854  22.898066  22.995110
0-1436.005699  19.174421  25.425622  25.417541
0-1595.649719  21.938836  27.952662  28.059603
0-1755.441410  24.109484  31.133915  31.127996
 3  0-1915.318341  24.405322  33.999221  33.775354
0-2075.191382  26.449769  36.050161  35.867307
0-2235.102790  29.356943  39.544135  39.508169
0-2395.035295  31.933051  42.135075  42.071975
0-2554.969209  34.477367  44.655395  44.757074
 4  0-2714.907652  35.887016  47.080545  47.318537
0-2874.839581  38.076137  50.464307  50.636219
0-3034.786031  40.881319  53.478684  53.310759
0-3194.743750  43.448424  56.388102  55.973969
0-3354.709936  45.623532  59.400930  58.926857
0-3514.681413  45.646151  62.035804  61.830057

[*] https://github.com/antonblanchard/ipistorm

Thanks,

C.

Changes in v3:

  - improved commit log for the misuse of "ibm,chip-id"
  - better error handling of xive_request_ipi()
  - use of a fwnode_handle to name the new domain 
  - increased IPI name length
  - use of early_cpu_to_node() for hotplugged CPUs
  - filter CPU-less nodes

Changes in v2:

  - extra simplification on xmon
  - fixes on issues reported by the kernel test robot

Cédric Le Goater (9):
  powerpc/xive: Use cpu_to_node() instead of "ibm,chip-id" property
  powerpc/xive: Introduce an IPI interrupt domain
  powerpc/xive: Remove useless check on XIVE_IPI_HW_IRQ
  powerpc/xive: Simplify xive_core_debug_show()
  powerpc/xive: Drop check on irq_data in xive_core_debug_show()
  powerpc/xive: Simplify the dump of XIVE interrupts under xmon
  powerpc/xive: Fix xmon command "dxi"
  powerpc/xive: Map one IPI interrupt per node
  powerpc/xive: Modernize XIVE-IPI domain with an 'alloc' handler

 arch/powerpc/include/asm/xive.h  |   1 +
 arch/powerpc/sysdev/xive/xive-internal.h |   2 -
 arch/powerpc/sysdev/xive/common.c| 211 +++
 arch/powerpc/xmon/xmon.c |  28 +--
 4 files changed, 139 insertions(+), 103 deletions(-)

-- 
2.26.3