Re: MSI-X & Interrupting CPU > 0

2020-01-26 Thread David Gwynne



> On 25 Jan 2020, at 10:57 am, Mark Kettenis  wrote:
> 
> David Gwynne schreef op 2020-01-25 01:28:
>>> On 23 Jan 2020, at 10:38 pm, Mark Kettenis  wrote:
>>> Martin Pieuchot schreef op 2020-01-23 11:28:
 I'd like to make progress towards interrupting multiple CPUs in order to
 one day make use of multiple queues in some network drivers.  The road
 towards that goal is consequent and I'd like to proceed in steps to make
 it easier to squash bugs.  I'm currently thinking of the following steps:
 1. Is my interrupt handler safe to be executed on CPU != CPU0?
>>> Except for things that are inherently tied to a specific CPU (clock 
>>> interrupts,
>>> performance counters, etc) I think the answer here should always be "yes".
>> Agreed.
>>> It probably only makes sense for mpsafe handlers to run on secondary CPUs 
>>> though.
>> Only because keeping !mpsafe handlers on one CPU means they're less
>> likely to need to spin against other !mpsafe interrupts on other CPUs
>> waiting for the kernel lock before they can execute. Otherwise this
>> shouldn't matter.
 2. Is it safe to execute this handler on two or more CPUs at the same
   time?
>>> I think that is never safe.  Unless you run execute the handler on 
>>> different "data".
>>> Running multiple rx interrupt handlers on different CPUs should be fine.
>> Agreed.
 3. How does interrupting multiple CPUs influence packet processing in
   the softnet thread?  Is any knowledge required (CPU affinity?) to
   have an optimum processing when multiple softnet threads are used?
>> I think this is my question to answer.
>> Packet sources (ie, rx rings) are supposed to be tied to a specific
>> nettq. Part of this is to avoid packet reordering where multiple
>> nettqs for one ring could overlap processing of packets for a single
>> TCP stream. The other part is so a busy nettq can apply backpressure
>> when it is overloaded to the rings that are feeding it.
>> Experience from other systems is that affinity does matter, but
>> running stuff in parallel matters more. Affinity between rings and
>> nettqs is something that can be worked on later.
 4. How to split traffic in one incoming NIC between multiple processing
   units?
>>> You'll need to have some sort of hardware filter that uses a hash of the
>>> packet header to assign an rx queue such that all packets from a single 
>>> "flow"
>>> end up on the same queue and therefore will be processed by the same 
>>> interrupt
>>> handler.
>> Yep.
 This new journey comes with the requirement of being able to interrupt
 an arbitrary CPU.  For that we need a new API.  Patrick gave me the
 diff below during u2k20 and I'd like to use it to start a discussion.
 We currently have 6 drivers using pci_intr_map_msix().  Since we want to
 be able to specify a CPU should we introduce a new function like in the
 diff below or do we prefer to add a new argument (cpuid_t?) to this one?
 This change in itself should already allow us to proceed with the first
 item of the above list.
>>> I'm not sure you want to have the driver pick the CPU to which to assign the
>>> interrupt.  In fact I think that doesn't make sense at all.  The CPU
>>> should be picked by more generic code instead.  But perhaps we do need to
>>> pass a hint from the driver to that code.
>> Letting the driver pick the CPU is Good Enough(tm) today. It may limit
>> us to 70 or 80 percent of some theoretical maximum, but we don't have
>> the machinery to make a better decision on behalf of the driver at
>> this point. It is much better to start with something simple today
>> (ie, letting the driver pick the CPU) and improve on it after we hit
>> the limits with the simple thing.
>> I also look at how far dfly has got, and from what I can tell their
>> MSI-X stuff let's the driver pick the CPU. So it can't be too bad.
 Then we need a way to read the MSI-X control table size using the define
 PCI_MSIX_CTL_TBLSIZE() below.  This can be done in MI, we might also
 want to print that information in dmesg, some maybe cache it in pci(4)?
>>> There are already defines for MSIX in pcireg.h, some of which are duplicated
>>> by the defines in this diff.  Don't think caching makes all that much sense.
>>> Don't think we need to print the table size in dmesg; pcidump(8) already
>>> prints it.  Might make sense to print the vector number though.
>> I'm ok with with using pcidump(8) to see what a particular device
>> offers rather than having it in dmesg. I'd avoid putting vectors in
>> dmesg output, cos if have a lot of rings there's going to be a lot of
>> dmesg output. Probably better to make vmstat -i more useful, or systat
>> mb.
 Does somebody has a better/stronger/magic way to achieve this goal?
>>> I playes a little bit with assigning interrupts to different CPUs in the
>>> past, but at that point this didn't really result in a performance boost.
>>> That was quite a while ago though.  

Re: MSI-X & Interrupting CPU > 0

2020-01-24 Thread Mark Kettenis

David Gwynne schreef op 2020-01-25 01:28:
On 23 Jan 2020, at 10:38 pm, Mark Kettenis  
wrote:


Martin Pieuchot schreef op 2020-01-23 11:28:
I'd like to make progress towards interrupting multiple CPUs in order 
to
one day make use of multiple queues in some network drivers.  The 
road
towards that goal is consequent and I'd like to proceed in steps to 
make
it easier to squash bugs.  I'm currently thinking of the following 
steps:

1. Is my interrupt handler safe to be executed on CPU != CPU0?


Except for things that are inherently tied to a specific CPU (clock 
interrupts,
performance counters, etc) I think the answer here should always be 
"yes".


Agreed.

It probably only makes sense for mpsafe handlers to run on secondary 
CPUs though.


Only because keeping !mpsafe handlers on one CPU means they're less
likely to need to spin against other !mpsafe interrupts on other CPUs
waiting for the kernel lock before they can execute. Otherwise this
shouldn't matter.




2. Is it safe to execute this handler on two or more CPUs at the same
   time?


I think that is never safe.  Unless you run execute the handler on 
different "data".
Running multiple rx interrupt handlers on different CPUs should be 
fine.


Agreed.




3. How does interrupting multiple CPUs influence packet processing in
   the softnet thread?  Is any knowledge required (CPU affinity?) to
   have an optimum processing when multiple softnet threads are used?


I think this is my question to answer.

Packet sources (ie, rx rings) are supposed to be tied to a specific
nettq. Part of this is to avoid packet reordering where multiple
nettqs for one ring could overlap processing of packets for a single
TCP stream. The other part is so a busy nettq can apply backpressure
when it is overloaded to the rings that are feeding it.

Experience from other systems is that affinity does matter, but
running stuff in parallel matters more. Affinity between rings and
nettqs is something that can be worked on later.

4. How to split traffic in one incoming NIC between multiple 
processing

   units?


You'll need to have some sort of hardware filter that uses a hash of 
the
packet header to assign an rx queue such that all packets from a 
single "flow"
end up on the same queue and therefore will be processed by the same 
interrupt

handler.


Yep.



This new journey comes with the requirement of being able to 
interrupt

an arbitrary CPU.  For that we need a new API.  Patrick gave me the
diff below during u2k20 and I'd like to use it to start a discussion.
We currently have 6 drivers using pci_intr_map_msix().  Since we want 
to
be able to specify a CPU should we introduce a new function like in 
the
diff below or do we prefer to add a new argument (cpuid_t?) to this 
one?
This change in itself should already allow us to proceed with the 
first

item of the above list.


I'm not sure you want to have the driver pick the CPU to which to 
assign the

interrupt.  In fact I think that doesn't make sense at all.  The CPU
should be picked by more generic code instead.  But perhaps we do need 
to

pass a hint from the driver to that code.


Letting the driver pick the CPU is Good Enough(tm) today. It may limit
us to 70 or 80 percent of some theoretical maximum, but we don't have
the machinery to make a better decision on behalf of the driver at
this point. It is much better to start with something simple today
(ie, letting the driver pick the CPU) and improve on it after we hit
the limits with the simple thing.

I also look at how far dfly has got, and from what I can tell their
MSI-X stuff let's the driver pick the CPU. So it can't be too bad.



Then we need a way to read the MSI-X control table size using the 
define

PCI_MSIX_CTL_TBLSIZE() below.  This can be done in MI, we might also
want to print that information in dmesg, some maybe cache it in 
pci(4)?


There are already defines for MSIX in pcireg.h, some of which are 
duplicated
by the defines in this diff.  Don't think caching makes all that much 
sense.
Don't think we need to print the table size in dmesg; pcidump(8) 
already

prints it.  Might make sense to print the vector number though.


I'm ok with with using pcidump(8) to see what a particular device
offers rather than having it in dmesg. I'd avoid putting vectors in
dmesg output, cos if have a lot of rings there's going to be a lot of
dmesg output. Probably better to make vmstat -i more useful, or systat
mb.




Does somebody has a better/stronger/magic way to achieve this goal?


I playes a little bit with assigning interrupts to different CPUs in 
the
past, but at that point this didn't really result in a performance 
boost.
That was quite a while ago though.  I don't think there are 
fundamental problems

in getting this going.


Well, packet processing still goes through a single nettq, and that's
the limit I hit on my firewalls. I have a lot of CARP, LACP and VLAN
stuff though, so my cost per packet is probably higher than most.
However, 

Re: MSI-X & Interrupting CPU > 0

2020-01-24 Thread David Gwynne



> On 23 Jan 2020, at 10:38 pm, Mark Kettenis  wrote:
> 
> Martin Pieuchot schreef op 2020-01-23 11:28:
>> I'd like to make progress towards interrupting multiple CPUs in order to
>> one day make use of multiple queues in some network drivers.  The road
>> towards that goal is consequent and I'd like to proceed in steps to make
>> it easier to squash bugs.  I'm currently thinking of the following steps:
>> 1. Is my interrupt handler safe to be executed on CPU != CPU0?
> 
> Except for things that are inherently tied to a specific CPU (clock 
> interrupts,
> performance counters, etc) I think the answer here should always be "yes".

Agreed.

> It probably only makes sense for mpsafe handlers to run on secondary CPUs 
> though.

Only because keeping !mpsafe handlers on one CPU means they're less likely to 
need to spin against other !mpsafe interrupts on other CPUs waiting for the 
kernel lock before they can execute. Otherwise this shouldn't matter.

> 
>> 2. Is it safe to execute this handler on two or more CPUs at the same
>>time?
> 
> I think that is never safe.  Unless you run execute the handler on different 
> "data".
> Running multiple rx interrupt handlers on different CPUs should be fine.

Agreed.

> 
>> 3. How does interrupting multiple CPUs influence packet processing in
>>the softnet thread?  Is any knowledge required (CPU affinity?) to
>>have an optimum processing when multiple softnet threads are used?

I think this is my question to answer.

Packet sources (ie, rx rings) are supposed to be tied to a specific nettq. Part 
of this is to avoid packet reordering where multiple nettqs for one ring could 
overlap processing of packets for a single TCP stream. The other part is so a 
busy nettq can apply backpressure when it is overloaded to the rings that are 
feeding it.

Experience from other systems is that affinity does matter, but running stuff 
in parallel matters more. Affinity between rings and nettqs is something that 
can be worked on later.

>> 4. How to split traffic in one incoming NIC between multiple processing
>>units?
> 
> You'll need to have some sort of hardware filter that uses a hash of the
> packet header to assign an rx queue such that all packets from a single "flow"
> end up on the same queue and therefore will be processed by the same interrupt
> handler.

Yep.

> 
>> This new journey comes with the requirement of being able to interrupt
>> an arbitrary CPU.  For that we need a new API.  Patrick gave me the
>> diff below during u2k20 and I'd like to use it to start a discussion.
>> We currently have 6 drivers using pci_intr_map_msix().  Since we want to
>> be able to specify a CPU should we introduce a new function like in the
>> diff below or do we prefer to add a new argument (cpuid_t?) to this one?
>> This change in itself should already allow us to proceed with the first
>> item of the above list.
> 
> I'm not sure you want to have the driver pick the CPU to which to assign the
> interrupt.  In fact I think that doesn't make sense at all.  The CPU
> should be picked by more generic code instead.  But perhaps we do need to
> pass a hint from the driver to that code.

Letting the driver pick the CPU is Good Enough(tm) today. It may limit us to 70 
or 80 percent of some theoretical maximum, but we don't have the machinery to 
make a better decision on behalf of the driver at this point. It is much better 
to start with something simple today (ie, letting the driver pick the CPU) and 
improve on it after we hit the limits with the simple thing.

I also look at how far dfly has got, and from what I can tell their MSI-X stuff 
let's the driver pick the CPU. So it can't be too bad.

> 
>> Then we need a way to read the MSI-X control table size using the define
>> PCI_MSIX_CTL_TBLSIZE() below.  This can be done in MI, we might also
>> want to print that information in dmesg, some maybe cache it in pci(4)?
> 
> There are already defines for MSIX in pcireg.h, some of which are duplicated
> by the defines in this diff.  Don't think caching makes all that much sense.
> Don't think we need to print the table size in dmesg; pcidump(8) already
> prints it.  Might make sense to print the vector number though.

I'm ok with with using pcidump(8) to see what a particular device offers rather 
than having it in dmesg. I'd avoid putting vectors in dmesg output, cos if have 
a lot of rings there's going to be a lot of dmesg output. Probably better to 
make vmstat -i more useful, or systat mb.

> 
>> Does somebody has a better/stronger/magic way to achieve this goal?
> 
> I playes a little bit with assigning interrupts to different CPUs in the
> past, but at that point this didn't really result in a performance boost.
> That was quite a while ago though.  I don't think there are fundamental 
> problems
> in getting this going.

Well, packet processing still goes through a single nettq, and that's the limit 
I hit on my firewalls. I have a lot of CARP, LACP and VLAN stuff 

Re: MSI-X & Interrupting CPU > 0

2020-01-23 Thread Jonathan Matthew
On Thu, Jan 23, 2020 at 11:28:50AM +0100, Martin Pieuchot wrote:
> I'd like to make progress towards interrupting multiple CPUs in order to
> one day make use of multiple queues in some network drivers.  The road
> towards that goal is consequent and I'd like to proceed in steps to make
> it easier to squash bugs.  I'm currently thinking of the following steps:
> 
>  1. Is my interrupt handler safe to be executed on CPU != CPU0?
> 
>  2. Is it safe to execute this handler on two or more CPUs at the same
> time?
> 
>  3. How does interrupting multiple CPUs influence packet processing in
> the softnet thread?  Is any knowledge required (CPU affinity?) to
> have an optimum processing when multiple softnet threads are used?
> 
>  4. How to split traffic in one incoming NIC between multiple processing
> units?
> 
> This new journey comes with the requirement of being able to interrupt
> an arbitrary CPU.  For that we need a new API.  Patrick gave me the
> diff below during u2k20 and I'd like to use it to start a discussion.
> 
> We currently have 6 drivers using pci_intr_map_msix().  Since we want to
> be able to specify a CPU should we introduce a new function like in the
> diff below or do we prefer to add a new argument (cpuid_t?) to this one?
> This change in itself should already allow us to proceed with the first
> item of the above list.
> 
> Then we need a way to read the MSI-X control table size using the define
> PCI_MSIX_CTL_TBLSIZE() below.  This can be done in MI, we might also
> want to print that information in dmesg, some maybe cache it in pci(4)?
> 
> Does somebody has a better/stronger/magic way to achieve this goal?

I worked on this in mcx(4) a while ago, but I didn't manage to get the nic's
RSS hashing working at the time so all I really achieved was moving the one
rx interrupt to some other cpu.  I think I got tx interrupts across multiple
cpus without any problems, but that's not the interesting bit.

The approach I took (diff below) was to, for msi-x vectors marked MPSAFE,
try to put vector n on the nth online cpu.  This means the driver doesn't
have to think the details, it just asks for a bunch of interrupts and they
end up spread across the system.  I'd probably just want a function that
tells me the most vectors I can usefully use for a given device, which would
combine the table size for the device, the number of cpus online, and for nic
drivers, the number of softnet threads.

The driver would then either set up rings to match the number of vectors, or
map vectors to however many rings it has accordingly.  For the drivers I've
worked on (mcx, ixl, bnxt, iavf), there's no coordination required between the
vectors, so running them concurrently on multiple cpus is safe.  All we need
is to set up multiple rings and RSS hashing.


Index: amd64/acpi_machdep.c
===
RCS file: /cvs/src/sys/arch/amd64/amd64/acpi_machdep.c,v
retrieving revision 1.89
diff -u -p -r1.89 acpi_machdep.c
--- amd64/acpi_machdep.c20 Dec 2019 07:49:31 -  1.89
+++ amd64/acpi_machdep.c23 Jan 2020 22:45:17 -
@@ -194,7 +194,7 @@ acpi_intr_establish(int irq, int flags, 
 
type = (flags & LR_EXTIRQ_MODE) ? IST_EDGE : IST_LEVEL;
return (intr_establish(-1, (struct pic *)apic, map->ioapic_pin,
-   type, level, handler, arg, what));
+   type, level, handler, arg, what, 0));
 #else
return NULL;
 #endif
Index: amd64/intr.c
===
RCS file: /cvs/src/sys/arch/amd64/amd64/intr.c,v
retrieving revision 1.52
diff -u -p -r1.52 intr.c
--- amd64/intr.c25 Mar 2019 18:45:27 -  1.52
+++ amd64/intr.c23 Jan 2020 22:45:17 -
@@ -220,7 +220,7 @@ intr_allocate_slot_cpu(struct cpu_info *
  */
 int
 intr_allocate_slot(struct pic *pic, int legacy_irq, int pin, int level,
-struct cpu_info **cip, int *index, int *idt_slot)
+struct cpu_info **cip, int *index, int *idt_slot, int cpuhint)
 {
CPU_INFO_ITERATOR cii;
struct cpu_info *ci;
@@ -282,24 +282,33 @@ duplicate:
} else {
 other:
/*
-* Otherwise, look for a free slot elsewhere. Do the primary
-* CPU first.
-*/
-   ci = _info_primary;
-   error = intr_allocate_slot_cpu(ci, pic, pin, );
-   if (error == 0)
-   goto found;
-
-   /*
-* ..now try the others.
+* Find an online cpu to allocate a slot on.
+* Skip 'cpuhint' candidates to spread interrupts across cpus,
+* but keep checking following candidates after that if we 
didn't
+* find a slot.
 */
+   cpuhint %= sysctl_hwncpuonline();
CPU_INFO_FOREACH(cii, ci) {
-   if (CPU_IS_PRIMARY(ci))
+   if 

Re: MSI-X & Interrupting CPU > 0

2020-01-23 Thread Alexandr Nedvedicky
Hello,


> 
>  3. How does interrupting multiple CPUs influence packet processing in
> the softnet thread?  Is any knowledge required (CPU affinity?) to
> have an optimum processing when multiple softnet threads are used?
> 

I think it's hard to tell in advance. IMO we should try to make RX
running in parallel to create some pressure on code we already have.
We can get back to CPU affinity once we will have an experience
with code on multiple cores. Perhaps we should also slowly move from
thinking in terms of RX/forward/local/TX to more general pattern
I/O scheduling. I think we are not there yet.

I think we need experiments to gain the knowledge.

regards
sashan



Re: MSI-X & Interrupting CPU > 0

2020-01-23 Thread Martin Pieuchot
On 23/01/20(Thu) 13:38, Mark Kettenis wrote:
> Martin Pieuchot schreef op 2020-01-23 11:28:
> > [...] 
> > We currently have 6 drivers using pci_intr_map_msix().  Since we want to
> > be able to specify a CPU should we introduce a new function like in the
> > diff below or do we prefer to add a new argument (cpuid_t?) to this one?
> > This change in itself should already allow us to proceed with the first
> > item of the above list.
> 
> I'm not sure you want to have the driver pick the CPU to which to assign the
> interrupt.  In fact I think that doesn't make sense at all.  The CPU
> should be picked by more generic code instead.  But perhaps we do need to
> pass a hint from the driver to that code.

We are heading towards running multiple softnet threads.  What should
be the relation, if any, between a RX interrupt handler, the CPU
executing it and the softnet thread being scheduled to process the
packets?  If such relation makes sense, what kind of hint do we need
to represent it?



Re: MSI-X & Interrupting CPU > 0

2020-01-23 Thread Mark Kettenis

Martin Pieuchot schreef op 2020-01-23 11:28:
I'd like to make progress towards interrupting multiple CPUs in order 
to

one day make use of multiple queues in some network drivers.  The road
towards that goal is consequent and I'd like to proceed in steps to 
make
it easier to squash bugs.  I'm currently thinking of the following 
steps:


 1. Is my interrupt handler safe to be executed on CPU != CPU0?


Except for things that are inherently tied to a specific CPU (clock 
interrupts,
performance counters, etc) I think the answer here should always be 
"yes".
It probably only makes sense for mpsafe handlers to run on secondary 
CPUs though.



 2. Is it safe to execute this handler on two or more CPUs at the same
time?


I think that is never safe.  Unless you run execute the handler on 
different "data".

Running multiple rx interrupt handlers on different CPUs should be fine.


 3. How does interrupting multiple CPUs influence packet processing in
the softnet thread?  Is any knowledge required (CPU affinity?) to
have an optimum processing when multiple softnet threads are used?

 4. How to split traffic in one incoming NIC between multiple 
processing

units?


You'll need to have some sort of hardware filter that uses a hash of the
packet header to assign an rx queue such that all packets from a single 
"flow"
end up on the same queue and therefore will be processed by the same 
interrupt

handler.


This new journey comes with the requirement of being able to interrupt
an arbitrary CPU.  For that we need a new API.  Patrick gave me the
diff below during u2k20 and I'd like to use it to start a discussion.

We currently have 6 drivers using pci_intr_map_msix().  Since we want 
to

be able to specify a CPU should we introduce a new function like in the
diff below or do we prefer to add a new argument (cpuid_t?) to this 
one?

This change in itself should already allow us to proceed with the first
item of the above list.


I'm not sure you want to have the driver pick the CPU to which to assign 
the

interrupt.  In fact I think that doesn't make sense at all.  The CPU
should be picked by more generic code instead.  But perhaps we do need 
to

pass a hint from the driver to that code.

Then we need a way to read the MSI-X control table size using the 
define

PCI_MSIX_CTL_TBLSIZE() below.  This can be done in MI, we might also
want to print that information in dmesg, some maybe cache it in pci(4)?


There are already defines for MSIX in pcireg.h, some of which are 
duplicated
by the defines in this diff.  Don't think caching makes all that much 
sense.

Don't think we need to print the table size in dmesg; pcidump(8) already
prints it.  Might make sense to print the vector number though.


Does somebody has a better/stronger/magic way to achieve this goal?


I playes a little bit with assigning interrupts to different CPUs in the
past, but at that point this didn't really result in a performance 
boost.
That was quite a while ago though.  I don't think there are fundamental 
problems

in getting this going.


What do you think?

Index: arch/alpha/pci/pci_machdep.h
===
RCS file: /cvs/src/sys/arch/alpha/pci/pci_machdep.h,v
retrieving revision 1.30
diff -u -p -r1.30 pci_machdep.h
--- arch/alpha/pci/pci_machdep.h4 May 2016 14:30:00 -   1.30
+++ arch/alpha/pci/pci_machdep.h23 Jan 2020 09:54:50 -
@@ -105,6 +105,8 @@ int alpha_sysctl_chipset(int *, u_int, c
 (*(c)->pc_conf_write)((c)->pc_conf_v, (t), (r), (v))
 #definepci_intr_map_msi(pa, ihp)   (-1)
 #definepci_intr_map_msix(pa, vec, ihp) (-1)
+#definepci_intr_map_msix_cpuid(pa, vec, ihp, cpu) (-1)
+#definepci_intr_msix_count(c, t)   (0)
 #definepci_intr_string(c, ih)  
\
 (*(c)->pc_intr_string)((c)->pc_intr_v, (ih))
 #definepci_intr_line(c, ih)
\
Index: arch/amd64/amd64/acpi_machdep.c
===
RCS file: /cvs/src/sys/arch/amd64/amd64/acpi_machdep.c,v
retrieving revision 1.89
diff -u -p -r1.89 acpi_machdep.c
--- arch/amd64/amd64/acpi_machdep.c 20 Dec 2019 07:49:31 -  1.89
+++ arch/amd64/amd64/acpi_machdep.c 23 Jan 2020 09:54:50 -
@@ -194,7 +194,7 @@ acpi_intr_establish(int irq, int flags,

type = (flags & LR_EXTIRQ_MODE) ? IST_EDGE : IST_LEVEL;
return (intr_establish(-1, (struct pic *)apic, map->ioapic_pin,
-   type, level, handler, arg, what));
+   type, level, NULL, handler, arg, what));
 #else
return NULL;
 #endif
Index: arch/amd64/amd64/intr.c
===
RCS file: /cvs/src/sys/arch/amd64/amd64/intr.c,v
retrieving revision 1.52
diff -u -p -r1.52 intr.c
--- arch/amd64/amd64/intr.c 25 Mar 2019 18:45:27 -  1.52
+++ 

MSI-X & Interrupting CPU > 0

2020-01-23 Thread Martin Pieuchot
I'd like to make progress towards interrupting multiple CPUs in order to
one day make use of multiple queues in some network drivers.  The road
towards that goal is consequent and I'd like to proceed in steps to make
it easier to squash bugs.  I'm currently thinking of the following steps:

 1. Is my interrupt handler safe to be executed on CPU != CPU0?

 2. Is it safe to execute this handler on two or more CPUs at the same
time?

 3. How does interrupting multiple CPUs influence packet processing in
the softnet thread?  Is any knowledge required (CPU affinity?) to
have an optimum processing when multiple softnet threads are used?

 4. How to split traffic in one incoming NIC between multiple processing
units?

This new journey comes with the requirement of being able to interrupt
an arbitrary CPU.  For that we need a new API.  Patrick gave me the
diff below during u2k20 and I'd like to use it to start a discussion.

We currently have 6 drivers using pci_intr_map_msix().  Since we want to
be able to specify a CPU should we introduce a new function like in the
diff below or do we prefer to add a new argument (cpuid_t?) to this one?
This change in itself should already allow us to proceed with the first
item of the above list.

Then we need a way to read the MSI-X control table size using the define
PCI_MSIX_CTL_TBLSIZE() below.  This can be done in MI, we might also
want to print that information in dmesg, some maybe cache it in pci(4)?

Does somebody has a better/stronger/magic way to achieve this goal?

What do you think?

Index: arch/alpha/pci/pci_machdep.h
===
RCS file: /cvs/src/sys/arch/alpha/pci/pci_machdep.h,v
retrieving revision 1.30
diff -u -p -r1.30 pci_machdep.h
--- arch/alpha/pci/pci_machdep.h4 May 2016 14:30:00 -   1.30
+++ arch/alpha/pci/pci_machdep.h23 Jan 2020 09:54:50 -
@@ -105,6 +105,8 @@ int alpha_sysctl_chipset(int *, u_int, c
 (*(c)->pc_conf_write)((c)->pc_conf_v, (t), (r), (v))
 #definepci_intr_map_msi(pa, ihp)   (-1)
 #definepci_intr_map_msix(pa, vec, ihp) (-1)
+#definepci_intr_map_msix_cpuid(pa, vec, ihp, cpu) (-1)
+#definepci_intr_msix_count(c, t)   (0)
 #definepci_intr_string(c, ih)  
\
 (*(c)->pc_intr_string)((c)->pc_intr_v, (ih))
 #definepci_intr_line(c, ih)
\
Index: arch/amd64/amd64/acpi_machdep.c
===
RCS file: /cvs/src/sys/arch/amd64/amd64/acpi_machdep.c,v
retrieving revision 1.89
diff -u -p -r1.89 acpi_machdep.c
--- arch/amd64/amd64/acpi_machdep.c 20 Dec 2019 07:49:31 -  1.89
+++ arch/amd64/amd64/acpi_machdep.c 23 Jan 2020 09:54:50 -
@@ -194,7 +194,7 @@ acpi_intr_establish(int irq, int flags, 
 
type = (flags & LR_EXTIRQ_MODE) ? IST_EDGE : IST_LEVEL;
return (intr_establish(-1, (struct pic *)apic, map->ioapic_pin,
-   type, level, handler, arg, what));
+   type, level, NULL, handler, arg, what));
 #else
return NULL;
 #endif
Index: arch/amd64/amd64/intr.c
===
RCS file: /cvs/src/sys/arch/amd64/amd64/intr.c,v
retrieving revision 1.52
diff -u -p -r1.52 intr.c
--- arch/amd64/amd64/intr.c 25 Mar 2019 18:45:27 -  1.52
+++ arch/amd64/amd64/intr.c 23 Jan 2020 09:54:50 -
@@ -282,13 +282,20 @@ duplicate:
} else {
 other:
/*
-* Otherwise, look for a free slot elsewhere. Do the primary
-* CPU first.
+* Otherwise, look for a free slot elsewhere. If cip is null, it
+* means try primary cpu but accept secondary, otherwise we need
+* a slot on the requested cpu.
 */
-   ci = _info_primary;
+   if (*cip == NULL)
+   ci = _info_primary;
+   else
+   ci = *cip;
error = intr_allocate_slot_cpu(ci, pic, pin, );
if (error == 0)
goto found;
+   /* Can't alloc on the requested cpu, fail. */
+   if (*cip != NULL)
+   return EBUSY;
 
/*
 * ..now try the others.
@@ -323,10 +330,9 @@ intintr_shared_edge;
 
 void *
 intr_establish(int legacy_irq, struct pic *pic, int pin, int type, int level,
-int (*handler)(void *), void *arg, const char *what)
+struct cpu_info *ci, int (*handler)(void *), void *arg, const char *what)
 {
struct intrhand **p, *q, *ih;
-   struct cpu_info *ci;
int slot, error, idt_vec;
struct intrsource *source;
struct intrstub *stubp;
Index: arch/amd64/include/intr.h
===
RCS file: