Re: workqueue for pr_input

2017-01-30 Thread Joerg Sonnenberger
On Mon, Jan 30, 2017 at 11:38:11AM +0900, Ryota Ozaki wrote:
> That said, I'm still not inclined to introduce selective queuing. I think
> it's an optimization and we can do it later if we find we really need it.

As long as the data path for CARP is not getting an additional queue, it
shouldn't matter.

Joerg


Re: workqueue for pr_input

2017-01-29 Thread Ryota Ozaki
On Sat, Jan 28, 2017 at 9:11 PM, Darren Reed  wrote:
> On 27/01/2017 11:51 PM, Ryota Ozaki wrote:
>>
>> On Thu, Jan 26, 2017 at 10:29 AM, Darren Reed  wrote:
>>>
>>> Apologies all for the blob of text :(
>>>
>>> I resent with better formatting but that's likely in a spam queue
>>> somewhere...
>>>
>>> So I'll try and summarise in a readable fashion:
>>>
>>> (1) patch introduces new locks for per-packet processing when the
>>> architectural goal is lock-less. What would this patch look like if done
>>> to be lock-less?
>>> (2) queuing has a memory cost. There are DoS attacks introduced that
>>> need to be mitigated.
>>> (4) does workqueue and this patch suggest that  needs
>>> enhancing for optimal processing of lists?
>>
>> http://www.netbsd.org/~ozaki-r/workqueue-pr_input-percpu.diff
>>
>> As per your comment, I changed the patch: (i) introduced percpu queues
>> and removed the lock and (ii) introduced the hard limit to each queue
>> to mitigate DoS, which is the same as other input queues such as
>> pktqueue and if_input percpuq.
>
>
> Looking better.
>
> In wqinit_work(), the mutex_enter/exit is inside a loop.
> What's being protected here?

Whole processing of pr_input protocols that aren't MP-safe yet.

>
> In wqinput_input(), there's a "wwl->wwl_dropped++".
> How do I see these when I do "netstat -s"?

I added sysctl nodes, instead of netstat -s, which are commonly used
for statistics of input queue such as pktqueue and percpuq.

>
>>> (3) do all packets need this workqueue approach or only some? Can
>>> selective queuing help with (2)?
>>
>> Workqueue is used only for ICMP/ICMPv6/CARP and packets on fast paths
>> aren't affected by the change. Even for the affected packets, the
>> influence by the change is small now by the above improvement.
>>
>> We might be able to use workqueue selectively per packet types say
>> ICMP message types, but there is a concern; it will introduce packet
>> reordering between packets (e.g., ICMP messages). I think we should
>> avoid the behavior, though I'm not sure the impact of the reordering.
>> And I don't so much like to introduce such an exceptional behavior
>> in the middle of the packet handling (pr_input is modestly a feasible
>> place to do, I think).
>
>
> I'm confused as to why the ordering of ICMP messages might be a problem.
> I don't think I've ever seen anyone suggest that this is necessary but maybe
> I just haven't read the correct section in host requirements.
> Can you please explain in more detail?

Hm, I think I overestimated the impact of the reordering; it can even happen
on the network and so fearing reordering in the host doesn't make sense.
Please ignore my claim about it.

That said, I'm still not inclined to introduce selective queuing. I think
it's an optimization and we can do it later if we find we really need it.

Thanks,
  ozaki-r


Re: workqueue for pr_input

2017-01-28 Thread Darren Reed

On 27/01/2017 11:51 PM, Ryota Ozaki wrote:

On Thu, Jan 26, 2017 at 10:29 AM, Darren Reed  wrote:

Apologies all for the blob of text :(

I resent with better formatting but that's likely in a spam queue
somewhere...

So I'll try and summarise in a readable fashion:

(1) patch introduces new locks for per-packet processing when the
architectural goal is lock-less. What would this patch look like if done
to be lock-less?
(2) queuing has a memory cost. There are DoS attacks introduced that
need to be mitigated.
(4) does workqueue and this patch suggest that  needs
enhancing for optimal processing of lists?

http://www.netbsd.org/~ozaki-r/workqueue-pr_input-percpu.diff

As per your comment, I changed the patch: (i) introduced percpu queues
and removed the lock and (ii) introduced the hard limit to each queue
to mitigate DoS, which is the same as other input queues such as
pktqueue and if_input percpuq.


Looking better.

In wqinit_work(), the mutex_enter/exit is inside a loop.
What's being protected here?

In wqinput_input(), there's a "wwl->wwl_dropped++".
How do I see these when I do "netstat -s"?


(3) do all packets need this workqueue approach or only some? Can
selective queuing help with (2)?

Workqueue is used only for ICMP/ICMPv6/CARP and packets on fast paths
aren't affected by the change. Even for the affected packets, the
influence by the change is small now by the above improvement.

We might be able to use workqueue selectively per packet types say
ICMP message types, but there is a concern; it will introduce packet
reordering between packets (e.g., ICMP messages). I think we should
avoid the behavior, though I'm not sure the impact of the reordering.
And I don't so much like to introduce such an exceptional behavior
in the middle of the packet handling (pr_input is modestly a feasible
place to do, I think).


I'm confused as to why the ordering of ICMP messages might be a problem.
I don't think I've ever seen anyone suggest that this is necessary but maybe
I just haven't read the correct section in host requirements.
Can you please explain in more detail?

Cheers,
Darren



Re: workqueue for pr_input

2017-01-27 Thread Ryota Ozaki
On Thu, Jan 26, 2017 at 10:29 AM, Darren Reed  wrote:
> Apologies all for the blob of text :(
>
> I resent with better formatting but that's likely in a spam queue
> somewhere...
>
> So I'll try and summarise in a readable fashion:
>
> (1) patch introduces new locks for per-packet processing when the
> architectural goal is lock-less. What would this patch look like if done
> to be lock-less?
> (2) queuing has a memory cost. There are DoS attacks introduced that
> need to be mitigated.
> (4) does workqueue and this patch suggest that  needs
> enhancing for optimal processing of lists?

http://www.netbsd.org/~ozaki-r/workqueue-pr_input-percpu.diff

As per your comment, I changed the patch: (i) introduced percpu queues
and removed the lock and (ii) introduced the hard limit to each queue
to mitigate DoS, which is the same as other input queues such as
pktqueue and if_input percpuq.

As a side effect of the change, the overhead reduced to 10 usec
for a single packet pingpong and for flooding the overhead is now
negligible, I guess, thanks to bulk packet processing per one
workqueue activation.

> (3) do all packets need this workqueue approach or only some? Can
> selective queuing help with (2)?

Workqueue is used only for ICMP/ICMPv6/CARP and packets on fast paths
aren't affected by the change. Even for the affected packets, the
influence by the change is small now by the above improvement.

We might be able to use workqueue selectively per packet types say
ICMP message types, but there is a concern; it will introduce packet
reordering between packets (e.g., ICMP messages). I think we should
avoid the behavior, though I'm not sure the impact of the reordering.
And I don't so much like to introduce such an exceptional behavior
in the middle of the packet handling (pr_input is modestly a feasible
place to do, I think).

Thanks,
  ozaki-r


Re: workqueue for pr_input

2017-01-25 Thread Darren Reed
Apologies for the earlier email that was one big text blob...

On 23/01/2017 7:58 PM, Ryota Ozaki wrote:

> Hi,
>
> I propose a controversial change that makes some
> pr_input workqueue: icmp_input, icmp6_input and
> carp_input.
>
> ...

Since you are redirecting icmp6_input()...

In response to (say) a ping flood, how large can or will the work queue
grow?Or unreachable flood?

How does this change prevent a DoS attack against NetBSD kernel memory
through ICMP message flooding of ICMP6 and growing the workqueue?

Granted there are certain ICMP6 messages that are more complex to deal
with that will benefit from this change but there are also a number of
ICMP6 messages for which this change means "nothing" (the 30usec increase
in time to respond to ping is not what I'd call meaningful.) Do they all
need to be handled in this fashion when the above attack scenario is
considered? Is it worth being able to distinguish using this input function
for handling ICMP6 messages that are local to the directly attached subnet
only vs those that are not
local to the subnet?

And a comment for workqueue, If I declare a work queue like this:
+ error = workqueue_create(&icmp6_input_wq, "icmp6_input",
+ icmp6_input_work, NULL, PRI_SOFTNET, IPL_SOFTNET,WQ_MPSAFE);

then the work queue function like this:
+icmp6_input_work(struct work *wk, void *arg)
+{
...
+ s = splsoftnet();

should not need a second explicit reference to an already nominated SPL.
This should be something like:

s = workqueue_spl(w);

"But the IPL_SOFTNET is for something else." Sure, some workqueue instances
may use a different IPL inside the work to that used to that in the process
of doing work.

(Why does workqueue_init() not use the "ipl" arg? subr_workqueue.c v1.33)

On a similar note this:
+ s = splsoftnet();
+ mutex_enter(softnet_lock);

might be better done as this:

s = splsoftnet_lock()

but then how does this work:
+ mutex_exit(softnet_lock);
+ splx(s);
(for which I don't have a better answer than splxsoftnet_unlock() and
 I don't know that this is even a good answer. Anyway, back to the task at 
hand...)

In terms of flooding, this change introduces at least four new memory bus
synchronisation events per input packet. In a push for lock-less networking,
is this desirable (even if the packets are not processed in-line)?

And that's not including SPL changes or the locking inside workqueue.

A way to halve that would be for icmp6_input_work to simply take the entire
chain of work at the start of icmp6_input_work() and work on that privately,
rather than do each item individually. Nothing else is going to look at what's
on the work queue so there should be no problem. Look at workqueue_worker()
for an example of how to save on locking and if the abstraction violation
comment is anything to go by then maybe a new set of interfaces is required
for  to support queues, etc, being copied privately and then
reset. Or maybe there are other ways to approach this problem with changes
to workqueue but the way the patch is not the best.

How would you implement this change without adding any new locks to icmp6.c?

Cheers, Darren



Re: workqueue for pr_input

2017-01-25 Thread Darren Reed
Apologies all for the blob of text :(

I resent with better formatting but that's likely in a spam queue
somewhere...

So I'll try and summarise in a readable fashion:

(1) patch introduces new locks for per-packet processing when the
architectural goal is lock-less. What would this patch look like if done
to be lock-less?
(2) queuing has a memory cost. There are DoS attacks introduced that
need to be mitigated.
(3) do all packets need this workqueue approach or only some? Can
selective queuing help with (2)?
(4) does workqueue and this patch suggest that  needs
enhancing for optimal processing of lists?

Cheers,
Darren

On 26/01/2017 12:02 PM, Darren Reed wrote:
> On 23/01/2017 7:58 PM, Ryota Ozaki wrote:
>
>> Hi,
>>
>> I propose a controversial change that makes some
>> pr_input workqueue: icmp_input, icmp6_input and
>> carp_input.
>>
>> ...
> Since you are redirecting icmp6_input()...
>
> In response to (say) a ping flood, how large can or will the work queue
> grow?Or unreachable flood? How does this change prevent a DoS attack
> against NetBSD kernel memory through ICMP message flooding of ICMP6 and
> growing the workqueue? Granted there are certain ICMP6 messages that are
> more complex to deal with that will benefit from this change but there
> are also a number of ICMP6 messages for which this change means
> "nothing" (the 30usec increase in time to respond to ping is not what
> I'd call meaningful.) Do they all need to be handled in this fashion
> when the above attack scenario is considered? Is it worth being able to
> distinguish using this input function for handling ICMP6 messages that
> are local to the directly attached subnet only vs those that are not
> local to the subnet? And a comment for workqueue, If I declare a work
> queue like this: + error = workqueue_create(&icmp6_input_wq,
> "icmp6_input", + icmp6_input_work, NULL, PRI_SOFTNET, IPL_SOFTNET,
> WQ_MPSAFE); then the work queue function like this:
> +icmp6_input_work(struct work *wk, void *arg) +{ ... + s = splsoftnet();
> should not need a second explicit reference to an already nominated SPL.
> This should be something like: s = workqueue_spl(w); "But the
> IPL_SOFTNET is for something else." Sure, some workqueue instances may
> use a different IPL inside the work to that used to that in the process
> of doing work. (Why does workqueue_init() not use the "ipl" arg?
> subr_workqueue.c v1.33) On a similar note this: + s = splsoftnet(); +
> mutex_enter(softnet_lock); might be better done as this: s =
> splsoftnet_lock() but then how does this work: +
> mutex_exit(softnet_lock); + splx(s); (for which I don't have a better
> answer than splxsoftnet_unlock() and I don't know that this is even a
> good answer. Anyway, back to the task at hand...) In terms of flooding,
> this change introduces at least four new memory bus synchronisation
> events per input packet. In a push for lock-less networking, is this
> desirable (even if the packets are not processed in-line)? Not including
> SPL changes. A way to halve that would be for icmp6_input_work to simply
> take the entire chain of work at the start of icmp6_input_work() and
> work on that privately, rather than do each item individually. Nothing
> else is going to look at what's on the work queue so there should be no
> problem. Look at workqueue_worker() for an example of how to save on
> locking and if the abstraction violation comment is anything to go by
> then maybe a new set of interfaces is required for  to
> support queues, etc, being copied privately and then reset. Or maybe
> there are other ways to approach this problem with changes to workqueue
> but the way the patch is not the best. How would you implement this
> change without adding any new locks to icmp6.c? Cheers, Darren
>



Re: workqueue for pr_input

2017-01-25 Thread Darren Reed
On 23/01/2017 7:58 PM, Ryota Ozaki wrote:

> Hi,
>
> I propose a controversial change that makes some
> pr_input workqueue: icmp_input, icmp6_input and
> carp_input.
>
> ...

Since you are redirecting icmp6_input()...

In response to (say) a ping flood, how large can or will the work queue
grow?Or unreachable flood? How does this change prevent a DoS attack
against NetBSD kernel memory through ICMP message flooding of ICMP6 and
growing the workqueue? Granted there are certain ICMP6 messages that are
more complex to deal with that will benefit from this change but there
are also a number of ICMP6 messages for which this change means
"nothing" (the 30usec increase in time to respond to ping is not what
I'd call meaningful.) Do they all need to be handled in this fashion
when the above attack scenario is considered? Is it worth being able to
distinguish using this input function for handling ICMP6 messages that
are local to the directly attached subnet only vs those that are not
local to the subnet? And a comment for workqueue, If I declare a work
queue like this: + error = workqueue_create(&icmp6_input_wq,
"icmp6_input", + icmp6_input_work, NULL, PRI_SOFTNET, IPL_SOFTNET,
WQ_MPSAFE); then the work queue function like this:
+icmp6_input_work(struct work *wk, void *arg) +{ ... + s = splsoftnet();
should not need a second explicit reference to an already nominated SPL.
This should be something like: s = workqueue_spl(w); "But the
IPL_SOFTNET is for something else." Sure, some workqueue instances may
use a different IPL inside the work to that used to that in the process
of doing work. (Why does workqueue_init() not use the "ipl" arg?
subr_workqueue.c v1.33) On a similar note this: + s = splsoftnet(); +
mutex_enter(softnet_lock); might be better done as this: s =
splsoftnet_lock() but then how does this work: +
mutex_exit(softnet_lock); + splx(s); (for which I don't have a better
answer than splxsoftnet_unlock() and I don't know that this is even a
good answer. Anyway, back to the task at hand...) In terms of flooding,
this change introduces at least four new memory bus synchronisation
events per input packet. In a push for lock-less networking, is this
desirable (even if the packets are not processed in-line)? Not including
SPL changes. A way to halve that would be for icmp6_input_work to simply
take the entire chain of work at the start of icmp6_input_work() and
work on that privately, rather than do each item individually. Nothing
else is going to look at what's on the work queue so there should be no
problem. Look at workqueue_worker() for an example of how to save on
locking and if the abstraction violation comment is anything to go by
then maybe a new set of interfaces is required for  to
support queues, etc, being copied privately and then reset. Or maybe
there are other ways to approach this problem with changes to workqueue
but the way the patch is not the best. How would you implement this
change without adding any new locks to icmp6.c? Cheers, Darren



Re: workqueue for pr_input

2017-01-25 Thread Ryota Ozaki
On Tue, Jan 24, 2017 at 3:47 PM, Ryota Ozaki  wrote:
> On Mon, Jan 23, 2017 at 10:01 PM, Thor Lancelot Simon  wrote:
>> On Mon, Jan 23, 2017 at 05:58:20PM +0900, Ryota Ozaki wrote:
>>>
>>> The demerit is that that mechanism adds non-trivial
>>> overhead; RTT of ping increases by 30 usec.
>>
>> I don't see why overhead for control protocols like ICMP matters.  I
>> think if you look at longstanding commercial designs for networking
>> equipment -- software and hardware alike -- for literally decades they
>> have taken the approach of optimizing the fast path (forwarding, data
>> protocol performance) while allowing the exception/error paths (ICMP etc.)
>> to be very slow.
>
> Yes, so I proposed this approach regardless of the overhead.
>
> If none objects, I'm going to implement a complete version of my proposal
> in protosw.

Here is a patch:
  http://www.netbsd.org/~ozaki-r/workqueue-pr_input.diff

I eventually gave up implementing the mechanism in protosw; it just made
protosw complex. Instead I implemented it as helper functions (see wqinput.c).

Any comments?

  ozaki-r


Re: workqueue for pr_input

2017-01-23 Thread Ryota Ozaki
On Mon, Jan 23, 2017 at 10:01 PM, Thor Lancelot Simon  wrote:
> On Mon, Jan 23, 2017 at 05:58:20PM +0900, Ryota Ozaki wrote:
>>
>> The demerit is that that mechanism adds non-trivial
>> overhead; RTT of ping increases by 30 usec.
>
> I don't see why overhead for control protocols like ICMP matters.  I
> think if you look at longstanding commercial designs for networking
> equipment -- software and hardware alike -- for literally decades they
> have taken the approach of optimizing the fast path (forwarding, data
> protocol performance) while allowing the exception/error paths (ICMP etc.)
> to be very slow.

Yes, so I proposed this approach regardless of the overhead.

If none objects, I'm going to implement a complete version of my proposal
in protosw.

  ozaki-r


Re: workqueue for pr_input

2017-01-23 Thread Thor Lancelot Simon
On Mon, Jan 23, 2017 at 05:58:20PM +0900, Ryota Ozaki wrote:
> 
> The demerit is that that mechanism adds non-trivial
> overhead; RTT of ping increases by 30 usec.

I don't see why overhead for control protocols like ICMP matters.  I
think if you look at longstanding commercial designs for networking
equipment -- software and hardware alike -- for literally decades they
have taken the approach of optimizing the fast path (forwarding, data
protocol performance) while allowing the exception/error paths (ICMP etc.)
to be very slow.

Thor


workqueue for pr_input

2017-01-23 Thread Ryota Ozaki
Hi,

I propose a controversial change that makes some
pr_input workqueue: icmp_input, icmp6_input and
carp_input.

The reason to do so is that they may add, delete
and update IP addresses and rtentries. If NET_MPSAFE
is enabled, deleting and updating such data require
psz/psref waits (pserialize_perform and
psref_target_destroy) that sleep and cannot be used
in softint.

Dealing with deleting data may have another option
that defers only a destruction into workqueue, not
making whole pr_input workqueue, as rt_free does
for now. However, for updating data, we cannot use
the same approach.

Merits of this approach are:
- Every additions/deletions/updates on IP addresses
  and rtentries will be done in normal LWP contexts,
  neither in softint nor in hardware interrupt.
  - It also pulls all malloc/free out of softint and
we can make them M_WAITOK or KM_SLEEP.
  - In reality, tcp_timer_rexmt is a remaining callout
that can delete rtentries, but we can fix in the
future.
- The logic of destructions/updates of addresses/rtentries
  don't need to take care of execution contexts.
  - We can get rid of complex deferred rt_free.
- We would be able to replace rwlocks for L2 caches
  (lltable/llentry) with psz/psref.
  - Though I'm not sure psz/psref suit for them.

The demerit is that that mechanism adds non-trivial
overhead; RTT of ping increases by 30 usec.

Here is a patch of the approach applied to icmp6_input:
  http://www.netbsd.org/~ozaki-r/workqueue-icmp6_input.diff

This is icmp6_input specific, but if we apply the approach
to icmp_input and carp_input as well, we would be able to
implement a common framework on struct protosw.

Note that with the patch all ATF tests for networking pass
even if NET_MPSAFE is enabled in rump kernels.

Any comments, suggestions, or objections?

Thanks,
  ozaki-r