Re: workqueue for pr_input
On Mon, Jan 30, 2017 at 11:38:11AM +0900, Ryota Ozaki wrote: > That said, I'm still not inclined to introduce selective queuing. I think > it's an optimization and we can do it later if we find we really need it. As long as the data path for CARP is not getting an additional queue, it shouldn't matter. Joerg
Re: workqueue for pr_input
On Sat, Jan 28, 2017 at 9:11 PM, Darren Reed wrote: > On 27/01/2017 11:51 PM, Ryota Ozaki wrote: >> >> On Thu, Jan 26, 2017 at 10:29 AM, Darren Reed wrote: >>> >>> Apologies all for the blob of text :( >>> >>> I resent with better formatting but that's likely in a spam queue >>> somewhere... >>> >>> So I'll try and summarise in a readable fashion: >>> >>> (1) patch introduces new locks for per-packet processing when the >>> architectural goal is lock-less. What would this patch look like if done >>> to be lock-less? >>> (2) queuing has a memory cost. There are DoS attacks introduced that >>> need to be mitigated. >>> (4) does workqueue and this patch suggest that needs >>> enhancing for optimal processing of lists? >> >> http://www.netbsd.org/~ozaki-r/workqueue-pr_input-percpu.diff >> >> As per your comment, I changed the patch: (i) introduced percpu queues >> and removed the lock and (ii) introduced the hard limit to each queue >> to mitigate DoS, which is the same as other input queues such as >> pktqueue and if_input percpuq. > > > Looking better. > > In wqinit_work(), the mutex_enter/exit is inside a loop. > What's being protected here? Whole processing of pr_input protocols that aren't MP-safe yet. > > In wqinput_input(), there's a "wwl->wwl_dropped++". > How do I see these when I do "netstat -s"? I added sysctl nodes, instead of netstat -s, which are commonly used for statistics of input queue such as pktqueue and percpuq. > >>> (3) do all packets need this workqueue approach or only some? Can >>> selective queuing help with (2)? >> >> Workqueue is used only for ICMP/ICMPv6/CARP and packets on fast paths >> aren't affected by the change. Even for the affected packets, the >> influence by the change is small now by the above improvement. >> >> We might be able to use workqueue selectively per packet types say >> ICMP message types, but there is a concern; it will introduce packet >> reordering between packets (e.g., ICMP messages). I think we should >> avoid the behavior, though I'm not sure the impact of the reordering. >> And I don't so much like to introduce such an exceptional behavior >> in the middle of the packet handling (pr_input is modestly a feasible >> place to do, I think). > > > I'm confused as to why the ordering of ICMP messages might be a problem. > I don't think I've ever seen anyone suggest that this is necessary but maybe > I just haven't read the correct section in host requirements. > Can you please explain in more detail? Hm, I think I overestimated the impact of the reordering; it can even happen on the network and so fearing reordering in the host doesn't make sense. Please ignore my claim about it. That said, I'm still not inclined to introduce selective queuing. I think it's an optimization and we can do it later if we find we really need it. Thanks, ozaki-r
Re: workqueue for pr_input
On 27/01/2017 11:51 PM, Ryota Ozaki wrote: On Thu, Jan 26, 2017 at 10:29 AM, Darren Reed wrote: Apologies all for the blob of text :( I resent with better formatting but that's likely in a spam queue somewhere... So I'll try and summarise in a readable fashion: (1) patch introduces new locks for per-packet processing when the architectural goal is lock-less. What would this patch look like if done to be lock-less? (2) queuing has a memory cost. There are DoS attacks introduced that need to be mitigated. (4) does workqueue and this patch suggest that needs enhancing for optimal processing of lists? http://www.netbsd.org/~ozaki-r/workqueue-pr_input-percpu.diff As per your comment, I changed the patch: (i) introduced percpu queues and removed the lock and (ii) introduced the hard limit to each queue to mitigate DoS, which is the same as other input queues such as pktqueue and if_input percpuq. Looking better. In wqinit_work(), the mutex_enter/exit is inside a loop. What's being protected here? In wqinput_input(), there's a "wwl->wwl_dropped++". How do I see these when I do "netstat -s"? (3) do all packets need this workqueue approach or only some? Can selective queuing help with (2)? Workqueue is used only for ICMP/ICMPv6/CARP and packets on fast paths aren't affected by the change. Even for the affected packets, the influence by the change is small now by the above improvement. We might be able to use workqueue selectively per packet types say ICMP message types, but there is a concern; it will introduce packet reordering between packets (e.g., ICMP messages). I think we should avoid the behavior, though I'm not sure the impact of the reordering. And I don't so much like to introduce such an exceptional behavior in the middle of the packet handling (pr_input is modestly a feasible place to do, I think). I'm confused as to why the ordering of ICMP messages might be a problem. I don't think I've ever seen anyone suggest that this is necessary but maybe I just haven't read the correct section in host requirements. Can you please explain in more detail? Cheers, Darren
Re: workqueue for pr_input
On Thu, Jan 26, 2017 at 10:29 AM, Darren Reed wrote: > Apologies all for the blob of text :( > > I resent with better formatting but that's likely in a spam queue > somewhere... > > So I'll try and summarise in a readable fashion: > > (1) patch introduces new locks for per-packet processing when the > architectural goal is lock-less. What would this patch look like if done > to be lock-less? > (2) queuing has a memory cost. There are DoS attacks introduced that > need to be mitigated. > (4) does workqueue and this patch suggest that needs > enhancing for optimal processing of lists? http://www.netbsd.org/~ozaki-r/workqueue-pr_input-percpu.diff As per your comment, I changed the patch: (i) introduced percpu queues and removed the lock and (ii) introduced the hard limit to each queue to mitigate DoS, which is the same as other input queues such as pktqueue and if_input percpuq. As a side effect of the change, the overhead reduced to 10 usec for a single packet pingpong and for flooding the overhead is now negligible, I guess, thanks to bulk packet processing per one workqueue activation. > (3) do all packets need this workqueue approach or only some? Can > selective queuing help with (2)? Workqueue is used only for ICMP/ICMPv6/CARP and packets on fast paths aren't affected by the change. Even for the affected packets, the influence by the change is small now by the above improvement. We might be able to use workqueue selectively per packet types say ICMP message types, but there is a concern; it will introduce packet reordering between packets (e.g., ICMP messages). I think we should avoid the behavior, though I'm not sure the impact of the reordering. And I don't so much like to introduce such an exceptional behavior in the middle of the packet handling (pr_input is modestly a feasible place to do, I think). Thanks, ozaki-r
Re: workqueue for pr_input
Apologies for the earlier email that was one big text blob... On 23/01/2017 7:58 PM, Ryota Ozaki wrote: > Hi, > > I propose a controversial change that makes some > pr_input workqueue: icmp_input, icmp6_input and > carp_input. > > ... Since you are redirecting icmp6_input()... In response to (say) a ping flood, how large can or will the work queue grow?Or unreachable flood? How does this change prevent a DoS attack against NetBSD kernel memory through ICMP message flooding of ICMP6 and growing the workqueue? Granted there are certain ICMP6 messages that are more complex to deal with that will benefit from this change but there are also a number of ICMP6 messages for which this change means "nothing" (the 30usec increase in time to respond to ping is not what I'd call meaningful.) Do they all need to be handled in this fashion when the above attack scenario is considered? Is it worth being able to distinguish using this input function for handling ICMP6 messages that are local to the directly attached subnet only vs those that are not local to the subnet? And a comment for workqueue, If I declare a work queue like this: + error = workqueue_create(&icmp6_input_wq, "icmp6_input", + icmp6_input_work, NULL, PRI_SOFTNET, IPL_SOFTNET,WQ_MPSAFE); then the work queue function like this: +icmp6_input_work(struct work *wk, void *arg) +{ ... + s = splsoftnet(); should not need a second explicit reference to an already nominated SPL. This should be something like: s = workqueue_spl(w); "But the IPL_SOFTNET is for something else." Sure, some workqueue instances may use a different IPL inside the work to that used to that in the process of doing work. (Why does workqueue_init() not use the "ipl" arg? subr_workqueue.c v1.33) On a similar note this: + s = splsoftnet(); + mutex_enter(softnet_lock); might be better done as this: s = splsoftnet_lock() but then how does this work: + mutex_exit(softnet_lock); + splx(s); (for which I don't have a better answer than splxsoftnet_unlock() and I don't know that this is even a good answer. Anyway, back to the task at hand...) In terms of flooding, this change introduces at least four new memory bus synchronisation events per input packet. In a push for lock-less networking, is this desirable (even if the packets are not processed in-line)? And that's not including SPL changes or the locking inside workqueue. A way to halve that would be for icmp6_input_work to simply take the entire chain of work at the start of icmp6_input_work() and work on that privately, rather than do each item individually. Nothing else is going to look at what's on the work queue so there should be no problem. Look at workqueue_worker() for an example of how to save on locking and if the abstraction violation comment is anything to go by then maybe a new set of interfaces is required for to support queues, etc, being copied privately and then reset. Or maybe there are other ways to approach this problem with changes to workqueue but the way the patch is not the best. How would you implement this change without adding any new locks to icmp6.c? Cheers, Darren
Re: workqueue for pr_input
Apologies all for the blob of text :( I resent with better formatting but that's likely in a spam queue somewhere... So I'll try and summarise in a readable fashion: (1) patch introduces new locks for per-packet processing when the architectural goal is lock-less. What would this patch look like if done to be lock-less? (2) queuing has a memory cost. There are DoS attacks introduced that need to be mitigated. (3) do all packets need this workqueue approach or only some? Can selective queuing help with (2)? (4) does workqueue and this patch suggest that needs enhancing for optimal processing of lists? Cheers, Darren On 26/01/2017 12:02 PM, Darren Reed wrote: > On 23/01/2017 7:58 PM, Ryota Ozaki wrote: > >> Hi, >> >> I propose a controversial change that makes some >> pr_input workqueue: icmp_input, icmp6_input and >> carp_input. >> >> ... > Since you are redirecting icmp6_input()... > > In response to (say) a ping flood, how large can or will the work queue > grow?Or unreachable flood? How does this change prevent a DoS attack > against NetBSD kernel memory through ICMP message flooding of ICMP6 and > growing the workqueue? Granted there are certain ICMP6 messages that are > more complex to deal with that will benefit from this change but there > are also a number of ICMP6 messages for which this change means > "nothing" (the 30usec increase in time to respond to ping is not what > I'd call meaningful.) Do they all need to be handled in this fashion > when the above attack scenario is considered? Is it worth being able to > distinguish using this input function for handling ICMP6 messages that > are local to the directly attached subnet only vs those that are not > local to the subnet? And a comment for workqueue, If I declare a work > queue like this: + error = workqueue_create(&icmp6_input_wq, > "icmp6_input", + icmp6_input_work, NULL, PRI_SOFTNET, IPL_SOFTNET, > WQ_MPSAFE); then the work queue function like this: > +icmp6_input_work(struct work *wk, void *arg) +{ ... + s = splsoftnet(); > should not need a second explicit reference to an already nominated SPL. > This should be something like: s = workqueue_spl(w); "But the > IPL_SOFTNET is for something else." Sure, some workqueue instances may > use a different IPL inside the work to that used to that in the process > of doing work. (Why does workqueue_init() not use the "ipl" arg? > subr_workqueue.c v1.33) On a similar note this: + s = splsoftnet(); + > mutex_enter(softnet_lock); might be better done as this: s = > splsoftnet_lock() but then how does this work: + > mutex_exit(softnet_lock); + splx(s); (for which I don't have a better > answer than splxsoftnet_unlock() and I don't know that this is even a > good answer. Anyway, back to the task at hand...) In terms of flooding, > this change introduces at least four new memory bus synchronisation > events per input packet. In a push for lock-less networking, is this > desirable (even if the packets are not processed in-line)? Not including > SPL changes. A way to halve that would be for icmp6_input_work to simply > take the entire chain of work at the start of icmp6_input_work() and > work on that privately, rather than do each item individually. Nothing > else is going to look at what's on the work queue so there should be no > problem. Look at workqueue_worker() for an example of how to save on > locking and if the abstraction violation comment is anything to go by > then maybe a new set of interfaces is required for to > support queues, etc, being copied privately and then reset. Or maybe > there are other ways to approach this problem with changes to workqueue > but the way the patch is not the best. How would you implement this > change without adding any new locks to icmp6.c? Cheers, Darren >
Re: workqueue for pr_input
On 23/01/2017 7:58 PM, Ryota Ozaki wrote: > Hi, > > I propose a controversial change that makes some > pr_input workqueue: icmp_input, icmp6_input and > carp_input. > > ... Since you are redirecting icmp6_input()... In response to (say) a ping flood, how large can or will the work queue grow?Or unreachable flood? How does this change prevent a DoS attack against NetBSD kernel memory through ICMP message flooding of ICMP6 and growing the workqueue? Granted there are certain ICMP6 messages that are more complex to deal with that will benefit from this change but there are also a number of ICMP6 messages for which this change means "nothing" (the 30usec increase in time to respond to ping is not what I'd call meaningful.) Do they all need to be handled in this fashion when the above attack scenario is considered? Is it worth being able to distinguish using this input function for handling ICMP6 messages that are local to the directly attached subnet only vs those that are not local to the subnet? And a comment for workqueue, If I declare a work queue like this: + error = workqueue_create(&icmp6_input_wq, "icmp6_input", + icmp6_input_work, NULL, PRI_SOFTNET, IPL_SOFTNET, WQ_MPSAFE); then the work queue function like this: +icmp6_input_work(struct work *wk, void *arg) +{ ... + s = splsoftnet(); should not need a second explicit reference to an already nominated SPL. This should be something like: s = workqueue_spl(w); "But the IPL_SOFTNET is for something else." Sure, some workqueue instances may use a different IPL inside the work to that used to that in the process of doing work. (Why does workqueue_init() not use the "ipl" arg? subr_workqueue.c v1.33) On a similar note this: + s = splsoftnet(); + mutex_enter(softnet_lock); might be better done as this: s = splsoftnet_lock() but then how does this work: + mutex_exit(softnet_lock); + splx(s); (for which I don't have a better answer than splxsoftnet_unlock() and I don't know that this is even a good answer. Anyway, back to the task at hand...) In terms of flooding, this change introduces at least four new memory bus synchronisation events per input packet. In a push for lock-less networking, is this desirable (even if the packets are not processed in-line)? Not including SPL changes. A way to halve that would be for icmp6_input_work to simply take the entire chain of work at the start of icmp6_input_work() and work on that privately, rather than do each item individually. Nothing else is going to look at what's on the work queue so there should be no problem. Look at workqueue_worker() for an example of how to save on locking and if the abstraction violation comment is anything to go by then maybe a new set of interfaces is required for to support queues, etc, being copied privately and then reset. Or maybe there are other ways to approach this problem with changes to workqueue but the way the patch is not the best. How would you implement this change without adding any new locks to icmp6.c? Cheers, Darren
Re: workqueue for pr_input
On Tue, Jan 24, 2017 at 3:47 PM, Ryota Ozaki wrote: > On Mon, Jan 23, 2017 at 10:01 PM, Thor Lancelot Simon wrote: >> On Mon, Jan 23, 2017 at 05:58:20PM +0900, Ryota Ozaki wrote: >>> >>> The demerit is that that mechanism adds non-trivial >>> overhead; RTT of ping increases by 30 usec. >> >> I don't see why overhead for control protocols like ICMP matters. I >> think if you look at longstanding commercial designs for networking >> equipment -- software and hardware alike -- for literally decades they >> have taken the approach of optimizing the fast path (forwarding, data >> protocol performance) while allowing the exception/error paths (ICMP etc.) >> to be very slow. > > Yes, so I proposed this approach regardless of the overhead. > > If none objects, I'm going to implement a complete version of my proposal > in protosw. Here is a patch: http://www.netbsd.org/~ozaki-r/workqueue-pr_input.diff I eventually gave up implementing the mechanism in protosw; it just made protosw complex. Instead I implemented it as helper functions (see wqinput.c). Any comments? ozaki-r
Re: workqueue for pr_input
On Mon, Jan 23, 2017 at 10:01 PM, Thor Lancelot Simon wrote: > On Mon, Jan 23, 2017 at 05:58:20PM +0900, Ryota Ozaki wrote: >> >> The demerit is that that mechanism adds non-trivial >> overhead; RTT of ping increases by 30 usec. > > I don't see why overhead for control protocols like ICMP matters. I > think if you look at longstanding commercial designs for networking > equipment -- software and hardware alike -- for literally decades they > have taken the approach of optimizing the fast path (forwarding, data > protocol performance) while allowing the exception/error paths (ICMP etc.) > to be very slow. Yes, so I proposed this approach regardless of the overhead. If none objects, I'm going to implement a complete version of my proposal in protosw. ozaki-r
Re: workqueue for pr_input
On Mon, Jan 23, 2017 at 05:58:20PM +0900, Ryota Ozaki wrote: > > The demerit is that that mechanism adds non-trivial > overhead; RTT of ping increases by 30 usec. I don't see why overhead for control protocols like ICMP matters. I think if you look at longstanding commercial designs for networking equipment -- software and hardware alike -- for literally decades they have taken the approach of optimizing the fast path (forwarding, data protocol performance) while allowing the exception/error paths (ICMP etc.) to be very slow. Thor
workqueue for pr_input
Hi, I propose a controversial change that makes some pr_input workqueue: icmp_input, icmp6_input and carp_input. The reason to do so is that they may add, delete and update IP addresses and rtentries. If NET_MPSAFE is enabled, deleting and updating such data require psz/psref waits (pserialize_perform and psref_target_destroy) that sleep and cannot be used in softint. Dealing with deleting data may have another option that defers only a destruction into workqueue, not making whole pr_input workqueue, as rt_free does for now. However, for updating data, we cannot use the same approach. Merits of this approach are: - Every additions/deletions/updates on IP addresses and rtentries will be done in normal LWP contexts, neither in softint nor in hardware interrupt. - It also pulls all malloc/free out of softint and we can make them M_WAITOK or KM_SLEEP. - In reality, tcp_timer_rexmt is a remaining callout that can delete rtentries, but we can fix in the future. - The logic of destructions/updates of addresses/rtentries don't need to take care of execution contexts. - We can get rid of complex deferred rt_free. - We would be able to replace rwlocks for L2 caches (lltable/llentry) with psz/psref. - Though I'm not sure psz/psref suit for them. The demerit is that that mechanism adds non-trivial overhead; RTT of ping increases by 30 usec. Here is a patch of the approach applied to icmp6_input: http://www.netbsd.org/~ozaki-r/workqueue-icmp6_input.diff This is icmp6_input specific, but if we apply the approach to icmp_input and carp_input as well, we would be able to implement a common framework on struct protosw. Note that with the patch all ATF tests for networking pass even if NET_MPSAFE is enabled in rump kernels. Any comments, suggestions, or objections? Thanks, ozaki-r