Re: [RFC 2/7] ath10k: Add support to process rx packet in thread

2021-03-25 Thread Felix Fietkau


On 2021-03-25 10:45, Rakesh Pillai wrote:
> Hi Felix / Ben,
> 
> In case of ath10k (snoc based targets), we have a lot of processing in the 
> NAPI context.
> Even moving this to threaded NAPI is not helping much due to the load.
> 
> Breaking the tasks into multiple context (with the patch series I posted) is 
> helping in improving the throughput.
> With the current rx_thread based approach, the rx processing is broken into 
> two parallel contexts
> 1) reaping the packets from the HW
> 2) processing these packets list and handing it over to mac80211 (and later 
> to the network stack)
> 
> This is the primary reason for choosing the rx thread approach.
Have you considered the possibility that maybe the problem is that the
driver doing too much work?
One example is that you could take advantage of the new 802.3 decap
offload to simplify rx processing. Worked for me on mt76 where a
dual-core 1.3 GHz A64 can easily handle >1.8 Gbps local TCP rx on a
single card, without the rx NAPI thread being the biggest consumer of
CPU cycles.

And if you can't do that and still consider all of the metric tons of
processing work necessary, you could still do this:
On interrupts, spawn a processing thread that traverses the ring and
does the preparation work (instead of NAPI).
>From that thread you schedule the threaded NAPI handler that processes
these packets further and hands them to mac80211.
To keep the load somewhat balanced, you can limit the number of
pre-processed packets in the ring.

- Felix


Re: [RFC 2/7] ath10k: Add support to process rx packet in thread

2021-03-23 Thread Felix Fietkau


On 2021-03-23 04:01, Ben Greear wrote:
> On 3/22/21 6:20 PM, Brian Norris wrote:
>> On Mon, Mar 22, 2021 at 4:58 PM Ben Greear  wrote:
>>> On 7/22/20 6:00 AM, Felix Fietkau wrote:
>>>> On 2020-07-22 14:55, Johannes Berg wrote:
>>>>> On Wed, 2020-07-22 at 14:27 +0200, Felix Fietkau wrote:
>>>>>
>>>>>> I'm considering testing a different approach (with mt76 initially):
>>>>>> - Add a mac80211 rx function that puts processed skbs into a list
>>>>>> instead of handing them to the network stack directly.
>>>>>
>>>>> Would this be *after* all the mac80211 processing, i.e. in place of the
>>>>> rx-up-to-stack?
>>>> Yes, it would run all the rx handlers normally and then put the
>>>> resulting skbs into a list instead of calling netif_receive_skb or
>>>> napi_gro_frags.
>>>
>>> Whatever came of this?  I realized I'm running Felix's patch since his mt76
>>> driver needs it.  Any chance it will go upstream?
>> 
>> If you're asking about $subject (moving NAPI/RX to a thread), this
>> landed upstream recently:
>> http://git.kernel.org/linus/adbb4fb028452b1b0488a1a7b66ab856cdf20715
>> 
>> It needs a bit of coaxing to work on a WiFi driver (including: WiFi
>> drivers tend to have a different netdev for NAPI than they expose to
>> /sys/class/net/), but it's there.
>> 
>> I'm not sure if people had something else in mind in the stuff you're
>> quoting though.
> 
> No, I got it confused with something Felix did:
> 
> https://github.com/greearb/mt76/blob/master/patches/0001-net-add-support-for-threaded-NAPI-polling.patch
> 
> Maybe the NAPI/RX to a thread thing superceded Felix's patch?
Yes, it did and it's in linux-next already.
I sent the following change to make mt76 use it:
https://github.com/nbd168/wireless/commit/1d4ff31437e5aaa999bd7a

- Felix


Re: [PATCH net-next 23/23] net: ethernet: mtk_eth_soc: fix parsing packets in GDM

2021-03-11 Thread Felix Fietkau


> On 11. Mar 2021, at 01:36, Pablo Neira Ayuso  wrote:
> 
> From: Felix Fietkau 
> 
> When using DSA, set the special tag in GDM ingress control to allow the MAC
> to parse packets properly earlier. This affects rx DMA source port reporting.
> 
> Signed-off-by: Felix Fietkau 
> Signed-off-by: Pablo Neira Ayuso 
> ---
In order to avoid regressions within the series, this patch needs to come 
before patch 21

- Felix


Re: [PATCH 2/2] ath9k: fix ath_tx_process_buffer() potential null ptr dereference

2021-02-17 Thread Felix Fietkau


> On 17. Feb 2021, at 21:28, Shuah Khan  wrote:
> 
> On 2/17/21 7:56 AM, Shuah Khan wrote:
>>> On 2/17/21 12:30 AM, Kalle Valo wrote:
>>> Shuah Khan  writes:
>>> 
>>>> On 2/16/21 12:53 AM, Felix Fietkau wrote:
>>>>> 
>>>>> On 2021-02-16 08:03, Kalle Valo wrote:
>>>>>> Shuah Khan  wrote:
>>>>>> 
>>>>>>> ath_tx_process_buffer() references ieee80211_find_sta_by_ifaddr()
>>>>>>> return pointer (sta) outside null check. Fix it by moving the code
>>>>>>> block under the null check.
>>>>>>> 
>>>>>>> This problem was found while reviewing code to debug RCU warn from
>>>>>>> ath10k_wmi_tlv_parse_peer_stats_info() and a subsequent manual audit
>>>>>>> of other callers of ieee80211_find_sta_by_ifaddr() that don't hold
>>>>>>> RCU read lock.
>>>>>>> 
>>>>>>> Signed-off-by: Shuah Khan 
>>>>>>> Signed-off-by: Kalle Valo 
>>>>>> 
>>>>>> Patch applied to ath-next branch of ath.git, thanks.
>>>>>> 
>>>>>> a56c14bb21b2 ath9k: fix ath_tx_process_buffer() potential null ptr 
>>>>>> dereference
>>>>> I just took another look at this patch, and it is completely bogus.
>>>>> Not only does the stated reason not make any sense (sta is simply passed
>>>>> to other functions, not dereferenced without checks), but this also
>>>>> introduces a horrible memory leak by skipping buffer completion if sta
>>>>> is NULL.
>>>>> Please drop it, the code is fine as-is.
>>>> 
> 
> Felix,
> 
> I looked at the code path again and found the following path that
> can become a potential dereference downstream. My concern is
> about potential dereference downstream.
> 
> First path: ath_tx_complete_buf()
> 
> 1. ath_tx_process_buffer() passes sta to ath_tx_complete_buf()
> 2. ath_tx_complete_buf() doesn't check or dereference sta
>   Passes it on to ath_tx_complete()
> 3. ath_tx_complete() doesn't check or dereference sta, but assigns
>   it to tx_info->status.status_driver_data[0]
>   tx_info->status.status_driver_data[0] = sta;
> 
> ath_tx_complete_buf() should be fixed to check sta perhaps?
> 
> This assignment without checking could lead to dereference at some
> point in the future.
The assignment is fine, no check needed here. If there was any invalid 
dereference here, we would see reports of crashes with NULL pointer 
dereference. Sending packets with sta==NULL is quite common, especially in AP 
mode.

> Second path: ath_tx_complete_aggr()
> 
> 1. ath_tx_process_buffer() passes sta to ath_tx_complete_aggr()
> 2. No problems in this path as ath_tx_complete_aggr() checks
>   sta before use.
> 
> I can send the revert as it moves more code than necessary under
> the null check. As you pointed out, it could lead to memory leak.
> Not knowing this code well, I can't really tell where. However,
> my original concern is valid for ath_tx_complete_buf
I still don’t see anything to be concerned about. I don’t even think passing a 
pointer that could be NULL to another function even deserves a comment in the 
code. Stuff like that is used in many other places as well.

- Felix


Re: [PATCH 2/2] ath9k: fix ath_tx_process_buffer() potential null ptr dereference

2021-02-15 Thread Felix Fietkau


On 2021-02-16 08:03, Kalle Valo wrote:
> Shuah Khan  wrote:
> 
>> ath_tx_process_buffer() references ieee80211_find_sta_by_ifaddr()
>> return pointer (sta) outside null check. Fix it by moving the code
>> block under the null check.
>> 
>> This problem was found while reviewing code to debug RCU warn from
>> ath10k_wmi_tlv_parse_peer_stats_info() and a subsequent manual audit
>> of other callers of ieee80211_find_sta_by_ifaddr() that don't hold
>> RCU read lock.
>> 
>> Signed-off-by: Shuah Khan 
>> Signed-off-by: Kalle Valo 
> 
> Patch applied to ath-next branch of ath.git, thanks.
> 
> a56c14bb21b2 ath9k: fix ath_tx_process_buffer() potential null ptr dereference
I just took another look at this patch, and it is completely bogus.
Not only does the stated reason not make any sense (sta is simply passed
to other functions, not dereferenced without checks), but this also
introduces a horrible memory leak by skipping buffer completion if sta
is NULL.
Please drop it, the code is fine as-is.

- Felix


Re: [PATCH] rtw88: hold RCU lock when calling ieee80211_find_sta_by_ifaddr()

2021-02-11 Thread Felix Fietkau
On 2021-02-12 03:13, Shuah Khan wrote:
> ieee80211_find_sta_by_ifaddr() must be called under the RCU lock and
> the resulting pointer is only valid under RCU lock as well.
> 
> Fix rtw_rx_addr_match_iter() to hold RCU read lock before it calls
> ieee80211_find_sta_by_ifaddr() and release it when the resulting
> pointer is no longer needed.
> 
> This problem was found while reviewing code to debug RCU warn from
> ath10k_wmi_tlv_parse_peer_stats_info() and a subsequent manual audit
> of other callers of ieee80211_find_sta_by_ifaddr() that don't hold
> RCU read lock.
> 
> Signed-off-by: Shuah Khan 
This one also seems unnecessary. rtw_rx_addr_match_iter is called by
ieee80211_iterate_active_interfaces_atomic, which acquires the RCU read
lock before calling it.

- Felix


Re: [PATCH 1/2] ath9k: hold RCU lock when calling ieee80211_find_sta_by_ifaddr()

2021-02-11 Thread Felix Fietkau
On 2021-02-12 03:13, Shuah Khan wrote:
> ieee80211_find_sta_by_ifaddr() must be called under the RCU lock and
> the resulting pointer is only valid under RCU lock as well.
> 
> Fix ath_tx_process_buffer() to hold RCU read lock before it calls
> ieee80211_find_sta_by_ifaddr() and release it when the resulting
> pointer is no longer needed.
> 
> This problem was found while reviewing code to debug RCU warn from
> ath10k_wmi_tlv_parse_peer_stats_info() and a subsequent manual audit
> of other callers of ieee80211_find_sta_by_ifaddr() that don't hold
> RCU read lock.
> 
> Signed-off-by: Shuah Khan 
This patch seems unnecessary as well. All callers of
ath_tx_process_buffer seem to hold the RCU read lock already.

- Felix


Re: [PATCH] mt76: hold RCU lock when calling ieee80211_find_sta_by_ifaddr()

2021-02-11 Thread Felix Fietkau


On 2021-02-12 03:13, Shuah Khan wrote:
> ieee80211_find_sta_by_ifaddr() must be called under the RCU lock and
> the resulting pointer is only valid under RCU lock as well.
> 
> Fix mt76_check_sta() to hold RCU read lock before it calls
> ieee80211_find_sta_by_ifaddr() and release it when the resulting
> pointer is no longer needed.
> 
> This problem was found while reviewing code to debug RCU warn from
> ath10k_wmi_tlv_parse_peer_stats_info() and a subsequent manual audit
> of other callers of ieee80211_find_sta_by_ifaddr() that don't hold
> RCU read lock.
> 
> Signed-off-by: Shuah Khan 
If I'm not mistaken, this patch is unnecessary. mt76_check_sta is only
called from mt76_rx_poll_complete, which itself is only called under RCU
lock.

- Felix


Re: [PATCH] mt76: Fix queue ID variable types after mcu queue split

2021-01-14 Thread Felix Fietkau
On 2021-01-11 09:06, Kalle Valo wrote:
> Lorenzo Bianconi  writes:
> 
>>> Clang warns in both mt7615 and mt7915:
>>> 
>>> drivers/net/wireless/mediatek/mt76/mt7915/mcu.c:271:9: warning: implicit
>>> conversion from enumeration type 'enum mt76_mcuq_id' to different
>>> enumeration type 'enum mt76_txq_id' [-Wenum-conversion]
>>> txq = MT_MCUQ_FWDL;
>>> ~ ^~~~
>>> drivers/net/wireless/mediatek/mt76/mt7915/mcu.c:278:9: warning: implicit
>>> conversion from enumeration type 'enum mt76_mcuq_id' to different
>>> enumeration type 'enum mt76_txq_id' [-Wenum-conversion]
>>> txq = MT_MCUQ_WA;
>>> ~ ^~
>>> drivers/net/wireless/mediatek/mt76/mt7915/mcu.c:282:9: warning: implicit
>>> conversion from enumeration type 'enum mt76_mcuq_id' to different
>>> enumeration type 'enum mt76_txq_id' [-Wenum-conversion]
>>> txq = MT_MCUQ_WM;
>>> ~ ^~
>>> 3 warnings generated.
>>> 
>>> drivers/net/wireless/mediatek/mt76/mt7615/mcu.c:238:9: warning: implicit
>>> conversion from enumeration type 'enum mt76_mcuq_id' to different
>>> enumeration type 'enum mt76_txq_id' [-Wenum-conversion]
>>> qid = MT_MCUQ_WM;
>>> ~ ^~
>>> drivers/net/wireless/mediatek/mt76/mt7615/mcu.c:240:9: warning: implicit
>>> conversion from enumeration type 'enum mt76_mcuq_id' to different
>>> enumeration type 'enum mt76_txq_id' [-Wenum-conversion]
>>> qid = MT_MCUQ_FWDL;
>>> ~ ^~~~
>>> 2 warnings generated.
>>> 
>>> Use the proper type for the queue ID variables to fix these warnings.
>>> Additionally, rename the txq variable in mt7915_mcu_send_message to be
>>> more neutral like mt7615_mcu_send_message.
>>> 
>>> Fixes: e637763b606b ("mt76: move mcu queues to mt76_dev q_mcu array")
>>> Link: https://github.com/ClangBuiltLinux/linux/issues/1229
>>> Signed-off-by: Nathan Chancellor 
>>
>> Acked-by: Lorenzo Bianconi 
> 
> I see that Felix already applied this, but as this is a regression
> starting from v5.11-rc1 I think it should be applied to
> wireless-drivers. Felix, can you drop this from your tree so that I
> could apply it to wireless-drivers?
Sure, will do.

- Felix



Re: [PATCH net 6/9] MAINTAINERS: mtk-eth: remove Felix

2021-01-11 Thread Felix Fietkau


On 2021-01-11 18:41, Jakub Kicinski wrote:
> On Sun, 10 Jan 2021 21:45:46 -0800 Joe Perches wrote:
>> On Sun, 2021-01-10 at 21:27 -0800, Jakub Kicinski wrote:
>> > Drop Felix from Mediatek Ethernet driver maintainers.
>> > We haven't seen any tags since the initial submission.  
>> []
>> > diff --git a/MAINTAINERS b/MAINTAINERS  
>> []
>> > @@ -11165,7 +11165,6 @@ F: Documentation/devicetree/bindings/dma/mtk-*
>> >  F:drivers/dma/mediatek/
>> >  
>> > 
>> >  MEDIATEK ETHERNET DRIVER
>> > -M:Felix Fietkau 
>> >  M:John Crispin 
>> >  M:Sean Wang 
>> >  M:Mark Lee   
>> 
>> I think your script is broken as there are multiple subdirectories
>> for this entry and 
> 
> Quite the opposite, the script intentionally only counts contributions
> only to the code under the MAINTAINERS entry. People lose interest and
> move on to working on other parts of the kernel (or maybe were never
> that interested in maintaining something in the first place?). 
> 
> We want to list folks who are likely to give us reviews.
> 
>> Felix is active.
> 
> Which I tried to state in the commit message as well :)
Sorry for the delayed response. I'm going to submit a bunch of work on
this driver in the near future.
The patches have already been written, just need a bit more time for
testing/review.

- Felix


Re: [PATCH 2/3] mac80211: Add support to trigger sta disconnect on hardware restart

2020-12-15 Thread Felix Fietkau


On 2020-12-15 18:23, Youghandhar Chintala wrote:
> Currently in case of target hardware restart, we just reconfig and
> re-enable the security keys and enable the network queues to start
> data traffic back from where it was interrupted.
> 
> Many ath10k wifi chipsets have sequence numbers for the data
> packets assigned by firmware and the mac sequence number will
> restart from zero after target hardware restart leading to mismatch
> in the sequence number expected by the remote peer vs the sequence
> number of the frame sent by the target firmware.
> 
> This mismatch in sequence number will cause out-of-order packets
> on the remote peer and all the frames sent by the device are dropped
> until we reach the sequence number which was sent before we restarted
> the target hardware
> 
> In order to fix this, we trigger a sta disconnect, for the targets
> which expose this corresponding wiphy flag, in case of target hw
> restart. After this there will be a fresh connection and thereby
> avoiding the dropping of frames by remote peer.
> 
> The right fix would be to pull the entire data path into the host
> which is not feasible or would need lots of complex changes and
> will still be inefficient.
How about simply tracking which tids have aggregation enabled and send
DELBA frames for those after the restart?
It would mean less disruption for affected stations and less ugly hacks
in the stack for unreliable hardware.

- Felix


Re: [PATCH net-next 5/5] net: improve napi threaded config

2020-10-01 Thread Felix Fietkau


On 2020-10-01 21:24, Wei Wang wrote:
> On Thu, Oct 1, 2020 at 11:38 AM Felix Fietkau  wrote:
>>
>>
>> On 2020-10-01 20:03, Eric Dumazet wrote:
>> > On Thu, Oct 1, 2020 at 7:12 PM Felix Fietkau  wrote:
>> >>
>> >> On 2020-10-01 19:01, Wei Wang wrote:
>> >> > On Thu, Oct 1, 2020 at 3:01 AM Felix Fietkau  wrote:
>> >> >>
>> >> >>
>> >> >> On 2020-09-30 21:21, Wei Wang wrote:
>> >> >> > This commit mainly addresses the threaded config to make the switch
>> >> >> > between softirq based and kthread based NAPI processing not require
>> >> >> > a device down/up.
>> >> >> > It also moves the kthread_create() call to the sysfs handler when 
>> >> >> > user
>> >> >> > tries to enable "threaded" on napi, and properly handles the
>> >> >> > kthread_create() failure. This is because certain drivers do not have
>> >> >> > the napi created and linked to the dev when dev_open() is called. So
>> >> >> > the previous implementation does not work properly there.
>> >> >> >
>> >> >> > Signed-off-by: Wei Wang 
>> >> >> > ---
>> >> >> > Changes since RFC:
>> >> >> > changed the thread name to napi/-
>> >> >> >
>> >> >> >  net/core/dev.c   | 49 
>> >> >> > +---
>> >> >> >  net/core/net-sysfs.c |  9 +++-
>> >> >> >  2 files changed, 31 insertions(+), 27 deletions(-)
>> >> >> >
>> >> >> > diff --git a/net/core/dev.c b/net/core/dev.c
>> >> >> > index b4f33e442b5e..bf878d3a9d89 100644
>> >> >> > --- a/net/core/dev.c
>> >> >> > +++ b/net/core/dev.c
>> >> >> > @@ -1490,17 +1490,24 @@ EXPORT_SYMBOL(netdev_notify_peers);
>> >> >> >
>> >> >> >  static int napi_threaded_poll(void *data);
>> >> >> >
>> >> >> > -static void napi_thread_start(struct napi_struct *n)
>> >> >> > +static int napi_kthread_create(struct napi_struct *n)
>> >> >> >  {
>> >> >> > - if (test_bit(NAPI_STATE_THREADED, &n->state) && !n->thread)
>> >> >> > - n->thread = kthread_create(napi_threaded_poll, n, 
>> >> >> > "%s-%d",
>> >> >> > -n->dev->name, n->napi_id);
>> >> >> > + int err = 0;
>> >> >> > +
>> >> >> > + n->thread = kthread_create(napi_threaded_poll, n, "napi/%s-%d",
>> >> >> > +n->dev->name, n->napi_id);
>> >> >> > + if (IS_ERR(n->thread)) {
>> >> >> > + err = PTR_ERR(n->thread);
>> >> >> > + pr_err("kthread_create failed with err %d\n", err);
>> >> >> > + n->thread = NULL;
>> >> >> > + }
>> >> >> > +
>> >> >> > + return err;
>> >> >> If I remember correctly, using kthread_create with no explicit first
>> >> >> wakeup means the task will sit there and contribute to system loadavg
>> >> >> until it is woken up the first time.
>> >> >> Shouldn't we use kthread_run here instead?
>> >> >>
>> >> >
>> >> > Right. kthread_create() basically creates the thread and leaves it in
>> >> > sleep mode. I think that is what we want. We rely on the next
>> >> > ___napi_schedule() call to wake up this thread when there is work to
>> >> > do.
>> >> But what if you have a device that's basically idle and napi isn't
>> >> scheduled until much later? It will get a confusing loadavg until then.
>> >> I'd prefer waking up the thread immediately and filtering going back to
>> >> sleep once in the thread function before running the loop if
>> >> NAPI_STATE_SCHED wasn't set.
>> >>
>> >
>> > I was not aware of this kthread_create() impact on loadavg.
>> > This seems like a bug to me. (although I do not care about loadavg)
>> >
>> > Do you have pointers on some documentation ?
> 
> I found this link:
> http://www.brendangregg.com/blog/2017-08-08/linux-load-averages.html
> It has a section called "Linux Uninterruptible Tasks" which explains
> this behavior specifically. But I don't see a good conclusion on why.
> Seems to be a convention.
> IMHO, this is actually the problem/decision of the loadavg. It should
> not impact how the kernel code is implemented. I think it makes more
> sense to only wake up the thread when there is work to do.
There were other users of kthread where the same issue was fixed.
With a quick search, I found these commits:
e890591413819eeb604207ad3261ba617b2ec0bb
3f776e8a25a9d281125490562e1cc5bd7c14cf7c

Please note that one of these describes that a kthread that was created
but not woken was triggering a blocked task warning - so it's not just
the loadavg that matters here.

All the other users of kthread that I looked at also do an initial
wakeup of the thread. Not doing it seems like wrong use of the API to me.

- Felix


Re: [PATCH net-next 5/5] net: improve napi threaded config

2020-10-01 Thread Felix Fietkau


On 2020-10-01 20:03, Eric Dumazet wrote:
> On Thu, Oct 1, 2020 at 7:12 PM Felix Fietkau  wrote:
>>
>> On 2020-10-01 19:01, Wei Wang wrote:
>> > On Thu, Oct 1, 2020 at 3:01 AM Felix Fietkau  wrote:
>> >>
>> >>
>> >> On 2020-09-30 21:21, Wei Wang wrote:
>> >> > This commit mainly addresses the threaded config to make the switch
>> >> > between softirq based and kthread based NAPI processing not require
>> >> > a device down/up.
>> >> > It also moves the kthread_create() call to the sysfs handler when user
>> >> > tries to enable "threaded" on napi, and properly handles the
>> >> > kthread_create() failure. This is because certain drivers do not have
>> >> > the napi created and linked to the dev when dev_open() is called. So
>> >> > the previous implementation does not work properly there.
>> >> >
>> >> > Signed-off-by: Wei Wang 
>> >> > ---
>> >> > Changes since RFC:
>> >> > changed the thread name to napi/-
>> >> >
>> >> >  net/core/dev.c   | 49 +---
>> >> >  net/core/net-sysfs.c |  9 +++-
>> >> >  2 files changed, 31 insertions(+), 27 deletions(-)
>> >> >
>> >> > diff --git a/net/core/dev.c b/net/core/dev.c
>> >> > index b4f33e442b5e..bf878d3a9d89 100644
>> >> > --- a/net/core/dev.c
>> >> > +++ b/net/core/dev.c
>> >> > @@ -1490,17 +1490,24 @@ EXPORT_SYMBOL(netdev_notify_peers);
>> >> >
>> >> >  static int napi_threaded_poll(void *data);
>> >> >
>> >> > -static void napi_thread_start(struct napi_struct *n)
>> >> > +static int napi_kthread_create(struct napi_struct *n)
>> >> >  {
>> >> > - if (test_bit(NAPI_STATE_THREADED, &n->state) && !n->thread)
>> >> > - n->thread = kthread_create(napi_threaded_poll, n, "%s-%d",
>> >> > -n->dev->name, n->napi_id);
>> >> > + int err = 0;
>> >> > +
>> >> > + n->thread = kthread_create(napi_threaded_poll, n, "napi/%s-%d",
>> >> > +n->dev->name, n->napi_id);
>> >> > + if (IS_ERR(n->thread)) {
>> >> > + err = PTR_ERR(n->thread);
>> >> > + pr_err("kthread_create failed with err %d\n", err);
>> >> > + n->thread = NULL;
>> >> > + }
>> >> > +
>> >> > + return err;
>> >> If I remember correctly, using kthread_create with no explicit first
>> >> wakeup means the task will sit there and contribute to system loadavg
>> >> until it is woken up the first time.
>> >> Shouldn't we use kthread_run here instead?
>> >>
>> >
>> > Right. kthread_create() basically creates the thread and leaves it in
>> > sleep mode. I think that is what we want. We rely on the next
>> > ___napi_schedule() call to wake up this thread when there is work to
>> > do.
>> But what if you have a device that's basically idle and napi isn't
>> scheduled until much later? It will get a confusing loadavg until then.
>> I'd prefer waking up the thread immediately and filtering going back to
>> sleep once in the thread function before running the loop if
>> NAPI_STATE_SCHED wasn't set.
>>
> 
> I was not aware of this kthread_create() impact on loadavg.
> This seems like a bug to me. (although I do not care about loadavg)
> 
> Do you have pointers on some documentation ?
I don't have any specific documentation pointers, but this is something
I observed on several occasions when playing with kthreads.

>From what I can find in the loadavg code it seems that tasks in
TASK_UNINTERRUPTIBLE state are counted for loadavg alongside actually
runnable tasks. This seems intentional to me, but I don't know why it
was made like this.

A kthread does not start the thread function until it has been woken up
at least once, most likely to give the creating code a chance to perform
some initializations after successfully creating the thread, before the
thread function starts doing something. Instead, kthread() sets
TASK_UNINTERRUPTIBLE and calls schedule() once.

> Probably not a big deal, but this seems quite odd to me.
I've run into enough users that look at loadavg as a measure of system
load and would likely start reporting bugs if they observe such
behavior. I'd like to avoid that.

- Felix


Re: [PATCH net-next 5/5] net: improve napi threaded config

2020-10-01 Thread Felix Fietkau
On 2020-10-01 19:01, Wei Wang wrote:
> On Thu, Oct 1, 2020 at 3:01 AM Felix Fietkau  wrote:
>>
>>
>> On 2020-09-30 21:21, Wei Wang wrote:
>> > This commit mainly addresses the threaded config to make the switch
>> > between softirq based and kthread based NAPI processing not require
>> > a device down/up.
>> > It also moves the kthread_create() call to the sysfs handler when user
>> > tries to enable "threaded" on napi, and properly handles the
>> > kthread_create() failure. This is because certain drivers do not have
>> > the napi created and linked to the dev when dev_open() is called. So
>> > the previous implementation does not work properly there.
>> >
>> > Signed-off-by: Wei Wang 
>> > ---
>> > Changes since RFC:
>> > changed the thread name to napi/-
>> >
>> >  net/core/dev.c   | 49 +---
>> >  net/core/net-sysfs.c |  9 +++-
>> >  2 files changed, 31 insertions(+), 27 deletions(-)
>> >
>> > diff --git a/net/core/dev.c b/net/core/dev.c
>> > index b4f33e442b5e..bf878d3a9d89 100644
>> > --- a/net/core/dev.c
>> > +++ b/net/core/dev.c
>> > @@ -1490,17 +1490,24 @@ EXPORT_SYMBOL(netdev_notify_peers);
>> >
>> >  static int napi_threaded_poll(void *data);
>> >
>> > -static void napi_thread_start(struct napi_struct *n)
>> > +static int napi_kthread_create(struct napi_struct *n)
>> >  {
>> > - if (test_bit(NAPI_STATE_THREADED, &n->state) && !n->thread)
>> > - n->thread = kthread_create(napi_threaded_poll, n, "%s-%d",
>> > -n->dev->name, n->napi_id);
>> > + int err = 0;
>> > +
>> > + n->thread = kthread_create(napi_threaded_poll, n, "napi/%s-%d",
>> > +n->dev->name, n->napi_id);
>> > + if (IS_ERR(n->thread)) {
>> > + err = PTR_ERR(n->thread);
>> > + pr_err("kthread_create failed with err %d\n", err);
>> > + n->thread = NULL;
>> > + }
>> > +
>> > + return err;
>> If I remember correctly, using kthread_create with no explicit first
>> wakeup means the task will sit there and contribute to system loadavg
>> until it is woken up the first time.
>> Shouldn't we use kthread_run here instead?
>>
> 
> Right. kthread_create() basically creates the thread and leaves it in
> sleep mode. I think that is what we want. We rely on the next
> ___napi_schedule() call to wake up this thread when there is work to
> do.
But what if you have a device that's basically idle and napi isn't
scheduled until much later? It will get a confusing loadavg until then.
I'd prefer waking up the thread immediately and filtering going back to
sleep once in the thread function before running the loop if
NAPI_STATE_SCHED wasn't set.

- Felix


Re: [PATCH net-next 5/5] net: improve napi threaded config

2020-10-01 Thread Felix Fietkau


On 2020-09-30 21:21, Wei Wang wrote:
> This commit mainly addresses the threaded config to make the switch
> between softirq based and kthread based NAPI processing not require
> a device down/up.
> It also moves the kthread_create() call to the sysfs handler when user
> tries to enable "threaded" on napi, and properly handles the
> kthread_create() failure. This is because certain drivers do not have
> the napi created and linked to the dev when dev_open() is called. So
> the previous implementation does not work properly there.
> 
> Signed-off-by: Wei Wang 
> ---
> Changes since RFC:
> changed the thread name to napi/-
> 
>  net/core/dev.c   | 49 +---
>  net/core/net-sysfs.c |  9 +++-
>  2 files changed, 31 insertions(+), 27 deletions(-)
> 
> diff --git a/net/core/dev.c b/net/core/dev.c
> index b4f33e442b5e..bf878d3a9d89 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -1490,17 +1490,24 @@ EXPORT_SYMBOL(netdev_notify_peers);
>  
>  static int napi_threaded_poll(void *data);
>  
> -static void napi_thread_start(struct napi_struct *n)
> +static int napi_kthread_create(struct napi_struct *n)
>  {
> - if (test_bit(NAPI_STATE_THREADED, &n->state) && !n->thread)
> - n->thread = kthread_create(napi_threaded_poll, n, "%s-%d",
> -n->dev->name, n->napi_id);
> + int err = 0;
> +
> + n->thread = kthread_create(napi_threaded_poll, n, "napi/%s-%d",
> +n->dev->name, n->napi_id);
> + if (IS_ERR(n->thread)) {
> + err = PTR_ERR(n->thread);
> + pr_err("kthread_create failed with err %d\n", err);
> + n->thread = NULL;
> + }
> +
> + return err;
If I remember correctly, using kthread_create with no explicit first
wakeup means the task will sit there and contribute to system loadavg
until it is woken up the first time.
Shouldn't we use kthread_run here instead?

- Felix


Re: [Bridge] [RFC PATCH net-next] bridge: Implement MLD Querier wake-up calls / Android bug workaround

2020-08-23 Thread Felix Fietkau
On 2020-08-23 17:42, Linus Lüssing wrote:
> On Sun, Aug 16, 2020 at 03:08:13PM -0700, Stephen Hemminger wrote:
>> Rather than adding yet another feature to the bridge, could this hack be 
>> done by
>> having a BPF hook? or netfilter module?
> 
> Hi Stephen,
> 
> Thanks for the constructive feedback and suggestions!
> 
> The netfilter approach sounds tempting. However as far as I know
> OpenWrt's firewall has no easy layer 2 netfilter integration yet.
> So it has default layer 3 netfilter rules, but not for layer 2.
> 
> Ideally I'd want to work towards a solution where things "just
> work as expected" when a user enables "IGMP Snooping" in the UI.
> I could hack the netfilter rules into netifd, the OpenWrt network
> manager, when it configures the bridge. But not sure if the
> OpenWrt maintainers would like that...
> 
> Any preferences from the OpenWrt maintainers side?
Enabling bridge netfilter comes with a very significant performance
cost, so it's not something that should be done in a default configuration.

- Felix


[PATCH v3 1/2] net: add support for threaded NAPI polling

2020-08-21 Thread Felix Fietkau
For some drivers (especially 802.11 drivers), doing a lot of work in the NAPI
poll function does not perform well. Since NAPI poll is bound to the CPU it
was scheduled from, we can easily end up with a few very busy CPUs spending
most of their time in softirq/ksoftirqd and some idle ones.

Introduce threaded NAPI for such drivers based on a workqueue. The API is the
same except for using netif_threaded_napi_add instead of netif_napi_add.

In my tests with mt76 on MT7621 using threaded NAPI + a thread for tx scheduling
improves LAN->WLAN bridging throughput by 10-50%. Throughput without threaded
NAPI is wildly inconsistent, depending on the CPU that runs the tx scheduling
thread.

With threaded NAPI, throughput seems stable and consistent (and higher than
the best results I got without it).

Based on a patch by Hillf Danton

Cc: Hillf Danton 
Signed-off-by: Felix Fietkau 
---
Changes since PATCH v2:
- Split sysfs attribute into a separate patch
- Take RTNL on attribute show
- make napi_threaded attribute static

Changes since PATCH v1:
- use WQ_SYSFS to make workqueue configurable from user space
- cancel work in netif_napi_del
- add a sysfs file to enable/disable threaded NAPI for a netdev

Changes since RFC v2:
- fix unused but set variable reported by kbuild test robot

Changes since RFC:
- disable softirq around threaded poll functions
- reuse most parts of napi_poll()
- fix re-schedule condition

 include/linux/netdevice.h |  23 +++
 net/core/dev.c| 139 +++---
 2 files changed, 124 insertions(+), 38 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index b0e303f6603f..69507e6d4dc8 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -347,6 +347,7 @@ struct napi_struct {
struct list_headdev_list;
struct hlist_node   napi_hash_node;
unsigned intnapi_id;
+   struct work_struct  work;
 };
 
 enum {
@@ -357,6 +358,7 @@ enum {
NAPI_STATE_HASHED,  /* In NAPI hash (busy polling possible) */
NAPI_STATE_NO_BUSY_POLL,/* Do not add in napi_hash, no busy polling */
NAPI_STATE_IN_BUSY_POLL,/* sk_busy_loop() owns this NAPI */
+   NAPI_STATE_THREADED,/* Use threaded NAPI */
 };
 
 enum {
@@ -367,6 +369,7 @@ enum {
NAPIF_STATE_HASHED   = BIT(NAPI_STATE_HASHED),
NAPIF_STATE_NO_BUSY_POLL = BIT(NAPI_STATE_NO_BUSY_POLL),
NAPIF_STATE_IN_BUSY_POLL = BIT(NAPI_STATE_IN_BUSY_POLL),
+   NAPIF_STATE_THREADED = BIT(NAPI_STATE_THREADED),
 };
 
 enum gro_result {
@@ -2327,6 +2330,26 @@ static inline void *netdev_priv(const struct net_device 
*dev)
 void netif_napi_add(struct net_device *dev, struct napi_struct *napi,
int (*poll)(struct napi_struct *, int), int weight);
 
+/**
+ * netif_threaded_napi_add - initialize a NAPI context
+ * @dev:  network device
+ * @napi: NAPI context
+ * @poll: polling function
+ * @weight: default weight
+ *
+ * This variant of netif_napi_add() should be used from drivers using NAPI
+ * with CPU intensive poll functions.
+ * This will schedule polling from a high priority workqueue
+ */
+static inline void netif_threaded_napi_add(struct net_device *dev,
+  struct napi_struct *napi,
+  int (*poll)(struct napi_struct *, 
int),
+  int weight)
+{
+   set_bit(NAPI_STATE_THREADED, &napi->state);
+   netif_napi_add(dev, napi, poll, weight);
+}
+
 /**
  * netif_tx_napi_add - initialize a NAPI context
  * @dev:  network device
diff --git a/net/core/dev.c b/net/core/dev.c
index b5d1129d8310..b6165309617c 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -157,6 +157,7 @@ static DEFINE_SPINLOCK(offload_lock);
 struct list_head ptype_base[PTYPE_HASH_SIZE] __read_mostly;
 struct list_head ptype_all __read_mostly;  /* Taps */
 static struct list_head offload_base __read_mostly;
+static struct workqueue_struct *napi_workq __read_mostly;
 
 static int netif_rx_internal(struct sk_buff *skb);
 static int call_netdevice_notifiers_info(unsigned long val,
@@ -6282,6 +6283,11 @@ void __napi_schedule(struct napi_struct *n)
 {
unsigned long flags;
 
+   if (test_bit(NAPI_STATE_THREADED, &n->state)) {
+   queue_work(napi_workq, &n->work);
+   return;
+   }
+
local_irq_save(flags);
napi_schedule(this_cpu_ptr(&softnet_data), n);
local_irq_restore(flags);
@@ -6329,6 +6335,11 @@ EXPORT_SYMBOL(napi_schedule_prep);
  */
 void __napi_schedule_irqoff(struct napi_struct *n)
 {
+   if (test_bit(NAPI_STATE_THREADED, &n->state)) {
+   queue_work(napi_workq, &n->work);
+   return;
+   }
+
napi_schedule(this_cpu_ptr(&softnet_data), n);
 }
 EXPORT_SYMBOL(__napi_schedule_irqoff);
@@

[PATCH v3 2/2] net: add sysfs attribute for enabling threaded NAPI

2020-08-21 Thread Felix Fietkau
This can be used to enable threaded NAPI on drivers that did not explicitly
request it.

Suggested-by: Eric Dumazet 
Signed-off-by: Felix Fietkau 
---
 net/core/net-sysfs.c | 47 
 1 file changed, 47 insertions(+)

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index efec66fa78b7..d7f7df715b60 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -472,6 +472,52 @@ static ssize_t proto_down_store(struct device *dev,
 }
 NETDEVICE_SHOW_RW(proto_down, fmt_dec);
 
+static int change_napi_threaded(struct net_device *dev, unsigned long val)
+{
+   struct napi_struct *napi;
+
+   if (list_empty(&dev->napi_list))
+   return -EOPNOTSUPP;
+
+   list_for_each_entry(napi, &dev->napi_list, dev_list) {
+   if (val)
+   set_bit(NAPI_STATE_THREADED, &napi->state);
+   else
+   clear_bit(NAPI_STATE_THREADED, &napi->state);
+   }
+
+   return 0;
+}
+
+static ssize_t napi_threaded_store(struct device *dev,
+   struct device_attribute *attr,
+   const char *buf, size_t len)
+{
+   return netdev_store(dev, attr, buf, len, change_napi_threaded);
+}
+
+static ssize_t napi_threaded_show(struct device *dev,
+ struct device_attribute *attr,
+ char *buf)
+{
+   struct net_device *netdev = to_net_dev(dev);
+   struct napi_struct *napi;
+   bool enabled = false;
+
+   if (!rtnl_trylock())
+   return restart_syscall();
+
+   list_for_each_entry(napi, &netdev->napi_list, dev_list) {
+   if (test_bit(NAPI_STATE_THREADED, &napi->state))
+   enabled = true;
+   }
+
+   rtnl_unlock();
+
+   return sprintf(buf, fmt_dec, enabled);
+}
+static DEVICE_ATTR_RW(napi_threaded);
+
 static ssize_t phys_port_id_show(struct device *dev,
 struct device_attribute *attr, char *buf)
 {
@@ -564,6 +610,7 @@ static struct attribute *net_class_attrs[] __ro_after_init 
= {
&dev_attr_tx_queue_len.attr,
&dev_attr_gro_flush_timeout.attr,
&dev_attr_napi_defer_hard_irqs.attr,
+   &dev_attr_napi_threaded.attr,
&dev_attr_phys_port_id.attr,
&dev_attr_phys_port_name.attr,
&dev_attr_phys_switch_id.attr,
-- 
2.28.0



[PATCH v2] net: add support for threaded NAPI polling

2020-08-06 Thread Felix Fietkau
For some drivers (especially 802.11 drivers), doing a lot of work in the NAPI
poll function does not perform well. Since NAPI poll is bound to the CPU it
was scheduled from, we can easily end up with a few very busy CPUs spending
most of their time in softirq/ksoftirqd and some idle ones.

Introduce threaded NAPI for such drivers based on a workqueue. The API is the
same except for using netif_threaded_napi_add instead of netif_napi_add.

In my tests with mt76 on MT7621 using threaded NAPI + a thread for tx scheduling
improves LAN->WLAN bridging throughput by 10-50%. Throughput without threaded
NAPI is wildly inconsistent, depending on the CPU that runs the tx scheduling
thread.

With threaded NAPI, throughput seems stable and consistent (and higher than
the best results I got without it).

Based on a patch by Hillf Danton

Cc: Hillf Danton 
Signed-off-by: Felix Fietkau 
---
Changes since PATCH v1:
- use WQ_SYSFS to make workqueue configurable from user space
- cancel work in netif_napi_del
- add a sysfs file to enable/disable threaded NAPI for a netdev

Changes since RFC v2:
- fix unused but set variable reported by kbuild test robot

Changes since RFC:
- disable softirq around threaded poll functions
- reuse most parts of napi_poll()
- fix re-schedule condition

 include/linux/netdevice.h |  23 ++
 net/core/dev.c| 163 ++
 net/core/net-sysfs.c  |  42 ++
 3 files changed, 176 insertions(+), 52 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index ac2cd3f49aba..3a39211c7598 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -347,6 +347,7 @@ struct napi_struct {
struct list_headdev_list;
struct hlist_node   napi_hash_node;
unsigned intnapi_id;
+   struct work_struct  work;
 };
 
 enum {
@@ -357,6 +358,7 @@ enum {
NAPI_STATE_HASHED,  /* In NAPI hash (busy polling possible) */
NAPI_STATE_NO_BUSY_POLL,/* Do not add in napi_hash, no busy polling */
NAPI_STATE_IN_BUSY_POLL,/* sk_busy_loop() owns this NAPI */
+   NAPI_STATE_THREADED,/* Use threaded NAPI */
 };
 
 enum {
@@ -367,6 +369,7 @@ enum {
NAPIF_STATE_HASHED   = BIT(NAPI_STATE_HASHED),
NAPIF_STATE_NO_BUSY_POLL = BIT(NAPI_STATE_NO_BUSY_POLL),
NAPIF_STATE_IN_BUSY_POLL = BIT(NAPI_STATE_IN_BUSY_POLL),
+   NAPIF_STATE_THREADED = BIT(NAPI_STATE_THREADED),
 };
 
 enum gro_result {
@@ -2315,6 +2318,26 @@ static inline void *netdev_priv(const struct net_device 
*dev)
 void netif_napi_add(struct net_device *dev, struct napi_struct *napi,
int (*poll)(struct napi_struct *, int), int weight);
 
+/**
+ * netif_threaded_napi_add - initialize a NAPI context
+ * @dev:  network device
+ * @napi: NAPI context
+ * @poll: polling function
+ * @weight: default weight
+ *
+ * This variant of netif_napi_add() should be used from drivers using NAPI
+ * with CPU intensive poll functions.
+ * This will schedule polling from a high priority workqueue that
+ */
+static inline void netif_threaded_napi_add(struct net_device *dev,
+  struct napi_struct *napi,
+  int (*poll)(struct napi_struct *, 
int),
+  int weight)
+{
+   set_bit(NAPI_STATE_THREADED, &napi->state);
+   netif_napi_add(dev, napi, poll, weight);
+}
+
 /**
  * netif_tx_napi_add - initialize a NAPI context
  * @dev:  network device
diff --git a/net/core/dev.c b/net/core/dev.c
index 19f1abc26fcd..4b0dbea68a09 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -158,6 +158,7 @@ static DEFINE_SPINLOCK(offload_lock);
 struct list_head ptype_base[PTYPE_HASH_SIZE] __read_mostly;
 struct list_head ptype_all __read_mostly;  /* Taps */
 static struct list_head offload_base __read_mostly;
+static struct workqueue_struct *napi_workq __read_mostly;
 
 static int netif_rx_internal(struct sk_buff *skb);
 static int call_netdevice_notifiers_info(unsigned long val,
@@ -6286,6 +6287,11 @@ void __napi_schedule(struct napi_struct *n)
 {
unsigned long flags;
 
+   if (test_bit(NAPI_STATE_THREADED, &n->state)) {
+   queue_work(napi_workq, &n->work);
+   return;
+   }
+
local_irq_save(flags);
napi_schedule(this_cpu_ptr(&softnet_data), n);
local_irq_restore(flags);
@@ -6333,6 +6339,11 @@ EXPORT_SYMBOL(napi_schedule_prep);
  */
 void __napi_schedule_irqoff(struct napi_struct *n)
 {
+   if (test_bit(NAPI_STATE_THREADED, &n->state)) {
+   queue_work(napi_workq, &n->work);
+   return;
+   }
+
napi_schedule(this_cpu_ptr(&softnet_data), n);
 }
 EXPORT_SYMBOL(__napi_schedule_irqoff);
@@ -6601,6 +6612,95 @@ static void init_gro_hash(struct napi_struct *napi)
napi->gro

[PATCH] net: add support for threaded NAPI polling

2020-07-29 Thread Felix Fietkau
For some drivers (especially 802.11 drivers), doing a lot of work in the NAPI
poll function does not perform well. Since NAPI poll is bound to the CPU it
was scheduled from, we can easily end up with a few very busy CPUs spending
most of their time in softirq/ksoftirqd and some idle ones.

Introduce threaded NAPI for such drivers based on a workqueue. The API is the
same except for using netif_threaded_napi_add instead of netif_napi_add.

In my tests with mt76 on MT7621 using threaded NAPI + a thread for tx scheduling
improves LAN->WLAN bridging throughput by 10-50%. Throughput without threaded
NAPI is wildly inconsistent, depending on the CPU that runs the tx scheduling
thread.

With threaded NAPI, throughput seems stable and consistent (and higher than
the best results I got without it).

Based on a patch by Hillf Danton

Cc: Hillf Danton 
Signed-off-by: Felix Fietkau 
---
Changes since RFC v2:
- fix unused but set variable reported by kbuild test robot

Changes since RFC:
- disable softirq around threaded poll functions
- reuse most parts of napi_poll()
- fix re-schedule condition

 include/linux/netdevice.h |  23 ++
 net/core/dev.c| 162 ++
 2 files changed, 133 insertions(+), 52 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index ac2cd3f49aba..3a39211c7598 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -347,6 +347,7 @@ struct napi_struct {
struct list_headdev_list;
struct hlist_node   napi_hash_node;
unsigned intnapi_id;
+   struct work_struct  work;
 };
 
 enum {
@@ -357,6 +358,7 @@ enum {
NAPI_STATE_HASHED,  /* In NAPI hash (busy polling possible) */
NAPI_STATE_NO_BUSY_POLL,/* Do not add in napi_hash, no busy polling */
NAPI_STATE_IN_BUSY_POLL,/* sk_busy_loop() owns this NAPI */
+   NAPI_STATE_THREADED,/* Use threaded NAPI */
 };
 
 enum {
@@ -367,6 +369,7 @@ enum {
NAPIF_STATE_HASHED   = BIT(NAPI_STATE_HASHED),
NAPIF_STATE_NO_BUSY_POLL = BIT(NAPI_STATE_NO_BUSY_POLL),
NAPIF_STATE_IN_BUSY_POLL = BIT(NAPI_STATE_IN_BUSY_POLL),
+   NAPIF_STATE_THREADED = BIT(NAPI_STATE_THREADED),
 };
 
 enum gro_result {
@@ -2315,6 +2318,26 @@ static inline void *netdev_priv(const struct net_device 
*dev)
 void netif_napi_add(struct net_device *dev, struct napi_struct *napi,
int (*poll)(struct napi_struct *, int), int weight);
 
+/**
+ * netif_threaded_napi_add - initialize a NAPI context
+ * @dev:  network device
+ * @napi: NAPI context
+ * @poll: polling function
+ * @weight: default weight
+ *
+ * This variant of netif_napi_add() should be used from drivers using NAPI
+ * with CPU intensive poll functions.
+ * This will schedule polling from a high priority workqueue that
+ */
+static inline void netif_threaded_napi_add(struct net_device *dev,
+  struct napi_struct *napi,
+  int (*poll)(struct napi_struct *, 
int),
+  int weight)
+{
+   set_bit(NAPI_STATE_THREADED, &napi->state);
+   netif_napi_add(dev, napi, poll, weight);
+}
+
 /**
  * netif_tx_napi_add - initialize a NAPI context
  * @dev:  network device
diff --git a/net/core/dev.c b/net/core/dev.c
index 19f1abc26fcd..11b027f3a2b9 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -158,6 +158,7 @@ static DEFINE_SPINLOCK(offload_lock);
 struct list_head ptype_base[PTYPE_HASH_SIZE] __read_mostly;
 struct list_head ptype_all __read_mostly;  /* Taps */
 static struct list_head offload_base __read_mostly;
+static struct workqueue_struct *napi_workq __read_mostly;
 
 static int netif_rx_internal(struct sk_buff *skb);
 static int call_netdevice_notifiers_info(unsigned long val,
@@ -6286,6 +6287,11 @@ void __napi_schedule(struct napi_struct *n)
 {
unsigned long flags;
 
+   if (test_bit(NAPI_STATE_THREADED, &n->state)) {
+   queue_work(napi_workq, &n->work);
+   return;
+   }
+
local_irq_save(flags);
napi_schedule(this_cpu_ptr(&softnet_data), n);
local_irq_restore(flags);
@@ -6333,6 +6339,11 @@ EXPORT_SYMBOL(napi_schedule_prep);
  */
 void __napi_schedule_irqoff(struct napi_struct *n)
 {
+   if (test_bit(NAPI_STATE_THREADED, &n->state)) {
+   queue_work(napi_workq, &n->work);
+   return;
+   }
+
napi_schedule(this_cpu_ptr(&softnet_data), n);
 }
 EXPORT_SYMBOL(__napi_schedule_irqoff);
@@ -6601,6 +6612,95 @@ static void init_gro_hash(struct napi_struct *napi)
napi->gro_bitmask = 0;
 }
 
+static int __napi_poll(struct napi_struct *n, bool *repoll)
+{
+   int work, weight;
+
+   weight = n->weight;
+
+   /* This NAPI_STATE_SCHED test is for avoiding a race
+

[RFC v2] net: add support for threaded NAPI polling

2020-07-27 Thread Felix Fietkau
For some drivers (especially 802.11 drivers), doing a lot of work in the NAPI
poll function does not perform well. Since NAPI poll is bound to the CPU it
was scheduled from, we can easily end up with a few very busy CPUs spending
most of their time in softirq/ksoftirqd and some idle ones.

Introduce threaded NAPI for such drivers based on a workqueue. The API is the
same except for using netif_threaded_napi_add instead of netif_napi_add.

In my tests with mt76 on MT7621 using threaded NAPI + a thread for tx scheduling
improves LAN->WLAN bridging throughput by 10-50%. Throughput without threaded
NAPI is wildly inconsistent, depending on the CPU that runs the tx scheduling
thread.

With threaded NAPI, throughput seems stable and consistent (and higher than
the best results I got without it).

Based on a patch by Hillf Danton

Cc: Hillf Danton 
Signed-off-by: Felix Fietkau 
---
Changes since RFC:
- disable softirq around threaded poll functions
- reuse most parts of napi_poll()
- fix re-schedule condition

 include/linux/netdevice.h |  23 ++
 net/core/dev.c| 163 ++
 2 files changed, 134 insertions(+), 52 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index ac2cd3f49aba..3a39211c7598 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -347,6 +347,7 @@ struct napi_struct {
struct list_headdev_list;
struct hlist_node   napi_hash_node;
unsigned intnapi_id;
+   struct work_struct  work;
 };
 
 enum {
@@ -357,6 +358,7 @@ enum {
NAPI_STATE_HASHED,  /* In NAPI hash (busy polling possible) */
NAPI_STATE_NO_BUSY_POLL,/* Do not add in napi_hash, no busy polling */
NAPI_STATE_IN_BUSY_POLL,/* sk_busy_loop() owns this NAPI */
+   NAPI_STATE_THREADED,/* Use threaded NAPI */
 };
 
 enum {
@@ -367,6 +369,7 @@ enum {
NAPIF_STATE_HASHED   = BIT(NAPI_STATE_HASHED),
NAPIF_STATE_NO_BUSY_POLL = BIT(NAPI_STATE_NO_BUSY_POLL),
NAPIF_STATE_IN_BUSY_POLL = BIT(NAPI_STATE_IN_BUSY_POLL),
+   NAPIF_STATE_THREADED = BIT(NAPI_STATE_THREADED),
 };
 
 enum gro_result {
@@ -2315,6 +2318,26 @@ static inline void *netdev_priv(const struct net_device 
*dev)
 void netif_napi_add(struct net_device *dev, struct napi_struct *napi,
int (*poll)(struct napi_struct *, int), int weight);
 
+/**
+ * netif_threaded_napi_add - initialize a NAPI context
+ * @dev:  network device
+ * @napi: NAPI context
+ * @poll: polling function
+ * @weight: default weight
+ *
+ * This variant of netif_napi_add() should be used from drivers using NAPI
+ * with CPU intensive poll functions.
+ * This will schedule polling from a high priority workqueue that
+ */
+static inline void netif_threaded_napi_add(struct net_device *dev,
+  struct napi_struct *napi,
+  int (*poll)(struct napi_struct *, 
int),
+  int weight)
+{
+   set_bit(NAPI_STATE_THREADED, &napi->state);
+   netif_napi_add(dev, napi, poll, weight);
+}
+
 /**
  * netif_tx_napi_add - initialize a NAPI context
  * @dev:  network device
diff --git a/net/core/dev.c b/net/core/dev.c
index 19f1abc26fcd..cdb599135592 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -158,6 +158,7 @@ static DEFINE_SPINLOCK(offload_lock);
 struct list_head ptype_base[PTYPE_HASH_SIZE] __read_mostly;
 struct list_head ptype_all __read_mostly;  /* Taps */
 static struct list_head offload_base __read_mostly;
+static struct workqueue_struct *napi_workq __read_mostly;
 
 static int netif_rx_internal(struct sk_buff *skb);
 static int call_netdevice_notifiers_info(unsigned long val,
@@ -6286,6 +6287,11 @@ void __napi_schedule(struct napi_struct *n)
 {
unsigned long flags;
 
+   if (test_bit(NAPI_STATE_THREADED, &n->state)) {
+   queue_work(napi_workq, &n->work);
+   return;
+   }
+
local_irq_save(flags);
napi_schedule(this_cpu_ptr(&softnet_data), n);
local_irq_restore(flags);
@@ -6333,6 +6339,11 @@ EXPORT_SYMBOL(napi_schedule_prep);
  */
 void __napi_schedule_irqoff(struct napi_struct *n)
 {
+   if (test_bit(NAPI_STATE_THREADED, &n->state)) {
+   queue_work(napi_workq, &n->work);
+   return;
+   }
+
napi_schedule(this_cpu_ptr(&softnet_data), n);
 }
 EXPORT_SYMBOL(__napi_schedule_irqoff);
@@ -6601,6 +6612,96 @@ static void init_gro_hash(struct napi_struct *napi)
napi->gro_bitmask = 0;
 }
 
+static int __napi_poll(struct napi_struct *n, bool *repoll)
+{
+   int work, weight;
+
+   weight = n->weight;
+
+   /* This NAPI_STATE_SCHED test is for avoiding a race
+* with netpoll's poll_napi().  Only the entity which
+* obtains the lock and se

Re: [RFC] net: add support for threaded NAPI polling

2020-07-26 Thread Felix Fietkau
On 2020-07-26 19:58, Eric Dumazet wrote:
> 
> 
> On 7/26/20 10:19 AM, Felix Fietkau wrote:
>> On 2020-07-26 18:49, Eric Dumazet wrote:
>>> On 7/26/20 9:31 AM, Felix Fietkau wrote:
>>>> For some drivers (especially 802.11 drivers), doing a lot of work in the 
>>>> NAPI
>>>> poll function does not perform well. Since NAPI poll is bound to the CPU it
>>>> was scheduled from, we can easily end up with a few very busy CPUs spending
>>>> most of their time in softirq/ksoftirqd and some idle ones.
>>>>
>>>> Introduce threaded NAPI for such drivers based on a workqueue. The API is 
>>>> the
>>>> same except for using netif_threaded_napi_add instead of netif_napi_add.
>>>>
>>>> In my tests with mt76 on MT7621 using threaded NAPI + a thread for tx 
>>>> scheduling
>>>> improves LAN->WLAN bridging throughput by 10-50%. Throughput without 
>>>> threaded
>>>> NAPI is wildly inconsistent, depending on the CPU that runs the tx 
>>>> scheduling
>>>> thread.
>>>>
>>>> With threaded NAPI, throughput seems stable and consistent (and higher than
>>>> the best results I got without it).
>>>
>>> Note that even with a threaded NAPI, you will not be able to use more than 
>>> one cpu
>>> to process the traffic.
>> For a single threaded NAPI user that's correct. The main difference here
>> is that the CPU running the poll function does not have to be the same
>> as the CPU that scheduled it, and it can change based on the load.
>> That makes a huge difference in my tests.
> 
> This really looks like there is a problem in the driver itself.
> 
> Have you first checked that this patch was not hurting your use case ?
> 
> commit 4cd13c21b207e80ddb1144c576500098f2d5f882softirq: Let ksoftirqd do 
> its job
> 
> If yes, your proposal would again possibly hurt user space threads competing
> with a high priority workqueue, and packets would not be consumed by user 
> applications.
> Having cpus burning 100% of cycles in kernel space is useless.
I already checked that a while back, and this patch is not hurting my
use case. I think it is actually helping, since I'm putting on enough
load to keep most softirq activity in ksoftirqd.

One thing to consider about my use case is that I'm bridging traffic
between an Ethernet interface and an 802.11 interface. Those packets do
not go through user space. If I push enough traffic, the ksoftirqd
instance running NAPI of the 802.11 device keeps the CPU 100% busy. I do
not have any signficant user space activity during my tests.

Since tx and rx NAPI are scheduled from the same IRQ which only fires on
one CPU, they end up in the same ksoftirqd instance.

Considering that one CPU is not enough to handle entire NAPI workload
for the device, threaded NAPI helps by letting other (otherwise mostly
idle) CPUs pick up some of the workload.

> It seems you need two cpus per queue, I guess this might be because
> you use a single NAPI for both tx and rx ?
> 
> Have you simply tried to use two NAPI, as some Ethernet drivers do ?
I'm already doing that.

> Do not get me wrong, but scheduling a thread only to process one packet at a 
> time
> will hurt common cases.
> 
> Really I do not mind if you add a threaded NAPI, but it seems you missed
> a lot of NAPI requirements in the proposed patch.
> 
> For instance, many ->poll() handlers assume BH are disabled.
> 
> Also part of RPS logic depends on net_rx_action() calling 
> net_rps_action_and_irq_enable()
I will look into that, thanks.

- Felix


Re: [RFC] net: add support for threaded NAPI polling

2020-07-26 Thread Felix Fietkau
On 2020-07-26 18:49, Eric Dumazet wrote:
> On 7/26/20 9:31 AM, Felix Fietkau wrote:
>> For some drivers (especially 802.11 drivers), doing a lot of work in the NAPI
>> poll function does not perform well. Since NAPI poll is bound to the CPU it
>> was scheduled from, we can easily end up with a few very busy CPUs spending
>> most of their time in softirq/ksoftirqd and some idle ones.
>> 
>> Introduce threaded NAPI for such drivers based on a workqueue. The API is the
>> same except for using netif_threaded_napi_add instead of netif_napi_add.
>> 
>> In my tests with mt76 on MT7621 using threaded NAPI + a thread for tx 
>> scheduling
>> improves LAN->WLAN bridging throughput by 10-50%. Throughput without threaded
>> NAPI is wildly inconsistent, depending on the CPU that runs the tx scheduling
>> thread.
>> 
>> With threaded NAPI, throughput seems stable and consistent (and higher than
>> the best results I got without it).
> 
> Note that even with a threaded NAPI, you will not be able to use more than 
> one cpu
> to process the traffic.
For a single threaded NAPI user that's correct. The main difference here
is that the CPU running the poll function does not have to be the same
as the CPU that scheduled it, and it can change based on the load.
That makes a huge difference in my tests.

> Also I wonder how this will scale to more than one device using this ?
The workqueue creates multiple workers that pick up poll work, so it
should scale nicely.

> Say we need 4 NAPI, how the different work queues will mix together ?
> 
> We invented years ago RPS and RFS, to be able to spread incoming traffic
> to more cpus, for devices having one hardware queue.
Unfortunately that does not work well at all for my use case (802.11
drivers). A really large chunk of the work (e.g. 802.11 -> 802.3 header
conversion, state checks, etc.) is being done inside the poll function,
before it even goes anywhere near the network stack and RPS/RFS.

I did a lot of experiments trying to parallelize the work by tuning RFS,
IRQ affinity, etc. on MT7621. I didn't get anything close to the
consistent performance I get by adding threaded NAPI to mt76 along with
moving some other CPU intensive work from tasklets to threads.

- Felix


[RFC] net: add support for threaded NAPI polling

2020-07-26 Thread Felix Fietkau
For some drivers (especially 802.11 drivers), doing a lot of work in the NAPI
poll function does not perform well. Since NAPI poll is bound to the CPU it
was scheduled from, we can easily end up with a few very busy CPUs spending
most of their time in softirq/ksoftirqd and some idle ones.

Introduce threaded NAPI for such drivers based on a workqueue. The API is the
same except for using netif_threaded_napi_add instead of netif_napi_add.

In my tests with mt76 on MT7621 using threaded NAPI + a thread for tx scheduling
improves LAN->WLAN bridging throughput by 10-50%. Throughput without threaded
NAPI is wildly inconsistent, depending on the CPU that runs the tx scheduling
thread.

With threaded NAPI, throughput seems stable and consistent (and higher than
the best results I got without it).

Based on a patch by Hillf Danton

Cc: Hillf Danton 
Signed-off-by: Felix Fietkau 
---
 include/linux/netdevice.h | 23 ++
 net/core/dev.c| 40 +++
 2 files changed, 63 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index ac2cd3f49aba..3a39211c7598 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -347,6 +347,7 @@ struct napi_struct {
struct list_headdev_list;
struct hlist_node   napi_hash_node;
unsigned intnapi_id;
+   struct work_struct  work;
 };
 
 enum {
@@ -357,6 +358,7 @@ enum {
NAPI_STATE_HASHED,  /* In NAPI hash (busy polling possible) */
NAPI_STATE_NO_BUSY_POLL,/* Do not add in napi_hash, no busy polling */
NAPI_STATE_IN_BUSY_POLL,/* sk_busy_loop() owns this NAPI */
+   NAPI_STATE_THREADED,/* Use threaded NAPI */
 };
 
 enum {
@@ -367,6 +369,7 @@ enum {
NAPIF_STATE_HASHED   = BIT(NAPI_STATE_HASHED),
NAPIF_STATE_NO_BUSY_POLL = BIT(NAPI_STATE_NO_BUSY_POLL),
NAPIF_STATE_IN_BUSY_POLL = BIT(NAPI_STATE_IN_BUSY_POLL),
+   NAPIF_STATE_THREADED = BIT(NAPI_STATE_THREADED),
 };
 
 enum gro_result {
@@ -2315,6 +2318,26 @@ static inline void *netdev_priv(const struct net_device 
*dev)
 void netif_napi_add(struct net_device *dev, struct napi_struct *napi,
int (*poll)(struct napi_struct *, int), int weight);
 
+/**
+ * netif_threaded_napi_add - initialize a NAPI context
+ * @dev:  network device
+ * @napi: NAPI context
+ * @poll: polling function
+ * @weight: default weight
+ *
+ * This variant of netif_napi_add() should be used from drivers using NAPI
+ * with CPU intensive poll functions.
+ * This will schedule polling from a high priority workqueue that
+ */
+static inline void netif_threaded_napi_add(struct net_device *dev,
+  struct napi_struct *napi,
+  int (*poll)(struct napi_struct *, 
int),
+  int weight)
+{
+   set_bit(NAPI_STATE_THREADED, &napi->state);
+   netif_napi_add(dev, napi, poll, weight);
+}
+
 /**
  * netif_tx_napi_add - initialize a NAPI context
  * @dev:  network device
diff --git a/net/core/dev.c b/net/core/dev.c
index 19f1abc26fcd..e140b6a9d5eb 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -158,6 +158,7 @@ static DEFINE_SPINLOCK(offload_lock);
 struct list_head ptype_base[PTYPE_HASH_SIZE] __read_mostly;
 struct list_head ptype_all __read_mostly;  /* Taps */
 static struct list_head offload_base __read_mostly;
+static struct workqueue_struct *napi_workq __read_mostly;
 
 static int netif_rx_internal(struct sk_buff *skb);
 static int call_netdevice_notifiers_info(unsigned long val,
@@ -6286,6 +6287,11 @@ void __napi_schedule(struct napi_struct *n)
 {
unsigned long flags;
 
+   if (test_bit(NAPI_STATE_THREADED, &n->state)) {
+   queue_work(napi_workq, &n->work);
+   return;
+   }
+
local_irq_save(flags);
napi_schedule(this_cpu_ptr(&softnet_data), n);
local_irq_restore(flags);
@@ -6333,6 +6339,11 @@ EXPORT_SYMBOL(napi_schedule_prep);
  */
 void __napi_schedule_irqoff(struct napi_struct *n)
 {
+   if (test_bit(NAPI_STATE_THREADED, &n->state)) {
+   queue_work(napi_workq, &n->work);
+   return;
+   }
+
napi_schedule(this_cpu_ptr(&softnet_data), n);
 }
 EXPORT_SYMBOL(__napi_schedule_irqoff);
@@ -6601,6 +6612,30 @@ static void init_gro_hash(struct napi_struct *napi)
napi->gro_bitmask = 0;
 }
 
+static void napi_workfn(struct work_struct *work)
+{
+   struct napi_struct *n = container_of(work, struct napi_struct, work);
+
+   for (;;) {
+   if (!test_bit(NAPI_STATE_SCHED, &n->state))
+   return;
+
+   if (n->poll(n, n->weight) < n->weight)
+   return;
+
+   if (!need_resched())
+   continue;

Re: [RFC 0/7] Add support to process rx packets in thread

2020-07-26 Thread Felix Fietkau
On 2020-07-26 10:32, Hillf Danton wrote:
> 
> On Sun, 26 Jul 2020 10:10:15 +0200 Felix Fietkau wrote:
>> On 2020-07-26 03:22, Hillf Danton wrote:
>> > 
>> > Feel free to do that. Is it likely for me to select a Cc?
>> > 
>> Shall I use Signed-off-by: Hillf Danton ?
> 
> s/Signed-off-by/Cc/
> 
>> What Cc do you want me to add?
> 
> I prefer Cc over other tags.
Ah, okay. I was planning on adding you as the author of the patch, since
you did most of the work on it. If you want to be attributed as author
(or in Co-developed-by), I'd need your Signed-off-by.

If you don't want that, I can submit it under my name and leave you in
as Cc only.

- Felix


Re: [RFC 0/7] Add support to process rx packets in thread

2020-07-26 Thread Felix Fietkau
On 2020-07-26 03:22, Hillf Danton wrote:
>> - add a state bit for threaded NAPI
>> - make netif_threaded_napi_add inline
>> - run queue_work outside of local_irq_save/restore (it does that
>> internally already)
>>
>> If you don't mind, I'd like to propose this to netdev soon. Can I have
>> your Signed-off-by for that?
> 
> Feel free to do that. Is it likely for me to select a Cc?
Shall I use Signed-off-by: Hillf Danton ?
What Cc do you want me to add?

>> ---
>> --- a/include/linux/netdevice.h
>> +++ b/include/linux/netdevice.h
>> @@ -347,6 +347,7 @@ struct napi_struct {
>>  struct list_headdev_list;
>>  struct hlist_node   napi_hash_node;
>>  unsigned intnapi_id;
>> +struct work_struct  work;
>>  };
>>  
>>  enum {
>> @@ -357,6 +358,7 @@ enum {
>>  NAPI_STATE_HASHED,  /* In NAPI hash (busy polling possible) */
>>  NAPI_STATE_NO_BUSY_POLL,/* Do not add in napi_hash, no busy polling */
>>  NAPI_STATE_IN_BUSY_POLL,/* sk_busy_loop() owns this NAPI */
>> +NAPI_STATE_THREADED,/* Use threaded NAPI */
>>  };
>>  
>>  enum {
>> @@ -367,6 +369,7 @@ enum {
>>  NAPIF_STATE_HASHED   = BIT(NAPI_STATE_HASHED),
>>  NAPIF_STATE_NO_BUSY_POLL = BIT(NAPI_STATE_NO_BUSY_POLL),
>>  NAPIF_STATE_IN_BUSY_POLL = BIT(NAPI_STATE_IN_BUSY_POLL),
>> +NAPIF_STATE_THREADED = BIT(NAPI_STATE_THREADED),
>>  };
>>  
>>  enum gro_result {
>> @@ -2315,6 +2318,26 @@ static inline void *netdev_priv(const struct 
>> net_device *dev)
>>  void netif_napi_add(struct net_device *dev, struct napi_struct *napi,
>>  int (*poll)(struct napi_struct *, int), int weight);
>>  
>> +/**
>> + *  netif_threaded_napi_add - initialize a NAPI context
>> + *  @dev:  network device
>> + *  @napi: NAPI context
>> + *  @poll: polling function
>> + *  @weight: default weight
>> + *
>> + * This variant of netif_napi_add() should be used from drivers using NAPI
>> + * with CPU intensive poll functions.
>> + * This will schedule polling from a high priority workqueue that
>> + */
>> +static inline void netif_threaded_napi_add(struct net_device *dev,
>> +   struct napi_struct *napi,
>> +   int (*poll)(struct napi_struct *, 
>> int),
>> +   int weight)
>> +{
>> +set_bit(NAPI_STATE_THREADED, &napi->state);
>> +netif_napi_add(dev, napi, poll, weight);
>> +}
>> +
>>  /**
>>   *  netif_tx_napi_add - initialize a NAPI context
>>   *  @dev:  network device
>> --- a/net/core/dev.c
>> +++ b/net/core/dev.c
>> @@ -158,6 +158,7 @@ static DEFINE_SPINLOCK(offload_lock);
>>  struct list_head ptype_base[PTYPE_HASH_SIZE] __read_mostly;
>>  struct list_head ptype_all __read_mostly;   /* Taps */
>>  static struct list_head offload_base __read_mostly;
>> +static struct workqueue_struct *napi_workq;
> 
> Is __read_mostly missing?
Yes, thanks. I will add that before sending the patch.

- Felix


Re: [RFC 0/7] Add support to process rx packets in thread

2020-07-25 Thread Felix Fietkau
On 2020-07-25 10:16, Hillf Danton wrote:
> Hi folks
> 
> Below is a minimunm poc implementation I can imagine on top of workqueue
> to make napi threaded. Thoughts are appreciated.
Hi Hillf,

For some reason I don't see your mails on linux-wireless/netdev.
I've cleaned up your implementation a bit and I ran some tests with mt76
on an mt7621 embedded board. The results look pretty nice, performance
is a lot more consistent in my tests now.

Here are the changes that I've made compared to your version:

- remove the #ifdef, I think it's unnecessary
- add a state bit for threaded NAPI
- make netif_threaded_napi_add inline
- run queue_work outside of local_irq_save/restore (it does that
internally already)

If you don't mind, I'd like to propose this to netdev soon. Can I have
your Signed-off-by for that?

Thanks,

- Felix

---
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -347,6 +347,7 @@ struct napi_struct {
struct list_headdev_list;
struct hlist_node   napi_hash_node;
unsigned intnapi_id;
+   struct work_struct  work;
 };
 
 enum {
@@ -357,6 +358,7 @@ enum {
NAPI_STATE_HASHED,  /* In NAPI hash (busy polling possible) */
NAPI_STATE_NO_BUSY_POLL,/* Do not add in napi_hash, no busy polling */
NAPI_STATE_IN_BUSY_POLL,/* sk_busy_loop() owns this NAPI */
+   NAPI_STATE_THREADED,/* Use threaded NAPI */
 };
 
 enum {
@@ -367,6 +369,7 @@ enum {
NAPIF_STATE_HASHED   = BIT(NAPI_STATE_HASHED),
NAPIF_STATE_NO_BUSY_POLL = BIT(NAPI_STATE_NO_BUSY_POLL),
NAPIF_STATE_IN_BUSY_POLL = BIT(NAPI_STATE_IN_BUSY_POLL),
+   NAPIF_STATE_THREADED = BIT(NAPI_STATE_THREADED),
 };
 
 enum gro_result {
@@ -2315,6 +2318,26 @@ static inline void *netdev_priv(const struct net_device 
*dev)
 void netif_napi_add(struct net_device *dev, struct napi_struct *napi,
int (*poll)(struct napi_struct *, int), int weight);
 
+/**
+ * netif_threaded_napi_add - initialize a NAPI context
+ * @dev:  network device
+ * @napi: NAPI context
+ * @poll: polling function
+ * @weight: default weight
+ *
+ * This variant of netif_napi_add() should be used from drivers using NAPI
+ * with CPU intensive poll functions.
+ * This will schedule polling from a high priority workqueue that
+ */
+static inline void netif_threaded_napi_add(struct net_device *dev,
+  struct napi_struct *napi,
+  int (*poll)(struct napi_struct *, 
int),
+  int weight)
+{
+   set_bit(NAPI_STATE_THREADED, &napi->state);
+   netif_napi_add(dev, napi, poll, weight);
+}
+
 /**
  * netif_tx_napi_add - initialize a NAPI context
  * @dev:  network device
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -158,6 +158,7 @@ static DEFINE_SPINLOCK(offload_lock);
 struct list_head ptype_base[PTYPE_HASH_SIZE] __read_mostly;
 struct list_head ptype_all __read_mostly;  /* Taps */
 static struct list_head offload_base __read_mostly;
+static struct workqueue_struct *napi_workq;
 
 static int netif_rx_internal(struct sk_buff *skb);
 static int call_netdevice_notifiers_info(unsigned long val,
@@ -6286,6 +6287,11 @@ void __napi_schedule(struct napi_struct *n)
 {
unsigned long flags;
 
+   if (test_bit(NAPI_STATE_THREADED, &n->state)) {
+   queue_work(napi_workq, &n->work);
+   return;
+   }
+
local_irq_save(flags);
napi_schedule(this_cpu_ptr(&softnet_data), n);
local_irq_restore(flags);
@@ -6333,6 +6339,11 @@ EXPORT_SYMBOL(napi_schedule_prep);
  */
 void __napi_schedule_irqoff(struct napi_struct *n)
 {
+   if (test_bit(NAPI_STATE_THREADED, &n->state)) {
+   queue_work(napi_workq, &n->work);
+   return;
+   }
+
napi_schedule(this_cpu_ptr(&softnet_data), n);
 }
 EXPORT_SYMBOL(__napi_schedule_irqoff);
@@ -6601,6 +6612,29 @@ static void init_gro_hash(struct napi_struct *napi)
napi->gro_bitmask = 0;
 }
 
+static void napi_workfn(struct work_struct *work)
+{
+   struct napi_struct *n = container_of(work, struct napi_struct, work);
+
+   for (;;) {
+   if (!test_bit(NAPI_STATE_SCHED, &n->state))
+   return;
+
+   if (n->poll(n, n->weight) < n->weight)
+   return;
+
+   if (need_resched()) {
+   /*
+* have to pay for the latency of task switch even if
+* napi is scheduled
+*/
+   if (test_bit(NAPI_STATE_SCHED, &n->state))
+   queue_work(napi_workq, work);
+   return;
+   }
+   }
+}
+
 void netif_napi_add(struct net_device *dev, struct napi_struct *napi,
int (*poll)(struct napi_struct *, int), int weight)
 {
@@ -6

Re: [RFC 2/7] ath10k: Add support to process rx packet in thread

2020-07-22 Thread Felix Fietkau
On 2020-07-22 14:55, Johannes Berg wrote:
> On Wed, 2020-07-22 at 14:27 +0200, Felix Fietkau wrote:
> 
>> I'm considering testing a different approach (with mt76 initially):
>> - Add a mac80211 rx function that puts processed skbs into a list
>> instead of handing them to the network stack directly.
> 
> Would this be *after* all the mac80211 processing, i.e. in place of the
> rx-up-to-stack?
Yes, it would run all the rx handlers normally and then put the
resulting skbs into a list instead of calling netif_receive_skb or
napi_gro_frags.

- Felix


Re: [RFC 2/7] ath10k: Add support to process rx packet in thread

2020-07-22 Thread Felix Fietkau
On 2020-07-21 23:53, Rajkumar Manoharan wrote:
> On 2020-07-21 10:14, Rakesh Pillai wrote:
>> NAPI instance gets scheduled on a CPU core on which
>> the IRQ was triggered. The processing of rx packets
>> can be CPU intensive and since NAPI cannot be moved
>> to a different CPU core, to get better performance,
>> its better to move the gist of rx packet processing
>> in a high priority thread.
>> 
>> Add the init/deinit part for a thread to process the
>> receive packets.
>> 
> IMHO this defeat the whole purpose of NAPI. Originally in ath10k
> irq processing happened in tasklet (high priority) context which in
> turn push more data to net core even though net is unable to process
> driver data as both happen in different context (fast producer - slow 
> consumer)
> issue. Why can't CPU governor schedule the interrupts in less loaded CPU 
> core?
> Otherwise you can play with different RPS and affinity settings to meet 
> your
> requirement.
> 
> IMO introducing high priority tasklets/threads is not viable solution.
I'm beginning to think that the main problem with NAPI here is that the
work done by poll functions on 802.11 drivers is significantly more CPU
intensive compared to ethernet drivers, possibly more than what NAPI was
designed for.

I'm considering testing a different approach (with mt76 initially):
- Add a mac80211 rx function that puts processed skbs into a list
instead of handing them to the network stack directly.
- Move all rx processing to a high priority thread, keep a driver
internal queue for fully processed packets.
- Schedule NAPI poll on completion.
- NAPI poll function pulls from the internal queue and passes to the
network stack.

With this approach, the network stack retains some control over the
processing rate of rx packets, while the scheduler can move the CPU
intensive processing around to where it fits best.

What do you think?

- Felix


Re: [PATCH] MAINTAINERS: remove Felix Fietkau for the Mediatek ethernet driver

2020-06-21 Thread Felix Fietkau
On 2020-06-21 16:44, Russell King - ARM Linux admin wrote:
> On Thu, Feb 20, 2020 at 09:54:44AM +0100, Felix Fietkau wrote:
>> On 2020-02-18 21:00, Jakub Kicinski wrote:
>> > On Tue, 18 Feb 2020 10:40:01 + Russell King wrote:
>> >> Felix's address has been failing for a while now with the following
>> >> non-delivery report:
>> >> 
>> >> This message was created automatically by mail delivery software.
>> >> 
>> >> A message that you sent could not be delivered to one or more of its
>> >> recipients. This is a permanent error. The following address(es) failed:
>> >> 
>> >>   n...@openwrt.org
>> >> host util-01.infra.openwrt.org [2a03:b0c0:3:d0::175a:2001]
>> >> SMTP error from remote mail server after RCPT TO::
>> >> 550 Unrouteable address
>> >> 
>> >> Let's remove his address from MAINTAINERS.  If a different resolution
>> >> is desired, please submit an alternative patch.
>> >> 
>> >> Signed-off-by: Russell King 
>> >> ---
>> >>  MAINTAINERS | 1 -
>> >>  1 file changed, 1 deletion(-)
>> >> 
>> >> diff --git a/MAINTAINERS b/MAINTAINERS
>> >> index a0d86490c2c6..82dccd29b24f 100644
>> >> --- a/MAINTAINERS
>> >> +++ b/MAINTAINERS
>> >> @@ -10528,7 +10528,6 @@ F:drivers/leds/leds-mt6323.c
>> >>  F:   Documentation/devicetree/bindings/leds/leds-mt6323.txt
>> >>  
>> >>  MEDIATEK ETHERNET DRIVER
>> >> -M:   Felix Fietkau 
>> >>  M:   John Crispin 
>> >>  M:   Sean Wang 
>> >>  M:   Mark Lee 
>> > 
>> > Let's CC Felix, I think he's using n...@nbd.name these days.
>> Yes, my address should simply be changed to n...@nbd.name.
>> 
>> Thanks,
> 
> And _still_, in next-next, four months on, your MAINTAINERS entry is
> still incorrect.
> 
> Can someone please merge my patch so I'm not confronted by bounces
> due to this incorrect entry?  I really don't see why I should be
> the one to provide a patch to correct Felix's address - that's for
> Felix himself to do, especially as he has already been made aware
> of the issue.
I forgot to take care of this, sorry about that. I just sent a patch to
netdev to update the address.

- Felix


[PATCH] MAINTAINERS: update email address for Felix Fietkau

2020-06-21 Thread Felix Fietkau
The old address has been bouncing for a while now

Signed-off-by: Felix Fietkau 
---
 MAINTAINERS | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 68f21d46614c..fc0c1bc24ba0 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -10808,7 +10808,7 @@ F:  Documentation/devicetree/bindings/dma/mtk-*
 F: drivers/dma/mediatek/
 
 MEDIATEK ETHERNET DRIVER
-M: Felix Fietkau 
+M: Felix Fietkau 
 M: John Crispin 
 M: Sean Wang 
 M: Mark Lee 
-- 
2.24.0



Re: [PATCH][next] mac80211: minstrel_ht: fix infinite loop because supported is not being shifted

2019-08-23 Thread Felix Fietkau
On 2019-08-22 14:20, Colin King wrote:
> From: Colin Ian King 
> 
> Currently the for-loop will spin forever if variable supported is
> non-zero because supported is never changed.  Fix this by adding in
> the missing right shift of supported.
> 
> Addresses-Coverity: ("Infinite loop")
> Fixes: 48cb39522a9d ("mac80211: minstrel_ht: improve rate probing for devices 
> with static fallback")
> Signed-off-by: Colin Ian King 
Acked-by: Felix Fietkau 

Thanks,

- Felix


Re: [PATCH] mt76: mt7615: Make mt7615_irq_handler static

2019-05-11 Thread Felix Fietkau
On 2019-05-04 12:22, YueHaibing wrote:
> Fix sparse warning:
> 
> drivers/net/wireless/mediatek/mt76/mt7615/pci.c:37:13:
>  warning: symbol 'mt7615_irq_handler' was not declared. Should it be static?
> 
> Signed-off-by: YueHaibing 
Applied, thanks.

- Felix


Re: [PATCH][next] mt76: fix less than zero check on a u8 variable

2019-05-11 Thread Felix Fietkau
On 2019-05-05 23:31, Colin King wrote:
> From: Colin Ian King 
> 
> The signed return from the call to get_omac_idx is being assigned to the
> u8 variable mac_idx and then checked for a negative error condition
> which is always going to be false. Fix this by assigning the return to
> the int variable ret and checking this instead.
> 
> Addresses-Coverity: ("Unsigned compared against 0")
> Fixes: 04b8e65922f6 ("mt76: add mac80211 driver for MT7615 PCIe-based 
> chipsets")
> Signed-off-by: Colin Ian King 
Already fixed by a similar patch by Dan Carpenter.

- Felix


Request for net merge into net-next

2019-04-12 Thread Felix Fietkau
Hi Dave,

I have a whole bunch of pending mt76 changes for -next which depend on
these two commits:

commit 2b4a66980217332d91ab1785e1750857d6d52bc8
mac80211: make ieee80211_schedule_txq schedule empty TXQs

commit 5b989c18dab2e82bac8a5564a174794bf84b20e6
mac80211: rework locking for txq scheduling / airtime fairness

Could you please merge net into net-next so that Kalle can take them
into wireless-drivers-next?

Thanks,

- Felix


Re: NAT performance regression caused by vlan GRO support

2019-04-05 Thread Felix Fietkau
On 2019-04-05 09:11, Rafał Miłecki wrote:
> On 05.04.2019 07:48, Rafał Miłecki wrote:
>> On 05.04.2019 06:26, Toshiaki Makita wrote:
>>> My test results:
>>>
>>> Receiving packets from eth0.10, forwarding them to eth0.20 and applying
>>> MASQUERADE on eth0.20, using i40e 25G NIC on kernel 4.20.13.
>>> Disabled rxvlan by ethtool -K to exercise vlan_gro_receive().
>>> Measured TCP throughput by netperf.
>>>
>>> GRO on : 17 Gbps
>>> GRO off:  5 Gbps
>>>
>>> So I failed to reproduce your problem.
>> 
>> :( Thanks for trying & checking that!
>> 
>> 
>>> Would you check the CPU usage by "mpstat -P ALL" or similar (like "sar
>>> -u ALL -P ALL") to check if the traffic is able to consume 100% CPU on
>>> your machine?
>> 
>> 1) ethtool -K eth0 gro on + iperf running (577 Mb/s)
>> root@OpenWrt:/# mpstat -P ALL 10 3
>> Linux 5.1.0-rc3+ (OpenWrt)  03/27/19    _armv7l_    (2 CPU)
>> 
>> 16:33:40 CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  
>> %guest   %idle
>> 16:33:50 all    0.00    0.00    0.00    0.00    0.00   58.79    0.00    
>> 0.00   41.21
>> 16:33:50   0    0.00    0.00    0.00    0.00    0.00  100.00    0.00    
>> 0.00    0.00
>> 16:33:50   1    0.00    0.00    0.00    0.00    0.00   17.58    0.00    
>> 0.00   82.42
>> 
>> 16:33:50 CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  
>> %guest   %idle
>> 16:34:00 all    0.00    0.00    0.05    0.00    0.00   59.44    0.00    
>> 0.00   40.51
>> 16:34:00   0    0.00    0.00    0.10    0.00    0.00   99.90    0.00    
>> 0.00    0.00
>> 16:34:00   1    0.00    0.00    0.00    0.00    0.00   18.98    0.00    
>> 0.00   81.02
>> 
>> 16:34:00 CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  
>> %guest   %idle
>> 16:34:10 all    0.00    0.00    0.00    0.00    0.00   59.59    0.00    
>> 0.00   40.41
>> 16:34:10   0    0.00    0.00    0.00    0.00    0.00  100.00    0.00    
>> 0.00    0.00
>> 16:34:10   1    0.00    0.00    0.00    0.00    0.00   19.18    0.00    
>> 0.00   80.82
>> 
>> Average: CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  
>> %guest   %idle
>> Average: all    0.00    0.00    0.02    0.00    0.00   59.27    0.00    
>> 0.00   40.71
>> Average:   0    0.00    0.00    0.03    0.00    0.00   99.97    0.00    
>> 0.00    0.00
>> Average:   1    0.00    0.00    0.00    0.00    0.00   18.58    0.00    
>> 0.00   81.42
>> 
>> 
>> 2) ethtool -K eth0 gro off + iperf running (941 Mb/s)
>> root@OpenWrt:/# mpstat -P ALL 10 3
>> Linux 5.1.0-rc3+ (OpenWrt)  03/27/19    _armv7l_    (2 CPU)
>> 
>> 16:34:39 CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  
>> %guest   %idle
>> 16:34:49 all    0.00    0.00    0.05    0.00    0.00   86.91    0.00    
>> 0.00   13.04
>> 16:34:49   0    0.00    0.00    0.10    0.00    0.00   78.22    0.00    
>> 0.00   21.68
>> 16:34:49   1    0.00    0.00    0.00    0.00    0.00   95.60    0.00    
>> 0.00    4.40
>> 
>> 16:34:49 CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  
>> %guest   %idle
>> 16:34:59 all    0.00    0.00    0.10    0.00    0.00   87.06    0.00    
>> 0.00   12.84
>> 16:34:59   0    0.00    0.00    0.20    0.00    0.00   79.72    0.00    
>> 0.00   20.08
>> 16:34:59   1    0.00    0.00    0.00    0.00    0.00   94.41    0.00    
>> 0.00    5.59
>> 
>> 16:34:59 CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  
>> %guest   %idle
>> 16:35:09 all    0.00    0.00    0.05    0.00    0.00   85.71    0.00    
>> 0.00   14.24
>> 16:35:09   0    0.00    0.00    0.10    0.00    0.00   79.42    0.00    
>> 0.00   20.48
>> 16:35:09   1    0.00    0.00    0.00    0.00    0.00   92.01    0.00    
>> 0.00    7.99
>> 
>> Average: CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  
>> %guest   %idle
>> Average: all    0.00    0.00    0.07    0.00    0.00   86.56    0.00    
>> 0.00   13.37
>> Average:   0    0.00    0.00    0.13    0.00    0.00   79.12    0.00    
>> 0.00   20.75
>> Average:   1    0.00    0.00    0.00    0.00    0.00   94.01    0.00    
>> 0.00    5.99
>> 
>> 
>> 3) System idle (no iperf)
>> root@OpenWrt:/# mpstat -P ALL 10 1
>> Linux 5.1.0-rc3+ (OpenWrt)  03/27/19    _armv7l_    (2 CPU)
>> 
>> 16:35:31 CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  
>> %guest   %idle
>> 16:35:41 all    0.00    0.00    0.00    0.00    0.00    0.00    0.00    
>> 0.00  100.00
>> 16:35:41   0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    
>> 0.00  100.00
>> 16:35:41   1    0.00    0.00    0.00    0.00    0.00    0.00    0.00    
>> 0.00  100.00
>> 
>> Average: CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  
>> %guest   %idle
>> Average: all    0.00    0.00    0.00    0.00    0.00    0.00    0.00    
>> 0.00  100.00
>> Average:   0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    
>> 0.00  10

Re: [PATCH v2] net: use bulk free in kfree_skb_list

2019-03-25 Thread Felix Fietkau
On 2019-03-25 10:31, Eric Dumazet wrote:
> 
> 
> On 03/25/2019 02:09 AM, Felix Fietkau wrote:
>> On 2019-03-25 09:51, Eric Dumazet wrote:
>>>
>>>
>>> On 03/24/2019 09:56 AM, Felix Fietkau wrote:
>>>> Since we're freeing multiple skbs, we might as well use bulk free to save a
>>>> few cycles. Use the same conditions for bulk free as in napi_consume_skb.
>>>>
>>>
>>> I do not believe kfree_skb_list() is used in the fast path, so do we really
>>> need to make it so complex ?
>> mac80211 uses it to free the fraglist from A-MSDU aggregated packets in
>> the tx status path. That's one fast path where it gets used right now
>> and the reason it was showing up in my perf traces.
> 
> This is not drop monitor friendly then
> 
> TX completion should use consume_skb() or dev_kfree_skb()
> 
> BTW, I wonder what drop-monitor signal bulk free is sending ?
Good point about the drop monitor. Would you prefer that I replace this
patch with one that adds a consume_skb_list function and another one
that makes mac80211 use it?

- Felix


[PATCH v3] net: use bulk free in kfree_skb_list

2019-03-25 Thread Felix Fietkau
Since we're freeing multiple skbs, we might as well use bulk free to save a
few cycles. Use the same conditions for bulk free as in napi_consume_skb.

Signed-off-by: Felix Fietkau 
---
v3: reorder checks to prevent skb double unref
v2: call kmem_cache_free_bulk once the skb array is full instead of
falling back to kfree_skb
 net/core/skbuff.c | 40 
 1 file changed, 36 insertions(+), 4 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 2415d9cb9b89..ca0308485669 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -666,12 +666,44 @@ EXPORT_SYMBOL(kfree_skb);
 
 void kfree_skb_list(struct sk_buff *segs)
 {
-   while (segs) {
-   struct sk_buff *next = segs->next;
+   struct sk_buff *next = segs;
+   void *skbs[16];
+   int n_skbs = 0;
 
-   kfree_skb(segs);
-   segs = next;
+   while ((segs = next) != NULL) {
+   next = segs->next;
+
+   if (segs->fclone != SKB_FCLONE_UNAVAILABLE) {
+   kfree_skb(segs);
+   continue;
+   }
+
+   if (!skb_unref(segs))
+   continue;
+
+   trace_kfree_skb(segs, __builtin_return_address(0));
+
+   /* drop skb->head and call any destructors for packet */
+   skb_release_all(segs);
+
+#ifdef CONFIG_SLUB
+   /* SLUB writes into objects when freeing */
+   prefetchw(segs);
+#endif
+
+   skbs[n_skbs++] = segs;
+
+   if (n_skbs < ARRAY_SIZE(skbs))
+   continue;
+
+   kmem_cache_free_bulk(skbuff_head_cache, n_skbs, skbs);
+   n_skbs = 0;
}
+
+   if (!n_skbs)
+   return;
+
+   kmem_cache_free_bulk(skbuff_head_cache, n_skbs, skbs);
 }
 EXPORT_SYMBOL(kfree_skb_list);
 
-- 
2.17.0



Re: [PATCH v2] net: use bulk free in kfree_skb_list

2019-03-25 Thread Felix Fietkau
On 2019-03-25 09:51, Eric Dumazet wrote:
> 
> 
> On 03/24/2019 09:56 AM, Felix Fietkau wrote:
>> Since we're freeing multiple skbs, we might as well use bulk free to save a
>> few cycles. Use the same conditions for bulk free as in napi_consume_skb.
>> 
> 
> I do not believe kfree_skb_list() is used in the fast path, so do we really
> need to make it so complex ?
mac80211 uses it to free the fraglist from A-MSDU aggregated packets in
the tx status path. That's one fast path where it gets used right now
and the reason it was showing up in my perf traces.

- Felix


[PATCH v2] net: use bulk free in kfree_skb_list

2019-03-24 Thread Felix Fietkau
Since we're freeing multiple skbs, we might as well use bulk free to save a
few cycles. Use the same conditions for bulk free as in napi_consume_skb.

Signed-off-by: Felix Fietkau 
---
v2: call kmem_cache_free_bulk once the skb array is full instead of
falling back to kfree_skb
 net/core/skbuff.c | 40 
 1 file changed, 36 insertions(+), 4 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 2415d9cb9b89..1eeaa264d2a4 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -666,12 +666,44 @@ EXPORT_SYMBOL(kfree_skb);
 
 void kfree_skb_list(struct sk_buff *segs)
 {
-   while (segs) {
-   struct sk_buff *next = segs->next;
+   struct sk_buff *next = segs;
+   void *skbs[16];
+   int n_skbs = 0;
 
-   kfree_skb(segs);
-   segs = next;
+   while ((segs = next) != NULL) {
+   next = segs->next;
+
+   if (!skb_unref(segs))
+   continue;
+
+   if (segs->fclone != SKB_FCLONE_UNAVAILABLE) {
+   kfree_skb(segs);
+   continue;
+   }
+
+   trace_kfree_skb(segs, __builtin_return_address(0));
+
+   /* drop skb->head and call any destructors for packet */
+   skb_release_all(segs);
+
+#ifdef CONFIG_SLUB
+   /* SLUB writes into objects when freeing */
+   prefetchw(segs);
+#endif
+
+   skbs[n_skbs++] = segs;
+
+   if (n_skbs < ARRAY_SIZE(skbs))
+   continue;
+
+   kmem_cache_free_bulk(skbuff_head_cache, n_skbs, skbs);
+   n_skbs = 0;
}
+
+   if (!n_skbs)
+   return;
+
+   kmem_cache_free_bulk(skbuff_head_cache, n_skbs, skbs);
 }
 EXPORT_SYMBOL(kfree_skb_list);
 
-- 
2.17.0



Re: [PATCH net-next] net: use bulk free in kfree_skb_list

2019-03-24 Thread Felix Fietkau
On 2019-03-24 12:31, Jesper Dangaard Brouer wrote:
> On Sun, 24 Mar 2019 07:58:34 +0100
> Felix Fietkau  wrote:
> 
>> Since we're freeing multiple skbs, we might as well use bulk free to save a
>> few cycles. Use the same conditions for bulk free as in napi_consume_skb.
>> 
>> Signed-off-by: Felix Fietkau 
> 
> Thanks for working on this, it's been on my todo list for a very long
> time. I just discussed this with Florian at NetDevconf.
No problem. It was showing up on my perf traces while improving
mac80211/mt76 performance, so I decided to deal with it now :)

>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>> index 2415d9cb9b89..ec030ab7f1e7 100644
>> --- a/net/core/skbuff.c
>> +++ b/net/core/skbuff.c
>> @@ -666,12 +666,39 @@ EXPORT_SYMBOL(kfree_skb);
>>  
>>  void kfree_skb_list(struct sk_buff *segs)
>>  {
>> -while (segs) {
>> -struct sk_buff *next = segs->next;
>> +struct sk_buff *next = segs;
>> +void *skbs[16];
>> +int n_skbs = 0;
>>  
>> -kfree_skb(segs);
>> -segs = next;
>> +while ((segs = next) != NULL) {
>> +next = segs->next;
>> +
>> +if (!skb_unref(segs))
>> +continue;
>> +
>> +if (segs->fclone != SKB_FCLONE_UNAVAILABLE ||
>> +n_skbs >= ARRAY_SIZE(skbs)) {
> 
> You could call kmem_cache_free_bulk() here and reset n_skbs=0.
Sure, good idea. I'll send v2 shortly.

- Felix


[PATCH net-next] net: use bulk free in kfree_skb_list

2019-03-23 Thread Felix Fietkau
Since we're freeing multiple skbs, we might as well use bulk free to save a
few cycles. Use the same conditions for bulk free as in napi_consume_skb.

Signed-off-by: Felix Fietkau 
---
 net/core/skbuff.c | 35 +++
 1 file changed, 31 insertions(+), 4 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 2415d9cb9b89..ec030ab7f1e7 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -666,12 +666,39 @@ EXPORT_SYMBOL(kfree_skb);
 
 void kfree_skb_list(struct sk_buff *segs)
 {
-   while (segs) {
-   struct sk_buff *next = segs->next;
+   struct sk_buff *next = segs;
+   void *skbs[16];
+   int n_skbs = 0;
 
-   kfree_skb(segs);
-   segs = next;
+   while ((segs = next) != NULL) {
+   next = segs->next;
+
+   if (!skb_unref(segs))
+   continue;
+
+   if (segs->fclone != SKB_FCLONE_UNAVAILABLE ||
+   n_skbs >= ARRAY_SIZE(skbs)) {
+   kfree_skb(segs);
+   continue;
+   }
+
+   trace_kfree_skb(segs, __builtin_return_address(0));
+
+   /* drop skb->head and call any destructors for packet */
+   skb_release_all(segs);
+
+#ifdef CONFIG_SLUB
+   /* SLUB writes into objects when freeing */
+   prefetchw(segs);
+#endif
+
+   skbs[n_skbs++] = segs;
}
+
+   if (!n_skbs)
+   return;
+
+   kmem_cache_free_bulk(skbuff_head_cache, n_skbs, skbs);
 }
 EXPORT_SYMBOL(kfree_skb_list);
 
-- 
2.17.0



Re: [mt76/mt7603/mac] Question about missing variable assignment

2019-03-03 Thread Felix Fietkau
On 2019-03-02 22:10, Gustavo A. R. Silva wrote:
> Hi all,
> 
> The following piece of code in drivers/net/wireless/mediatek/mt76/mt7603/mac.c
> is missing a variable assignment before line 1058.  Notice that there
> is a potential execution path in which variable *i* is compared against
> magic number 15 at line 1075 without being initialized previously
> (this was reported by Coverity):
> 
> 1055 out:
> 1056 final_rate_flags = info->status.rates[final_idx].flags;
> 1057 
> 1058 switch (FIELD_GET(MT_TX_RATE_MODE, final_rate)) {
> 1059 case MT_PHY_TYPE_CCK:
> 1060 cck = true;
> 1061 /* fall through */
> 1062 case MT_PHY_TYPE_OFDM:
> 1063 if (dev->mt76.chandef.chan->band == NL80211_BAND_5GHZ)
> 1064 sband = &dev->mt76.sband_5g.sband;
> 1065 else
> 1066 sband = &dev->mt76.sband_2g.sband;
> 1067 final_rate &= GENMASK(5, 0);
> 1068 final_rate = mt7603_get_rate(dev, sband, final_rate, 
> cck);
> 1069 final_rate_flags = 0;
> 1070 break;
> 1071 case MT_PHY_TYPE_HT_GF:
> 1072 case MT_PHY_TYPE_HT:
> 1073 final_rate_flags |= IEEE80211_TX_RC_MCS;
> 1074 final_rate &= GENMASK(5, 0);
> 1075 if (i > 15)
> 1076 return false;
> 1077 break;
> 1078 default:
> 1079 return false;
> 1080 }
> 
> My guess is that such missing assignment should be something similar
> to the one at line 566:
> 
>   i = FIELD_GET(MT_RXV1_TX_RATE, rxdg0);
> 
> but I'm not sure what the proper arguments for macro FIELD_GET should
> be.
> 
> This code was introduced by commit c8846e1015022d2531ac4c895783e400b3e5babe
> 
> What do you think?
Thanks for reporting this. The fix is simpler than that, the check
should be: if (final_rate > 15)
I will send a fix.

- Felix


Re: [PATCH] mt76: change the retun type of mt76_dma_attach()

2019-02-11 Thread Felix Fietkau
On 2019-02-11 03:13, Ryder Lee wrote:
> There is no need to retun 0 in mt76_dma_attach(), so switch it to void.
> 
> Signed-off-by: Ryder Lee 
Applied, thanks.

- Felix


Re: [PATCH] mt76: convert to DEFINE_SHOW_ATTRIBUTE

2019-01-11 Thread Felix Fietkau
On 2018-12-03 14:40, Yangtao Li wrote:
> Use DEFINE_SHOW_ATTRIBUTE macro to simplify the code.
> 
> Signed-off-by: Yangtao Li Please rebase this patch 
> onto the mt76 branch of
https://github.com/nbd168/wireless

Thanks,

- Felix


Re: [PATCH] mt76: make const array 'data' static, shrinks object size

2018-12-31 Thread Felix Fietkau
On 2018-12-30 14:26, Colin King wrote:
> From: Colin Ian King 
> 
> Don't populate the const array 'data' on the stack but instead
> make it static. Makes the object code smaller by 78 bytes:
> 
> Before:
>textdata bss dec hex filename
>54381080   065181976 mediatek/mt76/mt76x2/usb_mcu.o
> 
> After:
>textdata bss dec hex filename
>52961144   064401928 mediatek/mt76/mt76x2/usb_mcu.o
> 
> (gcc version 8.2.0 x86_64)
> 
> Signed-off-by: Colin Ian King 
Applied, thanks.

- Felix


Re: [PATCH 00/12] Ethernet: Add and use ether__addr globals

2018-04-05 Thread Felix Fietkau
On 2018-04-05 15:51, Joe Perches wrote:
>>  You have to factor in
>> not just the .text size, but the fact that referencing an exported
>> symbol needs a .reloc entry as well, which also eats up some space (at
>> least when the code is being built as module).
> 
> Thanks, the modules I built got smaller.
Please post some numbers to show this. By the way, on other
architectures the numbers will probably be different, especially on
ARM/MIPS.
>> In my opinion, your series probably causes more bloat in common
>> configurations instead of reducing it.
>> 
>> You're also touching several places that could easily use
>> eth_broadcast_addr and eth_zero_addr. I think making those changes would
>> be more productive than what you did in this series.
> 
> Doubtful. AFAIK: possible unaligned addresses.
Those two are just memset calls, alignment does not matter.

- Felix


Re: [PATCH 00/12] Ethernet: Add and use ether__addr globals

2018-04-05 Thread Felix Fietkau
On 2018-03-31 09:05, Joe Perches wrote:
> There are many local static and non-static arrays that are used for
> Ethernet broadcast address output or comparison.
> 
> Centralize the array into a single separate file and remove the local
> arrays.
I suspect that for many targets and configurations, the local arrays
might actually be smaller than exporting a global. You have to factor in
not just the .text size, but the fact that referencing an exported
symbol needs a .reloc entry as well, which also eats up some space (at
least when the code is being built as module).

In my opinion, your series probably causes more bloat in common
configurations instead of reducing it.

You're also touching several places that could easily use
eth_broadcast_addr and eth_zero_addr. I think making those changes would
be more productive than what you did in this series.

- Felix


Re: Problem with bridge (mcast-to-ucast + hairpin) and Broadcom's 802.11f in their FullMAC fw

2018-03-13 Thread Felix Fietkau
[resent with fixed typo in linux-wireless address]

On 2018-02-27 11:08, Rafał Miłecki wrote:
> I've problem when using OpenWrt/LEDE on a home router with Broadcom's
> FullMAC WiFi chipset.
> 
> 
> First of all OpenWrt/LEDE uses bridge interface for LAN network with:
> 1) IFLA_BRPORT_MCAST_TO_UCAST
> 2) Clients isolation in hostapd
> 3) Hairpin mode enabled
> 
> For more details please see Linus's patch description:
> https://patchwork.kernel.org/patch/9530669/
> and maybe hairpin mode patch:
> https://lwn.net/Articles/347344/
> 
> Short version: in that setup packets received from a bridged wireless
> interface can be handled back to it for transmission.
> 
> 
> Now, Broadcom's firmware for their FullMAC chipsets in AP mode
> supports an obsoleted 802.11f AKA IAPP standard. It's a roaming
> standard that was replaced by 802.11r.
> 
> Whenever a new station associates, firmware generates a packet like:
> ff ff ff ff  ff ff ec 10  7b 5f ?? ??  00 06 00 01  af 81 01 00
> (just masked 2 bytes of my MAC)
> 
> For mode details you can see discussion in my brcmfmac patch thread:
> https://patchwork.kernel.org/patch/10191451/
> 
> 
> The problem is that bridge (in setup as above) handles such a packet
> back to the device.
> 
> That makes Broadcom's FullMAC firmware believe that a given station
> just connected to another AP in a network (which doesn't even exist).
> As a result firmware immediately disassociates that station. It's
> simply impossible to connect to the router. Every association is
> followed by immediate disassociation.
> 
> 
> Can you see any solution for this problem? Is that an option to stop
> multicast-to-unicast from touching 802.11f packets? Some other ideas?
> Obviously I can't modify Broadcom's firmware and drop that obsoleted
> standard.
Let's look at it from a different angle: Since these packets are
forwarded as normal packets by the bridge, and the Broadcom firmware
reacts to them in this nasty way, that's basically local DoS security
issue. In my opinion that matters a lot more than having support for an
obsolete feature that almost nobody will ever want to use.

I think the right approach to deal with this issue is to drop these
garbage packets in both the receive and transmit path of brcmfmac.

- Felix


Re: Problem with bridge (mcast-to-ucast + hairpin) and Broadcom's 802.11f in their FullMAC fw

2018-03-13 Thread Felix Fietkau
On 2018-02-27 11:08, Rafał Miłecki wrote:
> I've problem when using OpenWrt/LEDE on a home router with Broadcom's
> FullMAC WiFi chipset.
> 
> 
> First of all OpenWrt/LEDE uses bridge interface for LAN network with:
> 1) IFLA_BRPORT_MCAST_TO_UCAST
> 2) Clients isolation in hostapd
> 3) Hairpin mode enabled
> 
> For more details please see Linus's patch description:
> https://patchwork.kernel.org/patch/9530669/
> and maybe hairpin mode patch:
> https://lwn.net/Articles/347344/
> 
> Short version: in that setup packets received from a bridged wireless
> interface can be handled back to it for transmission.
> 
> 
> Now, Broadcom's firmware for their FullMAC chipsets in AP mode
> supports an obsoleted 802.11f AKA IAPP standard. It's a roaming
> standard that was replaced by 802.11r.
> 
> Whenever a new station associates, firmware generates a packet like:
> ff ff ff ff  ff ff ec 10  7b 5f ?? ??  00 06 00 01  af 81 01 00
> (just masked 2 bytes of my MAC)
> 
> For mode details you can see discussion in my brcmfmac patch thread:
> https://patchwork.kernel.org/patch/10191451/
> 
> 
> The problem is that bridge (in setup as above) handles such a packet
> back to the device.
> 
> That makes Broadcom's FullMAC firmware believe that a given station
> just connected to another AP in a network (which doesn't even exist).
> As a result firmware immediately disassociates that station. It's
> simply impossible to connect to the router. Every association is
> followed by immediate disassociation.
> 
> 
> Can you see any solution for this problem? Is that an option to stop
> multicast-to-unicast from touching 802.11f packets? Some other ideas?
> Obviously I can't modify Broadcom's firmware and drop that obsoleted
> standard.
Let's look at it from a different angle: Since these packets are
forwarded as normal packets by the bridge, and the Broadcom firmware
reacts to them in this nasty way, that's basically local DoS security
issue. In my opinion that matters a lot more than having support for an
obsolete feature that almost nobody will ever want to use.

I think the right approach to deal with this issue is to drop these
garbage packets in both the receive and transmit path of brcmfmac.

- Felix


Re: [PATCH 00/30] Netfilter/IPVS updates for net-next

2018-03-12 Thread Felix Fietkau
On 2018-03-12 21:01, David Miller wrote:
> From: Felix Fietkau 
> Date: Mon, 12 Mar 2018 20:30:01 +0100
> 
>> It's not dead and useless. In its current state, it has a software fast
>> path that significantly improves nftables routing/NAT throughput,
>> especially on embedded devices.
>> On some devices, I've seen "only" 20% throughput improvement (along with
>> CPU usage reduction), on others it's quite a bit lot more. This is
>> without any extra drivers or patches aside from what's posted.
> 
> I wonder if this software fast path has the exploitability problems that
> things like the ipv4 routing cache and the per-cpu flow cache both had.
> And the reason for which both were removed.
> 
> I don't see how you can avoid this problem.
> 
> I'm willing to be shown otherwise :-)
I don't think it suffers from the same issues, and if it does, it's a
lot easier to mitigate. The ruleset can easily be configured to only
offload connections that transferred a certain amount of data, handling
only bulk flows.

It's easy to put an upper limit on the number of offloaded connections,
and there's nothing in the code that just creates an offload entry per
packet or per lookup or something like that.

If you have other concerns, I'm sure we can address them with follow-up
patches, but as it stands, I think the code is already quite useful.

- Felix


Re: [PATCH 00/30] Netfilter/IPVS updates for net-next

2018-03-12 Thread Felix Fietkau
On 2018-03-12 19:58, David Miller wrote:
> From: Pablo Neira Ayuso 
> Date: Mon, 12 Mar 2018 18:58:50 +0100
> 
>> The following patchset contains Netfilter/IPVS updates for your net-next
>> tree. This batch comes with more input sanitization for xtables to
>> address bug reports from fuzzers, preparation works to the flowtable
>> infrastructure and assorted updates. In no particular order, they are:
> 
> Sorry, I've seen enough.  I'm not pulling this.
> 
> What is the story with this flow table stuff?  I tried to ask you
> about this before, but the response I was given was extremely vague
> and did not answer my question at all.
> 
> This is a lot of code, and a lot of infrastructure, yet I see
> no device using the infrastructure to offload conntack.
> 
> Nor can I see how this can possibly be even useful for such an
> application.  What conntrack offload needs are things completely
> outside of what the flow table stuff provides.  Mainly, they
> require that the SKB is completely abstracted away from all of
> the contrack code paths, and that the conntrack infrastructure
> operates on an abstract packet metadata concept.
> 
> If you are targetting one specific piece of hardware with TCAMs
> that you are familiar with.  I'd like you to stop right there.
> Because if that is all that this infrastructure can actually
> be used for, it is definitely designed wrong.
> 
> This, as has been the case in the past, is what is wrong with
> netfilter approach to supporting offloading.  We see all of this
> infrastructure before an actual working use case is provided for a
> specific piece of hardware for a specific driver in the tree.
> 
> Nobody can evaluate whether the approach is good or not without
> a clear driver change implementing support for it.
> 
> No other area of networking puts the cart before the horse like this.
> 
> I do not agree at all with the flow table infrastructure and I
> therefore do not want to pull any more flow table changes into my tree
> until there is an actual user of this stuff in that pull request which
> actually works in a way which is useful for people.  It is completely
> dead and useless code currently.
It's not dead and useless. In its current state, it has a software fast
path that significantly improves nftables routing/NAT throughput,
especially on embedded devices.
On some devices, I've seen "only" 20% throughput improvement (along with
CPU usage reduction), on others it's quite a bit lot more. This is
without any extra drivers or patches aside from what's posted.

Within OpenWrt, I'm working on a patch that makes the same available to
legacy netfilter as well. This is the reason for a lot of the core
refactoring that I did.

Hardware offload is still being worked on, not sure when we will have
the first driver ready. But as it stands now, the code is already very
useful and backported to OpenWrt for testing.

I think that in a couple of weeks this code will be ready to be enabled
by default in OpenWrt, which means that a lot of users' setups will get
a lot faster with no configuration change at all.

- Felix


Re: [PATCH RFC] net_sched/codel: do not defer queue length update

2018-03-10 Thread Felix Fietkau
On 2017-08-21 10:14, Konstantin Khlebnikov wrote:
> When codel wants to drop last packet in ->dequeue() it cannot call
> qdisc_tree_reduce_backlog() right away - it will notify parent qdisc
> about zero qlen and HTB/HFSC will deactivate class. The same class will
> be deactivated second time by caller of ->dequeue(). Currently codel and
> fq_codel defer update. This triggers warning in HFSC when it's qlen != 0
> but there is no active classes.
> 
> This patch update parent queue length immediately: just temporary increase
> qlen around qdisc_tree_reduce_backlog() to prevent first class deactivation
> if we have skb to return.
> 
> This might open another problem in HFSC - now operation peek could fail and
> deactivate parent class.
> 
> Signed-off-by: Konstantin Khlebnikov 
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=109581
This has been tested by several people and it seems to be working well.
Please make sure this gets merged soon.

Thanks,

- Felix


Re: [PATCH] ath9k: turn on btcoex_enable as default

2018-02-10 Thread Felix Fietkau
On 2018-02-10 14:56, Kai Heng Feng wrote:
> 
>> On 9 Feb 2018, at 3:16 PM, Kalle Valo  wrote:
>> Sure, but we have to make sure that we don't create regressions on
>> existing systems. For example, did you test this with any system which
>> don't support btcoex? (just asking, haven't tested this myself)
> 
> No not really, but I will definitely test it.
> The only module I have that uses ath9k is Dell’s DW1707.
> How do I check if it support btcoex or not?
I just reviewed the code again, and I am sure that we cannot merge this
patch. Enabling the btcoex parameter makes the driver enable a whole
bunch of code starting timers, listening to some GPIOs, etc.

On non-btcoex systems, some of those GPIOs might be floating or even
connected to different things, which could cause a lot of undefined
behavior.

This is simply too big a risk, so there absolutely needs to be a
whitelist for systems that need this, otherwise it has to remain
disabled by default.

- Felix


Re: [PATCH] ath9k: turn on btcoex_enable as default

2018-02-08 Thread Felix Fietkau
On 2018-02-08 06:28, Kai-Heng Feng wrote:
> Without btcoex_enable, WiFi activies make both WiFi and Bluetooth
> unstable if there's a bluetooth connection.
> 
> Enable this option when bt_ant_diversity is disabled.
> 
> BugLink: https://bugs.launchpad.net/bugs/1746164
> Signed-off-by: Kai-Heng Feng 
I think this might cause regressions on devices that don't have
bluetooth. This probably either needs more EEPROM checks, or something
to selectively enable it only on affected platforms.

- Felix


[PATCH 2/2] netfilter: nf_tables: fix flowtable resource leak

2018-02-05 Thread Felix Fietkau
When cleaning up flowtable entries, associated dst and ct entries need
to be released as well

Signed-off-by: Felix Fietkau 
---
 net/netfilter/nf_flow_table.c | 19 +--
 1 file changed, 5 insertions(+), 14 deletions(-)

diff --git a/net/netfilter/nf_flow_table.c b/net/netfilter/nf_flow_table.c
index 20f86091ab98..9925bd3f93c4 100644
--- a/net/netfilter/nf_flow_table.c
+++ b/net/netfilter/nf_flow_table.c
@@ -124,7 +124,7 @@ void flow_offload_free(struct flow_offload *flow)
dst_release(flow->tuplehash[FLOW_OFFLOAD_DIR_ORIGINAL].tuple.dst_cache);
dst_release(flow->tuplehash[FLOW_OFFLOAD_DIR_REPLY].tuple.dst_cache);
e = container_of(flow, struct flow_offload_entry, flow);
-   kfree(e);
+   kfree_rcu(e, rcu_head);
 }
 EXPORT_SYMBOL_GPL(flow_offload_free);
 
@@ -161,7 +161,9 @@ void flow_offload_del(struct nf_flowtable *flow_table,
   *flow_table->type->params);
 
e = container_of(flow, struct flow_offload_entry, flow);
-   kfree_rcu(e, rcu_head);
+   nf_ct_delete(e->ct, 0, 0);
+   nf_ct_put(e->ct);
+   flow_offload_free(flow);
 }
 EXPORT_SYMBOL_GPL(flow_offload_del);
 
@@ -174,15 +176,6 @@ flow_offload_lookup(struct nf_flowtable *flow_table,
 }
 EXPORT_SYMBOL_GPL(flow_offload_lookup);
 
-static void nf_flow_release_ct(const struct flow_offload *flow)
-{
-   struct flow_offload_entry *e;
-
-   e = container_of(flow, struct flow_offload_entry, flow);
-   nf_ct_delete(e->ct, 0, 0);
-   nf_ct_put(e->ct);
-}
-
 int nf_flow_table_iterate(struct nf_flowtable *flow_table,
  void (*iter)(struct flow_offload *flow, void *data),
  void *data)
@@ -276,10 +269,8 @@ void nf_flow_offload_work_gc(struct work_struct *work)
flow = container_of(tuplehash, struct flow_offload, 
tuplehash[0]);
 
if (nf_flow_has_expired(flow) ||
-   nf_flow_is_dying(flow)) {
+   nf_flow_is_dying(flow))
flow_offload_del(flow_table, flow);
-   nf_flow_release_ct(flow);
-   }
}
 out:
rhashtable_walk_stop(&hti);
-- 
2.14.2



[PATCH 1/2] netfilter: nf_tables: fix flowtable free

2018-02-05 Thread Felix Fietkau
Every flow_offload entry is added into the table twice. Because of this,
rhashtable_free_and_destroy can't be used, since it would call kfree for
each flow_offload object twice.

Signed-off-by: Felix Fietkau 
---
 include/net/netfilter/nf_flow_table.h   |  2 ++
 net/ipv4/netfilter/nf_flow_table_ipv4.c |  1 +
 net/ipv6/netfilter/nf_flow_table_ipv6.c |  1 +
 net/netfilter/nf_flow_table.c   | 15 +++
 net/netfilter/nf_flow_table_inet.c  |  1 +
 net/netfilter/nf_tables_api.c   |  8 +---
 6 files changed, 21 insertions(+), 7 deletions(-)

diff --git a/include/net/netfilter/nf_flow_table.h 
b/include/net/netfilter/nf_flow_table.h
index b22b22082733..67c61bda6a46 100644
--- a/include/net/netfilter/nf_flow_table.h
+++ b/include/net/netfilter/nf_flow_table.h
@@ -14,6 +14,7 @@ struct nf_flowtable_type {
struct list_headlist;
int family;
void(*gc)(struct work_struct *work);
+   void(*free)(struct nf_flowtable *table);
const struct rhashtable_params  *params;
nf_hookfn   *hook;
struct module   *owner;
@@ -95,6 +96,7 @@ struct flow_offload_tuple_rhash *flow_offload_lookup(struct 
nf_flowtable *flow_t
 int nf_flow_table_iterate(struct nf_flowtable *flow_table,
  void (*iter)(struct flow_offload *flow, void *data),
  void *data);
+void nf_flow_table_free(struct nf_flowtable *flow_table);
 void nf_flow_offload_work_gc(struct work_struct *work);
 extern const struct rhashtable_params nf_flow_offload_rhash_params;
 
diff --git a/net/ipv4/netfilter/nf_flow_table_ipv4.c 
b/net/ipv4/netfilter/nf_flow_table_ipv4.c
index b2d01eb25f2c..25d2975da156 100644
--- a/net/ipv4/netfilter/nf_flow_table_ipv4.c
+++ b/net/ipv4/netfilter/nf_flow_table_ipv4.c
@@ -260,6 +260,7 @@ static struct nf_flowtable_type flowtable_ipv4 = {
.family = NFPROTO_IPV4,
.params = &nf_flow_offload_rhash_params,
.gc = nf_flow_offload_work_gc,
+   .free   = nf_flow_table_free,
.hook   = nf_flow_offload_ip_hook,
.owner  = THIS_MODULE,
 };
diff --git a/net/ipv6/netfilter/nf_flow_table_ipv6.c 
b/net/ipv6/netfilter/nf_flow_table_ipv6.c
index fff21602875a..d346705d6ee6 100644
--- a/net/ipv6/netfilter/nf_flow_table_ipv6.c
+++ b/net/ipv6/netfilter/nf_flow_table_ipv6.c
@@ -253,6 +253,7 @@ static struct nf_flowtable_type flowtable_ipv6 = {
.family = NFPROTO_IPV6,
.params = &nf_flow_offload_rhash_params,
.gc = nf_flow_offload_work_gc,
+   .free   = nf_flow_table_free,
.hook   = nf_flow_offload_ipv6_hook,
.owner  = THIS_MODULE,
 };
diff --git a/net/netfilter/nf_flow_table.c b/net/netfilter/nf_flow_table.c
index 2f5099cb85b8..20f86091ab98 100644
--- a/net/netfilter/nf_flow_table.c
+++ b/net/netfilter/nf_flow_table.c
@@ -221,6 +221,21 @@ int nf_flow_table_iterate(struct nf_flowtable *flow_table,
 }
 EXPORT_SYMBOL_GPL(nf_flow_table_iterate);
 
+static void
+__nf_flow_offload_free(struct flow_offload *flow, void *data)
+{
+   struct nf_flowtable *flow_table = data;
+
+   flow_offload_del(flow_table, flow);
+}
+
+void nf_flow_table_free(struct nf_flowtable *flow_table)
+{
+   nf_flow_table_iterate(flow_table, __nf_flow_offload_free, flow_table);
+   rhashtable_destroy(&flow_table->rhashtable);
+}
+EXPORT_SYMBOL_GPL(nf_flow_table_free);
+
 static inline bool nf_flow_has_expired(const struct flow_offload *flow)
 {
return (__s32)(flow->timeout - (u32)jiffies) <= 0;
diff --git a/net/netfilter/nf_flow_table_inet.c 
b/net/netfilter/nf_flow_table_inet.c
index 281209aeba8f..375a1881d93d 100644
--- a/net/netfilter/nf_flow_table_inet.c
+++ b/net/netfilter/nf_flow_table_inet.c
@@ -24,6 +24,7 @@ static struct nf_flowtable_type flowtable_inet = {
.family = NFPROTO_INET,
.params = &nf_flow_offload_rhash_params,
.gc = nf_flow_offload_work_gc,
+   .free   = nf_flow_table_free,
.hook   = nf_flow_offload_inet_hook,
.owner  = THIS_MODULE,
 };
diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index 0791813a1e7d..a6c4747c7d5d 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -5399,17 +5399,11 @@ static void nf_tables_flowtable_notify(struct nft_ctx 
*ctx,
nfnetlink_set_err(ctx->net, ctx->portid, NFNLGRP_NFTABLES, -ENOBUFS);
 }
 
-static void nft_flowtable_destroy(void *ptr, void *arg)
-{
-   kfree(ptr);
-}
-
 static void nf_tables_flowtable_destroy(struct nft_flowtable *flowtable)
 {
cancel_delayed_work_sync(&flowtable->data.gc_work);
kfree(flowtable->name);
-   rhashtabl

Re: [PATCH] net: igmp: fix source address check for IGMPv3 reports

2018-01-23 Thread Felix Fietkau
On 2018-01-22 22:17, David Miller wrote:
> From: Felix Fietkau 
> Date: Fri, 19 Jan 2018 11:50:46 +0100
> 
>> Commit "net: igmp: Use correct source address on IGMPv3 reports"
>> introduced a check to validate the source address of locally generated
>> IGMPv3 packets.
>> Instead of checking the local interface address directly, it uses
>> inet_ifa_match(fl4->saddr, ifa), which checks if the address is on the
>> local subnet (or equal to the point-to-point address if used).
>> 
>> This breaks for point-to-point interfaces, so check against
>> ifa->ifa_local directly.
>> 
>> Cc: Kevin Cernekee 
>> Fixes: a46182b00290 ("net: igmp: Use correct source address on IGMPv3 
>> reports")
>> Reported-by: Sebastian Gottschall 
>> Signed-off-by: Felix Fietkau 
> 
> Applied, thanks.
Thanks. Please queue it up for stable as well, since the commit fixed by
this patch has already made it to stable kernels.

- Felix


[PATCH] net: igmp: fix source address check for IGMPv3 reports

2018-01-19 Thread Felix Fietkau
Commit "net: igmp: Use correct source address on IGMPv3 reports"
introduced a check to validate the source address of locally generated
IGMPv3 packets.
Instead of checking the local interface address directly, it uses
inet_ifa_match(fl4->saddr, ifa), which checks if the address is on the
local subnet (or equal to the point-to-point address if used).

This breaks for point-to-point interfaces, so check against
ifa->ifa_local directly.

Cc: Kevin Cernekee 
Fixes: a46182b00290 ("net: igmp: Use correct source address on IGMPv3 reports")
Reported-by: Sebastian Gottschall 
Signed-off-by: Felix Fietkau 
---
 net/ipv4/igmp.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/igmp.c b/net/ipv4/igmp.c
index 726f6b608274..2d49717a7421 100644
--- a/net/ipv4/igmp.c
+++ b/net/ipv4/igmp.c
@@ -332,7 +332,7 @@ static __be32 igmpv3_get_srcaddr(struct net_device *dev,
return htonl(INADDR_ANY);
 
for_ifa(in_dev) {
-   if (inet_ifa_match(fl4->saddr, ifa))
+   if (fl4->saddr == ifa->ifa_local)
return fl4->saddr;
} endfor_ifa(in_dev);
 
-- 
2.14.2



Re: [PATCH] [wireless-next] mt76: fix building without CONFIG_LEDS_CLASS

2018-01-18 Thread Felix Fietkau
On 2018-01-18 14:14, Arnd Bergmann wrote:
> When CONFIG_LEDS_CLASS is disabled, or it is a loadable module while
> mt76 is built-in, we run into a link error:
> 
> drivers/net/wireless/mediatek/mt76/mac80211.o: In function 
> `mt76_register_device':
> mac80211.c:(.text+0xb78): relocation truncated to fit: R_AARCH64_CALL26 
> against undefined symbol `devm_of_led_classdev_register'
> 
> We don't really need a hard dependency here as the driver can presumably
> work just fine without LEDs, so this follows the iwlwifi example and
> adds a separate Kconfig option for the LED support, this will be available
> whenever it will link, and otherwise the respective code gets left out from
> the driver object.
> 
> Fixes: 17f1de56df05 ("mt76: add common code shared between multiple chipsets")
> Signed-off-by: Arnd Bergmann 
> ---
>  drivers/net/wireless/mediatek/mt76/Kconfig   | 5 +
>  drivers/net/wireless/mediatek/mt76/mac80211.c| 8 +---
>  drivers/net/wireless/mediatek/mt76/mt76x2_init.c | 6 --
>  3 files changed, 14 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/net/wireless/mediatek/mt76/Kconfig 
> b/drivers/net/wireless/mediatek/mt76/Kconfig
> index fc05d79c80d0..9d12c1f5284e 100644
> --- a/drivers/net/wireless/mediatek/mt76/Kconfig
> +++ b/drivers/net/wireless/mediatek/mt76/Kconfig
> @@ -8,3 +8,8 @@ config MT76x2E
>   depends on PCI
>   ---help---
> This adds support for MT7612/MT7602/MT7662-based wireless PCIe 
> devices.
> +
> +config MT76_LEDS
> + bool "LED support for MT76"
> + depends on MT76_CORE
> + depends on LEDS_CLASS=y || MT76_CORE=LEDS_CLASS
I think this should have a 'default y'

- Felix


Re: [PATCH 1/3] kallsyms: don't leak address when symbol not found

2017-12-18 Thread Felix Fietkau
On 2017-12-18 00:53, Tobin C. Harding wrote:
> Currently if kallsyms_lookup() fails to find the symbol then the address
> is printed. This potentially leaks sensitive information. Instead of
> printing the address we can return an error, giving the calling code the
> option to print the address or print some sanitized message.
> 
> Return error instead of printing address to argument buffer. Leave
> buffer in a sane state.
> 
> Signed-off-by: Tobin C. Harding 
I think there should be a way to keep the old behavior for debugging.

- Felix


Re: [PATCH] mt76: fix memcpy to potential null pointer on failed allocation

2017-12-14 Thread Felix Fietkau
On 2017-12-14 11:13, Colin King wrote:
> From: Colin Ian King 
> 
> Currently if the allocation of skb fails and returns NULL then the
> call to skb_put will cause a null pointer dereference. Fix this by
> checking for a null skb and returning NULL.  Note that calls to
> function mt76x2_mcu_msg_alloc don't directly check the null return
> but instead pass the NULL pointer to mt76x2_mcu_msg_send which
> checks for the NULL and returns ENOMEM in this case.
> 
> Detected by CoverityScan, CID#1462624 ("Dereference null return value")
> 
> Fixes: 7bc04215a66b ("mt76: add driver code for MT76x2e")
> Signed-off-by: Colin Ian King 
Acked-by: Felix Fietkau 


Re: [PATCH net-next 1/3] net: dsa: mediatek: add VLAN support for MT7530

2017-12-12 Thread Felix Fietkau
On 2017-12-07 07:06, sean.w...@mediatek.com wrote:
> From: Sean Wang 
> 
> MT7530 can treat each port as either VLAN-unware port or VLAN-ware port
Shouldn't that be VLAN-unaware/VLAN-aware (in the code as well)?

- Felix


Re: [PATCH v2 net-next] tcp: allow drivers to tweak TSQ logic

2017-11-28 Thread Felix Fietkau
On 2017-11-12 00:54, Eric Dumazet wrote:
> From: Eric Dumazet 
> 
> I had many reports that TSQ logic breaks wifi aggregation.
> 
> Current logic is to allow up to 1 ms of bytes to be queued into qdisc
> and drivers queues.
> 
> But Wifi aggregation needs a bigger budget to allow bigger rates to
> be discovered by various TCP Congestion Controls algorithms.
> 
> This patch adds an extra socket field, allowing wifi drivers to select
> another log scale to derive TCP Small Queue credit from current pacing
> rate.
> 
> Initial value is 10, meaning that this patch does not change current
> behavior.
> 
> We expect wifi drivers to set this field to smaller values (tests have
> been done with values from 6 to 9)
> 
> They would have to use following template :
> 
> if (skb->sk && skb->sk->sk_pacing_shift != MY_PACING_SHIFT)
>  skb->sk->sk_pacing_shift = MY_PACING_SHIFT;
I did some experiments with this approach (with your patch backported to
a 4.9 kernel), and I got some crashes.
After looking at the crashes and code some more, it seems that this
would need some extra checks to ensure that skb->sk is a full struct
sock, instead of just a struct request_sock.
Should this be done by checking for skb->sk->sk_state ==
TCP_ESTABLISHED? It seems to me that this might introduce some extra
overhead.

- Felix


Re: [PATCH RFC,WIP 5/5] netfilter: nft_flow_offload: add ndo hooks for hardware offload

2017-11-11 Thread Felix Fietkau
On 2017-11-03 16:26, Pablo Neira Ayuso wrote:
> This patch adds the infrastructure to offload flows to hardware, in case
> the nic/switch comes with built-in flow tables capabilities.
> 
> If the hardware comes with not hardware flow tables or they have
> limitations in terms of features, this falls back to the software
> generic flow table implementation.
> 
> The software flow table aging thread skips entries that resides in the
> hardware, so the hardware will be responsible for releasing this flow
> table entry too.
> 
> Signed-off-by: Pablo Neira Ayuso 
Hi Pablo,

I'd like to start playing with those patches in OpenWrt/LEDE soon. I'm
also considering making a patch that adds iptables support.
For that to work, I think it would be a good idea to keep the code that
tries to offload flows to hardware in nf_flow_offload.c instead, so that
it can be shared with iptables integration.

By the way, do you have a git tree where you keep the current version of
your patch set?

Thanks,

- Felix


Re: [PATCH] cfg80211: Fix array-bounds warning in fragment copy

2017-03-27 Thread Felix Fietkau
On 2017-03-27 12:47, Johannes Berg wrote:
> On Fri, 2017-03-24 at 18:06 -0700, Matthias Kaehlcke wrote:
>> __ieee80211_amsdu_copy_frag intentionally initializes a pointer to
>> array[-1] to increment it later to valid values. clang rightfully
>> generates an array-bounds warning on the initialization statement.
>> Work around this by initializing the pointer to array[0] and
>> decrementing it later, which allows to leave the rest of the
>> algorithm untouched.
>> 
>> Signed-off-by: Matthias Kaehlcke 
>> ---
>>  net/wireless/util.c | 3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>> 
>> diff --git a/net/wireless/util.c b/net/wireless/util.c
>> index 68e5f2ecee1a..d3d459e4a070 100644
>> --- a/net/wireless/util.c
>> +++ b/net/wireless/util.c
>> @@ -659,7 +659,7 @@ __ieee80211_amsdu_copy_frag(struct sk_buff *skb,
>> struct sk_buff *frame,
>>  int offset, int len)
>>  {
>>  struct skb_shared_info *sh = skb_shinfo(skb);
>> -const skb_frag_t *frag = &sh->frags[-1];
>> +const skb_frag_t *frag = &sh->frags[0];
>>  struct page *frag_page;
>>  void *frag_ptr;
>>  int frag_len, frag_size;
>> @@ -669,6 +669,7 @@ __ieee80211_amsdu_copy_frag(struct sk_buff *skb,
>> struct sk_buff *frame,
>>  frag_page = virt_to_head_page(skb->head);
>>  frag_ptr = skb->data;
>>  frag_size = head_size;
>> +frag--;
> 
> Isn't it just a question of time until the compiler will see through
> this trick and warn about it?
Frag is incremented again before being accessed, so there is nothing for
the compiler to see through here.

Acked-by: Felix Fietkau 

- Felix


Re: [PATCH net-next v2 5/5] net-next: dsa: add dsa support for Mediatek MT7530 switch

2017-03-23 Thread Felix Fietkau
On 2017-03-23 15:25, John Crispin wrote:
> 
> 
> On 23/03/17 15:09, Felix Fietkau wrote:
>> On 2017-03-23 09:06, Sean Wang wrote:
>>> Hi Andrew,
>>>
>>> The purpose for the regmap table registered is to
>>>
>>> provide a way which helps us to look up a specific
>>>
>>> register on the switch through regmap-debugfs.
>>>
>>>
>>> And not all ranges of register is defined
>>>
>>> so I only include the meaningful ones in a sparse way
>>>
>>> for the table.
>> I think in that case it might be nice to make regmap support optional in
>> order to avoid pulling in bloat on platforms that don't need it.
>>
>> - Felix
>>
> The 2 relevant platforms are mips/ralink and arm/mediatek. both require 
> regmap for the eth_sysctl syscon if they want to utilize the mtk_soc_eth 
> driver which is a prereq for mt7530. so regmap cannot be optional here.
Makes sense, thanks.

- Felix



Re: [PATCH net-next v2 5/5] net-next: dsa: add dsa support for Mediatek MT7530 switch

2017-03-23 Thread Felix Fietkau
On 2017-03-23 09:06, Sean Wang wrote:
> Hi Andrew,
> 
> The purpose for the regmap table registered is to 
> 
> provide a way which helps us to look up a specific 
> 
> register on the switch through regmap-debugfs.
> 
> 
> And not all ranges of register is defined
> 
> so I only include the meaningful ones in a sparse way 
> 
> for the table.
I think in that case it might be nice to make regmap support optional in
order to avoid pulling in bloat on platforms that don't need it.

- Felix




Re: [PATCH 1/3] net: bgmac: allocate struct bgmac just once & don't copy it

2017-01-27 Thread Felix Fietkau
On 2017-01-27 10:20, Rafał Miłecki wrote:
> From: Rafał Miłecki 
> 
> To share as much code as possible in bgmac we call alloc_etherdev from
> bgmac.c which is used by both: platform and bcma code. The easiest
> solution was to use it for allocating whole struct bgmac but it doesn't
> work well as we already get early-filled struct bgmac as an argument.
> 
> So far we were solving this by copying received struct into newly
> allocated one. The problem is it means storing 2 allocated structs,
> using only 1 of them and non-shared code not having access to it.
> 
> This patch solves it by using alloc_etherdev to allocate *pointer* for
> the already allocated struct. The only downside of this is we have to be
> careful when using netdev_priv.
> 
> Another solution was to call alloc_etherdev in platform/bcma specific
> code but Jon advised against it due to sharing less code that way.
How does that lead to sharing less code?
I find this pointer indirection rather ugly and uncommon, and I think it
would be much cleaner to split the probe into bgmac_enet_alloc and
bgmac_enet_probe (with bgmac_enet_alloc calling alloc_etherdev and doing
basic setup).

- Felix


Re: [net, 6/6] net: korina: version bump

2017-01-22 Thread Felix Fietkau
On 2017-01-22 13:10, Roman Yeryomin wrote:
> On 17 January 2017 at 21:19, Roman Yeryomin  wrote:
>> On 17 January 2017 at 20:55, Felix Fietkau  wrote:
>>> On 2017-01-17 18:33, Roman Yeryomin wrote:
>>>> Signed-off-by: Roman Yeryomin 
>>>> ---
>>>>  drivers/net/ethernet/korina.c | 4 ++--
>>>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/net/ethernet/korina.c b/drivers/net/ethernet/korina.c
>>>> index 83c994f..c8fed01 100644
>>>> --- a/drivers/net/ethernet/korina.c
>>>> +++ b/drivers/net/ethernet/korina.c
>>>> @@ -66,8 +66,8 @@
>>>>  #include 
>>>>
>>>>  #define DRV_NAME "korina"
>>>> -#define DRV_VERSION  "0.10"
>>>> -#define DRV_RELDATE  "04Mar2008"
>>>> +#define DRV_VERSION  "0.20"
>>>> +#define DRV_RELDATE  "15Jan2017"
>>> I think it would make more sense to remove this version instead of
>>> bumping it. Individual driver versions are rather pointless, the kernel
>>> version is more meaningful anyway.
>>
>> OK, makes sense
> 
> Actually, after thinking a bit more about this, not really...
> How about ethtool, which uses driver name and version?
> I see most ethernet drivers define some version. And it's pretty
> useful, when using backports.
> IMO, it should be kept and bumped.
I don't really care, I just wanted to point out that the exact kernel
version is a much more useful indicator, especially since not all patch
submitters do the useless version bump dance.

- Felix


Re: [net, 6/6] net: korina: version bump

2017-01-17 Thread Felix Fietkau
On 2017-01-17 18:33, Roman Yeryomin wrote:
> Signed-off-by: Roman Yeryomin 
> ---
>  drivers/net/ethernet/korina.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/korina.c b/drivers/net/ethernet/korina.c
> index 83c994f..c8fed01 100644
> --- a/drivers/net/ethernet/korina.c
> +++ b/drivers/net/ethernet/korina.c
> @@ -66,8 +66,8 @@
>  #include 
>  
>  #define DRV_NAME "korina"
> -#define DRV_VERSION  "0.10"
> -#define DRV_RELDATE  "04Mar2008"
> +#define DRV_VERSION  "0.20"
> +#define DRV_RELDATE  "15Jan2017"
I think it would make more sense to remove this version instead of
bumping it. Individual driver versions are rather pointless, the kernel
version is more meaningful anyway.

- Felix



Re: [PATCH net-next] bridge: multicast to unicast

2017-01-11 Thread Felix Fietkau
On 2017-01-11 13:15, IgorMitsyanko wrote:
> On 01/11/2017 02:30 PM, Felix Fietkau wrote:
>> On 2017-01-11 12:26, IgorMitsyanko wrote:
>>> On 01/11/2017 12:27 AM, Felix Fietkau wrote:
>>>> On 2017-01-10 11:56, Johannes Berg wrote:
>>>>> On Tue, 2017-01-10 at 05:18 +0100, Linus Lüssing wrote:
>>>>>> On Mon, Jan 09, 2017 at 01:30:32PM -0800, Stephen Hemminger wrote:
>>>>>>> I wonder if MAC80211 should be doing IGMP snooping and not bridge
>>>>>>> in this environment.
>>>>>> In the long term, yes. For now, not quite sure.
>>>>> There's no "for now" in the kernel. Code added now will have to be
>>>>> maintained essentially forever.
>>>> I'm not sure that putting the IGMP snooping code in mac80211 is a good
>>>> idea, that would be quite a bit of code duplication.
>>>> This implementation works, it's very simple, and it's quite flexible for
>>>> a number of use cases.
>>>>
>>>> Is there any remaining objection to merging this in principle (aside
>>>> from potential issues with the code)?
>>>>
>>>> - Felix
>>>>
>>>
>>> Hi Felix, can we consider two examples configurations with multicast
>>> traffic:
>>>
>>> 1. AP is a source of multicast traffic itself, no bridge on AP. For
>>> example, wireless video server streaming to several clients.
>>> In this situation, we can not make use of possible advantages given by
>>> mc-to-uc conversion?
>> You could simply put the AP interface in a bridge, no need to have any
>> other bridge members present.
>>
>>> 2. A configuration with AP + STA + 3 client devices behind STA.
>>>   |client 1|
>>>  |
>>> |  mc  ||AP||STA|---|---|client 2|
>>> |server||
>>>   |client 3|
>>>
>>> Multicast server behind AP streams MC video traffic. All 3 clients
>>> behind the STA have joined the multicast group.
>>> I'm not sure if this case will be handled correctly with mc-to-uc
>>> conversion in bridge on AP?
>> What do you mean by "3 client devices behind STA"? Are you using a
>> 4-addr STA, multicast routing, or some kind of vendor specific "client
>> bridge" hackery?
> 
> 3 client devices connected by backbone Ethernet network. Generic
> case is probably STA/AP operating in 4-addr mode (more or less standard
> solution as far as I know).
If the AP is running in 4-addr mode, it will need to have a bridge
interface anyway, because the link to the STA will be split out into a
separate virtual interface (AP_VLAN iftype).

In this case you don't actually need any multicast-to-unicast
conversion, because the multicast traffic will be unicast on 802.11
already (due to use of 4-addr mode).

- Felix


Re: [PATCH net-next] bridge: multicast to unicast

2017-01-11 Thread Felix Fietkau
On 2017-01-11 12:26, IgorMitsyanko wrote:
> On 01/11/2017 12:27 AM, Felix Fietkau wrote:
>> On 2017-01-10 11:56, Johannes Berg wrote:
>>> On Tue, 2017-01-10 at 05:18 +0100, Linus Lüssing wrote:
>>>> On Mon, Jan 09, 2017 at 01:30:32PM -0800, Stephen Hemminger wrote:
>>>>> I wonder if MAC80211 should be doing IGMP snooping and not bridge
>>>>> in this environment.
>>>>
>>>> In the long term, yes. For now, not quite sure.
>>>
>>> There's no "for now" in the kernel. Code added now will have to be
>>> maintained essentially forever.
>> I'm not sure that putting the IGMP snooping code in mac80211 is a good
>> idea, that would be quite a bit of code duplication.
>> This implementation works, it's very simple, and it's quite flexible for
>> a number of use cases.
>>
>> Is there any remaining objection to merging this in principle (aside
>> from potential issues with the code)?
>>
>> - Felix
>>
> 
> 
> Hi Felix, can we consider two examples configurations with multicast 
> traffic:
> 
> 1. AP is a source of multicast traffic itself, no bridge on AP. For 
> example, wireless video server streaming to several clients.
> In this situation, we can not make use of possible advantages given by 
> mc-to-uc conversion?
You could simply put the AP interface in a bridge, no need to have any
other bridge members present.

> 2. A configuration with AP + STA + 3 client devices behind STA.
>  |client 1|
>  |
> |  mc  ||AP||STA|---|---|client 2|
> |server||
>  |client 3|
> 
> Multicast server behind AP streams MC video traffic. All 3 clients 
> behind the STA have joined the multicast group.
> I'm not sure if this case will be handled correctly with mc-to-uc 
> conversion in bridge on AP?
What do you mean by "3 client devices behind STA"? Are you using a
4-addr STA, multicast routing, or some kind of vendor specific "client
bridge" hackery?

- Felix


Re: [PATCH net-next] bridge: multicast to unicast

2017-01-10 Thread Felix Fietkau
On 2017-01-10 11:56, Johannes Berg wrote:
> On Tue, 2017-01-10 at 05:18 +0100, Linus Lüssing wrote:
>> On Mon, Jan 09, 2017 at 01:30:32PM -0800, Stephen Hemminger wrote:
>> > I wonder if MAC80211 should be doing IGMP snooping and not bridge
>> > in this environment.
>> 
>> In the long term, yes. For now, not quite sure.
> 
> There's no "for now" in the kernel. Code added now will have to be
> maintained essentially forever.
I'm not sure that putting the IGMP snooping code in mac80211 is a good
idea, that would be quite a bit of code duplication.
This implementation works, it's very simple, and it's quite flexible for
a number of use cases.

Is there any remaining objection to merging this in principle (aside
from potential issues with the code)?

- Felix


Re: [PATCH net-next] bridge: multicast to unicast

2017-01-10 Thread Felix Fietkau
On 2017-01-10 18:17, Dave Taht wrote:
> In the case of wifi I have 3 issues with this line of thought.
> 
> multicast in wifi has generally supposed to be unreliable. This makes
> it reliable. reliability comes at a cost -
> 
> multicast is typically set at a fixed low rate today. unicast is
> retried at different rates until it succeeds - for every station
> listening. If one station is already at the lowest rate, the total
> cost of the transmit increases, rather than decreases.
> 
> unicast gets block acks until it succeeds. Again, more delay.
> 
> I think there is something like 31 soft-retries in the ath9k driver
If I remember correctly, hardware retries are counted here as well.

> what happens to diffserv markings here? for unicast CS1 goes into the
> BE queue, CS6, the VO queue. Do we go from one flat queue for all of
> multicast to punching it through one of the hardware queues based on
> the diffserv mark now with this patch?
> 
> I would like it if there was a way to preserve the unreliability
> (which multiple mesh protocols depend on), send stuff with QoSNoack,
> etc - or dynamically choose (based on the rates of the stations)
> between conventional multicast and unicast.
> 
> Or - better, IMHO, keep sending multicast as is but pick the best of
> the rates available to all the listening stations for it.
The advantage of the multicast-to-unicast conversion goes beyond simply
selecting a better rate - aggregation matters a lot as well, and that is
simply incompatible with normal multicast.

Some multicast streams use lots of small-ish packets, the airtime impact
of those is vastly reduced, even if the transmission has to be
duplicated for a few stations.

- Felix


Re: [PATCH net-next] bridge: multicast to unicast

2017-01-06 Thread Felix Fietkau
On 2017-01-06 14:54, Johannes Berg wrote:
> 
>> The bridge layer can use IGMP snooping to ensure that the multicast
>> stream is only transmitted to clients that are actually a member of
>> the group. Can the mac80211 feature do the same?
> 
> No, it'll convert the packet for all clients that are behind that
> netdev. But that's an argument for dropping the mac80211 feature, which
> hasn't been merged upstream yet, no?
Right.

- Felix



Re: [PATCH net-next] bridge: multicast to unicast

2017-01-06 Thread Felix Fietkau
On 2017-01-06 13:47, Johannes Berg wrote:
> On Mon, 2017-01-02 at 20:32 +0100, Linus Lüssing wrote:
>> Implements an optional, per bridge port flag and feature to deliver
>> multicast packets to any host on the according port via unicast
>> individually. This is done by copying the packet per host and
>> changing the multicast destination MAC to a unicast one accordingly.
> 
> How does this compare and/or relate to the multicast-to-unicast feature
> we were going to add to the wifi stack, particularly mac80211? Do we
> perhaps not need that feature at all, if bridging will have it?
> 
> I suppose that the feature there could apply also to locally generated
> traffic when the AP interface isn't in a bridge, but I think I could
> live with requiring the AP to be put into a bridge to achieve a similar
> configuration?
> 
> Additionally, on an unrelated note, this seems to apply generically to
> all kinds of frames, losing information by replacing the address.
> Shouldn't it have similar limitations as the wifi stack feature has
> then, like only applying to ARP, IPv4, IPv6 and not general protocols?
> 
> Also, it should probably come with the same caveat as we documented for
> the wifi feature:
> 
> Note that this may break certain expectations of the receiver,
> such as the ability to drop unicast IP packets received within
> multicast L2 frames, or the ability to not send ICMP destination
> unreachable messages for packets received in L2 multicast (which
> is required, but the receiver can't tell the difference if this
> new option is enabled.)
> 
> 
> I'll hold off sending my tree in until we see that we really need both
> features, or decide that we want the wifi feature *instead* of the
> bridge feature.
The bridge layer can use IGMP snooping to ensure that the multicast
stream is only transmitted to clients that are actually a member of the
group. Can the mac80211 feature do the same?

- Felix


Re: [PATCH net-next] bridge: multicast to unicast

2017-01-03 Thread Felix Fietkau
On 2017-01-02 20:32, Linus Lüssing wrote:
> Implements an optional, per bridge port flag and feature to deliver
> multicast packets to any host on the according port via unicast
> individually. This is done by copying the packet per host and
> changing the multicast destination MAC to a unicast one accordingly.
> 
> multicast-to-unicast works on top of the multicast snooping feature of
> the bridge. Which means unicast copies are only delivered to hosts which
> are interested in it and signalized this via IGMP/MLD reports
> previously.
> 
> This feature is intended for interface types which have a more reliable
> and/or efficient way to deliver unicast packets than broadcast ones
> (e.g. wifi).
> 
> However, it should only be enabled on interfaces where no IGMPv2/MLDv1
> report suppression takes place. This feature is disabled by default.
> 
> The initial patch and idea is from Felix Fietkau.
> 
> Cc: Felix Fietkau 
> Signed-off-by: Linus Lüssing 
Please add Signed-off-by: Felix Fietkau 
in the next version, and maybe also From:

Thanks,

- Felix


Re: Misalignment, MIPS, and ip_hdr(skb)->version

2016-12-10 Thread Felix Fietkau
On 2016-12-10 21:32, Måns Rullgård wrote:
> Felix Fietkau  writes:
> 
>> On 2016-12-10 14:25, Måns Rullgård wrote:
>>> Felix Fietkau  writes:
>>> 
>>>> On 2016-12-07 19:54, Jason A. Donenfeld wrote:
>>>>> On Wed, Dec 7, 2016 at 7:51 PM, David Miller  wrote:
>>>>>> It's so much better to analyze properly where the misalignment comes from
>>>>>> and address it at the source, as we have for various cases that trip up
>>>>>> Sparc too.
>>>>> 
>>>>> That's sort of my attitude too, hence starting this thread. Any
>>>>> pointers you have about this would be most welcome, so as not to
>>>>> perpetuate what already seems like an issue in other parts of the
>>>>> stack.
>>>> Hi Jason,
>>>>
>>>> I'm the author of that hackish LEDE/OpenWrt patch that works around the
>>>> misalignment issues. Here's some context regarding that patch:
>>>>
>>>> I intentionally put it in the target specific patches for only one of
>>>> our MIPS targets. There are a few ar71xx devices where the misalignment
>>>> cannot be fixed, because the Ethernet MAC has a 4-byte DMA alignment
>>>> requirement, and does not support inserting 2 bytes of padding to
>>>> correct the IP header misalignment.
>>>>
>>>> With these limitations the choice was between this ugly network stack
>>>> patch or inserting a very expensive memmove in the data path (which is
>>>> better than taking the mis-alignment traps, but still hurts routing
>>>> performance significantly).
>>> 
>>> I solved this problem in an Ethernet driver by copying the initial part
>>> of the packet to an aligned skb and appending the remainder using
>>> skb_add_rx_frag().  The kernel network stack only cares about the
>>> headers, so the alignment of the packet payload doesn't matter.
>>
>> I considered that as well, but it's bad for routing performance if the
>> ethernet MAC does not support scatter/gather for xmit.
>> Unfortunately that limitation is quite common on embedded hardware.
> 
> Yes, I can see that being an issue.  However, if you're doing zero-copy
> routing, the header part of the original buffer should still be there,
> unused, so you could presumably copy the header of the outgoing packet
> there and then do dma as usual.  Maybe there's something in the network
> stack that makes this impossible though.
That still puts more pressure on the ridiculously small dcache sizes
that are typical for embedded MIPS routers.

- Felix



Re: Misalignment, MIPS, and ip_hdr(skb)->version

2016-12-10 Thread Felix Fietkau
On 2016-12-10 14:25, Måns Rullgård wrote:
> Felix Fietkau  writes:
> 
>> On 2016-12-07 19:54, Jason A. Donenfeld wrote:
>>> On Wed, Dec 7, 2016 at 7:51 PM, David Miller  wrote:
>>>> It's so much better to analyze properly where the misalignment comes from
>>>> and address it at the source, as we have for various cases that trip up
>>>> Sparc too.
>>> 
>>> That's sort of my attitude too, hence starting this thread. Any
>>> pointers you have about this would be most welcome, so as not to
>>> perpetuate what already seems like an issue in other parts of the
>>> stack.
>> Hi Jason,
>>
>> I'm the author of that hackish LEDE/OpenWrt patch that works around the
>> misalignment issues. Here's some context regarding that patch:
>>
>> I intentionally put it in the target specific patches for only one of
>> our MIPS targets. There are a few ar71xx devices where the misalignment
>> cannot be fixed, because the Ethernet MAC has a 4-byte DMA alignment
>> requirement, and does not support inserting 2 bytes of padding to
>> correct the IP header misalignment.
>>
>> With these limitations the choice was between this ugly network stack
>> patch or inserting a very expensive memmove in the data path (which is
>> better than taking the mis-alignment traps, but still hurts routing
>> performance significantly).
> 
> I solved this problem in an Ethernet driver by copying the initial part
> of the packet to an aligned skb and appending the remainder using
> skb_add_rx_frag().  The kernel network stack only cares about the
> headers, so the alignment of the packet payload doesn't matter.
I considered that as well, but it's bad for routing performance if the
ethernet MAC does not support scatter/gather for xmit.
Unfortunately that limitation is quite common on embedded hardware.

- Felix



Re: Misalignment, MIPS, and ip_hdr(skb)->version

2016-12-10 Thread Felix Fietkau
On 2016-12-07 19:54, Jason A. Donenfeld wrote:
> On Wed, Dec 7, 2016 at 7:51 PM, David Miller  wrote:
>> It's so much better to analyze properly where the misalignment comes from
>> and address it at the source, as we have for various cases that trip up
>> Sparc too.
> 
> That's sort of my attitude too, hence starting this thread. Any
> pointers you have about this would be most welcome, so as not to
> perpetuate what already seems like an issue in other parts of the
> stack.
Hi Jason,

I'm the author of that hackish LEDE/OpenWrt patch that works around the
misalignment issues. Here's some context regarding that patch:

I intentionally put it in the target specific patches for only one of
our MIPS targets. There are a few ar71xx devices where the misalignment
cannot be fixed, because the Ethernet MAC has a 4-byte DMA alignment
requirement, and does not support inserting 2 bytes of padding to
correct the IP header misalignment.

With these limitations the choice was between this ugly network stack
patch or inserting a very expensive memmove in the data path (which is
better than taking the mis-alignment traps, but still hurts routing
performance significantly).

There are a lot of places in the network stack that assume full 32 bit
alignment, and you only get to see those once you start using more of
netfilter, play with various tunnel encapsulations, etc.

I think you have 3 options to deal with this properly:
1. add 3 bytes of padding
2. allocate a separate skb for decryption (might be more expensive)
3. save the header and decrypt to the start of the packet data
(overwriting the misaligned header).

I'm not sure what the performance impact of 2 and 3 is, so it's probably
best to stick with the padding.

I've taken a quick look at the wireguard message headers, and my
recommendation would be to insert the 3-byte padding in struct
message_header and remove __packed from your structs.
This will also remove misaligment of your own protocol fields.

- Felix


Re: [PATCH 1/2] add basic register-field manipulation macros

2016-06-13 Thread Felix Fietkau
On 2016-06-13 15:29, Jakub Kicinski wrote:
> C bitfields are problematic and best avoided.  Developers
> interacting with hardware registers find themselves searching
> for easy-to-use alternatives.  Common approach is to define
> structures or sets of macros containing mask and shift pair.
> Operations on the register are then performed as follows:
> 
>  field = (reg >> shift) & mask;
> 
>  reg &= ~(mask << shift);
>  reg |= (field & mask) << shift;
> 
> Defining shift and mask separately is tedious.  Ivo van Doorn
> came up with an idea of computing them at compilation time
> based on a single shifted mask (later refined by Felix) which
> can be used like this:
> 
>  #define X_REG_FIELD 0x000ff000
> 
>  field = FIELD_GET(X_REG_FIELD, reg);
> 
>  reg &= ~X_REG_FIELD;
>  reg |= FIELD_PUT(X_REG_FIELD, field);
> 
> FIELD_{GET,PUT} macros take care of finding out what the
> appropriate shift is based on compilation time ffs operation.
> 
> GENMASK can be used to define registers (which is usually
> less error-prone and easier to match with datasheets).
> 
> This approach is the most convenient I've seen so to limit code
> multiplication let's move the macros to a global header file.
> 
> Compared to Felix Fietkau's implementation from mt76 this one
> uses standard Linux and GCC functions such as is_power_of_2()
> and __builtin_ffsll().
> 
> Signed-off-by: Jakub Kicinski 
> ---
>  include/linux/bitfield.h | 58 
> 
>  include/linux/log2.h |  6 +
>  2 files changed, 64 insertions(+)
>  create mode 100644 include/linux/bitfield.h
> 
> diff --git a/include/linux/bitfield.h b/include/linux/bitfield.h
> new file mode 100644
> index ..ae2224464523
> --- /dev/null
> +++ b/include/linux/bitfield.h
> @@ -0,0 +1,58 @@
> +/*
> + * Copyright (C) 2014 Felix Fietkau 
Please change my email address to n...@nbd.name here

- Felix


Re: [RFC/RFT] mac80211: implement fq_codel for software queuing

2016-03-07 Thread Felix Fietkau
On 2016-03-07 15:05, Avery Pennarun wrote:
> On Fri, Mar 4, 2016 at 1:32 AM, Michal Kazior  wrote:
>> On 4 March 2016 at 03:48, Tim Shepard  wrote:
>> [...]
>>> (I am interested in knowing what other mac80211 drivers have been
>>>  modified to use the mac80211 intermediate software queues.   I know
>>>  Michal mentioned he has patches for ath10k that are not yet released,
>>>  and I know Felix is finishing up the mt76 driver which uses them.)
>>
>> Patches for ath10k are under review since quite some time now (but are
>> not merged yet). The latest re-spin is:
>>
>>   http://lists.infradead.org/pipermail/ath10k/2016-March/006923.html
> 
> Hi all, on Friday I had a chance to experiment with some of these
> patches, specifically Tim's ath9k patch (to use intermediate queues),
> plus MIchal's patch to use fq_codel with the intermediate queues.  I
> didn't attempt any fine tuning; I just slapped them together to see
> what happens.  (I tried applying Michal's ath10k patches too, but got
> stuck since they seem to be applied against the upstream v4.4 kernel
> and didn't merge cleanly with the latest mac80211 branch.  Maybe I was
> doing something wrong.)
> 
> Test setup:
>AP (ath9k) -> 2x2 strong signal -> STA1 (mwifiex)
> -> attenuator (-40 dB) -> 1x1 weak signal  -> STA2 (mwifiex)
> 
> STA2 generally gets modulation levels around MCS0-2 and STA1 usually
> gets something like MCS12-15.
> 
> With or without this patch, results with TCP iperf were fishy - I
> think packet loss patterns were particularly bad and caused 2-second
> TCP retry timeouts occasionally - so I removed TCP from the test and
> switched the UDP iperf instead.
> 
> I ran isoping 
> (https://gfiber.googlesource.com/vendor/google/platform/+/master/cmds/isoping.c)
> from the AP to both stations to measure two-way latency during all
> tests.  (I used -r2 for two packets/sec in each direction in order not
> to affect the test results too much.)
> 
> Overall results:
> 
> - Running one iperf at a time, I saw ~45 Mbps to STA1 and ~7 Mbps to STA2.
> 
> - Running both iperfs at once, without the patches, latencies got
> extremely high (~600ms sometimes) and results were closer to
> byte-fairness than airtime-fairness (ie. ~7 Mbps each).
> 
> - Running both iperfs at once, with the patches, latencies were still
> high (usually high 2-digit, sometimes low 3-digit latencies) but we
> got closer to airtime-fairness than byte-fairness (~17 Mbps and ~2
> Mbps).
> 
> - With only one iperf running, without the patches, latencies were
> high to both stations.  With the patches, latency was
> mid-double-digits to the non-iperf station (pretty good!) while being
> low-mid triple-digits to the busy iperf station.  This suggests that
> we are getting per-station queuing (yay!) but does make me question
> whether the fq_ in fq_codel was working.
Please change the 'if (flow->txqi)' check in ieee80211_txq_enqueue to:
if (flow->txqi && flow->txqi != txqi)
This should hopefully fix the fq_ part ;)

- Felix


Re: [RFC/RFT] mac80211: implement fq_codel for software queuing

2016-02-26 Thread Felix Fietkau
On 2016-02-26 14:09, Michal Kazior wrote:
> Since 11n aggregation become important to get the
> best out of txops. However aggregation inherently
> requires buffering and queuing. Once variable
> medium conditions to different associated stations
> is considered it became apparent that bufferbloat
> can't be simply fought with qdiscs for wireless
> drivers. 11ac with MU-MIMO makes the problem
> worse because the bandwidth-delay product becomes
> even greater.
> 
> This bases on codel5 and sch_fq_codel.c. It may
> not be the Right Thing yet but it should at least
> provide a framework for more improvements.
Nice work!

> I guess dropping rate could factor in per-station
> rate control info but I don't know how this should
> exactly be done. HW rate control drivers would
> need extra work to take advantage of this.
> 
> This obviously works only with drivers that use
> wake_tx_queue op.
> 
> Note: This uses IFF_NO_QUEUE to get rid of qdiscs
> for wireless drivers that use mac80211 and
> implement wake_tx_queue op.
> 
> Moreover the current txq_limit and latency setting
> might need tweaking. Either from userspace or be
> dynamically scaled with regard to, e.g. number of
> associated stations.
> 
> FWIW This already works nicely with ath10k's (not
> yey merged) pull-push congestion control for
> MU-MIMO as far as throughput is concerned.
> 
> Evaluating latency improvements is a little tricky
> at this point if a driver is using more queue
> layering and/or its firmware controls tx
> scheduling - hence I don't have any solid data on
> this. I'm open for suggestions though.
> 
> It might also be a good idea to do the following
> in the future:
> 
>  - make generic tx scheduling which does some RR
>over per-sta-tid queues and dequeues bursts of
>packets to form a PPDU to fit into designated
>txop timeframe and bytelimit
> 
>This could in theory be shared and used by
>ath9k and (future) mt76.
> 
>Moreover tx scheduling could factor in rate
>control info and keep per-station number of
>queued packets at a sufficient low threshold to
>avoid queue buildup for slow stations. Emmanuel
>already did similar experiment for iwlwifi's
>station mode and got promising results.
> 
>  - make software queueing default internally in
>mac80211. This could help other drivers to get
>at least some benefit from mac80211 smarter
>queueing.
> 
> Signed-off-by: Michal Kazior 
> ---
>  include/net/mac80211.h |  36 -
>  net/mac80211/agg-tx.c  |   8 +-
>  net/mac80211/codel.h   | 260 +++
>  net/mac80211/codel_i.h |  89 +++
>  net/mac80211/ieee80211_i.h |  27 +++-
>  net/mac80211/iface.c   |  25 ++-
>  net/mac80211/main.c|   9 +-
>  net/mac80211/rx.c  |   2 +-
>  net/mac80211/sta_info.c|  10 +-
>  net/mac80211/sta_info.h|  27 
>  net/mac80211/tx.c  | 370 
> -
>  net/mac80211/util.c|  20 ++-
>  12 files changed, 805 insertions(+), 78 deletions(-)
>  create mode 100644 net/mac80211/codel.h
>  create mode 100644 net/mac80211/codel_i.h
> 

> diff --git a/net/mac80211/tx.c b/net/mac80211/tx.c
> index af584f7cdd63..f42f898cb8b5 100644
> --- a/net/mac80211/tx.c
> +++ b/net/mac80211/tx.c
> + [...]
> +static void ieee80211_txq_enqueue(struct ieee80211_local *local,
> +   struct txq_info *txqi,
> +   struct sk_buff *skb)
> +{
> + struct ieee80211_fq *fq = &local->fq;
> + struct ieee80211_hw *hw = &local->hw;
> + struct txq_flow *flow;
> + struct txq_flow *i;
> + size_t idx = fq_hash(fq, skb);
> +
> + flow = &fq->flows[idx];
> +
> + if (flow->txqi)
> + flow = &txqi->flow;
I'm not sure I understand this part correctly, but shouldn't that be:
if (flow->txqi && flow->txqi != txqi)

> +
> + /* The following overwrites `vif` pointer effectively. It is later
> +  * restored using txq structure.
> +  */
> + IEEE80211_SKB_CB(skb)->control.enqueue_time = codel_get_time();
> +
> + flow->txqi = txqi;
> + flow->backlog += skb->len;
> + txqi->backlog_bytes += skb->len;
> + txqi->backlog_packets++;
> + fq->backlog++;
> +
> + if (list_empty(&flow->backlogchain))
> + i = list_last_entry(&fq->backlogs, struct txq_flow, 
> backlogchain);
> + else
> + i = flow;
> +
> + list_for_each_entry_continue_reverse(i, &fq->backlogs, backlogchain)
> + if (i->backlog > flow->backlog)
> + break;
> +
> + list_move(&flow->backlogchain, &i->backlogchain);
> +
> + if (list_empty(&flow->flowchain)) {
> + flow->deficit = fq->quantum;
> + list_add_tail(&flow->flowchain, &txqi->new_flows);
> + }
> +
> + __skb_queue_tail(&flow->queue, skb);
> +
> + if (fq->backlog > hw->txq_limit)
> + fq_drop(local);
> +}


Re: [PATCH V2 03/12] net-next: mediatek: add embedded switch driver (ESW)

2016-02-26 Thread Felix Fietkau
On 2016-02-26 16:18, Andrew Lunn wrote:
> On Fri, Feb 26, 2016 at 03:21:35PM +0100, John Crispin wrote:
>> The ESW is found in many of the old 100mbit MIPS based SoCs. it has 5
>> external ports, 1 cpu port and 1 further port that the internal HW
>> offloading engine connects to.
>> 
>> This driver is very basic and only provides basic init and irq support.
>> The SoC and switch core both have support for a special tag making DSA
>> support possible.
> 
> Hi Crispin
> 
> There was recently a discussion about adding switches without using
> DSA or switchdev. It was pretty much decided we would not accept such
> drivers.
For exactly this reason, the code does not provide any non-standard API
for allowing the user to configure the switch. The hardware needs to be
programmed with some defaults for the driver to be functional (since the
switch logic is built into the SoC).

In my opinion, leaving this part out does not make much sense and
neither does deferring the entire patch series until we have a
switchdev/DSA capable driver. This is just a starting point, which will
be turned into a proper driver with the right APIs later.

- Felix


Re: [PATCH] mac80211: minstrel_ht: fix out-of-bound in minstrel_ht_set_best_prob_rate

2016-01-29 Thread Felix Fietkau
On 2016-01-29 09:35, Konstantin Khlebnikov wrote:
> Patch fixes this splat
> 
> BUG: KASAN: slab-out-of-bounds in minstrel_ht_update_stats.isra.7+0x6e1/0x9e0
> [mac80211] at addr 8800cee640f4 Read of size 4 by task swapper/3/0
> 
> Signed-off-by: Konstantin Khlebnikov 
> Link: 
> http://lkml.kernel.org/r/calygninyjhsavne35qs6ucgasb2dx1_i5hcravuox14otz2...@mail.gmail.com
Acked-by: Felix Fietkau 


Re: KASAN splal in minstrel_ht_update_stats()

2016-01-28 Thread Felix Fietkau
On 2016-01-15 06:23, Konstantin Khlebnikov wrote:
> Jan 10 17:56:25 hamm kernel: [184374.378842]
> ==
> Jan 10 17:56:25 hamm kernel: [184374.379001] BUG: KASAN:
> slab-out-of-bounds in minstrel_ht_update_stats.isra.7+0x6e1/0x9e0
[...]
> ==
> 
> out-of-bound in
> 
> if (mrs->prob_ewma > mg->rates[mg->max_group_prob_rate].prob_ewma)
>  mg->max_group_prob_rate = index;
> 
> 
> 
> Fix should be something like this:
> 
> --- a/net/mac80211/rc80211_minstrel_ht.c
> +++ b/net/mac80211/rc80211_minstrel_ht.c
> @@ -414,15 +414,16 @@ minstrel_ht_set_best_prob_rate(struct
> minstrel_ht_sta *mi, u16 index)
> (max_tp_group != MINSTREL_CCK_GROUP))
> return;
> 
> +   max_gpr_group = mg->max_group_prob_rate / MCS_GROUP_RATES;
> +   max_gpr_idx = mg->max_group_prob_rate % MCS_GROUP_RATES;
> +   max_gpr_prob = mi->groups[max_gpr_group].rates[max_gpr_idx].prob_ewma;
> +
> if (mrs->prob_ewma > MINSTREL_FRAC(75, 100)) {
> cur_tp_avg = minstrel_ht_get_tp_avg(mi, cur_group, cur_idx,
> mrs->prob_ewma);
> if (cur_tp_avg > tmp_tp_avg)
> mi->max_prob_rate = index;
> 
> -   max_gpr_group = mg->max_group_prob_rate / MCS_GROUP_RATES;
> -   max_gpr_idx = mg->max_group_prob_rate % MCS_GROUP_RATES;
> -   max_gpr_prob =
> mi->groups[max_gpr_group].rates[max_gpr_idx].prob_ewma;
> max_gpr_tp_avg = minstrel_ht_get_tp_avg(mi, max_gpr_group,
> max_gpr_idx,
> max_gpr_prob);
> @@ -431,7 +432,7 @@ minstrel_ht_set_best_prob_rate(struct
> minstrel_ht_sta *mi, u16 index)
> } else {
> if (mrs->prob_ewma > tmp_prob)
> mi->max_prob_rate = index;
> -   if (mrs->prob_ewma >
> mg->rates[mg->max_group_prob_rate].prob_ewma)
> +   if (mrs->prob_ewma > max_gpr_prob)
> mg->max_group_prob_rate = index;
> }
>  }
Fix looks correct, but does not apply (line wrapped). Please resubmit
with a proper description and your Signed-off-by.

Thanks,

- Felix


Re: [PATCH] net: emac: emac gigabit ethernet controller driver

2015-12-07 Thread Felix Fietkau
On 2015-12-07 23:58, Gilad Avidov wrote:
> +/* RRD (Receive Return Descriptor) */
> +union emac_rrd {
> + struct {
> + /* 32bit word 0 */
> + u32  xsum:16;
> + u32  nor:4;   /* number of RFD */
> + u32  si:12;   /* start index of rfd-ring */
> + /* 32bit word 1 */
> + u32  hash;
> + /* 32bit word 2 */
You should never use bitfields for hardware structs.
I think in general, kernel code should be made endian safe, even if you
only care about one particular endian type for your platform.

- Felix
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] net: remove unnecessary semicolon in netdev_alloc_pcpu_stats()

2015-12-05 Thread Felix Fietkau
This semicolon causes a build error if the function call is wrapped in
parentheses.

Fixes: aabc92bbe3cf ("net: add __netdev_alloc_pcpu_stats() to indicate gfp 
flags")
Reported-by: Imre Kaloz 
Signed-off-by: Felix Fietkau 
---
 include/linux/netdevice.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 7d2d1d7..f2f91d9 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2103,7 +2103,7 @@ struct pcpu_sw_netstats {
 })
 
 #define netdev_alloc_pcpu_stats(type)  \
-   __netdev_alloc_pcpu_stats(type, GFP_KERNEL);
+   __netdev_alloc_pcpu_stats(type, GFP_KERNEL)
 
 #include 
 
-- 
2.2.2

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] packet: Allow packets with only a header (but no payload)

2015-11-09 Thread Felix Fietkau
On 2015-11-09 18:53, Willem de Bruijn wrote:
> On Sat, Nov 7, 2015 at 8:11 AM, Felix Fietkau  wrote:
>> On 2015-07-31 00:15, Martin Blumenstingl wrote:
>>> On Wed, Jul 29, 2015 at 8:05 AM, Willem de Bruijn  
>>> wrote:
>>>> Martin, to return to your initial statement that PPPoE PADI packets can
>>>> have a zero payload: the PPPoE RFC states that PADI packets "MUST
>>>> contain exactly one TAG of TAG_TYPE Service-Name, indicating the
>>>> service the Host is requesting, and any number of other TAG types."
>>>> (RFC 2516, 5.1). Is the observed behavior (no payload) perhaps
>>>> incorrect?
>>> As far as I can see you are right, but the real world seems to be different.
>>> My ISP for example lists the PPPoE connection settings, but they are
>>> nowhere mentioning the "service name".
>>>
>>> I have also re-read pppd's source code again and that seems to confirm
>>> what you are reading in the RFC: Leaving the service name away makes
>>> seems to violate the RFC, but pppd still accepts those configurations.
>>>
>>>> Even if it is, if this is breaking established userspace expectations,
>>>> we should look into it. Ethernet specifies a minimum payload size of
>>>> 46 on the wire, but perhaps that is handled with padding, so that
>>>> 0 length should be valid within the stack. Also, there may be other
>>>> valid uses of 0 length payload on top of link layers that are not Ethernet.
>>> Good catch. I would also like to note that the documentation for
>>> "hard_header_len" describes it as "Hardware header length". When the
>>> purpose of this field we should check whether the documentation should
>>> be updated to "Minimum hardware header length" -> that would mean the
>>> condition has to be a "len < hard_header_len" instead of a "len <=
>>> hard_header_len" (as it is now).
>>>
>>> PS: I have also added the pppd maintainer (Paul Mackerras) to this
>>> thread because I think he should know about this issue (and he can
>>> probably provide more details if required).
>>> As a quick summary for him: linux  >= 3.19 rejects PADI packets when
>>> no service name is configured.
>> Any news on this? Users are complaining about this regression:
>> https://dev.openwrt.org/ticket/20707
> 
> I took another look. This hinges on the question what the contract with
> device drivers is on skb network data and length. Is passing an skb with
> skb->len == 0 to ndo_start_xmit allowed?
> 
> From what I gather from the ethernet spec [1], sending frames with an
> empty head is allowed on that medium, at least.
> 
> A quick scan of a few drivers and the loopback path also does not show
> anything that would break. In some cases, skb_network_header points
> beyond the end of the buffer (ETH_HLEN), but the length is correctly
> reported as 0.
> 
> The tap device can also generate packets consisting of only a link layer
> header: compares len < ETH_HLEN in tun_get_user.
> 
> So, I think that this change should be correct:
> 
>  static bool ll_header_truncated(const struct net_device *dev, int len)
>  {
> -   /* net device doesn't like empty head */
> -   if (unlikely(len <= dev->hard_header_len)) {
> +   if (unlikely(len < dev->hard_header_len)) {
> 
> but a definitive answer would require an audit of all device drivers
> (including bonding, ..) or at least the certainty that it has always
> been correct to send a packet of only link layer header to
> ndo_start_xmit.
> 
> [1] IEEE 802.3™-2012 – Section One, {3.2.8, 4.2.3.3}
Yeah, I agree that such an audit is required. However, I think it's
*much* more important to add this change as soon as possible to fix the
regression. The old code may have had theoretical driver issues, but the
current code breaks real-world user setups.

- Felix
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] packet: Allow packets with only a header (but no payload)

2015-11-07 Thread Felix Fietkau
On 2015-07-31 00:15, Martin Blumenstingl wrote:
> On Wed, Jul 29, 2015 at 8:05 AM, Willem de Bruijn  wrote:
>> Martin, to return to your initial statement that PPPoE PADI packets can
>> have a zero payload: the PPPoE RFC states that PADI packets "MUST
>> contain exactly one TAG of TAG_TYPE Service-Name, indicating the
>> service the Host is requesting, and any number of other TAG types."
>> (RFC 2516, 5.1). Is the observed behavior (no payload) perhaps
>> incorrect?
> As far as I can see you are right, but the real world seems to be different.
> My ISP for example lists the PPPoE connection settings, but they are
> nowhere mentioning the "service name".
> 
> I have also re-read pppd's source code again and that seems to confirm
> what you are reading in the RFC: Leaving the service name away makes
> seems to violate the RFC, but pppd still accepts those configurations.
> 
>> Even if it is, if this is breaking established userspace expectations,
>> we should look into it. Ethernet specifies a minimum payload size of
>> 46 on the wire, but perhaps that is handled with padding, so that
>> 0 length should be valid within the stack. Also, there may be other
>> valid uses of 0 length payload on top of link layers that are not Ethernet.
> Good catch. I would also like to note that the documentation for
> "hard_header_len" describes it as "Hardware header length". When the
> purpose of this field we should check whether the documentation should
> be updated to "Minimum hardware header length" -> that would mean the
> condition has to be a "len < hard_header_len" instead of a "len <=
> hard_header_len" (as it is now).
> 
> PS: I have also added the pppd maintainer (Paul Mackerras) to this
> thread because I think he should know about this issue (and he can
> probably provide more details if required).
> As a quick summary for him: linux  >= 3.19 rejects PADI packets when
> no service name is configured.
Any news on this? Users are complaining about this regression:
https://dev.openwrt.org/ticket/20707

- Felix
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] net: dsa: Use devm_ prefixed allocations

2015-10-03 Thread Felix Fietkau
On 2015-10-02 15:30, Neil Armstrong wrote:
> On 10/02/2015 03:29 PM, Sergei Shtylyov wrote:
>> On 10/2/2015 1:48 PM, Neil Armstrong wrote:
>> 
>>> To simplify and prevent memory leakage when unbinding, use
>>> the devm_ memory allocation calls.
>>>
>>> Tested-by: Andrew Lunn 
>>> Tested-by: Florian Fainelli 
>>> Signed-off-by: Neil Armstrong 
>>> ---
>>>   net/dsa/dsa.c | 6 +++---
>>>   1 file changed, 3 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/net/dsa/dsa.c b/net/dsa/dsa.c
>>> index c59fa5d..98f94c2 100644
>>> --- a/net/dsa/dsa.c
>>> +++ b/net/dsa/dsa.c
>>> @@ -305,7 +305,7 @@ static int dsa_switch_setup_one(struct dsa_switch *ds, 
>>> struct device *parent)
>>>   if (ret < 0)
>>>   goto out;
>>>
>>> -ds->slave_mii_bus = mdiobus_alloc();
>>> +ds->slave_mii_bus = devm_mdiobus_alloc(parent);
>>>   if (ds->slave_mii_bus == NULL) {
>>>   ret = -ENOMEM;
>>>   goto out;
>>> @@ -400,7 +400,7 @@ dsa_switch_setup(struct dsa_switch_tree *dst, int index,
>>>   /*
>>>* Allocate and initialise switch state.
>>>*/
>>> -ds = kzalloc(sizeof(*ds) + drv->priv_size, GFP_KERNEL);
>>> +ds = devm_kzalloc(parent, sizeof(*ds) + drv->priv_size, GFP_KERNEL);
>>>   if (ds == NULL)
>>>   return ERR_PTR(-ENOMEM);
>>>
>>> @@ -883,7 +883,7 @@ static int dsa_probe(struct platform_device *pdev)
>>>   goto out;
>>>   }
>>>
>>> -dst = kzalloc(sizeof(*dst), GFP_KERNEL);
>>> +dst = devm_kzalloc(&pdev->dev, sizeof(*dst), GFP_KERNEL);
>>>   if (dst == NULL) {
>>>   dev_put(dev);
>>>   ret = -ENOMEM;
>>>
>> 
>>Shouldn't you remove the correspoding kfree(), etc. calls?
>> 
>> MBR, Sergei
>> 
> The corresponding kfree() calls were all missing.
Not in the error handling path. mdiobus_alloc has a corresponding
mdiobus_free in the same function.
The ds kzalloc in dsa_switch_setup has a kfree in dsa_switch_setup_one.

- Felix
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] net: dsa: Use devm_ prefixed allocations

2015-10-02 Thread Felix Fietkau
On 2015-10-02 12:48, Neil Armstrong wrote:
> To simplify and prevent memory leakage when unbinding, use
> the devm_ memory allocation calls.
> 
> Tested-by: Andrew Lunn 
> Tested-by: Florian Fainelli 
> Signed-off-by: Neil Armstrong 
I think you also need to get rid of the corresponding free calls in the
error path, otherwise it will probably crash at some point.

- Felix
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] pppoe: drop pppoe device in pppoe_unbind_sock_work

2015-05-09 Thread Felix Fietkau
After receiving a PADT and the socket is closed, user space will no
longer drop the reference to the pppoe device.
This leads to errors like this:

[  488.57] unregister_netdevice: waiting for eth0.2 to become free. Usage 
count = 2

Fixes: 287f3a943fe ("pppoe: Use workqueue to die properly when a PADT is 
received")
Signed-off-by: Felix Fietkau 
---
 drivers/net/ppp/pppoe.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/net/ppp/pppoe.c b/drivers/net/ppp/pppoe.c
index ff059e1..6a544f2 100644
--- a/drivers/net/ppp/pppoe.c
+++ b/drivers/net/ppp/pppoe.c
@@ -462,6 +462,10 @@ static void pppoe_unbind_sock_work(struct work_struct 
*work)
struct sock *sk = sk_pppox(po);
 
lock_sock(sk);
+   if (po->pppoe_dev) {
+   dev_put(po->pppoe_dev);
+   po->pppoe_dev = NULL;
+   }
pppox_unbind_sock(sk);
release_sock(sk);
sock_put(sk);
-- 
2.2.2

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][FIX 4.1] bgmac: fix requests for extra polling calls from NAPI

2015-04-23 Thread Felix Fietkau
On 2015-04-23 20:56, Rafał Miłecki wrote:
> After d75b1ade567f ("net: less interrupt masking in NAPI") polling
> function has to return whole budget when it wants NAPI to call it again.
> 
> Signed-off-by: Rafał Miłecki 
> Cc: Felix Fietkau 
> Fixes: eb64e2923a886 ("bgmac: leave interrupts disabled as long as there is 
> work to do")
Acked-by: Felix Fietkau 

- Felix
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html