Re: [PATCH v3] scripts/link-vmlinux.sh: Add alias to duplicate symbols for kallsyms

2023-09-11 Thread Alexander Lobakin
From: Alessandro Carminati (Red Hat) 
Date: Mon, 28 Aug 2023 08:04:23 +

> From: Alessandro Carminati 
> 
> It is not uncommon for drivers or modules related to similar peripherals
> to have symbols with the exact same name.

[...]

> Changes from v2:
> - Alias tags are created by querying DWARF information from the vmlinux.
> - The filename + line number is normalized and appended to the original name.
> - The tag begins with '@' to indicate the symbol source.
> - Not a change, but worth mentioning, since the alias is added to the existing
>   list, the old duplicated name is preserved, and the livepatch way of dealing
>   with duplicates is maintained.
> - Acknowledging the existence of scenarios where inlined functions declared in
>   header files may result in multiple copies due to compiler behavior, though
>it is not actionable as it does not pose an operational issue.
> - Highlighting a single exception where the same name refers to different
>   functions: the case of "compat_binfmt_elf.c," which directly includes
>   "binfmt_elf.c" producing identical function copies in two separate
>   modules.

Oh, I thought you managed to handle this in v3 since you didn't reply in
the previous thread...

> 
> sample from new v3
> 
>  ~ # cat /proc/kallsyms | grep gic_mask_irq
>  d0b03c04dae4 t gic_mask_irq
>  d0b03c04dae4 t gic_mask_irq@_drivers_irqchip_irq-gic_c_167
>  d0b03c050960 t gic_mask_irq
>  d0b03c050960 t gic_mask_irq@_drivers_irqchip_irq-gic-v3_c_404

BTW, why normalize them? Why not just

gic_mask_irq@drivers/irqchip/...

And why line number? Line numbers break reproducible builds and also
would make it harder to refer to a particular symbol by its path and
name since we also have to pass its line number which may change once
you add a debug print there, for example.
OTOH there can't be 2 symbols with the same name within one file, so
just path + name would be enough. Or not?

(sorry if some of this was already discussed previously)

[...]

Thanks,
Olek


Re: [PATCH v1 1/1] bitops: Share BYTES_TO_BITS() for everyone

2023-09-11 Thread Alexander Lobakin
From: Yury Norov 
Date: Sun, 10 Sep 2023 07:07:16 -0700

> On Wed, Sep 06, 2023 at 05:54:26PM +0300, Andy Shevchenko wrote:
>> On Wed, Sep 06, 2023 at 04:40:39PM +0200, Alexander Lobakin wrote:
>>> From: Andy Shevchenko 
>>> Date: Thu, 31 Aug 2023 16:21:30 +0300
>>>> On Fri, Aug 25, 2023 at 04:49:07PM +0200, Alexander Lobakin wrote:
>>>>> From: Andy Shevchenko 
>>>>> Date: Thu, 24 Aug 2023 15:37:28 +0300
>>>>>
>>>>>> It may be new callers for the same macro, share it.
>>>>>>
>>>>>> Note, it's unknown why it's represented in the current form instead of
>>>>>> simple multiplication and commit 1ff511e35ed8 ("tracing/kprobes: Add
>>>>>> bitfield type") doesn't explain that neither. Let leave it as is and
>>>>>> we may improve it in the future.
>>>>>
>>>>> Maybe symmetrical change in tools/ like I did[0] an aeon ago?
>>>>
>>>> Hmm... Why can't you simply upstream your version? It seems better than 
>>>> mine.
>>>
>>> It was a part of the Netlink bigint API which is a bit on hold for now
>>> (I needed this macro available treewide).
>>> But I can send it as standalone if you're fine with that.
>>
>> I'm fine. Yury?
> 
> Do we have opencoded BYTES_TO_BITS() somewhere else? If so, it should be
> fixed in the same series.

Treewide -- a ton.
We could add it so that devs could start using it and stop open-coding :D

> 
> Regarding implementation, the current:
> 
> #define BYTES_TO_BITS(nb)  ((BITS_PER_LONG * (nb)) / sizeof(long))
> 
> looks weird. Maybe there are some special considerations in a tracing
> subsystem to make it like this, but as per Masami's email - there's
> not. 
> 
> For a general purpose I'd suggest a simpler:
> #define BYTES_TO_BITS(nb)  ((nb) * BITS_PER_BYTE)

I also didn't notice anything that would require using logic more
complex than this one. It would probably make more sense to define
it that way when moving.

> 
> Thanks,
> Yury

Thanks,
Olek


[PATCH v2 net] gro: fix napi_gro_frags() Fast GRO breakage due to IP alignment check

2021-04-19 Thread Alexander Lobakin
Commit 38ec4944b593 ("gro: ensure frag0 meets IP header alignment")
did the right thing, but missed the fact that napi_gro_frags() logics
calls for skb_gro_reset_offset() *before* pulling Ethernet header
to the skb linear space.
That said, the introduced check for frag0 address being aligned to 4
always fails for it as Ethernet header is obviously 14 bytes long,
and in case with NET_IP_ALIGN its start is not aligned to 4.

Fix this by adding @nhoff argument to skb_gro_reset_offset() which
tells if an IP header is placed right at the start of frag0 or not.
This restores Fast GRO for napi_gro_frags() that became very slow
after the mentioned commit, and preserves the introduced check to
avoid silent unaligned accesses.

>From v1 [0]:
 - inline tiny skb_gro_reset_offset() to let the code be optimized
   more efficively (esp. for the !NET_IP_ALIGN case) (Eric);
 - pull in Reviewed-by from Eric.

[0] https://lore.kernel.org/netdev/20210418114200.5839-1-aloba...@pm.me

Fixes: 38ec4944b593 ("gro: ensure frag0 meets IP header alignment")
Reviewed-by: Eric Dumazet 
Signed-off-by: Alexander Lobakin 
---
 net/core/dev.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 1f79b9aa9a3f..15fe36332fb8 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5914,7 +5914,7 @@ static struct list_head *gro_list_prepare(struct 
napi_struct *napi,
return head;
 }

-static void skb_gro_reset_offset(struct sk_buff *skb)
+static inline void skb_gro_reset_offset(struct sk_buff *skb, u32 nhoff)
 {
const struct skb_shared_info *pinfo = skb_shinfo(skb);
const skb_frag_t *frag0 = >frags[0];
@@ -5925,7 +5925,7 @@ static void skb_gro_reset_offset(struct sk_buff *skb)

if (!skb_headlen(skb) && pinfo->nr_frags &&
!PageHighMem(skb_frag_page(frag0)) &&
-   (!NET_IP_ALIGN || !(skb_frag_off(frag0) & 3))) {
+   (!NET_IP_ALIGN || !((skb_frag_off(frag0) + nhoff) & 3))) {
NAPI_GRO_CB(skb)->frag0 = skb_frag_address(frag0);
NAPI_GRO_CB(skb)->frag0_len = min_t(unsigned int,
skb_frag_size(frag0),
@@ -6143,7 +6143,7 @@ gro_result_t napi_gro_receive(struct napi_struct *napi, 
struct sk_buff *skb)
skb_mark_napi_id(skb, napi);
trace_napi_gro_receive_entry(skb);

-   skb_gro_reset_offset(skb);
+   skb_gro_reset_offset(skb, 0);

ret = napi_skb_finish(napi, skb, dev_gro_receive(napi, skb));
trace_napi_gro_receive_exit(ret);
@@ -6232,7 +6232,7 @@ static struct sk_buff *napi_frags_skb(struct napi_struct 
*napi)
napi->skb = NULL;

skb_reset_mac_header(skb);
-   skb_gro_reset_offset(skb);
+   skb_gro_reset_offset(skb, hlen);

if (unlikely(skb_gro_header_hard(skb, hlen))) {
eth = skb_gro_header_slow(skb, hlen, 0);
--
2.31.1




Re: [PATCH net] gro: fix napi_gro_frags() Fast GRO breakage due to IP alignment check

2021-04-19 Thread Alexander Lobakin
From: Eric Dumazet 
Date: Mon, 19 Apr 2021 13:05:16 +0200

> On Sun, Apr 18, 2021 at 1:43 PM Alexander Lobakin  wrote:
> >
> > Commit 38ec4944b593 ("gro: ensure frag0 meets IP header alignment")
> > did the right thing, but missed the fact that napi_gro_frags() logics
> > calls for skb_gro_reset_offset() *before* pulling Ethernet header
> > to the skb linear space.
> > That said, the introduced check for frag0 address being aligned to 4
> > always fails for it as Ethernet header is obviously 14 bytes long,
> > and in case with NET_IP_ALIGN its start is not aligned to 4.
> >
> > Fix this by adding @nhoff argument to skb_gro_reset_offset() which
> > tells if an IP header is placed right at the start of frag0 or not.
> > This restores Fast GRO for napi_gro_frags() that became very slow
> > after the mentioned commit, and preserves the introduced check to
> > avoid silent unaligned accesses.
> >
> > Fixes: 38ec4944b593 ("gro: ensure frag0 meets IP header alignment")
> > Signed-off-by: Alexander Lobakin 
> > ---
> >  net/core/dev.c | 8 
> >  1 file changed, 4 insertions(+), 4 deletions(-)
> >
> > diff --git a/net/core/dev.c b/net/core/dev.c
> > index 1f79b9aa9a3f..965d5f9b6fee 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -5914,7 +5914,7 @@ static struct list_head *gro_list_prepare(struct 
> > napi_struct *napi,
> > return head;
> >  }
> >
> > -static void skb_gro_reset_offset(struct sk_buff *skb)
> > +static void skb_gro_reset_offset(struct sk_buff *skb, u32 nhoff)
> >  {
> > const struct skb_shared_info *pinfo = skb_shinfo(skb);
> > const skb_frag_t *frag0 = >frags[0];
> > @@ -5925,7 +5925,7 @@ static void skb_gro_reset_offset(struct sk_buff *skb)
> >
> > if (!skb_headlen(skb) && pinfo->nr_frags &&
> > !PageHighMem(skb_frag_page(frag0)) &&
> > -   (!NET_IP_ALIGN || !(skb_frag_off(frag0) & 3))) {
> > +   (!NET_IP_ALIGN || !((skb_frag_off(frag0) + nhoff) & 3))) {
> > NAPI_GRO_CB(skb)->frag0 = skb_frag_address(frag0);
> > NAPI_GRO_CB(skb)->frag0_len = min_t(unsigned int,
> > skb_frag_size(frag0),
> > @@ -6143,7 +6143,7 @@ gro_result_t napi_gro_receive(struct napi_struct 
> > *napi, struct sk_buff *skb)
> > skb_mark_napi_id(skb, napi);
> > trace_napi_gro_receive_entry(skb);
> >
> > -   skb_gro_reset_offset(skb);
> > +   skb_gro_reset_offset(skb, 0);
> >
> > ret = napi_skb_finish(napi, skb, dev_gro_receive(napi, skb));
> > trace_napi_gro_receive_exit(ret);
> > @@ -6232,7 +6232,7 @@ static struct sk_buff *napi_frags_skb(struct 
> > napi_struct *napi)
> > napi->skb = NULL;
> >
> > skb_reset_mac_header(skb);
> > -   skb_gro_reset_offset(skb);
> > +   skb_gro_reset_offset(skb, hlen);
> >
> > if (unlikely(skb_gro_header_hard(skb, hlen))) {
> > eth = skb_gro_header_slow(skb, hlen, 0);
> > --
> > 2.31.1
> >
> >
>
> Good catch, thanks.
>
> This has the unfortunate effect of increasing code length on x86,
> maybe we should have an exception to
> normal rules so that skb_gro_reset_offset() is inlined.

Agree. To mitigate this codegrowth we either go with NET_IP_ALIGN
ifdeffery or just inline skb_gro_reset_offset(). The function is
tiny itself, I don't see a reason to not do this.
Will drop v2 in a moment.

> Reviewed-by: Eric Dumazet 

Thanks,
Al



Re: [PATCH v2 bpf-next 0/2] xsk: introduce generic almost-zerocopy xmit

2021-04-18 Thread Alexander Lobakin
From: Magnus Karlsson 
Date: Tue, 13 Apr 2021 09:14:02 +0200

Hi!

I've finally done with a kinda comfy setup after moving to another
country and can finally continue working on patches and stuff.

> On Tue, Apr 13, 2021 at 3:49 AM Xuan Zhuo  wrote:
> >
> > On Mon, 12 Apr 2021 16:13:12 +0200, Magnus Karlsson 
> >  wrote:
> > > On Wed, Mar 31, 2021 at 2:27 PM Alexander Lobakin  wrote:
> > > >
> > > > This series is based on the exceptional generic zerocopy xmit logics
> > > > initially introduced by Xuan Zhuo. It extends it the way that it
> > > > could cover all the sane drivers, not only the ones that are capable
> > > > of xmitting skbs with no linear space.
> > > >
> > > > The first patch is a random while-we-are-here improvement over
> > > > full-copy path, and the second is the main course. See the individual
> > > > commit messages for the details.
> > > >
> > > > The original (full-zerocopy) path is still here and still generally
> > > > faster, but for now it seems like virtio_net will remain the only
> > > > user of it, at least for a considerable period of time.
> > > >
> > > > From v1 [0]:
> > > >  - don't add a whole SMP_CACHE_BYTES because of only two bytes
> > > >(NET_IP_ALIGN);
> > > >  - switch to zerocopy if the frame is 129 bytes or longer, not 128.
> > > >128 still fit to kmalloc-512, while a zerocopy skb is always
> > > >kmalloc-1024 -> can potentially be slower on this frame size.
> > > >
> > > > [0] 
> > > > https://lore.kernel.org/netdev/20210330231528.546284-1-aloba...@pm.me
> > > >
> > > > Alexander Lobakin (2):
> > > >   xsk: speed-up generic full-copy xmit
> > >
> > > I took both your patches for a spin on my machine and for the first
> > > one I do see a small but consistent drop in performance. I thought it
> > > would go the other way, but it does not so let us put this one on the
> > > shelf for now.

This is kinda strange as the solution is pretty straightforward.
But sure, if the performance dropped after this one, it should not
be considered for taking.
I might have a look at it later.

> > > >   xsk: introduce generic almost-zerocopy xmit
> > >
> > > This one wreaked havoc on my machine ;-). The performance dropped with
> > > 75% for packets larger than 128 bytes when the new scheme kicks in.
> > > Checking with perf top, it seems that we spend much more time
> > > executing the sendmsg syscall. Analyzing some more:
> > >
> > > $ sudo bpftrace -e 'kprobe:__sys_sendto { @calls = @calls + 1; }
> > > interval:s:1 {printf("calls/sec: %d\n", @calls); @calls = 0;}'
> > > Attaching 2 probes...
> > > calls/sec: 1539509 with your patch compared to
> > >
> > > calls/sec: 105796 without your patch
> > >
> > > The application spends a lot of more time trying to get the kernel to
> > > send new packets, but the kernel replies with "have not completed the
> > > outstanding ones, so come back later" = EAGAIN. Seems like the
> > > transmission takes longer when the skbs have fragments, but I have not
> > > examined this any further. Did you get a speed-up?
> >
> > Regarding this solution, I actually tested it on my mlx5 network card, but 
> > the
> > performance was severely degraded, so I did not continue this solution 
> > later. I
> > guess it might have something to do with the physical network card. We can 
> > try
> > other network cards.
>
> I tried it on a third card and got a 40% degradation, so let us scrap
> this idea. It should stay optional as it is today as the (software)
> drivers that benefit from this can turn it on explicitly.

Thank you guys a lot for the testing!

I think the main reason is the DMA mapping of one additional frag
(14 bytes of MAC header, which is excessive). It can take a lot of
CPU cycles, especially when the device is behind an IOMMU, and seems
like memcpying is faster here.

Moreover, if Xuan tested it as one of the steps towards his
full-zerocopy and found it to be a bad idea, this should not
go further.
So I'm burying this.

> > links: https://www.spinics.net/lists/netdev/msg710918.html
> >
> > Thanks.
> >
> > >
> > > >  net/xdp/xsk.c | 32 ++--
> > > >  1 file changed, 22 insertions(+), 10 deletions(-)
> > > >
> > > > --
> > > > Well, this is untested. I currently don't have an access to my setup
> > > > and is bound by moving to another country, but as I don't know for
> > > > sure at the moment when I'll get back to work on the kernel next time,
> > > > I found it worthy to publish this now -- if any further changes will
> > > > be required when I already will be out-of-sight, maybe someone could
> > > > carry on to make a another revision and so on (I'm still here for any
> > > > questions, comments, reviews and improvements till the end of this
> > > > week).
> > > > But this *should* work with all the sane drivers. If a particular
> > > > one won't handle this, it's likely ill. Any tests are highly
> > > > appreciated. Thanks!
> > > > --
> > > > 2.31.1

Thanks,
Al



[PATCH net] gro: fix napi_gro_frags() Fast GRO breakage due to IP alignment check

2021-04-18 Thread Alexander Lobakin
Commit 38ec4944b593 ("gro: ensure frag0 meets IP header alignment")
did the right thing, but missed the fact that napi_gro_frags() logics
calls for skb_gro_reset_offset() *before* pulling Ethernet header
to the skb linear space.
That said, the introduced check for frag0 address being aligned to 4
always fails for it as Ethernet header is obviously 14 bytes long,
and in case with NET_IP_ALIGN its start is not aligned to 4.

Fix this by adding @nhoff argument to skb_gro_reset_offset() which
tells if an IP header is placed right at the start of frag0 or not.
This restores Fast GRO for napi_gro_frags() that became very slow
after the mentioned commit, and preserves the introduced check to
avoid silent unaligned accesses.

Fixes: 38ec4944b593 ("gro: ensure frag0 meets IP header alignment")
Signed-off-by: Alexander Lobakin 
---
 net/core/dev.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 1f79b9aa9a3f..965d5f9b6fee 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5914,7 +5914,7 @@ static struct list_head *gro_list_prepare(struct 
napi_struct *napi,
return head;
 }

-static void skb_gro_reset_offset(struct sk_buff *skb)
+static void skb_gro_reset_offset(struct sk_buff *skb, u32 nhoff)
 {
const struct skb_shared_info *pinfo = skb_shinfo(skb);
const skb_frag_t *frag0 = >frags[0];
@@ -5925,7 +5925,7 @@ static void skb_gro_reset_offset(struct sk_buff *skb)

if (!skb_headlen(skb) && pinfo->nr_frags &&
!PageHighMem(skb_frag_page(frag0)) &&
-   (!NET_IP_ALIGN || !(skb_frag_off(frag0) & 3))) {
+   (!NET_IP_ALIGN || !((skb_frag_off(frag0) + nhoff) & 3))) {
NAPI_GRO_CB(skb)->frag0 = skb_frag_address(frag0);
NAPI_GRO_CB(skb)->frag0_len = min_t(unsigned int,
skb_frag_size(frag0),
@@ -6143,7 +6143,7 @@ gro_result_t napi_gro_receive(struct napi_struct *napi, 
struct sk_buff *skb)
skb_mark_napi_id(skb, napi);
trace_napi_gro_receive_entry(skb);

-   skb_gro_reset_offset(skb);
+   skb_gro_reset_offset(skb, 0);

ret = napi_skb_finish(napi, skb, dev_gro_receive(napi, skb));
trace_napi_gro_receive_exit(ret);
@@ -6232,7 +6232,7 @@ static struct sk_buff *napi_frags_skb(struct napi_struct 
*napi)
napi->skb = NULL;

skb_reset_mac_header(skb);
-   skb_gro_reset_offset(skb);
+   skb_gro_reset_offset(skb, hlen);

if (unlikely(skb_gro_header_hard(skb, hlen))) {
eth = skb_gro_header_slow(skb, hlen, 0);
--
2.31.1




Re: [PATCH] kbuild: merge module sections under CONFIG_LD_DEAD_CODE_DATA_ELIMINATION too

2021-04-06 Thread Alexander Lobakin
On Friday, 2 April 2021, 18:09, Sami
Tolvanen  wrote:

> On Fri, Apr 2, 2021 at 5:40 AM Alexander Lobakin aloba...@pm.me wrote:
>
> > When building with CONFIG_LD_DEAD_CODE_DATA_ELIMINATION,
> > -fdata-sections and -ffunction-sections are being enabled by the
> > top-level Makefile, and module section merging is also needed.
> > Expand the ifdef (and the comment block) to cover that case too.
> > Fixes: 6a3193cdd5e5 ("kbuild: lto: Merge module sections if and only if 
> > CONFIG_LTO_CLANG is enabled")
>
> Wouldn't this trigger the ld.bfd bug described in the commit message
> when LD_DEAD_CODE_DATA_ELIMINATION is enabled? LTO_CLANG always uses
> LLD, so it won't have this issue.

LD_DEAD_CODE_DATA_ELIMINATION is marked
“EXPERIMENTAL“ in the config prompt, and
arches have to opt-in
HAS_LD_DEAD_CODE_DATA_ELIMINATION to give
an access to it (only a few does). This
should be relatively safe.

> Sami

Thanks,
Al


[PATCH] kbuild: merge module sections under CONFIG_LD_DEAD_CODE_DATA_ELIMINATION too

2021-04-02 Thread Alexander Lobakin
When building with CONFIG_LD_DEAD_CODE_DATA_ELIMINATION,
-fdata-sections and -ffunction-sections are being enabled by the
top-level Makefile, and module section merging is also needed.
Expand the ifdef (and the comment block) to cover that case too.

Fixes: 6a3193cdd5e5 ("kbuild: lto: Merge module sections if and only if 
CONFIG_LTO_CLANG is enabled")
Signed-off-by: Alexander Lobakin 
---
 scripts/module.lds.S | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/scripts/module.lds.S b/scripts/module.lds.S
index 2c52535f9b56..d6bbdfc55e08 100644
--- a/scripts/module.lds.S
+++ b/scripts/module.lds.S
@@ -20,11 +20,14 @@ SECTIONS {

__patchable_function_entries : { *(__patchable_function_entries) }

-#ifdef CONFIG_LTO_CLANG
+#if defined(CONFIG_LD_DEAD_CODE_DATA_ELIMINATION) || defined(CONFIG_LTO_CLANG)
/*
 * With CONFIG_LTO_CLANG, LLD always enables -fdata-sections and
-* -ffunction-sections, which increases the size of the final module.
-* Merge the split sections in the final binary.
+* -ffunction-sections. With CONFIG_LD_DEAD_CODE_DATA_ELIMINATION,
+* -fdata-sections and -ffunction-sections are being enabled by
+* the top-level Makefile.
+* This increases the size of the final module. Merge the split
+* sections in the final binary.
 */
.bss : {
*(.bss .bss.[0-9a-zA-Z_]*)
--
2.31.1




Re: [PATCH bpf-next 0/2] xsk: introduce generic almost-zerocopy xmit

2021-03-31 Thread Alexander Lobakin
From: Magnus Karlsson 
Date: Wed, 31 Mar 2021 11:44:45 +0200

> On Wed, Mar 31, 2021 at 1:17 AM Alexander Lobakin  wrote:
> >
> > This series is based on the exceptional generic zerocopy xmit logics
> > initially introduced by Xuan Zhuo. It extends it the way that it
> > could cover all the sane drivers, not only the ones that are capable
> > of xmitting skbs with no linear space.
> >
> > The first patch is a random while-we-are-here improvement over
> > full-copy path, and the second is the main course. See the individual
> > commit messages for the details.
> >
> > The original (full-zerocopy) path is still here and still generally
> > faster, but for now it seems like virtio_net will remain the only
> > user of it, at least for a considerable period of time.
> >
> > Alexander Lobakin (2):
> >   xsk: speed-up generic full-copy xmit
> >   xsk: introduce generic almost-zerocopy xmit
> >
> >  net/xdp/xsk.c | 33 +++--
> >  1 file changed, 23 insertions(+), 10 deletions(-)
> >
> > --
> > Well, this is untested. I currently don't have an access to my setup
> > and is bound by moving to another country, but as I don't know for
> > sure at the moment when I'll get back to work on the kernel next time,
> > I found it worthy to publish this now -- if any further changes will
> > be required when I already will be out-of-sight, maybe someone could
> > carry on to make a another revision and so on (I'm still here for any
> > questions, comments, reviews and improvements till the end of this
> > week).
> > But this *should* work with all the sane drivers. If a particular
> > one won't handle this, it's likely ill.
>
> Thanks Alexander. I will take your patches for a spin on a couple of
> NICs and get back to you, though it will be next week due to holidays
> where I am based.

Thanks a lot! Any tests will be much appreciated.
I'll publish v2 in a moment though, want to drop a couple of
micro-optimizations.

> > --
> > 2.31.1

Al



[PATCH v2 bpf-next 2/2] xsk: introduce generic almost-zerocopy xmit

2021-03-31 Thread Alexander Lobakin
The reasons behind IFF_TX_SKB_NO_LINEAR are:
 - most drivers expect skb with the linear space;
 - most drivers expect hard header in the linear space;
 - many drivers need some headroom to insert custom headers
   and/or pull headers from frags (pskb_may_pull() etc.).

With some bits of overhead, we can satisfy all of this without
inducing full buffer data copy.

Now frames that are bigger than 128 bytes (to mitigate allocation
overhead) are also being built using zerocopy path (if the device and
driver support S/G xmit, which is almost always true).
We allocate 256* additional bytes for skb linear space and pull hard
header there (aligning its end by 16 bytes for platforms with
NET_IP_ALIGN). The rest of the buffer data is just pinned as frags.
A room of at least 240 bytes is left for any driver needs.

We could just pass the buffer to eth_get_headlen() to minimize
allocation overhead and be able to copy all the headers into the
linear space, but the flow dissection procedure tends to be more
expensive than the current approach.

IFF_TX_SKB_NO_LINEAR path remains unchanged and is still actual and
generally faster.

* The value of 256 bytes is kinda "magic", it can be found in lots
  of drivers and places of core code and it is believed that 256
  bytes are enough to store any headers of any frame.

Cc: Xuan Zhuo 
Signed-off-by: Alexander Lobakin 
---
 net/xdp/xsk.c | 25 +
 1 file changed, 21 insertions(+), 4 deletions(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 41f8f21b3348..1d241f87422c 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -445,6 +445,9 @@ static void xsk_destruct_skb(struct sk_buff *skb)
sock_wfree(skb);
 }

+#define XSK_SKB_HEADLEN256
+#define XSK_COPY_THRESHOLD (XSK_SKB_HEADLEN / 2)
+
 static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
  struct xdp_desc *desc)
 {
@@ -452,13 +455,21 @@ static struct sk_buff *xsk_build_skb_zerocopy(struct 
xdp_sock *xs,
u32 hr, len, ts, offset, copy, copied;
struct sk_buff *skb;
struct page *page;
+   bool need_pull;
void *buffer;
int err, i;
u64 addr;

hr = max(NET_SKB_PAD, L1_CACHE_ALIGN(xs->dev->needed_headroom));
+   len = hr;
+
+   need_pull = !(xs->dev->priv_flags & IFF_TX_SKB_NO_LINEAR);
+   if (need_pull) {
+   len += XSK_SKB_HEADLEN;
+   hr += NET_IP_ALIGN;
+   }

-   skb = sock_alloc_send_skb(>sk, hr, 1, );
+   skb = sock_alloc_send_skb(>sk, len, 1, );
if (unlikely(!skb))
return ERR_PTR(err);

@@ -488,6 +499,11 @@ static struct sk_buff *xsk_build_skb_zerocopy(struct 
xdp_sock *xs,
skb->data_len += len;
skb->truesize += ts;

+   if (need_pull && unlikely(!__pskb_pull_tail(skb, ETH_HLEN))) {
+   kfree_skb(skb);
+   return ERR_PTR(-ENOMEM);
+   }
+
refcount_add(ts, >sk.sk_wmem_alloc);

return skb;
@@ -498,19 +514,20 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
 {
struct net_device *dev = xs->dev;
struct sk_buff *skb;
+   u32 len = desc->len;

-   if (dev->priv_flags & IFF_TX_SKB_NO_LINEAR) {
+   if ((dev->priv_flags & IFF_TX_SKB_NO_LINEAR) ||
+   (len > XSK_COPY_THRESHOLD && likely(dev->features & NETIF_F_SG))) {
skb = xsk_build_skb_zerocopy(xs, desc);
if (IS_ERR(skb))
return skb;
} else {
-   u32 hr, tr, len;
void *buffer;
+   u32 hr, tr;
int err;

hr = max(NET_SKB_PAD, L1_CACHE_ALIGN(dev->needed_headroom));
tr = dev->needed_tailroom;
-   len = desc->len;

skb = sock_alloc_send_skb(>sk, hr + len + tr, 1, );
if (unlikely(!skb))
--
2.31.1




[PATCH v2 bpf-next 1/2] xsk: speed-up generic full-copy xmit

2021-03-31 Thread Alexander Lobakin
There are a few moments that are known for sure at the moment of
copying:
 - allocated skb is fully linear;
 - its linear space is long enough to hold the full buffer data.

So, the out-of-line skb_put(), skb_store_bits() and the check for
a retcode can be replaced with plain memcpy(__skb_put()) with
no loss.
Also align memcpy()'s len to sizeof(long) to improve its performance.

Signed-off-by: Alexander Lobakin 
---
 net/xdp/xsk.c | 7 +--
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index a71ed664da0a..41f8f21b3348 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -517,14 +517,9 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
return ERR_PTR(err);

skb_reserve(skb, hr);
-   skb_put(skb, len);

buffer = xsk_buff_raw_get_data(xs->pool, desc->addr);
-   err = skb_store_bits(skb, 0, buffer, len);
-   if (unlikely(err)) {
-   kfree_skb(skb);
-   return ERR_PTR(err);
-   }
+   memcpy(__skb_put(skb, len), buffer, ALIGN(len, sizeof(long)));
}

skb->dev = dev;
--
2.31.1




[PATCH v2 bpf-next 0/2] xsk: introduce generic almost-zerocopy xmit

2021-03-31 Thread Alexander Lobakin
This series is based on the exceptional generic zerocopy xmit logics
initially introduced by Xuan Zhuo. It extends it the way that it
could cover all the sane drivers, not only the ones that are capable
of xmitting skbs with no linear space.

The first patch is a random while-we-are-here improvement over
full-copy path, and the second is the main course. See the individual
commit messages for the details.

The original (full-zerocopy) path is still here and still generally
faster, but for now it seems like virtio_net will remain the only
user of it, at least for a considerable period of time.

>From v1 [0]:
 - don't add a whole SMP_CACHE_BYTES because of only two bytes
   (NET_IP_ALIGN);
 - switch to zerocopy if the frame is 129 bytes or longer, not 128.
   128 still fit to kmalloc-512, while a zerocopy skb is always
   kmalloc-1024 -> can potentially be slower on this frame size.

[0] https://lore.kernel.org/netdev/20210330231528.546284-1-aloba...@pm.me

Alexander Lobakin (2):
  xsk: speed-up generic full-copy xmit
  xsk: introduce generic almost-zerocopy xmit

 net/xdp/xsk.c | 32 ++--
 1 file changed, 22 insertions(+), 10 deletions(-)

--
Well, this is untested. I currently don't have an access to my setup
and is bound by moving to another country, but as I don't know for
sure at the moment when I'll get back to work on the kernel next time,
I found it worthy to publish this now -- if any further changes will
be required when I already will be out-of-sight, maybe someone could
carry on to make a another revision and so on (I'm still here for any
questions, comments, reviews and improvements till the end of this
week).
But this *should* work with all the sane drivers. If a particular
one won't handle this, it's likely ill. Any tests are highly
appreciated. Thanks!
--
2.31.1




[PATCH bpf-next 2/2] xsk: introduce generic almost-zerocopy xmit

2021-03-30 Thread Alexander Lobakin
The reasons behind IFF_TX_SKB_NO_LINEAR are:
 - most drivers expect skb with the linear space;
 - most drivers expect hard header in the linear space;
 - many drivers need some headroom to insert custom headers
   and/or pull headers from frags (pskb_may_pull() etc.).

With some bits of overhead, we can satisfy all of this without
inducing full buffer data copy.

Now frames that are no lesser than 128 bytes (to mitigate allocation
overhead) are also being built using zerocopy path (if the device and
driver support S/G xmit, which is almost always true).
We allocate 256* additional bytes for skb linear space and pull hard
header there (aligning its end by 16 bytes for platforms with
NET_IP_ALIGN). The rest of the buffer data is just pinned as frags.
A room of at least 242 bytes is left for any driver needs.

We could just pass the buffer to eth_get_headlen() to minimize
allocation overhead and be able to copy all the headers into the
linear space, but the flow dissection procedure tends to be more
expensive than the current approach.

IFF_TX_SKB_NO_LINEAR path remains unchanged and is still actual and
generally faster.

* The value of 256 bytes is kinda "magic", it can be found in lots
  of drivers and places of core code and it is believed that 256
  bytes are enough to store any headers of any frame.

Cc: Xuan Zhuo 
Signed-off-by: Alexander Lobakin 
---
 net/xdp/xsk.c | 26 ++
 1 file changed, 22 insertions(+), 4 deletions(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 41f8f21b3348..090ff9c096a3 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -445,6 +445,9 @@ static void xsk_destruct_skb(struct sk_buff *skb)
sock_wfree(skb);
 }

+#define XSK_SKB_HEADLEN256
+#define XSK_COPY_THRESHOLD (XSK_SKB_HEADLEN / 2)
+
 static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
  struct xdp_desc *desc)
 {
@@ -452,13 +455,22 @@ static struct sk_buff *xsk_build_skb_zerocopy(struct 
xdp_sock *xs,
u32 hr, len, ts, offset, copy, copied;
struct sk_buff *skb;
struct page *page;
+   bool need_pull;
void *buffer;
int err, i;
u64 addr;

hr = max(NET_SKB_PAD, L1_CACHE_ALIGN(xs->dev->needed_headroom));
+   len = hr;
+
+   need_pull = !(xs->dev->priv_flags & IFF_TX_SKB_NO_LINEAR);
+   if (need_pull) {
+   len += XSK_SKB_HEADLEN;
+   len += NET_IP_ALIGN;
+   hr += NET_IP_ALIGN;
+   }

-   skb = sock_alloc_send_skb(>sk, hr, 1, );
+   skb = sock_alloc_send_skb(>sk, len, 1, );
if (unlikely(!skb))
return ERR_PTR(err);

@@ -488,6 +500,11 @@ static struct sk_buff *xsk_build_skb_zerocopy(struct 
xdp_sock *xs,
skb->data_len += len;
skb->truesize += ts;

+   if (need_pull && unlikely(!__pskb_pull_tail(skb, ETH_HLEN))) {
+   kfree_skb(skb);
+   return ERR_PTR(-ENOMEM);
+   }
+
refcount_add(ts, >sk.sk_wmem_alloc);

return skb;
@@ -498,19 +515,20 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
 {
struct net_device *dev = xs->dev;
struct sk_buff *skb;
+   u32 len = desc->len;

-   if (dev->priv_flags & IFF_TX_SKB_NO_LINEAR) {
+   if ((dev->priv_flags & IFF_TX_SKB_NO_LINEAR) ||
+   (len >= XSK_COPY_THRESHOLD && likely(dev->features & NETIF_F_SG))) {
skb = xsk_build_skb_zerocopy(xs, desc);
if (IS_ERR(skb))
return skb;
} else {
-   u32 hr, tr, len;
void *buffer;
+   u32 hr, tr;
int err;

hr = max(NET_SKB_PAD, L1_CACHE_ALIGN(dev->needed_headroom));
tr = dev->needed_tailroom;
-   len = desc->len;

skb = sock_alloc_send_skb(>sk, hr + len + tr, 1, );
if (unlikely(!skb))
--
2.31.1




[PATCH bpf-next 0/2] xsk: introduce generic almost-zerocopy xmit

2021-03-30 Thread Alexander Lobakin
This series is based on the exceptional generic zerocopy xmit logics
initially introduced by Xuan Zhuo. It extends it the way that it
could cover all the sane drivers, not only the ones that are capable
of xmitting skbs with no linear space.

The first patch is a random while-we-are-here improvement over
full-copy path, and the second is the main course. See the individual
commit messages for the details.

The original (full-zerocopy) path is still here and still generally
faster, but for now it seems like virtio_net will remain the only
user of it, at least for a considerable period of time.

Alexander Lobakin (2):
  xsk: speed-up generic full-copy xmit
  xsk: introduce generic almost-zerocopy xmit

 net/xdp/xsk.c | 33 +++--
 1 file changed, 23 insertions(+), 10 deletions(-)

--
Well, this is untested. I currently don't have an access to my setup
and is bound by moving to another country, but as I don't know for
sure at the moment when I'll get back to work on the kernel next time,
I found it worthy to publish this now -- if any further changes will
be required when I already will be out-of-sight, maybe someone could
carry on to make a another revision and so on (I'm still here for any
questions, comments, reviews and improvements till the end of this
week).
But this *should* work with all the sane drivers. If a particular
one won't handle this, it's likely ill.
--
2.31.1




[PATCH bpf-next 1/2] xsk: speed-up generic full-copy xmit

2021-03-30 Thread Alexander Lobakin
There are a few moments that are known for sure at the moment of
copying:
 - allocated skb is fully linear;
 - its linear space is long enough to hold the full buffer data.

So, the out-of-line skb_put(), skb_store_bits() and the check for
a retcode can be replaced with plain memcpy(__skb_put()) with
no loss.
Also align memcpy()'s len to sizeof(long) to improve its performance.

Signed-off-by: Alexander Lobakin 
---
 net/xdp/xsk.c | 7 +--
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index a71ed664da0a..41f8f21b3348 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -517,14 +517,9 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
return ERR_PTR(err);

skb_reserve(skb, hr);
-   skb_put(skb, len);

buffer = xsk_buff_raw_get_data(xs->pool, desc->addr);
-   err = skb_store_bits(skb, 0, buffer, len);
-   if (unlikely(err)) {
-   kfree_skb(skb);
-   return ERR_PTR(err);
-   }
+   memcpy(__skb_put(skb, len), buffer, ALIGN(len, sizeof(long)));
}

skb->dev = dev;
--
2.31.1




Re: [PATCH 9/9] net: page_pool: use alloc_pages_bulk in refill code path

2021-03-25 Thread Alexander Lobakin
From: Mel Gorman 
Date: Thu, 25 Mar 2021 11:42:28 +

> From: Jesper Dangaard Brouer 
>
> There are cases where the page_pool need to refill with pages from the
> page allocator. Some workloads cause the page_pool to release pages
> instead of recycling these pages.
>
> For these workload it can improve performance to bulk alloc pages from
> the page-allocator to refill the alloc cache.
>
> For XDP-redirect workload with 100G mlx5 driver (that use page_pool)
> redirecting xdp_frame packets into a veth, that does XDP_PASS to create
> an SKB from the xdp_frame, which then cannot return the page to the
> page_pool.
>
> Performance results under GitHub xdp-project[1]:
>  [1] 
> https://github.com/xdp-project/xdp-project/blob/master/areas/mem/page_pool06_alloc_pages_bulk.org
>
> Mel: The patch "net: page_pool: convert to use alloc_pages_bulk_array
> variant" was squashed with this patch. From the test page, the array
> variant was superior with one of the test results as follows.
>
>   Kernel  XDP stats   CPU pps   Delta
>   BaselineXDP-RX CPU  total   3,771,046   n/a
>   ListXDP-RX CPU  total   3,940,242+4.49%
>   Array   XDP-RX CPU  total   4,249,224   +12.68%
>
> Signed-off-by: Jesper Dangaard Brouer 
> Signed-off-by: Mel Gorman 

I tested it a lot for past two weeks and I'm very satisfied with
the results, especially the new array-based version.
Haven't had a chance to test this particular set yet, but still.

Reviewed-by: Alexander Lobakin 

Great work, thank you all guys!

> ---
>  include/net/page_pool.h |  2 +-
>  net/core/page_pool.c| 82 -
>  2 files changed, 57 insertions(+), 27 deletions(-)
>
> diff --git a/include/net/page_pool.h b/include/net/page_pool.h
> index b5b195305346..6d517a37c18b 100644
> --- a/include/net/page_pool.h
> +++ b/include/net/page_pool.h
> @@ -65,7 +65,7 @@
>  #define PP_ALLOC_CACHE_REFILL64
>  struct pp_alloc_cache {
>   u32 count;
> - void *cache[PP_ALLOC_CACHE_SIZE];
> + struct page *cache[PP_ALLOC_CACHE_SIZE];
>  };
>
>  struct page_pool_params {
> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> index 40e1b2beaa6c..9ec1aa9640ad 100644
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
> @@ -203,38 +203,17 @@ static bool page_pool_dma_map(struct page_pool *pool, 
> struct page *page)
>   return true;
>  }
>
> -/* slow path */
> -noinline
> -static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool,
> -  gfp_t _gfp)
> +static struct page *__page_pool_alloc_page_order(struct page_pool *pool,
> +  gfp_t gfp)
>  {
> - unsigned int pp_flags = pool->p.flags;
>   struct page *page;
> - gfp_t gfp = _gfp;
> -
> - /* We could always set __GFP_COMP, and avoid this branch, as
> -  * prep_new_page() can handle order-0 with __GFP_COMP.
> -  */
> - if (pool->p.order)
> - gfp |= __GFP_COMP;
> -
> - /* FUTURE development:
> -  *
> -  * Current slow-path essentially falls back to single page
> -  * allocations, which doesn't improve performance.  This code
> -  * need bulk allocation support from the page allocator code.
> -  */
>
> - /* Cache was empty, do real allocation */
> -#ifdef CONFIG_NUMA
> + gfp |= __GFP_COMP;
>   page = alloc_pages_node(pool->p.nid, gfp, pool->p.order);
> -#else
> - page = alloc_pages(gfp, pool->p.order);
> -#endif
> - if (!page)
> + if (unlikely(!page))
>   return NULL;
>
> - if ((pp_flags & PP_FLAG_DMA_MAP) &&
> + if ((pool->p.flags & PP_FLAG_DMA_MAP) &&
>   unlikely(!page_pool_dma_map(pool, page))) {
>   put_page(page);
>   return NULL;
> @@ -243,6 +222,57 @@ static struct page *__page_pool_alloc_pages_slow(struct 
> page_pool *pool,
>   /* Track how many pages are held 'in-flight' */
>   pool->pages_state_hold_cnt++;
>   trace_page_pool_state_hold(pool, page, pool->pages_state_hold_cnt);
> + return page;
> +}
> +
> +/* slow path */
> +noinline
> +static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool,
> +  gfp_t gfp)
> +{
> + const int bulk = PP_ALLOC_CACHE_REFILL;
> + unsigned int pp_flags = pool->p.flags;
> + unsigned int pp_order = pool->p.order;
> + struct page *page;
> + int i, nr_pages;
> +
> + /* Don't support bulk alloc for high-order pa

Re: [PATCH net-next 0/6] page_pool: recycle buffers

2021-03-24 Thread Alexander Lobakin
From: Ilias Apalodimas 
Date: Wed, 24 Mar 2021 09:50:38 +0200

> Hi Alexander,

Hi!

> On Tue, Mar 23, 2021 at 08:03:46PM +, Alexander Lobakin wrote:
> > From: Ilias Apalodimas 
> > Date: Tue, 23 Mar 2021 19:01:52 +0200
> >
> > > On Tue, Mar 23, 2021 at 04:55:31PM +, Alexander Lobakin wrote:
> > > > > > > > >
> > >
> > > [...]
> > >
> > > > > > > >
> > > > > > > > Thanks for the testing!
> > > > > > > > Any chance you can get a perf measurement on this?
> > > > > > >
> > > > > > > I guess you mean perf-report (--stdio) output, right?
> > > > > > >
> > > > > >
> > > > > > Yea,
> > > > > > As hinted below, I am just trying to figure out if on Alexander's 
> > > > > > platform the
> > > > > > cost of syncing, is bigger that free-allocate. I remember one armv7 
> > > > > > were that
> > > > > > was the case.
> > > > > >
> > > > > > > > Is DMA syncing taking a substantial amount of your cpu usage?
> > > > > > >
> > > > > > > (+1 this is an important question)
> > > >
> > > > Sure, I'll drop perf tools to my test env and share the results,
> > > > maybe tomorrow or in a few days.
> >
> > Oh we-e-e-ell...
> > Looks like I've been fooled by I-cache misses or smth like that.
> > That happens sometimes, not only on my machines, and not only on
> > MIPS if I'm not mistaken.
> > Sorry for confusing you guys.
> >
> > I got drastically different numbers after I enabled CONFIG_KALLSYMS +
> > CONFIG_PERF_EVENTS for perf tools.
> > The only difference in code is that I rebased onto Mel's
> > mm-bulk-rebase-v6r4.
> >
> > (lunar is my WIP NIC driver)
> >
> > 1. 5.12-rc3 baseline:
> >
> > TCP: 566 Mbps
> > UDP: 615 Mbps
> >
> > perf top:
> >  4.44%  [lunar]  [k] lunar_rx_poll_page_pool
> >  3.56%  [kernel] [k] r4k_wait_irqoff
> >  2.89%  [kernel] [k] free_unref_page
> >  2.57%  [kernel] [k] dma_map_page_attrs
> >  2.32%  [kernel] [k] get_page_from_freelist
> >  2.28%  [lunar]  [k] lunar_start_xmit
> >  1.82%  [kernel] [k] __copy_user
> >  1.75%  [kernel] [k] dev_gro_receive
> >  1.52%  [kernel] [k] cpuidle_enter_state_coupled
> >  1.46%  [kernel] [k] tcp_gro_receive
> >  1.35%  [kernel] [k] __rmemcpy
> >  1.33%  [nf_conntrack]   [k] nf_conntrack_tcp_packet
> >  1.30%  [kernel] [k] __dev_queue_xmit
> >  1.22%  [kernel] [k] pfifo_fast_dequeue
> >  1.17%  [kernel] [k] skb_release_data
> >  1.17%  [kernel] [k] skb_segment
> >
> > free_unref_page() and get_page_from_freelist() consume a lot.
> >
> > 2. 5.12-rc3 + Page Pool recycling by Matteo:
> > TCP: 589 Mbps
> > UDP: 633 Mbps
> >
> > perf top:
> >  4.27%  [lunar]  [k] lunar_rx_poll_page_pool
> >  2.68%  [lunar]  [k] lunar_start_xmit
> >  2.41%  [kernel] [k] dma_map_page_attrs
> >  1.92%  [kernel] [k] r4k_wait_irqoff
> >  1.89%  [kernel] [k] __copy_user
> >  1.62%  [kernel] [k] dev_gro_receive
> >  1.51%  [kernel] [k] cpuidle_enter_state_coupled
> >  1.44%  [kernel] [k] tcp_gro_receive
> >  1.40%  [kernel] [k] __rmemcpy
> >  1.38%  [nf_conntrack]   [k] nf_conntrack_tcp_packet
> >  1.37%  [kernel] [k] free_unref_page
> >  1.35%  [kernel] [k] __dev_queue_xmit
> >  1.30%  [kernel] [k] skb_segment
> >  1.28%  [kernel] [k] get_page_from_freelist
> >  1.27%  [kernel] [k] r4k_dma_cache_inv
> >
> > +20 Mbps increase on both TCP and UDP. free_unref_page() and
> > get_page_from_freelist() dropped down the list significantly.
> >
> > 3. 5.12-rc3 + Page Pool recycling + PP bulk allocator (Mel & Jesper):
> > TCP: 596
> > UDP: 641
> >
> > perf top:
> >  4.38%  [lunar]  [k] lunar_rx_poll_page_pool
> >  3.34%  [kernel] [k] r4k_wait_irqoff
> >  3.14%  [kernel] [k] dma_map_page_attrs
> 

Re: [PATCH RESEND] PCI: dwc: put struct dw_pcie::{ep,pp} into a union to reduce its size

2021-03-24 Thread Alexander Lobakin
From: Krzysztof Wilczyński 
Date: Wed, 24 Mar 2021 02:31:42 +0100

> Hi Alexander,

Hi!

> Thank you for sending the patch over!
>
> > A single dw_pcie entity can't be a root complex and an endpoint at
> > the same time.
>
> Nice catch!
>
> A small nitpick: this would be Root Complex and Endpoint, as it's
> customary to capitalise these.
>
> Also, if you could capitalise the subject line - it could also perhaps
> be simplified to something like, for example:
>
>   Optimize struct dw_pcie to reduce its size
>
> Feel free to ignore both suggestions, as these are just nitpicks.

They are both correct, so I can send a v2 if this one wont't be
picked to the tree, let's say, this week.

> > We can use this to reduce the size of dw_pcie by 80, from 280 to 200
> > bytes (on x32, guess more on x64), by putting the related embedded
> > structures (struct pcie_port and struct dw_pcie_ep) into a union.
>
> [...]
> > -   struct pcie_portpp;
> > -   struct dw_pcie_ep   ep;
> > +   union {
> > +   struct pcie_portpp;
> > +   struct dw_pcie_ep   ep;
> > +   };
> [...]
>
> How did you measure the difference?  Often, people include pahole output
> for the "before" and "after", so to speak, to showcase the difference
> and/or improvement.  Do you have something like that handy?

I didn't use pahole to measure the difference, just printed sizeofs
for the structures "before" and "after". But I can get pahole's
output and include it in v2 to make commit message more useful.

> Krzysztof

Thanks!
Al



Re: [PATCH net-next 0/6] page_pool: recycle buffers

2021-03-23 Thread Alexander Lobakin
From: Ilias Apalodimas 
Date: Tue, 23 Mar 2021 19:01:52 +0200

> On Tue, Mar 23, 2021 at 04:55:31PM +0000, Alexander Lobakin wrote:
> > > > > > >
>
> [...]
>
> > > > > >
> > > > > > Thanks for the testing!
> > > > > > Any chance you can get a perf measurement on this?
> > > > >
> > > > > I guess you mean perf-report (--stdio) output, right?
> > > > >
> > > >
> > > > Yea,
> > > > As hinted below, I am just trying to figure out if on Alexander's 
> > > > platform the
> > > > cost of syncing, is bigger that free-allocate. I remember one armv7 
> > > > were that
> > > > was the case.
> > > >
> > > > > > Is DMA syncing taking a substantial amount of your cpu usage?
> > > > >
> > > > > (+1 this is an important question)
> >
> > Sure, I'll drop perf tools to my test env and share the results,
> > maybe tomorrow or in a few days.

Oh we-e-e-ell...
Looks like I've been fooled by I-cache misses or smth like that.
That happens sometimes, not only on my machines, and not only on
MIPS if I'm not mistaken.
Sorry for confusing you guys.

I got drastically different numbers after I enabled CONFIG_KALLSYMS +
CONFIG_PERF_EVENTS for perf tools.
The only difference in code is that I rebased onto Mel's
mm-bulk-rebase-v6r4.

(lunar is my WIP NIC driver)

1. 5.12-rc3 baseline:

TCP: 566 Mbps
UDP: 615 Mbps

perf top:
 4.44%  [lunar]  [k] lunar_rx_poll_page_pool
 3.56%  [kernel] [k] r4k_wait_irqoff
 2.89%  [kernel] [k] free_unref_page
 2.57%  [kernel] [k] dma_map_page_attrs
 2.32%  [kernel] [k] get_page_from_freelist
 2.28%  [lunar]  [k] lunar_start_xmit
 1.82%  [kernel] [k] __copy_user
 1.75%  [kernel] [k] dev_gro_receive
 1.52%  [kernel] [k] cpuidle_enter_state_coupled
 1.46%  [kernel] [k] tcp_gro_receive
 1.35%  [kernel] [k] __rmemcpy
 1.33%  [nf_conntrack]   [k] nf_conntrack_tcp_packet
 1.30%  [kernel] [k] __dev_queue_xmit
 1.22%  [kernel] [k] pfifo_fast_dequeue
 1.17%  [kernel] [k] skb_release_data
 1.17%  [kernel] [k] skb_segment

free_unref_page() and get_page_from_freelist() consume a lot.

2. 5.12-rc3 + Page Pool recycling by Matteo:
TCP: 589 Mbps
UDP: 633 Mbps

perf top:
 4.27%  [lunar]  [k] lunar_rx_poll_page_pool
 2.68%  [lunar]  [k] lunar_start_xmit
 2.41%  [kernel] [k] dma_map_page_attrs
 1.92%  [kernel] [k] r4k_wait_irqoff
 1.89%  [kernel] [k] __copy_user
 1.62%  [kernel] [k] dev_gro_receive
 1.51%  [kernel] [k] cpuidle_enter_state_coupled
 1.44%  [kernel] [k] tcp_gro_receive
 1.40%  [kernel] [k] __rmemcpy
 1.38%  [nf_conntrack]   [k] nf_conntrack_tcp_packet
 1.37%  [kernel] [k] free_unref_page
 1.35%  [kernel] [k] __dev_queue_xmit
 1.30%  [kernel] [k] skb_segment
 1.28%  [kernel] [k] get_page_from_freelist
 1.27%  [kernel] [k] r4k_dma_cache_inv

+20 Mbps increase on both TCP and UDP. free_unref_page() and
get_page_from_freelist() dropped down the list significantly.

3. 5.12-rc3 + Page Pool recycling + PP bulk allocator (Mel & Jesper):
TCP: 596
UDP: 641

perf top:
 4.38%  [lunar]  [k] lunar_rx_poll_page_pool
 3.34%  [kernel] [k] r4k_wait_irqoff
 3.14%  [kernel] [k] dma_map_page_attrs
 2.49%  [lunar]  [k] lunar_start_xmit
 1.85%  [kernel] [k] dev_gro_receive
 1.76%  [kernel] [k] free_unref_page
 1.76%  [kernel] [k] __copy_user
 1.65%  [kernel] [k] inet_gro_receive
 1.57%  [kernel] [k] tcp_gro_receive
 1.48%  [kernel] [k] cpuidle_enter_state_coupled
 1.43%  [nf_conntrack]   [k] nf_conntrack_tcp_packet
 1.42%  [kernel] [k] __rmemcpy
 1.25%  [kernel] [k] skb_segment
 1.21%  [kernel] [k] r4k_dma_cache_inv

+10 Mbps on top of recycling.
get_page_from_freelist() is gone.
NAPI polling, CPU idle cycle (r4k_wait_irqoff) and DMA mapping
routine became the top consumers.

4-5. __always_inline for rmqueue_bulk() and __rmqueue_pcplist(),
removing 'noinline' from net/core/page_pool.c etc.

...makes absolutely no sense anymore.
I see Mel took Jesper's patch to make __rmqueue_pcplist() inline into
mm-bulk-rebase-v6r5, not sure if it's really needed now.

So I'm really glad we sorted out the things and I can see the real
performance improvements fr

[PATCH v2 mtd/fixes] mtd: spinand: core: add missing MODULE_DEVICE_TABLE()

2021-03-23 Thread Alexander Lobakin
The module misses MODULE_DEVICE_TABLE() for both SPI and OF ID tables
and thus never autoloads on ID matches.
Add the missing declarations.
Present since day-0 of spinand framework introduction.

Fixes: 7529df465248 ("mtd: nand: Add core infrastructure to support SPI NANDs")
Cc: sta...@vger.kernel.org # 4.19+
Signed-off-by: Alexander Lobakin 
---
 drivers/mtd/nand/spi/core.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/mtd/nand/spi/core.c b/drivers/mtd/nand/spi/core.c
index 61d932c1b718..17f63f95f4a2 100644
--- a/drivers/mtd/nand/spi/core.c
+++ b/drivers/mtd/nand/spi/core.c
@@ -1263,12 +1263,14 @@ static const struct spi_device_id spinand_ids[] = {
{ .name = "spi-nand" },
{ /* sentinel */ },
 };
+MODULE_DEVICE_TABLE(spi, spinand_ids);

 #ifdef CONFIG_OF
 static const struct of_device_id spinand_of_ids[] = {
{ .compatible = "spi-nand" },
{ /* sentinel */ },
 };
+MODULE_DEVICE_TABLE(of, spinand_of_ids);
 #endif

 static struct spi_mem_driver spinand_drv = {
--
2.31.0




[PATCH mtd/fixes] mtd: spinand: core: add missing MODULE_DEVICE_TABLE()

2021-03-23 Thread Alexander Lobakin
The module misses MODULE_DEVICE_TABLE() for both SPI and OF ID tables
and thus never autoloads on ID matches.
Add the missing declarations.
Present since day-0 of spinand framework introduction.

Cc: sta...@vger.kernel.org # 4.19+
Signed-off-by: Alexander Lobakin 
---
 drivers/mtd/nand/spi/core.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/mtd/nand/spi/core.c b/drivers/mtd/nand/spi/core.c
index 61d932c1b718..17f63f95f4a2 100644
--- a/drivers/mtd/nand/spi/core.c
+++ b/drivers/mtd/nand/spi/core.c
@@ -1263,12 +1263,14 @@ static const struct spi_device_id spinand_ids[] = {
{ .name = "spi-nand" },
{ /* sentinel */ },
 };
+MODULE_DEVICE_TABLE(spi, spinand_ids);

 #ifdef CONFIG_OF
 static const struct of_device_id spinand_of_ids[] = {
{ .compatible = "spi-nand" },
{ /* sentinel */ },
 };
+MODULE_DEVICE_TABLE(of, spinand_of_ids);
 #endif

 static struct spi_mem_driver spinand_drv = {
--
2.31.0




Re: [PATCH net-next 0/6] page_pool: recycle buffers

2021-03-23 Thread Alexander Lobakin
From: Matteo Croce 
Date: Tue, 23 Mar 2021 17:28:32 +0100

> On Tue, Mar 23, 2021 at 5:10 PM Ilias Apalodimas
>  wrote:
> >
> > On Tue, Mar 23, 2021 at 05:04:47PM +0100, Jesper Dangaard Brouer wrote:
> > > On Tue, 23 Mar 2021 17:47:46 +0200
> > > Ilias Apalodimas  wrote:
> > >
> > > > On Tue, Mar 23, 2021 at 03:41:23PM +, Alexander Lobakin wrote:
> > > > > From: Matteo Croce 
> > > > > Date: Mon, 22 Mar 2021 18:02:55 +0100
> > > > >
> > > > > > From: Matteo Croce 
> > > > > >
> > > > > > This series enables recycling of the buffers allocated with the 
> > > > > > page_pool API.
> > > > > > The first two patches are just prerequisite to save space in a 
> > > > > > struct and
> > > > > > avoid recycling pages allocated with other API.
> > > > > > Patch 2 was based on a previous idea from Jonathan Lemon.
> > > > > >
> > > > > > The third one is the real recycling, 4 fixes the compilation of 
> > > > > > __skb_frag_unref
> > > > > > users, and 5,6 enable the recycling on two drivers.
> > > > > >
> > > > > > In the last two patches I reported the improvement I have with the 
> > > > > > series.
> > > > > >
> > > > > > The recycling as is can't be used with drivers like mlx5 which do 
> > > > > > page split,
> > > > > > but this is documented in a comment.
> > > > > > In the future, a refcount can be used so to support mlx5 with no 
> > > > > > changes.
> > > > > >
> > > > > > Ilias Apalodimas (2):
> > > > > >   page_pool: DMA handling and allow to recycles frames via SKB
> > > > > >   net: change users of __skb_frag_unref() and add an extra argument
> > > > > >
> > > > > > Jesper Dangaard Brouer (1):
> > > > > >   xdp: reduce size of struct xdp_mem_info
> > > > > >
> > > > > > Matteo Croce (3):
> > > > > >   mm: add a signature in struct page
> > > > > >   mvpp2: recycle buffers
> > > > > >   mvneta: recycle buffers
> > > > > >
> > > > > >  .../chelsio/inline_crypto/ch_ktls/chcr_ktls.c |  2 +-
> > > > > >  drivers/net/ethernet/marvell/mvneta.c |  4 +-
> > > > > >  .../net/ethernet/marvell/mvpp2/mvpp2_main.c   | 17 +++
> > > > > >  drivers/net/ethernet/marvell/sky2.c   |  2 +-
> > > > > >  drivers/net/ethernet/mellanox/mlx4/en_rx.c|  2 +-
> > > > > >  include/linux/mm_types.h  |  1 +
> > > > > >  include/linux/skbuff.h| 33 +++--
> > > > > >  include/net/page_pool.h   | 15 ++
> > > > > >  include/net/xdp.h |  5 +-
> > > > > >  net/core/page_pool.c  | 47 
> > > > > > +++
> > > > > >  net/core/skbuff.c | 20 +++-
> > > > > >  net/core/xdp.c| 14 --
> > > > > >  net/tls/tls_device.c  |  2 +-
> > > > > >  13 files changed, 138 insertions(+), 26 deletions(-)
> > > > >
> > > > > Just for the reference, I've performed some tests on 1G SoC NIC with
> > > > > this patchset on, here's direct link: [0]
> > > > >
> > > >
> > > > Thanks for the testing!
> > > > Any chance you can get a perf measurement on this?
> > >
> > > I guess you mean perf-report (--stdio) output, right?
> > >
> >
> > Yea,
> > As hinted below, I am just trying to figure out if on Alexander's platform 
> > the
> > cost of syncing, is bigger that free-allocate. I remember one armv7 were 
> > that
> > was the case.
> >
> > > > Is DMA syncing taking a substantial amount of your cpu usage?
> > >
> > > (+1 this is an important question)

Sure, I'll drop perf tools to my test env and share the results,
maybe tomorrow or in a few days.
>From what I know for sure about MIPS and my platform,
post-Rx synching (dma_sync_single_for_cpu()) is a no-op, and
pre-Rx (dma_sync_single_for_device() etc.) is a bit expensive.
I always have sane page_pool->pp.max_len value (smth about 1668
for MTU of 1500) to minimize the overhead.

By the word, IIRC, all machines shipped with mvpp2 have hardware
cache coherency units and don't suffer from sync routines at all.
That may be the reason why mvpp2 wins the most from this series.

> > > > >
> > > > > [0] 
> > > > > https://lore.kernel.org/netdev/20210323153550.130385-1-aloba...@pm.me
> > > > >
> > >
>
> That would be the same as for mvneta:
>
> Overhead  Shared Object Symbol
>   24.10%  [kernel]  [k] __pi___inval_dcache_area
>   23.02%  [mvneta]  [k] mvneta_rx_swbm
>7.19%  [kernel]  [k] kmem_cache_alloc
>
> Anyway, I tried to use the recycling *and* napi_build_skb on mvpp2,
> and I get lower packet rate than recycling alone.
> I don't know why, we should investigate it.

mvpp2 driver doesn't use napi_consume_skb() on its Tx completion path.
As a result, NAPI percpu caches get refilled only through
kmem_cache_alloc_bulk(), and most of skbuff_head recycling
doesn't work.

> Regards,
> --
> per aspera ad upstream

Oh, I love that one!

Al



Re: [PATCH net-next] page_pool: let the compiler optimize and inline core functions

2021-03-23 Thread Alexander Lobakin
From: Jesper Dangaard Brouer 
Date: Tue, 23 Mar 2021 10:01:38 +0100

> On Mon, 22 Mar 2021 18:30:55 +
> Alexander Lobakin  wrote:
>
> > As per disscussion in Page Pool bulk allocator thread [0],
> > there are two functions in Page Pool core code that are marked as
> > 'noinline'. The reason for this is not so clear, and even if it
> > was made to reduce hotpath overhead, in fact it only makes things
> > worse.
> > As both of these functions as being called only once through the
> > code, they could be inlined/folded into the non-static entry point.
> > However, 'noinline' marks effectively prevent from doing that and
> > induce totally unneeded fragmentation (baseline -> after removal):
> >
> > add/remove: 0/3 grow/shrink: 1/0 up/down: 1024/-1096 (-72)
> > Function old new   delta
> > page_pool_alloc_pages1001124   +1024
> > page_pool_dma_map164   --164
> > page_pool_refill_alloc_cache 332   --332
> > __page_pool_alloc_pages_slow 600   --600
> >
> > (taken from Mel's branch, hence factored-out page_pool_dma_map())
>
> I see that the refactor of page_pool_dma_map() caused it to be
> uninlined, that were a mistake.  Thanks for high-lighting that again
> as I forgot about this (even-though I think Alex Duyck did point this
> out earlier).
>
> I am considering if we should allow compiler to inline
> page_pool_refill_alloc_cache + __page_pool_alloc_pages_slow, for the
> sake of performance and I loose the ability to diagnose the behavior
> from perf-report.  Mind that page_pool avoids stat for the sake of
> performance, but these noinline makes it possible to diagnose the
> behavior anyway.
>
> >
> > 1124 is a normal hotpath frame size, but these jumps between tiny
> > page_pool_alloc_pages(), page_pool_refill_alloc_cache() and
> > __page_pool_alloc_pages_slow() are really redundant and harmful
> > for performance.
>
> Well, I disagree. (this is a NACK)
>
> If pages were recycled then the code never had to visit
> __page_pool_alloc_pages_slow().  And today without the bulk page-alloc
> (that we are working on adding together with Mel) we have to visit
> __page_pool_alloc_pages_slow() every time, which is a bad design, but
> I'm trying to fix that.
>
> Matteo is working on recycling here[1]:
>  [1] 
> https://lore.kernel.org/netdev/20210322170301.26017-1-mcr...@linux.microsoft.com/
>
> It would be really great if you could try out his patchset, as it will
> help your driver avoid the slow path of the page_pool.  Given you are
> very detailed oriented, I do want to point out that Matteo's patchset
> is only the first step, as to really improve performance for page_pool,
> we need to bulk return these page_pool pages (it is requires some
> restructure of the core code, that will be confusing at this point).

I tested it out when I saw the first RFC. Its code seemed fine to me
and I was wondering what could it bring to my workloads.
The reason why I didn't post the results is because they're actually
poor on my system.

I retested it again, this time v1 instead of RFC and also tried
the combined with bulk allocation variant.

VLAN NAT, GRO + TSO/USO, Page size 16 Kb.
XDP_PASS -> napi_build_skb() -> napi_gro_receive().

I disable fraglist offload and nftables Flow offload to drop
the performance below link speed.

1.
 - 5.12-rc3:

TCP 572 Mbps
UDP 616 Mbps

2.
 - 5.12-rc3;
 - Page Pool recycling by Matteo (with replacing
   page_pool_release_page() with skb_mark_for_recycle()
   in my driver):

TCP 540 Mbps
UDP 572 Mbps

First time when I saw the results, I didn't believe everything works
as expected from the code I saw, and pages are actually being recycled.
But then I traced skb and pages' paths and made sure that recycling
actually happens (on every frame).
The reason for such a heavy drop, at least that I can guess, is that
page_frag_free() that's being called on skb->head and its frags is
very lightweight and straightforward. When recycling is on, the
following chain is being called for skb head and every frag:

page_pool_return_skb_page()
 xdp_return_skb_frame()
  __xdp_return()
   page_pool_put_full_page()

Also, as allow_direct is false (which is fine -- for context safety
reasons), recycled pages are being returned into the ptr_ring (with
taking and freeing the spinlock) instead of the direct cache. So next
Page Pool allocations will inavoidably hit (noinline)
page_pool_refill_alloc_cache(), take the spinlock again and so on.

3.
 - 5.12-rc3;
 - Page Pool recycling;
 - bulk allocations:

TCP 545 Mbps
UDP 610 Mbps

As I wrote earlier, bulk allocator suffers from compi

Re: [PATCH net-next 0/6] page_pool: recycle buffers

2021-03-23 Thread Alexander Lobakin
From: Matteo Croce 
Date: Mon, 22 Mar 2021 18:02:55 +0100

> From: Matteo Croce 
>
> This series enables recycling of the buffers allocated with the page_pool API.
> The first two patches are just prerequisite to save space in a struct and
> avoid recycling pages allocated with other API.
> Patch 2 was based on a previous idea from Jonathan Lemon.
>
> The third one is the real recycling, 4 fixes the compilation of 
> __skb_frag_unref
> users, and 5,6 enable the recycling on two drivers.
>
> In the last two patches I reported the improvement I have with the series.
>
> The recycling as is can't be used with drivers like mlx5 which do page split,
> but this is documented in a comment.
> In the future, a refcount can be used so to support mlx5 with no changes.
>
> Ilias Apalodimas (2):
>   page_pool: DMA handling and allow to recycles frames via SKB
>   net: change users of __skb_frag_unref() and add an extra argument
>
> Jesper Dangaard Brouer (1):
>   xdp: reduce size of struct xdp_mem_info
>
> Matteo Croce (3):
>   mm: add a signature in struct page
>   mvpp2: recycle buffers
>   mvneta: recycle buffers
>
>  .../chelsio/inline_crypto/ch_ktls/chcr_ktls.c |  2 +-
>  drivers/net/ethernet/marvell/mvneta.c |  4 +-
>  .../net/ethernet/marvell/mvpp2/mvpp2_main.c   | 17 +++
>  drivers/net/ethernet/marvell/sky2.c   |  2 +-
>  drivers/net/ethernet/mellanox/mlx4/en_rx.c|  2 +-
>  include/linux/mm_types.h  |  1 +
>  include/linux/skbuff.h| 33 +++--
>  include/net/page_pool.h   | 15 ++
>  include/net/xdp.h |  5 +-
>  net/core/page_pool.c  | 47 +++
>  net/core/skbuff.c | 20 +++-
>  net/core/xdp.c| 14 --
>  net/tls/tls_device.c  |  2 +-
>  13 files changed, 138 insertions(+), 26 deletions(-)

Just for the reference, I've performed some tests on 1G SoC NIC with
this patchset on, here's direct link: [0]

> --
> 2.30.2

[0] https://lore.kernel.org/netdev/20210323153550.130385-1-aloba...@pm.me

Thanks,
Al



[PATCH net-next] page_pool: let the compiler optimize and inline core functions

2021-03-22 Thread Alexander Lobakin
As per disscussion in Page Pool bulk allocator thread [0],
there are two functions in Page Pool core code that are marked as
'noinline'. The reason for this is not so clear, and even if it
was made to reduce hotpath overhead, in fact it only makes things
worse.
As both of these functions as being called only once through the
code, they could be inlined/folded into the non-static entry point.
However, 'noinline' marks effectively prevent from doing that and
induce totally unneeded fragmentation (baseline -> after removal):

add/remove: 0/3 grow/shrink: 1/0 up/down: 1024/-1096 (-72)
Function old new   delta
page_pool_alloc_pages1001124   +1024
page_pool_dma_map164   --164
page_pool_refill_alloc_cache 332   --332
__page_pool_alloc_pages_slow 600   --600

(taken from Mel's branch, hence factored-out page_pool_dma_map())

1124 is a normal hotpath frame size, but these jumps between tiny
page_pool_alloc_pages(), page_pool_refill_alloc_cache() and
__page_pool_alloc_pages_slow() are really redundant and harmful
for performance.

This simple removal of 'noinline' keywords bumps the throughput
on XDP_PASS + napi_build_skb() + napi_gro_receive() on 25+ Mbps
for 1G embedded NIC.

[0] https://lore.kernel.org/netdev/20210317222506.1266004-1-aloba...@pm.me

Signed-off-by: Alexander Lobakin 
---
 net/core/page_pool.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index ad8b0707af04..589e4df6ef2b 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -102,7 +102,6 @@ EXPORT_SYMBOL(page_pool_create);

 static void page_pool_return_page(struct page_pool *pool, struct page *page);

-noinline
 static struct page *page_pool_refill_alloc_cache(struct page_pool *pool)
 {
struct ptr_ring *r = >ring;
@@ -181,7 +180,6 @@ static void page_pool_dma_sync_for_device(struct page_pool 
*pool,
 }

 /* slow path */
-noinline
 static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool,
 gfp_t _gfp)
 {
--
2.31.0




[PATCH] MIPS: generic: Support linking with LLVM ld.lld

2021-03-22 Thread Alexander Lobakin
From: Paul Cercueil 
Date: Sun, 21 Mar 2021 13:18:05 +

> LLVM's ld.lld chokes on the 64-bit sign-extended load addresses. Use
> 32-bit addresses if the linker is LLVM's ld.lld.
>
> Signed-off-by: Paul Cercueil 
> ---
>  arch/mips/generic/Platform | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/arch/mips/generic/Platform b/arch/mips/generic/Platform
> index b871af16b5b6..19b7d92a4ca7 100644
> --- a/arch/mips/generic/Platform
> +++ b/arch/mips/generic/Platform
> @@ -12,8 +12,8 @@
>  cflags-$(CONFIG_MACH_INGENIC_SOC)+= 
> -I$(srctree)/arch/mips/include/asm/mach-ingenic
>  cflags-$(CONFIG_MIPS_GENERIC)+= 
> -I$(srctree)/arch/mips/include/asm/mach-generic
>
> -load-$(CONFIG_MIPS_GENERIC)  += 0x8010
> -zload-$(CONFIG_MIPS_GENERIC) += 0x8100
> +load-$(CONFIG_MIPS_GENERIC)  += $(if 
> $(CONFIG_LD_IS_LLD),0x8010,0x8010)
> +zload-$(CONFIG_MIPS_GENERIC) += $(if 
> $(CONFIG_LD_IS_LLD),0x8100,0x8100)
>  all-$(CONFIG_MIPS_GENERIC)   := vmlinux.gz.itb

For load-y, it's handled in arch/mips/Makefile:289 arch-wide.
For zload-y, it's not handled at all, but the proper way to do this
is to add a similar to load-ld logics in
arch/mips/boot/compressed/Makefile.

>  its-y:= vmlinux.its.S
> --
> 2.30.2

Thanks,
Al



[PATCH RESEND 0/2] kconfig: fence choices and menuconfigs with comments in .config too

2021-03-19 Thread Alexander Lobakin
Comment blocks for menus are great for dotconfig readability, but
currently it's the only type (beside plain comments) of submenus
for which these block are generated.
This series expands the code to also generate such blocks for choices
and menuconfigs and additionally expands the comments themselves a
bit to let know if they belong to choice or menu{,config} block.

Alexander Lobakin (2):
  kconfig: fence choices and menuconfigs with comments in .config too
  kconfig: mention submenu type in comment blocks in .config

 scripts/kconfig/confdata.c | 43 +-
 1 file changed, 29 insertions(+), 14 deletions(-)

--
2.31.0




[PATCH RESEND 1/2] kconfig: fence choices and menuconfigs with comments in .config too

2021-03-19 Thread Alexander Lobakin
Comment blocks are now generated in .config only for menus. Provide
them for choices and menuconfigs too to greatly improve dotconfig
readability.

Choices before:

x CONFIG_BOOT_CONFIG is not set
CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE=y
x CONFIG_CC_OPTIMIZE_FOR_SIZE is not set
CONFIG_HAVE_LD_DEAD_CODE_DATA_ELIMINATION=y

Choices after:

x CONFIG_BOOT_CONFIG is not set

x
x Compiler optimization level
x
CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE=y
x CONFIG_CC_OPTIMIZE_FOR_SIZE is not set
x end of Compiler optimization level

CONFIG_HAVE_LD_DEAD_CODE_DATA_ELIMINATION=y

Menuconfigs before:

x CONFIG_IPACK_BUS is not set
CONFIG_RESET_CONTROLLER=y
x CONFIG_RESET_BRCMSTB_RESCAL is not set
x CONFIG_RESET_INTEL_GW is not set
x CONFIG_RESET_TI_SYSCON is not set

Menuconfigs after:

x
x IndustryPack bus support
x
x CONFIG_IPACK_BUS is not set
x end of IndustryPack bus support

x
x Reset Controller Support
x
CONFIG_RESET_CONTROLLER=y
x CONFIG_RESET_BRCMSTB_RESCAL is not set
x CONFIG_RESET_INTEL_GW is not set
x CONFIG_RESET_TI_SYSCON is not set
x end of Reset Controller Support

Signed-off-by: Alexander Lobakin 
---
 scripts/kconfig/confdata.c | 25 ++---
 1 file changed, 14 insertions(+), 11 deletions(-)

diff --git a/scripts/kconfig/confdata.c b/scripts/kconfig/confdata.c
index 2568dbe16ed6..e4f0a21fd469 100644
--- a/scripts/kconfig/confdata.c
+++ b/scripts/kconfig/confdata.c
@@ -869,17 +869,20 @@ int conf_write(const char *name)
menu = rootmenu.list;
while (menu) {
sym = menu->sym;
-   if (!sym) {
-   if (!menu_is_visible(menu))
-   goto next;
+
+   if ((!sym || (sym->flags & SYMBOL_CHOICE) ||
+(menu->prompt && menu->prompt->type == P_MENU)) &&
+   menu_is_visible(menu)) {
str = menu_get_prompt(menu);
fprintf(out, "\n"
 "#\n"
 "# %s\n"
 "#\n", str);
need_newline = false;
-   } else if (!(sym->flags & SYMBOL_CHOICE) &&
-  !(sym->flags & SYMBOL_WRITTEN)) {
+   }
+
+   if (sym && !(sym->flags & SYMBOL_CHOICE) &&
+   !(sym->flags & SYMBOL_WRITTEN)) {
sym_calc_value(sym);
if (!(sym->flags & SYMBOL_WRITE))
goto next;
@@ -896,11 +899,11 @@ int conf_write(const char *name)
menu = menu->list;
continue;
}
-   if (menu->next)
-   menu = menu->next;
-   else while ((menu = menu->parent)) {
-   if (!menu->sym && menu_is_visible(menu) &&
-   menu != ) {
+
+   do {
+   if (((menu->sym && menu->sym->flags & SYMBOL_CHOICE) ||
+(menu->prompt && menu->prompt->type == P_MENU)) &&
+   menu_is_visible(menu) && menu != ) {
str = menu_get_prompt(menu);
fprintf(out, "# end of %s\n", str);
need_newline = true;
@@ -909,7 +912,7 @@ int conf_write(const char *name)
menu = menu->next;
break;
}
-   }
+   } while ((menu = menu->parent));
}
fclose(out);

--
2.31.0




[PATCH RESEND 2/2] kconfig: mention submenu type in comment blocks in .config

2021-03-19 Thread Alexander Lobakin
To have a better understanding of the dotconfig blocks, mention if
a particular block-commented section is a choice or a menu{,config}.

Before:

x
x Timers subsystem
x
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ_COMMON=y

x
x Timer tick handling
x
x CONFIG_HZ_PERIODIC is not set
CONFIG_NO_HZ_IDLE=y
x end of Timer tick handling

x CONFIG_NO_HZ is not set
CONFIG_HIGH_RES_TIMERS=y
x end of Timers subsystem

After:

x
x Timers subsystem menu
x
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ_COMMON=y

x
x Timer tick handling choice
x
x CONFIG_HZ_PERIODIC is not set
CONFIG_NO_HZ_IDLE=y
x end of Timer tick handling choice

x CONFIG_NO_HZ is not set
CONFIG_HIGH_RES_TIMERS=y
x end of Timers subsystem menu

Signed-off-by: Alexander Lobakin 
---
 scripts/kconfig/confdata.c | 18 +++---
 1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/scripts/kconfig/confdata.c b/scripts/kconfig/confdata.c
index e4f0a21fd469..3f50d8b82a54 100644
--- a/scripts/kconfig/confdata.c
+++ b/scripts/kconfig/confdata.c
@@ -822,6 +822,17 @@ int conf_write_defconfig(const char *filename)
return 0;
 }

+static const char *menu_type_string(const struct menu *menu)
+{
+   if (menu->sym && (menu->sym->flags & SYMBOL_CHOICE))
+   return " choice";
+
+   if (menu->prompt && menu->prompt->type == P_MENU)
+   return " menu";
+
+   return "";
+}
+
 int conf_write(const char *name)
 {
FILE *out;
@@ -876,8 +887,8 @@ int conf_write(const char *name)
str = menu_get_prompt(menu);
fprintf(out, "\n"
 "#\n"
-"# %s\n"
-"#\n", str);
+"# %s%s\n"
+"#\n", str, menu_type_string(menu));
need_newline = false;
}

@@ -905,7 +916,8 @@ int conf_write(const char *name)
 (menu->prompt && menu->prompt->type == P_MENU)) &&
menu_is_visible(menu) && menu != ) {
str = menu_get_prompt(menu);
-   fprintf(out, "# end of %s\n", str);
+   fprintf(out, "# end of %s%s\n", str,
+   menu_type_string(menu));
need_newline = true;
}
if (menu->next) {
--
2.31.0




[PATCH 0/2] kconfig: fence choices and menuconfigs with comments in .config too

2021-03-19 Thread Alexander Lobakin
Comment blocks for menus are great for dotconfig readability, but
currently it's the only type (beside plain comments) of submenus
for which these block are generated.
This series expands the code to also generate such blocks for choices
and menuconfigs and additionally expands the comments themselves a
bit to let know if they belong to choice or menu{,config} block.

Alexander Lobakin (2):
  kconfig: fence choices and menuconfigs with comments in .config too
  kconfig: mention submenu type in comment blocks in .config

 scripts/kconfig/confdata.c | 43 +-
 1 file changed, 29 insertions(+), 14 deletions(-)

--
2.31.0




[PATCH 1/2] kconfig: fence choices and menuconfigs with comments in .config too

2021-03-19 Thread Alexander Lobakin
Comment blocks are now generated in .config only for menus. Provide
them for choices and menuconfigs too to greatly improve dotconfig
readability.

Choices before:

CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE=y
CONFIG_HAVE_LD_DEAD_CODE_DATA_ELIMINATION=y

Choices after:

CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE=y

CONFIG_HAVE_LD_DEAD_CODE_DATA_ELIMINATION=y

Menuconfigs before:

CONFIG_RESET_CONTROLLER=y

Menuconfigs after:

CONFIG_RESET_CONTROLLER=y

Signed-off-by: Alexander Lobakin 
---
 scripts/kconfig/confdata.c | 25 ++---
 1 file changed, 14 insertions(+), 11 deletions(-)

diff --git a/scripts/kconfig/confdata.c b/scripts/kconfig/confdata.c
index 2568dbe16ed6..e4f0a21fd469 100644
--- a/scripts/kconfig/confdata.c
+++ b/scripts/kconfig/confdata.c
@@ -869,17 +869,20 @@ int conf_write(const char *name)
menu = rootmenu.list;
while (menu) {
sym = menu->sym;
-   if (!sym) {
-   if (!menu_is_visible(menu))
-   goto next;
+
+   if ((!sym || (sym->flags & SYMBOL_CHOICE) ||
+(menu->prompt && menu->prompt->type == P_MENU)) &&
+   menu_is_visible(menu)) {
str = menu_get_prompt(menu);
fprintf(out, "\n"
 "#\n"
 "# %s\n"
 "#\n", str);
need_newline = false;
-   } else if (!(sym->flags & SYMBOL_CHOICE) &&
-  !(sym->flags & SYMBOL_WRITTEN)) {
+   }
+
+   if (sym && !(sym->flags & SYMBOL_CHOICE) &&
+   !(sym->flags & SYMBOL_WRITTEN)) {
sym_calc_value(sym);
if (!(sym->flags & SYMBOL_WRITE))
goto next;
@@ -896,11 +899,11 @@ int conf_write(const char *name)
menu = menu->list;
continue;
}
-   if (menu->next)
-   menu = menu->next;
-   else while ((menu = menu->parent)) {
-   if (!menu->sym && menu_is_visible(menu) &&
-   menu != ) {
+
+   do {
+   if (((menu->sym && menu->sym->flags & SYMBOL_CHOICE) ||
+(menu->prompt && menu->prompt->type == P_MENU)) &&
+   menu_is_visible(menu) && menu != ) {
str = menu_get_prompt(menu);
fprintf(out, "# end of %s\n", str);
need_newline = true;
@@ -909,7 +912,7 @@ int conf_write(const char *name)
menu = menu->next;
break;
}
-   }
+   } while ((menu = menu->parent));
}
fclose(out);

--
2.31.0




[PATCH 2/2] kconfig: mention submenu type in comment blocks in .config

2021-03-19 Thread Alexander Lobakin
To have a better understanding of the dotconfig blocks, mention if
a particular block-commented section is a choice or a menu{,config}.

Before:

CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ_COMMON=y

CONFIG_NO_HZ_IDLE=y

CONFIG_HIGH_RES_TIMERS=y

After:

CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ_COMMON=y

CONFIG_NO_HZ_IDLE=y

CONFIG_HIGH_RES_TIMERS=y

Signed-off-by: Alexander Lobakin 
---
 scripts/kconfig/confdata.c | 18 +++---
 1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/scripts/kconfig/confdata.c b/scripts/kconfig/confdata.c
index e4f0a21fd469..3f50d8b82a54 100644
--- a/scripts/kconfig/confdata.c
+++ b/scripts/kconfig/confdata.c
@@ -822,6 +822,17 @@ int conf_write_defconfig(const char *filename)
return 0;
 }

+static const char *menu_type_string(const struct menu *menu)
+{
+   if (menu->sym && (menu->sym->flags & SYMBOL_CHOICE))
+   return " choice";
+
+   if (menu->prompt && menu->prompt->type == P_MENU)
+   return " menu";
+
+   return "";
+}
+
 int conf_write(const char *name)
 {
FILE *out;
@@ -876,8 +887,8 @@ int conf_write(const char *name)
str = menu_get_prompt(menu);
fprintf(out, "\n"
 "#\n"
-"# %s\n"
-"#\n", str);
+"# %s%s\n"
+"#\n", str, menu_type_string(menu));
need_newline = false;
}

@@ -905,7 +916,8 @@ int conf_write(const char *name)
 (menu->prompt && menu->prompt->type == P_MENU)) &&
menu_is_visible(menu) && menu != ) {
str = menu_get_prompt(menu);
-   fprintf(out, "# end of %s\n", str);
+   fprintf(out, "# end of %s%s\n", str,
+   menu_type_string(menu));
need_newline = true;
}
if (menu->next) {
--
2.31.0




[PATCH net-next] dsa: simplify Kconfig symbols and dependencies

2021-03-19 Thread Alexander Lobakin
1. Remove CONFIG_HAVE_NET_DSA.

CONFIG_HAVE_NET_DSA is a legacy leftover from the times when drivers
should have selected CONFIG_NET_DSA manually.
Currently, all drivers has explicit 'depends on NET_DSA', so this is
no more needed.

2. CONFIG_HAVE_NET_DSA dependencies became CONFIG_NET_DSA's ones.

 - dropped !S390 dependency which was introduced to be sure NET_DSA
   can select CONFIG_PHYLIB. DSA migrated to Phylink almost 3 years
   ago and the PHY library itself doesn't depend on !S390 since
   commit 870a2b5e4fcd ("phylib: remove !S390 dependeny from Kconfig");
 - INET dependency is kept to be sure we can select NET_SWITCHDEV;
 - NETDEVICES dependency is kept to be sure we can select PHYLINK.

3. DSA drivers menu now depends on NET_DSA.

Instead on 'depends on NET_DSA' on every single driver, the entire
menu now depends on it. This eliminates a lot of duplicated lines
from Kconfig with no loss (when CONFIG_NET_DSA=m, drivers also can
be only m or n).
This also has a nice side effect that there's no more empty menu on
configurations without DSA.

4. Kbuild will now descend into 'drivers/net/dsa' only when
   CONFIG_NET_DSA is y or m.

This is safe since no objects inside this folder can be built without
DSA core, as well as when CONFIG_NET_DSA=m, no objects can be
built-in.

Signed-off-by: Alexander Lobakin 
---
 drivers/net/Makefile|  2 +-
 drivers/net/dsa/Kconfig | 17 -
 net/dsa/Kconfig | 10 +++---
 3 files changed, 8 insertions(+), 21 deletions(-)

diff --git a/drivers/net/Makefile b/drivers/net/Makefile
index f4990ff32fa4..040e20b81317 100644
--- a/drivers/net/Makefile
+++ b/drivers/net/Makefile
@@ -45,7 +45,7 @@ obj-$(CONFIG_ARCNET) += arcnet/
 obj-$(CONFIG_DEV_APPLETALK) += appletalk/
 obj-$(CONFIG_CAIF) += caif/
 obj-$(CONFIG_CAN) += can/
-obj-y += dsa/
+obj-$(CONFIG_NET_DSA) += dsa/
 obj-$(CONFIG_ETHERNET) += ethernet/
 obj-$(CONFIG_FDDI) += fddi/
 obj-$(CONFIG_HIPPI) += hippi/
diff --git a/drivers/net/dsa/Kconfig b/drivers/net/dsa/Kconfig
index 3af373e90806..a5f1aa911fe2 100644
--- a/drivers/net/dsa/Kconfig
+++ b/drivers/net/dsa/Kconfig
@@ -1,12 +1,12 @@
 # SPDX-License-Identifier: GPL-2.0-only
 menu "Distributed Switch Architecture drivers"
-   depends on HAVE_NET_DSA
+   depends on NET_DSA

 source "drivers/net/dsa/b53/Kconfig"

 config NET_DSA_BCM_SF2
tristate "Broadcom Starfighter 2 Ethernet switch support"
-   depends on HAS_IOMEM && NET_DSA
+   depends on HAS_IOMEM
select NET_DSA_TAG_BRCM
select FIXED_PHY
select BCM7XXX_PHY
@@ -18,7 +18,6 @@ config NET_DSA_BCM_SF2

 config NET_DSA_LOOP
tristate "DSA mock-up Ethernet switch chip support"
-   depends on NET_DSA
select FIXED_PHY
help
  This enables support for a fake mock-up switch chip which
@@ -28,7 +27,7 @@ source "drivers/net/dsa/hirschmann/Kconfig"

 config NET_DSA_LANTIQ_GSWIP
tristate "Lantiq / Intel GSWIP"
-   depends on HAS_IOMEM && NET_DSA
+   depends on HAS_IOMEM
select NET_DSA_TAG_GSWIP
help
  This enables support for the Lantiq / Intel GSWIP 2.1 found in
@@ -36,7 +35,6 @@ config NET_DSA_LANTIQ_GSWIP

 config NET_DSA_MT7530
tristate "MediaTek MT753x and MT7621 Ethernet switch support"
-   depends on NET_DSA
select NET_DSA_TAG_MTK
help
  This enables support for the MediaTek MT7530, MT7531, and MT7621
@@ -44,7 +42,6 @@ config NET_DSA_MT7530

 config NET_DSA_MV88E6060
tristate "Marvell 88E6060 ethernet switch chip support"
-   depends on NET_DSA
select NET_DSA_TAG_TRAILER
help
  This enables support for the Marvell 88E6060 ethernet switch
@@ -64,7 +61,6 @@ source "drivers/net/dsa/xrs700x/Kconfig"

 config NET_DSA_QCA8K
tristate "Qualcomm Atheros QCA8K Ethernet switch family support"
-   depends on NET_DSA
select NET_DSA_TAG_QCA
select REGMAP
help
@@ -73,7 +69,6 @@ config NET_DSA_QCA8K

 config NET_DSA_REALTEK_SMI
tristate "Realtek SMI Ethernet switch family support"
-   depends on NET_DSA
select NET_DSA_TAG_RTL4_A
select FIXED_PHY
select IRQ_DOMAIN
@@ -93,7 +88,7 @@ config NET_DSA_SMSC_LAN9303

 config NET_DSA_SMSC_LAN9303_I2C
tristate "SMSC/Microchip LAN9303 3-ports 10/100 ethernet switch in I2C 
managed mode"
-   depends on NET_DSA && I2C
+   depends on I2C
select NET_DSA_SMSC_LAN9303
select REGMAP_I2C
help
@@ -102,7 +97,6 @@ config NET_DSA_SMSC_LAN9303_I2C

 config NET_DSA_SMSC_LAN9303_MDIO
tristate "SMSC/Microchip LAN9303 3-ports 10/100 ethernet switch in MDIO 
managed mode"
-   depends on NET_DSA
select NET_DSA_SMSC_LAN9303
help
  Enable access functions if the SMSC/Microchip LAN9303 is c

Re: [PATCH net-next 2/4] gro: add combined call_gro_receive() + INDIRECT_CALL_INET() helper

2021-03-19 Thread Alexander Lobakin
From: Paolo Abeni 
Date: Fri, 19 Mar 2021 13:35:41 +0100

> On Fri, 2021-03-19 at 11:43 +0000, Alexander Lobakin wrote:
> > I'm not sure if you did it on purpose in commit aaa5d90b395a7
> > ("net: use indirect call wrappers at GRO network layer").
> > Was that intentional
>
> I must admit that 2y+ later my own intentions are not so clear to me
> too;)

Heh, know that feel (=

> > for the sake of more optimized path for the
> > kernels with moduled IPv6,
>
> Uhm... no I guess that was more an underlook on my side.
>
> > or I can replace INDIRECT_CALL_INET()
> > with INDIRECT_CALL_2() here too?
>
> If that build with IPV6=nmy, I would say yes.

I think you used INDIRECT_CALL_INET() to protect from CONFIG_INET=n.
But this also hurts with retpoline when CONFIG_IPV6=m. Not so common
case, but still.

Plain INDIRECT_CALL_2() won't build without CONFIG_INET, so we either
introduce a new one (e.g. _INET_2() similarly to _INET_1()), or leave
it as it is for now (Dave's already picked this series to net-next).

> > I want to keep GRO callbacks that
> > make use of indirect call wrappers unified.
>
> L4 will still need some special handling as ipv6 udp gro callbacks are
> not builtin with CONFIG_IPV6=m :(

Yep, I remember. I meant {inet,ipv6}_gro_{complete,receive}()
callers, but didn't mention that for some reason.

> Cheers,
>
> Paolo

Thanks,
Al



Re: [PATCH net-next 2/4] gro: add combined call_gro_receive() + INDIRECT_CALL_INET() helper

2021-03-19 Thread Alexander Lobakin
From: Alexander Lobakin 
Date: Fri, 19 Mar 2021 11:13:25 +

> From: Paolo Abeni 
> Date: Fri, 19 Mar 2021 11:53:42 +0100
>
> > Hello,
>
> Hi!
>
> > On Thu, 2021-03-18 at 18:42 +, Alexander Lobakin wrote:
> > > call_gro_receive() is used to limit GRO recursion, but it works only
> > > with callback pointers.
> > > There's a combined version of call_gro_receive() + INDIRECT_CALL_2()
> > > in , but it doesn't check for IPv6 modularity.
> >
> > AFAICS, ip6_offload is builtin even when IPv6 is a module, so the above
> > should not be needed.
>
> Aww, you are right. I overlooked that since dev_gro_receive() still
> use INDIRECT_CALL_INET(), though all GRO callbacks were made
> built-in.

I'm not sure if you did it on purpose in commit aaa5d90b395a7
("net: use indirect call wrappers at GRO network layer").
Was that intentional for the sake of more optimized path for the
kernels with moduled IPv6, or I can replace INDIRECT_CALL_INET()
with INDIRECT_CALL_2() here too? I want to keep GRO callbacks that
make use of indirect call wrappers unified.

> Seems like more code can be optimized, thanks!
>
> > Cheers,
> >
> > Paolo
>
> Al

Thanks,
Al



Re: [PATCH net-next 2/4] gro: add combined call_gro_receive() + INDIRECT_CALL_INET() helper

2021-03-19 Thread Alexander Lobakin
From: Paolo Abeni 
Date: Fri, 19 Mar 2021 11:53:42 +0100

> Hello,

Hi!

> On Thu, 2021-03-18 at 18:42 +, Alexander Lobakin wrote:
> > call_gro_receive() is used to limit GRO recursion, but it works only
> > with callback pointers.
> > There's a combined version of call_gro_receive() + INDIRECT_CALL_2()
> > in , but it doesn't check for IPv6 modularity.
>
> AFAICS, ip6_offload is builtin even when IPv6 is a module, so the above
> should not be needed.

Aww, you are right. I overlooked that since dev_gro_receive() still
use INDIRECT_CALL_INET(), though all GRO callbacks were made
built-in.

Seems like more code can be optimized, thanks!

> Cheers,
>
> Paolo

Al



[PATCH net-next 3/4] vlan/8021q: avoid retpoline overhead on GRO

2021-03-18 Thread Alexander Lobakin
The two most popular headers going after VLAN are IPv4 and IPv6.
Retpoline overhead for them is addressed only in dev_gro_receive(),
when they lie right after the outermost Ethernet header.
Use the indirect call wrappers in VLAN GRO receive code to reduce
the penalty on receiving tagged frames (when hardware stripping is
off or not available).

Signed-off-by: Alexander Lobakin 
---
 net/8021q/vlan_core.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/net/8021q/vlan_core.c b/net/8021q/vlan_core.c
index 78ec2e1b14d1..59bc13b5f14f 100644
--- a/net/8021q/vlan_core.c
+++ b/net/8021q/vlan_core.c
@@ -4,6 +4,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "vlan.h"

 bool vlan_do_receive(struct sk_buff **skbp)
@@ -495,7 +496,10 @@ static struct sk_buff *vlan_gro_receive(struct list_head 
*head,

skb_gro_pull(skb, sizeof(*vhdr));
skb_gro_postpull_rcsum(skb, vhdr, sizeof(*vhdr));
-   pp = call_gro_receive(ptype->callbacks.gro_receive, head, skb);
+
+   pp = indirect_call_gro_receive_inet(ptype->callbacks.gro_receive,
+   ipv6_gro_receive, inet_gro_receive,
+   head, skb);

 out_unlock:
rcu_read_unlock();
@@ -515,7 +519,9 @@ static int vlan_gro_complete(struct sk_buff *skb, int nhoff)
rcu_read_lock();
ptype = gro_find_complete_by_type(type);
if (ptype)
-   err = ptype->callbacks.gro_complete(skb, nhoff + sizeof(*vhdr));
+   err = INDIRECT_CALL_INET(ptype->callbacks.gro_complete,
+ipv6_gro_complete, inet_gro_complete,
+skb, nhoff + sizeof(*vhdr));

rcu_read_unlock();
return err;
--
2.31.0




[PATCH net-next 4/4] ethernet: avoid retpoline overhead on TEB (GENEVE, NvGRE, VxLAN) GRO

2021-03-18 Thread Alexander Lobakin
The two most popular headers going after Ethernet are IPv4 and IPv6.
Retpoline overhead for them is addressed only in dev_gro_receive(),
when they lie right after the outermost Ethernet header.
Use the indirect call wrappers in TEB (Transparent Ethernet Bridging,
such as GENEVE, NvGRE, VxLAN etc.) GRO receive code to reduce the
penalty when processing the inner headers.

Signed-off-by: Alexander Lobakin 
---
 net/ethernet/eth.c | 11 ---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/net/ethernet/eth.c b/net/ethernet/eth.c
index e01cf766d2c5..933b427122be 100644
--- a/net/ethernet/eth.c
+++ b/net/ethernet/eth.c
@@ -58,6 +58,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 

@@ -449,7 +450,10 @@ struct sk_buff *eth_gro_receive(struct list_head *head, 
struct sk_buff *skb)

skb_gro_pull(skb, sizeof(*eh));
skb_gro_postpull_rcsum(skb, eh, sizeof(*eh));
-   pp = call_gro_receive(ptype->callbacks.gro_receive, head, skb);
+
+   pp = indirect_call_gro_receive_inet(ptype->callbacks.gro_receive,
+   ipv6_gro_receive, inet_gro_receive,
+   head, skb);

 out_unlock:
rcu_read_unlock();
@@ -473,8 +477,9 @@ int eth_gro_complete(struct sk_buff *skb, int nhoff)
rcu_read_lock();
ptype = gro_find_complete_by_type(type);
if (ptype != NULL)
-   err = ptype->callbacks.gro_complete(skb, nhoff +
-   sizeof(struct ethhdr));
+   err = INDIRECT_CALL_INET(ptype->callbacks.gro_complete,
+ipv6_gro_complete, inet_gro_complete,
+skb, nhoff + sizeof(*eh));

rcu_read_unlock();
return err;
--
2.31.0




[PATCH net-next 2/4] gro: add combined call_gro_receive() + INDIRECT_CALL_INET() helper

2021-03-18 Thread Alexander Lobakin
call_gro_receive() is used to limit GRO recursion, but it works only
with callback pointers.
There's a combined version of call_gro_receive() + INDIRECT_CALL_2()
in , but it doesn't check for IPv6 modularity.
Add a similar new helper to cover both of these. It can and will be
used to avoid retpoline overhead when IP header lies behind another
offloaded proto.

Signed-off-by: Alexander Lobakin 
---
 include/net/gro.h | 8 
 1 file changed, 8 insertions(+)

diff --git a/include/net/gro.h b/include/net/gro.h
index 27c38b36df16..01edaf3fdda0 100644
--- a/include/net/gro.h
+++ b/include/net/gro.h
@@ -14,4 +14,12 @@ INDIRECT_CALLABLE_DECLARE(int ipv6_gro_complete(struct 
sk_buff *, int));
 INDIRECT_CALLABLE_DECLARE(struct sk_buff *inet_gro_receive(struct list_head *,
   struct sk_buff *));
 INDIRECT_CALLABLE_DECLARE(int inet_gro_complete(struct sk_buff *, int));
+
+#define indirect_call_gro_receive_inet(cb, f2, f1, head, skb)  \
+({ \
+   unlikely(gro_recursion_inc_test(skb)) ? \
+   NAPI_GRO_CB(skb)->flush |= 1, NULL :\
+   INDIRECT_CALL_INET(cb, f2, f1, head, skb);  \
+})
+
 #endif /* _NET_IPV6_GRO_H */
--
2.31.0




[PATCH net-next 1/4] gro: make net/gro.h self-contained

2021-03-18 Thread Alexander Lobakin
If some source file includes , but doesn't include
:

In file included from net/8021q/vlan_core.c:7:
./include/net/gro.h:6:1: warning: data definition has no type or storage class
6 | INDIRECT_CALLABLE_DECLARE(struct sk_buff *ipv6_gro_receive(struct 
list_head *,
  | ^
./include/net/gro.h:6:1: error: type defaults to ‘int’ in declaration of 
‘INDIRECT_CALLABLE_DECLARE’ [-Werror=implicit-int]

[...]

Include  directly. It's small and
won't pull lots of dependencies.
Also add some incomplete struct declarations to be fully stacked.

Fixes: 04f00ab2275f ("net/core: move gro function declarations to separate 
header ")
Signed-off-by: Alexander Lobakin 
---
 include/net/gro.h | 5 +
 1 file changed, 5 insertions(+)

diff --git a/include/net/gro.h b/include/net/gro.h
index 8a6eb5303cc4..27c38b36df16 100644
--- a/include/net/gro.h
+++ b/include/net/gro.h
@@ -3,6 +3,11 @@
 #ifndef _NET_IPV6_GRO_H
 #define _NET_IPV6_GRO_H

+#include 
+
+struct list_head;
+struct sk_buff;
+
 INDIRECT_CALLABLE_DECLARE(struct sk_buff *ipv6_gro_receive(struct list_head *,
   struct sk_buff *));
 INDIRECT_CALLABLE_DECLARE(int ipv6_gro_complete(struct sk_buff *, int));
--
2.31.0




[PATCH net-next 0/4] net: avoid retpoline overhead on VLAN and TEB GRO

2021-03-18 Thread Alexander Lobakin
dev_gro_receive() uses indirect calls for IP GRO functions, but
it works only for the outermost headers and untagged frames.
Simple VLAN tag before an IP header restores the performance hit.
This simple series straightens the GRO calls for IP headers going
after VLAN tag or inner Ethernet header (GENEVE, NvGRE, VxLAN)
for retpolined kernels.

Alexander Lobakin (4):
  gro: make net/gro.h self-contained
  gro: add combined call_gro_receive() + INDIRECT_CALL_INET() helper
  vlan/8021q: avoid retpoline overhead on GRO
  ethernet: avoid retpoline overhead on TEB (GENEVE, NvGRE, VxLAN) GRO

 include/net/gro.h | 13 +
 net/8021q/vlan_core.c | 10 --
 net/ethernet/eth.c| 11 ---
 3 files changed, 29 insertions(+), 5 deletions(-)

--
2.31.0




Re: [PATCH 0/7 v4] Introduce a bulk order-0 page allocator with two in-tree users

2021-03-17 Thread Alexander Lobakin
From: Jesper Dangaard Brouer 
Date: Wed, 17 Mar 2021 18:19:43 +0100

> On Wed, 17 Mar 2021 16:52:32 +
> Alexander Lobakin  wrote:
>
> > From: Jesper Dangaard Brouer 
> > Date: Wed, 17 Mar 2021 17:38:44 +0100
> >
> > > On Wed, 17 Mar 2021 16:31:07 +
> > > Alexander Lobakin  wrote:
> > >
> > > > From: Mel Gorman 
> > > > Date: Fri, 12 Mar 2021 15:43:24 +
> > > >
> > > > Hi there,
> > > >
> > > > > This series is based on top of Matthew Wilcox's series "Rationalise
> > > > > __alloc_pages wrapper" and does not apply to 5.12-rc2. If you want to
> > > > > test and are not using Andrew's tree as a baseline, I suggest using 
> > > > > the
> > > > > following git tree
> > > > >
> > > > > git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git 
> > > > > mm-bulk-rebase-v4r2
> > > >
> > > > I gave this series a go on my setup, it showed a bump of 10 Mbps on
> > > > UDP forwarding, but dropped TCP forwarding by almost 50 Mbps.
> > > >
> > > > (4 core 1.2GHz MIPS32 R2, page size of 16 Kb, Page Pool order-0
> > > > allocations with MTU of 1508 bytes, linear frames via build_skb(),
> > > > GRO + TSO/USO)
> > >
> > > What NIC driver is this?
> >
> > Ah, forgot to mention. It's a WIP driver, not yet mainlined.
> > The NIC itself is basically on-SoC 1G chip.
>
> Hmm, then it is really hard to check if your driver is doing something
> else that could cause this.
>
> Well, can you try to lower the page_pool bulking size, to test the
> theory from Wilcox that we should do smaller bulking to avoid pushing
> cachelines into L2 when walking the LRU list.  You might have to go as
> low as bulk=8 (for N-way associative level of L1 cache).

Turned out it suffered from GCC's decisions.
All of the following was taken on GCC 10.2.0 with -O2 in dotconfig.

vmlinux differences between baseline and this series:

(I used your followup instead of the last patch from the tree)

Function old new   delta
__rmqueue_pcplist  -2024   +2024
__alloc_pages_bulk -1456   +1456
__page_pool_alloc_pages_slow 284 600+316
page_pool_dma_map  - 164+164
get_page_from_freelist  56763760   -1916

The uninlining of __rmqueue_pcplist() hurts a lot. It slightly slows
down the "regular" page allocator, but makes __alloc_pages_bulk()
much slower than the per-page (in my case at least) due to calling
this function out from the loop.

One possible solution is to mark __rmqueue_pcplist() and
rmqueue_bulk() as __always_inline. Only both and only with
__always_inline, or GCC will emit rmqueue_bulk.constprop and
make the numbers even poorer.
This nearly doubles the size of bulk allocator, but eliminates
all performance hits.

Function old new   delta
__alloc_pages_bulk  14563512   +2056
get_page_from_freelist  37605744   +1984
find_suitable_fallback.part- 160+160
min_free_kbytes_sysctl_handler96 128 +32
find_suitable_fallback   164  28-136
__rmqueue_pcplist   2024   -   -2024

Between baseline and this series with __always_inline hints:

Function old new   delta
__alloc_pages_bulk -3512   +3512
find_suitable_fallback.part- 160+160
get_page_from_freelist  56765744 +68
min_free_kbytes_sysctl_handler96 128 +32
find_suitable_fallback   164  28-136

Another suboptimal place I've found is two functions in Page Pool
code which are marked as 'noinline'.
Maybe there's a reason behind this, but removing the annotations
and additionally marking page_pool_dma_map() as inline simplifies
the object code and in fact improves the performance (+15 Mbps on
my setup):

add/remove: 0/3 grow/shrink: 1/0 up/down: 1024/-1096 (-72)
Function old new   delta
page_pool_alloc_pages1001124   +1024
page_pool_dma_map164   --164
page_pool_refill_alloc_cache 332   --332
__page_pool_alloc_pages_slow 600   --600

1124 is a normal size for a hotpath function.
These fragmentation and jumps between page_pool_alloc_pages(),
__page_pool_alloc_pages_slow() and page_pool_refill_alloc_cache()
a

Re: [PATCH 0/7 v4] Introduce a bulk order-0 page allocator with two in-tree users

2021-03-17 Thread Alexander Lobakin
From: Jesper Dangaard Brouer 
Date: Wed, 17 Mar 2021 17:38:44 +0100

> On Wed, 17 Mar 2021 16:31:07 +
> Alexander Lobakin  wrote:
>
> > From: Mel Gorman 
> > Date: Fri, 12 Mar 2021 15:43:24 +
> >
> > Hi there,
> >
> > > This series is based on top of Matthew Wilcox's series "Rationalise
> > > __alloc_pages wrapper" and does not apply to 5.12-rc2. If you want to
> > > test and are not using Andrew's tree as a baseline, I suggest using the
> > > following git tree
> > >
> > > git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git 
> > > mm-bulk-rebase-v4r2
> >
> > I gave this series a go on my setup, it showed a bump of 10 Mbps on
> > UDP forwarding, but dropped TCP forwarding by almost 50 Mbps.
> >
> > (4 core 1.2GHz MIPS32 R2, page size of 16 Kb, Page Pool order-0
> > allocations with MTU of 1508 bytes, linear frames via build_skb(),
> > GRO + TSO/USO)
>
> What NIC driver is this?

Ah, forgot to mention. It's a WIP driver, not yet mainlined.
The NIC itself is basically on-SoC 1G chip.

> > I didn't have time to drill into the code, so for now can't provide
> > any additional details. You can request anything you need though and
> > I'll try to find a window to collect it.
> >
> > > Note to Chuck and Jesper -- as this is a cross-subsystem series, you may
> > > want to send the sunrpc and page_pool pre-requisites (patches 4 and 6)
> > > directly to the subsystem maintainers. While sunrpc is low-risk, I'm
> > > vaguely aware that there are other prototype series on netdev that affect
> > > page_pool. The conflict should be obvious in linux-next.
>
> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer

Al



Re: [PATCH 0/7 v4] Introduce a bulk order-0 page allocator with two in-tree users

2021-03-17 Thread Alexander Lobakin
From: Mel Gorman 
Date: Fri, 12 Mar 2021 15:43:24 +

Hi there,

> This series is based on top of Matthew Wilcox's series "Rationalise
> __alloc_pages wrapper" and does not apply to 5.12-rc2. If you want to
> test and are not using Andrew's tree as a baseline, I suggest using the
> following git tree
>
> git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git 
> mm-bulk-rebase-v4r2

I gave this series a go on my setup, it showed a bump of 10 Mbps on
UDP forwarding, but dropped TCP forwarding by almost 50 Mbps.

(4 core 1.2GHz MIPS32 R2, page size of 16 Kb, Page Pool order-0
allocations with MTU of 1508 bytes, linear frames via build_skb(),
GRO + TSO/USO)

I didn't have time to drill into the code, so for now can't provide
any additional details. You can request anything you need though and
I'll try to find a window to collect it.

> Note to Chuck and Jesper -- as this is a cross-subsystem series, you may
> want to send the sunrpc and page_pool pre-requisites (patches 4 and 6)
> directly to the subsystem maintainers. While sunrpc is low-risk, I'm
> vaguely aware that there are other prototype series on netdev that affect
> page_pool. The conflict should be obvious in linux-next.
>
> Changelog since v3
> o Rebase on top of Matthew's series consolidating the alloc_pages API
> o Rename alloced to allocated
> o Split out preparation patch for prepare_alloc_pages
> o Defensive check for bulk allocation or <= 0 pages
> o Call single page allocation path only if no pages were allocated
> o Minor cosmetic cleanups
> o Reorder patch dependencies by subsystem. As this is a cross-subsystem
>   series, the mm patches have to be merged before the sunrpc and net
>   users.
>
> Changelog since v2
> o Prep new pages with IRQs enabled
> o Minor documentation update
>
> Changelog since v1
> o Parenthesise binary and boolean comparisons
> o Add reviewed-bys
> o Rebase to 5.12-rc2
>
> This series introduces a bulk order-0 page allocator with sunrpc and
> the network page pool being the first users. The implementation is not
> particularly efficient and the intention is to iron out what the semantics
> of the API should have for users. Once the semantics are ironed out, it
> can be made more efficient. Despite that, this is a performance-related
> for users that require multiple pages for an operation without multiple
> round-trips to the page allocator. Quoting the last patch for the
> high-speed networking use-case.
>
> For XDP-redirect workload with 100G mlx5 driver (that use page_pool)
> redirecting xdp_frame packets into a veth, that does XDP_PASS to
> create an SKB from the xdp_frame, which then cannot return the page
> to the page_pool. In this case, we saw[1] an improvement of 18.8%
> from using the alloc_pages_bulk API (3,677,958 pps -> 4,368,926 pps).
>
> Both users in this series are corner cases (NFS and high-speed networks)
> so it is unlikely that most users will see any benefit in the short
> term. Potential other users are batch allocations for page cache
> readahead, fault around and SLUB allocations when high-order pages are
> unavailable. It's unknown how much benefit would be seen by converting
> multiple page allocation calls to a single batch or what difference it may
> make to headline performance. It's a chicken and egg problem given that
> the potential benefit cannot be investigated without an implementation
> to test against.
>
> Light testing passed, I'm relying on Chuck and Jesper to test the target
> users more aggressively but both report performance improvements with
> the initial RFC.
>
> Patch 1 moves GFP flag initialision to prepare_alloc_pages
>
> Patch 2 renames a variable name that is particularly unpopular
>
> Patch 3 adds a bulk page allocator
>
> Patch 4 is a sunrpc cleanup that is a pre-requisite.
>
> Patch 5 is the sunrpc user. Chuck also has a patch which further caches
>   pages but is not included in this series. It's not directly
>   related to the bulk allocator and as it caches pages, it might
>   have other concerns (e.g. does it need a shrinker?)
>
> Patch 6 is a preparation patch only for the network user
>
> Patch 7 converts the net page pool to the bulk allocator for order-0 pages.
>
>  include/linux/gfp.h   |  12 
>  mm/page_alloc.c   | 149 +-
>  net/core/page_pool.c  | 101 +---
>  net/sunrpc/svc_xprt.c |  47 +
>  4 files changed, 240 insertions(+), 69 deletions(-)
>
> --
> 2.26.2

Thanks,
Al



Re: [PATCH v3 net-next 4/6] linux/etherdevice.h: misc trailing whitespace cleanup

2021-03-17 Thread Alexander Lobakin
From: Jakub Kicinski 
Date: Mon, 15 Mar 2021 12:00:39 -0700

> On Mon, 15 Mar 2021 09:38:57 +0000 Alexander Lobakin wrote:
> > From: Vladimir Oltean 
> > Date: Sun, 14 Mar 2021 23:04:53 +0200
> >
> > > On Sun, Mar 14, 2021 at 11:11:32AM +, Alexander Lobakin wrote:
> > > > Caught by the text editor. Fix it separately from the actual changes.
> > > >
> > > > Signed-off-by: Alexander Lobakin 
> > > > ---
> > > >  include/linux/etherdevice.h | 2 +-
> > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > >
> > > > diff --git a/include/linux/etherdevice.h b/include/linux/etherdevice.h
> > > > index 2e5debc0373c..bcb2f81baafb 100644
> > > > --- a/include/linux/etherdevice.h
> > > > +++ b/include/linux/etherdevice.h
> > > > @@ -11,7 +11,7 @@
> > > >   * Authors:Ross Biro
> > > >   * Fred N. van Kempen, 
> > > >   *
> > > > - * Relocated to include/linux where it belongs by Alan Cox
> > > > + * Relocated to include/linux where it belongs by Alan Cox
> > > >   * 
> > > > 
> > > >   */
> > > >  #ifndef _LINUX_ETHERDEVICE_H
> > > > --
> > > > 2.30.2
> > > >
> > > >
> > >
> > > Your mailer did something weird here, it trimmed the trailing whitespace
> > > from the "-" line. The patch doesn't apply.
> >
> > It's git-send-email + ProtonMail Bridge... Seems like that's the
> > reason why only this series of mine was failing Patchwork
> > everytime.
>
> Yup. Sorry for the lack of logs in NIPA, you can run
>
>   git pw series apply $id
>
> and look at the output there. That's what I do, anyway, to double check
> that the patch doesn't apply so the logs are of limited use.

Got it, thanks!

Al



Re: [PATCH v2] MIPS: Check __clang__ to avoid performance influence with GCC in csum_tcpudp_nofold()

2021-03-15 Thread Alexander Lobakin
From: Tiezhu Yang 
Date: Tue, 9 Mar 2021 12:18:13 +0800

> The asm code in csum_tcpudp_nofold() is performance-critical, I am sorry
> for the poorly considered implementation about the performance influence
> with GCC in the commit 198688edbf77 ("MIPS: Fix inline asm input/output
> type mismatch in checksum.h used with Clang").
>
> With this patch, we can build successfully by both GCC and Clang,
> at the same time, we can avoid the potential performance influence
> with GCC.
>
> Signed-off-by: Tiezhu Yang 
> ---
>  arch/mips/include/asm/checksum.h | 12 
>  1 file changed, 8 insertions(+), 4 deletions(-)
>
> diff --git a/arch/mips/include/asm/checksum.h 
> b/arch/mips/include/asm/checksum.h
> index 1e6c135..80eddd4 100644
> --- a/arch/mips/include/asm/checksum.h
> +++ b/arch/mips/include/asm/checksum.h
> @@ -128,9 +128,13 @@ static inline __sum16 ip_fast_csum(const void *iph, 
> unsigned int ihl)
>
>  static inline __wsum csum_tcpudp_nofold(__be32 saddr, __be32 daddr,
>   __u32 len, __u8 proto,
> - __wsum sum)
> + __wsum sum_in)
>  {
> - unsigned long tmp = (__force unsigned long)sum;
> +#ifdef __clang__

Why not rely on CONFIG_CC_IS_CLANG here?

> + unsigned long sum = (__force unsigned long)sum_in;
> +#else
> + __wsum sum = sum_in;
> +#endif
>
>   __asm__(
>   "   .setpush# csum_tcpudp_nofold\n"
> @@ -159,7 +163,7 @@ static inline __wsum csum_tcpudp_nofold(__be32 saddr, 
> __be32 daddr,
>   "   addu%0, $1  \n"
>  #endif
>   "   .setpop"
> - : "=r" (tmp)
> + : "=r" (sum)
>   : "0" ((__force unsigned long)daddr),
> "r" ((__force unsigned long)saddr),
>  #ifdef __MIPSEL__
> @@ -169,7 +173,7 @@ static inline __wsum csum_tcpudp_nofold(__be32 saddr, 
> __be32 daddr,
>  #endif
> "r" ((__force unsigned long)sum));
>
> - return (__force __wsum)tmp;
> + return (__force __wsum)sum;
>  }
>  #define csum_tcpudp_nofold csum_tcpudp_nofold
>
> --
> 2.1.0

Al



Re: [PATCH v3 net-next 4/6] linux/etherdevice.h: misc trailing whitespace cleanup

2021-03-15 Thread Alexander Lobakin
From: Vladimir Oltean 
Date: Sun, 14 Mar 2021 23:04:53 +0200

> On Sun, Mar 14, 2021 at 11:11:32AM +0000, Alexander Lobakin wrote:
> > Caught by the text editor. Fix it separately from the actual changes.
> >
> > Signed-off-by: Alexander Lobakin 
> > ---
> >  include/linux/etherdevice.h | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/include/linux/etherdevice.h b/include/linux/etherdevice.h
> > index 2e5debc0373c..bcb2f81baafb 100644
> > --- a/include/linux/etherdevice.h
> > +++ b/include/linux/etherdevice.h
> > @@ -11,7 +11,7 @@
> >   * Authors:Ross Biro
> >   * Fred N. van Kempen, 
> >   *
> > - * Relocated to include/linux where it belongs by Alan Cox
> > + * Relocated to include/linux where it belongs by Alan Cox
> >   * 
> >   */
> >  #ifndef _LINUX_ETHERDEVICE_H
> > --
> > 2.30.2
> >
> >
>
> Your mailer did something weird here, it trimmed the trailing whitespace
> from the "-" line. The patch doesn't apply.

It's git-send-email + ProtonMail Bridge... Seems like that's the
reason why only this series of mine was failing Patchwork
everytime.

Much thanks for finding this out!
Al



[PATCH v3 net-next 5/6] ethernet: constify eth_get_headlen()'s data argument

2021-03-14 Thread Alexander Lobakin
It's used only for flow dissection, which now takes constant data
pointers.

Signed-off-by: Alexander Lobakin 
---
 include/linux/etherdevice.h | 2 +-
 net/ethernet/eth.c  | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/etherdevice.h b/include/linux/etherdevice.h
index bcb2f81baafb..330345b1be54 100644
--- a/include/linux/etherdevice.h
+++ b/include/linux/etherdevice.h
@@ -29,7 +29,7 @@ struct device;
 int eth_platform_get_mac_address(struct device *dev, u8 *mac_addr);
 unsigned char *arch_get_platform_mac_address(void);
 int nvmem_get_mac_address(struct device *dev, void *addrbuf);
-u32 eth_get_headlen(const struct net_device *dev, void *data, unsigned int 
len);
+u32 eth_get_headlen(const struct net_device *dev, const void *data, u32 len);
 __be16 eth_type_trans(struct sk_buff *skb, struct net_device *dev);
 extern const struct header_ops eth_header_ops;

diff --git a/net/ethernet/eth.c b/net/ethernet/eth.c
index 4106373180c6..e01cf766d2c5 100644
--- a/net/ethernet/eth.c
+++ b/net/ethernet/eth.c
@@ -122,7 +122,7 @@ EXPORT_SYMBOL(eth_header);
  * Make a best effort attempt to pull the length for all of the headers for
  * a given frame in a linear buffer.
  */
-u32 eth_get_headlen(const struct net_device *dev, void *data, unsigned int len)
+u32 eth_get_headlen(const struct net_device *dev, const void *data, u32 len)
 {
const unsigned int flags = FLOW_DISSECTOR_F_PARSE_1ST_FRAG;
const struct ethhdr *eth = (const struct ethhdr *)data;
--
2.30.2




[PATCH v3 net-next 4/6] linux/etherdevice.h: misc trailing whitespace cleanup

2021-03-14 Thread Alexander Lobakin
Caught by the text editor. Fix it separately from the actual changes.

Signed-off-by: Alexander Lobakin 
---
 include/linux/etherdevice.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/etherdevice.h b/include/linux/etherdevice.h
index 2e5debc0373c..bcb2f81baafb 100644
--- a/include/linux/etherdevice.h
+++ b/include/linux/etherdevice.h
@@ -11,7 +11,7 @@
  * Authors:Ross Biro
  * Fred N. van Kempen, 
  *
- * Relocated to include/linux where it belongs by Alan Cox
+ * Relocated to include/linux where it belongs by Alan Cox
  * 
  */
 #ifndef _LINUX_ETHERDEVICE_H
--
2.30.2




[PATCH v3 net-next 6/6] skbuff: micro-optimize {,__}skb_header_pointer()

2021-03-14 Thread Alexander Lobakin
{,__}skb_header_pointer() helpers exist mainly for preventing
accesses-beyond-end of the linear data.
In the vast majorify of cases, they bail out on the first condition.
All code going after is mostly a fallback.
Mark the most common branch as 'likely' one to move it in-line.
Also, skb_copy_bits() can return negative values only when the input
arguments are invalid, e.g. offset is greater than skb->len. It can
be safely marked as 'unlikely' branch, assuming that hotpath code
provides sane input to not fail here.

These two bump the throughput with a single Flow Dissector pass on
every packet (e.g. with RPS or driver that uses eth_get_headlen())
on 20 Mbps per flow/core.

Signed-off-by: Alexander Lobakin 
---
 include/linux/skbuff.h | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 46c61e127e9f..ecc029674ae4 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -3680,11 +3680,10 @@ static inline void * __must_check
 __skb_header_pointer(const struct sk_buff *skb, int offset, int len,
 const void *data, int hlen, void *buffer)
 {
-   if (hlen - offset >= len)
+   if (likely(hlen - offset >= len))
return (void *)data + offset;

-   if (!skb ||
-   skb_copy_bits(skb, offset, buffer, len) < 0)
+   if (!skb || unlikely(skb_copy_bits(skb, offset, buffer, len) < 0))
return NULL;

return buffer;
--
2.30.2




[PATCH v3 net-next 3/6] flow_dissector: constify raw input data argument

2021-03-14 Thread Alexander Lobakin
Flow Dissector code never modifies the input buffer, neither skb nor
raw data.
Make 'data' argument const for all of the Flow dissector's functions.

Signed-off-by: Alexander Lobakin 
---
 include/linux/skbuff.h   | 15 ++---
 include/net/flow_dissector.h |  2 +-
 net/core/flow_dissector.c| 41 +++-
 3 files changed, 30 insertions(+), 28 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index d6ea3dc3eddb..46c61e127e9f 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1292,10 +1292,10 @@ __skb_set_sw_hash(struct sk_buff *skb, __u32 hash, bool 
is_l4)
 void __skb_get_hash(struct sk_buff *skb);
 u32 __skb_get_hash_symmetric(const struct sk_buff *skb);
 u32 skb_get_poff(const struct sk_buff *skb);
-u32 __skb_get_poff(const struct sk_buff *skb, void *data,
+u32 __skb_get_poff(const struct sk_buff *skb, const void *data,
   const struct flow_keys_basic *keys, int hlen);
 __be32 __skb_flow_get_ports(const struct sk_buff *skb, int thoff, u8 ip_proto,
-   void *data, int hlen_proto);
+   const void *data, int hlen_proto);

 static inline __be32 skb_flow_get_ports(const struct sk_buff *skb,
int thoff, u8 ip_proto)
@@ -1314,9 +1314,8 @@ bool bpf_flow_dissect(struct bpf_prog *prog, struct 
bpf_flow_dissector *ctx,
 bool __skb_flow_dissect(const struct net *net,
const struct sk_buff *skb,
struct flow_dissector *flow_dissector,
-   void *target_container,
-   void *data, __be16 proto, int nhoff, int hlen,
-   unsigned int flags);
+   void *target_container, const void *data,
+   __be16 proto, int nhoff, int hlen, unsigned int flags);

 static inline bool skb_flow_dissect(const struct sk_buff *skb,
struct flow_dissector *flow_dissector,
@@ -1338,9 +1337,9 @@ static inline bool skb_flow_dissect_flow_keys(const 
struct sk_buff *skb,
 static inline bool
 skb_flow_dissect_flow_keys_basic(const struct net *net,
 const struct sk_buff *skb,
-struct flow_keys_basic *flow, void *data,
-__be16 proto, int nhoff, int hlen,
-unsigned int flags)
+struct flow_keys_basic *flow,
+const void *data, __be16 proto,
+int nhoff, int hlen, unsigned int flags)
 {
memset(flow, 0, sizeof(*flow));
return __skb_flow_dissect(net, skb, _keys_basic_dissector, flow,
diff --git a/include/net/flow_dissector.h b/include/net/flow_dissector.h
index bf00e71816ed..ffd386ea0dbb 100644
--- a/include/net/flow_dissector.h
+++ b/include/net/flow_dissector.h
@@ -350,7 +350,7 @@ static inline bool flow_keys_have_l4(const struct flow_keys 
*keys)
 u32 flow_hash_from_keys(struct flow_keys *keys);
 void skb_flow_get_icmp_tci(const struct sk_buff *skb,
   struct flow_dissector_key_icmp *key_icmp,
-  void *data, int thoff, int hlen);
+  const void *data, int thoff, int hlen);

 static inline bool dissector_uses_key(const struct flow_dissector 
*flow_dissector,
  enum flow_dissector_key_id key_id)
diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index 2ef2224b3bff..2ed380d096ce 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -114,7 +114,7 @@ int flow_dissector_bpf_prog_attach_check(struct net *net,
  * is the protocol port offset returned from proto_ports_offset
  */
 __be32 __skb_flow_get_ports(const struct sk_buff *skb, int thoff, u8 ip_proto,
-   void *data, int hlen)
+   const void *data, int hlen)
 {
int poff = proto_ports_offset(ip_proto);

@@ -161,7 +161,7 @@ static bool icmp_has_id(u8 type)
  */
 void skb_flow_get_icmp_tci(const struct sk_buff *skb,
   struct flow_dissector_key_icmp *key_icmp,
-  void *data, int thoff, int hlen)
+  const void *data, int thoff, int hlen)
 {
struct icmphdr *ih, _ih;

@@ -187,8 +187,8 @@ EXPORT_SYMBOL(skb_flow_get_icmp_tci);
  */
 static void __skb_flow_dissect_icmp(const struct sk_buff *skb,
struct flow_dissector *flow_dissector,
-   void *target_container,
-   void *data, int thoff, int hlen)
+   void *target_container, const void *data,
+   int thoff, int hlen)
 {
struct flow_dissector_key_icmp *key_icmp;

@@ -409,8 +409,8 @@ EXPORT_SYMBOL(skb_flow_dissect_hash);
 static

[PATCH v3 net-next 2/6] skbuff: make __skb_header_pointer()'s data argument const

2021-03-14 Thread Alexander Lobakin
The function never modifies the input buffer, so 'data' argument
can be marked as const.
This implies one harmless cast-away.

Signed-off-by: Alexander Lobakin 
---
 include/linux/skbuff.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 483e89348f78..d6ea3dc3eddb 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -3678,11 +3678,11 @@ __wsum skb_checksum(const struct sk_buff *skb, int 
offset, int len,
__wsum csum);

 static inline void * __must_check
-__skb_header_pointer(const struct sk_buff *skb, int offset,
-int len, void *data, int hlen, void *buffer)
+__skb_header_pointer(const struct sk_buff *skb, int offset, int len,
+const void *data, int hlen, void *buffer)
 {
if (hlen - offset >= len)
-   return data + offset;
+   return (void *)data + offset;

if (!skb ||
skb_copy_bits(skb, offset, buffer, len) < 0)
--
2.30.2




[PATCH v3 net-next 0/6] skbuff: micro-optimize flow dissection

2021-03-14 Thread Alexander Lobakin
This little number makes all of the flow dissection functions take
raw input data pointer as const (1-5) and shuffles the branches in
__skb_header_pointer() according to their hit probability.

The result is +20 Mbps per flow/core with one Flow Dissector pass
per packet. This affects RPS (with software hashing), drivers that
use eth_get_headlen() on their Rx path and so on.

>From v2 [1]:
 - reword some commit messages as a potential fix for NIPA;
 - no functional changes.

>From v1 [0]:
 - rebase on top of the latest net-next. This was super-weird, but
   I double-checked that the series applies with no conflicts, and
   then on Patchwork it didn't;
 - no other changes.

[0] https://lore.kernel.org/netdev/20210312194538.337504-1-aloba...@pm.me
[1] https://lore.kernel.org/netdev/20210313113645.5949-1-aloba...@pm.me

Alexander Lobakin (6):
  flow_dissector: constify bpf_flow_dissector's data pointers
  skbuff: make __skb_header_pointer()'s data argument const
  flow_dissector: constify raw input data argument
  linux/etherdevice.h: misc trailing whitespace cleanup
  ethernet: constify eth_get_headlen()'s data argument
  skbuff: micro-optimize {,__}skb_header_pointer()

 include/linux/etherdevice.h  |  4 ++--
 include/linux/skbuff.h   | 26 +++
 include/net/flow_dissector.h |  6 +++---
 net/core/flow_dissector.c| 41 +++-
 net/ethernet/eth.c   |  2 +-
 5 files changed, 40 insertions(+), 39 deletions(-)

--
2.30.2




[PATCH v3 net-next 1/6] flow_dissector: constify bpf_flow_dissector's data pointers

2021-03-14 Thread Alexander Lobakin
BPF Flow dissection programs are read-only and don't touch input
buffers.
Mark 'data' and 'data_end' in struct bpf_flow_dissector as const
in preparation for global input constifying.

Signed-off-by: Alexander Lobakin 
---
 include/net/flow_dissector.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/net/flow_dissector.h b/include/net/flow_dissector.h
index cc10b10dc3a1..bf00e71816ed 100644
--- a/include/net/flow_dissector.h
+++ b/include/net/flow_dissector.h
@@ -368,8 +368,8 @@ static inline void *skb_flow_dissector_target(struct 
flow_dissector *flow_dissec
 struct bpf_flow_dissector {
struct bpf_flow_keys*flow_keys;
const struct sk_buff*skb;
-   void*data;
-   void*data_end;
+   const void  *data;
+   const void  *data_end;
 };

 static inline void
--
2.30.2




Re: [PATCH v2 net-next 0/6] skbuff: micro-optimize flow dissection

2021-03-14 Thread Alexander Lobakin
From: David Miller 
Date: Sat, 13 Mar 2021 18:10:00 -0800 (PST)

> None of these apply to net-next as per the patchwork automated checks.  Any 
> idea why?

No unfortunately. That'why I sent a follow-up mail. All of them
successfully apply to pure net-next on my machine.
Can it be a Git version conflict? I use 2.30.2, but also tried 2.30.1
and the latest development snapshot, and in either case they got
applied with no problems.

I could have more clue if NIPA provided more detailed log, but didn't
find any. And apply_patches stage doesn't seem to be present on NIPA
GitHub repo, so I couldn't reproduce it 1:1.

I can try again on Monday, not sure if it will help.
I also sent another series yesterday, and it has 15 green lights on
Patchwork, so this problem seems to take place only with this
particular series.

> Thanks.

Thanks,
Al



[PATCH v2 net-next 2/3] gro: consistentify napi->gro_hash[x] access in dev_gro_receive()

2021-03-13 Thread Alexander Lobakin
GRO bucket index doesn't change through the entire function.
Store a pointer to the corresponding bucket instead of its member
and use it consistently through the function.
It is performance-safe since _list->list == gro_list.

Misc: remove superfluous braces around single-line branches.

Signed-off-by: Alexander Lobakin 
---
 net/core/dev.c | 22 +++---
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 1317e6b6758a..b635467087f3 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5953,7 +5953,7 @@ static void gro_flush_oldest(struct napi_struct *napi, 
struct list_head *head)
 static enum gro_result dev_gro_receive(struct napi_struct *napi, struct 
sk_buff *skb)
 {
u32 hash = skb_get_hash_raw(skb) & (GRO_HASH_BUCKETS - 1);
-   struct list_head *gro_head = >gro_hash[hash].list;
+   struct gro_list *gro_list = >gro_hash[hash];
struct list_head *head = _base;
struct packet_offload *ptype;
__be16 type = skb->protocol;
@@ -5965,7 +5965,7 @@ static enum gro_result dev_gro_receive(struct napi_struct 
*napi, struct sk_buff
if (netif_elide_gro(skb->dev))
goto normal;

-   gro_list_prepare(gro_head, skb);
+   gro_list_prepare(_list->list, skb);

rcu_read_lock();
list_for_each_entry_rcu(ptype, head, list) {
@@ -6001,7 +6001,7 @@ static enum gro_result dev_gro_receive(struct napi_struct 
*napi, struct sk_buff

pp = INDIRECT_CALL_INET(ptype->callbacks.gro_receive,
ipv6_gro_receive, inet_gro_receive,
-   gro_head, skb);
+   _list->list, skb);
break;
}
rcu_read_unlock();
@@ -6020,7 +6020,7 @@ static enum gro_result dev_gro_receive(struct napi_struct 
*napi, struct sk_buff
if (pp) {
skb_list_del_init(pp);
napi_gro_complete(napi, pp);
-   napi->gro_hash[hash].count--;
+   gro_list->count--;
}

if (same_flow)
@@ -6029,16 +6029,16 @@ static enum gro_result dev_gro_receive(struct 
napi_struct *napi, struct sk_buff
if (NAPI_GRO_CB(skb)->flush)
goto normal;

-   if (unlikely(napi->gro_hash[hash].count >= MAX_GRO_SKBS)) {
-   gro_flush_oldest(napi, gro_head);
-   } else {
-   napi->gro_hash[hash].count++;
-   }
+   if (unlikely(gro_list->count >= MAX_GRO_SKBS))
+   gro_flush_oldest(napi, _list->list);
+   else
+   gro_list->count++;
+
NAPI_GRO_CB(skb)->count = 1;
NAPI_GRO_CB(skb)->age = jiffies;
NAPI_GRO_CB(skb)->last = skb;
skb_shinfo(skb)->gso_size = skb_gro_len(skb);
-   list_add(>list, gro_head);
+   list_add(>list, _list->list);
ret = GRO_HELD;

 pull:
@@ -6046,7 +6046,7 @@ static enum gro_result dev_gro_receive(struct napi_struct 
*napi, struct sk_buff
if (grow > 0)
gro_pull_from_frag0(skb, grow);
 ok:
-   if (napi->gro_hash[hash].count) {
+   if (gro_list->count) {
if (!test_bit(hash, >gro_bitmask))
__set_bit(hash, >gro_bitmask);
} else if (test_bit(hash, >gro_bitmask)) {
--
2.30.2




[PATCH v2 net-next 3/3] gro: give 'hash' variable in dev_gro_receive() a less confusing name

2021-03-13 Thread Alexander Lobakin
'hash' stores not the flow hash, but the index of the GRO bucket
corresponding to it.
Change its name to 'bucket' to avoid confusion while reading lines
like '__set_bit(hash, >gro_bitmask)'.

Signed-off-by: Alexander Lobakin 
---
 net/core/dev.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index b635467087f3..5a2847a19cf2 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5952,8 +5952,8 @@ static void gro_flush_oldest(struct napi_struct *napi, 
struct list_head *head)

 static enum gro_result dev_gro_receive(struct napi_struct *napi, struct 
sk_buff *skb)
 {
-   u32 hash = skb_get_hash_raw(skb) & (GRO_HASH_BUCKETS - 1);
-   struct gro_list *gro_list = >gro_hash[hash];
+   u32 bucket = skb_get_hash_raw(skb) & (GRO_HASH_BUCKETS - 1);
+   struct gro_list *gro_list = >gro_hash[bucket];
struct list_head *head = _base;
struct packet_offload *ptype;
__be16 type = skb->protocol;
@@ -6047,10 +6047,10 @@ static enum gro_result dev_gro_receive(struct 
napi_struct *napi, struct sk_buff
gro_pull_from_frag0(skb, grow);
 ok:
if (gro_list->count) {
-   if (!test_bit(hash, >gro_bitmask))
-   __set_bit(hash, >gro_bitmask);
-   } else if (test_bit(hash, >gro_bitmask)) {
-   __clear_bit(hash, >gro_bitmask);
+   if (!test_bit(bucket, >gro_bitmask))
+   __set_bit(bucket, >gro_bitmask);
+   } else if (test_bit(bucket, >gro_bitmask)) {
+   __clear_bit(bucket, >gro_bitmask);
}

return ret;
--
2.30.2




[PATCH v2 net-next 0/3] gro: micro-optimize dev_gro_receive()

2021-03-13 Thread Alexander Lobakin
This random series addresses some of suboptimal constructions used
in the main GRO entry point.
The main body is gro_list_prepare() simplification and pointer usage
optimization in dev_gro_receive() itself. Being mostly cosmetic, it
gives like +10 Mbps on my setup to both TCP and UDP (both single- and
multi-flow).

Since v1 [0]:
 - drop the replacement of bucket index calculation with
   reciprocal_scale() since it makes absolutely no sense (Eric);
 - improve stack usage in dev_gro_receive() (Eric);
 - reverse the order of patches to avoid changes superseding.

[0] https://lore.kernel.org/netdev/20210312162127.239795-1-aloba...@pm.me

Alexander Lobakin (3):
  gro: simplify gro_list_prepare()
  gro: consistentify napi->gro_hash[x] access in dev_gro_receive()
  gro: give 'hash' variable in dev_gro_receive() a less confusing name

 net/core/dev.c | 40 ++--
 1 file changed, 18 insertions(+), 22 deletions(-)

--
2.30.2




[PATCH v2 net-next 1/3] gro: simplify gro_list_prepare()

2021-03-13 Thread Alexander Lobakin
gro_list_prepare() always returns >gro_hash[bucket].list,
without any variations. Moreover, it uses 'napi' argument only to
have access to this list, and calculates the bucket index for the
second time (firstly it happens at the beginning of
dev_gro_receive()) to do that.
Given that dev_gro_receive() already has an index to the needed
list, just pass it as the first argument to eliminate redundant
calculations, and make gro_list_prepare() return void.
Also, both arguments of gro_list_prepare() can be constified since
this function can only modify the skbs from the bucket list.

Signed-off-by: Alexander Lobakin 
---
 net/core/dev.c | 12 
 1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 2bfdd528c7c3..1317e6b6758a 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5858,15 +5858,13 @@ void napi_gro_flush(struct napi_struct *napi, bool 
flush_old)
 }
 EXPORT_SYMBOL(napi_gro_flush);

-static struct list_head *gro_list_prepare(struct napi_struct *napi,
- struct sk_buff *skb)
+static void gro_list_prepare(const struct list_head *head,
+const struct sk_buff *skb)
 {
unsigned int maclen = skb->dev->hard_header_len;
u32 hash = skb_get_hash_raw(skb);
-   struct list_head *head;
struct sk_buff *p;

-   head = >gro_hash[hash & (GRO_HASH_BUCKETS - 1)].list;
list_for_each_entry(p, head, list) {
unsigned long diffs;

@@ -5892,8 +5890,6 @@ static struct list_head *gro_list_prepare(struct 
napi_struct *napi,
   maclen);
NAPI_GRO_CB(p)->same_flow = !diffs;
}
-
-   return head;
 }

 static void skb_gro_reset_offset(struct sk_buff *skb)
@@ -5957,10 +5953,10 @@ static void gro_flush_oldest(struct napi_struct *napi, 
struct list_head *head)
 static enum gro_result dev_gro_receive(struct napi_struct *napi, struct 
sk_buff *skb)
 {
u32 hash = skb_get_hash_raw(skb) & (GRO_HASH_BUCKETS - 1);
+   struct list_head *gro_head = >gro_hash[hash].list;
struct list_head *head = _base;
struct packet_offload *ptype;
__be16 type = skb->protocol;
-   struct list_head *gro_head;
struct sk_buff *pp = NULL;
enum gro_result ret;
int same_flow;
@@ -5969,7 +5965,7 @@ static enum gro_result dev_gro_receive(struct napi_struct 
*napi, struct sk_buff
if (netif_elide_gro(skb->dev))
goto normal;

-   gro_head = gro_list_prepare(napi, skb);
+   gro_list_prepare(gro_head, skb);

rcu_read_lock();
list_for_each_entry_rcu(ptype, head, list) {
--
2.30.2




Re: [PATCH v2 net-next 0/6] skbuff: micro-optimize flow dissection

2021-03-13 Thread Alexander Lobakin
From: Alexander Lobakin 
Date: Sat, 13 Mar 2021 11:37:03 +

> This little number makes all of the flow dissection functions take
> raw input data pointer as const (1-5) and shuffles the branches in
> __skb_header_pointer() according to their hit probability.
>
> The result is +20 Mbps per flow/core with one Flow Dissector pass
> per packet. This affects RPS (with software hashing), drivers that
> use eth_get_headlen() on their Rx path and so on.
>
> Since v1 [0]:
>  - rebase on top of the latest net-next. This was super-weird, but
>I double-checked that the series applies with no conflicts, and
>then on Patchwork it didn't;

Still failing on Patchwork. I rebased it ten thousand times, rebuilt
the patches manually, tried previous stable Git version and the
latest CVS snapshot, and always got the same series that successfully
applies to next-next.
Can you please take a look?

>  - no other changes.
>
> [0] https://lore.kernel.org/netdev/20210312194538.337504-1-aloba...@pm.me
>
> Alexander Lobakin (6):
>   flow_dissector: constify bpf_flow_dissector's data pointers
>   skbuff: make __skb_header_pointer()'s data argument const
>   flow_dissector: constify raw input @data argument
>   linux/etherdevice.h: misc trailing whitespace cleanup
>   ethernet: constify eth_get_headlen()'s @data argument
>   skbuff: micro-optimize {,__}skb_header_pointer()
>
>  include/linux/etherdevice.h  |  4 ++--
>  include/linux/skbuff.h   | 26 +++
>  include/net/flow_dissector.h |  6 +++---
>  net/core/flow_dissector.c| 41 +++-
>  net/ethernet/eth.c   |  2 +-
>  5 files changed, 40 insertions(+), 39 deletions(-)
>
> --
> 2.30.2

Thanks,
Al



[PATCH v2 net-next 6/6] skbuff: micro-optimize {,__}skb_header_pointer()

2021-03-13 Thread Alexander Lobakin
{,__}skb_header_pointer() helpers exist mainly for preventing
accesses-beyond-end of the linear data.
In the vast majorify of cases, they bail out on the first condition.
All code going after is mostly a fallback.
Mark the most common branch as 'likely' one to move it in-line.
Also, skb_copy_bits() can return negative values only when the input
arguments are invalid, e.g. offset is greater than skb->len. It can
be safely marked as 'unlikely' branch, assuming that hotpath code
provides sane input to not fail here.

These two bump the throughput with a single Flow Dissector pass on
every packet (e.g. with RPS or driver that uses eth_get_headlen())
on 20 Mbps per flow/core.

Signed-off-by: Alexander Lobakin 
---
 include/linux/skbuff.h | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 46c61e127e9f..ecc029674ae4 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -3680,11 +3680,10 @@ static inline void * __must_check
 __skb_header_pointer(const struct sk_buff *skb, int offset, int len,
 const void *data, int hlen, void *buffer)
 {
-   if (hlen - offset >= len)
+   if (likely(hlen - offset >= len))
return (void *)data + offset;

-   if (!skb ||
-   skb_copy_bits(skb, offset, buffer, len) < 0)
+   if (!skb || unlikely(skb_copy_bits(skb, offset, buffer, len) < 0))
return NULL;

return buffer;
--
2.30.2




[PATCH v2 net-next 5/6] ethernet: constify eth_get_headlen()'s @data argument

2021-03-13 Thread Alexander Lobakin
It's used only for flow dissection, which now takes constant data
pointers.

Signed-off-by: Alexander Lobakin 
---
 include/linux/etherdevice.h | 2 +-
 net/ethernet/eth.c  | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/etherdevice.h b/include/linux/etherdevice.h
index bcb2f81baafb..330345b1be54 100644
--- a/include/linux/etherdevice.h
+++ b/include/linux/etherdevice.h
@@ -29,7 +29,7 @@ struct device;
 int eth_platform_get_mac_address(struct device *dev, u8 *mac_addr);
 unsigned char *arch_get_platform_mac_address(void);
 int nvmem_get_mac_address(struct device *dev, void *addrbuf);
-u32 eth_get_headlen(const struct net_device *dev, void *data, unsigned int 
len);
+u32 eth_get_headlen(const struct net_device *dev, const void *data, u32 len);
 __be16 eth_type_trans(struct sk_buff *skb, struct net_device *dev);
 extern const struct header_ops eth_header_ops;

diff --git a/net/ethernet/eth.c b/net/ethernet/eth.c
index 4106373180c6..e01cf766d2c5 100644
--- a/net/ethernet/eth.c
+++ b/net/ethernet/eth.c
@@ -122,7 +122,7 @@ EXPORT_SYMBOL(eth_header);
  * Make a best effort attempt to pull the length for all of the headers for
  * a given frame in a linear buffer.
  */
-u32 eth_get_headlen(const struct net_device *dev, void *data, unsigned int len)
+u32 eth_get_headlen(const struct net_device *dev, const void *data, u32 len)
 {
const unsigned int flags = FLOW_DISSECTOR_F_PARSE_1ST_FRAG;
const struct ethhdr *eth = (const struct ethhdr *)data;
--
2.30.2




[PATCH v2 net-next 4/6] linux/etherdevice.h: misc trailing whitespace cleanup

2021-03-13 Thread Alexander Lobakin
Caught by the text editor. Fix it separately from the actual changes.

Signed-off-by: Alexander Lobakin 
---
 include/linux/etherdevice.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/etherdevice.h b/include/linux/etherdevice.h
index 2e5debc0373c..bcb2f81baafb 100644
--- a/include/linux/etherdevice.h
+++ b/include/linux/etherdevice.h
@@ -11,7 +11,7 @@
  * Authors:Ross Biro
  * Fred N. van Kempen, 
  *
- * Relocated to include/linux where it belongs by Alan Cox
+ * Relocated to include/linux where it belongs by Alan Cox
  * 
  */
 #ifndef _LINUX_ETHERDEVICE_H
--
2.30.2




[PATCH v2 net-next 2/6] skbuff: make __skb_header_pointer()'s data argument const

2021-03-13 Thread Alexander Lobakin
The function never modifies the input buffer, so @data argument
can be marked as const.
This implies one harmless cast-away.

Signed-off-by: Alexander Lobakin 
---
 include/linux/skbuff.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 483e89348f78..d6ea3dc3eddb 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -3678,11 +3678,11 @@ __wsum skb_checksum(const struct sk_buff *skb, int 
offset, int len,
__wsum csum);

 static inline void * __must_check
-__skb_header_pointer(const struct sk_buff *skb, int offset,
-int len, void *data, int hlen, void *buffer)
+__skb_header_pointer(const struct sk_buff *skb, int offset, int len,
+const void *data, int hlen, void *buffer)
 {
if (hlen - offset >= len)
-   return data + offset;
+   return (void *)data + offset;

if (!skb ||
skb_copy_bits(skb, offset, buffer, len) < 0)
--
2.30.2




[PATCH v2 net-next 3/6] flow_dissector: constify raw input @data argument

2021-03-13 Thread Alexander Lobakin
Flow Dissector code never modifies the input buffer, neither skb nor
raw data.
Make @data argument const for all of the Flow dissector's functions.

Signed-off-by: Alexander Lobakin 
---
 include/linux/skbuff.h   | 15 ++---
 include/net/flow_dissector.h |  2 +-
 net/core/flow_dissector.c| 41 +++-
 3 files changed, 30 insertions(+), 28 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index d6ea3dc3eddb..46c61e127e9f 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1292,10 +1292,10 @@ __skb_set_sw_hash(struct sk_buff *skb, __u32 hash, bool 
is_l4)
 void __skb_get_hash(struct sk_buff *skb);
 u32 __skb_get_hash_symmetric(const struct sk_buff *skb);
 u32 skb_get_poff(const struct sk_buff *skb);
-u32 __skb_get_poff(const struct sk_buff *skb, void *data,
+u32 __skb_get_poff(const struct sk_buff *skb, const void *data,
   const struct flow_keys_basic *keys, int hlen);
 __be32 __skb_flow_get_ports(const struct sk_buff *skb, int thoff, u8 ip_proto,
-   void *data, int hlen_proto);
+   const void *data, int hlen_proto);

 static inline __be32 skb_flow_get_ports(const struct sk_buff *skb,
int thoff, u8 ip_proto)
@@ -1314,9 +1314,8 @@ bool bpf_flow_dissect(struct bpf_prog *prog, struct 
bpf_flow_dissector *ctx,
 bool __skb_flow_dissect(const struct net *net,
const struct sk_buff *skb,
struct flow_dissector *flow_dissector,
-   void *target_container,
-   void *data, __be16 proto, int nhoff, int hlen,
-   unsigned int flags);
+   void *target_container, const void *data,
+   __be16 proto, int nhoff, int hlen, unsigned int flags);

 static inline bool skb_flow_dissect(const struct sk_buff *skb,
struct flow_dissector *flow_dissector,
@@ -1338,9 +1337,9 @@ static inline bool skb_flow_dissect_flow_keys(const 
struct sk_buff *skb,
 static inline bool
 skb_flow_dissect_flow_keys_basic(const struct net *net,
 const struct sk_buff *skb,
-struct flow_keys_basic *flow, void *data,
-__be16 proto, int nhoff, int hlen,
-unsigned int flags)
+struct flow_keys_basic *flow,
+const void *data, __be16 proto,
+int nhoff, int hlen, unsigned int flags)
 {
memset(flow, 0, sizeof(*flow));
return __skb_flow_dissect(net, skb, _keys_basic_dissector, flow,
diff --git a/include/net/flow_dissector.h b/include/net/flow_dissector.h
index bf00e71816ed..ffd386ea0dbb 100644
--- a/include/net/flow_dissector.h
+++ b/include/net/flow_dissector.h
@@ -350,7 +350,7 @@ static inline bool flow_keys_have_l4(const struct flow_keys 
*keys)
 u32 flow_hash_from_keys(struct flow_keys *keys);
 void skb_flow_get_icmp_tci(const struct sk_buff *skb,
   struct flow_dissector_key_icmp *key_icmp,
-  void *data, int thoff, int hlen);
+  const void *data, int thoff, int hlen);

 static inline bool dissector_uses_key(const struct flow_dissector 
*flow_dissector,
  enum flow_dissector_key_id key_id)
diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index 2ef2224b3bff..2ed380d096ce 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -114,7 +114,7 @@ int flow_dissector_bpf_prog_attach_check(struct net *net,
  * is the protocol port offset returned from proto_ports_offset
  */
 __be32 __skb_flow_get_ports(const struct sk_buff *skb, int thoff, u8 ip_proto,
-   void *data, int hlen)
+   const void *data, int hlen)
 {
int poff = proto_ports_offset(ip_proto);

@@ -161,7 +161,7 @@ static bool icmp_has_id(u8 type)
  */
 void skb_flow_get_icmp_tci(const struct sk_buff *skb,
   struct flow_dissector_key_icmp *key_icmp,
-  void *data, int thoff, int hlen)
+  const void *data, int thoff, int hlen)
 {
struct icmphdr *ih, _ih;

@@ -187,8 +187,8 @@ EXPORT_SYMBOL(skb_flow_get_icmp_tci);
  */
 static void __skb_flow_dissect_icmp(const struct sk_buff *skb,
struct flow_dissector *flow_dissector,
-   void *target_container,
-   void *data, int thoff, int hlen)
+   void *target_container, const void *data,
+   int thoff, int hlen)
 {
struct flow_dissector_key_icmp *key_icmp;

@@ -409,8 +409,8 @@ EXPORT_SYMBOL(skb_flow_dissect_hash);
 static

[PATCH v2 net-next 1/6] flow_dissector: constify bpf_flow_dissector's data pointers

2021-03-13 Thread Alexander Lobakin
BPF Flow dissection programs are read-only and don't touch input
buffers.
Mark @data and @data_end in struct bpf_flow_dissector as const in
preparation for global input constifying.

Signed-off-by: Alexander Lobakin 
---
 include/net/flow_dissector.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/net/flow_dissector.h b/include/net/flow_dissector.h
index cc10b10dc3a1..bf00e71816ed 100644
--- a/include/net/flow_dissector.h
+++ b/include/net/flow_dissector.h
@@ -368,8 +368,8 @@ static inline void *skb_flow_dissector_target(struct 
flow_dissector *flow_dissec
 struct bpf_flow_dissector {
struct bpf_flow_keys*flow_keys;
const struct sk_buff*skb;
-   void*data;
-   void*data_end;
+   const void  *data;
+   const void  *data_end;
 };

 static inline void
--
2.30.2




[PATCH v2 net-next 0/6] skbuff: micro-optimize flow dissection

2021-03-13 Thread Alexander Lobakin
This little number makes all of the flow dissection functions take
raw input data pointer as const (1-5) and shuffles the branches in
__skb_header_pointer() according to their hit probability.

The result is +20 Mbps per flow/core with one Flow Dissector pass
per packet. This affects RPS (with software hashing), drivers that
use eth_get_headlen() on their Rx path and so on.

Since v1 [0]:
 - rebase on top of the latest net-next. This was super-weird, but
   I double-checked that the series applies with no conflicts, and
   then on Patchwork it didn't;
 - no other changes.

[0] https://lore.kernel.org/netdev/20210312194538.337504-1-aloba...@pm.me

Alexander Lobakin (6):
  flow_dissector: constify bpf_flow_dissector's data pointers
  skbuff: make __skb_header_pointer()'s data argument const
  flow_dissector: constify raw input @data argument
  linux/etherdevice.h: misc trailing whitespace cleanup
  ethernet: constify eth_get_headlen()'s @data argument
  skbuff: micro-optimize {,__}skb_header_pointer()

 include/linux/etherdevice.h  |  4 ++--
 include/linux/skbuff.h   | 26 +++
 include/net/flow_dissector.h |  6 +++---
 net/core/flow_dissector.c| 41 +++-
 net/ethernet/eth.c   |  2 +-
 5 files changed, 40 insertions(+), 39 deletions(-)

--
2.30.2




[PATCH net] flow_dissector: fix byteorder of dissected ICMP ID

2021-03-12 Thread Alexander Lobakin
flow_dissector_key_icmp::id is of type u16 (CPU byteorder),
ICMP header has its ID field in network byteorder obviously.
Sparse says:

net/core/flow_dissector.c:178:43: warning: restricted __be16 degrades to integer

Convert ID value to CPU byteorder when storing it into
flow_dissector_key_icmp.

Fixes: 5dec597e5cd0 ("flow_dissector: extract more ICMP information")
Signed-off-by: Alexander Lobakin 
---
 net/core/flow_dissector.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index 2ef2224b3bff..a96a4f5de0ce 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -176,7 +176,7 @@ void skb_flow_get_icmp_tci(const struct sk_buff *skb,
 * avoid confusion with packets without such field
 */
if (icmp_has_id(ih->type))
-   key_icmp->id = ih->un.echo.id ? : 1;
+   key_icmp->id = ih->un.echo.id ? ntohs(ih->un.echo.id) : 1;
else
key_icmp->id = 0;
 }
--
2.30.2




[PATCH net-next 5/6] ethernet: constify eth_get_headlen()'s @data argument

2021-03-12 Thread Alexander Lobakin
It's used only for flow dissection, which now takes constant data
pointers.

Signed-off-by: Alexander Lobakin 
---
 include/linux/etherdevice.h | 2 +-
 net/ethernet/eth.c  | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/etherdevice.h b/include/linux/etherdevice.h
index bcb2f81baafb..330345b1be54 100644
--- a/include/linux/etherdevice.h
+++ b/include/linux/etherdevice.h
@@ -29,7 +29,7 @@ struct device;
 int eth_platform_get_mac_address(struct device *dev, u8 *mac_addr);
 unsigned char *arch_get_platform_mac_address(void);
 int nvmem_get_mac_address(struct device *dev, void *addrbuf);
-u32 eth_get_headlen(const struct net_device *dev, void *data, unsigned int 
len);
+u32 eth_get_headlen(const struct net_device *dev, const void *data, u32 len);
 __be16 eth_type_trans(struct sk_buff *skb, struct net_device *dev);
 extern const struct header_ops eth_header_ops;

diff --git a/net/ethernet/eth.c b/net/ethernet/eth.c
index 4106373180c6..e01cf766d2c5 100644
--- a/net/ethernet/eth.c
+++ b/net/ethernet/eth.c
@@ -122,7 +122,7 @@ EXPORT_SYMBOL(eth_header);
  * Make a best effort attempt to pull the length for all of the headers for
  * a given frame in a linear buffer.
  */
-u32 eth_get_headlen(const struct net_device *dev, void *data, unsigned int len)
+u32 eth_get_headlen(const struct net_device *dev, const void *data, u32 len)
 {
const unsigned int flags = FLOW_DISSECTOR_F_PARSE_1ST_FRAG;
const struct ethhdr *eth = (const struct ethhdr *)data;
--
2.30.2




[PATCH net-next 6/6] skbuff: micro-optimize {,__}skb_header_pointer()

2021-03-12 Thread Alexander Lobakin
{,__}skb_header_pointer() helpers exist mainly for preventing
accesses-beyond-end of the linear data.
In the vast majorify of cases, they bail out on the first condition.
All code going after is mostly a fallback.
Mark the most common branch as 'likely' one to move it in-line.
Also, skb_copy_bits() can return negative values only when the input
arguments are invalid, e.g. offset is greater than skb->len. It can
be safely marked as 'unlikely' branch, assuming that hotpath code
provides sane input to not fail here.

These two bump the throughput with a single Flow Dissector pass on
every packet (e.g. with RPS or driver that uses eth_get_headlen())
on 20 Mbps per flow/core.

Signed-off-by: Alexander Lobakin 
---
 include/linux/skbuff.h | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 7873f24c0ae5..71f4d609819e 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -3680,11 +3680,10 @@ static inline void * __must_check
 __skb_header_pointer(const struct sk_buff *skb, int offset, int len,
 const void *data, int hlen, void *buffer)
 {
-   if (hlen - offset >= len)
+   if (likely(hlen - offset >= len))
return (void *)data + offset;

-   if (!skb ||
-   skb_copy_bits(skb, offset, buffer, len) < 0)
+   if (!skb || unlikely(skb_copy_bits(skb, offset, buffer, len) < 0))
return NULL;

return buffer;
--
2.30.2




[PATCH net-next 4/6] linux/etherdevice.h: misc trailing whitespace cleanup

2021-03-12 Thread Alexander Lobakin
Caught by the text editor. Fix it separately from the actual changes.

Signed-off-by: Alexander Lobakin 
---
 include/linux/etherdevice.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/etherdevice.h b/include/linux/etherdevice.h
index 2e5debc0373c..bcb2f81baafb 100644
--- a/include/linux/etherdevice.h
+++ b/include/linux/etherdevice.h
@@ -11,7 +11,7 @@
  * Authors:Ross Biro
  * Fred N. van Kempen, 
  *
- * Relocated to include/linux where it belongs by Alan Cox
+ * Relocated to include/linux where it belongs by Alan Cox
  * 
  */
 #ifndef _LINUX_ETHERDEVICE_H
--
2.30.2




[PATCH net-next 3/6] flow_dissector: constify raw input @data argument

2021-03-12 Thread Alexander Lobakin
Flow Dissector code never modifies the input buffer, neither skb nor
raw data.
Make @data argument const for all of the Flow dissector's functions.

Signed-off-by: Alexander Lobakin 
---
 include/linux/skbuff.h   | 15 ++---
 include/net/flow_dissector.h |  2 +-
 net/core/flow_dissector.c| 41 +++-
 3 files changed, 30 insertions(+), 28 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index d93ab74063e5..7873f24c0ae5 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1292,10 +1292,10 @@ __skb_set_sw_hash(struct sk_buff *skb, __u32 hash, bool 
is_l4)
 void __skb_get_hash(struct sk_buff *skb);
 u32 __skb_get_hash_symmetric(const struct sk_buff *skb);
 u32 skb_get_poff(const struct sk_buff *skb);
-u32 __skb_get_poff(const struct sk_buff *skb, void *data,
+u32 __skb_get_poff(const struct sk_buff *skb, const void *data,
   const struct flow_keys_basic *keys, int hlen);
 __be32 __skb_flow_get_ports(const struct sk_buff *skb, int thoff, u8 ip_proto,
-   void *data, int hlen_proto);
+   const void *data, int hlen_proto);

 static inline __be32 skb_flow_get_ports(const struct sk_buff *skb,
int thoff, u8 ip_proto)
@@ -1314,9 +1314,8 @@ bool bpf_flow_dissect(struct bpf_prog *prog, struct 
bpf_flow_dissector *ctx,
 bool __skb_flow_dissect(const struct net *net,
const struct sk_buff *skb,
struct flow_dissector *flow_dissector,
-   void *target_container,
-   void *data, __be16 proto, int nhoff, int hlen,
-   unsigned int flags);
+   void *target_container, const void *data,
+   __be16 proto, int nhoff, int hlen, unsigned int flags);

 static inline bool skb_flow_dissect(const struct sk_buff *skb,
struct flow_dissector *flow_dissector,
@@ -1338,9 +1337,9 @@ static inline bool skb_flow_dissect_flow_keys(const 
struct sk_buff *skb,
 static inline bool
 skb_flow_dissect_flow_keys_basic(const struct net *net,
 const struct sk_buff *skb,
-struct flow_keys_basic *flow, void *data,
-__be16 proto, int nhoff, int hlen,
-unsigned int flags)
+struct flow_keys_basic *flow,
+const void *data, __be16 proto,
+int nhoff, int hlen, unsigned int flags)
 {
memset(flow, 0, sizeof(*flow));
return __skb_flow_dissect(net, skb, _keys_basic_dissector, flow,
diff --git a/include/net/flow_dissector.h b/include/net/flow_dissector.h
index bf00e71816ed..ffd386ea0dbb 100644
--- a/include/net/flow_dissector.h
+++ b/include/net/flow_dissector.h
@@ -350,7 +350,7 @@ static inline bool flow_keys_have_l4(const struct flow_keys 
*keys)
 u32 flow_hash_from_keys(struct flow_keys *keys);
 void skb_flow_get_icmp_tci(const struct sk_buff *skb,
   struct flow_dissector_key_icmp *key_icmp,
-  void *data, int thoff, int hlen);
+  const void *data, int thoff, int hlen);

 static inline bool dissector_uses_key(const struct flow_dissector 
*flow_dissector,
  enum flow_dissector_key_id key_id)
diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index 2ef2224b3bff..2ed380d096ce 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -114,7 +114,7 @@ int flow_dissector_bpf_prog_attach_check(struct net *net,
  * is the protocol port offset returned from proto_ports_offset
  */
 __be32 __skb_flow_get_ports(const struct sk_buff *skb, int thoff, u8 ip_proto,
-   void *data, int hlen)
+   const void *data, int hlen)
 {
int poff = proto_ports_offset(ip_proto);

@@ -161,7 +161,7 @@ static bool icmp_has_id(u8 type)
  */
 void skb_flow_get_icmp_tci(const struct sk_buff *skb,
   struct flow_dissector_key_icmp *key_icmp,
-  void *data, int thoff, int hlen)
+  const void *data, int thoff, int hlen)
 {
struct icmphdr *ih, _ih;

@@ -187,8 +187,8 @@ EXPORT_SYMBOL(skb_flow_get_icmp_tci);
  */
 static void __skb_flow_dissect_icmp(const struct sk_buff *skb,
struct flow_dissector *flow_dissector,
-   void *target_container,
-   void *data, int thoff, int hlen)
+   void *target_container, const void *data,
+   int thoff, int hlen)
 {
struct flow_dissector_key_icmp *key_icmp;

@@ -409,8 +409,8 @@ EXPORT_SYMBOL(skb_flow_dissect_hash);
 static

[PATCH net-next 2/6] skbuff: make __skb_header_pointer()'s data argument const

2021-03-12 Thread Alexander Lobakin
The function never modifies the input buffer, so @data argument
can be marked as const.
This implies one harmless cast-away.

Signed-off-by: Alexander Lobakin 
---
 include/linux/skbuff.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 0503c917d773..d93ab74063e5 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -3678,11 +3678,11 @@ __wsum skb_checksum(const struct sk_buff *skb, int 
offset, int len,
__wsum csum);

 static inline void * __must_check
-__skb_header_pointer(const struct sk_buff *skb, int offset,
-int len, void *data, int hlen, void *buffer)
+__skb_header_pointer(const struct sk_buff *skb, int offset, int len,
+const void *data, int hlen, void *buffer)
 {
if (hlen - offset >= len)
-   return data + offset;
+   return (void *)data + offset;

if (!skb ||
skb_copy_bits(skb, offset, buffer, len) < 0)
--
2.30.2




[PATCH net-next 0/6] skbuff: micro-optimize flow dissection

2021-03-12 Thread Alexander Lobakin
This little number makes all of the flow dissection functions take
raw input data pointer as const (1-5) and shuffles the branches in
__skb_header_pointer() according to their hit probability.

The result is +20 Mbps per flow/core with one Flow Dissector pass
per packet. This affects RPS (with software hashing), drivers that
use eth_get_headlen() on their Rx path and so on.

Alexander Lobakin (6):
  flow_dissector: constify bpf_flow_dissector's data pointers
  skbuff: make __skb_header_pointer()'s data argument const
  flow_dissector: constify raw input @data argument
  linux/etherdevice.h: misc trailing whitespace cleanup
  ethernet: constify eth_get_headlen()'s @data argument
  skbuff: micro-optimize {,__}skb_header_pointer()

 include/linux/etherdevice.h  |  4 ++--
 include/linux/skbuff.h   | 26 +++
 include/net/flow_dissector.h |  6 +++---
 net/core/flow_dissector.c| 41 +++-
 net/ethernet/eth.c   |  2 +-
 5 files changed, 40 insertions(+), 39 deletions(-)

--
2.30.2




[PATCH net-next 1/6] flow_dissector: constify bpf_flow_dissector's data pointers

2021-03-12 Thread Alexander Lobakin
BPF Flow dissection programs are read-only and don't touch input
buffers.
Mark @data and @data_end in struct bpf_flow_dissector as const in
preparation for global input constifying.

Signed-off-by: Alexander Lobakin 
---
 include/net/flow_dissector.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/net/flow_dissector.h b/include/net/flow_dissector.h
index cc10b10dc3a1..bf00e71816ed 100644
--- a/include/net/flow_dissector.h
+++ b/include/net/flow_dissector.h
@@ -368,8 +368,8 @@ static inline void *skb_flow_dissector_target(struct 
flow_dissector *flow_dissec
 struct bpf_flow_dissector {
struct bpf_flow_keys*flow_keys;
const struct sk_buff*skb;
-   void*data;
-   void*data_end;
+   const void  *data;
+   const void  *data_end;
 };

 static inline void
--
2.30.2




Re: [PATCH net-next 2/4] gro: don't dereference napi->gro_hash[x] multiple times in dev_gro_receive()

2021-03-12 Thread Alexander Lobakin
From: Eric Dumazet 
Date: Fri, 12 Mar 2021 17:47:04 +0100

> On Fri, Mar 12, 2021 at 5:22 PM Alexander Lobakin  wrote:
> >
> > GRO bucket index doesn't change through the entire function.
> > Store a pointer to the corresponding bucket on stack once and use
> > it later instead of dereferencing again and again.
> >
> > Signed-off-by: Alexander Lobakin 
> > ---
> >  net/core/dev.c | 9 +
> >  1 file changed, 5 insertions(+), 4 deletions(-)
> >
> > diff --git a/net/core/dev.c b/net/core/dev.c
> > index adc42ba7ffd8..ee124aecb8a2 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -5957,6 +5957,7 @@ static void gro_flush_oldest(struct napi_struct 
> > *napi, struct list_head *head)
> >  static enum gro_result dev_gro_receive(struct napi_struct *napi, struct 
> > sk_buff *skb)
> >  {
> > u32 bucket = skb_get_hash_raw(skb) & (GRO_HASH_BUCKETS - 1);
> > +   struct gro_list *gro_list = >gro_hash[bucket];
> > struct list_head *head = _base;
> > struct packet_offload *ptype;
> > __be16 type = skb->protocol;
> > @@ -6024,7 +6025,7 @@ static enum gro_result dev_gro_receive(struct 
> > napi_struct *napi, struct sk_buff
> > if (pp) {
> > skb_list_del_init(pp);
> > napi_gro_complete(napi, pp);
> > -   napi->gro_hash[bucket].count--;
> > +   gro_list->count--;
> > }
> >
> > if (same_flow)
> > @@ -6033,10 +6034,10 @@ static enum gro_result dev_gro_receive(struct 
> > napi_struct *napi, struct sk_buff
> > if (NAPI_GRO_CB(skb)->flush)
> > goto normal;
> >
> > -   if (unlikely(napi->gro_hash[bucket].count >= MAX_GRO_SKBS)) {
> > +   if (unlikely(gro_list->count >= MAX_GRO_SKBS)) {
> > gro_flush_oldest(napi, gro_head);
> > } else {
> > -   napi->gro_hash[bucket].count++;
> > +   gro_list->count++;
> > }
> > NAPI_GRO_CB(skb)->count = 1;
> > NAPI_GRO_CB(skb)->age = jiffies;
> > @@ -6050,7 +6051,7 @@ static enum gro_result dev_gro_receive(struct 
> > napi_struct *napi, struct sk_buff
> > if (grow > 0)
> > gro_pull_from_frag0(skb, grow);
> >  ok:
> > -   if (napi->gro_hash[bucket].count) {
> > +   if (gro_list->count) {
> > if (!test_bit(bucket, >gro_bitmask))
> > __set_bit(bucket, >gro_bitmask);
> > } else if (test_bit(bucket, >gro_bitmask)) {
> > --
> > 2.30.2
> >
> >
>
> This adds more register pressure, do you have precise measures to
> confirm this change is a win ?
>
> Presumably the compiler should be able to optimize the code just fine,
> it can see @bucket does not change.

This is mostly (if not purely) cosmetic, I don't think it changes
anything at all for the most of sane compilers.

Regarding registers, since @gro_list and @gro_head are pretty the
same, we could drop @gro_head in favour of @gro_list and just use
@gro_list->list instead.

Al



Re: [PATCH net-next 4/4] gro: improve flow distribution across GRO buckets in dev_gro_receive()

2021-03-12 Thread Alexander Lobakin
From: Eric Dumazet 
Date: Fri, 12 Mar 2021 17:33:53 +0100

> On Fri, Mar 12, 2021 at 5:22 PM Alexander Lobakin  wrote:
> >
> > Most of the functions that "convert" hash value into an index
> > (when RPS is configured / XPS is not configured / etc.) set
> > reciprocal_scale() on it. Its logics is simple, but fair enough and
> > accounts the entire input value.
> > On the opposite side, 'hash & (GRO_HASH_BUCKETS - 1)' expression uses
> > only 3 least significant bits of the value, which is far from
> > optimal (especially for XOR RSS hashers, where the hashes of two
> > different flows may differ only by 1 bit somewhere in the middle).
> >
> > Use reciprocal_scale() here too to take the entire hash value into
> > account and improve flow dispersion between GRO hash buckets.
> >
> > Signed-off-by: Alexander Lobakin 
> > ---
> >  net/core/dev.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/net/core/dev.c b/net/core/dev.c
> > index 65d9e7d9d1e8..bd7c9ba54623 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -5952,7 +5952,7 @@ static void gro_flush_oldest(struct napi_struct 
> > *napi, struct list_head *head)
> >
> >  static enum gro_result dev_gro_receive(struct napi_struct *napi, struct 
> > sk_buff *skb)
> >  {
> > -   u32 bucket = skb_get_hash_raw(skb) & (GRO_HASH_BUCKETS - 1);
> > +   u32 bucket = reciprocal_scale(skb_get_hash_raw(skb), 
> > GRO_HASH_BUCKETS);
>
> This is going to use 3 high order bits instead of 3 low-order bits.

We-e-ell, seems like it.

> Now, had you use hash_32(skb_get_hash_raw(skb), 3), you could have
> claimed to use "more bits"

Nice suggestion, I'll try. If there won't be any visible improvements,
I'll just drop this one.

> Toeplitz already shuffles stuff.

As well as CRC and others, but I feel like we shouldn't rely only on
the hardware.

> Adding a multiply here seems not needed.
>
> Please provide experimental results, because this looks unnecessary to me.

Thanks,
Al



[PATCH net-next 4/4] gro: improve flow distribution across GRO buckets in dev_gro_receive()

2021-03-12 Thread Alexander Lobakin
Most of the functions that "convert" hash value into an index
(when RPS is configured / XPS is not configured / etc.) set
reciprocal_scale() on it. Its logics is simple, but fair enough and
accounts the entire input value.
On the opposite side, 'hash & (GRO_HASH_BUCKETS - 1)' expression uses
only 3 least significant bits of the value, which is far from
optimal (especially for XOR RSS hashers, where the hashes of two
different flows may differ only by 1 bit somewhere in the middle).

Use reciprocal_scale() here too to take the entire hash value into
account and improve flow dispersion between GRO hash buckets.

Signed-off-by: Alexander Lobakin 
---
 net/core/dev.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 65d9e7d9d1e8..bd7c9ba54623 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5952,7 +5952,7 @@ static void gro_flush_oldest(struct napi_struct *napi, 
struct list_head *head)

 static enum gro_result dev_gro_receive(struct napi_struct *napi, struct 
sk_buff *skb)
 {
-   u32 bucket = skb_get_hash_raw(skb) & (GRO_HASH_BUCKETS - 1);
+   u32 bucket = reciprocal_scale(skb_get_hash_raw(skb), GRO_HASH_BUCKETS);
struct gro_list *gro_list = >gro_hash[bucket];
struct list_head *gro_head = _list->list;
struct list_head *head = _base;
--
2.30.2




[PATCH net-next 2/4] gro: don't dereference napi->gro_hash[x] multiple times in dev_gro_receive()

2021-03-12 Thread Alexander Lobakin
GRO bucket index doesn't change through the entire function.
Store a pointer to the corresponding bucket on stack once and use
it later instead of dereferencing again and again.

Signed-off-by: Alexander Lobakin 
---
 net/core/dev.c | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index adc42ba7ffd8..ee124aecb8a2 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5957,6 +5957,7 @@ static void gro_flush_oldest(struct napi_struct *napi, 
struct list_head *head)
 static enum gro_result dev_gro_receive(struct napi_struct *napi, struct 
sk_buff *skb)
 {
u32 bucket = skb_get_hash_raw(skb) & (GRO_HASH_BUCKETS - 1);
+   struct gro_list *gro_list = >gro_hash[bucket];
struct list_head *head = _base;
struct packet_offload *ptype;
__be16 type = skb->protocol;
@@ -6024,7 +6025,7 @@ static enum gro_result dev_gro_receive(struct napi_struct 
*napi, struct sk_buff
if (pp) {
skb_list_del_init(pp);
napi_gro_complete(napi, pp);
-   napi->gro_hash[bucket].count--;
+   gro_list->count--;
}

if (same_flow)
@@ -6033,10 +6034,10 @@ static enum gro_result dev_gro_receive(struct 
napi_struct *napi, struct sk_buff
if (NAPI_GRO_CB(skb)->flush)
goto normal;

-   if (unlikely(napi->gro_hash[bucket].count >= MAX_GRO_SKBS)) {
+   if (unlikely(gro_list->count >= MAX_GRO_SKBS)) {
gro_flush_oldest(napi, gro_head);
} else {
-   napi->gro_hash[bucket].count++;
+   gro_list->count++;
}
NAPI_GRO_CB(skb)->count = 1;
NAPI_GRO_CB(skb)->age = jiffies;
@@ -6050,7 +6051,7 @@ static enum gro_result dev_gro_receive(struct napi_struct 
*napi, struct sk_buff
if (grow > 0)
gro_pull_from_frag0(skb, grow);
 ok:
-   if (napi->gro_hash[bucket].count) {
+   if (gro_list->count) {
if (!test_bit(bucket, >gro_bitmask))
__set_bit(bucket, >gro_bitmask);
} else if (test_bit(bucket, >gro_bitmask)) {
--
2.30.2




[PATCH net-next 3/4] gro: simplify gro_list_prepare()

2021-03-12 Thread Alexander Lobakin
gro_list_prepare() always returns >gro_hash[bucket].list,
without any variations. Moreover, it uses 'napi' argument only to
have access to this list, and calculates the bucket index for the
second time (firstly it happens at the beginning of
dev_gro_receive()) to do that.
Given that dev_gro_receive() already has a pointer to the needed
list, just pass it as the first argument to eliminate redundant
calculations, and make gro_list_prepare() return void.
Also, both arguments of gro_list_prepare() can be constified since
this function can only modify the skbs from the bucket list.

Signed-off-by: Alexander Lobakin 
---
 net/core/dev.c | 12 
 1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index ee124aecb8a2..65d9e7d9d1e8 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5858,15 +5858,13 @@ void napi_gro_flush(struct napi_struct *napi, bool 
flush_old)
 }
 EXPORT_SYMBOL(napi_gro_flush);

-static struct list_head *gro_list_prepare(struct napi_struct *napi,
- struct sk_buff *skb)
+static void gro_list_prepare(const struct list_head *head,
+const struct sk_buff *skb)
 {
unsigned int maclen = skb->dev->hard_header_len;
u32 hash = skb_get_hash_raw(skb);
-   struct list_head *head;
struct sk_buff *p;

-   head = >gro_hash[hash & (GRO_HASH_BUCKETS - 1)].list;
list_for_each_entry(p, head, list) {
unsigned long diffs;

@@ -5892,8 +5890,6 @@ static struct list_head *gro_list_prepare(struct 
napi_struct *napi,
   maclen);
NAPI_GRO_CB(p)->same_flow = !diffs;
}
-
-   return head;
 }

 static void skb_gro_reset_offset(struct sk_buff *skb)
@@ -5958,10 +5954,10 @@ static enum gro_result dev_gro_receive(struct 
napi_struct *napi, struct sk_buff
 {
u32 bucket = skb_get_hash_raw(skb) & (GRO_HASH_BUCKETS - 1);
struct gro_list *gro_list = >gro_hash[bucket];
+   struct list_head *gro_head = _list->list;
struct list_head *head = _base;
struct packet_offload *ptype;
__be16 type = skb->protocol;
-   struct list_head *gro_head;
struct sk_buff *pp = NULL;
enum gro_result ret;
int same_flow;
@@ -5970,7 +5966,7 @@ static enum gro_result dev_gro_receive(struct napi_struct 
*napi, struct sk_buff
if (netif_elide_gro(skb->dev))
goto normal;

-   gro_head = gro_list_prepare(napi, skb);
+   gro_list_prepare(gro_head, skb);

rcu_read_lock();
list_for_each_entry_rcu(ptype, head, list) {
--
2.30.2




[PATCH net-next 1/4] gro: give 'hash' variable in dev_gro_receive() a less confusing name

2021-03-12 Thread Alexander Lobakin
'hash' stores not the flow hash, but the index of the GRO bucket
corresponding to it.
Change its name to 'bucket' to avoid confusion while reading lines
like '__set_bit(hash, >gro_bitmask)'.

Signed-off-by: Alexander Lobakin 
---
 net/core/dev.c | 18 +-
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 2bfdd528c7c3..adc42ba7ffd8 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5956,7 +5956,7 @@ static void gro_flush_oldest(struct napi_struct *napi, 
struct list_head *head)

 static enum gro_result dev_gro_receive(struct napi_struct *napi, struct 
sk_buff *skb)
 {
-   u32 hash = skb_get_hash_raw(skb) & (GRO_HASH_BUCKETS - 1);
+   u32 bucket = skb_get_hash_raw(skb) & (GRO_HASH_BUCKETS - 1);
struct list_head *head = _base;
struct packet_offload *ptype;
__be16 type = skb->protocol;
@@ -6024,7 +6024,7 @@ static enum gro_result dev_gro_receive(struct napi_struct 
*napi, struct sk_buff
if (pp) {
skb_list_del_init(pp);
napi_gro_complete(napi, pp);
-   napi->gro_hash[hash].count--;
+   napi->gro_hash[bucket].count--;
}

if (same_flow)
@@ -6033,10 +6033,10 @@ static enum gro_result dev_gro_receive(struct 
napi_struct *napi, struct sk_buff
if (NAPI_GRO_CB(skb)->flush)
goto normal;

-   if (unlikely(napi->gro_hash[hash].count >= MAX_GRO_SKBS)) {
+   if (unlikely(napi->gro_hash[bucket].count >= MAX_GRO_SKBS)) {
gro_flush_oldest(napi, gro_head);
} else {
-   napi->gro_hash[hash].count++;
+   napi->gro_hash[bucket].count++;
}
NAPI_GRO_CB(skb)->count = 1;
NAPI_GRO_CB(skb)->age = jiffies;
@@ -6050,11 +6050,11 @@ static enum gro_result dev_gro_receive(struct 
napi_struct *napi, struct sk_buff
if (grow > 0)
gro_pull_from_frag0(skb, grow);
 ok:
-   if (napi->gro_hash[hash].count) {
-   if (!test_bit(hash, >gro_bitmask))
-   __set_bit(hash, >gro_bitmask);
-   } else if (test_bit(hash, >gro_bitmask)) {
-   __clear_bit(hash, >gro_bitmask);
+   if (napi->gro_hash[bucket].count) {
+   if (!test_bit(bucket, >gro_bitmask))
+   __set_bit(bucket, >gro_bitmask);
+   } else if (test_bit(bucket, >gro_bitmask)) {
+   __clear_bit(bucket, >gro_bitmask);
}

return ret;
--
2.30.2




[PATCH net-next 0/4] gro: micro-optimize dev_gro_receive()

2021-03-12 Thread Alexander Lobakin
This random series addresses some of suboptimal paths used in
the main GRO entry point.
The main body is patches 3-4 which simplify the code and improve
flow distribution. Two others are mostly cosmetic to make code
more readable.

The benetifs are not so huge and mostly depend on NIC RSS hash
function and a number of Rx flows per single NAPI instance. I got
something like +10-15 Mbps on 4-8 flows NATing.

Alexander Lobakin (4):
  gro: give 'hash' variable in dev_gro_receive() a less confusing name
  gro: don't dereference napi->gro_hash[x] multiple times in
dev_gro_receive()
  gro: simplify gro_list_prepare()
  gro: improve flow distribution across GRO buckets in dev_gro_receive()

 net/core/dev.c | 31 ++-
 1 file changed, 14 insertions(+), 17 deletions(-)

--
2.30.2




[PATCH RESEND] PCI: dwc: put struct dw_pcie::{ep,pp} into a union to reduce its size

2021-03-12 Thread Alexander Lobakin
A single dw_pcie entity can't be a root complex and an endpoint at
the same time.
We can use this to reduce the size of dw_pcie by 80, from 280 to 200
bytes (on x32, guess more on x64), by putting the related embedded
structures (struct pcie_port and struct dw_pcie_ep) into a union.

Signed-off-by: Alexander Lobakin 
---
 drivers/pci/controller/dwc/pcie-designware.h | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/controller/dwc/pcie-designware.h 
b/drivers/pci/controller/dwc/pcie-designware.h
index 7247c8b01f04..ca8aeba548ab 100644
--- a/drivers/pci/controller/dwc/pcie-designware.h
+++ b/drivers/pci/controller/dwc/pcie-designware.h
@@ -266,8 +266,10 @@ struct dw_pcie {
size_t  atu_size;
u32 num_ib_windows;
u32 num_ob_windows;
-   struct pcie_portpp;
-   struct dw_pcie_ep   ep;
+   union {
+   struct pcie_portpp;
+   struct dw_pcie_ep   ep;
+   };
const struct dw_pcie_ops *ops;
unsigned intversion;
int num_lanes;
--
2.30.2




Re: [PATCH 5.4 00/24] 5.4.105-rc1 review

2021-03-12 Thread Alexander Lobakin
From: Florian Fainelli 
Date: Thu, 11 Mar 2021 09:41:27 -0800

Hi Florian,

> On 3/11/21 9:40 AM, Greg KH wrote:
> > On Thu, Mar 11, 2021 at 09:23:56AM -0800, Florian Fainelli wrote:
> >> On 3/11/21 5:08 AM, Greg KH wrote:
> >>> On Wed, Mar 10, 2021 at 08:19:45PM -0800, Florian Fainelli wrote:
>  +Alex,
> 
>  On 3/10/2021 5:24 AM, gre...@linuxfoundation.org wrote:
> > From: Greg Kroah-Hartman 
> >
> > This is the start of the stable review cycle for the 5.4.105 release.
> > There are 24 patches in this series, all will be posted as a response
> > to this one.  If anyone has any issues with these being applied, please
> > let me know.
> >
> > Responses should be made by Fri, 12 Mar 2021 13:23:09 +.
> > Anything received after that time might be too late.
> >
> > The whole patch series can be found in one patch at:
> > 
> > https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.4.105-rc1.gz
> > or in the git tree and branch at:
> > 
> > git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git
> >  linux-5.4.y
> > and the diffstat can be found below.
> >
> > thanks,
> >
> > greg k-h
> 
>  I believe you need to drop "net: dsa: add GRO support via gro_cells" as
>  it causes the following kernel panic on a DSA-enabled platform:
> 
>  Configuring rgmii_2 interface
>  [   10.170527] brcm-sf2 f0b0.ethernet_switch rgmii_2: configuring
>  for fixed/rgmii-txid link mode
>  [   10.179597] 8021q: adding VLAN 0 to HW filter on device rgmii_2
>  [   10.185608] brcm-sf2 f0b0.ethernet_switch rgmii_2: Link is Up -
>  1Gbps/Full - flow control off
>  [   10.198631] IPv6: ADDRCONF(NETDEV_CHANGE): rgmii_2: link becomes ready
>  Configuring sit0 interface
>  [   10.254346] 8<--- cut here ---
>  [   10.257438] Unable to handle kernel paging request at virtual address
>  d6df6190
>  [   10.264685] pgd = (ptrval)
>  [   10.267411] [d6df6190] *pgd=807003, *pmd=
>  [   10.272846] Internal error: Oops: 206 [#1] SMP ARM
>  [   10.277661] Modules linked in:
>  [   10.280739] CPU: 0 PID: 1886 Comm: sed Not tainted
>  5.4.105-1.0pre-geff642e2af2b #4
>  [   10.288337] Hardware name: Broadcom STB (Flattened Device Tree)
>  [   10.294292] PC is at gro_cells_receive+0x90/0x11c
>  [   10.299020] LR is at dsa_switch_rcv+0x120/0x1d4
>  [   10.303562] pc : []lr : []psr: 600f0113
>  [   10.309841] sp : c1d33cd0  ip : 03e8  fp : c1d33ce4
>  [   10.315078] r10: c8901000  r9 : c8901000  r8 : c0b4a53c
>  [   10.320314] r7 : c2208920  r6 :   r5 :   r4 : 4000
>  [   10.326855] r3 : d6df6188  r2 : c4927000  r1 : c8adc300  r0 : c22069dc
>  [   10.98] Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM
>  Segment user
>  [   10.340547] Control: 30c5387d  Table: 04ac4c80  DAC: fffd
>  [   10.346307] Process sed (pid: 1886, stack limit = 0x(ptrval))
>  [   10.352066] Stack: (0xc1d33cd0 to 0xc1d34000)
>  [   10.356434] 3cc0: c8adc300
>  c4927000 c1d33d04 c1d33ce8
>  [   10.364631] 3ce0: c0b4a65c c0a579a4 c1d33d24 c2208920 c1d33d24
>   c1d33d5c c1d33d08
>  [   10.372827] 3d00: c0a0b38c c0b4a548 c021e070 c2204cc8 
>  c89015c0 04b87700 c89015c0
>  [   10.381023] 3d20: c2208920 c1d33d24 c1d33d24 00976ec2 04b87700
>  c8adc300 c89015c0 
>  [   10.389218] 3d40: c1d33d74 c1d32000  c230742c c1d33dac
>  c1d33d60 c0a0b5c0 c0a0b180
>  [   10.397414] 3d60:  c2204cc8  c1d33d6c c1d33d6c
>  c1d33d80 c029daf8 00976ec2
>  [   10.405610] 3d80: 0800 c8901540 c89015c0 c8901540 
>  0001 016c 0162
>  [   10.413805] 3da0: c1d33dc4 c1d33db0 c0a0b7fc c0a0b3b8 
>  c8adc300 c1d33dfc c1d33dc8
>  [   10.422001] 3dc0: c0a0c660 c0a0b7e4 c8901540 c8adc300 c1d33dfc
>  c1d33de0 c8901540 c8adc300
>  [   10.430196] 3de0: 015e c8901000 0001 016c c1d33e74
>  c1d33e00 c083df00 c0a0c4fc
>  [   10.438391] 3e00: 012c c22b0f14 c1d33e4c c1d33e18 c0fbd9b8
>  c0fbd9cc c0fbd9e0 c0fbd98c
>  [   10.446586] 3e20: 0001 0040 c8901500 0001 
>    
>  [   10.454780] 3e40:   c02f65a0 c8901540 0001
>  0040 c22b07e4 012c
>  [   10.462975] 3e60: d1003000 fffb942f c1d33edc c1d33e78 c0a0c94c
>  c083dafc d051ad80 c2204cc8
>  [   10.471170] 3e80: c2204cf0 c1d32000 c22b40b0 0e4a4000 c2076d80
>  c2203d00 c022bc70 c1d33e9c
>  [   10.479365] 3ea0: c1d33e9c c1d33ea4 c1d33ea4 00976ec2 c02f65a0
>  c220308c 0003 c1d32000
>  [   10.487561] 3ec0: c22b07e4 0100 d1003000 0008 c1d33f44
>  c1d33ee0 c020238c c0a0c6cc
>  [   10.495755] 3ee0: 

Re: [PATCH v4] net/qrtr: fix __netdev_alloc_skb call

2021-02-28 Thread Alexander Lobakin
From: Pavel Skripkin 
Date: Mon, 1 Mar 2021 02:22:40 +0300

> syzbot found WARNING in __alloc_pages_nodemask()[1] when order >= MAX_ORDER.
> It was caused by a huge length value passed from userspace to 
> qrtr_tun_write_iter(),
> which tries to allocate skb. Since the value comes from the untrusted source
> there is no need to raise a warning in __alloc_pages_nodemask().
>
> [1] WARNING in __alloc_pages_nodemask+0x5f8/0x730 mm/page_alloc.c:5014
> Call Trace:
>  __alloc_pages include/linux/gfp.h:511 [inline]
>  __alloc_pages_node include/linux/gfp.h:524 [inline]
>  alloc_pages_node include/linux/gfp.h:538 [inline]
>  kmalloc_large_node+0x60/0x110 mm/slub.c:3999
>  __kmalloc_node_track_caller+0x319/0x3f0 mm/slub.c:4496
>  __kmalloc_reserve net/core/skbuff.c:150 [inline]
>  __alloc_skb+0x4e4/0x5a0 net/core/skbuff.c:210
>  __netdev_alloc_skb+0x70/0x400 net/core/skbuff.c:446
>  netdev_alloc_skb include/linux/skbuff.h:2832 [inline]
>  qrtr_endpoint_post+0x84/0x11b0 net/qrtr/qrtr.c:442
>  qrtr_tun_write_iter+0x11f/0x1a0 net/qrtr/tun.c:98
>  call_write_iter include/linux/fs.h:1901 [inline]
>  new_sync_write+0x426/0x650 fs/read_write.c:518
>  vfs_write+0x791/0xa30 fs/read_write.c:605
>  ksys_write+0x12d/0x250 fs/read_write.c:658
>  do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
>  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> Reported-by: syzbot+80dccaee7c6630fa9...@syzkaller.appspotmail.com
> Signed-off-by: Pavel Skripkin 

Acked-by: Alexander Lobakin 

Thanks!

> ---
>  net/qrtr/qrtr.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/net/qrtr/qrtr.c b/net/qrtr/qrtr.c
> index b34358282f37..82d2eb8c21d1 100644
> --- a/net/qrtr/qrtr.c
> +++ b/net/qrtr/qrtr.c
> @@ -439,7 +439,7 @@ int qrtr_endpoint_post(struct qrtr_endpoint *ep, const 
> void *data, size_t len)
>   if (len == 0 || len & 3)
>   return -EINVAL;
>
> - skb = netdev_alloc_skb(NULL, len);
> + skb = __netdev_alloc_skb(NULL, len, GFP_ATOMIC | __GFP_NOWARN);
>   if (!skb)
>   return -ENOMEM;
>
> --
> 2.25.1

Al



Re: [PATCH v3] net/core/skbuff: fix passing wrong size to __alloc_skb

2021-02-28 Thread Alexander Lobakin
From: Pavel Skripkin 
Date: Sun, 28 Feb 2021 22:28:13 +0300

> Hi, thanks for reply!
>
> > From: Pavel Skripkin 
> > Date: Sat, 27 Feb 2021 20:51:14 +0300
> >
> > Hi,
> >
> > > syzbot found WARNING in __alloc_pages_nodemask()[1] when order >=
> > > MAX_ORDER.
> > > It was caused by __netdev_alloc_skb(), which doesn't check len
> > > value after adding NET_SKB_PAD.
> > > Order will be >= MAX_ORDER and passed to __alloc_pages_nodemask()
> > > if size > KMALLOC_MAX_SIZE.
> > > Same happens in __napi_alloc_skb.
> > >
> > > static void *kmalloc_large_node(size_t size, gfp_t flags, int node)
> > > {
> > >   struct page *page;
> > >   void *ptr = NULL;
> > >   unsigned int order = get_order(size);
> > > ...
> > >   page = alloc_pages_node(node, flags, order);
> > > ...
> > >
> > > [1] WARNING in __alloc_pages_nodemask+0x5f8/0x730
> > > mm/page_alloc.c:5014
> > > Call Trace:
> > >  __alloc_pages include/linux/gfp.h:511 [inline]
> > >  __alloc_pages_node include/linux/gfp.h:524 [inline]
> > >  alloc_pages_node include/linux/gfp.h:538 [inline]
> > >  kmalloc_large_node+0x60/0x110 mm/slub.c:3999
> > >  __kmalloc_node_track_caller+0x319/0x3f0 mm/slub.c:4496
> > >  __kmalloc_reserve net/core/skbuff.c:150 [inline]
> > >  __alloc_skb+0x4e4/0x5a0 net/core/skbuff.c:210
> > >  __netdev_alloc_skb+0x70/0x400 net/core/skbuff.c:446
> > >  netdev_alloc_skb include/linux/skbuff.h:2832 [inline]
> > >  qrtr_endpoint_post+0x84/0x11b0 net/qrtr/qrtr.c:442
> > >  qrtr_tun_write_iter+0x11f/0x1a0 net/qrtr/tun.c:98
> > >  call_write_iter include/linux/fs.h:1901 [inline]
> > >  new_sync_write+0x426/0x650 fs/read_write.c:518
> > >  vfs_write+0x791/0xa30 fs/read_write.c:605
> > >  ksys_write+0x12d/0x250 fs/read_write.c:658
> > >  do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
> > >  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> >
> > Ah, by the way. Have you tried to seek for the root cause, why
> > a request for such insanely large (at least 4 Mib) skb happens
> > in QRTR? I don't believe it's intended to be like this.
> > Now I feel that silencing this error with early return isn't
> > really correct approach for this.
>
> Yeah, i figured it out. Syzbot provides reproducer for thig bug:
>
> void loop(void)
> {
>   intptr_t res = 0;
>   memcpy((void*)0x2000, "/dev/qrtr-tun\000", 14);
>   res = syscall(__NR_openat, 0xff9cul, 0x2000ul, 1ul,
> 0);
>   if (res != -1)
> r[0] = res;
>   memcpy((void*)0x2040, "\x02", 1);
>   syscall(__NR_write, r[0], 0x2040ul, 0x40ul);
> }
>
> So, simply it writes to /dev/qrtr-tun 0x40 bytes.
> In qrtr_tun_write_iter there is a check, that the length is smaller
> than KMALLOC_MAX_VSIZE. Then the length is passed to
> __netdev_alloc_skb, where it becomes more than KMALLOC_MAX_VSIZE.

I've checked the logics in qrtr_tun_write_iter(). Seems like it's
only trying to prevent kzallocs of sizes larger than the maximum
and doesn't care about netdev_alloc_skb() at all, as it ignores
the fact that, besides NET_SKB_PAD and NET_IP_ALIGN, skbs always
have SKB_DATA_ALIGN(sizeof(struct skb_shared_info)) on top of
the "usable" size.

On the other hand, skb memory overheads, kmalloc bounds etc. are
an internal thing and all related corner cases should be handled
inside the implementations, not the users. From this point, even
this check for (len < KMALLOC_MAX_SIZE) is a bit bogus.
I think in that particular case with the size coming from userspace
(i.e. untrusted source), the allocations (both kzalloc() and
__netdev_alloc_skb()) should be performed with __GFP_NOWARN, so
insane values won't provoke any splats.

So maybe use it as a fix for this particular warning (seems like
it's the sole place in the entire kernel that can potentially
request such skb allocations) and don't add any branches into
hot *alloc_skb() at all?
We might add a cap for max skb length later, as Jakub pointed.

> >
> > > Reported-by: syzbot+80dccaee7c6630fa9...@syzkaller.appspotmail.com
> > > Signed-off-by: Pavel Skripkin 
> > >
> > > ---
> > > Changes from v3:
> > > * Removed Change-Id and extra tabs in net/core/skbuff.c
> > >
> > > Changes from v2:
> > > * Added length check to __napi_alloc_skb
> > > * Added unlikely() in checks
> > >
> > > Change from v1:
> > > * Added length check to __netdev_alloc_skb
> > > ---
> > >  net/core/skbuff.c | 6 ++
> > >  1 file changed, 6 insertions(+)
> > >
> > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > > index 785daff48030..ec7ba8728b61 100644
> > > --- a/net/core/skbuff.c
> > > +++ b/net/core/skbuff.c
> > > @@ -443,6 +443,9 @@ struct sk_buff *__netdev_alloc_skb(struct
> > > net_device *dev, unsigned int len,
> > >   if (len <= SKB_WITH_OVERHEAD(1024) ||
> > >   len > SKB_WITH_OVERHEAD(PAGE_SIZE) ||
> > >   (gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA))) {
> > > + if (unlikely(len > KMALLOC_MAX_SIZE))
> > > + return NULL;
> > > +
> > >   skb = __alloc_skb(len, gfp_mask, SKB_ALLOC_RX,
> > > NUMA_NO_NODE);
> > >   if 

Re: [PATCH v3] net/core/skbuff: fix passing wrong size to __alloc_skb

2021-02-28 Thread Alexander Lobakin
From: Jakub Kicinski 
Date: Sun, 28 Feb 2021 10:55:52 -0800

> On Sun, 28 Feb 2021 18:14:46 +0000 Alexander Lobakin wrote:
> > > [1] WARNING in __alloc_pages_nodemask+0x5f8/0x730 mm/page_alloc.c:5014
> > > Call Trace:
> > >  __alloc_pages include/linux/gfp.h:511 [inline]
> > >  __alloc_pages_node include/linux/gfp.h:524 [inline]
> > >  alloc_pages_node include/linux/gfp.h:538 [inline]
> > >  kmalloc_large_node+0x60/0x110 mm/slub.c:3999
> > >  __kmalloc_node_track_caller+0x319/0x3f0 mm/slub.c:4496
> > >  __kmalloc_reserve net/core/skbuff.c:150 [inline]
> > >  __alloc_skb+0x4e4/0x5a0 net/core/skbuff.c:210
> > >  __netdev_alloc_skb+0x70/0x400 net/core/skbuff.c:446
> > >  netdev_alloc_skb include/linux/skbuff.h:2832 [inline]
> > >  qrtr_endpoint_post+0x84/0x11b0 net/qrtr/qrtr.c:442
> > >  qrtr_tun_write_iter+0x11f/0x1a0 net/qrtr/tun.c:98
> > >  call_write_iter include/linux/fs.h:1901 [inline]
> > >  new_sync_write+0x426/0x650 fs/read_write.c:518
> > >  vfs_write+0x791/0xa30 fs/read_write.c:605
> > >  ksys_write+0x12d/0x250 fs/read_write.c:658
> > >  do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
> > >  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> >
> > Ah, by the way. Have you tried to seek for the root cause, why
> > a request for such insanely large (at least 4 Mib) skb happens
> > in QRTR? I don't believe it's intended to be like this.
> > Now I feel that silencing this error with early return isn't
> > really correct approach for this.
>
> Right, IIUC Eric suggested we limit the length of the allocation
> to 64KB because that's the max reasonable skb length, and QRTR tun write
> results in generating a single skb. That seems like a good approach.

+1 for explicit capping max skb length to 64K.

Al



Re: [PATCH v3] net/core/skbuff: fix passing wrong size to __alloc_skb

2021-02-28 Thread Alexander Lobakin
From: Pavel Skripkin 
Date: Sat, 27 Feb 2021 20:51:14 +0300

Hi,

> syzbot found WARNING in __alloc_pages_nodemask()[1] when order >= MAX_ORDER.
> It was caused by __netdev_alloc_skb(), which doesn't check len value after 
> adding NET_SKB_PAD.
> Order will be >= MAX_ORDER and passed to __alloc_pages_nodemask() if size > 
> KMALLOC_MAX_SIZE.
> Same happens in __napi_alloc_skb.
>
> static void *kmalloc_large_node(size_t size, gfp_t flags, int node)
> {
>   struct page *page;
>   void *ptr = NULL;
>   unsigned int order = get_order(size);
> ...
>   page = alloc_pages_node(node, flags, order);
> ...
>
> [1] WARNING in __alloc_pages_nodemask+0x5f8/0x730 mm/page_alloc.c:5014
> Call Trace:
>  __alloc_pages include/linux/gfp.h:511 [inline]
>  __alloc_pages_node include/linux/gfp.h:524 [inline]
>  alloc_pages_node include/linux/gfp.h:538 [inline]
>  kmalloc_large_node+0x60/0x110 mm/slub.c:3999
>  __kmalloc_node_track_caller+0x319/0x3f0 mm/slub.c:4496
>  __kmalloc_reserve net/core/skbuff.c:150 [inline]
>  __alloc_skb+0x4e4/0x5a0 net/core/skbuff.c:210
>  __netdev_alloc_skb+0x70/0x400 net/core/skbuff.c:446
>  netdev_alloc_skb include/linux/skbuff.h:2832 [inline]
>  qrtr_endpoint_post+0x84/0x11b0 net/qrtr/qrtr.c:442
>  qrtr_tun_write_iter+0x11f/0x1a0 net/qrtr/tun.c:98
>  call_write_iter include/linux/fs.h:1901 [inline]
>  new_sync_write+0x426/0x650 fs/read_write.c:518
>  vfs_write+0x791/0xa30 fs/read_write.c:605
>  ksys_write+0x12d/0x250 fs/read_write.c:658
>  do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
>  entry_SYSCALL_64_after_hwframe+0x44/0xa9

Ah, by the way. Have you tried to seek for the root cause, why
a request for such insanely large (at least 4 Mib) skb happens
in QRTR? I don't believe it's intended to be like this.
Now I feel that silencing this error with early return isn't
really correct approach for this.

> Reported-by: syzbot+80dccaee7c6630fa9...@syzkaller.appspotmail.com
> Signed-off-by: Pavel Skripkin 
>
> ---
> Changes from v3:
> * Removed Change-Id and extra tabs in net/core/skbuff.c
>
> Changes from v2:
> * Added length check to __napi_alloc_skb
> * Added unlikely() in checks
>
> Change from v1:
> * Added length check to __netdev_alloc_skb
> ---
>  net/core/skbuff.c | 6 ++
>  1 file changed, 6 insertions(+)
>
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 785daff48030..ec7ba8728b61 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -443,6 +443,9 @@ struct sk_buff *__netdev_alloc_skb(struct net_device 
> *dev, unsigned int len,
>   if (len <= SKB_WITH_OVERHEAD(1024) ||
>   len > SKB_WITH_OVERHEAD(PAGE_SIZE) ||
>   (gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA))) {
> + if (unlikely(len > KMALLOC_MAX_SIZE))
> + return NULL;
> +
>   skb = __alloc_skb(len, gfp_mask, SKB_ALLOC_RX, NUMA_NO_NODE);
>   if (!skb)
>   goto skb_fail;
> @@ -517,6 +520,9 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct 
> *napi, unsigned int len,
>   if (len <= SKB_WITH_OVERHEAD(1024) ||
>   len > SKB_WITH_OVERHEAD(PAGE_SIZE) ||
>   (gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA))) {
> + if (unlikely(len > KMALLOC_MAX_SIZE))
> + return NULL;
> +
>   skb = __alloc_skb(len, gfp_mask, SKB_ALLOC_RX, NUMA_NO_NODE);
>   if (!skb)
>   goto skb_fail;
> --
> 2.25.1

Thanks,
Al



Re: [PATCH] net/core/skbuff.c: __netdev_alloc_skb fix when len is greater than KMALLOC_MAX_SIZE

2021-02-27 Thread Alexander Lobakin
From: Pavel Skripkin 
Date: Fri, 26 Feb 2021 22:11:06 +0300

Hi,

> syzbot found WARNING in __alloc_pages_nodemask()[1] when order >= MAX_ORDER.
> It was caused by __netdev_alloc_skb(), which doesn't check len value after 
> adding NET_SKB_PAD.
> Order will be >= MAX_ORDER and passed to __alloc_pages_nodemask() if size > 
> KMALLOC_MAX_SIZE.
>
> static void *kmalloc_large_node(size_t size, gfp_t flags, int node)
> {
>   struct page *page;
>   void *ptr = NULL;
>   unsigned int order = get_order(size);
> ...
>   page = alloc_pages_node(node, flags, order);
> ...
>
> [1] WARNING in __alloc_pages_nodemask+0x5f8/0x730 mm/page_alloc.c:5014
> Call Trace:
>  __alloc_pages include/linux/gfp.h:511 [inline]
>  __alloc_pages_node include/linux/gfp.h:524 [inline]
>  alloc_pages_node include/linux/gfp.h:538 [inline]
>  kmalloc_large_node+0x60/0x110 mm/slub.c:3999
>  __kmalloc_node_track_caller+0x319/0x3f0 mm/slub.c:4496
>  __kmalloc_reserve net/core/skbuff.c:150 [inline]
>  __alloc_skb+0x4e4/0x5a0 net/core/skbuff.c:210
>  __netdev_alloc_skb+0x70/0x400 net/core/skbuff.c:446
>  netdev_alloc_skb include/linux/skbuff.h:2832 [inline]
>  qrtr_endpoint_post+0x84/0x11b0 net/qrtr/qrtr.c:442
>  qrtr_tun_write_iter+0x11f/0x1a0 net/qrtr/tun.c:98
>  call_write_iter include/linux/fs.h:1901 [inline]
>  new_sync_write+0x426/0x650 fs/read_write.c:518
>  vfs_write+0x791/0xa30 fs/read_write.c:605
>  ksys_write+0x12d/0x250 fs/read_write.c:658
>  do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
>  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> Reported-by: syzbot+80dccaee7c6630fa9...@syzkaller.appspotmail.com
> Signed-off-by: Pavel Skripkin 
> Change-Id: I480a6d6f818a4c0a387db0cd3f230b68a7daeb16
> ---
>  net/core/skbuff.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 785daff48030..dc28c8f7bf5f 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -443,6 +443,9 @@ struct sk_buff *__netdev_alloc_skb(struct net_device 
> *dev, unsigned int len,
>   if (len <= SKB_WITH_OVERHEAD(1024) ||
>   len > SKB_WITH_OVERHEAD(PAGE_SIZE) ||
>   (gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA))) {
> + if (len > KMALLOC_MAX_SIZE)
> + return NULL;

I'd use unlikely() for this as it's very very rare condition on the
very hot path.

Also, I'd add the same check below into __napi_alloc_skb() as it has
the same fallback.

>   skb = __alloc_skb(len, gfp_mask, SKB_ALLOC_RX, NUMA_NO_NODE);
>   if (!skb)
>   goto skb_fail;
> --
> 2.25.1

Thanks,
Al



Re: [PATCH] arm64: enable GENERIC_FIND_FIRST_BIT

2021-02-25 Thread Alexander Lobakin
From: Yury Norov 
Date: Wed, 24 Feb 2021 07:44:16 -0800

> On Wed, Feb 24, 2021 at 11:52:55AM +0000, Alexander Lobakin wrote:
> > From: Yury Norov 
> > Date: Sat, 5 Dec 2020 08:54:06 -0800
> >
> > Hi,
> >
> > > ARM64 doesn't implement find_first_{zero}_bit in arch code and doesn't
> > > enable it in config. It leads to using find_next_bit() which is less
> > > efficient:
> > >
> > >  :
> > >0: aa0003e4mov x4, x0
> > >4: aa0103e0mov x0, x1
> > >8: b4000181cbz x1, 38 
> > >c: f9400083ldr x3, [x4]
> > >   10: d2800802mov x2, #0x40   // #64
> > >   14: 91002084add x4, x4, #0x8
> > >   18: b4c3cbz x3, 30 
> > >   1c: 1408b   3c 
> > >   20: f8408483ldr x3, [x4], #8
> > >   24: 91010045add x5, x2, #0x40
> > >   28: b5c3cbnzx3, 40 
> > >   2c: aa0503e2mov x2, x5
> > >   30: eb02001fcmp x0, x2
> > >   34: 5468b.hi20   // b.pmore
> > >   38: d65f03c0ret
> > >   3c: d282mov x2, #0x0// #0
> > >   40: dac00063rbitx3, x3
> > >   44: dac01063clz x3, x3
> > >   48: 8b020062add x2, x3, x2
> > >   4c: eb02001fcmp x0, x2
> > >   50: 9a829000cselx0, x0, x2, ls  // ls = plast
> > >   54: d65f03c0ret
> > >
> > >   ...
> > >
> > > 0118 <_find_next_bit.constprop.1>:
> > >  118: eb02007fcmp x3, x2
> > >  11c: 540002e2b.cs178 <_find_next_bit.constprop.1+0x60>  
> > > // b.hs, b.nlast
> > >  120: d346fc66lsr x6, x3, #6
> > >  124: f8667805ldr x5, [x0, x6, lsl #3]
> > >  128: b461cbz x1, 134 
> > > <_find_next_bit.constprop.1+0x1c>
> > >  12c: f8667826ldr x6, [x1, x6, lsl #3]
> > >  130: 8a0600a5and x5, x5, x6
> > >  134: ca0400a6eor x6, x5, x4
> > >  138: 9285mov x5, #0x // #-1
> > >  13c: 9ac320a5lsl x5, x5, x3
> > >  140: 927ae463and x3, x3, #0xffc0
> > >  144: ea0600a5andsx5, x5, x6
> > >  148: 54000120b.eq16c <_find_next_bit.constprop.1+0x54>  
> > > // b.none
> > >  14c: 140eb   184 <_find_next_bit.constprop.1+0x6c>
> > >  150: d346fc66lsr x6, x3, #6
> > >  154: f8667805ldr x5, [x0, x6, lsl #3]
> > >  158: b461cbz x1, 164 
> > > <_find_next_bit.constprop.1+0x4c>
> > >  15c: f8667826ldr x6, [x1, x6, lsl #3]
> > >  160: 8a0600a5and x5, x5, x6
> > >  164: eb05009fcmp x4, x5
> > >  168: 54c1b.ne180 <_find_next_bit.constprop.1+0x68>  
> > > // b.any
> > >  16c: 91010063add x3, x3, #0x40
> > >  170: eb03005fcmp x2, x3
> > >  174: 54fffee8b.hi150 <_find_next_bit.constprop.1+0x38>  
> > > // b.pmore
> > >  178: aa0203e0mov x0, x2
> > >  17c: d65f03c0ret
> > >  180: ca050085eor x5, x4, x5
> > >  184: dac000a5rbitx5, x5
> > >  188: dac010a5clz x5, x5
> > >  18c: 8b0300a3add x3, x5, x3
> > >  190: eb03005fcmp x2, x3
> > >  194: 9a839042cselx2, x2, x3, ls  // ls = plast
> > >  198: aa0203e0mov x0, x2
> > >  19c: d65f03c0ret
> > >
> > >  ...
> > >
> > > 0238 :
> > >  238: a9bf7bfdstp x29, x30, [sp, #-16]!
> > >  23c: aa0203e3mov x3, x2
> > >  240: d284mov x4, #0x0// #0
> > >  244: aa0103e2mov x2, x1
> > >  248: 910003fdmov x29, sp
> > >  24c: d281mov x1, #0x0// #0
> > >  250: 97b2bl  118 <_find_next_bit.constprop.1>
> &

Re: [PATCH] arm64: enable GENERIC_FIND_FIRST_BIT

2021-02-24 Thread Alexander Lobakin
From: Yury Norov 
Date: Sat, 5 Dec 2020 08:54:06 -0800

Hi,

> ARM64 doesn't implement find_first_{zero}_bit in arch code and doesn't
> enable it in config. It leads to using find_next_bit() which is less
> efficient:
>
>  :
>0: aa0003e4mov x4, x0
>4: aa0103e0mov x0, x1
>8: b4000181cbz x1, 38 
>c: f9400083ldr x3, [x4]
>   10: d2800802mov x2, #0x40   // #64
>   14: 91002084add x4, x4, #0x8
>   18: b4c3cbz x3, 30 
>   1c: 1408b   3c 
>   20: f8408483ldr x3, [x4], #8
>   24: 91010045add x5, x2, #0x40
>   28: b5c3cbnzx3, 40 
>   2c: aa0503e2mov x2, x5
>   30: eb02001fcmp x0, x2
>   34: 5468b.hi20   // b.pmore
>   38: d65f03c0ret
>   3c: d282mov x2, #0x0// #0
>   40: dac00063rbitx3, x3
>   44: dac01063clz x3, x3
>   48: 8b020062add x2, x3, x2
>   4c: eb02001fcmp x0, x2
>   50: 9a829000cselx0, x0, x2, ls  // ls = plast
>   54: d65f03c0ret
>
>   ...
>
> 0118 <_find_next_bit.constprop.1>:
>  118: eb02007fcmp x3, x2
>  11c: 540002e2b.cs178 <_find_next_bit.constprop.1+0x60>  // b.hs, 
> b.nlast
>  120: d346fc66lsr x6, x3, #6
>  124: f8667805ldr x5, [x0, x6, lsl #3]
>  128: b461cbz x1, 134 <_find_next_bit.constprop.1+0x1c>
>  12c: f8667826ldr x6, [x1, x6, lsl #3]
>  130: 8a0600a5and x5, x5, x6
>  134: ca0400a6eor x6, x5, x4
>  138: 9285mov x5, #0x // #-1
>  13c: 9ac320a5lsl x5, x5, x3
>  140: 927ae463and x3, x3, #0xffc0
>  144: ea0600a5andsx5, x5, x6
>  148: 54000120b.eq16c <_find_next_bit.constprop.1+0x54>  // b.none
>  14c: 140eb   184 <_find_next_bit.constprop.1+0x6c>
>  150: d346fc66lsr x6, x3, #6
>  154: f8667805ldr x5, [x0, x6, lsl #3]
>  158: b461cbz x1, 164 <_find_next_bit.constprop.1+0x4c>
>  15c: f8667826ldr x6, [x1, x6, lsl #3]
>  160: 8a0600a5and x5, x5, x6
>  164: eb05009fcmp x4, x5
>  168: 54c1b.ne180 <_find_next_bit.constprop.1+0x68>  // b.any
>  16c: 91010063add x3, x3, #0x40
>  170: eb03005fcmp x2, x3
>  174: 54fffee8b.hi150 <_find_next_bit.constprop.1+0x38>  // 
> b.pmore
>  178: aa0203e0mov x0, x2
>  17c: d65f03c0ret
>  180: ca050085eor x5, x4, x5
>  184: dac000a5rbitx5, x5
>  188: dac010a5clz x5, x5
>  18c: 8b0300a3add x3, x5, x3
>  190: eb03005fcmp x2, x3
>  194: 9a839042cselx2, x2, x3, ls  // ls = plast
>  198: aa0203e0mov x0, x2
>  19c: d65f03c0ret
>
>  ...
>
> 0238 :
>  238: a9bf7bfdstp x29, x30, [sp, #-16]!
>  23c: aa0203e3mov x3, x2
>  240: d284mov x4, #0x0// #0
>  244: aa0103e2mov x2, x1
>  248: 910003fdmov x29, sp
>  24c: d281mov x1, #0x0// #0
>  250: 97b2bl  118 <_find_next_bit.constprop.1>
>  254: a8c17bfdldp x29, x30, [sp], #16
>  258: d65f03c0ret
>
> Enabling this functions would also benefit for_each_{set,clear}_bit().
> Would it make sense to enable this config for all such architectures by
> default?

I confirm that GENERIC_FIND_FIRST_BIT also produces more optimized and
fast code on MIPS (32 R2) where there is also no architecture-specific
bitsearching routines.
So, if it's okay for other folks, I'd suggest to go for it and enable
for all similar arches.

(otherwise, I'll publish a separate entry for mips-next after 5.12-rc1
 release and mention you in "Suggested-by:")

> Signed-off-by: Yury Norov 
>
> ---
>  arch/arm64/Kconfig | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 1515f6f153a0..2b90ef1f548e 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -106,6 +106,7 @@ config ARM64
>   select GENERIC_CPU_AUTOPROBE
>   select GENERIC_CPU_VULNERABILITIES
>   select GENERIC_EARLY_IOREMAP
> + select GENERIC_FIND_FIRST_BIT
>   select GENERIC_IDLE_POLL_SETUP
>   select GENERIC_IRQ_IPI
>   select GENERIC_IRQ_MULTI_HANDLER
> --
> 2.25.1

Thanks,
Al



Re: [GIT PULL v2] clang-lto for v5.12-rc1

2021-02-23 Thread Alexander Lobakin
From: Linus Torvalds 
Date: Tue, 23 Feb 2021 12:33:05 -0800

> On Tue, Feb 23, 2021 at 9:49 AM Linus Torvalds
>  wrote:
> >
> > On Mon, Feb 22, 2021 at 3:11 PM Kees Cook  wrote:
> > >
> > > While x86 LTO enablement is done[1], it depends on some objtool
> > > clean-ups[2], though it appears those actually have been in linux-next
> > > (via tip/objtool/core), so it's possible that if that tree lands [..]
> >
> > That tree is actually next on my list of things to merge after this
> > one, so it should be out soonish.
>
> "soonish" turned out to be later than I thought, because my "build
> changes" set of pulls included the module change that I then wasted a
> lot of time on trying to figure out why it slowed down my build so
> much.

I guess it's about CONFIG_TRIM_UNUSED_KSYMS you disabled in your tree.
Well, it's actually widely used, mostly in the embedded world where
there are often no out-of-tree modules, but a need to save as much
space as possible.
For full-blown systems and distributions it's almost needless, right.

> But it's out now, as pr-tracker-bot already noted.
>
>   Linus

Thanks,
Al



Re: [PATCH mips-fixes] vmlinux.lds.h: catch even more instrumentation symbols into .data

2021-02-23 Thread Alexander Lobakin
From: Thomas Bogendoerfer 
Date: Tue, 23 Feb 2021 13:21:44 +0100

> On Tue, Feb 23, 2021 at 11:36:41AM +0000, Alexander Lobakin wrote:
> > > LKP caught another bunch of orphaned instrumentation symbols [0]:
> > >
> > > mipsel-linux-ld: warning: orphan section `.data.$LPBX1' from
> > > `init/main.o' being placed in section `.data.$LPBX1'
> > > mipsel-linux-ld: warning: orphan section `.data.$LPBX0' from
> > > `init/main.o' being placed in section `.data.$LPBX0'
> > > mipsel-linux-ld: warning: orphan section `.data.$LPBX1' from
> > > `init/do_mounts.o' being placed in section `.data.$LPBX1'
> > > mipsel-linux-ld: warning: orphan section `.data.$LPBX0' from
> > > `init/do_mounts.o' being placed in section `.data.$LPBX0'
> > > mipsel-linux-ld: warning: orphan section `.data.$LPBX1' from
> > > `init/do_mounts_initrd.o' being placed in section `.data.$LPBX1'
> > > mipsel-linux-ld: warning: orphan section `.data.$LPBX0' from
> > > `init/do_mounts_initrd.o' being placed in section `.data.$LPBX0'
> > > mipsel-linux-ld: warning: orphan section `.data.$LPBX1' from
> > > `init/initramfs.o' being placed in section `.data.$LPBX1'
> > > mipsel-linux-ld: warning: orphan section `.data.$LPBX0' from
> > > `init/initramfs.o' being placed in section `.data.$LPBX0'
> > > mipsel-linux-ld: warning: orphan section `.data.$LPBX1' from
> > > `init/calibrate.o' being placed in section `.data.$LPBX1'
> > > mipsel-linux-ld: warning: orphan section `.data.$LPBX0' from
> > > `init/calibrate.o' being placed in section `.data.$LPBX0'
> > >
> > > [...]
> > >
> > > Soften the wildcard to .data.$L* to grab these ones into .data too.
> > >
> > > [0] https://lore.kernel.org/lkml/202102231519.lwplpvev-...@intel.com
> > >
> > > Reported-by: kernel test robot 
> > > Signed-off-by: Alexander Lobakin 
> > > ---
> > >  include/asm-generic/vmlinux.lds.h | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > Hi Thomas,
> >
> > This applies on top of mips-next or Linus' tree, so you may need to
> > rebase mips-fixes before taking it.
> > It's not for mips-next as it should go into this cycle as a [hot]fix.
> > I haven't added any "Fixes:" tag since these warnings is a result
> > of merging several sets and of certain build configurations that
> > almost couldn't be tested separately.
>
> no worries, mips-fixes is defunct during merge windows. I'll send another
> pull request to Linus and will add this patch to it.

Ah, thank you!

> Thomas.

Al

> --
> Crap can work. Given enough thrust pigs will fly, but it's not necessarily a
> good idea.[ RFC1925, 2.3 ]



Re: [PATCH mips-fixes] vmlinux.lds.h: catch even more instrumentation symbols into .data

2021-02-23 Thread Alexander Lobakin
> LKP caught another bunch of orphaned instrumentation symbols [0]:
>
> mipsel-linux-ld: warning: orphan section `.data.$LPBX1' from
> `init/main.o' being placed in section `.data.$LPBX1'
> mipsel-linux-ld: warning: orphan section `.data.$LPBX0' from
> `init/main.o' being placed in section `.data.$LPBX0'
> mipsel-linux-ld: warning: orphan section `.data.$LPBX1' from
> `init/do_mounts.o' being placed in section `.data.$LPBX1'
> mipsel-linux-ld: warning: orphan section `.data.$LPBX0' from
> `init/do_mounts.o' being placed in section `.data.$LPBX0'
> mipsel-linux-ld: warning: orphan section `.data.$LPBX1' from
> `init/do_mounts_initrd.o' being placed in section `.data.$LPBX1'
> mipsel-linux-ld: warning: orphan section `.data.$LPBX0' from
> `init/do_mounts_initrd.o' being placed in section `.data.$LPBX0'
> mipsel-linux-ld: warning: orphan section `.data.$LPBX1' from
> `init/initramfs.o' being placed in section `.data.$LPBX1'
> mipsel-linux-ld: warning: orphan section `.data.$LPBX0' from
> `init/initramfs.o' being placed in section `.data.$LPBX0'
> mipsel-linux-ld: warning: orphan section `.data.$LPBX1' from
> `init/calibrate.o' being placed in section `.data.$LPBX1'
> mipsel-linux-ld: warning: orphan section `.data.$LPBX0' from
> `init/calibrate.o' being placed in section `.data.$LPBX0'
>
> [...]
>
> Soften the wildcard to .data.$L* to grab these ones into .data too.
>
> [0] https://lore.kernel.org/lkml/202102231519.lwplpvev-...@intel.com
>
> Reported-by: kernel test robot 
> Signed-off-by: Alexander Lobakin 
> ---
>  include/asm-generic/vmlinux.lds.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Hi Thomas,

This applies on top of mips-next or Linus' tree, so you may need to
rebase mips-fixes before taking it.
It's not for mips-next as it should go into this cycle as a [hot]fix.
I haven't added any "Fixes:" tag since these warnings is a result
of merging several sets and of certain build configurations that
almost couldn't be tested separately.

> diff --git a/include/asm-generic/vmlinux.lds.h 
> b/include/asm-generic/vmlinux.lds.h
> index 01a3fd6a64d2..c887ac36c1b4 100644
> --- a/include/asm-generic/vmlinux.lds.h
> +++ b/include/asm-generic/vmlinux.lds.h
> @@ -95,7 +95,7 @@
>   */
>  #ifdef CONFIG_LD_DEAD_CODE_DATA_ELIMINATION
>  #define TEXT_MAIN .text .text.[0-9a-zA-Z_]*
> -#define DATA_MAIN .data .data.[0-9a-zA-Z_]* .data..L* 
> .data..compoundliteral* .data.$__unnamed_* .data.$Lubsan_*
> +#define DATA_MAIN .data .data.[0-9a-zA-Z_]* .data..L* 
> .data..compoundliteral* .data.$__unnamed_* .data.$L*
>  #define SDATA_MAIN .sdata .sdata.[0-9a-zA-Z_]*
>  #define RODATA_MAIN .rodata .rodata.[0-9a-zA-Z_]* .rodata..L*
>  #define BSS_MAIN .bss .bss.[0-9a-zA-Z_]* .bss..compoundliteral*
> --
> 2.30.1

Thanks,
Al



[PATCH mips-fixes] vmlinux.lds.h: catch even more instrumentation symbols into .data

2021-02-23 Thread Alexander Lobakin
LKP caught another bunch of orphaned instrumentation symbols [0]:

mipsel-linux-ld: warning: orphan section `.data.$LPBX1' from
`init/main.o' being placed in section `.data.$LPBX1'
mipsel-linux-ld: warning: orphan section `.data.$LPBX0' from
`init/main.o' being placed in section `.data.$LPBX0'
mipsel-linux-ld: warning: orphan section `.data.$LPBX1' from
`init/do_mounts.o' being placed in section `.data.$LPBX1'
mipsel-linux-ld: warning: orphan section `.data.$LPBX0' from
`init/do_mounts.o' being placed in section `.data.$LPBX0'
mipsel-linux-ld: warning: orphan section `.data.$LPBX1' from
`init/do_mounts_initrd.o' being placed in section `.data.$LPBX1'
mipsel-linux-ld: warning: orphan section `.data.$LPBX0' from
`init/do_mounts_initrd.o' being placed in section `.data.$LPBX0'
mipsel-linux-ld: warning: orphan section `.data.$LPBX1' from
`init/initramfs.o' being placed in section `.data.$LPBX1'
mipsel-linux-ld: warning: orphan section `.data.$LPBX0' from
`init/initramfs.o' being placed in section `.data.$LPBX0'
mipsel-linux-ld: warning: orphan section `.data.$LPBX1' from
`init/calibrate.o' being placed in section `.data.$LPBX1'
mipsel-linux-ld: warning: orphan section `.data.$LPBX0' from
`init/calibrate.o' being placed in section `.data.$LPBX0'

[...]

Soften the wildcard to .data.$L* to grab these ones into .data too.

[0] https://lore.kernel.org/lkml/202102231519.lwplpvev-...@intel.com

Reported-by: kernel test robot 
Signed-off-by: Alexander Lobakin 
---
 include/asm-generic/vmlinux.lds.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/asm-generic/vmlinux.lds.h 
b/include/asm-generic/vmlinux.lds.h
index 01a3fd6a64d2..c887ac36c1b4 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -95,7 +95,7 @@
  */
 #ifdef CONFIG_LD_DEAD_CODE_DATA_ELIMINATION
 #define TEXT_MAIN .text .text.[0-9a-zA-Z_]*
-#define DATA_MAIN .data .data.[0-9a-zA-Z_]* .data..L* .data..compoundliteral* 
.data.$__unnamed_* .data.$Lubsan_*
+#define DATA_MAIN .data .data.[0-9a-zA-Z_]* .data..L* .data..compoundliteral* 
.data.$__unnamed_* .data.$L*
 #define SDATA_MAIN .sdata .sdata.[0-9a-zA-Z_]*
 #define RODATA_MAIN .rodata .rodata.[0-9a-zA-Z_]* .rodata..L*
 #define BSS_MAIN .bss .bss.[0-9a-zA-Z_]* .bss..compoundliteral*
--
2.30.1




Re: linux-next: manual merge of the kspp tree with the mips tree

2021-02-23 Thread Alexander Lobakin
From: Stephen Rothwell 
Date: Tue, 23 Feb 2021 10:49:50 +1100

> Hi all,

Hi,

> On Mon, 15 Feb 2021 07:47:26 +1100 Stephen Rothwell  
> wrote:
> >
> > On Mon, 18 Jan 2021 15:08:04 +1100 Stephen Rothwell  
> > wrote:
> > >
> > > Today's linux-next merge of the kspp tree got a conflict in:
> > >
> > >   include/asm-generic/vmlinux.lds.h
> > >
> > > between commits:
> > >
> > >   9a427556fb8e ("vmlinux.lds.hf41b233de0ae: catch compound literals into 
> > > data and BSS")
> > >   f41b233de0ae ("vmlinux.lds.h: catch UBSAN's "unnamed data" into data")
> > >
> > > from the mips tree and commit:
> > >
> > >   dc5723b02e52 ("kbuild: add support for Clang LTO")
> > >
> > > from the kspp tree.
> > >
> > > I fixed it up (9a427556fb8e and dc5723b02e52 made the same change to
> > > DATA_MAIN, which conflicted with the change in f41b233de0ae) and can
> > > carry the fix as necessary. This is now fixed as far as linux-next is
> > > concerned, but any non trivial conflicts should be mentioned to your
> > > upstream maintainer when your tree is submitted for merging. You may
> > > also want to consider cooperating with the maintainer of the
> > > conflicting tree to minimise any particularly complex conflicts.
> >
> > With the merge window about to open, this is a reminder that this
> > conflict still exists.
>
> This is now a conflict between the kspp tree and Linus' tree.

Kees prepared a Git pull of kspp tree for Linus, this will be resolved
soon.

> --
> Cheers,
> Stephen Rothwell

Al



[PATCH v8 bpf-next 5/5] xsk: build skb by page (aka generic zerocopy xmit)

2021-02-18 Thread Alexander Lobakin
From: Xuan Zhuo 

This patch is used to construct skb based on page to save memory copy
overhead.

This function is implemented based on IFF_TX_SKB_NO_LINEAR. Only the
network card priv_flags supports IFF_TX_SKB_NO_LINEAR will use page to
directly construct skb. If this feature is not supported, it is still
necessary to copy data to construct skb.

 Performance Testing 

The test environment is Aliyun ECS server.
Test cmd:
```
xdpsock -i eth0 -t  -S -s 
```

Test result data:

size64  512 10241500
copy1916747 1775988 1600203 1440054
page1974058 1953655 1945463 1904478
percent 3.0%10.0%   21.58%  32.3%

Signed-off-by: Xuan Zhuo 
Reviewed-by: Dust Li 
[ alobakin:
 - expand subject to make it clearer;
 - improve skb->truesize calculation;
 - reserve some headroom in skb for drivers;
 - tailroom is not needed as skb is non-linear ]
Signed-off-by: Alexander Lobakin 
Acked-by: Magnus Karlsson 
Acked-by: John Fastabend 
---
 net/xdp/xsk.c | 120 --
 1 file changed, 96 insertions(+), 24 deletions(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 143979ea4165..a71ed664da0a 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -445,6 +445,97 @@ static void xsk_destruct_skb(struct sk_buff *skb)
sock_wfree(skb);
 }

+static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
+ struct xdp_desc *desc)
+{
+   struct xsk_buff_pool *pool = xs->pool;
+   u32 hr, len, ts, offset, copy, copied;
+   struct sk_buff *skb;
+   struct page *page;
+   void *buffer;
+   int err, i;
+   u64 addr;
+
+   hr = max(NET_SKB_PAD, L1_CACHE_ALIGN(xs->dev->needed_headroom));
+
+   skb = sock_alloc_send_skb(>sk, hr, 1, );
+   if (unlikely(!skb))
+   return ERR_PTR(err);
+
+   skb_reserve(skb, hr);
+
+   addr = desc->addr;
+   len = desc->len;
+   ts = pool->unaligned ? len : pool->chunk_size;
+
+   buffer = xsk_buff_raw_get_data(pool, addr);
+   offset = offset_in_page(buffer);
+   addr = buffer - pool->addrs;
+
+   for (copied = 0, i = 0; copied < len; i++) {
+   page = pool->umem->pgs[addr >> PAGE_SHIFT];
+   get_page(page);
+
+   copy = min_t(u32, PAGE_SIZE - offset, len - copied);
+   skb_fill_page_desc(skb, i, page, offset, copy);
+
+   copied += copy;
+   addr += copy;
+   offset = 0;
+   }
+
+   skb->len += len;
+   skb->data_len += len;
+   skb->truesize += ts;
+
+   refcount_add(ts, >sk.sk_wmem_alloc);
+
+   return skb;
+}
+
+static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
+struct xdp_desc *desc)
+{
+   struct net_device *dev = xs->dev;
+   struct sk_buff *skb;
+
+   if (dev->priv_flags & IFF_TX_SKB_NO_LINEAR) {
+   skb = xsk_build_skb_zerocopy(xs, desc);
+   if (IS_ERR(skb))
+   return skb;
+   } else {
+   u32 hr, tr, len;
+   void *buffer;
+   int err;
+
+   hr = max(NET_SKB_PAD, L1_CACHE_ALIGN(dev->needed_headroom));
+   tr = dev->needed_tailroom;
+   len = desc->len;
+
+   skb = sock_alloc_send_skb(>sk, hr + len + tr, 1, );
+   if (unlikely(!skb))
+   return ERR_PTR(err);
+
+   skb_reserve(skb, hr);
+   skb_put(skb, len);
+
+   buffer = xsk_buff_raw_get_data(xs->pool, desc->addr);
+   err = skb_store_bits(skb, 0, buffer, len);
+   if (unlikely(err)) {
+   kfree_skb(skb);
+   return ERR_PTR(err);
+   }
+   }
+
+   skb->dev = dev;
+   skb->priority = xs->sk.sk_priority;
+   skb->mark = xs->sk.sk_mark;
+   skb_shinfo(skb)->destructor_arg = (void *)(long)desc->addr;
+   skb->destructor = xsk_destruct_skb;
+
+   return skb;
+}
+
 static int xsk_generic_xmit(struct sock *sk)
 {
struct xdp_sock *xs = xdp_sk(sk);
@@ -454,56 +545,37 @@ static int xsk_generic_xmit(struct sock *sk)
struct sk_buff *skb;
unsigned long flags;
int err = 0;
-   u32 hr, tr;

mutex_lock(>mutex);

if (xs->queue_id >= xs->dev->real_num_tx_queues)
goto out;

-   hr = max(NET_SKB_PAD, L1_CACHE_ALIGN(xs->dev->needed_headroom));
-   tr = xs->dev->needed_tailroom;
-
while (xskq_cons_peek_desc(xs->tx, , xs->pool)) {
-   char *buffer;
-   u64 addr;
-   u32 len;
-
if (max_batch-- == 0) {
err = -EAGAIN;
goto out;
}

-   

[PATCH v8 bpf-next 4/5] xsk: respect device's headroom and tailroom on generic xmit path

2021-02-18 Thread Alexander Lobakin
xsk_generic_xmit() allocates a new skb and then queues it for
xmitting. The size of new skb's headroom is desc->len, so it comes
to the driver/device with no reserved headroom and/or tailroom.
Lots of drivers need some headroom (and sometimes tailroom) to
prepend (and/or append) some headers or data, e.g. CPU tags,
device-specific headers/descriptors (LSO, TLS etc.), and if case
of no available space skb_cow_head() will reallocate the skb.
Reallocations are unwanted on fast-path, especially when it comes
to XDP, so generic XSK xmit should reserve the spaces declared in
dev->needed_headroom and dev->needed tailroom to avoid them.

Note on max(NET_SKB_PAD, L1_CACHE_ALIGN(dev->needed_headroom)):

Usually, output functions reserve LL_RESERVED_SPACE(dev), which
consists of dev->hard_header_len + dev->needed_headroom, aligned
by 16.
However, on XSK xmit hard header is already here in the chunk, so
hard_header_len is not needed. But it'd still be better to align
data up to cacheline, while reserving no less than driver requests
for headroom. NET_SKB_PAD here is to double-insure there will be
no reallocations even when the driver advertises no needed_headroom,
but in fact need it (not so rare case).

Fixes: 35fcde7f8deb ("xsk: support for Tx")
Signed-off-by: Alexander Lobakin 
Acked-by: Magnus Karlsson 
Acked-by: John Fastabend 
---
 net/xdp/xsk.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 4faabd1ecfd1..143979ea4165 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -454,12 +454,16 @@ static int xsk_generic_xmit(struct sock *sk)
struct sk_buff *skb;
unsigned long flags;
int err = 0;
+   u32 hr, tr;

mutex_lock(>mutex);

if (xs->queue_id >= xs->dev->real_num_tx_queues)
goto out;

+   hr = max(NET_SKB_PAD, L1_CACHE_ALIGN(xs->dev->needed_headroom));
+   tr = xs->dev->needed_tailroom;
+
while (xskq_cons_peek_desc(xs->tx, , xs->pool)) {
char *buffer;
u64 addr;
@@ -471,11 +475,13 @@ static int xsk_generic_xmit(struct sock *sk)
}

len = desc.len;
-   skb = sock_alloc_send_skb(sk, len, 1, );
+   skb = sock_alloc_send_skb(sk, hr + len + tr, 1, );
if (unlikely(!skb))
goto out;

+   skb_reserve(skb, hr);
skb_put(skb, len);
+
addr = desc.addr;
buffer = xsk_buff_raw_get_data(xs->pool, addr);
err = skb_store_bits(skb, 0, buffer, len);
--
2.30.1




[PATCH v8 bpf-next 3/5] virtio-net: support IFF_TX_SKB_NO_LINEAR

2021-02-18 Thread Alexander Lobakin
From: Xuan Zhuo 

Virtio net supports the case where the skb linear space is empty, so add
priv_flags.

Signed-off-by: Xuan Zhuo 
Acked-by: Michael S. Tsirkin 
Signed-off-by: Alexander Lobakin 
Acked-by: John Fastabend 
---
 drivers/net/virtio_net.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index ba8e63792549..f2ff6c3906c1 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -2972,7 +2972,8 @@ static int virtnet_probe(struct virtio_device *vdev)
return -ENOMEM;

/* Set up network device as normal. */
-   dev->priv_flags |= IFF_UNICAST_FLT | IFF_LIVE_ADDR_CHANGE;
+   dev->priv_flags |= IFF_UNICAST_FLT | IFF_LIVE_ADDR_CHANGE |
+  IFF_TX_SKB_NO_LINEAR;
dev->netdev_ops = _netdev;
dev->features = NETIF_F_HIGHDMA;

--
2.30.1




[PATCH v8 bpf-next 2/5] net: add priv_flags for allow tx skb without linear

2021-02-18 Thread Alexander Lobakin
From: Xuan Zhuo 

In some cases, we hope to construct skb directly based on the existing
memory without copying data. In this case, the page will be placed
directly in the skb, and the linear space of skb is empty. But
unfortunately, many the network card does not support this operation.
For example Mellanox Technologies MT27710 Family [ConnectX-4 Lx] will
get the following error message:

mlx5_core :3b:00.1 eth1: Error cqe on cqn 0x817, ci 0x8,
qn 0x1dbb, opcode 0xd, syndrome 0x1, vendor syndrome 0x68
: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0030: 00 00 00 00 60 10 68 01 0a 00 1d bb 00 0f 9f d2
WQE DUMP: WQ size 1024 WQ cur size 0, WQE index 0xf, len: 64
: 00 00 0f 0a 00 1d bb 03 00 00 00 08 00 00 00 00
0010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0020: 00 00 00 2b 00 08 00 00 00 00 00 05 9e e3 08 00
0030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
mlx5_core :3b:00.1 eth1: ERR CQE on SQ: 0x1dbb

So a priv_flag is added here to indicate whether the network card
supports this feature.

Signed-off-by: Xuan Zhuo 
Suggested-by: Alexander Lobakin 
[ alobakin: give a new flag more detailed description ]
Signed-off-by: Alexander Lobakin 
Acked-by: John Fastabend 
---
 include/linux/netdevice.h | 4 
 1 file changed, 4 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 3b6f82c2c271..6cef47b76cc6 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1518,6 +1518,8 @@ struct net_device_ops {
  * @IFF_FAILOVER_SLAVE: device is lower dev of a failover master device
  * @IFF_L3MDEV_RX_HANDLER: only invoke the rx handler of L3 master device
  * @IFF_LIVE_RENAME_OK: rename is allowed while device is up and running
+ * @IFF_TX_SKB_NO_LINEAR: device/driver is capable of xmitting frames with
+ * skb_headlen(skb) == 0 (data starts from frag0)
  */
 enum netdev_priv_flags {
IFF_802_1Q_VLAN = 1<<0,
@@ -1551,6 +1553,7 @@ enum netdev_priv_flags {
IFF_FAILOVER_SLAVE  = 1<<28,
IFF_L3MDEV_RX_HANDLER   = 1<<29,
IFF_LIVE_RENAME_OK  = 1<<30,
+   IFF_TX_SKB_NO_LINEAR= 1<<31,
 };

 #define IFF_802_1Q_VLANIFF_802_1Q_VLAN
@@ -1584,6 +1587,7 @@ enum netdev_priv_flags {
 #define IFF_FAILOVER_SLAVE IFF_FAILOVER_SLAVE
 #define IFF_L3MDEV_RX_HANDLER  IFF_L3MDEV_RX_HANDLER
 #define IFF_LIVE_RENAME_OK IFF_LIVE_RENAME_OK
+#define IFF_TX_SKB_NO_LINEAR   IFF_TX_SKB_NO_LINEAR

 /**
  * struct net_device - The DEVICE structure.
--
2.30.1




  1   2   3   4   5   6   >