Re: [PATCH net-next v7] net/core: Introduce netdev_core_stats_inc()

2023-10-09 Thread Yajun Deng



On 2023/10/9 22:28, Steven Rostedt wrote:

On Mon, 9 Oct 2023 18:58:27 +0800
Yajun Deng  wrote:


C compiler decides to inline or not, depending on various factors.

The most efficient (and small) code is generated by this_cpu_inc()
version, allowing the compiler to inline it.

If you copy/paste this_cpu_inc()  twenty times, then the compiler
would  not inline the function anymore.

Yes, if you want something to be visible by ftrace, it must not be inlined
(as inlined functions are not function calls by definition). And as Eric
stated, the compiler is perfectly allowed to inline something if it
believes it will be more efficient. i.e. There may be code around the function
call that could be more efficient if it wasn't change to parameters. If you
want to make sure a function stays out of line, you must explicitly tell
the compiler you want the function not to ever be inlined (hence the
"noinline" attribute).



Thanks for the details.



Got it. Thank you.

Great.

-- Steve




Re: [PATCH net-next v7] net/core: Introduce netdev_core_stats_inc()

2023-10-09 Thread Steven Rostedt
On Mon, 9 Oct 2023 18:58:27 +0800
Yajun Deng  wrote:

> > C compiler decides to inline or not, depending on various factors.
> >
> > The most efficient (and small) code is generated by this_cpu_inc()
> > version, allowing the compiler to inline it.
> >
> > If you copy/paste this_cpu_inc()  twenty times, then the compiler
> > would  not inline the function anymore.  

Yes, if you want something to be visible by ftrace, it must not be inlined
(as inlined functions are not function calls by definition). And as Eric
stated, the compiler is perfectly allowed to inline something if it
believes it will be more efficient. i.e. There may be code around the function
call that could be more efficient if it wasn't change to parameters. If you
want to make sure a function stays out of line, you must explicitly tell
the compiler you want the function not to ever be inlined (hence the
"noinline" attribute).

> 
> 
> Got it. Thank you.

Great.

-- Steve



Re: [PATCH net-next v7] net/core: Introduce netdev_core_stats_inc()

2023-10-09 Thread Yajun Deng



On 2023/10/9 18:16, Eric Dumazet wrote:

On Mon, Oct 9, 2023 at 11:43 AM Yajun Deng  wrote:


On 2023/10/9 17:30, Eric Dumazet wrote:

On Mon, Oct 9, 2023 at 10:36 AM Yajun Deng  wrote:

On 2023/10/9 16:20, Eric Dumazet wrote:

On Mon, Oct 9, 2023 at 10:14 AM Yajun Deng  wrote:

On 2023/10/9 15:53, Eric Dumazet wrote:

On Mon, Oct 9, 2023 at 5:07 AM Yajun Deng  wrote:


'this_cpu_read + this_cpu_write' and 'pr_info + this_cpu_inc' will make
the trace work well.

They all have 'pop' instructions in them. This may be the key to making
the trace work well.

Hi all,

I need your help on percpu and ftrace.


I do not think you made sure netdev_core_stats_inc() was never inlined.

Adding more code in it is simply changing how the compiler decides to
inline or not.

Yes, you are right. It needs to add the 'noinline' prefix. The
disassembly code will have 'pop'

instruction.


The function was fine, you do not need anything like push or pop.

The only needed stuff was the call __fentry__.

The fact that the function was inlined for some invocations was the
issue, because the trace point
is only planted in the out of line function.

But somehow the following code isn't inline? They didn't need to add the
'noinline' prefix.

+   field = (unsigned long *)((void *)this_cpu_ptr(p) + offset);
+   WRITE_ONCE(*field, READ_ONCE(*field) + 1);

Or
+   (*(unsigned long *)((void *)this_cpu_ptr(p) + offset))++;


I think you are very confused.

You only want to trace netdev_core_stats_inc() entry point, not
arbitrary pieces of it.


Yes, I will trace netdev_core_stats_inc() entry point. I mean to replace

+   field = (__force unsigned long
__percpu *)((__force void *)p + offset);
+   this_cpu_inc(*field);

with

+   field = (unsigned long *)((void *)this_cpu_ptr(p) + offset);
+   WRITE_ONCE(*field, READ_ONCE(*field) + 1);

Or
+   (*(unsigned long *)((void *)this_cpu_ptr(p) + offset))++;

The netdev_core_stats_inc() entry point will work fine even if it doesn't
have 'noinline' prefix.

I don't know why this code needs to add 'noinline' prefix.
+   field = (__force unsigned long __percpu *)((__force void *)p + 
offset);
+   this_cpu_inc(*field);


C compiler decides to inline or not, depending on various factors.

The most efficient (and small) code is generated by this_cpu_inc()
version, allowing the compiler to inline it.

If you copy/paste this_cpu_inc()  twenty times, then the compiler
would  not inline the function anymore.



Got it. Thank you.




Re: [PATCH net-next v7] net/core: Introduce netdev_core_stats_inc()

2023-10-09 Thread Eric Dumazet
On Mon, Oct 9, 2023 at 11:43 AM Yajun Deng  wrote:
>
>
> On 2023/10/9 17:30, Eric Dumazet wrote:
> > On Mon, Oct 9, 2023 at 10:36 AM Yajun Deng  wrote:
> >>
> >> On 2023/10/9 16:20, Eric Dumazet wrote:
> >>> On Mon, Oct 9, 2023 at 10:14 AM Yajun Deng  wrote:
>  On 2023/10/9 15:53, Eric Dumazet wrote:
> > On Mon, Oct 9, 2023 at 5:07 AM Yajun Deng  wrote:
> >
> >> 'this_cpu_read + this_cpu_write' and 'pr_info + this_cpu_inc' will make
> >> the trace work well.
> >>
> >> They all have 'pop' instructions in them. This may be the key to making
> >> the trace work well.
> >>
> >> Hi all,
> >>
> >> I need your help on percpu and ftrace.
> >>
> > I do not think you made sure netdev_core_stats_inc() was never inlined.
> >
> > Adding more code in it is simply changing how the compiler decides to
> > inline or not.
>  Yes, you are right. It needs to add the 'noinline' prefix. The
>  disassembly code will have 'pop'
> 
>  instruction.
> 
> >>> The function was fine, you do not need anything like push or pop.
> >>>
> >>> The only needed stuff was the call __fentry__.
> >>>
> >>> The fact that the function was inlined for some invocations was the
> >>> issue, because the trace point
> >>> is only planted in the out of line function.
> >>
> >> But somehow the following code isn't inline? They didn't need to add the
> >> 'noinline' prefix.
> >>
> >> +   field = (unsigned long *)((void *)this_cpu_ptr(p) + 
> >> offset);
> >> +   WRITE_ONCE(*field, READ_ONCE(*field) + 1);
> >>
> >> Or
> >> +   (*(unsigned long *)((void *)this_cpu_ptr(p) + offset))++;
> >>
> > I think you are very confused.
> >
> > You only want to trace netdev_core_stats_inc() entry point, not
> > arbitrary pieces of it.
>
>
> Yes, I will trace netdev_core_stats_inc() entry point. I mean to replace
>
> +   field = (__force unsigned long
> __percpu *)((__force void *)p + offset);
> +   this_cpu_inc(*field);
>
> with
>
> +   field = (unsigned long *)((void *)this_cpu_ptr(p) + offset);
> +   WRITE_ONCE(*field, READ_ONCE(*field) + 1);
>
> Or
> +   (*(unsigned long *)((void *)this_cpu_ptr(p) + offset))++;
>
> The netdev_core_stats_inc() entry point will work fine even if it doesn't
> have 'noinline' prefix.
>
> I don't know why this code needs to add 'noinline' prefix.
> +   field = (__force unsigned long __percpu *)((__force void *)p 
> + offset);
> +   this_cpu_inc(*field);
>

C compiler decides to inline or not, depending on various factors.

The most efficient (and small) code is generated by this_cpu_inc()
version, allowing the compiler to inline it.

If you copy/paste this_cpu_inc()  twenty times, then the compiler
would  not inline the function anymore.



Re: [PATCH net-next v7] net/core: Introduce netdev_core_stats_inc()

2023-10-09 Thread Yajun Deng



On 2023/10/9 17:30, Eric Dumazet wrote:

On Mon, Oct 9, 2023 at 10:36 AM Yajun Deng  wrote:


On 2023/10/9 16:20, Eric Dumazet wrote:

On Mon, Oct 9, 2023 at 10:14 AM Yajun Deng  wrote:

On 2023/10/9 15:53, Eric Dumazet wrote:

On Mon, Oct 9, 2023 at 5:07 AM Yajun Deng  wrote:


'this_cpu_read + this_cpu_write' and 'pr_info + this_cpu_inc' will make
the trace work well.

They all have 'pop' instructions in them. This may be the key to making
the trace work well.

Hi all,

I need your help on percpu and ftrace.


I do not think you made sure netdev_core_stats_inc() was never inlined.

Adding more code in it is simply changing how the compiler decides to
inline or not.

Yes, you are right. It needs to add the 'noinline' prefix. The
disassembly code will have 'pop'

instruction.


The function was fine, you do not need anything like push or pop.

The only needed stuff was the call __fentry__.

The fact that the function was inlined for some invocations was the
issue, because the trace point
is only planted in the out of line function.


But somehow the following code isn't inline? They didn't need to add the
'noinline' prefix.

+   field = (unsigned long *)((void *)this_cpu_ptr(p) + offset);
+   WRITE_ONCE(*field, READ_ONCE(*field) + 1);

Or
+   (*(unsigned long *)((void *)this_cpu_ptr(p) + offset))++;


I think you are very confused.

You only want to trace netdev_core_stats_inc() entry point, not
arbitrary pieces of it.



Yes, I will trace netdev_core_stats_inc() entry point. I mean to replace

+                                   field = (__force unsigned long 
__percpu *)((__force void *)p + offset);

+                                   this_cpu_inc(*field);

with

+   field = (unsigned long *)((void *)this_cpu_ptr(p) + offset);
+   WRITE_ONCE(*field, READ_ONCE(*field) + 1);

Or
+   (*(unsigned long *)((void *)this_cpu_ptr(p) + offset))++;

The netdev_core_stats_inc() entry point will work fine even if it doesn't
have 'noinline' prefix.

I don't know why this code needs to add 'noinline' prefix.
+           field = (__force unsigned long __percpu *)((__force void *)p + 
offset);
+           this_cpu_inc(*field);




Re: [PATCH net-next v7] net/core: Introduce netdev_core_stats_inc()

2023-10-09 Thread Eric Dumazet
On Mon, Oct 9, 2023 at 10:36 AM Yajun Deng  wrote:
>
>
> On 2023/10/9 16:20, Eric Dumazet wrote:
> > On Mon, Oct 9, 2023 at 10:14 AM Yajun Deng  wrote:
> >>
> >> On 2023/10/9 15:53, Eric Dumazet wrote:
> >>> On Mon, Oct 9, 2023 at 5:07 AM Yajun Deng  wrote:
> >>>
>  'this_cpu_read + this_cpu_write' and 'pr_info + this_cpu_inc' will make
>  the trace work well.
> 
>  They all have 'pop' instructions in them. This may be the key to making
>  the trace work well.
> 
>  Hi all,
> 
>  I need your help on percpu and ftrace.
> 
> >>> I do not think you made sure netdev_core_stats_inc() was never inlined.
> >>>
> >>> Adding more code in it is simply changing how the compiler decides to
> >>> inline or not.
> >>
> >> Yes, you are right. It needs to add the 'noinline' prefix. The
> >> disassembly code will have 'pop'
> >>
> >> instruction.
> >>
> > The function was fine, you do not need anything like push or pop.
> >
> > The only needed stuff was the call __fentry__.
> >
> > The fact that the function was inlined for some invocations was the
> > issue, because the trace point
> > is only planted in the out of line function.
>
>
> But somehow the following code isn't inline? They didn't need to add the
> 'noinline' prefix.
>
> +   field = (unsigned long *)((void *)this_cpu_ptr(p) + offset);
> +   WRITE_ONCE(*field, READ_ONCE(*field) + 1);
>
> Or
> +   (*(unsigned long *)((void *)this_cpu_ptr(p) + offset))++;
>

I think you are very confused.

You only want to trace netdev_core_stats_inc() entry point, not
arbitrary pieces of it.



Re: [PATCH net-next v7] net/core: Introduce netdev_core_stats_inc()

2023-10-09 Thread Yajun Deng



On 2023/10/9 16:20, Eric Dumazet wrote:

On Mon, Oct 9, 2023 at 10:14 AM Yajun Deng  wrote:


On 2023/10/9 15:53, Eric Dumazet wrote:

On Mon, Oct 9, 2023 at 5:07 AM Yajun Deng  wrote:


'this_cpu_read + this_cpu_write' and 'pr_info + this_cpu_inc' will make
the trace work well.

They all have 'pop' instructions in them. This may be the key to making
the trace work well.

Hi all,

I need your help on percpu and ftrace.


I do not think you made sure netdev_core_stats_inc() was never inlined.

Adding more code in it is simply changing how the compiler decides to
inline or not.


Yes, you are right. It needs to add the 'noinline' prefix. The
disassembly code will have 'pop'

instruction.


The function was fine, you do not need anything like push or pop.

The only needed stuff was the call __fentry__.

The fact that the function was inlined for some invocations was the
issue, because the trace point
is only planted in the out of line function.



But somehow the following code isn't inline? They didn't need to add the 
'noinline' prefix.


+   field = (unsigned long *)((void *)this_cpu_ptr(p) + offset);
+   WRITE_ONCE(*field, READ_ONCE(*field) + 1);

Or
+   (*(unsigned long *)((void *)this_cpu_ptr(p) + offset))++;




Re: [PATCH net-next v7] net/core: Introduce netdev_core_stats_inc()

2023-10-09 Thread Eric Dumazet
On Mon, Oct 9, 2023 at 10:14 AM Yajun Deng  wrote:
>
>
> On 2023/10/9 15:53, Eric Dumazet wrote:
> > On Mon, Oct 9, 2023 at 5:07 AM Yajun Deng  wrote:
> >
> >> 'this_cpu_read + this_cpu_write' and 'pr_info + this_cpu_inc' will make
> >> the trace work well.
> >>
> >> They all have 'pop' instructions in them. This may be the key to making
> >> the trace work well.
> >>
> >> Hi all,
> >>
> >> I need your help on percpu and ftrace.
> >>
> > I do not think you made sure netdev_core_stats_inc() was never inlined.
> >
> > Adding more code in it is simply changing how the compiler decides to
> > inline or not.
>
>
> Yes, you are right. It needs to add the 'noinline' prefix. The
> disassembly code will have 'pop'
>
> instruction.
>

The function was fine, you do not need anything like push or pop.

The only needed stuff was the call __fentry__.

The fact that the function was inlined for some invocations was the
issue, because the trace point
is only planted in the out of line function.



Re: [PATCH net-next v7] net/core: Introduce netdev_core_stats_inc()

2023-10-09 Thread Yajun Deng



On 2023/10/9 15:53, Eric Dumazet wrote:

On Mon, Oct 9, 2023 at 5:07 AM Yajun Deng  wrote:


'this_cpu_read + this_cpu_write' and 'pr_info + this_cpu_inc' will make
the trace work well.

They all have 'pop' instructions in them. This may be the key to making
the trace work well.

Hi all,

I need your help on percpu and ftrace.


I do not think you made sure netdev_core_stats_inc() was never inlined.

Adding more code in it is simply changing how the compiler decides to
inline or not.



Yes, you are right. It needs to add the 'noinline' prefix. The 
disassembly code will have 'pop'


instruction.

Thanks.




Re: [PATCH net-next v7] net/core: Introduce netdev_core_stats_inc()

2023-10-09 Thread Eric Dumazet
On Mon, Oct 9, 2023 at 5:07 AM Yajun Deng  wrote:

> 'this_cpu_read + this_cpu_write' and 'pr_info + this_cpu_inc' will make
> the trace work well.
>
> They all have 'pop' instructions in them. This may be the key to making
> the trace work well.
>
> Hi all,
>
> I need your help on percpu and ftrace.
>

I do not think you made sure netdev_core_stats_inc() was never inlined.

Adding more code in it is simply changing how the compiler decides to
inline or not.



Re: [PATCH net-next v7] net/core: Introduce netdev_core_stats_inc()

2023-10-08 Thread Yajun Deng



On 2023/10/8 17:12, Yajun Deng wrote:


On 2023/10/8 16:53, Eric Dumazet wrote:

On Sun, Oct 8, 2023 at 10:44 AM Yajun Deng  wrote:


On 2023/10/8 15:18, Eric Dumazet wrote:
On Sun, Oct 8, 2023 at 9:00 AM Yajun Deng  
wrote:

On 2023/10/8 14:45, Eric Dumazet wrote:
On Sat, Oct 7, 2023 at 8:34 AM Yajun Deng  
wrote:

On 2023/10/7 13:29, Eric Dumazet wrote:
On Sat, Oct 7, 2023 at 7:06 AM Yajun Deng 
 wrote:
Although there is a kfree_skb_reason() helper function that 
can be used to
find the reason why this skb is dropped, but most callers 
didn't increase
one of rx_dropped, tx_dropped, rx_nohandler and 
rx_otherhost_dropped.



...


+
+void netdev_core_stats_inc(struct net_device *dev, u32 offset)
+{
+   /* This READ_ONCE() pairs with the write in 
netdev_core_stats_alloc() */
+   struct net_device_core_stats __percpu *p = 
READ_ONCE(dev->core_stats);

+   unsigned long *field;
+
+   if (unlikely(!p))
+   p = netdev_core_stats_alloc(dev);
+
+   if (p) {
+   field = (unsigned long *)((void 
*)this_cpu_ptr(p) + offset);

+   WRITE_ONCE(*field, READ_ONCE(*field) + 1);

This is broken...

As I explained earlier, dev_core_stats_(dev) can be called 
from

many different contexts:

1) process contexts, where preemption and migration are allowed.
2) interrupt contexts.

Adding WRITE_ONCE()/READ_ONCE() is not solving potential races.

I _think_ I already gave you how to deal with this ?

Yes, I replied in v6.

https://lore.kernel.org/all/e25b5f3c-bd97-56f0-de86-b93a31728...@linux.dev/ 




Please try instead:

+void netdev_core_stats_inc(struct net_device *dev, u32 offset)
+{
+   /* This READ_ONCE() pairs with the write in 
netdev_core_stats_alloc() */
+   struct net_device_core_stats __percpu *p = 
READ_ONCE(dev->core_stats);

+   unsigned long __percpu *field;
+
+   if (unlikely(!p)) {
+   p = netdev_core_stats_alloc(dev);
+   if (!p)
+   return;
+   }
+   field = (__force unsigned long __percpu *)((__force 
void *)p + offset);

+   this_cpu_inc(*field);
+}
This wouldn't trace anything even the rx_dropped is in 
increasing. It

needs to add an extra operation, such as:

I honestly do not know what you are talking about.

Have you even tried to change your patch to use

field = (__force unsigned long __percpu *)((__force void *)p + 
offset);

this_cpu_inc(*field);
Yes, I tested this code. But the following couldn't show anything 
even

if the rx_dropped is increasing.

'sudo python3 /usr/share/bcc/tools/trace netdev_core_stats_inc'

Well, I am not sure about this, "bpftrace" worked for me.

Make sure your toolchain generates something that looks like what I 
got:


ef20 :
  ef20: f3 0f 1e fa  endbr64
  ef24: e8 00 00 00 00    call   ef29 


ef25: R_X86_64_PLT32 __fentry__-0x4
  ef29: 55    push   %rbp
  ef2a: 48 89 e5  mov    %rsp,%rbp
  ef2d: 53    push   %rbx
  ef2e: 89 f3    mov    %esi,%ebx
  ef30: 48 8b 87 f0 01 00 00 mov    0x1f0(%rdi),%rax
  ef37: 48 85 c0  test   %rax,%rax
  ef3a: 74 0b    je ef47 


  ef3c: 89 d9    mov    %ebx,%ecx
  ef3e: 65 48 ff 04 08    incq   %gs:(%rax,%rcx,1)
  ef43: 5b    pop    %rbx
  ef44: 5d    pop    %rbp
  ef45: c3    ret
  ef46: cc    int3
  ef47: e8 00 00 00 00    call   ef4c 


ef48: R_X86_64_PLT32 .text.unlikely.+0x13c
  ef4c: 48 85 c0  test   %rax,%rax
  ef4f: 75 eb    jne    ef3c 

  ef51: eb f0    jmp    ef43 

  ef53: 66 66 66 66 2e 0f 1f data16 data16 data16 cs nopw 
0x0(%rax,%rax,1)

  ef5a: 84 00 00 00 00 00


I'll share some I can see it.

1.

objdump -D vmlinux

81b2f170 :
81b2f170:    e8 8b ea 55 ff   callq 8108dc00
<__fentry__>
81b2f175:    55   push   %rbp
81b2f176:    48 89 e5 mov    %rsp,%rbp
81b2f179:    48 83 ec 08  sub    $0x8,%rsp
81b2f17d:    48 8b 87 e8 01 00 00 mov 0x1e8(%rdi),%rax
81b2f184:    48 85 c0 test   %rax,%rax
81b2f187:    74 0d    je 81b2f196

81b2f189:    89 f6    mov    %esi,%esi
81b2f18b:    65 48 ff 04 30   incq %gs:(%rax,%rsi,1)
81b2f190:    c9   leaveq
81b2f191:    e9 aa 31 6d 00   jmpq 82202340
<__x86_return_thunk>
81b2f196:    89 75 fc mov %esi,-0x4(%rbp)
81b2f199:    e8 82 ff ff ff   callq 81b2f120

81b2f19e:    8b 75 fc mov -0x4(%rbp),%esi
81b2f1a1:    48 85 c0 test   %rax,%rax
81b2f1a4:    75 e3    jne