Re: [PATCH net-next v7] net/ipv4: add tracepoint for icmp_send

2024-04-26 Thread Eric Dumazet
On Fri, Apr 26, 2024 at 10:47 AM  wrote:
>
> From: Peilin He 
>
> Introduce a tracepoint for icmp_send, which can help users to get more
> detail information conveniently when icmp abnormal events happen.
>
> 1. Giving an usecase example:
> =
> When an application experiences packet loss due to an unreachable UDP
> destination port, the kernel will send an exception message through the
> icmp_send function. By adding a trace point for icmp_send, developers or
> system administrators can obtain detailed information about the UDP
> packet loss, including the type, code, source address, destination address,
> source port, and destination port. This facilitates the trouble-shooting
> of UDP packet loss issues especially for those network-service
> applications.
>
> 2. Operation Instructions:
> ==
> Switch to the tracing directory.
> cd /sys/kernel/tracing
> Filter for destination port unreachable.
> echo "type==3 && code==3" > events/icmp/icmp_send/filter
> Enable trace event.
> echo 1 > events/icmp/icmp_send/enable
>
> 3. Result View:
> 
>  udp_client_erro-11370   [002] ...s.12   124.728002:
>  icmp_send: icmp_send: type=3, code=3.
>  From 127.0.0.1:41895 to 127.0.0.1: ulen=23
>  skbaddr=589b167a
>
> Change log:
> ===
> v6->v7:
> Some fixes according to
> https://lore.kernel.org/all/20240425081210.720a4...@kernel.org/
> 1. Fix patch format issues.
>
> v5->v6:
> Some fixes according to
> https://lore.kernel.org/all/20240413161319.ga853...@kernel.org/
> 1.Resubmit patches based on the latest net-next code.
>
> v4->v5:
> Some fixes according to
> https://lore.kernel.org/all/CAL+tcoDeXXh+zcRk4PHnUk8ELnx=CE2pc
> cqs7sfm0y9ak-e...@mail.gmail.com/
> 1.Adjust the position of trace_icmp_send() to before icmp_push_reply().
>
> v3->v4:
> Some fixes according to
> https://lore.kernel.org/all/CANn89i+EFEr7VHXNdOi59Ba_R1nFKSBJz
> BzkJFVgCTdXBx=y...@mail.gmail.com/
> 1.Add legality check for UDP header in SKB.
> 2.Target this patch for net-next.
>
> v2->v3:
> Some fixes according to
> https://lore.kernel.org/all/20240319102549.7f7f6...@gandalf.local.home/
> 1. Change the tracking directory to/sys/kernel/tracking.
> 2. Adjust the layout of the TP-STRUCT_entry parameter structure.
>
> v1->v2:
> Some fixes according to
> https://lore.kernel.org/all/CANn89iL-y9e_VFpdw=sZtRnKRu_tnUwqHu
> fqtjvjsv-nz1x...@mail.gmail.com/
> 1. adjust the trace_icmp_send() to more protocols than UDP.
> 2. move the calling of trace_icmp_send after sanity checks
> in __icmp_send().
>
> Signed-off-by: Peilin He 
> Signed-off-by: xu xin 
> Reviewed-by: Yunkai Zhang 
> Cc: Yang Yang 
> Cc: Liu Chun 
> Cc: Xuexin Jiang 
> ---
>  include/trace/events/icmp.h | 68 +
>  net/ipv4/icmp.c |  4 +++
>  2 files changed, 72 insertions(+)
>  create mode 100644 include/trace/events/icmp.h
>
> diff --git a/include/trace/events/icmp.h b/include/trace/events/icmp.h
> new file mode 100644
> index ..cff33aaee44e
> --- /dev/null
> +++ b/include/trace/events/icmp.h
> @@ -0,0 +1,68 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#undef TRACE_SYSTEM
> +#define TRACE_SYSTEM icmp
> +
> +#if !defined(_TRACE_ICMP_H) || defined(TRACE_HEADER_MULTI_READ)
> +#define _TRACE_ICMP_H
> +
> +#include 
> +#include 
> +
> +TRACE_EVENT(icmp_send,
> +
> +   TP_PROTO(const struct sk_buff *skb, int type, int code),
> +
> +   TP_ARGS(skb, type, code),
> +
> +   TP_STRUCT__entry(
> +   __field(const void *, skbaddr)
> +   __field(int, type)
> +   __field(int, code)
> +   __array(__u8, saddr, 4)
> +   __array(__u8, daddr, 4)
> +   __field(__u16, sport)
> +   __field(__u16, dport)
> +   __field(unsigned short, ulen)
> +   ),
> +
> +   TP_fast_assign(
> +   struct iphdr *iph = ip_hdr(skb);
> +   int proto_4 = iph->protocol;
> +   __be32 *p32;
> +
> +   __entry->skbaddr = skb;
> +   __entry->type = type;
> +   __entry->code = code;
> +
> +   struct udphdr *uh = udp_hdr(skb);

Please move this line up., perhaps after the "struct iphdr *iph =
ip_hdr(skb);" one

We group all variable definitions together, we do not mix code and variables.



> +
> +   if (proto_4 != IPPROTO_UDP || (u8 *)uh < skb->head ||
> +   (u8 *)uh + sizeof(struct udphdr)
> +   > skb_tail_pointer(skb)) {
> +   __entry->sport = 0;
> +   __entry->dport = 0;
> +   __entry->ulen = 0;
> +   } else {
> +   

Re: [PATCH net-next v9 0/7] Implement reset reason mechanism to detect

2024-04-26 Thread Eric Dumazet
On Thu, Apr 25, 2024 at 5:13 AM Jason Xing  wrote:
>
> From: Jason Xing 
>
> In production, there are so many cases about why the RST skb is sent but
> we don't have a very convenient/fast method to detect the exact underlying
> reasons.
>
> RST is implemented in two kinds: passive kind (like tcp_v4_send_reset())
> and active kind (like tcp_send_active_reset()). The former can be traced
> carefully 1) in TCP, with the help of drop reasons, which is based on
> Eric's idea[1], 2) in MPTCP, with the help of reset options defined in
> RFC 8684. The latter is relatively independent, which should be
> implemented on our own, such as active reset reasons which can not be
> replace by skb drop reason or something like this.
>
> In this series, I focus on the fundamental implement mostly about how
> the rstreason mechanism works and give the detailed passive part as an
> example, not including the active reset part. In future, we can go
> further and refine those NOT_SPECIFIED reasons.
>
> Here are some examples when tracing:
> -0   [002] ..s1.  1830.262425: tcp_send_reset: skbaddr=x
> skaddr=x src=x dest=x state=x reason=NOT_SPECIFIED
> -0   [002] ..s1.  1830.262425: tcp_send_reset: skbaddr=x
> skaddr=x src=x dest=x state=x reason=NO_SOCKET
>
> [1]
> Link: 
> https://lore.kernel.org/all/CANn89iJw8x-LqgsWOeJQQvgVg6DnL5aBRLi10QN2WBdr+X4k=w...@mail.gmail.com/

Reviewed-by: Eric Dumazet 

Thanks !



Re: [PATCH net-next v9 7/7] rstreason: make it work in trace world

2024-04-26 Thread Eric Dumazet
On Thu, Apr 25, 2024 at 5:14 AM Jason Xing  wrote:
>
> From: Jason Xing 
>
> At last, we should let it work by introducing this reset reason in
> trace world.
>
> One of the possible expected outputs is:
> ... tcp_send_reset: skbaddr=xxx skaddr=xxx src=xxx dest=xxx
> state=TCP_ESTABLISHED reason=NOT_SPECIFIED
>
> Signed-off-by: Jason Xing 
> Reviewed-by: Steven Rostedt (Google) 
> ---

Reviewed-by: Eric Dumazet 



Re: [PATCH net-next v9 6/7] mptcp: introducing a helper into active reset logic

2024-04-26 Thread Eric Dumazet
On Thu, Apr 25, 2024 at 5:14 AM Jason Xing  wrote:
>
> From: Jason Xing 
>
> Since we have mapped every mptcp reset reason definition in enum
> sk_rst_reason, introducing a new helper can cover some missing places
> where we have already set the subflow->reset_reason.
>
> Note: using SK_RST_REASON_NOT_SPECIFIED is the same as
> SK_RST_REASON_MPTCP_RST_EUNSPEC. They are both unknown. So we can convert
> it directly.
>
> Suggested-by: Paolo Abeni 
> Signed-off-by: Jason Xing 
> Reviewed-by: Matthieu Baerts (NGI0) 

Reviewed-by: Eric Dumazet 



Re: [PATCH net-next v9 5/7] mptcp: support rstreason for passive reset

2024-04-26 Thread Eric Dumazet
On Thu, Apr 25, 2024 at 5:14 AM Jason Xing  wrote:
>
> From: Jason Xing 
>
> It relies on what reset options in the skb are as rfc8684 says. Reusing
> this logic can save us much energy. This patch replaces most of the prior
> NOT_SPECIFIED reasons.
>
> Signed-off-by: Jason Xing 
> Reviewed-by: Matthieu Baerts (NGI0) 

Reviewed-by: Eric Dumazet 



Re: [PATCH net-next v9 3/7] rstreason: prepare for active reset

2024-04-26 Thread Eric Dumazet
On Thu, Apr 25, 2024 at 5:14 AM Jason Xing  wrote:
>
> From: Jason Xing 
>
> Like what we did to passive reset:
> only passing possible reset reason in each active reset path.
>
> No functional changes.
>
> Signed-off-by: Jason Xing 
> Acked-by: Matthieu Baerts (NGI0) 

Reviewed-by: Eric Dumazet 



Re: [PATCH net-next v9 2/7] rstreason: prepare for passive reset

2024-04-26 Thread Eric Dumazet
On Thu, Apr 25, 2024 at 5:14 AM Jason Xing  wrote:
>
> From: Jason Xing 
>
> Adjust the parameter and support passing reason of reset which
> is for now NOT_SPECIFIED. No functional changes.
>
> Signed-off-by: Jason Xing 
> Acked-by: Matthieu Baerts (NGI0) 

Reviewed-by: Eric Dumazet 



Re: [PATCH net-next v9 1/7] net: introduce rstreason to detect why the RST is sent

2024-04-26 Thread Eric Dumazet
On Thu, Apr 25, 2024 at 5:13 AM Jason Xing  wrote:
>
> From: Jason Xing 
>
> Add a new standalone file for the easy future extension to support
> both active reset and passive reset in the TCP/DCCP/MPTCP protocols.
>
> This patch only does the preparations for reset reason mechanism,
> nothing else changes.
>
> The reset reasons are divided into three parts:
> 1) reuse drop reasons for passive reset in TCP
> 2) our own independent reasons which aren't relying on other reasons at all
> 3) reuse MP_TCPRST option for MPTCP
>
> The benefits of a standalone reset reason are listed here:
> 1) it can cover more than one case, such as reset reasons in MPTCP,
> active reset reasons.
> 2) people can easily/fastly understand and maintain this mechanism.
> 3) we get unified format of output with prefix stripped.
> 4) more new reset reasons are on the way
> ...
>
> I will implement the basic codes of active/passive reset reason in
> those three protocols, which are not complete for this moment. For
> passive reset part in TCP, I only introduce the NO_SOCKET common case
> which could be set as an example.
>
> After this series applied, it will have the ability to open a new
> gate to let other people contribute more reasons into it :)
>
> Signed-off-by: Jason Xing 
> Acked-by: Matthieu Baerts (NGI0) 

Reviewed-by: Eric Dumazet 



Re: [PATCH net-next v6 0/7] Implement reset reason mechanism to detect

2024-04-19 Thread Eric Dumazet
On Fri, Apr 19, 2024 at 9:29 AM Jason Xing  wrote:
>
> On Fri, Apr 19, 2024 at 3:02 PM Eric Dumazet  wrote:
> >
> > On Fri, Apr 19, 2024 at 4:31 AM Jason Xing  
> > wrote:
> > >
> > > On Fri, Apr 19, 2024 at 7:26 AM Jason Xing  
> > > wrote:
> > > >
> > > > > When I said "If you feel the need to put them in a special group, this
> > > > > is fine by me.",
> > > > > this was really about partitioning the existing enum into groups, if
> > > > > you prefer having a group of 'RES reasons'
> > > >
> > > > Are you suggesting copying what we need from enum skb_drop_reason{} to
> > > > enum sk_rst_reason{}? Why not reusing them directly. I have no idea
> > > > what the side effect of cast conversion itself is?
> > >
> > > Sorry that I'm writing this email. I'm worried my statement is not
> > > that clear, so I write one simple snippet which can help me explain
> > > well :)
> > >
> > > Allow me give NO_SOCKET as an example:
> > > diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
> > > index e63a3bf99617..2c9f7364de45 100644
> > > --- a/net/ipv4/icmp.c
> > > +++ b/net/ipv4/icmp.c
> > > @@ -767,6 +767,7 @@ void __icmp_send(struct sk_buff *skb_in, int type,
> > > int code, __be32 info,
> > > if (!fl4.saddr)
> > > fl4.saddr = htonl(INADDR_DUMMY);
> > >
> > > +   trace_icmp_send(skb_in, type, code);
> > > icmp_push_reply(sk, _param, , , );
> > >  ende:
> > > ip_rt_put(rt);
> > > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> > > index 1e650ec71d2f..d5963831280f 100644
> > > --- a/net/ipv4/tcp_ipv4.c
> > > +++ b/net/ipv4/tcp_ipv4.c
> > > @@ -2160,6 +2160,7 @@ int tcp_v4_rcv(struct sk_buff *skb)
> > >  {
> > > struct net *net = dev_net(skb->dev);
> > > enum skb_drop_reason drop_reason;
> > > +   enum sk_rst_reason rst_reason;
> > > int sdif = inet_sdif(skb);
> > > int dif = inet_iif(skb);
> > > const struct iphdr *iph;
> > > @@ -2355,7 +2356,8 @@ int tcp_v4_rcv(struct sk_buff *skb)
> > >  bad_packet:
> > > __TCP_INC_STATS(net, TCP_MIB_INERRS);
> > > } else {
> > > -   tcp_v4_send_reset(NULL, skb);
> > > +   rst_reason = RST_REASON_NO_SOCKET;
> > > +   tcp_v4_send_reset(NULL, skb, rst_reason);
> > > }
> > >
> > >  discard_it:
> > >
> > > As you can see, we need to add a new 'rst_reason' variable which
> > > actually is the same as drop reason. They are the same except for the
> > > enum type... Such rst_reasons/drop_reasons are all over the place.
> > >
> > > Eric, if you have a strong preference, I can do it as you said.
> > >
> > > Well, how about explicitly casting them like this based on the current
> > > series. It looks better and clearer and more helpful to people who is
> > > reading codes to understand:
> > > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> > > index 461b4d2b7cfe..eb125163d819 100644
> > > --- a/net/ipv4/tcp_ipv4.c
> > > +++ b/net/ipv4/tcp_ipv4.c
> > > @@ -1936,7 +1936,7 @@ int tcp_v4_do_rcv(struct sock *sk, struct sk_buff 
> > > *skb)
> > > return 0;
> > >
> > >  reset:
> > > -   tcp_v4_send_reset(rsk, skb, (u32)reason);
> > > +   tcp_v4_send_reset(rsk, skb, (enum sk_rst_reason)reason);
> > >  discard:
> > > kfree_skb_reason(skb, reason);
> > > /* Be careful here. If this function gets more complicated and
> >
> > It makes no sense to declare an enum sk_rst_reason and then convert it
> > to drop_reason
> > or vice versa.
> >
> > Next thing you know, compiler guys will add a new -Woption that will
> > forbid such conversions.
> >
> > Please add to tcp_v4_send_reset() an skb_drop_reason, not a new enum.
>
> Ah... It looks like I didn't make myself clear again. Sorry...
> Actually I wrote this part many times. My conclusion is that It's not
> feasible to do this.
>
> REASONS:
> If we __only__ need to deal with this passive reset in TCP, it's fine.
> We pass a skb_drop_reason which should also be converted to u32 type
> in tcp_v4_send_reset() as you said, it can work. People who use the
> trace will see the reason with the 'SKB_DROP_REASON' prefix stripped.
&g

Re: [PATCH net-next v6 0/7] Implement reset reason mechanism to detect

2024-04-19 Thread Eric Dumazet
On Fri, Apr 19, 2024 at 4:31 AM Jason Xing  wrote:
>
> On Fri, Apr 19, 2024 at 7:26 AM Jason Xing  wrote:
> >
> > > When I said "If you feel the need to put them in a special group, this
> > > is fine by me.",
> > > this was really about partitioning the existing enum into groups, if
> > > you prefer having a group of 'RES reasons'
> >
> > Are you suggesting copying what we need from enum skb_drop_reason{} to
> > enum sk_rst_reason{}? Why not reusing them directly. I have no idea
> > what the side effect of cast conversion itself is?
>
> Sorry that I'm writing this email. I'm worried my statement is not
> that clear, so I write one simple snippet which can help me explain
> well :)
>
> Allow me give NO_SOCKET as an example:
> diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
> index e63a3bf99617..2c9f7364de45 100644
> --- a/net/ipv4/icmp.c
> +++ b/net/ipv4/icmp.c
> @@ -767,6 +767,7 @@ void __icmp_send(struct sk_buff *skb_in, int type,
> int code, __be32 info,
> if (!fl4.saddr)
> fl4.saddr = htonl(INADDR_DUMMY);
>
> +   trace_icmp_send(skb_in, type, code);
> icmp_push_reply(sk, _param, , , );
>  ende:
> ip_rt_put(rt);
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index 1e650ec71d2f..d5963831280f 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -2160,6 +2160,7 @@ int tcp_v4_rcv(struct sk_buff *skb)
>  {
> struct net *net = dev_net(skb->dev);
> enum skb_drop_reason drop_reason;
> +   enum sk_rst_reason rst_reason;
> int sdif = inet_sdif(skb);
> int dif = inet_iif(skb);
> const struct iphdr *iph;
> @@ -2355,7 +2356,8 @@ int tcp_v4_rcv(struct sk_buff *skb)
>  bad_packet:
> __TCP_INC_STATS(net, TCP_MIB_INERRS);
> } else {
> -   tcp_v4_send_reset(NULL, skb);
> +   rst_reason = RST_REASON_NO_SOCKET;
> +   tcp_v4_send_reset(NULL, skb, rst_reason);
> }
>
>  discard_it:
>
> As you can see, we need to add a new 'rst_reason' variable which
> actually is the same as drop reason. They are the same except for the
> enum type... Such rst_reasons/drop_reasons are all over the place.
>
> Eric, if you have a strong preference, I can do it as you said.
>
> Well, how about explicitly casting them like this based on the current
> series. It looks better and clearer and more helpful to people who is
> reading codes to understand:
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index 461b4d2b7cfe..eb125163d819 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -1936,7 +1936,7 @@ int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)
> return 0;
>
>  reset:
> -   tcp_v4_send_reset(rsk, skb, (u32)reason);
> +   tcp_v4_send_reset(rsk, skb, (enum sk_rst_reason)reason);
>  discard:
> kfree_skb_reason(skb, reason);
> /* Be careful here. If this function gets more complicated and

It makes no sense to declare an enum sk_rst_reason and then convert it
to drop_reason
or vice versa.

Next thing you know, compiler guys will add a new -Woption that will
forbid such conversions.

Please add to tcp_v4_send_reset() an skb_drop_reason, not a new enum.

skb_drop_reason are simply values that are later converted to strings...

So : Do not declare a new enum.



Re: [PATCH net-next v6 0/7] Implement reset reason mechanism to detect

2024-04-18 Thread Eric Dumazet
On Thu, Apr 18, 2024 at 6:24 PM Jason Xing  wrote:
>
> On Thu, Apr 18, 2024 at 11:46 PM Jakub Kicinski  wrote:
> >
> > On Thu, 18 Apr 2024 11:30:02 +0800 Jason Xing wrote:
> > > I'm not sure why the patch series has been changed to 'Changes
> > > Requested', until now I don't think I need to change something.
> > >
> > > Should I repost this series (keeping the v6 tag) and then wait for
> > > more comments?
> >
> > If Eric doesn't like it - it's not getting merged.
>
> I'm not a English native speaker. If I understand correctly, it seems
> that Eric doesn't object to the patch series. Here is the quotation
> [1]:
> "If you feel the need to put them in a special group, this is fine by me."
>
> This rst reason mechanism can cover all the possible reasons for both
> TCP and MPTCP. We don't need to reinvent some definitions of reset
> reasons which are totally the same as drop reasons. Also, we don't
> need to reinvent something to cover MPTCP. If people are willing to
> contribute more rst reasons, they can find a good place.
>
> Reset is one big complicated 'issue' in production. I spent a lot of
> time handling all kinds of reset reasons daily. I'm apparently not the
> only one. I just want to make admins' lives easier, including me. This
> special/separate reason group is important because we can extend it in
> the future, which will not get confused.
>
> I hope it can have a chance to get merged :) Thank you.
>
> [1]: 
> https://lore.kernel.org/all/cann89i+alo_agyc8dr8dkfyi+6wpzcgrogysvgr8frfrvaa...@mail.gmail.com/
>
> Thanks,
> Jason

My objection was these casts between enums. Especially if hiding with (u32)

I see no reason for adding these casts in TCP stack.

When I said "If you feel the need to put them in a special group, this
is fine by me.",
this was really about partitioning the existing enum into groups, if
you prefer having a group of 'RES reasons'



Re: [PATCH net-next v6 1/7] net: introduce rstreason to detect why the RST is sent

2024-04-17 Thread Eric Dumazet
On Wed, Apr 17, 2024 at 10:51 AM Jason Xing  wrote:
>
> From: Jason Xing 
>
> Add a new standalone file for the easy future extension to support
> both active reset and passive reset in the TCP/DCCP/MPTCP protocols.
>
> This patch only does the preparations for reset reason mechanism,
> nothing else changes.
>
> The reset reasons are divided into three parts:
> 1) reuse drop reasons for passive reset in TCP
> 2) reuse MP_TCPRST option for MPTCP
> 3) our own reasons
>
> I will implement the basic codes of active/passive reset reason in
> those three protocols, which is not complete for this moment. But
> it provides a new chance to let other people add more reasons into
> it:)
>
> Signed-off-by: Jason Xing 

My original suggestion was to use normal values in  'enum
skb_drop_reason', even if there was not necessarily a 'drop'
in the common sense.

https://lore.kernel.org/all/CANn89iJw8x-LqgsWOeJQQvgVg6DnL5aBRLi10QN2WBdr+X4k=w...@mail.gmail.com/

This would avoid these ugly casts later, even casting an enum to other
ones is not very logical.
Going through an u32 pivot is quite a hack.

If you feel the need to put them in a special group, this is fine by me.



Re: [PATCH net-next v3 2/3] trace: tcp: fully support trace_tcp_send_reset

2024-03-29 Thread Eric Dumazet
On Fri, Mar 29, 2024 at 11:23 AM Jason Xing  wrote:
>
> On Fri, Mar 29, 2024 at 5:07 PM Eric Dumazet  wrote:
> >
> > On Fri, Mar 29, 2024 at 4:43 AM Jason Xing  
> > wrote:
> > >
> > > From: Jason Xing 
> > >
> > > Prior to this patch, what we can see by enabling trace_tcp_send is
> > > only happening under two circumstances:
> > > 1) active rst mode
> > > 2) non-active rst mode and based on the full socket
> > >
> > > That means the inconsistency occurs if we use tcpdump and trace
> > > simultaneously to see how rst happens.
> > >
> > > It's necessary that we should take into other cases into considerations,
> > > say:
> > > 1) time-wait socket
> > > 2) no socket
> > > ...
> > >
> > > By parsing the incoming skb and reversing its 4-tuple can
> > > we know the exact 'flow' which might not exist.
> > >
> > > Samples after applied this patch:
> > > 1. tcp_send_reset: skbaddr=XXX skaddr=XXX src=ip:port dest=ip:port
> > > state=TCP_ESTABLISHED
> > > 2. tcp_send_reset: skbaddr=000...000 skaddr=XXX src=ip:port dest=ip:port
> > > state=UNKNOWN
> > > Note:
> > > 1) UNKNOWN means we cannot extract the right information from skb.
> > > 2) skbaddr/skaddr could be 0
> > >
> > > Signed-off-by: Jason Xing 
> > > ---
> > >  include/trace/events/tcp.h | 39 --
> > >  net/ipv4/tcp_ipv4.c|  4 ++--
> > >  net/ipv6/tcp_ipv6.c|  3 ++-
> > >  3 files changed, 41 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/include/trace/events/tcp.h b/include/trace/events/tcp.h
> > > index 194425f69642..289438c54227 100644
> > > --- a/include/trace/events/tcp.h
> > > +++ b/include/trace/events/tcp.h
> > > @@ -78,11 +78,46 @@ DEFINE_EVENT(tcp_event_sk_skb, tcp_retransmit_skb,
> > >   * skb of trace_tcp_send_reset is the skb that caused RST. In case of
> > >   * active reset, skb should be NULL
> > >   */
> > > -DEFINE_EVENT(tcp_event_sk_skb, tcp_send_reset,
> > > +TRACE_EVENT(tcp_send_reset,
> > >
> > > TP_PROTO(const struct sock *sk, const struct sk_buff *skb),
> > >
> > > -   TP_ARGS(sk, skb)
> > > +   TP_ARGS(sk, skb),
> > > +
> > > +   TP_STRUCT__entry(
> > > +   __field(const void *, skbaddr)
> > > +   __field(const void *, skaddr)
> > > +   __field(int, state)
> > > +   __array(__u8, saddr, sizeof(struct sockaddr_in6))
> > > +   __array(__u8, daddr, sizeof(struct sockaddr_in6))
> > > +   ),
> > > +
> > > +   TP_fast_assign(
> > > +   __entry->skbaddr = skb;
> > > +   __entry->skaddr = sk;
> > > +   /* Zero means unknown state. */
> > > +   __entry->state = sk ? sk->sk_state : 0;
> > > +
> > > +   memset(__entry->saddr, 0, sizeof(struct sockaddr_in6));
> > > +   memset(__entry->daddr, 0, sizeof(struct sockaddr_in6));
> > > +
> > > +   if (sk && sk_fullsock(sk)) {
> > > +   const struct inet_sock *inet = inet_sk(sk);
> > > +
> > > +   TP_STORE_ADDR_PORTS(__entry, inet, sk);
> > > +   } else {
> >
> > To be on the safe side, I would test if (skb) here.
> > We have one caller with skb == NULL, we might have more in the future.
>
> Thanks for the review.
>
> How about changing '} else {' to '} else if (skb) {', then if we go
> into this else-if branch, we will print nothing, right? I'll test it
> in this case.

Right, the fields are cleared before this else

+   memset(__entry->saddr, 0, sizeof(struct sockaddr_in6));
+   memset(__entry->daddr, 0, sizeof(struct sockaddr_in6));

>
> >
> > > +   /*
> > > +* We should reverse the 4-tuple of skb, so later
> > > +* it can print the right flow direction of rst.
> > > +*/
> > > +   TP_STORE_ADDR_PORTS_SKB(skb, entry->daddr, 
> > > entry->saddr);
> > > +   }
> > > +   ),
> > > +
> > > +   TP_printk("skbaddr=%p skaddr=%p src=%pISpc dest=%pISpc state=%s",
> > > + __entry->skbaddr, __entry->skaddr,
> > > + __entry->saddr, __entry->daddr,
> > > + __entry->state ? show_tcp_state_name(__entry->state) : 
> > > "UNKNOWN")
> > >  );
> > >
> > >  /*
> > > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> > > index a22ee5838751..d5c4a969c066 100644
> > > --- a/net/ipv4/tcp_ipv4.c
> > > +++ b/net/ipv4/tcp_ipv4.c
> > > @@ -868,10 +868,10 @@ static void tcp_v4_send_reset(const struct sock 
> > > *sk, struct sk_buff *skb)
> > >  */
> > > if (sk) {
> > > arg.bound_dev_if = sk->sk_bound_dev_if;
> > > -   if (sk_fullsock(sk))
> > > -   trace_tcp_send_reset(sk, skb);
> > > }
> >
> > Remove the { } ?
>
> Yes, I forgot to remove them.

No problem.



Re: [PATCH net-next v3 3/3] tcp: add location into reset trace process

2024-03-29 Thread Eric Dumazet
On Fri, Mar 29, 2024 at 4:43 AM Jason Xing  wrote:
>
> From: Jason Xing 
>
> In addition to knowing the 4-tuple of the flow which generates RST,
> the reason why it does so is very important because we have some
> cases where the RST should be sent and have no clue which one
> exactly.
>
> Adding location of reset process can help us more, like what
> trace_kfree_skb does.

Well, I would prefer a drop_reason here, even if there is no 'dropped' packet.

This would be more stable than something based on function names that
could be changed.

tracepoints do not have to get ugly, we can easily get stack traces if needed.

perf record -a -g  -e tcp:tcp_send_reset ...



Re: [PATCH net-next v3 2/3] trace: tcp: fully support trace_tcp_send_reset

2024-03-29 Thread Eric Dumazet
On Fri, Mar 29, 2024 at 4:43 AM Jason Xing  wrote:
>
> From: Jason Xing 
>
> Prior to this patch, what we can see by enabling trace_tcp_send is
> only happening under two circumstances:
> 1) active rst mode
> 2) non-active rst mode and based on the full socket
>
> That means the inconsistency occurs if we use tcpdump and trace
> simultaneously to see how rst happens.
>
> It's necessary that we should take into other cases into considerations,
> say:
> 1) time-wait socket
> 2) no socket
> ...
>
> By parsing the incoming skb and reversing its 4-tuple can
> we know the exact 'flow' which might not exist.
>
> Samples after applied this patch:
> 1. tcp_send_reset: skbaddr=XXX skaddr=XXX src=ip:port dest=ip:port
> state=TCP_ESTABLISHED
> 2. tcp_send_reset: skbaddr=000...000 skaddr=XXX src=ip:port dest=ip:port
> state=UNKNOWN
> Note:
> 1) UNKNOWN means we cannot extract the right information from skb.
> 2) skbaddr/skaddr could be 0
>
> Signed-off-by: Jason Xing 
> ---
>  include/trace/events/tcp.h | 39 --
>  net/ipv4/tcp_ipv4.c|  4 ++--
>  net/ipv6/tcp_ipv6.c|  3 ++-
>  3 files changed, 41 insertions(+), 5 deletions(-)
>
> diff --git a/include/trace/events/tcp.h b/include/trace/events/tcp.h
> index 194425f69642..289438c54227 100644
> --- a/include/trace/events/tcp.h
> +++ b/include/trace/events/tcp.h
> @@ -78,11 +78,46 @@ DEFINE_EVENT(tcp_event_sk_skb, tcp_retransmit_skb,
>   * skb of trace_tcp_send_reset is the skb that caused RST. In case of
>   * active reset, skb should be NULL
>   */
> -DEFINE_EVENT(tcp_event_sk_skb, tcp_send_reset,
> +TRACE_EVENT(tcp_send_reset,
>
> TP_PROTO(const struct sock *sk, const struct sk_buff *skb),
>
> -   TP_ARGS(sk, skb)
> +   TP_ARGS(sk, skb),
> +
> +   TP_STRUCT__entry(
> +   __field(const void *, skbaddr)
> +   __field(const void *, skaddr)
> +   __field(int, state)
> +   __array(__u8, saddr, sizeof(struct sockaddr_in6))
> +   __array(__u8, daddr, sizeof(struct sockaddr_in6))
> +   ),
> +
> +   TP_fast_assign(
> +   __entry->skbaddr = skb;
> +   __entry->skaddr = sk;
> +   /* Zero means unknown state. */
> +   __entry->state = sk ? sk->sk_state : 0;
> +
> +   memset(__entry->saddr, 0, sizeof(struct sockaddr_in6));
> +   memset(__entry->daddr, 0, sizeof(struct sockaddr_in6));
> +
> +   if (sk && sk_fullsock(sk)) {
> +   const struct inet_sock *inet = inet_sk(sk);
> +
> +   TP_STORE_ADDR_PORTS(__entry, inet, sk);
> +   } else {

To be on the safe side, I would test if (skb) here.
We have one caller with skb == NULL, we might have more in the future.

> +   /*
> +* We should reverse the 4-tuple of skb, so later
> +* it can print the right flow direction of rst.
> +*/
> +   TP_STORE_ADDR_PORTS_SKB(skb, entry->daddr, 
> entry->saddr);
> +   }
> +   ),
> +
> +   TP_printk("skbaddr=%p skaddr=%p src=%pISpc dest=%pISpc state=%s",
> + __entry->skbaddr, __entry->skaddr,
> + __entry->saddr, __entry->daddr,
> + __entry->state ? show_tcp_state_name(__entry->state) : 
> "UNKNOWN")
>  );
>
>  /*
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index a22ee5838751..d5c4a969c066 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -868,10 +868,10 @@ static void tcp_v4_send_reset(const struct sock *sk, 
> struct sk_buff *skb)
>  */
> if (sk) {
> arg.bound_dev_if = sk->sk_bound_dev_if;
> -   if (sk_fullsock(sk))
> -   trace_tcp_send_reset(sk, skb);
> }

Remove the { } ?


>
> +   trace_tcp_send_reset(sk, skb);
> +
> BUILD_BUG_ON(offsetof(struct sock, sk_bound_dev_if) !=
>  offsetof(struct inet_timewait_sock, tw_bound_dev_if));
>
> diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
> index 3f4cba49e9ee..8e9c59b6c00c 100644
> --- a/net/ipv6/tcp_ipv6.c
> +++ b/net/ipv6/tcp_ipv6.c
> @@ -1113,7 +1113,6 @@ static void tcp_v6_send_reset(const struct sock *sk, 
> struct sk_buff *skb)
> if (sk) {
> oif = sk->sk_bound_dev_if;
> if (sk_fullsock(sk)) {
> -   trace_tcp_send_reset(sk, skb);
> if (inet6_test_bit(REPFLOW, sk))
> label = ip6_flowlabel(ipv6h);
> priority = READ_ONCE(sk->sk_priority);
> @@ -1129,6 +1128,8 @@ static void tcp_v6_send_reset(const struct sock *sk, 
> struct sk_buff *skb)
> label = ip6_flowlabel(ipv6h);
> }
>
> +   trace_tcp_send_reset(sk, skb);
> +
> tcp_v6_send_response(sk, skb, seq, ack_seq, 0, 0, 0, oif, 1,
>

Re: [PATCH net-next v3 1/3] trace: adjust TP_STORE_ADDR_PORTS_SKB() parameters

2024-03-29 Thread Eric Dumazet
On Fri, Mar 29, 2024 at 4:43 AM Jason Xing  wrote:
>
> From: Jason Xing 
>
> Introducing entry_saddr and entry_daddr parameters in this macro
> for later use can help us record the reverse 4-tuple by analyzing
> the 4-tuple of the incoming skb when receiving.
>
> Signed-off-by: Jason Xing 

Reviewed-by: Eric Dumazet 



Re: [PATCH net-next 0/3] trace: use TP_STORE_ADDRS macro

2024-03-26 Thread Eric Dumazet
On Tue, Mar 26, 2024 at 11:44 AM Jason Xing  wrote:

> Well, it's a pity that it seems that we are about to abandon this
> method but it's not that friendly to the users who are unable to
> deploy BPF...

It is a pity these tracepoint patches are consuming a lot of reviewer
time, just because
some people 'can not deploy BPF'

Well, I came up with more ideas about how to improve the
> trace function in recent days. The motivation of doing this is that I
> encountered some issues which could be traced/diagnosed by using trace
> effortlessly without writing some bpftrace codes again and again. The
> status of trace seems not active but many people are still using it, I
> believe.

'Writing bpftrace codes again and again' is not a good reason to add
maintenance costs
to linux networking stack.



Re: [PATCH v3 resend] net/ipv4: add tracepoint for icmp_send

2024-03-21 Thread Eric Dumazet
On Thu, Mar 21, 2024 at 4:09 AM  wrote:
>
> From: he peilin 
>
> Introduce a tracepoint for icmp_send, which can help users to get more
> detail information conveniently when icmp abnormal events happen.
>
> 1. Giving an usecase example:
> =
> When an application experiences packet loss due to an unreachable UDP
> destination port, the kernel will send an exception message through the
> icmp_send function. By adding a trace point for icmp_send, developers or
> system administrators can obtain detailed information about the UDP
> packet loss, including the type, code, source address, destination address,
> source port, and destination port. This facilitates the trouble-shooting
> of UDP packet loss issues especially for those network-service
> applications.
>
> 2. Operation Instructions:
> ==
> Switch to the tracing directory.
> cd /sys/kernel/tracing
> Filter for destination port unreachable.
> echo "type==3 && code==3" > events/icmp/icmp_send/filter
> Enable trace event.
> echo 1 > events/icmp/icmp_send/enable
>
> 3. Result View:
> 
>  udp_client_erro-11370   [002] ...s.12   124.728002:
>  icmp_send: icmp_send: type=3, code=3.
>  From 127.0.0.1:41895 to 127.0.0.1: ulen=23
>  skbaddr=589b167a
>
> Changelog
> -
> v2->v3:
> Some fixes according to
> https://lore.kernel.org/all/20240319102549.7f7f6...@gandalf.local.home/
> 1. Change the tracking directory to/sys/kernel/tracking.
> 2. Adjust the layout of the TP-STRUCT_entry parameter structure.
>
> v1->v2:
> Some fixes according to
> https://lore.kernel.org/all/CANn89iL-y9e_VFpdw=sztrnkru_tnuwqhufqtjvjsv-nz1x...@mail.gmail.com/
> 1. adjust the trace_icmp_send() to more protocols than UDP.
> 2. move the calling of trace_icmp_send after sanity checks
> in __icmp_send().
>
> Signed-off-by: Peilin He
> Reviewed-by: xu xin 
> Reviewed-by: Yunkai Zhang 
> Cc: Yang Yang 
> Cc: Liu Chun 
> Cc: Xuexin Jiang 
> ---
>  include/trace/events/icmp.h | 64 +
>  net/ipv4/icmp.c |  4 +++
>  2 files changed, 68 insertions(+)
>  create mode 100644 include/trace/events/icmp.h
>
> diff --git a/include/trace/events/icmp.h b/include/trace/events/icmp.h
> new file mode 100644
> index ..2098d4b1b12e
> --- /dev/null
> +++ b/include/trace/events/icmp.h
> @@ -0,0 +1,64 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#undef TRACE_SYSTEM
> +#define TRACE_SYSTEM icmp
> +
> +#if !defined(_TRACE_ICMP_H) || defined(TRACE_HEADER_MULTI_READ)
> +#define _TRACE_ICMP_H
> +
> +#include 
> +#include 
> +
> +TRACE_EVENT(icmp_send,
> +
> +   TP_PROTO(const struct sk_buff *skb, int type, int code),
> +
> +   TP_ARGS(skb, type, code),
> +
> +   TP_STRUCT__entry(
> +   __field(const void *, skbaddr)
> +   __field(int, type)
> +   __field(int, code)
> +   __array(__u8, saddr, 4)
> +   __array(__u8, daddr, 4)
> +   __field(__u16, sport)
> +   __field(__u16, dport)
> +   __field(unsigned short, ulen)
> +   ),
> +
> +   TP_fast_assign(
> +   struct iphdr *iph = ip_hdr(skb);
> +   int proto_4 = iph->protocol;
> +   __be32 *p32;
> +
> +   __entry->skbaddr = skb;
> +   __entry->type = type;
> +   __entry->code = code;
> +
> +   if (proto_4 == IPPROTO_UDP) {
> +   struct udphdr *uh = udp_hdr(skb);
> +   __entry->sport = ntohs(uh->source);
> +   __entry->dport = ntohs(uh->dest);
> +   __entry->ulen = ntohs(uh->len);

This is completely bogus.

Adding tracepoints is ok if there are no side effects like bugs :/

At this point there is no guarantee the UDP header is complete/present
in skb->head

Look at the existing checks between lines 619 and 623

Then audit all icmp_send() callers, and ask yourself if UDP packets
can not be malicious (like with a truncated UDP header)



Re: [PATCH net-next v2 2/2] tcp: add tracing of skbaddr in tcp_event_skb class

2024-03-07 Thread Eric Dumazet
On Mon, Mar 4, 2024 at 10:29 AM Jason Xing  wrote:
>
> From: Jason Xing 
>
> Use the existing parameter and print the address of skbaddr
> as other trace functions do.
>
> Signed-off-by: Jason Xing 

Reviewed-by: Eric Dumazet 



Re: [PATCH net-next v2 1/2] tcp: add tracing of skb/skaddr in tcp_event_sk_skb class

2024-03-07 Thread Eric Dumazet
On Mon, Mar 4, 2024 at 10:29 AM Jason Xing  wrote:
>
> From: Jason Xing 
>
> Printing the addresses can help us identify the exact skb/sk
> for those system in which it's not that easy to run BPF program.
> As we can see, it already fetches those, then use it directly
> and it will print like below:
>
> ...tcp_retransmit_skb: skbaddr=XXX skaddr=XXX family=AF_INET...
>
> Signed-off-by: Jason Xing 

Reviewed-by: Eric Dumazet 



Re: [PATCH net-next 1/2] tcp: add tracing of skb/skaddr in tcp_event_sk_skb class

2024-03-04 Thread Eric Dumazet
On Thu, Feb 29, 2024 at 6:10 PM Jason Xing  wrote:
>
> From: Jason Xing 
>
> Prio to this patch, the trace function doesn't print addresses
> which might be forgotten. As we can see, it already fetches
> those, use it directly and it will print like below:
>
> ...tcp_retransmit_skb: skbaddr=XXX skaddr=XXX family=AF_INET...

It was not forgotten, but a recommendation from Alexei

https://yhbt.net/lore/all/20171010173821.6djxyvrggvaiv...@ast-mbp.dhcp.thefacebook.com/



Re: [PATCH net-next 0/2] add two missing addresses when using trace

2024-03-04 Thread Eric Dumazet
On Mon, Mar 4, 2024 at 8:01 AM Jason Xing  wrote:
>
> On Fri, Mar 1, 2024 at 1:10 AM Jason Xing  wrote:
> >
> > From: Jason Xing 
> >
> > When I reviewed other people's patch [1], I noticed that similar thing
> > also happens in tcp_event_skb class and tcp_event_sk_skb class. They
> > don't print those two addrs of skb/sk which already exist.
> >
> > They are probably forgotten by the original authors, so this time I
> > finish the work. Also, adding more trace about the socket/skb addr can
> > help us sometime even though the chance is minor.
> >
> > I don't consider it is a bug, thus I chose to target net-next tree.
>
> Gentle ping...No rush. Just in case this simple patchset was lost for
> some reason.

This was a conscious choice I think.

https://yhbt.net/lore/all/20171010173821.6djxyvrggvaiv...@ast-mbp.dhcp.thefacebook.com/



Re: [PATCH net-next v2] tcp: Add skb addr and sock addr to arguments of tracepoint tcp_probe.

2024-03-04 Thread Eric Dumazet
On Mon, Mar 4, 2024 at 4:46 AM fuyuanli  wrote:
>
> It is useful to expose skb addr and sock addr to user in tracepoint
> tcp_probe, so that we can get more information while monitoring
> receiving of tcp data, by ebpf or other ways.
>
> For example, we need to identify a packet by seq and end_seq when
> calculate transmit latency between lay 2 and lay 4 by ebpf, but which is

Please use "layer 2 and layer 4".

> not available in tcp_probe, so we can only use kprobe hooking
> tcp_rcv_esatblised to get them. But we can use tcp_probe directly if skb
> addr and sock addr are available, which is more efficient.

Okay, but please fix the typo. Correct function name is tcp_rcv_established

>
> Signed-off-by: fuyuanli 
> Link: 
> https://lore.kernel.org/netdev/20240229052813.GA23899@didi-ThinkCentre-M920t-N000/
>



Re: [PATCH] net/ipv4: add tracepoint for icmp_send

2024-02-26 Thread Eric Dumazet
On Tue, Feb 27, 2024 at 3:50 AM  wrote:
>
> From: xu xin 
>
> Introduce a tracepoint for icmp_send, which can help users to get more
> detail information conveniently when icmp abnormal events happen.
>
> 1. Giving an usecase example:
> =
> When an application experiences packet loss due to an unreachable UDP
> destination port, the kernel will send an exception message through the
> icmp_send function. By adding a trace point for icmp_send, developers or
> system administrators can obtain the detailed information easily about the
> UDP packet loss, including the type, code, source address, destination
> address, source port, and destination port. This facilitates the
> trouble-shooting of packet loss issues especially for those complicated
> network-service applications.
>
> 2. Operation Instructions:
> ==
> Switch to the tracing directory.
> cd /sys/kernel/debug/tracing
> Filter for destination port unreachable.
> echo "type==3 && code==3" > events/icmp/icmp_send/filter
> Enable trace event.
> echo 1 > events/icmp/icmp_send/enable
>
> 3. Result View:
> 
>  udp_client_erro-11370   [002] ...s.12   124.728002:  icmp_send:
> icmp_send: type=3, code=3.From 127.0.0.1:41895 to 127.0.0.1: ulen=23
> skbaddr=589b167a
>
> Signed-off-by: He Peilin 
> Reviewed-by: xu xin 
> Reviewed-by: Yunkai Zhang 
> Cc: Yang Yang 
> Cc: Liu Chun 
> Cc: Xuexin Jiang 
> ---
>  include/trace/events/icmp.h | 57 
> +
>  net/ipv4/icmp.c |  4 
>  2 files changed, 61 insertions(+)
>  create mode 100644 include/trace/events/icmp.h
>
> diff --git a/include/trace/events/icmp.h b/include/trace/events/icmp.h
> new file mode 100644
> index ..3d9af5769bc3
> --- /dev/null
> +++ b/include/trace/events/icmp.h
> @@ -0,0 +1,57 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#undef TRACE_SYSTEM
> +#define TRACE_SYSTEM icmp
> +
> +#if !defined(_TRACE_ICMP_H) || defined(TRACE_HEADER_MULTI_READ)
> +#define _TRACE_ICMP_H
> +
> +#include 
> +#include 
> +
> +TRACE_EVENT(icmp_send,
> +
> +   TP_PROTO(const struct sk_buff *skb, int type, int code),
> +
> +   TP_ARGS(skb, type, code),
> +
> +   TP_STRUCT__entry(
> +   __field(__u16, sport)
> +   __field(__u16, dport)
> +   __field(unsigned short, ulen)
> +   __field(const void *, skbaddr)
> +   __field(int, type)
> +   __field(int, code)
> +   __array(__u8, saddr, 4)
> +   __array(__u8, daddr, 4)
> +   ),
> +
> +   TP_fast_assign(
> +   // Get UDP header
> +   struct udphdr *uh = udp_hdr(skb);
> +   struct iphdr *iph = ip_hdr(skb);
> +   __be32 *p32;
> +
> +   __entry->sport = ntohs(uh->source);
> +   __entry->dport = ntohs(uh->dest);
> +   __entry->ulen = ntohs(uh->len);
> +   __entry->skbaddr = skb;
> +   __entry->type = type;
> +   __entry->code = code;
> +
> +   p32 = (__be32 *) __entry->saddr;
> +   *p32 = iph->saddr;
> +
> +   p32 = (__be32 *) __entry->daddr;
> +   *p32 =  iph->daddr;
> +   ),
> +

FYI, ICMP can be generated for many other protocols than UDP.

> +   TP_printk("icmp_send: type=%d, code=%d. From %pI4:%u to 
> %pI4:%u ulen=%d skbaddr=%p",
> +   __entry->type, __entry->code,
> +   __entry->saddr, __entry->sport, __entry->daddr,
> +   __entry->dport, __entry->ulen, __entry->skbaddr)
> +);
> +
> +#endif /* _TRACE_ICMP_H */
> +
> +/* This part must be outside protection */
> +#include 
> diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
> index e63a3bf99617..437bdb7e2650 100644
> --- a/net/ipv4/icmp.c
> +++ b/net/ipv4/icmp.c
> @@ -92,6 +92,8 @@
>  #include 
>  #include 
>  #include 
> +#define CREATE_TRACE_POINTS
> +#include 
>
>  /*
>   * Build xmit assembly blocks
> @@ -599,6 +601,8 @@ void __icmp_send(struct sk_buff *skb_in, int type, int 
> code, __be32 info,
> struct net *net;
> struct sock *sk;
>
> +   trace_icmp_send(skb_in, type, code);

I think you missed many sanity checks between lines 622 and 676

Honestly, a kprobe BPF based solution would be less risky, and less
maintenance for us.



Re: [PATCH net-next 1/2] net/vsockmon: Leverage core stats allocator

2024-02-23 Thread Eric Dumazet
On Fri, Feb 23, 2024 at 12:58 PM Breno Leitao  wrote:
>
> With commit 34d21de99cea9 ("net: Move {l,t,d}stats allocation to core and
> convert veth & vrf"), stats allocation could be done on net core
> instead of this driver.
>
> With this new approach, the driver doesn't have to bother with error
> handling (allocation failure checking, making sure free happens in the
> right spot, etc). This is core responsibility now.
>
> Remove the allocation in the vsockmon driver and leverage the network
> core allocation instead.
>
> Signed-off-by: Breno Leitao 

Reviewed-by: Eric Dumazet 



Re: [PATCH net-next 2/2] net/vsockmon: Do not set zeroed statistics

2024-02-23 Thread Eric Dumazet
On Fri, Feb 23, 2024 at 12:58 PM Breno Leitao  wrote:
>
> Do not set rtnl_link_stats64 fields to zero, since they are zeroed
> before ops->ndo_get_stats64 is called in core dev_get_stats() function.
>
> Signed-off-by: Breno Leitao 

Reviewed-by: Eric Dumazet 



Re: [PATCH net-next v3] net: dqs: add NIC stall detector based on BQL

2024-02-14 Thread Eric Dumazet
On Wed, Feb 14, 2024 at 5:49 PM Breno Leitao  wrote:
>
> On Wed, Feb 14, 2024 at 04:41:36PM +0100, Eric Dumazet wrote:
> > On Wed, Feb 14, 2024 at 3:45 PM Breno Leitao  wrote:
> > >
> > > On Tue, Feb 13, 2024 at 10:04:57AM -0800, Jakub Kicinski wrote:
> > > > On Tue, 13 Feb 2024 14:57:49 +0100 Eric Dumazet wrote:
> > > > > Please note that adding other sysfs entries is expensive for workloads
> > > > > creating/deleting netdev and netns often.
> > > > >
> > > > > I _think_ we should find a way for not creating
> > > > > /sys/class/net//queues/tx-{Q}/byte_queue_limits  directory
> > > > > and files
> > > > > for non BQL enabled devices (like loopback !)
> > > >
> > > > We should try, see if anyone screams. We could use IFF_NO_QUEUE, and
> > > > NETIF_F_LLTX as a proxy for "device doesn't have a real queue so BQL
> > > > would be pointless"? Obviously better to annotate the drivers which
> > > > do have BQL support, but there's >50 of them on a quick count..
> > >
> > > Let me make sure I understand the suggestion above. We want to disable
> > > BQL completely for devices that has dev->features & NETIF_F_LLTX or
> > > dev->priv_flags & IFF_NO_QUEUE, right?
> > >
> > > Maybe we can add a ->enabled field in struct dql, and set it according
> > > to the features above. Then we can created the sysfs and process the dql
> > > operations based on that field. This should avoid some unnecessary calls
> > > also, if we are not display sysfs.
> > >
> > > Here is a very simple PoC to represent what I had in mind. Am I in the
> > > right direction?
> >
> > No, this was really about sysfs entries (aka dql_group)
> >
> > Partial patch would be:
>
> That is simpler than what I imagined. Thanks!
>

>
> for netdev_uses_bql(), would it be similar to what I proposed in the
> previous message? Let me copy it here.
>
> static bool netdev_uses_bql(struct net_device *dev)
> {
>if (dev->features & NETIF_F_LLTX ||
>dev->priv_flags & IFF_NO_QUEUE)
>return false;
>
>return true;
> }

I think this should be fine, yes.



Re: [PATCH net-next v3] net: dqs: add NIC stall detector based on BQL

2024-02-14 Thread Eric Dumazet
On Wed, Feb 14, 2024 at 3:45 PM Breno Leitao  wrote:
>
> On Tue, Feb 13, 2024 at 10:04:57AM -0800, Jakub Kicinski wrote:
> > On Tue, 13 Feb 2024 14:57:49 +0100 Eric Dumazet wrote:
> > > Please note that adding other sysfs entries is expensive for workloads
> > > creating/deleting netdev and netns often.
> > >
> > > I _think_ we should find a way for not creating
> > > /sys/class/net//queues/tx-{Q}/byte_queue_limits  directory
> > > and files
> > > for non BQL enabled devices (like loopback !)
> >
> > We should try, see if anyone screams. We could use IFF_NO_QUEUE, and
> > NETIF_F_LLTX as a proxy for "device doesn't have a real queue so BQL
> > would be pointless"? Obviously better to annotate the drivers which
> > do have BQL support, but there's >50 of them on a quick count..
>
> Let me make sure I understand the suggestion above. We want to disable
> BQL completely for devices that has dev->features & NETIF_F_LLTX or
> dev->priv_flags & IFF_NO_QUEUE, right?
>
> Maybe we can add a ->enabled field in struct dql, and set it according
> to the features above. Then we can created the sysfs and process the dql
> operations based on that field. This should avoid some unnecessary calls
> also, if we are not display sysfs.
>
> Here is a very simple PoC to represent what I had in mind. Am I in the
> right direction?

No, this was really about sysfs entries (aka dql_group)

Partial patch would be:

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 
a09d507c5b03d24a829bf7af0b7cf1e6a0bdb65a..094e3b2d78cca40d810b2fa3bd4393d22b30e6ad
100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1709,9 +1709,11 @@ static int netdev_queue_add_kobject(struct
net_device *dev, int index)
goto err;

 #ifdef CONFIG_BQL
-   error = sysfs_create_group(kobj, _group);
-   if (error)
-   goto err;
+   if (netdev_uses_bql(dev)) {
+   error = sysfs_create_group(kobj, _group);
+   if (error)
+   goto err;
+   }
 #endif

kobject_uevent(kobj, KOBJ_ADD);
@@ -1734,7 +1736,8 @@ static int tx_queue_change_owner(struct
net_device *ndev, int index,
return error;

 #ifdef CONFIG_BQL
-   error = sysfs_group_change_owner(kobj, _group, kuid, kgid);
+   if (netdev_uses_bql(ndev))
+   error = sysfs_group_change_owner(kobj, _group, kuid, kgid);
 #endif
return error;
 }
@@ -1768,7 +1771,8 @@ netdev_queue_update_kobjects(struct net_device
*dev, int old_num, int new_num)
if (!refcount_read(_net(dev)->ns.count))
queue->kobj.uevent_suppress = 1;
 #ifdef CONFIG_BQL
-   sysfs_remove_group(>kobj, _group);
+   if (netdev_uses_bql(dev))
+   sysfs_remove_group(>kobj, _group);
 #endif
kobject_put(>kobj);
}



Re: [PATCH net-next v3] net: dqs: add NIC stall detector based on BQL

2024-02-13 Thread Eric Dumazet
On Fri, Feb 2, 2024 at 5:55 PM Breno Leitao  wrote:
>
> From: Jakub Kicinski 
>
> softnet_data->time_squeeze is sometimes used as a proxy for
> host overload or indication of scheduling problems. In practice
> this statistic is very noisy and has hard to grasp units -
> e.g. is 10 squeezes a second to be expected, or high?
>
> Delaying network (NAPI) processing leads to drops on NIC queues
> but also RTT bloat, impacting pacing and CA decisions.
> Stalls are a little hard to detect on the Rx side, because
> there may simply have not been any packets received in given
> period of time. Packet timestamps help a little bit, but
> again we don't know if packets are stale because we're
> not keeping up or because someone (*cough* cgroups)
> disabled IRQs for a long time.

Please note that adding other sysfs entries is expensive for workloads
creating/deleting netdev and netns often.

I _think_ we should find a way for not creating
/sys/class/net//queues/tx-{Q}/byte_queue_limits  directory
and files
for non BQL enabled devices (like loopback !)



Re: [PATCH net-next] tcp: add tracepoints for data send/recv/acked

2023-12-05 Thread Eric Dumazet
On Tue, Dec 5, 2023 at 3:11 AM Xuan Zhuo  wrote:
>
> On Mon, 4 Dec 2023 13:28:21 +0100, Eric Dumazet  wrote:
> > On Mon, Dec 4, 2023 at 12:43 PM Philo Lu  wrote:
> > >
> > > Add 3 tracepoints, namely tcp_data_send/tcp_data_recv/tcp_data_acked,
> > > which will be called every time a tcp data packet is sent, received, and
> > > acked.
> > > tcp_data_send: called after a data packet is sent.
> > > tcp_data_recv: called after a data packet is receviced.
> > > tcp_data_acked: called after a valid ack packet is processed (some sent
> > > data are ackknowledged).
> > >
> > > We use these callbacks for fine-grained tcp monitoring, which collects
> > > and analyses every tcp request/response event information. The whole
> > > system has been described in SIGMOD'18 (see
> > > https://dl.acm.org/doi/pdf/10.1145/3183713.3190659 for details). To
> > > achieve this with bpf, we require hooks for data events that call bpf
> > > prog (1) when any data packet is sent/received/acked, and (2) after
> > > critical tcp state variables have been updated (e.g., snd_una, snd_nxt,
> > > rcv_nxt). However, existing bpf hooks cannot meet our requirements.
> > > Besides, these tracepoints help to debug tcp when data send/recv/acked.
> >
> > This I do not understand.
> >
> > >
> > > Though kretprobe/fexit can also be used to collect these information,
> > > they will not work if the kernel functions get inlined. Considering the
> > > stability, we prefer tracepoint as the solution.
> >
> > I dunno, this seems quite weak to me. I see many patches coming to add
> > tracing in the stack, but no patches fixing any issues.
>
>
> We have implemented a mechanism to split the request and response from the TCP
> connection using these "hookers", which can handle various protocols such as
> HTTP, HTTPS, Redis, and MySQL. This mechanism allows us to record important
> information about each request and response, including the amount of data
> uploaded, the time taken by the server to handle the request, and the time 
> taken
> for the client to receive the response. This mechanism has been running
> internally for many years and has proven to be very useful.
>
> One of the main benefits of this mechanism is that it helps in locating the
> source of any issues or problems that may arise. For example, if there is a
> problem with the network, the application, or the machine, we can use this
> mechanism to identify and isolate the issue.
>
> TCP has long been a challenge when it comes to tracking the transmission of 
> data
> on the network. The application can only confirm that it has sent a certain
> amount of data to the kernel, but it has limited visibility into whether the
> client has actually received this data. Our mechanism addresses this issue by
> providing insights into the amount of data received by the client and the time
> it was received. Furthermore, we can also detect any packet loss or delays
> caused by the server.
>
> https://help-static-aliyun-doc.aliyuncs.com/assets/img/zh-CN/7912288961/9732df025beny.svg
>
> So, we do not want to add some tracepoint to do some unknow debug.
> We have a clear goal. debugging is just an incidental capability.
>

We have powerful mechanisms in the stack already that ordinary (no
privilege requested) applications can readily use.

We have been using them for a while.

If existing mechanisms are missing something you need, please expand them.

For reference, start looking at tcp_get_timestamping_opt_stats() history.

Sender side can for instance get precise timestamps.

Combinations of these timestamps reveal different parts of the overall
network latency,

T0: sendmsg() enters TCP
T1: first byte enters qdisc
T2: first byte sent to the NIC
T3: first byte ACKed in TCP
T4: last byte sent to the NIC
T5: last byte ACKed
T1 - T0: how long the first byte was blocked in the TCP layer ("Head
of Line Blocking" latency).
T2 - T1: how long the first byte was blocked in the Linux traffic
shaping layer (known as QDisc).
T3 - T2: the network ‘distance’ (propagation delay + current queuing
delay along the network path and at the receiver).
T5 - T2: how fast the sent chunk was delivered.
Message Size / (T5 - T0): goodput (from application’s perspective)



Re: [PATCH net-next] tcp: add tracepoints for data send/recv/acked

2023-12-04 Thread Eric Dumazet
On Mon, Dec 4, 2023 at 12:43 PM Philo Lu  wrote:
>
> Add 3 tracepoints, namely tcp_data_send/tcp_data_recv/tcp_data_acked,
> which will be called every time a tcp data packet is sent, received, and
> acked.
> tcp_data_send: called after a data packet is sent.
> tcp_data_recv: called after a data packet is receviced.
> tcp_data_acked: called after a valid ack packet is processed (some sent
> data are ackknowledged).
>
> We use these callbacks for fine-grained tcp monitoring, which collects
> and analyses every tcp request/response event information. The whole
> system has been described in SIGMOD'18 (see
> https://dl.acm.org/doi/pdf/10.1145/3183713.3190659 for details). To
> achieve this with bpf, we require hooks for data events that call bpf
> prog (1) when any data packet is sent/received/acked, and (2) after
> critical tcp state variables have been updated (e.g., snd_una, snd_nxt,
> rcv_nxt). However, existing bpf hooks cannot meet our requirements.
> Besides, these tracepoints help to debug tcp when data send/recv/acked.

This I do not understand.

>
> Though kretprobe/fexit can also be used to collect these information,
> they will not work if the kernel functions get inlined. Considering the
> stability, we prefer tracepoint as the solution.

I dunno, this seems quite weak to me. I see many patches coming to add
tracing in the stack, but no patches fixing any issues.

It really looks like : We do not know how TCP stack works, we do not
know if there is any issue,
let us add trace points to help us to make forward progress in our analysis.

These tracepoints will not tell how many segments/bytes were
sent/acked/received, I really do not see
how we will avoid adding in the future more stuff, forcing the
compiler to save more state
just in case the tracepoint needs the info.

The argument of "add minimal info", so that we can silently add more
stuff in the future "for free" is not something I buy.

I very much prefer that you make sure the stuff you need is not
inlined, so that standard kprobe/kretprobe facility can be used.



Re: [PATCH net-next v7] net/core: Introduce netdev_core_stats_inc()

2023-10-09 Thread Eric Dumazet
On Mon, Oct 9, 2023 at 11:43 AM Yajun Deng  wrote:
>
>
> On 2023/10/9 17:30, Eric Dumazet wrote:
> > On Mon, Oct 9, 2023 at 10:36 AM Yajun Deng  wrote:
> >>
> >> On 2023/10/9 16:20, Eric Dumazet wrote:
> >>> On Mon, Oct 9, 2023 at 10:14 AM Yajun Deng  wrote:
> >>>> On 2023/10/9 15:53, Eric Dumazet wrote:
> >>>>> On Mon, Oct 9, 2023 at 5:07 AM Yajun Deng  wrote:
> >>>>>
> >>>>>> 'this_cpu_read + this_cpu_write' and 'pr_info + this_cpu_inc' will make
> >>>>>> the trace work well.
> >>>>>>
> >>>>>> They all have 'pop' instructions in them. This may be the key to making
> >>>>>> the trace work well.
> >>>>>>
> >>>>>> Hi all,
> >>>>>>
> >>>>>> I need your help on percpu and ftrace.
> >>>>>>
> >>>>> I do not think you made sure netdev_core_stats_inc() was never inlined.
> >>>>>
> >>>>> Adding more code in it is simply changing how the compiler decides to
> >>>>> inline or not.
> >>>> Yes, you are right. It needs to add the 'noinline' prefix. The
> >>>> disassembly code will have 'pop'
> >>>>
> >>>> instruction.
> >>>>
> >>> The function was fine, you do not need anything like push or pop.
> >>>
> >>> The only needed stuff was the call __fentry__.
> >>>
> >>> The fact that the function was inlined for some invocations was the
> >>> issue, because the trace point
> >>> is only planted in the out of line function.
> >>
> >> But somehow the following code isn't inline? They didn't need to add the
> >> 'noinline' prefix.
> >>
> >> +   field = (unsigned long *)((void *)this_cpu_ptr(p) + 
> >> offset);
> >> +   WRITE_ONCE(*field, READ_ONCE(*field) + 1);
> >>
> >> Or
> >> +   (*(unsigned long *)((void *)this_cpu_ptr(p) + offset))++;
> >>
> > I think you are very confused.
> >
> > You only want to trace netdev_core_stats_inc() entry point, not
> > arbitrary pieces of it.
>
>
> Yes, I will trace netdev_core_stats_inc() entry point. I mean to replace
>
> +   field = (__force unsigned long
> __percpu *)((__force void *)p + offset);
> +   this_cpu_inc(*field);
>
> with
>
> +   field = (unsigned long *)((void *)this_cpu_ptr(p) + offset);
> +   WRITE_ONCE(*field, READ_ONCE(*field) + 1);
>
> Or
> +   (*(unsigned long *)((void *)this_cpu_ptr(p) + offset))++;
>
> The netdev_core_stats_inc() entry point will work fine even if it doesn't
> have 'noinline' prefix.
>
> I don't know why this code needs to add 'noinline' prefix.
> +   field = (__force unsigned long __percpu *)((__force void *)p 
> + offset);
> +   this_cpu_inc(*field);
>

C compiler decides to inline or not, depending on various factors.

The most efficient (and small) code is generated by this_cpu_inc()
version, allowing the compiler to inline it.

If you copy/paste this_cpu_inc()  twenty times, then the compiler
would  not inline the function anymore.



Re: [PATCH net-next v7] net/core: Introduce netdev_core_stats_inc()

2023-10-09 Thread Eric Dumazet
On Mon, Oct 9, 2023 at 10:36 AM Yajun Deng  wrote:
>
>
> On 2023/10/9 16:20, Eric Dumazet wrote:
> > On Mon, Oct 9, 2023 at 10:14 AM Yajun Deng  wrote:
> >>
> >> On 2023/10/9 15:53, Eric Dumazet wrote:
> >>> On Mon, Oct 9, 2023 at 5:07 AM Yajun Deng  wrote:
> >>>
> >>>> 'this_cpu_read + this_cpu_write' and 'pr_info + this_cpu_inc' will make
> >>>> the trace work well.
> >>>>
> >>>> They all have 'pop' instructions in them. This may be the key to making
> >>>> the trace work well.
> >>>>
> >>>> Hi all,
> >>>>
> >>>> I need your help on percpu and ftrace.
> >>>>
> >>> I do not think you made sure netdev_core_stats_inc() was never inlined.
> >>>
> >>> Adding more code in it is simply changing how the compiler decides to
> >>> inline or not.
> >>
> >> Yes, you are right. It needs to add the 'noinline' prefix. The
> >> disassembly code will have 'pop'
> >>
> >> instruction.
> >>
> > The function was fine, you do not need anything like push or pop.
> >
> > The only needed stuff was the call __fentry__.
> >
> > The fact that the function was inlined for some invocations was the
> > issue, because the trace point
> > is only planted in the out of line function.
>
>
> But somehow the following code isn't inline? They didn't need to add the
> 'noinline' prefix.
>
> +   field = (unsigned long *)((void *)this_cpu_ptr(p) + offset);
> +   WRITE_ONCE(*field, READ_ONCE(*field) + 1);
>
> Or
> +   (*(unsigned long *)((void *)this_cpu_ptr(p) + offset))++;
>

I think you are very confused.

You only want to trace netdev_core_stats_inc() entry point, not
arbitrary pieces of it.



Re: [PATCH net-next v7] net/core: Introduce netdev_core_stats_inc()

2023-10-09 Thread Eric Dumazet
On Mon, Oct 9, 2023 at 10:14 AM Yajun Deng  wrote:
>
>
> On 2023/10/9 15:53, Eric Dumazet wrote:
> > On Mon, Oct 9, 2023 at 5:07 AM Yajun Deng  wrote:
> >
> >> 'this_cpu_read + this_cpu_write' and 'pr_info + this_cpu_inc' will make
> >> the trace work well.
> >>
> >> They all have 'pop' instructions in them. This may be the key to making
> >> the trace work well.
> >>
> >> Hi all,
> >>
> >> I need your help on percpu and ftrace.
> >>
> > I do not think you made sure netdev_core_stats_inc() was never inlined.
> >
> > Adding more code in it is simply changing how the compiler decides to
> > inline or not.
>
>
> Yes, you are right. It needs to add the 'noinline' prefix. The
> disassembly code will have 'pop'
>
> instruction.
>

The function was fine, you do not need anything like push or pop.

The only needed stuff was the call __fentry__.

The fact that the function was inlined for some invocations was the
issue, because the trace point
is only planted in the out of line function.



Re: [PATCH net-next v7] net/core: Introduce netdev_core_stats_inc()

2023-10-09 Thread Eric Dumazet
On Mon, Oct 9, 2023 at 5:07 AM Yajun Deng  wrote:

> 'this_cpu_read + this_cpu_write' and 'pr_info + this_cpu_inc' will make
> the trace work well.
>
> They all have 'pop' instructions in them. This may be the key to making
> the trace work well.
>
> Hi all,
>
> I need your help on percpu and ftrace.
>

I do not think you made sure netdev_core_stats_inc() was never inlined.

Adding more code in it is simply changing how the compiler decides to
inline or not.



Re: [PATCH v3 bpf-next 00/11] Socket migration for SO_REUSEPORT.

2021-04-20 Thread Eric Dumazet



On 4/20/21 5:41 PM, Kuniyuki Iwashima wrote:
> The SO_REUSEPORT option allows sockets to listen on the same port and to
> accept connections evenly. However, there is a defect in the current
> implementation [1]. When a SYN packet is received, the connection is tied
> to a listening socket. Accordingly, when the listener is closed, in-flight
> requests during the three-way handshake and child sockets in the accept
> queue are dropped even if other listeners on the same port could accept
> such connections.
> 
> This situation can happen when various server management tools restart
> server (such as nginx) processes. For instance, when we change nginx
> configurations and restart it, it spins up new workers that respect the new
> configuration and closes all listeners on the old workers, resulting in the
> in-flight ACK of 3WHS is responded by RST.
> 
> The SO_REUSEPORT option is excellent to improve scalability.

This was before the SYN processing was made lockless.

I really wonder if we still need SO_REUSEPORT for TCP ?

Eventually a new accept() system call where different threads
can express how they want to choose the children sockets would
be less invasive.

Instead of having many listeners, have one listener and eventually multiple
accept queues to improve scalability of accept() phase.



Re: BUG: KASAN: use-after-free in page_to_skb.isra.0+0x300/0x418

2021-04-20 Thread Eric Dumazet
On Tue, Apr 20, 2021 at 3:45 PM Naresh Kamboju
 wrote:
>
> Following kernel BUG reported on qemu-arm64 running linux next 20210420
> the config is enabled with KASAN.
>
> steps to reproduce:
> 
> - Build the arm64 kernel with KASAN enabled.
> - boot it with below command and you will notice
>  /usr/bin/qemu-system-aarch64 -cpu host -machine virt,accel=kvm
> -nographic -net nic,model=virtio,macaddr=BA:DD:AD:CC:09:10 -net tap -m
> 1024 -monitor none -kernel kernel/Image.gz --append "console=ttyAMA0
> root=/dev/vda rw" -hda
> rootfs/rpb-console-image-lkft-juno-20210414125244-133.rootfs.ext4 -m
> 4096 -smp 4 -nographic
>
>
> crash log:
> -
>

This is the fifth report, have you tried the proposed fix ?

https://patchwork.kernel.org/project/netdevbpf/patch/20210420094341.3259328-1-eric.duma...@gmail.com/


Re: [PATCH] tcp: prevent loss of bytes when using syncookie

2021-04-20 Thread Eric Dumazet
On Tue, Apr 20, 2021 at 2:00 PM zhaoya  wrote:
>
> When using syncookie, since $MSSID is spliced into cookie and
> the legal index of msstab  is 0,1,2,3, this gives client 3 bytes
> of freedom, resulting in at most 3 bytes of silent loss.
>
> C seq=12345-> S
> C <--seq=cookie/ack=12346 S S generated the cookie
> [RFC4987 Appendix A]
> C ---seq=123456/ack=cookie+1-->X  S The first byte was loss.
> C -seq=123457/ack=cookie+1--> S The second byte was received and
> cookie-check was still okay and
> handshake was finished.
> C 
> Here is a POC:
>
> $ cat poc.c
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
>
> void *serverfunc(void *arg)
> {
> int sd = -1;
> int csd = -1;
> struct sockaddr_in servaddr, cliaddr;
> int len = sizeof(cliaddr);
>
> sd = socket(AF_INET, SOCK_STREAM, 0);
> servaddr.sin_family = AF_INET;
> servaddr.sin_addr.s_addr = htonl(INADDR_ANY);
> servaddr.sin_port = htons(1234);
> bind(sd, (struct sockaddr *), sizeof(servaddr));
> listen(sd, 1);
>
> while (1) {
> char buf[2];
> int ret;
> csd = accept(sd, (struct sockaddr *), );
> memset(buf, 0, 2);
> ret = recv(csd, buf, 1, 0);
> // but unexpected char is 'b'
> if (ret && strncmp(buf, "a", 1)) {
> printf("unexpected:%s\n", buf);
> close(csd);
> exit(0);
> }
> close(csd);
> }
> }
>
> void *connectfunc(void *arg)
> {
> struct sockaddr_in addr;
> int sd;
> int i;
>
> for (i = 0; i < 500; i++) {
> sd = socket(AF_INET, SOCK_STREAM, 0);
> addr.sin_family = AF_INET;
> addr.sin_addr.s_addr = inet_addr("127.0.0.1");
> addr.sin_port = htons(1234);
>
> connect(sd, (struct sockaddr *), sizeof(addr));
>
> send(sd, "a", 1, 0); // expected char is 'a'
> send(sd, "b", 1, 0);
> close(sd);
> }
> return NULL;
> }
>
> int main(int argc, char *argv[])
> {
> int i;
> pthread_t id;
>
> pthread_create(, NULL, serverfunc, NULL);
> sleep(1);
> for (i = 0; i < 500; i++) {
> pthread_create(, NULL, connectfunc, NULL);
> }
> sleep(5);
> }
>
> $ sudo gcc poc.c -lpthread
> $ sudo sysctl -w net.ipv4.tcp_syncookies=1
> $ sudo ./a.out # please try as many times.
>

POC is distracting, you could instead give a link to
https://kognitio.com/blog/syn-cookies-ate-my-dog-breaking-tcp-on-linux/
where all the details can be found.

Also it gives credits to past work.

If you feel a program needs to be written and recorded for posterity,
add a selftest. (in tools/testing/selftests/net )

> This problem can be fixed by having $SSEQ itself participate in the
> hash operation.
>
> The feature is protected by a sysctl so that admins can choose
> the preferred behavior.
>
> Signed-off-by: zhaoya 
> ---
>  Documentation/networking/ip-sysctl.rst |  9 
>  include/linux/netfilter_ipv6.h | 24 +---
>  include/net/netns/ipv4.h   |  1 +
>  include/net/tcp.h  | 20 ++---
>  net/core/filter.c  |  6 +++--
>  net/ipv4/syncookies.c  | 31 --
>  net/ipv4/sysctl_net_ipv4.c |  7 ++
>  net/ipv4/tcp_ipv4.c|  4 +++-
>  net/ipv6/syncookies.c  | 29 +++-
>  net/ipv6/tcp_ipv6.c|  3 ++-
>  net/netfilter/nf_synproxy_core.c   | 10 +
>  11 files changed, 97 insertions(+), 47 deletions(-)
>
> diff --git a/Documentation/networking/ip-sysctl.rst 
> b/Documentation/networking/ip-sysctl.rst
> index c7952ac5bd2f..a430e8736c2b 100644
> --- a/Documentation/networking/ip-sysctl.rst
> +++ b/Documentation/networking/ip-sysctl.rst
> @@ -732,6 +732,15 @@ tcp_syncookies - INTEGER
> network connections you can set this knob to 2 to enable
> unconditionally generation of syncookies.
>
> +tcp_syncookies_enorder - INTEGER

enorder ? What does it mean ?

> +   Only valid when the kernel was compiled with CONFIG_SYN_COOKIES
> +   Prevent the loss of at most 3 bytes of which sent by client when
> +   syncookies as generated to complete TCP-Handshake.
> +   Default: 0
> +
> +   Note that if it was enabled, any out-of-order bytes sent by client
> +   to complete TCP-Handshake could get the session killed.

Technically 

Re: [PATCH] net: called rtnl_unlock() before runpm resumes devices

2021-04-20 Thread Eric Dumazet
On Tue, Apr 20, 2021 at 9:54 AM AceLan Kao  wrote:
>
> From: "Chia-Lin Kao (AceLan)" 
>
> The rtnl_lock() has been called in rtnetlink_rcv_msg(), and then in
> __dev_open() it calls pm_runtime_resume() to resume devices, and in
> some devices' resume function(igb_resum,igc_resume) they calls rtnl_lock()
> again. That leads to a recursive lock.
>
> It should leave the devices' resume function to decide if they need to
> call rtnl_lock()/rtnl_unlock(), so call rtnl_unlock() before calling
> pm_runtime_resume() and then call rtnl_lock() after it in __dev_open().
>
>

Hi Acelan

When was the bugg added ?
Please add a Fixes: tag

By doing so, you give more chances for reviewers to understand why the
fix is not risky,
and help stable teams work.

Thanks.

> Signed-off-by: Chia-Lin Kao (AceLan) 
> ---
>  net/core/dev.c | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 1f79b9aa9a3f..427cbc80d1e5 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -1537,8 +1537,11 @@ static int __dev_open(struct net_device *dev, struct 
> netlink_ext_ack *extack)
>
> if (!netif_device_present(dev)) {
> /* may be detached because parent is runtime-suspended */
> -   if (dev->dev.parent)
> +   if (dev->dev.parent) {
> +   rtnl_unlock();
> pm_runtime_resume(dev->dev.parent);
> +   rtnl_lock();
> +   }
> if (!netif_device_present(dev))
> return -ENODEV;
> }
> --
> 2.25.1
>


Re: [PATCH net] gro: fix napi_gro_frags() Fast GRO breakage due to IP alignment check

2021-04-19 Thread Eric Dumazet
On Sun, Apr 18, 2021 at 1:43 PM Alexander Lobakin  wrote:
>
> Commit 38ec4944b593 ("gro: ensure frag0 meets IP header alignment")
> did the right thing, but missed the fact that napi_gro_frags() logics
> calls for skb_gro_reset_offset() *before* pulling Ethernet header
> to the skb linear space.
> That said, the introduced check for frag0 address being aligned to 4
> always fails for it as Ethernet header is obviously 14 bytes long,
> and in case with NET_IP_ALIGN its start is not aligned to 4.
>
> Fix this by adding @nhoff argument to skb_gro_reset_offset() which
> tells if an IP header is placed right at the start of frag0 or not.
> This restores Fast GRO for napi_gro_frags() that became very slow
> after the mentioned commit, and preserves the introduced check to
> avoid silent unaligned accesses.
>
> Fixes: 38ec4944b593 ("gro: ensure frag0 meets IP header alignment")
> Signed-off-by: Alexander Lobakin 
> ---
>  net/core/dev.c | 8 
>  1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 1f79b9aa9a3f..965d5f9b6fee 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -5914,7 +5914,7 @@ static struct list_head *gro_list_prepare(struct 
> napi_struct *napi,
> return head;
>  }
>
> -static void skb_gro_reset_offset(struct sk_buff *skb)
> +static void skb_gro_reset_offset(struct sk_buff *skb, u32 nhoff)
>  {
> const struct skb_shared_info *pinfo = skb_shinfo(skb);
> const skb_frag_t *frag0 = >frags[0];
> @@ -5925,7 +5925,7 @@ static void skb_gro_reset_offset(struct sk_buff *skb)
>
> if (!skb_headlen(skb) && pinfo->nr_frags &&
> !PageHighMem(skb_frag_page(frag0)) &&
> -   (!NET_IP_ALIGN || !(skb_frag_off(frag0) & 3))) {
> +   (!NET_IP_ALIGN || !((skb_frag_off(frag0) + nhoff) & 3))) {
> NAPI_GRO_CB(skb)->frag0 = skb_frag_address(frag0);
> NAPI_GRO_CB(skb)->frag0_len = min_t(unsigned int,
> skb_frag_size(frag0),
> @@ -6143,7 +6143,7 @@ gro_result_t napi_gro_receive(struct napi_struct *napi, 
> struct sk_buff *skb)
> skb_mark_napi_id(skb, napi);
> trace_napi_gro_receive_entry(skb);
>
> -   skb_gro_reset_offset(skb);
> +   skb_gro_reset_offset(skb, 0);
>
> ret = napi_skb_finish(napi, skb, dev_gro_receive(napi, skb));
> trace_napi_gro_receive_exit(ret);
> @@ -6232,7 +6232,7 @@ static struct sk_buff *napi_frags_skb(struct 
> napi_struct *napi)
> napi->skb = NULL;
>
> skb_reset_mac_header(skb);
> -   skb_gro_reset_offset(skb);
> +   skb_gro_reset_offset(skb, hlen);
>
> if (unlikely(skb_gro_header_hard(skb, hlen))) {
> eth = skb_gro_header_slow(skb, hlen, 0);
> --
> 2.31.1
>
>

Good catch, thanks.

This has the unfortunate effect of increasing code length on x86,
maybe we should have an exception to
normal rules so that skb_gro_reset_offset() is inlined.

Reviewed-by: Eric Dumazet 


Re: PROBLEM: DoS Attack on Fragment Cache

2021-04-19 Thread Eric Dumazet
On Sun, Apr 18, 2021 at 4:31 PM Matt Corallo
 wrote:
>
> Should the default, though, be so low? If someone is still using a old modem 
> they can crank up the sysctl, it does seem
> like such things are pretty rare these days :). Its rather trivial to, 
> without any kind of attack, hit 1Mbps of lost
> fragments in today's networks, at which point all fragments are dropped. 
> After all, I submitted the patch to "scratch my
> own itch" :).

Again, even if you increase the values by 1000x, it is trivial for an
attacker to use all the memory you allowed.

And allowing a significant portion of memory to be eaten like that
might cause OOM on hosts where jobs are consuming all physical memory.

It is a sysctl, I changed things so that one could really reserve/use
16GB of memory if she/he is desperate about frags.

>
> Matt
>
> On 4/18/21 00:39, Willy Tarreau wrote:
> > I do agree that we shouldn't keep them that long nowadays, we can't go
> > too low without risking to break some slow transmission stacks (SLIP/PPP
> > over modems for example).


Re: [PATCH] x86/uaccess: small optimization in unsafe_copy_to_user()

2021-04-17 Thread Eric Dumazet
On Sat, Apr 17, 2021 at 6:03 PM Linus Torvalds
 wrote:
>
> On Fri, Apr 16, 2021 at 12:24 PM Eric Dumazet  wrote:
> >
> > From: Eric Dumazet 
> >
> > We have to loop only to copy u64 values.
> > After this first loop, we copy at most one u32, one u16 and one byte.
>
> As Al mentioned, at least in trivial cases the compiler understands
> that the subsequent loops can only be executed once, because earlier
> loops made sure that 'len' is always smaller than 2*size.
>
> But apparently the problem is the slightly more complex cases where
> the compiler just messes up and loses sight of that. Oh well.
>
> So the patch looks fine to me.
>
> HOWEVER.
>
> Looking at the put_cmsg() case in net-next, I'm very *VERY* unhappy
> about how you use those "unsafe" user access functions.
>
> Why? Because the point of the "unsafe" is to be a big red flag and
> make sure you are very VERY careful with it.
>
> And that code isn't.
>
> In particular, what if the "int len" argument is negative? Maybe it
> cannot happen, but when it comes to things like those unsafe user
> accessors, I really really want to see that all the arguments are
> *checked*.
>
> And you don't.
>
> You do
>
> if (!user_write_access_begin(cm, cmlen))
>
> ahead of time, and that will do basic range checking, but "cmlen" is
>
> sizeof(struct cmsghdr) + (len))
>
> so it's entirely possible that "cmlen" has a valid value, but "len"
> (and thus "cmlen - sizeof(*cm)", which is the length argument to the
> unsafe user copy) might be negative and that is never checked.
>
> End result: I want people to be a LOT more careful when they use those
> unsafe user space accessors. You need to make it REALLY REALLY obvious
> that everything you do is safe. And it's not at all obvious in the
> context of put_cmsg() - the safety currently relies on every single
> caller getting it right.

I thought put_cmsg() callers were from the kernel, with no possibility
for user to abuse this interface trying to push GB of data.

Which would be terrible even if we  ' fix'  possible overflows.

Maybe I was wrong.


>
> So either fix "len" to be some restricted type (ie "unsigned short"),
> or make really really sure that "len" is valid (ie never negative, and
> the cmlen addition didn't overflow.
>
> Really. The "unsafe" user accesses are named that way very explicitly,
> and for a very very good reason: the safety needs to be guaranteed and
> obvious within the context of those accesses. Not within some "oh,
> nobody will ever call this with a negative argument" garbage bullshit.
>
>   Linus


Re: [External] Re: [PATCH] tcp: fix silent loss when syncookie is trigered

2021-04-16 Thread Eric Dumazet
On Sat, Apr 17, 2021 at 12:45 AM 赵亚  wrote:
>
> On Fri, Apr 16, 2021 at 7:52 PM Eric Dumazet  wrote:
> >
> > On Fri, Apr 16, 2021 at 12:52 PM zhaoya  wrote:
> > >
> > > When syncookie is triggered, since $MSSID is spliced into cookie and
> > > the legal index of msstab  is 0,1,2,3, this gives client 3 bytes
> > > of freedom, resulting in at most 3 bytes of silent loss.
> > >
> > > C seq=12345-> S
> > > C <--seq=cookie/ack=12346 S S generated the cookie
> > > [RFC4987 Appendix A]
> > > C ---seq=123456/ack=cookie+1-->X  S The first byte was loss.
> > > C -seq=123457/ack=cookie+1--> S The second byte was received and
> > > cookie-check was still okay and
> > > handshake was finished.
> > > C <seq=.../ack=12348- S acknowledge the second byte.
> >
> >
> > I think this has been discussed in the past :
> > https://kognitio.com/blog/syn-cookies-ate-my-dog-breaking-tcp-on-linux/
> >
> > If I remember well, this can not be fixed "easily"
> >
> > I suspect you are trading one minor issue with another (which is
> > considered more practical these days)
> > Have you tried what happens if the server receives an out-of-order
> > packet after the SYN & SYN-ACK ?
> > The answer is : RST packet is sent, killing the session.
> >
> > That is the reason why sseq is not part of the hash key.
>
> Yes, I've tested this scenario. More sessions do get reset.
>
> If a client got an RST, it knew the session failed, which was clear. However,
> if the client send a character and it was acknowledged, but the server did not
> receive it, this could cause confusion.
> >
> > In practice, secure connexions are using a setup phase where more than
> > 3 bytes are sent in the first packet.
> > We recommend using secure protocols over TCP. (prefer HTTPS over HTTP,
> > SSL over plaintext)
>
> Yes, i agree with you. But the basis of practice is principle.
> Syncookie breaks the
> semantics of TCP.
> >
> > Your change would severely impair servers under DDOS ability to really
> > establish flows.
>
> Would you tell me more details.
> >
> > Now, if your patch is protected by a sysctl so that admins can choose
> > the preferred behavior, then why not...
>
> The sysctl in the POC is just for triggering problems easily.
>
> So the question is, when syncookie is triggered, which is more important,
> the practice or the principle?

SYNCOOKIES have lots of known limitations.

You can disable them if you need.

Or you can add a sysctl or socket options so that each listener can
decide what they want.

I gave feedback of why your initial patch was _not_ good.

I think it can render a server under DDOS absolutely unusable.
Exactly the same situation than _without_ syncookies being used.
We do not want to go back to the situation wed had before SYNCOOKIES
were invented.

I think you should have put a big warning in the changelog to explain
that you fully understood
the risks.

We prefer having servers that can still be useful, especially ones
serving 100% HTTPS traffic.

Thank you.


Re: [PATCH net] net/core/dev.c: Ensure pfmemalloc skbs are correctly handled when receiving

2021-04-16 Thread Eric Dumazet
On Sat, Apr 17, 2021 at 2:08 AM Xie He  wrote:
>
> When an skb is allocated by "__netdev_alloc_skb" in "net/core/skbuff.c",
> if "sk_memalloc_socks()" is true, and if there's not sufficient memory,
> the skb would be allocated using emergency memory reserves. This kind of
> skbs are called pfmemalloc skbs.
>
> pfmemalloc skbs must be specially handled in "net/core/dev.c" when
> receiving. They must NOT be delivered to the target protocol if
> "skb_pfmemalloc_protocol(skb)" is false.
>
> However, if, after a pfmemalloc skb is allocated and before it reaches
> the code in "__netif_receive_skb", "sk_memalloc_socks()" becomes false,
> then the skb will be handled by "__netif_receive_skb" as a normal skb.
> This causes the skb to be delivered to the target protocol even if
> "skb_pfmemalloc_protocol(skb)" is false.
>
> This patch fixes this problem by ensuring all pfmemalloc skbs are handled
> by "__netif_receive_skb" as pfmemalloc skbs.
>
> "__netif_receive_skb_list" has the same problem as "__netif_receive_skb".
> This patch also fixes it.
>
> Fixes: b4b9e3558508 ("netvm: set PF_MEMALLOC as appropriate during SKB 
> processing")
> Cc: Mel Gorman 
> Cc: David S. Miller 
> Cc: Neil Brown 
> Cc: Peter Zijlstra 
> Cc: Jiri Slaby 
> Cc: Mike Christie 
> Cc: Eric B Munson 
> Cc: Eric Dumazet 
> Cc: Sebastian Andrzej Siewior 
> Cc: Christoph Lameter 
> Cc: Andrew Morton 
> Signed-off-by: Xie He 
> ---
>  net/core/dev.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 1f79b9aa9a3f..3e6b7879daef 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -5479,7 +5479,7 @@ static int __netif_receive_skb(struct sk_buff *skb)
>  {
> int ret;
>
> -   if (sk_memalloc_socks() && skb_pfmemalloc(skb)) {
> +   if (skb_pfmemalloc(skb)) {
> unsigned int noreclaim_flag;
>
> /*
> @@ -5507,7 +5507,7 @@ static void __netif_receive_skb_list(struct list_head 
> *head)
> bool pfmemalloc = false; /* Is current sublist PF_MEMALLOC? */
>
> list_for_each_entry_safe(skb, next, head, list) {
> -   if ((sk_memalloc_socks() && skb_pfmemalloc(skb)) != 
> pfmemalloc) {
> +   if (skb_pfmemalloc(skb) != pfmemalloc) {
> struct list_head sublist;
>
> /* Handle the previous sublist */
> --
> 2.27.0
>

The race window has been considered to be small that we prefer the
code as it is.

The reason why we prefer current code is that we use a static key for
the implementation
of sk_memalloc_socks()

Trading some minor condition (race) with extra cycles for each
received packet is a serious concern.

What matters is a persistent condition that would _deplete_ memory,
not for a dozen of packets,
but thousands. Can you demonstrate such an issue ?


Re: PROBLEM: DoS Attack on Fragment Cache

2021-04-16 Thread Eric Dumazet
On Sat, Apr 17, 2021 at 2:31 AM David Ahern  wrote:
>
> [ cc author of 648700f76b03b7e8149d13cc2bdb3355035258a9 ]



I think this has been discussed already. There is no strategy that
makes IP reassembly units immune to DDOS attacks.

We added rb-tree and sysctls to let admins choose to use GB of RAM if
they really care.



>
> On 4/16/21 3:58 PM, Keyu Man wrote:
> > Hi,
> >
> >
> >
> > My name is Keyu Man. We are a group of researchers from University
> > of California, Riverside. Zhiyun Qian is my advisor. We found the code
> > in processing IPv4/IPv6 fragments will potentially lead to DoS Attacks.
> > Specifically, after the latest kernel receives an IPv4 fragment, it will
> > try to fit it into a queue by calling function
> >
> >
> >
> > struct inet_frag_queue *inet_frag_find(struct fqdir *fqdir, void
> > *key) in net/ipv4/inet_fragment.c.
> >
> >
> >
> > However, this function will first check if the existing fragment
> > memory exceeds the fqdir->high_thresh. If it exceeds, then drop the
> > fragment regardless whether it belongs to a new queue or an existing queue.
> >
> > Chances are that an attacker can fill the cache with fragments that
> > will never be assembled (i.e., only sends the first fragment with new
> > IPIDs every time) to exceed the threshold so that all future incoming
> > fragmented IPv4 traffic would be blocked and dropped. Since there is no
> > GC mechanism, the victim host has to wait for 30s when the fragments are
> > expired to continue receive incoming fragments normally.
> >
> > In practice, given the 4MB fragment cache, the attacker only needs
> > to send 1766 fragments to exhaust the cache and DoS the victim for 30s,
> > whose cost is pretty low. Besides, IPv6 would also be affected since the
> > issue resides in inet part.
> >
> > This issue is introduced in commit
> > 648700f76b03b7e8149d13cc2bdb3355035258a9 (inet: frags: use rhashtables
> > for reassembly units) which removes fqdir->low_thresh, and GC worker as
> > well. We would gently request to bring GC worker back to the kernel to
> > prevent the DoS attacks.
> >
> > Looking forward to hear from you
> >
> >
> >
> > Thanks,
> >
> > Keyu Man
> >
>


Re: [PATCH] x86/uaccess: small optimization in unsafe_copy_to_user()

2021-04-16 Thread Eric Dumazet
On Fri, Apr 16, 2021 at 10:11 PM Eric Dumazet  wrote:
>
> On Fri, Apr 16, 2021 at 9:44 PM Al Viro  wrote:
> >
> > On Fri, Apr 16, 2021 at 12:24:13PM -0700, Eric Dumazet wrote:
> > > From: Eric Dumazet 
> > >
> > > We have to loop only to copy u64 values.
> > > After this first loop, we copy at most one u32, one u16 and one byte.
> >
> > Does it actually yield a better code?
> >
>
> Yes, my patch gives a better code, on actual kernel use-case
>
> (net-next tree, look at put_cmsg())
>
> 5ca: 48 89 0f  mov%rcx,(%rdi)
>  5cd: 89 77 08  mov%esi,0x8(%rdi)
>  5d0: 89 57 0c  mov%edx,0xc(%rdi)
>  5d3: 48 83 c7 10  add$0x10,%rdi
>  5d7: 48 83 c1 f0  add$0xfff0,%rcx
>  5db: 48 83 f9 07  cmp$0x7,%rcx
>  5df: 76 40jbe621 
>  5e1: 66 66 66 66 66 66 2e data16 data16 data16 data16 data16 nopw
> %cs:0x0(%rax,%rax,1)
>  5e8: 0f 1f 84 00 00 00 00
>  5ef: 00
>  5f0: 49 8b 10  mov(%r8),%rdx
>  5f3: 48 89 17  mov%rdx,(%rdi)
>  5f6: 48 83 c7 08  add$0x8,%rdi
>  5fa: 49 83 c0 08  add$0x8,%r8
>  5fe: 48 83 c1 f8  add$0xfff8,%rcx
>  602: 48 83 f9 07  cmp$0x7,%rcx
>  606: 77 e8ja 5f0 
>  608: eb 17jmp621 
>  60a: 66 0f 1f 44 00 00nopw   0x0(%rax,%rax,1)
>  610: 41 8b 10  mov(%r8),%edx
>  613: 89 17mov%edx,(%rdi)
>  615: 48 83 c7 04  add$0x4,%rdi
>  619: 49 83 c0 04  add$0x4,%r8
>  61d: 48 83 c1 fc  add$0xfffc,%rcx
>  621: 48 83 f9 03  cmp$0x3,%rcx
>  625: 77 e9ja 610 
>  627: eb 1ajmp643 
>  629: 0f 1f 80 00 00 00 00 nopl   0x0(%rax)
>  630: 41 0f b7 10  movzwl (%r8),%edx
>  634: 66 89 17  mov%dx,(%rdi)
>  637: 48 83 c7 02  add$0x2,%rdi
>  63b: 49 83 c0 02  add$0x2,%r8
>  63f: 48 83 c1 fe  add$0xfffe,%rcx
>  643: 48 83 f9 01  cmp$0x1,%rcx
>  647: 77 e7ja 630 
>  649: eb 15jmp660 
>  64b: 0f 1f 44 00 00nopl   0x0(%rax,%rax,1)
>  650: 41 0f b6 08  movzbl (%r8),%ecx
>  654: 88 0fmov%cl,(%rdi)
>  656: 48 83 c7 01  add$0x1,%rdi
>  65a: 49 83 c0 01  add$0x1,%r8
>  65e: 31 c9xor%ecx,%ecx
>  660: 48 85 c9  test   %rcx,%rcx
>  663: 75 ebjne650 

After the change code is now what we would expect (no jmp around)
 5db: 48 83 f9 08  cmp$0x8,%rcx
 5df: 72 27jb 608 
 5e1: 66 66 66 66 66 66 2e data16 data16 data16 data16 data16 nopw
%cs:0x0(%rax,%rax,1)
 5e8: 0f 1f 84 00 00 00 00
 5ef: 00
 5f0: 49 8b 10  mov(%r8),%rdx
 5f3: 48 89 17  mov%rdx,(%rdi)
 5f6: 48 83 c7 08  add$0x8,%rdi
 5fa: 49 83 c0 08  add$0x8,%r8
 5fe: 48 83 c1 f8  add$0xfff8,%rcx
 602: 48 83 f9 08  cmp$0x8,%rcx
 606: 73 e8jae5f0 
 608: 48 83 f9 04  cmp$0x4,%rcx
 60c: 72 11jb 61f 
 60e: 41 8b 10  mov(%r8),%edx
 611: 89 17mov%edx,(%rdi)
 613: 48 83 c7 04  add$0x4,%rdi
 617: 49 83 c0 04  add$0x4,%r8
 61b: 48 83 c1 fc  add$0xfffc,%rcx
 61f: 48 83 f9 02  cmp$0x2,%rcx
 623: 72 13jb 638 
 625: 41 0f b7 10  movzwl (%r8),%edx
 629: 66 89 17  mov%dx,(%rdi)
 62c: 48 83 c7 02  add$0x2,%rdi
 630: 49 83 c0 02  add$0x2,%r8
 634: 48 83 c1 fe  add$0xfffe,%rcx
 638: 48 85 c9  test   %rcx,%rcx
 63b: 74 05je 642 
 63d: 41 8a 08  mov(%r8),%cl
 640: 88 0fmov%cl,(%rdi)

As I said, its minor, I am sure you can come up to something much better !

Thanks !

>

>
> > FWIW, this
> > void bar(unsigned);
> > void foo(unsigned n)
> > {
> > while (n >= 8) {
> > bar(n);
> > n -= 8;
> > }
> > while (n >= 4) {
> > bar(n);
> > n -= 4;
> > }
> > while (n >= 2) {
> > bar(n);
> > n -= 2;
> > }
> > while (n >= 1) {
> > bar(n);
> > n -= 1;
> > }
> > }
> >
> > will compile (with -O2) to
> > pushq   %rbp
&

Re: [PATCH] x86/uaccess: small optimization in unsafe_copy_to_user()

2021-04-16 Thread Eric Dumazet
On Fri, Apr 16, 2021 at 9:44 PM Al Viro  wrote:
>
> On Fri, Apr 16, 2021 at 12:24:13PM -0700, Eric Dumazet wrote:
> > From: Eric Dumazet 
> >
> > We have to loop only to copy u64 values.
> > After this first loop, we copy at most one u32, one u16 and one byte.
>
> Does it actually yield a better code?
>

Yes, my patch gives a better code, on actual kernel use-case

(net-next tree, look at put_cmsg())

5ca: 48 89 0f  mov%rcx,(%rdi)
 5cd: 89 77 08  mov%esi,0x8(%rdi)
 5d0: 89 57 0c  mov%edx,0xc(%rdi)
 5d3: 48 83 c7 10  add$0x10,%rdi
 5d7: 48 83 c1 f0  add$0xfff0,%rcx
 5db: 48 83 f9 07  cmp$0x7,%rcx
 5df: 76 40jbe621 
 5e1: 66 66 66 66 66 66 2e data16 data16 data16 data16 data16 nopw
%cs:0x0(%rax,%rax,1)
 5e8: 0f 1f 84 00 00 00 00
 5ef: 00
 5f0: 49 8b 10  mov(%r8),%rdx
 5f3: 48 89 17  mov%rdx,(%rdi)
 5f6: 48 83 c7 08  add$0x8,%rdi
 5fa: 49 83 c0 08  add$0x8,%r8
 5fe: 48 83 c1 f8  add$0xfff8,%rcx
 602: 48 83 f9 07  cmp$0x7,%rcx
 606: 77 e8ja 5f0 
 608: eb 17jmp621 
 60a: 66 0f 1f 44 00 00nopw   0x0(%rax,%rax,1)
 610: 41 8b 10  mov(%r8),%edx
 613: 89 17mov%edx,(%rdi)
 615: 48 83 c7 04  add$0x4,%rdi
 619: 49 83 c0 04  add$0x4,%r8
 61d: 48 83 c1 fc  add$0xfffc,%rcx
 621: 48 83 f9 03  cmp$0x3,%rcx
 625: 77 e9ja 610 
 627: eb 1ajmp643 
 629: 0f 1f 80 00 00 00 00 nopl   0x0(%rax)
 630: 41 0f b7 10  movzwl (%r8),%edx
 634: 66 89 17  mov%dx,(%rdi)
 637: 48 83 c7 02  add$0x2,%rdi
 63b: 49 83 c0 02  add$0x2,%r8
 63f: 48 83 c1 fe  add$0xfffe,%rcx
 643: 48 83 f9 01  cmp$0x1,%rcx
 647: 77 e7ja 630 
 649: eb 15jmp660 
 64b: 0f 1f 44 00 00nopl   0x0(%rax,%rax,1)
 650: 41 0f b6 08  movzbl (%r8),%ecx
 654: 88 0fmov%cl,(%rdi)
 656: 48 83 c7 01  add$0x1,%rdi
 65a: 49 83 c0 01  add$0x1,%r8
 65e: 31 c9xor%ecx,%ecx
 660: 48 85 c9  test   %rcx,%rcx
 663: 75 ebjne650 


> FWIW, this
> void bar(unsigned);
> void foo(unsigned n)
> {
> while (n >= 8) {
> bar(n);
> n -= 8;
> }
> while (n >= 4) {
> bar(n);
> n -= 4;
> }
> while (n >= 2) {
> bar(n);
> n -= 2;
> }
> while (n >= 1) {
> bar(n);
> n -= 1;
> }
> }
>
> will compile (with -O2) to
> pushq   %rbp
> pushq   %rbx
> movl%edi, %ebx
> subq$8, %rsp
> cmpl$7, %edi
> jbe .L2
> movl%edi, %ebp
> .L3:
> movl%ebp, %edi
> subl$8, %ebp
> callbar@PLT
> cmpl$7, %ebp
> ja  .L3
> andl$7, %ebx
> .L2:
> cmpl$3, %ebx
> jbe .L4
> movl%ebx, %edi
> andl$3, %ebx
> callbar@PLT
> .L4:
> cmpl$1, %ebx
> jbe .L5
> movl%ebx, %edi
> andl$1, %ebx
> callbar@PLT
> .L5:
> testl   %ebx, %ebx
> je  .L1
> addq$8, %rsp
> movl$1, %edi
> popq%rbx
> popq%rbp
> jmp bar@PLT
> .L1:
> addq$8, %rsp
> popq%rbx
> popq%rbp
> ret
>
> i.e. loop + if + if + if...


[PATCH] x86/uaccess: small optimization in unsafe_copy_to_user()

2021-04-16 Thread Eric Dumazet
From: Eric Dumazet 

We have to loop only to copy u64 values.
After this first loop, we copy at most one u32, one u16 and one byte.

Signed-off-by: Eric Dumazet 
---
 arch/x86/include/asm/uaccess.h | 14 +++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
index 
c9fa7be3df82ddb9495961b3e2f22b1ac07edafa..ddb19bb8c86786d78407dcfb59623943ccbce8a8
 100644
--- a/arch/x86/include/asm/uaccess.h
+++ b/arch/x86/include/asm/uaccess.h
@@ -517,15 +517,23 @@ do {  
\
len -= sizeof(type);
\
}
 
+#define unsafe_copy_elem(dst, src, len, type, label)   
\
+   if (len >= sizeof(type)) {  
\
+   unsafe_put_user(*(type *)(src),(type __user *)(dst),label); 
\
+   dst += sizeof(type);
\
+   src += sizeof(type);
\
+   len -= sizeof(type);
\
+   }
+
 #define unsafe_copy_to_user(_dst,_src,_len,label)  \
 do {   \
char __user *__ucu_dst = (_dst);\
const char *__ucu_src = (_src); \
size_t __ucu_len = (_len);  \
unsafe_copy_loop(__ucu_dst, __ucu_src, __ucu_len, u64, label);  \
-   unsafe_copy_loop(__ucu_dst, __ucu_src, __ucu_len, u32, label);  \
-   unsafe_copy_loop(__ucu_dst, __ucu_src, __ucu_len, u16, label);  \
-   unsafe_copy_loop(__ucu_dst, __ucu_src, __ucu_len, u8, label);   \
+   unsafe_copy_elem(__ucu_dst, __ucu_src, __ucu_len, u32, label);  \
+   unsafe_copy_elem(__ucu_dst, __ucu_src, __ucu_len, u16, label);  \
+   unsafe_copy_elem(__ucu_dst, __ucu_src, __ucu_len, u8, label);   \
 } while (0)
 
 #define HAVE_GET_KERNEL_NOFAULT
-- 
2.31.1.368.gbe11c130af-goog



Re: [PATCH] tcp: fix silent loss when syncookie is trigered

2021-04-16 Thread Eric Dumazet
On Fri, Apr 16, 2021 at 12:52 PM zhaoya  wrote:
>
> When syncookie is triggered, since $MSSID is spliced into cookie and
> the legal index of msstab  is 0,1,2,3, this gives client 3 bytes
> of freedom, resulting in at most 3 bytes of silent loss.
>
> C seq=12345-> S
> C <--seq=cookie/ack=12346 S S generated the cookie
> [RFC4987 Appendix A]
> C ---seq=123456/ack=cookie+1-->X  S The first byte was loss.
> C -seq=123457/ack=cookie+1--> S The second byte was received and
> cookie-check was still okay and
> handshake was finished.
> C 

Re: [PATCH 4.14 16/68] net: ensure mac header is set in virtio_net_hdr_to_skb()

2021-04-16 Thread Eric Dumazet
On Fri, Apr 16, 2021 at 10:49 AM Balazs Nemeth  wrote:
>
> On Thu, 2021-04-15 at 16:46 +0200, Greg Kroah-Hartman wrote:
> > From: Eric Dumazet 
> >
> > commit 61431a5907fc36d0738e9a547c7e1556349a03e9 upstream.
> >
> > Commit 924a9bc362a5 ("net: check if protocol extracted by
> > virtio_net_hdr_set_proto is correct")
> > added a call to dev_parse_header_protocol() but mac_header is not yet
> > set.
> >
> > This means that eth_hdr() reads complete garbage, and syzbot
> > complained about it [1]
> >
> > This patch resets mac_header earlier, to get more coverage about this
> > change.
> >
> > Audit of virtio_net_hdr_to_skb() callers shows that this change
> > should be safe.
> >
> > [1]
> >
> > BUG: KASAN: use-after-free in eth_header_parse_protocol+0xdc/0xe0
> > net/ethernet/eth.c:282
> > Read of size 2 at addr 888017a6200b by task syz-executor313/8409
> >
> > CPU: 1 PID: 8409 Comm: syz-executor313 Not tainted 5.12.0-rc2-
> > syzkaller #0
> > Hardware name: Google Google Compute Engine/Google Compute Engine,
> > BIOS Google 01/01/2011
> > Call Trace:
> >  __dump_stack lib/dump_stack.c:79 [inline]
> >  dump_stack+0x141/0x1d7 lib/dump_stack.c:120
> >  print_address_description.constprop.0.cold+0x5b/0x2f8
> > mm/kasan/report.c:232
> >  __kasan_report mm/kasan/report.c:399 [inline]
> >  kasan_report.cold+0x7c/0xd8 mm/kasan/report.c:416
> >  eth_header_parse_protocol+0xdc/0xe0 net/ethernet/eth.c:282
> >  dev_parse_header_protocol include/linux/netdevice.h:3177 [inline]
> >  virtio_net_hdr_to_skb.constprop.0+0x99d/0xcd0
> > include/linux/virtio_net.h:83
> >  packet_snd net/packet/af_packet.c:2994 [inline]
> >  packet_sendmsg+0x2325/0x52b0 net/packet/af_packet.c:3031
> >  sock_sendmsg_nosec net/socket.c:654 [inline]
> >  sock_sendmsg+0xcf/0x120 net/socket.c:674
> >  sock_no_sendpage+0xf3/0x130 net/core/sock.c:2860
> >  kernel_sendpage.part.0+0x1ab/0x350 net/socket.c:3631
> >  kernel_sendpage net/socket.c:3628 [inline]
> >  sock_sendpage+0xe5/0x140 net/socket.c:947
> >  pipe_to_sendpage+0x2ad/0x380 fs/splice.c:364
> >  splice_from_pipe_feed fs/splice.c:418 [inline]
> >  __splice_from_pipe+0x43e/0x8a0 fs/splice.c:562
> >  splice_from_pipe fs/splice.c:597 [inline]
> >  generic_splice_sendpage+0xd4/0x140 fs/splice.c:746
> >  do_splice_from fs/splice.c:767 [inline]
> >  do_splice+0xb7e/0x1940 fs/splice.c:1079
> >  __do_splice+0x134/0x250 fs/splice.c:1144
> >  __do_sys_splice fs/splice.c:1350 [inline]
> >  __se_sys_splice fs/splice.c:1332 [inline]
> >  __x64_sys_splice+0x198/0x250 fs/splice.c:1332
> >  do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
> >
> > Fixes: 924a9bc362a5 ("net: check if protocol extracted by
> > virtio_net_hdr_set_proto is correct")
> > Signed-off-by: Eric Dumazet 
> > Cc: Balazs Nemeth 
> > Cc: Willem de Bruijn 
> > Reported-by: syzbot 
> > Signed-off-by: David S. Miller 
> > Signed-off-by: Greg Kroah-Hartman 
> > ---
> >  include/linux/virtio_net.h |2 ++
> >  1 file changed, 2 insertions(+)
> >
> > --- a/include/linux/virtio_net.h
> > +++ b/include/linux/virtio_net.h
> > @@ -62,6 +62,8 @@ static inline int virtio_net_hdr_to_skb(
> > return -EINVAL;
> > }
> >
> > +   skb_reset_mac_header(skb);
> > +
> > if (hdr->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) {
> > u16 start = __virtio16_to_cpu(little_endian, hdr-
> > >csum_start);
> > u16 off = __virtio16_to_cpu(little_endian, hdr-
> > >csum_offset);
> >
> >
>
> Hi,
>
> Since the call to dev_parse_header_protocol is only made for gso
> packets where skb->protocol is not set, we could move
> skb_reset_mac_header down closer to that call. Is there another reason
> to reset mac_header earlier (and affect handling of other packets as
> well)? In any case, thanks for spotting this!
>

The answer to your question was in the changelog

"This patch resets mac_header earlier, to get more coverage about this change."

We want to detect if such a reset is going to hurt in general, not
only for GSO packets.


Re: [PROBLEM] a data race between tcp_set_default_congestion_control() and tcp_set_congestion_control()

2021-04-15 Thread Eric Dumazet
On Thu, Apr 15, 2021 at 5:47 PM Gong, Sishuai  wrote:
>
> Hi,
>
> We found a data race between tcp_set_default_congestion_control() and 
> tcp_set_congestion_control() in linux-5.12-rc3.
> In general, when tcp_set_congestion_control() is reading ca->flags with a 
> lock grabbed, tcp_set_default_congestion_control()
> may be updating ca->flags at the same time, as shown below.
>
> When the writer and reader are running parallel, 
> tcp_set_congestion_control()’s control flow
> might be non-deterministic, either returning a -EPERM or calling 
> tcp_reinit_congestion_control().
>
> We also notice in tcp_set_allowed_congestion_control(), the write to 
> ca->flags is protected by tcp_cong_list_lock,
> so we want to point it out in case the data race is unexpected.
>
> Thread 1Thread 2
> //tcp_set_default_congestion_control()  //tcp_set_congestion_control()
> // 
> lock_sock() grabbed
> if 
> (!((ca->flags & TCP_CONG_NON_RESTRICTED) || cap_net_admin))
> err = 
> -EPERM;
> else if 
> (!bpf_try_module_get(ca, ca->owner))
> err = 
> -EBUSY;
> else
> 
> tcp_reinit_congestion_control(sk, ca);
> ca->flags |= TCP_CONG_NON_RESTRICTED;
>
>
>
> Thanks,
> Sishuai
>

Yes, obviously reading ca->flags while another thread might set the bit is racy.

This is of no consequence, if you want to silence KCSAN please a patch.


[tip: sched/core] rseq: Optimize rseq_update_cpu_id()

2021-04-15 Thread tip-bot2 for Eric Dumazet
The following commit has been merged into the sched/core branch of tip:

Commit-ID: 60af388d23889636011488c42763876bcdda3eab
Gitweb:
https://git.kernel.org/tip/60af388d23889636011488c42763876bcdda3eab
Author:Eric Dumazet 
AuthorDate:Tue, 13 Apr 2021 13:33:50 -07:00
Committer: Peter Zijlstra 
CommitterDate: Wed, 14 Apr 2021 18:04:09 +02:00

rseq: Optimize rseq_update_cpu_id()

Two put_user() in rseq_update_cpu_id() are replaced
by a pair of unsafe_put_user() with appropriate surroundings.

This removes one stac/clac pair on x86 in fast path.

Signed-off-by: Eric Dumazet 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Mathieu Desnoyers 
Link: https://lkml.kernel.org/r/20210413203352.71350-2-eric.duma...@gmail.com
---
 kernel/rseq.c | 15 +++
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/kernel/rseq.c b/kernel/rseq.c
index a4f86a9..f020f18 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -84,13 +84,20 @@
 static int rseq_update_cpu_id(struct task_struct *t)
 {
u32 cpu_id = raw_smp_processor_id();
+   struct rseq __user *rseq = t->rseq;
 
-   if (put_user(cpu_id, >rseq->cpu_id_start))
-   return -EFAULT;
-   if (put_user(cpu_id, >rseq->cpu_id))
-   return -EFAULT;
+   if (!user_write_access_begin(rseq, sizeof(*rseq)))
+   goto efault;
+   unsafe_put_user(cpu_id, >cpu_id_start, efault_end);
+   unsafe_put_user(cpu_id, >cpu_id, efault_end);
+   user_write_access_end();
trace_rseq_update(t);
return 0;
+
+efault_end:
+   user_write_access_end();
+efault:
+   return -EFAULT;
 }
 
 static int rseq_reset_rseq_cpu_id(struct task_struct *t)


[tip: sched/core] rseq: Remove redundant access_ok()

2021-04-15 Thread tip-bot2 for Eric Dumazet
The following commit has been merged into the sched/core branch of tip:

Commit-ID: 0ed96051531ecc6965f6456d25b19b9b6bdb5c28
Gitweb:
https://git.kernel.org/tip/0ed96051531ecc6965f6456d25b19b9b6bdb5c28
Author:Eric Dumazet 
AuthorDate:Tue, 13 Apr 2021 13:33:51 -07:00
Committer: Peter Zijlstra 
CommitterDate: Wed, 14 Apr 2021 18:04:09 +02:00

rseq: Remove redundant access_ok()

After commit 8f2817701492 ("rseq: Use get_user/put_user rather
than __get_user/__put_user") we no longer need
an access_ok() call from __rseq_handle_notify_resume()

Mathieu pointed out the same cleanup can be done
in rseq_syscall().

Signed-off-by: Eric Dumazet 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Mathieu Desnoyers 
Link: https://lkml.kernel.org/r/20210413203352.71350-3-eric.duma...@gmail.com
---
 kernel/rseq.c | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/kernel/rseq.c b/kernel/rseq.c
index f020f18..cfe01ab 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -273,8 +273,6 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, 
struct pt_regs *regs)
 
if (unlikely(t->flags & PF_EXITING))
return;
-   if (unlikely(!access_ok(t->rseq, sizeof(*t->rseq
-   goto error;
ret = rseq_ip_fixup(regs);
if (unlikely(ret < 0))
goto error;
@@ -301,8 +299,7 @@ void rseq_syscall(struct pt_regs *regs)
 
if (!t->rseq)
return;
-   if (!access_ok(t->rseq, sizeof(*t->rseq)) ||
-   rseq_get_rseq_cs(t, _cs) || in_rseq_cs(ip, _cs))
+   if (rseq_get_rseq_cs(t, _cs) || in_rseq_cs(ip, _cs))
force_sig(SIGSEGV);
 }
 


[tip: sched/core] rseq: Optimise rseq_get_rseq_cs() and clear_rseq_cs()

2021-04-15 Thread tip-bot2 for Eric Dumazet
The following commit has been merged into the sched/core branch of tip:

Commit-ID: 5e0ccd4a3b01c5a71732a13186ca110a138516ea
Gitweb:
https://git.kernel.org/tip/5e0ccd4a3b01c5a71732a13186ca110a138516ea
Author:Eric Dumazet 
AuthorDate:Tue, 13 Apr 2021 13:33:52 -07:00
Committer: Peter Zijlstra 
CommitterDate: Wed, 14 Apr 2021 18:04:09 +02:00

rseq: Optimise rseq_get_rseq_cs() and clear_rseq_cs()

Commit ec9c82e03a74 ("rseq: uapi: Declare rseq_cs field as union,
update includes") added regressions for our servers.

Using copy_from_user() and clear_user() for 64bit values
is suboptimal.

We can use faster put_user() and get_user() on 64bit arches.

Signed-off-by: Eric Dumazet 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Mathieu Desnoyers 
Link: https://lkml.kernel.org/r/20210413203352.71350-4-eric.duma...@gmail.com
---
 kernel/rseq.c |  9 +
 1 file changed, 9 insertions(+)

diff --git a/kernel/rseq.c b/kernel/rseq.c
index cfe01ab..35f7bd0 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -127,8 +127,13 @@ static int rseq_get_rseq_cs(struct task_struct *t, struct 
rseq_cs *rseq_cs)
u32 sig;
int ret;
 
+#ifdef CONFIG_64BIT
+   if (get_user(ptr, >rseq->rseq_cs.ptr64))
+   return -EFAULT;
+#else
if (copy_from_user(, >rseq->rseq_cs.ptr64, sizeof(ptr)))
return -EFAULT;
+#endif
if (!ptr) {
memset(rseq_cs, 0, sizeof(*rseq_cs));
return 0;
@@ -211,9 +216,13 @@ static int clear_rseq_cs(struct task_struct *t)
 *
 * Set rseq_cs to NULL.
 */
+#ifdef CONFIG_64BIT
+   return put_user(0UL, >rseq->rseq_cs.ptr64);
+#else
if (clear_user(>rseq->rseq_cs.ptr64, sizeof(t->rseq->rseq_cs.ptr64)))
return -EFAULT;
return 0;
+#endif
 }
 
 /*


Re: [PATCH net-next v3 2/5] mm: add a signature in struct page

2021-04-14 Thread Eric Dumazet
On Wed, Apr 14, 2021 at 10:09 PM Shakeel Butt  wrote:

>
> I will let TCP RX zerocopy experts respond to this but from my high
> level code inspection, I didn't see page->private usage.

Indeed, we do not use page->private, since we do not own the page(s).


Re: [PATCH v2 3/3] rseq: optimise rseq_get_rseq_cs() and clear_rseq_cs()

2021-04-14 Thread Eric Dumazet
On Wed, Apr 14, 2021 at 10:15 PM Arjun Roy  wrote:
>
> On Wed, Apr 14, 2021 at 10:35 AM Eric Dumazet  wrote:
> >
> > On Wed, Apr 14, 2021 at 7:15 PM Arjun Roy  wrote:
> > >
> > > On Wed, Apr 14, 2021 at 9:10 AM Eric Dumazet  wrote:
> > > >
> > > > On Wed, Apr 14, 2021 at 6:08 PM David Laight  
> > > > wrote:
> > > > >
> > > > > From: Eric Dumazet
> > > > > > Sent: 14 April 2021 17:00
> > > > > ...
> > > > > > > Repeated unsafe_get_user() calls are crying out for an 
> > > > > > > optimisation.
> > > > > > > You get something like:
> > > > > > > failed = 0;
> > > > > > > copy();
> > > > > > > if (failed) goto error;
> > > > > > > copy();
> > > > > > > if (failed) goto error;
> > > > > > > Where 'failed' is set by the fault handler.
> > > > > > >
> > > > > > > This could be optimised to:
> > > > > > > failed = 0;
> > > > > > > copy();
> > > > > > > copy();
> > > > > > > if (failed) goto error;
> > > > > > > Even if it faults on every invalid address it probably
> > > > > > > doesn't matter - no one cares about that path.
> > > > > >
> > > > > >
> > > > > > On which arch are you looking at ?
> > > > > >
> > > > > > On x86_64 at least, code generation is just perfect.
> > > > > > Not even a conditional jmp, it is all handled by exceptions (if any)
> > > > > >
> > > > > > stac
> > > > > > copy();
> > > > > > copy();
> > > > > > clac
> > > > > >
> > > > > >
> > > > > > 
> > > > > > efault_end: do error recovery.
> > > > >
> > > > > It will be x86_64.
> > > > > I'm definitely seeing repeated tests of (IIRC) %rdx.
> > > > >
> > > > > It may well be because the compiler isn't very new.
> > > > > Will be an Ubuntu build of 9.3.0.
> > > > > Does that support 'asm goto with outputs' - which
> > > > > may be the difference.
> > > > >
> > > >
> > > > Yep, probably. I am using some recent clang version.
> > > >
> > >
> > > On x86-64 I can confirm, for me it (4 x unsafe_get_user()) compiles
> > > down to stac + lfence + 8 x mov + clac as straight line code. And
> > > results in roughly a 5%-10% speedup over copy_from_user().
> > >
> >
> > But rseq_get_rseq_cs() would still need three different copies,
> > with 3 stac+lfence+clac sequences.
> >
> > Maybe we need to enclose all __rseq_handle_notify_resume() operations
> > in a single section.
> >
> >
>
> To provide a bit of further exposition on this point, if you do 4x
> unsafe_get_user() recall I mentioned a 5-10% improvement. On the other
> hand, 4x normal get_user() I saw something like a 100% (ie. doubling
> of sys time measured) regression.
>
> I assume that's the fault of multiple stac+clac.


I was suggesting only using unsafe_get_user() and unsafe_put_user(),
and one surrounding stac/clac

Basically what we had (partially) in our old Google kernels, before
commit 8f2817701492 ("rseq: Use get_user/put_user rather than
__get_user/__put_user")
but with all the needed modern stuff.


Re: [PATCH v2 3/3] rseq: optimise rseq_get_rseq_cs() and clear_rseq_cs()

2021-04-14 Thread Eric Dumazet
On Wed, Apr 14, 2021 at 7:15 PM Arjun Roy  wrote:
>
> On Wed, Apr 14, 2021 at 9:10 AM Eric Dumazet  wrote:
> >
> > On Wed, Apr 14, 2021 at 6:08 PM David Laight  
> > wrote:
> > >
> > > From: Eric Dumazet
> > > > Sent: 14 April 2021 17:00
> > > ...
> > > > > Repeated unsafe_get_user() calls are crying out for an optimisation.
> > > > > You get something like:
> > > > > failed = 0;
> > > > > copy();
> > > > > if (failed) goto error;
> > > > > copy();
> > > > > if (failed) goto error;
> > > > > Where 'failed' is set by the fault handler.
> > > > >
> > > > > This could be optimised to:
> > > > > failed = 0;
> > > > > copy();
> > > > > copy();
> > > > > if (failed) goto error;
> > > > > Even if it faults on every invalid address it probably
> > > > > doesn't matter - no one cares about that path.
> > > >
> > > >
> > > > On which arch are you looking at ?
> > > >
> > > > On x86_64 at least, code generation is just perfect.
> > > > Not even a conditional jmp, it is all handled by exceptions (if any)
> > > >
> > > > stac
> > > > copy();
> > > > copy();
> > > > clac
> > > >
> > > >
> > > > 
> > > > efault_end: do error recovery.
> > >
> > > It will be x86_64.
> > > I'm definitely seeing repeated tests of (IIRC) %rdx.
> > >
> > > It may well be because the compiler isn't very new.
> > > Will be an Ubuntu build of 9.3.0.
> > > Does that support 'asm goto with outputs' - which
> > > may be the difference.
> > >
> >
> > Yep, probably. I am using some recent clang version.
> >
>
> On x86-64 I can confirm, for me it (4 x unsafe_get_user()) compiles
> down to stac + lfence + 8 x mov + clac as straight line code. And
> results in roughly a 5%-10% speedup over copy_from_user().
>

But rseq_get_rseq_cs() would still need three different copies,
with 3 stac+lfence+clac sequences.

Maybe we need to enclose all __rseq_handle_notify_resume() operations
in a single section.







> -Arjun
>
>
> > > David
> > >
> > > -
> > > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 
> > > 1PT, UK
> > > Registration No: 1397386 (Wales)


[PATCH] sh: remove unused variable

2021-04-14 Thread Eric Dumazet
From: Eric Dumazet 

Removes this annoying warning:

arch/sh/kernel/traps.c: In function ‘nmi_trap_handler’:
arch/sh/kernel/traps.c:183:15: warning: unused variable ‘cpu’ 
[-Wunused-variable]
  183 |  unsigned int cpu = smp_processor_id();

Fixes: fe3f1d5d7cd3 ("sh: Get rid of nmi_count()")
Signed-off-by: Eric Dumazet 
Cc: Thomas Gleixner 
Cc: Frederic Weisbecker 
Cc: Yoshinori Sato 
Cc: Rich Felker 
---
 arch/sh/kernel/traps.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/sh/kernel/traps.c b/arch/sh/kernel/traps.c
index 
f5beecdac69382f2d719fa33d50b9d58e22f6ff8..e76b221570999776e3bc9276d6b2fd60b9132e94
 100644
--- a/arch/sh/kernel/traps.c
+++ b/arch/sh/kernel/traps.c
@@ -180,7 +180,6 @@ static inline void arch_ftrace_nmi_exit(void) { }
 
 BUILD_TRAP_HANDLER(nmi)
 {
-   unsigned int cpu = smp_processor_id();
TRAP_HANDLER_DECL;
 
arch_ftrace_nmi_enter();
-- 
2.31.1.295.g9ea45b61b8-goog



Re: [PATCH v2 3/3] rseq: optimise rseq_get_rseq_cs() and clear_rseq_cs()

2021-04-14 Thread Eric Dumazet
On Wed, Apr 14, 2021 at 6:08 PM David Laight  wrote:
>
> From: Eric Dumazet
> > Sent: 14 April 2021 17:00
> ...
> > > Repeated unsafe_get_user() calls are crying out for an optimisation.
> > > You get something like:
> > > failed = 0;
> > > copy();
> > > if (failed) goto error;
> > > copy();
> > > if (failed) goto error;
> > > Where 'failed' is set by the fault handler.
> > >
> > > This could be optimised to:
> > > failed = 0;
> > > copy();
> > > copy();
> > > if (failed) goto error;
> > > Even if it faults on every invalid address it probably
> > > doesn't matter - no one cares about that path.
> >
> >
> > On which arch are you looking at ?
> >
> > On x86_64 at least, code generation is just perfect.
> > Not even a conditional jmp, it is all handled by exceptions (if any)
> >
> > stac
> > copy();
> > copy();
> > clac
> >
> >
> > 
> > efault_end: do error recovery.
>
> It will be x86_64.
> I'm definitely seeing repeated tests of (IIRC) %rdx.
>
> It may well be because the compiler isn't very new.
> Will be an Ubuntu build of 9.3.0.
> Does that support 'asm goto with outputs' - which
> may be the difference.
>

Yep, probably. I am using some recent clang version.

> David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 
> 1PT, UK
> Registration No: 1397386 (Wales)


Re: [PATCH v2 3/3] rseq: optimise rseq_get_rseq_cs() and clear_rseq_cs()

2021-04-14 Thread Eric Dumazet
On Wed, Apr 14, 2021 at 9:55 AM David Laight  wrote:
>
> From: Arjun Roy
> > Sent: 13 April 2021 23:04
> >
> > On Tue, Apr 13, 2021 at 2:19 PM David Laight  
> > wrote:
> > >
> > > > If we're special-casing 64-bit architectures anyways - unrolling the
> > > > 32B copy_from_user() for struct rseq_cs appears to be roughly 5-10%
> > > > savings on x86-64 when I measured it (well, in a microbenchmark, not
> > > > in rseq_get_rseq_cs() directly). Perhaps that could be an additional
> > > > avenue for improvement here.
> > >
> > > The killer is usually 'user copy hardening'.
> > > It significantly slows down sendmsg() and recvmsg().
> > > I've got measurable performance improvements by
> > > using __copy_from_user() when the buffer since has
> > > already been checked - but isn't a compile-time constant.
> > >
> > > There is also scope for using _get_user() when reading
> > > iovec[] (instead of copy_from_user()) and doing all the
> > > bound checks (etc) in the loop.
> > > That gives a measurable improvement for writev("/dev/null").
> > > I must sort those patches out again.
> > >
> > > David
> > >
> >
> > In this case I mean replacing copy_from_user(rseq_cs, urseq_cs,
> > sizeof(*rseq_cs)) with  4 (x8B=32 total) unsafe_get_user() (wrapped in
> > user_read_access_begin/end()) which I think would just bypass user
> > copy hardening (as far as I can tell).
>
> Yes that is one advantage over any of the get_user() calls.
> You also lose all the 'how shall we optimise this' checks
> in copy_from_user().
>
> Repeated unsafe_get_user() calls are crying out for an optimisation.
> You get something like:
> failed = 0;
> copy();
> if (failed) goto error;
> copy();
> if (failed) goto error;
> Where 'failed' is set by the fault handler.
>
> This could be optimised to:
> failed = 0;
> copy();
> copy();
> if (failed) goto error;
> Even if it faults on every invalid address it probably
> doesn't matter - no one cares about that path.


On which arch are you looking at ?

On x86_64 at least, code generation is just perfect.
Not even a conditional jmp, it is all handled by exceptions (if any)

stac
copy();
copy();
clac



efault_end: do error recovery.



>
> I've not really looked at how it could be achieved though.
>
> It might be that the 'asm goto with outputs' variant
> manages to avoid the compare and jump.
>
> David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 
> 1PT, UK
> Registration No: 1397386 (Wales)


[PATCH v3 2/3] rseq: remove redundant access_ok()

2021-04-13 Thread Eric Dumazet
From: Eric Dumazet 

After commit 8f2817701492 ("rseq: Use get_user/put_user rather
than __get_user/__put_user") we no longer need
an access_ok() call from __rseq_handle_notify_resume()

Mathieu pointed out the same cleanup can be done
in rseq_syscall().

Signed-off-by: Eric Dumazet 
Cc: Mathieu Desnoyers 
Cc: Peter Zijlstra 
Cc: "Paul E. McKenney" 
Cc: Boqun Feng 
Cc: Arjun Roy 
Cc: Ingo Molnar 
---
 kernel/rseq.c | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/kernel/rseq.c b/kernel/rseq.c
index 
f020f18f512a3f6241c3c9b104ce50e4d2c6188c..cfe01ab5253c1c424c0e8b25acbb6a8e1b41a5b6
 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -273,8 +273,6 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, 
struct pt_regs *regs)
 
if (unlikely(t->flags & PF_EXITING))
return;
-   if (unlikely(!access_ok(t->rseq, sizeof(*t->rseq
-   goto error;
ret = rseq_ip_fixup(regs);
if (unlikely(ret < 0))
goto error;
@@ -301,8 +299,7 @@ void rseq_syscall(struct pt_regs *regs)
 
if (!t->rseq)
return;
-   if (!access_ok(t->rseq, sizeof(*t->rseq)) ||
-   rseq_get_rseq_cs(t, _cs) || in_rseq_cs(ip, _cs))
+   if (rseq_get_rseq_cs(t, _cs) || in_rseq_cs(ip, _cs))
force_sig(SIGSEGV);
 }
 
-- 
2.31.1.295.g9ea45b61b8-goog



[PATCH v3 3/3] rseq: optimise rseq_get_rseq_cs() and clear_rseq_cs()

2021-04-13 Thread Eric Dumazet
From: Eric Dumazet 

Commit ec9c82e03a74 ("rseq: uapi: Declare rseq_cs field as union,
update includes") added regressions for our servers.

Using copy_from_user() and clear_user() for 64bit values
is suboptimal.

We can use faster put_user() and get_user() on 64bit arches.

Signed-off-by: Eric Dumazet 
Cc: Mathieu Desnoyers 
Cc: Peter Zijlstra 
Cc: "Paul E. McKenney" 
Cc: Boqun Feng 
Cc: Arjun Roy 
Cc: Ingo Molnar 
---
 kernel/rseq.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/kernel/rseq.c b/kernel/rseq.c
index 
cfe01ab5253c1c424c0e8b25acbb6a8e1b41a5b6..35f7bd0fced0e2dd8aed819e054dac03f024388a
 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -127,8 +127,13 @@ static int rseq_get_rseq_cs(struct task_struct *t, struct 
rseq_cs *rseq_cs)
u32 sig;
int ret;
 
+#ifdef CONFIG_64BIT
+   if (get_user(ptr, >rseq->rseq_cs.ptr64))
+   return -EFAULT;
+#else
if (copy_from_user(, >rseq->rseq_cs.ptr64, sizeof(ptr)))
return -EFAULT;
+#endif
if (!ptr) {
memset(rseq_cs, 0, sizeof(*rseq_cs));
return 0;
@@ -211,9 +216,13 @@ static int clear_rseq_cs(struct task_struct *t)
 *
 * Set rseq_cs to NULL.
 */
+#ifdef CONFIG_64BIT
+   return put_user(0UL, >rseq->rseq_cs.ptr64);
+#else
if (clear_user(>rseq->rseq_cs.ptr64, sizeof(t->rseq->rseq_cs.ptr64)))
return -EFAULT;
return 0;
+#endif
 }
 
 /*
-- 
2.31.1.295.g9ea45b61b8-goog



[PATCH v3 0/3] rseq: minor optimizations

2021-04-13 Thread Eric Dumazet
From: Eric Dumazet 

rseq is a heavy user of copy to/from user data in fast paths.
This series tries to reduce the cost.

v3: Third patch going back to v1 (only deal with 64bit arches)
v2: Addressed Peter and Mathieu feedbacks, thanks !

Eric Dumazet (3):
  rseq: optimize rseq_update_cpu_id()
  rseq: remove redundant access_ok()
  rseq: optimise rseq_get_rseq_cs() and clear_rseq_cs()

 kernel/rseq.c | 29 +
 1 file changed, 21 insertions(+), 8 deletions(-)

-- 
2.31.1.295.g9ea45b61b8-goog



[PATCH v3 1/3] rseq: optimize rseq_update_cpu_id()

2021-04-13 Thread Eric Dumazet
From: Eric Dumazet 

Two put_user() in rseq_update_cpu_id() are replaced
by a pair of unsafe_put_user() with appropriate surroundings.

This removes one stac/clac pair on x86 in fast path.

Signed-off-by: Eric Dumazet 
Cc: Mathieu Desnoyers 
Cc: Peter Zijlstra 
Cc: "Paul E. McKenney" 
Cc: Boqun Feng 
Cc: Arjun Roy 
Cc: Ingo Molnar 
---
 kernel/rseq.c | 15 +++
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/kernel/rseq.c b/kernel/rseq.c
index 
a4f86a9d6937cdfa2f13d1dcc9be863c1943d06f..f020f18f512a3f6241c3c9b104ce50e4d2c6188c
 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -84,13 +84,20 @@
 static int rseq_update_cpu_id(struct task_struct *t)
 {
u32 cpu_id = raw_smp_processor_id();
+   struct rseq __user *rseq = t->rseq;
 
-   if (put_user(cpu_id, >rseq->cpu_id_start))
-   return -EFAULT;
-   if (put_user(cpu_id, >rseq->cpu_id))
-   return -EFAULT;
+   if (!user_write_access_begin(rseq, sizeof(*rseq)))
+   goto efault;
+   unsafe_put_user(cpu_id, >cpu_id_start, efault_end);
+   unsafe_put_user(cpu_id, >cpu_id, efault_end);
+   user_write_access_end();
trace_rseq_update(t);
return 0;
+
+efault_end:
+   user_write_access_end();
+efault:
+   return -EFAULT;
 }
 
 static int rseq_reset_rseq_cpu_id(struct task_struct *t)
-- 
2.31.1.295.g9ea45b61b8-goog



Re: [PATCH v2 3/3] rseq: optimise rseq_get_rseq_cs() and clear_rseq_cs()

2021-04-13 Thread Eric Dumazet
On Tue, Apr 13, 2021 at 8:00 PM Mathieu Desnoyers
 wrote:
>

> As long as the ifdefs are localized within clearly identified wrappers in the
> rseq code I don't mind doing the special-casing there.
>
> The point which remains is that I don't think we want to optimize for speed
> on 32-bit architectures when it adds special-casing and complexity to the 
> 32-bit
> build. I suspect there is less and less testing performed on 32-bit 
> architectures
> nowadays, and it's good that as much code as possible is shared between 
> 32-bit and
> 64-bit builds to share the test coverage.
>

Quite frankly V1 was fine, I can't really make it looking better.

> Thanks,
>
> Mathieu
>
> >
> >
> >>
> >> Thanks,
> >>
> >> Mathieu
> >>
> >> >
> >> > diff --git a/kernel/rseq.c b/kernel/rseq.c
> >> > index
> >> > f2eee3f7f5d330688c81cb2e57d47ca6b843873e..537b1f684efa11069990018ffa3642c209993011
> >> > 100644
> >> > --- a/kernel/rseq.c
> >> > +++ b/kernel/rseq.c
> >> > @@ -136,6 +136,10 @@ static int rseq_get_cs_ptr(struct rseq_cs __user 
> >> > **uptrp,
> >> > {
> >> >u32 ptr;
> >> >
> >> > +   if (get_user(ptr, >rseq_cs.ptr.padding))
> >> > +   return -EFAULT;
> >> > +   if (ptr)
> >> > +   return -EINVAL;
> >> >if (get_user(ptr, >rseq_cs.ptr.ptr32))
> >> >return -EFAULT;
> >> >*uptrp = (struct rseq_cs __user *)ptr;
> >> > @@ -150,8 +154,9 @@ static int rseq_get_rseq_cs(struct task_struct *t,
> >> > struct rseq_cs *rseq_cs)
> >> >u32 sig;
> >> >int ret;
> >> >
> >> > -   if (rseq_get_cs_ptr(_cs, t->rseq))
> >> > -   return -EFAULT;
> >> > +   ret = rseq_get_cs_ptr(_cs, t->rseq);
> >> > +   if (ret)
> >> > +   return ret;
> >> >if (!urseq_cs) {
> >> >memset(rseq_cs, 0, sizeof(*rseq_cs));
> >> >return 0;
> >> > @@ -237,7 +242,8 @@ static int clear_rseq_cs(struct task_struct *t)
> >> > #ifdef CONFIG_64BIT
> >> >return put_user(0UL, >rseq->rseq_cs.ptr64);
> >> > #else
> >> > -   return put_user(0UL, >rseq->rseq_cs.ptr.ptr32);
> >> > +   return put_user(0UL, >rseq->rseq_cs.ptr.ptr32) |
> >> > +  put_user(0UL, >rseq->rseq_cs.ptr.padding);
> >> > #endif
> >> >  }
> >>
> >> --
> >> Mathieu Desnoyers
> >> EfficiOS Inc.
> > > http://www.efficios.com
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com


Re: [PATCH v2 3/3] rseq: optimise rseq_get_rseq_cs() and clear_rseq_cs()

2021-04-13 Thread Eric Dumazet
On Tue, Apr 13, 2021 at 7:20 PM Mathieu Desnoyers
 wrote:
>
> - On Apr 13, 2021, at 1:07 PM, Eric Dumazet eduma...@google.com wrote:
>
> > On Tue, Apr 13, 2021 at 7:01 PM Eric Dumazet  wrote:
> >>
> >> On Tue, Apr 13, 2021 at 6:57 PM Eric Dumazet  wrote:
> >> >
> >> > On Tue, Apr 13, 2021 at 6:54 PM Mathieu Desnoyers
> >> >  wrote:
> >> > >
> >> > > - On Apr 13, 2021, at 12:22 PM, Eric Dumazet 
> >> > > eric.duma...@gmail.com wrote:
> >> > >
> >> > > > From: Eric Dumazet 
> >> > > >
> >> > > > Commit ec9c82e03a74 ("rseq: uapi: Declare rseq_cs field as union,
> >> > > > update includes") added regressions for our servers.
> >> > > >
> >> > > > Using copy_from_user() and clear_user() for 64bit values
> >> > > > is suboptimal.
> >> > > >
> >> > > > We can use faster put_user() and get_user().
> >> > > >
> >> > > > 32bit arches can be changed to use the ptr32 field,
> >> > > > since the padding field must always be zero.
> >> > > >
> >> > > > v2: added ideas from Peter and Mathieu about making this
> >> > > >generic, since my initial patch was only dealing with
> >> > > >64bit arches.
> >> > >
> >> > > Ah, now I remember the reason why reading and clearing the entire 
> >> > > 64-bit
> >> > > is important: it's because we don't want to allow user-space processes 
> >> > > to
> >> > > use this change in behavior to figure out whether they are running on a
> >> > > 32-bit or in a 32-bit compat mode on a 64-bit kernel.
> >> > >
> >> > > So although I'm fine with making 64-bit kernels faster, we'll want to 
> >> > > keep
> >> > > updating the entire 64-bit ptr field on 32-bit kernels as well.
> >> > >
> >> > > Thanks,
> >> > >
> >> >
> >> > So... back to V1 then ?
> >>
> >> Or add more stuff as in :
> >
> > diff against v2, WDYT ?
>
> I like this approach slightly better, because it moves the preprocessor 
> ifdefs into
> rseq_get_rseq_cs and clear_rseq_cs, while keeping the same behavior for a 
> 32-bit
> process running on native 32-bit kernel and as compat task on a 64-bit kernel.
>
> That being said, I don't expect anyone to care much about performance of 
> 32-bit
> kernels, so we could use copy_from_user() on 32-bit kernels to remove 
> special-cases
> in 32-bit specific code. This would eliminate the 32-bit specific "padding" 
> read, and
> let the TASK_SIZE comparison handle the check for both 32-bit and 64-bit 
> kernels.
>
> As for clear_user(), I wonder whether we could simply keep using it, but 
> change the
> clear_user() macro to figure out that it can use a faster 8-byte put_user ? I 
> find it
> odd that performance optimizations which would be relevant elsewhere creep 
> into the
> rseq code.


clear_user() is a maze of arch-dependent macros/functions/assembly

I guess the same could be said from  copy_in_user(), but apparently we removed
special-casing, like in commit a41e0d754240fe8ca9c4f2070bf67e3b0228aa22

Definitely it seems odd having to carefully choose between multiple methods.


>
> Thanks,
>
> Mathieu
>
> >
> > diff --git a/kernel/rseq.c b/kernel/rseq.c
> > index
> > f2eee3f7f5d330688c81cb2e57d47ca6b843873e..537b1f684efa11069990018ffa3642c209993011
> > 100644
> > --- a/kernel/rseq.c
> > +++ b/kernel/rseq.c
> > @@ -136,6 +136,10 @@ static int rseq_get_cs_ptr(struct rseq_cs __user 
> > **uptrp,
> > {
> >u32 ptr;
> >
> > +   if (get_user(ptr, >rseq_cs.ptr.padding))
> > +   return -EFAULT;
> > +   if (ptr)
> > +   return -EINVAL;
> >if (get_user(ptr, >rseq_cs.ptr.ptr32))
> >return -EFAULT;
> >*uptrp = (struct rseq_cs __user *)ptr;
> > @@ -150,8 +154,9 @@ static int rseq_get_rseq_cs(struct task_struct *t,
> > struct rseq_cs *rseq_cs)
> >u32 sig;
> >int ret;
> >
> > -   if (rseq_get_cs_ptr(_cs, t->rseq))
> > -   return -EFAULT;
> > +   ret = rseq_get_cs_ptr(_cs, t->rseq);
> > +   if (ret)
> > +   return ret;
> >if (!urseq_cs) {
> >memset(rseq_cs, 0, sizeof(*rseq_cs));
> >return 0;
> > @@ -237,7 +242,8 @@ static int clear_rseq_cs(struct task_struct *t)
> > #ifdef CONFIG_64BIT
> >return put_user(0UL, >rseq->rseq_cs.ptr64);
> > #else
> > -   return put_user(0UL, >rseq->rseq_cs.ptr.ptr32);
> > +   return put_user(0UL, >rseq->rseq_cs.ptr.ptr32) |
> > +  put_user(0UL, >rseq->rseq_cs.ptr.padding);
> > #endif
> >  }
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com


Re: [PATCH v2 3/3] rseq: optimise rseq_get_rseq_cs() and clear_rseq_cs()

2021-04-13 Thread Eric Dumazet
On Tue, Apr 13, 2021 at 7:01 PM Eric Dumazet  wrote:
>
> On Tue, Apr 13, 2021 at 6:57 PM Eric Dumazet  wrote:
> >
> > On Tue, Apr 13, 2021 at 6:54 PM Mathieu Desnoyers
> >  wrote:
> > >
> > > - On Apr 13, 2021, at 12:22 PM, Eric Dumazet eric.duma...@gmail.com 
> > > wrote:
> > >
> > > > From: Eric Dumazet 
> > > >
> > > > Commit ec9c82e03a74 ("rseq: uapi: Declare rseq_cs field as union,
> > > > update includes") added regressions for our servers.
> > > >
> > > > Using copy_from_user() and clear_user() for 64bit values
> > > > is suboptimal.
> > > >
> > > > We can use faster put_user() and get_user().
> > > >
> > > > 32bit arches can be changed to use the ptr32 field,
> > > > since the padding field must always be zero.
> > > >
> > > > v2: added ideas from Peter and Mathieu about making this
> > > >generic, since my initial patch was only dealing with
> > > >64bit arches.
> > >
> > > Ah, now I remember the reason why reading and clearing the entire 64-bit
> > > is important: it's because we don't want to allow user-space processes to
> > > use this change in behavior to figure out whether they are running on a
> > > 32-bit or in a 32-bit compat mode on a 64-bit kernel.
> > >
> > > So although I'm fine with making 64-bit kernels faster, we'll want to keep
> > > updating the entire 64-bit ptr field on 32-bit kernels as well.
> > >
> > > Thanks,
> > >
> >
> > So... back to V1 then ?
>
> Or add more stuff as in :

diff against v2, WDYT ?

diff --git a/kernel/rseq.c b/kernel/rseq.c
index 
f2eee3f7f5d330688c81cb2e57d47ca6b843873e..537b1f684efa11069990018ffa3642c209993011
100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -136,6 +136,10 @@ static int rseq_get_cs_ptr(struct rseq_cs __user **uptrp,
 {
u32 ptr;

+   if (get_user(ptr, >rseq_cs.ptr.padding))
+   return -EFAULT;
+   if (ptr)
+   return -EINVAL;
if (get_user(ptr, >rseq_cs.ptr.ptr32))
return -EFAULT;
*uptrp = (struct rseq_cs __user *)ptr;
@@ -150,8 +154,9 @@ static int rseq_get_rseq_cs(struct task_struct *t,
struct rseq_cs *rseq_cs)
u32 sig;
int ret;

-   if (rseq_get_cs_ptr(_cs, t->rseq))
-   return -EFAULT;
+   ret = rseq_get_cs_ptr(_cs, t->rseq);
+   if (ret)
+   return ret;
if (!urseq_cs) {
memset(rseq_cs, 0, sizeof(*rseq_cs));
return 0;
@@ -237,7 +242,8 @@ static int clear_rseq_cs(struct task_struct *t)
 #ifdef CONFIG_64BIT
return put_user(0UL, >rseq->rseq_cs.ptr64);
 #else
-   return put_user(0UL, >rseq->rseq_cs.ptr.ptr32);
+   return put_user(0UL, >rseq->rseq_cs.ptr.ptr32) |
+  put_user(0UL, >rseq->rseq_cs.ptr.padding);
 #endif
 }


Re: [PATCH v2 3/3] rseq: optimise rseq_get_rseq_cs() and clear_rseq_cs()

2021-04-13 Thread Eric Dumazet
On Tue, Apr 13, 2021 at 6:57 PM Eric Dumazet  wrote:
>
> On Tue, Apr 13, 2021 at 6:54 PM Mathieu Desnoyers
>  wrote:
> >
> > - On Apr 13, 2021, at 12:22 PM, Eric Dumazet eric.duma...@gmail.com 
> > wrote:
> >
> > > From: Eric Dumazet 
> > >
> > > Commit ec9c82e03a74 ("rseq: uapi: Declare rseq_cs field as union,
> > > update includes") added regressions for our servers.
> > >
> > > Using copy_from_user() and clear_user() for 64bit values
> > > is suboptimal.
> > >
> > > We can use faster put_user() and get_user().
> > >
> > > 32bit arches can be changed to use the ptr32 field,
> > > since the padding field must always be zero.
> > >
> > > v2: added ideas from Peter and Mathieu about making this
> > >generic, since my initial patch was only dealing with
> > >64bit arches.
> >
> > Ah, now I remember the reason why reading and clearing the entire 64-bit
> > is important: it's because we don't want to allow user-space processes to
> > use this change in behavior to figure out whether they are running on a
> > 32-bit or in a 32-bit compat mode on a 64-bit kernel.
> >
> > So although I'm fine with making 64-bit kernels faster, we'll want to keep
> > updating the entire 64-bit ptr field on 32-bit kernels as well.
> >
> > Thanks,
> >
>
> So... back to V1 then ?

Or add more stuff as in :

#ifdef CONFIG_64BIT
+   return put_user(0UL, >rseq->rseq_cs.ptr64);
+#else
+   return put_user(0UL, >rseq->rseq_cs.ptr.ptr32) |
put_user(0UL, >rseq->rseq_cs.ptr.padding);
+#endif


Re: [PATCH v2 3/3] rseq: optimise rseq_get_rseq_cs() and clear_rseq_cs()

2021-04-13 Thread Eric Dumazet
On Tue, Apr 13, 2021 at 6:54 PM Mathieu Desnoyers
 wrote:
>
> - On Apr 13, 2021, at 12:22 PM, Eric Dumazet eric.duma...@gmail.com wrote:
>
> > From: Eric Dumazet 
> >
> > Commit ec9c82e03a74 ("rseq: uapi: Declare rseq_cs field as union,
> > update includes") added regressions for our servers.
> >
> > Using copy_from_user() and clear_user() for 64bit values
> > is suboptimal.
> >
> > We can use faster put_user() and get_user().
> >
> > 32bit arches can be changed to use the ptr32 field,
> > since the padding field must always be zero.
> >
> > v2: added ideas from Peter and Mathieu about making this
> >generic, since my initial patch was only dealing with
> >64bit arches.
>
> Ah, now I remember the reason why reading and clearing the entire 64-bit
> is important: it's because we don't want to allow user-space processes to
> use this change in behavior to figure out whether they are running on a
> 32-bit or in a 32-bit compat mode on a 64-bit kernel.
>
> So although I'm fine with making 64-bit kernels faster, we'll want to keep
> updating the entire 64-bit ptr field on 32-bit kernels as well.
>
> Thanks,
>

So... back to V1 then ?


[PATCH v2 3/3] rseq: optimise rseq_get_rseq_cs() and clear_rseq_cs()

2021-04-13 Thread Eric Dumazet
From: Eric Dumazet 

Commit ec9c82e03a74 ("rseq: uapi: Declare rseq_cs field as union,
update includes") added regressions for our servers.

Using copy_from_user() and clear_user() for 64bit values
is suboptimal.

We can use faster put_user() and get_user().

32bit arches can be changed to use the ptr32 field,
since the padding field must always be zero.

v2: added ideas from Peter and Mathieu about making this
generic, since my initial patch was only dealing with
64bit arches.

Signed-off-by: Eric Dumazet 
Cc: Mathieu Desnoyers 
Cc: Peter Zijlstra 
Cc: "Paul E. McKenney" 
Cc: Boqun Feng 
Cc: Arjun Roy 
Cc: Ingo Molnar 
---
 kernel/rseq.c | 41 +
 1 file changed, 33 insertions(+), 8 deletions(-)

diff --git a/kernel/rseq.c b/kernel/rseq.c
index 
cfe01ab5253c1c424c0e8b25acbb6a8e1b41a5b6..f2eee3f7f5d330688c81cb2e57d47ca6b843873e
 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -119,23 +119,46 @@ static int rseq_reset_rseq_cpu_id(struct task_struct *t)
return 0;
 }
 
+#ifdef CONFIG_64BIT
+static int rseq_get_cs_ptr(struct rseq_cs __user **uptrp,
+  const struct rseq __user *rseq)
+{
+   u64 ptr;
+
+   if (get_user(ptr, >rseq_cs.ptr64))
+   return -EFAULT;
+   *uptrp = (struct rseq_cs __user *)ptr;
+   return 0;
+}
+#else
+static int rseq_get_cs_ptr(struct rseq_cs __user **uptrp,
+  const struct rseq __user *rseq)
+{
+   u32 ptr;
+
+   if (get_user(ptr, >rseq_cs.ptr.ptr32))
+   return -EFAULT;
+   *uptrp = (struct rseq_cs __user *)ptr;
+   return 0;
+}
+#endif
+
 static int rseq_get_rseq_cs(struct task_struct *t, struct rseq_cs *rseq_cs)
 {
struct rseq_cs __user *urseq_cs;
-   u64 ptr;
u32 __user *usig;
u32 sig;
int ret;
 
-   if (copy_from_user(, >rseq->rseq_cs.ptr64, sizeof(ptr)))
+   if (rseq_get_cs_ptr(_cs, t->rseq))
return -EFAULT;
-   if (!ptr) {
+   if (!urseq_cs) {
memset(rseq_cs, 0, sizeof(*rseq_cs));
return 0;
}
-   if (ptr >= TASK_SIZE)
+   if ((unsigned long)urseq_cs >= TASK_SIZE)
return -EINVAL;
-   urseq_cs = (struct rseq_cs __user *)(unsigned long)ptr;
+
if (copy_from_user(rseq_cs, urseq_cs, sizeof(*rseq_cs)))
return -EFAULT;
 
@@ -211,9 +234,11 @@ static int clear_rseq_cs(struct task_struct *t)
 *
 * Set rseq_cs to NULL.
 */
-   if (clear_user(>rseq->rseq_cs.ptr64, sizeof(t->rseq->rseq_cs.ptr64)))
-   return -EFAULT;
-   return 0;
+#ifdef CONFIG_64BIT
+   return put_user(0UL, >rseq->rseq_cs.ptr64);
+#else
+   return put_user(0UL, >rseq->rseq_cs.ptr.ptr32);
+#endif
 }
 
 /*
-- 
2.31.1.295.g9ea45b61b8-goog



[PATCH v2 2/3] rseq: remove redundant access_ok()

2021-04-13 Thread Eric Dumazet
From: Eric Dumazet 

After commit 8f2817701492 ("rseq: Use get_user/put_user rather
than __get_user/__put_user") we no longer need
an access_ok() call from __rseq_handle_notify_resume()

Mathieu pointed out the same cleanup can be done
in rseq_syscall().

Signed-off-by: Eric Dumazet 
Cc: Mathieu Desnoyers 
Cc: Peter Zijlstra 
Cc: "Paul E. McKenney" 
Cc: Boqun Feng 
Cc: Arjun Roy 
Cc: Ingo Molnar 
---
 kernel/rseq.c | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/kernel/rseq.c b/kernel/rseq.c
index 
f020f18f512a3f6241c3c9b104ce50e4d2c6188c..cfe01ab5253c1c424c0e8b25acbb6a8e1b41a5b6
 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -273,8 +273,6 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, 
struct pt_regs *regs)
 
if (unlikely(t->flags & PF_EXITING))
return;
-   if (unlikely(!access_ok(t->rseq, sizeof(*t->rseq
-   goto error;
ret = rseq_ip_fixup(regs);
if (unlikely(ret < 0))
goto error;
@@ -301,8 +299,7 @@ void rseq_syscall(struct pt_regs *regs)
 
if (!t->rseq)
return;
-   if (!access_ok(t->rseq, sizeof(*t->rseq)) ||
-   rseq_get_rseq_cs(t, _cs) || in_rseq_cs(ip, _cs))
+   if (rseq_get_rseq_cs(t, _cs) || in_rseq_cs(ip, _cs))
force_sig(SIGSEGV);
 }
 
-- 
2.31.1.295.g9ea45b61b8-goog



[PATCH v2 1/3] rseq: optimize rseq_update_cpu_id()

2021-04-13 Thread Eric Dumazet
From: Eric Dumazet 

Two put_user() in rseq_update_cpu_id() are replaced
by a pair of unsafe_put_user() with appropriate surroundings.

This removes one stac/clac pair on x86 in fast path.

Signed-off-by: Eric Dumazet 
Cc: Mathieu Desnoyers 
Cc: Peter Zijlstra 
Cc: "Paul E. McKenney" 
Cc: Boqun Feng 
Cc: Arjun Roy 
Cc: Ingo Molnar 
---
 kernel/rseq.c | 15 +++
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/kernel/rseq.c b/kernel/rseq.c
index 
a4f86a9d6937cdfa2f13d1dcc9be863c1943d06f..f020f18f512a3f6241c3c9b104ce50e4d2c6188c
 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -84,13 +84,20 @@
 static int rseq_update_cpu_id(struct task_struct *t)
 {
u32 cpu_id = raw_smp_processor_id();
+   struct rseq __user *rseq = t->rseq;
 
-   if (put_user(cpu_id, >rseq->cpu_id_start))
-   return -EFAULT;
-   if (put_user(cpu_id, >rseq->cpu_id))
-   return -EFAULT;
+   if (!user_write_access_begin(rseq, sizeof(*rseq)))
+   goto efault;
+   unsafe_put_user(cpu_id, >cpu_id_start, efault_end);
+   unsafe_put_user(cpu_id, >cpu_id, efault_end);
+   user_write_access_end();
trace_rseq_update(t);
return 0;
+
+efault_end:
+   user_write_access_end();
+efault:
+   return -EFAULT;
 }
 
 static int rseq_reset_rseq_cpu_id(struct task_struct *t)
-- 
2.31.1.295.g9ea45b61b8-goog



[PATCH v2 0/3] rseq: minor optimizations

2021-04-13 Thread Eric Dumazet
From: Eric Dumazet 

rseq is a heavy user of copy to/from user data in fast paths.
This series tries to reduce the cost.

v2: Addressed Peter and Mathieu feedbacks, thanks !

Eric Dumazet (3):
  rseq: optimize rseq_update_cpu_id()
  rseq: remove redundant access_ok()
  rseq: optimise rseq_get_rseq_cs() and clear_rseq_cs()

 kernel/rseq.c | 61 +--
 1 file changed, 45 insertions(+), 16 deletions(-)

-- 
2.31.1.295.g9ea45b61b8-goog



Re: [PATCH 1/3] rseq: optimize rseq_update_cpu_id()

2021-04-13 Thread Eric Dumazet
On Tue, Apr 13, 2021 at 4:29 PM Mathieu Desnoyers
 wrote:
>
> - On Apr 13, 2021, at 3:36 AM, Eric Dumazet eric.duma...@gmail.com wrote:
>
> > From: Eric Dumazet 
> >
> > Two put_user() in rseq_update_cpu_id() are replaced
> > by a pair of unsafe_put_user() with appropriate surroundings.
> >
> > This removes one stac/clac pair on x86 in fast path.
> >
> > Signed-off-by: Eric Dumazet 
> > Cc: Mathieu Desnoyers 
> > Cc: Peter Zijlstra 
> > Cc: "Paul E. McKenney" 
> > Cc: Boqun Feng 
> > Cc: Arjun Roy 
> > Cc: Ingo Molnar 
> > ---
> > kernel/rseq.c | 15 +++
> > 1 file changed, 11 insertions(+), 4 deletions(-)
> >
> > diff --git a/kernel/rseq.c b/kernel/rseq.c
> > index
> > a4f86a9d6937cdfa2f13d1dcc9be863c1943d06f..d2689ccbb132c0fc8ec0924008771e5ee1ca855e
> > 100644
> > --- a/kernel/rseq.c
> > +++ b/kernel/rseq.c
> > @@ -84,13 +84,20 @@
> > static int rseq_update_cpu_id(struct task_struct *t)
> > {
> >   u32 cpu_id = raw_smp_processor_id();
> > + struct rseq *r = t->rseq;
>
> AFAIU the variable above should be a struct rseq __user *.
>
> Elsewhere in the file we use "rseq" rather than "r" for struct rseq __user *
> variable name, it would be better to keep the naming consistent across the 
> file
> if possible.

Absolutely, thanks for the feedback.


Re: [PATCH 2/3] rseq: remove redundant access_ok()

2021-04-13 Thread Eric Dumazet
On Tue, Apr 13, 2021 at 4:34 PM Mathieu Desnoyers
 wrote:
>
> - On Apr 13, 2021, at 3:36 AM, Eric Dumazet eric.duma...@gmail.com wrote:
>
> > From: Eric Dumazet 
> >
> > After commit 8f2817701492 ("rseq: Use get_user/put_user rather
> > than __get_user/__put_user") we no longer need
> > an access_ok() call from __rseq_handle_notify_resume()
>
> While we are doing that, should we also remove the access_ok() check in
> rseq_syscall() ? It look like it can also be removed for the exact same
> reason outlined here.

Yes, good idea.

 I was focusing in __rseq_handle_notify_resume() paths but
rseq_syscall() can get the same.

> Thanks,
>
> Mathieu
>
> >
> > Signed-off-by: Eric Dumazet 
> > Cc: Mathieu Desnoyers 
> > Cc: Peter Zijlstra 
> > Cc: "Paul E. McKenney" 
> > Cc: Boqun Feng 
> > Cc: Arjun Roy 
> > Cc: Ingo Molnar 
> > ---
> > kernel/rseq.c | 2 --
> > 1 file changed, 2 deletions(-)
> >
> > diff --git a/kernel/rseq.c b/kernel/rseq.c
> > index
> > d2689ccbb132c0fc8ec0924008771e5ee1ca855e..57344f9abb43905c7dd2b6081205ff508d963e1e
> > 100644
> > --- a/kernel/rseq.c
> > +++ b/kernel/rseq.c
> > @@ -273,8 +273,6 @@ void __rseq_handle_notify_resume(struct ksignal *ksig,
> > struct pt_regs *regs)
> >
> >   if (unlikely(t->flags & PF_EXITING))
> >   return;
> > - if (unlikely(!access_ok(t->rseq, sizeof(*t->rseq
> > - goto error;
> >   ret = rseq_ip_fixup(regs);
> >   if (unlikely(ret < 0))
> >   goto error;
> > --
> > 2.31.1.295.g9ea45b61b8-goog
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com


Re: Linux 5.12-rc7

2021-04-13 Thread Eric Dumazet
On Tue, Apr 13, 2021 at 3:38 PM Michael S. Tsirkin  wrote:
>
> On Tue, Apr 13, 2021 at 03:33:40PM +0200, Eric Dumazet wrote:
> > On Tue, Apr 13, 2021 at 3:27 PM Eric Dumazet  wrote:
> > >
> > > On Tue, Apr 13, 2021 at 2:57 PM Michael S. Tsirkin  
> > > wrote:
> > > >
> > > > On Mon, Apr 12, 2021 at 06:47:07PM +0200, Eric Dumazet wrote:
> > > > > On Mon, Apr 12, 2021 at 6:31 PM Eric Dumazet  
> > > > > wrote:
> > > > > >
> > > > > > On Mon, Apr 12, 2021 at 6:28 PM Linus Torvalds
> > > > > >  wrote:
> > > > > > >
> > > > > > > On Sun, Apr 11, 2021 at 10:14 PM Guenter Roeck 
> > > > > > >  wrote:
> > > > > > > >
> > > > > > > > Qemu test results:
> > > > > > > > total: 460 pass: 459 fail: 1
> > > > > > > > Failed tests:
> > > > > > > > sh:rts7751r2dplus_defconfig:ata:net,virtio-net:rootfs
> > > > > > > >
> > > > > > > > The failure bisects to commit 0f6925b3e8da ("virtio_net: Do not 
> > > > > > > > pull payload in
> > > > > > > > skb->head"). It is a spurious problem - the test passes roughly 
> > > > > > > > every other
> > > > > > > > time. When the failure is seen, udhcpc fails to get an IP 
> > > > > > > > address and aborts
> > > > > > > > with SIGTERM. So far I have only seen this with the "sh" 
> > > > > > > > architecture.
> > > > > > >
> > > > > > > Hmm. Let's add in some more of the people involved in that 
> > > > > > > commit, and
> > > > > > > also netdev.
> > > > > > >
> > > > > > > Nothing in there looks like it should have any interaction with
> > > > > > > architecture, so that "it happens on sh" sounds odd, but maybe 
> > > > > > > it's
> > > > > > > some particular interaction with the qemu environment.
> > > > > >
> > > > > > Yes, maybe.
> > > > > >
> > > > > > I spent few hours on this, and suspect a buggy memcpy() 
> > > > > > implementation
> > > > > > on SH, but this was not conclusive.
> > > > > >
> > > > > > By pulling one extra byte, the problem goes away.
> > > > > >
> > > > > > Strange thing is that the udhcpc process does not go past sendto().
> > > > >
> > > > > This is the patch working around the issue. Unfortunately I was not
> > > > > able to root-cause it (I really suspect something on SH)
> > > > >
> > > > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > > > > index 
> > > > > 0824e6999e49957f7aaf7c990f6259792d42f32b..fd890a951beea03bdf24406809042666eb972655
> > > > > 100644
> > > > > --- a/drivers/net/virtio_net.c
> > > > > +++ b/drivers/net/virtio_net.c
> > > > > @@ -408,11 +408,17 @@ static struct sk_buff *page_to_skb(struct
> > > > > virtnet_info *vi,
> > > > >
> > > > > /* Copy all frame if it fits skb->head, otherwise
> > > > >  * we let virtio_net_hdr_to_skb() and GRO pull headers as 
> > > > > needed.
> > > > > +*
> > > > > +* Apparently, pulling only the Ethernet Header triggers a bug
> > > > > on qemu-system-sh4.
> > > > > +* Since GRO aggregation really cares of IPv4/IPv6, pull 20 
> > > > > bytes
> > > > > +* more to work around this bug : These 20 bytes can not 
> > > > > belong
> > > > > +* to UDP/TCP payload.
> > > > > +* As a bonus, this makes GRO slightly faster for IPv4 (one 
> > > > > less copy).
> > > > >  */
> > > >
> > > > Question: do we still want to do this for performance reasons?
> > > > We also have the hdr_len coming from the device which is
> > > > just skb_headlen on the host.
> > >
> > > Well, putting 20 bytes in skb->head will disable frag0 optimization.
> > >
> > > The change would only benefit to sh architecture 

Re: Linux 5.12-rc7

2021-04-13 Thread Eric Dumazet
On Tue, Apr 13, 2021 at 3:27 PM Eric Dumazet  wrote:
>
> On Tue, Apr 13, 2021 at 2:57 PM Michael S. Tsirkin  wrote:
> >
> > On Mon, Apr 12, 2021 at 06:47:07PM +0200, Eric Dumazet wrote:
> > > On Mon, Apr 12, 2021 at 6:31 PM Eric Dumazet  wrote:
> > > >
> > > > On Mon, Apr 12, 2021 at 6:28 PM Linus Torvalds
> > > >  wrote:
> > > > >
> > > > > On Sun, Apr 11, 2021 at 10:14 PM Guenter Roeck  
> > > > > wrote:
> > > > > >
> > > > > > Qemu test results:
> > > > > > total: 460 pass: 459 fail: 1
> > > > > > Failed tests:
> > > > > > sh:rts7751r2dplus_defconfig:ata:net,virtio-net:rootfs
> > > > > >
> > > > > > The failure bisects to commit 0f6925b3e8da ("virtio_net: Do not 
> > > > > > pull payload in
> > > > > > skb->head"). It is a spurious problem - the test passes roughly 
> > > > > > every other
> > > > > > time. When the failure is seen, udhcpc fails to get an IP address 
> > > > > > and aborts
> > > > > > with SIGTERM. So far I have only seen this with the "sh" 
> > > > > > architecture.
> > > > >
> > > > > Hmm. Let's add in some more of the people involved in that commit, and
> > > > > also netdev.
> > > > >
> > > > > Nothing in there looks like it should have any interaction with
> > > > > architecture, so that "it happens on sh" sounds odd, but maybe it's
> > > > > some particular interaction with the qemu environment.
> > > >
> > > > Yes, maybe.
> > > >
> > > > I spent few hours on this, and suspect a buggy memcpy() implementation
> > > > on SH, but this was not conclusive.
> > > >
> > > > By pulling one extra byte, the problem goes away.
> > > >
> > > > Strange thing is that the udhcpc process does not go past sendto().
> > >
> > > This is the patch working around the issue. Unfortunately I was not
> > > able to root-cause it (I really suspect something on SH)
> > >
> > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > > index 
> > > 0824e6999e49957f7aaf7c990f6259792d42f32b..fd890a951beea03bdf24406809042666eb972655
> > > 100644
> > > --- a/drivers/net/virtio_net.c
> > > +++ b/drivers/net/virtio_net.c
> > > @@ -408,11 +408,17 @@ static struct sk_buff *page_to_skb(struct
> > > virtnet_info *vi,
> > >
> > > /* Copy all frame if it fits skb->head, otherwise
> > >  * we let virtio_net_hdr_to_skb() and GRO pull headers as needed.
> > > +*
> > > +* Apparently, pulling only the Ethernet Header triggers a bug
> > > on qemu-system-sh4.
> > > +* Since GRO aggregation really cares of IPv4/IPv6, pull 20 bytes
> > > +* more to work around this bug : These 20 bytes can not belong
> > > +* to UDP/TCP payload.
> > > +* As a bonus, this makes GRO slightly faster for IPv4 (one less 
> > > copy).
> > >  */
> >
> > Question: do we still want to do this for performance reasons?
> > We also have the hdr_len coming from the device which is
> > just skb_headlen on the host.
>
> Well, putting 20 bytes in skb->head will disable frag0 optimization.
>
> The change would only benefit to sh architecture :)
>
> About hdr_len, I suppose we could try it, with appropriate safety checks.

I have added traces, hdr_len seems to be 0 with the qemu-system-sh4 I am using.

Have I understood you correctly ?

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 
0824e6999e49957f7aaf7c990f6259792d42f32b..f024860f7dc260d4efbc35a3b8ffd358bd0da894
100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -399,9 +399,10 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi,
hdr_padded_len = sizeof(struct padded_vnet_hdr);

/* hdr_valid means no XDP, so we can copy the vnet header */
-   if (hdr_valid)
+   if (hdr_valid) {
memcpy(hdr, p, hdr_len);
-
+   pr_err("hdr->hdr_len=%u\n", hdr->hdr.hdr_len);
+   }
len -= hdr_len;
offset += hdr_padded_len;
p += hdr_padded_len;


Re: Linux 5.12-rc7

2021-04-13 Thread Eric Dumazet
On Tue, Apr 13, 2021 at 2:57 PM Michael S. Tsirkin  wrote:
>
> On Mon, Apr 12, 2021 at 06:47:07PM +0200, Eric Dumazet wrote:
> > On Mon, Apr 12, 2021 at 6:31 PM Eric Dumazet  wrote:
> > >
> > > On Mon, Apr 12, 2021 at 6:28 PM Linus Torvalds
> > >  wrote:
> > > >
> > > > On Sun, Apr 11, 2021 at 10:14 PM Guenter Roeck  
> > > > wrote:
> > > > >
> > > > > Qemu test results:
> > > > > total: 460 pass: 459 fail: 1
> > > > > Failed tests:
> > > > > sh:rts7751r2dplus_defconfig:ata:net,virtio-net:rootfs
> > > > >
> > > > > The failure bisects to commit 0f6925b3e8da ("virtio_net: Do not pull 
> > > > > payload in
> > > > > skb->head"). It is a spurious problem - the test passes roughly every 
> > > > > other
> > > > > time. When the failure is seen, udhcpc fails to get an IP address and 
> > > > > aborts
> > > > > with SIGTERM. So far I have only seen this with the "sh" architecture.
> > > >
> > > > Hmm. Let's add in some more of the people involved in that commit, and
> > > > also netdev.
> > > >
> > > > Nothing in there looks like it should have any interaction with
> > > > architecture, so that "it happens on sh" sounds odd, but maybe it's
> > > > some particular interaction with the qemu environment.
> > >
> > > Yes, maybe.
> > >
> > > I spent few hours on this, and suspect a buggy memcpy() implementation
> > > on SH, but this was not conclusive.
> > >
> > > By pulling one extra byte, the problem goes away.
> > >
> > > Strange thing is that the udhcpc process does not go past sendto().
> >
> > This is the patch working around the issue. Unfortunately I was not
> > able to root-cause it (I really suspect something on SH)
> >
> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > index 
> > 0824e6999e49957f7aaf7c990f6259792d42f32b..fd890a951beea03bdf24406809042666eb972655
> > 100644
> > --- a/drivers/net/virtio_net.c
> > +++ b/drivers/net/virtio_net.c
> > @@ -408,11 +408,17 @@ static struct sk_buff *page_to_skb(struct
> > virtnet_info *vi,
> >
> > /* Copy all frame if it fits skb->head, otherwise
> >  * we let virtio_net_hdr_to_skb() and GRO pull headers as needed.
> > +*
> > +* Apparently, pulling only the Ethernet Header triggers a bug
> > on qemu-system-sh4.
> > +* Since GRO aggregation really cares of IPv4/IPv6, pull 20 bytes
> > +* more to work around this bug : These 20 bytes can not belong
> > +* to UDP/TCP payload.
> > +* As a bonus, this makes GRO slightly faster for IPv4 (one less 
> > copy).
> >  */
>
> Question: do we still want to do this for performance reasons?
> We also have the hdr_len coming from the device which is
> just skb_headlen on the host.

Well, putting 20 bytes in skb->head will disable frag0 optimization.

The change would only benefit to sh architecture :)

About hdr_len, I suppose we could try it, with appropriate safety checks.

>
> > if (len <= skb_tailroom(skb))
> > copy = len;
> > else
> > -   copy = ETH_HLEN + metasize;
> > +   copy = ETH_HLEN + sizeof(struct iphdr) + metasize;
> > skb_put_data(skb, p, copy);
> >
> > if (metasize) {
>


Re: Linux 5.12-rc7

2021-04-13 Thread Eric Dumazet
On Tue, Apr 13, 2021 at 12:43 PM Eric Dumazet  wrote:
>
> On Tue, Apr 13, 2021 at 11:24 AM Eric Dumazet  wrote:
> >
> > On Mon, Apr 12, 2021 at 10:05 PM Guenter Roeck  wrote:
> > >
> > > On 4/12/21 10:38 AM, Eric Dumazet wrote:
> > > [ ... ]
> > >
> > > > Yes, I think this is the real issue here. This smells like some memory
> > > > corruption.
> > > >
> > > > In my traces, packet is correctly received in AF_PACKET queue.
> > > >
> > > > I have checked the skb is well formed.
> > > >
> > > > But the user space seems to never call poll() and recvmsg() on this
> > > > af_packet socket.
> > > >
> > >
> > > After sprinkling the kernel with debug messages:
> > >
> > > 424   00:01:33.674181 sendto(6, 
> > > "E\0\1H\0\0\0\0@\21y\246\0\0\0\0\377\377\377\377\0D\0C\00148\346\1\1\6\0\246\336\333\v\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0RT\0\
> > > 424   00:01:33.693873 close(6)  = 0
> > > 424   00:01:33.694652 fcntl64(5, F_SETFD, FD_CLOEXEC) = 0
> > > 424   00:01:33.695213 clock_gettime64(CLOCK_MONOTONIC, 0x7be18a18) = -1 
> > > EFAULT (Bad address)
> > > 424   00:01:33.695889 write(2, "udhcpc: clock_gettime(MONOTONIC) 
> > > failed\n", 40) = -1 EFAULT (Bad address)
> > > 424   00:01:33.697311 exit_group(1) = ?
> > > 424   00:01:33.698346 +++ exited with 1 +++
> > >
> > > I only see that after adding debug messages in the kernel, so I guess 
> > > there must be
> > > a heisenbug somehere.
> > >
> > > Anyway, indeed, I see (another kernel debug message):
> > >
> > > __do_sys_clock_gettime: Returning -EFAULT on address 0x7bacc9a8
> > >
> > > So udhcpc doesn't even try to read the reply because it crashes after 
> > > sendto()
> > > when trying to read the current time. Unless I am missing something, that 
> > > means
> > > that the problem happens somewhere on the send side.
> > >
> > > To make things even more interesting, it looks like the failing system 
> > > call
> > > isn't always clock_gettime().
> > >
> > > Guenter
> >
> >
> > I think GRO fast path has never worked on SUPERH. Probably SUPERH has
> > never used a fast NIC (10Gbit+)
> >
> > The following hack fixes the issue.
> >
> >
> > diff --git a/net/core/dev.c b/net/core/dev.c
> > index 
> > af8c1ea040b9364b076e2d72f04dc3de2d7e2f11..91ba89a645ff91d4cd4f3d8dc8a009bcb67da344
> > 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -5916,13 +5916,16 @@ static struct list_head
> > *gro_list_prepare(struct napi_struct *napi,
> >
> >  static void skb_gro_reset_offset(struct sk_buff *skb)
> >  {
> > +#if !defined(CONFIG_SUPERH)
> > const struct skb_shared_info *pinfo = skb_shinfo(skb);
> > const skb_frag_t *frag0 = >frags[0];
> > +#endif
> >
> > NAPI_GRO_CB(skb)->data_offset = 0;
> > NAPI_GRO_CB(skb)->frag0 = NULL;
> > NAPI_GRO_CB(skb)->frag0_len = 0;
> >
> > +#if !defined(CONFIG_SUPERH)
> > if (!skb_headlen(skb) && pinfo->nr_frags &&
> > !PageHighMem(skb_frag_page(frag0))) {
> > NAPI_GRO_CB(skb)->frag0 = skb_frag_address(frag0);
> > @@ -5930,6 +5933,7 @@ static void skb_gro_reset_offset(struct sk_buff *skb)
> > skb_frag_size(frag0),
> > skb->end - skb->tail);
> > }
> > +#endif
> >  }
> >
> >  static void gro_pull_from_frag0(struct sk_buff *skb, int grow)
>
> OK ... more sh debugging :
>
> diff --git a/arch/sh/mm/alignment.c b/arch/sh/mm/alignment.c
> index 
> fb517b82a87b1065cf38c06cb3c178ce86587b00..5d18f9f792991105a8aa05cc6231b7d4532d72c9
> 100644
> --- a/arch/sh/mm/alignment.c
> +++ b/arch/sh/mm/alignment.c
> @@ -27,7 +27,7 @@ static unsigned long se_multi;
> valid! */
>  static int se_usermode = UM_WARN | UM_FIXUP;
>  /* 0: no warning 1: print a warning message, disabled by default */
> -static int se_kernmode_warn;
> +static int se_kernmode_warn = 1;
>
>  core_param(alignment, se_usermode, int, 0600);
>
> @@ -103,7 +103,7 @@ void unaligned_fixups_notify(struct task_struct
> *tsk, insn_size_t insn,
>   (void *)instruction_pointer(regs), insn);
> else if (se_kernmode_warn)
> 

Re: Linux 5.12-rc7

2021-04-13 Thread Eric Dumazet
On Tue, Apr 13, 2021 at 11:24 AM Eric Dumazet  wrote:
>
> On Mon, Apr 12, 2021 at 10:05 PM Guenter Roeck  wrote:
> >
> > On 4/12/21 10:38 AM, Eric Dumazet wrote:
> > [ ... ]
> >
> > > Yes, I think this is the real issue here. This smells like some memory
> > > corruption.
> > >
> > > In my traces, packet is correctly received in AF_PACKET queue.
> > >
> > > I have checked the skb is well formed.
> > >
> > > But the user space seems to never call poll() and recvmsg() on this
> > > af_packet socket.
> > >
> >
> > After sprinkling the kernel with debug messages:
> >
> > 424   00:01:33.674181 sendto(6, 
> > "E\0\1H\0\0\0\0@\21y\246\0\0\0\0\377\377\377\377\0D\0C\00148\346\1\1\6\0\246\336\333\v\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0RT\0\
> > 424   00:01:33.693873 close(6)  = 0
> > 424   00:01:33.694652 fcntl64(5, F_SETFD, FD_CLOEXEC) = 0
> > 424   00:01:33.695213 clock_gettime64(CLOCK_MONOTONIC, 0x7be18a18) = -1 
> > EFAULT (Bad address)
> > 424   00:01:33.695889 write(2, "udhcpc: clock_gettime(MONOTONIC) failed\n", 
> > 40) = -1 EFAULT (Bad address)
> > 424   00:01:33.697311 exit_group(1) = ?
> > 424   00:01:33.698346 +++ exited with 1 +++
> >
> > I only see that after adding debug messages in the kernel, so I guess there 
> > must be
> > a heisenbug somehere.
> >
> > Anyway, indeed, I see (another kernel debug message):
> >
> > __do_sys_clock_gettime: Returning -EFAULT on address 0x7bacc9a8
> >
> > So udhcpc doesn't even try to read the reply because it crashes after 
> > sendto()
> > when trying to read the current time. Unless I am missing something, that 
> > means
> > that the problem happens somewhere on the send side.
> >
> > To make things even more interesting, it looks like the failing system call
> > isn't always clock_gettime().
> >
> > Guenter
>
>
> I think GRO fast path has never worked on SUPERH. Probably SUPERH has
> never used a fast NIC (10Gbit+)
>
> The following hack fixes the issue.
>
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 
> af8c1ea040b9364b076e2d72f04dc3de2d7e2f11..91ba89a645ff91d4cd4f3d8dc8a009bcb67da344
> 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -5916,13 +5916,16 @@ static struct list_head
> *gro_list_prepare(struct napi_struct *napi,
>
>  static void skb_gro_reset_offset(struct sk_buff *skb)
>  {
> +#if !defined(CONFIG_SUPERH)
> const struct skb_shared_info *pinfo = skb_shinfo(skb);
> const skb_frag_t *frag0 = >frags[0];
> +#endif
>
> NAPI_GRO_CB(skb)->data_offset = 0;
> NAPI_GRO_CB(skb)->frag0 = NULL;
> NAPI_GRO_CB(skb)->frag0_len = 0;
>
> +#if !defined(CONFIG_SUPERH)
> if (!skb_headlen(skb) && pinfo->nr_frags &&
> !PageHighMem(skb_frag_page(frag0))) {
> NAPI_GRO_CB(skb)->frag0 = skb_frag_address(frag0);
> @@ -5930,6 +5933,7 @@ static void skb_gro_reset_offset(struct sk_buff *skb)
> skb_frag_size(frag0),
> skb->end - skb->tail);
> }
> +#endif
>  }
>
>  static void gro_pull_from_frag0(struct sk_buff *skb, int grow)

OK ... more sh debugging :

diff --git a/arch/sh/mm/alignment.c b/arch/sh/mm/alignment.c
index 
fb517b82a87b1065cf38c06cb3c178ce86587b00..5d18f9f792991105a8aa05cc6231b7d4532d72c9
100644
--- a/arch/sh/mm/alignment.c
+++ b/arch/sh/mm/alignment.c
@@ -27,7 +27,7 @@ static unsigned long se_multi;
valid! */
 static int se_usermode = UM_WARN | UM_FIXUP;
 /* 0: no warning 1: print a warning message, disabled by default */
-static int se_kernmode_warn;
+static int se_kernmode_warn = 1;

 core_param(alignment, se_usermode, int, 0600);

@@ -103,7 +103,7 @@ void unaligned_fixups_notify(struct task_struct
*tsk, insn_size_t insn,
  (void *)instruction_pointer(regs), insn);
else if (se_kernmode_warn)
pr_notice_ratelimited("Fixing up unaligned kernel access "
- "in \"%s\" pid=%d pc=0x%p ins=0x%04hx\n",
+ "in \"%s\" pid=%d pc=%px ins=0x%04hx\n",
  tsk->comm, task_pid_nr(tsk),
  (void *)instruction_pointer(regs), insn);
 }

I now see something of interest :
Fixing up unaligned kernel access in "udhcpc" pid=91 pc=8c43fc2e ins=0x6236
Fixing up unaligned kernel access in "udhcpc" pid=91 pc=8c43fc2e ins=0x6236
Fixing up unaligned kernel access in "udhcpc&q

Re: Linux 5.12-rc7

2021-04-13 Thread Eric Dumazet
On Mon, Apr 12, 2021 at 10:05 PM Guenter Roeck  wrote:
>
> On 4/12/21 10:38 AM, Eric Dumazet wrote:
> [ ... ]
>
> > Yes, I think this is the real issue here. This smells like some memory
> > corruption.
> >
> > In my traces, packet is correctly received in AF_PACKET queue.
> >
> > I have checked the skb is well formed.
> >
> > But the user space seems to never call poll() and recvmsg() on this
> > af_packet socket.
> >
>
> After sprinkling the kernel with debug messages:
>
> 424   00:01:33.674181 sendto(6, 
> "E\0\1H\0\0\0\0@\21y\246\0\0\0\0\377\377\377\377\0D\0C\00148\346\1\1\6\0\246\336\333\v\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0RT\0\
> 424   00:01:33.693873 close(6)  = 0
> 424   00:01:33.694652 fcntl64(5, F_SETFD, FD_CLOEXEC) = 0
> 424   00:01:33.695213 clock_gettime64(CLOCK_MONOTONIC, 0x7be18a18) = -1 
> EFAULT (Bad address)
> 424   00:01:33.695889 write(2, "udhcpc: clock_gettime(MONOTONIC) failed\n", 
> 40) = -1 EFAULT (Bad address)
> 424   00:01:33.697311 exit_group(1) = ?
> 424   00:01:33.698346 +++ exited with 1 +++
>
> I only see that after adding debug messages in the kernel, so I guess there 
> must be
> a heisenbug somehere.
>
> Anyway, indeed, I see (another kernel debug message):
>
> __do_sys_clock_gettime: Returning -EFAULT on address 0x7bacc9a8
>
> So udhcpc doesn't even try to read the reply because it crashes after sendto()
> when trying to read the current time. Unless I am missing something, that 
> means
> that the problem happens somewhere on the send side.
>
> To make things even more interesting, it looks like the failing system call
> isn't always clock_gettime().
>
> Guenter


I think GRO fast path has never worked on SUPERH. Probably SUPERH has
never used a fast NIC (10Gbit+)

The following hack fixes the issue.


diff --git a/net/core/dev.c b/net/core/dev.c
index 
af8c1ea040b9364b076e2d72f04dc3de2d7e2f11..91ba89a645ff91d4cd4f3d8dc8a009bcb67da344
100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5916,13 +5916,16 @@ static struct list_head
*gro_list_prepare(struct napi_struct *napi,

 static void skb_gro_reset_offset(struct sk_buff *skb)
 {
+#if !defined(CONFIG_SUPERH)
const struct skb_shared_info *pinfo = skb_shinfo(skb);
const skb_frag_t *frag0 = >frags[0];
+#endif

NAPI_GRO_CB(skb)->data_offset = 0;
NAPI_GRO_CB(skb)->frag0 = NULL;
NAPI_GRO_CB(skb)->frag0_len = 0;

+#if !defined(CONFIG_SUPERH)
if (!skb_headlen(skb) && pinfo->nr_frags &&
!PageHighMem(skb_frag_page(frag0))) {
NAPI_GRO_CB(skb)->frag0 = skb_frag_address(frag0);
@@ -5930,6 +5933,7 @@ static void skb_gro_reset_offset(struct sk_buff *skb)
skb_frag_size(frag0),
skb->end - skb->tail);
}
+#endif
 }

 static void gro_pull_from_frag0(struct sk_buff *skb, int grow)


[PATCH 3/3] rseq: optimise for 64bit arches

2021-04-13 Thread Eric Dumazet
From: Eric Dumazet 

Commit ec9c82e03a74 ("rseq: uapi: Declare rseq_cs field as union,
update includes") added regressions for our servers.

Using copy_from_user() and clear_user() for 64bit values
on 64bit arches is suboptimal.

We might revisit this patch once all 32bit arches support
get_user() and/or put_user() for 8 bytes values.

Signed-off-by: Eric Dumazet 
Cc: Mathieu Desnoyers 
Cc: Peter Zijlstra 
Cc: "Paul E. McKenney" 
Cc: Boqun Feng 
Cc: Arjun Roy 
Cc: Ingo Molnar 
---
 kernel/rseq.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/kernel/rseq.c b/kernel/rseq.c
index 
57344f9abb43905c7dd2b6081205ff508d963e1e..18a75a804008d2f564d1f7789f09216f1a8760bd
 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -127,8 +127,13 @@ static int rseq_get_rseq_cs(struct task_struct *t, struct 
rseq_cs *rseq_cs)
u32 sig;
int ret;
 
+#ifdef CONFIG_64BIT
+   if (get_user(ptr, >rseq->rseq_cs.ptr64))
+   return -EFAULT;
+#else
if (copy_from_user(, >rseq->rseq_cs.ptr64, sizeof(ptr)))
return -EFAULT;
+#endif
if (!ptr) {
memset(rseq_cs, 0, sizeof(*rseq_cs));
return 0;
@@ -211,9 +216,13 @@ static int clear_rseq_cs(struct task_struct *t)
 *
 * Set rseq_cs to NULL.
 */
+#ifdef CONFIG_64BIT
+   return put_user(0ULL, >rseq->rseq_cs.ptr64);
+#else
if (clear_user(>rseq->rseq_cs.ptr64, sizeof(t->rseq->rseq_cs.ptr64)))
return -EFAULT;
return 0;
+#endif
 }
 
 /*
-- 
2.31.1.295.g9ea45b61b8-goog



[PATCH 2/3] rseq: remove redundant access_ok()

2021-04-13 Thread Eric Dumazet
From: Eric Dumazet 

After commit 8f2817701492 ("rseq: Use get_user/put_user rather
than __get_user/__put_user") we no longer need
an access_ok() call from __rseq_handle_notify_resume()

Signed-off-by: Eric Dumazet 
Cc: Mathieu Desnoyers 
Cc: Peter Zijlstra 
Cc: "Paul E. McKenney" 
Cc: Boqun Feng 
Cc: Arjun Roy 
Cc: Ingo Molnar 
---
 kernel/rseq.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/kernel/rseq.c b/kernel/rseq.c
index 
d2689ccbb132c0fc8ec0924008771e5ee1ca855e..57344f9abb43905c7dd2b6081205ff508d963e1e
 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -273,8 +273,6 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, 
struct pt_regs *regs)
 
if (unlikely(t->flags & PF_EXITING))
return;
-   if (unlikely(!access_ok(t->rseq, sizeof(*t->rseq
-   goto error;
ret = rseq_ip_fixup(regs);
if (unlikely(ret < 0))
goto error;
-- 
2.31.1.295.g9ea45b61b8-goog



[PATCH 1/3] rseq: optimize rseq_update_cpu_id()

2021-04-13 Thread Eric Dumazet
From: Eric Dumazet 

Two put_user() in rseq_update_cpu_id() are replaced
by a pair of unsafe_put_user() with appropriate surroundings.

This removes one stac/clac pair on x86 in fast path.

Signed-off-by: Eric Dumazet 
Cc: Mathieu Desnoyers 
Cc: Peter Zijlstra 
Cc: "Paul E. McKenney" 
Cc: Boqun Feng 
Cc: Arjun Roy 
Cc: Ingo Molnar 
---
 kernel/rseq.c | 15 +++
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/kernel/rseq.c b/kernel/rseq.c
index 
a4f86a9d6937cdfa2f13d1dcc9be863c1943d06f..d2689ccbb132c0fc8ec0924008771e5ee1ca855e
 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -84,13 +84,20 @@
 static int rseq_update_cpu_id(struct task_struct *t)
 {
u32 cpu_id = raw_smp_processor_id();
+   struct rseq *r = t->rseq;
 
-   if (put_user(cpu_id, >rseq->cpu_id_start))
-   return -EFAULT;
-   if (put_user(cpu_id, >rseq->cpu_id))
-   return -EFAULT;
+   if (!user_write_access_begin(r, sizeof(*r)))
+   goto efault;
+   unsafe_put_user(cpu_id, >cpu_id_start, efault_end);
+   unsafe_put_user(cpu_id, >cpu_id, efault_end);
+   user_write_access_end();
trace_rseq_update(t);
return 0;
+
+efault_end:
+   user_write_access_end();
+efault:
+   return -EFAULT;
 }
 
 static int rseq_reset_rseq_cpu_id(struct task_struct *t)
-- 
2.31.1.295.g9ea45b61b8-goog



[PATCH 0/3] rseq: minor optimizations

2021-04-13 Thread Eric Dumazet
From: Eric Dumazet 

rseq is a heavy user of copy to/from user data in fast paths.
This series tries to reduce the cost.

Eric Dumazet (3):
  rseq: optimize rseq_update_cpu_id()
  rseq: remove redundant access_ok()
  rseq: optimise for 64bit arches

 kernel/rseq.c | 26 --
 1 file changed, 20 insertions(+), 6 deletions(-)

-- 
2.31.1.295.g9ea45b61b8-goog



Re: Linux 5.12-rc7

2021-04-12 Thread Eric Dumazet
On Mon, Apr 12, 2021 at 7:31 PM Guenter Roeck  wrote:
>
> On 4/12/21 9:31 AM, Eric Dumazet wrote:
> > On Mon, Apr 12, 2021 at 6:28 PM Linus Torvalds
> >  wrote:
> >>
> >> On Sun, Apr 11, 2021 at 10:14 PM Guenter Roeck  wrote:
> >>>
> >>> Qemu test results:
> >>> total: 460 pass: 459 fail: 1
> >>> Failed tests:
> >>> sh:rts7751r2dplus_defconfig:ata:net,virtio-net:rootfs
> >>>
> >>> The failure bisects to commit 0f6925b3e8da ("virtio_net: Do not pull 
> >>> payload in
> >>> skb->head"). It is a spurious problem - the test passes roughly every 
> >>> other
> >>> time. When the failure is seen, udhcpc fails to get an IP address and 
> >>> aborts
> >>> with SIGTERM. So far I have only seen this with the "sh" architecture.
> >>
> >> Hmm. Let's add in some more of the people involved in that commit, and
> >> also netdev.
> >>
> >> Nothing in there looks like it should have any interaction with
> >> architecture, so that "it happens on sh" sounds odd, but maybe it's
> >> some particular interaction with the qemu environment.
> >
> > Yes, maybe.
> >
> > I spent few hours on this, and suspect a buggy memcpy() implementation
> > on SH, but this was not conclusive.
> >
>
> I replaced all memcpy() calls in skbuff.h with calls to
>
> static inline void __my_memcpy(unsigned char *to, const unsigned char *from,
>unsigned int len)
> {
>while (len--)
>*to++ = *from++;
> }
>
> That made no difference, so unless you have some other memcpy() in mind that
> seems to be unlikely.


Sure, note I also had :

diff --git a/net/core/dev.c b/net/core/dev.c
index 
af8c1ea040b9364b076e2d72f04dc3de2d7e2f11..4e05a32542dd606ee8038017fea949939c0e
100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5938,7 +5938,7 @@ static void gro_pull_from_frag0(struct sk_buff
*skb, int grow)

BUG_ON(skb->end - skb->tail < grow);

-   memcpy(skb_tail_pointer(skb), NAPI_GRO_CB(skb)->frag0, grow);
+   memmove(skb_tail_pointer(skb), NAPI_GRO_CB(skb)->frag0, grow);

skb->data_len -= grow;
skb->tail += grow;
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 
c421c8f809256f7b13a8b5a1331108449353ee2d..41796dedfa9034f2333cf249a0d81b7250e14d1f
100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2278,7 +2278,7 @@ int skb_copy_bits(const struct sk_buff *skb, int
offset, void *to, int len)
  skb_frag_off(f) + offset - start,
  copy, p, p_off, p_len, copied) {
vaddr = kmap_atomic(p);
-   memcpy(to + copied, vaddr + p_off, p_len);
+   memmove(to + copied, vaddr + p_off, p_len);
kunmap_atomic(vaddr);
}


>
> > By pulling one extra byte, the problem goes away.
> >
> > Strange thing is that the udhcpc process does not go past sendto().
> >
>
> I have been trying to debug that one. Unfortunately gdb doesn't work with sh,
> so I can't use it to debug the problem. I'll spend some more time on this 
> today.

Yes, I think this is the real issue here. This smells like some memory
corruption.

In my traces, packet is correctly received in AF_PACKET queue.

I have checked the skb is well formed.

But the user space seems to never call poll() and recvmsg() on this
af_packet socket.


Re: Linux 5.12-rc7

2021-04-12 Thread Eric Dumazet
On Mon, Apr 12, 2021 at 6:31 PM Eric Dumazet  wrote:
>
> On Mon, Apr 12, 2021 at 6:28 PM Linus Torvalds
>  wrote:
> >
> > On Sun, Apr 11, 2021 at 10:14 PM Guenter Roeck  wrote:
> > >
> > > Qemu test results:
> > > total: 460 pass: 459 fail: 1
> > > Failed tests:
> > > sh:rts7751r2dplus_defconfig:ata:net,virtio-net:rootfs
> > >
> > > The failure bisects to commit 0f6925b3e8da ("virtio_net: Do not pull 
> > > payload in
> > > skb->head"). It is a spurious problem - the test passes roughly every 
> > > other
> > > time. When the failure is seen, udhcpc fails to get an IP address and 
> > > aborts
> > > with SIGTERM. So far I have only seen this with the "sh" architecture.
> >
> > Hmm. Let's add in some more of the people involved in that commit, and
> > also netdev.
> >
> > Nothing in there looks like it should have any interaction with
> > architecture, so that "it happens on sh" sounds odd, but maybe it's
> > some particular interaction with the qemu environment.
>
> Yes, maybe.
>
> I spent few hours on this, and suspect a buggy memcpy() implementation
> on SH, but this was not conclusive.
>
> By pulling one extra byte, the problem goes away.
>
> Strange thing is that the udhcpc process does not go past sendto().

This is the patch working around the issue. Unfortunately I was not
able to root-cause it (I really suspect something on SH)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 
0824e6999e49957f7aaf7c990f6259792d42f32b..fd890a951beea03bdf24406809042666eb972655
100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -408,11 +408,17 @@ static struct sk_buff *page_to_skb(struct
virtnet_info *vi,

/* Copy all frame if it fits skb->head, otherwise
 * we let virtio_net_hdr_to_skb() and GRO pull headers as needed.
+*
+* Apparently, pulling only the Ethernet Header triggers a bug
on qemu-system-sh4.
+* Since GRO aggregation really cares of IPv4/IPv6, pull 20 bytes
+* more to work around this bug : These 20 bytes can not belong
+* to UDP/TCP payload.
+* As a bonus, this makes GRO slightly faster for IPv4 (one less copy).
 */
if (len <= skb_tailroom(skb))
copy = len;
else
-   copy = ETH_HLEN + metasize;
+   copy = ETH_HLEN + sizeof(struct iphdr) + metasize;
skb_put_data(skb, p, copy);

if (metasize) {


Re: Linux 5.12-rc7

2021-04-12 Thread Eric Dumazet
On Mon, Apr 12, 2021 at 6:28 PM Linus Torvalds
 wrote:
>
> On Sun, Apr 11, 2021 at 10:14 PM Guenter Roeck  wrote:
> >
> > Qemu test results:
> > total: 460 pass: 459 fail: 1
> > Failed tests:
> > sh:rts7751r2dplus_defconfig:ata:net,virtio-net:rootfs
> >
> > The failure bisects to commit 0f6925b3e8da ("virtio_net: Do not pull 
> > payload in
> > skb->head"). It is a spurious problem - the test passes roughly every other
> > time. When the failure is seen, udhcpc fails to get an IP address and aborts
> > with SIGTERM. So far I have only seen this with the "sh" architecture.
>
> Hmm. Let's add in some more of the people involved in that commit, and
> also netdev.
>
> Nothing in there looks like it should have any interaction with
> architecture, so that "it happens on sh" sounds odd, but maybe it's
> some particular interaction with the qemu environment.

Yes, maybe.

I spent few hours on this, and suspect a buggy memcpy() implementation
on SH, but this was not conclusive.

By pulling one extra byte, the problem goes away.

Strange thing is that the udhcpc process does not go past sendto().


Re: [PATCH net-next v2 2/3] net: use skb_for_each_frag() helper where possible

2021-04-12 Thread Eric Dumazet



On 4/12/21 2:38 AM, Matteo Croce wrote:
> From: Matteo Croce 
> 
> use the new helper macro skb_for_each_frag() which allows to iterate
> through all the SKB fragments.
> 
> The patch was created with Coccinelle, this was the semantic patch:
> 
> @@
> struct sk_buff *skb;
> identifier i;
> statement S;
> iterator name skb_for_each_frag;
> @@
> -for (i = 0; i < skb_shinfo(skb)->nr_frags; \(++i\|i++\))
> +skb_for_each_frag(skb, i)
>  S
> @@
> struct skb_shared_info *sinfo;
> struct sk_buff *skb;
> identifier i;
> statement S;
> iterator name skb_for_each_frag;
> @@


I disagree with this part :

>  sinfo = skb_shinfo(skb)
>  ...
> -for (i = 0; i < sinfo->nr_frags; \(++i\|i++\))
> +skb_for_each_frag(skb, i)
>  S
>


> index bde781f46b41..5de00477eaf9 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -1644,7 +1644,7 @@ static int __pskb_trim_head(struct sk_buff *skb, int 
> len)
>   eat = len;
>   k = 0;
>   shinfo = skb_shinfo(skb);
> - for (i = 0; i < shinfo->nr_frags; i++) {
> + skb_for_each_frag(skb, i) {
>   int size = skb_frag_size(>frags[i]);
>  
>   if (size <= eat) {

This will force the compiler to re-evaluate skb_shinfo(skb)->nr_frags in the 
loop,
since atomic operations like skb_frag_unref() have a memory clobber.

skb_shinfo(skb)->nr_frags has to reload three vars.

The macro should only be used when the code had

for (i = 0; i < skb_shinfo(skb)->nr_frags; i++)




Re: [syzbot] KMSAN: uninit-value in INET_ECN_decapsulate (2)

2021-04-12 Thread Eric Dumazet



On 3/30/21 3:26 PM, syzbot wrote:
> Hello,
> 
> syzbot found the following issue on:
> 
> HEAD commit:29ad81a1 arch/x86: add missing include to sparsemem.h
> git tree:   https://github.com/google/kmsan.git master
> console output: https://syzkaller.appspot.com/x/log.txt?x=166fe481d0
> kernel config:  https://syzkaller.appspot.com/x/.config?x=b573c14b733efb1c
> dashboard link: https://syzkaller.appspot.com/bug?extid=5e9c61d74e52a82b8ace
> compiler:   Debian clang version 11.0.1-2
> userspace arch: i386
> 
> Unfortunately, I don't have any reproducer for this issue yet.
> 
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+5e9c61d74e52a82b8...@syzkaller.appspotmail.com
> 
> =
> BUG: KMSAN: uninit-value in __INET_ECN_decapsulate include/net/inet_ecn.h:236 
> [inline]
> BUG: KMSAN: uninit-value in INET_ECN_decapsulate+0x329/0x1db0 
> include/net/inet_ecn.h:258
> CPU: 1 PID: 9058 Comm: syz-executor.1 Not tainted 5.11.0-rc7-syzkaller #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> Google 01/01/2011
> Call Trace:
>  
>  __dump_stack lib/dump_stack.c:79 [inline]
>  dump_stack+0x21c/0x280 lib/dump_stack.c:120
>  kmsan_report+0xfb/0x1e0 mm/kmsan/kmsan_report.c:118
>  __msan_warning+0x5f/0xa0 mm/kmsan/kmsan_instr.c:197
>  __INET_ECN_decapsulate include/net/inet_ecn.h:236 [inline]
>  INET_ECN_decapsulate+0x329/0x1db0 include/net/inet_ecn.h:258
>  geneve_rx+0x216e/0x29d0 include/net/inet_ecn.h:304
>  geneve_udp_encap_recv+0x1055/0x1340 drivers/net/geneve.c:377
>  udp_queue_rcv_one_skb+0x1943/0x1af0 net/ipv4/udp.c:2095
>  udp_queue_rcv_skb+0x286/0x1040 net/ipv4/udp.c:2169
>  udp_unicast_rcv_skb net/ipv4/udp.c:2327 [inline]
>  __udp4_lib_rcv+0x3a1f/0x58f0 net/ipv4/udp.c:2396
>  udp_rcv+0x5c/0x70 net/ipv4/udp.c:2567
>  ip_protocol_deliver_rcu+0x572/0xc50 net/ipv4/ip_input.c:204
>  ip_local_deliver_finish net/ipv4/ip_input.c:231 [inline]
>  NF_HOOK include/linux/netfilter.h:301 [inline]
>  ip_local_deliver+0x585/0x8d0 net/ipv4/ip_input.c:252
>  dst_input include/net/dst.h:447 [inline]
>  ip_rcv_finish net/ipv4/ip_input.c:428 [inline]
>  NF_HOOK include/linux/netfilter.h:301 [inline]
>  ip_rcv+0x599/0x820 net/ipv4/ip_input.c:539
>  __netif_receive_skb_one_core net/core/dev.c:5323 [inline]
>  __netif_receive_skb+0x1ec/0x640 net/core/dev.c:5437
>  process_backlog+0x517/0xbd0 net/core/dev.c:6328
>  napi_poll+0x428/0x15c0 net/core/dev.c:6806
>  net_rx_action+0x34c/0xd30 net/core/dev.c:6889
>  __do_softirq+0x1b9/0x715 kernel/softirq.c:343
>  asm_call_irq_on_stack+0xf/0x20
>  
>  __run_on_irqstack arch/x86/include/asm/irq_stack.h:26 [inline]
>  run_on_irqstack_cond arch/x86/include/asm/irq_stack.h:77 [inline]
>  do_softirq_own_stack+0x6e/0x90 arch/x86/kernel/irq_64.c:77
>  do_softirq kernel/softirq.c:246 [inline]
>  __local_bh_enable_ip+0x184/0x1d0 kernel/softirq.c:196
>  local_bh_enable+0x36/0x40 include/linux/bottom_half.h:32
>  rcu_read_unlock_bh include/linux/rcupdate.h:737 [inline]
>  __dev_queue_xmit+0x3b3e/0x45c0 net/core/dev.c:4178
>  dev_queue_xmit+0x4b/0x60 net/core/dev.c:4184
>  packet_snd net/packet/af_packet.c:3006 [inline]
>  packet_sendmsg+0x8778/0x9a60 net/packet/af_packet.c:3031
>  sock_sendmsg_nosec net/socket.c:652 [inline]
>  sock_sendmsg net/socket.c:672 [inline]
>  __sys_sendto+0x9ea/0xc60 net/socket.c:1975
>  __do_sys_sendto net/socket.c:1987 [inline]
>  __se_sys_sendto+0x107/0x130 net/socket.c:1983
>  __ia32_sys_sendto+0x6e/0x90 net/socket.c:1983
>  do_syscall_32_irqs_on arch/x86/entry/common.c:79 [inline]
>  __do_fast_syscall_32+0x102/0x160 arch/x86/entry/common.c:141
>  do_fast_syscall_32+0x6a/0xc0 arch/x86/entry/common.c:166
>  do_SYSENTER_32+0x73/0x90 arch/x86/entry/common.c:209
>  entry_SYSENTER_compat_after_hwframe+0x4d/0x5c
> RIP: 0023:0xf7f84549
> Code: 03 74 c0 01 10 05 03 74 b8 01 10 06 03 74 b4 01 10 07 03 74 b0 01 10 08 
> 03 74 d8 01 00 00 00 00 00 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 
> 90 8d b4 26 00 00 00 00 8d b4 26 00 00 00 00
> RSP: 002b:f557e5fc EFLAGS: 0296 ORIG_RAX: 0171
> RAX: ffda RBX: 0003 RCX: 2980
> RDX: 000e RSI:  RDI: 22c0
> RBP: 0014 R08:  R09: 
> R10:  R11:  R12: 
> R13:  R14:  R15: 
> 
> Uninit was created at:
>  kmsan_save_stack_with_flags mm/kmsan/kmsan.c:121 [inline]
>  kmsan_internal_poison_shadow+0x5c/0xf0 mm/kmsan/kmsan.c:104
>  kmsan_slab_alloc+0x8d/0xe0 mm/kmsan/kmsan_hooks.c:76
>  slab_alloc_node mm/slub.c:2907 [inline]
>  __kmalloc_node_track_caller+0xa37/0x1430 mm/slub.c:4527
>  __kmalloc_reserve net/core/skbuff.c:142 [inline]
>  __alloc_skb+0x2f8/0xb30 net/core/skbuff.c:210
>  alloc_skb include/linux/skbuff.h:1099 [inline]
>  alloc_skb_with_frags+0x1f3/0xc10 

Re: [PATCH] net: geneve: check skb is large enough for IPv4/IPv6 header

2021-04-11 Thread Eric Dumazet
On Sun, Apr 11, 2021 at 1:28 PM Phillip Potter  wrote:
>
> Check within geneve_xmit_skb/geneve6_xmit_skb that sk_buff structure
> is large enough to include IPv4 or IPv6 header, and reject if not. The
> geneve_xmit_skb portion and overall idea was contributed by Eric Dumazet.
> Fixes a KMSAN-found uninit-value bug reported by syzbot at:
> https://syzkaller.appspot.com/bug?id=abe95dc3e3e9667fc23b8d81f29ecad95c6f106f
>
> Suggested-by: Eric Dumazet 
> Reported-by: syzbot+2e406a9ac75bb71d4...@syzkaller.appspotmail.com
> Signed-off-by: Phillip Potter 

Signed-off-by: Eric Dumazet 

Thanks !


Re: [PATCH] net: core: sk_buff: zero-fill skb->data in __alloc_skb function

2021-04-10 Thread Eric Dumazet
On Sat, Apr 10, 2021 at 12:12 PM Eric Dumazet  wrote:
>
> On Sat, Apr 10, 2021 at 11:51 AM Phillip Potter  wrote:
> >
> > Zero-fill skb->data in __alloc_skb function of net/core/skbuff.c,
> > up to start of struct skb_shared_info bytes. Fixes a KMSAN-found
> > uninit-value bug reported by syzbot at:
> > https://syzkaller.appspot.com/bug?id=abe95dc3e3e9667fc23b8d81f29ecad95c6f106f
> >
> > Reported-by: syzbot+2e406a9ac75bb71d4...@syzkaller.appspotmail.com
> > Signed-off-by: Phillip Potter 
> > ---
> >  net/core/skbuff.c | 1 +
> >  1 file changed, 1 insertion(+)
> >
> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > index 785daff48030..9ac26cdb5417 100644
> > --- a/net/core/skbuff.c
> > +++ b/net/core/skbuff.c
> > @@ -215,6 +215,7 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t 
> > gfp_mask,
> >  * to allow max possible filling before reallocation.
> >  */
> > size = SKB_WITH_OVERHEAD(ksize(data));
> > +   memset(data, 0, size);
> > prefetchw(data + size);
>
>
> Certainly not.
>
> There is a difference between kmalloc() and kzalloc()
>
> Here you are basically silencing KMSAN and make it useless.
>
> Please fix the real issue, or stop using KMSAN if it bothers you.

My understanding of the KMSAN bug (when I released it months ago) was
that it was triggered by some invalid assumptions in geneve_xmit()

The syzbot repro sends a packet with a very small size (Ethernet
header only) and no IP/IPv6 header

Fix for ipv4 part (sorry, not much time during week end to test all this)

diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c
index 
e3b2375ac5eb55f544bbc1f309886cc9be189fd1..0a72779bc74bc50c20c34c05b2c525cca829f33c
100644
--- a/drivers/net/geneve.c
+++ b/drivers/net/geneve.c
@@ -892,6 +892,9 @@ static int geneve_xmit_skb(struct sk_buff *skb,
struct net_device *dev,
__be16 sport;
int err;

+   if (!pskb_network_may_pull(skb, sizeof(struct iphdr))
+   return -EINVAL;
+
sport = udp_flow_src_port(geneve->net, skb, 1, USHRT_MAX, true);
rt = geneve_get_v4_rt(skb, dev, gs4, , info,
  geneve->cfg.info.key.tp_dst, sport);


Re: [PATCH] net: core: sk_buff: zero-fill skb->data in __alloc_skb function

2021-04-10 Thread Eric Dumazet
On Sat, Apr 10, 2021 at 11:51 AM Phillip Potter  wrote:
>
> Zero-fill skb->data in __alloc_skb function of net/core/skbuff.c,
> up to start of struct skb_shared_info bytes. Fixes a KMSAN-found
> uninit-value bug reported by syzbot at:
> https://syzkaller.appspot.com/bug?id=abe95dc3e3e9667fc23b8d81f29ecad95c6f106f
>
> Reported-by: syzbot+2e406a9ac75bb71d4...@syzkaller.appspotmail.com
> Signed-off-by: Phillip Potter 
> ---
>  net/core/skbuff.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 785daff48030..9ac26cdb5417 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -215,6 +215,7 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t 
> gfp_mask,
>  * to allow max possible filling before reallocation.
>  */
> size = SKB_WITH_OVERHEAD(ksize(data));
> +   memset(data, 0, size);
> prefetchw(data + size);


Certainly not.

There is a difference between kmalloc() and kzalloc()

Here you are basically silencing KMSAN and make it useless.

Please fix the real issue, or stop using KMSAN if it bothers you.


Re: Problem in pfmemalloc skb handling in net/core/dev.c

2021-04-09 Thread Eric Dumazet



On 4/9/21 12:14 PM, Xie He wrote:
> On Fri, Apr 9, 2021 at 3:04 AM Eric Dumazet  wrote:
>>
>> Note that pfmemalloc skbs are normally dropped in sk_filter_trim_cap()
>>
>> Simply make sure your protocol use it.
> 
> It seems "sk_filter_trim_cap" needs an "struct sock" argument. Some of
> my protocols act like a middle layer to another protocol and don't
> have any "struct sock".

Then simply copy the needed logic.

> 
> Also, I think this is a problem in net/core/dev.c, there are a lot of
> old protocols that are not aware of pfmemalloc skbs. I don't think
> it's a good idea to fix them one by one.
> 

I think you are mistaken.

There is no problem in net/core/dev.c really, it uses
skb_pfmemalloc_protocol()

pfmemalloc is best effort really.

If a layer store packets in many long living queues, it has to drop pfmemalloc 
packets,
unless these packets are used for swapping.






Re: [PATCH] net/rds: Avoid potential use after free in rds_send_remove_from_sock

2021-04-09 Thread Eric Dumazet



On 4/7/21 2:09 AM, Aditya Pakki wrote:
> In case of rs failure in rds_send_remove_from_sock(), the 'rm' resource
> is freed and later under spinlock, causing potential use-after-free.
> Set the free pointer to NULL to avoid undefined behavior.
> 
> Signed-off-by: Aditya Pakki 
> ---
>  net/rds/message.c | 1 +
>  net/rds/send.c| 2 +-
>  2 files changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/net/rds/message.c b/net/rds/message.c
> index 071a261fdaab..90ebcfe5fe3b 100644
> --- a/net/rds/message.c
> +++ b/net/rds/message.c
> @@ -180,6 +180,7 @@ void rds_message_put(struct rds_message *rm)
>   rds_message_purge(rm);
>  
>   kfree(rm);
> + rm = NULL;

This is a nop really.

This does not clear @rm variable in the caller.



>   }
>  }
>  EXPORT_SYMBOL_GPL(rds_message_put);
> diff --git a/net/rds/send.c b/net/rds/send.c
> index 985d0b7713ac..fe5264b9d4b3 100644
> --- a/net/rds/send.c
> +++ b/net/rds/send.c
> @@ -665,7 +665,7 @@ static void rds_send_remove_from_sock(struct list_head 
> *messages, int status)
>  unlock_and_drop:
>   spin_unlock_irqrestore(>m_rs_lock, flags);
>   rds_message_put(rm);
> - if (was_on_sock)
> + if (was_on_sock && rm)
>   rds_message_put(rm);

Maybe the bug is that the refcount has not be elevated when was_on_sock
has been set.

>   }
>  
> 


Re: Problem in pfmemalloc skb handling in net/core/dev.c

2021-04-09 Thread Eric Dumazet



On 4/9/21 11:14 AM, Xie He wrote:
> On Fri, Apr 9, 2021 at 1:44 AM Mel Gorman  wrote:
>>
>> That would imply that the tap was communicating with a swap device to
>> allocate a pfmemalloc skb which shouldn't happen. Furthermore, it would
>> require the swap device to be deactivated while pfmemalloc skbs still
>> existed. Have you encountered this problem?
> 
> I'm not a user of swap devices or pfmemalloc skbs. I just want to make
> sure the protocols that I'm developing (not IP or IPv6) won't get
> pfmemalloc skbs when receiving, because those protocols cannot handle
> them.
> 
> According to the code, it seems always possible to get a pfmemalloc
> skb when a network driver calls "__netdev_alloc_skb". The skb will
> then be queued in per-CPU backlog queues when the driver calls
> "netif_rx". There seems to be nothing preventing "sk_memalloc_socks()"
> from becoming "false" after the skb is allocated and before it is
> handled by "__netif_receive_skb".
> 
> Do you mean that at the time "sk_memalloc_socks()" changes from "true"
> to "false", there would be no in-flight skbs currently being received,
> and all network communications have been paused?
> 


Note that pfmemalloc skbs are normally dropped in sk_filter_trim_cap()

Simply make sure your protocol use it.


Re: [PATCH] net: sched: sch_teql: fix null-pointer dereference

2021-04-08 Thread Eric Dumazet



On 4/8/21 5:14 PM, Pavel Tikhomirov wrote:
> Reproduce:
> 
>   modprobe sch_teql
>   tc qdisc add dev teql0 root teql0
> 
> This leads to (for instance in Centos 7 VM) OOPS:
> 
>
> 
> Null pointer dereference happens on master->slaves dereference in
> teql_destroy() as master is null-pointer.
> 
> When qdisc_create() calls teql_qdisc_init() it imediately fails after
> check "if (m->dev == dev)" because both devices are teql0, and it does
> not set qdisc_priv(sch)->m leaving it zero on error path, then
> qdisc_create() imediately calls teql_destroy() which does not expect
> zero master pointer and we get OOPS.
> 
> Signed-off-by: Pavel Tikhomirov 
> ---

This makes sense, thanks !

Reviewed-by: Eric Dumazet 

I would think bug origin is 

Fixes: 87b60cfacf9f ("net_sched: fix error recovery at qdisc creation")

Can you confirm you have this backported to 3.10.0-1062.7.1.el7.x86_64 ?




Re: [PATCH net v4] atl1c: move tx cleanup processing out of interrupt

2021-04-07 Thread Eric Dumazet



On 4/6/21 4:49 PM, Gatis Peisenieks wrote:
> Tx queue cleanup happens in interrupt handler on same core as rx queue
> processing. Both can take considerable amount of processing in high
> packet-per-second scenarios.
> 
> Sending big amounts of packets can stall the rx processing which is unfair
> and also can lead to out-of-memory condition since __dev_kfree_skb_irq
> queues the skbs for later kfree in softirq which is not allowed to happen
> with heavy load in interrupt handler.
> 

[ ... ]

> diff --git a/net/core/dev.c b/net/core/dev.c
> index 0f72ff5d34ba..489ac60b530c 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -6789,6 +6789,7 @@ int dev_set_threaded(struct net_device *dev, bool 
> threaded)
> 
>  return err;
>  }
> +EXPORT_SYMBOL(dev_set_threaded);
> 
>  void netif_napi_add(struct net_device *dev, struct napi_struct *napi,
>  int (*poll)(struct napi_struct *, int), int weight)

This has already been done in net-next

Please base your patch on top of net-next, this can not be backported to old
versions anyway, without some amount of pain.





Re: [PATCH v3] net: tun: set tun->dev->addr_len during TUNSETLINK processing

2021-04-06 Thread Eric Dumazet



On 4/6/21 7:45 PM, Phillip Potter wrote:
> When changing type with TUNSETLINK ioctl command, set tun->dev->addr_len
> to match the appropriate type, using new tun_get_addr_len utility function
> which returns appropriate address length for given type. Fixes a
> KMSAN-found uninit-value bug reported by syzbot at:
> https://syzkaller.appspot.com/bug?id=0766d38c656abeace60621896d705743aeefed51
> 
> Reported-by: syzbot+001516d86dbe88862...@syzkaller.appspotmail.com
> Diagnosed-by: Eric Dumazet 
> Signed-off-by: Phillip Potter 
> ---

SGTM, thanks a lot.

Reviewed-by: Eric Dumazet 



Re: [PATCH] net: tun: set tun->dev->addr_len during TUNSETLINK processing

2021-04-06 Thread Eric Dumazet



On 4/5/21 1:35 PM, Phillip Potter wrote:
> When changing type with TUNSETLINK ioctl command, set tun->dev->addr_len
> to match the appropriate type, using new tun_get_addr_len utility function
> which returns appropriate address length for given type. Fixes a
> KMSAN-found uninit-value bug reported by syzbot at:
> https://syzkaller.appspot.com/bug?id=0766d38c656abeace60621896d705743aeefed51
> 
> Reported-by: syzbot+001516d86dbe88862...@syzkaller.appspotmail.com
> Signed-off-by: Phillip Potter 
> ---

Please give credits to people who helped.

You could have :

Suggested-by: Eric Dumazet 

Or
Diagnosed-by: Eric Dumazet 

Or at least CCed me :/



  1   2   3   4   5   6   7   8   9   10   >