Re: [PATCH net-next 0/3] trace: use TP_STORE_ADDRS macro

2024-03-26 Thread Paolo Abeni
On Tue, 2024-03-26 at 18:43 +0800, Jason Xing wrote:
> On Tue, Mar 26, 2024 at 6:29 PM Paolo Abeni  wrote:
> > 
> > On Tue, 2024-03-26 at 12:14 +0800, Jason Xing wrote:
> > > On Mon, Mar 25, 2024 at 11:43 AM Jason Xing  
> > > wrote:
> > > > 
> > > > From: Jason Xing 
> > > > 
> > > > Using the macro for other tracepoints use to be more concise.
> > > > No functional change.
> > > > 
> > > > Jason Xing (3):
> > > >   trace: move to TP_STORE_ADDRS related macro to net_probe_common.h
> > > >   trace: use TP_STORE_ADDRS() macro in inet_sk_error_report()
> > > >   trace: use TP_STORE_ADDRS() macro in inet_sock_set_state()
> > > > 
> > > >  include/trace/events/net_probe_common.h | 29 
> > > >  include/trace/events/sock.h | 35 -
> > > 
> > > I just noticed that some trace files in include/trace directory (like
> > > net_probe_common.h, sock.h, skb.h, net.h, sock.h, udp.h, sctp.h,
> > > qdisc.h, neigh.h, napi.h, icmp.h, ...) are not owned by networking
> > > folks while some files (like tcp.h) have been maintained by specific
> > > maintainers/experts (like Eric) because they belong to one specific
> > > area. I wonder if we can get more networking guys involved in net
> > > tracing.
> > > 
> > > I'm not sure if 1) we can put those files into the "NETWORKING
> > > [GENERAL]" category, or 2) we can create a new category to include
> > > them all.
> > 
> > I think all the file you mentioned are not under networking because of
> > MAINTAINER file inaccuracy, and we could move there them accordingly.
> 
> Yes, they are not under the networking category currently. So how
> could we move them? The MAINTAINER file doesn't have all the specific
> categories which are suitable for each of the trace files.

I think there is no need to other categories: adding the explicit 'F:'
entries for such files in the NETWORKING [GENERAL] section should fit.

> > > I know people start using BPF to trace them all instead, but I can see
> > > some good advantages of those hooks implemented in the kernel, say:
> > > 1) help those machines which are not easy to use BPF tools.
> > > 2) insert the tracepoint in the middle of some functions which cannot
> > > be replaced by bpf kprobe.
> > > 3) if we have enough tracepoints, we can generate a timeline to
> > > know/detect which flow/skb spends unexpected time at which point.
> > > ...
> > > We can do many things in this area, I think :)
> > > 
> > > What do you think about this, Jakub, Paolo, Eric ?
> > 
> > I agree tracepoints are useful, but I think the general agreement is
> > that they are the 'old way', we should try to avoid their
> > proliferation.
> 
> Well, it's a pity that it seems that we are about to abandon this
> method but it's not that friendly to the users who are unable to
> deploy BPF... Well, I came up with more ideas about how to improve the
> trace function in recent days. The motivation of doing this is that I
> encountered some issues which could be traced/diagnosed by using trace
> effortlessly without writing some bpftrace codes again and again. The
> status of trace seems not active but many people are still using it, I
> believe.

I don't think we should abandon it completely. My understanding is that
we should thing carefully before adding new tracepoints, and generally
speaking, avoid adding 'too many' of them.

Cheers,

Paolo





Re: [PATCH net-next v2 3/3] tcp: add location into reset trace process

2024-03-26 Thread Paolo Abeni
On Mon, 2024-03-25 at 14:28 +0800, Jason Xing wrote:
> From: Jason Xing 
> 
> In addition to knowing the 4-tuple of the flow which generates RST,
> the reason why it does so is very important because we have some
> cases where the RST should be sent and have no clue which one
> exactly.
> 
> Adding location of reset process can help us more, like what
> trace_kfree_skb does.
> 
> Signed-off-by: Jason Xing 
> ---
>  include/trace/events/tcp.h | 14 ++
>  net/ipv4/tcp_ipv4.c|  2 +-
>  net/ipv4/tcp_output.c  |  2 +-
>  net/ipv6/tcp_ipv6.c|  2 +-
>  4 files changed, 13 insertions(+), 7 deletions(-)
> 
> diff --git a/include/trace/events/tcp.h b/include/trace/events/tcp.h
> index a13eb2147a02..8f6c1a07503c 100644
> --- a/include/trace/events/tcp.h
> +++ b/include/trace/events/tcp.h
> @@ -109,13 +109,17 @@ DEFINE_EVENT(tcp_event_sk_skb, tcp_retransmit_skb,
>   */
>  TRACE_EVENT(tcp_send_reset,
>  
> - TP_PROTO(const struct sock *sk, const struct sk_buff *skb),
> + TP_PROTO(
> + const struct sock *sk,
> + const struct sk_buff *skb,
> + void *location),

Very minor nit: the above lines should be aligned with the open
bracket.

No need to repost just for this, but let's wait for Eric's feedback.

Cheers,

Paolo




Re: [PATCH net-next 0/3] trace: use TP_STORE_ADDRS macro

2024-03-26 Thread Paolo Abeni
On Tue, 2024-03-26 at 12:14 +0800, Jason Xing wrote:
> On Mon, Mar 25, 2024 at 11:43 AM Jason Xing  wrote:
> > 
> > From: Jason Xing 
> > 
> > Using the macro for other tracepoints use to be more concise.
> > No functional change.
> > 
> > Jason Xing (3):
> >   trace: move to TP_STORE_ADDRS related macro to net_probe_common.h
> >   trace: use TP_STORE_ADDRS() macro in inet_sk_error_report()
> >   trace: use TP_STORE_ADDRS() macro in inet_sock_set_state()
> > 
> >  include/trace/events/net_probe_common.h | 29 
> >  include/trace/events/sock.h | 35 -
> 
> I just noticed that some trace files in include/trace directory (like
> net_probe_common.h, sock.h, skb.h, net.h, sock.h, udp.h, sctp.h,
> qdisc.h, neigh.h, napi.h, icmp.h, ...) are not owned by networking
> folks while some files (like tcp.h) have been maintained by specific
> maintainers/experts (like Eric) because they belong to one specific
> area. I wonder if we can get more networking guys involved in net
> tracing.
> 
> I'm not sure if 1) we can put those files into the "NETWORKING
> [GENERAL]" category, or 2) we can create a new category to include
> them all.

I think all the file you mentioned are not under networking because of
MAINTAINER file inaccuracy, and we could move there them accordingly.
> 
> I know people start using BPF to trace them all instead, but I can see
> some good advantages of those hooks implemented in the kernel, say:
> 1) help those machines which are not easy to use BPF tools.
> 2) insert the tracepoint in the middle of some functions which cannot
> be replaced by bpf kprobe.
> 3) if we have enough tracepoints, we can generate a timeline to
> know/detect which flow/skb spends unexpected time at which point.
> ...
> We can do many things in this area, I think :)
> 
> What do you think about this, Jakub, Paolo, Eric ?

I agree tracepoints are useful, but I think the general agreement is
that they are the 'old way', we should try to avoid their
proliferation. 

Cheers,

Paolo




Re: [PATCH] net: hns3: tracing: fix hclgevf trace event strings

2024-03-14 Thread Paolo Abeni
On Wed, 2024-03-13 at 09:34 -0400, Steven Rostedt wrote:
> From: "Steven Rostedt (Google)" 
> 
> [
>Note, I need to take this patch through my tree, so I'm looking for acks.

Note that this device driver is changing quite rapidly, so I expect
some conflicts here later. I guess Liuns will have to handle them ;)

>This causes the build to fail when I add the __assign_str() check, which
>I was about to push to Linus, but it breaks allmodconfig due to this error.
> ]
> 
> The __string() and __assign_str() helper macros of the TRACE_EVENT() macro
> are going through some optimizations where only the source string of
> __string() will be used and the __assign_str() source will be ignored and
> later removed.
> 
> To make sure that there's no issues, a new check is added between the
> __string() src argument and the __assign_str() src argument that does a
> strcmp() to make sure they are the same string.
> 
> The hclgevf trace events have:
> 
>   __assign_str(devname, >nic.kinfo.netdev->name);
> 
> Which triggers the warning:
> 
> hclgevf_trace.h:34:39: error: passing argument 1 of ‘strcmp’ from 
> incompatible pointer type [-Werror=incompatible-pointer-types]
>34 | __assign_str(devname, >nic.kinfo.netdev->name);
>  [..]
> arch/x86/include/asm/string_64.h:75:24: note: expected ‘const char *’ but 
> argument is of type ‘char (*)[16]’
>75 | int strcmp(const char *cs, const char *ct);
>   |^~
> 
> 
> Because __assign_str() now has:
> 
>   WARN_ON_ONCE(__builtin_constant_p(src) ?\
>strcmp((src), __data_offsets.dst##_ptr_) : \
>(src) != __data_offsets.dst##_ptr_);   \
> 
> The problem is the '&' on hdev->nic.kinfo.netdev->name. That's because
> that name is:
> 
>   charname[IFNAMSIZ]
> 
> Where passing an address '&' of a char array is not compatible with strcmp().
> 
> The '&' is not necessary, remove it.
> 
> Fixes: d8355240cf8fb ("net: hns3: add trace event support for PF/VF mailbox")

checkpactch in strict mode complains the hash is not 12 char long.

> Signed-off-by: Steven Rostedt (Google) 

FWIW

Acked-by: Paolo Abeni 




Re: [PATCH net-next v6 5/5] tools: virtio: introduce vhost_net_test

2024-03-05 Thread Paolo Abeni
On Wed, 2024-02-28 at 17:30 +0800, Yunsheng Lin wrote:
> introduce vhost_net_test for both vhost_net tx and rx basing
> on virtio_test to test vhost_net changing in the kernel.
> 
> Steps for vhost_net tx testing:
> 1. Prepare a out buf.
> 2. Kick the vhost_net to do tx processing.
> 3. Do the receiving in the tun side.
> 4. verify the data received by tun is correct.
> 
> Steps for vhost_net rx testing:
> 1. Prepare a in buf.
> 2. Do the sending in the tun side.
> 3. Kick the vhost_net to do rx processing.
> 4. verify the data received by vhost_net is correct.
> 
> Signed-off-by: Yunsheng Lin 

@Jason: AFAICS this addresses the points you raised on v5, could you
please have a look?

Thanks!

Paolo




Re: [PATCH net-next v2 3/3] tun: AF_XDP Tx zero-copy support

2024-02-29 Thread Paolo Abeni
On Wed, 2024-02-28 at 19:05 +0800, Yunjian Wang wrote:
> @@ -2661,6 +2776,54 @@ static int tun_ptr_peek_len(void *ptr)
>   }
>  }
>  
> +static void tun_peek_xsk(struct tun_file *tfile)
> +{
> + struct xsk_buff_pool *pool;
> + u32 i, batch, budget;
> + void *frame;
> +
> + if (!ptr_ring_empty(>tx_ring))
> + return;
> +
> + spin_lock(>pool_lock);
> + pool = tfile->xsk_pool;
> + if (!pool) {
> + spin_unlock(>pool_lock);
> + return;
> + }
> +
> + if (tfile->nb_descs) {
> + xsk_tx_completed(pool, tfile->nb_descs);
> + if (xsk_uses_need_wakeup(pool))
> + xsk_set_tx_need_wakeup(pool);
> + }
> +
> + spin_lock(>tx_ring.producer_lock);
> + budget = min_t(u32, tfile->tx_ring.size, TUN_XDP_BATCH);
> +
> + batch = xsk_tx_peek_release_desc_batch(pool, budget);
> + if (!batch) {

This branch looks like an unneeded "optimization". The generic loop
below should have the same effect with no measurable perf delta - and
smaller code. Just remove this.

> + tfile->nb_descs = 0;
> + spin_unlock(>tx_ring.producer_lock);
> + spin_unlock(>pool_lock);
> + return;
> + }
> +
> + tfile->nb_descs = batch;
> + for (i = 0; i < batch; i++) {
> + /* Encode the XDP DESC flag into lowest bit for consumer to 
> differ
> +  * XDP desc from XDP buffer and sk_buff.
> +  */
> + frame = tun_xdp_desc_to_ptr(>tx_descs[i]);
> + /* The budget must be less than or equal to tx_ring.size,
> +  * so enqueuing will not fail.
> +  */
> + __ptr_ring_produce(>tx_ring, frame);
> + }
> + spin_unlock(>tx_ring.producer_lock);
> + spin_unlock(>pool_lock);

More related to the general design: it looks wrong. What if
get_rx_bufs() will fail (ENOBUF) after successful peeking? With no more
incoming packets, later peek will return 0 and it looks like that the
half-processed packets will stay in the ring forever???

I think the 'ring produce' part should be moved into tun_do_read().

Cheers,

Paolo




Re: [PATCH net-next v2 2/3] vhost_net: Call peek_len when using xdp

2024-02-29 Thread Paolo Abeni
On Wed, 2024-02-28 at 19:05 +0800, Yunjian Wang wrote:
> If TUN supports AF_XDP TX zero-copy, the XDP program will enqueue
> packets to the XDP ring and wake up the vhost worker. This requires
> the vhost worker to call peek_len(), which can be used to consume
> XDP descriptors.
> 
> Signed-off-by: Yunjian Wang 
> ---
>  drivers/vhost/net.c | 17 -
>  1 file changed, 12 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index f2ed7167c848..077e74421558 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -207,6 +207,11 @@ static int vhost_net_buf_peek_len(void *ptr)
>   return __skb_array_len_with_tag(ptr);
>  }
>  
> +static bool vhost_sock_xdp(struct socket *sock)
> +{
> + return sock_flag(sock->sk, SOCK_XDP);
> +}
> +
>  static int vhost_net_buf_peek(struct vhost_net_virtqueue *nvq)
>  {
>   struct vhost_net_buf *rxq = >rxq;
> @@ -214,6 +219,13 @@ static int vhost_net_buf_peek(struct vhost_net_virtqueue 
> *nvq)
>   if (!vhost_net_buf_is_empty(rxq))
>   goto out;
>  
> + if (ptr_ring_empty(nvq->rx_ring)) {
> + struct socket *sock = vhost_vq_get_backend(>vq);
> + /* Call peek_len to consume XSK descriptors, when using xdp */
> + if (vhost_sock_xdp(sock) && sock->ops->peek_len)
> + sock->ops->peek_len(sock);

This really looks like a socket API misuse. Why can't you use ptr-ring
primitives to consume XSK descriptors? peek_len could be constified
some day, this code will prevent such (good) thing.

Cheers,

Paolo




Re: [PATCH net-next v2 1/3] xsk: Remove non-zero 'dma_page' check in xp_assign_dev

2024-02-29 Thread Paolo Abeni
On Wed, 2024-02-28 at 19:05 +0800, Yunjian Wang wrote:
> Now dma mappings are used by the physical NICs. However the vNIC
> maybe do not need them. So remove non-zero 'dma_page' check in
> xp_assign_dev.
> 
> Signed-off-by: Yunjian Wang 
> ---
>  net/xdp/xsk_buff_pool.c | 7 ---
>  1 file changed, 7 deletions(-)
> 
> diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
> index ce60ecd48a4d..a5af75b1f43c 100644
> --- a/net/xdp/xsk_buff_pool.c
> +++ b/net/xdp/xsk_buff_pool.c
> @@ -219,16 +219,9 @@ int xp_assign_dev(struct xsk_buff_pool *pool,
>   if (err)
>   goto err_unreg_pool;
>  
> - if (!pool->dma_pages) {
> - WARN(1, "Driver did not DMA map zero-copy buffers");
> - err = -EINVAL;
> - goto err_unreg_xsk;
> - }

This would unconditionally remove an otherwise valid check for most
NIC. What about let the driver declare it wont need DMA map with a
(pool?) flag.

Cheers,

Paolo




Re: [PATCH net-next v5] virtio_net: Support RX hash XDP hint

2024-02-09 Thread Paolo Abeni
On Fri, 2024-02-09 at 18:39 +0800, Liang Chen wrote:
> On Wed, Feb 7, 2024 at 10:27 PM Paolo Abeni  wrote:
> > 
> > On Wed, 2024-02-07 at 10:54 +0800, Liang Chen wrote:
> > > On Tue, Feb 6, 2024 at 6:44 PM Paolo Abeni  wrote:
> > > > 
> > > > On Sat, 2024-02-03 at 10:56 +0800, Liang Chen wrote:
> > > > > On Sat, Feb 3, 2024 at 12:20 AM Jesper Dangaard Brouer 
> > > > >  wrote:
> > > > > > On 02/02/2024 13.11, Liang Chen wrote:
> > > > [...]
> > > > > > > @@ -1033,6 +1039,16 @@ static void put_xdp_frags(struct xdp_buff 
> > > > > > > *xdp)
> > > > > > >   }
> > > > > > >   }
> > > > > > > 
> > > > > > > +static void virtnet_xdp_save_rx_hash(struct virtnet_xdp_buff 
> > > > > > > *virtnet_xdp,
> > > > > > > +  struct net_device *dev,
> > > > > > > +  struct virtio_net_hdr_v1_hash 
> > > > > > > *hdr_hash)
> > > > > > > +{
> > > > > > > + if (dev->features & NETIF_F_RXHASH) {
> > > > > > > + virtnet_xdp->hash_value = hdr_hash->hash_value;
> > > > > > > + virtnet_xdp->hash_report = hdr_hash->hash_report;
> > > > > > > + }
> > > > > > > +}
> > > > > > > +
> > > > > > 
> > > > > > Would it be possible to store a pointer to hdr_hash in 
> > > > > > virtnet_xdp_buff,
> > > > > > with the purpose of delaying extracting this, until and only if XDP
> > > > > > bpf_prog calls the kfunc?
> > > > > > 
> > > > > 
> > > > > That seems to be the way v1 works,
> > > > > https://lore.kernel.org/all/20240122102256.261374-1-liangchen.li...@gmail.com/
> > > > > . But it was pointed out that the inline header may be overwritten by
> > > > > the xdp prog, so the hash is copied out to maintain its integrity.
> > > > 
> > > > Why? isn't XDP supposed to get write access only to the pkt
> > > > contents/buffer?
> > > > 
> > > 
> > > Normally, an XDP program accesses only the packet data. However,
> > > there's also an XDP RX Metadata area, referenced by the data_meta
> > > pointer. This pointer can be adjusted with bpf_xdp_adjust_meta to
> > > point somewhere ahead of the data buffer, thereby granting the XDP
> > > program access to the virtio header located immediately before the
> > 
> > AFAICS bpf_xdp_adjust_meta() does not allow moving the meta_data before
> > xdp->data_hard_start:
> > 
> > https://elixir.bootlin.com/linux/latest/source/net/core/filter.c#L4210
> > 
> > and virtio net set such field after the virtio_net_hdr:
> > 
> > https://elixir.bootlin.com/linux/latest/source/drivers/net/virtio_net.c#L1218
> > https://elixir.bootlin.com/linux/latest/source/drivers/net/virtio_net.c#L1420
> > 
> > I don't see how the virtio hdr could be touched? Possibly even more
> > important: if such thing is possible, I think is should be somewhat
> > denied (for the same reason an H/W nic should prevent XDP from
> > modifying its own buffer descriptor).
> 
> Thank you for highlighting this concern. The header layout differs
> slightly between small and mergeable mode. Taking 'mergeable mode' as
> an example, after calling xdp_prepare_buff the layout of xdp_buff
> would be as depicted in the diagram below,
> 
>   buf
>|
>v
> +--+--+-+
> | xdp headroom | virtio header| packet  |
> | (256 bytes)  | (20 bytes)   | content |
> +--+--+-+
> ^ ^
> | |
>  data_hard_startdata
>   data_meta
> 
> If 'bpf_xdp_adjust_meta' repositions the 'data_meta' pointer a little
> towards 'data_hard_start', it would point to the inline header, thus
> potentially allowing the XDP program to access the inline header.

I see. That layout was completely unexpected to me.

AFAICS the virtio_net driver tries to avoid accessing/using the
virtio_net_hdr after the XDP program execution, so nothing tragic
should happen.

@Michael, @Jason, I guess the above is like that by design? Isn't it a
bit fragile?

Thanks!

Paolo




Re: [PATCH net-next v5] virtio_net: Support RX hash XDP hint

2024-02-07 Thread Paolo Abeni
On Wed, 2024-02-07 at 10:54 +0800, Liang Chen wrote:
> On Tue, Feb 6, 2024 at 6:44 PM Paolo Abeni  wrote:
> > 
> > On Sat, 2024-02-03 at 10:56 +0800, Liang Chen wrote:
> > > On Sat, Feb 3, 2024 at 12:20 AM Jesper Dangaard Brouer  
> > > wrote:
> > > > On 02/02/2024 13.11, Liang Chen wrote:
> > [...]
> > > > > @@ -1033,6 +1039,16 @@ static void put_xdp_frags(struct xdp_buff *xdp)
> > > > >   }
> > > > >   }
> > > > > 
> > > > > +static void virtnet_xdp_save_rx_hash(struct virtnet_xdp_buff 
> > > > > *virtnet_xdp,
> > > > > +  struct net_device *dev,
> > > > > +  struct virtio_net_hdr_v1_hash 
> > > > > *hdr_hash)
> > > > > +{
> > > > > + if (dev->features & NETIF_F_RXHASH) {
> > > > > + virtnet_xdp->hash_value = hdr_hash->hash_value;
> > > > > + virtnet_xdp->hash_report = hdr_hash->hash_report;
> > > > > + }
> > > > > +}
> > > > > +
> > > > 
> > > > Would it be possible to store a pointer to hdr_hash in virtnet_xdp_buff,
> > > > with the purpose of delaying extracting this, until and only if XDP
> > > > bpf_prog calls the kfunc?
> > > > 
> > > 
> > > That seems to be the way v1 works,
> > > https://lore.kernel.org/all/20240122102256.261374-1-liangchen.li...@gmail.com/
> > > . But it was pointed out that the inline header may be overwritten by
> > > the xdp prog, so the hash is copied out to maintain its integrity.
> > 
> > Why? isn't XDP supposed to get write access only to the pkt
> > contents/buffer?
> > 
> 
> Normally, an XDP program accesses only the packet data. However,
> there's also an XDP RX Metadata area, referenced by the data_meta
> pointer. This pointer can be adjusted with bpf_xdp_adjust_meta to
> point somewhere ahead of the data buffer, thereby granting the XDP
> program access to the virtio header located immediately before the

AFAICS bpf_xdp_adjust_meta() does not allow moving the meta_data before
xdp->data_hard_start:

https://elixir.bootlin.com/linux/latest/source/net/core/filter.c#L4210

and virtio net set such field after the virtio_net_hdr:

https://elixir.bootlin.com/linux/latest/source/drivers/net/virtio_net.c#L1218
https://elixir.bootlin.com/linux/latest/source/drivers/net/virtio_net.c#L1420

I don't see how the virtio hdr could be touched? Possibly even more
important: if such thing is possible, I think is should be somewhat
denied (for the same reason an H/W nic should prevent XDP from
modifying its own buffer descriptor).

Cheers,

Paolo




Re: [PATCH net-next v3] net: dqs: add NIC stall detector based on BQL

2024-02-06 Thread Paolo Abeni
On Fri, 2024-02-02 at 08:53 -0800, Breno Leitao wrote:
> From: Jakub Kicinski 
> 
> softnet_data->time_squeeze is sometimes used as a proxy for
> host overload or indication of scheduling problems. In practice
> this statistic is very noisy and has hard to grasp units -
> e.g. is 10 squeezes a second to be expected, or high?
> 
> Delaying network (NAPI) processing leads to drops on NIC queues
> but also RTT bloat, impacting pacing and CA decisions.
> Stalls are a little hard to detect on the Rx side, because
> there may simply have not been any packets received in given
> period of time. Packet timestamps help a little bit, but
> again we don't know if packets are stale because we're
> not keeping up or because someone (*cough* cgroups)
> disabled IRQs for a long time.
> 
> We can, however, use Tx as a proxy for Rx stalls. Most drivers
> use combined Rx+Tx NAPIs so if Tx gets starved so will Rx.
> On the Tx side we know exactly when packets get queued,
> and completed, so there is no uncertainty.
> 
> This patch adds stall checks to BQL. Why BQL? Because
> it's a convenient place to add such checks, already
> called by most drivers, and it has copious free space
> in its structures (this patch adds no extra cache
> references or dirtying to the fast path).
> 
> The algorithm takes one parameter - max delay AKA stall
> threshold and increments a counter whenever NAPI got delayed
> for at least that amount of time. It also records the length
> of the longest stall.
> 
> To be precise every time NAPI has not polled for at least
> stall thrs we check if there were any Tx packets queued
> between last NAPI run and now - stall_thrs/2.
> 
> Unlike the classic Tx watchdog this mechanism does not
> ignore stalls caused by Tx being disabled, or loss of link.
> I don't think the check is worth the complexity, and
> stall is a stall, whether due to host overload, flow
> control, link down... doesn't matter much to the application.
> 
> We have been running this detector in production at Meta
> for 2 years, with the threshold of 8ms. It's the lowest
> value where false positives become rare. There's still
> a constant stream of reported stalls (especially without
> the ksoftirqd deferral patches reverted), those who like
> their stall metrics to be 0 may prefer higher value.
> 
> Signed-off-by: Jakub Kicinski 
> Signed-off-by: Breno Leitao 
> ---
> Changelog:
> 
> v1:
>   * https://lore.kernel.org/netdev/202306172057.jx7yhliu-...@intel.com/T/
> 
> v2:
>   * Fix the documentation file in patch 0001, since patch 0002 will
> touch it later.
>   * Fix the kernel test robot issues, marking functions as statics.
>   * Use #include  instead of .
>   * Added some new comments around, mainly around barriers.
>   * Format struct `netdev_queue_attribute` assignments to make
> checkpatch happy.
>   * Updated and fixed the path in sysfs-class-net-queues
> documentation.
> 
> v3:
>   * Sending patch 0002 against net-next.
>   - The first patch was accepted against 'net'
> 
> ---
>  .../ABI/testing/sysfs-class-net-queues| 23 +++
>  include/linux/dynamic_queue_limits.h  | 35 +++
>  include/trace/events/napi.h   | 33 ++
>  lib/dynamic_queue_limits.c| 58 +
>  net/core/net-sysfs.c  | 62 +++
>  5 files changed, 211 insertions(+)
> 
> diff --git a/Documentation/ABI/testing/sysfs-class-net-queues 
> b/Documentation/ABI/testing/sysfs-class-net-queues
> index 5bff64d256c2..45bab9b6e3ae 100644
> --- a/Documentation/ABI/testing/sysfs-class-net-queues
> +++ b/Documentation/ABI/testing/sysfs-class-net-queues
> @@ -96,3 +96,26 @@ Description:
>   Indicates the absolute minimum limit of bytes allowed to be
>   queued on this network device transmit queue. Default value is
>   0.
> +
> +What:
> /sys/class/net//queues/tx-/byte_queue_limits/stall_thrs
> +Date:Jan 2024
> +KernelVersion:   6.9
> +Contact: net...@vger.kernel.org
> +Description:
> + Tx completion stall detection threshold. 

Possibly worth mentioning it's in milliseconds

> Kernel will guarantee
> + to detect all stalls longer than this threshold but may also
> + detect stalls longer than half of the threshold.
> +
> +What:
> /sys/class/net//queues/tx-/byte_queue_limits/stall_cnt
> +Date:Jan 2024
> +KernelVersion:   6.9
> +Contact: net...@vger.kernel.org
> +Description:
> + Number of detected Tx completion stalls.
> +
> +What:
> /sys/class/net//queues/tx-/byte_queue_limits/stall_max
> +Date:Jan 2024
> +KernelVersion:   6.9
> +Contact: net...@vger.kernel.org
> +Description:
> + Longest detected Tx completion stall. Write 0 to clear.
> diff --git a/include/linux/dynamic_queue_limits.h 
> b/include/linux/dynamic_queue_limits.h
> index 

Re: [PATCH net-next v5] virtio_net: Support RX hash XDP hint

2024-02-06 Thread Paolo Abeni
On Sat, 2024-02-03 at 10:56 +0800, Liang Chen wrote:
> On Sat, Feb 3, 2024 at 12:20 AM Jesper Dangaard Brouer  
> wrote:
> > On 02/02/2024 13.11, Liang Chen wrote:
[...]
> > > @@ -1033,6 +1039,16 @@ static void put_xdp_frags(struct xdp_buff *xdp)
> > >   }
> > >   }
> > > 
> > > +static void virtnet_xdp_save_rx_hash(struct virtnet_xdp_buff 
> > > *virtnet_xdp,
> > > +  struct net_device *dev,
> > > +  struct virtio_net_hdr_v1_hash 
> > > *hdr_hash)
> > > +{
> > > + if (dev->features & NETIF_F_RXHASH) {
> > > + virtnet_xdp->hash_value = hdr_hash->hash_value;
> > > + virtnet_xdp->hash_report = hdr_hash->hash_report;
> > > + }
> > > +}
> > > +
> > 
> > Would it be possible to store a pointer to hdr_hash in virtnet_xdp_buff,
> > with the purpose of delaying extracting this, until and only if XDP
> > bpf_prog calls the kfunc?
> > 
> 
> That seems to be the way v1 works,
> https://lore.kernel.org/all/20240122102256.261374-1-liangchen.li...@gmail.com/
> . But it was pointed out that the inline header may be overwritten by
> the xdp prog, so the hash is copied out to maintain its integrity.

Why? isn't XDP supposed to get write access only to the pkt
contents/buffer?

if the XDP program can really change the virtio_net_hdr, that looks
potentially dangerous/bug prone regardless of this patch.

Thanks,

Paolo




Re: [PATCH net-next v4 2/5] page_frag: unify gfp bits for order 3 page allocation

2024-02-02 Thread Paolo Abeni
On Fri, 2024-02-02 at 10:10 +0800, Yunsheng Lin wrote:
> On 2024/2/1 21:16, Paolo Abeni wrote:
> 
> > from the __page_frag_cache_refill() allocator - which never accesses
> > the memory reserves.
> 
> I am not really sure I understand the above commemt.
> The semantic is the same as skb_page_frag_refill() as explained above
> as my understanding. Note that __page_frag_cache_refill() use 'gfp_mask'
> for allocating order 3 pages and use the original 'gfp' for allocating
> order 0 pages.

You are right! I got fooled misreading 'gfp' as 'gfp_mask' in there.

> > I'm unsure we want to propagate the __page_frag_cache_refill behavior
> > here, the current behavior could be required by some systems.
> > 
> > It looks like this series still leave the skb_page_frag_refill()
> > allocator alone, what about dropping this chunk, too? 
> 
> As explained above, I would prefer to keep it as it is as it seems
> to be quite obvious that we can avoid possible pressure for mm by
> not using memory reserve for order 3 pages as we have the fallback
> for order 0 pages.
> 
> Please let me know if there is anything obvious I missed.
> 

I still think/fear that behaviours changes here could have
subtle/negative side effects - even if I agree the change looks safe.

I think the series without this patch would still achieve its goals and
would be much more uncontroversial. What about move this patch as a
standalone follow-up?

Thanks!

Paolo




Re: [PATCH net-next v4 2/5] page_frag: unify gfp bits for order 3 page allocation

2024-02-01 Thread Paolo Abeni
On Tue, 2024-01-30 at 19:37 +0800, Yunsheng Lin wrote:
> Currently there seems to be three page frag implementions
> which all try to allocate order 3 page, if that fails, it
> then fail back to allocate order 0 page, and each of them
> all allow order 3 page allocation to fail under certain
> condition by using specific gfp bits.
> 
> The gfp bits for order 3 page allocation are different
> between different implementation, __GFP_NOMEMALLOC is
> or'd to forbid access to emergency reserves memory for
> __page_frag_cache_refill(), but it is not or'd in other
> implementions, __GFP_DIRECT_RECLAIM is masked off to avoid
> direct reclaim in skb_page_frag_refill(), but it is not
> masked off in __page_frag_cache_refill().
> 
> This patch unifies the gfp bits used between different
> implementions by or'ing __GFP_NOMEMALLOC and masking off
> __GFP_DIRECT_RECLAIM for order 3 page allocation to avoid
> possible pressure for mm.
> 
> Signed-off-by: Yunsheng Lin 
> Reviewed-by: Alexander Duyck 
> CC: Alexander Duyck 
> ---
>  drivers/vhost/net.c | 2 +-
>  mm/page_alloc.c | 4 ++--
>  net/core/sock.c | 2 +-
>  3 files changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index f2ed7167c848..e574e21cc0ca 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -670,7 +670,7 @@ static bool vhost_net_page_frag_refill(struct vhost_net 
> *net, unsigned int sz,
>   /* Avoid direct reclaim but allow kswapd to wake */
>   pfrag->page = alloc_pages((gfp & ~__GFP_DIRECT_RECLAIM) |
> __GFP_COMP | __GFP_NOWARN |
> -   __GFP_NORETRY,
> +   __GFP_NORETRY | __GFP_NOMEMALLOC,
> SKB_FRAG_PAGE_ORDER);

>   if (likely(pfrag->page)) {
>   pfrag->size = PAGE_SIZE << SKB_FRAG_PAGE_ORDER;
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index c0f7e67c4250..636145c29f70 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4685,8 +4685,8 @@ static struct page *__page_frag_cache_refill(struct 
> page_frag_cache *nc,
>   gfp_t gfp = gfp_mask;
>  
>  #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
> - gfp_mask |= __GFP_COMP | __GFP_NOWARN | __GFP_NORETRY |
> - __GFP_NOMEMALLOC;
> + gfp_mask = (gfp_mask & ~__GFP_DIRECT_RECLAIM) |  __GFP_COMP |
> +__GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC;
>   page = alloc_pages_node(NUMA_NO_NODE, gfp_mask,
>   PAGE_FRAG_CACHE_MAX_ORDER);
>   nc->size = page ? PAGE_FRAG_CACHE_MAX_SIZE : PAGE_SIZE;
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 88bf810394a5..8289a3d8c375 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -2919,7 +2919,7 @@ bool skb_page_frag_refill(unsigned int sz, struct 
> page_frag *pfrag, gfp_t gfp)
>   /* Avoid direct reclaim but allow kswapd to wake */
>   pfrag->page = alloc_pages((gfp & ~__GFP_DIRECT_RECLAIM) |
> __GFP_COMP | __GFP_NOWARN |
> -   __GFP_NORETRY,
> +   __GFP_NORETRY | __GFP_NOMEMALLOC,
> SKB_FRAG_PAGE_ORDER);

This will prevent memory reserve usage when allocating order 3 pages,
but not when allocating a single page as a fallback. Still different
from the __page_frag_cache_refill() allocator - which never accesses
the memory reserves.

I'm unsure we want to propagate the __page_frag_cache_refill behavior
here, the current behavior could be required by some systems.

It looks like this series still leave the skb_page_frag_refill()
allocator alone, what about dropping this chunk, too? 

Thanks!

Paolo




Re: [PATCH] net: make config lines follow common pattern

2023-11-27 Thread Paolo Abeni
On Thu, 2023-11-23 at 12:12 +0100, Lukas Bulwahn wrote:
> The Kconfig parser is quite relaxed when parsing config definition lines.
> However, there are just a few config definition lines that do not follow
> the common regular expression 'config [0-9A-Z]', i.e., there are only a few
> cases where config is not followed by exactly one whitespace.
> 
> To simplify life for kernel developers that use basic regular expressions
> to find and extract kernel configs, make all config lines follow this
> common pattern.
> 
> No functional change, just helpful stylistic clean-up.
> 
> Signed-off-by: Lukas Bulwahn 

IMHO this is a bit too much noise for a small gain: simple REs can
match all the existing patterns with 100% accuracy.

I think this should be dropped.

Cheers,

Paolo




Re: [PATCH] udp: Add support for getsockopt(..., ..., UDP_GRO, ..., ...)

2021-03-31 Thread Paolo Abeni
On Wed, 2021-03-31 at 15:10 +0200, Norman Maurer wrote:
> As this missing change was most likely an oversight in the original
> commit I do think it should go into 5.12 and subsequently stable as
> well. That’s also the reason why I didn’t send a v2 and changed the
> commit message / subject for the patch. For me it clearly is a bug
> and not a new feature.

I have no strong opinion against that (sorry, I hoped that was clear in
my reply).

Please go ahead.

Thanks,

Paolo



Re: [PATCH] udp: Add support for getsockopt(..., ..., UDP_GRO, ..., ...)

2021-03-26 Thread Paolo Abeni
On Fri, 2021-03-26 at 11:22 +0100, Norman Maurer wrote:
> On 26. Mar 2021, at 10:36, Paolo Abeni  wrote:
> > One thing you can do to simplifies the maintainer's life, would be post
> > a v2 with the correct tag (and ev. obsolete this patch in patchwork).
> 
> I am quite new to contribute patches to the kernel so I am not sure
> how I would “obsolete” this patch and make a v2. If you can give me
> some pointers I am happy to do so.

Well, I actually gave you a bad advice about fiddling with patchwork.

The autoritative documentation:

Documentation/networking/netdev-FAQ.rst

(inside the kernel tree) suggests to avoid it.

Just posting a v2 will suffice.

Thanks!

Paolo



Re: [PATCH] udp: Add support for getsockopt(..., ..., UDP_GRO, ..., ...)

2021-03-26 Thread Paolo Abeni
Hello,

On Thu, 2021-03-25 at 20:56 +0100, Norman Maurer wrote:
> From: Norman Maurer 
> 
> Support for UDP_GRO was added in the past but the implementation for
> getsockopt was missed which did lead to an error when we tried to
> retrieve the setting for UDP_GRO. This patch adds the missing switch
> case for UDP_GRO
> 
> Fixes: e20cf8d3f1f7 ("udp: implement GRO for plain UDP sockets.")
> Signed-off-by: Norman Maurer 

The patch LGTM, but please cc the blamed commit author in when you add
a 'Fixes' tag (me in this case ;)

Also please specify a target tree, either 'net' or 'net-next', in the
patch subj. Being declared as a fix, this should target 'net'.

One thing you can do to simplifies the maintainer's life, would be post
a v2 with the correct tag (and ev. obsolete this patch in patchwork).

Side note: I personally think this is more a new feature (is adds
getsockopt support for UDP_GRO) than a fix, so I would not have added
the 'Fixes' tag and I would have targeted net-next, but it's just my
opinion.

Cheers,

Paolo



Re: [PATCH net-next 2/4] gro: add combined call_gro_receive() + INDIRECT_CALL_INET() helper

2021-03-19 Thread Paolo Abeni
On Fri, 2021-03-19 at 11:43 +, Alexander Lobakin wrote:
> I'm not sure if you did it on purpose in commit aaa5d90b395a7
> ("net: use indirect call wrappers at GRO network layer").
> Was that intentional 

I must admit that 2y+ later my own intentions are not so clear to me
too;)

> for the sake of more optimized path for the
> kernels with moduled IPv6, 

Uhm... no I guess that was more an underlook on my side.

> or I can replace INDIRECT_CALL_INET()
> with INDIRECT_CALL_2() here too? 

If that build with IPV6=nmy, I would say yes.

> I want to keep GRO callbacks that
> make use of indirect call wrappers unified.

L4 will still need some special handling as ipv6 udp gro callbacks are
not builtin with CONFIG_IPV6=m :(

Cheers,

Paolo



Re: [PATCH net-next 2/4] gro: add combined call_gro_receive() + INDIRECT_CALL_INET() helper

2021-03-19 Thread Paolo Abeni
Hello,

On Thu, 2021-03-18 at 18:42 +, Alexander Lobakin wrote:
> call_gro_receive() is used to limit GRO recursion, but it works only
> with callback pointers.
> There's a combined version of call_gro_receive() + INDIRECT_CALL_2()
> in , but it doesn't check for IPv6 modularity.

AFAICS, ip6_offload is builtin even when IPv6 is a module, so the above
should not be needed.

Cheers,

Paolo



Re: possible deadlock in ipv6_sock_mc_close

2021-03-01 Thread Paolo Abeni
Hello,

On Mon, 2021-03-01 at 14:52 +, Chuck Lever wrote:
> > On Mar 1, 2021, at 8:49 AM, syzbot 
> >  wrote:
> > 
> > Hello,
> > 
> > syzbot found the following issue on:
> > 
> > HEAD commit:eee7ede6 Merge branch 'bnxt_en-error-recovery-bug-fixes'
> > git tree:   net
> > console output: https://syzkaller.appspot.com/x/log.txt?x=123ad632d0
> > kernel config:  https://syzkaller.appspot.com/x/.config?x=e2d5ba72abae4f14
> > dashboard link: https://syzkaller.appspot.com/bug?extid=e2fa57709a385e6db10f
> > syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=109d89b6d0
> > C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=12e9e0dad0
> > 
> > The issue was bisected to:
> > 
> > commit c8e88e3aa73889421461f878cd569ef84f231ceb
> > Author: Chuck Lever 
> > Date:   Tue Nov 3 20:06:04 2020 +
> > 
> >NFSD: Replace READ* macros in nfsd4_decode_layoutget()
> > 
> > bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=13bef9ccd0
> > final oops: https://syzkaller.appspot.com/x/report.txt?x=107ef9ccd0
> > console output: https://syzkaller.appspot.com/x/log.txt?x=17bef9ccd0
> > 
> > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > Reported-by: syzbot+e2fa57709a385e6db...@syzkaller.appspotmail.com
> > Fixes: c8e88e3aa738 ("NFSD: Replace READ* macros in 
> > nfsd4_decode_layoutget()")
> > 
> > ==
> > WARNING: possible circular locking dependency detected
> > 5.11.0-syzkaller #0 Not tainted
> > --
> > syz-executor905/8822 is trying to acquire lock:
> > 8d678fe8 (rtnl_mutex){+.+.}-{3:3}, at: 
> > ipv6_sock_mc_close+0xd7/0x110 net/ipv6/mcast.c:323
> > 
> > but task is already holding lock:
> > 888024390120 (sk_lock-AF_INET6){+.+.}-{0:0}, at: lock_sock 
> > include/net/sock.h:1600 [inline]
> > 888024390120 (sk_lock-AF_INET6){+.+.}-{0:0}, at: 
> > mptcp6_release+0x57/0x130 net/mptcp/protocol.c:3507
> > 
> > which lock already depends on the new lock.
> 
> Hi, thanks for the report.
> 
> Initial analysis:
> 
> c8e88e3aa738 ("NFSD: Replace READ* macros in nfsd4_decode_layoutget()"
> changes code several layers above the network layer. In addition,
> neither of the stack traces contain NFSD functions. And, repro.c does
> not appear to exercise any filesystem code.
> 
> Therefore the bisect result looks implausible to me. I don't see any
> obvious connection between the lockdep splat and c8e88e3aa738. (If
> someone else does, please let me know where to look).

I agree the bisect result is unexpected.

This looks really as an MPTCP-specific issue, likely introduced by:

32fcc880e0a9 ("mptcp: provide subflow aware release function")

and should be fixed inside MPTCP.

Cheers,

Paolo



Re: possible deadlock in mptcp_push_pending

2021-02-18 Thread Paolo Abeni
On Wed, 2021-02-17 at 09:31 -0800, syzbot wrote:
> syzbot found the following issue on:
> 
> HEAD commit:c48f8607 Merge branch 'PTP-for-DSA-tag_ocelot_8021q'
> git tree:   net-next
> console output: https://syzkaller.appspot.com/x/log.txt?x=16525cb0d0
> kernel config:  https://syzkaller.appspot.com/x/.config?x=dbc1ca9e55dc1f9f
> dashboard link: https://syzkaller.appspot.com/bug?extid=d1b1723faccb7a43f6d1
> 
> Unfortunately, I don't have any reproducer for this issue yet.
> 
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+d1b1723faccb7a43f...@syzkaller.appspotmail.com
> 
> 
> WARNING: possible recursive locking detected
> 5.11.0-rc7-syzkaller #0 Not tainted
> 
> syz-executor.1/15600 is trying to acquire lock:
> 888057303220 (sk_lock-AF_INET6){+.+.}-{0:0}, at: lock_sock 
> include/net/sock.h:1598 [inline]
> 888057303220 (sk_lock-AF_INET6){+.+.}-{0:0}, at: 
> mptcp_push_pending+0x28b/0x650 net/mptcp/protocol.c:1466

Even this one is suspected to be a dup of 'WARNING in dst_release': the
subflow socket lock family is reported to be 'sk_lock-AF_INET6', but
subflows are created in kernel, and get 'k-sk_lock-AF_INET6'. This
looks like [re]use after free, likely via msk->first, as in the
suspected dup issue. Lacking a repro, I'm not 110% sure.

@Dmitry, I'm wondering which is the preferred course of action here:
tentatively marking this one as a dup, or leaving it alone till we get
a reproducer?

Thanks!

Paolo



Re: KASAN: use-after-free Read in tcp_current_mss

2021-02-17 Thread Paolo Abeni
On Wed, 2021-02-17 at 07:36 -0800, syzbot wrote:
> Hello,
> 
> syzbot found the following issue on:
> 
> HEAD commit:773dc50d Merge branch 'Xilinx-axienet-updates'
> git tree:   net-next
> console output: https://syzkaller.appspot.com/x/log.txt?x=13460822d0
> kernel config:  https://syzkaller.appspot.com/x/.config?x=dbc1ca9e55dc1f9f
> dashboard link: https://syzkaller.appspot.com/bug?extid=ea948c9d0dedf6ff57b1
> 
> Unfortunately, I don't have any reproducer for this issue yet.
> 
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+ea948c9d0dedf6ff5...@syzkaller.appspotmail.com
> 
> ==
> BUG: KASAN: use-after-free in dst_mtu include/net/dst.h:201 [inline]
> BUG: KASAN: use-after-free in tcp_current_mss+0x358/0x360 
> net/ipv4/tcp_output.c:1835
> Read of size 8 at addr 88802943db08 by task syz-executor.2/11568
> 
> CPU: 0 PID: 11568 Comm: syz-executor.2 Not tainted 5.11.0-rc7-syzkaller #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> Google 01/01/2011
> Call Trace:
>  __dump_stack lib/dump_stack.c:79 [inline]
>  dump_stack+0x107/0x163 lib/dump_stack.c:120
>  print_address_description.constprop.0.cold+0x5b/0x2f8 mm/kasan/report.c:230
>  __kasan_report mm/kasan/report.c:396 [inline]
>  kasan_report.cold+0x79/0xd5 mm/kasan/report.c:413
>  dst_mtu include/net/dst.h:201 [inline]
>  tcp_current_mss+0x358/0x360 net/ipv4/tcp_output.c:1835
>  tcp_send_mss+0x28/0x2b0 net/ipv4/tcp.c:943
>  mptcp_sendmsg_frag+0x13b/0x1220 net/mptcp/protocol.c:1266
>  mptcp_push_pending+0x2cc/0x650 net/mptcp/protocol.c:1477
>  mptcp_sendmsg+0x1ffb/0x2830 net/mptcp/protocol.c:1692
>  inet6_sendmsg+0x99/0xe0 net/ipv6/af_inet6.c:638
>  sock_sendmsg_nosec net/socket.c:652 [inline]
>  sock_sendmsg+0xcf/0x120 net/socket.c:672
>  sock_write_iter+0x289/0x3c0 net/socket.c:999
>  call_write_iter include/linux/fs.h:1901 [inline]
>  new_sync_write+0x426/0x650 fs/read_write.c:518
>  vfs_write+0x791/0xa30 fs/read_write.c:605
>  ksys_write+0x1ee/0x250 fs/read_write.c:658
>  do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
>  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> RIP: 0033:0x465d99
> Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 
> 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 
> 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48
> RSP: 002b:7fc692d89188 EFLAGS: 0246 ORIG_RAX: 0001
> RAX: ffda RBX: 0056c0b0 RCX: 00465d99
> RDX: 0001 RSI: 2000 RDI: 0003
> RBP: 004bcf27 R08:  R09: 
> R10:  R11: 0246 R12: 0056c0b0
> R13: 7ffc258487df R14: 7fc692d89300 R15: 00022000
> 
> Allocated by task 11558:
>  kasan_save_stack+0x1b/0x40 mm/kasan/common.c:38
>  kasan_set_track mm/kasan/common.c:46 [inline]
>  set_alloc_info mm/kasan/common.c:401 [inline]
>  kasan_kmalloc.constprop.0+0x82/0xa0 mm/kasan/common.c:429
>  kasan_slab_alloc include/linux/kasan.h:209 [inline]
>  slab_post_alloc_hook mm/slab.h:512 [inline]
>  slab_alloc_node mm/slub.c:2892 [inline]
>  slab_alloc mm/slub.c:2900 [inline]
>  kmem_cache_alloc+0x1c6/0x440 mm/slub.c:2905
>  dst_alloc+0x9e/0x650 net/core/dst.c:93
>  rt_dst_alloc+0x73/0x430 net/ipv4/route.c:1642
>  __mkroute_output net/ipv4/route.c:2457 [inline]
>  ip_route_output_key_hash_rcu+0x955/0x2ce0 net/ipv4/route.c:2684
>  ip_route_output_key_hash+0x1a4/0x2f0 net/ipv4/route.c:2512
>  __ip_route_output_key include/net/route.h:126 [inline]
>  ip_route_output_flow+0x23/0x150 net/ipv4/route.c:2773
>  ip_route_newports include/net/route.h:342 [inline]
>  tcp_v4_connect+0x12d7/0x1c40 net/ipv4/tcp_ipv4.c:281
>  tcp_v6_connect+0x733/0x1df0 net/ipv6/tcp_ipv6.c:248
>  __inet_stream_connect+0x8c5/0xee0 net/ipv4/af_inet.c:661
>  inet_stream_connect+0x53/0xa0 net/ipv4/af_inet.c:725
>  mptcp_stream_connect+0x156/0x800 net/mptcp/protocol.c:3200
>  __sys_connect_file+0x155/0x1a0 net/socket.c:1835
>  __sys_connect+0x161/0x190 net/socket.c:1852
>  __do_sys_connect net/socket.c:1862 [inline]
>  __se_sys_connect net/socket.c:1859 [inline]
>  __x64_sys_connect+0x6f/0xb0 net/socket.c:1859
>  do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
>  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> Freed by task 11559:
>  kasan_save_stack+0x1b/0x40 mm/kasan/common.c:38
>  kasan_set_track+0x1c/0x30 mm/kasan/common.c:46
>  kasan_set_free_info+0x20/0x30 mm/kasan/generic.c:356
>  kasan_slab_free+0xe1/0x110 mm/kasan/common.c:362
>  kasan_slab_free include/linux/kasan.h:192 [inline]
>  slab_free_hook mm/slub.c:1547 [inline]
>  slab_free_freelist_hook+0x5d/0x150 mm/slub.c:1580
>  slab_free mm/slub.c:3143 [inline]
>  kmem_cache_free+0x82/0x350 mm/slub.c:3159
>  dst_destroy+0x2bc/0x3c0 net/core/dst.c:129
>  rcu_do_batch kernel/rcu/tree.c:2489 [inline]
>  rcu_core+0x5eb/0xf00 

Re: [MPTCP] KASAN: use-after-free Read in mptcp_established_options

2021-02-17 Thread Paolo Abeni
On Wed, 2021-02-17 at 09:30 -0800, syzbot wrote:
> syzbot found the following issue on:
> 
> HEAD commit:966df6de lan743x: sync only the received area of an rx rin..
> git tree:   net-next
> console output: https://syzkaller.appspot.com/x/log.txt?x=11afe082d0
> kernel config:  https://syzkaller.appspot.com/x/.config?x=dbc1ca9e55dc1f9f
> dashboard link: https://syzkaller.appspot.com/bug?extid=3c1e5ab4997849b69807
> 
> Unfortunately, I don't have any reproducer for this issue yet.
> 
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+3c1e5ab4997849b69...@syzkaller.appspotmail.com
> 
> ==
> BUG: KASAN: use-after-free in mptcp_check_fallback net/mptcp/protocol.h:745 
> [inline]
> BUG: KASAN: use-after-free in mptcp_established_options+0x22cf/0x2780 
> net/mptcp/options.c:724
> Read of size 8 at addr 88802bea10a0 by task syz-executor.1/11042
> 
> CPU: 1 PID: 11042 Comm: syz-executor.1 Not tainted 5.11.0-rc7-syzkaller #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> Google 01/01/2011
> Call Trace:
>  __dump_stack lib/dump_stack.c:79 [inline]
>  dump_stack+0x107/0x163 lib/dump_stack.c:120
>  print_address_description.constprop.0.cold+0x5b/0x2f8 mm/kasan/report.c:230
>  __kasan_report mm/kasan/report.c:396 [inline]
>  kasan_report.cold+0x79/0xd5 mm/kasan/report.c:413
>  mptcp_check_fallback net/mptcp/protocol.h:745 [inline]
>  mptcp_established_options+0x22cf/0x2780 net/mptcp/options.c:724
>  tcp_established_options+0x4ed/0x700 net/ipv4/tcp_output.c:953
>  tcp_current_mss+0x1d2/0x360 net/ipv4/tcp_output.c:1840
>  tcp_send_mss+0x28/0x2b0 net/ipv4/tcp.c:943
>  mptcp_sendmsg_frag+0x13b/0x1220 net/mptcp/protocol.c:1266
>  mptcp_push_pending+0x2cc/0x650 net/mptcp/protocol.c:1477
>  mptcp_sendmsg+0xde4/0x2830 net/mptcp/protocol.c:1685
>  inet6_sendmsg+0x99/0xe0 net/ipv6/af_inet6.c:642
>  sock_sendmsg_nosec net/socket.c:652 [inline]
>  sock_sendmsg+0xcf/0x120 net/socket.c:672
>  sock_write_iter+0x289/0x3c0 net/socket.c:999
>  call_write_iter include/linux/fs.h:1901 [inline]
>  new_sync_write+0x426/0x650 fs/read_write.c:518
>  vfs_write+0x791/0xa30 fs/read_write.c:605
>  ksys_write+0x1ee/0x250 fs/read_write.c:658
>  do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
>  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> RIP: 0033:0x465d99
> Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 
> 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 
> 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48
> RSP: 002b:7ff231ccc188 EFLAGS: 0246 ORIG_RAX: 0001
> RAX: ffda RBX: 0056c008 RCX: 00465d99
> RDX: 0003f9b4 RSI: 2000 RDI: 0004
> RBP: 004bcf27 R08:  R09: 
> R10:  R11: 0246 R12: 0056c008
> R13: 7ffeaa2da27f R14: 7ff231ccc300 R15: 00022000
> 
> Allocated by task 11017:
>  kasan_save_stack+0x1b/0x40 mm/kasan/common.c:38
>  kasan_set_track mm/kasan/common.c:46 [inline]
>  set_alloc_info mm/kasan/common.c:401 [inline]
>  kasan_kmalloc.constprop.0+0x82/0xa0 mm/kasan/common.c:429
>  kmalloc include/linux/slab.h:552 [inline]
>  kzalloc include/linux/slab.h:682 [inline]
>  subflow_create_ctx+0x82/0x230 net/mptcp/subflow.c:1378
>  subflow_ulp_init+0x62/0x370 net/mptcp/subflow.c:1459
>  __tcp_set_ulp net/ipv4/tcp_ulp.c:139 [inline]
>  tcp_set_ulp+0x27c/0x610 net/ipv4/tcp_ulp.c:160
>  mptcp_subflow_create_socket+0x5bf/0xe20 net/mptcp/subflow.c:1343
>  __mptcp_socket_create net/mptcp/protocol.c:110 [inline]
>  mptcp_init_sock net/mptcp/protocol.c:2365 [inline]
>  mptcp_init_sock+0x140/0x830 net/mptcp/protocol.c:2350
>  inet6_create net/ipv6/af_inet6.c:256 [inline]
>  inet6_create+0xa15/0x1010 net/ipv6/af_inet6.c:110
>  __sock_create+0x3de/0x780 net/socket.c:1406
>  sock_create net/socket.c:1457 [inline]
>  __sys_socket+0xef/0x200 net/socket.c:1499
>  __do_sys_socket net/socket.c:1508 [inline]
>  __se_sys_socket net/socket.c:1506 [inline]
>  __x64_sys_socket+0x6f/0xb0 net/socket.c:1506
>  do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
>  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> Freed by task 10650:
>  kasan_save_stack+0x1b/0x40 mm/kasan/common.c:38
>  kasan_set_track+0x1c/0x30 mm/kasan/common.c:46
>  kasan_set_free_info+0x20/0x30 mm/kasan/generic.c:356
>  kasan_slab_free+0xe1/0x110 mm/kasan/common.c:362
>  kasan_slab_free include/linux/kasan.h:192 [inline]
>  slab_free_hook mm/slub.c:1547 [inline]
>  slab_free_freelist_hook+0x5d/0x150 mm/slub.c:1580
>  slab_free mm/slub.c:3143 [inline]
>  kmem_cache_free_bulk mm/slub.c:3269 [inline]
>  kmem_cache_free_bulk+0x253/0xc80 mm/slub.c:3256
>  kfree_bulk include/linux/slab.h:409 [inline]
>  kfree_rcu_work+0x4cd/0x860 kernel/rcu/tree.c:3226
>  process_one_work+0x98d/0x15f0 kernel/workqueue.c:2275
>  

Re: [PATCH v4 net-next 09/11] skbuff: allow to optionally use NAPI cache from __alloc_skb()

2021-02-11 Thread Paolo Abeni
On Thu, 2021-02-11 at 14:28 +, Alexander Lobakin wrote:
> From: Paolo Abeni  on Thu, 11 Feb 2021 11:16:40 +0100 
> wrote:
> > What about changing __napi_alloc_skb() to always use
> > the __napi_build_skb(), for both kmalloc and page backed skbs? That is,
> > always doing the 'data' allocation in __napi_alloc_skb() - either via
> > page_frag or via kmalloc() - and than call __napi_build_skb().
> > 
> > I think that should avoid adding more checks in __alloc_skb() and
> > should probably reduce the number of conditional used
> > by __napi_alloc_skb().
> 
> I thought of this too. But this will introduce conditional branch
> to set or not skb->head_frag. So one branch less in __alloc_skb(),
> one branch more here, and we also lose the ability to __alloc_skb()
> with decached head.

Just to try to be clear, I mean something alike the following (not even
build tested). In the fast path it has less branches than the current
code - for both kmalloc and page_frag allocation.

---
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 785daff48030..a242fbe4730e 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -506,23 +506,12 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct 
*napi, unsigned int len,
 gfp_t gfp_mask)
 {
struct napi_alloc_cache *nc;
+   bool head_frag, pfmemalloc;
struct sk_buff *skb;
void *data;
 
len += NET_SKB_PAD + NET_IP_ALIGN;
 
-   /* If requested length is either too small or too big,
-* we use kmalloc() for skb->head allocation.
-*/
-   if (len <= SKB_WITH_OVERHEAD(1024) ||
-   len > SKB_WITH_OVERHEAD(PAGE_SIZE) ||
-   (gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA))) {
-   skb = __alloc_skb(len, gfp_mask, SKB_ALLOC_RX, NUMA_NO_NODE);
-   if (!skb)
-   goto skb_fail;
-   goto skb_success;
-   }
-
nc = this_cpu_ptr(_alloc_cache);
len += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
len = SKB_DATA_ALIGN(len);
@@ -530,25 +519,34 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct 
*napi, unsigned int len,
if (sk_memalloc_socks())
gfp_mask |= __GFP_MEMALLOC;
 
-   data = page_frag_alloc(>page, len, gfp_mask);
+   if (len <= SKB_WITH_OVERHEAD(1024) ||
+len > SKB_WITH_OVERHEAD(PAGE_SIZE) ||
+(gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA))) {
+   data = kmalloc_reserve(len, gfp_mask, NUMA_NO_NODE, 
);
+   head_frag = 0;
+   len = 0;
+   } else {
+   data = page_frag_alloc(>page, len, gfp_mask);
+   pfmemalloc = nc->page.pfmemalloc;
+   head_frag = 1;
+   }
if (unlikely(!data))
return NULL;
 
skb = __build_skb(data, len);
if (unlikely(!skb)) {
-   skb_free_frag(data);
+   if (head_frag)
+   skb_free_frag(data);
+   else
+   kfree(data);
return NULL;
}
 
-   if (nc->page.pfmemalloc)
-   skb->pfmemalloc = 1;
-   skb->head_frag = 1;
+   skb->pfmemalloc = pfmemalloc;
+   skb->head_frag = head_frag;
 
-skb_success:
skb_reserve(skb, NET_SKB_PAD + NET_IP_ALIGN);
skb->dev = napi->dev;
-
-skb_fail:
return skb;
 }
 EXPORT_SYMBOL(__napi_alloc_skb);



Re: [PATCH v4 net-next 09/11] skbuff: allow to optionally use NAPI cache from __alloc_skb()

2021-02-11 Thread Paolo Abeni
On Wed, 2021-02-10 at 16:30 +, Alexander Lobakin wrote:
> Reuse the old and forgotten SKB_ALLOC_NAPI to add an option to get
> an skbuff_head from the NAPI cache instead of inplace allocation
> inside __alloc_skb().
> This implies that the function is called from softirq or BH-off
> context, not for allocating a clone or from a distant node.
> 
> Signed-off-by: Alexander Lobakin 
> ---
>  net/core/skbuff.c | 13 +
>  1 file changed, 9 insertions(+), 4 deletions(-)
> 
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 9e1a8ded4acc..750fa1825b28 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -397,15 +397,20 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t 
> gfp_mask,
>   struct sk_buff *skb;
>   u8 *data;
>   bool pfmemalloc;
> + bool clone;
>  
> - cache = (flags & SKB_ALLOC_FCLONE)
> - ? skbuff_fclone_cache : skbuff_head_cache;
> + clone = !!(flags & SKB_ALLOC_FCLONE);
> + cache = clone ? skbuff_fclone_cache : skbuff_head_cache;
>  
>   if (sk_memalloc_socks() && (flags & SKB_ALLOC_RX))
>   gfp_mask |= __GFP_MEMALLOC;
>  
>   /* Get the HEAD */
> - skb = kmem_cache_alloc_node(cache, gfp_mask & ~__GFP_DMA, node);
> + if (!clone && (flags & SKB_ALLOC_NAPI) &&
> + likely(node == NUMA_NO_NODE || node == numa_mem_id()))
> + skb = napi_skb_cache_get();
> + else
> + skb = kmem_cache_alloc_node(cache, gfp_mask & ~GFP_DMA, node);
>   if (unlikely(!skb))
>   return NULL;
>   prefetchw(skb);

I hope the opt-in thing would have allowed leaving this code unchanged.
I see it's not trivial avoid touching this code path.
Still I think it would be nice if you would be able to let the device
driver use the cache without touching the above, which is also used
e.g. by the TCP xmit path, which in turn will not leverage the cache
(as it requires FCLONE skbs).

If I read correctly, the above chunk is needed to
allow __napi_alloc_skb() access the cache even for small skb
allocation. Good device drivers should not call alloc_skb() in the fast
path.

What about changing __napi_alloc_skb() to always use
the __napi_build_skb(), for both kmalloc and page backed skbs? That is,
always doing the 'data' allocation in __napi_alloc_skb() - either via
page_frag or via kmalloc() - and than call __napi_build_skb().

I think that should avoid adding more checks in __alloc_skb() and
should probably reduce the number of conditional used
by __napi_alloc_skb().

Thanks!

Paolo



Re: [v3 net-next 08/10] skbuff: reuse NAPI skb cache on allocation path (__build_skb())

2021-02-10 Thread Paolo Abeni
On Wed, 2021-02-10 at 12:25 +, Alexander Lobakin wrote:
> Paolo Abeni  on Wed, 10 Feb 2021 11:21:06 +0100 wrote:
> > Perhaps giving the device drivers the ability to opt-in on this infra
> > via a new helper - as done back then with napi_consume_skb() - would
> > make this change safer?
> 
> That's actually a very nice idea. There's only a little in the code
> to change to introduce an ability to take heads from the cache
> optionally. This way developers could switch to it when needed.
> 
> Thanks for the suggestions! I'll definitely absorb them into the code
> and give it a test.

Quick reply before is too late. I suggest to wait a bit for others
opinions before coding - if others dislike this I would regret wasting
your time.

Cheers,

Paolo



Re: [v3 net-next 08/10] skbuff: reuse NAPI skb cache on allocation path (__build_skb())

2021-02-10 Thread Paolo Abeni
Hello,

I'm sorry for the late feedback, I could not step-in before.

Also adding Jesper for awareness, as he introduced the bulk free
infrastructure.

On Tue, 2021-02-09 at 20:48 +, Alexander Lobakin wrote:
> @@ -231,7 +256,7 @@ struct sk_buff *__build_skb(void *data, unsigned int 
> frag_size)
>   */
>  struct sk_buff *build_skb(void *data, unsigned int frag_size)
>  {
> - struct sk_buff *skb = __build_skb(data, frag_size);
> + struct sk_buff *skb = __build_skb(data, frag_size, true);

I must admit I'm a bit scared of this. There are several high speed
device drivers that will move to bulk allocation, and we don't have any
performance figure for them.

In my experience with (low end) MIPS board, cache misses cost tend to
be much less visible there compared to reasonably recent server H/W,
because the CPU/memory access time difference is much lower.

When moving to higher end H/W the performance gain you measured could
be completely countered by less optimal cache usage.

I fear also latency spikes - I'm unsure if a 32 skbs allocation vs a
single skb would be visible e.g. in a round-robin test. Generally
speaking bulk allocating 32 skbs looks a bit too much. IIRC, when
Edward added listification to GRO, he did several measures with
different list size and found 8 to be the optimal value (for the tested
workload). Above such number the list become too big and the pressure
on the cache outweighted the bulking benefits.

Perhaps giving the device drivers the ability to opt-in on this infra
via a new helper - as done back then with napi_consume_skb() - would
make this change safer?

> @@ -838,31 +863,31 @@ void __consume_stateless_skb(struct sk_buff *skb)
>   kfree_skbmem(skb);
>  }
>  
> -static inline void _kfree_skb_defer(struct sk_buff *skb)
> +static void napi_skb_cache_put(struct sk_buff *skb)
>  {
>   struct napi_alloc_cache *nc = this_cpu_ptr(_alloc_cache);
> + u32 i;
>  
>   /* drop skb->head and call any destructors for packet */
>   skb_release_all(skb);
>  
> - /* record skb to CPU local list */
> + kasan_poison_object_data(skbuff_head_cache, skb);
>   nc->skb_cache[nc->skb_count++] = skb;
>  
> -#ifdef CONFIG_SLUB
> - /* SLUB writes into objects when freeing */
> - prefetchw(skb);
> -#endif

It looks like this chunk has been lost. Is that intentional?

Thanks!

Paolo



Re: [PATCH v4 11/13] task_isolation: net: don't flush backlog on CPUs running isolated tasks

2021-01-22 Thread Paolo Abeni
On Fri, 2021-01-22 at 11:13 -0300, Marcelo Tosatti wrote:
> On Thu, Oct 01, 2020 at 04:47:31PM +0200, Frederic Weisbecker wrote:
> > On Wed, Jul 22, 2020 at 02:58:24PM +, Alex Belits wrote:
> > > From: Yuri Norov 
> > > 
> > > so we don't need to flush it.
> > 
> > What guarantees that we have no backlog on it?
> 
> From Paolo's work to use lockless reading of 
> per-CPU skb lists
> 
> https://www.spinics.net/lists/netdev/msg682693.html
> 
> It also exposed skb queue length to userspace
> 
> https://www.spinics.net/lists/netdev/msg684939.html
> 
> But if i remember correctly waiting for a RCU grace
> period was also necessary to ensure no backlog !?! 
> 
> Paolo would you please remind us what was the sequence of steps?
> (and then also, for the userspace isolation interface, where 
> the application informs the kernel that its entering isolated
> mode, is just confirming the queues have zero length is
> sufficient?).

After commit 2de79ee27fdb52626ac4ac48ec6d8d52ba6f9047, for CONFIG_RPS
enabled build, with no RFS in place to ensure backlog will be empty on
CPU X, the user must:
- configure the RPS map on each device before the device goes up to
explicitly exclude CPU X.

If CPU X is isolated after some network device already went up, to
ensure that the backlog will be empty on CPU X the user must:
- configure RPS on all the network device to exclude CPU X (as in the
previous scenario)
- wait a RCU grace period
- wait untill the backlog len on CPU X reported by procfs is 0

Cheers,

Paolo



Re: [PATCH v2 net-next] udp: allow forwarding of plain (non-fraglisted) UDP GRO packets

2021-01-13 Thread Paolo Abeni
On Wed, 2021-01-13 at 10:32 +, Alexander Lobakin wrote:
> Commit 9fd1ff5d2ac7 ("udp: Support UDP fraglist GRO/GSO.") actually
> not only added a support for fraglisted UDP GRO, but also tweaked
> some logics the way that non-fraglisted UDP GRO started to work for
> forwarding too.
> Commit 2e4ef10f5850 ("net: add GSO UDP L4 and GSO fraglists to the
> list of software-backed types") added GSO UDP L4 to the list of
> software GSO to allow virtual netdevs to forward them as is up to
> the real drivers.
> 
> Tests showed that currently forwarding and NATing of plain UDP GRO
> packets are performed fully correctly, regardless if the target
> netdevice has a support for hardware/driver GSO UDP L4 or not.
> Add the last element and allow to form plain UDP GRO packets if
> there is no socket -> we are on forwarding path.

If I read correctly, the above will make UDP GRO in the forwarding path
always enabled (admin can't disable that, if forwarding is enabled).

UDP GRO can introduce measurable latency for UDP packets staging in the
napi GRO hash (no push flag for UDP ;).

Currently the admin (for fraglist) or the application (for socket-based 
"plain" GRO) have to explicitly enable the feature, but this change
will impact every user.

I think we need at lest an explict switch for this.

Cheers,

Paolo



Re: [PATCH] net/core: use xx_zalloc instead xx_alloc and memset

2020-11-18 Thread Paolo Abeni
On Wed, 2020-11-18 at 16:15 +0800, Tian Tao wrote:
> use kmem_cache_zalloc instead kmem_cache_alloc and memset.
> 
> Signed-off-by: Tian Tao 
> ---
>  net/core/skbuff.c | 10 +++---
>  1 file changed, 3 insertions(+), 7 deletions(-)
> 
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index c9a5a3c..3449c1c 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -313,12 +313,10 @@ struct sk_buff *__build_skb(void *data, unsigned int 
> frag_size)
>  {
>   struct sk_buff *skb;
>  
> - skb = kmem_cache_alloc(skbuff_head_cache, GFP_ATOMIC);
> + skb = kmem_cache_zalloc(skbuff_head_cache, GFP_ATOMIC);

This will zeroed a slighly larger amount of data compared to the
existing code: offsetof(struct sk_buff, tail) == 182, sizeof(struct
sk_buff) == 224.

>   if (unlikely(!skb))
>   return NULL;
>  
> - memset(skb, 0, offsetof(struct sk_buff, tail));

Additionally this leverages constant argument optimizations.

Possibly overall not noticeable, but this code path is quite critical
performance wise.

I would avoid the above.
> -
>   return __build_skb_around(skb, data, frag_size);
>  }
>  
> @@ -6170,12 +6168,10 @@ static void *skb_ext_get_ptr(struct skb_ext *ext, 
> enum skb_ext_id id)
>   */
>  struct skb_ext *__skb_ext_alloc(gfp_t flags)
>  {
> - struct skb_ext *new = kmem_cache_alloc(skbuff_ext_cache, flags);
> + struct skb_ext *new = kmem_cache_zalloc(skbuff_ext_cache, flags);
>  
> - if (new) {
> - memset(new->offset, 0, sizeof(new->offset));

Similar to the above, but additionally here the number of zeroed bytes
changes a lot and a few additional cachelines will be touched. The
performance impact is likely relevant.

Overall I think we should not do this.

Thanks,

Paolo



Re: [PATCH] net: udp: increase UDP_MIB_RCVBUFERRORS when ENOBUFS

2020-10-26 Thread Paolo Abeni
Hello,

On Mon, 2020-10-26 at 17:39 +0800, Menglong Dong wrote:
> The error returned from __udp_enqueue_schedule_skb is ENOMEM or ENOBUFS.
> For now, only ENOMEM is counted into UDP_MIB_RCVBUFERRORS in
> __udp_queue_rcv_skb. UDP_MIB_RCVBUFERRORS should count all of the
> failed skb because of memory errors during udp receiving, not just because of 
> the limit of sock receive queue. We can see this
> in __udp4_lib_mcast_deliver:
> 
>   nskb = skb_clone(skb, GFP_ATOMIC);
> 
>   if (unlikely(!nskb)) {
>   atomic_inc(>sk_drops);
>   __UDP_INC_STATS(net, UDP_MIB_RCVBUFERRORS,
>   IS_UDPLITE(sk));
>   __UDP_INC_STATS(net, UDP_MIB_INERRORS,
>   IS_UDPLITE(sk));
>   continue;
>   }
> 
> See, UDP_MIB_RCVBUFERRORS is increased when skb clone failed. From this
> point, ENOBUFS from __udp_enqueue_schedule_skb should be counted, too.
> It means that the buffer used by all of the UDP sock is to the limit, and
> it ought to be counted.
> 
> Signed-off-by: Menglong Dong 
> ---
>  net/ipv4/udp.c | 4 +---
>  net/ipv6/udp.c | 4 +---
>  2 files changed, 2 insertions(+), 6 deletions(-)
> 
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index 09f0a23d1a01..49a69d8d55b3 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -2035,9 +2035,7 @@ static int __udp_queue_rcv_skb(struct sock *sk, struct 
> sk_buff *skb)
>   int is_udplite = IS_UDPLITE(sk);
>  
>   /* Note that an ENOMEM error is charged twice */
> - if (rc == -ENOMEM)
> - UDP_INC_STATS(sock_net(sk), UDP_MIB_RCVBUFERRORS,
> - is_udplite);
> + UDP_INC_STATS(sock_net(sk), UDP_MIB_RCVBUFERRORS, is_udplite);
>   UDP_INC_STATS(sock_net(sk), UDP_MIB_INERRORS, is_udplite);
>   kfree_skb(skb);
>   trace_udp_fail_queue_rcv_skb(rc, sk);
> diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
> index 29d9691359b9..d5e23b150fd9 100644
> --- a/net/ipv6/udp.c
> +++ b/net/ipv6/udp.c
> @@ -634,9 +634,7 @@ static int __udpv6_queue_rcv_skb(struct sock *sk, struct 
> sk_buff *skb)
>   int is_udplite = IS_UDPLITE(sk);
>  
>   /* Note that an ENOMEM error is charged twice */
> - if (rc == -ENOMEM)
> - UDP6_INC_STATS(sock_net(sk),
> -  UDP_MIB_RCVBUFERRORS, is_udplite);
> + UDP6_INC_STATS(sock_net(sk), UDP_MIB_RCVBUFERRORS, is_udplite);
>   UDP6_INC_STATS(sock_net(sk), UDP_MIB_INERRORS, is_udplite);
>   kfree_skb(skb);
>   return -1;

The diffstat is nice, but I'm unsure we can do this kind of change
(well, I really think we should not do it): it will fool any kind of
existing users (application, scripts, admin) currently reading the
above counters and expecting UDP_MIB_RCVBUFERRORS being increased with
the existing schema.

Cheers,

Paolo



Re: [PATCH net-next] selftests: mptcp: interpret \n as a new line

2020-10-06 Thread Paolo Abeni
On Tue, 2020-10-06 at 18:06 +0200, Matthieu Baerts wrote:
> In case of errors, this message was printed:
> 
>   (...)
>   balanced bwidth with unbalanced delay   5233 max 5005  [ fail ]
>   client exit code 0, server 0
>   \nnetns ns3-0-EwnkPH socket stat for 10003:
>   (...)
> 
> Obviously, the idea was to add a new line before the socket stat and not
> print "\nnetns".
> 
> The commit 8b974778f998 ("selftests: mptcp: interpret \n as a new line")
> is very similar to this one. But the modification in simult_flows.sh was
> missed because this commit above was done in parallel to one here below.
> 
> Fixes: 1a418cb8e888 ("mptcp: simult flow self-tests")
> Signed-off-by: Matthieu Baerts 

Acked-by: Paolo Abeni 



Re: [PATCH net-next] selftests: mptcp: interpret \n as a new line

2020-09-17 Thread Paolo Abeni
On Wed, 2020-09-16 at 15:13 +0200, Matthieu Baerts wrote:
> In case of errors, this message was printed:
> 
>   (...)
>   # read: Resource temporarily unavailable
>   #  client exit code 0, server 3
>   # \nnetns ns1-0-BJlt5D socket stat for 10003:
>   (...)
> 
> Obviously, the idea was to add a new line before the socket stat and not
> print "\nnetns".
> 
> Fixes: b08fbf241064 ("selftests: add test-cases for MPTCP MP_JOIN")
> Fixes: 048d19d444be ("mptcp: add basic kselftest for mptcp")
> Signed-off-by: Matthieu Baerts 

Acked-by: Paolo Abeni 



Re: Packet gets stuck in NOLOCK pfifo_fast qdisc

2020-09-10 Thread Paolo Abeni
On Thu, 2020-09-10 at 14:07 -0700, John Fastabend wrote:
> Cong Wang wrote:
> > On Thu, Sep 3, 2020 at 10:08 PM John Fastabend  
> > wrote:
> > > Maybe this would unlock us,
> > > 
> > > diff --git a/net/core/dev.c b/net/core/dev.c
> > > index 7df6c9617321..9b09429103f1 100644
> > > --- a/net/core/dev.c
> > > +++ b/net/core/dev.c
> > > @@ -3749,7 +3749,7 @@ static inline int __dev_xmit_skb(struct sk_buff 
> > > *skb, struct Qdisc *q,
> > > 
> > > if (q->flags & TCQ_F_NOLOCK) {
> > > rc = q->enqueue(skb, q, _free) & NET_XMIT_MASK;
> > > -   qdisc_run(q);
> > > +   __qdisc_run(q);
> > > 
> > > if (unlikely(to_free))
> > > kfree_skb_list(to_free);
> > > 
> > > 
> > > Per other thread we also need the state deactivated check added
> > > back.
> > 
> > I guess no, because pfifo_dequeue() seems to require q->seqlock,
> > according to comments in qdisc_run(), so we can not just get rid of
> > qdisc_run_begin()/qdisc_run_end() here.
> > 
> > Thanks.
> 
> Seems we would have to revert this as well then,
> 
>  commit 021a17ed796b62383f7623f4fea73787abddad77
>  Author: Paolo Abeni 
>  Date:   Tue May 15 16:24:37 2018 +0200
> 
> pfifo_fast: drop unneeded additional lock on dequeue
> 
> After the previous patch, for NOLOCK qdiscs, q->seqlock is
> always held when the dequeue() is invoked, we can drop
> any additional locking to protect such operation.
> 
> Then I think it should be safe. Back when I was working on the ptr
> ring implementation I opted not to do a case without the spinlock
> because the performance benefit was minimal in the benchmarks I
> was looking at.

The main point behind all that changes was try to close the gap vs the
locked implementation in the uncontended scenario. In our benchmark,
after commit eb82a994479245a79647d302f9b4eb8e7c9d7ca6, was more near to
10%.

Anyway I agree reverting back to the bitlock should be safe.

Cheers,

Paolo 




Re: Packet gets stuck in NOLOCK pfifo_fast qdisc

2020-09-03 Thread Paolo Abeni
On Wed, 2020-09-02 at 22:01 -0700, Cong Wang wrote:
> Can you test the attached one-line fix? I think we are overthinking,
> probably all
> we need here is a busy wait.

I think that will solve, but I also think that will kill NOLOCK
performances due to really increased contention.

At this point I fear we could consider reverting the NOLOCK stuff.
I personally would hate doing so, but it looks like NOLOCK benefits are
outweighed by its issues.

Any other opinion more than welcome!

Cheers,

Paolo



Re: [PATCH] net: openvswitch: silence suspicious RCU usage warning

2020-08-05 Thread Paolo Abeni
On Wed, 2020-08-05 at 15:19 +0800, xiangxia.m@gmail.com wrote:
> From: Tonghao Zhang 
> 
> ovs_flow_tbl_destroy always is called from RCU callback
> or error path. It is no need to check if rcu_read_lock
> or lockdep_ovsl_is_held was held.
> 
> ovs_dp_cmd_fill_info always is called with ovs_mutex,
> So use the rcu_dereference_ovsl instead of rcu_dereference
> in ovs_flow_tbl_masks_cache_size.
> 
> Fixes: 9bf24f594c6a ("net: openvswitch: make masks cache size configurable")
> Cc: Eelco Chaudron 
> Reported-by: syzbot+c0eb9e7cdde04e4eb...@syzkaller.appspotmail.com
> Reported-by: syzbot+f612c02823acb02ff...@syzkaller.appspotmail.com
> Signed-off-by: Tonghao Zhang 

Acked-by: Paolo Abeni 



Re: [PATCH net-next] mptcp: use mptcp_for_each_subflow in mptcp_stream_accept

2020-08-03 Thread Paolo Abeni
On Mon, 2020-08-03 at 21:00 +0800, Geliang Tang wrote:
> Use mptcp_for_each_subflow in mptcp_stream_accept instead of
> open-coding.
> 
> Signed-off-by: Geliang Tang 
> ---
>  net/mptcp/protocol.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
> index d3fe7296e1c9..400824eabf73 100644
> --- a/net/mptcp/protocol.c
> +++ b/net/mptcp/protocol.c
> @@ -2249,7 +2249,7 @@ static int mptcp_stream_accept(struct socket *sock, 
> struct socket *newsock,
>* This is needed so NOSPACE flag can be set from tcp stack.
>*/
>   __mptcp_flush_join_list(msk);
> - list_for_each_entry(subflow, >conn_list, node) {
> + mptcp_for_each_subflow(msk, subflow) {
>   struct sock *ssk = mptcp_subflow_tcp_sock(subflow);
>  
>   if (!ssk->sk_socket)

Acked-by: Paolo Abeni 



Re: [PATCH net] mptcp: fix joined subflows with unblocking sk

2020-07-27 Thread Paolo Abeni
On Mon, 2020-07-27 at 12:24 +0200, Matthieu Baerts wrote:
> Unblocking sockets used for outgoing connections were not containing
> inet info about the initial connection due to a typo there: the value of
> "err" variable is negative in the kernelspace.
> 
> This fixes the creation of additional subflows where the remote port has
> to be reused if the other host didn't announce another one. This also
> fixes inet_diag showing blank info about MPTCP sockets from unblocking
> sockets doing a connect().
> 
> Fixes: 41be81a8d3d0 ("mptcp: fix unblocking connect()")
> Signed-off-by: Matthieu Baerts 

Acked-by: Paolo Abeni 



Re: [PATCH net-next] mptcp: Remove unused inline function mptcp_rcv_synsent()

2020-07-15 Thread Paolo Abeni
On Wed, 2020-07-15 at 10:36 +0800, YueHaibing wrote:
> commit 263e1201a2c3 ("mptcp: consolidate synack processing.")
> left behind this, remove it.
> 
> Signed-off-by: YueHaibing 

Acked-by: Paolo Abeni 

Thank you for the clean-up!

/P



Re: Packet gets stuck in NOLOCK pfifo_fast qdisc

2020-07-09 Thread Paolo Abeni
On Wed, 2020-07-08 at 13:16 -0700, Cong Wang wrote:
> On Tue, Jul 7, 2020 at 7:18 AM Paolo Abeni  wrote:
> > So the regression with 2 pktgen threads is still relevant. 'perf' shows
> > relevant time spent into net_tx_action() and __netif_schedule().
> 
> So, touching the __QDISC_STATE_SCHED bit in __dev_xmit_skb() is
> not a good idea.
> 
> Let me see if there is any other way to fix this.

Thank you very much for the effort! I'm personally out of ideas for a
real fix that would avoid regressions. 

To be more exaustive this are the sources of overhead, as far as I can
observe them with perf:

- contention on q->state, in __netif_schedule()
- execution of net_tx_action() when there are no packet to be served

Cheers,

Paolo



Re: Packet gets stuck in NOLOCK pfifo_fast qdisc

2020-07-07 Thread Paolo Abeni
On Thu, 2020-07-02 at 11:08 -0700, Josh Hunt wrote:
> On 7/2/20 2:45 AM, Paolo Abeni wrote:
> > Hi all,
> > 
> > On Thu, 2020-07-02 at 08:14 +0200, Jonas Bonn wrote:
> > > Hi Cong,
> > > 
> > > On 01/07/2020 21:58, Cong Wang wrote:
> > > > On Wed, Jul 1, 2020 at 9:05 AM Cong Wang  
> > > > wrote:
> > > > > On Tue, Jun 30, 2020 at 2:08 PM Josh Hunt  wrote:
> > > > > > Do either of you know if there's been any development on a fix for 
> > > > > > this
> > > > > > issue? If not we can propose something.
> > > > > 
> > > > > If you have a reproducer, I can look into this.
> > > > 
> > > > Does the attached patch fix this bug completely?
> > > 
> > > It's easier to comment if you inline the patch, but after taking a quick
> > > look it seems too simplistic.
> > > 
> > > i)  Are you sure you haven't got the return values on qdisc_run reversed?
> > 
> > qdisc_run() returns true if it was able to acquire the seq lock. We
> > need to take special action in the opposite case, so Cong's patch LGTM
> > from a functional PoV.
> > 
> > > ii) There's a "bypass" path that skips the enqueue/dequeue operation if
> > > the queue is empty; that needs a similar treatment:  after releasing
> > > seqlock it needs to ensure that another packet hasn't been enqueued
> > > since it last checked.
> > 
> > That has been reverted with
> > commit 379349e9bc3b42b8b2f8f7a03f64a97623fff323
> > 
> > ---
> > > diff --git a/net/core/dev.c b/net/core/dev.c
> > > index 90b59fc50dc9..c7e48356132a 100644
> > > --- a/net/core/dev.c
> > > +++ b/net/core/dev.c
> > > @@ -3744,7 +3744,8 @@ static inline int __dev_xmit_skb(struct sk_buff 
> > > *skb, struct Qdisc *q,
> > > 
> > >   if (q->flags & TCQ_F_NOLOCK) {
> > >   rc = q->enqueue(skb, q, _free) & NET_XMIT_MASK;
> > > - qdisc_run(q);
> > > + if (!qdisc_run(q) && rc == NET_XMIT_SUCCESS)
> > > + __netif_schedule(q);
> > 
> > I fear the __netif_schedule() call may cause performance regression to
> > the point of making a revert of TCQ_F_NOLOCK preferable. I'll try to
> > collect some data.
> 
> Initial results with Cong's patch look promising, so far no stalls. We 
> will let it run over the long weekend and report back on Tuesday.
> 
> Paolo - I have concerns about possible performance regression with the 
> change as well. If you can gather some data that would be great. 

I finally had the time to run some performance tests vs the above with
mixed results.

Using several netperf threadsover a single pfifo_fast queue with small
UDP packets, perf differences vs vanilla are just above noise range (1-
1,5%)

Using pktgen in 'queue_xmit' mode on a dummy device (this should
maximise the pkt-rate and thus the contention) I see:

pktgen threads  vanilla patched delta
nr  kppskpps%

1   324032400
2   39102710-30.5
4   51404920-4

A relevant source of the measured overhead is due to the contention on
q->state in __netif_schedule, so the following helps a bit:

---
diff --git a/net/core/dev.c b/net/core/dev.c
index b8e8286a0a34..3cad6e086fac 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3750,7 +3750,8 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, 
struct Qdisc *q,
 
if (q->flags & TCQ_F_NOLOCK) {
rc = q->enqueue(skb, q, NULL, _free) & NET_XMIT_MASK;
-   if (!qdisc_run(q) && rc == NET_XMIT_SUCCESS)
+   if (!qdisc_run(q) && rc == NET_XMIT_SUCCESS &&
+   !test_bit(__QDISC_STATE_SCHED, >state))
__netif_schedule(q);
 
if (unlikely(to_free))
---

With the above incremental patch applied I see:
pktgen threads  vanilla patched[II] delta
nr  kppskpps%
1   324032400
2   39102830-27%
4   514051400

So the regression with 2 pktgen threads is still relevant. 'perf' shows
relevant time spent into net_tx_action() and __netif_schedule().

Cheers,

Paolo.



Re: Packet gets stuck in NOLOCK pfifo_fast qdisc

2020-07-02 Thread Paolo Abeni
Hi all,

On Thu, 2020-07-02 at 08:14 +0200, Jonas Bonn wrote:
> Hi Cong,
> 
> On 01/07/2020 21:58, Cong Wang wrote:
> > On Wed, Jul 1, 2020 at 9:05 AM Cong Wang  wrote:
> > > On Tue, Jun 30, 2020 at 2:08 PM Josh Hunt  wrote:
> > > > Do either of you know if there's been any development on a fix for this
> > > > issue? If not we can propose something.
> > > 
> > > If you have a reproducer, I can look into this.
> > 
> > Does the attached patch fix this bug completely?
> 
> It's easier to comment if you inline the patch, but after taking a quick 
> look it seems too simplistic.
> 
> i)  Are you sure you haven't got the return values on qdisc_run reversed?

qdisc_run() returns true if it was able to acquire the seq lock. We
need to take special action in the opposite case, so Cong's patch LGTM
from a functional PoV.

> ii) There's a "bypass" path that skips the enqueue/dequeue operation if 
> the queue is empty; that needs a similar treatment:  after releasing 
> seqlock it needs to ensure that another packet hasn't been enqueued 
> since it last checked.

That has been reverted with
commit 379349e9bc3b42b8b2f8f7a03f64a97623fff323

---
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 90b59fc50dc9..c7e48356132a 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -3744,7 +3744,8 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, 
> struct Qdisc *q,
> 
>   if (q->flags & TCQ_F_NOLOCK) {
>   rc = q->enqueue(skb, q, _free) & NET_XMIT_MASK;
> - qdisc_run(q);
> + if (!qdisc_run(q) && rc == NET_XMIT_SUCCESS)
> + __netif_schedule(q);

I fear the __netif_schedule() call may cause performance regression to
the point of making a revert of TCQ_F_NOLOCK preferable. I'll try to
collect some data.

Thanks!

Paolo



Re: [PATCH v2 net-next 2/2] ipv6: fib6: avoid indirect calls from fib6_rule_lookup

2020-06-23 Thread Paolo Abeni
On Tue, 2020-06-23 at 09:42 -0700, Brian Vazquez wrote:
> It was reported that a considerable amount of cycles were spent on the
> expensive indirect calls on fib6_rule_lookup. This patch introduces an
> inline helper called pol_route_func that uses the indirect_call_wrappers
> to avoid the indirect calls.
> 
> This patch saves around 50ns per call.
> 
> Performance was measured on the receiver by checking the amount of
> syncookies that server was able to generate under a synflood load.
> 
> Traffic was generated using trafgen[1] which was pushing around 1Mpps on
> a single queue. Receiver was using only one rx queue which help to
> create a bottle neck and make the experiment rx-bounded.
> 
> These are the syncookies generated over 10s from the different runs:
> 
> Whithout the patch:
> TcpExtSyncookiesSent35537490.0
> TcpExtSyncookiesSent35508950.0
> TcpExtSyncookiesSent35538450.0
> TcpExtSyncookiesSent35410500.0
> TcpExtSyncookiesSent35399210.0
> TcpExtSyncookiesSent35576590.0
> TcpExtSyncookiesSent35268120.0
> TcpExtSyncookiesSent35361210.0
> TcpExtSyncookiesSent35299630.0
> TcpExtSyncookiesSent35363190.0
> 
> With the patch:
> TcpExtSyncookiesSent36117860.0
> TcpExtSyncookiesSent35966820.0
> TcpExtSyncookiesSent36068780.0
> TcpExtSyncookiesSent35995640.0
> TcpExtSyncookiesSent36013040.0
> TcpExtSyncookiesSent36092490.0
> TcpExtSyncookiesSent36174370.0
> TcpExtSyncookiesSent36087650.0
> TcpExtSyncookiesSent36202050.0
> TcpExtSyncookiesSent36018950.0
> 
> Without the patch the average is 354263 pkt/s or 2822 ns/pkt and with
> the patch the average is 360738 pkt/s or 2772 ns/pkt which gives an
> estimate of 50 ns per packet.
> 
> [1] http://netsniff-ng.org/
> 
> Changelog since v1:
>  - Change ordering in the ICW (Paolo Abeni)
> 
> Cc: Luigi Rizzo 
> Cc: Paolo Abeni 
> Reported-by: Eric Dumazet 
> Signed-off-by: Brian Vazquez 
> ---
>  include/net/ip6_fib.h | 36 
>  net/ipv6/fib6_rules.c |  9 ++---
>  net/ipv6/ip6_fib.c|  3 ++-
>  net/ipv6/route.c  |  8 
>  4 files changed, 48 insertions(+), 8 deletions(-)
> 
> diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
> index 3f615a29766e..cc8356fd927f 100644
> --- a/include/net/ip6_fib.h
> +++ b/include/net/ip6_fib.h
> @@ -19,6 +19,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #ifdef CONFIG_IPV6_MULTIPLE_TABLES
>  #define FIB6_TABLE_HASHSZ 256
> @@ -552,6 +553,41 @@ struct bpf_iter__ipv6_route {
>  };
>  #endif
>  
> +INDIRECT_CALLABLE_DECLARE(struct rt6_info *ip6_pol_route_output(struct net 
> *net,
> +  struct fib6_table *table,
> +  struct flowi6 *fl6,
> +  const struct sk_buff *skb,
> +  int flags));
> +INDIRECT_CALLABLE_DECLARE(struct rt6_info *ip6_pol_route_input(struct net 
> *net,
> +  struct fib6_table *table,
> +  struct flowi6 *fl6,
> +  const struct sk_buff *skb,
> +  int flags));
> +INDIRECT_CALLABLE_DECLARE(struct rt6_info *__ip6_route_redirect(struct net 
> *net,
> +  struct fib6_table *table,
> +  struct flowi6 *fl6,
> +  const struct sk_buff *skb,
> +  int flags));
> +INDIRECT_CALLABLE_DECLARE(struct rt6_info *ip6_pol_route_lookup(struct net 
> *net,
> +  struct fib6_table *table,
> +  struct flowi6 *fl6,
> +  const struct sk_buff *skb,
> +  int flags));
> +static inline struct rt6_info *pol_lookup_func(pol_lookup_t lookup,
> + struct net *net,
> + struct fib6_table *table,
> + struct flowi6 *fl6,
> +

Re: [PATCH v2 net-next 1/2] indirect_call_wrapper: extend indirect wrapper to support up to 4 calls

2020-06-23 Thread Paolo Abeni
On Tue, 2020-06-23 at 09:42 -0700, Brian Vazquez wrote:
> There are many places where 2 annotations are not enough. This patch
> adds INDIRECT_CALL_3 and INDIRECT_CALL_4 to cover such cases.
> 
> Signed-off-by: Brian Vazquez 
> ---
>  include/linux/indirect_call_wrapper.h | 12 
>  1 file changed, 12 insertions(+)
> 
> diff --git a/include/linux/indirect_call_wrapper.h 
> b/include/linux/indirect_call_wrapper.h
> index 00d7e8e919c6..54c02c84906a 100644
> --- a/include/linux/indirect_call_wrapper.h
> +++ b/include/linux/indirect_call_wrapper.h
> @@ -23,6 +23,16 @@
>   likely(f == f2) ? f2(__VA_ARGS__) : \
> INDIRECT_CALL_1(f, f1, __VA_ARGS__);  \
>   })
> +#define INDIRECT_CALL_3(f, f3, f2, f1, ...)  
> \
> + ({  
> \
> + likely(f == f3) ? f3(__VA_ARGS__) : 
> \
> +   INDIRECT_CALL_2(f, f2, f1, __VA_ARGS__);  
> \
> + })
> +#define INDIRECT_CALL_4(f, f4, f3, f2, f1, ...)  
> \
> + ({  
> \
> + likely(f == f4) ? f4(__VA_ARGS__) : 
> \
> +   INDIRECT_CALL_3(f, f3, f2, f1, __VA_ARGS__);  
> \
> + })
>  
>  #define INDIRECT_CALLABLE_DECLARE(f) f
>  #define INDIRECT_CALLABLE_SCOPE
> @@ -30,6 +40,8 @@
>  #else
>  #define INDIRECT_CALL_1(f, f1, ...) f(__VA_ARGS__)
>  #define INDIRECT_CALL_2(f, f2, f1, ...) f(__VA_ARGS__)
> +#define INDIRECT_CALL_3(f, f3, f2, f1, ...) f(__VA_ARGS__)
> +#define INDIRECT_CALL_4(f, f4, f3, f2, f1, ...) f(__VA_ARGS__)
>  #define INDIRECT_CALLABLE_DECLARE(f)
>  #define INDIRECT_CALLABLE_SCOPE  static
>  #endif

Acked-by: Paolo Abeni 



Re: [PATCH net-next 2/2] ipv6: fib6: avoid indirect calls from fib6_rule_lookup

2020-06-23 Thread Paolo Abeni
On Mon, 2020-06-22 at 12:26 -0700, Brian Vazquez wrote:
> 
> 
> On Mon, Jun 22, 2020 at 11:00 AM Paolo Abeni  wrote:
> > On Mon, 2020-06-22 at 09:25 -0700, Brian Vazquez wrote:
> > > 
> > > Hi Paolo
> > > On Mon, Jun 22, 2020 at 3:13 AM Paolo Abeni  wrote:
> > > > Hi,
> > > > 
> > > > On Fri, 2020-06-19 at 20:14 -0700, Brian Vazquez wrote:
> > > > > @@ -111,11 +111,13 @@ struct dst_entry *fib6_rule_lookup(struct net 
> > > > > *net, struct flowi6 *fl6,
> > > > >   } else {
> > > > >   struct rt6_info *rt;
> > > > >  
> > > > > - rt = lookup(net, net->ipv6.fib6_local_tbl, fl6, skb, 
> > > > > flags);
> > > > > + rt = pol_lookup_func(lookup,
> > > > > +  net, net->ipv6.fib6_local_tbl, fl6, skb, 
> > > > > flags);
> > > > >   if (rt != net->ipv6.ip6_null_entry && rt->dst.error != 
> > > > > -EAGAIN)
> > > > >   return >dst;
> > > > >   ip6_rt_put_flags(rt, flags);
> > > > > - rt = lookup(net, net->ipv6.fib6_main_tbl, fl6, skb, 
> > > > > flags);
> > > > > + rt = pol_lookup_func(lookup,
> > > > > +  net, net->ipv6.fib6_main_tbl, fl6, skb, 
> > > > > flags);
> > > > >   if (rt->dst.error != -EAGAIN)
> > > > >   return >dst;
> > > > >   ip6_rt_put_flags(rt, flags);
> > > > 
> > > > Have you considered instead factoring out the slice of
> > > > fib6_rule_lookup() using indirect calls to an header file? it looks
> > > > like here (gcc 10.1.1) it sufficent let the compiler use direct calls
> > > > and will avoid the additional branches.
> > > > 
> > > 
> > > Sorry I couldn't get your point. Could you elaborate more, please? Or 
> > > provide an example?
> > 
> > I mean: I think we can avoid the indirect calls even without using the
> > ICW, just moving the relevant code around - in a inline function in the
> > header files.
> 
>  
> Oh I see your point now, yeah this could work but only for the path where 
> there's no custom_rules, right?? So we still need the ICW for the other case. 
> 
> > e.g. with the following code - rough, build-tested only experiment -
> > the gcc 10.1.1 is able to use direct calls
> > for ip6_pol_route_lookup(), ip6_pol_route_output(), etc.
> > 
> > Cheers,
> > 
> > Paolo
> > ---
> > diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
> > index 3f615a29766e..c1b5ac890cd2 100644
> > --- a/include/net/ip6_fib.h
> > +++ b/include/net/ip6_fib.h
> > @@ -430,9 +430,6 @@ struct fib6_entry_notifier_info {
> > 
> >  struct fib6_table *fib6_get_table(struct net *net, u32 id);
> >  struct fib6_table *fib6_new_table(struct net *net, u32 id);
> > -struct dst_entry *fib6_rule_lookup(struct net *net, struct flowi6 *fl6,
> > -  const struct sk_buff *skb,
> > -  int flags, pol_lookup_t lookup);
> > 
> >  /* called with rcu lock held; can return error pointer
> >   * caller needs to select path
> > diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h
> > index 2a5277758379..fe1c2102ffe8 100644
> > --- a/include/net/ip6_route.h
> > +++ b/include/net/ip6_route.h
> > @@ -336,4 +336,50 @@ u32 ip6_mtu_from_fib6(const struct fib6_result *res,
> >  struct neighbour *ip6_neigh_lookup(const struct in6_addr *gw,
> >struct net_device *dev, struct sk_buff 
> > *skb,
> >const void *daddr);
> > +
> > +#ifdef CONFIG_IPV6_MULTIPLE_TABLES
> > +struct rt6_info *__fib6_rule_lookup(struct net *net, struct flowi6 *fl6,
> > +   const struct sk_buff *skb,
> > +   int flags, pol_lookup_t lookup);
> > +static inline struct dst_entry *
> > +fib6_rule_lookup(struct net *net, struct flowi6 *fl6, const struct sk_buff 
> > *skb,
> > +int flags, pol_lookup_t lookup)
> > +{
> > +   struct rt6_info *rt;
> > +
> > +   if (!net->ipv6.fib6_has_custom_rules) {
> > +   rt = lookup(net, net->ipv6.fib6_local_tbl, fl6, skb, flags);
> > +   

Re: [PATCH net-next 2/2] ipv6: fib6: avoid indirect calls from fib6_rule_lookup

2020-06-22 Thread Paolo Abeni
On Mon, 2020-06-22 at 09:25 -0700, Brian Vazquez wrote:
> 
> Hi Paolo
> On Mon, Jun 22, 2020 at 3:13 AM Paolo Abeni  wrote:
> > Hi,
> > 
> > On Fri, 2020-06-19 at 20:14 -0700, Brian Vazquez wrote:
> > > @@ -111,11 +111,13 @@ struct dst_entry *fib6_rule_lookup(struct net *net, 
> > > struct flowi6 *fl6,
> > >   } else {
> > >   struct rt6_info *rt;
> > >  
> > > - rt = lookup(net, net->ipv6.fib6_local_tbl, fl6, skb, flags);
> > > + rt = pol_lookup_func(lookup,
> > > +  net, net->ipv6.fib6_local_tbl, fl6, skb, 
> > > flags);
> > >   if (rt != net->ipv6.ip6_null_entry && rt->dst.error != 
> > > -EAGAIN)
> > >   return >dst;
> > >   ip6_rt_put_flags(rt, flags);
> > > - rt = lookup(net, net->ipv6.fib6_main_tbl, fl6, skb, flags);
> > > + rt = pol_lookup_func(lookup,
> > > +  net, net->ipv6.fib6_main_tbl, fl6, skb, flags);
> > >   if (rt->dst.error != -EAGAIN)
> > >   return >dst;
> > >   ip6_rt_put_flags(rt, flags);
> > 
> > Have you considered instead factoring out the slice of
> > fib6_rule_lookup() using indirect calls to an header file? it looks
> > like here (gcc 10.1.1) it sufficent let the compiler use direct calls
> > and will avoid the additional branches.
> > 
> 
> Sorry I couldn't get your point. Could you elaborate more, please? Or provide 
> an example?

I mean: I think we can avoid the indirect calls even without using the
ICW, just moving the relevant code around - in a inline function in the
header files.

e.g. with the following code - rough, build-tested only experiment -
the gcc 10.1.1 is able to use direct calls
for ip6_pol_route_lookup(), ip6_pol_route_output(), etc.

Cheers,

Paolo
---
diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
index 3f615a29766e..c1b5ac890cd2 100644
--- a/include/net/ip6_fib.h
+++ b/include/net/ip6_fib.h
@@ -430,9 +430,6 @@ struct fib6_entry_notifier_info {
 
 struct fib6_table *fib6_get_table(struct net *net, u32 id);
 struct fib6_table *fib6_new_table(struct net *net, u32 id);
-struct dst_entry *fib6_rule_lookup(struct net *net, struct flowi6 *fl6,
-  const struct sk_buff *skb,
-  int flags, pol_lookup_t lookup);
 
 /* called with rcu lock held; can return error pointer
  * caller needs to select path
diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h
index 2a5277758379..fe1c2102ffe8 100644
--- a/include/net/ip6_route.h
+++ b/include/net/ip6_route.h
@@ -336,4 +336,50 @@ u32 ip6_mtu_from_fib6(const struct fib6_result *res,
 struct neighbour *ip6_neigh_lookup(const struct in6_addr *gw,
   struct net_device *dev, struct sk_buff *skb,
   const void *daddr);
+
+#ifdef CONFIG_IPV6_MULTIPLE_TABLES
+struct rt6_info *__fib6_rule_lookup(struct net *net, struct flowi6 *fl6,
+   const struct sk_buff *skb,
+   int flags, pol_lookup_t lookup);
+static inline struct dst_entry *
+fib6_rule_lookup(struct net *net, struct flowi6 *fl6, const struct sk_buff 
*skb,
+int flags, pol_lookup_t lookup)
+{
+   struct rt6_info *rt;
+
+   if (!net->ipv6.fib6_has_custom_rules) {
+   rt = lookup(net, net->ipv6.fib6_local_tbl, fl6, skb, flags);
+   if (rt != net->ipv6.ip6_null_entry && rt->dst.error != -EAGAIN)
+   return >dst;
+   ip6_rt_put_flags(rt, flags);
+   rt = lookup(net, net->ipv6.fib6_main_tbl, fl6, skb, flags);
+   if (rt->dst.error != -EAGAIN)
+   return >dst;
+   ip6_rt_put_flags(rt, flags);
+   } else {
+   rt = __fib6_rule_lookup(net, fl6, skb, flags, lookup);
+   }
+
+   if (!(flags & RT6_LOOKUP_F_DST_NOREF))
+   dst_hold(>ipv6.ip6_null_entry->dst);
+   return >ipv6.ip6_null_entry->dst;
+}
+#else
+static inline struct dst_entry *
+fib6_rule_lookup(struct net *net, struct flowi6 *fl6, const struct sk_buff 
*skb,
+int flags, pol_lookup_t lookup)
+{
+   struct rt6_info *rt;
+
+   rt = lookup(net, net->ipv6.fib6_main_tbl, fl6, skb, flags);
+   if (rt->dst.error == -EAGAIN) {
+   ip6_rt_put_flags(rt, flags);
+   rt = net->ipv6.ip6_null_entry;
+   if (!(flags & RT6_LOOKUP_F_DST_NOREF))
+   dst_hold(>dst);
+   }
+
+   return >dst;
+}
+#end

Re: [PATCH net-next 2/2] ipv6: fib6: avoid indirect calls from fib6_rule_lookup

2020-06-22 Thread Paolo Abeni
Hi,

On Fri, 2020-06-19 at 20:14 -0700, Brian Vazquez wrote:
> @@ -111,11 +111,13 @@ struct dst_entry *fib6_rule_lookup(struct net *net, 
> struct flowi6 *fl6,
>   } else {
>   struct rt6_info *rt;
>  
> - rt = lookup(net, net->ipv6.fib6_local_tbl, fl6, skb, flags);
> + rt = pol_lookup_func(lookup,
> +  net, net->ipv6.fib6_local_tbl, fl6, skb, flags);
>   if (rt != net->ipv6.ip6_null_entry && rt->dst.error != -EAGAIN)
>   return >dst;
>   ip6_rt_put_flags(rt, flags);
> - rt = lookup(net, net->ipv6.fib6_main_tbl, fl6, skb, flags);
> + rt = pol_lookup_func(lookup,
> +  net, net->ipv6.fib6_main_tbl, fl6, skb, flags);
>   if (rt->dst.error != -EAGAIN)
>   return >dst;
>   ip6_rt_put_flags(rt, flags);

Have you considered instead factoring out the slice of
fib6_rule_lookup() using indirect calls to an header file? it looks
like here (gcc 10.1.1) it sufficent let the compiler use direct calls
and will avoid the additional branches.

Thanks!

Paolo




Re: general protection fault in sock_recvmsg

2020-05-25 Thread Paolo Abeni
On Sun, 2020-05-24 at 20:14 -0700, syzbot wrote:
> syzbot found the following crash on:
> 
> HEAD commit:caffb99b Merge git://git.kernel.org/pub/scm/linux/kernel/g..
> git tree:   upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=15a7444110
> kernel config:  https://syzkaller.appspot.com/x/.config?x=c33c7f7c5471fd39
> dashboard link: https://syzkaller.appspot.com/bug?extid=d7cface3f90b13edf5b0
> compiler:   clang version 10.0.0 (https://github.com/llvm/llvm-project/ 
> c2443155a0fb245c8f17f2c1c72b6ea391e86e81)
> syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=141034ba10
> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=119cd01610
> 
> The bug was bisected to:
> 
> commit 263e1201a2c324b60b15ecda5de9ebf1e7293e31
> Author: Paolo Abeni 
> Date:   Thu Apr 30 13:01:51 2020 +
> 
> mptcp: consolidate synack processing.
> 
> bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=119d862610
> final crash:https://syzkaller.appspot.com/x/report.txt?x=139d862610
> console output: https://syzkaller.appspot.com/x/log.txt?x=159d862610
> 
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+d7cface3f90b13edf...@syzkaller.appspotmail.com
> Fixes: 263e1201a2c3 ("mptcp: consolidate synack processing.")
> 
> general protection fault, probably for non-canonical address 
> 0xdc04:  [#1] PREEMPT SMP KASAN
> KASAN: null-ptr-deref in range [0x0020-0x0027]
> CPU: 1 PID: 7226 Comm: syz-executor523 Not tainted 5.7.0-rc6-syzkaller #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> Google 01/01/2011
> RIP: 0010:sock_recvmsg_nosec net/socket.c:886 [inline]
> RIP: 0010:sock_recvmsg+0x92/0x110 net/socket.c:904
> Code: 5b 41 5c 41 5d 41 5e 41 5f 5d c3 44 89 6c 24 04 e8 53 18 1d fb 4d 8d 6f 
> 20 4c 89 e8 48 c1 e8 03 48 b9 00 00 00 00 00 fc ff df <80> 3c 08 00 74 08 4c 
> 89 ef e8 20 12 5b fb bd a0 00 00 00 49 03 6d
> RSP: 0018:c90001077b98 EFLAGS: 00010202
> RAX: 0004 RBX: c90001077dc0 RCX: dc00
> RDX:  RSI:  RDI: 
> RBP:  R08: 86565e59 R09: ed10115afeaa
> R10: ed10115afeaa R11:  R12: 19200020efbc
> R13: 0020 R14: c90001077de0 R15: 
> FS:  7fc6a3abe700() GS:8880ae90() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 004d0050 CR3: 969f CR4: 001406e0
> DR0:  DR1:  DR2: 
> DR3:  DR6: fffe0ff0 DR7: 0400
> Call Trace:
>  mptcp_recvmsg+0x18d5/0x19b0 net/mptcp/protocol.c:891
>  inet_recvmsg+0xf6/0x1d0 net/ipv4/af_inet.c:838
>  sock_recvmsg_nosec net/socket.c:886 [inline]
>  sock_recvmsg net/socket.c:904 [inline]
>  __sys_recvfrom+0x2f3/0x470 net/socket.c:2057
>  __do_sys_recvfrom net/socket.c:2075 [inline]
>  __se_sys_recvfrom net/socket.c:2071 [inline]
>  __x64_sys_recvfrom+0xda/0xf0 net/socket.c:2071
>  do_syscall_64+0xf3/0x1b0 arch/x86/entry/common.c:295
>  entry_SYSCALL_64_after_hwframe+0x49/0xb3
> RIP: 0033:0x448ef9
> Code: e8 cc 14 03 00 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 
> 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 
> 83 9b 0c fc ff c3 66 2e 0f 1f 84 00 00 00 00
> RSP: 002b:7fc6a3abdda8 EFLAGS: 0246 ORIG_RAX: 002d
> RAX: ffda RBX: 006dec28 RCX: 00448ef9
> RDX: 1000 RSI: 24c0 RDI: 0003
> RBP: 006dec20 R08:  R09: 
> R10: 4000 R11: 0246 R12: 006dec2c
> R13: 7ffe4730174f R14: 7fc6a3abe9c0 R15: 006dec2c
> Modules linked in:
> ---[ end trace 097bdf143c3a60db ]---
> RIP: 0010:sock_recvmsg_nosec net/socket.c:886 [inline]
> RIP: 0010:sock_recvmsg+0x92/0x110 net/socket.c:904
> Code: 5b 41 5c 41 5d 41 5e 41 5f 5d c3 44 89 6c 24 04 e8 53 18 1d fb 4d 8d 6f 
> 20 4c 89 e8 48 c1 e8 03 48 b9 00 00 00 00 00 fc ff df <80> 3c 08 00 74 08 4c 
> 89 ef e8 20 12 5b fb bd a0 00 00 00 49 03 6d
> RSP: 0018:c90001077b98 EFLAGS: 00010202
> RAX: 0004 RBX: c90001077dc0 RCX: dc00
> RDX:  RSI:  RDI: 
> RBP:  R08: 86565e59 R09: ed10115afeaa
> R10: ed10115afeaa R11:  R12: 19200020efbc
> R13: 0020 R14: c90001077de0 R15: 
> FS:  7fc6a3abe700() GS:8880ae90(0

Re: Packet gets stuck in NOLOCK pfifo_fast qdisc

2019-10-09 Thread Paolo Abeni
On Wed, 2019-10-09 at 08:46 +0200, Jonas Bonn wrote:
> Hi,
> 
> The lockless pfifo_fast qdisc has an issue with packets getting stuck in 
> the queue.  What appears to happen is:
> 
> i)  Thread 1 holds the 'seqlock' on the qdisc and dequeues packets.
> ii)  Thread 1 dequeues the last packet in the queue.
> iii)  Thread 1 iterates through the qdisc->dequeue function again and 
> determines that the queue is empty.
> 
> iv)  Thread 2 queues up a packet.  Since 'seqlock' is busy, it just 
> assumes the packet will be dequeued by whoever is holding the lock.
> 
> v)  Thread 1 releases 'seqlock'.
> 
> After v), nobody will check if there are packets in the queue until a 
> new packet is enqueued.  Thereby, the packet enqueued by Thread 2 may be 
> delayed indefinitely.

I think you are right.

It looks like this possible race is present since the initial lockless
implementation - commit 6b3ba9146fe6 ("net: sched: allow qdiscs to
handle locking")

Anyhow the racing windows looks quite tiny - I never observed that
issue in my tests. Do you have a working reproducer?

Something alike the following code - completely untested - can possibly
address the issue, but it's a bit rough and I would prefer not adding
additonal complexity to the lockless qdiscs, can you please have a spin
a it?

Thanks,

Paolo
---
diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h
index 6a70845bd9ab..65a1c03330d6 100644
--- a/include/net/pkt_sched.h
+++ b/include/net/pkt_sched.h
@@ -113,18 +113,23 @@ bool sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q,
 struct net_device *dev, struct netdev_queue *txq,
 spinlock_t *root_lock, bool validate);
 
-void __qdisc_run(struct Qdisc *q);
+int __qdisc_run(struct Qdisc *q);
 
 static inline void qdisc_run(struct Qdisc *q)
 {
+   int quota = 0;
+
if (qdisc_run_begin(q)) {
/* NOLOCK qdisc must check 'state' under the qdisc seqlock
 * to avoid racing with dev_qdisc_reset()
 */
if (!(q->flags & TCQ_F_NOLOCK) ||
likely(!test_bit(__QDISC_STATE_DEACTIVATED, >state)))
-   __qdisc_run(q);
+   quota = __qdisc_run(q);
qdisc_run_end(q);
+
+   if (quota > 0 && q->flags & TCQ_F_NOLOCK && q->ops->peek(q))
+   __netif_schedule(q);
}
 }
 
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 17bd8f539bc7..013480f6a794 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -376,7 +376,7 @@ static inline bool qdisc_restart(struct Qdisc *q, int 
*packets)
return sch_direct_xmit(skb, q, dev, txq, root_lock, validate);
 }
 
-void __qdisc_run(struct Qdisc *q)
+int __qdisc_run(struct Qdisc *q)
 {
int quota = dev_tx_weight;
int packets;
@@ -390,9 +390,10 @@ void __qdisc_run(struct Qdisc *q)
quota -= packets;
if (quota <= 0 || need_resched()) {
__netif_schedule(q);
-   break;
+   return 0;
}
}
+   return quota;
 }
 
 unsigned long dev_trans_start(struct net_device *dev)



Re: [PATCH net-next] net: ipvlan: forward ingress packet to slave's l2 in l3s mode

2019-06-26 Thread Paolo Abeni
Hi,

On Tue, 2019-06-25 at 14:42 +0800, Zhiyuan Hou wrote:
> In ipvlan l3s mode,  ingress packet is switched to slave interface and
> delivers to l4 stack. This may cause two problems:
> 
>   1. When slave is in an ns different from master, the behavior of stack
>   in slave ns may cause confusion for users. For example, iptables, tc,
>   and other l2/l3 functions are not available for ingress packet.
> 
>   2. l3s mode is not used for tap device, and cannot support ipvtap. But
>   in VM or container based VM cases, tap device is a very common device.
> 
> In l3s mode's input nf_hook, this patch calles the skb_forward_dev() to
> forward ingress packet to slave and uses nf_conntrack_confirm() to make
> conntrack work with new mode.
> 
> Signed-off-by: Zha Bin 
> Signed-off-by: Zhiyuan Hou 
> ---
>  drivers/net/ipvlan/ipvlan.h |  9 -
>  drivers/net/ipvlan/ipvlan_l3s.c | 16 ++--
>  2 files changed, 22 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/ipvlan/ipvlan.h b/drivers/net/ipvlan/ipvlan.h
> index 3837c897832e..48c814e24c3f 100644
> --- a/drivers/net/ipvlan/ipvlan.h
> +++ b/drivers/net/ipvlan/ipvlan.h
> @@ -172,6 +172,14 @@ void ipvlan_link_delete(struct net_device *dev, struct 
> list_head *head);
>  void ipvlan_link_setup(struct net_device *dev);
>  int ipvlan_link_register(struct rtnl_link_ops *ops);
>  #ifdef CONFIG_IPVLAN_L3S
> +
> +#include 
> +
> +static inline int ipvlan_confirm_conntrack(struct sk_buff *skb)
> +{
> + return nf_conntrack_confirm(skb);
> +}
> +
>  int ipvlan_l3s_register(struct ipvl_port *port);
>  void ipvlan_l3s_unregister(struct ipvl_port *port);
>  void ipvlan_migrate_l3s_hook(struct net *oldnet, struct net *newnet);
> @@ -206,5 +214,4 @@ static inline bool netif_is_ipvlan_port(const struct 
> net_device *dev)
>  {
>   return rcu_access_pointer(dev->rx_handler) == ipvlan_handle_frame;
>  }
> -
>  #endif /* __IPVLAN_H */
> diff --git a/drivers/net/ipvlan/ipvlan_l3s.c b/drivers/net/ipvlan/ipvlan_l3s.c
> index 943d26cbf39f..ed210002f593 100644
> --- a/drivers/net/ipvlan/ipvlan_l3s.c
> +++ b/drivers/net/ipvlan/ipvlan_l3s.c
> @@ -95,14 +95,26 @@ static unsigned int ipvlan_nf_input(void *priv, struct 
> sk_buff *skb,
>  {
>   struct ipvl_addr *addr;
>   unsigned int len;
> + int ret = NF_ACCEPT;
> + bool success;
>  
>   addr = ipvlan_skb_to_addr(skb, skb->dev);
>   if (!addr)
>   goto out;
>  
> - skb->dev = addr->master->dev;
>   len = skb->len + ETH_HLEN;
> - ipvlan_count_rx(addr->master, len, true, false);
> +
> + ret = ipvlan_confirm_conntrack(skb);
> + if (ret != NF_ACCEPT) {
> + ipvlan_count_rx(addr->master, len, false, false);
> + goto out;
> + }
> +
> + skb_push_rcsum(skb, ETH_HLEN);
> + success = dev_forward_skb(addr->master->dev, skb) == NET_RX_SUCCESS;

This looks weird to me: if I read the code correctly, the skb will
traverse twice NF_INET_LOCAL_IN, once due to the l3s hooking and
another one due to dev_forward_skb().

Also, tc ingreess, etc will run after the first traversing of
NF_INET_LOCAL_IN.

All in all I think that if full l2 processing is required, a different
mode or a different virtual device should be used.

Cheers,

Paolo



Re: [BUG] moving fq back to clock monotonic breaks my setup

2019-01-10 Thread Paolo Abeni
On Thu, 2019-01-10 at 09:25 +0100, Ian Kumlien wrote:
> On Thu, Jan 10, 2019 at 6:53 AM Eric Dumazet  wrote:
> > On Wed, Jan 9, 2019 at 4:48 PM Ian Kumlien  wrote:
> > > Hi,
> > > 
> > > Just been trough ~5+ hours of bisecting and eventually actually found
> > > the culprit =)
> > > 
> > > commit fb420d5d91c1274d5966917725e71f27ed092a85 (refs/bisect/bad)
> > > Author: Eric Dumazet 
> > > Date:   Fri Sep 28 10:28:44 2018 -0700
> > > 
> > > tcp/fq: move back to CLOCK_MONOTONIC
> > > 
> > > [--8<--]
> > > 
> > > So this might be because my setup might be "odd".
> > > 
> > > Basically I have a firewall with four nics that uses two of those nics
> > > to handle my normal
> > > internet connection (firewall/MASQ/NAT) and the other two are assigned
> > > to one bridge each.
> > > 
> > > The firewall is also my local caching DNS server and DHCP server,
> > > which is also used by the VM:s...
> > > But with 4.20 DHCP replies disappeared before entering the bridge - i
> > > couldn't even see them in
> > > tcpdump! (all nics are ixgbe on a atom soc)
> > > 
> > > I'm currently running a kernel with that patch reversed but I'm also
> > > wondering about possible ways
> > > forward since I'm reverting a fix from someone else...
> > 
> > I suggest you use netdev@ mailing list instead of lkml
> > 
> > Then, we probably need to clear skb->tstamp in more paths (you are
> > mentioning bridge ...)
> > 
> > See commit 8203e2d844d34af247a151d8ebd68553a6e91785 for reference.
> > 
> > Can you try :
> > 
> > diff --git a/net/bridge/br_forward.c b/net/bridge/br_forward.c
> > index 
> > 5372e2042adfe20d3cd039c29057535b2413be61..bd4fa141420c92a44716bd93fcd8aa3d3310203a
> > 100644
> > --- a/net/bridge/br_forward.c
> > +++ b/net/bridge/br_forward.c
> > @@ -53,6 +53,7 @@ int br_dev_queue_push_xmit(struct net *net, struct
> > sock *sk, struct sk_buff *skb
> > skb_set_network_header(skb, depth);
> > }
> > 
> > +   skb->tstamp = 0;
> > dev_queue_xmit(skb);
> > 
> > return 0;
> 
> This works, and so does: 
> https://marc.info/?l=linux-netdev=154696956604748=2
> 
> Pointed out by Paolo (tested both separately)

Note: I cleared the tstamp in br_forward_finish() instead of
br_dev_queue_push_xmit() because I think the latter could be called
also in the local xmit path, via br_nf_post_routing.

We must preserve the tstamp in output path, right?

Thanks,

Paolo






Re: [BUG] v4.20 - bridge not getting DHCP responses? (works in 4.19.13)

2019-01-10 Thread Paolo Abeni
On Thu, 2019-01-10 at 01:38 +0100, Ian Kumlien wrote:
> Confirmed, sending a new mail with summary etc
> 
> On Thu, Jan 10, 2019 at 1:16 AM Ian Kumlien  wrote:
> > On Wed, Jan 9, 2019 at 12:17 AM Ian Kumlien  wrote:
> > > On Wed, Jan 9, 2019, 00:09 Florian Fainelli  > 
> >  [--8<---]
> > 
> > > > > when looking at "git log v4.19...v4.20
> > > > > drivers/net/ethernet/intel/ixgbe/" nothing else really stands out...
> > > > > The machine is also running NAT for my home network and all of that
> > > > > works just fine...
> > > > > 
> > > > > I started with tcpdump, prooving that packets reached all the way
> > > > > outside but replies never made it, reboorting
> > > > > with 4.19.13 resulted in replies appearing in the tcpdump.
> > > > > 
> > > > > I don't quite know where to look - and what can i do to test - i tried
> > > > > disabling all offloading (due to the UDP
> > > > > offloading changes) but nothing helped...
> > > > > 
> > > > > Ideas? Patches? ;)
> > > > 
> > > > Running a bisection would certainly help find the offending commit if
> > > > that is something that you can do?
> > > 
> > > I was hoping for a likely suspect but this was on my "Todo" for Friday 
> > > night anyway... (And I already started testing with some patches reversed)
> > 
> > So after lengthy git bisect sections, both from the latest stable i
> > was using (not the best of ideas)
> > and from 4.19.
> > 
> > The latest stable yielded 72b0094f918294e6cb8cf5c3b4520d928fbb1a57 -
> > which is incorrect...
> > 
> > However, the proper bisect gave me this:
> > fb420d5d91c1274d5966917725e71f27ed092a85 is the first bad commit
> > commit fb420d5d91c1274d5966917725e71f27ed092a85
> > Author: Eric Dumazet 
> > Date:   Fri Sep 28 10:28:44 2018 -0700
> > 
> > tcp/fq: move back to CLOCK_MONOTONIC

Thank you for bisecting. 

Should be solve by:

https://marc.info/?l=linux-netdev=154696956604748=2

Can you test with the above applied?

Thanks,

Paolo



Re: linux-next: build failure after merge of the net-next tree

2018-12-17 Thread Paolo Abeni
Hi,

On Mon, 2018-12-17 at 12:36 +1100, Stephen Rothwell wrote:
> After merging the net-next tree, today's linux-next build (arm
> multi_v7_defconfig) failed like this:
> 
> In file included from net/core/dev.c:148:
> net/core/dev.c: In function 'napi_gro_complete':
> net/core/dev.c:5364:26: error: 'inet_gro_complete' undeclared (first use in 
> this function); did you mean 'eth_gro_complete'?
>ipv6_gro_complete, inet_gro_complete,
>   ^
> include/linux/indirect_call_wrapper.h:32:41: note: in definition of macro 
> 'INDIRECT_CALL_2'
>  #define INDIRECT_CALL_2(f, name, ...) f(__VA_ARGS__)
>  ^~~
> 
> and on 
> 
> Caused by commit
> 
>   aaa5d90b395a ("net: use indirect call wrappers at GRO network layer")
> 
> inet_gro_complete() is declared in include/net/inet_comon.h which is
> not directly included in net/core/dev.c.
> 
> I have used the net-next tree from next-20181214 for today.

My fault, I should have tested more kernel configs. This fails with
CONFIG_RETPOLINE=n

A fix has been posted here:

https://marc.info/?l=linux-netdev=154504685327698=2

Thanks,

Paolo



Re: [PATCH net-next v3 0/4] net: mitigate retpoline overhead

2018-12-16 Thread Paolo Abeni
On Sat, 2018-12-15 at 13:23 -0800, David Miller wrote:
> From: Paolo Abeni 
> Date: Fri, 14 Dec 2018 11:51:56 +0100
> 
> > The spectre v2 counter-measures, aka retpolines, are a source of measurable
> > overhead[1]. We can partially address that when the function pointer refers 
> > to
> > a builtin symbol resorting to a list of tests vs well-known builtin 
> > function and
> > direct calls.
> > 
> > Experimental results show that replacing a single indirect call via
> > retpoline with several branches and a direct call gives performance gains
> > even when multiple branches are added - 5 or more, as reported in [2].
> > 
> > This may lead to some uglification around the indirect calls. In netconf 
> > 2018
> > Eric Dumazet described a technique to hide the most relevant part of the 
> > needed
> > boilerplate with some macro help.
> > 
> > This series is a [re-]implementation of such idea, exposing the introduced
> > helpers in a new header file. They are later leveraged to avoid the indirect
> > call overhead in the GRO path, when possible.
> > 
> > Overall this gives > 10% performance improvement for UDP GRO benchmark and
> > smaller but measurable for TCP syn flood.
> > 
> > The added infra can be used in follow-up patches to cope with retpoline 
> > overhead
> > in other points of the networking stack (e.g. at the qdisc layer) and 
> > possibly
> > even in other subsystems.
>  ...
> 
> Series applied, I'll push this out after a build check completes.

Again, I messed it! I'm really sorry to waste everybody's time.
I was unable to give proper coverage with different configs. I tested
vs.:

CONFIG_IPV6=ymn
CONFIG_INET=yn

but

# CONFIG_RETPOLINE is not set

fooled me. The following patch should fix. I'll try more
configurations. Is there any way to try/tests all kbuild robot configs?

Please feel free to give me an hard stop if this sounds too much a
trial and error thing.

Paolo

--
diff --git a/include/linux/indirect_call_wrapper.h 
b/include/linux/indirect_call_wrapper.h
index 7c8b7f4..00d7e8e9 100644
--- a/include/linux/indirect_call_wrapper.h
+++ b/include/linux/indirect_call_wrapper.h
@@ -28,8 +28,8 @@
 #define INDIRECT_CALLABLE_SCOPE
 
 #else
-#define INDIRECT_CALL_1(f, name, ...) f(__VA_ARGS__)
-#define INDIRECT_CALL_2(f, name, ...) f(__VA_ARGS__)
+#define INDIRECT_CALL_1(f, f1, ...) f(__VA_ARGS__)
+#define INDIRECT_CALL_2(f, f2, f1, ...) f(__VA_ARGS__)
 #define INDIRECT_CALLABLE_DECLARE(f)
 #define INDIRECT_CALLABLE_SCOPEstatic
 #endif



[PATCH net-next v3 3/4] net: use indirect call wrappers at GRO transport layer

2018-12-14 Thread Paolo Abeni
This avoids an indirect call in the receive path for TCP and UDP
packets. TCP takes precedence on UDP, so that we have a single
additional conditional in the common case.

When IPV6 is build as module, all gro symbols except UDPv6 are
builtin, while the latter belong to the ipv6 module, so we
need some special care.

v1 -> v2:
 - adapted to INDIRECT_CALL_ changes
v2 -> v3:
 - fix build issue with CONFIG_IPV6=m

Signed-off-by: Paolo Abeni 
---
 include/net/inet_common.h |  7 +++
 net/ipv4/af_inet.c| 13 +++--
 net/ipv4/tcp_offload.c|  6 --
 net/ipv4/udp_offload.c|  7 ---
 net/ipv6/ip6_offload.c| 29 +++--
 net/ipv6/tcpv6_offload.c  |  7 ---
 net/ipv6/udp_offload.c|  7 ---
 7 files changed, 61 insertions(+), 15 deletions(-)

diff --git a/include/net/inet_common.h b/include/net/inet_common.h
index 56e7592811ea..975901a95c0f 100644
--- a/include/net/inet_common.h
+++ b/include/net/inet_common.h
@@ -56,4 +56,11 @@ static inline void inet_ctl_sock_destroy(struct sock *sk)
sock_release(sk->sk_socket);
 }
 
+#define indirect_call_gro_receive(f2, f1, cb, head, skb)   \
+({ \
+   unlikely(gro_recursion_inc_test(skb)) ? \
+   NAPI_GRO_CB(skb)->flush |= 1, NULL :\
+   INDIRECT_CALL_2(cb, f2, f1, head, skb); \
+})
+
 #endif
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 326c422c22f8..0dfb72c46671 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1385,6 +1385,10 @@ struct sk_buff *inet_gso_segment(struct sk_buff *skb,
 }
 EXPORT_SYMBOL(inet_gso_segment);
 
+INDIRECT_CALLABLE_DECLARE(struct sk_buff *tcp4_gro_receive(struct list_head *,
+  struct sk_buff *));
+INDIRECT_CALLABLE_DECLARE(struct sk_buff *udp4_gro_receive(struct list_head *,
+  struct sk_buff *));
 struct sk_buff *inet_gro_receive(struct list_head *head, struct sk_buff *skb)
 {
const struct net_offload *ops;
@@ -1494,7 +1498,8 @@ struct sk_buff *inet_gro_receive(struct list_head *head, 
struct sk_buff *skb)
skb_gro_pull(skb, sizeof(*iph));
skb_set_transport_header(skb, skb_gro_offset(skb));
 
-   pp = call_gro_receive(ops->callbacks.gro_receive, head, skb);
+   pp = indirect_call_gro_receive(tcp4_gro_receive, udp4_gro_receive,
+  ops->callbacks.gro_receive, head, skb);
 
 out_unlock:
rcu_read_unlock();
@@ -1556,6 +1561,8 @@ int inet_recv_error(struct sock *sk, struct msghdr *msg, 
int len, int *addr_len)
return -EINVAL;
 }
 
+INDIRECT_CALLABLE_DECLARE(int tcp4_gro_complete(struct sk_buff *, int));
+INDIRECT_CALLABLE_DECLARE(int udp4_gro_complete(struct sk_buff *, int));
 int inet_gro_complete(struct sk_buff *skb, int nhoff)
 {
__be16 newlen = htons(skb->len - nhoff);
@@ -1581,7 +1588,9 @@ int inet_gro_complete(struct sk_buff *skb, int nhoff)
 * because any hdr with option will have been flushed in
 * inet_gro_receive().
 */
-   err = ops->callbacks.gro_complete(skb, nhoff + sizeof(*iph));
+   err = INDIRECT_CALL_2(ops->callbacks.gro_complete,
+ tcp4_gro_complete, udp4_gro_complete,
+ skb, nhoff + sizeof(*iph));
 
 out_unlock:
rcu_read_unlock();
diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c
index 870b0a335061..0fbf7d4df9da 100644
--- a/net/ipv4/tcp_offload.c
+++ b/net/ipv4/tcp_offload.c
@@ -10,6 +10,7 @@
  * TCPv4 GSO/GRO support
  */
 
+#include 
 #include 
 #include 
 #include 
@@ -305,7 +306,8 @@ int tcp_gro_complete(struct sk_buff *skb)
 }
 EXPORT_SYMBOL(tcp_gro_complete);
 
-static struct sk_buff *tcp4_gro_receive(struct list_head *head, struct sk_buff 
*skb)
+INDIRECT_CALLABLE_SCOPE
+struct sk_buff *tcp4_gro_receive(struct list_head *head, struct sk_buff *skb)
 {
/* Don't bother verifying checksum if we're going to flush anyway. */
if (!NAPI_GRO_CB(skb)->flush &&
@@ -318,7 +320,7 @@ static struct sk_buff *tcp4_gro_receive(struct list_head 
*head, struct sk_buff *
return tcp_gro_receive(head, skb);
 }
 
-static int tcp4_gro_complete(struct sk_buff *skb, int thoff)
+INDIRECT_CALLABLE_SCOPE int tcp4_gro_complete(struct sk_buff *skb, int thoff)
 {
const struct iphdr *iph = ip_hdr(skb);
struct tcphdr *th = tcp_hdr(skb);
diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
index 0646d61f4fa8..9a141a6cf1a0 100644
--- a/net/ipv4/udp_offload.c
+++ b/net/ipv4/udp_offload.c
@@ -13,6 +13,7 @@
 #include 
 #include 
 #include 
+#include 
 
 static struct sk_buff *__skb_udp_tunnel_segment(struct sk_buff *skb,
netdev_features_t features,
@@ -451,8 +452,8 @@ struct sk_buff *udp_gro_receive(struct list_head *head, 

[PATCH net-next v3 0/4] net: mitigate retpoline overhead

2018-12-14 Thread Paolo Abeni
The spectre v2 counter-measures, aka retpolines, are a source of measurable
overhead[1]. We can partially address that when the function pointer refers to
a builtin symbol resorting to a list of tests vs well-known builtin function and
direct calls.

Experimental results show that replacing a single indirect call via
retpoline with several branches and a direct call gives performance gains
even when multiple branches are added - 5 or more, as reported in [2].

This may lead to some uglification around the indirect calls. In netconf 2018
Eric Dumazet described a technique to hide the most relevant part of the needed
boilerplate with some macro help.

This series is a [re-]implementation of such idea, exposing the introduced
helpers in a new header file. They are later leveraged to avoid the indirect
call overhead in the GRO path, when possible.

Overall this gives > 10% performance improvement for UDP GRO benchmark and
smaller but measurable for TCP syn flood.

The added infra can be used in follow-up patches to cope with retpoline overhead
in other points of the networking stack (e.g. at the qdisc layer) and possibly
even in other subsystems.

v2  -> v3:
 - fix build error with CONFIG_IPV6=m

v1  -> v2:
 - list explicitly the builtin function names in INDIRECT_CALL_*(),
   as suggested by Ed Cree
 - expand the recipients list

rfc -> v1:
 - use branch prediction hints, as suggested by Eric

[1] http://vger.kernel.org/netconf2018_files/PaoloAbeni_netconf2018.pdf
[2] 
https://linuxplumbersconf.org/event/2/contributions/99/attachments/98/117/lpc18_paper_af_xdp_perf-v2.pdf

Paolo Abeni (4):
  indirect call wrappers: helpers to speed-up indirect calls of builtin
  net: use indirect call wrappers at GRO network layer
  net: use indirect call wrappers at GRO transport layer
  udp: use indirect call wrappers for GRO socket lookup

 include/linux/indirect_call_wrapper.h | 51 +++
 include/net/inet_common.h |  9 +
 net/core/dev.c| 15 ++--
 net/ipv4/af_inet.c| 13 +--
 net/ipv4/tcp_offload.c|  6 ++--
 net/ipv4/udp_offload.c| 15 +---
 net/ipv6/ip6_offload.c| 35 +++---
 net/ipv6/tcpv6_offload.c  |  7 ++--
 net/ipv6/udp_offload.c|  7 ++--
 9 files changed, 136 insertions(+), 22 deletions(-)
 create mode 100644 include/linux/indirect_call_wrapper.h

-- 
2.19.2



[PATCH net-next v3 2/4] net: use indirect call wrappers at GRO network layer

2018-12-14 Thread Paolo Abeni
This avoids an indirect calls for L3 GRO receive path, both
for ipv4 and ipv6, if the latter is not compiled as a module.

Note that when IPv6 is compiled as builtin, it will be checked first,
so we have a single additional compare for the more common path.

v1 -> v2:
 - adapted to INDIRECT_CALL_ changes

Signed-off-by: Paolo Abeni 
---
 include/net/inet_common.h |  2 ++
 net/core/dev.c| 15 +--
 net/ipv6/ip6_offload.c|  6 +++---
 3 files changed, 18 insertions(+), 5 deletions(-)

diff --git a/include/net/inet_common.h b/include/net/inet_common.h
index 3ca969cbd161..56e7592811ea 100644
--- a/include/net/inet_common.h
+++ b/include/net/inet_common.h
@@ -2,6 +2,8 @@
 #ifndef _INET_COMMON_H
 #define _INET_COMMON_H
 
+#include 
+
 extern const struct proto_ops inet_stream_ops;
 extern const struct proto_ops inet_dgram_ops;
 
diff --git a/net/core/dev.c b/net/core/dev.c
index ed9aa4a91f1f..1b5a4410be0e 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -145,6 +145,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "net-sysfs.h"
 
@@ -5338,6 +5339,8 @@ static void flush_all_backlogs(void)
put_online_cpus();
 }
 
+INDIRECT_CALLABLE_DECLARE(int inet_gro_complete(struct sk_buff *, int));
+INDIRECT_CALLABLE_DECLARE(int ipv6_gro_complete(struct sk_buff *, int));
 static int napi_gro_complete(struct sk_buff *skb)
 {
struct packet_offload *ptype;
@@ -5357,7 +5360,9 @@ static int napi_gro_complete(struct sk_buff *skb)
if (ptype->type != type || !ptype->callbacks.gro_complete)
continue;
 
-   err = ptype->callbacks.gro_complete(skb, 0);
+   err = INDIRECT_CALL_INET(ptype->callbacks.gro_complete,
+ipv6_gro_complete, inet_gro_complete,
+skb, 0);
break;
}
rcu_read_unlock();
@@ -5504,6 +5509,10 @@ static void gro_flush_oldest(struct list_head *head)
napi_gro_complete(oldest);
 }
 
+INDIRECT_CALLABLE_DECLARE(struct sk_buff *inet_gro_receive(struct list_head *,
+  struct sk_buff *));
+INDIRECT_CALLABLE_DECLARE(struct sk_buff *ipv6_gro_receive(struct list_head *,
+  struct sk_buff *));
 static enum gro_result dev_gro_receive(struct napi_struct *napi, struct 
sk_buff *skb)
 {
u32 hash = skb_get_hash_raw(skb) & (GRO_HASH_BUCKETS - 1);
@@ -5553,7 +5562,9 @@ static enum gro_result dev_gro_receive(struct napi_struct 
*napi, struct sk_buff
NAPI_GRO_CB(skb)->csum_valid = 0;
}
 
-   pp = ptype->callbacks.gro_receive(gro_head, skb);
+   pp = INDIRECT_CALL_INET(ptype->callbacks.gro_receive,
+   ipv6_gro_receive, inet_gro_receive,
+   gro_head, skb);
break;
}
rcu_read_unlock();
diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c
index 70f525c33cb6..ff8b484d2258 100644
--- a/net/ipv6/ip6_offload.c
+++ b/net/ipv6/ip6_offload.c
@@ -164,8 +164,8 @@ static int ipv6_exthdrs_len(struct ipv6hdr *iph,
return len;
 }
 
-static struct sk_buff *ipv6_gro_receive(struct list_head *head,
-   struct sk_buff *skb)
+INDIRECT_CALLABLE_SCOPE struct sk_buff *ipv6_gro_receive(struct list_head 
*head,
+struct sk_buff *skb)
 {
const struct net_offload *ops;
struct sk_buff *pp = NULL;
@@ -301,7 +301,7 @@ static struct sk_buff *ip4ip6_gro_receive(struct list_head 
*head,
return inet_gro_receive(head, skb);
 }
 
-static int ipv6_gro_complete(struct sk_buff *skb, int nhoff)
+INDIRECT_CALLABLE_SCOPE int ipv6_gro_complete(struct sk_buff *skb, int nhoff)
 {
const struct net_offload *ops;
struct ipv6hdr *iph = (struct ipv6hdr *)(skb->data + nhoff);
-- 
2.19.2



[PATCH net-next v3 4/4] udp: use indirect call wrappers for GRO socket lookup

2018-12-14 Thread Paolo Abeni
This avoids another indirect call for UDP GRO. Again, the test
for the IPv6 variant is performed first.

v1 -> v2:
 - adapted to INDIRECT_CALL_ changes

Signed-off-by: Paolo Abeni 
---
 net/ipv4/udp_offload.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
index 9a141a6cf1a0..64f9715173ac 100644
--- a/net/ipv4/udp_offload.c
+++ b/net/ipv4/udp_offload.c
@@ -392,6 +392,8 @@ static struct sk_buff *udp_gro_receive_segment(struct 
list_head *head,
return NULL;
 }
 
+INDIRECT_CALLABLE_DECLARE(struct sock *udp6_lib_lookup_skb(struct sk_buff *skb,
+  __be16 sport, __be16 dport));
 struct sk_buff *udp_gro_receive(struct list_head *head, struct sk_buff *skb,
struct udphdr *uh, udp_lookup_t lookup)
 {
@@ -403,7 +405,8 @@ struct sk_buff *udp_gro_receive(struct list_head *head, 
struct sk_buff *skb,
struct sock *sk;
 
rcu_read_lock();
-   sk = (*lookup)(skb, uh->source, uh->dest);
+   sk = INDIRECT_CALL_INET(lookup, udp6_lib_lookup_skb,
+   udp4_lib_lookup_skb, skb, uh->source, uh->dest);
if (!sk)
goto out_unlock;
 
@@ -503,7 +506,8 @@ int udp_gro_complete(struct sk_buff *skb, int nhoff,
uh->len = newlen;
 
rcu_read_lock();
-   sk = (*lookup)(skb, uh->source, uh->dest);
+   sk = INDIRECT_CALL_INET(lookup, udp6_lib_lookup_skb,
+   udp4_lib_lookup_skb, skb, uh->source, uh->dest);
if (sk && udp_sk(sk)->gro_enabled) {
err = udp_gro_complete_segment(skb);
} else if (sk && udp_sk(sk)->gro_complete) {
-- 
2.19.2



[PATCH net-next v3 1/4] indirect call wrappers: helpers to speed-up indirect calls of builtin

2018-12-14 Thread Paolo Abeni
This header define a bunch of helpers that allow avoiding the
retpoline overhead when calling builtin functions via function pointers.
It boils down to explicitly comparing the function pointers to
known builtin functions and eventually invoke directly the latter.

The macros defined here implement the boilerplate for the above schema
and will be used by the next patches.

rfc -> v1:
 - use branch prediction hint, as suggested by Eric
v1  -> v2:
 - list explicitly the builtin function names in INDIRECT_CALL_*(),
   as suggested by Ed Cree

Suggested-by: Eric Dumazet 
Signed-off-by: Paolo Abeni 
---
 include/linux/indirect_call_wrapper.h | 51 +++
 1 file changed, 51 insertions(+)
 create mode 100644 include/linux/indirect_call_wrapper.h

diff --git a/include/linux/indirect_call_wrapper.h 
b/include/linux/indirect_call_wrapper.h
new file mode 100644
index ..7c8b7f4948af
--- /dev/null
+++ b/include/linux/indirect_call_wrapper.h
@@ -0,0 +1,51 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_INDIRECT_CALL_WRAPPER_H
+#define _LINUX_INDIRECT_CALL_WRAPPER_H
+
+#ifdef CONFIG_RETPOLINE
+
+/*
+ * INDIRECT_CALL_$NR - wrapper for indirect calls with $NR known builtin
+ *  @f: function pointer
+ *  @f$NR: builtin functions names, up to $NR of them
+ *  @__VA_ARGS__: arguments for @f
+ *
+ * Avoid retpoline overhead for known builtin, checking @f vs each of them and
+ * eventually invoking directly the builtin function. The functions are check
+ * in the given order. Fallback to the indirect call.
+ */
+#define INDIRECT_CALL_1(f, f1, ...)\
+   ({  \
+   likely(f == f1) ? f1(__VA_ARGS__) : f(__VA_ARGS__); \
+   })
+#define INDIRECT_CALL_2(f, f2, f1, ...)
\
+   ({  \
+   likely(f == f2) ? f2(__VA_ARGS__) : \
+ INDIRECT_CALL_1(f, f1, __VA_ARGS__);  \
+   })
+
+#define INDIRECT_CALLABLE_DECLARE(f)   f
+#define INDIRECT_CALLABLE_SCOPE
+
+#else
+#define INDIRECT_CALL_1(f, name, ...) f(__VA_ARGS__)
+#define INDIRECT_CALL_2(f, name, ...) f(__VA_ARGS__)
+#define INDIRECT_CALLABLE_DECLARE(f)
+#define INDIRECT_CALLABLE_SCOPEstatic
+#endif
+
+/*
+ * We can use INDIRECT_CALL_$NR for ipv6 related functions only if ipv6 is
+ * builtin, this macro simplify dealing with indirect calls with only ipv4/ipv6
+ * alternatives
+ */
+#if IS_BUILTIN(CONFIG_IPV6)
+#define INDIRECT_CALL_INET(f, f2, f1, ...) \
+   INDIRECT_CALL_2(f, f2, f1, __VA_ARGS__)
+#elif IS_ENABLED(CONFIG_INET)
+#define INDIRECT_CALL_INET(f, f2, f1, ...) INDIRECT_CALL_1(f, f1, __VA_ARGS__)
+#else
+#define INDIRECT_CALL_INET(f, f2, f1, ...) f(__VA_ARGS__)
+#endif
+
+#endif
-- 
2.19.2



Re: [PATCH net-next v2 1/4] indirect call wrappers: helpers to speed-up indirect calls of builtin

2018-12-11 Thread Paolo Abeni
Hi,

I'm sorry for the long delay, I was (and still I am) diverted by some
other duty.

On Fri, 2018-12-07 at 21:46 +, David Woodhouse wrote:
> On Fri, 2018-12-07 at 21:46 +0100, Paolo Abeni wrote:
> > > I wonder if we can declare the common case functions as 'weak' so that
> > > the link failures don't happen when they're absent.
> > 
> > I experimented a previous version with alias. I avoided weak alias
> > usage, because I [mis?]understood not all compilers have a complete
> > support for them (e.g. clang).
> > Also, with weak ref, a coding error that is now discovered at build
> > time will result in worse performance at runtime, likely with some
> > uncommon configuration, possibly not as easily detected. I'm unsure
> > that would be better ?!?
> 
> I think everything supports weak linkage; we've been using it for
> years.

Ok, I likely was confused by some old, non first-hand info.

Anyway weak alias will turn a compile time issue in a possible run-time 
(small) regression. I think the first option would be preferable.

> > I'm sorry, I don't follow here. I think static keys can't be used for
> > the reported network case: we have different list elements each
> > contaning a different function pointer and we access/use
> > different ptr on a per packet basis.
> 
> Yes, the alternatives would be used to change the "likely" case.
> 
> We still do the "if (fn == default_fn) default_fn(); else (*fn)();"
> part; or even the variant with two (or more) common cases. 
> 
> It's just that the value of 'default_fn' can be changed at runtime
> (with patching like alternatives/static keys, since of course it has to
> be a direct call).

Thanks for clarifying. If I understood correctly, you would like some
helper for:

if (static_branch_likely(_default_fn_a))
INDIRECT_CALL_1(f, default_fn_a, )
else if (static_branch_likely(_default_fn_b))
INDIRECT_CALL_1(f, default_fn_b, )
// ...

if so, I think we can eventually add support for this kind of stuff on
top of the proposed macros. WDYT?

Thanks,

Paolo




Re: [PATCH net-next v2 1/4] indirect call wrappers: helpers to speed-up indirect calls of builtin

2018-12-07 Thread Paolo Abeni
On Fri, 2018-12-07 at 09:46 +, David Woodhouse wrote:
> On Wed, 2018-12-05 at 19:13 +0100, Paolo Abeni wrote:
> > +/*
> > + * We can use INDIRECT_CALL_$NR for ipv6 related functions only if ipv6 is
> > + * builtin, this macro simplify dealing with indirect calls with only 
> > ipv4/ipv6
> > + * alternatives
> > + */
> > +#if IS_BUILTIN(CONFIG_IPV6)
> > +#define INDIRECT_CALL_INET(f, f2, f1, ...) \
> > +   INDIRECT_CALL_2(f, f2, f1, __VA_ARGS__)
> > +#elif IS_ENABLED(CONFIG_INET)
> > +#define INDIRECT_CALL_INET(f, f2, f1, ...) INDIRECT_CALL_1(f, f1, 
> > __VA_ARGS__)
> > +#else
> > +#define INDIRECT_CALL_INET(f, f2, f1, ...) f(__VA_ARGS__)
> > +#endif
> > +
> > +#endif
> 
> Thanks for working on this.
> 
> I'm not stunningly keen on the part cited above. And it doesn't seem to
> be working either, given Dave's later error and reversion.

My bad, I did not test vs a relevant cfg. Hopefully that can be fixed.

> I wonder if we can declare the common case functions as 'weak' so that
> the link failures don't happen when they're absent.

I experimented a previous version with alias. I avoided weak alias
usage, because I [mis?]understood not all compilers have a complete
support for them (e.g. clang).
Also, with weak ref, a coding error that is now discovered at build
time will result in worse performance at runtime, likely with some
uncommon configuration, possibly not as easily detected. I'm unsure
that would be better ?!?

> Once we extend this past the network code, especially to file systems'
> f_ops, I suspect we're going to want to use something like static keys
> to patch the common cases at runtime — perhaps changing the f_ops
> default according to what the root file system is, etc.

I'm sorry, I don't follow here. I think static keys can't be used for the 
reported network case: we have different list elements each contaning a 
different function pointer and we access/use
different ptr on a per packet basis.

> I'd quite like to see the API for this taking that into account even if
> it's left to be a future development.

Again, I'm lost here. Can you please give more hints?

Thanks,

Paolo



Re: [PATCH net-next v2 0/4] net: mitigate retpoline overhead

2018-12-07 Thread Paolo Abeni
Hi,

On Thu, 2018-12-06 at 22:28 -0800, David Miller wrote:
> From: David Miller 
> Date: Thu, 06 Dec 2018 22:24:09 -0800 (PST)
> 
> > Series applied, thanks!
> 
> Erm... actually reverted.  Please fix these build failures:

oops ...
I'm sorry for the late reply. I'm travelling and I will not able to re-
post soon.

> ld: net/ipv6/ip6_offload.o: in function `ipv6_gro_receive':
> ip6_offload.c:(.text+0xda2): undefined reference to `udp6_gro_receive'
> ld: ip6_offload.c:(.text+0xdb6): undefined reference to `udp6_gro_receive'
> ld: net/ipv6/ip6_offload.o: in function `ipv6_gro_complete':
> ip6_offload.c:(.text+0x1953): undefined reference to `udp6_gro_complete'
> ld: ip6_offload.c:(.text+0x1966): undefined reference to `udp6_gro_complete'
> make: *** [Makefile:1036: vmlinux] Error 1

Are you building with CONFIG_IPV6=m ? I tested vs some common cfg, but
I omitted that in my last iteration (my bad). With such conf ip6
offloads are builtin while udp6 offloads end-up in the ipv6 module, so
I can't use them with the given conf.

I'll try to fix the above in v3.

I'm sorry for this mess,

Paolo



Re: [PATCH] x86: only use ERMS for user copies for larger sizes

2018-11-21 Thread Paolo Abeni
On Wed, 2018-11-21 at 06:32 -0700, Jens Axboe wrote:
> I did some more investigation yesterday, and found this:
> 
> commit 236222d39347e0e486010f10c1493e83dbbdfba8
> Author: Paolo Abeni 
> Date:   Thu Jun 29 15:55:58 2017 +0200
> 
> x86/uaccess: Optimize copy_user_enhanced_fast_string() for short strings
> 
> which does attempt to rectify it, but only using ERMS for >= 64 byte copies.
> At least for me, looks like the break even point is higher than that, which
> would mean that something like the below would be more appropriate. 

Back then I used a custom kernel module and the tsc to micro-benchmark
the patched function and the original one with different buffer sizes.
I'm sorry, the relevant code has been lost.

In my experiments 64 bytes was the break even point for all the CPUs I
had handy, but I guess that may change with other models.

Cheers,

Paolo



Re: [PATCH] x86: only use ERMS for user copies for larger sizes

2018-11-21 Thread Paolo Abeni
On Wed, 2018-11-21 at 06:32 -0700, Jens Axboe wrote:
> I did some more investigation yesterday, and found this:
> 
> commit 236222d39347e0e486010f10c1493e83dbbdfba8
> Author: Paolo Abeni 
> Date:   Thu Jun 29 15:55:58 2017 +0200
> 
> x86/uaccess: Optimize copy_user_enhanced_fast_string() for short strings
> 
> which does attempt to rectify it, but only using ERMS for >= 64 byte copies.
> At least for me, looks like the break even point is higher than that, which
> would mean that something like the below would be more appropriate. 

Back then I used a custom kernel module and the tsc to micro-benchmark
the patched function and the original one with different buffer sizes.
I'm sorry, the relevant code has been lost.

In my experiments 64 bytes was the break even point for all the CPUs I
had handy, but I guess that may change with other models.

Cheers,

Paolo



Re: WARNING in static_key_disable_cpuslocked

2018-11-17 Thread Paolo Abeni
Hi,

On Sat, 2018-11-17 at 06:52 -0800, Ard Biesheuvel wrote:
> (+ Paolo, Dave)
> 
> On Sat, 17 Nov 2018 at 01:59, syzbot
>  wrote:
> > 
> > Hello,
> > 
> > syzbot found the following crash on:
> > 
> > HEAD commit:442b8cea2477 Add linux-next specific files for 20181109
> > git tree:   linux-next
> > console output: https://syzkaller.appspot.com/x/log.txt?x=11f1dc2540
> > kernel config:  https://syzkaller.appspot.com/x/.config?x=2f72bdb11df9fbe8
> > dashboard link: https://syzkaller.appspot.com/bug?extid=6cffdef928852907751c
> > compiler:   gcc (GCC) 8.0.1 20180413 (experimental)
> > syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=1012653340
> > 
> > IMPORTANT: if you fix the bug, please add the following tag to the commit:
> > Reported-by: syzbot+6cffdef9288529077...@syzkaller.appspotmail.com

This should be addressed in current net-next, by commit:

commit 9c48060141bd937497774546e4bb89b8992be383
Author: Paolo Abeni 
Date:   Thu Nov 15 02:34:50 2018 +0100

udp: fix jump label misuse

Cheers,

Paolo



Re: WARNING in static_key_disable_cpuslocked

2018-11-17 Thread Paolo Abeni
Hi,

On Sat, 2018-11-17 at 06:52 -0800, Ard Biesheuvel wrote:
> (+ Paolo, Dave)
> 
> On Sat, 17 Nov 2018 at 01:59, syzbot
>  wrote:
> > 
> > Hello,
> > 
> > syzbot found the following crash on:
> > 
> > HEAD commit:442b8cea2477 Add linux-next specific files for 20181109
> > git tree:   linux-next
> > console output: https://syzkaller.appspot.com/x/log.txt?x=11f1dc2540
> > kernel config:  https://syzkaller.appspot.com/x/.config?x=2f72bdb11df9fbe8
> > dashboard link: https://syzkaller.appspot.com/bug?extid=6cffdef928852907751c
> > compiler:   gcc (GCC) 8.0.1 20180413 (experimental)
> > syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=1012653340
> > 
> > IMPORTANT: if you fix the bug, please add the following tag to the commit:
> > Reported-by: syzbot+6cffdef9288529077...@syzkaller.appspotmail.com

This should be addressed in current net-next, by commit:

commit 9c48060141bd937497774546e4bb89b8992be383
Author: Paolo Abeni 
Date:   Thu Nov 15 02:34:50 2018 +0100

udp: fix jump label misuse

Cheers,

Paolo



Re: [PATCH] add param that allows bootline control of hardened usercopy

2018-06-26 Thread Paolo Abeni
[hopefully fixed the 'mm' recipient]

On Tue, 2018-06-26 at 09:54 -0700, Kees Cook wrote:
> On Tue, Jun 26, 2018 at 2:48 AM, Paolo Abeni  wrote:
> > With CONFIG_HARDENED_USERCOPY=y, perf shows ~6% of CPU time spent
> > cumulatively in __check_object_size (~4%) and __virt_addr_valid (~2%).
> 
> Are you able to see which network functions are making the
> __check_object_size() calls?

The call-chain is:

__GI___libc_recvfrom   
entry_SYSCALL_64_after_hwframe 
do_syscall_64  
__x64_sys_recvfrom 
__sys_recvfrom 
inet_recvmsg   
udp_recvmsg
__check_object_size

udp_recvmsg() actually calls copy_to_iter() (inlined) and the latters
calls check_copy_size() (again, inlined).

Cheers,

Paolo



Re: [PATCH] add param that allows bootline control of hardened usercopy

2018-06-26 Thread Paolo Abeni
[hopefully fixed the 'mm' recipient]

On Tue, 2018-06-26 at 09:54 -0700, Kees Cook wrote:
> On Tue, Jun 26, 2018 at 2:48 AM, Paolo Abeni  wrote:
> > With CONFIG_HARDENED_USERCOPY=y, perf shows ~6% of CPU time spent
> > cumulatively in __check_object_size (~4%) and __virt_addr_valid (~2%).
> 
> Are you able to see which network functions are making the
> __check_object_size() calls?

The call-chain is:

__GI___libc_recvfrom   
entry_SYSCALL_64_after_hwframe 
do_syscall_64  
__x64_sys_recvfrom 
__sys_recvfrom 
inet_recvmsg   
udp_recvmsg
__check_object_size

udp_recvmsg() actually calls copy_to_iter() (inlined) and the latters
calls check_copy_size() (again, inlined).

Cheers,

Paolo



Re: [PATCH] add param that allows bootline control of hardened usercopy

2018-06-26 Thread Paolo Abeni
Hi,

On Mon, 25 Jun 2018 11:21:38 -0700 Kees Cook  wrote:
> On Mon, Jun 25, 2018 at 8:08 AM, Chris von Recklinghausen
>  wrote:
> > Enabling HARDENED_USER_COPY causes measurable regressions in the
> > networking performances, up to 8% under UDP flood.
> 
> Which function is "hot"? i.e. which copy*user() is taking up the time?
> Do you have a workload that at can be used to reproduce the problem?

I'm running an a small packet UDP flood using pktgen vs. an host b2b
connected. On the receiver side the UDP packets are processed by a
simple user space process that just read and drop them:

https://github.com/netoptimizer/network-testing/blob/master/src/udp_sink.c

Not very useful from a functional PoV, it helps mostly pin-pointing
bottle-neck in the networking stack.

When running a kernel with CONFIG_HARDENED_USERCOPY=y, I see a 5-8%
regression in the receive tput, compared to the same kernel without
such option.

With CONFIG_HARDENED_USERCOPY=y, perf shows ~6% of CPU time spent
cumulatively in __check_object_size (~4%) and __virt_addr_valid (~2%).

Cheers,

Paolo



Re: [PATCH] add param that allows bootline control of hardened usercopy

2018-06-26 Thread Paolo Abeni
Hi,

On Mon, 25 Jun 2018 11:21:38 -0700 Kees Cook  wrote:
> On Mon, Jun 25, 2018 at 8:08 AM, Chris von Recklinghausen
>  wrote:
> > Enabling HARDENED_USER_COPY causes measurable regressions in the
> > networking performances, up to 8% under UDP flood.
> 
> Which function is "hot"? i.e. which copy*user() is taking up the time?
> Do you have a workload that at can be used to reproduce the problem?

I'm running an a small packet UDP flood using pktgen vs. an host b2b
connected. On the receiver side the UDP packets are processed by a
simple user space process that just read and drop them:

https://github.com/netoptimizer/network-testing/blob/master/src/udp_sink.c

Not very useful from a functional PoV, it helps mostly pin-pointing
bottle-neck in the networking stack.

When running a kernel with CONFIG_HARDENED_USERCOPY=y, I see a 5-8%
regression in the receive tput, compared to the same kernel without
such option.

With CONFIG_HARDENED_USERCOPY=y, perf shows ~6% of CPU time spent
cumulatively in __check_object_size (~4%) and __virt_addr_valid (~2%).

Cheers,

Paolo



Re: WARNING in ip_recv_error

2018-05-24 Thread Paolo Abeni
On Wed, 2018-05-23 at 11:40 -0400, Willem de Bruijn wrote:
> On Sun, May 20, 2018 at 7:13 PM, Willem de Bruijn
>  wrote:
> > On Fri, May 18, 2018 at 2:59 PM, Willem de Bruijn
> >  wrote:
> > > On Fri, May 18, 2018 at 2:46 PM, Willem de Bruijn
> > >  wrote:
> > > > On Fri, May 18, 2018 at 2:44 PM, Willem de Bruijn
> > > >  wrote:
> > > > > On Fri, May 18, 2018 at 1:09 PM, Willem de Bruijn
> > > > >  wrote:
> > > > > > On Fri, May 18, 2018 at 11:44 AM, David Miller 
> > > > > >  wrote:
> > > > > > > From: Eric Dumazet 
> > > > > > > Date: Fri, 18 May 2018 08:30:43 -0700
> > > > > > > 
> > > > > > > > We probably need to revert Willem patch 
> > > > > > > > (7ce875e5ecb8562fd44040f69bda96c999e38bbc)
> > > > > > > 
> > > > > > > Is it really valid to reach ip_recv_err with an ipv6 socket?
> > > > > > 
> > > > > > I guess the issue is that setsockopt IPV6_ADDRFORM is not an
> > > > > > atomic operation, so that the socket is neither fully ipv4 nor fully
> > > > > > ipv6 by the time it reaches ip_recv_error.
> > > > > > 
> > > > > >   sk->sk_socket->ops = _dgram_ops;
> > > > > >   < HERE >
> > > > > >   sk->sk_family = PF_INET;
> > > > > > 
> > > > > > Even calling inet_recv_error to demux would not necessarily help.
> > > > > > 
> > > > > > Safest would be to look up by skb->protocol, similar to what
> > > > > > ipv6_recv_error does to handle v4-mapped-v6.
> > > > > > 
> > > > > > Or to make that function safe with PF_INET and swap the order
> > > > > > of the above two operations.
> > > > > > 
> > > > > > All sound needlessly complicated for this rare socket option, but
> > > > > > I don't have a better idea yet. Dropping on the floor is not nice,
> > > > > > either.
> > > > > 
> > > > > Ensuring that ip_recv_error correctly handles packets from either
> > > > > socket and removing the warning should indeed be good.
> > > > > 
> > > > > It is robust against v4-mapped packets from an AF_INET6 socket,
> > > > > but see caveat on reconnect below.
> > > > > 
> > > > > The code between ipv6_recv_error for v4-mapped addresses and
> > > > > ip_recv_error is essentially the same, the main difference being
> > > > > whether to return network headers as sockaddr_in with SOL_IP
> > > > > or sockaddr_in6 with SOL_IPV6.
> > > > > 
> > > > > There are very few other locations in the stack that explicitly test
> > > > > sk_family in this way and thus would be vulnerable to races with
> > > > > IPV6_ADDRFORM.
> > > > > 
> > > > > I'm not sure whether it is possible for a udpv6 socket to queue a
> > > > > real ipv6 packet on the error queue, disconnect, connect to an
> > > > > ipv4 address, call IPV6_ADDRFORM and then call ip_recv_error
> > > > > on a true ipv6 packet. That would return buggy data, e.g., in
> > > > > msg_name.
> > > > 
> > > > In do_ipv6_setsockopt IPV6_ADDRFORM we can test that the
> > > > error queue is empty, and then take its lock for the duration of the
> > > > operation.
> > > 
> > > Actually, no reason to hold the lock. This setsockopt holds the socket
> > > lock, which connect would need, too. So testing that the queue
> > > is empty after testing that it is connected to a v4 address is
> > > sufficient to ensure that no ipv6 packets are queued for reception.
> > > 
> > > diff --git a/net/ipv6/ipv6_sockglue.c b/net/ipv6/ipv6_sockglue.c
> > > index 4d780c7f0130..a975d6311341 100644
> > > --- a/net/ipv6/ipv6_sockglue.c
> > > +++ b/net/ipv6/ipv6_sockglue.c
> > > @@ -199,6 +199,11 @@ static int do_ipv6_setsockopt(struct sock *sk,
> > > int level, int optname,
> > > 
> > > if (ipv6_only_sock(sk) ||
> > > !ipv6_addr_v4mapped(>sk_v6_daddr)) {
> > > retv = -EADDRNOTAVAIL;
> > > break;
> > > }
> > > 
> > > +   if (!skb_queue_empty(>sk_error_queue)) {
> > > +   retv = -EBUSY;
> > > +   break;
> > > +   }
> > > +
> > > fl6_free_socklist(sk);
> > > __ipv6_sock_mc_close(sk);
> > > 
> > > After this it should be safe to remove the warning in ip_recv_error.
> > 
> > Hmm.. nope.
> > 
> > This ensures that the socket cannot produce any new true v6 packets.
> > But it does not guarantee that they are not already in the system, e.g.
> > queued in tc, and will find their way to the error queue later.
> > 
> > We'll have to just be able to handle ipv6 packets in ip_recv_error.
> > Since IPV6_ADDRFORM is used to pass to legacy v4-only
> > processes and those likely are only confused by SOL_IPV6
> > error messages, it is probably best to just drop them and perhaps
> > WARN_ONCE.
> 
> Even more fun, this is not limited to the error queue.
> 

Re: WARNING in ip_recv_error

2018-05-24 Thread Paolo Abeni
On Wed, 2018-05-23 at 11:40 -0400, Willem de Bruijn wrote:
> On Sun, May 20, 2018 at 7:13 PM, Willem de Bruijn
>  wrote:
> > On Fri, May 18, 2018 at 2:59 PM, Willem de Bruijn
> >  wrote:
> > > On Fri, May 18, 2018 at 2:46 PM, Willem de Bruijn
> > >  wrote:
> > > > On Fri, May 18, 2018 at 2:44 PM, Willem de Bruijn
> > > >  wrote:
> > > > > On Fri, May 18, 2018 at 1:09 PM, Willem de Bruijn
> > > > >  wrote:
> > > > > > On Fri, May 18, 2018 at 11:44 AM, David Miller 
> > > > > >  wrote:
> > > > > > > From: Eric Dumazet 
> > > > > > > Date: Fri, 18 May 2018 08:30:43 -0700
> > > > > > > 
> > > > > > > > We probably need to revert Willem patch 
> > > > > > > > (7ce875e5ecb8562fd44040f69bda96c999e38bbc)
> > > > > > > 
> > > > > > > Is it really valid to reach ip_recv_err with an ipv6 socket?
> > > > > > 
> > > > > > I guess the issue is that setsockopt IPV6_ADDRFORM is not an
> > > > > > atomic operation, so that the socket is neither fully ipv4 nor fully
> > > > > > ipv6 by the time it reaches ip_recv_error.
> > > > > > 
> > > > > >   sk->sk_socket->ops = _dgram_ops;
> > > > > >   < HERE >
> > > > > >   sk->sk_family = PF_INET;
> > > > > > 
> > > > > > Even calling inet_recv_error to demux would not necessarily help.
> > > > > > 
> > > > > > Safest would be to look up by skb->protocol, similar to what
> > > > > > ipv6_recv_error does to handle v4-mapped-v6.
> > > > > > 
> > > > > > Or to make that function safe with PF_INET and swap the order
> > > > > > of the above two operations.
> > > > > > 
> > > > > > All sound needlessly complicated for this rare socket option, but
> > > > > > I don't have a better idea yet. Dropping on the floor is not nice,
> > > > > > either.
> > > > > 
> > > > > Ensuring that ip_recv_error correctly handles packets from either
> > > > > socket and removing the warning should indeed be good.
> > > > > 
> > > > > It is robust against v4-mapped packets from an AF_INET6 socket,
> > > > > but see caveat on reconnect below.
> > > > > 
> > > > > The code between ipv6_recv_error for v4-mapped addresses and
> > > > > ip_recv_error is essentially the same, the main difference being
> > > > > whether to return network headers as sockaddr_in with SOL_IP
> > > > > or sockaddr_in6 with SOL_IPV6.
> > > > > 
> > > > > There are very few other locations in the stack that explicitly test
> > > > > sk_family in this way and thus would be vulnerable to races with
> > > > > IPV6_ADDRFORM.
> > > > > 
> > > > > I'm not sure whether it is possible for a udpv6 socket to queue a
> > > > > real ipv6 packet on the error queue, disconnect, connect to an
> > > > > ipv4 address, call IPV6_ADDRFORM and then call ip_recv_error
> > > > > on a true ipv6 packet. That would return buggy data, e.g., in
> > > > > msg_name.
> > > > 
> > > > In do_ipv6_setsockopt IPV6_ADDRFORM we can test that the
> > > > error queue is empty, and then take its lock for the duration of the
> > > > operation.
> > > 
> > > Actually, no reason to hold the lock. This setsockopt holds the socket
> > > lock, which connect would need, too. So testing that the queue
> > > is empty after testing that it is connected to a v4 address is
> > > sufficient to ensure that no ipv6 packets are queued for reception.
> > > 
> > > diff --git a/net/ipv6/ipv6_sockglue.c b/net/ipv6/ipv6_sockglue.c
> > > index 4d780c7f0130..a975d6311341 100644
> > > --- a/net/ipv6/ipv6_sockglue.c
> > > +++ b/net/ipv6/ipv6_sockglue.c
> > > @@ -199,6 +199,11 @@ static int do_ipv6_setsockopt(struct sock *sk,
> > > int level, int optname,
> > > 
> > > if (ipv6_only_sock(sk) ||
> > > !ipv6_addr_v4mapped(>sk_v6_daddr)) {
> > > retv = -EADDRNOTAVAIL;
> > > break;
> > > }
> > > 
> > > +   if (!skb_queue_empty(>sk_error_queue)) {
> > > +   retv = -EBUSY;
> > > +   break;
> > > +   }
> > > +
> > > fl6_free_socklist(sk);
> > > __ipv6_sock_mc_close(sk);
> > > 
> > > After this it should be safe to remove the warning in ip_recv_error.
> > 
> > Hmm.. nope.
> > 
> > This ensures that the socket cannot produce any new true v6 packets.
> > But it does not guarantee that they are not already in the system, e.g.
> > queued in tc, and will find their way to the error queue later.
> > 
> > We'll have to just be able to handle ipv6 packets in ip_recv_error.
> > Since IPV6_ADDRFORM is used to pass to legacy v4-only
> > processes and those likely are only confused by SOL_IPV6
> > error messages, it is probably best to just drop them and perhaps
> > WARN_ONCE.
> 
> Even more fun, this is not limited to the error queue.
> 
> I can queue a v6 packet for reception on a socket, connect to a v4
> address, call IPV6_ADDRFORM and then a regular recvfrom will
> return a partial v6 address as AF_INET.
> 
> We definitely do not want to 

Re: [PATCH] ipvlan: flush arp table when mac address changed

2018-05-14 Thread Paolo Abeni
Hi,

On Sat, 2018-05-12 at 19:00 +0800, liuq...@huawei.com wrote:
> From: Keefe Liu 
> 
> When master device's mac has been changed, the
> commit <32c10bbfe914> "ipvlan: always use the current L2
> addr of the master" makes the IPVlan devices's mac changed
> also, but it doesn't flush the IPVlan's arp table.
> 
> Signed-off-by: Keefe Liu 
> ---
>  drivers/net/ipvlan/ipvlan_main.c | 6 +-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ipvlan/ipvlan_main.c 
> b/drivers/net/ipvlan/ipvlan_main.c
> index 450eec2..a1edfe1 100644
> --- a/drivers/net/ipvlan/ipvlan_main.c
> +++ b/drivers/net/ipvlan/ipvlan_main.c
> @@ -7,6 +7,8 @@
>   *
>   */
>  
> +#include 
> +#include 
>  #include "ipvlan.h"
>  
>  static unsigned int ipvlan_netid __read_mostly;
> @@ -792,8 +794,10 @@ static int ipvlan_device_event(struct notifier_block 
> *unused,
>   break;
>  
>   case NETDEV_CHANGEADDR:
> - list_for_each_entry(ipvlan, >ipvlans, pnode)
> + list_for_each_entry(ipvlan, >ipvlans, pnode) {
>   ether_addr_copy(ipvlan->dev->dev_addr, dev->dev_addr);
> + neigh_changeaddr(_tbl, ipvlan->dev);
> + }

Why don't using:

 call_netdevice_notifiers(NETDEV_CHANGEADDR, ipvlan->dev);

instead?

that is what other stacked device - bridge and vlans - are currently
doing in the same scenario.

Thanks,

Paolo


Re: [PATCH] ipvlan: flush arp table when mac address changed

2018-05-14 Thread Paolo Abeni
Hi,

On Sat, 2018-05-12 at 19:00 +0800, liuq...@huawei.com wrote:
> From: Keefe Liu 
> 
> When master device's mac has been changed, the
> commit <32c10bbfe914> "ipvlan: always use the current L2
> addr of the master" makes the IPVlan devices's mac changed
> also, but it doesn't flush the IPVlan's arp table.
> 
> Signed-off-by: Keefe Liu 
> ---
>  drivers/net/ipvlan/ipvlan_main.c | 6 +-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ipvlan/ipvlan_main.c 
> b/drivers/net/ipvlan/ipvlan_main.c
> index 450eec2..a1edfe1 100644
> --- a/drivers/net/ipvlan/ipvlan_main.c
> +++ b/drivers/net/ipvlan/ipvlan_main.c
> @@ -7,6 +7,8 @@
>   *
>   */
>  
> +#include 
> +#include 
>  #include "ipvlan.h"
>  
>  static unsigned int ipvlan_netid __read_mostly;
> @@ -792,8 +794,10 @@ static int ipvlan_device_event(struct notifier_block 
> *unused,
>   break;
>  
>   case NETDEV_CHANGEADDR:
> - list_for_each_entry(ipvlan, >ipvlans, pnode)
> + list_for_each_entry(ipvlan, >ipvlans, pnode) {
>   ether_addr_copy(ipvlan->dev->dev_addr, dev->dev_addr);
> + neigh_changeaddr(_tbl, ipvlan->dev);
> + }

Why don't using:

 call_netdevice_notifiers(NETDEV_CHANGEADDR, ipvlan->dev);

instead?

that is what other stacked device - bridge and vlans - are currently
doing in the same scenario.

Thanks,

Paolo


Re: [PATCH] net: improve ipv4 performances

2018-04-04 Thread Paolo Abeni
On Sun, 2018-04-01 at 20:31 +0200, Anton Gary Ceph wrote:
> After a few profiling and analysis, turned out that the ethertype field
> of the packets has the following distribution:
[...]
>  0.6% don't know/no opinion

Am I the only one finding the submission date and the above info
suspicious ?!?

/P


Re: [PATCH] net: improve ipv4 performances

2018-04-04 Thread Paolo Abeni
On Sun, 2018-04-01 at 20:31 +0200, Anton Gary Ceph wrote:
> After a few profiling and analysis, turned out that the ethertype field
> of the packets has the following distribution:
[...]
>  0.6% don't know/no opinion

Am I the only one finding the submission date and the above info
suspicious ?!?

/P


Re: [bug, bisected] pfifo_fast causes packet reordering

2018-03-19 Thread Paolo Abeni
Hi,

On Fri, 2018-03-16 at 11:26 +0100, Jakob Unterwurzacher wrote:
> On 15.03.18 23:30, John Fastabend wrote:
> > > I have reproduced it using two USB network cards connected to each other. 
> > > The test tool sends UDP packets containing a counter and listens on the 
> > > other interface, it is available at
> > > https://github.com/jakob-tsd/pfifo_stress/blob/master/pfifo_stress.py
> > > 
> > 
> > Great thanks, can you also run this with taskset to bind to
> > a single CPU,
> > 
> >   # taskset 0x1 ./pifof_stress.py
> > 
> > And let me know if you still see the OOO.
> 
> Interesting. Looks like it depends on which core it runs on. CPU0 is 
> clean, CPU1 is not.
> 
> Clean:taskset --cpu-list 0 ./pfifo_stress.py
> 
> Broken:   taskset --cpu-list 1 ./pfifo_stress.py
> 
> Maybe related: CPU0 is where USB interrupts are handled:
> 
> > root@rk3399-q7:~# cat /proc/interrupts   
> >CPU0   CPU1   CPU2   CPU3   CPU4   CPU5
> > 217:2175353  0  0  0  0  0 
> > GICv3 142 Level xhci-hcd:usb5


Is not clear to me if you can reproduce the bug with the vanilla
kernel, or if  you need some out-of-tree nic driver. Can you please
clarify which NIC/driver are you using?

Thanks,

Paolo


Re: [bug, bisected] pfifo_fast causes packet reordering

2018-03-19 Thread Paolo Abeni
Hi,

On Fri, 2018-03-16 at 11:26 +0100, Jakob Unterwurzacher wrote:
> On 15.03.18 23:30, John Fastabend wrote:
> > > I have reproduced it using two USB network cards connected to each other. 
> > > The test tool sends UDP packets containing a counter and listens on the 
> > > other interface, it is available at
> > > https://github.com/jakob-tsd/pfifo_stress/blob/master/pfifo_stress.py
> > > 
> > 
> > Great thanks, can you also run this with taskset to bind to
> > a single CPU,
> > 
> >   # taskset 0x1 ./pifof_stress.py
> > 
> > And let me know if you still see the OOO.
> 
> Interesting. Looks like it depends on which core it runs on. CPU0 is 
> clean, CPU1 is not.
> 
> Clean:taskset --cpu-list 0 ./pfifo_stress.py
> 
> Broken:   taskset --cpu-list 1 ./pfifo_stress.py
> 
> Maybe related: CPU0 is where USB interrupts are handled:
> 
> > root@rk3399-q7:~# cat /proc/interrupts   
> >CPU0   CPU1   CPU2   CPU3   CPU4   CPU5
> > 217:2175353  0  0  0  0  0 
> > GICv3 142 Level xhci-hcd:usb5


Is not clear to me if you can reproduce the bug with the vanilla
kernel, or if  you need some out-of-tree nic driver. Can you please
clarify which NIC/driver are you using?

Thanks,

Paolo


Re: BUG: unable to handle kernel paging request in compat_copy_entries

2018-03-05 Thread Paolo Abeni
On Mon, 2018-03-05 at 00:21 -0800, syzbot wrote:
> Hello,
> 
> syzbot hit the following crash on upstream commit
> 5fbdefcf685defd8bc5a8f37b17538d25c58d77a (Fri Mar 2 21:05:20 2018 +)
> Merge branch 'parisc-4.16-1' of  
> git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux
> 
> So far this crash happened 5 times on upstream.
> syzkaller reproducer is attached.
> Raw console output is attached.
> compiler: gcc (GCC) 7.1.1 20170620
> .config is attached.
> user-space arch: i386
> 
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+5705ba91388d7bc30...@syzkaller.appspotmail.com
> It will help syzbot understand when the bug is fixed. See footer for  
> details.
> If you forward the report, please keep this part and the footer.
> 
> audit: type=1400 audit(1520098078.492:8): avc:  denied  { map } for   
> pid=4239 comm="syz-execprog" path="/root/syzkaller-shm255959590" dev="sda1"  
> ino=16482 scontext=unconfined_u:system_r:insmod_t:s0-s0:c0.c1023  
> tcontext=unconfined_u:object_r:file_t:s0 tclass=file permissive=1
> IPVS: ftp: loaded support on port[0] = 21
> BUG: unable to handle kernel paging request at c90001819e4f
> IP: ebt_size_mwt net/bridge/netfilter/ebtables.c:2037 [inline]
> IP: size_entry_mwt net/bridge/netfilter/ebtables.c:2122 [inline]
> IP: compat_copy_entries+0x49f/0x1050 net/bridge/netfilter/ebtables.c:2160
> PGD 1dad2f067 P4D 1dad2f067 PUD 1dad30067 PMD 1b2408067 PTE 0
> Oops:  [#1] SMP KASAN
> Dumping ftrace buffer:
> (ftrace buffer empty)
> Modules linked in:
> CPU: 1 PID: 4249 Comm: syz-executor0 Not tainted 4.16.0-rc3+ #248
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
> Google 01/01/2011
> RIP: 0010:ebt_size_mwt net/bridge/netfilter/ebtables.c:2037 [inline]
> RIP: 0010:size_entry_mwt net/bridge/netfilter/ebtables.c:2122 [inline]
> RIP: 0010:compat_copy_entries+0x49f/0x1050  
> net/bridge/netfilter/ebtables.c:2160
> RSP: 0018:8801b34bf7e8 EFLAGS: 00010246
> RAX: 000a RBX: 8801b34bf9d4 RCX: c90001819e4f
> RDX:  RSI:  RDI: 8801b34bf9d8
> RBP: 8801b34bf968 R08:  R09: 
> R10: 88613340 R11: 0001 R12: ee5f
> R13: dc00 R14: 8801b34bf9c8 R15: c90001819e2f
> FS:  () GS:8801db30(0063) knlGS:085b9900
> CS:  0010 DS: 002b ES: 002b CR0: 80050033
> CR2: c90001819e4f CR3: 0001b2bd7003 CR4: 001606e0
> DR0:  DR1:  DR2: 
> DR3:  DR6: fffe0ff0 DR7: 0400
> Call Trace:
>   compat_do_replace+0x398/0x7c0 net/bridge/netfilter/ebtables.c:2249
>   compat_do_ebt_set_ctl+0x22a/0x2d0 net/bridge/netfilter/ebtables.c:2330
>   compat_nf_sockopt net/netfilter/nf_sockopt.c:144 [inline]
>   compat_nf_setsockopt+0x88/0x130 net/netfilter/nf_sockopt.c:156
>   compat_ip_setsockopt+0x8b/0xd0 net/ipv4/ip_sockglue.c:1285
>   inet_csk_compat_setsockopt+0x95/0x120 net/ipv4/inet_connection_sock.c:1041
>   compat_tcp_setsockopt+0x3d/0x70 net/ipv4/tcp.c:2916
>   compat_sock_common_setsockopt+0xb2/0x140 net/core/sock.c:2986
>   C_SYSC_setsockopt net/compat.c:403 [inline]
>   compat_SyS_setsockopt+0x17c/0x410 net/compat.c:386
>   do_syscall_32_irqs_on arch/x86/entry/common.c:330 [inline]
>   do_fast_syscall_32+0x3ec/0xf9f arch/x86/entry/common.c:392
>   entry_SYSENTER_compat+0x70/0x7f arch/x86/entry/entry_64_compat.S:139
> RIP: 0023:0xf7fbbc99
> RSP: 002b:ffd5ab8c EFLAGS: 0286 ORIG_RAX: 016e
> RAX: ffda RBX: 0003 RCX: 
> RDX: 0080 RSI: 2280 RDI: 0208
> RBP:  R08:  R09: 
> R10:  R11:  R12: 
> R13:  R14:  R15: 
> Code: 8d 4f 20 48 89 c8 48 89 8d c8 fe ff ff 48 c1 e8 03 42 0f b6 14 28 48  
> 89 c8 83 e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 b2 0a 00 00 <45> 8b 67 20  
> 44 39 a5 04 ff ff ff 0f 82 bd 08 00 00 e8 cb 52 56
> RIP: ebt_size_mwt net/bridge/netfilter/ebtables.c:2037 [inline] RSP:  
> 8801b34bf7e8
> RIP: size_entry_mwt net/bridge/netfilter/ebtables.c:2122 [inline] RSP:  
> 8801b34bf7e8
> RIP: compat_copy_entries+0x49f/0x1050 net/bridge/netfilter/ebtables.c:2160  
> RSP: 8801b34bf7e8
> CR2: c90001819e4f
> ---[ end trace cf111332eb971f16 ]---
> 
> 
> ---
> This bug is generated by a dumb bot. It may contain errors.
> See https://goo.gl/tpsmEJ for details.
> Direct all questions to syzkal...@googlegroups.com.
> 
> syzbot will keep track of this bug report.
> If you forgot to add the Reported-by tag, once the fix for this bug is  
> merged
> into any tree, please reply to this email with:
> #syz fix: exact-commit-title
> If you want to test a patch for this bug, please reply with:
> #syz 

Re: BUG: unable to handle kernel paging request in compat_copy_entries

2018-03-05 Thread Paolo Abeni
On Mon, 2018-03-05 at 00:21 -0800, syzbot wrote:
> Hello,
> 
> syzbot hit the following crash on upstream commit
> 5fbdefcf685defd8bc5a8f37b17538d25c58d77a (Fri Mar 2 21:05:20 2018 +)
> Merge branch 'parisc-4.16-1' of  
> git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux
> 
> So far this crash happened 5 times on upstream.
> syzkaller reproducer is attached.
> Raw console output is attached.
> compiler: gcc (GCC) 7.1.1 20170620
> .config is attached.
> user-space arch: i386
> 
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+5705ba91388d7bc30...@syzkaller.appspotmail.com
> It will help syzbot understand when the bug is fixed. See footer for  
> details.
> If you forward the report, please keep this part and the footer.
> 
> audit: type=1400 audit(1520098078.492:8): avc:  denied  { map } for   
> pid=4239 comm="syz-execprog" path="/root/syzkaller-shm255959590" dev="sda1"  
> ino=16482 scontext=unconfined_u:system_r:insmod_t:s0-s0:c0.c1023  
> tcontext=unconfined_u:object_r:file_t:s0 tclass=file permissive=1
> IPVS: ftp: loaded support on port[0] = 21
> BUG: unable to handle kernel paging request at c90001819e4f
> IP: ebt_size_mwt net/bridge/netfilter/ebtables.c:2037 [inline]
> IP: size_entry_mwt net/bridge/netfilter/ebtables.c:2122 [inline]
> IP: compat_copy_entries+0x49f/0x1050 net/bridge/netfilter/ebtables.c:2160
> PGD 1dad2f067 P4D 1dad2f067 PUD 1dad30067 PMD 1b2408067 PTE 0
> Oops:  [#1] SMP KASAN
> Dumping ftrace buffer:
> (ftrace buffer empty)
> Modules linked in:
> CPU: 1 PID: 4249 Comm: syz-executor0 Not tainted 4.16.0-rc3+ #248
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
> Google 01/01/2011
> RIP: 0010:ebt_size_mwt net/bridge/netfilter/ebtables.c:2037 [inline]
> RIP: 0010:size_entry_mwt net/bridge/netfilter/ebtables.c:2122 [inline]
> RIP: 0010:compat_copy_entries+0x49f/0x1050  
> net/bridge/netfilter/ebtables.c:2160
> RSP: 0018:8801b34bf7e8 EFLAGS: 00010246
> RAX: 000a RBX: 8801b34bf9d4 RCX: c90001819e4f
> RDX:  RSI:  RDI: 8801b34bf9d8
> RBP: 8801b34bf968 R08:  R09: 
> R10: 88613340 R11: 0001 R12: ee5f
> R13: dc00 R14: 8801b34bf9c8 R15: c90001819e2f
> FS:  () GS:8801db30(0063) knlGS:085b9900
> CS:  0010 DS: 002b ES: 002b CR0: 80050033
> CR2: c90001819e4f CR3: 0001b2bd7003 CR4: 001606e0
> DR0:  DR1:  DR2: 
> DR3:  DR6: fffe0ff0 DR7: 0400
> Call Trace:
>   compat_do_replace+0x398/0x7c0 net/bridge/netfilter/ebtables.c:2249
>   compat_do_ebt_set_ctl+0x22a/0x2d0 net/bridge/netfilter/ebtables.c:2330
>   compat_nf_sockopt net/netfilter/nf_sockopt.c:144 [inline]
>   compat_nf_setsockopt+0x88/0x130 net/netfilter/nf_sockopt.c:156
>   compat_ip_setsockopt+0x8b/0xd0 net/ipv4/ip_sockglue.c:1285
>   inet_csk_compat_setsockopt+0x95/0x120 net/ipv4/inet_connection_sock.c:1041
>   compat_tcp_setsockopt+0x3d/0x70 net/ipv4/tcp.c:2916
>   compat_sock_common_setsockopt+0xb2/0x140 net/core/sock.c:2986
>   C_SYSC_setsockopt net/compat.c:403 [inline]
>   compat_SyS_setsockopt+0x17c/0x410 net/compat.c:386
>   do_syscall_32_irqs_on arch/x86/entry/common.c:330 [inline]
>   do_fast_syscall_32+0x3ec/0xf9f arch/x86/entry/common.c:392
>   entry_SYSENTER_compat+0x70/0x7f arch/x86/entry/entry_64_compat.S:139
> RIP: 0023:0xf7fbbc99
> RSP: 002b:ffd5ab8c EFLAGS: 0286 ORIG_RAX: 016e
> RAX: ffda RBX: 0003 RCX: 
> RDX: 0080 RSI: 2280 RDI: 0208
> RBP:  R08:  R09: 
> R10:  R11:  R12: 
> R13:  R14:  R15: 
> Code: 8d 4f 20 48 89 c8 48 89 8d c8 fe ff ff 48 c1 e8 03 42 0f b6 14 28 48  
> 89 c8 83 e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 b2 0a 00 00 <45> 8b 67 20  
> 44 39 a5 04 ff ff ff 0f 82 bd 08 00 00 e8 cb 52 56
> RIP: ebt_size_mwt net/bridge/netfilter/ebtables.c:2037 [inline] RSP:  
> 8801b34bf7e8
> RIP: size_entry_mwt net/bridge/netfilter/ebtables.c:2122 [inline] RSP:  
> 8801b34bf7e8
> RIP: compat_copy_entries+0x49f/0x1050 net/bridge/netfilter/ebtables.c:2160  
> RSP: 8801b34bf7e8
> CR2: c90001819e4f
> ---[ end trace cf111332eb971f16 ]---
> 
> 
> ---
> This bug is generated by a dumb bot. It may contain errors.
> See https://goo.gl/tpsmEJ for details.
> Direct all questions to syzkal...@googlegroups.com.
> 
> syzbot will keep track of this bug report.
> If you forgot to add the Reported-by tag, once the fix for this bug is  
> merged
> into any tree, please reply to this email with:
> #syz fix: exact-commit-title
> If you want to test a patch for this bug, please reply with:
> #syz 

Re: BUG: sleeping function called from invalid context at net/core/sock.c:LINE (3)

2018-02-19 Thread Paolo Abeni
On Mon, 2018-02-19 at 13:23 +, Jon Maloy wrote:
> I don't understand this one. tipc_topsrv_stop() can only be trigged
> from a user doing rmmod(), and I double checked that this is running
> in user mode.
> How does the call chain you are reporting occur?

tipc_topsrv_stop() is called also at net namespace destruction time:

static void __net_exit tipc_exit_net(struct net *net)
{
tipc_topsrv_stop(net);
#...

I *think* the following should fix the issue, but I'm unsure if it's
safe.

#syz test: git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git 
master

---
diff --git a/net/tipc/topsrv.c b/net/tipc/topsrv.c
index 02013e00f287..63f35eae7236 100644
--- a/net/tipc/topsrv.c
+++ b/net/tipc/topsrv.c
@@ -693,9 +693,9 @@ void tipc_topsrv_stop(struct net *net)
}
__module_get(lsock->ops->owner);
__module_get(lsock->sk->sk_prot_creator->owner);
-   sock_release(lsock);
srv->listener = NULL;
spin_unlock_bh(>idr_lock);
+   sock_release(lsock);
tipc_topsrv_work_stop(srv);
idr_destroy(>conn_idr);
kfree(srv);


Re: BUG: sleeping function called from invalid context at net/core/sock.c:LINE (3)

2018-02-19 Thread Paolo Abeni
On Mon, 2018-02-19 at 13:23 +, Jon Maloy wrote:
> I don't understand this one. tipc_topsrv_stop() can only be trigged
> from a user doing rmmod(), and I double checked that this is running
> in user mode.
> How does the call chain you are reporting occur?

tipc_topsrv_stop() is called also at net namespace destruction time:

static void __net_exit tipc_exit_net(struct net *net)
{
tipc_topsrv_stop(net);
#...

I *think* the following should fix the issue, but I'm unsure if it's
safe.

#syz test: git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git 
master

---
diff --git a/net/tipc/topsrv.c b/net/tipc/topsrv.c
index 02013e00f287..63f35eae7236 100644
--- a/net/tipc/topsrv.c
+++ b/net/tipc/topsrv.c
@@ -693,9 +693,9 @@ void tipc_topsrv_stop(struct net *net)
}
__module_get(lsock->ops->owner);
__module_get(lsock->sk->sk_prot_creator->owner);
-   sock_release(lsock);
srv->listener = NULL;
spin_unlock_bh(>idr_lock);
+   sock_release(lsock);
tipc_topsrv_work_stop(srv);
idr_destroy(>conn_idr);
kfree(srv);


Re: WARNING: proc registration bug in clusterip_tg_check

2018-02-07 Thread Paolo Abeni
On Wed, 2018-02-07 at 09:43 +0100, Paolo Abeni wrote:
> On Tue, 2018-02-06 at 22:42 -0800, Cong Wang wrote:
> > On Tue, Feb 6, 2018 at 6:27 AM, syzbot
> > <syzbot+03218bcdba6aa7644...@syzkaller.appspotmail.com> wrote:
> > > Hello,
> > > 
> > > syzbot hit the following crash on net-next commit
> > > 617aebe6a97efa539cc4b8a52adccd89596e6be0 (Sun Feb 4 00:25:42 2018 +)
> > > Merge tag 'usercopy-v4.16-rc1' of
> > > git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
> > > 
> > > So far this crash happened 5 times on net-next, upstream.
> > > C reproducer is attached.
> > > syzkaller reproducer is attached.
> > > Raw console output is attached.
> > > compiler: gcc (GCC) 7.1.1 20170620
> > > .config is attached.
> > > 
> > > IMPORTANT: if you fix the bug, please add the following tag to the commit:
> > > Reported-by: syzbot+03218bcdba6aa7644...@syzkaller.appspotmail.com
> > > It will help syzbot understand when the bug is fixed. See footer for
> > > details.
> > > If you forward the report, please keep this part and the footer.
> > > 
> > > x_tables: ip_tables: osf match: only valid for protocol 6
> > > x_tables: ip_tables: osf match: only valid for protocol 6
> > > x_tables: ip_tables: osf match: only valid for protocol 6
> > > [ cut here ]
> > > proc_dir_entry 'ipt_CLUSTERIP/172.20.0.170' already registered
> > > WARNING: CPU: 1 PID: 4152 at fs/proc/generic.c:330 
> > > proc_register+0x2a4/0x370
> > > fs/proc/generic.c:329
> > > Kernel panic - not syncing: panic_on_warn set ...
> > > 
> > > CPU: 1 PID: 4152 Comm: syzkaller851476 Not tainted 4.15.0+ #221
> > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> > > Google 01/01/2011
> > > Call Trace:
> > >  __dump_stack lib/dump_stack.c:17 [inline]
> > >  dump_stack+0x194/0x257 lib/dump_stack.c:53
> > >  panic+0x1e4/0x41c kernel/panic.c:183
> > >  __warn+0x1dc/0x200 kernel/panic.c:547
> > >  report_bug+0x211/0x2d0 lib/bug.c:184
> > >  fixup_bug.part.11+0x37/0x80 arch/x86/kernel/traps.c:178
> > >  fixup_bug arch/x86/kernel/traps.c:247 [inline]
> > >  do_error_trap+0x2d7/0x3e0 arch/x86/kernel/traps.c:296
> > >  do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:315
> > >  invalid_op+0x22/0x40 arch/x86/entry/entry_64.S:1097
> > > RIP: 0010:proc_register+0x2a4/0x370 fs/proc/generic.c:329
> > > RSP: 0018:8801cbd6ee20 EFLAGS: 00010286
> > > RAX: dc08 RBX: 8801d2181038 RCX: 815a57ae
> > > RDX:  RSI: 1100397add74 RDI: 1100397add49
> > > RBP: 8801cbd6ee70 R08: 1100397add0b R09: 
> > > R10: 8801cbd6ecd8 R11:  R12: 8801b2bb1cc0
> > > R13: dc00 R14: 8801b0d8dbc8 R15: 8801b2bb1d81
> > >  proc_create_data+0xf8/0x180 fs/proc/generic.c:494
> > >  clusterip_config_init net/ipv4/netfilter/ipt_CLUSTERIP.c:250 [inline]
> > 
> > I think there is probably a race condition between 
> > clusterip_config_entry_put()
> > and clusterip_config_init(), after we release the spinlock, a new proc
> > with the same IP could be created therefore triggers this warning
> > 
> > I am not sure if it is enough to just move the proc_remove() under
> > spinlock...
> 
> I *think* we should change the order on proc fs entry creation, 
> because clusterip_config_init() can race with itself,
> clusterip_config_init() returns NULL if the clusterip_config_init has
> no pte, and currently such entry is inserted into the list with NULL
> pte and the list lock itself is released before creating the PTE.

I was wrong. My suggested fix does not work at all.

I tried your code and it fixes the issue here.

Feel free to submit with:

Tested-by: Paolo Abeni <pab...@redhat.com> 

Thank you,

Paolo


Re: WARNING: proc registration bug in clusterip_tg_check

2018-02-07 Thread Paolo Abeni
On Wed, 2018-02-07 at 09:43 +0100, Paolo Abeni wrote:
> On Tue, 2018-02-06 at 22:42 -0800, Cong Wang wrote:
> > On Tue, Feb 6, 2018 at 6:27 AM, syzbot
> >  wrote:
> > > Hello,
> > > 
> > > syzbot hit the following crash on net-next commit
> > > 617aebe6a97efa539cc4b8a52adccd89596e6be0 (Sun Feb 4 00:25:42 2018 +)
> > > Merge tag 'usercopy-v4.16-rc1' of
> > > git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
> > > 
> > > So far this crash happened 5 times on net-next, upstream.
> > > C reproducer is attached.
> > > syzkaller reproducer is attached.
> > > Raw console output is attached.
> > > compiler: gcc (GCC) 7.1.1 20170620
> > > .config is attached.
> > > 
> > > IMPORTANT: if you fix the bug, please add the following tag to the commit:
> > > Reported-by: syzbot+03218bcdba6aa7644...@syzkaller.appspotmail.com
> > > It will help syzbot understand when the bug is fixed. See footer for
> > > details.
> > > If you forward the report, please keep this part and the footer.
> > > 
> > > x_tables: ip_tables: osf match: only valid for protocol 6
> > > x_tables: ip_tables: osf match: only valid for protocol 6
> > > x_tables: ip_tables: osf match: only valid for protocol 6
> > > [ cut here ]
> > > proc_dir_entry 'ipt_CLUSTERIP/172.20.0.170' already registered
> > > WARNING: CPU: 1 PID: 4152 at fs/proc/generic.c:330 
> > > proc_register+0x2a4/0x370
> > > fs/proc/generic.c:329
> > > Kernel panic - not syncing: panic_on_warn set ...
> > > 
> > > CPU: 1 PID: 4152 Comm: syzkaller851476 Not tainted 4.15.0+ #221
> > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> > > Google 01/01/2011
> > > Call Trace:
> > >  __dump_stack lib/dump_stack.c:17 [inline]
> > >  dump_stack+0x194/0x257 lib/dump_stack.c:53
> > >  panic+0x1e4/0x41c kernel/panic.c:183
> > >  __warn+0x1dc/0x200 kernel/panic.c:547
> > >  report_bug+0x211/0x2d0 lib/bug.c:184
> > >  fixup_bug.part.11+0x37/0x80 arch/x86/kernel/traps.c:178
> > >  fixup_bug arch/x86/kernel/traps.c:247 [inline]
> > >  do_error_trap+0x2d7/0x3e0 arch/x86/kernel/traps.c:296
> > >  do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:315
> > >  invalid_op+0x22/0x40 arch/x86/entry/entry_64.S:1097
> > > RIP: 0010:proc_register+0x2a4/0x370 fs/proc/generic.c:329
> > > RSP: 0018:8801cbd6ee20 EFLAGS: 00010286
> > > RAX: dc08 RBX: 8801d2181038 RCX: 815a57ae
> > > RDX:  RSI: 1100397add74 RDI: 1100397add49
> > > RBP: 8801cbd6ee70 R08: 1100397add0b R09: 
> > > R10: 8801cbd6ecd8 R11:  R12: 8801b2bb1cc0
> > > R13: dc00 R14: 8801b0d8dbc8 R15: 8801b2bb1d81
> > >  proc_create_data+0xf8/0x180 fs/proc/generic.c:494
> > >  clusterip_config_init net/ipv4/netfilter/ipt_CLUSTERIP.c:250 [inline]
> > 
> > I think there is probably a race condition between 
> > clusterip_config_entry_put()
> > and clusterip_config_init(), after we release the spinlock, a new proc
> > with the same IP could be created therefore triggers this warning
> > 
> > I am not sure if it is enough to just move the proc_remove() under
> > spinlock...
> 
> I *think* we should change the order on proc fs entry creation, 
> because clusterip_config_init() can race with itself,
> clusterip_config_init() returns NULL if the clusterip_config_init has
> no pte, and currently such entry is inserted into the list with NULL
> pte and the list lock itself is released before creating the PTE.

I was wrong. My suggested fix does not work at all.

I tried your code and it fixes the issue here.

Feel free to submit with:

Tested-by: Paolo Abeni  

Thank you,

Paolo


Re: WARNING: proc registration bug in clusterip_tg_check

2018-02-07 Thread Paolo Abeni
#syz test: git://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git master

I can't reproduce the issue locally, so asking the syzbot to test the 
tentive fix for me (and hoping I did not mess with the tag/format) 

---
 net/ipv4/netfilter/ipt_CLUSTERIP.c | 30 +++---
 1 file changed, 15 insertions(+), 15 deletions(-)

diff --git a/net/ipv4/netfilter/ipt_CLUSTERIP.c 
b/net/ipv4/netfilter/ipt_CLUSTERIP.c
index 3a84a60f6b39..db103cd971a9 100644
--- a/net/ipv4/netfilter/ipt_CLUSTERIP.c
+++ b/net/ipv4/netfilter/ipt_CLUSTERIP.c
@@ -230,17 +230,6 @@ clusterip_config_init(struct net *net, const struct 
ipt_clusterip_tgt_info *i,
refcount_set(>refcount, 1);
refcount_set(>entries, 1);
 
-   spin_lock_bh(>lock);
-   if (__clusterip_config_find(net, ip)) {
-   spin_unlock_bh(>lock);
-   kfree(c);
-
-   return ERR_PTR(-EBUSY);
-   }
-
-   list_add_rcu(>list, >configs);
-   spin_unlock_bh(>lock);
-
 #ifdef CONFIG_PROC_FS
{
char buffer[16];
@@ -257,20 +246,31 @@ clusterip_config_init(struct net *net, const struct 
ipt_clusterip_tgt_info *i,
}
 #endif
 
+   spin_lock_bh(>lock);
+   if (__clusterip_config_find(net, ip)) {
+   spin_unlock_bh(>lock);
+   err = -EBUSY;
+   goto err_remove_pte:
+   }
+
+   list_add_rcu(>list, >configs);
+   spin_unlock_bh(>lock);
+
c->notifier.notifier_call = clusterip_netdev_event;
err = register_netdevice_notifier(>notifier);
if (!err)
return c;
 
+   spin_lock_bh(>lock);
+   list_del_rcu(>list);
+   spin_unlock_bh(>lock);
+
+err_remove_pte:
 #ifdef CONFIG_PROC_FS
proc_remove(c->pde);
 err:
 #endif
-   spin_lock_bh(>lock);
-   list_del_rcu(>list);
-   spin_unlock_bh(>lock);
kfree(c);
-
return ERR_PTR(err);
 }
 
-- 
2.14.3



Re: WARNING: proc registration bug in clusterip_tg_check

2018-02-07 Thread Paolo Abeni
#syz test: git://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git master

I can't reproduce the issue locally, so asking the syzbot to test the 
tentive fix for me (and hoping I did not mess with the tag/format) 

---
 net/ipv4/netfilter/ipt_CLUSTERIP.c | 30 +++---
 1 file changed, 15 insertions(+), 15 deletions(-)

diff --git a/net/ipv4/netfilter/ipt_CLUSTERIP.c 
b/net/ipv4/netfilter/ipt_CLUSTERIP.c
index 3a84a60f6b39..db103cd971a9 100644
--- a/net/ipv4/netfilter/ipt_CLUSTERIP.c
+++ b/net/ipv4/netfilter/ipt_CLUSTERIP.c
@@ -230,17 +230,6 @@ clusterip_config_init(struct net *net, const struct 
ipt_clusterip_tgt_info *i,
refcount_set(>refcount, 1);
refcount_set(>entries, 1);
 
-   spin_lock_bh(>lock);
-   if (__clusterip_config_find(net, ip)) {
-   spin_unlock_bh(>lock);
-   kfree(c);
-
-   return ERR_PTR(-EBUSY);
-   }
-
-   list_add_rcu(>list, >configs);
-   spin_unlock_bh(>lock);
-
 #ifdef CONFIG_PROC_FS
{
char buffer[16];
@@ -257,20 +246,31 @@ clusterip_config_init(struct net *net, const struct 
ipt_clusterip_tgt_info *i,
}
 #endif
 
+   spin_lock_bh(>lock);
+   if (__clusterip_config_find(net, ip)) {
+   spin_unlock_bh(>lock);
+   err = -EBUSY;
+   goto err_remove_pte:
+   }
+
+   list_add_rcu(>list, >configs);
+   spin_unlock_bh(>lock);
+
c->notifier.notifier_call = clusterip_netdev_event;
err = register_netdevice_notifier(>notifier);
if (!err)
return c;
 
+   spin_lock_bh(>lock);
+   list_del_rcu(>list);
+   spin_unlock_bh(>lock);
+
+err_remove_pte:
 #ifdef CONFIG_PROC_FS
proc_remove(c->pde);
 err:
 #endif
-   spin_lock_bh(>lock);
-   list_del_rcu(>list);
-   spin_unlock_bh(>lock);
kfree(c);
-
return ERR_PTR(err);
 }
 
-- 
2.14.3



Re: possible deadlock in rtnl_lock (3)

2018-02-07 Thread Paolo Abeni
On Tue, 2018-02-06 at 19:00 +0100, Dmitry Vyukov wrote:
> On Tue, Feb 6, 2018 at 6:58 PM, syzbot
>  wrote:
> > Hello,
> > 
> > syzbot hit the following crash on net-next commit
> > 617aebe6a97efa539cc4b8a52adccd89596e6be0 (Sun Feb 4 00:25:42 2018 +)
> > Merge tag 'usercopy-v4.16-rc1' of
> > git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
> > 
> > So far this crash happened 2510 times on net-next, upstream.
> > C reproducer is attached.
> > syzkaller reproducer is attached.
> > Raw console output is attached.
> > compiler: gcc (GCC) 7.1.1 20170620
> > .config is attached.
> > 
> > IMPORTANT: if you fix the bug, please add the following tag to the commit:
> > Reported-by: syzbot+63682ce11532e0da2...@syzkaller.appspotmail.com
> > It will help syzbot understand when the bug is fixed. See footer for
> > details.
> > If you forward the report, please keep this part and the footer.
> 
> 
> Paolo, was this also fixed by "netfilter: on sockopt() acquire sock
> lock only in the required scope"?

I *think* this is fixed by the above commit, anyway I'll probably be
unable to verify such statement soon.

Thanks,

Paolo


Re: possible deadlock in rtnl_lock (3)

2018-02-07 Thread Paolo Abeni
On Tue, 2018-02-06 at 19:00 +0100, Dmitry Vyukov wrote:
> On Tue, Feb 6, 2018 at 6:58 PM, syzbot
>  wrote:
> > Hello,
> > 
> > syzbot hit the following crash on net-next commit
> > 617aebe6a97efa539cc4b8a52adccd89596e6be0 (Sun Feb 4 00:25:42 2018 +)
> > Merge tag 'usercopy-v4.16-rc1' of
> > git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
> > 
> > So far this crash happened 2510 times on net-next, upstream.
> > C reproducer is attached.
> > syzkaller reproducer is attached.
> > Raw console output is attached.
> > compiler: gcc (GCC) 7.1.1 20170620
> > .config is attached.
> > 
> > IMPORTANT: if you fix the bug, please add the following tag to the commit:
> > Reported-by: syzbot+63682ce11532e0da2...@syzkaller.appspotmail.com
> > It will help syzbot understand when the bug is fixed. See footer for
> > details.
> > If you forward the report, please keep this part and the footer.
> 
> 
> Paolo, was this also fixed by "netfilter: on sockopt() acquire sock
> lock only in the required scope"?

I *think* this is fixed by the above commit, anyway I'll probably be
unable to verify such statement soon.

Thanks,

Paolo


Re: WARNING: proc registration bug in clusterip_tg_check

2018-02-07 Thread Paolo Abeni
On Tue, 2018-02-06 at 22:42 -0800, Cong Wang wrote:
> On Tue, Feb 6, 2018 at 6:27 AM, syzbot
>  wrote:
> > Hello,
> > 
> > syzbot hit the following crash on net-next commit
> > 617aebe6a97efa539cc4b8a52adccd89596e6be0 (Sun Feb 4 00:25:42 2018 +)
> > Merge tag 'usercopy-v4.16-rc1' of
> > git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
> > 
> > So far this crash happened 5 times on net-next, upstream.
> > C reproducer is attached.
> > syzkaller reproducer is attached.
> > Raw console output is attached.
> > compiler: gcc (GCC) 7.1.1 20170620
> > .config is attached.
> > 
> > IMPORTANT: if you fix the bug, please add the following tag to the commit:
> > Reported-by: syzbot+03218bcdba6aa7644...@syzkaller.appspotmail.com
> > It will help syzbot understand when the bug is fixed. See footer for
> > details.
> > If you forward the report, please keep this part and the footer.
> > 
> > x_tables: ip_tables: osf match: only valid for protocol 6
> > x_tables: ip_tables: osf match: only valid for protocol 6
> > x_tables: ip_tables: osf match: only valid for protocol 6
> > [ cut here ]
> > proc_dir_entry 'ipt_CLUSTERIP/172.20.0.170' already registered
> > WARNING: CPU: 1 PID: 4152 at fs/proc/generic.c:330 proc_register+0x2a4/0x370
> > fs/proc/generic.c:329
> > Kernel panic - not syncing: panic_on_warn set ...
> > 
> > CPU: 1 PID: 4152 Comm: syzkaller851476 Not tainted 4.15.0+ #221
> > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> > Google 01/01/2011
> > Call Trace:
> >  __dump_stack lib/dump_stack.c:17 [inline]
> >  dump_stack+0x194/0x257 lib/dump_stack.c:53
> >  panic+0x1e4/0x41c kernel/panic.c:183
> >  __warn+0x1dc/0x200 kernel/panic.c:547
> >  report_bug+0x211/0x2d0 lib/bug.c:184
> >  fixup_bug.part.11+0x37/0x80 arch/x86/kernel/traps.c:178
> >  fixup_bug arch/x86/kernel/traps.c:247 [inline]
> >  do_error_trap+0x2d7/0x3e0 arch/x86/kernel/traps.c:296
> >  do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:315
> >  invalid_op+0x22/0x40 arch/x86/entry/entry_64.S:1097
> > RIP: 0010:proc_register+0x2a4/0x370 fs/proc/generic.c:329
> > RSP: 0018:8801cbd6ee20 EFLAGS: 00010286
> > RAX: dc08 RBX: 8801d2181038 RCX: 815a57ae
> > RDX:  RSI: 1100397add74 RDI: 1100397add49
> > RBP: 8801cbd6ee70 R08: 1100397add0b R09: 
> > R10: 8801cbd6ecd8 R11:  R12: 8801b2bb1cc0
> > R13: dc00 R14: 8801b0d8dbc8 R15: 8801b2bb1d81
> >  proc_create_data+0xf8/0x180 fs/proc/generic.c:494
> >  clusterip_config_init net/ipv4/netfilter/ipt_CLUSTERIP.c:250 [inline]
> 
> I think there is probably a race condition between 
> clusterip_config_entry_put()
> and clusterip_config_init(), after we release the spinlock, a new proc
> with the same IP could be created therefore triggers this warning
> 
> I am not sure if it is enough to just move the proc_remove() under
> spinlock...

I *think* we should change the order on proc fs entry creation, 
because clusterip_config_init() can race with itself,
clusterip_config_init() returns NULL if the clusterip_config_init has
no pte, and currently such entry is inserted into the list with NULL
pte and the list lock itself is released before creating the PTE.

I'll try to test something the following:
---
diff --git a/net/ipv4/netfilter/ipt_CLUSTERIP.c 
b/net/ipv4/netfilter/ipt_CLUSTERIP.c
index 3a84a60f6b39..d8807c44cc61 100644
--- a/net/ipv4/netfilter/ipt_CLUSTERIP.c
+++ b/net/ipv4/netfilter/ipt_CLUSTERIP.c
@@ -230,17 +230,6 @@ clusterip_config_init(struct net *net, const struct 
ipt_clusterip_tgt_info *i,
refcount_set(>refcount, 1);
refcount_set(>entries, 1);
 
-   spin_lock_bh(>lock);
-   if (__clusterip_config_find(net, ip)) {
-   spin_unlock_bh(>lock);
-   kfree(c);
-
-   return ERR_PTR(-EBUSY);
-   }
-
-   list_add_rcu(>list, >configs);
-   spin_unlock_bh(>lock);
-
 #ifdef CONFIG_PROC_FS
{
char buffer[16];
@@ -257,6 +246,18 @@ clusterip_config_init(struct net *net, const struct 
ipt_clusterip_tgt_info *i,
}
 #endif
 
+   spin_lock_bh(>lock);
+   if (__clusterip_config_find(net, ip)) {
+   spin_unlock_bh(>lock);
+   kfree(c);
+
+   proc_remove(c->pde);
+   return ERR_PTR(-EBUSY);
+   }
+
+   list_add_rcu(>list, >configs);
+   spin_unlock_bh(>lock);
+
c->notifier.notifier_call = clusterip_netdev_event;
err = register_netdevice_notifier(>notifier);
if (!err)
---

Cheers,

Paolo


Re: WARNING: proc registration bug in clusterip_tg_check

2018-02-07 Thread Paolo Abeni
On Tue, 2018-02-06 at 22:42 -0800, Cong Wang wrote:
> On Tue, Feb 6, 2018 at 6:27 AM, syzbot
>  wrote:
> > Hello,
> > 
> > syzbot hit the following crash on net-next commit
> > 617aebe6a97efa539cc4b8a52adccd89596e6be0 (Sun Feb 4 00:25:42 2018 +)
> > Merge tag 'usercopy-v4.16-rc1' of
> > git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
> > 
> > So far this crash happened 5 times on net-next, upstream.
> > C reproducer is attached.
> > syzkaller reproducer is attached.
> > Raw console output is attached.
> > compiler: gcc (GCC) 7.1.1 20170620
> > .config is attached.
> > 
> > IMPORTANT: if you fix the bug, please add the following tag to the commit:
> > Reported-by: syzbot+03218bcdba6aa7644...@syzkaller.appspotmail.com
> > It will help syzbot understand when the bug is fixed. See footer for
> > details.
> > If you forward the report, please keep this part and the footer.
> > 
> > x_tables: ip_tables: osf match: only valid for protocol 6
> > x_tables: ip_tables: osf match: only valid for protocol 6
> > x_tables: ip_tables: osf match: only valid for protocol 6
> > [ cut here ]
> > proc_dir_entry 'ipt_CLUSTERIP/172.20.0.170' already registered
> > WARNING: CPU: 1 PID: 4152 at fs/proc/generic.c:330 proc_register+0x2a4/0x370
> > fs/proc/generic.c:329
> > Kernel panic - not syncing: panic_on_warn set ...
> > 
> > CPU: 1 PID: 4152 Comm: syzkaller851476 Not tainted 4.15.0+ #221
> > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> > Google 01/01/2011
> > Call Trace:
> >  __dump_stack lib/dump_stack.c:17 [inline]
> >  dump_stack+0x194/0x257 lib/dump_stack.c:53
> >  panic+0x1e4/0x41c kernel/panic.c:183
> >  __warn+0x1dc/0x200 kernel/panic.c:547
> >  report_bug+0x211/0x2d0 lib/bug.c:184
> >  fixup_bug.part.11+0x37/0x80 arch/x86/kernel/traps.c:178
> >  fixup_bug arch/x86/kernel/traps.c:247 [inline]
> >  do_error_trap+0x2d7/0x3e0 arch/x86/kernel/traps.c:296
> >  do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:315
> >  invalid_op+0x22/0x40 arch/x86/entry/entry_64.S:1097
> > RIP: 0010:proc_register+0x2a4/0x370 fs/proc/generic.c:329
> > RSP: 0018:8801cbd6ee20 EFLAGS: 00010286
> > RAX: dc08 RBX: 8801d2181038 RCX: 815a57ae
> > RDX:  RSI: 1100397add74 RDI: 1100397add49
> > RBP: 8801cbd6ee70 R08: 1100397add0b R09: 
> > R10: 8801cbd6ecd8 R11:  R12: 8801b2bb1cc0
> > R13: dc00 R14: 8801b0d8dbc8 R15: 8801b2bb1d81
> >  proc_create_data+0xf8/0x180 fs/proc/generic.c:494
> >  clusterip_config_init net/ipv4/netfilter/ipt_CLUSTERIP.c:250 [inline]
> 
> I think there is probably a race condition between 
> clusterip_config_entry_put()
> and clusterip_config_init(), after we release the spinlock, a new proc
> with the same IP could be created therefore triggers this warning
> 
> I am not sure if it is enough to just move the proc_remove() under
> spinlock...

I *think* we should change the order on proc fs entry creation, 
because clusterip_config_init() can race with itself,
clusterip_config_init() returns NULL if the clusterip_config_init has
no pte, and currently such entry is inserted into the list with NULL
pte and the list lock itself is released before creating the PTE.

I'll try to test something the following:
---
diff --git a/net/ipv4/netfilter/ipt_CLUSTERIP.c 
b/net/ipv4/netfilter/ipt_CLUSTERIP.c
index 3a84a60f6b39..d8807c44cc61 100644
--- a/net/ipv4/netfilter/ipt_CLUSTERIP.c
+++ b/net/ipv4/netfilter/ipt_CLUSTERIP.c
@@ -230,17 +230,6 @@ clusterip_config_init(struct net *net, const struct 
ipt_clusterip_tgt_info *i,
refcount_set(>refcount, 1);
refcount_set(>entries, 1);
 
-   spin_lock_bh(>lock);
-   if (__clusterip_config_find(net, ip)) {
-   spin_unlock_bh(>lock);
-   kfree(c);
-
-   return ERR_PTR(-EBUSY);
-   }
-
-   list_add_rcu(>list, >configs);
-   spin_unlock_bh(>lock);
-
 #ifdef CONFIG_PROC_FS
{
char buffer[16];
@@ -257,6 +246,18 @@ clusterip_config_init(struct net *net, const struct 
ipt_clusterip_tgt_info *i,
}
 #endif
 
+   spin_lock_bh(>lock);
+   if (__clusterip_config_find(net, ip)) {
+   spin_unlock_bh(>lock);
+   kfree(c);
+
+   proc_remove(c->pde);
+   return ERR_PTR(-EBUSY);
+   }
+
+   list_add_rcu(>list, >configs);
+   spin_unlock_bh(>lock);
+
c->notifier.notifier_call = clusterip_netdev_event;
err = register_netdevice_notifier(>notifier);
if (!err)
---

Cheers,

Paolo


Re: [RFC PATCH 0/4] softirq: Per vector threading v3

2018-01-24 Thread Paolo Abeni
On Wed, 2018-01-24 at 10:05 -0500, David Miller wrote:
> From: Paolo Abeni <pab...@redhat.com>
> Date: Wed, 24 Jan 2018 15:54:05 +0100
> 
> > Niklas suggested a possible relation with CONFIG_IRQ_TIME_ACCOUNTING=y
> > and indeed he was right.
> > 
> > The patched kernel under test had CONFIG_IRQ_TIME_ACCOUNTING set, and
> > very little CPU time was accounted to the kworker:
> > 
> > [2125 is the relevant kworker's pid]
> > grep sum_exec_runtime /proc/2125/sched; sleep 10; grep sum_exec_runtime 
> > /proc/2125/sched
> > se.sum_exec_runtime  : 13408.239286
> > se.sum_exec_runtime  : 13456.907197
> > 
> > despite such process was processing a lot of packets and basically
> > burning a CPU.
> 
> So IRQ_TIME_ACCOUNTING makes the scheduler think that the worker
> threads are using nearly no task time at all.

Yes, this is the behavior I observe in the test. But a quick look at
the scheduler code - I'm not very familiar with it - let me think this
is not the intended/expected behaviour for the ksoftirqd (and after
this series, for the kworker serving the softirq).

> The existing ksoftirqd code should hit the same problem, right?

I just tried the vanilla kernel with CONFIG_IRQ_TIME_ACCOUNTING=y, and
in the same test scenario, I observe the 'good' behavior: the user
space process and ksoftirqd share almost fairly the relevant CPU, and
the CPU time spend in softirq processing is accounted to ksoftirqd.

Cheers,

Paolo



Re: [RFC PATCH 0/4] softirq: Per vector threading v3

2018-01-24 Thread Paolo Abeni
On Wed, 2018-01-24 at 10:05 -0500, David Miller wrote:
> From: Paolo Abeni 
> Date: Wed, 24 Jan 2018 15:54:05 +0100
> 
> > Niklas suggested a possible relation with CONFIG_IRQ_TIME_ACCOUNTING=y
> > and indeed he was right.
> > 
> > The patched kernel under test had CONFIG_IRQ_TIME_ACCOUNTING set, and
> > very little CPU time was accounted to the kworker:
> > 
> > [2125 is the relevant kworker's pid]
> > grep sum_exec_runtime /proc/2125/sched; sleep 10; grep sum_exec_runtime 
> > /proc/2125/sched
> > se.sum_exec_runtime  : 13408.239286
> > se.sum_exec_runtime  : 13456.907197
> > 
> > despite such process was processing a lot of packets and basically
> > burning a CPU.
> 
> So IRQ_TIME_ACCOUNTING makes the scheduler think that the worker
> threads are using nearly no task time at all.

Yes, this is the behavior I observe in the test. But a quick look at
the scheduler code - I'm not very familiar with it - let me think this
is not the intended/expected behaviour for the ksoftirqd (and after
this series, for the kworker serving the softirq).

> The existing ksoftirqd code should hit the same problem, right?

I just tried the vanilla kernel with CONFIG_IRQ_TIME_ACCOUNTING=y, and
in the same test scenario, I observe the 'good' behavior: the user
space process and ksoftirqd share almost fairly the relevant CPU, and
the CPU time spend in softirq processing is accounted to ksoftirqd.

Cheers,

Paolo



Re: [RFC PATCH 0/4] softirq: Per vector threading v3

2018-01-24 Thread Paolo Abeni
On Tue, 2018-01-23 at 09:42 -0800, Linus Torvalds wrote:
> On Tue, Jan 23, 2018 at 8:57 AM, Paolo Abeni <pab...@redhat.com> wrote:
> > 
> > > Or is it that the workqueue execution is simply not yielding for some
> > > reason?
> > 
> > It's like that.
> > 
> > I spent little time on it, so I haven't many data point. I'll try to
> > investigate the scenario later this week.
> 
> Hmm. workqueues seem to use cond_resched_rcu_qs(), which does a
> cond_resched() (and a RCU quiescent note).
> 
> But I wonder if the test triggers the "lets run lots of workqueue
> threads", and then the single-threaded user space just gets blown out
> of the water by many kernel threads. Each thread gets its own "fair"
> amount of CPU, but..

Niklas suggested a possible relation with CONFIG_IRQ_TIME_ACCOUNTING=y
and indeed he was right.

The patched kernel under test had CONFIG_IRQ_TIME_ACCOUNTING set, and
very little CPU time was accounted to the kworker:

[2125 is the relevant kworker's pid]
grep sum_exec_runtime /proc/2125/sched; sleep 10; grep sum_exec_runtime 
/proc/2125/sched
se.sum_exec_runtime  : 13408.239286
se.sum_exec_runtime  : 13456.907197

despite such process was processing a lot of packets and basically
burning a CPU.

Switching CONFIG_IRQ_TIME_ACCOUNTING off I see the expected behaviour:
top reports that the user space process and kworker share the CPU
almost fairly and the user space process is able to receive a
reasonable amount of packets.

Paolo


Re: [RFC PATCH 0/4] softirq: Per vector threading v3

2018-01-24 Thread Paolo Abeni
On Tue, 2018-01-23 at 09:42 -0800, Linus Torvalds wrote:
> On Tue, Jan 23, 2018 at 8:57 AM, Paolo Abeni  wrote:
> > 
> > > Or is it that the workqueue execution is simply not yielding for some
> > > reason?
> > 
> > It's like that.
> > 
> > I spent little time on it, so I haven't many data point. I'll try to
> > investigate the scenario later this week.
> 
> Hmm. workqueues seem to use cond_resched_rcu_qs(), which does a
> cond_resched() (and a RCU quiescent note).
> 
> But I wonder if the test triggers the "lets run lots of workqueue
> threads", and then the single-threaded user space just gets blown out
> of the water by many kernel threads. Each thread gets its own "fair"
> amount of CPU, but..

Niklas suggested a possible relation with CONFIG_IRQ_TIME_ACCOUNTING=y
and indeed he was right.

The patched kernel under test had CONFIG_IRQ_TIME_ACCOUNTING set, and
very little CPU time was accounted to the kworker:

[2125 is the relevant kworker's pid]
grep sum_exec_runtime /proc/2125/sched; sleep 10; grep sum_exec_runtime 
/proc/2125/sched
se.sum_exec_runtime  : 13408.239286
se.sum_exec_runtime  : 13456.907197

despite such process was processing a lot of packets and basically
burning a CPU.

Switching CONFIG_IRQ_TIME_ACCOUNTING off I see the expected behaviour:
top reports that the user space process and kworker share the CPU
almost fairly and the user space process is able to receive a
reasonable amount of packets.

Paolo


Re: [RFC PATCH 0/4] softirq: Per vector threading v3

2018-01-23 Thread Paolo Abeni
On Tue, 2018-01-23 at 11:22 -0500, David Miller wrote:
> From: Paolo Abeni <pab...@redhat.com>
> Date: Tue, 23 Jan 2018 11:13:52 +0100
> 
> > Hi,
> > 
> > On Fri, 2018-01-19 at 16:46 +0100, Frederic Weisbecker wrote:
> >> As per Linus suggestion, this take doesn't limit the number of occurences
> >> per jiffy anymore but instead defers a vector to workqueues as soon as
> >> it gets re-enqueued on IRQ tail.
> >> 
> >> No tunable here, so testing should be easier.
> >> 
> >> git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
> >>  softirq/thread-v3
> >> 
> >> HEAD: 6835e92cbd70ef4a056987d2e1ed383b294429d4
> > 
> > I tested this series in the UDP flood scenario, binding the user space
> > process receiving the packets on the same CPU processing the related
> > IRQ, and the tput sinks nearly to 0, like before Eric's patch.
> > 
> > The perf tool says that almost all the softirq processing is done
> > inside the workqueue, but the user space process is scheduled very
> > rarely, while before this series, in this scenario, ksoftirqd and the
> > user space process got a fair share of the CPU time.
> 
> Do workqueue threads get a higher scheduling priority than user
> processes?

As far as I can see, no: the workqueue thread has the same priority and
nice level than the user space process.

> Or is it that the workqueue execution is simply not yielding for some
> reason?

It's like that.

I spent little time on it, so I haven't many data point. I'll try to
investigate the scenario later this week.

Cheers,

Paolo


Re: [RFC PATCH 0/4] softirq: Per vector threading v3

2018-01-23 Thread Paolo Abeni
On Tue, 2018-01-23 at 11:22 -0500, David Miller wrote:
> From: Paolo Abeni 
> Date: Tue, 23 Jan 2018 11:13:52 +0100
> 
> > Hi,
> > 
> > On Fri, 2018-01-19 at 16:46 +0100, Frederic Weisbecker wrote:
> >> As per Linus suggestion, this take doesn't limit the number of occurences
> >> per jiffy anymore but instead defers a vector to workqueues as soon as
> >> it gets re-enqueued on IRQ tail.
> >> 
> >> No tunable here, so testing should be easier.
> >> 
> >> git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
> >>  softirq/thread-v3
> >> 
> >> HEAD: 6835e92cbd70ef4a056987d2e1ed383b294429d4
> > 
> > I tested this series in the UDP flood scenario, binding the user space
> > process receiving the packets on the same CPU processing the related
> > IRQ, and the tput sinks nearly to 0, like before Eric's patch.
> > 
> > The perf tool says that almost all the softirq processing is done
> > inside the workqueue, but the user space process is scheduled very
> > rarely, while before this series, in this scenario, ksoftirqd and the
> > user space process got a fair share of the CPU time.
> 
> Do workqueue threads get a higher scheduling priority than user
> processes?

As far as I can see, no: the workqueue thread has the same priority and
nice level than the user space process.

> Or is it that the workqueue execution is simply not yielding for some
> reason?

It's like that.

I spent little time on it, so I haven't many data point. I'll try to
investigate the scenario later this week.

Cheers,

Paolo


Re: [RFC PATCH 0/4] softirq: Per vector threading v3

2018-01-23 Thread Paolo Abeni
Hi,

On Fri, 2018-01-19 at 16:46 +0100, Frederic Weisbecker wrote:
> As per Linus suggestion, this take doesn't limit the number of occurences
> per jiffy anymore but instead defers a vector to workqueues as soon as
> it gets re-enqueued on IRQ tail.
> 
> No tunable here, so testing should be easier.
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
>   softirq/thread-v3
> 
> HEAD: 6835e92cbd70ef4a056987d2e1ed383b294429d4

I tested this series in the UDP flood scenario, binding the user space
process receiving the packets on the same CPU processing the related
IRQ, and the tput sinks nearly to 0, like before Eric's patch.

The perf tool says that almost all the softirq processing is done
inside the workqueue, but the user space process is scheduled very
rarely, while before this series, in this scenario, ksoftirqd and the
user space process got a fair share of the CPU time.

Cheers,

Paolo






Re: [RFC PATCH 0/4] softirq: Per vector threading v3

2018-01-23 Thread Paolo Abeni
Hi,

On Fri, 2018-01-19 at 16:46 +0100, Frederic Weisbecker wrote:
> As per Linus suggestion, this take doesn't limit the number of occurences
> per jiffy anymore but instead defers a vector to workqueues as soon as
> it gets re-enqueued on IRQ tail.
> 
> No tunable here, so testing should be easier.
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
>   softirq/thread-v3
> 
> HEAD: 6835e92cbd70ef4a056987d2e1ed383b294429d4

I tested this series in the UDP flood scenario, binding the user space
process receiving the packets on the same CPU processing the related
IRQ, and the tput sinks nearly to 0, like before Eric's patch.

The perf tool says that almost all the softirq processing is done
inside the workqueue, but the user space process is scheduled very
rarely, while before this series, in this scenario, ksoftirqd and the
user space process got a fair share of the CPU time.

Cheers,

Paolo






Re: [RFC PATCH 2/2] softirq: Per vector thread deferment

2018-01-12 Thread Paolo Abeni
On Fri, 2018-01-12 at 06:35 +0100, Frederic Weisbecker wrote:
> Some softirq vectors can be more CPU hungry than others. Especially
> networking may sometimes deal with packet storm and need more CPU than
> IRQ tail can offer without inducing scheduler latencies. In this case
> the current code defers to ksoftirqd that behaves nicer. Now this nice
> behaviour can be bad for other IRQ vectors that usually need quick
> processing.
> 
> To solve this we only defer to threading the vectors that outreached the
> time limit on IRQ tail processing and leave the others inline on real
> Soft-IRQs service. This is achieved using workqueues with
> per-CPU/per-vector worklets.
> 
> Note ksoftirqd is not removed as it is still needed for threaded IRQs
> mode.
> 
> Suggested-by: Linus Torvalds <torva...@linux-foundation.org>
> Signed-off-by: Frederic Weisbecker <frede...@kernel.org>
> Cc: Dmitry Safonov <d...@arista.com>
> Cc: Eric Dumazet <eduma...@google.com>
> Cc: Linus Torvalds <torva...@linux-foundation.org>
> Cc: Peter Zijlstra <pet...@infradead.org>
> Cc: Andrew Morton <a...@linux-foundation.org>
> Cc: David Miller <da...@davemloft.net>
> Cc: Hannes Frederic Sowa <han...@stressinduktion.org>
> Cc: Ingo Molnar <mi...@kernel.org>
> Cc: Levin Alexander <alexander.le...@verizon.com>
> Cc: Paolo Abeni <pab...@redhat.com>
> Cc: Paul E. McKenney <paul...@linux.vnet.ibm.com>
> Cc: Radu Rendec <rren...@arista.com>
> Cc: Rik van Riel <r...@redhat.com>
> Cc: Stanislaw Gruszka <sgrus...@redhat.com>
> Cc: Thomas Gleixner <t...@linutronix.de>
> Cc: Wanpeng Li <wanpeng...@hotmail.com>
> ---
>  kernel/softirq.c | 90 
> ++--
>  1 file changed, 87 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/softirq.c b/kernel/softirq.c
> index fa267f7..0c817ec6 100644
> --- a/kernel/softirq.c
> +++ b/kernel/softirq.c
> @@ -74,6 +74,13 @@ struct softirq_stat {
>  
>  static DEFINE_PER_CPU(struct softirq_stat, softirq_stat_cpu);
>  
> +struct vector_work {
> + int vec;
> + struct work_struct work;
> +};
> +
> +static DEFINE_PER_CPU(struct vector_work[NR_SOFTIRQS], vector_work_cpu);
> +
>  /*
>   * we cannot loop indefinitely here to avoid userspace starvation,
>   * but we also don't want to introduce a worst case 1/HZ latency
> @@ -251,6 +258,70 @@ static inline bool lockdep_softirq_start(void) { return 
> false; }
>  static inline void lockdep_softirq_end(bool in_hardirq) { }
>  #endif
>  
> +static void vector_work_func(struct work_struct *work)
> +{
> + struct vector_work *vector_work;
> + u32 pending;
> + int vec;
> +
> + vector_work = container_of(work, struct vector_work, work);
> + vec = vector_work->vec;
> +
> + local_irq_disable();
> + pending = local_softirq_pending();
> + account_irq_enter_time(current);
> + __local_bh_disable_ip(_RET_IP_, SOFTIRQ_OFFSET);
> + lockdep_softirq_enter();
> + set_softirq_pending(pending & ~(1 << vec));
> + local_irq_enable();
> +
> + if (pending & (1 << vec)) {
> + struct softirq_action *sa = _vec[vec];
> +
> + kstat_incr_softirqs_this_cpu(vec);
> + trace_softirq_entry(vec);
> + sa->action(sa);
> + trace_softirq_exit(vec);
> + }
> +
> + local_irq_disable();
> +
> + pending = local_softirq_pending();
> + if (pending & (1 << vec))
> + schedule_work_on(smp_processor_id(), work);

If we check for the overrun condition here, as done in the
__do_softirq() main loop, we could avoid ksoftirqd completely and
probably have less code duplication.

> +
> + lockdep_softirq_exit();
> + account_irq_exit_time(current);
> + __local_bh_enable(SOFTIRQ_OFFSET);
> + local_irq_enable();
> +}
> +
> +static int do_softirq_overrun(u32 overrun, u32 pending)
> +{
> + struct softirq_action *h = softirq_vec;
> + int softirq_bit;
> +
> + if (!overrun)
> + return pending;
> +
> + overrun &= pending;
> + pending &= ~overrun;
> +
> + while ((softirq_bit = ffs(overrun))) {
> + struct vector_work *work;
> + unsigned int vec_nr;
> +
> + h += softirq_bit - 1;
> + vec_nr = h - softirq_vec;
> + work = this_cpu_ptr(_work_cpu[vec_nr]);
> + schedule_work_on(smp_processor_id(), >work);
> + h++;
> + overrun >>= softirq_bit;
> + }
> +
> + return pending;
> +}
>

  1   2   3   >