Re: [PATCH net-next] virtio_net: ethtool tx napi configuration

2018-09-13 Thread Willem de Bruijn
On Thu, Sep 13, 2018 at 11:53 PM Jason Wang  wrote:
>
>
>
> On 2018年09月14日 11:40, Willem de Bruijn wrote:
> > On Thu, Sep 13, 2018 at 11:27 PM Jason Wang  wrote:
> >>
> >>
> >> On 2018年09月13日 22:58, Willem de Bruijn wrote:
> >>> On Thu, Sep 13, 2018 at 5:02 AM Jason Wang  wrote:
> 
>  On 2018年09月13日 07:27, Willem de Bruijn wrote:
> > On Wed, Sep 12, 2018 at 3:11 PM Willem de Bruijn
> >  wrote:
> >> On Wed, Sep 12, 2018 at 2:16 PM Florian Fainelli 
> >>  wrote:
> >>> On 9/12/2018 11:07 AM, Willem de Bruijn wrote:
>  On Wed, Sep 12, 2018 at 1:42 PM Florian Fainelli 
>   wrote:
> > On 9/9/2018 3:44 PM, Willem de Bruijn wrote:
> >> From: Willem de Bruijn 
> >>
> >> Implement ethtool .set_coalesce (-C) and .get_coalesce (-c) 
> >> handlers.
> >> Interrupt moderation is currently not supported, so these accept 
> >> and
> >> display the default settings of 0 usec and 1 frame.
> >>
> >> Toggle tx napi through a bit in tx-frames. So as to not interfere
> >> with possible future interrupt moderation, use bit 10, well outside
> >> the reasonable range of real interrupt moderation values.
> >>
> >> Changes are not atomic. The tx IRQ, napi BH and transmit path must
> >> be quiesced when switching modes. Only allow changing this setting
> >> when the device is down.
> > Humm, would not a private ethtool flag to switch TX NAPI on/off be 
> > more
> > appropriate rather than use the coalescing configuration API here?
>  What do you mean by private ethtool flag? A new field in ethtool
>  --features (-k)?
> >>> I meant using ethtool_drvinfo::n_priv_flags, ETH_SS_PRIV_FLAGS and 
> >>> then
> >>> ETHTOOL_GFPFLAGS and ETHTOOL_SPFLAGS to control the toggling of that
> >>> private flag. mlx5 has a number of privates flags for instance.
> >> Interesting, thanks! I was not at all aware of those ethtool flags.
> >> Am having a look. It definitely looks promising.
> > Okay, I made that change. That is indeed much cleaner, thanks.
> > Let me send the patch, initially as RFC.
> >
> > I've observed one issue where if we toggle the flag before bringing
> > up the device, it hits a kernel BUG at include/linux/netdevice.h:515
> >
> >BUG_ON(!test_bit(NAPI_STATE_SCHED, >state));
>  This reminds me that we need to check netif_running() before trying to
>  enable and disable tx napi in ethtool_set_coalesce().
> >>> The first iteration of my patch checked IFF_UP and effectively
> >>> only allowed the change when not running. What do you mean
> >>> by need to check?
> >> I mean if device is not up, there's no need to toggle napi state and tx
> >> lock.
> >>
> >>> And to respond to the other follow-up notes at once:
> >>>
>  Consider we may have interrupt moderation in the future, I tend to use
>  set_coalesce. Otherwise we may need two steps to enable moderation:
> 
>  - tx-napi on
>  - set_coalesce
> >>> FWIW, I don't care strongly whether we do this through coalesce or 
> >>> priv_flags.
> >> Ok.
> > Since you prefer coalesce, let's go with that (and a revision of your
> > latest patch).
>
> Good to know this.
>
> > + if (!napi_weight)
> > + virtqueue_enable_cb(vi->sq[i].vq);
>  I don't get why we need to disable enable cb here.
> >>> To avoid entering no-napi mode with too few descriptors to
> >>> make progress and no way to get out of that state. This is a
> >>> pretty crude attempt at handling that, admittedly.
> >> But in this case, we will call enable_cb_delayed() and we will finally
> >> get a interrupt?
> > Right. It's a bit of a roundabout way to ensure that
> > netif_tx_wake_queue and thus eventually free_old_xmit_skbs are called.
> > It might make more sense to just wake the device without going through
> > an interrupt.
>
> I'm not sure I get this. If we don't enable tx napi, we tend to delay TX
> interrupt if we found the ring is about to full to avoid interrupt
> storm, so we're probably ok in this case.

I'm only concerned about the transition state when converting from
napi to no-napi when the queue is stopped and tx interrupt disabled.

With napi mode the interrupt is only disabled if napi is scheduled,
in which case it will eventually reenable the interrupt. But when
switching to no-napi mode in this state no progress will be made.

But it seems this cannot happen. When converting to no-napi
mode, set_coalesce waits for napi to complete in napi_disable.
So the interrupt should always start enabled when transitioning
into no-napi mode.


Re: [RFC PATCH net-next v1 00/14] rename and shrink i40evf

2018-09-13 Thread Benjamin Poirier
On 2018/09/13 15:31, Jesse Brandeburg wrote:
[...]
> 
> ---
> v1: initial RFC
> 
> Jesse Brandeburg (14):
>   intel-ethernet: rename i40evf to iavf

Seems like patch 1 didn't make it to netdev
https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20180910/014025.html

>   iavf: diet and reformat
>   iavf: rename functions and structs to new name
>   iavf: rename i40e_status to iavf_status
>   iavf: move i40evf files to new name
>   iavf: remove references to old names
>   iavf: rename device ID defines
>   iavf: rename I40E_ADMINQ_DESC
>   iavf: rename i40e_hw to iavf_hw
>   iavf: replace i40e_debug with iavf version
>   iavf: tracing infrastructure rename
>   iavf: rename most of i40e strings
>   iavf: finish renaming files to iavf
>   intel-ethernet: use correct module license


[PATCH net] veth: Orphan skb before GRO

2018-09-13 Thread Toshiaki Makita
GRO expects skbs not to be owned by sockets, but when XDP is enabled veth
passed skbs owned by sockets. It caused corrupted sk_wmem_alloc.

Paolo Abeni reported the following splat:

[  362.098904] refcount_t overflow at skb_set_owner_w+0x5e/0xa0 in 
iperf3[1644], uid/euid: 0/0
[  362.108239] WARNING: CPU: 0 PID: 1644 at kernel/panic.c:648 
refcount_error_report+0xa0/0xa4
[  362.117547] Modules linked in: tcp_diag inet_diag veth intel_rapl sb_edac 
x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass 
crct10dif_pclmul crc32_pclmul ghash_clmulni_intel intel_cstate intel_uncore 
intel_rapl_perf ipmi_ssif iTCO_wdt sg ipmi_si iTCO_vendor_support ipmi_devintf 
mxm_wmi ipmi_msghandler pcspkr dcdbas mei_me wmi mei lpc_ich acpi_power_meter 
pcc_cpufreq xfs libcrc32c sd_mod mgag200 drm_kms_helper syscopyarea sysfillrect 
sysimgblt fb_sys_fops ixgbe igb ttm ahci mdio libahci ptp crc32c_intel drm 
pps_core libata i2c_algo_bit dca dm_mirror dm_region_hash dm_log dm_mod
[  362.176622] CPU: 0 PID: 1644 Comm: iperf3 Not tainted 4.19.0-rc2.vanilla+ 
#2025
[  362.184777] Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.1.7 
06/16/2016
[  362.193124] RIP: 0010:refcount_error_report+0xa0/0xa4
[  362.198758] Code: 08 00 00 48 8b 95 80 00 00 00 49 8d 8c 24 80 0a 00 00 41 
89 c1 44 89 2c 24 48 89 de 48 c7 c7 18 4d e7 9d 31 c0 e8 30 fa ff ff <0f> 0b eb 
88 0f 1f 44 00 00 55 48 89 e5 41 56 41 55 41 54 49 89 fc
[  362.219711] RSP: 0018:9ee6ff603c20 EFLAGS: 00010282
[  362.225538] RAX:  RBX: 9de83e10 RCX: 
[  362.233497] RDX: 0001 RSI: 9ee6ff6167d8 RDI: 9ee6ff6167d8
[  362.241457] RBP: 9ee6ff603d78 R08: 0490 R09: 0004
[  362.249416] R10:  R11: 9ee6ff603990 R12: 9ee664b94500
[  362.257377] R13:  R14: 0004 R15: 9de615f9
[  362.265337] FS:  7f1d22d28740() GS:9ee6ff60() 
knlGS:
[  362.274363] CS:  0010 DS:  ES:  CR0: 80050033
[  362.280773] CR2: 7f1d222f35d0 CR3: 001fddfec003 CR4: 001606f0
[  362.288733] Call Trace:
[  362.291459]  
[  362.293702]  ex_handler_refcount+0x4e/0x80
[  362.298269]  fixup_exception+0x35/0x40
[  362.302451]  do_trap+0x109/0x150
[  362.306048]  do_error_trap+0xd5/0x130
[  362.315766]  invalid_op+0x14/0x20
[  362.319460] RIP: 0010:skb_set_owner_w+0x5e/0xa0
[  362.324512] Code: ef ff ff 74 49 48 c7 43 60 20 7b 4a 9d 8b 85 f4 01 00 00 
85 c0 75 16 8b 83 e0 00 00 00 f0 01 85 44 01 00 00 0f 88 d8 23 16 00 <5b> 5d c3 
80 8b 91 00 00 00 01 8b 85 f4 01 00 00 89 83 a4 00 00 00
[  362.345465] RSP: 0018:9ee6ff603e20 EFLAGS: 00010a86
[  362.351291] RAX: 1100 RBX: 9ee65deec700 RCX: 9ee65e829244
[  362.359250] RDX: 0100 RSI: 9ee65e829100 RDI: 9ee65deec700
[  362.367210] RBP: 9ee65e829100 R08: 0002a380 R09: 
[  362.375169] R10: 0002 R11: f1a4bf77bb00 R12: c0754661d000
[  362.383130] R13: 9ee65deec200 R14: 9ee65f597000 R15: 00aa
[  362.391092]  veth_xdp_rcv+0x4e4/0x890 [veth]
[  362.399357]  veth_poll+0x4d/0x17a [veth]
[  362.403731]  net_rx_action+0x2af/0x3f0
[  362.407912]  __do_softirq+0xdd/0x29e
[  362.411897]  do_softirq_own_stack+0x2a/0x40
[  362.416561]  
[  362.418899]  do_softirq+0x4b/0x70
[  362.422594]  __local_bh_enable_ip+0x50/0x60
[  362.427258]  ip_finish_output2+0x16a/0x390
[  362.431824]  ip_output+0x71/0xe0
[  362.440670]  __tcp_transmit_skb+0x583/0xab0
[  362.445333]  tcp_write_xmit+0x247/0xfb0
[  362.449609]  __tcp_push_pending_frames+0x2d/0xd0
[  362.454760]  tcp_sendmsg_locked+0x857/0xd30
[  362.459424]  tcp_sendmsg+0x27/0x40
[  362.463216]  sock_sendmsg+0x36/0x50
[  362.467104]  sock_write_iter+0x87/0x100
[  362.471382]  __vfs_write+0x112/0x1a0
[  362.475369]  vfs_write+0xad/0x1a0
[  362.479062]  ksys_write+0x52/0xc0
[  362.482759]  do_syscall_64+0x5b/0x180
[  362.486841]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  362.492473] RIP: 0033:0x7f1d22293238
[  362.496458] Code: 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 00 00 f3 
0f 1e fa 48 8d 05 c5 54 2d 00 8b 00 85 c0 75 17 b8 01 00 00 00 0f 05 <48> 3d 00 
f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 41 54 49 89 d4 55
[  362.517409] RSP: 002b:7ffebaef8008 EFLAGS: 0246 ORIG_RAX: 
0001
[  362.525855] RAX: ffda RBX: 2800 RCX: 7f1d22293238
[  362.533816] RDX: 2800 RSI: 7f1d22d36000 RDI: 0005
[  362.541775] RBP: 7f1d22d36000 R08: 0002db777a30 R09: 562b70712b20
[  362.549734] R10:  R11: 0246 R12: 0005
[  362.557693] R13: 2800 R14: 7ffebaef8060 R15: 562b70712260

In order to avoid this, orphan the skb before entering GRO.

Fixes: 948d4f214fde ("veth: Add driver XDP")
Reported-by: Paolo Abeni 
Signed-off-by: Toshiaki Makita 
---
 drivers/net/veth.c | 4 ++--
 1 file 

Re: [PATCH net-next 0/8] bnxt_en: devlink param updates

2018-09-13 Thread Vasundhara Volam
On Wed, Sep 12, 2018 at 3:20 PM Jakub Kicinski
 wrote:
>
> On Wed, 12 Sep 2018 12:09:37 +0530, Vasundhara Volam wrote:
> > On Tue, Sep 11, 2018 at 5:04 PM Jakub Kicinski wrote:
> > > On Tue, 11 Sep 2018 14:14:57 +0530, Vasundhara Volam wrote:
> > > > This patchset adds support for 4 generic and 1 driver-specific devlink
> > > > parameters.
> > > >
> > > > Also, this patchset adds support to return proper error code if
> > > > HWRM_NVM_GET/SET_VARIABLE commands return error code
> > > > HWRM_ERR_CODE_RESOURCE_ACCESS_DENIED.
> > > >
> > > > Vasundhara Volam (8):
> > > >   devlink: Add generic parameter hw_tc_offload
> > >
> > > Much like Jiri, I can't help but wonder why do you need this?
> >
> > There is a request from our customer for a way to toggle tc_offload
> > feature in our adapter.
>
> Vasundhara, again, we don't need to know who asked you to do this, but
> _why_.  What problem are you solving?  What is the customer trying to
> achieve?
For Brand new big features like TC_offload, few customers are not willing
to enable it by default in the adapter(Firmware). This was a subjective decision
to disable TC_offload by default in the adapter.
>
> > > >   devlink: Add generic parameter ignore_ari
> > > >   devlink: Add generic parameter msix_vec_per_pf_max
> > > >   devlink: Add generic parameter msix_vec_per_pf_min
> > >
> > > IMHO more structured API would be preferable if possible.  The string
> > > keys won't scale if you want to set the parameters per PF, and
> > > creating more structured API for PCIe which is a relatively slow
> > > moving HW spec seems tractable.
> >
> > Sorry, could you please suggest an example? We will try to adapt.
>
> My thinking was that the same way devlink device has ports, it should
> have PCIe functions as objects which then have attributes.  Instead of
> making everything a string-identified device attribute.  But I'm not
> dead set on this if others don't think its a good idea.
Actually this parameters are for the port but the value given to this param
is applicable for individual PF. That's the reason I have added "per_pf" string.
If you think this is not a good idea, I can move this params to driver-specific.


Re: [PATCH net] net/ipv6: do not copy DST_NOCOUNT flag on rt init

2018-09-13 Thread David Ahern
On 9/13/18 1:38 PM, Peter Oskolkov wrote:

> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> index 3eed045c65a5..a3902f805305 100644
> --- a/net/ipv6/route.c
> +++ b/net/ipv6/route.c
> @@ -946,7 +946,7 @@ static void ip6_rt_init_dst_reject(struct rt6_info *rt, 
> struct fib6_info *ort)
>  
>  static void ip6_rt_init_dst(struct rt6_info *rt, struct fib6_info *ort)
>  {
> - rt->dst.flags |= fib6_info_dst_flags(ort);
> + rt->dst.flags |= fib6_info_dst_flags(ort) & ~DST_NOCOUNT;

I think my mistake is setting dst.flags in ip6_rt_init_dst. Flags
argument is passed to ip6_dst_alloc which is always invoked before
ip6_rt_copy_init is called which is the only caller of ip6_rt_init_dst.

>  
>   if (ort->fib6_flags & RTF_REJECT) {
>   ip6_rt_init_dst_reject(rt, ort);
> 



Re: [PATCH net-next] virtio_net: ethtool tx napi configuration

2018-09-13 Thread Jason Wang




On 2018年09月14日 11:40, Willem de Bruijn wrote:

On Thu, Sep 13, 2018 at 11:27 PM Jason Wang  wrote:



On 2018年09月13日 22:58, Willem de Bruijn wrote:

On Thu, Sep 13, 2018 at 5:02 AM Jason Wang  wrote:


On 2018年09月13日 07:27, Willem de Bruijn wrote:

On Wed, Sep 12, 2018 at 3:11 PM Willem de Bruijn
 wrote:

On Wed, Sep 12, 2018 at 2:16 PM Florian Fainelli  wrote:

On 9/12/2018 11:07 AM, Willem de Bruijn wrote:

On Wed, Sep 12, 2018 at 1:42 PM Florian Fainelli  wrote:

On 9/9/2018 3:44 PM, Willem de Bruijn wrote:

From: Willem de Bruijn 

Implement ethtool .set_coalesce (-C) and .get_coalesce (-c) handlers.
Interrupt moderation is currently not supported, so these accept and
display the default settings of 0 usec and 1 frame.

Toggle tx napi through a bit in tx-frames. So as to not interfere
with possible future interrupt moderation, use bit 10, well outside
the reasonable range of real interrupt moderation values.

Changes are not atomic. The tx IRQ, napi BH and transmit path must
be quiesced when switching modes. Only allow changing this setting
when the device is down.

Humm, would not a private ethtool flag to switch TX NAPI on/off be more
appropriate rather than use the coalescing configuration API here?

What do you mean by private ethtool flag? A new field in ethtool
--features (-k)?

I meant using ethtool_drvinfo::n_priv_flags, ETH_SS_PRIV_FLAGS and then
ETHTOOL_GFPFLAGS and ETHTOOL_SPFLAGS to control the toggling of that
private flag. mlx5 has a number of privates flags for instance.

Interesting, thanks! I was not at all aware of those ethtool flags.
Am having a look. It definitely looks promising.

Okay, I made that change. That is indeed much cleaner, thanks.
Let me send the patch, initially as RFC.

I've observed one issue where if we toggle the flag before bringing
up the device, it hits a kernel BUG at include/linux/netdevice.h:515

   BUG_ON(!test_bit(NAPI_STATE_SCHED, >state));

This reminds me that we need to check netif_running() before trying to
enable and disable tx napi in ethtool_set_coalesce().

The first iteration of my patch checked IFF_UP and effectively
only allowed the change when not running. What do you mean
by need to check?

I mean if device is not up, there's no need to toggle napi state and tx
lock.


And to respond to the other follow-up notes at once:


Consider we may have interrupt moderation in the future, I tend to use
set_coalesce. Otherwise we may need two steps to enable moderation:

- tx-napi on
- set_coalesce

FWIW, I don't care strongly whether we do this through coalesce or priv_flags.

Ok.

Since you prefer coalesce, let's go with that (and a revision of your
latest patch).


Good to know this.


+ if (!napi_weight)
+ virtqueue_enable_cb(vi->sq[i].vq);

I don't get why we need to disable enable cb here.

To avoid entering no-napi mode with too few descriptors to
make progress and no way to get out of that state. This is a
pretty crude attempt at handling that, admittedly.

But in this case, we will call enable_cb_delayed() and we will finally
get a interrupt?

Right. It's a bit of a roundabout way to ensure that
netif_tx_wake_queue and thus eventually free_old_xmit_skbs are called.
It might make more sense to just wake the device without going through
an interrupt.


I'm not sure I get this. If we don't enable tx napi, we tend to delay TX 
interrupt if we found the ring is about to full to avoid interrupt 
storm, so we're probably ok in this case.


Thanks


Re: [PATCH net-next] virtio_net: ethtool tx napi configuration

2018-09-13 Thread Willem de Bruijn
On Thu, Sep 13, 2018 at 11:27 PM Jason Wang  wrote:
>
>
>
> On 2018年09月13日 22:58, Willem de Bruijn wrote:
> > On Thu, Sep 13, 2018 at 5:02 AM Jason Wang  wrote:
> >>
> >>
> >> On 2018年09月13日 07:27, Willem de Bruijn wrote:
> >>> On Wed, Sep 12, 2018 at 3:11 PM Willem de Bruijn
> >>>  wrote:
>  On Wed, Sep 12, 2018 at 2:16 PM Florian Fainelli  
>  wrote:
> >
> > On 9/12/2018 11:07 AM, Willem de Bruijn wrote:
> >> On Wed, Sep 12, 2018 at 1:42 PM Florian Fainelli 
> >>  wrote:
> >>>
> >>> On 9/9/2018 3:44 PM, Willem de Bruijn wrote:
>  From: Willem de Bruijn 
> 
>  Implement ethtool .set_coalesce (-C) and .get_coalesce (-c) handlers.
>  Interrupt moderation is currently not supported, so these accept and
>  display the default settings of 0 usec and 1 frame.
> 
>  Toggle tx napi through a bit in tx-frames. So as to not interfere
>  with possible future interrupt moderation, use bit 10, well outside
>  the reasonable range of real interrupt moderation values.
> 
>  Changes are not atomic. The tx IRQ, napi BH and transmit path must
>  be quiesced when switching modes. Only allow changing this setting
>  when the device is down.
> >>> Humm, would not a private ethtool flag to switch TX NAPI on/off be 
> >>> more
> >>> appropriate rather than use the coalescing configuration API here?
> >> What do you mean by private ethtool flag? A new field in ethtool
> >> --features (-k)?
> > I meant using ethtool_drvinfo::n_priv_flags, ETH_SS_PRIV_FLAGS and then
> > ETHTOOL_GFPFLAGS and ETHTOOL_SPFLAGS to control the toggling of that
> > private flag. mlx5 has a number of privates flags for instance.
>  Interesting, thanks! I was not at all aware of those ethtool flags.
>  Am having a look. It definitely looks promising.
> >>> Okay, I made that change. That is indeed much cleaner, thanks.
> >>> Let me send the patch, initially as RFC.
> >>>
> >>> I've observed one issue where if we toggle the flag before bringing
> >>> up the device, it hits a kernel BUG at include/linux/netdevice.h:515
> >>>
> >>>   BUG_ON(!test_bit(NAPI_STATE_SCHED, >state));
> >> This reminds me that we need to check netif_running() before trying to
> >> enable and disable tx napi in ethtool_set_coalesce().
> > The first iteration of my patch checked IFF_UP and effectively
> > only allowed the change when not running. What do you mean
> > by need to check?
>
> I mean if device is not up, there's no need to toggle napi state and tx
> lock.
>
> >
> > And to respond to the other follow-up notes at once:
> >
> >> Consider we may have interrupt moderation in the future, I tend to use
> >> set_coalesce. Otherwise we may need two steps to enable moderation:
> >>
> >> - tx-napi on
> >> - set_coalesce
> > FWIW, I don't care strongly whether we do this through coalesce or 
> > priv_flags.
>
> Ok.

Since you prefer coalesce, let's go with that (and a revision of your
latest patch).

>
> >>> + if (!napi_weight)
> >>> + virtqueue_enable_cb(vi->sq[i].vq);
> >> I don't get why we need to disable enable cb here.
> > To avoid entering no-napi mode with too few descriptors to
> > make progress and no way to get out of that state. This is a
> > pretty crude attempt at handling that, admittedly.
>
> But in this case, we will call enable_cb_delayed() and we will finally
> get a interrupt?

Right. It's a bit of a roundabout way to ensure that
netif_tx_wake_queue and thus eventually free_old_xmit_skbs are called.
It might make more sense to just wake the device without going through
an interrupt.


Re: [PATCH net-next] virtio_net: ethtool tx napi configuration

2018-09-13 Thread Jason Wang




On 2018年09月13日 22:58, Willem de Bruijn wrote:

On Thu, Sep 13, 2018 at 5:02 AM Jason Wang  wrote:



On 2018年09月13日 07:27, Willem de Bruijn wrote:

On Wed, Sep 12, 2018 at 3:11 PM Willem de Bruijn
 wrote:

On Wed, Sep 12, 2018 at 2:16 PM Florian Fainelli  wrote:


On 9/12/2018 11:07 AM, Willem de Bruijn wrote:

On Wed, Sep 12, 2018 at 1:42 PM Florian Fainelli  wrote:


On 9/9/2018 3:44 PM, Willem de Bruijn wrote:

From: Willem de Bruijn 

Implement ethtool .set_coalesce (-C) and .get_coalesce (-c) handlers.
Interrupt moderation is currently not supported, so these accept and
display the default settings of 0 usec and 1 frame.

Toggle tx napi through a bit in tx-frames. So as to not interfere
with possible future interrupt moderation, use bit 10, well outside
the reasonable range of real interrupt moderation values.

Changes are not atomic. The tx IRQ, napi BH and transmit path must
be quiesced when switching modes. Only allow changing this setting
when the device is down.

Humm, would not a private ethtool flag to switch TX NAPI on/off be more
appropriate rather than use the coalescing configuration API here?

What do you mean by private ethtool flag? A new field in ethtool
--features (-k)?

I meant using ethtool_drvinfo::n_priv_flags, ETH_SS_PRIV_FLAGS and then
ETHTOOL_GFPFLAGS and ETHTOOL_SPFLAGS to control the toggling of that
private flag. mlx5 has a number of privates flags for instance.

Interesting, thanks! I was not at all aware of those ethtool flags.
Am having a look. It definitely looks promising.

Okay, I made that change. That is indeed much cleaner, thanks.
Let me send the patch, initially as RFC.

I've observed one issue where if we toggle the flag before bringing
up the device, it hits a kernel BUG at include/linux/netdevice.h:515

  BUG_ON(!test_bit(NAPI_STATE_SCHED, >state));

This reminds me that we need to check netif_running() before trying to
enable and disable tx napi in ethtool_set_coalesce().

The first iteration of my patch checked IFF_UP and effectively
only allowed the change when not running. What do you mean
by need to check?


I mean if device is not up, there's no need to toggle napi state and tx 
lock.




And to respond to the other follow-up notes at once:


Consider we may have interrupt moderation in the future, I tend to use
set_coalesce. Otherwise we may need two steps to enable moderation:

- tx-napi on
- set_coalesce

FWIW, I don't care strongly whether we do this through coalesce or priv_flags.


Ok.


+ if (!napi_weight)
+ virtqueue_enable_cb(vi->sq[i].vq);

I don't get why we need to disable enable cb here.

To avoid entering no-napi mode with too few descriptors to
make progress and no way to get out of that state. This is a
pretty crude attempt at handling that, admittedly.


But in this case, we will call enable_cb_delayed() and we will finally 
get a interrupt?


Thanks


Re: unexpected GRO/veth behavior

2018-09-13 Thread Toshiaki Makita
On 2018/09/11 20:07, Toshiaki Makita wrote:
> On 2018/09/11 19:27, Eric Dumazet wrote:
> ...
>> Fix would probably be :
>>
>> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
>> index 
>> 8d679c8b7f25c753d77cfb8821d9d2528c9c9048..96bd94480942b469403abf017f9f9d5be1e23ef5
>>  100644
>> --- a/drivers/net/veth.c
>> +++ b/drivers/net/veth.c
>> @@ -602,9 +602,10 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget, 
>> unsigned int *xdp_xmit)
>> skb = veth_xdp_rcv_skb(rq, ptr, xdp_xmit);
>> }
>>  
>> -   if (skb)
>> +   if (skb) {
>> +   skb_orphan(skb);
>> napi_gro_receive(>xdp_napi, skb);
>> -
>> +   }
>> done++;
>> }
> 
> Considering commit 9c4c3252 ("skbuff: preserve sock reference when
> scrubbing the skb.") I'm not sure if we should unconditionally orphan
> the skb here.
> I was thinking I should call netif_receive_skb() for such packets
> instead of napi_gro_receive().

I tested TCP throughput within localhost with XDP enabled (with
skb_orphan() fix).

GRO off: 4.7 Gbps
GRO on : 6.7 Gbps

Since there is not-so-small difference, I'm making a patch which orphan
the skb as Eric suggested (but in veth_xdp_rcv_skb() instead).

Thanks!

-- 
Toshiaki Makita



Re: [PATCH stable 4.4 0/9] fix SegmentSmack in stable branch (CVE-2018-5390)

2018-09-13 Thread maowenan



On 2018/9/13 20:44, Eric Dumazet wrote:
> On Thu, Sep 13, 2018 at 5:32 AM Greg KH  wrote:
>>
>> On Thu, Aug 16, 2018 at 05:24:09PM +0200, Greg KH wrote:
>>> On Thu, Aug 16, 2018 at 02:33:56PM +0200, Michal Kubecek wrote:
 On Thu, Aug 16, 2018 at 08:05:50PM +0800, maowenan wrote:
> On 2018/8/16 19:39, Michal Kubecek wrote:
>>
>> I suspect you may be doing something wrong with your tests. I checked
>> the segmentsmack testcase and the CPU utilization on receiving side
>> (with sending 10 times as many packets as default) went down from ~100%
>> to ~3% even when comparing what is in stable 4.4 now against older 4.4
>> kernel.
>
> There seems no obvious problem when you send packets with default
> parameter in Segmentsmack POC, Which is also very related with your
> server's hardware configuration. Please try with below parameter to
> form OFO packets

 I did and even with these (questionable, see below) changes, I did not
 get more than 10% (of one core) by receiving ksoftirqd.

>   for (i = 0; i < 1024; i++)  // 128->1024
 ...
>   usleep(10*1000); // Adjust this and packet count to match the 
> target!, sleep 100ms->10ms

 The comment in the testcase source suggests to do _one_ of these two
 changes so that you generate 10 times as many packets as the original
 testcase. You did both so that you end up sending 102400 packets per
 second. With 55 byte long packets, this kind of attack requires at least
 5.5 MB/s (44 Mb/s) of throughput. This is no longer a "low packet rate
 DoS", I'm afraid.

 Anyway, even at this rate, I only get ~10% of one core (Intel E5-2697).

 What I can see, though, is that with current stable 4.4 code, modified
 testcase which sends something like

   2:3, 3:4, ..., 3001:3002, 3003:3004, 3004:3005, ... 6001:6002, ...

 I quickly eat 6 MB of memory for receive queue of one socket while
 earlier 4.4 kernels only take 200-300 KB. I didn't test latest 4.4 with
 Takashi's follow-up yet but I'm pretty sure it will help while
 preserving nice performance when using the original segmentsmack
 testcase (with increased packet ratio).
>>>
>>> Ok, for now I've applied Takashi's fix to the 4.4 stable queue and will
>>> push out a new 4.4-rc later tonight.  Can everyone standardize on that
>>> and test and let me know if it does, or does not, fix the reported
>>> issues?
>>>
>>> If not, we can go from there and evaluate this much larger patch series.
>>> But let's try the simple thing first.
>>
>> So, is the issue still present on the latest 4.4 release?  Has anyone
>> tested it?  If not, I'm more than willing to look at backported patches,
>> but I want to ensure that they really are needed here.
>>
>> thanks,
> 
> Honestly, TCP stack without rb-tree for the OOO queue is vulnerable,
> even with non malicious sender,
> but with big enough TCP receive window and a not favorable network.
> 
> So a malicious peer can definitely send packets needed to make TCP
> stack behave in O(N), which is pretty bad if N is big...
> 
> 9f5afeae51526b3ad7b7cb21ee8b145ce6ea7a7a ("tcp: use an RB tree for ooo
> receive queue")
> was proven to be almost bug free [1], and should be backported if possible.
> 
> [1] bug fixed :
> 76f0dcbb5ae1a7c3dbeec13dd98233b8e6b0b32a tcp: fix a stale ooo_last_skb
> after a replace

Thank you for Eric's suggestion, I will do some work to backport them.
> 
> .
> 



[iproute2,RFC PATCH] tc: range: Introduce TC range classifier

2018-09-13 Thread Amritha Nambiar
Range classifier is introduced to support filters based
on ranges. Only port-range filters are supported currently.
This can be combined with flower classifier to support a
combination of port-ranges and other parameters based
on existing fields supported by cls_flower.

Example:
1. Match on a port range:
---
$ tc filter add dev enp4s0 protocol ip parent : prio 2 range\
ip_proto tcp dst_port 1-15 skip_hw action drop

$ tc -s filter show dev enp4s0 parent :
filter protocol ip pref 2 range chain 0
filter protocol ip pref 2 range chain 0 handle 0x1
  eth_type ipv4
  ip_proto tcp
  dst_port_min 1
  dst_port_max 15
  skip_hw
  not_in_hw
action order 1: gact action drop
 random type none pass val 0
 index 1 ref 1 bind 1 installed 34 sec used 2 sec
Action statistics:
Sent 1380 bytes 30 pkt (dropped 30, overlimits 0 requeues 0)
backlog 0b 0p requeues 0

2. Match on IP address and port range:
--
$ tc filter add dev enp4s0 protocol ip parent : prio 2 flower\
  dst_ip 192.168.1.1 skip_hw action goto chain 11

$ tc filter add dev enp4s0 protocol ip parent : prio 2 chain 11\
  range ip_proto tcp dst_port 1-15 action drop

$ tc -s filter show dev enp4s0 parent :
filter protocol ip pref 2 flower chain 0
filter protocol ip pref 2 flower chain 0 handle 0x1
  eth_type ipv4
  dst_ip 192.168.1.1
  skip_hw
  not_in_hw
action order 1: gact action goto chain 11
 random type none pass val 0
 index 1 ref 1 bind 1 installed 1426 sec used 2 sec
Action statistics:
Sent 460 bytes 10 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0

filter protocol ip pref 2 range chain 11
filter protocol ip pref 2 range chain 11 handle 0x1
  eth_type ipv4
  ip_proto tcp
  dst_port_min 1
  dst_port_max 15
  not_in_hw
action order 1: gact action drop
 random type none pass val 0
 index 2 ref 1 bind 1 installed 1310 sec used 2 sec
Action statistics:
Sent 460 bytes 10 pkt (dropped 10, overlimits 0 requeues 0)
backlog 0b 0p requeues 0

Signed-off-by: Amritha Nambiar 
---
 include/uapi/linux/pkt_cls.h |   19 ++
 tc/Makefile  |1 
 tc/f_range.c |  369 ++
 3 files changed, 389 insertions(+)
 create mode 100644 tc/f_range.c

diff --git a/include/uapi/linux/pkt_cls.h b/include/uapi/linux/pkt_cls.h
index be382fb..8ef3a5a 100644
--- a/include/uapi/linux/pkt_cls.h
+++ b/include/uapi/linux/pkt_cls.h
@@ -379,6 +379,25 @@ enum {
 
 #define TCA_BPF_MAX (__TCA_BPF_MAX - 1)
 
+/* RANGE classifier */
+
+enum {
+   TCA_RANGE_UNSPEC,
+   TCA_RANGE_CLASSID,  /* u32 */
+   TCA_RANGE_INDEV,
+   TCA_RANGE_ACT,
+   TCA_RANGE_KEY_ETH_TYPE, /* be16 */
+   TCA_RANGE_KEY_IP_PROTO, /* u8 */
+   TCA_RANGE_KEY_PORT_SRC_MIN, /* be16 */
+   TCA_RANGE_KEY_PORT_SRC_MAX, /* be16 */
+   TCA_RANGE_KEY_PORT_DST_MIN, /* be16 */
+   TCA_RANGE_KEY_PORT_DST_MAX, /* be16 */
+   TCA_RANGE_FLAGS,/* u32 */
+   __TCA_RANGE_MAX,
+};
+
+#define TCA_RANGE_MAX (__TCA_RANGE_MAX - 1)
+
 /* Flower classifier */
 
 enum {
diff --git a/tc/Makefile b/tc/Makefile
index 5a1a7ff..155cabe 100644
--- a/tc/Makefile
+++ b/tc/Makefile
@@ -29,6 +29,7 @@ TCMODULES += f_bpf.o
 TCMODULES += f_flow.o
 TCMODULES += f_cgroup.o
 TCMODULES += f_flower.o
+TCMODULES += f_range.o
 TCMODULES += q_dsmark.o
 TCMODULES += q_gred.o
 TCMODULES += f_tcindex.o
diff --git a/tc/f_range.c b/tc/f_range.c
new file mode 100644
index 000..388b275
--- /dev/null
+++ b/tc/f_range.c
@@ -0,0 +1,369 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * f_range.c   Range Classifier
+ *
+ * This program is free software; you can distribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ *
+ * Authors:Amritha Nambiar 
+ *
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "utils.h"
+#include "tc_util.h"
+
+enum range_type {
+   RANGE_PORT_SRC,
+   RANGE_PORT_DST
+};
+
+struct range_values {
+   __be16 min_port_type;
+   __be16 max_port_type;
+};
+
+static void explain(void)
+{
+   fprintf(stderr, "Usage: ... range [ MATCH-LIST ]\n");
+   fprintf(stderr, " [skip_sw | skip_hw]\n");
+   fprintf(stderr, " [ action ACTION_SPEC ] [ classid 
CLASSID ]\n");
+   fprintf(stderr, "\n");
+   fprintf(stderr, "Where: SELECTOR := SAMPLE SAMPLE ...\n");
+   fprintf(stderr, "   FILTERID := X:Y:Z\n");
+   fprintf(stderr, "   ACTION_SPEC := ... look at individual 
actions\n");
+   fprintf(stderr, "\nNOTE: CLASSID is parsed as hexadecimal 

[net-next, RFC PATCH] net: sched: cls_range: Introduce Range classifier

2018-09-13 Thread Amritha Nambiar
This patch introduces a range classifier to support filtering based
on ranges. Only port-range filters are supported currently. This can
be combined with flower classifier to support filters that are a
combination of port-ranges and other parameters based on existing
fields supported by cls_flower.

Example:
1. Match on a port range:
---
$ tc filter add dev enp4s0 protocol ip parent : prio 2 range\
ip_proto tcp dst_port 1-15 skip_hw action drop

$ tc -s filter show dev enp4s0 parent :
filter protocol ip pref 2 range chain 0
filter protocol ip pref 2 range chain 0 handle 0x1
  eth_type ipv4
  ip_proto tcp
  dst_port_min 1
  dst_port_max 15
  skip_hw
  not_in_hw
action order 1: gact action drop
 random type none pass val 0
 index 1 ref 1 bind 1 installed 34 sec used 2 sec
Action statistics:
Sent 1380 bytes 30 pkt (dropped 30, overlimits 0 requeues 0)
backlog 0b 0p requeues 0

2. Match on IP address and port range:
--
$ tc filter add dev enp4s0 protocol ip parent : prio 2 flower\
  dst_ip 192.168.1.1 skip_hw action goto chain 11

$ tc filter add dev enp4s0 protocol ip parent : prio 2 chain 11\
  range ip_proto tcp dst_port 1-15 action drop

$ tc -s filter show dev enp4s0 parent :
filter protocol ip pref 2 flower chain 0
filter protocol ip pref 2 flower chain 0 handle 0x1
  eth_type ipv4
  dst_ip 192.168.1.1
  skip_hw
  not_in_hw
action order 1: gact action goto chain 11
 random type none pass val 0
 index 1 ref 1 bind 1 installed 1426 sec used 2 sec
Action statistics:
Sent 460 bytes 10 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0

filter protocol ip pref 2 range chain 11
filter protocol ip pref 2 range chain 11 handle 0x1
  eth_type ipv4
  ip_proto tcp
  dst_port_min 1
  dst_port_max 15
  not_in_hw
action order 1: gact action drop
 random type none pass val 0
 index 2 ref 1 bind 1 installed 1310 sec used 2 sec
Action statistics:
Sent 460 bytes 10 pkt (dropped 10, overlimits 0 requeues 0)
backlog 0b 0p requeues 0

Signed-off-by: Amritha Nambiar 
---
 include/uapi/linux/pkt_cls.h |   19 +
 net/sched/Kconfig|   10 +
 net/sched/Makefile   |1 
 net/sched/cls_range.c|  725 ++
 4 files changed, 755 insertions(+)
 create mode 100644 net/sched/cls_range.c

diff --git a/include/uapi/linux/pkt_cls.h b/include/uapi/linux/pkt_cls.h
index 401d0c1..b2b68e6 100644
--- a/include/uapi/linux/pkt_cls.h
+++ b/include/uapi/linux/pkt_cls.h
@@ -379,6 +379,25 @@ enum {
 
 #define TCA_BPF_MAX (__TCA_BPF_MAX - 1)
 
+/* RANGE classifier */
+
+enum {
+   TCA_RANGE_UNSPEC,
+   TCA_RANGE_CLASSID,  /* u32 */
+   TCA_RANGE_INDEV,
+   TCA_RANGE_ACT,
+   TCA_RANGE_KEY_ETH_TYPE, /* be16 */
+   TCA_RANGE_KEY_IP_PROTO, /* u8 */
+   TCA_RANGE_KEY_PORT_SRC_MIN, /* be16 */
+   TCA_RANGE_KEY_PORT_SRC_MAX, /* be16 */
+   TCA_RANGE_KEY_PORT_DST_MIN, /* be16 */
+   TCA_RANGE_KEY_PORT_DST_MAX, /* be16 */
+   TCA_RANGE_FLAGS,/* u32 */
+   __TCA_RANGE_MAX,
+};
+
+#define TCA_RANGE_MAX (__TCA_RANGE_MAX - 1)
+
 /* Flower classifier */
 
 enum {
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index e957413..f68770d 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -585,6 +585,16 @@ config NET_CLS_FLOWER
  To compile this code as a module, choose M here: the module will
  be called cls_flower.
 
+config NET_CLS_RANGE
+   tristate "Range classifier"
+   select NET_CLS
+   help
+ If you say Y here, you will be able to classify packets based on
+ ranges with minimum and maximum values.
+
+ To compile this code as a module, choose M here: the module will
+ be called cls_range.
+
 config NET_CLS_MATCHALL
tristate "Match-all classifier"
select NET_CLS
diff --git a/net/sched/Makefile b/net/sched/Makefile
index f0403f4..d1f57a8 100644
--- a/net/sched/Makefile
+++ b/net/sched/Makefile
@@ -69,6 +69,7 @@ obj-$(CONFIG_NET_CLS_FLOW)+= cls_flow.o
 obj-$(CONFIG_NET_CLS_CGROUP)   += cls_cgroup.o
 obj-$(CONFIG_NET_CLS_BPF)  += cls_bpf.o
 obj-$(CONFIG_NET_CLS_FLOWER)   += cls_flower.o
+obj-$(CONFIG_NET_CLS_RANGE)+= cls_range.o
 obj-$(CONFIG_NET_CLS_MATCHALL) += cls_matchall.o
 obj-$(CONFIG_NET_EMATCH)   += ematch.o
 obj-$(CONFIG_NET_EMATCH_CMP)   += em_cmp.o
diff --git a/net/sched/cls_range.c b/net/sched/cls_range.c
new file mode 100644
index 000..2ed53c7
--- /dev/null
+++ b/net/sched/cls_range.c
@@ -0,0 +1,725 @@
+// SPDX-License-Identifier: GPL-2.0
+/* net/sched/cls_range.c   Range classifier
+ *
+ * Copyright (c) 2018 Amritha Nambiar 
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU 

[net-next,RFC PATCH] Introduce TC Range classifier

2018-09-13 Thread Amritha Nambiar
This patch introduces a TC range classifier to support filtering based
on ranges. Only port-range filters are supported currently. This can
be combined with flower classifier to support filters that are a
combination of port-ranges and other parameters based on existing
fields supported by cls_flower. The 'goto chain' action can be used to
combine the flower and range filter.
The filter precedence is decided based on the 'prio' value.

Example:
1. Match on a port range:
---
$ tc filter add dev enp4s0 protocol ip parent : prio 2 range\
ip_proto tcp dst_port 1-15 skip_hw action drop

$ tc -s filter show dev enp4s0 parent :
filter protocol ip pref 2 range chain 0
filter protocol ip pref 2 range chain 0 handle 0x1
  eth_type ipv4
  ip_proto tcp
  dst_port_min 1
  dst_port_max 15
  skip_hw
  not_in_hw
action order 1: gact action drop
 random type none pass val 0
 index 1 ref 1 bind 1 installed 34 sec used 2 sec
Action statistics:
Sent 1380 bytes 30 pkt (dropped 30, overlimits 0 requeues 0)
backlog 0b 0p requeues 0

2. Match on IP address and port range:
--
$ tc filter add dev enp4s0 protocol ip parent : prio 2 flower\
  dst_ip 192.168.1.1 skip_hw action goto chain 11

$ tc filter add dev enp4s0 protocol ip parent : prio 2 chain 11\
  range ip_proto tcp dst_port 1-15 action drop

$ tc -s filter show dev enp4s0 parent :
filter protocol ip pref 2 flower chain 0
filter protocol ip pref 2 flower chain 0 handle 0x1
  eth_type ipv4
  dst_ip 192.168.1.1
  skip_hw
  not_in_hw
action order 1: gact action goto chain 11
 random type none pass val 0
 index 1 ref 1 bind 1 installed 1426 sec used 2 sec
Action statistics:
Sent 460 bytes 10 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0

filter protocol ip pref 2 range chain 11
filter protocol ip pref 2 range chain 11 handle 0x1
  eth_type ipv4
  ip_proto tcp
  dst_port_min 1
  dst_port_max 15
  not_in_hw
action order 1: gact action drop
 random type none pass val 0
 index 2 ref 1 bind 1 installed 1310 sec used 2 sec
Action statistics:
Sent 460 bytes 10 pkt (dropped 10, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
---

Amritha Nambiar (1):
  net: sched: cls_range: Introduce Range classifier


 include/uapi/linux/pkt_cls.h |   19 +
 net/sched/Kconfig|   10 +
 net/sched/Makefile   |1 
 net/sched/cls_range.c|  725 ++
 4 files changed, 755 insertions(+)
 create mode 100644 net/sched/cls_range.c

--


Re: [PATCH net v2] bonding: pass link-local packets to bonding master also.

2018-09-13 Thread महेश बंडेवार
On Thu, Sep 13, 2018 at 4:00 PM, Michal Soltys  wrote:
> On 2018-07-19 18:20, Michal Soltys wrote:
>> On 07/19/2018 01:41 AM, Mahesh Bandewar wrote:
>>> From: Mahesh Bandewar 
>>>
>>> Commit b89f04c61efe ("bonding: deliver link-local packets with
>>> skb->dev set to link that packets arrived on") changed the behavior
>>> of how link-local-multicast packets are processed. The change in
>>> the behavior broke some legacy use cases where these packets are
>>> expected to arrive on bonding master device also.
>>>
>>> This patch passes the packet to the stack with the link it arrived
>>> on as well as passes to the bonding-master device to preserve the
>>> legacy use case.
>>>
>>> Fixes: b89f04c61efe ("bonding: deliver link-local packets with
>>> skb->dev set to link that packets arrived on")
>>> Reported-by: Michal Soltys 
>>> Signed-off-by: Mahesh Bandewar 
>>> ---
>>> v2: Added Fixes tag.
>>> v1: Initial patch.
>>>   drivers/net/bonding/bond_main.c | 17 +++--
>>>   1 file changed, 15 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/net/bonding/bond_main.c
>>> b/drivers/net/bonding/bond_main.c
>>> index 9a2ea3c1f949..1d3b7d8448f2 100644
>>> --- a/drivers/net/bonding/bond_main.c
>>> +++ b/drivers/net/bonding/bond_main.c
>>> @@ -1177,9 +1177,22 @@ static rx_handler_result_t
>>> bond_handle_frame(struct sk_buff **pskb)
>>>   }
>>>   }
>>>   -/* don't change skb->dev for link-local packets */
>>> -if (is_link_local_ether_addr(eth_hdr(skb)->h_dest))
>>> +/* Link-local multicast packets should be passed to the
>>> + * stack on the link they arrive as well as pass them to the
>>> + * bond-master device. These packets are mostly usable when
>>> + * stack receives it with the link on which they arrive
>>> + * (e.g. LLDP) but there may be some legacy behavior that
>>> + * expects these packets to appear on bonding master too.
>>
>> I'd really change the comment from:
>>
>> "These packets are mostly usable when stack receives it with the link on
>> which they arrive (e.g. LLDP) but there may be some legacy behavior that
>> expects these packets to appear on bonding master too."
>>
>> to something like:
>>
>> "These packets are mostly usable when stack receives it with the link on
>> which they arrive, but they also must be available on aggregations. Some
>> of the use cases include (but are not limited to): LLDP agents that must
>> be able to operate both on enslaved interfaces as well as on bonds
>> themselves; linux bridges that must be able to process/pass BPDUs from
>> attached bonds when any kind of stp version is enabled on the network."
>>
>> It's a bit longer, but clarifies the reasons more precisely (without
>> going too deep into features like group_fwd_mask).
>>
>
> Anyway, any chance for that patch to get merged ? It would be great to
> get the correct functionality back asap. As for the comment, I'll submit
> a trivial patch expanding/clarifying it later (or I can resubmit
> adjusted v3 if it's ok with Mahesh).
Hmm, didn't notice that it wasn't merged but somehow it fell through
the cracks as it needed my attention earlier. I'll resubmit.


Re: [PATCH net v2] bonding: pass link-local packets to bonding master also.

2018-09-13 Thread Michal Soltys
On 2018-07-19 18:20, Michal Soltys wrote:
> On 07/19/2018 01:41 AM, Mahesh Bandewar wrote:
>> From: Mahesh Bandewar 
>>
>> Commit b89f04c61efe ("bonding: deliver link-local packets with
>> skb->dev set to link that packets arrived on") changed the behavior
>> of how link-local-multicast packets are processed. The change in
>> the behavior broke some legacy use cases where these packets are
>> expected to arrive on bonding master device also.
>>
>> This patch passes the packet to the stack with the link it arrived
>> on as well as passes to the bonding-master device to preserve the
>> legacy use case.
>>
>> Fixes: b89f04c61efe ("bonding: deliver link-local packets with
>> skb->dev set to link that packets arrived on")
>> Reported-by: Michal Soltys 
>> Signed-off-by: Mahesh Bandewar 
>> ---
>> v2: Added Fixes tag.
>> v1: Initial patch.
>>   drivers/net/bonding/bond_main.c | 17 +++--
>>   1 file changed, 15 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/net/bonding/bond_main.c
>> b/drivers/net/bonding/bond_main.c
>> index 9a2ea3c1f949..1d3b7d8448f2 100644
>> --- a/drivers/net/bonding/bond_main.c
>> +++ b/drivers/net/bonding/bond_main.c
>> @@ -1177,9 +1177,22 @@ static rx_handler_result_t
>> bond_handle_frame(struct sk_buff **pskb)
>>   }
>>   }
>>   -    /* don't change skb->dev for link-local packets */
>> -    if (is_link_local_ether_addr(eth_hdr(skb)->h_dest))
>> +    /* Link-local multicast packets should be passed to the
>> + * stack on the link they arrive as well as pass them to the
>> + * bond-master device. These packets are mostly usable when
>> + * stack receives it with the link on which they arrive
>> + * (e.g. LLDP) but there may be some legacy behavior that
>> + * expects these packets to appear on bonding master too.
> 
> I'd really change the comment from:
> 
> "These packets are mostly usable when stack receives it with the link on
> which they arrive (e.g. LLDP) but there may be some legacy behavior that
> expects these packets to appear on bonding master too."
> 
> to something like:
> 
> "These packets are mostly usable when stack receives it with the link on
> which they arrive, but they also must be available on aggregations. Some
> of the use cases include (but are not limited to): LLDP agents that must
> be able to operate both on enslaved interfaces as well as on bonds
> themselves; linux bridges that must be able to process/pass BPDUs from
> attached bonds when any kind of stp version is enabled on the network."
> 
> It's a bit longer, but clarifies the reasons more precisely (without
> going too deep into features like group_fwd_mask).
> 

Anyway, any chance for that patch to get merged ? It would be great to
get the correct functionality back asap. As for the comment, I'll submit
a trivial patch expanding/clarifying it later (or I can resubmit
adjusted v3 if it's ok with Mahesh).


Re: [PATCH v2] socket: fix struct ifreq size in compat ioctl

2018-09-13 Thread David Miller
From: Johannes Berg 
Date: Thu, 13 Sep 2018 14:40:55 +0200

> From: Johannes Berg 
> 
> As reported by Reobert O'Callahan, since Viro's commit to kill
> dev_ifsioc() we attempt to copy too much data in compat mode,
> which may lead to EFAULT when the 32-bit version of struct ifreq
> sits at/near the end of a page boundary, and the next page isn't
> mapped.
> 
> Fix this by passing the approprate compat/non-compat size to copy
> and using that, as before the dev_ifsioc() removal. This works
> because only the embedded "struct ifmap" has different size, and
> this is only used in SIOCGIFMAP/SIOCSIFMAP which has a different
> handler. All other parts of the union are naturally compatible.
> 
> This fixes https://bugzilla.kernel.org/show_bug.cgi?id=199469.
> 
> Fixes: bf4405737f9f ("kill dev_ifsioc()")
> Reported-by: Robert O'Callahan 
> Signed-off-by: Johannes Berg 

Applied and queued up for -stable, thanks Johannes.


Re: mlx5 driver loading failing on v4.19 / net-next / bpf-next

2018-09-13 Thread Alexei Starovoitov
On Thu, Aug 30, 2018 at 1:35 AM, Tariq Toukan  wrote:
>
>
> On 29/08/2018 6:05 PM, Jesper Dangaard Brouer wrote:
>>
>> Hi Saeed,
>>
>> I'm having issues loading mlx5 driver on v4.19 kernels (tested both
>> net-next and bpf-next), while kernel v4.18 seems to work.  It happens
>> with a Mellanox ConnectX-5 NIC (and also a CX4-Lx but I removed that
>> from the system now).
>>
>
> Hi Jesper,
>
> Thanks for your report!
>
> We are working to analyze and debug the issue.

looks like serious issue to me... while no news in 2 weeks.
any update?


Re: [PATCH] socket: fix struct ifreq size in compat ioctl

2018-09-13 Thread kbuild test robot
Hi Johannes,

I love your patch! Yet something to improve:

[auto build test ERROR on net/master]
[also build test ERROR on v4.19-rc3 next-20180913]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Johannes-Berg/socket-fix-struct-ifreq-size-in-compat-ioctl/20180914-061826
config: x86_64-randconfig-x013-201836 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-1) 7.3.0
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64 

All errors (new ones prefixed by >>):

   net/socket.c: In function 'sock_do_ioctl':
>> net/socket.c:972:24: error: invalid application of 'sizeof' to incomplete 
>> type 'struct compat_ifreq'
   compat ? sizeof(struct compat_ifreq) :
   ^~
   net/socket.c:978:23: error: invalid application of 'sizeof' to incomplete 
type 'struct compat_ifreq'
  compat ? sizeof(struct compat_ifreq) :
  ^~

vim +972 net/socket.c

   942  
   943  static long sock_do_ioctl(struct net *net, struct socket *sock,
   944unsigned int cmd, unsigned long arg,
   945bool compat)
   946  {
   947  int err;
   948  void __user *argp = (void __user *)arg;
   949  
   950  err = sock->ops->ioctl(sock, cmd, arg);
   951  
   952  /*
   953   * If this ioctl is unknown try to hand it down
   954   * to the NIC driver.
   955   */
   956  if (err != -ENOIOCTLCMD)
   957  return err;
   958  
   959  if (cmd == SIOCGIFCONF) {
   960  struct ifconf ifc;
   961  if (copy_from_user(, argp, sizeof(struct ifconf)))
   962  return -EFAULT;
   963  rtnl_lock();
   964  err = dev_ifconf(net, , sizeof(struct ifreq));
   965  rtnl_unlock();
   966  if (!err && copy_to_user(argp, , sizeof(struct 
ifconf)))
   967  err = -EFAULT;
   968  } else {
   969  struct ifreq ifr;
   970  bool need_copyout;
   971  if (copy_from_user(, argp,
 > 972 compat ? sizeof(struct compat_ifreq) 
 > :
   973  sizeof(struct ifreq)))
   974  return -EFAULT;
   975  err = dev_ioctl(net, cmd, , _copyout);
   976  if (!err && need_copyout)
   977  if (copy_to_user(argp, ,
   978   compat ? sizeof(struct 
compat_ifreq) :
   979sizeof(struct ifreq)))
   980  return -EFAULT;
   981  }
   982  return err;
   983  }
   984  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


Re: [RFC PATCH net-next v1 00/14] rename and shrink i40evf

2018-09-13 Thread David Miller
From: Jesse Brandeburg 
Date: Thu, 13 Sep 2018 15:31:30 -0700

> This series contains changes to i40evf so that it becomes a more
> generic virtual function driver for current and future silicon.
> 
> While doing the rename of i40evf to a more generic name of iavf,
> we also put the driver on a severe diet due to how much of the
> code was unneeded or was unused.  The outcome is a lean and mean
> virtual function driver that continues to work on existing 40GbE
> (i40e) virtual devices and prepped for future supported devices,
> like the 100GbE (ice) virtual devices.
> 
> This solves 2 issues we saw coming or were already present, the
> first was constant code duplication happening with i40e/i40evf,
> when much of the duplicate code in the i40evf was not used or was
> not needed.  The second was to remove the future confusion of why
> future VF devices that were not considered "40GbE" only devices
> were supported by i40evf.
> 
> The thought is that iavf will be the virtual function driver for
> all future devices, so it should have a "generic" name to propery
> represent that it is the VF driver for multiple generations of
> devices.

Having a common vf driver for current and future devices is a major
accomplishment and I fully support these changes.

Nice work!

> Known Caveats:
> This may cause some user confusion, especially for Kconfig not
> migrating cleanly to the new CONFIG_IAVF from CONFIG_I40EVF.
> 
> Existing user configurations may have to change, but the module
> alias in patch 1 helps a bit here.

You can deal with this by retaining the existing I40EVF Kconfig
knob and just let it 'select' IAVF.


[RFC PATCH net-next v1 14/14] intel-ethernet: use correct module license

2018-09-13 Thread Jesse Brandeburg
We recently updated all our SPDX identifiers to correctly
indicate our net/ethernet/intel/* drivers were always released
and intended to be released under GPL v2, but the MODULE_LICENSE
declaration was never updated.

Fix the MODULE_LICENSE to be GPL v2, for all our drivers.

Signed-off-by: Jesse Brandeburg 
---
 drivers/net/ethernet/intel/e100.c | 2 +-
 drivers/net/ethernet/intel/e1000/e1000_main.c | 2 +-
 drivers/net/ethernet/intel/e1000e/netdev.c| 2 +-
 drivers/net/ethernet/intel/fm10k/fm10k_main.c | 2 +-
 drivers/net/ethernet/intel/i40e/i40e_main.c   | 2 +-
 drivers/net/ethernet/intel/iavf/iavf_main.c   | 4 ++--
 drivers/net/ethernet/intel/ice/ice_main.c | 2 +-
 drivers/net/ethernet/intel/igb/igb_main.c | 2 +-
 drivers/net/ethernet/intel/igbvf/netdev.c | 2 +-
 drivers/net/ethernet/intel/ixgb/ixgb_main.c   | 2 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 2 +-
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 2 +-
 12 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/drivers/net/ethernet/intel/e100.c 
b/drivers/net/ethernet/intel/e100.c
index 27d5f27163d2..7c4b55482f72 100644
--- a/drivers/net/ethernet/intel/e100.c
+++ b/drivers/net/ethernet/intel/e100.c
@@ -164,7 +164,7 @@
 
 MODULE_DESCRIPTION(DRV_DESCRIPTION);
 MODULE_AUTHOR(DRV_COPYRIGHT);
-MODULE_LICENSE("GPL");
+MODULE_LICENSE("GPL v2");
 MODULE_VERSION(DRV_VERSION);
 MODULE_FIRMWARE(FIRMWARE_D101M);
 MODULE_FIRMWARE(FIRMWARE_D101S);
diff --git a/drivers/net/ethernet/intel/e1000/e1000_main.c 
b/drivers/net/ethernet/intel/e1000/e1000_main.c
index 2110d5f2da19..7e0f1f96a8a1 100644
--- a/drivers/net/ethernet/intel/e1000/e1000_main.c
+++ b/drivers/net/ethernet/intel/e1000/e1000_main.c
@@ -195,7 +195,7 @@ static struct pci_driver e1000_driver = {
 
 MODULE_AUTHOR("Intel Corporation, ");
 MODULE_DESCRIPTION("Intel(R) PRO/1000 Network Driver");
-MODULE_LICENSE("GPL");
+MODULE_LICENSE("GPL v2");
 MODULE_VERSION(DRV_VERSION);
 
 #define DEFAULT_MSG_ENABLE (NETIF_MSG_DRV|NETIF_MSG_PROBE|NETIF_MSG_LINK)
diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c 
b/drivers/net/ethernet/intel/e1000e/netdev.c
index 3ba0c90e7055..c0f9faca70c4 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -7592,7 +7592,7 @@ module_exit(e1000_exit_module);
 
 MODULE_AUTHOR("Intel Corporation, ");
 MODULE_DESCRIPTION("Intel(R) PRO/1000 Network Driver");
-MODULE_LICENSE("GPL");
+MODULE_LICENSE("GPL v2");
 MODULE_VERSION(DRV_VERSION);
 
 /* netdev.c */
diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_main.c 
b/drivers/net/ethernet/intel/fm10k/fm10k_main.c
index 3f536541f45f..503bbc017792 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k_main.c
+++ b/drivers/net/ethernet/intel/fm10k/fm10k_main.c
@@ -21,7 +21,7 @@ static const char fm10k_copyright[] =
 
 MODULE_AUTHOR("Intel Corporation, ");
 MODULE_DESCRIPTION(DRV_SUMMARY);
-MODULE_LICENSE("GPL");
+MODULE_LICENSE("GPL v2");
 MODULE_VERSION(DRV_VERSION);
 
 /* single workqueue for entire fm10k driver */
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 5d209d8fe9b8..c7d2c9010fdf 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -91,7 +91,7 @@ MODULE_PARM_DESC(debug, "Debug level (0=none,...,16=all), 
Debug mask (0x8XXX
 
 MODULE_AUTHOR("Intel Corporation, ");
 MODULE_DESCRIPTION("Intel(R) Ethernet Connection XL710 Network Driver");
-MODULE_LICENSE("GPL");
+MODULE_LICENSE("GPL v2");
 MODULE_VERSION(DRV_VERSION);
 
 static struct workqueue_struct *i40e_wq;
diff --git a/drivers/net/ethernet/intel/iavf/iavf_main.c 
b/drivers/net/ethernet/intel/iavf/iavf_main.c
index 54d8a1ed05ac..0e2f78175f0e 100644
--- a/drivers/net/ethernet/intel/iavf/iavf_main.c
+++ b/drivers/net/ethernet/intel/iavf/iavf_main.c
@@ -53,8 +53,8 @@ MODULE_DEVICE_TABLE(pci, iavf_pci_tbl);
 
 MODULE_ALIAS("i40evf");
 MODULE_AUTHOR("Intel Corporation, ");
-MODULE_DESCRIPTION("Intel(R) XL710 X710 Virtual Function Network Driver");
-MODULE_LICENSE("GPL");
+MODULE_DESCRIPTION("Intel(R) Ethernet Adaptive Virtual Function Network 
Driver");
+MODULE_LICENSE("GPL v2");
 MODULE_VERSION(DRV_VERSION);
 
 static struct workqueue_struct *iavf_wq;
diff --git a/drivers/net/ethernet/intel/ice/ice_main.c 
b/drivers/net/ethernet/intel/ice/ice_main.c
index 1b49a605d094..d54e63785ff0 100644
--- a/drivers/net/ethernet/intel/ice/ice_main.c
+++ b/drivers/net/ethernet/intel/ice/ice_main.c
@@ -15,7 +15,7 @@ static const char ice_copyright[] = "Copyright (c) 2018, 
Intel Corporation.";
 
 MODULE_AUTHOR("Intel Corporation, ");
 MODULE_DESCRIPTION(DRV_SUMMARY);
-MODULE_LICENSE("GPL");
+MODULE_LICENSE("GPL v2");
 MODULE_VERSION(DRV_VERSION);
 
 static int debug = -1;
diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
b/drivers/net/ethernet/intel/igb/igb_main.c
index a32c576c1e65..c18e79112cad 100644
--- 

[RFC PATCH net-next v1 10/14] iavf: replace i40e_debug with iavf version

2018-09-13 Thread Jesse Brandeburg
Change another string (i40e_debug)

Signed-off-by: Jesse Brandeburg 
---
 drivers/net/ethernet/intel/iavf/i40e_adminq.c| 28 
 drivers/net/ethernet/intel/iavf/i40e_common.c| 12 +-
 drivers/net/ethernet/intel/iavf/i40e_osdep.h |  2 +-
 drivers/net/ethernet/intel/iavf/i40e_prototype.h |  2 +-
 drivers/net/ethernet/intel/iavf/i40e_type.h  |  2 +-
 5 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/drivers/net/ethernet/intel/iavf/i40e_adminq.c 
b/drivers/net/ethernet/intel/iavf/i40e_adminq.c
index 480c3e8c38c8..d614722fbb3d 100644
--- a/drivers/net/ethernet/intel/iavf/i40e_adminq.c
+++ b/drivers/net/ethernet/intel/iavf/i40e_adminq.c
@@ -577,7 +577,7 @@ static u16 i40e_clean_asq(struct iavf_hw *hw)
desc = IAVF_ADMINQ_DESC(*asq, ntc);
details = I40E_ADMINQ_DETAILS(*asq, ntc);
while (rd32(hw, hw->aq.asq.head) != ntc) {
-   i40e_debug(hw, I40E_DEBUG_AQ_MESSAGE,
+   iavf_debug(hw, I40E_DEBUG_AQ_MESSAGE,
   "ntc %d head %d.\n", ntc, rd32(hw, hw->aq.asq.head));
 
if (details->callback) {
@@ -643,7 +643,7 @@ iavf_status iavf_asq_send_command(struct iavf_hw *hw, 
struct i40e_aq_desc *desc,
mutex_lock(>aq.asq_mutex);
 
if (hw->aq.asq.count == 0) {
-   i40e_debug(hw, I40E_DEBUG_AQ_MESSAGE,
+   iavf_debug(hw, I40E_DEBUG_AQ_MESSAGE,
   "AQTX: Admin queue not initialized.\n");
status = I40E_ERR_QUEUE_EMPTY;
goto asq_send_command_error;
@@ -653,7 +653,7 @@ iavf_status iavf_asq_send_command(struct iavf_hw *hw, 
struct i40e_aq_desc *desc,
 
val = rd32(hw, hw->aq.asq.head);
if (val >= hw->aq.num_asq_entries) {
-   i40e_debug(hw, I40E_DEBUG_AQ_MESSAGE,
+   iavf_debug(hw, I40E_DEBUG_AQ_MESSAGE,
   "AQTX: head overrun at %d\n", val);
status = I40E_ERR_QUEUE_EMPTY;
goto asq_send_command_error;
@@ -682,7 +682,7 @@ iavf_status iavf_asq_send_command(struct iavf_hw *hw, 
struct i40e_aq_desc *desc,
desc->flags |= cpu_to_le16(details->flags_ena);
 
if (buff_size > hw->aq.asq_buf_size) {
-   i40e_debug(hw,
+   iavf_debug(hw,
   I40E_DEBUG_AQ_MESSAGE,
   "AQTX: Invalid buffer size: %d.\n",
   buff_size);
@@ -691,7 +691,7 @@ iavf_status iavf_asq_send_command(struct iavf_hw *hw, 
struct i40e_aq_desc *desc,
}
 
if (details->postpone && !details->async) {
-   i40e_debug(hw,
+   iavf_debug(hw,
   I40E_DEBUG_AQ_MESSAGE,
   "AQTX: Async flag not set along with postpone flag");
status = I40E_ERR_PARAM;
@@ -706,7 +706,7 @@ iavf_status iavf_asq_send_command(struct iavf_hw *hw, 
struct i40e_aq_desc *desc,
 * in case of asynchronous completions
 */
if (i40e_clean_asq(hw) == 0) {
-   i40e_debug(hw,
+   iavf_debug(hw,
   I40E_DEBUG_AQ_MESSAGE,
   "AQTX: Error queue is full.\n");
status = I40E_ERR_ADMIN_QUEUE_FULL;
@@ -736,7 +736,7 @@ iavf_status iavf_asq_send_command(struct iavf_hw *hw, 
struct i40e_aq_desc *desc,
}
 
/* bump the tail */
-   i40e_debug(hw, I40E_DEBUG_AQ_MESSAGE, "AQTX: desc and buffer:\n");
+   iavf_debug(hw, I40E_DEBUG_AQ_MESSAGE, "AQTX: desc and buffer:\n");
iavf_debug_aq(hw, I40E_DEBUG_AQ_COMMAND, (void *)desc_on_ring,
  buff, buff_size);
(hw->aq.asq.next_to_use)++;
@@ -769,7 +769,7 @@ iavf_status iavf_asq_send_command(struct iavf_hw *hw, 
struct i40e_aq_desc *desc,
memcpy(buff, dma_buff->va, buff_size);
retval = le16_to_cpu(desc->retval);
if (retval != 0) {
-   i40e_debug(hw,
+   iavf_debug(hw,
   I40E_DEBUG_AQ_MESSAGE,
   "AQTX: Command completed with error 0x%X.\n",
   retval);
@@ -787,7 +787,7 @@ iavf_status iavf_asq_send_command(struct iavf_hw *hw, 
struct i40e_aq_desc *desc,
hw->aq.asq_last_status = (enum i40e_admin_queue_err)retval;
}
 
-   i40e_debug(hw, I40E_DEBUG_AQ_MESSAGE,
+   iavf_debug(hw, I40E_DEBUG_AQ_MESSAGE,
   "AQTX: desc and buffer writeback:\n");
iavf_debug_aq(hw, I40E_DEBUG_AQ_COMMAND, (void *)desc, buff, buff_size);
 
@@ -799,11 +799,11 @@ iavf_status iavf_asq_send_command(struct iavf_hw *hw, 
struct i40e_aq_desc *desc,
if ((!cmd_completed) &&
(!details->async && !details->postpone)) {
if (rd32(hw, hw->aq.asq.len) & IAVF_VF_ATQLEN1_ATQCRIT_MASK) {
-   i40e_debug(hw, I40E_DEBUG_AQ_MESSAGE,
+

[RFC PATCH net-next v1 13/14] iavf: finish renaming files to iavf

2018-09-13 Thread Jesse Brandeburg
This finishes the process of renaming the files that
make sense to rename (skipping adminq related files that
talk to i40e) and fixes up the build and the #includes
so that everything builds nicely.

Signed-off-by: Jesse Brandeburg 
---
 drivers/net/ethernet/intel/iavf/Makefile  | 2 +-
 drivers/net/ethernet/intel/iavf/i40e_adminq.c | 8 
 drivers/net/ethernet/intel/iavf/i40e_adminq.h | 4 ++--
 drivers/net/ethernet/intel/iavf/iavf.h| 4 ++--
 drivers/net/ethernet/intel/iavf/{i40e_alloc.h => iavf_alloc.h}| 0
 drivers/net/ethernet/intel/iavf/iavf_client.c | 2 +-
 drivers/net/ethernet/intel/iavf/{i40e_common.c => iavf_common.c}  | 4 ++--
 drivers/net/ethernet/intel/iavf/{i40e_devids.h => iavf_devids.h}  | 0
 drivers/net/ethernet/intel/iavf/iavf_main.c   | 2 +-
 drivers/net/ethernet/intel/iavf/{i40e_osdep.h => iavf_osdep.h}| 0
 .../ethernet/intel/iavf/{i40e_prototype.h => iavf_prototype.h}| 4 ++--
 .../net/ethernet/intel/iavf/{i40e_register.h => iavf_register.h}  | 0
 drivers/net/ethernet/intel/iavf/{i40e_status.h => iavf_status.h}  | 0
 drivers/net/ethernet/intel/iavf/iavf_txrx.c   | 2 +-
 drivers/net/ethernet/intel/iavf/{i40e_type.h => iavf_type.h}  | 8 
 drivers/net/ethernet/intel/iavf/iavf_virtchnl.c   | 2 +-
 16 files changed, 21 insertions(+), 21 deletions(-)
 rename drivers/net/ethernet/intel/iavf/{i40e_alloc.h => iavf_alloc.h} (100%)
 rename drivers/net/ethernet/intel/iavf/{i40e_common.c => iavf_common.c} (99%)
 rename drivers/net/ethernet/intel/iavf/{i40e_devids.h => iavf_devids.h} (100%)
 rename drivers/net/ethernet/intel/iavf/{i40e_osdep.h => iavf_osdep.h} (100%)
 rename drivers/net/ethernet/intel/iavf/{i40e_prototype.h => iavf_prototype.h} 
(98%)
 rename drivers/net/ethernet/intel/iavf/{i40e_register.h => iavf_register.h} 
(100%)
 rename drivers/net/ethernet/intel/iavf/{i40e_status.h => iavf_status.h} (100%)
 rename drivers/net/ethernet/intel/iavf/{i40e_type.h => iavf_type.h} (99%)

diff --git a/drivers/net/ethernet/intel/iavf/Makefile 
b/drivers/net/ethernet/intel/iavf/Makefile
index fa4c43be2266..87ddfbac2f17 100644
--- a/drivers/net/ethernet/intel/iavf/Makefile
+++ b/drivers/net/ethernet/intel/iavf/Makefile
@@ -12,4 +12,4 @@ subdir-ccflags-y += -I$(src)
 obj-$(CONFIG_IAVF) += iavf.o
 
 iavf-objs := iavf_main.o iavf_ethtool.o iavf_virtchnl.o \
-iavf_txrx.o i40e_common.o i40e_adminq.o iavf_client.o
+iavf_txrx.o iavf_common.o i40e_adminq.o iavf_client.o
diff --git a/drivers/net/ethernet/intel/iavf/i40e_adminq.c 
b/drivers/net/ethernet/intel/iavf/i40e_adminq.c
index 8aa817808cd5..d2b165b610fa 100644
--- a/drivers/net/ethernet/intel/iavf/i40e_adminq.c
+++ b/drivers/net/ethernet/intel/iavf/i40e_adminq.c
@@ -1,11 +1,11 @@
 // SPDX-License-Identifier: GPL-2.0
 /* Copyright(c) 2013 - 2018 Intel Corporation. */
 
-#include "i40e_status.h"
-#include "i40e_type.h"
-#include "i40e_register.h"
+#include "iavf_status.h"
+#include "iavf_type.h"
+#include "iavf_register.h"
 #include "i40e_adminq.h"
-#include "i40e_prototype.h"
+#include "iavf_prototype.h"
 
 /**
  *  i40e_adminq_init_regs - Initialize AdminQ registers
diff --git a/drivers/net/ethernet/intel/iavf/i40e_adminq.h 
b/drivers/net/ethernet/intel/iavf/i40e_adminq.h
index e34625e25589..ee983889eab0 100644
--- a/drivers/net/ethernet/intel/iavf/i40e_adminq.h
+++ b/drivers/net/ethernet/intel/iavf/i40e_adminq.h
@@ -4,8 +4,8 @@
 #ifndef _IAVF_ADMINQ_H_
 #define _IAVF_ADMINQ_H_
 
-#include "i40e_osdep.h"
-#include "i40e_status.h"
+#include "iavf_osdep.h"
+#include "iavf_status.h"
 #include "i40e_adminq_cmd.h"
 
 #define IAVF_ADMINQ_DESC(R, i)   \
diff --git a/drivers/net/ethernet/intel/iavf/iavf.h 
b/drivers/net/ethernet/intel/iavf/iavf.h
index 1d973b4cd973..961c1a71b671 100644
--- a/drivers/net/ethernet/intel/iavf/iavf.h
+++ b/drivers/net/ethernet/intel/iavf/iavf.h
@@ -34,7 +34,7 @@
 #include 
 #include 
 
-#include "i40e_type.h"
+#include "iavf_type.h"
 #include 
 #include "iavf_txrx.h"
 
@@ -298,7 +298,7 @@ struct iavf_adapter {
struct net_device *netdev;
struct pci_dev *pdev;
 
-   struct iavf_hw hw; /* defined in i40e_type.h */
+   struct iavf_hw hw; /* defined in iavf_type.h */
 
enum iavf_state_t state;
unsigned long crit_section;
diff --git a/drivers/net/ethernet/intel/iavf/i40e_alloc.h 
b/drivers/net/ethernet/intel/iavf/iavf_alloc.h
similarity index 100%
rename from drivers/net/ethernet/intel/iavf/i40e_alloc.h
rename to drivers/net/ethernet/intel/iavf/iavf_alloc.h
diff --git a/drivers/net/ethernet/intel/iavf/iavf_client.c 
b/drivers/net/ethernet/intel/iavf/iavf_client.c
index f4c195a4167a..a0bfa6b9555e 100644
--- a/drivers/net/ethernet/intel/iavf/iavf_client.c
+++ b/drivers/net/ethernet/intel/iavf/iavf_client.c
@@ -5,7 +5,7 @@
 #include 
 
 #include "iavf.h"
-#include "i40e_prototype.h"

[RFC PATCH net-next v1 09/14] iavf: rename i40e_hw to iavf_hw

2018-09-13 Thread Jesse Brandeburg
Fix up the i40e_hw names to new name, including versions
inside other strings.

Signed-off-by: Jesse Brandeburg 
---
 drivers/net/ethernet/intel/iavf/i40e_adminq.c| 42 +++
 drivers/net/ethernet/intel/iavf/i40e_alloc.h | 21 +++-
 drivers/net/ethernet/intel/iavf/i40e_common.c| 30 +--
 drivers/net/ethernet/intel/iavf/i40e_prototype.h | 65 +++-
 drivers/net/ethernet/intel/iavf/i40e_type.h  | 10 ++--
 drivers/net/ethernet/intel/iavf/iavf.h   |  2 +-
 drivers/net/ethernet/intel/iavf/iavf_main.c  | 52 +--
 drivers/net/ethernet/intel/iavf/iavf_txrx.c  |  2 +-
 drivers/net/ethernet/intel/iavf/iavf_virtchnl.c  |  6 +--
 9 files changed, 111 insertions(+), 119 deletions(-)

diff --git a/drivers/net/ethernet/intel/iavf/i40e_adminq.c 
b/drivers/net/ethernet/intel/iavf/i40e_adminq.c
index 69dfdfd69796..480c3e8c38c8 100644
--- a/drivers/net/ethernet/intel/iavf/i40e_adminq.c
+++ b/drivers/net/ethernet/intel/iavf/i40e_adminq.c
@@ -13,7 +13,7 @@
  *
  *  This assumes the alloc_asq and alloc_arq functions have already been called
  **/
-static void i40e_adminq_init_regs(struct i40e_hw *hw)
+static void i40e_adminq_init_regs(struct iavf_hw *hw)
 {
/* set head and tail registers in our local struct */
hw->aq.asq.tail = IAVF_VF_ATQT1;
@@ -32,7 +32,7 @@ static void i40e_adminq_init_regs(struct i40e_hw *hw)
  *  i40e_alloc_adminq_asq_ring - Allocate Admin Queue send rings
  *  @hw: pointer to the hardware structure
  **/
-static iavf_status i40e_alloc_adminq_asq_ring(struct i40e_hw *hw)
+static iavf_status i40e_alloc_adminq_asq_ring(struct iavf_hw *hw)
 {
iavf_status ret_code;
 
@@ -59,7 +59,7 @@ static iavf_status i40e_alloc_adminq_asq_ring(struct i40e_hw 
*hw)
  *  i40e_alloc_adminq_arq_ring - Allocate Admin Queue receive rings
  *  @hw: pointer to the hardware structure
  **/
-static iavf_status i40e_alloc_adminq_arq_ring(struct i40e_hw *hw)
+static iavf_status i40e_alloc_adminq_arq_ring(struct iavf_hw *hw)
 {
iavf_status ret_code;
 
@@ -79,7 +79,7 @@ static iavf_status i40e_alloc_adminq_arq_ring(struct i40e_hw 
*hw)
  *  This assumes the posted send buffers have already been cleaned
  *  and de-allocated
  **/
-static void i40e_free_adminq_asq(struct i40e_hw *hw)
+static void i40e_free_adminq_asq(struct iavf_hw *hw)
 {
i40e_free_dma_mem(hw, >aq.asq.desc_buf);
 }
@@ -91,7 +91,7 @@ static void i40e_free_adminq_asq(struct i40e_hw *hw)
  *  This assumes the posted receive buffers have already been cleaned
  *  and de-allocated
  **/
-static void i40e_free_adminq_arq(struct i40e_hw *hw)
+static void i40e_free_adminq_arq(struct iavf_hw *hw)
 {
i40e_free_dma_mem(hw, >aq.arq.desc_buf);
 }
@@ -100,7 +100,7 @@ static void i40e_free_adminq_arq(struct i40e_hw *hw)
  *  i40e_alloc_arq_bufs - Allocate pre-posted buffers for the receive queue
  *  @hw: pointer to the hardware structure
  **/
-static iavf_status i40e_alloc_arq_bufs(struct i40e_hw *hw)
+static iavf_status i40e_alloc_arq_bufs(struct iavf_hw *hw)
 {
iavf_status ret_code;
struct i40e_aq_desc *desc;
@@ -167,7 +167,7 @@ static iavf_status i40e_alloc_arq_bufs(struct i40e_hw *hw)
  *  i40e_alloc_asq_bufs - Allocate empty buffer structs for the send queue
  *  @hw: pointer to the hardware structure
  **/
-static iavf_status i40e_alloc_asq_bufs(struct i40e_hw *hw)
+static iavf_status i40e_alloc_asq_bufs(struct iavf_hw *hw)
 {
iavf_status ret_code;
struct i40e_dma_mem *bi;
@@ -207,7 +207,7 @@ static iavf_status i40e_alloc_asq_bufs(struct i40e_hw *hw)
  *  i40e_free_arq_bufs - Free receive queue buffer info elements
  *  @hw: pointer to the hardware structure
  **/
-static void i40e_free_arq_bufs(struct i40e_hw *hw)
+static void i40e_free_arq_bufs(struct iavf_hw *hw)
 {
int i;
 
@@ -226,7 +226,7 @@ static void i40e_free_arq_bufs(struct i40e_hw *hw)
  *  i40e_free_asq_bufs - Free send queue buffer info elements
  *  @hw: pointer to the hardware structure
  **/
-static void i40e_free_asq_bufs(struct i40e_hw *hw)
+static void i40e_free_asq_bufs(struct iavf_hw *hw)
 {
int i;
 
@@ -251,7 +251,7 @@ static void i40e_free_asq_bufs(struct i40e_hw *hw)
  *
  *  Configure base address and length registers for the transmit queue
  **/
-static iavf_status i40e_config_asq_regs(struct i40e_hw *hw)
+static iavf_status i40e_config_asq_regs(struct iavf_hw *hw)
 {
iavf_status ret_code = 0;
u32 reg = 0;
@@ -280,7 +280,7 @@ static iavf_status i40e_config_asq_regs(struct i40e_hw *hw)
  *
  * Configure base address and length registers for the receive (event queue)
  **/
-static iavf_status i40e_config_arq_regs(struct i40e_hw *hw)
+static iavf_status i40e_config_arq_regs(struct iavf_hw *hw)
 {
iavf_status ret_code = 0;
u32 reg = 0;
@@ -319,7 +319,7 @@ static iavf_status i40e_config_arq_regs(struct i40e_hw *hw)
  *  Do *NOT* hold the lock when calling this as the memory allocation routines
  * 

Re: [PATCH net-next] pktgen: Fix fall-through annotation

2018-09-13 Thread David Miller
From: "Gustavo A. R. Silva" 
Date: Thu, 13 Sep 2018 14:03:20 -0500

> Replace "fallthru" with a proper "fall through" annotation.
> 
> This fix is part of the ongoing efforts to enabling
> -Wimplicit-fallthrough
> 
> Signed-off-by: Gustavo A. R. Silva 

Applied.


[RFC PATCH net-next v1 04/14] iavf: rename i40e_status to iavf_status

2018-09-13 Thread Jesse Brandeburg
This is just a rename of an internal variable i40e_status, but
it was a pretty big change and so deserved it's own patch.

Signed-off-by: Jesse Brandeburg 
---
 drivers/net/ethernet/intel/iavf/i40e_adminq.c |  94 +-
 drivers/net/ethernet/intel/iavf/i40e_alloc.h  |   8 +-
 drivers/net/ethernet/intel/iavf/i40e_common.c |  72 +++---
 drivers/net/ethernet/intel/iavf/i40e_osdep.h  |   2 +-
 drivers/net/ethernet/intel/iavf/i40e_prototype.h  |  28 +++---
 drivers/net/ethernet/intel/iavf/i40evf.h  |   2 +-
 drivers/net/ethernet/intel/iavf/i40evf_client.c   |   6 +-
 drivers/net/ethernet/intel/iavf/i40evf_ethtool.c  |  52 --
 drivers/net/ethernet/intel/iavf/i40evf_main.c |  14 +--
 drivers/net/ethernet/intel/iavf/i40evf_virtchnl.c | 115 +-
 10 files changed, 179 insertions(+), 214 deletions(-)

diff --git a/drivers/net/ethernet/intel/iavf/i40e_adminq.c 
b/drivers/net/ethernet/intel/iavf/i40e_adminq.c
index 3d1c874f5f85..f0e6f9bbb819 100644
--- a/drivers/net/ethernet/intel/iavf/i40e_adminq.c
+++ b/drivers/net/ethernet/intel/iavf/i40e_adminq.c
@@ -34,9 +34,9 @@ static void i40e_adminq_init_regs(struct i40e_hw *hw)
  *  i40e_alloc_adminq_asq_ring - Allocate Admin Queue send rings
  *  @hw: pointer to the hardware structure
  **/
-static i40e_status i40e_alloc_adminq_asq_ring(struct i40e_hw *hw)
+static iavf_status i40e_alloc_adminq_asq_ring(struct i40e_hw *hw)
 {
-   i40e_status ret_code;
+   iavf_status ret_code;
 
ret_code = i40e_allocate_dma_mem(hw, >aq.asq.desc_buf,
 i40e_mem_atq_ring,
@@ -61,9 +61,9 @@ static i40e_status i40e_alloc_adminq_asq_ring(struct i40e_hw 
*hw)
  *  i40e_alloc_adminq_arq_ring - Allocate Admin Queue receive rings
  *  @hw: pointer to the hardware structure
  **/
-static i40e_status i40e_alloc_adminq_arq_ring(struct i40e_hw *hw)
+static iavf_status i40e_alloc_adminq_arq_ring(struct i40e_hw *hw)
 {
-   i40e_status ret_code;
+   iavf_status ret_code;
 
ret_code = i40e_allocate_dma_mem(hw, >aq.arq.desc_buf,
 i40e_mem_arq_ring,
@@ -102,9 +102,9 @@ static void i40e_free_adminq_arq(struct i40e_hw *hw)
  *  i40e_alloc_arq_bufs - Allocate pre-posted buffers for the receive queue
  *  @hw: pointer to the hardware structure
  **/
-static i40e_status i40e_alloc_arq_bufs(struct i40e_hw *hw)
+static iavf_status i40e_alloc_arq_bufs(struct i40e_hw *hw)
 {
-   i40e_status ret_code;
+   iavf_status ret_code;
struct i40e_aq_desc *desc;
struct i40e_dma_mem *bi;
int i;
@@ -115,7 +115,7 @@ static i40e_status i40e_alloc_arq_bufs(struct i40e_hw *hw)
 
/* buffer_info structures do not need alignment */
ret_code = i40e_allocate_virt_mem(hw, >aq.arq.dma_head,
-   (hw->aq.num_arq_entries * sizeof(struct i40e_dma_mem)));
+ (hw->aq.num_arq_entries * 
sizeof(struct i40e_dma_mem)));
if (ret_code)
goto alloc_arq_bufs;
hw->aq.arq.r.arq_bi = (struct i40e_dma_mem *)hw->aq.arq.dma_head.va;
@@ -169,15 +169,15 @@ static i40e_status i40e_alloc_arq_bufs(struct i40e_hw *hw)
  *  i40e_alloc_asq_bufs - Allocate empty buffer structs for the send queue
  *  @hw: pointer to the hardware structure
  **/
-static i40e_status i40e_alloc_asq_bufs(struct i40e_hw *hw)
+static iavf_status i40e_alloc_asq_bufs(struct i40e_hw *hw)
 {
-   i40e_status ret_code;
+   iavf_status ret_code;
struct i40e_dma_mem *bi;
int i;
 
/* No mapped memory needed yet, just the buffer info structures */
ret_code = i40e_allocate_virt_mem(hw, >aq.asq.dma_head,
-   (hw->aq.num_asq_entries * sizeof(struct i40e_dma_mem)));
+ (hw->aq.num_asq_entries * 
sizeof(struct i40e_dma_mem)));
if (ret_code)
goto alloc_asq_bufs;
hw->aq.asq.r.asq_bi = (struct i40e_dma_mem *)hw->aq.asq.dma_head.va;
@@ -253,9 +253,9 @@ static void i40e_free_asq_bufs(struct i40e_hw *hw)
  *
  *  Configure base address and length registers for the transmit queue
  **/
-static i40e_status i40e_config_asq_regs(struct i40e_hw *hw)
+static iavf_status i40e_config_asq_regs(struct i40e_hw *hw)
 {
-   i40e_status ret_code = 0;
+   iavf_status ret_code = 0;
u32 reg = 0;
 
/* Clear Head and Tail */
@@ -282,9 +282,9 @@ static i40e_status i40e_config_asq_regs(struct i40e_hw *hw)
  *
  * Configure base address and length registers for the receive (event queue)
  **/
-static i40e_status i40e_config_arq_regs(struct i40e_hw *hw)
+static iavf_status i40e_config_arq_regs(struct i40e_hw *hw)
 {
-   i40e_status ret_code = 0;
+   iavf_status ret_code = 0;
u32 reg = 0;
 
/* Clear Head and Tail */
@@ -321,9 +321,9 @@ static i40e_status i40e_config_arq_regs(struct i40e_hw *hw)
  *  Do *NOT* hold the lock when calling this as the memory 

[RFC PATCH net-next v1 11/14] iavf: tracing infrastructure rename

2018-09-13 Thread Jesse Brandeburg
Rename the i40e_trace file and fix up all the callers
to the new names inside the iavf_trace.h file.

Signed-off-by: Jesse Brandeburg 
---
 drivers/net/ethernet/intel/iavf/iavf_main.c|  2 +-
 .../intel/iavf/{i40e_trace.h => iavf_trace.h}  | 28 +++---
 drivers/net/ethernet/intel/iavf/iavf_txrx.c| 14 +--
 3 files changed, 22 insertions(+), 22 deletions(-)
 rename drivers/net/ethernet/intel/iavf/{i40e_trace.h => iavf_trace.h} (85%)

diff --git a/drivers/net/ethernet/intel/iavf/iavf_main.c 
b/drivers/net/ethernet/intel/iavf/iavf_main.c
index 63c5d97b1658..b8edf43e36f1 100644
--- a/drivers/net/ethernet/intel/iavf/iavf_main.c
+++ b/drivers/net/ethernet/intel/iavf/iavf_main.c
@@ -9,7 +9,7 @@
  * CREATE_TRACE_POINTS defined
  */
 #define CREATE_TRACE_POINTS
-#include "i40e_trace.h"
+#include "iavf_trace.h"
 
 static int iavf_setup_all_tx_resources(struct iavf_adapter *adapter);
 static int iavf_setup_all_rx_resources(struct iavf_adapter *adapter);
diff --git a/drivers/net/ethernet/intel/iavf/i40e_trace.h 
b/drivers/net/ethernet/intel/iavf/iavf_trace.h
similarity index 85%
rename from drivers/net/ethernet/intel/iavf/i40e_trace.h
rename to drivers/net/ethernet/intel/iavf/iavf_trace.h
index 552cfbfcce71..24f34d79f20a 100644
--- a/drivers/net/ethernet/intel/iavf/i40e_trace.h
+++ b/drivers/net/ethernet/intel/iavf/iavf_trace.h
@@ -5,7 +5,7 @@
 
 /* The trace subsystem name for iavf will be "iavf".
  *
- * This file is named i40e_trace.h.
+ * This file is named iavf_trace.h.
  *
  * Since this include file's name is different from the trace
  * subsystem name, we'll have to define TRACE_INCLUDE_FILE at the end
@@ -23,14 +23,14 @@
 #include 
 
 /**
- * i40e_trace() macro enables shared code to refer to trace points
+ * iavf_trace() macro enables shared code to refer to trace points
  * like:
  *
- * trace_i40e{,vf}_example(args...)
+ * trace_iavf{,vf}_example(args...)
  *
  * ... as:
  *
- * i40e_trace(example, args...)
+ * iavf_trace(example, args...)
  *
  * ... to resolve to the PF or VF version of the tracepoint without
  * ifdefs, and to allow tracepoints to be disabled entirely at build
@@ -39,18 +39,18 @@
  * Trace point should always be referred to in the driver via this
  * macro.
  *
- * Similarly, i40e_trace_enabled(trace_name) wraps references to
- * trace_i40e{,vf}__enabled() functions.
+ * Similarly, iavf_trace_enabled(trace_name) wraps references to
+ * trace_iavf{,vf}__enabled() functions.
  */
-#define _I40E_TRACE_NAME(trace_name) (trace_ ## iavf ## _ ## trace_name)
-#define I40E_TRACE_NAME(trace_name) _I40E_TRACE_NAME(trace_name)
+#define _IAVF_TRACE_NAME(trace_name) (trace_ ## iavf ## _ ## trace_name)
+#define IAVF_TRACE_NAME(trace_name) _IAVF_TRACE_NAME(trace_name)
 
-#define i40e_trace(trace_name, args...) I40E_TRACE_NAME(trace_name)(args)
+#define iavf_trace(trace_name, args...) IAVF_TRACE_NAME(trace_name)(args)
 
-#define i40e_trace_enabled(trace_name) I40E_TRACE_NAME(trace_name##_enabled)()
+#define iavf_trace_enabled(trace_name) IAVF_TRACE_NAME(trace_name##_enabled)()
 
 /* Events common to PF and VF. Corresponding versions will be defined
- * for both, named trace_i40e_* and trace_iavf_*. The i40e_trace()
+ * for both, named trace_iavf_* and trace_iavf_*. The iavf_trace()
  * macro above will select the right trace point name for the driver
  * being built from shared code.
  */
@@ -195,8 +195,8 @@ DEFINE_EVENT(
 
 /* Events unique to the VF. */
 
-#endif /* _I40E_TRACE_H_ */
-/* This must be outside ifdef _I40E_TRACE_H */
+#endif /* _IAVF_TRACE_H_ */
+/* This must be outside ifdef _IAVF_TRACE_H */
 
 /* This trace include file is not located in the .../include/trace
  * with the kernel tracepoint definitions, because we're a loadable
@@ -205,5 +205,5 @@ DEFINE_EVENT(
 #undef TRACE_INCLUDE_PATH
 #define TRACE_INCLUDE_PATH .
 #undef TRACE_INCLUDE_FILE
-#define TRACE_INCLUDE_FILE i40e_trace
+#define TRACE_INCLUDE_FILE iavf_trace
 #include 
diff --git a/drivers/net/ethernet/intel/iavf/iavf_txrx.c 
b/drivers/net/ethernet/intel/iavf/iavf_txrx.c
index 66d9f1bf9467..5164e812f009 100644
--- a/drivers/net/ethernet/intel/iavf/iavf_txrx.c
+++ b/drivers/net/ethernet/intel/iavf/iavf_txrx.c
@@ -5,7 +5,7 @@
 #include 
 
 #include "iavf.h"
-#include "i40e_trace.h"
+#include "iavf_trace.h"
 #include "i40e_prototype.h"
 
 static inline __le64 build_ctob(u32 td_cmd, u32 td_offset, unsigned int size,
@@ -211,7 +211,7 @@ static bool i40e_clean_tx_irq(struct i40e_vsi *vsi,
/* prevent any other reads prior to eop_desc */
smp_rmb();
 
-   i40e_trace(clean_tx_irq, tx_ring, tx_desc, tx_buf);
+   iavf_trace(clean_tx_irq, tx_ring, tx_desc, tx_buf);
/* if the descriptor isn't done, no work yet to do */
if (!(eop_desc->cmd_type_offset_bsz &
  cpu_to_le64(IAVF_TX_DESC_DTYPE_DESC_DONE)))
@@ -239,7 +239,7 @@ static bool i40e_clean_tx_irq(struct i40e_vsi *vsi,
 

[RFC PATCH net-next v1 00/14] rename and shrink i40evf

2018-09-13 Thread Jesse Brandeburg
This series contains changes to i40evf so that it becomes a more
generic virtual function driver for current and future silicon.

While doing the rename of i40evf to a more generic name of iavf,
we also put the driver on a severe diet due to how much of the
code was unneeded or was unused.  The outcome is a lean and mean
virtual function driver that continues to work on existing 40GbE
(i40e) virtual devices and prepped for future supported devices,
like the 100GbE (ice) virtual devices.

This solves 2 issues we saw coming or were already present, the
first was constant code duplication happening with i40e/i40evf,
when much of the duplicate code in the i40evf was not used or was
not needed.  The second was to remove the future confusion of why
future VF devices that were not considered "40GbE" only devices
were supported by i40evf.

The thought is that iavf will be the virtual function driver for
all future devices, so it should have a "generic" name to propery
represent that it is the VF driver for multiple generations of
devices.

Known Caveats:
This may cause some user confusion, especially for Kconfig not
migrating cleanly to the new CONFIG_IAVF from CONFIG_I40EVF.

Existing user configurations may have to change, but the module
alias in patch 1 helps a bit here.

---
v1: initial RFC

Jesse Brandeburg (14):
  intel-ethernet: rename i40evf to iavf
  iavf: diet and reformat
  iavf: rename functions and structs to new name
  iavf: rename i40e_status to iavf_status
  iavf: move i40evf files to new name
  iavf: remove references to old names
  iavf: rename device ID defines
  iavf: rename I40E_ADMINQ_DESC
  iavf: rename i40e_hw to iavf_hw
  iavf: replace i40e_debug with iavf version
  iavf: tracing infrastructure rename
  iavf: rename most of i40e strings
  iavf: finish renaming files to iavf
  intel-ethernet: use correct module license

 Documentation/networking/00-INDEX  |4 +-
 Documentation/networking/{i40evf.txt => iavf.txt}  |   16 +-
 MAINTAINERS|2 +-
 drivers/net/ethernet/intel/Kconfig |   12 +-
 drivers/net/ethernet/intel/Makefile|2 +-
 drivers/net/ethernet/intel/e100.c  |2 +-
 drivers/net/ethernet/intel/e1000/e1000_main.c  |2 +-
 drivers/net/ethernet/intel/e1000e/netdev.c |2 +-
 drivers/net/ethernet/intel/fm10k/fm10k_main.c  |2 +-
 drivers/net/ethernet/intel/i40e/i40e_main.c|2 +-
 drivers/net/ethernet/intel/i40evf/i40e_devids.h|   34 -
 drivers/net/ethernet/intel/i40evf/i40e_hmc.h   |  215 --
 drivers/net/ethernet/intel/i40evf/i40e_lan_hmc.h   |  158 --
 drivers/net/ethernet/intel/i40evf/i40e_register.h  |  313 ---
 .../net/ethernet/intel/{i40evf => iavf}/Makefile   |   11 +-
 .../ethernet/intel/{i40evf => iavf}/i40e_adminq.c  |  309 ++-
 .../ethernet/intel/{i40evf => iavf}/i40e_adminq.h  |   35 +-
 .../intel/{i40evf => iavf}/i40e_adminq_cmd.h   | 2280 +---
 .../intel/{i40evf/i40evf.h => iavf/iavf.h} |  407 ++--
 .../{i40evf/i40e_alloc.h => iavf/iavf_alloc.h} |   47 +-
 .../{i40evf/i40evf_client.c => iavf/iavf_client.c} |  200 +-
 .../{i40evf/i40evf_client.h => iavf/iavf_client.h} |   30 +-
 .../{i40evf/i40e_common.c => iavf/iavf_common.c}   | 1105 --
 drivers/net/ethernet/intel/iavf/iavf_devids.h  |   12 +
 .../i40evf_ethtool.c => iavf/iavf_ethtool.c}   |  510 +++--
 .../{i40evf/i40evf_main.c => iavf/iavf_main.c} | 1688 ---
 .../{i40evf/i40e_osdep.h => iavf/iavf_osdep.h} |   28 +-
 .../i40e_prototype.h => iavf/iavf_prototype.h} |  147 +-
 drivers/net/ethernet/intel/iavf/iavf_register.h|   68 +
 .../{i40evf/i40e_status.h => iavf/iavf_status.h}   |8 +-
 .../{i40evf/i40e_trace.h => iavf/iavf_trace.h} |   86 +-
 .../intel/{i40evf/i40e_txrx.c => iavf/iavf_txrx.c} |  804 +++
 .../intel/{i40evf/i40e_txrx.h => iavf/iavf_txrx.h} |  359 ++-
 .../intel/{i40evf/i40e_type.h => iavf/iavf_type.h} | 1604 --
 .../i40evf_virtchnl.c => iavf/iavf_virtchnl.c} |  501 +++--
 drivers/net/ethernet/intel/ice/ice_main.c  |2 +-
 drivers/net/ethernet/intel/igb/igb_main.c  |2 +-
 drivers/net/ethernet/intel/igbvf/netdev.c  |2 +-
 drivers/net/ethernet/intel/ixgb/ixgb_main.c|2 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c  |2 +-
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c  |2 +-
 41 files changed, 3436 insertions(+), 7581 deletions(-)
 rename Documentation/networking/{i40evf.txt => iavf.txt} (72%)
 delete mode 100644 drivers/net/ethernet/intel/i40evf/i40e_devids.h
 delete mode 100644 drivers/net/ethernet/intel/i40evf/i40e_hmc.h
 delete mode 100644 drivers/net/ethernet/intel/i40evf/i40e_lan_hmc.h
 delete mode 100644 drivers/net/ethernet/intel/i40evf/i40e_register.h
 rename drivers/net/ethernet/intel/{i40evf => iavf}/Makefile (38%)
 rename drivers/net/ethernet/intel/{i40evf => 

[RFC PATCH net-next v1 06/14] iavf: remove references to old names

2018-09-13 Thread Jesse Brandeburg
Remove the register name references to I40E_VF* and change to
IAVF_VF. Update the descriptor names and defines to the IAVF
name.

Signed-off-by: Jesse Brandeburg 
---
 drivers/net/ethernet/intel/iavf/i40e_adminq.c   |  28 ++--
 drivers/net/ethernet/intel/iavf/i40e_common.c   |   2 +-
 drivers/net/ethernet/intel/iavf/i40e_osdep.h|   2 +-
 drivers/net/ethernet/intel/iavf/i40e_register.h | 128 +-
 drivers/net/ethernet/intel/iavf/i40e_type.h | 170 
 drivers/net/ethernet/intel/iavf/iavf.h  |  10 +-
 drivers/net/ethernet/intel/iavf/iavf_main.c |  92 ++---
 drivers/net/ethernet/intel/iavf/iavf_txrx.c | 104 +++
 drivers/net/ethernet/intel/iavf/iavf_txrx.h |   2 +-
 9 files changed, 267 insertions(+), 271 deletions(-)

diff --git a/drivers/net/ethernet/intel/iavf/i40e_adminq.c 
b/drivers/net/ethernet/intel/iavf/i40e_adminq.c
index f0e6f9bbb819..50e0f1225298 100644
--- a/drivers/net/ethernet/intel/iavf/i40e_adminq.c
+++ b/drivers/net/ethernet/intel/iavf/i40e_adminq.c
@@ -17,16 +17,16 @@ static void i40e_adminq_init_regs(struct i40e_hw *hw)
 {
/* set head and tail registers in our local struct */
if (i40e_is_vf(hw)) {
-   hw->aq.asq.tail = I40E_VF_ATQT1;
-   hw->aq.asq.head = I40E_VF_ATQH1;
-   hw->aq.asq.len  = I40E_VF_ATQLEN1;
-   hw->aq.asq.bal  = I40E_VF_ATQBAL1;
-   hw->aq.asq.bah  = I40E_VF_ATQBAH1;
-   hw->aq.arq.tail = I40E_VF_ARQT1;
-   hw->aq.arq.head = I40E_VF_ARQH1;
-   hw->aq.arq.len  = I40E_VF_ARQLEN1;
-   hw->aq.arq.bal  = I40E_VF_ARQBAL1;
-   hw->aq.arq.bah  = I40E_VF_ARQBAH1;
+   hw->aq.asq.tail = IAVF_VF_ATQT1;
+   hw->aq.asq.head = IAVF_VF_ATQH1;
+   hw->aq.asq.len  = IAVF_VF_ATQLEN1;
+   hw->aq.asq.bal  = IAVF_VF_ATQBAL1;
+   hw->aq.asq.bah  = IAVF_VF_ATQBAH1;
+   hw->aq.arq.tail = IAVF_VF_ARQT1;
+   hw->aq.arq.head = IAVF_VF_ARQH1;
+   hw->aq.arq.len  = IAVF_VF_ARQLEN1;
+   hw->aq.arq.bal  = IAVF_VF_ARQBAL1;
+   hw->aq.arq.bah  = IAVF_VF_ARQBAH1;
}
 }
 
@@ -264,7 +264,7 @@ static iavf_status i40e_config_asq_regs(struct i40e_hw *hw)
 
/* set starting point */
wr32(hw, hw->aq.asq.len, (hw->aq.num_asq_entries |
- I40E_VF_ATQLEN1_ATQENABLE_MASK));
+ IAVF_VF_ATQLEN1_ATQENABLE_MASK));
wr32(hw, hw->aq.asq.bal, lower_32_bits(hw->aq.asq.desc_buf.pa));
wr32(hw, hw->aq.asq.bah, upper_32_bits(hw->aq.asq.desc_buf.pa));
 
@@ -293,7 +293,7 @@ static iavf_status i40e_config_arq_regs(struct i40e_hw *hw)
 
/* set starting point */
wr32(hw, hw->aq.arq.len, (hw->aq.num_arq_entries |
- I40E_VF_ARQLEN1_ARQENABLE_MASK));
+ IAVF_VF_ARQLEN1_ARQENABLE_MASK));
wr32(hw, hw->aq.arq.bal, lower_32_bits(hw->aq.arq.desc_buf.pa));
wr32(hw, hw->aq.arq.bah, upper_32_bits(hw->aq.arq.desc_buf.pa));
 
@@ -800,7 +800,7 @@ iavf_status iavf_asq_send_command(struct i40e_hw *hw, 
struct i40e_aq_desc *desc,
/* update the error if time out occurred */
if ((!cmd_completed) &&
(!details->async && !details->postpone)) {
-   if (rd32(hw, hw->aq.asq.len) & I40E_VF_ATQLEN1_ATQCRIT_MASK) {
+   if (rd32(hw, hw->aq.asq.len) & IAVF_VF_ATQLEN1_ATQCRIT_MASK) {
i40e_debug(hw, I40E_DEBUG_AQ_MESSAGE,
   "AQTX: AQ Critical error.\n");
status = I40E_ERR_ADMIN_QUEUE_CRITICAL_ERROR;
@@ -868,7 +868,7 @@ iavf_status iavf_clean_arq_element(struct i40e_hw *hw,
}
 
/* set next_to_use to head */
-   ntu = rd32(hw, hw->aq.arq.head) & I40E_VF_ARQH1_ARQH_MASK;
+   ntu = rd32(hw, hw->aq.arq.head) & IAVF_VF_ARQH1_ARQH_MASK;
if (ntu == ntc) {
/* nothing to do - shouldn't need to update ring's values */
ret_code = I40E_ERR_ADMIN_QUEUE_NO_WORK;
diff --git a/drivers/net/ethernet/intel/iavf/i40e_common.c 
b/drivers/net/ethernet/intel/iavf/i40e_common.c
index 96133efddf72..733e5cfeaf71 100644
--- a/drivers/net/ethernet/intel/iavf/i40e_common.c
+++ b/drivers/net/ethernet/intel/iavf/i40e_common.c
@@ -335,7 +335,7 @@ bool iavf_check_asq_alive(struct i40e_hw *hw)
 {
if (hw->aq.asq.len)
return !!(rd32(hw, hw->aq.asq.len) &
- I40E_VF_ATQLEN1_ATQENABLE_MASK);
+ IAVF_VF_ATQLEN1_ATQENABLE_MASK);
else
return false;
 }
diff --git a/drivers/net/ethernet/intel/iavf/i40e_osdep.h 
b/drivers/net/ethernet/intel/iavf/i40e_osdep.h
index 788a599dc26b..0fceb284e54a 100644
--- a/drivers/net/ethernet/intel/iavf/i40e_osdep.h
+++ b/drivers/net/ethernet/intel/iavf/i40e_osdep.h
@@ 

[RFC PATCH net-next v1 08/14] iavf: rename I40E_ADMINQ_DESC

2018-09-13 Thread Jesse Brandeburg
Take care of some renames containing I40E_ADMINQ_DESC.

Signed-off-by: Jesse Brandeburg 
---
 drivers/net/ethernet/intel/iavf/i40e_adminq.c | 18 +-
 drivers/net/ethernet/intel/iavf/i40e_adminq.h |  4 ++--
 2 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/intel/iavf/i40e_adminq.c 
b/drivers/net/ethernet/intel/iavf/i40e_adminq.c
index 8110b92fa2b0..69dfdfd69796 100644
--- a/drivers/net/ethernet/intel/iavf/i40e_adminq.c
+++ b/drivers/net/ethernet/intel/iavf/i40e_adminq.c
@@ -40,7 +40,7 @@ static iavf_status i40e_alloc_adminq_asq_ring(struct i40e_hw 
*hw)
 i40e_mem_atq_ring,
 (hw->aq.num_asq_entries *
 sizeof(struct i40e_aq_desc)),
-I40E_ADMINQ_DESC_ALIGNMENT);
+IAVF_ADMINQ_DESC_ALIGNMENT);
if (ret_code)
return ret_code;
 
@@ -67,7 +67,7 @@ static iavf_status i40e_alloc_adminq_arq_ring(struct i40e_hw 
*hw)
 i40e_mem_arq_ring,
 (hw->aq.num_arq_entries *
 sizeof(struct i40e_aq_desc)),
-I40E_ADMINQ_DESC_ALIGNMENT);
+IAVF_ADMINQ_DESC_ALIGNMENT);
 
return ret_code;
 }
@@ -124,12 +124,12 @@ static iavf_status i40e_alloc_arq_bufs(struct i40e_hw *hw)
ret_code = i40e_allocate_dma_mem(hw, bi,
 i40e_mem_arq_buf,
 hw->aq.arq_buf_size,
-I40E_ADMINQ_DESC_ALIGNMENT);
+IAVF_ADMINQ_DESC_ALIGNMENT);
if (ret_code)
goto unwind_alloc_arq_bufs;
 
/* now configure the descriptors for use */
-   desc = I40E_ADMINQ_DESC(hw->aq.arq, i);
+   desc = IAVF_ADMINQ_DESC(hw->aq.arq, i);
 
desc->flags = cpu_to_le16(I40E_AQ_FLAG_BUF);
if (hw->aq.arq_buf_size > I40E_AQ_LARGE_BUF)
@@ -186,7 +186,7 @@ static iavf_status i40e_alloc_asq_bufs(struct i40e_hw *hw)
ret_code = i40e_allocate_dma_mem(hw, bi,
 i40e_mem_asq_buf,
 hw->aq.asq_buf_size,
-I40E_ADMINQ_DESC_ALIGNMENT);
+IAVF_ADMINQ_DESC_ALIGNMENT);
if (ret_code)
goto unwind_alloc_asq_bufs;
}
@@ -574,7 +574,7 @@ static u16 i40e_clean_asq(struct i40e_hw *hw)
struct i40e_aq_desc desc_cb;
struct i40e_aq_desc *desc;
 
-   desc = I40E_ADMINQ_DESC(*asq, ntc);
+   desc = IAVF_ADMINQ_DESC(*asq, ntc);
details = I40E_ADMINQ_DETAILS(*asq, ntc);
while (rd32(hw, hw->aq.asq.head) != ntc) {
i40e_debug(hw, I40E_DEBUG_AQ_MESSAGE,
@@ -592,7 +592,7 @@ static u16 i40e_clean_asq(struct i40e_hw *hw)
ntc++;
if (ntc == asq->count)
ntc = 0;
-   desc = I40E_ADMINQ_DESC(*asq, ntc);
+   desc = IAVF_ADMINQ_DESC(*asq, ntc);
details = I40E_ADMINQ_DETAILS(*asq, ntc);
}
 
@@ -714,7 +714,7 @@ iavf_status iavf_asq_send_command(struct i40e_hw *hw, 
struct i40e_aq_desc *desc,
}
 
/* initialize the temp desc pointer with the right desc */
-   desc_on_ring = I40E_ADMINQ_DESC(hw->aq.asq, hw->aq.asq.next_to_use);
+   desc_on_ring = IAVF_ADMINQ_DESC(hw->aq.asq, hw->aq.asq.next_to_use);
 
/* if the desc is available copy the temp desc to the right place */
*desc_on_ring = *desc;
@@ -874,7 +874,7 @@ iavf_status iavf_clean_arq_element(struct i40e_hw *hw,
}
 
/* now clean the next descriptor */
-   desc = I40E_ADMINQ_DESC(hw->aq.arq, ntc);
+   desc = IAVF_ADMINQ_DESC(hw->aq.arq, ntc);
desc_idx = ntc;
 
hw->aq.arq_last_status =
diff --git a/drivers/net/ethernet/intel/iavf/i40e_adminq.h 
b/drivers/net/ethernet/intel/iavf/i40e_adminq.h
index 80b70a65028f..fd162a293c38 100644
--- a/drivers/net/ethernet/intel/iavf/i40e_adminq.h
+++ b/drivers/net/ethernet/intel/iavf/i40e_adminq.h
@@ -8,10 +8,10 @@
 #include "i40e_status.h"
 #include "i40e_adminq_cmd.h"
 
-#define I40E_ADMINQ_DESC(R, i)   \
+#define IAVF_ADMINQ_DESC(R, i)   \
(&(((struct i40e_aq_desc *)((R).desc_buf.va))[i]))
 
-#define I40E_ADMINQ_DESC_ALIGNMENT 4096
+#define IAVF_ADMINQ_DESC_ALIGNMENT 4096
 
 struct i40e_adminq_ring {
struct i40e_virt_mem dma_head;  /* space for dma structures */
-- 
2.14.4



[RFC PATCH net-next v1 02/14] iavf: diet and reformat

2018-09-13 Thread Jesse Brandeburg
Remove a bunch of unused code and reformat a few lines. Also
remove some now un-necessary files.

Signed-off-by: Jesse Brandeburg 
---
 drivers/net/ethernet/intel/iavf/i40e_adminq.c |   27 -
 drivers/net/ethernet/intel/iavf/i40e_adminq_cmd.h | 2276 +
 drivers/net/ethernet/intel/iavf/i40e_common.c |  337 ---
 drivers/net/ethernet/intel/iavf/i40e_hmc.h|  215 --
 drivers/net/ethernet/intel/iavf/i40e_lan_hmc.h|  158 --
 drivers/net/ethernet/intel/iavf/i40e_prototype.h  |   65 +-
 drivers/net/ethernet/intel/iavf/i40e_register.h   |  245 ---
 drivers/net/ethernet/intel/iavf/i40e_type.h   |  783 +--
 8 files changed, 50 insertions(+), 4056 deletions(-)
 delete mode 100644 drivers/net/ethernet/intel/iavf/i40e_hmc.h
 delete mode 100644 drivers/net/ethernet/intel/iavf/i40e_lan_hmc.h

diff --git a/drivers/net/ethernet/intel/iavf/i40e_adminq.c 
b/drivers/net/ethernet/intel/iavf/i40e_adminq.c
index 21a0dbf6ccf6..32e0e2d9cdc5 100644
--- a/drivers/net/ethernet/intel/iavf/i40e_adminq.c
+++ b/drivers/net/ethernet/intel/iavf/i40e_adminq.c
@@ -7,16 +7,6 @@
 #include "i40e_adminq.h"
 #include "i40e_prototype.h"
 
-/**
- * i40e_is_nvm_update_op - return true if this is an NVM update operation
- * @desc: API request descriptor
- **/
-static inline bool i40e_is_nvm_update_op(struct i40e_aq_desc *desc)
-{
-   return (desc->opcode == i40e_aqc_opc_nvm_erase) ||
-  (desc->opcode == i40e_aqc_opc_nvm_update);
-}
-
 /**
  *  i40e_adminq_init_regs - Initialize AdminQ registers
  *  @hw: pointer to the hardware structure
@@ -569,9 +559,6 @@ i40e_status i40evf_shutdown_adminq(struct i40e_hw *hw)
i40e_shutdown_asq(hw);
i40e_shutdown_arq(hw);
 
-   if (hw->nvm_buff.va)
-   i40e_free_virt_mem(hw, >nvm_buff);
-
return ret_code;
 }
 
@@ -951,17 +938,3 @@ i40e_status i40evf_clean_arq_element(struct i40e_hw *hw,
 
return ret_code;
 }
-
-void i40evf_resume_aq(struct i40e_hw *hw)
-{
-   /* Registers are reset after PF reset */
-   hw->aq.asq.next_to_use = 0;
-   hw->aq.asq.next_to_clean = 0;
-
-   i40e_config_asq_regs(hw);
-
-   hw->aq.arq.next_to_use = 0;
-   hw->aq.arq.next_to_clean = 0;
-
-   i40e_config_arq_regs(hw);
-}
diff --git a/drivers/net/ethernet/intel/iavf/i40e_adminq_cmd.h 
b/drivers/net/ethernet/intel/iavf/i40e_adminq_cmd.h
index 5fd8529465d4..e7224ff9496f 100644
--- a/drivers/net/ethernet/intel/iavf/i40e_adminq_cmd.h
+++ b/drivers/net/ethernet/intel/iavf/i40e_adminq_cmd.h
@@ -307,33 +307,6 @@ enum i40e_admin_queue_opc {
  */
 #define I40E_CHECK_CMD_LENGTH(X)   I40E_CHECK_STRUCT_LEN(16, X)
 
-/* internal (0x00XX) commands */
-
-/* Get version (direct 0x0001) */
-struct i40e_aqc_get_version {
-   __le32 rom_ver;
-   __le32 fw_build;
-   __le16 fw_major;
-   __le16 fw_minor;
-   __le16 api_major;
-   __le16 api_minor;
-};
-
-I40E_CHECK_CMD_LENGTH(i40e_aqc_get_version);
-
-/* Send driver version (indirect 0x0002) */
-struct i40e_aqc_driver_version {
-   u8  driver_major_ver;
-   u8  driver_minor_ver;
-   u8  driver_build_ver;
-   u8  driver_subbuild_ver;
-   u8  reserved[4];
-   __le32  address_high;
-   __le32  address_low;
-};
-
-I40E_CHECK_CMD_LENGTH(i40e_aqc_driver_version);
-
 /* Queue Shutdown (direct 0x0003) */
 struct i40e_aqc_queue_shutdown {
__le32  driver_unloading;
@@ -343,490 +316,6 @@ struct i40e_aqc_queue_shutdown {
 
 I40E_CHECK_CMD_LENGTH(i40e_aqc_queue_shutdown);
 
-/* Set PF context (0x0004, direct) */
-struct i40e_aqc_set_pf_context {
-   u8  pf_id;
-   u8  reserved[15];
-};
-
-I40E_CHECK_CMD_LENGTH(i40e_aqc_set_pf_context);
-
-/* Request resource ownership (direct 0x0008)
- * Release resource ownership (direct 0x0009)
- */
-#define I40E_AQ_RESOURCE_NVM   1
-#define I40E_AQ_RESOURCE_SDP   2
-#define I40E_AQ_RESOURCE_ACCESS_READ   1
-#define I40E_AQ_RESOURCE_ACCESS_WRITE  2
-#define I40E_AQ_RESOURCE_NVM_READ_TIMEOUT  3000
-#define I40E_AQ_RESOURCE_NVM_WRITE_TIMEOUT 18
-
-struct i40e_aqc_request_resource {
-   __le16  resource_id;
-   __le16  access_type;
-   __le32  timeout;
-   __le32  resource_number;
-   u8  reserved[4];
-};
-
-I40E_CHECK_CMD_LENGTH(i40e_aqc_request_resource);
-
-/* Get function capabilities (indirect 0x000A)
- * Get device capabilities (indirect 0x000B)
- */
-struct i40e_aqc_list_capabilites {
-   u8 command_flags;
-#define I40E_AQ_LIST_CAP_PF_INDEX_EN   1
-   u8 pf_index;
-   u8 reserved[2];
-   __le32 count;
-   __le32 addr_high;
-   __le32 addr_low;
-};
-
-I40E_CHECK_CMD_LENGTH(i40e_aqc_list_capabilites);
-
-struct i40e_aqc_list_capabilities_element_resp {
-   __le16  id;
-   u8  major_rev;
-   u8  minor_rev;
-   __le32  number;
-   __le32  logical_id;
-   __le32  phys_id;
-   u8  reserved[16];
-};
-
-/* list of 

[RFC PATCH net-next v1 05/14] iavf: move i40evf files to new name

2018-09-13 Thread Jesse Brandeburg
Simply move the i40evf files to the new name, updating the #includes
to track the new names, and updating the Makefile as well.

A future patch will remove the i40e references (after the code
removal patches later in this series).

Signed-off-by: Jesse Brandeburg 
---
v3: renamed more files after review comments
---
 drivers/net/ethernet/intel/iavf/Makefile  | 4 ++--
 drivers/net/ethernet/intel/iavf/{i40evf.h => iavf.h}  | 2 +-
 drivers/net/ethernet/intel/iavf/{i40evf_client.c => iavf_client.c}| 4 ++--
 drivers/net/ethernet/intel/iavf/{i40evf_client.h => iavf_client.h}| 0
 drivers/net/ethernet/intel/iavf/{i40evf_ethtool.c => iavf_ethtool.c}  | 2 +-
 drivers/net/ethernet/intel/iavf/{i40evf_main.c => iavf_main.c}| 4 ++--
 drivers/net/ethernet/intel/iavf/{i40e_txrx.c => iavf_txrx.c}  | 2 +-
 drivers/net/ethernet/intel/iavf/{i40e_txrx.h => iavf_txrx.h}  | 0
 .../net/ethernet/intel/iavf/{i40evf_virtchnl.c => iavf_virtchnl.c}| 4 ++--
 9 files changed, 11 insertions(+), 11 deletions(-)
 rename drivers/net/ethernet/intel/iavf/{i40evf.h => iavf.h} (99%)
 rename drivers/net/ethernet/intel/iavf/{i40evf_client.c => iavf_client.c} (99%)
 rename drivers/net/ethernet/intel/iavf/{i40evf_client.h => iavf_client.h} 
(100%)
 rename drivers/net/ethernet/intel/iavf/{i40evf_ethtool.c => iavf_ethtool.c} 
(99%)
 rename drivers/net/ethernet/intel/iavf/{i40evf_main.c => iavf_main.c} (99%)
 rename drivers/net/ethernet/intel/iavf/{i40e_txrx.c => iavf_txrx.c} (99%)
 rename drivers/net/ethernet/intel/iavf/{i40e_txrx.h => iavf_txrx.h} (100%)
 rename drivers/net/ethernet/intel/iavf/{i40evf_virtchnl.c => iavf_virtchnl.c} 
(99%)

diff --git a/drivers/net/ethernet/intel/iavf/Makefile 
b/drivers/net/ethernet/intel/iavf/Makefile
index ce2dce1e1ebf..fa4c43be2266 100644
--- a/drivers/net/ethernet/intel/iavf/Makefile
+++ b/drivers/net/ethernet/intel/iavf/Makefile
@@ -11,5 +11,5 @@ subdir-ccflags-y += -I$(src)
 
 obj-$(CONFIG_IAVF) += iavf.o
 
-iavf-objs := i40evf_main.o i40evf_ethtool.o i40evf_virtchnl.o \
-i40e_txrx.o i40e_common.o i40e_adminq.o i40evf_client.o
+iavf-objs := iavf_main.o iavf_ethtool.o iavf_virtchnl.o \
+iavf_txrx.o i40e_common.o i40e_adminq.o iavf_client.o
diff --git a/drivers/net/ethernet/intel/iavf/i40evf.h 
b/drivers/net/ethernet/intel/iavf/iavf.h
similarity index 99%
rename from drivers/net/ethernet/intel/iavf/i40evf.h
rename to drivers/net/ethernet/intel/iavf/iavf.h
index 19a93bfdb65c..c7ce2db958b0 100644
--- a/drivers/net/ethernet/intel/iavf/i40evf.h
+++ b/drivers/net/ethernet/intel/iavf/iavf.h
@@ -36,7 +36,7 @@
 
 #include "i40e_type.h"
 #include 
-#include "i40e_txrx.h"
+#include "iavf_txrx.h"
 
 #define DEFAULT_DEBUG_LEVEL_SHIFT 3
 #define PFX "iavf: "
diff --git a/drivers/net/ethernet/intel/iavf/i40evf_client.c 
b/drivers/net/ethernet/intel/iavf/iavf_client.c
similarity index 99%
rename from drivers/net/ethernet/intel/iavf/i40evf_client.c
rename to drivers/net/ethernet/intel/iavf/iavf_client.c
index d2660659174d..16971bfc5e43 100644
--- a/drivers/net/ethernet/intel/iavf/i40evf_client.c
+++ b/drivers/net/ethernet/intel/iavf/iavf_client.c
@@ -4,9 +4,9 @@
 #include 
 #include 
 
-#include "i40evf.h"
+#include "iavf.h"
 #include "i40e_prototype.h"
-#include "i40evf_client.h"
+#include "iavf_client.h"
 
 static
 const char iavf_client_interface_version_str[] = IAVF_CLIENT_VERSION_STR;
diff --git a/drivers/net/ethernet/intel/iavf/i40evf_client.h 
b/drivers/net/ethernet/intel/iavf/iavf_client.h
similarity index 100%
rename from drivers/net/ethernet/intel/iavf/i40evf_client.h
rename to drivers/net/ethernet/intel/iavf/iavf_client.h
diff --git a/drivers/net/ethernet/intel/iavf/i40evf_ethtool.c 
b/drivers/net/ethernet/intel/iavf/iavf_ethtool.c
similarity index 99%
rename from drivers/net/ethernet/intel/iavf/i40evf_ethtool.c
rename to drivers/net/ethernet/intel/iavf/iavf_ethtool.c
index 0277df40e53f..74a142802074 100644
--- a/drivers/net/ethernet/intel/iavf/i40evf_ethtool.c
+++ b/drivers/net/ethernet/intel/iavf/iavf_ethtool.c
@@ -2,7 +2,7 @@
 /* Copyright(c) 2013 - 2018 Intel Corporation. */
 
 /* ethtool support for iavf */
-#include "i40evf.h"
+#include "iavf.h"
 
 #include 
 
diff --git a/drivers/net/ethernet/intel/iavf/i40evf_main.c 
b/drivers/net/ethernet/intel/iavf/iavf_main.c
similarity index 99%
rename from drivers/net/ethernet/intel/iavf/i40evf_main.c
rename to drivers/net/ethernet/intel/iavf/iavf_main.c
index 600ea4040af2..7d815ace2d98 100644
--- a/drivers/net/ethernet/intel/iavf/i40evf_main.c
+++ b/drivers/net/ethernet/intel/iavf/iavf_main.c
@@ -1,9 +1,9 @@
 // SPDX-License-Identifier: GPL-2.0
 /* Copyright(c) 2013 - 2018 Intel Corporation. */
 
-#include "i40evf.h"
+#include "iavf.h"
 #include "i40e_prototype.h"
-#include "i40evf_client.h"
+#include "iavf_client.h"
 /* All iavf tracepoints are defined by the include below, which must
  * be included exactly once across the whole kernel with
  * 

[RFC PATCH net-next v1 07/14] iavf: rename device ID defines

2018-09-13 Thread Jesse Brandeburg
Rename the device ID defines to have IAVF in them
and remove all the unused defines.

Signed-off-by: Jesse Brandeburg 
---
 drivers/net/ethernet/intel/iavf/i40e_adminq.c | 22 +++
 drivers/net/ethernet/intel/iavf/i40e_common.c | 29 +++
 drivers/net/ethernet/intel/iavf/i40e_devids.h | 40 ++-
 drivers/net/ethernet/intel/iavf/i40e_type.h   |  6 
 drivers/net/ethernet/intel/iavf/iavf_main.c   |  8 +++---
 5 files changed, 27 insertions(+), 78 deletions(-)

diff --git a/drivers/net/ethernet/intel/iavf/i40e_adminq.c 
b/drivers/net/ethernet/intel/iavf/i40e_adminq.c
index 50e0f1225298..8110b92fa2b0 100644
--- a/drivers/net/ethernet/intel/iavf/i40e_adminq.c
+++ b/drivers/net/ethernet/intel/iavf/i40e_adminq.c
@@ -16,18 +16,16 @@
 static void i40e_adminq_init_regs(struct i40e_hw *hw)
 {
/* set head and tail registers in our local struct */
-   if (i40e_is_vf(hw)) {
-   hw->aq.asq.tail = IAVF_VF_ATQT1;
-   hw->aq.asq.head = IAVF_VF_ATQH1;
-   hw->aq.asq.len  = IAVF_VF_ATQLEN1;
-   hw->aq.asq.bal  = IAVF_VF_ATQBAL1;
-   hw->aq.asq.bah  = IAVF_VF_ATQBAH1;
-   hw->aq.arq.tail = IAVF_VF_ARQT1;
-   hw->aq.arq.head = IAVF_VF_ARQH1;
-   hw->aq.arq.len  = IAVF_VF_ARQLEN1;
-   hw->aq.arq.bal  = IAVF_VF_ARQBAL1;
-   hw->aq.arq.bah  = IAVF_VF_ARQBAH1;
-   }
+   hw->aq.asq.tail = IAVF_VF_ATQT1;
+   hw->aq.asq.head = IAVF_VF_ATQH1;
+   hw->aq.asq.len  = IAVF_VF_ATQLEN1;
+   hw->aq.asq.bal  = IAVF_VF_ATQBAL1;
+   hw->aq.asq.bah  = IAVF_VF_ATQBAH1;
+   hw->aq.arq.tail = IAVF_VF_ARQT1;
+   hw->aq.arq.head = IAVF_VF_ARQH1;
+   hw->aq.arq.len  = IAVF_VF_ARQLEN1;
+   hw->aq.arq.bal  = IAVF_VF_ARQBAL1;
+   hw->aq.arq.bah  = IAVF_VF_ARQBAH1;
 }
 
 /**
diff --git a/drivers/net/ethernet/intel/iavf/i40e_common.c 
b/drivers/net/ethernet/intel/iavf/i40e_common.c
index 733e5cfeaf71..b97e8925d20e 100644
--- a/drivers/net/ethernet/intel/iavf/i40e_common.c
+++ b/drivers/net/ethernet/intel/iavf/i40e_common.c
@@ -19,33 +19,12 @@ iavf_status i40e_set_mac_type(struct i40e_hw *hw)
 
if (hw->vendor_id == PCI_VENDOR_ID_INTEL) {
switch (hw->device_id) {
-   case I40E_DEV_ID_SFP_XL710:
-   case I40E_DEV_ID_QEMU:
-   case I40E_DEV_ID_KX_B:
-   case I40E_DEV_ID_KX_C:
-   case I40E_DEV_ID_QSFP_A:
-   case I40E_DEV_ID_QSFP_B:
-   case I40E_DEV_ID_QSFP_C:
-   case I40E_DEV_ID_10G_BASE_T:
-   case I40E_DEV_ID_10G_BASE_T4:
-   case I40E_DEV_ID_20G_KR2:
-   case I40E_DEV_ID_20G_KR2_A:
-   case I40E_DEV_ID_25G_B:
-   case I40E_DEV_ID_25G_SFP28:
-   hw->mac.type = I40E_MAC_XL710;
-   break;
-   case I40E_DEV_ID_SFP_X722:
-   case I40E_DEV_ID_1G_BASE_T_X722:
-   case I40E_DEV_ID_10G_BASE_T_X722:
-   case I40E_DEV_ID_SFP_I_X722:
-   hw->mac.type = I40E_MAC_X722;
-   break;
-   case I40E_DEV_ID_X722_VF:
+   case IAVF_DEV_ID_X722_VF:
hw->mac.type = I40E_MAC_X722_VF;
break;
-   case I40E_DEV_ID_VF:
-   case I40E_DEV_ID_VF_HV:
-   case I40E_DEV_ID_ADAPTIVE_VF:
+   case IAVF_DEV_ID_VF:
+   case IAVF_DEV_ID_VF_HV:
+   case IAVF_DEV_ID_ADAPTIVE_VF:
hw->mac.type = I40E_MAC_VF;
break;
default:
diff --git a/drivers/net/ethernet/intel/iavf/i40e_devids.h 
b/drivers/net/ethernet/intel/iavf/i40e_devids.h
index f300bf271824..8eb7b697e96c 100644
--- a/drivers/net/ethernet/intel/iavf/i40e_devids.h
+++ b/drivers/net/ethernet/intel/iavf/i40e_devids.h
@@ -1,34 +1,12 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 /* Copyright(c) 2013 - 2018 Intel Corporation. */
 
-#ifndef _I40E_DEVIDS_H_
-#define _I40E_DEVIDS_H_
-
-/* Device IDs */
-#define I40E_DEV_ID_SFP_XL710  0x1572
-#define I40E_DEV_ID_QEMU   0x1574
-#define I40E_DEV_ID_KX_B   0x1580
-#define I40E_DEV_ID_KX_C   0x1581
-#define I40E_DEV_ID_QSFP_A 0x1583
-#define I40E_DEV_ID_QSFP_B 0x1584
-#define I40E_DEV_ID_QSFP_C 0x1585
-#define I40E_DEV_ID_10G_BASE_T 0x1586
-#define I40E_DEV_ID_20G_KR20x1587
-#define I40E_DEV_ID_20G_KR2_A  0x1588
-#define I40E_DEV_ID_10G_BASE_T40x1589
-#define I40E_DEV_ID_25G_B  0x158A
-#define I40E_DEV_ID_25G_SFP28  0x158B
-#define I40E_DEV_ID_VF 0x154C
-#define I40E_DEV_ID_VF_HV  0x1571
-#define I40E_DEV_ID_ADAPTIVE_VF0x1889
-#define I40E_DEV_ID_SFP_X722   0x37D0
-#define 

Re: [PATCH bpf-next 07/11] bpf: Add helper to retrieve socket in BPF

2018-09-13 Thread Alexei Starovoitov
On Thu, Sep 13, 2018 at 02:24:03PM -0700, Joe Stringer wrote:
> On Thu, 13 Sep 2018 at 14:22, Alexei Starovoitov
>  wrote:
> >
> > On Thu, Sep 13, 2018 at 02:17:17PM -0700, Joe Stringer wrote:
> > > On Thu, 13 Sep 2018 at 14:02, Alexei Starovoitov
> > >  wrote:
> > > >
> > > > On Thu, Sep 13, 2018 at 01:55:01PM -0700, Joe Stringer wrote:
> > > > > On Thu, 13 Sep 2018 at 12:06, Alexei Starovoitov
> > > > >  wrote:
> > > > > >
> > > > > > On Wed, Sep 12, 2018 at 5:06 PM, Alexei Starovoitov
> > > > > >  wrote:
> > > > > > > On Tue, Sep 11, 2018 at 05:36:36PM -0700, Joe Stringer wrote:
> > > > > > >> This patch adds new BPF helper functions, bpf_sk_lookup_tcp() and
> > > > > > >> bpf_sk_lookup_udp() which allows BPF programs to find out if 
> > > > > > >> there is a
> > > > > > >> socket listening on this host, and returns a socket pointer 
> > > > > > >> which the
> > > > > > >> BPF program can then access to determine, for instance, whether 
> > > > > > >> to
> > > > > > >> forward or drop traffic. bpf_sk_lookup_xxx() may take a 
> > > > > > >> reference on the
> > > > > > >> socket, so when a BPF program makes use of this function, it must
> > > > > > >> subsequently pass the returned pointer into the newly added 
> > > > > > >> sk_release()
> > > > > > >> to return the reference.
> > > > > > >>
> > > > > > >> By way of example, the following pseudocode would filter inbound
> > > > > > >> connections at XDP if there is no corresponding service 
> > > > > > >> listening for
> > > > > > >> the traffic:
> > > > > > >>
> > > > > > >>   struct bpf_sock_tuple tuple;
> > > > > > >>   struct bpf_sock_ops *sk;
> > > > > > >>
> > > > > > >>   populate_tuple(ctx, ); // Extract the 5tuple from the 
> > > > > > >> packet
> > > > > > >>   sk = bpf_sk_lookup_tcp(ctx, , sizeof tuple, netns, 0);
> > > > > > > ...
> > > > > > >> +struct bpf_sock_tuple {
> > > > > > >> + union {
> > > > > > >> + __be32 ipv6[4];
> > > > > > >> + __be32 ipv4;
> > > > > > >> + } saddr;
> > > > > > >> + union {
> > > > > > >> + __be32 ipv6[4];
> > > > > > >> + __be32 ipv4;
> > > > > > >> + } daddr;
> > > > > > >> + __be16 sport;
> > > > > > >> + __be16 dport;
> > > > > > >> + __u8 family;
> > > > > > >> +};
> > > > > > >
> > > > > > > since we can pass ptr_to_packet into map lookup and other helpers 
> > > > > > > now,
> > > > > > > can you move 'family' out of bpf_sock_tuple and combine with 
> > > > > > > netns_id arg?
> > > > > > > then progs wouldn't need to copy bytes from the packet into tuple
> > > > > > > to do a lookup.
> > > > >
> > > > > If I follow, you're proposing that users should be able to pass a
> > > > > pointer to the source address field of the L3 header, and assuming
> > > > > that the L3 header ends with saddr+daddr (no options/extheaders), and
> > > > > is immediately followed by the sport/dport then a packet pointer
> > > > > should work for performing socket lookup. Then it is up to the BPF
> > > > > program writer to ensure that this is the case, or otherwise fall back
> > > > > to populating a copy of the sock tuple on the stack.
> > > >
> > > > yep.
> > > >
> > > > > > have been thinking more about it.
> > > > > > since only ipv4 and ipv6 supported may be use size of bpf_sock_tuple
> > > > > > to infer family inside the helper, so it doesn't need to be passed 
> > > > > > explicitly?
> > > > >
> > > > > Let me make sure I understand the proposal here.
> > > > >
> > > > > The current structure and function prototypes are:
> > > > >
> > > > > struct bpf_sock_tuple {
> > > > >   union {
> > > > >   __be32 ipv6[4];
> > > > >   __be32 ipv4;
> > > > >   } saddr;
> > > > >   union {
> > > > >   __be32 ipv6[4];
> > > > >   __be32 ipv4;
> > > > >   } daddr;
> > > > >   __be16 sport;
> > > > >   __be16 dport;
> > > > >   __u8 family;
> > > > > };
> > > > ...
> > > > > You're proposing something like:
> > > > >
> > > > > struct bpf_sock_tuple4 {
> > > > >   __be32 saddr;
> > > > >   __be32 daddr;
> > > > >   __be16 sport;
> > > > >   __be16 dport;
> > > > >   __u8 family;
> > > > > };
> > > > >
> > > > > struct bpf_sock_tuple6 {
> > > > >   __be32 saddr[4];
> > > > >   __be32 daddr[4];
> > > > >   __be16 sport;
> > > > >   __be16 dport;
> > > > >   __u8 family;
> > > > > };
> > > >
> > > > I think the split is unnecessary.
> > > > I'm proposing:
> > > > struct bpf_sock_tuple {
> > > >   union {
> > > >   __be32 ipv6[4];
> > > >   __be32 ipv4;
> > > >   } saddr;
> > > >   union {
> > > >   __be32 ipv6[4];
> > > >   __be32 ipv4;
> > > >   } daddr;
> > > >   __be16 sport;
> > > >   __be16 dport;
> > > > };
> > > >
> > > > that points directly into the packet (when ipv4 options are not there)
> > > > and bpf_sk_lookup_tcp() uses 'size' argument to figure out ipv4/ipv6 
> > 

Re: [RFC PATCH iproute2-next] man: Add devlink health man page

2018-09-13 Thread Tobin C. Harding
On Thu, Sep 13, 2018 at 02:58:52PM +0300, Eran Ben Elisha wrote:
> 
> 
> On 9/13/2018 1:27 PM, Tobin C. Harding wrote:
> > On Thu, Sep 13, 2018 at 11:18:16AM +0300, Eran Ben Elisha wrote:
> > > Add devlink-health man page. Devlink-health tool will control device
> > > health attributes, sensors, actions and logging.
> > > 
> > > Signed-off-by: Eran Ben Elisha 
> > > 
> > > ---
> > > Copy paste man output to here for easier review process of the RFC.
> > > 
> > > DEVLINK-HEALTH(8) 
> > >   Linux   
> > >
> > > DEVLINK-HEALTH(8)
> > > 
> > > NAME
> > > devlink-health - devlink health configuration
> > > 
> > > SYNOPSIS
> > > devlink [ OPTIONS ] health  { COMMAND | help }
> > > 
> > > OPTIONS := { -V[ersion] | -n[no-nice-names] }
> > > 
> > > devlink health show [ DEV ] [ sensor NAME ]
> > > 
> > > devlink health sensor set DEV name NAME [ action NAME { active | 
> > > inactive } ]"
> > > 
> > > devlink health action set DEV name NAME period PERIOD count COUNT 
> > > fail { ignore | down }
> > > 
> > > devlink health action reinit DEV name NAME
> > > 
> > > devlink health help
> > > 
> > > DESCRIPTION
> > > devlink-health tool allows user to configure the way driver 
> > > treats unexpected status. The tool allows configuration of the sensors 
> > > that can trigger health activity. Set for each sensor the follow up 
> > > operations, such as,
> > > reset and dump of info. In addition, set the health activity 
> > > termination action.
> > > 
> > > devlink health show - Display devlink health sensors and actions 
> > > attributes
> > > DEV - Specifies the devlink device to show.  If this argument is 
> > > omitted, all devices are listed.
> > > 
> > > Format is:
> > >   BUS_NAME/BUS_ADDRESS
> > > 
> > > sensor NAME - Specifies the devlink sensor to show.
> > > 
> > 
> > Perhaps the commands should include the optional arguments so when
> > reading the description one doesn't have to scroll to the top of the
> > page all the time
> > 
> > e.g
> >   devlink health show [ DEV ] [ sensor NAME ] - Display devlink health 
> > sensors and actions attributes
> > 
> 
> I followed the scheme presented in all other devlink man pages.
> see devlink-region, devlink-port, etc.

Oh ok, my mistake.  I'd stick with what you have then.  Thanks for
pointing this out.

> From my perspective, I am fine with adding it to devlink-health, need ack
> from the devlink maintainer to see if he likes it...
> 
> > > devlink health sensor set - sets devlink health sensor attributes
> > > DEVSpecifies the devlink device to show.
> > 
> > set
> > 
> > > name NAME
> > >Name of the sensor to set.
> > > 
> > > action NAME { active | inactive }
> > >Specify which actions to activate and which to 
> > > deactivate once a sensor was triggered. actions can be dump, reset, etc.
> > > 
> > > devlink health action set - sets devlink action attributes
> > > DEVSpecifies the devlink device to set.
> > > 
> > > name NAME
> > >Specifies the devlink action to set.
> > 
> > This is a little unclear to me?
> 
> what is not clear? the term 'action' or the naming? can you elaborate?

It wasn't immediately clear what 'name' referred to.  But following on
from discussion above this may be because I have not read any of the
other devlink man pages.

thanks,
Tobin.


Re: [PATCH bpf-next 07/11] bpf: Add helper to retrieve socket in BPF

2018-09-13 Thread Joe Stringer
On Thu, 13 Sep 2018 at 14:22, Alexei Starovoitov
 wrote:
>
> On Thu, Sep 13, 2018 at 02:17:17PM -0700, Joe Stringer wrote:
> > On Thu, 13 Sep 2018 at 14:02, Alexei Starovoitov
> >  wrote:
> > >
> > > On Thu, Sep 13, 2018 at 01:55:01PM -0700, Joe Stringer wrote:
> > > > On Thu, 13 Sep 2018 at 12:06, Alexei Starovoitov
> > > >  wrote:
> > > > >
> > > > > On Wed, Sep 12, 2018 at 5:06 PM, Alexei Starovoitov
> > > > >  wrote:
> > > > > > On Tue, Sep 11, 2018 at 05:36:36PM -0700, Joe Stringer wrote:
> > > > > >> This patch adds new BPF helper functions, bpf_sk_lookup_tcp() and
> > > > > >> bpf_sk_lookup_udp() which allows BPF programs to find out if there 
> > > > > >> is a
> > > > > >> socket listening on this host, and returns a socket pointer which 
> > > > > >> the
> > > > > >> BPF program can then access to determine, for instance, whether to
> > > > > >> forward or drop traffic. bpf_sk_lookup_xxx() may take a reference 
> > > > > >> on the
> > > > > >> socket, so when a BPF program makes use of this function, it must
> > > > > >> subsequently pass the returned pointer into the newly added 
> > > > > >> sk_release()
> > > > > >> to return the reference.
> > > > > >>
> > > > > >> By way of example, the following pseudocode would filter inbound
> > > > > >> connections at XDP if there is no corresponding service listening 
> > > > > >> for
> > > > > >> the traffic:
> > > > > >>
> > > > > >>   struct bpf_sock_tuple tuple;
> > > > > >>   struct bpf_sock_ops *sk;
> > > > > >>
> > > > > >>   populate_tuple(ctx, ); // Extract the 5tuple from the 
> > > > > >> packet
> > > > > >>   sk = bpf_sk_lookup_tcp(ctx, , sizeof tuple, netns, 0);
> > > > > > ...
> > > > > >> +struct bpf_sock_tuple {
> > > > > >> + union {
> > > > > >> + __be32 ipv6[4];
> > > > > >> + __be32 ipv4;
> > > > > >> + } saddr;
> > > > > >> + union {
> > > > > >> + __be32 ipv6[4];
> > > > > >> + __be32 ipv4;
> > > > > >> + } daddr;
> > > > > >> + __be16 sport;
> > > > > >> + __be16 dport;
> > > > > >> + __u8 family;
> > > > > >> +};
> > > > > >
> > > > > > since we can pass ptr_to_packet into map lookup and other helpers 
> > > > > > now,
> > > > > > can you move 'family' out of bpf_sock_tuple and combine with 
> > > > > > netns_id arg?
> > > > > > then progs wouldn't need to copy bytes from the packet into tuple
> > > > > > to do a lookup.
> > > >
> > > > If I follow, you're proposing that users should be able to pass a
> > > > pointer to the source address field of the L3 header, and assuming
> > > > that the L3 header ends with saddr+daddr (no options/extheaders), and
> > > > is immediately followed by the sport/dport then a packet pointer
> > > > should work for performing socket lookup. Then it is up to the BPF
> > > > program writer to ensure that this is the case, or otherwise fall back
> > > > to populating a copy of the sock tuple on the stack.
> > >
> > > yep.
> > >
> > > > > have been thinking more about it.
> > > > > since only ipv4 and ipv6 supported may be use size of bpf_sock_tuple
> > > > > to infer family inside the helper, so it doesn't need to be passed 
> > > > > explicitly?
> > > >
> > > > Let me make sure I understand the proposal here.
> > > >
> > > > The current structure and function prototypes are:
> > > >
> > > > struct bpf_sock_tuple {
> > > >   union {
> > > >   __be32 ipv6[4];
> > > >   __be32 ipv4;
> > > >   } saddr;
> > > >   union {
> > > >   __be32 ipv6[4];
> > > >   __be32 ipv4;
> > > >   } daddr;
> > > >   __be16 sport;
> > > >   __be16 dport;
> > > >   __u8 family;
> > > > };
> > > ...
> > > > You're proposing something like:
> > > >
> > > > struct bpf_sock_tuple4 {
> > > >   __be32 saddr;
> > > >   __be32 daddr;
> > > >   __be16 sport;
> > > >   __be16 dport;
> > > >   __u8 family;
> > > > };
> > > >
> > > > struct bpf_sock_tuple6 {
> > > >   __be32 saddr[4];
> > > >   __be32 daddr[4];
> > > >   __be16 sport;
> > > >   __be16 dport;
> > > >   __u8 family;
> > > > };
> > >
> > > I think the split is unnecessary.
> > > I'm proposing:
> > > struct bpf_sock_tuple {
> > >   union {
> > >   __be32 ipv6[4];
> > >   __be32 ipv4;
> > >   } saddr;
> > >   union {
> > >   __be32 ipv6[4];
> > >   __be32 ipv4;
> > >   } daddr;
> > >   __be16 sport;
> > >   __be16 dport;
> > > };
> > >
> > > that points directly into the packet (when ipv4 options are not there)
> > > and bpf_sk_lookup_tcp() uses 'size' argument to figure out ipv4/ipv6 
> > > family.
> >
> > Needs to be subtly different, the 'sport'/'dport' offset would be
> > wrong in the IPv4 case otherwise:
>
> ahh. right.
>
> >
> > We could take my definitions above and do the following if we want to
> > try to type the helper definition:
> >
> > union bpf_sock_tuple {
> >struct bpf_sock_tuple4 t4;
> > 

Re: [PATCH bpf-next 07/11] bpf: Add helper to retrieve socket in BPF

2018-09-13 Thread Alexei Starovoitov
On Thu, Sep 13, 2018 at 02:17:17PM -0700, Joe Stringer wrote:
> On Thu, 13 Sep 2018 at 14:02, Alexei Starovoitov
>  wrote:
> >
> > On Thu, Sep 13, 2018 at 01:55:01PM -0700, Joe Stringer wrote:
> > > On Thu, 13 Sep 2018 at 12:06, Alexei Starovoitov
> > >  wrote:
> > > >
> > > > On Wed, Sep 12, 2018 at 5:06 PM, Alexei Starovoitov
> > > >  wrote:
> > > > > On Tue, Sep 11, 2018 at 05:36:36PM -0700, Joe Stringer wrote:
> > > > >> This patch adds new BPF helper functions, bpf_sk_lookup_tcp() and
> > > > >> bpf_sk_lookup_udp() which allows BPF programs to find out if there 
> > > > >> is a
> > > > >> socket listening on this host, and returns a socket pointer which the
> > > > >> BPF program can then access to determine, for instance, whether to
> > > > >> forward or drop traffic. bpf_sk_lookup_xxx() may take a reference on 
> > > > >> the
> > > > >> socket, so when a BPF program makes use of this function, it must
> > > > >> subsequently pass the returned pointer into the newly added 
> > > > >> sk_release()
> > > > >> to return the reference.
> > > > >>
> > > > >> By way of example, the following pseudocode would filter inbound
> > > > >> connections at XDP if there is no corresponding service listening for
> > > > >> the traffic:
> > > > >>
> > > > >>   struct bpf_sock_tuple tuple;
> > > > >>   struct bpf_sock_ops *sk;
> > > > >>
> > > > >>   populate_tuple(ctx, ); // Extract the 5tuple from the packet
> > > > >>   sk = bpf_sk_lookup_tcp(ctx, , sizeof tuple, netns, 0);
> > > > > ...
> > > > >> +struct bpf_sock_tuple {
> > > > >> + union {
> > > > >> + __be32 ipv6[4];
> > > > >> + __be32 ipv4;
> > > > >> + } saddr;
> > > > >> + union {
> > > > >> + __be32 ipv6[4];
> > > > >> + __be32 ipv4;
> > > > >> + } daddr;
> > > > >> + __be16 sport;
> > > > >> + __be16 dport;
> > > > >> + __u8 family;
> > > > >> +};
> > > > >
> > > > > since we can pass ptr_to_packet into map lookup and other helpers now,
> > > > > can you move 'family' out of bpf_sock_tuple and combine with netns_id 
> > > > > arg?
> > > > > then progs wouldn't need to copy bytes from the packet into tuple
> > > > > to do a lookup.
> > >
> > > If I follow, you're proposing that users should be able to pass a
> > > pointer to the source address field of the L3 header, and assuming
> > > that the L3 header ends with saddr+daddr (no options/extheaders), and
> > > is immediately followed by the sport/dport then a packet pointer
> > > should work for performing socket lookup. Then it is up to the BPF
> > > program writer to ensure that this is the case, or otherwise fall back
> > > to populating a copy of the sock tuple on the stack.
> >
> > yep.
> >
> > > > have been thinking more about it.
> > > > since only ipv4 and ipv6 supported may be use size of bpf_sock_tuple
> > > > to infer family inside the helper, so it doesn't need to be passed 
> > > > explicitly?
> > >
> > > Let me make sure I understand the proposal here.
> > >
> > > The current structure and function prototypes are:
> > >
> > > struct bpf_sock_tuple {
> > >   union {
> > >   __be32 ipv6[4];
> > >   __be32 ipv4;
> > >   } saddr;
> > >   union {
> > >   __be32 ipv6[4];
> > >   __be32 ipv4;
> > >   } daddr;
> > >   __be16 sport;
> > >   __be16 dport;
> > >   __u8 family;
> > > };
> > ...
> > > You're proposing something like:
> > >
> > > struct bpf_sock_tuple4 {
> > >   __be32 saddr;
> > >   __be32 daddr;
> > >   __be16 sport;
> > >   __be16 dport;
> > >   __u8 family;
> > > };
> > >
> > > struct bpf_sock_tuple6 {
> > >   __be32 saddr[4];
> > >   __be32 daddr[4];
> > >   __be16 sport;
> > >   __be16 dport;
> > >   __u8 family;
> > > };
> >
> > I think the split is unnecessary.
> > I'm proposing:
> > struct bpf_sock_tuple {
> >   union {
> >   __be32 ipv6[4];
> >   __be32 ipv4;
> >   } saddr;
> >   union {
> >   __be32 ipv6[4];
> >   __be32 ipv4;
> >   } daddr;
> >   __be16 sport;
> >   __be16 dport;
> > };
> >
> > that points directly into the packet (when ipv4 options are not there)
> > and bpf_sk_lookup_tcp() uses 'size' argument to figure out ipv4/ipv6 family.
> 
> Needs to be subtly different, the 'sport'/'dport' offset would be
> wrong in the IPv4 case otherwise:

ahh. right.

> 
> We could take my definitions above and do the following if we want to
> try to type the helper definition:
> 
> union bpf_sock_tuple {
>struct bpf_sock_tuple4 t4;
>struct bpf_sock_tuple6 t6;
> };

yes. sounds great to me. Much better than 'void *' in the helper.



Re: [PATCH bpf-next 07/11] bpf: Add helper to retrieve socket in BPF

2018-09-13 Thread Joe Stringer
On Thu, 13 Sep 2018 at 14:02, Alexei Starovoitov
 wrote:
>
> On Thu, Sep 13, 2018 at 01:55:01PM -0700, Joe Stringer wrote:
> > On Thu, 13 Sep 2018 at 12:06, Alexei Starovoitov
> >  wrote:
> > >
> > > On Wed, Sep 12, 2018 at 5:06 PM, Alexei Starovoitov
> > >  wrote:
> > > > On Tue, Sep 11, 2018 at 05:36:36PM -0700, Joe Stringer wrote:
> > > >> This patch adds new BPF helper functions, bpf_sk_lookup_tcp() and
> > > >> bpf_sk_lookup_udp() which allows BPF programs to find out if there is a
> > > >> socket listening on this host, and returns a socket pointer which the
> > > >> BPF program can then access to determine, for instance, whether to
> > > >> forward or drop traffic. bpf_sk_lookup_xxx() may take a reference on 
> > > >> the
> > > >> socket, so when a BPF program makes use of this function, it must
> > > >> subsequently pass the returned pointer into the newly added 
> > > >> sk_release()
> > > >> to return the reference.
> > > >>
> > > >> By way of example, the following pseudocode would filter inbound
> > > >> connections at XDP if there is no corresponding service listening for
> > > >> the traffic:
> > > >>
> > > >>   struct bpf_sock_tuple tuple;
> > > >>   struct bpf_sock_ops *sk;
> > > >>
> > > >>   populate_tuple(ctx, ); // Extract the 5tuple from the packet
> > > >>   sk = bpf_sk_lookup_tcp(ctx, , sizeof tuple, netns, 0);
> > > > ...
> > > >> +struct bpf_sock_tuple {
> > > >> + union {
> > > >> + __be32 ipv6[4];
> > > >> + __be32 ipv4;
> > > >> + } saddr;
> > > >> + union {
> > > >> + __be32 ipv6[4];
> > > >> + __be32 ipv4;
> > > >> + } daddr;
> > > >> + __be16 sport;
> > > >> + __be16 dport;
> > > >> + __u8 family;
> > > >> +};
> > > >
> > > > since we can pass ptr_to_packet into map lookup and other helpers now,
> > > > can you move 'family' out of bpf_sock_tuple and combine with netns_id 
> > > > arg?
> > > > then progs wouldn't need to copy bytes from the packet into tuple
> > > > to do a lookup.
> >
> > If I follow, you're proposing that users should be able to pass a
> > pointer to the source address field of the L3 header, and assuming
> > that the L3 header ends with saddr+daddr (no options/extheaders), and
> > is immediately followed by the sport/dport then a packet pointer
> > should work for performing socket lookup. Then it is up to the BPF
> > program writer to ensure that this is the case, or otherwise fall back
> > to populating a copy of the sock tuple on the stack.
>
> yep.
>
> > > have been thinking more about it.
> > > since only ipv4 and ipv6 supported may be use size of bpf_sock_tuple
> > > to infer family inside the helper, so it doesn't need to be passed 
> > > explicitly?
> >
> > Let me make sure I understand the proposal here.
> >
> > The current structure and function prototypes are:
> >
> > struct bpf_sock_tuple {
> >   union {
> >   __be32 ipv6[4];
> >   __be32 ipv4;
> >   } saddr;
> >   union {
> >   __be32 ipv6[4];
> >   __be32 ipv4;
> >   } daddr;
> >   __be16 sport;
> >   __be16 dport;
> >   __u8 family;
> > };
> ...
> > You're proposing something like:
> >
> > struct bpf_sock_tuple4 {
> >   __be32 saddr;
> >   __be32 daddr;
> >   __be16 sport;
> >   __be16 dport;
> >   __u8 family;
> > };
> >
> > struct bpf_sock_tuple6 {
> >   __be32 saddr[4];
> >   __be32 daddr[4];
> >   __be16 sport;
> >   __be16 dport;
> >   __u8 family;
> > };
>
> I think the split is unnecessary.
> I'm proposing:
> struct bpf_sock_tuple {
>   union {
>   __be32 ipv6[4];
>   __be32 ipv4;
>   } saddr;
>   union {
>   __be32 ipv6[4];
>   __be32 ipv4;
>   } daddr;
>   __be16 sport;
>   __be16 dport;
> };
>
> that points directly into the packet (when ipv4 options are not there)
> and bpf_sk_lookup_tcp() uses 'size' argument to figure out ipv4/ipv6 family.

Needs to be subtly different, the 'sport'/'dport' offset would be
wrong in the IPv4 case otherwise:

$ cat foo.c
#include 

struct bpf_sock_tuple {
 union {
 __be32 ipv6[4];
 __be32 ipv4;
 } saddr;
 union {
 __be32 ipv6[4];
 __be32 ipv4;
 } daddr;
 __be16 sport;
 __be16 dport;
};

int main(int argc, char *argv[]) {
   struct bpf_sock_tuple tuple;

   return 0;
}
$ gcc -g ./foo.c -o foo.o
$ pahole foo.o
struct bpf_sock_tuple {
   union {
   __be32 ipv6[4];  /*  16 */
   __be32 ipv4; /*   4 */
   } saddr; /* 016 */
   union {
   __be32 ipv6[4];  /*  16 */
   __be32 ipv4; /*   4 */
   } daddr; /*1616 */
   __be16 

Re: [PATCH bpf-next 07/11] bpf: Add helper to retrieve socket in BPF

2018-09-13 Thread Alexei Starovoitov
On Thu, Sep 13, 2018 at 01:55:01PM -0700, Joe Stringer wrote:
> On Thu, 13 Sep 2018 at 12:06, Alexei Starovoitov
>  wrote:
> >
> > On Wed, Sep 12, 2018 at 5:06 PM, Alexei Starovoitov
> >  wrote:
> > > On Tue, Sep 11, 2018 at 05:36:36PM -0700, Joe Stringer wrote:
> > >> This patch adds new BPF helper functions, bpf_sk_lookup_tcp() and
> > >> bpf_sk_lookup_udp() which allows BPF programs to find out if there is a
> > >> socket listening on this host, and returns a socket pointer which the
> > >> BPF program can then access to determine, for instance, whether to
> > >> forward or drop traffic. bpf_sk_lookup_xxx() may take a reference on the
> > >> socket, so when a BPF program makes use of this function, it must
> > >> subsequently pass the returned pointer into the newly added sk_release()
> > >> to return the reference.
> > >>
> > >> By way of example, the following pseudocode would filter inbound
> > >> connections at XDP if there is no corresponding service listening for
> > >> the traffic:
> > >>
> > >>   struct bpf_sock_tuple tuple;
> > >>   struct bpf_sock_ops *sk;
> > >>
> > >>   populate_tuple(ctx, ); // Extract the 5tuple from the packet
> > >>   sk = bpf_sk_lookup_tcp(ctx, , sizeof tuple, netns, 0);
> > > ...
> > >> +struct bpf_sock_tuple {
> > >> + union {
> > >> + __be32 ipv6[4];
> > >> + __be32 ipv4;
> > >> + } saddr;
> > >> + union {
> > >> + __be32 ipv6[4];
> > >> + __be32 ipv4;
> > >> + } daddr;
> > >> + __be16 sport;
> > >> + __be16 dport;
> > >> + __u8 family;
> > >> +};
> > >
> > > since we can pass ptr_to_packet into map lookup and other helpers now,
> > > can you move 'family' out of bpf_sock_tuple and combine with netns_id arg?
> > > then progs wouldn't need to copy bytes from the packet into tuple
> > > to do a lookup.
> 
> If I follow, you're proposing that users should be able to pass a
> pointer to the source address field of the L3 header, and assuming
> that the L3 header ends with saddr+daddr (no options/extheaders), and
> is immediately followed by the sport/dport then a packet pointer
> should work for performing socket lookup. Then it is up to the BPF
> program writer to ensure that this is the case, or otherwise fall back
> to populating a copy of the sock tuple on the stack.

yep.

> > have been thinking more about it.
> > since only ipv4 and ipv6 supported may be use size of bpf_sock_tuple
> > to infer family inside the helper, so it doesn't need to be passed 
> > explicitly?
> 
> Let me make sure I understand the proposal here.
> 
> The current structure and function prototypes are:
> 
> struct bpf_sock_tuple {
>   union {
>   __be32 ipv6[4];
>   __be32 ipv4;
>   } saddr;
>   union {
>   __be32 ipv6[4];
>   __be32 ipv4;
>   } daddr;
>   __be16 sport;
>   __be16 dport;
>   __u8 family;
> };
...
> You're proposing something like:
> 
> struct bpf_sock_tuple4 {
>   __be32 saddr;
>   __be32 daddr;
>   __be16 sport;
>   __be16 dport;
>   __u8 family;
> };
> 
> struct bpf_sock_tuple6 {
>   __be32 saddr[4];
>   __be32 daddr[4];
>   __be16 sport;
>   __be16 dport;
>   __u8 family;
> };

I think the split is unnecessary.
I'm proposing:
struct bpf_sock_tuple {
  union {
  __be32 ipv6[4];
  __be32 ipv4;
  } saddr;
  union {
  __be32 ipv6[4];
  __be32 ipv4;
  } daddr;
  __be16 sport;
  __be16 dport;
};

that points directly into the packet (when ipv4 options are not there)
and bpf_sk_lookup_tcp() uses 'size' argument to figure out ipv4/ipv6 family.



Re: [bpf-next, v3 1/5] flow_dissector: implements flow dissector BPF hook

2018-09-13 Thread Willem de Bruijn
On Thu, Sep 13, 2018 at 4:57 PM Alexei Starovoitov
 wrote:
>
> On Thu, Sep 13, 2018 at 04:51:49PM -0400, Willem de Bruijn wrote:
> > On Thu, Sep 13, 2018 at 3:45 PM Alexei Starovoitov
> >  wrote:
> > >
> > > On Thu, Sep 13, 2018 at 10:45:53AM -0700, Petar Penkov wrote:
> > > > From: Petar Penkov 
> > > >
> > > > Adds a hook for programs of type BPF_PROG_TYPE_FLOW_DISSECTOR and
> > > > attach type BPF_FLOW_DISSECTOR that is executed in the flow dissector
> > > > path. The BPF program is per-network namespace.
> > > >
> > > > Signed-off-by: Petar Penkov 
> > > > Signed-off-by: Willem de Bruijn 
> > > ...
> > > > @@ -2333,6 +2335,7 @@ struct __sk_buff {
> > > >   /* ... here. */
> > > >
> > > >   __u32 data_meta;
> > > > + struct bpf_flow_keys *flow_keys;
> > >
> > > the bpf prog form patch 4 looks much better now. Thanks!
> > >
> > > >  };
> > > >
> > > >  struct bpf_tunnel_key {
> > > > @@ -2778,4 +2781,27 @@ enum bpf_task_fd_type {
> > > >   BPF_FD_TYPE_URETPROBE,  /* filename + offset */
> > > >  };
> > > >
> > > > +struct bpf_flow_keys {
> > > > + __u16   nhoff;
> > > > + __u16   thoff;
> > > > + __u16   addr_proto; /* ETH_P_* of valid addrs 
> > > > */
> > > > + __u8is_frag;
> > > > + __u8is_first_frag;
> > > > + __u8is_encap;
> > > > + __be16  n_proto;
> > > > + __u8ip_proto;
> > > > + union {
> > > > + struct {
> > > > + __be32  ipv4_src;
> > > > + __be32  ipv4_dst;
> > > > + };
> > > > + struct {
> > > > + __u32   ipv6_src[4];/* in6_addr; network 
> > > > order */
> > > > + __u32   ipv6_dst[4];/* in6_addr; network 
> > > > order */
> > > > + };
> > > > + };
> > > > + __be16  sport;
> > > > + __be16  dport;
> > > > +};
> > >
> > > can you please pack it?
> > > struct bpf_flow_keys {
> > > __u16  nhoff;/* 0 2 */
> > > __u16  thoff;/* 2 2 */
> > > __u16  addr_proto;   /* 4 2 */
> > > __u8   is_frag;  /* 6 1 */
> > > __u8   is_first_frag;/* 7 1 */
> > > __u8   is_encap; /* 8 1 */
> > >
> > > /* XXX 1 byte hole, try to pack */
> > >
> > > __be16 n_proto;  /*10 2 */
> > > __u8   ip_proto; /*12 1 */
> > >
> > > /* XXX 3 bytes hole, try to pack */
> > >
> > > union {
> > >
> > > also is_frag and other fields are not used by the kernel and
> > > only used by the prog to pass data between tail_calls ?
> >
> > No, these are mapped directly onto fields in struct flow_keys
> > on return from the BPF program in __skb_flow_bpf_to_target.
> > For is_frag, for instance:
> >
> >if (flow_keys->is_frag)
> >key_control->flags |= FLOW_DIS_IS_FRAGMENT;
>
> right. my search-fu failed me. only packing is needed then.
>
> > This is true for all fields in the struct except nhoff.
>
> > > In such case reserve some space in bpf_flow_keys similar to skb->cb
> > > so it can contain any fields and accommodate for inevitable changes
> > > to bpf flow dissector prog in the future.
> >
> > Do you mean a second scratch space akin to cb[], just a few
> > reserved padding bytes?
>
> it looks to me it's possible to rearrange the fields to avoid all holes,
> so no extra padding bytes necessary.

Absolutely. We forgot to run pahole earlier. Updating now.

> > We have given some thought to forward compatibility. The existing
> > fields cannot be changed, but it should be fine if we need to expand
> > the struct later.
>
> let's keep cb-like idea for later. It seems to me we can add it to
> the end of bpf_flow_keys any time later.


Re: [PATCH bpf-next 07/11] bpf: Add helper to retrieve socket in BPF

2018-09-13 Thread Joe Stringer
On Thu, 13 Sep 2018 at 13:55, Joe Stringer  wrote:
> struct bpf_sock_tuple4 {
>   __be32 saddr;
>   __be32 daddr;
>   __be16 sport;
>   __be16 dport;
>   __u8 family;
> };
>
> struct bpf_sock_tuple6 {
>   __be32 saddr[4];
>   __be32 daddr[4];
>   __be16 sport;
>   __be16 dport;
>   __u8 family;
> };

(ignore the family bit here, I forgot to remove it..)


Re: [bpf-next, v3 1/5] flow_dissector: implements flow dissector BPF hook

2018-09-13 Thread Alexei Starovoitov
On Thu, Sep 13, 2018 at 04:51:49PM -0400, Willem de Bruijn wrote:
> On Thu, Sep 13, 2018 at 3:45 PM Alexei Starovoitov
>  wrote:
> >
> > On Thu, Sep 13, 2018 at 10:45:53AM -0700, Petar Penkov wrote:
> > > From: Petar Penkov 
> > >
> > > Adds a hook for programs of type BPF_PROG_TYPE_FLOW_DISSECTOR and
> > > attach type BPF_FLOW_DISSECTOR that is executed in the flow dissector
> > > path. The BPF program is per-network namespace.
> > >
> > > Signed-off-by: Petar Penkov 
> > > Signed-off-by: Willem de Bruijn 
> > ...
> > > @@ -2333,6 +2335,7 @@ struct __sk_buff {
> > >   /* ... here. */
> > >
> > >   __u32 data_meta;
> > > + struct bpf_flow_keys *flow_keys;
> >
> > the bpf prog form patch 4 looks much better now. Thanks!
> >
> > >  };
> > >
> > >  struct bpf_tunnel_key {
> > > @@ -2778,4 +2781,27 @@ enum bpf_task_fd_type {
> > >   BPF_FD_TYPE_URETPROBE,  /* filename + offset */
> > >  };
> > >
> > > +struct bpf_flow_keys {
> > > + __u16   nhoff;
> > > + __u16   thoff;
> > > + __u16   addr_proto; /* ETH_P_* of valid addrs */
> > > + __u8is_frag;
> > > + __u8is_first_frag;
> > > + __u8is_encap;
> > > + __be16  n_proto;
> > > + __u8ip_proto;
> > > + union {
> > > + struct {
> > > + __be32  ipv4_src;
> > > + __be32  ipv4_dst;
> > > + };
> > > + struct {
> > > + __u32   ipv6_src[4];/* in6_addr; network order 
> > > */
> > > + __u32   ipv6_dst[4];/* in6_addr; network order 
> > > */
> > > + };
> > > + };
> > > + __be16  sport;
> > > + __be16  dport;
> > > +};
> >
> > can you please pack it?
> > struct bpf_flow_keys {
> > __u16  nhoff;/* 0 2 */
> > __u16  thoff;/* 2 2 */
> > __u16  addr_proto;   /* 4 2 */
> > __u8   is_frag;  /* 6 1 */
> > __u8   is_first_frag;/* 7 1 */
> > __u8   is_encap; /* 8 1 */
> >
> > /* XXX 1 byte hole, try to pack */
> >
> > __be16 n_proto;  /*10 2 */
> > __u8   ip_proto; /*12 1 */
> >
> > /* XXX 3 bytes hole, try to pack */
> >
> > union {
> >
> > also is_frag and other fields are not used by the kernel and
> > only used by the prog to pass data between tail_calls ?
> 
> No, these are mapped directly onto fields in struct flow_keys
> on return from the BPF program in __skb_flow_bpf_to_target.
> For is_frag, for instance:
> 
>if (flow_keys->is_frag)
>key_control->flags |= FLOW_DIS_IS_FRAGMENT;

right. my search-fu failed me. only packing is needed then.

> This is true for all fields in the struct except nhoff.

> > In such case reserve some space in bpf_flow_keys similar to skb->cb
> > so it can contain any fields and accommodate for inevitable changes
> > to bpf flow dissector prog in the future.
> 
> Do you mean a second scratch space akin to cb[], just a few
> reserved padding bytes?

it looks to me it's possible to rearrange the fields to avoid all holes,
so no extra padding bytes necessary.

> We have given some thought to forward compatibility. The existing
> fields cannot be changed, but it should be fine if we need to expand
> the struct later.

let's keep cb-like idea for later. It seems to me we can add it to
the end of bpf_flow_keys any time later.



Re: [PATCH bpf-next 11/11] Documentation: Describe bpf reference tracking

2018-09-13 Thread Joe Stringer
On Wed, 12 Sep 2018 at 17:13, Alexei Starovoitov
 wrote:
>
> On Tue, Sep 11, 2018 at 05:36:40PM -0700, Joe Stringer wrote:
> > Signed-off-by: Joe Stringer 
>
> just few words in commit log would be better than nothing.
>
> Acked-by: Alexei Starovoitov 

Ack, thanks for the review!


Re: [PATCH bpf-next 10/11] selftests/bpf: Add C tests for reference tracking

2018-09-13 Thread Joe Stringer
On Wed, 12 Sep 2018 at 17:11, Alexei Starovoitov
 wrote:
>
> On Tue, Sep 11, 2018 at 05:36:39PM -0700, Joe Stringer wrote:
> > Signed-off-by: Joe Stringer 
>
> really nice set of tests.
> please describe them briefly in commit log.
>
> Acked-by: Alexei Starovoitov 

Ack, will do.


Re: [PATCH bpf-next 07/11] bpf: Add helper to retrieve socket in BPF

2018-09-13 Thread Joe Stringer
On Thu, 13 Sep 2018 at 12:06, Alexei Starovoitov
 wrote:
>
> On Wed, Sep 12, 2018 at 5:06 PM, Alexei Starovoitov
>  wrote:
> > On Tue, Sep 11, 2018 at 05:36:36PM -0700, Joe Stringer wrote:
> >> This patch adds new BPF helper functions, bpf_sk_lookup_tcp() and
> >> bpf_sk_lookup_udp() which allows BPF programs to find out if there is a
> >> socket listening on this host, and returns a socket pointer which the
> >> BPF program can then access to determine, for instance, whether to
> >> forward or drop traffic. bpf_sk_lookup_xxx() may take a reference on the
> >> socket, so when a BPF program makes use of this function, it must
> >> subsequently pass the returned pointer into the newly added sk_release()
> >> to return the reference.
> >>
> >> By way of example, the following pseudocode would filter inbound
> >> connections at XDP if there is no corresponding service listening for
> >> the traffic:
> >>
> >>   struct bpf_sock_tuple tuple;
> >>   struct bpf_sock_ops *sk;
> >>
> >>   populate_tuple(ctx, ); // Extract the 5tuple from the packet
> >>   sk = bpf_sk_lookup_tcp(ctx, , sizeof tuple, netns, 0);
> > ...
> >> +struct bpf_sock_tuple {
> >> + union {
> >> + __be32 ipv6[4];
> >> + __be32 ipv4;
> >> + } saddr;
> >> + union {
> >> + __be32 ipv6[4];
> >> + __be32 ipv4;
> >> + } daddr;
> >> + __be16 sport;
> >> + __be16 dport;
> >> + __u8 family;
> >> +};
> >
> > since we can pass ptr_to_packet into map lookup and other helpers now,
> > can you move 'family' out of bpf_sock_tuple and combine with netns_id arg?
> > then progs wouldn't need to copy bytes from the packet into tuple
> > to do a lookup.

If I follow, you're proposing that users should be able to pass a
pointer to the source address field of the L3 header, and assuming
that the L3 header ends with saddr+daddr (no options/extheaders), and
is immediately followed by the sport/dport then a packet pointer
should work for performing socket lookup. Then it is up to the BPF
program writer to ensure that this is the case, or otherwise fall back
to populating a copy of the sock tuple on the stack.

> have been thinking more about it.
> since only ipv4 and ipv6 supported may be use size of bpf_sock_tuple
> to infer family inside the helper, so it doesn't need to be passed explicitly?

Let me make sure I understand the proposal here.

The current structure and function prototypes are:

struct bpf_sock_tuple {
  union {
  __be32 ipv6[4];
  __be32 ipv4;
  } saddr;
  union {
  __be32 ipv6[4];
  __be32 ipv4;
  } daddr;
  __be16 sport;
  __be16 dport;
  __u8 family;
};

static struct bpf_sock *(*bpf_sk_lookup_tcp)(void *ctx,
   struct bpf_sock_tuple *tuple,
   int size, unsigned int netns_id,
   unsigned long long flags);
static struct bpf_sock *(*bpf_sk_lookup_udp)(void *ctx,
   struct bpf_sock_tuple *tuple,
   int size, unsigned int netns_id,
   unsigned long long flags);
static int (*bpf_sk_release)(struct bpf_sock *sk, unsigned long long flags);

You're proposing something like:

struct bpf_sock_tuple4 {
  __be32 saddr;
  __be32 daddr;
  __be16 sport;
  __be16 dport;
  __u8 family;
};

struct bpf_sock_tuple6 {
  __be32 saddr[4];
  __be32 daddr[4];
  __be16 sport;
  __be16 dport;
  __u8 family;
};

static struct bpf_sock *(*bpf_sk_lookup_tcp)(void *ctx,
   void *tuple,
   int size, unsigned int
netns_id,
   unsigned long long flags);
static struct bpf_sock *(*bpf_sk_lookup_udp)(void *ctx,
   void *tuple,
   int size, unsigned int netns_id,
   unsigned long long flags);
static int (*bpf_sk_release)(struct bpf_sock *sk, unsigned long long flags);

Then the implementation will check the size against either
"sizeof(struct bpf_sock_tuple4)" or "sizeof(struct bpf_sock_tuple6)"
and interpret as the v4 or v6 handler from this.

Sure, I can try this out.


Re: [bpf-next, v3 1/5] flow_dissector: implements flow dissector BPF hook

2018-09-13 Thread Willem de Bruijn
On Thu, Sep 13, 2018 at 3:45 PM Alexei Starovoitov
 wrote:
>
> On Thu, Sep 13, 2018 at 10:45:53AM -0700, Petar Penkov wrote:
> > From: Petar Penkov 
> >
> > Adds a hook for programs of type BPF_PROG_TYPE_FLOW_DISSECTOR and
> > attach type BPF_FLOW_DISSECTOR that is executed in the flow dissector
> > path. The BPF program is per-network namespace.
> >
> > Signed-off-by: Petar Penkov 
> > Signed-off-by: Willem de Bruijn 
> ...
> > @@ -2333,6 +2335,7 @@ struct __sk_buff {
> >   /* ... here. */
> >
> >   __u32 data_meta;
> > + struct bpf_flow_keys *flow_keys;
>
> the bpf prog form patch 4 looks much better now. Thanks!
>
> >  };
> >
> >  struct bpf_tunnel_key {
> > @@ -2778,4 +2781,27 @@ enum bpf_task_fd_type {
> >   BPF_FD_TYPE_URETPROBE,  /* filename + offset */
> >  };
> >
> > +struct bpf_flow_keys {
> > + __u16   nhoff;
> > + __u16   thoff;
> > + __u16   addr_proto; /* ETH_P_* of valid addrs */
> > + __u8is_frag;
> > + __u8is_first_frag;
> > + __u8is_encap;
> > + __be16  n_proto;
> > + __u8ip_proto;
> > + union {
> > + struct {
> > + __be32  ipv4_src;
> > + __be32  ipv4_dst;
> > + };
> > + struct {
> > + __u32   ipv6_src[4];/* in6_addr; network order */
> > + __u32   ipv6_dst[4];/* in6_addr; network order */
> > + };
> > + };
> > + __be16  sport;
> > + __be16  dport;
> > +};
>
> can you please pack it?
> struct bpf_flow_keys {
> __u16  nhoff;/* 0 2 */
> __u16  thoff;/* 2 2 */
> __u16  addr_proto;   /* 4 2 */
> __u8   is_frag;  /* 6 1 */
> __u8   is_first_frag;/* 7 1 */
> __u8   is_encap; /* 8 1 */
>
> /* XXX 1 byte hole, try to pack */
>
> __be16 n_proto;  /*10 2 */
> __u8   ip_proto; /*12 1 */
>
> /* XXX 3 bytes hole, try to pack */
>
> union {
>
> also is_frag and other fields are not used by the kernel and
> only used by the prog to pass data between tail_calls ?

No, these are mapped directly onto fields in struct flow_keys
on return from the BPF program in __skb_flow_bpf_to_target.
For is_frag, for instance:

   if (flow_keys->is_frag)
   key_control->flags |= FLOW_DIS_IS_FRAGMENT;

This is true for all fields in the struct except nhoff.

> In such case reserve some space in bpf_flow_keys similar to skb->cb
> so it can contain any fields and accommodate for inevitable changes
> to bpf flow dissector prog in the future.

Do you mean a second scratch space akin to cb[], just a few
reserved padding bytes?

We have given some thought to forward compatibility. The existing
fields cannot be changed, but it should be fine if we need to expand
the struct later.


[PATCH net] net/ipv6: do not copy DST_NOCOUNT flag on rt init

2018-09-13 Thread Peter Oskolkov
DST_NOCOUNT in dst_entry::flags tracks whether the entry counts
toward route cache size (net->ipv6.sysctl.ip6_rt_max_size).

If the flag is NOT set, dst_ops::pcpuc_entries counter is incremented
in dist_init() and decremented in dst_destroy().

This flag is tied to allocation/deallocation of dst_entry and
should not be copied from another dst/route. Otherwise it can happen
that dst_ops::pcpuc_entries counter grows until no new routes can
be allocated because the counter reached ip6_rt_max_size due to
DST_NOCOUNT not set and thus no counter decrements on gc-ed routes.

Fixes: 3b6761d18bc1 ("net/ipv6: Move dst flags to booleans in fib entries")
Cc: David Ahern 
Acked-by: Wei Wang 
Signed-off-by: Peter Oskolkov 
---
 net/ipv6/route.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 3eed045c65a5..a3902f805305 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -946,7 +946,7 @@ static void ip6_rt_init_dst_reject(struct rt6_info *rt, 
struct fib6_info *ort)
 
 static void ip6_rt_init_dst(struct rt6_info *rt, struct fib6_info *ort)
 {
-   rt->dst.flags |= fib6_info_dst_flags(ort);
+   rt->dst.flags |= fib6_info_dst_flags(ort) & ~DST_NOCOUNT;
 
if (ort->fib6_flags & RTF_REJECT) {
ip6_rt_init_dst_reject(rt, ort);
-- 
2.19.0.397.gdd90340f6a-goog



Re: [bpf-next, v3 1/5] flow_dissector: implements flow dissector BPF hook

2018-09-13 Thread Alexei Starovoitov
On Thu, Sep 13, 2018 at 10:45:53AM -0700, Petar Penkov wrote:
> From: Petar Penkov 
> 
> Adds a hook for programs of type BPF_PROG_TYPE_FLOW_DISSECTOR and
> attach type BPF_FLOW_DISSECTOR that is executed in the flow dissector
> path. The BPF program is per-network namespace.
> 
> Signed-off-by: Petar Penkov 
> Signed-off-by: Willem de Bruijn 
...
> @@ -2333,6 +2335,7 @@ struct __sk_buff {
>   /* ... here. */
>  
>   __u32 data_meta;
> + struct bpf_flow_keys *flow_keys;

the bpf prog form patch 4 looks much better now. Thanks!

>  };
>  
>  struct bpf_tunnel_key {
> @@ -2778,4 +2781,27 @@ enum bpf_task_fd_type {
>   BPF_FD_TYPE_URETPROBE,  /* filename + offset */
>  };
>  
> +struct bpf_flow_keys {
> + __u16   nhoff;
> + __u16   thoff;
> + __u16   addr_proto; /* ETH_P_* of valid addrs */
> + __u8is_frag;
> + __u8is_first_frag;
> + __u8is_encap;
> + __be16  n_proto;
> + __u8ip_proto;
> + union {
> + struct {
> + __be32  ipv4_src;
> + __be32  ipv4_dst;
> + };
> + struct {
> + __u32   ipv6_src[4];/* in6_addr; network order */
> + __u32   ipv6_dst[4];/* in6_addr; network order */
> + };
> + };
> + __be16  sport;
> + __be16  dport;
> +};

can you please pack it?
struct bpf_flow_keys {
__u16  nhoff;/* 0 2 */
__u16  thoff;/* 2 2 */
__u16  addr_proto;   /* 4 2 */
__u8   is_frag;  /* 6 1 */
__u8   is_first_frag;/* 7 1 */
__u8   is_encap; /* 8 1 */

/* XXX 1 byte hole, try to pack */

__be16 n_proto;  /*10 2 */
__u8   ip_proto; /*12 1 */

/* XXX 3 bytes hole, try to pack */

union {

also is_frag and other fields are not used by the kernel and
only used by the prog to pass data between tail_calls ?
In such case reserve some space in bpf_flow_keys similar to skb->cb
so it can contain any fields and accommodate for inevitable changes
to bpf flow dissector prog in the future.



[PATCH iproute2] libnetlink: fix leak and using unused memory on error

2018-09-13 Thread Stephen Hemminger
If an error happens in multi-segment message (tc only)
then report the error and stop processing further responses.
This also fixes refering to the buffer after free.

The sequence check is not necessary here because the
response message has already been validated to be in
the window of the sequence number of the iov.

Reported-by: Mahesh Bandewar 
Fixes: 7b2ee50c0cd5 ("hv_netvsc: common detach logic")
Signed-off-by: Stephen Hemminger 
---
 lib/libnetlink.c | 23 +--
 1 file changed, 9 insertions(+), 14 deletions(-)

diff --git a/lib/libnetlink.c b/lib/libnetlink.c
index 928de1dd16d8..586809292594 100644
--- a/lib/libnetlink.c
+++ b/lib/libnetlink.c
@@ -617,7 +617,6 @@ static int __rtnl_talk_iov(struct rtnl_handle *rtnl, struct 
iovec *iov,
msg.msg_iovlen = 1;
i = 0;
while (1) {
-next:
status = rtnl_recvmsg(rtnl->fd, , );
++i;
 
@@ -660,27 +659,23 @@ next:
 
if (l < sizeof(struct nlmsgerr)) {
fprintf(stderr, "ERROR truncated\n");
-   } else if (!err->error) {
+   free(buf);
+   return -1;
+   }
+
+   if (!err->error)
/* check messages from kernel */
nl_dump_ext_ack(h, errfn);
 
-   if (answer)
-   *answer = (struct nlmsghdr 
*)buf;
-   else
-   free(buf);
-   if (h->nlmsg_seq == seq)
-   return 0;
-   else if (i < iovlen)
-   goto next;
-   return 0;
-   }
-
if (rtnl->proto != NETLINK_SOCK_DIAG &&
show_rtnl_err)
rtnl_talk_error(h, err, errfn);
 
errno = -err->error;
-   free(buf);
+   if (answer)
+   *answer = (struct nlmsghdr *)buf;
+   else
+   free(buf);
return -i;
}
 
-- 
2.18.0



Re: pull request: bluetooth 2018-09-13

2018-09-13 Thread David Miller
From: Johan Hedberg 
Date: Thu, 13 Sep 2018 12:45:51 +0300

> A few Bluetooth fixes for the 4.19-rc series:
> 
>  - Fixed rw_semaphore leak in hci_ldisc
>  - Fixed local Out-of-Band pairing data handling
> 
> Let me know if there are any issues pulling. Thanks.

Pulled, thanks Johan.


Re: [PATCH bpf-next 07/11] bpf: Add helper to retrieve socket in BPF

2018-09-13 Thread Alexei Starovoitov
On Wed, Sep 12, 2018 at 5:06 PM, Alexei Starovoitov
 wrote:
> On Tue, Sep 11, 2018 at 05:36:36PM -0700, Joe Stringer wrote:
>> This patch adds new BPF helper functions, bpf_sk_lookup_tcp() and
>> bpf_sk_lookup_udp() which allows BPF programs to find out if there is a
>> socket listening on this host, and returns a socket pointer which the
>> BPF program can then access to determine, for instance, whether to
>> forward or drop traffic. bpf_sk_lookup_xxx() may take a reference on the
>> socket, so when a BPF program makes use of this function, it must
>> subsequently pass the returned pointer into the newly added sk_release()
>> to return the reference.
>>
>> By way of example, the following pseudocode would filter inbound
>> connections at XDP if there is no corresponding service listening for
>> the traffic:
>>
>>   struct bpf_sock_tuple tuple;
>>   struct bpf_sock_ops *sk;
>>
>>   populate_tuple(ctx, ); // Extract the 5tuple from the packet
>>   sk = bpf_sk_lookup_tcp(ctx, , sizeof tuple, netns, 0);
> ...
>> +struct bpf_sock_tuple {
>> + union {
>> + __be32 ipv6[4];
>> + __be32 ipv4;
>> + } saddr;
>> + union {
>> + __be32 ipv6[4];
>> + __be32 ipv4;
>> + } daddr;
>> + __be16 sport;
>> + __be16 dport;
>> + __u8 family;
>> +};
>
> since we can pass ptr_to_packet into map lookup and other helpers now,
> can you move 'family' out of bpf_sock_tuple and combine with netns_id arg?
> then progs wouldn't need to copy bytes from the packet into tuple
> to do a lookup.

have been thinking more about it.
since only ipv4 and ipv6 supported may be use size of bpf_sock_tuple
to infer family inside the helper, so it doesn't need to be passed explicitly?


Re: [PATCH net v2 0/3] tls: don't leave keys in kernel memory

2018-09-13 Thread David Miller
From: Sabrina Dubroca 
Date: Wed, 12 Sep 2018 17:44:40 +0200

> There are a few places where the RX/TX key for a TLS socket is copied
> to kernel memory. This series clears those memory areas when they're no
> longer needed.
> 
> v2: add union tls_crypto_context, following Vakul Garg's comment
> swap patch 2 and 3, using new union in patch 3

Series applied and queued up for -stable.

Thanks.


Re: [PATCH bpf-next 06/11] bpf: Add reference tracking to verifier

2018-09-13 Thread Joe Stringer
On Wed, 12 Sep 2018 at 16:17, Alexei Starovoitov
 wrote:
>
> On Tue, Sep 11, 2018 at 05:36:35PM -0700, Joe Stringer wrote:
> > ...
> > +
> > +/* release function corresponding to acquire_reference_state(). 
> > Idempotent. */
> > +static int __release_reference_state(struct bpf_func_state *state, int 
> > ptr_id)
> > +{
> > + int i, last_idx;
> > +
> > + if (!ptr_id)
> > + return 0;
>
> Is this defensive programming or this condition can actually happen?
> As far as I can see all callers suppose to pass valid ptr_id into it.
>
> Acked-by: Alexei Starovoitov 
>

Looks like defensive programming to me. That said, if it's being
defensive, why not return `-EFAULT`? I'll try this out locally.


Re: [PATCH bpf-next 04/11] bpf: Add PTR_TO_SOCKET verifier type

2018-09-13 Thread Joe Stringer
On Wed, 12 Sep 2018 at 15:50, Alexei Starovoitov
 wrote:
>
> On Tue, Sep 11, 2018 at 05:36:33PM -0700, Joe Stringer wrote:
> > ...
> > +static bool reg_type_mismatch(enum bpf_reg_type src, enum bpf_reg_type 
> > prev)
> > +{
> > + return src != prev && (!reg_type_mismatch_ok(src) ||
> > +!reg_type_mismatch_ok(prev));
> > +}
> > +
> >  static int do_check(struct bpf_verifier_env *env)
> >  {
> >   struct bpf_verifier_state *state;
> > @@ -4778,9 +4862,7 @@ static int do_check(struct bpf_verifier_env *env)
> >*/
> >   *prev_src_type = src_reg_type;
> >
> > - } else if (src_reg_type != *prev_src_type &&
> > -(src_reg_type == PTR_TO_CTX ||
> > - *prev_src_type == PTR_TO_CTX)) {
> > + } else if (reg_type_mismatch(src_reg_type, 
> > *prev_src_type)) {
> >   /* ABuser program is trying to use the same 
> > insn
> >* dst_reg = *(u32*) (src_reg + off)
> >* with different pointer types:
> > @@ -4826,8 +4908,8 @@ static int do_check(struct bpf_verifier_env *env)
> >   if (*prev_dst_type == NOT_INIT) {
> >   *prev_dst_type = dst_reg_type;
> >   } else if (dst_reg_type != *prev_dst_type &&
> > -(dst_reg_type == PTR_TO_CTX ||
> > - *prev_dst_type == PTR_TO_CTX)) {
> > +(!reg_type_mismatch_ok(dst_reg_type) ||
> > + !reg_type_mismatch_ok(*prev_dst_type))) {
>
> reg_type_mismatch() could have been used here as well ?

Missed that before, will fix.

> >   verbose(env, "same insn cannot be used with 
> > different pointers\n");
> >   return -EINVAL;
> >   }
> > @@ -5244,10 +5326,14 @@ static void sanitize_dead_code(struct 
> > bpf_verifier_env *env)
> >   }
> >  }
> >
> > -/* convert load instructions that access fields of 'struct __sk_buff'
> > - * into sequence of instructions that access fields of 'struct sk_buff'
> > +/* convert load instructions that access fields of a context type into a
> > + * sequence of instructions that access fields of the underlying structure:
> > + * struct __sk_buff-> struct sk_buff
> > + * struct bpf_sock_ops -> struct sock
> >   */
> > -static int convert_ctx_accesses(struct bpf_verifier_env *env)
> > +static int convert_ctx_accesses(struct bpf_verifier_env *env,
> > + bpf_convert_ctx_access_t convert_ctx_access,
> > + enum bpf_reg_type ctx_type)
> >  {
> >   const struct bpf_verifier_ops *ops = env->ops;
> >   int i, cnt, size, ctx_field_size, delta = 0;
> > @@ -5274,12 +5360,14 @@ static int convert_ctx_accesses(struct 
> > bpf_verifier_env *env)
> >   }
> >   }
> >
> > - if (!ops->convert_ctx_access || bpf_prog_is_dev_bound(env->prog->aux))
> > + if (!convert_ctx_access || bpf_prog_is_dev_bound(env->prog->aux))
> >   return 0;
> >
> >   insn = env->prog->insnsi + delta;
> >
> >   for (i = 0; i < insn_cnt; i++, insn++) {
> > + enum bpf_reg_type ptr_type;
> > +
> >   if (insn->code == (BPF_LDX | BPF_MEM | BPF_B) ||
> >   insn->code == (BPF_LDX | BPF_MEM | BPF_H) ||
> >   insn->code == (BPF_LDX | BPF_MEM | BPF_W) ||
> > @@ -5321,7 +5409,8 @@ static int convert_ctx_accesses(struct 
> > bpf_verifier_env *env)
> >   continue;
> >   }
> >
> > - if (env->insn_aux_data[i + delta].ptr_type != PTR_TO_CTX)
> > + ptr_type = env->insn_aux_data[i + delta].ptr_type;
> > + if (ptr_type != ctx_type)
> >   continue;
> >
> >   ctx_field_size = env->insn_aux_data[i + delta].ctx_field_size;
> > @@ -5354,8 +5443,8 @@ static int convert_ctx_accesses(struct 
> > bpf_verifier_env *env)
> >   }
> >
> >   target_size = 0;
> > - cnt = ops->convert_ctx_access(type, insn, insn_buf, env->prog,
> > -   _size);
> > + cnt = convert_ctx_access(type, insn, insn_buf, env->prog,
> > +  _size);
> >   if (cnt == 0 || cnt >= ARRAY_SIZE(insn_buf) ||
> >   (ctx_field_size && !target_size)) {
> >   verbose(env, "bpf verifier is misconfigured\n");
> > @@ -5899,7 +5988,13 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr 
> > *attr)
> >
> >   if (ret == 0)
> >   /* program is valid, convert *(u32*)(ctx + off) accesses */
> > - ret = convert_ctx_accesses(env);
> > + ret = 

Re: [PATCH net] net: rtnl_configure_link: fix dev flags changes arg to __dev_notify_flags

2018-09-13 Thread David Miller
From: Roopa Prabhu 
Date: Wed, 12 Sep 2018 13:21:48 -0700

> From: Roopa Prabhu 
> 
> This fix addresses https://bugzilla.kernel.org/show_bug.cgi?id=201071
> 
> Commit 5025f7f7d506 wrongly relied on __dev_change_flags to notify users of
> dev flag changes in the case when dev->rtnl_link_state = 
> RTNL_LINK_INITIALIZED.
> Fix it by indicating flag changes explicitly to __dev_notify_flags.
> 
> Fixes: 5025f7f7d506 ("rtnetlink: add rtnl_link_state check in 
> rtnl_configure_link")
> Reported-By: Liam mcbirnie 
> Signed-off-by: Roopa Prabhu 

Applied.

> ---
> Dave, if 5025f7f7d506 made it to stable, request you to pls queue this one up 
> too. Thanks.

I will, thanks.


Re: [PATCH iproute2] iproute2: fix use-after-free

2018-09-13 Thread महेश बंडेवार
On Thu, Sep 13, 2018 at 8:19 AM, Stephen Hemminger
 wrote:
>
> On Wed, 12 Sep 2018 23:07:20 -0700
> Mahesh Bandewar (महेश बंडेवार)  wrote:
>
> > On Wed, Sep 12, 2018 at 5:33 PM, Stephen Hemminger
> >  wrote:
> > >
> > > On Wed, 12 Sep 2018 16:29:28 -0700
> > > Mahesh Bandewar  wrote:
> > >
> > > > From: Mahesh Bandewar 
> > > >
> > > > A local program using iproute2 lib pointed out the issue and looking
> > > > at the code it is pretty obvious -
> > > >
> > > > a = (struct nlmsghdr *)b;
> > > > ...
> > > > free(b);
> > > > if (a->nlmsg_seq == seq)
> > > > ...
> > > >
> > > > Fixes: 86bf43c7c2fd ("lib/libnetlink: update rtnl_talk to support 
> > > > malloc buff at run time")
> > > > Signed-off-by: Mahesh Bandewar 
> > >
> > > Yes, this is a real problem.
> > >
> > > Maybe a minimal patch like this would be enough:
> > actually that will leave the memory leak at the 'goto next' line (just
> > few lines below) since that jump will overwrite the buf.
>
> It looks like everytime in the while loop. a new buffer is allocated.
> So yes, it looks like old, my patch, and your patch would leak there
> was an error followed by other data in response.
> Though I doubt kernel would ever do that.
>
I started fixing the issue that I reported and then found-out the
memory leak and hence the first attempt of simple fix went into fixing
free-after-use as well as memory leak (in my patch). I'm not going to
claim that I know how and where this gets used, but my attempt was to
simply fix those two issues. I don't mind which fix you apply as long
as these issues get addressed.

> The only user of iov style messages to the kernel is in tc batching.
> My gut feeling is that if one message in batch has error, then
> the netlink code should return that error and stop processing more.


Re: [PATCH net-next v3 1/2] netlink: ipv4 igmp join notifications

2018-09-13 Thread Patrick Ruddy
On Thu, 2018-09-13 at 10:03 -0700, Roopa Prabhu wrote:
> On Thu, Sep 6, 2018 at 8:40 PM, Roopa Prabhu  
> wrote:
> > On Thu, Sep 6, 2018 at 2:10 AM, Patrick Ruddy
> >  wrote:
> > > Some userspace applications need to know about IGMP joins from the
> > > kernel for 2 reasons:
> > > 1. To allow the programming of multicast MAC filters in hardware
> > > 2. To form a multicast FORUS list for non link-local multicast
> > >groups to be sent to the kernel and from there to the interested
> > >party.
> > > (1) can be fulfilled but simply sending the hardware multicast MAC
> > > address to be programmed but (2) requires the L3 address to be sent
> > > since this cannot be constructed from the MAC address whereas the
> > > reverse translation is a standard library function.
> > > 
> > > This commit provides addition and deletion of multicast addresses
> > > using the RTM_NEWMDB and RTM_DELMDB messages with AF_INET. It also
> > > provides the RTM_GETMDB extension to allow multicast join state to
> > > be read from the kernel.
> > > 
> > > Signed-off-by: Patrick Ruddy 
> > > ---
> > > v3 rework to use RTM_***MDB messages as per review comments.
> > 
> > Patrick, this version seems to be using RTM_***MDB msgs with the
> > RTM_*ADDR format.
> > We cant do that...because existing RTM_MDB users will be confused.
> > 
> > My request was to evaluate RTM_***MDB msg format. see
> > nlmsg_populate_mdb_fill for details.
> > 
> > If you can wait a day or two I can share some experimental code that
> > moves high level RTM_*MDB msg handling into net/core/rtnetlink.c
> > similar to RTM_*FDB
> > 
> 
> I was trying to get a default per interface (non bridge) RTM_*MDB
> working, but realized that the dev->mc
> entries are already getting dumped as part of RTM_*FDB msgs instead of
> RTM_*MDB. (see net/core/rtnetlink.c:ndo_dflt_fdb_dump).
> This adds another wrench.
> 
> so, that puts us back to your use of RTM_NEWADDR.
> Instead of using IFA_ADDRESS, you could introduce a new one
> IFA_IGMP_MULTICAST  (since IFA_MULTICAST is already taken).
> 
> 
> To keep existing users of RTM_NEWADDR unaffected. I think you can use
> the IPMR family with RTM_NEWADDR.
> We can introduce new notification group. (We can choose to add a new
> family too, but that seems unnecessary)
> 
> since you only need dumps:
> rtnl_register(RTNL_FAMILY_IPMR, RTM_GETADDR, NULL, igmp_rtm_dumpaddrs, 0);
> 
> For notifications, since we already have many variants for routes, I
> don't see a problem adding similar addr variants
> RTNLGRP_IPV4_MCADDR
> 
> (Others on the list may have more feedback).
Thanks for looking at this Roopa - I'll rehash as suggested.

-pr


[bpf-next, v3 0/5] Introduce eBPF flow dissector

2018-09-13 Thread Petar Penkov
From: Petar Penkov 

This patch series hardens the RX stack by allowing flow dissection in BPF,
as previously discussed [1]. Because of the rigorous checks of the BPF
verifier, this provides significant security guarantees. In particular, the
BPF flow dissector cannot get inside of an infinite loop, as with
CVE-2013-4348, because BPF programs are guaranteed to terminate. It cannot
read outside of packet bounds, because all memory accesses are checked.
Also, with BPF the administrator can decide which protocols to support,
reducing potential attack surface. Rarely encountered protocols can be
excluded from dissection and the program can be updated without kernel
recompile or reboot if a bug is discovered.

Patch 1 adds infrastructure to execute a BPF program in __skb_flow_dissect.
This includes a new BPF program and attach type.

Patch 2 adds the new BPF flow dissector definitions to tools/uapi.

Patch 3 adds support for the new BPF program type to libbpf and bpftool.

Patch 4 adds a flow dissector program in BPF. This parses most protocols in
__skb_flow_dissect in BPF for a subset of flow keys (basic, control, ports,
and address types).

Patch 5 adds a selftest that attaches the BPF program to the flow dissector
and sends traffic with different levels of encapsulation.

Performance Evaluation:
The in-kernel implementation was compared against the demo program from
patch 4 using the test in patch 5 with IPv4/UDP traffic over 10 seconds.
$perf record -a -C 4 taskset -c 4 ./test_flow_dissector -i 4 -f 8 \
-t 10

In-kernel Dissector:
__skb_flow_dissect overhead: 2.12%
Total Packets: 3,272,597 (from output of ./test_flow_dissector)

BPF Dissector:
__skb_flow_dissect overhead: 1.63% 
Total Packets: 3,232,356 (from output of ./test_flow_dissector)

No-op BPF Dissector:
__skb_flow_dissect overhead: 1.52% 
Total Packets: 3,330,635 (from output of ./test_flow_dissector)

Changes since v2:
1/ Changes to tools/include/uapi pulled into a separate patch 2
2/ Changes to tools/lib and tools/bpftool pulled into a separate patch 3
3/ Changed flow_keys in __sk_buff from __u32 to struct bpf_flow_keys *
4/ Added nhoff field in struct bpf_flow_keys to pass initial offset
5/ Saving all of the modified control block, rather than just the qdisc
6/ Sample BPF program in patch 4 modified to use the changes above

Changes since v1:
1/ LD_ABS instructions now disallowed for the new BPF prog type 
2/ now checks if skb is NULL in __skb_flow_dissect()
3/ fixed incorrect accesses in flow_dissector_is_valid_access()
- writes to the flow_keys field now disallowed
- reads/writes to tc_classid and data_meta now disallowed 
4/ headers pulled with bpf_skb_load_data if direct access fails 

Changes since RFC:
1/ Flow dissector hook changed from global to per-netns
2/ Defined struct bpf_flow_keys to be used in BPF flow dissector
programs instead of exposing the internal flow keys layout. Added a
function to translate from bpf_flow_keys to the internal layout after BPF
dissection is complete. The pointer to this struct is stored in
qdisc_skb_cb rather than inside of the 20 byte control block which
simplifies verification and allows access to all 20 bytes of the cb.
3/ Removed GUE parsing as it relied on a hardcoded port
4/ MPLS parsing now stops at the first label which is consistent
with the in-kernel flow dissector
5/ Refactored to use direct packet access and to write out to
struct bpf_flow_keys

[1] http://vger.kernel.org/netconf2017_files/rx_hardening_and_udp_gso.pdf

Petar Penkov (5):
  flow_dissector: implements flow dissector BPF hook
  bpf: sync bpf.h uapi with tools/
  bpf: support flow dissector in libbpf and bpftool
  flow_dissector: implements eBPF parser
  selftests/bpf: test bpf flow dissection

 include/linux/bpf.h   |   1 +
 include/linux/bpf_types.h |   1 +
 include/linux/skbuff.h|   7 +
 include/net/net_namespace.h   |   3 +
 include/net/sch_generic.h |  12 +-
 include/uapi/linux/bpf.h  |  26 +
 kernel/bpf/syscall.c  |   8 +
 kernel/bpf/verifier.c |  32 +
 net/core/filter.c |  70 ++
 net/core/flow_dissector.c | 134 +++
 tools/bpf/bpftool/prog.c  |   1 +
 tools/include/uapi/linux/bpf.h|  26 +
 tools/lib/bpf/libbpf.c|   2 +
 tools/testing/selftests/bpf/.gitignore|   2 +
 tools/testing/selftests/bpf/Makefile  |   8 +-
 tools/testing/selftests/bpf/bpf_flow.c| 373 +
 tools/testing/selftests/bpf/config|   1 +
 .../selftests/bpf/flow_dissector_load.c   | 140 
 .../selftests/bpf/test_flow_dissector.c   | 782 ++
 .../selftests/bpf/test_flow_dissector.sh  | 115 +++
 

[bpf-next, v3 2/5] bpf: sync bpf.h uapi with tools/

2018-09-13 Thread Petar Penkov
From: Petar Penkov 

This patch syncs tools/include/uapi/linux/bpf.h with the flow dissector
definitions from include/uapi/linux/bpf.h

Signed-off-by: Petar Penkov 
Signed-off-by: Willem de Bruijn 
---
 tools/include/uapi/linux/bpf.h | 26 ++
 1 file changed, 26 insertions(+)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 66917a4eba27..d1baf20cd329 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -152,6 +152,7 @@ enum bpf_prog_type {
BPF_PROG_TYPE_LWT_SEG6LOCAL,
BPF_PROG_TYPE_LIRC_MODE2,
BPF_PROG_TYPE_SK_REUSEPORT,
+   BPF_PROG_TYPE_FLOW_DISSECTOR,
 };
 
 enum bpf_attach_type {
@@ -172,6 +173,7 @@ enum bpf_attach_type {
BPF_CGROUP_UDP4_SENDMSG,
BPF_CGROUP_UDP6_SENDMSG,
BPF_LIRC_MODE2,
+   BPF_FLOW_DISSECTOR,
__MAX_BPF_ATTACH_TYPE
 };
 
@@ -2333,6 +2335,7 @@ struct __sk_buff {
/* ... here. */
 
__u32 data_meta;
+   struct bpf_flow_keys *flow_keys;
 };
 
 struct bpf_tunnel_key {
@@ -2778,4 +2781,27 @@ enum bpf_task_fd_type {
BPF_FD_TYPE_URETPROBE,  /* filename + offset */
 };
 
+struct bpf_flow_keys {
+   __u16   nhoff;
+   __u16   thoff;
+   __u16   addr_proto; /* ETH_P_* of valid addrs */
+   __u8is_frag;
+   __u8is_first_frag;
+   __u8is_encap;
+   __be16  n_proto;
+   __u8ip_proto;
+   union {
+   struct {
+   __be32  ipv4_src;
+   __be32  ipv4_dst;
+   };
+   struct {
+   __u32   ipv6_src[4];/* in6_addr; network order */
+   __u32   ipv6_dst[4];/* in6_addr; network order */
+   };
+   };
+   __be16  sport;
+   __be16  dport;
+};
+
 #endif /* _UAPI__LINUX_BPF_H__ */
-- 
2.19.0.397.gdd90340f6a-goog



[bpf-next, v3 3/5] bpf: support flow dissector in libbpf and bpftool

2018-09-13 Thread Petar Penkov
From: Petar Penkov 

This patch extends libbpf and bpftool to work with programs of type
BPF_PROG_TYPE_FLOW_DISSECTOR.

Signed-off-by: Petar Penkov 
Signed-off-by: Willem de Bruijn 
---
 tools/bpf/bpftool/prog.c | 1 +
 tools/lib/bpf/libbpf.c   | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c
index dce960d22106..b1cd3bc8db70 100644
--- a/tools/bpf/bpftool/prog.c
+++ b/tools/bpf/bpftool/prog.c
@@ -74,6 +74,7 @@ static const char * const prog_type_name[] = {
[BPF_PROG_TYPE_RAW_TRACEPOINT]  = "raw_tracepoint",
[BPF_PROG_TYPE_CGROUP_SOCK_ADDR] = "cgroup_sock_addr",
[BPF_PROG_TYPE_LIRC_MODE2]  = "lirc_mode2",
+   [BPF_PROG_TYPE_FLOW_DISSECTOR]  = "flow_dissector",
 };
 
 static void print_boot_time(__u64 nsecs, char *buf, unsigned int size)
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 8476da7f2720..9ca8e0e624d8 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -1502,6 +1502,7 @@ static bool bpf_prog_type__needs_kver(enum bpf_prog_type 
type)
case BPF_PROG_TYPE_CGROUP_SOCK_ADDR:
case BPF_PROG_TYPE_LIRC_MODE2:
case BPF_PROG_TYPE_SK_REUSEPORT:
+   case BPF_PROG_TYPE_FLOW_DISSECTOR:
return false;
case BPF_PROG_TYPE_UNSPEC:
case BPF_PROG_TYPE_KPROBE:
@@ -2121,6 +2122,7 @@ static const struct {
BPF_PROG_SEC("sk_skb",  BPF_PROG_TYPE_SK_SKB),
BPF_PROG_SEC("sk_msg",  BPF_PROG_TYPE_SK_MSG),
BPF_PROG_SEC("lirc_mode2",  BPF_PROG_TYPE_LIRC_MODE2),
+   BPF_PROG_SEC("flow_dissector",  BPF_PROG_TYPE_FLOW_DISSECTOR),
BPF_SA_PROG_SEC("cgroup/bind4", BPF_CGROUP_INET4_BIND),
BPF_SA_PROG_SEC("cgroup/bind6", BPF_CGROUP_INET6_BIND),
BPF_SA_PROG_SEC("cgroup/connect4", BPF_CGROUP_INET4_CONNECT),
-- 
2.19.0.397.gdd90340f6a-goog



[bpf-next, v3 5/5] selftests/bpf: test bpf flow dissection

2018-09-13 Thread Petar Penkov
From: Petar Penkov 

Adds a test that sends different types of packets over multiple
tunnels and verifies that valid packets are dissected correctly.  To do
so, a tc-flower rule is added to drop packets on UDP src port 9, and
packets are sent from ports 8, 9, and 10. Only the packets on port 9
should be dropped. Because tc-flower relies on the flow dissector to
match flows, correct classification demonstrates correct dissection.

Also add support logic to load the BPF program and to inject the test
packets.

Signed-off-by: Petar Penkov 
Signed-off-by: Willem de Bruijn 
---
 tools/testing/selftests/bpf/.gitignore|   2 +
 tools/testing/selftests/bpf/Makefile  |   6 +-
 tools/testing/selftests/bpf/config|   1 +
 .../selftests/bpf/flow_dissector_load.c   | 140 
 .../selftests/bpf/test_flow_dissector.c   | 782 ++
 .../selftests/bpf/test_flow_dissector.sh  | 115 +++
 tools/testing/selftests/bpf/with_addr.sh  |  54 ++
 tools/testing/selftests/bpf/with_tunnels.sh   |  36 +
 8 files changed, 1134 insertions(+), 2 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/flow_dissector_load.c
 create mode 100644 tools/testing/selftests/bpf/test_flow_dissector.c
 create mode 100755 tools/testing/selftests/bpf/test_flow_dissector.sh
 create mode 100755 tools/testing/selftests/bpf/with_addr.sh
 create mode 100755 tools/testing/selftests/bpf/with_tunnels.sh

diff --git a/tools/testing/selftests/bpf/.gitignore 
b/tools/testing/selftests/bpf/.gitignore
index 4d789c1e5167..8a60c9b9892d 100644
--- a/tools/testing/selftests/bpf/.gitignore
+++ b/tools/testing/selftests/bpf/.gitignore
@@ -23,3 +23,5 @@ test_skb_cgroup_id_user
 test_socket_cookie
 test_cgroup_storage
 test_select_reuseport
+test_flow_dissector
+flow_dissector_load
diff --git a/tools/testing/selftests/bpf/Makefile 
b/tools/testing/selftests/bpf/Makefile
index e65f50f9185e..fd3851d5c079 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -47,10 +47,12 @@ TEST_PROGS := test_kmod.sh \
test_tunnel.sh \
test_lwt_seg6local.sh \
test_lirc_mode2.sh \
-   test_skb_cgroup_id.sh
+   test_skb_cgroup_id.sh \
+   test_flow_dissector.sh
 
 # Compile but not part of 'make run_tests'
-TEST_GEN_PROGS_EXTENDED = test_libbpf_open test_sock_addr 
test_skb_cgroup_id_user
+TEST_GEN_PROGS_EXTENDED = test_libbpf_open test_sock_addr 
test_skb_cgroup_id_user \
+   flow_dissector_load test_flow_dissector
 
 include ../lib.mk
 
diff --git a/tools/testing/selftests/bpf/config 
b/tools/testing/selftests/bpf/config
index b4994a94968b..3655508f95fd 100644
--- a/tools/testing/selftests/bpf/config
+++ b/tools/testing/selftests/bpf/config
@@ -18,3 +18,4 @@ CONFIG_CRYPTO_HMAC=m
 CONFIG_CRYPTO_SHA256=m
 CONFIG_VXLAN=y
 CONFIG_GENEVE=y
+CONFIG_NET_CLS_FLOWER=m
diff --git a/tools/testing/selftests/bpf/flow_dissector_load.c 
b/tools/testing/selftests/bpf/flow_dissector_load.c
new file mode 100644
index ..d3273b5b3173
--- /dev/null
+++ b/tools/testing/selftests/bpf/flow_dissector_load.c
@@ -0,0 +1,140 @@
+// SPDX-License-Identifier: GPL-2.0
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+const char *cfg_pin_path = "/sys/fs/bpf/flow_dissector";
+const char *cfg_map_name = "jmp_table";
+bool cfg_attach = true;
+char *cfg_section_name;
+char *cfg_path_name;
+
+static void load_and_attach_program(void)
+{
+   struct bpf_program *prog, *main_prog;
+   struct bpf_map *prog_array;
+   int i, fd, prog_fd, ret;
+   struct bpf_object *obj;
+   int prog_array_fd;
+
+   ret = bpf_prog_load(cfg_path_name, BPF_PROG_TYPE_FLOW_DISSECTOR, ,
+   _fd);
+   if (ret)
+   error(1, 0, "bpf_prog_load %s", cfg_path_name);
+
+   main_prog = bpf_object__find_program_by_title(obj, cfg_section_name);
+   if (!main_prog)
+   error(1, 0, "bpf_object__find_program_by_title %s",
+ cfg_section_name);
+
+   prog_fd = bpf_program__fd(main_prog);
+   if (prog_fd < 0)
+   error(1, 0, "bpf_program__fd");
+
+   prog_array = bpf_object__find_map_by_name(obj, cfg_map_name);
+   if (!prog_array)
+   error(1, 0, "bpf_object__find_map_by_name %s", cfg_map_name);
+
+   prog_array_fd = bpf_map__fd(prog_array);
+   if (prog_array_fd < 0)
+   error(1, 0, "bpf_map__fd %s", cfg_map_name);
+
+   i = 0;
+   bpf_object__for_each_program(prog, obj) {
+   fd = bpf_program__fd(prog);
+   if (fd < 0)
+   error(1, 0, "bpf_program__fd");
+
+   if (fd != prog_fd) {
+   printf("%d: %s\n", i, bpf_program__title(prog, false));
+   bpf_map_update_elem(prog_array_fd, , , BPF_ANY);
+   ++i;
+   }
+   }
+
+   ret = 

[bpf-next, v3 4/5] flow_dissector: implements eBPF parser

2018-09-13 Thread Petar Penkov
From: Petar Penkov 

This eBPF program extracts basic/control/ip address/ports keys from
incoming packets. It supports recursive parsing for IP encapsulation,
and VLAN, along with IPv4/IPv6 and extension headers.  This program is
meant to show how flow dissection and key extraction can be done in
eBPF.

Link: http://vger.kernel.org/netconf2017_files/rx_hardening_and_udp_gso.pdf
Signed-off-by: Petar Penkov 
Signed-off-by: Willem de Bruijn 
---
 tools/testing/selftests/bpf/Makefile   |   2 +-
 tools/testing/selftests/bpf/bpf_flow.c | 373 +
 2 files changed, 374 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/bpf_flow.c

diff --git a/tools/testing/selftests/bpf/Makefile 
b/tools/testing/selftests/bpf/Makefile
index fff7fb1285fc..e65f50f9185e 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -35,7 +35,7 @@ TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o 
test_tcp_estats.o test
test_get_stack_rawtp.o test_sockmap_kern.o test_sockhash_kern.o \
test_lwt_seg6local.o sendmsg4_prog.o sendmsg6_prog.o 
test_lirc_mode2_kern.o \
get_cgroup_id_kern.o socket_cookie_prog.o test_select_reuseport_kern.o \
-   test_skb_cgroup_id_kern.o
+   test_skb_cgroup_id_kern.o bpf_flow.o
 
 # Order correspond to 'make run_tests' order
 TEST_PROGS := test_kmod.sh \
diff --git a/tools/testing/selftests/bpf/bpf_flow.c 
b/tools/testing/selftests/bpf/bpf_flow.c
new file mode 100644
index ..5fb809d95867
--- /dev/null
+++ b/tools/testing/selftests/bpf/bpf_flow.c
@@ -0,0 +1,373 @@
+// SPDX-License-Identifier: GPL-2.0
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "bpf_helpers.h"
+#include "bpf_endian.h"
+
+int _version SEC("version") = 1;
+#define PROG(F) SEC(#F) int bpf_func_##F
+
+/* These are the identifiers of the BPF programs that will be used in tail
+ * calls. Name is limited to 16 characters, with the terminating character and
+ * bpf_func_ above, we have only 6 to work with, anything after will be 
cropped.
+ */
+enum {
+   IP,
+   IPV6,
+   IPV6OP, /* Destination/Hop-by-Hop Options IPv6 Extension header */
+   IPV6FR, /* Fragmentation IPv6 Extension Header */
+   MPLS,
+   VLAN,
+};
+
+#define IP_MF  0x2000
+#define IP_OFFSET  0x1FFF
+#define IP6_MF 0x0001
+#define IP6_OFFSET 0xFFF8
+
+struct vlan_hdr {
+   __be16 h_vlan_TCI;
+   __be16 h_vlan_encapsulated_proto;
+};
+
+struct gre_hdr {
+   __be16 flags;
+   __be16 proto;
+};
+
+struct frag_hdr {
+   __u8 nexthdr;
+   __u8 reserved;
+   __be16 frag_off;
+   __be32 identification;
+};
+
+struct bpf_map_def SEC("maps") jmp_table = {
+   .type = BPF_MAP_TYPE_PROG_ARRAY,
+   .key_size = sizeof(__u32),
+   .value_size = sizeof(__u32),
+   .max_entries = 8
+};
+
+static __always_inline void *bpf_flow_dissect_get_header(struct __sk_buff *skb,
+__u16 hdr_size,
+void *buffer)
+{
+   void *data_end = (void *)(long)skb->data_end;
+   void *data = (void *)(long)skb->data;
+   __u16 nhoff = skb->flow_keys->nhoff;
+   __u8 *hdr;
+
+   /* Verifies this variable offset does not overflow */
+   if (nhoff > (USHRT_MAX - hdr_size))
+   return NULL;
+
+   hdr = data + nhoff;
+   if (hdr + hdr_size <= data_end)
+   return hdr;
+
+   if (bpf_skb_load_bytes(skb, nhoff, buffer, hdr_size))
+   return NULL;
+
+   return buffer;
+}
+
+/* Dispatches on ETHERTYPE */
+static __always_inline int parse_eth_proto(struct __sk_buff *skb, __be16 proto)
+{
+   struct bpf_flow_keys *keys = skb->flow_keys;
+
+   keys->n_proto = proto;
+   switch (proto) {
+   case bpf_htons(ETH_P_IP):
+   bpf_tail_call(skb, _table, IP);
+   break;
+   case bpf_htons(ETH_P_IPV6):
+   bpf_tail_call(skb, _table, IPV6);
+   break;
+   case bpf_htons(ETH_P_MPLS_MC):
+   case bpf_htons(ETH_P_MPLS_UC):
+   bpf_tail_call(skb, _table, MPLS);
+   break;
+   case bpf_htons(ETH_P_8021Q):
+   case bpf_htons(ETH_P_8021AD):
+   bpf_tail_call(skb, _table, VLAN);
+   break;
+   default:
+   /* Protocol not supported */
+   return BPF_DROP;
+   }
+
+   return BPF_DROP;
+}
+
+SEC("dissect")
+int dissect(struct __sk_buff *skb)
+{
+   if (!skb->vlan_present)
+   return parse_eth_proto(skb, skb->protocol);
+   else
+   return parse_eth_proto(skb, skb->vlan_proto);
+}
+
+/* Parses on IPPROTO_* */
+static __always_inline int parse_ip_proto(struct __sk_buff *skb, __u8 proto)
+{
+   struct 

[bpf-next, v3 1/5] flow_dissector: implements flow dissector BPF hook

2018-09-13 Thread Petar Penkov
From: Petar Penkov 

Adds a hook for programs of type BPF_PROG_TYPE_FLOW_DISSECTOR and
attach type BPF_FLOW_DISSECTOR that is executed in the flow dissector
path. The BPF program is per-network namespace.

Signed-off-by: Petar Penkov 
Signed-off-by: Willem de Bruijn 
---
 include/linux/bpf.h |   1 +
 include/linux/bpf_types.h   |   1 +
 include/linux/skbuff.h  |   7 ++
 include/net/net_namespace.h |   3 +
 include/net/sch_generic.h   |  12 +++-
 include/uapi/linux/bpf.h|  26 +++
 kernel/bpf/syscall.c|   8 +++
 kernel/bpf/verifier.c   |  32 +
 net/core/filter.c   |  70 +++
 net/core/flow_dissector.c   | 134 
 10 files changed, 291 insertions(+), 3 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 523481a3471b..988a00797bcd 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -212,6 +212,7 @@ enum bpf_reg_type {
PTR_TO_PACKET_META,  /* skb->data - meta_len */
PTR_TO_PACKET,   /* reg points to skb->data */
PTR_TO_PACKET_END,   /* skb->data + headlen */
+   PTR_TO_FLOW_KEYS,/* reg points to bpf_flow_keys */
 };
 
 /* The information passed from prog-specific *_is_valid_access
diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index cd26c090e7c0..22083712dd18 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -32,6 +32,7 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_LIRC_MODE2, lirc_mode2)
 #ifdef CONFIG_INET
 BPF_PROG_TYPE(BPF_PROG_TYPE_SK_REUSEPORT, sk_reuseport)
 #endif
+BPF_PROG_TYPE(BPF_PROG_TYPE_FLOW_DISSECTOR, flow_dissector)
 
 BPF_MAP_TYPE(BPF_MAP_TYPE_ARRAY, array_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_ARRAY, percpu_array_map_ops)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 17a13e4785fc..ce0e863f02a2 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -243,6 +243,8 @@ struct scatterlist;
 struct pipe_inode_info;
 struct iov_iter;
 struct napi_struct;
+struct bpf_prog;
+union bpf_attr;
 
 #if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE)
 struct nf_conntrack {
@@ -1192,6 +1194,11 @@ void skb_flow_dissector_init(struct flow_dissector 
*flow_dissector,
 const struct flow_dissector_key *key,
 unsigned int key_count);
 
+int skb_flow_dissector_bpf_prog_attach(const union bpf_attr *attr,
+  struct bpf_prog *prog);
+
+int skb_flow_dissector_bpf_prog_detach(const union bpf_attr *attr);
+
 bool __skb_flow_dissect(const struct sk_buff *skb,
struct flow_dissector *flow_dissector,
void *target_container,
diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index 9b5fdc50519a..99d4148e0f90 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -43,6 +43,7 @@ struct ctl_table_header;
 struct net_generic;
 struct uevent_sock;
 struct netns_ipvs;
+struct bpf_prog;
 
 
 #define NETDEV_HASHBITS8
@@ -145,6 +146,8 @@ struct net {
 #endif
struct net_generic __rcu*gen;
 
+   struct bpf_prog __rcu   *flow_dissector_prog;
+
/* Note : following structs are cache line aligned */
 #ifdef CONFIG_XFRM
struct netns_xfrm   xfrm;
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index a6d00093f35e..1b81ba85fd2d 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -19,6 +19,7 @@ struct Qdisc_ops;
 struct qdisc_walker;
 struct tcf_walker;
 struct module;
+struct bpf_flow_keys;
 
 typedef int tc_setup_cb_t(enum tc_setup_type type,
  void *type_data, void *cb_priv);
@@ -307,9 +308,14 @@ struct tcf_proto {
 };
 
 struct qdisc_skb_cb {
-   unsigned intpkt_len;
-   u16 slave_dev_queue_mapping;
-   u16 tc_classid;
+   union {
+   struct {
+   unsigned intpkt_len;
+   u16 slave_dev_queue_mapping;
+   u16 tc_classid;
+   };
+   struct bpf_flow_keys *flow_keys;
+   };
 #define QDISC_CB_PRIV_LEN 20
unsigned char   data[QDISC_CB_PRIV_LEN];
 };
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 66917a4eba27..d1baf20cd329 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -152,6 +152,7 @@ enum bpf_prog_type {
BPF_PROG_TYPE_LWT_SEG6LOCAL,
BPF_PROG_TYPE_LIRC_MODE2,
BPF_PROG_TYPE_SK_REUSEPORT,
+   BPF_PROG_TYPE_FLOW_DISSECTOR,
 };
 
 enum bpf_attach_type {
@@ -172,6 +173,7 @@ enum bpf_attach_type {
BPF_CGROUP_UDP4_SENDMSG,
BPF_CGROUP_UDP6_SENDMSG,
BPF_LIRC_MODE2,
+   BPF_FLOW_DISSECTOR,
__MAX_BPF_ATTACH_TYPE
 };
 
@@ -2333,6 +2335,7 @@ struct __sk_buff {

Re: [RFC PATCH iproute2-next] System specification health API

2018-09-13 Thread Jakub Kicinski
On Thu, 13 Sep 2018 11:18:15 +0300, Eran Ben Elisha wrote:
> The health spec is targeted for Real Time Alerting, in order to know when
> something bad had happened to a PCI device

By spec you mean some standards body spec you implement or this
proposal is a spec?

> - Provide alert debug information
> - Self healing
> - If problem needs vendor support, provide a way to gather all needed 
> debugging
>   information.
> 
> The health contains sensors which sense for malfunction. Once sensor 
> triggered,
> actions such as logs and correction can be taken.
> Sensors are sensing the health state and can trigger correction action.
> 
> The sensors are divided into the following groups
> - Hardware sensor - a sensor which is triggered by the device due to
>   malfunction.
> - Software sensor - a sensor which is triggered by the software due to
>   malfunction.
> Both group of sensors can be triggered due to error event or due to a 
> periodic check.
> 
> Actions are the way to handle sensor events. Action can be in one of the
> following groups:
> - Dump -  SW trace, SW dump, HW trace, HW dump
> - Reset - Surgical correction (e.g. modify Q, flush Q, reset of device, etc)
> Actions can be performed by SW or HW.
> 
> User is allowed to enable or disable sensors and sensor2action mapping.
> 
> This RFC man page patch describes the suggested API of devlink-health in order
> to control sensors and actions.

I like the idea of configuring response to events like this, although
I'm not sure the name sensor is appropriate here - perhaps exception or
error would be better?  Are there going to be values reported?

I'm not so sure about HW sensors in relation to existing HWMON
infrastructure...  I assume you're targeting things like say some HW
engine/block reporting it encountered an error?  Sounds good, too.

Are the actions all envisioned to be performed by the driver?
Firmware?  Hardware?  I guess that distinction can be added later.
For FW/HW actions we would go back to the problem of persistence of 
the setting since it was only implemented for params :S

Is the dump option going to tie back into region snapshots?


Re: [PATCH] hv_netvsc: fix schedule in RCU context

2018-09-13 Thread David Miller
From: Stephen Hemminger 
Date: Thu, 13 Sep 2018 08:03:43 -0700

> When netvsc device is removed it can call reschedule in RCU context.
> This happens because canceling the subchannel setup work could (in theory)
> cause a reschedule when manipulating the timer.
> 
> To reproduce, run with lockdep enabled kernel and unbind
> a network device from hv_netvsc (via sysfs).
 ...
> Resolve this by getting RTNL earlier. This is safe because the subchannel
> work queue does trylock on RTNL and will detect the race.
> 
> Fixes: 7b2ee50c0cd5 ("hv_netvsc: common detach logic")
> Signed-off-by: Stephen Hemminger 

Applied and queued up for -stable.


[PATCH net-next 1/2] net/sched: act_police: use per-cpu counters

2018-09-13 Thread Davide Caratti
use per-CPU counters, instead of sharing a single set of stats with all
cores. This removes the need of using spinlock when statistics are read
or updated.

Signed-off-by: Davide Caratti 
---
 net/sched/act_police.c | 46 --
 1 file changed, 22 insertions(+), 24 deletions(-)

diff --git a/net/sched/act_police.c b/net/sched/act_police.c
index 393c7a670300..965a48d3ec35 100644
--- a/net/sched/act_police.c
+++ b/net/sched/act_police.c
@@ -110,7 +110,7 @@ static int tcf_police_init(struct net *net, struct nlattr 
*nla,
 
if (!exists) {
ret = tcf_idr_create(tn, parm->index, NULL, a,
-_police_ops, bind, false);
+_police_ops, bind, true);
if (ret) {
tcf_idr_cleanup(tn, parm->index);
return ret;
@@ -137,7 +137,8 @@ static int tcf_police_init(struct net *net, struct nlattr 
*nla,
}
 
if (est) {
-   err = gen_replace_estimator(>tcf_bstats, NULL,
+   err = gen_replace_estimator(>tcf_bstats,
+   police->common.cpu_bstats,
>tcf_rate_est,
>tcf_lock,
NULL, est);
@@ -207,32 +208,27 @@ static int tcf_police_act(struct sk_buff *skb, const 
struct tc_action *a,
  struct tcf_result *res)
 {
struct tcf_police *police = to_police(a);
-   s64 now;
-   s64 toks;
-   s64 ptoks = 0;
+   s64 now, toks, ptoks = 0;
+   int ret;
 
-   spin_lock(>tcf_lock);
-
-   bstats_update(>tcf_bstats, skb);
tcf_lastuse_update(>tcf_tm);
+   bstats_cpu_update(this_cpu_ptr(police->common.cpu_bstats), skb);
 
+   spin_lock(>tcf_lock);
if (police->tcfp_ewma_rate) {
struct gnet_stats_rate_est64 sample;
 
if (!gen_estimator_read(>tcf_rate_est, ) ||
sample.bps >= police->tcfp_ewma_rate) {
-   police->tcf_qstats.overlimits++;
-   if (police->tcf_action == TC_ACT_SHOT)
-   police->tcf_qstats.drops++;
-   spin_unlock(>tcf_lock);
-   return police->tcf_action;
+   ret = police->tcf_action;
+   goto inc_overlimits;
}
}
 
if (qdisc_pkt_len(skb) <= police->tcfp_mtu) {
if (!police->rate_present) {
-   spin_unlock(>tcf_lock);
-   return police->tcfp_result;
+   ret = police->tcfp_result;
+   goto unlock;
}
 
now = ktime_get_ns();
@@ -253,18 +249,20 @@ static int tcf_police_act(struct sk_buff *skb, const 
struct tc_action *a,
police->tcfp_t_c = now;
police->tcfp_toks = toks;
police->tcfp_ptoks = ptoks;
-   if (police->tcfp_result == TC_ACT_SHOT)
-   police->tcf_qstats.drops++;
-   spin_unlock(>tcf_lock);
-   return police->tcfp_result;
+   ret = police->tcfp_result;
+   goto inc_drops;
}
}
-
-   police->tcf_qstats.overlimits++;
-   if (police->tcf_action == TC_ACT_SHOT)
-   police->tcf_qstats.drops++;
+   ret = police->tcf_action;
+
+inc_overlimits:
+   qstats_overlimit_inc(this_cpu_ptr(police->common.cpu_qstats));
+inc_drops:
+   if (ret == TC_ACT_SHOT)
+   qstats_drop_inc(this_cpu_ptr(police->common.cpu_qstats));
+unlock:
spin_unlock(>tcf_lock);
-   return police->tcf_action;
+   return ret;
 }
 
 static int tcf_police_dump(struct sk_buff *skb, struct tc_action *a,
-- 
2.17.1



[PATCH net-next 2/2] net/sched: act_police: don't use spinlock in the data path

2018-09-13 Thread Davide Caratti
use RCU instead of spinlocks, to protect concurrent read/write on
act_police configuration. This reduces the effects of contention in the
data path, in case multiple readers are present.

Signed-off-by: Davide Caratti 
---
 net/sched/act_police.c | 156 -
 1 file changed, 92 insertions(+), 64 deletions(-)

diff --git a/net/sched/act_police.c b/net/sched/act_police.c
index 965a48d3ec35..92649d2667ed 100644
--- a/net/sched/act_police.c
+++ b/net/sched/act_police.c
@@ -22,8 +22,7 @@
 #include 
 #include 
 
-struct tcf_police {
-   struct tc_actioncommon;
+struct tcf_police_params {
int tcfp_result;
u32 tcfp_ewma_rate;
s64 tcfp_burst;
@@ -36,6 +35,12 @@ struct tcf_police {
boolrate_present;
struct psched_ratecfg   peak;
boolpeak_present;
+   struct rcu_head rcu;
+};
+
+struct tcf_police {
+   struct tc_actioncommon;
+   struct tcf_police_params __rcu *params;
 };
 
 #define to_police(pc) ((struct tcf_police *)pc)
@@ -84,6 +89,7 @@ static int tcf_police_init(struct net *net, struct nlattr 
*nla,
struct tcf_police *police;
struct qdisc_rate_table *R_tab = NULL, *P_tab = NULL;
struct tc_action_net *tn = net_generic(net, police_net_id);
+   struct tcf_police_params *new;
bool exists = false;
int size;
 
@@ -151,50 +157,60 @@ static int tcf_police_init(struct net *net, struct nlattr 
*nla,
goto failure;
}
 
-   spin_lock_bh(>tcf_lock);
+   new = kzalloc(sizeof(*new), GFP_KERNEL);
+   if (unlikely(!new)) {
+   err = -ENOMEM;
+   goto failure;
+   }
+
/* No failure allowed after this point */
-   police->tcfp_mtu = parm->mtu;
-   if (police->tcfp_mtu == 0) {
-   police->tcfp_mtu = ~0;
+   new->tcfp_mtu = parm->mtu;
+   if (!new->tcfp_mtu) {
+   new->tcfp_mtu = ~0;
if (R_tab)
-   police->tcfp_mtu = 255 << R_tab->rate.cell_log;
+   new->tcfp_mtu = 255 << R_tab->rate.cell_log;
}
if (R_tab) {
-   police->rate_present = true;
-   psched_ratecfg_precompute(>rate, _tab->rate, 0);
+   new->rate_present = true;
+   psched_ratecfg_precompute(>rate, _tab->rate, 0);
qdisc_put_rtab(R_tab);
} else {
-   police->rate_present = false;
+   new->rate_present = false;
}
if (P_tab) {
-   police->peak_present = true;
-   psched_ratecfg_precompute(>peak, _tab->rate, 0);
+   new->peak_present = true;
+   psched_ratecfg_precompute(>peak, _tab->rate, 0);
qdisc_put_rtab(P_tab);
} else {
-   police->peak_present = false;
+   new->peak_present = false;
}
 
if (tb[TCA_POLICE_RESULT])
-   police->tcfp_result = nla_get_u32(tb[TCA_POLICE_RESULT]);
-   police->tcfp_burst = PSCHED_TICKS2NS(parm->burst);
-   police->tcfp_toks = police->tcfp_burst;
-   if (police->peak_present) {
-   police->tcfp_mtu_ptoks = (s64) psched_l2t_ns(>peak,
-police->tcfp_mtu);
-   police->tcfp_ptoks = police->tcfp_mtu_ptoks;
+   new->tcfp_result = nla_get_u32(tb[TCA_POLICE_RESULT]);
+   new->tcfp_burst = PSCHED_TICKS2NS(parm->burst);
+   new->tcfp_toks = new->tcfp_burst;
+   if (new->peak_present) {
+   new->tcfp_mtu_ptoks = (s64)psched_l2t_ns(>peak,
+new->tcfp_mtu);
+   new->tcfp_ptoks = new->tcfp_mtu_ptoks;
}
-   police->tcf_action = parm->action;
 
if (tb[TCA_POLICE_AVRATE])
-   police->tcfp_ewma_rate = nla_get_u32(tb[TCA_POLICE_AVRATE]);
+   new->tcfp_ewma_rate = nla_get_u32(tb[TCA_POLICE_AVRATE]);
 
+   spin_lock_bh(>tcf_lock);
+   new->tcfp_t_c = ktime_get_ns();
+   police->tcf_action = parm->action;
+   rcu_swap_protected(police->params,
+  new,
+  lockdep_is_held(>tcf_lock));
spin_unlock_bh(>tcf_lock);
-   if (ret != ACT_P_CREATED)
-   return ret;
 
-   police->tcfp_t_c = ktime_get_ns();
-   tcf_idr_insert(tn, *a);
+   if (new)
+   kfree_rcu(new, rcu);
 
+   if (ret == ACT_P_CREATED)
+   tcf_idr_insert(tn, *a);
return ret;
 
 failure:
@@ -208,68 +224,77 @@ static int tcf_police_act(struct sk_buff *skb, const 
struct tc_action *a,
  struct tcf_result *res)
 {
struct tcf_police *police = to_police(a);
+   struct tcf_police_params *p;
s64 now, toks, ptoks = 0;
int ret;
 
 

[PATCH net-next 0/2] net/sched: act_police: lockless data path

2018-09-13 Thread Davide Caratti
the data path of 'police' action can be faster if we avoid using spinlocks:
 - patch 1 converts act_police to use per-cpu counters
 - patch 2 lets act_police use RCU to access its configuration data.

test procedure (using pktgen from https://github.com/netoptimizer):
 # ip link add name eth1 type dummy
 # ip link set dev eth1 up
 # tc qdisc add dev eth1 clsact
 # tc filter add dev eth1 egress matchall action police \
 > rate 2gbit burst 100k conform-exceed pass/pass index 100
 # for c in 1 2 4; do
 > ./pktgen_bench_xmit_mode_queue_xmit.sh -v -s 64 -t $c -n 500 -i eth1
 > done

test results (avg. pps/thread):

  $c | before patch |  after patch | improvement
 +--+--+-
   1 |  3518448 |  3591240 |  irrelevant
   2 |  3070065 |  3383393 | 10%
   4 |  1540969 |  3238385 |110%


Davide Caratti (2):
  net/sched: act_police: use per-cpu counters
  net/sched: act_police: don't use spinlock in the data path

 net/sched/act_police.c | 186 +++--
 1 file changed, 106 insertions(+), 80 deletions(-)

-- 
2.17.1



Re: [PATCH net-next 08/13] net: sched: rename tcf_block_get{_ext}() and tcf_block_put{_ext}()

2018-09-13 Thread Cong Wang
On Wed, Sep 12, 2018 at 1:24 AM Vlad Buslov  wrote:
>
>
> On Fri 07 Sep 2018 at 20:09, Cong Wang  wrote:
> > On Thu, Sep 6, 2018 at 12:59 AM Vlad Buslov  wrote:
> >>
> >> Functions tcf_block_get{_ext}() and tcf_block_put{_ext}() actually
> >> attach/detach block to specific Qdisc besides just taking/putting
> >> reference. Rename them according to their purpose.
> >
> > Where exactly does it attach to?
> >
> > Each qdisc provides a pointer to a pointer of a block, like
> > >block. It is where the result is saved to. It takes a parameter
> > of Qdisc* merely for read-only purpose.
>
> tcf_block_attach_ext() passes qdisc parameter to tcf_block_owner_add()
> which saves qdisc to new tcf_block_owner_item and adds the item to
> block's owner list. I proposed several naming options for these
> functions to Jiri on internal review and he suggested "attach" as better
> option.

But that is merely item->q = q, this is why I said it is read-only,
hard to claim this is attaching.


>
> >
> > So, renaming it to *attach() is even confusing, at least not
> > any better. Please find other names or leave them as they are.
>
> What would you recommend?

I don't know, perhaps "acquire"?

Or, leaving tcf_block_get() as it is but rename your refcnt
increment function to be something like tcf_block_refcnt_get()?


Re: [PATCH net-next v2] net: sched: change tcf_del_walker() to take idrinfo->lock

2018-09-13 Thread Cong Wang
On Wed, Sep 12, 2018 at 1:51 AM Vlad Buslov  wrote:
>
>
> On Fri 07 Sep 2018 at 19:12, Cong Wang  wrote:
> > On Fri, Sep 7, 2018 at 6:52 AM Vlad Buslov  wrote:
> >>
> >> Action API was changed to work with actions and action_idr in concurrency
> >> safe manner, however tcf_del_walker() still uses actions without taking a
> >> reference or idrinfo->lock first, and deletes them directly, disregarding
> >> possible concurrent delete.
> >>
> >> Add tc_action_wq workqueue to action API. Implement
> >> tcf_idr_release_unsafe() that assumes external synchronization by caller
> >> and delays blocking action cleanup part to tc_action_wq workqueue. Extend
> >> tcf_action_cleanup() with 'async' argument to indicate that function should
> >> free action asynchronously.
> >
> > Where exactly is blocking in tcf_action_cleanup()?
> >
> > From your code, it looks like free_tcf(), but from my observation,
> > the only blocking function inside is tcf_action_goto_chain_fini()
> > which calls __tcf_chain_put(). But, __tcf_chain_put() is blocking
> > _ONLY_ when tc_chain_notify() is called, for tc action it is never
> > called.
> >
> > So, what else is blocking?
>
> __tcf_chain_put() calls tc_chain_tmplt_del(), which calls
> ops->tmplt_destroy(). This last function uses hw offload API, which is
> blocking.

Good to know.

Can we just make ops->tmplt_destroy() to use workqueue?
Making tc action to workqueue seems overkill, for me.


Re: [PATCH net-next v3 1/2] netlink: ipv4 igmp join notifications

2018-09-13 Thread Roopa Prabhu
On Thu, Sep 6, 2018 at 8:40 PM, Roopa Prabhu  wrote:
> On Thu, Sep 6, 2018 at 2:10 AM, Patrick Ruddy
>  wrote:
>> Some userspace applications need to know about IGMP joins from the
>> kernel for 2 reasons:
>> 1. To allow the programming of multicast MAC filters in hardware
>> 2. To form a multicast FORUS list for non link-local multicast
>>groups to be sent to the kernel and from there to the interested
>>party.
>> (1) can be fulfilled but simply sending the hardware multicast MAC
>> address to be programmed but (2) requires the L3 address to be sent
>> since this cannot be constructed from the MAC address whereas the
>> reverse translation is a standard library function.
>>
>> This commit provides addition and deletion of multicast addresses
>> using the RTM_NEWMDB and RTM_DELMDB messages with AF_INET. It also
>> provides the RTM_GETMDB extension to allow multicast join state to
>> be read from the kernel.
>>
>> Signed-off-by: Patrick Ruddy 
>> ---
>> v3 rework to use RTM_***MDB messages as per review comments.
>
> Patrick, this version seems to be using RTM_***MDB msgs with the
> RTM_*ADDR format.
> We cant do that...because existing RTM_MDB users will be confused.
>
> My request was to evaluate RTM_***MDB msg format. see
> nlmsg_populate_mdb_fill for details.
>
> If you can wait a day or two I can share some experimental code that
> moves high level RTM_*MDB msg handling into net/core/rtnetlink.c
> similar to RTM_*FDB
>

I was trying to get a default per interface (non bridge) RTM_*MDB
working, but realized that the dev->mc
entries are already getting dumped as part of RTM_*FDB msgs instead of
RTM_*MDB. (see net/core/rtnetlink.c:ndo_dflt_fdb_dump).
This adds another wrench.

so, that puts us back to your use of RTM_NEWADDR.
Instead of using IFA_ADDRESS, you could introduce a new one
IFA_IGMP_MULTICAST  (since IFA_MULTICAST is already taken).


To keep existing users of RTM_NEWADDR unaffected. I think you can use
the IPMR family with RTM_NEWADDR.
We can introduce new notification group. (We can choose to add a new
family too, but that seems unnecessary)

since you only need dumps:
rtnl_register(RTNL_FAMILY_IPMR, RTM_GETADDR, NULL, igmp_rtm_dumpaddrs, 0);

For notifications, since we already have many variants for routes, I
don't see a problem adding similar addr variants
RTNLGRP_IPV4_MCADDR

(Others on the list may have more feedback).


Re: What is the best forum (mailing list, irc etc) to ask questions about the usage of AF_XDP sockets.

2018-09-13 Thread Jakub Kicinski
On Thu, 13 Sep 2018 18:31:55 +0200, Konrad Djimeli wrote:
> Hello,
> 
> I have been working on trying to make use of AF_XDP sockets as part of a
> project I working on, and I have been facing some issues but I am not
> sure where to ask questions related to the usage of AF_XDP, since this
> is a development mailing list.

IMHO AF_XDP is quite fresh so it should be okay to ask questions on
netdev.  There is also xdp-newbies mailing list which seems very
appropriate for less advanced questions!


Re: [PATCH v4 0/3] IB/ipoib: Use dev_port to disambiguate port numbers

2018-09-13 Thread Doug Ledford
On Thu, 2018-09-06 at 17:51 +0300, Arseny Maslennikov wrote:
> Pre-3.15 userspace had trouble distinguishing different ports
> of a NIC on a single PCI bus/device/function. To solve this,
> a sysfs field `dev_port' was introduced quite a while ago
> (commit v3.14-rc3-739-g3f85944fe207), and some relevant device
> drivers were fixed to use it, but not in case of IPoIB.
> 
> The convention for some reason never got documented in the kernel, but
> was immediately adopted by userspace (notably udev[1][2], biosdevname[3])
> 
> 1/3 documents the sysfs field — that's why I'm CC-ing netdev.
> 
> This series was tested on and applies to 4.19-rc2.
> 
> [1] https://lists.freedesktop.org/archives/systemd-devel/2014-June/020788.html
> [2] https://lists.freedesktop.org/archives/systemd-devel/2014-July/020804.html
> [3] 
> https://github.com/CloudAutomationNTools/biosdevname/blob/c795d51dd93a5309652f0d635f12a3ecfabfaa72/src/eths.c#L38
> 
> v1->v2: replace a line instead of inserting and then removing.
> v2->v3: restore both attributes, output a notice of deprecation to kmsg.
> v3->v4: style adjustments, join the deprecation notice to single line.
> 
> Arseny Maslennikov (3):
>   Documentation/ABI: document /sys/class/net/*/dev_port
>   IB/ipoib: Use dev_port to expose network interface port numbers
>   IB/ipoib: Log sysfs 'dev_id' accesses from userspace
> 
>  Documentation/ABI/testing/sysfs-class-net | 18 +
>  drivers/infiniband/ulp/ipoib/ipoib_main.c | 33 +++
>  2 files changed, 51 insertions(+)
> 

Series applied to for-next.  But I think we should watch feedback from
people, and if people think the notification about using the wrong
variable is too noisy, then we might want to revert it or modify it to
only print out once per specific executable instead of once per run of
each executable.

-- 
Doug Ledford 
GPG KeyID: B826A3330E572FDD
Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD


signature.asc
Description: This is a digitally signed message part


What is the best forum (mailing list, irc etc) to ask questions about the usage of AF_XDP sockets.

2018-09-13 Thread Konrad Djimeli
Hello,

I have been working on trying to make use of AF_XDP sockets as part of a
project I working on, and I have been facing some issues but I am not
sure where to ask questions related to the usage of AF_XDP, since this
is a development mailing list.

Thanks
Konrad
www.djimeli.me


Re: [Patch net] net_sched: notify filter deletion when deleting a chain

2018-09-13 Thread David Miller
From: Cong Wang 
Date: Tue, 11 Sep 2018 14:22:23 -0700

> When we delete a chain of filters, we need to notify
> user-space we are deleting each filters in this chain
> too.
> 
> Fixes: 32a4f5ecd738 ("net: sched: introduce chain object to uapi")
> Cc: Jiri Pirko 
> Signed-off-by: Cong Wang 

Applied, thanks Cong.


Re: [Patch net-next] llc: avoid blocking in llc_sap_close()

2018-09-13 Thread David Miller
From: Cong Wang 
Date: Tue, 11 Sep 2018 11:42:06 -0700

> llc_sap_close() is called by llc_sap_put() which
> could be called in BH context in llc_rcv(). We can't
> block in BH.
> 
> There is no reason to block it here, kfree_rcu() should
> be sufficient.
> 
> Signed-off-by: Cong Wang 

Applied, thanks Cong.


Re: [PATCH v4 3/3] IB/ipoib: Log sysfs 'dev_id' accesses from userspace

2018-09-13 Thread Doug Ledford
On Sun, 2018-09-09 at 23:55 +0300, Arseny Maslennikov wrote:
> On Sun, Sep 09, 2018 at 09:11:46PM +0300, Arseny Maslennikov wrote:
> > On Fri, Sep 07, 2018 at 09:43:59AM -0600, Jason Gunthorpe wrote:
> > > On Thu, Sep 06, 2018 at 05:51:12PM +0300, Arseny Maslennikov wrote:
> > > > Some tools may currently be using only the deprecated attribute;
> > > > let's print an elaborate and clear deprecation notice to kmsg.
> > > > 
> > > > To do that, we have to replace the whole sysfs file, since we inherit
> > > > the original one from netdev.
> > > > 
> > > > Signed-off-by: Arseny Maslennikov 
> > > >  drivers/infiniband/ulp/ipoib/ipoib_main.c | 31 +++
> > > >  1 file changed, 31 insertions(+)
> > > > 
> > > > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c 
> > > > b/drivers/infiniband/ulp/ipoib/ipoib_main.c
> > > > index 30f840f874b3..74732726ec6f 100644
> > > > +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
> > > > @@ -2386,6 +2386,35 @@ int ipoib_add_pkey_attr(struct net_device *dev)
> > > > return device_create_file(>dev, _attr_pkey);
> > > >  }
> > > >  
> > > > +/*
> > > > + * We erroneously exposed the iface's port number in the dev_id
> > > > + * sysfs field long after dev_port was introduced for that purpose[1],
> > > > + * and we need to stop everyone from relying on that.
> > > > + * Let's overload the shower routine for the dev_id file here
> > > > + * to gently bring the issue up.
> > > > + *
> > > > + * [1] https://www.spinics.net/lists/netdev/msg272123.html
> > > > + */
> > > > +static ssize_t dev_id_show(struct device *dev,
> > > > +  struct device_attribute *attr, char *buf)
> > > > +{
> > > > +   struct net_device *ndev = to_net_dev(dev);
> > > > +
> > > > +   if (ndev->dev_id == ndev->dev_port)
> > > > +   netdev_info_once(ndev,
> > > > +   "\"%s\" wants to know my dev_id. Should it look 
> > > > at dev_port instead? See Documentation/ABI/testing/sysfs-class-net for 
> > > > more info.\n",
> > > > +   current->comm);
> > > > +
> > > > +   return sprintf(buf, "%#x\n", ndev->dev_id);
> > > > +}
> > > > +static DEVICE_ATTR_RO(dev_id);
> > > > +
> > > > +int ipoib_intercept_dev_id_attr(struct net_device *dev)
> > > > +{
> > > > +   device_remove_file(>dev, _attr_dev_id);
> > > > +   return device_create_file(>dev, _attr_dev_id);
> > > > +}
> > > 
> > > Isn't this racey with userspace? Ie what happens if udev is querying
> > > the dev_id right here?
> > 
> > udev in particular does not use dev_id at all since 2014, because "why
> > would we keep using dev_id if it is not the right thing to use?".
> > 
> > > 
> > > Do we know there is no userspace doing this?
> > > 
> > 
> > Not for sure.
> > 
> > If we move all the sysfs handling stuff we introduce in _add_port():
> >  - pkey
> >  - umcast
> >  - {create,delete}_child
> >  - connected/datagram mode
> > to _ndo_init(), which is called by register_netdev before it sends
> > the netlink message, would that suffice to eliminate the race?
> > (Sysfs files for {create,delete}_child go to _parent_init() then).
> > 
> 
> No, we can't, sorry for the noise. ndo_init() runs before the kobject
> becomes available.
> 
> Anyway, our sysfs attributes being racy is unrelated to the patch series
> subject, and I can't come up with any other ideas what to do with them
> that do not involve adjustments to register_netdev.

Agreed (that fixing the race issues is a different patch series).

-- 
Doug Ledford 
GPG KeyID: B826A3330E572FDD
Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD


signature.asc
Description: This is a digitally signed message part


Re: [net-next, PATCH 2/2, v2] net: socionext: add XDP support

2018-09-13 Thread Ilias Apalodimas
On Thu, Sep 13, 2018 at 04:32:06PM +0200, Jesper Dangaard Brouer wrote:
> On Wed, 12 Sep 2018 12:29:15 +0300
> Ilias Apalodimas  wrote:
> 
> > On Wed, Sep 12, 2018 at 11:25:24AM +0200, Jesper Dangaard Brouer wrote:
> > > On Wed, 12 Sep 2018 12:02:38 +0300
> > > Ilias Apalodimas  wrote:
> > >   
> > > >  static const struct net_device_ops netsec_netdev_ops = {
> > > > .ndo_init   = netsec_netdev_init,
> > > > .ndo_uninit = netsec_netdev_uninit,
> > > > @@ -1430,6 +1627,7 @@ static const struct net_device_ops 
> > > > netsec_netdev_ops = {
> > > > .ndo_set_mac_address= eth_mac_addr,
> > > > .ndo_validate_addr  = eth_validate_addr,
> > > > .ndo_do_ioctl   = netsec_netdev_ioctl,
> > > > +   .ndo_bpf= netsec_xdp,
> > > >  };
> > > >
> > > 
> > > You have not implemented ndo_xdp_xmit.
> > > 
> > > Thus, you have "only" implemented the RX side of XDP_REDIRECT.  Which
> > > allows you to do, cpumap and AF_XDP redirects, but not allowing other
> > > drivers to XDP send out this device.  
> >
> > Correct, that was the planning, is ndo_xdp_xmit() needed for the patch or
> > is the patch message just misleading and i should change that ?
> 
> Yes, I think you should ALSO implement ndo_xdp_xmit, maybe as a separate
> patch, but in the same series. (Our experience is that if we don't
> require this, people forget to complete this part of the XDP support).
Ok makes sense. Already started on that i should have something soon
> 
> Also you XDP_TX is not optimal, as it (looks like) you flush TX on
> every send.
Yes i do, the driver is queueing packet by packet (in it's default skb 
implemetation) so i just did the same. Agree it's far from optimal though
i'll see if i can change than on the next version
> 
> BTW, do you have any performance numbers?
Yes XDP_TX is doing ~330kpps and XDP_REDIRECT ~340kpps(dropping packets)
using 64b packets.  I am not really sure if this is a hardware limitation 
due to only using a single queue. I used ./samples/bpf/xdpsock for AF_XDP
and ./samples/bpf/xdp2 for XDP_TX. I hope i am doing the right tests

The default Rx path is doing ~220kpps with the improved memory allocation
scheme so we do have some improvement although we are far away from line 
rate

The default Tx seems to hang after some point with a txq full message so 
i don't have any precice numbers for that

This change on the driver started as an investigation of using AF_XDP
for Time Sensitive networking setups. The offloading seems to work wonders
there since the latency is reduced *A LOT* (more than 10x in my case) in 
Rx path

Another thing i did consider is that Bjorn is right. Since i only have 1 
shared txq i need locking to avoid race conditions. 

Once again thanks for reviewing this

/Ilias


RE: [PATCH] hv_netvsc: fix schedule in RCU context

2018-09-13 Thread Haiyang Zhang



> -Original Message-
> From: Stephen Hemminger 
> Sent: Thursday, September 13, 2018 11:04 AM
> To: KY Srinivasan ; Haiyang Zhang
> 
> Cc: netdev@vger.kernel.org; Stephen Hemminger 
> Subject: [PATCH] hv_netvsc: fix schedule in RCU context
> 
> When netvsc device is removed it can call reschedule in RCU context.
> This happens because canceling the subchannel setup work could (in theory)
> cause a reschedule when manipulating the timer.
> 
> To reproduce, run with lockdep enabled kernel and unbind
> a network device from hv_netvsc (via sysfs).
> 
> [  160.682011] WARNING: suspicious RCU usage
> [  160.707466] 4.19.0-rc3-uio+ #2 Not tainted
> [  160.709937] -
> [  160.712352] ./include/linux/rcupdate.h:302 Illegal context switch in RCU
> read-side critical section!
> [  160.723691]
> [  160.723691] other info that might help us debug this:
> [  160.723691]
> [  160.730955]
> [  160.730955] rcu_scheduler_active = 2, debug_locks = 1
> [  160.762813] 5 locks held by rebind-eth.sh/1812:
> [  160.766851]  #0: 8befa37a (sb_writers#6){.+.+}, at:
> vfs_write+0x184/0x1b0
> [  160.773416]  #1: b097f236 (>mutex){+.+.}, at:
> kernfs_fop_write+0xe2/0x1a0
> [  160.783766]  #2: 41ee6889 (kn->count#3){}, at:
> kernfs_fop_write+0xeb/0x1a0
> [  160.787465]  #3: 56d92a74 (>mutex){}, at:
> device_release_driver_internal+0x39/0x250
> [  160.816987]  #4: 30f6031e (rcu_read_lock){}, at:
> netvsc_remove+0x1e/0x250 [hv_netvsc]
> [  160.828629]
> [  160.828629] stack backtrace:
> [  160.831966] CPU: 1 PID: 1812 Comm: rebind-eth.sh Not tainted 4.19.0-rc3-
> uio+ #2
> [  160.832952] Hardware name: Microsoft Corporation Virtual Machine/Virtual
> Machine, BIOS Hyper-V UEFI Release v1.0 11/26/2012
> [  160.832952] Call Trace:
> [  160.832952]  dump_stack+0x85/0xcb
> [  160.832952]  ___might_sleep+0x1a3/0x240
> [  160.832952]  __flush_work+0x57/0x2e0
> [  160.832952]  ? __mutex_lock+0x83/0x990
> [  160.832952]  ? __kernfs_remove+0x24f/0x2e0
> [  160.832952]  ? __kernfs_remove+0x1b2/0x2e0
> [  160.832952]  ? mark_held_locks+0x50/0x80
> [  160.832952]  ? get_work_pool+0x90/0x90
> [  160.832952]  __cancel_work_timer+0x13c/0x1e0
> [  160.832952]  ? netvsc_remove+0x1e/0x250 [hv_netvsc]
> [  160.832952]  ? __lock_is_held+0x55/0x90
> [  160.832952]  netvsc_remove+0x9a/0x250 [hv_netvsc]
> [  160.832952]  vmbus_remove+0x26/0x30
> [  160.832952]  device_release_driver_internal+0x18a/0x250
> [  160.832952]  unbind_store+0xb4/0x180
> [  160.832952]  kernfs_fop_write+0x113/0x1a0
> [  160.832952]  __vfs_write+0x36/0x1a0
> [  160.832952]  ? rcu_read_lock_sched_held+0x6b/0x80
> [  160.832952]  ? rcu_sync_lockdep_assert+0x2e/0x60
> [  160.832952]  ? __sb_start_write+0x141/0x1a0
> [  160.832952]  ? vfs_write+0x184/0x1b0
> [  160.832952]  vfs_write+0xbe/0x1b0
> [  160.832952]  ksys_write+0x55/0xc0
> [  160.832952]  do_syscall_64+0x60/0x1b0
> [  160.832952]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> [  160.832952] RIP: 0033:0x7fe48f4c8154
> 
> Resolve this by getting RTNL earlier. This is safe because the subchannel
> work queue does trylock on RTNL and will detect the race.
> 
> Fixes: 7b2ee50c0cd5 ("hv_netvsc: common detach logic")
> Signed-off-by: Stephen Hemminger 

Reviewed-by: Haiyang Zhang 

Thank you!


Re: [PATCHv2 net] ipv6: use rt6_info members when dst is set in rt6_fill_node

2018-09-13 Thread David Miller
From: Xin Long 
Date: Tue, 11 Sep 2018 14:33:58 +0800

> In inet6_rtm_getroute, since Commit 93531c674315 ("net/ipv6: separate
> handling of FIB entries from dst based routes"), it has used rt->from
> to dump route info instead of rt.
> 
> However for some route like cache, some of its information like flags
> or gateway is not the same as that of the 'from' one. It caused 'ip
> route get' to dump the wrong route information.
> 
> In Jianlin's testing, the output information even lost the expiration
> time for a pmtu route cache due to the wrong fib6_flags.
> 
> So change to use rt6_info members for dst addr, src addr, flags and
> gateway when it tries to dump a route entry without fibmatch set.
> 
> v1->v2:
>   - not use rt6i_prefsrc.
>   - also fix the gw dump issue.
> 
> Fixes: 93531c674315 ("net/ipv6: separate handling of FIB entries from dst 
> based routes")
> Reported-by: Jianlin Shi 
> Signed-off-by: Xin Long 

Applied and queued up for -stable, thanks Xin.


Re: [PATCH 2/2] net: qcom/emac: add shared mdio bus support

2018-09-13 Thread Wang, Dongsheng
On 9/13/2018 8:42 PM, Andrew Lunn wrote:
> On Thu, Sep 13, 2018 at 05:04:53PM +0800, Wang Dongsheng wrote:
>> Share the mii_bus for others MAC device because QDF2400 emac
>> include MDIO, and the motherboard has more than one PHY connected
>> to an MDIO bus.
>>
>> Tested: QDF2400 (ACPI), buildin/insmod/rmmod
>>
>> Signed-off-by: Wang Dongsheng 
> Hi Wang
>
> This is a pretty big patch, and is hard to review. Could you try to
> break it up into a number of smaller patches. You could for example
> first refactor emacs_phy_config(), without making any functional
> changes. Then add the sharing. Maybe do OF an ACPI in different
> patches?
>
> Thanks
>Andrew
>
Ok, thanks.


Cheers

Dongsheng



Re: [PATCH iproute2] iproute2: fix use-after-free

2018-09-13 Thread Stephen Hemminger
On Wed, 12 Sep 2018 23:07:20 -0700
Mahesh Bandewar (महेश बंडेवार)  wrote:

> On Wed, Sep 12, 2018 at 5:33 PM, Stephen Hemminger
>  wrote:
> >
> > On Wed, 12 Sep 2018 16:29:28 -0700
> > Mahesh Bandewar  wrote:
> >  
> > > From: Mahesh Bandewar 
> > >
> > > A local program using iproute2 lib pointed out the issue and looking
> > > at the code it is pretty obvious -
> > >
> > > a = (struct nlmsghdr *)b;
> > > ...
> > > free(b);
> > > if (a->nlmsg_seq == seq)
> > > ...
> > >
> > > Fixes: 86bf43c7c2fd ("lib/libnetlink: update rtnl_talk to support malloc 
> > > buff at run time")
> > > Signed-off-by: Mahesh Bandewar   
> >
> > Yes, this is a real problem.
> >
> > Maybe a minimal patch like this would be enough:  
> actually that will leave the memory leak at the 'goto next' line (just
> few lines below) since that jump will overwrite the buf.

It looks like everytime in the while loop. a new buffer is allocated.
So yes, it looks like old, my patch, and your patch would leak there
was an error followed by other data in response.
Though I doubt kernel would ever do that.

The only user of iov style messages to the kernel is in tc batching.
My gut feeling is that if one message in batch has error, then
the netlink code should return that error and stop processing more.


Re: [PATCH v3 net-next 0/6] Add support for Lantiq / Intel vrx200 network

2018-09-13 Thread David Miller
From: Hauke Mehrtens 
Date: Sun,  9 Sep 2018 22:16:41 +0200

> This adds basic support for the GSWIP (Gigabit Switch) found in the
> VRX200 SoC.
> There are different versions of this IP core used in different SoCs, but
> this driver was currently only tested on the VRX200 SoC line, for other
> SoCs this driver probably need some adoptions to work.
> 
> I also plan to add Layer 2 offloading to the DSA driver and later also
> layer 3 offloading which is supported by the PPE HW block.
> 
> All these patches should go through the net-next tree.
> 
> This depends on the patch "MIPS: lantiq: dma: add dev pointer" which 
> should go into 4.19.

Series applied to net-next, thanks.


Re: [RFC PATCH iproute2-next] man: Add devlink health man page

2018-09-13 Thread Andrew Lunn
> devlink health sensor set pci/:01:00.0 name TX_COMP_ERROR 
>  action reset off action dump on
> Sets TX_COMP_ERROR sensor parameters for a specific device.

> >>This is what I had in mind:
> >>1. command interface error
> >>2. command interface timeout
> >>3. stuck TX queue (like tx_timeout)
> >>4. stuck TX completion queue (driver did not process packets in a reasonable
> >>time period)
> >>5. stuck RX queue
> >>6. RX completion error
> >>7. TX completion error
> >>8. HW / FW catastrophic error report
> >>9. completion queue overrun

> Such issues do exist in production environment, and need to be handled even
> if root cause is a bug which will be fixed in latest release. My feature
> should help developers / administrator to control and recover their live
> systems, by auto correction and logging support.
> Goal is:
> - Provide alert debug information
> - Self healing
> - If problem needs vendor support, provide a way to gather all needed
> debugging information.

So maybe you have the wrong name for this. Health is nice in terms of
Marketing, but we are actually talking about bug recovery.

devlink bug sensor set pci/:01:00.0 name command_interface_error action 
reset off action dump on
devlink bug sensor set pci/:01:00.0 name command_interface_timeout action 
reset off action dump on
devlink bug sensor set pci/:01:00.0 name transmit_completion_error action 
reset off action dump on
devlink bug sensor set pci/:01:00.0 name completion_queue_overrun action 
reset off action dump on

seems a lot more understandable than:

devlink health set pci/:01:00.0 name TX_COMP_ERROR action reset off action 
dump on

Andrew


[PATCH] hv_netvsc: fix schedule in RCU context

2018-09-13 Thread Stephen Hemminger
When netvsc device is removed it can call reschedule in RCU context.
This happens because canceling the subchannel setup work could (in theory)
cause a reschedule when manipulating the timer.

To reproduce, run with lockdep enabled kernel and unbind
a network device from hv_netvsc (via sysfs).

[  160.682011] WARNING: suspicious RCU usage
[  160.707466] 4.19.0-rc3-uio+ #2 Not tainted
[  160.709937] -
[  160.712352] ./include/linux/rcupdate.h:302 Illegal context switch in RCU 
read-side critical section!
[  160.723691]
[  160.723691] other info that might help us debug this:
[  160.723691]
[  160.730955]
[  160.730955] rcu_scheduler_active = 2, debug_locks = 1
[  160.762813] 5 locks held by rebind-eth.sh/1812:
[  160.766851]  #0: 8befa37a (sb_writers#6){.+.+}, at: 
vfs_write+0x184/0x1b0
[  160.773416]  #1: b097f236 (>mutex){+.+.}, at: 
kernfs_fop_write+0xe2/0x1a0
[  160.783766]  #2: 41ee6889 (kn->count#3){}, at: 
kernfs_fop_write+0xeb/0x1a0
[  160.787465]  #3: 56d92a74 (>mutex){}, at: 
device_release_driver_internal+0x39/0x250
[  160.816987]  #4: 30f6031e (rcu_read_lock){}, at: 
netvsc_remove+0x1e/0x250 [hv_netvsc]
[  160.828629]
[  160.828629] stack backtrace:
[  160.831966] CPU: 1 PID: 1812 Comm: rebind-eth.sh Not tainted 4.19.0-rc3-uio+ 
#2
[  160.832952] Hardware name: Microsoft Corporation Virtual Machine/Virtual 
Machine, BIOS Hyper-V UEFI Release v1.0 11/26/2012
[  160.832952] Call Trace:
[  160.832952]  dump_stack+0x85/0xcb
[  160.832952]  ___might_sleep+0x1a3/0x240
[  160.832952]  __flush_work+0x57/0x2e0
[  160.832952]  ? __mutex_lock+0x83/0x990
[  160.832952]  ? __kernfs_remove+0x24f/0x2e0
[  160.832952]  ? __kernfs_remove+0x1b2/0x2e0
[  160.832952]  ? mark_held_locks+0x50/0x80
[  160.832952]  ? get_work_pool+0x90/0x90
[  160.832952]  __cancel_work_timer+0x13c/0x1e0
[  160.832952]  ? netvsc_remove+0x1e/0x250 [hv_netvsc]
[  160.832952]  ? __lock_is_held+0x55/0x90
[  160.832952]  netvsc_remove+0x9a/0x250 [hv_netvsc]
[  160.832952]  vmbus_remove+0x26/0x30
[  160.832952]  device_release_driver_internal+0x18a/0x250
[  160.832952]  unbind_store+0xb4/0x180
[  160.832952]  kernfs_fop_write+0x113/0x1a0
[  160.832952]  __vfs_write+0x36/0x1a0
[  160.832952]  ? rcu_read_lock_sched_held+0x6b/0x80
[  160.832952]  ? rcu_sync_lockdep_assert+0x2e/0x60
[  160.832952]  ? __sb_start_write+0x141/0x1a0
[  160.832952]  ? vfs_write+0x184/0x1b0
[  160.832952]  vfs_write+0xbe/0x1b0
[  160.832952]  ksys_write+0x55/0xc0
[  160.832952]  do_syscall_64+0x60/0x1b0
[  160.832952]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  160.832952] RIP: 0033:0x7fe48f4c8154

Resolve this by getting RTNL earlier. This is safe because the subchannel
work queue does trylock on RTNL and will detect the race.

Fixes: 7b2ee50c0cd5 ("hv_netvsc: common detach logic")
Signed-off-by: Stephen Hemminger 
---
 drivers/net/hyperv/netvsc_drv.c | 9 +++--
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index 70921bbe0e28..915fbd66a02b 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -2272,17 +2272,15 @@ static int netvsc_remove(struct hv_device *dev)
 
cancel_delayed_work_sync(_ctx->dwork);
 
-   rcu_read_lock();
-   nvdev = rcu_dereference(ndev_ctx->nvdev);
-
-   if  (nvdev)
+   rtnl_lock();
+   nvdev = rtnl_dereference(ndev_ctx->nvdev);
+   if (nvdev)
cancel_work_sync(>subchan_work);
 
/*
 * Call to the vsc driver to let it know that the device is being
 * removed. Also blocks mtu and channel changes.
 */
-   rtnl_lock();
vf_netdev = rtnl_dereference(ndev_ctx->vf_netdev);
if (vf_netdev)
netvsc_unregister_vf(vf_netdev);
@@ -2294,7 +2292,6 @@ static int netvsc_remove(struct hv_device *dev)
list_del(_ctx->list);
 
rtnl_unlock();
-   rcu_read_unlock();
 
hv_set_drvdata(dev, NULL);
 
-- 
2.18.0



Re: [PATCH 2/2] net: qcom/emac: add shared mdio bus support

2018-09-13 Thread Timur Tabi

On 9/13/18 7:42 AM, Andrew Lunn wrote:

This is a pretty big patch, and is hard to review. Could you try to
break it up into a number of smaller patches. You could for example
first refactor emacs_phy_config(), without making any functional
changes. Then add the sharing. Maybe do OF an ACPI in different
patches?


Yes, please.


Re: [PATCH net-next RFC] virtio_net: ethtool tx napi configuration

2018-09-13 Thread Willem de Bruijn
> > +static u32 virtnet_get_priv_flags(struct net_device *dev)
> > +{
> > + struct virtnet_info *vi = netdev_priv(dev);
> > + int priv_flags = 0;
> > +
> > + if (vi->sq[0].napi.weight)
> > + priv_flags |= 0x1;
> > +
> > + return priv_flags;
> > +}
>
> Why the use of priv_flags here?  Is there some reason that we don't want
> to use the more simple
>
> static u32 virtnet_get_priv_flags(struct net_device *dev)
> {
> struct virtnet_info *vi = netdev_priv(dev);
>
> if (vi->sq[0].napi.weight)
> return 1;
>
> return 0;
> }

Sure, that's fine, too.

I just wanted to make it explicit that this is one of possibly many
private flags,
and only acts on bit 0. If another private flag is added, the existing
code needs
little change, just add a branch on another bit. But either way works.


[PATCH v3 09/30] inet: frags: use rhashtables for reassembly units

2018-09-13 Thread Stephen Hemminger
From: Eric Dumazet 

Some applications still rely on IP fragmentation, and to be fair linux
reassembly unit is not working under any serious load.

It uses static hash tables of 1024 buckets, and up to 128 items per bucket (!!!)

A work queue is supposed to garbage collect items when host is under memory
pressure, and doing a hash rebuild, changing seed used in hash computations.

This work queue blocks softirqs for up to 25 ms when doing a hash rebuild,
occurring every 5 seconds if host is under fire.

Then there is the problem of sharing this hash table for all netns.

It is time to switch to rhashtables, and allocate one of them per netns
to speedup netns dismantle, since this is a critical metric these days.

Lookup is now using RCU. A followup patch will even remove
the refcount hold/release left from prior implementation and save
a couple of atomic operations.

Before this patch, 16 cpus (16 RX queue NIC) could not handle more
than 1 Mpps frags DDOS.

After the patch, I reach 9 Mpps without any tuning, and can use up to 2GB
of storage for the fragments (exact number depends on frags being evicted
after timeout)

$ grep FRAG /proc/net/sockstat
FRAG: inuse 1966916 memory 2140004608

A followup patch will change the limits for 64bit arches.

Signed-off-by: Eric Dumazet 
Cc: Kirill Tkhai 
Cc: Herbert Xu 
Cc: Florian Westphal 
Cc: Jesper Dangaard Brouer 
Cc: Alexander Aring 
Cc: Stefan Schmidt 
Signed-off-by: David S. Miller 
(cherry picked from commit 648700f76b03b7e8149d13cc2bdb3355035258a9)
---
 Documentation/networking/ip-sysctl.txt  |   7 +-
 include/net/inet_frag.h |  81 +++---
 include/net/ipv6.h  |  16 +-
 net/ieee802154/6lowpan/6lowpan_i.h  |  26 +-
 net/ieee802154/6lowpan/reassembly.c |  91 +++
 net/ipv4/inet_fragment.c| 346 +---
 net/ipv4/ip_fragment.c  | 112 
 net/ipv6/netfilter/nf_conntrack_reasm.c |  51 +---
 net/ipv6/reassembly.c   | 110 
 9 files changed, 266 insertions(+), 574 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt 
b/Documentation/networking/ip-sysctl.txt
index d499676890d8..f23582a3c661 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -134,13 +134,10 @@ min_adv_mss - INTEGER
 IP Fragmentation:
 
 ipfrag_high_thresh - INTEGER
-   Maximum memory used to reassemble IP fragments. When
-   ipfrag_high_thresh bytes of memory is allocated for this purpose,
-   the fragment handler will toss packets until ipfrag_low_thresh
-   is reached. This also serves as a maximum limit to namespaces
-   different from the initial one.
+   Maximum memory used to reassemble IP fragments.
 
 ipfrag_low_thresh - INTEGER
+   (Obsolete since linux-4.17)
Maximum memory used to reassemble IP fragments before the kernel
begins to remove incomplete fragment queues to free up resources.
The kernel still accepts new fragments for defragmentation.
diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index 69e531ed8189..3fec0d3a0d01 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -2,7 +2,11 @@
 #ifndef __NET_FRAG_H__
 #define __NET_FRAG_H__
 
+#include 
+
 struct netns_frags {
+   struct rhashtable   rhashtable cacheline_aligned_in_smp;
+
/* Keep atomic mem on separate cachelines in structs that include it */
atomic_tmem cacheline_aligned_in_smp;
/* sysctls */
@@ -26,12 +30,30 @@ enum {
INET_FRAG_COMPLETE  = BIT(2),
 };
 
+struct frag_v4_compare_key {
+   __be32  saddr;
+   __be32  daddr;
+   u32 user;
+   u32 vif;
+   __be16  id;
+   u16 protocol;
+};
+
+struct frag_v6_compare_key {
+   struct in6_addr saddr;
+   struct in6_addr daddr;
+   u32 user;
+   __be32  id;
+   u32 iif;
+};
+
 /**
  * struct inet_frag_queue - fragment queue
  *
- * @lock: spinlock protecting the queue
+ * @node: rhash node
+ * @key: keys identifying this frag.
  * @timer: queue expiration timer
- * @list: hash bucket list
+ * @lock: spinlock protecting this frag
  * @refcnt: reference count of the queue
  * @fragments: received fragments head
  * @fragments_tail: received fragments tail
@@ -41,12 +63,16 @@ enum {
  * @flags: fragment queue flags
  * @max_size: maximum received fragment size
  * @net: namespace that this frag belongs to
- * @list_evictor: list of queues to forcefully evict (e.g. due to low memory)
+ * @rcu: rcu head for freeing deferall
  */
 struct inet_frag_queue {
-   spinlock_t  lock;
+   struct rhash_head   node;
+   union {
+   struct frag_v4_compare_key v4;
+   struct frag_v6_compare_key v6;
+   } key;
struct timer_list   timer;
-   struct hlist_node   list;

[PATCH v3 08/30] rhashtable: add schedule points

2018-09-13 Thread Stephen Hemminger
From: Eric Dumazet 

Rehashing and destroying large hash table takes a lot of time,
and happens in process context. It is safe to add cond_resched()
in rhashtable_rehash_table() and rhashtable_free_and_destroy()

Signed-off-by: Eric Dumazet 
Acked-by: Herbert Xu 
Signed-off-by: David S. Miller 
(cherry picked from commit ae6da1f503abb5a5081f9f6c4a6881de97830f3e)
---
 lib/rhashtable.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 39215c724fc7..cebbcec877d7 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -364,6 +364,7 @@ static int rhashtable_rehash_table(struct rhashtable *ht)
err = rhashtable_rehash_chain(ht, old_hash);
if (err)
return err;
+   cond_resched();
}
 
/* Publish the new table pointer. */
@@ -1073,6 +1074,7 @@ void rhashtable_free_and_destroy(struct rhashtable *ht,
for (i = 0; i < tbl->size; i++) {
struct rhash_head *pos, *next;
 
+   cond_resched();
for (pos = rht_dereference(*rht_bucket(tbl, i), ht),
 next = !rht_is_a_nulls(pos) ?
rht_dereference(pos->next, ht) : NULL;
-- 
2.18.0



[PATCH v3 18/30] inet: frags: get rid of ipfrag_skb_cb/FRAG_CB

2018-09-13 Thread Stephen Hemminger
From: Eric Dumazet 

ip_defrag uses skb->cb[] to store the fragment offset, and unfortunately
this integer is currently in a different cache line than skb->next,
meaning that we use two cache lines per skb when finding the insertion point.

By aliasing skb->ip_defrag_offset and skb->dev, we pack all the fields
in a single cache line and save precious memory bandwidth.

Note that after the fast path added by Changli Gao in commit
d6bebca92c66 ("fragment: add fast path for in-order fragments")
this change wont help the fast path, since we still need
to access prev->len (2nd cache line), but will show great
benefits when slow path is entered, since we perform
a linear scan of a potentially long list.

Also, note that this potential long list is an attack vector,
we might consider also using an rb-tree there eventually.

Signed-off-by: Eric Dumazet 
Signed-off-by: David S. Miller 
(cherry picked from commit bf66337140c64c27fa37222b7abca7e49d63fb57)
---
 include/linux/skbuff.h |  1 +
 net/ipv4/ip_fragment.c | 35 ++-
 2 files changed, 15 insertions(+), 21 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 6dd77767fd5b..f4749678b7ee 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -678,6 +678,7 @@ struct sk_buff {
 * UDP receive path is one user.
 */
unsigned long   dev_scratch;
+   int ip_defrag_offset;
};
/*
 * This is the control buffer. It is free to use for every
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 88fa8ffc5558..5331a0d68374 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -57,14 +57,6 @@
  */
 static const char ip_frag_cache_name[] = "ip4-frags";
 
-struct ipfrag_skb_cb
-{
-   struct inet_skb_parmh;
-   int offset;
-};
-
-#define FRAG_CB(skb)   ((struct ipfrag_skb_cb *)((skb)->cb))
-
 /* Describe an entry in the "incomplete datagrams" queue. */
 struct ipq {
struct inet_frag_queue q;
@@ -353,13 +345,13 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff 
*skb)
 * this fragment, right?
 */
prev = qp->q.fragments_tail;
-   if (!prev || FRAG_CB(prev)->offset < offset) {
+   if (!prev || prev->ip_defrag_offset < offset) {
next = NULL;
goto found;
}
prev = NULL;
for (next = qp->q.fragments; next != NULL; next = next->next) {
-   if (FRAG_CB(next)->offset >= offset)
+   if (next->ip_defrag_offset >= offset)
break;  /* bingo! */
prev = next;
}
@@ -370,7 +362,7 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff 
*skb)
 * any overlaps are eliminated.
 */
if (prev) {
-   int i = (FRAG_CB(prev)->offset + prev->len) - offset;
+   int i = (prev->ip_defrag_offset + prev->len) - offset;
 
if (i > 0) {
offset += i;
@@ -387,8 +379,8 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff 
*skb)
 
err = -ENOMEM;
 
-   while (next && FRAG_CB(next)->offset < end) {
-   int i = end - FRAG_CB(next)->offset; /* overlap is 'i' bytes */
+   while (next && next->ip_defrag_offset < end) {
+   int i = end - next->ip_defrag_offset; /* overlap is 'i' bytes */
 
if (i < next->len) {
int delta = -next->truesize;
@@ -401,7 +393,7 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff 
*skb)
delta += next->truesize;
if (delta)
add_frag_mem_limit(qp->q.net, delta);
-   FRAG_CB(next)->offset += i;
+   next->ip_defrag_offset += i;
qp->q.meat -= i;
if (next->ip_summed != CHECKSUM_UNNECESSARY)
next->ip_summed = CHECKSUM_NONE;
@@ -425,7 +417,13 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff 
*skb)
}
}
 
-   FRAG_CB(skb)->offset = offset;
+   /* Note : skb->ip_defrag_offset and skb->dev share the same location */
+   dev = skb->dev;
+   if (dev)
+   qp->iif = dev->ifindex;
+   /* Makes sure compiler wont do silly aliasing games */
+   barrier();
+   skb->ip_defrag_offset = offset;
 
/* Insert this fragment in the chain of fragments. */
skb->next = next;
@@ -436,11 +434,6 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff 
*skb)
else
qp->q.fragments = skb;
 
-   dev = skb->dev;
-   if (dev) {
-   qp->iif = dev->ifindex;
-   skb->dev = NULL;
-   }
qp->q.stamp = skb->tstamp;
qp->q.meat += skb->len;
qp->ecn |= ecn;
@@ -516,7 +509,7 @@ static int 

[PATCH v3 14/30] inet: frags: do not clone skb in ip_expire()

2018-09-13 Thread Stephen Hemminger
From: Eric Dumazet 

An skb_clone() was added in commit ec4fbd64751d ("inet: frag: release
spinlock before calling icmp_send()")

While fixing the bug at that time, it also added a very high cost
for DDOS frags, as the ICMP rate limit is applied after this
expensive operation (skb_clone() + consume_skb(), implying memory
allocations, copy, and freeing)

We can use skb_get(head) here, all we want is to make sure skb wont
be freed by another cpu.

Signed-off-by: Eric Dumazet 
Signed-off-by: David S. Miller 
(cherry picked from commit 1eec5d5670084ee644597bd26c25e22c69b9f748)
---
 net/ipv4/ip_fragment.c | 16 ++--
 1 file changed, 6 insertions(+), 10 deletions(-)

diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index dc3ed0ac4c58..88fa8ffc5558 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -143,8 +143,8 @@ static bool frag_expire_skip_icmp(u32 user)
 static void ip_expire(struct timer_list *t)
 {
struct inet_frag_queue *frag = from_timer(frag, t, timer);
-   struct sk_buff *clone, *head;
const struct iphdr *iph;
+   struct sk_buff *head;
struct net *net;
struct ipq *qp;
int err;
@@ -187,16 +187,12 @@ static void ip_expire(struct timer_list *t)
(skb_rtable(head)->rt_type != RTN_LOCAL))
goto out;
 
-   clone = skb_clone(head, GFP_ATOMIC);
+   skb_get(head);
+   spin_unlock(>q.lock);
+   icmp_send(head, ICMP_TIME_EXCEEDED, ICMP_EXC_FRAGTIME, 0);
+   kfree_skb(head);
+   goto out_rcu_unlock;
 
-   /* Send an ICMP "Fragment Reassembly Timeout" message. */
-   if (clone) {
-   spin_unlock(>q.lock);
-   icmp_send(clone, ICMP_TIME_EXCEEDED,
- ICMP_EXC_FRAGTIME, 0);
-   consume_skb(clone);
-   goto out_rcu_unlock;
-   }
 out:
spin_unlock(>q.lock);
 out_rcu_unlock:
-- 
2.18.0



[PATCH v3 02/30] inet: frags: add a pointer to struct netns_frags

2018-09-13 Thread Stephen Hemminger
From: Eric Dumazet 

In order to simplify the API, add a pointer to struct inet_frags.
This will allow us to make things less complex.

These functions no longer have a struct inet_frags parameter :

inet_frag_destroy(struct inet_frag_queue *q  /*, struct inet_frags *f */)
inet_frag_put(struct inet_frag_queue *q /*, struct inet_frags *f */)
inet_frag_kill(struct inet_frag_queue *q /*, struct inet_frags *f */)
inet_frags_exit_net(struct netns_frags *nf /*, struct inet_frags *f */)
ip6_expire_frag_queue(struct net *net, struct frag_queue *fq)

Signed-off-by: Eric Dumazet 
Signed-off-by: David S. Miller 
(cherry picked from commit 093ba72914b696521e4885756a68a3332782c8de)
---
 include/net/inet_frag.h | 11 ++-
 include/net/ipv6.h  |  3 +--
 net/ieee802154/6lowpan/reassembly.c | 13 +++--
 net/ipv4/inet_fragment.c| 17 ++---
 net/ipv4/ip_fragment.c  |  9 +
 net/ipv6/netfilter/nf_conntrack_reasm.c | 16 +---
 net/ipv6/reassembly.c   | 20 ++--
 7 files changed, 48 insertions(+), 41 deletions(-)

diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index 2ad894e446ac..fd338293a095 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -10,6 +10,7 @@ struct netns_frags {
int high_thresh;
int low_thresh;
int max_dist;
+   struct inet_frags   *f;
 };
 
 /**
@@ -109,20 +110,20 @@ static inline int inet_frags_init_net(struct netns_frags 
*nf)
atomic_set(>mem, 0);
return 0;
 }
-void inet_frags_exit_net(struct netns_frags *nf, struct inet_frags *f);
+void inet_frags_exit_net(struct netns_frags *nf);
 
-void inet_frag_kill(struct inet_frag_queue *q, struct inet_frags *f);
-void inet_frag_destroy(struct inet_frag_queue *q, struct inet_frags *f);
+void inet_frag_kill(struct inet_frag_queue *q);
+void inet_frag_destroy(struct inet_frag_queue *q);
 struct inet_frag_queue *inet_frag_find(struct netns_frags *nf,
struct inet_frags *f, void *key, unsigned int hash);
 
 void inet_frag_maybe_warn_overflow(struct inet_frag_queue *q,
   const char *prefix);
 
-static inline void inet_frag_put(struct inet_frag_queue *q, struct inet_frags 
*f)
+static inline void inet_frag_put(struct inet_frag_queue *q)
 {
if (refcount_dec_and_test(>refcnt))
-   inet_frag_destroy(q, f);
+   inet_frag_destroy(q);
 }
 
 static inline bool inet_frag_evicting(struct inet_frag_queue *q)
diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index f280c61e019a..ff8407b19d05 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -560,8 +560,7 @@ struct frag_queue {
u8  ecn;
 };
 
-void ip6_expire_frag_queue(struct net *net, struct frag_queue *fq,
-  struct inet_frags *frags);
+void ip6_expire_frag_queue(struct net *net, struct frag_queue *fq);
 
 static inline bool ipv6_addr_any(const struct in6_addr *a)
 {
diff --git a/net/ieee802154/6lowpan/reassembly.c 
b/net/ieee802154/6lowpan/reassembly.c
index 9757ce6c077a..9ccb8458b5c3 100644
--- a/net/ieee802154/6lowpan/reassembly.c
+++ b/net/ieee802154/6lowpan/reassembly.c
@@ -93,10 +93,10 @@ static void lowpan_frag_expire(unsigned long data)
if (fq->q.flags & INET_FRAG_COMPLETE)
goto out;
 
-   inet_frag_kill(>q, _frags);
+   inet_frag_kill(>q);
 out:
spin_unlock(>q.lock);
-   inet_frag_put(>q, _frags);
+   inet_frag_put(>q);
 }
 
 static inline struct lowpan_frag_queue *
@@ -229,7 +229,7 @@ static int lowpan_frag_reasm(struct lowpan_frag_queue *fq, 
struct sk_buff *prev,
struct sk_buff *fp, *head = fq->q.fragments;
int sum_truesize;
 
-   inet_frag_kill(>q, _frags);
+   inet_frag_kill(>q);
 
/* Make the one we just received the head. */
if (prev) {
@@ -437,7 +437,7 @@ int lowpan_frag_rcv(struct sk_buff *skb, u8 frag_type)
ret = lowpan_frag_queue(fq, skb, frag_type);
spin_unlock(>q.lock);
 
-   inet_frag_put(>q, _frags);
+   inet_frag_put(>q);
return ret;
}
 
@@ -585,13 +585,14 @@ static int __net_init lowpan_frags_init_net(struct net 
*net)
ieee802154_lowpan->frags.high_thresh = IPV6_FRAG_HIGH_THRESH;
ieee802154_lowpan->frags.low_thresh = IPV6_FRAG_LOW_THRESH;
ieee802154_lowpan->frags.timeout = IPV6_FRAG_TIMEOUT;
+   ieee802154_lowpan->frags.f = _frags;
 
res = inet_frags_init_net(_lowpan->frags);
if (res < 0)
return res;
res = lowpan_frags_ns_sysctl_register(net);
if (res < 0)
-   inet_frags_exit_net(_lowpan->frags, _frags);
+   inet_frags_exit_net(_lowpan->frags);
return res;
 }
 
@@ -601,7 +602,7 @@ static void __net_exit 

[PATCH v3 22/30] net: modify skb_rbtree_purge to return the truesize of all purged skbs.

2018-09-13 Thread Stephen Hemminger
From: Peter Oskolkov 

Tested: see the next patch is the series.

Suggested-by: Eric Dumazet 
Signed-off-by: Peter Oskolkov 
Signed-off-by: Eric Dumazet 
Cc: Florian Westphal 
Signed-off-by: David S. Miller 
(cherry picked from commit 385114dec8a49b5e5945e77ba7de6356106713f4)
---
 include/linux/skbuff.h | 2 +-
 net/core/skbuff.c  | 6 +-
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index f4749678b7ee..9c8457375aee 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2581,7 +2581,7 @@ static inline void __skb_queue_purge(struct sk_buff_head 
*list)
kfree_skb(skb);
 }
 
-void skb_rbtree_purge(struct rb_root *root);
+unsigned int skb_rbtree_purge(struct rb_root *root);
 
 void *netdev_alloc_frag(unsigned int fragsz);
 
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index c7c5f05f2af1..8fd690def5c1 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2842,23 +2842,27 @@ EXPORT_SYMBOL(skb_queue_purge);
 /**
  * skb_rbtree_purge - empty a skb rbtree
  * @root: root of the rbtree to empty
+ * Return value: the sum of truesizes of all purged skbs.
  *
  * Delete all buffers on an _buff rbtree. Each buffer is removed from
  * the list and one reference dropped. This function does not take
  * any lock. Synchronization should be handled by the caller (e.g., TCP
  * out-of-order queue is protected by the socket lock).
  */
-void skb_rbtree_purge(struct rb_root *root)
+unsigned int skb_rbtree_purge(struct rb_root *root)
 {
struct rb_node *p = rb_first(root);
+   unsigned int sum = 0;
 
while (p) {
struct sk_buff *skb = rb_entry(p, struct sk_buff, rbnode);
 
p = rb_next(p);
rb_erase(>rbnode, root);
+   sum += skb->truesize;
kfree_skb(skb);
}
+   return sum;
 }
 
 /**
-- 
2.18.0



[PATCH v3 19/30] inet: frags: fix ip6frag_low_thresh boundary

2018-09-13 Thread Stephen Hemminger
From: Eric Dumazet 

Giving an integer to proc_doulongvec_minmax() is dangerous on 64bit arches,
since linker might place next to it a non zero value preventing a change
to ip6frag_low_thresh.

ip6frag_low_thresh is not used anymore in the kernel, but we do not
want to prematuraly break user scripts wanting to change it.

Since specifying a minimal value of 0 for proc_doulongvec_minmax()
is moot, let's remove these zero values in all defrag units.

Fixes: 6e00f7dd5e4e ("ipv6: frags: fix /proc/sys/net/ipv6/ip6frag_low_thresh")
Signed-off-by: Eric Dumazet 
Reported-by: Maciej Żenczykowski 
Signed-off-by: David S. Miller 
(cherry picked from commit 3d23401283e80ceb03f765842787e0e79ff598b7)
---
 net/ieee802154/6lowpan/reassembly.c | 2 --
 net/ipv4/ip_fragment.c  | 5 ++---
 net/ipv6/netfilter/nf_conntrack_reasm.c | 2 --
 net/ipv6/reassembly.c   | 4 +---
 4 files changed, 3 insertions(+), 10 deletions(-)

diff --git a/net/ieee802154/6lowpan/reassembly.c 
b/net/ieee802154/6lowpan/reassembly.c
index 44f148a6bb57..1790b65944b3 100644
--- a/net/ieee802154/6lowpan/reassembly.c
+++ b/net/ieee802154/6lowpan/reassembly.c
@@ -411,7 +411,6 @@ int lowpan_frag_rcv(struct sk_buff *skb, u8 frag_type)
 }
 
 #ifdef CONFIG_SYSCTL
-static long zero;
 
 static struct ctl_table lowpan_frags_ns_ctl_table[] = {
{
@@ -428,7 +427,6 @@ static struct ctl_table lowpan_frags_ns_ctl_table[] = {
.maxlen = sizeof(unsigned long),
.mode   = 0644,
.proc_handler   = proc_doulongvec_minmax,
-   .extra1 = ,
.extra2 = _net.ieee802154_lowpan.frags.high_thresh
},
{
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 5331a0d68374..d14d741fb05e 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -672,7 +672,7 @@ struct sk_buff *ip_check_defrag(struct net *net, struct 
sk_buff *skb, u32 user)
 EXPORT_SYMBOL(ip_check_defrag);
 
 #ifdef CONFIG_SYSCTL
-static long zero;
+static int dist_min;
 
 static struct ctl_table ip4_frags_ns_ctl_table[] = {
{
@@ -689,7 +689,6 @@ static struct ctl_table ip4_frags_ns_ctl_table[] = {
.maxlen = sizeof(unsigned long),
.mode   = 0644,
.proc_handler   = proc_doulongvec_minmax,
-   .extra1 = ,
.extra2 = _net.ipv4.frags.high_thresh
},
{
@@ -705,7 +704,7 @@ static struct ctl_table ip4_frags_ns_ctl_table[] = {
.maxlen = sizeof(int),
.mode   = 0644,
.proc_handler   = proc_dointvec_minmax,
-   .extra1 = 
+   .extra1 = _min,
},
{ }
 };
diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c 
b/net/ipv6/netfilter/nf_conntrack_reasm.c
index 6613f81e553a..a1dc0d6a5949 100644
--- a/net/ipv6/netfilter/nf_conntrack_reasm.c
+++ b/net/ipv6/netfilter/nf_conntrack_reasm.c
@@ -63,7 +63,6 @@ struct nf_ct_frag6_skb_cb
 static struct inet_frags nf_frags;
 
 #ifdef CONFIG_SYSCTL
-static long zero;
 
 static struct ctl_table nf_ct_frag6_sysctl_table[] = {
{
@@ -79,7 +78,6 @@ static struct ctl_table nf_ct_frag6_sysctl_table[] = {
.maxlen = sizeof(unsigned long),
.mode   = 0644,
.proc_handler   = proc_doulongvec_minmax,
-   .extra1 = ,
.extra2 = _net.nf_frag.frags.high_thresh
},
{
diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
index 2127da130dc2..e1c5fa5e3873 100644
--- a/net/ipv6/reassembly.c
+++ b/net/ipv6/reassembly.c
@@ -554,7 +554,6 @@ static const struct inet6_protocol frag_protocol = {
 };
 
 #ifdef CONFIG_SYSCTL
-static int zero;
 
 static struct ctl_table ip6_frags_ns_ctl_table[] = {
{
@@ -570,8 +569,7 @@ static struct ctl_table ip6_frags_ns_ctl_table[] = {
.data   = _net.ipv6.frags.low_thresh,
.maxlen = sizeof(unsigned long),
.mode   = 0644,
-   .proc_handler   = proc_dointvec_minmax,
-   .extra1 = ,
+   .proc_handler   = proc_doulongvec_minmax,
.extra2 = _net.ipv6.frags.high_thresh
},
{
-- 
2.18.0



[PATCH v3 12/30] inet: frags: remove inet_frag_maybe_warn_overflow()

2018-09-13 Thread Stephen Hemminger
From: Eric Dumazet 

This function is obsolete, after rhashtable addition to inet defrag.

Signed-off-by: Eric Dumazet 
Signed-off-by: David S. Miller 
(cherry picked from commit 2d44ed22e607f9a285b049de2263e3840673a260)
---
 include/net/inet_frag.h |  2 --
 net/ieee802154/6lowpan/reassembly.c |  5 ++---
 net/ipv4/inet_fragment.c| 11 ---
 net/ipv4/ip_fragment.c  |  5 ++---
 net/ipv6/netfilter/nf_conntrack_reasm.c |  5 ++---
 net/ipv6/reassembly.c   |  5 ++---
 6 files changed, 8 insertions(+), 25 deletions(-)

diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index 0e8e159d88f7..95e353e3305b 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -110,8 +110,6 @@ void inet_frags_exit_net(struct netns_frags *nf);
 void inet_frag_kill(struct inet_frag_queue *q);
 void inet_frag_destroy(struct inet_frag_queue *q);
 struct inet_frag_queue *inet_frag_find(struct netns_frags *nf, void *key);
-void inet_frag_maybe_warn_overflow(struct inet_frag_queue *q,
-  const char *prefix);
 
 static inline void inet_frag_put(struct inet_frag_queue *q)
 {
diff --git a/net/ieee802154/6lowpan/reassembly.c 
b/net/ieee802154/6lowpan/reassembly.c
index 0fa0121f85d4..1aec71a3f904 100644
--- a/net/ieee802154/6lowpan/reassembly.c
+++ b/net/ieee802154/6lowpan/reassembly.c
@@ -84,10 +84,9 @@ fq_find(struct net *net, const struct lowpan_802154_cb *cb,
struct inet_frag_queue *q;
 
q = inet_frag_find(_lowpan->frags, );
-   if (IS_ERR_OR_NULL(q)) {
-   inet_frag_maybe_warn_overflow(q, pr_fmt());
+   if (!q)
return NULL;
-   }
+
return container_of(q, struct lowpan_frag_queue, q);
 }
 
diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index ebb8f411e0db..c9e35b81d093 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -218,14 +218,3 @@ struct inet_frag_queue *inet_frag_find(struct netns_frags 
*nf, void *key)
return inet_frag_create(nf, key);
 }
 EXPORT_SYMBOL(inet_frag_find);
-
-void inet_frag_maybe_warn_overflow(struct inet_frag_queue *q,
-  const char *prefix)
-{
-   static const char msg[] = "inet_frag_find: Fragment hash bucket"
-   " list length grew over limit. Dropping fragment.\n";
-
-   if (PTR_ERR(q) == -ENOBUFS)
-   net_dbg_ratelimited("%s%s", prefix, msg);
-}
-EXPORT_SYMBOL(inet_frag_maybe_warn_overflow);
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 1222aee3e5ee..38cbf56bb48e 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -221,10 +221,9 @@ static struct ipq *ip_find(struct net *net, struct iphdr 
*iph,
struct inet_frag_queue *q;
 
q = inet_frag_find(>ipv4.frags, );
-   if (IS_ERR_OR_NULL(q)) {
-   inet_frag_maybe_warn_overflow(q, pr_fmt());
+   if (!q)
return NULL;
-   }
+
return container_of(q, struct ipq, q);
 }
 
diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c 
b/net/ipv6/netfilter/nf_conntrack_reasm.c
index 8b12431ae296..54ce1d2a9a9d 100644
--- a/net/ipv6/netfilter/nf_conntrack_reasm.c
+++ b/net/ipv6/netfilter/nf_conntrack_reasm.c
@@ -178,10 +178,9 @@ static struct frag_queue *fq_find(struct net *net, __be32 
id, u32 user,
struct inet_frag_queue *q;
 
q = inet_frag_find(>nf_frag.frags, );
-   if (IS_ERR_OR_NULL(q)) {
-   inet_frag_maybe_warn_overflow(q, pr_fmt());
+   if (!q)
return NULL;
-   }
+
return container_of(q, struct frag_queue, q);
 }
 
diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
index 70acad126d04..2a77fda5e3bc 100644
--- a/net/ipv6/reassembly.c
+++ b/net/ipv6/reassembly.c
@@ -155,10 +155,9 @@ fq_find(struct net *net, __be32 id, const struct ipv6hdr 
*hdr, int iif)
key.iif = 0;
 
q = inet_frag_find(>ipv6.frags, );
-   if (IS_ERR_OR_NULL(q)) {
-   inet_frag_maybe_warn_overflow(q, pr_fmt());
+   if (!q)
return NULL;
-   }
+
return container_of(q, struct frag_queue, q);
 }
 
-- 
2.18.0



[PATCH v3 04/30] inet: frags: Convert timers to use timer_setup()

2018-09-13 Thread Stephen Hemminger
From: Kees Cook 

In preparation for unconditionally passing the struct timer_list pointer to
all timer callbacks, switch to using the new timer_setup() and from_timer()
to pass the timer pointer explicitly.

Cc: Alexander Aring 
Cc: Stefan Schmidt 
Cc: "David S. Miller" 
Cc: Alexey Kuznetsov 
Cc: Hideaki YOSHIFUJI 
Cc: Pablo Neira Ayuso 
Cc: Jozsef Kadlecsik 
Cc: Florian Westphal 
Cc: linux-w...@vger.kernel.org
Cc: netdev@vger.kernel.org
Cc: netfilter-de...@vger.kernel.org
Cc: coret...@netfilter.org
Signed-off-by: Kees Cook 
Acked-by: Stefan Schmidt  # for ieee802154
Signed-off-by: David S. Miller 
(cherry picked from commit 78802011fbe34331bdef6f2dfb1634011f0e4c32)
---
 include/net/inet_frag.h | 2 +-
 net/ieee802154/6lowpan/reassembly.c | 5 +++--
 net/ipv4/inet_fragment.c| 4 ++--
 net/ipv4/ip_fragment.c  | 5 +++--
 net/ipv6/netfilter/nf_conntrack_reasm.c | 5 +++--
 net/ipv6/reassembly.c   | 5 +++--
 6 files changed, 15 insertions(+), 11 deletions(-)

diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index fd338293a095..69e531ed8189 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -97,7 +97,7 @@ struct inet_frags {
void(*constructor)(struct inet_frag_queue *q,
   const void *arg);
void(*destructor)(struct inet_frag_queue *);
-   void(*frag_expire)(unsigned long data);
+   void(*frag_expire)(struct timer_list *t);
struct kmem_cache   *frags_cachep;
const char  *frags_cache_name;
 };
diff --git a/net/ieee802154/6lowpan/reassembly.c 
b/net/ieee802154/6lowpan/reassembly.c
index 9ccb8458b5c3..6badc05b 100644
--- a/net/ieee802154/6lowpan/reassembly.c
+++ b/net/ieee802154/6lowpan/reassembly.c
@@ -80,12 +80,13 @@ static void lowpan_frag_init(struct inet_frag_queue *q, 
const void *a)
fq->daddr = *arg->dst;
 }
 
-static void lowpan_frag_expire(unsigned long data)
+static void lowpan_frag_expire(struct timer_list *t)
 {
+   struct inet_frag_queue *frag = from_timer(frag, t, timer);
struct frag_queue *fq;
struct net *net;
 
-   fq = container_of((struct inet_frag_queue *)data, struct frag_queue, q);
+   fq = container_of(frag, struct frag_queue, q);
net = container_of(fq->q.net, struct net, ieee802154_lowpan.frags);
 
spin_lock(>q.lock);
diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index 4b44f973c37f..97e747b1e9a0 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -150,7 +150,7 @@ inet_evict_bucket(struct inet_frags *f, struct 
inet_frag_bucket *hb)
spin_unlock(>chain_lock);
 
hlist_for_each_entry_safe(fq, n, , list_evictor)
-   f->frag_expire((unsigned long) fq);
+   f->frag_expire(>timer);
 
return evicted;
 }
@@ -367,7 +367,7 @@ static struct inet_frag_queue *inet_frag_alloc(struct 
netns_frags *nf,
f->constructor(q, arg);
add_frag_mem_limit(nf, f->qsize);
 
-   setup_timer(>timer, f->frag_expire, (unsigned long)q);
+   timer_setup(>timer, f->frag_expire, 0);
spin_lock_init(>lock);
refcount_set(>refcnt, 1);
 
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 9d0b08c8ee00..5171c8cc0eb6 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -191,12 +191,13 @@ static bool frag_expire_skip_icmp(u32 user)
 /*
  * Oops, a fragment queue timed out.  Kill it and send an ICMP reply.
  */
-static void ip_expire(unsigned long arg)
+static void ip_expire(struct timer_list *t)
 {
+   struct inet_frag_queue *frag = from_timer(frag, t, timer);
struct ipq *qp;
struct net *net;
 
-   qp = container_of((struct inet_frag_queue *) arg, struct ipq, q);
+   qp = container_of(frag, struct ipq, q);
net = container_of(qp->q.net, struct net, ipv4.frags);
 
rcu_read_lock();
diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c 
b/net/ipv6/netfilter/nf_conntrack_reasm.c
index 7ea2b4490672..bc776ef392ea 100644
--- a/net/ipv6/netfilter/nf_conntrack_reasm.c
+++ b/net/ipv6/netfilter/nf_conntrack_reasm.c
@@ -169,12 +169,13 @@ static unsigned int nf_hashfn(const struct 
inet_frag_queue *q)
return nf_hash_frag(nq->id, >saddr, >daddr);
 }
 
-static void nf_ct_frag6_expire(unsigned long data)
+static void nf_ct_frag6_expire(struct timer_list *t)
 {
+   struct inet_frag_queue *frag = from_timer(frag, t, timer);
struct frag_queue *fq;
struct net *net;
 
-   fq = container_of((struct inet_frag_queue *)data, struct frag_queue, q);
+   fq = container_of(frag, struct frag_queue, q);
net = container_of(fq->q.net, struct net, nf_frag.frags);
 
ip6_expire_frag_queue(net, fq);
diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
index 26f737c3fc7b..b85ef051b75c 100644
--- 

[PATCH v3 23/30] ipv6: defrag: drop non-last frags smaller than min mtu

2018-09-13 Thread Stephen Hemminger
From: Florian Westphal 

don't bother with pathological cases, they only waste cycles.
IPv6 requires a minimum MTU of 1280 so we should never see fragments
smaller than this (except last frag).

v3: don't use awkward "-offset + len"
v2: drop IPv4 part, which added same check w. IPV4_MIN_MTU (68).
There were concerns that there could be even smaller frags
generated by intermediate nodes, e.g. on radio networks.

Cc: Peter Oskolkov 
Cc: Eric Dumazet 
Signed-off-by: Florian Westphal 
Signed-off-by: David S. Miller 
(cherry picked from commit 0ed4229b08c13c84a3c301a08defdc9e7f4467e6)
---
 net/ipv6/netfilter/nf_conntrack_reasm.c | 4 
 net/ipv6/reassembly.c   | 4 
 2 files changed, 8 insertions(+)

diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c 
b/net/ipv6/netfilter/nf_conntrack_reasm.c
index a1dc0d6a5949..1d2f07cde01a 100644
--- a/net/ipv6/netfilter/nf_conntrack_reasm.c
+++ b/net/ipv6/netfilter/nf_conntrack_reasm.c
@@ -565,6 +565,10 @@ int nf_ct_frag6_gather(struct net *net, struct sk_buff 
*skb, u32 user)
hdr = ipv6_hdr(skb);
fhdr = (struct frag_hdr *)skb_transport_header(skb);
 
+   if (skb->len - skb_network_offset(skb) < IPV6_MIN_MTU &&
+   fhdr->frag_off & htons(IP6_MF))
+   return -EINVAL;
+
skb_orphan(skb);
fq = fq_find(net, fhdr->identification, user, hdr,
 skb->dev ? skb->dev->ifindex : 0);
diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
index e1c5fa5e3873..afaad60dc2ac 100644
--- a/net/ipv6/reassembly.c
+++ b/net/ipv6/reassembly.c
@@ -522,6 +522,10 @@ static int ipv6_frag_rcv(struct sk_buff *skb)
return 1;
}
 
+   if (skb->len - skb_network_offset(skb) < IPV6_MIN_MTU &&
+   fhdr->frag_off & htons(IP6_MF))
+   goto fail_hdr;
+
iif = skb->dev ? skb->dev->ifindex : 0;
fq = fq_find(net, fhdr->identification, hdr, iif);
if (fq) {
-- 
2.18.0



[PATCH v3 25/30] net: add rb_to_skb() and other rb tree helpers

2018-09-13 Thread Stephen Hemminger
From: Eric Dumazet 

Geeralize private netem_rb_to_skb()

TCP rtx queue will soon be converted to rb-tree,
so we will need skb_rbtree_walk() helpers.

Signed-off-by: Eric Dumazet 
Signed-off-by: David S. Miller 
(cherry picked from commit 18a4c0eab2623cc95be98a1e6af1ad18e7695977)
---
 include/linux/skbuff.h  | 18 ++
 net/ipv4/tcp_fastopen.c |  8 +++-
 net/ipv4/tcp_input.c| 33 -
 net/sched/sch_netem.c   | 14 --
 4 files changed, 37 insertions(+), 36 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 758084b434c8..2837e55df03e 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -3169,6 +3169,12 @@ static inline int __skb_grow_rcsum(struct sk_buff *skb, 
unsigned int len)
 
 #define rb_to_skb(rb) rb_entry_safe(rb, struct sk_buff, rbnode)
 
+#define rb_to_skb(rb) rb_entry_safe(rb, struct sk_buff, rbnode)
+#define skb_rb_first(root) rb_to_skb(rb_first(root))
+#define skb_rb_last(root)  rb_to_skb(rb_last(root))
+#define skb_rb_next(skb)   rb_to_skb(rb_next(&(skb)->rbnode))
+#define skb_rb_prev(skb)   rb_to_skb(rb_prev(&(skb)->rbnode))
+
 #define skb_queue_walk(queue, skb) \
for (skb = (queue)->next;   
\
 skb != (struct sk_buff *)(queue);  
\
@@ -3183,6 +3189,18 @@ static inline int __skb_grow_rcsum(struct sk_buff *skb, 
unsigned int len)
for (; skb != (struct sk_buff *)(queue);
\
 skb = skb->next)
 
+#define skb_rbtree_walk(skb, root) 
\
+   for (skb = skb_rb_first(root); skb != NULL; 
\
+skb = skb_rb_next(skb))
+
+#define skb_rbtree_walk_from(skb)  
\
+   for (; skb != NULL; 
\
+skb = skb_rb_next(skb))
+
+#define skb_rbtree_walk_from_safe(skb, tmp)
\
+   for (; tmp = skb ? skb_rb_next(skb) : NULL, (skb != NULL);  
\
+skb = tmp)
+
 #define skb_queue_walk_from_safe(queue, skb, tmp)  
\
for (tmp = skb->next;   
\
 skb != (struct sk_buff *)(queue);  
\
diff --git a/net/ipv4/tcp_fastopen.c b/net/ipv4/tcp_fastopen.c
index fbbeda647774..0567edb76522 100644
--- a/net/ipv4/tcp_fastopen.c
+++ b/net/ipv4/tcp_fastopen.c
@@ -458,17 +458,15 @@ bool tcp_fastopen_active_should_disable(struct sock *sk)
 void tcp_fastopen_active_disable_ofo_check(struct sock *sk)
 {
struct tcp_sock *tp = tcp_sk(sk);
-   struct rb_node *p;
-   struct sk_buff *skb;
struct dst_entry *dst;
+   struct sk_buff *skb;
 
if (!tp->syn_fastopen)
return;
 
if (!tp->data_segs_in) {
-   p = rb_first(>out_of_order_queue);
-   if (p && !rb_next(p)) {
-   skb = rb_entry(p, struct sk_buff, rbnode);
+   skb = skb_rb_first(>out_of_order_queue);
+   if (skb && !skb_rb_next(skb)) {
if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN) {
tcp_fastopen_active_disable(sk);
return;
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index bdabd748f4bc..991f382afc1b 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4372,7 +4372,7 @@ static void tcp_ofo_queue(struct sock *sk)
 
p = rb_first(>out_of_order_queue);
while (p) {
-   skb = rb_entry(p, struct sk_buff, rbnode);
+   skb = rb_to_skb(p);
if (after(TCP_SKB_CB(skb)->seq, tp->rcv_nxt))
break;
 
@@ -4440,7 +4440,7 @@ static int tcp_try_rmem_schedule(struct sock *sk, struct 
sk_buff *skb,
 static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb)
 {
struct tcp_sock *tp = tcp_sk(sk);
-   struct rb_node **p, *q, *parent;
+   struct rb_node **p, *parent;
struct sk_buff *skb1;
u32 seq, end_seq;
bool fragstolen;
@@ -4503,7 +4503,7 @@ static void tcp_data_queue_ofo(struct sock *sk, struct 
sk_buff *skb)
parent = NULL;
while (*p) {
parent = *p;
-   skb1 = rb_entry(parent, struct sk_buff, rbnode);
+   skb1 = rb_to_skb(parent);
if (before(seq, TCP_SKB_CB(skb1)->seq)) {
p = >rb_left;
continue;
@@ -4548,9 +4548,7 @@ static void tcp_data_queue_ofo(struct sock *sk, struct 
sk_buff *skb)
 
 merge_right:
/* Remove other segments covered by skb. */
-   while ((q = rb_next(>rbnode)) != NULL) {
-   skb1 = rb_entry(q, struct sk_buff, rbnode);
-
+   while ((skb1 = 

[PATCH v3 30/30] ip: frags: fix crash in ip_do_fragment()

2018-09-13 Thread Stephen Hemminger
From: Taehee Yoo 

commit 5d407b071dc369c26a38398326ee2be53651cfe4 upstream

A kernel crash occurrs when defragmented packet is fragmented
in ip_do_fragment().
In defragment routine, skb_orphan() is called and
skb->ip_defrag_offset is set. but skb->sk and
skb->ip_defrag_offset are same union member. so that
frag->sk is not NULL.
Hence crash occurrs in skb->sk check routine in ip_do_fragment() when
defragmented packet is fragmented.

test commands:
   %iptables -t nat -I POSTROUTING -j MASQUERADE
   %hping3 192.168.4.2 -s 1000 -p 2000 -d 6

splat looks like:
[  261.069429] kernel BUG at net/ipv4/ip_output.c:636!
[  261.075753] invalid opcode:  [#1] SMP DEBUG_PAGEALLOC KASAN PTI
[  261.083854] CPU: 1 PID: 1349 Comm: hping3 Not tainted 4.19.0-rc2+ #3
[  261.100977] RIP: 0010:ip_do_fragment+0x1613/0x2600
[  261.106945] Code: e8 e2 38 e3 fe 4c 8b 44 24 18 48 8b 74 24 08 e9 92 f6 ff 
ff 80 3c 02 00 0f 85 da 07 00 00 48 8b b5 d0 00 00 00 e9 25 f6 ff ff <0f> 0b 0f 
0b 44 8b 54 24 58 4c 8b 4c 24 18 4c 8b 5c 24 60 4c 8b 6c
[  261.127015] RSP: 0018:8801031cf2c0 EFLAGS: 00010202
[  261.134156] RAX: 11002297537b RBX: ed0020639e6e RCX: 0004
[  261.142156] RDX:  RSI:  RDI: 880114ba9bd8
[  261.150157] RBP: 880114ba8a40 R08: ed0022975395 R09: ed0022975395
[  261.158157] R10: 0001 R11: ed0022975394 R12: 880114ba9ca4
[  261.166159] R13: 0010 R14: 880114ba9bc0 R15: dc00
[  261.174169] FS:  7fbae2199700() GS:88011b40() 
knlGS:
[  261.183012] CS:  0010 DS:  ES:  CR0: 80050033
[  261.189013] CR2: 5579244fe000 CR3: 000119bf4000 CR4: 001006e0
[  261.198158] Call Trace:
[  261.199018]  ? dst_output+0x180/0x180
[  261.205011]  ? save_trace+0x300/0x300
[  261.209018]  ? ip_copy_metadata+0xb00/0xb00
[  261.213034]  ? sched_clock_local+0xd4/0x140
[  261.218158]  ? kill_l4proto+0x120/0x120 [nf_conntrack]
[  261.223014]  ? rt_cpu_seq_stop+0x10/0x10
[  261.227014]  ? find_held_lock+0x39/0x1c0
[  261.233008]  ip_finish_output+0x51d/0xb50
[  261.237006]  ? ip_fragment.constprop.56+0x220/0x220
[  261.243011]  ? nf_ct_l4proto_register_one+0x5b0/0x5b0 [nf_conntrack]
[  261.250152]  ? rcu_is_watching+0x77/0x120
[  261.255010]  ? nf_nat_ipv4_out+0x1e/0x2b0 [nf_nat_ipv4]
[  261.261033]  ? nf_hook_slow+0xb1/0x160
[  261.265007]  ip_output+0x1c7/0x710
[  261.269005]  ? ip_mc_output+0x13f0/0x13f0
[  261.273002]  ? __local_bh_enable_ip+0xe9/0x1b0
[  261.278152]  ? ip_fragment.constprop.56+0x220/0x220
[  261.282996]  ? nf_hook_slow+0xb1/0x160
[  261.287007]  raw_sendmsg+0x21f9/0x4420
[  261.291008]  ? dst_output+0x180/0x180
[  261.297003]  ? sched_clock_cpu+0x126/0x170
[  261.301003]  ? find_held_lock+0x39/0x1c0
[  261.306155]  ? stop_critical_timings+0x420/0x420
[  261.311004]  ? check_flags.part.36+0x450/0x450
[  261.315005]  ? _raw_spin_unlock_irq+0x29/0x40
[  261.320995]  ? _raw_spin_unlock_irq+0x29/0x40
[  261.326142]  ? cyc2ns_read_end+0x10/0x10
[  261.330139]  ? raw_bind+0x280/0x280
[  261.334138]  ? sched_clock_cpu+0x126/0x170
[  261.338995]  ? check_flags.part.36+0x450/0x450
[  261.342991]  ? __lock_acquire+0x4500/0x4500
[  261.348994]  ? inet_sendmsg+0x11c/0x500
[  261.352989]  ? dst_output+0x180/0x180
[  261.357012]  inet_sendmsg+0x11c/0x500
[ ... ]

v2:
 - clear skb->sk at reassembly routine.(Eric Dumarzet)

Fixes: fa0f527358bd ("ip: use rb trees for IP frag queue.")
Suggested-by: Eric Dumazet 
Signed-off-by: Taehee Yoo 
Reviewed-by: Eric Dumazet 
Signed-off-by: David S. Miller 
---
 net/ipv4/ip_fragment.c  | 1 +
 net/ipv6/netfilter/nf_conntrack_reasm.c | 1 +
 2 files changed, 2 insertions(+)

diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 88281fbce88c..e7227128df2c 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -599,6 +599,7 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff 
*skb,
nextp = >next;
fp->prev = NULL;
memset(>rbnode, 0, sizeof(fp->rbnode));
+   fp->sk = NULL;
head->data_len += fp->len;
head->len += fp->len;
if (head->ip_summed != fp->ip_summed)
diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c 
b/net/ipv6/netfilter/nf_conntrack_reasm.c
index 82ce0d0f54bf..2ed8536e10b6 100644
--- a/net/ipv6/netfilter/nf_conntrack_reasm.c
+++ b/net/ipv6/netfilter/nf_conntrack_reasm.c
@@ -453,6 +453,7 @@ nf_ct_frag6_reasm(struct frag_queue *fq, struct sk_buff 
*prev,  struct net_devic
else if (head->ip_summed == CHECKSUM_COMPLETE)
head->csum = csum_add(head->csum, fp->csum);
head->truesize += fp->truesize;
+   fp->sk = NULL;
}
sub_frag_mem_limit(fq->q.net, head->truesize);
 
-- 
2.18.0



[PATCH v3 28/30] ip: add helpers to process in-order fragments faster.

2018-09-13 Thread Stephen Hemminger
From: Peter Oskolkov 

This patch introduces several helper functions/macros that will be
used in the follow-up patch. No runtime changes yet.

The new logic (fully implemented in the second patch) is as follows:

* Nodes in the rb-tree will now contain not single fragments, but lists
  of consecutive fragments ("runs").

* At each point in time, the current "active" run at the tail is
  maintained/tracked. Fragments that arrive in-order, adjacent
  to the previous tail fragment, are added to this tail run without
  triggering the re-balancing of the rb-tree.

* If a fragment arrives out of order with the offset _before_ the tail run,
  it is inserted into the rb-tree as a single fragment.

* If a fragment arrives after the current tail fragment (with a gap),
  it starts a new "tail" run, as is inserted into the rb-tree
  at the end as the head of the new run.

skb->cb is used to store additional information
needed here (suggested by Eric Dumazet).

Reported-by: Willem de Bruijn 
Signed-off-by: Peter Oskolkov 
Cc: Eric Dumazet 
Cc: Florian Westphal 
Signed-off-by: David S. Miller 
(cherry picked from commit 353c9cb360874e737fb000545f783df756c06f9a)
---
 include/net/inet_frag.h |  6 
 net/ipv4/ip_fragment.c  | 73 +
 2 files changed, 79 insertions(+)

diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index e4c71a7644be..335cf7851f12 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -57,7 +57,9 @@ struct frag_v6_compare_key {
  * @lock: spinlock protecting this frag
  * @refcnt: reference count of the queue
  * @fragments: received fragments head
+ * @rb_fragments: received fragments rb-tree root
  * @fragments_tail: received fragments tail
+ * @last_run_head: the head of the last "run". see ip_fragment.c
  * @stamp: timestamp of the last received fragment
  * @len: total length of the original datagram
  * @meat: length of received fragments so far
@@ -78,6 +80,7 @@ struct inet_frag_queue {
struct sk_buff  *fragments;  /* Used in IPv6. */
struct rb_root  rb_fragments; /* Used in IPv4. */
struct sk_buff  *fragments_tail;
+   struct sk_buff  *last_run_head;
ktime_t stamp;
int len;
int meat;
@@ -113,6 +116,9 @@ void inet_frag_kill(struct inet_frag_queue *q);
 void inet_frag_destroy(struct inet_frag_queue *q);
 struct inet_frag_queue *inet_frag_find(struct netns_frags *nf, void *key);
 
+/* Free all skbs in the queue; return the sum of their truesizes. */
+unsigned int inet_frag_rbtree_purge(struct rb_root *root);
+
 static inline void inet_frag_put(struct inet_frag_queue *q)
 {
if (refcount_dec_and_test(>refcnt))
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 7cb7ed761d8c..26ace9d2d976 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -57,6 +57,57 @@
  */
 static const char ip_frag_cache_name[] = "ip4-frags";
 
+/* Use skb->cb to track consecutive/adjacent fragments coming at
+ * the end of the queue. Nodes in the rb-tree queue will
+ * contain "runs" of one or more adjacent fragments.
+ *
+ * Invariants:
+ * - next_frag is NULL at the tail of a "run";
+ * - the head of a "run" has the sum of all fragment lengths in frag_run_len.
+ */
+struct ipfrag_skb_cb {
+   struct inet_skb_parmh;
+   struct sk_buff  *next_frag;
+   int frag_run_len;
+};
+
+#define FRAG_CB(skb)   ((struct ipfrag_skb_cb *)((skb)->cb))
+
+static void ip4_frag_init_run(struct sk_buff *skb)
+{
+   BUILD_BUG_ON(sizeof(struct ipfrag_skb_cb) > sizeof(skb->cb));
+
+   FRAG_CB(skb)->next_frag = NULL;
+   FRAG_CB(skb)->frag_run_len = skb->len;
+}
+
+/* Append skb to the last "run". */
+static void ip4_frag_append_to_last_run(struct inet_frag_queue *q,
+   struct sk_buff *skb)
+{
+   RB_CLEAR_NODE(>rbnode);
+   FRAG_CB(skb)->next_frag = NULL;
+
+   FRAG_CB(q->last_run_head)->frag_run_len += skb->len;
+   FRAG_CB(q->fragments_tail)->next_frag = skb;
+   q->fragments_tail = skb;
+}
+
+/* Create a new "run" with the skb. */
+static void ip4_frag_create_run(struct inet_frag_queue *q, struct sk_buff *skb)
+{
+   if (q->last_run_head)
+   rb_link_node(>rbnode, >last_run_head->rbnode,
+>last_run_head->rbnode.rb_right);
+   else
+   rb_link_node(>rbnode, NULL, >rb_fragments.rb_node);
+   rb_insert_color(>rbnode, >rb_fragments);
+
+   ip4_frag_init_run(skb);
+   q->fragments_tail = skb;
+   q->last_run_head = skb;
+}
+
 /* Describe an entry in the "incomplete datagrams" queue. */
 struct ipq {
struct inet_frag_queue q;
@@ -654,6 +705,28 @@ struct sk_buff *ip_check_defrag(struct net *net, struct 
sk_buff *skb, u32 user)
 }
 EXPORT_SYMBOL(ip_check_defrag);
 
+unsigned int inet_frag_rbtree_purge(struct 

[PATCH v3 29/30] ip: process in-order fragments efficiently

2018-09-13 Thread Stephen Hemminger
From: Peter Oskolkov 

This patch changes the runtime behavior of IP defrag queue:
incoming in-order fragments are added to the end of the current
list/"run" of in-order fragments at the tail.

On some workloads, UDP stream performance is substantially improved:

RX: ./udp_stream -F 10 -T 2 -l 60
TX: ./udp_stream -c -H  -F 10 -T 5 -l 60

with this patchset applied on a 10Gbps receiver:

  throughput=9524.18
  throughput_units=Mbit/s

upstream (net-next):

  throughput=4608.93
  throughput_units=Mbit/s

Reported-by: Willem de Bruijn 
Signed-off-by: Peter Oskolkov 
Cc: Eric Dumazet 
Cc: Florian Westphal 
Signed-off-by: David S. Miller 
(cherry picked from commit a4fd284a1f8fd4b6c59aa59db2185b1e17c5c11c)
---
 net/ipv4/inet_fragment.c |   2 +-
 net/ipv4/ip_fragment.c   | 110 ---
 2 files changed, 70 insertions(+), 42 deletions(-)

diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index 6904cbb7de1a..f6764537148c 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -145,7 +145,7 @@ void inet_frag_destroy(struct inet_frag_queue *q)
fp = xp;
} while (fp);
} else {
-   sum_truesize = skb_rbtree_purge(>rb_fragments);
+   sum_truesize = inet_frag_rbtree_purge(>rb_fragments);
}
sum = sum_truesize + f->qsize;
 
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 26ace9d2d976..88281fbce88c 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -126,8 +126,8 @@ static u8 ip4_frag_ecn(u8 tos)
 
 static struct inet_frags ip4_frags;
 
-static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev,
-struct net_device *dev);
+static int ip_frag_reasm(struct ipq *qp, struct sk_buff *skb,
+struct sk_buff *prev_tail, struct net_device *dev);
 
 
 static void ip4_frag_init(struct inet_frag_queue *q, const void *a)
@@ -219,7 +219,12 @@ static void ip_expire(struct timer_list *t)
head = skb_rb_first(>q.rb_fragments);
if (!head)
goto out;
-   rb_erase(>rbnode, >q.rb_fragments);
+   if (FRAG_CB(head)->next_frag)
+   rb_replace_node(>rbnode,
+   _CB(head)->next_frag->rbnode,
+   >q.rb_fragments);
+   else
+   rb_erase(>rbnode, >q.rb_fragments);
memset(>rbnode, 0, sizeof(head->rbnode));
barrier();
}
@@ -320,7 +325,7 @@ static int ip_frag_reinit(struct ipq *qp)
return -ETIMEDOUT;
}
 
-   sum_truesize = skb_rbtree_purge(>q.rb_fragments);
+   sum_truesize = inet_frag_rbtree_purge(>q.rb_fragments);
sub_frag_mem_limit(qp->q.net, sum_truesize);
 
qp->q.flags = 0;
@@ -329,6 +334,7 @@ static int ip_frag_reinit(struct ipq *qp)
qp->q.fragments = NULL;
qp->q.rb_fragments = RB_ROOT;
qp->q.fragments_tail = NULL;
+   qp->q.last_run_head = NULL;
qp->iif = 0;
qp->ecn = 0;
 
@@ -340,7 +346,7 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff 
*skb)
 {
struct net *net = container_of(qp->q.net, struct net, ipv4.frags);
struct rb_node **rbn, *parent;
-   struct sk_buff *skb1;
+   struct sk_buff *skb1, *prev_tail;
struct net_device *dev;
unsigned int fragsize;
int flags, offset;
@@ -418,38 +424,41 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff 
*skb)
 */
 
/* Find out where to put this fragment.  */
-   skb1 = qp->q.fragments_tail;
-   if (!skb1) {
-   /* This is the first fragment we've received. */
-   rb_link_node(>rbnode, NULL, >q.rb_fragments.rb_node);
-   qp->q.fragments_tail = skb;
-   } else if ((skb1->ip_defrag_offset + skb1->len) < end) {
-   /* This is the common/special case: skb goes to the end. */
+   prev_tail = qp->q.fragments_tail;
+   if (!prev_tail)
+   ip4_frag_create_run(>q, skb);  /* First fragment. */
+   else if (prev_tail->ip_defrag_offset + prev_tail->len < end) {
+   /* This is the common case: skb goes to the end. */
/* Detect and discard overlaps. */
-   if (offset < (skb1->ip_defrag_offset + skb1->len))
+   if (offset < prev_tail->ip_defrag_offset + prev_tail->len)
goto discard_qp;
-   /* Insert after skb1. */
-   rb_link_node(>rbnode, >rbnode, 
>rbnode.rb_right);
-   qp->q.fragments_tail = skb;
+   if (offset == prev_tail->ip_defrag_offset + prev_tail->len)
+   ip4_frag_append_to_last_run(>q, skb);
+   else
+   ip4_frag_create_run(>q, skb);
} else {
-   /* Binary search. Note that skb can become the first 

  1   2   >