Re: [PATCH 2/2] drivers core: multi-threading device shutdown

2018-05-02 Thread Tobin C. Harding
This code was a pleasure to read, super clean.

On Wed, May 02, 2018 at 11:59:31PM -0400, Pavel Tatashin wrote:
> When system is rebooted, halted or kexeced device_shutdown() is
> called.
> 
> This function shuts down every single device by calling either:
>   dev->bus->shutdown(dev)
>   dev->driver->shutdown(dev)
> 
> Even on a machine just with a moderate amount of devices, device_shutdown()
> may take multiple seconds to complete. Because many devices require a
> specific delays to perform this operation.
> 
> Here is sample analysis of time it takes to call device_shutdown() on
> two socket Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz machine.
> 
> device_shutdown   2.95s
>  mlx4_shutdown1.14s
>  megasas_shutdown 0.24s
>  ixgbe_shutdown   0.37s x 4 (four ixgbe devices on my machine).
>  the rest 0.09s
> 
> In mlx4 we spent the most time, but that is because there is a 1 second
> sleep:
> mlx4_shutdown
>  mlx4_unload_one
>   mlx4_free_ownership
>msleep(1000)
> 
> With megasas we spend quoter of second, but sometimes longer (up-to 0.5s)
> in this path:
> 
> megasas_shutdown
>   megasas_flush_cache
> megasas_issue_blocked_cmd
>   wait_event_timeout
> 
> Finally, with ixgbe_shutdown() it takes 0.37 for each device, but that time
> is spread all over the place, with bigger offenders:
> 
> ixgbe_shutdown
>   __ixgbe_shutdown
> ixgbe_close_suspend
>   ixgbe_down
> ixgbe_init_hw_generic
>   ixgbe_reset_hw_X540
> msleep(100);0.104483472
> ixgbe_get_san_mac_addr_generic  0.048414851
> ixgbe_get_wwn_prefix_generic0.048409893
>   ixgbe_start_hw_X540
> ixgbe_start_hw_generic
>   ixgbe_clear_hw_cntrs_generic  0.048581502
>   ixgbe_setup_fc_generic0.024225800
> 
> All the ixgbe_*generic functions end-up calling:
> ixgbe_read_eerd_X540()
>   ixgbe_acquire_swfw_sync_X540
> usleep_range(5000, 6000);
>   ixgbe_release_swfw_sync_X540
> usleep_range(5000, 6000);
> 
> While these are short sleeps, they end-up calling them over 24 times!
> 24 * 0.0055s = 0.132s. Adding-up to 0.528s for four devices.
> 
> While we should keep optimizing the individual device drivers, in some
> cases this is simply a hardware property that forces a specific delay, and
> we must wait.
> 
> So, the solution for this problem is to shutdown devices in parallel.
> However, we must shutdown children before shutting down parents, so parent
> device must wait for its children to finish.
> 
> With this patch, on the same machine devices_shutdown() takes 1.142s, and
> without mlx4 one second delay only 0.38s
> 
> Signed-off-by: Pavel Tatashin 
> ---
>  drivers/base/core.c | 238 +++-
>  1 file changed, 189 insertions(+), 49 deletions(-)
> 
> diff --git a/drivers/base/core.c b/drivers/base/core.c
> index b610816eb887..f370369a303b 100644
> --- a/drivers/base/core.c
> +++ b/drivers/base/core.c
> @@ -25,6 +25,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include "base.h"
>  #include "power/power.h"
> @@ -2102,6 +2103,59 @@ const char *device_get_devnode(struct device *dev,
>   return *tmp = s;
>  }
>  
> +/**
> + * device_children_count - device children count
> + * @parent: parent struct device.
> + *
> + * Returns number of children for this device or 0 if nonde.
> + */
> +static int device_children_count(struct device *parent)
> +{
> + struct klist_iter i;
> + int children = 0;
> +
> + if (!parent->p)
> + return 0;
> +
> + klist_iter_init(>p->klist_children, );
> + while (next_device())
> + children++;
> + klist_iter_exit();
> +
> + return children;
> +}
> +
> +/**
> + * device_get_child_by_index - Return child using the provide index.
> + * @parent: parent struct device.
> + * @index:  Index of the child, where 0 is the first child in the children 
> list,
> + * and so on.
> + *
> + * Returns child or NULL if child with this index is not present.
> + */
> +static struct device *
> +device_get_child_by_index(struct device *parent, int index)
> +{
> + struct klist_iter i;
> + struct device *dev = NULL, *d;
> + int child_index = 0;
> +
> + if (!parent->p || index < 0)
> + return NULL;
> +
> + klist_iter_init(>p->klist_children, );
> + while ((d = next_device()) != NULL) {

perhaps:
while ((d = next_device())) {

> + if (child_index == index) {
> + dev = d;
> + break;
> + }
> + child_index++;
> + }
> + klist_iter_exit();
> +
> + return dev;
> +}
> +
>  /**
>   * device_for_each_child - device child iterator.
>   * @parent: parent struct device.
> @@ -2765,71 

Re: INFO: rcu detected stall in __schedule

2018-05-02 Thread Tetsuo Handa
I'm not sure whether this is a PPP bug.

As of uptime = 484, RCU says that it stalled for 125 seconds.

--
[  484.407032] INFO: rcu_sched self-detected stall on CPU
[  484.412488]  0-...!: (125000 ticks this GP) idle=f3e/1/4611686018427387906 
softirq=112858/112858 fqs=0 
[  484.422300]   (t=125000 jiffies g=61626 c=61625 q=1534)
[  484.427663] rcu_sched kthread starved for 125000 jiffies! g61626 c61625 f0x0 
RCU_GP_WAIT_FQS(3) ->state=0x402 ->cpu=0
--

484 - 125 = 359, which was about to start SND related fuzzing in that log.

--
2033/05/18 03:36:31 executing program 1:
r0 = socket(0x4a, 0x5, 0x7)
setsockopt$inet_int(r0, 0x0, 0x18, &(0x7f00)=0x200, 0x4)
bind$inet6(r0, &(0x7fc0)={0xa, 0x0, 0x0, @loopback={0x0, 0x1}}, 0x1c)
perf_event_open(&(0x7f40)={0x2, 0x70, 0x3e5}, 0x0, 0x, 
0x, 0x0)
timer_create(0x0, &(0x7f0001c0)={0x0, 0x15, 0x0, @thr={&(0x7f000440), 
&(0x7f000540)}}, &(0x7f000200))
timer_getoverrun(0x0)
perf_event_open(&(0x7f25c000)={0x2, 0x78, 0x3e3}, 0x0, 0x0, 
0x, 0x0)
r1 = syz_open_dev$sndctrl(&(0x7f000200)='/dev/snd/controlC#\x00', 0x2, 0x0)
perf_event_open(&(0x7f001000)={0x0, 0x70, 0x0, 0x0, 0x0, 0x0, 0x0, 0x8ce, 
0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x7, 0x0, 0x0, 0x0, 
0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xfff8, 0x0, 0x0, 0x0, 0x0, 
0x0, 0x0, 0x0, 0x0, @perf_bp={&(0x7f005000), 0x2}, 0x10c}, 0x0, 
0x0, 0x, 0x0)
ioctl$SNDRV_CTL_IOCTL_SUBSCRIBE_EVENTS(r1, 0xc0045516, &(0x7fc0)=0x1)
r2 = syz_open_dev$sndpcmp(&(0x7f000100)='/dev/snd/pcmC#D#p\x00', 0x1, 
0x4000)
ioctl$SNDRV_SEQ_IOCTL_GET_QUEUE_CLIENT(r2, 0xc04c5349, 
&(0x7f000240)={0x200, 0xfcdc, 0x1})
syz_open_dev$tun(&(0x7f0003c0)='/dev/net/tun\x00', 0x0, 0x20402)
ioctl$SNDRV_CTL_IOCTL_PVERSION(r1, 0xc1105517, &(0x7f001000)=""/250)
ioctl$SNDRV_CTL_IOCTL_SUBSCRIBE_EVENTS(r1, 0xc0045516, &(0x7f00))

2033/05/18 03:36:31 executing program 4:
syz_emit_ethernet(0x3e, &(0x7fc0)={@broadcast=[0xff, 0xff, 0xff, 0xff, 
0xff, 0xff], @empty=[0x0, 0x0, 0xb00], [], {@ipv4={0x800, {{0x5, 
0x4, 0x0, 0x0, 0x30, 0x0, 0x0, 0x0, 0x1, 0x0, @remote={0xac, 0x14, 0x14, 0xbb}, 
@dev={0xac, 0x14, 0x14}}, @icmp=@parameter_prob={0x5, 0x4, 0x0, 0x0, 0x0, 0x0, 
{0x5, 0x4, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, @local={0xac, 0x223, 0x14, 
0xaa}, @dev={0xac, 0x14, 0x14}}}, &(0x7f00)={0x0, 0x2, [0x0, 
0x2e6]})

2033/05/18 03:36:31 executing program 1:
r0 = socket$pppoe(0x18, 0x1, 0x0)
connect$pppoe(r0, &(0x7fc0)={0x18, 0x0, {0x1, @broadcast=[0xff, 0xff, 
0xff, 0xff, 0xff, 0xff], 'ip6_vti0\x00'}}, 0x1e)
r1 = socket(0x3, 0xb, 0x8001)
setsockopt$inet_sctp6_SCTP_ADAPTATION_LAYER(r1, 0x84, 0x7, 
&(0x7f000100)={0x2}, 0x4)
ioctl$sock_inet_SIOCGIFADDR(r0, 0x8915, 
&(0x7f40)={'veth1_to_bridge\x00', {0x2, 0x4e21}})
r2 = syz_open_dev$admmidi(&(0x7f00)='/dev/admmidi#\x00', 0x6, 0x8000)
setsockopt$SO_VM_SOCKETS_BUFFER_MAX_SIZE(r2, 0x28, 0x2, 
&(0x7f80)=0xff00, 0x8)

[  359.306427] snd_virmidi snd_virmidi.0: control 112:0:0:�:0 is already 
present
--


Re: [lkp-robot] 486ad79630 [ 15.532543] BUG: unable to handle kernel NULL pointer dereference at 0000000000000004

2018-05-02 Thread Andrew Morton
On Wed, 2 May 2018 21:58:25 -0700 Cong Wang  wrote:

> On Wed, May 2, 2018 at 9:27 PM, Andrew Morton  
> wrote:
> >
> > So it's saying that something which got committed into Linus's tree
> > after 4.17-rc3 has caused a NULL deref in
> > sock_release->llc_ui_release+0x3a/0xd0
> 
> Do you mean it contains commit 3a04ce7130a7
> ("llc: fix NULL pointer deref for SOCK_ZAPPED")?

That was in 4.17-rc3 so if this report's bisection is correct, that
patch is innocent.

origin.patch (http://ozlabs.org/~akpm/mmots/broken-out/origin.patch)
contains no changes to net/llc/af_llc.c so perhaps this crash is also
occurring in 4.17-rc3 base.


Re: [PATCH] net/xfrm: Fix lookups for states with spi == 0

2018-05-02 Thread Herbert Xu
On Wed, May 02, 2018 at 01:41:36PM +0100, Dmitry Safonov wrote:
>
> But still it's possible to create ipsec with zero SPI.
> And it seems not making sense to search for a state with SPI hash if
> request has zero SPI.

Fair enough.  In fact a zero SPI is legal and defined for IPcomp.

The bug arose from this patch:

commit 7b4dc3600e4877178ba94c7fbf7e520421378aa6
Author: Masahide NAKAMURA 
Date:   Wed Sep 27 22:21:52 2006 -0700

[XFRM]: Do not add a state whose SPI is zero to the SPI hash.

SPI=0 is used for acquired IPsec SA and MIPv6 RO state.
Such state should not be added to the SPI hash
because we do not care about it on deleting path.

Signed-off-by: Masahide NAKAMURA 
Signed-off-by: YOSHIFUJI Hideaki 

I think it would be better to revert this.

Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


Re: [PATCH net] ipv4: fix fnhe usage by non-cached routes

2018-05-02 Thread Julian Anastasov

Hello,

On Wed, 2 May 2018, David Ahern wrote:

> On 5/2/18 12:41 AM, Julian Anastasov wrote:
> > Allow some non-cached routes to use non-expired fnhe:
> > 
> > 1. ip_del_fnhe: moved above and now called by find_exception.
> > The 4.5+ commit deed49df7390 expires fnhe only when caching
> > routes. Change that to:
> > 
> > 1.1. use fnhe for non-cached local output routes, with the help
> > from (2)
> > 
> > 1.2. allow __mkroute_input to detect expired fnhe (outdated
> > fnhe_gw, for example) when do_cache is false, eg. when itag!=0
> > for unicast destinations.
> > 
> > 2. __mkroute_output: keep fi to allow local routes with orig_oif != 0
> > to use fnhe info even when the new route will not be cached into fnhe.
> > After commit 839da4d98960 ("net: ipv4: set orig_oif based on fib
> > result for local traffic") it means all local routes will be affected
> > because they are not cached. This change is used to solve a PMTU
> > problem with IPVS (and probably Netfilter DNAT) setups that redirect
> > local clients from target local IP (local route to Virtual IP)
> > to new remote IP target, eg. IPVS TUN real server. Loopback has
> > 64K MTU and we need to create fnhe on the local route that will
> > keep the reduced PMTU for the Virtual IP. Without this change
> > fnhe_pmtu is updated from ICMP but never exposed to non-cached
> > local routes. This includes routes with flowi4_oif!=0 for 4.6+ and
> > with flowi4_oif=any for 4.14+).
> 
> Can you add a test case to tools/testing/selftests/net/pmtu.sh to cover
> this situation?

Sure, I'll give it a try.

> > @@ -1310,8 +1340,14 @@ static struct fib_nh_exception 
> > *find_exception(struct fib_nh *nh, __be32 daddr)
> >  
> > for (fnhe = rcu_dereference(hash[hval].chain); fnhe;
> >  fnhe = rcu_dereference(fnhe->fnhe_next)) {
> > -   if (fnhe->fnhe_daddr == daddr)
> > +   if (fnhe->fnhe_daddr == daddr) {
> > +   if (fnhe->fnhe_expires &&
> > +   time_after(jiffies, fnhe->fnhe_expires)) {
> > +   ip_del_fnhe(nh, daddr);
> 
> I'm surprised this is done in the fast path vs gc time. (the existing
> code does as well; your change is only moving the call to make the input
> and output paths the same)
> 
> 
> The change looks correct to me and all of my functional tests passed.
> 
> Acked-by: David Ahern 

Thanks for the review!

Regards

--
Julian Anastasov 


[PATCH] sched: fix semicolon.cocci warnings

2018-05-02 Thread kbuild test robot
From: Fengguang Wu 

net/sched/sch_cake.c:580:2-3: Unneeded semicolon


 Remove unneeded semicolon.

Generated by: scripts/coccinelle/misc/semicolon.cocci

Fixes: 907a16741a03 ("sched: Add Common Applications Kept Enhanced (cake) 
qdisc")
CC: Toke Høiland-Jørgensen 
Signed-off-by: Fengguang Wu 
---

 sch_cake.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/net/sched/sch_cake.c
+++ b/net/sched/sch_cake.c
@@ -577,7 +577,7 @@ cake_hash(struct cake_tin_data *q, const
default:
dsthost_hash = 0;
srchost_hash = 0;
-   };
+   }
 
/* This *must* be after the above switch, since as a
 * side-effect it sorts the src and dst addresses.


Re: [PATCH net-next v7 1/7] sched: Add Common Applications Kept Enhanced (cake) qdisc

2018-05-02 Thread kbuild test robot
Hi Toke,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on net-next/master]

url:
https://github.com/0day-ci/linux/commits/Toke-H-iland-J-rgensen/sched-Add-Common-Applications-Kept-Enhanced-cake-qdisc/20180503-073002


coccinelle warnings: (new ones prefixed by >>)

>> net/sched/sch_cake.c:580:2-3: Unneeded semicolon

Please review and possibly fold the followup patch.

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


Re: [v2 PATCH 1/1] tg3: fix meaningless hw_stats reading after tg3_halt memset 0 hw_stats

2018-05-02 Thread Michael Chan
On Wed, May 2, 2018 at 5:30 PM, Zumeng Chen  wrote:
> On 2018年05月03日 01:32, Michael Chan wrote:
>>
>> On Wed, May 2, 2018 at 3:27 AM, Zumeng Chen  wrote:
>>>
>>> On 2018年05月02日 13:12, Michael Chan wrote:

 On Tue, May 1, 2018 at 5:42 PM, Zumeng Chen 
 wrote:

> diff --git a/drivers/net/ethernet/broadcom/tg3.h
> b/drivers/net/ethernet/broadcom/tg3.h
> index 3b5e98e..c61d83c 100644
> --- a/drivers/net/ethernet/broadcom/tg3.h
> +++ b/drivers/net/ethernet/broadcom/tg3.h
> @@ -3102,6 +3102,7 @@ enum TG3_FLAGS {
>   TG3_FLAG_ROBOSWITCH,
>   TG3_FLAG_ONE_DMA_AT_ONCE,
>   TG3_FLAG_RGMII_MODE,
> +   TG3_FLAG_HALT,

 I think you should be able to use the existing INIT_COMPLETE flag
>>>
>>>
>>> No,  it will bring the uncertain factors into the existed complicate
>>> logic
>>> of INIT_COMPLETE.
>>> And I think it's very simple logic here to fix the meaningless hw_stats
>>> reading and the problem
>>> of commit f5992b72. I even suspect if you have read INIT_COMPLETE related
>>> codes carefully.
>>>
>> We should use an existing flag whenever appropriate
>
>
> I disagree. This is sort of blahblah...
>>

I don't want to see another flag added that is practically the same as
!INIT_COMPLETE.  The driver already has close to one hundred flags.
Adding a new flag that is similar to an existing flag will just make
the code more difficult to understand and maintain.

If you don't want to fix it the cleaner way, Siva or I will fix it.


Re: Silently dropped UDP packets on kernel 4.14

2018-05-02 Thread Florian Westphal
Kristian Evensen  wrote:
> I went for the early-insert approached and have patched

I'm sorry for suggesting that.

It doesn't work, because of NAT.
NAT rewrites packet content and changes the reply tuple, but the tuples
determine the hash insertion location.

I don't know how to solve this problem.


Re: [lkp-robot] 486ad79630 [ 15.532543] BUG: unable to handle kernel NULL pointer dereference at 0000000000000004

2018-05-02 Thread Cong Wang
On Wed, May 2, 2018 at 9:27 PM, Andrew Morton  wrote:
>
> So it's saying that something which got committed into Linus's tree
> after 4.17-rc3 has caused a NULL deref in
> sock_release->llc_ui_release+0x3a/0xd0

Do you mean it contains commit 3a04ce7130a7
("llc: fix NULL pointer deref for SOCK_ZAPPED")?


[PATCH RFC v2 net-next 4/4] bpfilter: rough bpfilter codegen example hack

2018-05-02 Thread Alexei Starovoitov
From: Daniel Borkmann 

Signed-off-by: Daniel Borkmann 
---
 net/bpfilter/Makefile   |   2 +-
 net/bpfilter/bpfilter_mod.h | 285 ++-
 net/bpfilter/ctor.c |  57 +
 net/bpfilter/gen.c  | 290 
 net/bpfilter/init.c |  11 +-
 net/bpfilter/main.c |  15 ++-
 net/bpfilter/sockopt.c  | 137 -
 net/bpfilter/tables.c   |   5 +-
 net/bpfilter/tgts.c |   1 +
 9 files changed, 737 insertions(+), 66 deletions(-)
 create mode 100644 net/bpfilter/gen.c

diff --git a/net/bpfilter/Makefile b/net/bpfilter/Makefile
index bec6181de995..3796651c76cb 100644
--- a/net/bpfilter/Makefile
+++ b/net/bpfilter/Makefile
@@ -4,7 +4,7 @@
 #
 
 hostprogs-y := bpfilter_umh
-bpfilter_umh-objs := main.o tgts.o targets.o tables.o init.o ctor.o sockopt.o
+bpfilter_umh-objs := main.o tgts.o targets.o tables.o init.o ctor.o sockopt.o 
gen.o
 HOSTCFLAGS += -I. -Itools/include/
 
 # a bit of elf magic to convert bpfilter_umh binary into a binary blob
diff --git a/net/bpfilter/bpfilter_mod.h b/net/bpfilter/bpfilter_mod.h
index f0de41b20793..b4209985efff 100644
--- a/net/bpfilter/bpfilter_mod.h
+++ b/net/bpfilter/bpfilter_mod.h
@@ -21,8 +21,8 @@ struct bpfilter_table_info {
unsigned intinitial_entries;
unsigned inthook_entry[BPFILTER_INET_HOOK_MAX];
unsigned intunderflow[BPFILTER_INET_HOOK_MAX];
-   unsigned intstacksize;
-   void***jumpstack;
+// unsigned intstacksize;
+// void***jumpstack;
unsigned char   entries[0] __aligned(8);
 };
 
@@ -64,22 +64,55 @@ struct bpfilter_ipt_error {
 
 struct bpfilter_target {
struct list_headall_target_list;
-   const char  name[BPFILTER_EXTENSION_MAXNAMELEN];
+   charname[BPFILTER_EXTENSION_MAXNAMELEN];
unsigned intsize;
int hold;
u16 family;
u8  rev;
 };
 
+struct bpfilter_gen_ctx {
+   struct bpf_insn *img;
+   u32 len_cur;
+   u32 len_max;
+   u32 default_verdict;
+   int fd;
+   int ifindex;
+   booloffloaded;
+};
+
+union bpf_attr;
+int sys_bpf(int cmd, union bpf_attr *attr, unsigned int size);
+
+int bpfilter_gen_init(struct bpfilter_gen_ctx *ctx);
+int bpfilter_gen_prologue(struct bpfilter_gen_ctx *ctx);
+int bpfilter_gen_epilogue(struct bpfilter_gen_ctx *ctx);
+int bpfilter_gen_append(struct bpfilter_gen_ctx *ctx,
+   struct bpfilter_ipt_ip *ent, int verdict);
+int bpfilter_gen_commit(struct bpfilter_gen_ctx *ctx);
+void bpfilter_gen_destroy(struct bpfilter_gen_ctx *ctx);
+
 struct bpfilter_target *bpfilter_target_get_by_name(const char *name);
 void bpfilter_target_put(struct bpfilter_target *tgt);
 int bpfilter_target_add(struct bpfilter_target *tgt);
 
-struct bpfilter_table_info *bpfilter_ipv4_table_ctor(struct bpfilter_table 
*tbl);
+struct bpfilter_table_info *
+bpfilter_ipv4_table_alloc(struct bpfilter_table *tbl, __u32 size_ents);
+struct bpfilter_table_info *
+bpfilter_ipv4_table_finalize(struct bpfilter_table *tbl,
+struct bpfilter_table_info *info,
+__u32 size_ents, __u32 num_ents);
+struct bpfilter_table_info *
+bpfilter_ipv4_table_finalize2(struct bpfilter_table *tbl,
+ struct bpfilter_table_info *info,
+ __u32 size_ents, __u32 num_ents);
+
 int bpfilter_ipv4_register_targets(void);
 void bpfilter_tables_init(void);
 int bpfilter_get_info(void *addr, int len);
 int bpfilter_get_entries(void *cmd, int len);
+int bpfilter_set_replace(void *cmd, int len);
+int bpfilter_set_add_counters(void *cmd, int len);
 int bpfilter_ipv4_init(void);
 
 int copy_from_user(void *dst, void *addr, int len);
@@ -93,4 +126,248 @@ extern int pid;
 extern int debug_fd;
 #define ENOTSUPP524
 
+/* Helper macros for filter block array initializers. */
+
+/* ALU ops on registers, bpf_add|sub|...: dst_reg += src_reg */
+
+#define BPF_ALU64_REG(OP, DST, SRC)\
+   ((struct bpf_insn) {\
+   .code  = BPF_ALU64 | BPF_OP(OP) | BPF_X,\
+   .dst_reg = DST, \
+   .src_reg = SRC, \
+   .off   = 0, \
+   .imm   = 0 })
+
+#define BPF_ALU32_REG(OP, DST, SRC)\
+   ((struct bpf_insn) {\
+   .code  = BPF_ALU | BPF_OP(OP) | 

[PATCH v2 net-next 0/4] bpfilter

2018-05-02 Thread Alexei Starovoitov
Hi All,

v1->v2:
this patch set is almost a full rewrite of the earlier umh modules approach
The v1 of patches and follow up discussion was covered by LWN:
https://lwn.net/Articles/749108/

I believe the v2 addresses all issues brought up by Andy and others.
Mainly there are zero changes to kernel/module.c
Instead of teaching module loading logic to recognize special
umh module, let normal kernel modules execute part of its own
.init.rodata as a new user space process (Andy's idea)
Patch 1 introduces this new helper:
int fork_usermode_blob(void *data, size_t len, struct umh_info *info);
Input:
  data + len == executable file
Output:
  struct umh_info {
   struct file *pipe_to_umh;
   struct file *pipe_from_umh;
   pid_t pid;
  };

Advantages vs v1:
- the embedded user mode executable is stored as .init.rodata inside
  normal kernel module. These pages are freed when .ko finishes loading
- the elf file is copied into tmpfs file. The user mode process is swappable.
- the communication between user mode process and 'parent' kernel module
  is done via two unix pipes, hence protocol is not exposed to
  user space
- impossible to launch umh on its own (that was the main issue of v1)
  and impossible to be man-in-the-middle due to pipes
- bpfilter.ko consists of tiny kernel part that passes the data
  between kernel and umh via pipes and much bigger umh part that
  doing all the work
- 'lsmod' shows bpfilter.ko as usual.
  'rmmod bpfilter' removes kernel module and kills corresponding umh
- signed bpfilter.ko covers the whole image including umh code

Few issues:
- architecturally bpfilter.ko can be builtin, but doesn't work yet.
  Still debugging. Kinda cool to have user mode executables
  to be part of vmlinux
- the user can still attach to the process and debug it with
  'gdb /proc/pid/exe pid', but 'gdb -p pid' doesn't work.
  (a bit worse comparing to v1)
- tinyconfig will notice a small increase in .text
  +766 | TEXT | 7c8b94806bec umh: introduce fork_usermode_blob() helper

More details in patches 1 and 2 that are ready to land.
Patches 3 and 4 are still rough. They were mainly used for
testing and to demonstrate how bpfilter is building on top.
The patch 4 approach of converting one iptable rule to few bpf
instructions will certainly change in the future, since it doesn't
scale to thousands of rules.

Alexei Starovoitov (2):
  umh: introduce fork_usermode_blob() helper
  net: add skeleton of bpfilter kernel module

Daniel Borkmann (1):
  bpfilter: rough bpfilter codegen example hack

David S. Miller (1):
  bpfilter: add iptable get/set parsing

 fs/exec.c |  38 -
 include/linux/binfmts.h   |   1 +
 include/linux/bpfilter.h  |  15 ++
 include/linux/umh.h   |  12 ++
 include/uapi/linux/bpfilter.h | 200 ++
 kernel/umh.c  | 176 +++-
 net/Kconfig   |   2 +
 net/Makefile  |   1 +
 net/bpfilter/Kconfig  |  17 ++
 net/bpfilter/Makefile |  24 +++
 net/bpfilter/bpfilter_kern.c  |  93 +++
 net/bpfilter/bpfilter_mod.h   | 373 ++
 net/bpfilter/ctor.c   |  91 +++
 net/bpfilter/gen.c| 290 
 net/bpfilter/init.c   |  36 
 net/bpfilter/main.c   | 117 +
 net/bpfilter/msgfmt.h |  17 ++
 net/bpfilter/sockopt.c| 236 ++
 net/bpfilter/tables.c |  73 +
 net/bpfilter/targets.c|  51 ++
 net/bpfilter/tgts.c   |  26 +++
 net/ipv4/Makefile |   2 +
 net/ipv4/bpfilter/Makefile|   2 +
 net/ipv4/bpfilter/sockopt.c   |  42 +
 net/ipv4/ip_sockglue.c|  17 ++
 25 files changed, 1940 insertions(+), 12 deletions(-)
 create mode 100644 include/linux/bpfilter.h
 create mode 100644 include/uapi/linux/bpfilter.h
 create mode 100644 net/bpfilter/Kconfig
 create mode 100644 net/bpfilter/Makefile
 create mode 100644 net/bpfilter/bpfilter_kern.c
 create mode 100644 net/bpfilter/bpfilter_mod.h
 create mode 100644 net/bpfilter/ctor.c
 create mode 100644 net/bpfilter/gen.c
 create mode 100644 net/bpfilter/init.c
 create mode 100644 net/bpfilter/main.c
 create mode 100644 net/bpfilter/msgfmt.h
 create mode 100644 net/bpfilter/sockopt.c
 create mode 100644 net/bpfilter/tables.c
 create mode 100644 net/bpfilter/targets.c
 create mode 100644 net/bpfilter/tgts.c
 create mode 100644 net/ipv4/bpfilter/Makefile
 create mode 100644 net/ipv4/bpfilter/sockopt.c

-- 
2.9.5



[PATCH RFC v2 net-next 3/4] bpfilter: add iptable get/set parsing

2018-05-02 Thread Alexei Starovoitov
From: "David S. Miller" 

parse iptable binary blobs into bpfilter internal data structures
bpfilter.ko only passing the [gs]etsockopt commands from kernel to umh
All parsing is done inside umh

Signed-off-by: David S. Miller 
Signed-off-by: Alexei Starovoitov 
---
 include/uapi/linux/bpfilter.h | 179 ++
 net/bpfilter/Makefile |   2 +-
 net/bpfilter/bpfilter_mod.h   |  96 ++
 net/bpfilter/ctor.c   |  80 +++
 net/bpfilter/init.c   |  33 
 net/bpfilter/main.c   |  51 
 net/bpfilter/sockopt.c| 153 
 net/bpfilter/tables.c |  70 +
 net/bpfilter/targets.c|  51 
 net/bpfilter/tgts.c   |  25 ++
 10 files changed, 739 insertions(+), 1 deletion(-)
 create mode 100644 net/bpfilter/bpfilter_mod.h
 create mode 100644 net/bpfilter/ctor.c
 create mode 100644 net/bpfilter/init.c
 create mode 100644 net/bpfilter/sockopt.c
 create mode 100644 net/bpfilter/tables.c
 create mode 100644 net/bpfilter/targets.c
 create mode 100644 net/bpfilter/tgts.c

diff --git a/include/uapi/linux/bpfilter.h b/include/uapi/linux/bpfilter.h
index 2ec3cc99ea4c..38d54e9947a1 100644
--- a/include/uapi/linux/bpfilter.h
+++ b/include/uapi/linux/bpfilter.h
@@ -18,4 +18,183 @@ enum {
BPFILTER_IPT_GET_MAX,
 };
 
+enum {
+   BPFILTER_XT_TABLE_MAXNAMELEN = 32,
+};
+
+enum {
+   BPFILTER_NF_DROP = 0,
+   BPFILTER_NF_ACCEPT = 1,
+   BPFILTER_NF_STOLEN = 2,
+   BPFILTER_NF_QUEUE = 3,
+   BPFILTER_NF_REPEAT = 4,
+   BPFILTER_NF_STOP = 5,
+   BPFILTER_NF_MAX_VERDICT = BPFILTER_NF_STOP,
+};
+
+enum {
+   BPFILTER_INET_HOOK_PRE_ROUTING  = 0,
+   BPFILTER_INET_HOOK_LOCAL_IN = 1,
+   BPFILTER_INET_HOOK_FORWARD  = 2,
+   BPFILTER_INET_HOOK_LOCAL_OUT= 3,
+   BPFILTER_INET_HOOK_POST_ROUTING = 4,
+   BPFILTER_INET_HOOK_MAX,
+};
+
+enum {
+   BPFILTER_PROTO_UNSPEC   = 0,
+   BPFILTER_PROTO_INET = 1,
+   BPFILTER_PROTO_IPV4 = 2,
+   BPFILTER_PROTO_ARP  = 3,
+   BPFILTER_PROTO_NETDEV   = 5,
+   BPFILTER_PROTO_BRIDGE   = 7,
+   BPFILTER_PROTO_IPV6 = 10,
+   BPFILTER_PROTO_DECNET   = 12,
+   BPFILTER_PROTO_NUMPROTO,
+};
+
+#ifndef INT_MAX
+#define INT_MAX((int)(~0U>>1))
+#endif
+#ifndef INT_MIN
+#define INT_MIN (-INT_MAX - 1)
+#endif
+
+enum {
+   BPFILTER_IP_PRI_FIRST   = INT_MIN,
+   BPFILTER_IP_PRI_CONNTRACK_DEFRAG= -400,
+   BPFILTER_IP_PRI_RAW = -300,
+   BPFILTER_IP_PRI_SELINUX_FIRST   = -225,
+   BPFILTER_IP_PRI_CONNTRACK   = -200,
+   BPFILTER_IP_PRI_MANGLE  = -150,
+   BPFILTER_IP_PRI_NAT_DST = -100,
+   BPFILTER_IP_PRI_FILTER  = 0,
+   BPFILTER_IP_PRI_SECURITY= 50,
+   BPFILTER_IP_PRI_NAT_SRC = 100,
+   BPFILTER_IP_PRI_SELINUX_LAST= 225,
+   BPFILTER_IP_PRI_CONNTRACK_HELPER= 300,
+   BPFILTER_IP_PRI_CONNTRACK_CONFIRM   = INT_MAX,
+   BPFILTER_IP_PRI_LAST= INT_MAX,
+};
+
+#define BPFILTER_FUNCTION_MAXNAMELEN   30
+#define BPFILTER_EXTENSION_MAXNAMELEN  29
+#define BPFILTER_TABLE_MAXNAMELEN  32
+
+struct bpfilter_match;
+struct bpfilter_entry_match {
+   union {
+   struct {
+   __u16   match_size;
+   charname[BPFILTER_EXTENSION_MAXNAMELEN];
+   __u8revision;
+   } user;
+   struct {
+   __u16   match_size;
+   struct bpfilter_match   *match;
+   } kernel;
+   __u16   match_size;
+   } u;
+   unsigned char   data[0];
+};
+
+struct bpfilter_target;
+struct bpfilter_entry_target {
+   union {
+   struct {
+   __u16   target_size;
+   charname[BPFILTER_EXTENSION_MAXNAMELEN];
+   __u8revision;
+   } user;
+   struct {
+   __u16   target_size;
+   struct bpfilter_target  *target;
+   } kernel;
+   __u16   target_size;
+   } u;
+   unsigned char   data[0];
+};
+
+struct bpfilter_standard_target {
+   struct bpfilter_entry_targettarget;
+   int verdict;
+};
+
+struct bpfilter_error_target {
+   struct bpfilter_entry_targettarget;
+   char
error_name[BPFILTER_FUNCTION_MAXNAMELEN];
+};
+
+#define __ALIGN_KERNEL(x, a)__ALIGN_KERNEL_MASK(x, (typeof(x))(a) 
- 1)

[PATCH v2 net-next 1/4] umh: introduce fork_usermode_blob() helper

2018-05-02 Thread Alexei Starovoitov
Introduce helper:
int fork_usermode_blob(void *data, size_t len, struct umh_info *info);
struct umh_info {
   struct file *pipe_to_umh;
   struct file *pipe_from_umh;
   pid_t pid;
};

that GPLed kernel modules (signed or unsigned) can use it to execute part
of its own data as swappable user mode process.

The kernel will do:
- mount "tmpfs"
- allocate a unique file in tmpfs
- populate that file with [data, data + len] bytes
- user-mode-helper code will do_execve that file and, before the process
  starts, the kernel will create two unix pipes for bidirectional
  communication between kernel module and umh
- close tmpfs file, effectively deleting it
- the fork_usermode_blob will return zero on success and populate
  'struct umh_info' with two unix pipes and the pid of the user process

As the first step in the development of the bpfilter project
the fork_usermode_blob() helper is introduced to allow user mode code
to be invoked from a kernel module. The idea is that user mode code plus
normal kernel module code are built as part of the kernel build
and installed as traditional kernel module into distro specified location,
such that from a distribution point of view, there is
no difference between regular kernel modules and kernel modules + umh code.
Such modules can be signed, modprobed, rmmod, etc. The use of this new helper
by a kernel module doesn't make it any special from kernel and user space
tooling point of view.

Such approach enables kernel to delegate functionality traditionally done
by the kernel modules into the user space processes (either root or !root) and
reduces security attack surface of the new code. The buggy umh code would crash
the user process, but not the kernel. Another advantage is that umh code
of the kernel module can be debugged and tested out of user space
(e.g. opening the possibility to run clang sanitizers, fuzzers or
user space test suites on the umh code).
In case of the bpfilter project such architecture allows complex control plane
to be done in the user space while bpf based data plane stays in the kernel.

Since umh can crash, can be oom-ed by the kernel, killed by the admin,
the kernel module that uses them (like bpfilter) needs to manage life
time of umh on its own via two unix pipes and the pid of umh.

The exit code of such kernel module should kill the umh it started,
so that rmmod of the kernel module will cleanup the corresponding umh.
Just like if the kernel module does kmalloc() it should kfree() it in the exit 
code.

Signed-off-by: Alexei Starovoitov 
---
 fs/exec.c   |  38 ---
 include/linux/binfmts.h |   1 +
 include/linux/umh.h |  12 
 kernel/umh.c| 176 +++-
 4 files changed, 215 insertions(+), 12 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 183059c427b9..30a36c2a39bf 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1706,14 +1706,13 @@ static int exec_binprm(struct linux_binprm *bprm)
 /*
  * sys_execve() executes a new program.
  */
-static int do_execveat_common(int fd, struct filename *filename,
- struct user_arg_ptr argv,
- struct user_arg_ptr envp,
- int flags)
+static int __do_execve_file(int fd, struct filename *filename,
+   struct user_arg_ptr argv,
+   struct user_arg_ptr envp,
+   int flags, struct file *file)
 {
char *pathbuf = NULL;
struct linux_binprm *bprm;
-   struct file *file;
struct files_struct *displaced;
int retval;
 
@@ -1752,7 +1751,8 @@ static int do_execveat_common(int fd, struct filename 
*filename,
check_unsafe_exec(bprm);
current->in_execve = 1;
 
-   file = do_open_execat(fd, filename, flags);
+   if (!file)
+   file = do_open_execat(fd, filename, flags);
retval = PTR_ERR(file);
if (IS_ERR(file))
goto out_unmark;
@@ -1760,7 +1760,9 @@ static int do_execveat_common(int fd, struct filename 
*filename,
sched_exec();
 
bprm->file = file;
-   if (fd == AT_FDCWD || filename->name[0] == '/') {
+   if (!filename) {
+   bprm->filename = "none";
+   } else if (fd == AT_FDCWD || filename->name[0] == '/') {
bprm->filename = filename->name;
} else {
if (filename->name[0] == '\0')
@@ -1826,7 +1828,8 @@ static int do_execveat_common(int fd, struct filename 
*filename,
task_numa_free(current);
free_bprm(bprm);
kfree(pathbuf);
-   putname(filename);
+   if (filename)
+   putname(filename);
if (displaced)
put_files_struct(displaced);
return retval;
@@ -1849,10 +1852,27 @@ static int do_execveat_common(int fd, struct filename 
*filename,
if (displaced)
reset_files_struct(displaced);
 out_ret:
-  

[PATCH v2 net-next 2/4] net: add skeleton of bpfilter kernel module

2018-05-02 Thread Alexei Starovoitov
bpfilter.ko consists of bpfilter_kern.c (normal kernel module code)
and user mode helper code that is embedded into bpfilter.ko

The steps to build bpfilter.ko are the following:
- main.c is compiled by HOSTCC into the bpfilter_umh elf executable file
- with quite a bit of objcopy and Makefile magic the bpfilter_umh elf file
  is converted into bpfilter_umh.o object file
  with _binary_net_bpfilter_bpfilter_umh_start and _end symbols
  Example:
  $ nm ./bld_x64/net/bpfilter/bpfilter_umh.o
  4cf8 T _binary_net_bpfilter_bpfilter_umh_end
  4cf8 A _binary_net_bpfilter_bpfilter_umh_size
   T _binary_net_bpfilter_bpfilter_umh_start
- bpfilter_umh.o and bpfilter_kern.o are linked together into bpfilter.ko

bpfilter_kern.c is a normal kernel module code that calls
the fork_usermode_blob() helper to execute part of its own data
as a user mode process.

Notice that _binary_net_bpfilter_bpfilter_umh_start - end
is placed into .init.rodata section, so it's freed as soon as __init
function of bpfilter.ko is finished.
As part of __init the bpfilter.ko does first request/reply action
via two unix pipe provided by fork_usermode_blob() helper to
make sure that umh is healthy. If not it will kill it via pid.

Later bpfilter_process_sockopt() will be called from bpfilter hooks
in get/setsockopt() to pass iptable commands into umh via bpfilter.ko

If admin does 'rmmod bpfilter' the __exit code bpfilter.ko will
kill umh as well.

Signed-off-by: Alexei Starovoitov 
---
 include/linux/bpfilter.h  | 15 +++
 include/uapi/linux/bpfilter.h | 21 ++
 net/Kconfig   |  2 +
 net/Makefile  |  1 +
 net/bpfilter/Kconfig  | 17 
 net/bpfilter/Makefile | 24 +++
 net/bpfilter/bpfilter_kern.c  | 93 +++
 net/bpfilter/main.c   | 63 +
 net/bpfilter/msgfmt.h | 17 
 net/ipv4/Makefile |  2 +
 net/ipv4/bpfilter/Makefile|  2 +
 net/ipv4/bpfilter/sockopt.c   | 42 +++
 net/ipv4/ip_sockglue.c| 17 
 13 files changed, 316 insertions(+)
 create mode 100644 include/linux/bpfilter.h
 create mode 100644 include/uapi/linux/bpfilter.h
 create mode 100644 net/bpfilter/Kconfig
 create mode 100644 net/bpfilter/Makefile
 create mode 100644 net/bpfilter/bpfilter_kern.c
 create mode 100644 net/bpfilter/main.c
 create mode 100644 net/bpfilter/msgfmt.h
 create mode 100644 net/ipv4/bpfilter/Makefile
 create mode 100644 net/ipv4/bpfilter/sockopt.c

diff --git a/include/linux/bpfilter.h b/include/linux/bpfilter.h
new file mode 100644
index ..687b1760bb9f
--- /dev/null
+++ b/include/linux/bpfilter.h
@@ -0,0 +1,15 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_BPFILTER_H
+#define _LINUX_BPFILTER_H
+
+#include 
+
+struct sock;
+int bpfilter_ip_set_sockopt(struct sock *sk, int optname, char *optval,
+   unsigned int optlen);
+int bpfilter_ip_get_sockopt(struct sock *sk, int optname, char *optval,
+   int *optlen);
+extern int (*bpfilter_process_sockopt)(struct sock *sk, int optname,
+  char __user *optval,
+  unsigned int optlen, bool is_set);
+#endif
diff --git a/include/uapi/linux/bpfilter.h b/include/uapi/linux/bpfilter.h
new file mode 100644
index ..2ec3cc99ea4c
--- /dev/null
+++ b/include/uapi/linux/bpfilter.h
@@ -0,0 +1,21 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _UAPI_LINUX_BPFILTER_H
+#define _UAPI_LINUX_BPFILTER_H
+
+#include 
+
+enum {
+   BPFILTER_IPT_SO_SET_REPLACE = 64,
+   BPFILTER_IPT_SO_SET_ADD_COUNTERS = 65,
+   BPFILTER_IPT_SET_MAX,
+};
+
+enum {
+   BPFILTER_IPT_SO_GET_INFO = 64,
+   BPFILTER_IPT_SO_GET_ENTRIES = 65,
+   BPFILTER_IPT_SO_GET_REVISION_MATCH = 66,
+   BPFILTER_IPT_SO_GET_REVISION_TARGET = 67,
+   BPFILTER_IPT_GET_MAX,
+};
+
+#endif /* _UAPI_LINUX_BPFILTER_H */
diff --git a/net/Kconfig b/net/Kconfig
index b62089fb1332..ed6368b306fa 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -201,6 +201,8 @@ source "net/bridge/netfilter/Kconfig"
 
 endif
 
+source "net/bpfilter/Kconfig"
+
 source "net/dccp/Kconfig"
 source "net/sctp/Kconfig"
 source "net/rds/Kconfig"
diff --git a/net/Makefile b/net/Makefile
index a6147c61b174..7f982b7682bd 100644
--- a/net/Makefile
+++ b/net/Makefile
@@ -20,6 +20,7 @@ obj-$(CONFIG_TLS) += tls/
 obj-$(CONFIG_XFRM) += xfrm/
 obj-$(CONFIG_UNIX) += unix/
 obj-$(CONFIG_NET)  += ipv6/
+obj-$(CONFIG_BPFILTER) += bpfilter/
 obj-$(CONFIG_PACKET)   += packet/
 obj-$(CONFIG_NET_KEY)  += key/
 obj-$(CONFIG_BRIDGE)   += bridge/
diff --git a/net/bpfilter/Kconfig b/net/bpfilter/Kconfig
new file mode 100644
index ..782a732b9a5c
--- /dev/null
+++ b/net/bpfilter/Kconfig
@@ -0,0 +1,17 @@

Re: [lkp-robot] 486ad79630 [ 15.532543] BUG: unable to handle kernel NULL pointer dereference at 0000000000000004

2018-05-02 Thread Andrew Morton

(networking cc's added)

On Thu, 3 May 2018 12:14:50 +0800 kernel test robot  wrote:

> Greetings,
> 
> 0day kernel testing robot got the below dmesg and the first bad commit is
> 
> git://git.cmpxchg.org/linux-mmotm.git master
> 
> commit 486ad79630d0ba0b7205a8db9fe15ba392f5ee32
> Author: Andrew Morton 
> AuthorDate: Fri Apr 20 22:00:53 2018 +
> Commit: Johannes Weiner 
> CommitDate: Fri Apr 20 22:00:53 2018 +
> 
> origin


OK, this got confusing.  origin.patch is the diff between 4.17-rc3 and
current mainline.

>
> [many lines deleted]
>
> [main] Setsockopt(101 c 1b24000 a) on fd 177 [3:5:240]
> [main] Setsockopt(1 2c 1b24000 4) on fd 178 [5:2:0]
> [main] Setsockopt(29 8 1b24000 4) on fd 180 [10:1:0]
> [main] Setsockopt(1 20 1b24000 4) on fd 181 [26:2:125]
> [main] Setsockopt(11 1 1b24000 4) on fd 183 [2:2:17]
> [   15.532543] BUG: unable to handle kernel NULL pointer dereference at 
> 0004
> [   15.534143] PGD 80001734b067 P4D 80001734b067 PUD 17350067 PMD 0 
> [   15.535516] Oops: 0002 [#1] PTI
> [   15.536165] Modules linked in:
> [   15.536798] CPU: 0 PID: 363 Comm: trinity-main Not tainted 
> 4.17.0-rc1-1-g486ad79 #2
> [   15.538396] RIP: 0010:llc_ui_release+0x3a/0xd0
> [   15.539293] RSP: 0018:c915bd70 EFLAGS: 00010202
> [   15.540345] RAX: 0001 RBX: 88001fa60008 RCX: 
> 0006
> [   15.541802] RDX: 0006 RSI: 88001fdda660 RDI: 
> 88001fa60008
> [   15.543139] RBP: c915bd80 R08:  R09: 
> 
> [   15.544725] R10:  R11:  R12: 
> 
> [   15.546287] R13: 88001fa61730 R14: 88001e130a60 R15: 
> 880019bdb3f0
> [   15.547962] FS:  7f2221bb1700() GS:82034000() 
> knlGS:
> [   15.549848] CS:  0010 DS:  ES:  CR0: 80050033
> [   15.551186] CR2: 0004 CR3: 1734e000 CR4: 
> 06b0
> [   15.552671] DR0: 02232000 DR1:  DR2: 
> 
> [   15.554105] DR3:  DR6: 0ff0 DR7: 
> 0600
> [   15.34] Call Trace:
> [   15.556049]  sock_release+0x14/0x60
> [   15.556767]  sock_close+0xd/0x20
> [   15.557427]  __fput+0xba/0x1f0
> [   15.558058]  fput+0x9/0x10
> [   15.558682]  task_work_run+0x73/0xa0
> [   15.559416]  do_exit+0x231/0xab0
> [   15.560079]  do_group_exit+0x3f/0xc0
> [   15.560810]  __x64_sys_exit_group+0x13/0x20
> [   15.561656]  do_syscall_64+0x58/0x2f0
> [   15.562407]  ? trace_hardirqs_off_thunk+0x1a/0x1c
> [   15.563360]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> [   15.564471] RIP: 0033:0x7f2221696408
> [   15.565264] RSP: 002b:7ffe5c544c48 EFLAGS: 0206 ORIG_RAX: 
> 00e7
> [   15.566924] RAX: ffda RBX:  RCX: 
> 7f2221696408
> [   15.568485] RDX:  RSI: 003c RDI: 
> 
> [   15.570046] RBP:  R08: 00e7 R09: 
> ffa0
> [   15.571603] R10: 7ffe5c5449e0 R11: 0206 R12: 
> 0004
> [   15.573160] R13: 7ffe5c544e30 R14:  R15: 
> 
> [   15.574720] Code: 7b ff 43 78 0f 88 a5 6f 14 00 31 f6 48 89 df e8 ad 33 fb 
> ff 48 89 df e8 55 94 ff ff 85 c0 0f 84 84 00 00 00 4c 8b a3 d8 04 00 00 <41> 
> ff 44 24 04 0f 88 7f 6f 14 00 48 8b 43 58 f6 c4 01 74 58 48 
> [   15.578679] RIP: llc_ui_release+0x3a/0xd0 RSP: c915bd70
> [   15.579874] CR2: 0004
> [   15.580553] ---[ end trace 0dd8fdc6b7182234 ]---
>

So it's saying that something which got committed into Linus's tree
after 4.17-rc3 has caused a NULL deref in
sock_release->llc_ui_release+0x3a/0xd0




[PATCH net] macsonic: Set platform device coherent_dma_mask

2018-05-02 Thread Finn Thain
Set the device's coherent_dma_mask to avoid a WARNING splat.
Please see commit 205e1b7f51e4 ("dma-mapping: warn when there is
no coherent_dma_mask").

Cc: linux-m...@lists.linux-m68k.org
Signed-off-by: Finn Thain 
---
 drivers/net/ethernet/natsemi/macsonic.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/net/ethernet/natsemi/macsonic.c 
b/drivers/net/ethernet/natsemi/macsonic.c
index 0937fc2a928e..37b1ffa8bb61 100644
--- a/drivers/net/ethernet/natsemi/macsonic.c
+++ b/drivers/net/ethernet/natsemi/macsonic.c
@@ -523,6 +523,10 @@ static int mac_sonic_platform_probe(struct platform_device 
*pdev)
struct sonic_local *lp;
int err;
 
+   err = dma_coerce_mask_and_coherent(>dev, DMA_BIT_MASK(32));
+   if (err)
+   return err;
+
dev = alloc_etherdev(sizeof(struct sonic_local));
if (!dev)
return -ENOMEM;
-- 
2.16.1



[PATCH net] macmace: Set platform device coherent_dma_mask

2018-05-02 Thread Finn Thain
Set the device's coherent_dma_mask to avoid a WARNING splat.
Please see commit 205e1b7f51e4 ("dma-mapping: warn when there is
no coherent_dma_mask").

Cc: linux-m...@lists.linux-m68k.org
Tested-by: Stan Johnson 
Signed-off-by: Finn Thain 
---
 drivers/net/ethernet/apple/macmace.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/net/ethernet/apple/macmace.c 
b/drivers/net/ethernet/apple/macmace.c
index 137cbb470af2..98292c49ecf0 100644
--- a/drivers/net/ethernet/apple/macmace.c
+++ b/drivers/net/ethernet/apple/macmace.c
@@ -203,6 +203,10 @@ static int mace_probe(struct platform_device *pdev)
unsigned char checksum = 0;
int err;
 
+   err = dma_coerce_mask_and_coherent(>dev, DMA_BIT_MASK(32));
+   if (err)
+   return err;
+
dev = alloc_etherdev(PRIV_BYTES);
if (!dev)
return -ENOMEM;
-- 
2.16.1



[PATCH 1/2] ixgbe: release lock for the duration of ixgbe_suspend_close()

2018-05-02 Thread Pavel Tatashin
Currently, during device_shutdown() ixgbe holds rtnl_lock for the duration
of lengthy ixgbe_close_suspend(). On machines with multiple ixgbe cards
this lock prevents scaling if device_shutdown() function is multi-threaded.

It is not necessary to hold this lock during ixgbe_close_suspend()
as it is not held when ixgbe_close() is called also during shutdown but for
kexec case.

Signed-off-by: Pavel Tatashin 
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index afadba99f7b8..e7875b58854b 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -6748,8 +6748,15 @@ static int __ixgbe_shutdown(struct pci_dev *pdev, bool 
*enable_wake)
rtnl_lock();
netif_device_detach(netdev);
 
-   if (netif_running(netdev))
+   if (netif_running(netdev)) {
+   /* Suspend takes a long time, device_shutdown may be
+* parallelized this function, so drop lock for the
+* duration of this call.
+*/
+   rtnl_unlock();
ixgbe_close_suspend(adapter);
+   rtnl_lock();
+   }
 
ixgbe_clear_interrupt_scheme(adapter);
rtnl_unlock();
-- 
2.17.0



[PATCH 2/2] drivers core: multi-threading device shutdown

2018-05-02 Thread Pavel Tatashin
When system is rebooted, halted or kexeced device_shutdown() is
called.

This function shuts down every single device by calling either:
dev->bus->shutdown(dev)
dev->driver->shutdown(dev)

Even on a machine just with a moderate amount of devices, device_shutdown()
may take multiple seconds to complete. Because many devices require a
specific delays to perform this operation.

Here is sample analysis of time it takes to call device_shutdown() on
two socket Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz machine.

device_shutdown 2.95s
 mlx4_shutdown  1.14s
 megasas_shutdown   0.24s
 ixgbe_shutdown 0.37s x 4 (four ixgbe devices on my machine).
 the rest   0.09s

In mlx4 we spent the most time, but that is because there is a 1 second
sleep:
mlx4_shutdown
 mlx4_unload_one
  mlx4_free_ownership
   msleep(1000)

With megasas we spend quoter of second, but sometimes longer (up-to 0.5s)
in this path:

megasas_shutdown
  megasas_flush_cache
megasas_issue_blocked_cmd
  wait_event_timeout

Finally, with ixgbe_shutdown() it takes 0.37 for each device, but that time
is spread all over the place, with bigger offenders:

ixgbe_shutdown
  __ixgbe_shutdown
ixgbe_close_suspend
  ixgbe_down
ixgbe_init_hw_generic
  ixgbe_reset_hw_X540
msleep(100);0.104483472
ixgbe_get_san_mac_addr_generic  0.048414851
ixgbe_get_wwn_prefix_generic0.048409893
  ixgbe_start_hw_X540
ixgbe_start_hw_generic
  ixgbe_clear_hw_cntrs_generic  0.048581502
  ixgbe_setup_fc_generic0.024225800

All the ixgbe_*generic functions end-up calling:
ixgbe_read_eerd_X540()
  ixgbe_acquire_swfw_sync_X540
usleep_range(5000, 6000);
  ixgbe_release_swfw_sync_X540
usleep_range(5000, 6000);

While these are short sleeps, they end-up calling them over 24 times!
24 * 0.0055s = 0.132s. Adding-up to 0.528s for four devices.

While we should keep optimizing the individual device drivers, in some
cases this is simply a hardware property that forces a specific delay, and
we must wait.

So, the solution for this problem is to shutdown devices in parallel.
However, we must shutdown children before shutting down parents, so parent
device must wait for its children to finish.

With this patch, on the same machine devices_shutdown() takes 1.142s, and
without mlx4 one second delay only 0.38s

Signed-off-by: Pavel Tatashin 
---
 drivers/base/core.c | 238 +++-
 1 file changed, 189 insertions(+), 49 deletions(-)

diff --git a/drivers/base/core.c b/drivers/base/core.c
index b610816eb887..f370369a303b 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -25,6 +25,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "base.h"
 #include "power/power.h"
@@ -2102,6 +2103,59 @@ const char *device_get_devnode(struct device *dev,
return *tmp = s;
 }
 
+/**
+ * device_children_count - device children count
+ * @parent: parent struct device.
+ *
+ * Returns number of children for this device or 0 if nonde.
+ */
+static int device_children_count(struct device *parent)
+{
+   struct klist_iter i;
+   int children = 0;
+
+   if (!parent->p)
+   return 0;
+
+   klist_iter_init(>p->klist_children, );
+   while (next_device())
+   children++;
+   klist_iter_exit();
+
+   return children;
+}
+
+/**
+ * device_get_child_by_index - Return child using the provide index.
+ * @parent: parent struct device.
+ * @index:  Index of the child, where 0 is the first child in the children 
list,
+ * and so on.
+ *
+ * Returns child or NULL if child with this index is not present.
+ */
+static struct device *
+device_get_child_by_index(struct device *parent, int index)
+{
+   struct klist_iter i;
+   struct device *dev = NULL, *d;
+   int child_index = 0;
+
+   if (!parent->p || index < 0)
+   return NULL;
+
+   klist_iter_init(>p->klist_children, );
+   while ((d = next_device()) != NULL) {
+   if (child_index == index) {
+   dev = d;
+   break;
+   }
+   child_index++;
+   }
+   klist_iter_exit();
+
+   return dev;
+}
+
 /**
  * device_for_each_child - device child iterator.
  * @parent: parent struct device.
@@ -2765,71 +2819,157 @@ int device_move(struct device *dev, struct device 
*new_parent,
 }
 EXPORT_SYMBOL_GPL(device_move);
 
+/*
+ * device_shutdown_one - call ->shutdown() for the device passed as
+ * argument.
+ */
+static void device_shutdown_one(struct device *dev)
+{
+   /* Don't allow any more runtime suspends */
+   pm_runtime_get_noresume(dev);
+   pm_runtime_barrier(dev);
+
+   if (dev->class && 

[PATCH 0/2] multi-threading device shutdown

2018-05-02 Thread Pavel Tatashin
Do a faster shutdown by calling dev->*->shutdown(dev) in parallel.
device_shutdown() calls these functions for every single device but
only using one thread.

Since, nothing else is running on the machine by the device_shutdown()
s called, there is no reason not to utilize all the available CPU
resources.

Pavel Tatashin (2):
  ixgbe: release lock for the duration of ixgbe_suspend_close()
  drivers core: multi-threading device shutdown

 drivers/base/core.c   | 238 ++
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   9 +-
 2 files changed, 197 insertions(+), 50 deletions(-)

-- 
2.17.0



[bpf-next v1 5/9] net/ipv6: Add fib6_lookup

2018-05-02 Thread David Ahern
Add IPv6 equivalent to fib_lookup. Does a fib lookup, including rules,
but returns a FIB entry, fib6_info, rather than a dst based rt6_info.
fib6_lookup is any where from 140% (MULTIPLE_TABLES config disabled)
to 60% faster than any of the dst based lookup methods (without custom
rules) and 25% faster with custom rules (e.g., l3mdev rule).

Since the lookup function has a completely different signature,
fib6_rule_action is split into 2 paths: the existing one is
renamed __fib6_rule_action and a new one for the fib6_info path
is added. fib6_rule_action decides which to call based on the
lookup_ptr. If it is fib6_table_lookup then the new path is taken.

Caller must hold rcu lock as no reference is taken on the returned
fib entry.

Signed-off-by: David Ahern 
---
 include/net/ip6_fib.h |  6 
 net/ipv6/fib6_rules.c | 86 +--
 net/ipv6/ip6_fib.c|  7 +
 3 files changed, 97 insertions(+), 2 deletions(-)

diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
index 4f7b8f59ea6d..d920dd00139b 100644
--- a/include/net/ip6_fib.h
+++ b/include/net/ip6_fib.h
@@ -376,6 +376,12 @@ struct dst_entry *fib6_rule_lookup(struct net *net, struct 
flowi6 *fl6,
   const struct sk_buff *skb,
   int flags, pol_lookup_t lookup);
 
+/* called with rcu lock held; can return error pointer
+ * caller needs to select path
+ */
+struct fib6_info *fib6_lookup(struct net *net, int oif, struct flowi6 *fl6,
+ int flags);
+
 /* called with rcu lock held; caller needs to select path */
 struct fib6_info *fib6_table_lookup(struct net *net, struct fib6_table *table,
int oif, struct flowi6 *fl6, int strict);
diff --git a/net/ipv6/fib6_rules.c b/net/ipv6/fib6_rules.c
index d040c4bff3a0..f590446595d8 100644
--- a/net/ipv6/fib6_rules.c
+++ b/net/ipv6/fib6_rules.c
@@ -60,6 +60,39 @@ unsigned int fib6_rules_seq_read(struct net *net)
return fib_rules_seq_read(net, AF_INET6);
 }
 
+/* called with rcu lock held; no reference taken on fib6_info */
+struct fib6_info *fib6_lookup(struct net *net, int oif, struct flowi6 *fl6,
+ int flags)
+{
+   struct fib6_info *f6i;
+   int err;
+
+   if (net->ipv6.fib6_has_custom_rules) {
+   struct fib_lookup_arg arg = {
+   .lookup_ptr = fib6_table_lookup,
+   .lookup_data = ,
+   .flags = FIB_LOOKUP_NOREF,
+   };
+
+   l3mdev_update_flow(net, flowi6_to_flowi(fl6));
+
+   err = fib_rules_lookup(net->ipv6.fib6_rules_ops,
+  flowi6_to_flowi(fl6), flags, );
+   if (err)
+   return ERR_PTR(err);
+
+   f6i = arg.result ? : net->ipv6.fib6_null_entry;
+   } else {
+   f6i = fib6_table_lookup(net, net->ipv6.fib6_local_tbl,
+   oif, fl6, flags);
+   if (!f6i || f6i == net->ipv6.fib6_null_entry)
+   f6i = fib6_table_lookup(net, net->ipv6.fib6_main_tbl,
+   oif, fl6, flags);
+   }
+
+   return f6i;
+}
+
 struct dst_entry *fib6_rule_lookup(struct net *net, struct flowi6 *fl6,
   const struct sk_buff *skb,
   int flags, pol_lookup_t lookup)
@@ -121,8 +154,48 @@ static int fib6_rule_saddr(struct net *net, struct 
fib_rule *rule, int flags,
return 0;
 }
 
-static int fib6_rule_action(struct fib_rule *rule, struct flowi *flp,
-   int flags, struct fib_lookup_arg *arg)
+static int fib6_rule_action_alt(struct fib_rule *rule, struct flowi *flp,
+   int flags, struct fib_lookup_arg *arg)
+{
+   struct flowi6 *flp6 = >u.ip6;
+   struct net *net = rule->fr_net;
+   struct fib6_table *table;
+   struct fib6_info *f6i;
+   int err = -EAGAIN, *oif;
+   u32 tb_id;
+
+   switch (rule->action) {
+   case FR_ACT_TO_TBL:
+   break;
+   case FR_ACT_UNREACHABLE:
+   return -ENETUNREACH;
+   case FR_ACT_PROHIBIT:
+   return -EACCES;
+   case FR_ACT_BLACKHOLE:
+   default:
+   return -EINVAL;
+   }
+
+   tb_id = fib_rule_get_table(rule, arg);
+   table = fib6_get_table(net, tb_id);
+   if (!table)
+   return -EAGAIN;
+
+   oif = (int *)arg->lookup_data;
+   f6i = fib6_table_lookup(net, table, *oif, flp6, flags);
+   if (f6i != net->ipv6.fib6_null_entry) {
+   err = fib6_rule_saddr(net, rule, flags, flp6,
+ fib6_info_nh_dev(f6i));
+
+   if (likely(!err))
+   arg->result = f6i;
+   }
+
+   return err;
+}
+
+static int 

[bpf-next v1 7/9] net/ipv6: Add fib lookup stubs for use in bpf helper

2018-05-02 Thread David Ahern
Add stubs to retrieve a handle to an IPv6 FIB table, fib6_get_table,
a stub to do a lookup in a specific table, fib6_table_lookup, and
a stub for a full route lookup.

The stubs are needed for core bpf code to handle the case when the
IPv6 module is not builtin.

Signed-off-by: David Ahern 
---
 include/net/addrconf.h   | 14 ++
 net/ipv6/addrconf_core.c | 33 -
 net/ipv6/af_inet6.c  |  6 +-
 3 files changed, 51 insertions(+), 2 deletions(-)

diff --git a/include/net/addrconf.h b/include/net/addrconf.h
index 8312cc25a3af..ff766ab207e0 100644
--- a/include/net/addrconf.h
+++ b/include/net/addrconf.h
@@ -223,6 +223,20 @@ struct ipv6_stub {
 const struct in6_addr *addr);
int (*ipv6_dst_lookup)(struct net *net, struct sock *sk,
   struct dst_entry **dst, struct flowi6 *fl6);
+
+   struct fib6_table *(*fib6_get_table)(struct net *net, u32 id);
+   struct fib6_info *(*fib6_lookup)(struct net *net, int oif,
+struct flowi6 *fl6, int flags);
+   struct fib6_info *(*fib6_table_lookup)(struct net *net,
+ struct fib6_table *table,
+ int oif, struct flowi6 *fl6,
+ int flags);
+   struct fib6_info *(*fib6_multipath_select)(const struct net *net,
+  struct fib6_info *f6i,
+  struct flowi6 *fl6, int oif,
+  const struct sk_buff *skb,
+  int strict);
+
void (*udpv6_encap_enable)(void);
void (*ndisc_send_na)(struct net_device *dev, const struct in6_addr 
*daddr,
  const struct in6_addr *solicited_addr,
diff --git a/net/ipv6/addrconf_core.c b/net/ipv6/addrconf_core.c
index 32b564dfd02a..2fe754fd4f5e 100644
--- a/net/ipv6/addrconf_core.c
+++ b/net/ipv6/addrconf_core.c
@@ -134,8 +134,39 @@ static int eafnosupport_ipv6_dst_lookup(struct net *net, 
struct sock *u1,
return -EAFNOSUPPORT;
 }
 
+static struct fib6_table *eafnosupport_fib6_get_table(struct net *net, u32 id)
+{
+   return NULL;
+}
+
+static struct fib6_info *
+eafnosupport_fib6_table_lookup(struct net *net, struct fib6_table *table,
+  int oif, struct flowi6 *fl6, int flags)
+{
+   return NULL;
+}
+
+static struct fib6_info *
+eafnosupport_fib6_lookup(struct net *net, int oif, struct flowi6 *fl6,
+int flags)
+{
+   return NULL;
+}
+
+static struct fib6_info *
+eafnosupport_fib6_multipath_select(const struct net *net, struct fib6_info 
*f6i,
+  struct flowi6 *fl6, int oif,
+  const struct sk_buff *skb, int strict)
+{
+   return f6i;
+}
+
 const struct ipv6_stub *ipv6_stub __read_mostly = &(struct ipv6_stub) {
-   .ipv6_dst_lookup = eafnosupport_ipv6_dst_lookup,
+   .ipv6_dst_lookup   = eafnosupport_ipv6_dst_lookup,
+   .fib6_get_table= eafnosupport_fib6_get_table,
+   .fib6_table_lookup = eafnosupport_fib6_table_lookup,
+   .fib6_lookup   = eafnosupport_fib6_lookup,
+   .fib6_multipath_select = eafnosupport_fib6_multipath_select,
 };
 EXPORT_SYMBOL_GPL(ipv6_stub);
 
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index 36d622c477b1..c0e8255d50bb 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -887,7 +887,11 @@ static struct pernet_operations inet6_net_ops = {
 static const struct ipv6_stub ipv6_stub_impl = {
.ipv6_sock_mc_join = ipv6_sock_mc_join,
.ipv6_sock_mc_drop = ipv6_sock_mc_drop,
-   .ipv6_dst_lookup = ip6_dst_lookup,
+   .ipv6_dst_lookup   = ip6_dst_lookup,
+   .fib6_get_table= fib6_get_table,
+   .fib6_table_lookup = fib6_table_lookup,
+   .fib6_lookup   = fib6_lookup,
+   .fib6_multipath_select = fib6_multipath_select,
.udpv6_encap_enable = udpv6_encap_enable,
.ndisc_send_na = ndisc_send_na,
.nd_tbl = _tbl,
-- 
2.11.0



[bpf-next v1 9/9] samples/bpf: Add example of ipv4 and ipv6 forwarding in XDP

2018-05-02 Thread David Ahern
Simple example of fast-path forwarding. It has a serious flaw
in not verifying the egress device index supports XDP forwarding.
If the egress device does not packets are dropped.

Take this only as a simple example of fast-path forwarding.

Signed-off-by: David Ahern 
---
 samples/bpf/Makefile  |   4 +
 samples/bpf/xdp_fwd_kern.c| 113 +
 samples/bpf/xdp_fwd_user.c| 136 ++
 tools/testing/selftests/bpf/bpf_helpers.h |   3 +
 4 files changed, 256 insertions(+)
 create mode 100644 samples/bpf/xdp_fwd_kern.c
 create mode 100644 samples/bpf/xdp_fwd_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 5e31770ac087..393dac1c43f4 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -45,6 +45,7 @@ hostprogs-y += xdp_rxq_info
 hostprogs-y += syscall_tp
 hostprogs-y += cpustat
 hostprogs-y += xdp_adjust_tail
+hostprogs-y += xdp_fwd
 
 # Libbpf dependencies
 LIBBPF := ../../tools/lib/bpf/bpf.o ../../tools/lib/bpf/nlattr.o
@@ -98,6 +99,7 @@ xdp_rxq_info-objs := bpf_load.o $(LIBBPF) xdp_rxq_info_user.o
 syscall_tp-objs := bpf_load.o $(LIBBPF) syscall_tp_user.o
 cpustat-objs := bpf_load.o $(LIBBPF) cpustat_user.o
 xdp_adjust_tail-objs := bpf_load.o $(LIBBPF) xdp_adjust_tail_user.o
+xdp_fwd-objs := bpf_load.o $(LIBBPF) xdp_fwd_user.o
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
@@ -151,6 +153,7 @@ always += xdp2skb_meta_kern.o
 always += syscall_tp_kern.o
 always += cpustat_kern.o
 always += xdp_adjust_tail_kern.o
+always += xdp_fwd_kern.o
 
 HOSTCFLAGS += -I$(objtree)/usr/include
 HOSTCFLAGS += -I$(srctree)/tools/lib/
@@ -197,6 +200,7 @@ HOSTLOADLIBES_xdp_rxq_info += -lelf
 HOSTLOADLIBES_syscall_tp += -lelf
 HOSTLOADLIBES_cpustat += -lelf
 HOSTLOADLIBES_xdp_adjust_tail += -lelf
+HOSTLOADLIBES_xdp_fwd += -lelf
 
 # Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on 
cmdline:
 #  make samples/bpf/ LLC=~/git/llvm/build/bin/llc 
CLANG=~/git/llvm/build/bin/clang
diff --git a/samples/bpf/xdp_fwd_kern.c b/samples/bpf/xdp_fwd_kern.c
new file mode 100644
index ..7eeaa32538b1
--- /dev/null
+++ b/samples/bpf/xdp_fwd_kern.c
@@ -0,0 +1,113 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2017-18 David Ahern 
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ */
+#define KBUILD_MODNAME "foo"
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "bpf_helpers.h"
+
+#define IPV6_FLOWINFO_MASK  cpu_to_be32(0x0FFF)
+
+struct bpf_map_def SEC("maps") tx_port = {
+   .type = BPF_MAP_TYPE_DEVMAP,
+   .key_size = sizeof(int),
+   .value_size = sizeof(int),
+   .max_entries = 64,
+};
+
+static __always_inline int xdp_fwd_flags(struct xdp_md *ctx, u32 flags)
+{
+   void *data_end = (void *)(long)ctx->data_end;
+   void *data = (void *)(long)ctx->data;
+   struct bpf_fib_lookup fib_params;
+   struct ethhdr *eth = data;
+   int out_index;
+   u16 h_proto;
+   u64 nh_off;
+
+   nh_off = sizeof(*eth);
+   if (data + nh_off > data_end)
+   return XDP_DROP;
+
+   __builtin_memset(_params, 0, sizeof(fib_params));
+
+   h_proto = eth->h_proto;
+   if (h_proto == htons(ETH_P_IP)) {
+   struct iphdr *iph = data + nh_off;
+
+   if (iph + 1 > data_end)
+   return XDP_DROP;
+
+   fib_params.family   = AF_INET;
+   fib_params.tos  = iph->tos;
+   fib_params.l4_protocol  = iph->protocol;
+   fib_params.sport= 0;
+   fib_params.dport= 0;
+   fib_params.tot_len  = ntohs(iph->tot_len);
+   fib_params.ipv4_src = iph->saddr;
+   fib_params.ipv4_dst = iph->daddr;
+   } else if (h_proto == htons(ETH_P_IPV6)) {
+   struct ipv6hdr *iph = data + nh_off;
+
+   if (iph + 1 > data_end)
+   return XDP_DROP;
+
+   fib_params.family   = AF_INET6;
+   fib_params.flowlabel= *(__be32 *)iph & IPV6_FLOWINFO_MASK;
+   fib_params.l4_protocol  = iph->nexthdr;
+   fib_params.sport= 0;
+   fib_params.dport= 0;
+   fib_params.tot_len  = ntohs(iph->payload_len);
+   fib_params.ipv6_src = iph->saddr;
+   fib_params.ipv6_dst = iph->daddr;
+   } else {
+   return 

[bpf-next v1 2/9] net/ipv6: Rename rt6_multipath_select

2018-05-02 Thread David Ahern
Rename rt6_multipath_select to fib6_multipath_select and export it.
A later patch wants access to it similar to IPv4's fib_select_path.

Signed-off-by: David Ahern 
---
 include/net/ip6_fib.h |  5 +
 net/ipv6/route.c  | 17 +
 2 files changed, 14 insertions(+), 8 deletions(-)

diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
index 5a16630179cb..80d76d8dc683 100644
--- a/include/net/ip6_fib.h
+++ b/include/net/ip6_fib.h
@@ -376,6 +376,11 @@ struct dst_entry *fib6_rule_lookup(struct net *net, struct 
flowi6 *fl6,
   const struct sk_buff *skb,
   int flags, pol_lookup_t lookup);
 
+struct fib6_info *fib6_multipath_select(const struct net *net,
+   struct fib6_info *match,
+   struct flowi6 *fl6, int oif,
+   const struct sk_buff *skb, int strict);
+
 struct fib6_node *fib6_node_lookup(struct fib6_node *root,
   const struct in6_addr *daddr,
   const struct in6_addr *saddr);
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index d903db30dfff..58af969f3a2c 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -419,11 +419,11 @@ static bool rt6_check_expired(const struct rt6_info *rt)
return false;
 }
 
-static struct fib6_info *rt6_multipath_select(const struct net *net,
- struct fib6_info *match,
-struct flowi6 *fl6, int oif,
-const struct sk_buff *skb,
-int strict)
+struct fib6_info *fib6_multipath_select(const struct net *net,
+   struct fib6_info *match,
+   struct flowi6 *fl6, int oif,
+   const struct sk_buff *skb,
+   int strict)
 {
struct fib6_info *sibling, *next_sibling;
 
@@ -1068,8 +1068,9 @@ static struct rt6_info *ip6_pol_route_lookup(struct net 
*net,
f6i = rt6_device_match(net, f6i, >saddr,
  fl6->flowi6_oif, flags);
if (f6i->fib6_nsiblings && fl6->flowi6_oif == 0)
-   f6i = rt6_multipath_select(net, f6i, fl6,
-  fl6->flowi6_oif, skb, flags);
+   f6i = fib6_multipath_select(net, f6i, fl6,
+   fl6->flowi6_oif, skb,
+   flags);
}
if (f6i == net->ipv6.fib6_null_entry) {
fn = fib6_backtrack(fn, >saddr);
@@ -1824,7 +1825,7 @@ struct rt6_info *ip6_pol_route(struct net *net, struct 
fib6_table *table,
 redo_rt6_select:
f6i = rt6_select(net, fn, oif, strict);
if (f6i->fib6_nsiblings)
-   f6i = rt6_multipath_select(net, f6i, fl6, oif, skb, strict);
+   f6i = fib6_multipath_select(net, f6i, fl6, oif, skb, strict);
if (f6i == net->ipv6.fib6_null_entry) {
fn = fib6_backtrack(fn, >saddr);
if (fn)
-- 
2.11.0



[bpf-next v1 0/9] bpf: Add helper to do FIB lookups

2018-05-02 Thread David Ahern
Provide a helper for doing a FIB and neighbor lookup in the kernel
tables from an XDP program. The helper provides a fastpath for forwarding
packets. If the packet is a local delivery or for any reason is not a
simple lookup and forward, the packet is expected to continue up the stack
for full processing.

The response from a FIB and neighbor lookup is either the egress index
with the bpf_fib_lookup struct filled in with dmac and gateway or
0 meaning the packet should continue up the stack. In time we can
visit this to return the FIB lookup result errno if it is one of the
special RTN_'s such as RTN_BLACKHOLE (-EINVAL) so that the XDP
programs can do an early drop if desired.

Patches 1-6 do some more refactoring to IPv6 with the end goal of
extracting a FIB lookup function that aligns with fib_lookup for IPv4,
basically returning a fib6_info without creating a dst based entry.

Patch 7 adds lookup functions to the ipv6 stub. These are needed since
bpf is built into the kernel and ipv6 may not be built or loaded.

Patch 8 adds the bpf helper and 9 adds a sample program.

v1
- updated commit messages and cover letter
- added comment to sample program noting lack of verification on
  egress device supporting XDP

RFC v2
- fixed use of foward helper from cls_act as noted by Daniel
- in patch 1 rename fib6_lookup_1 as well for consistency


David Ahern (9):
  net/ipv6: Rename fib6_lookup to fib6_node_lookup
  net/ipv6: Rename rt6_multipath_select
  net/ipv6: Extract table lookup from ip6_pol_route
  net/ipv6: Refactor fib6_rule_action
  net/ipv6: Add fib6_lookup
  net/ipv6: Update fib6 tracepoint to take fib6_info
  net/ipv6: Add fib lookup stubs for use in bpf helper
  bpf: Provide helper to do lookups in kernel FIB table
  samples/bpf: Add example of ipv4 and ipv6 forwarding in XDP

 include/net/addrconf.h|  14 ++
 include/net/ip6_fib.h |  21 ++-
 include/trace/events/fib6.h   |  14 +-
 include/uapi/linux/bpf.h  |  83 +-
 net/core/filter.c | 263 ++
 net/ipv6/addrconf_core.c  |  33 +++-
 net/ipv6/af_inet6.c   |   6 +-
 net/ipv6/fib6_rules.c | 138 +---
 net/ipv6/ip6_fib.c|  21 ++-
 net/ipv6/route.c  |  76 +
 samples/bpf/Makefile  |   4 +
 samples/bpf/xdp_fwd_kern.c| 113 +
 samples/bpf/xdp_fwd_user.c| 136 +++
 tools/testing/selftests/bpf/bpf_helpers.h |   3 +
 14 files changed, 850 insertions(+), 75 deletions(-)
 create mode 100644 samples/bpf/xdp_fwd_kern.c
 create mode 100644 samples/bpf/xdp_fwd_user.c

-- 
2.11.0



[bpf-next v1 3/9] net/ipv6: Extract table lookup from ip6_pol_route

2018-05-02 Thread David Ahern
ip6_pol_route is used for ingress and egress FIB lookups. Refactor it
moving the table lookup into a separate fib6_table_lookup that can be
invoked separately and export the new function.

ip6_pol_route now calls fib6_table_lookup and uses the result to generate
a dst based rt6_info.

Signed-off-by: David Ahern 
---
 include/net/ip6_fib.h |  4 
 net/ipv6/route.c  | 39 +--
 2 files changed, 29 insertions(+), 14 deletions(-)

diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
index 80d76d8dc683..4f7b8f59ea6d 100644
--- a/include/net/ip6_fib.h
+++ b/include/net/ip6_fib.h
@@ -376,6 +376,10 @@ struct dst_entry *fib6_rule_lookup(struct net *net, struct 
flowi6 *fl6,
   const struct sk_buff *skb,
   int flags, pol_lookup_t lookup);
 
+/* called with rcu lock held; caller needs to select path */
+struct fib6_info *fib6_table_lookup(struct net *net, struct fib6_table *table,
+   int oif, struct flowi6 *fl6, int strict);
+
 struct fib6_info *fib6_multipath_select(const struct net *net,
struct fib6_info *match,
struct flowi6 *fl6, int oif,
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 58af969f3a2c..d0ace0c5c3e9 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -1800,21 +1800,12 @@ void rt6_age_exceptions(struct fib6_info *rt,
rcu_read_unlock_bh();
 }
 
-struct rt6_info *ip6_pol_route(struct net *net, struct fib6_table *table,
-  int oif, struct flowi6 *fl6,
-  const struct sk_buff *skb, int flags)
+/* must be called with rcu lock held */
+struct fib6_info *fib6_table_lookup(struct net *net, struct fib6_table *table,
+   int oif, struct flowi6 *fl6, int strict)
 {
struct fib6_node *fn, *saved_fn;
struct fib6_info *f6i;
-   struct rt6_info *rt;
-   int strict = 0;
-
-   strict |= flags & RT6_LOOKUP_F_IFACE;
-   strict |= flags & RT6_LOOKUP_F_IGNORE_LINKSTATE;
-   if (net->ipv6.devconf_all->forwarding == 0)
-   strict |= RT6_LOOKUP_F_REACHABLE;
-
-   rcu_read_lock();
 
fn = fib6_node_lookup(>tb6_root, >daddr, >saddr);
saved_fn = fn;
@@ -1824,8 +1815,6 @@ struct rt6_info *ip6_pol_route(struct net *net, struct 
fib6_table *table,
 
 redo_rt6_select:
f6i = rt6_select(net, fn, oif, strict);
-   if (f6i->fib6_nsiblings)
-   f6i = fib6_multipath_select(net, f6i, fl6, oif, skb, strict);
if (f6i == net->ipv6.fib6_null_entry) {
fn = fib6_backtrack(fn, >saddr);
if (fn)
@@ -1838,6 +1827,28 @@ struct rt6_info *ip6_pol_route(struct net *net, struct 
fib6_table *table,
}
}
 
+   return f6i;
+}
+
+struct rt6_info *ip6_pol_route(struct net *net, struct fib6_table *table,
+  int oif, struct flowi6 *fl6,
+  const struct sk_buff *skb, int flags)
+{
+   struct fib6_info *f6i;
+   struct rt6_info *rt;
+   int strict = 0;
+
+   strict |= flags & RT6_LOOKUP_F_IFACE;
+   strict |= flags & RT6_LOOKUP_F_IGNORE_LINKSTATE;
+   if (net->ipv6.devconf_all->forwarding == 0)
+   strict |= RT6_LOOKUP_F_REACHABLE;
+
+   rcu_read_lock();
+
+   f6i = fib6_table_lookup(net, table, oif, fl6, strict);
+   if (f6i->fib6_nsiblings)
+   f6i = fib6_multipath_select(net, f6i, fl6, oif, skb, strict);
+
if (f6i == net->ipv6.fib6_null_entry) {
rt = net->ipv6.ip6_null_entry;
rcu_read_unlock();
-- 
2.11.0



[bpf-next v1 8/9] bpf: Provide helper to do lookups in kernel FIB table

2018-05-02 Thread David Ahern
Provide a helper for doing a FIB and neighbor lookup in the kernel
tables from an XDP program. The helper provides a fastpath for forwarding
packets. If the packet is a local delivery or for any reason is not a
simple lookup and forward, the packet continues up the stack.

If it is to be forwarded, the forwarding can be done directly if the
neighbor is already known. If the neighbor does not exist, the first
few packets go up the stack for neighbor resolution. Once resolved, the
xdp program provides the fast path.

On successful lookup the nexthop dmac, current device smac and egress
device index are returned.

The API supports IPv4, IPv6 and MPLS protocols, but only IPv4 and IPv6
are implemented in this patch. The API includes layer 4 parameters if
the XDP program chooses to do deep packet inspection to allow compare
against ACLs implemented as FIB rules.

Header rewrite is left to the XDP program.

The lookup takes 2 flags:
- BPF_FIB_LOOKUP_DIRECT to do a lookup that bypasses FIB rules and goes
  straight to the table associated with the device (expert setting for
  those looking to maximize throughput)

- BPF_FIB_LOOKUP_OUTPUT to do a lookup from the egress perspective.
  Default is an ingress lookup.

Initial performance numbers collected by Jesper, forwarded packets/sec:

   Full stackXDP FIB lookupXDP Direct lookup
IPv4   1,947,969   7,074,156  7,415,333
IPv6   1,728,000   6,165,504  7,262,720

These number are single CPU core forwarding on a Broadwell
E5-1650 v4 @ 3.60GHz.

Signed-off-by: David Ahern 
---
 include/uapi/linux/bpf.h |  83 ++-
 net/core/filter.c| 263 +++
 2 files changed, 345 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 8daef7326bb7..360a1168c353 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -10,6 +10,8 @@
 
 #include 
 #include 
+#include 
+#include 
 
 /* Extended instruction set based on top of classic BPF */
 
@@ -1801,6 +1803,33 @@ union bpf_attr {
  * Return
  * a non-negative value equal to or less than size on success, or
  * a negative error in case of failure.
+ *
+ * int bpf_fib_lookup(void *ctx, struct bpf_fib_lookup *params, int plen, u32 
flags)
+ * Description
+ * Do FIB lookup in kernel tables using parameters in *params*.
+ * If lookup is successful and result shows packets is to be
+ * forwarded, the neighbor tables are searched for the nexthop.
+ * If successful (ie., FIB lookup shows forwarding and nexthop
+ * is resolved), the nexthop address is returned in ipv4_dst,
+ * ipv6_dst or mpls_out based on family, smac is set to mac
+ * address of egress device, dmac is set to nexthop mac address,
+ * rt_metric is set to metric from route.
+ *
+ * *plen* argument is the size of the passed in struct.
+ * *flags* argument can be one or more BPF_FIB_LOOKUP_ flags:
+ *
+ * **BPF_FIB_LOOKUP_DIRECT** means do a direct table lookup vs
+ * full lookup using FIB rules
+ * **BPF_FIB_LOOKUP_OUTPUT** mmeans do lookup from an egress
+ * perspective (default is ingress)
+ *
+ * *ctx* is either **struct xdp_md** for XDP programs or
+ * **struct sk_buff** tc cls_act programs.
+ *
+ * Return
+ * Egress device index on success, 0 if packet needs to continue
+ * up the stack for further processing or a negative error in case
+ * of failure.
  */
 #define __BPF_FUNC_MAPPER(FN)  \
FN(unspec), \
@@ -1870,7 +1899,8 @@ union bpf_attr {
FN(bind),   \
FN(xdp_adjust_tail),\
FN(skb_get_xfrm_state), \
-   FN(get_stack),
+   FN(get_stack),  \
+   FN(fib_lookup),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -2278,4 +2308,55 @@ struct bpf_raw_tracepoint_args {
__u64 args[0];
 };
 
+/* DIRECT:  Skip the FIB rules and go to FIB table associated with device
+ * OUTPUT:  Do lookup from egress perspective; default is ingress
+ */
+#define BPF_FIB_LOOKUP_DIRECT  BIT(0)
+#define BPF_FIB_LOOKUP_OUTPUT  BIT(1)
+
+struct bpf_fib_lookup {
+   /* input */
+   __u8family;   /* network family, AF_INET, AF_INET6, AF_MPLS */
+
+   /* set if lookup is to consider L4 data - e.g., FIB rules */
+   __u8l4_protocol;
+   __be16  sport;
+   __be16  dport;
+
+   /* total length of packet from network header - used for MTU check */
+   __u16   tot_len;
+   __u32   ifindex;  /* L3 device index for lookup */
+
+   union {
+   /* inputs to lookup */
+   __u8tos;/* AF_INET  

[bpf-next v1 6/9] net/ipv6: Update fib6 tracepoint to take fib6_info

2018-05-02 Thread David Ahern
Similar to IPv4, IPv6 should use the FIB lookup result in the
tracepoint.

Signed-off-by: David Ahern 
---
 include/trace/events/fib6.h | 14 +++---
 net/ipv6/route.c| 14 ++
 2 files changed, 13 insertions(+), 15 deletions(-)

diff --git a/include/trace/events/fib6.h b/include/trace/events/fib6.h
index 7e8d48a81b91..1b8d951e3c12 100644
--- a/include/trace/events/fib6.h
+++ b/include/trace/events/fib6.h
@@ -12,10 +12,10 @@
 
 TRACE_EVENT(fib6_table_lookup,
 
-   TP_PROTO(const struct net *net, const struct rt6_info *rt,
+   TP_PROTO(const struct net *net, const struct fib6_info *f6i,
 struct fib6_table *table, const struct flowi6 *flp),
 
-   TP_ARGS(net, rt, table, flp),
+   TP_ARGS(net, f6i, table, flp),
 
TP_STRUCT__entry(
__field(u32,tb_id   )
@@ -48,20 +48,20 @@ TRACE_EVENT(fib6_table_lookup,
in6 = (struct in6_addr *)__entry->dst;
*in6 = flp->daddr;
 
-   if (rt->rt6i_idev) {
-   __assign_str(name, rt->rt6i_idev->dev->name);
+   if (f6i->fib6_nh.nh_dev) {
+   __assign_str(name, f6i->fib6_nh.nh_dev);
} else {
__assign_str(name, "");
}
-   if (rt == net->ipv6.ip6_null_entry) {
+   if (f6i == net->ipv6.fib6_null_entry) {
struct in6_addr in6_zero = {};
 
in6 = (struct in6_addr *)__entry->gw;
*in6 = in6_zero;
 
-   } else if (rt) {
+   } else if (f6i) {
in6 = (struct in6_addr *)__entry->gw;
-   *in6 = rt->rt6i_gateway;
+   *in6 = f6i->fib6_nh.nh_gw;
}
),
 
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index d0ace0c5c3e9..cf8de6899581 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -1078,6 +1078,8 @@ static struct rt6_info *ip6_pol_route_lookup(struct net 
*net,
goto restart;
}
 
+   trace_fib6_table_lookup(net, f6i, table, fl6);
+
/* Search through exception table */
rt = rt6_find_cached_rt(f6i, >daddr, >saddr);
if (rt) {
@@ -1096,8 +1098,6 @@ static struct rt6_info *ip6_pol_route_lookup(struct net 
*net,
 
rcu_read_unlock();
 
-   trace_fib6_table_lookup(net, rt, table, fl6);
-
return rt;
 }
 
@@ -1827,6 +1827,8 @@ struct fib6_info *fib6_table_lookup(struct net *net, 
struct fib6_table *table,
}
}
 
+   trace_fib6_table_lookup(net, f6i, table, fl6);
+
return f6i;
 }
 
@@ -1853,7 +1855,6 @@ struct rt6_info *ip6_pol_route(struct net *net, struct 
fib6_table *table,
rt = net->ipv6.ip6_null_entry;
rcu_read_unlock();
dst_hold(>dst);
-   trace_fib6_table_lookup(net, rt, table, fl6);
return rt;
}
 
@@ -1864,7 +1865,6 @@ struct rt6_info *ip6_pol_route(struct net *net, struct 
fib6_table *table,
dst_use_noref(>dst, jiffies);
 
rcu_read_unlock();
-   trace_fib6_table_lookup(net, rt, table, fl6);
return rt;
} else if (unlikely((fl6->flowi6_flags & FLOWI_FLAG_KNOWN_NH) &&
!(f6i->fib6_flags & RTF_GATEWAY))) {
@@ -1890,9 +1890,7 @@ struct rt6_info *ip6_pol_route(struct net *net, struct 
fib6_table *table,
dst_hold(_rt->dst);
}
 
-   trace_fib6_table_lookup(net, uncached_rt, table, fl6);
return uncached_rt;
-
} else {
/* Get a percpu copy */
 
@@ -1906,7 +1904,7 @@ struct rt6_info *ip6_pol_route(struct net *net, struct 
fib6_table *table,
 
local_bh_enable();
rcu_read_unlock();
-   trace_fib6_table_lookup(net, pcpu_rt, table, fl6);
+
return pcpu_rt;
}
 }
@@ -2486,7 +2484,7 @@ static struct rt6_info *__ip6_route_redirect(struct net 
*net,
 
rcu_read_unlock();
 
-   trace_fib6_table_lookup(net, ret, table, fl6);
+   trace_fib6_table_lookup(net, rt, table, fl6);
return ret;
 };
 
-- 
2.11.0



[bpf-next v1 4/9] net/ipv6: Refactor fib6_rule_action

2018-05-02 Thread David Ahern
Move source address lookup from fib6_rule_action to a helper. It will be
used in a later patch by a second variant for fib6_rule_action.

Signed-off-by: David Ahern 
---
 net/ipv6/fib6_rules.c | 52 ++-
 1 file changed, 31 insertions(+), 21 deletions(-)

diff --git a/net/ipv6/fib6_rules.c b/net/ipv6/fib6_rules.c
index 6547fc6491a6..d040c4bff3a0 100644
--- a/net/ipv6/fib6_rules.c
+++ b/net/ipv6/fib6_rules.c
@@ -96,6 +96,31 @@ struct dst_entry *fib6_rule_lookup(struct net *net, struct 
flowi6 *fl6,
return >ipv6.ip6_null_entry->dst;
 }
 
+static int fib6_rule_saddr(struct net *net, struct fib_rule *rule, int flags,
+  struct flowi6 *flp6, const struct net_device *dev)
+{
+   struct fib6_rule *r = (struct fib6_rule *)rule;
+
+   /* If we need to find a source address for this traffic,
+* we check the result if it meets requirement of the rule.
+*/
+   if ((rule->flags & FIB_RULE_FIND_SADDR) &&
+   r->src.plen && !(flags & RT6_LOOKUP_F_HAS_SADDR)) {
+   struct in6_addr saddr;
+
+   if (ipv6_dev_get_saddr(net, dev, >daddr,
+  rt6_flags2srcprefs(flags), ))
+   return -EAGAIN;
+
+   if (!ipv6_prefix_equal(, >src.addr, r->src.plen))
+   return -EAGAIN;
+
+   flp6->saddr = saddr;
+   }
+
+   return 0;
+}
+
 static int fib6_rule_action(struct fib_rule *rule, struct flowi *flp,
int flags, struct fib_lookup_arg *arg)
 {
@@ -134,27 +159,12 @@ static int fib6_rule_action(struct fib_rule *rule, struct 
flowi *flp,
 
rt = lookup(net, table, flp6, arg->lookup_data, flags);
if (rt != net->ipv6.ip6_null_entry) {
-   struct fib6_rule *r = (struct fib6_rule *)rule;
-
-   /*
-* If we need to find a source address for this traffic,
-* we check the result if it meets requirement of the rule.
-*/
-   if ((rule->flags & FIB_RULE_FIND_SADDR) &&
-   r->src.plen && !(flags & RT6_LOOKUP_F_HAS_SADDR)) {
-   struct in6_addr saddr;
-
-   if (ipv6_dev_get_saddr(net,
-  ip6_dst_idev(>dst)->dev,
-  >daddr,
-  rt6_flags2srcprefs(flags),
-  ))
-   goto again;
-   if (!ipv6_prefix_equal(, >src.addr,
-  r->src.plen))
-   goto again;
-   flp6->saddr = saddr;
-   }
+   err = fib6_rule_saddr(net, rule, flags, flp6,
+ ip6_dst_idev(>dst)->dev);
+
+   if (err == -EAGAIN)
+   goto again;
+
err = rt->dst.error;
if (err != -EAGAIN)
goto out;
-- 
2.11.0



[bpf-next v1 1/9] net/ipv6: Rename fib6_lookup to fib6_node_lookup

2018-05-02 Thread David Ahern
Rename fib6_lookup to fib6_node_lookup to better reflect what it
returns. The fib6_lookup name will be used in a later patch for
an IPv6 equivalent to IPv4's fib_lookup.

Signed-off-by: David Ahern 
---
 include/net/ip6_fib.h |  6 +++---
 net/ipv6/ip6_fib.c| 14 --
 net/ipv6/route.c  |  8 
 3 files changed, 15 insertions(+), 13 deletions(-)

diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
index 1af450d4e923..5a16630179cb 100644
--- a/include/net/ip6_fib.h
+++ b/include/net/ip6_fib.h
@@ -376,9 +376,9 @@ struct dst_entry *fib6_rule_lookup(struct net *net, struct 
flowi6 *fl6,
   const struct sk_buff *skb,
   int flags, pol_lookup_t lookup);
 
-struct fib6_node *fib6_lookup(struct fib6_node *root,
- const struct in6_addr *daddr,
- const struct in6_addr *saddr);
+struct fib6_node *fib6_node_lookup(struct fib6_node *root,
+  const struct in6_addr *daddr,
+  const struct in6_addr *saddr);
 
 struct fib6_node *fib6_locate(struct fib6_node *root,
  const struct in6_addr *daddr, int dst_len,
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 6421c893466e..4cfffa0f676e 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -1354,8 +1354,8 @@ struct lookup_args {
const struct in6_addr   *addr;  /* search key   
*/
 };
 
-static struct fib6_node *fib6_lookup_1(struct fib6_node *root,
-  struct lookup_args *args)
+static struct fib6_node *fib6_node_lookup_1(struct fib6_node *root,
+   struct lookup_args *args)
 {
struct fib6_node *fn;
__be32 dir;
@@ -1400,7 +1400,8 @@ static struct fib6_node *fib6_lookup_1(struct fib6_node 
*root,
 #ifdef CONFIG_IPV6_SUBTREES
if (subtree) {
struct fib6_node *sfn;
-   sfn = fib6_lookup_1(subtree, args + 1);
+   sfn = fib6_node_lookup_1(subtree,
+args + 1);
if (!sfn)
goto backtrack;
fn = sfn;
@@ -1422,8 +1423,9 @@ static struct fib6_node *fib6_lookup_1(struct fib6_node 
*root,
 
 /* called with rcu_read_lock() held
  */
-struct fib6_node *fib6_lookup(struct fib6_node *root, const struct in6_addr 
*daddr,
- const struct in6_addr *saddr)
+struct fib6_node *fib6_node_lookup(struct fib6_node *root,
+  const struct in6_addr *daddr,
+  const struct in6_addr *saddr)
 {
struct fib6_node *fn;
struct lookup_args args[] = {
@@ -1442,7 +1444,7 @@ struct fib6_node *fib6_lookup(struct fib6_node *root, 
const struct in6_addr *dad
}
};
 
-   fn = fib6_lookup_1(root, daddr ? args : args + 1);
+   fn = fib6_node_lookup_1(root, daddr ? args : args + 1);
if (!fn || fn->fn_flags & RTN_TL_ROOT)
fn = root;
 
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 7ee0a34fba46..d903db30dfff 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -1006,7 +1006,7 @@ static struct fib6_node* fib6_backtrack(struct fib6_node 
*fn,
pn = rcu_dereference(fn->parent);
sn = FIB6_SUBTREE(pn);
if (sn && sn != fn)
-   fn = fib6_lookup(sn, NULL, saddr);
+   fn = fib6_node_lookup(sn, NULL, saddr);
else
fn = pn;
if (fn->fn_flags & RTN_RTINFO)
@@ -1059,7 +1059,7 @@ static struct rt6_info *ip6_pol_route_lookup(struct net 
*net,
flags &= ~RT6_LOOKUP_F_IFACE;
 
rcu_read_lock();
-   fn = fib6_lookup(>tb6_root, >daddr, >saddr);
+   fn = fib6_node_lookup(>tb6_root, >daddr, >saddr);
 restart:
f6i = rcu_dereference(fn->leaf);
if (!f6i) {
@@ -1815,7 +1815,7 @@ struct rt6_info *ip6_pol_route(struct net *net, struct 
fib6_table *table,
 
rcu_read_lock();
 
-   fn = fib6_lookup(>tb6_root, >daddr, >saddr);
+   fn = fib6_node_lookup(>tb6_root, >daddr, >saddr);
saved_fn = fn;
 
if (fl6->flowi6_flags & FLOWI_FLAG_SKIP_NH_OIF)
@@ -2420,7 +2420,7 @@ static struct rt6_info *__ip6_route_redirect(struct net 
*net,
 */
 
rcu_read_lock();
-   fn = fib6_lookup(>tb6_root, >daddr, >saddr);
+   fn = fib6_node_lookup(>tb6_root, >daddr, >saddr);
 restart:
for_each_fib6_node_rt_rcu(fn) {
if (rt->fib6_nh.nh_flags & RTNH_F_DEAD)
-- 
2.11.0



[PATCH net] tcp: restore autocorking

2018-05-02 Thread Eric Dumazet
When adding rb-tree for TCP retransmit queue, we inadvertently broke
TCP autocorking.

tcp_should_autocork() should really check if the rtx queue is not empty.

Tested:

Before the fix :
$ nstat -n;./netperf -H 10.246.7.152 -Cc -- -m 500;nstat | grep AutoCork
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.246.7.152 
() port 0 AF_INET
Recv   SendSend  Utilization   Service Demand
Socket Socket  Message  Elapsed  Send Recv SendRecv
Size   SizeSize Time Throughput  localremote   local   remote
bytes  bytes   bytessecs.10^6bits/s  % S  % S  us/KB   us/KB

54 26214450010.00  2682.85   2.47 1.59 3.618   2.329
TcpExtTCPAutoCorking33 0.0

// Same test, but forcing TCP_NODELAY
$ nstat -n;./netperf -H 10.246.7.152 -Cc -- -D -m 500;nstat | grep AutoCork
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.246.7.152 
() port 0 AF_INET : nodelay
Recv   SendSend  Utilization   Service Demand
Socket Socket  Message  Elapsed  Send Recv SendRecv
Size   SizeSize Time Throughput  localremote   local   remote
bytes  bytes   bytessecs.10^6bits/s  % S  % S  us/KB   us/KB

54 26214450010.00  1408.75   2.44 2.96 6.802   8.259
TcpExtTCPAutoCorking1  0.0

After the fix :
$ nstat -n;./netperf -H 10.246.7.152 -Cc -- -m 500;nstat | grep AutoCork
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.246.7.152 
() port 0 AF_INET
Recv   SendSend  Utilization   Service Demand
Socket Socket  Message  Elapsed  Send Recv SendRecv
Size   SizeSize Time Throughput  localremote   local   remote
bytes  bytes   bytessecs.10^6bits/s  % S  % S  us/KB   us/KB

54 26214450010.00  5472.46   2.45 1.43 1.761   1.027
TcpExtTCPAutoCorking361293 0.0

// With TCP_NODELAY option
$ nstat -n;./netperf -H 10.246.7.152 -Cc -- -D -m 500;nstat | grep AutoCork
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.246.7.152 
() port 0 AF_INET : nodelay
Recv   SendSend  Utilization   Service Demand
Socket Socket  Message  Elapsed  Send Recv SendRecv
Size   SizeSize Time Throughput  localremote   local   remote
bytes  bytes   bytessecs.10^6bits/s  % S  % S  us/KB   us/KB

54 26214450010.00  5454.96   2.46 1.63 1.775   1.174
TcpExtTCPAutoCorking315448 0.0

Fixes: 75c119afe14f ("tcp: implement rb-tree based retransmit queue")
Signed-off-by: Eric Dumazet 
Reported-by: Michael Wenig 
Tested-by: Michael Wenig 
---
 net/ipv4/tcp.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 
44be7f43455e4aefde8db61e2d941a69abcc642a..c9d00ef54deca15d5760bcbe154001a96fa1e2a7
 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -697,7 +697,7 @@ static bool tcp_should_autocork(struct sock *sk, struct 
sk_buff *skb,
 {
return skb->len < size_goal &&
   sock_net(sk)->ipv4.sysctl_tcp_autocorking &&
-  skb != tcp_write_queue_head(sk) &&
+  !tcp_rtx_queue_empty(sk) &&
   refcount_read(>sk_wmem_alloc) > skb->truesize;
 }
 
-- 
2.17.0.441.gb46fe60e1d-goog



Re: [PATCH net] ipv4: fix fnhe usage by non-cached routes

2018-05-02 Thread David Miller
From: Julian Anastasov 
Date: Wed,  2 May 2018 09:41:19 +0300

> Allow some non-cached routes to use non-expired fnhe:
> 
> 1. ip_del_fnhe: moved above and now called by find_exception.
> The 4.5+ commit deed49df7390 expires fnhe only when caching
> routes. Change that to:
> 
> 1.1. use fnhe for non-cached local output routes, with the help
> from (2)
> 
> 1.2. allow __mkroute_input to detect expired fnhe (outdated
> fnhe_gw, for example) when do_cache is false, eg. when itag!=0
> for unicast destinations.
> 
> 2. __mkroute_output: keep fi to allow local routes with orig_oif != 0
> to use fnhe info even when the new route will not be cached into fnhe.
> After commit 839da4d98960 ("net: ipv4: set orig_oif based on fib
> result for local traffic") it means all local routes will be affected
> because they are not cached. This change is used to solve a PMTU
> problem with IPVS (and probably Netfilter DNAT) setups that redirect
> local clients from target local IP (local route to Virtual IP)
> to new remote IP target, eg. IPVS TUN real server. Loopback has
> 64K MTU and we need to create fnhe on the local route that will
> keep the reduced PMTU for the Virtual IP. Without this change
> fnhe_pmtu is updated from ICMP but never exposed to non-cached
> local routes. This includes routes with flowi4_oif!=0 for 4.6+ and
> with flowi4_oif=any for 4.14+).
> 
> 3. update_or_create_fnhe: make sure fnhe_expires is not 0 for
> new entries
> 
> Fixes: 839da4d98960 ("net: ipv4: set orig_oif based on fib result for local 
> traffic")
> Fixes: d6d5e999e5df ("route: do not cache fib route info on local routes with 
> oif")
> Fixes: deed49df7390 ("route: check and remove route cache when we get route")
> Cc: David Ahern 
> Cc: Xin Long 
> Signed-off-by: Julian Anastasov 

Applied and queued up for -stable, thanks Julian.


Re: pull-request: bpf 2018-05-03

2018-05-02 Thread David Miller
From: Daniel Borkmann 
Date: Thu,  3 May 2018 02:37:12 +0200

> The following pull-request contains BPF updates for your *net* tree.
> 
> The main changes are:
 ...
> Please consider pulling these changes from:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git

Pulled, thanks Daniel.


Re: [RFC v3 4/5] virtio_ring: add event idx support in packed ring

2018-05-02 Thread Tiwei Bie
On Thu, May 03, 2018 at 04:44:39AM +0300, Michael S. Tsirkin wrote:
> On Thu, May 03, 2018 at 09:11:16AM +0800, Tiwei Bie wrote:
> > On Wed, May 02, 2018 at 06:42:57PM +0300, Michael S. Tsirkin wrote:
> > > On Wed, May 02, 2018 at 11:12:55PM +0800, Tiwei Bie wrote:
> > > > On Wed, May 02, 2018 at 04:51:01PM +0300, Michael S. Tsirkin wrote:
> > > > > On Wed, May 02, 2018 at 03:28:19PM +0800, Tiwei Bie wrote:
> > > > > > On Wed, May 02, 2018 at 10:51:06AM +0800, Jason Wang wrote:
> > > > > > > On 2018年04月25日 13:15, Tiwei Bie wrote:
> > > > > > > > This commit introduces the event idx support in packed
> > > > > > > > ring. This feature is temporarily disabled, because the
> > > > > > > > implementation in this patch may not work as expected,
> > > > > > > > and some further discussions on the implementation are
> > > > > > > > needed, e.g. do we have to check the wrap counter when
> > > > > > > > checking whether a kick is needed?
> > > > > > > > 
> > > > > > > > Signed-off-by: Tiwei Bie 
> > > > > > > > ---
> > > > > > > >   drivers/virtio/virtio_ring.c | 53 
> > > > > > > > 
> > > > > > > >   1 file changed, 49 insertions(+), 4 deletions(-)
> > > > > > > > 
> > > > > > > > diff --git a/drivers/virtio/virtio_ring.c 
> > > > > > > > b/drivers/virtio/virtio_ring.c
> > > > > > > > index 0181e93897be..b1039c2985b9 100644
> > > > > > > > --- a/drivers/virtio/virtio_ring.c
> > > > > > > > +++ b/drivers/virtio/virtio_ring.c
> > > > > > > > @@ -986,7 +986,7 @@ static inline int 
> > > > > > > > virtqueue_add_packed(struct virtqueue *_vq,
> > > > > > > >   static bool virtqueue_kick_prepare_packed(struct virtqueue 
> > > > > > > > *_vq)
> > > > > > > >   {
> > > > > > > > struct vring_virtqueue *vq = to_vvq(_vq);
> > > > > > > > -   u16 flags;
> > > > > > > > +   u16 new, old, off_wrap, flags;
> > > > > > > > bool needs_kick;
> > > > > > > > u32 snapshot;
> > > > > > > > @@ -995,7 +995,12 @@ static bool 
> > > > > > > > virtqueue_kick_prepare_packed(struct virtqueue *_vq)
> > > > > > > >  * suppressions. */
> > > > > > > > virtio_mb(vq->weak_barriers);
> > > > > > > > +   old = vq->next_avail_idx - vq->num_added;
> > > > > > > > +   new = vq->next_avail_idx;
> > > > > > > > +   vq->num_added = 0;
> > > > > > > > +
> > > > > > > > snapshot = *(u32 *)vq->vring_packed.device;
> > > > > > > > +   off_wrap = virtio16_to_cpu(_vq->vdev, snapshot & 
> > > > > > > > 0x);
> > > > > > > > flags = cpu_to_virtio16(_vq->vdev, snapshot >> 16) & 
> > > > > > > > 0x3;
> > > > > > > >   #ifdef DEBUG
> > > > > > > > @@ -1006,7 +1011,10 @@ static bool 
> > > > > > > > virtqueue_kick_prepare_packed(struct virtqueue *_vq)
> > > > > > > > vq->last_add_time_valid = false;
> > > > > > > >   #endif
> > > > > > > > -   needs_kick = (flags != VRING_EVENT_F_DISABLE);
> > > > > > > > +   if (flags == VRING_EVENT_F_DESC)
> > > > > > > > +   needs_kick = vring_need_event(off_wrap & 
> > > > > > > > ~(1<<15), new, old);
> > > > > > > 
> > > > > > > I wonder whether or not the math is correct. Both new and event 
> > > > > > > are in the
> > > > > > > unit of descriptor ring size, but old looks not.
> > > > > > 
> > > > > > What vring_need_event() cares is the distance between
> > > > > > `new` and `old`, i.e. vq->num_added. So I think there
> > > > > > is nothing wrong with `old`. But the calculation of the
> > > > > > distance between `new` and `event_idx` isn't right when
> > > > > > `new` wraps. How do you think about the below code:
> > > > > > 
> > > > > > wrap_counter = off_wrap >> 15;
> > > > > > event_idx = off_wrap & ~(1<<15);
> > > > > > if (wrap_counter != vq->wrap_counter)
> > > > > > event_idx -= vq->vring_packed.num;
> > > > > > 
> > > > > > needs_kick = vring_need_event(event_idx, new, old);
> > > > > 
> > > > > I suspect this hack won't work for non power of 2 ring.
> > > > 
> > > > Above code doesn't require the ring size to be a power of 2.
> > > > 
> > > > For (__u16)(new_idx - old), what we want to get is vq->num_added.
> > > > 
> > > > old = vq->next_avail_idx - vq->num_added;
> > > > new = vq->next_avail_idx;
> > > > 
> > > > When vq->next_avail_idx >= vq->num_added, it's obvious that,
> > > > (__u16)(new_idx - old) is vq->num_added.
> > > > 
> > > > And when vq->next_avail_idx < vq->num_added, new will be smaller
> > > > than old (old will be a big unsigned number), but (__u16)(new_idx
> > > > - old) is still vq->num_added.
> > > > 
> > > > For (__u16)(new_idx - event_idx - 1), when new wraps and event_idx
> > > > doesn't wrap, the most straightforward way to calculate it is:
> > > > (new + vq->vring_packed.num) - event_idx - 1.
> > > 
> > > So how about we use the straightforward way then?
> > 
> > You mean we do new += vq->vring_packed.num instead
> > of event_idx -= vq->vring_packed.num before calling
> > 

Re: [PATCH] sctp: fix a potential missing-check bug

2018-05-02 Thread Marcelo Ricardo Leitner
On Wed, May 02, 2018 at 08:27:05PM -0500, Wenwen Wang wrote:
> On Wed, May 2, 2018 at 8:24 PM, Marcelo Ricardo Leitner
>  wrote:
> > On Wed, May 02, 2018 at 08:15:45PM -0500, Wenwen Wang wrote:
> >> In sctp_setsockopt_maxseg(), the integer 'val' is compared against min_len
> >> and max_len to check whether it is in the appropriate range. If it is not,
> >> an error code -EINVAL will be returned. This is enforced by a security
> >> check. But, this check is only executed when 'val' is not 0. In fact, if
> >> 'val' is 0, it will be assigned with a new value (if the return value of
> >> the function sctp_id2assoc() is not 0) in the following execution. However,
> >> this new value of 'val' is not checked before it is used to assigned to
> >> asoc->user_frag. That means it is possible that the new value of 'val'
> >> could be out of the expected range. This can cause security issues
> >> such as buffer overflows, e.g., the new value of 'val' is used as an index
> >> to access a buffer.
> >>
> >> This patch inserts a check for the new value of 'val' to see if it is in
> >> the expected range. If it is not, an error code -EINVAL will be returned.
> >>
> >> Signed-off-by: Wenwen Wang 
> >> ---
> >>  net/sctp/socket.c | 22 +++---
> >>  1 file changed, 11 insertions(+), 11 deletions(-)
> >
> > ?
> > This patch is the same as previous one. git send-email 
> > maybe?
> >
> >   Marcelo
>
> Thanks for your suggestion, Marcelo. I can send the old file. But, I
> have added a line of comment in this patch.

I meant if you had sent the old patch again by accident, because you
said you worked on an old version of the tree, but then posted a patch
that also doesn't use the new MTU function I mentioned.

  Marcelo


Re: [RFC v3 4/5] virtio_ring: add event idx support in packed ring

2018-05-02 Thread Michael S. Tsirkin
On Thu, May 03, 2018 at 09:11:16AM +0800, Tiwei Bie wrote:
> On Wed, May 02, 2018 at 06:42:57PM +0300, Michael S. Tsirkin wrote:
> > On Wed, May 02, 2018 at 11:12:55PM +0800, Tiwei Bie wrote:
> > > On Wed, May 02, 2018 at 04:51:01PM +0300, Michael S. Tsirkin wrote:
> > > > On Wed, May 02, 2018 at 03:28:19PM +0800, Tiwei Bie wrote:
> > > > > On Wed, May 02, 2018 at 10:51:06AM +0800, Jason Wang wrote:
> > > > > > On 2018年04月25日 13:15, Tiwei Bie wrote:
> > > > > > > This commit introduces the event idx support in packed
> > > > > > > ring. This feature is temporarily disabled, because the
> > > > > > > implementation in this patch may not work as expected,
> > > > > > > and some further discussions on the implementation are
> > > > > > > needed, e.g. do we have to check the wrap counter when
> > > > > > > checking whether a kick is needed?
> > > > > > > 
> > > > > > > Signed-off-by: Tiwei Bie 
> > > > > > > ---
> > > > > > >   drivers/virtio/virtio_ring.c | 53 
> > > > > > > 
> > > > > > >   1 file changed, 49 insertions(+), 4 deletions(-)
> > > > > > > 
> > > > > > > diff --git a/drivers/virtio/virtio_ring.c 
> > > > > > > b/drivers/virtio/virtio_ring.c
> > > > > > > index 0181e93897be..b1039c2985b9 100644
> > > > > > > --- a/drivers/virtio/virtio_ring.c
> > > > > > > +++ b/drivers/virtio/virtio_ring.c
> > > > > > > @@ -986,7 +986,7 @@ static inline int virtqueue_add_packed(struct 
> > > > > > > virtqueue *_vq,
> > > > > > >   static bool virtqueue_kick_prepare_packed(struct virtqueue *_vq)
> > > > > > >   {
> > > > > > >   struct vring_virtqueue *vq = to_vvq(_vq);
> > > > > > > - u16 flags;
> > > > > > > + u16 new, old, off_wrap, flags;
> > > > > > >   bool needs_kick;
> > > > > > >   u32 snapshot;
> > > > > > > @@ -995,7 +995,12 @@ static bool 
> > > > > > > virtqueue_kick_prepare_packed(struct virtqueue *_vq)
> > > > > > >* suppressions. */
> > > > > > >   virtio_mb(vq->weak_barriers);
> > > > > > > + old = vq->next_avail_idx - vq->num_added;
> > > > > > > + new = vq->next_avail_idx;
> > > > > > > + vq->num_added = 0;
> > > > > > > +
> > > > > > >   snapshot = *(u32 *)vq->vring_packed.device;
> > > > > > > + off_wrap = virtio16_to_cpu(_vq->vdev, snapshot & 0x);
> > > > > > >   flags = cpu_to_virtio16(_vq->vdev, snapshot >> 16) & 
> > > > > > > 0x3;
> > > > > > >   #ifdef DEBUG
> > > > > > > @@ -1006,7 +1011,10 @@ static bool 
> > > > > > > virtqueue_kick_prepare_packed(struct virtqueue *_vq)
> > > > > > >   vq->last_add_time_valid = false;
> > > > > > >   #endif
> > > > > > > - needs_kick = (flags != VRING_EVENT_F_DISABLE);
> > > > > > > + if (flags == VRING_EVENT_F_DESC)
> > > > > > > + needs_kick = vring_need_event(off_wrap & ~(1<<15), new, 
> > > > > > > old);
> > > > > > 
> > > > > > I wonder whether or not the math is correct. Both new and event are 
> > > > > > in the
> > > > > > unit of descriptor ring size, but old looks not.
> > > > > 
> > > > > What vring_need_event() cares is the distance between
> > > > > `new` and `old`, i.e. vq->num_added. So I think there
> > > > > is nothing wrong with `old`. But the calculation of the
> > > > > distance between `new` and `event_idx` isn't right when
> > > > > `new` wraps. How do you think about the below code:
> > > > > 
> > > > >   wrap_counter = off_wrap >> 15;
> > > > >   event_idx = off_wrap & ~(1<<15);
> > > > >   if (wrap_counter != vq->wrap_counter)
> > > > >   event_idx -= vq->vring_packed.num;
> > > > >   
> > > > >   needs_kick = vring_need_event(event_idx, new, old);
> > > > 
> > > > I suspect this hack won't work for non power of 2 ring.
> > > 
> > > Above code doesn't require the ring size to be a power of 2.
> > > 
> > > For (__u16)(new_idx - old), what we want to get is vq->num_added.
> > > 
> > > old = vq->next_avail_idx - vq->num_added;
> > > new = vq->next_avail_idx;
> > > 
> > > When vq->next_avail_idx >= vq->num_added, it's obvious that,
> > > (__u16)(new_idx - old) is vq->num_added.
> > > 
> > > And when vq->next_avail_idx < vq->num_added, new will be smaller
> > > than old (old will be a big unsigned number), but (__u16)(new_idx
> > > - old) is still vq->num_added.
> > > 
> > > For (__u16)(new_idx - event_idx - 1), when new wraps and event_idx
> > > doesn't wrap, the most straightforward way to calculate it is:
> > > (new + vq->vring_packed.num) - event_idx - 1.
> > 
> > So how about we use the straightforward way then?
> 
> You mean we do new += vq->vring_packed.num instead
> of event_idx -= vq->vring_packed.num before calling
> vring_need_event()?
> 
> The problem is that, the second param (new_idx) of
> vring_need_event() will be used for:
> 
> (__u16)(new_idx - event_idx - 1)
> (__u16)(new_idx - old)
> 
> So if we change new, we will need to change old too.

I think that since we have a branch there anyway,
we are better off just special-casing if 

Re: [PATCH bpf-next 07/12] bpf, sparc64: remove ld_abs/ld_ind

2018-05-02 Thread David Miller
From: Daniel Borkmann 
Date: Thu,  3 May 2018 03:05:31 +0200

> Since LD_ABS/LD_IND instructions are now removed from the core and
> reimplemented through a combination of inlined BPF instructions and
> a slow-path helper, we can get rid of the complexity from sparc64 JIT.
> 
> Signed-off-by: Daniel Borkmann 
> Cc: David S. Miller 
> Acked-by: Alexei Starovoitov 

Acked-by: David S. Miller 


[PATCH net-next] ip6_gre: correct the function name in ip6gre_tnl_addr_conflict() comment

2018-05-02 Thread Sun Lianwen
The function name is wrong in ip6gre_tnl_addr_conflict() comment, which
use ip6_tnl_addr_conflict instead of ip6gre_tnl_addr_conflict.

Signed-off-by: Sun Lianwen 
---
 net/ipv6/ip6_gre.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
index 69727bc168cb..4e111da8d453 100644
--- a/net/ipv6/ip6_gre.c
+++ b/net/ipv6/ip6_gre.c
@@ -807,7 +807,7 @@ static inline int ip6gre_xmit_ipv6(struct sk_buff *skb, 
struct net_device *dev)
 }
 
 /**
- * ip6_tnl_addr_conflict - compare packet addresses to tunnel's own
+ * ip6gre_tnl_addr_conflict - compare packet addresses to tunnel's own
  *   @t: the outgoing tunnel device
  *   @hdr: IPv6 header from the incoming packet
  *
-- 
2.17.0





Re: [PATCH] sctp: fix a potential missing-check bug

2018-05-02 Thread Wenwen Wang
On Wed, May 2, 2018 at 8:24 PM, Marcelo Ricardo Leitner
 wrote:
> On Wed, May 02, 2018 at 08:15:45PM -0500, Wenwen Wang wrote:
>> In sctp_setsockopt_maxseg(), the integer 'val' is compared against min_len
>> and max_len to check whether it is in the appropriate range. If it is not,
>> an error code -EINVAL will be returned. This is enforced by a security
>> check. But, this check is only executed when 'val' is not 0. In fact, if
>> 'val' is 0, it will be assigned with a new value (if the return value of
>> the function sctp_id2assoc() is not 0) in the following execution. However,
>> this new value of 'val' is not checked before it is used to assigned to
>> asoc->user_frag. That means it is possible that the new value of 'val'
>> could be out of the expected range. This can cause security issues
>> such as buffer overflows, e.g., the new value of 'val' is used as an index
>> to access a buffer.
>>
>> This patch inserts a check for the new value of 'val' to see if it is in
>> the expected range. If it is not, an error code -EINVAL will be returned.
>>
>> Signed-off-by: Wenwen Wang 
>> ---
>>  net/sctp/socket.c | 22 +++---
>>  1 file changed, 11 insertions(+), 11 deletions(-)
>
> ?
> This patch is the same as previous one. git send-email 
> maybe?
>
>   Marcelo

Thanks for your suggestion, Marcelo. I can send the old file. But, I
have added a line of comment in this patch.

Wenwen


Re: [PATCH] sctp: fix a potential missing-check bug

2018-05-02 Thread Marcelo Ricardo Leitner
On Wed, May 02, 2018 at 08:15:45PM -0500, Wenwen Wang wrote:
> In sctp_setsockopt_maxseg(), the integer 'val' is compared against min_len
> and max_len to check whether it is in the appropriate range. If it is not,
> an error code -EINVAL will be returned. This is enforced by a security
> check. But, this check is only executed when 'val' is not 0. In fact, if
> 'val' is 0, it will be assigned with a new value (if the return value of
> the function sctp_id2assoc() is not 0) in the following execution. However,
> this new value of 'val' is not checked before it is used to assigned to
> asoc->user_frag. That means it is possible that the new value of 'val'
> could be out of the expected range. This can cause security issues
> such as buffer overflows, e.g., the new value of 'val' is used as an index
> to access a buffer.
>
> This patch inserts a check for the new value of 'val' to see if it is in
> the expected range. If it is not, an error code -EINVAL will be returned.
>
> Signed-off-by: Wenwen Wang 
> ---
>  net/sctp/socket.c | 22 +++---
>  1 file changed, 11 insertions(+), 11 deletions(-)

?
This patch is the same as previous one. git send-email 
maybe?

  Marcelo


Re: [PATCH] NET/netlink: optimize output of seq_puts in af_netlink.c

2018-05-02 Thread YU Bo

Hi,
On Wed, May 02, 2018 at 10:19:43AM -0400, David Miller wrote:

From: Bo YU 
Date: Wed, 2 May 2018 05:54:24 -0400


Optimization of command output: `cat /proc/net/netlink`

After the patch, we will get:

https://clbin.com/lnu4L

Signed-off-by: Bo YU 
---
 net/netlink/af_netlink.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 55342c4d5cec..2e2dd88fc79f 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -2606,13 +2606,13 @@ static int netlink_seq_show(struct seq_file
*seq, void *v)
 {
if (v == SEQ_START_TOKEN) {
seq_puts(seq,
-"sk   Eth PidGroups   "
-"Rmem Wmem Dump Locks Drops 
Inode\n");
+"sk   Eth PidGroups   "
+ "Rmem Wmem Dump Locks Drops Inode\n");


Please do not break the indentation of the code like this.

Sorry, i am shame to do like it.There are something happened in my different
version vim.Because checkpatch tell me only
"WARNING: quoted string split across lines"

Thank you, i will fix it.



I wish to unfortunately say, that generally speaking, your patch
submissions are not of the best quality, and take up a lot of reviewer
time and resources as a result.

If you do not improve the quality of your submissions, I am giving
you a kind warning that the amount of care and review your patches
will receive will become lower.  Your submissions might even get to
the point wheere they are effectively ignored.

So please put more care into your work.

Thank you.


[PATCH] sctp: fix a potential missing-check bug

2018-05-02 Thread Wenwen Wang
In sctp_setsockopt_maxseg(), the integer 'val' is compared against min_len
and max_len to check whether it is in the appropriate range. If it is not,
an error code -EINVAL will be returned. This is enforced by a security
check. But, this check is only executed when 'val' is not 0. In fact, if
'val' is 0, it will be assigned with a new value (if the return value of
the function sctp_id2assoc() is not 0) in the following execution. However,
this new value of 'val' is not checked before it is used to assigned to
asoc->user_frag. That means it is possible that the new value of 'val'
could be out of the expected range. This can cause security issues
such as buffer overflows, e.g., the new value of 'val' is used as an index
to access a buffer.

This patch inserts a check for the new value of 'val' to see if it is in
the expected range. If it is not, an error code -EINVAL will be returned.

Signed-off-by: Wenwen Wang 
---
 net/sctp/socket.c | 22 +++---
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index 80835ac..03e1cc3 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -3212,6 +3212,7 @@ static int sctp_setsockopt_maxseg(struct sock *sk, char 
__user *optval, unsigned
struct sctp_af *af = sp->pf->af;
struct sctp_assoc_value params;
struct sctp_association *asoc;
+   int min_len, max_len;
int val;
 
if (optlen == sizeof(int)) {
@@ -3231,19 +3232,15 @@ static int sctp_setsockopt_maxseg(struct sock *sk, char 
__user *optval, unsigned
return -EINVAL;
}
 
-   if (val) {
-   int min_len, max_len;
+   min_len = SCTP_DEFAULT_MINSEGMENT - af->net_header_len;
+   min_len -= af->ip_options_len(sk);
+   min_len -= sizeof(struct sctphdr) +
+  sizeof(struct sctp_data_chunk);
 
-   min_len = SCTP_DEFAULT_MINSEGMENT - af->net_header_len;
-   min_len -= af->ip_options_len(sk);
-   min_len -= sizeof(struct sctphdr) +
-  sizeof(struct sctp_data_chunk);
+   max_len = SCTP_MAX_CHUNK_LEN - sizeof(struct sctp_data_chunk);
 
-   max_len = SCTP_MAX_CHUNK_LEN - sizeof(struct sctp_data_chunk);
-
-   if (val < min_len || val > max_len)
-   return -EINVAL;
-   }
+   if (val && (val < min_len || val > max_len))
+   return -EINVAL;
 
asoc = sctp_id2assoc(sk, params.assoc_id);
if (asoc) {
@@ -3253,6 +3250,9 @@ static int sctp_setsockopt_maxseg(struct sock *sk, char 
__user *optval, unsigned
val -= sizeof(struct sctphdr) +
   sctp_datachk_len(>stream);
}
+   /* Check the new val to make sure it is in the range. */
+   if (val < min_len || val > max_len)
+   return -EINVAL;
asoc->user_frag = val;
asoc->frag_point = sctp_frag_point(asoc, asoc->pathmtu);
} else {
-- 
2.7.4



Re: [RFC v3 4/5] virtio_ring: add event idx support in packed ring

2018-05-02 Thread Tiwei Bie
On Wed, May 02, 2018 at 06:42:57PM +0300, Michael S. Tsirkin wrote:
> On Wed, May 02, 2018 at 11:12:55PM +0800, Tiwei Bie wrote:
> > On Wed, May 02, 2018 at 04:51:01PM +0300, Michael S. Tsirkin wrote:
> > > On Wed, May 02, 2018 at 03:28:19PM +0800, Tiwei Bie wrote:
> > > > On Wed, May 02, 2018 at 10:51:06AM +0800, Jason Wang wrote:
> > > > > On 2018年04月25日 13:15, Tiwei Bie wrote:
> > > > > > This commit introduces the event idx support in packed
> > > > > > ring. This feature is temporarily disabled, because the
> > > > > > implementation in this patch may not work as expected,
> > > > > > and some further discussions on the implementation are
> > > > > > needed, e.g. do we have to check the wrap counter when
> > > > > > checking whether a kick is needed?
> > > > > > 
> > > > > > Signed-off-by: Tiwei Bie 
> > > > > > ---
> > > > > >   drivers/virtio/virtio_ring.c | 53 
> > > > > > 
> > > > > >   1 file changed, 49 insertions(+), 4 deletions(-)
> > > > > > 
> > > > > > diff --git a/drivers/virtio/virtio_ring.c 
> > > > > > b/drivers/virtio/virtio_ring.c
> > > > > > index 0181e93897be..b1039c2985b9 100644
> > > > > > --- a/drivers/virtio/virtio_ring.c
> > > > > > +++ b/drivers/virtio/virtio_ring.c
> > > > > > @@ -986,7 +986,7 @@ static inline int virtqueue_add_packed(struct 
> > > > > > virtqueue *_vq,
> > > > > >   static bool virtqueue_kick_prepare_packed(struct virtqueue *_vq)
> > > > > >   {
> > > > > > struct vring_virtqueue *vq = to_vvq(_vq);
> > > > > > -   u16 flags;
> > > > > > +   u16 new, old, off_wrap, flags;
> > > > > > bool needs_kick;
> > > > > > u32 snapshot;
> > > > > > @@ -995,7 +995,12 @@ static bool 
> > > > > > virtqueue_kick_prepare_packed(struct virtqueue *_vq)
> > > > > >  * suppressions. */
> > > > > > virtio_mb(vq->weak_barriers);
> > > > > > +   old = vq->next_avail_idx - vq->num_added;
> > > > > > +   new = vq->next_avail_idx;
> > > > > > +   vq->num_added = 0;
> > > > > > +
> > > > > > snapshot = *(u32 *)vq->vring_packed.device;
> > > > > > +   off_wrap = virtio16_to_cpu(_vq->vdev, snapshot & 0x);
> > > > > > flags = cpu_to_virtio16(_vq->vdev, snapshot >> 16) & 0x3;
> > > > > >   #ifdef DEBUG
> > > > > > @@ -1006,7 +1011,10 @@ static bool 
> > > > > > virtqueue_kick_prepare_packed(struct virtqueue *_vq)
> > > > > > vq->last_add_time_valid = false;
> > > > > >   #endif
> > > > > > -   needs_kick = (flags != VRING_EVENT_F_DISABLE);
> > > > > > +   if (flags == VRING_EVENT_F_DESC)
> > > > > > +   needs_kick = vring_need_event(off_wrap & ~(1<<15), new, 
> > > > > > old);
> > > > > 
> > > > > I wonder whether or not the math is correct. Both new and event are 
> > > > > in the
> > > > > unit of descriptor ring size, but old looks not.
> > > > 
> > > > What vring_need_event() cares is the distance between
> > > > `new` and `old`, i.e. vq->num_added. So I think there
> > > > is nothing wrong with `old`. But the calculation of the
> > > > distance between `new` and `event_idx` isn't right when
> > > > `new` wraps. How do you think about the below code:
> > > > 
> > > > wrap_counter = off_wrap >> 15;
> > > > event_idx = off_wrap & ~(1<<15);
> > > > if (wrap_counter != vq->wrap_counter)
> > > > event_idx -= vq->vring_packed.num;
> > > > 
> > > > needs_kick = vring_need_event(event_idx, new, old);
> > > 
> > > I suspect this hack won't work for non power of 2 ring.
> > 
> > Above code doesn't require the ring size to be a power of 2.
> > 
> > For (__u16)(new_idx - old), what we want to get is vq->num_added.
> > 
> > old = vq->next_avail_idx - vq->num_added;
> > new = vq->next_avail_idx;
> > 
> > When vq->next_avail_idx >= vq->num_added, it's obvious that,
> > (__u16)(new_idx - old) is vq->num_added.
> > 
> > And when vq->next_avail_idx < vq->num_added, new will be smaller
> > than old (old will be a big unsigned number), but (__u16)(new_idx
> > - old) is still vq->num_added.
> > 
> > For (__u16)(new_idx - event_idx - 1), when new wraps and event_idx
> > doesn't wrap, the most straightforward way to calculate it is:
> > (new + vq->vring_packed.num) - event_idx - 1.
> 
> So how about we use the straightforward way then?

You mean we do new += vq->vring_packed.num instead
of event_idx -= vq->vring_packed.num before calling
vring_need_event()?

The problem is that, the second param (new_idx) of
vring_need_event() will be used for:

(__u16)(new_idx - event_idx - 1)
(__u16)(new_idx - old)

So if we change new, we will need to change old too.
And that would be an ugly hack..

Best regards,
Tiwei Bie

> 
> > But we can also calculate it in this way:
> > 
> > event_idx -= vq->vring_packed.num;
> > (event_idx will be a big unsigned number)
> > 
> > Then (__u16)(new_idx - event_idx - 1) will be the value we want.
> > 
> > Best regards,
> > Tiwei Bie
> 
> 
> > > 
> > > 
> > > > Best regards,
> > > > Tiwei Bie
> > > > 
> > > > 
> > > > > 
> 

Re: [PATCH] sctp: fix a potential missing-check bug

2018-05-02 Thread Wenwen Wang
Hi Marcelo,

I guess I worked on an old version of the kernel. I will re-submit the
patch. Sorry :(

Wenwen

On Wed, May 2, 2018 at 6:23 PM, Marcelo Ricardo Leitner
 wrote:
> Hi Wenwen,
>
> On Wed, May 02, 2018 at 05:12:45PM -0500, Wenwen Wang wrote:
>> In sctp_setsockopt_maxseg(), the integer 'val' is compared against min_len
>> and max_len to check whether it is in the appropriate range. If it is not,
>> an error code -EINVAL will be returned. This is enforced by a security
>> check. But, this check is only executed when 'val' is not 0. In fact, if
>
> Which makes sense, no? Especially if considering that 0 should be an
> allowed value as it turns off the user limit.
>
>> 'val' is 0, it will be assigned with a new value (if the return value of
>> the function sctp_id2assoc() is not 0) in the following execution. However,
>> this new value of 'val' is not checked before it is used to assigned to
>
> Which 'new value'? val is not set to something new during the
> function. It always contains the user supplied value.
>
>> asoc->user_frag. That means it is possible that the new value of 'val'
>> could be out of the expected range. This can cause security issues
>> such as buffer overflows, e.g., the new value of 'val' is used as an index
>> to access a buffer.
>>
>> This patch inserts a check for the new value of 'val' to see if it is in
>> the expected range. If it is not, an error code -EINVAL will be returned.
>>
>> Signed-off-by: Wenwen Wang 
>> ---
>>  net/sctp/socket.c | 21 ++---
>>  1 file changed, 10 insertions(+), 11 deletions(-)
>>
>> diff --git a/net/sctp/socket.c b/net/sctp/socket.c
>> index 80835ac..2beb601 100644
>> --- a/net/sctp/socket.c
>> +++ b/net/sctp/socket.c
>> @@ -3212,6 +3212,7 @@ static int sctp_setsockopt_maxseg(struct sock *sk, 
>> char __user *optval, unsigned
>>   struct sctp_af *af = sp->pf->af;
>>   struct sctp_assoc_value params;
>>   struct sctp_association *asoc;
>> + int min_len, max_len;
>>   int val;
>>
>>   if (optlen == sizeof(int)) {
>> @@ -3231,19 +3232,15 @@ static int sctp_setsockopt_maxseg(struct sock *sk, 
>> char __user *optval, unsigned
>>   return -EINVAL;
>>   }
>>
>> - if (val) {
>> - int min_len, max_len;
>> + min_len = SCTP_DEFAULT_MINSEGMENT - af->net_header_len;
>> + min_len -= af->ip_options_len(sk);
>> + min_len -= sizeof(struct sctphdr) +
>> +sizeof(struct sctp_data_chunk);
>
> On which tree did you base your patch on? Your patch lacks a tag so it
> defaults to net-next, and I reworked this section on current net-next
> and these MTU calculcations are now handled by sctp_mtu_payload().
>
> But even for net tree, I don't understand which issue you're fixing
> here. Actually it seems to me that both codes seems to do the same
> thing.
>
>>
>> - min_len = SCTP_DEFAULT_MINSEGMENT - af->net_header_len;
>> - min_len -= af->ip_options_len(sk);
>> - min_len -= sizeof(struct sctphdr) +
>> -sizeof(struct sctp_data_chunk);
>> + max_len = SCTP_MAX_CHUNK_LEN - sizeof(struct sctp_data_chunk);
>>
>> - max_len = SCTP_MAX_CHUNK_LEN - sizeof(struct sctp_data_chunk);
>> -
>> - if (val < min_len || val > max_len)
>> - return -EINVAL;
>> - }
>> + if (val && (val < min_len || val > max_len))
>> + return -EINVAL;
>>
>>   asoc = sctp_id2assoc(sk, params.assoc_id);
>>   if (asoc) {
>> @@ -3253,6 +3250,8 @@ static int sctp_setsockopt_maxseg(struct sock *sk, 
>> char __user *optval, unsigned
>>   val -= sizeof(struct sctphdr) +
>>  sctp_datachk_len(>stream);
>>   }
>> + if (val < min_len || val > max_len)
>> + return -EINVAL;
>>   asoc->user_frag = val;
>>   asoc->frag_point = sctp_frag_point(asoc, asoc->pathmtu);
>>   } else {
>> --
>> 2.7.4
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>


[PATCH bpf-next 05/12] bpf, x64: remove ld_abs/ld_ind

2018-05-02 Thread Daniel Borkmann
Since LD_ABS/LD_IND instructions are now removed from the core and
reimplemented through a combination of inlined BPF instructions and
a slow-path helper, we can get rid of the complexity from x64 JIT.

Signed-off-by: Daniel Borkmann 
Acked-by: Alexei Starovoitov 
---
 arch/x86/net/Makefile   |   4 +-
 arch/x86/net/bpf_jit.S  | 154 
 arch/x86/net/bpf_jit_comp.c | 144 ++---
 3 files changed, 5 insertions(+), 297 deletions(-)
 delete mode 100644 arch/x86/net/bpf_jit.S

diff --git a/arch/x86/net/Makefile b/arch/x86/net/Makefile
index fefb4b6..20277db 100644
--- a/arch/x86/net/Makefile
+++ b/arch/x86/net/Makefile
@@ -1,6 +1,4 @@
 #
 # Arch-specific network modules
 #
-OBJECT_FILES_NON_STANDARD_bpf_jit.o += y
-
-obj-$(CONFIG_BPF_JIT) += bpf_jit.o bpf_jit_comp.o
+obj-$(CONFIG_BPF_JIT) += bpf_jit_comp.o
diff --git a/arch/x86/net/bpf_jit.S b/arch/x86/net/bpf_jit.S
deleted file mode 100644
index b33093f..000
--- a/arch/x86/net/bpf_jit.S
+++ /dev/null
@@ -1,154 +0,0 @@
-/* bpf_jit.S : BPF JIT helper functions
- *
- * Copyright (C) 2011 Eric Dumazet (eric.duma...@gmail.com)
- *
- * This program is free software; you can redistribute it and/or
- * modify it under the terms of the GNU General Public License
- * as published by the Free Software Foundation; version 2
- * of the License.
- */
-#include 
-#include 
-
-/*
- * Calling convention :
- * rbx : skb pointer (callee saved)
- * esi : offset of byte(s) to fetch in skb (can be scratched)
- * r10 : copy of skb->data
- * r9d : hlen = skb->len - skb->data_len
- */
-#define SKBDATA%r10
-#define SKF_MAX_NEG_OFF$(-0x20) /* SKF_LL_OFF from filter.h */
-
-#define FUNC(name) \
-   .globl name; \
-   .type name, @function; \
-   name:
-
-FUNC(sk_load_word)
-   test%esi,%esi
-   js  bpf_slow_path_word_neg
-
-FUNC(sk_load_word_positive_offset)
-   mov %r9d,%eax   # hlen
-   sub %esi,%eax   # hlen - offset
-   cmp $3,%eax
-   jle bpf_slow_path_word
-   mov (SKBDATA,%rsi),%eax
-   bswap   %eax/* ntohl() */
-   ret
-
-FUNC(sk_load_half)
-   test%esi,%esi
-   js  bpf_slow_path_half_neg
-
-FUNC(sk_load_half_positive_offset)
-   mov %r9d,%eax
-   sub %esi,%eax   #   hlen - offset
-   cmp $1,%eax
-   jle bpf_slow_path_half
-   movzwl  (SKBDATA,%rsi),%eax
-   rol $8,%ax  # ntohs()
-   ret
-
-FUNC(sk_load_byte)
-   test%esi,%esi
-   js  bpf_slow_path_byte_neg
-
-FUNC(sk_load_byte_positive_offset)
-   cmp %esi,%r9d   /* if (offset >= hlen) goto bpf_slow_path_byte */
-   jle bpf_slow_path_byte
-   movzbl  (SKBDATA,%rsi),%eax
-   ret
-
-/* rsi contains offset and can be scratched */
-#define bpf_slow_path_common(LEN)  \
-   lea 32(%rbp), %rdx;\
-   FRAME_BEGIN;\
-   mov %rbx, %rdi; /* arg1 == skb */   \
-   push%r9;\
-   pushSKBDATA;\
-/* rsi already has offset */   \
-   mov $LEN,%ecx;  /* len */   \
-   callskb_copy_bits;  \
-   test%eax,%eax;  \
-   pop SKBDATA;\
-   pop %r9;\
-   FRAME_END
-
-
-bpf_slow_path_word:
-   bpf_slow_path_common(4)
-   js  bpf_error
-   mov 32(%rbp),%eax
-   bswap   %eax
-   ret
-
-bpf_slow_path_half:
-   bpf_slow_path_common(2)
-   js  bpf_error
-   mov 32(%rbp),%ax
-   rol $8,%ax
-   movzwl  %ax,%eax
-   ret
-
-bpf_slow_path_byte:
-   bpf_slow_path_common(1)
-   js  bpf_error
-   movzbl  32(%rbp),%eax
-   ret
-
-#define sk_negative_common(SIZE)   \
-   FRAME_BEGIN;\
-   mov %rbx, %rdi; /* arg1 == skb */   \
-   push%r9;\
-   pushSKBDATA;\
-/* rsi already has offset */   \
-   mov $SIZE,%edx; /* size */  \
-   callbpf_internal_load_pointer_neg_helper;   \
-   test%rax,%rax;  \
-   pop SKBDATA;\
-   pop %r9;\
-   FRAME_END;  \
-   jz  bpf_error
-
-bpf_slow_path_word_neg:
-   cmp SKF_MAX_NEG_OFF, %esi   /* test range */
-   jl  bpf_error   /* offset lower -> error  */
-
-FUNC(sk_load_word_negative_offset)
-   

[PATCH bpf-next 00/12] Move ld_abs/ld_ind to native BPF

2018-05-02 Thread Daniel Borkmann
This set simplifies BPF JITs significantly by moving ld_abs/ld_ind
to native BPF, for details see individual patches. Main rationale
is in patch 'implement ld_abs/ld_ind in native bpf'. Thanks!

Daniel Borkmann (12):
  bpf: prefix cbpf internal helpers with bpf_
  bpf: migrate ebpf ld_abs/ld_ind tests to test_verifier
  bpf: implement ld_abs/ld_ind in native bpf
  bpf: add skb_load_bytes_relative helper
  bpf, x64: remove ld_abs/ld_ind
  bpf, arm64: remove ld_abs/ld_ind
  bpf, sparc64: remove ld_abs/ld_ind
  bpf, arm32: remove ld_abs/ld_ind
  bpf, mips64: remove ld_abs/ld_ind
  bpf, ppc64: remove ld_abs/ld_ind
  bpf, s390x: remove ld_abs/ld_ind
  bpf: sync tools bpf.h uapi header

 arch/arm/net/bpf_jit_32.c   |  77 
 arch/arm64/net/bpf_jit_comp.c   |  65 --
 arch/mips/net/ebpf_jit.c| 104 --
 arch/powerpc/net/Makefile   |   2 +-
 arch/powerpc/net/bpf_jit64.h|  37 +---
 arch/powerpc/net/bpf_jit_asm64.S| 180 -
 arch/powerpc/net/bpf_jit_comp64.c   | 109 +-
 arch/s390/net/Makefile  |   2 +-
 arch/s390/net/bpf_jit.S | 116 ---
 arch/s390/net/bpf_jit.h |  20 +-
 arch/s390/net/bpf_jit_comp.c| 127 ++--
 arch/sparc/net/Makefile |   5 +-
 arch/sparc/net/bpf_jit_64.h |  29 ---
 arch/sparc/net/bpf_jit_asm_64.S | 162 ---
 arch/sparc/net/bpf_jit_comp_64.c|  79 +---
 arch/x86/net/Makefile   |   4 +-
 arch/x86/net/bpf_jit.S  | 154 ---
 arch/x86/net/bpf_jit_comp.c | 144 +-
 include/linux/bpf.h |   4 +-
 include/linux/filter.h  |   4 +-
 include/uapi/linux/bpf.h|  33 +++-
 kernel/bpf/core.c   |  96 +
 kernel/bpf/verifier.c   |  24 +++
 lib/test_bpf.c  | 212 
 net/core/filter.c   | 296 +---
 tools/include/uapi/linux/bpf.h  |  33 +++-
 tools/testing/selftests/bpf/test_verifier.c | 266 -
 27 files changed, 669 insertions(+), 1715 deletions(-)
 delete mode 100644 arch/powerpc/net/bpf_jit_asm64.S
 delete mode 100644 arch/s390/net/bpf_jit.S
 delete mode 100644 arch/sparc/net/bpf_jit_asm_64.S
 delete mode 100644 arch/x86/net/bpf_jit.S

-- 
2.9.5



[PATCH bpf-next 06/12] bpf, arm64: remove ld_abs/ld_ind

2018-05-02 Thread Daniel Borkmann
Since LD_ABS/LD_IND instructions are now removed from the core and
reimplemented through a combination of inlined BPF instructions and
a slow-path helper, we can get rid of the complexity from arm64 JIT.

Signed-off-by: Daniel Borkmann 
Acked-by: Alexei Starovoitov 
---
 arch/arm64/net/bpf_jit_comp.c | 65 ---
 1 file changed, 65 deletions(-)

diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c
index a933504..0b40c8f 100644
--- a/arch/arm64/net/bpf_jit_comp.c
+++ b/arch/arm64/net/bpf_jit_comp.c
@@ -723,71 +723,6 @@ static int build_insn(const struct bpf_insn *insn, struct 
jit_ctx *ctx)
emit(A64_CBNZ(0, tmp3, jmp_offset), ctx);
break;
 
-   /* R0 = ntohx(*(size *)(((struct sk_buff *)R6)->data + imm)) */
-   case BPF_LD | BPF_ABS | BPF_W:
-   case BPF_LD | BPF_ABS | BPF_H:
-   case BPF_LD | BPF_ABS | BPF_B:
-   /* R0 = ntohx(*(size *)(((struct sk_buff *)R6)->data + src + imm)) */
-   case BPF_LD | BPF_IND | BPF_W:
-   case BPF_LD | BPF_IND | BPF_H:
-   case BPF_LD | BPF_IND | BPF_B:
-   {
-   const u8 r0 = bpf2a64[BPF_REG_0]; /* r0 = return value */
-   const u8 r6 = bpf2a64[BPF_REG_6]; /* r6 = pointer to sk_buff */
-   const u8 fp = bpf2a64[BPF_REG_FP];
-   const u8 r1 = bpf2a64[BPF_REG_1]; /* r1: struct sk_buff *skb */
-   const u8 r2 = bpf2a64[BPF_REG_2]; /* r2: int k */
-   const u8 r3 = bpf2a64[BPF_REG_3]; /* r3: unsigned int size */
-   const u8 r4 = bpf2a64[BPF_REG_4]; /* r4: void *buffer */
-   const u8 r5 = bpf2a64[BPF_REG_5]; /* r5: void *(*func)(...) */
-   int size;
-
-   emit(A64_MOV(1, r1, r6), ctx);
-   emit_a64_mov_i(0, r2, imm, ctx);
-   if (BPF_MODE(code) == BPF_IND)
-   emit(A64_ADD(0, r2, r2, src), ctx);
-   switch (BPF_SIZE(code)) {
-   case BPF_W:
-   size = 4;
-   break;
-   case BPF_H:
-   size = 2;
-   break;
-   case BPF_B:
-   size = 1;
-   break;
-   default:
-   return -EINVAL;
-   }
-   emit_a64_mov_i64(r3, size, ctx);
-   emit(A64_SUB_I(1, r4, fp, ctx->stack_size), ctx);
-   emit_a64_mov_i64(r5, (unsigned long)bpf_load_pointer, ctx);
-   emit(A64_BLR(r5), ctx);
-   emit(A64_MOV(1, r0, A64_R(0)), ctx);
-
-   jmp_offset = epilogue_offset(ctx);
-   check_imm19(jmp_offset);
-   emit(A64_CBZ(1, r0, jmp_offset), ctx);
-   emit(A64_MOV(1, r5, r0), ctx);
-   switch (BPF_SIZE(code)) {
-   case BPF_W:
-   emit(A64_LDR32(r0, r5, A64_ZR), ctx);
-#ifndef CONFIG_CPU_BIG_ENDIAN
-   emit(A64_REV32(0, r0, r0), ctx);
-#endif
-   break;
-   case BPF_H:
-   emit(A64_LDRH(r0, r5, A64_ZR), ctx);
-#ifndef CONFIG_CPU_BIG_ENDIAN
-   emit(A64_REV16(0, r0, r0), ctx);
-#endif
-   break;
-   case BPF_B:
-   emit(A64_LDRB(r0, r5, A64_ZR), ctx);
-   break;
-   }
-   break;
-   }
default:
pr_err_once("unknown opcode %02x\n", code);
return -EINVAL;
-- 
2.9.5



[PATCH bpf-next 03/12] bpf: implement ld_abs/ld_ind in native bpf

2018-05-02 Thread Daniel Borkmann
The main part of this work is to finally allow removal of LD_ABS
and LD_IND from the BPF core by reimplementing them through native
eBPF instead. Both LD_ABS/LD_IND were carried over from cBPF and
keeping them around in native eBPF caused way more trouble than
actually worth it. To just list some of the security issues in
the past:

  * fdfaf64e7539 ("x86: bpf_jit: support negative offsets")
  * 35607b02dbef ("sparc: bpf_jit: fix loads from negative offsets")
  * e0ee9c12157d ("x86: bpf_jit: fix two bugs in eBPF JIT compiler")
  * 07aee9439454 ("bpf, sparc: fix usage of wrong reg for load_skb_regs after 
call")
  * 6d59b7dbf72e ("bpf, s390x: do not reload skb pointers in non-skb context")
  * 87338c8e2cbb ("bpf, ppc64: do not reload skb pointers in non-skb context")

For programs in native eBPF, LD_ABS/LD_IND are pretty much legacy
these days due to their limitations and more efficient/flexible
alternatives that have been developed over time such as direct
packet access. LD_ABS/LD_IND only cover 1/2/4 byte loads into a
register, the load happens in host endianness and its exception
handling can yield unexpected behavior. The latter is explained
in depth in f6b1b3bf0d5f ("bpf: fix subprog verifier bypass by
div/mod by 0 exception") with similar cases of exceptions we had.
In native eBPF more recent program types will disable LD_ABS/LD_IND
altogether through may_access_skb() in verifier, and given the
limitations in terms of exception handling, it's also disabled
in programs that use BPF to BPF calls.

In terms of cBPF, the LD_ABS/LD_IND is used in networking programs
to access packet data. It is not used in seccomp-BPF but programs
that use it for socket filtering or reuseport for demuxing with
cBPF. This is mostly relevant for applications that have not yet
migrated to native eBPF.

The main complexity and source of bugs in LD_ABS/LD_IND is coming
from their implementation in the various JITs. Most of them keep
the model around from cBPF times by implementing a fastpath written
in asm. They use typically two from the BPF program hidden CPU
registers for caching the skb's headlen (skb->len - skb->data_len)
and skb->data. Throughout the JIT phase this requires to keep track
whether LD_ABS/LD_IND are used and if so, the two registers need
to be recached each time a BPF helper would change the underlying
packet data in native eBPF case. At least in eBPF case, available
CPU registers are rare and the additional exit path out of the
asm written JIT helper makes it also inflexible since not all
parts of the JITer are in control from plain C. A LD_ABS/LD_IND
implementation in eBPF therefore allows to significatnly reduce
the complexity in JITs with comparable performance results for
them, e.g.:

test_bpf tcpdump port 22 tcpdump complex
x64  - before15 21 1014 19  18
 - after  7 10 10 7 10  15
arm64- before40 91 9240 91 151
 - after 51 64 7351 62 113

For cBPF we now track any usage of LD_ABS/LD_IND in bpf_convert_filter()
and cache the skb's headlen and data in the cBPF prologue. The
BPF_REG_TMP gets remapped from R8 to R2 since it's really just
used as a local temporary variable. This allows to shrink the
image on x86_64 also for seccomp programs slightly since mapping
to %rsi is not an ereg. In callee-saved R8 and R9 we now track
skb data and headlen, respectively. For normal prologue emission
in the JITs this does not add any extra instructions since R8, R9
are pushed to stack in any case from eBPF side. cBPF uses the
convert_bpf_ld_abs() emitter which probes the fast path inline
already and falls back to bpf_skb_load_helper_{8,16,32}() helper
relying on the cached skb data and headlen as well. R8 and R9
never need to be reloaded due to bpf_helper_changes_pkt_data()
since all skb access in cBPF is read-only. Then, for the case
of native eBPF, we use the bpf_gen_ld_abs() emitter, which calls
the bpf_skb_load_helper_{8,16,32}_no_cache() helper unconditionally,
does neither cache skb data and headlen nor has an inlined fast
path. The reason for the latter is that native eBPF does not have
any extra registers available anyway, but even if there were, it
avoids any reload of skb data and headlen in the first place.
Additionally, for the negative offsets, we provide an alternative
bpf_skb_load_bytes_relative() helper in eBPF which operates
similarly as bpf_skb_load_bytes(). Tested myself on x64, arm64,
s390x, from Sandipan on ppc64.

Signed-off-by: Daniel Borkmann 
Acked-by: Alexei Starovoitov 
---
 include/linux/bpf.h|   2 +
 include/linux/filter.h |   4 +-
 kernel/bpf/core.c  |  96 ++---
 kernel/bpf/verifier.c  |  24 ++
 net/core/filter.c  | 227 +++--
 5 files changed, 255 insertions(+), 98 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h

[PATCH bpf-next 11/12] bpf, s390x: remove ld_abs/ld_ind

2018-05-02 Thread Daniel Borkmann
Since LD_ABS/LD_IND instructions are now removed from the core and
reimplemented through a combination of inlined BPF instructions and
a slow-path helper, we can get rid of the complexity from s390x JIT.
Tested on s390x instance on LinuxONE.

Signed-off-by: Daniel Borkmann 
Cc: Michael Holzheu 
Acked-by: Alexei Starovoitov 
---
 arch/s390/net/Makefile   |   2 +-
 arch/s390/net/bpf_jit.S  | 116 ---
 arch/s390/net/bpf_jit.h  |  20 +--
 arch/s390/net/bpf_jit_comp.c | 127 ---
 4 files changed, 13 insertions(+), 252 deletions(-)
 delete mode 100644 arch/s390/net/bpf_jit.S

diff --git a/arch/s390/net/Makefile b/arch/s390/net/Makefile
index e0d5f24..d4663b4 100644
--- a/arch/s390/net/Makefile
+++ b/arch/s390/net/Makefile
@@ -2,4 +2,4 @@
 #
 # Arch-specific network modules
 #
-obj-$(CONFIG_BPF_JIT) += bpf_jit.o bpf_jit_comp.o
+obj-$(CONFIG_BPF_JIT) += bpf_jit_comp.o
diff --git a/arch/s390/net/bpf_jit.S b/arch/s390/net/bpf_jit.S
deleted file mode 100644
index 25bb464..000
--- a/arch/s390/net/bpf_jit.S
+++ /dev/null
@@ -1,116 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-/*
- * BPF Jit compiler for s390, help functions.
- *
- * Copyright IBM Corp. 2012,2015
- *
- * Author(s): Martin Schwidefsky 
- *   Michael Holzheu 
- */
-
-#include 
-#include "bpf_jit.h"
-
-/*
- * Calling convention:
- * registers %r7-%r10, %r11,%r13, and %r15 are call saved
- *
- * Input (64 bit):
- *   %r3 (%b2) = offset into skb data
- *   %r6 (%b5) = return address
- *   %r7 (%b6) = skb pointer
- *   %r12  = skb data pointer
- *
- * Output:
- *   %r14= %b0 = return value (read skb value)
- *
- * Work registers: %r2,%r4,%r5,%r14
- *
- * skb_copy_bits takes 4 parameters:
- *   %r2 = skb pointer
- *   %r3 = offset into skb data
- *   %r4 = pointer to temp buffer
- *   %r5 = length to copy
- *   Return value in %r2: 0 = ok
- *
- * bpf_internal_load_pointer_neg_helper takes 3 parameters:
- *   %r2 = skb pointer
- *   %r3 = offset into data
- *   %r4 = length to copy
- *   Return value in %r2: Pointer to data
- */
-
-#define SKF_MAX_NEG_OFF-0x20   /* SKF_LL_OFF from filter.h */
-
-/*
- * Load SIZE bytes from SKB
- */
-#define sk_load_common(NAME, SIZE, LOAD)   \
-ENTRY(sk_load_##NAME); \
-   ltgr%r3,%r3;/* Is offset negative? */   \
-   jl  sk_load_##NAME##_slow_neg;  \
-ENTRY(sk_load_##NAME##_pos);   \
-   aghi%r3,SIZE;   /* Offset + SIZE */ \
-   clg %r3,STK_OFF_HLEN(%r15); /* Offset + SIZE > hlen? */ \
-   jh  sk_load_##NAME##_slow;  \
-   LOAD%r14,-SIZE(%r3,%r12);   /* Get data from skb */ \
-   b   OFF_OK(%r6);/* Return */\
-   \
-sk_load_##NAME##_slow:;
\
-   lgr %r2,%r7;/* Arg1 = skb pointer */\
-   aghi%r3,-SIZE;  /* Arg2 = offset */ \
-   la  %r4,STK_OFF_TMP(%r15);  /* Arg3 = temp bufffer */   \
-   lghi%r5,SIZE;   /* Arg4 = size */   \
-   brasl   %r14,skb_copy_bits; /* Get data from skb */ \
-   LOAD%r14,STK_OFF_TMP(%r15); /* Load from temp bufffer */\
-   ltgr%r2,%r2;/* Set cc to (%r2 != 0) */  \
-   br  %r6;/* Return */
-
-sk_load_common(word, 4, llgf)  /* r14 = *(u32 *) (skb->data+offset) */
-sk_load_common(half, 2, llgh)  /* r14 = *(u16 *) (skb->data+offset) */
-
-/*
- * Load 1 byte from SKB (optimized version)
- */
-   /* r14 = *(u8 *) (skb->data+offset) */
-ENTRY(sk_load_byte)
-   ltgr%r3,%r3 # Is offset negative?
-   jl  sk_load_byte_slow_neg
-ENTRY(sk_load_byte_pos)
-   clg %r3,STK_OFF_HLEN(%r15)  # Offset >= hlen?
-   jnl sk_load_byte_slow
-   llgc%r14,0(%r3,%r12)# Get byte from skb
-   b   OFF_OK(%r6) # Return OK
-
-sk_load_byte_slow:
-   lgr %r2,%r7 # Arg1 = skb pointer
-   # Arg2 = offset
-   la  %r4,STK_OFF_TMP(%r15)   # Arg3 = pointer to temp buffer
-   lghi%r5,1   # Arg4 = size (1 byte)
-   brasl   %r14,skb_copy_bits  # Get data from skb
-   llgc%r14,STK_OFF_TMP(%r15)  # Load result from temp buffer
-   ltgr%r2,%r2 # Set cc to (%r2 != 0)
-   br  %r6 # Return cc
-
-#define sk_negative_common(NAME, SIZE, LOAD)   

[PATCH bpf-next 08/12] bpf, arm32: remove ld_abs/ld_ind

2018-05-02 Thread Daniel Borkmann
Since LD_ABS/LD_IND instructions are now removed from the core and
reimplemented through a combination of inlined BPF instructions and
a slow-path helper, we can get rid of the complexity from arm32 JIT.

Signed-off-by: Daniel Borkmann 
Acked-by: Alexei Starovoitov 
---
 arch/arm/net/bpf_jit_32.c | 77 ---
 1 file changed, 77 deletions(-)

diff --git a/arch/arm/net/bpf_jit_32.c b/arch/arm/net/bpf_jit_32.c
index b5030e1..82689b9 100644
--- a/arch/arm/net/bpf_jit_32.c
+++ b/arch/arm/net/bpf_jit_32.c
@@ -1452,83 +1452,6 @@ static int build_insn(const struct bpf_insn *insn, 
struct jit_ctx *ctx)
emit(ARM_LDR_I(rn, ARM_SP, STACK_VAR(src_lo)), ctx);
emit_ldx_r(dst, rn, dstk, off, ctx, BPF_SIZE(code));
break;
-   /* R0 = ntohx(*(size *)(((struct sk_buff *)R6)->data + imm)) */
-   case BPF_LD | BPF_ABS | BPF_W:
-   case BPF_LD | BPF_ABS | BPF_H:
-   case BPF_LD | BPF_ABS | BPF_B:
-   /* R0 = ntohx(*(size *)(((struct sk_buff *)R6)->data + src + imm)) */
-   case BPF_LD | BPF_IND | BPF_W:
-   case BPF_LD | BPF_IND | BPF_H:
-   case BPF_LD | BPF_IND | BPF_B:
-   {
-   const u8 r4 = bpf2a32[BPF_REG_6][1]; /* r4 = ptr to sk_buff */
-   const u8 r0 = bpf2a32[BPF_REG_0][1]; /*r0: struct sk_buff *skb*/
-/* rtn value */
-   const u8 r1 = bpf2a32[BPF_REG_0][0]; /* r1: int k */
-   const u8 r2 = bpf2a32[BPF_REG_1][1]; /* r2: unsigned int size */
-   const u8 r3 = bpf2a32[BPF_REG_1][0]; /* r3: void *buffer */
-   const u8 r6 = bpf2a32[TMP_REG_1][1]; /* r6: void *(*func)(..) */
-   int size;
-
-   /* Setting up first argument */
-   emit(ARM_MOV_R(r0, r4), ctx);
-
-   /* Setting up second argument */
-   emit_a32_mov_i(r1, imm, false, ctx);
-   if (BPF_MODE(code) == BPF_IND)
-   emit_a32_alu_r(r1, src_lo, false, sstk, ctx,
-  false, false, BPF_ADD);
-
-   /* Setting up third argument */
-   switch (BPF_SIZE(code)) {
-   case BPF_W:
-   size = 4;
-   break;
-   case BPF_H:
-   size = 2;
-   break;
-   case BPF_B:
-   size = 1;
-   break;
-   default:
-   return -EINVAL;
-   }
-   emit_a32_mov_i(r2, size, false, ctx);
-
-   /* Setting up fourth argument */
-   emit(ARM_ADD_I(r3, ARM_SP, imm8m(SKB_BUFFER)), ctx);
-
-   /* Setting up function pointer to call */
-   emit_a32_mov_i(r6, (unsigned int)bpf_load_pointer, false, ctx);
-   emit_blx_r(r6, ctx);
-
-   emit(ARM_EOR_R(r1, r1, r1), ctx);
-   /* Check if return address is NULL or not.
-* if NULL then jump to epilogue
-* else continue to load the value from retn address
-*/
-   emit(ARM_CMP_I(r0, 0), ctx);
-   jmp_offset = epilogue_offset(ctx);
-   check_imm24(jmp_offset);
-   _emit(ARM_COND_EQ, ARM_B(jmp_offset), ctx);
-
-   /* Load value from the address */
-   switch (BPF_SIZE(code)) {
-   case BPF_W:
-   emit(ARM_LDR_I(r0, r0, 0), ctx);
-   emit_rev32(r0, r0, ctx);
-   break;
-   case BPF_H:
-   emit(ARM_LDRH_I(r0, r0, 0), ctx);
-   emit_rev16(r0, r0, ctx);
-   break;
-   case BPF_B:
-   emit(ARM_LDRB_I(r0, r0, 0), ctx);
-   /* No need to reverse */
-   break;
-   }
-   break;
-   }
/* ST: *(size *)(dst + off) = imm */
case BPF_ST | BPF_MEM | BPF_W:
case BPF_ST | BPF_MEM | BPF_H:
-- 
2.9.5



[PATCH bpf-next 07/12] bpf, sparc64: remove ld_abs/ld_ind

2018-05-02 Thread Daniel Borkmann
Since LD_ABS/LD_IND instructions are now removed from the core and
reimplemented through a combination of inlined BPF instructions and
a slow-path helper, we can get rid of the complexity from sparc64 JIT.

Signed-off-by: Daniel Borkmann 
Cc: David S. Miller 
Acked-by: Alexei Starovoitov 
---
 arch/sparc/net/Makefile  |   5 +-
 arch/sparc/net/bpf_jit_64.h  |  29 ---
 arch/sparc/net/bpf_jit_asm_64.S  | 162 ---
 arch/sparc/net/bpf_jit_comp_64.c |  79 +--
 4 files changed, 6 insertions(+), 269 deletions(-)
 delete mode 100644 arch/sparc/net/bpf_jit_asm_64.S

diff --git a/arch/sparc/net/Makefile b/arch/sparc/net/Makefile
index 76fa8e9..d32aac3 100644
--- a/arch/sparc/net/Makefile
+++ b/arch/sparc/net/Makefile
@@ -1,4 +1,7 @@
 #
 # Arch-specific network modules
 #
-obj-$(CONFIG_BPF_JIT) += bpf_jit_asm_$(BITS).o bpf_jit_comp_$(BITS).o
+obj-$(CONFIG_BPF_JIT) += bpf_jit_comp_$(BITS).o
+ifeq ($(BITS),32)
+obj-$(CONFIG_BPF_JIT) += bpf_jit_asm_32.o
+endif
diff --git a/arch/sparc/net/bpf_jit_64.h b/arch/sparc/net/bpf_jit_64.h
index 428f7fd..fbc836f 100644
--- a/arch/sparc/net/bpf_jit_64.h
+++ b/arch/sparc/net/bpf_jit_64.h
@@ -33,35 +33,6 @@
 #define I5 0x1d
 #define FP 0x1e
 #define I7 0x1f
-
-#define r_SKB  L0
-#define r_HEADLEN  L4
-#define r_SKB_DATA L5
-#define r_TMP  G1
-#define r_TMP2 G3
-
-/* assembly code in arch/sparc/net/bpf_jit_asm_64.S */
-extern u32 bpf_jit_load_word[];
-extern u32 bpf_jit_load_half[];
-extern u32 bpf_jit_load_byte[];
-extern u32 bpf_jit_load_byte_msh[];
-extern u32 bpf_jit_load_word_positive_offset[];
-extern u32 bpf_jit_load_half_positive_offset[];
-extern u32 bpf_jit_load_byte_positive_offset[];
-extern u32 bpf_jit_load_byte_msh_positive_offset[];
-extern u32 bpf_jit_load_word_negative_offset[];
-extern u32 bpf_jit_load_half_negative_offset[];
-extern u32 bpf_jit_load_byte_negative_offset[];
-extern u32 bpf_jit_load_byte_msh_negative_offset[];
-
-#else
-#define r_RESULT   %o0
-#define r_SKB  %o0
-#define r_OFF  %o1
-#define r_HEADLEN  %l4
-#define r_SKB_DATA %l5
-#define r_TMP  %g1
-#define r_TMP2 %g3
 #endif
 
 #endif /* _BPF_JIT_H */
diff --git a/arch/sparc/net/bpf_jit_asm_64.S b/arch/sparc/net/bpf_jit_asm_64.S
deleted file mode 100644
index 7177867..000
--- a/arch/sparc/net/bpf_jit_asm_64.S
+++ /dev/null
@@ -1,162 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-#include 
-
-#include "bpf_jit_64.h"
-
-#define SAVE_SZ176
-#define SCRATCH_OFFSTACK_BIAS + 128
-#define BE_PTR(label)  be,pn %xcc, label
-#define SIGN_EXTEND(reg)   sra reg, 0, reg
-
-#define SKF_MAX_NEG_OFF(-0x20) /* SKF_LL_OFF from filter.h */
-
-   .text
-   .globl  bpf_jit_load_word
-bpf_jit_load_word:
-   cmp r_OFF, 0
-   bl  bpf_slow_path_word_neg
-nop
-   .globl  bpf_jit_load_word_positive_offset
-bpf_jit_load_word_positive_offset:
-   sub r_HEADLEN, r_OFF, r_TMP
-   cmp r_TMP, 3
-   ble bpf_slow_path_word
-addr_SKB_DATA, r_OFF, r_TMP
-   andcc   r_TMP, 3, %g0
-   bne load_word_unaligned
-nop
-   retl
-ld [r_TMP], r_RESULT
-load_word_unaligned:
-   ldub[r_TMP + 0x0], r_OFF
-   ldub[r_TMP + 0x1], r_TMP2
-   sll r_OFF, 8, r_OFF
-   or  r_OFF, r_TMP2, r_OFF
-   ldub[r_TMP + 0x2], r_TMP2
-   sll r_OFF, 8, r_OFF
-   or  r_OFF, r_TMP2, r_OFF
-   ldub[r_TMP + 0x3], r_TMP2
-   sll r_OFF, 8, r_OFF
-   retl
-or r_OFF, r_TMP2, r_RESULT
-
-   .globl  bpf_jit_load_half
-bpf_jit_load_half:
-   cmp r_OFF, 0
-   bl  bpf_slow_path_half_neg
-nop
-   .globl  bpf_jit_load_half_positive_offset
-bpf_jit_load_half_positive_offset:
-   sub r_HEADLEN, r_OFF, r_TMP
-   cmp r_TMP, 1
-   ble bpf_slow_path_half
-addr_SKB_DATA, r_OFF, r_TMP
-   andcc   r_TMP, 1, %g0
-   bne load_half_unaligned
-nop
-   retl
-lduh   [r_TMP], r_RESULT
-load_half_unaligned:
-   ldub[r_TMP + 0x0], r_OFF
-   ldub[r_TMP + 0x1], r_TMP2
-   sll r_OFF, 8, r_OFF
-   retl
-or r_OFF, r_TMP2, r_RESULT
-
-   .globl  bpf_jit_load_byte
-bpf_jit_load_byte:
-   cmp r_OFF, 0
-   bl  bpf_slow_path_byte_neg
-nop
-   .globl  bpf_jit_load_byte_positive_offset
-bpf_jit_load_byte_positive_offset:
-   cmp r_OFF, r_HEADLEN
-   bge bpf_slow_path_byte
-nop
-   retl
-ldub   [r_SKB_DATA + r_OFF], r_RESULT
-
-#define bpf_slow_path_common(LEN)  \
-   save%sp, -SAVE_SZ, %sp; \
-   mov %i0, %o0;   \
-   mov %i1, %o1;   \
-   add %fp, SCRATCH_OFF, %o2;  \
-   

[PATCH bpf-next 09/12] bpf, mips64: remove ld_abs/ld_ind

2018-05-02 Thread Daniel Borkmann
Since LD_ABS/LD_IND instructions are now removed from the core and
reimplemented through a combination of inlined BPF instructions and
a slow-path helper, we can get rid of the complexity from mips64 JIT.

Signed-off-by: Daniel Borkmann 
Acked-by: Alexei Starovoitov 
---
 arch/mips/net/ebpf_jit.c | 104 ---
 1 file changed, 104 deletions(-)

diff --git a/arch/mips/net/ebpf_jit.c b/arch/mips/net/ebpf_jit.c
index 3e2798b..7ba7df9 100644
--- a/arch/mips/net/ebpf_jit.c
+++ b/arch/mips/net/ebpf_jit.c
@@ -1267,110 +1267,6 @@ static int build_one_insn(const struct bpf_insn *insn, 
struct jit_ctx *ctx,
return -EINVAL;
break;
 
-   case BPF_LD | BPF_B | BPF_ABS:
-   case BPF_LD | BPF_H | BPF_ABS:
-   case BPF_LD | BPF_W | BPF_ABS:
-   case BPF_LD | BPF_DW | BPF_ABS:
-   ctx->flags |= EBPF_SAVE_RA;
-
-   gen_imm_to_reg(insn, MIPS_R_A1, ctx);
-   emit_instr(ctx, addiu, MIPS_R_A2, MIPS_R_ZERO, 
size_to_len(insn));
-
-   if (insn->imm < 0) {
-   emit_const_to_reg(ctx, MIPS_R_T9, 
(u64)bpf_internal_load_pointer_neg_helper);
-   } else {
-   emit_const_to_reg(ctx, MIPS_R_T9, 
(u64)ool_skb_header_pointer);
-   emit_instr(ctx, daddiu, MIPS_R_A3, MIPS_R_SP, 
ctx->tmp_offset);
-   }
-   goto ld_skb_common;
-
-   case BPF_LD | BPF_B | BPF_IND:
-   case BPF_LD | BPF_H | BPF_IND:
-   case BPF_LD | BPF_W | BPF_IND:
-   case BPF_LD | BPF_DW | BPF_IND:
-   ctx->flags |= EBPF_SAVE_RA;
-   src = ebpf_to_mips_reg(ctx, insn, src_reg_no_fp);
-   if (src < 0)
-   return src;
-   ts = get_reg_val_type(ctx, this_idx, insn->src_reg);
-   if (ts == REG_32BIT_ZERO_EX) {
-   /* sign extend */
-   emit_instr(ctx, sll, MIPS_R_A1, src, 0);
-   src = MIPS_R_A1;
-   }
-   if (insn->imm >= S16_MIN && insn->imm <= S16_MAX) {
-   emit_instr(ctx, daddiu, MIPS_R_A1, src, insn->imm);
-   } else {
-   gen_imm_to_reg(insn, MIPS_R_AT, ctx);
-   emit_instr(ctx, daddu, MIPS_R_A1, MIPS_R_AT, src);
-   }
-   /* truncate to 32-bit int */
-   emit_instr(ctx, sll, MIPS_R_A1, MIPS_R_A1, 0);
-   emit_instr(ctx, daddiu, MIPS_R_A3, MIPS_R_SP, ctx->tmp_offset);
-   emit_instr(ctx, slt, MIPS_R_AT, MIPS_R_A1, MIPS_R_ZERO);
-
-   emit_const_to_reg(ctx, MIPS_R_T8, 
(u64)bpf_internal_load_pointer_neg_helper);
-   emit_const_to_reg(ctx, MIPS_R_T9, (u64)ool_skb_header_pointer);
-   emit_instr(ctx, addiu, MIPS_R_A2, MIPS_R_ZERO, 
size_to_len(insn));
-   emit_instr(ctx, movn, MIPS_R_T9, MIPS_R_T8, MIPS_R_AT);
-
-ld_skb_common:
-   emit_instr(ctx, jalr, MIPS_R_RA, MIPS_R_T9);
-   /* delay slot move */
-   emit_instr(ctx, daddu, MIPS_R_A0, MIPS_R_S0, MIPS_R_ZERO);
-
-   /* Check the error value */
-   b_off = b_imm(exit_idx, ctx);
-   if (is_bad_offset(b_off)) {
-   target = j_target(ctx, exit_idx);
-   if (target == (unsigned int)-1)
-   return -E2BIG;
-
-   if (!(ctx->offsets[this_idx] & OFFSETS_B_CONV)) {
-   ctx->offsets[this_idx] |= OFFSETS_B_CONV;
-   ctx->long_b_conversion = 1;
-   }
-   emit_instr(ctx, bne, MIPS_R_V0, MIPS_R_ZERO, 4 * 3);
-   emit_instr(ctx, nop);
-   emit_instr(ctx, j, target);
-   emit_instr(ctx, nop);
-   } else {
-   emit_instr(ctx, beq, MIPS_R_V0, MIPS_R_ZERO, b_off);
-   emit_instr(ctx, nop);
-   }
-
-#ifdef __BIG_ENDIAN
-   need_swap = false;
-#else
-   need_swap = true;
-#endif
-   dst = MIPS_R_V0;
-   switch (BPF_SIZE(insn->code)) {
-   case BPF_B:
-   emit_instr(ctx, lbu, dst, 0, MIPS_R_V0);
-   break;
-   case BPF_H:
-   emit_instr(ctx, lhu, dst, 0, MIPS_R_V0);
-   if (need_swap)
-   emit_instr(ctx, wsbh, dst, dst);
-   break;
-   case BPF_W:
-   emit_instr(ctx, lw, dst, 0, MIPS_R_V0);
-   if (need_swap) {
-   emit_instr(ctx, wsbh, dst, dst);
-   emit_instr(ctx, rotr, dst, dst, 16);
-   }
-   break;

[PATCH bpf-next 12/12] bpf: sync tools bpf.h uapi header

2018-05-02 Thread Daniel Borkmann
Only sync the header from include/uapi/linux/bpf.h.

Signed-off-by: Daniel Borkmann 
Acked-by: Alexei Starovoitov 
---
 tools/include/uapi/linux/bpf.h | 33 -
 1 file changed, 32 insertions(+), 1 deletion(-)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 8daef73..83a95ae 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -1801,6 +1801,30 @@ union bpf_attr {
  * Return
  * a non-negative value equal to or less than size on success, or
  * a negative error in case of failure.
+ *
+ * int skb_load_bytes_relative(const struct sk_buff *skb, u32 offset, void 
*to, u32 len, u32 start_header)
+ * Description
+ * This helper is similar to **bpf_skb_load_bytes**\ () in that
+ * it provides an easy way to load *len* bytes from *offset*
+ * from the packet associated to *skb*, into the buffer pointed
+ * by *to*. The difference to **bpf_skb_load_bytes**\ () is that
+ * a fifth argument *start_header* exists in order to select a
+ * base offset to start from. *start_header* can be one of:
+ *
+ * **BPF_HDR_START_MAC**
+ * Base offset to load data from is *skb*'s mac header.
+ * **BPF_HDR_START_NET**
+ * Base offset to load data from is *skb*'s network header.
+ *
+ * In general, "direct packet access" is the preferred method to
+ * access packet data, however, this helper is in particular useful
+ * in socket filters where *skb*\ **->data** does not always point
+ * to the start of the mac header and where "direct packet access"
+ * is not available.
+ *
+ * Return
+ * 0 on success, or a negative error in case of failure.
+ *
  */
 #define __BPF_FUNC_MAPPER(FN)  \
FN(unspec), \
@@ -1870,7 +1894,8 @@ union bpf_attr {
FN(bind),   \
FN(xdp_adjust_tail),\
FN(skb_get_xfrm_state), \
-   FN(get_stack),
+   FN(get_stack),  \
+   FN(skb_load_bytes_relative),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -1931,6 +1956,12 @@ enum bpf_adj_room_mode {
BPF_ADJ_ROOM_NET,
 };
 
+/* Mode for BPF_FUNC_skb_load_bytes_relative helper. */
+enum bpf_hdr_start_off {
+   BPF_HDR_START_MAC,
+   BPF_HDR_START_NET,
+};
+
 /* user accessible mirror of in-kernel sk_buff.
  * new fields can only be added to the end of this structure
  */
-- 
2.9.5



[PATCH bpf-next 10/12] bpf, ppc64: remove ld_abs/ld_ind

2018-05-02 Thread Daniel Borkmann
Since LD_ABS/LD_IND instructions are now removed from the core and
reimplemented through a combination of inlined BPF instructions and
a slow-path helper, we can get rid of the complexity from ppc64 JIT.

Signed-off-by: Daniel Borkmann 
Acked-by: Naveen N. Rao 
Acked-by: Alexei Starovoitov 
Tested-by: Sandipan Das 
---
 arch/powerpc/net/Makefile |   2 +-
 arch/powerpc/net/bpf_jit64.h  |  37 ++--
 arch/powerpc/net/bpf_jit_asm64.S  | 180 --
 arch/powerpc/net/bpf_jit_comp64.c | 109 +--
 4 files changed, 11 insertions(+), 317 deletions(-)
 delete mode 100644 arch/powerpc/net/bpf_jit_asm64.S

diff --git a/arch/powerpc/net/Makefile b/arch/powerpc/net/Makefile
index 02d369c..809f019 100644
--- a/arch/powerpc/net/Makefile
+++ b/arch/powerpc/net/Makefile
@@ -3,7 +3,7 @@
 # Arch-specific network modules
 #
 ifeq ($(CONFIG_PPC64),y)
-obj-$(CONFIG_BPF_JIT) += bpf_jit_asm64.o bpf_jit_comp64.o
+obj-$(CONFIG_BPF_JIT) += bpf_jit_comp64.o
 else
 obj-$(CONFIG_BPF_JIT) += bpf_jit_asm.o bpf_jit_comp.o
 endif
diff --git a/arch/powerpc/net/bpf_jit64.h b/arch/powerpc/net/bpf_jit64.h
index 8bdef7e..3609be4 100644
--- a/arch/powerpc/net/bpf_jit64.h
+++ b/arch/powerpc/net/bpf_jit64.h
@@ -20,7 +20,7 @@
  * with our redzone usage.
  *
  * [   prev sp ] <-
- * [   nv gpr save area] 8*8   |
+ * [   nv gpr save area] 6*8   |
  * [tail_call_cnt  ] 8 |
  * [local_tmp_var  ] 8 |
  * fp (r31) -->[   ebpf stack space] upto 512  |
@@ -28,8 +28,8 @@
  * sp (r1) --->[stack pointer  ] --
  */
 
-/* for gpr non volatile registers BPG_REG_6 to 10, plus skb cache registers */
-#define BPF_PPC_STACK_SAVE (8*8)
+/* for gpr non volatile registers BPG_REG_6 to 10 */
+#define BPF_PPC_STACK_SAVE (6*8)
 /* for bpf JIT code internal usage */
 #define BPF_PPC_STACK_LOCALS   16
 /* stack frame excluding BPF stack, ensure this is quadword aligned */
@@ -39,10 +39,8 @@
 #ifndef __ASSEMBLY__
 
 /* BPF register usage */
-#define SKB_HLEN_REG   (MAX_BPF_JIT_REG + 0)
-#define SKB_DATA_REG   (MAX_BPF_JIT_REG + 1)
-#define TMP_REG_1  (MAX_BPF_JIT_REG + 2)
-#define TMP_REG_2  (MAX_BPF_JIT_REG + 3)
+#define TMP_REG_1  (MAX_BPF_JIT_REG + 0)
+#define TMP_REG_2  (MAX_BPF_JIT_REG + 1)
 
 /* BPF to ppc register mappings */
 static const int b2p[] = {
@@ -63,40 +61,23 @@ static const int b2p[] = {
[BPF_REG_FP] = 31,
/* eBPF jit internal registers */
[BPF_REG_AX] = 2,
-   [SKB_HLEN_REG] = 25,
-   [SKB_DATA_REG] = 26,
[TMP_REG_1] = 9,
[TMP_REG_2] = 10
 };
 
-/* PPC NVR range -- update this if we ever use NVRs below r24 */
-#define BPF_PPC_NVR_MIN24
-
-/* Assembly helpers */
-#define DECLARE_LOAD_FUNC(func)u64 func(u64 r3, u64 r4);   
\
-   u64 func##_negative_offset(u64 r3, u64 r4); 
\
-   u64 func##_positive_offset(u64 r3, u64 r4);
-
-DECLARE_LOAD_FUNC(sk_load_word);
-DECLARE_LOAD_FUNC(sk_load_half);
-DECLARE_LOAD_FUNC(sk_load_byte);
-
-#define CHOOSE_LOAD_FUNC(imm, func)
\
-   (imm < 0 ?  
\
-   (imm >= SKF_LL_OFF ? func##_negative_offset : func) :   
\
-   func##_positive_offset)
+/* PPC NVR range -- update this if we ever use NVRs below r27 */
+#define BPF_PPC_NVR_MIN27
 
 #define SEEN_FUNC  0x1000 /* might call external helpers */
 #define SEEN_STACK 0x2000 /* uses BPF stack */
-#define SEEN_SKB   0x4000 /* uses sk_buff */
-#define SEEN_TAILCALL  0x8000 /* uses tail calls */
+#define SEEN_TAILCALL  0x4000 /* uses tail calls */
 
 struct codegen_context {
/*
 * This is used to track register usage as well
 * as calls to external helpers.
 * - register usage is tracked with corresponding
-*   bits (r3-r10 and r25-r31)
+*   bits (r3-r10 and r27-r31)
 * - rest of the bits can be used to track other
 *   things -- for now, we use bits 16 to 23
 *   encoded in SEEN_* macros above
diff --git a/arch/powerpc/net/bpf_jit_asm64.S b/arch/powerpc/net/bpf_jit_asm64.S
deleted file mode 100644
index 7e4c514..000
--- a/arch/powerpc/net/bpf_jit_asm64.S
+++ /dev/null
@@ -1,180 +0,0 @@
-/*
- * bpf_jit_asm64.S: Packet/header access helper functions
- * for PPC64 BPF compiler.
- *
- * Copyright 2016, Naveen N. Rao 
- *IBM Corporation
- *
- * Based on bpf_jit_asm.S by Matt Evans
- *
- * This program is free software; you can redistribute it and/or
- * modify it under 

[PATCH bpf-next 02/12] bpf: migrate ebpf ld_abs/ld_ind tests to test_verifier

2018-05-02 Thread Daniel Borkmann
Remove all eBPF tests involving LD_ABS/LD_IND from test_bpf.ko. Reason
is that the eBPF tests from test_bpf module do not go via BPF verifier
and therefore any instruction rewrites from verifier cannot take place.
Therefore, move them into test_verifier which runs out of user space,
so that verfier can rewrite LD_ABS/LD_IND internally in upcoming patches.
It will have the same effect since runtime tests are also performed from
there. This also allows to finally unexport bpf_skb_vlan_{push,pop}_proto
and keep it internal to core kernel.

Signed-off-by: Daniel Borkmann 
Acked-by: Alexei Starovoitov 
---
 include/linux/bpf.h |   2 -
 lib/test_bpf.c  | 212 --
 net/core/filter.c   |   6 +-
 tools/testing/selftests/bpf/test_verifier.c | 266 +++-
 4 files changed, 261 insertions(+), 225 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index c553f6f..8ea3f6d 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -689,8 +689,6 @@ extern const struct bpf_func_proto bpf_ktime_get_ns_proto;
 extern const struct bpf_func_proto bpf_get_current_pid_tgid_proto;
 extern const struct bpf_func_proto bpf_get_current_uid_gid_proto;
 extern const struct bpf_func_proto bpf_get_current_comm_proto;
-extern const struct bpf_func_proto bpf_skb_vlan_push_proto;
-extern const struct bpf_func_proto bpf_skb_vlan_pop_proto;
 extern const struct bpf_func_proto bpf_get_stackid_proto;
 extern const struct bpf_func_proto bpf_get_stack_proto;
 extern const struct bpf_func_proto bpf_sock_map_update_proto;
diff --git a/lib/test_bpf.c b/lib/test_bpf.c
index 8e15780..35f49bd 100644
--- a/lib/test_bpf.c
+++ b/lib/test_bpf.c
@@ -386,116 +386,6 @@ static int bpf_fill_ld_abs_get_processor_id(struct 
bpf_test *self)
return 0;
 }
 
-#define PUSH_CNT 68
-/* test: {skb->data[0], vlan_push} x 68 + {skb->data[0], vlan_pop} x 68 */
-static int bpf_fill_ld_abs_vlan_push_pop(struct bpf_test *self)
-{
-   unsigned int len = BPF_MAXINSNS;
-   struct bpf_insn *insn;
-   int i = 0, j, k = 0;
-
-   insn = kmalloc_array(len, sizeof(*insn), GFP_KERNEL);
-   if (!insn)
-   return -ENOMEM;
-
-   insn[i++] = BPF_MOV64_REG(R6, R1);
-loop:
-   for (j = 0; j < PUSH_CNT; j++) {
-   insn[i++] = BPF_LD_ABS(BPF_B, 0);
-   insn[i] = BPF_JMP_IMM(BPF_JNE, R0, 0x34, len - i - 2);
-   i++;
-   insn[i++] = BPF_MOV64_REG(R1, R6);
-   insn[i++] = BPF_MOV64_IMM(R2, 1);
-   insn[i++] = BPF_MOV64_IMM(R3, 2);
-   insn[i++] = BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0,
-bpf_skb_vlan_push_proto.func - 
__bpf_call_base);
-   insn[i] = BPF_JMP_IMM(BPF_JNE, R0, 0, len - i - 2);
-   i++;
-   }
-
-   for (j = 0; j < PUSH_CNT; j++) {
-   insn[i++] = BPF_LD_ABS(BPF_B, 0);
-   insn[i] = BPF_JMP_IMM(BPF_JNE, R0, 0x34, len - i - 2);
-   i++;
-   insn[i++] = BPF_MOV64_REG(R1, R6);
-   insn[i++] = BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0,
-bpf_skb_vlan_pop_proto.func - 
__bpf_call_base);
-   insn[i] = BPF_JMP_IMM(BPF_JNE, R0, 0, len - i - 2);
-   i++;
-   }
-   if (++k < 5)
-   goto loop;
-
-   for (; i < len - 1; i++)
-   insn[i] = BPF_ALU32_IMM(BPF_MOV, R0, 0xbef);
-
-   insn[len - 1] = BPF_EXIT_INSN();
-
-   self->u.ptr.insns = insn;
-   self->u.ptr.len = len;
-
-   return 0;
-}
-
-static int bpf_fill_ld_abs_vlan_push_pop2(struct bpf_test *self)
-{
-   struct bpf_insn *insn;
-
-   insn = kmalloc_array(16, sizeof(*insn), GFP_KERNEL);
-   if (!insn)
-   return -ENOMEM;
-
-   /* Due to func address being non-const, we need to
-* assemble this here.
-*/
-   insn[0] = BPF_MOV64_REG(R6, R1);
-   insn[1] = BPF_LD_ABS(BPF_B, 0);
-   insn[2] = BPF_LD_ABS(BPF_H, 0);
-   insn[3] = BPF_LD_ABS(BPF_W, 0);
-   insn[4] = BPF_MOV64_REG(R7, R6);
-   insn[5] = BPF_MOV64_IMM(R6, 0);
-   insn[6] = BPF_MOV64_REG(R1, R7);
-   insn[7] = BPF_MOV64_IMM(R2, 1);
-   insn[8] = BPF_MOV64_IMM(R3, 2);
-   insn[9] = BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0,
-  bpf_skb_vlan_push_proto.func - __bpf_call_base);
-   insn[10] = BPF_MOV64_REG(R6, R7);
-   insn[11] = BPF_LD_ABS(BPF_B, 0);
-   insn[12] = BPF_LD_ABS(BPF_H, 0);
-   insn[13] = BPF_LD_ABS(BPF_W, 0);
-   insn[14] = BPF_MOV64_IMM(R0, 42);
-   insn[15] = BPF_EXIT_INSN();
-
-   self->u.ptr.insns = insn;
-   self->u.ptr.len = 16;
-
-   return 0;
-}
-
-static int bpf_fill_jump_around_ld_abs(struct bpf_test *self)
-{
-   unsigned int len = BPF_MAXINSNS;
-   struct bpf_insn 

[PATCH bpf-next 04/12] bpf: add skb_load_bytes_relative helper

2018-05-02 Thread Daniel Borkmann
This adds a small BPF helper similar to bpf_skb_load_bytes() that
is able to load relative to mac/net header offset from the skb's
linear data. Compared to bpf_skb_load_bytes(), it takes a fith
argument namely start_header, which is either BPF_HDR_START_MAC
or BPF_HDR_START_NET. This allows for a more flexible alternative
compared to LD_ABS/LD_IND with negative offset. It's enabled for
tc BPF programs as well as sock filter program types where it's
mainly useful in reuseport programs to ease access to lower header
data.

Reference: 
https://lists.iovisor.org/pipermail/iovisor-dev/2017-March/000698.html
Signed-off-by: Daniel Borkmann 
Acked-by: Alexei Starovoitov 
---
 include/uapi/linux/bpf.h | 33 -
 net/core/filter.c| 45 +
 2 files changed, 77 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 8daef73..83a95ae 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1801,6 +1801,30 @@ union bpf_attr {
  * Return
  * a non-negative value equal to or less than size on success, or
  * a negative error in case of failure.
+ *
+ * int skb_load_bytes_relative(const struct sk_buff *skb, u32 offset, void 
*to, u32 len, u32 start_header)
+ * Description
+ * This helper is similar to **bpf_skb_load_bytes**\ () in that
+ * it provides an easy way to load *len* bytes from *offset*
+ * from the packet associated to *skb*, into the buffer pointed
+ * by *to*. The difference to **bpf_skb_load_bytes**\ () is that
+ * a fifth argument *start_header* exists in order to select a
+ * base offset to start from. *start_header* can be one of:
+ *
+ * **BPF_HDR_START_MAC**
+ * Base offset to load data from is *skb*'s mac header.
+ * **BPF_HDR_START_NET**
+ * Base offset to load data from is *skb*'s network header.
+ *
+ * In general, "direct packet access" is the preferred method to
+ * access packet data, however, this helper is in particular useful
+ * in socket filters where *skb*\ **->data** does not always point
+ * to the start of the mac header and where "direct packet access"
+ * is not available.
+ *
+ * Return
+ * 0 on success, or a negative error in case of failure.
+ *
  */
 #define __BPF_FUNC_MAPPER(FN)  \
FN(unspec), \
@@ -1870,7 +1894,8 @@ union bpf_attr {
FN(bind),   \
FN(xdp_adjust_tail),\
FN(skb_get_xfrm_state), \
-   FN(get_stack),
+   FN(get_stack),  \
+   FN(skb_load_bytes_relative),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -1931,6 +1956,12 @@ enum bpf_adj_room_mode {
BPF_ADJ_ROOM_NET,
 };
 
+/* Mode for BPF_FUNC_skb_load_bytes_relative helper. */
+enum bpf_hdr_start_off {
+   BPF_HDR_START_MAC,
+   BPF_HDR_START_NET,
+};
+
 /* user accessible mirror of in-kernel sk_buff.
  * new fields can only be added to the end of this structure
  */
diff --git a/net/core/filter.c b/net/core/filter.c
index 3159f53..516ac1b 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -1678,6 +1678,47 @@ static const struct bpf_func_proto 
bpf_skb_load_bytes_proto = {
.arg4_type  = ARG_CONST_SIZE,
 };
 
+BPF_CALL_5(bpf_skb_load_bytes_relative, const struct sk_buff *, skb,
+  u32, offset, void *, to, u32, len, u32, start_header)
+{
+   u8 *ptr;
+
+   if (unlikely(offset > 0x || len > skb_headlen(skb)))
+   goto err_clear;
+
+   switch (start_header) {
+   case BPF_HDR_START_MAC:
+   ptr = skb_mac_header(skb) + offset;
+   break;
+   case BPF_HDR_START_NET:
+   ptr = skb_network_header(skb) + offset;
+   break;
+   default:
+   goto err_clear;
+   }
+
+   if (likely(ptr >= skb_mac_header(skb) &&
+  ptr + len <= skb_tail_pointer(skb))) {
+   memcpy(to, ptr, len);
+   return 0;
+   }
+
+err_clear:
+   memset(to, 0, len);
+   return -EFAULT;
+}
+
+static const struct bpf_func_proto bpf_skb_load_bytes_relative_proto = {
+   .func   = bpf_skb_load_bytes_relative,
+   .gpl_only   = false,
+   .ret_type   = RET_INTEGER,
+   .arg1_type  = ARG_PTR_TO_CTX,
+   .arg2_type  = ARG_ANYTHING,
+   .arg3_type  = ARG_PTR_TO_UNINIT_MEM,
+   .arg4_type  = ARG_CONST_SIZE,
+   .arg5_type  = ARG_ANYTHING,
+};
+
 BPF_CALL_2(bpf_skb_pull_data, struct sk_buff *, skb, u32, len)
 {
/* Idea is the following: should the needed direct read/write
@@ -4028,6 

[PATCH bpf-next 01/12] bpf: prefix cbpf internal helpers with bpf_

2018-05-02 Thread Daniel Borkmann
No change in functionality, just remove the '__' prefix and replace it
with a 'bpf_' prefix instead. We later on add a couple of more helpers
for cBPF and keeping the scheme with '__' is suboptimal there.

Signed-off-by: Daniel Borkmann 
Acked-by: Alexei Starovoitov 
---
 net/core/filter.c | 18 +-
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index d3781da..07fe378 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -112,12 +112,12 @@ int sk_filter_trim_cap(struct sock *sk, struct sk_buff 
*skb, unsigned int cap)
 }
 EXPORT_SYMBOL(sk_filter_trim_cap);
 
-BPF_CALL_1(__skb_get_pay_offset, struct sk_buff *, skb)
+BPF_CALL_1(bpf_skb_get_pay_offset, struct sk_buff *, skb)
 {
return skb_get_poff(skb);
 }
 
-BPF_CALL_3(__skb_get_nlattr, struct sk_buff *, skb, u32, a, u32, x)
+BPF_CALL_3(bpf_skb_get_nlattr, struct sk_buff *, skb, u32, a, u32, x)
 {
struct nlattr *nla;
 
@@ -137,7 +137,7 @@ BPF_CALL_3(__skb_get_nlattr, struct sk_buff *, skb, u32, a, 
u32, x)
return 0;
 }
 
-BPF_CALL_3(__skb_get_nlattr_nest, struct sk_buff *, skb, u32, a, u32, x)
+BPF_CALL_3(bpf_skb_get_nlattr_nest, struct sk_buff *, skb, u32, a, u32, x)
 {
struct nlattr *nla;
 
@@ -161,13 +161,13 @@ BPF_CALL_3(__skb_get_nlattr_nest, struct sk_buff *, skb, 
u32, a, u32, x)
return 0;
 }
 
-BPF_CALL_0(__get_raw_cpu_id)
+BPF_CALL_0(bpf_get_raw_cpu_id)
 {
return raw_smp_processor_id();
 }
 
 static const struct bpf_func_proto bpf_get_raw_smp_processor_id_proto = {
-   .func   = __get_raw_cpu_id,
+   .func   = bpf_get_raw_cpu_id,
.gpl_only   = false,
.ret_type   = RET_INTEGER,
 };
@@ -317,16 +317,16 @@ static bool convert_bpf_extensions(struct sock_filter *fp,
/* Emit call(arg1=CTX, arg2=A, arg3=X) */
switch (fp->k) {
case SKF_AD_OFF + SKF_AD_PAY_OFFSET:
-   *insn = BPF_EMIT_CALL(__skb_get_pay_offset);
+   *insn = BPF_EMIT_CALL(bpf_skb_get_pay_offset);
break;
case SKF_AD_OFF + SKF_AD_NLATTR:
-   *insn = BPF_EMIT_CALL(__skb_get_nlattr);
+   *insn = BPF_EMIT_CALL(bpf_skb_get_nlattr);
break;
case SKF_AD_OFF + SKF_AD_NLATTR_NEST:
-   *insn = BPF_EMIT_CALL(__skb_get_nlattr_nest);
+   *insn = BPF_EMIT_CALL(bpf_skb_get_nlattr_nest);
break;
case SKF_AD_OFF + SKF_AD_CPU:
-   *insn = BPF_EMIT_CALL(__get_raw_cpu_id);
+   *insn = BPF_EMIT_CALL(bpf_get_raw_cpu_id);
break;
case SKF_AD_OFF + SKF_AD_RANDOM:
*insn = BPF_EMIT_CALL(bpf_user_rnd_u32);
-- 
2.9.5



pull-request: bpf 2018-05-03

2018-05-02 Thread Daniel Borkmann
Hi David,

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) Several BPF sockmap fixes mostly related to bugs in error path
   handling, that is, a bug in updating the scatterlist length /
   offset accounting, a missing sk_mem_uncharge() in redirect
   error handling, and a bug where the outstanding bytes counter
   sg_size was not zeroed, from John.

2) Fix two memory leaks in the x86-64 BPF JIT, one in an error
   path where we still don't converge after image was allocated
   and another one where BPF calls are used and JIT passes don't
   converge, from Daniel.

3) Minor fix in BPF selftests where in test_stacktrace_build_id()
   we drop useless args in urandom_read and we need to add a missing
   newline in a CHECK() error message, from Song.

Please consider pulling these changes from:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git

Thanks a lot!



The following changes since commit 25eb0ea7174c6e84f21fa59dccbddd0318b17b12:

  Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf (2018-04-25 
22:55:33 -0400)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git 

for you to fetch changes up to b5b6ff730253ab68ec230e239c4245cb1e8a5397:

  Merge branch 'bpf-sockmap-fixes' (2018-05-02 15:30:46 -0700)


Alexei Starovoitov (2):
  Merge branch 'x86-bpf-jit-fixes'
  Merge branch 'bpf-sockmap-fixes'

Daniel Borkmann (2):
  bpf, x64: fix memleak when not converging after image
  bpf, x64: fix memleak when not converging on calls

John Fastabend (4):
  bpf: fix uninitialized variable in bpf tools
  bpf: sockmap, fix scatterlist update on error path in send with apply
  bpf: sockmap, zero sg_size on error when buffer is released
  bpf: sockmap, fix error handling in redirect failures

Song Liu (1):
  bpf: minor fix to selftest test_stacktrace_build_id()

 arch/x86/net/bpf_jit_comp.c  |  6 ++--
 kernel/bpf/sockmap.c | 48 +---
 tools/bpf/bpf_dbg.c  |  7 +++--
 tools/testing/selftests/bpf/test_progs.c |  4 +--
 4 files changed, 36 insertions(+), 29 deletions(-)


Re: [v2 PATCH 1/1] tg3: fix meaningless hw_stats reading after tg3_halt memset 0 hw_stats

2018-05-02 Thread Zumeng Chen

On 2018年05月03日 01:32, Michael Chan wrote:

On Wed, May 2, 2018 at 3:27 AM, Zumeng Chen  wrote:

On 2018年05月02日 13:12, Michael Chan wrote:

On Tue, May 1, 2018 at 5:42 PM, Zumeng Chen  wrote:


diff --git a/drivers/net/ethernet/broadcom/tg3.h
b/drivers/net/ethernet/broadcom/tg3.h
index 3b5e98e..c61d83c 100644
--- a/drivers/net/ethernet/broadcom/tg3.h
+++ b/drivers/net/ethernet/broadcom/tg3.h
@@ -3102,6 +3102,7 @@ enum TG3_FLAGS {
  TG3_FLAG_ROBOSWITCH,
  TG3_FLAG_ONE_DMA_AT_ONCE,
  TG3_FLAG_RGMII_MODE,
+   TG3_FLAG_HALT,

I think you should be able to use the existing INIT_COMPLETE flag


No,  it will bring the uncertain factors into the existed complicate logic
of INIT_COMPLETE.
And I think it's very simple logic here to fix the meaningless hw_stats
reading and the problem
of commit f5992b72. I even suspect if you have read INIT_COMPLETE related
codes carefully.


We should use an existing flag whenever appropriate


I disagree. This is sort of blahblah...

, instead of adding
yet another flag to do similar things. I've looked at the code briefly
and believe that INIT_COMPLETE will work.


When we fix a problem, we'd better think if we introduce a new one.


   If you think it won't work,
please be specific and point out why it won't work.  Thanks.


I don't care if it work or not, I directly feel it's a bad idea.
INIT_COMPLETE include a lot of network stuffs, it's not simple hardware 
reset related.


Here again,

My fix logic is very simple to fix the problem I met, I think this is 
how Linux works together
with such a lot of thing, which means clear,  simple, and robust for 
every unit, we re-unite

them eco-systematically.

Finally, it's yours, so be it.

Cheers,
Zumeng




Re: Silently dropped UDP packets on kernel 4.14

2018-05-02 Thread Kristian Evensen
Hello,

On Wed, May 2, 2018 at 12:42 AM, Kristian Evensen
 wrote:
> My knowledge of the conntrack/nat subsystem is not that great, and I
> don't know the implications of what I am about to suggest. However,
> considering that the two packets represent the same flow, wouldn't it
> be possible to apply the existing nat-mapping to the second packet,
> and then let the second packet pass?

I have spent the day today trying to solve my problem and I think I am
almost there. I have attached my work in progress patch to this email
if anyone wants to take a look (for kernel 4.14).

I went for the early-insert approached and have patched
nfnetlink_queue to perform an insert if no conntrack entry is found
when the verdict is passed from user-space. If a conntrack entry is
found, then I replace the ct attached to the skb with the existing
conntrack entry. I have verified that my approach works by
artificially delaying the verdict from my application for the second
packet (so that the first packet has passed all netfilter hooks).
After replacing the ct, the second packet is handled correctly and
sent over the wire.

However, something goes wrong after the conntrack entry has been
inserted (using nf_conntrack_hash_check_insert()). At some random (but
very short) time after I see a couple of "early insert ..."/"early
insert confirmed", I get an RCU-stall. The trace for the stall looks
as follows:

[  105.420024] INFO: rcu_sched self-detected stall on CPU
[  105.425191]  2-...: (5999 ticks this GP) idle=12a/141/0
softirq=2674/2674 fqs=2543
[  105.433845]   (t=6001 jiffies g=587 c=586 q=5896)
[  105.438545] CPU: 2 PID: 3632 Comm: dlb Not tainted 4.14.36 #0
[  105.444261] Stack :     805c7ada
0031  8052c588
[  105.452610] 8fd48ff4 805668e7 804f5a2c 0002 0e30
0001 8fc11d20 0007
[  105.460957]   805c 7448 
016e 0007 
[  105.469303]  8057 0006b111  8000
 8059 804608e4
[  105.477650] 8056c2c0 8056408c 00e0 8056 
8027c418 0008 805c0008
[  105.485996] ...
[  105.488437] Call Trace:
[  105.490909] [<800103f8>] show_stack+0x58/0x100
[  105.495363] [<8043c1dc>] dump_stack+0x9c/0xe0
[  105.499707] [<8000d938>] arch_trigger_cpumask_backtrace+0x50/0x78
[  105.505785] [<80083540>] rcu_dump_cpu_stacks+0xc4/0x134
[  105.510991] [<800829c4>] rcu_check_callbacks+0x310/0x814
[  105.516295] [<80085f04>] update_process_times+0x34/0x70
[  105.521522] [<8009691c>] tick_handle_periodic+0x34/0xd0
[  105.526749] [<802f6e98>] gic_compare_interrupt+0x48/0x58
[  105.532047] [<80076450>] handle_percpu_devid_irq+0xbc/0x1a8
[  105.537619] [<800707c0>] generic_handle_irq+0x40/0x58
[  105.542679] [<8023a274>] gic_handle_local_int+0x84/0xd0
[  105.547886] [<8023a434>] gic_irq_dispatch+0x10/0x20
[  105.552747] [<800707c0>] generic_handle_irq+0x40/0x58
[  105.557806] [<804591b4>] do_IRQ+0x1c/0x2c
[  105.561805] [<802394bc>] plat_irq_dispatch+0xfc/0x138
[  105.566839] [<8000b508>] except_vec_vi_end+0xb8/0xc4
[  105.571904] [<8f200ca4>] nf_conntrack_lock+0x28c/0x440 [nf_conntrack]
[  105.578341] [ cut here ]
[  105.582948] WARNING: CPU: 2 PID: 3632 at kernel/smp.c:416
smp_call_function_many+0xc8/0x3bc
[  105.591257] Modules linked in: rt2800pci rt2800mmio rt2800lib
qcserial ppp_async option usb_wwan rt2x00pci rt2x00mmio rt2x00lib
rndis_host qmi_wwan ppp_generic nf_nat_pptp nf_conntrack_pptp
nf_conntrack_ipv6p
[  105.662308]  nf_nat_snmp_basic nf_nat_sip nf_nat_redirect
nf_nat_proto_gre nf_nat_masquerade_ipv4 nf_nat_irc nf_conntrack_ipv4
nf_nat_ipv4 nf_nat_h323 nf_nat_ftp nf_nat_amanda nf_nat nf_log_ipv4
nf_flow_tablt
[  105.733288]  ip_set_hash_netiface ip_set_hash_netport
ip_set_hash_netnet ip_set_hash_net ip_set_hash_netportnet
ip_set_hash_mac ip_set_hash_ipportnet ip_set_hash_ipportip
ip_set_hash_ipport ip_set_hash_ipmarm
[  105.804724]  ohci_hcd ehci_platform sd_mod scsi_mod ehci_hcd
gpio_button_hotplug usbcore nls_base usb_common mii
[  105.814899] CPU: 2 PID: 3632 Comm: dlb Not tainted 4.14.36 #0
[  105.820615] Stack :     805c7ada
0031  8052c588
[  105.828961] 8fd48ff4 805668e7 804f5a2c 0002 0e30
0001 8fc11c88 0007
[  105.837308]   805c 8590 
018d 0007 
[  105.845654]  8057 000c6f33  8000
 8059 8009e304
[  105.854001] 0009 01a0 8000d0fc  
8027c418 0008 805c0008
[  105.862348] ...
[  105.864786] Call Trace:
[  105.867230] [<800103f8>] show_stack+0x58/0x100
[  105.871662] [<8043c1dc>] dump_stack+0x9c/0xe0
[  105.876009] [<8002e190>] __warn+0xe0/0x114
[  105.880091] [<8002e254>] warn_slowpath_null+0x1c/0x28
[  105.885124] [<8009e304>] smp_call_function_many+0xc8/0x3bc
[  105.890591] 

Re: [PATCH] sctp: fix a potential missing-check bug

2018-05-02 Thread Marcelo Ricardo Leitner
Hi Wenwen,

On Wed, May 02, 2018 at 05:12:45PM -0500, Wenwen Wang wrote:
> In sctp_setsockopt_maxseg(), the integer 'val' is compared against min_len
> and max_len to check whether it is in the appropriate range. If it is not,
> an error code -EINVAL will be returned. This is enforced by a security
> check. But, this check is only executed when 'val' is not 0. In fact, if

Which makes sense, no? Especially if considering that 0 should be an
allowed value as it turns off the user limit.

> 'val' is 0, it will be assigned with a new value (if the return value of
> the function sctp_id2assoc() is not 0) in the following execution. However,
> this new value of 'val' is not checked before it is used to assigned to

Which 'new value'? val is not set to something new during the
function. It always contains the user supplied value.

> asoc->user_frag. That means it is possible that the new value of 'val'
> could be out of the expected range. This can cause security issues
> such as buffer overflows, e.g., the new value of 'val' is used as an index
> to access a buffer.
>
> This patch inserts a check for the new value of 'val' to see if it is in
> the expected range. If it is not, an error code -EINVAL will be returned.
>
> Signed-off-by: Wenwen Wang 
> ---
>  net/sctp/socket.c | 21 ++---
>  1 file changed, 10 insertions(+), 11 deletions(-)
>
> diff --git a/net/sctp/socket.c b/net/sctp/socket.c
> index 80835ac..2beb601 100644
> --- a/net/sctp/socket.c
> +++ b/net/sctp/socket.c
> @@ -3212,6 +3212,7 @@ static int sctp_setsockopt_maxseg(struct sock *sk, char 
> __user *optval, unsigned
>   struct sctp_af *af = sp->pf->af;
>   struct sctp_assoc_value params;
>   struct sctp_association *asoc;
> + int min_len, max_len;
>   int val;
>
>   if (optlen == sizeof(int)) {
> @@ -3231,19 +3232,15 @@ static int sctp_setsockopt_maxseg(struct sock *sk, 
> char __user *optval, unsigned
>   return -EINVAL;
>   }
>
> - if (val) {
> - int min_len, max_len;
> + min_len = SCTP_DEFAULT_MINSEGMENT - af->net_header_len;
> + min_len -= af->ip_options_len(sk);
> + min_len -= sizeof(struct sctphdr) +
> +sizeof(struct sctp_data_chunk);

On which tree did you base your patch on? Your patch lacks a tag so it
defaults to net-next, and I reworked this section on current net-next
and these MTU calculcations are now handled by sctp_mtu_payload().

But even for net tree, I don't understand which issue you're fixing
here. Actually it seems to me that both codes seems to do the same
thing.

>
> - min_len = SCTP_DEFAULT_MINSEGMENT - af->net_header_len;
> - min_len -= af->ip_options_len(sk);
> - min_len -= sizeof(struct sctphdr) +
> -sizeof(struct sctp_data_chunk);
> + max_len = SCTP_MAX_CHUNK_LEN - sizeof(struct sctp_data_chunk);
>
> - max_len = SCTP_MAX_CHUNK_LEN - sizeof(struct sctp_data_chunk);
> -
> - if (val < min_len || val > max_len)
> - return -EINVAL;
> - }
> + if (val && (val < min_len || val > max_len))
> + return -EINVAL;
>
>   asoc = sctp_id2assoc(sk, params.assoc_id);
>   if (asoc) {
> @@ -3253,6 +3250,8 @@ static int sctp_setsockopt_maxseg(struct sock *sk, char 
> __user *optval, unsigned
>   val -= sizeof(struct sctphdr) +
>  sctp_datachk_len(>stream);
>   }
> + if (val < min_len || val > max_len)
> + return -EINVAL;
>   asoc->user_frag = val;
>   asoc->frag_point = sctp_frag_point(asoc, asoc->pathmtu);
>   } else {
> --
> 2.7.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


[PATCH v2 bpf-next 2/2] bpf: add selftest for stackmap with build_id in NMI context

2018-05-02 Thread Song Liu
This new test captures stackmap with build_id with hardware event
PERF_COUNT_HW_CPU_CYCLES.

Because we only support one ips-to-build_id lookup per cpu in NMI
context, stack_amap will not be able to do the lookup in this test.
Therefore, we didn't do compare_stack_ips(), as it will alwasy fail.

urandom_read.c is extended to run configurable cycles so that it can be
caught by the perf event.

Signed-off-by: Song Liu 
---
 tools/testing/selftests/bpf/test_progs.c   | 137 +
 tools/testing/selftests/bpf/urandom_read.c |  10 ++-
 2 files changed, 145 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_progs.c 
b/tools/testing/selftests/bpf/test_progs.c
index aa336f0..00bb08c 100644
--- a/tools/testing/selftests/bpf/test_progs.c
+++ b/tools/testing/selftests/bpf/test_progs.c
@@ -1272,6 +1272,142 @@ static void test_stacktrace_build_id(void)
return;
 }
 
+static void test_stacktrace_build_id_nmi(void)
+{
+   int control_map_fd, stackid_hmap_fd, stackmap_fd, stack_amap_fd;
+   const char *file = "./test_stacktrace_build_id.o";
+   int err, pmu_fd, prog_fd;
+   struct perf_event_attr attr = {
+   .sample_freq = 5000,
+   .freq = 1,
+   .type = PERF_TYPE_HARDWARE,
+   .config = PERF_COUNT_HW_CPU_CYCLES,
+   };
+   __u32 key, previous_key, val, duration = 0;
+   struct bpf_object *obj;
+   char buf[256];
+   int i, j;
+   struct bpf_stack_build_id id_offs[PERF_MAX_STACK_DEPTH];
+   int build_id_matches = 0;
+
+   err = bpf_prog_load(file, BPF_PROG_TYPE_PERF_EVENT, , _fd);
+   if (CHECK(err, "prog_load", "err %d errno %d\n", err, errno))
+   goto out;
+
+   pmu_fd = syscall(__NR_perf_event_open, , -1 /* pid */,
+0 /* cpu 0 */, -1 /* group id */,
+0 /* flags */);
+   if (CHECK(pmu_fd < 0, "perf_event_open",
+ "err %d errno %d. Does the test host support 
PERF_COUNT_HW_CPU_CYCLES?\n",
+ pmu_fd, errno))
+   goto close_prog;
+
+   err = ioctl(pmu_fd, PERF_EVENT_IOC_ENABLE, 0);
+   if (CHECK(err, "perf_event_ioc_enable", "err %d errno %d\n",
+ err, errno))
+   goto close_pmu;
+
+   err = ioctl(pmu_fd, PERF_EVENT_IOC_SET_BPF, prog_fd);
+   if (CHECK(err, "perf_event_ioc_set_bpf", "err %d errno %d\n",
+ err, errno))
+   goto disable_pmu;
+
+   /* find map fds */
+   control_map_fd = bpf_find_map(__func__, obj, "control_map");
+   if (CHECK(control_map_fd < 0, "bpf_find_map control_map",
+ "err %d errno %d\n", err, errno))
+   goto disable_pmu;
+
+   stackid_hmap_fd = bpf_find_map(__func__, obj, "stackid_hmap");
+   if (CHECK(stackid_hmap_fd < 0, "bpf_find_map stackid_hmap",
+ "err %d errno %d\n", err, errno))
+   goto disable_pmu;
+
+   stackmap_fd = bpf_find_map(__func__, obj, "stackmap");
+   if (CHECK(stackmap_fd < 0, "bpf_find_map stackmap", "err %d errno %d\n",
+ err, errno))
+   goto disable_pmu;
+
+   stack_amap_fd = bpf_find_map(__func__, obj, "stack_amap");
+   if (CHECK(stack_amap_fd < 0, "bpf_find_map stack_amap",
+ "err %d errno %d\n", err, errno))
+   goto disable_pmu;
+
+   assert(system("dd if=/dev/urandom of=/dev/zero count=4 2> /dev/null")
+  == 0);
+   assert(system("taskset 0x1 ./urandom_read 10") == 0);
+   /* disable stack trace collection */
+   key = 0;
+   val = 1;
+   bpf_map_update_elem(control_map_fd, , , 0);
+
+   /* for every element in stackid_hmap, we can find a corresponding one
+* in stackmap, and vise versa.
+*/
+   err = compare_map_keys(stackid_hmap_fd, stackmap_fd);
+   if (CHECK(err, "compare_map_keys stackid_hmap vs. stackmap",
+ "err %d errno %d\n", err, errno))
+   goto disable_pmu;
+
+   err = compare_map_keys(stackmap_fd, stackid_hmap_fd);
+   if (CHECK(err, "compare_map_keys stackmap vs. stackid_hmap",
+ "err %d errno %d\n", err, errno))
+   goto disable_pmu;
+
+   err = extract_build_id(buf, 256);
+
+   if (CHECK(err, "get build_id with readelf",
+ "err %d errno %d\n", err, errno))
+   goto disable_pmu;
+
+   err = bpf_map_get_next_key(stackmap_fd, NULL, );
+   if (CHECK(err, "get_next_key from stackmap",
+ "err %d, errno %d\n", err, errno))
+   goto disable_pmu;
+
+   do {
+   char build_id[64];
+
+   err = bpf_map_lookup_elem(stackmap_fd, , id_offs);
+   if (CHECK(err, "lookup_elem from stackmap",
+ "err %d, errno %d\n", err, errno))
+   goto disable_pmu;
+   for (i = 0; i < 

[PATCH v2 bpf-next 1/2] bpf: enable stackmap with build_id in nmi context

2018-05-02 Thread Song Liu
Currently, we cannot parse build_id in nmi context because of
up_read(>mm->mmap_sem), this makes stackmap with build_id
less useful. This patch enables parsing build_id in nmi by putting
the up_read() call in irq_work. To avoid memory allocation in nmi
context, we use per cpu variable for the irq_work. As a result, only
one irq_work per cpu is allowed. If the irq_work is in-use, we
fallback to only report ips.

Cc: Alexei Starovoitov 
Cc: Daniel Borkmann 
Cc: Peter Zijlstra 
Signed-off-by: Song Liu 
---
 init/Kconfig  |  1 +
 kernel/bpf/stackmap.c | 59 +--
 2 files changed, 54 insertions(+), 6 deletions(-)

diff --git a/init/Kconfig b/init/Kconfig
index f013afc..480a4f2 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1391,6 +1391,7 @@ config BPF_SYSCALL
bool "Enable bpf() system call"
select ANON_INODES
select BPF
+   select IRQ_WORK
default n
help
  Enable the bpf() system call that allows to manipulate eBPF
diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index 3ba102b..51d4aea 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -11,6 +11,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "percpu_freelist.h"
 
 #define STACK_CREATE_FLAG_MASK \
@@ -32,6 +33,23 @@ struct bpf_stack_map {
struct stack_map_bucket *buckets[];
 };
 
+/* irq_work to run up_read() for build_id lookup in nmi context */
+struct stack_map_irq_work {
+   struct irq_work irq_work;
+   struct rw_semaphore *sem;
+};
+
+static void do_up_read(struct irq_work *entry)
+{
+   struct stack_map_irq_work *work = container_of(entry,
+   struct stack_map_irq_work, irq_work);
+
+   up_read(work->sem);
+   work->sem = NULL;
+}
+
+static DEFINE_PER_CPU(struct stack_map_irq_work, up_read_work);
+
 static inline bool stack_map_use_build_id(struct bpf_map *map)
 {
return (map->map_flags & BPF_F_STACK_BUILD_ID);
@@ -267,17 +285,27 @@ static void stack_map_get_build_id_offset(struct 
bpf_stack_build_id *id_offs,
 {
int i;
struct vm_area_struct *vma;
+   bool in_nmi_ctx = in_nmi();
+   bool irq_work_busy = false;
+   struct stack_map_irq_work *work;
+
+   if (in_nmi_ctx) {
+   work = this_cpu_ptr(_read_work);
+   if (work->irq_work.flags & IRQ_WORK_BUSY)
+   /* cannot queue more up_read, fallback */
+   irq_work_busy = true;
+   }
 
/*
-* We cannot do up_read() in nmi context, so build_id lookup is
-* only supported for non-nmi events. If at some point, it is
-* possible to run find_vma() without taking the semaphore, we
-* would like to allow build_id lookup in nmi context.
+* We cannot do up_read() in nmi context. To do build_id lookup
+* in nmi context, we need to run up_read() in irq_work. We use
+* a percpu variable to do the irq_work. If the irq_work is
+* already used by another lookup, we fall back to report ips.
 *
 * Same fallback is used for kernel stack (!user) on a stackmap
 * with build_id.
 */
-   if (!user || !current || !current->mm || in_nmi() ||
+   if (!user || !current || !current->mm || irq_work_busy ||
down_read_trylock(>mm->mmap_sem) == 0) {
/* cannot access current->mm, fall back to ips */
for (i = 0; i < trace_nr; i++) {
@@ -299,7 +327,13 @@ static void stack_map_get_build_id_offset(struct 
bpf_stack_build_id *id_offs,
- vma->vm_start;
id_offs[i].status = BPF_STACK_BUILD_ID_VALID;
}
-   up_read(>mm->mmap_sem);
+
+   if (!in_nmi_ctx)
+   up_read(>mm->mmap_sem);
+   else {
+   work->sem = >mm->mmap_sem;
+   irq_work_queue(>irq_work);
+   }
 }
 
 BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map,
@@ -575,3 +609,16 @@ const struct bpf_map_ops stack_map_ops = {
.map_update_elem = stack_map_update_elem,
.map_delete_elem = stack_map_delete_elem,
 };
+
+static int __init stack_map_init(void)
+{
+   int cpu;
+   struct stack_map_irq_work *work;
+
+   for_each_possible_cpu(cpu) {
+   work = per_cpu_ptr(_read_work, cpu);
+   init_irq_work(>irq_work, do_up_read);
+   }
+   return 0;
+}
+subsys_initcall(stack_map_init);
-- 
2.9.5



[PATCH v2 bpf-next 0/2] bpf: enable stackmap with build_id in nmi

2018-05-02 Thread Song Liu
Changes v1 -> v2:
  1. Rename some variables to (hopefully) reduce confusion;
  2. Check irq_work status with IRQ_WORK_BUSY (instead of work->sem);
  3. In Kconfig, let BPF_SYSCALL select IRQ_WORK;
  4. Add static to DEFINE_PER_CPU();
  5. Remove pr_info() in stack_map_init().

Song Liu (2):
  bpf: enable stackmap with build_id in nmi context
  bpf: add selftest for stackmap with build_id in NMI context

 init/Kconfig   |   1 +
 kernel/bpf/stackmap.c  |  59 +++--
 tools/testing/selftests/bpf/test_progs.c   | 137 +
 tools/testing/selftests/bpf/urandom_read.c |  10 ++-
 4 files changed, 199 insertions(+), 8 deletions(-)

--
2.9.5


Re: [PATCH net] ipv4: fix fnhe usage by non-cached routes

2018-05-02 Thread David Ahern
On 5/2/18 12:41 AM, Julian Anastasov wrote:
> Allow some non-cached routes to use non-expired fnhe:
> 
> 1. ip_del_fnhe: moved above and now called by find_exception.
> The 4.5+ commit deed49df7390 expires fnhe only when caching
> routes. Change that to:
> 
> 1.1. use fnhe for non-cached local output routes, with the help
> from (2)
> 
> 1.2. allow __mkroute_input to detect expired fnhe (outdated
> fnhe_gw, for example) when do_cache is false, eg. when itag!=0
> for unicast destinations.
> 
> 2. __mkroute_output: keep fi to allow local routes with orig_oif != 0
> to use fnhe info even when the new route will not be cached into fnhe.
> After commit 839da4d98960 ("net: ipv4: set orig_oif based on fib
> result for local traffic") it means all local routes will be affected
> because they are not cached. This change is used to solve a PMTU
> problem with IPVS (and probably Netfilter DNAT) setups that redirect
> local clients from target local IP (local route to Virtual IP)
> to new remote IP target, eg. IPVS TUN real server. Loopback has
> 64K MTU and we need to create fnhe on the local route that will
> keep the reduced PMTU for the Virtual IP. Without this change
> fnhe_pmtu is updated from ICMP but never exposed to non-cached
> local routes. This includes routes with flowi4_oif!=0 for 4.6+ and
> with flowi4_oif=any for 4.14+).

Can you add a test case to tools/testing/selftests/net/pmtu.sh to cover
this situation?


> @@ -1310,8 +1340,14 @@ static struct fib_nh_exception *find_exception(struct 
> fib_nh *nh, __be32 daddr)
>  
>   for (fnhe = rcu_dereference(hash[hval].chain); fnhe;
>fnhe = rcu_dereference(fnhe->fnhe_next)) {
> - if (fnhe->fnhe_daddr == daddr)
> + if (fnhe->fnhe_daddr == daddr) {
> + if (fnhe->fnhe_expires &&
> + time_after(jiffies, fnhe->fnhe_expires)) {
> + ip_del_fnhe(nh, daddr);

I'm surprised this is done in the fast path vs gc time. (the existing
code does as well; your change is only moving the call to make the input
and output paths the same)


The change looks correct to me and all of my functional tests passed.

Acked-by: David Ahern 


Re: Performance regressions in TCP_STREAM tests in Linux 4.15 (and later)

2018-05-02 Thread Eric Dumazet


On 05/02/2018 02:47 PM, Michael Wenig wrote:
> After applying Eric's proposed change (see below) to a 4.17 RC3 kernel, the 
> regressions that we had observed in our TCP_STREAM small message tests with 
> TCP_NODELAY enabled are now drastically reduced. Instead of the original 3x 
> thruput and cpu cost regressions, the regression depth is now < 10% for 
> thruput and between 10% - 20% for cpu cost. The improvements in the TCP_RR 
> tests that we had observed after Eric's original commit are not impacted by 
> the change. It would be great if this change could make it into a patch.
> 


Thanks for a lot testing, I will submit this patch after more tests from my 
side.



> Michael Wenig
> VMware Performance Engineering 
> 
> -Original Message-
> From: Eric Dumazet [mailto:eric.duma...@gmail.com] 
> Sent: Monday, April 30, 2018 10:48 AM
> To: Ben Greear ; Steven Rostedt 
> ; Michael Wenig 
> Cc: netdev@vger.kernel.org; Shilpi Agarwal ; Boon Ang 
> ; Darren Hart ; Steven Rostedt 
> ; Abdul Anshad Azeez 
> Subject: Re: Performance regressions in TCP_STREAM tests in Linux 4.15 (and 
> later)
> 
> 
> 
> On 04/30/2018 09:36 AM, Eric Dumazet wrote:
>>
>>
>> On 04/30/2018 09:14 AM, Ben Greear wrote:
>>> On 04/27/2018 08:11 PM, Steven Rostedt wrote:

 We'd like this email archived in netdev list, but since netdev is 
 notorious for blocking outlook email as spam, it didn't go through. 
 So I'm replying here to help get it into the archives.

 Thanks!

 -- Steve


 On Fri, 27 Apr 2018 23:05:46 +
 Michael Wenig  wrote:

> As part of VMware's performance testing with the Linux 4.15 kernel, 
> we identified CPU cost and throughput regressions when comparing to 
> the Linux 4.14 kernel. The impacted test cases are mostly 
> TCP_STREAM send tests when using small message sizes. The 
> regressions are significant (up 3x) and were tracked down to be a 
> side effect of Eric Dumazat's RB tree changes that went into the Linux 
> 4.15 kernel.
> Further investigation showed our use of the TCP_NODELAY flag in 
> conjunction with Eric's change caused the regressions to show and 
> simply disabling TCP_NODELAY brought performance back to normal.
> Eric's change also resulted into significant improvements in our 
> TCP_RR test cases.
>
>
>
> Based on these results, our theory is that Eric's change made the 
> system overall faster (reduced latency) but as a side effect less 
> aggregation is happening (with TCP_NODELAY) and that results in 
> lower throughput. Previously even though TCP_NODELAY was set, 
> system was slower and we still got some benefit of aggregation. 
> Aggregation helps in better efficiency and higher throughput 
> although it can increase the latency. If you are seeing a 
> regression in your application throughput after this change, using 
> TCP_NODELAY might help bring performance back however that might increase 
> latency.
>>>
>>> I guess you mean _disabling_ TCP_NODELAY instead of _using_ TCP_NODELAY?
>>>
>>
>> Yeah, I guess auto-corking does not work as intended.
> 
> I would try the following patch :
> 
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 
> 44be7f43455e4aefde8db61e2d941a69abcc642a..c9d00ef54deca15d5760bcbe154001a96fa1e2a7
>  100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -697,7 +697,7 @@ static bool tcp_should_autocork(struct sock *sk, struct 
> sk_buff *skb,  {
> return skb->len < size_goal &&
>sock_net(sk)->ipv4.sysctl_tcp_autocorking &&
> -  skb != tcp_write_queue_head(sk) &&
> +  !tcp_rtx_queue_empty(sk) &&
>refcount_read(>sk_wmem_alloc) > skb->truesize;  }
>  
> 


Re: [bpf PATCH v2 0/3] sockmap error path fixes

2018-05-02 Thread Alexei Starovoitov
On Wed, May 02, 2018 at 01:50:14PM -0700, John Fastabend wrote:
> When I added the test_sockmap to selftests I mistakenly changed the
> test logic a bit. The result of this was on redirect cases we ended up
> choosing the wrong sock from the BPF program and ended up sending to a
> socket that had no receive handler. The result was the actual receive
> handler, running on a different socket, is timing out and closing the
> socket. This results in errors (-EPIPE to be specific) on the sending
> side. Typically happening if the sender does not complete the send
> before the receive side times out. So depending on timing and the size
> of the send we may get errors. This exposed some bugs in the sockmap
> error path handling.
> 
> This series fixes the errors. The primary issue is we did not do proper
> memory accounting in these cases which resulted in missing a
> sk_mem_uncharge(). This happened in the redirect path and in one case
> on the normal send path. See the three patches for the details.
> 
> The other take-away from this is we need to fix the test_sockmap and
> also add more negative test cases. That will happen in bpf-next.
> 
> Finally, I tested this using the existing test_sockmap program, the
> older sockmap sample test script, and a few real use cases with
> Cilium. All of these seem to be in working correctly.
> 
> v2: fix compiler warning, drop iterator variable 'i' that is no longer
> used in patch 3.

Applied, Thanks.



Re: [PATCH v5] bpf, x86_32: add eBPF JIT compiler for ia32

2018-05-02 Thread Daniel Borkmann
Hi Wang,

On 04/29/2018 02:37 PM, Wang YanQing wrote:
> The JIT compiler emits ia32 bit instructions. Currently, It supports eBPF
> only. Classic BPF is supported because of the conversion by BPF core.
> 
> Almost all instructions from eBPF ISA supported except the following:
> BPF_ALU64 | BPF_DIV | BPF_K
> BPF_ALU64 | BPF_DIV | BPF_X
> BPF_ALU64 | BPF_MOD | BPF_K
> BPF_ALU64 | BPF_MOD | BPF_X
> BPF_STX | BPF_XADD | BPF_W
> BPF_STX | BPF_XADD | BPF_DW
> 
> It doesn't support BPF_JMP|BPF_CALL with BPF_PSEUDO_CALL at the moment.
> 
> IA32 has few general purpose registers, EAX|EDX|ECX|EBX|ESI|EDI. I use
> EAX|EDX|ECX|EBX as temporary registers to simulate instructions in eBPF
> ISA, and allocate ESI|EDI to BPF_REG_AX for constant blinding, all others
> eBPF registers, R0-R10, are simulated through scratch space on stack.
> 
[...]
> 
> The numbers show we get 30%~50% improvement.
> 
> See Documentation/networking/filter.txt for more information.
> 
> Signed-off-by: Wang YanQing 

Sorry for the delay. There's still a memory leak in this patch I found
while reviewing, more below and how to fix it. Otherwise few small nits
that would be nice to address in the respin along with it.

> ---
>  Changes v4-v5:
>  1:Delete is_on_stack, BPF_REG_AX is the only one
>on real hardware registers, so just check with
>it.
>  2:Apply commit 1612a981b766 ("bpf, x64: fix JIT emission
>for dead code"), suggested by Daniel Borkmann.
>  
>  Changes v3-v4:
>  1:Fix changelog in commit.
>I install llvm-6.0, then test_progs willn't report errors.
>I submit another patch:
>"bpf: fix misaligned access for BPF_PROG_TYPE_PERF_EVENT program type on 
> x86_32 platform"
>to fix another problem, after that patch, test_verifier willn't report 
> errors too.
>  2:Fix clear r0[1] twice unnecessarily in *BPF_IND|BPF_ABS* simulation.
> 
>  Changes v2-v3:
>  1:Move BPF_REG_AX to real hardware registers for performance reason.
>  3:Using bpf_load_pointer instead of bpf_jit32.S, suggested by Daniel 
> Borkmann.
>  4:Delete partial codes in 1c2a088a6626, suggested by Daniel Borkmann.
>  5:Some bug fixes and comments improvement.
> 
>  Changes v1-v2:
>  1:Fix bug in emit_ia32_neg64.
>  2:Fix bug in emit_ia32_arsh_r64.
>  3:Delete filename in top level comment, suggested by Thomas Gleixner.
>  4:Delete unnecessary boiler plate text, suggested by Thomas Gleixner.
>  5:Rewrite some words in changelog.
>  6:CodingSytle improvement and a little more comments.
> 
>  arch/x86/Kconfig |2 +-
>  arch/x86/include/asm/nospec-branch.h |   26 +-
>  arch/x86/net/Makefile|9 +-
>  arch/x86/net/bpf_jit_comp32.c| 2527 
> ++
>  4 files changed, 2559 insertions(+), 5 deletions(-)
>  create mode 100644 arch/x86/net/bpf_jit_comp32.c
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 00fcf81..1f5fa2f 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -137,7 +137,7 @@ config X86
>   select HAVE_DMA_CONTIGUOUS
>   select HAVE_DYNAMIC_FTRACE
>   select HAVE_DYNAMIC_FTRACE_WITH_REGS
> - select HAVE_EBPF_JITif X86_64
> + select HAVE_EBPF_JIT
>   select HAVE_EFFICIENT_UNALIGNED_ACCESS
>   select HAVE_EXIT_THREAD
>   select HAVE_FENTRY  if X86_64 || DYNAMIC_FTRACE
> diff --git a/arch/x86/include/asm/nospec-branch.h 
> b/arch/x86/include/asm/nospec-branch.h
> index f928ad9..a4c7ca4 100644
> --- a/arch/x86/include/asm/nospec-branch.h
> +++ b/arch/x86/include/asm/nospec-branch.h
> @@ -291,14 +291,17 @@ static inline void 
> indirect_branch_prediction_barrier(void)
>   *lfence
>   *jmp spec_trap
>   *  do_rop:
> - *mov %rax,(%rsp)
> + *mov %rax,(%rsp) for x86_64
> + *mov %edx,(%esp) for x86_32
>   *retq
>   *
>   * Without retpolines configured:
>   *
> - *jmp *%rax
> + *jmp *%rax for x86_64
> + *jmp *%edx for x86_32
>   */
>  #ifdef CONFIG_RETPOLINE
> +#ifdef CONFIG_X86_64
>  # define RETPOLINE_RAX_BPF_JIT_SIZE  17
>  # define RETPOLINE_RAX_BPF_JIT() \
>   EMIT1_off32(0xE8, 7);/* callq do_rop */ \
> @@ -310,9 +313,28 @@ static inline void 
> indirect_branch_prediction_barrier(void)
>   EMIT4(0x48, 0x89, 0x04, 0x24); /* mov %rax,(%rsp) */\
>   EMIT1(0xC3); /* retq */
>  #else
> +# define RETPOLINE_EDX_BPF_JIT() \
> +do { \
> + EMIT1_off32(0xE8, 7);/* call do_rop */  \
> + /* spec_trap: */\
> + EMIT2(0xF3, 0x90);   /* pause */\
> + EMIT3(0x0F, 0xAE, 0xE8); /* lfence */   \
> + EMIT2(0xEB, 0xF9);   /* jmp spec_trap */\
> + /* do_rop: */   \
> + EMIT3(0x89, 0x14, 0x24); /* mov %edx,(%esp) */  \

[PATCH] sctp: fix a potential missing-check bug

2018-05-02 Thread Wenwen Wang
In sctp_setsockopt_maxseg(), the integer 'val' is compared against min_len
and max_len to check whether it is in the appropriate range. If it is not,
an error code -EINVAL will be returned. This is enforced by a security
check. But, this check is only executed when 'val' is not 0. In fact, if
'val' is 0, it will be assigned with a new value (if the return value of
the function sctp_id2assoc() is not 0) in the following execution. However,
this new value of 'val' is not checked before it is used to assigned to
asoc->user_frag. That means it is possible that the new value of 'val'
could be out of the expected range. This can cause security issues
such as buffer overflows, e.g., the new value of 'val' is used as an index
to access a buffer.

This patch inserts a check for the new value of 'val' to see if it is in
the expected range. If it is not, an error code -EINVAL will be returned.

Signed-off-by: Wenwen Wang 
---
 net/sctp/socket.c | 21 ++---
 1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index 80835ac..2beb601 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -3212,6 +3212,7 @@ static int sctp_setsockopt_maxseg(struct sock *sk, char 
__user *optval, unsigned
struct sctp_af *af = sp->pf->af;
struct sctp_assoc_value params;
struct sctp_association *asoc;
+   int min_len, max_len;
int val;
 
if (optlen == sizeof(int)) {
@@ -3231,19 +3232,15 @@ static int sctp_setsockopt_maxseg(struct sock *sk, char 
__user *optval, unsigned
return -EINVAL;
}
 
-   if (val) {
-   int min_len, max_len;
+   min_len = SCTP_DEFAULT_MINSEGMENT - af->net_header_len;
+   min_len -= af->ip_options_len(sk);
+   min_len -= sizeof(struct sctphdr) +
+  sizeof(struct sctp_data_chunk);
 
-   min_len = SCTP_DEFAULT_MINSEGMENT - af->net_header_len;
-   min_len -= af->ip_options_len(sk);
-   min_len -= sizeof(struct sctphdr) +
-  sizeof(struct sctp_data_chunk);
+   max_len = SCTP_MAX_CHUNK_LEN - sizeof(struct sctp_data_chunk);
 
-   max_len = SCTP_MAX_CHUNK_LEN - sizeof(struct sctp_data_chunk);
-
-   if (val < min_len || val > max_len)
-   return -EINVAL;
-   }
+   if (val && (val < min_len || val > max_len))
+   return -EINVAL;
 
asoc = sctp_id2assoc(sk, params.assoc_id);
if (asoc) {
@@ -3253,6 +3250,8 @@ static int sctp_setsockopt_maxseg(struct sock *sk, char 
__user *optval, unsigned
val -= sizeof(struct sctphdr) +
   sctp_datachk_len(>stream);
}
+   if (val < min_len || val > max_len)
+   return -EINVAL;
asoc->user_frag = val;
asoc->frag_point = sctp_frag_point(asoc, asoc->pathmtu);
} else {
-- 
2.7.4



[PATCH net-next] inet: add bound ports statistic

2018-05-02 Thread Stephen Hemminger
This adds a number of bound ports which fixes socket summary
command.  The ss -s has been broken since changes to slab info
and this is one way to recover the missing value by adding a
field onto /proc/net/sockstat.

Since this is an informational value only, there is no need
for locking.

Overhead of keeping count in hash bucket head is minimal.
It is cache hot already, and the same thing is already done for
listen buckets.

Signed-off-by: Stephen Hemminger 
---
v2 - use unsigned for count
 get rid of leftover increment

 include/net/inet_hashtables.h|  3 +++
 include/net/inet_timewait_sock.h |  2 ++
 net/dccp/proto.c |  1 +
 net/ipv4/inet_hashtables.c   | 22 +++---
 net/ipv4/inet_timewait_sock.c|  8 +---
 net/ipv4/proc.c  |  5 +++--
 net/ipv4/tcp.c   |  1 +
 7 files changed, 34 insertions(+), 8 deletions(-)

diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index 9141e95529e7..2302ae0f7818 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -103,6 +103,7 @@ static inline struct net *ib_net(struct inet_bind_bucket 
*ib)
 
 struct inet_bind_hashbucket {
spinlock_t  lock;
+   unsigned intcount;
struct hlist_head   chain;
 };
 
@@ -193,7 +194,9 @@ inet_bind_bucket_create(struct kmem_cache *cachep, struct 
net *net,
struct inet_bind_hashbucket *head,
const unsigned short snum);
 void inet_bind_bucket_destroy(struct kmem_cache *cachep,
+ struct inet_bind_hashbucket *head,
  struct inet_bind_bucket *tb);
+int inet_bind_bucket_count(struct proto *prot);
 
 static inline u32 inet_bhashfn(const struct net *net, const __u16 lport,
   const u32 bhash_size)
diff --git a/include/net/inet_timewait_sock.h b/include/net/inet_timewait_sock.h
index c7be1ca8e562..4cdb8034ad80 100644
--- a/include/net/inet_timewait_sock.h
+++ b/include/net/inet_timewait_sock.h
@@ -87,7 +87,9 @@ static inline struct inet_timewait_sock *inet_twsk(const 
struct sock *sk)
 void inet_twsk_free(struct inet_timewait_sock *tw);
 void inet_twsk_put(struct inet_timewait_sock *tw);
 
+struct inet_bind_hashbucket;
 void inet_twsk_bind_unhash(struct inet_timewait_sock *tw,
+  struct inet_bind_hashbucket *head,
   struct inet_hashinfo *hashinfo);
 
 struct inet_timewait_sock *inet_twsk_alloc(const struct sock *sk,
diff --git a/net/dccp/proto.c b/net/dccp/proto.c
index 84cd4e3fd01b..25f03e62cfea 100644
--- a/net/dccp/proto.c
+++ b/net/dccp/proto.c
@@ -1208,6 +1208,7 @@ static int __init dccp_init(void)
for (i = 0; i < dccp_hashinfo.bhash_size; i++) {
spin_lock_init(_hashinfo.bhash[i].lock);
INIT_HLIST_HEAD(_hashinfo.bhash[i].chain);
+   dccp_hashinfo.bhash[i].count = 0;
}
 
rc = dccp_mib_init();
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index 31ff46daae97..aac6de8e5381 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -58,6 +58,18 @@ static u32 sk_ehashfn(const struct sock *sk)
sk->sk_daddr, sk->sk_dport);
 }
 
+/* Count how many any entries are in the bind hash table */
+unsigned int inet_bind_bucket_count(struct proto *prot)
+{
+   struct inet_hashinfo *hinfo = prot->h.hashinfo;
+   unsigned int i, ports = 0;
+
+   for (i = 0; i < hinfo->bhash_size; i++)
+   ports += hinfo->bhash[i].count;
+
+   return ports;
+}
+
 /*
  * Allocate and initialize a new local port bind bucket.
  * The bindhash mutex for snum's hash chain must be held here.
@@ -76,6 +88,7 @@ struct inet_bind_bucket *inet_bind_bucket_create(struct 
kmem_cache *cachep,
tb->fastreuseport = 0;
INIT_HLIST_HEAD(>owners);
hlist_add_head(>node, >chain);
+   ++head->count;
}
return tb;
 }
@@ -83,10 +96,13 @@ struct inet_bind_bucket *inet_bind_bucket_create(struct 
kmem_cache *cachep,
 /*
  * Caller must hold hashbucket lock for this tb with local BH disabled
  */
-void inet_bind_bucket_destroy(struct kmem_cache *cachep, struct 
inet_bind_bucket *tb)
+void inet_bind_bucket_destroy(struct kmem_cache *cachep,
+ struct inet_bind_hashbucket *head,
+ struct inet_bind_bucket *tb)
 {
if (hlist_empty(>owners)) {
__hlist_del(>node);
+   --head->count;
kmem_cache_free(cachep, tb);
}
 }
@@ -115,7 +131,7 @@ static void __inet_put_port(struct sock *sk)
__sk_del_bind_node(sk);
inet_csk(sk)->icsk_bind_hash = NULL;
inet_sk(sk)->inet_num = 0;
-   inet_bind_bucket_destroy(hashinfo->bind_bucket_cachep, tb);
+   

Re: [PATCH V2 net-next 0/6] virtio-net: Add SCTP checksum offload support

2018-05-02 Thread Marcelo Ricardo Leitner
On Tue, May 01, 2018 at 10:07:33PM -0400, Vladislav Yasevich wrote:
> Now that we have SCTP offload capabilities in the kernel, we can add
> them to virtio as well.  First step is SCTP checksum.

SCTP-wise, LGTM:
Acked-by: Marcelo Ricardo Leitner 


[PATCH net] rds: do not leak kernel memory to user land

2018-05-02 Thread Eric Dumazet
syzbot/KMSAN reported an uninit-value in put_cmsg(), originating
from rds_cmsg_recv().

Simply clear the structure, since we have holes there, or since
rx_traces might be smaller than RDS_MSG_RX_DGRAM_TRACE_MAX.

BUG: KMSAN: uninit-value in copy_to_user include/linux/uaccess.h:184 [inline]
BUG: KMSAN: uninit-value in put_cmsg+0x600/0x870 net/core/scm.c:242
CPU: 0 PID: 4459 Comm: syz-executor582 Not tainted 4.16.0+ #87
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 
01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:17 [inline]
 dump_stack+0x185/0x1d0 lib/dump_stack.c:53
 kmsan_report+0x142/0x240 mm/kmsan/kmsan.c:1067
 kmsan_internal_check_memory+0x135/0x1e0 mm/kmsan/kmsan.c:1157
 kmsan_copy_to_user+0x69/0x160 mm/kmsan/kmsan.c:1199
 copy_to_user include/linux/uaccess.h:184 [inline]
 put_cmsg+0x600/0x870 net/core/scm.c:242
 rds_cmsg_recv net/rds/recv.c:570 [inline]
 rds_recvmsg+0x2db5/0x3170 net/rds/recv.c:657
 sock_recvmsg_nosec net/socket.c:803 [inline]
 sock_recvmsg+0x1d0/0x230 net/socket.c:810
 ___sys_recvmsg+0x3fb/0x810 net/socket.c:2205
 __sys_recvmsg net/socket.c:2250 [inline]
 SYSC_recvmsg+0x298/0x3c0 net/socket.c:2262
 SyS_recvmsg+0x54/0x80 net/socket.c:2257
 do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x3d/0xa2

Fixes: 3289025aedc0 ("RDS: add receive message trace used by application")
Signed-off-by: Eric Dumazet 
Reported-by: syzbot 
Cc: Santosh Shilimkar 
Cc: linux-rdma 
---
 net/rds/recv.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/rds/recv.c b/net/rds/recv.c
index 
de50e2126e404aed541b8d268a28da08154bf08d..dc67458b52f0043c2328d4a77a43536e7c62b0ed
 100644
--- a/net/rds/recv.c
+++ b/net/rds/recv.c
@@ -558,6 +558,7 @@ static int rds_cmsg_recv(struct rds_incoming *inc, struct 
msghdr *msg,
struct rds_cmsg_rx_trace t;
int i, j;
 
+   memset(, 0, sizeof(t));
inc->i_rx_lat_trace[RDS_MSG_RX_CMSG] = local_clock();
t.rx_traces =  rs->rs_rx_traces;
for (i = 0; i < rs->rs_rx_traces; i++) {
-- 
2.17.0.441.gb46fe60e1d-goog



RE: Performance regressions in TCP_STREAM tests in Linux 4.15 (and later)

2018-05-02 Thread Michael Wenig
After applying Eric's proposed change (see below) to a 4.17 RC3 kernel, the 
regressions that we had observed in our TCP_STREAM small message tests with 
TCP_NODELAY enabled are now drastically reduced. Instead of the original 3x 
thruput and cpu cost regressions, the regression depth is now < 10% for thruput 
and between 10% - 20% for cpu cost. The improvements in the TCP_RR tests that 
we had observed after Eric's original commit are not impacted by the change. It 
would be great if this change could make it into a patch.

Michael Wenig
VMware Performance Engineering 

-Original Message-
From: Eric Dumazet [mailto:eric.duma...@gmail.com] 
Sent: Monday, April 30, 2018 10:48 AM
To: Ben Greear ; Steven Rostedt ; 
Michael Wenig 
Cc: netdev@vger.kernel.org; Shilpi Agarwal ; Boon Ang 
; Darren Hart ; Steven Rostedt 
; Abdul Anshad Azeez 
Subject: Re: Performance regressions in TCP_STREAM tests in Linux 4.15 (and 
later)



On 04/30/2018 09:36 AM, Eric Dumazet wrote:
> 
> 
> On 04/30/2018 09:14 AM, Ben Greear wrote:
>> On 04/27/2018 08:11 PM, Steven Rostedt wrote:
>>>
>>> We'd like this email archived in netdev list, but since netdev is 
>>> notorious for blocking outlook email as spam, it didn't go through. 
>>> So I'm replying here to help get it into the archives.
>>>
>>> Thanks!
>>>
>>> -- Steve
>>>
>>>
>>> On Fri, 27 Apr 2018 23:05:46 +
>>> Michael Wenig  wrote:
>>>
 As part of VMware's performance testing with the Linux 4.15 kernel, 
 we identified CPU cost and throughput regressions when comparing to 
 the Linux 4.14 kernel. The impacted test cases are mostly 
 TCP_STREAM send tests when using small message sizes. The 
 regressions are significant (up 3x) and were tracked down to be a 
 side effect of Eric Dumazat's RB tree changes that went into the Linux 
 4.15 kernel.
 Further investigation showed our use of the TCP_NODELAY flag in 
 conjunction with Eric's change caused the regressions to show and 
 simply disabling TCP_NODELAY brought performance back to normal.
 Eric's change also resulted into significant improvements in our 
 TCP_RR test cases.



 Based on these results, our theory is that Eric's change made the 
 system overall faster (reduced latency) but as a side effect less 
 aggregation is happening (with TCP_NODELAY) and that results in 
 lower throughput. Previously even though TCP_NODELAY was set, 
 system was slower and we still got some benefit of aggregation. 
 Aggregation helps in better efficiency and higher throughput 
 although it can increase the latency. If you are seeing a 
 regression in your application throughput after this change, using 
 TCP_NODELAY might help bring performance back however that might increase 
 latency.
>>
>> I guess you mean _disabling_ TCP_NODELAY instead of _using_ TCP_NODELAY?
>>
> 
> Yeah, I guess auto-corking does not work as intended.

I would try the following patch :

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 
44be7f43455e4aefde8db61e2d941a69abcc642a..c9d00ef54deca15d5760bcbe154001a96fa1e2a7
 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -697,7 +697,7 @@ static bool tcp_should_autocork(struct sock *sk, struct 
sk_buff *skb,  {
return skb->len < size_goal &&
   sock_net(sk)->ipv4.sysctl_tcp_autocorking &&
-  skb != tcp_write_queue_head(sk) &&
+  !tcp_rtx_queue_empty(sk) &&
   refcount_read(>sk_wmem_alloc) > skb->truesize;  }
 


Re: [PATCH net-next v9 2/4] net: Introduce generic failover module

2018-05-02 Thread Jiri Pirko
Wed, May 02, 2018 at 07:51:12PM CEST, sridhar.samudr...@intel.com wrote:
>
>
>On 5/2/2018 9:15 AM, Jiri Pirko wrote:
>> Sat, Apr 28, 2018 at 11:06:01AM CEST, j...@resnulli.us wrote:
>> > Fri, Apr 27, 2018 at 07:06:58PM CEST, sridhar.samudr...@intel.com wrote:
>> [...]
>> 
>> 
>> > > +
>> > > +err = netdev_rx_handler_register(slave_dev, 
>> > > net_failover_handle_frame,
>> > > + failover_dev);
>> > > +if (err) {
>> > > +netdev_err(slave_dev, "can not register failover rx 
>> > > handler (err = %d)\n",
>> > > +   err);
>> > > +goto err_handler_register;
>> > > +}
>> > > +
>> > > +err = netdev_upper_dev_link(slave_dev, failover_dev, NULL);
>> > Please use netdev_master_upper_dev_link().
>> Don't forget to fillup struct netdev_lag_upper_info - 
>> NETDEV_LAG_TX_TYPE_ACTIVEBACKUP
>> 
>> 
>> Also, please call netdev_lower_state_changed() when the active slave
>> device changes from primary->backup of backup->primary and whenever link
>> state of a slave changes
>> 
>Sure. will look into it.  Do you think this will help with the issue
>you saw with having to change mac on standy twice to get the init scripts
>working? We are now going to block changing the mac on both standby and
>failover.

I don't see any relation to that.


Re: [PATCH net-next v9 2/4] net: Introduce generic failover module

2018-05-02 Thread Jiri Pirko
Wed, May 02, 2018 at 07:51:12PM CEST, sridhar.samudr...@intel.com wrote:
>
>
>On 5/2/2018 9:15 AM, Jiri Pirko wrote:
>> Sat, Apr 28, 2018 at 11:06:01AM CEST, j...@resnulli.us wrote:
>> > Fri, Apr 27, 2018 at 07:06:58PM CEST, sridhar.samudr...@intel.com wrote:
>> [...]
>> 
>> 
>> > > +
>> > > +err = netdev_rx_handler_register(slave_dev, 
>> > > net_failover_handle_frame,
>> > > + failover_dev);
>> > > +if (err) {
>> > > +netdev_err(slave_dev, "can not register failover rx 
>> > > handler (err = %d)\n",
>> > > +   err);
>> > > +goto err_handler_register;
>> > > +}
>> > > +
>> > > +err = netdev_upper_dev_link(slave_dev, failover_dev, NULL);
>> > Please use netdev_master_upper_dev_link().
>> Don't forget to fillup struct netdev_lag_upper_info - 
>> NETDEV_LAG_TX_TYPE_ACTIVEBACKUP
>> 
>> 
>> Also, please call netdev_lower_state_changed() when the active slave
>> device changes from primary->backup of backup->primary and whenever link
>> state of a slave changes
>> 
>Sure. will look into it.  Do you think this will help with the issue
>you saw with having to change mac on standy twice to get the init scripts
>working? We are now going to block changing the mac on both standby and
>failover.
>
>Also, i was wondering if we should set dev->flags to IFF_MASTER on failover
>and IFF_SLAVE on primary and standby. netvsc does this.

No. Don't set it. It is wrong.



>Does this help with the init scripts and network manager to skip slave
>devices for dhcp requests?
>


Re: [PATCH net-next v9 2/4] net: Introduce generic failover module

2018-05-02 Thread Samudrala, Sridhar

On 5/2/2018 1:30 PM, Michael S. Tsirkin wrote:

On Wed, May 02, 2018 at 10:51:12AM -0700, Samudrala, Sridhar wrote:


On 5/2/2018 9:15 AM, Jiri Pirko wrote:

Sat, Apr 28, 2018 at 11:06:01AM CEST, j...@resnulli.us wrote:

Fri, Apr 27, 2018 at 07:06:58PM CEST, sridhar.samudr...@intel.com wrote:

[...]



+
+   err = netdev_rx_handler_register(slave_dev, net_failover_handle_frame,
+failover_dev);
+   if (err) {
+   netdev_err(slave_dev, "can not register failover rx handler (err = 
%d)\n",
+  err);
+   goto err_handler_register;
+   }
+
+   err = netdev_upper_dev_link(slave_dev, failover_dev, NULL);

Please use netdev_master_upper_dev_link().

Don't forget to fillup struct netdev_lag_upper_info - 
NETDEV_LAG_TX_TYPE_ACTIVEBACKUP


Also, please call netdev_lower_state_changed() when the active slave
device changes from primary->backup of backup->primary and whenever link
state of a slave changes


Sure. will look into it.  Do you think this will help with the issue
you saw with having to change mac on standy twice to get the init scripts
working? We are now going to block changing the mac on both standby and
failover.

Also, i was wondering if we should set dev->flags to IFF_MASTER on failover
and IFF_SLAVE on primary and standby.

We do need a way to find things out, that's for sure.
How does userspace know it's a failover
config and find the failover device right now?


# ethtool -i ens12|grep driver
driver: failover

# ethtool -i ens12n_sby|grep driver
driver: virtio_net




Re: [RFC iproute2-next 2/5] ss: make tcp_mem long

2018-05-02 Thread Stephen Hemminger
On Wed, 2 May 2018 14:08:53 -0700
Eric Dumazet  wrote:

> On 05/02/2018 01:27 PM, Stephen Hemminger wrote:
> > The tcp_memory field in /proc/net/sockstat is formatted as
> > a long value by kernel. Change ss to keep this as full value.
> > 
> > Signed-off-by: Stephen Hemminger 
> > ---
> >  misc/ss.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/misc/ss.c b/misc/ss.c
> > index 22c76e34f83b..c88a25581755 100644
> > --- a/misc/ss.c
> > +++ b/misc/ss.c
> > @@ -4589,7 +4589,7 @@ static int get_snmp_int(const char *proto, const char 
> > *key, int *result)
> >  
> >  struct ssummary {
> > int socks;
> > -   int tcp_mem;
> > +   long tcp_mem;
> > int tcp_total;
> > int tcp_orphans;
> > int tcp_tws;
> > @@ -4629,7 +4629,7 @@ static void get_sockstat_line(char *line, struct 
> > ssummary *s)
> > else if (strcmp(id, "FRAG6:") == 0)
> > sscanf(rem, "%*s%d%*s%d", >frag6, >frag6_mem);
> > else if (strcmp(id, "TCP:") == 0)
> > -   sscanf(rem, "%*s%d%*s%d%*s%d%*s%d%*s%d",
> > +   sscanf(rem, "%*s%d%*s%d%*s%d%*s%d%*s%ld",
> >>tcp4_hashed,
> >>tcp_orphans, >tcp_tws, >tcp_total, 
> > >tcp_mem);
> >  }
> >   
> 
> Hi Stephen
> 
> It seems nothing uses yet the value ?

Yup. let's just drop it from the scan 



Re: [RFC iproute2-next 2/5] ss: make tcp_mem long

2018-05-02 Thread Eric Dumazet


On 05/02/2018 01:27 PM, Stephen Hemminger wrote:
> The tcp_memory field in /proc/net/sockstat is formatted as
> a long value by kernel. Change ss to keep this as full value.
> 
> Signed-off-by: Stephen Hemminger 
> ---
>  misc/ss.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/misc/ss.c b/misc/ss.c
> index 22c76e34f83b..c88a25581755 100644
> --- a/misc/ss.c
> +++ b/misc/ss.c
> @@ -4589,7 +4589,7 @@ static int get_snmp_int(const char *proto, const char 
> *key, int *result)
>  
>  struct ssummary {
>   int socks;
> - int tcp_mem;
> + long tcp_mem;
>   int tcp_total;
>   int tcp_orphans;
>   int tcp_tws;
> @@ -4629,7 +4629,7 @@ static void get_sockstat_line(char *line, struct 
> ssummary *s)
>   else if (strcmp(id, "FRAG6:") == 0)
>   sscanf(rem, "%*s%d%*s%d", >frag6, >frag6_mem);
>   else if (strcmp(id, "TCP:") == 0)
> - sscanf(rem, "%*s%d%*s%d%*s%d%*s%d%*s%d",
> + sscanf(rem, "%*s%d%*s%d%*s%d%*s%d%*s%ld",
>  >tcp4_hashed,
>  >tcp_orphans, >tcp_tws, >tcp_total, 
> >tcp_mem);
>  }
> 

Hi Stephen

It seems nothing uses yet the value ?

Also, do we care of iproute2 being compiled in 32bit mode, but eventually 
running on 64bit kernel ?



Re: [PATCH bpf-next v3 15/15] samples/bpf: sample application and documentation for AF_XDP sockets

2018-05-02 Thread Jesper Dangaard Brouer

On Wed,  2 May 2018 13:01:36 +0200 Björn Töpel  wrote:

> +static void rx_drop(struct xdpsock *xsk)
> +{
> + struct xdp_desc descs[BATCH_SIZE];
> + unsigned int rcvd, i;
> +
> + rcvd = xq_deq(>rx, descs, BATCH_SIZE);
> + if (!rcvd)
> + return;
> +
> + for (i = 0; i < rcvd; i++) {
> + u32 idx = descs[i].idx;
> +
> + lassert(idx < NUM_FRAMES);
> +#if DEBUG_HEXDUMP
> + char *pkt;
> + char buf[32];
> +
> + pkt = xq_get_data(xsk, idx, descs[i].offset);
> + sprintf(buf, "idx=%d", idx);
> + hex_dump(pkt, descs[i].len, buf);
> +#endif
> + }
> +
> + xsk->rx_npkts += rcvd;
> +
> + umem_fill_to_kernel_ex(>umem->fq, descs, rcvd);
> +}

I would really like to see an option that can enable reading the
data/memory in the packet.  Else the test is rather fake...

I hacked it myself manually to read first u32.
 - Before: 10,771,083 pps
 - After:   9,430,741 pps

The slowdown is not as big as I expected, which is good :-)

With perf stat I can see more LLC-load's, but not misses.  It is not
getting registered as a cache-miss that I read data on the remote CPPU.

p.s. these tests are with mlx5 (which only have XDP_REDIRECT RX-side).

- - 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Before:

sudo ~/perf stat -C3 -e L1-icache-load-misses -e cycles -e  instructions -e 
cache-misses -e   cache-references  -e LLC-store-misses -e LLC-store -e 
LLC-load-misses -e  LLC-load -r 3 sleep 1

 Performance counter stats for 'CPU(s) 3' (3 runs):

   200,020  L1-icache-load-misses   
  ( +-  0.76% )  (33.31%)
 3,920,754,587  cycles  
  ( +-  0.14% )  (44.50%)
 3,062,308,209  instructions  #0.78  insn per cycle 
  ( +-  0.28% )  (55.65%)
   823  cache-misses  #0.011 % of all cache 
refs  ( +- 70.81% )  (66.74%)
 7,587,132  cache-references
  ( +-  0.48% )  (77.83%)
 0  LLC-store-misses
  (77.83%)
   384,401  LLC-store   
  ( +-  2.97% )  (77.83%)
15  LLC-load-misses   #0.00% of all LL-cache 
hits ( +-100.00% )  (22.17%)
 3,192,312  LLC-load
  ( +-  0.35% )  (22.17%)

   1.001199221 seconds time elapsed 
 ( +-  0.00% )


After:

$ sudo ~/perf stat -C3 -e L1-icache-load-misses -e cycles -e  instructions -e 
cache-misses -e   cache-references  -e LLC-store-misses -e LLC-store -e 
LLC-load-misses -e  LLC-load -r 3 sleep 1

 Performance counter stats for 'CPU(s) 3' (3 runs):

   154,921  L1-icache-load-misses   
  ( +-  3.88% )  (33.31%)
 3,924,791,213  cycles  
  ( +-  0.10% )  (44.50%)
 2,930,116,185  instructions  #0.75  insn per cycle 
  ( +-  0.33% )  (55.65%)
   342  cache-misses  #0.002 % of all cache 
refs  ( +- 65.52% )  (66.74%)
15,810,892  cache-references
  ( +-  0.13% )  (77.83%)
 0  LLC-store-misses
  (77.83%)
   925,544  LLC-store   
  ( +-  2.33% )  (77.83%)
   155  LLC-load-misses   #0.00% of all LL-cache 
hits ( +- 67.22% )  (22.17%)
12,791,264  LLC-load
  ( +-  0.04% )  (22.17%)

   1.001206058 seconds time elapsed 
 ( +-  0.00% )



Re: [PATCH net-next 1/4] ipv6: Calculate hash thresholds for IPv6 nexthops

2018-05-02 Thread David Ahern
On 5/2/18 2:48 PM, Thomas Winter wrote:
> Should I look at reworking this? It would be great to have these ECMP routes 
> for other purposes.

Looking at my IPv6 bug list this change is on it -- allowing ECMP routes
to have a device only hop.

Let me take a look at it at the same time as a few other bugs.


Re: DSA switch

2018-05-02 Thread Andrew Lunn
On Wed, May 02, 2018 at 11:20:05PM +0300, Ran Shalit wrote:
> Hello,
> 
> Is it possible to use switch just like external real switch,
> connecting all ports to the same subnet ?

Yes. Just bridge all ports/interfaces together and put your host IP
address on the bridge.

Andrew


Re: [PATCH bpf-next] bpf/verifier: enable ctx + const + 0.

2018-05-02 Thread Jakub Kicinski
On Wed, 2 May 2018 10:54:56 -0700, William Tu wrote:
> On Wed, May 2, 2018 at 1:29 AM, Daniel Borkmann  wrote:
> > On 05/02/2018 06:52 AM, Alexei Starovoitov wrote:  
> >> On Tue, May 01, 2018 at 09:35:29PM -0700, William Tu wrote:  
> >> Please test it with real program and you'll see crashes and garbage 
> >> returned.  
> >
> > +1, *convert_ctx_access() use bpf_insn's off to determine what to rewrite,
> > so this is definitely buggy, and wasn't properly tested as it should have
> > been. The test case is also way too simple, just the LDX and then doing a
> > return 0 will get you past verifier, but won't give you anything in terms
> > of runtime testing that test_verifier is doing. A single test case for a
> > non trivial verifier change like this is also _completely insufficient_,
> > this really needs to test all sort of weird corner cases (involving out of
> > bounds accesses, overflows, etc).  
> 
> Thanks, now I understand.
> It's much more complicated than I thought.

FWIW NFP JIT would also have to be updated, similarly to
*convert_ctx_access() in mem_ldx_skb()/mem_ldx_xdp() we are currently
looking at insn.off.  In case you find a way to solve this.. :)


[bpf PATCH v2 3/3] bpf: sockmap, fix error handling in redirect failures

2018-05-02 Thread John Fastabend
When a redirect failure happens we release the buffers in-flight
without calling a sk_mem_uncharge(), the uncharge is called before
dropping the sock lock for the redirecte, however we missed updating
the ring start index. When no apply actions are in progress this
is OK because we uncharge the entire buffer before the redirect.
But, when we have apply logic running its possible that only a
portion of the buffer is being redirected. In this case we only
do memory accounting for the buffer slice being redirected and
expect to be able to loop over the BPF program again and/or if
a sock is closed uncharge the memory at sock destruct time.

With an invalid start index however the program logic looks at
the start pointer index, checks the length, and when seeing the
length is zero (from the initial release and failure to update
the pointer) aborts without uncharging/releasing the remaining
memory.

The fix for this is simply to update the start index. To avoid
fixing this error in two locations we do a small refactor and
remove one case where it is open-coded. Then fix it in the
single function.

Signed-off-by: John Fastabend 
---
 kernel/bpf/sockmap.c |   28 
 1 file changed, 12 insertions(+), 16 deletions(-)

diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index 052c313..098eca5 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -393,7 +393,8 @@ static void return_mem_sg(struct sock *sk, int bytes, 
struct sk_msg_buff *md)
} while (i != md->sg_end);
 }
 
-static void free_bytes_sg(struct sock *sk, int bytes, struct sk_msg_buff *md)
+static void free_bytes_sg(struct sock *sk, int bytes,
+ struct sk_msg_buff *md, bool charge)
 {
struct scatterlist *sg = md->sg_data;
int i = md->sg_start, free;
@@ -403,11 +404,13 @@ static void free_bytes_sg(struct sock *sk, int bytes, 
struct sk_msg_buff *md)
if (bytes < free) {
sg[i].length -= bytes;
sg[i].offset += bytes;
-   sk_mem_uncharge(sk, bytes);
+   if (charge)
+   sk_mem_uncharge(sk, bytes);
break;
}
 
-   sk_mem_uncharge(sk, sg[i].length);
+   if (charge)
+   sk_mem_uncharge(sk, sg[i].length);
put_page(sg_page([i]));
bytes -= sg[i].length;
sg[i].length = 0;
@@ -418,6 +421,7 @@ static void free_bytes_sg(struct sock *sk, int bytes, 
struct sk_msg_buff *md)
if (i == MAX_SKB_FRAGS)
i = 0;
}
+   md->sg_start = i;
 }
 
 static int free_sg(struct sock *sk, int start, struct sk_msg_buff *md)
@@ -576,10 +580,10 @@ static int bpf_tcp_sendmsg_do_redirect(struct sock *sk, 
int send,
   struct sk_msg_buff *md,
   int flags)
 {
+   bool ingress = !!(md->flags & BPF_F_INGRESS);
struct smap_psock *psock;
struct scatterlist *sg;
-   int i, err, free = 0;
-   bool ingress = !!(md->flags & BPF_F_INGRESS);
+   int err = 0;
 
sg = md->sg_data;
 
@@ -607,16 +611,8 @@ static int bpf_tcp_sendmsg_do_redirect(struct sock *sk, 
int send,
 out_rcu:
rcu_read_unlock();
 out:
-   i = md->sg_start;
-   while (sg[i].length) {
-   free += sg[i].length;
-   put_page(sg_page([i]));
-   sg[i].length = 0;
-   i++;
-   if (i == MAX_SKB_FRAGS)
-   i = 0;
-   }
-   return free;
+   free_bytes_sg(NULL, send, md, false);
+   return err;
 }
 
 static inline void bpf_md_init(struct smap_psock *psock)
@@ -720,7 +716,7 @@ static int bpf_exec_tx_verdict(struct smap_psock *psock,
break;
case __SK_DROP:
default:
-   free_bytes_sg(sk, send, m);
+   free_bytes_sg(sk, send, m, true);
apply_bytes_dec(psock, send);
*copied -= send;
psock->sg_size -= send;



[bpf PATCH v2 2/3] bpf: sockmap, zero sg_size on error when buffer is released

2018-05-02 Thread John Fastabend
When an error occurs during a redirect we have two cases that need
to be handled (i) we have a cork'ed buffer (ii) we have a normal
sendmsg buffer.

In the cork'ed buffer case we don't currently support recovering from
errors in a redirect action. So the buffer is released and the error
should _not_ be pushed back to the caller of sendmsg/sendpage. The
rationale here is the user will get an error that relates to old
data that may have been sent by some arbitrary thread on that sock.
Instead we simple consume the data and tell the user that the data
has been consumed. We may add proper error recovery in the future.
However, this patch fixes a bug where the bytes outstanding counter
sg_size was not zeroed. This could result in a case where if the user
has both a cork'ed action and apply action in progress we may
incorrectly call into the BPF program when the user expected an
old verdict to be applied via the apply action. I don't have a use
case where using apply and cork at the same time is valid but we
never explicitly reject it because it should work fine. This patch
ensures the sg_size is zeroed so we don't have this case.

In the normal sendmsg buffer case (no cork data) we also do not
zero sg_size. Again this can confuse the apply logic when the logic
calls into the BPF program when the BPF programmer expected the old
verdict to remain. So ensure we set sg_size to zero here as well. And
additionally to keep the psock state in-sync with the sk_msg_buff
release all the memory as well. Previously we did this before
returning to the user but this left a gap where psock and sk_msg_buff
states were out of sync which seems fragile. No additional overhead
is taken here except for a call to check the length and realize its
already been freed. This is in the error path as well so in my
opinion lets have robust code over optimized error paths.

Signed-off-by: John Fastabend 
---
 kernel/bpf/sockmap.c |   15 +++
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index 943929a..052c313 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -701,15 +701,22 @@ static int bpf_exec_tx_verdict(struct smap_psock *psock,
err = bpf_tcp_sendmsg_do_redirect(redir, send, m, flags);
lock_sock(sk);
 
+   if (unlikely(err < 0)) {
+   free_start_sg(sk, m);
+   psock->sg_size = 0;
+   if (!cork)
+   *copied -= send;
+   } else {
+   psock->sg_size -= send;
+   }
+
if (cork) {
free_start_sg(sk, m);
+   psock->sg_size = 0;
kfree(m);
m = NULL;
+   err = 0;
}
-   if (unlikely(err))
-   *copied -= err;
-   else
-   psock->sg_size -= send;
break;
case __SK_DROP:
default:



[bpf PATCH v2 0/3] sockmap error path fixes

2018-05-02 Thread John Fastabend
When I added the test_sockmap to selftests I mistakenly changed the
test logic a bit. The result of this was on redirect cases we ended up
choosing the wrong sock from the BPF program and ended up sending to a
socket that had no receive handler. The result was the actual receive
handler, running on a different socket, is timing out and closing the
socket. This results in errors (-EPIPE to be specific) on the sending
side. Typically happening if the sender does not complete the send
before the receive side times out. So depending on timing and the size
of the send we may get errors. This exposed some bugs in the sockmap
error path handling.

This series fixes the errors. The primary issue is we did not do proper
memory accounting in these cases which resulted in missing a
sk_mem_uncharge(). This happened in the redirect path and in one case
on the normal send path. See the three patches for the details.

The other take-away from this is we need to fix the test_sockmap and
also add more negative test cases. That will happen in bpf-next.

Finally, I tested this using the existing test_sockmap program, the
older sockmap sample test script, and a few real use cases with
Cilium. All of these seem to be in working correctly.

v2: fix compiler warning, drop iterator variable 'i' that is no longer
used in patch 3.

---

John Fastabend (3):
  bpf: sockmap, fix scatterlist update on error path in send with apply
  bpf: sockmap, zero sg_size on error when buffer is released
  bpf: sockmap, fix error handling in redirect failures


 kernel/bpf/sockmap.c |   48 ++--
 1 file changed, 26 insertions(+), 22 deletions(-)

--
Signature


[bpf PATCH v2 1/3] bpf: sockmap, fix scatterlist update on error path in send with apply

2018-05-02 Thread John Fastabend
When the call to do_tcp_sendpage() fails to send the complete block
requested we either retry if only a partial send was completed or
abort if we receive a error less than or equal to zero. Before
returning though we must update the scatterlist length/offset to
account for any partial send completed.

Before this patch we did this at the end of the retry loop, but
this was buggy when used while applying a verdict to fewer bytes
than in the scatterlist. When the scatterlist length was being set
we forgot to account for the apply logic reducing the size variable.
So the result was we chopped off some bytes in the scatterlist without
doing proper cleanup on them. This results in a WARNING when the
sock is tore down because the bytes have previously been charged to
the socket but are never uncharged.

The simple fix is to simply do the accounting inside the retry loop
subtracting from the absolute scatterlist values rather than trying
to accumulate the totals and subtract at the end.

Reported-by: Alexei Starovoitov 
Signed-off-by: John Fastabend 
---
 kernel/bpf/sockmap.c |5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index 634415c..943929a 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -326,6 +326,9 @@ static int bpf_tcp_push(struct sock *sk, int apply_bytes,
if (ret > 0) {
if (apply)
apply_bytes -= ret;
+
+   sg->offset += ret;
+   sg->length -= ret;
size -= ret;
offset += ret;
if (uncharge)
@@ -333,8 +336,6 @@ static int bpf_tcp_push(struct sock *sk, int apply_bytes,
goto retry;
}
 
-   sg->length = size;
-   sg->offset = offset;
return ret;
}
 



Re: [PATCH net-next 1/4] ipv6: Calculate hash thresholds for IPv6 nexthops

2018-05-02 Thread Thomas Winter
> On Wed, May 02, 2018 at 12:58:56PM -0600, David Ahern wrote:
> > On 5/2/18 12:53 PM, Ido Schimmel wrote:
> > > 
> > > So this fixes the issue for me. To reproduce:
> > > 
> > > # ip -6 address add 2001:db8::1/64 dev dummy0
> > > # ip -6 address add 2001:db8::1/64 dev dummy1
> > > 
> > > This reproduces the issue because due to above commit both local routes
> > > are considered siblings... :/
> > > 
> > > local 2001:db8::1 proto kernel metric 0 
> > > nexthop dev dummy0 weight 1 
> > > nexthop dev dummy1 weight 1 pref medium
> > > 
> > > I think it's best to revert the patch and have Thomas submit a fixed
> > > version to net-next. I was actually surprised to see it applied to net.
> > 
> > ugly side effect of the way ecmp routes are managed in IPv6. I think
> > revert is the best option for now.
> 
> OK. I'll send a patch.

fe80::/64  proto kernel  metric 256 
nexthop dev vlan1 weight 1
nexthop dev vlan10 weight 1
nexthop dev vlan30 weight 1
nexthop dev tunnel11 weight 1
nexthop dev tunnel12 weight 1

Sorry I completely missed that, I was always looking at other route tables. 
Should I look at reworking this? It would be great to have these ECMP routes 
for other purposes.

ip -6 ro show table 601
default  metric 1024 
nexthop dev tunnel11 weight 1
nexthop dev tunnel12 weight 1


Re: [PATCH 0/2] sh_eth: complain on access to unimplemented TSU registers

2018-05-02 Thread David Miller
From: Sergei Shtylyov 
Date: Wed, 2 May 2018 22:53:23 +0300

> Here's a set of 2 patches against DaveM's 'net-next.git' repo. The 1st patch
> routes TSU_POST register accesses thru sh_eth_tsu_{read|write}() and the 
> 2nd
> added WARN_ON() unimplemented register to those functions. I'm going to deal 
> with
> TSU_ADR{H|L} registers in a later series...
> 
> [1/2] sh_eth: use TSU register accessors for TSU_POST
> [2/2] sh_eth: WARN_ON() access to unimplemented TSU register

Series applied to net-next, thanks.


Re: [PATCH net] net_sched: fq: take care of throttled flows before reuse

2018-05-02 Thread David Miller
From: Eric Dumazet 
Date: Wed,  2 May 2018 10:03:30 -0700

> Normally, a socket can not be freed/reused unless all its TX packets
> left qdisc and were TX-completed. However connect(AF_UNSPEC) allows
> this to happen.
> 
> With commit fc59d5bdf1e3 ("pkt_sched: fq: clear time_next_packet for
> reused flows") we cleared f->time_next_packet but took no special
> action if the flow was still in the throttled rb-tree.
> 
> Since f->time_next_packet is the key used in the rb-tree searches,
> blindly clearing it might break rb-tree integrity. We need to make
> sure the flow is no longer in the rb-tree to avoid this problem.
> 
> Fixes: fc59d5bdf1e3 ("pkt_sched: fq: clear time_next_packet for reused flows")
> Signed-off-by: Eric Dumazet 

Applied and queued up for -stable, thanks Eric.


Re: [PATCH net] ipv6: Revert "ipv6: Allow non-gateway ECMP for IPv6"

2018-05-02 Thread David Miller
From: Ido Schimmel 
Date: Wed,  2 May 2018 22:41:56 +0300

> This reverts commit edd7ceb78296 ("ipv6: Allow non-gateway ECMP for
> IPv6").
> 
> Eric reported a division by zero in rt6_multipath_rebalance() which is
> caused by above commit that considers identical local routes to be
> siblings. The division by zero happens because a nexthop weight is not
> set for local routes.
> 
> Revert the commit as it does not fix a bug and has side effects.
> 
> To reproduce:
> 
> # ip -6 address add 2001:db8::1/64 dev dummy0
> # ip -6 address add 2001:db8::1/64 dev dummy1
> 
> Fixes: edd7ceb78296 ("ipv6: Allow non-gateway ECMP for IPv6")
> Signed-off-by: Ido Schimmel 
> Reported-by: Eric Dumazet 

Applied, thank you.


Re: [PATCH net-next v9 2/4] net: Introduce generic failover module

2018-05-02 Thread Michael S. Tsirkin
On Wed, May 02, 2018 at 10:51:12AM -0700, Samudrala, Sridhar wrote:
> 
> 
> On 5/2/2018 9:15 AM, Jiri Pirko wrote:
> > Sat, Apr 28, 2018 at 11:06:01AM CEST, j...@resnulli.us wrote:
> > > Fri, Apr 27, 2018 at 07:06:58PM CEST, sridhar.samudr...@intel.com wrote:
> > [...]
> > 
> > 
> > > > +
> > > > +   err = netdev_rx_handler_register(slave_dev, 
> > > > net_failover_handle_frame,
> > > > +failover_dev);
> > > > +   if (err) {
> > > > +   netdev_err(slave_dev, "can not register failover rx 
> > > > handler (err = %d)\n",
> > > > +  err);
> > > > +   goto err_handler_register;
> > > > +   }
> > > > +
> > > > +   err = netdev_upper_dev_link(slave_dev, failover_dev, NULL);
> > > Please use netdev_master_upper_dev_link().
> > Don't forget to fillup struct netdev_lag_upper_info - 
> > NETDEV_LAG_TX_TYPE_ACTIVEBACKUP
> > 
> > 
> > Also, please call netdev_lower_state_changed() when the active slave
> > device changes from primary->backup of backup->primary and whenever link
> > state of a slave changes
> > 
> Sure. will look into it.  Do you think this will help with the issue
> you saw with having to change mac on standy twice to get the init scripts
> working? We are now going to block changing the mac on both standby and
> failover.
> 
> Also, i was wondering if we should set dev->flags to IFF_MASTER on failover
> and IFF_SLAVE on primary and standby.

We do need a way to find things out, that's for sure.
How does userspace know it's a failover
config and find the failover device right now?

> netvsc does this.
> Does this help with the init scripts and network manager to skip slave
> devices for dhcp requests?

Try it?


[RFC iproute2-next 3/5] ss: use sockstat to get TCP bind ports

2018-05-02 Thread Stephen Hemminger
From: Stephen Hemminger 

Using slabinfo to try and get the number of bind_buckets no longer
works because of slab cache merging. Instead use proposed enhancment
of /proc/net/sockstat to get the same data.

Signed-off-by: Stephen Hemminger 
---
 misc/ss.c | 10 --
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/misc/ss.c b/misc/ss.c
index c88a25581755..4f76999c0fee 100644
--- a/misc/ss.c
+++ b/misc/ss.c
@@ -732,7 +732,6 @@ next:
 
 struct slabstat {
int socks;
-   int tcp_ports;
int tcp_tws;
int tcp_syns;
int skbs;
@@ -748,7 +747,6 @@ static int get_slabstat(struct slabstat *s)
static int slabstat_valid;
static const char * const slabstat_ids[] = {
"sock",
-   "tcp_bind_bucket",
"tcp_tw_bucket",
"tcp_open_request",
"skbuff_head_cache",
@@ -4594,6 +4592,7 @@ struct ssummary {
int tcp_orphans;
int tcp_tws;
int tcp4_hashed;
+   int tcp_ports;
int udp4;
int raw4;
int frag4;
@@ -4629,9 +4628,9 @@ static void get_sockstat_line(char *line, struct ssummary 
*s)
else if (strcmp(id, "FRAG6:") == 0)
sscanf(rem, "%*s%d%*s%d", >frag6, >frag6_mem);
else if (strcmp(id, "TCP:") == 0)
-   sscanf(rem, "%*s%d%*s%d%*s%d%*s%d%*s%ld",
+   sscanf(rem, "%*s%d%*s%d%*s%d%*s%d%*s%ld%*s%d",
   >tcp4_hashed,
-  >tcp_orphans, >tcp_tws, >tcp_total, 
>tcp_mem);
+  >tcp_orphans, >tcp_tws, >tcp_total, 
>tcp_mem, >tcp_ports);
 }
 
 static int get_sockstat(struct ssummary *s)
@@ -4676,8 +4675,7 @@ static int print_summary(void)
   s.tcp_total - (s.tcp4_hashed+s.tcp6_hashed-s.tcp_tws),
   s.tcp_orphans,
   slabstat.tcp_syns,
-  s.tcp_tws, slabstat.tcp_tws,
-  slabstat.tcp_ports
+  s.tcp_tws, slabstat.tcp_tws, s.tcp_ports
   );
 
printf("\n");
-- 
2.17.0



[RFC iproute2-next 1/5] ss: make args to get_snmp_int const

2018-05-02 Thread Stephen Hemminger
These are keys for lookup and should be const.

Signed-off-by: Stephen Hemminger 
---
 misc/ss.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/misc/ss.c b/misc/ss.c
index 3ed7e66962f3..22c76e34f83b 100644
--- a/misc/ss.c
+++ b/misc/ss.c
@@ -4539,7 +4539,7 @@ static int handle_follow_request(struct filter *f)
return ret;
 }
 
-static int get_snmp_int(char *proto, char *key, int *result)
+static int get_snmp_int(const char *proto, const char *key, int *result)
 {
char buf[1024];
FILE *fp;
-- 
2.17.0



[RFC iproute2-next 5/5] ss: use correct slab statistics

2018-05-02 Thread Stephen Hemminger
From: Stephen Hemminger 

The slabinfo names changed years ago, and ss statistics were broken.
This changes to use current slab names and handle TCP IPv6.

Signed-off-by: Stephen Hemminger 
---
 misc/ss.c | 23 +++
 1 file changed, 11 insertions(+), 12 deletions(-)

diff --git a/misc/ss.c b/misc/ss.c
index 97304cd8abfc..66c767cc415b 100644
--- a/misc/ss.c
+++ b/misc/ss.c
@@ -742,12 +742,12 @@ static int get_slabstat(struct slabstat *s)
 {
char buf[256];
FILE *fp;
-   int cnt;
+   int *stats = (int *) s;
static int slabstat_valid;
static const char * const slabstat_ids[] = {
-   "sock",
-   "tcp_tw_bucket",
-   "tcp_open_request",
+   "sock_inode_cache",
+   "tw_sock_TCP",
+   "request_sock_TCP",
};
 
if (slabstat_valid)
@@ -759,24 +759,23 @@ static int get_slabstat(struct slabstat *s)
if (!fp)
return -1;
 
-   cnt = sizeof(*s)/sizeof(int);
-
if (!fgets(buf, sizeof(buf), fp)) {
fclose(fp);
return -1;
}
+
while (fgets(buf, sizeof(buf), fp) != NULL) {
-   int i;
+   int i, v;
 
for (i = 0; i < ARRAY_SIZE(slabstat_ids); i++) {
-   if (memcmp(buf, slabstat_ids[i], 
strlen(slabstat_ids[i])) == 0) {
-   sscanf(buf, "%*s%d", ((int *)s) + i);
-   cnt--;
+   if (memcmp(buf, slabstat_ids[i], 
strlen(slabstat_ids[i])) != 0)
+   continue;
+
+   if (sscanf(buf, "%*s%d", ) == 1) {
+   stats[i] += v;
break;
}
}
-   if (cnt <= 0)
-   break;
}
 
slabstat_valid = 1;
-- 
2.17.0



[RFC iproute2-next 0/5] ss statistics fixes

2018-05-02 Thread Stephen Hemminger
From: Stephen Hemminger 

The output of the ss -s command has been broken for a long time
because of kernel changes (ie since 2.6).

This is an attempt to resolve most of the issues. Still don't like
the way it is using slabinfo to get the data but some of this information
would be expensive for kernel to account for otherwise.

Stephen Hemminger (5):
  ss: make args to get_snmp_int const
  ss: make tcp_mem long
  ss: use sockstat to get TCP bind ports
  ss: don't look for skbuff_head_cache
  ss: use correct slab statistics

 misc/ss.c | 39 +--
 1 file changed, 17 insertions(+), 22 deletions(-)

-- 
2.17.0



[RFC iproute2-next 2/5] ss: make tcp_mem long

2018-05-02 Thread Stephen Hemminger
The tcp_memory field in /proc/net/sockstat is formatted as
a long value by kernel. Change ss to keep this as full value.

Signed-off-by: Stephen Hemminger 
---
 misc/ss.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/misc/ss.c b/misc/ss.c
index 22c76e34f83b..c88a25581755 100644
--- a/misc/ss.c
+++ b/misc/ss.c
@@ -4589,7 +4589,7 @@ static int get_snmp_int(const char *proto, const char 
*key, int *result)
 
 struct ssummary {
int socks;
-   int tcp_mem;
+   long tcp_mem;
int tcp_total;
int tcp_orphans;
int tcp_tws;
@@ -4629,7 +4629,7 @@ static void get_sockstat_line(char *line, struct ssummary 
*s)
else if (strcmp(id, "FRAG6:") == 0)
sscanf(rem, "%*s%d%*s%d", >frag6, >frag6_mem);
else if (strcmp(id, "TCP:") == 0)
-   sscanf(rem, "%*s%d%*s%d%*s%d%*s%d%*s%d",
+   sscanf(rem, "%*s%d%*s%d%*s%d%*s%d%*s%ld",
   >tcp4_hashed,
   >tcp_orphans, >tcp_tws, >tcp_total, 
>tcp_mem);
 }
-- 
2.17.0



[RFC iproute2-next 4/5] ss: don't look for skbuff_head_cache

2018-05-02 Thread Stephen Hemminger
From: Stephen Hemminger 

Not used in current code.

Signed-off-by: Stephen Hemminger 
---
 misc/ss.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/misc/ss.c b/misc/ss.c
index 4f76999c0fee..97304cd8abfc 100644
--- a/misc/ss.c
+++ b/misc/ss.c
@@ -734,7 +734,6 @@ struct slabstat {
int socks;
int tcp_tws;
int tcp_syns;
-   int skbs;
 };
 
 static struct slabstat slabstat;
@@ -749,7 +748,6 @@ static int get_slabstat(struct slabstat *s)
"sock",
"tcp_tw_bucket",
"tcp_open_request",
-   "skbuff_head_cache",
};
 
if (slabstat_valid)
-- 
2.17.0



Re: [PATCH net-next 00/10] r8169: series with further improvements

2018-05-02 Thread David Miller
From: Heiner Kallweit 
Date: Wed, 2 May 2018 21:28:10 +0200

> I thought I'm more or less done with the basic refactoring. But again
> I stumbled across things that can be improved / simplified.

Looks good, series applied, thanks Heiner.


[PATCH net,stable] qmi_wwan: do not steal interfaces from class drivers

2018-05-02 Thread Bjørn Mork
The USB_DEVICE_INTERFACE_NUMBER matching macro assumes that
the { vendorid, productid, interfacenumber } set uniquely
identifies one specific function.  This has proven to fail
for some configurable devices. One example is the Quectel
EM06/EP06 where the same interface number can be either
QMI or MBIM, without the device ID changing either.

Fix by requiring the vendor-specific class for interface number
based matching.  Functions of other classes can and should use
class based matching instead.

Fixes: 03304bcb5ec4 ("net: qmi_wwan: use fixed interface number matching")
Signed-off-by: Bjørn Mork 
---
It's quite possible that the fix should be integrated in the
USB_DEVICE_INTERFACE_NUMBER macro instead.  But that has grown a few
other users since it was added, so changing it now seems risky. 
Another option is of course adding a new match macro with the
USB_CLASS_VENDOR_SPEC match integrated. Maybe best?

But I'm proposing this as-is for now, since this quickfix seems most
suitable for stable backporting.

 drivers/net/usb/qmi_wwan.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/drivers/net/usb/qmi_wwan.c b/drivers/net/usb/qmi_wwan.c
index 51c68fc416fa..42565dd33aa6 100644
--- a/drivers/net/usb/qmi_wwan.c
+++ b/drivers/net/usb/qmi_wwan.c
@@ -1344,6 +1344,18 @@ static int qmi_wwan_probe(struct usb_interface *intf,
id->driver_info = (unsigned long)_wwan_info;
}
 
+   /* There are devices where the same interface number can be
+* configured as different functions. We should only bind to
+* vendor specific functions when matching on interface number
+*/
+   if (id->match_flags & USB_DEVICE_ID_MATCH_INT_NUMBER &&
+   desc->bInterfaceClass != USB_CLASS_VENDOR_SPEC) {
+   dev_dbg(>dev,
+   "Rejecting interface number match for class %02x\n",
+   desc->bInterfaceClass);
+   return -ENODEV;
+   }
+
/* Quectel EC20 quirk where we've QMI on interface 4 instead of 0 */
if (quectel_ec20_detected(intf) && desc->bInterfaceNumber == 0) {
dev_dbg(>dev, "Quectel EC20 quirk, skipping interface 
0\n");
-- 
2.11.0



  1   2   3   4   >