date:20180125

[PATCH] ceph: Check memory allocation result

2018-01-25 Thread Chengguang Xu

Should check result of kstrndup() in case of memory allocation failure.

Signed-off-by: Chengguang Xu 
---
 net/ceph/ceph_common.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/net/ceph/ceph_common.c b/net/ceph/ceph_common.c
index 5c036d2..1e492ef 100644
--- a/net/ceph/ceph_common.c
+++ b/net/ceph/ceph_common.c
@@ -421,6 +421,10 @@ struct ceph_options *
opt->name = kstrndup(argstr[0].from,
  argstr[0].to-argstr[0].from,
  GFP_KERNEL);
+   if (!opt->name) {
+   err = -ENOMEM;
+   goto out;
+   }
break;
case Opt_secret:
opt->key = kzalloc(sizeof(*opt->key), GFP_KERNEL);
-- 
1.8.3.1

[PATCH] atm: firestream: Replace GFP_ATOMIC with GFP_KERNEL in fs_send

2018-01-25 Thread Jia-Ju Bai

After checking all possible call chains to fs_send() here,
my tool finds that fs_send() is never called in atomic context.
And this function is assigned to a function pointer "dev->ops->send",
which is only called by vcc_sendmsg() (net/atm/common.c)
through vcc->dev->ops->send(), and vcc_sendmsg() calls schedule(),
it indicates that fs_send() can call functions which may sleep.
Thus GFP_ATOMIC is not necessary, and it can be replaced with GFP_KERNEL.

This is found by a static analysis tool named DCNS written by myself.

Signed-off-by: Jia-Ju Bai 
---
 drivers/atm/firestream.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/atm/firestream.c b/drivers/atm/firestream.c
index d97c056..cce6f9f 100644
--- a/drivers/atm/firestream.c
+++ b/drivers/atm/firestream.c
@@ -1184,7 +1184,7 @@ static int fs_send (struct atm_vcc *atm_vcc, struct 
sk_buff *skb)
 
vcc->last_skb = skb;
 
-   td = kmalloc (sizeof (struct FS_BPENTRY), GFP_ATOMIC);
+   td = kmalloc (sizeof (struct FS_BPENTRY), GFP_KERNEL);
fs_dprintk (FS_DEBUG_ALLOC, "Alloc transd: %p(%zd)\n", td, sizeof 
(struct FS_BPENTRY));
if (!td) {
/* Oops out of mem */
-- 
1.7.9.5

Re: general protection fault in tun_do_read

2018-01-25 Thread Dmitry Vyukov

On Fri, Jan 26, 2018 at 4:24 AM, Jason Wang  wrote:
>
>
> On 2018年01月25日 15:59, syzbot wrote:
>>
>> Hello,
>>
>> syzbot hit the following crash on net-next commit
>> 43df215d99e6049d4680309c54232689e16ddd6b (Wed Jan 24 01:24:32 2018 +)
>> Merge branch 'bpf-and-netdevsim-test-updates'
>>
>> So far this crash happened 2 times on net-next.
>> Unfortunately, I don't have any reproducer for this crash yet.
>> Raw console output is attached.
>> compiler: gcc (GCC) 7.1.1 20170620
>> .config is attached.
>>
>> IMPORTANT: if you fix the bug, please add the following tag to the commit:
>> Reported-by: syzbot+885a488f0e7e392e9...@syzkaller.appspotmail.com
>> It will help syzbot understand when the bug is fixed. See footer for
>> details.
>> If you forward the report, please keep this part and the footer.
>
>
> I suspect this is the same as what syzbot repots here
> https://www.spinics.net/lists/netdev/msg480736.html.
>
> We're discussing a solution posted by Michael here
> https://patchwork.ozlabs.org/patch/866064/

Thanks for notifying, Jason.
Let's also tell syzbot about this so that it will ever report bugs in
tun_do_read again:

#syz dup: general protection fault in tun_queue_purge

Darlehen Geld für Einzelpersonen und Fachleute in weniger als 72 Stunden

2018-01-25 Thread Peter Schuster

Hallo,

Sind Sie in einer schwierigen Situation, für die Sie sich für ein
Darlehen suchen? Benötigen Sie eine Finanzierung, um eine Schuld zu
begleichen oder eine Aktivität zu finanzieren? Haben Sie einen
Verbraucherkredit, eine Hypothek, einen persönlichen Kredit, eine
Hypothek, Investition Darlehen, Schuldenkonsolidierung Darlehen oder
andere braucht?

Ich bin ein einzelner Investor. I zur Verfügung stellen die Kredit
kurz-, mittel- und langfristige. Ihr Finanzierungsbedingungen sind
sehr einfach und meine Zinssatz beträgt 3% pro Jahr.

Für alle Anfragen, bleibe ich zur Verfügung, um Ihre Fragen zu beantworten.

Danke, dass Sie mir per E-Mail an Sie von  :   klaus.peterschus...@outlook.de

Mit freundlichen Grüßen.

Peter Schuster

Financial Bank
https://firstfinancialsa.com/de

Re: Re: [RFC] net: qcom/emac: mdiobus-dev fwnode should point to emac-adev

2018-01-25 Thread Wang, Dongsheng

Hi, Timur && Andrew,

Please correct me if there is any problem with my understanding.

GPIO is a general property of devices, the property point to
an entity such as device tree or ACPI table, we also can directly
implement it in device node.

For ACPI, there is _DSD that should include GPIO property if we need it.
No matter which devices implement it, MAC or MDIO also can implement a 
_DSD.
We can explicitly define a GPIO property in MDIO, but I think this
may conflict with the existing definition of ACPI. ACPI guys may not
agree to do this because there already has _DSD. We just need to use
_DSD to notify GPIO layer there may have a Device Specific Data for
this device.

So far MDIO owns external PHY "reset" as an optional feature and MDIO
is integrated in MAC, so we need to point the MAC adev to 
MDIO->dev.fwnode.
And most importantly this feature does not depend on SoC, this feature
depends on MotherBoard design.

ACPI "reset-gpios" example:
Documentation/acpi/gpio-properties.txt.

Cheers,
-Dongsheng

On 2018/1/26 0:05:02, "Timur Tabi"  wrote:

>On 01/25/2018 09:59 AM, Andrew Lunn wrote:
>>I expect we will implement something like acpi_mdiobus_register(), and
>>it will take a pointer to an ACPI node. And maybe on top of
>>of_mdiobus_register() and of_mdiobus_register() we will add a
>>device_mdiobus_register().
>
>Makes sense.  If you remember, please CC me on any patches.
>
>>What i'm trying to avoid is drivers ending up with different ACPI
>>bindings. If you don't want to add an ACPI node/property then no
>>problems, just don't expect to be able to use any of the optional
>>features of the MDIO core, like the GPIOs for reset.
>
>Well, if a new binding is created, we will update our ACPI tables and
>drivers use it.  But we may need to keep the legacy code in emac-phy.c
>for backwards compatibility with older firmware.
>
>--
>Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm
>Technologies, Inc.  Qualcomm Technologies, Inc. is a member of the
>Code Aurora Forum, a Linux Foundation Collaborative Project.

Re: 答复: [PATCH] net: clean the sk_frag.page of new cloned socket

2018-01-25 Thread Cong Wang

On Thu, Jan 25, 2018 at 7:14 PM, Eric Dumazet  wrote:
> On Fri, 2018-01-26 at 02:09 +, Li,Rongqing wrote:
>
>>
>> crash> bt 8683
>> PID: 8683   TASK: 881faa088000  CPU: 10  COMMAND: "mynode"
>>  #0 [881fff145e78] crash_nmi_callback at 81031712
>>  #1 [881fff145e88] nmi_handle at 816cafe9
>>  #2 [881fff145ec8] do_nmi at 816cb0f0
>>  #3 [881fff145ef0] end_repeat_nmi at 816ca4a1
>> [exception RIP: _raw_spin_lock_irqsave+62]
>> RIP: 816c9a9e  RSP: 881fa992b990  RFLAGS: 0002
>> RAX: 4358  RBX: 88207ffd7e80  RCX: 4358
>> RDX: 4356  RSI: 0246  RDI: 88207ffd7ee8
>> RBP: 881fa992b990   R8:    R9: 019a16e6
>> R10: 4d24  R11: 4000  R12: 0242
>> R13: 4d24  R14: 0001  R15: 
>> ORIG_RAX:   CS: 0010  SS: 0018
>> ---  ---
>>  #4 [881fa992b990] _raw_spin_lock_irqsave at 816c9a9e
>>  #5 [881fa992b998] get_page_from_freelist at 8113ce5f
>>  #6 [881fa992ba70] __alloc_pages_nodemask at 8113d15f
>>  #7 [881fa992bba0] alloc_pages_current at 8117ab29
>>  #8 [881fa992bbe8] sk_page_frag_refill at 815dd310
>>  #9 [881fa992bc18] tcp_sendmsg at 8163e4f3
>> #10 [881fa992bcd8] inet_sendmsg at 81668434
>> #11 [881fa992bd08] sock_sendmsg at 815d9719
>> #12 [881fa992be58] SYSC_sendto at 815d9c81
>> #13 [881fa992bf70] sys_sendto at 815da6ae
>> #14 [881fa992bf80] system_call_fastpath at 816d2189
>>
>
> Note that tcp_sendmsg() does not use sk->sk_frag, but the per task
> page.
>
> Unless something changes sk->sk_allocation, which a user application
> can not do.
>

Some kernel TCP socket uses atomic allocation, e.g.,
o2net_open_listening_sock().

Re: [PATCH bpf-next 1/2] bpf: implement MAP_GET_NEXT_KEY command for LPM_TRIE map

2018-01-25 Thread Eric Dumazet

On Thu, 2018-01-18 at 15:08 -0800, Yonghong Song wrote:

> +find_leftmost:
> + /* Find the leftmost non-intermediate node, all intermediate nodes
> +  * have exact two children, so this function will never return NULL.
> +  */

syzbot [1] disagrees violently with this comment.

> + for (node = rcu_dereference(*root); node;) {
> + if (!(node->flags & LPM_TREE_NODE_FLAG_IM))
> + next_node = node;
> + node = rcu_dereference(node->child[0]);
> + }
> +do_copy:
> + next_key->prefixlen = next_node->prefixlen;
> + memcpy((void *)next_key + offsetof(struct bpf_lpm_trie_key, data),
> +next_node->data, trie->data_size);

[1]

syzbot hit the following crash on e9dcd80b9d77a92bfae6ce42a451f5c5fd318832
git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
config: 
https://syzkaller-buganizer.googleplex.com/text?tag=Config=b2216f04db2aa337e2bbf5ebd233919c3e2aa05f
compiler: gcc (GCC) 7.1.1 20170620


Unfortunately, I don't have any reproducer for this bug yet.
raw crash log: 
https://syzkaller-buganizer.googleplex.com/text?tag=CrashLog=4f78be02e2cd37040b8796322e02b147caae6024
dashboard link: https://syzkaller.appspot.com/bug?extid=148b56534d9269ab7433

See http://go/syzbot for details on how to handle this bug.

kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault:  [#1] SMP KASAN
Dumping ftrace buffer:
   (ftrace buffer empty)
Modules linked in:
CPU: 1 PID: 8033 Comm: syz-executor3 Not tainted 4.15.0-rc8+ #4
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 
01/01/2011
RIP: 0010:trie_get_next_key+0x3c2/0xf10 kernel/bpf/lpm_trie.c:682
RSP: 0018:8801aa44f628 EFLAGS: 00010202
RAX: dc00 RBX:  RCX: 81829a9d
RDX: 0004 RSI: c90003b7b000 RDI: 0020
RBP: 8801aa44f8b0 R08: 817e8f95 R09: 0002
R10: 8801aa44f790 R11:  R12: 
R13: 110035489f01 R14: fff4 R15: 
FS:  7fbb3b39b700() GS:8801db30() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 2057a000 CR3: 0001c26e4005 CR4: 001606e0
DR0:  DR1:  DR2: 
DR3:  DR6: fffe0ff0 DR7: 0400
Call Trace:
 map_get_next_key kernel/bpf/syscall.c:842 [inline]
 SYSC_bpf kernel/bpf/syscall.c:1881 [inline]
 SyS_bpf+0x11b4/0x4860 kernel/bpf/syscall.c:1846
 entry_SYSCALL_64_fastpath+0x29/0xa0
RIP: 0033:0x452f19
RSP: 002b:7fbb3b39ac58 EFLAGS: 0212 ORIG_RAX: 0141
RAX: ffda RBX: 0071bea0 RCX: 00452f19
RDX: 0018 RSI: 20daf000 RDI: 0004
RBP: 003e R08:  R09: 
R10:  R11: 0212 R12: 006ef670
R13:  R14: 7fbb3b39b6d4 R15: 
Code: 19 d3 ff e8 81 98 ed ff 4d 85 e4 0f 85 30 ff ff ff e8 73 98 ed ff 49 8d 
7f 20 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <0f> b6 04 02 84 c0 74 
08 3c 03 0f 8e f2 0a 00 00 48 8b b5 98 fd 
RIP: trie_get_next_key+0x3c2/0xf10 kernel/bpf/lpm_trie.c:682 RSP: 
8801aa44f628
---[ end trace b4eb675edf4c4059 ]---
Kernel panic - not syncing: Fatal exception
Dumping ftrace buffer:
   (ftrace buffer empty)
Kernel Offset: disabled
Rebooting in 86400 seconds..

Re: [PATCH net-next 2/2] net: sched: add em_ipt ematch for calling xtables matches

2018-01-25 Thread Eyal Birger

On Thu, Jan 25, 2018 at 2:00 AM, Pablo Neira Ayuso  wrote:
> On Wed, Jan 24, 2018 at 04:37:16PM -0500, David Miller wrote:
>> From: Eyal Birger 
>> Date: Tue, 23 Jan 2018 11:17:32 +0200
>>
>> > +   network_offset = skb_network_offset(skb);
>> > +   skb_pull(skb, network_offset);
>> > +
>> > +   rcu_read_lock();
>> > +
>> > +   if (skb->skb_iif)
>> > +   indev = dev_get_by_index_rcu(em->net, skb->skb_iif);
>> > +
>> > +   nf_hook_state_init(, im->hook, im->nfproto, indev ?: skb->dev,
>> > +  skb->dev, NULL, em->net, NULL);
>> > +
>> > +   acpar.match = im->match;
>> > +   acpar.matchinfo = im->match_data;
>> > +   acpar.state = 
>> > +
>> > +   ret = im->match->match(skb, );
>> > +
>> > +   rcu_read_unlock();
>> > +
>> > +   skb_push(skb, network_offset);
>>
>> If the SKB is shared in any way, this pull/push around the NF hook
>> invocation is illegal.
>
> At ingress, skb->data points to the network header, which is what the
> xtables matches expect, so these are actually noops, therefore,
> skb_pull() and skb_push() can be removed.

Right. I added those for completeness in supporting the xmit path.
In the xt_policy use-case it is irrelevant as tc is invoked after
encapsulation.

I will submit a v2 without these, asserting the ingress direction.

Note I have followed the example in em_ipset.c, so that might be a
problem there too...

Thanks!
Eyal.

[PATCH net] net: ipv6: send unsolicited NA after DAD

2018-01-25 Thread David Ahern

Unsolicited IPv6 neighbor advertisements should be sent after DAD
completes. Update ndisc_send_unsol_na to skip tentative, non-optimistic
addresses and have those sent by addrconf_dad_completed after DAD.

Fixes: 4a6e3c5def13c ("net: ipv6: send unsolicited NA on admin up")
Reported-by: Vivek Venkatraman 
Signed-off-by: David Ahern 
---
 net/ipv6/addrconf.c | 30 ++
 net/ipv6/ndisc.c|  5 +
 2 files changed, 31 insertions(+), 4 deletions(-)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index ab99cb641b7c..adcaaad115f5 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -186,7 +186,8 @@ static struct rt6_info *addrconf_get_prefix_route(const 
struct in6_addr *pfx,
 
 static void addrconf_dad_start(struct inet6_ifaddr *ifp);
 static void addrconf_dad_work(struct work_struct *w);
-static void addrconf_dad_completed(struct inet6_ifaddr *ifp, bool bump_id);
+static void addrconf_dad_completed(struct inet6_ifaddr *ifp, bool bump_id,
+  bool send_na);
 static void addrconf_dad_run(struct inet6_dev *idev);
 static void addrconf_rs_timer(struct timer_list *t);
 static void __ipv6_ifa_notify(int event, struct inet6_ifaddr *ifa);
@@ -3838,12 +3839,17 @@ static void addrconf_dad_begin(struct inet6_ifaddr *ifp)
 idev->cnf.accept_dad < 1) ||
!(ifp->flags_F_TENTATIVE) ||
ifp->flags & IFA_F_NODAD) {
+   bool send_na = false;
+
+   if (ifp->flags & IFA_F_TENTATIVE &&
+   !(ifp->flags & IFA_F_OPTIMISTIC))
+   send_na = true;
bump_id = ifp->flags & IFA_F_TENTATIVE;
ifp->flags &= 
~(IFA_F_TENTATIVE|IFA_F_OPTIMISTIC|IFA_F_DADFAILED);
spin_unlock(>lock);
read_unlock_bh(>lock);
 
-   addrconf_dad_completed(ifp, bump_id);
+   addrconf_dad_completed(ifp, bump_id, send_na);
return;
}
 
@@ -3972,16 +3978,21 @@ static void addrconf_dad_work(struct work_struct *w)
}
 
if (ifp->dad_probes == 0) {
+   bool send_na = false;
+
/*
 * DAD was successful
 */
 
+   if (ifp->flags & IFA_F_TENTATIVE &&
+   !(ifp->flags & IFA_F_OPTIMISTIC))
+   send_na = true;
bump_id = ifp->flags & IFA_F_TENTATIVE;
ifp->flags &= 
~(IFA_F_TENTATIVE|IFA_F_OPTIMISTIC|IFA_F_DADFAILED);
spin_unlock(>lock);
write_unlock_bh(>lock);
 
-   addrconf_dad_completed(ifp, bump_id);
+   addrconf_dad_completed(ifp, bump_id, send_na);
 
goto out;
}
@@ -4019,7 +4030,8 @@ static bool ipv6_lonely_lladdr(struct inet6_ifaddr *ifp)
return true;
 }
 
-static void addrconf_dad_completed(struct inet6_ifaddr *ifp, bool bump_id)
+static void addrconf_dad_completed(struct inet6_ifaddr *ifp, bool bump_id,
+  bool send_na)
 {
struct net_device *dev = ifp->idev->dev;
struct in6_addr lladdr;
@@ -4051,6 +4063,16 @@ static void addrconf_dad_completed(struct inet6_ifaddr 
*ifp, bool bump_id)
if (send_mld)
ipv6_mc_dad_complete(ifp->idev);
 
+   /* send unsolicited NA if enabled */
+   if (send_na &&
+   (ifp->idev->cnf.ndisc_notify ||
+dev_net(dev)->ipv6.devconf_all->ndisc_notify)) {
+   ndisc_send_na(dev, _linklocal_allnodes, >addr,
+ /*router=*/ !!ifp->idev->cnf.forwarding,
+ /*solicited=*/ false, /*override=*/ true,
+ /*inc_opt=*/ true);
+   }
+
if (send_rs) {
/*
 *  If a host as already performed a random delay
diff --git a/net/ipv6/ndisc.c b/net/ipv6/ndisc.c
index b3cea200c85e..f61a5b613b52 100644
--- a/net/ipv6/ndisc.c
+++ b/net/ipv6/ndisc.c
@@ -566,6 +566,11 @@ static void ndisc_send_unsol_na(struct net_device *dev)
 
read_lock_bh(>lock);
list_for_each_entry(ifa, >addr_list, if_list) {
+   /* skip tentative addresses until dad completes */
+   if (ifa->flags & IFA_F_TENTATIVE &&
+   !(ifa->flags & IFA_F_OPTIMISTIC))
+   continue;
+
ndisc_send_na(dev, _linklocal_allnodes, >addr,
  /*router=*/ !!idev->cnf.forwarding,
  /*solicited=*/ false, /*override=*/ true,
-- 
2.11.0

答复: 答复: [PATCH] net: clean the sk_frag.page of new cloned socket

2018-01-25 Thread Li,Rongqing



> -邮件原件-
> 发件人: Eric Dumazet [mailto:eric.duma...@gmail.com]
> 发送时间: 2018年1月26日 11:14
> 收件人: Li,Rongqing ; netdev@vger.kernel.org
> 抄送: eduma...@google.com
> 主题: Re: 答复: [PATCH] net: clean the sk_frag.page of new cloned socket
> 
> On Fri, 2018-01-26 at 02:09 +, Li,Rongqing wrote:
> 
> >
> > crash> bt 8683
> > PID: 8683   TASK: 881faa088000  CPU: 10  COMMAND: "mynode"
> >  #0 [881fff145e78] crash_nmi_callback at 81031712
> >  #1 [881fff145e88] nmi_handle at 816cafe9
> >  #2 [881fff145ec8] do_nmi at 816cb0f0
> >  #3 [881fff145ef0] end_repeat_nmi at 816ca4a1
> > [exception RIP: _raw_spin_lock_irqsave+62]
> > RIP: 816c9a9e  RSP: 881fa992b990  RFLAGS: 0002
> > RAX: 4358  RBX: 88207ffd7e80  RCX:
> 4358
> > RDX: 4356  RSI: 0246  RDI:
> 88207ffd7ee8
> > RBP: 881fa992b990   R8:    R9:
> 019a16e6
> > R10: 4d24  R11: 4000  R12:
> 0242
> > R13: 4d24  R14: 0001  R15:
> 
> > ORIG_RAX:   CS: 0010  SS: 0018
> > ---  ---
> >  #4 [881fa992b990] _raw_spin_lock_irqsave at 816c9a9e
> >  #5 [881fa992b998] get_page_from_freelist at 8113ce5f
> >  #6 [881fa992ba70] __alloc_pages_nodemask at 8113d15f
> >  #7 [881fa992bba0] alloc_pages_current at 8117ab29
> >  #8 [881fa992bbe8] sk_page_frag_refill at 815dd310
> >  #9 [881fa992bc18] tcp_sendmsg at 8163e4f3
> > #10 [881fa992bcd8] inet_sendmsg at 81668434
> > #11 [881fa992bd08] sock_sendmsg at 815d9719
> > #12 [881fa992be58] SYSC_sendto at 815d9c81
> > #13 [881fa992bf70] sys_sendto at 815da6ae
> > #14 [881fa992bf80] system_call_fastpath at 816d2189
> >
> 
> Note that tcp_sendmsg() does not use sk->sk_frag, but the per task page.
> 
> Unless something changes sk->sk_allocation, which a user application can
> not do.
> 
> Are you using a pristine upstream kernel ?

No

I do not know how to reproduce my bug, I find it twice online.

-RongQing

答复: 答复: [PATCH] net: clean the sk_frag.page of new cloned socket

2018-01-25 Thread Li,Rongqing


> > my kernel is 3.10, I did not find the root cause, I guest all kind of
> > possibility
> >
> 
> Have you backported 22a0e18eac7a9e986fec76c60fa4a2926d1291e2 ?
> 
> 
When I see this bug, I find this commit, and backport it, 
But this seems to not related to my bug.


> > > I would rather move that in tcp_disconnect() that only fuzzers use,
> > > instead of doing this on every clone and slowing down normal users.
> > >
> >
> >
> > Do you mean we should fix it like below:
> >
> > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index
> > f08eebe60446..44f8320610ab 100644
> > --- a/net/ipv4/tcp.c
> > +++ b/net/ipv4/tcp.c
> > @@ -2431,6 +2431,12 @@ int tcp_disconnect(struct sock *sk, int flags)
> >
> > WARN_ON(inet->inet_num && !icsk->icsk_bind_hash);
> >
> > +
> > +   if (sk->sk_frag.page) {
> > +   put_page(sk->sk_frag.page);
> > +   sk->sk_frag.page = NULL;
> > +   }
> > +
> > sk->sk_error_report(sk);
> > return err;
> >  }
> 
> Yes, something like that.

Ok, thanks

-R

Re: [Patch net-next v3 3/3] net_sched: implement ->change_tx_queue_len() for pfifo_fast

2018-01-25 Thread Cong Wang

On Thu, Jan 25, 2018 at 7:57 PM, Jason Wang  wrote:
>
>
> On 2018年01月26日 10:26, Cong Wang wrote:
>>
>> pfifo_fast used to drop based on qdisc_dev(qdisc)->tx_queue_len,
>> so we have to resize skb array when we change tx_queue_len.
>>
>> Other qdiscs which read tx_queue_len are fine because they
>> all save it to sch->limit or somewhere else in qdisc during init.
>> They don't have to implement this, it is nicer if they do so
>> that users don't have to re-configure qdisc after changing
>> tx_queue_len.
>>
>> Cc: John Fastabend 
>> Signed-off-by: Cong Wang 
>> ---
>>   net/sched/sch_generic.c | 18 ++
>>   1 file changed, 18 insertions(+)
>>
>> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
>> index 08f9fa27e06e..190570f21b20 100644
>> --- a/net/sched/sch_generic.c
>> +++ b/net/sched/sch_generic.c
>> @@ -763,6 +763,23 @@ static void pfifo_fast_destroy(struct Qdisc *sch)
>> }
>>   }
>>   +static int pfifo_fast_change_tx_queue_len(struct Qdisc *sch,
>> + unsigned int new_len)
>> +{
>> +   struct pfifo_fast_priv *priv = qdisc_priv(sch);
>> +   struct skb_array *bands[PFIFO_FAST_BANDS];
>> +   int prio;
>> +
>> +   for (prio = 0; prio < PFIFO_FAST_BANDS; prio++) {
>> +   struct skb_array *q = band2list(priv, prio);
>> +
>> +   bands[prio] = q;
>> +   }
>> +
>> +   return skb_array_resize_multiple(bands, PFIFO_FAST_BANDS, new_len,
>> +GFP_KERNEL);
>> +}
>> +
>>   struct Qdisc_ops pfifo_fast_ops __read_mostly = {
>> .id =   "pfifo_fast",
>> .priv_size  =   sizeof(struct pfifo_fast_priv),
>> @@ -773,6 +790,7 @@ struct Qdisc_ops pfifo_fast_ops __read_mostly = {
>> .destroy=   pfifo_fast_destroy,
>> .reset  =   pfifo_fast_reset,
>> .dump   =   pfifo_fast_dump,
>> +   .change_tx_queue_len =  pfifo_fast_change_tx_queue_len,
>> .owner  =   THIS_MODULE,
>> .static_flags   =   TCQ_F_NOLOCK | TCQ_F_CPUSTATS,
>>   };
>
>
> Is __skb_array_empty() in pfifo_fast_dequeue() still safe after this change?

Yes, we sync with dequeue path before calling ->change_tx_queue_len().
I already mentioned this in patch 2/3.

Re: [Patch net-next v3 3/3] net_sched: implement ->change_tx_queue_len() for pfifo_fast

2018-01-25 Thread Jason Wang




On 2018年01月26日 10:26, Cong Wang wrote:

pfifo_fast used to drop based on qdisc_dev(qdisc)->tx_queue_len,
so we have to resize skb array when we change tx_queue_len.

Other qdiscs which read tx_queue_len are fine because they
all save it to sch->limit or somewhere else in qdisc during init.
They don't have to implement this, it is nicer if they do so
that users don't have to re-configure qdisc after changing
tx_queue_len.

Cc: John Fastabend 
Signed-off-by: Cong Wang 
---
  net/sched/sch_generic.c | 18 ++
  1 file changed, 18 insertions(+)

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 08f9fa27e06e..190570f21b20 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -763,6 +763,23 @@ static void pfifo_fast_destroy(struct Qdisc *sch)
}
  }
  
+static int pfifo_fast_change_tx_queue_len(struct Qdisc *sch,

+ unsigned int new_len)
+{
+   struct pfifo_fast_priv *priv = qdisc_priv(sch);
+   struct skb_array *bands[PFIFO_FAST_BANDS];
+   int prio;
+
+   for (prio = 0; prio < PFIFO_FAST_BANDS; prio++) {
+   struct skb_array *q = band2list(priv, prio);
+
+   bands[prio] = q;
+   }
+
+   return skb_array_resize_multiple(bands, PFIFO_FAST_BANDS, new_len,
+GFP_KERNEL);
+}
+
  struct Qdisc_ops pfifo_fast_ops __read_mostly = {
.id =   "pfifo_fast",
.priv_size  =   sizeof(struct pfifo_fast_priv),
@@ -773,6 +790,7 @@ struct Qdisc_ops pfifo_fast_ops __read_mostly = {
.destroy=   pfifo_fast_destroy,
.reset  =   pfifo_fast_reset,
.dump   =   pfifo_fast_dump,
+   .change_tx_queue_len =  pfifo_fast_change_tx_queue_len,
.owner  =   THIS_MODULE,
.static_flags   =   TCQ_F_NOLOCK | TCQ_F_CPUSTATS,
  };


Is __skb_array_empty() in pfifo_fast_dequeue() still safe after this change?

Thanks

Re: [PATCH net-next 12/12] tools/virtio: fix smp_mb on x86

2018-01-25 Thread Jason Wang




On 2018年01月26日 07:36, Michael S. Tsirkin wrote:

Offset 128 overlaps the last word of the redzone.
Use 132 which is always beyond that.

Signed-off-by: Michael S. Tsirkin 
---
  tools/virtio/ringtest/main.h | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/virtio/ringtest/main.h b/tools/virtio/ringtest/main.h
index 593a328..301d59b 100644
--- a/tools/virtio/ringtest/main.h
+++ b/tools/virtio/ringtest/main.h
@@ -111,7 +111,7 @@ static inline void busy_wait(void)
  }
  
  #if defined(__x86_64__) || defined(__i386__)

-#define smp_mb() asm volatile("lock; addl $0,-128(%%rsp)" ::: "memory", 
"cc")


Just wonder did "rsp" work for __i386__ ?

Thanks


+#define smp_mb() asm volatile("lock; addl $0,-132(%%rsp)" ::: "memory", 
"cc")
  #else
  /*
   * Not using __ATOMIC_SEQ_CST since gcc docs say they are only synchronized

[PATCH] atm: atmtcp: Replace GFP_ATOMIC with GFP_KERNEL in atmtcp_v_send

2018-01-25 Thread Jia-Ju Bai

After checking all possible call chains to atmtcp_v_send() here,
my tool finds that atmtcp_v_send() is never called in atomic context.
And this function is assigned to a function pointer "dev->ops->send",
which is only called by vcc_sendmsg() (net/atm/common.c)
through vcc->dev->ops->send(), and vcc_sendmsg() calls schedule(),
it indicates that atmtcp_v_send() can call functions which may sleep.
Thus GFP_ATOMIC is not necessary, and it can be replaced with GFP_KERNEL.

This is found by a static analysis tool named DCNS written by myself.

Signed-off-by: Jia-Ju Bai 
---
 drivers/atm/atmtcp.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/atm/atmtcp.c b/drivers/atm/atmtcp.c
index afebeb1c..f7a27ed 100644
--- a/drivers/atm/atmtcp.c
+++ b/drivers/atm/atmtcp.c
@@ -210,7 +210,7 @@ static int atmtcp_v_send(struct atm_vcc *vcc,struct sk_buff 
*skb)
return -ENOLINK;
}
size = skb->len+sizeof(struct atmtcp_hdr);
-   new_skb = atm_alloc_charge(out_vcc,size,GFP_ATOMIC);
+   new_skb = atm_alloc_charge(out_vcc,size,GFP_KERNEL);
if (!new_skb) {
if (vcc->pop) vcc->pop(vcc,skb);
else dev_kfree_skb(skb);
-- 
1.7.9.5

[PATCH net] gianfar: prevent integer wrapping in the rx handler

2018-01-25 Thread Andy Spencer

When the frame check sequence (FCS) is split across the last two frames
of a fragmented packet, part of the FCS gets counted twice, once when
subtracting the FCS, and again when subtracting the previously received
data.

For example, if 1602 bytes are received, and the first fragment contains
the first 1600 bytes (including the first two bytes of the FCS), and the
second fragment contains the last two bytes of the FCS:

  'skb->len == 1600' from the first fragment

  size  = lstatus & BD_LENGTH_MASK; # 1602
  size -= ETH_FCS_LEN;  # 1598
  size -= skb->len; # -2

Since the size is unsigned, it wraps around and causes a BUG later in
the packet handling, as shown below:

  kernel BUG at ./include/linux/skbuff.h:2068!
  Oops: Exception in kernel mode, sig: 5 [#1]
  ...
  NIP [c021ec60] skb_pull+0x24/0x44
  LR [c01e2fbc] gfar_clean_rx_ring+0x498/0x690
  Call Trace:
  [df7edeb0] [c01e2c1c] gfar_clean_rx_ring+0xf8/0x690 (unreliable)
  [df7edf20] [c01e33a8] gfar_poll_rx_sq+0x3c/0x9c
  [df7edf40] [c023352c] net_rx_action+0x21c/0x274
  [df7edf90] [c0329000] __do_softirq+0xd8/0x240
  [df7edff0] [c000c108] call_do_irq+0x24/0x3c
  [c0597e90] [c00041dc] do_IRQ+0x64/0xc4
  [c0597eb0] [c000d920] ret_from_except+0x0/0x18
  --- interrupt: 501 at arch_cpu_idle+0x24/0x5c

Change the size to a signed integer and then trim off any part of the
FCS that was received prior to the last fragment.

Fixes: 6c389fc931bc ("gianfar: fix size of scatter-gathered frames")
Signed-off-by: Andy Spencer 
---
 drivers/net/ethernet/freescale/gianfar.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/freescale/gianfar.c 
b/drivers/net/ethernet/freescale/gianfar.c
index 7f83700..3bdeb29 100644
--- a/drivers/net/ethernet/freescale/gianfar.c
+++ b/drivers/net/ethernet/freescale/gianfar.c
@@ -2932,7 +2932,7 @@ static irqreturn_t gfar_transmit(int irq, void *grp_id)
 static bool gfar_add_rx_frag(struct gfar_rx_buff *rxb, u32 lstatus,
 struct sk_buff *skb, bool first)
 {
-   unsigned int size = lstatus & BD_LENGTH_MASK;
+   int size = lstatus & BD_LENGTH_MASK;
struct page *page = rxb->page;
bool last = !!(lstatus & BD_LFLAG(RXBD_LAST));
 
@@ -2947,11 +2947,16 @@ static bool gfar_add_rx_frag(struct gfar_rx_buff *rxb, 
u32 lstatus,
if (last)
size -= skb->len;
 
-   /* in case the last fragment consisted only of the FCS */
+   /* Add the last fragment if it contains something other than
+* the FCS, otherwise drop it and trim off any part of the FCS
+* that was already received.
+*/
if (size > 0)
skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags, page,
rxb->page_offset + RXBUF_ALIGNMENT,
size, GFAR_RXB_TRUESIZE);
+   else if (size < 0)
+   pskb_trim(skb, skb->len + size);
}
 
/* try reuse page */
-- 
2.7.4

[PATCH] atm: solos-pci: Replace GFP_ATOMIC with GFP_KERNEL in psend

2018-01-25 Thread Jia-Ju Bai

After checking all possible call chains to psend() here,
my tool finds that psend() is never called in atomic context.
And this function is assigned to a function pointer "dev->ops->send",
which is only called by vcc_sendmsg (net/atm/common.c)
through vcc->dev->ops->send(), and vcc_sendmsg calls schedule,
it indicates that psend() can call functions which may sleep.
Thus GFP_ATOMIC is not necessary, and it can be replaced with GFP_KERNEL.

This is found by a static analysis tool named DCNS written by myself.

Signed-off-by: Jia-Ju Bai 
---
 drivers/atm/solos-pci.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/atm/solos-pci.c b/drivers/atm/solos-pci.c
index 0df1a1c..ac10b62 100644
--- a/drivers/atm/solos-pci.c
+++ b/drivers/atm/solos-pci.c
@@ -1166,7 +1166,7 @@ static int psend(struct atm_vcc *vcc, struct sk_buff *skb)
if (skb_headroom(skb) < sizeof(*header))
expand_by = sizeof(*header) - skb_headroom(skb);
 
-   ret = pskb_expand_head(skb, expand_by, 0, GFP_ATOMIC);
+   ret = pskb_expand_head(skb, expand_by, 0, GFP_KERNEL);
if (ret) {
dev_warn(>dev->dev, "pskb_expand_head failed.\n");
solos_pop(vcc, skb);
-- 
1.7.9.5

Re: [PATCH net-next] ptr_ring: fix integer overflow

2018-01-25 Thread Jason Wang




On 2018年01月26日 01:31, Michael S. Tsirkin wrote:

On Thu, Jan 25, 2018 at 10:17:38PM +0800, Jason Wang wrote:


On 2018年01月25日 21:45, Michael S. Tsirkin wrote:

On Thu, Jan 25, 2018 at 03:31:42PM +0800, Jason Wang wrote:

We try to allocate one more entry for lockless peeking. The adding
operation may overflow which causes zero to be passed to kmalloc().
In this case, it returns ZERO_SIZE_PTR without any notice by ptr
ring. Try to do producing or consuming on such ring will lead NULL
dereference. Fix this detect and fail early.

Fixes: bcecb4bbf88a ("net: ptr_ring: otherwise safe empty checks can overrun array 
bounds")
Reported-by:syzbot+87678bcf753b44c39...@syzkaller.appspotmail.com
Cc: John Fastabend
Signed-off-by: Jason Wang

Ugh that's just way too ugly.
I'll work on dropping the extra + 1 - but calling this
function with -1 size is the real source of the bug.
Do you know how come we do that?


It looks e.g try to change tx_queue_len to UINT_MAX. And we probably can't
prevent user form trying to do this?

Thanks

Right. BTW why net-next? Isn't the crash exploitable in net?



Commit bcecb4bbf88a exists only in net-next. And in net we check r->size 
before trying to dereference the queue.


Thanks

[PATCH] atm: idt77252: Replace mdelay with usleep_range in idt77252_preset

2018-01-25 Thread Jia-Ju Bai

After checking all possible call chains to idt77252_preset() here,
my tool finds that idt77252_preset() is never called in atomic context,
namely never in an interrupt handler or holding a spinlock.
And idt77252_preset() calls deinit_card, which calls free_irq (can sleep),
so it indicates that idt77252_preset() can call functions which can sleep.
Thus mdelay can be replaced with usleep_range to avoid busy wait.

This is found by a static analysis tool named DCNS written by myself.

Signed-off-by: Jia-Ju Bai 
---
 drivers/atm/idt77252.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/atm/idt77252.c b/drivers/atm/idt77252.c
index 0277f36..cea4bf2 100644
--- a/drivers/atm/idt77252.c
+++ b/drivers/atm/idt77252.c
@@ -3563,7 +3563,7 @@ static int idt77252_preset(struct idt77252_dev *card)
 
/* Software reset */
writel(SAR_CFG_SWRST, SAR_REG_CFG);
-   mdelay(1);
+   usleep_range(500, 1000);
writel(0, SAR_REG_CFG);
 
IPRINTK("%s: Software resetted.\n", card->name);
-- 
1.7.9.5

Re: general protection fault in tun_do_read

2018-01-25 Thread Jason Wang




On 2018年01月25日 15:59, syzbot wrote:

Hello,

syzbot hit the following crash on net-next commit
43df215d99e6049d4680309c54232689e16ddd6b (Wed Jan 24 01:24:32 2018 +)
Merge branch 'bpf-and-netdevsim-test-updates'

So far this crash happened 2 times on net-next.
Unfortunately, I don't have any reproducer for this crash yet.
Raw console output is attached.
compiler: gcc (GCC) 7.1.1 20170620
.config is attached.

IMPORTANT: if you fix the bug, please add the following tag to the 
commit:

Reported-by: syzbot+885a488f0e7e392e9...@syzkaller.appspotmail.com
It will help syzbot understand when the bug is fixed. See footer for 
details.

If you forward the report, please keep this part and the footer.


I suspect this is the same as what syzbot repots here 
https://www.spinics.net/lists/netdev/msg480736.html.


We're discussing a solution posted by Michael here 
https://patchwork.ozlabs.org/patch/866064/


Thanks

Re: [PATCH net-next 00/12] ptr_ring fixes

2018-01-25 Thread Jason Wang




On 2018年01月26日 07:36, Michael S. Tsirkin wrote:

This fixes a bunch of issues around ptr_ring use in net core.
One of these: "tap: fix use-after-free" is also needed on net,
but can't be backported cleanly.

I will post a net patch separately.

Lightly tested - Jason, could you pls confirm this
addresses the security issue you saw with ptr_ring?
Testing reports would be appreciated too.


With the reproducer provided by syzbot, the problem has been fixed.

Thanks

Re: [PATCH net-next 03/12] ptr_ring: READ/WRITE_ONCE for __ptr_ring_empty

2018-01-25 Thread Jason Wang




On 2018年01月26日 10:44, Michael S. Tsirkin wrote:

On Fri, Jan 26, 2018 at 10:37:58AM +0800, Jason Wang wrote:


On 2018年01月26日 07:36, Michael S. Tsirkin wrote:

Lockless __ptr_ring_empty requires that consumer head is read and
written at once, atomically. Annotate accordingly to make sure compiler
does it correctly.  Switch locked callers to __ptr_ring_peek which does
not support the lockless operation.

Signed-off-by: Michael S. Tsirkin 
---
   include/linux/ptr_ring.h | 11 ---
   1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
index 8594c7b..9a72d8f 100644
--- a/include/linux/ptr_ring.h
+++ b/include/linux/ptr_ring.h
@@ -196,7 +196,9 @@ static inline void *__ptr_ring_peek(struct ptr_ring *r)
*/
   static inline bool __ptr_ring_empty(struct ptr_ring *r)
   {
-   return !__ptr_ring_peek(r);
+   if (likely(r->size))
+   return !r->queue[READ_ONCE(r->consumer_head)];
+   return true;
   }

So after patch 8, __ptr_ring_peek() did:

static inline void *__ptr_ring_peek(struct ptr_ring *r)
{
     if (likely(r->size))
         return READ_ONCE(r->queue[r->consumer_head]);
     return NULL;
}

Looks like a duplication.

Thanks

Nope - they are different.

The reason is that __ptr_ring_peek does not need to read the consumer_head once
since callers have a lock,


I get this.


  and __ptr_ring_empty does not need to read
the queue once since it merely compares it to 0.



Do this still work if it was called inside a loop?

Thanks

Re: 答复: [PATCH] net: clean the sk_frag.page of new cloned socket

2018-01-25 Thread Eric Dumazet

On Fri, 2018-01-26 at 02:09 +, Li,Rongqing wrote:

> 
> crash> bt 8683
> PID: 8683   TASK: 881faa088000  CPU: 10  COMMAND: "mynode"
>  #0 [881fff145e78] crash_nmi_callback at 81031712
>  #1 [881fff145e88] nmi_handle at 816cafe9
>  #2 [881fff145ec8] do_nmi at 816cb0f0
>  #3 [881fff145ef0] end_repeat_nmi at 816ca4a1
> [exception RIP: _raw_spin_lock_irqsave+62]
> RIP: 816c9a9e  RSP: 881fa992b990  RFLAGS: 0002
> RAX: 4358  RBX: 88207ffd7e80  RCX: 4358
> RDX: 4356  RSI: 0246  RDI: 88207ffd7ee8
> RBP: 881fa992b990   R8:    R9: 019a16e6
> R10: 4d24  R11: 4000  R12: 0242
> R13: 4d24  R14: 0001  R15: 
> ORIG_RAX:   CS: 0010  SS: 0018
> ---  ---
>  #4 [881fa992b990] _raw_spin_lock_irqsave at 816c9a9e
>  #5 [881fa992b998] get_page_from_freelist at 8113ce5f
>  #6 [881fa992ba70] __alloc_pages_nodemask at 8113d15f
>  #7 [881fa992bba0] alloc_pages_current at 8117ab29
>  #8 [881fa992bbe8] sk_page_frag_refill at 815dd310
>  #9 [881fa992bc18] tcp_sendmsg at 8163e4f3
> #10 [881fa992bcd8] inet_sendmsg at 81668434
> #11 [881fa992bd08] sock_sendmsg at 815d9719
> #12 [881fa992be58] SYSC_sendto at 815d9c81
> #13 [881fa992bf70] sys_sendto at 815da6ae
> #14 [881fa992bf80] system_call_fastpath at 816d2189
> 

Note that tcp_sendmsg() does not use sk->sk_frag, but the per task
page.

Unless something changes sk->sk_allocation, which a user application
can not do.

Are you using a pristine upstream kernel ?

Re: 答复: [PATCH] net: clean the sk_frag.page of new cloned socket

2018-01-25 Thread Eric Dumazet

On Fri, 2018-01-26 at 02:09 +, Li,Rongqing wrote:
> > >   if (newsk->sk_prot->sockets_allocated)
> > >   sk_sockets_allocated_inc(newsk);
> > 
> > Good catch.
> > 
> > I suspect this was discovered by some syzkaller/syzbot run ?
> > 
> 
> 
> No.
> 
> I am seeing a panic, a page is in both task.task_frag.page and buddy free 
> list;
> It should not happen , and the page->lru->next and page->lru->pre is 
> 0xdead00100100, then when alloc page from buddy, the system panic at
>  __list_del of __rmqueue 
> 
> #0 [881fff0c3850] machine_kexec at 8103cca8
>  #1 [881fff0c38a0] crash_kexec at 810c2443
>  #2 [881fff0c3968] oops_end at 816cae70
>  #3 [881fff0c3990] die at 810063eb
>  #4 [881fff0c39c0] do_general_protection at 816ca7ce
>  #5 [881fff0c39f0] general_protection at 816ca0d8
> [exception RIP: __rmqueue+120]
> RIP: 8113a918  RSP: 881fff0c3aa0  RFLAGS: 00010046
> RAX: 88207ffd8018  RBX: 0003  RCX: 0003
> RDX: 0001  RSI: ea006f4cf620  RDI: dead00200200
> RBP: 881fff0c3b00   R8: 88207ffd8018   R9: 
> R10: dead00100100  R11: ea007ecc6480  R12: ea006f4cf600
> R13:   R14: 0003  R15: 88207ffd7e80
> ORIG_RAX:   CS: 0010  SS: 
>  #6 [881fff0c3b08] get_page_from_freelist at 8113ce71
>  #7 [881fff0c3be0] __alloc_pages_nodemask at 8113d15f
>  #8 [881fff0c3d10] __alloc_page_frag at 815e2362
>  #9 [881fff0c3d40] __netdev_alloc_frag at 815e241b
> #10 [881fff0c3d58] __alloc_rx_skb at 815e2f91
> #11 [881fff0c3d78] __netdev_alloc_skb at 815e300b
> #12 [881fff0c3d90] ixgbe_clean_rx_irq at a003a98f [ixgbe]
> #13 [881fff0c3df8] ixgbe_poll at a003c233 [ixgbe]
> #14 [881fff0c3e70] net_rx_action at 815f2f09
> #15 [881fff0c3ec8] __do_softirq at 81064867
> #16 [881fff0c3f38] call_softirq at 816d3a9c
> #17 [881fff0c3f50] do_softirq at 81004e65
> #18 [881fff0c3f68] irq_exit at 81064b7d
> #19 [881fff0c3f78] do_IRQ at 816d4428
> 
> The page info is like below, some element is removed:
> 
> crash> struct page ea006f4cf600 -x
> struct page {
>   flags = 0x2f4000, 
>   mapping = 0x0, 
>   {
> {
>   counters = 0x2, 
>   {
> {
>   _mapcount = {
> counter = 0x
>   }, 
>   {
> inuse = 0x, 
> objects = 0x7fff, 
> frozen = 0x1
>   }, 
>   units = 0x
> }, 
> _count = {
>   counter = 0x2
> }
>   }
> }
>   }, 
>   {
> lru = {
>   next = 0xdead00100100, 
>   prev = 0xdead00200200
> }, 
>   }, 
> …..
>   }
> }
> crash>
> 
> 
> the page ea006f4cf600 is in other task task_frag.page and 
> the task backtrace is like below
> 
> crash> task 8683|grep ea006f4cf600 -A3  
> page = 0xea006f4cf600, 
> offset = 32768, 
> size = 32768
>   }, 
> crash>
> 
> crash> bt 8683
> PID: 8683   TASK: 881faa088000  CPU: 10  COMMAND: "mynode"
>  #0 [881fff145e78] crash_nmi_callback at 81031712
>  #1 [881fff145e88] nmi_handle at 816cafe9
>  #2 [881fff145ec8] do_nmi at 816cb0f0
>  #3 [881fff145ef0] end_repeat_nmi at 816ca4a1
> [exception RIP: _raw_spin_lock_irqsave+62]
> RIP: 816c9a9e  RSP: 881fa992b990  RFLAGS: 0002
> RAX: 4358  RBX: 88207ffd7e80  RCX: 4358
> RDX: 4356  RSI: 0246  RDI: 88207ffd7ee8
> RBP: 881fa992b990   R8:    R9: 019a16e6
> R10: 4d24  R11: 4000  R12: 0242
> R13: 4d24  R14: 0001  R15: 
> ORIG_RAX:   CS: 0010  SS: 0018
> ---  ---
>  #4 [881fa992b990] _raw_spin_lock_irqsave at 816c9a9e
>  #5 [881fa992b998] get_page_from_freelist at 8113ce5f
>  #6 [881fa992ba70] __alloc_pages_nodemask at 8113d15f
>  #7 [881fa992bba0] alloc_pages_current at 8117ab29
>  #8 [881fa992bbe8] sk_page_frag_refill at 815dd310
>  #9 [881fa992bc18] tcp_sendmsg at 8163e4f3
> #10 [881fa992bcd8] inet_sendmsg at 81668434
> #11 [881fa992bd08] sock_sendmsg at 815d9719
> #12 [881fa992be58] SYSC_sendto at 815d9c81
> #13 [881fa992bf70] sys_sendto at 815da6ae
> #14 [881fa992bf80] system_call_fastpath at 816d2189
> RIP: 7f5bfe1d804b  RSP: 7f5bfa63b3b0  RFLAGS: 0206
> RAX: 002c  RBX: 816d2189  RCX: 7f5bfa63b420
> RDX: 2000  RSI: 0c096000  RDI:

答复: [PATCH] net: clean the sk_frag.page of new cloned socket

2018-01-25 Thread Li,Rongqing



> > if (newsk->sk_prot->sockets_allocated)
> > sk_sockets_allocated_inc(newsk);
> 
> Good catch.
> 
> I suspect this was discovered by some syzkaller/syzbot run ?
> 


No.

I am seeing a panic, a page is in both task.task_frag.page and buddy free list;
It should not happen , and the page->lru->next and page->lru->pre is 
0xdead00100100, then when alloc page from buddy, the system panic at
 __list_del of __rmqueue 

#0 [881fff0c3850] machine_kexec at 8103cca8
 #1 [881fff0c38a0] crash_kexec at 810c2443
 #2 [881fff0c3968] oops_end at 816cae70
 #3 [881fff0c3990] die at 810063eb
 #4 [881fff0c39c0] do_general_protection at 816ca7ce
 #5 [881fff0c39f0] general_protection at 816ca0d8
[exception RIP: __rmqueue+120]
RIP: 8113a918  RSP: 881fff0c3aa0  RFLAGS: 00010046
RAX: 88207ffd8018  RBX: 0003  RCX: 0003
RDX: 0001  RSI: ea006f4cf620  RDI: dead00200200
RBP: 881fff0c3b00   R8: 88207ffd8018   R9: 
R10: dead00100100  R11: ea007ecc6480  R12: ea006f4cf600
R13:   R14: 0003  R15: 88207ffd7e80
ORIG_RAX:   CS: 0010  SS: 
 #6 [881fff0c3b08] get_page_from_freelist at 8113ce71
 #7 [881fff0c3be0] __alloc_pages_nodemask at 8113d15f
 #8 [881fff0c3d10] __alloc_page_frag at 815e2362
 #9 [881fff0c3d40] __netdev_alloc_frag at 815e241b
#10 [881fff0c3d58] __alloc_rx_skb at 815e2f91
#11 [881fff0c3d78] __netdev_alloc_skb at 815e300b
#12 [881fff0c3d90] ixgbe_clean_rx_irq at a003a98f [ixgbe]
#13 [881fff0c3df8] ixgbe_poll at a003c233 [ixgbe]
#14 [881fff0c3e70] net_rx_action at 815f2f09
#15 [881fff0c3ec8] __do_softirq at 81064867
#16 [881fff0c3f38] call_softirq at 816d3a9c
#17 [881fff0c3f50] do_softirq at 81004e65
#18 [881fff0c3f68] irq_exit at 81064b7d
#19 [881fff0c3f78] do_IRQ at 816d4428

The page info is like below, some element is removed:

crash> struct page ea006f4cf600 -x
struct page {
  flags = 0x2f4000, 
  mapping = 0x0, 
  {
{
  counters = 0x2, 
  {
{
  _mapcount = {
counter = 0x
  }, 
  {
inuse = 0x, 
objects = 0x7fff, 
frozen = 0x1
  }, 
  units = 0x
}, 
_count = {
  counter = 0x2
}
  }
}
  }, 
  {
lru = {
  next = 0xdead00100100, 
  prev = 0xdead00200200
}, 
  }, 
…..
  }
}
crash>


the page ea006f4cf600 is in other task task_frag.page and 
the task backtrace is like below

crash> task 8683|grep ea006f4cf600 -A3  
page = 0xea006f4cf600, 
offset = 32768, 
size = 32768
  }, 
crash>

crash> bt 8683
PID: 8683   TASK: 881faa088000  CPU: 10  COMMAND: "mynode"
 #0 [881fff145e78] crash_nmi_callback at 81031712
 #1 [881fff145e88] nmi_handle at 816cafe9
 #2 [881fff145ec8] do_nmi at 816cb0f0
 #3 [881fff145ef0] end_repeat_nmi at 816ca4a1
[exception RIP: _raw_spin_lock_irqsave+62]
RIP: 816c9a9e  RSP: 881fa992b990  RFLAGS: 0002
RAX: 4358  RBX: 88207ffd7e80  RCX: 4358
RDX: 4356  RSI: 0246  RDI: 88207ffd7ee8
RBP: 881fa992b990   R8:    R9: 019a16e6
R10: 4d24  R11: 4000  R12: 0242
R13: 4d24  R14: 0001  R15: 
ORIG_RAX:   CS: 0010  SS: 0018
---  ---
 #4 [881fa992b990] _raw_spin_lock_irqsave at 816c9a9e
 #5 [881fa992b998] get_page_from_freelist at 8113ce5f
 #6 [881fa992ba70] __alloc_pages_nodemask at 8113d15f
 #7 [881fa992bba0] alloc_pages_current at 8117ab29
 #8 [881fa992bbe8] sk_page_frag_refill at 815dd310
 #9 [881fa992bc18] tcp_sendmsg at 8163e4f3
#10 [881fa992bcd8] inet_sendmsg at 81668434
#11 [881fa992bd08] sock_sendmsg at 815d9719
#12 [881fa992be58] SYSC_sendto at 815d9c81
#13 [881fa992bf70] sys_sendto at 815da6ae
#14 [881fa992bf80] system_call_fastpath at 816d2189
RIP: 7f5bfe1d804b  RSP: 7f5bfa63b3b0  RFLAGS: 0206
RAX: 002c  RBX: 816d2189  RCX: 7f5bfa63b420
RDX: 2000  RSI: 0c096000  RDI: 0040
RBP:    R8:    R9: 
R10:   R11: 0246  R12: 815da6ae
R13: 881fa992bf78  R14: a552  R15: 0016
ORIG_RAX: 002c  CS: 0033  SS: 002b
crash>


my kernel is 3.10,

Re: [PATCH net-next 08/12] ptr_ring: prevent queue load/store tearing

2018-01-25 Thread Michael S. Tsirkin

On Fri, Jan 26, 2018 at 10:38:12AM +0800, Jason Wang wrote:
> 
> 
> On 2018年01月26日 07:36, Michael S. Tsirkin wrote:
> > In theory compiler could tear queue loads or stores in two. It does not
> > seem to be happening in practice but it seems easier to convert the
> > cases where this would be a problem to READ/WRITE_ONCE than worry about
> > it.
> > 
> > Signed-off-by: Michael S. Tsirkin 
> > ---
> >   include/linux/ptr_ring.h | 4 ++--
> >   1 file changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
> > index 3a19ebd..1883d61 100644
> > --- a/include/linux/ptr_ring.h
> > +++ b/include/linux/ptr_ring.h
> > @@ -114,7 +114,7 @@ static inline int __ptr_ring_produce(struct ptr_ring 
> > *r, void *ptr)
> > /* Pairs with smp_read_barrier_depends in __ptr_ring_consume. */
> > smp_wmb();
> > -   r->queue[r->producer++] = ptr;
> > +   WRITE_ONCE(r->queue[r->producer++], ptr);
> > if (unlikely(r->producer >= r->size))
> > r->producer = 0;
> 
> You may want WRITE_ONCE() here? And if we just fix the out of bound
> r->producer, we may just need one WRITE_ONCE().
> 
> Thanks

No because producers are serialized.

If we were going to sprinkle write/read once all over the place
we should just make it all volatile and drop the annotations.

I don't care much either way but for better or worse linux has volatile
considered harmful doc which says that you are supposed to think and
only add these things were they are necessary.

> > return 0;
> > @@ -173,7 +173,7 @@ static inline int ptr_ring_produce_bh(struct ptr_ring 
> > *r, void *ptr)
> >   static inline void *__ptr_ring_peek(struct ptr_ring *r)
> >   {
> > if (likely(r->size))
> > -   return r->queue[r->consumer_head];
> > +   return READ_ONCE(r->queue[r->consumer_head]);
> > return NULL;
> >   }

Re: [PATCH net-next 05/12] ptr_ring: disallow lockless __ptr_ring_full

2018-01-25 Thread Michael S. Tsirkin

On Fri, Jan 26, 2018 at 10:38:05AM +0800, Jason Wang wrote:
> 
> 
> On 2018年01月26日 07:36, Michael S. Tsirkin wrote:
> > Similar to bcecb4bbf88a ("net: ptr_ring: otherwise safe empty checks can
> > overrun array bounds") a lockless use of __ptr_ring_full might
> > cause an out of bounds access.
> > 
> > We can fix this, but it's easier to just disallow lockless
> > __ptr_ring_full for now.
> 
> It looks to me that just fix this is better than disallow through doc (which
> is easily to be ignored ...).
> 
> Thanks

lockless is tricky, and I'd rather not sprinkle READ/WRITE_ONCE where
they aren't necessary.

> > 
> > Signed-off-by: Michael S. Tsirkin 
> > ---
> >   include/linux/ptr_ring.h | 7 ---
> >   1 file changed, 4 insertions(+), 3 deletions(-)
> > 
> > diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
> > index 9a72d8f..f175846 100644
> > --- a/include/linux/ptr_ring.h
> > +++ b/include/linux/ptr_ring.h
> > @@ -45,9 +45,10 @@ struct ptr_ring {
> >   };
> >   /* Note: callers invoking this in a loop must use a compiler barrier,
> > - * for example cpu_relax().  If ring is ever resized, callers must hold
> > - * producer_lock - see e.g. ptr_ring_full.  Otherwise, if callers don't 
> > hold
> > - * producer_lock, the next call to __ptr_ring_produce may fail.
> > + * for example cpu_relax().
> > + *
> > + * NB: this is unlike __ptr_ring_empty in that callers must hold 
> > producer_lock:
> > + * see e.g. ptr_ring_full.
> >*/
> >   static inline bool __ptr_ring_full(struct ptr_ring *r)
> >   {

Re: [PATCH net-next 03/12] ptr_ring: READ/WRITE_ONCE for __ptr_ring_empty

2018-01-25 Thread Michael S. Tsirkin

On Fri, Jan 26, 2018 at 10:37:58AM +0800, Jason Wang wrote:
> 
> 
> On 2018年01月26日 07:36, Michael S. Tsirkin wrote:
> > Lockless __ptr_ring_empty requires that consumer head is read and
> > written at once, atomically. Annotate accordingly to make sure compiler
> > does it correctly.  Switch locked callers to __ptr_ring_peek which does
> > not support the lockless operation.
> > 
> > Signed-off-by: Michael S. Tsirkin 
> > ---
> >   include/linux/ptr_ring.h | 11 ---
> >   1 file changed, 8 insertions(+), 3 deletions(-)
> > 
> > diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
> > index 8594c7b..9a72d8f 100644
> > --- a/include/linux/ptr_ring.h
> > +++ b/include/linux/ptr_ring.h
> > @@ -196,7 +196,9 @@ static inline void *__ptr_ring_peek(struct ptr_ring *r)
> >*/
> >   static inline bool __ptr_ring_empty(struct ptr_ring *r)
> >   {
> > -   return !__ptr_ring_peek(r);
> > +   if (likely(r->size))
> > +   return !r->queue[READ_ONCE(r->consumer_head)];
> > +   return true;
> >   }
> 
> So after patch 8, __ptr_ring_peek() did:
> 
> static inline void *__ptr_ring_peek(struct ptr_ring *r)
> {
>     if (likely(r->size))
>         return READ_ONCE(r->queue[r->consumer_head]);
>     return NULL;
> }
> 
> Looks like a duplication.
> 
> Thanks

Nope - they are different.

The reason is that __ptr_ring_peek does not need to read the consumer_head once
since callers have a lock, and __ptr_ring_empty does not need to read
the queue once since it merely compares it to 0.

-- 
MST

Re: [PATCHv6 net-next 0/3] net: erspan: add support for openvswitch

2018-01-25 Thread David Miller

From: William Tu 
Date: Thu, 25 Jan 2018 13:20:08 -0800

> The first patch refactors the erspan header definitions. 
> Originally, the erspan fields are defined as a group into a __be16 field,
> and use mask and offset to access each field.  This is more costly due to
> calling ntohs/htons and error-prone.  The first patch changes it to use
> bitfields.  The second patch creates erspan.h in UAPI and move the definition
> 'struct erspan_metadata' to it for later openvswitch to use.  The final patch
> introduces the new OVS tunnel key attribute, OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS,
> to program both v1 and v2 erspan tunnel for openvswitch.

Series applied, thanks William.

Re: [PATCH net-next 08/12] ptr_ring: prevent queue load/store tearing

2018-01-25 Thread Jason Wang




On 2018年01月26日 07:36, Michael S. Tsirkin wrote:

In theory compiler could tear queue loads or stores in two. It does not
seem to be happening in practice but it seems easier to convert the
cases where this would be a problem to READ/WRITE_ONCE than worry about
it.

Signed-off-by: Michael S. Tsirkin 
---
  include/linux/ptr_ring.h | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
index 3a19ebd..1883d61 100644
--- a/include/linux/ptr_ring.h
+++ b/include/linux/ptr_ring.h
@@ -114,7 +114,7 @@ static inline int __ptr_ring_produce(struct ptr_ring *r, 
void *ptr)
/* Pairs with smp_read_barrier_depends in __ptr_ring_consume. */
smp_wmb();
  
-	r->queue[r->producer++] = ptr;

+   WRITE_ONCE(r->queue[r->producer++], ptr);
if (unlikely(r->producer >= r->size))
r->producer = 0;


You may want WRITE_ONCE() here? And if we just fix the out of bound 
r->producer, we may just need one WRITE_ONCE().


Thanks


return 0;
@@ -173,7 +173,7 @@ static inline int ptr_ring_produce_bh(struct ptr_ring *r, 
void *ptr)
  static inline void *__ptr_ring_peek(struct ptr_ring *r)
  {
if (likely(r->size))
-   return r->queue[r->consumer_head];
+   return READ_ONCE(r->queue[r->consumer_head]);
return NULL;
  }

Re: [PATCH net-next 05/12] ptr_ring: disallow lockless __ptr_ring_full

2018-01-25 Thread Jason Wang




On 2018年01月26日 07:36, Michael S. Tsirkin wrote:

Similar to bcecb4bbf88a ("net: ptr_ring: otherwise safe empty checks can
overrun array bounds") a lockless use of __ptr_ring_full might
cause an out of bounds access.

We can fix this, but it's easier to just disallow lockless
__ptr_ring_full for now.


It looks to me that just fix this is better than disallow through doc 
(which is easily to be ignored ...).


Thanks



Signed-off-by: Michael S. Tsirkin 
---
  include/linux/ptr_ring.h | 7 ---
  1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
index 9a72d8f..f175846 100644
--- a/include/linux/ptr_ring.h
+++ b/include/linux/ptr_ring.h
@@ -45,9 +45,10 @@ struct ptr_ring {
  };
  
  /* Note: callers invoking this in a loop must use a compiler barrier,

- * for example cpu_relax().  If ring is ever resized, callers must hold
- * producer_lock - see e.g. ptr_ring_full.  Otherwise, if callers don't hold
- * producer_lock, the next call to __ptr_ring_produce may fail.
+ * for example cpu_relax().
+ *
+ * NB: this is unlike __ptr_ring_empty in that callers must hold producer_lock:
+ * see e.g. ptr_ring_full.
   */
  static inline bool __ptr_ring_full(struct ptr_ring *r)
  {

Re: [PATCH net-next 03/12] ptr_ring: READ/WRITE_ONCE for __ptr_ring_empty

2018-01-25 Thread Jason Wang




On 2018年01月26日 07:36, Michael S. Tsirkin wrote:

Lockless __ptr_ring_empty requires that consumer head is read and
written at once, atomically. Annotate accordingly to make sure compiler
does it correctly.  Switch locked callers to __ptr_ring_peek which does
not support the lockless operation.

Signed-off-by: Michael S. Tsirkin 
---
  include/linux/ptr_ring.h | 11 ---
  1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
index 8594c7b..9a72d8f 100644
--- a/include/linux/ptr_ring.h
+++ b/include/linux/ptr_ring.h
@@ -196,7 +196,9 @@ static inline void *__ptr_ring_peek(struct ptr_ring *r)
   */
  static inline bool __ptr_ring_empty(struct ptr_ring *r)
  {
-   return !__ptr_ring_peek(r);
+   if (likely(r->size))
+   return !r->queue[READ_ONCE(r->consumer_head)];
+   return true;
  }


So after patch 8, __ptr_ring_peek() did:

static inline void *__ptr_ring_peek(struct ptr_ring *r)
{
    if (likely(r->size))
        return READ_ONCE(r->queue[r->consumer_head]);
    return NULL;
}

Looks like a duplication.

Thanks

Re: [PATCHv6 net-next 3/3] openvswitch: add erspan version I and II support

2018-01-25 Thread Pravin Shelar

On Thu, Jan 25, 2018 at 1:20 PM, William Tu  wrote:
> The patch adds support for openvswitch to configure erspan
> v1 and v2.  The OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS attr is added
> to uapi as a binary blob to support all ERSPAN v1 and v2's
> fields.  Note that Previous commit "openvswitch: Add erspan tunnel
> support." was reverted since it does not design properly.
>
> Signed-off-by: William Tu 

Acked-by: Pravin B Shelar 

Thanks.

Re: [PATCHv6 net-next 2/3] net: erspan: create erspan metadata uapi header

2018-01-25 Thread Pravin Shelar

On Thu, Jan 25, 2018 at 1:20 PM, William Tu  wrote:
> The patch adds a new uapi header file, erspan.h, and moves
> the 'struct erspan_metadata' from internal erspan.h to it.
>
> Signed-off-by: William Tu 

Acked-by: Pravin B Shelar 

Thanks.

Re: [PATCHv6 net-next 1/3] net: erspan: use bitfield instead of mask and offset

2018-01-25 Thread Pravin Shelar

On Thu, Jan 25, 2018 at 1:20 PM, William Tu  wrote:
> Originally the erspan fields are defined as a group into a __be16 field,
> and use mask and offset to access each field.  This is more costly due to
> calling ntohs/htons.  The patch changes it to use bitfields.
>
> Signed-off-by: William Tu 

Acked-by: Pravin B Shelar 

Thanks.

Re: [PATCH net 1/1] r8169: fix memory corruption on retrieval of hardware statistics.

2018-01-25 Thread David Miller

From: Francois Romieu 
Date: Fri, 26 Jan 2018 01:53:26 +0100

> Hardware statistics retrieval hurts in tight invocation loops.
> 
> Avoid extraneous write and enforce strict ordering of writes targeted to
> the tally counters dump area address registers.
> 
> Signed-off-by: Francois Romieu 
> Tested-by: Oliver Freyermuth 

Applied and queued up for -stable, thanks Francois.

[Patch net-next v3 2/3] net_sched: plug in qdisc ops change_tx_queue_len

2018-01-25 Thread Cong Wang

Introduce a new qdisc ops ->change_tx_queue_len() so that
each qdisc could decide how to implement this if it wants.
Previously we simply read dev->tx_queue_len, after pfifo_fast
switches to skb array, we need this API to resize the skb array
when we change dev->tx_queue_len.

To avoid handling race conditions with TX BH, we need to
deactivate all TX queues before change the value and bring them
back after we are done, this also makes implementation easier.

Cc: John Fastabend 
Signed-off-by: Cong Wang 
---
 include/net/sch_generic.h |  2 ++
 net/core/dev.c|  1 +
 net/sched/sch_generic.c   | 33 +
 3 files changed, 36 insertions(+)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index eac43e8ca96d..e2ab13687fb9 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -200,6 +200,7 @@ struct Qdisc_ops {
  struct nlattr *arg,
  struct netlink_ext_ack *extack);
void(*attach)(struct Qdisc *sch);
+   int (*change_tx_queue_len)(struct Qdisc *, unsigned 
int);
 
int (*dump)(struct Qdisc *, struct sk_buff *);
int (*dump_stats)(struct Qdisc *, struct gnet_dump 
*);
@@ -489,6 +490,7 @@ void qdisc_class_hash_remove(struct Qdisc_class_hash *,
 void qdisc_class_hash_grow(struct Qdisc *, struct Qdisc_class_hash *);
 void qdisc_class_hash_destroy(struct Qdisc_class_hash *);
 
+int dev_qdisc_change_tx_queue_len(struct net_device *dev);
 void dev_init_scheduler(struct net_device *dev);
 void dev_shutdown(struct net_device *dev);
 void dev_activate(struct net_device *dev);
diff --git a/net/core/dev.c b/net/core/dev.c
index e0b0c2784070..c8443cfaa17a 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -7070,6 +7070,7 @@ int dev_change_tx_queue_len(struct net_device *dev, 
unsigned long new_len)
dev->tx_queue_len = orig_len;
return res;
}
+   return dev_qdisc_change_tx_queue_len(dev);
}
 
return 0;
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 1816bde47256..08f9fa27e06e 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -1178,6 +1178,39 @@ void dev_deactivate(struct net_device *dev)
 }
 EXPORT_SYMBOL(dev_deactivate);
 
+static int qdisc_change_tx_queue_len(struct net_device *dev,
+struct netdev_queue *dev_queue)
+{
+   struct Qdisc *qdisc = dev_queue->qdisc_sleeping;
+   const struct Qdisc_ops *ops = qdisc->ops;
+
+   if (ops->change_tx_queue_len)
+   return ops->change_tx_queue_len(qdisc, dev->tx_queue_len);
+   return 0;
+}
+
+int dev_qdisc_change_tx_queue_len(struct net_device *dev)
+{
+   bool up = dev->flags & IFF_UP;
+   unsigned int i;
+   int ret = 0;
+
+   if (up)
+   dev_deactivate(dev);
+
+   for (i = 0; i < dev->num_tx_queues; i++) {
+   ret = qdisc_change_tx_queue_len(dev, >_tx[i]);
+
+   /* TODO: revert changes on a partial failure */
+   if (ret)
+   break;
+   }
+
+   if (up)
+   dev_activate(dev);
+   return ret;
+}
+
 static void dev_init_scheduler_queue(struct net_device *dev,
 struct netdev_queue *dev_queue,
 void *_qdisc)
-- 
2.13.0

[Patch net-next v3 1/3] net: introduce helper dev_change_tx_queue_len()

2018-01-25 Thread Cong Wang

This patch promotes the local change_tx_queue_len() to a core
helper function, dev_change_tx_queue_len(), so that rtnetlink
and net-sysfs could share the code. This also prepares for the
following patch.

Note, the -EFAULT in the original code doesn't make sense,
we should propagate the errno from notifiers.

Cc: John Fastabend 
Signed-off-by: Cong Wang 
---
 include/linux/netdevice.h |  1 +
 net/core/dev.c| 28 
 net/core/net-sysfs.c  | 25 +
 net/core/rtnetlink.c  | 18 +-
 4 files changed, 35 insertions(+), 37 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 24a62d590350..0804e1d38c78 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3330,6 +3330,7 @@ int dev_get_alias(const struct net_device *, char *, 
size_t);
 int dev_change_net_namespace(struct net_device *, struct net *, const char *);
 int __dev_set_mtu(struct net_device *, int);
 int dev_set_mtu(struct net_device *, int);
+int dev_change_tx_queue_len(struct net_device *, unsigned long);
 void dev_set_group(struct net_device *, int);
 int dev_set_mac_address(struct net_device *, struct sockaddr *);
 int dev_change_carrier(struct net_device *, bool new_carrier);
diff --git a/net/core/dev.c b/net/core/dev.c
index 4670ccabe23a..e0b0c2784070 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -7048,6 +7048,34 @@ int dev_set_mtu(struct net_device *dev, int new_mtu)
 EXPORT_SYMBOL(dev_set_mtu);
 
 /**
+ * dev_change_tx_queue_len - Change TX queue length of a netdevice
+ * @dev: device
+ * @new_len: new tx queue length
+ */
+int dev_change_tx_queue_len(struct net_device *dev, unsigned long new_len)
+{
+   unsigned int orig_len = dev->tx_queue_len;
+   int res;
+
+   if (new_len != (unsigned int)new_len)
+   return -ERANGE;
+
+   if (new_len != orig_len) {
+   dev->tx_queue_len = new_len;
+   res = call_netdevice_notifiers(NETDEV_CHANGE_TX_QUEUE_LEN, dev);
+   res = notifier_to_errno(res);
+   if (res) {
+   netdev_err(dev,
+  "refused to change device tx_queue_len\n");
+   dev->tx_queue_len = orig_len;
+   return res;
+   }
+   }
+
+   return 0;
+}
+
+/**
  * dev_set_group - Change group this device belongs to
  * @dev: device
  * @new_group: group this device should belong to
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index c4a28f4667b6..60a5ad2c33ee 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -346,29 +346,6 @@ static ssize_t flags_store(struct device *dev, struct 
device_attribute *attr,
 }
 NETDEVICE_SHOW_RW(flags, fmt_hex);
 
-static int change_tx_queue_len(struct net_device *dev, unsigned long new_len)
-{
-   unsigned int orig_len = dev->tx_queue_len;
-   int res;
-
-   if (new_len != (unsigned int)new_len)
-   return -ERANGE;
-
-   if (new_len != orig_len) {
-   dev->tx_queue_len = new_len;
-   res = call_netdevice_notifiers(NETDEV_CHANGE_TX_QUEUE_LEN, dev);
-   res = notifier_to_errno(res);
-   if (res) {
-   netdev_err(dev,
-  "refused to change device tx_queue_len\n");
-   dev->tx_queue_len = orig_len;
-   return -EFAULT;
-   }
-   }
-
-   return 0;
-}
-
 static ssize_t tx_queue_len_store(struct device *dev,
  struct device_attribute *attr,
  const char *buf, size_t len)
@@ -376,7 +353,7 @@ static ssize_t tx_queue_len_store(struct device *dev,
if (!capable(CAP_NET_ADMIN))
return -EPERM;
 
-   return netdev_store(dev, attr, buf, len, change_tx_queue_len);
+   return netdev_store(dev, attr, buf, len, dev_change_tx_queue_len);
 }
 NETDEVICE_SHOW_RW(tx_queue_len, fmt_dec);
 
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 97874daa1336..6fa6b9c60694 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -2291,19 +2291,11 @@ static int do_setlink(const struct sk_buff *skb,
 
if (tb[IFLA_TXQLEN]) {
unsigned int value = nla_get_u32(tb[IFLA_TXQLEN]);
-   unsigned int orig_len = dev->tx_queue_len;
-
-   if (dev->tx_queue_len ^ value) {
-   dev->tx_queue_len = value;
-   err = call_netdevice_notifiers(
- NETDEV_CHANGE_TX_QUEUE_LEN, dev);
-   err = notifier_to_errno(err);
-   if (err) {
-   dev->tx_queue_len = orig_len;
-   goto errout;
-   }
-   status |=

[Patch net-next v3 0/3] net_sched: reflect tx_queue_len change for pfifo_fast

2018-01-25 Thread Cong Wang

This pathcset restores the pfifo_fast qdisc behavior of dropping
packets based on latest dev->tx_queue_len. Patch 1 introduces
a helper, patch 2 introduces a new Qdisc ops which is called when
we modify tx_queue_len, patch 3 implements this ops for pfifo_fast.

Please see each patch for details.

---
v3: use skb_array_resize_multiple()
v2: handle error case for ->change_tx_queue_len()

Cong Wang (3):
  net: introduce helper dev_change_tx_queue_len()
  net_sched: plug in qdisc ops change_tx_queue_len
  net_sched: implement ->change_tx_queue_len() for pfifo_fast

 include/linux/netdevice.h |  1 +
 include/net/sch_generic.h |  2 ++
 net/core/dev.c| 29 +++
 net/core/net-sysfs.c  | 25 +--
 net/core/rtnetlink.c  | 18 +
 net/sched/sch_generic.c   | 51 +++
 6 files changed, 89 insertions(+), 37 deletions(-)

-- 
2.13.0

[Patch net-next v3 3/3] net_sched: implement ->change_tx_queue_len() for pfifo_fast

2018-01-25 Thread Cong Wang

pfifo_fast used to drop based on qdisc_dev(qdisc)->tx_queue_len,
so we have to resize skb array when we change tx_queue_len.

Other qdiscs which read tx_queue_len are fine because they
all save it to sch->limit or somewhere else in qdisc during init.
They don't have to implement this, it is nicer if they do so
that users don't have to re-configure qdisc after changing
tx_queue_len.

Cc: John Fastabend 
Signed-off-by: Cong Wang 
---
 net/sched/sch_generic.c | 18 ++
 1 file changed, 18 insertions(+)

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 08f9fa27e06e..190570f21b20 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -763,6 +763,23 @@ static void pfifo_fast_destroy(struct Qdisc *sch)
}
 }
 
+static int pfifo_fast_change_tx_queue_len(struct Qdisc *sch,
+ unsigned int new_len)
+{
+   struct pfifo_fast_priv *priv = qdisc_priv(sch);
+   struct skb_array *bands[PFIFO_FAST_BANDS];
+   int prio;
+
+   for (prio = 0; prio < PFIFO_FAST_BANDS; prio++) {
+   struct skb_array *q = band2list(priv, prio);
+
+   bands[prio] = q;
+   }
+
+   return skb_array_resize_multiple(bands, PFIFO_FAST_BANDS, new_len,
+GFP_KERNEL);
+}
+
 struct Qdisc_ops pfifo_fast_ops __read_mostly = {
.id =   "pfifo_fast",
.priv_size  =   sizeof(struct pfifo_fast_priv),
@@ -773,6 +790,7 @@ struct Qdisc_ops pfifo_fast_ops __read_mostly = {
.destroy=   pfifo_fast_destroy,
.reset  =   pfifo_fast_reset,
.dump   =   pfifo_fast_dump,
+   .change_tx_queue_len =  pfifo_fast_change_tx_queue_len,
.owner  =   THIS_MODULE,
.static_flags   =   TCQ_F_NOLOCK | TCQ_F_CPUSTATS,
 };
-- 
2.13.0

Re: [PATCH net-next 1/2] net: vrf: Add support for sends to local broadcast address

2018-01-25 Thread David Miller

From: David Ahern 
Date: Thu, 25 Jan 2018 15:01:23 -0700

> On 1/25/18 2:23 PM, David Miller wrote:
>> From: David Ahern 
>> Date: Wed, 24 Jan 2018 19:37:37 -0800
>> 
>>> Sukumar reported that sends to the local broadcast address
>>> (255.255.255.255) are broken. Check for the address in vrf driver
>>> and do not redirect to the VRF device - similar to multicast
>>> packets.
>>>
>>> With this change sockets can use SO_BINDTODEVICE to specify an
>>> egress interface and receive responses. Note: the egress interface
>>> can not be a VRF device but needs to be the enslaved device.
>>>
>>> https://bugzilla.kernel.org/show_bug.cgi?id=198521
>>>
>>> Reported-by: Sukumar Gopalakrishnan 
>>> Signed-off-by: David Ahern 
>>>
>>> ---
>>> Dave: Really this is a day 1 bug that goes back to the beginning of VRF.
>>> IMO, backport to the 4.14 LTS kernel is sufficient; the multicast
>>> handling for IPv4 was only complete as of the 4.12 kernel. I directed
>>> this at net-next because it is not urgent for the 4.15 merge window.
>> 
>> You have to decide, either this is for 'net' and -stable, or it isn't.
>> 
>> We don't put things into net-next and then -stable backport it.  It
>> doesn't work like that.
> 
> Please take this one for -net and patch 2 for net-next (it's a new
> feature). I can re-send as separate patches if needed.

Ok I'll do that, no need to resend.

Thanks.

Re: [PATCH net-next v2 00/11] use tc_cls_can_offload_and_chain0() throughout the drivers

2018-01-25 Thread David Miller

From: Jakub Kicinski 
Date: Thu, 25 Jan 2018 14:00:42 -0800

> This set makes all drivers use a new tc_cls_can_offload_and_chain0()
> helper which will set extack in case TC hw offload flag is disabled.
> 
> I chose to keep the new helper which also looks at the chain but
> renamed it more appropriately.  The rationale being that most drivers
> don't accept chains other than 0 and since we have to pass extack
> to the helper we can as well pass the entire struct tc_cls_common_offload
> and perform the most common checks.
> 
> This code makes the assumption that type_data in the callback can
> be interpreted as struct tc_cls_common_offload, i.e. the real offload
> structure has common part as the first member.  This allows us to
> make the check once for all classifier types if driver supports
> more than one.
> 
> v1:
>  - drop the type validation in nfp and netdevsim.
> v2:
>  - reorder checks in patch 1;
>  - split other changes from patch 1;
>  - add the i40e patch in;
>  - add one more test case - for chain 0 extack.

Series applied, thanks Jakub.

Re: [PATCH net-next v1] samples/bpf: Partially fixes the bpf.o build

2018-01-25 Thread Alexei Starovoitov

On Fri, Jan 26, 2018 at 01:39:30AM +0100, Mickaël Salaün wrote:
> Do not build lib/bpf/bpf.o with this Makefile but use the one from the
> library directory.  This avoid making a buggy bpf.o file (e.g. missing
> symbols).

could you provide an example?
What symbols will be missing?
I don't think there is an issue with existing Makefile.

> This patch is useful if some code (e.g. Landlock tests) needs both the
> bpf.o (from tools/lib/bpf) and the bpf_load.o (from samples/bpf).

is that some future patches?

we're trying to move everything form samples/bpf/ into selftests/bpf/
and convert to use libbpf.a instead of obsolete bpf_load.c
Please use this approach for landlock as well.

> Signed-off-by: Mickaël Salaün 
> Cc: Alexei Starovoitov 
> Cc: Daniel Borkmann 
> ---
> 
> This is not a complet fix because the call to multi_depend with
> $(host-cmulti) from scripts/Makefile.host force the build of bpf.o
> anyway. I'm not sure how to completely avoid this automatic build
> though.
> ---
>  samples/bpf/Makefile | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
> index 7f61a3d57fa7..64335bb94f9f 100644
> --- a/samples/bpf/Makefile
> +++ b/samples/bpf/Makefile
> @@ -201,13 +201,16 @@ CLANG_ARCH_ARGS = -target $(ARCH)
>  endif
>  
>  # Trick to allow make to be run from this directory
> -all:
> +all: $(LIBBPF)
>   $(MAKE) -C ../../ $(CURDIR)/
>  
>  clean:
>   $(MAKE) -C ../../ M=$(CURDIR) clean
>   @rm -f *~
>  
> +$(LIBBPF): FORCE
> + $(MAKE) -C $(dir $@) $(notdir $@)
> +
>  $(obj)/syscall_nrs.s:$(src)/syscall_nrs.c
>   $(call if_changed_dep,cc_s_c)
>  
> -- 
> 2.15.1
>

Re: [Intel-wired-lan] [RFC v2 net-next 01/10] net: Add a new socket option for a future transmit time.

2018-01-25 Thread Richard Cochran

On Wed, Jan 24, 2018 at 02:46:24PM -0800, Vinicius Costa Gomes wrote:
> The only robust way that we could think of about keeping the the packets
> in order for the device queue is re-ordering packets in the Qdisc.

Right, but you cannot afford the overhead of the timerqueue when using
HW offload, when the HW device sits on a PCIe bus.  Many serious TSN
applications (like industrial controls) will want to have just one
packet queued, readying the next one just in time for the next
deadline.  The control loops are sensitive to the feedback interval.

> Even if we reach a decision that the Qdisc should not re-order packets
> (we wouldn't have any dependency on hrtimers in the offload case, as you
> pointed out), we still need hrtimers for the software implementation.

Fine.

> So, I guess, the problem remains, if it's possible for the user to
> express a /dev/ptp* clock, what should we do? 

Thinking a bit more, it doesn't make sense to have a user choice for
the HW offloading case.  The clock should implicitly be the device
clock, always.  Using any other clock would make no sense.

Thanks,
Richard

Re: [PATCH net-next v1] bpf: Use the IS_FD_ARRAY() macro in map_update_elem()

2018-01-25 Thread Alexei Starovoitov

On Fri, Jan 26, 2018 at 12:54:02AM +0100, Mickaël Salaün wrote:
> Make the code more readable.
> 
> Signed-off-by: Mickaël Salaün 
> Cc: Alexei Starovoitov 
> Cc: Daniel Borkmann 
> ---
>  kernel/bpf/syscall.c | 5 +
>  1 file changed, 1 insertion(+), 4 deletions(-)
> 
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 5bdb0cc84ad2..e24aa3241387 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -709,10 +709,7 @@ static int map_update_elem(union bpf_attr *attr)
>   err = bpf_percpu_hash_update(map, key, value, attr->flags);
>   } else if (map->map_type == BPF_MAP_TYPE_PERCPU_ARRAY) {
>   err = bpf_percpu_array_update(map, key, value, attr->flags);
> - } else if (map->map_type == BPF_MAP_TYPE_PERF_EVENT_ARRAY ||
> -map->map_type == BPF_MAP_TYPE_PROG_ARRAY ||
> -map->map_type == BPF_MAP_TYPE_CGROUP_ARRAY ||
> -map->map_type == BPF_MAP_TYPE_ARRAY_OF_MAPS) {
> + } else if (IS_FD_ARRAY(map)) {

Applied to bpf-next, thank you Mickael.

[PATCH 2/2] atm: fore200e: Replace GFP_ATOMIC with GFP_KERNEL in fore200e_send

2018-01-25 Thread Jia-Ju Bai

After checking all possible call chains to fore200e_send here,
my tool finds that fore200e_send is never called in atomic context.
And this function is assigned to a function pointer "dev->ops->send",
which is only called by vcc_sendmsg (net/atm/common.c) 
through vcc->dev->ops->send, and vcc_sendmsg calls schedule,
it indicates that fore200e_send can call functions which may sleep.
Thus GFP_ATOMIC is not necessary, and it can be replaced with GFP_KERNEL.

This is found by a static analysis tool named DCNS written by myself.

Signed-off-by: Jia-Ju Bai 
---
 drivers/atm/fore200e.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/atm/fore200e.c b/drivers/atm/fore200e.c
index 6ebc4e4..f6a5326 100644
--- a/drivers/atm/fore200e.c
+++ b/drivers/atm/fore200e.c
@@ -1611,7 +1611,7 @@ int bsq_audit(int where, struct host_bsq* bsq, int 
scheme, int magn)
 }
 
 if (tx_copy) {
-   data = kmalloc(tx_len, GFP_ATOMIC | GFP_DMA);
+   data = kmalloc(tx_len, GFP_KERNEL | GFP_DMA);
if (data == NULL) {
if (vcc->pop) {
vcc->pop(vcc, skb);
-- 
1.7.9.5

[PATCH 1/2] atm: fore200e: Replace GFP_ATOMIC with GFP_KERNEL in fore200e_open

2018-01-25 Thread Jia-Ju Bai

After checking all possible call chains to fore200e_open here,
my tool finds that fore200e_open is never called in atomic context.
And fore200e_open calls mutex_lock which can sleep later,
thus it is supposed fore200e_open can call functions which may sleep.
Thus GFP_ATOMIC is not necessary, and it can be replaced with GFP_KERNEL.

This is found by a static analysis tool named DCNS written by myself.

Signed-off-by: Jia-Ju Bai 
---
 drivers/atm/fore200e.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/atm/fore200e.c b/drivers/atm/fore200e.c
index 6ebc4e4..bdffb85 100644
--- a/drivers/atm/fore200e.c
+++ b/drivers/atm/fore200e.c
@@ -1422,7 +1422,7 @@ int bsq_audit(int where, struct host_bsq* bsq, int 
scheme, int magn)
 
 spin_unlock_irqrestore(>q_lock, flags);
 
-fore200e_vcc = kzalloc(sizeof(struct fore200e_vcc), GFP_ATOMIC);
+fore200e_vcc = kzalloc(sizeof(struct fore200e_vcc), GFP_KERNEL);
 if (fore200e_vcc == NULL) {
vc_map->vcc = NULL;
return -ENOMEM;
-- 
1.7.9.5

[net-next:master 1917/1931] drivers/net/ethernet/sfc/ptp.c:646:43: sparse: constant 4000000000 is so big it is long

2018-01-25 Thread kbuild test robot

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git 
master
head:   fdd6d771c7de9d351c6dbdbab5bdc83805c06955
commit: 1280c0f8aafc4c09c59c576c8d50f367070b2619 [1917/1931] sfc: support 
second + quarter ns time format for receive datapath
reproduce:
# apt-get install sparse
git checkout 1280c0f8aafc4c09c59c576c8d50f367070b2619
make ARCH=x86_64 allmodconfig
make C=1 CF=-D__CHECK_ENDIAN__


sparse warnings: (new ones prefixed by >>)

>> drivers/net/ethernet/sfc/ptp.c:646:43: sparse: constant 40 is so big 
>> it is long
   drivers/net/ethernet/sfc/ptp.c:369:6: sparse: symbol 'efx_ptp_want_txqs' was 
not declared. Should it be
   drivers/net/ethernet/sfc/ptp.c:2132:31: sparse: symbol 
'efx_ptp_channel_type' was not declared. Should it be

vim +646 drivers/net/ethernet/sfc/ptp.c

   598  
   599  /* Get PTP attributes and set up time conversions */
   600  static int efx_ptp_get_attributes(struct efx_nic *efx)
   601  {
   602  MCDI_DECLARE_BUF(inbuf, MC_CMD_PTP_IN_GET_ATTRIBUTES_LEN);
   603  MCDI_DECLARE_BUF(outbuf, MC_CMD_PTP_OUT_GET_ATTRIBUTES_LEN);
   604  struct efx_ptp_data *ptp = efx->ptp_data;
   605  int rc;
   606  u32 fmt;
   607  size_t out_len;
   608  
   609  /* Get the PTP attributes. If the NIC doesn't support the 
operation we
   610   * use the default format for compatibility with older NICs i.e.
   611   * seconds and nanoseconds.
   612   */
   613  MCDI_SET_DWORD(inbuf, PTP_IN_OP, MC_CMD_PTP_OP_GET_ATTRIBUTES);
   614  MCDI_SET_DWORD(inbuf, PTP_IN_PERIPH_ID, 0);
   615  rc = efx_mcdi_rpc_quiet(efx, MC_CMD_PTP, inbuf, sizeof(inbuf),
   616  outbuf, sizeof(outbuf), _len);
   617  if (rc == 0) {
   618  fmt = MCDI_DWORD(outbuf, 
PTP_OUT_GET_ATTRIBUTES_TIME_FORMAT);
   619  } else if (rc == -EINVAL) {
   620  fmt = MC_CMD_PTP_OUT_GET_ATTRIBUTES_SECONDS_NANOSECONDS;
   621  } else if (rc == -EPERM) {
   622  netif_info(efx, probe, efx->net_dev, "no PTP 
support\n");
   623  return rc;
   624  } else {
   625  efx_mcdi_display_error(efx, MC_CMD_PTP, sizeof(inbuf),
   626 outbuf, sizeof(outbuf), rc);
   627  return rc;
   628  }
   629  
   630  switch (fmt) {
   631  case MC_CMD_PTP_OUT_GET_ATTRIBUTES_SECONDS_27FRACTION:
   632  ptp->ns_to_nic_time = efx_ptp_ns_to_s27;
   633  ptp->nic_to_kernel_time = 
efx_ptp_s27_to_ktime_correction;
   634  ptp->nic_time.minor_max = 1 << 27;
   635  ptp->nic_time.sync_event_minor_shift = 19;
   636  break;
   637  case MC_CMD_PTP_OUT_GET_ATTRIBUTES_SECONDS_NANOSECONDS:
   638  ptp->ns_to_nic_time = efx_ptp_ns_to_s_ns;
   639  ptp->nic_to_kernel_time = 
efx_ptp_s_ns_to_ktime_correction;
   640  ptp->nic_time.minor_max = 10;
   641  ptp->nic_time.sync_event_minor_shift = 22;
   642  break;
   643  case MC_CMD_PTP_OUT_GET_ATTRIBUTES_SECONDS_QTR_NANOSECONDS:
   644  ptp->ns_to_nic_time = efx_ptp_ns_to_s_qns;
   645  ptp->nic_to_kernel_time = 
efx_ptp_s_qns_to_ktime_correction;
 > 646  ptp->nic_time.minor_max = 40;
   647  ptp->nic_time.sync_event_minor_shift = 24;
   648  break;
   649  default:
   650  return -ERANGE;
   651  }
   652  
   653  /* Precalculate acceptable difference between the minor time in 
the
   654   * packet prefix and the last MCDI time sync event. We expect 
the
   655   * packet prefix timestamp to be after of sync event by up to 
one
   656   * sync event interval (0.25s) but we allow it to exceed this 
by a
   657   * fuzz factor of (0.1s)
   658   */
   659  ptp->nic_time.sync_event_diff_min = ptp->nic_time.minor_max
   660  - (ptp->nic_time.minor_max / 10);
   661  ptp->nic_time.sync_event_diff_max = (ptp->nic_time.minor_max / 
4)
   662  + (ptp->nic_time.minor_max / 10);
   663  
   664  /* MC_CMD_PTP_OP_GET_ATTRIBUTES has been extended twice from an 
older
   665   * operation MC_CMD_PTP_OP_GET_TIME_FORMAT. The function now 
may return
   666   * a value to use for the minimum acceptable corrected 
synchronization
   667   * window and may return further capabilities.
   668   * If we have the extra information store it. For older 
firmware that
   669   * does not implement the extended command use the default 
value.
   670   */
   671  if (rc == 0 &&
   672  out_len >=

Re: [bpf PATCH 0/3] bpf sockmap fixes

2018-01-25 Thread John Fastabend

On 01/25/2018 04:26 PM, John Fastabend wrote:
> A set of fixes for sockmap to resolve map/prog not being cleaned
> up when maps are not cleaned up from the program side.
> 

Well that first sentence is a bit confusing now that I read it again.
Here is a better version,

"
A set of fixes for sockmap to resolve programs referencing sockmaps
and closing without deleting all entries in the map and/or not detaching
BPF programs attached to the map. Both leaving entries in the map and
not detaching programs may result in the map failing to be removed by
BPF infrastructure due to reference counts never reaching zero.
"


> For this we pull in the ULP infrastructure to hook into the close()
> hook of the sock layer. This seemed natural because we have additional
> sockmap features (to add support for TX hooks) that will also use the
> ULP infrastructure. This allows us to cleanup entries in the map when
> socks are closed() and avoid trying to get the sk_state_change() hook
> to fire in all cases.
> 
> The second issue resolved here occurs when users don't detach
> programs. The gist is a refcnt issue resolved by implementing the
> release callback. See patch for details.
> 
> For testing I ran both sample/sockmap and selftests bpf/test_maps.c.
> I did not however retest TLS with the small change to ULP layer.
> Mostly because I don't have a TLS setup. I plan/hope to get around
> to writing either a sample or preferably a selftest for this case
> as well (assuming I didn't miss one).
> 
> @Dave Watson can you take a quick look to verify the changes are
> good on TLS ULP side.
> 
> ---
> 
> John Fastabend (3):
>   net: add a UID to use for ULP socket assignment
>   bpf: sockmap, add sock close() hook to remove socks
>   bpf: sockmap, fix leaking maps with attached but not detached progs
> 
> 
>  include/net/tcp.h|8 ++
>  kernel/bpf/sockmap.c |  164 
> --
>  net/ipv4/tcp_ulp.c   |   53 +++-
>  net/tls/tls_main.c   |2 +
>  4 files changed, 150 insertions(+), 77 deletions(-)
> 
> --
> Signature
>

[net-next:master 1913/1931] drivers/net/ethernet/sfc/efx.c:899:6: sparse: symbol 'efx_default_channel_want_txqs' was not declared. Should it be static?

2018-01-25 Thread kbuild test robot

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git 
master
head:   fdd6d771c7de9d351c6dbdbab5bdc83805c06955
commit: 2935e3c38228ad9bf073eeb0eedff5849eea63db [1913/1931] sfc: on 8000 
series use TX queues for TX timestamps
reproduce:
# apt-get install sparse
git checkout 2935e3c38228ad9bf073eeb0eedff5849eea63db
make ARCH=x86_64 allmodconfig
make C=1 CF=-D__CHECK_ENDIAN__


sparse warnings: (new ones prefixed by >>)

>> drivers/net/ethernet/sfc/efx.c:899:6: sparse: symbol 
>> 'efx_default_channel_want_txqs' was not declared. Should it be
--
>> drivers/net/ethernet/sfc/ptp.c:347:6: sparse: symbol 'efx_ptp_want_txqs' was 
>> not declared. Should it be
>> drivers/net/ethernet/sfc/ptp.c:2023:31: sparse: symbol 
>> 'efx_ptp_channel_type' was not declared. Should it be

Please review and possibly fold the followup patch.

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation

[RFC PATCH net-next] sfc: efx_default_channel_want_txqs() can be static

2018-01-25 Thread kbuild test robot


Fixes: 2935e3c38228 ("sfc: on 8000 series use TX queues for TX timestamps")
Signed-off-by: Fengguang Wu 
---
 efx.c |2 +-
 ptp.c |4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/sfc/efx.c b/drivers/net/ethernet/sfc/efx.c
index 456866b0..16757cf 100644
--- a/drivers/net/ethernet/sfc/efx.c
+++ b/drivers/net/ethernet/sfc/efx.c
@@ -896,7 +896,7 @@ void efx_schedule_slow_fill(struct efx_rx_queue *rx_queue)
mod_timer(_queue->slow_fill, jiffies + msecs_to_jiffies(100));
 }
 
-bool efx_default_channel_want_txqs(struct efx_channel *channel)
+static bool efx_default_channel_want_txqs(struct efx_channel *channel)
 {
return channel->channel - channel->efx->tx_channel_offset <
channel->efx->n_tx_channels;
diff --git a/drivers/net/ethernet/sfc/ptp.c b/drivers/net/ethernet/sfc/ptp.c
index 433d29d..3e2c5b1 100644
--- a/drivers/net/ethernet/sfc/ptp.c
+++ b/drivers/net/ethernet/sfc/ptp.c
@@ -366,7 +366,7 @@ bool efx_ptp_use_mac_tx_timestamps(struct efx_nic *efx)
 /* PTP 'extra' channel is still a traffic channel, but we only create TX queues
  * if PTP uses MAC TX timestamps, not if PTP uses the MC directly to transmit.
  */
-bool efx_ptp_want_txqs(struct efx_channel *channel)
+static bool efx_ptp_want_txqs(struct efx_channel *channel)
 {
return efx_ptp_use_mac_tx_timestamps(channel->efx);
 }
@@ -2146,7 +2146,7 @@ static int efx_phc_enable(struct ptp_clock_info *ptp,
return 0;
 }
 
-const struct efx_channel_type efx_ptp_channel_type = {
+static const struct efx_channel_type efx_ptp_channel_type = {
.handle_no_channel  = efx_ptp_handle_no_channel,
.pre_probe  = efx_ptp_probe_channel,
.post_remove= efx_ptp_remove_channel,

[PATCH v2 net-next 0/3] net/ipv6: Add support for ONLINK flag

2018-01-25 Thread David Ahern

Add support for RTNH_F_ONLINK with ipv6 routes.

First patch moves existing gateway validation into helper. The onlink
flag requires a different set of checks and the existing validation
makes ip6_route_info_create long enough.

Second patch makes the table id and lookup flag an option to 
ip6_nh_lookup_table. onlink check needs to verify the gateway without
the RT6_LOOKUP_F_IFACE flag and PBR with VRF means the table id can
vary between the table the route is inserted and the VRF the egress
device is enslaved to.

Third patch adds support for RTNH_F_ONLINK.

I have a set of test cases in a format based on the framework Ido and
Jiri are working on. Once that goes in I will adapt the script and
submit.

v2
- removed table id check. Too constraining for PBR with VRF use cases

David Ahern (3):
  net/ipv6: Move gateway validation into helper
  net/ipv6: Add flags and table id to ip6_nh_lookup_table
  net/ipv6: Add support for onlink flag

 net/ipv6/route.c | 140 ---
 1 file changed, 103 insertions(+), 37 deletions(-)

-- 
2.11.0

[PATCH v2 net-next 2/3] net/ipv6: Add flags and table id to ip6_nh_lookup_table

2018-01-25 Thread David Ahern

onlink verification needs to do a lookup in potentially different
table than the table in fib6_config and without the RT6_LOOKUP_F_IFACE
flag. Change ip6_nh_lookup_table to take table id and flags as input
arguments. Both verifications want to ignore link state, so add that
flag can stay in the lookup helper.

Signed-off-by: David Ahern 
---
 net/ipv6/route.c | 12 
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 0ff4ca0948f0..3e0a1c67eb9f 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -2440,7 +2440,8 @@ static int ip6_convert_metrics(struct mx6_config *mxc,
 
 static struct rt6_info *ip6_nh_lookup_table(struct net *net,
struct fib6_config *cfg,
-   const struct in6_addr *gw_addr)
+   const struct in6_addr *gw_addr,
+   u32 tbid, int flags)
 {
struct flowi6 fl6 = {
.flowi6_oif = cfg->fc_ifindex,
@@ -2449,15 +2450,15 @@ static struct rt6_info *ip6_nh_lookup_table(struct net 
*net,
};
struct fib6_table *table;
struct rt6_info *rt;
-   int flags = RT6_LOOKUP_F_IFACE | RT6_LOOKUP_F_IGNORE_LINKSTATE;
 
-   table = fib6_get_table(net, cfg->fc_table);
+   table = fib6_get_table(net, tbid);
if (!table)
return NULL;
 
if (!ipv6_addr_any(>fc_prefsrc))
flags |= RT6_LOOKUP_F_HAS_SADDR;
 
+   flags |= RT6_LOOKUP_F_IGNORE_LINKSTATE;
rt = ip6_pol_route(net, table, cfg->fc_ifindex, , flags);
 
/* if table lookup failed, fall back to full lookup */
@@ -2480,7 +2481,10 @@ static int ip6_route_check_nh(struct net *net,
int err = -EHOSTUNREACH;
 
if (cfg->fc_table) {
-   grt = ip6_nh_lookup_table(net, cfg, gw_addr);
+   int flags = RT6_LOOKUP_F_IFACE;
+
+   grt = ip6_nh_lookup_table(net, cfg, gw_addr,
+ cfg->fc_table, flags);
if (grt) {
if (grt->rt6i_flags & RTF_GATEWAY ||
(dev && dev != grt->dst.dev)) {
-- 
2.11.0

[PATCH v2 net-next 3/3] net/ipv6: Add support for onlink flag

2018-01-25 Thread David Ahern

Similar to IPv4 allow routes to be added with the RTNH_F_ONLINK flag.
The onlink option requires a gateway and a nexthop device. Any unicast
gateway is allowed (including IPv4 mapped addresses and unresolved
ones) as long as the gateway is not a local address and if it resolves
it must match the given device.

Signed-off-by: David Ahern 
---
 net/ipv6/route.c | 51 ++-
 1 file changed, 50 insertions(+), 1 deletion(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 3e0a1c67eb9f..8fecdb25fd1e 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -2470,6 +2470,31 @@ static struct rt6_info *ip6_nh_lookup_table(struct net 
*net,
return rt;
 }
 
+static int ip6_route_check_nh_onlink(struct net *net,
+struct fib6_config *cfg,
+struct net_device *dev,
+struct netlink_ext_ack *extack)
+{
+   u32 tbid = l3mdev_fib_table(dev) ? : RT_TABLE_LOCAL;
+   const struct in6_addr *gw_addr = >fc_gateway;
+   u32 flags = RTF_LOCAL | RTF_ANYCAST | RTF_REJECT;
+   struct rt6_info *grt;
+   int err;
+
+   err = 0;
+   grt = ip6_nh_lookup_table(net, cfg, gw_addr, tbid, 0);
+   if (grt) {
+   if (grt->rt6i_flags & flags || dev != grt->dst.dev) {
+   NL_SET_ERR_MSG(extack, "Nexthop has invalid gateway");
+   err = -EINVAL;
+   }
+
+   ip6_rt_put(grt);
+   }
+
+   return err;
+}
+
 static int ip6_route_check_nh(struct net *net,
  struct fib6_config *cfg,
  struct net_device **_dev,
@@ -2572,6 +2597,21 @@ static struct rt6_info *ip6_route_info_create(struct 
fib6_config *cfg,
if (cfg->fc_metric == 0)
cfg->fc_metric = IP6_RT_PRIO_USER;
 
+   if (cfg->fc_flags & RTNH_F_ONLINK) {
+   if (!dev) {
+   NL_SET_ERR_MSG(extack,
+  "Nexthop device required for onlink");
+   err = -ENODEV;
+   goto out;
+   }
+
+   if (!(dev->flags & IFF_UP)) {
+   NL_SET_ERR_MSG(extack, "Nexthop device is not up");
+   err = -ENETDOWN;
+   goto out;
+   }
+   }
+
err = -ENOBUFS;
if (cfg->fc_nlinfo.nlh &&
!(cfg->fc_nlinfo.nlh->nlmsg_flags & NLM_F_CREATE)) {
@@ -2732,7 +2772,12 @@ static struct rt6_info *ip6_route_info_create(struct 
fib6_config *cfg,
goto out;
}
 
-   err = ip6_route_check_nh(net, cfg, , );
+   if (cfg->fc_flags & RTNH_F_ONLINK) {
+   err = ip6_route_check_nh_onlink(net, cfg, dev,
+   extack);
+   } else {
+   err = ip6_route_check_nh(net, cfg, , );
+   }
if (err)
goto out;
}
@@ -2768,6 +2813,7 @@ static struct rt6_info *ip6_route_info_create(struct 
fib6_config *cfg,
if (!(rt->rt6i_flags & (RTF_LOCAL | RTF_ANYCAST)) &&
!netif_carrier_ok(dev))
rt->rt6i_nh_flags |= RTNH_F_LINKDOWN;
+   rt->rt6i_nh_flags |= (cfg->fc_flags & RTNH_F_ONLINK);
rt->dst.dev = dev;
rt->rt6i_idev = idev;
rt->rt6i_table = table;
@@ -3837,6 +3883,8 @@ static int rtm_to_fib6_config(struct sk_buff *skb, struct 
nlmsghdr *nlh,
if (rtm->rtm_flags & RTM_F_CLONED)
cfg->fc_flags |= RTF_CACHE;
 
+   cfg->fc_flags |= (rtm->rtm_flags & RTNH_F_ONLINK);
+
cfg->fc_nlinfo.portid = NETLINK_CB(skb).portid;
cfg->fc_nlinfo.nlh = nlh;
cfg->fc_nlinfo.nl_net = sock_net(skb->sk);
@@ -4242,6 +4290,7 @@ static int rt6_nexthop_info(struct sk_buff *skb, struct 
rt6_info *rt,
goto nla_put_failure;
}
 
+   *flags |= (rt->rt6i_nh_flags & RTNH_F_ONLINK);
if (rt->rt6i_nh_flags & RTNH_F_OFFLOAD)
*flags |= RTNH_F_OFFLOAD;
 
-- 
2.11.0

[PATCH v2 net-next 1/3] net/ipv6: Move gateway validation into helper

2018-01-25 Thread David Ahern

Move existing code to validate nexthop into a helper. Follow on patch
adds support for nexthops marked with onlink, and this helper keeps
the complexity of ip6_route_info_create in check.

Signed-off-by: David Ahern 
---
 net/ipv6/route.c | 85 
 1 file changed, 49 insertions(+), 36 deletions(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index f85da2f1e729..0ff4ca0948f0 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -2469,6 +2469,54 @@ static struct rt6_info *ip6_nh_lookup_table(struct net 
*net,
return rt;
 }
 
+static int ip6_route_check_nh(struct net *net,
+ struct fib6_config *cfg,
+ struct net_device **_dev,
+ struct inet6_dev **idev)
+{
+   const struct in6_addr *gw_addr = >fc_gateway;
+   struct net_device *dev = _dev ? *_dev : NULL;
+   struct rt6_info *grt = NULL;
+   int err = -EHOSTUNREACH;
+
+   if (cfg->fc_table) {
+   grt = ip6_nh_lookup_table(net, cfg, gw_addr);
+   if (grt) {
+   if (grt->rt6i_flags & RTF_GATEWAY ||
+   (dev && dev != grt->dst.dev)) {
+   ip6_rt_put(grt);
+   grt = NULL;
+   }
+   }
+   }
+
+   if (!grt)
+   grt = rt6_lookup(net, gw_addr, NULL, cfg->fc_ifindex, 1);
+
+   if (!grt)
+   goto out;
+
+   if (dev) {
+   if (dev != grt->dst.dev) {
+   ip6_rt_put(grt);
+   goto out;
+   }
+   } else {
+   *_dev = dev = grt->dst.dev;
+   *idev = grt->rt6i_idev;
+   dev_hold(dev);
+   in6_dev_hold(grt->rt6i_idev);
+   }
+
+   if (!(grt->rt6i_flags & RTF_GATEWAY))
+   err = 0;
+
+   ip6_rt_put(grt);
+
+out:
+   return err;
+}
+
 static struct rt6_info *ip6_route_info_create(struct fib6_config *cfg,
  struct netlink_ext_ack *extack)
 {
@@ -2664,8 +2712,6 @@ static struct rt6_info *ip6_route_info_create(struct 
fib6_config *cfg,
rt->rt6i_gateway = *gw_addr;
 
if (gwa_type != (IPV6_ADDR_LINKLOCAL|IPV6_ADDR_UNICAST)) {
-   struct rt6_info *grt = NULL;
-
/* IPv6 strictly inhibits using not link-local
   addresses as nexthop address.
   Otherwise, router will not able to send redirects.
@@ -2682,40 +2728,7 @@ static struct rt6_info *ip6_route_info_create(struct 
fib6_config *cfg,
goto out;
}
 
-   if (cfg->fc_table) {
-   grt = ip6_nh_lookup_table(net, cfg, gw_addr);
-
-   if (grt) {
-   if (grt->rt6i_flags & RTF_GATEWAY ||
-   (dev && dev != grt->dst.dev)) {
-   ip6_rt_put(grt);
-   grt = NULL;
-   }
-   }
-   }
-
-   if (!grt)
-   grt = rt6_lookup(net, gw_addr, NULL,
-cfg->fc_ifindex, 1);
-
-   err = -EHOSTUNREACH;
-   if (!grt)
-   goto out;
-   if (dev) {
-   if (dev != grt->dst.dev) {
-   ip6_rt_put(grt);
-   goto out;
-   }
-   } else {
-   dev = grt->dst.dev;
-   idev = grt->rt6i_idev;
-   dev_hold(dev);
-   in6_dev_hold(grt->rt6i_idev);
-   }
-   if (!(grt->rt6i_flags & RTF_GATEWAY))
-   err = 0;
-   ip6_rt_put(grt);
-
+   err = ip6_route_check_nh(net, cfg, , );
if (err)
goto out;
}
-- 
2.11.0

Re: [PATCH bpf-next v10 00/12] bpf: More sock_ops callbacks

2018-01-25 Thread Alexei Starovoitov

On Thu, Jan 25, 2018 at 04:14:04PM -0800, Lawrence Brakmo wrote:
> This patchset adds support for:
> 
> - direct R or R/W access to many tcp_sock fields
> - passing up to 4 arguments to sock_ops BPF functions
> - tcp_sock field bpf_sock_ops_cb_flags for controlling callbacks
> - optionally calling sock_ops BPF program when RTO fires
> - optionally calling sock_ops BPF program when packet is retransmitted
> - optionally calling sock_ops BPF program when TCP state changes
> - access to tclass and sk_txhash
> - new selftest
> 
> v2: Fixed commit message 0/11. The commit is to "bpf-next" but the patch
> below used "bpf" and Patchwork didn't work correctly.
> v3: Cleaned RTO callback as per  Yuchung's comment
> Added BPF enum for TCP states as per  Alexei's comment
> v4: Fixed compile warnings related to detecting changes between TCP
> internal states and the BPF defined states.
> v5: Fixed comment issues in some selftest files
> Fixed accesss issue with u64 fields in bpf_sock_ops struct
> v6: Made fixes based on comments form Eric Dumazet:
> The field bpf_sock_ops_cb_flags was addded in a hole on 64bit kernels
> Field bpf_sock_ops_cb_flags is now set through a helper function
> which returns an error when a BPF program tries to set bits for
> callbacks that are not supported in the current kernel.
> Added a comment indicating that when adding fields to bpf_sock_ops_kern
> they should be added before the field named "temp" if they need to be
> cleared before calling the BPF function.  
> v7: Enfornced fields "op" and "replylong[1] .. replylong[3]" not be writable
> based on comments form Eric Dumazet and Alexei Starovoitov.
> Filled 32 bit hole in bpf_sock_ops struct with sk_txhash based on
> comments from Daniel Borkmann.
> Removed unused functions (tcp_call_bpf_1arg, tcp_call_bpf_4arg) based
> on comments from Daniel Borkmann.
> v8: Add commit message 00/12
> Add Acked-by as appropriate
> v9: Moved the bug fix to the front of the patchset
> Changed RETRANS_CB so it is always called (before it was only called if
> the retransmit succeeded). It is now called with an extra argument, the
> return value of tcp_transmit_skb (0 => success). Based on comments
> from Yuchung Cheng.
> Added support for reading 2 new fields, sacked_out and lost_out, based on
> comments from Yuchung Cheng.
> v10: Moved the callback flags from include/uapi/linux/tcp.h to
>  include/uapi/linux/bpf.h
>  Cleaned up the test in selftest. Added a timeout so it always completes,
>  even if the client is not communicating with the server. Made it faster
>  by removing the sleeps. Made sure it works even when called back-to-back
>  20 times.

Applied to bpf-next, thank you Larry.

We'll make sure to send patch 1 to stable when 4.15 is released.

Also there were few leftover comments in patch 12,
please clean it up in the followup patch.

Thanks

[PATCH net 1/1] r8169: fix memory corruption on retrieval of hardware statistics.

2018-01-25 Thread Francois Romieu

Hardware statistics retrieval hurts in tight invocation loops.

Avoid extraneous write and enforce strict ordering of writes targeted to
the tally counters dump area address registers.

Signed-off-by: Francois Romieu 
Tested-by: Oliver Freyermuth 
---
 drivers/net/ethernet/realtek/r8169.c | 9 ++---
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/realtek/r8169.c 
b/drivers/net/ethernet/realtek/r8169.c
index 272c5962e4f7..8e91274174f1 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -2235,19 +2235,14 @@ static bool rtl8169_do_counters(struct net_device *dev, 
u32 counter_cmd)
void __iomem *ioaddr = tp->mmio_addr;
dma_addr_t paddr = tp->counters_phys_addr;
u32 cmd;
-   bool ret;
 
RTL_W32(CounterAddrHigh, (u64)paddr >> 32);
+   RTL_R32(CounterAddrHigh);
cmd = (u64)paddr & DMA_BIT_MASK(32);
RTL_W32(CounterAddrLow, cmd);
RTL_W32(CounterAddrLow, cmd | counter_cmd);
 
-   ret = rtl_udelay_loop_wait_low(tp, _counters_cond, 10, 1000);
-
-   RTL_W32(CounterAddrLow, 0);
-   RTL_W32(CounterAddrHigh, 0);
-
-   return ret;
+   return rtl_udelay_loop_wait_low(tp, _counters_cond, 10, 1000);
 }
 
 static bool rtl8169_reset_counters(struct net_device *dev)
-- 
2.14.3

[PATCH net-next v1] samples/bpf: Partially fixes the bpf.o build

2018-01-25 Thread Mickaël Salaün

Do not build lib/bpf/bpf.o with this Makefile but use the one from the
library directory.  This avoid making a buggy bpf.o file (e.g. missing
symbols).

This patch is useful if some code (e.g. Landlock tests) needs both the
bpf.o (from tools/lib/bpf) and the bpf_load.o (from samples/bpf).

Signed-off-by: Mickaël Salaün 
Cc: Alexei Starovoitov 
Cc: Daniel Borkmann 
---

This is not a complet fix because the call to multi_depend with
$(host-cmulti) from scripts/Makefile.host force the build of bpf.o
anyway. I'm not sure how to completely avoid this automatic build
though.
---
 samples/bpf/Makefile | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 7f61a3d57fa7..64335bb94f9f 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -201,13 +201,16 @@ CLANG_ARCH_ARGS = -target $(ARCH)
 endif
 
 # Trick to allow make to be run from this directory
-all:
+all: $(LIBBPF)
$(MAKE) -C ../../ $(CURDIR)/
 
 clean:
$(MAKE) -C ../../ M=$(CURDIR) clean
@rm -f *~
 
+$(LIBBPF): FORCE
+   $(MAKE) -C $(dir $@) $(notdir $@)
+
 $(obj)/syscall_nrs.s:  $(src)/syscall_nrs.c
$(call if_changed_dep,cc_s_c)
 
-- 
2.15.1

[bpf PATCH 3/3] bpf: sockmap, fix leaking maps with attached but not detached progs

2018-01-25 Thread John Fastabend

When a program is attached to a map we increment the program refcnt
to ensure that the program is not removed while it is potentially
being referenced from sockmap side. However, if this same program
also references the map (this is a reasonably common pattern in
my programs) then the verifier will also increment the maps refcnt
from the verifier. This is to ensure the map doesn't get garbage
collected while the program has a reference to it.

So we are left in a state where the map holds the refcnt on the
program stopping it from being removed and releasing the map refcnt.
And vice versa the program holds a refcnt on the map stopping it
from releasing the refcnt on the prog.

All this is fine as long as users detach the program while the
map fd is still around. But, if the user omits this detach command
we are left with a dangling map we can no longer release.

To resolve this when the map fd is released decrement the program
references and remove any reference from the map to the program.
This fixes the issue with possibly dangling map and creates a
user side API constraint. That is, the map fd must be held open
for programs to be attached to a map.

Fixes: 174a79ff9515 ("bpf: sockmap with sk redirect support")
Signed-off-by: John Fastabend 
---
 kernel/bpf/sockmap.c |   19 ++-
 1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index b85a79c..a8cd1b7 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -611,11 +611,6 @@ static void sock_map_free(struct bpf_map *map)
}
rcu_read_unlock();
 
-   if (stab->bpf_verdict)
-   bpf_prog_put(stab->bpf_verdict);
-   if (stab->bpf_parse)
-   bpf_prog_put(stab->bpf_parse);
-
sock_map_remove_complete(stab);
 }
 
@@ -891,6 +886,19 @@ static int sock_map_update_elem(struct bpf_map *map,
return err;
 }
 
+static void sock_map_release(struct bpf_map *map, struct file *map_file)
+{
+   struct bpf_stab *stab = container_of(map, struct bpf_stab, map);
+   struct bpf_prog *orig;
+
+   orig = xchg(>bpf_parse, NULL);
+   if (orig)
+   bpf_prog_put(orig);
+   orig = xchg(>bpf_verdict, NULL);
+   if (orig)
+   bpf_prog_put(orig);
+}
+
 const struct bpf_map_ops sock_map_ops = {
.map_alloc = sock_map_alloc,
.map_free = sock_map_free,
@@ -898,6 +906,7 @@ static int sock_map_update_elem(struct bpf_map *map,
.map_get_next_key = sock_map_get_next_key,
.map_update_elem = sock_map_update_elem,
.map_delete_elem = sock_map_delete_elem,
+   .map_release = sock_map_release,
 };
 
 BPF_CALL_4(bpf_sock_map_update, struct bpf_sock_ops_kern *, bpf_sock,

[bpf PATCH 0/3] bpf sockmap fixes

2018-01-25 Thread John Fastabend

A set of fixes for sockmap to resolve map/prog not being cleaned
up when maps are not cleaned up from the program side.

For this we pull in the ULP infrastructure to hook into the close()
hook of the sock layer. This seemed natural because we have additional
sockmap features (to add support for TX hooks) that will also use the
ULP infrastructure. This allows us to cleanup entries in the map when
socks are closed() and avoid trying to get the sk_state_change() hook
to fire in all cases.

The second issue resolved here occurs when users don't detach
programs. The gist is a refcnt issue resolved by implementing the
release callback. See patch for details.

For testing I ran both sample/sockmap and selftests bpf/test_maps.c.
I did not however retest TLS with the small change to ULP layer.
Mostly because I don't have a TLS setup. I plan/hope to get around
to writing either a sample or preferably a selftest for this case
as well (assuming I didn't miss one).

@Dave Watson can you take a quick look to verify the changes are
good on TLS ULP side.

---

John Fastabend (3):
  net: add a UID to use for ULP socket assignment
  bpf: sockmap, add sock close() hook to remove socks
  bpf: sockmap, fix leaking maps with attached but not detached progs


 include/net/tcp.h|8 ++
 kernel/bpf/sockmap.c |  164 --
 net/ipv4/tcp_ulp.c   |   53 +++-
 net/tls/tls_main.c   |2 +
 4 files changed, 150 insertions(+), 77 deletions(-)

--
Signature

[bpf PATCH 2/3] bpf: sockmap, add sock close() hook to remove socks

2018-01-25 Thread John Fastabend

The selftests test_maps program was leaving dangling BPF sockmap
programs around because not all psock elements were removed from
the map. The elements in turn hold a reference on the BPF program
they are attached to causing BPF programs to stay open even after
test_maps has completed.

The original intent was that sk_state_change() would be called
when TCP socks went through TCP_CLOSE state. However, because
socks may be in SOCK_DEAD state or the sock may be a listening
socket the event is not always triggered.

To resolve this use the ULP infrastructure and register our own
proto close() handler. This fixes the above case.

Fixes: 174a79ff9515 ("bpf: sockmap with sk redirect support")
Reported-by: Prashant Bhole 
Signed-off-by: John Fastabend 
---
 include/net/tcp.h|2 +
 kernel/bpf/sockmap.c |  145 +++---
 2 files changed, 80 insertions(+), 67 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index ab9b714..6a88353 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1985,6 +1985,7 @@ static inline void tcp_listendrop(const struct sock *sk)
 
 enum {
TCP_ULP_TLS,
+   TCP_ULP_BPF,
 };
 
 struct tcp_ulp_ops {
@@ -2003,6 +2004,7 @@ struct tcp_ulp_ops {
 int tcp_register_ulp(struct tcp_ulp_ops *type);
 void tcp_unregister_ulp(struct tcp_ulp_ops *type);
 int tcp_set_ulp(struct sock *sk, const char *name);
+int tcp_set_ulp_id(struct sock *sk, const int ulp);
 void tcp_get_available_ulp(char *buf, size_t len);
 void tcp_cleanup_ulp(struct sock *sk);
 
diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index 1712d31..b85a79c 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -86,9 +86,10 @@ struct smap_psock {
struct work_struct tx_work;
struct work_struct gc_work;
 
+   struct proto *sk_proto;
+   void (*save_close)(struct sock *sk, long timeout);
void (*save_data_ready)(struct sock *sk);
void (*save_write_space)(struct sock *sk);
-   void (*save_state_change)(struct sock *sk);
 };
 
 static inline struct smap_psock *smap_psock_sk(const struct sock *sk)
@@ -104,12 +105,79 @@ static inline void bpf_compute_data_end_sk_skb(struct 
sk_buff *skb)
TCP_SKB_CB(skb)->bpf.data_end = skb->data + skb_headlen(skb);
 }
 
+struct proto tcp_bpf_proto;
+static int bpf_tcp_init(struct sock *sk)
+{
+   struct smap_psock *psock;
+
+   psock = smap_psock_sk(sk);
+   if (unlikely(!psock))
+   return -EINVAL;
+
+   if (unlikely(psock->sk_proto))
+   return -EBUSY;
+
+   psock->save_close = sk->sk_prot->close;
+   psock->sk_proto = sk->sk_prot;
+   sk->sk_prot = _bpf_proto;
+   return 0;
+}
+
+static void bpf_tcp_release(struct sock *sk)
+{
+   struct smap_psock *psock = smap_psock_sk(sk);
+
+   if (likely(psock)) {
+   sk->sk_prot = psock->sk_proto;
+   psock->sk_proto = NULL;
+   }
+}
+
+static void smap_release_sock(struct smap_psock *psock, struct sock *sock);
+
+void bpf_tcp_close(struct sock *sk, long timeout)
+{
+   struct smap_psock_map_entry *e, *tmp;
+   struct smap_psock *psock;
+   struct sock *osk;
+
+   psock = smap_psock_sk(sk);
+   if (unlikely(!psock))
+   return sk->sk_prot->close(sk, timeout);
+
+   write_lock_bh(>sk_callback_lock);
+   list_for_each_entry_safe(e, tmp, >maps, list) {
+   osk = cmpxchg(e->entry, sk, NULL);
+   if (osk == sk) {
+   list_del(>list);
+   smap_release_sock(psock, sk);
+   }
+   }
+   write_unlock_bh(>sk_callback_lock);
+   return psock->save_close(sk, timeout);
+}
+
 enum __sk_action {
__SK_DROP = 0,
__SK_PASS,
__SK_REDIRECT,
 };
 
+static struct tcp_ulp_ops bpf_tcp_ulp_ops __read_mostly = {
+   .name   = "bpf_tcp",
+   .uid= TCP_ULP_BPF,
+   .owner  = NULL,
+   .init   = bpf_tcp_init,
+   .release= bpf_tcp_release,
+};
+
+static int bpf_tcp_ulp_register(void)
+{
+   tcp_bpf_proto = tcp_prot;
+   tcp_bpf_proto.close = bpf_tcp_close;
+   return tcp_register_ulp(_tcp_ulp_ops);
+}
+
 static int smap_verdict_func(struct smap_psock *psock, struct sk_buff *skb)
 {
struct bpf_prog *prog = READ_ONCE(psock->bpf_verdict);
@@ -174,68 +242,6 @@ static void smap_report_sk_error(struct smap_psock *psock, 
int err)
sk->sk_error_report(sk);
 }
 
-static void smap_release_sock(struct smap_psock *psock, struct sock *sock);
-
-/* Called with lock_sock(sk) held */
-static void smap_state_change(struct sock *sk)
-{
-   struct smap_psock_map_entry *e, *tmp;
-   struct smap_psock *psock;
-   struct socket_wq *wq;
-   struct sock *osk;
-
-   rcu_read_lock();
-
-   /* Allowing transitions into an established syn_recv states allows
-* for

[bpf PATCH 1/3] net: add a UID to use for ULP socket assignment

2018-01-25 Thread John Fastabend

Create a UID field and enum that can be used to assign ULPs to
sockets. This saves a set of string comparisons if the ULP id
is known.

For sockmap, which is added in the next patches, a ULP is used to
hook into TCP sockets close state. In this case the ULP being added
is done at map insert time and the ULP is known and done on the kernel
side. In this case the named lookup is not needed. Because we don't
want to expose psock internals to user space socket options a user
visible flag is also added. For TLS this is set for BPF it will be
cleared.

Alos remove pr_notice, user gets an error code back and should check
that rather than rely on logs.

Signed-off-by: John Fastabend 
---
 include/net/tcp.h  |6 ++
 net/ipv4/tcp_ulp.c |   53 +++-
 net/tls/tls_main.c |2 ++
 3 files changed, 56 insertions(+), 5 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 6da880d..ab9b714 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1983,6 +1983,10 @@ static inline void tcp_listendrop(const struct sock *sk)
 #define TCP_ULP_MAX128
 #define TCP_ULP_BUF_MAX(TCP_ULP_NAME_MAX*TCP_ULP_MAX)
 
+enum {
+   TCP_ULP_TLS,
+};
+
 struct tcp_ulp_ops {
struct list_headlist;
 
@@ -1991,7 +1995,9 @@ struct tcp_ulp_ops {
/* cleanup ulp */
void (*release)(struct sock *sk);
 
+   int uid;
charname[TCP_ULP_NAME_MAX];
+   booluser_visible;
struct module   *owner;
 };
 int tcp_register_ulp(struct tcp_ulp_ops *type);
diff --git a/net/ipv4/tcp_ulp.c b/net/ipv4/tcp_ulp.c
index 6bb9e14..8ef437d 100644
--- a/net/ipv4/tcp_ulp.c
+++ b/net/ipv4/tcp_ulp.c
@@ -29,6 +29,18 @@ static struct tcp_ulp_ops *tcp_ulp_find(const char *name)
return NULL;
 }
 
+static struct tcp_ulp_ops *tcp_ulp_find_id(const int ulp)
+{
+   struct tcp_ulp_ops *e;
+
+   list_for_each_entry_rcu(e, _ulp_list, list) {
+   if (e->uid == ulp)
+   return e;
+   }
+
+   return NULL;
+}
+
 static const struct tcp_ulp_ops *__tcp_ulp_find_autoload(const char *name)
 {
const struct tcp_ulp_ops *ulp = NULL;
@@ -51,6 +63,18 @@ static const struct tcp_ulp_ops 
*__tcp_ulp_find_autoload(const char *name)
return ulp;
 }
 
+static const struct tcp_ulp_ops *__tcp_ulp_lookup(const int uid)
+{
+   const struct tcp_ulp_ops *ulp = NULL;
+
+   rcu_read_lock();
+   ulp = tcp_ulp_find_id(uid);
+   if (!ulp || !try_module_get(ulp->owner))
+   ulp = NULL;
+   rcu_read_unlock();
+   return ulp;
+}
+
 /* Attach new upper layer protocol to the list
  * of available protocols.
  */
@@ -59,13 +83,10 @@ int tcp_register_ulp(struct tcp_ulp_ops *ulp)
int ret = 0;
 
spin_lock(_ulp_list_lock);
-   if (tcp_ulp_find(ulp->name)) {
-   pr_notice("%s already registered or non-unique name\n",
- ulp->name);
+   if (tcp_ulp_find(ulp->name))
ret = -EEXIST;
-   } else {
+   else
list_add_tail_rcu(>list, _ulp_list);
-   }
spin_unlock(_ulp_list_lock);
 
return ret;
@@ -124,6 +145,9 @@ int tcp_set_ulp(struct sock *sk, const char *name)
if (!ulp_ops)
return -ENOENT;
 
+   if (!ulp_ops->user_visible)
+   return -ENOENT;
+
err = ulp_ops->init(sk);
if (err) {
module_put(ulp_ops->owner);
@@ -133,3 +157,22 @@ int tcp_set_ulp(struct sock *sk, const char *name)
icsk->icsk_ulp_ops = ulp_ops;
return 0;
 }
+
+int tcp_set_ulp_id(struct sock *sk, int ulp)
+{
+   struct inet_connection_sock *icsk = inet_csk(sk);
+   const struct tcp_ulp_ops *ulp_ops;
+   int err;
+
+   if (icsk->icsk_ulp_ops)
+   return -EEXIST;
+
+   ulp_ops = __tcp_ulp_lookup(ulp);
+   if (!ulp_ops)
+   return -ENOENT;
+
+   err = ulp_ops->init(sk);
+   if (!err)
+   icsk->icsk_ulp_ops = ulp_ops;
+   return err;
+}
diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
index 736719c..b0d5fce 100644
--- a/net/tls/tls_main.c
+++ b/net/tls/tls_main.c
@@ -484,6 +484,8 @@ static int tls_init(struct sock *sk)
 
 static struct tcp_ulp_ops tcp_tls_ulp_ops __read_mostly = {
.name   = "tls",
+   .uid= TCP_ULP_TLS,
+   .user_visible   = true,
.owner  = THIS_MODULE,
.init   = tls_init,
 };

[PATCH bpf-next v10 09/12] bpf: Add sock_ops R/W access to tclass

2018-01-25 Thread Lawrence Brakmo

Adds direct write access to sk_txhash and access to tclass for ipv6
flows through getsockopt and setsockopt. Sample usage for tclass:

  bpf_getsockopt(skops, SOL_IPV6, IPV6_TCLASS, , sizeof(v))

where skops is a pointer to the ctx (struct bpf_sock_ops).

Signed-off-by: Lawrence Brakmo 
---
 net/core/filter.c | 47 +--
 1 file changed, 45 insertions(+), 2 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index a858ebc..fe2c793 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3232,6 +3232,29 @@ BPF_CALL_5(bpf_setsockopt, struct bpf_sock_ops_kern *, 
bpf_sock,
ret = -EINVAL;
}
 #ifdef CONFIG_INET
+#if IS_ENABLED(CONFIG_IPV6)
+   } else if (level == SOL_IPV6) {
+   if (optlen != sizeof(int) || sk->sk_family != AF_INET6)
+   return -EINVAL;
+
+   val = *((int *)optval);
+   /* Only some options are supported */
+   switch (optname) {
+   case IPV6_TCLASS:
+   if (val < -1 || val > 0xff) {
+   ret = -EINVAL;
+   } else {
+   struct ipv6_pinfo *np = inet6_sk(sk);
+
+   if (val == -1)
+   val = 0;
+   np->tclass = val;
+   }
+   break;
+   default:
+   ret = -EINVAL;
+   }
+#endif
} else if (level == SOL_TCP &&
   sk->sk_prot->setsockopt == tcp_setsockopt) {
if (optname == TCP_CONGESTION) {
@@ -3241,7 +3264,8 @@ BPF_CALL_5(bpf_setsockopt, struct bpf_sock_ops_kern *, 
bpf_sock,
strncpy(name, optval, min_t(long, optlen,
TCP_CA_NAME_MAX-1));
name[TCP_CA_NAME_MAX-1] = 0;
-   ret = tcp_set_congestion_control(sk, name, false, 
reinit);
+   ret = tcp_set_congestion_control(sk, name, false,
+reinit);
} else {
struct tcp_sock *tp = tcp_sk(sk);
 
@@ -3307,6 +3331,22 @@ BPF_CALL_5(bpf_getsockopt, struct bpf_sock_ops_kern *, 
bpf_sock,
} else {
goto err_clear;
}
+#if IS_ENABLED(CONFIG_IPV6)
+   } else if (level == SOL_IPV6) {
+   struct ipv6_pinfo *np = inet6_sk(sk);
+
+   if (optlen != sizeof(int) || sk->sk_family != AF_INET6)
+   goto err_clear;
+
+   /* Only some options are supported */
+   switch (optname) {
+   case IPV6_TCLASS:
+   *((int *)optval) = (int)np->tclass;
+   break;
+   default:
+   goto err_clear;
+   }
+#endif
} else {
goto err_clear;
}
@@ -3871,6 +3911,7 @@ static bool sock_ops_is_valid_access(int off, int size,
if (type == BPF_WRITE) {
switch (off) {
case offsetof(struct bpf_sock_ops, reply):
+   case offsetof(struct bpf_sock_ops, sk_txhash):
if (size != size_default)
return false;
break;
@@ -4690,7 +4731,8 @@ static u32 sock_ops_convert_ctx_access(enum 
bpf_access_type type,
break;
 
case offsetof(struct bpf_sock_ops, sk_txhash):
-   SOCK_OPS_GET_FIELD(sk_txhash, sk_txhash, struct sock);
+   SOCK_OPS_GET_OR_SET_FIELD(sk_txhash, sk_txhash,
+ struct sock, type);
break;
 
case offsetof(struct bpf_sock_ops, bytes_received):
@@ -4701,6 +4743,7 @@ static u32 sock_ops_convert_ctx_access(enum 
bpf_access_type type,
case offsetof(struct bpf_sock_ops, bytes_acked):
SOCK_OPS_GET_FIELD(bytes_acked, bytes_acked, struct tcp_sock);
break;
+
}
return insn - insn_buf;
 }
-- 
2.9.5

[PATCH bpf-next v10 00/12] bpf: More sock_ops callbacks

2018-01-25 Thread Lawrence Brakmo

This patchset adds support for:

- direct R or R/W access to many tcp_sock fields
- passing up to 4 arguments to sock_ops BPF functions
- tcp_sock field bpf_sock_ops_cb_flags for controlling callbacks
- optionally calling sock_ops BPF program when RTO fires
- optionally calling sock_ops BPF program when packet is retransmitted
- optionally calling sock_ops BPF program when TCP state changes
- access to tclass and sk_txhash
- new selftest

v2: Fixed commit message 0/11. The commit is to "bpf-next" but the patch
below used "bpf" and Patchwork didn't work correctly.
v3: Cleaned RTO callback as per  Yuchung's comment
Added BPF enum for TCP states as per  Alexei's comment
v4: Fixed compile warnings related to detecting changes between TCP
internal states and the BPF defined states.
v5: Fixed comment issues in some selftest files
Fixed accesss issue with u64 fields in bpf_sock_ops struct
v6: Made fixes based on comments form Eric Dumazet:
The field bpf_sock_ops_cb_flags was addded in a hole on 64bit kernels
Field bpf_sock_ops_cb_flags is now set through a helper function
which returns an error when a BPF program tries to set bits for
callbacks that are not supported in the current kernel.
Added a comment indicating that when adding fields to bpf_sock_ops_kern
they should be added before the field named "temp" if they need to be
cleared before calling the BPF function.  
v7: Enfornced fields "op" and "replylong[1] .. replylong[3]" not be writable
based on comments form Eric Dumazet and Alexei Starovoitov.
Filled 32 bit hole in bpf_sock_ops struct with sk_txhash based on
comments from Daniel Borkmann.
Removed unused functions (tcp_call_bpf_1arg, tcp_call_bpf_4arg) based
on comments from Daniel Borkmann.
v8: Add commit message 00/12
Add Acked-by as appropriate
v9: Moved the bug fix to the front of the patchset
Changed RETRANS_CB so it is always called (before it was only called if
the retransmit succeeded). It is now called with an extra argument, the
return value of tcp_transmit_skb (0 => success). Based on comments
from Yuchung Cheng.
Added support for reading 2 new fields, sacked_out and lost_out, based on
comments from Yuchung Cheng.
v10: Moved the callback flags from include/uapi/linux/tcp.h to
 include/uapi/linux/bpf.h
 Cleaned up the test in selftest. Added a timeout so it always completes,
 even if the client is not communicating with the server. Made it faster
 by removing the sleeps. Made sure it works even when called back-to-back
 20 times.

Consists of the following patches:
[PATCH bpf-next v10 01/12] bpf: Only reply field should be writeable
[PATCH bpf-next v10 02/12] bpf: Make SOCK_OPS_GET_TCP size
[PATCH bpf-next v10 03/12] bpf: Make SOCK_OPS_GET_TCP struct
[PATCH bpf-next v10 04/12] bpf: Add write access to tcp_sock and sock
[PATCH bpf-next v10 05/12] bpf: Support passing args to sock_ops bpf
[PATCH bpf-next v10 06/12] bpf: Adds field bpf_sock_ops_cb_flags to
[PATCH bpf-next v10 07/12] bpf: Add sock_ops RTO callback
[PATCH bpf-next v10 08/12] bpf: Add support for reading sk_state and
[PATCH bpf-next v10 09/12] bpf: Add sock_ops R/W access to tclass
[PATCH bpf-next v10 10/12] bpf: Add BPF_SOCK_OPS_RETRANS_CB
[PATCH bpf-next v10 11/12] bpf: Add BPF_SOCK_OPS_STATE_CB
[PATCH bpf-next v10 12/12] bpf: add selftest for tcpbpf

 include/linux/filter.h |  10 ++
 include/linux/tcp.h|  11 ++
 include/net/tcp.h  |  42 -
 include/uapi/linux/bpf.h   |  84 -
 net/core/filter.c  | 290 
---
 net/ipv4/tcp.c |  26 ++-
 net/ipv4/tcp_nv.c  |   2 +-
 net/ipv4/tcp_output.c  |   6 +-
 net/ipv4/tcp_timer.c   |   7 +
 tools/include/uapi/linux/bpf.h |  86 -
 tools/testing/selftests/bpf/Makefile   |   4 +-
 tools/testing/selftests/bpf/bpf_helpers.h  |   2 +
 tools/testing/selftests/bpf/tcp_client.py  |  51 ++
 tools/testing/selftests/bpf/tcp_server.py  |  83 +
 tools/testing/selftests/bpf/test_tcpbpf.h  |  16 ++
 tools/testing/selftests/bpf/test_tcpbpf_kern.c | 118 +
 tools/testing/selftests/bpf/test_tcpbpf_user.c | 126 ++
 17 files changed, 925 insertions(+), 39 deletions(-)

[PATCH bpf-next v10 04/12] bpf: Add write access to tcp_sock and sock fields

2018-01-25 Thread Lawrence Brakmo

This patch adds a macro, SOCK_OPS_SET_FIELD, for writing to
struct tcp_sock or struct sock fields. This required adding a new
field "temp" to struct bpf_sock_ops_kern for temporary storage that
is used by sock_ops_convert_ctx_access. It is used to store and recover
the contents of a register, so the register can be used to store the
address of the sk. Since we cannot overwrite the dst_reg because it
contains the pointer to ctx, nor the src_reg since it contains the value
we want to store, we need an extra register to contain the address
of the sk.

Also adds the macro SOCK_OPS_GET_OR_SET_FIELD that calls one of the
GET or SET macros depending on the value of the TYPE field.

Signed-off-by: Lawrence Brakmo 
Acked-by: Alexei Starovoitov 
---
 include/linux/filter.h |  9 +
 include/net/tcp.h  |  2 +-
 net/core/filter.c  | 48 
 3 files changed, 58 insertions(+), 1 deletion(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 425056c..daa5a67 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -1007,6 +1007,15 @@ struct bpf_sock_ops_kern {
u32 replylong[4];
};
u32 is_fullsock;
+   u64 temp;   /* temp and everything after is not
+* initialized to 0 before calling
+* the BPF program. New fields that
+* should be initialized to 0 should
+* be inserted before temp.
+* temp is scratch storage used by
+* sock_ops_convert_ctx_access
+* as temporary storage of a register.
+*/
 };
 
 #endif /* __LINUX_FILTER_H__ */
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 5a1d26a..6092eaf 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -2011,7 +2011,7 @@ static inline int tcp_call_bpf(struct sock *sk, int op)
struct bpf_sock_ops_kern sock_ops;
int ret;
 
-   memset(_ops, 0, sizeof(sock_ops));
+   memset(_ops, 0, offsetof(struct bpf_sock_ops_kern, temp));
if (sk_fullsock(sk)) {
sock_ops.is_fullsock = 1;
sock_owned_by_me(sk);
diff --git a/net/core/filter.c b/net/core/filter.c
index dbb6d2f..c356ec0 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4491,6 +4491,54 @@ static u32 sock_ops_convert_ctx_access(enum 
bpf_access_type type,
  offsetof(OBJ, OBJ_FIELD));  \
} while (0)
 
+/* Helper macro for adding write access to tcp_sock or sock fields.
+ * The macro is called with two registers, dst_reg which contains a pointer
+ * to ctx (context) and src_reg which contains the value that should be
+ * stored. However, we need an additional register since we cannot overwrite
+ * dst_reg because it may be used later in the program.
+ * Instead we "borrow" one of the other register. We first save its value
+ * into a new (temp) field in bpf_sock_ops_kern, use it, and then restore
+ * it at the end of the macro.
+ */
+#define SOCK_OPS_SET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ)\
+   do {  \
+   int reg = BPF_REG_9;  \
+   BUILD_BUG_ON(FIELD_SIZEOF(OBJ, OBJ_FIELD) >   \
+FIELD_SIZEOF(struct bpf_sock_ops, BPF_FIELD));   \
+   if (si->dst_reg == reg || si->src_reg == reg) \
+   reg--;\
+   if (si->dst_reg == reg || si->src_reg == reg) \
+   reg--;\
+   *insn++ = BPF_STX_MEM(BPF_DW, si->dst_reg, reg,   \
+ offsetof(struct bpf_sock_ops_kern,  \
+  temp));\
+   *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(   \
+   struct bpf_sock_ops_kern, \
+   is_fullsock), \
+ reg, si->dst_reg,   \
+ offsetof(struct bpf_sock_ops_kern,  \
+  is_fullsock)); \
+   *insn++ = BPF_JMP_IMM(BPF_JEQ, reg, 0, 2);\
+   *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(   \
+   struct bpf_sock_ops_kern, sk),\
+

[PATCH bpf-next v10 02/12] bpf: Make SOCK_OPS_GET_TCP size independent

2018-01-25 Thread Lawrence Brakmo

Make SOCK_OPS_GET_TCP helper macro size independent (before only worked
with 4-byte fields.

Signed-off-by: Lawrence Brakmo 
---
 net/core/filter.c | 13 -
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index bf9bb75..62e7874 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4470,9 +4470,10 @@ static u32 sock_ops_convert_ctx_access(enum 
bpf_access_type type,
break;
 
 /* Helper macro for adding read access to tcp_sock fields. */
-#define SOCK_OPS_GET_TCP32(FIELD_NAME)   \
+#define SOCK_OPS_GET_TCP(FIELD_NAME) \
do {  \
-   BUILD_BUG_ON(FIELD_SIZEOF(struct tcp_sock, FIELD_NAME) != 4); \
+   BUILD_BUG_ON(FIELD_SIZEOF(struct tcp_sock, FIELD_NAME) >  \
+FIELD_SIZEOF(struct bpf_sock_ops, FIELD_NAME));  \
*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(   \
struct bpf_sock_ops_kern, \
is_fullsock), \
@@ -4484,16 +4485,18 @@ static u32 sock_ops_convert_ctx_access(enum 
bpf_access_type type,
struct bpf_sock_ops_kern, sk),\
  si->dst_reg, si->src_reg,   \
  offsetof(struct bpf_sock_ops_kern, sk));\
-   *insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg,\
+   *insn++ = BPF_LDX_MEM(FIELD_SIZEOF(struct tcp_sock,   \
+  FIELD_NAME), si->dst_reg,  \
+ si->dst_reg,\
  offsetof(struct tcp_sock, FIELD_NAME)); \
} while (0)
 
case offsetof(struct bpf_sock_ops, snd_cwnd):
-   SOCK_OPS_GET_TCP32(snd_cwnd);
+   SOCK_OPS_GET_TCP(snd_cwnd);
break;
 
case offsetof(struct bpf_sock_ops, srtt_us):
-   SOCK_OPS_GET_TCP32(srtt_us);
+   SOCK_OPS_GET_TCP(srtt_us);
break;
}
return insn - insn_buf;
-- 
2.9.5

[PATCH bpf-next v10 03/12] bpf: Make SOCK_OPS_GET_TCP struct independent

2018-01-25 Thread Lawrence Brakmo

Changed SOCK_OPS_GET_TCP to SOCK_OPS_GET_FIELD and added 2
arguments so now it can also work with struct sock fields.
The first argument is the name of the field in the bpf_sock_ops
struct, the 2nd argument is the name of the field in the OBJ struct.

Previous: SOCK_OPS_GET_TCP(FIELD_NAME)
New:  SOCK_OPS_GET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ)

Where OBJ is either "struct tcp_sock" or "struct sock" (without
quotation). BPF_FIELD is the name of the field in the bpf_sock_ops
struct and OBJ_FIELD is the name of the field in the OBJ struct.

Although the field names are currently the same, the kernel struct names
could change in the future and this change makes it easier to support
that.

Note that adding access to tcp_sock fields in sock_ops programs does
not preclude the tcp_sock fields from being removed as long as we are
willing to do one of the following:

  1) Return a fixed value (e.x. 0 or 0x), or
  2) Make the verifier fail if that field is accessed (i.e. program
fails to load) so the user will know that field is no longer
supported.

Signed-off-by: Lawrence Brakmo 
---
 net/core/filter.c | 20 ++--
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 62e7874..dbb6d2f 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4469,11 +4469,11 @@ static u32 sock_ops_convert_ctx_access(enum 
bpf_access_type type,
   is_fullsock));
break;
 
-/* Helper macro for adding read access to tcp_sock fields. */
-#define SOCK_OPS_GET_TCP(FIELD_NAME) \
+/* Helper macro for adding read access to tcp_sock or sock fields. */
+#define SOCK_OPS_GET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ)\
do {  \
-   BUILD_BUG_ON(FIELD_SIZEOF(struct tcp_sock, FIELD_NAME) >  \
-FIELD_SIZEOF(struct bpf_sock_ops, FIELD_NAME));  \
+   BUILD_BUG_ON(FIELD_SIZEOF(OBJ, OBJ_FIELD) >   \
+FIELD_SIZEOF(struct bpf_sock_ops, BPF_FIELD));   \
*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(   \
struct bpf_sock_ops_kern, \
is_fullsock), \
@@ -4485,18 +4485,18 @@ static u32 sock_ops_convert_ctx_access(enum 
bpf_access_type type,
struct bpf_sock_ops_kern, sk),\
  si->dst_reg, si->src_reg,   \
  offsetof(struct bpf_sock_ops_kern, sk));\
-   *insn++ = BPF_LDX_MEM(FIELD_SIZEOF(struct tcp_sock,   \
-  FIELD_NAME), si->dst_reg,  \
- si->dst_reg,\
- offsetof(struct tcp_sock, FIELD_NAME)); \
+   *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(OBJ,   \
+  OBJ_FIELD),\
+ si->dst_reg, si->dst_reg,   \
+ offsetof(OBJ, OBJ_FIELD));  \
} while (0)
 
case offsetof(struct bpf_sock_ops, snd_cwnd):
-   SOCK_OPS_GET_TCP(snd_cwnd);
+   SOCK_OPS_GET_FIELD(snd_cwnd, snd_cwnd, struct tcp_sock);
break;
 
case offsetof(struct bpf_sock_ops, srtt_us):
-   SOCK_OPS_GET_TCP(srtt_us);
+   SOCK_OPS_GET_FIELD(srtt_us, srtt_us, struct tcp_sock);
break;
}
return insn - insn_buf;
-- 
2.9.5

[PATCH bpf-next v10 11/12] bpf: Add BPF_SOCK_OPS_STATE_CB

2018-01-25 Thread Lawrence Brakmo

Adds support for calling sock_ops BPF program when there is a TCP state
change. Two arguments are used; one for the old state and another for
the new state.

There is a new enum in include/uapi/linux/bpf.h that exports the TCP
states that prepends BPF_ to the current TCP state names. If it is ever
necessary to change the internal TCP state values (other than adding
more to the end), then it will become necessary to convert from the
internal TCP state value to the BPF value before calling the BPF
sock_ops function. There are a set of compile checks added in tcp.c
to detect if the internal and BPF values differ so we can make the
necessary fixes.

New op: BPF_SOCK_OPS_STATE_CB.

Signed-off-by: Lawrence Brakmo 
---
 include/uapi/linux/bpf.h | 29 -
 net/ipv4/tcp.c   | 24 
 2 files changed, 52 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 31c93a0b..db6bdc3 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1006,7 +1006,8 @@ struct bpf_sock_ops {
 /* Definitions for bpf_sock_ops_cb_flags */
 #define BPF_SOCK_OPS_RTO_CB_FLAG   (1<<0)
 #define BPF_SOCK_OPS_RETRANS_CB_FLAG   (1<<1)
-#define BPF_SOCK_OPS_ALL_CB_FLAGS   0x3/* Mask of all currently
+#define BPF_SOCK_OPS_STATE_CB_FLAG (1<<2)
+#define BPF_SOCK_OPS_ALL_CB_FLAGS   0x7/* Mask of all currently
 * supported cb flags
 */
 
@@ -1054,6 +1055,32 @@ enum {
 * Arg3: return value of
 *   tcp_transmit_skb (0 => success)
 */
+   BPF_SOCK_OPS_STATE_CB,  /* Called when TCP changes state.
+* Arg1: old_state
+* Arg2: new_state
+*/
+};
+
+/* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
+ * changes between the TCP and BPF versions. Ideally this should never happen.
+ * If it does, we need to add code to convert them before calling
+ * the BPF sock_ops function.
+ */
+enum {
+   BPF_TCP_ESTABLISHED = 1,
+   BPF_TCP_SYN_SENT,
+   BPF_TCP_SYN_RECV,
+   BPF_TCP_FIN_WAIT1,
+   BPF_TCP_FIN_WAIT2,
+   BPF_TCP_TIME_WAIT,
+   BPF_TCP_CLOSE,
+   BPF_TCP_CLOSE_WAIT,
+   BPF_TCP_LAST_ACK,
+   BPF_TCP_LISTEN,
+   BPF_TCP_CLOSING,/* Now a valid state */
+   BPF_TCP_NEW_SYN_RECV,
+
+   BPF_TCP_MAX_STATES  /* Leave at the end! */
 };
 
 #define TCP_BPF_IW 1001/* Set TCP initial congestion window */
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 88b6244..f013ddc 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2042,6 +2042,30 @@ void tcp_set_state(struct sock *sk, int state)
 {
int oldstate = sk->sk_state;
 
+   /* We defined a new enum for TCP states that are exported in BPF
+* so as not force the internal TCP states to be frozen. The
+* following checks will detect if an internal state value ever
+* differs from the BPF value. If this ever happens, then we will
+* need to remap the internal value to the BPF value before calling
+* tcp_call_bpf_2arg.
+*/
+   BUILD_BUG_ON((int)BPF_TCP_ESTABLISHED != (int)TCP_ESTABLISHED);
+   BUILD_BUG_ON((int)BPF_TCP_SYN_SENT != (int)TCP_SYN_SENT);
+   BUILD_BUG_ON((int)BPF_TCP_SYN_RECV != (int)TCP_SYN_RECV);
+   BUILD_BUG_ON((int)BPF_TCP_FIN_WAIT1 != (int)TCP_FIN_WAIT1);
+   BUILD_BUG_ON((int)BPF_TCP_FIN_WAIT2 != (int)TCP_FIN_WAIT2);
+   BUILD_BUG_ON((int)BPF_TCP_TIME_WAIT != (int)TCP_TIME_WAIT);
+   BUILD_BUG_ON((int)BPF_TCP_CLOSE != (int)TCP_CLOSE);
+   BUILD_BUG_ON((int)BPF_TCP_CLOSE_WAIT != (int)TCP_CLOSE_WAIT);
+   BUILD_BUG_ON((int)BPF_TCP_LAST_ACK != (int)TCP_LAST_ACK);
+   BUILD_BUG_ON((int)BPF_TCP_LISTEN != (int)TCP_LISTEN);
+   BUILD_BUG_ON((int)BPF_TCP_CLOSING != (int)TCP_CLOSING);
+   BUILD_BUG_ON((int)BPF_TCP_NEW_SYN_RECV != (int)TCP_NEW_SYN_RECV);
+   BUILD_BUG_ON((int)BPF_TCP_MAX_STATES != (int)TCP_MAX_STATES);
+
+   if (BPF_SOCK_OPS_TEST_FLAG(tcp_sk(sk), BPF_SOCK_OPS_STATE_CB_FLAG))
+   tcp_call_bpf_2arg(sk, BPF_SOCK_OPS_STATE_CB, oldstate, state);
+
switch (state) {
case TCP_ESTABLISHED:
if (oldstate != TCP_ESTABLISHED)
-- 
2.9.5

[PATCH bpf-next v10 05/12] bpf: Support passing args to sock_ops bpf function

2018-01-25 Thread Lawrence Brakmo

Adds support for passing up to 4 arguments to sock_ops bpf functions. It
reusues the reply union, so the bpf_sock_ops structures are not
increased in size.

Signed-off-by: Lawrence Brakmo 
---
 include/linux/filter.h   |  1 +
 include/net/tcp.h| 40 +++-
 include/uapi/linux/bpf.h |  5 +++--
 net/ipv4/tcp.c   |  2 +-
 net/ipv4/tcp_nv.c|  2 +-
 net/ipv4/tcp_output.c|  2 +-
 6 files changed, 42 insertions(+), 10 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index daa5a67..20384c4 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -1003,6 +1003,7 @@ struct bpf_sock_ops_kern {
struct  sock *sk;
u32 op;
union {
+   u32 args[4];
u32 reply;
u32 replylong[4];
};
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 6092eaf..093e967 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -2006,7 +2006,7 @@ void tcp_cleanup_ulp(struct sock *sk);
  * program loaded).
  */
 #ifdef CONFIG_BPF
-static inline int tcp_call_bpf(struct sock *sk, int op)
+static inline int tcp_call_bpf(struct sock *sk, int op, u32 nargs, u32 *args)
 {
struct bpf_sock_ops_kern sock_ops;
int ret;
@@ -2019,6 +2019,8 @@ static inline int tcp_call_bpf(struct sock *sk, int op)
 
sock_ops.sk = sk;
sock_ops.op = op;
+   if (nargs > 0)
+   memcpy(sock_ops.args, args, nargs * sizeof(*args));
 
ret = BPF_CGROUP_RUN_PROG_SOCK_OPS(_ops);
if (ret == 0)
@@ -2027,18 +2029,46 @@ static inline int tcp_call_bpf(struct sock *sk, int op)
ret = -1;
return ret;
 }
+
+static inline int tcp_call_bpf_2arg(struct sock *sk, int op, u32 arg1, u32 
arg2)
+{
+   u32 args[2] = {arg1, arg2};
+
+   return tcp_call_bpf(sk, op, 2, args);
+}
+
+static inline int tcp_call_bpf_3arg(struct sock *sk, int op, u32 arg1, u32 
arg2,
+   u32 arg3)
+{
+   u32 args[3] = {arg1, arg2, arg3};
+
+   return tcp_call_bpf(sk, op, 3, args);
+}
+
 #else
-static inline int tcp_call_bpf(struct sock *sk, int op)
+static inline int tcp_call_bpf(struct sock *sk, int op, u32 nargs, u32 *args)
 {
return -EPERM;
 }
+
+static inline int tcp_call_bpf_2arg(struct sock *sk, int op, u32 arg1, u32 
arg2)
+{
+   return -EPERM;
+}
+
+static inline int tcp_call_bpf_3arg(struct sock *sk, int op, u32 arg1, u32 
arg2,
+   u32 arg3)
+{
+   return -EPERM;
+}
+
 #endif
 
 static inline u32 tcp_timeout_init(struct sock *sk)
 {
int timeout;
 
-   timeout = tcp_call_bpf(sk, BPF_SOCK_OPS_TIMEOUT_INIT);
+   timeout = tcp_call_bpf(sk, BPF_SOCK_OPS_TIMEOUT_INIT, 0, NULL);
 
if (timeout <= 0)
timeout = TCP_TIMEOUT_INIT;
@@ -2049,7 +2079,7 @@ static inline u32 tcp_rwnd_init_bpf(struct sock *sk)
 {
int rwnd;
 
-   rwnd = tcp_call_bpf(sk, BPF_SOCK_OPS_RWND_INIT);
+   rwnd = tcp_call_bpf(sk, BPF_SOCK_OPS_RWND_INIT, 0, NULL);
 
if (rwnd < 0)
rwnd = 0;
@@ -2058,7 +2088,7 @@ static inline u32 tcp_rwnd_init_bpf(struct sock *sk)
 
 static inline bool tcp_bpf_ca_needs_ecn(struct sock *sk)
 {
-   return (tcp_call_bpf(sk, BPF_SOCK_OPS_NEEDS_ECN) == 1);
+   return (tcp_call_bpf(sk, BPF_SOCK_OPS_NEEDS_ECN, 0, NULL) == 1);
 }
 
 #if IS_ENABLED(CONFIG_SMC)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 406c19d..8d5874c 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -952,8 +952,9 @@ struct bpf_map_info {
 struct bpf_sock_ops {
__u32 op;
union {
-   __u32 reply;
-   __u32 replylong[4];
+   __u32 args[4];  /* Optionally passed to bpf program */
+   __u32 reply;/* Returned by bpf program  */
+   __u32 replylong[4]; /* Optionally returned by bpf prog  */
};
__u32 family;
__u32 remote_ip4;   /* Stored in network byte order */
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index d7cf861..88b6244 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -463,7 +463,7 @@ void tcp_init_transfer(struct sock *sk, int bpf_op)
tcp_mtup_init(sk);
icsk->icsk_af_ops->rebuild_header(sk);
tcp_init_metrics(sk);
-   tcp_call_bpf(sk, bpf_op);
+   tcp_call_bpf(sk, bpf_op, 0, NULL);
tcp_init_congestion_control(sk);
tcp_init_buffer_space(sk);
 }
diff --git a/net/ipv4/tcp_nv.c b/net/ipv4/tcp_nv.c
index 0b5a05b..ddbce73 100644
--- a/net/ipv4/tcp_nv.c
+++ b/net/ipv4/tcp_nv.c
@@ -146,7 +146,7 @@ static void tcpnv_init(struct sock *sk)
 * within a datacenter, where we have reasonable estimates of
 * RTTs
 */
-   base_rtt = tcp_call_bpf(sk, BPF_SOCK_OPS_BASE_RTT);
+   base_rtt = tcp_call_bpf(sk, BPF_SOCK_OPS_BASE_RTT, 0, NULL);
if (base_rtt >

[PATCH bpf-next v10 12/12] bpf: add selftest for tcpbpf

2018-01-25 Thread Lawrence Brakmo

Added a selftest for tcpbpf (sock_ops) that checks that the appropriate
callbacks occured and that it can access tcp_sock fields and that their
values are correct.

Run with command: ./test_tcpbpf_user
Adding the flag "-d" will show why it did not pass.

Signed-off-by: Lawrence Brakmo 
Acked-by: Alexei Starovoitov 
---
 tools/include/uapi/linux/bpf.h |  86 -
 tools/testing/selftests/bpf/Makefile   |   4 +-
 tools/testing/selftests/bpf/bpf_helpers.h  |   2 +
 tools/testing/selftests/bpf/tcp_client.py  |  51 ++
 tools/testing/selftests/bpf/tcp_server.py  |  83 
 tools/testing/selftests/bpf/test_tcpbpf.h  |  16 
 tools/testing/selftests/bpf/test_tcpbpf_kern.c | 118 +++
 tools/testing/selftests/bpf/test_tcpbpf_user.c | 126 +
 8 files changed, 480 insertions(+), 6 deletions(-)
 create mode 100755 tools/testing/selftests/bpf/tcp_client.py
 create mode 100755 tools/testing/selftests/bpf/tcp_server.py
 create mode 100644 tools/testing/selftests/bpf/test_tcpbpf.h
 create mode 100644 tools/testing/selftests/bpf/test_tcpbpf_kern.c
 create mode 100644 tools/testing/selftests/bpf/test_tcpbpf_user.c

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index af1f49a..db6bdc3 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -17,7 +17,7 @@
 #define BPF_ALU64  0x07/* alu mode in double word width */
 
 /* ld/ldx fields */
-#define BPF_DW 0x18/* double word */
+#define BPF_DW 0x18/* double word (64-bit) */
 #define BPF_XADD   0xc0/* exclusive add */
 
 /* alu/jmp fields */
@@ -642,6 +642,14 @@ union bpf_attr {
  * @optlen: length of optval in bytes
  * Return: 0 or negative error
  *
+ * int bpf_sock_ops_cb_flags_set(bpf_sock_ops, flags)
+ * Set callback flags for sock_ops
+ * @bpf_sock_ops: pointer to bpf_sock_ops_kern struct
+ * @flags: flags value
+ * Return: 0 for no error
+ * -EINVAL if there is no full tcp socket
+ * bits in flags that are not supported by current kernel
+ *
  * int bpf_skb_adjust_room(skb, len_diff, mode, flags)
  * Grow or shrink room in sk_buff.
  * @skb: pointer to skb
@@ -748,7 +756,8 @@ union bpf_attr {
FN(perf_event_read_value),  \
FN(perf_prog_read_value),   \
FN(getsockopt), \
-   FN(override_return),
+   FN(override_return),\
+   FN(sock_ops_cb_flags_set),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -952,8 +961,9 @@ struct bpf_map_info {
 struct bpf_sock_ops {
__u32 op;
union {
-   __u32 reply;
-   __u32 replylong[4];
+   __u32 args[4];  /* Optionally passed to bpf program */
+   __u32 reply;/* Returned by bpf program  */
+   __u32 replylong[4]; /* Optionally returned by bpf prog  */
};
__u32 family;
__u32 remote_ip4;   /* Stored in network byte order */
@@ -968,8 +978,39 @@ struct bpf_sock_ops {
 */
__u32 snd_cwnd;
__u32 srtt_us;  /* Averaged RTT << 3 in usecs */
+   __u32 bpf_sock_ops_cb_flags; /* flags defined in uapi/linux/tcp.h */
+   __u32 state;
+   __u32 rtt_min;
+   __u32 snd_ssthresh;
+   __u32 rcv_nxt;
+   __u32 snd_nxt;
+   __u32 snd_una;
+   __u32 mss_cache;
+   __u32 ecn_flags;
+   __u32 rate_delivered;
+   __u32 rate_interval_us;
+   __u32 packets_out;
+   __u32 retrans_out;
+   __u32 total_retrans;
+   __u32 segs_in;
+   __u32 data_segs_in;
+   __u32 segs_out;
+   __u32 data_segs_out;
+   __u32 lost_out;
+   __u32 sacked_out;
+   __u32 sk_txhash;
+   __u64 bytes_received;
+   __u64 bytes_acked;
 };
 
+/* Definitions for bpf_sock_ops_cb_flags */
+#define BPF_SOCK_OPS_RTO_CB_FLAG   (1<<0)
+#define BPF_SOCK_OPS_RETRANS_CB_FLAG   (1<<1)
+#define BPF_SOCK_OPS_STATE_CB_FLAG (1<<2)
+#define BPF_SOCK_OPS_ALL_CB_FLAGS   0x7/* Mask of all currently
+* supported cb flags
+*/
+
 /* List of known BPF sock_ops operators.
  * New entries can only be added at the end
  */
@@ -1003,6 +1044,43 @@ enum {
 * a congestion threshold. RTTs above
 * this indicate congestion
 */
+   BPF_SOCK_OPS_RTO_CB,/* Called when an RTO has triggered.
+* Arg1: value of icsk_retransmits
+* Arg2: value of icsk_rto
+

[PATCH bpf-next v10 01/12] bpf: Only reply field should be writeable

2018-01-25 Thread Lawrence Brakmo

Currently, a sock_ops BPF program can write the op field and all the
reply fields (reply and replylong). This is a bug. The op field should
not have been writeable and there is currently no way to use replylong
field for indices >= 1. This patch enforces that only the reply field
(which equals replylong[0]) is writeable.

Fixes: 40304b2a1567 ("bpf: BPF support for sock_ops")
Signed-off-by: Lawrence Brakmo 
Acked-by: Yuchung Cheng 
---
 net/core/filter.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 18da42a..bf9bb75 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3845,8 +3845,7 @@ static bool sock_ops_is_valid_access(int off, int size,
 {
if (type == BPF_WRITE) {
switch (off) {
-   case offsetof(struct bpf_sock_ops, op) ...
-offsetof(struct bpf_sock_ops, replylong[3]):
+   case offsetof(struct bpf_sock_ops, reply):
break;
default:
return false;
-- 
2.9.5

[PATCH bpf-next v10 10/12] bpf: Add BPF_SOCK_OPS_RETRANS_CB

2018-01-25 Thread Lawrence Brakmo

Adds support for calling sock_ops BPF program when there is a
retransmission. Three arguments are used; one for the sequence number,
another for the number of segments retransmitted, and the last one for
the return value of tcp_transmit_skb (0 => success).
Does not include syn-ack retransmissions.

New op: BPF_SOCK_OPS_RETRANS_CB.

Signed-off-by: Lawrence Brakmo 
---
 include/uapi/linux/bpf.h | 9 -
 net/ipv4/tcp_output.c| 4 
 2 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 46520ea..31c93a0b 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1005,7 +1005,8 @@ struct bpf_sock_ops {
 
 /* Definitions for bpf_sock_ops_cb_flags */
 #define BPF_SOCK_OPS_RTO_CB_FLAG   (1<<0)
-#define BPF_SOCK_OPS_ALL_CB_FLAGS   0x1/* Mask of all currently
+#define BPF_SOCK_OPS_RETRANS_CB_FLAG   (1<<1)
+#define BPF_SOCK_OPS_ALL_CB_FLAGS   0x3/* Mask of all currently
 * supported cb flags
 */
 
@@ -1047,6 +1048,12 @@ enum {
 * Arg2: value of icsk_rto
 * Arg3: whether RTO has expired
 */
+   BPF_SOCK_OPS_RETRANS_CB,/* Called when skb is retransmitted.
+* Arg1: sequence number of 1st byte
+* Arg2: # segments
+* Arg3: return value of
+*   tcp_transmit_skb (0 => success)
+*/
 };
 
 #define TCP_BPF_IW 1001/* Set TCP initial congestion window */
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index d12f7f7..e9f985e 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2905,6 +2905,10 @@ int __tcp_retransmit_skb(struct sock *sk, struct sk_buff 
*skb, int segs)
err = tcp_transmit_skb(sk, skb, 1, GFP_ATOMIC);
}
 
+   if (BPF_SOCK_OPS_TEST_FLAG(tp, BPF_SOCK_OPS_RETRANS_CB_FLAG))
+   tcp_call_bpf_3arg(sk, BPF_SOCK_OPS_RETRANS_CB,
+ TCP_SKB_CB(skb)->seq, segs, err);
+
if (likely(!err)) {
TCP_SKB_CB(skb)->sacked |= TCPCB_EVER_RETRANS;
trace_tcp_retransmit_skb(sk, skb);
-- 
2.9.5

[PATCH bpf-next v10 06/12] bpf: Adds field bpf_sock_ops_cb_flags to tcp_sock

2018-01-25 Thread Lawrence Brakmo

Adds field bpf_sock_ops_cb_flags to tcp_sock and bpf_sock_ops. Its primary
use is to determine if there should be calls to sock_ops bpf program at
various points in the TCP code. The field is initialized to zero,
disabling the calls. A sock_ops BPF program can set it, per connection and
as necessary, when the connection is established.

It also adds support for reading and writting the field within a
sock_ops BPF program. Reading is done by accessing the field directly.
However, writing is done through the helper function
bpf_sock_ops_cb_flags_set, in order to return an error if a BPF program
is trying to set a callback that is not supported in the current kernel
(i.e. running an older kernel). The helper function returns 0 if it was
able to set all of the bits set in the argument, a positive number
containing the bits that could not be set, or -EINVAL if the socket is
not a full TCP socket.

Examples of where one could call the bpf program:

1) When RTO fires
2) When a packet is retransmitted
3) When the connection terminates
4) When a packet is sent
5) When a packet is received

Signed-off-by: Lawrence Brakmo 
Acked-by: Alexei Starovoitov 
---
 include/linux/tcp.h  | 11 +++
 include/uapi/linux/bpf.h | 17 -
 net/core/filter.c| 34 ++
 3 files changed, 61 insertions(+), 1 deletion(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 4f93f095..8f4c549 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -335,6 +335,17 @@ struct tcp_sock {
 
int linger2;
 
+
+/* Sock_ops bpf program related variables */
+#ifdef CONFIG_BPF
+   u8  bpf_sock_ops_cb_flags;  /* Control calling BPF programs
+* values defined in uapi/linux/tcp.h
+*/
+#define BPF_SOCK_OPS_TEST_FLAG(TP, ARG) (TP->bpf_sock_ops_cb_flags & ARG)
+#else
+#define BPF_SOCK_OPS_TEST_FLAG(TP, ARG) 0
+#endif
+
 /* Receiver side RTT estimation */
struct {
u32 rtt_us;
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 8d5874c..aa12840 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -642,6 +642,14 @@ union bpf_attr {
  * @optlen: length of optval in bytes
  * Return: 0 or negative error
  *
+ * int bpf_sock_ops_cb_flags_set(bpf_sock_ops, flags)
+ * Set callback flags for sock_ops
+ * @bpf_sock_ops: pointer to bpf_sock_ops_kern struct
+ * @flags: flags value
+ * Return: 0 for no error
+ * -EINVAL if there is no full tcp socket
+ * bits in flags that are not supported by current kernel
+ *
  * int bpf_skb_adjust_room(skb, len_diff, mode, flags)
  * Grow or shrink room in sk_buff.
  * @skb: pointer to skb
@@ -748,7 +756,8 @@ union bpf_attr {
FN(perf_event_read_value),  \
FN(perf_prog_read_value),   \
FN(getsockopt), \
-   FN(override_return),
+   FN(override_return),\
+   FN(sock_ops_cb_flags_set),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -969,8 +978,14 @@ struct bpf_sock_ops {
 */
__u32 snd_cwnd;
__u32 srtt_us;  /* Averaged RTT << 3 in usecs */
+   __u32 bpf_sock_ops_cb_flags; /* flags defined in uapi/linux/tcp.h */
 };
 
+/* Definitions for bpf_sock_ops_cb_flags */
+#define BPF_SOCK_OPS_ALL_CB_FLAGS   0  /* Mask of all currently
+* supported cb flags
+*/
+
 /* List of known BPF sock_ops operators.
  * New entries can only be added at the end
  */
diff --git a/net/core/filter.c b/net/core/filter.c
index c356ec0..6936d19 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3328,6 +3328,33 @@ static const struct bpf_func_proto bpf_getsockopt_proto 
= {
.arg5_type  = ARG_CONST_SIZE,
 };
 
+BPF_CALL_2(bpf_sock_ops_cb_flags_set, struct bpf_sock_ops_kern *, bpf_sock,
+  int, argval)
+{
+   struct sock *sk = bpf_sock->sk;
+   int val = argval & BPF_SOCK_OPS_ALL_CB_FLAGS;
+
+   if (!sk_fullsock(sk))
+   return -EINVAL;
+
+#ifdef CONFIG_INET
+   if (val)
+   tcp_sk(sk)->bpf_sock_ops_cb_flags = val;
+
+   return argval & (~BPF_SOCK_OPS_ALL_CB_FLAGS);
+#else
+   return -EINVAL;
+#endif
+}
+
+static const struct bpf_func_proto bpf_sock_ops_cb_flags_set_proto = {
+   .func   = bpf_sock_ops_cb_flags_set,
+   .gpl_only   = false,
+   .ret_type   = RET_INTEGER,
+   .arg1_type  = ARG_PTR_TO_CTX,
+   .arg2_type  = ARG_ANYTHING,
+};
+
 static const struct bpf_func_proto *
 bpf_base_func_proto(enum bpf_func_id func_id)
 {
@@ -3510,6 +3537,8 @@ static const struct

[PATCH bpf-next v10 08/12] bpf: Add support for reading sk_state and more

2018-01-25 Thread Lawrence Brakmo

Add support for reading many more tcp_sock fields

  state,same as sk->sk_state
  rtt_min   same as sk->rtt_min.s[0].v (current rtt_min)
  snd_ssthresh
  rcv_nxt
  snd_nxt
  snd_una
  mss_cache
  ecn_flags
  rate_delivered
  rate_interval_us
  packets_out
  retrans_out
  total_retrans
  segs_in
  data_segs_in
  segs_out
  data_segs_out
  lost_out
  sacked_out
  sk_txhash
  bytes_received (__u64)
  bytes_acked(__u64)

Signed-off-by: Lawrence Brakmo 
---
 include/uapi/linux/bpf.h |  22 
 net/core/filter.c| 143 +++
 2 files changed, 154 insertions(+), 11 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index c8cecf9..46520ea 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -979,6 +979,28 @@ struct bpf_sock_ops {
__u32 snd_cwnd;
__u32 srtt_us;  /* Averaged RTT << 3 in usecs */
__u32 bpf_sock_ops_cb_flags; /* flags defined in uapi/linux/tcp.h */
+   __u32 state;
+   __u32 rtt_min;
+   __u32 snd_ssthresh;
+   __u32 rcv_nxt;
+   __u32 snd_nxt;
+   __u32 snd_una;
+   __u32 mss_cache;
+   __u32 ecn_flags;
+   __u32 rate_delivered;
+   __u32 rate_interval_us;
+   __u32 packets_out;
+   __u32 retrans_out;
+   __u32 total_retrans;
+   __u32 segs_in;
+   __u32 data_segs_in;
+   __u32 segs_out;
+   __u32 data_segs_out;
+   __u32 lost_out;
+   __u32 sacked_out;
+   __u32 sk_txhash;
+   __u64 bytes_received;
+   __u64 bytes_acked;
 };
 
 /* Definitions for bpf_sock_ops_cb_flags */
diff --git a/net/core/filter.c b/net/core/filter.c
index 6936d19..a858ebc 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3855,33 +3855,43 @@ void bpf_warn_invalid_xdp_action(u32 act)
 }
 EXPORT_SYMBOL_GPL(bpf_warn_invalid_xdp_action);
 
-static bool __is_valid_sock_ops_access(int off, int size)
+static bool sock_ops_is_valid_access(int off, int size,
+enum bpf_access_type type,
+struct bpf_insn_access_aux *info)
 {
+   const int size_default = sizeof(__u32);
+
if (off < 0 || off >= sizeof(struct bpf_sock_ops))
return false;
+
/* The verifier guarantees that size > 0. */
if (off % size != 0)
return false;
-   if (size != sizeof(__u32))
-   return false;
-
-   return true;
-}
 
-static bool sock_ops_is_valid_access(int off, int size,
-enum bpf_access_type type,
-struct bpf_insn_access_aux *info)
-{
if (type == BPF_WRITE) {
switch (off) {
case offsetof(struct bpf_sock_ops, reply):
+   if (size != size_default)
+   return false;
break;
default:
return false;
}
+   } else {
+   switch (off) {
+   case bpf_ctx_range_till(struct bpf_sock_ops, bytes_received,
+   bytes_acked):
+   if (size != sizeof(__u64))
+   return false;
+   break;
+   default:
+   if (size != size_default)
+   return false;
+   break;
+   }
}
 
-   return __is_valid_sock_ops_access(off, size);
+   return true;
 }
 
 static int sk_skb_prologue(struct bpf_insn *insn_buf, bool direct_write,
@@ -4498,6 +4508,32 @@ static u32 sock_ops_convert_ctx_access(enum 
bpf_access_type type,
   is_fullsock));
break;
 
+   case offsetof(struct bpf_sock_ops, state):
+   BUILD_BUG_ON(FIELD_SIZEOF(struct sock_common, skc_state) != 1);
+
+   *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(
+   struct bpf_sock_ops_kern, sk),
+ si->dst_reg, si->src_reg,
+ offsetof(struct bpf_sock_ops_kern, sk));
+   *insn++ = BPF_LDX_MEM(BPF_B, si->dst_reg, si->dst_reg,
+ offsetof(struct sock_common, skc_state));
+   break;
+
+   case offsetof(struct bpf_sock_ops, rtt_min):
+   BUILD_BUG_ON(FIELD_SIZEOF(struct tcp_sock, rtt_min) !=
+sizeof(struct minmax));
+   BUILD_BUG_ON(sizeof(struct minmax) <
+sizeof(struct minmax_sample));
+
+   *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(
+   struct bpf_sock_ops_kern, sk),
+ si->dst_reg, si->src_reg,
+ offsetof(struct bpf_sock_ops_kern,

[PATCH bpf-next v10 07/12] bpf: Add sock_ops RTO callback

2018-01-25 Thread Lawrence Brakmo

Adds an optional call to sock_ops BPF program based on whether the
BPF_SOCK_OPS_RTO_CB_FLAG is set in bpf_sock_ops_flags.
The BPF program is passed 2 arguments: icsk_retransmits and whether the
RTO has expired.

Signed-off-by: Lawrence Brakmo 
---
 include/uapi/linux/bpf.h | 8 +++-
 net/ipv4/tcp_timer.c | 7 +++
 2 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index aa12840..c8cecf9 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -982,7 +982,8 @@ struct bpf_sock_ops {
 };
 
 /* Definitions for bpf_sock_ops_cb_flags */
-#define BPF_SOCK_OPS_ALL_CB_FLAGS   0  /* Mask of all currently
+#define BPF_SOCK_OPS_RTO_CB_FLAG   (1<<0)
+#define BPF_SOCK_OPS_ALL_CB_FLAGS   0x1/* Mask of all currently
 * supported cb flags
 */
 
@@ -1019,6 +1020,11 @@ enum {
 * a congestion threshold. RTTs above
 * this indicate congestion
 */
+   BPF_SOCK_OPS_RTO_CB,/* Called when an RTO has triggered.
+* Arg1: value of icsk_retransmits
+* Arg2: value of icsk_rto
+* Arg3: whether RTO has expired
+*/
 };
 
 #define TCP_BPF_IW 1001/* Set TCP initial congestion window */
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index 6db3124..257abdd 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -213,11 +213,18 @@ static int tcp_write_timeout(struct sock *sk)
icsk->icsk_user_timeout);
}
tcp_fastopen_active_detect_blackhole(sk, expired);
+
+   if (BPF_SOCK_OPS_TEST_FLAG(tp, BPF_SOCK_OPS_RTO_CB_FLAG))
+   tcp_call_bpf_3arg(sk, BPF_SOCK_OPS_RTO_CB,
+ icsk->icsk_retransmits,
+ icsk->icsk_rto, (int)expired);
+
if (expired) {
/* Has it gone just too far? */
tcp_write_err(sk);
return 1;
}
+
return 0;
 }
 
-- 
2.9.5

Re: [PATCH net-next 06/12] Revert "net: ptr_ring: otherwise safe empty checks can overrun array bounds"

2018-01-25 Thread John Fastabend

On 01/25/2018 03:36 PM, Michael S. Tsirkin wrote:
> This reverts commit bcecb4bbf88aa03171c30652bca761cf27755a6b.
> 
> If we try to allocate an extra entry as the above commit did, and when
> the requested size is UINT_MAX, addition overflows causing zero size to
> be passed to kmalloc().
> 
> kmalloc then returns ZERO_SIZE_PTR with a subsequent crash.
> 
> Reported-by: syzbot+87678bcf753b44c39...@syzkaller.appspotmail.com
> Cc: John Fastabend 
> Signed-off-by: Michael S. Tsirkin 
> ---

Dang, I missed this case. Thanks.

Acked-by: John Fastabend 

>  include/linux/ptr_ring.h | 7 +--
>  1 file changed, 1 insertion(+), 6 deletions(-)
> 
> diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
> index f175846..3a19ebd 100644
> --- a/include/linux/ptr_ring.h
> +++ b/include/linux/ptr_ring.h
> @@ -466,12 +466,7 @@ static inline int ptr_ring_consume_batched_bh(struct 
> ptr_ring *r,
>  
>  static inline void **__ptr_ring_init_queue_alloc(unsigned int size, gfp_t 
> gfp)
>  {
> - /* Allocate an extra dummy element at end of ring to avoid consumer head
> -  * or produce head access past the end of the array. Possible when
> -  * producer/consumer operations and __ptr_ring_peek operations run in
> -  * parallel.
> -  */
> - return kcalloc(size + 1, sizeof(void *), gfp);
> + return kcalloc(size, sizeof(void *), gfp);
>  }
>  
>  static inline void __ptr_ring_set_size(struct ptr_ring *r, int size)
>

Re: [PATCH net-next 01/12] ptr_ring: keep consumer_head valid at all times

2018-01-25 Thread John Fastabend

On 01/25/2018 03:36 PM, Michael S. Tsirkin wrote:
> The comment near __ptr_ring_peek says:
> 
>  * If ring is never resized, and if the pointer is merely
>  * tested, there's no need to take the lock - see e.g.  __ptr_ring_empty.
> 
> but this was in fact never possible since consumer_head would sometimes
> point outside the ring. Refactor the code so that it's always
> pointing within a ring.
> 
> Fixes: c5ad119fb6c09 ("net: sched: pfifo_fast use skb_array")
> Signed-off-by: Michael S. Tsirkin 
> ---
>  include/linux/ptr_ring.h | 25 -
>  1 file changed, 16 insertions(+), 9 deletions(-)
> 

Thanks for fixing this up.

Acked-by: John Fastabend

[PATCH net-next v1] bpf: Use the IS_FD_ARRAY() macro in map_update_elem()

2018-01-25 Thread Mickaël Salaün

Make the code more readable.

Signed-off-by: Mickaël Salaün 
Cc: Alexei Starovoitov 
Cc: Daniel Borkmann 
---
 kernel/bpf/syscall.c | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 5bdb0cc84ad2..e24aa3241387 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -709,10 +709,7 @@ static int map_update_elem(union bpf_attr *attr)
err = bpf_percpu_hash_update(map, key, value, attr->flags);
} else if (map->map_type == BPF_MAP_TYPE_PERCPU_ARRAY) {
err = bpf_percpu_array_update(map, key, value, attr->flags);
-   } else if (map->map_type == BPF_MAP_TYPE_PERF_EVENT_ARRAY ||
-  map->map_type == BPF_MAP_TYPE_PROG_ARRAY ||
-  map->map_type == BPF_MAP_TYPE_CGROUP_ARRAY ||
-  map->map_type == BPF_MAP_TYPE_ARRAY_OF_MAPS) {
+   } else if (IS_FD_ARRAY(map)) {
rcu_read_lock();
err = bpf_fd_array_map_update_elem(map, f.file, key, value,
   attr->flags);
-- 
2.15.1

[PATCH net-next 03/12] ptr_ring: READ/WRITE_ONCE for __ptr_ring_empty

2018-01-25 Thread Michael S. Tsirkin

Lockless __ptr_ring_empty requires that consumer head is read and
written at once, atomically. Annotate accordingly to make sure compiler
does it correctly.  Switch locked callers to __ptr_ring_peek which does
not support the lockless operation.

Signed-off-by: Michael S. Tsirkin 
---
 include/linux/ptr_ring.h | 11 ---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
index 8594c7b..9a72d8f 100644
--- a/include/linux/ptr_ring.h
+++ b/include/linux/ptr_ring.h
@@ -196,7 +196,9 @@ static inline void *__ptr_ring_peek(struct ptr_ring *r)
  */
 static inline bool __ptr_ring_empty(struct ptr_ring *r)
 {
-   return !__ptr_ring_peek(r);
+   if (likely(r->size))
+   return !r->queue[READ_ONCE(r->consumer_head)];
+   return true;
 }
 
 static inline bool ptr_ring_empty(struct ptr_ring *r)
@@ -285,7 +287,8 @@ static inline void __ptr_ring_discard_one(struct ptr_ring 
*r)
consumer_head = 0;
r->consumer_tail = 0;
}
-   r->consumer_head = consumer_head;
+   /* matching READ_ONCE in __ptr_ring_empty for lockless tests */
+   WRITE_ONCE(r->consumer_head, consumer_head);
 }
 
 static inline void *__ptr_ring_consume(struct ptr_ring *r)
@@ -541,7 +544,9 @@ static inline void ptr_ring_unconsume(struct ptr_ring *r, 
void **batch, int n,
goto done;
}
r->queue[head] = batch[--n];
-   r->consumer_tail = r->consumer_head = head;
+   r->consumer_tail = head;
+   /* matching READ_ONCE in __ptr_ring_empty for lockless tests */
+   WRITE_ONCE(r->consumer_head, head);
}
 
 done:
-- 
MST

[PATCH net-next 00/12] ptr_ring fixes

2018-01-25 Thread Michael S. Tsirkin

This fixes a bunch of issues around ptr_ring use in net core.
One of these: "tap: fix use-after-free" is also needed on net,
but can't be backported cleanly.

I will post a net patch separately.

Lightly tested - Jason, could you pls confirm this
addresses the security issue you saw with ptr_ring?
Testing reports would be appreciated too.

Michael S. Tsirkin (12):
  ptr_ring: keep consumer_head valid at all times
  ptr_ring: clean up documentation
  ptr_ring: READ/WRITE_ONCE for __ptr_ring_empty
  tap: fix use-after-free
  ptr_ring: disallow lockless __ptr_ring_full
  Revert "net: ptr_ring: otherwise safe empty checks can overrun array
bounds"
  skb_array: use __ptr_ring_empty
  ptr_ring: prevent queue load/store tearing
  tools/virtio: switch to __ptr_ring_empty
  tools/virtio: more stubs to fix tools build
  tools/virtio: copy READ/WRITE_ONCE
  tools/virtio: fix smp_mb on x86

 drivers/net/tap.c|  3 --
 include/linux/ptr_ring.h | 86 ++--
 include/linux/skb_array.h|  2 +-
 tools/virtio/linux/kernel.h  |  2 +-
 tools/virtio/linux/thread_info.h |  1 +
 tools/virtio/ringtest/main.h | 59 ++-
 tools/virtio/ringtest/ptr_ring.c |  2 +-
 7 files changed, 110 insertions(+), 45 deletions(-)
 create mode 100644 tools/virtio/linux/thread_info.h

-- 
MST

[PATCH net-next 01/12] ptr_ring: keep consumer_head valid at all times

2018-01-25 Thread Michael S. Tsirkin

The comment near __ptr_ring_peek says:

 * If ring is never resized, and if the pointer is merely
 * tested, there's no need to take the lock - see e.g.  __ptr_ring_empty.

but this was in fact never possible since consumer_head would sometimes
point outside the ring. Refactor the code so that it's always
pointing within a ring.

Fixes: c5ad119fb6c09 ("net: sched: pfifo_fast use skb_array")
Signed-off-by: Michael S. Tsirkin 
---
 include/linux/ptr_ring.h | 25 -
 1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
index 9ca1726..5ebcdd4 100644
--- a/include/linux/ptr_ring.h
+++ b/include/linux/ptr_ring.h
@@ -248,22 +248,28 @@ static inline void __ptr_ring_discard_one(struct ptr_ring 
*r)
/* Fundamentally, what we want to do is update consumer
 * index and zero out the entry so producer can reuse it.
 * Doing it naively at each consume would be as simple as:
-*   r->queue[r->consumer++] = NULL;
-*   if (unlikely(r->consumer >= r->size))
-*   r->consumer = 0;
+*   consumer = r->consumer;
+*   r->queue[consumer++] = NULL;
+*   if (unlikely(consumer >= r->size))
+*   consumer = 0;
+*   r->consumer = consumer;
 * but that is suboptimal when the ring is full as producer is writing
 * out new entries in the same cache line.  Defer these updates until a
 * batch of entries has been consumed.
 */
-   int head = r->consumer_head++;
+   /* Note: we must keep consumer_head valid at all times for 
__ptr_ring_empty
+* to work correctly.
+*/
+   int consumer_head = r->consumer_head;
+   int head = consumer_head++;
 
/* Once we have processed enough entries invalidate them in
 * the ring all at once so producer can reuse their space in the ring.
 * We also do this when we reach end of the ring - not mandatory
 * but helps keep the implementation simple.
 */
-   if (unlikely(r->consumer_head - r->consumer_tail >= r->batch ||
-r->consumer_head >= r->size)) {
+   if (unlikely(consumer_head - r->consumer_tail >= r->batch ||
+consumer_head >= r->size)) {
/* Zero out entries in the reverse order: this way we touch the
 * cache line that producer might currently be reading the last;
 * producer won't make progress and touch other cache lines
@@ -271,12 +277,13 @@ static inline void __ptr_ring_discard_one(struct ptr_ring 
*r)
 */
while (likely(head >= r->consumer_tail))
r->queue[head--] = NULL;
-   r->consumer_tail = r->consumer_head;
+   r->consumer_tail = consumer_head;
}
-   if (unlikely(r->consumer_head >= r->size)) {
-   r->consumer_head = 0;
+   if (unlikely(consumer_head >= r->size)) {
+   consumer_head = 0;
r->consumer_tail = 0;
}
+   r->consumer_head = consumer_head;
 }
 
 static inline void *__ptr_ring_consume(struct ptr_ring *r)
-- 
MST

[PATCH net-next 02/12] ptr_ring: clean up documentation

2018-01-25 Thread Michael S. Tsirkin

The only function safe to call without locks
is __ptr_ring_empty. Move documentation about
lockless use there to make sure people do not
try to use __ptr_ring_peek outside locks.

Signed-off-by: Michael S. Tsirkin 
---
 include/linux/ptr_ring.h | 34 ++
 1 file changed, 18 insertions(+), 16 deletions(-)

diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
index 5ebcdd4..8594c7b 100644
--- a/include/linux/ptr_ring.h
+++ b/include/linux/ptr_ring.h
@@ -169,21 +169,6 @@ static inline int ptr_ring_produce_bh(struct ptr_ring *r, 
void *ptr)
return ret;
 }
 
-/* Note: callers invoking this in a loop must use a compiler barrier,
- * for example cpu_relax(). Callers must take consumer_lock
- * if they dereference the pointer - see e.g. PTR_RING_PEEK_CALL.
- * If ring is never resized, and if the pointer is merely
- * tested, there's no need to take the lock - see e.g.  __ptr_ring_empty.
- * However, if called outside the lock, and if some other CPU
- * consumes ring entries at the same time, the value returned
- * is not guaranteed to be correct.
- * In this case - to avoid incorrectly detecting the ring
- * as empty - the CPU consuming the ring entries is responsible
- * for either consuming all ring entries until the ring is empty,
- * or synchronizing with some other CPU and causing it to
- * execute __ptr_ring_peek and/or consume the ring enteries
- * after the synchronization point.
- */
 static inline void *__ptr_ring_peek(struct ptr_ring *r)
 {
if (likely(r->size))
@@ -191,7 +176,24 @@ static inline void *__ptr_ring_peek(struct ptr_ring *r)
return NULL;
 }
 
-/* See __ptr_ring_peek above for locking rules. */
+/*
+ * Test ring empty status without taking any locks.
+ *
+ * NB: This is only safe to call if ring is never resized.
+ *
+ * However, if some other CPU consumes ring entries at the same time, the value
+ * returned is not guaranteed to be correct.
+ *
+ * In this case - to avoid incorrectly detecting the ring
+ * as empty - the CPU consuming the ring entries is responsible
+ * for either consuming all ring entries until the ring is empty,
+ * or synchronizing with some other CPU and causing it to
+ * re-test __ptr_ring_empty and/or consume the ring enteries
+ * after the synchronization point.
+ *
+ * Note: callers invoking this in a loop must use a compiler barrier,
+ * for example cpu_relax().
+ */
 static inline bool __ptr_ring_empty(struct ptr_ring *r)
 {
return !__ptr_ring_peek(r);
-- 
MST

[PATCH net-next 04/12] tap: fix use-after-free

2018-01-25 Thread Michael S. Tsirkin

Lockless access to __ptr_ring_full is only legal if ring is
never resized, otherwise it might cause use-after free errors.
Simply drop the lockless test, we'll drop the packet
a bit later when produce fails.

Fixes: 362899b8 ("macvtap: switch to use skb array")
Signed-off-by: Michael S. Tsirkin 
---
 drivers/net/tap.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/drivers/net/tap.c b/drivers/net/tap.c
index 7c38659..7787269 100644
--- a/drivers/net/tap.c
+++ b/drivers/net/tap.c
@@ -330,9 +330,6 @@ rx_handler_result_t tap_handle_frame(struct sk_buff **pskb)
if (!q)
return RX_HANDLER_PASS;
 
-   if (__ptr_ring_full(>ring))
-   goto drop;
-
skb_push(skb, ETH_HLEN);
 
/* Apply the forward feature mask so that we perform segmentation
-- 
MST

[PATCH net-next 05/12] ptr_ring: disallow lockless __ptr_ring_full

2018-01-25 Thread Michael S. Tsirkin

Similar to bcecb4bbf88a ("net: ptr_ring: otherwise safe empty checks can
overrun array bounds") a lockless use of __ptr_ring_full might
cause an out of bounds access.

We can fix this, but it's easier to just disallow lockless
__ptr_ring_full for now.

Signed-off-by: Michael S. Tsirkin 
---
 include/linux/ptr_ring.h | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
index 9a72d8f..f175846 100644
--- a/include/linux/ptr_ring.h
+++ b/include/linux/ptr_ring.h
@@ -45,9 +45,10 @@ struct ptr_ring {
 };
 
 /* Note: callers invoking this in a loop must use a compiler barrier,
- * for example cpu_relax().  If ring is ever resized, callers must hold
- * producer_lock - see e.g. ptr_ring_full.  Otherwise, if callers don't hold
- * producer_lock, the next call to __ptr_ring_produce may fail.
+ * for example cpu_relax().
+ *
+ * NB: this is unlike __ptr_ring_empty in that callers must hold producer_lock:
+ * see e.g. ptr_ring_full.
  */
 static inline bool __ptr_ring_full(struct ptr_ring *r)
 {
-- 
MST

[PATCH net-next 08/12] ptr_ring: prevent queue load/store tearing

2018-01-25 Thread Michael S. Tsirkin

In theory compiler could tear queue loads or stores in two. It does not
seem to be happening in practice but it seems easier to convert the
cases where this would be a problem to READ/WRITE_ONCE than worry about
it.

Signed-off-by: Michael S. Tsirkin 
---
 include/linux/ptr_ring.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
index 3a19ebd..1883d61 100644
--- a/include/linux/ptr_ring.h
+++ b/include/linux/ptr_ring.h
@@ -114,7 +114,7 @@ static inline int __ptr_ring_produce(struct ptr_ring *r, 
void *ptr)
/* Pairs with smp_read_barrier_depends in __ptr_ring_consume. */
smp_wmb();
 
-   r->queue[r->producer++] = ptr;
+   WRITE_ONCE(r->queue[r->producer++], ptr);
if (unlikely(r->producer >= r->size))
r->producer = 0;
return 0;
@@ -173,7 +173,7 @@ static inline int ptr_ring_produce_bh(struct ptr_ring *r, 
void *ptr)
 static inline void *__ptr_ring_peek(struct ptr_ring *r)
 {
if (likely(r->size))
-   return r->queue[r->consumer_head];
+   return READ_ONCE(r->queue[r->consumer_head]);
return NULL;
 }
 
-- 
MST

[PATCH net-next 07/12] skb_array: use __ptr_ring_empty

2018-01-25 Thread Michael S. Tsirkin

__skb_array_empty should use __ptr_ring_empty since that's the only
legal lockless function.

Signed-off-by: Michael S. Tsirkin 
---
 include/linux/skb_array.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/skb_array.h b/include/linux/skb_array.h
index c7addf3..a6b6e8b 100644
--- a/include/linux/skb_array.h
+++ b/include/linux/skb_array.h
@@ -69,7 +69,7 @@ static inline int skb_array_produce_any(struct skb_array *a, 
struct sk_buff *skb
  */
 static inline bool __skb_array_empty(struct skb_array *a)
 {
-   return !__ptr_ring_peek(>ring);
+   return __ptr_ring_empty(>ring);
 }
 
 static inline struct sk_buff *__skb_array_peek(struct skb_array *a)
-- 
MST

[PATCH net-next 10/12] tools/virtio: more stubs to fix tools build

2018-01-25 Thread Michael S. Tsirkin

Signed-off-by: Michael S. Tsirkin 
---
 tools/virtio/linux/kernel.h  | 2 +-
 tools/virtio/linux/thread_info.h | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)
 create mode 100644 tools/virtio/linux/thread_info.h

diff --git a/tools/virtio/linux/kernel.h b/tools/virtio/linux/kernel.h
index 395521a..fca8381 100644
--- a/tools/virtio/linux/kernel.h
+++ b/tools/virtio/linux/kernel.h
@@ -118,7 +118,7 @@ static inline void free_page(unsigned long addr)
 #define dev_err(dev, format, ...) fprintf (stderr, format, ## __VA_ARGS__)
 #define dev_warn(dev, format, ...) fprintf (stderr, format, ## __VA_ARGS__)
 
-#define WARN_ON_ONCE(cond) ((cond) && fprintf (stderr, "WARNING\n"))
+#define WARN_ON_ONCE(cond) ((cond) ? fprintf (stderr, "WARNING\n") : 0)
 
 #define min(x, y) ({   \
typeof(x) _min1 = (x);  \
diff --git a/tools/virtio/linux/thread_info.h b/tools/virtio/linux/thread_info.h
new file mode 100644
index 000..e0f610d
--- /dev/null
+++ b/tools/virtio/linux/thread_info.h
@@ -0,0 +1 @@
+#define check_copy_size(A, B, C) (1)
-- 
MST

[PATCH net-next 12/12] tools/virtio: fix smp_mb on x86

2018-01-25 Thread Michael S. Tsirkin

Offset 128 overlaps the last word of the redzone.
Use 132 which is always beyond that.

Signed-off-by: Michael S. Tsirkin 
---
 tools/virtio/ringtest/main.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/virtio/ringtest/main.h b/tools/virtio/ringtest/main.h
index 593a328..301d59b 100644
--- a/tools/virtio/ringtest/main.h
+++ b/tools/virtio/ringtest/main.h
@@ -111,7 +111,7 @@ static inline void busy_wait(void)
 } 
 
 #if defined(__x86_64__) || defined(__i386__)
-#define smp_mb() asm volatile("lock; addl $0,-128(%%rsp)" ::: "memory", 
"cc")
+#define smp_mb() asm volatile("lock; addl $0,-132(%%rsp)" ::: "memory", 
"cc")
 #else
 /*
  * Not using __ATOMIC_SEQ_CST since gcc docs say they are only synchronized
-- 
MST

[PATCH net-next 09/12] tools/virtio: switch to __ptr_ring_empty

2018-01-25 Thread Michael S. Tsirkin

We don't rely on lockless guarantees, but it
seems cleaner than inverting __ptr_ring_peek.

Signed-off-by: Michael S. Tsirkin 
---
 tools/virtio/ringtest/ptr_ring.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/virtio/ringtest/ptr_ring.c b/tools/virtio/ringtest/ptr_ring.c
index e6e8130..477899c 100644
--- a/tools/virtio/ringtest/ptr_ring.c
+++ b/tools/virtio/ringtest/ptr_ring.c
@@ -187,7 +187,7 @@ bool enable_kick()
 
 bool avail_empty()
 {
-   return !__ptr_ring_peek();
+   return __ptr_ring_empty();
 }
 
 bool use_buf(unsigned *lenp, void **bufp)
-- 
MST

[PATCH net-next 06/12] Revert "net: ptr_ring: otherwise safe empty checks can overrun array bounds"

2018-01-25 Thread Michael S. Tsirkin

This reverts commit bcecb4bbf88aa03171c30652bca761cf27755a6b.

If we try to allocate an extra entry as the above commit did, and when
the requested size is UINT_MAX, addition overflows causing zero size to
be passed to kmalloc().

kmalloc then returns ZERO_SIZE_PTR with a subsequent crash.

Reported-by: syzbot+87678bcf753b44c39...@syzkaller.appspotmail.com
Cc: John Fastabend 
Signed-off-by: Michael S. Tsirkin 
---
 include/linux/ptr_ring.h | 7 +--
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
index f175846..3a19ebd 100644
--- a/include/linux/ptr_ring.h
+++ b/include/linux/ptr_ring.h
@@ -466,12 +466,7 @@ static inline int ptr_ring_consume_batched_bh(struct 
ptr_ring *r,
 
 static inline void **__ptr_ring_init_queue_alloc(unsigned int size, gfp_t gfp)
 {
-   /* Allocate an extra dummy element at end of ring to avoid consumer head
-* or produce head access past the end of the array. Possible when
-* producer/consumer operations and __ptr_ring_peek operations run in
-* parallel.
-*/
-   return kcalloc(size + 1, sizeof(void *), gfp);
+   return kcalloc(size, sizeof(void *), gfp);
 }
 
 static inline void __ptr_ring_set_size(struct ptr_ring *r, int size)
-- 
MST

[PATCH net-next 11/12] tools/virtio: copy READ/WRITE_ONCE

2018-01-25 Thread Michael S. Tsirkin

This is to make ptr_ring test build again.

Signed-off-by: Michael S. Tsirkin 
---
 tools/virtio/ringtest/main.h | 57 
 1 file changed, 57 insertions(+)

diff --git a/tools/virtio/ringtest/main.h b/tools/virtio/ringtest/main.h
index 5706e07..593a328 100644
--- a/tools/virtio/ringtest/main.h
+++ b/tools/virtio/ringtest/main.h
@@ -134,4 +134,61 @@ static inline void busy_wait(void)
 barrier(); \
 } while (0)
 
+#if defined(__i386__) || defined(__x86_64__) || defined(__s390x__)
+#define smp_wmb() barrier()
+#else
+#define smp_wmb() smp_release()
+#endif
+
+#ifdef __alpha__
+#define smp_read_barrier_depends() smp_acquire()
+#else
+#define smp_read_barrier_depends() do {} while(0)
+#endif
+
+static __always_inline
+void __read_once_size(const volatile void *p, void *res, int size)
+{
+switch (size) { \
+case 1: *(unsigned char *)res = *(volatile unsigned char *)p; break;   
   \
+case 2: *(unsigned short *)res = *(volatile unsigned short *)p; break; 
   \
+case 4: *(unsigned int *)res = *(volatile unsigned int *)p; break; 
   \
+case 8: *(unsigned long long *)res = *(volatile unsigned long long 
*)p; break;\
+default:\
+barrier();  \
+__builtin_memcpy((void *)res, (const void *)p, size);   \
+barrier();  \
+}   \
+}
+
+static __always_inline void __write_once_size(volatile void *p, void *res, int 
size)
+{
+   switch (size) {
+   case 1: *(volatile unsigned char *)p = *(unsigned char *)res; break;
+   case 2: *(volatile unsigned short *)p = *(unsigned short *)res; break;
+   case 4: *(volatile unsigned int *)p = *(unsigned int *)res; break;
+   case 8: *(volatile unsigned long long *)p = *(unsigned long long *)res; 
break;
+   default:
+   barrier();
+   __builtin_memcpy((void *)p, (const void *)res, size);
+   barrier();
+   }
+}
+
+#define READ_ONCE(x) \
+({ \
+   union { typeof(x) __val; char __c[1]; } __u;\
+   __read_once_size(&(x), __u.__c, sizeof(x)); \
+   smp_read_barrier_depends(); /* Enforce dependency ordering from x */ \
+   __u.__val;  \
+})
+
+#define WRITE_ONCE(x, val) \
+({ \
+   union { typeof(x) __val; char __c[1]; } __u =   \
+   { .__val = (typeof(x)) (val) }; \
+   __write_once_size(&(x), __u.__c, sizeof(x));\
+   __u.__val;  \
+})
+
 #endif
-- 
MST

Re: [PATCH net-next 0/3 V1] rtnetlink: enable IFLA_IF_NETNSID for RTM_{DEL,SET}LINK

2018-01-25 Thread Nicolas Dichtel

Le 25/01/2018 à 23:30, Jiri Benc a écrit :
> On Thu, 25 Jan 2018 15:20:59 +0100, Nicolas Dichtel wrote:
>> Hmm, I don't agree. For me, it would be the correct answer. If user has a 
>> socket
>> in ns_a and targets a RTM_GETLINK in ns_b, the answer he gets should be like 
>> if
>> it was done in ns_b.
> 
> But that information would be useless for the caller. Why return a
> value that has no meaning for the caller and can not be used? More so
> when the kernel is aware of what the correct meaningful value is?
Why meaningful? The user knows that the answer is like if if was done in another
netns. It enables to have only one netlink socket instead of one per netns. But
the code using it will be the same.
I fear that with your approach, it will results to a lot of complexity in the
kernel.

> 
>> This is already the case with messages received with NETLINK_LISTEN_ALL_NSID,
>> there is no reason to do something different.
> 
> NETLINK_LISTEN_ALL_NSID is tough due to way it is implemented. But yes,
> it should translate the netnsids to be valid in the socket's netns.
> That's the only sane way for the listener.
A listener that uses this option should know the details about each netns it
listens. Thus, he has no problem to interpret the answer.

What is really missing for me, is a way to get a fd from an nsid. The user
should be able to call RTM_GETNSID with an fd and a nsid and the kernel performs
the needed operations so that the fd points to the corresponding netns.

Nicolas

Re: [PATCH net] net: don't call update_pmtu unconditionally

2018-01-25 Thread Nicolas Dichtel

Le 25/01/2018 à 22:28, David Miller a écrit :
> From: Nicolas Dichtel 
> Date: Thu, 25 Jan 2018 19:03:03 +0100
> 
>> Some dst_ops (e.g. md_dst_ops)) doesn't set this handler. It may result to:
>> "BUG: unable to handle kernel NULL pointer dereference at   (null)"
>>
>> Let's add a helper to check if update_pmtu is available before calling it.
>>
>> Fixes: 52a589d51f10 ("geneve: update skb dst pmtu on tx path")
>> Fixes: a93bf0ff4490 ("vxlan: update skb dst pmtu on tx path")
>> CC: Roman Kapl 
>> CC: Xin Long 
>> Signed-off-by: Nicolas Dichtel 
> 
> Applied with the fixup suggested by David Ahern and queued up for -stable,
Thank you for taking care of that.


Regards,
Nicolas

Re: [PATCH net-next v2 00/12] net: sched: propagate extack to cls offloads on destroy and only with skip_sw

2018-01-25 Thread Jakub Kicinski

On Thu, 25 Jan 2018 13:11:57 -0200, Marcelo Ricardo Leitner wrote:
> On Wed, Jan 24, 2018 at 12:54:12PM -0800, Jakub Kicinski wrote:
> > Hi!
> > 
> > This series some of Jiri's comments and the fact that today drivers
> > may produce extack even if there is no skip_sw flag (meaning the
> > driver failure is not really a problem), and warning messages will
> > only confuse the users.  
> 
> It's a fair point, but I think this is not the best solution. How will
> the user then know why it failed to install in hw? Will have to
> install a new rule, just with skip_sw, and hope that it fails with the
> same reason?
>
> Maybe it's better to let tc/ovs/etc only exhibit this information
> under a certain log/debug level.

What you say does make sense in case of classifiers which are basically
HW offload vehicles.  But for cls_bpf, which people are actually using
heavily as a software solution, I don't want any warning to be produced
just because someone happened to run the command on a Netronome
card :(  Say someone swaps an old NIC for a NFP, and runs their old
cls_bpf scripts and suddenly there are warnings they don't care about
and have no way of silencing.

I do think skip_sw will fail for the same reason, unless someone adds
extacks for IO or memory allocation problems or other transient things.

Do I understand correctly that OvS TC does not set skip_sw?  We could
add a "verbose offload" flag to the API or filter the bad extacks at
the user space level.  Although, again, my preference would be not to
filter at the user space level, because user space can't differentiate
between a probably-doesn't-matter-but-HW-offload-failed warning or legit
something-is-not-right-in-the-software-but-command-succeeded warning :S
So if there is a major use for non-forced offload failure warnings I
would lean towards a new flag.

Re: [PATCH nf-next,RFC v4] netfilter: nf_flow_table: add hardware offload support

2018-01-25 Thread Jakub Kicinski

On Thu, 25 Jan 2018 12:28:58 +0100, Pablo Neira Ayuso wrote:
> On Wed, Jan 24, 2018 at 05:31:36PM -0800, Jakub Kicinski wrote:
> > On Thu, 25 Jan 2018 01:09:41 +0100, Pablo Neira Ayuso wrote:  
> > > This patch adds the infrastructure to offload flows to hardware, in case
> > > the nic/switch comes with built-in flow tables capabilities.
> > > 
> > > If the hardware comes with no hardware flow tables or they have
> > > limitations in terms of features, the existing infrastructure falls back
> > > to the software flow table implementation.
> > > 
> > > The software flow table garbage collector skips entries that resides in
> > > the hardware, so the hardware will be responsible for releasing this
> > > flow table entry too via flow_offload_dead().
> > > 
> > > Hardware configuration, either to add or to delete entries, is done from
> > > the hardware offload workqueue, to ensure this is done from user context
> > > given that we may sleep when grabbing the mdio mutex.
> > > 
> > > Signed-off-by: Pablo Neira Ayuso   
> > 
> > I wonder how do you deal with device/table removal?  I know regrettably
> > little about internals of nftables.  I assume the table cannot be
> > removed/module unloaded as long as there are flow entries? And on
> > device removal all flows pertaining to the removed ifindex will be
> > automatically flushed?  
> 
> Yes, this code is part of the generic software infrastructure, it's
> not specific to the hardware offload, it's already upstream, see
> net/netfilter/nft_flow_offload.c, see flow_offload_netdev_notifier.

Hm.  At a glance flow_offload_iterate_cleanup() will just mark the
flows as dead, not request their removal from the HW.  Doesn't that
mean that reloading the HW driver with flows installed will likely lead
to HW/FW resources being leaked (unless every driver duplicates manual
flush on remove).

> > Still there could be outstanding work items targeting the device, so
> > this WARN_ON:
> > 
> > +   indev = dev_get_by_index(net, ifindex);
> > +   if (WARN_ON(!indev))
> > +   return 0;
> > 
> > looks possible to trigger.  
> 
> It should not, that's why there's a WARN_ON there ;-).
> 
> See nf_flow_table_hw_module_exit(), there's a call to
> cancel_work_sync() to stop the hw offload workqueue, then flushes it.
> After this, there's a flow table cleanup. So noone should be calling
> that function by then.

Ah, I must be misunderstanding.  I meant when device is removed, not
the flow_table_hw module.  Does the nf_flow_table_hw_module_exit() run
when device is removed?  I was expecting that, for example something
like nft_flow_offload_iterate_cleanup() would queue up all the flow
remove calls and then call flush_work() (not cancel_work). 

> > On the general architecture - I think it's worth documenting somewhere
> > clearly that unlike TC offloads and most NDOs add/del of NFT flows are
> > not protected by rtnl_lock.  
> 
> Someone could probably look at getting rid of the rtnl_lock() all over
> the place for hardware offloads, holding on the entire rtnetlink
> subsystem just because some piece of hardware is taking time to
> configure things is not good. Not explicitly related to this, but have
> a look at Florian Westphal's talk on rtnl_lock during NetDev.

Oh, 100% agreed.  I was just pointing out that it could be useful to
mention the locking in kdoc or at least the commit message.

> > > v4: More work in progress
> > > - Decouple nf_flow_table_hw from nft_flow_offload via rcu hooks
> > > - Consolidate ->ndo invocations, now they happen from the hw worker.
> > > - Fix bug in list handling, use list_replace_init()
> > > - cleanup entries on nf_flow_table_hw module removal
> > > - add NFT_FLOWTABLE_F_HW flag to flowtables to explicit signal that user 
> > > wants
> > >   to offload entries to hardware.
> > > 
> > > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> > > index ed0799a12bf2..be0c12acc3f0 100644
> > > --- a/include/linux/netdevice.h
> > > +++ b/include/linux/netdevice.h
> > > @@ -859,6 +859,13 @@ struct dev_ifalias {
> > >   char ifalias[];
> > >  };
> > >  
> > > +struct flow_offload;
> > > +
> > > +enum flow_offload_type {
> > > + FLOW_OFFLOAD_ADD= 0,
> > > + FLOW_OFFLOAD_DEL,
> > > +};
> > > +
> > >  /*
> > >   * This structure defines the management hooks for network devices.
> > >   * The following hooks can be defined; unless noted otherwise, they are
> > > @@ -1316,6 +1323,8 @@ struct net_device_ops {
> > >   int (*ndo_bridge_dellink)(struct net_device *dev,
> > > struct nlmsghdr *nlh,
> > > u16 flags);
> > > + int (*ndo_flow_offload)(enum flow_offload_type type,
> > > + struct flow_offload *flow); 
> > >  
> > 
> > nit: should there be kdoc for the new NDO?  ndo kdoc comment doesn't
> >  look like it would be recognized

[net-next:master 1917/1931] drivers/net/ethernet/sfc/ptp.c:646:3: warning: this decimal constant is unsigned only in ISO C90

2018-01-25 Thread kbuild test robot

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git 
master
head:   fdd6d771c7de9d351c6dbdbab5bdc83805c06955
commit: 1280c0f8aafc4c09c59c576c8d50f367070b2619 [1917/1931] sfc: support 
second + quarter ns time format for receive datapath
config: i386-randconfig-s1-201803 (attached as .config)
compiler: gcc-6 (Debian 6.4.0-9) 6.4.0 20171026
reproduce:
git checkout 1280c0f8aafc4c09c59c576c8d50f367070b2619
# save the attached .config to linux build tree
make ARCH=i386 

All warnings (new ones prefixed by >>):

   drivers/net/ethernet/sfc/ptp.c: In function 'efx_ptp_get_attributes':
>> drivers/net/ethernet/sfc/ptp.c:646:3: warning: this decimal constant is 
>> unsigned only in ISO C90
  ptp->nic_time.minor_max = 40;
  ^~~

vim +646 drivers/net/ethernet/sfc/ptp.c

   598  
   599  /* Get PTP attributes and set up time conversions */
   600  static int efx_ptp_get_attributes(struct efx_nic *efx)
   601  {
   602  MCDI_DECLARE_BUF(inbuf, MC_CMD_PTP_IN_GET_ATTRIBUTES_LEN);
   603  MCDI_DECLARE_BUF(outbuf, MC_CMD_PTP_OUT_GET_ATTRIBUTES_LEN);
   604  struct efx_ptp_data *ptp = efx->ptp_data;
   605  int rc;
   606  u32 fmt;
   607  size_t out_len;
   608  
   609  /* Get the PTP attributes. If the NIC doesn't support the 
operation we
   610   * use the default format for compatibility with older NICs i.e.
   611   * seconds and nanoseconds.
   612   */
   613  MCDI_SET_DWORD(inbuf, PTP_IN_OP, MC_CMD_PTP_OP_GET_ATTRIBUTES);
   614  MCDI_SET_DWORD(inbuf, PTP_IN_PERIPH_ID, 0);
   615  rc = efx_mcdi_rpc_quiet(efx, MC_CMD_PTP, inbuf, sizeof(inbuf),
   616  outbuf, sizeof(outbuf), _len);
   617  if (rc == 0) {
   618  fmt = MCDI_DWORD(outbuf, 
PTP_OUT_GET_ATTRIBUTES_TIME_FORMAT);
   619  } else if (rc == -EINVAL) {
   620  fmt = MC_CMD_PTP_OUT_GET_ATTRIBUTES_SECONDS_NANOSECONDS;
   621  } else if (rc == -EPERM) {
   622  netif_info(efx, probe, efx->net_dev, "no PTP 
support\n");
   623  return rc;
   624  } else {
   625  efx_mcdi_display_error(efx, MC_CMD_PTP, sizeof(inbuf),
   626 outbuf, sizeof(outbuf), rc);
   627  return rc;
   628  }
   629  
   630  switch (fmt) {
   631  case MC_CMD_PTP_OUT_GET_ATTRIBUTES_SECONDS_27FRACTION:
   632  ptp->ns_to_nic_time = efx_ptp_ns_to_s27;
   633  ptp->nic_to_kernel_time = 
efx_ptp_s27_to_ktime_correction;
   634  ptp->nic_time.minor_max = 1 << 27;
   635  ptp->nic_time.sync_event_minor_shift = 19;
   636  break;
   637  case MC_CMD_PTP_OUT_GET_ATTRIBUTES_SECONDS_NANOSECONDS:
   638  ptp->ns_to_nic_time = efx_ptp_ns_to_s_ns;
   639  ptp->nic_to_kernel_time = 
efx_ptp_s_ns_to_ktime_correction;
   640  ptp->nic_time.minor_max = 10;
   641  ptp->nic_time.sync_event_minor_shift = 22;
   642  break;
   643  case MC_CMD_PTP_OUT_GET_ATTRIBUTES_SECONDS_QTR_NANOSECONDS:
   644  ptp->ns_to_nic_time = efx_ptp_ns_to_s_qns;
   645  ptp->nic_to_kernel_time = 
efx_ptp_s_qns_to_ktime_correction;
 > 646  ptp->nic_time.minor_max = 40;
   647  ptp->nic_time.sync_event_minor_shift = 24;
   648  break;
   649  default:
   650  return -ERANGE;
   651  }
   652  
   653  /* Precalculate acceptable difference between the minor time in 
the
   654   * packet prefix and the last MCDI time sync event. We expect 
the
   655   * packet prefix timestamp to be after of sync event by up to 
one
   656   * sync event interval (0.25s) but we allow it to exceed this 
by a
   657   * fuzz factor of (0.1s)
   658   */
   659  ptp->nic_time.sync_event_diff_min = ptp->nic_time.minor_max
   660  - (ptp->nic_time.minor_max / 10);
   661  ptp->nic_time.sync_event_diff_max = (ptp->nic_time.minor_max / 
4)
   662  + (ptp->nic_time.minor_max / 10);
   663  
   664  /* MC_CMD_PTP_OP_GET_ATTRIBUTES has been extended twice from an 
older
   665   * operation MC_CMD_PTP_OP_GET_TIME_FORMAT. The function now 
may return
   666   * a value to use for the minimum acceptable corrected 
synchronization
   667   * window and may return further capabilities.
   668   * If we have the extra information store it. For older 
firmware that
   669   * does not implement the extended command use the default 
value.
   670   */
   671  if (rc == 0 &&
   672  out_len >=

Re: [PATCH net-next 2/2] dev: advertise the new ifindex when the netns iface changes

2018-01-25 Thread Jiri Benc

On Thu, 25 Jan 2018 15:01:39 +0100, Nicolas Dichtel wrote:
> The goal is to let the user follow an interface that moves to another
> netns.
> 
> CC: Jiri Benc 
> CC: Christian Brauner 
> Signed-off-by: Nicolas Dichtel 

This is great, thanks, Nicolas!

Reviewed-by: Jiri Benc

Re: [PATCH net-next 1/2] dev: always advertise the new nsid when the netns iface changes

2018-01-25 Thread Jiri Benc

On Thu, 25 Jan 2018 15:01:38 +0100, Nicolas Dichtel wrote:
> The user should be able to follow any interface that moves to another
> netns.  There is no reason to hide physical interfaces.
> 
> CC: Jiri Benc 
> CC: Christian Brauner 
> Signed-off-by: Nicolas Dichtel 

Reviewed-by: Jiri Benc

Re: [PATCH net-next 0/3 V1] rtnetlink: enable IFLA_IF_NETNSID for RTM_{DEL,SET}LINK

2018-01-25 Thread Jiri Benc

On Thu, 25 Jan 2018 15:20:59 +0100, Nicolas Dichtel wrote:
> Hmm, I don't agree. For me, it would be the correct answer. If user has a 
> socket
> in ns_a and targets a RTM_GETLINK in ns_b, the answer he gets should be like 
> if
> it was done in ns_b.

But that information would be useless for the caller. Why return a
value that has no meaning for the caller and can not be used? More so
when the kernel is aware of what the correct meaningful value is?

> This is already the case with messages received with NETLINK_LISTEN_ALL_NSID,
> there is no reason to do something different.

NETLINK_LISTEN_ALL_NSID is tough due to way it is implemented. But yes,
it should translate the netnsids to be valid in the socket's netns.
That's the only sane way for the listener.

 Jiri

Re: [regresssion 4.15] Userspace compilation broken by uapi/linux/if_ether.h update

2018-01-25 Thread Hauke Mehrtens

On 01/25/2018 03:58 PM, Guillaume Nault wrote:
> Hi,
> 
> Commit 6926e041a892 ("uapi/if_ether.h: prevent redefinition of struct 
> ethhdr"),
> can break compilation of userspace programs (in my case, accel-ppp
> (https://accel-ppp.org)).
> 
> This happens for userspace programs that end up including
> linux/if_ether.h, netinet/in.h and linux/in.h in this order:
> 
> # cat test_ifether.c
> #include 
> #include 
> #include 
> 
> int main(void)
> {
>   return 0;
> }
> 
> # gcc test_ifether.c
> In file included from test_ifether.c:2:0:
> /usr/include/linux/in.h:29:3: error: redeclaration of enumerator ‘IPPROTO_IP’
>IPPROTO_IP = 0,  /* Dummy protocol for TCP  */
>^
> /usr/include/netinet/in.h:42:5: note: previous definition of ‘IPPROTO_IP’ was 
> here
>  IPPROTO_IP = 0,/* Dummy protocol for TCP.  */
>  ^~
> 
> 
> Now that linux/libc-compat.h is included in linux/if_ether.h, it is
> processed before netinet/in.h. Therefore, it sets the relevant
> __UAPI_DEF_* guards to 1 (as _NETINET_IN_H isn't yet defined).
> Then netinet/in.h is included, followed by linux/in.h. The later
> doesn't realise that what it defines has already been included by
> netinet/in.h because the __UAPI_DEF_* guards were set too early.
> 
> Of course the situation is a bit more complicated on real projects, as
> these files aren't included directly. For example, in accel-ppp, the
> PPPoE module (accel-ppp/accel-pppd/ctrl/pppoe/pppoe.c) uses
> #include/* includes linux/if_ether.h */
> #include   /* includes netinet/in.h */
> #include  /* (through pppoe.h), includes linux/in.h */
> 
> 
> I don't have a satisfying solution for now, but I'd really like it if
> we could avoid shipping a kernel which forces userspace to play with
> include files ordering to keep compiling.
> 
Hi,

This is about this commit:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6926e041a8920c8ec27e4e155efa760aa01551fd

On option would be to move this into include/uapi/linux/if_ether.h and
remove the include for libc-compat.h:
#ifndef __UAPI_DEF_ETHHDR
#define __UAPI_DEF_ETHHDR   1
#endif

This will only work if netinet/if_ether.h is included before
linux/if_ether.h, but I think this is very likely.

I think we can do this because we do not need some special libc handling
like it is done for other symbols as __UAPI_DEF_ETHHDR is currently only
needed by musl and not by glibc.

Hauke

1 2 3 >

1 - 100 of 231 matches

Mail list logo