Re: [PATCH net-next 6/6] netdevsim: Add simple FIB resource controller via devlink

2018-04-05 Thread Jiri Pirko
Thu, Apr 05, 2018 at 11:06:41PM CEST, d...@cumulusnetworks.com wrote:
>On 4/5/18 2:10 PM, David Ahern wrote:
>> 
>> The ASIC here is the kernel tables in a namespace. It does not make
>> sense to have 2 devlink instances for a single namespace.
>
>I put this example controller in netdevsim per a suggestion from Ido.
>The netdevsim seemed like a good idea given that modules intention --
>testing network facilities. Perhaps I should have done this as a
>completely standalone module ...
>
>The intention is to treat the kernel's tables *per namespace* as a
>standalone entity that can be managed very similar to ASIC resources.

So you say you want to treat a namespace as an ASIC? That sounds very
odd to me :/


>Given that I can add a resource controller module
>(drivers/net/kern_res_mgr.c?) that creates a 'struct device' per network
>namespace with a devlink instance. In this case the device would very
>much be tied to the namespace 1:1.

That sounds more reasonable and accurate, yet still odd. You would not
have any netdevices there? Any ports?


Re: [PATCH net-next 6/6] netdevsim: Add simple FIB resource controller via devlink

2018-04-05 Thread Jiri Pirko
Thu, Apr 05, 2018 at 10:10:29PM CEST, d...@cumulusnetworks.com wrote:
>On 4/5/18 11:27 AM, Jiri Pirko wrote:
>> Wed, Mar 28, 2018 at 03:22:00AM CEST, d...@cumulusnetworks.com wrote:
>>> Add devlink support to netdevsim and use it to implement a simple,
>>> profile based resource controller. Only one controller is needed
>>> per namespace, so the first netdevsim netdevice in a namespace
>>> registers with devlink. If that device is deleted, the resource
>>> settings are deleted.
>> 
>> I don't understand why you add 1:1 fixed relationship between
>> netnamespace and devlink instance. That is highly misleading and reader
>> might think that those 2 are somehow related. They are not. You can have
>> multiple devlink instances for many ports in a single namespace.
>
>The netdevsim devlink instance is an example of limiting the number of
>FIB entries and FIB rules for a namespace. It is currently limited to
>the init_net based on past discussion.
>
>It does not make sense to have multiple resource controllers for the
>same network namespace, hence the limit of only registering with devlink
>on the first device create.

Devlink instance represents an ASIC. 1:1. There is no relation with
network namespaces and should not be. I have no clue why you think so.

The model looks as I described it down below in the picture.


>
>> 
>> Could you please clarify?
>> 
>> Also, to see the relationship between individual netdevsim netdevices
>> and the parent devlink instance, we should use devlink_port
>> instances, like this: 
>> 
>>   devlink1  devlink2
>>||||
>>  dl_port1_1 dlport1_2   dlport2_1 dlport2_2
>>||||
>>  eth0  eth1 eth2 eth3
>> 
>> Note that "devlink instance" reprensents one ASIC.
>> The address of the devlink instance is the bus address of the ASIC.
>> Here, you use address of some/first netdevsim netdev instance.
>
>The ASIC here is the kernel tables in a namespace. It does not make
>sense to have 2 devlink instances for a single namespace.

Again. No clue why you build relationship with namespace.


>
>> 
>> The way it is implemented in netdevsim by this patch is wrong on
>> so many levels :(
>> 
>> Could you please fix this? I'm more than happy to help you with this,
>> please say so. Thanks!
>
>What is there to fix?
>
>Not creating a netdevsim device per netdevsim netdevice? That is
>completely unrelated to the devlink change.

To fit the model. Multiple devlink instances, each representing one
"virtual" ASIC, devlink_port instances, 1 for each netdevsim port.
Netdevsim port should simulate real devices. No real device should have
1:1 relation with network namespace. That is just simply wrong.


Re: [PATCH 05/32] fs: introduce new ->get_poll_head and ->poll_mask methods

2018-04-05 Thread Al Viro
On Fri, Mar 30, 2018 at 05:07:42PM +0200, Christoph Hellwig wrote:

> +  get_poll_head: Returns the struct wait_queue_head that poll, select,
> +  epoll or aio poll should wait on in case this instance only has single
> +  waitqueue.  Can return NULL to indicate polling is not supported,
> +  or a POLL* value using the POLL_TO_PTR helper in case a grave error
> +  occured and ->poll_mask shall not be called.

> + if (IS_ERR(head))
> + return PTR_TO_POLL(head);

> + * ->get_poll_head can return a __poll_t in the PTR_ERR, use these macros
> + * to return the value and recover it.  It takes care of the negation as
> + * well as off the annotations.
> + */
> +#define POLL_TO_PTR(mask)(ERR_PTR(-(__force int)(mask)))

Uh-oh...
static inline bool __must_check IS_ERR(__force const void *ptr)
{
return IS_ERR_VALUE((unsigned long)ptr);
}
#define IS_ERR_VALUE(x) unlikely((unsigned long)(void *)(x) >= (unsigned 
long)-MAX_ERRNO)
#define MAX_ERRNO   4095

IOW, your trick relies upon arguments of PTR_TO_POLL being no greater
than 4095.  Now, consider
#define EPOLLRDHUP  (__force __poll_t)0x2000
which is to say, 8192...

So anything that tries e.g. POLL_TO_PTR(EPOLLRDHUP | EPOLLERR) will be in
for a quiet unpleasant surprise...


Re: [PATCH 1/2] af_key: Use DIV_ROUND_UP() instead of open-coded equivalent

2018-04-05 Thread Steffen Klassert
On Wed, Mar 28, 2018 at 09:35:26PM -0400, Kevin Easton wrote:
> On Wed, Mar 28, 2018 at 07:59:25AM +0200, Steffen Klassert wrote:
> > On Mon, Mar 26, 2018 at 07:39:16AM -0400, Kevin Easton wrote:
> > > Several places use (x + 7) / 8 to convert from a number of bits to a 
> > > number
> > > of bytes.  Replace those with DIV_ROUND_UP(x, 8) instead, for consistency
> > > with other parts of the same file.
> > > 
> > > Signed-off-by: Kevin Easton 
> > 
> > Is this a fix or just a cleanup?
> > If it is just a cleanup, please resent it based on ipsec-next.
> 
> Hi Steffen,
> 
> This patch (1/2) is just a cleanup, but it's in preparation for patch 2/2
> which is a fix and modifies some of the same lines.

So please resend the fix without the cleanup, otherwise we can
get conflicts when backporting the fix to the stable trees.


Re: [PATCH 5/6] rhashtable: support guaranteed successful insertion.

2018-04-05 Thread Herbert Xu
On Fri, Apr 06, 2018 at 01:11:56PM +1000, NeilBrown wrote:
>
> You don't need to handle memory allocation failures at the point where
> you insert into the table - adding to a linked list requires no new
> memory.

You do actually.  The r in rhashtable stands for resizable.  We
cannot completely rely on a background hash table resize because
the insertions can be triggered from softirq context and be without
rate-limits of any kind (e.g., incoming TCP connections).

Therefore at some point you either have to wait for the resize or
fail the insertion.  Since we cannot sleep then failing is the only
option.

> The likelihood of the error isn't really the issue - it either can
> happen or it cannot.  If it can, it requires code to handle it.

Sure, but you just have to handle it the same way you would handle
a memory allocation failure, because that's what it essentially is.

I'm sorry if that means you have to restructure your code to do that
but that's what you pay for having a resizable hash table because
at some point you just have to perform a resize.

> Ahhh... I see what you are thinking now.  You aren't suggestion a
> sharded count that is summed periodically.  Rather you are suggesting that
> we divide the hash space into N regions each with their own independent
> counter, and resize the table if any one region reaches 70% occupancy -
> on the assumption that the other regions are likely to be close.  Is
> that right?

Yes.

> It would probably be dangerous to allow automatic shrinking (when just
> one drops below 30%) in that case, but it might be OK for growing.

At the greatest granularity it would be a counter per bucket, so
clearly we need to maintain some kind of bound on the granularity
so our sample space is not too small.

Automatic shrinking shouldn't be an issue because that's the slow
kind of resizing that we punt to kthread context and it can afford
to do a careful count.

Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


Re: [PATCH net-next] netns: filter uevents correctly

2018-04-05 Thread Eric W. Biederman
Christian Brauner  writes:

> On Thu, Apr 05, 2018 at 05:26:59PM +0300, Kirill Tkhai wrote:
>> On 05.04.2018 17:07, Christian Brauner wrote:
>> > On Thu, Apr 05, 2018 at 04:01:03PM +0300, Kirill Tkhai wrote:
>> >> On 04.04.2018 22:48, Christian Brauner wrote:
>> >>> commit 07e98962fa77 ("kobject: Send hotplug events in all network 
>> >>> namespaces")
>> >>>
>> >>> enabled sending hotplug events into all network namespaces back in 2010.
>> >>> Over time the set of uevents that get sent into all network namespaces 
>> >>> has
>> >>> shrunk. We have now reached the point where hotplug events for all 
>> >>> devices
>> >>> that carry a namespace tag are filtered according to that namespace.
>> >>>
>> >>> Specifically, they are filtered whenever the namespace tag of the kobject
>> >>> does not match the namespace tag of the netlink socket. One example are
>> >>> network devices. Uevents for network devices only show up in the network
>> >>> namespaces these devices are moved to or created in.
>> >>>
>> >>> However, any uevent for a kobject that does not have a namespace tag
>> >>> associated with it will not be filtered and we will *try* to broadcast it
>> >>> into all network namespaces.
>> >>>
>> >>> The original patchset was written in 2010 before user namespaces were a
>> >>> thing. With the introduction of user namespaces sending out uevents 
>> >>> became
>> >>> partially isolated as they were filtered by user namespaces:
>> >>>
>> >>> net/netlink/af_netlink.c:do_one_broadcast()
>> >>>
>> >>> if (!net_eq(sock_net(sk), p->net)) {
>> >>> if (!(nlk->flags & NETLINK_F_LISTEN_ALL_NSID))
>> >>> return;
>> >>>
>> >>> if (!peernet_has_id(sock_net(sk), p->net))
>> >>> return;
>> >>>
>> >>> if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns,
>> >>>  CAP_NET_BROADCAST))
>> >>> j   return;
>> >>> }
>> >>>
>> >>> The file_ns_capable() check will check whether the caller had
>> >>> CAP_NET_BROADCAST at the time of opening the netlink socket in the user
>> >>> namespace of interest. This check is fine in general but seems 
>> >>> insufficient
>> >>> to me when paired with uevents. The reason is that devices always belong 
>> >>> to
>> >>> the initial user namespace so uevents for kobjects that do not carry a
>> >>> namespace tag should never be sent into another user namespace. This has
>> >>> been the intention all along. But there's one case where this breaks,
>> >>> namely if a new user namespace is created by root on the host and an
>> >>> identity mapping is established between root on the host and root in the
>> >>> new user namespace. Here's a reproducer:
>> >>>
>> >>>  sudo unshare -U --map-root
>> >>>  udevadm monitor -k
>> >>>  # Now change to initial user namespace and e.g. do
>> >>>  modprobe kvm
>> >>>  # or
>> >>>  rmmod kvm
>> >>>
>> >>> will allow the non-initial user namespace to retrieve all uevents from 
>> >>> the
>> >>> host. This seems very anecdotal given that in the general case user
>> >>> namespaces do not see any uevents and also can't really do anything 
>> >>> useful
>> >>> with them.
>> >>>
>> >>> Additionally, it is now possible to send uevents from userspace. As such 
>> >>> we
>> >>> can let a sufficiently privileged (CAP_SYS_ADMIN in the owning user
>> >>> namespace of the network namespace of the netlink socket) userspace 
>> >>> process
>> >>> make a decision what uevents should be sent.
>> >>>
>> >>> This makes me think that we should simply ensure that uevents for 
>> >>> kobjects
>> >>> that do not carry a namespace tag are *always* filtered by user namespace
>> >>> in kobj_bcast_filter(). Specifically:
>> >>> - If the owning user namespace of the uevent socket is not init_user_ns 
>> >>> the
>> >>>   event will always be filtered.
>> >>> - If the network namespace the uevent socket belongs to was created in 
>> >>> the
>> >>>   initial user namespace but was opened from a non-initial user namespace
>> >>>   the event will be filtered as well.
>> >>> Put another way, uevents for kobjects not carrying a namespace tag are 
>> >>> now
>> >>> always only sent to the initial user namespace. The regression potential
>> >>> for this is near to non-existent since user namespaces can't really do
>> >>> anything with interesting devices.
>> >>>
>> >>> Signed-off-by: Christian Brauner 
>> >>> ---
>> >>>  lib/kobject_uevent.c | 10 +-
>> >>>  1 file changed, 9 insertions(+), 1 deletion(-)
>> >>>
>> >>> diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
>> >>> index 15ea216a67ce..cb98cddb6e3b 100644
>> >>> --- a/lib/kobject_uevent.c
>> >>> +++ b/lib/kobject_uevent.c
>> >>> @@ -251,7 +251,15 @@ static int kobj_bcast_filter(struct sock *dsk, 
>> >>> struct sk_buff *skb, void *data)
>> >>>  return sock_ns != ns;
>> >>>  }
>> >>>  
>> >>> -return 0;
>> >>> +/*
>> >>> + * 

Re: KASAN: use-after-free Read in worker_thread (2)

2018-04-05 Thread Eric Biggers
On Sat, Nov 11, 2017 at 07:56:01AM -0800, syzbot wrote:
> syzkaller has found reproducer for the following crash on
> d9e0e63d9a6f88440eb201e1491fcf730272c706
> git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/master
> compiler: gcc (GCC) 7.1.1 20170620
> .config is attached
> Raw console output is attached.
> 
> syzkaller reproducer is attached. See https://goo.gl/kgGztJ
> for information about syzkaller reproducers
> 
> 
> BUG: KASAN: use-after-free in worker_thread+0x15bb/0x1990
> kernel/workqueue.c:2244
> Read of size 8 at addr 88002d0e3de0 by task kworker/u8:1/1209
> 
> CPU: 0 PID: 1209 Comm: kworker/u8:1 Not tainted 4.14.0-rc8-next-20171110+
> #12
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
> Call Trace:
>  __dump_stack lib/dump_stack.c:17 [inline]
>  dump_stack+0x194/0x257 lib/dump_stack.c:53
>  print_address_description+0x73/0x250 mm/kasan/report.c:252
>  kasan_report_error mm/kasan/report.c:351 [inline]
>  kasan_report+0x25b/0x340 mm/kasan/report.c:409
>  __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:430
>  worker_thread+0x15bb/0x1990 kernel/workqueue.c:2244
>  kthread+0x37a/0x440 kernel/kthread.c:238
>  ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:437
> 
> Allocated by task 11866:
>  save_stack+0x43/0xd0 mm/kasan/kasan.c:447
>  set_track mm/kasan/kasan.c:459 [inline]
>  kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:551
>  kasan_slab_alloc+0x12/0x20 mm/kasan/kasan.c:489
>  kmem_cache_alloc+0x12e/0x760 mm/slab.c:3548
>  kmem_cache_zalloc include/linux/slab.h:693 [inline]
>  kcm_attach net/kcm/kcmsock.c:1394 [inline]
>  kcm_attach_ioctl net/kcm/kcmsock.c:1460 [inline]
>  kcm_ioctl+0x2d1/0x1610 net/kcm/kcmsock.c:1695
>  sock_do_ioctl+0x65/0xb0 net/socket.c:960
>  sock_ioctl+0x2c2/0x440 net/socket.c:1057
>  vfs_ioctl fs/ioctl.c:46 [inline]
>  do_vfs_ioctl+0x1b1/0x1530 fs/ioctl.c:686
>  SYSC_ioctl fs/ioctl.c:701 [inline]
>  SyS_ioctl+0x8f/0xc0 fs/ioctl.c:692
>  entry_SYSCALL_64_fastpath+0x1f/0x96
> 
> Freed by task 11867:
>  save_stack+0x43/0xd0 mm/kasan/kasan.c:447
>  set_track mm/kasan/kasan.c:459 [inline]
>  kasan_slab_free+0x71/0xc0 mm/kasan/kasan.c:524
>  __cache_free mm/slab.c:3492 [inline]
>  kmem_cache_free+0x77/0x280 mm/slab.c:3750
>  kcm_unattach+0xe50/0x1510 net/kcm/kcmsock.c:1563
>  kcm_unattach_ioctl net/kcm/kcmsock.c:1608 [inline]
>  kcm_ioctl+0xdf0/0x1610 net/kcm/kcmsock.c:1705
>  sock_do_ioctl+0x65/0xb0 net/socket.c:960
>  sock_ioctl+0x2c2/0x440 net/socket.c:1057
>  vfs_ioctl fs/ioctl.c:46 [inline]
>  do_vfs_ioctl+0x1b1/0x1530 fs/ioctl.c:686
>  SYSC_ioctl fs/ioctl.c:701 [inline]
>  SyS_ioctl+0x8f/0xc0 fs/ioctl.c:692
>  entry_SYSCALL_64_fastpath+0x1f/0x96
> 
> The buggy address belongs to the object at 88002d0e3d00
>  which belongs to the cache kcm_psock_cache of size 576
> The buggy address is located 224 bytes inside of
>  576-byte region [88002d0e3d00, 88002d0e3f40)
> The buggy address belongs to the page:
> page:eab43880 count:1 mapcount:0 mapping:88002d0e2180 index:0x0
> compound_mapcount: 0
> flags: 0x1008100(slab|head)
> raw: 01008100 88002d0e2180  0001000b
> raw: eab14920 eab27e20 88002b0089c0 
> page dumped because: kasan: bad access detected
> 
> Memory state around the buggy address:
>  88002d0e3c80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>  88002d0e3d00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> > 88002d0e3d80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>^
>  88002d0e3e00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>  88002d0e3e80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> ==

No longer occurring, the fix seems to have been commit 7e9964574ee97:

#syz fix: kcm: Only allow TCP sockets to be attached to a KCM mux

- Eric


Re: [PATCH 5/6] rhashtable: support guaranteed successful insertion.

2018-04-05 Thread NeilBrown
On Thu, Mar 29 2018, Herbert Xu wrote:

> On Thu, Mar 29, 2018 at 08:26:21AM +1100, NeilBrown wrote:
>>
>> I say "astronomically unlikely", you say "probability .. is extremely
>> low".  I think we are in agreement here.
>> 
>> The point remains that if an error *can* be returned then I have to
>> write code to handle it and test that code.  I'd rather not.
>
> You have be able to handle errors anyway because of memory allocation
> failures.  Ultimately if you keep inserting you will eventually
> fail with ENOMEM.  So I don't see the issue with an additional
> error value.

You don't need to handle memory allocation failures at the point where
you insert into the table - adding to a linked list requires no new
memory.

If you are running out of memory, you will fail to allocate new objects,
so you won't be able to add them to the rhashtable.  Obviously you have
to handle that failure, and that is likely to save rhashtable from
getting many mem-alloc failures.

But once you have allocated a new object, rhashtable should just accept
it.  It doesn't help anyone to have the insertion fail.
The insertion can currently fail even when allocating a new object would
succeed, as assertion can require a GFP_ATOMIC allocation to succeed,
while allocating a new object might only need a GFP_KERNEL allocation to
succeed. 

>
>> > Even if it does happen we won't fail because we will perform
>> > an immediate rehash.  We only fail if it happens right away
>> > after the rehash (that is, at least another 16 elements have
>> > been inserted and you're trying to insert a 17th element, all
>> > while the new hash table has not been completely populated),
>> > which means that somebody has figured out our hash secret and
>> > failing in that case makes sense.
>
> BTW, you didn't acknowledge this bit which I think is crucial to
> how likely such an error is.

The likelihood of the error isn't really the issue - it either can
happen or it cannot.  If it can, it requires code to handle it.

>
>> I never suggested retrying, but I would have to handle it somehow.  I'd
>> rather not.
>
> ...
>
>> While I have no doubt that there are hashtables where someone could try
>> to attack the hash, I am quite sure there are others where is such an
>> attack is meaningless - any code which could generate the required range of
>> keys, could do far worse things more easily.
>
> Our network hashtable has to be secure against adversaries.  I
> understand that this may not be important to your use-case.  However,
> given the fact that the failure would only occur if an adversary
> is present and actively attacking your hash table, I don't think
> it has that much of a negative effect on your use-case either.

I certainly accept that there are circumstances where it is a real
possibility that an adversary might succeed in attacking the hash
function, and changing the seed for each table is an excellent idea.
It isn't clear to me that failing insertions is needed - it is done
rather late and so doesn't throttle much, and could give the attacker
feedback that they had succeeded (?).

Not all hash tables are susceptible to attack.  It would be nice if
developers working in less exposed areas could use rhashtable without ever
having to check for errors.

Error codes are messages from the implementation to the caller.
They should be chosen to help the caller make useful choices, not to
allow the implementation to avoid awkward situations.

>
> Of course if you can reproduce the EBUSY error without your
> disable_count patch or even after you have fixed the issue I
> have pointed out in your disable_count patch you can still reproduce
> it then that would suggest a real bug and we would need to fix it,
> for everyone.

I suspect that I cannot.  However experience tells me that customers do
things that I cannot and would never expect - they can often trigger
errors that I cannot.  It is best if the error cannot be returned.

>
>> Yes, storing a sharded count in the spinlock table does seem like an
>> appropriate granularity.  However that leads me to ask: why do we have
>> the spinlock table?  Why not bit spinlocks in the hashchain head like
>> include/linux/list_bl uses?
>
> The spinlock table predates rhashtable.  Perhaps Thomas/Eric/Dave
> can elucidate this.

Maybe I'll submit an RFC patch to change it - if I can make it work.

>
>> I don't understand how it can ever be "too late", though I appreciate
>> that in some cases "sooner" is better than "later"
>> If we give up on the single atomic_t counter, then we must accept that
>> the number of elements could exceed any given value.  The only promise
>> we can provide is that it wont exceed N% of the table size for more than
>> T seconds.
>
> Sure.  However, assuming we use an estimate that is hash-based,
> that *should* be fairly accurate assuming that your hash function
> is working in the first place.  It's completely different vs.
> estimating based on a per-cpu count which could be wildly 

Re: [PATCH v15 ] net/veth/XDP: Line-rate packet forwarding in kernel

2018-04-05 Thread David Ahern
On 4/3/18 9:15 PM, Md. Islam wrote:
>> Have you looked at what I would consider a more interesting use case of
>> packets into a node and delivered to a namespace via veth?
>>
>>+--+---
>>| Host | container
>>|  |
>>|+---{ veth1 }-|-{veth2}
>>|   |  |
>>+{ eth1 }--
>>
>> Can xdp / bpf on eth1 be used to speed up delivery to the container?
> 
> I didn't consider that, but it sounds like an important use case. How
> do we determine which namespace gets the packet?
> 

FIB lookups of course. Starting with my patch set that handles
forwarding on eth1, what is needed for XDP with veth? ie., a program on
eth1 does the lookup and redirects the packet to veth1 for Tx.
ndo_xdp_xmit for veth knows the packet needs to be forwarded to veth2
internally and there is no skb allocated for the packet yet.


Re: [PATCH net] net/ipv6: Increment OUTxxx counters after netfilter hook

2018-04-05 Thread David Miller
From: Jeff Barnhill <0xeff...@gmail.com>
Date: Thu,  5 Apr 2018 21:29:47 +

> At the end of ip6_forward(), IPSTATS_MIB_OUTFORWDATAGRAMS and
> IPSTATS_MIB_OUTOCTETS are incremented immediately before the NF_HOOK call
> for NFPROTO_IPV6 / NF_INET_FORWARD.  As a result, these counters get
> incremented regardless of whether or not the netfilter hook allows the
> packet to continue being processed.  This change increments the counters
> in ip6_forward_finish() so that it will not happen if the netfilter hook
> chooses to terminate the packet, which is similar to how IPv4 works.
> 
> Signed-off-by: Jeff Barnhill <0xeff...@gmail.com>

Yep, this is more consistent with ipv4.

Applied, thank you.


Re: [PATCH 2/2] net: phy: dp83640: Read strapped configuration settings

2018-04-05 Thread David Miller
From: Andrew Lunn 
Date: Thu, 5 Apr 2018 22:40:49 +0200

> Or could it still contain whatever state the last boot of Linux, or
> maybe the bootloader, left the PHY in?

Right, this is my concern as well.


Re: [PATCH 0/4] hv_netvsc: Fix shutdown issues on older Windows hosts

2018-04-05 Thread David Miller
From: Mohammed Gamal 
Date: Thu,  5 Apr 2018 21:09:17 +0200

> Guests running on WS2012 hosts would not shutdown when changing network
> interface setting (e.g. Number of channels, MTU ... etc). 
> 
> This patch series addresses these shutdown issues we enecountered with WS2012
> hosts. It's essentialy a rework of the series sent in 
> https://lkml.org/lkml/2018/1/23/111 on top of latest upstream
> 
> Fixes: 0ef58b0a05c1 ("hv_netvsc: change GPAD teardown order on older 
> versions")

Series applied, thank you.


Re: [PATCH net-next] hv_netvsc: Add NetVSP v6 into version negotiation

2018-04-05 Thread David Miller
From: Haiyang Zhang 
Date: Thu,  5 Apr 2018 11:42:22 -0700

> From: Haiyang Zhang 
> 
> This patch adds the NetVSP v6 message structures, and includes this
> version into NetVSC/NetVSP version negotiation process.
> 
> Signed-off-by: Haiyang Zhang 

The net-next tree is closed, please resubmit this when the net-next
tree reopens.

Thank you.


Re: [PATCH 08/32] aio: replace kiocb_set_cancel_fn with a cancel_kiocb file operation

2018-04-05 Thread Al Viro
On Fri, Mar 30, 2018 at 05:07:45PM +0200, Christoph Hellwig wrote:
> The current kiocb_set_cancel_fn implementation assumes the kiocb is
> embedded into an aio_kiocb, which is fundamentally unsafe as it might
> have been submitted by non-aio callers.  Instead add a cancel_kiocb
> file operation that replaced the ki_cancel function pointer set by
> kiocb_set_cancel_fn, and only adds iocbs to the active list when
> the read/write_iter methods return -EIOCBQUEUED and the file has
> a cancel_kiocb method.

> @@ -1440,6 +1423,16 @@ static inline ssize_t aio_rw_ret(struct kiocb *req, 
> ssize_t ret)
>  {
>   switch (ret) {
>   case -EIOCBQUEUED:
> + if (req->ki_filp->f_op->cancel_kiocb) {

... and by that point req might've been already freed by IO completion on
another CPU.
> + struct aio_kiocb *iocb =
> + container_of(req, struct aio_kiocb, rw);
> + struct kioctx *ctx = iocb->ki_ctx;
> + unsigned long flags;
> +
> + spin_lock_irqsave(>ctx_lock, flags);
> + list_add_tail(>ki_list, >active_reqs);



Re: [PATCH net] net: mvpp2: Fix parser entry init boundary check

2018-04-05 Thread David Miller
From: Maxime Chevallier 
Date: Thu,  5 Apr 2018 11:55:48 +0200

> Boundary check in mvpp2_prs_init_from_hw must be done according to the
> passed "tid" parameter, not the mvpp2_prs_entry index, which is not yet
> initialized at the time of the check.
> 
> Fixes: 47e0e14eb1a6 ("net: mvpp2: Make mvpp2_prs_hw_read a parser entry init 
> function")
> Signed-off-by: Maxime Chevallier 

Applied, thanks.


Re: [PATCH net] arp: fix arp_filter on l3slave devices

2018-04-05 Thread David Miller
From: David Ahern 
Date: Thu, 5 Apr 2018 08:40:48 -0600

> On 4/5/18 2:25 AM, Miguel Fadon Perlines wrote:
>> arp_filter performs an ip_route_output search for arp source address and
>> checks if output device is the same where the arp request was received,
>> if it is not, the arp request is not answered.
>> 
>> This route lookup is always done on main route table so l3slave devices
>> never find the proper route and arp is not answered.
>> 
>> Passing l3mdev_master_ifindex_rcu(dev) return value as oif fixes the
>> lookup for l3slave devices while maintaining same behavior for non
>> l3slave devices as this function returns 0 in that case.
>> 
>> Signed-off-by: Miguel Fadon Perlines 
>> ---
>>  net/ipv4/arp.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>> 
> 
> Acked-by: David Ahern 
> 
> DaveM: this is a day 1 bug for VRF. Best guess at a Fixes tag would be:
> 
> Fixes: 613d09b30f8b ("net: Use VRF device index for lookups on TX")
> 
> It would be good to get this into stable releases 4.9 and up. Thanks,

Applied and queued up for -stable.


Re: [PATCH net] net: dsa: Discard frames from unused ports

2018-04-05 Thread David Miller
From: Andrew Lunn 
Date: Thu,  5 Apr 2018 01:56:44 +0200

> The Marvell switches under some conditions will pass a frame to the
> host with the port being the CPU port. Such frames are invalid, and
> should be dropped. Not dropping them can result in a crash when
> incrementing the receive statistics for an invalid port.
> 
> Reported-by: Chris Healy 
> Fixes: 5f6b4e14cada ("net: dsa: User per-cpu 64-bit statistics")
> Signed-off-by: Andrew Lunn 
 ...
> + slave_port = >ports[port];
> +
> + if (slave_port->type != DSA_PORT_TYPE_USER)
> + return NULL;

Look like we need a Fixes: update and an adjustment to use unlikely()
here based upon Florian's feedback.


Re: [PATCH net] netns: filter uevents correctly

2018-04-05 Thread David Miller
From: Christian Brauner 
Date: Thu, 05 Apr 2018 01:27:16 +

> David, is it ok to queue this or would you prefer I resend when net-next
> reopens?

This definitely needs more discussion, and after the discussion some
further clarification in the commit log message based upon that
discussion if we move forward with this change at all.

Thank you.


[PATCH] make net_gso_ok return false when gso_type is zero(invalid)

2018-04-05 Thread Wenhua Shi
Signed-off-by: Wenhua Shi 
---
 include/linux/netdevice.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index cf44503e..1f26cbcf 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -4187,7 +4187,7 @@ static inline bool net_gso_ok(netdev_features_t features, 
int gso_type)
BUILD_BUG_ON(SKB_GSO_ESP != (NETIF_F_GSO_ESP >> NETIF_F_GSO_SHIFT));
BUILD_BUG_ON(SKB_GSO_UDP != (NETIF_F_GSO_UDP >> NETIF_F_GSO_SHIFT));
 
-   return (features & feature) == feature;
+   return feature && (features & feature) == feature;
 }
 
 static inline bool skb_gso_ok(struct sk_buff *skb, netdev_features_t features)
-- 
2.11.0



Re: [RFC 0/9] bpf: Add buildid check support

2018-04-05 Thread Alexei Starovoitov
On Thu, Apr 05, 2018 at 05:16:36PM +0200, Jiri Olsa wrote:
> hi,
> eBPF programs loaded for kprobes are allowed to read kernel
> internal structures. We check the provided kernel version
> to ensure that the program is loaded for the proper kernel. 
> 
> The problem is that the version check is not enough, because
> it only follows the version setup from kernel's Makefile.
> However, the internal kernel structures change based on the
> .config data, so in practise we have different kernels with
> same version.
> 
> The eBPF kprobe program thus then get loaded for different
> kernel than it's been built for, get wrong data (silently)
> and provide misleading output.
> 
> This patchset implements additional check in eBPF loading code
> on provided build ID (from kernel's elf image, .notes section
> GNU build ID) to ensure we load the eBPF program on correct
> kernel.
> 
> Also available in here (based on bpf-next/master):
>   https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git
>   bpf/checksum
> 
> This patchset consists of several changes:
> 
> - adding CONFIG_BUILDID_H option that instructs the build
>   to generate uapi header file with build ID data, that
>   will be included by eBPF program
> 
> - adding CONFIG_BPF_BUILDID_CHECK option and new bpf_attr
>   field to allow build ID checking when loading the eBPF
>   program
> 
> - changing libbpf to read and pass build ID to the kernel
> 
> - several small side fixes
> 
> - example perf eBPF code in bpf-samples/bpf-stdout-example.c
>   to show the build ID support/usage.
> 
> # perf record -vv  -e ./bpf-samples/bpf-stdout-example.c kill 2>&1 | grep 
> buildid
> libbpf: section(7) buildid, size 21, link 0, flags 3, type=1
> libbpf: kernel buildid of ./bpf-samples/bpf-stdout-example.c is: 
> 6e25edeb408513184e2753bebad25d42314501a0
> 
>   The buildid is provided the same way we provide kernel
>   version, in a special "buildid" section:
> 
> # cat ./bpf-samples/bpf-stdout-example.c
> ...
> #include 
> 
> char _buildid[] SEC("buildid") = LINUX_BUILDID_DATA;
> ...
> 
>   where LINUX_BUILDID_DATA is defined in the generated buildid.h.
> 
> please note it's an RFC ;-) any comments and suggestions are welcome

I think this is overkill.

We're very heavy users of kprobe+bpf. It's used for lots
of different cases and usage is constantly growing,
but I haven't seen a single case of :

> The eBPF kprobe program thus then get loaded for different
> kernel than it's been built for, get wrong data (silently)
> and provide misleading output.

but I saw plenty of the opposite. People pre-compile the program
and hack kernel version when they load, since they know in advance
that kprobe+bpf doesn't use any kernel specific things.
The existing kernel version check for kprobe+bpf is already annoying
to them.
This buildid check they can easily bypass the same way.
imo the whole thing just adds complexity and doesn't solve anything.
Sorry, this is a Nack.



Re: [Cluster-devel] [PATCH v2 0/2] gfs2: Stop using rhashtable_walk_peek

2018-04-05 Thread NeilBrown
On Wed, Apr 04 2018, Andreas Grünbacher wrote:

> Herbert Xu  schrieb am Mi. 4. Apr. 2018 um
> 17:51:
>
>> On Wed, Apr 04, 2018 at 11:46:28AM -0400, Bob Peterson wrote:
>> >
>> > The patches look good. The big question is whether to add them to this
>> > merge window while it's still open. Opinions?
>>
>> We're still hashing out the rhashtable interface so I don't think now is
>> the time to rush things.
>
>
> Fair enough. No matter how rhashtable_walk_peek changes, we‘ll still need
> these two patches to fix the glock dump though.

Those two patches look fine to me and don't depend on changes to
rhashtable, so it is up to you when they go upstream.

However, I think the code can be substantially simplified, particularly
once we make rhashtable a little cleverer.
So this is what I'll probably be doing for a similar situation in
lustre

Having examined seqfile closely, it is apparent that if ->start never
changes *ppos, and if ->next always increments it (except maybe on error)
then

1/ ->next is only ever given a 'pos' that was returned by the previous
   call to ->start or ->next.  So it should *always* return the next
   object, after the one previously returned by ->start or ->next.  It
   never needs to 'seek'. The 'traverse()' function in seq_file.c does
   any seeking needed.  ->next needs to increase 'pos' and when walking
   a dynamic list, it is easiest if it just increments it.

2/ ->start is only called with a pos of:
0 - in this case it should rewind to the start
the last pos passed to ->start of ->next
 In this case it should return the same thing that was
 returned last time.  If it no longer exists, then
 the following one should be returned.
one more than the last pos passed to ->start or ->next
 In this case it should return the object after the
 last one returned.

The proposed enhancement to rhashtable_walk* is to add a
rhashtable_walk_prev() which returns the previously returned object,
if it is still in the table, or NULL. It also enhances
rhashtable_walk_start() so that if the previously returned object is
still in the table, it is preserved as the current cursor.
This means that if you take some action to ensure that the
previously returned object remains in the table until the next ->start,
then you can reliably walk the table with no duplicates or omissions
(unless a concurrent rehash causes duplicates)
If you don't keep the object in the table and it gets removed, then
the 'skip' counter is used to find your place, and you might get
duplicates or omissions.

NeilBrown


signature.asc
Description: PGP signature


[no subject]

2018-04-05 Thread venkatvenkatsubra
Hi Netdevhttps://goo.gl/5bDZtk

Re: [PATCH v2 bpf-next 0/3] bpf/verifier: subprog/func_call simplifications

2018-04-05 Thread Alexei Starovoitov
On Thu, Apr 05, 2018 at 04:50:03PM +0100, Jiong Wang wrote:
> On 03/04/2018 02:08, Alexei Starovoitov wrote:
> > Combining subprog pass with do_check is going into opposite direction
> > of this long term work. Divide and conquer. Combining more things into
> > do_check is the opposite of this programming principle.
> 
> Agree. And for the redundant insn traversal issue in check_subprogs that
> Edward trying to fix, I am thinking we could do it by utilizing the
> existing DFS traversal in check_cfg.
> 
> The current DFS probably could be improved into an generic instruction
> information collection pass.
> 
> This won't make the existing DFS complexer as it only does information
> collection as a side job during traversal. These collected information
> then could be used to build any other information to be consumed later,
> for example subprog, basic blocks etc.
> 
> For the redundant insn traversal issue during subprog detection, the
> Like how we mark STATE_LIST_MARK in DFS, we could just call add_subprog
> for BPF_PSEUDO_CALL insn during DFS.
> 
> i.e we change the code logic of check_cfg into:
> 
> check_cfg
> {
>   * DFS traversal:
> - detect back-edge.
> - collect STATE_LIST_MARK.
> - collect subprog destination.
> 
>   * check all insns are reachable.
>   * check_subprogs (insn traversal removed).
> }

I don't think that will work.
Functions are independent from CFG.
With bpf libraries we will have disjoint functions sitting in the kernel
and check_cfg would need to be done separately from function boundaries.



Re: [PATCH 3/9] kbuild: Do not pass arguments to link-vmlinux.sh

2018-04-05 Thread Masahiro Yamada
2018-04-06 3:59 GMT+09:00 Jiri Olsa :
> On Fri, Apr 06, 2018 at 12:50:00AM +0900, Masahiro Yamada wrote:
>> 2018-04-06 0:16 GMT+09:00 Jiri Olsa :
>> > There's no need to pass LD* arguments to link-vmlinux.sh,
>> > because they are passed as variables. The only argument
>> > the link-vmlinux.sh supports is the 'clean' argument.
>> >
>> > Signed-off-by: Jiri Olsa 
>> > ---
>>
>> Wrong.
>>
>> $(LD) $(LDFLAGS) $(LDFLAGS_vmlinux)
>> exist here so that any change in them
>> invokes scripts/linkk-vmlinux.sh
>
> sry, I can't see that.. but it's just a side fix,
> which is actually not needed for the rest
>
> I'll check on more and address this separately


The link command is recorded in .vmlinux.cmd
then, it is checked by if_changed in the next rebuild
because link-vmlinux is invoked by $(call if_changed,...)

 +$(call if_changed,link-vmlinux)



> thanks,
> jirka
>
>>
>>
>>
>>
>> >  Makefile | 2 +-
>> >  1 file changed, 1 insertion(+), 1 deletion(-)
>> >
>> > diff --git a/Makefile b/Makefile
>> > index d3300e46f925..a65a3919c6ad 100644
>> > --- a/Makefile
>> > +++ b/Makefile
>> > @@ -1027,7 +1027,7 @@ ARCH_POSTLINK := $(wildcard 
>> > $(srctree)/arch/$(SRCARCH)/Makefile.postlink)
>> >
>> >  # Final link of vmlinux with optional arch pass after final link
>> >  cmd_link-vmlinux = \
>> > -   $(CONFIG_SHELL) $< $(LD) $(LDFLAGS) $(LDFLAGS_vmlinux) ;\
>> > +   $(CONFIG_SHELL) $< ;   \
>> > $(if $(ARCH_POSTLINK), $(MAKE) -f $(ARCH_POSTLINK) $@, true)
>> >
>> >  vmlinux: scripts/link-vmlinux.sh vmlinux_prereq $(vmlinux-deps) FORCE
>>
>>
>>
>>
>>
>>
>> --
>> Best Regards
>> Masahiro Yamada
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kbuild" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Best Regards
Masahiro Yamada


Packet corrupts to guest system system from host kernel 4.16.x

2018-04-05 Thread Алексей Болдырев
Why, when using the 4.16 kernel on the host system, is there such a strange 
behavior of virtual machines? What is the problem? He rolled back to 4.9.c - 
the problem was gone.

tcpdump on router:
tcpdump: listening on vlan-00110013, link-type EN10MB (Ethernet), capture size 
262144 bytes
23:59:08.331875 52:54:00:38:cf:94 > 52:54:00:4e:9c:5f, ethertype ARP (0x0806), 
length 38: [|ARP]
0x:  0001 0800 0604 0001 5254 0038 cf94 c0a8  RT.8
0x0010:  b402     
23:59:08.454364 52:54:00:4e:9c:5f > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), 
length 42: Ethernet (len 6), IPv4 (len 4), Request who-has 192.168.180.2 tell 
192.168.180.1, length 28
23:59:08.454754 52:54:00:38:cf:94 > 52:54:00:4e:9c:5f, ethertype ARP (0x0806), 
length 38: [|ARP]
0x:  0001 0800 0604 0002 5254 0038 cf94 c0a8  RT.8
0x0010:  b402 5254 004e 9c5f  ..RT.N._
23:59:09.462383 52:54:00:4e:9c:5f > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), 
length 42: Ethernet (len 6), IPv4 (len 4), Request who-has 192.168.180.2 tell 
192.168.180.1, length 28
23:59:09.463607 52:54:00:38:cf:94 > 52:54:00:4e:9c:5f, ethertype ARP (0x0806), 
length 38: [|ARP]
0x:  0001 0800 0604 0002 5254 0038 cf94 c0a8  RT.8
0x0010:  b402 5254 004e 9c5f  ..RT.N._
23:59:10.486352 52:54:00:4e:9c:5f > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), 
length 42: Ethernet (len 6), IPv4 (len 4), Request who-has 192.168.180.2 tell 
192.168.180.1, length 28
23:59:10.487303 52:54:00:38:cf:94 > 52:54:00:4e:9c:5f, ethertype ARP (0x0806), 
length 38: [|ARP]
0x:  0001 0800 0604 0002 5254 0038 cf94 c0a8  RT.8
0x0010:  b402 5254 004e 9c5f  ..RT.N._
23:59:11.051570 52:54:00:38:cf:94 > 52:54:00:4e:9c:5f, ethertype IPv4 (0x0800), 
length 77: truncated-ip - 4 bytes missing! (tos 0x0, ttl 64, id 54539, offset 
0, flags [DF], proto UDP (17), length 67)
192.168.180.2.59917 > 198.18.120.1.53: 1196+[|domain]
23:59:11.051585 52:54:00:38:cf:94 > 52:54:00:4e:9c:5f, ethertype IPv4 (0x0800), 
length 77: truncated-ip - 4 bytes missing! (tos 0x0, ttl 64, id 54540, offset 
0, flags [DF], proto UDP (17), length 67)
192.168.180.2.59917 > 198.18.120.1.53: 5849+[|domain]
23:59:11.527109 52:54:00:4e:9c:5f > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), 
length 42: Ethernet (len 6), IPv4 (len 4), Request who-has 192.168.180.2 tell 
192.168.180.1, length 28
23:59:11.527707 52:54:00:38:cf:94 > 52:54:00:4e:9c:5f, ethertype ARP (0x0806), 
length 38: [|ARP]
0x:  0001 0800 0604 0002 5254 0038 cf94 c0a8  RT.8
0x0010:  b402 5254 004e 9c5f  ..RT.N._
^C
11 packets captured
11 packets received by filter
0 packets dropped by kernel




[PATCH net] net/sched: fix NULL dereference in the error path of tcf_bpf_init()

2018-04-05 Thread Davide Caratti
when tcf_bpf_init_from_ops() fails (e.g. because of program having invalid
number of instructions), tcf_bpf_cfg_cleanup() calls bpf_prog_put(NULL) or
bpf_prog_destroy(NULL). Unless CONFIG_BPF_SYSCALL is unset, this causes
the following error:

 BUG: unable to handle kernel NULL pointer dereference at 0020
 PGD 80007345a067 P4D 80007345a067 PUD 340e1067 PMD 0
 Oops:  [#1] SMP PTI
 Modules linked in: act_bpf(E) ip6table_filter ip6_tables iptable_filter 
binfmt_misc ext4 mbcache jbd2 crct10dif_pclmul crc32_pclmul ghash_clmulni_intel 
snd_hda_codec_generic pcbc snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep 
snd_seq snd_seq_device snd_pcm aesni_intel crypto_simd glue_helper cryptd 
joydev snd_timer snd virtio_balloon pcspkr soundcore i2c_piix4 nfsd auth_rpcgss 
nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c ata_generic pata_acpi qxl 
drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm virtio_blk drm 
virtio_net virtio_console i2c_core crc32c_intel serio_raw virtio_pci ata_piix 
libata virtio_ring floppy virtio dm_mirror dm_region_hash dm_log dm_mod [last 
unloaded: act_bpf]
 CPU: 3 PID: 5654 Comm: tc Tainted: GE4.16.0.bpf_test+ #408
 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
 RIP: 0010:__bpf_prog_put+0xc/0xc0
 RSP: 0018:9594003ef728 EFLAGS: 00010202
 RAX:  RBX: 9594003ef758 RCX: 0024
 RDX:  RSI: 0001 RDI: 
 RBP:  R08: 0001 R09: 0044
 R10: 0220 R11: 8a7ab9f17131 R12: 
 R13: 8a7ab7c3c8e0 R14: 0001 R15: 8a7ab88f1054
 FS:  7fcb2f17c740() GS:8a7abfd8() knlGS:
 CS:  0010 DS:  ES:  CR0: 80050033
 CR2: 0020 CR3: 7c888006 CR4: 001606e0
 Call Trace:
  tcf_bpf_cfg_cleanup+0x2f/0x40 [act_bpf]
  tcf_bpf_cleanup+0x4c/0x70 [act_bpf]
  __tcf_idr_release+0x79/0x140
  tcf_bpf_init+0x125/0x330 [act_bpf]
  tcf_action_init_1+0x2cc/0x430
  ? get_page_from_freelist+0x3f0/0x11b0
  tcf_action_init+0xd3/0x1b0
  tc_ctl_action+0x18b/0x240
  rtnetlink_rcv_msg+0x29c/0x310
  ? _cond_resched+0x15/0x30
  ? __kmalloc_node_track_caller+0x1b9/0x270
  ? rtnl_calcit.isra.29+0x100/0x100
  netlink_rcv_skb+0xd2/0x110
  netlink_unicast+0x17c/0x230
  netlink_sendmsg+0x2cd/0x3c0
  sock_sendmsg+0x30/0x40
  ___sys_sendmsg+0x27a/0x290
  ? mem_cgroup_commit_charge+0x80/0x130
  ? page_add_new_anon_rmap+0x73/0xc0
  ? do_anonymous_page+0x2a2/0x560
  ? __handle_mm_fault+0xc75/0xe20
  __sys_sendmsg+0x58/0xa0
  do_syscall_64+0x6e/0x1a0
  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
 RIP: 0033:0x7fcb2e58eba0
 RSP: 002b:7ffc93c496c8 EFLAGS: 0246 ORIG_RAX: 002e
 RAX: ffda RBX: 7ffc93c497f0 RCX: 7fcb2e58eba0
 RDX:  RSI: 7ffc93c49740 RDI: 0003
 RBP: 5ac6a646 R08: 0002 R09: 
 R10: 7ffc93c49120 R11: 0246 R12: 
 R13: 7ffc93c49804 R14: 0001 R15: 0066afa0
 Code: 5f 00 48 8b 43 20 48 c7 c7 70 2f 7c b8 c7 40 10 00 00 00 00 5b e9 a5 8b 
61 00 0f 1f 44 00 00 0f 1f 44 00 00 41 54 55 48 89 fd 53 <48> 8b 47 20 f0 ff 08 
74 05 5b 5d 41 5c c3 41 89 f4 0f 1f 44 00
 RIP: __bpf_prog_put+0xc/0xc0 RSP: 9594003ef728
 CR2: 0020

Fix it in tcf_bpf_cfg_cleanup(), ensuring that bpf_prog_{put,destroy}(f)
is called only when f is not NULL.

Fixes: bbc09e7842a5 ("net/sched: fix idr leak on the error path of 
tcf_bpf_init()")
Reported-by: Lucas Bates 
Signed-off-by: Davide Caratti 
---
 net/sched/act_bpf.c | 12 
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/net/sched/act_bpf.c b/net/sched/act_bpf.c
index 9092531d45d8..18089c02e557 100644
--- a/net/sched/act_bpf.c
+++ b/net/sched/act_bpf.c
@@ -248,10 +248,14 @@ static int tcf_bpf_init_from_efd(struct nlattr **tb, 
struct tcf_bpf_cfg *cfg)
 
 static void tcf_bpf_cfg_cleanup(const struct tcf_bpf_cfg *cfg)
 {
-   if (cfg->is_ebpf)
-   bpf_prog_put(cfg->filter);
-   else
-   bpf_prog_destroy(cfg->filter);
+   struct bpf_prog *filter = cfg->filter;
+
+   if (filter) {
+   if (cfg->is_ebpf)
+   bpf_prog_put(filter);
+   else
+   bpf_prog_destroy(filter);
+   }
 
kfree(cfg->bpf_ops);
kfree(cfg->bpf_name);
-- 
2.14.3



Re: Enable and configure storm prevention in a network device

2018-04-05 Thread Florian Fainelli
On 04/05/2018 01:20 PM, David Miller wrote:
> From: Murali Karicheri 
> Date: Thu, 5 Apr 2018 16:14:49 -0400
> 
>> Is there a standard way to implement and configure storm prevention
>> in a Linux network device?
> 
> What kind of "storm", an interrupt storm?
> 

I would assume Murali is referring to L2 broadcast storms which is
common in switches. There is not an API for that AFAICT and I am not
sure what a proper API would look like.
-- 
Florian


Re: [PATCH net] net: dsa: Discard frames from unused ports

2018-04-05 Thread Florian Fainelli
On 04/04/2018 07:17 PM, Andrew Lunn wrote:
> On Wed, Apr 04, 2018 at 05:49:10PM -0700, Florian Fainelli wrote:
>> On 04/04/2018 04:56 PM, Andrew Lunn wrote:
>>> The Marvell switches under some conditions will pass a frame to the
>>> host with the port being the CPU port. Such frames are invalid, and
>>> should be dropped. Not dropping them can result in a crash when
>>> incrementing the receive statistics for an invalid port.
>>>
>>> Reported-by: Chris Healy 
>>> Fixes: 5f6b4e14cada ("net: dsa: User per-cpu 64-bit statistics")
>>
>> Are you sure this is the commit that introduced the problem?
> 
> Hi Florian
> 
> Well, the problem is it crashes when trying to update the
> statistics. The CPU port is not allocated a p->stats64, only slave
> ports get those. So before this patch, there was no crash and the
> frame would be delivered to the master interface. This in itself is
> probably not correct, but also not fatal. Talking to Chris, it seems
> this behaviour has existing for a long while. I needed to use lldpd to
> trigger the issue, because i assume the Marvell switch sees these as
> special frames and forwards them to the CPU. The other thing is, the
> code got refactored recently. So this fix will not rebase to too many
> earlier versions. It needs a fix per tagging protocol for before the
> common dsa_master_find_slave() was added.

Yes what you are explaining makes sense, but does not that mean we would
just be accessing a garbage memory location before as well? It seems to
me like this problem dates back from long before that particular
changeset, and this changeset just made it more obvious. Would that be
accurate?

> 
>>> -   return ds->ports[port].slave;
>>> +   slave_port = >ports[port];
>>> +
>>> +   if (slave_port->type != DSA_PORT_TYPE_USER)
>>
>> Can we optimize this with an unlikely()?
> 
> Yes, that would make sense.
> 
>  Andrew
> 


-- 
Florian


Re: [iovisor-dev] Best userspace programming API for XDP features query to kernel?

2018-04-05 Thread Jakub Kicinski
On Thu, 5 Apr 2018 22:51:33 +0200, Jesper Dangaard Brouer wrote:
> > What about nfp in terms of XDP
> > offload capabilities, should they be included as well or is probing to load
> > the program and see if it loads/JITs as we do today just fine (e.g. you'd
> > otherwise end up with extra flags on a per BPF helper basis)?  
> 
> No, flags per BPF helper basis. As I've described above, helper belong
> to the BPF core, not the driver.  Here I want to know what the specific
> driver support.

I think Daniel meant for nfp offload.  The offload restrictions are
quite involved, are we going to be able to express those?

This is a bit simpler but reminds me of the TC flower capability
discussion.  Expressing features and capabilities gets messy quickly.

I have a gut feeling that a good starting point would be defining and
building a test suite or a set of probing tests to check things work at
system level (incl. redirects to different ports etc.)  I think having
a concrete set of litmus tests that confirm the meaning of a given
feature/capability would go a long way in making people more comfortable
with accepting any form of BPF driver capability.  And serious BPF
projects already do probing so it's just centralizing this in the
kernel.

That's my two cents.


[PATCH net] net/ipv6: Increment OUTxxx counters after netfilter hook

2018-04-05 Thread Jeff Barnhill
At the end of ip6_forward(), IPSTATS_MIB_OUTFORWDATAGRAMS and
IPSTATS_MIB_OUTOCTETS are incremented immediately before the NF_HOOK call
for NFPROTO_IPV6 / NF_INET_FORWARD.  As a result, these counters get
incremented regardless of whether or not the netfilter hook allows the
packet to continue being processed.  This change increments the counters
in ip6_forward_finish() so that it will not happen if the netfilter hook
chooses to terminate the packet, which is similar to how IPv4 works.

Signed-off-by: Jeff Barnhill <0xeff...@gmail.com>
---
 net/ipv6/ip6_output.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index b8ee50e94af3..2e891d2c30ef 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -375,6 +375,11 @@ static int ip6_forward_proxy_check(struct sk_buff *skb)
 static inline int ip6_forward_finish(struct net *net, struct sock *sk,
 struct sk_buff *skb)
 {
+   struct dst_entry *dst = skb_dst(skb);
+
+   __IP6_INC_STATS(net, ip6_dst_idev(dst), IPSTATS_MIB_OUTFORWDATAGRAMS);
+   __IP6_ADD_STATS(net, ip6_dst_idev(dst), IPSTATS_MIB_OUTOCTETS, 
skb->len);
+
return dst_output(net, sk, skb);
 }
 
@@ -569,8 +574,6 @@ int ip6_forward(struct sk_buff *skb)
 
hdr->hop_limit--;
 
-   __IP6_INC_STATS(net, ip6_dst_idev(dst), IPSTATS_MIB_OUTFORWDATAGRAMS);
-   __IP6_ADD_STATS(net, ip6_dst_idev(dst), IPSTATS_MIB_OUTOCTETS, 
skb->len);
return NF_HOOK(NFPROTO_IPV6, NF_INET_FORWARD,
   net, NULL, skb, skb->dev, dst->dev,
   ip6_forward_finish);
-- 
2.14.1



Re: [patch net] devlink: convert occ_get op to separate registration

2018-04-05 Thread Jiri Pirko
Thu, Apr 05, 2018 at 10:55:58PM CEST, dsah...@gmail.com wrote:
>On 4/5/18 2:13 PM, Jiri Pirko wrote:
>> From: Jiri Pirko 
>> 
>> This resolves race during initialization where the resources with
>> ops are registered before driver and the structures used by occ_get
>> op is initialized. So keep occ_get callbacks registered only when
>> all structs are initialized.
>> 
>> The example flows, as it is in mlxsw:
>> 1) driver load/asic probe:
>>mlxsw_core
>>   -> mlxsw_sp_resources_register
>> -> mlxsw_sp_kvdl_resources_register
>>   -> devlink_resource_register IDX
>>mlxsw_spectrum
>>   -> mlxsw_sp_kvdl_init
>> -> mlxsw_sp_kvdl_parts_init
>>   -> mlxsw_sp_kvdl_part_init
>> -> devlink_resource_size_get IDX (to get the current setup
>>   size from devlink)
>> -> devlink_resource_occ_get_register IDX (register current
>>   occupancy getter)
>> 2) reload triggered by devlink command:
>>   -> mlxsw_devlink_core_bus_device_reload
>> -> mlxsw_sp_fini
>>   -> mlxsw_sp_kvdl_fini
>>  -> devlink_resource_occ_get_unregister IDX
>> (struct mlxsw_sp *mlxsw_sp is freed at this point, call to occ get
>>  which is using mlxsw_sp would cause use-after free)
>> -> mlxsw_sp_init
>>   -> mlxsw_sp_kvdl_init
>> -> mlxsw_sp_kvdl_parts_init
>>   -> mlxsw_sp_kvdl_part_init
>> -> devlink_resource_size_get IDX (to get the current setup
>>   size from devlink)
>> -> devlink_resource_occ_get_register IDX (register current
>>   occupancy getter)
>> 
>> Fixes: d9f9b9a4d05f ("devlink: Add support for resource abstraction")
>> Signed-off-by: Jiri Pirko 
>> ---
>
>
>Why won't something like the attached work as opposed to playing
>register / unregister games?

Because it is quite driver-specific. For example in netdevsim, with
current implementation you don't have to unreg as the priv is there the
whole time.

Also, I think it is more correct to register getter with the priv
directly. Not to depend on getting with via devlink struct somehow.

I'm not saying that my solution is super nice. But what you suggest
seems much more uglier to me.



>diff --git a/drivers/net/ethernet/mellanox/mlxsw/core.c 
>b/drivers/net/ethernet/mellanox/mlxsw/core.c
>index 93ea56620a24..dcded613faa6 100644
>--- a/drivers/net/ethernet/mellanox/mlxsw/core.c
>+++ b/drivers/net/ethernet/mellanox/mlxsw/core.c
>@@ -113,6 +113,7 @@ struct mlxsw_core {
>   struct mlxsw_thermal *thermal;
>   struct mlxsw_core_port *ports;
>   unsigned int max_ports;
>+  bool reload_in_progress;
>   bool reload_fail;
>   unsigned long driver_priv[0];
>   /* driver_priv has to be always the last item */
>@@ -154,6 +155,12 @@ void *mlxsw_core_driver_priv(struct mlxsw_core 
>*mlxsw_core)
> }
> EXPORT_SYMBOL(mlxsw_core_driver_priv);
> 
>+bool mlxsw_core_reload_in_progress(struct mlxsw_core *mlxsw_core)
>+{
>+  return mlxsw_core->mlxsw_core_driver_priv;
>+}
>+EXPORT_SYMBOL(mlxsw_core_reload_in_progress);
>+
> struct mlxsw_rx_listener_item {
>   struct list_head list;
>   struct mlxsw_rx_listener rxl;
>@@ -972,12 +979,16 @@ static int mlxsw_devlink_core_bus_device_reload(struct 
>devlink *devlink)
>   if (!mlxsw_bus->reset)
>   return -EOPNOTSUPP;
> 
>+  mlxsw_core->reload_in_progress = true;
>+
>   mlxsw_core_bus_device_unregister(mlxsw_core, true);
>   mlxsw_bus->reset(mlxsw_core->bus_priv);
>   err = mlxsw_core_bus_device_register(mlxsw_core->bus_info,
>mlxsw_core->bus,
>mlxsw_core->bus_priv, true,
>devlink);
>+  mlxsw_core->reload_in_progress = false;
>+
>   if (err)
>   mlxsw_core->reload_fail = true;
>   return err;
>diff --git a/drivers/net/ethernet/mellanox/mlxsw/core.h 
>b/drivers/net/ethernet/mellanox/mlxsw/core.h
>index 092d39399f3c..ffa1db154757 100644
>--- a/drivers/net/ethernet/mellanox/mlxsw/core.h
>+++ b/drivers/net/ethernet/mellanox/mlxsw/core.h
>@@ -60,6 +60,7 @@ struct mlxsw_bus_info;
> unsigned int mlxsw_core_max_ports(const struct mlxsw_core *mlxsw_core);
> 
> void *mlxsw_core_driver_priv(struct mlxsw_core *mlxsw_core);
>+bool mlxsw_core_reload_in_progress(struct mlxsw_core *mlxsw_core);
> 
> int mlxsw_core_driver_register(struct mlxsw_driver *mlxsw_driver);
> void mlxsw_core_driver_unregister(struct mlxsw_driver *mlxsw_driver);
>diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c 
>b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
>index 53fffd09d133..09b89af37d8a 100644
>--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
>+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
>@@ -3808,8 +3808,12 

[RFC PATCH net-next v5 4/4] netvsc: refactor notifier/event handling code to use the bypass framework

2018-04-05 Thread Sridhar Samudrala
Use the registration/notification framework supported by the generic
bypass infrastructure.

Signed-off-by: Sridhar Samudrala 
---
 drivers/net/hyperv/Kconfig  |   1 +
 drivers/net/hyperv/netvsc_drv.c | 219 
 2 files changed, 63 insertions(+), 157 deletions(-)

diff --git a/drivers/net/hyperv/Kconfig b/drivers/net/hyperv/Kconfig
index 936968d23559..cc3a721baa18 100644
--- a/drivers/net/hyperv/Kconfig
+++ b/drivers/net/hyperv/Kconfig
@@ -1,5 +1,6 @@
 config HYPERV_NET
tristate "Microsoft Hyper-V virtual network driver"
depends on HYPERV
+   depends on MAY_USE_BYPASS
help
  Select this option to enable the Hyper-V virtual network driver.
diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index ecc84954c511..b53a2be99bd2 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -43,6 +43,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "hyperv_net.h"
 
@@ -1763,46 +1764,6 @@ static void netvsc_link_change(struct work_struct *w)
rtnl_unlock();
 }
 
-static struct net_device *get_netvsc_bymac(const u8 *mac)
-{
-   struct net_device *dev;
-
-   ASSERT_RTNL();
-
-   for_each_netdev(_net, dev) {
-   if (dev->netdev_ops != _ops)
-   continue;   /* not a netvsc device */
-
-   if (ether_addr_equal(mac, dev->perm_addr))
-   return dev;
-   }
-
-   return NULL;
-}
-
-static struct net_device *get_netvsc_byref(struct net_device *vf_netdev)
-{
-   struct net_device *dev;
-
-   ASSERT_RTNL();
-
-   for_each_netdev(_net, dev) {
-   struct net_device_context *net_device_ctx;
-
-   if (dev->netdev_ops != _ops)
-   continue;   /* not a netvsc device */
-
-   net_device_ctx = netdev_priv(dev);
-   if (!rtnl_dereference(net_device_ctx->nvdev))
-   continue;   /* device is removed */
-
-   if (rtnl_dereference(net_device_ctx->vf_netdev) == vf_netdev)
-   return dev; /* a match */
-   }
-
-   return NULL;
-}
-
 /* Called when VF is injecting data into network stack.
  * Change the associated network device from VF to netvsc.
  * note: already called with rcu_read_lock
@@ -1825,43 +1786,19 @@ static rx_handler_result_t 
netvsc_vf_handle_frame(struct sk_buff **pskb)
return RX_HANDLER_ANOTHER;
 }
 
-static int netvsc_vf_join(struct net_device *vf_netdev,
- struct net_device *ndev)
+static int netvsc_vf_join(struct net_device *ndev,
+ struct net_device *vf_netdev)
 {
struct net_device_context *ndev_ctx = netdev_priv(ndev);
-   int ret;
-
-   ret = netdev_rx_handler_register(vf_netdev,
-netvsc_vf_handle_frame, ndev);
-   if (ret != 0) {
-   netdev_err(vf_netdev,
-  "can not register netvsc VF receive handler (err = 
%d)\n",
-  ret);
-   goto rx_handler_failed;
-   }
-
-   ret = netdev_upper_dev_link(vf_netdev, ndev, NULL);
-   if (ret != 0) {
-   netdev_err(vf_netdev,
-  "can not set master device %s (err = %d)\n",
-  ndev->name, ret);
-   goto upper_link_failed;
-   }
-
-   /* set slave flag before open to prevent IPv6 addrconf */
-   vf_netdev->flags |= IFF_SLAVE;
 
schedule_delayed_work(_ctx->vf_takeover, VF_TAKEOVER_INT);
 
-   call_netdevice_notifiers(NETDEV_JOIN, vf_netdev);
-
netdev_info(vf_netdev, "joined to %s\n", ndev->name);
-   return 0;
 
-upper_link_failed:
-   netdev_rx_handler_unregister(vf_netdev);
-rx_handler_failed:
-   return ret;
+   dev_hold(vf_netdev);
+   rcu_assign_pointer(ndev_ctx->vf_netdev, vf_netdev);
+
+   return 0;
 }
 
 static void __netvsc_vf_setup(struct net_device *ndev,
@@ -1914,85 +1851,84 @@ static void netvsc_vf_setup(struct work_struct *w)
rtnl_unlock();
 }
 
-static int netvsc_register_vf(struct net_device *vf_netdev)
+static int netvsc_register_vf(struct net_device *ndev,
+ struct net_device *vf_netdev)
 {
-   struct net_device *ndev;
struct net_device_context *net_device_ctx;
struct netvsc_device *netvsc_dev;
 
-   if (vf_netdev->addr_len != ETH_ALEN)
-   return NOTIFY_DONE;
-
-   /*
-* We will use the MAC address to locate the synthetic interface to
-* associate with the VF interface. If we don't find a matching
-* synthetic interface, move on.
-*/
-   ndev = get_netvsc_bymac(vf_netdev->perm_addr);
-   if (!ndev)
-   return NOTIFY_DONE;
-
net_device_ctx = netdev_priv(ndev);
netvsc_dev = 

[RFC PATCH net-next v5 0/4] Enable virtio_net to act as a backup for a passthru device

2018-04-05 Thread Sridhar Samudrala
The main motivation for this patch is to enable cloud service providers
to provide an accelerated datapath to virtio-net enabled VMs in a 
transparent manner with no/minimal guest userspace changes. This also
enables hypervisor controlled live migration to be supported with VMs that
have direct attached SR-IOV VF devices.

Patch 1 introduces a new feature bit VIRTIO_NET_F_BACKUP that can be
used by hypervisor to indicate that virtio_net interface should act as
a backup for another device with the same MAC address.

Patch 2 introduces a bypass module that provides a generic interface for 
paravirtual drivers to listen for netdev register/unregister/link change
events from pci ethernet devices with the same MAC and takeover their
datapath. The notifier and event handling code is based on the existing
netvsc implementation. A paravirtual driver can use this module by
registering a set of ops and each instance of the device when it is probed.

Patch 3 extends virtio_net to use alternate datapath when available and
registered. When BACKUP feature is enabled, virtio_net driver creates
an additional 'bypass' netdev that acts as a master device and controls
2 slave devices.  The original virtio_net netdev is registered as
'backup' netdev and a passthru/vf device with the same MAC gets
registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are
associated with the same 'pci' device.  The user accesses the network
interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev
as default for transmits when it is available with link up and running.

Patch 4 refactors netvsc to use the registration/notification framework
supported by bypass module.

As this patch series is initially focusing on usecases where hypervisor 
fully controls the VM networking and the guest is not expected to directly 
configure any hardware settings, it doesn't expose all the ndo/ethtool ops
that are supported by virtio_net at this time. To support additional usecases,
it should be possible to enable additional ops later by caching the state
in virtio netdev and replaying when the 'active' netdev gets registered. 
 
The hypervisor needs to enable only one datapath at any time so that packets
don't get looped back to the VM over the other datapath. When a VF is
plugged, the virtio datapath link state can be marked as down.
At the time of live migration, the hypervisor needs to unplug the VF device
from the guest on the source host and reset the MAC filter of the VF to
initiate failover of datapath to virtio before starting the migration. After
the migration is completed, the destination hypervisor sets the MAC filter
on the VF and plugs it back to the guest to switch over to VF datapath.

This patch is based on the discussion initiated by Jesse on this thread.
https://marc.info/?l=linux-virtualization=151189725224231=2

v5 RFC:
  Based on Jiri's comments, moved the common functionality to a 'bypass'
  module so that the same notifier and event handlers to handle child
  register/unregister/link change events can be shared between virtio_net
  and netvsc.
  Improved error handling based on Siwei's comments.
v4:
- Based on the review comments on the v3 version of the RFC patch and
  Jakub's suggestion for the naming issue with 3 netdev solution,
  proposed 3 netdev in-driver bonding solution for virtio-net.
v3 RFC:
- Introduced 3 netdev model and pointed out a couple of issues with
  that model and proposed 2 netdev model to avoid these issues.
- Removed broadcast/multicast optimization and only use virtio as
  backup path when VF is unplugged.
v2 RFC:
- Changed VIRTIO_NET_F_MASTER to VIRTIO_NET_F_BACKUP (mst)
- made a small change to the virtio-net xmit path to only use VF datapath
  for unicasts. Broadcasts/multicasts use virtio datapath. This avoids
  east-west broadcasts to go over the PCI link.
- added suppport for the feature bit in qemu

Sridhar Samudrala (4):
  virtio_net: Introduce VIRTIO_NET_F_BACKUP feature bit
  net: Introduce generic bypass module
  virtio_net: Extend virtio to use VF datapath when available
  netvsc: refactor notifier/event handling code to use the bypass
framework

 drivers/net/Kconfig |   1 +
 drivers/net/hyperv/Kconfig  |   1 +
 drivers/net/hyperv/netvsc_drv.c | 219 --
 drivers/net/virtio_net.c| 614 +++-
 include/net/bypass.h|  80 ++
 include/uapi/linux/virtio_net.h |   3 +
 net/Kconfig |  18 ++
 net/core/Makefile   |   1 +
 net/core/bypass.c   | 406 ++
 9 files changed, 1184 insertions(+), 159 deletions(-)
 create mode 100644 include/net/bypass.h
 create mode 100644 net/core/bypass.c

-- 
2.14.3


[RFC PATCH net-next v5 2/4] net: Introduce generic bypass module

2018-04-05 Thread Sridhar Samudrala
This provides a generic interface for paravirtual drivers to listen
for netdev register/unregister/link change events from pci ethernet
devices with the same MAC and takeover their datapath. The notifier and
event handling code is based on the existing netvsc implementation. A
paravirtual driver can use this module by registering a set of ops and
each instance of the device when it is probed.

Signed-off-by: Sridhar Samudrala 
---
 include/net/bypass.h |  80 ++
 net/Kconfig  |  18 +++
 net/core/Makefile|   1 +
 net/core/bypass.c| 406 +++
 4 files changed, 505 insertions(+)
 create mode 100644 include/net/bypass.h
 create mode 100644 net/core/bypass.c

diff --git a/include/net/bypass.h b/include/net/bypass.h
new file mode 100644
index ..e2dd122f951a
--- /dev/null
+++ b/include/net/bypass.h
@@ -0,0 +1,80 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2018, Intel Corporation. */
+
+#ifndef _NET_BYPASS_H
+#define _NET_BYPASS_H
+
+#include 
+
+struct bypass_ops {
+   int (*register_child)(struct net_device *bypass_netdev,
+ struct net_device *child_netdev);
+   int (*join_child)(struct net_device *bypass_netdev,
+ struct net_device *child_netdev);
+   int (*unregister_child)(struct net_device *bypass_netdev,
+   struct net_device *child_netdev);
+   int (*release_child)(struct net_device *bypass_netdev,
+struct net_device *child_netdev);
+   int (*update_link)(struct net_device *bypass_netdev,
+  struct net_device *child_netdev);
+   rx_handler_result_t (*handle_frame)(struct sk_buff **pskb);
+};
+
+struct bypass_instance {
+   struct list_head list;
+   struct net_device __rcu *bypass_netdev;
+   struct bypass *bypass;
+};
+
+struct bypass {
+   struct list_head list;
+   const struct bypass_ops *ops;
+   const struct net_device_ops *netdev_ops;
+   struct list_head instance_list;
+   struct mutex lock;
+};
+
+#if IS_ENABLED(CONFIG_NET_BYPASS)
+
+struct bypass *bypass_register_driver(const struct bypass_ops *ops,
+ const struct net_device_ops *netdev_ops);
+void bypass_unregister_driver(struct bypass *bypass);
+
+int bypass_register_instance(struct bypass *bypass, struct net_device *dev);
+int bypass_unregister_instance(struct bypass *bypass, struct net_device
*dev);
+
+int bypass_unregister_child(struct net_device *child_netdev);
+
+#else
+
+static inline
+struct bypass *bypass_register_driver(const struct bypass_ops *ops,
+ const struct net_device_ops *netdev_ops)
+{
+   return NULL;
+}
+
+static inline void bypass_unregister_driver(struct bypass *bypass)
+{
+}
+
+static inline int bypass_register_instance(struct bypass *bypass,
+  struct net_device *dev)
+{
+   return 0;
+}
+
+static inline int bypass_unregister_instance(struct bypass *bypass,
+struct net_device *dev)
+{
+   return 0;
+}
+
+static inline int bypass_unregister_child(struct net_device *child_netdev)
+{
+   return 0;
+}
+
+#endif
+
+#endif /* _NET_BYPASS_H */
diff --git a/net/Kconfig b/net/Kconfig
index 0428f12c25c2..994445f4a96a 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -423,6 +423,24 @@ config MAY_USE_DEVLINK
  on MAY_USE_DEVLINK to ensure they do not cause link errors when
  devlink is a loadable module and the driver using it is built-in.
 
+config NET_BYPASS
+   tristate "Bypass interface"
+   ---help---
+ This provides a generic interface for paravirtual drivers to listen
+ for netdev register/unregister/link change events from pci ethernet
+ devices with the same MAC and takeover their datapath. This also
+ enables live migration of a VM with direct attached VF by failing
+ over to the paravirtual datapath when the VF is unplugged.
+
+config MAY_USE_BYPASS
+   tristate
+   default m if NET_BYPASS=m
+   default y if NET_BYPASS=y || NET_BYPASS=n
+   help
+ Drivers using the bypass infrastructure should have a dependency
+ on MAY_USE_BYPASS to ensure they do not cause link errors when
+ bypass is a loadable module and the driver using it is built-in.
+
 endif   # if NET
 
 # Used by archs to tell that they support BPF JIT compiler plus which flavour.
diff --git a/net/core/Makefile b/net/core/Makefile
index 6dbbba8c57ae..a9727ed1c8fc 100644
--- a/net/core/Makefile
+++ b/net/core/Makefile
@@ -30,3 +30,4 @@ obj-$(CONFIG_DST_CACHE) += dst_cache.o
 obj-$(CONFIG_HWBM) += hwbm.o
 obj-$(CONFIG_NET_DEVLINK) += devlink.o
 obj-$(CONFIG_GRO_CELLS) += gro_cells.o
+obj-$(CONFIG_NET_BYPASS) += bypass.o
diff --git a/net/core/bypass.c b/net/core/bypass.c
new file 

[RFC PATCH net-next v5 3/4] virtio_net: Extend virtio to use VF datapath when available

2018-04-05 Thread Sridhar Samudrala
This patch enables virtio_net to switch over to a VF datapath when a VF
netdev is present with the same MAC address. It allows live migration
of a VM with a direct attached VF without the need to setup a bond/team
between a VF and virtio net device in the guest.

The hypervisor needs to enable only one datapath at any time so that
packets don't get looped back to the VM over the other datapath. When a VF
is plugged, the virtio datapath link state can be marked as down. The
hypervisor needs to unplug the VF device from the guest on the source host
and reset the MAC filter of the VF to initiate failover of datapath to
virtio before starting the migration. After the migration is completed,
the destination hypervisor sets the MAC filter on the VF and plugs it back
to the guest to switch over to VF datapath.

When BACKUP feature is enabled, an additional netdev(bypass netdev) is
created that acts as a master device and tracks the state of the 2 lower
netdevs. The original virtio_net netdev is marked as 'backup' netdev and a
passthru device with the same MAC is registered as 'active' netdev.

This patch is based on the discussion initiated by Jesse on this thread.
https://marc.info/?l=linux-virtualization=151189725224231=2

Signed-off-by: Sridhar Samudrala 
---
 drivers/net/Kconfig  |   1 +
 drivers/net/virtio_net.c | 612 ++-
 2 files changed, 612 insertions(+), 1 deletion(-)

diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index 891846655000..9e2cf61fd1c1 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -331,6 +331,7 @@ config VETH
 config VIRTIO_NET
tristate "Virtio network driver"
depends on VIRTIO
+   depends on MAY_USE_BYPASS
---help---
  This is the virtual network driver for virtio.  It can be used with
  QEMU based VMMs (like KVM or Xen).  Say Y or M.
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index befb5944f3fd..86b2f8f2947d 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -30,8 +30,11 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include 
 #include 
+#include 
 
 static int napi_weight = NAPI_POLL_WEIGHT;
 module_param(napi_weight, int, 0444);
@@ -206,6 +209,9 @@ struct virtnet_info {
u32 speed;
 
unsigned long guest_offloads;
+
+   /* upper netdev created when BACKUP feature enabled */
+   struct net_device __rcu *bypass_netdev;
 };
 
 struct padded_vnet_hdr {
@@ -2275,6 +2281,22 @@ static int virtnet_xdp(struct net_device *dev, struct 
netdev_bpf *xdp)
}
 }
 
+static int virtnet_get_phys_port_name(struct net_device *dev, char *buf,
+ size_t len)
+{
+   struct virtnet_info *vi = netdev_priv(dev);
+   int ret;
+
+   if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_BACKUP))
+   return -EOPNOTSUPP;
+
+   ret = snprintf(buf, len, "_bkup");
+   if (ret >= len)
+   return -EOPNOTSUPP;
+
+   return 0;
+}
+
 static const struct net_device_ops virtnet_netdev = {
.ndo_open= virtnet_open,
.ndo_stop= virtnet_close,
@@ -2292,6 +2314,7 @@ static const struct net_device_ops virtnet_netdev = {
.ndo_xdp_xmit   = virtnet_xdp_xmit,
.ndo_xdp_flush  = virtnet_xdp_flush,
.ndo_features_check = passthru_features_check,
+   .ndo_get_phys_port_name = virtnet_get_phys_port_name,
 };
 
 static void virtnet_config_changed_work(struct work_struct *work)
@@ -2689,6 +2712,576 @@ static int virtnet_validate(struct virtio_device *vdev)
return 0;
 }
 
+/* START of functions supporting VIRTIO_NET_F_BACKUP feature.
+ * When BACKUP feature is enabled, an additional netdev(bypass netdev)
+ * is created that acts as a master device and tracks the state of the
+ * 2 lower netdevs. The original virtio_net netdev is registered as
+ * 'backup' netdev and a passthru device with the same MAC is registered
+ * as 'active' netdev.
+ */
+
+/* bypass state maintained when BACKUP feature is enabled */
+struct virtnet_bypass_info {
+   /* passthru netdev with same MAC */
+   struct net_device __rcu *active_netdev;
+
+   /* virtio_net netdev */
+   struct net_device __rcu *backup_netdev;
+
+   /* active netdev stats */
+   struct rtnl_link_stats64 active_stats;
+
+   /* backup netdev stats */
+   struct rtnl_link_stats64 backup_stats;
+
+   /* aggregated stats */
+   struct rtnl_link_stats64 bypass_stats;
+
+   /* spinlock while updating stats */
+   spinlock_t stats_lock;
+};
+
+static int virtnet_bypass_open(struct net_device *dev)
+{
+   struct virtnet_bypass_info *vbi = netdev_priv(dev);
+   struct net_device *active_netdev, *backup_netdev;
+   int err;
+
+   netif_carrier_off(dev);
+   netif_tx_wake_all_queues(dev);
+
+   active_netdev = rtnl_dereference(vbi->active_netdev);

[RFC PATCH net-next v5 1/4] virtio_net: Introduce VIRTIO_NET_F_BACKUP feature bit

2018-04-05 Thread Sridhar Samudrala
This feature bit can be used by hypervisor to indicate virtio_net device to
act as a backup for another device with the same MAC address.

VIRTIO_NET_F_BACKUP is defined as bit 62 as it is a device feature bit.

Signed-off-by: Sridhar Samudrala 
---
 drivers/net/virtio_net.c| 2 +-
 include/uapi/linux/virtio_net.h | 3 +++
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 7b187ec7411e..befb5944f3fd 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -2962,7 +2962,7 @@ static struct virtio_device_id id_table[] = {
VIRTIO_NET_F_GUEST_ANNOUNCE, VIRTIO_NET_F_MQ, \
VIRTIO_NET_F_CTRL_MAC_ADDR, \
VIRTIO_NET_F_MTU, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS, \
-   VIRTIO_NET_F_SPEED_DUPLEX
+   VIRTIO_NET_F_SPEED_DUPLEX, VIRTIO_NET_F_BACKUP
 
 static unsigned int features[] = {
VIRTNET_FEATURES,
diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
index 5de6ed37695b..c7c35fd1a5ed 100644
--- a/include/uapi/linux/virtio_net.h
+++ b/include/uapi/linux/virtio_net.h
@@ -57,6 +57,9 @@
 * Steering */
 #define VIRTIO_NET_F_CTRL_MAC_ADDR 23  /* Set MAC address */
 
+#define VIRTIO_NET_F_BACKUP  62/* Act as backup for another device
+* with the same MAC.
+*/
 #define VIRTIO_NET_F_SPEED_DUPLEX 63   /* Device set linkspeed and duplex */
 
 #ifndef VIRTIO_NET_NO_LEGACY
-- 
2.14.3



Re: [PATCH net-next 6/6] netdevsim: Add simple FIB resource controller via devlink

2018-04-05 Thread David Ahern
On 4/5/18 2:10 PM, David Ahern wrote:
> 
> The ASIC here is the kernel tables in a namespace. It does not make
> sense to have 2 devlink instances for a single namespace.

I put this example controller in netdevsim per a suggestion from Ido.
The netdevsim seemed like a good idea given that modules intention --
testing network facilities. Perhaps I should have done this as a
completely standalone module ...

The intention is to treat the kernel's tables *per namespace* as a
standalone entity that can be managed very similar to ASIC resources.
Given that I can add a resource controller module
(drivers/net/kern_res_mgr.c?) that creates a 'struct device' per network
namespace with a devlink instance. In this case the device would very
much be tied to the namespace 1:1.


Re: [patch net] devlink: convert occ_get op to separate registration

2018-04-05 Thread David Ahern
On 4/5/18 2:55 PM, David Ahern wrote:
> @@ -154,6 +155,12 @@ void *mlxsw_core_driver_priv(struct mlxsw_core 
> *mlxsw_core)
>  }
>  EXPORT_SYMBOL(mlxsw_core_driver_priv);
>  
> +bool mlxsw_core_reload_in_progress(struct mlxsw_core *mlxsw_core)
> +{
> + return mlxsw_core->mlxsw_core_driver_priv;

Oops, that is supposed to be:
return mlxsw_core->reload_in_progress;

but I think you get the point.

> +}
> +EXPORT_SYMBOL(mlxsw_core_reload_in_progress);
> +
>  struct mlxsw_rx_listener_item {
>   struct list_head list;
>   struct mlxsw_rx_listener rxl;



Re: [patch net] devlink: convert occ_get op to separate registration

2018-04-05 Thread David Ahern
On 4/5/18 2:13 PM, Jiri Pirko wrote:
> From: Jiri Pirko 
> 
> This resolves race during initialization where the resources with
> ops are registered before driver and the structures used by occ_get
> op is initialized. So keep occ_get callbacks registered only when
> all structs are initialized.
> 
> The example flows, as it is in mlxsw:
> 1) driver load/asic probe:
>mlxsw_core
>   -> mlxsw_sp_resources_register
> -> mlxsw_sp_kvdl_resources_register
>   -> devlink_resource_register IDX
>mlxsw_spectrum
>   -> mlxsw_sp_kvdl_init
> -> mlxsw_sp_kvdl_parts_init
>   -> mlxsw_sp_kvdl_part_init
> -> devlink_resource_size_get IDX (to get the current setup
>   size from devlink)
> -> devlink_resource_occ_get_register IDX (register current
>   occupancy getter)
> 2) reload triggered by devlink command:
>   -> mlxsw_devlink_core_bus_device_reload
> -> mlxsw_sp_fini
>   -> mlxsw_sp_kvdl_fini
>   -> devlink_resource_occ_get_unregister IDX
> (struct mlxsw_sp *mlxsw_sp is freed at this point, call to occ get
>  which is using mlxsw_sp would cause use-after free)
> -> mlxsw_sp_init
>   -> mlxsw_sp_kvdl_init
> -> mlxsw_sp_kvdl_parts_init
>   -> mlxsw_sp_kvdl_part_init
> -> devlink_resource_size_get IDX (to get the current setup
>   size from devlink)
> -> devlink_resource_occ_get_register IDX (register current
>   occupancy getter)
> 
> Fixes: d9f9b9a4d05f ("devlink: Add support for resource abstraction")
> Signed-off-by: Jiri Pirko 
> ---


Why won't something like the attached work as opposed to playing
register / unregister games?
diff --git a/drivers/net/ethernet/mellanox/mlxsw/core.c 
b/drivers/net/ethernet/mellanox/mlxsw/core.c
index 93ea56620a24..dcded613faa6 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/core.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/core.c
@@ -113,6 +113,7 @@ struct mlxsw_core {
struct mlxsw_thermal *thermal;
struct mlxsw_core_port *ports;
unsigned int max_ports;
+   bool reload_in_progress;
bool reload_fail;
unsigned long driver_priv[0];
/* driver_priv has to be always the last item */
@@ -154,6 +155,12 @@ void *mlxsw_core_driver_priv(struct mlxsw_core *mlxsw_core)
 }
 EXPORT_SYMBOL(mlxsw_core_driver_priv);
 
+bool mlxsw_core_reload_in_progress(struct mlxsw_core *mlxsw_core)
+{
+   return mlxsw_core->mlxsw_core_driver_priv;
+}
+EXPORT_SYMBOL(mlxsw_core_reload_in_progress);
+
 struct mlxsw_rx_listener_item {
struct list_head list;
struct mlxsw_rx_listener rxl;
@@ -972,12 +979,16 @@ static int mlxsw_devlink_core_bus_device_reload(struct 
devlink *devlink)
if (!mlxsw_bus->reset)
return -EOPNOTSUPP;
 
+   mlxsw_core->reload_in_progress = true;
+
mlxsw_core_bus_device_unregister(mlxsw_core, true);
mlxsw_bus->reset(mlxsw_core->bus_priv);
err = mlxsw_core_bus_device_register(mlxsw_core->bus_info,
 mlxsw_core->bus,
 mlxsw_core->bus_priv, true,
 devlink);
+   mlxsw_core->reload_in_progress = false;
+
if (err)
mlxsw_core->reload_fail = true;
return err;
diff --git a/drivers/net/ethernet/mellanox/mlxsw/core.h 
b/drivers/net/ethernet/mellanox/mlxsw/core.h
index 092d39399f3c..ffa1db154757 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/core.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/core.h
@@ -60,6 +60,7 @@ struct mlxsw_bus_info;
 unsigned int mlxsw_core_max_ports(const struct mlxsw_core *mlxsw_core);
 
 void *mlxsw_core_driver_priv(struct mlxsw_core *mlxsw_core);
+bool mlxsw_core_reload_in_progress(struct mlxsw_core *mlxsw_core);
 
 int mlxsw_core_driver_register(struct mlxsw_driver *mlxsw_driver);
 void mlxsw_core_driver_unregister(struct mlxsw_driver *mlxsw_driver);
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
index 53fffd09d133..09b89af37d8a 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
@@ -3808,8 +3808,12 @@ static const struct mlxsw_config_profile 
mlxsw_sp_config_profile = {
 static u64 mlxsw_sp_resource_kvd_linear_occ_get(struct devlink *devlink)
 {
struct mlxsw_core *mlxsw_core = devlink_priv(devlink);
-   struct mlxsw_sp *mlxsw_sp = mlxsw_core_driver_priv(mlxsw_core);
+   struct mlxsw_sp *mlxsw_sp;
+
+   if (mlxsw_core_reload_in_progress(mlxsw_core))
+   return 0;
 
+   mlxsw_sp = mlxsw_core_driver_priv(mlxsw_core);
return mlxsw_sp_kvdl_occ_get(mlxsw_sp);
 }
 
diff --git 

Re: [iovisor-dev] Best userspace programming API for XDP features query to kernel?

2018-04-05 Thread Jesper Dangaard Brouer
On Thu, 5 Apr 2018 12:37:19 +0200
Daniel Borkmann  wrote:

> On 04/04/2018 02:28 PM, Jesper Dangaard Brouer via iovisor-dev wrote:
> > Hi Suricata people,
> > 
> > When Eric Leblond (and I helped) integrated XDP in Suricata, we ran
> > into the issue, that at Suricata load/start time, we cannot determine
> > if the chosen XDP config options, like xdp-cpu-redirect[1], is valid on
> > this HW (e.g require driver XDP_REDIRECT support and bpf cpumap).
> > 
> > We would have liked a way to report that suricata.yaml config was
> > invalid for this hardware/setup.  Now, it just loads, and packets gets
> > silently dropped by XDP (well a WARN_ONCE and catchable via tracepoints).
> > 
> > My question to suricata developers: (Q1) Do you already have code that
> > query the kernel or drivers for features?
> > 
> > At the IOvisor call (2 weeks ago), we discussed two options of exposing
> > XDP features avail in a given driver.
> > 
> > Option#1: Extend existing ethtool -k/-K "offload and other features"
> > with some XDP features, that userspace can query. (Do you already query
> > offloads, regarding Q1)
> > 
> > Option#2: Invent a new 'ip link set xdp' netlink msg with a query option.  
> 
> I don't really mind if you go via ethtool, as long as we handle this
> generically from there and e.g. call the dev's ndo_bpf handler such that
> we keep all the information in one place. This can be a new ndo_bpf command
> e.g. XDP_QUERY_FEATURES or such.

Just to be clear: notice as Victor points out[2], they are programmable
going though the IOCTL (SIOCETHTOOL) and not using cmdline tools.

[2] https://github.com/OISF/suricata/blob/master/src/util-ioctl.c#L326

If you want everything to go through the drivers ndo_bpf call anyway
(which userspace API is netlink based) then at what point to you
want drivers to call their own ndo_bpf, when activated though their
ethtool_ops ? (Sorry, but I don't follow the flow you are proposing)

Notice, I'm not directly against using the drivers ndo_bpf call.  I can
see it does provide kernel more flexibility than the ethtool IOCTL.


> More specifically, how would such feature mask look like? How fine grained
> would this be? When you add a new minor feature to, say, cpumap that not
> all drivers support yet, we'd need a new flag each time, no?

No, CPUMAP is not a driver level feature, and thus does not require a
features flag exposed by the driver.  CPUMAP depends on the driver
feature XDP_REDIRECT.

It is important we separate driver-level features and BPF/XDP core
features.  I feel that we constantly talk past each-other when we mix
that up.

It is true that Suricata _also_ need to detect if the running kernel
support the map type called: BPF_MAP_TYPE_CPUMAP.  *BUT* that is a
completely separate mechanism.  It is a core kernel bpf feature, and I
have accepted that this can only be done via probing the kernel (simply
use the bpf syscall and try to create this map type).

Here, I want to discuss how drivers expose/tell userspace that they
support a given feature: Specifically a bit for: XDP_REDIRECT action
support.


> Same for meta data,

Well, not really.  It would be a "nice-to-have", but not strictly
needed as a feature bit.  XDP meta-data is controlled via a helper.
And the BPF-prog can detect/see runtime, that the helper bpf_xdp_adjust_meta
returns -ENOTSUPP (and need to check the ret value anyhow).  Thus,
there is that not much gained by exposing this to be detected setup
time, as all drivers should eventually support this, and we can detect
it runtime.

The missing XDP_REDIRECT action features bit it different, as the
BPF-prog cannot detect runtime that this is an unsupported action.
Plus, setup time we cannot query the driver for supported XDP actions.


> then potentially for the redirect memory return work,

I'm not sure the redirect memory return types, belong here.  First of
all it is per RX-ring.  Second, some other config method will likely be
config interface.  Like with AF_XDP-zero-copy it is a new NDO. So, it
is more losely coupled.

> or the af_xdp bits, 

No need for bit for AF_XDP in copy-mode (current RFC), as this only
depend on driver supporting XDP_REDIRECT action.

For AF_XDP in zero-copy mode, then yes we need a way for userspace to
"see" if this mode is supported by the driver.  But it might not need a
feature bit here... as the bind() call (which knows the ifindex) could
fail when it tried to enable this ZC mode.  It would make userspace's
live easier to add ZC as a driver feature bit.


> the xdp_rxq_info would have needed it, etc. 

Same comment as mem-type, not necessarily, as it is more losely coupled
to XDP.

> What about nfp in terms of XDP
> offload capabilities, should they be included as well or is probing to load
> the program and see if it loads/JITs as we do today just fine (e.g. you'd
> otherwise end up with extra flags on a per BPF helper basis)?

No, flags per BPF helper basis. As I've described above, helper belong
to the 

Re: marvell switch

2018-04-05 Thread Andrew Lunn
> > Hi Ran
> >
> > The Marvell driver makes each port act like a normal Linux network
> > interface. So if you want to enable a port, do
> >
> > ip link set lan0 up
> >
> > Want to add an ip address to a port
> >
> > ip addr add 10.42.42.42/24 dev lan0
> >
> > Want to bridge two ports
> >
> > ip link add name br0 type bridge
> > ip link set dev br0 up
> > ip link set dev lan0 master br0
> > ip link set dev lan1 master br0
> >
> > Just treat them as normal interfaces.
> >
> 
> If I may please ask,
> What is the purpose of using bridge for configuring switch interfaces.
> Is it in order to isolate some ports from others?
> I ask because according to my understanding the default configuration of
> the driver is to enable switch in "flat" configuration, i.e. as if all
> ports are connected to each other.

Please think about what i said. They are standard Linux network
interfaces. Do standard Linux network interfaces bridge themselves
together by default? No, you need to configure a bridge.

 Andrew


Re: [PATCH 2/2] net: phy: dp83640: Read strapped configuration settings

2018-04-05 Thread Andrew Lunn
On Thu, Apr 05, 2018 at 10:30:45PM +0200, Esben Haabendal wrote:
> Florian Fainelli  writes:
> 
> > On 04/05/2018 04:44 AM, esben.haaben...@gmail.com wrote:
> >> From: Esben Haabendal 
> >> 
> >> Read configration settings, to allow automatic forced speed/duplex setup
> >> by hardware strapping.
> >
> > OK but why? What problem is this solving for you?
> 
> It is ensuring that the PHY is configured according to the HW design.
> If the HW design has set the strap configuration to fx. fixed 100 Mbit
> full-duplex, this avoids Linux configuring it for auto-negotiation.

Hi Esben

Are you sure it contains the HW strapping? Is the driver hitting the
PHY with a hard reset to make it go back to the strapping defaults?
Or could it still contain whatever state the last boot of Linux, or
maybe the bootloader, left the PHY in?

  Andrew


[PATCH v2] net: phy: marvell: Enable interrupt function on LED2 pin

2018-04-05 Thread Esben Haabendal
From: Esben Haabendal 

The LED2[2]/INTn pin on Marvell 88E1318S as well as 88E1510/12/14/18 needs
to be configured to be usable as interrupt not only when WOL is enabled,
but whenever we rely on interrupts from the PHY.

Signed-off-by: Esben Haabendal 
Cc: Rasmus Villemoes 
---
 drivers/net/phy/marvell.c | 20 ++--
 1 file changed, 18 insertions(+), 2 deletions(-)

diff --git a/drivers/net/phy/marvell.c b/drivers/net/phy/marvell.c
index 0e0978d8a0eb..a6ad0255c512 100644
--- a/drivers/net/phy/marvell.c
+++ b/drivers/net/phy/marvell.c
@@ -828,6 +828,22 @@ static int m88e1121_config_init(struct phy_device *phydev)
return marvell_config_init(phydev);
 }
 
+static int m88e1318_config_init(struct phy_device *phydev)
+{
+   if (phy_interrupt_is_valid(phydev)) {
+   int err = phy_modify_paged(
+   phydev, MII_MARVELL_LED_PAGE,
+   MII_88E1318S_PHY_LED_TCR,
+   MII_88E1318S_PHY_LED_TCR_FORCE_INT,
+   MII_88E1318S_PHY_LED_TCR_INTn_ENABLE |
+   MII_88E1318S_PHY_LED_TCR_INT_ACTIVE_LOW);
+   if (err < 0)
+   return err;
+   }
+
+   return m88e1121_config_init(phydev);
+}
+
 static int m88e1510_config_init(struct phy_device *phydev)
 {
int err;
@@ -870,7 +886,7 @@ static int m88e1510_config_init(struct phy_device *phydev)
phydev->advertising &= ~pause;
}
 
-   return m88e1121_config_init(phydev);
+   return m88e1318_config_init(phydev);
 }
 
 static int m88e1118_config_aneg(struct phy_device *phydev)
@@ -2086,7 +2102,7 @@ static struct phy_driver marvell_drivers[] = {
.features = PHY_GBIT_FEATURES,
.flags = PHY_HAS_INTERRUPT,
.probe = marvell_probe,
-   .config_init = _config_init,
+   .config_init = _config_init,
.config_aneg = _config_aneg,
.read_status = _read_status,
.ack_interrupt = _ack_interrupt,
-- 
2.16.3



Re: [PATCH 1/2] net: phy: Helper function for reading strapped configuration values

2018-04-05 Thread Esben Haabendal
Florian Fainelli  writes:

> On 04/05/2018 04:44 AM, esben.haaben...@gmail.com wrote:
>> From: Esben Haabendal 
>> 
>> Add a function for use in PHY driver probe functions, reading current
>> autoneg, speed and duplex configuration from BMCR register.
>> 
>> Useful for PHY that supports hardware strapped configuration, enabling
>> Linux to respect that configuration (i.e. strapped non-autoneg
>> configuration).
>> 
>> Signed-off-by: Esben Haabendal 
>> Cc: Rasmus Villemoes 
>> ---
>>  drivers/net/phy/phy_device.c | 41 +
>>  include/linux/phy.h  |  1 +
>>  2 files changed, 42 insertions(+)
>> 
>> diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
>> index 74664a6c0cdc..cc52ff2a2344 100644
>> --- a/drivers/net/phy/phy_device.c
>> +++ b/drivers/net/phy/phy_device.c
>> @@ -1673,6 +1673,47 @@ int genphy_config_init(struct phy_device *phydev)
>>  }
>>  EXPORT_SYMBOL(genphy_config_init);
>>  
>> +/**
>> + * genphy_read_config - read configuration from PHY
>> + * @phydev: target phy_device struct
>> + *
>> + * Description: Reads MII_BMCR and sets phydev autoneg, speed and duplex
>> + * accordingly.  For use in driver probe functions, to respect strapped
>> + * configuration settings.
>> + */
>> +int genphy_read_config(struct phy_device *phydev)
>
> This duplicates what already exists, in part at least within
> genphy_read_status() can you find a way to use that?

Make a small static function for updating duplex and speed fields from a
BMCR value.  It will not be big re-use, but it would make sense.  I will
do that in next patch version.

/Esben


Re: [PATCH 2/2] net: phy: dp83640: Read strapped configuration settings

2018-04-05 Thread Esben Haabendal
Florian Fainelli  writes:

> On 04/05/2018 04:44 AM, esben.haaben...@gmail.com wrote:
>> From: Esben Haabendal 
>> 
>> Read configration settings, to allow automatic forced speed/duplex setup
>> by hardware strapping.
>
> OK but why? What problem is this solving for you?

It is ensuring that the PHY is configured according to the HW design.
If the HW design has set the strap configuration to fx. fixed 100 Mbit
full-duplex, this avoids Linux configuring it for auto-negotiation.

> In general, we do not really want to preserve too much of what the PHY
> has been previously configured with,

Even when this "previous" configuration is the configuration selected by
the board configuration?

> provided that the PHY driver can re-instate these configuration
> values.

This is sort of what I try to do here.  The PHY driver needs to check
the BMCR register to figure out what the strapped configuration was.
Without that, it is necessary for software configuration to duplicate
the exact same configuration as is encoded in the strap configuration in
order for the PHY to be configured as it is supposed to.

> I just wonder how this can be robust when you connect this PHY with
> auto-negotiation disabled to a peer that expects a set of link
> parameters not covered by the default advertisement values?

I assume it will fail just as it will if you use ethtool to configure
the PHY wrongly.

> This really looks like a recipe for disaster when you could just
> disable auto-negotiation with ethtool.

The current PHY driver approach to always default to enable
auto-negotation, and then allow user-space to configure it differently
with ethtool.

With this patch, the default configuration is chosen based on the strap
configuration, but user-space can still change the configuration with
ethtool if needed / desired.

I don't think it is such a big change.

Bringing up the PHY with a default configuration according to HW
strapping instead of an in-kernel SW hard-coded value sounds like a good
idea to me.

/Esben


Re: [PATCH 2/2] net: phy: dp83640: Read strapped configuration settings

2018-04-05 Thread Esben Haabendal
Florian Fainelli  writes:

> On 04/05/2018 04:44 AM, esben.haaben...@gmail.com wrote:
>> From: Esben Haabendal 
>> 
>> Read configration settings, to allow automatic forced speed/duplex setup
>> by hardware strapping.
>
> OK but why? What problem is this solving for you?

It is ensuring that the PHY is configured according to the HW design.
If the HW design has set the strap configuration to fx. fixed 100 Mbit
full-duplex, this avoids Linux configuring it for auto-negotiation.

> In general, we do not really want to preserve too much of what the PHY
> has been previously configured with,

Even when this "previous" configuration is the configuration selected by
the board configuration?

> provided that the PHY driver can re-instate these configuration
> values.

This is sort of what I try to do here.  The PHY driver needs to check
the BMCR register to figure out what the strapped configuration was.
Without that, it is necessary for software configuration to duplicate
the exact same configuration as is encoded in the strap configuration in
order for the PHY to be configured as it is supposed to.

> I just wonder how this can be robust when you connect this PHY with
> auto-negotiation disabled to a peer that expects a set of link
> parameters not covered by the default advertisement values?

I assume it will fail just as it will if you use ethtool to configure
the PHY wrongly.

> This really looks like a recipe for disaster when you could just
> disable auto-negotiation with ethtool.

The current PHY driver approach to always default to enable
auto-negotation, and then allow user-space to configure it differently
with ethtool.

With this patch, the default configuration is chosen based on the strap
configuration, but user-space can still change the configuration with
ethtool if needed / desired.

I don't think it is such a big change.

Bringing up the PHY with a default configuration according to HW
strapping instead of an in-kernel SW hard-coded value sounds like a good
idea to me.

/Esben


Re: [PATCH 1/2] net: phy: Helper function for reading strapped configuration values

2018-04-05 Thread Esben Haabendal
Florian Fainelli  writes:

> On 04/05/2018 04:44 AM, esben.haaben...@gmail.com wrote:
>> From: Esben Haabendal 
>> 
>> Add a function for use in PHY driver probe functions, reading current
>> autoneg, speed and duplex configuration from BMCR register.
>> 
>> Useful for PHY that supports hardware strapped configuration, enabling
>> Linux to respect that configuration (i.e. strapped non-autoneg
>> configuration).
>> 
>> Signed-off-by: Esben Haabendal 
>> Cc: Rasmus Villemoes 
>> ---
>>  drivers/net/phy/phy_device.c | 41 +
>>  include/linux/phy.h  |  1 +
>>  2 files changed, 42 insertions(+)
>> 
>> diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
>> index 74664a6c0cdc..cc52ff2a2344 100644
>> --- a/drivers/net/phy/phy_device.c
>> +++ b/drivers/net/phy/phy_device.c
>> @@ -1673,6 +1673,47 @@ int genphy_config_init(struct phy_device *phydev)
>>  }
>>  EXPORT_SYMBOL(genphy_config_init);
>>  
>> +/**
>> + * genphy_read_config - read configuration from PHY
>> + * @phydev: target phy_device struct
>> + *
>> + * Description: Reads MII_BMCR and sets phydev autoneg, speed and duplex
>> + * accordingly.  For use in driver probe functions, to respect strapped
>> + * configuration settings.
>> + */
>> +int genphy_read_config(struct phy_device *phydev)
>
> This duplicates what already exists, in part at least within
> genphy_read_status() can you find a way to use that?

Make a small static function for updating duplex and speed fields from a
BMCR value.  It will not be big re-use, but it would make sense.  I will
do that in next patch version.

/Esben


Re: Enable and configure storm prevention in a network device

2018-04-05 Thread David Miller
From: Murali Karicheri 
Date: Thu, 5 Apr 2018 16:14:49 -0400

> Is there a standard way to implement and configure storm prevention
> in a Linux network device?

What kind of "storm", an interrupt storm?


[patch net] devlink: convert occ_get op to separate registration

2018-04-05 Thread Jiri Pirko
From: Jiri Pirko 

This resolves race during initialization where the resources with
ops are registered before driver and the structures used by occ_get
op is initialized. So keep occ_get callbacks registered only when
all structs are initialized.

The example flows, as it is in mlxsw:
1) driver load/asic probe:
   mlxsw_core
  -> mlxsw_sp_resources_register
-> mlxsw_sp_kvdl_resources_register
  -> devlink_resource_register IDX
   mlxsw_spectrum
  -> mlxsw_sp_kvdl_init
-> mlxsw_sp_kvdl_parts_init
  -> mlxsw_sp_kvdl_part_init
-> devlink_resource_size_get IDX (to get the current setup
  size from devlink)
-> devlink_resource_occ_get_register IDX (register current
  occupancy getter)
2) reload triggered by devlink command:
  -> mlxsw_devlink_core_bus_device_reload
-> mlxsw_sp_fini
  -> mlxsw_sp_kvdl_fini
-> devlink_resource_occ_get_unregister IDX
(struct mlxsw_sp *mlxsw_sp is freed at this point, call to occ get
 which is using mlxsw_sp would cause use-after free)
-> mlxsw_sp_init
  -> mlxsw_sp_kvdl_init
-> mlxsw_sp_kvdl_parts_init
  -> mlxsw_sp_kvdl_part_init
-> devlink_resource_size_get IDX (to get the current setup
  size from devlink)
-> devlink_resource_occ_get_register IDX (register current
  occupancy getter)

Fixes: d9f9b9a4d05f ("devlink: Add support for resource abstraction")
Signed-off-by: Jiri Pirko 
---
v1->v2:
- rebased on top or netdevsim changes
- improved description
---
 drivers/net/ethernet/mellanox/mlxsw/spectrum.c | 24 ++-
 drivers/net/ethernet/mellanox/mlxsw/spectrum.h |  1 -
 .../net/ethernet/mellanox/mlxsw/spectrum_kvdl.c| 67 
 drivers/net/netdevsim/devlink.c| 65 +--
 include/net/devlink.h  | 40 
 net/core/devlink.c | 74 +++---
 6 files changed, 165 insertions(+), 106 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
index 53fffd09d133..ca38a30fbe91 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
@@ -3805,18 +3805,6 @@ static const struct mlxsw_config_profile 
mlxsw_sp_config_profile = {
},
 };
 
-static u64 mlxsw_sp_resource_kvd_linear_occ_get(struct devlink *devlink)
-{
-   struct mlxsw_core *mlxsw_core = devlink_priv(devlink);
-   struct mlxsw_sp *mlxsw_sp = mlxsw_core_driver_priv(mlxsw_core);
-
-   return mlxsw_sp_kvdl_occ_get(mlxsw_sp);
-}
-
-static const struct devlink_resource_ops mlxsw_sp_resource_kvd_linear_ops = {
-   .occ_get = mlxsw_sp_resource_kvd_linear_occ_get,
-};
-
 static void
 mlxsw_sp_resource_size_params_prepare(struct mlxsw_core *mlxsw_core,
  struct devlink_resource_size_params 
*kvd_size_params,
@@ -3877,8 +3865,7 @@ static int mlxsw_sp_resources_register(struct mlxsw_core 
*mlxsw_core)
err = devlink_resource_register(devlink, MLXSW_SP_RESOURCE_NAME_KVD,
kvd_size, MLXSW_SP_RESOURCE_KVD,
DEVLINK_RESOURCE_ID_PARENT_TOP,
-   _size_params,
-   NULL);
+   _size_params);
if (err)
return err;
 
@@ -3887,8 +3874,7 @@ static int mlxsw_sp_resources_register(struct mlxsw_core 
*mlxsw_core)
linear_size,
MLXSW_SP_RESOURCE_KVD_LINEAR,
MLXSW_SP_RESOURCE_KVD,
-   _size_params,
-   _sp_resource_kvd_linear_ops);
+   _size_params);
if (err)
return err;
 
@@ -3905,8 +3891,7 @@ static int mlxsw_sp_resources_register(struct mlxsw_core 
*mlxsw_core)
double_size,
MLXSW_SP_RESOURCE_KVD_HASH_DOUBLE,
MLXSW_SP_RESOURCE_KVD,
-   _double_size_params,
-   NULL);
+   _double_size_params);
if (err)
return err;
 
@@ -3915,8 +3900,7 @@ static int mlxsw_sp_resources_register(struct mlxsw_core 
*mlxsw_core)
single_size,
MLXSW_SP_RESOURCE_KVD_HASH_SINGLE,

Enable and configure storm prevention in a network device

2018-04-05 Thread Murali Karicheri
Hello Netdev experts,

Is there a standard way to implement and configure storm prevention in a Linux
network device?

Our NIC firmware has the capability to enable storm prevention which is 
implemented
using a credit based scheme. The configuration is how many number of multicast +
broadcast packets allowed in a certain period of time. Is there a standard way
of passing this information from user space to driver? 

Thanks in advance for your input!

-- 
Murali Karicheri
Linux Kernel, Keystone


Re: [PATCH net-next 6/6] netdevsim: Add simple FIB resource controller via devlink

2018-04-05 Thread David Ahern
On 4/5/18 11:27 AM, Jiri Pirko wrote:
> Wed, Mar 28, 2018 at 03:22:00AM CEST, d...@cumulusnetworks.com wrote:
>> Add devlink support to netdevsim and use it to implement a simple,
>> profile based resource controller. Only one controller is needed
>> per namespace, so the first netdevsim netdevice in a namespace
>> registers with devlink. If that device is deleted, the resource
>> settings are deleted.
> 
> I don't understand why you add 1:1 fixed relationship between
> netnamespace and devlink instance. That is highly misleading and reader
> might think that those 2 are somehow related. They are not. You can have
> multiple devlink instances for many ports in a single namespace.

The netdevsim devlink instance is an example of limiting the number of
FIB entries and FIB rules for a namespace. It is currently limited to
the init_net based on past discussion.

It does not make sense to have multiple resource controllers for the
same network namespace, hence the limit of only registering with devlink
on the first device create.

> 
> Could you please clarify?
> 
> Also, to see the relationship between individual netdevsim netdevices
> and the parent devlink instance, we should use devlink_port
> instances, like this: 
> 
>   devlink1  devlink2
>||||
>  dl_port1_1 dlport1_2   dlport2_1 dlport2_2
>||||
>  eth0  eth1 eth2 eth3
> 
> Note that "devlink instance" reprensents one ASIC.
> The address of the devlink instance is the bus address of the ASIC.
> Here, you use address of some/first netdevsim netdev instance.

The ASIC here is the kernel tables in a namespace. It does not make
sense to have 2 devlink instances for a single namespace.

> 
> The way it is implemented in netdevsim by this patch is wrong on
> so many levels :(
> 
> Could you please fix this? I'm more than happy to help you with this,
> please say so. Thanks!

What is there to fix?

Not creating a netdevsim device per netdevsim netdevice? That is
completely unrelated to the devlink change.


Re: [PATCH iproute2] l2tp: no need to export session offsets in JSON output

2018-04-05 Thread Stephen Hemminger
On Thu, 5 Apr 2018 19:24:17 +0200
Guillaume Nault  wrote:

> The offset and peer_offset parameters are only printed to avoid
> confusing external scripts that may parse "ip l2tp show session"
> output. There's no reason to keep them in JSON.
> 
> Signed-off-by: Guillaume Nault 
> ---

Applied thanks.


Re: [PATCH net 0/6] net: better validate user provided tunnel names

2018-04-05 Thread Eric Dumazet


On 04/05/2018 12:21 PM, David Miller wrote:
> From: Eric Dumazet 
> Date: Thu,  5 Apr 2018 06:39:25 -0700
> 
>> This series changes dev_valid_name() to not attempt reading
>> a possibly too long user-provided device name, then use
>> this helper in five different tunnel providers.
> 
> Series applied and queued up for -stable, thanks Eric.
> 
> Reading over this series makes me wonder if we generally have an
> off-by-one bug for device names which are exactly IFNAMSIZ.
> 
> We validate the size using the test:
> 
>   if (strlen(name) >= IFNAMSIZ)
>   return ERROR;
> 
> and thusly after Eric's changes:
> 
>   if (strnlen(name, IFNAMSIZ) == IFNAMSIZ)
>   return ERROR;
> 
> This value computed by str{,n}len() doesn't include the trailing null
> byte.
> 
> So we will accept a name that has exactly IFNAMSIZ bytes long not
> including the trailing null.

In this case strnlen(name, IFNAMSIZ) returns IFNAMSIZ.

So  (strnlen(name, IFNAMSIZ) == IFNAMSIZ) would definitely be true.

The only effect of the change is that strlen() would read 1000 bytes of a
malicious string before we reached the test on the length to reject such name.

While strnlen() is guaranteed to not read more than IFNAMSIZ bytes.



Re: [PATCH net 0/6] net: better validate user provided tunnel names

2018-04-05 Thread David Miller
From: Eric Dumazet 
Date: Thu,  5 Apr 2018 06:39:25 -0700

> This series changes dev_valid_name() to not attempt reading
> a possibly too long user-provided device name, then use
> this helper in five different tunnel providers.

Series applied and queued up for -stable, thanks Eric.

Reading over this series makes me wonder if we generally have an
off-by-one bug for device names which are exactly IFNAMSIZ.

We validate the size using the test:

if (strlen(name) >= IFNAMSIZ)
return ERROR;

and thusly after Eric's changes:

if (strnlen(name, IFNAMSIZ) == IFNAMSIZ)
return ERROR;

This value computed by str{,n}len() doesn't include the trailing null
byte.

So we will accept a name that has exactly IFNAMSIZ bytes long not
including the trailing null.

Then we will copy IFNAMSIZ bytes, minus 1, into the device name and
then tack on the trailling null byte.

So essentially we will set the final non-null byte in the string to
a null byte.

Am I misreading things?


Re: [PATCH net-next v2 2/2] dt: bindings: add new dt entries for brcmfmac

2018-04-05 Thread Arend van Spriel

On 4/5/2018 3:10 PM, Kalle Valo wrote:

Ulf Hansson  writes:


On 20 March 2018 at 10:55, Kalle Valo  wrote:

Arend van Spriel  writes:


If I get it right, you mean something like this:

mmc3: mmc@1c12000 {
...
  broken-sg-support;
  sd-head-align = 4;
  sd-sgentry-align = 512;

  brcmf: wifi@1 {
  ...
  };
};

Where dt: bindings documentation for these entries should reside?
In generic MMC bindings? Well, this is the very special case and
mmc-linux maintainer will unlikely to accept these changes.
Also, extra kernel code modification might be required. It could make
quite trivial change much more complex.


If the MMC maintainers are not copied on this patch series, it will
likely be hard for them to identify this patch series and chime in...


The main question is whether this is indeed a "very special case" as
Alexey claims it to be or that it is likely to be applicable to other
device and host combinations as you are suggesting.

If these properties are imposed by the host or host controller it
would make sense to have these in the mmc bindings.


BTW, last year we were discussing something similar (I mean related to
alignment requirements) with ath10k SDIO patches and at the time the
patch submitter was proposing to have a bounce buffer in ath10k to
workaround that. I don't remember the details anymore, they are on the
ath10k mailing list archive if anyone is curious to know, but I would
not be surprised if they are similar as here. So there might be a need
to solve this in a generic way (but not sure of course as I haven't
checked the details).


I re-call something about these as well, here are the patches. Perhaps
I should pick some of them up...

https://patchwork.kernel.org/patch/10123137/
https://patchwork.kernel.org/patch/10123139/
https://patchwork.kernel.org/patch/10123141/
https://patchwork.kernel.org/patch/10123143/


Actually I was talking about a different patch, found it now:

ath10k_sdio: DMA bounce buffers for read write

https://patchwork.kernel.org/patch/9979543/


The patches Uffe mentions are related to this. In brcmfmac we simply 
implemented functionality to create a scatter list and send it through 
the MMC stack using SDIO CMD53. It is in the creation of the scatter 
list that these alignment properties are needed. Moving the whole 
functionality to the MMC stack will also move those properties into 
their right context, ie. SDIO host controller.


Our broker_sg_support property is basically doing the bounce buffer 
thing if I understand it correctly.


Regards,
Arend



Re: [PATCH net-next v2 2/2] dt: bindings: add new dt entries for brcmfmac

2018-04-05 Thread Ulf Hansson
On 5 April 2018 at 15:10, Kalle Valo  wrote:
> Ulf Hansson  writes:
>
>> On 20 March 2018 at 10:55, Kalle Valo  wrote:
>>> Arend van Spriel  writes:
>>>
>> If I get it right, you mean something like this:
>>
>> mmc3: mmc@1c12000 {
>> ...
>>  broken-sg-support;
>>  sd-head-align = 4;
>>  sd-sgentry-align = 512;
>>
>>  brcmf: wifi@1 {
>>  ...
>>  };
>> };
>>
>> Where dt: bindings documentation for these entries should reside?
>> In generic MMC bindings? Well, this is the very special case and
>> mmc-linux maintainer will unlikely to accept these changes.
>> Also, extra kernel code modification might be required. It could make
>> quite trivial change much more complex.
>
> If the MMC maintainers are not copied on this patch series, it will
> likely be hard for them to identify this patch series and chime in...

 The main question is whether this is indeed a "very special case" as
 Alexey claims it to be or that it is likely to be applicable to other
 device and host combinations as you are suggesting.

 If these properties are imposed by the host or host controller it
 would make sense to have these in the mmc bindings.
>>>
>>> BTW, last year we were discussing something similar (I mean related to
>>> alignment requirements) with ath10k SDIO patches and at the time the
>>> patch submitter was proposing to have a bounce buffer in ath10k to
>>> workaround that. I don't remember the details anymore, they are on the
>>> ath10k mailing list archive if anyone is curious to know, but I would
>>> not be surprised if they are similar as here. So there might be a need
>>> to solve this in a generic way (but not sure of course as I haven't
>>> checked the details).
>>
>> I re-call something about these as well, here are the patches. Perhaps
>> I should pick some of them up...
>>
>> https://patchwork.kernel.org/patch/10123137/
>> https://patchwork.kernel.org/patch/10123139/
>> https://patchwork.kernel.org/patch/10123141/
>> https://patchwork.kernel.org/patch/10123143/
>
> Actually I was talking about a different patch, found it now:
>
> ath10k_sdio: DMA bounce buffers for read write
>
> https://patchwork.kernel.org/patch/9979543/

Ah, yes. This is about buffer alignment, particularly when using DMA.

Normally there should be no constraint on the alignment, if the
mmc/sdio controller driver would implement a fallback mechanism from
DMA to PIO mode, in case the buffer can't be used for DMA.

However, I know about cases where simply PIO doesn't work because of
broken HW and in many cases the mmc drivers don't implement the
fallback to PIO even if they could.

Moreover, it seems reasonable to anyway have a way for mmc driver to
inform upper layers about DMA buffer alignment constraints, as to be
able to use DMA as long as possible.

Thoughts?

Kind regards
Uffe


[PATCH 1/4] hv_netvsc: Use Windows version instead of NVSP version on GPAD teardown

2018-04-05 Thread Mohammed Gamal
When changing network interface settings, Windows guests
older than WS2016 can no longer shutdown. This was addressed
by commit 0ef58b0a05c12 ("hv_netvsc: change GPAD teardown order
on older versions"), however the issue also occurs on WS2012
guests that share NVSP protocol versions with WS2016 guests.
Hence we use Windows version directly to differentiate them.

Fixes: 0ef58b0a05c12 ("hv_netvsc: change GPAD teardown order on older versions")

Signed-off-by: Mohammed Gamal 
---
 drivers/net/hyperv/netvsc.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index c9910c3..d65b7fc 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -590,13 +590,13 @@ void netvsc_device_remove(struct hv_device *device)
netdev_dbg(ndev, "net device safe to remove\n");
 
/* older versions require that buffer be revoked before close */
-   if (net_device->nvsp_version < NVSP_PROTOCOL_VERSION_4)
+   if (vmbus_proto_version < VERSION_WIN10)
netvsc_teardown_gpadl(device, net_device);
 
/* Now, we can close the channel safely */
vmbus_close(device->channel);
 
-   if (net_device->nvsp_version >= NVSP_PROTOCOL_VERSION_4)
+   if (vmbus_proto_version >= VERSION_WIN10)
netvsc_teardown_gpadl(device, net_device);
 
/* Release all resources */
-- 
1.8.3.1



[PATCH 0/4] hv_netvsc: Fix shutdown issues on older Windows hosts

2018-04-05 Thread Mohammed Gamal
Guests running on WS2012 hosts would not shutdown when changing network
interface setting (e.g. Number of channels, MTU ... etc). 

This patch series addresses these shutdown issues we enecountered with WS2012
hosts. It's essentialy a rework of the series sent in 
https://lkml.org/lkml/2018/1/23/111 on top of latest upstream

Fixes: 0ef58b0a05c1 ("hv_netvsc: change GPAD teardown order on older versions")

Mohammed Gamal (4):
  hv_netvsc: Use Windows version instead of NVSP version on GPAD
teardown
  hv_netvsc: Split netvsc_revoke_buf() and netvsc_teardown_gpadl()
  hv_netvsc: Ensure correct teardown message sequence order
  hv_netvsc: Pass net_device parameter to revoke and teardown functions

 drivers/net/hyperv/netvsc.c | 60 +
 1 file changed, 44 insertions(+), 16 deletions(-)

-- 
1.8.3.1



[PATCH 2/4] hv_netvsc: Split netvsc_revoke_buf() and netvsc_teardown_gpadl()

2018-04-05 Thread Mohammed Gamal
Split each of the functions into two for each of send/recv buffers.
This will be needed in order to implement a fine-grained messaging
sequence to the host so tht we accommodate the requirements of
different Windows versions

Fixes: 0ef58b0a05c12 ("hv_netvsc: change GPAD teardown order on older versions")

Signed-off-by: Mohammed Gamal 
---
 drivers/net/hyperv/netvsc.c | 46 +
 1 file changed, 34 insertions(+), 12 deletions(-)

diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index d65b7fc..f4df5de 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -109,11 +109,11 @@ static void free_netvsc_device_rcu(struct netvsc_device 
*nvdev)
call_rcu(>rcu, free_netvsc_device);
 }
 
-static void netvsc_revoke_buf(struct hv_device *device,
- struct netvsc_device *net_device)
+static void netvsc_revoke_recv_buf(struct hv_device *device,
+  struct netvsc_device *net_device)
 {
-   struct nvsp_message *revoke_packet;
struct net_device *ndev = hv_get_drvdata(device);
+   struct nvsp_message *revoke_packet;
int ret;
 
/*
@@ -157,6 +157,14 @@ static void netvsc_revoke_buf(struct hv_device *device,
}
net_device->recv_section_cnt = 0;
}
+}
+
+static void netvsc_revoke_send_buf(struct hv_device *device,
+  struct netvsc_device *net_device)
+{
+   struct net_device *ndev = hv_get_drvdata(device);
+   struct nvsp_message *revoke_packet;
+   int ret;
 
/* Deal with the send buffer we may have setup.
 * If we got a  send section size, it means we received a
@@ -202,8 +210,8 @@ static void netvsc_revoke_buf(struct hv_device *device,
}
 }
 
-static void netvsc_teardown_gpadl(struct hv_device *device,
- struct netvsc_device *net_device)
+static void netvsc_teardown_recv_gpadl(struct hv_device *device,
+  struct netvsc_device *net_device)
 {
struct net_device *ndev = hv_get_drvdata(device);
int ret;
@@ -222,6 +230,13 @@ static void netvsc_teardown_gpadl(struct hv_device *device,
}
net_device->recv_buf_gpadl_handle = 0;
}
+}
+
+static void netvsc_teardown_send_gpadl(struct hv_device *device,
+  struct netvsc_device *net_device)
+{
+   struct net_device *ndev = hv_get_drvdata(device);
+   int ret;
 
if (net_device->send_buf_gpadl_handle) {
ret = vmbus_teardown_gpadl(device->channel,
@@ -437,8 +452,10 @@ static int netvsc_init_buf(struct hv_device *device,
goto exit;
 
 cleanup:
-   netvsc_revoke_buf(device, net_device);
-   netvsc_teardown_gpadl(device, net_device);
+   netvsc_revoke_recv_buf(device, net_device);
+   netvsc_revoke_send_buf(device, net_device);
+   netvsc_teardown_recv_gpadl(device, net_device);
+   netvsc_teardown_send_gpadl(device, net_device);
 
 exit:
return ret;
@@ -575,7 +592,8 @@ void netvsc_device_remove(struct hv_device *device)
= rtnl_dereference(net_device_ctx->nvdev);
int i;
 
-   netvsc_revoke_buf(device, net_device);
+   netvsc_revoke_recv_buf(device, net_device);
+   netvsc_revoke_send_buf(device, net_device);
 
RCU_INIT_POINTER(net_device_ctx->nvdev, NULL);
 
@@ -590,14 +608,18 @@ void netvsc_device_remove(struct hv_device *device)
netdev_dbg(ndev, "net device safe to remove\n");
 
/* older versions require that buffer be revoked before close */
-   if (vmbus_proto_version < VERSION_WIN10)
-   netvsc_teardown_gpadl(device, net_device);
+   if (vmbus_proto_version < VERSION_WIN10) {
+   netvsc_teardown_recv_gpadl(device, net_device);
+   netvsc_teardown_send_gpadl(device, net_device);
+   }
 
/* Now, we can close the channel safely */
vmbus_close(device->channel);
 
-   if (vmbus_proto_version >= VERSION_WIN10)
-   netvsc_teardown_gpadl(device, net_device);
+   if (vmbus_proto_version >= VERSION_WIN10) {
+   netvsc_teardown_recv_gpadl(device, net_device);
+   netvsc_teardown_send_gpadl(device, net_device);
+   }
 
/* Release all resources */
free_netvsc_device_rcu(net_device);
-- 
1.8.3.1



[PATCH 3/4] hv_netvsc: Ensure correct teardown message sequence order

2018-04-05 Thread Mohammed Gamal
Prior to commit 0cf737808ae7 ("hv_netvsc: netvsc_teardown_gpadl() split")
the call sequence in netvsc_device_remove() was as follows (as
implemented in netvsc_destroy_buf()):
1- Send NVSP_MSG1_TYPE_REVOKE_RECV_BUF message
2- Teardown receive buffer GPADL
3- Send NVSP_MSG1_TYPE_REVOKE_SEND_BUF message
4- Teardown send buffer GPADL
5- Close vmbus

This didn't work for WS2016 hosts. Commit 0cf737808ae7
("hv_netvsc: netvsc_teardown_gpadl() split") rearranged the
teardown sequence as follows:
1- Send NVSP_MSG1_TYPE_REVOKE_RECV_BUF message
2- Send NVSP_MSG1_TYPE_REVOKE_SEND_BUF message
3- Close vmbus
4- Teardown receive buffer GPADL
5- Teardown send buffer GPADL

That worked well for WS2016 hosts, but it prevented guests on older hosts from
shutting down after changing network settings. Commit 0ef58b0a05c1
("hv_netvsc: change GPAD teardown order on older versions") ensured the
following message sequence for older hosts
1- Send NVSP_MSG1_TYPE_REVOKE_RECV_BUF message
2- Send NVSP_MSG1_TYPE_REVOKE_SEND_BUF message
3- Teardown receive buffer GPADL
4- Teardown send buffer GPADL
5- Close vmbus

However, with this sequence calling `ip link set eth0 mtu 1000` hangs and the
process becomes uninterruptible. On futher analysis it turns out that on tearing
down the receive buffer GPADL the kernel is waiting indefinitely
in vmbus_teardown_gpadl() for a completion to be signaled.

Here is a snippet of where this occurs:
int vmbus_teardown_gpadl(struct vmbus_channel *channel, u32 gpadl_handle)
{
struct vmbus_channel_gpadl_teardown *msg;
struct vmbus_channel_msginfo *info;
unsigned long flags;
int ret;

info = kmalloc(sizeof(*info) +
   sizeof(struct vmbus_channel_gpadl_teardown), GFP_KERNEL);
if (!info)
return -ENOMEM;

init_completion(>waitevent);
info->waiting_channel = channel;
[]
ret = vmbus_post_msg(msg, sizeof(struct vmbus_channel_gpadl_teardown),
 true);

if (ret)
goto post_msg_err;

wait_for_completion(>waitevent);
[]
}

The completion is signaled from vmbus_ongpadl_torndown(), which gets called when
the corresponding message is received from the host, which apparently never 
happens
in that case.
This patch works around the issue by restoring the first mentioned message 
sequence
for older hosts

Fixes: 0ef58b0a05c1 ("hv_netvsc: change GPAD teardown order on older versions")

Signed-off-by: Mohammed Gamal 
---
 drivers/net/hyperv/netvsc.c | 19 +--
 1 file changed, 13 insertions(+), 6 deletions(-)

diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index f4df5de..df92c2f 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -592,8 +592,17 @@ void netvsc_device_remove(struct hv_device *device)
= rtnl_dereference(net_device_ctx->nvdev);
int i;
 
+   /*
+* Revoke receive buffer. If host is pre-Win2016 then tear down
+* receive buffer GPADL. Do the same for send buffer.
+*/
netvsc_revoke_recv_buf(device, net_device);
+   if (vmbus_proto_version < VERSION_WIN10)
+   netvsc_teardown_recv_gpadl(device, net_device);
+
netvsc_revoke_send_buf(device, net_device);
+   if (vmbus_proto_version < VERSION_WIN10)
+   netvsc_teardown_send_gpadl(device, net_device);
 
RCU_INIT_POINTER(net_device_ctx->nvdev, NULL);
 
@@ -607,15 +616,13 @@ void netvsc_device_remove(struct hv_device *device)
 */
netdev_dbg(ndev, "net device safe to remove\n");
 
-   /* older versions require that buffer be revoked before close */
-   if (vmbus_proto_version < VERSION_WIN10) {
-   netvsc_teardown_recv_gpadl(device, net_device);
-   netvsc_teardown_send_gpadl(device, net_device);
-   }
-
/* Now, we can close the channel safely */
vmbus_close(device->channel);
 
+   /*
+* If host is Win2016 or higher then we do the GPADL tear down
+* here after VMBus is closed.
+   */
if (vmbus_proto_version >= VERSION_WIN10) {
netvsc_teardown_recv_gpadl(device, net_device);
netvsc_teardown_send_gpadl(device, net_device);
-- 
1.8.3.1



[PATCH 4/4] hv_netvsc: Pass net_device parameter to revoke and teardown functions

2018-04-05 Thread Mohammed Gamal
The callers to netvsc_revoke_*_buf() and netvsc_teardown_*_gpadl()
already have their net_device instances. Pass them as a paramaeter to
the function instead of obtaining them from netvsc_device struct
everytime

Signed-off-by: Mohammed Gamal 
---
 drivers/net/hyperv/netvsc.c | 37 ++---
 1 file changed, 18 insertions(+), 19 deletions(-)

diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index df92c2f..04f611e 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -110,9 +110,9 @@ static void free_netvsc_device_rcu(struct netvsc_device 
*nvdev)
 }
 
 static void netvsc_revoke_recv_buf(struct hv_device *device,
-  struct netvsc_device *net_device)
+  struct netvsc_device *net_device,
+  struct net_device *ndev)
 {
-   struct net_device *ndev = hv_get_drvdata(device);
struct nvsp_message *revoke_packet;
int ret;
 
@@ -160,9 +160,9 @@ static void netvsc_revoke_recv_buf(struct hv_device *device,
 }
 
 static void netvsc_revoke_send_buf(struct hv_device *device,
-  struct netvsc_device *net_device)
+  struct netvsc_device *net_device,
+  struct net_device *ndev)
 {
-   struct net_device *ndev = hv_get_drvdata(device);
struct nvsp_message *revoke_packet;
int ret;
 
@@ -211,9 +211,9 @@ static void netvsc_revoke_send_buf(struct hv_device *device,
 }
 
 static void netvsc_teardown_recv_gpadl(struct hv_device *device,
-  struct netvsc_device *net_device)
+  struct netvsc_device *net_device,
+  struct net_device *ndev)
 {
-   struct net_device *ndev = hv_get_drvdata(device);
int ret;
 
if (net_device->recv_buf_gpadl_handle) {
@@ -233,9 +233,9 @@ static void netvsc_teardown_recv_gpadl(struct hv_device 
*device,
 }
 
 static void netvsc_teardown_send_gpadl(struct hv_device *device,
-  struct netvsc_device *net_device)
+  struct netvsc_device *net_device,
+  struct net_device *ndev)
 {
-   struct net_device *ndev = hv_get_drvdata(device);
int ret;
 
if (net_device->send_buf_gpadl_handle) {
@@ -452,10 +452,10 @@ static int netvsc_init_buf(struct hv_device *device,
goto exit;
 
 cleanup:
-   netvsc_revoke_recv_buf(device, net_device);
-   netvsc_revoke_send_buf(device, net_device);
-   netvsc_teardown_recv_gpadl(device, net_device);
-   netvsc_teardown_send_gpadl(device, net_device);
+   netvsc_revoke_recv_buf(device, net_device, ndev);
+   netvsc_revoke_send_buf(device, net_device, ndev);
+   netvsc_teardown_recv_gpadl(device, net_device, ndev);
+   netvsc_teardown_send_gpadl(device, net_device, ndev);
 
 exit:
return ret;
@@ -474,7 +474,6 @@ static int negotiate_nvsp_ver(struct hv_device *device,
init_packet->hdr.msg_type = NVSP_MSG_TYPE_INIT;
init_packet->msg.init_msg.init.min_protocol_ver = nvsp_ver;
init_packet->msg.init_msg.init.max_protocol_ver = nvsp_ver;
-
trace_nvsp_send(ndev, init_packet);
 
/* Send the init request */
@@ -596,13 +595,13 @@ void netvsc_device_remove(struct hv_device *device)
 * Revoke receive buffer. If host is pre-Win2016 then tear down
 * receive buffer GPADL. Do the same for send buffer.
 */
-   netvsc_revoke_recv_buf(device, net_device);
+   netvsc_revoke_recv_buf(device, net_device, ndev);
if (vmbus_proto_version < VERSION_WIN10)
-   netvsc_teardown_recv_gpadl(device, net_device);
+   netvsc_teardown_recv_gpadl(device, net_device, ndev);
 
-   netvsc_revoke_send_buf(device, net_device);
+   netvsc_revoke_send_buf(device, net_device, ndev);
if (vmbus_proto_version < VERSION_WIN10)
-   netvsc_teardown_send_gpadl(device, net_device);
+   netvsc_teardown_send_gpadl(device, net_device, ndev);
 
RCU_INIT_POINTER(net_device_ctx->nvdev, NULL);
 
@@ -624,8 +623,8 @@ void netvsc_device_remove(struct hv_device *device)
 * here after VMBus is closed.
*/
if (vmbus_proto_version >= VERSION_WIN10) {
-   netvsc_teardown_recv_gpadl(device, net_device);
-   netvsc_teardown_send_gpadl(device, net_device);
+   netvsc_teardown_recv_gpadl(device, net_device, ndev);
+   netvsc_teardown_send_gpadl(device, net_device, ndev);
}
 
/* Release all resources */
-- 
1.8.3.1



Re: [PATCH 3/9] kbuild: Do not pass arguments to link-vmlinux.sh

2018-04-05 Thread Jiri Olsa
On Fri, Apr 06, 2018 at 12:50:00AM +0900, Masahiro Yamada wrote:
> 2018-04-06 0:16 GMT+09:00 Jiri Olsa :
> > There's no need to pass LD* arguments to link-vmlinux.sh,
> > because they are passed as variables. The only argument
> > the link-vmlinux.sh supports is the 'clean' argument.
> >
> > Signed-off-by: Jiri Olsa 
> > ---
> 
> Wrong.
> 
> $(LD) $(LDFLAGS) $(LDFLAGS_vmlinux)
> exist here so that any change in them
> invokes scripts/linkk-vmlinux.sh

sry, I can't see that.. but it's just a side fix,
which is actually not needed for the rest

I'll check on more and address this separately

thanks,
jirka

> 
> 
> 
> 
> >  Makefile | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/Makefile b/Makefile
> > index d3300e46f925..a65a3919c6ad 100644
> > --- a/Makefile
> > +++ b/Makefile
> > @@ -1027,7 +1027,7 @@ ARCH_POSTLINK := $(wildcard 
> > $(srctree)/arch/$(SRCARCH)/Makefile.postlink)
> >
> >  # Final link of vmlinux with optional arch pass after final link
> >  cmd_link-vmlinux = \
> > -   $(CONFIG_SHELL) $< $(LD) $(LDFLAGS) $(LDFLAGS_vmlinux) ;\
> > +   $(CONFIG_SHELL) $< ;   \
> > $(if $(ARCH_POSTLINK), $(MAKE) -f $(ARCH_POSTLINK) $@, true)
> >
> >  vmlinux: scripts/link-vmlinux.sh vmlinux_prereq $(vmlinux-deps) FORCE
> 
> 
> 
> 
> 
> 
> -- 
> Best Regards
> Masahiro Yamada


Re: [PATCH 07/12] brcmfmac: Convert ALLFFMAC to ether_broadcast_addr

2018-04-05 Thread Arend van Spriel

On 3/31/2018 9:05 AM, Joe Perches wrote:

Remove the local ALLFFMAC extern array and use the new global instead.


I stumbled upon this one couple of weeks ago. I moved the definition to 
flowring.c although I considered for a moment to pick up the task you 
took here valiantly.



Miscellanea:

o Convert char *mac to const char *mac as it can't be modified


The real reason is off course that your new global is const and thus mac 
variable need to be const as well to avoid compiler warning.


I have to agree with Kalle regarding the upstream logistics, but for 
what it is worth...


Acked-by: Arend van Spriel 

Signed-off-by: Joe Perches 
---
  drivers/net/wireless/broadcom/brcm80211/brcmfmac/common.c   | 2 --
  drivers/net/wireless/broadcom/brcm80211/brcmfmac/common.h   | 2 --
  drivers/net/wireless/broadcom/brcm80211/brcmfmac/flowring.c | 8 
  3 files changed, 4 insertions(+), 8 deletions(-)




[PATCH net-next] hv_netvsc: Add NetVSP v6 into version negotiation

2018-04-05 Thread Haiyang Zhang
From: Haiyang Zhang 

This patch adds the NetVSP v6 message structures, and includes this
version into NetVSC/NetVSP version negotiation process.

Signed-off-by: Haiyang Zhang 
---
 drivers/net/hyperv/hyperv_net.h | 33 +
 drivers/net/hyperv/netvsc.c |  3 ++-
 2 files changed, 35 insertions(+), 1 deletion(-)

diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
index 960f06141472..036cd55c66fe 100644
--- a/drivers/net/hyperv/hyperv_net.h
+++ b/drivers/net/hyperv/hyperv_net.h
@@ -237,6 +237,7 @@ void netvsc_switch_datapath(struct net_device *nv_dev, bool 
vf);
 #define NVSP_PROTOCOL_VERSION_20x30002
 #define NVSP_PROTOCOL_VERSION_40x4
 #define NVSP_PROTOCOL_VERSION_50x5
+#define NVSP_PROTOCOL_VERSION_60x6
 
 enum {
NVSP_MSG_TYPE_NONE = 0,
@@ -308,6 +309,11 @@ enum {
NVSP_MSG5_TYPE_SEND_INDIRECTION_TABLE,
 
NVSP_MSG5_MAX = NVSP_MSG5_TYPE_SEND_INDIRECTION_TABLE,
+
+   /* Version 6 messages */
+   NVSP_MSG6_TYPE_PD_API,
+
+   NVSP_MSG6_MAX = NVSP_MSG6_TYPE_PD_API
 };
 
 enum {
@@ -619,12 +625,39 @@ union nvsp_5_message_uber {
struct nvsp_5_send_indirect_table send_table;
 } __packed;
 
+enum nvsp6_pd_api_op {
+   PD_API_OP_NOTIFY_VSP = 0,
+   PD_API_OP_CONFIG,
+   PD_API_OP_MAX
+};
+
+struct nvsp_6_pd_api_req {
+   u32 op;
+   u64 mmio_pa; /* MMIO Physical Address */
+   u32 mmio_len;
+   u32 num_subchn; /* Number of PD subchannels */
+} __packed;
+
+struct nvsp_6_pd_api_comp {
+   u32 op;
+   u32 status;
+   u32 num_subchn; /* Number of PD subchannels */
+   u8 is_supported; /* Is supported by VSP */
+   u8 is_enabled; /* Is enabled by VSP */
+} __packed;
+
+union nvsp_6_message_uber {
+   struct nvsp_6_pd_api_req pd_req;
+   struct nvsp_6_pd_api_comp pd_comp;
+} __packed;
+
 union nvsp_all_messages {
union nvsp_message_init_uber init_msg;
union nvsp_1_message_uber v1_msg;
union nvsp_2_message_uber v2_msg;
union nvsp_4_message_uber v4_msg;
union nvsp_5_message_uber v5_msg;
+   union nvsp_6_message_uber v6_msg;
 } __packed;
 
 /* ALL Messages */
diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index c9910c33e671..3abe57bd85bb 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -509,7 +509,8 @@ static int netvsc_connect_vsp(struct hv_device *device,
struct net_device *ndev = hv_get_drvdata(device);
static const u32 ver_list[] = {
NVSP_PROTOCOL_VERSION_1, NVSP_PROTOCOL_VERSION_2,
-   NVSP_PROTOCOL_VERSION_4, NVSP_PROTOCOL_VERSION_5
+   NVSP_PROTOCOL_VERSION_4, NVSP_PROTOCOL_VERSION_5,
+   NVSP_PROTOCOL_VERSION_6
};
struct nvsp_message *init_packet;
int ndis_version, i, ret;
-- 
2.15.1



Re: marvell switch

2018-04-05 Thread Andrew Lunn
On Thu, Apr 05, 2018 at 05:26:37PM +, Ran Shalit wrote:
> בתאריך יום ה׳, 5 באפר׳ 2018, 19:09, מאת Andrew Lunn ‏:
> 
> > > Is there a wiki which explains switch configuration ?
> >
> > Nope. The whole idea is that they behave like normal linux
> > interfaces. So there is no need to document them. You already know how
> > to use them.
> >
> > > Is it possible to open socket and send/recieve on switch ports (lan0
> > > for example) ?
> >
> > Sure. It is as normal interface. Give is an IP address and then do
> > TCP/IP as usual.
> >
> 
> There is something I don't fully understand.
> I thought that ports in switch which are not connected to the manage cpu
> are only configured by cpu and then can coomunicate with each other.
> 
> What does it mean to open socket in such port (not the cpu port) and send
> data (ethernet frames) ? Does it mean that the cpu port is sending data
> directly to that port ?

The CPU port is configured to use {E}DSA tagging. Frames ingressing on
the CPU port contain an extra header. That header tells the switch
which port to send the frame out of. The switch can also send frames
received on a port out the CPU port to the host, again, with a
header. The host can tell from the header which port the frame
ingressed.

Using this, we can make ports look like normal Linux interfaces.

  Andrew



RE: [Intel-wired-lan] [next-queue PATCH v6 08/10] igb: Add MAC address support for ethtool nftuple filters

2018-04-05 Thread Vinicius Costa Gomes
Hi,

"Brown, Aaron F"  writes:

>> From: Intel-wired-lan [mailto:intel-wired-lan-boun...@osuosl.org] On
>> Behalf Of Vinicius Costa Gomes
>> Sent: Thursday, March 29, 2018 2:08 PM
>> To: intel-wired-...@lists.osuosl.org
>> Cc: netdev@vger.kernel.org; Sanchez-Palencia, Jesus > palen...@intel.com>
>> Subject: [Intel-wired-lan] [next-queue PATCH v6 08/10] igb: Add MAC
>> address support for ethtool nftuple filters
>> 
>> This adds the capability of configuring the queue steering of arriving
>> packets based on their source and destination MAC addresses.
>> 
>> In practical terms this adds support for the following use cases,
>> characterized by these examples:
>> 
>> $ ethtool -N eth0 flow-type ether dst aa:aa:aa:aa:aa:aa action 0
>> (this will direct packets with destination address "aa:aa:aa:aa:aa:aa"
>> to the RX queue 0)
>
> This is now working for me, testing with the dst MAC being the MAC on the 
> i210.  I set the filter and all the traffic to the destination MAC address 
> gets routed to the chosen RX queue.
>
>> $ ethtool -N eth0 flow-type ether src 44:44:44:44:44:44 action 3
>> (this will direct packets with source address "44:44:44:44:44:44" to
>> the RX queue 3)
>
> However, I am still not getting the raw ethernet source filter to
> work.  Even back to back with no other system to "confuse" the stream,
> I set the filter so the source MAC is the same as the MAC on the link
> partner, send traffic and the traffic bounces around the queues as if
> the filter is not set.

It seems there is at least a documentation issue in the i210 datasheet,
steering (placing traffic into a specific queue) by source address
doesn't work, filtering (accepting the traffic based on some rule) does
work. I pointed this out in the cover letter of v5 as a known issue, but
forgot to repeat it for v6, sorry about the confusion.

But only the filtering part is useful, I think, it enables cases like
this:

$ ethtool -N enp2s0 flow-type ether src 68:05:ca:4a:c9:73 proto 0x22f0 action 3

I added that note in the hope that someone else would have an stronger
opinion about what to do.

Anyway, my plan for now will be to document this better and turn the
case that only the source address is specified into an error.

>
>> 
>> Signed-off-by: Vinicius Costa Gomes 
>> ---
>>  drivers/net/ethernet/intel/igb/igb_ethtool.c | 35
>> 
>>  1 file changed, 31 insertions(+), 4 deletions(-)


Cheers,
--
Vinicius


Re: [PATCH net-next 6/6] netdevsim: Add simple FIB resource controller via devlink

2018-04-05 Thread Jiri Pirko
Wed, Mar 28, 2018 at 03:22:00AM CEST, d...@cumulusnetworks.com wrote:
>Add devlink support to netdevsim and use it to implement a simple,
>profile based resource controller. Only one controller is needed
>per namespace, so the first netdevsim netdevice in a namespace
>registers with devlink. If that device is deleted, the resource
>settings are deleted.

I don't understand why you add 1:1 fixed relationship between
netnamespace and devlink instance. That is highly misleading and reader
might think that those 2 are somehow related. They are not. You can have
multiple devlink instances for many ports in a single namespace.

Could you please clarify?

Also, to see the relationship between individual netdevsim netdevices
and the parent devlink instance, we should use devlink_port
instances, like this: 

  devlink1  devlink2
   ||||
 dl_port1_1 dlport1_2   dlport2_1 dlport2_2
   ||||
 eth0  eth1 eth2 eth3

Note that "devlink instance" reprensents one ASIC.
The address of the devlink instance is the bus address of the ASIC.
Here, you use address of some/first netdevsim netdev instance.

The way it is implemented in netdevsim by this patch is wrong on
so many levels :(

Could you please fix this? I'm more than happy to help you with this,
please say so. Thanks!


[...]

>+  err = devlink_resource_register(devlink, "IPv4", (u64)-1,
>+  NSIM_RESOURCE_IPV4,
>+  DEVLINK_RESOURCE_ID_PARENT_TOP,
>+  , NULL);
>+  if (err) {
>+  pr_err("Failed to register IPv4 top resource\n");
>+  goto out;


this goto is pointless. Just return.


Re: Kernel bug from adding bpf actions in tc

2018-04-05 Thread Davide Caratti
On Thu, 2018-04-05 at 11:23 -0400, Lucas Bates wrote:
> Hi Davide,
> 
> Our overnight tc test runs of net-next revealed a kernel bug on one of
> the BPF tests you submitted, d959.  The add action completes
> successfully, but the bug occurs on the verify when tdc does a get of
> the action that was just added.  Here's the text of the dump:
> 

looking at the call trace, I think cfg->filter is NULL when
tcf_bpf_cleanup() is called, and apparently we are in the error path of
tcf_bpf_init(), when 

prog->bpf_ops = cfg.bpf_ops;
...
rcu_assign_pointer(prog->filter, cfg.filter);

have not been executed yet.

If tcf_idr_release() is called in this situation, cfg->is_ebpf is assigned
to true, and bpf_prog_put() can dereference a NULL pointer.

I will try reproducing in the next hours, and eventually followup with a
patch.

thanks!
regards,
-- 
davide


[PATCH iproute2] l2tp: no need to export session offsets in JSON output

2018-04-05 Thread Guillaume Nault
The offset and peer_offset parameters are only printed to avoid
confusing external scripts that may parse "ip l2tp show session"
output. There's no reason to keep them in JSON.

Signed-off-by: Guillaume Nault 
---
 ip/ipl2tp.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/ip/ipl2tp.c b/ip/ipl2tp.c
index 427e0249..05e96387 100644
--- a/ip/ipl2tp.c
+++ b/ip/ipl2tp.c
@@ -306,8 +306,9 @@ static void print_session(struct l2tp_data *data)
print_string(PRINT_FP, NULL, "%s", _SL_);
}
 
-   print_uint(PRINT_ANY, "offset", "  offset %u,", 0);
-   print_uint(PRINT_ANY, "peer_offset", " peer offset %u\n", 0);
+   /* Show offsets only for plain console output (for legacy scripts) */
+   print_uint(PRINT_FP, "offset", "  offset %u,", 0);
+   print_uint(PRINT_FP, "peer_offset", " peer offset %u\n", 0);
 
if (p->cookie_len > 0)
print_cookie("cookie", "cookie",
-- 
2.17.0



Re: [PATCH iproute2] ip/l2tp: remove offset and peer-offset options

2018-04-05 Thread Guillaume Nault
On Wed, Apr 04, 2018 at 04:43:10PM -0700, Stephen Hemminger wrote:
> On Tue, 3 Apr 2018 17:39:54 +0200
> Guillaume Nault  wrote:
> 
> > Ignore options "peer-offset" and "offset" when creating sessions. Keep
> > them when dumping sessions in order to avoid breaking external scripts.
> > 
> > "peer-offset" has always been a noop in iproute2. "offset" is now
> > ignored in Linux 4.16 (and was broken before that).
> > 
> > Signed-off-by: Guillaume Nault 
> 
> Sure, this makes sense applied.
> In theory, you could have just dropped them from the JSON output.

Indeed, I'll followup on this.


Re: [PATCH v3 2/4] bus: fsl-mc: add restool userspace support

2018-04-05 Thread Andrew Lunn
> > Hi Andrew,
> > 
> > We're waiting for the DPIO driver (which we depend on) to be moved
> > out of staging first, it's currently under review:
> > https://lkml.org/lkml/2018/3/27/1086
> 
> That's stalled on my side right now as the merge window is open and I
> can't do any new stuff until after 4.17-rc1 is out.  So everyone please
> be patient a bit...

I took a quick look.

There are a few inline functions in .c files which is generally
frowned upon. Let the compiler decide.

e.g:

static inline struct dpaa2_io *service_select_by_cpu(struct dpaa2_io *d,
 int cpu)
static inline struct dpaa2_io *service_select(struct dpaa2_io *d)

dpaa2_io_down() seems to be too simple. dpaa2_io_create() sets up
interrupt triggers, notifications, and adds the new object to the
dpio_list. dpaa2_io_down() seems to just free the memory. Do
notifications need to be disabled, the object taken off the list?

dpaa2_io_store_create() allocates memory using kzalloc() and then uses
dma_map_single(,,DMA_FROM_DEVICE). The documentation says:

   DMA_FROM_DEVICE synchronisation must be done before the driver
   accesses data that may be changed by the device.  This memory
   should be treated as read-only by the driver.  If the driver needs
   to write to it at any point, it should be DMA_BIDIRECTIONAL (see
   below).

Since it has just been allocated, this seems questionable.

I'm also not sure where the correct call to
dma_map_single(,,DMA_FROM_DEVICE) is?  Should dpaa2_io_store_next()
doing this?

The DMA API usage might need a closer review.

Andrew


Re: [PATCH net-next] pptp: remove a buggy dst release in pptp_connect()

2018-04-05 Thread Sasha Levin
Hi.

[This is an automated email]

This commit has been processed by the -stable helper bot and determined
to be a high probability candidate for -stable trees. (score: 30.1120)

The bot has tested the following trees: v4.15.15, v4.14.32, v4.9.92, v4.4.126, 

v4.15.15: Build OK!
v4.14.32: Build OK!
v4.9.92: Build OK!
v4.4.126: Build OK!

Please let us know if you'd like to have this patch included in a stable tree.

--
Thanks.
Sasha

Re: [PATCH] qtnfmac: pearl: pcie: fix memory leak in qtnf_fw_work_handler

2018-04-05 Thread Gustavo A. R. Silva

Hi Sergey,

On 04/05/2018 11:31 AM, Sergey Matyukevich wrote:

Hello Gustavo,


In case memory resources for fw were succesfully allocated, release
them before jumping to fw_load_fail.

Addresses-Coverity-ID: 1466092 ("Resource leak")
Fixes: c3b2f7ca4186 ("qtnfmac: implement asynchronous firmware loading")
Signed-off-by: Gustavo A. R. Silva 
---
  drivers/net/wireless/quantenna/qtnfmac/pearl/pcie.c | 4 
  1 file changed, 4 insertions(+)


Thanks for the patch!



Glad to help. :)


Reviewed-by: Sergey Matyukevich 



Thanks
--
Gustavo



Re: [PATCH] qtnfmac: pearl: pcie: fix memory leak in qtnf_fw_work_handler

2018-04-05 Thread Sergey Matyukevich
Hello Gustavo,

> In case memory resources for fw were succesfully allocated, release
> them before jumping to fw_load_fail.
> 
> Addresses-Coverity-ID: 1466092 ("Resource leak")
> Fixes: c3b2f7ca4186 ("qtnfmac: implement asynchronous firmware loading")
> Signed-off-by: Gustavo A. R. Silva 
> ---
>  drivers/net/wireless/quantenna/qtnfmac/pearl/pcie.c | 4 
>  1 file changed, 4 insertions(+)

Thanks for the patch!

Reviewed-by: Sergey Matyukevich 

Regards,
Sergey


Re: [PATCH v2 bpf-next 0/3] bpf/verifier: subprog/func_call simplifications

2018-04-05 Thread Edward Cree
On 05/04/18 16:50, Jiong Wang wrote:
> On 03/04/2018 02:08, Alexei Starovoitov wrote:
>> Combining subprog pass with do_check is going into opposite direction
>> of this long term work. Divide and conquer. Combining more things into
>> do_check is the opposite of this programming principle.
>
> Agree. And for the redundant insn traversal issue in check_subprogs that
> Edward trying to fix, I am thinking we could do it by utilizing the
> existing DFS traversal in check_cfg.
>
> The current DFS probably could be improved into an generic instruction
> information collection pass.

> And if we want to build basic-block later, we could just call a new add_bb
> (similar as add_subprog) for jump targets etc. (some of those places are
> actually STATE_LIST_MARK_JMPTARGET if we separate STATE_LIST_MARK as
> STATE_LIST_MARK_JMPTARGET and STATE_LIST_MARK_HEURISTIC).
>
> Regards,
> Jiong
>
This is broadly similar to my medium-term plan, except that I would use
 the Kahn's-algorithm style tsort rather than the depth-first tsort
 algorithm used by current check_cfg.
* chop up prog into functions and bbs, incidentally placing S_L_Ms
* create control flow graph similar to my function call graph
* tsort it to detect loops, similar to how my check_max_stack_depth()
  implementation detects recursion.

-Ed


[PATCH] ieee802154: mcr20a: Fix memory leak in mcr20a_probe

2018-04-05 Thread Gustavo A. R. Silva
Free allocated memory for pdata before return.

Addresses-Coverity-ID: 1466096 ("Resource leak")
Fixes: 8c6ad9cc5157 ("ieee802154: Add NXP MCR20A IEEE 802.15.4 transceiver 
driver")
Signed-off-by: Gustavo A. R. Silva 
---
 drivers/net/ieee802154/mcr20a.c | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ieee802154/mcr20a.c b/drivers/net/ieee802154/mcr20a.c
index 55a22c7..944470d 100644
--- a/drivers/net/ieee802154/mcr20a.c
+++ b/drivers/net/ieee802154/mcr20a.c
@@ -1267,7 +1267,7 @@ mcr20a_probe(struct spi_device *spi)
ret = mcr20a_get_platform_data(spi, pdata);
if (ret < 0) {
dev_crit(>dev, "mcr20a_get_platform_data failed.\n");
-   return ret;
+   goto free_pdata;
}
 
/* init reset gpio */
@@ -1275,7 +1275,7 @@ mcr20a_probe(struct spi_device *spi)
ret = devm_gpio_request_one(>dev, pdata->rst_gpio,
GPIOF_OUT_INIT_HIGH, "reset");
if (ret)
-   return ret;
+   goto free_pdata;
}
 
/* reset mcr20a */
@@ -1291,7 +1291,8 @@ mcr20a_probe(struct spi_device *spi)
hw = ieee802154_alloc_hw(sizeof(*lp), _hw_ops);
if (!hw) {
dev_crit(>dev, "ieee802154_alloc_hw failed\n");
-   return -ENOMEM;
+   ret = -ENOMEM;
+   goto free_pdata;
}
 
/* init mcr20a local data */
@@ -1366,6 +1367,8 @@ mcr20a_probe(struct spi_device *spi)
 
 free_dev:
ieee802154_free_hw(lp->hw);
+free_pdata:
+   kfree(pdata);
 
return ret;
 }
-- 
2.7.4



Re: [PATCH] net: thunderx: rework mac addresses list to u64 array

2018-04-05 Thread Christoph Hellwig
On Thu, Apr 05, 2018 at 09:07:49AM -0700, Vadim Lomovtsev wrote:
> > 
> > > + mc_list = kmalloc(sizeof(*mc_list) +
> > > +   sizeof(u64) * 
> > > netdev_mc_count(netdev),
> > > +   GFP_ATOMIC);
> > 
> > kmalloc_array(), please.
> 
> In this case it would require two memory allocation calls to kmalloc() for
> xcast_addr_list struct and to kmalloc_array() for 'mc' addresses, becasue of
> different data types and so two null-ptr checks .. this is what I'd like get 
> rid off.
> 
> My idea of this was to keep number of array elements and themselves within the
> same memory block/page to reduce number of memory allocation requests, number
> of allocated pages/blocks and avoid possible memory fragmentation (however, I 
> believe
> the latter is already handled at the mm layer).

Indeed, lets keep it as-is.


Re: [PATCH v3 2/4] bus: fsl-mc: add restool userspace support

2018-04-05 Thread gregkh
On Thu, Apr 05, 2018 at 03:35:30PM +, Ruxandra Ioana Ciocoi Radulescu wrote:
> > -Original Message-
> > From: Andrew Lunn [mailto:and...@lunn.ch]
> > Sent: Thursday, April 5, 2018 6:24 PM
> > To: Laurentiu Tudor 
> > Cc: Stuart Yoder ; Arnd Bergmann ;
> > Ioana Ciornei ; gregkh
> > ; Linux Kernel Mailing List  > ker...@vger.kernel.org>; Ruxandra Ioana Ciocoi Radulescu
> > ; Razvan Stefanescu
> > ; Roy Pledge ;
> > Networking 
> > Subject: Re: [PATCH v3 2/4] bus: fsl-mc: add restool userspace support
> > 
> > On Thu, Apr 05, 2018 at 02:43:29PM +, Laurentiu Tudor wrote:
> > > Hi Andrew,
> > >
> > > On 04/05/2018 03:48 PM, Andrew Lunn wrote:
> > > >>> Hi Laurentiu
> > > >>>
> > > >>> So i can use switchdev without it? I can modprobe the switchdev
> > > >>> driver, all the physical interfaces will appear, and i can use ip addr
> > > >>> add etc. I do not need to use a user space tool at all in order to use
> > > >>> the network functionality?
> > > >>
> > > >> Absolutely!
> > > >
> > > > Great.
> > > >
> > > > Then the easiest way forwards is to simply drop the IOCTL code for the
> > > > moment. Get the basic support for the hardware into the kernel
> > > > first. Then come back later to look at dynamic behaviour which needs
> > > > some form of configuration.
> > >
> > > Hmm, not sure I understand. We already have a fully functional ethernet
> > > driver [1] and a switch driver [2] ...
> > 
> > In staging, the tree of crap. You want to get it out of there and into
> > the main tree. But that effort is being side lined by the discussion
> > around this IOCTL call.  The best way forward is to to accept Greg is
> > not going to take this patchset at the moment, and move on. As you
> > said, it is not needed for the Ethernet and switchdev driver.
> > 
> > What needs to happen before the Ethernet driver can be reviewed for
> > moving out of staging?
> 
> Hi Andrew,
> 
> We're waiting for the DPIO driver (which we depend on) to be moved
> out of staging first, it's currently under review:
> https://lkml.org/lkml/2018/3/27/1086

That's stalled on my side right now as the merge window is open and I
can't do any new stuff until after 4.17-rc1 is out.  So everyone please
be patient a bit...

thanks,

greg k-h


[PATCH] qtnfmac: pearl: pcie: fix memory leak in qtnf_fw_work_handler

2018-04-05 Thread Gustavo A. R. Silva
In case memory resources for fw were succesfully allocated, release
them before jumping to fw_load_fail.

Addresses-Coverity-ID: 1466092 ("Resource leak")
Fixes: c3b2f7ca4186 ("qtnfmac: implement asynchronous firmware loading")
Signed-off-by: Gustavo A. R. Silva 
---
 drivers/net/wireless/quantenna/qtnfmac/pearl/pcie.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/net/wireless/quantenna/qtnfmac/pearl/pcie.c 
b/drivers/net/wireless/quantenna/qtnfmac/pearl/pcie.c
index f117904..6c1e139 100644
--- a/drivers/net/wireless/quantenna/qtnfmac/pearl/pcie.c
+++ b/drivers/net/wireless/quantenna/qtnfmac/pearl/pcie.c
@@ -1185,6 +1185,10 @@ static void qtnf_fw_work_handler(struct work_struct 
*work)
if (qtnf_poll_state(>bda->bda_ep_state, QTN_EP_FW_LOADRDY,
QTN_FW_DL_TIMEOUT_MS)) {
pr_err("card is not ready\n");
+
+   if (!flashboot)
+   release_firmware(fw);
+
goto fw_load_fail;
}
 
-- 
2.7.4



Re: marvell switch

2018-04-05 Thread Andrew Lunn
> Is there a wiki which explains switch configuration ?

Nope. The whole idea is that they behave like normal linux
interfaces. So there is no need to document them. You already know how
to use them.

> Is it possible to open socket and send/recieve on switch ports (lan0
> for example) ?

Sure. It is as normal interface. Give is an IP address and then do
TCP/IP as usual.

   Andrew


Re: [PATCH] net: thunderx: rework mac addresses list to u64 array

2018-04-05 Thread Vadim Lomovtsev
Hi Christoph,

Thank you for your feedback and time.

On Thu, Apr 05, 2018 at 08:07:48AM -0700, Christoph Hellwig wrote:
> >  struct xcast_addr_list {
> > -   struct list_head list;
> > int  count;
> > +   u64  mc[0];
> 
> Please use the standard C99 syntax here:
> 
>   u64  mc[];

Ok, will update.

> 
> > +   mc_list = kmalloc(sizeof(*mc_list) +
> > + sizeof(u64) * 
> > netdev_mc_count(netdev),
> > + GFP_ATOMIC);
> 
> kmalloc_array(), please.

In this case it would require two memory allocation calls to kmalloc() for
xcast_addr_list struct and to kmalloc_array() for 'mc' addresses, becasue of
different data types and so two null-ptr checks .. this is what I'd like get 
rid off.

My idea of this was to keep number of array elements and themselves within the
same memory block/page to reduce number of memory allocation requests, number
of allocated pages/blocks and avoid possible memory fragmentation (however, I 
believe
the latter is already handled at the mm layer).

WBR,
Vadim


Re: [PATCH 3/9] kbuild: Do not pass arguments to link-vmlinux.sh

2018-04-05 Thread Masahiro Yamada
2018-04-06 0:16 GMT+09:00 Jiri Olsa :
> There's no need to pass LD* arguments to link-vmlinux.sh,
> because they are passed as variables. The only argument
> the link-vmlinux.sh supports is the 'clean' argument.
>
> Signed-off-by: Jiri Olsa 
> ---

Wrong.

$(LD) $(LDFLAGS) $(LDFLAGS_vmlinux)
exist here so that any change in them
invokes scripts/linkk-vmlinux.sh




>  Makefile | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/Makefile b/Makefile
> index d3300e46f925..a65a3919c6ad 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -1027,7 +1027,7 @@ ARCH_POSTLINK := $(wildcard 
> $(srctree)/arch/$(SRCARCH)/Makefile.postlink)
>
>  # Final link of vmlinux with optional arch pass after final link
>  cmd_link-vmlinux = \
> -   $(CONFIG_SHELL) $< $(LD) $(LDFLAGS) $(LDFLAGS_vmlinux) ;\
> +   $(CONFIG_SHELL) $< ;   \
> $(if $(ARCH_POSTLINK), $(MAKE) -f $(ARCH_POSTLINK) $@, true)
>
>  vmlinux: scripts/link-vmlinux.sh vmlinux_prereq $(vmlinux-deps) FORCE






-- 
Best Regards
Masahiro Yamada


Re: [PATCH 2/2] net: phy: dp83640: Read strapped configuration settings

2018-04-05 Thread Florian Fainelli


On 04/05/2018 04:44 AM, esben.haaben...@gmail.com wrote:
> From: Esben Haabendal 
> 
> Read configration settings, to allow automatic forced speed/duplex setup
> by hardware strapping.

OK but why? What problem is this solving for you? In general, we do not
really want to preserve too much of what the PHY has been previously
configured with, provided that the PHY driver can re-instate these
configuration values.

I just wonder how this can be robust when you connect this PHY with
auto-negotiation disabled to a peer that expects a set of link
parameters not covered by the default advertisement values? This really
looks like a recipe for disaster when you could just disable
auto-negotiation with ethtool.

> 
> Signed-off-by: Esben Haabendal 
> Cc: Rasmus Villemoes 
> ---
>  drivers/net/phy/dp83640.c | 5 +
>  1 file changed, 5 insertions(+)
> 
> diff --git a/drivers/net/phy/dp83640.c b/drivers/net/phy/dp83640.c
> index 654f42d00092..01e21b4998ad 100644
> --- a/drivers/net/phy/dp83640.c
> +++ b/drivers/net/phy/dp83640.c
> @@ -1134,6 +1134,10 @@ static int dp83640_probe(struct phy_device *phydev)
>   if (!dp83640)
>   goto no_memory;
>  
> + err = genphy_read_config(phydev);
> + if (err)
> + goto no_config;
> +
>   dp83640->phydev = phydev;
>   INIT_DELAYED_WORK(>ts_work, rx_timestamp_work);
>  
> @@ -1166,6 +1170,7 @@ static int dp83640_probe(struct phy_device *phydev)
>  
>  no_register:
>   clock->chosen = NULL;
> +no_config:
>   kfree(dp83640);
>  no_memory:
>   dp83640_clock_put(clock);
> 

-- 
Florian


Re: [PATCH 1/2] net: phy: Helper function for reading strapped configuration values

2018-04-05 Thread Florian Fainelli


On 04/05/2018 04:44 AM, esben.haaben...@gmail.com wrote:
> From: Esben Haabendal 
> 
> Add a function for use in PHY driver probe functions, reading current
> autoneg, speed and duplex configuration from BMCR register.
> 
> Useful for PHY that supports hardware strapped configuration, enabling
> Linux to respect that configuration (i.e. strapped non-autoneg
> configuration).
> 
> Signed-off-by: Esben Haabendal 
> Cc: Rasmus Villemoes 
> ---
>  drivers/net/phy/phy_device.c | 41 +
>  include/linux/phy.h  |  1 +
>  2 files changed, 42 insertions(+)
> 
> diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
> index 74664a6c0cdc..cc52ff2a2344 100644
> --- a/drivers/net/phy/phy_device.c
> +++ b/drivers/net/phy/phy_device.c
> @@ -1673,6 +1673,47 @@ int genphy_config_init(struct phy_device *phydev)
>  }
>  EXPORT_SYMBOL(genphy_config_init);
>  
> +/**
> + * genphy_read_config - read configuration from PHY
> + * @phydev: target phy_device struct
> + *
> + * Description: Reads MII_BMCR and sets phydev autoneg, speed and duplex
> + * accordingly.  For use in driver probe functions, to respect strapped
> + * configuration settings.
> + */
> +int genphy_read_config(struct phy_device *phydev)

This duplicates what already exists, in part at least within
genphy_read_status() can you find a way to use that?

I really wonder how this is going to work though because an user can
decide to force the PHY to have auto-negotiation disabled just like a
MAC could actually attempt to do that while connecting to the PHY...
more comments in patch 2.
-- 
Florian


Re: marvell switch

2018-04-05 Thread Ran Shalit
On Thu, Apr 5, 2018 at 3:22 PM, Andrew Lunn  wrote:
> On Thu, Apr 05, 2018 at 05:47:24AM +0300, Ran Shalit wrote:
>> Hello,
>>
>> I am trying to use marvell switch in linux,
>> Is it that the kernel drivers from marvell switch are used just to
>> enable all ports, or do they also provide APIs to userspace to enable
>> specific ports only.
>> I have not find examples or wiki for marvell switch, so I am not too
>> sure as what are the drivers meant for.
>
> Hi Ran
>
> The Marvell driver makes each port act like a normal Linux network
> interface. So if you want to enable a port, do
>
> ip link set lan0 up
>
> Want to add an ip address to a port
>
> ip addr add 10.42.42.42/24 dev lan0
>
> Want to bridge two ports
>
> ip link add name br0 type bridge
> ip link set dev br0 up
> ip link set dev lan0 master br0
> ip link set dev lan1 master br0
>
> Just treat them as normal interfaces.
>

Is there a wiki which explains switch configuration ?
What does it mean that they are treated as normal interfaces ?
Is it possible to open socket and send/recieve on switch ports (lan0
for example) ?
My understanding is that these ports are not treated the same as eth0
interface for example.

Thanks for the assistance,
ranran

>  Andrew


Re: [RFC] ethtool: Support for driver private ioctl's

2018-04-05 Thread Florian Fainelli


On 04/05/2018 03:47 AM, Jose Abreu wrote:
> Hi All,
> 
> I would like to know your opinion regarding adding support for
> driver private ioctl's in ethtool.
> 
> Background: Synopsys Ethernet IP's have a certain number of
> features which can be reconfigured at runtime. Giving you two
> examples: One of the most recent one is the safety features,
> which can be enabled/disabled and forced at runtime. Another one
> is a Flexible RX Parser which can route specific packets to
> specific RX DMA channels. Given that these are features specific
> to our IP's it would not be useful to add an uniform API for this
> because the users would only be one or two drivers ...

Parsing of packets and directing the matched packets to specific
queues/channels can be done through ethtool rxnfc API, tc/cls_flower as
well, so you should really check whether those APIs don't already allow
you to do what you want.

ethtool already supports a concept of private  flags, not ioctl() though
which allows you to toggle boolean values for instance (or technically
up to how many bits a "flag" is used to represent) is that enough or do
you need to turn on/off the feature as well as pass configuration
parameters?

> 
> This new feature would change the help usage for ethtool so that
> each driver private option would be shown, and then each driver
> specific file would have a structure with all the available
> options. Finally, each driver would have to handle the private
> IOCTL's.
> 
> We already have this working locally and now I would like to know
> your opinion about upstreaming this ... Do you think this can be
> useful for anyone else? Or should we change direction to use, for
> example, debugfs/configfs?

In general, even if there is only one driver implementing a particular
feature, the approach chosen is to come up with an API that is as
generic as possible. Even if there is a single user of that API in tree,
having something that was thought to be generic is better than allowing
uncontrolled private ioctl() implementations.
-- 
Florian


Re: [PATCH v2 bpf-next 0/3] bpf/verifier: subprog/func_call simplifications

2018-04-05 Thread Jiong Wang

On 03/04/2018 02:08, Alexei Starovoitov wrote:

Combining subprog pass with do_check is going into opposite direction
of this long term work. Divide and conquer. Combining more things into
do_check is the opposite of this programming principle.


Agree. And for the redundant insn traversal issue in check_subprogs that
Edward trying to fix, I am thinking we could do it by utilizing the
existing DFS traversal in check_cfg.

The current DFS probably could be improved into an generic instruction
information collection pass.

This won't make the existing DFS complexer as it only does information
collection as a side job during traversal. These collected information
then could be used to build any other information to be consumed later,
for example subprog, basic blocks etc.

For the redundant insn traversal issue during subprog detection, the
Like how we mark STATE_LIST_MARK in DFS, we could just call add_subprog
for BPF_PSEUDO_CALL insn during DFS.

i.e we change the code logic of check_cfg into:

check_cfg
{
  * DFS traversal:
- detect back-edge.
- collect STATE_LIST_MARK.
- collect subprog destination.

  * check all insns are reachable.
  * check_subprogs (insn traversal removed).
}

I prototyped a patch for above change, the change looks to me is easier to
review. There are 8 regressions on selftests however after this change due
to check_subprogs moved after some checks in DFS that some errors detected
before the expected errors, need to be cleaned up.

Does this direction sound good?

And if we want to build basic-block later, we could just call a new add_bb
(similar as add_subprog) for jump targets etc. (some of those places are
actually STATE_LIST_MARK_JMPTARGET if we separate STATE_LIST_MARK as
STATE_LIST_MARK_JMPTARGET and STATE_LIST_MARK_HEURISTIC).

Regards,
Jiong



RE: [PATCH v3 2/4] bus: fsl-mc: add restool userspace support

2018-04-05 Thread Ruxandra Ioana Ciocoi Radulescu
> -Original Message-
> From: Andrew Lunn [mailto:and...@lunn.ch]
> Sent: Thursday, April 5, 2018 6:24 PM
> To: Laurentiu Tudor 
> Cc: Stuart Yoder ; Arnd Bergmann ;
> Ioana Ciornei ; gregkh
> ; Linux Kernel Mailing List  ker...@vger.kernel.org>; Ruxandra Ioana Ciocoi Radulescu
> ; Razvan Stefanescu
> ; Roy Pledge ;
> Networking 
> Subject: Re: [PATCH v3 2/4] bus: fsl-mc: add restool userspace support
> 
> On Thu, Apr 05, 2018 at 02:43:29PM +, Laurentiu Tudor wrote:
> > Hi Andrew,
> >
> > On 04/05/2018 03:48 PM, Andrew Lunn wrote:
> > >>> Hi Laurentiu
> > >>>
> > >>> So i can use switchdev without it? I can modprobe the switchdev
> > >>> driver, all the physical interfaces will appear, and i can use ip addr
> > >>> add etc. I do not need to use a user space tool at all in order to use
> > >>> the network functionality?
> > >>
> > >> Absolutely!
> > >
> > > Great.
> > >
> > > Then the easiest way forwards is to simply drop the IOCTL code for the
> > > moment. Get the basic support for the hardware into the kernel
> > > first. Then come back later to look at dynamic behaviour which needs
> > > some form of configuration.
> >
> > Hmm, not sure I understand. We already have a fully functional ethernet
> > driver [1] and a switch driver [2] ...
> 
> In staging, the tree of crap. You want to get it out of there and into
> the main tree. But that effort is being side lined by the discussion
> around this IOCTL call.  The best way forward is to to accept Greg is
> not going to take this patchset at the moment, and move on. As you
> said, it is not needed for the Ethernet and switchdev driver.
> 
> What needs to happen before the Ethernet driver can be reviewed for
> moving out of staging?

Hi Andrew,

We're waiting for the DPIO driver (which we depend on) to be moved
out of staging first, it's currently under review:
https://lkml.org/lkml/2018/3/27/1086

Ioana


Re: [virtio-dev] Re: [RFC PATCH 1/3] qemu: virtio-bypass should explicitly bind to a passthrough device

2018-04-05 Thread Paolo Bonzini
On 04/04/2018 10:02, Siwei Liu wrote:
>> pci_bus_num is almost always a bug if not done within
>> a context of a PCI host, bridge, etc.
>>
>> In particular this will not DTRT if run before guest assigns bus
>> numbers.
>>
> I was seeking means to reserve a specific pci bus slot from drivers,
> and update the driver when guest assigns the bus number but it seems
> there's no low-hanging fruits. Because of that reason the bus_num is
> only obtained until it's really needed (during get_config) and I
> assume at that point the pci bus assignment is already done. I know
> the current one is not perfect, but we need that information (PCI
> bus:slot.func number) to name the guest device correctly.

Can you use the -device "id", and look it up as

devices = container_get(qdev_get_machine(), "/peripheral");
return object_resolve_path_component(devices, id);

?

Thanks,

Paolo


Re: [PATCH v3 2/4] bus: fsl-mc: add restool userspace support

2018-04-05 Thread Andrew Lunn
On Thu, Apr 05, 2018 at 02:43:29PM +, Laurentiu Tudor wrote:
> Hi Andrew,
> 
> On 04/05/2018 03:48 PM, Andrew Lunn wrote:
> >>> Hi Laurentiu
> >>>
> >>> So i can use switchdev without it? I can modprobe the switchdev
> >>> driver, all the physical interfaces will appear, and i can use ip addr
> >>> add etc. I do not need to use a user space tool at all in order to use
> >>> the network functionality?
> >>
> >> Absolutely!
> >
> > Great.
> >
> > Then the easiest way forwards is to simply drop the IOCTL code for the
> > moment. Get the basic support for the hardware into the kernel
> > first. Then come back later to look at dynamic behaviour which needs
> > some form of configuration.
> 
> Hmm, not sure I understand. We already have a fully functional ethernet 
> driver [1] and a switch driver [2] ...

In staging, the tree of crap. You want to get it out of there and into
the main tree. But that effort is being side lined by the discussion
around this IOCTL call.  The best way forward is to to accept Greg is
not going to take this patchset at the moment, and move on. As you
said, it is not needed for the Ethernet and switchdev driver.

What needs to happen before the Ethernet driver can be reviewed for
moving out of staging?

   Andrew


Kernel bug from adding bpf actions in tc

2018-04-05 Thread Lucas Bates
Hi Davide,

Our overnight tc test runs of net-next revealed a kernel bug on one of
the BPF tests you submitted, d959.  The add action completes
successfully, but the bug occurs on the verify when tdc does a get of
the action that was just added.  Here's the text of the dump:

[   61.973632] BUG: unable to handle kernel NULL pointer dereference
at 0020
[   61.974366] PGD 800081881067 P4D 800081881067 PUD 83784067 PMD 0
[   61.974986] Oops:  [#1] SMP PTI
[   61.975309] Modules linked in: kvm_intel kvm irqbypass
crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel
aes_x86_64 crypto_simd psmouse glue_helper cryptd serio_raw
[   61.976800] CPU: 28 PID: 1087 Comm: tc Not tainted 4.16.0+ #28
[   61.977329] RIP: 0010:__bpf_prog_put+0x5/0xe0
[   61.977731] RSP: 0018:9647c4823788 EFLAGS: 00010202
[   61.978204] RAX:  RBX: 9647c48237a0 RCX: 176c
[   61.978845] RDX:  RSI: 0001 RDI: 
[   61.979484] RBP:  R08: 00025be0 R09: a4794077
[   61.980121] R10: dc1bc2130b00 R11:  R12: 
[   61.980763] R13: 0001 R14: 8889869969f0 R15: ffea
[   61.981398] FS:  7faa72489700() GS:888988f0()
knlGS:
[   61.982114] CS:  0010 DS:  ES:  CR0: 80050033
[   61.982627] CR2: 0020 CR3: 85938004 CR4: 003606a0
[   61.983263] DR0:  DR1:  DR2: 
[   61.983897] DR3:  DR6: fffe0ff0 DR7: 0400
[   61.984531] Call Trace:
[   61.984766]  tcf_bpf_cfg_cleanup+0x2f/0x40
[   61.985139]  tcf_bpf_cleanup+0x3c/0x50
[   61.985479]  ? uncore_event_cpu_online+0x80/0x3c0
[   61.985922]  __tcf_idr_release+0x72/0x150
[   61.986297]  tcf_bpf_init+0x102/0x3e0
[   61.986637]  ? perf_trace_sched_process_exec+0xf4/0x140
[   61.987108]  tcf_action_init_1+0x36c/0x410
[   61.987482]  ? ___slab_alloc+0x218/0x4b0
[   61.987841]  tcf_action_init+0x106/0x190
[   61.988204]  tc_ctl_action+0x11a/0x220
[   61.988551]  rtnetlink_rcv_msg+0x243/0x2f0
[   61.988931]  ? _cond_resched+0x16/0x40
[   61.989277]  ? __kmalloc_node_track_caller+0x1e6/0x2a0
[   61.989746]  ? rtnl_calcit.isra.29+0xe0/0xe0
[   61.990137]  netlink_rcv_skb+0xde/0x110
[   61.990494]  netlink_unicast+0x16d/0x220
[   61.990858]  netlink_sendmsg+0x293/0x370
[   61.991224]  sock_sendmsg+0x36/0x40
[   61.991559]  ___sys_sendmsg+0x2cb/0x2e0
[   61.991913]  ? pagecache_get_page+0x27/0x220
[   61.992302]  ? filemap_fault+0xa2/0x650
[   61.992651]  ? page_add_file_rmap+0x108/0x200
[   61.993057]  ? alloc_set_pte+0x2aa/0x530
[   61.993419]  ? finish_fault+0x4e/0x70
[   61.993758]  ? __handle_mm_fault+0xbfc/0x1110
[   61.994160]  ? __sys_sendmsg+0x53/0x80
[   61.994506]  __sys_sendmsg+0x53/0x80
[   61.994849]  do_syscall_64+0x6e/0x120
[   61.995189]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[   61.995662] RIP: 0033:0x7faa71885bd0
[   61.995991] RSP: 002b:7ffccf65bf28 EFLAGS: 0246 ORIG_RAX:
002e
[   61.996693] RAX: ffda RBX: 7ffccf65c050 RCX: 7faa71885bd0
[   61.997350] RDX:  RSI: 7ffccf65bfa0 RDI: 0003
[   61.998001] RBP: 5ac513e0 R08: 0001 R09: 
[   61.998649] R10: 05e7 R11: 0246 R12: 
[   61.999287] R13: 7ffccf6600b0 R14: 0001 R15: 00674600
[   61.42] Code: c6 72 00 48 8b 43 20 48 c7 c7 d0 61 9b a5 c7 40
10 00 00 00 00 5b e9 1b d9 74 00 90 66 2e 0f 1f 84 00 00 00 00 00 66
66 66 66 90 <48> 8b 47 20 f0 ff 08 74 01 c3 41 54 41 89 f4 55 48 89 fd
53 66
[   62.001660] RIP: __bpf_prog_put+0x5/0xe0 RSP: 9647c4823788
[   62.002183] CR2: 0020
[   62.002488] ---[ end trace 19b56d1a66dd8e2a ]---



I'm sending this your way because you were the last one to touch this
part of the code.  Have you seen this in your own testing?  (This
can't be replicated by hand, only by running tdc).

- Lucas


[RFC 0/9] bpf: Add buildid check support

2018-04-05 Thread Jiri Olsa
hi,
eBPF programs loaded for kprobes are allowed to read kernel
internal structures. We check the provided kernel version
to ensure that the program is loaded for the proper kernel. 

The problem is that the version check is not enough, because
it only follows the version setup from kernel's Makefile.
However, the internal kernel structures change based on the
.config data, so in practise we have different kernels with
same version.

The eBPF kprobe program thus then get loaded for different
kernel than it's been built for, get wrong data (silently)
and provide misleading output.

This patchset implements additional check in eBPF loading code
on provided build ID (from kernel's elf image, .notes section
GNU build ID) to ensure we load the eBPF program on correct
kernel.

Also available in here (based on bpf-next/master):
  https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git
  bpf/checksum

This patchset consists of several changes:

- adding CONFIG_BUILDID_H option that instructs the build
  to generate uapi header file with build ID data, that
  will be included by eBPF program

- adding CONFIG_BPF_BUILDID_CHECK option and new bpf_attr
  field to allow build ID checking when loading the eBPF
  program

- changing libbpf to read and pass build ID to the kernel

- several small side fixes

- example perf eBPF code in bpf-samples/bpf-stdout-example.c
  to show the build ID support/usage.

# perf record -vv  -e ./bpf-samples/bpf-stdout-example.c kill 2>&1 | grep 
buildid
libbpf: section(7) buildid, size 21, link 0, flags 3, type=1
libbpf: kernel buildid of ./bpf-samples/bpf-stdout-example.c is: 
6e25edeb408513184e2753bebad25d42314501a0

  The buildid is provided the same way we provide kernel
  version, in a special "buildid" section:

# cat ./bpf-samples/bpf-stdout-example.c
...
#include 

char _buildid[] SEC("buildid") = LINUX_BUILDID_DATA;
...

  where LINUX_BUILDID_DATA is defined in the generated buildid.h.

please note it's an RFC ;-) any comments and suggestions are welcome

thanks,
jirka


---
Jiri Olsa (9):
  perf tools: Make read_build_id function public
  perf tools: Add fetch_kernel_buildid function
  kbuild: Do not pass arguments to link-vmlinux.sh
  kbuild: Add filechk2 function
  bpf: Add CONFIG_BUILDID_H option
  bpf: Add CONFIG_BPF_BUILDID_CHECK option
  libbpf: Synchronize uapi bpf.h header
  libbpf: Add support to attach buildid to program load
  perf tools: The buildid usage in example eBPF program

 Makefile| 14 +-
 include/uapi/linux/bpf.h|  2 ++
 init/Kconfig| 12 
 kernel/bpf/syscall.c| 84 
+++-
 scripts/Kbuild.include  | 24 
 scripts/Makefile|  1 +
 scripts/extract-buildid.c   | 42 
++
 tools/bpf/bpftool/Makefile  |  5 -
 tools/include/uapi/linux/bpf.h  |  3 +++
 tools/lib/bpf/bpf.c |  6 --
 tools/lib/bpf/bpf.h |  5 +++--
 tools/lib/bpf/libbpf.c  | 46 
--
 tools/perf/bpf-samples/bpf-stdout-example.c | 42 
++
 tools/perf/tests/bpf.c  |  9 -
 tools/perf/util/symbol-minimal.c| 50 
++
 tools/perf/util/util.c  | 66 
++
 tools/perf/util/util.h  |  6 ++
 17 files changed, 355 insertions(+), 62 deletions(-)
 create mode 100644 scripts/extract-buildid.c
 create mode 100644 tools/perf/bpf-samples/bpf-stdout-example.c


[PATCH 1/9] perf tools: Make read_build_id function public

2018-04-05 Thread Jiri Olsa
And renaming it into parse_notes_buildid to be more
precise and usable in following patches.

Link: http://lkml.kernel.org/n/tip-v1mz76rkdxfnbfz2v05fu...@git.kernel.org
Signed-off-by: Jiri Olsa 
---
 tools/perf/util/symbol-minimal.c | 50 ++--
 tools/perf/util/util.c   | 48 ++
 tools/perf/util/util.h   |  4 
 3 files changed, 54 insertions(+), 48 deletions(-)

diff --git a/tools/perf/util/symbol-minimal.c b/tools/perf/util/symbol-minimal.c
index ff48d0d49584..bd281d3dc508 100644
--- a/tools/perf/util/symbol-minimal.c
+++ b/tools/perf/util/symbol-minimal.c
@@ -24,53 +24,6 @@ static bool check_need_swap(int file_endian)
return host_endian != file_endian;
 }
 
-#define NOTE_ALIGN(sz) (((sz) + 3) & ~3)
-
-#define NT_GNU_BUILD_ID3
-
-static int read_build_id(void *note_data, size_t note_len, void *bf,
-size_t size, bool need_swap)
-{
-   struct {
-   u32 n_namesz;
-   u32 n_descsz;
-   u32 n_type;
-   } *nhdr;
-   void *ptr;
-
-   ptr = note_data;
-   while (ptr < (note_data + note_len)) {
-   const char *name;
-   size_t namesz, descsz;
-
-   nhdr = ptr;
-   if (need_swap) {
-   nhdr->n_namesz = bswap_32(nhdr->n_namesz);
-   nhdr->n_descsz = bswap_32(nhdr->n_descsz);
-   nhdr->n_type = bswap_32(nhdr->n_type);
-   }
-
-   namesz = NOTE_ALIGN(nhdr->n_namesz);
-   descsz = NOTE_ALIGN(nhdr->n_descsz);
-
-   ptr += sizeof(*nhdr);
-   name = ptr;
-   ptr += namesz;
-   if (nhdr->n_type == NT_GNU_BUILD_ID &&
-   nhdr->n_namesz == sizeof("GNU")) {
-   if (memcmp(name, "GNU", sizeof("GNU")) == 0) {
-   size_t sz = min(size, descsz);
-   memcpy(bf, ptr, sz);
-   memset(bf + sz, 0, size - sz);
-   return 0;
-   }
-   }
-   ptr += descsz;
-   }
-
-   return -1;
-}
-
 int filename__read_debuglink(const char *filename __maybe_unused,
 char *debuglink __maybe_unused,
 size_t size __maybe_unused)
@@ -153,7 +106,8 @@ int filename__read_build_id(const char *filename, void *bf, 
size_t size)
if (fread(buf, buf_size, 1, fp) != 1)
goto out_free;
 
-   ret = read_build_id(buf, buf_size, bf, size, need_swap);
+   ret = parse_notes_buildid(buf, buf_size, bf, size,
+ need_swap);
if (ret == 0)
ret = size;
break;
diff --git a/tools/perf/util/util.c b/tools/perf/util/util.c
index 1019bbc5dbd8..41fc61d941ef 100644
--- a/tools/perf/util/util.c
+++ b/tools/perf/util/util.c
@@ -447,6 +447,54 @@ fetch_kernel_version(unsigned int *puint, char *str,
return 0;
 }
 
+#define NOTE_ALIGN(sz) (((sz) + 3) & ~3)
+
+#define NT_GNU_BUILD_ID3
+
+int parse_notes_buildid(void *note_data, size_t note_len, void *bf,
+   size_t size, bool need_swap)
+{
+   struct {
+   u32 n_namesz;
+   u32 n_descsz;
+   u32 n_type;
+   } *nhdr;
+   void *ptr;
+
+   ptr = note_data;
+   while (ptr < (note_data + note_len)) {
+   const char *name;
+   size_t namesz, descsz;
+
+   nhdr = ptr;
+   if (need_swap) {
+   nhdr->n_namesz = bswap_32(nhdr->n_namesz);
+   nhdr->n_descsz = bswap_32(nhdr->n_descsz);
+   nhdr->n_type = bswap_32(nhdr->n_type);
+   }
+
+   namesz = NOTE_ALIGN(nhdr->n_namesz);
+   descsz = NOTE_ALIGN(nhdr->n_descsz);
+
+   ptr += sizeof(*nhdr);
+   name = ptr;
+   ptr += namesz;
+   if (nhdr->n_type == NT_GNU_BUILD_ID &&
+   nhdr->n_namesz == sizeof("GNU")) {
+   if (memcmp(name, "GNU", sizeof("GNU")) == 0) {
+   size_t sz = min(size, descsz);
+
+   memcpy(bf, ptr, sz);
+   memset(bf + sz, 0, size - sz);
+   return 0;
+   }
+   }
+   ptr += descsz;
+   }
+
+   return -1;
+}
+
 const char *perf_tip(const char *dirpath)
 {
struct strlist *tips;
diff --git a/tools/perf/util/util.h b/tools/perf/util/util.h
index 9496365da3d7..27106548396b 100644
--- a/tools/perf/util/util.h
+++ b/tools/perf/util/util.h
@@ -47,6 +47,10 

[PATCH 2/9] perf tools: Add fetch_kernel_buildid function

2018-04-05 Thread Jiri Olsa
Adding fetch_kernel_buildid helper function to
retrieve build id from running kernel. It will
be used in following patches.

Link: http://lkml.kernel.org/n/tip-at98orsncas8v2ito61u3...@git.kernel.org
Signed-off-by: Jiri Olsa 
---
 tools/perf/util/util.c | 18 ++
 tools/perf/util/util.h |  2 ++
 2 files changed, 20 insertions(+)

diff --git a/tools/perf/util/util.c b/tools/perf/util/util.c
index 41fc61d941ef..26f93866bc02 100644
--- a/tools/perf/util/util.c
+++ b/tools/perf/util/util.c
@@ -495,6 +495,24 @@ int parse_notes_buildid(void *note_data, size_t note_len, 
void *bf,
return -1;
 }
 
+int fetch_kernel_buildid(char *buildid, int size)
+{
+   char path[PATH_MAX], *buf;
+   size_t len;
+   int err;
+
+   scnprintf(path, PATH_MAX, "%s/kernel/notes",
+ sysfs__mountpoint());
+
+   if (filename__read_str(path, , ))
+   return -EINVAL;
+
+   err = parse_notes_buildid(buf, len, buildid, size, false);
+
+   free(buf);
+   return err;
+}
+
 const char *perf_tip(const char *dirpath)
 {
struct strlist *tips;
diff --git a/tools/perf/util/util.h b/tools/perf/util/util.h
index 27106548396b..1ccaa957e3ab 100644
--- a/tools/perf/util/util.h
+++ b/tools/perf/util/util.h
@@ -48,6 +48,8 @@ extern int cacheline_size;
 int fetch_kernel_version(unsigned int *puint,
 char *str, size_t str_sz);
 
+int fetch_kernel_buildid(char *buildid, int size);
+
 int parse_notes_buildid(void *note_data, size_t note_len, void *bf,
size_t size, bool need_swap);
 
-- 
2.13.6



[PATCH 4/9] kbuild: Add filechk2 function

2018-04-05 Thread Jiri Olsa
Adding filechk2 function  that has the same semantics
as filechk, but it takes the target file from the 2nd
argument instead of from the '$@' as in the filechk
function.

This function is needed when we can't have separate
target for the file, like in the following patch.

Signed-off-by: Jiri Olsa 
---
 scripts/Kbuild.include | 24 
 1 file changed, 24 insertions(+)

diff --git a/scripts/Kbuild.include b/scripts/Kbuild.include
index 065324a8046f..9775ce2771d4 100644
--- a/scripts/Kbuild.include
+++ b/scripts/Kbuild.include
@@ -67,6 +67,30 @@ define filechk
fi
 endef
 
+# filechk2 is used to check if the content of a generated file is updated.
+# It follows the same logic as the filechk except instead of the $@ target
+# variable, the checked file is passed in 2nd argument.
+#
+# Sample usage:
+# define filechk_sample
+#  echo $KERNELRELEASE
+# endef
+# version.h : Makefile
+#  $(call filechk2,sample,version.h)
+# endef
+define filechk2
+   $(Q)set -e; \
+   $(kecho) '  CHK $(2)';  \
+   mkdir -p $(dir $(2));   \
+   $(filechk_$(1)) < $< > $(2).tmp;\
+   if [ -r $(2) ] && cmp -s $(2) $(2).tmp; then\
+   rm -f $(2).tmp; \
+   else\
+   $(kecho) '  UPD $(2)';  \
+   mv -f $(2).tmp $(2);\
+   fi
+endef
+
 ##
 # gcc support functions
 # See documentation in Documentation/kbuild/makefiles.txt
-- 
2.13.6



[PATCH 3/9] kbuild: Do not pass arguments to link-vmlinux.sh

2018-04-05 Thread Jiri Olsa
There's no need to pass LD* arguments to link-vmlinux.sh,
because they are passed as variables. The only argument
the link-vmlinux.sh supports is the 'clean' argument.

Signed-off-by: Jiri Olsa 
---
 Makefile | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Makefile b/Makefile
index d3300e46f925..a65a3919c6ad 100644
--- a/Makefile
+++ b/Makefile
@@ -1027,7 +1027,7 @@ ARCH_POSTLINK := $(wildcard 
$(srctree)/arch/$(SRCARCH)/Makefile.postlink)
 
 # Final link of vmlinux with optional arch pass after final link
 cmd_link-vmlinux = \
-   $(CONFIG_SHELL) $< $(LD) $(LDFLAGS) $(LDFLAGS_vmlinux) ;\
+   $(CONFIG_SHELL) $< ;   \
$(if $(ARCH_POSTLINK), $(MAKE) -f $(ARCH_POSTLINK) $@, true)
 
 vmlinux: scripts/link-vmlinux.sh vmlinux_prereq $(vmlinux-deps) FORCE
-- 
2.13.6



[PATCH 5/9] bpf: Add CONFIG_BUILDID_H option

2018-04-05 Thread Jiri Olsa
Adding CONFIG_BUILDID_H option that forces build to generate
file with GNU build id value:

  include/linux/buildid.h

It contains following macros:

  #define LINUX_BUILDID_DATA "\x6c\x41\x0f\xea\xa9\x5d ...
  #define LINUX_BUILDID_SIZE 20

Those macros will be used in following patches to identify
kernel in more precise way when loading eBPF program that
can touch kernel internal structures.

There's new build output for the check and update
of the buildid.h:

$ make
...
LD  vmlinux.o
MODPOST vmlinux.o
KSYM.tmp_kallsyms1.o
KSYM.tmp_kallsyms2.o
LD  vmlinux
SORTEX  vmlinux
SYSMAP  System.map
CHK include/generated/uapi/linux/buildid.h
UPD include/generated/uapi/linux/buildid.h
...

Signed-off-by: Jiri Olsa 
---
 Makefile  | 12 
 init/Kconfig  |  3 +++
 scripts/Makefile  |  1 +
 scripts/extract-buildid.c | 42 ++
 4 files changed, 58 insertions(+)
 create mode 100644 scripts/extract-buildid.c

diff --git a/Makefile b/Makefile
index a65a3919c6ad..92b04d8f08bc 100644
--- a/Makefile
+++ b/Makefile
@@ -1023,6 +1023,15 @@ endif
 include/generated/autoksyms.h: FORCE
$(Q)$(CONFIG_SHELL) $(srctree)/scripts/adjust_autoksyms.sh true
 
+ifdef CONFIG_BUILDID_H
+buildid_h := include/linux/buildid.h
+
+define filechk_buildid.h
+   buildid=`readelf -n $@ | grep 'Build ID' | sed -e 's/^.*Build ID: 
\(.*\)$$/\1/'`; \
+   scripts/extract-buildid $$buildid
+endef
+endif
+
 ARCH_POSTLINK := $(wildcard $(srctree)/arch/$(SRCARCH)/Makefile.postlink)
 
 # Final link of vmlinux with optional arch pass after final link
@@ -1032,6 +1041,9 @@ cmd_link-vmlinux =
 \
 
 vmlinux: scripts/link-vmlinux.sh vmlinux_prereq $(vmlinux-deps) FORCE
+$(call if_changed,link-vmlinux)
+ifdef CONFIG_BUILDID_H
+   +$(call filechk2,buildid.h,$(buildid_h))
+endif
 
 # Build samples along the rest of the kernel
 ifdef CONFIG_SAMPLES
diff --git a/init/Kconfig b/init/Kconfig
index 2852692d7c9c..572df24dda9b 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1386,6 +1386,9 @@ config KALLSYMS_BASE_RELATIVE
 
 # end of the "standard kernel features (expert users)" menu
 
+config BUILDID_H
+   bool
+
 # syscall, maps, verifier
 config BPF_SYSCALL
bool "Enable bpf() system call"
diff --git a/scripts/Makefile b/scripts/Makefile
index 25ab143cbe14..fa34eaed6c29 100644
--- a/scripts/Makefile
+++ b/scripts/Makefile
@@ -19,6 +19,7 @@ hostprogs-$(CONFIG_ASN1)   += asn1_compiler
 hostprogs-$(CONFIG_MODULE_SIG)  += sign-file
 hostprogs-$(CONFIG_SYSTEM_TRUSTED_KEYRING) += extract-cert
 hostprogs-$(CONFIG_SYSTEM_EXTRA_CERTIFICATE) += insert-sys-cert
+hostprogs-$(CONFIG_BUILDID_H)   += extract-buildid
 
 HOSTCFLAGS_sortextable.o = -I$(srctree)/tools/include
 HOSTCFLAGS_asn1_compiler.o = -I$(srctree)/include
diff --git a/scripts/extract-buildid.c b/scripts/extract-buildid.c
new file mode 100644
index ..a116723da3ad
--- /dev/null
+++ b/scripts/extract-buildid.c
@@ -0,0 +1,42 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Formats buildid into following macros:
+ *
+ * #define LINUX_BUILDID_DATA "\x6c\x41\x0f\xea\xa9\x5d\x46 ...
+ * #define LINUX_BUILDID_SIZE 20
+ *
+ */
+#include 
+#include 
+
+int main(int argc, char **argv)
+{
+   char *id;
+   int len, i;
+
+   if (argc != 2) {
+   fprintf(stderr, "usage: %s buildid\n", argv[0]);
+   return -1;
+   }
+
+   id  = argv[1];
+   len = strlen(id);
+
+   printf("#ifndef _LINUX_BUILDID_H\n");
+   printf("#define _LINUX_BUILDID_H\n");
+   printf("\n");
+
+   printf("#define LINUX_BUILDID_DATA \"");
+
+   for (i = 0; i < len; i += 2)
+   printf("\\x%c%c", id[i], id[i + 1]);
+
+   printf("\"\n");
+
+   printf("#define LINUX_BUILDID_SIZE %u\n", len / 2);
+
+   printf("\n");
+   printf("#endif /* _LINUX_BUILDID_H */\n");
+   return 0;
+}
-- 
2.13.6



[PATCH 7/9] libbpf: Synchronize uapi bpf.h header

2018-04-05 Thread Jiri Olsa
Synchronize include/uapi/linux/bpf.h with tools version.

Link: http://lkml.kernel.org/n/tip-gaja0nnet6oku657642nx...@git.kernel.org
Signed-off-by: Jiri Olsa 
---
 tools/include/uapi/linux/bpf.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 9d07465023a2..17d8d330e6c7 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -308,6 +308,8 @@ union bpf_attr {
 * (context accesses, allowed helpers, etc).
 */
__u32   expected_attach_type;
+   /* Checked when prog_type=kprobe and CONFIG_BPF_BUILDID_CHECK. 
*/
+   __aligned_u64   kern_buildid;
};
 
struct { /* anonymous struct used by BPF_OBJ_* commands */
@@ -864,6 +866,7 @@ enum bpf_func_id {
 /* BPF_FUNC_skb_set_tunnel_key flags. */
 #define BPF_F_ZERO_CSUM_TX (1ULL << 1)
 #define BPF_F_DONT_FRAGMENT(1ULL << 2)
+#define BPF_F_SEQ_NUMBER   (1ULL << 3)
 
 /* BPF_FUNC_perf_event_output, BPF_FUNC_perf_event_read and
  * BPF_FUNC_perf_event_read_value flags.
-- 
2.13.6



[PATCH 6/9] bpf: Add CONFIG_BPF_BUILDID_CHECK option

2018-04-05 Thread Jiri Olsa
Adding CONFIG_BPF_BUILDID_CHECK option that forces kernel
to check on provided build id when loading eBPF program.
If the build id does not match the one in kernel the program
fails to load.

Adding new field into struct bpf_attr. The kern_buildid
points to the user memory that contains the buildid.

Kernel expose the notes section via sysfs, but there's
currently no other use for kernel's buildid, so I needed
to add new __init buildid_init function to parse it out.

Signed-off-by: Jiri Olsa 
---
 include/uapi/linux/bpf.h |  2 ++
 init/Kconfig |  9 ++
 kernel/bpf/syscall.c | 84 +++-
 3 files changed, 94 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index c5ec89732a8d..17d8d330e6c7 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -308,6 +308,8 @@ union bpf_attr {
 * (context accesses, allowed helpers, etc).
 */
__u32   expected_attach_type;
+   /* Checked when prog_type=kprobe and CONFIG_BPF_BUILDID_CHECK. 
*/
+   __aligned_u64   kern_buildid;
};
 
struct { /* anonymous struct used by BPF_OBJ_* commands */
diff --git a/init/Kconfig b/init/Kconfig
index 572df24dda9b..6d32220de7e0 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1406,6 +1406,15 @@ config BPF_JIT_ALWAYS_ON
  Enables BPF JIT and removes BPF interpreter to avoid
  speculative execution of BPF instructions by the interpreter
 
+config BPF_BUILDID_CHECK
+   bool "Check buildid for kprobe and tracepoint programs"
+   depends on BPF_SYSCALL
+   select BUILDID_H
+   default n
+   help
+ Enables BPF program load code to check on kernel Build ID
+ for kprobe programs.
+
 config USERFAULTFD
bool "Enable userfaultfd() system call"
select ANON_INODES
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 0244973ee544..d0b3bc0bd9e6 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1239,7 +1239,85 @@ static int bpf_prog_attach_check_attach_type(const 
struct bpf_prog *prog,
 }
 
 /* last field in 'union bpf_attr' used by this command */
-#defineBPF_PROG_LOAD_LAST_FIELD expected_attach_type
+#defineBPF_PROG_LOAD_LAST_FIELD kern_buildid
+
+#ifdef CONFIG_BPF_BUILDID_CHECK
+
+static struct {
+   const char *ptr;
+   int size;
+} buildid;
+
+#define BUILDID_MAX 40
+
+static int __check_buildid(union bpf_attr *attr)
+{
+   char id[buildid.size];
+
+   /* copy buildid from user space */
+   if (strncpy_from_user(id, u64_to_user_ptr(attr->kern_buildid),
+ buildid.size) < 0)
+   return -EFAULT;
+
+   return memcmp(id, buildid.ptr, buildid.size);
+}
+
+static int check_buildid(union bpf_attr *attr)
+{
+   return buildid.ptr ? __check_buildid(attr) : 0;
+}
+
+#define NT_ALIGN(n)(((n) + 3) & ~3)
+#define NT_GNU_BUILD_ID3
+
+extern const void __start_notes __weak;
+extern const void __stop_notes __weak;
+
+static int __init buildid_init(void)
+{
+   const void *ptr = &__start_notes;
+
+   while (ptr < &__stop_notes) {
+   const Elf64_Nhdr *nhdr = ptr;
+   size_t namesz = NT_ALIGN(nhdr->n_namesz),
+  descsz = NT_ALIGN(nhdr->n_descsz);
+   const char *name;
+
+   ptr += sizeof(*nhdr);
+   name = ptr;
+   ptr += namesz;
+
+   if (nhdr->n_type != NT_GNU_BUILD_ID ||
+   nhdr->n_namesz != sizeof("GNU") ||
+   memcmp(name, "GNU", sizeof("GNU"))) {
+   ptr += descsz;
+   continue;
+   }
+
+   buildid.ptr = ptr;
+   buildid.size = descsz;
+   break;
+   }
+
+   /* Sanity checks on the parsed buildid. */
+   if (!buildid.ptr) {
+   pr_warn("bpf: GNU buildid not found, switching off the 
check\n");
+   } else if (buildid.size > 64) {
+   pr_warn("bpf: GNU buildid too long, switching off the check\n");
+   buildid.ptr = NULL;
+   }
+
+   return 0;
+}
+
+subsys_initcall(buildid_init);
+
+#else
+static int check_buildid(union bpf_attr *attr __maybe_unused)
+{
+   return 0;
+}
+#endif /* CONFIG_BPF_BUILDID_CHECK */
 
 static int bpf_prog_load(union bpf_attr *attr)
 {
@@ -1271,6 +1349,10 @@ static int bpf_prog_load(union bpf_attr *attr)
attr->kern_version != LINUX_VERSION_CODE)
return -EINVAL;
 
+   if (type == BPF_PROG_TYPE_KPROBE &&
+   check_buildid(attr))
+   return -EINVAL;
+
if (type != BPF_PROG_TYPE_SOCKET_FILTER &&
type != BPF_PROG_TYPE_CGROUP_SKB &&
!capable(CAP_SYS_ADMIN))
-- 
2.13.6



[PATCH 8/9] libbpf: Add support to attach buildid to program load

2018-04-05 Thread Jiri Olsa
Adding support to retrieve buildid from elf's "buildid"
section and passing it through to the load_program
function to kernel bpf syscall.

Fixing perf use of the bpf_load_program function and
linking in the vsprintf.o into bpftool to have the
scnprintf function in.

Link: http://lkml.kernel.org/n/tip-2pafwtzbyosmf9ftuf0ud...@git.kernel.org
Signed-off-by: Jiri Olsa 
---
 tools/bpf/bpftool/Makefile |  5 -
 tools/lib/bpf/bpf.c|  6 --
 tools/lib/bpf/bpf.h|  5 +++--
 tools/lib/bpf/libbpf.c | 46 --
 tools/perf/tests/bpf.c |  9 -
 5 files changed, 59 insertions(+), 12 deletions(-)

diff --git a/tools/bpf/bpftool/Makefile b/tools/bpf/bpftool/Makefile
index 4e69782c4a79..9ac11ea5de1c 100644
--- a/tools/bpf/bpftool/Makefile
+++ b/tools/bpf/bpftool/Makefile
@@ -75,11 +75,14 @@ include $(wildcard $(OUTPUT)*.d)
 all: $(OUTPUT)bpftool
 
 SRCS = $(wildcard *.c)
-OBJS = $(patsubst %.c,$(OUTPUT)%.o,$(SRCS)) $(OUTPUT)disasm.o
+OBJS = $(patsubst %.c,$(OUTPUT)%.o,$(SRCS)) $(OUTPUT)disasm.o 
$(OUTPUT)vsprintf.o
 
 $(OUTPUT)disasm.o: $(srctree)/kernel/bpf/disasm.c
$(QUIET_CC)$(COMPILE.c) -MMD -o $@ $<
 
+$(OUTPUT)vsprintf.o: $(srctree)/tools/lib/vsprintf.c
+   $(QUIET_CC)$(COMPILE.c) -MMD -o $@ $<
+
 $(OUTPUT)bpftool: $(OBJS) $(LIBBPF)
$(QUIET_LINK)$(CC) $(CFLAGS) -o $@ $^ $(LIBS)
 
diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index acbb3f8b3bec..8e384db5bbbd 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -168,6 +168,7 @@ int bpf_load_program_xattr(const struct 
bpf_load_program_attr *load_attr,
attr.log_size = 0;
attr.log_level = 0;
attr.kern_version = load_attr->kern_version;
+   attr.kern_buildid = ptr_to_u64(load_attr->buildid);
memcpy(attr.prog_name, load_attr->name,
   min(name_len, BPF_OBJ_NAME_LEN - 1));
 
@@ -185,8 +186,8 @@ int bpf_load_program_xattr(const struct 
bpf_load_program_attr *load_attr,
 
 int bpf_load_program(enum bpf_prog_type type, const struct bpf_insn *insns,
 size_t insns_cnt, const char *license,
-__u32 kern_version, char *log_buf,
-size_t log_buf_sz)
+__u32 kern_version, const char *buildid,
+char *log_buf, size_t log_buf_sz)
 {
struct bpf_load_program_attr load_attr;
 
@@ -198,6 +199,7 @@ int bpf_load_program(enum bpf_prog_type type, const struct 
bpf_insn *insns,
load_attr.insns_cnt = insns_cnt;
load_attr.license = license;
load_attr.kern_version = kern_version;
+   load_attr.buildid = buildid;
 
return bpf_load_program_xattr(_attr, log_buf, log_buf_sz);
 }
diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
index 39f6a0d64a3b..b5ffb178ebdd 100644
--- a/tools/lib/bpf/bpf.h
+++ b/tools/lib/bpf/bpf.h
@@ -49,6 +49,7 @@ struct bpf_load_program_attr {
size_t insns_cnt;
const char *license;
__u32 kern_version;
+   const char *buildid;
 };
 
 /* Recommend log buffer size */
@@ -57,8 +58,8 @@ int bpf_load_program_xattr(const struct bpf_load_program_attr 
*load_attr,
   char *log_buf, size_t log_buf_sz);
 int bpf_load_program(enum bpf_prog_type type, const struct bpf_insn *insns,
 size_t insns_cnt, const char *license,
-__u32 kern_version, char *log_buf,
-size_t log_buf_sz);
+__u32 kern_version, const char *buildid,
+char *log_buf, size_t log_buf_sz);
 int bpf_verify_program(enum bpf_prog_type type, const struct bpf_insn *insns,
   size_t insns_cnt, int strict_alignment,
   const char *license, __u32 kern_version,
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 5922443063f0..421f2c2e0ebe 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -220,6 +220,7 @@ static LIST_HEAD(bpf_objects_list);
 
 struct bpf_object {
char license[64];
+   char buildid[64];
u32 kern_version;
 
struct bpf_program *programs;
@@ -599,6 +600,32 @@ bpf_object__init_kversion(struct bpf_object *obj,
return 0;
 }
 
+static void buildid_scnprint(char *buf, int n, char *buildid, int size)
+{
+   int i, ret = 0;
+
+   for (i = 0; i < size; i++) {
+   ret += scnprintf(buf + ret, n - ret, "%x",
+(unsigned char) buildid[i]);
+   }
+}
+
+static int
+bpf_object__init_buildid(struct bpf_object *obj,
+void *data, size_t size)
+{
+   char buf[64];
+
+   if (size > sizeof(obj->buildid))
+   return -LIBBPF_ERRNO__FORMAT;
+
+   memcpy(>buildid, data, size);
+
+   buildid_scnprint(buf, 64, obj->buildid, size);
+   pr_debug("kernel buildid of %s is: %s\n", obj->path, buf);
+   return 0;
+}
+
 static int compare_bpf_map(const void *_a, const 

  1   2   >