date:20171112

Re: [patch net-next v2 04/10] net: sched: introduce block mechanism to handle netif_keep_dst calls

2017-11-12 Thread Jiri Pirko

Mon, Nov 13, 2017 at 08:47:26AM CET, jakub.kicin...@netronome.com wrote:
>On Sun, 12 Nov 2017 16:55:58 +0100, Jiri Pirko wrote:
>> From: Jiri Pirko 
>> 
>> Couple of classifiers call netif_keep_dst directly on q->dev. That is
>> not possible to do directly for shared blocke where multiple qdiscs are
>> owning the block. So introduce a infrastructure to keep track of the
>> block owners in list and use this list to implement block variant of
>> netif_keep_dst.
>> 
>> Signed-off-by: Jiri Pirko 
>
>Could you use the list you add here to check the ethtool tc offload
>flag? :)

It is a list of qdisc sub parts. Not a list of netdevs

Re: [patch net-next v2 06/10] net: sched: allow ingress and clsact qdiscs to share filter blocks

2017-11-12 Thread Jiri Pirko

Mon, Nov 13, 2017 at 08:54:52AM CET, jakub.kicin...@netronome.com wrote:
>On Sun, 12 Nov 2017 16:56:00 +0100, Jiri Pirko wrote:
>> diff --git a/net/sched/sch_ingress.c b/net/sched/sch_ingress.c
>> index 5ecc38f..ee89efc 100644
>> --- a/net/sched/sch_ingress.c
>> +++ b/net/sched/sch_ingress.c
>> @@ -60,6 +60,29 @@ static void clsact_chain_head_change(struct tcf_proto 
>> *tp_head, void *priv)
>>  struct mini_Qdisc_pair *miniqp = priv;
>>  
>>  mini_qdisc_pair_swap(miniqp, tp_head);
>> +};
>> +
>> +static const struct nla_policy ingress_policy[TCA_CLSACT_MAX + 1] = {
>> +[TCA_CLSACT_INGRESS_BLOCK]  = { .type = NLA_U32 },
>> +};
>> +
>> +static int ingress_parse_opt(struct nlattr *opt, u32 *p_ingress_block_index)
>
>nit: why the p_ prefix on all the pointers?

Just to diferenciate:
u32 *ingress_block_index
and
u32 ingress_block_index

Re: [patch net-next v2 01/10] cls_bpf: move prog offload->netdev check into drivers

2017-11-12 Thread Jiri Pirko

Mon, Nov 13, 2017 at 08:17:34AM CET, jakub.kicin...@netronome.com wrote:
>On Mon, 13 Nov 2017 07:25:38 +0100, Jiri Pirko wrote:
>> Mon, Nov 13, 2017 at 03:14:18AM CET, jakub.kicin...@netronome.com wrote:
>> >On Sun, 12 Nov 2017 16:55:55 +0100, Jiri Pirko wrote:  
>> >> From: Jiri Pirko 
>> >> 
>> >> In order to remove tp->q usage in cls_bpf, the offload->netdev check
>> >> needs to be moved to individual drivers as only they will have access
>> >> to appropriate struct net_device.
>> >> 
>> >> Signed-off-by: Jiri Pirko   
>> >
>> >This seems not entirely correct and it adds unnecessary code.  I think  
>> 
>> What is not correct?
>
>From quick reading it looks like you will allow to install the
>dev-specific filter without skip_sw flag.  You haven't fixed what

Right. I see it now.


>your previous series broke in cls_bpf offload model and now you 

What do you mean exactly?


>break it even further.
>
>> >the XDP and cls_bpf handling could be unified, making way for binding
>> >the same program to multiple ports of the same device.  Would you mind
>> >waiting a day for me to send corrections to BPF offload?  
>> 
>> Well I'm trying to get this in before net-next closes...
>
>Right, and I'm surprised by that.  I'd hope you'll understand my caution
>here given recent history.

Sure.

Re: [PATCH net-next 2/8] rtnetlink: add rtnl_register_module

2017-11-12 Thread Peter Zijlstra

On Mon, Nov 13, 2017 at 08:21:59AM +0100, Florian Westphal wrote:
> Reason is that some places do this:
> 
> rtnl_register(pf, RTM_FOO, doit, NULL, 0);
> rtnl_register(pf, RTM_FOO, NULL, dumpit, 0);

Sure, however,

> (from different call sites in the stack).
> > -   if (doit)
> > -   tab[msgindex].doit = doit;
> > -   if (dumpit)
> > -   tab[msgindex].dumpit = dumpit;
> 
> Which is the reason for these if () tests.

then we assign NULL, which is fine, no?

Re: [patch net-next v2 06/10] net: sched: allow ingress and clsact qdiscs to share filter blocks

2017-11-12 Thread Jakub Kicinski

On Sun, 12 Nov 2017 16:56:00 +0100, Jiri Pirko wrote:
> diff --git a/net/sched/sch_ingress.c b/net/sched/sch_ingress.c
> index 5ecc38f..ee89efc 100644
> --- a/net/sched/sch_ingress.c
> +++ b/net/sched/sch_ingress.c
> @@ -60,6 +60,29 @@ static void clsact_chain_head_change(struct tcf_proto 
> *tp_head, void *priv)
>   struct mini_Qdisc_pair *miniqp = priv;
>  
>   mini_qdisc_pair_swap(miniqp, tp_head);
> +};
> +
> +static const struct nla_policy ingress_policy[TCA_CLSACT_MAX + 1] = {
> + [TCA_CLSACT_INGRESS_BLOCK]  = { .type = NLA_U32 },
> +};
> +
> +static int ingress_parse_opt(struct nlattr *opt, u32 *p_ingress_block_index)

nit: why the p_ prefix on all the pointers?

Re: [patch net-next v2 04/10] net: sched: introduce block mechanism to handle netif_keep_dst calls

2017-11-12 Thread Jakub Kicinski

On Sun, 12 Nov 2017 16:55:58 +0100, Jiri Pirko wrote:
> From: Jiri Pirko 
> 
> Couple of classifiers call netif_keep_dst directly on q->dev. That is
> not possible to do directly for shared blocke where multiple qdiscs are
> owning the block. So introduce a infrastructure to keep track of the
> block owners in list and use this list to implement block variant of
> netif_keep_dst.
> 
> Signed-off-by: Jiri Pirko 

Could you use the list you add here to check the ethtool tc offload
flag? :)

Re: [PATCH net-next 2/8] rtnetlink: add rtnl_register_module

2017-11-12 Thread Florian Westphal

Peter Zijlstra  wrote:
> On Tue, Nov 07, 2017 at 10:47:51AM +0100, Florian Westphal wrote:
> > I would expect this to trigger all the time, due to
> > 
> > rtnl_register(AF_INET, RTM_GETROUTE, ...
> > rtnl_register(AF_INET, RTM_GETADDR, ...
> 
> Ah, sure, then something like so then...
> 
> There's bound to be bugs there too, as I pretty much typed this without
> thinking, but it should show the idea.

Just o let you know, I am backlogged at the moment so I Will not have
time to work on this for the time being.

> ---
> diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
> index 5ace48926b19..de1336775602 100644
> --- a/net/core/rtnetlink.c
> +++ b/net/core/rtnetlink.c
> @@ -63,6 +63,7 @@ struct rtnl_link {
>   rtnl_doit_func  doit;
>   rtnl_dumpit_funcdumpit;
>   unsigned intflags;
> + struct rcu_head rcu;
>  };

This will need to be split:

struct rtnl_link {
rtnl_doit_func  doit;
unsigned intflags;
struct rcu_head rcu;
};
struct rtnl_link_dump {
rtnl_dumpit_funcdumpit;
struct rcu_head rcu;
};

> -static struct rtnl_link __rcu *rtnl_msg_handlers[RTNL_FAMILY_MAX + 1];
> +static struct rtnl_link __rcu **rtnl_msg_handlers[RTNL_FAMILY_MAX + 1];

So this will need to be two arrays.

Reason is that some places do this:

rtnl_register(pf, RTM_FOO, doit, NULL, 0);
rtnl_register(pf, RTM_FOO, NULL, dumpit, 0);

(from different call sites in the stack).
> - if (doit)
> - tab[msgindex].doit = doit;
> - if (dumpit)
> - tab[msgindex].dumpit = dumpit;

Which is the reason for these if () tests.

Re: [PATCH net-next] decnet: move to staging

2017-11-12 Thread Greg KH

On Mon, Nov 13, 2017 at 07:30:09AM +0100, Jiri Pirko wrote:
> Sun, Nov 12, 2017 at 09:02:14PM CET, step...@networkplumber.org wrote:
> >Support for Decnet has been orphaned for many years.
> >In the interest of reducing the potential bug surface and pre-holiday
> >cleaning, move the decnet protocol into staging for eventual removal.
> >
> >Signed-off-by: Stephen Hemminger 
> 
> Why not just remove it in the same way tokenring was removed in the
> past? I fear that in staging, this will rot forever for no good reason.

No, I will not let it stay around for long :)

Stephen, no objection from me for this, but can you add a TODO file much
like drivers/staging/irda/TODO has?

thanks,

greg k-h

Re: [patch net-next v2 01/10] cls_bpf: move prog offload->netdev check into drivers

2017-11-12 Thread Jakub Kicinski

On Mon, 13 Nov 2017 07:25:38 +0100, Jiri Pirko wrote:
> Mon, Nov 13, 2017 at 03:14:18AM CET, jakub.kicin...@netronome.com wrote:
> >On Sun, 12 Nov 2017 16:55:55 +0100, Jiri Pirko wrote:  
> >> From: Jiri Pirko 
> >> 
> >> In order to remove tp->q usage in cls_bpf, the offload->netdev check
> >> needs to be moved to individual drivers as only they will have access
> >> to appropriate struct net_device.
> >> 
> >> Signed-off-by: Jiri Pirko   
> >
> >This seems not entirely correct and it adds unnecessary code.  I think  
> 
> What is not correct?

From quick reading it looks like you will allow to install the
dev-specific filter without skip_sw flag.  You haven't fixed what
your previous series broke in cls_bpf offload model and now you 
break it even further.

> >the XDP and cls_bpf handling could be unified, making way for binding
> >the same program to multiple ports of the same device.  Would you mind
> >waiting a day for me to send corrections to BPF offload?  
> 
> Well I'm trying to get this in before net-next closes...

Right, and I'm surprised by that.  I'd hope you'll understand my caution
here given recent history.

Re: [kernel-hardening] [PATCH v4] scripts: add leaking_addresses.pl

2017-11-12 Thread Kaiwan N Billimoria

On Mon, Nov 13, 2017 at 11:38 AM, Tobin C. Harding  wrote:
> On Mon, Nov 13, 2017 at 11:16:28AM +0530, kaiwan.billimo...@gmail.com wrote:
>> On Mon, 2017-11-13 at 09:21 +1100, Tobin C. Harding wrote:
>> > On Fri, Nov 10, 2017 at 07:26:34PM +0530, kaiwan.billimo...@gmail.com
>> >  wrote:
>> > > On Tue, 2017-11-07 at 21:32 +1100, Tobin C. Harding wrote:
>> > > > Currently we are leaking addresses from the kernel to user space.
>> > > > This
...
>
> So, Linus has requested that I set up a tree for the development of
> this. I have to work out the details of how to do that and then I'll
> email you so you can get the pull the current version. I can then take
> your patch via LKML as per usual.
>
Super. Thanks.

Re: [PATCH net-next] decnet: move to staging

2017-11-12 Thread Jiri Pirko

Sun, Nov 12, 2017 at 09:02:14PM CET, step...@networkplumber.org wrote:
>Support for Decnet has been orphaned for many years.
>In the interest of reducing the potential bug surface and pre-holiday
>cleaning, move the decnet protocol into staging for eventual removal.
>
>Signed-off-by: Stephen Hemminger 

Why not just remove it in the same way tokenring was removed in the
past? I fear that in staging, this will rot forever for no good reason.

Re: [patch net-next v2 01/10] cls_bpf: move prog offload->netdev check into drivers

2017-11-12 Thread Jiri Pirko

Mon, Nov 13, 2017 at 03:14:18AM CET, jakub.kicin...@netronome.com wrote:
>On Sun, 12 Nov 2017 16:55:55 +0100, Jiri Pirko wrote:
>> From: Jiri Pirko 
>> 
>> In order to remove tp->q usage in cls_bpf, the offload->netdev check
>> needs to be moved to individual drivers as only they will have access
>> to appropriate struct net_device.
>> 
>> Signed-off-by: Jiri Pirko 
>
>This seems not entirely correct and it adds unnecessary code.  I think

What is not correct?


>the XDP and cls_bpf handling could be unified, making way for binding
>the same program to multiple ports of the same device.  Would you mind
>waiting a day for me to send corrections to BPF offload?

Well I'm trying to get this in before net-next closes...

Re: SRIOV switchdev mode BoF minutes

2017-11-12 Thread Or Gerlitz

On Sun, Nov 12, 2017 at 10:38 PM, Alexander Duyck
 wrote:
> On Sun, Nov 12, 2017 at 11:49 AM, Or Gerlitz  wrote:
>> Hi Dave and all,
>>
>> During and after the BoF on SRIOV switchdev mode, we came into a
>> consensus among the developers from four different HW vendors (CC
>> audience) that a correct thing to do would be to disallow any new
>> extensions to the legacy mode.
>>
>> The idea is to put focus on the new mode and not add new UAPIs and
>> kernel code which was turned to be a wrong design which does not allow
>> for properly offloading a kernel switching SW model to e-switch HW.

> You may not recall but we tried to transition the i40e driver over to
> SwitchDev, the parts supported by i40e have a much more robust l2
> forwarding framework than the 82599, and the result was we were told
> that while we might look at doing port representors some other way,
> there was no way we could use switchdev since the hardware couldn't
> support the requirements of switchdev in terms of default routes and
> forwarding behavior. I am planning to resolve the port representor
> issue by looking at coming up with something like a "source mode"
> macvlan based port representor. I figure that is probably the closest
> match for what the Intel hardware does since really the VFs are
> nothing more than a physical macvlan interface in and of themselves as
> the hardware doesn't have a full switch.

Hi Alex,

The what we call slow path requirements are the following:

1. xmit on VF rep always turns to a receive on the VF, regardless of
the offloaded
SW steering rules ("send-to-vport")

2. xmit on VF which doesn't meet any offloaded SW steering rules must
be recieved
into the host OS from the VF rep

1,2 above must hold also for the uplink and the PF reps

When the i40e limitation was described to @ netdev, it seems you have a problem
with VF xmit that should be turned to be a recv on the VF rep but also
goes to the wire.

It smells as if a FW patch can solve that, isn't that?

> I would have to disagree with this. For devices such as 82599 that
> doesn't have a true switch this may limit future functionality since
> we can't move it over to switchdev mode. For example one thing I may
> need to add is the ability to disable multicast and broadcast receive
> on a per-VF basis at some point in the future.

We are on the same boat with ConnectX3/mlx4, so us lucky that misery loves
company (my google search also yielded "many narrow-half consolation" is that
completely unrelated?) - the legacy mode for ixgbe/mlx4 is there for ~8-10 years
- and since then both companies had 2-3 newer HW generations. I don't see why
you can't come to your customers and tell that newish functionality needs newer
HW - it will also help sell more from the new stuff..  If you keep
extending the legacy
mode, more ppl/drivers will do that as well and it will not let us go
in the right direction.

Or.

Re: [kernel-hardening] [PATCH v4] scripts: add leaking_addresses.pl

2017-11-12 Thread Tobin C. Harding

On Mon, Nov 13, 2017 at 11:16:28AM +0530, kaiwan.billimo...@gmail.com wrote:
> On Mon, 2017-11-13 at 09:21 +1100, Tobin C. Harding wrote:
> > On Fri, Nov 10, 2017 at 07:26:34PM +0530, kaiwan.billimo...@gmail.com
> >  wrote:
> > > On Tue, 2017-11-07 at 21:32 +1100, Tobin C. Harding wrote:
> > > > Currently we are leaking addresses from the kernel to user space.
> > > > This
> > > > script is an attempt to find some of those leakages. Script
> > > > parses
> > > > `dmesg` output and /proc and /sys files for hex strings that look
> > > > like
> > > > kernel addresses.
> > > > 
> > > > Only works for 64 bit kernels, the reason being that kernel
> > > > addresses
> > > > on 64 bit kernels have '' as the leading bit pattern making
> > > > greping
> > > > possible. On 32 kernels we don't have this luxury.
> > > 
> > > Tobin C. Harding  wrote:
> > > > Only works for 64 bit kernels, the reason being that kernel
> > > > addresses
> > > > on 64 bit kernels have '' as the leading bit pattern making
> > > > greping
> > > > possible. On 32 kernels we don't have this luxury.
> > > 
> > > [RFC] leaking_addresses.pl - enhance it to work for 32-bit kernels
> > > as well
> > > 
> > > (Firstly, apologies if I've got the protocol horribly wrong- should
> > > this
> > > be a new thread altogether?)
> > 
> > I think this patch will need to wait until the patch set that is
> > currently in flight is either merged or dropped.
> > 
> Thanks for looking at it!
> Okay; blocking on merge || drop...  :-)

So, Linus has requested that I set up a tree for the development of
this. I have to work out the details of how to do that and then I'll
email you so you can get the pull the current version. I can then take
your patch via LKML as per usual.

> > We can work this out pragmatically, Perl can give us an architecture
> > string then a few regexs can ascertain which architecture we are
> > running
> > on. This is in the inflight patch set. 
> > 
> > > The patch below does Not take into account (yet) stuff like:
> > >  - exactly which files & dirs should be skipped on 32-bit (will it
> > > be
> > > identical to 64-bit?; unsure..)
> > 
> > As per discussion later in this thread we may need to consider
> > architecture specific lists for files/directories to skip. 
> Right
> > 
> > >  - it currently hard-codes a global 'PAGE_OFFSET_32BIT=0xc000'
> > > , just
> > >  so I can test quickly; must figure whether to query it or pass it;
> > >  Suggestions?
> > 
> > Perhaps we should have a command line option for this.
> > 
> > --kernel-base-address
> 
> Why not just detect it programatically? We could devise a series of
> fallbacks; something like:
> - if .config exists in the kernel source tree root, grep it for
> PAGE_OFFSET
> - if not, grep the arch-specific (arch//configs/)
> for the same
> - if for some reason we don't have enough info regarding specific
> platform and thus the defconfig filename (could happen for ARM, PPC?),
> we then fail and request the user to pass it as a parameter.
> 
> > >  - the 'false positives'; again, what differs for 32-bit?

Sounds good to me.

thanks,
Tobin.

Re: linux-next: manual merge of the tip tree with the net-next tree

2017-11-12 Thread Stephen Rothwell

Hi all,

On Mon, 30 Oct 2017 20:55:47 + Mark Brown  wrote:
>
> Today's linux-next merge of the tip tree got a conflict in:
> 
>   net/ipv4/tcp_output.c
> 
> between commit:
> 
>   6aa7de059173a ("locking/atomics: COCCINELLE/treewide: Convert trivial 
> ACCESS_ONCE() patterns to READ_ONCE()/WRITE_ONCE()")
> 
> in the tip tree and some change in the net-next tree.
> 
> I fixed it up (see below) and can carry the fix as necessary. This
> is now fixed as far as linux-next is concerned, but any non trivial
> conflicts should be mentioned to your upstream maintainer when your tree
> is submitted for merging.  You may also want to consider cooperating
> with the maintainer of the conflicting tree to minimise any particularly
> complex conflicts.
> 
> diff --cc net/ipv4/tcp_output.c
> index a69a34f57330,48531da1aba6..
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@@ -1978,7 -1908,7 +1978,7 @@@ static bool tcp_tso_should_defer(struc
>   if ((skb != tcp_write_queue_tail(sk)) && (limit >= skb->len))
>   goto send_now;
>   
> - win_divisor = 
> ACCESS_ONCE(sock_net(sk)->ipv4.sysctl_tcp_tso_win_divisor);
>  -win_divisor = READ_ONCE(sysctl_tcp_tso_win_divisor);
> ++win_divisor = READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_tso_win_divisor);
>   if (win_divisor) {
>   u32 chunk = min(tp->snd_wnd, tp->snd_cwnd * tp->mss_cache);
>   

Just a reminder that this conflict still exists.

-- 
Cheers,
Stephen Rothwell

[PATCH net] sctp: add wait_buf flag in asoc to avoid the peeloff and wait sndbuf race

2017-11-12 Thread Xin Long

Commit dfcb9f4f99f1 ("sctp: deny peeloff operation on asocs with threads
sleeping on it") fixed the race between peeloff and wait sndbuf by
checking waitqueue_active(>wait) in sctp_do_peeloff().

But it actually doesn't work as even if waitqueue_active returns false
the waiting sndbuf thread may still not yet hold sk lock.

This patch is to fix this by adding wait_buf flag in asoc, and setting it
before going the waiting loop, clearing it until the waiting loop breaks,
and checking it in sctp_do_peeloff instead.

Fixes: dfcb9f4f99f1 ("sctp: deny peeloff operation on asocs with threads 
sleeping on it")
Suggested-by: Marcelo Ricardo Leitner 
Signed-off-by: Xin Long 
---
 include/net/sctp/structs.h | 1 +
 net/sctp/socket.c  | 4 +++-
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
index 0477945..446350e 100644
--- a/include/net/sctp/structs.h
+++ b/include/net/sctp/structs.h
@@ -1883,6 +1883,7 @@ struct sctp_association {
 
__u8 need_ecne:1,   /* Need to send an ECNE Chunk? */
 temp:1,/* Is it a temporary association? */
+wait_buf:1,
 force_delay:1,
 prsctp_enable:1,
 reconf_enable:1;
diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index 6f45d17..1b2c78c 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -4946,7 +4946,7 @@ int sctp_do_peeloff(struct sock *sk, sctp_assoc_t id, 
struct socket **sockp)
/* If there is a thread waiting on more sndbuf space for
 * sending on this asoc, it cannot be peeled.
 */
-   if (waitqueue_active(>wait))
+   if (asoc->wait_buf)
return -EBUSY;
 
/* An association cannot be branched off from an already peeled-off
@@ -7835,6 +7835,7 @@ static int sctp_wait_for_sndbuf(struct sctp_association 
*asoc, long *timeo_p,
/* Increment the association's refcnt.  */
sctp_association_hold(asoc);
 
+   asoc->wait_buf = 1;
/* Wait on the association specific sndbuf space. */
for (;;) {
prepare_to_wait_exclusive(>wait, ,
@@ -7860,6 +7861,7 @@ static int sctp_wait_for_sndbuf(struct sctp_association 
*asoc, long *timeo_p,
}
 
 out:
+   asoc->wait_buf = 0;
finish_wait(>wait, );
 
/* Release the association's refcnt.  */
-- 
2.1.0

Re: [kernel-hardening] [PATCH v4] scripts: add leaking_addresses.pl

2017-11-12 Thread kaiwan . billimoria

On Mon, 2017-11-13 at 09:21 +1100, Tobin C. Harding wrote:
> On Fri, Nov 10, 2017 at 07:26:34PM +0530, kaiwan.billimo...@gmail.com
>  wrote:
> > On Tue, 2017-11-07 at 21:32 +1100, Tobin C. Harding wrote:
> > > Currently we are leaking addresses from the kernel to user space.
> > > This
> > > script is an attempt to find some of those leakages. Script
> > > parses
> > > `dmesg` output and /proc and /sys files for hex strings that look
> > > like
> > > kernel addresses.
> > > 
> > > Only works for 64 bit kernels, the reason being that kernel
> > > addresses
> > > on 64 bit kernels have '' as the leading bit pattern making
> > > greping
> > > possible. On 32 kernels we don't have this luxury.
> > 
> > Tobin C. Harding  wrote:
> > > Only works for 64 bit kernels, the reason being that kernel
> > > addresses
> > > on 64 bit kernels have '' as the leading bit pattern making
> > > greping
> > > possible. On 32 kernels we don't have this luxury.
> > 
> > [RFC] leaking_addresses.pl - enhance it to work for 32-bit kernels
> > as well
> > 
> > (Firstly, apologies if I've got the protocol horribly wrong- should
> > this
> > be a new thread altogether?)
> 
> I think this patch will need to wait until the patch set that is
> currently in flight is either merged or dropped.
> 
Thanks for looking at it!
Okay; blocking on merge || drop...  :-)

> > 
> We can work this out pragmatically, Perl can give us an architecture
> string then a few regexs can ascertain which architecture we are
> running
> on. This is in the inflight patch set. 
> 
> > The patch below does Not take into account (yet) stuff like:
> >  - exactly which files & dirs should be skipped on 32-bit (will it
> > be
> > identical to 64-bit?; unsure..)
> 
> As per discussion later in this thread we may need to consider
> architecture specific lists for files/directories to skip. 
Right
> 
> >  - it currently hard-codes a global 'PAGE_OFFSET_32BIT=0xc000'
> > , just
> >  so I can test quickly; must figure whether to query it or pass it;
> >  Suggestions?
> 
> Perhaps we should have a command line option for this.
> 
>   --kernel-base-address

Why not just detect it programatically? We could devise a series of
fallbacks; something like:
- if .config exists in the kernel source tree root, grep it for
PAGE_OFFSET
- if not, grep the arch-specific (arch//configs/)
for the same
- if for some reason we don't have enough info regarding specific
platform and thus the defconfig filename (could happen for ARM, PPC?),
we then fail and request the user to pass it as a parameter.

> >  - the 'false positives'; again, what differs for 32-bit?
> >(BTW, shouldn't the dmesg 'root=UUID=<...>' line be a false
> > positive
> > & skipped?).
> 
> We could probably do with architecture specific false
> positives. Inflight patch set refactors false_positive() so adding to
> this should be easy.
Sure.
> 
> > Also, I must point out that I'm a complete newbie to Perl :-) so,
> > pl excuse
> > my highly inadequate perl-foo; I rely on you perl gurus out there
> > to fix
> > and optimize :)
> 
> I'm no Perl guru but following are a few tips I have picked up over
> the
> last month.
Thanks, will fix the issues you point out..
> 
> > 
> Conceptually your ideas look good to me. If there is some reason this
> approach won't work hopefully someone else will jump in and say so.
> 
> Nice work, thanks for putting in effort to get 32 bit machines
> supported. Let's see what happens with the inflight patch set then
> work
> on getting these ideas in.
> 
Thanks! yes..
> thanks,
> Tobin.

[PATCH net] sctp: check stream reset info len before making reconf chunk

2017-11-12 Thread Xin Long

Now when resetting stream, if both in and out flags are set, the info
len can reach:
  sizeof(struct sctp_strreset_outreq) + SCTP_MAX_STREAM(65535) +
  sizeof(struct sctp_strreset_inreq)  + SCTP_MAX_STREAM(65535)
even without duplicated stream no, this value is far greater than the
chunk's max size.

_sctp_make_chunk doesn't do any check for this, which would cause the
skb it allocs is huge, syzbot even reported a crash due to this.

This patch is to check stream reset info len before making reconf
chunk and return NULL if the len exceeds chunk's capacity.

Fixes: cc16f00f6529 ("sctp: add support for generating stream reconf ssn reset 
request chunk")
Reported-by: Dmitry Vyukov 
Signed-off-by: Xin Long 
---
 net/sctp/sm_make_chunk.c | 7 +--
 net/sctp/stream.c| 8 +---
 2 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/net/sctp/sm_make_chunk.c b/net/sctp/sm_make_chunk.c
index 514465b..a21328a 100644
--- a/net/sctp/sm_make_chunk.c
+++ b/net/sctp/sm_make_chunk.c
@@ -3598,14 +3598,17 @@ struct sctp_chunk *sctp_make_strreset_req(
__u16 stream_len = stream_num * 2;
struct sctp_strreset_inreq inreq;
struct sctp_chunk *retval;
-   __u16 outlen, inlen;
+   int outlen, inlen;
 
outlen = (sizeof(outreq) + stream_len) * out;
inlen = (sizeof(inreq) + stream_len) * in;
 
+   if (outlen + inlen > SCTP_MAX_CHUNK_LEN - sizeof(struct sctp_chunkhdr))
+   return ERR_PTR(-EINVAL);
+
retval = sctp_make_reconf(asoc, outlen + inlen);
if (!retval)
-   return NULL;
+   return ERR_PTR(-ENOMEM);
 
if (outlen) {
outreq.param_hdr.type = SCTP_PARAM_RESET_OUT_REQUEST;
diff --git a/net/sctp/stream.c b/net/sctp/stream.c
index fa8371f..51a25bf 100644
--- a/net/sctp/stream.c
+++ b/net/sctp/stream.c
@@ -162,8 +162,8 @@ int sctp_send_reset_streams(struct sctp_association *asoc,
 
kfree(nstr_list);
 
-   if (!chunk) {
-   retval = -ENOMEM;
+   if (IS_ERR(chunk)) {
+   retval = PTR_ERR(chunk);
goto out;
}
 
@@ -482,8 +482,10 @@ struct sctp_chunk *sctp_process_strreset_inreq(
}
 
chunk = sctp_make_strreset_req(asoc, nums, str_p, 1, 0);
-   if (!chunk)
+   if (IS_ERR(chunk)) {
+   chunk = NULL;
goto out;
+   }
 
if (nums)
for (i = 0; i < nums; i++)
-- 
2.1.0

Re: [kernel-hardening] Re: [PATCH v4] scripts: add leaking_addresses.pl

2017-11-12 Thread Kaiwan N Billimoria

On Mon, Nov 13, 2017 at 10:05 AM, Tobin C. Harding  wrote:
> On Mon, Nov 13, 2017 at 06:37:28AM +0300, Kirill A. Shutemov wrote:
>> On Mon, Nov 13, 2017 at 10:06:46AM +1100, Tobin C. Harding wrote:
>> > On Sun, Nov 12, 2017 at 02:10:07AM +0300, Kirill A. Shutemov wrote:
...
>> >
>> > Thanks for the link. So it looks like we need to refactor the kernel
>> > address regular expression into a function that takes into account the
>> > machine architecture and the number of page table levels. We will need
>> > to add this to the false positive checks also.
>> >
>> > > Not sure if we care. It won't work too for other 64-bit architectrues 
>> > > that
>> > > have more than 256TB of virtual address space.
>> >
>> > Is this because of the virtual memory map?
>>
>> On x86 direct mapping is the nearest thing we have to userspace.
>>
>> > Did you mean 512TB?
>>
>> No, I mean 256TB.
>>
>> You have all kernel memory in the range from 0x to
>> 0x if you have 256 TB of virtual address space. If you
>> hvae more, some thing might be ouside the range.
>
> Doesn't 4-level paging already limit a system to 64TB of memory? So any
> system better equipped than this will use 5-level paging right? If I am
> totally talking rubbish please ignore, I'm appreciative that you pointed
> out the limitation already. Perhaps we can add a comment to the script
>
> # Script may miss some addresses on machines with more than 256TB of
> # memory.

I think the 256TB is wrt *virtual* address space not physical RAM.

Also, IMHO, the script should 'transparently' take into account the # of paging
levels (instead of the user needing to pass a parameter).
IOW it should be able to detect the same (say, from the .config file) and act
accordingly - in the sense, the regex's and associated logic would accordingly
differ.

Re: linux-next: manual merge of the drivers-x86 tree with the net-next tree

2017-11-12 Thread Stephen Rothwell

Hi all,

On Mon, 9 Oct 2017 18:56:33 +0100 Mark Brown  wrote:
>
> Today's linux-next merge of the drivers-x86 tree got a conflict in:
> 
>   Documentation/admin-guide/thunderbolt.rst
> 
> between commit:
> 
>e69b6c02b4c3b ("net: Add support for networking over Thunderbolt cable")
> 
> from the net-next tree and commit:
> 
>ce6a90027c10f ("platform/x86: Add driver to force WMI Thunderbolt 
> controller power status")
> 
> from the drivers-x86 tree.
> 
> I fixed it up (see below) and can carry the fix as necessary. This
> is now fixed as far as linux-next is concerned, but any non trivial
> conflicts should be mentioned to your upstream maintainer when your tree
> is submitted for merging.  You may also want to consider cooperating
> with the maintainer of the conflicting tree to minimise any particularly
> complex conflicts.
> 
> diff --cc Documentation/admin-guide/thunderbolt.rst
> index 5c62d11d77e8,dadcd66ee12f..
> --- a/Documentation/admin-guide/thunderbolt.rst
> +++ b/Documentation/admin-guide/thunderbolt.rst
> @@@ -198,26 -198,17 +198,41 @@@ information is missing
>   To recover from this mode, one needs to flash a valid NVM image to the
>   host host controller in the same way it is done in the previous chapter.
>   
>  +Networking over Thunderbolt cable
>  +-
>  +Thunderbolt technology allows software communication across two hosts
>  +connected by a Thunderbolt cable.
>  +
>  +It is possible to tunnel any kind of traffic over Thunderbolt link but
>  +currently we only support Apple ThunderboltIP protocol.
>  +
>  +If the other host is running Windows or macOS only thing you need to
>  +do is to connect Thunderbolt cable between the two hosts, the
>  +``thunderbolt-net`` is loaded automatically. If the other host is also
>  +Linux you should load ``thunderbolt-net`` manually on one host (it does
>  +not matter which one)::
>  +
>  +  # modprobe thunderbolt-net
>  +
>  +This triggers module load on the other host automatically. If the driver
>  +is built-in to the kernel image, there is no need to do anything.
>  +
>  +The driver will create one virtual ethernet interface per Thunderbolt
>  +port which are named like ``thunderbolt0`` and so on. From this point
>  +you can either use standard userspace tools like ``ifconfig`` to
>  +configure the interface or let your GUI to handle it automatically.
> ++
> + Forcing power
> + -
> + Many OEMs include a method that can be used to force the power of a
> + thunderbolt controller to an "On" state even if nothing is connected.
> + If supported by your machine this will be exposed by the WMI bus with
> + a sysfs attribute called "force_power".
> + 
> + For example the intel-wmi-thunderbolt driver exposes this attribute in:
> +   
> /sys/devices/platform/PNP0C14:00/wmi_bus/wmi_bus-PNP0C14:00/86CCFD48-205E-4A77-9C48-2021CBEDE341/force_power
> + 
> +   To force the power to on, write 1 to this attribute file.
> +   To disable force power, write 0 to this attribute file.
> + 
> + Note: it's currently not possible to query the force power state of a 
> platform.

Just a reminder that this conflict still exists.

-- 
Cheers,
Stephen Rothwell

Re: linux-next: manual merge of the cgroup tree with the net-next tree

2017-11-12 Thread Stephen Rothwell

Hi Mark,

On Mon, 9 Oct 2017 19:38:36 +0100 Mark Brown  wrote:
>
> Hi Tejun,
> 
> Today's linux-next merge of the cgroup tree got a conflict in:
> 
>   kernel/cgroup/cgroup.c
> 
> between commit:
> 
>   324bda9e6c5ad ("bpf: multi program support for cgroup+bpf")
> 
> from the net-next tree and commit:
> 
>   041cd640b2f3c ("cgroup: Implement cgroup2 basic CPU usage accounting")
> 
> from the cgroup tree.
> 
> I fixed it up (see below) and can carry the fix as necessary. This
> is now fixed as far as linux-next is concerned, but any non trivial
> conflicts should be mentioned to your upstream maintainer when your tree
> is submitted for merging.  You may also want to consider cooperating
> with the maintainer of the conflicting tree to minimise any particularly
> complex conflicts.
> 
> diff --cc kernel/cgroup/cgroup.c
> index 00f5b358aeac,c3421ee0d230..
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@@ -4765,8 -4785,9 +4788,11 @@@ static struct cgroup *cgroup_create(str
>   
>   return cgrp;
>   
>  +out_idr_free:
>  +cgroup_idr_remove(>cgroup_idr, cgrp->id);
> + out_stat_exit:
> + if (cgroup_on_dfl(parent))
> + cgroup_stat_exit(cgrp);
>   out_cancel_ref:
>   percpu_ref_exit(>self.refcnt);
>   out_free_cgrp:

Just a reminder that this conflict still exists.

-- 
Cheers,
Stephen Rothwell

[PATCH net] sctp: do not free asoc when it is already dead in sctp_sendmsg

2017-11-12 Thread Xin Long

Now in sctp_sendmsg sctp_wait_for_sndbuf could schedule out without
holding sock sk. It means the current asoc can be freed elsewhere,
like when receiving an abort packet.

If the asoc is just created in sctp_sendmsg and sctp_wait_for_sndbuf
returns err, the asoc will be freed again due to new_asoc is not nil.
An use-after-free issue would be triggered by this.

This patch is to fix it by setting new_asoc with nil if the asoc is
already dead when cpu schedules back, so that it will not be freed
again in sctp_sendmsg.

Reported-by: Dmitry Vyukov 
Signed-off-by: Xin Long 
---
 net/sctp/socket.c | 15 +--
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index 6f45d17..f575976 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -83,8 +83,8 @@
 /* Forward declarations for internal helper functions. */
 static int sctp_writeable(struct sock *sk);
 static void sctp_wfree(struct sk_buff *skb);
-static int sctp_wait_for_sndbuf(struct sctp_association *, long *timeo_p,
-   size_t msg_len);
+static int sctp_wait_for_sndbuf(struct sctp_association *asoc, long *timeo_p,
+   size_t msg_len, struct sctp_association **new);
 static int sctp_wait_for_packet(struct sock *sk, int *err, long *timeo_p);
 static int sctp_wait_for_connect(struct sctp_association *, long *timeo_p);
 static int sctp_wait_for_accept(struct sock *sk, long timeo);
@@ -1962,7 +1962,7 @@ static int sctp_sendmsg(struct sock *sk, struct msghdr 
*msg, size_t msg_len)
 
timeo = sock_sndtimeo(sk, msg->msg_flags & MSG_DONTWAIT);
if (!sctp_wspace(asoc)) {
-   err = sctp_wait_for_sndbuf(asoc, , msg_len);
+   err = sctp_wait_for_sndbuf(asoc, , msg_len, _asoc);
if (err)
goto out_free;
}
@@ -7822,7 +7822,7 @@ void sctp_sock_rfree(struct sk_buff *skb)
 
 /* Helper function to wait for space in the sndbuf.  */
 static int sctp_wait_for_sndbuf(struct sctp_association *asoc, long *timeo_p,
-   size_t msg_len)
+   size_t msg_len, struct sctp_association **new)
 {
struct sock *sk = asoc->base.sk;
int err = 0;
@@ -7839,10 +7839,13 @@ static int sctp_wait_for_sndbuf(struct sctp_association 
*asoc, long *timeo_p,
for (;;) {
prepare_to_wait_exclusive(>wait, ,
  TASK_INTERRUPTIBLE);
+   if (asoc->base.dead) {
+   *new = NULL;
+   goto do_error;
+   }
if (!*timeo_p)
goto do_nonblock;
-   if (sk->sk_err || asoc->state >= SCTP_STATE_SHUTDOWN_PENDING ||
-   asoc->base.dead)
+   if (sk->sk_err || asoc->state >= SCTP_STATE_SHUTDOWN_PENDING)
goto do_error;
if (signal_pending(current))
goto do_interrupted;
-- 
2.1.0

Re: [PATCH] net: dsa: lan9303: correctly check return value of devm_gpiod_get_optional

2017-11-12 Thread Andrew Lunn

On Mon, Nov 13, 2017 at 12:08:49PM +0800, Phil Reid wrote:
> On 12/11/2017 23:38, Pan Bian wrote:
> >Function devm_gpiod_get_optional() returns an ERR_PTR on failure. Its
> >return value should not be validated by a NULL check. Instead, use IS_ERR.
> >
> >Signed-off-by: Pan Bian 
> >---
> >  drivers/net/dsa/lan9303-core.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> >diff --git a/drivers/net/dsa/lan9303-core.c b/drivers/net/dsa/lan9303-core.c
> >index b471413..6d3fc8f 100644
> >--- a/drivers/net/dsa/lan9303-core.c
> >+++ b/drivers/net/dsa/lan9303-core.c
> >@@ -828,7 +828,7 @@ static void lan9303_probe_reset_gpio(struct lan9303 
> >*chip,
> > chip->reset_gpio = devm_gpiod_get_optional(chip->dev, "reset",
> >GPIOD_OUT_LOW);
> >-if (!chip->reset_gpio) {
> >+if (IS_ERR(chip->reset_gpio)) {
> > dev_dbg(chip->dev, "No reset GPIO defined\n");
> > return;
> > }
> >
> Should not an error actually report the error and error out (ie fail probe).
> But a null is the optional return and ok. (ie when -ENOENT return from sub 
> gpiod_get call).
> 
> IS_ERR should be a separate condition check I think.

Hi Phil

Yes, you are right. In particular, -EPROBE_DEFFER should be propagated
up and cause the probe to fail and be called later.

Care to submit a patch?

 Andrew

Re: [PATCH v4] scripts: add leaking_addresses.pl

2017-11-12 Thread Tobin C. Harding

On Mon, Nov 13, 2017 at 06:37:28AM +0300, Kirill A. Shutemov wrote:
> On Mon, Nov 13, 2017 at 10:06:46AM +1100, Tobin C. Harding wrote:
> > On Sun, Nov 12, 2017 at 02:10:07AM +0300, Kirill A. Shutemov wrote:
> > > On Tue, Nov 07, 2017 at 09:32:11PM +1100, Tobin C. Harding wrote:
> > > > Currently we are leaking addresses from the kernel to user space. This
> > > > script is an attempt to find some of those leakages. Script parses
> > > > `dmesg` output and /proc and /sys files for hex strings that look like
> > > > kernel addresses.
> > > > 
> > > > Only works for 64 bit kernels, the reason being that kernel addresses
> > > > on 64 bit kernels have '' as the leading bit pattern making greping
> > > > possible. On 32 kernels we don't have this luxury.
> > > 
> > > Well, it's not going to work as well as intented on x86 machine with
> > > 5-level paging. Kernel address space there starts at 0xff10.
> > > It will still catch pointers to kernel/modules text, but the rest is
> > > outside of 0x... space. See Documentation/x86/x86_64/mm.txt.
> > 
> > Thanks for the link. So it looks like we need to refactor the kernel
> > address regular expression into a function that takes into account the
> > machine architecture and the number of page table levels. We will need
> > to add this to the false positive checks also.
> > 
> > > Not sure if we care. It won't work too for other 64-bit architectrues that
> > > have more than 256TB of virtual address space.
> > 
> > Is this because of the virtual memory map?
> 
> On x86 direct mapping is the nearest thing we have to userspace.
> 
> > Did you mean 512TB?
> 
> No, I mean 256TB.
> 
> You have all kernel memory in the range from 0x to
> 0x if you have 256 TB of virtual address space. If you
> hvae more, some thing might be ouside the range.

Doesn't 4-level paging already limit a system to 64TB of memory? So any
system better equipped than this will use 5-level paging right? If I am
totally talking rubbish please ignore, I'm appreciative that you pointed
out the limitation already. Perhaps we can add a comment to the script

# Script may miss some addresses on machines with more than 256TB of
# memory.

thanks,
Tobin.

Re: [PATCH] net: dsa: lan9303: correctly check return value of devm_gpiod_get_optional

2017-11-12 Thread Phil Reid


On 12/11/2017 23:38, Pan Bian wrote:

Function devm_gpiod_get_optional() returns an ERR_PTR on failure. Its
return value should not be validated by a NULL check. Instead, use IS_ERR.

Signed-off-by: Pan Bian 
---
  drivers/net/dsa/lan9303-core.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/dsa/lan9303-core.c b/drivers/net/dsa/lan9303-core.c
index b471413..6d3fc8f 100644
--- a/drivers/net/dsa/lan9303-core.c
+++ b/drivers/net/dsa/lan9303-core.c
@@ -828,7 +828,7 @@ static void lan9303_probe_reset_gpio(struct lan9303 *chip,
chip->reset_gpio = devm_gpiod_get_optional(chip->dev, "reset",
   GPIOD_OUT_LOW);
  
-	if (!chip->reset_gpio) {

+   if (IS_ERR(chip->reset_gpio)) {
dev_dbg(chip->dev, "No reset GPIO defined\n");
return;
}


Should not an error actually report the error and error out (ie fail probe).
But a null is the optional return and ok. (ie when -ENOENT return from sub 
gpiod_get call).

IS_ERR should be a separate condition check I think.

related lan9303_handle_reset() always returns 0.

lan9303_probe checks  lan9303_handle_reset() return value.

Probably should be checking lan9303_probe_reset_gpio() instead.

--
Regards
Phil Reid

[PATCH net-next V2] vhost_net: conditionally enable tx polling

2017-11-12 Thread Jason Wang

We always poll tx for socket, this is sub optimal since this will
slightly increase the waitqueue traversing time and more important,
vhost could not benefit from commit 9e641bdcfa4e ("net-tun:
restructure tun_do_read for better sleep/wakeup efficiency") even if
we've stopped rx polling during handle_rx(), tx poll were still left
in the waitqueue.

Pktgen from a remote host to VM over mlx4 on two 2.00GHz Xeon E5-2650
shows 11.7% improvements on rx PPS. (from 1.28Mpps to 1.44Mpps)

Cc: Wei Xu 
Cc: Matthew Rosato 
Signed-off-by: Jason Wang 
---
Changes from V1:
- don't try to disable tx polling during start
- poll tx on error unconditonally
---
 drivers/vhost/net.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 68677d9..8d626d7 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -471,6 +471,7 @@ static void handle_tx(struct vhost_net *net)
goto out;
 
vhost_disable_notify(>dev, vq);
+   vhost_net_disable_vq(net, vq);
 
hdr_size = nvq->vhost_hlen;
zcopy = nvq->ubufs;
@@ -556,6 +557,7 @@ static void handle_tx(struct vhost_net *net)
% UIO_MAXIOV;
}
vhost_discard_vq_desc(vq, 1);
+   vhost_net_enable_vq(net, vq);
break;
}
if (err != len)
-- 
2.7.4

Re: [PATCH v4] scripts: add leaking_addresses.pl

2017-11-12 Thread Kirill A. Shutemov

On Mon, Nov 13, 2017 at 10:06:46AM +1100, Tobin C. Harding wrote:
> On Sun, Nov 12, 2017 at 02:10:07AM +0300, Kirill A. Shutemov wrote:
> > On Tue, Nov 07, 2017 at 09:32:11PM +1100, Tobin C. Harding wrote:
> > > Currently we are leaking addresses from the kernel to user space. This
> > > script is an attempt to find some of those leakages. Script parses
> > > `dmesg` output and /proc and /sys files for hex strings that look like
> > > kernel addresses.
> > > 
> > > Only works for 64 bit kernels, the reason being that kernel addresses
> > > on 64 bit kernels have '' as the leading bit pattern making greping
> > > possible. On 32 kernels we don't have this luxury.
> > 
> > Well, it's not going to work as well as intented on x86 machine with
> > 5-level paging. Kernel address space there starts at 0xff10.
> > It will still catch pointers to kernel/modules text, but the rest is
> > outside of 0x... space. See Documentation/x86/x86_64/mm.txt.
> 
> Thanks for the link. So it looks like we need to refactor the kernel
> address regular expression into a function that takes into account the
> machine architecture and the number of page table levels. We will need
> to add this to the false positive checks also.
> 
> > Not sure if we care. It won't work too for other 64-bit architectrues that
> > have more than 256TB of virtual address space.
> 
> Is this because of the virtual memory map?

On x86 direct mapping is the nearest thing we have to userspace.

> Did you mean 512TB?

No, I mean 256TB.

You have all kernel memory in the range from 0x to
0x if you have 256 TB of virtual address space. If you
hvae more, some thing might be ouside the range.

-- 
 Kirill A. Shutemov

[PATCH 1/1] net: ipv4: use BUG_ON instead of condition followed by BUG

2017-11-12 Thread Jian Wang

Signed-off-by: Jian Wang 
---
 net/ipv4/ip_output.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index e8e675b..1a53553 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -769,8 +769,8 @@ int ip_do_fragment(struct net *net, struct sock *sk, struct 
sk_buff *skb,
/*
 *  Copy a block of the IP datagram.
 */
-   if (skb_copy_bits(skb, ptr, skb_transport_header(skb2), len))
-   BUG();
+   BUG_ON(skb_copy_bits(skb, ptr, skb_transport_header(skb2), 
len));
+
left -= len;
 
/*
-- 
2.7.4

Fwd: FW: [PATCH 15/31] nds32: System calls handling

2017-11-12 Thread Vincent Chen

>>On Wed, Nov 8, 2017 at 6:55 AM, Greentime Hu  wrote:
>> From: Greentime Hu 
>
>> +#endif /* __ASM_NDS32_SYSCALLS_H */
>> diff --git a/arch/nds32/include/asm/unistd.h
>> b/arch/nds32/include/asm/unistd.h new file mode 100644 index
>> 000..b30adca
>> --- /dev/null
>> +++ b/arch/nds32/include/asm/unistd.h
>> @@ -0,0 +1,21 @@
>
>> +#define __ARCH_WANT_SYS_LLSEEK
>
>This gets set from include/asm-generic/unistd.h if you include that file.
>
Dear Arnd:

Thanks
I will remove it in the next version patch.


>> +#define __ARCH_WANT_SYS_CLONE
>
>This seems ok, though it would be nice to have the reverse logic and have 
>architectures opt-out of the generic version when they need to provide their 
>own, rather than having most architectures set it.
>

Thanks
I will provide nds32 SYSCALL_DEFINE_5(clone) in the next version patch.

>> +#define __ARCH_WANT_SYS_OLD_MMAP
>
>I don't see why you need this, can it be dropped?

Thanks
I will remove it in the next version patch.

>
>> diff --git a/arch/nds32/include/uapi/asm/unistd.h
>> b/arch/nds32/include/uapi/asm/unistd.h
>> new file mode 100644
>> index 000..01b466d
>> --- /dev/null
>> +++ b/arch/nds32/include/uapi/asm/unistd.h
>
>> +#define __NR_ipc   (__NR_arch_specific_syscall + 2)
>> +#define __NR_sysfs (__NR_arch_specific_syscall + 3)
>> +#define __NR__llseek __NR_llseek
>
>
>
>> +__SYSCALL(__NR_cacheflush, sys_cacheflush) __SYSCALL(__NR_syscall,
>> +sys_syscall) __SYSCALL(__NR_ipc, sys_ipc) __SYSCALL(__NR_sysfs,
>> +sys_sysfs)
>> +
>> +__SYSCALL(__NR_fadvise64_64, sys_fadvise64_64_wrapper)
>> +__SYSCALL(__NR_rt_sigreturn, sys_rt_sigreturn_wrapper)
>> +__SYSCALL(__NR_mmap, sys_old_mmap)
>
>Usually we handle those overrides by defining the macros in asm/unistd.h 
>before including the asm-generic version. Can you do that as well for 
>consistency?
>

Thanks
Ok, I will modify it in the next version patch

>I don't see a reason for sys_ipc, sys_sysfs or sys_old_mmap() here in a new 
>architecture. Can you drop those or explain why you need them?
>

Thanks
 I will remove them in the next version patch

>> +/*
>> + * Special system call wrappers
>> + *
>> + * $r0 = syscall number
>> + * $r8 = syscall table
>> + */
>> +   .type   sys_syscall, #function
>> +ENTRY(sys_syscall)
>> +   addi$p1, $r0, #-__NR_syscalls
>> +   bgtz$p1, 3f
>> +   move$p1, $r0
>> +   move$r0, $r1
>> +   move$r1, $r2
>> +   move$r2, $r3
>> +   move$r3, $r4
>> +   move$r4, $r5
>> +! add for syscall 6 args
>> +   lwi $r5, [$sp + #SP_OFFSET ]
>> +   lwi $r5, [$r5]
>> +! ~add for syscall 6 args
>> +
>> +   lw  $p1, [tbl+$p1<<2]
>> .+   jr  $p1
>> +3: b   sys_ni_syscall
>> +ENDPROC(sys_syscall)
>
>Can you explain what this is used for?
>

This is used to handle syscall(int number, ).
Unlike other architectures,  the system number shall be determined in
compile time when issuing system call in nds32.
Therefore, we  only can parse the content of syscall(int number, )
and distribute it to destination handler in kernel space
(Other architecture can handle it in user space by glibc's syscall wrapper)

>> --- /dev/null
>> +++ b/arch/nds32/kernel/sys_nds32.c
>> +
>> +long sys_mmap2(unsigned long addr, unsigned long len,
>> +  unsigned long prot, unsigned long flags,
>> +  unsigned long fd, unsigned long pgoff) {
>> +   if (pgoff & (~PAGE_MASK >> 12))
>> +   return -EINVAL;
>> +
>> +   return sys_mmap_pgoff(addr, len, prot, flags, fd,
>> + pgoff >> (PAGE_SHIFT - 12)); }
>> +
>> +asmlinkage long sys_fadvise64_64_wrapper(int fd, int advice, loff_t offset,
>> +loff_t len) {
>> +   return sys_fadvise64_64(fd, offset, len, advice); }
>
>You should always use SYSCALL_DEFINE*() macros to define entry points for your 
>own syscalls in C code for consistency. I also wonder if we should just move 
>those two into common code, a lot of architectures need the first one in 
>particular.
>

The sys_fadvise64_64_wrapper is used to reorder the input parameter.

In order to solve register alignment problem, we adjust the input
parameter order of fadvise64_64 while issuing this syscall.
Therefore, we need this wrapper to reorder the input parameter to fit
sys_fadvise64_64's API in kernel.

>   Arnd


Best regard
Vincent

Fwd: FW: [PATCH 17/31] nds32: Signal handling support

2017-11-12 Thread Vincent Chen

>> +static int restore_sigframe(struct pt_regs *regs,
>> + struct rt_sigframe __user * sf) {
>
>[snip]
>
>> + err |= !valid_user_regs(regs);
>
>IDGI...  Where do you modify ->ipsw at all and how can valid_user_regs() come 
>to be false here?
>
Thanks.
This code is trivial and I will remove it in the next version patch



>> +asmlinkage int sys_rt_sigreturn(struct pt_regs *regs) {
>> + struct rt_sigframe __user *frame;
>> +
>> + /* Always make any pending restarted system calls return -EINTR */
>> + current->restart_block.fn = do_no_restart_syscall;
>> +
>> + /*
>> +  * Since we stacked the signal on a 64-bit boundary,
>> +  * then 'sp' should be two-word aligned here.  If it's
>> +  * not, then the user is trying to mess with us.
>> +  */
>> + if (regs->sp & 7)
>> + goto badframe;
>> +
>> + frame = (struct rt_sigframe __user *)regs->sp;
>> +
>> + if (!access_ok(VERIFY_READ, frame, sizeof(*frame)))
>> + goto badframe;
>> +
>> + if (restore_sigframe(regs, frame))
>> + goto badframe;
>> +
>> + if (restore_altstack(>uc.uc_stack))
>> + goto badframe;
>> +
>> + return regs->uregs[0];
>> +
>> +badframe:
>> + force_sig(SIGSEGV, current);
>> + return 0;
>> +}\
>
>AFAICS, you are copying arm; take a good look at sys_rt_sigreturn_wrapper 
>there - specifically, the 'mov why, #0' part.  Consider what happens if you 
>get an interrupt at the moment when $r0 contains -ERESTARTSYS (for
example) and signal arrives while we are processing the interrupt.  It
will be handled on the way out, without any syscall restart crap
('why'
is 0 on that codepath).  So far, so good, but think what'll happen
when you are done with the signal handler.  sigreturn() is called, the
values we had stashed in sigcontext go back into registers (OK,
pt_regs on kernel stack that will be used to reload the registers on
return to userland)... and the signals that had been blocked for the
duration of handlers (see sa_mask in sigaction(2)) get unblocked.  And
it turns out that you have one of those pending - it had arrived while
we'd been in the handler.

>Now we are fucked.  You have TIF_SIGPENDING set, so do_notify_resume() is 
>called.  And everything looks exactly as if you had a syscall restart 
>situation - regs->uregs[0] being one of -ERESTART... and 'syscall' flag being 
>true.  So we go into the second signal handler (as we ought to) with saved 
>->ipc set 4 bytes back from where we were going to return.
>It would've been the right thing to do if it *was* a syscall restart, but we 
>were returning to the location where the original interrupt had caught us.
>
>Result: with the right timing, an interrupt arriving when userland process has 
>$r0 equal -512 may lead to instruction pointer jumping 4 bytes back.
>Pity the poor sod trying to debug that kind of breakage...
>
>Restart should *NOT* be triggered upon sigreturn(2).

Thanks for your detailed description.
I got it and I will fix this bug in the next version patch


>>
>> +static int
>> +setup_return(struct pt_regs *regs, struct ksignal *ksig, void __user
>> +* frame) {
>> + unsigned long handler = (unsigned long)ksig->ka.sa.sa_handler;
>> + unsigned long retcode;
>> +
>> + /*
>> +  * Maybe we need to deliver a 32-bit signal to a 26-bit task.
>>
>Deliver to what, again?  That comment made sense (if rather sad one) on arm, 
>but what is it doing here?
>
Thanks.
I will remove it in the next version patch


>> +static int do_signal(struct pt_regs *regs, int syscall) {
>> + unsigned int retval = 0, continue_addr = 0, restart_addr = 0;
>> + struct ksignal ksig;
>> + int restart = 0;
>> +
>> + /*
>> +  * We want the common case to go fast, which
>> +  * is why we may in certain cases get here from
>> +  * kernel mode. Just return without doing anything
>> +  * if so.
>> +  */
>> + if (!user_mode(regs))
>> + return 0;
>
>Which cases would those be?

Thanks.
This code is trivial too. I will remove it in the next version patch

Re: CONFIG_DEBUG_INFO_SPLIT impacts on faddr2line

2017-11-12 Thread Fengguang Wu


[...]

> Oh - and talking about "big step forward" - does the 0day robot do
> any
> suspend/resume testing at all?
Yes, we do. CC Rui and Aaron on power testing.


yes, we have added suspend/resume test in 0day, including both
functionality and suspend/resume performance. It is not widely run
because most of the 0Day testboxes are servers/desktops, now we've just
added some client laptops as testboxes, and will add more in the near
future. :)

>
> Even on non-laptop hardware, it should be possible to do something
> like
>
>    echo platform > /sys/power/pm_test
>    echo freeze > /sys/power/state
>
> or similar (assuming CONFIG_PM_DEBUG is enabled).
>


yes.

I will run native suspend/resume test on laptops and other test boxes
that really support it, and run suspend/resume test in pm_test modes on
the others to help us find more issues.


It's a good plan, thanks! Client devices can be much cheaper than servers.
They have more diversities in HW while being more general available.

On the other hand, if there are PM functionalities that can be tested
inside QEMU, it'll be good to have. Since no real HW can be tested as
cheap and extensive as the large amount of VMs.

Thanks,
Fengguang

Re: [patch net-next v2 01/10] cls_bpf: move prog offload->netdev check into drivers

2017-11-12 Thread Jakub Kicinski

On Sun, 12 Nov 2017 16:55:55 +0100, Jiri Pirko wrote:
> From: Jiri Pirko 
> 
> In order to remove tp->q usage in cls_bpf, the offload->netdev check
> needs to be moved to individual drivers as only they will have access
> to appropriate struct net_device.
> 
> Signed-off-by: Jiri Pirko 

This seems not entirely correct and it adds unnecessary code.  I think
the XDP and cls_bpf handling could be unified, making way for binding
the same program to multiple ports of the same device.  Would you mind
waiting a day for me to send corrections to BPF offload?

Re: CONFIG_DEBUG_INFO_SPLIT impacts on faddr2line

2017-11-12 Thread Zhang Rui

On Mon, 2017-11-13 at 09:13 +0800, Fengguang Wu wrote:
> CC Andi and more DEBUG_INFO_SPLIT people.
> 
> On Sun, Nov 12, 2017 at 11:31:56AM -0800, Linus Torvalds wrote:
> > 
> > On Wed, Nov 8, 2017 at 9:12 AM, Fengguang Wu  > m> wrote:
> > > 
> > > 
> > > OK. Here is the original faddr2line output:
> > > 
> > > $ ~/linux/scripts/faddr2line vmlinux
> > > vlan_device_event+0x7f5/0xa40
> > > vlan_device_event+0x7f5/0xa40:
> > > vlan_device_event at net/8021q/vlan.h:60
> > > 
> > > And below is call trace embedded with full faddr2line output.
> > > 
> > > I notice that this trace shows no additional inline files at all.
> > > Is it because I did some kconfig option wrong, so that inline
> > > info is
> > > lost? Eg.
> > > 
> > > CONFIG_OPTIMIZE_INLINING=y (it looks better set to N)
> > > CONFIG_DEBUG_INFO_REDUCED=y
> > > CONFIG_DEBUG_INFO_SPLIT=y
> > Ok, this annoyed me, so I went back and looked.
> > 
> > It's the "CONFIG_DEBUG_INFO_SPLIT" thing that makes faddr2line
> > unable
> > to see the inlining information,
> > 
> > Using OPTIMIZE_INLINING is fine.
> Good to know that!
> 
> > 
> > I'm not sure that addr2line could be made to understand the .dwo
> > files
> > that DEBUG_INFO_SPLIT causes (particularly since we munge the
> > vmlinux
> > file itself, who knows how that could confuse things).
> > 
> > So can I ask that you make the 0day build scripts always use
> > 
> >  CONFIG_DEBUG_INFO=y
> >  CONFIG_DEBUG_INFO_REDUCED=y
> >  # CONFIG_DEBUG_INFO_SPLIT is not set
> > 
> > because with that "DEBUG_INFO_REDUCED=y", the use of
> > DEBUG_INFO_SPLIT
> > shouldn't be _that_ big of a deal.
> > 
> > Yes, splitting the debug info does help reduce disk usage for the
> > build, and presumably speed it up a bit too due to less IO and
> > reduced
> > copying of the debug info data, but right now it really makes the
> > debug info much less useful.
> Yes DEBUG_INFO_SPLIT helps reduce build cost. Equally importantly,
> it helps cut down the *.ko sizes, which saves boot test cost, too.
> Since in our test scheme, the below modules.cgz will be loaded as
> part
> of initrd on boot testing. Which will cost memory, and to the lesser
> degree, IO and uncompressing time.
> 
> Here is the diff of the modules.cgz size:
> 
> Big files under /pkg/linux/x86_64-rhel-
> 7.2+CONFIG_DEBUG_INFO_REDUCED/gcc-6/v4.14-rc7/,
> comparing to +CONFIG_DEBUG_INFO_SPLIT:
> 
> =>54M  135M  modules.cgz
>  7.3M  7.3M  vmlinuz-4.14.0-rc7
>  1.2M  1.2M  linux-headers.cgz
>  7.6M  7.7M  linux-selftests.cgz
>   31M   31M  linux-perf.cgz
> 
> Nevertheless, that's machine cost. If DEBUG_INFO_SPLIT hurts our
> ability to analyze bugs, I think the forthright way would be to
> disable it in our tests.
> 
> > 
> > Just to see the difference:
> > 
> > - with DEBUG_INFO_SPLIT=y
> > 
> >    [torvalds@i7 linux]$ ./scripts/faddr2line vmlinux
> > __schedule+0x314
> >    __schedule+0x314/0x840:
> >    __schedule at kernel/sched/stats.h:12
> > 
> > - with DEBUG_INFO_SPLIT is not set
> > 
> >    [torvalds@i7 linux]$ ./scripts/faddr2line vmlinux
> > __schedule+0x314
> >    __schedule+0x314/0x840:
> >    rq_sched_info_arrive at kernel/sched/stats.h:12
> > (inlined by) sched_info_arrive at kernel/sched/stats.h:99
> > (inlined by) __sched_info_switch at kernel/sched/stats.h:151
> > (inlined by) sched_info_switch at kernel/sched/stats.h:158
> > (inlined by) prepare_task_switch at kernel/sched/core.c:2582
> > (inlined by) context_switch at kernel/sched/core.c:2755
> > (inlined by) __schedule at kernel/sched/core.c:3366
> > 
> > and while (once again) this is a pretty extreme case, we do use a
> > lot
> > of inlines, and gcc will add its own inlining. Getting this whole
> > information - particularly for the faulting IP - would really help
> > in
> > some situations.
> > 
> > I love what the 0day robot is doing, this would be another big step
> > forward.
> Thank you for the helpful information and appreciations!
> I'll make the change to disable DEBUG_INFO_SPLIT.
> 
> > 
> > Oh - and talking about "big step forward" - does the 0day robot do
> > any
> > suspend/resume testing at all?
> Yes, we do. CC Rui and Aaron on power testing.
> 
yes, we have added suspend/resume test in 0day, including both
functionality and suspend/resume performance. It is not widely run
because most of the 0Day testboxes are servers/desktops, now we've just
added some client laptops as testboxes, and will add more in the near
future. :)
> > 
> > Even on non-laptop hardware, it should be possible to do something
> > like
> > 
> >    echo platform > /sys/power/pm_test
> >    echo freeze > /sys/power/state
> > 
> > or similar (assuming CONFIG_PM_DEBUG is enabled).
> > 

yes.

I will run native suspend/resume test on laptops and other test boxes
that really support it, and run suspend/resume test in pm_test modes on
the others to help us find more issues.

thanks,
rui
> > Maybe you already do something like this?
> Rui/Aaron have better

Re: [PATCH net-next 0/5] net: improve the process of redirect and toobig for ipv6 tunnels

2017-11-12 Thread David Miller

From: Xin Long 
Date: Sat, 11 Nov 2017 19:06:48 +0800

> Now let's say there are 3 kinds of icmp packets to process for tunnels,
> toobig(needfrag), redirect, others, their process should be:
> 
>  - toobig(needfrag)
>update the lower dst's pmtu by route cache, also update sk dst's pmtu
>if possible, or it will be fine if sk dst pmtu will get updated on tx
>path.
> 
>  - redirect
>update the lower dst's gw by route cache and return, no need to send
>this redirect packet to user sk.
> 
>  - others
>send the packet to user's sk, or it will also be fine to use err_count
>to count it and report fail link on tx path.
> 
> All ipv4 tunnels basically follow this while some of ipv6 tunnels are
> doing in different ways, like ip6gre and ip6_tunnels update tnl dev's
> mtu instead of updating lower dst pmtu, no redirect process on their
> err_handlers, which doesn't make any sense and even causes performance
> problems.
> 
> This patchset is to improve the process of redirect and toobig for ip6gre
> ip4ip6, ip6ip6 tunnels, as in ipv4 tunnels.

Series applied, thank you.

Re: [PATCH net-next 1/1] forcedeth: remove redudant assignments in xmit

2017-11-12 Thread David Miller

From: Zhu Yanjun 
Date: Fri, 10 Nov 2017 21:10:00 -0500

> In xmit process, the variables are set many times. In fact,
> it is enough for these variables to be set once.
> After a long time test, the throughput performance is better
> than before.
> 
> CC: Srinivas Eeda 
> CC: Joe Jin 
> CC: Junxiao Bi 
> Signed-off-by: Zhu Yanjun 

Applied.

Re: [net-next v5 0/4] Openvswitch meter action

2017-11-12 Thread David Miller

From: Pravin Shelar 
Date: Sat, 11 Nov 2017 07:25:30 +0530

> On Sat, Nov 11, 2017 at 1:39 AM, Andy Zhou  wrote:
>> This patch series is the first attempt to add openvswitch
>> meter support. We have previously experimented with adding
>> metering support in nftables. However 1) It was not clear
>> how to expose a named nftables object cleanly, and 2)
>> the logic that implements metering is quite small, < 100 lines
>> of code.
>>
>> With those two observations, it seems cleaner to add meter
>> support in the openvswitch module directly.
>>
>> ---
>>
>> v1(RFC)->v2:  remove unused code improve locking
>>   and other review comments
>> v2 -> v3: rebase
>> v3 -> v4: fix undefined "__udivdi3" references on 32 bit builds.
>>   use div_u64() instead.
>> v4 -> v5: rebase
>>
> Acked-by: Pravin B Shelar 

Series applied, thanks.

Re: [PATCH net-next 0/4] net: dsa: b53: Support prepended Broadcom tags

2017-11-12 Thread David Miller

From: Florian Fainelli 
Date: Fri, 10 Nov 2017 15:22:51 -0800

> This patch series adds support for prepended 4-bytes Broadcom tags that we
> already support. This type of tag will typically be used when interfaced to
> a SoC like BCM58xx (NorthStar Plus) which supports a Flow Accelerator (WIP).
> In that case, we need to support a slightly different tagging format.
> 
> The first patch does a bit of re-factoring and passes a port index to
> the get_tag_protocol() function since at least two different drivers need
> that type of information (mt7530, b53) to support tagging or not.

Series applied, thanks Florian.

Re: [patch 1/1] net/sched/sch_red.c: work around gcc-4.4.4 anon union initializer issue

2017-11-12 Thread David Miller

From: a...@linux-foundation.org
Date: Fri, 10 Nov 2017 15:09:53 -0800

> From: Andrew Morton 
> Subject: net/sched/sch_red.c: work around gcc-4.4.4 anon union initializer 
> issue
> 
> gcc-4.4.4 (at lest) has issues with initializers and anonymous unions:
> 
> net/sched/sch_red.c: In function 'red_dump_offload':
> net/sched/sch_red.c:282: error: unknown field 'stats' specified in initializer
> net/sched/sch_red.c:282: warning: initialization makes integer from pointer 
> without a cast
> net/sched/sch_red.c:283: error: unknown field 'stats' specified in initializer
> net/sched/sch_red.c:283: warning: initialization makes integer from pointer 
> without a cast
> net/sched/sch_red.c: In function 'red_dump_stats':
> net/sched/sch_red.c:352: error: unknown field 'xstats' specified in 
> initializer
> net/sched/sch_red.c:352: warning: initialization makes integer from pointer 
> without a cast
> 
> Work around this.
> 
> Fixes: 602f3baf2218 ("net_sch: red: Add offload ability to RED qdisc")
> Cc: Nogah Frankel 
> Cc: Jiri Pirko 
> Cc: Simon Horman 
> Cc: David S. Miller 
> Signed-off-by: Andrew Morton 

Applied.

Re: [PATCH net-next] net/mlx4: Use Kconfig flag to remove support of old gen2 Mellanox devices

2017-11-12 Thread David Miller

From: Tariq Toukan 
Date: Fri, 10 Nov 2017 09:10:29 +0200

> From: Slava Shwartsman 
> 
> Since Mellanox focus is on newer adapters, we would like to have the
> ability to disable the support for old gen2 adapters.
> 
> This can be done by turning off the MLX4_CORE_GEN2 Kconfig flag.
> We keep it turned on by default.
> 
> Signed-off-by: Slava Shwartsman 
> Signed-off-by: Tariq Toukan 

Applied.

Re: [PATCH net-next] tcp: Namespace-ify sysctl_tcp_default_congestion_control

2017-11-12 Thread David Miller

From: Stephen Hemminger 
Date: Fri, 10 Nov 2017 10:26:37 +0900

> Make default TCP default congestion control to a per namespace
> value. The congestion control setting of new namespaces is inherited
> from the root namespace. Modules are only autoloaded in the root namespace.
> 
> Signed-off-by: Stephen Hemminger 

I have to think some more about this and the semantics you've choosen.

Is it really buying us anything to restrict the module load to the
initial namespace?  Unless it's really required this makes things like
running tests in sub-namespaces unnecessarily cumbersome.

Re: [PATCH V2 net] net: hns3: Updates MSI/MSI-X alloc/free APIs(depricated) to new APIs

2017-11-12 Thread David Miller

From: Salil Mehta 
Date: Thu, 9 Nov 2017 16:38:13 +

> This patch migrates the HNS3 driver code from use of depricated PCI
> MSI/MSI-X interrupt vector allocation/free APIs to new common APIs.
> 
> Signed-off-by: Salil Mehta 
> Suggested-by: Christoph Hellwig 

This doesn't apply cleanly to the net-next tree.

Re: [PATCH v4] af_netlink: ensure that NLMSG_DONE never fails in dumps

2017-11-12 Thread David Miller

From: David Miller 
Date: Sat, 11 Nov 2017 23:21:01 +0900 (KST)

> Aha, that's what I missed.  Indeed, it cannot happen.

Applied and queued up for -stable.

Re: [PATCH v3 net-next 0/3] netem: add nsec scheduling and slot feature

2017-11-12 Thread David Miller

From: Dave Taht 
Date: Wed,  8 Nov 2017 15:12:25 -0800

> This patch series converts netem away from the old "ticks" interface and
> userspace API, and adds support for a new "slot" feature intended to
> emulate bursty macs such as WiFi and LTE better.
> 
> Changes since v2:
> Use u64 for packet_len_sched_time()
> Use simpler max(time_to_send,q->slot.slot_next)
> 
> Changes since v1:
> Always pass new nanosecond APIs to userspace

Series applied, thanks!

CONFIG_DEBUG_INFO_SPLIT impacts on faddr2line

2017-11-12 Thread Fengguang Wu


CC Andi and more DEBUG_INFO_SPLIT people.

On Sun, Nov 12, 2017 at 11:31:56AM -0800, Linus Torvalds wrote:

On Wed, Nov 8, 2017 at 9:12 AM, Fengguang Wu  wrote:


OK. Here is the original faddr2line output:

$ ~/linux/scripts/faddr2line vmlinux vlan_device_event+0x7f5/0xa40
vlan_device_event+0x7f5/0xa40:
vlan_device_event at net/8021q/vlan.h:60

And below is call trace embedded with full faddr2line output.

I notice that this trace shows no additional inline files at all.
Is it because I did some kconfig option wrong, so that inline info is
lost? Eg.

CONFIG_OPTIMIZE_INLINING=y (it looks better set to N)
CONFIG_DEBUG_INFO_REDUCED=y
CONFIG_DEBUG_INFO_SPLIT=y


Ok, this annoyed me, so I went back and looked.

It's the "CONFIG_DEBUG_INFO_SPLIT" thing that makes faddr2line unable
to see the inlining information,

Using OPTIMIZE_INLINING is fine.


Good to know that!


I'm not sure that addr2line could be made to understand the .dwo files
that DEBUG_INFO_SPLIT causes (particularly since we munge the vmlinux
file itself, who knows how that could confuse things).

So can I ask that you make the 0day build scripts always use

 CONFIG_DEBUG_INFO=y
 CONFIG_DEBUG_INFO_REDUCED=y
 # CONFIG_DEBUG_INFO_SPLIT is not set

because with that "DEBUG_INFO_REDUCED=y", the use of DEBUG_INFO_SPLIT
shouldn't be _that_ big of a deal.

Yes, splitting the debug info does help reduce disk usage for the
build, and presumably speed it up a bit too due to less IO and reduced
copying of the debug info data, but right now it really makes the
debug info much less useful.


Yes DEBUG_INFO_SPLIT helps reduce build cost. Equally importantly,
it helps cut down the *.ko sizes, which saves boot test cost, too.
Since in our test scheme, the below modules.cgz will be loaded as part
of initrd on boot testing. Which will cost memory, and to the lesser
degree, IO and uncompressing time.

Here is the diff of the modules.cgz size:

Big files under 
/pkg/linux/x86_64-rhel-7.2+CONFIG_DEBUG_INFO_REDUCED/gcc-6/v4.14-rc7/,
comparing to +CONFIG_DEBUG_INFO_SPLIT:

=>54M  135M  modules.cgz
7.3M  7.3M  vmlinuz-4.14.0-rc7
1.2M  1.2M  linux-headers.cgz
7.6M  7.7M  linux-selftests.cgz
 31M   31M  linux-perf.cgz

Nevertheless, that's machine cost. If DEBUG_INFO_SPLIT hurts our
ability to analyze bugs, I think the forthright way would be to
disable it in our tests.


Just to see the difference:

- with DEBUG_INFO_SPLIT=y

   [torvalds@i7 linux]$ ./scripts/faddr2line vmlinux __schedule+0x314
   __schedule+0x314/0x840:
   __schedule at kernel/sched/stats.h:12

- with DEBUG_INFO_SPLIT is not set

   [torvalds@i7 linux]$ ./scripts/faddr2line vmlinux __schedule+0x314
   __schedule+0x314/0x840:
   rq_sched_info_arrive at kernel/sched/stats.h:12
(inlined by) sched_info_arrive at kernel/sched/stats.h:99
(inlined by) __sched_info_switch at kernel/sched/stats.h:151
(inlined by) sched_info_switch at kernel/sched/stats.h:158
(inlined by) prepare_task_switch at kernel/sched/core.c:2582
(inlined by) context_switch at kernel/sched/core.c:2755
(inlined by) __schedule at kernel/sched/core.c:3366

and while (once again) this is a pretty extreme case, we do use a lot
of inlines, and gcc will add its own inlining. Getting this whole
information - particularly for the faulting IP - would really help in
some situations.

I love what the 0day robot is doing, this would be another big step forward.


Thank you for the helpful information and appreciations!
I'll make the change to disable DEBUG_INFO_SPLIT.


Oh - and talking about "big step forward" - does the 0day robot do any
suspend/resume testing at all?


Yes, we do. CC Rui and Aaron on power testing.


Even on non-laptop hardware, it should be possible to do something like

   echo platform > /sys/power/pm_test
   echo freeze > /sys/power/state

or similar (assuming CONFIG_PM_DEBUG is enabled).

Maybe you already do something like this?


Rui/Aaron have better knowledge on the current status. It does look an
error-prone area that's worth more testing efforts.


Anyway, regardless this was a good release for the 0day robot. Thanks.


My (and our) pleasure. I'd like to thank you and all the people who
take time to analyze/fix the bugs. It's great to see the long standing
bugs being fixed in mainline -- they have been a big source of noises
that hurt our auto bisect capabilities.

Regards,
Fengguang

Re: [PATCH net-next v2] ipv6: try not to take rtnl_lock in ip6mr_sk_done

2017-11-12 Thread David Miller

From: frugg...@arista.com (Francesco Ruggeri)
Date: Wed, 08 Nov 2017 11:23:46 -0800

> Avoid traversing the list of mr6_tables (which requires the
> rtnl_lock) in ip6mr_sk_done(), when we know in advance that
> a match will not be found.
> This can happen when rawv6_close()/ip6mr_sk_done() is invoked
> on non-mroute6 sockets.
> This patch helps reduce rtnl_lock contention when destroying
> a large number of net namespaces, each having a non-mroute6
> raw socket.
> 
> v2: same patch, only fixed subject line and expanded comment.
> 
> Signed-off-by: Francesco Ruggeri 

Applied, thanks.

Re: [PATCH v2 net-next 00/12] tls: Add generic NIC offload infrastructure

2017-11-12 Thread David Miller

From: Ilya Lesokhin 
Date: Wed,  8 Nov 2017 15:38:25 +0200

> Changes from v1:
> - Remove the binding of the socket to a specific netdev 
>   through sk->sk_bound_dev_if.
>   Add a check in validate_xmit_skb to detect route changes
>   and call SW fallback code to do the crypto in software.
> - tls_get_record now returns the tls record sequence number.
>   This is required to support connections with rcd_sn != iv.
> - Bug fixes to the TLS code.
> 
> This patchset adds a generic infrastructure to offload TLS crypto to a
> network devices.
> 
> Patches 1-6 refactor and fix various issues in the TLS code
> Patches 7-8 Export functions that we need
> patch 9 adds infrastructue for offloaded socket fallback
> patches 10-11 add new NDOs and capabilities.
> patch 12 adds the TLS NIC offload infrastructure.
> 
> Github with mlx5e TLS offload support:
> https://github.com/Mellanox/tls-offload/tree/tls_device_v2
> 
> Paper: https://www.netdevconf.org/1.2/papers/netdevconf-TLS.pdf

This doesn't apply cleanly to net-next, and the net-next tree is now
closed so please resubmit this after the merge window.

Thank you.

Re: [PATCH] net: realtek: r8169: remove redundant assignment to giga_ctrl

2017-11-12 Thread David Miller

From: Colin King 
Date: Wed,  8 Nov 2017 13:23:23 +

> From: Colin Ian King 
> 
> The variable giga_ctrl is being assigned to zero however this is
> never read and hence the assignment is redundant, so remove it.
> Cleans up clang warning:
> 
> drivers/net/ethernet/realtek/r8169.c:1978:3: warning: Value stored
> to 'giga_ctrl' is never read
> 
> Signed-off-by: Colin Ian King 

Applied, thanks Colin.

Re: [kernel-hardening] [PATCH v4] scripts: add leaking_addresses.pl

2017-11-12 Thread Michael Ellerman

Frank Rowand  writes:
> Hi Michael,
>
> On 11/12/17 03:49, Michael Ellerman wrote:
...
>> 
>> On our bare metal machines the device tree comes from skiboot
>> (firmware), with some of the content provided by hostboot (other
>> firmware), both of which are open source, so in theory most of the
>> information is available in *some* source tree. But there's still
>> information about runtime allocations etc. that is not available in the
>> source anywhere.
>
> Thanks for the additional information. 
>
> Can you explain a little bit what "runtime allocations" are?  Are you
> referring to the memory reservation block, the memory node(s) and the
> chosen node?  Or other information?

Yeah I was thinking of memory reservations. They're under the
reserved-memory node as well as the reservation block, eg:

$ ls -1 /proc/device-tree/reserved-memory/
ibm,firmware-allocs-memory@10
ibm,firmware-allocs-memory@18
ibm,firmware-allocs-memory@39c0
ibm,firmware-allocs-memory@8
ibm,firmware-code@3000
ibm,firmware-data@3100
ibm,firmware-heap@3030
ibm,firmware-stacks@31c0
ibm,hbrt-code-image@1ffd51
ibm,hbrt-target-image@1ffd6a
ibm,hbrt-vpd-image@1ffd70
ibm,slw-image@1ffda0
ibm,slw-image@1ffde0
ibm,slw-image@1ffe20
ibm,slw-image@1ffe60


There's also some new systems where a catalog of PMU events is stored in
flash as a DTB and then stitched into the device tree by skiboot before
booting Linux.

Anyway my point was mainly just that the device tree is not simply a
copy of something in the kernel source.

cheers

Re: [PATCH net-next] net: dsa: lan9303: Fix lan9303_alr_del_port()

2017-11-12 Thread David Miller

From: Egil Hjelmeland 
Date: Wed,  8 Nov 2017 11:44:36 +0100

> Fix embarrassing bug in lan9303_alr_del_port(): Instead of zeroing
> entr->mac_addr, I destroyed the next cache entry. Affected .port_fdb_del and
> .port_mdb_del.
> 
> Fixes: 0620427ea0d6 ("net: dsa: lan9303: Add fdb/mdb manipulation")
> Signed-off-by: Egil Hjelmeland 

Applied.

Re: [ftrace-bpf 1/5] add BPF_PROG_TYPE_FTRACE to bpf

2017-11-12 Thread Alexei Starovoitov

On Sun, Nov 12, 2017 at 07:28:24AM +, yupeng0...@gmail.com wrote:
> Add a new type BPF_PROG_TYPE_FTRACE to bpf, let bpf can be attached to
> ftrace. Ftrace pass the function parameters to bpf prog, bpf prog
> return 1 or 0 to indicate whether ftrace can trace this function. The
> major propose is provide an accurate way to trigger function graph
> trace. Changes in code:
> 1. add FTRACE_BPF_FILTER in kernel/trace/Kconfig. Let ftrace pass
> function parameter to bpf need to modify architecture dependent code,
> so this feature will only be enabled only when it is enabled in
> Kconfig and the architecture support this feature. If an architecture
> support this feature, it should define a macro whose name is
> FTRACE_BPF_FILTER, e.g.:
> So other code in kernel can check whether the macro FTRACE_BPF_FILTER
> is defined to know whether this feature is really enabled.
> 2. add BPF_PROG_TYPE_FTRACE in bpf_prog_type
> 3. check kernel version when load BPF_PROG_TYPE_FTRACE bpf prog
> 4. define ftrace_prog_func_proto, the prog input is a struct
> ftrace_regs type pointer, it is similar as pt_regs in kprobe, it
> is an architecture dependent code, if an architecture doens't define
> FTRACE_BPF_FILTER, use a fake ftrace_prog_func_proto.
> 5. add BPF_PROG_TYPE in bpf_types.h
> 
> Signed-off-by: yupeng0...@gmail.com

In general I like the bigger concept of adding bpf filtering to ftrace,
but there are a lot of fundamental issues with this patch set.

1. anything bpf related has to go via net-next tree.

> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -118,6 +118,7 @@ enum bpf_prog_type {
>   BPF_PROG_TYPE_UNSPEC,
>   BPF_PROG_TYPE_SOCKET_FILTER,
>   BPF_PROG_TYPE_KPROBE,
> + BPF_PROG_TYPE_FTRACE,
>   BPF_PROG_TYPE_SCHED_CLS,

2.
this obviously breaks ABI. New types can only be added to the end.

> +static bool ftrace_prog_is_valid_access(int off, int size,
> + enum bpf_access_type type,
> + struct bpf_insn_access_aux *info)
> +{
> + if (off < 0 || off >= sizeof(struct ftrace_regs))
> + return false;

3.
this won't even compile, since ftrace_regs is only added in the patch 4.

Since bpf program will see ftrace_regs as an input it becomes
abi, so has to be defined in uapi/linux/bpf_ftrace.h or similar.
We need to think through how to make it generic across archs
instead of defining ftrace_regs for each arch.

4.
the patch 2/3 takes an approach of passing FD integer value in text form
to the kernel. That approach was discussed years ago and rejected.
It has to use binary interface like perf_event + ioctl.
See RFC patches where we're extending perf_event_open syscall to
support binary access to kprobe/uprobe.
imo binary interface to ftrace is pre-requisite to ftrace+bpf work.
We've had too many issues with text based kprobe api to repeat
the same mistake here.

5.
patch 4 hacks save_mcount_regs asm to pass ctx pointer in %rcx
whereas it's only used in ftrace_graph_caller which doesn't seem right.
It points out to another issue that such ftrace+bpf integration
is only done for ftrace_graph_caller without extensibility in mind.
If we do ftrace+bpf I'd rather see generic framework that applies
to all of ftrace instead of single feature of it.

6.
copyright line copy-pasted incorrectly.

Re: [PATCH net-next] net: dsa: lan9303: Documentation: Add missing word "Mbps"

2017-11-12 Thread David Miller

From: Egil Hjelmeland 
Date: Wed,  8 Nov 2017 11:55:14 +0100

> Signed-off-by: Egil Hjelmeland 

Applied.

Re: [PATCH] ip_gre: fix ip-config error reported by lkp-robot

2017-11-12 Thread David Miller

From: William Tu 
Date: Tue,  7 Nov 2017 07:57:44 -0800

> lkp-robot reports the following two errors:
>   IP-Config: Failed to open gretap0
>   IP-Config: Failed to open erspan0
> due to device's mac address is zero.  Fix it by assigning
> a random Ethernet address.
> 
> Signed-off-by: William Tu 

This isn't really a good idea.

If there is a tunnel source address of zero, which is where the device
address comes from, we shouldn't allow the interface to come up
because it is misconfigured.

[PATCH] uapi: fix linux/rxrpc.h userspace compilation errors

2017-11-12 Thread Dmitry V. Levin

Consistently use types provided by  to fix the following
linux/rxrpc.h userspace compilation errors:

/usr/include/linux/rxrpc.h:24:2: error: unknown type name 'u16'
  u16  srx_service; /* service desired */
/usr/include/linux/rxrpc.h:25:2: error: unknown type name 'u16'
  u16  transport_type; /* type of transport socket (SOCK_DGRAM) */
/usr/include/linux/rxrpc.h:26:2: error: unknown type name 'u16'
  u16  transport_len; /* length of transport address */

Use __kernel_sa_family_t instead of sa_family_t the same way
as uapi/linux/in.h does, to fix the following
linux/rxrpc.h userspace compilation errors:

/usr/include/linux/rxrpc.h:23:2: error: unknown type name 'sa_family_t'
  sa_family_t srx_family; /* address family */
/usr/include/linux/rxrpc.h:28:3: error: unknown type name 'sa_family_t'
  sa_family_t family;  /* transport address family */

Fixes: 727f8914477e ("rxrpc: Expose UAPI definitions to userspace")
Cc:  # v4.14
Signed-off-by: Dmitry V. Levin 
---
 include/uapi/linux/rxrpc.h | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/uapi/linux/rxrpc.h b/include/uapi/linux/rxrpc.h
index 9656aad8f8f7..9d4afea308a4 100644
--- a/include/uapi/linux/rxrpc.h
+++ b/include/uapi/linux/rxrpc.h
@@ -20,12 +20,12 @@
  * RxRPC socket address
  */
 struct sockaddr_rxrpc {
-   sa_family_t srx_family; /* address family */
-   u16 srx_service;/* service desired */
-   u16 transport_type; /* type of transport socket 
(SOCK_DGRAM) */
-   u16 transport_len;  /* length of transport address */
+   __kernel_sa_family_tsrx_family; /* address family */
+   __u16   srx_service;/* service desired */
+   __u16   transport_type; /* type of transport socket 
(SOCK_DGRAM) */
+   __u16   transport_len;  /* length of transport address 
*/
union {
-   sa_family_t family; /* transport address family */
+   __kernel_sa_family_t family;/* transport address family */
struct sockaddr_in sin; /* IPv4 transport address */
struct sockaddr_in6 sin6;   /* IPv6 transport address */
} transport;
-- 
ldv

Re: [PATCH v7 net-next 00/13] gtp: Additional feature support - Part I

2017-11-12 Thread Tom Herbert

On Sun, Nov 12, 2017 at 12:56 PM, Harald Welte  wrote:
> Hi Tom,
>
> sorry for the delayed response.  But I remain committed in pushing
> the non-controversial part of your GTP patches forward.
>
> On Sat, Oct 28, 2017 at 06:47:59PM +0200, Harald Welte wrote:
>> Thanks.  As indicated, I'm planning some testing later this weekend on
>> the non-IPv6 patches, and am happy to add my Acked-by and/or re-submit
>> those to Dave after that.
>
> After some more delays and returning from netdev 2.2, I've finally put
> together a testing setup and successfully (manually) tested with the
> following patches:
>
> 01/13 vxlan: Move gro_cells_init to ndo_init
> 02/13 iptunnel: Add common functions to get a tunnel route
> 04/13 gtp: Call common functions to get tunnel routes and add dst_cache
> 05/13 iptunnel: Generalize tunnel update pmtu
> 06/13 gtp: Change to use gro_cells
> 07/13 gtp: Use goto for exceptions in gtp_udp_encap_recv funcs
> 08/13 gtp: udp recv clean up
> 09/13 gtp: Call function to update path mtu
> 10/13 gtp: Eliminate pktinfo and add port configuration
>
The IPv6 related code in patches 4-10 needs to be taken out. It can be
restored once there is support for IPv6.

> I hereby acknowledge those patches.  How should we proceed?  Should I
>
> a) do nothing, you will add Acked-By and re-submit?
>
> b) send an individual Acked-By in a reply to each related patch here on
>netdev and you will re-submit those patches?
>
> c) simply create a rebased set from those patches and
>re-submit them to the list for net-next myself, with the Acked-by?
>
Feel free to do c). I can Ack and test once the patches are ready.

Tom

> d) be preposterous and provide a gtp git tree for DaveM to pull from?
>
> As discussed before, I will not merge/ack IPv6 will until we have an
> implementation that is interoperable.  I have a TODO list of other
> bugfixes and improvements for Kernel GTP, but I'm hopeful that IPv6 can
> still be addressed before the end of 2017.
>
> Regards,
> Harald
> --
> - Harald Welte    http://laforge.gnumonks.org/
> 
> "Privacy in residential applications is a desirable marketing option."
>   (ETSI EN 300 175-7 Ch. A6)

Re: [PATCH iproute2 0/2] add batch command support to devlink

2017-11-12 Thread Stephen Hemminger

On Fri, 10 Nov 2017 07:20:12 +0100
Ivan Vecera  wrote:

> This patch series adds support for devlink commands batching. The first
> just removes a requirement to have declared 'resolve_hosts' variable in
> any command that use any function implemented in utils.c (it is really
> confusing to see this declaration in utils like bridge or devlink).
> 
> Ivan Vecera (2):
>   lib: make resolve_hosts variable common
>   devlink: add batch command support
> 
>  bridge/bridge.c|  1 -
>  devlink/devlink.c  | 70 
> +++---
>  genl/genl.c|  1 -
>  ip/ip.c|  1 -
>  ip/rtmon.c |  1 -
>  lib/utils.c|  1 +
>  man/man8/devlink.8 | 16 +
>  misc/arpd.c|  2 --
>  misc/ss.c  |  1 -
>  tc/tc.c|  1 -
>  10 files changed, 79 insertions(+), 16 deletions(-)
> 

Applied

Re: [PATCH iproute2 2/2] devlink: add batch command support

2017-11-12 Thread Stephen Hemminger

On Fri, 10 Nov 2017 21:47:35 +0200
Leon Romanovsky  wrote:

> On Fri, Nov 10, 2017 at 08:10:43AM +0100, Ivan Vecera wrote:
> > On 10.11.2017 07:57, Leon Romanovsky wrote:  
> > > On Fri, Nov 10, 2017 at 07:20:14AM +0100, Ivan Vecera wrote:  
> > >> The patch adds support to batch devlink commands.
> > >>
> > >> Cc: Jiri Pirko 
> > >> Cc: Arkadi Sharshevsky 
> > >> Signed-off-by: Ivan Vecera 
> > >> ---
> > >>  devlink/devlink.c  | 70 
> > >> +++---
> > >>  man/man8/devlink.8 | 16 +
> > >>  2 files changed, 78 insertions(+), 8 deletions(-)
> > >>  
> > >
> > > <..>
> > >  
> > >> diff --git a/man/man8/devlink.8 b/man/man8/devlink.8
> > >> index a480766c..a975ef34 100644
> > >> --- a/man/man8/devlink.8
> > >> +++ b/man/man8/devlink.8
> > >> @@ -12,6 +12,12 @@ devlink \- Devlink tool
> > >>  .sp
> > >>
> > >>  .ti -8
> > >> +.B devlink
> > >> +.RB "[ " -force " ] "
> > >> +.BI "-batch " filename
> > >> +.sp
> > >> +
> > >> +.ti -8
> > >>  .IR OBJECT " := { "
> > >>  .BR dev " | " port " | " monitor " }"
> > >>  .sp
> > >> @@ -32,6 +38,16 @@ Print the version of the
> > >>  utility and exit.
> > >>
> > >>  .TP
> > >> +.BR "\-b", " \-batch " 
> > >> +Read commands from provided file or standard input and invoke them.
> > >> +First failure will cause termination of devlink.  
> > >
> > > It is worth to document the expected format of that file.
> > > And IMHO, it is better to have ability to load JSON fie which was
> > > generated by -j, instead of declaring new format/knob.  
> > It's just a list of command-lines... like other utils (bridge,ip...)  
> 
> I'm implementing similar thing in RDMAtool (part of iproute2) and choose JSON
> approach, it is more user and script friendly.
> 

If you want to do batch form rdmatool then it must take list of commands by 
default.
An additional option to take json input "rdmatool -j --batch..." would be good 
as well.




pgpIz7nIKSiSB.pgp
Description: OpenPGP digital signature

Re: Per-CPU Queueing for QoS

2017-11-12 Thread Stephen Hemminger

On Sun, 12 Nov 2017 13:43:13 -0800
Michael Ma  wrote:

> Any comments? We plan to implement this as a qdisc and appreciate any early 
> feedback.
> 
> Thanks,
> Michael
> 
> > On Nov 9, 2017, at 5:20 PM, Michael Ma  wrote:
> > 
> > Currently txq/qdisc selection is based on flow hash so packets from
> > the same flow will follow the order when they enter qdisc/txq, which
> > avoids out-of-order problem.
> > 
> > To improve the concurrency of QoS algorithm we plan to have multiple
> > per-cpu queues for a single TC class and do busy polling from a
> > per-class thread to drain these queues. If we can do this frequently
> > enough the out-of-order situation in this polling thread should not be
> > that bad.
> > 
> > To give more details - in the send path we introduce per-cpu per-class
> > queues so that packets from the same class and same core will be
> > enqueued to the same place. Then a per-class thread poll the queues
> > belonging to its class from all the cpus and aggregate them into
> > another per-class queue. This can effectively reduce contention but
> > inevitably introduces potential out-of-order issue.
> > 
> > Any concern/suggestion for working towards this direction?  

In general, there is no meta design discussions in Linux development
Several developers have tried to do lockless
qdisc and similar things in the past.

The devil is in the details, show us the code.

Re: [PATCH 0/7] net: core: devname allocation cleanups

2017-11-12 Thread Stephen Hemminger

On Mon, 13 Nov 2017 00:15:03 +0100
Rasmus Villemoes  wrote:

> It's somewhat confusing to have both dev_alloc_name and
> dev_get_valid_name. I can't see why the former is less strict than the
> latter, so make them (or rather dev_alloc_name_ns and
> dev_get_valid_name) equivalent, hardening dev_alloc_name() a little.
> 
> Obvious follow-up patches would be to only export one function, and
> make dev_alloc_name a static inline wrapper for that (whichever name
> is chosen for the exported interface). But maybe there is a good
> reason the two exported interfaces do different checking, so I'll
> refrain from including the trivial but tree-wide renaming in this
> series.
> 
> Rasmus Villemoes (7):
>   net: core: improve sanity checking in __dev_alloc_name
>   net: core: move dev_alloc_name_ns a little higher
>   net: core: eliminate dev_alloc_name{,_ns} code duplication
>   net: core: drop pointless check in __dev_alloc_name
>   net: core: check dev_valid_name in __dev_alloc_name
>   net: core: maybe return -EEXIST in __dev_alloc_name
>   net: core: dev_get_valid_name is now the same as dev_alloc_name_ns
> 
>  net/core/dev.c | 62 
> +-
>  1 file changed, 22 insertions(+), 40 deletions(-)
> 

Looks good to me. Can't see anything obviously wrong with this.
I think the two functions started out heading in different directions.

Re: [PATCH 6/7] net: core: maybe return -EEXIST in __dev_alloc_name

2017-11-12 Thread Stephen Hemminger

On Mon, 13 Nov 2017 00:15:09 +0100
Rasmus Villemoes  wrote:

> If we're given format string with no %d, -EEXIST is a saner error code.
> 
> Signed-off-by: Rasmus Villemoes 
> ---
>  net/core/dev.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/core/dev.c b/net/core/dev.c
> index c0a92cf27566..7c08b4ca7b76 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -1104,7 +1104,7 @@ static int __dev_alloc_name(struct net *net, const char 
> *name, char *buf)
>* when the name is long and there isn't enough space left
>* for the digits, or if all bits are used.
>*/
> - return -ENFILE;
> + return p ? -ENFILE : -EEXIST;
>  }
>  
>  static int dev_alloc_name_ns(struct net *net,

This is potentially a change to user ABI with no real advantage.

Re: [PATCH iproute2 1/1] tc: distinguish Add/Replace qdisc operations

2017-11-12 Thread Stephen Hemminger

On Thu, 26 Oct 2017 17:30:08 -0400
Roman Mashak  wrote:

> Signed-off-by: Roman Mashak 
> ---
>  tc/tc_qdisc.c | 10 ++
>  1 file changed, 10 insertions(+)

Applied to 4.14

Re: [PATCH iproute2] man: Clarify idleslope calculation for tc-cbs

2017-11-12 Thread Stephen Hemminger

On Fri, 10 Nov 2017 14:34:36 -0800
Jesus Sanchez-Palencia  wrote:

> In order to calculate the idleSlope parameter of CBS correctly, users
> must take into account the entire packet size, including the overhead
> from all layers.
> 
> Add some more details to the man page to clarify that, giving one
> simple example and pointing users to the correct 802.1Q section for
> further clarifications if needed.
> 
> Signed-off-by: Jesus Sanchez-Palencia 
> ---
>  man/man8/tc-cbs.8 | 14 +-
>  1 file changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/man/man8/tc-cbs.8 b/man/man8/tc-cbs.8
> index 97e00c84..32e1e0d4 100644
> --- a/man/man8/tc-cbs.8
> +++ b/man/man8/tc-cbs.8
> @@ -43,7 +43,19 @@ second) when there is at least one packet waiting for 
> transmission.
>  Packets are transmitted when the current value of credits is equal or
>  greater than zero. When there is no packet to be transmitted the
>  amount of credits is set to zero. This is the main tunable of the CBS
> -algorithm.
> +algorithm and represents the bandwidth that will be consumed.
> +Note that when calculating idleslope, the entire packet size must be
> +considered, including headers from all layers (i.e. MAC framing and any
> +overhead from the physical layer), as described by IEEE 802.1Q-2014
> +section 34.4.
> +
> +As an example, for an ethernet frame carrying 284 bytes of payload,
> +and with no VLAN tags, you must add 14 bytes for the Ethernet headers,
> +4 bytes for the Frame check sequence (CRC), and 20 bytes for the L1
> +overhead: 12 bytes of interpacket gap, 7 bytes of preamble and 1 byte
> +of start of frame delimiter. That results in 322 bytes for the total
> +packet size, which is then used for calculating the idleslope.
> +
>  .TP
>  sendslope
>  Sendslope is the rate of credits that is depleted (it should be a

Applied to net-next

[RFC v2 6/6] bpf: add new test test_many_kprobe

2017-11-12 Thread Song Liu

The test compares old text based kprobe API with PERF_TYPE_PROBE.

Here is a sample output of this test:

Creating 1000 kprobes with text-based API takes 6.979683 seconds
Cleaning 1000 kprobes with text-based API takes 84.897687 seconds
Creating 1000 kprobes with PERF_TYPE_PROBE (function name) takes 5.077558 
seconds
Cleaning 1000 kprobes with PERF_TYPE_PROBE (function name) takes 81.241354 
seconds
Creating 1000 kprobes with PERF_TYPE_PROBE (function addr) takes 5.218255 
seconds
Cleaning 1000 kprobes with PERF_TYPE_PROBE (function addr) takes 80.010731 
seconds

Signed-off-by: Song Liu 
Reviewed-by: Josef Bacik 
---
 samples/bpf/Makefile|   3 +
 samples/bpf/bpf_load.c  |   5 +-
 samples/bpf/bpf_load.h  |   4 +
 samples/bpf/test_many_kprobe_user.c | 184 
 4 files changed, 193 insertions(+), 3 deletions(-)
 create mode 100644 samples/bpf/test_many_kprobe_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 3b4945c..1b729d6 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -44,6 +44,7 @@ hostprogs-y += xdp_redirect_map
 hostprogs-y += xdp_redirect_cpu
 hostprogs-y += xdp_monitor
 hostprogs-y += syscall_tp
+hostprogs-y += test_many_kprobe
 
 # Libbpf dependencies
 LIBBPF := ../../tools/lib/bpf/bpf.o
@@ -92,6 +93,7 @@ xdp_redirect_map-objs := bpf_load.o $(LIBBPF) 
xdp_redirect_map_user.o
 xdp_redirect_cpu-objs := bpf_load.o $(LIBBPF) xdp_redirect_cpu_user.o
 xdp_monitor-objs := bpf_load.o $(LIBBPF) xdp_monitor_user.o
 syscall_tp-objs := bpf_load.o $(LIBBPF) syscall_tp_user.o
+test_many_kprobe-objs := bpf_load.o $(LIBBPF) test_many_kprobe_user.o
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
@@ -182,6 +184,7 @@ HOSTLOADLIBES_xdp_redirect_map += -lelf
 HOSTLOADLIBES_xdp_redirect_cpu += -lelf
 HOSTLOADLIBES_xdp_monitor += -lelf
 HOSTLOADLIBES_syscall_tp += -lelf
+HOSTLOADLIBES_test_many_kprobe += -lelf
 
 # Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on 
cmdline:
 #  make samples/bpf/ LLC=~/git/llvm/build/bin/llc 
CLANG=~/git/llvm/build/bin/clang
diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c
index a47cb1c..590e6f0 100644
--- a/samples/bpf/bpf_load.c
+++ b/samples/bpf/bpf_load.c
@@ -639,9 +639,8 @@ void read_trace_pipe(void)
}
 }
 
-#define MAX_SYMS 30
-static struct ksym syms[MAX_SYMS];
-static int sym_cnt;
+struct ksym syms[MAX_SYMS];
+int sym_cnt;
 
 static int ksym_cmp(const void *p1, const void *p2)
 {
diff --git a/samples/bpf/bpf_load.h b/samples/bpf/bpf_load.h
index e7a8a21..16bc263 100644
--- a/samples/bpf/bpf_load.h
+++ b/samples/bpf/bpf_load.h
@@ -67,6 +67,10 @@ static inline __u64 ptr_to_u64(const void *ptr)
return (__u64) (unsigned long) ptr;
 }
 
+#define MAX_SYMS 30
+extern struct ksym syms[MAX_SYMS];
+extern int sym_cnt;
+
 int load_kallsyms(void);
 struct ksym *ksym_search(long key);
 int set_link_xdp_fd(int ifindex, int fd, __u32 flags);
diff --git a/samples/bpf/test_many_kprobe_user.c 
b/samples/bpf/test_many_kprobe_user.c
new file mode 100644
index 000..70b680e
--- /dev/null
+++ b/samples/bpf/test_many_kprobe_user.c
@@ -0,0 +1,184 @@
+/* Copyright (c) 2017 Facebook
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#define _GNU_SOURCE
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "libbpf.h"
+#include "bpf_load.h"
+#include "perf-sys.h"
+
+#define MAX_KPROBES 1000
+
+#define DEBUGFS "/sys/kernel/debug/tracing/"
+
+int kprobes[MAX_KPROBES] = {0};
+int kprobe_count;
+int perf_event_fds[MAX_KPROBES];
+const char license[] = "GPL";
+
+static __u64 time_get_ns(void)
+{
+   struct timespec ts;
+
+   clock_gettime(CLOCK_MONOTONIC, );
+   return ts.tv_sec * 10ull + ts.tv_nsec;
+}
+
+static int kprobe_api(char *func, void *addr, bool use_new_api)
+{
+   int efd;
+   struct perf_event_attr attr = {};
+   struct probe_desc pd;
+   char buf[256];
+   int err, id;
+
+   attr.sample_type = PERF_SAMPLE_RAW;
+   attr.sample_period = 1;
+   attr.wakeup_events = 1;
+
+   if (use_new_api) {
+   attr.type = PERF_TYPE_PROBE;
+   if (func) {
+   pd.func = ptr_to_u64(func);
+   pd.offset = 0;
+   } else {
+   pd.func = 0;
+   pd.offset = ptr_to_u64(addr);
+   }
+
+   attr.probe_desc = ptr_to_u64();
+   } else {
+   attr.type = PERF_TYPE_TRACEPOINT;
+   snprintf(buf, sizeof(buf),
+"echo 'p:%s %s' >> 
/sys/kernel/debug/tracing/kprobe_events",
+func, func);
+

[RFC v2 0/6] enable creating [k,u]probe with perf_event_open

2017-11-12 Thread Song Liu

Changes v1 to v2:
  Fix build issue reported by kbuild test bot by adding ifdef of
  CONFIG_KPROBE_EVENTS, and CONFIG_UPROBE_EVENTS.

v1 cover letter:

This is to follow up the discussion over "new kprobe api" at Linux
Plumbers 2017:

https://www.linuxplumbersconf.org/2017/ocw/proposals/4808

With current kernel, user space tools can only create/destroy [k,u]probes
with a text-based API (kprobe_events and uprobe_events in tracefs). This
approach relies on user space to clean up the [k,u]probe after using them.
However, this is not easy for user space to clean up properly.

To solve this problem, we introduce a file descriptor based API.
Specifically, we extended perf_event_open to create [k,u]probe, and attach
this [k,u]probe to the file descriptor created by perf_event_open. These
[k,u]probe are associated with this file descriptor, so they are not
available in tracefs.

We reuse large portion of existing trace_kprobe and trace_uprobe code.
Currently, the file descriptor API does not support arguments as the
text-based API does. This should not be a problem, as user of the file
decriptor based API read data through other methods (bpf, etc.).

I also include a patch to to bcc, and a patch to man-page perf_even_open.
Please see the list below. A fork of bcc with this patch is also available
on github:

  https://github.com/liu-song-6/bcc/tree/new_perf_event_opn

Thanks,
Song

man-pages patch:
  perf_event_open.2: add new type PERF_TYPE_PROBE

bcc patch:
  bcc: Try use new API to create [k,u]probe with perf_event_open

kernel patches:

Song Liu (6):
  perf: Add new type PERF_TYPE_PROBE
  perf: copy new perf_event.h to tools/include/uapi
  perf: implement kprobe support to PERF_TYPE_PROBE
  perf: implement uprobe support to PERF_TYPE_PROBE
  bpf: add option for bpf_load.c to use PERF_TYPE_PROBE
  bpf: add new test test_many_kprobe

 include/linux/trace_events.h  |   2 +
 include/uapi/linux/perf_event.h   |  35 ++-
 kernel/events/core.c  |  39 ++-
 kernel/trace/trace_event_perf.c   | 127 +++
 kernel/trace/trace_kprobe.c   |  91 +++--
 kernel/trace/trace_probe.h|  11 ++
 kernel/trace/trace_uprobe.c   |  90 +++--
 samples/bpf/Makefile  |   3 +
 samples/bpf/bpf_load.c|  61 ++-
 samples/bpf/bpf_load.h|  12 +++
 samples/bpf/test_many_kprobe_user.c   | 184 ++
 tools/include/uapi/linux/perf_event.h |  35 ++-
 12 files changed, 643 insertions(+), 47 deletions(-)
 create mode 100644 samples/bpf/test_many_kprobe_user.c

--
2.9.5

[RFC v2 4/6] perf: implement uprobe support to PERF_TYPE_PROBE

2017-11-12 Thread Song Liu

This patch adds uprobe support to perf_probe with similar pattern
as previous patch (for kprobe).

Two functions, create_local_trace_uprobe() and
destroy_local_trace_uprobe(), are created so a uprobe can be created
and attached to the file descriptor created by perf_event_open().

Signed-off-by: Song Liu 
Reviewed-by: Yonghong Song 
Reviewed-by: Josef Bacik 
---
 kernel/trace/trace_event_perf.c | 48 +-
 kernel/trace/trace_probe.h  |  4 ++
 kernel/trace/trace_uprobe.c | 90 -
 3 files changed, 131 insertions(+), 11 deletions(-)

diff --git a/kernel/trace/trace_event_perf.c b/kernel/trace/trace_event_perf.c
index bf9b99b..4e4de84 100644
--- a/kernel/trace/trace_event_perf.c
+++ b/kernel/trace/trace_event_perf.c
@@ -256,6 +256,39 @@ static int perf_probe_create_kprobe(struct perf_event 
*p_event,
 }
 #endif /* CONFIG_KPROBE_EVENTS */
 
+#ifdef CONFIG_UPROBE_EVENTS
+static int perf_probe_create_uprobe(struct perf_event *p_event,
+   struct probe_desc *pd, char *name)
+{
+   struct trace_event_call *tp_event;
+   int ret;
+
+   if (!name)
+   return -EINVAL;
+   tp_event = create_local_trace_uprobe(
+   name, pd->offset, p_event->attr.is_return);
+   if (IS_ERR(tp_event))
+   return PTR_ERR(tp_event);
+   /*
+* local trace_uprobe need to hold event_mutex to call
+* uprobe_buffer_enable() and uprobe_buffer_disable().
+* event_mutex is not required for local trace_kprobes.
+*/
+   mutex_lock(_mutex);
+   ret = perf_trace_event_init(tp_event, p_event);
+   if (ret)
+   destroy_local_trace_uprobe(tp_event);
+   mutex_unlock(_mutex);
+   return ret;
+}
+#else
+static int perf_probe_create_uprobe(struct perf_event *p_event,
+   struct probe_desc *pd, char *name)
+{
+   return -EOPNOTSUPP;
+}
+#endif /* CONFIG_KPROBE_EVENTS */
+
 int perf_probe_init(struct perf_event *p_event)
 {
struct probe_desc pd;
@@ -292,7 +325,7 @@ int perf_probe_init(struct perf_event *p_event)
if (!p_event->attr.is_uprobe)
ret = perf_probe_create_kprobe(p_event, , name);
else
-   ret = -EOPNOTSUPP;
+   ret = perf_probe_create_uprobe(p_event, , name);
 out:
kfree(name);
return ret;
@@ -308,13 +341,26 @@ void perf_trace_destroy(struct perf_event *p_event)
 
 void perf_probe_destroy(struct perf_event *p_event)
 {
+   /*
+* local trace_uprobe need to hold event_mutex to call
+* uprobe_buffer_enable() and uprobe_buffer_disable().
+* event_mutex is not required for local trace_kprobes.
+*/
+   if (p_event->attr.is_uprobe)
+   mutex_lock(_mutex);
perf_trace_event_close(p_event);
perf_trace_event_unreg(p_event);
+   if (p_event->attr.is_uprobe)
+   mutex_unlock(_mutex);
 
if (!p_event->attr.is_uprobe) {
 #ifdef CONFIG_KPROBE_EVENTS
destroy_local_trace_kprobe(p_event->tp_event);
 #endif
+   } else {
+#ifdef CONFIG_UPROBE_EVENTS
+   destroy_local_trace_uprobe(p_event->tp_event);
+#endif
}
 }
 
diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
index 910ae1b..86b5925 100644
--- a/kernel/trace/trace_probe.h
+++ b/kernel/trace/trace_probe.h
@@ -417,4 +417,8 @@ extern struct trace_event_call *
 create_local_trace_kprobe(char *func, void *addr, unsigned long offs,
  bool is_return);
 extern void destroy_local_trace_kprobe(struct trace_event_call *event_call);
+
+extern struct trace_event_call *
+create_local_trace_uprobe(char *name, unsigned long offs, bool is_return);
+extern void destroy_local_trace_uprobe(struct trace_event_call *event_call);
 #endif
diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
index 153c0e4..1aa82be 100644
--- a/kernel/trace/trace_uprobe.c
+++ b/kernel/trace/trace_uprobe.c
@@ -31,8 +31,8 @@
 #define UPROBE_EVENT_SYSTEM"uprobes"
 
 struct uprobe_trace_entry_head {
-   struct trace_entry  ent;
-   unsigned long   vaddr[];
+   struct trace_entry  ent;
+   unsigned long   vaddr[];
 };
 
 #define SIZEOF_TRACE_ENTRY(is_return)  \
@@ -1292,16 +1292,25 @@ static struct trace_event_functions uprobe_funcs = {
.trace  = print_uprobe_event
 };
 
-static int register_uprobe_event(struct trace_uprobe *tu)
+static inline void init_trace_event_call(struct trace_uprobe *tu,
+struct trace_event_call *call)
 {
-   struct trace_event_call *call = >tp.call;
-   int ret;
-
-   /* Initialize trace_event_call */
INIT_LIST_HEAD(>class->fields);
call->event.funcs = _funcs;
call->class->define_fields = uprobe_event_define_fields;
 
+

[RFC] bcc: Try use new API to create [k,u]probe with perf_event_open

2017-11-12 Thread Song Liu

New kernel API allows creating [k,u]probe with perf_event_open.
This patch tries to use the new API. If the new API doesn't work,
we fall back to old API.

bpf_detach_probe() looks up the event being removed. If the event
is not found, we skip the clean up procedure.

Signed-off-by: Song Liu 
---
 src/cc/libbpf.c | 224 +++-
 1 file changed, 155 insertions(+), 69 deletions(-)

diff --git a/src/cc/libbpf.c b/src/cc/libbpf.c
index 77413df..d7be0a9 100644
--- a/src/cc/libbpf.c
+++ b/src/cc/libbpf.c
@@ -520,38 +520,66 @@ int bpf_attach_socket(int sock, int prog) {
   return setsockopt(sock, SOL_SOCKET, SO_ATTACH_BPF, , sizeof(prog));
 }
 
+/*
+ * new kernel API allows creating [k,u]probe with perf_event_open, which
+ * makes it easier to clean up the [k,u]probe. This function tries to
+ * create pfd with the new API.
+ */
+static int bpf_try_perf_event_open_with_probe(struct probe_desc *pd, int pid,
+int cpu, int group_fd, int is_uprobe, int is_return)
+{
+  struct perf_event_attr attr = {};
+
+  attr.type = PERF_TYPE_PROBE;
+  attr.probe_desc = ptr_to_u64(pd);
+  attr.sample_type = PERF_SAMPLE_RAW | PERF_SAMPLE_CALLCHAIN;
+  attr.sample_period = 1;
+  attr.wakeup_events = 1;
+  attr.is_uprobe = is_uprobe;
+  attr.is_return = is_return;
+  return syscall(__NR_perf_event_open, , pid, cpu, group_fd,
+ PERF_FLAG_FD_CLOEXEC);
+}
+
 static int bpf_attach_tracing_event(int progfd, const char *event_path,
-struct perf_reader *reader, int pid, int cpu, int group_fd) {
-  int efd, pfd;
+struct perf_reader *reader, int pid, int cpu, int group_fd, int pfd) {
+  int efd;
   ssize_t bytes;
   char buf[256];
   struct perf_event_attr attr = {};
 
-  snprintf(buf, sizeof(buf), "%s/id", event_path);
-  efd = open(buf, O_RDONLY, 0);
-  if (efd < 0) {
-fprintf(stderr, "open(%s): %s\n", buf, strerror(errno));
-return -1;
-  }
+  /*
+   * Only look up id and call perf_event_open when
+   * bpf_try_perf_event_open_with_probe() didn't returns valid pfd.
+   */
+  if (pfd < 0) {
+snprintf(buf, sizeof(buf), "%s/id", event_path);
+efd = open(buf, O_RDONLY, 0);
+if (efd < 0) {
+  fprintf(stderr, "open(%s): %s\n", buf, strerror(errno));
+  return -1;
+}
 
-  bytes = read(efd, buf, sizeof(buf));
-  if (bytes <= 0 || bytes >= sizeof(buf)) {
-fprintf(stderr, "read(%s): %s\n", buf, strerror(errno));
+bytes = read(efd, buf, sizeof(buf));
+if (bytes <= 0 || bytes >= sizeof(buf)) {
+  fprintf(stderr, "read(%s): %s\n", buf, strerror(errno));
+  close(efd);
+  return -1;
+}
 close(efd);
-return -1;
-  }
-  close(efd);
-  buf[bytes] = '\0';
-  attr.config = strtol(buf, NULL, 0);
-  attr.type = PERF_TYPE_TRACEPOINT;
-  attr.sample_type = PERF_SAMPLE_RAW | PERF_SAMPLE_CALLCHAIN;
-  attr.sample_period = 1;
-  attr.wakeup_events = 1;
-  pfd = syscall(__NR_perf_event_open, , pid, cpu, group_fd, 
PERF_FLAG_FD_CLOEXEC);
-  if (pfd < 0) {
-fprintf(stderr, "perf_event_open(%s/id): %s\n", event_path, 
strerror(errno));
-return -1;
+buf[bytes] = '\0';
+attr.config = strtol(buf, NULL, 0);
+attr.type = PERF_TYPE_TRACEPOINT;
+attr.sample_type = PERF_SAMPLE_RAW | PERF_SAMPLE_CALLCHAIN;
+attr.sample_period = 1;
+attr.wakeup_events = 1;
+pfd = syscall(__NR_perf_event_open, , pid, cpu, group_fd, 
PERF_FLAG_FD_CLOEXEC);
+if (pfd < 0) {
+  fprintf(stderr, "perf_event_open(%s/id): %s\n", event_path, 
strerror(errno));
+  return -1;
+}
   }
+
   perf_reader_set_fd(reader, pfd);
 
   if (perf_reader_mmap(reader, attr.type, attr.sample_type) < 0)
@@ -579,31 +607,41 @@ void * bpf_attach_kprobe(int progfd, enum 
bpf_probe_attach_type attach_type, con
   char event_alias[128];
   struct perf_reader *reader = NULL;
   static char *event_type = "kprobe";
+  struct probe_desc pd;
+  int pfd;
 
   reader = perf_reader_new(cb, NULL, NULL, cb_cookie, 
probe_perf_reader_page_cnt);
   if (!reader)
 goto error;
 
-  snprintf(buf, sizeof(buf), "/sys/kernel/debug/tracing/%s_events", 
event_type);
-  kfd = open(buf, O_WRONLY | O_APPEND, 0);
-  if (kfd < 0) {
-fprintf(stderr, "open(%s): %s\n", buf, strerror(errno));
-goto error;
-  }
+  /* try use new API to create kprobe */
+  pd.func = ptr_to_u64((void *)fn_name);
+  pd.offset = 0;
+  pfd = bpf_try_perf_event_open_with_probe(, pid, cpu, group_fd, 0,
+   attach_type != BPF_PROBE_ENTRY);
 
-  snprintf(event_alias, sizeof(event_alias), "%s_bcc_%d", ev_name, getpid());
-  snprintf(buf, sizeof(buf), "%c:%ss/%s %s", attach_type==BPF_PROBE_ENTRY ? 
'p' : 'r',
-   event_type, event_alias, fn_name);
-  if (write(kfd, buf, strlen(buf)) < 0) {
-if (errno == EINVAL)
-  fprintf(stderr, "check dmesg output for possible cause\n");
+  if (pfd < 0) {
+snprintf(buf, sizeof(buf), "/sys/kernel/debug/tracing/%s_events", 
event_type);
+kfd =

[RFC v2 2/6] perf: copy new perf_event.h to tools/include/uapi

2017-11-12 Thread Song Liu

perf_event.h is updated in previous patch, this patch applies same
changes to the tools/ version. This is part is put in a separate
patch in case the two files are back ported separately.

Signed-off-by: Song Liu 
Reviewed-by: Yonghong Song 
Reviewed-by: Josef Bacik 
Acked-by: Alexei Starovoitov 
---
 tools/include/uapi/linux/perf_event.h | 35 +--
 1 file changed, 33 insertions(+), 2 deletions(-)

diff --git a/tools/include/uapi/linux/perf_event.h 
b/tools/include/uapi/linux/perf_event.h
index 362493a..cc42d59 100644
--- a/tools/include/uapi/linux/perf_event.h
+++ b/tools/include/uapi/linux/perf_event.h
@@ -33,6 +33,7 @@ enum perf_type_id {
PERF_TYPE_HW_CACHE  = 3,
PERF_TYPE_RAW   = 4,
PERF_TYPE_BREAKPOINT= 5,
+   PERF_TYPE_PROBE = 6,
 
PERF_TYPE_MAX,  /* non-ABI */
 };
@@ -299,6 +300,29 @@ enum perf_event_read_format {
 #define PERF_ATTR_SIZE_VER4104 /* add: sample_regs_intr */
 #define PERF_ATTR_SIZE_VER5112 /* add: aux_watermark */
 
+#define MAX_PROBE_FUNC_NAME_LEN64
+/*
+ * Describe a kprobe or uprobe for PERF_TYPE_PROBE.
+ * perf_event_attr.probe_desc will point to this structure. is_uprobe
+ * and is_return are used to differentiate different types of probe
+ * (k/u, probe/retprobe).
+ *
+ * The two unions should be used as follows:
+ * For uprobe: use path and offset;
+ * For kprobe: if func is empty, use addr
+ * if func is not emtpy, use func and offset
+ */
+struct probe_desc {
+   union {
+   __aligned_u64   func;
+   __aligned_u64   path;
+   };
+   union {
+   __aligned_u64   addr;
+   __u64   offset;
+   };
+};
+
 /*
  * Hardware event_id to monitor via a performance monitoring event:
  *
@@ -320,7 +344,10 @@ struct perf_event_attr {
/*
 * Type specific configuration information.
 */
-   __u64   config;
+   union {
+   __u64   config;
+   __u64   probe_desc; /* ptr to struct probe_desc */
+   };
 
union {
__u64   sample_period;
@@ -370,7 +397,11 @@ struct perf_event_attr {
context_switch :  1, /* context switch data */
write_backward :  1, /* Write ring buffer from 
end to beginning */
namespaces :  1, /* include namespaces data 
*/
-   __reserved_1   : 35;
+
+   /* For PERF_TYPE_PROBE */
+   is_uprobe  :  1, /* 0: kprobe, 1: uprobe */
+   is_return  :  1, /* 0: probe, 1: retprobe */
+   __reserved_1   : 33;
 
union {
__u32   wakeup_events;/* wakeup every n events */
-- 
2.9.5

[RFC v2 5/6] bpf: add option for bpf_load.c to use PERF_TYPE_PROBE

2017-11-12 Thread Song Liu

Function load_and_attach() is updated to be able to create kprobes
with either old text based API, or the new PERF_TYPE_PROBE API.

A global flag use_perf_type_probe is added to select between the
two APIs.

Signed-off-by: Song Liu 
Reviewed-by: Josef Bacik 
---
 samples/bpf/bpf_load.c | 56 --
 samples/bpf/bpf_load.h |  8 
 2 files changed, 44 insertions(+), 20 deletions(-)

diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c
index 522ca92..a47cb1c 100644
--- a/samples/bpf/bpf_load.c
+++ b/samples/bpf/bpf_load.c
@@ -8,7 +8,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
@@ -42,6 +41,7 @@ int prog_array_fd = -1;
 
 struct bpf_map_data map_data[MAX_MAPS];
 int map_data_count = 0;
+bool use_perf_type_probe = true;
 
 static int populate_prog_array(const char *event, int prog_fd)
 {
@@ -70,8 +70,9 @@ static int load_and_attach(const char *event, struct bpf_insn 
*prog, int size)
size_t insns_cnt = size / sizeof(struct bpf_insn);
enum bpf_prog_type prog_type;
char buf[256];
-   int fd, efd, err, id;
+   int fd, efd, err, id = -1;
struct perf_event_attr attr = {};
+   struct probe_desc pd;
 
attr.type = PERF_TYPE_TRACEPOINT;
attr.sample_type = PERF_SAMPLE_RAW;
@@ -128,7 +129,7 @@ static int load_and_attach(const char *event, struct 
bpf_insn *prog, int size)
return populate_prog_array(event, fd);
}
 
-   if (is_kprobe || is_kretprobe) {
+   if (!use_perf_type_probe && (is_kprobe || is_kretprobe)) {
if (is_kprobe)
event += 7;
else
@@ -169,27 +170,42 @@ static int load_and_attach(const char *event, struct 
bpf_insn *prog, int size)
strcat(buf, "/id");
}
 
-   efd = open(buf, O_RDONLY, 0);
-   if (efd < 0) {
-   printf("failed to open event %s\n", event);
-   return -1;
-   }
-
-   err = read(efd, buf, sizeof(buf));
-   if (err < 0 || err >= sizeof(buf)) {
-   printf("read from '%s' failed '%s'\n", event, strerror(errno));
-   return -1;
+   if (use_perf_type_probe && (is_kprobe || is_kretprobe)) {
+   attr.type = PERF_TYPE_PROBE;
+   pd.func = ptr_to_u64(event + strlen(is_kprobe ? "kprobe/"
+   : "kretprobe/"));
+   pd.offset = 0;
+   attr.is_return  = !!is_kretprobe;
+   attr.probe_desc = ptr_to_u64();
+   } else {
+   efd = open(buf, O_RDONLY, 0);
+   if (efd < 0) {
+   printf("failed to open event %s\n", event);
+   return -1;
+   }
+   err = read(efd, buf, sizeof(buf));
+   if (err < 0 || err >= sizeof(buf)) {
+   printf("read from '%s' failed '%s'\n", event,
+  strerror(errno));
+   return -1;
+   }
+   close(efd);
+   buf[err] = 0;
+   id = atoi(buf);
+   attr.config = id;
}
 
-   close(efd);
-
-   buf[err] = 0;
-   id = atoi(buf);
-   attr.config = id;
-
efd = sys_perf_event_open(, -1/*pid*/, 0/*cpu*/, -1/*group_fd*/, 
0);
if (efd < 0) {
-   printf("event %d fd %d err %s\n", id, efd, strerror(errno));
+   if (use_perf_type_probe && (is_kprobe || is_kretprobe))
+   printf("k%sprobe %s fd %d err %s\n",
+  is_kprobe ? "" : "ret",
+  event + strlen(is_kprobe ? "kprobe/"
+ : "kretprobe/"),
+  efd, strerror(errno));
+   else
+   printf("event %d fd %d err %s\n", id, efd,
+  strerror(errno));
return -1;
}
event_fd[prog_cnt - 1] = efd;
diff --git a/samples/bpf/bpf_load.h b/samples/bpf/bpf_load.h
index 7d57a42..e7a8a21 100644
--- a/samples/bpf/bpf_load.h
+++ b/samples/bpf/bpf_load.h
@@ -2,6 +2,7 @@
 #ifndef __BPF_LOAD_H
 #define __BPF_LOAD_H
 
+#include 
 #include "libbpf.h"
 
 #define MAX_MAPS 32
@@ -38,6 +39,8 @@ extern int map_fd[MAX_MAPS];
 extern struct bpf_map_data map_data[MAX_MAPS];
 extern int map_data_count;
 
+extern bool use_perf_type_probe;
+
 /* parses elf file compiled by llvm .c->.o
  * . parses 'maps' section and creates maps via BPF syscall
  * . parses 'license' section and passes it to syscall
@@ -59,6 +62,11 @@ struct ksym {
char *name;
 };
 
+static inline __u64 ptr_to_u64(const void *ptr)
+{
+   return (__u64) (unsigned long) ptr;
+}
+
 int load_kallsyms(void);
 struct ksym *ksym_search(long key);
 int set_link_xdp_fd(int ifindex, int fd, __u32 flags);
-- 
2.9.5

[RFC] perf_event_open.2: add new type PERF_TYPE_PROBE

2017-11-12 Thread Song Liu

A new type PERF_TYPE_PROBE is being added to perf_event_attr. This
patch adds information about this type.

Note: the following two flags are also added to the man page. They
are from perf_event.h in latest kernel repo. However, they are not
related to PERF_TYPE_PROBE. Therefore, their usage are not included
in this patch.

  write_backward :  1
  namespaces :  1

Signed-off-by: Song Liu 
---
 man2/perf_event_open.2 | 82 --
 1 file changed, 80 insertions(+), 2 deletions(-)

diff --git a/man2/perf_event_open.2 b/man2/perf_event_open.2
index 03dc748..a37b78b 100644
--- a/man2/perf_event_open.2
+++ b/man2/perf_event_open.2
@@ -205,7 +205,12 @@ for the event being created.
 struct perf_event_attr {
 __u32 type; /* Type of event */
 __u32 size; /* Size of attribute structure */
-__u64 config;   /* Type-specific configuration */
+
+/* Type-specific configuration */
+union {
+__u64 config;
+__u64 probe_desc; /* ptr to struct probe_desc */
+};
 
 union {
 __u64 sample_period;/* Period of sampling */
@@ -244,8 +249,13 @@ struct perf_event_attr {
due to exec */
   use_clockid:  1,  /* use clockid for time fields */
   context_switch :  1,  /* context switch data */
+  write_backward :  1,  /* Write ring buffer from end to beginning */
+  namespaces :  1,  /* include namespaces data */
 
-  __reserved_1   : 37;
+  /* For PERF_TYPE_PROBE */
+  is_uprobe  :  1,  /* 0 for kprobe, 1 for uprobe */
+  is_return  :  1,  /* 0 for [k,u]probe, 1 for [k,u]retprobe */
+  __reserved_1   : 33;
 
 union {
 __u32 wakeup_events;/* wakeup every n events */
@@ -336,6 +346,13 @@ field.
 For instance,
 .I /sys/bus/event_source/devices/cpu/type
 contains the value for the core CPU PMU, which is usually 4.
+.TP
+.BR PERF_TYPE_PROBE " (since Linux 4.TBD)"
+This indicates a kprobe or uprobe should be created and
+attached to the file descriptor.
+See fields
+.IR probe_desc ", " is_uprobe ", and " is_return
+for more details.
 .RE
 .TP
 .I "size"
@@ -627,6 +644,67 @@ then leave
 .I config
 set to zero.
 Its parameters are set in other places.
+.PP
+If
+.I type
+is
+.BR PERF_TYPE_PROBE ,
+.I probe_desc
+is used instead of
+.IR config .
+.RE
+.TP
+.I probe_desc
+The
+.I probe_desc
+field is used with
+.I type
+of
+.BR PERF_TYPE_PROBE ,
+to save a pointer to struct probe_desc:
+.PP
+.in +8n
+.EX
+struct probe_desc {
+union {
+__aligned_u64 func;
+__aligned_u64 path;
+};
+union {
+__aligned_u64 addr;
+__u64 offset;
+};
+};
+.EE
+Different fields of struct probe_desc are used to describe kprobes
+and uprobes. For kprobes: use
+.I func
+and
+.IR offset ,
+or use
+.I addr
+and leave
+.I func
+as NULL. For uprobe: use
+.I path
+and
+.IR offset .
+.RE
+.TP
+.IR is_uprobe ", " is_return
+These two bits are used with
+.I type
+of
+.BR PERF_TYPE_PROBE ,
+to specify type of the probe:
+.PP
+.in +8n
+.EX
+is_uprobe == 0, is_return == 0: kprobe
+is_uprobe == 0, is_return == 1: kretprobe
+is_uprobe == 1, is_return == 0: uprobe
+is_uprobe == 1, is_return == 1: uretprobe
+.EE
 .RE
 .TP
 .IR sample_period ", " sample_freq
-- 
2.9.5

[RFC v2 1/6] perf: Add new type PERF_TYPE_PROBE

2017-11-12 Thread Song Liu

A new perf type PERF_TYPE_PROBE is added to allow creating [k,u]probe
with perf_event_open. These [k,u]probe are associated with the file
decriptor created by perf_event_open, thus are easy to clean when
the file descriptor is destroyed.

Struct probe_desc and two flags, is_uprobe and is_return, are added
to describe the probe being created with perf_event_open.

Note: We use type __u64 for pointer probe_desc instead of __aligned_u64.
The reason here is to avoid changing the size of struct perf_event_attr,
and breaking new-kernel-old-utility scenario. To avoid alignment problem
with the pointer, we will (in the following patches) copy probe_desc to
__aligned_u64 before using it as pointer.

Signed-off-by: Song Liu 
Reviewed-by: Yonghong Song 
Reviewed-by: Josef Bacik 
Acked-by: Alexei Starovoitov 
---
 include/uapi/linux/perf_event.h | 35 +--
 1 file changed, 33 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 362493a..cc42d59 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -33,6 +33,7 @@ enum perf_type_id {
PERF_TYPE_HW_CACHE  = 3,
PERF_TYPE_RAW   = 4,
PERF_TYPE_BREAKPOINT= 5,
+   PERF_TYPE_PROBE = 6,
 
PERF_TYPE_MAX,  /* non-ABI */
 };
@@ -299,6 +300,29 @@ enum perf_event_read_format {
 #define PERF_ATTR_SIZE_VER4104 /* add: sample_regs_intr */
 #define PERF_ATTR_SIZE_VER5112 /* add: aux_watermark */
 
+#define MAX_PROBE_FUNC_NAME_LEN64
+/*
+ * Describe a kprobe or uprobe for PERF_TYPE_PROBE.
+ * perf_event_attr.probe_desc will point to this structure. is_uprobe
+ * and is_return are used to differentiate different types of probe
+ * (k/u, probe/retprobe).
+ *
+ * The two unions should be used as follows:
+ * For uprobe: use path and offset;
+ * For kprobe: if func is empty, use addr
+ * if func is not emtpy, use func and offset
+ */
+struct probe_desc {
+   union {
+   __aligned_u64   func;
+   __aligned_u64   path;
+   };
+   union {
+   __aligned_u64   addr;
+   __u64   offset;
+   };
+};
+
 /*
  * Hardware event_id to monitor via a performance monitoring event:
  *
@@ -320,7 +344,10 @@ struct perf_event_attr {
/*
 * Type specific configuration information.
 */
-   __u64   config;
+   union {
+   __u64   config;
+   __u64   probe_desc; /* ptr to struct probe_desc */
+   };
 
union {
__u64   sample_period;
@@ -370,7 +397,11 @@ struct perf_event_attr {
context_switch :  1, /* context switch data */
write_backward :  1, /* Write ring buffer from 
end to beginning */
namespaces :  1, /* include namespaces data 
*/
-   __reserved_1   : 35;
+
+   /* For PERF_TYPE_PROBE */
+   is_uprobe  :  1, /* 0: kprobe, 1: uprobe */
+   is_return  :  1, /* 0: probe, 1: retprobe */
+   __reserved_1   : 33;
 
union {
__u32   wakeup_events;/* wakeup every n events */
-- 
2.9.5

[RFC v2 3/6] perf: implement kprobe support to PERF_TYPE_PROBE

2017-11-12 Thread Song Liu

A new pmu, perf_probe, is created for PERF_TYPE_PROBE. Based on
input from perf_event_open(), perf_probe creates a kprobe (or
kretprobe) for the perf_event. This kprobe is private to this
perf_event, and thus not added to global lists, and not
available in tracefs.

Two functions, create_local_trace_kprobe() and
destroy_local_trace_kprobe()  are added to created and destroy these
local trace_kprobe.

Signed-off-by: Song Liu 
Reviewed-by: Yonghong Song 
Reviewed-by: Josef Bacik 
---
 include/linux/trace_events.h|  2 +
 kernel/events/core.c| 39 +-
 kernel/trace/trace_event_perf.c | 81 
 kernel/trace/trace_kprobe.c | 91 +
 kernel/trace/trace_probe.h  |  7 
 5 files changed, 210 insertions(+), 10 deletions(-)

diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
index 84014ec..96ce715 100644
--- a/include/linux/trace_events.h
+++ b/include/linux/trace_events.h
@@ -528,6 +528,8 @@ extern int  perf_trace_init(struct perf_event *event);
 extern void perf_trace_destroy(struct perf_event *event);
 extern int  perf_trace_add(struct perf_event *event, int flags);
 extern void perf_trace_del(struct perf_event *event, int flags);
+extern int  perf_probe_init(struct perf_event *event);
+extern void perf_probe_destroy(struct perf_event *event);
 extern int  ftrace_profile_set_filter(struct perf_event *event, int event_id,
 char *filter_str);
 extern void ftrace_profile_free_filter(struct perf_event *event);
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 42d24bd..97dc648 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -8053,6 +8053,28 @@ static int perf_tp_event_init(struct perf_event *event)
return 0;
 }
 
+static int perf_probe_event_init(struct perf_event *event)
+{
+   int err;
+
+   if (event->attr.type != PERF_TYPE_PROBE)
+   return -ENOENT;
+
+   /*
+* no branch sampling for probe events
+*/
+   if (has_branch_stack(event))
+   return -EOPNOTSUPP;
+
+   err = perf_probe_init(event);
+   if (err)
+   return err;
+
+   event->destroy = perf_probe_destroy;
+
+   return 0;
+}
+
 static struct pmu perf_tracepoint = {
.task_ctx_nr= perf_sw_context,
 
@@ -8064,9 +8086,20 @@ static struct pmu perf_tracepoint = {
.read   = perf_swevent_read,
 };
 
+static struct pmu perf_probe = {
+   .task_ctx_nr= perf_sw_context,
+   .event_init = perf_probe_event_init,
+   .add= perf_trace_add,
+   .del= perf_trace_del,
+   .start  = perf_swevent_start,
+   .stop   = perf_swevent_stop,
+   .read   = perf_swevent_read,
+};
+
 static inline void perf_tp_register(void)
 {
perf_pmu_register(_tracepoint, "tracepoint", PERF_TYPE_TRACEPOINT);
+   perf_pmu_register(_probe, "probe", PERF_TYPE_PROBE);
 }
 
 static void perf_event_free_filter(struct perf_event *event)
@@ -8149,7 +8182,8 @@ static int perf_event_set_bpf_prog(struct perf_event 
*event, u32 prog_fd)
struct bpf_prog *prog;
int ret;
 
-   if (event->attr.type != PERF_TYPE_TRACEPOINT)
+   if (event->attr.type != PERF_TYPE_TRACEPOINT &&
+   event->attr.type != PERF_TYPE_PROBE)
return perf_event_set_bpf_handler(event, prog_fd);
 
is_kprobe = event->tp_event->flags & TRACE_EVENT_FL_UKPROBE;
@@ -8188,7 +8222,8 @@ static int perf_event_set_bpf_prog(struct perf_event 
*event, u32 prog_fd)
 
 static void perf_event_free_bpf_prog(struct perf_event *event)
 {
-   if (event->attr.type != PERF_TYPE_TRACEPOINT) {
+   if (event->attr.type != PERF_TYPE_TRACEPOINT &&
+   event->attr.type != PERF_TYPE_PROBE) {
perf_event_free_bpf_handler(event);
return;
}
diff --git a/kernel/trace/trace_event_perf.c b/kernel/trace/trace_event_perf.c
index 13ba2d3..bf9b99b 100644
--- a/kernel/trace/trace_event_perf.c
+++ b/kernel/trace/trace_event_perf.c
@@ -8,6 +8,7 @@
 #include 
 #include 
 #include "trace.h"
+#include "trace_probe.h"
 
 static char __percpu *perf_trace_buf[PERF_NR_CONTEXTS];
 
@@ -229,6 +230,74 @@ int perf_trace_init(struct perf_event *p_event)
return ret;
 }
 
+#ifdef CONFIG_KPROBE_EVENTS
+static int perf_probe_create_kprobe(struct perf_event *p_event,
+   struct probe_desc *pd, char *name)
+{
+   struct trace_event_call *tp_event;
+   int ret;
+
+   tp_event = create_local_trace_kprobe(
+   name, (void *)(unsigned long)(pd->addr), pd->offset,
+   p_event->attr.is_return);
+   if (IS_ERR(tp_event))
+   return PTR_ERR(tp_event);
+   ret = perf_trace_event_init(tp_event, p_event);
+   if (ret)
+

[PATCH 0/7] net: core: devname allocation cleanups

2017-11-12 Thread Rasmus Villemoes

It's somewhat confusing to have both dev_alloc_name and
dev_get_valid_name. I can't see why the former is less strict than the
latter, so make them (or rather dev_alloc_name_ns and
dev_get_valid_name) equivalent, hardening dev_alloc_name() a little.

Obvious follow-up patches would be to only export one function, and
make dev_alloc_name a static inline wrapper for that (whichever name
is chosen for the exported interface). But maybe there is a good
reason the two exported interfaces do different checking, so I'll
refrain from including the trivial but tree-wide renaming in this
series.

Rasmus Villemoes (7):
  net: core: improve sanity checking in __dev_alloc_name
  net: core: move dev_alloc_name_ns a little higher
  net: core: eliminate dev_alloc_name{,_ns} code duplication
  net: core: drop pointless check in __dev_alloc_name
  net: core: check dev_valid_name in __dev_alloc_name
  net: core: maybe return -EEXIST in __dev_alloc_name
  net: core: dev_get_valid_name is now the same as dev_alloc_name_ns

 net/core/dev.c | 62 +-
 1 file changed, 22 insertions(+), 40 deletions(-)

-- 
2.11.0

[PATCH 1/7] net: core: improve sanity checking in __dev_alloc_name

2017-11-12 Thread Rasmus Villemoes

__dev_alloc_name is called from the public (and exported)
dev_alloc_name(), so we don't have a guarantee that strlen(name) is at
most IFNAMSIZ. If somebody manages to get __dev_alloc_name called with a
% char beyond the 31st character, we'd be making a snprintf() call that
will very easily crash the kernel (using an appropriate %p extension,
we'll likely dereference some completely bogus pointer).

In the normal case where strlen() is sane, we don't even save anything
by limiting to IFNAMSIZ, so just use strchr().

Signed-off-by: Rasmus Villemoes 
---
 net/core/dev.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 11596a302a26..87e19804757b 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1062,7 +1062,7 @@ static int __dev_alloc_name(struct net *net, const char 
*name, char *buf)
unsigned long *inuse;
struct net_device *d;
 
-   p = strnchr(name, IFNAMSIZ-1, '%');
+   p = strchr(name, '%');
if (p) {
/*
 * Verify the string as this thing may have come from
-- 
2.11.0

[PATCH 3/7] net: core: eliminate dev_alloc_name{,_ns} code duplication

2017-11-12 Thread Rasmus Villemoes

dev_alloc_name contained a BUG_ON(), which I moved to dev_alloc_name_ns;
the only other caller of that already has the same BUG_ON.

Signed-off-by: Rasmus Villemoes 
---
 net/core/dev.c | 12 ++--
 1 file changed, 2 insertions(+), 10 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 240ae6bc1097..1077bfe97bde 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1112,6 +1112,7 @@ static int dev_alloc_name_ns(struct net *net,
char buf[IFNAMSIZ];
int ret;
 
+   BUG_ON(!net);
ret = __dev_alloc_name(net, name, buf);
if (ret >= 0)
strlcpy(dev->name, buf, IFNAMSIZ);
@@ -1134,16 +1135,7 @@ static int dev_alloc_name_ns(struct net *net,
 
 int dev_alloc_name(struct net_device *dev, const char *name)
 {
-   char buf[IFNAMSIZ];
-   struct net *net;
-   int ret;
-
-   BUG_ON(!dev_net(dev));
-   net = dev_net(dev);
-   ret = __dev_alloc_name(net, name, buf);
-   if (ret >= 0)
-   strlcpy(dev->name, buf, IFNAMSIZ);
-   return ret;
+   return dev_alloc_name_ns(dev_net(dev), dev, name);
 }
 EXPORT_SYMBOL(dev_alloc_name);
 
-- 
2.11.0

[PATCH 4/7] net: core: drop pointless check in __dev_alloc_name

2017-11-12 Thread Rasmus Villemoes

The only caller passes a stack buffer as buf, so it won't equal the
passed-in name. Moreover, we're already using buf as a scratch buffer
inside the if (p) {} block, so if buf and name were the same, that
snprintf() call would be overwriting its own format string.

Signed-off-by: Rasmus Villemoes 
---
 net/core/dev.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 1077bfe97bde..14541b7a3195 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1093,8 +1093,7 @@ static int __dev_alloc_name(struct net *net, const char 
*name, char *buf)
free_page((unsigned long) inuse);
}
 
-   if (buf != name)
-   snprintf(buf, IFNAMSIZ, name, i);
+   snprintf(buf, IFNAMSIZ, name, i);
if (!__dev_get_by_name(net, buf))
return i;
 
-- 
2.11.0

[PATCH 2/7] net: core: move dev_alloc_name_ns a little higher

2017-11-12 Thread Rasmus Villemoes

No functional change.

Signed-off-by: Rasmus Villemoes 
---
 net/core/dev.c | 26 +-
 1 file changed, 13 insertions(+), 13 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 87e19804757b..240ae6bc1097 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1105,6 +1105,19 @@ static int __dev_alloc_name(struct net *net, const char 
*name, char *buf)
return -ENFILE;
 }
 
+static int dev_alloc_name_ns(struct net *net,
+struct net_device *dev,
+const char *name)
+{
+   char buf[IFNAMSIZ];
+   int ret;
+
+   ret = __dev_alloc_name(net, name, buf);
+   if (ret >= 0)
+   strlcpy(dev->name, buf, IFNAMSIZ);
+   return ret;
+}
+
 /**
  * dev_alloc_name - allocate a name for a device
  * @dev: device
@@ -1134,19 +1147,6 @@ int dev_alloc_name(struct net_device *dev, const char 
*name)
 }
 EXPORT_SYMBOL(dev_alloc_name);
 
-static int dev_alloc_name_ns(struct net *net,
-struct net_device *dev,
-const char *name)
-{
-   char buf[IFNAMSIZ];
-   int ret;
-
-   ret = __dev_alloc_name(net, name, buf);
-   if (ret >= 0)
-   strlcpy(dev->name, buf, IFNAMSIZ);
-   return ret;
-}
-
 int dev_get_valid_name(struct net *net, struct net_device *dev,
   const char *name)
 {
-- 
2.11.0

[PATCH 5/7] net: core: check dev_valid_name in __dev_alloc_name

2017-11-12 Thread Rasmus Villemoes

We currently only exclude non-sysfs-friendly names via
dev_get_valid_name; there doesn't seem to be a reason to allow such
names when we're called via dev_alloc_name.

This does duplicate the dev_valid_name check in the dev_get_valid_name()
case; we'll fix that shortly.

Signed-off-by: Rasmus Villemoes 
---
 net/core/dev.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/core/dev.c b/net/core/dev.c
index 14541b7a3195..c0a92cf27566 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1062,6 +1062,9 @@ static int __dev_alloc_name(struct net *net, const char 
*name, char *buf)
unsigned long *inuse;
struct net_device *d;
 
+   if (!dev_valid_name(name))
+   return -EINVAL;
+
p = strchr(name, '%');
if (p) {
/*
-- 
2.11.0

[PATCH 7/7] net: core: dev_get_valid_name is now the same as dev_alloc_name_ns

2017-11-12 Thread Rasmus Villemoes

If name contains a %, it's easy to see that this patch doesn't change
anything (other than eliminate the duplicate dev_valid_name
call). Otherwise, we'll now just spend a little time in snprintf()
copying name to the stack buffer allocated in dev_alloc_name_ns, and do
the __dev_get_by_name using that buffer rather than name.

Signed-off-by: Rasmus Villemoes 
---
 net/core/dev.c | 14 +-
 1 file changed, 1 insertion(+), 13 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 7c08b4ca7b76..e29eea26f9c1 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1144,19 +1144,7 @@ EXPORT_SYMBOL(dev_alloc_name);
 int dev_get_valid_name(struct net *net, struct net_device *dev,
   const char *name)
 {
-   BUG_ON(!net);
-
-   if (!dev_valid_name(name))
-   return -EINVAL;
-
-   if (strchr(name, '%'))
-   return dev_alloc_name_ns(net, dev, name);
-   else if (__dev_get_by_name(net, name))
-   return -EEXIST;
-   else if (dev->name != name)
-   strlcpy(dev->name, name, IFNAMSIZ);
-
-   return 0;
+   return dev_alloc_name_ns(net, dev, name);
 }
 EXPORT_SYMBOL(dev_get_valid_name);
 
-- 
2.11.0

[PATCH 6/7] net: core: maybe return -EEXIST in __dev_alloc_name

2017-11-12 Thread Rasmus Villemoes

If we're given format string with no %d, -EEXIST is a saner error code.

Signed-off-by: Rasmus Villemoes 
---
 net/core/dev.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index c0a92cf27566..7c08b4ca7b76 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1104,7 +1104,7 @@ static int __dev_alloc_name(struct net *net, const char 
*name, char *buf)
 * when the name is long and there isn't enough space left
 * for the digits, or if all bits are used.
 */
-   return -ENFILE;
+   return p ? -ENFILE : -EEXIST;
 }
 
 static int dev_alloc_name_ns(struct net *net,
-- 
2.11.0

Re: [PATCH v4] scripts: add leaking_addresses.pl

2017-11-12 Thread Tobin C. Harding

On Sun, Nov 12, 2017 at 02:10:07AM +0300, Kirill A. Shutemov wrote:
> On Tue, Nov 07, 2017 at 09:32:11PM +1100, Tobin C. Harding wrote:
> > Currently we are leaking addresses from the kernel to user space. This
> > script is an attempt to find some of those leakages. Script parses
> > `dmesg` output and /proc and /sys files for hex strings that look like
> > kernel addresses.
> > 
> > Only works for 64 bit kernels, the reason being that kernel addresses
> > on 64 bit kernels have '' as the leading bit pattern making greping
> > possible. On 32 kernels we don't have this luxury.
> 
> Well, it's not going to work as well as intented on x86 machine with
> 5-level paging. Kernel address space there starts at 0xff10.
> It will still catch pointers to kernel/modules text, but the rest is
> outside of 0x... space. See Documentation/x86/x86_64/mm.txt.

Thanks for the link. So it looks like we need to refactor the kernel
address regular expression into a function that takes into account the
machine architecture and the number of page table levels. We will need
to add this to the false positive checks also.

> Not sure if we care. It won't work too for other 64-bit architectrues that
> have more than 256TB of virtual address space.

Is this because of the virtual memory map? Did you mean 512TB?

from mm.txt:
ffd4 - ffd5 (=49 bits) virtual memory map (512TB)

Perhaps an option (--terse) that only checks the virtual memory map
range (above for 5-level paging) and

ea00 - eaff (=40 bits) virtual memory map (1TB)

for 4-level paging?

> Just wanted to point to the limitation.

Appreciate it, thanks.

Tobin.

[PATCH net-next 1/3 v3] bpf: improve verifier ARG_CONST_SIZE_OR_ZERO semantics

2017-11-12 Thread Yonghong Song

For helpers, the argument type ARG_CONST_SIZE_OR_ZERO permits the
access size to be 0 when accessing the previous argument (arg).
Right now, it requires the arg needs to be NULL when size passed
is 0 or could be 0. It also requires a non-NULL arg when the size
is proved to be non-0.

This patch changes verifier ARG_CONST_SIZE_OR_ZERO behavior
such that for size-0 or possible size-0, it is not required
the arg equal to NULL.

There are a couple of reasons for this semantics change, and
all of them intends to simplify user bpf programs which
may improve user experience and/or increase chances of
verifier acceptance. Together with the next patch which
changes bpf_probe_read arg2 type from ARG_CONST_SIZE to
ARG_CONST_SIZE_OR_ZERO, the following two examples, which
fail the verifier currently, are able to get verifier acceptance.

Example 1:
   unsigned long len = pend - pstart;
   len = len > MAX_PAYLOAD_LEN ? MAX_PAYLOAD_LEN : len;
   len &= MAX_PAYLOAD_LEN;
   bpf_probe_read(data->payload, len, pstart);

It does not have test for "len > 0" and it failed the verifier.
Users may not be aware that they have to add this test.
Converting the bpf_probe_read helper to have
ARG_CONST_SIZE_OR_ZERO helps the above code get
verifier acceptance.

Example 2:
  Here is one example where llvm "messed up" the code and
  the verifier fails.

..
   unsigned long len = pend - pstart;
   if (len > 0 && len <= MAX_PAYLOAD_LEN)
 bpf_probe_read(data->payload, len, pstart);
..

The compiler generates the following code and verifier fails:
..
39: (79) r2 = *(u64 *)(r10 -16)
40: (1f) r2 -= r8
41: (bf) r1 = r2
42: (07) r1 += -1
43: (25) if r1 > 0xffe goto pc+3
  R0=inv(id=0) R1=inv(id=0,umax_value=4094,var_off=(0x0; 0xfff))
  R2=inv(id=0) R6=map_value(id=0,off=0,ks=4,vs=4095,imm=0) R7=inv(id=0)
  R8=inv(id=0) R9=inv0 R10=fp0
44: (bf) r1 = r6
45: (bf) r3 = r8
46: (85) call bpf_probe_read#45
R2 min value is negative, either use unsigned or 'var &= const'
..

The compiler optimization is correct. If r1 = 0,
r1 - 1 = 0x > 0xffe.  If r1 != 0, r1 - 1 will not wrap.
r1 > 0xffe at insn #43 can actually capture
both "r1 > 0" and "len <= MAX_PAYLOAD_LEN".
This however causes an issue in verifier as the value range of arg2
"r2" does not properly get refined and lead to verification failure.

Relaxing bpf_prog_read arg2 from ARG_CONST_SIZE to ARG_CONST_SIZE_OR_ZERO
allows the following simplied code:
   unsigned long len = pend - pstart;
   if (len <= MAX_PAYLOAD_LEN)
 bpf_probe_read(data->payload, len, pstart);

The llvm compiler will generate less complex code and the
verifier is able to verify that the program is okay.

Signed-off-by: Yonghong Song 
Acked-by: Alexei Starovoitov 
Acked-by: Daniel Borkmann 
---
 kernel/bpf/verifier.c | 40 
 1 file changed, 24 insertions(+), 16 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 4a942e2..dd54d20 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -799,12 +799,13 @@ static int check_stack_read(struct bpf_verifier_env *env,
 
 /* check read/write into map element returned by bpf_map_lookup_elem() */
 static int __check_map_access(struct bpf_verifier_env *env, u32 regno, int off,
-   int size)
+ int size, bool zero_size_allowed)
 {
struct bpf_reg_state *regs = cur_regs(env);
struct bpf_map *map = regs[regno].map_ptr;
 
-   if (off < 0 || size <= 0 || off + size > map->value_size) {
+   if (off < 0 || size < 0 || (size == 0 && !zero_size_allowed) ||
+   off + size > map->value_size) {
verbose(env, "invalid access to map value, value_size=%d off=%d 
size=%d\n",
map->value_size, off, size);
return -EACCES;
@@ -814,7 +815,7 @@ static int __check_map_access(struct bpf_verifier_env *env, 
u32 regno, int off,
 
 /* check read/write into a map element with possible variable offset */
 static int check_map_access(struct bpf_verifier_env *env, u32 regno,
-   int off, int size)
+   int off, int size, bool zero_size_allowed)
 {
struct bpf_verifier_state *state = env->cur_state;
struct bpf_reg_state *reg = >regs[regno];
@@ -837,7 +838,8 @@ static int check_map_access(struct bpf_verifier_env *env, 
u32 regno,
regno);
return -EACCES;
}
-   err = __check_map_access(env, regno, reg->smin_value + off, size);
+   err = __check_map_access(env, regno, reg->smin_value + off, size,
+zero_size_allowed);
if (err) {
verbose(env, "R%d min value is outside of the array range\n",
regno);
@@ -853,7 +855,8 @@ static int check_map_access(struct bpf_verifier_env *env, 
u32 regno,
regno);

[PATCH net-next 3/3 v3] bpf: fix and add test cases for ARG_CONST_SIZE_OR_ZERO semantics change

2017-11-12 Thread Yonghong Song

Fix a few test cases to allow non-NULL map/packet/stack pointer
with size = 0. Change a few tests using bpf_probe_read to use
bpf_probe_write_user so ARG_CONST_SIZE arg can still be properly
tested. One existing test case already covers size = 0 with non-NULL
packet pointer, so add additional tests so all cases of
size = 0 and 0 <= size <= legal_upper_bound with non-NULL
map/packet/stack pointer are covered.

Signed-off-by: Yonghong Song 
Acked-by: Alexei Starovoitov 
Acked-by: Daniel Borkmann 
---
 tools/testing/selftests/bpf/test_verifier.c | 131 
 1 file changed, 112 insertions(+), 19 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_verifier.c 
b/tools/testing/selftests/bpf/test_verifier.c
index bb3c4ad..bf092b8 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -3579,7 +3579,7 @@ static struct bpf_test tests[] = {
.prog_type = BPF_PROG_TYPE_SCHED_CLS,
},
{
-   "helper access to packet: test19, cls helper fail range zero",
+   "helper access to packet: test19, cls helper range zero",
.insns = {
BPF_LDX_MEM(BPF_W, BPF_REG_6, BPF_REG_1,
offsetof(struct __sk_buff, data)),
@@ -3599,8 +3599,7 @@ static struct bpf_test tests[] = {
BPF_MOV64_IMM(BPF_REG_0, 0),
BPF_EXIT_INSN(),
},
-   .result = REJECT,
-   .errstr = "invalid access to packet",
+   .result = ACCEPT,
.prog_type = BPF_PROG_TYPE_SCHED_CLS,
},
{
@@ -4379,10 +4378,10 @@ static struct bpf_test tests[] = {
BPF_LD_MAP_FD(BPF_REG_1, 0),
BPF_EMIT_CALL(BPF_FUNC_map_lookup_elem),
BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 4),
-   BPF_MOV64_REG(BPF_REG_1, BPF_REG_0),
-   BPF_MOV64_IMM(BPF_REG_2, 0),
+   BPF_MOV64_IMM(BPF_REG_1, 0),
+   BPF_MOV64_REG(BPF_REG_2, BPF_REG_0),
BPF_MOV64_IMM(BPF_REG_3, 0),
-   BPF_EMIT_CALL(BPF_FUNC_probe_read),
+   BPF_EMIT_CALL(BPF_FUNC_probe_write_user),
BPF_EXIT_INSN(),
},
.fixup_map2 = { 3 },
@@ -4486,9 +4485,10 @@ static struct bpf_test tests[] = {
BPF_MOV64_REG(BPF_REG_1, BPF_REG_0),
BPF_ALU64_IMM(BPF_ADD, BPF_REG_1,
offsetof(struct test_val, foo)),
-   BPF_MOV64_IMM(BPF_REG_2, 0),
+   BPF_MOV64_REG(BPF_REG_2, BPF_REG_1),
+   BPF_MOV64_IMM(BPF_REG_1, 0),
BPF_MOV64_IMM(BPF_REG_3, 0),
-   BPF_EMIT_CALL(BPF_FUNC_probe_read),
+   BPF_EMIT_CALL(BPF_FUNC_probe_write_user),
BPF_EXIT_INSN(),
},
.fixup_map2 = { 3 },
@@ -4622,13 +4622,14 @@ static struct bpf_test tests[] = {
BPF_MOV64_REG(BPF_REG_1, BPF_REG_0),
BPF_MOV64_IMM(BPF_REG_3, 0),
BPF_ALU64_REG(BPF_ADD, BPF_REG_1, BPF_REG_3),
-   BPF_MOV64_IMM(BPF_REG_2, 0),
+   BPF_MOV64_REG(BPF_REG_2, BPF_REG_1),
+   BPF_MOV64_IMM(BPF_REG_1, 0),
BPF_MOV64_IMM(BPF_REG_3, 0),
-   BPF_EMIT_CALL(BPF_FUNC_probe_read),
+   BPF_EMIT_CALL(BPF_FUNC_probe_write_user),
BPF_EXIT_INSN(),
},
.fixup_map2 = { 3 },
-   .errstr = "R1 min value is outside of the array range",
+   .errstr = "R2 min value is outside of the array range",
.result = REJECT,
.prog_type = BPF_PROG_TYPE_TRACEPOINT,
},
@@ -4765,13 +4766,14 @@ static struct bpf_test tests[] = {
BPF_JMP_IMM(BPF_JGT, BPF_REG_3,
offsetof(struct test_val, foo), 4),
BPF_ALU64_REG(BPF_ADD, BPF_REG_1, BPF_REG_3),
-   BPF_MOV64_IMM(BPF_REG_2, 0),
+   BPF_MOV64_REG(BPF_REG_2, BPF_REG_1),
+   BPF_MOV64_IMM(BPF_REG_1, 0),
BPF_MOV64_IMM(BPF_REG_3, 0),
-   BPF_EMIT_CALL(BPF_FUNC_probe_read),
+   BPF_EMIT_CALL(BPF_FUNC_probe_write_user),
BPF_EXIT_INSN(),
},
.fixup_map2 = { 3 },
-   .errstr = "R1 min value is outside of the array range",
+   .errstr = "R2 min value is outside of the array range",
.result = REJECT,
.prog_type

[PATCH net-next 0/3 v3] bpf: improve verifier ARG_CONST_SIZE_OR_ZERO semantics

2017-11-12 Thread Yonghong Song

This patch set intends to change verifier ARG_CONST_SIZE_OR_ZERO
semantics so that simpler bpf programs can be written with verifier
acceptance. Patch #1 comment provided the detailed examples and
the patch itself implements the new semantics. Patch #2
changes bpf_probe_read helper arg2 type from
ARG_CONST_SIZE to ARG_CONST_SIZE_OR_ZERO. Patch #3 fixed a few
test cases and added some for better coverage.

Changelog:
v2 -> v3:
  Fix comments to make patchwork happy
v1 -> v2:
  Fix typo in commit message pointed by Sergei Shtylyov

Yonghong Song (3):
  bpf: improve verifier ARG_CONST_SIZE_OR_ZERO semantics
  bpf: change helper bpf_probe_read arg2 type to ARG_CONST_SIZE_OR_ZERO
  bpf: fix and add test cases for ARG_CONST_SIZE_OR_ZERO semantics
change

 kernel/bpf/verifier.c   |  40 +
 kernel/trace/bpf_trace.c|   8 +-
 tools/testing/selftests/bpf/test_verifier.c | 131 
 3 files changed, 142 insertions(+), 37 deletions(-)

-- 
2.9.5

[PATCH net-next 2/3 v3] bpf: change helper bpf_probe_read arg2 type to ARG_CONST_SIZE_OR_ZERO

2017-11-12 Thread Yonghong Song

The helper bpf_probe_read arg2 type is changed
from ARG_CONST_SIZE to ARG_CONST_SIZE_OR_ZERO to permit
size-0 buffer. Together with newer ARG_CONST_SIZE_OR_ZERO
semantics which allows non-NULL buffer with size 0,
this allows simpler bpf programs with verifier acceptance.
The previous commit which changes ARG_CONST_SIZE_OR_ZERO semantics
has details on examples.

Signed-off-by: Yonghong Song 
Acked-by: Alexei Starovoitov 
Acked-by: Daniel Borkmann 
---
 kernel/trace/bpf_trace.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 506efe6..a5580c6 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -78,12 +78,16 @@ EXPORT_SYMBOL_GPL(trace_call_bpf);
 
 BPF_CALL_3(bpf_probe_read, void *, dst, u32, size, const void *, unsafe_ptr)
 {
-   int ret;
+   int ret = 0;
+
+   if (unlikely(size == 0))
+   goto out;
 
ret = probe_kernel_read(dst, unsafe_ptr, size);
if (unlikely(ret < 0))
memset(dst, 0, size);
 
+ out:
return ret;
 }
 
@@ -92,7 +96,7 @@ static const struct bpf_func_proto bpf_probe_read_proto = {
.gpl_only   = true,
.ret_type   = RET_INTEGER,
.arg1_type  = ARG_PTR_TO_UNINIT_MEM,
-   .arg2_type  = ARG_CONST_SIZE,
+   .arg2_type  = ARG_CONST_SIZE_OR_ZERO,
.arg3_type  = ARG_ANYTHING,
 };
 
-- 
2.9.5

Microchip KSZ* DSA drivers Re: [PATCH v1 RFC 1/1] Add Microchip KSZ8795 DSA driver

2017-11-12 Thread Pavel Machek

Hi!

Are there any news here? Is there new release planned? Is there a git
tree somewhere? I probably should get it working, soon.. so I guess I
can help with testing.

Thanks and best regards,
Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


signature.asc
Description: Digital signature

[PATCH net-next] net: Mention net-next status web page in netdev-FAQ.txt

2017-11-12 Thread laforge

From: Harald Welte 

According to
  https://www.mail-archive.com/netdev@vger.kernel.org/msg177411.html
there is a status page available at
  http://vger.kernel.org/~davem/net-next.html
to obtain the current status of the net-next tree.  Let's add this
information to the netdev FAQ.

Signed-off-by: Harald Welte 
---
 Documentation/networking/netdev-FAQ.txt | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/Documentation/networking/netdev-FAQ.txt 
b/Documentation/networking/netdev-FAQ.txt
index cfc66ea72329..2a3278d5cf35 100644
--- a/Documentation/networking/netdev-FAQ.txt
+++ b/Documentation/networking/netdev-FAQ.txt
@@ -64,7 +64,10 @@ A: To understand this, you need to know a bit of background 
information
 
If you aren't subscribed to netdev and/or are simply unsure if net-next
has re-opened yet, simply check the net-next git repository link above for
-   any new networking-related commits.
+   any new networking-related commits.  You may also check the following
+   website for the current status:
+
+http://vger.kernel.org/~davem/net-next.html
 
The "net" tree continues to collect fixes for the vX.Y content, and
is fed back to Linus at regular (~weekly) intervals.  Meaning that the
-- 
2.15.0

Re: [kernel-hardening] [PATCH v4] scripts: add leaking_addresses.pl

2017-11-12 Thread Tobin C. Harding

On Fri, Nov 10, 2017 at 07:26:34PM +0530, kaiwan.billimo...@gmail.com wrote:
> On Tue, 2017-11-07 at 21:32 +1100, Tobin C. Harding wrote:
> > Currently we are leaking addresses from the kernel to user space.
> > This
> > script is an attempt to find some of those leakages. Script parses
> > `dmesg` output and /proc and /sys files for hex strings that look
> > like
> > kernel addresses.
> > 
> > Only works for 64 bit kernels, the reason being that kernel addresses
> > on 64 bit kernels have '' as the leading bit pattern making
> > greping
> > possible. On 32 kernels we don't have this luxury.
> 
> Tobin C. Harding  wrote:
> >Only works for 64 bit kernels, the reason being that kernel addresses
> >on 64 bit kernels have '' as the leading bit pattern making greping
> >possible. On 32 kernels we don't have this luxury.
> 
> [RFC] leaking_addresses.pl - enhance it to work for 32-bit kernels as well
> 
> (Firstly, apologies if I've got the protocol horribly wrong- should this
> be a new thread altogether?)

I think this patch will need to wait until the patch set that is
currently in flight is either merged or dropped.

> Ok so, I was interested in figuring - why not have this useful script work
> for 32-bit kernel virtual addresses as well (and those systems by
> extension).

Awesome.

> The approach am considering, pl correct me if I'm way off:
> on 32-bit, the kernel macro PAGE_OFFSET will give us the user-kernel split;
> (alternatively, could also script up CONFIG_VMSPLIT_[n]G and figure the
> split from there.)
> 
> For the time being, lets say we go with the "use PAGE_OFFSET" approach and
> PAGE_OFFSET = 0xc000 , whch implies we have a 3:1 GB user:kernel split.
> So any virtual addresses >= PAGE_OFFSET are kernel virtual addresses (i
> know, untrue on some ARM-32 systems!).
> 
> As a very early and *far-from-perfect* start, I've enhanced Tobin's Perl
> script to take into account 32-bit address space by passing the
> parameter '--bit-size='.

We can work this out pragmatically, Perl can give us an architecture
string then a few regexs can ascertain which architecture we are running
on. This is in the inflight patch set. 

> The patch below does Not take into account (yet) stuff like:
>  - exactly which files & dirs should be skipped on 32-bit (will it be
> identical to 64-bit?; unsure..)

As per discussion later in this thread we may need to consider
architecture specific lists for files/directories to skip. 

>  - it currently hard-codes a global 'PAGE_OFFSET_32BIT=0xc000' , just
>  so I can test quickly; must figure whether to query it or pass it;
>  Suggestions?

Perhaps we should have a command line option for this.

--kernel-base-address

>  - the 'false positives'; again, what differs for 32-bit?
>(BTW, shouldn't the dmesg 'root=UUID=<...>' line be a false positive
> & skipped?).

We could probably do with architecture specific false
positives. Inflight patch set refactors false_positive() so adding to
this should be easy.

> Also, I must point out that I'm a complete newbie to Perl :-) so, pl excuse
> my highly inadequate perl-foo; I rely on you perl gurus out there to fix
> and optimize :)

I'm no Perl guru but following are a few tips I have picked up over the
last month.

> Yes, I've **Very Minimally** tested the patch in it's current form on:
> a) a regular (Fedora 26) x86_64 desktop,
> b) a (Debian 7) 32-bit kernel (VM) with PAGE_OFFSET=3 Gb
> and it seems all right, considering...
> 
> Some sample output from test (b), if interested:
> =
> dmesg: [0.00] found SMP MP-table at [c00f1280] f1280
> dmesg: [0.00] Base memory trampoline at [c009b000] 9b000 size 16384
> dmesg: [0.00] ACPI: Local APIC address 0xfee0
> dmesg: [0.00] free_area_init_node: node 0, pgdat c1418bc0, 
> node_mem_map dfbfa200
> dmesg: [0.00] ACPI: Local APIC address 0xfee0
> dmesg: [0.00] ACPI: IOAPIC (id[0x00] address[0xfec0] gsi_base[0])
> dmesg: [0.00] IOAPIC[0]: apic_id 0, version 17, address 0xfec0, 
> GSI 0-23
> dmesg: [0.00] PERCPU: Embedded 14 pages/cpu @dfbe8000 s33344 r0 
> d24000 u57344
> dmesg: [0.00] fixmap  : 0xffd36000 - 0xf000   (2852 kB)
> dmesg: [0.00] pkmap   : 0xffa0 - 0xffc0   (2048 kB)
> dmesg: [0.00] vmalloc : 0xe07fb000 - 0xff9fe000   ( 498 MB)
> dmesg: [0.00] lowmem  : 0xc000 - 0xdfffb000   ( 511 MB)
> dmesg: [0.00]   .init : 0xc1421000 - 0xc148c000   ( 428 kB)
> 
> [...]
> 
> /proc/kallsyms: c10010e8 T _stext
> /proc/kallsyms: c1002000 T hypercall_page
> /proc/kallsyms: c1003000 t arch_local_save_flags
> /proc/kallsyms: c1003007 t arch_local_irq_enable
> /proc/kallsyms: c100300e T do_one_initcall
> 
> << ... plenty more kallsyms of course (92.5% of the output to be precise!) 
> ... >>
> 
> /proc/modules: loop 17803 0 - Live 0xe097c000
> /proc/modules: crc32c_intel 12659 0 - Live 0xe096e000
>

[PATCH net-next] net: Extend Kernel GTP-U tunneling documentation

2017-11-12 Thread laforge

From: Harald Welte 

* clarify specification references for v0/v1
* add section "APN vs. Network device"
* add section "Local GTP-U entity and tunnel identification"

Signed-off-by: Andreas Schultz 
Signed-off-by: Harald Welte 
---
 Documentation/networking/gtp.txt | 103 +--
 1 file changed, 99 insertions(+), 4 deletions(-)

diff --git a/Documentation/networking/gtp.txt b/Documentation/networking/gtp.txt
index 93e96750f103..0d9c18f05ec6 100644
--- a/Documentation/networking/gtp.txt
+++ b/Documentation/networking/gtp.txt
@@ -1,6 +1,7 @@
 The Linux kernel GTP tunneling module
 ==
-Documentation by Harald Welte 
+Documentation by Harald Welte  and
+ Andreas Schultz 
 
 In 'drivers/net/gtp.c' you are finding a kernel-level implementation
 of a GTP tunnel endpoint.
@@ -91,9 +92,13 @@ http://git.osmocom.org/libgtpnl/
 
 == Protocol Versions ==
 
-There are two different versions of GTP-U: v0 and v1.  Both are
-implemented in the Kernel GTP module.  Version 0 is a legacy version,
-and deprecated from recent 3GPP specifications.
+There are two different versions of GTP-U: v0 [GSM TS 09.60] and v1
+[3GPP TS 29.281].  Both are implemented in the Kernel GTP module.
+Version 0 is a legacy version, and deprecated from recent 3GPP
+specifications.
+
+GTP-U uses UDP for transporting PDUs.  The receiving UDP port is 2151
+for GTPv1-U and 3386 for GTPv0-U.
 
 There are three versions of GTP-C: v0, v1, and v2.  As the kernel
 doesn't implement GTP-C, we don't have to worry about this.  It's the
@@ -133,3 +138,93 @@ doe to a lack of user interest, it never got merged.
 In 2015, Andreas Schultz came to the rescue and fixed lots more bugs,
 extended it with new features and finally pushed all of us to get it
 mainline, where it was merged in 4.7.0.
+
+== Architectural Details ==
+
+=== Local GTP-U entity and tunnel identification ===
+
+GTP-U uses UDP for transporting PDU's. The receiving UDP port is 2152
+for GTPv1-U and 3386 for GTPv0-U.
+
+There is only one GTP-U entity (and therefor SGSN/GGSN/S-GW/PDN-GW
+instance) per IP address. Tunnel Endpoint Identifier (TEID) are unique
+per GTP-U entity.
+
+A specific tunnel is only defined by the destination entity. Since the
+destination port is constant, only the destination IP and TEID define
+a tunnel. The source IP and Port have no meaning for the tunnel.
+
+Therefore:
+
+  * when sending, the remote entity is defined by the remote IP and
+the tunnel endpoint id. The source IP and port have no meaning and
+can be changed at any time.
+
+  * when receiving the local entity is defined by the local
+destination IP and the tunnel endpoint id. The source IP and port
+have no meaning and can change at any time.
+
+[3GPP TS 29.281] Section 4.3.0 defines this so:
+
+> The TEID in the GTP-U header is used to de-multiplex traffic
+> incoming from remote tunnel endpoints so that it is delivered to the
+> User plane entities in a way that allows multiplexing of different
+> users, different packet protocols and different QoS levels.
+> Therefore no two remote GTP-U endpoints shall send traffic to a
+> GTP-U protocol entity using the same TEID value except
+> for data forwarding as part of mobility procedures.
+
+The definition above only defines that two remote GTP-U endpoints
+*should not* send to the same TEID, it *does not* forbid or exclude
+such a scenario. In fact, the mentioned mobility procedures make it
+necessary that the GTP-U entity accepts traffic for TEIDs from
+multiple or unknown peers.
+
+Therefore, the receiving side identifies tunnels exclusively based on
+TEIDs, not based on the source IP!
+
+== APN vs. Network Device ==
+
+The GTP-U driver creates a Linux network device for each Gi/SGi
+interface.
+
+[3GPP TS 29.281] calls the Gi/SGi reference point an interface. This
+may lead to the impression that the GGSN/P-GW can have only one such
+interface.
+
+Correct is that the Gi/SGi reference point defines the interworking
+between +the 3GPP packet domain (PDN) based on GTP-U tunnel and IP
+based networks.
+
+There is no provision in any of the 3GPP documents that limits the
+number of Gi/SGi interfaces implemented by a GGSN/P-GW.
+
+[3GPP TS 29.061] Section 11.3 makes it clear that the selection of a
+specific Gi/SGi interfaces is made through the Access Point Name
+(APN):
+
+> 2. each private network manages its own addressing. In general this
+>will result in different private networks having overlapping
+>address ranges. A logically separate connection (e.g. an IP in IP
+>tunnel or layer 2 virtual circuit) is used between the GGSN/P-GW
+>and each private network.
+>
+>In this case the IP address alone is not necessarily unique.  The
+>pair of values, Access Point Name (APN) and IPv4

Re: Per-CPU Queueing for QoS

2017-11-12 Thread Michael Ma

Any comments? We plan to implement this as a qdisc and appreciate any early 
feedback.

Thanks,
Michael

> On Nov 9, 2017, at 5:20 PM, Michael Ma  wrote:
> 
> Currently txq/qdisc selection is based on flow hash so packets from
> the same flow will follow the order when they enter qdisc/txq, which
> avoids out-of-order problem.
> 
> To improve the concurrency of QoS algorithm we plan to have multiple
> per-cpu queues for a single TC class and do busy polling from a
> per-class thread to drain these queues. If we can do this frequently
> enough the out-of-order situation in this polling thread should not be
> that bad.
> 
> To give more details - in the send path we introduce per-cpu per-class
> queues so that packets from the same class and same core will be
> enqueued to the same place. Then a per-class thread poll the queues
> belonging to its class from all the cpus and aggregate them into
> another per-class queue. This can effectively reduce contention but
> inevitably introduces potential out-of-order issue.
> 
> Any concern/suggestion for working towards this direction?

Re: [kernel-hardening] [PATCH v4] scripts: add leaking_addresses.pl

2017-11-12 Thread Tobin C. Harding

On Sun, Nov 12, 2017 at 10:02:55AM -0800, Frank Rowand wrote:
> Hi Michael,
> 
> On 11/12/17 03:49, Michael Ellerman wrote:
> > Hi Frank,
> > 
> > Frank Rowand  writes:
> >> Hi Michael, Tobin,
> >>
> >> On 11/08/17 04:10, Michael Ellerman wrote:
> >>> "Tobin C. Harding"  writes:
>  Currently we are leaking addresses from the kernel to user space. This
>  script is an attempt to find some of those leakages. Script parses
>  `dmesg` output and /proc and /sys files for hex strings that look like
>  kernel addresses.
> 
>  Only works for 64 bit kernels, the reason being that kernel addresses
>  on 64 bit kernels have '' as the leading bit pattern making greping
>  possible.
> >>>
> >>> That doesn't work super well on other architectures :D
> >>>
> >>> I don't speak perl but presumably you can check the arch somehow and
> >>> customise the regex?
> >>>
> >>> ...
>  +# Return _all_ non false positive addresses from $line.
>  +sub extract_addresses
>  +{
>  +my ($line) = @_;
>  +my $address = '\b(0x)?[[:xdigit:]]{12}\b';
> >>>
> >>> On 64-bit powerpc (ppc64/ppc64le) we'd want:
> >>>
> >>> +my $address = '\b(0x)?[89abcdef]00[[:xdigit:]]{13}\b';
> >>>
> >>>
>  +# Do not parse these files (absolute path).
>  +my @skip_parse_files_abs = ('/proc/kmsg',
>  +'/proc/kcore',
>  +'/proc/fs/ext4/sdb1/mb_groups',
>  +'/proc/1/fd/3',
>  +'/sys/kernel/debug/tracing/trace_pipe',
>  +'/sys/kernel/security/apparmor/revision');
> >>>
> >>> Can you add:
> >>>
> >>>   /sys/firmware/devicetree
> >>>
> >>> and/or /proc/device-tree (which is a symlink to the above).
> >>
> >> /proc/device-tree is a symlink to /sys/firmware/devicetree/base
> > 
> > Oh yep, forgot about the base part.
> > 
> >> /sys/firmware contains
> >>fdt  -- the flattened device tree that was passed to the
> >>kernel on boot
> >>devicetree/base/ -- the data that is currently in the live device tree.
> >>This live device tree is represented as directories
> >>and files beneath base/
> >>
> >> The information in fdt is directly available in the kernel source tree
> > 
> > On ARM that might be true, but not on powerpc.

Looks like we should be considering architecture specific lists for
files/directories to skip.

thanks,
Tobin.

[PATCH RfC 2/2] net: phy: core: don't disable device interrupts in phy_change

2017-11-12 Thread Heiner Kallweit

If state is not PHY_HALTED I see no need to temporarily disable
interrupts on the device. As long as the current interrupt isn't acked
on the device no new interrupt can happen anyway.

In addition remove a unneeded enabling of interrupts in the state
machine when handling state PHY_CHANGELINK.

Tested on a Odroid-C2 with RTL8211F phy in interrupt mode.

Signed-off-by: Heiner Kallweit 
---
 drivers/net/phy/phy.c | 19 ++-
 1 file changed, 6 insertions(+), 13 deletions(-)

diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
index b3784c9a2..4a11de8cb 100644
--- a/drivers/net/phy/phy.c
+++ b/drivers/net/phy/phy.c
@@ -727,8 +727,9 @@ void phy_change(struct phy_device *phydev)
!phydev->drv->did_interrupt(phydev))
return;
 
-   if (phy_disable_interrupts(phydev))
-   goto phy_err;
+   if (phydev->state == PHY_HALTED)
+   if (phy_disable_interrupts(phydev))
+   goto phy_err;
}
 
mutex_lock(>lock);
@@ -736,15 +737,11 @@ void phy_change(struct phy_device *phydev)
phydev->state = PHY_CHANGELINK;
mutex_unlock(>lock);
 
-   if (phy_interrupt_is_valid(phydev)) {
-   /* Reenable interrupts */
-   if (PHY_HALTED != phydev->state &&
-   phy_config_interrupt(phydev, PHY_INTERRUPT_ENABLED))
-   goto phy_err;
-   }
-
/* reschedule state queue work to run as soon as possible */
phy_trigger_machine(phydev, true);
+
+   if (phy_interrupt_is_valid(phydev) && phy_clear_interrupt(phydev))
+   goto phy_err;
return;
 
 phy_err:
@@ -984,10 +981,6 @@ void phy_state_machine(struct work_struct *work)
phydev->state = PHY_NOLINK;
phy_link_down(phydev, true);
}
-
-   if (phy_interrupt_is_valid(phydev))
-   err = phy_config_interrupt(phydev,
-  PHY_INTERRUPT_ENABLED);
break;
case PHY_HALTED:
if (phydev->link) {
-- 
2.15.0

[PATCH RfC 1/2] net: phy: core: remove now uneeded disabling of interrupts

2017-11-12 Thread Heiner Kallweit

After commits c974bdbc3e "net: phy: Use threaded IRQ, to allow IRQ from
sleeping devices" and 664fcf123a30 "net: phy: Threaded interrupts allow
some simplification" all relevant code pieces run in process context
anyway and I don't think we need the disabling of interrupts any longer.

Interestingly enough, latter commit already removed the comment
explaining why interrupts need to be temporarily disabled.

On my system phy interrupt mode works fine with this patch.
However I may miss something, especially in the context of shared phy
interrupts, therefore I'd appreciate if more people could test this.

Signed-off-by: Heiner Kallweit 
---
 drivers/net/phy/phy.c | 26 ++
 include/linux/phy.h   |  1 -
 2 files changed, 2 insertions(+), 25 deletions(-)

diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
index 2b1e67bc1..b3784c9a2 100644
--- a/drivers/net/phy/phy.c
+++ b/drivers/net/phy/phy.c
@@ -629,9 +629,6 @@ static irqreturn_t phy_interrupt(int irq, void *phy_dat)
if (PHY_HALTED == phydev->state)
return IRQ_NONE;/* It can't be ours.  */
 
-   disable_irq_nosync(irq);
-   atomic_inc(>irq_disable);
-
phy_change(phydev);
 
return IRQ_HANDLED;
@@ -689,7 +686,6 @@ static int phy_disable_interrupts(struct phy_device *phydev)
  */
 int phy_start_interrupts(struct phy_device *phydev)
 {
-   atomic_set(>irq_disable, 0);
if (request_threaded_irq(phydev->irq, NULL, phy_interrupt,
 IRQF_ONESHOT | IRQF_SHARED,
 phydev_name(phydev), phydev) < 0) {
@@ -716,13 +712,6 @@ int phy_stop_interrupts(struct phy_device *phydev)
 
free_irq(phydev->irq, phydev);
 
-   /* If work indeed has been cancelled, disable_irq() will have
-* been left unbalanced from phy_interrupt() and enable_irq()
-* has to be called so that other devices on the line work.
-*/
-   while (atomic_dec_return(>irq_disable) >= 0)
-   enable_irq(phydev->irq);
-
return err;
 }
 EXPORT_SYMBOL(phy_stop_interrupts);
@@ -736,7 +725,7 @@ void phy_change(struct phy_device *phydev)
if (phy_interrupt_is_valid(phydev)) {
if (phydev->drv->did_interrupt &&
!phydev->drv->did_interrupt(phydev))
-   goto ignore;
+   return;
 
if (phy_disable_interrupts(phydev))
goto phy_err;
@@ -748,27 +737,16 @@ void phy_change(struct phy_device *phydev)
mutex_unlock(>lock);
 
if (phy_interrupt_is_valid(phydev)) {
-   atomic_dec(>irq_disable);
-   enable_irq(phydev->irq);
-
/* Reenable interrupts */
if (PHY_HALTED != phydev->state &&
phy_config_interrupt(phydev, PHY_INTERRUPT_ENABLED))
-   goto irq_enable_err;
+   goto phy_err;
}
 
/* reschedule state queue work to run as soon as possible */
phy_trigger_machine(phydev, true);
return;
 
-ignore:
-   atomic_dec(>irq_disable);
-   enable_irq(phydev->irq);
-   return;
-
-irq_enable_err:
-   disable_irq(phydev->irq);
-   atomic_inc(>irq_disable);
 phy_err:
phy_error(phydev);
 }
diff --git a/include/linux/phy.h b/include/linux/phy.h
index dc82a07cb..8a87e441f 100644
--- a/include/linux/phy.h
+++ b/include/linux/phy.h
@@ -468,7 +468,6 @@ struct phy_device {
/* Interrupt and Polling infrastructure */
struct work_struct phy_queue;
struct delayed_work state_queue;
-   atomic_t irq_disable;
 
struct mutex lock;
 
-- 
2.15.0

Re: [PATCH v7 net-next 00/13] gtp: Additional feature support - Part I

2017-11-12 Thread Harald Welte

Hi Tom,

sorry for the delayed response.  But I remain committed in pushing
the non-controversial part of your GTP patches forward.

On Sat, Oct 28, 2017 at 06:47:59PM +0200, Harald Welte wrote:
> Thanks.  As indicated, I'm planning some testing later this weekend on
> the non-IPv6 patches, and am happy to add my Acked-by and/or re-submit
> those to Dave after that.

After some more delays and returning from netdev 2.2, I've finally put
together a testing setup and successfully (manually) tested with the
following patches:

01/13 vxlan: Move gro_cells_init to ndo_init
02/13 iptunnel: Add common functions to get a tunnel route
04/13 gtp: Call common functions to get tunnel routes and add dst_cache
05/13 iptunnel: Generalize tunnel update pmtu
06/13 gtp: Change to use gro_cells
07/13 gtp: Use goto for exceptions in gtp_udp_encap_recv funcs
08/13 gtp: udp recv clean up
09/13 gtp: Call function to update path mtu
10/13 gtp: Eliminate pktinfo and add port configuration

I hereby acknowledge those patches.  How should we proceed?  Should I

a) do nothing, you will add Acked-By and re-submit?

b) send an individual Acked-By in a reply to each related patch here on
   netdev and you will re-submit those patches?

c) simply create a rebased set from those patches and
   re-submit them to the list for net-next myself, with the Acked-by?

d) be preposterous and provide a gtp git tree for DaveM to pull from?

As discussed before, I will not merge/ack IPv6 will until we have an
implementation that is interoperable.  I have a TODO list of other
bugfixes and improvements for Kernel GTP, but I'm hopeful that IPv6 can
still be addressed before the end of 2017.

Regards,
Harald
-- 
- Harald Welte    http://laforge.gnumonks.org/

"Privacy in residential applications is a desirable marketing option."
  (ETSI EN 300 175-7 Ch. A6)

linux-next: Signed-off-by missing for commit in the net-next tree

2017-11-12 Thread Stephen Rothwell

Hi all,

Commit

  cbad52e92ad7 ("sfc: don't warn on successful change of MAC")

is missing a Signed-off-by from its author.

-- 
Cheers,
Stephen Rothwell

Re: SRIOV switchdev mode BoF minutes

2017-11-12 Thread Alexander Duyck

On Sun, Nov 12, 2017 at 11:49 AM, Or Gerlitz  wrote:
> Hi Dave and all,
>
> During and after the BoF on SRIOV switchdev mode, we came into a
> consensus among the developers from four different HW vendors (CC
> audience) that a correct thing to do would be to disallow any new
> extensions to the legacy mode.
>
> The idea is to put focus on the new mode and not add new UAPIs and
> kernel code which was turned to be a wrong design which does not allow
> for properly offloading a kernel switching SW model to e-switch HW.

I would have to disagree with this. For devices such as 82599 that
doesn't have a true switch this may limit future functionality since
we can't move it over to switchdev mode. For example one thing I may
need to add is the ability to disable multicast and broadcast receive
on a per-VF basis at some point in the future.

You may not recall but we tried to transition the i40e driver over to
SwitchDev, the parts supported by i40e have a much more robust l2
forwarding framework than the 82599, and the result was we were told
that while we might look at doing port representors some other way,
there was no way we could use switchdev since the hardware couldn't
support the requirements of switchdev in terms of default routes and
forwarding behavior. I am planning to resolve the port representor
issue by looking at coming up with something like a "source mode"
macvlan based port representor. I figure that is probably the closest
match for what the Intel hardware does since really the VFs are
nothing more than a physical macvlan interface in and of themselves as
the hardware doesn't have a full switch.

> We also had a good session the day after regarding alignment for the
> representation model of the uplink (physical port) and PF/s.
>
> The VF representor netdevs  exist for all drivers that support the new
> mode but the representation for the uplink and PF wasn't the same for
> all. The decision was to represent the uplink and PFs vports in the
> same manner done for VFs, using rep netdevs. This alignment would
> provide a more strict and clear view of the kernel model for e-switch
> to users and upper layer control plane SW.
>
> Or.

This part sounds fine.

- Alex

[PATCH] ipv6: sr: update the struct ipv6_sr_hdr

2017-11-12 Thread Ahmed Abdelsalam

The IPv6 Segment Routing Header (SRH) format has been updated srating 
from revision 6 of the SRH ietf draft. The update includes the following 
SRH fields

(1) The "First Segment" field changed to be "Last Entry" which contains 
the index, in the Segment List, of the last element of the Segment List.

(2) The 16 bit "reserved" field now is used as a "tag" which tags a packet
as part of a class or group of packets, e.g.,packets sharing the same
set of properties.

This patch updates the struct ipv6_sr_hdr, so it complies with the updated
SRH draft. It also update the different parts of the kernel that were 
using the old fields names.

Signed-off-by: Ahmed Abdelsalam 
---
 This patch is tested by re-compiling the whole kernel after the changes.

 include/uapi/linux/seg6.h |  4 ++--
 net/ipv6/exthdrs.c|  2 +-
 net/ipv6/seg6.c   |  4 ++--
 net/ipv6/seg6_hmac.c  | 14 +++---
 net/ipv6/seg6_iptunnel.c  |  4 ++--
 5 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/include/uapi/linux/seg6.h b/include/uapi/linux/seg6.h
index 2f6fb0d..3f4b3ab 100644
--- a/include/uapi/linux/seg6.h
+++ b/include/uapi/linux/seg6.h
@@ -26,9 +26,9 @@ struct ipv6_sr_hdr {
__u8hdrlen;
__u8type;
__u8segments_left;
-   __u8first_segment;
+   __u8last_entry;
__u8flags;
-   __u16   reserved;
+   __u16   tag;
 
struct in6_addr segments[0];
 };
diff --git a/net/ipv6/exthdrs.c b/net/ipv6/exthdrs.c
index 83bd757..d53af71 100644
--- a/net/ipv6/exthdrs.c
+++ b/net/ipv6/exthdrs.c
@@ -918,7 +918,7 @@ static void ipv6_push_rthdr4(struct sk_buff *skb, u8 *proto,
sr_phdr = skb_push(skb, plen);
memcpy(sr_phdr, sr_ihdr, sizeof(struct ipv6_sr_hdr));
 
-   hops = sr_ihdr->first_segment + 1;
+   hops = sr_ihdr->last_entry + 1;
memcpy(sr_phdr->segments + 1, sr_ihdr->segments + 1,
   (hops - 1) * sizeof(struct in6_addr));
 
diff --git a/net/ipv6/seg6.c b/net/ipv6/seg6.c
index c814077..3d5279d 100644
--- a/net/ipv6/seg6.c
+++ b/net/ipv6/seg6.c
@@ -40,10 +40,10 @@ bool seg6_validate_srh(struct ipv6_sr_hdr *srh, int len)
if (((srh->hdrlen + 1) << 3) != len)
return false;
 
-   if (srh->segments_left > srh->first_segment)
+   if (srh->segments_left > srh->last_entry)
return false;
 
-   tlv_offset = sizeof(*srh) + ((srh->first_segment + 1) << 4);
+   tlv_offset = sizeof(*srh) + ((srh->last_entry + 1) << 4);
 
trailing = len - tlv_offset;
if (trailing < 0)
diff --git a/net/ipv6/seg6_hmac.c b/net/ipv6/seg6_hmac.c
index 33fb35c..5107ebb 100644
--- a/net/ipv6/seg6_hmac.c
+++ b/net/ipv6/seg6_hmac.c
@@ -91,7 +91,7 @@ static struct sr6_tlv_hmac *seg6_get_tlv_hmac(struct 
ipv6_sr_hdr *srh)
 {
struct sr6_tlv_hmac *tlv;
 
-   if (srh->hdrlen < (srh->first_segment + 1) * 2 + 5)
+   if (srh->hdrlen < (srh->last_entry + 1) * 2 + 5)
return NULL;
 
if (!sr_has_hmac(srh))
@@ -175,8 +175,8 @@ int seg6_hmac_compute(struct seg6_hmac_info *hinfo, struct 
ipv6_sr_hdr *hdr,
 * hash function (RadioGatun) with up to 1216 bits
 */
 
-   /* saddr(16) + first_seg(1) + flags(1) + keyid(4) + seglist(16n) */
-   plen = 16 + 1 + 1 + 4 + (hdr->first_segment + 1) * 16;
+   /* saddr(16) + last_entry(1) + flags(1) + keyid(4) + seglist(16n) */
+   plen = 16 + 1 + 1 + 4 + (hdr->last_entry + 1) * 16;
 
/* this limit allows for 14 segments */
if (plen >= SEG6_HMAC_RING_SIZE)
@@ -186,7 +186,7 @@ int seg6_hmac_compute(struct seg6_hmac_info *hinfo, struct 
ipv6_sr_hdr *hdr,
 * as follows, in order:
 *
 * 1. Source IPv6 address (128 bits)
-* 2. first_segment value (8 bits)
+* 2. last_entry value (8 bits)
 * 3. Flags (8 bits)
 * 4. HMAC Key ID (32 bits)
 * 5. All segments in the segments list (n * 128 bits)
@@ -200,8 +200,8 @@ int seg6_hmac_compute(struct seg6_hmac_info *hinfo, struct 
ipv6_sr_hdr *hdr,
memcpy(off, saddr, 16);
off += 16;
 
-   /* first_segment value */
-   *off++ = hdr->first_segment;
+   /* last_entry value */
+   *off++ = hdr->last_entry;
 
/* flags */
*off++ = hdr->flags;
@@ -211,7 +211,7 @@ int seg6_hmac_compute(struct seg6_hmac_info *hinfo, struct 
ipv6_sr_hdr *hdr,
off += 4;
 
/* all segments in the list */
-   for (i = 0; i < hdr->first_segment + 1; i++) {
+   for (i = 0; i < hdr->last_entry + 1; i++) {
memcpy(off, hdr->segments + i, 16);
off += 16;
}
diff --git a/net/ipv6/seg6_iptunnel.c b/net/ipv6/seg6_iptunnel.c
index bd6cc68..fc9813e 100644
--- a/net/ipv6/seg6_iptunnel.c
+++ b/net/ipv6/seg6_iptunnel.c
@@ -133,7 +133,7 @@ int seg6_do_srh_encap(struct sk_buff *skb, struct 
ipv6_sr_hdr *osrh, int proto)
 
isrh->nexthdr = proto;
 
-   hdr->daddr =

Re: [PATCH] net: phy: realtek: fix RTL8211F interrupt mode

2017-11-12 Thread Jerome Brunet

On Sun, 2017-11-12 at 21:06 +0100, Andrew Lunn wrote:
> On Sun, Nov 12, 2017 at 07:36:48PM +0100, Jerome Brunet wrote:
> > On Sun, 2017-11-12 at 19:25 +0100, Andrew Lunn wrote:
> > > On Sun, Nov 12, 2017 at 04:16:04PM +0100, Heiner Kallweit wrote:
> > > > After commit b94d22d94ad22 "ARM64: dts: meson-gx: add external PHY
> > > > interrupt on some platforms" ethernet stopped working on my Odroid-C2
> > > > which has a RTL8211F phy.
> > > 
> > > Hi Jerome
> > > 
> > > Please could you test this. I Just want to be sure we don't introduce
> > > a regression by breaking the boards you tested on.
> > 
> > Sure I'll try it tomorrow.
> > 
> > When I tested this, I was more focused on the SoC side of it (the interrupt
> > controller itself) and whether the interrupt worked or not. The board (p200)
> > I
> > tested on used a micrel PHY and worked well ...
> 
> Hi Jerome
> 
> Ah, O.K. Do you have a board with a RTL8211?
I do have some. I'll confirm Heiner's report and fix tomorrow.

> If all your boards use a
> different PHY, there is no chance of a regression for you.
I'm not expecting any, quite the contrary actually

> 
> Thanks
>   Andrew

Re: [PATCH] net: phy: realtek: fix RTL8211F interrupt mode

2017-11-12 Thread Andrew Lunn

On Sun, Nov 12, 2017 at 07:36:48PM +0100, Jerome Brunet wrote:
> On Sun, 2017-11-12 at 19:25 +0100, Andrew Lunn wrote:
> > On Sun, Nov 12, 2017 at 04:16:04PM +0100, Heiner Kallweit wrote:
> > > After commit b94d22d94ad22 "ARM64: dts: meson-gx: add external PHY
> > > interrupt on some platforms" ethernet stopped working on my Odroid-C2
> > > which has a RTL8211F phy.
> > 
> > Hi Jerome
> > 
> > Please could you test this. I Just want to be sure we don't introduce
> > a regression by breaking the boards you tested on.
> 
> Sure I'll try it tomorrow.
> 
> When I tested this, I was more focused on the SoC side of it (the interrupt
> controller itself) and whether the interrupt worked or not. The board (p200) I
> tested on used a micrel PHY and worked well ...

Hi Jerome

Ah, O.K. Do you have a board with a RTL8211? If all your boards use a
different PHY, there is no chance of a regression for you.

Thanks
Andrew

[PATCH net-next] decnet: move to staging

2017-11-12 Thread Stephen Hemminger

Support for Decnet has been orphaned for many years.
In the interest of reducing the potential bug surface and pre-holiday
cleaning, move the decnet protocol into staging for eventual removal.

Signed-off-by: Stephen Hemminger 
---
 MAINTAINERS  | 2 +-
 drivers/staging/Kconfig  | 5 +
 drivers/staging/Makefile | 1 +
 {net => drivers/staging}/decnet/Kconfig  | 0
 {net => drivers/staging}/decnet/Makefile | 0
 {net => drivers/staging}/decnet/README   | 0
 {net => drivers/staging}/decnet/TODO | 0
 {net => drivers/staging}/decnet/af_decnet.c  | 0
 {net => drivers/staging}/decnet/dn_dev.c | 0
 {net => drivers/staging}/decnet/dn_fib.c | 0
 {net => drivers/staging}/decnet/dn_neigh.c   | 0
 {net => drivers/staging}/decnet/dn_nsp_in.c  | 0
 {net => drivers/staging}/decnet/dn_nsp_out.c | 0
 {net => drivers/staging}/decnet/dn_route.c   | 0
 {net => drivers/staging}/decnet/dn_rules.c   | 0
 {net => drivers/staging}/decnet/dn_table.c   | 0
 {net => drivers/staging}/decnet/dn_timer.c   | 0
 {net => drivers/staging}/decnet/netfilter/Kconfig| 0
 {net => drivers/staging}/decnet/netfilter/Makefile   | 0
 {net => drivers/staging}/decnet/netfilter/dn_rtmsg.c | 0
 {net => drivers/staging}/decnet/sysctl_net_decnet.c  | 0
 net/Kconfig  | 2 --
 net/Makefile | 1 -
 23 files changed, 7 insertions(+), 4 deletions(-)
 rename {net => drivers/staging}/decnet/Kconfig (100%)
 rename {net => drivers/staging}/decnet/Makefile (100%)
 rename {net => drivers/staging}/decnet/README (100%)
 rename {net => drivers/staging}/decnet/TODO (100%)
 rename {net => drivers/staging}/decnet/af_decnet.c (100%)
 rename {net => drivers/staging}/decnet/dn_dev.c (100%)
 rename {net => drivers/staging}/decnet/dn_fib.c (100%)
 rename {net => drivers/staging}/decnet/dn_neigh.c (100%)
 rename {net => drivers/staging}/decnet/dn_nsp_in.c (100%)
 rename {net => drivers/staging}/decnet/dn_nsp_out.c (100%)
 rename {net => drivers/staging}/decnet/dn_route.c (100%)
 rename {net => drivers/staging}/decnet/dn_rules.c (100%)
 rename {net => drivers/staging}/decnet/dn_table.c (100%)
 rename {net => drivers/staging}/decnet/dn_timer.c (100%)
 rename {net => drivers/staging}/decnet/netfilter/Kconfig (100%)
 rename {net => drivers/staging}/decnet/netfilter/Makefile (100%)
 rename {net => drivers/staging}/decnet/netfilter/dn_rtmsg.c (100%)
 rename {net => drivers/staging}/decnet/sysctl_net_decnet.c (100%)

diff --git a/MAINTAINERS b/MAINTAINERS
index 29aa89a1837b..66e2d302d9eb 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3968,7 +3968,7 @@ W:http://linux-decnet.sourceforge.net
 L: linux-decnet-u...@lists.sourceforge.net
 S: Orphan
 F: Documentation/networking/decnet.txt
-F: net/decnet/
+F: drivers/staging/decnet/
 
 DECSTATION PLATFORM SUPPORT
 M: "Maciej W. Rozycki" 
diff --git a/drivers/staging/Kconfig b/drivers/staging/Kconfig
index 554683912cff..e30af73c3797 100644
--- a/drivers/staging/Kconfig
+++ b/drivers/staging/Kconfig
@@ -30,6 +30,11 @@ source "drivers/staging/wlan-ng/Kconfig"
 
 source "drivers/staging/comedi/Kconfig"
 
+if NETFILTER
+source "drivers/staging/decnet/netfilter/Kconfig"
+endif
+source "drivers/staging/decnet/Kconfig"
+
 source "drivers/staging/olpc_dcon/Kconfig"
 
 source "drivers/staging/rtl8192u/Kconfig"
diff --git a/drivers/staging/Makefile b/drivers/staging/Makefile
index 6e536020029a..89655cc80a91 100644
--- a/drivers/staging/Makefile
+++ b/drivers/staging/Makefile
@@ -7,6 +7,7 @@ obj-$(CONFIG_IRDA)  += irda/net/
 obj-$(CONFIG_IRDA) += irda/drivers/
 obj-$(CONFIG_PRISM2_USB)   += wlan-ng/
 obj-$(CONFIG_COMEDI)   += comedi/
+obj-$(CONFIG_DECNET)   += decnet/
 obj-$(CONFIG_FB_OLPC_DCON) += olpc_dcon/
 obj-$(CONFIG_RTL8192U) += rtl8192u/
 obj-$(CONFIG_RTL8192E) += rtl8192e/
diff --git a/net/decnet/Kconfig b/drivers/staging/decnet/Kconfig
similarity index 100%
rename from net/decnet/Kconfig
rename to drivers/staging/decnet/Kconfig
diff --git a/net/decnet/Makefile b/drivers/staging/decnet/Makefile
similarity index 100%
rename from net/decnet/Makefile
rename to drivers/staging/decnet/Makefile
diff --git a/net/decnet/README b/drivers/staging/decnet/README
similarity index 100%
rename from net/decnet/README
rename to drivers/staging/decnet/README
diff --git a/net/decnet/TODO b/drivers/staging/decnet/TODO
similarity index 100%
rename from net/decnet/TODO
rename to drivers/staging/decnet/TODO
diff --git a/net/decnet/af_decnet.c b/drivers/staging/decnet/af_decnet.c
similarity index 100%
rename from net/decnet/af_decnet.c
rename to drivers/staging/decnet/af_decnet.c
diff --git a/net/decnet/dn_dev.c

SRIOV switchdev mode BoF minutes

2017-11-12 Thread Or Gerlitz

Hi Dave and all,

During and after the BoF on SRIOV switchdev mode, we came into a
consensus among the developers from four different HW vendors (CC
audience) that a correct thing to do would be to disallow any new
extensions to the legacy mode.

The idea is to put focus on the new mode and not add new UAPIs and
kernel code which was turned to be a wrong design which does not allow
for properly offloading a kernel switching SW model to e-switch HW.

We also had a good session the day after regarding alignment for the
representation model of the uplink (physical port) and PF/s.

The VF representor netdevs  exist for all drivers that support the new
mode but the representation for the uplink and PF wasn't the same for
all. The decision was to represent the uplink and PFs vports in the
same manner done for VFs, using rep netdevs. This alignment would
provide a more strict and clear view of the kernel model for e-switch
to users and upper layer control plane SW.

Or.

[PATCH v5 13/13] xfrm6_tunnel: exit_net cleanup check added

2017-11-12 Thread Vasily Averin

Be sure that spi_byaddr and spi_byspi arrays initialized in net_init hook
were return to initial state

Signed-off-by: Vasily Averin 
---
 net/ipv6/xfrm6_tunnel.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/net/ipv6/xfrm6_tunnel.c b/net/ipv6/xfrm6_tunnel.c
index 4e438bc..f85f0d7 100644
--- a/net/ipv6/xfrm6_tunnel.c
+++ b/net/ipv6/xfrm6_tunnel.c
@@ -338,6 +338,14 @@ static int __net_init xfrm6_tunnel_net_init(struct net 
*net)
 
 static void __net_exit xfrm6_tunnel_net_exit(struct net *net)
 {
+   struct xfrm6_tunnel_net *xfrm6_tn = xfrm6_tunnel_pernet(net);
+   unsigned int i;
+
+   for (i = 0; i < XFRM6_TUNNEL_SPI_BYADDR_HSIZE; i++)
+   WARN_ON_ONCE(!hlist_empty(_tn->spi_byaddr[i]));
+
+   for (i = 0; i < XFRM6_TUNNEL_SPI_BYSPI_HSIZE; i++)
+   WARN_ON_ONCE(!hlist_empty(_tn->spi_byspi[i]));
 }
 
 static struct pernet_operations xfrm6_tunnel_net_ops = {
-- 
2.7.4

1 2 >

1 - 100 of 192 matches

Mail list logo