date:20160830

Re: [PATCH net-next v4 0/3] net: mpls: fragmentation and gso fixes for locally originated traffic

2016-08-30 Thread David Miller

From: David Ahern 
Date: Wed, 24 Aug 2016 20:10:42 -0700

> This series fixes mtu and fragmentation for tunnels using lwtunnel
> output redirect, and fixes GSO for MPLS for locally originated traffic
> reported by Lennert Buytenhek.
> 
> A follow on series will address fragmentation and GSO for forwarded
> MPLS traffic. Hardware offload of GSO with MPLS also needs to be
> addressed.
> 
> Simon: Can you verify this works with OVS for single and multiple
>labels?

Series applied, thanks.

Re: [PATCH net-next] net: batch calls to flush_all_backlogs()

2016-08-30 Thread David Miller

From: Eric Dumazet 
Date: Fri, 26 Aug 2016 12:50:39 -0700

> From: Eric Dumazet 
> 
> After commit 145dd5f9c88f ("net: flush the softnet backlog in process
> context"), we can easily batch calls to flush_all_backlogs() for all
> devices processed in rollback_registered_many()
> 
> Tested:
 ...
> Signed-off-by: Eric Dumazet 

Applied, thanks.

Re: [net-next] ixgbe: Eliminate useless message and improve logic

2016-08-30 Thread David Miller

From: Jeff Kirsher 
Date: Tue, 30 Aug 2016 11:33:43 -0700

> From: Mark Rustad 
> 
> Remove a useless log message and improve the logic for setting
> a PHY address from the contents of the MNG_IF_SEL register.
> 
> Signed-off-by: Mark Rustad 
> Tested-by: Andrew Bowers 
> Signed-off-by: Jeff Kirsher 

Applied.

Re: [PATCH net-next 0/8] rxrpc: Preparation for removal of use of skbs from AFS

2016-08-30 Thread David Miller

From: David Howells <dhowe...@redhat.com>
Date: Tue, 30 Aug 2016 16:41:37 +0100

> Here's a set of patches that prepare the way for the removal of the use of
> sk_buffs from fs/afs (they'll be entirely retained within net/rxrpc):
> 
>  (1) Fix a potential NULL-pointer deref in rxrpc_abort_calls().
> 
>  (2) Condense all the terminal call state machine states to a single one
>  plus supplementary info.
> 
>  (3) Add a trace point for rxrpc call usage debugging.
> 
>  (4) Cleanups and missing headers.
> 
>  (5) Provide a way for AFS to ask about a call's peer address without
>  having an sk_buff to query.
> 
>  (6) Use call->peer directly rather than going via call->conn (which might
>  be NULL).
> 
>  (7) Pass struct socket * to various rxrpc kernel interface functions so
>  they can use that directly rather than getting it from the rxrpc_call
>  struct.
 ...
> Tagged thusly:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git
>   rxrpc-rewrite-20160830-1

Pulled, thanks David.

Re: [PATCH 0/7] Netfilter fixes for net

2016-08-30 Thread David Miller

From: Pablo Neira Ayuso 
Date: Tue, 30 Aug 2016 13:26:16 +0200

> The following patchset contains Netfilter fixes for your net tree,
> they are:
 ...
> You can pull these changes from:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf.git

Pulled, thanks a lot Pablo.

[PATCH net-next] rtnetlink: fdb dump: optimize by saving last interface markers

2016-08-30 Thread Roopa Prabhu

From: Roopa Prabhu 

fdb dumps spanning multiple skb's currently restart from the first
interface again for every skb. This results in unnecessary
iterations on the already visited interfaces and their fdb
entries. In large scale setups, we have seen this to slow
down fdb dumps considerably. On a system with 30k macs we
see fdb dumps spanning across more than 300 skbs.

To fix the problem, this patch replaces the existing single fdb
marker with three markers: netdev hash entries, netdevs and fdb
index to continue where we left off instead of restarting from the
first netdev. This is consistent with link dumps.

In the process of fixing the performance issue, this patch also
re-implements fix done by
commit 472681d57a5d ("net: ndo_fdb_dump should report -EMSGSIZE to 
rtnl_fdb_dump")
(with an internal fix from Wilson Kok) in the following ways:
- change ndo_fdb_dump handlers to return error code instead
of the last fdb index
- use cb->args strictly for dump frag markers and not error codes.
This is consistent with other dump functions.

Below results were taken on a system with 1000 netdevs
and 35085 fdb entries:
before patch:
$time bridge fdb show | wc -l
15065

real1m11.791s
user0m0.070s
sys 1m8.395s

(existing code does not return all macs)

after patch:
$time bridge fdb show | wc -l
35085

real0m2.017s
user0m0.113s
sys 0m1.942s

Signed-off-by: Roopa Prabhu 
Signed-off-by: Wilson Kok 
---
 drivers/net/ethernet/qlogic/qlcnic/qlcnic_main.c |   7 +-
 drivers/net/vxlan.c  |  14 ++-
 include/linux/netdevice.h|   4 +-
 include/linux/rtnetlink.h|   2 +-
 include/net/switchdev.h  |   4 +-
 net/bridge/br_fdb.c  |  23 ++---
 net/bridge/br_private.h  |   2 +-
 net/core/rtnetlink.c | 105 ++-
 net/switchdev/switchdev.c|  10 +--
 9 files changed, 98 insertions(+), 73 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qlcnic/qlcnic_main.c 
b/drivers/net/ethernet/qlogic/qlcnic/qlcnic_main.c
index 3ebef27..3ae3968 100644
--- a/drivers/net/ethernet/qlogic/qlcnic/qlcnic_main.c
+++ b/drivers/net/ethernet/qlogic/qlcnic/qlcnic_main.c
@@ -432,18 +432,19 @@ static int qlcnic_fdb_add(struct ndmsg *ndm, struct 
nlattr *tb[],
 
 static int qlcnic_fdb_dump(struct sk_buff *skb, struct netlink_callback *ncb,
struct net_device *netdev,
-   struct net_device *filter_dev, int idx)
+   struct net_device *filter_dev, int *idx)
 {
struct qlcnic_adapter *adapter = netdev_priv(netdev);
+   int err = 0;
 
if (!adapter->fdb_mac_learn)
return ndo_dflt_fdb_dump(skb, ncb, netdev, filter_dev, idx);
 
if ((adapter->flags & QLCNIC_ESWITCH_ENABLED) ||
qlcnic_sriov_check(adapter))
-   idx = ndo_dflt_fdb_dump(skb, ncb, netdev, filter_dev, idx);
+   err = ndo_dflt_fdb_dump(skb, ncb, netdev, filter_dev, idx);
 
-   return idx;
+   return err;
 }
 
 static void qlcnic_82xx_cancel_idc_work(struct qlcnic_adapter *adapter)
diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index c0dda6f..f5b381d 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -861,20 +861,20 @@ out:
 /* Dump forwarding table */
 static int vxlan_fdb_dump(struct sk_buff *skb, struct netlink_callback *cb,
  struct net_device *dev,
- struct net_device *filter_dev, int idx)
+ struct net_device *filter_dev, int *idx)
 {
struct vxlan_dev *vxlan = netdev_priv(dev);
unsigned int h;
+   int err = 0;
 
for (h = 0; h < FDB_HASH_SIZE; ++h) {
struct vxlan_fdb *f;
-   int err;
 
hlist_for_each_entry_rcu(f, >fdb_head[h], hlist) {
struct vxlan_rdst *rd;
 
list_for_each_entry_rcu(rd, >remotes, list) {
-   if (idx < cb->args[0])
+   if (*idx < cb->args[2])
goto skip;
 
err = vxlan_fdb_info(skb, vxlan, f,
@@ -882,17 +882,15 @@ static int vxlan_fdb_dump(struct sk_buff *skb, struct 
netlink_callback *cb,
 cb->nlh->nlmsg_seq,
 RTM_NEWNEIGH,
 NLM_F_MULTI, rd);
-   if (err < 0) {
-   cb->args[1] = err;
+   if (err < 0)
goto out;
-   }
 skip:
-   ++idx;
+

Re: [PATCH v3 0/5] meson: Meson8b and GXBB DWMAC glue driver

2016-08-30 Thread David Miller

From: Martin Blumenstingl 
Date: Tue, 30 Aug 2016 20:49:28 +0200

> On Mon, Aug 29, 2016 at 5:40 AM, David Miller  wrote:
>> From: Martin Blumenstingl 
>> Date: Sun, 28 Aug 2016 18:16:32 +0200
>>
>>> This adds a DWMAC glue driver for the PRG_ETHERNET registers found in
>>> Meson8b and GXBB SoCs. Based on the "old" meson6b-dwmac glue driver
>>> the register layout is completely different.
>>> Thus I introduced a separate driver.
>>>
>>> Changes since v2:
>>> - fixed unloading the glue driver when built as module. This pulls in a
>>>   patch from Joachim Eastwood (thanks) to get our private data structure
>>>   (bsp_priv).
>>
>> This doesn't apply cleanly at all to the net-next tree, so I have
>> no idea where you expect these changes to be applied.
> OK, maybe Kevin can me help out here as I think the patches should go
> to various trees.
> 
> I think patches 1, 3 and 4 should go through the net-next tree (as
> these touch drivers/net/ethernet/stmicro/stmmac/ and the corresponding
> documentation).
> Patch 2 should probably go through clk-meson-gxbb / clk-next (just
> like the other clk changes we had).
> The last patch (patch 5) should probably go through the ARM SoC tree
> (just like the other dts changes we had).
> 
> @David, Kevin: would this be fine for you?

I would prefer if all of the patches went through one tree, that way
all the dependencies are satisfied in one place.

Re: pull-request: mac80211 2016-08-30

2016-08-30 Thread David Miller

From: Johannes Berg 
Date: Tue, 30 Aug 2016 08:19:18 +0200

> Nothing much, but we have three little fixes, see below. I've included the
> static inline so that BATMAN_ADV_BATMAN_V can be changed to be allowed w/o
> cfg80211 sooner, and it's a trivial change.
> 
> Let me know if there's any problem.

Pulled, thanks Johannes.

Re: [PATCH net] rps: flow_dissector: Fix uninitialized flow_keys used in __skb_get_hash possibly

2016-08-30 Thread Gao Feng

On Wed, Aug 31, 2016 at 12:14 PM, Eric Dumazet  wrote:
> On Wed, 2016-08-31 at 10:56 +0800, f...@ikuai8.com wrote:
>> From: Gao Feng 
>>
>> The original codes depend on that the function parameters are evaluated from
>> left to right. But the parameter's evaluation order is not defined in C
>> standard actually.
>>
>> When flow_keys_have_l4() is invoked before ___skb_get_hash(skb, ,
>> hashrnd) with some compilers or environment, the keys passed to
>> flow_keys_have_l4 is not initialized.
>>
>> Signed-off-by: Gao Feng 
>> ---
>
> Good catch, please add
>
> Fixes: 6db61d79c1e1 ("flow_dissector: Ignore flow dissector return value from 
> ___skb_get_hash")
> Acked-by: Eric Dumazet 
>
>

Add it into the description and resend the patch again?

Best Regards
Feng

Re: [PATCH net] rps: flow_dissector: Fix uninitialized flow_keys used in __skb_get_hash possibly

2016-08-30 Thread Eric Dumazet

On Wed, 2016-08-31 at 10:56 +0800, f...@ikuai8.com wrote:
> From: Gao Feng 
> 
> The original codes depend on that the function parameters are evaluated from
> left to right. But the parameter's evaluation order is not defined in C
> standard actually.
> 
> When flow_keys_have_l4() is invoked before ___skb_get_hash(skb, ,
> hashrnd) with some compilers or environment, the keys passed to
> flow_keys_have_l4 is not initialized.
> 
> Signed-off-by: Gao Feng 
> ---

Good catch, please add 

Fixes: 6db61d79c1e1 ("flow_dissector: Ignore flow dissector return value from 
___skb_get_hash")
Acked-by: Eric Dumazet

Re: [PATCH net-next 4/6] perf, bpf: add perf events core support for BPF_PROG_TYPE_PERF_EVENT programs

2016-08-30 Thread Alexei Starovoitov

On Mon, Aug 29, 2016 at 02:17:18PM +0200, Peter Zijlstra wrote:
> On Fri, Aug 26, 2016 at 07:31:22PM -0700, Alexei Starovoitov wrote:
> > +static int perf_event_set_bpf_handler(struct perf_event *event, u32 
> > prog_fd)
> > +{
> > +   struct bpf_prog *prog;
> > +
> > +   if (event->overflow_handler_context)
> > +   /* hw breakpoint or kernel counter */
> > +   return -EINVAL;
> > +
> > +   if (event->prog)
> > +   return -EEXIST;
> > +
> > +   prog = bpf_prog_get_type(prog_fd, BPF_PROG_TYPE_PERF_EVENT);
> > +   if (IS_ERR(prog))
> > +   return PTR_ERR(prog);
> > +
> > +   event->prog = prog;
> > +   event->orig_overflow_handler = READ_ONCE(event->overflow_handler);
> > +   WRITE_ONCE(event->overflow_handler, bpf_overflow_handler);
> > +   return 0;
> > +}
> > +
> > +static void perf_event_free_bpf_handler(struct perf_event *event)
> > +{
> > +   struct bpf_prog *prog = event->prog;
> > +
> > +   if (!prog)
> > +   return;
> 
> Does it make sense to do something like:
> 
>   WARN_ON_ONCE(event->overflow_handler != bpf_overflow_handler);

Yes that's an implicit assumption here, but checking for that
would be overkill. event->overflow_handler and event->prog are set
back to back in two places and reset here once together.
Such warn_on will only make people reading this code in the future
think that this bit is too complex to analyze by hand.

> > +
> > +   WRITE_ONCE(event->overflow_handler, event->orig_overflow_handler);
> > +   event->prog = NULL;
> > +   bpf_prog_put(prog);
> > +}
> 
> 
> >  static int perf_event_set_bpf_prog(struct perf_event *event, u32 prog_fd)
> >  {
> > bool is_kprobe, is_tracepoint;
> > struct bpf_prog *prog;
> >  
> > +   if (event->attr.type == PERF_TYPE_HARDWARE ||
> > +   event->attr.type == PERF_TYPE_SOFTWARE)
> > +   return perf_event_set_bpf_handler(event, prog_fd);
> > +
> > if (event->attr.type != PERF_TYPE_TRACEPOINT)
> > return -EINVAL;
> >  
> > @@ -7647,6 +7711,8 @@ static void perf_event_free_bpf_prog(struct 
> > perf_event *event)
> >  {
> > struct bpf_prog *prog;
> >  
> > +   perf_event_free_bpf_handler(event);
> > +
> > if (!event->tp_event)
> > return;
> >  
> 
> Does it at all make sense to merge the tp_event->prog thing into this
> new event->prog?

'struct trace_event_call *tp_event' is global while tp_event->perf_events
are per cpu, so I don't see how we can do that without breaking user space
logic. Right now users do single perf_event_open of kprobe and attach prog
that is executed on all cpus where kprobe is firing. Additional per-cpu
filtering is done from within bpf prog.

> >  #ifdef CONFIG_HAVE_HW_BREAKPOINT
> > @@ -8957,6 +9029,14 @@ perf_event_alloc(struct perf_event_attr *attr, int 
> > cpu,
> > if (!overflow_handler && parent_event) {
> > overflow_handler = parent_event->overflow_handler;
> > context = parent_event->overflow_handler_context;
> > +   if (overflow_handler == bpf_overflow_handler) {
> > +   event->prog = bpf_prog_inc(parent_event->prog);
> > +   event->orig_overflow_handler = 
> > parent_event->orig_overflow_handler;
> > +   if (IS_ERR(event->prog)) {
> > +   event->prog = NULL;
> > +   overflow_handler = NULL;
> > +   }
> > +   }
> > }
> 
> Should we not fail the entire perf_event_alloc() call in that IS_ERR()
> case?

Yes. Good point. Will do.

Re: [RFC v2 09/10] landlock: Handle cgroups (performance)

2016-08-30 Thread Andy Lutomirski

On Tue, Aug 30, 2016 at 6:36 PM, Alexei Starovoitov
 wrote:
> On Tue, Aug 30, 2016 at 02:45:14PM -0700, Andy Lutomirski wrote:
>>
>> One might argue that landlock shouldn't be tied to seccomp (in theory,
>> attached progs could be given access to syscall_get_xyz()), but I
>
> proposed lsm is way more powerful than syscall_get_xyz.
> no need to dumb it down.

I think you're misunderstanding me.

Mickaël's code allows one to make the LSM hook filters depend on the
syscall using SECCOMP_RET_LANDLOCK.  I'm suggesting that a similar
effect could be achieved by allowing the eBPF LSM hook to call
syscall_get_xyz() if it wants to.

>
>> think that the seccomp attachment mechanism is the right way to
>> install unprivileged filters.  It handles the no_new_privs stuff, it
>> allows TSYNC, it's totally independent of systemwide policy, etc.
>>
>> Trying to use cgroups or similar for this is going to be much nastier.
>> Some tighter sandboxes (Sandstorm, etc) aren't even going to dream of
>> putting cgroupfs in their containers, so requiring cgroups or similar
>> would be a mess for that type of application.
>
> I don't see why it is a 'mess'. cgroups are already used by majority
> of the systems, so I don't see why requiring a cgroup is such a big deal.

Requiring cgroup to be configured in isn't a big deal.  Requiring

> But let's say we don't do them. How implementation is going to look like
> for task based hierarchy? Note that we need an array of bpf_prog pointers.
> One for each lsm hook. Where this array is going to be stored?
> We cannot put in task_struct, since it's too large. Cannot put it
> into 'struct seccomp' directly either, unless it will become a pointer.
> Is that the proposal?

It would go in struct seccomp_filter or in something pointed to from there.

> So now we will be wasting extra 1kbyte of memory per task. Not great.
> We'd want to optimize it by sharing this such struct seccomp with prog array
> across threads of the same task? Or dynimically allocating it when
> landlock is in use? May sound nice, but how to account for that kernel
> memory? I guess also solvable by charging memlock.
> With cgroup based approach we don't need to worry about all that.
>

The considerations are essentially identical either way.

With cgroups, if you want to share the memory between multiple
separate sandboxes (Firejail instances, Sandstorm grains, Chromium
instances, xdg-apps, etc), you'd need to get them to all coordinate to
share a cgroup.  With a seccomp-like interface, you'd need to get them
to coordinate to share an installed layer (using my FD idea or
similar).

There would *not* be any duplication of this memory just because a
sandboxed process called fork().

--Andy

-- 
Andy Lutomirski
AMA Capital Management, LLC

[PATCH net-next] rps: flow_dissector: Add the const for the parameter of flow_keys_have_l4

2016-08-30 Thread fgao

From: Gao Feng 

Add the const for the parameter of flow_keys_have_l4 for the readability.

Signed-off-by: Gao Feng 
---
 include/net/flow_dissector.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/net/flow_dissector.h b/include/net/flow_dissector.h
index f266b51..d953492 100644
--- a/include/net/flow_dissector.h
+++ b/include/net/flow_dissector.h
@@ -183,7 +183,7 @@ struct flow_keys_digest {
 void make_flow_keys_digest(struct flow_keys_digest *digest,
   const struct flow_keys *flow);
 
-static inline bool flow_keys_have_l4(struct flow_keys *keys)
+static inline bool flow_keys_have_l4(const struct flow_keys *keys)
 {
return (keys->ports.ports || keys->tags.flow_label);
 }
-- 
1.9.1

Re: [PATCH RFC 4/4] xfs: Transmit flow steering

2016-08-30 Thread Alexander Duyck

On Tue, Aug 30, 2016 at 5:00 PM, Tom Herbert  wrote:
> XFS maintains a per device flow table that is indexed by the skbuff
> hash. The XFS table is only consulted when there is no queue saved in
> a transmit socket for an skbuff.
>
> Each entry in the flow table contains a queue index and a queue
> pointer. The queue pointer is set when a queue is chosen using a
> flow table entry. This pointer is set to the head pointer in the
> transmit queue (which is maintained by BQL).
>
> The new function get_xfs_index that looks up flows in the XPS table.
> The entry returned gives the last queue a matching flow used. The
> returned queue is compared against the normal XPS queue. If they
> are different, then we only switch if the tail pointer in the TX
> queue has advanced past the pointer saved in the entry. In this
> way OOO should be avoided when XPS wants to use a different queue.
>
> Signed-off-by: Tom Herbert 

This looks pretty good.  I haven't had a chance to test it though as
it will probably take me a few days.

A few minor items called out below.

Thanks.

- Alex

> ---
>  net/Kconfig|  6 
>  net/core/dev.c | 93 
> --
>  2 files changed, 84 insertions(+), 15 deletions(-)
>
> diff --git a/net/Kconfig b/net/Kconfig
> index 7b6cd34..5e3eddf 100644
> --- a/net/Kconfig
> +++ b/net/Kconfig
> @@ -255,6 +255,12 @@ config XPS
> depends on SMP
> default y
>
> +config XFS
> +   bool
> +   depends on XPS
> +   depends on BQL
> +   default y
> +
>  config HWBM
> bool
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 1d5c6dd..722e487 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -3210,6 +3210,7 @@ sch_handle_egress(struct sk_buff *skb, int *ret, struct 
> net_device *dev)
>  }
>  #endif /* CONFIG_NET_EGRESS */
>
> +/* Must be called with RCU read_lock */
>  static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
>  {
>  #ifdef CONFIG_XPS
> @@ -3217,7 +3218,6 @@ static inline int get_xps_queue(struct net_device *dev, 
> struct sk_buff *skb)
> struct xps_map *map;
> int queue_index = -1;
>
> -   rcu_read_lock();
> dev_maps = rcu_dereference(dev->xps_maps);
> if (dev_maps) {
> map = rcu_dereference(
> @@ -3232,7 +3232,6 @@ static inline int get_xps_queue(struct net_device *dev, 
> struct sk_buff *skb)
> queue_index = -1;
> }
> }
> -   rcu_read_unlock();
>
> return queue_index;
>  #else
> @@ -3240,26 +3239,90 @@ static inline int get_xps_queue(struct net_device 
> *dev, struct sk_buff *skb)
>  #endif
>  }
>
> -static u16 __netdev_pick_tx(struct net_device *dev, struct sk_buff *skb)
> +/* Must be called with RCU read_lock */
> +static int get_xfs_index(struct net_device *dev, struct sk_buff *skb)
>  {
> -   struct sock *sk = skb->sk;
> -   int queue_index = sk_tx_queue_get(sk);
> +#ifdef CONFIG_XFS
> +   struct xps_dev_flow_table *flow_table;
> +   struct xps_dev_flow ent;
> +   int queue_index;
> +   struct netdev_queue *txq;
> +   u32 hash;
>
> -   if (queue_index < 0 || skb->ooo_okay ||
> -   queue_index >= dev->real_num_tx_queues) {
> -   int new_index = get_xps_queue(dev, skb);
> -   if (new_index < 0)
> -   new_index = skb_tx_hash(dev, skb);
> +   flow_table = rcu_dereference(dev->xps_flow_table);
> +   if (!flow_table)
> +   return -1;
>
> -   if (queue_index != new_index && sk &&
> -   sk_fullsock(sk) &&
> -   rcu_access_pointer(sk->sk_dst_cache))
> -   sk_tx_queue_set(sk, new_index);
> +   queue_index = get_xps_queue(dev, skb);
> +   if (queue_index < 0)
> +   return -1;

Actually I think this bit here probably needs to fall back to using
skb_tx_hash if you don't get a usable result.  The problem is you
could have a system that is running with a mix of XFS assigned for
some CPUs and just using skb_tx_hash for others.  We shouldn't steal
flows from the ones selected using skb_tx_hash until they have met the
flow transition criteria.

> -   queue_index = new_index;
> +   hash = skb_get_hash(skb);
> +   if (!hash)
> +   return -1;

I'm not sure the !hash test makes any sense.  Isn't 0 a valid hash value?

> +   ent.v64 = flow_table->flows[hash & flow_table->mask].v64;
> +   if (ent.queue_index >= 0 &&
> +   ent.queue_index < dev->real_num_tx_queues) {
> +   txq = netdev_get_tx_queue(dev, ent.queue_index);
> +   if (queue_index != ent.queue_index) {
> +   if ((int)(txq->tail_cnt - ent.queue_ptr) >= 0)  {
> +   /* The current queue's tail has advanced
> +* beyone the last packet

[PATCH net] rps: flow_dissector: Fix uninitialized flow_keys used in __skb_get_hash possibly

2016-08-30 Thread fgao

From: Gao Feng 

The original codes depend on that the function parameters are evaluated from
left to right. But the parameter's evaluation order is not defined in C
standard actually.

When flow_keys_have_l4() is invoked before ___skb_get_hash(skb, ,
hashrnd) with some compilers or environment, the keys passed to
flow_keys_have_l4 is not initialized.

Signed-off-by: Gao Feng 
---
 net/core/flow_dissector.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index 61ad43f..52742a0 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -680,11 +680,13 @@ EXPORT_SYMBOL_GPL(__skb_get_hash_symmetric);
 void __skb_get_hash(struct sk_buff *skb)
 {
struct flow_keys keys;
+   u32 hash;
 
__flow_hash_secret_init();
 
-   __skb_set_sw_hash(skb, ___skb_get_hash(skb, , hashrnd),
- flow_keys_have_l4());
+   hash = ___skb_get_hash(skb, , hashrnd);
+
+   __skb_set_sw_hash(skb, hash, flow_keys_have_l4());
 }
 EXPORT_SYMBOL(__skb_get_hash);
 
-- 
1.9.1

Re: [PATCH] net/mlx4_en: protect ring->xdp_prog with rcu_read_lock

2016-08-30 Thread Brenden Blanco

On Tue, Aug 30, 2016 at 12:35:58PM +0300, Saeed Mahameed wrote:
> On Mon, Aug 29, 2016 at 8:46 PM, Tom Herbert  wrote:
> > On Mon, Aug 29, 2016 at 8:55 AM, Brenden Blanco  
> > wrote:
> >> On Mon, Aug 29, 2016 at 05:59:26PM +0300, Tariq Toukan wrote:
> >>> Hi Brenden,
> >>>
> >>> The solution direction should be XDP specific that does not hurt the
> >>> regular flow.
> >> An rcu_read_lock is _already_ taken for _every_ packet. This is 1/64th of
> 
> In other words "let's add  new small speed bump, we already have
> plenty ahead, so why not slow down now anyway".
> 
> Every single new instruction hurts performance, in this case maybe you
> are right, maybe we won't feel any performance
> impact, but that doesn't mean it is ok to do this.
Actually, I will make a stronger assertion. Unless your .config contains
CONFIG_PREEMPT=y (not most distros) or something like DEBUG_ATOMIC_SLEEP
(to trigger PREEMPT_COUNT), the code in this patch will be a nop.
Therefore, adding the protections that you mention below will be
_slower_ than the code already proposed.
> 
> 
> >> that.
> >>>
> >>> On 26/08/2016 11:38 PM, Brenden Blanco wrote:
> >>> >Depending on the preempt mode, the bpf_prog stored in xdp_prog may be
> >>> >freed despite the use of call_rcu inside bpf_prog_put. The situation is
> >>> >possible when running in PREEMPT_RCU=y mode, for instance, since the rcu
> >>> >callback for destroying the bpf prog can run even during the bh handling
> >>> >in the mlx4 rx path.
> >>> >
> >>> >Several options were considered before this patch was settled on:
> >>> >
> >>> >Add a napi_synchronize loop in mlx4_xdp_set, which would occur after all
> >>> >of the rings are updated with the new program.
> >>> >This approach has the disadvantage that as the number of rings
> >>> >increases, the speed of udpate will slow down significantly due to
> >>> >napi_synchronize's msleep(1).
> >>> I prefer this option as it doesn't hurt the data path. A delay in a
> >>> control command can be tolerated.
> >>> >Add a new rcu_head in bpf_prog_aux, to be used by a new bpf_prog_put_bh.
> >>> >The action of the bpf_prog_put_bh would be to then call bpf_prog_put
> >>> >later. Those drivers that consume a bpf prog in a bh context (like mlx4)
> >>> >would then use the bpf_prog_put_bh instead when the ring is up. This has
> >>> >the problem of complexity, in maintaining proper refcnts and rcu lists,
> >>> >and would likely be harder to review. In addition, this approach to
> >>> >freeing must be exclusive with other frees of the bpf prog, for instance
> >>> >a _bh prog must not be referenced from a prog array that is consumed by
> >>> >a non-_bh prog.
> >>> >
> >>> >The placement of rcu_read_lock in this patch is functionally the same as
> >>> >putting an rcu_read_lock in napi_poll. Actually doing so could be a
> >>> >potentially controversial change, but would bring the implementation in
> >>> >line with sk_busy_loop (though of course the nature of those two paths
> >>> >is substantially different), and would also avoid future copy/paste
> >>> >problems with future supporters of XDP. Still, this patch does not take
> >>> >that opinionated option.
> >>> So you decided to add a lock for all non-XDP flows, which are 99% of
> >>> the cases.
> >>> We should avoid this.
> >> The whole point of rcu_read_lock architecture is to be taken in the fast
> >> path. There won't be a performance impact from this patch.
> >
> > +1, this is nothing at all like a spinlock and really this should be
> > just like any other rcu like access.
> >
> > Brenden, tracking down how the structure is freed needed a few steps,
> > please make sure the RCU requirements are well documented. Also, I'm
> > still not a fan of using xchg to set the program, seems that a lock
> > could be used in that path.
> >
> > Thanks,
> > Tom
> 
> Sorry folks I am with Tariq on this, you can't just add a single
> instruction which is only valid/needed for 1% of the use cases
> to the driver's general data path, even if it was as cheap as one cpu cycle!
How about 0?

$ diff mlx4_en.ko.norcu.s mlx4_en.ko.rcu.s | wc -l  


 
0

> 
> Let me try to suggest something:
> instead of taking the rcu_read_lock for the whole
> mlx4_en_process_rx_cq, we can minimize to XDP code path only
> by double checking xdp_prog (non-protected check followed by a
> protected check inside mlx4 XDP critical path).
> 
> i.e instead of:
> 
> rcu_read_lock();
> 
> xdp_prog = ring->xdp_prog;
> 
> //__Do lots of non-XDP related stuff__
> 
> if (xdp_prog) {
> //Do XDP magic ..
> }
> //__Do more of non-XDP related stuff__
> 
> rcu_read_unlock();
> 
> 
> We can minimize it to XDP critical path only:
> 
> //Non protected xdp_prog dereference.
> if (xdp_prog) {
>

Re: [PATCH V2] rtl_bt: Add firmware and config file for RTL8822BE

2016-08-30 Thread Ben Hutchings

On Tue, 2016-08-30 at 20:11 -0500, Larry Finger wrote:
> This device is a new model from Realtek. Updates to driver btrtl will
> soon be submitted to the kernel.
> 
> These files were provided by the Realtek developer.
> 
> Signed-off-by: 陆朱伟 
> Signed-off-by: Larry Finger 
> Cc: linux-blueto...@vger.kernel.org
> ---
> 
> V2 - fix error in file names in WHENCE
> ---
[...]
>  Found in vendor driver, linux_bt_usb_2.11.20140423_8723be.rar
>  From https://github.com/troy-tan/driver_store
> +Files rtl_bt/rtl8822e_* came directly from Realtek.
[...]

You missed this wildcard, but I fixed it up.  Applied and pushed,
thanks.

Ben.

-- 
Ben Hutchings
Anthony's Law of Force: Don't force it, get a larger hammer.


signature.asc
Description: This is a digitally signed message part

Re: [Bridge] [PATCH net-next v2 2/2] net: bridge: add per-port multicast flood flag

2016-08-30 Thread Linus Lüssing

On Tue, Aug 30, 2016 at 05:23:08PM +0200, Nikolay Aleksandrov via Bridge wrote:
> diff --git a/net/bridge/br_if.c b/net/bridge/br_if.c
> index 1da3221845f1..ed0dd3340084 100644
> --- a/net/bridge/br_if.c
> +++ b/net/bridge/br_if.c
> @@ -362,7 +362,7 @@ static struct net_bridge_port *new_nbp(struct net_bridge 
> *br,
>   p->path_cost = port_cost(dev);
>   p->priority = 0x8000 >> BR_PORT_BITS;
>   p->port_no = index;
> - p->flags = BR_LEARNING | BR_FLOOD;
> + p->flags = BR_LEARNING | BR_FLOOD | BR_MCAST_FLOOD;

I'm discontent with this new flag becoming the default.

Could you elaborate a little more on your use-case, when/why do
you want/need this flag?

Re: [RFC v2 09/10] landlock: Handle cgroups (performance)

2016-08-30 Thread Alexei Starovoitov

On Tue, Aug 30, 2016 at 02:45:14PM -0700, Andy Lutomirski wrote:
> 
> One might argue that landlock shouldn't be tied to seccomp (in theory,
> attached progs could be given access to syscall_get_xyz()), but I

proposed lsm is way more powerful than syscall_get_xyz.
no need to dumb it down.

> think that the seccomp attachment mechanism is the right way to
> install unprivileged filters.  It handles the no_new_privs stuff, it
> allows TSYNC, it's totally independent of systemwide policy, etc.
> 
> Trying to use cgroups or similar for this is going to be much nastier.
> Some tighter sandboxes (Sandstorm, etc) aren't even going to dream of
> putting cgroupfs in their containers, so requiring cgroups or similar
> would be a mess for that type of application.

I don't see why it is a 'mess'. cgroups are already used by majority
of the systems, so I don't see why requiring a cgroup is such a big deal.
But let's say we don't do them. How implementation is going to look like
for task based hierarchy? Note that we need an array of bpf_prog pointers.
One for each lsm hook. Where this array is going to be stored?
We cannot put in task_struct, since it's too large. Cannot put it
into 'struct seccomp' directly either, unless it will become a pointer.
Is that the proposal?
So now we will be wasting extra 1kbyte of memory per task. Not great.
We'd want to optimize it by sharing this such struct seccomp with prog array
across threads of the same task? Or dynimically allocating it when
landlock is in use? May sound nice, but how to account for that kernel
memory? I guess also solvable by charging memlock.
With cgroup based approach we don't need to worry about all that.

Re: [RFCv2 07/16] bpf: enable non-core use of the verfier

2016-08-30 Thread Alexei Starovoitov

On Tue, Aug 30, 2016 at 11:00:38PM +0200, Daniel Borkmann wrote:
> On 08/30/2016 10:48 PM, Alexei Starovoitov wrote:
> >On Tue, Aug 30, 2016 at 10:22:46PM +0200, Jakub Kicinski wrote:
> >>On Tue, 30 Aug 2016 21:07:50 +0200, Daniel Borkmann wrote:
> Having two modes seems more straight forward and I think we would only
> need to pay attention in the LD_IMM64 case, I don't think I've seen
> LLVM generating XORs, it's just the cBPF -> eBPF conversion.
> >>>
> >>>Okay, though, I think that the cBPF to eBPF migration wouldn't even
> >>>pass through the bpf_parse() handling, since verifier is not aware on
> >>>some of their aspects such as emitting calls directly (w/o *proto) or
> >>>arg mappings. Probably make sense to reject these (bpf_prog_was_classic())
> >>>if they cannot be handled anyway?
> >>
> >>TBH again I only use cBPF for testing.  It's a convenient way of
> >>generating certain instruction sequences.  I can probably just drop
> >>it completely but the XOR patch is just 3 lines of code so not a huge
> >>cost either...  I'll keep patch 6 in my tree for now.
> >
> >if xor matching is only need for classic, I would drop that patch
> >just to avoid unnecessary state collection. The number of lines
> >is not a concern, but extra state for state prunning is.
> >
> >>Alternatively - is there any eBPF assembler out there?  Something
> >>converting verifier output back into ELF would be quite cool.
> >
> >would certainly be nice. I don't think there is anything standalone.
> >btw llvm can be made to work as assembler only, but simple flex/bison
> >is probably better.
> 
> Never tried it out, but seems llvm backend doesn't have asm parser
> implemented?
> 
>   $ clang -target bpf -O2 -c foo.c -S -o foo.S
>   $ llvm-mc -arch bpf foo.S -filetype=obj -o foo.o
>   llvm-mc: error: this target does not support assembly parsing.
> 
> LLVM IR might work, but maybe too high level(?); alternatively, we could
> make bpf_asm from tools/net/ eBPF aware for debugging purposes. If you
> have a toolchain supporting libbfd et al, you could probably make use
> of bpf_jit_dump() (like JITs do) and then bpf_jit_disasm tool (from
> same dir as bpf_asm).

yes. llvm-based bpf asm is not complete. It's straightforward to add though.
It won't be going through IR. Only 'mc' (machine instruciton) layer.

[PATCH V2] rtl_bt: Add firmware and config file for RTL8822BE

2016-08-30 Thread Larry Finger

This device is a new model from Realtek. Updates to driver btrtl will
soon be submitted to the kernel.

These files were provided by the Realtek developer.

Signed-off-by: 陆朱伟 
Signed-off-by: Larry Finger 
Cc: linux-blueto...@vger.kernel.org
---

V2 - fix error in file names in WHENCE
---
 WHENCE |   3 +++
 rtl_bt/rtl8822b_config.bin | Bin 0 -> 32 bytes
 rtl_bt/rtl8822b_fw.bin | Bin 0 -> 51756 bytes
 3 files changed, 3 insertions(+)
 create mode 100644 rtl_bt/rtl8822b_config.bin
 create mode 100644 rtl_bt/rtl8822b_fw.bin

diff --git a/WHENCE b/WHENCE
index d0bef0d..a9d7c97 100644
--- a/WHENCE
+++ b/WHENCE
@@ -2755,11 +2755,14 @@ File: rtl_bt/rtl8723b_fw.bin
 File: rtl_bt/rtl8761a_fw.bin
 File: rtl_bt/rtl8812ae_fw.bin
 File: rtl_bt/rtl8821a_fw.bin
+File: rtl_bt/rtl8822b_fw.bin
+File: rtl_bt/rtl8822b_config.bin
 
 Licence: Redistributable. See LICENCE.rtlwifi_firmware.txt for details.
 
 Found in vendor driver, linux_bt_usb_2.11.20140423_8723be.rar
 From https://github.com/troy-tan/driver_store
+Files rtl_bt/rtl8822e_* came directly from Realtek.
 
 --
 
diff --git a/rtl_bt/rtl8822b_config.bin b/rtl_bt/rtl8822b_config.bin
new file mode 100644
index 
..a691e7ca258b0e7dc4ff2bdbdc1d13f2a613526b
GIT binary patch
literal 32
ncmWGtt=ulfaGQbAX&)n#faGpQMq3vKHbEt07gmOw42=2!f)WPF

literal 0
HcmV?d1

diff --git a/rtl_bt/rtl8822b_fw.bin b/rtl_bt/rtl8822b_fw.bin
new file mode 100644
index 
..b7d6d1229491314875b3d4a7266462c47998c0fb
GIT binary patch
literal 51756
zcmeEv31CxYy7oEANt=k+uP>MoZa2eFJYz0z8R73<#DD{*qEmm+`20bhe
zI2H#en}~Bm+M+^1NLt2w?|5%4-~#0iq#%lVr?Si~Nm^2PJ!UA(t2e|f%m=O?|<>r;4s%1hVQ@I+T@c!5h*Zpw8ATEmaHEP0l^;$Uld
zvWvMwewAKlpfImBJi*r*F7+{23UlY+Y?h9{mQErNTzVHJsQ|_98FTB+y=8O5o
zAz%1GDQ9Z})`m5K!hBzNHbT85ncpey68*
zU+=>6UEb(*2A*elqu0Cftk}%0WiHA~n9^Z)^5QH;CtBd{h1)ld7a5gKWGuWmAe|R8
z5_RH|AMs*6{EFqgcrH#SjzO3%+z;t_aWC!({dln%zu%|f#e?xW@q%hWot^9Ek5
z)#$|c6$pp>+e3NLf^a$|!r?v=_iebB#_?h+?jSUAO)4**K%CqA^WrD?X3bb$T#es0
zBJ5t=pUcKO;ufFKiR+O534|Gfy8-Ds5ypowzwgV79NzD{4c};`c=aO6c85$SuEuv&
za0#~~Y=Tbw2;a0O>BN(2UL1yR6u39zejfg7@$KihZ@^uFJl4ZM754_*wVs3qORZ}w
zXLAI!M$=XIm`Czh$DUwSFyg2RGBz(8k+TV9#*{LpGhEDGN_RWTMm`#kA-vd~
zhOd(=IjgCboikf79aMTO?l|~yGiM8|-OyGZpmw1=M@raZj4i_}zHGi(bhe)>soB*Y
zDX$MU2X7dW?zOz!ezv5B-V7sB@%H7}v#*BQo7?-2WT!R<)N{^SdEPuK_1xJy(YdR=
z{$ixO`C|M{?Afy}3Z;}n>E?7;SsTkUHg+xY4%(QY(}@ds&<5@io4LarGPj2U_n&93
zxW+J7YR0_T-g1t)BIYw!R^&=ZwV!bo+mUPdl$lsl%h
zsD{}au~Dw;gH?g}#`p&9epBx5P<*2)w>eZCst-9s>PC%M{In{lZm2>{7Y7Y(mQO*m
zESxM7VIm9@iAeE8B40)#kw5;SByPU)xZ+s
zM{^}q#y{M_iy>AghHz)lzaw}~!JUKC!=W~8j7^S_TkbWqnU3UuzJ|duwxT>$!=vX=
z*Pp~SPRL((-qOrYm0B{OnbT>dS{#RP3*d;liCS4utHNm7ro%Yu2$(*hxTXFD%2yc^
z5JJS*33{+Q0twj~yz}@%WwdOQg&91_jjUaU5sbdufWG^yY@K*ZicVB{T{W!Tw5>iU
z%jE6OfDG*-%j7W5_oTrvJTt%6axMO3W4y<*GjeqRdZ6X-MYOFW5a*?~b_P^o2C*o-
zrjMo4m%d2l#dz~8^h<8NX{YIE)StD>yK}>oeu*%$ue{5@^3HwbU40&0GkZbC~$lg{38l=3?9BbOf%rsb*)w6b@

Re: [PATCH] rtl_bt: Add firmware and config file for RTL8822BE

2016-08-30 Thread Larry Finger


On 08/30/2016 09:51 AM, Ben Hutchings wrote:

On Tue, 2016-08-30 at 09:08 -0500, Larry Finger wrote:

This device is a new model from Realtek. Updates to driver btrtl will
soon be submitted to the kernel.

These files were provided by the Realtek developer.

Signed-off-by: 陆朱伟 
Signed-off-by: Larry Finger 
Cc: linux-blueto...@vger.kernel.org
---
 WHENCE |   3 +++
 rtl_bt/rtl8822b_config.bin | Bin 0 -> 32 bytes
 rtl_bt/rtl8822b_fw.bin | Bin 0 -> 51756 bytes
 3 files changed, 3 insertions(+)
 create mode 100644 rtl_bt/rtl8822b_config.bin
 create mode 100644 rtl_bt/rtl8822b_fw.bin

diff --git a/WHENCE b/WHENCE
index d0bef0d..a9d7c97 100644
--- a/WHENCE
+++ b/WHENCE
@@ -2755,11 +2755,14 @@ File: rtl_bt/rtl8723b_fw.bin
 File: rtl_bt/rtl8761a_fw.bin
 File: rtl_bt/rtl8812ae_fw.bin
 File: rtl_bt/rtl8821a_fw.bin
+File: rtl_bt/rtl8822e_fw.bin
+File: rtl_bt/rtl8822e_config.bin

[...]

Should the filenames begin with "rtl822b" or "rtl822e"?


They should start with rtl8822b. V2 will be sent shortly.

Thanks,

Larry

[PATCH RFC 0/4] xfs: Transmit flow steering

2016-08-30 Thread Tom Herbert

This patch set introduces transmit flow steering. The idea is that we
record the transmit queues in a flow table that is indexed by skbuff.
The flow table entries have two values: the queue_index and the head cnt
of packets from the TX queue. We only allow a queue to change for a flow
if the tail cnt in the TX queue advances beyond the recorded head cnt.
That is the condition that should indicate that all outstanding packets
for the flow have completed transmission so the queue can change.

Tracking the inflight queue is performed as part of BQL. Two fields are
added to netdevice structure: head_cnt and tail_cnt. head_cnt is
incremented in netdev_tx_sent_queue and tail_cnt is incremented in
netdev_tx_completed_queue by the number of packets completed.

This patch set creates /sys/class/net/eth*/xps_dev_flow_table_cnt
which number of entries in the XPS flow table.

Tom Herbert (4):
  net: Set SW hash in skb_set_hash_from_sk
  bql: Add tracking of inflight packets
  net: Add xps_dev_flow_table_cnt
  xfs: Transmit flow steering

 include/linux/netdevice.h | 26 +
 include/net/sock.h|  6 +--
 net/Kconfig   |  6 +++
 net/core/dev.c| 93 +++
 net/core/net-sysfs.c  | 87 
 5 files changed, 199 insertions(+), 19 deletions(-)

-- 
2.8.0.rc2

[PATCH bug-fix] iproute: fix documentation for ip rule scan order

2016-08-30 Thread Iskren Chernev

>From 416f45b62f33017d19a9b14e7b0179807c993cbe Mon Sep 17 00:00:00 2001
From: Iskren Chernev 
Date: Tue, 30 Aug 2016 17:08:54 -0700
Subject: [PATCH bug-fix] iproute: fix documentation for ip rule scan order

---
 man/man8/ip-rule.8 | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/man/man8/ip-rule.8 b/man/man8/ip-rule.8
index 1774ae3..3508d80 100644
--- a/man/man8/ip-rule.8
+++ b/man/man8/ip-rule.8
@@ -93,7 +93,7 @@ Each policy routing rule consists of a
 .B selector
 and an
 .B action predicate.
-The RPDB is scanned in order of decreasing priority. The selector
+The RPDB is scanned in order of increasing priority. The selector
 of each rule is applied to {source address, destination address, incoming
 interface, tos, fwmark} and, if the selector matches the packet,
 the action is performed. The action predicate may return with success.
--
2.4.5

[PATCH RFC 2/4] bql: Add tracking of inflight packets

2016-08-30 Thread Tom Herbert

Add two fields to netdev_queue as head_cnt and tail_cnt. head_cnt is
incremented for every sent packet in netdev_tx_sent_queue and tail_cnt
is incremented by the number of packets in netdev_tx_completed_queue.
So then the number of inflight packets for a queue is simply
queue->head_cnt - queue->tail_cnt.

Add inflight_pkts to be reported in sys-fs.

Signed-off-by: Tom Herbert 
---
 include/linux/netdevice.h |  4 
 net/core/net-sysfs.c  | 11 +++
 2 files changed, 15 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index d122be9..487d1df 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -592,6 +592,8 @@ struct netdev_queue {
 
 #ifdef CONFIG_BQL
struct dql  dql;
+   unsigned inthead_cnt;
+   unsigned inttail_cnt;
 #endif
 } cacheline_aligned_in_smp;
 
@@ -2958,6 +2960,7 @@ static inline void netdev_tx_sent_queue(struct 
netdev_queue *dev_queue,
unsigned int bytes)
 {
 #ifdef CONFIG_BQL
+   dev_queue->head_cnt++;
dql_queued(_queue->dql, bytes);
 
if (likely(dql_avail(_queue->dql) >= 0))
@@ -2999,6 +3002,7 @@ static inline void netdev_tx_completed_queue(struct 
netdev_queue *dev_queue,
if (unlikely(!bytes))
return;
 
+   dev_queue->tail_cnt += pkts;
dql_completed(_queue->dql, bytes);
 
/*
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 6e4f347..5a33f6a 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1147,6 +1147,16 @@ static ssize_t bql_show_inflight(struct netdev_queue 
*queue,
 static struct netdev_queue_attribute bql_inflight_attribute =
__ATTR(inflight, S_IRUGO, bql_show_inflight, NULL);
 
+static ssize_t bql_show_inflight_pkts(struct netdev_queue *queue,
+ struct netdev_queue_attribute *attr,
+ char *buf)
+{
+   return sprintf(buf, "%u\n", queue->head_cnt - queue->tail_cnt);
+}
+
+static struct netdev_queue_attribute bql_inflight_pkts_attribute =
+   __ATTR(inflight_pkts, S_IRUGO, bql_show_inflight_pkts, NULL);
+
 #define BQL_ATTR(NAME, FIELD)  \
 static ssize_t bql_show_ ## NAME(struct netdev_queue *queue,   \
 struct netdev_queue_attribute *attr,   \
@@ -1176,6 +1186,7 @@ static struct attribute *dql_attrs[] = {
_limit_min_attribute.attr,
_hold_time_attribute.attr,
_inflight_attribute.attr,
+   _inflight_pkts_attribute.attr,
NULL
 };
 
-- 
2.8.0.rc2

[PATCH RFC 4/4] xfs: Transmit flow steering

2016-08-30 Thread Tom Herbert

XFS maintains a per device flow table that is indexed by the skbuff
hash. The XFS table is only consulted when there is no queue saved in
a transmit socket for an skbuff.

Each entry in the flow table contains a queue index and a queue
pointer. The queue pointer is set when a queue is chosen using a
flow table entry. This pointer is set to the head pointer in the
transmit queue (which is maintained by BQL).

The new function get_xfs_index that looks up flows in the XPS table.
The entry returned gives the last queue a matching flow used. The
returned queue is compared against the normal XPS queue. If they
are different, then we only switch if the tail pointer in the TX
queue has advanced past the pointer saved in the entry. In this
way OOO should be avoided when XPS wants to use a different queue.

Signed-off-by: Tom Herbert 
---
 net/Kconfig|  6 
 net/core/dev.c | 93 --
 2 files changed, 84 insertions(+), 15 deletions(-)

diff --git a/net/Kconfig b/net/Kconfig
index 7b6cd34..5e3eddf 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -255,6 +255,12 @@ config XPS
depends on SMP
default y
 
+config XFS
+   bool
+   depends on XPS
+   depends on BQL
+   default y
+
 config HWBM
bool
 
diff --git a/net/core/dev.c b/net/core/dev.c
index 1d5c6dd..722e487 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3210,6 +3210,7 @@ sch_handle_egress(struct sk_buff *skb, int *ret, struct 
net_device *dev)
 }
 #endif /* CONFIG_NET_EGRESS */
 
+/* Must be called with RCU read_lock */
 static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
 {
 #ifdef CONFIG_XPS
@@ -3217,7 +3218,6 @@ static inline int get_xps_queue(struct net_device *dev, 
struct sk_buff *skb)
struct xps_map *map;
int queue_index = -1;
 
-   rcu_read_lock();
dev_maps = rcu_dereference(dev->xps_maps);
if (dev_maps) {
map = rcu_dereference(
@@ -3232,7 +3232,6 @@ static inline int get_xps_queue(struct net_device *dev, 
struct sk_buff *skb)
queue_index = -1;
}
}
-   rcu_read_unlock();
 
return queue_index;
 #else
@@ -3240,26 +3239,90 @@ static inline int get_xps_queue(struct net_device *dev, 
struct sk_buff *skb)
 #endif
 }
 
-static u16 __netdev_pick_tx(struct net_device *dev, struct sk_buff *skb)
+/* Must be called with RCU read_lock */
+static int get_xfs_index(struct net_device *dev, struct sk_buff *skb)
 {
-   struct sock *sk = skb->sk;
-   int queue_index = sk_tx_queue_get(sk);
+#ifdef CONFIG_XFS
+   struct xps_dev_flow_table *flow_table;
+   struct xps_dev_flow ent;
+   int queue_index;
+   struct netdev_queue *txq;
+   u32 hash;
 
-   if (queue_index < 0 || skb->ooo_okay ||
-   queue_index >= dev->real_num_tx_queues) {
-   int new_index = get_xps_queue(dev, skb);
-   if (new_index < 0)
-   new_index = skb_tx_hash(dev, skb);
+   flow_table = rcu_dereference(dev->xps_flow_table);
+   if (!flow_table)
+   return -1;
 
-   if (queue_index != new_index && sk &&
-   sk_fullsock(sk) &&
-   rcu_access_pointer(sk->sk_dst_cache))
-   sk_tx_queue_set(sk, new_index);
+   queue_index = get_xps_queue(dev, skb);
+   if (queue_index < 0)
+   return -1;
 
-   queue_index = new_index;
+   hash = skb_get_hash(skb);
+   if (!hash)
+   return -1;
+
+   ent.v64 = flow_table->flows[hash & flow_table->mask].v64;
+   if (ent.queue_index >= 0 &&
+   ent.queue_index < dev->real_num_tx_queues) {
+   txq = netdev_get_tx_queue(dev, ent.queue_index);
+   if (queue_index != ent.queue_index) {
+   if ((int)(txq->tail_cnt - ent.queue_ptr) >= 0)  {
+   /* The current queue's tail has advanced
+* beyone the last packet that was
+* enqueued using the table entry. All
+* previous packets sent for this flow
+* should have been completed so the
+* queue for the flow cna be changed.
+*/
+   ent.queue_index = queue_index;
+   txq = netdev_get_tx_queue(dev, queue_index);
+   } else {
+   queue_index = ent.queue_index;
+   }
+   }
+   } else {
+   /* Queue from the table was bad, use the new one. */
+   ent.queue_index = queue_index;
+   txq = netdev_get_tx_queue(dev, queue_index);
}
 
+   /* Save the updated entry */
+   ent.queue_ptr = txq->head_cnt;
+

[PATCH RFC 1/4] net: Set SW hash in skb_set_hash_from_sk

2016-08-30 Thread Tom Herbert

Use the __skb_set_sw_hash to set the hash in an skbuff from the socket
txhash.

Signed-off-by: Tom Herbert 
---
 include/net/sock.h | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index c797c57..12e585c 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1910,10 +1910,8 @@ static inline void sock_poll_wait(struct file *filp,
 
 static inline void skb_set_hash_from_sk(struct sk_buff *skb, struct sock *sk)
 {
-   if (sk->sk_txhash) {
-   skb->l4_hash = 1;
-   skb->hash = sk->sk_txhash;
-   }
+   if (sk->sk_txhash)
+   __skb_set_sw_hash(skb, sk->sk_txhash, true);
 }
 
 void skb_set_owner_w(struct sk_buff *skb, struct sock *sk);
-- 
2.8.0.rc2

[PATCH RFC 3/4] net: Add xps_dev_flow_table_cnt

2016-08-30 Thread Tom Herbert

Add infrastructure and definitions to create XFS flow tables. This
creates the new sys entry /sys/class/net/eth*/xps_dev_flow_table_cnt

Signed-off-by: Tom Herbert 
---
 include/linux/netdevice.h | 22 ++
 net/core/net-sysfs.c  | 76 +++
 2 files changed, 98 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 487d1df..d30e1bb 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -736,8 +736,28 @@ struct xps_dev_maps {
 };
 #define XPS_DEV_MAPS_SIZE (sizeof(struct xps_dev_maps) +   \
 (nr_cpu_ids * sizeof(struct xps_map *)))
+
+struct xps_dev_flow {
+   union {
+   u64 v64;
+   struct {
+   int queue_index;
+   unsigned intqueue_ptr;
+   };
+   };
+};
+
+struct xps_dev_flow_table {
+   unsigned int mask;
+   struct rcu_head rcu;
+   struct xps_dev_flow flows[0];
+};
+#define XPS_DEV_FLOW_TABLE_SIZE(_num) (sizeof(struct xps_dev_flow_table) + \
+   ((_num) * sizeof(struct xps_dev_flow)))
+
 #endif /* CONFIG_XPS */
 
+
 #define TC_MAX_QUEUE   16
 #define TC_BITMASK 15
 /* HW offloaded queuing disciplines txq count and offset maps */
@@ -1810,6 +1830,8 @@ struct net_device {
 
 #ifdef CONFIG_XPS
struct xps_dev_maps __rcu *xps_maps;
+   struct xps_dev_flow_table __rcu *xps_flow_table;
+   unsigned int xps_dev_flow_table_cnt;
 #endif
 #ifdef CONFIG_NET_CLS_ACT
struct tcf_proto __rcu  *egress_cl_list;
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 5a33f6a..41d0bc9 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -503,6 +503,79 @@ static ssize_t phys_switch_id_show(struct device *dev,
 }
 static DEVICE_ATTR_RO(phys_switch_id);
 
+#ifdef CONFIG_XPS
+static void xps_dev_flow_table_release(struct rcu_head *rcu)
+{
+   struct xps_dev_flow_table *table = container_of(rcu,
+   struct xps_dev_flow_table, rcu);
+   vfree(table);
+}
+
+static int change_xps_dev_flow_table_cnt(struct net_device *dev,
+unsigned long count)
+{
+   unsigned long mask;
+   struct xps_dev_flow_table *table, *old_table;
+   static DEFINE_SPINLOCK(xps_dev_flow_lock);
+
+   if (!capable(CAP_NET_ADMIN))
+   return -EPERM;
+
+   if (count) {
+   mask = count - 1;
+   /* mask = roundup_pow_of_two(count) - 1;
+* without overflows...
+*/
+   while ((mask | (mask >> 1)) != mask)
+   mask |= (mask >> 1);
+   /* On 64 bit arches, must check mask fits in table->mask (u32),
+* and on 32bit arches, must check
+* XPS_DEV_FLOW_TABLE_SIZE(mask + 1) doesn't overflow.
+*/
+#if BITS_PER_LONG > 32
+   if (mask > (unsigned long)(u32)mask)
+   return -EINVAL;
+#else
+   if (mask > (ULONG_MAX - XPS_DEV_FLOW_TABLE_SIZE(1))
+   / sizeof(struct xps_dev_flow)) {
+   /* Enforce a limit to prevent overflow */
+   return -EINVAL;
+   }
+#endif
+   table = vmalloc(XPS_DEV_FLOW_TABLE_SIZE(mask + 1));
+   if (!table)
+   return -ENOMEM;
+
+   table->mask = mask;
+   for (count = 0; count <= mask; count++)
+   table->flows[count].queue_index = -1;
+   } else
+   table = NULL;
+
+   spin_lock(_dev_flow_lock);
+   old_table = rcu_dereference_protected(dev->xps_flow_table,
+ 
lockdep_is_held(_dev_flow_lock));
+   rcu_assign_pointer(dev->xps_flow_table, table);
+   dev->xps_dev_flow_table_cnt = count;
+   spin_unlock(_dev_flow_lock);
+
+   if (old_table)
+   call_rcu(_table->rcu, xps_dev_flow_table_release);
+
+   return 0;
+}
+
+static ssize_t xps_dev_flow_table_cnt_store(struct device *dev,
+   struct device_attribute *attr,
+   const char *buf, size_t len)
+{
+   return netdev_store(dev, attr, buf, len, change_xps_dev_flow_table_cnt);
+}
+
+NETDEVICE_SHOW_RW(xps_dev_flow_table_cnt, fmt_dec);
+
+#endif
+
 static struct attribute *net_class_attrs[] = {
_attr_netdev_group.attr,
_attr_type.attr,
@@ -531,6 +604,9 @@ static struct attribute *net_class_attrs[] = {
_attr_phys_port_name.attr,
_attr_phys_switch_id.attr,
_attr_proto_down.attr,
+#ifdef CONFIG_XPS
+   _attr_xps_dev_flow_table_cnt.attr,
+#endif
NULL,
 };
 ATTRIBUTE_GROUPS(net_class);
-- 
2.8.0.rc2

Re: [ovs-dev] [PATCH net-next v11 5/6] openvswitch: add layer 3 flow/port support

2016-08-30 Thread Joe Stringer

On 26 August 2016 at 02:13, Simon Horman  wrote:
> On Thu, Aug 25, 2016 at 05:33:57PM -0700, Joe Stringer wrote:
>> On 25 August 2016 at 03:08, Simon Horman  wrote:
>> > Please find my working patch below.
>> >
>> > From: Simon Horman 
>> > Subject: [PATCH] system-traffic: Exercise GSO
>> >
>> > Exercise GSO for: unencapsulated; MPLS; GRE; and MPLS in GRE.
>> >
>> > There is scope to extend this testing to other encapsulation formats
>> > if desired.
>> >
>> > This is motivated by a desire to test GRE and MPLS encapsulation in
>> > the context of L3/VPN (MPLS over non-TEB GRE work). That is not
>> > tested here but tests for those cases would idealy be based on those in
>> > this patch.
>> >
>> > Signed-off-by: Simon Horman 
>>
>> I realised that these tests disable TSO, but they don't actually check
>> if GSO is enabled. Maybe it's safe to assume this, but it's more
>> explicit to actually look for it in the tests.
>
> Good point, I'll see about checking that.
>
>> With particular setups (fedora23 in particular, both kernel and
>> userspace testsuites) I see this:
>>
>> ./system-traffic.at:371: ip netns exec at_ns0 sh << NS_EXEC_HEREDOC
>> ip route add 10.1.2.0/24 encap mpls 100 via inet 10.1.1.2 dev ns_gre0
>> NS_EXEC_HEREDOC
>> --- /dev/null 2016-08-19 01:28:02.15100 +
>> +++ 
>> /home/gitlab-runner/builds/83c49bff/0/root/gitlab-ovs/ovs/tests/system-kmod-testsuite.dir/at-groups/10/stderr
>> 2016-08-25 17:16:27.32400 +
>> @@ -0,0 +1 @@
>> +Error: either "to" is duplicate, or "encap" is a garbage.
>>
>> I'm guessing the ip tool is a little out of date. We could detect and
>> skip this with something like:
>>
>> AT_SKIP_IF([ip route help 2>&1 | grep encap])
>>
>> in the CHECK_MPLS.
>
> Thanks, I'll add something like that.
>
>> Hmm, I'm still seeing the bad counts of segments retransmited even
>> with the diff change on a kernel I have built at bf0f500bd019 ("Merge
>> tag 'trace-v4.8-1' of
>> git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace").
>> If it's passing on latest net-next then maybe I just need to swap out
>> that box's kernel for a newer build. I'll try that.
>
> It is possible that it is detecting a bug.
> Which test is failing?

FWIW I tried with a newer build, commit 9a0a5c4cb1af ("net:
systemport: Fix ordering in intrl2_*_mask_clear macro"). I no longer
see the issue.

Unfortunately I lost my test output. It was one of these two:

  8: datapath - ping over gre tunnel FAILED
(system-traffic.at:294)
  9: datapath - http over gre tunnel FAILED
(system-traffic.at:348)

I also realised that I didn't have MPLS router enabled in my kernel
config so the MPLS tests were getting skipped. I enabled MPLS_ROUTING,
but now I see this failure on the "http over mpls" tests:

./system-traffic.at:111: ip netns exec at_ns0 sh << NS_EXEC_HEREDOC
ip route add 10.1.1.0/24 encap mpls 100 via inet 172.31.1.2 dev p0
NS_EXEC_HEREDOC
--- /dev/null 2016-08-30 15:22:28.813316948 -0700
+++ 
/home/gitlab-runner/builds/f1d4a2be/0/root/gitlab-ovs/ovs/tests/system-kmod-testsuite.dir/at-groups/4/stderr
2016-08-30 15:33:45.133306581 -0700
@@ -0,0 +1 @@
+RTNETLINK answers: Operation not supported


> At this stage I have mostly added TSO/GSO testing to existing checks.
> Perhaps it would be better to break them out into separate checks so
> ping/http can be be checked without considering TSO/GSO which may have some
> value in situations where TSO/GSO is broken which is actually what I am
> interested in testing.

Sounds reasonable.

Re: [PATCH 3/4] arm64: dts: rockchip: support gmac for rk3399

2016-08-30 Thread Heiko Stübner

Am Mittwoch, 31. August 2016, 04:30:06 schrieb Caesar Wang:
> This patch adds needed gamc information for rk3399,
> also support the gmac pd.
> 
> Signed-off-by: Roger Chen 
> Signed-off-by: Caesar Wang 
> ---
> 
>  arch/arm64/boot/dts/rockchip/rk3399.dtsi | 90
>  1 file changed, 90 insertions(+)
> 
> diff --git a/arch/arm64/boot/dts/rockchip/rk3399.dtsi
> b/arch/arm64/boot/dts/rockchip/rk3399.dtsi index 32aebc8..53ac651 100644
> --- a/arch/arm64/boot/dts/rockchip/rk3399.dtsi
> +++ b/arch/arm64/boot/dts/rockchip/rk3399.dtsi
> @@ -200,6 +200,26 @@
>   };
>   };
> 
> + gmac: eth@fe30 {
> + compatible = "rockchip,rk3399-gmac";
> + reg = <0x0 0xfe30 0x0 0x1>;
> + rockchip,grf = <>;

should move below the reset-names .

> + interrupts = ;
> + interrupt-names = "macirq";
> + clocks = < SCLK_MAC>, < SCLK_MAC_RX>,
> +  < SCLK_MAC_TX>, < SCLK_MACREF>,
> +  < SCLK_MACREF_OUT>, < ACLK_GMAC>,
> +  < PCLK_GMAC>;
> + clock-names = "stmmaceth", "mac_clk_rx",
> +   "mac_clk_tx", "clk_mac_ref",
> +   "clk_mac_refout", "aclk_mac",
> +   "pclk_mac";
> + resets = < SRST_A_GMAC>;
> + reset-names = "stmmaceth";
> + power-domains = < RK3399_PD_GMAC>;

The driver core should handle regular power-domain handling on device creation 
already, right? So I should be able to apply patches 3 and 4 even without the 
dwmac patches, right?

Also if resending please move power-domains above resets


Heiko

Re: [RFC v2 09/10] landlock: Handle cgroups (performance)

2016-08-30 Thread Andy Lutomirski

On Aug 30, 2016 1:56 PM, "Alexei Starovoitov"
 wrote:
>
> On Tue, Aug 30, 2016 at 10:33:31PM +0200, Mickaël Salaün wrote:
> >
> >
> > On 30/08/2016 22:23, Andy Lutomirski wrote:
> > > On Tue, Aug 30, 2016 at 1:20 PM, Mickaël Salaün  wrote:
> > >>
> > >> On 30/08/2016 20:55, Andy Lutomirski wrote:
> > >>> On Sun, Aug 28, 2016 at 2:42 AM, Mickaël Salaün  
> > >>> wrote:
> > 
> > 
> >  On 28/08/2016 10:13, Andy Lutomirski wrote:
> > > On Aug 27, 2016 11:14 PM, "Mickaël Salaün"  wrote:
> > >>
> > >>
> > >> On 27/08/2016 22:43, Alexei Starovoitov wrote:
> > >>> On Sat, Aug 27, 2016 at 09:35:14PM +0200, Mickaël Salaün wrote:
> >  On 27/08/2016 20:06, Alexei Starovoitov wrote:
> > > On Sat, Aug 27, 2016 at 04:06:38PM +0200, Mickaël Salaün wrote:
> > >> As said above, Landlock will not run an eBPF programs when not 
> > >> strictly
> > >> needed. Attaching to a cgroup will have the same performance 
> > >> impact as
> > >> attaching to a process hierarchy.
> > >
> > > Having a prog per cgroup per lsm_hook is the only scalable way I
> > > could come up with. If you see another way, please propose.
> > > current->seccomp.landlock_prog is not the answer.
> > 
> >  Hum, I don't see the difference from a performance point of view 
> >  between
> >  a cgroup-based or a process hierarchy-based system.
> > 
> >  Maybe a better option should be to use an array of pointers with N
> >  entries, one for each supported hook, instead of a unique pointer 
> >  list?
> > >>>
> > >>> yes, clearly array dereference is faster than link list walk.
> > >>> Now the question is where to keep this prog_array[num_lsm_hooks] ?
> > >>> Since we cannot keep it inside task_struct, we have to allocate it.
> > >>> Every time the task is creted then. What to do on the fork? That
> > >>> will require changes all over. Then the obvious optimization would 
> > >>> be
> > >>> to share this allocated array of prog pointers across multiple 
> > >>> tasks...
> > >>> and little by little this new facility will look like cgroup.
> > >>> Hence the suggestion to put this array into cgroup from the start.
> > >>
> > >> I see your point :)
> > >>
> > >>>
> >  Anyway, being able to attach an LSM hook program to a cgroup 
> >  thanks to
> >  the new BPF_PROG_ATTACH seems a good idea (while keeping the 
> >  possibility
> >  to use a process hierarchy). The downside will be to handle an LSM 
> >  hook
> >  program which is not triggered by a seccomp-filter, but this 
> >  should be
> >  needed anyway to handle interruptions.
> > >>>
> > >>> what do you mean 'not triggered by seccomp' ?
> > >>> You're not suggesting that this lsm has to enable seccomp to be 
> > >>> functional?
> > >>> imo that's non starter due to overhead.
> > >>
> > >> Yes, for now, it is triggered by a new seccomp filter return value
> > >> RET_LANDLOCK, which can take a 16-bit value called cookie. This must 
> > >> not
> > >> be needed but could be useful to bind a seccomp filter security 
> > >> policy
> > >> with a Landlock one. Waiting for Kees's point of view…
> > >>
> > >
> > > I'm not Kees, but I'd be okay with that.  I still think that doing
> > > this by process hierarchy a la seccomp will be easier to use and to
> > > understand (which is quite important for this kind of work) than doing
> > > it by cgroup.
> > >
> > > A feature I've wanted to add for a while is to have an fd that
> > > represents a seccomp layer, the idea being that you would set up your
> > > seccomp layer (with syscall filter, landlock hooks, etc) and then you
> > > would have a syscall to install that layer.  Then an unprivileged
> > > sandbox manager could set up its layer and still be able to inject new
> > > processes into it later on, no cgroups needed.
> > 
> >  A nice thing I didn't highlight about Landlock is that a process can
> >  prepare a layer of rules (arraymap of handles + Landlock programs) and
> >  pass the file descriptors of the Landlock programs to another process.
> >  This process could then apply this programs to get sandboxed. However,
> >  for now, because a Landlock program is only triggered by a seccomp
> >  filter (which do not follow the Landlock programs as a FD), they will 
> >  be
> >  useless.
> > 
> >  The FD referring to an arraymap of handles can also be used to update a
> >  map and change the behavior of a Landlock program. A master process can
> >  then add or remove restrictions to another process hierarchy on the 
> >  fly.
>

Re: [PATCH 2/4] net: stmmac: dwmac-rk: add pd_gmac support for rk3399

2016-08-30 Thread kbuild test robot

Hi David,

[auto build test ERROR on rockchip/for-next]
[also build test ERROR on v4.8-rc4 next-20160825]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]
[Suggest to use git(>=2.9.0) format-patch --base= (or --base=auto for 
convenience) to record what (public, well-known) commit your patch series was 
built on]
[Check https://git-scm.com/docs/git-format-patch for more information]

url:
https://github.com/0day-ci/linux/commits/Caesar-Wang/Support-the-rk3399-gmac-pd-function/20160831-043741
base:   
https://git.kernel.org/pub/scm/linux/kernel/git/mmind/linux-rockchip.git 
for-next
config: xtensa-allmodconfig (attached as .config)
compiler: xtensa-linux-gcc (GCC) 4.9.0
reproduce:
wget 
https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross
 -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=xtensa 

All errors (new ones prefixed by >>):

   drivers/net/ethernet/stmicro/stmmac/dwmac-rk.c: In function 
'rk_gmac_powerdown':
>> drivers/net/ethernet/stmicro/stmmac/dwmac-rk.c:672:23: error: 'pdev' 
>> undeclared (first use in this function)
 pm_runtime_put_sync(>dev);
  ^
   drivers/net/ethernet/stmicro/stmmac/dwmac-rk.c:672:23: note: each undeclared 
identifier is reported only once for each function it appears in
   drivers/net/ethernet/stmicro/stmmac/dwmac-rk.c: At top level:
   drivers/net/ethernet/stmicro/stmmac/dwmac-rk.c:679:12: error: redefinition 
of 'rk_gmac_init'
static int rk_gmac_init(struct platform_device *pdev, void *priv)
   ^
   drivers/net/ethernet/stmicro/stmmac/dwmac-rk.c:638:12: note: previous 
definition of 'rk_gmac_init' was here
static int rk_gmac_init(struct platform_device *pdev, void *priv)
   ^
   drivers/net/ethernet/stmicro/stmmac/dwmac-rk.c: In function 'rk_gmac_init':
   drivers/net/ethernet/stmicro/stmmac/dwmac-rk.c:683:2: error: implicit 
declaration of function 'rk_gmac_powerup' 
[-Werror=implicit-function-declaration]
 return rk_gmac_powerup(bsp_priv);
 ^
   drivers/net/ethernet/stmicro/stmmac/dwmac-rk.c: At top level:
   drivers/net/ethernet/stmicro/stmmac/dwmac-rk.c:638:12: warning: 
'rk_gmac_init' defined but not used [-Wunused-function]
static int rk_gmac_init(struct platform_device *pdev, void *priv)
   ^
   cc1: some warnings being treated as errors

vim +/pdev +672 drivers/net/ethernet/stmicro/stmmac/dwmac-rk.c

   666  
   667  return 0;
   668  }
   669  
   670  static void rk_gmac_powerdown(struct rk_priv_data *gmac)
   671  {
 > 672  pm_runtime_put_sync(>dev);
   673  pm_runtime_disable(>dev);
   674  
   675  phy_power_on(gmac, false);

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: Binary data

Re: [RFCv2 07/16] bpf: enable non-core use of the verfier

2016-08-30 Thread Daniel Borkmann


On 08/30/2016 10:48 PM, Alexei Starovoitov wrote:

On Tue, Aug 30, 2016 at 10:22:46PM +0200, Jakub Kicinski wrote:

On Tue, 30 Aug 2016 21:07:50 +0200, Daniel Borkmann wrote:

Having two modes seems more straight forward and I think we would only
need to pay attention in the LD_IMM64 case, I don't think I've seen
LLVM generating XORs, it's just the cBPF -> eBPF conversion.


Okay, though, I think that the cBPF to eBPF migration wouldn't even
pass through the bpf_parse() handling, since verifier is not aware on
some of their aspects such as emitting calls directly (w/o *proto) or
arg mappings. Probably make sense to reject these (bpf_prog_was_classic())
if they cannot be handled anyway?


TBH again I only use cBPF for testing.  It's a convenient way of
generating certain instruction sequences.  I can probably just drop
it completely but the XOR patch is just 3 lines of code so not a huge
cost either...  I'll keep patch 6 in my tree for now.


if xor matching is only need for classic, I would drop that patch
just to avoid unnecessary state collection. The number of lines
is not a concern, but extra state for state prunning is.


Alternatively - is there any eBPF assembler out there?  Something
converting verifier output back into ELF would be quite cool.


would certainly be nice. I don't think there is anything standalone.
btw llvm can be made to work as assembler only, but simple flex/bison
is probably better.


Never tried it out, but seems llvm backend doesn't have asm parser
implemented?

  $ clang -target bpf -O2 -c foo.c -S -o foo.S
  $ llvm-mc -arch bpf foo.S -filetype=obj -o foo.o
  llvm-mc: error: this target does not support assembly parsing.

LLVM IR might work, but maybe too high level(?); alternatively, we could
make bpf_asm from tools/net/ eBPF aware for debugging purposes. If you
have a toolchain supporting libbfd et al, you could probably make use
of bpf_jit_dump() (like JITs do) and then bpf_jit_disasm tool (from
same dir as bpf_asm).

Re: [RFC v2 09/10] landlock: Handle cgroups (performance)

2016-08-30 Thread Alexei Starovoitov

On Tue, Aug 30, 2016 at 10:33:31PM +0200, Mickaël Salaün wrote:
> 
> 
> On 30/08/2016 22:23, Andy Lutomirski wrote:
> > On Tue, Aug 30, 2016 at 1:20 PM, Mickaël Salaün  wrote:
> >>
> >> On 30/08/2016 20:55, Andy Lutomirski wrote:
> >>> On Sun, Aug 28, 2016 at 2:42 AM, Mickaël Salaün  wrote:
> 
> 
>  On 28/08/2016 10:13, Andy Lutomirski wrote:
> > On Aug 27, 2016 11:14 PM, "Mickaël Salaün"  wrote:
> >>
> >>
> >> On 27/08/2016 22:43, Alexei Starovoitov wrote:
> >>> On Sat, Aug 27, 2016 at 09:35:14PM +0200, Mickaël Salaün wrote:
>  On 27/08/2016 20:06, Alexei Starovoitov wrote:
> > On Sat, Aug 27, 2016 at 04:06:38PM +0200, Mickaël Salaün wrote:
> >> As said above, Landlock will not run an eBPF programs when not 
> >> strictly
> >> needed. Attaching to a cgroup will have the same performance 
> >> impact as
> >> attaching to a process hierarchy.
> >
> > Having a prog per cgroup per lsm_hook is the only scalable way I
> > could come up with. If you see another way, please propose.
> > current->seccomp.landlock_prog is not the answer.
> 
>  Hum, I don't see the difference from a performance point of view 
>  between
>  a cgroup-based or a process hierarchy-based system.
> 
>  Maybe a better option should be to use an array of pointers with N
>  entries, one for each supported hook, instead of a unique pointer 
>  list?
> >>>
> >>> yes, clearly array dereference is faster than link list walk.
> >>> Now the question is where to keep this prog_array[num_lsm_hooks] ?
> >>> Since we cannot keep it inside task_struct, we have to allocate it.
> >>> Every time the task is creted then. What to do on the fork? That
> >>> will require changes all over. Then the obvious optimization would be
> >>> to share this allocated array of prog pointers across multiple 
> >>> tasks...
> >>> and little by little this new facility will look like cgroup.
> >>> Hence the suggestion to put this array into cgroup from the start.
> >>
> >> I see your point :)
> >>
> >>>
>  Anyway, being able to attach an LSM hook program to a cgroup thanks 
>  to
>  the new BPF_PROG_ATTACH seems a good idea (while keeping the 
>  possibility
>  to use a process hierarchy). The downside will be to handle an LSM 
>  hook
>  program which is not triggered by a seccomp-filter, but this should 
>  be
>  needed anyway to handle interruptions.
> >>>
> >>> what do you mean 'not triggered by seccomp' ?
> >>> You're not suggesting that this lsm has to enable seccomp to be 
> >>> functional?
> >>> imo that's non starter due to overhead.
> >>
> >> Yes, for now, it is triggered by a new seccomp filter return value
> >> RET_LANDLOCK, which can take a 16-bit value called cookie. This must 
> >> not
> >> be needed but could be useful to bind a seccomp filter security policy
> >> with a Landlock one. Waiting for Kees's point of view…
> >>
> >
> > I'm not Kees, but I'd be okay with that.  I still think that doing
> > this by process hierarchy a la seccomp will be easier to use and to
> > understand (which is quite important for this kind of work) than doing
> > it by cgroup.
> >
> > A feature I've wanted to add for a while is to have an fd that
> > represents a seccomp layer, the idea being that you would set up your
> > seccomp layer (with syscall filter, landlock hooks, etc) and then you
> > would have a syscall to install that layer.  Then an unprivileged
> > sandbox manager could set up its layer and still be able to inject new
> > processes into it later on, no cgroups needed.
> 
>  A nice thing I didn't highlight about Landlock is that a process can
>  prepare a layer of rules (arraymap of handles + Landlock programs) and
>  pass the file descriptors of the Landlock programs to another process.
>  This process could then apply this programs to get sandboxed. However,
>  for now, because a Landlock program is only triggered by a seccomp
>  filter (which do not follow the Landlock programs as a FD), they will be
>  useless.
> 
>  The FD referring to an arraymap of handles can also be used to update a
>  map and change the behavior of a Landlock program. A master process can
>  then add or remove restrictions to another process hierarchy on the fly.
> >>>
> >>> Maybe this could be extended a little bit.  The fd could hold the
> >>> seccomp filter *and* the LSM hook filters.  FMODE_EXECUTE could give
> >>> the ability to install it and FMODE_WRITE could give the ability to
> >>> modify it.
> >>>
> >>
> >> This is interesting! It should be possible to append the seccomp

Re: [PATCH V2] dt: net: enhance DWC EQoS binding to support Tegra186

2016-08-30 Thread Stephen Warren


On 08/30/2016 01:01 PM, Rob Herring wrote:

On Wed, Aug 24, 2016 at 03:20:46PM -0600, Stephen Warren wrote:

From: Stephen Warren 

The Synopsys DWC EQoS is a configurable IP block which supports multiple
options for bus type, clocking and reset structure, and feature list.
Extend the DT binding to define a "compatible value" for the configuration
contained in NVIDIA's Tegra186 SoC, and define some new properties and
list property entries required by that configuration.



diff --git a/Documentation/devicetree/bindings/net/snps,dwc-qos-ethernet.txt 
b/Documentation/devicetree/bindings/net/snps,dwc-qos-ethernet.txt



 Required properties:



-- clocks: Phandles to the reference clock and the bus clock
-- clock-names: Should be "phy_ref_clk" for the reference clock and "apb_pclk"
-  for the bus clock.
+- clocks: Phandle and clock specifiers for each entry in clock-names, in the
+  same order. See ../clock/clock-bindings.txt.
+- clock-names: May contain any/all of the following depending on the IP
+  configuration, in any order:


No, they should be in a defined order.


If the binding only defines "clocks", then yes the order must be specified.

If the binding defines clock-names too, then the order is arbitrary 
since all clocks must be looked up via clock-names. That's the entire 
point of having a clock-names property.


...

+The EQOS transmit path clock. The HW signal name is clk_tx_i.
+In some configurations (e.g. GMII/RGMII), this clock also drives the PHY TX
+path. In other configurations, other clocks (such as tx_125, rmii) may
+drive the PHY TX path.
+  - "rx"
+The EQOS receive path clock. The HW signal name is clk_rx_i.
+In some configurations (e.g. GMII/RGMII), this clock also drives the PHY RX
+path. In other configurations, other clocks (such as rx_125, pmarx_0,
+pmarx_1, rmii) may drive the PHY RX path.
+  - "slave_bus"
+(Alternate name "apb_pclk"; only one alternate must appear.)
+The CPU/slave-bus (CSR) interface clock. Despite the name, this applies to
+any bus type; APB, AHB, AXI, etc. The HW signal name is hclk_i (AHB) or
+clk_csr_i (other buses).


Sounds like 2 clocks here.


+  - "master_bus"
+The master bus interface clock. Only required in configurations that use a
+separate clock for the master and slave bus interfaces. The HW signal name
+is hclk_i (AHB) or aclk_i (AXI).


Sounds like 2 clocks. I'm guessing these are mutually exclusive based on
whether you configure the IP for AHB or AXI?


Yes, my understanding is that the two clocks are mutually exclusive in 
both cases.


It seems simpler to have an entry in clocks/clock-names for each logical 
purpose, so that the driver can always retrieve a "slave bus clock" and 
a "master bus clock". That way, there's never any conditional code in 
the driver; it just gets two fixed clock names and enables them, no 
matter what the HW configuration.


If the binding specifies 3 clocks, hclk_i, clk_csr_i, and aclk_i, then 
the driver needs to know which subset of clocks to retrieve based on 
compatible value or HW configuration. That seems like unnecessary 
complexity. I suppose the driver could just attempt to retrieve all 3 
clocks, and ignore any missing clocks, but that would allow malformed 
DTs not to be noticed since the driver wouldn't validate the set of 
clocks present, and could lead to the driver touching the HW without all 
required clocks active, which at least in Tegra can lead to a HW hang.



+  Note: Support for additional IP configurations may require adding the
+  following clocks to this list in the future: clk_rx_125_i, clk_tx_125_i,
+  clk_pmarx_0_i, clk_pmarx1_i, clk_rmii_i, clk_revmii_rx_i, clk_revmii_tx_i.
+
+  The following compatible values require the following set of clocks:
+  - "nvidia,tegra186-eqos", "snps,dwc-qos-ethernet-4.10":
+- "slave_bus"
+- "master_bus"
+- "rx"
+- "tx"
+- "ptp_ref"
+  - "axis,artpec6-eqos", "snps,dwc-qos-ethernet-4.10":
+- "phy_ref_clk"
+- "apb_clk"


It would be good if this was marked deprecated and the full set of
clocks could be described and supported. Not sure if you can figure that
out. Is it really only 2 clocks, or these have multiple connections to
the same source.


Lars, can you answer here?

I deliberately didn't attempt to change the binding definition for the 
existing use-case, since I'm not familiar with that SoC, and don't 
relish changing DTs for a platform I can't test.

Re: [RFCv2 16/16] nfp: bpf: add offload of TC direct action mode

2016-08-30 Thread Jakub Kicinski

On Tue, 30 Aug 2016 22:02:10 +0200, Daniel Borkmann wrote:
> On 08/30/2016 12:52 PM, Jakub Kicinski wrote:
> > On Mon, 29 Aug 2016 23:09:35 +0200, Daniel Borkmann wrote:  
>  [...]  
> >>
> >> In da mode, RECLASSIFY is not supported, so this one could be scratched.
> >> For the OK and UNSPEC part, couldn't both be treated the same (as in: OK /
> >> pass to stack roughly equivalent as in sch_handle_ingress())? Or is the
> >> issue that you cannot populate skb->tc_index when passing to stack (maybe
> >> just fine to leave it at 0 for now)?  
> >
> > The comment is a bit confus(ed|ing).  The problem is:
> >
> > tc filter add  skip_sw
> > tc filter add  skip_hw
> >
> > If packet appears in the stack - was it because of OK or UNSPEC (or
> > RECLASSIFY) in filter1?  Do we need to run filter2 or not?  Passing
> > tc_index can be implemented the same way I do mark today.  
> 
> Okay, I see, thanks for explaining. So, if passing tc_index (or any other
> meta data) can be implemented the same way as we do with mark already,
> could we store such verdict, say, in some unused skb->tc_verd bits (the
> skb->tc_index could be filled by the program already) and pass that up the
> stack to differentiate between them? There should be no prior user before
> ingress, so that patch 4 could become something like:
> 
>if (tc_skip_sw(prog->gen_flags)) {
>   filter_res = tc_map_hw_verd_to_act(skb);
>} else if (at_ingress) {
>   ...
>} ...

This looks promising!

> And I assume it wouldn't make any sense anyway to have a skip_sw filter
> being chained /after/ some skip_hw and the like, right?

Right.  I think it should be enforced by TC core or at least some shared
code similar to tc_flags_valid() to reject offload attempts of filters
which are not first in line from the wire.  Right now AFAICT enabling
transparent offload with ethtool may result in things going down to HW
completely out of order and user doesn't even have to specify the
skip_* flags...

Re: [RFCv2 07/16] bpf: enable non-core use of the verfier

2016-08-30 Thread Alexei Starovoitov

On Tue, Aug 30, 2016 at 10:22:46PM +0200, Jakub Kicinski wrote:
> On Tue, 30 Aug 2016 21:07:50 +0200, Daniel Borkmann wrote:
> > > Having two modes seems more straight forward and I think we would only
> > > need to pay attention in the LD_IMM64 case, I don't think I've seen
> > > LLVM generating XORs, it's just the cBPF -> eBPF conversion.  
> > 
> > Okay, though, I think that the cBPF to eBPF migration wouldn't even
> > pass through the bpf_parse() handling, since verifier is not aware on
> > some of their aspects such as emitting calls directly (w/o *proto) or
> > arg mappings. Probably make sense to reject these (bpf_prog_was_classic())
> > if they cannot be handled anyway?
> 
> TBH again I only use cBPF for testing.  It's a convenient way of
> generating certain instruction sequences.  I can probably just drop
> it completely but the XOR patch is just 3 lines of code so not a huge
> cost either...  I'll keep patch 6 in my tree for now.  

if xor matching is only need for classic, I would drop that patch
just to avoid unnecessary state collection. The number of lines
is not a concern, but extra state for state prunning is.

> Alternatively - is there any eBPF assembler out there?  Something
> converting verifier output back into ELF would be quite cool.

would certainly be nice. I don't think there is anything standalone.
btw llvm can be made to work as assembler only, but simple flex/bison
is probably better.

Re: [PATCH net-next 2/6] net/mlx5e: Read ETS settings directly from firmware

2016-08-30 Thread Or Gerlitz

On Tue, Aug 30, 2016 at 2:29 PM, Saeed Mahameed  wrote:
> From: Huy Nguyen 
>
> Current implementation does not read the setting
> directly from FW when ieee_getets is called.


what's wrong with that? explain

Re: [PATCH net-next 1/6] net/mlx5e: Support DCBX CEE API

2016-08-30 Thread Or Gerlitz

On Tue, Aug 30, 2016 at 2:29 PM, Saeed Mahameed  wrote:
> From: Huy Nguyen 
>
> Add DCBX CEE API interface for CX4. Configurations are stored in a
> temporary structure and are applied to the card's firmware when the
> CEE's setall callback function is called.
>
> Note:
>   priority group in CEE is equivalent to traffic class in ConnectX-4
>   hardware spec.
>
>   bw allocation per priority in CEE is not supported because CX4
>   only supports bw allocation per traffic class.
>
>   user priority in CEE does not have an equivalent term in CX4.
>   Therefore, user priority to priority mapping in CEE is not supported.

basically our drivers suits (mlx4/5) are not written to a certain HW,
but rather to multiple (past, present and future) brands using dev
caps advertized by the firmware towards the driver.

I see here lots of CX4 explicit mentioning... so (1) try to avoid it
or make the description more general (2) do you base your code on dev
caps or hard coded assumptions?

> Test: see DCBX_LinuxDriverCX4 document section 6.4

what's the relevancy for the upstream commit change log?

> Signed-off-by: Huy Nguyen 
> Signed-off-by: Saeed Mahameed

Re: [RFC v2 09/10] landlock: Handle cgroups (performance)

2016-08-30 Thread Mickaël Salaün



On 30/08/2016 22:23, Andy Lutomirski wrote:
> On Tue, Aug 30, 2016 at 1:20 PM, Mickaël Salaün  wrote:
>>
>> On 30/08/2016 20:55, Andy Lutomirski wrote:
>>> On Sun, Aug 28, 2016 at 2:42 AM, Mickaël Salaün  wrote:


 On 28/08/2016 10:13, Andy Lutomirski wrote:
> On Aug 27, 2016 11:14 PM, "Mickaël Salaün"  wrote:
>>
>>
>> On 27/08/2016 22:43, Alexei Starovoitov wrote:
>>> On Sat, Aug 27, 2016 at 09:35:14PM +0200, Mickaël Salaün wrote:
 On 27/08/2016 20:06, Alexei Starovoitov wrote:
> On Sat, Aug 27, 2016 at 04:06:38PM +0200, Mickaël Salaün wrote:
>> As said above, Landlock will not run an eBPF programs when not 
>> strictly
>> needed. Attaching to a cgroup will have the same performance impact 
>> as
>> attaching to a process hierarchy.
>
> Having a prog per cgroup per lsm_hook is the only scalable way I
> could come up with. If you see another way, please propose.
> current->seccomp.landlock_prog is not the answer.

 Hum, I don't see the difference from a performance point of view 
 between
 a cgroup-based or a process hierarchy-based system.

 Maybe a better option should be to use an array of pointers with N
 entries, one for each supported hook, instead of a unique pointer list?
>>>
>>> yes, clearly array dereference is faster than link list walk.
>>> Now the question is where to keep this prog_array[num_lsm_hooks] ?
>>> Since we cannot keep it inside task_struct, we have to allocate it.
>>> Every time the task is creted then. What to do on the fork? That
>>> will require changes all over. Then the obvious optimization would be
>>> to share this allocated array of prog pointers across multiple tasks...
>>> and little by little this new facility will look like cgroup.
>>> Hence the suggestion to put this array into cgroup from the start.
>>
>> I see your point :)
>>
>>>
 Anyway, being able to attach an LSM hook program to a cgroup thanks to
 the new BPF_PROG_ATTACH seems a good idea (while keeping the 
 possibility
 to use a process hierarchy). The downside will be to handle an LSM hook
 program which is not triggered by a seccomp-filter, but this should be
 needed anyway to handle interruptions.
>>>
>>> what do you mean 'not triggered by seccomp' ?
>>> You're not suggesting that this lsm has to enable seccomp to be 
>>> functional?
>>> imo that's non starter due to overhead.
>>
>> Yes, for now, it is triggered by a new seccomp filter return value
>> RET_LANDLOCK, which can take a 16-bit value called cookie. This must not
>> be needed but could be useful to bind a seccomp filter security policy
>> with a Landlock one. Waiting for Kees's point of view…
>>
>
> I'm not Kees, but I'd be okay with that.  I still think that doing
> this by process hierarchy a la seccomp will be easier to use and to
> understand (which is quite important for this kind of work) than doing
> it by cgroup.
>
> A feature I've wanted to add for a while is to have an fd that
> represents a seccomp layer, the idea being that you would set up your
> seccomp layer (with syscall filter, landlock hooks, etc) and then you
> would have a syscall to install that layer.  Then an unprivileged
> sandbox manager could set up its layer and still be able to inject new
> processes into it later on, no cgroups needed.

 A nice thing I didn't highlight about Landlock is that a process can
 prepare a layer of rules (arraymap of handles + Landlock programs) and
 pass the file descriptors of the Landlock programs to another process.
 This process could then apply this programs to get sandboxed. However,
 for now, because a Landlock program is only triggered by a seccomp
 filter (which do not follow the Landlock programs as a FD), they will be
 useless.

 The FD referring to an arraymap of handles can also be used to update a
 map and change the behavior of a Landlock program. A master process can
 then add or remove restrictions to another process hierarchy on the fly.
>>>
>>> Maybe this could be extended a little bit.  The fd could hold the
>>> seccomp filter *and* the LSM hook filters.  FMODE_EXECUTE could give
>>> the ability to install it and FMODE_WRITE could give the ability to
>>> modify it.
>>>
>>
>> This is interesting! It should be possible to append the seccomp stack
>> of a source process to the seccomp stack of the target process when a
>> Landlock program is passed and then activated through seccomp(2).
>>
>> For the FMODE_EXECUTE/FMODE_WRITE, are you suggesting to manage
>> permission of the eBPF program FD in a specific way?
>>
> 
> This wouldn't be an eBPF program FD -- it

[PATCH 1/4] net: stmmac: dwmac-rk: fixes the gmac resume after PD on/off

2016-08-30 Thread Caesar Wang

From: Roger Chen 

GMAC Power Domain(PD) will be disabled during suspend.
That will causes GRF registers reset.
So corresponding GRF registers for GMAC must be setup again.

Signed-off-by: Roger Chen 
Signed-off-by: Caesar Wang 
---

 drivers/net/ethernet/stmicro/stmmac/dwmac-rk.c | 20 +++-
 1 file changed, 11 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac-rk.c 
b/drivers/net/ethernet/stmicro/stmmac/dwmac-rk.c
index 9210591..ea0e493 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac-rk.c
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-rk.c
@@ -629,6 +629,17 @@ static struct rk_priv_data *rk_gmac_setup(struct 
platform_device *pdev,
"rockchip,grf");
bsp_priv->pdev = pdev;
 
+   gmac_clk_init(bsp_priv);
+
+   return bsp_priv;
+}
+
+static int rk_gmac_init(struct platform_device *pdev, void *priv)
+{
+   struct rk_priv_data *bsp_priv = priv;
+   int ret;
+   struct device *dev = >dev;
+
/*rmii or rgmii*/
if (bsp_priv->phy_iface == PHY_INTERFACE_MODE_RGMII) {
dev_info(dev, "init for RGMII\n");
@@ -641,15 +652,6 @@ static struct rk_priv_data *rk_gmac_setup(struct 
platform_device *pdev,
dev_err(dev, "NO interface defined!\n");
}
 
-   gmac_clk_init(bsp_priv);
-
-   return bsp_priv;
-}
-
-static int rk_gmac_powerup(struct rk_priv_data *bsp_priv)
-{
-   int ret;
-
ret = phy_power_on(bsp_priv, true);
if (ret)
return ret;
-- 
1.9.1

[PATCH 2/4] net: stmmac: dwmac-rk: add pd_gmac support for rk3399

2016-08-30 Thread Caesar Wang

From: David Wu 

Add the gmac power domain support for rk3399, in order to save more
power consumption.

Signed-off-by: David Wu 
Signed-off-by: Caesar Wang 
---

 drivers/net/ethernet/stmicro/stmmac/dwmac-rk.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac-rk.c 
b/drivers/net/ethernet/stmicro/stmmac/dwmac-rk.c
index ea0e493..71a1ca5 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac-rk.c
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-rk.c
@@ -30,6 +30,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "stmmac_platform.h"
 
@@ -660,11 +661,17 @@ static int rk_gmac_init(struct platform_device *pdev, 
void *priv)
if (ret)
return ret;
 
+   pm_runtime_enable(>dev);
+   pm_runtime_get_sync(>dev);
+
return 0;
 }
 
 static void rk_gmac_powerdown(struct rk_priv_data *gmac)
 {
+   pm_runtime_put_sync(>dev);
+   pm_runtime_disable(>dev);
+
phy_power_on(gmac, false);
gmac_clk_enable(gmac, false);
 }
-- 
1.9.1

[PATCH 4/4] arm64: dts: rockchip: enable the gmac for rk3399 evb board

2016-08-30 Thread Caesar Wang

We add the required and optional properties for evb board.
See the [0] to get the detail information.

[0]:
Documentation/devicetree/bindings/net/rockchip-dwmac.txt

Signed-off-by: Roger Chen 
Signed-off-by: Caesar Wang 
---

 arch/arm64/boot/dts/rockchip/rk3399-evb.dts | 31 +
 1 file changed, 31 insertions(+)

diff --git a/arch/arm64/boot/dts/rockchip/rk3399-evb.dts 
b/arch/arm64/boot/dts/rockchip/rk3399-evb.dts
index d47b4e9..ed6f2e8 100644
--- a/arch/arm64/boot/dts/rockchip/rk3399-evb.dts
+++ b/arch/arm64/boot/dts/rockchip/rk3399-evb.dts
@@ -94,12 +94,43 @@
regulator-always-on;
regulator-boot-on;
};
+
+   clkin_gmac: external-gmac-clock {
+   compatible = "fixed-clock";
+   clock-frequency = <12500>;
+   clock-output-names = "clkin_gmac";
+   #clock-cells = <0>;
+   };
+
+   vcc_phy: vcc-phy-regulator {
+   compatible = "regulator-fixed";
+   regulator-name = "vcc_phy";
+   regulator-always-on;
+   regulator-boot-on;
+   };
+
 };
 
 _phy {
status = "okay";
 };
 
+ {
+   phy-supply = <_phy>;
+   phy-mode = "rgmii";
+   clock_in_out = "input";
+   snps,reset-gpio = < 15 GPIO_ACTIVE_LOW>;
+   snps,reset-active-low;
+   snps,reset-delays-us = <0 1 5>;
+   assigned-clocks = < SCLK_RMII_SRC>;
+   assigned-clock-parents = <_gmac>;
+   pinctrl-names = "default";
+   pinctrl-0 = <_pins>;
+   tx_delay = <0x28>;
+   rx_delay = <0x11>;
+   status = "okay";
+};
+
  {
status = "okay";
 };
-- 
1.9.1

[PATCH 0/4] Support the rk3399 gmac pd function

2016-08-30 Thread Caesar Wang

This patch add to handle the gmac pd issue, and support
the rk3399 gmac for devicetree.



Caesar Wang (2):
  arm64: dts: rockchip: support gmac for rk3399
  arm64: dts: rockchip: enable the gmac for rk3399 evb board

David Wu (1):
  net: stmmac: dwmac-rk: add pd_gmac support for rk3399

Roger Chen (1):
  net: stmmac: dwmac-rk: fixes the gmac resume after PD on/off

 arch/arm64/boot/dts/rockchip/rk3399-evb.dts| 31 +
 arch/arm64/boot/dts/rockchip/rk3399.dtsi   | 90 ++
 drivers/net/ethernet/stmicro/stmmac/dwmac-rk.c | 27 +---
 3 files changed, 139 insertions(+), 9 deletions(-)

-- 
1.9.1

[PATCH 3/4] arm64: dts: rockchip: support gmac for rk3399

2016-08-30 Thread Caesar Wang

This patch adds needed gamc information for rk3399,
also support the gmac pd.

Signed-off-by: Roger Chen 
Signed-off-by: Caesar Wang 
---

 arch/arm64/boot/dts/rockchip/rk3399.dtsi | 90 
 1 file changed, 90 insertions(+)

diff --git a/arch/arm64/boot/dts/rockchip/rk3399.dtsi 
b/arch/arm64/boot/dts/rockchip/rk3399.dtsi
index 32aebc8..53ac651 100644
--- a/arch/arm64/boot/dts/rockchip/rk3399.dtsi
+++ b/arch/arm64/boot/dts/rockchip/rk3399.dtsi
@@ -200,6 +200,26 @@
};
};
 
+   gmac: eth@fe30 {
+   compatible = "rockchip,rk3399-gmac";
+   reg = <0x0 0xfe30 0x0 0x1>;
+   rockchip,grf = <>;
+   interrupts = ;
+   interrupt-names = "macirq";
+   clocks = < SCLK_MAC>, < SCLK_MAC_RX>,
+< SCLK_MAC_TX>, < SCLK_MACREF>,
+< SCLK_MACREF_OUT>, < ACLK_GMAC>,
+< PCLK_GMAC>;
+   clock-names = "stmmaceth", "mac_clk_rx",
+ "mac_clk_tx", "clk_mac_ref",
+ "clk_mac_refout", "aclk_mac",
+ "pclk_mac";
+   resets = < SRST_A_GMAC>;
+   reset-names = "stmmaceth";
+   power-domains = < RK3399_PD_GMAC>;
+   status = "disabled";
+   };
+
sdio0: dwmmc@fe31 {
compatible = "rockchip,rk3399-dw-mshc",
 "rockchip,rk3288-dw-mshc";
@@ -611,6 +631,11 @@
status = "disabled";
};
 
+   qos_gmac: qos@ffa5c000 {
+   compatible = "syscon";
+   reg = <0x0 0xffa5c000 0x0 0x20>;
+   };
+
qos_hdcp: qos@ffa9 {
compatible = "syscon";
reg = <0x0 0xffa9 0x0 0x20>;
@@ -704,6 +729,11 @@
#size-cells = <0>;
 
/* These power domains are grouped by VD_CENTER */
+   pd_gmac@RK3399_PD_GMAC {
+   reg = ;
+   clocks = < ACLK_GMAC>;
+   pm_qos = <_gmac>;
+   };
pd_iep@RK3399_PD_IEP {
reg = ;
clocks = < ACLK_IEP>,
@@ -1183,6 +1213,66 @@
drive-strength = <13>;
};
 
+   gmac {
+   rgmii_pins: rgmii-pins {
+   rockchip,pins =
+   /* mac_txclk */
+   <3 17 RK_FUNC_1 _pull_none_13ma>,
+   /* mac_rxclk */
+   <3 14 RK_FUNC_1 _pull_none>,
+   /* mac_mdio */
+   <3 13 RK_FUNC_1 _pull_none>,
+   /* mac_txen */
+   <3 12 RK_FUNC_1 _pull_none_13ma>,
+   /* mac_clk */
+   <3 11 RK_FUNC_1 _pull_none>,
+   /* mac_rxdv */
+   <3 9 RK_FUNC_1 _pull_none>,
+   /* mac_mdc */
+   <3 8 RK_FUNC_1 _pull_none>,
+   /* mac_rxd1 */
+   <3 7 RK_FUNC_1 _pull_none>,
+   /* mac_rxd0 */
+   <3 6 RK_FUNC_1 _pull_none>,
+   /* mac_txd1 */
+   <3 5 RK_FUNC_1 _pull_none_13ma>,
+   /* mac_txd0 */
+   <3 4 RK_FUNC_1 _pull_none_13ma>,
+   /* mac_rxd3 */
+   <3 3 RK_FUNC_1 _pull_none>,
+   /* mac_rxd2 */
+   <3 2 RK_FUNC_1 _pull_none>,
+   /* mac_txd3 */
+   <3 1 RK_FUNC_1 _pull_none_13ma>,
+   /* mac_txd2 */
+   <3 0 RK_FUNC_1 _pull_none_13ma>;
+   };
+
+   rmii_pins: rmii-pins {
+   rockchip,pins =
+   /* mac_mdio */
+   <3 13 RK_FUNC_1 _pull_none>,
+   /* mac_txen */
+   <3 12 RK_FUNC_1 _pull_none_13ma>,
+   /* mac_clk */
+   <3 11 RK_FUNC_1 _pull_none>,
+

Re: [RFC v2 06/10] landlock: Add LSM hooks

2016-08-30 Thread Mickaël Salaün

On 30/08/2016 22:18, Andy Lutomirski wrote:
> On Tue, Aug 30, 2016 at 1:10 PM, Mickaël Salaün  wrote:
>>
>> On 30/08/2016 20:56, Andy Lutomirski wrote:
>>> On Aug 25, 2016 12:34 PM, "Mickaël Salaün"  wrote:

 Add LSM hooks which can be used by userland through Landlock (eBPF)
 programs. This programs are limited to a whitelist of functions (cf.
 next commit). The eBPF program context is depicted by the struct
 landlock_data (cf. include/uapi/linux/bpf.h):
 * hook: LSM hook ID (useful when using the same program for multiple LSM
   hooks);
 * cookie: the 16-bit value from the seccomp filter that triggered this
   Landlock program;
 * args[6]: array of LSM hook arguments.

 The LSM hook arguments can contain raw values as integers or
 (unleakable) pointers. The only way to use the pointers are to pass them
 to an eBPF function according to their types (e.g. the
 bpf_landlock_cmp_fs_beneath_with_struct_file function can use a struct
 file pointer).

 For now, there is three hooks for file system access control:
 * file_open;
 * file_permission;
 * mmap_file.

>>>
>>> What's the purpose of exposing struct cred * to userspace?  It's
>>> primarily just an optimization to save a bit of RAM, and it's a
>>> dubious optimization at that.  What are you using it for?  Would it
>>> make more sense to use struct task_struct * or struct pid * instead?
>>>
>>> Also, exposing struct cred * has a really weird side-effect: it allows
>>> (maybe even encourages) checking for pointer equality between two
>>> struct cred * objects.  Doing so will have erratic results.
>>>
>>
>> The pointers exposed in the ePBF context are not directly readable by an
>> unprivileged eBPF program thanks to the strong typing of the Landlock
>> context and the static eBPF verification. There is no way to leak a
>> kernel pointer to userspace from an unprivileged eBPF program: pointer
>> arithmetic and comparison are prohibited. Pointers can only be pass as
>> argument to dedicated eBPF functions.
> 
> I'm not talking about leaking the value -- I'm talking about leaking
> the predicate (a == b) for two struct cred pointers.  That predicate
> shouldn't be available because it has very odd effects.

I'm pretty sure this case is covered with the impossibility of doing
pointers comparison.

> 
>>
>> For now, struct cred * is simply not used by any eBPF function and then
>> not usable at all. It only exist here because I map the LSM hook
>> arguments in a generic/automatic way to the eBPF context.
> 
> Maybe remove it from this patch set then?

Well, this is done with the LANDLOCK_HOOK* macros but I will remove it.

signature.asc
Description: OpenPGP digital signature

Re: [RFCv2 07/16] bpf: enable non-core use of the verfier

2016-08-30 Thread Jakub Kicinski

On Tue, 30 Aug 2016 21:07:50 +0200, Daniel Borkmann wrote:
> > Having two modes seems more straight forward and I think we would only
> > need to pay attention in the LD_IMM64 case, I don't think I've seen
> > LLVM generating XORs, it's just the cBPF -> eBPF conversion.  
> 
> Okay, though, I think that the cBPF to eBPF migration wouldn't even
> pass through the bpf_parse() handling, since verifier is not aware on
> some of their aspects such as emitting calls directly (w/o *proto) or
> arg mappings. Probably make sense to reject these (bpf_prog_was_classic())
> if they cannot be handled anyway?

TBH again I only use cBPF for testing.  It's a convenient way of
generating certain instruction sequences.  I can probably just drop
it completely but the XOR patch is just 3 lines of code so not a huge
cost either...  I'll keep patch 6 in my tree for now.  

Alternatively - is there any eBPF assembler out there?  Something
converting verifier output back into ELF would be quite cool.

Re: [PATCH net-next 6/6] net/mlx5: Add handling for port module event

2016-08-30 Thread Or Gerlitz

On Tue, Aug 30, 2016 at 2:29 PM, Saeed Mahameed  wrote:
> From: Huy Nguyen 

> +++ b/include/linux/mlx5/device.h
> @@ -543,6 +544,15 @@ struct mlx5_eqe_vport_change {
> __be32  rsvd1[6];
>  } __packed;
>
> +struct mlx5_eqe_port_module {
> +   u8rsvd0[1];
> +   u8module;
> +   u8rsvd1[1];
> +   u8module_status;
> +   u8rsvd2[2];
> +   u8error_type;
> +};
> +

Saeed, any reason for this struct and friends not to be  @ the FW IFC file?

Re: [RFC v2 09/10] landlock: Handle cgroups (performance)

2016-08-30 Thread Andy Lutomirski

On Tue, Aug 30, 2016 at 1:20 PM, Mickaël Salaün  wrote:
>
> On 30/08/2016 20:55, Andy Lutomirski wrote:
>> On Sun, Aug 28, 2016 at 2:42 AM, Mickaël Salaün  wrote:
>>>
>>>
>>> On 28/08/2016 10:13, Andy Lutomirski wrote:
 On Aug 27, 2016 11:14 PM, "Mickaël Salaün"  wrote:
>
>
> On 27/08/2016 22:43, Alexei Starovoitov wrote:
>> On Sat, Aug 27, 2016 at 09:35:14PM +0200, Mickaël Salaün wrote:
>>> On 27/08/2016 20:06, Alexei Starovoitov wrote:
 On Sat, Aug 27, 2016 at 04:06:38PM +0200, Mickaël Salaün wrote:
> As said above, Landlock will not run an eBPF programs when not 
> strictly
> needed. Attaching to a cgroup will have the same performance impact as
> attaching to a process hierarchy.

 Having a prog per cgroup per lsm_hook is the only scalable way I
 could come up with. If you see another way, please propose.
 current->seccomp.landlock_prog is not the answer.
>>>
>>> Hum, I don't see the difference from a performance point of view between
>>> a cgroup-based or a process hierarchy-based system.
>>>
>>> Maybe a better option should be to use an array of pointers with N
>>> entries, one for each supported hook, instead of a unique pointer list?
>>
>> yes, clearly array dereference is faster than link list walk.
>> Now the question is where to keep this prog_array[num_lsm_hooks] ?
>> Since we cannot keep it inside task_struct, we have to allocate it.
>> Every time the task is creted then. What to do on the fork? That
>> will require changes all over. Then the obvious optimization would be
>> to share this allocated array of prog pointers across multiple tasks...
>> and little by little this new facility will look like cgroup.
>> Hence the suggestion to put this array into cgroup from the start.
>
> I see your point :)
>
>>
>>> Anyway, being able to attach an LSM hook program to a cgroup thanks to
>>> the new BPF_PROG_ATTACH seems a good idea (while keeping the possibility
>>> to use a process hierarchy). The downside will be to handle an LSM hook
>>> program which is not triggered by a seccomp-filter, but this should be
>>> needed anyway to handle interruptions.
>>
>> what do you mean 'not triggered by seccomp' ?
>> You're not suggesting that this lsm has to enable seccomp to be 
>> functional?
>> imo that's non starter due to overhead.
>
> Yes, for now, it is triggered by a new seccomp filter return value
> RET_LANDLOCK, which can take a 16-bit value called cookie. This must not
> be needed but could be useful to bind a seccomp filter security policy
> with a Landlock one. Waiting for Kees's point of view…
>

 I'm not Kees, but I'd be okay with that.  I still think that doing
 this by process hierarchy a la seccomp will be easier to use and to
 understand (which is quite important for this kind of work) than doing
 it by cgroup.

 A feature I've wanted to add for a while is to have an fd that
 represents a seccomp layer, the idea being that you would set up your
 seccomp layer (with syscall filter, landlock hooks, etc) and then you
 would have a syscall to install that layer.  Then an unprivileged
 sandbox manager could set up its layer and still be able to inject new
 processes into it later on, no cgroups needed.
>>>
>>> A nice thing I didn't highlight about Landlock is that a process can
>>> prepare a layer of rules (arraymap of handles + Landlock programs) and
>>> pass the file descriptors of the Landlock programs to another process.
>>> This process could then apply this programs to get sandboxed. However,
>>> for now, because a Landlock program is only triggered by a seccomp
>>> filter (which do not follow the Landlock programs as a FD), they will be
>>> useless.
>>>
>>> The FD referring to an arraymap of handles can also be used to update a
>>> map and change the behavior of a Landlock program. A master process can
>>> then add or remove restrictions to another process hierarchy on the fly.
>>
>> Maybe this could be extended a little bit.  The fd could hold the
>> seccomp filter *and* the LSM hook filters.  FMODE_EXECUTE could give
>> the ability to install it and FMODE_WRITE could give the ability to
>> modify it.
>>
>
> This is interesting! It should be possible to append the seccomp stack
> of a source process to the seccomp stack of the target process when a
> Landlock program is passed and then activated through seccomp(2).
>
> For the FMODE_EXECUTE/FMODE_WRITE, are you suggesting to manage
> permission of the eBPF program FD in a specific way?
>

This wouldn't be an eBPF program FD -- it would be an FD encapsulating
an entire configuration including seccomp BPF program, whatever
landlock stuff is associated, and eventual seccomp monitor
configuration (once I write

Re: [RFC v2 09/10] landlock: Handle cgroups (performance)

2016-08-30 Thread Mickaël Salaün


On 30/08/2016 20:55, Andy Lutomirski wrote:
> On Sun, Aug 28, 2016 at 2:42 AM, Mickaël Salaün  wrote:
>>
>>
>> On 28/08/2016 10:13, Andy Lutomirski wrote:
>>> On Aug 27, 2016 11:14 PM, "Mickaël Salaün"  wrote:


 On 27/08/2016 22:43, Alexei Starovoitov wrote:
> On Sat, Aug 27, 2016 at 09:35:14PM +0200, Mickaël Salaün wrote:
>> On 27/08/2016 20:06, Alexei Starovoitov wrote:
>>> On Sat, Aug 27, 2016 at 04:06:38PM +0200, Mickaël Salaün wrote:
 As said above, Landlock will not run an eBPF programs when not strictly
 needed. Attaching to a cgroup will have the same performance impact as
 attaching to a process hierarchy.
>>>
>>> Having a prog per cgroup per lsm_hook is the only scalable way I
>>> could come up with. If you see another way, please propose.
>>> current->seccomp.landlock_prog is not the answer.
>>
>> Hum, I don't see the difference from a performance point of view between
>> a cgroup-based or a process hierarchy-based system.
>>
>> Maybe a better option should be to use an array of pointers with N
>> entries, one for each supported hook, instead of a unique pointer list?
>
> yes, clearly array dereference is faster than link list walk.
> Now the question is where to keep this prog_array[num_lsm_hooks] ?
> Since we cannot keep it inside task_struct, we have to allocate it.
> Every time the task is creted then. What to do on the fork? That
> will require changes all over. Then the obvious optimization would be
> to share this allocated array of prog pointers across multiple tasks...
> and little by little this new facility will look like cgroup.
> Hence the suggestion to put this array into cgroup from the start.

 I see your point :)

>
>> Anyway, being able to attach an LSM hook program to a cgroup thanks to
>> the new BPF_PROG_ATTACH seems a good idea (while keeping the possibility
>> to use a process hierarchy). The downside will be to handle an LSM hook
>> program which is not triggered by a seccomp-filter, but this should be
>> needed anyway to handle interruptions.
>
> what do you mean 'not triggered by seccomp' ?
> You're not suggesting that this lsm has to enable seccomp to be 
> functional?
> imo that's non starter due to overhead.

 Yes, for now, it is triggered by a new seccomp filter return value
 RET_LANDLOCK, which can take a 16-bit value called cookie. This must not
 be needed but could be useful to bind a seccomp filter security policy
 with a Landlock one. Waiting for Kees's point of view…

>>>
>>> I'm not Kees, but I'd be okay with that.  I still think that doing
>>> this by process hierarchy a la seccomp will be easier to use and to
>>> understand (which is quite important for this kind of work) than doing
>>> it by cgroup.
>>>
>>> A feature I've wanted to add for a while is to have an fd that
>>> represents a seccomp layer, the idea being that you would set up your
>>> seccomp layer (with syscall filter, landlock hooks, etc) and then you
>>> would have a syscall to install that layer.  Then an unprivileged
>>> sandbox manager could set up its layer and still be able to inject new
>>> processes into it later on, no cgroups needed.
>>
>> A nice thing I didn't highlight about Landlock is that a process can
>> prepare a layer of rules (arraymap of handles + Landlock programs) and
>> pass the file descriptors of the Landlock programs to another process.
>> This process could then apply this programs to get sandboxed. However,
>> for now, because a Landlock program is only triggered by a seccomp
>> filter (which do not follow the Landlock programs as a FD), they will be
>> useless.
>>
>> The FD referring to an arraymap of handles can also be used to update a
>> map and change the behavior of a Landlock program. A master process can
>> then add or remove restrictions to another process hierarchy on the fly.
> 
> Maybe this could be extended a little bit.  The fd could hold the
> seccomp filter *and* the LSM hook filters.  FMODE_EXECUTE could give
> the ability to install it and FMODE_WRITE could give the ability to
> modify it.
> 

This is interesting! It should be possible to append the seccomp stack
of a source process to the seccomp stack of the target process when a
Landlock program is passed and then activated through seccomp(2).

For the FMODE_EXECUTE/FMODE_WRITE, are you suggesting to manage
permission of the eBPF program FD in a specific way?



signature.asc
Description: OpenPGP digital signature

Re: [RFC v2 06/10] landlock: Add LSM hooks

2016-08-30 Thread Andy Lutomirski

On Tue, Aug 30, 2016 at 1:10 PM, Mickaël Salaün  wrote:
>
> On 30/08/2016 20:56, Andy Lutomirski wrote:
>> On Aug 25, 2016 12:34 PM, "Mickaël Salaün"  wrote:
>>>
>>> Add LSM hooks which can be used by userland through Landlock (eBPF)
>>> programs. This programs are limited to a whitelist of functions (cf.
>>> next commit). The eBPF program context is depicted by the struct
>>> landlock_data (cf. include/uapi/linux/bpf.h):
>>> * hook: LSM hook ID (useful when using the same program for multiple LSM
>>>   hooks);
>>> * cookie: the 16-bit value from the seccomp filter that triggered this
>>>   Landlock program;
>>> * args[6]: array of LSM hook arguments.
>>>
>>> The LSM hook arguments can contain raw values as integers or
>>> (unleakable) pointers. The only way to use the pointers are to pass them
>>> to an eBPF function according to their types (e.g. the
>>> bpf_landlock_cmp_fs_beneath_with_struct_file function can use a struct
>>> file pointer).
>>>
>>> For now, there is three hooks for file system access control:
>>> * file_open;
>>> * file_permission;
>>> * mmap_file.
>>>
>>
>> What's the purpose of exposing struct cred * to userspace?  It's
>> primarily just an optimization to save a bit of RAM, and it's a
>> dubious optimization at that.  What are you using it for?  Would it
>> make more sense to use struct task_struct * or struct pid * instead?
>>
>> Also, exposing struct cred * has a really weird side-effect: it allows
>> (maybe even encourages) checking for pointer equality between two
>> struct cred * objects.  Doing so will have erratic results.
>>
>
> The pointers exposed in the ePBF context are not directly readable by an
> unprivileged eBPF program thanks to the strong typing of the Landlock
> context and the static eBPF verification. There is no way to leak a
> kernel pointer to userspace from an unprivileged eBPF program: pointer
> arithmetic and comparison are prohibited. Pointers can only be pass as
> argument to dedicated eBPF functions.

I'm not talking about leaking the value -- I'm talking about leaking
the predicate (a == b) for two struct cred pointers.  That predicate
shouldn't be available because it has very odd effects.

>
> For now, struct cred * is simply not used by any eBPF function and then
> not usable at all. It only exist here because I map the LSM hook
> arguments in a generic/automatic way to the eBPF context.

Maybe remove it from this patch set then?

--Andy

Re: [RFC v2 06/10] landlock: Add LSM hooks

2016-08-30 Thread Mickaël Salaün

On 30/08/2016 20:56, Andy Lutomirski wrote:
> On Aug 25, 2016 12:34 PM, "Mickaël Salaün"  wrote:
>>
>> Add LSM hooks which can be used by userland through Landlock (eBPF)
>> programs. This programs are limited to a whitelist of functions (cf.
>> next commit). The eBPF program context is depicted by the struct
>> landlock_data (cf. include/uapi/linux/bpf.h):
>> * hook: LSM hook ID (useful when using the same program for multiple LSM
>>   hooks);
>> * cookie: the 16-bit value from the seccomp filter that triggered this
>>   Landlock program;
>> * args[6]: array of LSM hook arguments.
>>
>> The LSM hook arguments can contain raw values as integers or
>> (unleakable) pointers. The only way to use the pointers are to pass them
>> to an eBPF function according to their types (e.g. the
>> bpf_landlock_cmp_fs_beneath_with_struct_file function can use a struct
>> file pointer).
>>
>> For now, there is three hooks for file system access control:
>> * file_open;
>> * file_permission;
>> * mmap_file.
>>
> 
> What's the purpose of exposing struct cred * to userspace?  It's
> primarily just an optimization to save a bit of RAM, and it's a
> dubious optimization at that.  What are you using it for?  Would it
> make more sense to use struct task_struct * or struct pid * instead?
> 
> Also, exposing struct cred * has a really weird side-effect: it allows
> (maybe even encourages) checking for pointer equality between two
> struct cred * objects.  Doing so will have erratic results.
> 

The pointers exposed in the ePBF context are not directly readable by an
unprivileged eBPF program thanks to the strong typing of the Landlock
context and the static eBPF verification. There is no way to leak a
kernel pointer to userspace from an unprivileged eBPF program: pointer
arithmetic and comparison are prohibited. Pointers can only be pass as
argument to dedicated eBPF functions.

For now, struct cred * is simply not used by any eBPF function and then
not usable at all. It only exist here because I map the LSM hook
arguments in a generic/automatic way to the eBPF context.

I'm planning to extend the Landlock context with extra pointers,
whatever the LSM hook. We could then use task_struct, skb or any other
kernel objects, in a safe way, with dedicated functions.

signature.asc
Description: OpenPGP digital signature

Re: [RFCv2 16/16] nfp: bpf: add offload of TC direct action mode

2016-08-30 Thread Daniel Borkmann


On 08/30/2016 12:52 PM, Jakub Kicinski wrote:

On Mon, 29 Aug 2016 23:09:35 +0200, Daniel Borkmann wrote:

+*   0,1   okNOT SUPPORTED[1]
+*   2   drop  0x22 -> drop,  count as stat1
+*   4,5 nuke  0x02 -> drop
+*   7  redir  0x44 -> redir, count as stat2
+*   * unspec  0x11 -> pass,  count as stat0
+*
+* [1] We can't support OK and RECLASSIFY because we can't tell TC
+* the exact decision made.  We are forced to support UNSPEC
+* to handle aborts so that's the only one we handle for passing
+* packets up the stack.


In da mode, RECLASSIFY is not supported, so this one could be scratched.
For the OK and UNSPEC part, couldn't both be treated the same (as in: OK /
pass to stack roughly equivalent as in sch_handle_ingress())? Or is the
issue that you cannot populate skb->tc_index when passing to stack (maybe
just fine to leave it at 0 for now)?


The comment is a bit confus(ed|ing).  The problem is:

tc filter add  skip_sw
tc filter add  skip_hw

If packet appears in the stack - was it because of OK or UNSPEC (or
RECLASSIFY) in filter1?  Do we need to run filter2 or not?  Passing
tc_index can be implemented the same way I do mark today.


Okay, I see, thanks for explaining. So, if passing tc_index (or any other
meta data) can be implemented the same way as we do with mark already,
could we store such verdict, say, in some unused skb->tc_verd bits (the
skb->tc_index could be filled by the program already) and pass that up the
stack to differentiate between them? There should be no prior user before
ingress, so that patch 4 could become something like:

  if (tc_skip_sw(prog->gen_flags)) {
 filter_res = tc_map_hw_verd_to_act(skb);
  } else if (at_ingress) {
 ...
  } ...

And I assume it wouldn't make any sense anyway to have a skip_sw filter
being chained /after/ some skip_hw and the like, right?


Just curious, does TC_ACT_REDIRECT work in this scenario?


I do the redirects in the card, all the problems stem from the


Ok, cool.


difficulty of passing full ret code in the skb from the driver
to tc_classify()/cls_bpf_classify().

Re: [RFC v2 00/10] Landlock LSM: Unprivileged sandboxing

2016-08-30 Thread Andy Lutomirski

On Tue, Aug 30, 2016 at 12:51 PM, Mickaël Salaün  wrote:
>
> On 30/08/2016 18:06, Andy Lutomirski wrote:
>> On Thu, Aug 25, 2016 at 3:32 AM, Mickaël Salaün  wrote:
>>> Hi,
>>>
>>> This series is a proof of concept to fill some missing part of seccomp as 
>>> the
>>> ability to check syscall argument pointers or creating more dynamic security
>>> policies. The goal of this new stackable Linux Security Module (LSM) called
>>> Landlock is to allow any process, including unprivileged ones, to create
>>> powerful security sandboxes comparable to the Seatbelt/XNU Sandbox or the
>>> OpenBSD Pledge. This kind of sandbox help to mitigate the security impact of
>>> bugs or unexpected/malicious behaviors in userland applications.
>>
>> Mickaël, will you be at KS and/or LPC?
>>
>
> I won't be at KS/LPC but I will give a talk at Kernel Recipes (Paris)
> for which registration will start Thursday (and will not last long). :)

There's a teeny tiny chance I'll be there.  I've done way too much
traveling lately.

Re: [RFC v2 00/10] Landlock LSM: Unprivileged sandboxing

2016-08-30 Thread Mickaël Salaün


On 30/08/2016 18:06, Andy Lutomirski wrote:
> On Thu, Aug 25, 2016 at 3:32 AM, Mickaël Salaün  wrote:
>> Hi,
>>
>> This series is a proof of concept to fill some missing part of seccomp as the
>> ability to check syscall argument pointers or creating more dynamic security
>> policies. The goal of this new stackable Linux Security Module (LSM) called
>> Landlock is to allow any process, including unprivileged ones, to create
>> powerful security sandboxes comparable to the Seatbelt/XNU Sandbox or the
>> OpenBSD Pledge. This kind of sandbox help to mitigate the security impact of
>> bugs or unexpected/malicious behaviors in userland applications.
> 
> Mickaël, will you be at KS and/or LPC?
> 

I won't be at KS/LPC but I will give a talk at Kernel Recipes (Paris)
for which registration will start Thursday (and will not last long). :)

 Mickaël



signature.asc
Description: OpenPGP digital signature

Re: [PATCH v2] ipv6: Use inbound ifaddr as source addresses for ICMPv6 errors

2016-08-30 Thread Guillaume Nault

On Mon, Aug 29, 2016 at 02:34:32AM +0800, Eli Cooper wrote:
> Hello,
> 
> 
> On 2016/8/29 1:18, Guillaume Nault wrote:
> > On Sun, Aug 28, 2016 at 11:34:06AM +0800, Eli Cooper wrote:
> >> According to RFC 1885 2.2(c), the source address of ICMPv6
> >> errors in response to forwarded packets should be set to the
> >> unicast address of the forwarding interface in order to be helpful
> >> in diagnosis.
> >>
> > FWIW, this behaviour has been deprecated ten years ago by RFC 4443:
> > "The address SHOULD be chosen according to the rules that would be used
> >  to select the source address for any other packet originated by the
> >  node, given the destination address of the packet."
> >
> > The door is left open for other address selection algorithms but, IMHO,
> > changing kernel's behaviour is better justified by real use cases
> > than by obsolete RFCs.
> 
> I agree, sorry for the obsoleted RFC. This is actually motivated by a
> real use case: Say a Linux box is acting as a router that forwards
> packets with policy routing from two local networks to two uplinks,
> respectively. An outside host from is performing traceroute to a host on
> one of the LAN. If the kernel's default route is via the other LAN's
> uplink, it will send ICMPv6 packets with the source address that has
> nothing to do with the network in question, yet the message probably
> will reach the outside host.
> 
> Here using the address of inbound or exiting interface as source address
> is evidently "a more informative choice." I surmise this is the reason
> why the comment reads "Force OUTPUT device used as source address" when
> dealing with hop limit exceeded packets in ip6_forward(), although not
> effectively so. The current behaviour not only confuses diagnosis, but
> also might be undesirable if the addresses of the networks are best kept
> secret from each other.
> 
That makes more sense indeed. Would be nice to have this use case in
the commit message rather than the blind reference to the obsolete RFC.

Regards,

Guillaume

[PATCH net-next 1/1] rxrpc: Don't expose skbs to in-kernel users [ver #2]

2016-08-30 Thread David Howells

Don't expose skbs to in-kernel users, such as the AFS filesystem, but
instead provide a notification hook the indicates that a call needs
attention and another that indicates that there's a new call to be
collected.

This makes the following possibilities more achievable:

 (1) Call refcounting can be made simpler if skbs don't hold refs to calls.

 (2) skbs referring to non-data events will be able to be freed much sooner
 rather than being queued for AFS to pick up as rxrpc_kernel_recv_data
 will be able to consult the call state.

 (3) We can shortcut the receive phase when a call is remotely aborted
 because we don't have to go through all the packets to get to the one
 cancelling the operation.

 (4) It makes it easier to do encryption/decryption directly between AFS's
 buffers and sk_buffs.

 (5) Encryption/decryption can more easily be done in the AFS's thread
 contexts - usually that of the userspace process that issued a syscall
 - rather than in one of rxrpc's background threads on a workqueue.

 (6) AFS will be able to wait synchronously on a call inside AF_RXRPC.

To make this work, the following interface function has been added:

 int rxrpc_kernel_recv_data(
struct socket *sock, struct rxrpc_call *call,
void *buffer, size_t bufsize, size_t *_offset,
bool want_more, u32 *_abort_code);

This is the recvmsg equivalent.  It allows the caller to find out about the
state of a specific call and to transfer received data into a buffer
piecemeal.

afs_extract_data() and rxrpc_kernel_recv_data() now do all the extraction
logic between them.  They don't wait synchronously yet because the socket
lock needs to be dealt with.

Five interface functions have been removed:

rxrpc_kernel_is_data_last()
rxrpc_kernel_get_abort_code()
rxrpc_kernel_get_error_number()
rxrpc_kernel_free_skb()
rxrpc_kernel_data_consumed()

As a temporary hack, sk_buffs going to an in-kernel call are queued on the
rxrpc_call struct (->knlrecv_queue) rather than being handed over to the
in-kernel user.  To process the queue internally, a temporary function,
temp_deliver_data() has been added.  This will be replaced with common code
between the rxrpc_recvmsg() path and the kernel_rxrpc_recv_data() path in a
future patch.

Signed-off-by: David Howells 
---

 Documentation/networking/rxrpc.txt |   72 +++---
 fs/afs/cmservice.c |  142 ++--
 fs/afs/fsclient.c  |  148 +---
 fs/afs/internal.h  |   33 +--
 fs/afs/rxrpc.c |  439 +---
 fs/afs/vlclient.c  |7 -
 include/net/af_rxrpc.h |   35 +--
 net/rxrpc/af_rxrpc.c   |   29 +-
 net/rxrpc/ar-internal.h|   23 ++
 net/rxrpc/call_accept.c|   13 +
 net/rxrpc/call_object.c|5 
 net/rxrpc/conn_event.c |1 
 net/rxrpc/input.c  |   10 +
 net/rxrpc/output.c |2 
 net/rxrpc/recvmsg.c|  191 +---
 net/rxrpc/skbuff.c |1 
 16 files changed, 565 insertions(+), 586 deletions(-)

diff --git a/Documentation/networking/rxrpc.txt 
b/Documentation/networking/rxrpc.txt
index cfc8cb91452f..1b63bbc6b94f 100644
--- a/Documentation/networking/rxrpc.txt
+++ b/Documentation/networking/rxrpc.txt
@@ -748,6 +748,37 @@ The kernel interface functions are as follows:
  The msg must not specify a destination address, control data or any flags
  other than MSG_MORE.  len is the total amount of data to transmit.
 
+ (*) Receive data from a call.
+
+   int rxrpc_kernel_recv_data(struct socket *sock,
+  struct rxrpc_call *call,
+  void *buf,
+  size_t size,
+  size_t *_offset,
+  bool want_more,
+  u32 *_abort)
+
+  This is used to receive data from either the reply part of a client call
+  or the request part of a service call.  buf and size specify how much
+  data is desired and where to store it.  *_offset is added on to buf and
+  subtracted from size internally; the amount copied into the buffer is
+  added to *_offset before returning.
+
+  want_more should be true if further data will be required after this is
+  satisfied and false if this is the last item of the receive phase.
+
+  There are three normal returns: 0 if the buffer was filled and want_more
+  was true; 1 if the buffer was filled, the last DATA packet has been
+  emptied and want_more was false; and -EAGAIN if the function needs to be
+  called again.
+
+  If the last DATA packet is processed but the buffer contains less than
+  the amount requested, EBADMSG is returned.  If

[PATCH net-next 0/1] rxrpc: Remove use of skbs from AFS [ver #2]

2016-08-30 Thread David Howells


Here's a single patch that removes the use of sk_buffs from fs/afs.  From
this point on they'll be entirely retained within net/rxrpc and AFS just
asks AF_RXRPC for linear buffers of data.  This needs to be applied on top
of the just-posted preparatory patch set.

This makes some future developments easier/possible:

 (1) Simpler rxrpc_call usage counting.

 (2) Earlier freeing of metadata sk_buffs.

 (3) Rx phase shortcutting on abort/error.

 (4) Encryption/decryption in the AFS fs contexts/threads and directly
 between sk_buffs and AFS buffers.

 (5) Synchronous waiting in reception for AFS.

Changes:

 (V2) Fixed afs_transfer_reply() whereby call->offset was incorrectly being
  added to the buffer pointer (it doesn't matter as long as the reply
  fits entirely inside a single packet).

  Removed an unused goto-label and an unused variable.


The patch can be found here also:


http://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/log/?h=rxrpc-rewrite

Tagged thusly:

git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git
rxrpc-rewrite-20160830-2v2

David
---
David Howells (1):
  rxrpc: Don't expose skbs to in-kernel users


 Documentation/networking/rxrpc.txt |   72 +++---
 fs/afs/cmservice.c |  142 ++--
 fs/afs/fsclient.c  |  148 +---
 fs/afs/internal.h  |   33 +--
 fs/afs/rxrpc.c |  439 +---
 fs/afs/vlclient.c  |7 -
 include/net/af_rxrpc.h |   35 +--
 net/rxrpc/af_rxrpc.c   |   29 +-
 net/rxrpc/ar-internal.h|   23 ++
 net/rxrpc/call_accept.c|   13 +
 net/rxrpc/call_object.c|5 
 net/rxrpc/conn_event.c |1 
 net/rxrpc/input.c  |   10 +
 net/rxrpc/output.c |2 
 net/rxrpc/recvmsg.c|  191 +---
 net/rxrpc/skbuff.c |1 
 16 files changed, 565 insertions(+), 586 deletions(-)

Re: [PATCH v3 4/5] net: stmmac: add a glue driver for the Amlogic Meson 8b / GXBB DWMAC

2016-08-30 Thread Stephen Boyd

On 08/28, Martin Blumenstingl wrote:
> +static int meson8b_init_clk(struct meson8b_dwmac *dwmac)
> +{
> + struct clk_init_data init;
> + int i, ret;
> + struct device *dev = >pdev->dev;
> + char clk_name[32];
> + const char *clk_div_parents[1];
> + const char *mux_parent_names[MUX_CLK_NUM_PARENTS];
> + static struct clk_div_table clk_25m_div_table[] = {
> + { .val = 0, .div = 5 },
> + { .val = 1, .div = 10 },
> + { /* sentinel */ },
> + };
> +
> + /* get the mux parents from DT */
> + for (i = 0; i < MUX_CLK_NUM_PARENTS; i++) {
> + char name[16];
> +
> + snprintf(name, sizeof(name), "clkin%d", i);
> + dwmac->m250_mux_parent[i] = devm_clk_get(dev, name);
> + if (IS_ERR(dwmac->m250_mux_parent[i])) {
> + ret = PTR_ERR(dwmac->m250_mux_parent[i]);
> + if (ret != -EPROBE_DEFER)
> + dev_err(dev, "Missing clock %s\n", name);
> + return ret;
> + }
> +
> + mux_parent_names[i] =
> + __clk_get_name(dwmac->m250_mux_parent[i]);
> + }
> +
> + /* create the m250_mux */
> + snprintf(clk_name, sizeof(clk_name), "%s#m250_sel", dev_name(dev));
> + init.name = clk_name;
> + init.ops = _mux_ops;
> + init.flags = CLK_IS_BASIC;

Please don't use this flag unless you need it.

> + init.parent_names = mux_parent_names;
> + init.num_parents = MUX_CLK_NUM_PARENTS;
> +
> + dwmac->m250_mux.reg = dwmac->regs + PRG_ETH0;
> + dwmac->m250_mux.shift = PRG_ETH0_CLK_M250_SEL_SHIFT;
> + dwmac->m250_mux.mask = PRG_ETH0_CLK_M250_SEL_MASK;
> + dwmac->m250_mux.flags = 0;
> + dwmac->m250_mux.table = NULL;
> + dwmac->m250_mux.hw.init = 
> +
> + dwmac->m250_mux_clk = devm_clk_register(dev, >m250_mux.hw);
> + if (WARN_ON(PTR_ERR_OR_ZERO(dwmac->m250_mux_clk)))

Why not if(WARN_ON(IS_ERR())? The OR_ZERO part seems confusing.

> + return PTR_ERR(dwmac->m250_mux_clk);
> +
> + /* create the m250_div */
> + snprintf(clk_name, sizeof(clk_name), "%s#m250_div", dev_name(dev));
> + init.name = devm_kstrdup(dev, clk_name, GFP_KERNEL);
> + init.ops = _divider_ops;
> + init.flags = CLK_IS_BASIC | CLK_SET_RATE_PARENT;
> + clk_div_parents[0] = __clk_get_name(dwmac->m250_mux_clk);
> + init.parent_names = clk_div_parents;
> + init.num_parents = ARRAY_SIZE(clk_div_parents);
> +
> + dwmac->m250_div.reg = dwmac->regs + PRG_ETH0;
> + dwmac->m250_div.shift = PRG_ETH0_CLK_M250_DIV_SHIFT;
> + dwmac->m250_div.width = PRG_ETH0_CLK_M250_DIV_WIDTH;
> + dwmac->m250_div.hw.init = 
> + dwmac->m250_div.flags = CLK_DIVIDER_ONE_BASED | CLK_DIVIDER_ALLOW_ZERO;
> +
> + dwmac->m250_div_clk = devm_clk_register(dev, >m250_div.hw);

We've been trying to move away from devm_clk_register() to
devm_clk_hw_register() so that clk providers aren't also clk
consumers. Obviously in this case this driver is a provider and a
consumer, so this isn't as important. Kevin did something similar
in the mmc driver, so I'll reiterate what I said on that patch.
Perhaps we should make __clk_create_clk() into a real clk
provider API so that we can use devm_clk_hw_register() here and
then generate a clk for this device. That would allow us to have
proper consumer tracking without relying on the clk that is
returned from clk_register() (the intent is to make that clk
instance internal to the framework).

> + if (WARN_ON(PTR_ERR_OR_ZERO(dwmac->m250_div_clk)))
> + return PTR_ERR(dwmac->m250_div_clk);
> +
> + /* create the m25_div */
> + snprintf(clk_name, sizeof(clk_name), "%s#m25_div", dev_name(dev));
> + init.name = devm_kstrdup(dev, clk_name, GFP_KERNEL);
> + init.ops = _divider_ops;
> + init.flags = CLK_IS_BASIC | CLK_SET_RATE_PARENT;
> + clk_div_parents[0] = __clk_get_name(dwmac->m250_div_clk);
> + init.parent_names = clk_div_parents;
> + init.num_parents = ARRAY_SIZE(clk_div_parents);
> +
> + dwmac->m25_div.reg = dwmac->regs + PRG_ETH0;
> + dwmac->m25_div.shift = PRG_ETH0_CLK_M25_DIV_SHIFT;
> + dwmac->m25_div.width = PRG_ETH0_CLK_M25_DIV_WIDTH;
> + dwmac->m25_div.table = clk_25m_div_table;
> + dwmac->m25_div.hw.init = 
> + dwmac->m25_div.flags = CLK_DIVIDER_ALLOW_ZERO;
> +
> + dwmac->m25_div_clk = devm_clk_register(dev, >m25_div.hw);
> + if (WARN_ON(PTR_ERR_OR_ZERO(dwmac->m25_div_clk)))
> + return PTR_ERR(dwmac->m25_div_clk);
> +
> + return 0;

This could be return WARN_ON(PTR_ERR_OR_ZERO(...))

> +
> +static int meson8b_dwmac_probe(struct platform_device *pdev)
> +{
> + struct plat_stmmacenet_data *plat_dat;
> + struct stmmac_resources stmmac_res;
> + struct resource *res;
> + struct meson8b_dwmac *dwmac;
> + int ret;
> +
> + ret = stmmac_get_platform_resources(pdev, _res);
> + if (ret)
> +

Re: [RFCv2 07/16] bpf: enable non-core use of the verfier

2016-08-30 Thread Daniel Borkmann


On 08/30/2016 12:48 PM, Jakub Kicinski wrote:

On Mon, 29 Aug 2016 22:17:10 +0200, Daniel Borkmann wrote:

On 08/29/2016 10:13 PM, Daniel Borkmann wrote:

On 08/27/2016 07:32 PM, Alexei Starovoitov wrote:

On Sat, Aug 27, 2016 at 12:40:04PM +0100, Jakub Kicinski wrote:
probably array_of_insn_aux_data[num_insns] should do it.
Unlike reg_state that is forked on branches, this array
is only one.


This would be for struct nfp_insn_meta, right? So, struct
bpf_ext_parser_ops could become:

static const struct bpf_ext_parser_ops nfp_bpf_pops = {
  .insn_hook = nfp_verify_insn,
  .insn_size = sizeof(struct nfp_insn_meta),
};

... where bpf_parse() would prealloc that f.e. in env->insn_meta[].


Hm.. this is tempting, I will have to store the pointer type in
nfp_insn_meta soon, anyway.


(Well, actually everything can live in env->private_data.)


We are discussing changing the place verifier keep its pointer type
annotation, I don't think we could put that in the private_data.


Agree, was also my concern when I read patch 5 and 6. It would
not only be related to types, but also different imm values,
where the memcmp() could fail on. Potentially the latter can be
avoided by only checking types which should be sufficient. Hmm,
maybe only bpf_parse() should go through this stricter mode since
only relevant for drivers (otoh downside would be that bugs
would end up less likely to be found).


I don't want only checking types because it would defeat my exit code
validation :)  I was thinking about doing a lazy evaluation -
registering branches to explored_states with UNKNOWN and only upgrading
to CONST when someone actually needed the imm value.  I'm not sure the
complexity would be justified, though.

Having two modes seems more straight forward and I think we would only
need to pay attention in the LD_IMM64 case, I don't think I've seen
LLVM generating XORs, it's just the cBPF -> eBPF conversion.


Okay, though, I think that the cBPF to eBPF migration wouldn't even
pass through the bpf_parse() handling, since verifier is not aware on
some of their aspects such as emitting calls directly (w/o *proto) or
arg mappings. Probably make sense to reject these (bpf_prog_was_classic())
if they cannot be handled anyway?


I see. Indeed then you'd need the verifier to walk all paths
to make sure constant return values.


I think this would still not cover the cases where you'd fetch
a return value/verdict from a map, but this should be ignored/
rejected for now, also since majority of programs are not written
in such a way.


If you only need yes/no check then such info can probably be
collected unconditionally during initial program load.
Like prog->cb_access flag.


One other comment wrt the header, when you move these things
there, would be good to prefix with bpf_* so that this doesn't
clash in future with other header files.


Good point!

[PATCH v2] cfg80211: Remove deprecated create_singlethread_workqueue

2016-08-30 Thread Bhaktipriya Shridhar

The workqueue "cfg80211_wq" is involved in cleanup, scan and event related
works. It queues multiple work items >event_work,
>dfs_update_channels_wk,
_to_rdev(request->wiphy)->scan_done_wk,
_to_rdev(wiphy)->sched_scan_results_wk, which require strict
execution ordering.
Hence, an ordered dedicated workqueue has been used.

Since it's a wireless driver, WQ_MEM_RECLAIM has been set to ensure
forward progress under memory pressure.

Signed-off-by: Bhaktipriya Shridhar 
---
 net/wireless/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/wireless/core.c b/net/wireless/core.c
index d25c82b..2cd4563 100644
--- a/net/wireless/core.c
+++ b/net/wireless/core.c
@@ -1218,7 +1218,7 @@ static int __init cfg80211_init(void)
if (err)
goto out_fail_reg;

-   cfg80211_wq = create_singlethread_workqueue("cfg80211");
+   cfg80211_wq = alloc_ordered_workqueue("cfg80211", WQ_MEM_RECLAIM);
if (!cfg80211_wq) {
err = -ENOMEM;
goto out_fail_wq;
--
2.1.4

Re: [PATCH V2] dt: net: enhance DWC EQoS binding to support Tegra186

2016-08-30 Thread Rob Herring

On Wed, Aug 24, 2016 at 03:20:46PM -0600, Stephen Warren wrote:
> From: Stephen Warren 
> 
> The Synopsys DWC EQoS is a configurable IP block which supports multiple
> options for bus type, clocking and reset structure, and feature list.
> Extend the DT binding to define a "compatible value" for the configuration
> contained in NVIDIA's Tegra186 SoC, and define some new properties and
> list property entries required by that configuration.
> 
> Signed-off-by: Stephen Warren 
> ---
> v2:
> * Add an explicit compatible value for the Axis SoC's version of the EQOS
>   IP; this allows the driver to handle any SoC-specific integration quirks
>   that are required, rather than only knowing about the IP block in
>   isolation. This is good general DT practice. The existing value is still
>   documented to support existing DTs.
> * Reworked the list of clocks the binding requires:
>   - Combined "tx" and "phy_ref_clk"; for GMII/RGMII configurations, these
> are the same thing.
>   - Added extra description to the "rx" and "tx" clocks, to make it clear
> exactly which HW clock they represent.
>   - Made the new "tx" and "slave_bus" names more prominent than the
> original "phy_ref_clk" and "apb_pclk". The new names are more generic
> and should work for any enhanced version of the binding (e.g. to
> support additional PHY types). New compatible values will hopefully
> choose to require the new names.
> * Added a couple extra clocks to the list that may need to be supported in
>   future binding revisions.
> * Fixed a typo; "clocks" -> "resets".
> ---
>  .../bindings/net/snps,dwc-qos-ethernet.txt | 75 
> --
>  1 file changed, 71 insertions(+), 4 deletions(-)
> 
> diff --git a/Documentation/devicetree/bindings/net/snps,dwc-qos-ethernet.txt 
> b/Documentation/devicetree/bindings/net/snps,dwc-qos-ethernet.txt
> index 51f8d2eba8d8..1d028259824a 100644
> --- a/Documentation/devicetree/bindings/net/snps,dwc-qos-ethernet.txt
> +++ b/Documentation/devicetree/bindings/net/snps,dwc-qos-ethernet.txt
> @@ -1,21 +1,87 @@
>  * Synopsys DWC Ethernet QoS IP version 4.10 driver (GMAC)
>  
> +This binding supports the Synopsys Designware Ethernet QoS (Quality Of 
> Service)
> +IP block. The IP supports multiple options for bus type, clocking and reset
> +structure, and feature list. Consequently, a number of properties and list
> +entries in properties are marked as optional, or only required in specific HW
> +configurations.
>  
>  Required properties:
> -- compatible: Should be "snps,dwc-qos-ethernet-4.10"
> +- compatible: One of:
> +  - "axis,artpec6-eqos", "snps,dwc-qos-ethernet-4.10"
> +Represents the IP core when integrated into the Axis ARTPEC-6 SoC.
> +  - "nvidia,tegra186-eqos", "snps,dwc-qos-ethernet-4.10"
> +Represents the IP core when integrated into the NVIDIA Tegra186 SoC.
> +  - "snps,dwc-qos-ethernet-4.10"
> +This combination is deprecated. It should be treated as equivalent to
> +"axis,artpec6-eqos", "snps,dwc-qos-ethernet-4.10". It is supported to be
> +compatible with earlier revisions of this binding.
>  - reg: Address and length of the register set for the device
> -- clocks: Phandles to the reference clock and the bus clock
> -- clock-names: Should be "phy_ref_clk" for the reference clock and "apb_pclk"
> -  for the bus clock.
> +- clocks: Phandle and clock specifiers for each entry in clock-names, in the
> +  same order. See ../clock/clock-bindings.txt.
> +- clock-names: May contain any/all of the following depending on the IP
> +  configuration, in any order:

No, they should be in a defined order.

> +  - "tx"
> +(Alternate name "phy_ref_clk"; only one alternate must appear.)

Obviously, the prior clocks were just made up for what someone needed at 
the time and did not read the spec. I think it would be better to just 
separate the old names and state they are deprecated and which 
compatibles they are for.

> +The EQOS transmit path clock. The HW signal name is clk_tx_i.
> +In some configurations (e.g. GMII/RGMII), this clock also drives the PHY 
> TX
> +path. In other configurations, other clocks (such as tx_125, rmii) may
> +drive the PHY TX path.
> +  - "rx"
> +The EQOS receive path clock. The HW signal name is clk_rx_i.
> +In some configurations (e.g. GMII/RGMII), this clock also drives the PHY 
> RX
> +path. In other configurations, other clocks (such as rx_125, pmarx_0,
> +pmarx_1, rmii) may drive the PHY RX path.
> +  - "slave_bus"
> +(Alternate name "apb_pclk"; only one alternate must appear.)
> +The CPU/slave-bus (CSR) interface clock. Despite the name, this applies 
> to
> +any bus type; APB, AHB, AXI, etc. The HW signal name is hclk_i (AHB) or
> +clk_csr_i (other buses).

Sounds like 2 clocks here.

> +  - "master_bus"
> +The master bus interface clock. Only required in configurations that use 
> a
> +separate clock for the

Re: [RFC v2 06/10] landlock: Add LSM hooks

2016-08-30 Thread Andy Lutomirski

On Aug 25, 2016 12:34 PM, "Mickaël Salaün"  wrote:
>
> Add LSM hooks which can be used by userland through Landlock (eBPF)
> programs. This programs are limited to a whitelist of functions (cf.
> next commit). The eBPF program context is depicted by the struct
> landlock_data (cf. include/uapi/linux/bpf.h):
> * hook: LSM hook ID (useful when using the same program for multiple LSM
>   hooks);
> * cookie: the 16-bit value from the seccomp filter that triggered this
>   Landlock program;
> * args[6]: array of LSM hook arguments.
>
> The LSM hook arguments can contain raw values as integers or
> (unleakable) pointers. The only way to use the pointers are to pass them
> to an eBPF function according to their types (e.g. the
> bpf_landlock_cmp_fs_beneath_with_struct_file function can use a struct
> file pointer).
>
> For now, there is three hooks for file system access control:
> * file_open;
> * file_permission;
> * mmap_file.
>

What's the purpose of exposing struct cred * to userspace?  It's
primarily just an optimization to save a bit of RAM, and it's a
dubious optimization at that.  What are you using it for?  Would it
make more sense to use struct task_struct * or struct pid * instead?

Also, exposing struct cred * has a really weird side-effect: it allows
(maybe even encourages) checking for pointer equality between two
struct cred * objects.  Doing so will have erratic results.

Re: [RFC v2 09/10] landlock: Handle cgroups (performance)

2016-08-30 Thread Andy Lutomirski

On Sun, Aug 28, 2016 at 2:42 AM, Mickaël Salaün  wrote:
>
>
> On 28/08/2016 10:13, Andy Lutomirski wrote:
>> On Aug 27, 2016 11:14 PM, "Mickaël Salaün"  wrote:
>>>
>>>
>>> On 27/08/2016 22:43, Alexei Starovoitov wrote:
 On Sat, Aug 27, 2016 at 09:35:14PM +0200, Mickaël Salaün wrote:
> On 27/08/2016 20:06, Alexei Starovoitov wrote:
>> On Sat, Aug 27, 2016 at 04:06:38PM +0200, Mickaël Salaün wrote:
>>> As said above, Landlock will not run an eBPF programs when not strictly
>>> needed. Attaching to a cgroup will have the same performance impact as
>>> attaching to a process hierarchy.
>>
>> Having a prog per cgroup per lsm_hook is the only scalable way I
>> could come up with. If you see another way, please propose.
>> current->seccomp.landlock_prog is not the answer.
>
> Hum, I don't see the difference from a performance point of view between
> a cgroup-based or a process hierarchy-based system.
>
> Maybe a better option should be to use an array of pointers with N
> entries, one for each supported hook, instead of a unique pointer list?

 yes, clearly array dereference is faster than link list walk.
 Now the question is where to keep this prog_array[num_lsm_hooks] ?
 Since we cannot keep it inside task_struct, we have to allocate it.
 Every time the task is creted then. What to do on the fork? That
 will require changes all over. Then the obvious optimization would be
 to share this allocated array of prog pointers across multiple tasks...
 and little by little this new facility will look like cgroup.
 Hence the suggestion to put this array into cgroup from the start.
>>>
>>> I see your point :)
>>>

> Anyway, being able to attach an LSM hook program to a cgroup thanks to
> the new BPF_PROG_ATTACH seems a good idea (while keeping the possibility
> to use a process hierarchy). The downside will be to handle an LSM hook
> program which is not triggered by a seccomp-filter, but this should be
> needed anyway to handle interruptions.

 what do you mean 'not triggered by seccomp' ?
 You're not suggesting that this lsm has to enable seccomp to be functional?
 imo that's non starter due to overhead.
>>>
>>> Yes, for now, it is triggered by a new seccomp filter return value
>>> RET_LANDLOCK, which can take a 16-bit value called cookie. This must not
>>> be needed but could be useful to bind a seccomp filter security policy
>>> with a Landlock one. Waiting for Kees's point of view…
>>>
>>
>> I'm not Kees, but I'd be okay with that.  I still think that doing
>> this by process hierarchy a la seccomp will be easier to use and to
>> understand (which is quite important for this kind of work) than doing
>> it by cgroup.
>>
>> A feature I've wanted to add for a while is to have an fd that
>> represents a seccomp layer, the idea being that you would set up your
>> seccomp layer (with syscall filter, landlock hooks, etc) and then you
>> would have a syscall to install that layer.  Then an unprivileged
>> sandbox manager could set up its layer and still be able to inject new
>> processes into it later on, no cgroups needed.
>
> A nice thing I didn't highlight about Landlock is that a process can
> prepare a layer of rules (arraymap of handles + Landlock programs) and
> pass the file descriptors of the Landlock programs to another process.
> This process could then apply this programs to get sandboxed. However,
> for now, because a Landlock program is only triggered by a seccomp
> filter (which do not follow the Landlock programs as a FD), they will be
> useless.
>
> The FD referring to an arraymap of handles can also be used to update a
> map and change the behavior of a Landlock program. A master process can
> then add or remove restrictions to another process hierarchy on the fly.

Maybe this could be extended a little bit.  The fd could hold the
seccomp filter *and* the LSM hook filters.  FMODE_EXECUTE could give
the ability to install it and FMODE_WRITE could give the ability to
modify it.

Re: [PATCH v3 0/5] meson: Meson8b and GXBB DWMAC glue driver

2016-08-30 Thread Martin Blumenstingl

On Mon, Aug 29, 2016 at 5:40 AM, David Miller  wrote:
> From: Martin Blumenstingl 
> Date: Sun, 28 Aug 2016 18:16:32 +0200
>
>> This adds a DWMAC glue driver for the PRG_ETHERNET registers found in
>> Meson8b and GXBB SoCs. Based on the "old" meson6b-dwmac glue driver
>> the register layout is completely different.
>> Thus I introduced a separate driver.
>>
>> Changes since v2:
>> - fixed unloading the glue driver when built as module. This pulls in a
>>   patch from Joachim Eastwood (thanks) to get our private data structure
>>   (bsp_priv).
>
> This doesn't apply cleanly at all to the net-next tree, so I have
> no idea where you expect these changes to be applied.
OK, maybe Kevin can me help out here as I think the patches should go
to various trees.

I think patches 1, 3 and 4 should go through the net-next tree (as
these touch drivers/net/ethernet/stmicro/stmmac/ and the corresponding
documentation).
Patch 2 should probably go through clk-meson-gxbb / clk-next (just
like the other clk changes we had).
The last patch (patch 5) should probably go through the ARM SoC tree
(just like the other dts changes we had).

@David, Kevin: would this be fine for you?

[net-next] ixgbe: Eliminate useless message and improve logic

2016-08-30 Thread Jeff Kirsher

From: Mark Rustad 

Remove a useless log message and improve the logic for setting
a PHY address from the contents of the MNG_IF_SEL register.

Signed-off-by: Mark Rustad 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_x550.c | 16 +---
 1 file changed, 5 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_x550.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_x550.c
index fb1b819..e092a89 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_x550.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_x550.c
@@ -2394,18 +2394,12 @@ static void ixgbe_read_mng_if_sel_x550em(struct 
ixgbe_hw *hw)
/* If X552 (X550EM_a) and MDIO is connected to external PHY, then set
 * PHY address. This register field was has only been used for X552.
 */
-   if (!hw->phy.nw_mng_if_sel) {
-   if (hw->mac.type == ixgbe_mac_x550em_a) {
-   struct ixgbe_adapter *adapter = hw->back;
-
-   e_warn(drv, "nw_mng_if_sel not set\n");
-   }
-   return;
+   if (hw->mac.type == ixgbe_mac_x550em_a &&
+   hw->phy.nw_mng_if_sel & IXGBE_NW_MNG_IF_SEL_MDIO_ACT) {
+   hw->phy.mdio.prtad = (hw->phy.nw_mng_if_sel &
+ IXGBE_NW_MNG_IF_SEL_MDIO_PHY_ADD) >>
+IXGBE_NW_MNG_IF_SEL_MDIO_PHY_ADD_SHIFT;
}
-
-   hw->phy.mdio.prtad = (hw->phy.nw_mng_if_sel &
- IXGBE_NW_MNG_IF_SEL_MDIO_PHY_ADD) >>
-IXGBE_NW_MNG_IF_SEL_MDIO_PHY_ADD_SHIFT;
 }
 
 /** ixgbe_init_phy_ops_X550em - PHY/SFP specific init
-- 
2.7.4

Re: [PATCH net] tg3: Fix for disallow tx coalescing time to be 0

2016-08-30 Thread Sergei Shtylyov


Hello.

On 08/30/2016 05:38 PM, Ivan Vecera wrote:


The recent commit 087d7a8c disallows to set Rx coalescing time to be 0


   You should specify both 12-digit SHA1 and the commit summary enclosed in 
("").


as this stops generating interrupts for the incoming packets. I found
the zero Tx coalescing time stops generating interrupts similarly for
outgoing packets and fires Tx watchdog later. To avoid this, don't allow
to set Tx coalescing time to 0.

Cc: satish.baddipad...@broadcom.com
Cc: siva.kal...@broadcom.com
Cc: michael.c...@broadcom.com
Signed-off-by: Ivan Vecera 

[...]

MBR, Sergei

Re: [PATCH net] l2tp: fix use-after-free during module unload

2016-08-30 Thread Sergei Shtylyov


Hello.

On 08/30/2016 05:05 PM, Sabrina Dubroca wrote:


Tunnel deletion is delayed by both a workqueue (l2tp_tunnel_delete -> wq
 -> l2tp_tunnel_del_work) and RCU (sk_destruct -> RCU ->
l2tp_tunnel_destruct).

By the time l2tp_tunnel_destruct() runs to destroy the tunnel and finish
destroying the socket, the private data reserved via the net_generic
mechanism has already been freed, but l2tp_tunnel_destruct() actually
uses this data.

Make sure tunnel deletion for the netns has completed before returning
from l2tp_net_exit() by first flushing the tunnel removal workqueue, and


   The patch tells me the function is named l2tp_exit_net(). :-)


then waiting for RCU callbacks to complete.

Fixes: 167eb17e0b17 ("l2tp: create tunnel sockets in the right namespace")
Signed-off-by: Sabrina Dubroca 
---
 net/l2tp/l2tp_core.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/l2tp/l2tp_core.c b/net/l2tp/l2tp_core.c
index 1e40dacaa137..a2ed3bda4ddc 100644
--- a/net/l2tp/l2tp_core.c
+++ b/net/l2tp/l2tp_core.c
@@ -1855,6 +1855,9 @@ static __net_exit void l2tp_exit_net(struct net *net)
(void)l2tp_tunnel_delete(tunnel);
}
rcu_read_unlock_bh();
+
+   flush_workqueue(l2tp_wq);
+   rcu_barrier();
 }

 static struct pernet_operations l2tp_net_ops = {


MBR, Sergei

[PATCH net-next 03/12] net: l3mdev: Allow the l3mdev to be a loopback

2016-08-30 Thread David Ahern

Allow an L3 master device to act as the loopback for that L3 domain.
For IPv4 the device can also have the address 127.0.0.1.

Signed-off-by: David Ahern 
---
 include/net/l3mdev.h |  6 +++---
 net/ipv4/route.c |  8 ++--
 net/ipv6/route.c | 12 ++--
 3 files changed, 19 insertions(+), 7 deletions(-)

diff --git a/include/net/l3mdev.h b/include/net/l3mdev.h
index 74ffe5aef299..5f03a89bb075 100644
--- a/include/net/l3mdev.h
+++ b/include/net/l3mdev.h
@@ -90,7 +90,7 @@ static inline int l3mdev_master_ifindex_by_index(struct net 
*net, int ifindex)
 }
 
 static inline
-const struct net_device *l3mdev_master_dev_rcu(const struct net_device *_dev)
+struct net_device *l3mdev_master_dev_rcu(const struct net_device *_dev)
 {
/* netdev_master_upper_dev_get_rcu calls
 * list_first_or_null_rcu to walk the upper dev list.
@@ -99,7 +99,7 @@ const struct net_device *l3mdev_master_dev_rcu(const struct 
net_device *_dev)
 * typecast to remove the const
 */
struct net_device *dev = (struct net_device *)_dev;
-   const struct net_device *master;
+   struct net_device *master;
 
if (!dev)
return NULL;
@@ -253,7 +253,7 @@ static inline int l3mdev_master_ifindex_by_index(struct net 
*net, int ifindex)
 }
 
 static inline
-const struct net_device *l3mdev_master_dev_rcu(const struct net_device *dev)
+struct net_device *l3mdev_master_dev_rcu(const struct net_device *dev)
 {
return NULL;
 }
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index a1f2830d8110..1119f18fb720 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -2016,7 +2016,9 @@ static struct rtable *__mkroute_output(const struct 
fib_result *res,
return ERR_PTR(-EINVAL);
 
if (likely(!IN_DEV_ROUTE_LOCALNET(in_dev)))
-   if (ipv4_is_loopback(fl4->saddr) && !(dev_out->flags & 
IFF_LOOPBACK))
+   if (ipv4_is_loopback(fl4->saddr) &&
+   !(dev_out->flags & IFF_LOOPBACK) &&
+   !netif_is_l3_master(dev_out))
return ERR_PTR(-EINVAL);
 
if (ipv4_is_lbcast(fl4->daddr))
@@ -2300,7 +2302,9 @@ struct rtable *__ip_route_output_key_hash(struct net 
*net, struct flowi4 *fl4,
else
fl4->saddr = fl4->daddr;
}
-   dev_out = net->loopback_dev;
+
+   /* L3 master device is the loopback for that domain */
+   dev_out = l3mdev_master_dev_rcu(dev_out) ? : net->loopback_dev;
fl4->flowi4_oif = dev_out->ifindex;
flags |= RTCF_LOCAL;
goto make_route;
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 49817555449e..4a0f77aa49cf 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -2556,8 +2556,16 @@ struct rt6_info *addrconf_dst_alloc(struct inet6_dev 
*idev,
 {
u32 tb_id;
struct net *net = dev_net(idev->dev);
-   struct rt6_info *rt = ip6_dst_alloc(net, net->loopback_dev,
-   DST_NOCOUNT);
+   struct net_device *dev = net->loopback_dev;
+   struct rt6_info *rt;
+
+   /* use L3 Master device as loopback for host routes if device
+* is enslaved and address is not link local or multicast
+*/
+   if (!rt6_need_strict(addr))
+   dev = l3mdev_master_dev_rcu(idev->dev) ? : dev;
+
+   rt = ip6_dst_alloc(net, dev, DST_NOCOUNT);
if (!rt)
return ERR_PTR(-ENOMEM);
 
-- 
2.1.4

[PATCH net-next 00/12] net: Convert vrf from dst to tx hook

2016-08-30 Thread David Ahern

The motivation for this series is that ICMP Unreachable - Fragmentation
Needed packets are not handled properly for VRFs. Specifically, the
FIB lookup in __ip_rt_update_pmtu fails so no nexthop exception is
created with the reduced MTU. As a result connections stall if packets
larger than the smallest MTU in the path are generated.

While investigating that problem I also noticed that the MSS for all
connections in a VRF is based on the VRF device's MTU and not the
interface the packets ultimately go through. VRF currently uses a dst
to direct packets to the device. The first FIB lookup returns this dst
and then the lookup in the VRF driver gets the actual output route. A
side effect of this design is that the VRF dst is cached on sockets
and then used for calculations like the MSS.

This series fixes this problem by removing the output dst that points
to the VRF and always doing the actual FIB lookup. This allows the real
dst to be cached on sockets and used for MSS. Packets are diverted to
the VRF device on Tx using an l3mdev hook in the output path similar to
to what is done for Rx.

The end result is a much smaller and faster implementation for VRFs
with fewer intrusions into the network stack, less code duplication in
the VRF driver (output processing and FIB lookups) and symmetrical
packet handling for Rx and Tx paths. The l3mdev and vrf hooks are more
tightly focused on the primary goal of controlling the table used for
lookups and a secondary goal of providing device based features for VRF
such as packet socket hooks for tcpdump and netfilter hooks.

Comparison of netperf performance for a build without l3mdev (best case
performance), the old vrf driver and the VRF driver from this series.
Data are collected using VMs with virtio + vhost. The netperf client
runs in the VM and netserver runs in the host. 1-byte RR tests are done
as these packets exaggerate the performance hit due to the extra lookups
done for l3mdev and VRF.

Command: netperf -cC -H ${ip} -l 60 -t {TCP,UDP}_RR [-J red]

  TCP_RR UDP_RR
   IPv4IPv6   IPv4   IPv6
no l3mdev 30105   31101  32436  26297
vrf old   27223   28476  28912  26122
vrf new   29001   30630  31024  26351

 * Transactions per second as reported by netperf
 * netperf modified to take a bind-to-device argument -- the -J red option

About the series
- patch 1 adds the flow update (changing oif or iif to L3 master device
  and setting the flag to skip the oif check) to ipv4 and ipv6 paths just
  before hitting the rules. This catches all code paths in a single spot.

- patch 2 adds the Tx hook to push the packet to the l3mdev if relevant

- patch 3 adds some checks so the vrf device can act as a vrf-local
  loopback. These paths were not hit before since the vrf dst was
  returned from the lookup.

- patches 4 and 5 flip the ipv4 and ipv6 stacks to the tx stack

- patches 6-12 remove no longer needed l3mdev code

David Ahern (12):
  net: flow: Add l3mdev flow update
  net: l3mdev: Add hook to output path
  net: l3mdev: Allow the l3mdev to be a loopback
  net: vrf: Flip the IPv4 path from dst to tx out hook
  net: vrf: Flip the IPv6 path from dst to tx out hook
  net: remove redundant l3mdev calls
  net: l3mdev: Remove l3mdev_get_saddr
  net: ipv6: Remove l3mdev_get_saddr6
  net: l3mdev: Remove l3mdev_get_rtable
  net: l3mdev: Remove l3mdev_get_rt6_dst
  net: l3mdev: Remove l3mdev_fib_oif
  net: flow: Remove FLOWI_FLAG_L3MDEV_SRC flag

 drivers/net/vrf.c   | 545 
 include/net/flow.h  |   3 +-
 include/net/l3mdev.h| 132 +---
 include/net/route.h |  10 -
 net/ipv4/fib_rules.c|   3 +
 net/ipv4/ip_output.c|  11 +-
 net/ipv4/raw.c  |   6 -
 net/ipv4/route.c|  24 +--
 net/ipv4/udp.c  |   6 -
 net/ipv4/xfrm4_policy.c |   2 +-
 net/ipv6/fib6_rules.c   |   3 +
 net/ipv6/ip6_output.c   |  28 +--
 net/ipv6/ndisc.c|  11 +-
 net/ipv6/output_core.c  |   7 +
 net/ipv6/raw.c  |   7 +
 net/ipv6/route.c|  24 +--
 net/ipv6/tcp_ipv6.c |   8 +-
 net/ipv6/xfrm6_policy.c |   2 +-
 net/l3mdev/l3mdev.c | 122 ---
 19 files changed, 288 insertions(+), 666 deletions(-)

-- 
2.1.4

[PATCH net-next 06/12] net: remove redundant l3mdev calls

2016-08-30 Thread David Ahern

A previous patch added l3mdev flow update making these hooks redundant.

Signed-off-by: David Ahern 
---
 net/ipv4/ip_output.c|  3 +--
 net/ipv4/route.c| 12 ++--
 net/ipv4/xfrm4_policy.c |  2 +-
 net/ipv6/ip6_output.c   |  2 --
 net/ipv6/ndisc.c| 11 ++-
 net/ipv6/route.c|  7 +--
 net/ipv6/tcp_ipv6.c |  8 ++--
 net/ipv6/xfrm6_policy.c |  2 +-
 8 files changed, 10 insertions(+), 37 deletions(-)

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 3c727d4eaba9..75f8167615ba 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1574,8 +1574,7 @@ void ip_send_unicast_reply(struct sock *sk, struct 
sk_buff *skb,
}
 
oif = arg->bound_dev_if;
-   if (!oif && netif_index_is_l3_master(net, skb->skb_iif))
-   oif = skb->skb_iif;
+   oif = oif ? : skb->skb_iif;
 
flowi4_init_output(, oif,
   IP4_REPLY_MARK(net, skb->mark),
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index d9936f90a755..ec994380d354 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1829,7 +1829,7 @@ static int ip_route_input_slow(struct sk_buff *skb, 
__be32 daddr, __be32 saddr,
 *  Now we are ready to route packet.
 */
fl4.flowi4_oif = 0;
-   fl4.flowi4_iif = l3mdev_fib_oif_rcu(dev);
+   fl4.flowi4_iif = dev->ifindex;
fl4.flowi4_mark = skb->mark;
fl4.flowi4_tos = tos;
fl4.flowi4_scope = RT_SCOPE_UNIVERSE;
@@ -2148,7 +2148,6 @@ struct rtable *__ip_route_output_key_hash(struct net 
*net, struct flowi4 *fl4,
unsigned int flags = 0;
struct fib_result res;
struct rtable *rth;
-   int master_idx;
int orig_oif;
int err = -ENETUNREACH;
 
@@ -2158,9 +2157,6 @@ struct rtable *__ip_route_output_key_hash(struct net 
*net, struct flowi4 *fl4,
 
orig_oif = fl4->flowi4_oif;
 
-   master_idx = l3mdev_master_ifindex_by_index(net, fl4->flowi4_oif);
-   if (master_idx)
-   fl4->flowi4_oif = master_idx;
fl4->flowi4_iif = LOOPBACK_IFINDEX;
fl4->flowi4_tos = tos & IPTOS_RT_MASK;
fl4->flowi4_scope = ((tos & RTO_ONLINK) ?
@@ -2261,8 +2257,7 @@ struct rtable *__ip_route_output_key_hash(struct net 
*net, struct flowi4 *fl4,
if (err) {
res.fi = NULL;
res.table = NULL;
-   if (fl4->flowi4_oif &&
-   !netif_index_is_l3_master(net, fl4->flowi4_oif)) {
+   if (fl4->flowi4_oif) {
/* Apparently, routing tables are wrong. Assume,
   that the destination is on link.
 
@@ -2575,9 +2570,6 @@ static int inet_rtm_getroute(struct sk_buff *in_skb, 
struct nlmsghdr *nlh)
fl4.flowi4_oif = tb[RTA_OIF] ? nla_get_u32(tb[RTA_OIF]) : 0;
fl4.flowi4_mark = mark;
 
-   if (netif_index_is_l3_master(net, fl4.flowi4_oif))
-   fl4.flowi4_flags = FLOWI_FLAG_L3MDEV_SRC | 
FLOWI_FLAG_SKIP_NH_OIF;
-
if (iif) {
struct net_device *dev;
 
diff --git a/net/ipv4/xfrm4_policy.c b/net/ipv4/xfrm4_policy.c
index b644a23c3db0..3155ed73d3b3 100644
--- a/net/ipv4/xfrm4_policy.c
+++ b/net/ipv4/xfrm4_policy.c
@@ -112,7 +112,7 @@ _decode_session4(struct sk_buff *skb, struct flowi *fl, int 
reverse)
int oif = 0;
 
if (skb_dst(skb))
-   oif = l3mdev_fib_oif(skb_dst(skb)->dev);
+   oif = skb_dst(skb)->dev->ifindex;
 
memset(fl4, 0, sizeof(struct flowi4));
fl4->flowi4_mark = skb->mark;
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 9711f32eedd7..84d1b3feaf2e 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1067,8 +1067,6 @@ struct dst_entry *ip6_dst_lookup_flow(const struct sock 
*sk, struct flowi6 *fl6,
return ERR_PTR(err);
if (final_dst)
fl6->daddr = *final_dst;
-   if (!fl6->flowi6_oif)
-   fl6->flowi6_oif = l3mdev_fib_oif(dst->dev);
 
return xfrm_lookup_route(net, dst, flowi6_to_flowi(fl6), sk, 0);
 }
diff --git a/net/ipv6/ndisc.c b/net/ipv6/ndisc.c
index fe65cdc28a45..d8e671457d10 100644
--- a/net/ipv6/ndisc.c
+++ b/net/ipv6/ndisc.c
@@ -67,7 +67,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 
 #include 
@@ -457,11 +456,9 @@ static void ndisc_send_skb(struct sk_buff *skb,
 
if (!dst) {
struct flowi6 fl6;
-   int oif = l3mdev_fib_oif(skb->dev);
+   int oif = skb->dev->ifindex;
 
icmpv6_flow_init(sk, , type, saddr, daddr, oif);
-   if (oif != skb->dev->ifindex)
-   fl6.flowi6_flags |= FLOWI_FLAG_L3MDEV_SRC;
dst = icmp6_dst_alloc(skb->dev, );
if (IS_ERR(dst)) {
kfree_skb(skb);
@@ -1538,7 +1535,6 @@ void ndisc_send_redirect(struct sk_buff *skb, const 
struct in6_addr *target)
int rd_len;

[PATCH net-next 02/12] net: l3mdev: Add hook to output path

2016-08-30 Thread David Ahern

This patch adds the infrastructure to the output path to pass an skb
to an l3mdev device if it has a hook registered. This is the Tx parallel
to l3mdev_ip{6}_rcv in the receive path and is the basis for removing
the dst based hook.

Signed-off-by: David Ahern 
---
 include/net/l3mdev.h   | 47 +++
 net/ipv4/ip_output.c   |  8 
 net/ipv6/ip6_output.c  |  8 
 net/ipv6/output_core.c |  7 +++
 net/ipv6/raw.c |  7 +++
 5 files changed, 77 insertions(+)

diff --git a/include/net/l3mdev.h b/include/net/l3mdev.h
index 81e175e80537..74ffe5aef299 100644
--- a/include/net/l3mdev.h
+++ b/include/net/l3mdev.h
@@ -11,6 +11,7 @@
 #ifndef _NET_L3MDEV_H_
 #define _NET_L3MDEV_H_
 
+#include 
 #include 
 
 /**
@@ -18,6 +19,10 @@
  *
  * @l3mdev_fib_table: Get FIB table id to use for lookups
  *
+ * @l3mdev_l3_rcv:Hook in L3 receive path
+ *
+ * @l3mdev_l3_out:Hook in L3 output path
+ *
  * @l3mdev_get_rtable: Get cached IPv4 rtable (dst_entry) for device
  *
  * @l3mdev_get_saddr: Get source address for a flow
@@ -29,6 +34,9 @@ struct l3mdev_ops {
u32 (*l3mdev_fib_table)(const struct net_device *dev);
struct sk_buff * (*l3mdev_l3_rcv)(struct net_device *dev,
  struct sk_buff *skb, u16 proto);
+   struct sk_buff * (*l3mdev_l3_out)(struct net_device *dev,
+ struct sock *sk, struct sk_buff *skb,
+ u16 proto);
 
/* IPv4 ops */
struct rtable * (*l3mdev_get_rtable)(const struct net_device *dev,
@@ -201,6 +209,33 @@ struct sk_buff *l3mdev_ip6_rcv(struct sk_buff *skb)
return l3mdev_l3_rcv(skb, AF_INET6);
 }
 
+static inline
+struct sk_buff *l3mdev_l3_out(struct sock *sk, struct sk_buff *skb, u16 proto)
+{
+   struct net_device *dev = skb_dst(skb)->dev;
+   struct net_device *master = NULL;
+
+   if (netif_is_l3_slave(dev)) {
+   master = netdev_master_upper_dev_get_rcu(dev);
+   if (master && master->l3mdev_ops->l3mdev_l3_out)
+   skb = master->l3mdev_ops->l3mdev_l3_out(master, sk,
+   skb, proto);
+   }
+
+   return skb;
+}
+
+static inline
+struct sk_buff *l3mdev_ip_out(struct sock *sk, struct sk_buff *skb)
+{
+   return l3mdev_l3_out(sk, skb, AF_INET);
+}
+
+static inline
+struct sk_buff *l3mdev_ip6_out(struct sock *sk, struct sk_buff *skb)
+{
+   return l3mdev_l3_out(sk, skb, AF_INET6);
+}
 #else
 
 static inline int l3mdev_master_ifindex_rcu(const struct net_device *dev)
@@ -287,6 +322,18 @@ struct sk_buff *l3mdev_ip6_rcv(struct sk_buff *skb)
 }
 
 static inline
+struct sk_buff *l3mdev_ip_out(struct sock *sk, struct sk_buff *skb)
+{
+   return skb;
+}
+
+static inline
+struct sk_buff *l3mdev_ip6_out(struct sock *sk, struct sk_buff *skb)
+{
+   return skb;
+}
+
+static inline
 int l3mdev_fib_rule_match(struct net *net, struct flowi *fl,
  struct fib_lookup_arg *arg)
 {
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index dde37fb340bf..3c727d4eaba9 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -98,6 +98,14 @@ int __ip_local_out(struct net *net, struct sock *sk, struct 
sk_buff *skb)
 
iph->tot_len = htons(skb->len);
ip_send_check(iph);
+
+   /* if egress device is enslaved to an L3 master device pass the
+* skb to its handler for processing
+*/
+   skb = l3mdev_ip_out(sk, skb);
+   if (unlikely(!skb))
+   return 0;
+
return nf_hook(NFPROTO_IPV4, NF_INET_LOCAL_OUT,
   net, sk, skb, NULL, skb_dst(skb)->dev,
   dst_output);
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 1dfc402d9ad1..bcec7e73eb0b 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -228,6 +228,14 @@ int ip6_xmit(const struct sock *sk, struct sk_buff *skb, 
struct flowi6 *fl6,
if ((skb->len <= mtu) || skb->ignore_df || skb_is_gso(skb)) {
IP6_UPD_PO_STATS(net, ip6_dst_idev(skb_dst(skb)),
  IPSTATS_MIB_OUT, skb->len);
+
+   /* if egress device is enslaved to an L3 master device pass the
+* skb to its handler for processing
+*/
+   skb = l3mdev_ip6_out((struct sock *)sk, skb);
+   if (unlikely(!skb))
+   return 0;
+
/* hooks should never assume socket lock is held.
 * we promote our socket to non const
 */
diff --git a/net/ipv6/output_core.c b/net/ipv6/output_core.c
index 462f2a76b5c2..7cca8ac66fe9 100644
--- a/net/ipv6/output_core.c
+++ b/net/ipv6/output_core.c
@@ -148,6 +148,13 @@ int __ip6_local_out(struct net *net, struct sock *sk, 
struct sk_buff *skb)

[PATCH net-next 01/12] net: flow: Add l3mdev flow update

2016-08-30 Thread David Ahern

Add l3mdev hook to set FLOWI_FLAG_SKIP_NH_OIF flag and update oif/iif
in flow struct if its oif or iif points to a device enslaved to an L3
Master device. Only 1 needs to be converted to match the l3mdev FIB
rule. This moves the flow adjustment for l3mdev to a single point
catching all lookups. It is redundant for existing hooks (those are
removed in later patches) but is needed for missed lookups such as
PMTU updates.

Signed-off-by: David Ahern 
---
 include/net/l3mdev.h  |  6 ++
 net/ipv4/fib_rules.c  |  3 +++
 net/ipv6/fib6_rules.c |  3 +++
 net/l3mdev/l3mdev.c   | 35 +++
 4 files changed, 47 insertions(+)

diff --git a/include/net/l3mdev.h b/include/net/l3mdev.h
index e90095091aa0..81e175e80537 100644
--- a/include/net/l3mdev.h
+++ b/include/net/l3mdev.h
@@ -49,6 +49,8 @@ struct l3mdev_ops {
 int l3mdev_fib_rule_match(struct net *net, struct flowi *fl,
  struct fib_lookup_arg *arg);
 
+void l3mdev_update_flow(struct net *net, struct flowi *fl);
+
 int l3mdev_master_ifindex_rcu(const struct net_device *dev);
 static inline int l3mdev_master_ifindex(struct net_device *dev)
 {
@@ -290,6 +292,10 @@ int l3mdev_fib_rule_match(struct net *net, struct flowi 
*fl,
 {
return 1;
 }
+static inline
+void l3mdev_update_flow(struct net *net, struct flowi *fl)
+{
+}
 #endif
 
 #endif /* _NET_L3MDEV_H_ */
diff --git a/net/ipv4/fib_rules.c b/net/ipv4/fib_rules.c
index 6e9ea69e5f75..770bebed6b28 100644
--- a/net/ipv4/fib_rules.c
+++ b/net/ipv4/fib_rules.c
@@ -56,6 +56,9 @@ int __fib_lookup(struct net *net, struct flowi4 *flp,
};
int err;
 
+   /* update flow if oif or iif point to device enslaved to l3mdev */
+   l3mdev_update_flow(net, flowi4_to_flowi(flp));
+
err = fib_rules_lookup(net->ipv4.rules_ops, flowi4_to_flowi(flp), 0, 
);
 #ifdef CONFIG_IP_ROUTE_CLASSID
if (arg.rule)
diff --git a/net/ipv6/fib6_rules.c b/net/ipv6/fib6_rules.c
index 5857c1fc8b67..eea23b57c6a5 100644
--- a/net/ipv6/fib6_rules.c
+++ b/net/ipv6/fib6_rules.c
@@ -38,6 +38,9 @@ struct dst_entry *fib6_rule_lookup(struct net *net, struct 
flowi6 *fl6,
.flags = FIB_LOOKUP_NOREF,
};
 
+   /* update flow if oif or iif point to device enslaved to l3mdev */
+   l3mdev_update_flow(net, flowi6_to_flowi(fl6));
+
fib_rules_lookup(net->ipv6.fib6_rules_ops,
 flowi6_to_flowi(fl6), flags, );
 
diff --git a/net/l3mdev/l3mdev.c b/net/l3mdev/l3mdev.c
index c4a1c3e84e12..43610e5acc4e 100644
--- a/net/l3mdev/l3mdev.c
+++ b/net/l3mdev/l3mdev.c
@@ -222,3 +222,38 @@ int l3mdev_fib_rule_match(struct net *net, struct flowi 
*fl,
 
return rc;
 }
+
+void l3mdev_update_flow(struct net *net, struct flowi *fl)
+{
+   struct net_device *dev;
+   int ifindex;
+
+   rcu_read_lock();
+
+   if (fl->flowi_oif) {
+   dev = dev_get_by_index_rcu(net, fl->flowi_oif);
+   if (dev) {
+   ifindex = l3mdev_master_ifindex_rcu(dev);
+   if (ifindex) {
+   fl->flowi_oif = ifindex;
+   fl->flowi_flags |= FLOWI_FLAG_SKIP_NH_OIF;
+   goto out;
+   }
+   }
+   }
+
+   if (fl->flowi_iif) {
+   dev = dev_get_by_index_rcu(net, fl->flowi_iif);
+   if (dev) {
+   ifindex = l3mdev_master_ifindex_rcu(dev);
+   if (ifindex) {
+   fl->flowi_iif = ifindex;
+   fl->flowi_flags |= FLOWI_FLAG_SKIP_NH_OIF;
+   }
+   }
+   }
+
+out:
+   rcu_read_unlock();
+}
+EXPORT_SYMBOL_GPL(l3mdev_update_flow);
-- 
2.1.4

[PATCH net-next 04/12] net: vrf: Flip IPv4 path from dst to out hook

2016-08-30 Thread David Ahern

Flip the IPv4 output path from use of the vrf dst to the l3mdev tx out
hook.

Signed-off-by: David Ahern 
---
 drivers/net/vrf.c | 171 --
 net/ipv4/route.c  |   4 --
 2 files changed, 64 insertions(+), 111 deletions(-)

diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c
index 1ce7420322ee..7517645347c3 100644
--- a/drivers/net/vrf.c
+++ b/drivers/net/vrf.c
@@ -230,79 +230,28 @@ static netdev_tx_t vrf_process_v6_outbound(struct sk_buff 
*skb,
 static netdev_tx_t vrf_process_v4_outbound(struct sk_buff *skb,
   struct net_device *vrf_dev)
 {
-   struct iphdr *ip4h = ip_hdr(skb);
-   int ret = NET_XMIT_DROP;
-   struct flowi4 fl4 = {
-   /* needed to match OIF rule */
-   .flowi4_oif = vrf_dev->ifindex,
-   .flowi4_iif = LOOPBACK_IFINDEX,
-   .flowi4_tos = RT_TOS(ip4h->tos),
-   .flowi4_flags = FLOWI_FLAG_ANYSRC | FLOWI_FLAG_L3MDEV_SRC |
-   FLOWI_FLAG_SKIP_NH_OIF,
-   .daddr = ip4h->daddr,
-   };
-   struct net *net = dev_net(vrf_dev);
-   struct rtable *rt;
-
-   rt = ip_route_output_flow(net, , NULL);
-   if (IS_ERR(rt))
-   goto err;
-
-   if (rt->rt_type != RTN_UNICAST && rt->rt_type != RTN_LOCAL) {
-   ip_rt_put(rt);
-   goto err;
-   }
+   struct net_vrf *vrf = netdev_priv(vrf_dev);
+   struct dst_entry *dst = NULL;
+   struct rtable *rth_local;
 
skb_dst_drop(skb);
 
-   /* if dst.dev is loopback or the VRF device again this is locally
-* originated traffic destined to a local address. Short circuit
-* to Rx path using our local dst
-*/
-   if (rt->dst.dev == net->loopback_dev || rt->dst.dev == vrf_dev) {
-   struct net_vrf *vrf = netdev_priv(vrf_dev);
-   struct rtable *rth_local;
-   struct dst_entry *dst = NULL;
-
-   ip_rt_put(rt);
-
-   rcu_read_lock();
-
-   rth_local = rcu_dereference(vrf->rth_local);
-   if (likely(rth_local)) {
-   dst = _local->dst;
-   dst_hold(dst);
-   }
-
-   rcu_read_unlock();
-
-   if (unlikely(!dst))
-   goto err;
+   rcu_read_lock();
 
-   return vrf_local_xmit(skb, vrf_dev, dst);
+   rth_local = rcu_dereference(vrf->rth_local);
+   if (likely(rth_local)) {
+   dst = _local->dst;
+   dst_hold(dst);
}
 
-   skb_dst_set(skb, >dst);
-
-   /* strip the ethernet header added for pass through VRF device */
-   __skb_pull(skb, skb_network_offset(skb));
+   rcu_read_unlock();
 
-   if (!ip4h->saddr) {
-   ip4h->saddr = inet_select_addr(skb_dst(skb)->dev, 0,
-  RT_SCOPE_LINK);
+   if (unlikely(!dst)) {
+   vrf_tx_error(vrf_dev, skb);
+   return NET_XMIT_DROP;
}
 
-   ret = ip_local_out(dev_net(skb_dst(skb)->dev), skb->sk, skb);
-   if (unlikely(net_xmit_eval(ret)))
-   vrf_dev->stats.tx_errors++;
-   else
-   ret = NET_XMIT_SUCCESS;
-
-out:
-   return ret;
-err:
-   vrf_tx_error(vrf_dev, skb);
-   goto out;
+   return vrf_local_xmit(skb, vrf_dev, dst);
 }
 
 static netdev_tx_t is_ip_tx_frame(struct sk_buff *skb, struct net_device *dev)
@@ -473,64 +422,71 @@ static int vrf_rt6_create(struct net_device *dev)
 }
 #endif
 
-/* modelled after ip_finish_output2 */
+/* run skb through packet sockets for tcpdump with dev set to vrf dev */
 static int vrf_finish_output(struct net *net, struct sock *sk, struct sk_buff 
*skb)
 {
-   struct dst_entry *dst = skb_dst(skb);
-   struct rtable *rt = (struct rtable *)dst;
-   struct net_device *dev = dst->dev;
-   unsigned int hh_len = LL_RESERVED_SPACE(dev);
-   struct neighbour *neigh;
-   u32 nexthop;
-   int ret = -EINVAL;
-
-   /* Be paranoid, rather than too clever. */
-   if (unlikely(skb_headroom(skb) < hh_len && dev->header_ops)) {
-   struct sk_buff *skb2;
-
-   skb2 = skb_realloc_headroom(skb, LL_RESERVED_SPACE(dev));
-   if (!skb2) {
-   ret = -ENOMEM;
-   goto err;
-   }
-   if (skb->sk)
-   skb_set_owner_w(skb2, skb->sk);
-
-   consume_skb(skb);
-   skb = skb2;
+   if (likely(skb_headroom(skb) >= ETH_HLEN)) {
+   struct ethhdr *eth = (struct ethhdr *)skb_push(skb, ETH_HLEN);
+
+   ether_addr_copy(eth->h_source, skb->dev->dev_addr);
+   eth_zero_addr(eth->h_dest);
+   eth->h_proto = skb->protocol;
+   dev_queue_xmit_nit(skb, skb->dev);
+

[PATCH net-next 05/12] net: vrf: Flip IPv6 path from dst to out hook

2016-08-30 Thread David Ahern

Flip the IPv6 output path from use of the vrf dst to the l3mdev tx out
hook.

Signed-off-by: David Ahern 
---
 drivers/net/vrf.c | 156 --
 net/ipv6/ip6_output.c |   9 ++-
 net/ipv6/route.c  |   5 --
 3 files changed, 70 insertions(+), 100 deletions(-)

diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c
index 7517645347c3..df58bc791cfd 100644
--- a/drivers/net/vrf.c
+++ b/drivers/net/vrf.c
@@ -140,80 +140,42 @@ static int vrf_local_xmit(struct sk_buff *skb, struct 
net_device *dev,
 static netdev_tx_t vrf_process_v6_outbound(struct sk_buff *skb,
   struct net_device *dev)
 {
-   const struct ipv6hdr *iph = ipv6_hdr(skb);
-   struct net *net = dev_net(skb->dev);
-   struct flowi6 fl6 = {
-   /* needed to match OIF rule */
-   .flowi6_oif = dev->ifindex,
-   .flowi6_iif = LOOPBACK_IFINDEX,
-   .daddr = iph->daddr,
-   .saddr = iph->saddr,
-   .flowlabel = ip6_flowinfo(iph),
-   .flowi6_mark = skb->mark,
-   .flowi6_proto = iph->nexthdr,
-   .flowi6_flags = FLOWI_FLAG_L3MDEV_SRC | FLOWI_FLAG_SKIP_NH_OIF,
-   };
-   int ret = NET_XMIT_DROP;
-   struct dst_entry *dst;
-   struct dst_entry *dst_null = >ipv6.ip6_null_entry->dst;
-
-   dst = ip6_route_output(net, NULL, );
-   if (dst == dst_null)
-   goto err;
+   struct net_vrf *vrf = netdev_priv(dev);
+   struct dst_entry *dst = NULL;
+   struct rt6_info *rt6_local;
 
skb_dst_drop(skb);
 
-   /* if dst.dev is loopback or the VRF device again this is locally
-* originated traffic destined to a local address. Short circuit
-* to Rx path using our local dst
-*/
-   if (dst->dev == net->loopback_dev || dst->dev == dev) {
-   struct net_vrf *vrf = netdev_priv(dev);
-   struct rt6_info *rt6_local;
-
-   /* release looked up dst and use cached local dst */
-   dst_release(dst);
+   rcu_read_lock();
 
-   rcu_read_lock();
+   rt6_local = rcu_dereference(vrf->rt6_local);
+   if (unlikely(!rt6_local)) {
+   rcu_read_unlock();
+   goto err;
+   }
 
-   rt6_local = rcu_dereference(vrf->rt6_local);
-   if (unlikely(!rt6_local)) {
+   /* Ordering issue: cached local dst is created on newlink
+* before the IPv6 initialization. Using the local dst
+* requires rt6i_idev to be set so make sure it is.
+*/
+   if (unlikely(!rt6_local->rt6i_idev)) {
+   rt6_local->rt6i_idev = in6_dev_get(dev);
+   if (!rt6_local->rt6i_idev) {
rcu_read_unlock();
goto err;
}
-
-   /* Ordering issue: cached local dst is created on newlink
-* before the IPv6 initialization. Using the local dst
-* requires rt6i_idev to be set so make sure it is.
-*/
-   if (unlikely(!rt6_local->rt6i_idev)) {
-   rt6_local->rt6i_idev = in6_dev_get(dev);
-   if (!rt6_local->rt6i_idev) {
-   rcu_read_unlock();
-   goto err;
-   }
-   }
-
-   dst = _local->dst;
-   dst_hold(dst);
-
-   rcu_read_unlock();
-
-   return vrf_local_xmit(skb, dev, _local->dst);
}
 
-   skb_dst_set(skb, dst);
+   dst = _local->dst;
+   if (likely(dst))
+   dst_hold(dst);
 
-   /* strip the ethernet header added for pass through VRF device */
-   __skb_pull(skb, skb_network_offset(skb));
+   rcu_read_unlock();
 
-   ret = ip6_local_out(net, skb->sk, skb);
-   if (unlikely(net_xmit_eval(ret)))
-   dev->stats.tx_errors++;
-   else
-   ret = NET_XMIT_SUCCESS;
+   if (unlikely(!dst))
+   goto err;
 
-   return ret;
+   return vrf_local_xmit(skb, dev, dst);
 err:
vrf_tx_error(dev, skb);
return NET_XMIT_DROP;
@@ -286,44 +248,43 @@ static netdev_tx_t vrf_xmit(struct sk_buff *skb, struct 
net_device *dev)
 }
 
 #if IS_ENABLED(CONFIG_IPV6)
-/* modelled after ip6_finish_output2 */
-static int vrf_finish_output6(struct net *net, struct sock *sk,
- struct sk_buff *skb)
-{
-   struct dst_entry *dst = skb_dst(skb);
-   struct net_device *dev = dst->dev;
-   struct neighbour *neigh;
-   struct in6_addr *nexthop;
-   int ret;
+static int vrf_finish_output(struct net *net, struct sock *sk,
+struct sk_buff *skb);
 
+/* modelled after ip6_output */
+static int vrf_output6(struct net *net, struct sock *sk, struct sk_buff *skb)
+{
skb->protocol =

[PATCH net-next 10/12] net: l3mdev: Remove l3mdev_get_rt6_dst

2016-08-30 Thread David Ahern

No longer used

Signed-off-by: David Ahern 
---
 drivers/net/vrf.c| 92 +++-
 include/net/l3mdev.h | 14 
 net/l3mdev/l3mdev.c  | 32 --
 3 files changed, 4 insertions(+), 134 deletions(-)

diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c
index 08103bc7f1f5..23801647c113 100644
--- a/drivers/net/vrf.c
+++ b/drivers/net/vrf.c
@@ -48,7 +48,6 @@ static bool add_fib_rules = true;
 
 struct net_vrf {
struct rtable __rcu *rth_local;
-   struct rt6_info __rcu   *rt6;
struct rt6_info __rcu   *rt6_local;
u32 tb_id;
 };
@@ -289,25 +288,11 @@ static struct sk_buff *vrf_ip6_out(struct net_device 
*vrf_dev,
 /* holding rtnl */
 static void vrf_rt6_release(struct net_device *dev, struct net_vrf *vrf)
 {
-   struct rt6_info *rt6 = rtnl_dereference(vrf->rt6);
struct rt6_info *rt6_local = rtnl_dereference(vrf->rt6_local);
struct net *net = dev_net(dev);
struct dst_entry *dst;
 
-   RCU_INIT_POINTER(vrf->rt6, NULL);
-   RCU_INIT_POINTER(vrf->rt6_local, NULL);
-   synchronize_rcu();
-
-   /* move dev in dst's to loopback so this VRF device can be deleted
-* - based on dst_ifdown
-*/
-   if (rt6) {
-   dst = >dst;
-   dev_put(dst->dev);
-   dst->dev = net->loopback_dev;
-   dev_hold(dst->dev);
-   dst_release(dst);
-   }
+   rcu_assign_pointer(vrf->rt6_local, NULL);
 
if (rt6_local) {
if (rt6_local->rt6i_idev)
@@ -327,7 +312,7 @@ static int vrf_rt6_create(struct net_device *dev)
struct net_vrf *vrf = netdev_priv(dev);
struct net *net = dev_net(dev);
struct fib6_table *rt6i_table;
-   struct rt6_info *rt6, *rt6_local;
+   struct rt6_info *rt6_local;
int rc = -ENOMEM;
 
/* IPv6 can be CONFIG enabled and then disabled runtime */
@@ -338,24 +323,12 @@ static int vrf_rt6_create(struct net_device *dev)
if (!rt6i_table)
goto out;
 
-   /* create a dst for routing packets out a VRF device */
-   rt6 = ip6_dst_alloc(net, dev, flags);
-   if (!rt6)
-   goto out;
-
-   dst_hold(>dst);
-
-   rt6->rt6i_table = rt6i_table;
-   rt6->dst.output = vrf_output6;
-
/* create a dst for local routing - packets sent locally
 * to local address via the VRF device as a loopback
 */
rt6_local = ip6_dst_alloc(net, dev, flags);
-   if (!rt6_local) {
-   dst_release(>dst);
+   if (!rt6_local)
goto out;
-   }
 
dst_hold(_local->dst);
 
@@ -364,7 +337,6 @@ static int vrf_rt6_create(struct net_device *dev)
rt6_local->rt6i_table = rt6i_table;
rt6_local->dst.input  = ip6_input;
 
-   rcu_assign_pointer(vrf->rt6, rt6);
rcu_assign_pointer(vrf->rt6_local, rt6_local);
 
rc = 0;
@@ -693,7 +665,7 @@ static struct rt6_info *vrf_ip6_route_lookup(struct net 
*net,
rcu_read_lock();
 
/* fib6_table does not have a refcnt and can not be freed */
-   rt6 = rcu_dereference(vrf->rt6);
+   rt6 = rcu_dereference(vrf->rt6_local);
if (likely(rt6))
table = rt6->rt6i_table;
 
@@ -816,66 +788,10 @@ static struct sk_buff *vrf_l3_rcv(struct net_device 
*vrf_dev,
return skb;
 }
 
-#if IS_ENABLED(CONFIG_IPV6)
-static struct dst_entry *vrf_get_rt6_dst(const struct net_device *dev,
-struct flowi6 *fl6)
-{
-   bool need_strict = rt6_need_strict(>daddr);
-   struct net_vrf *vrf = netdev_priv(dev);
-   struct net *net = dev_net(dev);
-   struct dst_entry *dst = NULL;
-   struct rt6_info *rt;
-
-   /* send to link-local or multicast address */
-   if (need_strict) {
-   int flags = RT6_LOOKUP_F_IFACE;
-
-   /* VRF device does not have a link-local address and
-* sending packets to link-local or mcast addresses over
-* a VRF device does not make sense
-*/
-   if (fl6->flowi6_oif == dev->ifindex) {
-   struct dst_entry *dst = >ipv6.ip6_null_entry->dst;
-
-   dst_hold(dst);
-   return dst;
-   }
-
-   if (!ipv6_addr_any(>saddr))
-   flags |= RT6_LOOKUP_F_HAS_SADDR;
-
-   rt = vrf_ip6_route_lookup(net, dev, fl6, fl6->flowi6_oif, 
flags);
-   if (rt)
-   dst = >dst;
-
-   } else if (!(fl6->flowi6_flags & FLOWI_FLAG_L3MDEV_SRC)) {
-
-   rcu_read_lock();
-
-   rt = rcu_dereference(vrf->rt6);
-   if (likely(rt)) {
-   dst = >dst;
-   dst_hold(dst);
-   }
-
-   rcu_read_unlock();
-   }
-
-   /* make sure oif is set

[PATCH net-next 12/12] net: flow: Remove FLOWI_FLAG_L3MDEV_SRC flag

2016-08-30 Thread David Ahern

No longer used

Signed-off-by: David Ahern 
---
 include/net/flow.h | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/include/net/flow.h b/include/net/flow.h
index d47ef4bb5423..035aa7716967 100644
--- a/include/net/flow.h
+++ b/include/net/flow.h
@@ -34,8 +34,7 @@ struct flowi_common {
__u8flowic_flags;
 #define FLOWI_FLAG_ANYSRC  0x01
 #define FLOWI_FLAG_KNOWN_NH0x02
-#define FLOWI_FLAG_L3MDEV_SRC  0x04
-#define FLOWI_FLAG_SKIP_NH_OIF 0x08
+#define FLOWI_FLAG_SKIP_NH_OIF 0x04
__u32   flowic_secid;
struct flowi_tunnel flowic_tun_key;
 };
-- 
2.1.4

[PATCH net-next 11/12] net: l3mdev: Remove l3mdev_fib_oif

2016-08-30 Thread David Ahern

No longer used

Signed-off-by: David Ahern 
---
 include/net/l3mdev.h | 29 -
 1 file changed, 29 deletions(-)

diff --git a/include/net/l3mdev.h b/include/net/l3mdev.h
index 3c1d71474f55..6aae664b427a 100644
--- a/include/net/l3mdev.h
+++ b/include/net/l3mdev.h
@@ -95,26 +95,6 @@ struct net_device *l3mdev_master_dev_rcu(const struct 
net_device *_dev)
return master;
 }
 
-/* get index of an interface to use for FIB lookups. For devices
- * enslaved to an L3 master device FIB lookups are based on the
- * master index
- */
-static inline int l3mdev_fib_oif_rcu(struct net_device *dev)
-{
-   return l3mdev_master_ifindex_rcu(dev) ? : dev->ifindex;
-}
-
-static inline int l3mdev_fib_oif(struct net_device *dev)
-{
-   int oif;
-
-   rcu_read_lock();
-   oif = l3mdev_fib_oif_rcu(dev);
-   rcu_read_unlock();
-
-   return oif;
-}
-
 u32 l3mdev_fib_table_rcu(const struct net_device *dev);
 u32 l3mdev_fib_table_by_index(struct net *net, int ifindex);
 static inline u32 l3mdev_fib_table(const struct net_device *dev)
@@ -224,15 +204,6 @@ struct net_device *l3mdev_master_dev_rcu(const struct 
net_device *dev)
return NULL;
 }
 
-static inline int l3mdev_fib_oif_rcu(struct net_device *dev)
-{
-   return dev ? dev->ifindex : 0;
-}
-static inline int l3mdev_fib_oif(struct net_device *dev)
-{
-   return dev ? dev->ifindex : 0;
-}
-
 static inline u32 l3mdev_fib_table_rcu(const struct net_device *dev)
 {
return 0;
-- 
2.1.4

[PATCH net-next 08/12] net: ipv6: Remove l3mdev_get_saddr6

2016-08-30 Thread David Ahern

No longer needed

Signed-off-by: David Ahern 
---
 drivers/net/vrf.c | 41 -
 include/net/l3mdev.h  | 11 ---
 net/ipv6/ip6_output.c |  9 +
 net/l3mdev/l3mdev.c   | 24 
 4 files changed, 1 insertion(+), 84 deletions(-)

diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c
index ec65bf2afcb2..cc18319b4b0d 100644
--- a/drivers/net/vrf.c
+++ b/drivers/net/vrf.c
@@ -909,46 +909,6 @@ static struct dst_entry *vrf_get_rt6_dst(const struct 
net_device *dev,
 
return dst;
 }
-
-/* called under rcu_read_lock */
-static int vrf_get_saddr6(struct net_device *dev, const struct sock *sk,
- struct flowi6 *fl6)
-{
-   struct net *net = dev_net(dev);
-   struct dst_entry *dst;
-   struct rt6_info *rt;
-   int err;
-
-   if (rt6_need_strict(>daddr)) {
-   rt = vrf_ip6_route_lookup(net, dev, fl6, fl6->flowi6_oif,
- RT6_LOOKUP_F_IFACE);
-   if (unlikely(!rt))
-   return 0;
-
-   dst = >dst;
-   } else {
-   __u8 flags = fl6->flowi6_flags;
-
-   fl6->flowi6_flags |= FLOWI_FLAG_L3MDEV_SRC;
-   fl6->flowi6_flags |= FLOWI_FLAG_SKIP_NH_OIF;
-
-   dst = ip6_route_output(net, sk, fl6);
-   rt = (struct rt6_info *)dst;
-
-   fl6->flowi6_flags = flags;
-   }
-
-   err = dst->error;
-   if (!err) {
-   err = ip6_route_get_saddr(net, rt, >daddr,
- sk ? inet6_sk(sk)->srcprefs : 0,
- >saddr);
-   }
-
-   dst_release(dst);
-
-   return err;
-}
 #endif
 
 static const struct l3mdev_ops vrf_l3mdev_ops = {
@@ -958,7 +918,6 @@ static const struct l3mdev_ops vrf_l3mdev_ops = {
.l3mdev_l3_out  = vrf_l3_out,
 #if IS_ENABLED(CONFIG_IPV6)
.l3mdev_get_rt6_dst = vrf_get_rt6_dst,
-   .l3mdev_get_saddr6  = vrf_get_saddr6,
 #endif
 };
 
diff --git a/include/net/l3mdev.h b/include/net/l3mdev.h
index 8085be19a767..391c46130ef6 100644
--- a/include/net/l3mdev.h
+++ b/include/net/l3mdev.h
@@ -43,9 +43,6 @@ struct l3mdev_ops {
/* IPv6 ops */
struct dst_entry * (*l3mdev_get_rt6_dst)(const struct net_device *dev,
 struct flowi6 *fl6);
-   int(*l3mdev_get_saddr6)(struct net_device *dev,
-   const struct sock *sk,
-   struct flowi6 *fl6);
 };
 
 #ifdef CONFIG_NET_L3_MASTER_DEV
@@ -172,8 +169,6 @@ static inline bool netif_index_is_l3_master(struct net 
*net, int ifindex)
 }
 
 struct dst_entry *l3mdev_get_rt6_dst(struct net *net, struct flowi6 *fl6);
-int l3mdev_get_saddr6(struct net *net, const struct sock *sk,
- struct flowi6 *fl6);
 
 static inline
 struct sk_buff *l3mdev_l3_rcv(struct sk_buff *skb, u16 proto)
@@ -291,12 +286,6 @@ struct dst_entry *l3mdev_get_rt6_dst(struct net *net, 
struct flowi6 *fl6)
return NULL;
 }
 
-static inline int l3mdev_get_saddr6(struct net *net, const struct sock *sk,
-   struct flowi6 *fl6)
-{
-   return 0;
-}
-
 static inline
 struct sk_buff *l3mdev_ip_rcv(struct sk_buff *skb)
 {
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 84d1b3feaf2e..2d067b0c2f10 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -918,13 +918,6 @@ static int ip6_dst_lookup_tail(struct net *net, const 
struct sock *sk,
int err;
int flags = 0;
 
-   if (ipv6_addr_any(>saddr) && fl6->flowi6_oif &&
-   (!*dst || !(*dst)->error)) {
-   err = l3mdev_get_saddr6(net, sk, fl6);
-   if (err)
-   goto out_err;
-   }
-
/* The correct way to handle this would be to do
 * ip6_route_get_saddr, and then ip6_route_output; however,
 * the route-specific preferred source forces the
@@ -1016,7 +1009,7 @@ static int ip6_dst_lookup_tail(struct net *net, const 
struct sock *sk,
 out_err_release:
dst_release(*dst);
*dst = NULL;
-out_err:
+
if (err == -ENETUNREACH)
IP6_INC_STATS(net, NULL, IPSTATS_MIB_OUTNOROUTES);
return err;
diff --git a/net/l3mdev/l3mdev.c b/net/l3mdev/l3mdev.c
index b30034efccff..998e4dc2e6f9 100644
--- a/net/l3mdev/l3mdev.c
+++ b/net/l3mdev/l3mdev.c
@@ -131,30 +131,6 @@ struct dst_entry *l3mdev_get_rt6_dst(struct net *net,
 }
 EXPORT_SYMBOL_GPL(l3mdev_get_rt6_dst);
 
-int l3mdev_get_saddr6(struct net *net, const struct sock *sk,
- struct flowi6 *fl6)
-{
-   struct net_device *dev;
-   int rc = 0;
-
-   if (fl6->flowi6_oif) {
-   rcu_read_lock();
-
-   dev = dev_get_by_index_rcu(net, fl6->flowi6_oif);
-   if

[PATCH net-next 07/12] net: ipv4: Remove l3mdev_get_saddr

2016-08-30 Thread David Ahern

No longer needed

Signed-off-by: David Ahern 
---
 drivers/net/vrf.c| 38 --
 include/net/l3mdev.h | 12 
 include/net/route.h  | 10 --
 net/ipv4/raw.c   |  6 --
 net/ipv4/udp.c   |  6 --
 net/l3mdev/l3mdev.c  | 31 ---
 6 files changed, 103 deletions(-)

diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c
index df58bc791cfd..ec65bf2afcb2 100644
--- a/drivers/net/vrf.c
+++ b/drivers/net/vrf.c
@@ -668,43 +668,6 @@ static struct rtable *vrf_get_rtable(const struct 
net_device *dev,
return rth;
 }
 
-/* called under rcu_read_lock */
-static int vrf_get_saddr(struct net_device *dev, struct flowi4 *fl4)
-{
-   struct fib_result res = { .tclassid = 0 };
-   struct net *net = dev_net(dev);
-   u32 orig_tos = fl4->flowi4_tos;
-   u8 flags = fl4->flowi4_flags;
-   u8 scope = fl4->flowi4_scope;
-   u8 tos = RT_FL_TOS(fl4);
-   int rc;
-
-   if (unlikely(!fl4->daddr))
-   return 0;
-
-   fl4->flowi4_flags |= FLOWI_FLAG_SKIP_NH_OIF;
-   fl4->flowi4_iif = LOOPBACK_IFINDEX;
-   /* make sure oif is set to VRF device for lookup */
-   fl4->flowi4_oif = dev->ifindex;
-   fl4->flowi4_tos = tos & IPTOS_RT_MASK;
-   fl4->flowi4_scope = ((tos & RTO_ONLINK) ?
-RT_SCOPE_LINK : RT_SCOPE_UNIVERSE);
-
-   rc = fib_lookup(net, fl4, , 0);
-   if (!rc) {
-   if (res.type == RTN_LOCAL)
-   fl4->saddr = res.fi->fib_prefsrc ? : fl4->daddr;
-   else
-   fib_select_path(net, , fl4, -1);
-   }
-
-   fl4->flowi4_flags = flags;
-   fl4->flowi4_tos = orig_tos;
-   fl4->flowi4_scope = scope;
-
-   return rc;
-}
-
 static int vrf_rcv_finish(struct net *net, struct sock *sk, struct sk_buff 
*skb)
 {
return 0;
@@ -991,7 +954,6 @@ static int vrf_get_saddr6(struct net_device *dev, const 
struct sock *sk,
 static const struct l3mdev_ops vrf_l3mdev_ops = {
.l3mdev_fib_table   = vrf_fib_table,
.l3mdev_get_rtable  = vrf_get_rtable,
-   .l3mdev_get_saddr   = vrf_get_saddr,
.l3mdev_l3_rcv  = vrf_l3_rcv,
.l3mdev_l3_out  = vrf_l3_out,
 #if IS_ENABLED(CONFIG_IPV6)
diff --git a/include/net/l3mdev.h b/include/net/l3mdev.h
index 5f03a89bb075..8085be19a767 100644
--- a/include/net/l3mdev.h
+++ b/include/net/l3mdev.h
@@ -25,8 +25,6 @@
  *
  * @l3mdev_get_rtable: Get cached IPv4 rtable (dst_entry) for device
  *
- * @l3mdev_get_saddr: Get source address for a flow
- *
  * @l3mdev_get_rt6_dst: Get cached IPv6 rt6_info (dst_entry) for device
  */
 
@@ -41,8 +39,6 @@ struct l3mdev_ops {
/* IPv4 ops */
struct rtable * (*l3mdev_get_rtable)(const struct net_device *dev,
 const struct flowi4 *fl4);
-   int (*l3mdev_get_saddr)(struct net_device *dev,
-   struct flowi4 *fl4);
 
/* IPv6 ops */
struct dst_entry * (*l3mdev_get_rt6_dst)(const struct net_device *dev,
@@ -175,8 +171,6 @@ static inline bool netif_index_is_l3_master(struct net 
*net, int ifindex)
return rc;
 }
 
-int l3mdev_get_saddr(struct net *net, int ifindex, struct flowi4 *fl4);
-
 struct dst_entry *l3mdev_get_rt6_dst(struct net *net, struct flowi6 *fl6);
 int l3mdev_get_saddr6(struct net *net, const struct sock *sk,
  struct flowi6 *fl6);
@@ -291,12 +285,6 @@ static inline bool netif_index_is_l3_master(struct net 
*net, int ifindex)
return false;
 }
 
-static inline int l3mdev_get_saddr(struct net *net, int ifindex,
-  struct flowi4 *fl4)
-{
-   return 0;
-}
-
 static inline
 struct dst_entry *l3mdev_get_rt6_dst(struct net *net, struct flowi6 *fl6)
 {
diff --git a/include/net/route.h b/include/net/route.h
index ad777d79af94..0429d47cad25 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -29,7 +29,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
@@ -285,15 +284,6 @@ static inline struct rtable *ip_route_connect(struct 
flowi4 *fl4,
ip_route_connect_init(fl4, dst, src, tos, oif, protocol,
  sport, dport, sk);
 
-   if (!src && oif) {
-   int rc;
-
-   rc = l3mdev_get_saddr(net, oif, fl4);
-   if (rc < 0)
-   return ERR_PTR(rc);
-
-   src = fl4->saddr;
-   }
if (!dst || !src) {
rt = __ip_route_output_key(net, fl4);
if (IS_ERR(rt))
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index 438f50c1a676..90a85c955872 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -606,12 +606,6 @@ static int raw_sendmsg(struct sock *sk, struct msghdr 
*msg, size_t len)
(inet->hdrincl ? FLOWI_FLAG_KNOWN_NH : 0),

[PATCH net-next 09/12] net: l3mdev: Remove l3mdev_get_rtable

2016-08-30 Thread David Ahern

No longer used

Signed-off-by: David Ahern 
---
 drivers/net/vrf.c| 47 ++-
 include/net/l3mdev.h | 21 -
 2 files changed, 2 insertions(+), 66 deletions(-)

diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c
index cc18319b4b0d..08103bc7f1f5 100644
--- a/drivers/net/vrf.c
+++ b/drivers/net/vrf.c
@@ -47,7 +47,6 @@
 static bool add_fib_rules = true;
 
 struct net_vrf {
-   struct rtable __rcu *rth;
struct rtable __rcu *rth_local;
struct rt6_info __rcu   *rt6;
struct rt6_info __rcu   *rt6_local;
@@ -460,26 +459,16 @@ static struct sk_buff *vrf_l3_out(struct net_device 
*vrf_dev,
 /* holding rtnl */
 static void vrf_rtable_release(struct net_device *dev, struct net_vrf *vrf)
 {
-   struct rtable *rth = rtnl_dereference(vrf->rth);
struct rtable *rth_local = rtnl_dereference(vrf->rth_local);
struct net *net = dev_net(dev);
struct dst_entry *dst;
 
-   RCU_INIT_POINTER(vrf->rth, NULL);
RCU_INIT_POINTER(vrf->rth_local, NULL);
synchronize_rcu();
 
/* move dev in dst's to loopback so this VRF device can be deleted
 * - based on dst_ifdown
 */
-   if (rth) {
-   dst = >dst;
-   dev_put(dst->dev);
-   dst->dev = net->loopback_dev;
-   dev_hold(dst->dev);
-   dst_release(dst);
-   }
-
if (rth_local) {
dst = _local->dst;
dev_put(dst->dev);
@@ -492,31 +481,20 @@ static void vrf_rtable_release(struct net_device *dev, 
struct net_vrf *vrf)
 static int vrf_rtable_create(struct net_device *dev)
 {
struct net_vrf *vrf = netdev_priv(dev);
-   struct rtable *rth, *rth_local;
+   struct rtable *rth_local;
 
if (!fib_new_table(dev_net(dev), vrf->tb_id))
return -ENOMEM;
 
-   /* create a dst for routing packets out through a VRF device */
-   rth = rt_dst_alloc(dev, 0, RTN_UNICAST, 1, 1, 0);
-   if (!rth)
-   return -ENOMEM;
-
/* create a dst for local ingress routing - packets sent locally
 * to local address via the VRF device as a loopback
 */
rth_local = rt_dst_alloc(dev, RTCF_LOCAL, RTN_LOCAL, 1, 1, 0);
-   if (!rth_local) {
-   dst_release(>dst);
+   if (!rth_local)
return -ENOMEM;
-   }
-
-   rth->dst.output = vrf_output;
-   rth->rt_table_id = vrf->tb_id;
 
rth_local->rt_table_id = vrf->tb_id;
 
-   rcu_assign_pointer(vrf->rth, rth);
rcu_assign_pointer(vrf->rth_local, rth_local);
 
return 0;
@@ -648,26 +626,6 @@ static u32 vrf_fib_table(const struct net_device *dev)
return vrf->tb_id;
 }
 
-static struct rtable *vrf_get_rtable(const struct net_device *dev,
-const struct flowi4 *fl4)
-{
-   struct rtable *rth = NULL;
-
-   if (!(fl4->flowi4_flags & FLOWI_FLAG_L3MDEV_SRC)) {
-   struct net_vrf *vrf = netdev_priv(dev);
-
-   rcu_read_lock();
-
-   rth = rcu_dereference(vrf->rth);
-   if (likely(rth))
-   dst_hold(>dst);
-
-   rcu_read_unlock();
-   }
-
-   return rth;
-}
-
 static int vrf_rcv_finish(struct net *net, struct sock *sk, struct sk_buff 
*skb)
 {
return 0;
@@ -913,7 +871,6 @@ static struct dst_entry *vrf_get_rt6_dst(const struct 
net_device *dev,
 
 static const struct l3mdev_ops vrf_l3mdev_ops = {
.l3mdev_fib_table   = vrf_fib_table,
-   .l3mdev_get_rtable  = vrf_get_rtable,
.l3mdev_l3_rcv  = vrf_l3_rcv,
.l3mdev_l3_out  = vrf_l3_out,
 #if IS_ENABLED(CONFIG_IPV6)
diff --git a/include/net/l3mdev.h b/include/net/l3mdev.h
index 391c46130ef6..44ceec61de63 100644
--- a/include/net/l3mdev.h
+++ b/include/net/l3mdev.h
@@ -23,8 +23,6 @@
  *
  * @l3mdev_l3_out:Hook in L3 output path
  *
- * @l3mdev_get_rtable: Get cached IPv4 rtable (dst_entry) for device
- *
  * @l3mdev_get_rt6_dst: Get cached IPv6 rt6_info (dst_entry) for device
  */
 
@@ -36,10 +34,6 @@ struct l3mdev_ops {
  struct sock *sk, struct sk_buff *skb,
  u16 proto);
 
-   /* IPv4 ops */
-   struct rtable * (*l3mdev_get_rtable)(const struct net_device *dev,
-const struct flowi4 *fl4);
-
/* IPv6 ops */
struct dst_entry * (*l3mdev_get_rt6_dst)(const struct net_device *dev,
 struct flowi6 *fl6);
@@ -140,15 +134,6 @@ static inline u32 l3mdev_fib_table(const struct net_device 
*dev)
return tb_id;
 }
 
-static inline struct rtable *l3mdev_get_rtable(const struct net_device *dev,
-  const struct flowi4 *fl4)
-{
-   if (netif_is_l3_master(dev) &&

Re: [PATCH] net: pegasus: Remove deprecated create_singlethread_workqueue

2016-08-30 Thread Petko Manolov

On 16-08-30 22:02:47, Bhaktipriya Shridhar wrote:
> The workqueue "pegasus_workqueue" queues a single work item per pegasus
> instance and hence it doesn't require execution ordering. Hence,
> alloc_workqueue has been used to replace the deprecated
> create_singlethread_workqueue instance.
> 
> The WQ_MEM_RECLAIM flag has been set to ensure forward progress under
> memory pressure since it's a network driver.
> 
> Since there are fixed number of work items, explicit concurrency
> limit is unnecessary here.
> 
> Signed-off-by: Bhaktipriya Shridhar 
> ---
>  drivers/net/usb/pegasus.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/usb/pegasus.c b/drivers/net/usb/pegasus.c
> index 9bbe0161..1434e5d 100644
> --- a/drivers/net/usb/pegasus.c
> +++ b/drivers/net/usb/pegasus.c
> @@ -1129,7 +1129,8 @@ static int pegasus_probe(struct usb_interface *intf,
>   return -ENODEV;
> 
>   if (pegasus_count == 0) {
> - pegasus_workqueue = create_singlethread_workqueue("pegasus");
> + pegasus_workqueue = alloc_workqueue("pegasus", WQ_MEM_RECLAIM,
> + 0);
>   if (!pegasus_workqueue)
>   return -ENOMEM;
>   }
> --
> 2.1.4

Nope, there is no need for singlethread-ness here.  As long as the flag you 
used 
is doing the right thing i am OK with the patch.


Petko

[PATCH] net: pegasus: Remove deprecated create_singlethread_workqueue

2016-08-30 Thread Bhaktipriya Shridhar

The workqueue "pegasus_workqueue" queues a single work item per pegasus
instance and hence it doesn't require execution ordering. Hence,
alloc_workqueue has been used to replace the deprecated
create_singlethread_workqueue instance.

The WQ_MEM_RECLAIM flag has been set to ensure forward progress under
memory pressure since it's a network driver.

Since there are fixed number of work items, explicit concurrency
limit is unnecessary here.

Signed-off-by: Bhaktipriya Shridhar 
---
 drivers/net/usb/pegasus.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/usb/pegasus.c b/drivers/net/usb/pegasus.c
index 9bbe0161..1434e5d 100644
--- a/drivers/net/usb/pegasus.c
+++ b/drivers/net/usb/pegasus.c
@@ -1129,7 +1129,8 @@ static int pegasus_probe(struct usb_interface *intf,
return -ENODEV;

if (pegasus_count == 0) {
-   pegasus_workqueue = create_singlethread_workqueue("pegasus");
+   pegasus_workqueue = alloc_workqueue("pegasus", WQ_MEM_RECLAIM,
+   0);
if (!pegasus_workqueue)
return -ENOMEM;
}
--
2.1.4

[PATCH] bonding: Remove deprecated create_singlethread_workqueue

2016-08-30 Thread Bhaktipriya Shridhar

alloc_ordered_workqueue() with WQ_MEM_RECLAIM set, replaces
deprecated create_singlethread_workqueue(). This is the identity
conversion.

The workqueue "wq" queues multiple work items viz
>mcast_work, >work, >mii_work, >arp_work,
>alb_work, >mii_work, >ad_work, >slave_arr_work
which require strict execution ordering. Hence, an ordered dedicated
workqueue has been used.

Since, it is a network driver, WQ_MEM_RECLAIM has been set to
ensure forward progress under memory pressure.

Signed-off-by: Bhaktipriya Shridhar 
---
 drivers/net/bonding/bond_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 941ec99..ebaf1a9 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -4635,7 +4635,7 @@ static int bond_init(struct net_device *bond_dev)

netdev_dbg(bond_dev, "Begin bond_init\n");

-   bond->wq = create_singlethread_workqueue(bond_dev->name);
+   bond->wq = alloc_ordered_workqueue(bond_dev->name, WQ_MEM_RECLAIM);
if (!bond->wq)
return -ENOMEM;

--
2.1.4

Re: [PATCH net] sunrpc: fix UDP memory accounting

2016-08-30 Thread Benjamin Coddington


On 25 Aug 2016, at 12:42, Paolo Abeni wrote:


The commit f9b2ee714c5c ("SUNRPC: Move UDP receive data path
into a workqueue context"), as a side effect, moved the
skb_free_datagram() call outside the scope of the related socket
lock, but UDP sockets require such lock to be held for proper
memory accounting.
Fix it by replacing skb_free_datagram() with
skb_free_datagram_locked().

Fixes: f9b2ee714c5c ("SUNRPC: Move UDP receive data path into a 
workqueue context")

Reported-and-tested-by: Jan Stancek 
Signed-off-by: Paolo Abeni 


Thanks for finding this. A similar fix in 2009 for svcsock.c was done by 
Eric Dumazet:

9d410c796067 ("net: fix sk_forward_alloc corruption")

skb_free_datagram_locked() is used for all xprt types in svcsock.c, 
should we use

it for the xs_local_transport as well in xprtsock.c?

Ben


---
 net/sunrpc/xprtsock.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index 8ede3bc..bf16883 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -1074,7 +1074,7 @@ static void xs_udp_data_receive(struct sock_xprt 
*transport)

skb = skb_recv_datagram(sk, 0, 1, );
if (skb != NULL) {
xs_udp_data_read_skb(>xprt, sk, skb);
-   skb_free_datagram(sk, skb);
+   skb_free_datagram_locked(sk, skb);
continue;
}
 		if (!test_and_clear_bit(XPRT_SOCK_DATA_READY, 
>sock_state))

--
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" 
in

the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next] rxrpc: Remove use of skbs from AFS

2016-08-30 Thread David Howells

Sorry about this, stgit mail is playing silly devils and not inserting the
patch numbers if there's a cover letter but only one patch:-/

David

[PATCH net-next] rxrpc: Remove use of skbs from AFS

2016-08-30 Thread David Howells


Here's a single patch that removes the use of sk_buffs from fs/afs.  From
this point on they'll be entirely retained within net/rxrpc and AFS just
asks AF_RXRPC for linear buffers of data.  This needs to be applied on top
of the just-posted preparatory patch set.

This makes some future developments easier/possible:

 (1) Simpler rxrpc_call usage counting.

 (2) Earlier freeing of metadata sk_buffs.

 (3) Rx phase shortcutting on abort/error.

 (4) Encryption/decryption in the AFS fs contexts/threads and directly
 between sk_buffs and AFS buffers.

 (5) Synchronous waiting in reception for AFS.

Changes:

 (V2) Fixed afs_transfer_reply() whereby call->offset was incorrectly being
  added to the buffer pointer (it doesn't matter as long as the reply
  fits entirely inside a single packet).

  Removed an unused goto-label and an unused variable.


The patch can be found here also:


http://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/log/?h=rxrpc-rewrite

Tagged thusly:

git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git
rxrpc-rewrite-20160830-2v2

David
---
David Howells (1):
  rxrpc: Don't expose skbs to in-kernel users


 Documentation/networking/rxrpc.txt |   72 +++---
 fs/afs/cmservice.c |  142 ++--
 fs/afs/fsclient.c  |  148 +---
 fs/afs/internal.h  |   33 +--
 fs/afs/rxrpc.c |  439 +---
 fs/afs/vlclient.c  |7 -
 include/net/af_rxrpc.h |   35 +--
 net/rxrpc/af_rxrpc.c   |   29 +-
 net/rxrpc/ar-internal.h|   23 ++
 net/rxrpc/call_accept.c|   13 +
 net/rxrpc/call_object.c|5 
 net/rxrpc/conn_event.c |1 
 net/rxrpc/input.c  |   10 +
 net/rxrpc/output.c |2 
 net/rxrpc/recvmsg.c|  191 +---
 net/rxrpc/skbuff.c |1 
 16 files changed, 565 insertions(+), 586 deletions(-)

Re: [PATCH v4] brcmfmac: add missing header dependencies

2016-08-30 Thread Kalle Valo

Baoyou Xie  writes:

> On 29 August 2016 at 23:31, Rafał Miłecki  wrote:
>
> On 29 August 2016 at 14:39, Baoyou Xie  wrote:
> > We get 1 warning when build kernel with W=1:
> > drivers/net/wireless/broadcom/brcm80211/brcmfmac/tracepoint.c:23:6:
> warning: no previous prototype for '__brcmf_err' [-Wmissing-prototypes]
>
> building? I'm not native English, but I think so.
>
>
> > In fact, this function is declared in brcmfmac/debug.h, so this patch
> > add missing header dependencies.
>
> adds
>
>
> > Signed-off-by: Baoyou Xie 
> > Acked-by: Arnd Bergmann 
>
> Please don't resend patches just to add tags like that. This only
> increases a noise and patchwork handles this just fine, see:
> https://patchwork.kernel.org/patch/9303285/
> https://patchwork.kernel.org/patch/9303285/mbox/
>
>
> Do I need to resend a patch that fixes two typos(build/add)? or you modify 
> them
> on your way?

I can fix those when I commit the patch.

-- 
Kalle Valo

Re: [RFC v2 00/10] Landlock LSM: Unprivileged sandboxing

2016-08-30 Thread Andy Lutomirski

On Thu, Aug 25, 2016 at 3:32 AM, Mickaël Salaün  wrote:
> Hi,
>
> This series is a proof of concept to fill some missing part of seccomp as the
> ability to check syscall argument pointers or creating more dynamic security
> policies. The goal of this new stackable Linux Security Module (LSM) called
> Landlock is to allow any process, including unprivileged ones, to create
> powerful security sandboxes comparable to the Seatbelt/XNU Sandbox or the
> OpenBSD Pledge. This kind of sandbox help to mitigate the security impact of
> bugs or unexpected/malicious behaviors in userland applications.

Mickaël, will you be at KS and/or LPC?

Re: [PATCH next] tcp: fastopen: fix rcv_wup initialization for TFO server on SYN/data

2016-08-30 Thread Neal Cardwell

Sorry, there's a typo in the subject line: that should be "net" rather
than "next" (I'm proposing "net" since it's a bug fix).

Looks like "git am" strips this mistake, but I'm happy to resubmit if it helps.

thanks,
neal

[PATCH next] tcp: fastopen: fix rcv_wup initialization for TFO server on SYN/data

2016-08-30 Thread Neal Cardwell

Yuchung noticed that on the first TFO server data packet sent after
the (TFO) handshake, the server echoed the TCP timestamp value in the
SYN/data instead of the timestamp value in the final ACK of the
handshake. This problem did not happen on regular opens.

The tcp_replace_ts_recent() logic that decides whether to remember an
incoming TS value needs tp->rcv_wup to hold the latest receive
sequence number that we have ACKed (latest tp->rcv_nxt we have
ACKed). This commit fixes this issue by ensuring that a TFO server
properly updates tp->rcv_wup to match tp->rcv_nxt at the time it sends
a SYN/ACK for the SYN/data.

Reported-by: Yuchung Cheng 
Signed-off-by: Neal Cardwell 
Signed-off-by: Yuchung Cheng 
Signed-off-by: Eric Dumazet 
Signed-off-by: Soheil Hassas Yeganeh 
Fixes: 168a8f58059a ("tcp: TCP Fast Open Server - main code path")
---
 net/ipv4/tcp_fastopen.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/ipv4/tcp_fastopen.c b/net/ipv4/tcp_fastopen.c
index 54d9f9b..62a5751 100644
--- a/net/ipv4/tcp_fastopen.c
+++ b/net/ipv4/tcp_fastopen.c
@@ -226,6 +226,7 @@ static struct sock *tcp_fastopen_create_child(struct sock 
*sk,
tcp_fastopen_add_skb(child, skb);
 
tcp_rsk(req)->rcv_nxt = tp->rcv_nxt;
+   tp->rcv_wup = tp->rcv_nxt;
/* tcp_conn_request() is sending the SYNACK,
 * and queues the child into listener accept queue.
 */
-- 
2.8.0.rc3.226.g39d4020

[PATCH net-next] rxrpc: Don't expose skbs to in-kernel users

2016-08-30 Thread David Howells

Don't expose skbs to in-kernel users, such as the AFS filesystem, but
instead provide a notification hook the indicates that a call needs
attention and another that indicates that there's a new call to be
collected.

This makes the following possibilities more achievable:

 (1) Call refcounting can be made simpler if skbs don't hold refs to calls.

 (2) skbs referring to non-data events will be able to be freed much sooner
 rather than being queued for AFS to pick up as rxrpc_kernel_recv_data
 will be able to consult the call state.

 (3) We can shortcut the receive phase when a call is remotely aborted
 because we don't have to go through all the packets to get to the one
 cancelling the operation.

 (4) It makes it easier to do encryption/decryption directly between AFS's
 buffers and sk_buffs.

 (5) Encryption/decryption can more easily be done in the AFS's thread
 contexts - usually that of the userspace process that issued a syscall
 - rather than in one of rxrpc's background threads on a workqueue.

 (6) AFS will be able to wait synchronously on a call inside AF_RXRPC.

To make this work, the following interface function has been added:

 int rxrpc_kernel_recv_data(
struct socket *sock, struct rxrpc_call *call,
void *buffer, size_t bufsize, size_t *_offset,
bool want_more, u32 *_abort_code);

This is the recvmsg equivalent.  It allows the caller to find out about the
state of a specific call and to transfer received data into a buffer
piecemeal.

afs_extract_data() and rxrpc_kernel_recv_data() now do all the extraction
logic between them.  They don't wait synchronously yet because the socket
lock needs to be dealt with.

Five interface functions have been removed:

rxrpc_kernel_is_data_last()
rxrpc_kernel_get_abort_code()
rxrpc_kernel_get_error_number()
rxrpc_kernel_free_skb()
rxrpc_kernel_data_consumed()

As a temporary hack, sk_buffs going to an in-kernel call are queued on the
rxrpc_call struct (->knlrecv_queue) rather than being handed over to the
in-kernel user.  To process the queue internally, a temporary function,
temp_deliver_data() has been added.  This will be replaced with common code
between the rxrpc_recvmsg() path and the kernel_rxrpc_recv_data() path in a
future patch.

Signed-off-by: David Howells 
---

 Documentation/networking/rxrpc.txt |   72 +++---
 fs/afs/cmservice.c |  142 ++--
 fs/afs/fsclient.c  |  148 +---
 fs/afs/internal.h  |   34 +--
 fs/afs/rxrpc.c |  439 +---
 fs/afs/vlclient.c  |7 -
 include/net/af_rxrpc.h |   35 +--
 net/rxrpc/af_rxrpc.c   |   29 +-
 net/rxrpc/ar-internal.h|   23 ++
 net/rxrpc/call_accept.c|   13 +
 net/rxrpc/call_object.c|5 
 net/rxrpc/conn_event.c |1 
 net/rxrpc/input.c  |   10 +
 net/rxrpc/output.c |2 
 net/rxrpc/recvmsg.c|  195 +---
 net/rxrpc/skbuff.c |1 
 16 files changed, 570 insertions(+), 586 deletions(-)

diff --git a/Documentation/networking/rxrpc.txt 
b/Documentation/networking/rxrpc.txt
index cfc8cb91452f..1b63bbc6b94f 100644
--- a/Documentation/networking/rxrpc.txt
+++ b/Documentation/networking/rxrpc.txt
@@ -748,6 +748,37 @@ The kernel interface functions are as follows:
  The msg must not specify a destination address, control data or any flags
  other than MSG_MORE.  len is the total amount of data to transmit.
 
+ (*) Receive data from a call.
+
+   int rxrpc_kernel_recv_data(struct socket *sock,
+  struct rxrpc_call *call,
+  void *buf,
+  size_t size,
+  size_t *_offset,
+  bool want_more,
+  u32 *_abort)
+
+  This is used to receive data from either the reply part of a client call
+  or the request part of a service call.  buf and size specify how much
+  data is desired and where to store it.  *_offset is added on to buf and
+  subtracted from size internally; the amount copied into the buffer is
+  added to *_offset before returning.
+
+  want_more should be true if further data will be required after this is
+  satisfied and false if this is the last item of the receive phase.
+
+  There are three normal returns: 0 if the buffer was filled and want_more
+  was true; 1 if the buffer was filled, the last DATA packet has been
+  emptied and want_more was false; and -EAGAIN if the function needs to be
+  called again.
+
+  If the last DATA packet is processed but the buffer contains less than
+  the amount requested, EBADMSG is returned.  If

[PATCH net-next] rxrpc: Remove use of skbs from AFS

2016-08-30 Thread David Howells


Here's a single patch that removes the use of sk_buffs from fs/afs.  From
this point on they'll be entirely retained within net/rxrpc and AFS just
asks AF_RXRPC for linear buffers of data.  This needs to be applied on top
of the just-posted preparatory patch set.

This makes some future developments easier/possible:

 (1) Simpler rxrpc_call usage counting.

 (2) Earlier freeing of metadata sk_buffs.

 (3) Rx phase shortcutting on abort/error.

 (4) Encryption/decryption in the AFS fs contexts/threads and directly
 between sk_buffs and AFS buffers.

 (5) Synchronous waiting in reception for AFS.

The patch can be found here also:


http://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/log/?h=rxrpc-rewrite

Tagged thusly:

git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git
rxrpc-rewrite-20160830-2

David
---
David Howells (1):
  rxrpc: Don't expose skbs to in-kernel users


 Documentation/networking/rxrpc.txt |   72 +++---
 fs/afs/cmservice.c |  142 ++--
 fs/afs/fsclient.c  |  148 +---
 fs/afs/internal.h  |   34 +--
 fs/afs/rxrpc.c |  439 +---
 fs/afs/vlclient.c  |7 -
 include/net/af_rxrpc.h |   35 +--
 net/rxrpc/af_rxrpc.c   |   29 +-
 net/rxrpc/ar-internal.h|   23 ++
 net/rxrpc/call_accept.c|   13 +
 net/rxrpc/call_object.c|5 
 net/rxrpc/conn_event.c |1 
 net/rxrpc/input.c  |   10 +
 net/rxrpc/output.c |2 
 net/rxrpc/recvmsg.c|  195 +---
 net/rxrpc/skbuff.c |1 
 16 files changed, 570 insertions(+), 586 deletions(-)

Re: [PATCH 4/8] dmaengine: sa11x0: unexport sa11x0_dma_filter_fn and clean up

2016-08-30 Thread Vinod Koul

On Mon, Aug 29, 2016 at 12:26:20PM +0100, Russell King wrote:
> As we now have no users of sa11x0_dma_filter_fn() in the tree, we can
> unexport this function, and remove the now unused header file.

Acked-by: Vinod Koul 


-- 
~Vinod

Re: [PATCH 1/8] dmaengine: sa11x0: add DMA filters

2016-08-30 Thread Vinod Koul

On Mon, Aug 29, 2016 at 12:26:04PM +0100, Russell King wrote:
> Add DMA filters for the sa11x0 DMA channels.  This will allow us to
> migrate away from directly using the DMA filter function in drivers.

Acked-by: Vinod Koul 

-- 
~Vinod

[PATCH net] net: bridge: don't increment tx_dropped in br_do_proxy_arp

2016-08-30 Thread Nikolay Aleksandrov

pskb_may_pull may fail due to various reasons (e.g. alloc failure), but the
skb isn't changed/dropped and processing continues so we shouldn't
increment tx_dropped.

CC: Kyeyoon Park 
CC: Roopa Prabhu 
CC: Stephen Hemminger 
CC: bri...@lists.linux-foundation.org
Fixes: 958501163ddd ("bridge: Add support for IEEE 802.11 Proxy ARP")
Signed-off-by: Nikolay Aleksandrov 
---
 net/bridge/br_input.c | 7 ++-
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/net/bridge/br_input.c b/net/bridge/br_input.c
index 8e486203d133..abe11f085479 100644
--- a/net/bridge/br_input.c
+++ b/net/bridge/br_input.c
@@ -80,13 +80,10 @@ static void br_do_proxy_arp(struct sk_buff *skb, struct 
net_bridge *br,
 
BR_INPUT_SKB_CB(skb)->proxyarp_replied = false;
 
-   if (dev->flags & IFF_NOARP)
+   if ((dev->flags & IFF_NOARP) ||
+   !pskb_may_pull(skb, arp_hdr_len(dev)))
return;
 
-   if (!pskb_may_pull(skb, arp_hdr_len(dev))) {
-   dev->stats.tx_dropped++;
-   return;
-   }
parp = arp_hdr(skb);
 
if (parp->ar_pro != htons(ETH_P_IP) ||
-- 
2.1.4

Re: [PATCH net] tg3: Fix for disallow tx coalescing time to be 0

2016-08-30 Thread Michael Chan

On Tue, Aug 30, 2016 at 7:38 AM, Ivan Vecera  wrote:
> The recent commit 087d7a8c disallows to set Rx coalescing time to be 0
> as this stops generating interrupts for the incoming packets. I found
> the zero Tx coalescing time stops generating interrupts similarly for
> outgoing packets and fires Tx watchdog later. To avoid this, don't allow
> to set Tx coalescing time to 0.
>
> Cc: satish.baddipad...@broadcom.com
> Cc: siva.kal...@broadcom.com
> Cc: michael.c...@broadcom.com
> Signed-off-by: Ivan Vecera 
> ---
>  drivers/net/ethernet/broadcom/tg3.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/drivers/net/ethernet/broadcom/tg3.c 
> b/drivers/net/ethernet/broadcom/tg3.c
> index 6592612..07e3beb 100644
> --- a/drivers/net/ethernet/broadcom/tg3.c
> +++ b/drivers/net/ethernet/broadcom/tg3.c
> @@ -14012,6 +14012,7 @@ static int tg3_set_coalesce(struct net_device *dev, 
> struct ethtool_coalesce *ec)
> if ((ec->rx_coalesce_usecs > MAX_RXCOL_TICKS) ||
> (!ec->rx_coalesce_usecs) ||
> (ec->tx_coalesce_usecs > MAX_TXCOL_TICKS) ||
> +   (!ec->tx_coalesce_usecs) ||
> (ec->rx_max_coalesced_frames > MAX_RXMAX_FRAMES) ||
> (ec->tx_max_coalesced_frames > MAX_TXMAX_FRAMES) ||
> (ec->rx_coalesce_usecs_irq > max_rxcoal_tick_int) ||


As Rick pointed out last time, we can remove this check which follows
the block of code above:

/* No tx interrupts will be generated if both are zero */
if ((ec->tx_coalesce_usecs == 0) &&
   (ec->tx_max_coalesced_frames == 0))
return -EINVAL;

Re: [PATCH 0/4] SA11x0 Clocks and removal of Neponset SMC91x hack

2016-08-30 Thread Nicolas Pitre

On Tue, 30 Aug 2016, Russell King - ARM Linux wrote:

> This mini-series (which follows several other series on which it
> depends) gets rid of the Assabet/Neponset hack in the smc91x driver.
> 
> In order to do that, we need to get several pieces in place first:
> * gpiolib support throughout SA11x0/Assabet/Neponset so that we can
>   represent control signals through gpiolib
> * CCF support, so we can re-use the code in drivers/clk to implement
>   the external crystal oscillator attached to the SMC91x.  This
>   external crystal oscillator is enabled via a control signal.
> 
> This series:
> - performs the SA11x0 CCF conversion
> - adds an optional clock to SMC91x to cater for an external crystal
>   oscillator
> - switches the Neponset code to provide a 'struct clk' representing
>   this oscillator
> - removes the SMC91x hack to assert the enable signal
> 
> This results in the platform specific includes being removed from the
> SMC91x driver.
> 
> Please ack these changes; due to the dependencies, I wish to merge
> them through my tree.  Thanks.

Looks nice to me.

Acked-by: Nicolas Pitre 

>  arch/arm/Kconfig   |   1 +
>  arch/arm/mach-sa1100/clock.c   | 191 
> +
>  arch/arm/mach-sa1100/neponset.c|  42 
>  drivers/net/ethernet/smsc/smc91x.c |  47 ++---
>  drivers/net/ethernet/smsc/smc91x.h |   1 +
>  5 files changed, 166 insertions(+), 116 deletions(-)
> 
> -- 
> RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
> FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
> according to speedtest.net.
> 
> ___
> linux-arm-kernel mailing list
> linux-arm-ker...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
>

[PATCH net-next 6/8] rxrpc: Provide a way for AFS to ask for the peer address of a call

2016-08-30 Thread David Howells

Provide a function so that kernel users, such as AFS, can ask for the peer
address of a call:

   void rxrpc_kernel_get_peer(struct rxrpc_call *call,
  struct sockaddr_rxrpc *_srx);

In the future the kernel service won't get sk_buffs to look inside.
Further, this allows us to hide any canonicalisation inside AF_RXRPC for
when IPv6 support is added.

Also propagate this through to afs_find_server() and issue a warning if we
can't handle the address family yet.

Signed-off-by: David Howells 
---

 Documentation/networking/rxrpc.txt |7 +++
 fs/afs/cmservice.c |   20 +++-
 fs/afs/internal.h  |5 -
 fs/afs/rxrpc.c |2 +-
 fs/afs/server.c|   11 ---
 include/net/af_rxrpc.h |2 ++
 net/rxrpc/peer_object.c|   15 +++
 7 files changed, 48 insertions(+), 14 deletions(-)

diff --git a/Documentation/networking/rxrpc.txt 
b/Documentation/networking/rxrpc.txt
index 70c926ae212d..dfe0b008df74 100644
--- a/Documentation/networking/rxrpc.txt
+++ b/Documentation/networking/rxrpc.txt
@@ -868,6 +868,13 @@ The kernel interface functions are as follows:
  This is used to allocate a null RxRPC key that can be used to indicate
  anonymous security for a particular domain.
 
+ (*) Get the peer address of a call.
+
+   void rxrpc_kernel_get_peer(struct socket *sock, struct rxrpc_call *call,
+  struct sockaddr_rxrpc *_srx);
+
+ This is used to find the remote peer address of a call.
+
 
 ===
 CONFIGURABLE PARAMETERS
diff --git a/fs/afs/cmservice.c b/fs/afs/cmservice.c
index ca32d891bbc3..77ee481059ac 100644
--- a/fs/afs/cmservice.c
+++ b/fs/afs/cmservice.c
@@ -167,9 +167,9 @@ static void SRXAFSCB_CallBack(struct work_struct *work)
 static int afs_deliver_cb_callback(struct afs_call *call, struct sk_buff *skb,
   bool last)
 {
+   struct sockaddr_rxrpc srx;
struct afs_callback *cb;
struct afs_server *server;
-   struct in_addr addr;
__be32 *bp;
u32 tmp;
int ret, loop;
@@ -178,6 +178,7 @@ static int afs_deliver_cb_callback(struct afs_call *call, 
struct sk_buff *skb,
 
switch (call->unmarshall) {
case 0:
+   rxrpc_kernel_get_peer(afs_socket, call->rxcall, );
call->offset = 0;
call->unmarshall++;
 
@@ -282,8 +283,7 @@ static int afs_deliver_cb_callback(struct afs_call *call, 
struct sk_buff *skb,
 
/* we'll need the file server record as that tells us which set of
 * vnodes to operate upon */
-   memcpy(, _hdr(skb)->saddr, 4);
-   server = afs_find_server();
+   server = afs_find_server();
if (!server)
return -ENOTCONN;
call->server = server;
@@ -314,12 +314,14 @@ static int afs_deliver_cb_init_call_back_state(struct 
afs_call *call,
   struct sk_buff *skb,
   bool last)
 {
+   struct sockaddr_rxrpc srx;
struct afs_server *server;
-   struct in_addr addr;
int ret;
 
_enter(",{%u},%d", skb->len, last);
 
+   rxrpc_kernel_get_peer(afs_socket, call->rxcall, );
+
ret = afs_data_complete(call, skb, last);
if (ret < 0)
return ret;
@@ -329,8 +331,7 @@ static int afs_deliver_cb_init_call_back_state(struct 
afs_call *call,
 
/* we'll need the file server record as that tells us which set of
 * vnodes to operate upon */
-   memcpy(, _hdr(skb)->saddr, 4);
-   server = afs_find_server();
+   server = afs_find_server();
if (!server)
return -ENOTCONN;
call->server = server;
@@ -347,11 +348,13 @@ static int afs_deliver_cb_init_call_back_state3(struct 
afs_call *call,
struct sk_buff *skb,
bool last)
 {
+   struct sockaddr_rxrpc srx;
struct afs_server *server;
-   struct in_addr addr;
 
_enter(",{%u},%d", skb->len, last);
 
+   rxrpc_kernel_get_peer(afs_socket, call->rxcall, );
+
/* There are some arguments that we ignore */
afs_data_consumed(call, skb);
if (!last)
@@ -362,8 +365,7 @@ static int afs_deliver_cb_init_call_back_state3(struct 
afs_call *call,
 
/* we'll need the file server record as that tells us which set of
 * vnodes to operate upon */
-   memcpy(, _hdr(skb)->saddr, 4);
-   server = afs_find_server();
+   server = afs_find_server();
if (!server)
return -ENOTCONN;
call->server = server;
diff --git a/fs/afs/internal.h b/fs/afs/internal.h
index df976b2a7f40..d97552de9c59 100644
--- a/fs/afs/internal.h
+++ b/fs/afs/internal.h
@@ -20,6 +20,7 @@
 #include 
 #include

[PATCH net-next 8/8] rxrpc: Pass struct socket * to more rxrpc kernel interface functions

2016-08-30 Thread David Howells

Pass struct socket * to more rxrpc kernel interface functions.  They should
be starting from this rather than the socket pointer in the rxrpc_call
struct if they need to access the socket.

I have left:

rxrpc_kernel_is_data_last()
rxrpc_kernel_get_abort_code()
rxrpc_kernel_get_error_number()
rxrpc_kernel_free_skb()
rxrpc_kernel_data_consumed()

unmodified as they're all about to be removed (and, in any case, don't
touch the socket).

Signed-off-by: David Howells 
---

 Documentation/networking/rxrpc.txt |   11 ---
 fs/afs/rxrpc.c |   26 +++---
 include/net/af_rxrpc.h |   10 +++---
 net/rxrpc/af_rxrpc.c   |5 +++--
 net/rxrpc/output.c |   20 +++-
 5 files changed, 44 insertions(+), 28 deletions(-)

diff --git a/Documentation/networking/rxrpc.txt 
b/Documentation/networking/rxrpc.txt
index dfe0b008df74..cfc8cb91452f 100644
--- a/Documentation/networking/rxrpc.txt
+++ b/Documentation/networking/rxrpc.txt
@@ -725,7 +725,8 @@ The kernel interface functions are as follows:
 
  (*) End a client call.
 
-   void rxrpc_kernel_end_call(struct rxrpc_call *call);
+   void rxrpc_kernel_end_call(struct socket *sock,
+  struct rxrpc_call *call);
 
  This is used to end a previously begun call.  The user_call_ID is expunged
  from AF_RXRPC's knowledge and will not be seen again in association with
@@ -733,7 +734,9 @@ The kernel interface functions are as follows:
 
  (*) Send data through a call.
 
-   int rxrpc_kernel_send_data(struct rxrpc_call *call, struct msghdr *msg,
+   int rxrpc_kernel_send_data(struct socket *sock,
+  struct rxrpc_call *call,
+  struct msghdr *msg,
   size_t len);
 
  This is used to supply either the request part of a client call or the
@@ -747,7 +750,9 @@ The kernel interface functions are as follows:
 
  (*) Abort a call.
 
-   void rxrpc_kernel_abort_call(struct rxrpc_call *call, u32 abort_code);
+   void rxrpc_kernel_abort_call(struct socket *sock,
+struct rxrpc_call *call,
+u32 abort_code);
 
  This is used to abort a call if it's still in an abortable state.  The
  abort code specified will be placed in the ABORT message sent.
diff --git a/fs/afs/rxrpc.c b/fs/afs/rxrpc.c
index a1916750e2f9..7b0d18900f50 100644
--- a/fs/afs/rxrpc.c
+++ b/fs/afs/rxrpc.c
@@ -207,7 +207,7 @@ static void afs_free_call(struct afs_call *call)
 static void afs_end_call_nofree(struct afs_call *call)
 {
if (call->rxcall) {
-   rxrpc_kernel_end_call(call->rxcall);
+   rxrpc_kernel_end_call(afs_socket, call->rxcall);
call->rxcall = NULL;
}
if (call->type->destructor)
@@ -325,8 +325,8 @@ static int afs_send_pages(struct afs_call *call, struct 
msghdr *msg,
 * returns from sending the request */
if (first + loop >= last)
call->state = AFS_CALL_AWAIT_REPLY;
-   ret = rxrpc_kernel_send_data(call->rxcall, msg,
-to - offset);
+   ret = rxrpc_kernel_send_data(afs_socket, call->rxcall,
+msg, to - offset);
kunmap(pages[loop]);
if (ret < 0)
break;
@@ -406,7 +406,8 @@ int afs_make_call(struct in_addr *addr, struct afs_call 
*call, gfp_t gfp,
 * request */
if (!call->send_pages)
call->state = AFS_CALL_AWAIT_REPLY;
-   ret = rxrpc_kernel_send_data(rxcall, , call->request_size);
+   ret = rxrpc_kernel_send_data(afs_socket, rxcall,
+, call->request_size);
if (ret < 0)
goto error_do_abort;
 
@@ -421,7 +422,7 @@ int afs_make_call(struct in_addr *addr, struct afs_call 
*call, gfp_t gfp,
return wait_mode->wait(call);
 
 error_do_abort:
-   rxrpc_kernel_abort_call(rxcall, RX_USER_ABORT);
+   rxrpc_kernel_abort_call(afs_socket, rxcall, RX_USER_ABORT);
while ((skb = skb_dequeue(>rx_queue)))
afs_free_skb(skb);
 error_kill_call:
@@ -509,7 +510,8 @@ static void afs_deliver_to_call(struct afs_call *call)
if (call->state != AFS_CALL_AWAIT_REPLY)
abort_code = RXGEN_SS_UNMARSHAL;
do_abort:
-   rxrpc_kernel_abort_call(call->rxcall,
+   rxrpc_kernel_abort_call(afs_socket,
+   call->rxcall,

1 2 >

1 - 100 of 189 matches

Mail list logo