Re: [PATCH 17/17] batman-adv: Avoid precedence issues in macros

2016-10-28 Thread Sven Eckelmann
On Freitag, 28. Oktober 2016 18:56:24 CEST Joe Perches wrote:
[...]
> > But yes, looks to me like it is missing. Do you want to propose a patch or
> > should I do?
> 
> As you are working on this file, perhaps you should.

Ok, I will submit a patch with you as Reported-by to
the batman-adv mailing list

Kind regards,
Sven

signature.asc
Description: This is a digitally signed message part.


Re: [PATCH v7 0/6] Add eBPF hooks for cgroups

2016-10-28 Thread Alexei Starovoitov
On Sat, Oct 29, 2016 at 01:59:23PM +0900, Lorenzo Colitti wrote:
> On Sat, Oct 29, 2016 at 1:51 PM, Alexei Starovoitov
>  wrote:
> >> What's the use case for egress?
> >>
> >> We (android networking) are currently looking at implementing network
> >> accounting via eBPF in order to replace the out-of-tree xt_qtaguid
> >> code. A per-cgroup eBPF program run on all traffic would be great. But
> >> when we looked at this patchset we realized it would not be useful for
> >> accounting purposes because even if a packet is counted here, it might
> >> still be dropped by netfilter hooks.
> >
> > don't use out-of-tree and instead drop using this mechanism or
> > any other in-kernel method? ;)
> 
> Getting rid of out-of-tree code is the goal, yes. We do have a
> requirement that things continue to work, though. Accounting for a
> packet in ip{,6}_output is not correct if that packet ends up being
> dropped by iptables later on.

understood.
it could be solved by swapping the order of cgroup_bpf_run_filter()
and NF_INET_POST_ROUTING in patch 5. It was proposed some time back, but
the current patch, I think, is more symmetrical.
cgroup+bpf runs after nf hook on rx and runs before it on tx.
imo it's more consistent.
Regardless of this choice... are you going to backport cgroupv2 to
android? Because this set is v2 only.

> > We (facebook infrastructure) have been using iptables and bpf networking
> > together with great success. They nicely co-exist and complement each other.
> > There is no need to reinvent the wheel if existing solution works.
> > iptables are great for their purpose.
> 
> That doesn't really answer my "what is the use case for egress"
> question though, right? Or are you saying "we use this, but we can't
> talk about how we use it"?

if the question is "why patch 4 alone is not enough and patch 5 is needed"?
Then it's symmetrical access. Accounting for RX only is a half done job.

> > there is iptables+cBPF support. It's being used in some cases already.
> 
> Adding eBPF support to the xt_bpf iptables code would be an option for
> what we want to do, yes. I think this requires that the eBPF map to be
> an fd that is available to the process that exec()s iptables, but we
> could do that.

yes. that's certainly doable, but sooner or later such approach will hit
scalability issue when number of cgroups is large. Same issue we saw
with cls_bpf and bpf_skb_under_cgroup(). Hence this patch set was needed
that is centered around cgroups instead of hooks. Note, unlike, tc and nf
there is no way to attach to a hook. The bpf program is attached to a cgroup.
It's an important distinction vs everything that currently exists in the stack.



Re: net/dccp: warning in dccp_feat_clone_sp_val/__might_sleep

2016-10-28 Thread Cong Wang
On Fri, Oct 28, 2016 at 5:40 PM, Andrey Konovalov  wrote:
> Hi,
>
> I've got the following error report while running the syzkaller fuzzer:
>
> [ cut here ]
> WARNING: CPU: 0 PID: 4608 at kernel/sched/core.c:7724
> __might_sleep+0x14c/0x1a0 kernel/sched/core.c:7719
> do not call blocking ops when !TASK_RUNNING; state=1 set at
> [] prepare_to_wait+0xbc/0x210
> kernel/sched/wait.c:178
> Modules linked in:
> CPU: 0 PID: 4608 Comm: syz-executor Not tainted 4.9.0-rc2+ #320
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
>  88006625f7a0 81b46914 88006625f818 
>  84052960  88006625f7e8 8237
>  88006aceac00 1e2c ed000cc4beff 84052960
> Call Trace:
>  [< inline >] __dump_stack lib/dump_stack.c:15
>  [] dump_stack+0xb3/0x10f lib/dump_stack.c:51
>  [] __warn+0x1a7/0x1f0 kernel/panic.c:550
>  [] warn_slowpath_fmt+0xac/0xd0 kernel/panic.c:565
>  [] __might_sleep+0x14c/0x1a0 kernel/sched/core.c:7719
>  [< inline >] slab_pre_alloc_hook mm/slab.h:393
>  [< inline >] slab_alloc_node mm/slub.c:2634
>  [< inline >] slab_alloc mm/slub.c:2716
>  [] __kmalloc_track_caller+0x150/0x2a0 mm/slub.c:4240
>  [] kmemdup+0x24/0x50 mm/util.c:113
>  [] dccp_feat_clone_sp_val.part.5+0x4f/0xe0
> net/dccp/feat.c:374
>  [< inline >] dccp_feat_clone_sp_val net/dccp/feat.c:1141
>  [< inline >] dccp_feat_change_recv net/dccp/feat.c:1141
>  [] dccp_feat_parse_options+0xaa1/0x13d0 
> net/dccp/feat.c:1411
>  [] dccp_parse_options+0x721/0x1010 net/dccp/options.c:128
>  [] dccp_rcv_state_process+0x200/0x15b0 net/dccp/input.c:644
>  [] dccp_v4_do_rcv+0xf4/0x1a0 net/dccp/ipv4.c:681
>  [< inline >] sk_backlog_rcv ./include/net/sock.h:872
>  [] __release_sock+0x126/0x3a0 net/core/sock.c:2044
>  [] release_sock+0x59/0x1c0 net/core/sock.c:2502
>  [< inline >] inet_wait_for_connect net/ipv4/af_inet.c:547
>  [] __inet_stream_connect+0x5d2/0xbb0 net/ipv4/af_inet.c:617
>  [] inet_stream_connect+0x55/0xa0 net/ipv4/af_inet.c:656
>  [] SYSC_connect+0x244/0x2f0 net/socket.c:1533
>  [] SyS_connect+0x24/0x30 net/socket.c:1514
>  [] entry_SYSCALL_64_fastpath+0x1f/0xc2
> arch/x86/entry/entry_64.S:209

Should be fixed the attached patch. I will verify it with your
reproducer tomorrow.

Thanks!
diff --git a/net/dccp/feat.c b/net/dccp/feat.c
index 1704948..c90cb35 100644
--- a/net/dccp/feat.c
+++ b/net/dccp/feat.c
@@ -367,11 +367,11 @@ static inline int dccp_feat_must_be_understood(u8 
feat_num)
 }
 
 /* copy constructor, fval must not already contain allocated memory */
-static int dccp_feat_clone_sp_val(dccp_feat_val *fval, u8 const *val, u8 len)
+static int dccp_feat_clone_sp_val(dccp_feat_val *fval, u8 const *val, u8 len, 
gfp_t flags)
 {
fval->sp.len = len;
if (fval->sp.len > 0) {
-   fval->sp.vec = kmemdup(val, len, gfp_any());
+   fval->sp.vec = kmemdup(val, len, flags);
if (fval->sp.vec == NULL) {
fval->sp.len = 0;
return -ENOBUFS;
@@ -404,7 +404,8 @@ static void dccp_feat_val_destructor(u8 feat_num, 
dccp_feat_val *val)
 
if (type == FEAT_SP && dccp_feat_clone_sp_val(&new->val,
  original->val.sp.vec,
- original->val.sp.len)) {
+ original->val.sp.len,
+ gfp_any())) {
kfree(new);
return NULL;
}
@@ -735,7 +736,7 @@ static int __feat_register_sp(struct list_head *fn, u8 
feat, u8 is_local,
if (feat == DCCPF_CCID && !ccid_support_check(sp_val, sp_len))
return -EOPNOTSUPP;
 
-   if (dccp_feat_clone_sp_val(&fval, sp_val, sp_len))
+   if (dccp_feat_clone_sp_val(&fval, sp_val, sp_len, gfp_any()))
return -ENOMEM;
 
return dccp_feat_push_change(fn, feat, is_local, mandatory, &fval);
@@ -1138,7 +1139,7 @@ static u8 dccp_feat_change_recv(struct list_head *fn, u8 
is_mandatory, u8 opt,
 *   otherwise we accept the preferred value;
 * - else if we are the client, we use the first list element.
 */
-   if (dccp_feat_clone_sp_val(&fval, val, 1))
+   if (dccp_feat_clone_sp_val(&fval, val, 1, GFP_ATOMIC))
return DCCP_RESET_CODE_TOO_BUSY;
 
if (len > 1 && server) {


Re: [PATCH v7 0/6] Add eBPF hooks for cgroups

2016-10-28 Thread Lorenzo Colitti
On Sat, Oct 29, 2016 at 1:51 PM, Alexei Starovoitov
 wrote:
>> What's the use case for egress?
>>
>> We (android networking) are currently looking at implementing network
>> accounting via eBPF in order to replace the out-of-tree xt_qtaguid
>> code. A per-cgroup eBPF program run on all traffic would be great. But
>> when we looked at this patchset we realized it would not be useful for
>> accounting purposes because even if a packet is counted here, it might
>> still be dropped by netfilter hooks.
>
> don't use out-of-tree and instead drop using this mechanism or
> any other in-kernel method? ;)

Getting rid of out-of-tree code is the goal, yes. We do have a
requirement that things continue to work, though. Accounting for a
packet in ip{,6}_output is not correct if that packet ends up being
dropped by iptables later on.

> We (facebook infrastructure) have been using iptables and bpf networking
> together with great success. They nicely co-exist and complement each other.
> There is no need to reinvent the wheel if existing solution works.
> iptables are great for their purpose.

That doesn't really answer my "what is the use case for egress"
question though, right? Or are you saying "we use this, but we can't
talk about how we use it"?

> there is iptables+cBPF support. It's being used in some cases already.

Adding eBPF support to the xt_bpf iptables code would be an option for
what we want to do, yes. I think this requires that the eBPF map to be
an fd that is available to the process that exec()s iptables, but we
could do that.


Re: [PATCH v7 0/6] Add eBPF hooks for cgroups

2016-10-28 Thread Alexei Starovoitov
On Sat, Oct 29, 2016 at 12:51:37PM +0900, Lorenzo Colitti wrote:
> On Thu, Oct 27, 2016 at 5:40 PM, Daniel Mack  wrote:
> > It's not anything new. These hooks live on the very same level as
> > SO_ATTACH_FILTER. The only differences are that the BPF programs are
> > stored in the cgroup, and not in the socket, and that they exist for
> > egress as well.
> 
> What's the use case for egress?
> 
> We (android networking) are currently looking at implementing network
> accounting via eBPF in order to replace the out-of-tree xt_qtaguid
> code. A per-cgroup eBPF program run on all traffic would be great. But
> when we looked at this patchset we realized it would not be useful for
> accounting purposes because even if a packet is counted here, it might
> still be dropped by netfilter hooks.

don't use out-of-tree and instead drop using this mechanism or
any other in-kernel method? ;)
We (facebook infrastructure) have been using iptables and bpf networking
together with great success. They nicely co-exist and complement each other.
There is no need to reinvent the wheel if existing solution works.
iptables are great for their purpose.

> It seems like it would be much more useful to be able to do this in an
> iptables rule.

there is iptables+cBPF support. It's being used in some cases already.



Re: [PATCH v7 0/6] Add eBPF hooks for cgroups

2016-10-28 Thread Lorenzo Colitti
On Thu, Oct 27, 2016 at 5:40 PM, Daniel Mack  wrote:
> It's not anything new. These hooks live on the very same level as
> SO_ATTACH_FILTER. The only differences are that the BPF programs are
> stored in the cgroup, and not in the socket, and that they exist for
> egress as well.

What's the use case for egress?

We (android networking) are currently looking at implementing network
accounting via eBPF in order to replace the out-of-tree xt_qtaguid
code. A per-cgroup eBPF program run on all traffic would be great. But
when we looked at this patchset we realized it would not be useful for
accounting purposes because even if a packet is counted here, it might
still be dropped by netfilter hooks.

It seems like it would be much more useful to be able to do this in an
iptables rule.


RE: [PATCH net-next RFC WIP] Patch for XDP support for virtio_net

2016-10-28 Thread Shrijeet Mukherjee
> -Original Message-
> From: Alexei Starovoitov [mailto:alexei.starovoi...@gmail.com]
> Sent: Friday, October 28, 2016 11:22 AM
> To: Jakub Kicinski 
> Cc: John Fastabend ; David Miller
> ; alexander.du...@gmail.com; m...@redhat.com;
> bro...@redhat.com; shrij...@gmail.com; t...@herbertland.com;
> netdev@vger.kernel.org; s...@cumulusnetworks.com;
> ro...@cumulusnetworks.com; niko...@cumulusnetworks.com
> Subject: Re: [PATCH net-next RFC WIP] Patch for XDP support for
virtio_net
>
> On Fri, Oct 28, 2016 at 05:18:12PM +0100, Jakub Kicinski wrote:
> > On Fri, 28 Oct 2016 08:56:35 -0700, John Fastabend wrote:
> > > On 16-10-27 07:10 PM, David Miller wrote:
> > > > From: Alexander Duyck 
> > > > Date: Thu, 27 Oct 2016 18:43:59 -0700
> > > >
> > > >> On Thu, Oct 27, 2016 at 6:35 PM, David Miller
>  wrote:
> > > >>> From: "Michael S. Tsirkin" 
> > > >>> Date: Fri, 28 Oct 2016 01:25:48 +0300
> > > >>>
> > >  On Thu, Oct 27, 2016 at 05:42:18PM -0400, David Miller wrote:
> > > > From: "Michael S. Tsirkin" 
> > > > Date: Fri, 28 Oct 2016 00:30:35 +0300
> > > >
> > > >> Something I'd like to understand is how does XDP address the
> > > >> problem that 100Byte packets are consuming 4K of memory
> now.
> > > >
> > > > Via page pools.  We're going to make a generic one, but right
> > > > now each and every driver implements a quick list of pages to
> > > > allocate from (and thus avoid the DMA man/unmap overhead,
> > > > etc.)
> > > 
> > >  So to clarify, ATM virtio doesn't attempt to avoid dma
> > >  map/unmap so there should be no issue with that even when using
> > >  sub/page regions, assuming DMA APIs support sub-page
> map/unmap correctly.
> > > >>>
> > > >>> That's not what I said.
> > > >>>
> > > >>> The page pools are meant to address the performance degradation
> > > >>> from going to having one packet per page for the sake of XDP's
> > > >>> requirements.
> > > >>>
> > > >>> You still need to have one packet per page for correct XDP
> > > >>> operation whether you do page pools or not, and whether you have
> > > >>> DMA mapping (or it's equivalent virutalization operation) or
not.
> > > >>
> > > >> Maybe I am missing something here, but why do you need to limit
> > > >> things to one packet per page for correct XDP operation?  Most of
> > > >> the drivers out there now are usually storing something closer to
> > > >> at least 2 packets per page, and with the DMA API fixes I am
> > > >> working on there should be no issue with changing the contents
> > > >> inside those pages since we won't invalidate or overwrite the
> > > >> data after the DMA buffer has been synchronized for use by the
CPU.
> > > >
> > > > Because with SKB's you can share the page with other packets.
> > > >
> > > > With XDP you simply cannot.
> > > >
> > > > It's software semantics that are the issue.  SKB frag list pages
> > > > are read only, XDP packets are writable.
> > > >
> > > > This has nothing to do with "writability" of the pages wrt. DMA
> > > > mapping or cpu mappings.
> > > >
> > >
> > > Sorry I'm not seeing it either. The current xdp_buff is defined by,
> > >
> > >   struct xdp_buff {
> > >   void *data;
> > >   void *data_end;
> > >   };
> > >
> > > The verifier has an xdp_is_valid_access() check to ensure we don't
> > > go past data_end. The page for now at least never leaves the driver.
> > > For the work to get xmit to other devices working I'm still not sure
> > > I see any issue.
> >
> > +1
> >
> > Do we want to make the packet-per-page a requirement because it could
> > be useful in the future from architectural standpoint?  I guess there
> > is a trade-off here between having the comfort of people following
> > this requirement today and making driver support for XDP even more
> complex.
>
> It looks to me that packet per page makes drivers simpler instead of
> complex.
> ixgbe split-page and mlx* order-3/5 tricks are definitely complex.
> The skb truesize concerns come from the host when data is delivered to
> user space and we need to have precise memory accounting for different
> applications and different users. XDP is all root and imo it's far away
from
> the days when multi-user non-root issues start to pop up.
> At the same time XDP doesn't require to use 4k buffer in something like
> Netronome.
> If xdp bpf program can be offloaded into HW with 1800 byte buffers,
great!
> For x86 cpu the 4k byte is a natural allocation chunk. Anything lower
> requires delicate dma tricks paired with even more complex slab
allocator
> and atomic recnts.
> All of that can only drive the performance down.
> Comparing to kernel bypass xdp is using 4k pages whereas dpdk has to use
> huge pages, so xdp is saving a ton of memory already!

Generally agree, but SRIOV nics with multiple queues can end up in a bad
spot if each buffer was 4K right ? I see a specific page pool to be used
by queues which are enabled for XDP as the easiest to swing solution that
way the memor

Re: [bnx2] [Regression 4.8] Driver loading fails without firmware

2016-10-28 Thread Baoquan He
On 10/27/16 at 03:21pm, Paul Menzel wrote:
> Dear Baoquan,
> > > Baoquan, could you please fix this regression. My suggestion is, that you
> > > add the old code back, but check if the firmware has been loaded. If it
> > > hasn’t, load it again.
> > > 
> > > That way, people can update their Linux kernel, and it continues working
> > > without changing the initramfs, or anything else.
> > 
> > I saw your mail but I am also not familiar with bnx2 driver. As the
> > commit log says I just tried to make bnx2 driver reset itself earlier.
> > 
> > So you did a git bisect and found this commit caused the regression,
> > right? If yes, and network developers have no action, I will look into
> > the code and see if I have idea to fix it.
> 
> Well, I looked through the commits and found that one, which would explain
> the changed behavior.
> 
> To be sure, and to follow your request, I took Linux 4.8.4 and reverted your
> commit (attached). Then I deleted the firmware again from the initramfs, and
> rebooted. The devices showed up just fine as before.
> 
> So to summarize, the commit is indeed the culprit.

Hi Paul,

Sorry for this.

Could you tell the steps to reproduce? I will find a machine with bnx2
NIC and check if there's other ways.

Thanks
Baoquan

> From 61b8dac8796343a797858b4a2eb0a59a0cfcd735 Mon Sep 17 00:00:00 2001
> From: Paul Menzel 
> Date: Thu, 27 Oct 2016 11:34:52 +0200
> Subject: [PATCH] Revert "bnx2: Reset device during driver initialization"
> 
> This reverts commit 3e1be7ad2d38c6bd6aeef96df9bd0a7822f4e51c.
> ---
>  drivers/net/ethernet/broadcom/bnx2.c | 12 +---
>  1 file changed, 5 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/net/ethernet/broadcom/bnx2.c 
> b/drivers/net/ethernet/broadcom/bnx2.c
> index 27f11a5..ecd357d 100644
> --- a/drivers/net/ethernet/broadcom/bnx2.c
> +++ b/drivers/net/ethernet/broadcom/bnx2.c
> @@ -6356,6 +6356,10 @@ bnx2_open(struct net_device *dev)
>   struct bnx2 *bp = netdev_priv(dev);
>   int rc;
>  
> + rc = bnx2_request_firmware(bp);
> + if (rc < 0)
> + goto out;
> +
>   netif_carrier_off(dev);
>  
>   bnx2_disable_int(bp);
> @@ -6424,6 +6428,7 @@ bnx2_open(struct net_device *dev)
>   bnx2_free_irq(bp);
>   bnx2_free_mem(bp);
>   bnx2_del_napi(bp);
> + bnx2_release_firmware(bp);
>   goto out;
>  }
>  
> @@ -8570,12 +8575,6 @@ bnx2_init_one(struct pci_dev *pdev, const struct 
> pci_device_id *ent)
>  
>   pci_set_drvdata(pdev, dev);
>  
> - rc = bnx2_request_firmware(bp);
> - if (rc < 0)
> - goto error;
> -
> -
> - bnx2_reset_chip(bp, BNX2_DRV_MSG_CODE_RESET);
>   memcpy(dev->dev_addr, bp->mac_addr, ETH_ALEN);
>  
>   dev->hw_features = NETIF_F_IP_CSUM | NETIF_F_SG |
> @@ -8608,7 +8607,6 @@ bnx2_init_one(struct pci_dev *pdev, const struct 
> pci_device_id *ent)
>   return 0;
>  
>  error:
> - bnx2_release_firmware(bp);
>   pci_iounmap(pdev, bp->regview);
>   pci_release_regions(pdev);
>   pci_disable_device(pdev);
> -- 
> 2.4.1
> 



Re: [PATCH v2 0/5] make POSIX timers optional with some Kconfig help

2016-10-28 Thread Nicolas Pitre
On Sat, 29 Oct 2016, Paul Bolle wrote:

> On Tue, 2016-10-25 at 22:28 -0400, Nicolas Pitre wrote:
> > When POSIX timers are configured out, the PTP clock subsystem should be
> > left out as well. However a bunch of ethernet drivers currently *select*
> > the later in their Kconfig entries. Therefore some more work was needed
> > to break that hard dependency from those drivers without preventing their
> > usage altogether.
> 
> By the way: would you have pointers to threads that discussed attempts
> to achieve this using currently available Kconfig options?

You could probably go backward from here:
https://lkml.org/lkml/2016/9/20/606


Nicolas


Re: [PATCH 17/17] batman-adv: Avoid precedence issues in macros

2016-10-28 Thread Joe Perches
On Fri, 2016-10-28 at 23:27 +0200, Sven Eckelmann wrote:
> On Freitag, 28. Oktober 2016 14:13:06 CEST Joe Perches wrote:
> > On Thu, 2016-10-27 at 21:01 +0200, Simon Wunderlich wrote:
> > > From: Sven Eckelmann 
> > > 
> > > It must be avoided that arguments to a macro are evaluated ungrouped 
> > > (which
> > > enforces normal operator precendence). Otherwise the result of the macro
> > > is not well defined.
> > 
> > Curiosity:
> > 
> > in net/batman-adv/tp_meter.c
> 
> [...]
> > orig_node = batadv_orig_hash_find(bat_priv, tp_vars->other_end);
> > if (unlikely(!orig_node)) {
> > err = BATADV_TP_REASON_DST_UNREACHABLE;
> > tp_vars->reason = err;
> > goto out;
> > }
> > 
> > primary_if = batadv_primary_if_get_selected(bat_priv);
> > if (unlikely(!primary_if)) {
> > err = BATADV_TP_REASON_DST_UNREACHABLE;
> > goto out;
> > }
> > 
> > err is not used in the out block
> > 
> > Is the last if block supposed to set tp_vars->reason to err?
> 
> This seems to be unrelated to this patch.

Kinda.  I was looking at how the one instance of
batadv_tp_is_error was used and it's in this file.

> But yes, looks to me like it is missing. Do you want to propose a patch or
> should I do?

As you are working on this file, perhaps you should.

cheers, Joe



Re: [PATCH v7 0/6] Add eBPF hooks for cgroups

2016-10-28 Thread Alexei Starovoitov
On Fri, Oct 28, 2016 at 01:28:39PM +0200, Pablo Neira Ayuso wrote:
> Hi Alexei,
> 
> On Wed, Oct 26, 2016 at 08:35:04PM -0700, Alexei Starovoitov wrote:
> > On Wed, Oct 26, 2016 at 09:59:33PM +0200, Pablo Neira Ayuso wrote:
> > > On Tue, Oct 25, 2016 at 12:14:08PM +0200, Daniel Mack wrote:
> > > [...]
> > > >   Dumping programs once they are installed is problematic because of
> > > >   the internal optimizations done to the eBPF program during its
> > > >   lifetime. Also, the references to maps etc. would need to be
> > > >   restored during the dump.
> > > > 
> > > >   Just exposing whether or not a program is attached would be
> > > >   trivial to do, however, most easily through another bpf(2)
> > > >   command. That can be added later on though.
> > > 
> > > I don't know if anyone told you, but during last netconf, this topic
> > > took a bit of time of discussion and it was controversial, I would say
> > > 1/3 of netdev hackers there showed their concerns, and that's
> > > something that should not be skipped IMO.
> > 
> > Though I attended netconf over hangouts, I think it was pretty
> > clear that bpf needs 'introspection' of loaded bpf programs and it
> > was a universal desire of everyone. Not 1/3 of hackers.
> 
> Introspection is a different thing, very useful, no doubt. But this
> infrastructure is allowing way more than simple innocuous introspection.

During netconf the discussion around bpf introspection was about
making bpf more visible to user space.
This patch does cgroup+bpf infrastructure which is orthogonal.

> > As commit log says it's an orthogonal work and over the last
> > month we've been discussing pros and cons of different approaches.
> > The audit infra, tracepoints and other ideas.
> > We kept the discussion in private because, unfortunately, public
> > discussions are not fruitful due to threads like this one.
> 
> We need to understand what people are trying to solve and it what way.
> That's why we have all those conferences and places to meet and
> discuss too. Please, don't think like that, this is sending the wrong
> message to everyone here, that is kind of: bypass public discussions
> and don't take time to describe what you're doing since it is a waste
> of time. That's not good.

In general I agree that it's best to discuss technical topics
on the mainling list and resolve personal conflicts in person
during the conferences.
Unfortunately the obstructionism seen in this thread forces people
discuss in private to have constructive conversation.
Anyway, neither your reply nor mine has nothing to do with this patch set
and being typical example why I prefer to discuss bpf related
ideas in private.

> > The further points below were disputed many times in the past.
> > Let's address them one more time:
> > 
> > > path. But this is adding hooks to push bpf programs in the middle of
> > > our generic stack, this is way different domain.
> > 
> > incorrect. look at socket filters, cls_bpf.
> 
> Classic socket filters don't allow you to deploy a global policy in
> such a fine grain way as this is doing. Then, cls_bpf is fine since it
> is visible via tc command, so sysadmins can use tools they are
> familiar with to inspect policies and say "oh look, some of the
> processes I'm deploying have installed filters via cls_bpf". However,
> this approach is visible in no way.

the audit and/or tracepoint infra that is being worked on in parallel
will answer this question in full detail.
It's no different from application using seccomp to control its children
or cls_bpf doing policy enforcement of containers or xdp doing low
level filtering.
xdp has boolean "is program loaded" flag and it turned out
to be not very useful for debuggability.
It should be solved in a universal way for all bpf programs.
We're not going to add something specific to this cgroup+bpf infra,
since it will be just as clumsy as xdp's bool flag.
Once again, audit/tracepoint for bpf prog introspection is a separate
work orthogonal to this set.

> [...]
> > > around this socket code in functions so we can invoke it. I guess
> > > filtering of UDP and TCP should be good for you at this stage.
> > 
> > DanielM mentioned few times that it's not only about UDP and TCP.
> 
> OK, since this is limited to the scope of inet sockets, let's revisit
> what we have around: DCCP is hopeless, who cares. We also have SCTP,
> that is deployed by telcos in datacenters, it cannot reach that domain
> because many Internet gateways are broken for it, so you may not get
> too far with it. Arguably it would be good to have SCTP support at
> some point, but I guess this is not a priority now. Then, UDPlite
> almost comes for free since it relies on the existing UDP
> infrastructure, it's basically UDP with a bit more features. What
> else?

thank you for nice description of current state of sctp,
but it has nothing to do with this patch set. The goal is to
monitor application networking activity at L3 level.

> > > This would requir

net/dccp: warning in dccp_feat_clone_sp_val/__might_sleep

2016-10-28 Thread Andrey Konovalov
Hi,

I've got the following error report while running the syzkaller fuzzer:

[ cut here ]
WARNING: CPU: 0 PID: 4608 at kernel/sched/core.c:7724
__might_sleep+0x14c/0x1a0 kernel/sched/core.c:7719
do not call blocking ops when !TASK_RUNNING; state=1 set at
[] prepare_to_wait+0xbc/0x210
kernel/sched/wait.c:178
Modules linked in:
CPU: 0 PID: 4608 Comm: syz-executor Not tainted 4.9.0-rc2+ #320
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
 88006625f7a0 81b46914 88006625f818 
 84052960  88006625f7e8 8237
 88006aceac00 1e2c ed000cc4beff 84052960
Call Trace:
 [< inline >] __dump_stack lib/dump_stack.c:15
 [] dump_stack+0xb3/0x10f lib/dump_stack.c:51
 [] __warn+0x1a7/0x1f0 kernel/panic.c:550
 [] warn_slowpath_fmt+0xac/0xd0 kernel/panic.c:565
 [] __might_sleep+0x14c/0x1a0 kernel/sched/core.c:7719
 [< inline >] slab_pre_alloc_hook mm/slab.h:393
 [< inline >] slab_alloc_node mm/slub.c:2634
 [< inline >] slab_alloc mm/slub.c:2716
 [] __kmalloc_track_caller+0x150/0x2a0 mm/slub.c:4240
 [] kmemdup+0x24/0x50 mm/util.c:113
 [] dccp_feat_clone_sp_val.part.5+0x4f/0xe0
net/dccp/feat.c:374
 [< inline >] dccp_feat_clone_sp_val net/dccp/feat.c:1141
 [< inline >] dccp_feat_change_recv net/dccp/feat.c:1141
 [] dccp_feat_parse_options+0xaa1/0x13d0 net/dccp/feat.c:1411
 [] dccp_parse_options+0x721/0x1010 net/dccp/options.c:128
 [] dccp_rcv_state_process+0x200/0x15b0 net/dccp/input.c:644
 [] dccp_v4_do_rcv+0xf4/0x1a0 net/dccp/ipv4.c:681
 [< inline >] sk_backlog_rcv ./include/net/sock.h:872
 [] __release_sock+0x126/0x3a0 net/core/sock.c:2044
 [] release_sock+0x59/0x1c0 net/core/sock.c:2502
 [< inline >] inet_wait_for_connect net/ipv4/af_inet.c:547
 [] __inet_stream_connect+0x5d2/0xbb0 net/ipv4/af_inet.c:617
 [] inet_stream_connect+0x55/0xa0 net/ipv4/af_inet.c:656
 [] SYSC_connect+0x244/0x2f0 net/socket.c:1533
 [] SyS_connect+0x24/0x30 net/socket.c:1514
 [] entry_SYSCALL_64_fastpath+0x1f/0xc2
arch/x86/entry/entry_64.S:209
---[ end trace 0dc4109d69f4e51e ]---

On commit 14970f204b1993af7459d5bd34aaff38dfee6670 (Oct 27).

A reproducer is attached.


dccp-feat-warn-poc.c
Description: Binary data


[PATCH net-next] bpf, inode: add support for symlinks and fix mtime/ctime

2016-10-28 Thread Daniel Borkmann
While commit bb35a6ef7da4 ("bpf, inode: allow for rename and link ops")
added support for hard links that can be used for prog and map nodes,
this work adds simple symlink support, which can be used f.e. for
directories also when unpriviledged and works with cmdline tooling that
understands S_IFLNK anyway. Since the switch in e27f4a942a0e ("bpf: Use
mount_nodev not mount_ns to mount the bpf filesystem"), there can be
various mount instances with mount_nodev() and thus hierarchy can be
flattened to facilitate object sharing. Thus, we can keep bpf tooling
also working by repointing paths.

Most of the functionality can be used from vfs library operations. The
symlink is stored in the inode itself, that is in i_link, which is
sufficient in our case as opposed to storing it in the page cache.
While at it, I noticed that bpf_mkdir() and bpf_mkobj() don't update
the directories mtime and ctime, so add a common helper for it called
bpf_dentry_finalize() that takes care of it for all cases now.

Signed-off-by: Daniel Borkmann 
Acked-by: Alexei Starovoitov 
---
 kernel/bpf/inode.c | 45 +++--
 1 file changed, 39 insertions(+), 6 deletions(-)

diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c
index 1ed8473..2565809 100644
--- a/kernel/bpf/inode.c
+++ b/kernel/bpf/inode.c
@@ -87,6 +87,7 @@ static struct inode *bpf_get_inode(struct super_block *sb,
switch (mode & S_IFMT) {
case S_IFDIR:
case S_IFREG:
+   case S_IFLNK:
break;
default:
return ERR_PTR(-EINVAL);
@@ -119,6 +120,16 @@ static int bpf_inode_type(const struct inode *inode, enum 
bpf_type *type)
return 0;
 }
 
+static void bpf_dentry_finalize(struct dentry *dentry, struct inode *inode,
+   struct inode *dir)
+{
+   d_instantiate(dentry, inode);
+   dget(dentry);
+
+   dir->i_mtime = current_time(dir);
+   dir->i_ctime = dir->i_mtime;
+}
+
 static int bpf_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode)
 {
struct inode *inode;
@@ -133,9 +144,7 @@ static int bpf_mkdir(struct inode *dir, struct dentry 
*dentry, umode_t mode)
inc_nlink(inode);
inc_nlink(dir);
 
-   d_instantiate(dentry, inode);
-   dget(dentry);
-
+   bpf_dentry_finalize(dentry, inode, dir);
return 0;
 }
 
@@ -151,9 +160,7 @@ static int bpf_mkobj_ops(struct inode *dir, struct dentry 
*dentry,
inode->i_op = iops;
inode->i_private = dentry->d_fsdata;
 
-   d_instantiate(dentry, inode);
-   dget(dentry);
-
+   bpf_dentry_finalize(dentry, inode, dir);
return 0;
 }
 
@@ -181,13 +188,37 @@ static int bpf_mkobj(struct inode *dir, struct dentry 
*dentry, umode_t mode,
 {
if (strchr(dentry->d_name.name, '.'))
return ERR_PTR(-EPERM);
+
return simple_lookup(dir, dentry, flags);
 }
 
+static int bpf_symlink(struct inode *dir, struct dentry *dentry,
+  const char *target)
+{
+   char *link = kstrdup(target, GFP_USER | __GFP_NOWARN);
+   struct inode *inode;
+
+   if (!link)
+   return -ENOMEM;
+
+   inode = bpf_get_inode(dir->i_sb, dir, S_IRWXUGO | S_IFLNK);
+   if (IS_ERR(inode)) {
+   kfree(link);
+   return PTR_ERR(inode);
+   }
+
+   inode->i_op = &simple_symlink_inode_operations;
+   inode->i_link = link;
+
+   bpf_dentry_finalize(dentry, inode, dir);
+   return 0;
+}
+
 static const struct inode_operations bpf_dir_iops = {
.lookup = bpf_lookup,
.mknod  = bpf_mkobj,
.mkdir  = bpf_mkdir,
+   .symlink= bpf_symlink,
.rmdir  = simple_rmdir,
.rename = simple_rename,
.link   = simple_link,
@@ -324,6 +355,8 @@ static void bpf_evict_inode(struct inode *inode)
truncate_inode_pages_final(&inode->i_data);
clear_inode(inode);
 
+   if (S_ISLNK(inode->i_mode))
+   kfree(inode->i_link);
if (!bpf_inode_type(inode, &type))
bpf_any_put(inode->i_private, type);
 }
-- 
1.9.3



Re: [PATCH] [v2 netfilter-next] netfilter: nf_tables: fib warnings

2016-10-28 Thread Florian Westphal
Arnd Bergmann  wrote:
> The newly added nft fib code produces two warnings:
> 
> net/ipv4/netfilter/nft_fib_ipv4.c: In function 'nft_fib4_eval':
> net/ipv4/netfilter/nft_fib_ipv4.c:80:6: error: unused variable 'i' 
> [-Werror=unused-variable]
> net/ipv4/netfilter/nft_fib_ipv4.c: In function ‘nft_fib4_eval’:
> net/ipv4/netfilter/nft_fib_ipv4.c:137:6: error: ‘oif’ may be used 
> uninitialized in this function [-Werror=maybe-uninitialized]
> 
> The first one is obvious as the only user of that variable is
> inside of an #ifdef
> 
> The second one is a bit trickier. It's clear that oif is in fact
> uninitialized when it gets used when neither NFTA_FIB_F_IIF nor
> NFTA_FIB_F_OIF are set, and just setting it to NULL won't work
> as it may later get dereferenced.
> 
> However, there is no need to search the result list if it is
> NULL, as Florian pointed out. This integrates his (untested)
> change to do so. I have confirmed that the combined patch
> solves both warnings, but as I don't fully understand Florian's
> change, I can't tell if it's correct.
> 
> Suggested-by: Florian Westphal 
> Fixes: 84f5eedb983e ("netfilter: nf_tables: add fib expression")
> Signed-off-by: Arnd Bergmann 

chain pre {
type filter hook prerouting priority 0; policy accept;
fib saddr oif "eth0"
}

eth0: default route, 192.168.7.10/16
eth1: 10.0.0.2/8

ping from 192.168.7.1 from peer on eth0: result eth0, ok
ping from 10.0.0.2 from peer on eth0: no result, ok
ping from 10.0.0.3 from peer on eth0: result eth1, ok

chain pre {
type filter hook prerouting priority 0; policy accept;
fib saddr . iif oif "eth0"
}

ping from 192.168.7.1 from peer on eth0: result eth0, ok
ping from 10.0.0.2 from peer on eth0: no result, ok
ping from 10.0.0.3 from peer on eth0: no result, ok

so:

Tested-by: Florian Westphal 


Re: [PATCH net-next] genetlink: Fix generic netlink family unregister

2016-10-28 Thread Johannes Berg
On Fri, 2016-10-28 at 16:01 -0700, Pravin B Shelar wrote:
> This patch fixes a typo in unregister operation.
> 
> Following crash is fixed by this patch. It can be easily reproduced
> by repeating modprobe and rmmod module that uses genetlink.

Yikes, I ran all my tests in VMs and forgot this wasn't even executed.
Sorry about that!

Reviewed-by: Johannes Berg 

johannes


[PATCH net-next] genetlink: Fix generic netlink family unregister

2016-10-28 Thread Pravin B Shelar
This patch fixes a typo in unregister operation.

Following crash is fixed by this patch. It can be easily reproduced
by repeating modprobe and rmmod module that uses genetlink.

[  261.446686] BUG: unable to handle kernel paging request at a0264088
[  261.448921] IP: [] strcmp+0xe/0x30
[  261.450494] PGD 1c09067
[  261.451266] PUD 1c0a063
[  261.452091] PMD 8068d5067
[  261.452525] PTE 0
[  261.453164]
[  261.453618] Oops:  [#1] SMP
[  261.454577] Modules linked in: openvswitch(+) ...
[  261.480753] RIP: 0010:[]  [] 
strcmp+0xe/0x30
[  261.483069] RSP: 0018:c90003c0bc28  EFLAGS: 00010282
[  261.510145] Call Trace:
[  261.510896]  [] genl_family_find_byname+0x5a/0x70
[  261.512819]  [] genl_register_family+0xb9/0x630
[  261.514805]  [] dp_init+0xbc/0x120 [openvswitch]
[  261.518268]  [] do_one_initcall+0x3d/0x160
[  261.525041]  [] do_init_module+0x60/0x1f1
[  261.526754]  [] load_module+0x22af/0x2860
[  261.530144]  [] SYSC_finit_module+0x96/0xd0
[  261.531901]  [] SyS_finit_module+0xe/0x10
[  261.533605]  [] do_syscall_64+0x6e/0x180
[  261.535284]  [] entry_SYSCALL64_slow_path+0x25/0x25
[  261.546512] RIP  [] strcmp+0xe/0x30
[  261.550198] ---[ end trace 76505a814dd68770 ]---

Fixes: 2ae0f17df1c ("genetlink: use idr to track families").

Reported-by: Jarno Rajahalme 
CC: Johannes Berg 
Signed-off-by: Pravin B Shelar 
---
 net/netlink/genetlink.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/netlink/genetlink.c b/net/netlink/genetlink.c
index df0cbcd..caf04d7 100644
--- a/net/netlink/genetlink.c
+++ b/net/netlink/genetlink.c
@@ -399,7 +399,7 @@ int genl_unregister_family(const struct genl_family *family)
 {
genl_lock_all();
 
-   if (genl_family_find_byid(family->id)) {
+   if (!genl_family_find_byid(family->id)) {
genl_unlock_all();
return -ENOENT;
}
-- 
1.9.1



Re: [PATCH v2 0/5] make POSIX timers optional with some Kconfig help

2016-10-28 Thread Paul Bolle
On Tue, 2016-10-25 at 22:28 -0400, Nicolas Pitre wrote:
> When POSIX timers are configured out, the PTP clock subsystem should be
> left out as well. However a bunch of ethernet drivers currently *select*
> the later in their Kconfig entries. Therefore some more work was needed
> to break that hard dependency from those drivers without preventing their
> usage altogether.

By the way: would you have pointers to threads that discussed attempts
to achieve this using currently available Kconfig options?

Thanks,


Paul Bolle


Re: [PATCH 2/2] at803x: double check SGMII side autoneg

2016-10-28 Thread Timur Tabi

Zefir Kurtisi wrote:

+   /* check if the SGMII link is OK. */
+   if (!(phy_read(phydev, AT803X_PSSR) & AT803X_PSSR_MR_AN_COMPLETE)) {
+   pr_warn("803x_aneg_done: SGMII link is not ok\n");
+   aneg_done = 0;


I see this message appear sometimes when bring up the interface via 
ifup.  However, contrary to your patch description, everything seems to 
work:


$ iperf3 -c 192.168.1.1 -t 3600
Connecting to host 192.168.1.1, port 5201
[  4] local 192.168.1.2 port 52640 connected to 192.168.1.1 port 5201
[ ID] Interval   Transfer Bandwidth   Retr  Cwnd
[  4]   0.00-1.00   sec   109 MBytes   909 Mbits/sec0485 KBytes
[  4]   1.00-2.00   sec   108 MBytes   902 Mbits/sec0622 KBytes

I wonder if you're impacted with all of the pause frame problems I'm having.

--
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm
Technologies, Inc.  Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.


Re: [PATCH v2 1/5] kconfig: introduce the "imply" keyword

2016-10-28 Thread Paul Bolle
On Thu, 2016-10-27 at 23:10 -0400, Nicolas Pitre wrote:
> On Fri, 28 Oct 2016, Paul Bolle wrote:
> > And in your example BAR is bool, right? Does the above get more
> > complicated if BAR would be tristate?
> 
> If BAR=m then implying BAZ from FOO=y will force BAZ to y or n, 
> bypassing the restriction provided by BAR like "select" does.  This is 
> somewhat questionable for "select" to do that, and the code emits a 
> warning when "select" bypasses a direct dependency set to n, but not 
> when set to m. For now "imply" simply tries to be consistent with 
> the "select" behavior.

Side note: yes, one can select a symbol that's missing one or more
dependencies. But since Kconfig has two separate methods to describe
relations (ie, selecting and depending) there's logically the
possibility of conflict. So we need a rule to resolve that conflict.
That rule is: "select" beats "depends on". I don't think that this rule
is less plausible than the opposite rule.


Paul Bolle


Re: [PATCH v2 1/5] kconfig: introduce the "imply" keyword

2016-10-28 Thread Nicolas Pitre
On Fri, 28 Oct 2016, Paul Bolle wrote:

> On Thu, 2016-10-27 at 23:10 -0400, Nicolas Pitre wrote:
> > On Fri, 28 Oct 2016, Paul Bolle wrote:
> > > What happens when a tristate symbol is implied by a symbol set to 'y'
> > > and by a symbol set to 'm'?
> > 
> > That's respectively the third and second rows in the table above.
> 
> I meant: two separate symbols implying the same symbol at the same
> time. One of those symbols set to 'y' and the other set to 'm'.

Then it's the greatest of the set i.e. y.


Nicolas


[net-next PATCH v2 4/4] net: Add support for XPS with QoS via traffic classes

2016-10-28 Thread Alexander Duyck
This patch adds support for setting and using XPS when QoS via traffic
classes is enabled.  With this change we will factor in the priority and
traffic class mapping of the packet and use that information to correctly
select the queue.

This allows us to define a set of queues for a given traffic class via
mqprio and then configure the XPS mapping for those queues so that the
traffic flows can avoid head-of-line blocking between the individual CPUs
if so desired.

Signed-off-by: Alexander Duyck 
---

v2: Mostly for loop updates based on changes in patch 3
Replaced "i = 0; i < X; i++" with "i = X; i--;" as i was just a counter

 include/linux/netdevice.h |4 +-
 net/core/dev.c|  117 -
 net/core/net-sysfs.c  |   31 
 3 files changed, 105 insertions(+), 47 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index b60a156..56f90f7 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -732,8 +732,8 @@ struct xps_dev_maps {
struct rcu_head rcu;
struct xps_map __rcu *cpu_map[0];
 };
-#define XPS_DEV_MAPS_SIZE (sizeof(struct xps_dev_maps) +   \
-(nr_cpu_ids * sizeof(struct xps_map *)))
+#define XPS_DEV_MAPS_SIZE(_tcs) (sizeof(struct xps_dev_maps) + \
+   (nr_cpu_ids * (_tcs) * sizeof(struct xps_map *)))
 #endif /* CONFIG_XPS */
 
 #define TC_MAX_QUEUE   16
diff --git a/net/core/dev.c b/net/core/dev.c
index fea7061..0ebb035 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2002,14 +2002,22 @@ static bool remove_xps_queue_cpu(struct net_device *dev,
 struct xps_dev_maps *dev_maps,
 int cpu, u16 offset, u16 count)
 {
-   int i, j;
+   int num_tc = dev->num_tc ? : 1;
+   bool active = false;
+   int tci;
 
-   for (i = count, j = offset; i--; j++) {
-   if (!remove_xps_queue(dev_maps, cpu, j))
-   break;
+   for (tci = cpu * num_tc; num_tc--; tci++) {
+   int i, j;
+
+   for (i = count, j = offset; i--; j++) {
+   if (!remove_xps_queue(dev_maps, cpu, j))
+   break;
+   }
+
+   active |= i < 0;
}
 
-   return i < 0;
+   return active;
 }
 
 static void netif_reset_xps_queues(struct net_device *dev, u16 offset,
@@ -2086,20 +2094,28 @@ int netif_set_xps_queue(struct net_device *dev, const 
struct cpumask *mask,
u16 index)
 {
struct xps_dev_maps *dev_maps, *new_dev_maps = NULL;
+   int i, cpu, tci, numa_node_id = -2;
+   int maps_sz, num_tc = 1, tc = 0;
struct xps_map *map, *new_map;
-   int maps_sz = max_t(unsigned int, XPS_DEV_MAPS_SIZE, L1_CACHE_BYTES);
-   int cpu, numa_node_id = -2;
bool active = false;
 
+   if (dev->num_tc) {
+   num_tc = dev->num_tc;
+   tc = netdev_txq_to_tc(dev, index);
+   if (tc < 0)
+   return -EINVAL;
+   }
+
+   maps_sz = XPS_DEV_MAPS_SIZE(num_tc);
+   if (maps_sz < L1_CACHE_BYTES)
+   maps_sz = L1_CACHE_BYTES;
+
mutex_lock(&xps_map_mutex);
 
dev_maps = xmap_dereference(dev->xps_maps);
 
/* allocate memory for queue storage */
-   for_each_online_cpu(cpu) {
-   if (!cpumask_test_cpu(cpu, mask))
-   continue;
-
+   for_each_cpu_and(cpu, cpu_online_mask, mask) {
if (!new_dev_maps)
new_dev_maps = kzalloc(maps_sz, GFP_KERNEL);
if (!new_dev_maps) {
@@ -2107,25 +2123,38 @@ int netif_set_xps_queue(struct net_device *dev, const 
struct cpumask *mask,
return -ENOMEM;
}
 
-   map = dev_maps ? xmap_dereference(dev_maps->cpu_map[cpu]) :
+   tci = cpu * num_tc + tc;
+   map = dev_maps ? xmap_dereference(dev_maps->cpu_map[tci]) :
 NULL;
 
map = expand_xps_map(map, cpu, index);
if (!map)
goto error;
 
-   RCU_INIT_POINTER(new_dev_maps->cpu_map[cpu], map);
+   RCU_INIT_POINTER(new_dev_maps->cpu_map[tci], map);
}
 
if (!new_dev_maps)
goto out_no_new_maps;
 
for_each_possible_cpu(cpu) {
+   /* copy maps belonging to foreign traffic classes */
+   for (i = tc, tci = cpu * num_tc; dev_maps && i--; tci++) {
+   /* fill in the new device map from the old device map */
+   map = xmap_dereference(dev_maps->cpu_map[tci]);
+   RCU_INIT_POINTER(new_dev_maps->cpu_map[tci], map);
+   }
+
+   /* We need to explicitly update tci as prevous loop
+* could break out early if dev_maps is NULL.
+*/
+   

[net-next PATCH v2 3/4] net: Refactor removal of queues from XPS map and apply on num_tc changes

2016-10-28 Thread Alexander Duyck
This patch updates the code for removing queues from the XPS map and makes
it so that we can apply the code any time we change either the number of
traffic classes or the mapping of a given block of queues.  This way we
avoid having queues pulling traffic from a foreign traffic class.

Signed-off-by: Alexander Duyck 
---

v2: Replaced do/while with for loop
Changed order for remove_xps_queue_cpu loop
Updated i to reflect count, j to be temporary offset
Updated test to check for i < 0 instead of i < offset

 net/core/dev.c |   73 ++--
 1 file changed, 50 insertions(+), 23 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 08ed625..fea7061 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1970,32 +1970,50 @@ int netdev_txq_to_tc(struct net_device *dev, unsigned 
int txq)
 #define xmap_dereference(P)\
rcu_dereference_protected((P), lockdep_is_held(&xps_map_mutex))
 
-static struct xps_map *remove_xps_queue(struct xps_dev_maps *dev_maps,
-   int cpu, u16 index)
+static bool remove_xps_queue(struct xps_dev_maps *dev_maps,
+int tci, u16 index)
 {
struct xps_map *map = NULL;
int pos;
 
if (dev_maps)
-   map = xmap_dereference(dev_maps->cpu_map[cpu]);
+   map = xmap_dereference(dev_maps->cpu_map[tci]);
+   if (!map)
+   return false;
 
-   for (pos = 0; map && pos < map->len; pos++) {
-   if (map->queues[pos] == index) {
-   if (map->len > 1) {
-   map->queues[pos] = map->queues[--map->len];
-   } else {
-   RCU_INIT_POINTER(dev_maps->cpu_map[cpu], NULL);
-   kfree_rcu(map, rcu);
-   map = NULL;
-   }
+   for (pos = map->len; pos--;) {
+   if (map->queues[pos] != index)
+   continue;
+
+   if (map->len > 1) {
+   map->queues[pos] = map->queues[--map->len];
break;
}
+
+   RCU_INIT_POINTER(dev_maps->cpu_map[tci], NULL);
+   kfree_rcu(map, rcu);
+   return false;
}
 
-   return map;
+   return true;
 }
 
-static void netif_reset_xps_queues_gt(struct net_device *dev, u16 index)
+static bool remove_xps_queue_cpu(struct net_device *dev,
+struct xps_dev_maps *dev_maps,
+int cpu, u16 offset, u16 count)
+{
+   int i, j;
+
+   for (i = count, j = offset; i--; j++) {
+   if (!remove_xps_queue(dev_maps, cpu, j))
+   break;
+   }
+
+   return i < 0;
+}
+
+static void netif_reset_xps_queues(struct net_device *dev, u16 offset,
+  u16 count)
 {
struct xps_dev_maps *dev_maps;
int cpu, i;
@@ -2007,21 +2025,16 @@ static void netif_reset_xps_queues_gt(struct net_device 
*dev, u16 index)
if (!dev_maps)
goto out_no_maps;
 
-   for_each_possible_cpu(cpu) {
-   for (i = index; i < dev->num_tx_queues; i++) {
-   if (!remove_xps_queue(dev_maps, cpu, i))
-   break;
-   }
-   if (i == dev->num_tx_queues)
-   active = true;
-   }
+   for_each_possible_cpu(cpu)
+   active |= remove_xps_queue_cpu(dev, dev_maps, cpu,
+  offset, count);
 
if (!active) {
RCU_INIT_POINTER(dev->xps_maps, NULL);
kfree_rcu(dev_maps, rcu);
}
 
-   for (i = index; i < dev->num_tx_queues; i++)
+   for (i = offset + (count - 1); count--; i--)
netdev_queue_numa_node_write(netdev_get_tx_queue(dev, i),
 NUMA_NO_NODE);
 
@@ -2029,6 +2042,11 @@ static void netif_reset_xps_queues_gt(struct net_device 
*dev, u16 index)
mutex_unlock(&xps_map_mutex);
 }
 
+static void netif_reset_xps_queues_gt(struct net_device *dev, u16 index)
+{
+   netif_reset_xps_queues(dev, index, dev->num_tx_queues - index);
+}
+
 static struct xps_map *expand_xps_map(struct xps_map *map,
  int cpu, u16 index)
 {
@@ -2192,6 +2210,9 @@ int netif_set_xps_queue(struct net_device *dev, const 
struct cpumask *mask,
 #endif
 void netdev_reset_tc(struct net_device *dev)
 {
+#ifdef CONFIG_XPS
+   netif_reset_xps_queues_gt(dev, 0);
+#endif
dev->num_tc = 0;
memset(dev->tc_to_txq, 0, sizeof(dev->tc_to_txq));
memset(dev->prio_tc_map, 0, sizeof(dev->prio_tc_map));
@@ -2203,6 +2224,9 @@ int netdev_set_tc_queue(struct net_device *dev, u8 tc, 
u16 count, u16 offset)
if (tc >= dev->num_tc)
return -

[net-next PATCH v2 2/4] net: Add sysfs value to determine queue traffic class

2016-10-28 Thread Alexander Duyck
Add a sysfs attribute for a Tx queue that allows us to determine the
traffic class for a given queue.  This will allow us to more easily
determine this in the future.  It is needed as XPS will take the traffic
class for a group of queues into account in order to avoid pulling traffic
from one traffic class into another.

Signed-off-by: Alexander Duyck 
---

v2: Added this patch to the series.

 include/linux/netdevice.h |1 +
 net/core/dev.c|   17 +
 net/core/net-sysfs.c  |   20 +++-
 3 files changed, 37 insertions(+), 1 deletion(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index d045432..b60a156 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1920,6 +1920,7 @@ int netdev_set_prio_tc_map(struct net_device *dev, u8 
prio, u8 tc)
return 0;
 }
 
+int netdev_txq_to_tc(struct net_device *dev, unsigned int txq);
 void netdev_reset_tc(struct net_device *dev);
 int netdev_set_tc_queue(struct net_device *dev, u8 tc, u16 count, u16 offset);
 int netdev_set_num_tc(struct net_device *dev, u8 num_tc);
diff --git a/net/core/dev.c b/net/core/dev.c
index d4d45bf..08ed625 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1948,6 +1948,23 @@ static void netif_setup_tc(struct net_device *dev, 
unsigned int txq)
}
 }
 
+int netdev_txq_to_tc(struct net_device *dev, unsigned int txq)
+{
+   if (dev->num_tc) {
+   struct netdev_tc_txq *tc = &dev->tc_to_txq[0];
+   int i;
+
+   for (i = 0; i < TC_MAX_QUEUE; i++, tc++) {
+   if ((txq - tc->offset) < tc->count)
+   return i;
+   }
+
+   return -1;
+   }
+
+   return 0;
+}
+
 #ifdef CONFIG_XPS
 static DEFINE_MUTEX(xps_map_mutex);
 #define xmap_dereference(P)\
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 6e4f347..c4e4e7d 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1021,7 +1021,6 @@ static ssize_t show_trans_timeout(struct netdev_queue 
*queue,
return sprintf(buf, "%lu", trans_timeout);
 }
 
-#ifdef CONFIG_XPS
 static unsigned int get_netdev_queue_index(struct netdev_queue *queue)
 {
struct net_device *dev = queue->dev;
@@ -1033,6 +1032,21 @@ static unsigned int get_netdev_queue_index(struct 
netdev_queue *queue)
return i;
 }
 
+static ssize_t show_traffic_class(struct netdev_queue *queue,
+ struct netdev_queue_attribute *attribute,
+ char *buf)
+{
+   struct net_device *dev = queue->dev;
+   int index = get_netdev_queue_index(queue);
+   int tc = netdev_txq_to_tc(dev, index);
+
+   if (tc < 0)
+   return -EINVAL;
+
+   return sprintf(buf, "%u\n", tc);
+}
+
+#ifdef CONFIG_XPS
 static ssize_t show_tx_maxrate(struct netdev_queue *queue,
   struct netdev_queue_attribute *attribute,
   char *buf)
@@ -1075,6 +1089,9 @@ static ssize_t set_tx_maxrate(struct netdev_queue *queue,
 static struct netdev_queue_attribute queue_trans_timeout =
__ATTR(tx_timeout, S_IRUGO, show_trans_timeout, NULL);
 
+static struct netdev_queue_attribute queue_traffic_class =
+   __ATTR(traffic_class, S_IRUGO, show_traffic_class, NULL);
+
 #ifdef CONFIG_BQL
 /*
  * Byte queue limits sysfs structures and functions.
@@ -1260,6 +1277,7 @@ static ssize_t store_xps_map(struct netdev_queue *queue,
 
 static struct attribute *netdev_queue_default_attrs[] = {
&queue_trans_timeout.attr,
+   &queue_traffic_class.attr,
 #ifdef CONFIG_XPS
&xps_cpus_attribute.attr,
&queue_tx_maxrate.attr,



[net-next PATCH v2 1/4] net: Move functions for configuring traffic classes out of inline headers

2016-10-28 Thread Alexander Duyck
The functions for configuring the traffic class to queue mappings have
other effects that need to be addressed.  Instead of trying to export a
bunch of new functions just relocate the functions so that we can
instrument them directly with the functionality they will need.

Signed-off-by: Alexander Duyck 
---
 include/linux/netdevice.h |   31 +++
 net/core/dev.c|   29 +
 2 files changed, 32 insertions(+), 28 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 458c876..d045432 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1920,34 +1920,9 @@ int netdev_set_prio_tc_map(struct net_device *dev, u8 
prio, u8 tc)
return 0;
 }
 
-static inline
-void netdev_reset_tc(struct net_device *dev)
-{
-   dev->num_tc = 0;
-   memset(dev->tc_to_txq, 0, sizeof(dev->tc_to_txq));
-   memset(dev->prio_tc_map, 0, sizeof(dev->prio_tc_map));
-}
-
-static inline
-int netdev_set_tc_queue(struct net_device *dev, u8 tc, u16 count, u16 offset)
-{
-   if (tc >= dev->num_tc)
-   return -EINVAL;
-
-   dev->tc_to_txq[tc].count = count;
-   dev->tc_to_txq[tc].offset = offset;
-   return 0;
-}
-
-static inline
-int netdev_set_num_tc(struct net_device *dev, u8 num_tc)
-{
-   if (num_tc > TC_MAX_QUEUE)
-   return -EINVAL;
-
-   dev->num_tc = num_tc;
-   return 0;
-}
+void netdev_reset_tc(struct net_device *dev);
+int netdev_set_tc_queue(struct net_device *dev, u8 tc, u16 count, u16 offset);
+int netdev_set_num_tc(struct net_device *dev, u8 num_tc);
 
 static inline
 int netdev_get_num_tc(struct net_device *dev)
diff --git a/net/core/dev.c b/net/core/dev.c
index f55fb45..d4d45bf 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2173,6 +2173,35 @@ int netif_set_xps_queue(struct net_device *dev, const 
struct cpumask *mask,
 EXPORT_SYMBOL(netif_set_xps_queue);
 
 #endif
+void netdev_reset_tc(struct net_device *dev)
+{
+   dev->num_tc = 0;
+   memset(dev->tc_to_txq, 0, sizeof(dev->tc_to_txq));
+   memset(dev->prio_tc_map, 0, sizeof(dev->prio_tc_map));
+}
+EXPORT_SYMBOL(netdev_reset_tc);
+
+int netdev_set_tc_queue(struct net_device *dev, u8 tc, u16 count, u16 offset)
+{
+   if (tc >= dev->num_tc)
+   return -EINVAL;
+
+   dev->tc_to_txq[tc].count = count;
+   dev->tc_to_txq[tc].offset = offset;
+   return 0;
+}
+EXPORT_SYMBOL(netdev_set_tc_queue);
+
+int netdev_set_num_tc(struct net_device *dev, u8 num_tc)
+{
+   if (num_tc > TC_MAX_QUEUE)
+   return -EINVAL;
+
+   dev->num_tc = num_tc;
+   return 0;
+}
+EXPORT_SYMBOL(netdev_set_num_tc);
+
 /*
  * Routine to help set real_num_tx_queues. To avoid skbs mapped to queues
  * greater then real_num_tx_queues stale skbs on the qdisc must be flushed.



[net-next PATCH v2 0/4] Add support for XPS when using DCB

2016-10-28 Thread Alexander Duyck
This patch series enables proper isolation between traffic classes when
using XPS while DCB is enabled.  Previously enabling XPS would cause the
traffic to be potentially pulled from one traffic class into another on
egress.  This change essentially multiplies the XPS map by the number of
traffic classes and allows us to do a lookup per traffic class for a given
CPU.

To guarantee the isolation I invalidate the XPS map for any queues that are
moved from one traffic class to another, or if we change the number of
traffic classes.

v2: Added sysfs to display traffic class
Replaced do/while with for loop
Cleaned up several other for for loops throughout the patch

---

Alexander Duyck (4):
  net: Move functions for configuring traffic classes out of inline headers
  net: Add sysfs value to determine queue traffic class
  net: Refactor removal of queues from XPS map and apply on num_tc changes
  net: Add support for XPS with QoS via traffic classes


 include/linux/netdevice.h |   36 +--
 net/core/dev.c|  226 +++--
 net/core/net-sysfs.c  |   51 --
 3 files changed, 219 insertions(+), 94 deletions(-)

--


Re: [PATCH net 2/3] sctp: return back transport in __sctp_rcv_init_lookup

2016-10-28 Thread Marcelo Ricardo Leitner
On Fri, Oct 28, 2016 at 05:42:21PM -0200, Marcelo Ricardo Leitner wrote:
> On Fri, Oct 28, 2016 at 06:10:53PM +0800, Xin Long wrote:
> > Prior to this patch, it used a local variable to save the transport that is
> > looked up by __sctp_lookup_association(), and didn't return it back. But in
> > sctp_rcv, it is used to initialize chunk->transport. So when hitting this
> > code, it was initializing chunk->transport with some random stack value
> > instead.
> > 
> > This patch is to return the transport back through transport pointer
> > that is from __sctp_rcv_lookup_harder().
> > 
> > Signed-off-by: Xin Long 
> 
> Acked-by: Marcelo Ricardo Leitner 
> 
> transport pointer in sctp_rcv() is initialized to null and there are
> checks for it after this path, so this shouldn't be exploitable, just
> malfunction.

This actually sort of contradicts the changelog.

Xin, did I miss something here? Seems we need to update the changelog if
not.



Re: Why do we need tasklet in IFB?

2016-10-28 Thread Stephen Hemminger
On Fri, 28 Oct 2016 14:36:27 -0700
Michael Ma  wrote:

> Hi -
> 
> Currently IFB uses tasklet to process tx/rx on the interface that
> forwarded the packet to IFB. My understanding on why we're doing this
> is that since dev_queue_xmit() can be invoked in interrupt, we want to
> defer the processing of original tx/rx in case ifb_xmit() is called
> from interrupt.

dev_queue_xmit is only called from interrupt if doing netconsole.


Why do we need tasklet in IFB?

2016-10-28 Thread Michael Ma
Hi -

Currently IFB uses tasklet to process tx/rx on the interface that
forwarded the packet to IFB. My understanding on why we're doing this
is that since dev_queue_xmit() can be invoked in interrupt, we want to
defer the processing of original tx/rx in case ifb_xmit() is called
from interrupt.

However, if the packet is originally from rx, calling context should
already be a tasklet and there is no need to queue the processing to
another tasklet anymore. Even if the packet is originated from tx, we
can rely on the deferred processing of the "original device" or TC if
necessary. Did I miss anything here?

Furthermore, looking at the bonding device's code there isn't this
kind of buffering/tasklet handling for packet forwarding even though
bond also has its own txq/rxq configured separately from the actual
nic, which is very similar to IFB.

So why do we need tasklet in IFB?

Thanks,
Michael


Re: [PATCH v2 1/5] kconfig: introduce the "imply" keyword

2016-10-28 Thread Paul Bolle
On Thu, 2016-10-27 at 23:10 -0400, Nicolas Pitre wrote:
> On Fri, 28 Oct 2016, Paul Bolle wrote:
> > What happens when a tristate symbol is implied by a symbol set to 'y'
> > and by a symbol set to 'm'?
> 
> That's respectively the third and second rows in the table above.

I meant: two separate symbols implying the same symbol at the same
time. One of those symbols set to 'y' and the other set to 'm'.

Anyhow, I hope to play with a mock Kconfig fragment the next few days
to find out myself.

Thanks,


Paul Bolle


Re: [PATCH v2 1/5] kconfig: introduce the "imply" keyword

2016-10-28 Thread Paul Bolle
On Thu, 2016-10-27 at 23:10 -0400, Nicolas Pitre wrote:
> On Fri, 28 Oct 2016, Paul Bolle wrote:
> > You probably got "["if" ]" for free by copying what's there for
> > select. But this series doesn't use it, so perhaps it would be better
> > to not document it yet. But that is rather sneaky. Dunno.
> 
> If it is not documented then the chance that someone uses it are slim. 
> And since it comes "for free" I don't see the point of making the tool 
> less flexible. And not having this flexibility could make some 
> transitions from "select" to "imply" needlessly difficult.

My point is moot. I somehow missed that this series adds
    imply PTP_1588_CLOCK if TILEGX

So you are quite right in documenting this.


Paul Bolle


Re: [PATCH 17/17] batman-adv: Avoid precedence issues in macros

2016-10-28 Thread Sven Eckelmann
On Freitag, 28. Oktober 2016 14:13:06 CEST Joe Perches wrote:
> On Thu, 2016-10-27 at 21:01 +0200, Simon Wunderlich wrote:
> > From: Sven Eckelmann 
> > 
> > It must be avoided that arguments to a macro are evaluated ungrouped (which
> > enforces normal operator precendence). Otherwise the result of the macro
> > is not well defined.
> 
> Curiosity:
> 
> in net/batman-adv/tp_meter.c
[...]
>   orig_node = batadv_orig_hash_find(bat_priv, tp_vars->other_end);
>   if (unlikely(!orig_node)) {
>   err = BATADV_TP_REASON_DST_UNREACHABLE;
>   tp_vars->reason = err;
>   goto out;
>   }
> 
>   primary_if = batadv_primary_if_get_selected(bat_priv);
>   if (unlikely(!primary_if)) {
>   err = BATADV_TP_REASON_DST_UNREACHABLE;
>   goto out;
>   }
> 
> err is not used in the out block
> 
> Is the last if block supposed to set tp_vars->reason to err?

This seems to be unrelated to this patch.

But yes, looks to me like it is missing. Do you want to propose a patch or
should I do? Just make sure you Cc Antonio Quartulli  (and of
course b.a.t.m@lists.open-mesh.org). He is the original author of
33a3bb4a3345 ("batman-adv: throughput meter implementation").

Kind regards,
Sven

signature.asc
Description: This is a digitally signed message part.


Re: [PATCH net] net: clear sk_err_soft in sk_clone_lock()

2016-10-28 Thread Soheil Hassas Yeganeh
On Fri, Oct 28, 2016 at 4:40 PM, Eric Dumazet  wrote:
> From: Eric Dumazet 
>
> At accept() time, it is possible the parent has a non zero
> sk_err_soft, leftover from a prior error.
>
> Make sure we do not leave this value in the child, as it
> makes future getsockopt(SO_ERROR) calls quite unreliable.
>
> Signed-off-by: Eric Dumazet 
Acked-by: Soheil Hassas Yeganeh 

> ---
>  net/core/sock.c |1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 
> d8e4532e89e7c28737c95c723e5f5b3d184a7805..662ccf1c40ed1b66ee253b063dcbfbd186deccee
>  100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -1543,6 +1543,7 @@ struct sock *sk_clone_lock(const struct sock *sk, const 
> gfp_t priority)
> RCU_INIT_POINTER(newsk->sk_reuseport_cb, NULL);
>
> newsk->sk_err  = 0;
> +   newsk->sk_err_soft = 0;
> newsk->sk_priority = 0;
> newsk->sk_incoming_cpu = raw_smp_processor_id();
> atomic64_set(&newsk->sk_cookie, 0);
>
>

Very nice catch! Thank you, Eric!


Re: [PATCH 17/17] batman-adv: Avoid precedence issues in macros

2016-10-28 Thread Joe Perches
On Thu, 2016-10-27 at 21:01 +0200, Simon Wunderlich wrote:
> From: Sven Eckelmann 
> 
> It must be avoided that arguments to a macro are evaluated ungrouped (which
> enforces normal operator precendence). Otherwise the result of the macro
> is not well defined.

Curiosity:

in net/batman-adv/tp_meter.c

static int batadv_tp_send(void *arg)
{
struct batadv_tp_vars *tp_vars = arg;
struct batadv_priv *bat_priv = tp_vars->bat_priv;
struct batadv_hard_iface *primary_if = NULL;
struct batadv_orig_node *orig_node = NULL;
size_t payload_len, packet_len;
int err = 0;

if (unlikely(tp_vars->role != BATADV_TP_SENDER)) {
err = BATADV_TP_REASON_DST_UNREACHABLE;
tp_vars->reason = err;
goto out;
}

orig_node = batadv_orig_hash_find(bat_priv, tp_vars->other_end);
if (unlikely(!orig_node)) {
err = BATADV_TP_REASON_DST_UNREACHABLE;
tp_vars->reason = err;
goto out;
}

primary_if = batadv_primary_if_get_selected(bat_priv);
if (unlikely(!primary_if)) {
err = BATADV_TP_REASON_DST_UNREACHABLE;
goto out;
}

err is not used in the out block

Is the last if block supposed to set tp_vars->reason to err?




Re: [PATCH] net: phy: at803x: the Atheros 8031 supports pause frames

2016-10-28 Thread Florian Fainelli
On 10/28/2016 01:06 PM, Timur Tabi wrote:
> Florian Fainelli wrote:
>> On 10/27/2016 03:24 PM, Timur Tabi wrote:
>>> Florian Fainelli wrote:
>>>
 Hu? In my experience that should not come from supporting Pause frames
 or not, but rather properly configuring a (RG)MII delay, but your
 mileage may vary.
>>>
>>> I can assure you, I'm more confused than you.  I've been working in this
>>> for almost two weeks, and not only does this patch align with other phy
>>> drivers, but it does fix my bug.  Without this patch, phylib does not
>>> set the pause frame bits (10 and 11) in MII_ADVERTISE.
>>
>> And that's expected, but if your MAC does not act upon phydev->pause and
>> phydev->asym_pause, then chances are that you can indeed lose packets
>> every once in a while.
> 
> Can you give me more details on that?  I'm not really an expert on our
> MAC, so I'm not sure how it's supposed to work.  The MAC supports the
> concept of "flow control".  The non-upstream version of my driver reads
> PHY register 0x11, which apparently is a PHY-specific status register
> that says whether or not "transmit pause" and "receive pause" is
> enabled.  (The AT8031 datasheet is here, if you'd like to read it:
> http://www.datasheet-pdf.com/datasheet/AtherosCommunications/734720/AR8031.pdf.html).

How it works is that you read phydev->pause and phydev->asym_pause to
know what your link partner advertises in terms of flow control/pause
frame (they are the same thing here). Based on that, and your local
settings (from ethool -A) you advertise/enable flow control as well or
don't do it, and the result is either both agree on what is being used,
flow control is enabled, end-to-end, or it is not. It's exactly the same
thing as speed, except that there is a possibly to do asymetric things,
where flow control is only enabled in one direction or the other.

The way this works is this:

You transmit data, faster than your link partner can process it (busy
receiving, loaded, you name it), it sends back Pause frames back at you
(see below for how this is typically implemented), your MAC gets these
Pause frames back, which feed into the transmit logic, your TX
completion interrupt gets signaled slower than the actual data rate you
are pushing to your link partner, you call dev_kfree_skb() a little
slower from your TX completion handler, the networking stack knows this
means to back off, your transmission rate adjusts to your link partner's
processing capability, no packets get lost.

Conversely, if you were receiving a stream of packets that you have a
hard time keeping up with, the part of your Ethernet MAC that fills
system memory with packets should realize that it is producing them
faster than you are consuming them, this triggers the flow control
mechanism of your MAC receiver, and it sends Pause frames back at the
link partner.

> 
> 
> If both are enabled in the PHY, then the driver enables the same
> features in the MAC.  Unfortunately, I don't have any documentation that
> explains that that really means.  I've tried enabling and disabling this
> feature in the MAC, and it makes no difference.  You would think that if
> the feature is disabled in both the MAC and the PHY, then no packets
> would be lost, but that's not the case.

The PHY does not participate in flow control, other than passing frames
through (and your Atheros PHY may have a register to control that
passthrough, but that would be surprising, could not find one).

> 
>> The part that is totally crappy about Pause frames and PHYLIB is that we
>> need some information about whether this should be supported or not,
>> such that we can change MII_ADVERTISE accordingly to reflect that, and
>> what best way to do this than use one of these SUPPORTED_* bits to set
>> that, except, that unlike speed (which is both a MAC and PHY property),
>> Pause is exclusively MAC, yet, it transpires in PHYLIB.
> 
> Then what do those bits in register 0x11 do?  If I don't set them, I
> lose packets, no matter how my MAC is programmed.

This is just a status register, as you can see, all bits are read-only,
it just aggregates the resolution of what is advertised and what is
negotiated, which is not without value, but PHYLIB does the same
intersection here.

> 
> It's like that joke, "Doctor, if I hold my arm like this, then it
> hurts!", "Well, then don't hold your arm like that."  If I don't program
> those PHY register bits, then my hardware doesn't work.
> 
>> MACs that I work with for instance need to be told to ignore pause
>> frames, or not ignore them, it all depends on what you want to
>> advertise/support.
> 
> That's the problem, I don't know what I "want", except that I want my
> hardware to work. :-)  99% of my knowledge of the hardware comes from
> scanning the source code of the internal version of the driver that 1)
> does not support phylib,  and 2) is hard-coded to initialize the Atheros
> 8031 PHY.

May I suggest reading about standards a bit more, or just looking at

[PATCH v2 0/6] add NS2 support to bgmac

2016-10-28 Thread Jon Mason
Changes in v2:
* Remove the PHY power-on (per Andrew Lunn)
* Misc PHY clean-ups regarding comments and #defines (per Andrew Lunn)
  This results on none of the original PHY code from Vikas being
  present.  So, I'm removing him as an author and giving him
  "Inspired-by" credit.
* Move PHY lane swapping to PHY driver (per Andrew Lunn and Florian
  Fainelli)
* Remove bgmac sleep (per Florian Fainelli)
* Re-add bgmac chip reset (per Florian Fainelli and Ray Jui)
* Rebased on latest net-next
* Added patch for bcm54xx_auxctl_read, which is used in the BCM54810


Add support for the amac found in the Broadcom Northstar2 SoC to the
bgmac driver.  This necessitates adding support to connect to an
externally defined phy (as described in the device tree) in the driver.
These phy changes are in addition to the changes necessary to get NS2
working.


Jon Mason (6):
  net: phy: broadcom: add bcm54xx_auxctl_read
  net: phy: broadcom: Add BCM54810 PHY entry
  Documentation: devicetree: net: add NS2 bindings to amac
  net: ethernet: bgmac: device tree phy enablement
  net: ethernet: bgmac: add NS2 support
  arm64: dts: NS2: add AMAC ethernet support

 .../devicetree/bindings/net/brcm,amac.txt  |   5 +-
 arch/arm64/boot/dts/broadcom/ns2-svk.dts   |   5 ++
 arch/arm64/boot/dts/broadcom/ns2.dtsi  |  12 +++
 drivers/net/ethernet/broadcom/bgmac-bcma.c |  48 ++
 drivers/net/ethernet/broadcom/bgmac-platform.c | 100 -
 drivers/net/ethernet/broadcom/bgmac.c  |  55 ++--
 drivers/net/ethernet/broadcom/bgmac.h  |   8 ++
 drivers/net/phy/Kconfig|   2 +-
 drivers/net/phy/broadcom.c |  68 +-
 include/linux/brcmphy.h|  11 +++
 10 files changed, 260 insertions(+), 54 deletions(-)

-- 
2.7.4



[PATCH v2 5/6] net: ethernet: bgmac: add NS2 support

2016-10-28 Thread Jon Mason
Add support for the variant of amac hardware present in the Broadcom
Northstar2 based SoCs.  Northstar2 requires an additional register to be
configured with the port speed/duplexity (NICPM).  This can be added to
the link callback to hide it from the instances that do not use this.
Also, clearing of the pending interrupts on init is required due to
observed issues on some platforms.

Signed-off-by: Jon Mason 
---
 drivers/net/ethernet/broadcom/bgmac-platform.c | 56 +-
 drivers/net/ethernet/broadcom/bgmac.c  |  3 ++
 drivers/net/ethernet/broadcom/bgmac.h  |  1 +
 3 files changed, 58 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bgmac-platform.c 
b/drivers/net/ethernet/broadcom/bgmac-platform.c
index aed5dc5..f6d48c7 100644
--- a/drivers/net/ethernet/broadcom/bgmac-platform.c
+++ b/drivers/net/ethernet/broadcom/bgmac-platform.c
@@ -14,12 +14,21 @@
 #define pr_fmt(fmt)KBUILD_MODNAME ": " fmt
 
 #include 
+#include 
 #include 
 #include 
 #include 
 #include 
 #include "bgmac.h"
 
+#define NICPM_IOMUX_CTRL   0x0008
+
+#define NICPM_IOMUX_CTRL_INIT_VAL  0x3196e000
+#define NICPM_IOMUX_CTRL_SPD_SHIFT 10
+#define NICPM_IOMUX_CTRL_SPD_10M   0
+#define NICPM_IOMUX_CTRL_SPD_100M  1
+#define NICPM_IOMUX_CTRL_SPD_1000M 2
+
 static u32 platform_bgmac_read(struct bgmac *bgmac, u16 offset)
 {
return readl(bgmac->plat.base + offset);
@@ -87,12 +96,46 @@ static void platform_bgmac_cmn_maskset32(struct bgmac 
*bgmac, u16 offset,
WARN_ON(1);
 }
 
+static void bgmac_nicpm_speed_set(struct net_device *net_dev)
+{
+   struct bgmac *bgmac = netdev_priv(net_dev);
+   u32 val;
+
+   if (!bgmac->plat.nicpm_base)
+   return;
+
+   val = NICPM_IOMUX_CTRL_INIT_VAL;
+   switch (bgmac->net_dev->phydev->speed) {
+   default:
+   pr_err("Unsupported speed.  Defaulting to 1000Mb\n");
+   case SPEED_1000:
+   val |= NICPM_IOMUX_CTRL_SPD_1000M << NICPM_IOMUX_CTRL_SPD_SHIFT;
+   break;
+   case SPEED_100:
+   val |= NICPM_IOMUX_CTRL_SPD_100M << NICPM_IOMUX_CTRL_SPD_SHIFT;
+   break;
+   case SPEED_10:
+   val |= NICPM_IOMUX_CTRL_SPD_10M << NICPM_IOMUX_CTRL_SPD_SHIFT;
+   break;
+   }
+
+   writel(val, bgmac->plat.nicpm_base + NICPM_IOMUX_CTRL);
+
+   bgmac_adjust_link(bgmac->net_dev);
+}
+
 static int platform_phy_connect(struct bgmac *bgmac)
 {
struct phy_device *phy_dev;
 
-   phy_dev = of_phy_get_and_connect(bgmac->net_dev, bgmac->dev->of_node,
-bgmac_adjust_link);
+   if (bgmac->plat.nicpm_base)
+   phy_dev = of_phy_get_and_connect(bgmac->net_dev,
+bgmac->dev->of_node,
+bgmac_nicpm_speed_set);
+   else
+   phy_dev = of_phy_get_and_connect(bgmac->net_dev,
+bgmac->dev->of_node,
+bgmac_adjust_link);
if (!phy_dev) {
dev_err(bgmac->dev, "Phy connect failed\n");
return -ENODEV;
@@ -182,6 +225,14 @@ static int bgmac_probe(struct platform_device *pdev)
if (IS_ERR(bgmac->plat.idm_base))
return PTR_ERR(bgmac->plat.idm_base);
 
+   regs = platform_get_resource_byname(pdev, IORESOURCE_MEM, "nicpm_base");
+   if (regs) {
+   bgmac->plat.nicpm_base = devm_ioremap_resource(&pdev->dev,
+  regs);
+   if (IS_ERR(bgmac->plat.nicpm_base))
+   return PTR_ERR(bgmac->plat.nicpm_base);
+   }
+
bgmac->read = platform_bgmac_read;
bgmac->write = platform_bgmac_write;
bgmac->idm_read = platform_bgmac_idm_read;
@@ -213,6 +264,7 @@ static int bgmac_remove(struct platform_device *pdev)
 static const struct of_device_id bgmac_of_enet_match[] = {
{.compatible = "brcm,amac",},
{.compatible = "brcm,nsp-amac",},
+   {.compatible = "brcm,ns2-amac",},
{},
 };
 
diff --git a/drivers/net/ethernet/broadcom/bgmac.c 
b/drivers/net/ethernet/broadcom/bgmac.c
index 4584958..a805cc8 100644
--- a/drivers/net/ethernet/broadcom/bgmac.c
+++ b/drivers/net/ethernet/broadcom/bgmac.c
@@ -1082,6 +1082,9 @@ static void bgmac_enable(struct bgmac *bgmac)
 /* http://bcm-v4.sipsolutions.net/mac-gbit/gmac/chipinit */
 static void bgmac_chip_init(struct bgmac *bgmac)
 {
+   /* Clear any erroneously pending interrupts */
+   bgmac_write(bgmac, BGMAC_INT_STATUS, ~0);
+
/* 1 interrupt per received frame */
bgmac_write(bgmac, BGMAC_INT_RECV_LAZY, 1 << BGMAC_IRL_FC_SHIFT);
 
diff --git a/drivers/net/ethernet/broadcom/bgmac.h 
b/drivers/net/ethernet/broadcom/bgmac.h
index ea52ac3..b1820ea 100644
--- a/drivers/net

[PATCH v2 2/6] net: phy: broadcom: Add BCM54810 PHY entry

2016-10-28 Thread Jon Mason
The BCM54810 PHY requires some semi-unique configuration, which results
in some additional configuration in addition to the standard config.
Also, some users of the BCM54810 require the PHY lanes to be swapped.
Since there is no way to detect this, add a device tree query to see if
it is applicable.

Inspired-by: Vikas Soni 
Signed-off-by: Jon Mason 
---
 drivers/net/phy/Kconfig|  2 +-
 drivers/net/phy/broadcom.c | 58 +-
 include/linux/brcmphy.h| 10 
 3 files changed, 68 insertions(+), 2 deletions(-)

diff --git a/drivers/net/phy/Kconfig b/drivers/net/phy/Kconfig
index 45f68ea..31967ca 100644
--- a/drivers/net/phy/Kconfig
+++ b/drivers/net/phy/Kconfig
@@ -217,7 +217,7 @@ config BROADCOM_PHY
select BCM_NET_PHYLIB
---help---
  Currently supports the BCM5411, BCM5421, BCM5461, BCM54616S, BCM5464,
- BCM5481 and BCM5482 PHYs.
+ BCM5481, BCM54810 and BCM5482 PHYs.
 
 config CICADA_PHY
tristate "Cicada PHYs"
diff --git a/drivers/net/phy/broadcom.c b/drivers/net/phy/broadcom.c
index 3a64b3d..777ea29 100644
--- a/drivers/net/phy/broadcom.c
+++ b/drivers/net/phy/broadcom.c
@@ -18,7 +18,7 @@
 #include 
 #include 
 #include 
-
+#include 
 
 #define BRCM_PHY_MODEL(phydev) \
((phydev)->drv->phy_id & (phydev)->drv->phy_id_mask)
@@ -45,6 +45,34 @@ static int bcm54xx_auxctl_write(struct phy_device *phydev, 
u16 regnum, u16 val)
return phy_write(phydev, MII_BCM54XX_AUX_CTL, regnum | val);
 }
 
+static int bcm54810_config(struct phy_device *phydev)
+{
+   int rc, val;
+
+   val = bcm_phy_read_exp(phydev, BCM54810_EXP_BROADREACH_LRE_MISC_CTL);
+   val &= ~BCM54810_EXP_BROADREACH_LRE_MISC_CTL_EN;
+   rc = bcm_phy_write_exp(phydev, BCM54810_EXP_BROADREACH_LRE_MISC_CTL,
+  val);
+   if (rc < 0)
+   return rc;
+
+   val = bcm54xx_auxctl_read(phydev, MII_BCM54XX_AUXCTL_SHDWSEL_MISC);
+   val &= ~MII_BCM54XX_AUXCTL_SHDWSEL_MISC_RGMII_SKEW_EN;
+   val |= MII_BCM54XX_AUXCTL_MISC_WREN;
+   rc = bcm54xx_auxctl_write(phydev, MII_BCM54XX_AUXCTL_SHDWSEL_MISC,
+ val);
+   if (rc < 0)
+   return rc;
+
+   val = bcm_phy_read_shadow(phydev, BCM54810_SHD_CLK_CTL);
+   val &= ~BCM54810_SHD_CLK_CTL_GTXCLK_EN;
+   rc = bcm_phy_write_shadow(phydev, BCM54810_SHD_CLK_CTL, val);
+   if (rc < 0)
+   return rc;
+
+   return 0;
+}
+
 /* Needs SMDSP clock enabled via bcm54xx_phydsp_config() */
 static int bcm50610_a0_workaround(struct phy_device *phydev)
 {
@@ -217,6 +245,12 @@ static int bcm54xx_config_init(struct phy_device *phydev)
(phydev->dev_flags & PHY_BRCM_AUTO_PWRDWN_ENABLE))
bcm54xx_adjust_rxrefclk(phydev);
 
+   if (BRCM_PHY_MODEL(phydev) == PHY_ID_BCM54810) {
+   err = bcm54810_config(phydev);
+   if (err)
+   return err;
+   }
+
bcm54xx_phydsp_config(phydev);
 
return 0;
@@ -314,6 +348,7 @@ static int bcm5482_read_status(struct phy_device *phydev)
 
 static int bcm5481_config_aneg(struct phy_device *phydev)
 {
+   struct device_node *np = phydev->mdio.dev.of_node;
int ret;
 
/* Aneg firsly. */
@@ -344,6 +379,14 @@ static int bcm5481_config_aneg(struct phy_device *phydev)
phy_write(phydev, 0x18, reg);
}
 
+   if (of_property_read_bool(np, "brcm,enet-phy-lane-swap")) {
+   /* Lane Swap - Undocumented register...magic! */
+   ret = bcm_phy_write_exp(phydev, MII_BCM54XX_EXP_SEL_ER + 0x9,
+   0x11B);
+   if (ret < 0)
+   return ret;
+   }
+
return ret;
 }
 
@@ -578,6 +621,18 @@ static struct phy_driver broadcom_drivers[] = {
.ack_interrupt  = bcm_phy_ack_intr,
.config_intr= bcm_phy_config_intr,
 }, {
+   .phy_id = PHY_ID_BCM54810,
+   .phy_id_mask= 0xfff0,
+   .name   = "Broadcom BCM54810",
+   .features   = PHY_GBIT_FEATURES |
+ SUPPORTED_Pause | SUPPORTED_Asym_Pause,
+   .flags  = PHY_HAS_MAGICANEG | PHY_HAS_INTERRUPT,
+   .config_init= bcm54xx_config_init,
+   .config_aneg= bcm5481_config_aneg,
+   .read_status= genphy_read_status,
+   .ack_interrupt  = bcm_phy_ack_intr,
+   .config_intr= bcm_phy_config_intr,
+}, {
.phy_id = PHY_ID_BCM5482,
.phy_id_mask= 0xfff0,
.name   = "Broadcom BCM5482",
@@ -661,6 +716,7 @@ static struct mdio_device_id __maybe_unused broadcom_tbl[] 
= {
{ PHY_ID_BCM54616S, 0xfff0 },
{ PHY_ID_BCM5464, 0xfff0 },
{ PHY_ID_BCM5481, 0xfff0 },
+   { PHY_ID_BCM54810, 0xfff0 },
{ PHY_ID_BCM5482, 0xfff0 },
{ PHY_ID_BCM50610, 0xfff0 },
{ PHY_ID_BCM50610M, 0xff

[PATCH v2 4/6] net: ethernet: bgmac: device tree phy enablement

2016-10-28 Thread Jon Mason
Change the bgmac driver to allow for phy's defined by the device tree

Signed-off-by: Jon Mason 
---
 drivers/net/ethernet/broadcom/bgmac-bcma.c | 48 
 drivers/net/ethernet/broadcom/bgmac-platform.c | 48 +++-
 drivers/net/ethernet/broadcom/bgmac.c  | 52 ++
 drivers/net/ethernet/broadcom/bgmac.h  |  7 
 4 files changed, 105 insertions(+), 50 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bgmac-bcma.c 
b/drivers/net/ethernet/broadcom/bgmac-bcma.c
index c16ec3a..3e3efde 100644
--- a/drivers/net/ethernet/broadcom/bgmac-bcma.c
+++ b/drivers/net/ethernet/broadcom/bgmac-bcma.c
@@ -80,6 +80,50 @@ static void bcma_bgmac_cmn_maskset32(struct bgmac *bgmac, 
u16 offset, u32 mask,
bcma_maskset32(bgmac->bcma.cmn, offset, mask, set);
 }
 
+static int bcma_phy_connect(struct bgmac *bgmac)
+{
+   struct phy_device *phy_dev;
+   char bus_id[MII_BUS_ID_SIZE + 3];
+
+   /* Connect to the PHY */
+   snprintf(bus_id, sizeof(bus_id), PHY_ID_FMT, bgmac->mii_bus->id,
+bgmac->phyaddr);
+   phy_dev = phy_connect(bgmac->net_dev, bus_id, bgmac_adjust_link,
+ PHY_INTERFACE_MODE_MII);
+   if (IS_ERR(phy_dev)) {
+   dev_err(bgmac->dev, "PHY connecton failed\n");
+   return PTR_ERR(phy_dev);
+   }
+
+   return 0;
+}
+
+static int bcma_phy_direct_connect(struct bgmac *bgmac)
+{
+   struct fixed_phy_status fphy_status = {
+   .link = 1,
+   .speed = SPEED_1000,
+   .duplex = DUPLEX_FULL,
+   };
+   struct phy_device *phy_dev;
+   int err;
+
+   phy_dev = fixed_phy_register(PHY_POLL, &fphy_status, -1, NULL);
+   if (!phy_dev || IS_ERR(phy_dev)) {
+   dev_err(bgmac->dev, "Failed to register fixed PHY device\n");
+   return -ENODEV;
+   }
+
+   err = phy_connect_direct(bgmac->net_dev, phy_dev, bgmac_adjust_link,
+PHY_INTERFACE_MODE_MII);
+   if (err) {
+   dev_err(bgmac->dev, "Connecting PHY failed\n");
+   return err;
+   }
+
+   return err;
+}
+
 static const struct bcma_device_id bgmac_bcma_tbl[] = {
BCMA_CORE(BCMA_MANUF_BCM, BCMA_CORE_4706_MAC_GBIT,
  BCMA_ANY_REV, BCMA_ANY_CLASS),
@@ -275,6 +319,10 @@ static int bgmac_probe(struct bcma_device *core)
bgmac->cco_ctl_maskset = bcma_bgmac_cco_ctl_maskset;
bgmac->get_bus_clock = bcma_bgmac_get_bus_clock;
bgmac->cmn_maskset32 = bcma_bgmac_cmn_maskset32;
+   if (bgmac->mii_bus)
+   bgmac->phy_connect = bcma_phy_connect;
+   else
+   bgmac->phy_connect = bcma_phy_direct_connect;
 
err = bgmac_enet_probe(bgmac);
if (err)
diff --git a/drivers/net/ethernet/broadcom/bgmac-platform.c 
b/drivers/net/ethernet/broadcom/bgmac-platform.c
index be52f27..aed5dc5 100644
--- a/drivers/net/ethernet/broadcom/bgmac-platform.c
+++ b/drivers/net/ethernet/broadcom/bgmac-platform.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include "bgmac.h"
 
@@ -86,6 +87,46 @@ static void platform_bgmac_cmn_maskset32(struct bgmac 
*bgmac, u16 offset,
WARN_ON(1);
 }
 
+static int platform_phy_connect(struct bgmac *bgmac)
+{
+   struct phy_device *phy_dev;
+
+   phy_dev = of_phy_get_and_connect(bgmac->net_dev, bgmac->dev->of_node,
+bgmac_adjust_link);
+   if (!phy_dev) {
+   dev_err(bgmac->dev, "Phy connect failed\n");
+   return -ENODEV;
+   }
+
+   return 0;
+}
+
+static int platform_phy_direct_connect(struct bgmac *bgmac)
+{
+   struct fixed_phy_status fphy_status = {
+   .link = 1,
+   .speed = SPEED_1000,
+   .duplex = DUPLEX_FULL,
+   };
+   struct phy_device *phy_dev;
+   int err;
+
+   phy_dev = fixed_phy_register(PHY_POLL, &fphy_status, -1, NULL);
+   if (!phy_dev || IS_ERR(phy_dev)) {
+   dev_err(bgmac->dev, "Failed to register fixed PHY device\n");
+   return -ENODEV;
+   }
+
+   err = phy_connect_direct(bgmac->net_dev, phy_dev, bgmac_adjust_link,
+PHY_INTERFACE_MODE_MII);
+   if (err) {
+   dev_err(bgmac->dev, "Connecting PHY failed\n");
+   return err;
+   }
+
+   return err;
+}
+
 static int bgmac_probe(struct platform_device *pdev)
 {
struct device_node *np = pdev->dev.of_node;
@@ -102,7 +143,6 @@ static int bgmac_probe(struct platform_device *pdev)
/* Set the features of the 4707 family */
bgmac->feature_flags |= BGMAC_FEAT_CLKCTLST;
bgmac->feature_flags |= BGMAC_FEAT_NO_RESET;
-   bgmac->feature_flags |= BGMAC_FEAT_FORCE_SPEED_2500;
bgmac->feature_flags |= BGMAC_FEAT_CMDCFG_SR_REV4;
bgmac->feature_flags |= BGMAC_FEAT_TX_MASK_SETUP

[PATCH v2 6/6] arm64: dts: NS2: add AMAC ethernet support

2016-10-28 Thread Jon Mason
Add support for the AMAC ethernet to the Broadcom Northstar2 SoC device
tree

Signed-off-by: Jon Mason 
---
 arch/arm64/boot/dts/broadcom/ns2-svk.dts |  5 +
 arch/arm64/boot/dts/broadcom/ns2.dtsi| 12 
 2 files changed, 17 insertions(+)

diff --git a/arch/arm64/boot/dts/broadcom/ns2-svk.dts 
b/arch/arm64/boot/dts/broadcom/ns2-svk.dts
index 2d7872a..9e5c3c9 100644
--- a/arch/arm64/boot/dts/broadcom/ns2-svk.dts
+++ b/arch/arm64/boot/dts/broadcom/ns2-svk.dts
@@ -56,6 +56,10 @@
};
 };
 
+&enet {
+   status = "ok";
+};
+
 &pci_phy0 {
status = "ok";
 };
@@ -172,6 +176,7 @@
 &mdio_mux_iproc {
mdio@10 {
gphy0: eth-phy@10 {
+   brcm,enet-phy-lane-swap;
reg = <0x10>;
};
};
diff --git a/arch/arm64/boot/dts/broadcom/ns2.dtsi 
b/arch/arm64/boot/dts/broadcom/ns2.dtsi
index d95dc40..773ed59 100644
--- a/arch/arm64/boot/dts/broadcom/ns2.dtsi
+++ b/arch/arm64/boot/dts/broadcom/ns2.dtsi
@@ -191,6 +191,18 @@
 
#include "ns2-clock.dtsi"
 
+   enet: ethernet@6100 {
+   compatible = "brcm,ns2-amac";
+   reg = <0x6100 0x1000>,
+ <0x6109 0x1000>,
+ <0x6103 0x100>;
+   reg-names = "amac_base", "idm_base", "nicpm_base";
+   interrupts = ;
+   phy-handle = <&gphy0>;
+   phy-mode = "rgmii";
+   status = "disabled";
+   };
+
dma0: dma@6136 {
compatible = "arm,pl330", "arm,primecell";
reg = <0x6136 0x1000>;
-- 
2.7.4



[PATCH v2 3/6] Documentation: devicetree: net: add NS2 bindings to amac

2016-10-28 Thread Jon Mason
Signed-off-by: Jon Mason 
---
 Documentation/devicetree/bindings/net/brcm,amac.txt | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/Documentation/devicetree/bindings/net/brcm,amac.txt 
b/Documentation/devicetree/bindings/net/brcm,amac.txt
index ba5ecc1..f2b194e 100644
--- a/Documentation/devicetree/bindings/net/brcm,amac.txt
+++ b/Documentation/devicetree/bindings/net/brcm,amac.txt
@@ -2,11 +2,12 @@ Broadcom AMAC Ethernet Controller Device Tree Bindings
 -
 
 Required properties:
- - compatible: "brcm,amac" or "brcm,nsp-amac"
+ - compatible: "brcm,amac", "brcm,nsp-amac", or "brcm,ns2-amac"
  - reg:Address and length of the GMAC registers,
Address and length of the GMAC IDM registers
+   Address and length of the NIC Port Manager registers (optional)
  - reg-names:  Names of the registers.  Must have both "amac_base" and
-   "idm_base"
+   "idm_base". "nicpm_base" is optional (required for NS2)
  - interrupts: Interrupt number
 
 Optional properties:
-- 
2.7.4



[PATCH v2 1/6] net: phy: broadcom: add bcm54xx_auxctl_read

2016-10-28 Thread Jon Mason
Add a helper function to read the AUXCTL register for the BCM54xx.  This
mirrors the bcm54xx_auxctl_write function already present in the code.

Signed-off-by: Jon Mason 
---
 drivers/net/phy/broadcom.c | 10 ++
 include/linux/brcmphy.h|  1 +
 2 files changed, 11 insertions(+)

diff --git a/drivers/net/phy/broadcom.c b/drivers/net/phy/broadcom.c
index 583ef8a..3a64b3d 100644
--- a/drivers/net/phy/broadcom.c
+++ b/drivers/net/phy/broadcom.c
@@ -30,6 +30,16 @@ MODULE_DESCRIPTION("Broadcom PHY driver");
 MODULE_AUTHOR("Maciej W. Rozycki");
 MODULE_LICENSE("GPL");
 
+static int bcm54xx_auxctl_read(struct phy_device *phydev, u16 regnum)
+{
+   /* The register must be written to both the Shadow Register Select and
+* the Shadow Read Register Selector
+*/
+   phy_write(phydev, MII_BCM54XX_AUX_CTL, regnum |
+ regnum << MII_BCM54XX_AUXCTL_SHDWSEL_READ_SHIFT);
+   return phy_read(phydev, MII_BCM54XX_AUX_CTL);
+}
+
 static int bcm54xx_auxctl_write(struct phy_device *phydev, u16 regnum, u16 val)
 {
return phy_write(phydev, MII_BCM54XX_AUX_CTL, regnum | val);
diff --git a/include/linux/brcmphy.h b/include/linux/brcmphy.h
index 60def78..0ed6691 100644
--- a/include/linux/brcmphy.h
+++ b/include/linux/brcmphy.h
@@ -110,6 +110,7 @@
 #define MII_BCM54XX_AUXCTL_MISC_FORCE_AMDIX0x0200
 #define MII_BCM54XX_AUXCTL_MISC_RDSEL_MISC 0x7000
 #define MII_BCM54XX_AUXCTL_SHDWSEL_MISC0x0007
+#define MII_BCM54XX_AUXCTL_SHDWSEL_READ_SHIFT  12
 
 #define MII_BCM54XX_AUXCTL_SHDWSEL_MASK0x0007
 
-- 
2.7.4



RE: [PATCH net-next] lan78xx: Use irq_domain for phy interrupt from USB Int. EP

2016-10-28 Thread Woojung.Huh
> > +static void lan78xx_irq_mask(struct irq_data *irqd)
> > +{
> > +   struct irq_domain_data *data = irq_data_get_irq_chip_data(irqd);
> > +   struct lan78xx_net *dev =
> > +   container_of(data, struct lan78xx_net, domain_data);
> > +   u32 buf;
> > +
> > +   lan78xx_read_reg(dev, INT_EP_CTL, &buf);
> 
> lan78xx_read_reg() uses kmalloc() with GFP_KERNEL, while
> irq_mask/irq_unmask can be called in atomic context AFAIR, you may need
> to pass down a specific gfp_t to lan78xx_read_reg.
> 
> What about usb_submit_urb(), can that work in atomic context? Do you
> need to have lan78xx_read_reg() acquire a raw spinlock or something to
> serialize them?
> 
> > +   buf &= ~INT_EP_PHY_INT_EN_;
> 
> Even though you may have only one interrupt line to deal with at the
> moment, better make this bit derived from irqd->hwirq instead of hard
> coding it here.
> 
> > +   lan78xx_write_reg(dev, INT_EP_CTL, buf);
> > +}
> > +
> > +static void lan78xx_irq_unmask(struct irq_data *irqd)
> > +{
> > +   struct irq_domain_data *data = irq_data_get_irq_chip_data(irqd);
> > +   struct lan78xx_net *dev =
> > +   container_of(data, struct lan78xx_net, domain_data);
> > +   u32 buf;
> > +
> > +   lan78xx_read_reg(dev, INT_EP_CTL, &buf);
> > +   buf |= INT_EP_PHY_INT_EN_;
> 
> Same here, this should come from irqd->hwirq.
> 
> > +   lan78xx_write_reg(dev, INT_EP_CTL, buf);
> > +}
> > +
> > +static struct irq_chip lan78xx_irqchip = {
> > +   .name   = "lan78xx-phyirq",
> > +   .irq_mask   = lan78xx_irq_mask,
> > +   .irq_unmask = lan78xx_irq_unmask,
> > +};
> > +
> > +static int lan78xx_setup_irq_domain(struct lan78xx_net *dev)
> > +{
> > +   struct device_node *of_node;
> > +   struct irq_domain *irqdomain;
> > +   unsigned int irq_base = 0;
> > +   int ret = 0;
> > +
> > +   of_node = dev->udev->dev.parent->of_node;
> > +
> > +   dev->domain_data.irqchip = &lan78xx_irqchip;
> > +   dev->domain_data.irq_handler = handle_simple_irq;
> > +
> > +   irqdomain = irq_domain_add_simple(of_node, 1, 0,
> &chip_domain_ops,
> > + &dev->domain_data);
> 
> Is there really just one interrupt associated with this peripheral here?
> 
> >
> > +   if (lan78xx_setup_irq_domain(dev) < 0) {
> > +   netdev_warn(dev->net, "lan78xx_setup_irq_domain()
> failed");
> > +   return -EIO;
> > +   }
> 
> Any reason not to propagate the error code from
> lan78xx_setup_irq_domain() here?

Thanks for prompt review.
I'll look into it and submit update.

Thanks.
- Woojung


Re: [PATCH net-next RFC WIP] Patch for XDP support for virtio_net

2016-10-28 Thread Jakub Kicinski
On Fri, 28 Oct 2016 13:35:02 -0700, Alexander Duyck wrote:
> > At the same time XDP doesn't require to use 4k buffer in something like 
> > Netronome.
> > If xdp bpf program can be offloaded into HW with 1800 byte buffers, great!  
> 
> So are you saying this is only really meant to be used with a full bpf
> hardware offload then?

I think Alexei just meant to say I don't have to worry about providing
4k buffers when program is offloaded but on the host a page is the
memory granularity for XDP...


[PATCH net] net: clear sk_err_soft in sk_clone_lock()

2016-10-28 Thread Eric Dumazet
From: Eric Dumazet 

At accept() time, it is possible the parent has a non zero
sk_err_soft, leftover from a prior error.

Make sure we do not leave this value in the child, as it
makes future getsockopt(SO_ERROR) calls quite unreliable.

Signed-off-by: Eric Dumazet 
---
 net/core/sock.c |1 +
 1 file changed, 1 insertion(+)

diff --git a/net/core/sock.c b/net/core/sock.c
index 
d8e4532e89e7c28737c95c723e5f5b3d184a7805..662ccf1c40ed1b66ee253b063dcbfbd186deccee
 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1543,6 +1543,7 @@ struct sock *sk_clone_lock(const struct sock *sk, const 
gfp_t priority)
RCU_INIT_POINTER(newsk->sk_reuseport_cb, NULL);
 
newsk->sk_err  = 0;
+   newsk->sk_err_soft = 0;
newsk->sk_priority = 0;
newsk->sk_incoming_cpu = raw_smp_processor_id();
atomic64_set(&newsk->sk_cookie, 0);




Re: [PATCH net-next RFC WIP] Patch for XDP support for virtio_net

2016-10-28 Thread Jakub Kicinski
On Fri, 28 Oct 2016 11:22:25 -0700, Alexei Starovoitov wrote:
> On Fri, Oct 28, 2016 at 05:18:12PM +0100, Jakub Kicinski wrote:
> > On Fri, 28 Oct 2016 08:56:35 -0700, John Fastabend wrote:  
> > > On 16-10-27 07:10 PM, David Miller wrote:  
> > > > From: Alexander Duyck 
> > > > Date: Thu, 27 Oct 2016 18:43:59 -0700
> > > > 
> > > >> On Thu, Oct 27, 2016 at 6:35 PM, David Miller  
> > > >> wrote:
> > > >>> From: "Michael S. Tsirkin" 
> > > >>> Date: Fri, 28 Oct 2016 01:25:48 +0300
> > > >>>
> > >  On Thu, Oct 27, 2016 at 05:42:18PM -0400, David Miller wrote:
> > > > From: "Michael S. Tsirkin" 
> > > > Date: Fri, 28 Oct 2016 00:30:35 +0300
> > > >
> > > >> Something I'd like to understand is how does XDP address the
> > > >> problem that 100Byte packets are consuming 4K of memory now.
> > > >
> > > > Via page pools.  We're going to make a generic one, but right now
> > > > each and every driver implements a quick list of pages to allocate
> > > > from (and thus avoid the DMA man/unmap overhead, etc.)
> > > 
> > >  So to clarify, ATM virtio doesn't attempt to avoid dma map/unmap
> > >  so there should be no issue with that even when using sub/page
> > >  regions, assuming DMA APIs support sub-page map/unmap correctly.
> > > >>>
> > > >>> That's not what I said.
> > > >>>
> > > >>> The page pools are meant to address the performance degradation from
> > > >>> going to having one packet per page for the sake of XDP's
> > > >>> requirements.
> > > >>>
> > > >>> You still need to have one packet per page for correct XDP operation
> > > >>> whether you do page pools or not, and whether you have DMA mapping
> > > >>> (or it's equivalent virutalization operation) or not.
> > > >>
> > > >> Maybe I am missing something here, but why do you need to limit things
> > > >> to one packet per page for correct XDP operation?  Most of the drivers
> > > >> out there now are usually storing something closer to at least 2
> > > >> packets per page, and with the DMA API fixes I am working on there
> > > >> should be no issue with changing the contents inside those pages since
> > > >> we won't invalidate or overwrite the data after the DMA buffer has
> > > >> been synchronized for use by the CPU.
> > > > 
> > > > Because with SKB's you can share the page with other packets.
> > > > 
> > > > With XDP you simply cannot.
> > > > 
> > > > It's software semantics that are the issue.  SKB frag list pages
> > > > are read only, XDP packets are writable.
> > > > 
> > > > This has nothing to do with "writability" of the pages wrt. DMA
> > > > mapping or cpu mappings.
> > > > 
> > > 
> > > Sorry I'm not seeing it either. The current xdp_buff is defined
> > > by,
> > > 
> > >   struct xdp_buff {
> > >   void *data;
> > >   void *data_end;
> > >   };
> > > 
> > > The verifier has an xdp_is_valid_access() check to ensure we don't go
> > > past data_end. The page for now at least never leaves the driver. For
> > > the work to get xmit to other devices working I'm still not sure I see
> > > any issue.  
> > 
> > +1
> > 
> > Do we want to make the packet-per-page a requirement because it could
> > be useful in the future from architectural standpoint?  I guess there
> > is a trade-off here between having the comfort of people following this
> > requirement today and making driver support for XDP even more complex.  
>
> It looks to me that packet per page makes drivers simpler instead of complex.
> ixgbe split-page and mlx* order-3/5 tricks are definitely complex.
> The skb truesize concerns come from the host when data is delivered to user
> space and we need to have precise memory accounting for different applications
> and different users. XDP is all root and imo it's far away from the days when
> multi-user non-root issues start to pop up.
> At the same time XDP doesn't require to use 4k buffer in something like 
> Netronome.
> If xdp bpf program can be offloaded into HW with 1800 byte buffers, great!
> For x86 cpu the 4k byte is a natural allocation chunk. Anything lower requires
> delicate dma tricks paired with even more complex slab allocator and atomic 
> recnts.
> All of that can only drive the performance down.
> Comparing to kernel bypass xdp is using 4k pages whereas dpdk has
> to use huge pages, so xdp is saving a ton of memory already!
 
Thanks a lot for explaining!  I was not aware of the user space
delivery and precise memory accounting requirements.

My understanding was that we would be driving toward some common
implementation of the recycle tricks either as part of the frag
allocator or some new mechanism of handing out mapped pages.  Basically
moving the allocation strategy decision further up the stack.

I'm not actually worried about the offload, full disclosure, I have a
XDP implementation based on the frag allocator :) (still pending
internal review), so if packet-per-page is the way forward then I'll
have revise to fli

Re: [PATCH net-next RFC WIP] Patch for XDP support for virtio_net

2016-10-28 Thread Alexander Duyck
On Fri, Oct 28, 2016 at 11:22 AM, Alexei Starovoitov
 wrote:
> On Fri, Oct 28, 2016 at 05:18:12PM +0100, Jakub Kicinski wrote:
>> On Fri, 28 Oct 2016 08:56:35 -0700, John Fastabend wrote:
>> > On 16-10-27 07:10 PM, David Miller wrote:
>> > > From: Alexander Duyck 
>> > > Date: Thu, 27 Oct 2016 18:43:59 -0700
>> > >
>> > >> On Thu, Oct 27, 2016 at 6:35 PM, David Miller  
>> > >> wrote:
>> > >>> From: "Michael S. Tsirkin" 
>> > >>> Date: Fri, 28 Oct 2016 01:25:48 +0300
>> > >>>
>> >  On Thu, Oct 27, 2016 at 05:42:18PM -0400, David Miller wrote:
>> > > From: "Michael S. Tsirkin" 
>> > > Date: Fri, 28 Oct 2016 00:30:35 +0300
>> > >
>> > >> Something I'd like to understand is how does XDP address the
>> > >> problem that 100Byte packets are consuming 4K of memory now.
>> > >
>> > > Via page pools.  We're going to make a generic one, but right now
>> > > each and every driver implements a quick list of pages to allocate
>> > > from (and thus avoid the DMA man/unmap overhead, etc.)
>> > 
>> >  So to clarify, ATM virtio doesn't attempt to avoid dma map/unmap
>> >  so there should be no issue with that even when using sub/page
>> >  regions, assuming DMA APIs support sub-page map/unmap correctly.
>> > >>>
>> > >>> That's not what I said.
>> > >>>
>> > >>> The page pools are meant to address the performance degradation from
>> > >>> going to having one packet per page for the sake of XDP's
>> > >>> requirements.
>> > >>>
>> > >>> You still need to have one packet per page for correct XDP operation
>> > >>> whether you do page pools or not, and whether you have DMA mapping
>> > >>> (or it's equivalent virutalization operation) or not.
>> > >>
>> > >> Maybe I am missing something here, but why do you need to limit things
>> > >> to one packet per page for correct XDP operation?  Most of the drivers
>> > >> out there now are usually storing something closer to at least 2
>> > >> packets per page, and with the DMA API fixes I am working on there
>> > >> should be no issue with changing the contents inside those pages since
>> > >> we won't invalidate or overwrite the data after the DMA buffer has
>> > >> been synchronized for use by the CPU.
>> > >
>> > > Because with SKB's you can share the page with other packets.
>> > >
>> > > With XDP you simply cannot.
>> > >
>> > > It's software semantics that are the issue.  SKB frag list pages
>> > > are read only, XDP packets are writable.
>> > >
>> > > This has nothing to do with "writability" of the pages wrt. DMA
>> > > mapping or cpu mappings.
>> > >
>> >
>> > Sorry I'm not seeing it either. The current xdp_buff is defined
>> > by,
>> >
>> >   struct xdp_buff {
>> > void *data;
>> > void *data_end;
>> >   };
>> >
>> > The verifier has an xdp_is_valid_access() check to ensure we don't go
>> > past data_end. The page for now at least never leaves the driver. For
>> > the work to get xmit to other devices working I'm still not sure I see
>> > any issue.
>>
>> +1
>>
>> Do we want to make the packet-per-page a requirement because it could
>> be useful in the future from architectural standpoint?  I guess there
>> is a trade-off here between having the comfort of people following this
>> requirement today and making driver support for XDP even more complex.
>
> It looks to me that packet per page makes drivers simpler instead of complex.
> ixgbe split-page and mlx* order-3/5 tricks are definitely complex.
> The skb truesize concerns come from the host when data is delivered to user
> space and we need to have precise memory accounting for different applications
> and different users. XDP is all root and imo it's far away from the days when
> multi-user non-root issues start to pop up.

Right but having XDP require 4K pages is going to hurt performance for
user space when we are using sockets.  We cannot justify killing
application performance just because we want to support XDP, and
having to alloc new memory and memcpy out of the buffers isn't going
to work as a viable workaround for this either.

> At the same time XDP doesn't require to use 4k buffer in something like 
> Netronome.
> If xdp bpf program can be offloaded into HW with 1800 byte buffers, great!

So are you saying this is only really meant to be used with a full bpf
hardware offload then?

> For x86 cpu the 4k byte is a natural allocation chunk. Anything lower requires
> delicate dma tricks paired with even more complex slab allocator and atomic 
> recnts.
> All of that can only drive the performance down.
> Comparing to kernel bypass xdp is using 4k pages whereas dpdk has
> to use huge pages, so xdp is saving a ton of memory already!

Do you really think the page pool Jesper is talking about doing is any
different?  Half the reason why I have to implement the DMA API
changes that I am are so that the page pool can unmap a page if the
device decides to cut it from the pool without invalidating the data
written by the CPU.

If anything I think we end up

Re: [PATCH] [rfc, netfilter-next] netfilter: nf_tables: fib warnings

2016-10-28 Thread Arnd Bergmann
On Friday, October 28, 2016 6:21:49 PM CEST Florian Westphal wrote:
> Good point.  In case oif is NULL we don't have to search the result
> list for a match anyway, so we could do this (not even build tested):
> 

It didn't apply cleanly, but I've integrated it with the change to
initialize oif to NULL and the added #ifdef I had in my first version
and got a clean build.

I sent out a v2 now, but didn't try hard to understand your changes.

Arnd



[PATCH] [v2 netfilter-next] netfilter: nf_tables: fib warnings

2016-10-28 Thread Arnd Bergmann
The newly added nft fib code produces two warnings:

net/ipv4/netfilter/nft_fib_ipv4.c: In function 'nft_fib4_eval':
net/ipv4/netfilter/nft_fib_ipv4.c:80:6: error: unused variable 'i' 
[-Werror=unused-variable]
net/ipv4/netfilter/nft_fib_ipv4.c: In function ‘nft_fib4_eval’:
net/ipv4/netfilter/nft_fib_ipv4.c:137:6: error: ‘oif’ may be used uninitialized 
in this function [-Werror=maybe-uninitialized]

The first one is obvious as the only user of that variable is
inside of an #ifdef

The second one is a bit trickier. It's clear that oif is in fact
uninitialized when it gets used when neither NFTA_FIB_F_IIF nor
NFTA_FIB_F_OIF are set, and just setting it to NULL won't work
as it may later get dereferenced.

However, there is no need to search the result list if it is
NULL, as Florian pointed out. This integrates his (untested)
change to do so. I have confirmed that the combined patch
solves both warnings, but as I don't fully understand Florian's
change, I can't tell if it's correct.

Suggested-by: Florian Westphal 
Fixes: 84f5eedb983e ("netfilter: nf_tables: add fib expression")
Signed-off-by: Arnd Bergmann 
---
v2: integrate changes that Florian suggested
---
 net/ipv4/netfilter/nft_fib_ipv4.c | 23 ++-
 1 file changed, 14 insertions(+), 9 deletions(-)

diff --git a/net/ipv4/netfilter/nft_fib_ipv4.c 
b/net/ipv4/netfilter/nft_fib_ipv4.c
index 6787c563cfc9..db91fd42db67 100644
--- a/net/ipv4/netfilter/nft_fib_ipv4.c
+++ b/net/ipv4/netfilter/nft_fib_ipv4.c
@@ -77,7 +77,9 @@ void nft_fib4_eval(const struct nft_expr *expr, struct 
nft_regs *regs,
};
const struct net_device *oif;
struct net_device *found;
+#ifdef CONFIG_IP_ROUTE_MULTIPATH
int i;
+#endif
 
/*
 * Do not set flowi4_oif, it restricts results (for example, asking
@@ -90,6 +92,8 @@ void nft_fib4_eval(const struct nft_expr *expr, struct 
nft_regs *regs,
oif = pkt->out;
else if (priv->flags & NFTA_FIB_F_IIF)
oif = pkt->in;
+   else
+   oif = NULL;
 
if (pkt->hook == NF_INET_PRE_ROUTING && fib4_is_local(pkt->skb)) {
nft_fib_store_result(dest, priv->result, pkt, LOOPBACK_IFINDEX);
@@ -130,6 +134,11 @@ void nft_fib4_eval(const struct nft_expr *expr, struct 
nft_regs *regs,
break;
}
 
+   if (!oif) {
+   found = FIB_RES_DEV(res);
+   goto ok;
+   }
+
 #ifdef CONFIG_IP_ROUTE_MULTIPATH
for (i = 0; i < res.fi->fib_nhs; i++) {
struct fib_nh *nh = &res.fi->fib_nh[i];
@@ -139,16 +148,12 @@ void nft_fib4_eval(const struct nft_expr *expr, struct 
nft_regs *regs,
goto ok;
}
}
-#endif
-   if (priv->flags & NFTA_FIB_F_OIF) {
-   found = FIB_RES_DEV(res);
-   if (found == oif)
-   goto ok;
-   return;
-   }
-
-   *dest = FIB_RES_DEV(res)->ifindex;
return;
+#else
+   found = FIB_RES_DEV(res);
+   if (found != oif)
+   return;
+#endif
 ok:
switch (priv->result) {
case NFT_FIB_RESULT_OIF:
-- 
2.9.0



Re: [PATCH] net: phy: at803x: the Atheros 8031 supports pause frames

2016-10-28 Thread Timur Tabi

Florian Fainelli wrote:

On 10/27/2016 03:24 PM, Timur Tabi wrote:

Florian Fainelli wrote:


Hu? In my experience that should not come from supporting Pause frames
or not, but rather properly configuring a (RG)MII delay, but your
mileage may vary.


I can assure you, I'm more confused than you.  I've been working in this
for almost two weeks, and not only does this patch align with other phy
drivers, but it does fix my bug.  Without this patch, phylib does not
set the pause frame bits (10 and 11) in MII_ADVERTISE.


And that's expected, but if your MAC does not act upon phydev->pause and
phydev->asym_pause, then chances are that you can indeed lose packets
every once in a while.


Can you give me more details on that?  I'm not really an expert on our 
MAC, so I'm not sure how it's supposed to work.  The MAC supports the 
concept of "flow control".  The non-upstream version of my driver reads 
PHY register 0x11, which apparently is a PHY-specific status register 
that says whether or not "transmit pause" and "receive pause" is 
enabled.  (The AT8031 datasheet is here, if you'd like to read it: 
http://www.datasheet-pdf.com/datasheet/AtherosCommunications/734720/AR8031.pdf.html).


If both are enabled in the PHY, then the driver enables the same 
features in the MAC.  Unfortunately, I don't have any documentation that 
explains that that really means.  I've tried enabling and disabling this 
feature in the MAC, and it makes no difference.  You would think that if 
the feature is disabled in both the MAC and the PHY, then no packets 
would be lost, but that's not the case.



The part that is totally crappy about Pause frames and PHYLIB is that we
need some information about whether this should be supported or not,
such that we can change MII_ADVERTISE accordingly to reflect that, and
what best way to do this than use one of these SUPPORTED_* bits to set
that, except, that unlike speed (which is both a MAC and PHY property),
Pause is exclusively MAC, yet, it transpires in PHYLIB.


Then what do those bits in register 0x11 do?  If I don't set them, I 
lose packets, no matter how my MAC is programmed.


It's like that joke, "Doctor, if I hold my arm like this, then it 
hurts!", "Well, then don't hold your arm like that."  If I don't program 
those PHY register bits, then my hardware doesn't work.



MACs that I work with for instance need to be told to ignore pause
frames, or not ignore them, it all depends on what you want to
advertise/support.


That's the problem, I don't know what I "want", except that I want my 
hardware to work. :-)  99% of my knowledge of the hardware comes from 
scanning the source code of the internal version of the driver that 1) 
does not support phylib,  and 2) is hard-coded to initialize the Atheros 
8031 PHY.



It does not, support for Pause frames is a MAC-level feature, yet,
PHYLIB (and that's been on my todo for a while now) insists on reporting
the confusing phydev->pause and phydev->asym_pause, which really is what
has been determined from auto-negotiating with your partner, as opposed
to being a supported thing or not. The PHY has absolutely not business
in that.


But there are pause frame bits in the MII_ADVERTISE register, and this
setting directly impacts whether those bits are set.  I don't see how
this is a MAC-level feature.


This is a MAC level feature, that needs to be auto-negotiated with your
link partner, which is why there is room for this in MII_ADVERTISE, but
the PHY absolutely does not participate in Pause frames, other than
passing them through the MAC, like normal frames. Whether your MAC acts
upon that or not is a MAC dependent feature.


Ok, to me that means that the PHY driver must tell phylib whether or not 
it supports pause frames.  The question becomes:


1) Is my patch correct?
2) Is my patch necessary?
3) Is my patch sufficient?

1) I believe my patch is correct, because many other PHY drivers do the 
same thing for the same reason.  If the PHY supports register 4 bits 10 
and 11, then those SUPPORTED_xxx bits should be set.


2) I don't know why my patch appears to be necessary, because I don't 
understand why a 1Gb NIC on a 2Ghz processor drops frames.  I suspect 
the program is not performance but rather a mis-programming.


4) Is my patch sufficient?  The internal driver always enables pause 
frame support in the PHY if the PHY says that pause frames are enabled. 
 In fact, I can't even turn off flow control in the internal driver:


$ sudo ethtool -A eth1 tx off rx off
$ sudo ethtool -a eth1
Pause parameters for eth1:
Autonegotiate:  on
RX: on
TX: on

The driver insists on leaving flow control enabled if autonegotiation is 
enabled.



Yes I am sure you should set these bits, the documentation is not
necessarily the authoritative source here (although it should). What
this probably meant to be written is something like: don't set any bits
that are not defined in ethtool.h, i.e: do not use this to store
persistent st

Re: [PATCH net 3/3] sctp: hold transport instead of assoc when lookup assoc in rx path

2016-10-28 Thread Marcelo Ricardo Leitner
On Fri, Oct 28, 2016 at 06:10:54PM +0800, Xin Long wrote:
> Prior to this patch, in rx path, before calling lock_sock, it needed to
> hold assoc when got it by __sctp_lookup_association, in case other place
> would free/put assoc.
> 
> But in __sctp_lookup_association, it lookup and hold transport, then got
> assoc by transport->assoc, then hold assoc and put transport. It means
> it didn't hold transport, yet it was returned and later on directly
> assigned to chunk->transport.
> 
> Without the protection of sock lock, the transport may be freed/put by
> other places, which would cause a use-after-free issue.
> 
> This patch is to fix this issue by holding transport instead of assoc.
> As holding transport can make sure to access assoc is also safe, and
> actually it looks up assoc by searching transport rhashtable, to hold
> transport here makes more sense.
> 
> Note that the function will be renamed later on on another patch.
> 
> Signed-off-by: Xin Long 

Acked-by: Marcelo Ricardo Leitner 

> ---
>  include/net/sctp/sctp.h |  2 +-
>  net/sctp/input.c| 32 
>  net/sctp/ipv6.c |  2 +-
>  3 files changed, 18 insertions(+), 18 deletions(-)
> 
> diff --git a/include/net/sctp/sctp.h b/include/net/sctp/sctp.h
> index 87a7f42..31acc3f 100644
> --- a/include/net/sctp/sctp.h
> +++ b/include/net/sctp/sctp.h
> @@ -152,7 +152,7 @@ void sctp_unhash_endpoint(struct sctp_endpoint *);
>  struct sock *sctp_err_lookup(struct net *net, int family, struct sk_buff *,
>struct sctphdr *, struct sctp_association **,
>struct sctp_transport **);
> -void sctp_err_finish(struct sock *, struct sctp_association *);
> +void sctp_err_finish(struct sock *, struct sctp_transport *);
>  void sctp_icmp_frag_needed(struct sock *, struct sctp_association *,
>  struct sctp_transport *t, __u32 pmtu);
>  void sctp_icmp_redirect(struct sock *, struct sctp_transport *,
> diff --git a/net/sctp/input.c b/net/sctp/input.c
> index 8e0bc58..a01a56e 100644
> --- a/net/sctp/input.c
> +++ b/net/sctp/input.c
> @@ -181,9 +181,10 @@ int sctp_rcv(struct sk_buff *skb)
>* bound to another interface, via SO_BINDTODEVICE, treat it as OOTB
>*/
>   if (sk->sk_bound_dev_if && (sk->sk_bound_dev_if != af->skb_iif(skb))) {
> - if (asoc) {
> - sctp_association_put(asoc);
> + if (transport) {
> + sctp_transport_put(transport);
>   asoc = NULL;
> + transport = NULL;
>   } else {
>   sctp_endpoint_put(ep);
>   ep = NULL;
> @@ -269,8 +270,8 @@ int sctp_rcv(struct sk_buff *skb)
>   bh_unlock_sock(sk);
>  
>   /* Release the asoc/ep ref we took in the lookup calls. */
> - if (asoc)
> - sctp_association_put(asoc);
> + if (transport)
> + sctp_transport_put(transport);
>   else
>   sctp_endpoint_put(ep);
>  
> @@ -283,8 +284,8 @@ int sctp_rcv(struct sk_buff *skb)
>  
>  discard_release:
>   /* Release the asoc/ep ref we took in the lookup calls. */
> - if (asoc)
> - sctp_association_put(asoc);
> + if (transport)
> + sctp_transport_put(transport);
>   else
>   sctp_endpoint_put(ep);
>  
> @@ -300,6 +301,7 @@ int sctp_backlog_rcv(struct sock *sk, struct sk_buff *skb)
>  {
>   struct sctp_chunk *chunk = SCTP_INPUT_CB(skb)->chunk;
>   struct sctp_inq *inqueue = &chunk->rcvr->inqueue;
> + struct sctp_transport *t = chunk->transport;
>   struct sctp_ep_common *rcvr = NULL;
>   int backloged = 0;
>  
> @@ -351,7 +353,7 @@ int sctp_backlog_rcv(struct sock *sk, struct sk_buff *skb)
>  done:
>   /* Release the refs we took in sctp_add_backlog */
>   if (SCTP_EP_TYPE_ASSOCIATION == rcvr->type)
> - sctp_association_put(sctp_assoc(rcvr));
> + sctp_transport_put(t);
>   else if (SCTP_EP_TYPE_SOCKET == rcvr->type)
>   sctp_endpoint_put(sctp_ep(rcvr));
>   else
> @@ -363,6 +365,7 @@ int sctp_backlog_rcv(struct sock *sk, struct sk_buff *skb)
>  static int sctp_add_backlog(struct sock *sk, struct sk_buff *skb)
>  {
>   struct sctp_chunk *chunk = SCTP_INPUT_CB(skb)->chunk;
> + struct sctp_transport *t = chunk->transport;
>   struct sctp_ep_common *rcvr = chunk->rcvr;
>   int ret;
>  
> @@ -373,7 +376,7 @@ static int sctp_add_backlog(struct sock *sk, struct 
> sk_buff *skb)
>* from us
>*/
>   if (SCTP_EP_TYPE_ASSOCIATION == rcvr->type)
> - sctp_association_hold(sctp_assoc(rcvr));
> + sctp_transport_hold(t);
>   else if (SCTP_EP_TYPE_SOCKET == rcvr->type)
>   sctp_endpoint_hold(sctp_ep(rcvr));
>   else
> @@ -537,15 +540,15 @@ struct sock *sctp_err_lookup(struct net *net, int 
> fami

[patch net-next 16/16] mlxsw: switchib: Introduce SwitchIB and SwitchIB silicon driver

2016-10-28 Thread Jiri Pirko
From: Elad Raz 

SwitchIB and SwitchIB-2 are Infiniband switches with up to 36 ports. This
driver initialize the hardware and Firmware which implements the IB
management and connection with the SM.

Signed-off-by: Elad Raz 
Signed-off-by: Jiri Pirko 
---
 drivers/net/ethernet/mellanox/mlxsw/Kconfig|  11 +
 drivers/net/ethernet/mellanox/mlxsw/Makefile   |   2 +
 drivers/net/ethernet/mellanox/mlxsw/pci.h  |   2 +
 drivers/net/ethernet/mellanox/mlxsw/port.h |   3 +
 drivers/net/ethernet/mellanox/mlxsw/switchib.c | 598 +
 5 files changed, 616 insertions(+)
 create mode 100644 drivers/net/ethernet/mellanox/mlxsw/switchib.c

diff --git a/drivers/net/ethernet/mellanox/mlxsw/Kconfig 
b/drivers/net/ethernet/mellanox/mlxsw/Kconfig
index d0bcf59..bac2e5e 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/Kconfig
+++ b/drivers/net/ethernet/mellanox/mlxsw/Kconfig
@@ -29,6 +29,17 @@ config MLXSW_PCI
  To compile this driver as a module, choose M here: the
  module will be called mlxsw_pci.
 
+config MLXSW_SWITCHIB
+   tristate "Mellanox Technologies SwitchIB and SwitchIB-2 support"
+   depends on MLXSW_CORE && NET_SWITCHDEV
+   default m
+   ---help---
+ This driver supports Mellanox Technologies SwitchIB and SwitchIB-2
+ Infiniband Switch ASICs.
+
+ To compile this driver as a module, choose M here: the
+ module will be called mlxsw_switchib.
+
 config MLXSW_SWITCHX2
tristate "Mellanox Technologies SwitchX-2 support"
depends on MLXSW_CORE && MLXSW_PCI && NET_SWITCHDEV
diff --git a/drivers/net/ethernet/mellanox/mlxsw/Makefile 
b/drivers/net/ethernet/mellanox/mlxsw/Makefile
index d20ae18..badef87 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/Makefile
+++ b/drivers/net/ethernet/mellanox/mlxsw/Makefile
@@ -3,6 +3,8 @@ mlxsw_core-objs := core.o
 mlxsw_core-$(CONFIG_MLXSW_CORE_HWMON) += core_hwmon.o
 obj-$(CONFIG_MLXSW_PCI)+= mlxsw_pci.o
 mlxsw_pci-objs := pci.o
+obj-$(CONFIG_MLXSW_SWITCHIB)   += mlxsw_switchib.o
+mlxsw_switchib-objs:= switchib.o
 obj-$(CONFIG_MLXSW_SWITCHX2)   += mlxsw_switchx2.o
 mlxsw_switchx2-objs:= switchx2.o
 obj-$(CONFIG_MLXSW_SPECTRUM)   += mlxsw_spectrum.o
diff --git a/drivers/net/ethernet/mellanox/mlxsw/pci.h 
b/drivers/net/ethernet/mellanox/mlxsw/pci.h
index 35a0011..d655823 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/pci.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/pci.h
@@ -39,6 +39,8 @@
 
 #define PCI_DEVICE_ID_MELLANOX_SWITCHX20xc738
 #define PCI_DEVICE_ID_MELLANOX_SPECTRUM0xcb84
+#define PCI_DEVICE_ID_MELLANOX_SWITCHIB0xcb20
+#define PCI_DEVICE_ID_MELLANOX_SWITCHIB2   0xcf08
 
 #if IS_ENABLED(CONFIG_MLXSW_PCI)
 
diff --git a/drivers/net/ethernet/mellanox/mlxsw/port.h 
b/drivers/net/ethernet/mellanox/mlxsw/port.h
index 7296071..3d42146 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/port.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/port.h
@@ -52,6 +52,9 @@
 #define MLXSW_PORT_MAX_PHY_PORTS   0x40
 #define MLXSW_PORT_MAX_PORTS   (MLXSW_PORT_MAX_PHY_PORTS + 1)
 
+#define MLXSW_PORT_MAX_IB_PHY_PORTS36
+#define MLXSW_PORT_MAX_IB_PORTS(MLXSW_PORT_MAX_IB_PHY_PORTS + 
1)
+
 #define MLXSW_PORT_DEVID_BITS_OFFSET   10
 #define MLXSW_PORT_PHY_BITS_OFFSET 4
 #define MLXSW_PORT_PHY_BITS_MASK   (MLXSW_PORT_MAX_PHY_PORTS - 1)
diff --git a/drivers/net/ethernet/mellanox/mlxsw/switchib.c 
b/drivers/net/ethernet/mellanox/mlxsw/switchib.c
new file mode 100644
index 000..ec0b27e
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlxsw/switchib.c
@@ -0,0 +1,598 @@
+/*
+ * drivers/net/ethernet/mellanox/mlxsw/switchib.c
+ * Copyright (c) 2016 Mellanox Technologies. All rights reserved.
+ * Copyright (c) 2016 Elad Raz 
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright
+ *notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *notice, this list of conditions and the following disclaimer in the
+ *documentation and/or other materials provided with the distribution.
+ * 3. Neither the names of the copyright holders nor the names of its
+ *contributors may be used to endorse or promote products derived from
+ *this software without specific prior written permission.
+ *
+ * Alternatively, this software may be distributed under the terms of the
+ * GNU General Public License ("GPL") version 2 as published by the Free
+ * Software Foundation.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTI

[patch net-next 15/16] mlxsw: switchx2: Add IB port support

2016-10-28 Thread Jiri Pirko
From: Elad Raz 

SwitchX-2 is IB capable device. This patch add a support to change the
port type between Ethernet and Infiniband.

When the port is set to IB, the FW implements the Subnet Management Agent
(SMA) manage the port. All port attributes can be control remotely by
the SM.

Usage:
$ devlink port show
pci/:03:00.0/1: type eth netdev eth0
pci/:03:00.0/3: type eth netdev eth1
pci/:03:00.0/5: type eth netdev eth2
pci/:03:00.0/6: type eth netdev eth3
pci/:03:00.0/8: type eth netdev eth4

$ devlink port set pci/:03:00.0/1 type ib

$ devlink port show
pci/:03:00.0/1: type ib

Signed-off-by: Elad Raz 
Reviewed-by: Jiri Pirko 
Signed-off-by: Jiri Pirko 
---
 drivers/net/ethernet/mellanox/mlxsw/ib.h   |  39 +
 drivers/net/ethernet/mellanox/mlxsw/switchx2.c | 229 +++--
 2 files changed, 249 insertions(+), 19 deletions(-)
 create mode 100644 drivers/net/ethernet/mellanox/mlxsw/ib.h

diff --git a/drivers/net/ethernet/mellanox/mlxsw/ib.h 
b/drivers/net/ethernet/mellanox/mlxsw/ib.h
new file mode 100644
index 000..ce313aa
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlxsw/ib.h
@@ -0,0 +1,39 @@
+/*
+ * drivers/net/ethernet/mellanox/mlxsw/ib.h
+ * Copyright (c) 2016 Mellanox Technologies. All rights reserved.
+ * Copyright (c) 2016 Elad Raz 
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright
+ *notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *notice, this list of conditions and the following disclaimer in the
+ *documentation and/or other materials provided with the distribution.
+ * 3. Neither the names of the copyright holders nor the names of its
+ *contributors may be used to endorse or promote products derived from
+ *this software without specific prior written permission.
+ *
+ * Alternatively, this software may be distributed under the terms of the
+ * GNU General Public License ("GPL") version 2 as published by the Free
+ * Software Foundation.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
+ * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
+ * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
+ * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
+ * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
+ * CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
+ * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
+ * POSSIBILITY OF SUCH DAMAGE.
+ */
+#ifndef _MLXSW_IB_H
+#define _MLXSW_IB_H
+
+#define MLXSW_IB_DEFAULT_MTU 4096
+
+#endif /* _MLXSW_IB_H */
diff --git a/drivers/net/ethernet/mellanox/mlxsw/switchx2.c 
b/drivers/net/ethernet/mellanox/mlxsw/switchx2.c
index 3aa0948..5208764 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/switchx2.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/switchx2.c
@@ -3,7 +3,7 @@
  * Copyright (c) 2015 Mellanox Technologies. All rights reserved.
  * Copyright (c) 2015 Jiri Pirko 
  * Copyright (c) 2015 Ido Schimmel 
- * Copyright (c) 2015 Elad Raz 
+ * Copyright (c) 2015-2016 Elad Raz 
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions are met:
@@ -53,6 +53,7 @@
 #include "port.h"
 #include "trap.h"
 #include "txheader.h"
+#include "ib.h"
 
 static const char mlxsw_sx_driver_name[] = "mlxsw_switchx2";
 static const char mlxsw_sx_driver_version[] = "1.0";
@@ -218,14 +219,14 @@ static int mlxsw_sx_port_oper_status_get(struct 
mlxsw_sx_port *mlxsw_sx_port,
return 0;
 }
 
-static int mlxsw_sx_port_mtu_set(struct mlxsw_sx_port *mlxsw_sx_port, u16 mtu)
+static int __mlxsw_sx_port_mtu_set(struct mlxsw_sx_port *mlxsw_sx_port,
+  u16 mtu)
 {
struct mlxsw_sx *mlxsw_sx = mlxsw_sx_port->mlxsw_sx;
char pmtu_pl[MLXSW_REG_PMTU_LEN];
int max_mtu;
int err;
 
-   mtu += MLXSW_TXHDR_LEN + ETH_HLEN;
mlxsw_reg_pmtu_pack(pmtu_pl, mlxsw_sx_port->local_port, 0);
err = mlxsw_reg_query(mlxsw_sx->core, MLXSW_REG(pmtu), pmtu_pl);
if (err)
@@ -239,6 +240,32 @@ static int mlxsw_sx_port_mtu_set(struct mlxsw_sx_port 
*mlxsw_sx_port, u16 mtu)
return mlxsw_reg_write(mlxsw_sx->core, MLXSW_REG(pmtu), pmtu_pl);
 }
 
+static int mlxsw_sx_port_mtu_eth_set(struct mlxsw_sx_port *mlxsw_sx_port,
+ 

[patch net-next 14/16] mlxsw: switchx2: Add eth prefix to port create and remove

2016-10-28 Thread Jiri Pirko
From: Elad Raz 

Since we are about to add Infiniband port remove and create we will add
"eth" prefix to port create and remove APIs.

Signed-off-by: Elad Raz 
Reviewed-by: Jiri Pirko 
Signed-off-by: Jiri Pirko 
---
 drivers/net/ethernet/mellanox/mlxsw/switchx2.c | 22 +++---
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/switchx2.c 
b/drivers/net/ethernet/mellanox/mlxsw/switchx2.c
index d110e3f..3aa0948 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/switchx2.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/switchx2.c
@@ -994,8 +994,8 @@ mlxsw_sx_port_mac_learning_mode_set(struct mlxsw_sx_port 
*mlxsw_sx_port,
return mlxsw_reg_write(mlxsw_sx->core, MLXSW_REG(spmlr), spmlr_pl);
 }
 
-static int __mlxsw_sx_port_create(struct mlxsw_sx *mlxsw_sx, u8 local_port,
- u8 module, u8 width)
+static int __mlxsw_sx_port_eth_create(struct mlxsw_sx *mlxsw_sx, u8 local_port,
+ u8 module, u8 width)
 {
struct mlxsw_sx_port *mlxsw_sx_port;
struct net_device *dev;
@@ -1119,8 +1119,8 @@ static int __mlxsw_sx_port_create(struct mlxsw_sx 
*mlxsw_sx, u8 local_port,
return err;
 }
 
-static int mlxsw_sx_port_create(struct mlxsw_sx *mlxsw_sx, u8 local_port,
-   u8 module, u8 width)
+static int mlxsw_sx_port_eth_create(struct mlxsw_sx *mlxsw_sx, u8 local_port,
+   u8 module, u8 width)
 {
int err;
 
@@ -1130,7 +1130,7 @@ static int mlxsw_sx_port_create(struct mlxsw_sx 
*mlxsw_sx, u8 local_port,
local_port);
return err;
}
-   err = __mlxsw_sx_port_create(mlxsw_sx, local_port, module, width);
+   err = __mlxsw_sx_port_eth_create(mlxsw_sx, local_port, module, width);
if (err)
goto err_port_create;
 
@@ -1141,7 +1141,7 @@ static int mlxsw_sx_port_create(struct mlxsw_sx 
*mlxsw_sx, u8 local_port,
return err;
 }
 
-static void __mlxsw_sx_port_remove(struct mlxsw_sx *mlxsw_sx, u8 local_port)
+static void __mlxsw_sx_port_eth_remove(struct mlxsw_sx *mlxsw_sx, u8 
local_port)
 {
struct mlxsw_sx_port *mlxsw_sx_port = mlxsw_sx->ports[local_port];
 
@@ -1153,9 +1153,9 @@ static void __mlxsw_sx_port_remove(struct mlxsw_sx 
*mlxsw_sx, u8 local_port)
free_netdev(mlxsw_sx_port->dev);
 }
 
-static void mlxsw_sx_port_remove(struct mlxsw_sx *mlxsw_sx, u8 local_port)
+static void mlxsw_sx_port_eth_remove(struct mlxsw_sx *mlxsw_sx, u8 local_port)
 {
-   __mlxsw_sx_port_remove(mlxsw_sx, local_port);
+   __mlxsw_sx_port_eth_remove(mlxsw_sx, local_port);
mlxsw_core_port_fini(mlxsw_sx->core, local_port);
 }
 
@@ -1170,7 +1170,7 @@ static void mlxsw_sx_ports_remove(struct mlxsw_sx 
*mlxsw_sx)
 
for (i = 1; i < MLXSW_PORT_MAX_PORTS; i++)
if (mlxsw_sx_port_created(mlxsw_sx, i))
-   mlxsw_sx_port_remove(mlxsw_sx, i);
+   mlxsw_sx_port_eth_remove(mlxsw_sx, i);
kfree(mlxsw_sx->ports);
 }
 
@@ -1193,7 +1193,7 @@ static int mlxsw_sx_ports_create(struct mlxsw_sx 
*mlxsw_sx)
goto err_port_module_info_get;
if (!width)
continue;
-   err = mlxsw_sx_port_create(mlxsw_sx, i, module, width);
+   err = mlxsw_sx_port_eth_create(mlxsw_sx, i, module, width);
if (err)
goto err_port_create;
}
@@ -1203,7 +1203,7 @@ static int mlxsw_sx_ports_create(struct mlxsw_sx 
*mlxsw_sx)
 err_port_module_info_get:
for (i--; i >= 1; i--)
if (mlxsw_sx_port_created(mlxsw_sx, i))
-   mlxsw_sx_port_remove(mlxsw_sx, i);
+   mlxsw_sx_port_eth_remove(mlxsw_sx, i);
kfree(mlxsw_sx->ports);
return err;
 }
-- 
2.5.5



[patch net-next 13/16] mlxsw: core: Add port type (Eth/IB) set API

2016-10-28 Thread Jiri Pirko
From: Elad Raz 

Add "port_type_set" API to mlxsw core. The core layer send the change type
callback to the port along with it's private information.

Signed-off-by: Elad Raz 
Reviewed-by: Jiri Pirko 
Signed-off-by: Jiri Pirko 
---
 drivers/net/ethernet/mellanox/mlxsw/core.c | 41 ++
 drivers/net/ethernet/mellanox/mlxsw/core.h |  6 +
 2 files changed, 47 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/core.c 
b/drivers/net/ethernet/mellanox/mlxsw/core.c
index dcd7202..6004817 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/core.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/core.c
@@ -93,6 +93,7 @@ struct mlxsw_core_pcpu_stats {
 struct mlxsw_core_port {
struct devlink_port devlink_port;
void *port_driver_priv;
+   u8 local_port;
 };
 
 void *mlxsw_core_port_driver_priv(struct mlxsw_core_port *mlxsw_core_port)
@@ -937,6 +938,21 @@ static void *__dl_port(struct devlink_port *devlink_port)
return container_of(devlink_port, struct mlxsw_core_port, devlink_port);
 }
 
+static int mlxsw_devlink_port_type_set(struct devlink_port *devlink_port,
+  enum devlink_port_type port_type)
+{
+   struct mlxsw_core *mlxsw_core = devlink_priv(devlink_port->devlink);
+   struct mlxsw_driver *mlxsw_driver = mlxsw_core->driver;
+   struct mlxsw_core_port *mlxsw_core_port = __dl_port(devlink_port);
+
+   if (!mlxsw_driver->port_type_set)
+   return -EOPNOTSUPP;
+
+   return mlxsw_driver->port_type_set(mlxsw_core,
+  mlxsw_core_port->local_port,
+  port_type);
+}
+
 static int mlxsw_devlink_sb_port_pool_get(struct devlink_port *devlink_port,
  unsigned int sb_index, u16 pool_index,
  u32 *p_threshold)
@@ -1060,6 +1076,7 @@ mlxsw_devlink_sb_occ_tc_port_bind_get(struct devlink_port 
*devlink_port,
 }
 
 static const struct devlink_ops mlxsw_devlink_ops = {
+   .port_type_set  = mlxsw_devlink_port_type_set,
.port_split = mlxsw_devlink_port_split,
.port_unsplit   = mlxsw_devlink_port_unsplit,
.sb_pool_get= mlxsw_devlink_sb_pool_get,
@@ -1687,6 +1704,7 @@ int mlxsw_core_port_init(struct mlxsw_core *mlxsw_core, 
u8 local_port)
struct devlink_port *devlink_port = &mlxsw_core_port->devlink_port;
int err;
 
+   mlxsw_core_port->local_port = local_port;
err = devlink_port_register(devlink, devlink_port, local_port);
if (err)
memset(mlxsw_core_port, 0, sizeof(*mlxsw_core_port));
@@ -1720,6 +1738,18 @@ void mlxsw_core_port_eth_set(struct mlxsw_core 
*mlxsw_core, u8 local_port,
 }
 EXPORT_SYMBOL(mlxsw_core_port_eth_set);
 
+void mlxsw_core_port_ib_set(struct mlxsw_core *mlxsw_core, u8 local_port,
+   void *port_driver_priv)
+{
+   struct mlxsw_core_port *mlxsw_core_port =
+   &mlxsw_core->ports[local_port];
+   struct devlink_port *devlink_port = &mlxsw_core_port->devlink_port;
+
+   mlxsw_core_port->port_driver_priv = port_driver_priv;
+   devlink_port_type_ib_set(devlink_port, NULL);
+}
+EXPORT_SYMBOL(mlxsw_core_port_ib_set);
+
 void mlxsw_core_port_clear(struct mlxsw_core *mlxsw_core, u8 local_port,
   void *port_driver_priv)
 {
@@ -1732,6 +1762,17 @@ void mlxsw_core_port_clear(struct mlxsw_core 
*mlxsw_core, u8 local_port,
 }
 EXPORT_SYMBOL(mlxsw_core_port_clear);
 
+enum devlink_port_type mlxsw_core_port_type_get(struct mlxsw_core *mlxsw_core,
+   u8 local_port)
+{
+   struct mlxsw_core_port *mlxsw_core_port =
+   &mlxsw_core->ports[local_port];
+   struct devlink_port *devlink_port = &mlxsw_core_port->devlink_port;
+
+   return devlink_port->type;
+}
+EXPORT_SYMBOL(mlxsw_core_port_type_get);
+
 static void mlxsw_core_buf_dump_dbg(struct mlxsw_core *mlxsw_core,
const char *buf, size_t size)
 {
diff --git a/drivers/net/ethernet/mellanox/mlxsw/core.h 
b/drivers/net/ethernet/mellanox/mlxsw/core.h
index 82866d5..c0acc1b 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/core.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/core.h
@@ -148,8 +148,12 @@ void mlxsw_core_port_fini(struct mlxsw_core *mlxsw_core, 
u8 local_port);
 void mlxsw_core_port_eth_set(struct mlxsw_core *mlxsw_core, u8 local_port,
 void *port_driver_priv, struct net_device *dev,
 bool split, u32 split_group);
+void mlxsw_core_port_ib_set(struct mlxsw_core *mlxsw_core, u8 local_port,
+   void *port_driver_priv);
 void mlxsw_core_port_clear(struct mlxsw_core *mlxsw_core, u8 local_port,
   void *port_driver_priv);

[patch net-next 12/16] mlxsw: core: Add "eth" prefix to mlxsw_core_port_set

2016-10-28 Thread Jiri Pirko
From: Elad Raz 

Since we are about to introduce IB port APIs, we will add prefixes to
existing APIs.

Signed-off-by: Elad Raz 
Signed-off-by: Jiri Pirko 
---
 drivers/net/ethernet/mellanox/mlxsw/core.c | 8 
 drivers/net/ethernet/mellanox/mlxsw/core.h | 6 +++---
 drivers/net/ethernet/mellanox/mlxsw/spectrum.c | 5 +++--
 drivers/net/ethernet/mellanox/mlxsw/switchx2.c | 4 ++--
 4 files changed, 12 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/core.c 
b/drivers/net/ethernet/mellanox/mlxsw/core.c
index b50acb1..dcd7202 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/core.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/core.c
@@ -1705,9 +1705,9 @@ void mlxsw_core_port_fini(struct mlxsw_core *mlxsw_core, 
u8 local_port)
 }
 EXPORT_SYMBOL(mlxsw_core_port_fini);
 
-void mlxsw_core_port_set(struct mlxsw_core *mlxsw_core, u8 local_port,
-void *port_driver_priv, struct net_device *dev,
-bool split, u32 split_group)
+void mlxsw_core_port_eth_set(struct mlxsw_core *mlxsw_core, u8 local_port,
+void *port_driver_priv, struct net_device *dev,
+bool split, u32 split_group)
 {
struct mlxsw_core_port *mlxsw_core_port =
&mlxsw_core->ports[local_port];
@@ -1718,7 +1718,7 @@ void mlxsw_core_port_set(struct mlxsw_core *mlxsw_core, 
u8 local_port,
devlink_port_split_set(devlink_port, split_group);
devlink_port_type_eth_set(devlink_port, dev);
 }
-EXPORT_SYMBOL(mlxsw_core_port_set);
+EXPORT_SYMBOL(mlxsw_core_port_eth_set);
 
 void mlxsw_core_port_clear(struct mlxsw_core *mlxsw_core, u8 local_port,
   void *port_driver_priv)
diff --git a/drivers/net/ethernet/mellanox/mlxsw/core.h 
b/drivers/net/ethernet/mellanox/mlxsw/core.h
index f4bbbd4..82866d5 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/core.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/core.h
@@ -145,9 +145,9 @@ void mlxsw_core_lag_mapping_clear(struct mlxsw_core 
*mlxsw_core,
 void *mlxsw_core_port_driver_priv(struct mlxsw_core_port *mlxsw_core_port);
 int mlxsw_core_port_init(struct mlxsw_core *mlxsw_core, u8 local_port);
 void mlxsw_core_port_fini(struct mlxsw_core *mlxsw_core, u8 local_port);
-void mlxsw_core_port_set(struct mlxsw_core *mlxsw_core, u8 local_port,
-void *port_driver_priv, struct net_device *dev,
-bool split, u32 split_group);
+void mlxsw_core_port_eth_set(struct mlxsw_core *mlxsw_core, u8 local_port,
+void *port_driver_priv, struct net_device *dev,
+bool split, u32 split_group);
 void mlxsw_core_port_clear(struct mlxsw_core *mlxsw_core, u8 local_port,
   void *port_driver_priv);
 
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
index ad4ff27..8bca020 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
@@ -2358,8 +2358,9 @@ static int __mlxsw_sp_port_create(struct mlxsw_sp 
*mlxsw_sp, u8 local_port,
goto err_register_netdev;
}
 
-   mlxsw_core_port_set(mlxsw_sp->core, mlxsw_sp_port->local_port,
-   mlxsw_sp_port, dev, mlxsw_sp_port->split, module);
+   mlxsw_core_port_eth_set(mlxsw_sp->core, mlxsw_sp_port->local_port,
+   mlxsw_sp_port, dev, mlxsw_sp_port->split,
+   module);
mlxsw_core_schedule_dw(&mlxsw_sp_port->hw_stats.update_dw, 0);
return 0;
 
diff --git a/drivers/net/ethernet/mellanox/mlxsw/switchx2.c 
b/drivers/net/ethernet/mellanox/mlxsw/switchx2.c
index a905589..d110e3f 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/switchx2.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/switchx2.c
@@ -1098,8 +1098,8 @@ static int __mlxsw_sx_port_create(struct mlxsw_sx 
*mlxsw_sx, u8 local_port,
goto err_register_netdev;
}
 
-   mlxsw_core_port_set(mlxsw_sx->core, mlxsw_sx_port->local_port,
-   mlxsw_sx_port, dev, false, 0);
+   mlxsw_core_port_eth_set(mlxsw_sx->core, mlxsw_sx_port->local_port,
+   mlxsw_sx_port, dev, false, 0);
mlxsw_sx->ports[local_port] = mlxsw_sx_port;
return 0;
 
-- 
2.5.5



[patch net-next 11/16] mlxsw: switchx2: Add Infiniband switch partition

2016-10-28 Thread Jiri Pirko
From: Elad Raz 

In order to put a port in Infiniband fabric it should be assigned
to separate swid (Switch partition) that initialized as IB swid.

Signed-off-by: Elad Raz 
Signed-off-by: Jiri Pirko 
---
 drivers/net/ethernet/mellanox/mlxsw/port.h | 1 +
 drivers/net/ethernet/mellanox/mlxsw/switchx2.c | 6 +-
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/port.h 
b/drivers/net/ethernet/mellanox/mlxsw/port.h
index af371a8..7296071 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/port.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/port.h
@@ -44,6 +44,7 @@
 
 #define MLXSW_PORT_SWID_DISABLED_PORT  255
 #define MLXSW_PORT_SWID_ALL_SWIDS  254
+#define MLXSW_PORT_SWID_TYPE_IB1
 #define MLXSW_PORT_SWID_TYPE_ETH   2
 
 #define MLXSW_PORT_MID 0xd000
diff --git a/drivers/net/ethernet/mellanox/mlxsw/switchx2.c 
b/drivers/net/ethernet/mellanox/mlxsw/switchx2.c
index 8eac26f..a905589 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/switchx2.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/switchx2.c
@@ -1590,13 +1590,17 @@ static struct mlxsw_config_profile 
mlxsw_sx_config_profile = {
.used_flood_mode= 1,
.flood_mode = 3,
.used_max_ib_mc = 1,
-   .max_ib_mc  = 0,
+   .max_ib_mc  = 6,
.used_max_pkey  = 1,
.max_pkey   = 0,
.swid_config= {
{
.used_type  = 1,
.type   = MLXSW_PORT_SWID_TYPE_ETH,
+   },
+   {
+   .used_type  = 1,
+   .type   = MLXSW_PORT_SWID_TYPE_IB,
}
},
.resource_query_enable  = 0,
-- 
2.5.5



[patch net-next 05/16] mlxsw: switchx2: Add support for physical port names

2016-10-28 Thread Jiri Pirko
From: Elad Raz 

Export to userspace the front panel name of the port, so that udev can
rename the ports accordingly. The convention suggested by switchdev
documentation is used: pX

Signed-off-by: Elad Raz 
Reviewed-by: Jiri Pirko 
Reviewed-by: Ido Schimmel 
Signed-off-by: Jiri Pirko 
---
 drivers/net/ethernet/mellanox/mlxsw/switchx2.c | 32 ++
 1 file changed, 27 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/switchx2.c 
b/drivers/net/ethernet/mellanox/mlxsw/switchx2.c
index 6882e35..941393b 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/switchx2.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/switchx2.c
@@ -81,6 +81,9 @@ struct mlxsw_sx_port {
struct mlxsw_sx_port_pcpu_stats __percpu *pcpu_stats;
struct mlxsw_sx *mlxsw_sx;
u8 local_port;
+   struct {
+   u8 module;
+   } mapping;
 };
 
 /* tx_hdr_version
@@ -257,7 +260,8 @@ mlxsw_sx_port_system_port_mapping_set(struct mlxsw_sx_port 
*mlxsw_sx_port)
 }
 
 static int mlxsw_sx_port_module_info_get(struct mlxsw_sx *mlxsw_sx,
-u8 local_port, u8 *p_width)
+u8 local_port, u8 *p_module,
+u8 *p_width)
 {
char pmlp_pl[MLXSW_REG_PMLP_LEN];
int err;
@@ -266,6 +270,7 @@ static int mlxsw_sx_port_module_info_get(struct mlxsw_sx 
*mlxsw_sx,
err = mlxsw_reg_query(mlxsw_sx->core, MLXSW_REG(pmlp), pmlp_pl);
if (err)
return err;
+   *p_module = mlxsw_reg_pmlp_module_get(pmlp_pl, 0);
*p_width = mlxsw_reg_pmlp_width_get(pmlp_pl);
return 0;
 }
@@ -383,12 +388,26 @@ mlxsw_sx_port_get_stats64(struct net_device *dev,
return stats;
 }
 
+static int mlxsw_sx_port_get_phys_port_name(struct net_device *dev, char *name,
+   size_t len)
+{
+   struct mlxsw_sx_port *mlxsw_sx_port = netdev_priv(dev);
+   int err;
+
+   err = snprintf(name, len, "p%d", mlxsw_sx_port->mapping.module + 1);
+   if (err >= len)
+   return -EINVAL;
+
+   return 0;
+}
+
 static const struct net_device_ops mlxsw_sx_port_netdev_ops = {
.ndo_open   = mlxsw_sx_port_open,
.ndo_stop   = mlxsw_sx_port_stop,
.ndo_start_xmit = mlxsw_sx_port_xmit,
.ndo_change_mtu = mlxsw_sx_port_change_mtu,
.ndo_get_stats64= mlxsw_sx_port_get_stats64,
+   .ndo_get_phys_port_name = mlxsw_sx_port_get_phys_port_name,
 };
 
 static void mlxsw_sx_port_get_drvinfo(struct net_device *dev,
@@ -957,7 +976,8 @@ mlxsw_sx_port_mac_learning_mode_set(struct mlxsw_sx_port 
*mlxsw_sx_port,
return mlxsw_reg_write(mlxsw_sx->core, MLXSW_REG(spmlr), spmlr_pl);
 }
 
-static int mlxsw_sx_port_create(struct mlxsw_sx *mlxsw_sx, u8 local_port)
+static int mlxsw_sx_port_create(struct mlxsw_sx *mlxsw_sx, u8 local_port,
+   u8 module)
 {
struct mlxsw_sx_port *mlxsw_sx_port;
struct net_device *dev;
@@ -971,6 +991,7 @@ static int mlxsw_sx_port_create(struct mlxsw_sx *mlxsw_sx, 
u8 local_port)
mlxsw_sx_port->dev = dev;
mlxsw_sx_port->mlxsw_sx = mlxsw_sx;
mlxsw_sx_port->local_port = local_port;
+   mlxsw_sx_port->mapping.module = module;
 
mlxsw_sx_port->pcpu_stats =
netdev_alloc_pcpu_stats(struct mlxsw_sx_port_pcpu_stats);
@@ -1119,7 +1140,7 @@ static void mlxsw_sx_ports_remove(struct mlxsw_sx 
*mlxsw_sx)
 static int mlxsw_sx_ports_create(struct mlxsw_sx *mlxsw_sx)
 {
size_t alloc_size;
-   u8 width;
+   u8 module, width;
int i;
int err;
 
@@ -1129,12 +1150,13 @@ static int mlxsw_sx_ports_create(struct mlxsw_sx 
*mlxsw_sx)
return -ENOMEM;
 
for (i = 1; i < MLXSW_PORT_MAX_PORTS; i++) {
-   err = mlxsw_sx_port_module_info_get(mlxsw_sx, i, &width);
+   err = mlxsw_sx_port_module_info_get(mlxsw_sx, i, &module,
+   &width);
if (err)
goto err_port_module_info_get;
if (!width)
continue;
-   err = mlxsw_sx_port_create(mlxsw_sx, i);
+   err = mlxsw_sx_port_create(mlxsw_sx, i, module);
if (err)
goto err_port_create;
}
-- 
2.5.5



[patch net-next 08/16] mlxsw: reg: Add Infiniband support to PTYS

2016-10-28 Thread Jiri Pirko
From: Elad Raz 

In order to support Infiniband fabric, we need to introduce IB speeds and
capabilities to PTYS emads.

Signed-off-by: Elad Raz 
Reviewed-by: Jiri Pirko 
Reviewed-by: Ido Schimmel 
Signed-off-by: Jiri Pirko 
---
 drivers/net/ethernet/mellanox/mlxsw/reg.h | 71 +++
 1 file changed, 71 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/reg.h 
b/drivers/net/ethernet/mellanox/mlxsw/reg.h
index 3ecf348..2f31af0 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/reg.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/reg.h
@@ -2054,6 +2054,7 @@ MLXSW_REG_DEFINE(ptys, MLXSW_REG_PTYS_ID, 
MLXSW_REG_PTYS_LEN);
  */
 MLXSW_ITEM32(reg, ptys, local_port, 0x00, 16, 8);
 
+#define MLXSW_REG_PTYS_PROTO_MASK_IB   BIT(0)
 #define MLXSW_REG_PTYS_PROTO_MASK_ETH  BIT(2)
 
 /* reg_ptys_proto_mask
@@ -2112,18 +2113,61 @@ MLXSW_ITEM32(reg, ptys, an_status, 0x04, 28, 4);
  */
 MLXSW_ITEM32(reg, ptys, eth_proto_cap, 0x0C, 0, 32);
 
+/* reg_ptys_ib_link_width_cap
+ * IB port supported widths.
+ * Access: RO
+ */
+MLXSW_ITEM32(reg, ptys, ib_link_width_cap, 0x10, 16, 16);
+
+#define MLXSW_REG_PTYS_IB_SPEED_SDRBIT(0)
+#define MLXSW_REG_PTYS_IB_SPEED_DDRBIT(1)
+#define MLXSW_REG_PTYS_IB_SPEED_QDRBIT(2)
+#define MLXSW_REG_PTYS_IB_SPEED_FDR10  BIT(3)
+#define MLXSW_REG_PTYS_IB_SPEED_FDRBIT(4)
+#define MLXSW_REG_PTYS_IB_SPEED_EDRBIT(5)
+
+/* reg_ptys_ib_proto_cap
+ * IB port supported speeds and protocols.
+ * Access: RO
+ */
+MLXSW_ITEM32(reg, ptys, ib_proto_cap, 0x10, 0, 16);
+
 /* reg_ptys_eth_proto_admin
  * Speed and protocol to set port to.
  * Access: RW
  */
 MLXSW_ITEM32(reg, ptys, eth_proto_admin, 0x18, 0, 32);
 
+/* reg_ptys_ib_link_width_admin
+ * IB width to set port to.
+ * Access: RW
+ */
+MLXSW_ITEM32(reg, ptys, ib_link_width_admin, 0x1C, 16, 16);
+
+/* reg_ptys_ib_proto_admin
+ * IB speeds and protocols to set port to.
+ * Access: RW
+ */
+MLXSW_ITEM32(reg, ptys, ib_proto_admin, 0x1C, 0, 16);
+
 /* reg_ptys_eth_proto_oper
  * The current speed and protocol configured for the port.
  * Access: RO
  */
 MLXSW_ITEM32(reg, ptys, eth_proto_oper, 0x24, 0, 32);
 
+/* reg_ptys_ib_link_width_oper
+ * The current IB width to set port to.
+ * Access: RO
+ */
+MLXSW_ITEM32(reg, ptys, ib_link_width_oper, 0x28, 16, 16);
+
+/* reg_ptys_ib_proto_oper
+ * The current IB speed and protocol.
+ * Access: RO
+ */
+MLXSW_ITEM32(reg, ptys, ib_proto_oper, 0x28, 0, 16);
+
 /* reg_ptys_eth_proto_lp_advertise
  * The protocols that were advertised by the link partner during
  * autonegotiation.
@@ -2153,6 +2197,33 @@ static inline void mlxsw_reg_ptys_eth_unpack(char 
*payload,
*p_eth_proto_oper = mlxsw_reg_ptys_eth_proto_oper_get(payload);
 }
 
+static inline void mlxsw_reg_ptys_ib_pack(char *payload, u8 local_port,
+ u16 proto_admin, u16 link_width)
+{
+   MLXSW_REG_ZERO(ptys, payload);
+   mlxsw_reg_ptys_local_port_set(payload, local_port);
+   mlxsw_reg_ptys_proto_mask_set(payload, MLXSW_REG_PTYS_PROTO_MASK_IB);
+   mlxsw_reg_ptys_ib_proto_admin_set(payload, proto_admin);
+   mlxsw_reg_ptys_ib_link_width_admin_set(payload, link_width);
+}
+
+static inline void mlxsw_reg_ptys_ib_unpack(char *payload, u16 *p_ib_proto_cap,
+   u16 *p_ib_link_width_cap,
+   u16 *p_ib_proto_oper,
+   u16 *p_ib_link_width_oper)
+{
+   if (p_ib_proto_cap)
+   *p_ib_proto_cap = mlxsw_reg_ptys_ib_proto_cap_get(payload);
+   if (p_ib_link_width_cap)
+   *p_ib_link_width_cap =
+   mlxsw_reg_ptys_ib_link_width_cap_get(payload);
+   if (p_ib_proto_oper)
+   *p_ib_proto_oper = mlxsw_reg_ptys_ib_proto_oper_get(payload);
+   if (p_ib_link_width_oper)
+   *p_ib_link_width_oper =
+   mlxsw_reg_ptys_ib_link_width_oper_get(payload);
+}
+
 /* PPAD - Port Physical Address Register
  * -
  * The PPAD register configures the per port physical MAC address.
-- 
2.5.5



[patch net-next 06/16] mlxsw: switchx2: Fix port speed configuration

2016-10-28 Thread Jiri Pirko
From: Elad Raz 

In SwitchX-2 we configure the port speed to negotiate with 40G link only.
Add support for all other supported speeds.

Fixes: 31557f0f9755 ("mlxsw: Introduce Mellanox SwitchX-2 ASIC support")
Signed-off-by: Elad Raz 
Reviewed-by: Jiri Pirko 
Reviewed-by: Ido Schimmel 
Signed-off-by: Jiri Pirko 
---
 drivers/net/ethernet/mellanox/mlxsw/switchx2.c | 30 --
 1 file changed, 23 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/switchx2.c 
b/drivers/net/ethernet/mellanox/mlxsw/switchx2.c
index 941393b..44dae8d 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/switchx2.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/switchx2.c
@@ -662,6 +662,7 @@ static const struct mlxsw_sx_port_link_mode 
mlxsw_sx_port_link_mode[] = {
 };
 
 #define MLXSW_SX_PORT_LINK_MODE_LEN ARRAY_SIZE(mlxsw_sx_port_link_mode)
+#define MLXSW_SX_PORT_BASE_SPEED 1 /* Mb/s */
 
 static u32 mlxsw_sx_from_ptys_supported_port(u32 ptys_eth_proto)
 {
@@ -809,6 +810,18 @@ static u32 mlxsw_sx_to_ptys_speed(u32 speed)
return ptys_proto;
 }
 
+static u32 mlxsw_sx_to_ptys_upper_speed(u32 upper_speed)
+{
+   u32 ptys_proto = 0;
+   int i;
+
+   for (i = 0; i < MLXSW_SX_PORT_LINK_MODE_LEN; i++) {
+   if (mlxsw_sx_port_link_mode[i].speed <= upper_speed)
+   ptys_proto |= mlxsw_sx_port_link_mode[i].mask;
+   }
+   return ptys_proto;
+}
+
 static int mlxsw_sx_port_set_settings(struct net_device *dev,
  struct ethtool_cmd *cmd)
 {
@@ -955,13 +968,17 @@ static int mlxsw_sx_port_stp_state_set(struct 
mlxsw_sx_port *mlxsw_sx_port,
return err;
 }
 
-static int mlxsw_sx_port_speed_set(struct mlxsw_sx_port *mlxsw_sx_port,
-  u32 speed)
+static int
+mlxsw_sx_port_speed_by_width_set(struct mlxsw_sx_port *mlxsw_sx_port, u8 width)
 {
struct mlxsw_sx *mlxsw_sx = mlxsw_sx_port->mlxsw_sx;
+   u32 upper_speed = MLXSW_SX_PORT_BASE_SPEED * width;
char ptys_pl[MLXSW_REG_PTYS_LEN];
+   u32 eth_proto_admin;
 
-   mlxsw_reg_ptys_pack(ptys_pl, mlxsw_sx_port->local_port, speed);
+   eth_proto_admin = mlxsw_sx_to_ptys_upper_speed(upper_speed);
+   mlxsw_reg_ptys_pack(ptys_pl, mlxsw_sx_port->local_port,
+   eth_proto_admin);
return mlxsw_reg_write(mlxsw_sx->core, MLXSW_REG(ptys), ptys_pl);
 }
 
@@ -977,7 +994,7 @@ mlxsw_sx_port_mac_learning_mode_set(struct mlxsw_sx_port 
*mlxsw_sx_port,
 }
 
 static int mlxsw_sx_port_create(struct mlxsw_sx *mlxsw_sx, u8 local_port,
-   u8 module)
+   u8 module, u8 width)
 {
struct mlxsw_sx_port *mlxsw_sx_port;
struct net_device *dev;
@@ -1038,8 +1055,7 @@ static int mlxsw_sx_port_create(struct mlxsw_sx 
*mlxsw_sx, u8 local_port,
goto err_port_swid_set;
}
 
-   err = mlxsw_sx_port_speed_set(mlxsw_sx_port,
- MLXSW_REG_PTYS_ETH_SPEED_40GBASE_CR4);
+   err = mlxsw_sx_port_speed_by_width_set(mlxsw_sx_port, width);
if (err) {
dev_err(mlxsw_sx->bus_info->dev, "Port %d: Failed to set 
speed\n",
mlxsw_sx_port->local_port);
@@ -1156,7 +1172,7 @@ static int mlxsw_sx_ports_create(struct mlxsw_sx 
*mlxsw_sx)
goto err_port_module_info_get;
if (!width)
continue;
-   err = mlxsw_sx_port_create(mlxsw_sx, i, module);
+   err = mlxsw_sx_port_create(mlxsw_sx, i, module, width);
if (err)
goto err_port_create;
}
-- 
2.5.5



[patch net-next 10/16] mlxsw: Make devlink port instances independent of spectrum/switchx2 port instances

2016-10-28 Thread Jiri Pirko
From: Jiri Pirko 

Currently, devlink register/unregister is done directly from
spectrum/switchx2 port create/remove functions. With a need to
introduce a port type change, the devlink port instances have to be
persistent across type changes, therefore across port create/remove
function calls. So do a bit of reshuffling to achieve that.

Signed-off-by: Jiri Pirko 
Signed-off-by: Elad Raz 
Reviewed-by: Ido Schimmel 
Signed-off-by: Jiri Pirko 
---
 drivers/net/ethernet/mellanox/mlxsw/core.c | 82 +-
 drivers/net/ethernet/mellanox/mlxsw/core.h | 26 +++-
 drivers/net/ethernet/mellanox/mlxsw/spectrum.c | 53 +++--
 drivers/net/ethernet/mellanox/mlxsw/spectrum.h |  1 -
 drivers/net/ethernet/mellanox/mlxsw/switchx2.c | 49 ++-
 5 files changed, 147 insertions(+), 64 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/core.c 
b/drivers/net/ethernet/mellanox/mlxsw/core.c
index 02183f6..b50acb1 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/core.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/core.c
@@ -90,6 +90,22 @@ struct mlxsw_core_pcpu_stats {
u32 port_rx_invalid;
 };
 
+struct mlxsw_core_port {
+   struct devlink_port devlink_port;
+   void *port_driver_priv;
+};
+
+void *mlxsw_core_port_driver_priv(struct mlxsw_core_port *mlxsw_core_port)
+{
+   return mlxsw_core_port->port_driver_priv;
+}
+EXPORT_SYMBOL(mlxsw_core_port_driver_priv);
+
+static bool mlxsw_core_port_check(struct mlxsw_core_port *mlxsw_core_port)
+{
+   return mlxsw_core_port->port_driver_priv != NULL;
+}
+
 struct mlxsw_core {
struct mlxsw_driver *driver;
const struct mlxsw_bus *bus;
@@ -114,6 +130,7 @@ struct mlxsw_core {
} lag;
struct mlxsw_res res;
struct mlxsw_hwmon *hwmon;
+   struct mlxsw_core_port ports[MLXSW_PORT_MAX_PORTS];
unsigned long driver_priv[0];
/* driver_priv has to be always the last item */
 };
@@ -928,7 +945,8 @@ static int mlxsw_devlink_sb_port_pool_get(struct 
devlink_port *devlink_port,
struct mlxsw_driver *mlxsw_driver = mlxsw_core->driver;
struct mlxsw_core_port *mlxsw_core_port = __dl_port(devlink_port);
 
-   if (!mlxsw_driver->sb_port_pool_get)
+   if (!mlxsw_driver->sb_port_pool_get ||
+   !mlxsw_core_port_check(mlxsw_core_port))
return -EOPNOTSUPP;
return mlxsw_driver->sb_port_pool_get(mlxsw_core_port, sb_index,
  pool_index, p_threshold);
@@ -942,7 +960,8 @@ static int mlxsw_devlink_sb_port_pool_set(struct 
devlink_port *devlink_port,
struct mlxsw_driver *mlxsw_driver = mlxsw_core->driver;
struct mlxsw_core_port *mlxsw_core_port = __dl_port(devlink_port);
 
-   if (!mlxsw_driver->sb_port_pool_set)
+   if (!mlxsw_driver->sb_port_pool_set ||
+   !mlxsw_core_port_check(mlxsw_core_port))
return -EOPNOTSUPP;
return mlxsw_driver->sb_port_pool_set(mlxsw_core_port, sb_index,
  pool_index, threshold);
@@ -958,7 +977,8 @@ mlxsw_devlink_sb_tc_pool_bind_get(struct devlink_port 
*devlink_port,
struct mlxsw_driver *mlxsw_driver = mlxsw_core->driver;
struct mlxsw_core_port *mlxsw_core_port = __dl_port(devlink_port);
 
-   if (!mlxsw_driver->sb_tc_pool_bind_get)
+   if (!mlxsw_driver->sb_tc_pool_bind_get ||
+   !mlxsw_core_port_check(mlxsw_core_port))
return -EOPNOTSUPP;
return mlxsw_driver->sb_tc_pool_bind_get(mlxsw_core_port, sb_index,
 tc_index, pool_type,
@@ -975,7 +995,8 @@ mlxsw_devlink_sb_tc_pool_bind_set(struct devlink_port 
*devlink_port,
struct mlxsw_driver *mlxsw_driver = mlxsw_core->driver;
struct mlxsw_core_port *mlxsw_core_port = __dl_port(devlink_port);
 
-   if (!mlxsw_driver->sb_tc_pool_bind_set)
+   if (!mlxsw_driver->sb_tc_pool_bind_set ||
+   !mlxsw_core_port_check(mlxsw_core_port))
return -EOPNOTSUPP;
return mlxsw_driver->sb_tc_pool_bind_set(mlxsw_core_port, sb_index,
 tc_index, pool_type,
@@ -1013,7 +1034,8 @@ mlxsw_devlink_sb_occ_port_pool_get(struct devlink_port 
*devlink_port,
struct mlxsw_driver *mlxsw_driver = mlxsw_core->driver;
struct mlxsw_core_port *mlxsw_core_port = __dl_port(devlink_port);
 
-   if (!mlxsw_driver->sb_occ_port_pool_get)
+   if (!mlxsw_driver->sb_occ_port_pool_get ||
+   !mlxsw_core_port_check(mlxsw_core_port))
return -EOPNOTSUPP;
return mlxsw_driver->sb_occ_port_pool_get(mlxsw_core_port, sb_index,
  pool_index, p_cur, p_max);
@@ -1029,7 +1051,8 @@ mlxsw_devlink_sb_occ_tc_port_bind_get(struct devlink_port 
*devlink_port,
struct mlxsw_driver *mlxsw_driver = mlxsw_core->driver;
struct mlxsw_core_por

[patch net-next 09/16] mlxsw: reg: Add local-port to Infiniband port mapping

2016-10-28 Thread Jiri Pirko
From: Elad Raz 

In order to change a port type to Infiniband port we should change his
mapping from local-port to Infiniband. Adding the PLIB (Port Local to
InfiniBand) allows this mapping.

Signed-off-by: Elad Raz 
Reviewed-by: Jiri Pirko 
Reviewed-by: Ido Schimmel 
Signed-off-by: Jiri Pirko 
---
 drivers/net/ethernet/mellanox/mlxsw/reg.h | 22 ++
 1 file changed, 22 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/reg.h 
b/drivers/net/ethernet/mellanox/mlxsw/reg.h
index 2f31af0..a61ce34 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/reg.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/reg.h
@@ -2748,6 +2748,27 @@ static inline void mlxsw_reg_ppcnt_pack(char *payload, 
u8 local_port,
mlxsw_reg_ppcnt_prio_tc_set(payload, prio_tc);
 }
 
+/* PLIB - Port Local to InfiniBand Port
+ * 
+ * The PLIB register performs mapping from Local Port into InfiniBand Port.
+ */
+#define MLXSW_REG_PLIB_ID 0x500A
+#define MLXSW_REG_PLIB_LEN 0x10
+
+MLXSW_REG_DEFINE(plib, MLXSW_REG_PLIB_ID, MLXSW_REG_PLIB_LEN);
+
+/* reg_plib_local_port
+ * Local port number.
+ * Access: Index
+ */
+MLXSW_ITEM32(reg, plib, local_port, 0x00, 16, 8);
+
+/* reg_plib_ib_port
+ * InfiniBand port remapping for local_port.
+ * Access: RW
+ */
+MLXSW_ITEM32(reg, plib, ib_port, 0x00, 0, 8);
+
 /* PPTB - Port Prio To Buffer Register
  * ---
  * Configures the switch priority to buffer table.
@@ -5188,6 +5209,7 @@ static const struct mlxsw_reg_info *mlxsw_reg_infos[] = {
MLXSW_REG(paos),
MLXSW_REG(pfcc),
MLXSW_REG(ppcnt),
+   MLXSW_REG(plib),
MLXSW_REG(pptb),
MLXSW_REG(pbmc),
MLXSW_REG(pspa),
-- 
2.5.5



[patch net-next 07/16] mlxsw: reg: Add eth prefix to PTYS pack and unpack

2016-10-28 Thread Jiri Pirko
From: Elad Raz 

We want to add Infiniband support to PTYS. In order to maintain proper
conventions, we will change pack and unpack prefix to eth.

Signed-off-by: Elad Raz 
Reviewed-by: Jiri Pirko 
Reviewed-by: Ido Schimmel 
Signed-off-by: Jiri Pirko 
---
 drivers/net/ethernet/mellanox/mlxsw/reg.h  | 11 ++-
 drivers/net/ethernet/mellanox/mlxsw/spectrum.c | 17 +
 drivers/net/ethernet/mellanox/mlxsw/switchx2.c | 18 ++
 3 files changed, 25 insertions(+), 21 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/reg.h 
b/drivers/net/ethernet/mellanox/mlxsw/reg.h
index debcf26..3ecf348 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/reg.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/reg.h
@@ -2131,8 +2131,8 @@ MLXSW_ITEM32(reg, ptys, eth_proto_oper, 0x24, 0, 32);
  */
 MLXSW_ITEM32(reg, ptys, eth_proto_lp_advertise, 0x30, 0, 32);
 
-static inline void mlxsw_reg_ptys_pack(char *payload, u8 local_port,
-  u32 proto_admin)
+static inline void mlxsw_reg_ptys_eth_pack(char *payload, u8 local_port,
+  u32 proto_admin)
 {
MLXSW_REG_ZERO(ptys, payload);
mlxsw_reg_ptys_local_port_set(payload, local_port);
@@ -2140,9 +2140,10 @@ static inline void mlxsw_reg_ptys_pack(char *payload, u8 
local_port,
mlxsw_reg_ptys_eth_proto_admin_set(payload, proto_admin);
 }
 
-static inline void mlxsw_reg_ptys_unpack(char *payload, u32 *p_eth_proto_cap,
-u32 *p_eth_proto_adm,
-u32 *p_eth_proto_oper)
+static inline void mlxsw_reg_ptys_eth_unpack(char *payload,
+u32 *p_eth_proto_cap,
+u32 *p_eth_proto_adm,
+u32 *p_eth_proto_oper)
 {
if (p_eth_proto_cap)
*p_eth_proto_cap = mlxsw_reg_ptys_eth_proto_cap_get(payload);
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
index d649064..d1c2f9f 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
@@ -2003,12 +2003,12 @@ static int mlxsw_sp_port_get_link_ksettings(struct 
net_device *dev,
int err;
 
autoneg = mlxsw_sp_port->link.autoneg;
-   mlxsw_reg_ptys_pack(ptys_pl, mlxsw_sp_port->local_port, 0);
+   mlxsw_reg_ptys_eth_pack(ptys_pl, mlxsw_sp_port->local_port, 0);
err = mlxsw_reg_query(mlxsw_sp->core, MLXSW_REG(ptys), ptys_pl);
if (err)
return err;
-   mlxsw_reg_ptys_unpack(ptys_pl, ð_proto_cap, ð_proto_admin,
- ð_proto_oper);
+   mlxsw_reg_ptys_eth_unpack(ptys_pl, ð_proto_cap, ð_proto_admin,
+ ð_proto_oper);
 
mlxsw_sp_port_get_link_supported(eth_proto_cap, cmd);
 
@@ -2037,11 +2037,11 @@ mlxsw_sp_port_set_link_ksettings(struct net_device *dev,
bool autoneg;
int err;
 
-   mlxsw_reg_ptys_pack(ptys_pl, mlxsw_sp_port->local_port, 0);
+   mlxsw_reg_ptys_eth_pack(ptys_pl, mlxsw_sp_port->local_port, 0);
err = mlxsw_reg_query(mlxsw_sp->core, MLXSW_REG(ptys), ptys_pl);
if (err)
return err;
-   mlxsw_reg_ptys_unpack(ptys_pl, ð_proto_cap, NULL, NULL);
+   mlxsw_reg_ptys_eth_unpack(ptys_pl, ð_proto_cap, NULL, NULL);
 
autoneg = cmd->base.autoneg == AUTONEG_ENABLE;
eth_proto_new = autoneg ?
@@ -2054,7 +2054,8 @@ mlxsw_sp_port_set_link_ksettings(struct net_device *dev,
return -EINVAL;
}
 
-   mlxsw_reg_ptys_pack(ptys_pl, mlxsw_sp_port->local_port, eth_proto_new);
+   mlxsw_reg_ptys_eth_pack(ptys_pl, mlxsw_sp_port->local_port,
+   eth_proto_new);
err = mlxsw_reg_write(mlxsw_sp->core, MLXSW_REG(ptys), ptys_pl);
if (err)
return err;
@@ -2092,8 +2093,8 @@ mlxsw_sp_port_speed_by_width_set(struct mlxsw_sp_port 
*mlxsw_sp_port, u8 width)
u32 eth_proto_admin;
 
eth_proto_admin = mlxsw_sp_to_ptys_upper_speed(upper_speed);
-   mlxsw_reg_ptys_pack(ptys_pl, mlxsw_sp_port->local_port,
-   eth_proto_admin);
+   mlxsw_reg_ptys_eth_pack(ptys_pl, mlxsw_sp_port->local_port,
+   eth_proto_admin);
return mlxsw_reg_write(mlxsw_sp->core, MLXSW_REG(ptys), ptys_pl);
 }
 
diff --git a/drivers/net/ethernet/mellanox/mlxsw/switchx2.c 
b/drivers/net/ethernet/mellanox/mlxsw/switchx2.c
index 44dae8d..7b74dcb 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/switchx2.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/switchx2.c
@@ -762,14 +762,14 @@ static int mlxsw_sx_port_get_settings(struct net_device 
*dev,
u32 eth_proto_oper;
int err;
 
-   mlxsw_reg_ptys_pack(ptys_pl, mlxsw_sx_port->local_port, 0);
+   mlxsw_reg_ptys_eth_pack(ptys_pl, mlxsw

[patch net-next 04/16] mlxsw: spectrum: Move port used check outside port remove function

2016-10-28 Thread Jiri Pirko
From: Jiri Pirko 

Be symmentrical with create and do the check outside the remove function.

Signed-off-by: Jiri Pirko 
Signed-off-by: Elad Raz 
---
 drivers/net/ethernet/mellanox/mlxsw/spectrum.c | 22 +++---
 1 file changed, 15 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
index 2573faa..d649064 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
@@ -2403,8 +2403,6 @@ static void mlxsw_sp_port_remove(struct mlxsw_sp 
*mlxsw_sp, u8 local_port)
 {
struct mlxsw_sp_port *mlxsw_sp_port = mlxsw_sp->ports[local_port];
 
-   if (!mlxsw_sp_port)
-   return;
cancel_delayed_work_sync(&mlxsw_sp_port->hw_stats.update_dw);
mlxsw_core_port_fini(&mlxsw_sp_port->core_port);
unregister_netdev(mlxsw_sp_port->dev); /* This calls ndo_stop */
@@ -2422,12 +2420,18 @@ static void mlxsw_sp_port_remove(struct mlxsw_sp 
*mlxsw_sp, u8 local_port)
free_netdev(mlxsw_sp_port->dev);
 }
 
+static bool mlxsw_sp_port_created(struct mlxsw_sp *mlxsw_sp, u8 local_port)
+{
+   return mlxsw_sp->ports[local_port] != NULL;
+}
+
 static void mlxsw_sp_ports_remove(struct mlxsw_sp *mlxsw_sp)
 {
int i;
 
for (i = 1; i < MLXSW_PORT_MAX_PORTS; i++)
-   mlxsw_sp_port_remove(mlxsw_sp, i);
+   if (mlxsw_sp_port_created(mlxsw_sp, i))
+   mlxsw_sp_port_remove(mlxsw_sp, i);
kfree(mlxsw_sp->ports);
 }
 
@@ -2461,7 +2465,8 @@ static int mlxsw_sp_ports_create(struct mlxsw_sp 
*mlxsw_sp)
 err_port_create:
 err_port_module_info_get:
for (i--; i >= 1; i--)
-   mlxsw_sp_port_remove(mlxsw_sp, i);
+   if (mlxsw_sp_port_created(mlxsw_sp, i))
+   mlxsw_sp_port_remove(mlxsw_sp, i);
kfree(mlxsw_sp->ports);
return err;
 }
@@ -2503,7 +2508,8 @@ static int mlxsw_sp_port_split_create(struct mlxsw_sp 
*mlxsw_sp, u8 base_port,
 
 err_port_create:
for (i--; i >= 0; i--)
-   mlxsw_sp_port_remove(mlxsw_sp, base_port + i);
+   if (mlxsw_sp_port_created(mlxsw_sp, base_port + i))
+   mlxsw_sp_port_remove(mlxsw_sp, base_port + i);
i = count;
 err_port_swid_set:
for (i--; i >= 0; i--)
@@ -2593,7 +2599,8 @@ static int mlxsw_sp_port_split(struct mlxsw_core 
*mlxsw_core, u8 local_port,
}
 
for (i = 0; i < count; i++)
-   mlxsw_sp_port_remove(mlxsw_sp, base_port + i);
+   if (mlxsw_sp_port_created(mlxsw_sp, base_port + i))
+   mlxsw_sp_port_remove(mlxsw_sp, base_port + i);
 
err = mlxsw_sp_port_split_create(mlxsw_sp, base_port, module, count);
if (err) {
@@ -2638,7 +2645,8 @@ static int mlxsw_sp_port_unsplit(struct mlxsw_core 
*mlxsw_core, u8 local_port)
base_port = base_port + 2;
 
for (i = 0; i < count; i++)
-   mlxsw_sp_port_remove(mlxsw_sp, base_port + i);
+   if (mlxsw_sp_port_created(mlxsw_sp, base_port + i))
+   mlxsw_sp_port_remove(mlxsw_sp, base_port + i);
 
mlxsw_sp_port_unsplit_create(mlxsw_sp, base_port, count);
 
-- 
2.5.5



Re: [PATCH net 2/3] sctp: return back transport in __sctp_rcv_init_lookup

2016-10-28 Thread Marcelo Ricardo Leitner
On Fri, Oct 28, 2016 at 06:10:53PM +0800, Xin Long wrote:
> Prior to this patch, it used a local variable to save the transport that is
> looked up by __sctp_lookup_association(), and didn't return it back. But in
> sctp_rcv, it is used to initialize chunk->transport. So when hitting this
> code, it was initializing chunk->transport with some random stack value
> instead.
> 
> This patch is to return the transport back through transport pointer
> that is from __sctp_rcv_lookup_harder().
> 
> Signed-off-by: Xin Long 

Acked-by: Marcelo Ricardo Leitner 

transport pointer in sctp_rcv() is initialized to null and there are
checks for it after this path, so this shouldn't be exploitable, just
malfunction.

> ---
>  net/sctp/input.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/net/sctp/input.c b/net/sctp/input.c
> index a2ea1d1..8e0bc58 100644
> --- a/net/sctp/input.c
> +++ b/net/sctp/input.c
> @@ -1021,7 +1021,6 @@ static struct sctp_association 
> *__sctp_rcv_init_lookup(struct net *net,
>   struct sctphdr *sh = sctp_hdr(skb);
>   union sctp_params params;
>   sctp_init_chunk_t *init;
> - struct sctp_transport *transport;
>   struct sctp_af *af;
>  
>   /*
> @@ -1052,7 +1051,7 @@ static struct sctp_association 
> *__sctp_rcv_init_lookup(struct net *net,
>  
>   af->from_addr_param(paddr, params.addr, sh->source, 0);
>  
> - asoc = __sctp_lookup_association(net, laddr, paddr, &transport);
> + asoc = __sctp_lookup_association(net, laddr, paddr, transportp);
>   if (asoc)
>   return asoc;
>   }
> -- 
> 2.1.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


[patch net-next 03/16] mlxsw: switchx2: Move port used check outside port remove function

2016-10-28 Thread Jiri Pirko
From: Jiri Pirko 

Be symmentrical with create and do the check outside the remove function.

Signed-off-by: Jiri Pirko 
Signed-off-by: Elad Raz 
---
 drivers/net/ethernet/mellanox/mlxsw/switchx2.c | 14 ++
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/switchx2.c 
b/drivers/net/ethernet/mellanox/mlxsw/switchx2.c
index 67f1b70..6882e35 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/switchx2.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/switchx2.c
@@ -1093,21 +1093,26 @@ static void mlxsw_sx_port_remove(struct mlxsw_sx 
*mlxsw_sx, u8 local_port)
 {
struct mlxsw_sx_port *mlxsw_sx_port = mlxsw_sx->ports[local_port];
 
-   if (!mlxsw_sx_port)
-   return;
mlxsw_core_port_fini(&mlxsw_sx_port->core_port);
unregister_netdev(mlxsw_sx_port->dev); /* This calls ndo_stop */
+   mlxsw_sx->ports[local_port] = NULL;
mlxsw_sx_port_swid_set(mlxsw_sx_port, MLXSW_PORT_SWID_DISABLED_PORT);
free_percpu(mlxsw_sx_port->pcpu_stats);
free_netdev(mlxsw_sx_port->dev);
 }
 
+static bool mlxsw_sx_port_created(struct mlxsw_sx *mlxsw_sx, u8 local_port)
+{
+   return mlxsw_sx->ports[local_port] != NULL;
+}
+
 static void mlxsw_sx_ports_remove(struct mlxsw_sx *mlxsw_sx)
 {
int i;
 
for (i = 1; i < MLXSW_PORT_MAX_PORTS; i++)
-   mlxsw_sx_port_remove(mlxsw_sx, i);
+   if (mlxsw_sx_port_created(mlxsw_sx, i))
+   mlxsw_sx_port_remove(mlxsw_sx, i);
kfree(mlxsw_sx->ports);
 }
 
@@ -1138,7 +1143,8 @@ static int mlxsw_sx_ports_create(struct mlxsw_sx 
*mlxsw_sx)
 err_port_create:
 err_port_module_info_get:
for (i--; i >= 1; i--)
-   mlxsw_sx_port_remove(mlxsw_sx, i);
+   if (mlxsw_sx_port_created(mlxsw_sx, i))
+   mlxsw_sx_port_remove(mlxsw_sx, i);
kfree(mlxsw_sx->ports);
return err;
 }
-- 
2.5.5



[patch net-next 02/16] mlxsw: switchx2: Check if port is usable before calling port create

2016-10-28 Thread Jiri Pirko
From: Jiri Pirko 

Do it in a same way we do it in spectrum. Check if port is usable first
and only in that case create a port instance.

Signed-off-by: Jiri Pirko 
Reviewed-by: Elad Raz 
Reviewed-by: Ido Schimmel 
---
 drivers/net/ethernet/mellanox/mlxsw/switchx2.c | 32 +-
 1 file changed, 11 insertions(+), 21 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/switchx2.c 
b/drivers/net/ethernet/mellanox/mlxsw/switchx2.c
index 8d11e5447..67f1b70 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/switchx2.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/switchx2.c
@@ -256,18 +256,17 @@ mlxsw_sx_port_system_port_mapping_set(struct 
mlxsw_sx_port *mlxsw_sx_port)
return mlxsw_reg_write(mlxsw_sx->core, MLXSW_REG(sspr), sspr_pl);
 }
 
-static int mlxsw_sx_port_module_check(struct mlxsw_sx_port *mlxsw_sx_port,
- bool *p_usable)
+static int mlxsw_sx_port_module_info_get(struct mlxsw_sx *mlxsw_sx,
+u8 local_port, u8 *p_width)
 {
-   struct mlxsw_sx *mlxsw_sx = mlxsw_sx_port->mlxsw_sx;
char pmlp_pl[MLXSW_REG_PMLP_LEN];
int err;
 
-   mlxsw_reg_pmlp_pack(pmlp_pl, mlxsw_sx_port->local_port);
+   mlxsw_reg_pmlp_pack(pmlp_pl, local_port);
err = mlxsw_reg_query(mlxsw_sx->core, MLXSW_REG(pmlp), pmlp_pl);
if (err)
return err;
-   *p_usable = mlxsw_reg_pmlp_width_get(pmlp_pl) ? true : false;
+   *p_width = mlxsw_reg_pmlp_width_get(pmlp_pl);
return 0;
 }
 
@@ -962,7 +961,6 @@ static int mlxsw_sx_port_create(struct mlxsw_sx *mlxsw_sx, 
u8 local_port)
 {
struct mlxsw_sx_port *mlxsw_sx_port;
struct net_device *dev;
-   bool usable;
int err;
 
dev = alloc_etherdev(sizeof(struct mlxsw_sx_port));
@@ -1005,19 +1003,6 @@ static int mlxsw_sx_port_create(struct mlxsw_sx 
*mlxsw_sx, u8 local_port)
 */
dev->needed_headroom = MLXSW_TXHDR_LEN;
 
-   err = mlxsw_sx_port_module_check(mlxsw_sx_port, &usable);
-   if (err) {
-   dev_err(mlxsw_sx->bus_info->dev, "Port %d: Failed to check 
module\n",
-   mlxsw_sx_port->local_port);
-   goto err_port_module_check;
-   }
-
-   if (!usable) {
-   dev_dbg(mlxsw_sx->bus_info->dev, "Port %d: Not usable, skipping 
initialization\n",
-   mlxsw_sx_port->local_port);
-   goto port_not_usable;
-   }
-
err = mlxsw_sx_port_system_port_mapping_set(mlxsw_sx_port);
if (err) {
dev_err(mlxsw_sx->bus_info->dev, "Port %d: Failed to set system 
port mapping\n",
@@ -1097,8 +1082,6 @@ static int mlxsw_sx_port_create(struct mlxsw_sx 
*mlxsw_sx, u8 local_port)
mlxsw_sx_port_swid_set(mlxsw_sx_port, MLXSW_PORT_SWID_DISABLED_PORT);
 err_port_swid_set:
 err_port_system_port_mapping_set:
-port_not_usable:
-err_port_module_check:
 err_dev_addr_get:
free_percpu(mlxsw_sx_port->pcpu_stats);
 err_alloc_stats:
@@ -1131,6 +1114,7 @@ static void mlxsw_sx_ports_remove(struct mlxsw_sx 
*mlxsw_sx)
 static int mlxsw_sx_ports_create(struct mlxsw_sx *mlxsw_sx)
 {
size_t alloc_size;
+   u8 width;
int i;
int err;
 
@@ -1140,6 +1124,11 @@ static int mlxsw_sx_ports_create(struct mlxsw_sx 
*mlxsw_sx)
return -ENOMEM;
 
for (i = 1; i < MLXSW_PORT_MAX_PORTS; i++) {
+   err = mlxsw_sx_port_module_info_get(mlxsw_sx, i, &width);
+   if (err)
+   goto err_port_module_info_get;
+   if (!width)
+   continue;
err = mlxsw_sx_port_create(mlxsw_sx, i);
if (err)
goto err_port_create;
@@ -1147,6 +1136,7 @@ static int mlxsw_sx_ports_create(struct mlxsw_sx 
*mlxsw_sx)
return 0;
 
 err_port_create:
+err_port_module_info_get:
for (i--; i >= 1; i--)
mlxsw_sx_port_remove(mlxsw_sx, i);
kfree(mlxsw_sx->ports);
-- 
2.5.5



[patch net-next 01/16] mlxsw: core: Zero payload buffers for couple of registers

2016-10-28 Thread Jiri Pirko
From: Elad Raz 

We recently discovered a bug in the firmware in which a field's length in
one of the registers was incorrectly set. This caused the firmware to
access garbage data that wasn't initialized by the driver and therefore
emit error messages.

While the bug is already fixed and the driver usually zeros the buffers
passed to the firmware, there are a handful of cases where this isn't
done. Zero the buffer in these cases and prevent similar bugs from
recurring, as they tend to be hard to debug.

Fixes: 52581961d83d ("mlxsw: core: Implement fan control using hwmon")
Signed-off-by: Elad Raz 
Reviewed-by: Jiri Pirko 
Reviewed-by: Ido Schimmel 
Signed-off-by: Jiri Pirko 
---
 drivers/net/ethernet/mellanox/mlxsw/core_hwmon.c | 4 ++--
 drivers/net/ethernet/mellanox/mlxsw/spectrum.c   | 2 +-
 drivers/net/ethernet/mellanox/mlxsw/switchx2.c   | 2 +-
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/core_hwmon.c 
b/drivers/net/ethernet/mellanox/mlxsw/core_hwmon.c
index 1ac8bf1..ab710e3 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/core_hwmon.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/core_hwmon.c
@@ -262,7 +262,7 @@ static void mlxsw_hwmon_attr_add(struct mlxsw_hwmon 
*mlxsw_hwmon,
 
 static int mlxsw_hwmon_temp_init(struct mlxsw_hwmon *mlxsw_hwmon)
 {
-   char mtcap_pl[MLXSW_REG_MTCAP_LEN];
+   char mtcap_pl[MLXSW_REG_MTCAP_LEN] = {0};
char mtmp_pl[MLXSW_REG_MTMP_LEN];
u8 sensor_count;
int i;
@@ -295,7 +295,7 @@ static int mlxsw_hwmon_temp_init(struct mlxsw_hwmon 
*mlxsw_hwmon)
 
 static int mlxsw_hwmon_fans_init(struct mlxsw_hwmon *mlxsw_hwmon)
 {
-   char mfcr_pl[MLXSW_REG_MFCR_LEN];
+   char mfcr_pl[MLXSW_REG_MFCR_LEN] = {0};
enum mlxsw_reg_mfcr_pwm_frequency freq;
unsigned int type_index;
unsigned int num;
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
index d652f7f..2573faa 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
@@ -158,7 +158,7 @@ static void mlxsw_sp_txhdr_construct(struct sk_buff *skb,
 
 static int mlxsw_sp_base_mac_get(struct mlxsw_sp *mlxsw_sp)
 {
-   char spad_pl[MLXSW_REG_SPAD_LEN];
+   char spad_pl[MLXSW_REG_SPAD_LEN] = {0};
int err;
 
err = mlxsw_reg_query(mlxsw_sp->core, MLXSW_REG(spad), spad_pl);
diff --git a/drivers/net/ethernet/mellanox/mlxsw/switchx2.c 
b/drivers/net/ethernet/mellanox/mlxsw/switchx2.c
index 8c8f5d8..8d11e5447 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/switchx2.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/switchx2.c
@@ -890,7 +890,7 @@ static const struct switchdev_ops 
mlxsw_sx_port_switchdev_ops = {
 
 static int mlxsw_sx_hw_id_get(struct mlxsw_sx *mlxsw_sx)
 {
-   char spad_pl[MLXSW_REG_SPAD_LEN];
+   char spad_pl[MLXSW_REG_SPAD_LEN] = {0};
int err;
 
err = mlxsw_reg_query(mlxsw_sx->core, MLXSW_REG(spad), spad_pl);
-- 
2.5.5



Re: [PATCH net 1/3] sctp: hold transport instead of assoc in sctp_diag

2016-10-28 Thread Marcelo Ricardo Leitner
On Fri, Oct 28, 2016 at 06:10:52PM +0800, Xin Long wrote:
> In sctp_transport_lookup_process(), Commit 1cceda784980 ("sctp: fix
> the issue sctp_diag uses lock_sock in rcu_read_lock") moved cb() out
> of rcu lock, but it put transport and hold assoc instead, and ignore
> that cb() still uses transport. It may cause a use-after-free issue.
> 
> This patch is to hold transport instead of assoc there.
> 
> Fixes: 1cceda784980 ("sctp: fix the issue sctp_diag uses lock_sock in 
> rcu_read_lock")
> Signed-off-by: Xin Long 

Acked-by: Marcelo Ricardo Leitner 

> ---
>  net/sctp/socket.c | 5 +
>  1 file changed, 1 insertion(+), 4 deletions(-)
> 
> diff --git a/net/sctp/socket.c b/net/sctp/socket.c
> index 9fbb6fe..71b75f9 100644
> --- a/net/sctp/socket.c
> +++ b/net/sctp/socket.c
> @@ -4480,12 +4480,9 @@ int sctp_transport_lookup_process(int (*cb)(struct 
> sctp_transport *, void *),
>   if (!transport || !sctp_transport_hold(transport))
>   goto out;
>  
> - sctp_association_hold(transport->asoc);
> - sctp_transport_put(transport);
> -
>   rcu_read_unlock();
>   err = cb(transport, p);
> - sctp_association_put(transport->asoc);
> + sctp_transport_put(transport);
>  
>  out:
>   return err;
> -- 
> 2.1.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


[patch net-next 00/16] mlxsw: Add Infiniband support for Mellanox switches

2016-10-28 Thread Jiri Pirko
From: Jiri Pirko 

This patchset adds basic Infiniband support for SwitchX-2, Switch-IB
and Switch-IB-2 ASIC drivers.

SwitchX-2 ASIC is VPI capable, which means each port can be either
Ethernet or Infiniband. When the port is configured as Infiniband,
the Subnet Management Agent (SMA) is managed by the SwitchX-2 firmware
and not by the host. Port configuration, MTU and more are configured
remotely by the Subnet Manager (SM).

Usage:
$ devlink port show
pci/:03:00.0/1: type eth netdev eth0
pci/:03:00.0/3: type eth netdev eth1
pci/:03:00.0/5: type eth netdev eth2
pci/:03:00.0/6: type eth netdev eth3
pci/:03:00.0/8: type eth netdev eth4

$ devlink port set pci/:03:00.0/1 type ib

$ devlink port show
pci/:03:00.0/1: type ib

Switch-IB (FDR) and Switch-IB-2 (EDR 100Gbs) ASICs are Infiniband-only
switches. The support provided in the mlxsw_switchib.ko driver is port
initialization only. The firmware running in the Silicon implements
the SMA.

Please note that this patchset does only very basic port initialization.
ib_device or RDMA implementations are not part of this patchset.

Elad Raz (12):
  mlxsw: core: Zero payload buffers for couple of registers
  mlxsw: switchx2: Add support for physical port names
  mlxsw: switchx2: Fix port speed configuration
  mlxsw: reg: Add eth prefix to PTYS pack and unpack
  mlxsw: reg: Add Infiniband support to PTYS
  mlxsw: reg: Add local-port to Infiniband port mapping
  mlxsw: switchx2: Add Infiniband switch partition
  mlxsw: core: Add "eth" prefix to mlxsw_core_port_set
  mlxsw: core: Add port type (Eth/IB) set API
  mlxsw: switchx2: Add eth prefix to port create and remove
  mlxsw: switchx2: Add IB port support
  mlxsw: switchib: Introduce SwitchIB and SwitchIB silicon driver

Jiri Pirko (4):
  mlxsw: switchx2: Check if port is usable before calling port create
  mlxsw: switchx2: Move port used check outside port remove function
  mlxsw: spectrum: Move port used check outside port remove function
  mlxsw: Make devlink port instances independent of spectrum/switchx2
port instances

 drivers/net/ethernet/mellanox/mlxsw/Kconfig  |  11 +
 drivers/net/ethernet/mellanox/mlxsw/Makefile |   2 +
 drivers/net/ethernet/mellanox/mlxsw/core.c   | 123 -
 drivers/net/ethernet/mellanox/mlxsw/core.h   |  32 +-
 drivers/net/ethernet/mellanox/mlxsw/core_hwmon.c |   4 +-
 drivers/net/ethernet/mellanox/mlxsw/ib.h |  39 ++
 drivers/net/ethernet/mellanox/mlxsw/pci.h|   2 +
 drivers/net/ethernet/mellanox/mlxsw/port.h   |   4 +
 drivers/net/ethernet/mellanox/mlxsw/reg.h| 104 +++-
 drivers/net/ethernet/mellanox/mlxsw/spectrum.c   |  95 ++--
 drivers/net/ethernet/mellanox/mlxsw/spectrum.h   |   1 -
 drivers/net/ethernet/mellanox/mlxsw/switchib.c   | 598 +++
 drivers/net/ethernet/mellanox/mlxsw/switchx2.c   | 380 +++---
 13 files changed, 1258 insertions(+), 137 deletions(-)
 create mode 100644 drivers/net/ethernet/mellanox/mlxsw/ib.h
 create mode 100644 drivers/net/ethernet/mellanox/mlxsw/switchib.c

-- 
2.5.5



Re: [PATCH net-next] lan78xx: Use irq_domain for phy interrupt from USB Int. EP

2016-10-28 Thread Florian Fainelli
On 10/28/2016 11:54 AM, woojung@microchip.com wrote:
> From: Woojung Huh 
> 
> To utilize phylib with interrupt fully than handling some of phy stuff in the 
> MAC driver,
> create irq_domain for USB interrupt EP of phy interrupt and
> pass the irq number to phy_connect_direct() instead of PHY_IGNORE_INTERRUPT.
> 
> Idea comes from drivers/gpio/gpio-dl2.c
> 
> Signed-off-by: Woojung Huh 
> ---

> +static void lan78xx_irq_mask(struct irq_data *irqd)
> +{
> + struct irq_domain_data *data = irq_data_get_irq_chip_data(irqd);
> + struct lan78xx_net *dev =
> + container_of(data, struct lan78xx_net, domain_data);
> + u32 buf;
> +
> + lan78xx_read_reg(dev, INT_EP_CTL, &buf);

lan78xx_read_reg() uses kmalloc() with GFP_KERNEL, while
irq_mask/irq_unmask can be called in atomic context AFAIR, you may need
to pass down a specific gfp_t to lan78xx_read_reg.

What about usb_submit_urb(), can that work in atomic context? Do you
need to have lan78xx_read_reg() acquire a raw spinlock or something to
serialize them?

> + buf &= ~INT_EP_PHY_INT_EN_;

Even though you may have only one interrupt line to deal with at the
moment, better make this bit derived from irqd->hwirq instead of hard
coding it here.

> + lan78xx_write_reg(dev, INT_EP_CTL, buf);
> +}
> +
> +static void lan78xx_irq_unmask(struct irq_data *irqd)
> +{
> + struct irq_domain_data *data = irq_data_get_irq_chip_data(irqd);
> + struct lan78xx_net *dev =
> + container_of(data, struct lan78xx_net, domain_data);
> + u32 buf;
> +
> + lan78xx_read_reg(dev, INT_EP_CTL, &buf);
> + buf |= INT_EP_PHY_INT_EN_;

Same here, this should come from irqd->hwirq.

> + lan78xx_write_reg(dev, INT_EP_CTL, buf);
> +}
> +
> +static struct irq_chip lan78xx_irqchip = {
> + .name   = "lan78xx-phyirq",
> + .irq_mask   = lan78xx_irq_mask,
> + .irq_unmask = lan78xx_irq_unmask,
> +};
> +
> +static int lan78xx_setup_irq_domain(struct lan78xx_net *dev)
> +{
> + struct device_node *of_node;
> + struct irq_domain *irqdomain;
> + unsigned int irq_base = 0;
> + int ret = 0;
> +
> + of_node = dev->udev->dev.parent->of_node;
> +
> + dev->domain_data.irqchip = &lan78xx_irqchip;
> + dev->domain_data.irq_handler = handle_simple_irq;
> +
> + irqdomain = irq_domain_add_simple(of_node, 1, 0, &chip_domain_ops,
> +   &dev->domain_data);

Is there really just one interrupt associated with this peripheral here?

>  
> + if (lan78xx_setup_irq_domain(dev) < 0) {
> + netdev_warn(dev->net, "lan78xx_setup_irq_domain() failed");
> + return -EIO;
> + }

Any reason not to propagate the error code from
lan78xx_setup_irq_domain() here?

-- 
Florian


[PATCH net-next] lan78xx: Use irq_domain for phy interrupt from USB Int. EP

2016-10-28 Thread Woojung.Huh
From: Woojung Huh 

To utilize phylib with interrupt fully than handling some of phy stuff in the 
MAC driver,
create irq_domain for USB interrupt EP of phy interrupt and
pass the irq number to phy_connect_direct() instead of PHY_IGNORE_INTERRUPT.

Idea comes from drivers/gpio/gpio-dl2.c

Signed-off-by: Woojung Huh 
---
 drivers/net/usb/lan78xx.c | 142 +-
 1 file changed, 128 insertions(+), 14 deletions(-)

diff --git a/drivers/net/usb/lan78xx.c b/drivers/net/usb/lan78xx.c
index c4e748e..88ff21d 100644
--- a/drivers/net/usb/lan78xx.c
+++ b/drivers/net/usb/lan78xx.c
@@ -30,13 +30,17 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
+#include 
 #include 
 #include "lan78xx.h"
 
 #define DRIVER_AUTHOR  "WOOJUNG HUH "
 #define DRIVER_DESC"LAN78XX USB 3.0 Gigabit Ethernet Devices"
 #define DRIVER_NAME"lan78xx"
-#define DRIVER_VERSION "1.0.4"
+#define DRIVER_VERSION "1.0.5"
 
 #define TX_TIMEOUT_JIFFIES (5 * HZ)
 #define THROTTLE_JIFFIES   (HZ / 8)
@@ -296,6 +300,13 @@ struct statstage {
struct lan78xx_statstage64  curr_stat;
 };
 
+struct irq_domain_data {
+   struct irq_domain   *irqdomain;
+   unsigned intirq_base;
+   struct irq_chip *irqchip;
+   irq_flow_handler_t  irq_handler;
+};
+
 struct lan78xx_net {
struct net_device   *net;
struct usb_device   *udev;
@@ -351,6 +362,8 @@ struct lan78xx_net {
 
int delta;
struct statstagestats;
+
+   struct irq_domain_data  domain_data;
 };
 
 /* use ethtool to change the level for any given device */
@@ -1120,8 +1133,6 @@ static int lan78xx_link_reset(struct lan78xx_net *dev)
if (unlikely(ret < 0))
return -EIO;
 
-   phy_mac_interrupt(phydev, 0);
-
del_timer(&dev->stat_monitor);
} else if (phydev->link && !dev->link_on) {
dev->link_on = true;
@@ -1163,7 +1174,6 @@ static int lan78xx_link_reset(struct lan78xx_net *dev)
 
ret = lan78xx_update_flowcontrol(dev, ecmd.base.duplex, ladv,
 radv);
-   phy_mac_interrupt(phydev, 1);
 
if (!timer_pending(&dev->stat_monitor)) {
dev->delta = 1;
@@ -1202,7 +1212,10 @@ static void lan78xx_status(struct lan78xx_net *dev, 
struct urb *urb)
 
if (intdata & INT_ENP_PHY_INT) {
netif_dbg(dev, link, dev->net, "PHY INTR: 0x%08x\n", intdata);
- lan78xx_defer_kevent(dev, EVENT_LINK_RESET);
+   lan78xx_defer_kevent(dev, EVENT_LINK_RESET);
+
+   if (dev->domain_data.irq_base > 0)
+   generic_handle_irq(dev->domain_data.irq_base);
} else
netdev_warn(dev->net,
"unexpected interrupt: 0x%08x\n", intdata);
@@ -1844,6 +1857,103 @@ static void lan78xx_link_status_change(struct 
net_device *net)
}
 }
 
+static int irq_map(struct irq_domain *d, unsigned int irq,
+  irq_hw_number_t hwirq)
+{
+   struct irq_domain_data *data = d->host_data;
+
+   irq_set_chip_data(irq, data);
+   irq_set_chip_and_handler(irq, data->irqchip, data->irq_handler);
+   irq_set_noprobe(irq);
+
+   return 0;
+}
+
+static void irq_unmap(struct irq_domain *d, unsigned int irq)
+{
+   irq_set_chip_and_handler(irq, NULL, NULL);
+   irq_set_chip_data(irq, NULL);
+}
+
+static const struct irq_domain_ops chip_domain_ops = {
+   .map= irq_map,
+   .unmap  = irq_unmap,
+};
+
+static void lan78xx_irq_mask(struct irq_data *irqd)
+{
+   struct irq_domain_data *data = irq_data_get_irq_chip_data(irqd);
+   struct lan78xx_net *dev =
+   container_of(data, struct lan78xx_net, domain_data);
+   u32 buf;
+
+   lan78xx_read_reg(dev, INT_EP_CTL, &buf);
+   buf &= ~INT_EP_PHY_INT_EN_;
+   lan78xx_write_reg(dev, INT_EP_CTL, buf);
+}
+
+static void lan78xx_irq_unmask(struct irq_data *irqd)
+{
+   struct irq_domain_data *data = irq_data_get_irq_chip_data(irqd);
+   struct lan78xx_net *dev =
+   container_of(data, struct lan78xx_net, domain_data);
+   u32 buf;
+
+   lan78xx_read_reg(dev, INT_EP_CTL, &buf);
+   buf |= INT_EP_PHY_INT_EN_;
+   lan78xx_write_reg(dev, INT_EP_CTL, buf);
+}
+
+static struct irq_chip lan78xx_irqchip = {
+   .name   = "lan78xx-phyirq",
+   .irq_mask   = lan78xx_irq_mask,
+   .irq_unmask = lan78xx_irq_unmask,
+};
+
+static int lan78xx_setup_irq_domain(struct lan78xx_net *dev)
+{
+   struct device_node *of_node;
+   struct irq_domain *irqdomain;
+   unsigned int irq_base = 0;
+   int ret = 0;
+
+   of_node = dev->udev->dev.parent->of_node;
+
+   dev->domain_data.irqchip = &lan78xx_irqchip;
+   dev->domain_data.ir

net-next still on 4.8.0?

2016-10-28 Thread Andrew Lunn
Hi Dave

During the merge window, you didn't close net-next. Trying something
new.

We are now at v4.9-rc2, yet net-next is still based on 4.8.0. Is this
also something new you are trying? Or do you still plan to rebase to
an -rcX at some point?

Thanks
Andrew


Re: [PATCH v2] net: cpsw: fix obtaining mac address for am3517

2016-10-28 Thread Jeroen Hofstee

Hello Tony,

On 28-10-16 17:52, Tony Lindgren wrote:

* Jeroen Hofstee  [161028 08:33]:

Commit b6745f6e4e63 ("drivers: net: cpsw: davinci_emac: move reading mac
id to common file") did not only move the code for an am3517, it also
added the slave parameter, resulting in an invalid (all zero) mac address
being returned for an am3517, since it only has a single emac and the slave
parameter is pointing to the second. So simply always read the first and
valid mac-address for a ti,am3517-emac.

And others davinci_emac.c users can have more than one. So is the
reason the slave parameter points to the second instance because
of the location in the hardware?


Sort of, the slave parameter gets determined by the fact if there is one
or two register range(s) associated with the davinci_emac. In davinci_emac.c

res_ctrl = platform_get_resource(pdev, IORESOURCE_MEM, 1);
...
rc = davinci_emac_try_get_mac(pdev, res_ctrl ? 0 : 1,
  priv->mac_addr);

So it there are two ranges, the slave param becomes 0. It there is only 
one, it
will be 1. Since the am3517 only has a single regs entry it ends up with 
slave 1,

while there is only a single davinci_emac.

Regards,
Jeroen

arch/arm/boot/dts/dm816x.dtsi
-
eth0: ethernet@4a10 {
compatible = "ti,dm816-emac";
ti,hwmods = "emac0";
reg = <0x4a10 0x800
   0x4a100900 0x3700>;
clocks = <&sysclk24_ck>;
syscon = <&scm_conf>;
ti,davinci-ctrl-reg-offset = <0>;
ti,davinci-ctrl-mod-reg-offset = <0x900>;
ti,davinci-ctrl-ram-offset = <0x2000>;
ti,davinci-ctrl-ram-size = <0x2000>;
interrupts = <40 41 42 43>;
phy-handle = <&phy0>;
};

eth1: ethernet@4a12 {
compatible = "ti,dm816-emac";
ti,hwmods = "emac1";
reg = <0x4a12 0x4000>;
clocks = <&sysclk24_ck>;
syscon = <&scm_conf>;
ti,davinci-ctrl-reg-offset = <0>;
ti,davinci-ctrl-mod-reg-offset = <0x900>;
ti,davinci-ctrl-ram-offset = <0x2000>;
ti,davinci-ctrl-ram-size = <0x2000>;
interrupts = <44 45 46 47>;
phy-handle = <&phy1>;
};

arch/arm/boot/dts/am3517.dtsi
---
davinci_emac: ethernet@0x5c00 {
compatible = "ti,am3517-emac";
ti,hwmods = "davinci_emac";
status = "disabled";
reg = <0x5c00 0x3>;
interrupts = <67 68 69 70>;
syscon = <&scm_conf>;
ti,davinci-ctrl-reg-offset = <0x1>;
ti,davinci-ctrl-mod-reg-offset = <0>;
ti,davinci-ctrl-ram-offset = <0x2>;
ti,davinci-ctrl-ram-size = <0x2000>;
ti,davinci-rmii-en = /bits/ 8 <1>;
local-mac-address = [ 00 00 00 00 00 00 ];
};


[PATCH net-next v3]ldmvsw: tx queue stuck in stopped state after LDC reset

2016-10-28 Thread Aaron Young
From: Aaron Young 

The following patch fixes an issue with the ldmvsw driver where
the network connection of a guest domain becomes non-functional after
the guest domain has panic'd and rebooted.

The root cause was determined to be from the following series of
events:

1. Guest domain panics - resulting in the guest no longer processing
   network packets (from ldmvsw driver)
2. The ldmvsw driver (in the control domain) eventually exerts flow
   control due to no more available tx drings and stops the tx queue
   for the guest domain
3. The LDC of the network connection for the guest is reset when
   the guest domain reboots after the panic.
4. The LDC reset event is received by the ldmvsw driver and the ldmvsw
   responds by clearing the tx queue for the guest.
5. ldmvsw waits indefinitely for a DATA ACK from the guest - which is
   the normal method to re-enable the tx queue. But the ACK never comes
   because the tx queue was cleared due to the LDC reset.

To fix this issue, in addition to clearing the tx queue, re-enable the
tx queue on a LDC reset. This prevents the ldmvsw from getting caught in
this deadlocked state of waiting for a DATA ACK which will never come.

Signed-off-by: Aaron Young 
Acked-by: Sowmini Varadhan 
---
 drivers/net/ethernet/sun/sunvnet_common.c |   16 +---
 1 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/sun/sunvnet_common.c 
b/drivers/net/ethernet/sun/sunvnet_common.c
index 58efe69..8878b75 100644
--- a/drivers/net/ethernet/sun/sunvnet_common.c
+++ b/drivers/net/ethernet/sun/sunvnet_common.c
@@ -704,9 +704,8 @@ static int handle_mcast(struct vnet_port *port, void 
*msgbuf)
return 0;
 }
 
-/* Got back a STOPPED LDC message on port. If the queue is stopped,
- * wake it up so that we'll send out another START message at the
- * next TX.
+/* If the queue is stopped, wake it up so that we'll
+ * send out another START message at the next TX.
  */
 static void maybe_tx_wakeup(struct vnet_port *port)
 {
@@ -734,6 +733,7 @@ bool sunvnet_port_is_up_common(struct vnet_port *vnet)
 
 static int vnet_event_napi(struct vnet_port *port, int budget)
 {
+   struct net_device *dev = VNET_PORT_TO_NET_DEVICE(port);
struct vio_driver_state *vio = &port->vio;
int tx_wakeup, err;
int npkts = 0;
@@ -747,6 +747,16 @@ static int vnet_event_napi(struct vnet_port *port, int 
budget)
if (event == LDC_EVENT_RESET) {
vnet_port_reset(port);
vio_port_up(vio);
+
+   /* If the device is running but its tx queue was
+* stopped (due to flow control), restart it.
+* This is necessary since vnet_port_reset()
+* clears the tx drings and thus we may never get
+* back a VIO_TYPE_DATA ACK packet - which is
+* the normal mechanism to restart the tx queue.
+*/
+   if (netif_running(dev))
+   maybe_tx_wakeup(port);
}
port->rx_event = 0;
return 0;
-- 
1.7.1



Re: [PATCH net-next RFC WIP] Patch for XDP support for virtio_net

2016-10-28 Thread Alexei Starovoitov
On Fri, Oct 28, 2016 at 05:18:12PM +0100, Jakub Kicinski wrote:
> On Fri, 28 Oct 2016 08:56:35 -0700, John Fastabend wrote:
> > On 16-10-27 07:10 PM, David Miller wrote:
> > > From: Alexander Duyck 
> > > Date: Thu, 27 Oct 2016 18:43:59 -0700
> > >   
> > >> On Thu, Oct 27, 2016 at 6:35 PM, David Miller  
> > >> wrote:  
> > >>> From: "Michael S. Tsirkin" 
> > >>> Date: Fri, 28 Oct 2016 01:25:48 +0300
> > >>>  
> >  On Thu, Oct 27, 2016 at 05:42:18PM -0400, David Miller wrote:  
> > > From: "Michael S. Tsirkin" 
> > > Date: Fri, 28 Oct 2016 00:30:35 +0300
> > >  
> > >> Something I'd like to understand is how does XDP address the
> > >> problem that 100Byte packets are consuming 4K of memory now.  
> > >
> > > Via page pools.  We're going to make a generic one, but right now
> > > each and every driver implements a quick list of pages to allocate
> > > from (and thus avoid the DMA man/unmap overhead, etc.)  
> > 
> >  So to clarify, ATM virtio doesn't attempt to avoid dma map/unmap
> >  so there should be no issue with that even when using sub/page
> >  regions, assuming DMA APIs support sub-page map/unmap correctly.  
> > >>>
> > >>> That's not what I said.
> > >>>
> > >>> The page pools are meant to address the performance degradation from
> > >>> going to having one packet per page for the sake of XDP's
> > >>> requirements.
> > >>>
> > >>> You still need to have one packet per page for correct XDP operation
> > >>> whether you do page pools or not, and whether you have DMA mapping
> > >>> (or it's equivalent virutalization operation) or not.  
> > >>
> > >> Maybe I am missing something here, but why do you need to limit things
> > >> to one packet per page for correct XDP operation?  Most of the drivers
> > >> out there now are usually storing something closer to at least 2
> > >> packets per page, and with the DMA API fixes I am working on there
> > >> should be no issue with changing the contents inside those pages since
> > >> we won't invalidate or overwrite the data after the DMA buffer has
> > >> been synchronized for use by the CPU.  
> > > 
> > > Because with SKB's you can share the page with other packets.
> > > 
> > > With XDP you simply cannot.
> > > 
> > > It's software semantics that are the issue.  SKB frag list pages
> > > are read only, XDP packets are writable.
> > > 
> > > This has nothing to do with "writability" of the pages wrt. DMA
> > > mapping or cpu mappings.
> > >   
> > 
> > Sorry I'm not seeing it either. The current xdp_buff is defined
> > by,
> > 
> >   struct xdp_buff {
> > void *data;
> > void *data_end;
> >   };
> > 
> > The verifier has an xdp_is_valid_access() check to ensure we don't go
> > past data_end. The page for now at least never leaves the driver. For
> > the work to get xmit to other devices working I'm still not sure I see
> > any issue.
> 
> +1
> 
> Do we want to make the packet-per-page a requirement because it could
> be useful in the future from architectural standpoint?  I guess there
> is a trade-off here between having the comfort of people following this
> requirement today and making driver support for XDP even more complex.

It looks to me that packet per page makes drivers simpler instead of complex.
ixgbe split-page and mlx* order-3/5 tricks are definitely complex.
The skb truesize concerns come from the host when data is delivered to user
space and we need to have precise memory accounting for different applications
and different users. XDP is all root and imo it's far away from the days when
multi-user non-root issues start to pop up.
At the same time XDP doesn't require to use 4k buffer in something like 
Netronome.
If xdp bpf program can be offloaded into HW with 1800 byte buffers, great!
For x86 cpu the 4k byte is a natural allocation chunk. Anything lower requires
delicate dma tricks paired with even more complex slab allocator and atomic 
recnts.
All of that can only drive the performance down.
Comparing to kernel bypass xdp is using 4k pages whereas dpdk has
to use huge pages, so xdp is saving a ton of memory already!



Re: [PATCH] mac80211: fix incorrect error return path on tmp allocation failure

2016-10-28 Thread Johannes Berg
On Fri, 2016-10-28 at 19:08 +0100, Colin King wrote:
> From: Colin Ian King 
> 
> The current exit path when tmp fails to be allocated is via the
> fail label which frees tfm2 which has not yet been allocated,
> which is problematic since tfm2 is not initialized and is a garbage
> pointer. Fix this by exiting directly to the return at the end
> of the function and hence avoiding the freeing of tfm2.

Yeah, thanks. Arnd beat you to the fix by about 8 hours, so I've
already applied his patch and sent an updated pull request :)

johannes


Re: [PATCH v2] net: cpsw: fix obtaining mac address for am3517

2016-10-28 Thread Tony Lindgren
* Jeroen Hofstee  [161028 11:19]:
> Hello Tony,
> 
> On 28-10-16 17:52, Tony Lindgren wrote:
> > * Jeroen Hofstee  [161028 08:33]:
> > > Commit b6745f6e4e63 ("drivers: net: cpsw: davinci_emac: move reading mac
> > > id to common file") did not only move the code for an am3517, it also
> > > added the slave parameter, resulting in an invalid (all zero) mac address
> > > being returned for an am3517, since it only has a single emac and the 
> > > slave
> > > parameter is pointing to the second. So simply always read the first and
> > > valid mac-address for a ti,am3517-emac.
> > And others davinci_emac.c users can have more than one. So is the
> > reason the slave parameter points to the second instance because
> > of the location in the hardware?
> 
> Sort of, the slave parameter gets determined by the fact if there is one
> or two register range(s) associated with the davinci_emac. In davinci_emac.c
> 
> res_ctrl = platform_get_resource(pdev, IORESOURCE_MEM, 1);
> ...
> rc = davinci_emac_try_get_mac(pdev, res_ctrl ? 0 : 1,
>   priv->mac_addr);
> 
> So it there are two ranges, the slave param becomes 0. It there is only one,
> it
> will be 1. Since the am3517 only has a single regs entry it ends up with
> slave 1,
> while there is only a single davinci_emac.

OK thanks for clarifying it:

Acked-by: Tony Lindgren 


Re: [net-next PATCH 03/27] swiotlb: Add support for DMA_ATTR_SKIP_CPU_SYNC

2016-10-28 Thread Alexander Duyck
On Fri, Oct 28, 2016 at 10:34 AM, Konrad Rzeszutek Wilk
 wrote:
> On Tue, Oct 25, 2016 at 11:37:03AM -0400, Alexander Duyck wrote:
>> As a first step to making DMA_ATTR_SKIP_CPU_SYNC apply to architectures
>> beyond just ARM I need to make it so that the swiotlb will respect the
>> flag.  In order to do that I also need to update the swiotlb-xen since it
>> heavily makes use of the functionality.
>>
>> Cc: Konrad Rzeszutek Wilk 
>
> I am pretty sure I acked it the RFC. Was there a particular
> reason (this is very different from the RFC?) you dropped my ACk?
>
> Thanks.

If I recall you had acked patch 1, but for 2 you had some review
comments on and suggested I change a few things.  What was patch 2 in
the RFC was split out into patches 2 and 3.  That is why I didn't
include an Ack from you for those patches.

Patch 2 is a fix for Xen to address the fact that you could return
either 0 or ~0.  It was part of patch 2 originally and I pulled it out
into a separate patch.

Patch 3 does most of what patch 2 in the RFC was doing before with
fixes to address the fact that I was moving some code to avoid going
over 80 characters.  I found a different way to fix that by just
updating attrs before using it instead of ORing in the value when
passing it as a parameter.

>> @@ -558,11 +560,12 @@ void xen_swiotlb_unmap_page(struct device *hwdev, 
>> dma_addr_t dev_addr,
>>
>> start_dma_addr,
>>sg_phys(sg),
>>sg->length,
>> -  dir);
>> +  dir, attrs);
>>   if (map == SWIOTLB_MAP_ERROR) {
>>   dev_warn(hwdev, "swiotlb buffer is full\n");
>>   /* Don't panic here, we expect map_sg users
>>  to do proper error handling. */
>> + attrs |= DMA_ATTR_SKIP_CPU_SYNC;
>>   xen_swiotlb_unmap_sg_attrs(hwdev, sgl, i, dir,
>>  attrs);
>>   sg_dma_len(sgl) = 0;

The biggest difference from patch 2 in the RFC is right here.  This
code before was moving this off to the end of the function and adding
a label which I then jumped to.  I just ORed the
DMA_ATTR_SKIP_CPU_SYNC into attrs and skipped the problem entirely.
It should be harmless to do this way since attrs isn't used anywhere
else once we have had the error.

I hope that helps to clear it up.  So if you want I will add your
Acked-by for patches 2 and 3, but I just wanted to make sure this
worked with the changes you suggested.

Thanks.

- Alex


[PATCH] mac80211: fix incorrect error return path on tmp allocation failure

2016-10-28 Thread Colin King
From: Colin Ian King 

The current exit path when tmp fails to be allocated is via the
fail label which frees tfm2 which has not yet been allocated,
which is problematic since tfm2 is not initialized and is a garbage
pointer. Fix this by exiting directly to the return at the end
of the function and hence avoiding the freeing of tfm2.

Signed-off-by: Colin Ian King 
---
 net/mac80211/fils_aead.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/mac80211/fils_aead.c b/net/mac80211/fils_aead.c
index b81b4f24..c114737 100644
--- a/net/mac80211/fils_aead.c
+++ b/net/mac80211/fils_aead.c
@@ -112,7 +112,7 @@ static int aes_siv_encrypt(const u8 *key, size_t key_len,
tmp = kmemdup(plain, plain_len, GFP_KERNEL);
if (!tmp) {
res = -ENOMEM;
-   goto fail;
+   goto fail_ret;
}
 
/* IV for CTR before encrypted data */
@@ -150,6 +150,7 @@ static int aes_siv_encrypt(const u8 *key, size_t key_len,
 fail:
kfree(tmp);
crypto_free_skcipher(tfm2);
+fail_ret:
return res;
 }
 
-- 
2.9.3



Re: [PATCH for-next 00/14][PULL request] Mellanox mlx5 core driver updates 2016-10-25

2016-10-28 Thread David Miller

I really disalike pull requests of this form.

You add lots of datastructures and helper functions but no actual
users of these facilities to the driver.

Do this instead:

1) Add TSAR infrastructure
2) Add use of TSAR facilities to the driver

That's one pull request.

I don't care if this is hard, or if there are entanglements with
Infiniband or whatever, you must submit changes in this manner.

I will not accept additions to a driver that don't even get really
used.


Re: [PATCH net-next] udp: do fwd memory scheduling on dequeue

2016-10-28 Thread Eric Dumazet
On Fri, 2016-10-28 at 10:16 -0700, Eric Dumazet wrote:
> Nice !
> 
> I was working on this as well and my implementation was somewhat
> different.

This is my WIP

Note this can be split in two parts.

1) One adding struct sock *sk param to ip_cmsg_recv_offset()
 
   This was because I left skb->sk NULL for skbs stored in receive
queue.
   You chose instead to set skb->sk, which is unusual (check
skb_orphan() BUG_ON())

2) Udp changes.

Tell me what you think, thanks again !


diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 601258f6e621..52e70431abc9 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -3033,9 +3033,13 @@ int __skb_wait_for_more_packets(struct sock *sk, int 
*err, long *timeo_p,
const struct sk_buff *skb);
 struct sk_buff *__skb_try_recv_datagram(struct sock *sk, unsigned flags,
int *peeked, int *off, int *err,
-   struct sk_buff **last);
+   struct sk_buff **last,
+   void (*destructor)(struct sock *sk,
+  struct sk_buff 
*skb));
 struct sk_buff *__skb_recv_datagram(struct sock *sk, unsigned flags,
-   int *peeked, int *off, int *err);
+   int *peeked, int *off, int *err,
+   void (*destructor)(struct sock *sk,
+  struct sk_buff *skb));
 struct sk_buff *skb_recv_datagram(struct sock *sk, unsigned flags, int noblock,
  int *err);
 unsigned int datagram_poll(struct file *file, struct socket *sock,
diff --git a/include/net/ip.h b/include/net/ip.h
index 79b102ffbe16..f91fc376f3fb 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -572,7 +572,8 @@ int ip_options_rcv_srr(struct sk_buff *skb);
  */
 
 void ipv4_pktinfo_prepare(const struct sock *sk, struct sk_buff *skb);
-void ip_cmsg_recv_offset(struct msghdr *msg, struct sk_buff *skb, int tlen, 
int offset);
+void ip_cmsg_recv_offset(struct msghdr *msg, struct sock *sk,
+struct sk_buff *skb, int tlen, int offset);
 int ip_cmsg_send(struct sock *sk, struct msghdr *msg,
 struct ipcm_cookie *ipc, bool allow_ipv6);
 int ip_setsockopt(struct sock *sk, int level, int optname, char __user *optval,
@@ -594,7 +595,7 @@ void ip_local_error(struct sock *sk, int err, __be32 daddr, 
__be16 dport,
 
 static inline void ip_cmsg_recv(struct msghdr *msg, struct sk_buff *skb)
 {
-   ip_cmsg_recv_offset(msg, skb, 0, 0);
+   ip_cmsg_recv_offset(msg, skb->sk, skb, 0, 0);
 }
 
 bool icmp_global_allow(void);
diff --git a/include/net/udp.h b/include/net/udp.h
index 18f1e6b91927..0178f4552c4d 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -248,6 +248,7 @@ static inline __be16 udp_flow_src_port(struct net *net, 
struct sk_buff *skb,
 /* net/ipv4/udp.c */
 void skb_consume_udp(struct sock *sk, struct sk_buff *skb, int len);
 int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb);
+void udp_skb_destructor(struct sock *sk, struct sk_buff *skb);
 
 void udp_v4_early_demux(struct sk_buff *skb);
 int udp_get_port(struct sock *sk, unsigned short snum,
diff --git a/net/core/datagram.c b/net/core/datagram.c
index bfb973aebb5b..48228d15f33c 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -198,7 +198,8 @@ static struct sk_buff *skb_set_peeked(struct sk_buff *skb)
  */
 struct sk_buff *__skb_try_recv_datagram(struct sock *sk, unsigned int flags,
int *peeked, int *off, int *err,
-   struct sk_buff **last)
+   struct sk_buff **last,
+   void (*destructor)(struct sock *sk, 
struct sk_buff *skb))
 {
struct sk_buff_head *queue = &sk->sk_receive_queue;
struct sk_buff *skb;
@@ -241,9 +242,11 @@ struct sk_buff *__skb_try_recv_datagram(struct sock *sk, 
unsigned int flags,
}
 
atomic_inc(&skb->users);
-   } else
+   } else {
__skb_unlink(skb, queue);
-
+   if (destructor)
+   destructor(sk, skb);
+   }
spin_unlock_irqrestore(&queue->lock, cpu_flags);
*off = _off;
return skb;
@@ -262,7 +265,9 @@ struct sk_buff *__skb_try_recv_datagram(struct sock *sk, 
unsigned int flags,
 EXPORT_SYMBOL(__skb_try_recv_datagram);
 
 struct sk_buff *__skb_recv_datagram(struct sock *sk, unsigned int flags,
-   int *peeked, int *off, int *err)
+   int *peeked, int *off,

Re: [PATCH v2] LSO feature added to Cadence GEM driver

2016-10-28 Thread David Miller
From: Rafal Ozieblo 
Date: Tue, 25 Oct 2016 13:05:30 +0100

> New Cadence GEM hardware support Large Segment Offload (LSO):
> TCP segmentation offload (TSO) as well as UDP fragmentation
> offload (UFO). Support for those features was added to the driver.
> 
> Signed-off-by: Rafal Ozieblo 

Patch does not apply cleanly to net-next, please respin.

Also please fix your Subject line, it should be of the form:

[PATCH net-next v2] cadence: Add LSO support.


Re: [PATCH] net caif: insert missing spaces in pr_* messages and unbreak multi-line strings

2016-10-28 Thread David Miller
From: Colin King 
Date: Tue, 25 Oct 2016 12:18:42 +0100

> From: Colin Ian King 
> 
> Some of the pr_* messages are missing spaces, so insert these and also
> unbreak multi-line literal strings in pr_* messages
> 
> Signed-off-by: Colin Ian King 

Applied.


Re: [patch net-next 0/5] mlxsw: small driver update

2016-10-28 Thread David Miller
From: Jiri Pirko 
Date: Thu, 27 Oct 2016 15:12:56 +0200

> For details, see individual patches.

Series applied, thanks.


Re: [patch net 0/2] mlxsw: Couple of fixes

2016-10-28 Thread David Miller
From: Jiri Pirko 
Date: Tue, 25 Oct 2016 11:25:55 +0200

> From: Jiri Pirko 
> 
> Couple of LPM tree management fixes.

Series applied, thanks Jiri.


Re: [net-next PATCH 02/27] swiotlb-xen: Enforce return of DMA_ERROR_CODE in mapping function

2016-10-28 Thread Konrad Rzeszutek Wilk
On Tue, Oct 25, 2016 at 11:36:58AM -0400, Alexander Duyck wrote:
> The mapping function should always return DMA_ERROR_CODE when a mapping has
> failed as this is what the DMA API expects when a DMA error has occurred.
> The current function for mapping a page in Xen was returning either
> DMA_ERROR_CODE or 0 depending on where it failed.
> 
> On x86 DMA_ERROR_CODE is 0, but on other architectures such as ARM it is
> ~0. We need to make sure we return the same error value if either the
> mapping failed or the device is not capable of accessing the mapping.
> 
> If we are returning DMA_ERROR_CODE as our error value we can drop the
> function for checking the error code as the default is to compare the
> return value against DMA_ERROR_CODE if no function is defined.
> 
> Cc: Konrad Rzeszutek Wilk 

I am pretty sure I gave an Ack. Any particular reason from dropping it
(if so, please add a comment under the --- of the reason).

Thanks.
> Signed-off-by: Alexander Duyck 
> ---
>  arch/arm/xen/mm.c  |1 -
>  arch/x86/xen/pci-swiotlb-xen.c |1 -
>  drivers/xen/swiotlb-xen.c  |   18 ++
>  include/xen/swiotlb-xen.h  |3 ---
>  4 files changed, 6 insertions(+), 17 deletions(-)
> 
> diff --git a/arch/arm/xen/mm.c b/arch/arm/xen/mm.c
> index d062f08..bd62d94 100644
> --- a/arch/arm/xen/mm.c
> +++ b/arch/arm/xen/mm.c
> @@ -186,7 +186,6 @@ void xen_destroy_contiguous_region(phys_addr_t pstart, 
> unsigned int order)
>  EXPORT_SYMBOL(xen_dma_ops);
>  
>  static struct dma_map_ops xen_swiotlb_dma_ops = {
> - .mapping_error = xen_swiotlb_dma_mapping_error,
>   .alloc = xen_swiotlb_alloc_coherent,
>   .free = xen_swiotlb_free_coherent,
>   .sync_single_for_cpu = xen_swiotlb_sync_single_for_cpu,
> diff --git a/arch/x86/xen/pci-swiotlb-xen.c b/arch/x86/xen/pci-swiotlb-xen.c
> index 0e98e5d..a9fafb5 100644
> --- a/arch/x86/xen/pci-swiotlb-xen.c
> +++ b/arch/x86/xen/pci-swiotlb-xen.c
> @@ -19,7 +19,6 @@
>  int xen_swiotlb __read_mostly;
>  
>  static struct dma_map_ops xen_swiotlb_dma_ops = {
> - .mapping_error = xen_swiotlb_dma_mapping_error,
>   .alloc = xen_swiotlb_alloc_coherent,
>   .free = xen_swiotlb_free_coherent,
>   .sync_single_for_cpu = xen_swiotlb_sync_single_for_cpu,
> diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c
> index 87e6035..b8014bf 100644
> --- a/drivers/xen/swiotlb-xen.c
> +++ b/drivers/xen/swiotlb-xen.c
> @@ -416,11 +416,12 @@ dma_addr_t xen_swiotlb_map_page(struct device *dev, 
> struct page *page,
>   /*
>* Ensure that the address returned is DMA'ble
>*/
> - if (!dma_capable(dev, dev_addr, size)) {
> - swiotlb_tbl_unmap_single(dev, map, size, dir);
> - dev_addr = 0;
> - }
> - return dev_addr;
> + if (dma_capable(dev, dev_addr, size))
> + return dev_addr;
> +
> + swiotlb_tbl_unmap_single(dev, map, size, dir);
> +
> + return DMA_ERROR_CODE;
>  }
>  EXPORT_SYMBOL_GPL(xen_swiotlb_map_page);
>  
> @@ -648,13 +649,6 @@ void xen_swiotlb_unmap_page(struct device *hwdev, 
> dma_addr_t dev_addr,
>  }
>  EXPORT_SYMBOL_GPL(xen_swiotlb_sync_sg_for_device);
>  
> -int
> -xen_swiotlb_dma_mapping_error(struct device *hwdev, dma_addr_t dma_addr)
> -{
> - return !dma_addr;
> -}
> -EXPORT_SYMBOL_GPL(xen_swiotlb_dma_mapping_error);
> -
>  /*
>   * Return whether the given device DMA address mask can be supported
>   * properly.  For example, if your device can only drive the low 24-bits
> diff --git a/include/xen/swiotlb-xen.h b/include/xen/swiotlb-xen.h
> index 7c35e27..a0083be 100644
> --- a/include/xen/swiotlb-xen.h
> +++ b/include/xen/swiotlb-xen.h
> @@ -51,9 +51,6 @@ extern void xen_swiotlb_unmap_page(struct device *hwdev, 
> dma_addr_t dev_addr,
>  int nelems, enum dma_data_direction dir);
>  
>  extern int
> -xen_swiotlb_dma_mapping_error(struct device *hwdev, dma_addr_t dma_addr);
> -
> -extern int
>  xen_swiotlb_dma_supported(struct device *hwdev, u64 mask);
>  
>  extern int
> 


Re: [net-next PATCH 03/27] swiotlb: Add support for DMA_ATTR_SKIP_CPU_SYNC

2016-10-28 Thread Konrad Rzeszutek Wilk
On Tue, Oct 25, 2016 at 11:37:03AM -0400, Alexander Duyck wrote:
> As a first step to making DMA_ATTR_SKIP_CPU_SYNC apply to architectures
> beyond just ARM I need to make it so that the swiotlb will respect the
> flag.  In order to do that I also need to update the swiotlb-xen since it
> heavily makes use of the functionality.
> 
> Cc: Konrad Rzeszutek Wilk 

I am pretty sure I acked it the RFC. Was there a particular
reason (this is very different from the RFC?) you dropped my ACk?

Thanks.

> Signed-off-by: Alexander Duyck 
> ---
>  drivers/xen/swiotlb-xen.c |   11 +++---
>  include/linux/swiotlb.h   |6 --
>  lib/swiotlb.c |   48 
> +++--
>  3 files changed, 40 insertions(+), 25 deletions(-)
> 
> diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c
> index b8014bf..3d048af 100644
> --- a/drivers/xen/swiotlb-xen.c
> +++ b/drivers/xen/swiotlb-xen.c
> @@ -405,7 +405,8 @@ dma_addr_t xen_swiotlb_map_page(struct device *dev, 
> struct page *page,
>*/
>   trace_swiotlb_bounced(dev, dev_addr, size, swiotlb_force);
>  
> - map = swiotlb_tbl_map_single(dev, start_dma_addr, phys, size, dir);
> + map = swiotlb_tbl_map_single(dev, start_dma_addr, phys, size, dir,
> +  attrs);
>   if (map == SWIOTLB_MAP_ERROR)
>   return DMA_ERROR_CODE;
>  
> @@ -419,7 +420,8 @@ dma_addr_t xen_swiotlb_map_page(struct device *dev, 
> struct page *page,
>   if (dma_capable(dev, dev_addr, size))
>   return dev_addr;
>  
> - swiotlb_tbl_unmap_single(dev, map, size, dir);
> + swiotlb_tbl_unmap_single(dev, map, size, dir,
> +  attrs | DMA_ATTR_SKIP_CPU_SYNC);
>  
>   return DMA_ERROR_CODE;
>  }
> @@ -445,7 +447,7 @@ static void xen_unmap_single(struct device *hwdev, 
> dma_addr_t dev_addr,
>  
>   /* NOTE: We use dev_addr here, not paddr! */
>   if (is_xen_swiotlb_buffer(dev_addr)) {
> - swiotlb_tbl_unmap_single(hwdev, paddr, size, dir);
> + swiotlb_tbl_unmap_single(hwdev, paddr, size, dir, attrs);
>   return;
>   }
>  
> @@ -558,11 +560,12 @@ void xen_swiotlb_unmap_page(struct device *hwdev, 
> dma_addr_t dev_addr,
>start_dma_addr,
>sg_phys(sg),
>sg->length,
> -  dir);
> +  dir, attrs);
>   if (map == SWIOTLB_MAP_ERROR) {
>   dev_warn(hwdev, "swiotlb buffer is full\n");
>   /* Don't panic here, we expect map_sg users
>  to do proper error handling. */
> + attrs |= DMA_ATTR_SKIP_CPU_SYNC;
>   xen_swiotlb_unmap_sg_attrs(hwdev, sgl, i, dir,
>  attrs);
>   sg_dma_len(sgl) = 0;
> diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
> index e237b6f..4517be9 100644
> --- a/include/linux/swiotlb.h
> +++ b/include/linux/swiotlb.h
> @@ -44,11 +44,13 @@ enum dma_sync_target {
>  extern phys_addr_t swiotlb_tbl_map_single(struct device *hwdev,
> dma_addr_t tbl_dma_addr,
> phys_addr_t phys, size_t size,
> -   enum dma_data_direction dir);
> +   enum dma_data_direction dir,
> +   unsigned long attrs);
>  
>  extern void swiotlb_tbl_unmap_single(struct device *hwdev,
>phys_addr_t tlb_addr,
> -  size_t size, enum dma_data_direction dir);
> +  size_t size, enum dma_data_direction dir,
> +  unsigned long attrs);
>  
>  extern void swiotlb_tbl_sync_single(struct device *hwdev,
>   phys_addr_t tlb_addr,
> diff --git a/lib/swiotlb.c b/lib/swiotlb.c
> index 47aad37..b538d39 100644
> --- a/lib/swiotlb.c
> +++ b/lib/swiotlb.c
> @@ -425,7 +425,8 @@ static void swiotlb_bounce(phys_addr_t orig_addr, 
> phys_addr_t tlb_addr,
>  phys_addr_t swiotlb_tbl_map_single(struct device *hwdev,
>  dma_addr_t tbl_dma_addr,
>  phys_addr_t orig_addr, size_t size,
> -enum dma_data_direction dir)
> +enum dma_data_direction dir,
> +unsigned long attrs)
>  {
>   unsigned long flags;
>   phys_addr_t tlb_addr;
> @@ -526,7 +527,8 @@ phys_addr_t swiotlb_tbl_map_single(struct

Re: pull request (net-next): ipsec-next 2016-10-25

2016-10-28 Thread David Miller
From: Steffen Klassert 
Date: Tue, 25 Oct 2016 07:05:40 +0200

> Just a leftover from the last development cycle.
> 
> 1) Remove some unused code, from Florian Westphal.
> 
> Please pull or let me know if there are problems.

Pulled, thanks!


Re: [PATCH net-next] udp: do fwd memory scheduling on dequeue

2016-10-28 Thread Eric Dumazet

Nice !

I was working on this as well and my implementation was somewhat
different.

I then stopped when I found the UDP checksum bug and was waiting for net
to be merged into net-next
( 
https://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/?id=10df8e6152c6c400a563a673e9956320bfce1871
 )

On Fri, 2016-10-28 at 15:20 +0200, Paolo Abeni wrote:
> A new argument is added to __skb_recv_datagram, it allows
> the caller to perform protocol specific action on dequeue
> under the receive queue spinlock.
> The UDP protocol uses such argument to perform fwd memory
> reclaiming on dequeue, while protocol memory and rmem updatating
> is delayed after the lock release, to keep the time spent
> under the lock as low as possible.
> The UDP specific skb desctructor is not used anymore, instead
> explicit memory reclaiming is performed at close() time and
> when skbs are removed from the receive queue.
> The in kernel UDP procotol users now need to use an
> skb_recv_udp() variant instead of skb_recv_datagram() to
> properly perform memory accounting on dequeue.
> 
> Overall, this allows acquiring only once the receive queue
> lock on dequeue.
> 
> Tested using pktgen with random src port, 64 bytes packet,
> wire-speed on a 10G link as sender and udp_sink as the receiver,
> using an l4 tuple rxhash to stress the contention, and one or more
> udp_sink instances with reuseport.
> 
> nr sinks  vanilla patched
> 1 440 560
> 3 21502300
> 6 36503800
> 9 44504600
> 1262506450
> 
> Suggested-by: Eric Dumazet 
> Acked-by: Hannes Frederic Sowa 
> Signed-off-by: Paolo Abeni 
> ---
>  include/linux/skbuff.h |  4 
>  include/net/udp.h  | 10 
>  net/core/datagram.c| 11 ++---
>  net/ipv4/udp.c | 65 
> --
>  net/ipv6/udp.c |  3 +--
>  net/rxrpc/input.c  |  7 +++---
>  net/sunrpc/svcsock.c   |  2 +-
>  net/sunrpc/xprtsock.c  |  2 +-
>  net/unix/af_unix.c |  4 ++--
>  9 files changed, 83 insertions(+), 25 deletions(-)
> 
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 601258f..dd171a9 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -3028,13 +3028,17 @@ static inline void skb_frag_list_init(struct sk_buff 
> *skb)
>  #define skb_walk_frags(skb, iter)\
>   for (iter = skb_shinfo(skb)->frag_list; iter; iter = iter->next)
>  
> +typedef void (*skb_dequeue_cb_t)(struct sock *sk, struct sk_buff *skb,
> +  int flags);
>  
>  int __skb_wait_for_more_packets(struct sock *sk, int *err, long *timeo_p,
>   const struct sk_buff *skb);
>  struct sk_buff *__skb_try_recv_datagram(struct sock *sk, unsigned flags,
> + skb_dequeue_cb_t dequeue_cb,
>   int *peeked, int *off, int *err,
>   struct sk_buff **last);
>  struct sk_buff *__skb_recv_datagram(struct sock *sk, unsigned flags,
> + skb_dequeue_cb_t dequeue_cb,
>   int *peeked, int *off, int *err);
>  struct sk_buff *skb_recv_datagram(struct sock *sk, unsigned flags, int 
> noblock,
> int *err);
> diff --git a/include/net/udp.h b/include/net/udp.h
> index 18f1e6b..983c861 100644
> --- a/include/net/udp.h
> +++ b/include/net/udp.h
> @@ -48,6 +48,7 @@ struct udp_skb_cb {
>   } header;
>   __u16   cscov;
>   __u8partial_cov;
> + int fwd_memory_released;

I was not adding this field.

>  };
>  #define UDP_SKB_CB(__skb)((struct udp_skb_cb *)((__skb)->cb))
>  
> @@ -248,6 +249,15 @@ static inline __be16 udp_flow_src_port(struct net *net, 
> struct sk_buff *skb,
>  /* net/ipv4/udp.c */
>  void skb_consume_udp(struct sock *sk, struct sk_buff *skb, int len);
>  int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb);
> +struct sk_buff *__skb_recv_udp(struct sock *sk, unsigned int flags, int 
> noblock,
> +int *peeked, int *off, int *err);
> +static inline struct sk_buff *skb_recv_udp(struct sock *sk, unsigned int 
> flags,
> +int noblock, int *err)
> +{
> + int peeked, off = 0;
> +
> + return __skb_recv_udp(sk, flags, noblock, &peeked, &off, err);
> +}
>  
>  void udp_v4_early_demux(struct sk_buff *skb);
>  int udp_get_port(struct sock *sk, unsigned short snum,
> diff --git a/net/core/datagram.c b/net/core/datagram.c
> index bfb973a..226548b 100644
> --- a/net/core/datagram.c
> +++ b/net/core/datagram.c
> @@ -165,6 +165,7 @@ static struct sk_buff *skb_set_peeked(struct sk_buff *skb)
>   *   __skb_try_recv_datagram - Receive a datagram skbuff
>   *   @sk: socket
>   *   @flags: MSG_ flags
> + *   @dequeue_cb: invoked under the rece

Re: [PATCH net-next RFC WIP] Patch for XDP support for virtio_net

2016-10-28 Thread David Miller
From: John Fastabend 
Date: Fri, 28 Oct 2016 08:56:35 -0700

> On 16-10-27 07:10 PM, David Miller wrote:
>> From: Alexander Duyck 
>> Date: Thu, 27 Oct 2016 18:43:59 -0700
>> 
>>> On Thu, Oct 27, 2016 at 6:35 PM, David Miller  wrote:
 From: "Michael S. Tsirkin" 
 Date: Fri, 28 Oct 2016 01:25:48 +0300

> On Thu, Oct 27, 2016 at 05:42:18PM -0400, David Miller wrote:
>> From: "Michael S. Tsirkin" 
>> Date: Fri, 28 Oct 2016 00:30:35 +0300
>>
>>> Something I'd like to understand is how does XDP address the
>>> problem that 100Byte packets are consuming 4K of memory now.
>>
>> Via page pools.  We're going to make a generic one, but right now
>> each and every driver implements a quick list of pages to allocate
>> from (and thus avoid the DMA man/unmap overhead, etc.)
>
> So to clarify, ATM virtio doesn't attempt to avoid dma map/unmap
> so there should be no issue with that even when using sub/page
> regions, assuming DMA APIs support sub-page map/unmap correctly.

 That's not what I said.

 The page pools are meant to address the performance degradation from
 going to having one packet per page for the sake of XDP's
 requirements.

 You still need to have one packet per page for correct XDP operation
 whether you do page pools or not, and whether you have DMA mapping
 (or it's equivalent virutalization operation) or not.
>>>
>>> Maybe I am missing something here, but why do you need to limit things
>>> to one packet per page for correct XDP operation?  Most of the drivers
>>> out there now are usually storing something closer to at least 2
>>> packets per page, and with the DMA API fixes I am working on there
>>> should be no issue with changing the contents inside those pages since
>>> we won't invalidate or overwrite the data after the DMA buffer has
>>> been synchronized for use by the CPU.
>> 
>> Because with SKB's you can share the page with other packets.
>> 
>> With XDP you simply cannot.
>> 
>> It's software semantics that are the issue.  SKB frag list pages
>> are read only, XDP packets are writable.
>> 
>> This has nothing to do with "writability" of the pages wrt. DMA
>> mapping or cpu mappings.
>> 
> 
> Sorry I'm not seeing it either. The current xdp_buff is defined
> by,
> 
>   struct xdp_buff {
>   void *data;
>   void *data_end;
>   };
> 
> The verifier has an xdp_is_valid_access() check to ensure we don't go
> past data_end. The page for now at least never leaves the driver. For
> the work to get xmit to other devices working I'm still not sure I see
> any issue.

I guess I can say that the packets must be "writable" until I'm blue
in the face but I'll say it again, semantically writable pages are a
requirement.  And if multiple packets share a page this requirement
is not satisfied.

Also, we want to do several things in the future:

1) Allow push/pop of headers via eBPF code, which needs we need
   headroom.

2) Transparently zero-copy pass packets into userspace, basically
   the user will have a semi-permanently mapped ring of all the
   packet pages sitting in the RX queue of the device and the
   page pool associated with it.  This way we avoid all of the
   TLB flush/map overhead for the user's mapping of the packets
   just as we avoid the DMA map/unmap overhead.

And that's just the beginninng.

I'm sure others can come up with more reasons why we have this
requirement.



Re: [net-next PATCH 00/27] Add support for DMA writable pages being writable by the network stack

2016-10-28 Thread David Miller
From: Alexander Duyck 
Date: Fri, 28 Oct 2016 08:48:01 -0700

> So the feedback for this set has been mostly just a few "Acked-by"s,
> and it looks like the series was marked as "Not Applicable" in
> patchwork.  I was wondering what the correct merge strategy for this
> patch set should be going forward?

I marked it as not applicable because it's definitely not a networking
change, and merging it via my tree would be really inappropriate, even
though we need it for some infrastructure we want to build for
networking.

So you have to merge this upstream via a more appropriate path.

> I was wondering if I should be looking at breaking up the set and
> splitting it over a few different trees, or if I should just hold onto
> it and resubmit it when the merge window opens?  My preference would
> be to submit it as a single set so I can know all the patches are
> present to avoid any possible regressions due to only part of the set
> being present.

I don't think you need to split it up.


Re: [PATCH net 2/2] geneve: avoid using stale geneve socket.

2016-10-28 Thread Pravin Shelar
On Fri, Oct 28, 2016 at 7:27 AM, John W. Linville
 wrote:
> On Thu, Oct 27, 2016 at 11:51:56AM -0700, Pravin B Shelar wrote:
>> This patch is similar to earlier vxlan patch.
>> Geneve device close operation frees geneve socket. This
>> operation can race with geneve-xmit function which
>> dereferences geneve socket. Following patch uses RCU
>> mechanism to avoid this situation.
>>
>> Signed-off-by: Pravin B Shelar 
>
> LGTM, although I reckon that Stephen's comment about RCU_INIT_POINTER
> applies to this patch as well.
>
Geneve code is bit different and it is using rcu_assign_pointer()
while destroying the object. so we need the barrier.

> Either way...
>
> Acked-by: John W. Linville 
>


[PATCH v2 net 1/2] vxlan: avoid using stale vxlan socket.

2016-10-28 Thread Pravin B Shelar
When vxlan device is closed vxlan socket is freed. This
operation can race with vxlan-xmit function which
dereferences vxlan socket. Following patch uses RCU
mechanism to avoid this situation.

Signed-off-by: Pravin B Shelar 
---
 drivers/net/vxlan.c | 80 +
 include/net/vxlan.h |  4 +--
 2 files changed, 52 insertions(+), 32 deletions(-)

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index c1639a3..f3c2fa3 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -943,17 +943,20 @@ static bool vxlan_snoop(struct net_device *dev,
 static bool vxlan_group_used(struct vxlan_net *vn, struct vxlan_dev *dev)
 {
struct vxlan_dev *vxlan;
+   struct vxlan_sock *sock4;
+   struct vxlan_sock *sock6 = NULL;
unsigned short family = dev->default_dst.remote_ip.sa.sa_family;
 
+   sock4 = rtnl_dereference(dev->vn4_sock);
+
/* The vxlan_sock is only used by dev, leaving group has
 * no effect on other vxlan devices.
 */
-   if (family == AF_INET && dev->vn4_sock &&
-   atomic_read(&dev->vn4_sock->refcnt) == 1)
+   if (family == AF_INET && sock4 && atomic_read(&sock4->refcnt) == 1)
return false;
 #if IS_ENABLED(CONFIG_IPV6)
-   if (family == AF_INET6 && dev->vn6_sock &&
-   atomic_read(&dev->vn6_sock->refcnt) == 1)
+   sock6 = rtnl_dereference(dev->vn6_sock);
+   if (family == AF_INET6 && sock6 && atomic_read(&sock6->refcnt) == 1)
return false;
 #endif
 
@@ -961,10 +964,12 @@ static bool vxlan_group_used(struct vxlan_net *vn, struct 
vxlan_dev *dev)
if (!netif_running(vxlan->dev) || vxlan == dev)
continue;
 
-   if (family == AF_INET && vxlan->vn4_sock != dev->vn4_sock)
+   if (family == AF_INET &&
+   rtnl_dereference(vxlan->vn4_sock) != sock4)
continue;
 #if IS_ENABLED(CONFIG_IPV6)
-   if (family == AF_INET6 && vxlan->vn6_sock != dev->vn6_sock)
+   if (family == AF_INET6 &&
+   rtnl_dereference(vxlan->vn6_sock) != sock6)
continue;
 #endif
 
@@ -1005,22 +1010,25 @@ static bool __vxlan_sock_release_prep(struct vxlan_sock 
*vs)
 
 static void vxlan_sock_release(struct vxlan_dev *vxlan)
 {
-   bool ipv4 = __vxlan_sock_release_prep(vxlan->vn4_sock);
+   struct vxlan_sock *sock4 = rtnl_dereference(vxlan->vn4_sock);
 #if IS_ENABLED(CONFIG_IPV6)
-   bool ipv6 = __vxlan_sock_release_prep(vxlan->vn6_sock);
+   struct vxlan_sock *sock6 = rtnl_dereference(vxlan->vn6_sock);
+
+   rcu_assign_pointer(vxlan->vn6_sock, NULL);
 #endif
 
+   rcu_assign_pointer(vxlan->vn4_sock, NULL);
synchronize_net();
 
-   if (ipv4) {
-   udp_tunnel_sock_release(vxlan->vn4_sock->sock);
-   kfree(vxlan->vn4_sock);
+   if (__vxlan_sock_release_prep(sock4)) {
+   udp_tunnel_sock_release(sock4->sock);
+   kfree(sock4);
}
 
 #if IS_ENABLED(CONFIG_IPV6)
-   if (ipv6) {
-   udp_tunnel_sock_release(vxlan->vn6_sock->sock);
-   kfree(vxlan->vn6_sock);
+   if (__vxlan_sock_release_prep(sock6)) {
+   udp_tunnel_sock_release(sock6->sock);
+   kfree(sock6);
}
 #endif
 }
@@ -1036,18 +1044,21 @@ static int vxlan_igmp_join(struct vxlan_dev *vxlan)
int ret = -EINVAL;
 
if (ip->sa.sa_family == AF_INET) {
+   struct vxlan_sock *sock4 = rtnl_dereference(vxlan->vn4_sock);
struct ip_mreqn mreq = {
.imr_multiaddr.s_addr   = ip->sin.sin_addr.s_addr,
.imr_ifindex= ifindex,
};
 
-   sk = vxlan->vn4_sock->sock->sk;
+   sk = sock4->sock->sk;
lock_sock(sk);
ret = ip_mc_join_group(sk, &mreq);
release_sock(sk);
 #if IS_ENABLED(CONFIG_IPV6)
} else {
-   sk = vxlan->vn6_sock->sock->sk;
+   struct vxlan_sock *sock6 = rtnl_dereference(vxlan->vn6_sock);
+
+   sk = sock6->sock->sk;
lock_sock(sk);
ret = ipv6_stub->ipv6_sock_mc_join(sk, ifindex,
   &ip->sin6.sin6_addr);
@@ -1067,18 +1078,21 @@ static int vxlan_igmp_leave(struct vxlan_dev *vxlan)
int ret = -EINVAL;
 
if (ip->sa.sa_family == AF_INET) {
+   struct vxlan_sock *sock4 = rtnl_dereference(vxlan->vn4_sock);
struct ip_mreqn mreq = {
.imr_multiaddr.s_addr   = ip->sin.sin_addr.s_addr,
.imr_ifindex= ifindex,
};
 
-   sk = vxlan->vn4_sock->sock->sk;
+   sk = sock4->sock->sk;
lock_sock(sk);
ret = ip_mc_leave_group(sk, &mreq);
release_sock(sk

[PATCH v2 net 2/2] geneve: avoid using stale geneve socket.

2016-10-28 Thread Pravin B Shelar
This patch is similar to earlier vxlan patch.
Geneve device close operation frees geneve socket. This
operation can race with geneve-xmit function which
dereferences geneve socket. Following patch uses RCU
mechanism to avoid this situation.

Signed-off-by: Pravin B Shelar 
Acked-by: John W. Linville 
---
 drivers/net/geneve.c | 45 ++---
 1 file changed, 34 insertions(+), 11 deletions(-)

diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c
index 16af1ce..42edd7b 100644
--- a/drivers/net/geneve.c
+++ b/drivers/net/geneve.c
@@ -58,9 +58,9 @@ struct geneve_dev {
struct hlist_node  hlist;   /* vni hash table */
struct net *net;/* netns for packet i/o */
struct net_device  *dev;/* netdev for geneve tunnel */
-   struct geneve_sock *sock4;  /* IPv4 socket used for geneve tunnel */
+   struct geneve_sock __rcu *sock4;/* IPv4 socket used for geneve 
tunnel */
 #if IS_ENABLED(CONFIG_IPV6)
-   struct geneve_sock *sock6;  /* IPv6 socket used for geneve tunnel */
+   struct geneve_sock __rcu *sock6;/* IPv6 socket used for geneve 
tunnel */
 #endif
u8 vni[3];  /* virtual network ID for tunnel */
u8 ttl; /* TTL override */
@@ -543,9 +543,19 @@ static void __geneve_sock_release(struct geneve_sock *gs)
 
 static void geneve_sock_release(struct geneve_dev *geneve)
 {
-   __geneve_sock_release(geneve->sock4);
+   struct geneve_sock *gs4 = rtnl_dereference(geneve->sock4);
 #if IS_ENABLED(CONFIG_IPV6)
-   __geneve_sock_release(geneve->sock6);
+   struct geneve_sock *gs6 = rtnl_dereference(geneve->sock6);
+
+   rcu_assign_pointer(geneve->sock6, NULL);
+#endif
+
+   rcu_assign_pointer(geneve->sock4, NULL);
+   synchronize_net();
+
+   __geneve_sock_release(gs4);
+#if IS_ENABLED(CONFIG_IPV6)
+   __geneve_sock_release(gs6);
 #endif
 }
 
@@ -586,10 +596,10 @@ static int geneve_sock_add(struct geneve_dev *geneve, 
bool ipv6)
gs->flags = geneve->flags;
 #if IS_ENABLED(CONFIG_IPV6)
if (ipv6)
-   geneve->sock6 = gs;
+   rcu_assign_pointer(geneve->sock6, gs);
else
 #endif
-   geneve->sock4 = gs;
+   rcu_assign_pointer(geneve->sock4, gs);
 
hash = geneve_net_vni_hash(geneve->vni);
hlist_add_head_rcu(&geneve->hlist, &gs->vni_list[hash]);
@@ -603,9 +613,7 @@ static int geneve_open(struct net_device *dev)
bool metadata = geneve->collect_md;
int ret = 0;
 
-   geneve->sock4 = NULL;
 #if IS_ENABLED(CONFIG_IPV6)
-   geneve->sock6 = NULL;
if (ipv6 || metadata)
ret = geneve_sock_add(geneve, true);
 #endif
@@ -720,6 +728,9 @@ static struct rtable *geneve_get_v4_rt(struct sk_buff *skb,
struct rtable *rt = NULL;
__u8 tos;
 
+   if (!rcu_dereference(geneve->sock4))
+   return ERR_PTR(-EIO);
+
memset(fl4, 0, sizeof(*fl4));
fl4->flowi4_mark = skb->mark;
fl4->flowi4_proto = IPPROTO_UDP;
@@ -772,11 +783,15 @@ static struct dst_entry *geneve_get_v6_dst(struct sk_buff 
*skb,
 {
bool use_cache = ip_tunnel_dst_cache_usable(skb, info);
struct geneve_dev *geneve = netdev_priv(dev);
-   struct geneve_sock *gs6 = geneve->sock6;
struct dst_entry *dst = NULL;
struct dst_cache *dst_cache;
+   struct geneve_sock *gs6;
__u8 prio;
 
+   gs6 = rcu_dereference(geneve->sock6);
+   if (!gs6)
+   return ERR_PTR(-EIO);
+
memset(fl6, 0, sizeof(*fl6));
fl6->flowi6_mark = skb->mark;
fl6->flowi6_proto = IPPROTO_UDP;
@@ -842,7 +857,7 @@ static netdev_tx_t geneve_xmit_skb(struct sk_buff *skb, 
struct net_device *dev,
   struct ip_tunnel_info *info)
 {
struct geneve_dev *geneve = netdev_priv(dev);
-   struct geneve_sock *gs4 = geneve->sock4;
+   struct geneve_sock *gs4;
struct rtable *rt = NULL;
const struct iphdr *iip; /* interior IP header */
int err = -EINVAL;
@@ -853,6 +868,10 @@ static netdev_tx_t geneve_xmit_skb(struct sk_buff *skb, 
struct net_device *dev,
bool xnet = !net_eq(geneve->net, dev_net(geneve->dev));
u32 flags = geneve->flags;
 
+   gs4 = rcu_dereference(geneve->sock4);
+   if (!gs4)
+   goto tx_error;
+
if (geneve->collect_md) {
if (unlikely(!info || !(info->mode & IP_TUNNEL_INFO_TX))) {
netdev_dbg(dev, "no tunnel metadata\n");
@@ -932,9 +951,9 @@ static netdev_tx_t geneve6_xmit_skb(struct sk_buff *skb, 
struct net_device *dev,
struct ip_tunnel_info *info)
 {
struct geneve_dev *geneve = netdev_priv(dev);
-   struct geneve_sock *gs6 = geneve->sock6;
struct dst_entry *dst = NULL;
const struct iphdr *iip; /* interior IP header */
+   str

Re: [PATCH net 1/2] vxlan: avoid using stale vxlan socket.

2016-10-28 Thread Pravin Shelar
On Thu, Oct 27, 2016 at 4:02 PM, Stephen Hemminger
 wrote:
> On Thu, 27 Oct 2016 11:51:55 -0700
> Pravin B Shelar  wrote:
>
>> - vxlan->vn4_sock = NULL;
>> + rcu_assign_pointer(vxlan->vn4_sock, NULL);
>>  #if IS_ENABLED(CONFIG_IPV6)
>> - vxlan->vn6_sock = NULL;
>> + rcu_assign_pointer(vxlan->vn6_sock, NULL);
>>   if (ipv6 || metadata)
>
> User RCU_INIT_POINTER when setting pointer to NULL.
> It avoids unnecessary barrier.

OK. I have posted v2 patch to use RCU_INIT_POINTER().


[PATCH net] dctcp: avoid bogus doubling of cwnd after loss

2016-10-28 Thread Florian Westphal
If a congestion control module doesn't provide .undo_cwnd function,
tcp_undo_cwnd_reduction() will set cwnd to

   tp->snd_cwnd = max(tp->snd_cwnd, tp->snd_ssthresh << 1);

... which makes sense for reno (it sets ssthresh to half the current cwnd),
but it makes no sense for dctcp, which sets ssthresh based on the current
congestion estimate.

This can cause severe growth of cwnd (eventually overflowing u32).

Fix this by saving last cwnd on loss and restore cwnd based on that,
similar to cubic and other algorithms.

Fixes: e3118e8359bb7c ("net: tcp: add DCTCP congestion control algorithm")
Cc: Lawrence Brakmo 
Cc: Andrew Shewmaker 
Cc: Glenn Judd 
Acked-by: Daniel Borkmann 
Signed-off-by: Florian Westphal 
---
 net/ipv4/tcp_dctcp.c | 13 -
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_dctcp.c b/net/ipv4/tcp_dctcp.c
index 10d728b6804c..ab37c6775630 100644
--- a/net/ipv4/tcp_dctcp.c
+++ b/net/ipv4/tcp_dctcp.c
@@ -56,6 +56,7 @@ struct dctcp {
u32 next_seq;
u32 ce_state;
u32 delayed_ack_reserved;
+   u32 loss_cwnd;
 };
 
 static unsigned int dctcp_shift_g __read_mostly = 4; /* g = 1/2^4 */
@@ -96,6 +97,7 @@ static void dctcp_init(struct sock *sk)
ca->dctcp_alpha = min(dctcp_alpha_on_init, DCTCP_MAX_ALPHA);
 
ca->delayed_ack_reserved = 0;
+   ca->loss_cwnd = 0;
ca->ce_state = 0;
 
dctcp_reset(tp, ca);
@@ -111,9 +113,10 @@ static void dctcp_init(struct sock *sk)
 
 static u32 dctcp_ssthresh(struct sock *sk)
 {
-   const struct dctcp *ca = inet_csk_ca(sk);
+   struct dctcp *ca = inet_csk_ca(sk);
struct tcp_sock *tp = tcp_sk(sk);
 
+   ca->loss_cwnd = tp->snd_cwnd;
return max(tp->snd_cwnd - ((tp->snd_cwnd * ca->dctcp_alpha) >> 11U), 
2U);
 }
 
@@ -308,12 +311,20 @@ static size_t dctcp_get_info(struct sock *sk, u32 ext, 
int *attr,
return 0;
 }
 
+static u32 dctcp_cwnd_undo(struct sock *sk)
+{
+   const struct dctcp *ca = inet_csk_ca(sk);
+
+   return max(tcp_sk(sk)->snd_cwnd, ca->loss_cwnd);
+}
+
 static struct tcp_congestion_ops dctcp __read_mostly = {
.init   = dctcp_init,
.in_ack_event   = dctcp_update_alpha,
.cwnd_event = dctcp_cwnd_event,
.ssthresh   = dctcp_ssthresh,
.cong_avoid = tcp_reno_cong_avoid,
+   .undo_cwnd  = dctcp_cwnd_undo,
.set_state  = dctcp_state,
.get_info   = dctcp_get_info,
.flags  = TCP_CONG_NEEDS_ECN,
-- 
2.7.3



Re: Bonding MAC address

2016-10-28 Thread Shanker Wang
I think it is intentional. Because changing mac address of an 
interface would cause problems. As a result, if you find it 
necessary to change the mac address, it should be done manually.

Consider these two example:

* A host is connecting the internet through a bond interface, 
and obtain ip address from a dhcp server. Changing the mac 
address automatically would possibly lead to re-assigning a 
new IP address, which could not be expected.

* A linux-box is acting as a gateway, providing service to 
hosts in the local lan. Changing mac address and not changing
IP address of an interface would cause other hosts to fail
to communicate with the gateway, since the old mac address is
still in theirs arp cache table. The communication will recover
after arp cache expires, which can be a short or long time.

* The scene is the same to the second one. And consider if  
arp snooping or other mechanisms to protect hosts from being 
spoofed by a fake gateway are enabled in the local lan. After 
changing the mac address, the linux-box itself will be a 
“spoofer” and may get blocked. 

So changing mac address of an interface could be dangerous and
lead to network malfunction and cannot be done automatically. 

> 在 2016年10月28日,19:09,Igor Ryzhov  写道:
> 
> Hello everyone,
> 
> Studying the bonding driver I found that bonding device doesn't choose a new 
> MAC address when the slave which MAC is in use is released.
> This is a warning about that - "the permanent HWaddr of slave is still in use 
> by bond - set the HWaddr of slave to a different address to avoid conflicts".
> Why not to choose a new MAC for bonding device? Is it intentional or just not 
> implemented?
> 
> Best regards,
> Igor



smime.p7s
Description: S/MIME cryptographic signature


Re: [PATCH] Revert "usb: dwc3: gadget: use allocated/queued reqs for LST bit"

2016-10-28 Thread Ville Syrjälä
On Fri, Oct 28, 2016 at 01:16:13PM +0300, Felipe Balbi wrote:
> 
> Hi,
> 
> Ville Syrjälä  writes:
> > On Thu, Oct 06, 2016 at 12:08:26PM +0300, Ville Syrjälä wrote:
> >> On Thu, Oct 06, 2016 at 10:36:09AM +0300, Felipe Balbi wrote:
> >> > 
> >> > Hi,
> >> > 
> >> > Felipe Balbi  writes:
> >> 
> >> > Okay, I have found a regression on dwc3 and another patch follows:
> >> > 
> >> > commit 5e1a2af3e46248c55098cdae643c4141851b703e
> >> > Author: Felipe Balbi 
> >> > Date:   Wed Oct 5 14:24:37 2016 +0300
> >> > 
> >> > usb: dwc3: gadget: properly account queued requests
> >> > 
> >> > Some requests could be accounted for multiple
> >> > times. Let's fix that so each and every requests is
> >> > accounted for only once.
> >> > 
> >> > Cc:  # v4.8
> >> > Fixes: 55a0237f8f47 ("usb: dwc3: gadget: use allocated/queued reqs 
> >> > for LST bit")
> >> > Signed-off-by: Felipe Balbi 
> >> > 
> >> > diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c
> >> > index 07cc8929f271..3c3ced128c77 100644
> >> > --- a/drivers/usb/dwc3/gadget.c
> >> > +++ b/drivers/usb/dwc3/gadget.c
> >> > @@ -783,6 +783,7 @@ static void dwc3_prepare_one_trb(struct dwc3_ep *dep,
> >> >  req->trb = trb;
> >> >  req->trb_dma = dwc3_trb_dma_offset(dep, trb);
> >> >  req->first_trb_index = dep->trb_enqueue;
> >> > +dep->queued_requests++;
> >> >  }
> >> >  
> >> >  dwc3_ep_inc_enq(dep);
> >> > @@ -833,8 +834,6 @@ static void dwc3_prepare_one_trb(struct dwc3_ep *dep,
> >> >  
> >> >  trb->ctrl |= DWC3_TRB_CTRL_HWO;
> >> >  
> >> > -dep->queued_requests++;
> >> > -
> >> >  trace_dwc3_prepare_trb(dep, trb);
> >> >  }
> >> >  
> >> > @@ -1861,8 +1860,11 @@ static int __dwc3_cleanup_done_trbs(struct dwc3 
> >> > *dwc, struct dwc3_ep *dep,
> >> >  unsigned ints_pkt = 0;
> >> >  unsigned inttrb_status;
> >> >  
> >> > -dep->queued_requests--;
> >> >  dwc3_ep_inc_deq(dep);
> >> > +
> >> > +if (req->trb == trb)
> >> > +dep->queued_requests--;
> >> > +
> >> >  trace_dwc3_complete_trb(dep, trb);
> >> >  
> >> >  /*
> >> > 
> >> > I have also built a branch which you can use for testing. Here's a pull
> >> > request, once you tell me it works for you, then I can send proper
> >> > patches out:
> >> > 
> >> > The following changes since commit 
> >> > c8d2bc9bc39ebea8437fd974fdbc21847bb897a3:
> >> > 
> >> >   Linux 4.8 (2016-10-02 16:24:33 -0700)
> >> > 
> >> > are available in the git repository at:
> >> > 
> >> >   git://git.kernel.org/pub/scm/linux/kernel/git/balbi/usb.git 
> >> > tmp-test-v4.8
> >> > 
> >> > for you to fetch changes up to c968b8d1effe64a7980802d1eef29f4e1922faca:
> >> > 
> >> >   usb: dwc3: gadget: properly account queued requests (2016-10-06 
> >> > 10:16:37 +0300)
> >> > 
> >> > 
> >> > Felipe Balbi (2):
> >> >   usb: gadget: function: u_ether: don't starve tx request queue
> >> >   usb: dwc3: gadget: properly account queued requests
> >> > 
> >> >  drivers/usb/dwc3/gadget.c | 7 ---
> >> >  drivers/usb/gadget/function/u_ether.c | 5 +++--
> >> >  2 files changed, 7 insertions(+), 5 deletions(-)
> >> 
> >> Tried your branch, but unfortunately I'm still seeing the lags. New trace
> >> attached.
> >
> > Just a reminder that the regressions is still there in 4.9-rc2.
> > Unfortunateyly with all the stuff already piled on top, getting USB
> > working on my device is no longer a simple matter of reverting the
> > one commit. I had to revert 10 patches to get even a clean revert, but
> > unfortunately that approach just lead to the transfer hanging immediately.
> >
> > So what I ended up doing is reverting all of this:
> > git log --no-merges --format=oneline 
> > 55a0237f8f47957163125e20ee9260538c5c341c^..HEAD -- drivers/usb/dwc3/ 
> > include/linux/ulpi/interface.h drivers/usb/Kconfig drivers/usb/core/Kconfig
> >
> > which comes out at whopping 70 commits. Having to carry that around
> > is going to be quite a pain especially as more stuff might be piled on
> > top.
> >
> > /me thinks a stable backport of any fix (assuming one is found
> > eventually) is going to be quite the challenge...
> 
> Yeah, I'm guessing we're gonna need some help from networking folks. The
> only thing we did since v4.7 was actually respect req->no_interrupt flag
> coming from u_ether itself. No idea why that causes so much trouble for
> u_ether.
> 
> BTW, Instead of reverting so many patches, you can just remove
> throttling:
> 
> diff --git a/drivers/usb/gadget/function/u_ether.c 
> b/drivers/usb/gadget/function/u_ether.c
> index f4a640216913..119a2e5848e8 100644
> --- a/drivers/usb/gadget/function/u_ether.c
> +++ b/drivers/usb/gadget/function/u_ether.c
> @@ -589,14 +589,6 @@ static netdev_tx_t eth_start_xmit(struct sk_buff *skb,
>  
> 

Re: [PATCH] [rfc, netfilter-next] netfilter: nf_tables: fib warnings

2016-10-28 Thread Florian Westphal
Arnd Bergmann  wrote:
> On Friday, October 28, 2016 5:50:31 PM CEST Florian Westphal wrote:
> > Arnd Bergmann  wrote:
> > > The newly added nft fib code produces two warnings:
> > > 
> > > net/ipv4/netfilter/nft_fib_ipv4.c: In function 'nft_fib4_eval':
> > > net/ipv4/netfilter/nft_fib_ipv4.c:80:6: error: unused variable 'i' 
> > > [-Werror=unused-variable]
> > > net/ipv4/netfilter/nft_fib_ipv4.c: In function ‘nft_fib4_eval’:
> > > net/ipv4/netfilter/nft_fib_ipv4.c:137:6: error: ‘oif’ may be used 
> > > uninitialized in this function [-Werror=maybe-uninitialized]
> > > 
> > > The first one is obvious as the only user of that variable is
> > > inside of an #ifdef, but the second one is a bit trickier.
> > > It is clear that 'oif' is uninitialized here if neither
> > > NFTA_FIB_F_OIF nor NFTA_FIB_F_IIF are set.
> > > 
> > > I have no idea how that should be handled, this patch just
> > > returns without doing anything, which may or may not be
> > > the right thing to do.
> > 
> > It should be initialized to NULL.
> 
> Ok, I had considered that, but wasn't sure if ->nh_dev could
> ever be NULL, as that would then get dereferenced.

Good point.  In case oif is NULL we don't have to search the result
list for a match anyway, so we could do this (not even build tested):

diff --git a/net/ipv4/netfilter/nft_fib_ipv4.c 
b/net/ipv4/netfilter/nft_fib_ipv4.c
--- a/net/ipv4/netfilter/nft_fib_ipv4.c
+++ b/net/ipv4/netfilter/nft_fib_ipv4.c
@@ -130,6 +130,11 @@ void nft_fib4_eval(const struct nft_expr *expr, struct 
nft_regs *regs,
break;
}
 
+   if (!oif) {
+   found = FIB_RES_DEV(res);
+   goto ok;
+   }
+
 #ifdef CONFIG_IP_ROUTE_MULTIPATH
for (i = 0; i < res.fi->fib_nhs; i++) {
struct fib_nh *nh = &res.fi->fib_nh[i];
@@ -139,16 +144,12 @@ void nft_fib4_eval(const struct nft_expr *expr, struct 
nft_regs *regs,
goto ok;
}
}
-#endif
-   if (priv->flags & NFTA_FIB_F_OIF) {
-   found = FIB_RES_DEV(res);
-   if (found == oif)
-   goto ok;
-   return;
-   }
-
-   *dest = FIB_RES_DEV(res)->ifindex;
return;
+#else
+   found = FIB_RES_DEV(res);
+   if (found != oif)
+   return;
+#endif
 ok:
switch (priv->result) {


I can take care of this as a followup.



Re: [PATCH net-next RFC WIP] Patch for XDP support for virtio_net

2016-10-28 Thread Jakub Kicinski
On Fri, 28 Oct 2016 08:56:35 -0700, John Fastabend wrote:
> On 16-10-27 07:10 PM, David Miller wrote:
> > From: Alexander Duyck 
> > Date: Thu, 27 Oct 2016 18:43:59 -0700
> >   
> >> On Thu, Oct 27, 2016 at 6:35 PM, David Miller  wrote: 
> >>  
> >>> From: "Michael S. Tsirkin" 
> >>> Date: Fri, 28 Oct 2016 01:25:48 +0300
> >>>  
>  On Thu, Oct 27, 2016 at 05:42:18PM -0400, David Miller wrote:  
> > From: "Michael S. Tsirkin" 
> > Date: Fri, 28 Oct 2016 00:30:35 +0300
> >  
> >> Something I'd like to understand is how does XDP address the
> >> problem that 100Byte packets are consuming 4K of memory now.  
> >
> > Via page pools.  We're going to make a generic one, but right now
> > each and every driver implements a quick list of pages to allocate
> > from (and thus avoid the DMA man/unmap overhead, etc.)  
> 
>  So to clarify, ATM virtio doesn't attempt to avoid dma map/unmap
>  so there should be no issue with that even when using sub/page
>  regions, assuming DMA APIs support sub-page map/unmap correctly.  
> >>>
> >>> That's not what I said.
> >>>
> >>> The page pools are meant to address the performance degradation from
> >>> going to having one packet per page for the sake of XDP's
> >>> requirements.
> >>>
> >>> You still need to have one packet per page for correct XDP operation
> >>> whether you do page pools or not, and whether you have DMA mapping
> >>> (or it's equivalent virutalization operation) or not.  
> >>
> >> Maybe I am missing something here, but why do you need to limit things
> >> to one packet per page for correct XDP operation?  Most of the drivers
> >> out there now are usually storing something closer to at least 2
> >> packets per page, and with the DMA API fixes I am working on there
> >> should be no issue with changing the contents inside those pages since
> >> we won't invalidate or overwrite the data after the DMA buffer has
> >> been synchronized for use by the CPU.  
> > 
> > Because with SKB's you can share the page with other packets.
> > 
> > With XDP you simply cannot.
> > 
> > It's software semantics that are the issue.  SKB frag list pages
> > are read only, XDP packets are writable.
> > 
> > This has nothing to do with "writability" of the pages wrt. DMA
> > mapping or cpu mappings.
> >   
> 
> Sorry I'm not seeing it either. The current xdp_buff is defined
> by,
> 
>   struct xdp_buff {
>   void *data;
>   void *data_end;
>   };
> 
> The verifier has an xdp_is_valid_access() check to ensure we don't go
> past data_end. The page for now at least never leaves the driver. For
> the work to get xmit to other devices working I'm still not sure I see
> any issue.

+1

Do we want to make the packet-per-page a requirement because it could
be useful in the future from architectural standpoint?  I guess there
is a trade-off here between having the comfort of people following this
requirement today and making driver support for XDP even more complex.


  1   2   >