Re: [PATCH 17/17] batman-adv: Avoid precedence issues in macros
On Freitag, 28. Oktober 2016 18:56:24 CEST Joe Perches wrote: [...] > > But yes, looks to me like it is missing. Do you want to propose a patch or > > should I do? > > As you are working on this file, perhaps you should. Ok, I will submit a patch with you as Reported-by to the batman-adv mailing list Kind regards, Sven signature.asc Description: This is a digitally signed message part.
Re: [PATCH v7 0/6] Add eBPF hooks for cgroups
On Sat, Oct 29, 2016 at 01:59:23PM +0900, Lorenzo Colitti wrote: > On Sat, Oct 29, 2016 at 1:51 PM, Alexei Starovoitov > wrote: > >> What's the use case for egress? > >> > >> We (android networking) are currently looking at implementing network > >> accounting via eBPF in order to replace the out-of-tree xt_qtaguid > >> code. A per-cgroup eBPF program run on all traffic would be great. But > >> when we looked at this patchset we realized it would not be useful for > >> accounting purposes because even if a packet is counted here, it might > >> still be dropped by netfilter hooks. > > > > don't use out-of-tree and instead drop using this mechanism or > > any other in-kernel method? ;) > > Getting rid of out-of-tree code is the goal, yes. We do have a > requirement that things continue to work, though. Accounting for a > packet in ip{,6}_output is not correct if that packet ends up being > dropped by iptables later on. understood. it could be solved by swapping the order of cgroup_bpf_run_filter() and NF_INET_POST_ROUTING in patch 5. It was proposed some time back, but the current patch, I think, is more symmetrical. cgroup+bpf runs after nf hook on rx and runs before it on tx. imo it's more consistent. Regardless of this choice... are you going to backport cgroupv2 to android? Because this set is v2 only. > > We (facebook infrastructure) have been using iptables and bpf networking > > together with great success. They nicely co-exist and complement each other. > > There is no need to reinvent the wheel if existing solution works. > > iptables are great for their purpose. > > That doesn't really answer my "what is the use case for egress" > question though, right? Or are you saying "we use this, but we can't > talk about how we use it"? if the question is "why patch 4 alone is not enough and patch 5 is needed"? Then it's symmetrical access. Accounting for RX only is a half done job. > > there is iptables+cBPF support. It's being used in some cases already. > > Adding eBPF support to the xt_bpf iptables code would be an option for > what we want to do, yes. I think this requires that the eBPF map to be > an fd that is available to the process that exec()s iptables, but we > could do that. yes. that's certainly doable, but sooner or later such approach will hit scalability issue when number of cgroups is large. Same issue we saw with cls_bpf and bpf_skb_under_cgroup(). Hence this patch set was needed that is centered around cgroups instead of hooks. Note, unlike, tc and nf there is no way to attach to a hook. The bpf program is attached to a cgroup. It's an important distinction vs everything that currently exists in the stack.
Re: net/dccp: warning in dccp_feat_clone_sp_val/__might_sleep
On Fri, Oct 28, 2016 at 5:40 PM, Andrey Konovalov wrote: > Hi, > > I've got the following error report while running the syzkaller fuzzer: > > [ cut here ] > WARNING: CPU: 0 PID: 4608 at kernel/sched/core.c:7724 > __might_sleep+0x14c/0x1a0 kernel/sched/core.c:7719 > do not call blocking ops when !TASK_RUNNING; state=1 set at > [] prepare_to_wait+0xbc/0x210 > kernel/sched/wait.c:178 > Modules linked in: > CPU: 0 PID: 4608 Comm: syz-executor Not tainted 4.9.0-rc2+ #320 > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 > 88006625f7a0 81b46914 88006625f818 > 84052960 88006625f7e8 8237 > 88006aceac00 1e2c ed000cc4beff 84052960 > Call Trace: > [< inline >] __dump_stack lib/dump_stack.c:15 > [] dump_stack+0xb3/0x10f lib/dump_stack.c:51 > [] __warn+0x1a7/0x1f0 kernel/panic.c:550 > [] warn_slowpath_fmt+0xac/0xd0 kernel/panic.c:565 > [] __might_sleep+0x14c/0x1a0 kernel/sched/core.c:7719 > [< inline >] slab_pre_alloc_hook mm/slab.h:393 > [< inline >] slab_alloc_node mm/slub.c:2634 > [< inline >] slab_alloc mm/slub.c:2716 > [] __kmalloc_track_caller+0x150/0x2a0 mm/slub.c:4240 > [] kmemdup+0x24/0x50 mm/util.c:113 > [] dccp_feat_clone_sp_val.part.5+0x4f/0xe0 > net/dccp/feat.c:374 > [< inline >] dccp_feat_clone_sp_val net/dccp/feat.c:1141 > [< inline >] dccp_feat_change_recv net/dccp/feat.c:1141 > [] dccp_feat_parse_options+0xaa1/0x13d0 > net/dccp/feat.c:1411 > [] dccp_parse_options+0x721/0x1010 net/dccp/options.c:128 > [] dccp_rcv_state_process+0x200/0x15b0 net/dccp/input.c:644 > [] dccp_v4_do_rcv+0xf4/0x1a0 net/dccp/ipv4.c:681 > [< inline >] sk_backlog_rcv ./include/net/sock.h:872 > [] __release_sock+0x126/0x3a0 net/core/sock.c:2044 > [] release_sock+0x59/0x1c0 net/core/sock.c:2502 > [< inline >] inet_wait_for_connect net/ipv4/af_inet.c:547 > [] __inet_stream_connect+0x5d2/0xbb0 net/ipv4/af_inet.c:617 > [] inet_stream_connect+0x55/0xa0 net/ipv4/af_inet.c:656 > [] SYSC_connect+0x244/0x2f0 net/socket.c:1533 > [] SyS_connect+0x24/0x30 net/socket.c:1514 > [] entry_SYSCALL_64_fastpath+0x1f/0xc2 > arch/x86/entry/entry_64.S:209 Should be fixed the attached patch. I will verify it with your reproducer tomorrow. Thanks! diff --git a/net/dccp/feat.c b/net/dccp/feat.c index 1704948..c90cb35 100644 --- a/net/dccp/feat.c +++ b/net/dccp/feat.c @@ -367,11 +367,11 @@ static inline int dccp_feat_must_be_understood(u8 feat_num) } /* copy constructor, fval must not already contain allocated memory */ -static int dccp_feat_clone_sp_val(dccp_feat_val *fval, u8 const *val, u8 len) +static int dccp_feat_clone_sp_val(dccp_feat_val *fval, u8 const *val, u8 len, gfp_t flags) { fval->sp.len = len; if (fval->sp.len > 0) { - fval->sp.vec = kmemdup(val, len, gfp_any()); + fval->sp.vec = kmemdup(val, len, flags); if (fval->sp.vec == NULL) { fval->sp.len = 0; return -ENOBUFS; @@ -404,7 +404,8 @@ static void dccp_feat_val_destructor(u8 feat_num, dccp_feat_val *val) if (type == FEAT_SP && dccp_feat_clone_sp_val(&new->val, original->val.sp.vec, - original->val.sp.len)) { + original->val.sp.len, + gfp_any())) { kfree(new); return NULL; } @@ -735,7 +736,7 @@ static int __feat_register_sp(struct list_head *fn, u8 feat, u8 is_local, if (feat == DCCPF_CCID && !ccid_support_check(sp_val, sp_len)) return -EOPNOTSUPP; - if (dccp_feat_clone_sp_val(&fval, sp_val, sp_len)) + if (dccp_feat_clone_sp_val(&fval, sp_val, sp_len, gfp_any())) return -ENOMEM; return dccp_feat_push_change(fn, feat, is_local, mandatory, &fval); @@ -1138,7 +1139,7 @@ static u8 dccp_feat_change_recv(struct list_head *fn, u8 is_mandatory, u8 opt, * otherwise we accept the preferred value; * - else if we are the client, we use the first list element. */ - if (dccp_feat_clone_sp_val(&fval, val, 1)) + if (dccp_feat_clone_sp_val(&fval, val, 1, GFP_ATOMIC)) return DCCP_RESET_CODE_TOO_BUSY; if (len > 1 && server) {
Re: [PATCH v7 0/6] Add eBPF hooks for cgroups
On Sat, Oct 29, 2016 at 1:51 PM, Alexei Starovoitov wrote: >> What's the use case for egress? >> >> We (android networking) are currently looking at implementing network >> accounting via eBPF in order to replace the out-of-tree xt_qtaguid >> code. A per-cgroup eBPF program run on all traffic would be great. But >> when we looked at this patchset we realized it would not be useful for >> accounting purposes because even if a packet is counted here, it might >> still be dropped by netfilter hooks. > > don't use out-of-tree and instead drop using this mechanism or > any other in-kernel method? ;) Getting rid of out-of-tree code is the goal, yes. We do have a requirement that things continue to work, though. Accounting for a packet in ip{,6}_output is not correct if that packet ends up being dropped by iptables later on. > We (facebook infrastructure) have been using iptables and bpf networking > together with great success. They nicely co-exist and complement each other. > There is no need to reinvent the wheel if existing solution works. > iptables are great for their purpose. That doesn't really answer my "what is the use case for egress" question though, right? Or are you saying "we use this, but we can't talk about how we use it"? > there is iptables+cBPF support. It's being used in some cases already. Adding eBPF support to the xt_bpf iptables code would be an option for what we want to do, yes. I think this requires that the eBPF map to be an fd that is available to the process that exec()s iptables, but we could do that.
Re: [PATCH v7 0/6] Add eBPF hooks for cgroups
On Sat, Oct 29, 2016 at 12:51:37PM +0900, Lorenzo Colitti wrote: > On Thu, Oct 27, 2016 at 5:40 PM, Daniel Mack wrote: > > It's not anything new. These hooks live on the very same level as > > SO_ATTACH_FILTER. The only differences are that the BPF programs are > > stored in the cgroup, and not in the socket, and that they exist for > > egress as well. > > What's the use case for egress? > > We (android networking) are currently looking at implementing network > accounting via eBPF in order to replace the out-of-tree xt_qtaguid > code. A per-cgroup eBPF program run on all traffic would be great. But > when we looked at this patchset we realized it would not be useful for > accounting purposes because even if a packet is counted here, it might > still be dropped by netfilter hooks. don't use out-of-tree and instead drop using this mechanism or any other in-kernel method? ;) We (facebook infrastructure) have been using iptables and bpf networking together with great success. They nicely co-exist and complement each other. There is no need to reinvent the wheel if existing solution works. iptables are great for their purpose. > It seems like it would be much more useful to be able to do this in an > iptables rule. there is iptables+cBPF support. It's being used in some cases already.
Re: [PATCH v7 0/6] Add eBPF hooks for cgroups
On Thu, Oct 27, 2016 at 5:40 PM, Daniel Mack wrote: > It's not anything new. These hooks live on the very same level as > SO_ATTACH_FILTER. The only differences are that the BPF programs are > stored in the cgroup, and not in the socket, and that they exist for > egress as well. What's the use case for egress? We (android networking) are currently looking at implementing network accounting via eBPF in order to replace the out-of-tree xt_qtaguid code. A per-cgroup eBPF program run on all traffic would be great. But when we looked at this patchset we realized it would not be useful for accounting purposes because even if a packet is counted here, it might still be dropped by netfilter hooks. It seems like it would be much more useful to be able to do this in an iptables rule.
RE: [PATCH net-next RFC WIP] Patch for XDP support for virtio_net
> -Original Message- > From: Alexei Starovoitov [mailto:alexei.starovoi...@gmail.com] > Sent: Friday, October 28, 2016 11:22 AM > To: Jakub Kicinski > Cc: John Fastabend ; David Miller > ; alexander.du...@gmail.com; m...@redhat.com; > bro...@redhat.com; shrij...@gmail.com; t...@herbertland.com; > netdev@vger.kernel.org; s...@cumulusnetworks.com; > ro...@cumulusnetworks.com; niko...@cumulusnetworks.com > Subject: Re: [PATCH net-next RFC WIP] Patch for XDP support for virtio_net > > On Fri, Oct 28, 2016 at 05:18:12PM +0100, Jakub Kicinski wrote: > > On Fri, 28 Oct 2016 08:56:35 -0700, John Fastabend wrote: > > > On 16-10-27 07:10 PM, David Miller wrote: > > > > From: Alexander Duyck > > > > Date: Thu, 27 Oct 2016 18:43:59 -0700 > > > > > > > >> On Thu, Oct 27, 2016 at 6:35 PM, David Miller > wrote: > > > >>> From: "Michael S. Tsirkin" > > > >>> Date: Fri, 28 Oct 2016 01:25:48 +0300 > > > >>> > > > On Thu, Oct 27, 2016 at 05:42:18PM -0400, David Miller wrote: > > > > From: "Michael S. Tsirkin" > > > > Date: Fri, 28 Oct 2016 00:30:35 +0300 > > > > > > > >> Something I'd like to understand is how does XDP address the > > > >> problem that 100Byte packets are consuming 4K of memory > now. > > > > > > > > Via page pools. We're going to make a generic one, but right > > > > now each and every driver implements a quick list of pages to > > > > allocate from (and thus avoid the DMA man/unmap overhead, > > > > etc.) > > > > > > So to clarify, ATM virtio doesn't attempt to avoid dma > > > map/unmap so there should be no issue with that even when using > > > sub/page regions, assuming DMA APIs support sub-page > map/unmap correctly. > > > >>> > > > >>> That's not what I said. > > > >>> > > > >>> The page pools are meant to address the performance degradation > > > >>> from going to having one packet per page for the sake of XDP's > > > >>> requirements. > > > >>> > > > >>> You still need to have one packet per page for correct XDP > > > >>> operation whether you do page pools or not, and whether you have > > > >>> DMA mapping (or it's equivalent virutalization operation) or not. > > > >> > > > >> Maybe I am missing something here, but why do you need to limit > > > >> things to one packet per page for correct XDP operation? Most of > > > >> the drivers out there now are usually storing something closer to > > > >> at least 2 packets per page, and with the DMA API fixes I am > > > >> working on there should be no issue with changing the contents > > > >> inside those pages since we won't invalidate or overwrite the > > > >> data after the DMA buffer has been synchronized for use by the CPU. > > > > > > > > Because with SKB's you can share the page with other packets. > > > > > > > > With XDP you simply cannot. > > > > > > > > It's software semantics that are the issue. SKB frag list pages > > > > are read only, XDP packets are writable. > > > > > > > > This has nothing to do with "writability" of the pages wrt. DMA > > > > mapping or cpu mappings. > > > > > > > > > > Sorry I'm not seeing it either. The current xdp_buff is defined by, > > > > > > struct xdp_buff { > > > void *data; > > > void *data_end; > > > }; > > > > > > The verifier has an xdp_is_valid_access() check to ensure we don't > > > go past data_end. The page for now at least never leaves the driver. > > > For the work to get xmit to other devices working I'm still not sure > > > I see any issue. > > > > +1 > > > > Do we want to make the packet-per-page a requirement because it could > > be useful in the future from architectural standpoint? I guess there > > is a trade-off here between having the comfort of people following > > this requirement today and making driver support for XDP even more > complex. > > It looks to me that packet per page makes drivers simpler instead of > complex. > ixgbe split-page and mlx* order-3/5 tricks are definitely complex. > The skb truesize concerns come from the host when data is delivered to > user space and we need to have precise memory accounting for different > applications and different users. XDP is all root and imo it's far away from > the days when multi-user non-root issues start to pop up. > At the same time XDP doesn't require to use 4k buffer in something like > Netronome. > If xdp bpf program can be offloaded into HW with 1800 byte buffers, great! > For x86 cpu the 4k byte is a natural allocation chunk. Anything lower > requires delicate dma tricks paired with even more complex slab allocator > and atomic recnts. > All of that can only drive the performance down. > Comparing to kernel bypass xdp is using 4k pages whereas dpdk has to use > huge pages, so xdp is saving a ton of memory already! Generally agree, but SRIOV nics with multiple queues can end up in a bad spot if each buffer was 4K right ? I see a specific page pool to be used by queues which are enabled for XDP as the easiest to swing solution that way the memor
Re: [bnx2] [Regression 4.8] Driver loading fails without firmware
On 10/27/16 at 03:21pm, Paul Menzel wrote: > Dear Baoquan, > > > Baoquan, could you please fix this regression. My suggestion is, that you > > > add the old code back, but check if the firmware has been loaded. If it > > > hasn’t, load it again. > > > > > > That way, people can update their Linux kernel, and it continues working > > > without changing the initramfs, or anything else. > > > > I saw your mail but I am also not familiar with bnx2 driver. As the > > commit log says I just tried to make bnx2 driver reset itself earlier. > > > > So you did a git bisect and found this commit caused the regression, > > right? If yes, and network developers have no action, I will look into > > the code and see if I have idea to fix it. > > Well, I looked through the commits and found that one, which would explain > the changed behavior. > > To be sure, and to follow your request, I took Linux 4.8.4 and reverted your > commit (attached). Then I deleted the firmware again from the initramfs, and > rebooted. The devices showed up just fine as before. > > So to summarize, the commit is indeed the culprit. Hi Paul, Sorry for this. Could you tell the steps to reproduce? I will find a machine with bnx2 NIC and check if there's other ways. Thanks Baoquan > From 61b8dac8796343a797858b4a2eb0a59a0cfcd735 Mon Sep 17 00:00:00 2001 > From: Paul Menzel > Date: Thu, 27 Oct 2016 11:34:52 +0200 > Subject: [PATCH] Revert "bnx2: Reset device during driver initialization" > > This reverts commit 3e1be7ad2d38c6bd6aeef96df9bd0a7822f4e51c. > --- > drivers/net/ethernet/broadcom/bnx2.c | 12 +--- > 1 file changed, 5 insertions(+), 7 deletions(-) > > diff --git a/drivers/net/ethernet/broadcom/bnx2.c > b/drivers/net/ethernet/broadcom/bnx2.c > index 27f11a5..ecd357d 100644 > --- a/drivers/net/ethernet/broadcom/bnx2.c > +++ b/drivers/net/ethernet/broadcom/bnx2.c > @@ -6356,6 +6356,10 @@ bnx2_open(struct net_device *dev) > struct bnx2 *bp = netdev_priv(dev); > int rc; > > + rc = bnx2_request_firmware(bp); > + if (rc < 0) > + goto out; > + > netif_carrier_off(dev); > > bnx2_disable_int(bp); > @@ -6424,6 +6428,7 @@ bnx2_open(struct net_device *dev) > bnx2_free_irq(bp); > bnx2_free_mem(bp); > bnx2_del_napi(bp); > + bnx2_release_firmware(bp); > goto out; > } > > @@ -8570,12 +8575,6 @@ bnx2_init_one(struct pci_dev *pdev, const struct > pci_device_id *ent) > > pci_set_drvdata(pdev, dev); > > - rc = bnx2_request_firmware(bp); > - if (rc < 0) > - goto error; > - > - > - bnx2_reset_chip(bp, BNX2_DRV_MSG_CODE_RESET); > memcpy(dev->dev_addr, bp->mac_addr, ETH_ALEN); > > dev->hw_features = NETIF_F_IP_CSUM | NETIF_F_SG | > @@ -8608,7 +8607,6 @@ bnx2_init_one(struct pci_dev *pdev, const struct > pci_device_id *ent) > return 0; > > error: > - bnx2_release_firmware(bp); > pci_iounmap(pdev, bp->regview); > pci_release_regions(pdev); > pci_disable_device(pdev); > -- > 2.4.1 >
Re: [PATCH v2 0/5] make POSIX timers optional with some Kconfig help
On Sat, 29 Oct 2016, Paul Bolle wrote: > On Tue, 2016-10-25 at 22:28 -0400, Nicolas Pitre wrote: > > When POSIX timers are configured out, the PTP clock subsystem should be > > left out as well. However a bunch of ethernet drivers currently *select* > > the later in their Kconfig entries. Therefore some more work was needed > > to break that hard dependency from those drivers without preventing their > > usage altogether. > > By the way: would you have pointers to threads that discussed attempts > to achieve this using currently available Kconfig options? You could probably go backward from here: https://lkml.org/lkml/2016/9/20/606 Nicolas
Re: [PATCH 17/17] batman-adv: Avoid precedence issues in macros
On Fri, 2016-10-28 at 23:27 +0200, Sven Eckelmann wrote: > On Freitag, 28. Oktober 2016 14:13:06 CEST Joe Perches wrote: > > On Thu, 2016-10-27 at 21:01 +0200, Simon Wunderlich wrote: > > > From: Sven Eckelmann > > > > > > It must be avoided that arguments to a macro are evaluated ungrouped > > > (which > > > enforces normal operator precendence). Otherwise the result of the macro > > > is not well defined. > > > > Curiosity: > > > > in net/batman-adv/tp_meter.c > > [...] > > orig_node = batadv_orig_hash_find(bat_priv, tp_vars->other_end); > > if (unlikely(!orig_node)) { > > err = BATADV_TP_REASON_DST_UNREACHABLE; > > tp_vars->reason = err; > > goto out; > > } > > > > primary_if = batadv_primary_if_get_selected(bat_priv); > > if (unlikely(!primary_if)) { > > err = BATADV_TP_REASON_DST_UNREACHABLE; > > goto out; > > } > > > > err is not used in the out block > > > > Is the last if block supposed to set tp_vars->reason to err? > > This seems to be unrelated to this patch. Kinda. I was looking at how the one instance of batadv_tp_is_error was used and it's in this file. > But yes, looks to me like it is missing. Do you want to propose a patch or > should I do? As you are working on this file, perhaps you should. cheers, Joe
Re: [PATCH v7 0/6] Add eBPF hooks for cgroups
On Fri, Oct 28, 2016 at 01:28:39PM +0200, Pablo Neira Ayuso wrote: > Hi Alexei, > > On Wed, Oct 26, 2016 at 08:35:04PM -0700, Alexei Starovoitov wrote: > > On Wed, Oct 26, 2016 at 09:59:33PM +0200, Pablo Neira Ayuso wrote: > > > On Tue, Oct 25, 2016 at 12:14:08PM +0200, Daniel Mack wrote: > > > [...] > > > > Dumping programs once they are installed is problematic because of > > > > the internal optimizations done to the eBPF program during its > > > > lifetime. Also, the references to maps etc. would need to be > > > > restored during the dump. > > > > > > > > Just exposing whether or not a program is attached would be > > > > trivial to do, however, most easily through another bpf(2) > > > > command. That can be added later on though. > > > > > > I don't know if anyone told you, but during last netconf, this topic > > > took a bit of time of discussion and it was controversial, I would say > > > 1/3 of netdev hackers there showed their concerns, and that's > > > something that should not be skipped IMO. > > > > Though I attended netconf over hangouts, I think it was pretty > > clear that bpf needs 'introspection' of loaded bpf programs and it > > was a universal desire of everyone. Not 1/3 of hackers. > > Introspection is a different thing, very useful, no doubt. But this > infrastructure is allowing way more than simple innocuous introspection. During netconf the discussion around bpf introspection was about making bpf more visible to user space. This patch does cgroup+bpf infrastructure which is orthogonal. > > As commit log says it's an orthogonal work and over the last > > month we've been discussing pros and cons of different approaches. > > The audit infra, tracepoints and other ideas. > > We kept the discussion in private because, unfortunately, public > > discussions are not fruitful due to threads like this one. > > We need to understand what people are trying to solve and it what way. > That's why we have all those conferences and places to meet and > discuss too. Please, don't think like that, this is sending the wrong > message to everyone here, that is kind of: bypass public discussions > and don't take time to describe what you're doing since it is a waste > of time. That's not good. In general I agree that it's best to discuss technical topics on the mainling list and resolve personal conflicts in person during the conferences. Unfortunately the obstructionism seen in this thread forces people discuss in private to have constructive conversation. Anyway, neither your reply nor mine has nothing to do with this patch set and being typical example why I prefer to discuss bpf related ideas in private. > > The further points below were disputed many times in the past. > > Let's address them one more time: > > > > > path. But this is adding hooks to push bpf programs in the middle of > > > our generic stack, this is way different domain. > > > > incorrect. look at socket filters, cls_bpf. > > Classic socket filters don't allow you to deploy a global policy in > such a fine grain way as this is doing. Then, cls_bpf is fine since it > is visible via tc command, so sysadmins can use tools they are > familiar with to inspect policies and say "oh look, some of the > processes I'm deploying have installed filters via cls_bpf". However, > this approach is visible in no way. the audit and/or tracepoint infra that is being worked on in parallel will answer this question in full detail. It's no different from application using seccomp to control its children or cls_bpf doing policy enforcement of containers or xdp doing low level filtering. xdp has boolean "is program loaded" flag and it turned out to be not very useful for debuggability. It should be solved in a universal way for all bpf programs. We're not going to add something specific to this cgroup+bpf infra, since it will be just as clumsy as xdp's bool flag. Once again, audit/tracepoint for bpf prog introspection is a separate work orthogonal to this set. > [...] > > > around this socket code in functions so we can invoke it. I guess > > > filtering of UDP and TCP should be good for you at this stage. > > > > DanielM mentioned few times that it's not only about UDP and TCP. > > OK, since this is limited to the scope of inet sockets, let's revisit > what we have around: DCCP is hopeless, who cares. We also have SCTP, > that is deployed by telcos in datacenters, it cannot reach that domain > because many Internet gateways are broken for it, so you may not get > too far with it. Arguably it would be good to have SCTP support at > some point, but I guess this is not a priority now. Then, UDPlite > almost comes for free since it relies on the existing UDP > infrastructure, it's basically UDP with a bit more features. What > else? thank you for nice description of current state of sctp, but it has nothing to do with this patch set. The goal is to monitor application networking activity at L3 level. > > > This would requir
net/dccp: warning in dccp_feat_clone_sp_val/__might_sleep
Hi, I've got the following error report while running the syzkaller fuzzer: [ cut here ] WARNING: CPU: 0 PID: 4608 at kernel/sched/core.c:7724 __might_sleep+0x14c/0x1a0 kernel/sched/core.c:7719 do not call blocking ops when !TASK_RUNNING; state=1 set at [] prepare_to_wait+0xbc/0x210 kernel/sched/wait.c:178 Modules linked in: CPU: 0 PID: 4608 Comm: syz-executor Not tainted 4.9.0-rc2+ #320 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 88006625f7a0 81b46914 88006625f818 84052960 88006625f7e8 8237 88006aceac00 1e2c ed000cc4beff 84052960 Call Trace: [< inline >] __dump_stack lib/dump_stack.c:15 [] dump_stack+0xb3/0x10f lib/dump_stack.c:51 [] __warn+0x1a7/0x1f0 kernel/panic.c:550 [] warn_slowpath_fmt+0xac/0xd0 kernel/panic.c:565 [] __might_sleep+0x14c/0x1a0 kernel/sched/core.c:7719 [< inline >] slab_pre_alloc_hook mm/slab.h:393 [< inline >] slab_alloc_node mm/slub.c:2634 [< inline >] slab_alloc mm/slub.c:2716 [] __kmalloc_track_caller+0x150/0x2a0 mm/slub.c:4240 [] kmemdup+0x24/0x50 mm/util.c:113 [] dccp_feat_clone_sp_val.part.5+0x4f/0xe0 net/dccp/feat.c:374 [< inline >] dccp_feat_clone_sp_val net/dccp/feat.c:1141 [< inline >] dccp_feat_change_recv net/dccp/feat.c:1141 [] dccp_feat_parse_options+0xaa1/0x13d0 net/dccp/feat.c:1411 [] dccp_parse_options+0x721/0x1010 net/dccp/options.c:128 [] dccp_rcv_state_process+0x200/0x15b0 net/dccp/input.c:644 [] dccp_v4_do_rcv+0xf4/0x1a0 net/dccp/ipv4.c:681 [< inline >] sk_backlog_rcv ./include/net/sock.h:872 [] __release_sock+0x126/0x3a0 net/core/sock.c:2044 [] release_sock+0x59/0x1c0 net/core/sock.c:2502 [< inline >] inet_wait_for_connect net/ipv4/af_inet.c:547 [] __inet_stream_connect+0x5d2/0xbb0 net/ipv4/af_inet.c:617 [] inet_stream_connect+0x55/0xa0 net/ipv4/af_inet.c:656 [] SYSC_connect+0x244/0x2f0 net/socket.c:1533 [] SyS_connect+0x24/0x30 net/socket.c:1514 [] entry_SYSCALL_64_fastpath+0x1f/0xc2 arch/x86/entry/entry_64.S:209 ---[ end trace 0dc4109d69f4e51e ]--- On commit 14970f204b1993af7459d5bd34aaff38dfee6670 (Oct 27). A reproducer is attached. dccp-feat-warn-poc.c Description: Binary data
[PATCH net-next] bpf, inode: add support for symlinks and fix mtime/ctime
While commit bb35a6ef7da4 ("bpf, inode: allow for rename and link ops") added support for hard links that can be used for prog and map nodes, this work adds simple symlink support, which can be used f.e. for directories also when unpriviledged and works with cmdline tooling that understands S_IFLNK anyway. Since the switch in e27f4a942a0e ("bpf: Use mount_nodev not mount_ns to mount the bpf filesystem"), there can be various mount instances with mount_nodev() and thus hierarchy can be flattened to facilitate object sharing. Thus, we can keep bpf tooling also working by repointing paths. Most of the functionality can be used from vfs library operations. The symlink is stored in the inode itself, that is in i_link, which is sufficient in our case as opposed to storing it in the page cache. While at it, I noticed that bpf_mkdir() and bpf_mkobj() don't update the directories mtime and ctime, so add a common helper for it called bpf_dentry_finalize() that takes care of it for all cases now. Signed-off-by: Daniel Borkmann Acked-by: Alexei Starovoitov --- kernel/bpf/inode.c | 45 +++-- 1 file changed, 39 insertions(+), 6 deletions(-) diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c index 1ed8473..2565809 100644 --- a/kernel/bpf/inode.c +++ b/kernel/bpf/inode.c @@ -87,6 +87,7 @@ static struct inode *bpf_get_inode(struct super_block *sb, switch (mode & S_IFMT) { case S_IFDIR: case S_IFREG: + case S_IFLNK: break; default: return ERR_PTR(-EINVAL); @@ -119,6 +120,16 @@ static int bpf_inode_type(const struct inode *inode, enum bpf_type *type) return 0; } +static void bpf_dentry_finalize(struct dentry *dentry, struct inode *inode, + struct inode *dir) +{ + d_instantiate(dentry, inode); + dget(dentry); + + dir->i_mtime = current_time(dir); + dir->i_ctime = dir->i_mtime; +} + static int bpf_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode) { struct inode *inode; @@ -133,9 +144,7 @@ static int bpf_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode) inc_nlink(inode); inc_nlink(dir); - d_instantiate(dentry, inode); - dget(dentry); - + bpf_dentry_finalize(dentry, inode, dir); return 0; } @@ -151,9 +160,7 @@ static int bpf_mkobj_ops(struct inode *dir, struct dentry *dentry, inode->i_op = iops; inode->i_private = dentry->d_fsdata; - d_instantiate(dentry, inode); - dget(dentry); - + bpf_dentry_finalize(dentry, inode, dir); return 0; } @@ -181,13 +188,37 @@ static int bpf_mkobj(struct inode *dir, struct dentry *dentry, umode_t mode, { if (strchr(dentry->d_name.name, '.')) return ERR_PTR(-EPERM); + return simple_lookup(dir, dentry, flags); } +static int bpf_symlink(struct inode *dir, struct dentry *dentry, + const char *target) +{ + char *link = kstrdup(target, GFP_USER | __GFP_NOWARN); + struct inode *inode; + + if (!link) + return -ENOMEM; + + inode = bpf_get_inode(dir->i_sb, dir, S_IRWXUGO | S_IFLNK); + if (IS_ERR(inode)) { + kfree(link); + return PTR_ERR(inode); + } + + inode->i_op = &simple_symlink_inode_operations; + inode->i_link = link; + + bpf_dentry_finalize(dentry, inode, dir); + return 0; +} + static const struct inode_operations bpf_dir_iops = { .lookup = bpf_lookup, .mknod = bpf_mkobj, .mkdir = bpf_mkdir, + .symlink= bpf_symlink, .rmdir = simple_rmdir, .rename = simple_rename, .link = simple_link, @@ -324,6 +355,8 @@ static void bpf_evict_inode(struct inode *inode) truncate_inode_pages_final(&inode->i_data); clear_inode(inode); + if (S_ISLNK(inode->i_mode)) + kfree(inode->i_link); if (!bpf_inode_type(inode, &type)) bpf_any_put(inode->i_private, type); } -- 1.9.3
Re: [PATCH] [v2 netfilter-next] netfilter: nf_tables: fib warnings
Arnd Bergmann wrote: > The newly added nft fib code produces two warnings: > > net/ipv4/netfilter/nft_fib_ipv4.c: In function 'nft_fib4_eval': > net/ipv4/netfilter/nft_fib_ipv4.c:80:6: error: unused variable 'i' > [-Werror=unused-variable] > net/ipv4/netfilter/nft_fib_ipv4.c: In function ‘nft_fib4_eval’: > net/ipv4/netfilter/nft_fib_ipv4.c:137:6: error: ‘oif’ may be used > uninitialized in this function [-Werror=maybe-uninitialized] > > The first one is obvious as the only user of that variable is > inside of an #ifdef > > The second one is a bit trickier. It's clear that oif is in fact > uninitialized when it gets used when neither NFTA_FIB_F_IIF nor > NFTA_FIB_F_OIF are set, and just setting it to NULL won't work > as it may later get dereferenced. > > However, there is no need to search the result list if it is > NULL, as Florian pointed out. This integrates his (untested) > change to do so. I have confirmed that the combined patch > solves both warnings, but as I don't fully understand Florian's > change, I can't tell if it's correct. > > Suggested-by: Florian Westphal > Fixes: 84f5eedb983e ("netfilter: nf_tables: add fib expression") > Signed-off-by: Arnd Bergmann chain pre { type filter hook prerouting priority 0; policy accept; fib saddr oif "eth0" } eth0: default route, 192.168.7.10/16 eth1: 10.0.0.2/8 ping from 192.168.7.1 from peer on eth0: result eth0, ok ping from 10.0.0.2 from peer on eth0: no result, ok ping from 10.0.0.3 from peer on eth0: result eth1, ok chain pre { type filter hook prerouting priority 0; policy accept; fib saddr . iif oif "eth0" } ping from 192.168.7.1 from peer on eth0: result eth0, ok ping from 10.0.0.2 from peer on eth0: no result, ok ping from 10.0.0.3 from peer on eth0: no result, ok so: Tested-by: Florian Westphal
Re: [PATCH net-next] genetlink: Fix generic netlink family unregister
On Fri, 2016-10-28 at 16:01 -0700, Pravin B Shelar wrote: > This patch fixes a typo in unregister operation. > > Following crash is fixed by this patch. It can be easily reproduced > by repeating modprobe and rmmod module that uses genetlink. Yikes, I ran all my tests in VMs and forgot this wasn't even executed. Sorry about that! Reviewed-by: Johannes Berg johannes
[PATCH net-next] genetlink: Fix generic netlink family unregister
This patch fixes a typo in unregister operation. Following crash is fixed by this patch. It can be easily reproduced by repeating modprobe and rmmod module that uses genetlink. [ 261.446686] BUG: unable to handle kernel paging request at a0264088 [ 261.448921] IP: [] strcmp+0xe/0x30 [ 261.450494] PGD 1c09067 [ 261.451266] PUD 1c0a063 [ 261.452091] PMD 8068d5067 [ 261.452525] PTE 0 [ 261.453164] [ 261.453618] Oops: [#1] SMP [ 261.454577] Modules linked in: openvswitch(+) ... [ 261.480753] RIP: 0010:[] [] strcmp+0xe/0x30 [ 261.483069] RSP: 0018:c90003c0bc28 EFLAGS: 00010282 [ 261.510145] Call Trace: [ 261.510896] [] genl_family_find_byname+0x5a/0x70 [ 261.512819] [] genl_register_family+0xb9/0x630 [ 261.514805] [] dp_init+0xbc/0x120 [openvswitch] [ 261.518268] [] do_one_initcall+0x3d/0x160 [ 261.525041] [] do_init_module+0x60/0x1f1 [ 261.526754] [] load_module+0x22af/0x2860 [ 261.530144] [] SYSC_finit_module+0x96/0xd0 [ 261.531901] [] SyS_finit_module+0xe/0x10 [ 261.533605] [] do_syscall_64+0x6e/0x180 [ 261.535284] [] entry_SYSCALL64_slow_path+0x25/0x25 [ 261.546512] RIP [] strcmp+0xe/0x30 [ 261.550198] ---[ end trace 76505a814dd68770 ]--- Fixes: 2ae0f17df1c ("genetlink: use idr to track families"). Reported-by: Jarno Rajahalme CC: Johannes Berg Signed-off-by: Pravin B Shelar --- net/netlink/genetlink.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/netlink/genetlink.c b/net/netlink/genetlink.c index df0cbcd..caf04d7 100644 --- a/net/netlink/genetlink.c +++ b/net/netlink/genetlink.c @@ -399,7 +399,7 @@ int genl_unregister_family(const struct genl_family *family) { genl_lock_all(); - if (genl_family_find_byid(family->id)) { + if (!genl_family_find_byid(family->id)) { genl_unlock_all(); return -ENOENT; } -- 1.9.1
Re: [PATCH v2 0/5] make POSIX timers optional with some Kconfig help
On Tue, 2016-10-25 at 22:28 -0400, Nicolas Pitre wrote: > When POSIX timers are configured out, the PTP clock subsystem should be > left out as well. However a bunch of ethernet drivers currently *select* > the later in their Kconfig entries. Therefore some more work was needed > to break that hard dependency from those drivers without preventing their > usage altogether. By the way: would you have pointers to threads that discussed attempts to achieve this using currently available Kconfig options? Thanks, Paul Bolle
Re: [PATCH 2/2] at803x: double check SGMII side autoneg
Zefir Kurtisi wrote: + /* check if the SGMII link is OK. */ + if (!(phy_read(phydev, AT803X_PSSR) & AT803X_PSSR_MR_AN_COMPLETE)) { + pr_warn("803x_aneg_done: SGMII link is not ok\n"); + aneg_done = 0; I see this message appear sometimes when bring up the interface via ifup. However, contrary to your patch description, everything seems to work: $ iperf3 -c 192.168.1.1 -t 3600 Connecting to host 192.168.1.1, port 5201 [ 4] local 192.168.1.2 port 52640 connected to 192.168.1.1 port 5201 [ ID] Interval Transfer Bandwidth Retr Cwnd [ 4] 0.00-1.00 sec 109 MBytes 909 Mbits/sec0485 KBytes [ 4] 1.00-2.00 sec 108 MBytes 902 Mbits/sec0622 KBytes I wonder if you're impacted with all of the pause frame problems I'm having. -- Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc. Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.
Re: [PATCH v2 1/5] kconfig: introduce the "imply" keyword
On Thu, 2016-10-27 at 23:10 -0400, Nicolas Pitre wrote: > On Fri, 28 Oct 2016, Paul Bolle wrote: > > And in your example BAR is bool, right? Does the above get more > > complicated if BAR would be tristate? > > If BAR=m then implying BAZ from FOO=y will force BAZ to y or n, > bypassing the restriction provided by BAR like "select" does. This is > somewhat questionable for "select" to do that, and the code emits a > warning when "select" bypasses a direct dependency set to n, but not > when set to m. For now "imply" simply tries to be consistent with > the "select" behavior. Side note: yes, one can select a symbol that's missing one or more dependencies. But since Kconfig has two separate methods to describe relations (ie, selecting and depending) there's logically the possibility of conflict. So we need a rule to resolve that conflict. That rule is: "select" beats "depends on". I don't think that this rule is less plausible than the opposite rule. Paul Bolle
Re: [PATCH v2 1/5] kconfig: introduce the "imply" keyword
On Fri, 28 Oct 2016, Paul Bolle wrote: > On Thu, 2016-10-27 at 23:10 -0400, Nicolas Pitre wrote: > > On Fri, 28 Oct 2016, Paul Bolle wrote: > > > What happens when a tristate symbol is implied by a symbol set to 'y' > > > and by a symbol set to 'm'? > > > > That's respectively the third and second rows in the table above. > > I meant: two separate symbols implying the same symbol at the same > time. One of those symbols set to 'y' and the other set to 'm'. Then it's the greatest of the set i.e. y. Nicolas
[net-next PATCH v2 4/4] net: Add support for XPS with QoS via traffic classes
This patch adds support for setting and using XPS when QoS via traffic classes is enabled. With this change we will factor in the priority and traffic class mapping of the packet and use that information to correctly select the queue. This allows us to define a set of queues for a given traffic class via mqprio and then configure the XPS mapping for those queues so that the traffic flows can avoid head-of-line blocking between the individual CPUs if so desired. Signed-off-by: Alexander Duyck --- v2: Mostly for loop updates based on changes in patch 3 Replaced "i = 0; i < X; i++" with "i = X; i--;" as i was just a counter include/linux/netdevice.h |4 +- net/core/dev.c| 117 - net/core/net-sysfs.c | 31 3 files changed, 105 insertions(+), 47 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index b60a156..56f90f7 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -732,8 +732,8 @@ struct xps_dev_maps { struct rcu_head rcu; struct xps_map __rcu *cpu_map[0]; }; -#define XPS_DEV_MAPS_SIZE (sizeof(struct xps_dev_maps) + \ -(nr_cpu_ids * sizeof(struct xps_map *))) +#define XPS_DEV_MAPS_SIZE(_tcs) (sizeof(struct xps_dev_maps) + \ + (nr_cpu_ids * (_tcs) * sizeof(struct xps_map *))) #endif /* CONFIG_XPS */ #define TC_MAX_QUEUE 16 diff --git a/net/core/dev.c b/net/core/dev.c index fea7061..0ebb035 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -2002,14 +2002,22 @@ static bool remove_xps_queue_cpu(struct net_device *dev, struct xps_dev_maps *dev_maps, int cpu, u16 offset, u16 count) { - int i, j; + int num_tc = dev->num_tc ? : 1; + bool active = false; + int tci; - for (i = count, j = offset; i--; j++) { - if (!remove_xps_queue(dev_maps, cpu, j)) - break; + for (tci = cpu * num_tc; num_tc--; tci++) { + int i, j; + + for (i = count, j = offset; i--; j++) { + if (!remove_xps_queue(dev_maps, cpu, j)) + break; + } + + active |= i < 0; } - return i < 0; + return active; } static void netif_reset_xps_queues(struct net_device *dev, u16 offset, @@ -2086,20 +2094,28 @@ int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask, u16 index) { struct xps_dev_maps *dev_maps, *new_dev_maps = NULL; + int i, cpu, tci, numa_node_id = -2; + int maps_sz, num_tc = 1, tc = 0; struct xps_map *map, *new_map; - int maps_sz = max_t(unsigned int, XPS_DEV_MAPS_SIZE, L1_CACHE_BYTES); - int cpu, numa_node_id = -2; bool active = false; + if (dev->num_tc) { + num_tc = dev->num_tc; + tc = netdev_txq_to_tc(dev, index); + if (tc < 0) + return -EINVAL; + } + + maps_sz = XPS_DEV_MAPS_SIZE(num_tc); + if (maps_sz < L1_CACHE_BYTES) + maps_sz = L1_CACHE_BYTES; + mutex_lock(&xps_map_mutex); dev_maps = xmap_dereference(dev->xps_maps); /* allocate memory for queue storage */ - for_each_online_cpu(cpu) { - if (!cpumask_test_cpu(cpu, mask)) - continue; - + for_each_cpu_and(cpu, cpu_online_mask, mask) { if (!new_dev_maps) new_dev_maps = kzalloc(maps_sz, GFP_KERNEL); if (!new_dev_maps) { @@ -2107,25 +2123,38 @@ int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask, return -ENOMEM; } - map = dev_maps ? xmap_dereference(dev_maps->cpu_map[cpu]) : + tci = cpu * num_tc + tc; + map = dev_maps ? xmap_dereference(dev_maps->cpu_map[tci]) : NULL; map = expand_xps_map(map, cpu, index); if (!map) goto error; - RCU_INIT_POINTER(new_dev_maps->cpu_map[cpu], map); + RCU_INIT_POINTER(new_dev_maps->cpu_map[tci], map); } if (!new_dev_maps) goto out_no_new_maps; for_each_possible_cpu(cpu) { + /* copy maps belonging to foreign traffic classes */ + for (i = tc, tci = cpu * num_tc; dev_maps && i--; tci++) { + /* fill in the new device map from the old device map */ + map = xmap_dereference(dev_maps->cpu_map[tci]); + RCU_INIT_POINTER(new_dev_maps->cpu_map[tci], map); + } + + /* We need to explicitly update tci as prevous loop +* could break out early if dev_maps is NULL. +*/ +
[net-next PATCH v2 3/4] net: Refactor removal of queues from XPS map and apply on num_tc changes
This patch updates the code for removing queues from the XPS map and makes it so that we can apply the code any time we change either the number of traffic classes or the mapping of a given block of queues. This way we avoid having queues pulling traffic from a foreign traffic class. Signed-off-by: Alexander Duyck --- v2: Replaced do/while with for loop Changed order for remove_xps_queue_cpu loop Updated i to reflect count, j to be temporary offset Updated test to check for i < 0 instead of i < offset net/core/dev.c | 73 ++-- 1 file changed, 50 insertions(+), 23 deletions(-) diff --git a/net/core/dev.c b/net/core/dev.c index 08ed625..fea7061 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -1970,32 +1970,50 @@ int netdev_txq_to_tc(struct net_device *dev, unsigned int txq) #define xmap_dereference(P)\ rcu_dereference_protected((P), lockdep_is_held(&xps_map_mutex)) -static struct xps_map *remove_xps_queue(struct xps_dev_maps *dev_maps, - int cpu, u16 index) +static bool remove_xps_queue(struct xps_dev_maps *dev_maps, +int tci, u16 index) { struct xps_map *map = NULL; int pos; if (dev_maps) - map = xmap_dereference(dev_maps->cpu_map[cpu]); + map = xmap_dereference(dev_maps->cpu_map[tci]); + if (!map) + return false; - for (pos = 0; map && pos < map->len; pos++) { - if (map->queues[pos] == index) { - if (map->len > 1) { - map->queues[pos] = map->queues[--map->len]; - } else { - RCU_INIT_POINTER(dev_maps->cpu_map[cpu], NULL); - kfree_rcu(map, rcu); - map = NULL; - } + for (pos = map->len; pos--;) { + if (map->queues[pos] != index) + continue; + + if (map->len > 1) { + map->queues[pos] = map->queues[--map->len]; break; } + + RCU_INIT_POINTER(dev_maps->cpu_map[tci], NULL); + kfree_rcu(map, rcu); + return false; } - return map; + return true; } -static void netif_reset_xps_queues_gt(struct net_device *dev, u16 index) +static bool remove_xps_queue_cpu(struct net_device *dev, +struct xps_dev_maps *dev_maps, +int cpu, u16 offset, u16 count) +{ + int i, j; + + for (i = count, j = offset; i--; j++) { + if (!remove_xps_queue(dev_maps, cpu, j)) + break; + } + + return i < 0; +} + +static void netif_reset_xps_queues(struct net_device *dev, u16 offset, + u16 count) { struct xps_dev_maps *dev_maps; int cpu, i; @@ -2007,21 +2025,16 @@ static void netif_reset_xps_queues_gt(struct net_device *dev, u16 index) if (!dev_maps) goto out_no_maps; - for_each_possible_cpu(cpu) { - for (i = index; i < dev->num_tx_queues; i++) { - if (!remove_xps_queue(dev_maps, cpu, i)) - break; - } - if (i == dev->num_tx_queues) - active = true; - } + for_each_possible_cpu(cpu) + active |= remove_xps_queue_cpu(dev, dev_maps, cpu, + offset, count); if (!active) { RCU_INIT_POINTER(dev->xps_maps, NULL); kfree_rcu(dev_maps, rcu); } - for (i = index; i < dev->num_tx_queues; i++) + for (i = offset + (count - 1); count--; i--) netdev_queue_numa_node_write(netdev_get_tx_queue(dev, i), NUMA_NO_NODE); @@ -2029,6 +2042,11 @@ static void netif_reset_xps_queues_gt(struct net_device *dev, u16 index) mutex_unlock(&xps_map_mutex); } +static void netif_reset_xps_queues_gt(struct net_device *dev, u16 index) +{ + netif_reset_xps_queues(dev, index, dev->num_tx_queues - index); +} + static struct xps_map *expand_xps_map(struct xps_map *map, int cpu, u16 index) { @@ -2192,6 +2210,9 @@ int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask, #endif void netdev_reset_tc(struct net_device *dev) { +#ifdef CONFIG_XPS + netif_reset_xps_queues_gt(dev, 0); +#endif dev->num_tc = 0; memset(dev->tc_to_txq, 0, sizeof(dev->tc_to_txq)); memset(dev->prio_tc_map, 0, sizeof(dev->prio_tc_map)); @@ -2203,6 +2224,9 @@ int netdev_set_tc_queue(struct net_device *dev, u8 tc, u16 count, u16 offset) if (tc >= dev->num_tc) return -
[net-next PATCH v2 2/4] net: Add sysfs value to determine queue traffic class
Add a sysfs attribute for a Tx queue that allows us to determine the traffic class for a given queue. This will allow us to more easily determine this in the future. It is needed as XPS will take the traffic class for a group of queues into account in order to avoid pulling traffic from one traffic class into another. Signed-off-by: Alexander Duyck --- v2: Added this patch to the series. include/linux/netdevice.h |1 + net/core/dev.c| 17 + net/core/net-sysfs.c | 20 +++- 3 files changed, 37 insertions(+), 1 deletion(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index d045432..b60a156 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1920,6 +1920,7 @@ int netdev_set_prio_tc_map(struct net_device *dev, u8 prio, u8 tc) return 0; } +int netdev_txq_to_tc(struct net_device *dev, unsigned int txq); void netdev_reset_tc(struct net_device *dev); int netdev_set_tc_queue(struct net_device *dev, u8 tc, u16 count, u16 offset); int netdev_set_num_tc(struct net_device *dev, u8 num_tc); diff --git a/net/core/dev.c b/net/core/dev.c index d4d45bf..08ed625 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -1948,6 +1948,23 @@ static void netif_setup_tc(struct net_device *dev, unsigned int txq) } } +int netdev_txq_to_tc(struct net_device *dev, unsigned int txq) +{ + if (dev->num_tc) { + struct netdev_tc_txq *tc = &dev->tc_to_txq[0]; + int i; + + for (i = 0; i < TC_MAX_QUEUE; i++, tc++) { + if ((txq - tc->offset) < tc->count) + return i; + } + + return -1; + } + + return 0; +} + #ifdef CONFIG_XPS static DEFINE_MUTEX(xps_map_mutex); #define xmap_dereference(P)\ diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c index 6e4f347..c4e4e7d 100644 --- a/net/core/net-sysfs.c +++ b/net/core/net-sysfs.c @@ -1021,7 +1021,6 @@ static ssize_t show_trans_timeout(struct netdev_queue *queue, return sprintf(buf, "%lu", trans_timeout); } -#ifdef CONFIG_XPS static unsigned int get_netdev_queue_index(struct netdev_queue *queue) { struct net_device *dev = queue->dev; @@ -1033,6 +1032,21 @@ static unsigned int get_netdev_queue_index(struct netdev_queue *queue) return i; } +static ssize_t show_traffic_class(struct netdev_queue *queue, + struct netdev_queue_attribute *attribute, + char *buf) +{ + struct net_device *dev = queue->dev; + int index = get_netdev_queue_index(queue); + int tc = netdev_txq_to_tc(dev, index); + + if (tc < 0) + return -EINVAL; + + return sprintf(buf, "%u\n", tc); +} + +#ifdef CONFIG_XPS static ssize_t show_tx_maxrate(struct netdev_queue *queue, struct netdev_queue_attribute *attribute, char *buf) @@ -1075,6 +1089,9 @@ static ssize_t set_tx_maxrate(struct netdev_queue *queue, static struct netdev_queue_attribute queue_trans_timeout = __ATTR(tx_timeout, S_IRUGO, show_trans_timeout, NULL); +static struct netdev_queue_attribute queue_traffic_class = + __ATTR(traffic_class, S_IRUGO, show_traffic_class, NULL); + #ifdef CONFIG_BQL /* * Byte queue limits sysfs structures and functions. @@ -1260,6 +1277,7 @@ static ssize_t store_xps_map(struct netdev_queue *queue, static struct attribute *netdev_queue_default_attrs[] = { &queue_trans_timeout.attr, + &queue_traffic_class.attr, #ifdef CONFIG_XPS &xps_cpus_attribute.attr, &queue_tx_maxrate.attr,
[net-next PATCH v2 1/4] net: Move functions for configuring traffic classes out of inline headers
The functions for configuring the traffic class to queue mappings have other effects that need to be addressed. Instead of trying to export a bunch of new functions just relocate the functions so that we can instrument them directly with the functionality they will need. Signed-off-by: Alexander Duyck --- include/linux/netdevice.h | 31 +++ net/core/dev.c| 29 + 2 files changed, 32 insertions(+), 28 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 458c876..d045432 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1920,34 +1920,9 @@ int netdev_set_prio_tc_map(struct net_device *dev, u8 prio, u8 tc) return 0; } -static inline -void netdev_reset_tc(struct net_device *dev) -{ - dev->num_tc = 0; - memset(dev->tc_to_txq, 0, sizeof(dev->tc_to_txq)); - memset(dev->prio_tc_map, 0, sizeof(dev->prio_tc_map)); -} - -static inline -int netdev_set_tc_queue(struct net_device *dev, u8 tc, u16 count, u16 offset) -{ - if (tc >= dev->num_tc) - return -EINVAL; - - dev->tc_to_txq[tc].count = count; - dev->tc_to_txq[tc].offset = offset; - return 0; -} - -static inline -int netdev_set_num_tc(struct net_device *dev, u8 num_tc) -{ - if (num_tc > TC_MAX_QUEUE) - return -EINVAL; - - dev->num_tc = num_tc; - return 0; -} +void netdev_reset_tc(struct net_device *dev); +int netdev_set_tc_queue(struct net_device *dev, u8 tc, u16 count, u16 offset); +int netdev_set_num_tc(struct net_device *dev, u8 num_tc); static inline int netdev_get_num_tc(struct net_device *dev) diff --git a/net/core/dev.c b/net/core/dev.c index f55fb45..d4d45bf 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -2173,6 +2173,35 @@ int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask, EXPORT_SYMBOL(netif_set_xps_queue); #endif +void netdev_reset_tc(struct net_device *dev) +{ + dev->num_tc = 0; + memset(dev->tc_to_txq, 0, sizeof(dev->tc_to_txq)); + memset(dev->prio_tc_map, 0, sizeof(dev->prio_tc_map)); +} +EXPORT_SYMBOL(netdev_reset_tc); + +int netdev_set_tc_queue(struct net_device *dev, u8 tc, u16 count, u16 offset) +{ + if (tc >= dev->num_tc) + return -EINVAL; + + dev->tc_to_txq[tc].count = count; + dev->tc_to_txq[tc].offset = offset; + return 0; +} +EXPORT_SYMBOL(netdev_set_tc_queue); + +int netdev_set_num_tc(struct net_device *dev, u8 num_tc) +{ + if (num_tc > TC_MAX_QUEUE) + return -EINVAL; + + dev->num_tc = num_tc; + return 0; +} +EXPORT_SYMBOL(netdev_set_num_tc); + /* * Routine to help set real_num_tx_queues. To avoid skbs mapped to queues * greater then real_num_tx_queues stale skbs on the qdisc must be flushed.
[net-next PATCH v2 0/4] Add support for XPS when using DCB
This patch series enables proper isolation between traffic classes when using XPS while DCB is enabled. Previously enabling XPS would cause the traffic to be potentially pulled from one traffic class into another on egress. This change essentially multiplies the XPS map by the number of traffic classes and allows us to do a lookup per traffic class for a given CPU. To guarantee the isolation I invalidate the XPS map for any queues that are moved from one traffic class to another, or if we change the number of traffic classes. v2: Added sysfs to display traffic class Replaced do/while with for loop Cleaned up several other for for loops throughout the patch --- Alexander Duyck (4): net: Move functions for configuring traffic classes out of inline headers net: Add sysfs value to determine queue traffic class net: Refactor removal of queues from XPS map and apply on num_tc changes net: Add support for XPS with QoS via traffic classes include/linux/netdevice.h | 36 +-- net/core/dev.c| 226 +++-- net/core/net-sysfs.c | 51 -- 3 files changed, 219 insertions(+), 94 deletions(-) --
Re: [PATCH net 2/3] sctp: return back transport in __sctp_rcv_init_lookup
On Fri, Oct 28, 2016 at 05:42:21PM -0200, Marcelo Ricardo Leitner wrote: > On Fri, Oct 28, 2016 at 06:10:53PM +0800, Xin Long wrote: > > Prior to this patch, it used a local variable to save the transport that is > > looked up by __sctp_lookup_association(), and didn't return it back. But in > > sctp_rcv, it is used to initialize chunk->transport. So when hitting this > > code, it was initializing chunk->transport with some random stack value > > instead. > > > > This patch is to return the transport back through transport pointer > > that is from __sctp_rcv_lookup_harder(). > > > > Signed-off-by: Xin Long > > Acked-by: Marcelo Ricardo Leitner > > transport pointer in sctp_rcv() is initialized to null and there are > checks for it after this path, so this shouldn't be exploitable, just > malfunction. This actually sort of contradicts the changelog. Xin, did I miss something here? Seems we need to update the changelog if not.
Re: Why do we need tasklet in IFB?
On Fri, 28 Oct 2016 14:36:27 -0700 Michael Ma wrote: > Hi - > > Currently IFB uses tasklet to process tx/rx on the interface that > forwarded the packet to IFB. My understanding on why we're doing this > is that since dev_queue_xmit() can be invoked in interrupt, we want to > defer the processing of original tx/rx in case ifb_xmit() is called > from interrupt. dev_queue_xmit is only called from interrupt if doing netconsole.
Why do we need tasklet in IFB?
Hi - Currently IFB uses tasklet to process tx/rx on the interface that forwarded the packet to IFB. My understanding on why we're doing this is that since dev_queue_xmit() can be invoked in interrupt, we want to defer the processing of original tx/rx in case ifb_xmit() is called from interrupt. However, if the packet is originally from rx, calling context should already be a tasklet and there is no need to queue the processing to another tasklet anymore. Even if the packet is originated from tx, we can rely on the deferred processing of the "original device" or TC if necessary. Did I miss anything here? Furthermore, looking at the bonding device's code there isn't this kind of buffering/tasklet handling for packet forwarding even though bond also has its own txq/rxq configured separately from the actual nic, which is very similar to IFB. So why do we need tasklet in IFB? Thanks, Michael
Re: [PATCH v2 1/5] kconfig: introduce the "imply" keyword
On Thu, 2016-10-27 at 23:10 -0400, Nicolas Pitre wrote: > On Fri, 28 Oct 2016, Paul Bolle wrote: > > What happens when a tristate symbol is implied by a symbol set to 'y' > > and by a symbol set to 'm'? > > That's respectively the third and second rows in the table above. I meant: two separate symbols implying the same symbol at the same time. One of those symbols set to 'y' and the other set to 'm'. Anyhow, I hope to play with a mock Kconfig fragment the next few days to find out myself. Thanks, Paul Bolle
Re: [PATCH v2 1/5] kconfig: introduce the "imply" keyword
On Thu, 2016-10-27 at 23:10 -0400, Nicolas Pitre wrote: > On Fri, 28 Oct 2016, Paul Bolle wrote: > > You probably got "["if" ]" for free by copying what's there for > > select. But this series doesn't use it, so perhaps it would be better > > to not document it yet. But that is rather sneaky. Dunno. > > If it is not documented then the chance that someone uses it are slim. > And since it comes "for free" I don't see the point of making the tool > less flexible. And not having this flexibility could make some > transitions from "select" to "imply" needlessly difficult. My point is moot. I somehow missed that this series adds imply PTP_1588_CLOCK if TILEGX So you are quite right in documenting this. Paul Bolle
Re: [PATCH 17/17] batman-adv: Avoid precedence issues in macros
On Freitag, 28. Oktober 2016 14:13:06 CEST Joe Perches wrote: > On Thu, 2016-10-27 at 21:01 +0200, Simon Wunderlich wrote: > > From: Sven Eckelmann > > > > It must be avoided that arguments to a macro are evaluated ungrouped (which > > enforces normal operator precendence). Otherwise the result of the macro > > is not well defined. > > Curiosity: > > in net/batman-adv/tp_meter.c [...] > orig_node = batadv_orig_hash_find(bat_priv, tp_vars->other_end); > if (unlikely(!orig_node)) { > err = BATADV_TP_REASON_DST_UNREACHABLE; > tp_vars->reason = err; > goto out; > } > > primary_if = batadv_primary_if_get_selected(bat_priv); > if (unlikely(!primary_if)) { > err = BATADV_TP_REASON_DST_UNREACHABLE; > goto out; > } > > err is not used in the out block > > Is the last if block supposed to set tp_vars->reason to err? This seems to be unrelated to this patch. But yes, looks to me like it is missing. Do you want to propose a patch or should I do? Just make sure you Cc Antonio Quartulli (and of course b.a.t.m@lists.open-mesh.org). He is the original author of 33a3bb4a3345 ("batman-adv: throughput meter implementation"). Kind regards, Sven signature.asc Description: This is a digitally signed message part.
Re: [PATCH net] net: clear sk_err_soft in sk_clone_lock()
On Fri, Oct 28, 2016 at 4:40 PM, Eric Dumazet wrote: > From: Eric Dumazet > > At accept() time, it is possible the parent has a non zero > sk_err_soft, leftover from a prior error. > > Make sure we do not leave this value in the child, as it > makes future getsockopt(SO_ERROR) calls quite unreliable. > > Signed-off-by: Eric Dumazet Acked-by: Soheil Hassas Yeganeh > --- > net/core/sock.c |1 + > 1 file changed, 1 insertion(+) > > diff --git a/net/core/sock.c b/net/core/sock.c > index > d8e4532e89e7c28737c95c723e5f5b3d184a7805..662ccf1c40ed1b66ee253b063dcbfbd186deccee > 100644 > --- a/net/core/sock.c > +++ b/net/core/sock.c > @@ -1543,6 +1543,7 @@ struct sock *sk_clone_lock(const struct sock *sk, const > gfp_t priority) > RCU_INIT_POINTER(newsk->sk_reuseport_cb, NULL); > > newsk->sk_err = 0; > + newsk->sk_err_soft = 0; > newsk->sk_priority = 0; > newsk->sk_incoming_cpu = raw_smp_processor_id(); > atomic64_set(&newsk->sk_cookie, 0); > > Very nice catch! Thank you, Eric!
Re: [PATCH 17/17] batman-adv: Avoid precedence issues in macros
On Thu, 2016-10-27 at 21:01 +0200, Simon Wunderlich wrote: > From: Sven Eckelmann > > It must be avoided that arguments to a macro are evaluated ungrouped (which > enforces normal operator precendence). Otherwise the result of the macro > is not well defined. Curiosity: in net/batman-adv/tp_meter.c static int batadv_tp_send(void *arg) { struct batadv_tp_vars *tp_vars = arg; struct batadv_priv *bat_priv = tp_vars->bat_priv; struct batadv_hard_iface *primary_if = NULL; struct batadv_orig_node *orig_node = NULL; size_t payload_len, packet_len; int err = 0; if (unlikely(tp_vars->role != BATADV_TP_SENDER)) { err = BATADV_TP_REASON_DST_UNREACHABLE; tp_vars->reason = err; goto out; } orig_node = batadv_orig_hash_find(bat_priv, tp_vars->other_end); if (unlikely(!orig_node)) { err = BATADV_TP_REASON_DST_UNREACHABLE; tp_vars->reason = err; goto out; } primary_if = batadv_primary_if_get_selected(bat_priv); if (unlikely(!primary_if)) { err = BATADV_TP_REASON_DST_UNREACHABLE; goto out; } err is not used in the out block Is the last if block supposed to set tp_vars->reason to err?
Re: [PATCH] net: phy: at803x: the Atheros 8031 supports pause frames
On 10/28/2016 01:06 PM, Timur Tabi wrote: > Florian Fainelli wrote: >> On 10/27/2016 03:24 PM, Timur Tabi wrote: >>> Florian Fainelli wrote: >>> Hu? In my experience that should not come from supporting Pause frames or not, but rather properly configuring a (RG)MII delay, but your mileage may vary. >>> >>> I can assure you, I'm more confused than you. I've been working in this >>> for almost two weeks, and not only does this patch align with other phy >>> drivers, but it does fix my bug. Without this patch, phylib does not >>> set the pause frame bits (10 and 11) in MII_ADVERTISE. >> >> And that's expected, but if your MAC does not act upon phydev->pause and >> phydev->asym_pause, then chances are that you can indeed lose packets >> every once in a while. > > Can you give me more details on that? I'm not really an expert on our > MAC, so I'm not sure how it's supposed to work. The MAC supports the > concept of "flow control". The non-upstream version of my driver reads > PHY register 0x11, which apparently is a PHY-specific status register > that says whether or not "transmit pause" and "receive pause" is > enabled. (The AT8031 datasheet is here, if you'd like to read it: > http://www.datasheet-pdf.com/datasheet/AtherosCommunications/734720/AR8031.pdf.html). How it works is that you read phydev->pause and phydev->asym_pause to know what your link partner advertises in terms of flow control/pause frame (they are the same thing here). Based on that, and your local settings (from ethool -A) you advertise/enable flow control as well or don't do it, and the result is either both agree on what is being used, flow control is enabled, end-to-end, or it is not. It's exactly the same thing as speed, except that there is a possibly to do asymetric things, where flow control is only enabled in one direction or the other. The way this works is this: You transmit data, faster than your link partner can process it (busy receiving, loaded, you name it), it sends back Pause frames back at you (see below for how this is typically implemented), your MAC gets these Pause frames back, which feed into the transmit logic, your TX completion interrupt gets signaled slower than the actual data rate you are pushing to your link partner, you call dev_kfree_skb() a little slower from your TX completion handler, the networking stack knows this means to back off, your transmission rate adjusts to your link partner's processing capability, no packets get lost. Conversely, if you were receiving a stream of packets that you have a hard time keeping up with, the part of your Ethernet MAC that fills system memory with packets should realize that it is producing them faster than you are consuming them, this triggers the flow control mechanism of your MAC receiver, and it sends Pause frames back at the link partner. > > > If both are enabled in the PHY, then the driver enables the same > features in the MAC. Unfortunately, I don't have any documentation that > explains that that really means. I've tried enabling and disabling this > feature in the MAC, and it makes no difference. You would think that if > the feature is disabled in both the MAC and the PHY, then no packets > would be lost, but that's not the case. The PHY does not participate in flow control, other than passing frames through (and your Atheros PHY may have a register to control that passthrough, but that would be surprising, could not find one). > >> The part that is totally crappy about Pause frames and PHYLIB is that we >> need some information about whether this should be supported or not, >> such that we can change MII_ADVERTISE accordingly to reflect that, and >> what best way to do this than use one of these SUPPORTED_* bits to set >> that, except, that unlike speed (which is both a MAC and PHY property), >> Pause is exclusively MAC, yet, it transpires in PHYLIB. > > Then what do those bits in register 0x11 do? If I don't set them, I > lose packets, no matter how my MAC is programmed. This is just a status register, as you can see, all bits are read-only, it just aggregates the resolution of what is advertised and what is negotiated, which is not without value, but PHYLIB does the same intersection here. > > It's like that joke, "Doctor, if I hold my arm like this, then it > hurts!", "Well, then don't hold your arm like that." If I don't program > those PHY register bits, then my hardware doesn't work. > >> MACs that I work with for instance need to be told to ignore pause >> frames, or not ignore them, it all depends on what you want to >> advertise/support. > > That's the problem, I don't know what I "want", except that I want my > hardware to work. :-) 99% of my knowledge of the hardware comes from > scanning the source code of the internal version of the driver that 1) > does not support phylib, and 2) is hard-coded to initialize the Atheros > 8031 PHY. May I suggest reading about standards a bit more, or just looking at
[PATCH v2 0/6] add NS2 support to bgmac
Changes in v2: * Remove the PHY power-on (per Andrew Lunn) * Misc PHY clean-ups regarding comments and #defines (per Andrew Lunn) This results on none of the original PHY code from Vikas being present. So, I'm removing him as an author and giving him "Inspired-by" credit. * Move PHY lane swapping to PHY driver (per Andrew Lunn and Florian Fainelli) * Remove bgmac sleep (per Florian Fainelli) * Re-add bgmac chip reset (per Florian Fainelli and Ray Jui) * Rebased on latest net-next * Added patch for bcm54xx_auxctl_read, which is used in the BCM54810 Add support for the amac found in the Broadcom Northstar2 SoC to the bgmac driver. This necessitates adding support to connect to an externally defined phy (as described in the device tree) in the driver. These phy changes are in addition to the changes necessary to get NS2 working. Jon Mason (6): net: phy: broadcom: add bcm54xx_auxctl_read net: phy: broadcom: Add BCM54810 PHY entry Documentation: devicetree: net: add NS2 bindings to amac net: ethernet: bgmac: device tree phy enablement net: ethernet: bgmac: add NS2 support arm64: dts: NS2: add AMAC ethernet support .../devicetree/bindings/net/brcm,amac.txt | 5 +- arch/arm64/boot/dts/broadcom/ns2-svk.dts | 5 ++ arch/arm64/boot/dts/broadcom/ns2.dtsi | 12 +++ drivers/net/ethernet/broadcom/bgmac-bcma.c | 48 ++ drivers/net/ethernet/broadcom/bgmac-platform.c | 100 - drivers/net/ethernet/broadcom/bgmac.c | 55 ++-- drivers/net/ethernet/broadcom/bgmac.h | 8 ++ drivers/net/phy/Kconfig| 2 +- drivers/net/phy/broadcom.c | 68 +- include/linux/brcmphy.h| 11 +++ 10 files changed, 260 insertions(+), 54 deletions(-) -- 2.7.4
[PATCH v2 5/6] net: ethernet: bgmac: add NS2 support
Add support for the variant of amac hardware present in the Broadcom Northstar2 based SoCs. Northstar2 requires an additional register to be configured with the port speed/duplexity (NICPM). This can be added to the link callback to hide it from the instances that do not use this. Also, clearing of the pending interrupts on init is required due to observed issues on some platforms. Signed-off-by: Jon Mason --- drivers/net/ethernet/broadcom/bgmac-platform.c | 56 +- drivers/net/ethernet/broadcom/bgmac.c | 3 ++ drivers/net/ethernet/broadcom/bgmac.h | 1 + 3 files changed, 58 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/broadcom/bgmac-platform.c b/drivers/net/ethernet/broadcom/bgmac-platform.c index aed5dc5..f6d48c7 100644 --- a/drivers/net/ethernet/broadcom/bgmac-platform.c +++ b/drivers/net/ethernet/broadcom/bgmac-platform.c @@ -14,12 +14,21 @@ #define pr_fmt(fmt)KBUILD_MODNAME ": " fmt #include +#include #include #include #include #include #include "bgmac.h" +#define NICPM_IOMUX_CTRL 0x0008 + +#define NICPM_IOMUX_CTRL_INIT_VAL 0x3196e000 +#define NICPM_IOMUX_CTRL_SPD_SHIFT 10 +#define NICPM_IOMUX_CTRL_SPD_10M 0 +#define NICPM_IOMUX_CTRL_SPD_100M 1 +#define NICPM_IOMUX_CTRL_SPD_1000M 2 + static u32 platform_bgmac_read(struct bgmac *bgmac, u16 offset) { return readl(bgmac->plat.base + offset); @@ -87,12 +96,46 @@ static void platform_bgmac_cmn_maskset32(struct bgmac *bgmac, u16 offset, WARN_ON(1); } +static void bgmac_nicpm_speed_set(struct net_device *net_dev) +{ + struct bgmac *bgmac = netdev_priv(net_dev); + u32 val; + + if (!bgmac->plat.nicpm_base) + return; + + val = NICPM_IOMUX_CTRL_INIT_VAL; + switch (bgmac->net_dev->phydev->speed) { + default: + pr_err("Unsupported speed. Defaulting to 1000Mb\n"); + case SPEED_1000: + val |= NICPM_IOMUX_CTRL_SPD_1000M << NICPM_IOMUX_CTRL_SPD_SHIFT; + break; + case SPEED_100: + val |= NICPM_IOMUX_CTRL_SPD_100M << NICPM_IOMUX_CTRL_SPD_SHIFT; + break; + case SPEED_10: + val |= NICPM_IOMUX_CTRL_SPD_10M << NICPM_IOMUX_CTRL_SPD_SHIFT; + break; + } + + writel(val, bgmac->plat.nicpm_base + NICPM_IOMUX_CTRL); + + bgmac_adjust_link(bgmac->net_dev); +} + static int platform_phy_connect(struct bgmac *bgmac) { struct phy_device *phy_dev; - phy_dev = of_phy_get_and_connect(bgmac->net_dev, bgmac->dev->of_node, -bgmac_adjust_link); + if (bgmac->plat.nicpm_base) + phy_dev = of_phy_get_and_connect(bgmac->net_dev, +bgmac->dev->of_node, +bgmac_nicpm_speed_set); + else + phy_dev = of_phy_get_and_connect(bgmac->net_dev, +bgmac->dev->of_node, +bgmac_adjust_link); if (!phy_dev) { dev_err(bgmac->dev, "Phy connect failed\n"); return -ENODEV; @@ -182,6 +225,14 @@ static int bgmac_probe(struct platform_device *pdev) if (IS_ERR(bgmac->plat.idm_base)) return PTR_ERR(bgmac->plat.idm_base); + regs = platform_get_resource_byname(pdev, IORESOURCE_MEM, "nicpm_base"); + if (regs) { + bgmac->plat.nicpm_base = devm_ioremap_resource(&pdev->dev, + regs); + if (IS_ERR(bgmac->plat.nicpm_base)) + return PTR_ERR(bgmac->plat.nicpm_base); + } + bgmac->read = platform_bgmac_read; bgmac->write = platform_bgmac_write; bgmac->idm_read = platform_bgmac_idm_read; @@ -213,6 +264,7 @@ static int bgmac_remove(struct platform_device *pdev) static const struct of_device_id bgmac_of_enet_match[] = { {.compatible = "brcm,amac",}, {.compatible = "brcm,nsp-amac",}, + {.compatible = "brcm,ns2-amac",}, {}, }; diff --git a/drivers/net/ethernet/broadcom/bgmac.c b/drivers/net/ethernet/broadcom/bgmac.c index 4584958..a805cc8 100644 --- a/drivers/net/ethernet/broadcom/bgmac.c +++ b/drivers/net/ethernet/broadcom/bgmac.c @@ -1082,6 +1082,9 @@ static void bgmac_enable(struct bgmac *bgmac) /* http://bcm-v4.sipsolutions.net/mac-gbit/gmac/chipinit */ static void bgmac_chip_init(struct bgmac *bgmac) { + /* Clear any erroneously pending interrupts */ + bgmac_write(bgmac, BGMAC_INT_STATUS, ~0); + /* 1 interrupt per received frame */ bgmac_write(bgmac, BGMAC_INT_RECV_LAZY, 1 << BGMAC_IRL_FC_SHIFT); diff --git a/drivers/net/ethernet/broadcom/bgmac.h b/drivers/net/ethernet/broadcom/bgmac.h index ea52ac3..b1820ea 100644 --- a/drivers/net
[PATCH v2 2/6] net: phy: broadcom: Add BCM54810 PHY entry
The BCM54810 PHY requires some semi-unique configuration, which results in some additional configuration in addition to the standard config. Also, some users of the BCM54810 require the PHY lanes to be swapped. Since there is no way to detect this, add a device tree query to see if it is applicable. Inspired-by: Vikas Soni Signed-off-by: Jon Mason --- drivers/net/phy/Kconfig| 2 +- drivers/net/phy/broadcom.c | 58 +- include/linux/brcmphy.h| 10 3 files changed, 68 insertions(+), 2 deletions(-) diff --git a/drivers/net/phy/Kconfig b/drivers/net/phy/Kconfig index 45f68ea..31967ca 100644 --- a/drivers/net/phy/Kconfig +++ b/drivers/net/phy/Kconfig @@ -217,7 +217,7 @@ config BROADCOM_PHY select BCM_NET_PHYLIB ---help--- Currently supports the BCM5411, BCM5421, BCM5461, BCM54616S, BCM5464, - BCM5481 and BCM5482 PHYs. + BCM5481, BCM54810 and BCM5482 PHYs. config CICADA_PHY tristate "Cicada PHYs" diff --git a/drivers/net/phy/broadcom.c b/drivers/net/phy/broadcom.c index 3a64b3d..777ea29 100644 --- a/drivers/net/phy/broadcom.c +++ b/drivers/net/phy/broadcom.c @@ -18,7 +18,7 @@ #include #include #include - +#include #define BRCM_PHY_MODEL(phydev) \ ((phydev)->drv->phy_id & (phydev)->drv->phy_id_mask) @@ -45,6 +45,34 @@ static int bcm54xx_auxctl_write(struct phy_device *phydev, u16 regnum, u16 val) return phy_write(phydev, MII_BCM54XX_AUX_CTL, regnum | val); } +static int bcm54810_config(struct phy_device *phydev) +{ + int rc, val; + + val = bcm_phy_read_exp(phydev, BCM54810_EXP_BROADREACH_LRE_MISC_CTL); + val &= ~BCM54810_EXP_BROADREACH_LRE_MISC_CTL_EN; + rc = bcm_phy_write_exp(phydev, BCM54810_EXP_BROADREACH_LRE_MISC_CTL, + val); + if (rc < 0) + return rc; + + val = bcm54xx_auxctl_read(phydev, MII_BCM54XX_AUXCTL_SHDWSEL_MISC); + val &= ~MII_BCM54XX_AUXCTL_SHDWSEL_MISC_RGMII_SKEW_EN; + val |= MII_BCM54XX_AUXCTL_MISC_WREN; + rc = bcm54xx_auxctl_write(phydev, MII_BCM54XX_AUXCTL_SHDWSEL_MISC, + val); + if (rc < 0) + return rc; + + val = bcm_phy_read_shadow(phydev, BCM54810_SHD_CLK_CTL); + val &= ~BCM54810_SHD_CLK_CTL_GTXCLK_EN; + rc = bcm_phy_write_shadow(phydev, BCM54810_SHD_CLK_CTL, val); + if (rc < 0) + return rc; + + return 0; +} + /* Needs SMDSP clock enabled via bcm54xx_phydsp_config() */ static int bcm50610_a0_workaround(struct phy_device *phydev) { @@ -217,6 +245,12 @@ static int bcm54xx_config_init(struct phy_device *phydev) (phydev->dev_flags & PHY_BRCM_AUTO_PWRDWN_ENABLE)) bcm54xx_adjust_rxrefclk(phydev); + if (BRCM_PHY_MODEL(phydev) == PHY_ID_BCM54810) { + err = bcm54810_config(phydev); + if (err) + return err; + } + bcm54xx_phydsp_config(phydev); return 0; @@ -314,6 +348,7 @@ static int bcm5482_read_status(struct phy_device *phydev) static int bcm5481_config_aneg(struct phy_device *phydev) { + struct device_node *np = phydev->mdio.dev.of_node; int ret; /* Aneg firsly. */ @@ -344,6 +379,14 @@ static int bcm5481_config_aneg(struct phy_device *phydev) phy_write(phydev, 0x18, reg); } + if (of_property_read_bool(np, "brcm,enet-phy-lane-swap")) { + /* Lane Swap - Undocumented register...magic! */ + ret = bcm_phy_write_exp(phydev, MII_BCM54XX_EXP_SEL_ER + 0x9, + 0x11B); + if (ret < 0) + return ret; + } + return ret; } @@ -578,6 +621,18 @@ static struct phy_driver broadcom_drivers[] = { .ack_interrupt = bcm_phy_ack_intr, .config_intr= bcm_phy_config_intr, }, { + .phy_id = PHY_ID_BCM54810, + .phy_id_mask= 0xfff0, + .name = "Broadcom BCM54810", + .features = PHY_GBIT_FEATURES | + SUPPORTED_Pause | SUPPORTED_Asym_Pause, + .flags = PHY_HAS_MAGICANEG | PHY_HAS_INTERRUPT, + .config_init= bcm54xx_config_init, + .config_aneg= bcm5481_config_aneg, + .read_status= genphy_read_status, + .ack_interrupt = bcm_phy_ack_intr, + .config_intr= bcm_phy_config_intr, +}, { .phy_id = PHY_ID_BCM5482, .phy_id_mask= 0xfff0, .name = "Broadcom BCM5482", @@ -661,6 +716,7 @@ static struct mdio_device_id __maybe_unused broadcom_tbl[] = { { PHY_ID_BCM54616S, 0xfff0 }, { PHY_ID_BCM5464, 0xfff0 }, { PHY_ID_BCM5481, 0xfff0 }, + { PHY_ID_BCM54810, 0xfff0 }, { PHY_ID_BCM5482, 0xfff0 }, { PHY_ID_BCM50610, 0xfff0 }, { PHY_ID_BCM50610M, 0xff
[PATCH v2 4/6] net: ethernet: bgmac: device tree phy enablement
Change the bgmac driver to allow for phy's defined by the device tree Signed-off-by: Jon Mason --- drivers/net/ethernet/broadcom/bgmac-bcma.c | 48 drivers/net/ethernet/broadcom/bgmac-platform.c | 48 +++- drivers/net/ethernet/broadcom/bgmac.c | 52 ++ drivers/net/ethernet/broadcom/bgmac.h | 7 4 files changed, 105 insertions(+), 50 deletions(-) diff --git a/drivers/net/ethernet/broadcom/bgmac-bcma.c b/drivers/net/ethernet/broadcom/bgmac-bcma.c index c16ec3a..3e3efde 100644 --- a/drivers/net/ethernet/broadcom/bgmac-bcma.c +++ b/drivers/net/ethernet/broadcom/bgmac-bcma.c @@ -80,6 +80,50 @@ static void bcma_bgmac_cmn_maskset32(struct bgmac *bgmac, u16 offset, u32 mask, bcma_maskset32(bgmac->bcma.cmn, offset, mask, set); } +static int bcma_phy_connect(struct bgmac *bgmac) +{ + struct phy_device *phy_dev; + char bus_id[MII_BUS_ID_SIZE + 3]; + + /* Connect to the PHY */ + snprintf(bus_id, sizeof(bus_id), PHY_ID_FMT, bgmac->mii_bus->id, +bgmac->phyaddr); + phy_dev = phy_connect(bgmac->net_dev, bus_id, bgmac_adjust_link, + PHY_INTERFACE_MODE_MII); + if (IS_ERR(phy_dev)) { + dev_err(bgmac->dev, "PHY connecton failed\n"); + return PTR_ERR(phy_dev); + } + + return 0; +} + +static int bcma_phy_direct_connect(struct bgmac *bgmac) +{ + struct fixed_phy_status fphy_status = { + .link = 1, + .speed = SPEED_1000, + .duplex = DUPLEX_FULL, + }; + struct phy_device *phy_dev; + int err; + + phy_dev = fixed_phy_register(PHY_POLL, &fphy_status, -1, NULL); + if (!phy_dev || IS_ERR(phy_dev)) { + dev_err(bgmac->dev, "Failed to register fixed PHY device\n"); + return -ENODEV; + } + + err = phy_connect_direct(bgmac->net_dev, phy_dev, bgmac_adjust_link, +PHY_INTERFACE_MODE_MII); + if (err) { + dev_err(bgmac->dev, "Connecting PHY failed\n"); + return err; + } + + return err; +} + static const struct bcma_device_id bgmac_bcma_tbl[] = { BCMA_CORE(BCMA_MANUF_BCM, BCMA_CORE_4706_MAC_GBIT, BCMA_ANY_REV, BCMA_ANY_CLASS), @@ -275,6 +319,10 @@ static int bgmac_probe(struct bcma_device *core) bgmac->cco_ctl_maskset = bcma_bgmac_cco_ctl_maskset; bgmac->get_bus_clock = bcma_bgmac_get_bus_clock; bgmac->cmn_maskset32 = bcma_bgmac_cmn_maskset32; + if (bgmac->mii_bus) + bgmac->phy_connect = bcma_phy_connect; + else + bgmac->phy_connect = bcma_phy_direct_connect; err = bgmac_enet_probe(bgmac); if (err) diff --git a/drivers/net/ethernet/broadcom/bgmac-platform.c b/drivers/net/ethernet/broadcom/bgmac-platform.c index be52f27..aed5dc5 100644 --- a/drivers/net/ethernet/broadcom/bgmac-platform.c +++ b/drivers/net/ethernet/broadcom/bgmac-platform.c @@ -16,6 +16,7 @@ #include #include #include +#include #include #include "bgmac.h" @@ -86,6 +87,46 @@ static void platform_bgmac_cmn_maskset32(struct bgmac *bgmac, u16 offset, WARN_ON(1); } +static int platform_phy_connect(struct bgmac *bgmac) +{ + struct phy_device *phy_dev; + + phy_dev = of_phy_get_and_connect(bgmac->net_dev, bgmac->dev->of_node, +bgmac_adjust_link); + if (!phy_dev) { + dev_err(bgmac->dev, "Phy connect failed\n"); + return -ENODEV; + } + + return 0; +} + +static int platform_phy_direct_connect(struct bgmac *bgmac) +{ + struct fixed_phy_status fphy_status = { + .link = 1, + .speed = SPEED_1000, + .duplex = DUPLEX_FULL, + }; + struct phy_device *phy_dev; + int err; + + phy_dev = fixed_phy_register(PHY_POLL, &fphy_status, -1, NULL); + if (!phy_dev || IS_ERR(phy_dev)) { + dev_err(bgmac->dev, "Failed to register fixed PHY device\n"); + return -ENODEV; + } + + err = phy_connect_direct(bgmac->net_dev, phy_dev, bgmac_adjust_link, +PHY_INTERFACE_MODE_MII); + if (err) { + dev_err(bgmac->dev, "Connecting PHY failed\n"); + return err; + } + + return err; +} + static int bgmac_probe(struct platform_device *pdev) { struct device_node *np = pdev->dev.of_node; @@ -102,7 +143,6 @@ static int bgmac_probe(struct platform_device *pdev) /* Set the features of the 4707 family */ bgmac->feature_flags |= BGMAC_FEAT_CLKCTLST; bgmac->feature_flags |= BGMAC_FEAT_NO_RESET; - bgmac->feature_flags |= BGMAC_FEAT_FORCE_SPEED_2500; bgmac->feature_flags |= BGMAC_FEAT_CMDCFG_SR_REV4; bgmac->feature_flags |= BGMAC_FEAT_TX_MASK_SETUP
[PATCH v2 6/6] arm64: dts: NS2: add AMAC ethernet support
Add support for the AMAC ethernet to the Broadcom Northstar2 SoC device tree Signed-off-by: Jon Mason --- arch/arm64/boot/dts/broadcom/ns2-svk.dts | 5 + arch/arm64/boot/dts/broadcom/ns2.dtsi| 12 2 files changed, 17 insertions(+) diff --git a/arch/arm64/boot/dts/broadcom/ns2-svk.dts b/arch/arm64/boot/dts/broadcom/ns2-svk.dts index 2d7872a..9e5c3c9 100644 --- a/arch/arm64/boot/dts/broadcom/ns2-svk.dts +++ b/arch/arm64/boot/dts/broadcom/ns2-svk.dts @@ -56,6 +56,10 @@ }; }; +&enet { + status = "ok"; +}; + &pci_phy0 { status = "ok"; }; @@ -172,6 +176,7 @@ &mdio_mux_iproc { mdio@10 { gphy0: eth-phy@10 { + brcm,enet-phy-lane-swap; reg = <0x10>; }; }; diff --git a/arch/arm64/boot/dts/broadcom/ns2.dtsi b/arch/arm64/boot/dts/broadcom/ns2.dtsi index d95dc40..773ed59 100644 --- a/arch/arm64/boot/dts/broadcom/ns2.dtsi +++ b/arch/arm64/boot/dts/broadcom/ns2.dtsi @@ -191,6 +191,18 @@ #include "ns2-clock.dtsi" + enet: ethernet@6100 { + compatible = "brcm,ns2-amac"; + reg = <0x6100 0x1000>, + <0x6109 0x1000>, + <0x6103 0x100>; + reg-names = "amac_base", "idm_base", "nicpm_base"; + interrupts = ; + phy-handle = <&gphy0>; + phy-mode = "rgmii"; + status = "disabled"; + }; + dma0: dma@6136 { compatible = "arm,pl330", "arm,primecell"; reg = <0x6136 0x1000>; -- 2.7.4
[PATCH v2 3/6] Documentation: devicetree: net: add NS2 bindings to amac
Signed-off-by: Jon Mason --- Documentation/devicetree/bindings/net/brcm,amac.txt | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/Documentation/devicetree/bindings/net/brcm,amac.txt b/Documentation/devicetree/bindings/net/brcm,amac.txt index ba5ecc1..f2b194e 100644 --- a/Documentation/devicetree/bindings/net/brcm,amac.txt +++ b/Documentation/devicetree/bindings/net/brcm,amac.txt @@ -2,11 +2,12 @@ Broadcom AMAC Ethernet Controller Device Tree Bindings - Required properties: - - compatible: "brcm,amac" or "brcm,nsp-amac" + - compatible: "brcm,amac", "brcm,nsp-amac", or "brcm,ns2-amac" - reg:Address and length of the GMAC registers, Address and length of the GMAC IDM registers + Address and length of the NIC Port Manager registers (optional) - reg-names: Names of the registers. Must have both "amac_base" and - "idm_base" + "idm_base". "nicpm_base" is optional (required for NS2) - interrupts: Interrupt number Optional properties: -- 2.7.4
[PATCH v2 1/6] net: phy: broadcom: add bcm54xx_auxctl_read
Add a helper function to read the AUXCTL register for the BCM54xx. This mirrors the bcm54xx_auxctl_write function already present in the code. Signed-off-by: Jon Mason --- drivers/net/phy/broadcom.c | 10 ++ include/linux/brcmphy.h| 1 + 2 files changed, 11 insertions(+) diff --git a/drivers/net/phy/broadcom.c b/drivers/net/phy/broadcom.c index 583ef8a..3a64b3d 100644 --- a/drivers/net/phy/broadcom.c +++ b/drivers/net/phy/broadcom.c @@ -30,6 +30,16 @@ MODULE_DESCRIPTION("Broadcom PHY driver"); MODULE_AUTHOR("Maciej W. Rozycki"); MODULE_LICENSE("GPL"); +static int bcm54xx_auxctl_read(struct phy_device *phydev, u16 regnum) +{ + /* The register must be written to both the Shadow Register Select and +* the Shadow Read Register Selector +*/ + phy_write(phydev, MII_BCM54XX_AUX_CTL, regnum | + regnum << MII_BCM54XX_AUXCTL_SHDWSEL_READ_SHIFT); + return phy_read(phydev, MII_BCM54XX_AUX_CTL); +} + static int bcm54xx_auxctl_write(struct phy_device *phydev, u16 regnum, u16 val) { return phy_write(phydev, MII_BCM54XX_AUX_CTL, regnum | val); diff --git a/include/linux/brcmphy.h b/include/linux/brcmphy.h index 60def78..0ed6691 100644 --- a/include/linux/brcmphy.h +++ b/include/linux/brcmphy.h @@ -110,6 +110,7 @@ #define MII_BCM54XX_AUXCTL_MISC_FORCE_AMDIX0x0200 #define MII_BCM54XX_AUXCTL_MISC_RDSEL_MISC 0x7000 #define MII_BCM54XX_AUXCTL_SHDWSEL_MISC0x0007 +#define MII_BCM54XX_AUXCTL_SHDWSEL_READ_SHIFT 12 #define MII_BCM54XX_AUXCTL_SHDWSEL_MASK0x0007 -- 2.7.4
RE: [PATCH net-next] lan78xx: Use irq_domain for phy interrupt from USB Int. EP
> > +static void lan78xx_irq_mask(struct irq_data *irqd) > > +{ > > + struct irq_domain_data *data = irq_data_get_irq_chip_data(irqd); > > + struct lan78xx_net *dev = > > + container_of(data, struct lan78xx_net, domain_data); > > + u32 buf; > > + > > + lan78xx_read_reg(dev, INT_EP_CTL, &buf); > > lan78xx_read_reg() uses kmalloc() with GFP_KERNEL, while > irq_mask/irq_unmask can be called in atomic context AFAIR, you may need > to pass down a specific gfp_t to lan78xx_read_reg. > > What about usb_submit_urb(), can that work in atomic context? Do you > need to have lan78xx_read_reg() acquire a raw spinlock or something to > serialize them? > > > + buf &= ~INT_EP_PHY_INT_EN_; > > Even though you may have only one interrupt line to deal with at the > moment, better make this bit derived from irqd->hwirq instead of hard > coding it here. > > > + lan78xx_write_reg(dev, INT_EP_CTL, buf); > > +} > > + > > +static void lan78xx_irq_unmask(struct irq_data *irqd) > > +{ > > + struct irq_domain_data *data = irq_data_get_irq_chip_data(irqd); > > + struct lan78xx_net *dev = > > + container_of(data, struct lan78xx_net, domain_data); > > + u32 buf; > > + > > + lan78xx_read_reg(dev, INT_EP_CTL, &buf); > > + buf |= INT_EP_PHY_INT_EN_; > > Same here, this should come from irqd->hwirq. > > > + lan78xx_write_reg(dev, INT_EP_CTL, buf); > > +} > > + > > +static struct irq_chip lan78xx_irqchip = { > > + .name = "lan78xx-phyirq", > > + .irq_mask = lan78xx_irq_mask, > > + .irq_unmask = lan78xx_irq_unmask, > > +}; > > + > > +static int lan78xx_setup_irq_domain(struct lan78xx_net *dev) > > +{ > > + struct device_node *of_node; > > + struct irq_domain *irqdomain; > > + unsigned int irq_base = 0; > > + int ret = 0; > > + > > + of_node = dev->udev->dev.parent->of_node; > > + > > + dev->domain_data.irqchip = &lan78xx_irqchip; > > + dev->domain_data.irq_handler = handle_simple_irq; > > + > > + irqdomain = irq_domain_add_simple(of_node, 1, 0, > &chip_domain_ops, > > + &dev->domain_data); > > Is there really just one interrupt associated with this peripheral here? > > > > > + if (lan78xx_setup_irq_domain(dev) < 0) { > > + netdev_warn(dev->net, "lan78xx_setup_irq_domain() > failed"); > > + return -EIO; > > + } > > Any reason not to propagate the error code from > lan78xx_setup_irq_domain() here? Thanks for prompt review. I'll look into it and submit update. Thanks. - Woojung
Re: [PATCH net-next RFC WIP] Patch for XDP support for virtio_net
On Fri, 28 Oct 2016 13:35:02 -0700, Alexander Duyck wrote: > > At the same time XDP doesn't require to use 4k buffer in something like > > Netronome. > > If xdp bpf program can be offloaded into HW with 1800 byte buffers, great! > > So are you saying this is only really meant to be used with a full bpf > hardware offload then? I think Alexei just meant to say I don't have to worry about providing 4k buffers when program is offloaded but on the host a page is the memory granularity for XDP...
[PATCH net] net: clear sk_err_soft in sk_clone_lock()
From: Eric Dumazet At accept() time, it is possible the parent has a non zero sk_err_soft, leftover from a prior error. Make sure we do not leave this value in the child, as it makes future getsockopt(SO_ERROR) calls quite unreliable. Signed-off-by: Eric Dumazet --- net/core/sock.c |1 + 1 file changed, 1 insertion(+) diff --git a/net/core/sock.c b/net/core/sock.c index d8e4532e89e7c28737c95c723e5f5b3d184a7805..662ccf1c40ed1b66ee253b063dcbfbd186deccee 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -1543,6 +1543,7 @@ struct sock *sk_clone_lock(const struct sock *sk, const gfp_t priority) RCU_INIT_POINTER(newsk->sk_reuseport_cb, NULL); newsk->sk_err = 0; + newsk->sk_err_soft = 0; newsk->sk_priority = 0; newsk->sk_incoming_cpu = raw_smp_processor_id(); atomic64_set(&newsk->sk_cookie, 0);
Re: [PATCH net-next RFC WIP] Patch for XDP support for virtio_net
On Fri, 28 Oct 2016 11:22:25 -0700, Alexei Starovoitov wrote: > On Fri, Oct 28, 2016 at 05:18:12PM +0100, Jakub Kicinski wrote: > > On Fri, 28 Oct 2016 08:56:35 -0700, John Fastabend wrote: > > > On 16-10-27 07:10 PM, David Miller wrote: > > > > From: Alexander Duyck > > > > Date: Thu, 27 Oct 2016 18:43:59 -0700 > > > > > > > >> On Thu, Oct 27, 2016 at 6:35 PM, David Miller > > > >> wrote: > > > >>> From: "Michael S. Tsirkin" > > > >>> Date: Fri, 28 Oct 2016 01:25:48 +0300 > > > >>> > > > On Thu, Oct 27, 2016 at 05:42:18PM -0400, David Miller wrote: > > > > From: "Michael S. Tsirkin" > > > > Date: Fri, 28 Oct 2016 00:30:35 +0300 > > > > > > > >> Something I'd like to understand is how does XDP address the > > > >> problem that 100Byte packets are consuming 4K of memory now. > > > > > > > > Via page pools. We're going to make a generic one, but right now > > > > each and every driver implements a quick list of pages to allocate > > > > from (and thus avoid the DMA man/unmap overhead, etc.) > > > > > > So to clarify, ATM virtio doesn't attempt to avoid dma map/unmap > > > so there should be no issue with that even when using sub/page > > > regions, assuming DMA APIs support sub-page map/unmap correctly. > > > >>> > > > >>> That's not what I said. > > > >>> > > > >>> The page pools are meant to address the performance degradation from > > > >>> going to having one packet per page for the sake of XDP's > > > >>> requirements. > > > >>> > > > >>> You still need to have one packet per page for correct XDP operation > > > >>> whether you do page pools or not, and whether you have DMA mapping > > > >>> (or it's equivalent virutalization operation) or not. > > > >> > > > >> Maybe I am missing something here, but why do you need to limit things > > > >> to one packet per page for correct XDP operation? Most of the drivers > > > >> out there now are usually storing something closer to at least 2 > > > >> packets per page, and with the DMA API fixes I am working on there > > > >> should be no issue with changing the contents inside those pages since > > > >> we won't invalidate or overwrite the data after the DMA buffer has > > > >> been synchronized for use by the CPU. > > > > > > > > Because with SKB's you can share the page with other packets. > > > > > > > > With XDP you simply cannot. > > > > > > > > It's software semantics that are the issue. SKB frag list pages > > > > are read only, XDP packets are writable. > > > > > > > > This has nothing to do with "writability" of the pages wrt. DMA > > > > mapping or cpu mappings. > > > > > > > > > > Sorry I'm not seeing it either. The current xdp_buff is defined > > > by, > > > > > > struct xdp_buff { > > > void *data; > > > void *data_end; > > > }; > > > > > > The verifier has an xdp_is_valid_access() check to ensure we don't go > > > past data_end. The page for now at least never leaves the driver. For > > > the work to get xmit to other devices working I'm still not sure I see > > > any issue. > > > > +1 > > > > Do we want to make the packet-per-page a requirement because it could > > be useful in the future from architectural standpoint? I guess there > > is a trade-off here between having the comfort of people following this > > requirement today and making driver support for XDP even more complex. > > It looks to me that packet per page makes drivers simpler instead of complex. > ixgbe split-page and mlx* order-3/5 tricks are definitely complex. > The skb truesize concerns come from the host when data is delivered to user > space and we need to have precise memory accounting for different applications > and different users. XDP is all root and imo it's far away from the days when > multi-user non-root issues start to pop up. > At the same time XDP doesn't require to use 4k buffer in something like > Netronome. > If xdp bpf program can be offloaded into HW with 1800 byte buffers, great! > For x86 cpu the 4k byte is a natural allocation chunk. Anything lower requires > delicate dma tricks paired with even more complex slab allocator and atomic > recnts. > All of that can only drive the performance down. > Comparing to kernel bypass xdp is using 4k pages whereas dpdk has > to use huge pages, so xdp is saving a ton of memory already! Thanks a lot for explaining! I was not aware of the user space delivery and precise memory accounting requirements. My understanding was that we would be driving toward some common implementation of the recycle tricks either as part of the frag allocator or some new mechanism of handing out mapped pages. Basically moving the allocation strategy decision further up the stack. I'm not actually worried about the offload, full disclosure, I have a XDP implementation based on the frag allocator :) (still pending internal review), so if packet-per-page is the way forward then I'll have revise to fli
Re: [PATCH net-next RFC WIP] Patch for XDP support for virtio_net
On Fri, Oct 28, 2016 at 11:22 AM, Alexei Starovoitov wrote: > On Fri, Oct 28, 2016 at 05:18:12PM +0100, Jakub Kicinski wrote: >> On Fri, 28 Oct 2016 08:56:35 -0700, John Fastabend wrote: >> > On 16-10-27 07:10 PM, David Miller wrote: >> > > From: Alexander Duyck >> > > Date: Thu, 27 Oct 2016 18:43:59 -0700 >> > > >> > >> On Thu, Oct 27, 2016 at 6:35 PM, David Miller >> > >> wrote: >> > >>> From: "Michael S. Tsirkin" >> > >>> Date: Fri, 28 Oct 2016 01:25:48 +0300 >> > >>> >> > On Thu, Oct 27, 2016 at 05:42:18PM -0400, David Miller wrote: >> > > From: "Michael S. Tsirkin" >> > > Date: Fri, 28 Oct 2016 00:30:35 +0300 >> > > >> > >> Something I'd like to understand is how does XDP address the >> > >> problem that 100Byte packets are consuming 4K of memory now. >> > > >> > > Via page pools. We're going to make a generic one, but right now >> > > each and every driver implements a quick list of pages to allocate >> > > from (and thus avoid the DMA man/unmap overhead, etc.) >> > >> > So to clarify, ATM virtio doesn't attempt to avoid dma map/unmap >> > so there should be no issue with that even when using sub/page >> > regions, assuming DMA APIs support sub-page map/unmap correctly. >> > >>> >> > >>> That's not what I said. >> > >>> >> > >>> The page pools are meant to address the performance degradation from >> > >>> going to having one packet per page for the sake of XDP's >> > >>> requirements. >> > >>> >> > >>> You still need to have one packet per page for correct XDP operation >> > >>> whether you do page pools or not, and whether you have DMA mapping >> > >>> (or it's equivalent virutalization operation) or not. >> > >> >> > >> Maybe I am missing something here, but why do you need to limit things >> > >> to one packet per page for correct XDP operation? Most of the drivers >> > >> out there now are usually storing something closer to at least 2 >> > >> packets per page, and with the DMA API fixes I am working on there >> > >> should be no issue with changing the contents inside those pages since >> > >> we won't invalidate or overwrite the data after the DMA buffer has >> > >> been synchronized for use by the CPU. >> > > >> > > Because with SKB's you can share the page with other packets. >> > > >> > > With XDP you simply cannot. >> > > >> > > It's software semantics that are the issue. SKB frag list pages >> > > are read only, XDP packets are writable. >> > > >> > > This has nothing to do with "writability" of the pages wrt. DMA >> > > mapping or cpu mappings. >> > > >> > >> > Sorry I'm not seeing it either. The current xdp_buff is defined >> > by, >> > >> > struct xdp_buff { >> > void *data; >> > void *data_end; >> > }; >> > >> > The verifier has an xdp_is_valid_access() check to ensure we don't go >> > past data_end. The page for now at least never leaves the driver. For >> > the work to get xmit to other devices working I'm still not sure I see >> > any issue. >> >> +1 >> >> Do we want to make the packet-per-page a requirement because it could >> be useful in the future from architectural standpoint? I guess there >> is a trade-off here between having the comfort of people following this >> requirement today and making driver support for XDP even more complex. > > It looks to me that packet per page makes drivers simpler instead of complex. > ixgbe split-page and mlx* order-3/5 tricks are definitely complex. > The skb truesize concerns come from the host when data is delivered to user > space and we need to have precise memory accounting for different applications > and different users. XDP is all root and imo it's far away from the days when > multi-user non-root issues start to pop up. Right but having XDP require 4K pages is going to hurt performance for user space when we are using sockets. We cannot justify killing application performance just because we want to support XDP, and having to alloc new memory and memcpy out of the buffers isn't going to work as a viable workaround for this either. > At the same time XDP doesn't require to use 4k buffer in something like > Netronome. > If xdp bpf program can be offloaded into HW with 1800 byte buffers, great! So are you saying this is only really meant to be used with a full bpf hardware offload then? > For x86 cpu the 4k byte is a natural allocation chunk. Anything lower requires > delicate dma tricks paired with even more complex slab allocator and atomic > recnts. > All of that can only drive the performance down. > Comparing to kernel bypass xdp is using 4k pages whereas dpdk has > to use huge pages, so xdp is saving a ton of memory already! Do you really think the page pool Jesper is talking about doing is any different? Half the reason why I have to implement the DMA API changes that I am are so that the page pool can unmap a page if the device decides to cut it from the pool without invalidating the data written by the CPU. If anything I think we end up
Re: [PATCH] [rfc, netfilter-next] netfilter: nf_tables: fib warnings
On Friday, October 28, 2016 6:21:49 PM CEST Florian Westphal wrote: > Good point. In case oif is NULL we don't have to search the result > list for a match anyway, so we could do this (not even build tested): > It didn't apply cleanly, but I've integrated it with the change to initialize oif to NULL and the added #ifdef I had in my first version and got a clean build. I sent out a v2 now, but didn't try hard to understand your changes. Arnd
[PATCH] [v2 netfilter-next] netfilter: nf_tables: fib warnings
The newly added nft fib code produces two warnings: net/ipv4/netfilter/nft_fib_ipv4.c: In function 'nft_fib4_eval': net/ipv4/netfilter/nft_fib_ipv4.c:80:6: error: unused variable 'i' [-Werror=unused-variable] net/ipv4/netfilter/nft_fib_ipv4.c: In function ‘nft_fib4_eval’: net/ipv4/netfilter/nft_fib_ipv4.c:137:6: error: ‘oif’ may be used uninitialized in this function [-Werror=maybe-uninitialized] The first one is obvious as the only user of that variable is inside of an #ifdef The second one is a bit trickier. It's clear that oif is in fact uninitialized when it gets used when neither NFTA_FIB_F_IIF nor NFTA_FIB_F_OIF are set, and just setting it to NULL won't work as it may later get dereferenced. However, there is no need to search the result list if it is NULL, as Florian pointed out. This integrates his (untested) change to do so. I have confirmed that the combined patch solves both warnings, but as I don't fully understand Florian's change, I can't tell if it's correct. Suggested-by: Florian Westphal Fixes: 84f5eedb983e ("netfilter: nf_tables: add fib expression") Signed-off-by: Arnd Bergmann --- v2: integrate changes that Florian suggested --- net/ipv4/netfilter/nft_fib_ipv4.c | 23 ++- 1 file changed, 14 insertions(+), 9 deletions(-) diff --git a/net/ipv4/netfilter/nft_fib_ipv4.c b/net/ipv4/netfilter/nft_fib_ipv4.c index 6787c563cfc9..db91fd42db67 100644 --- a/net/ipv4/netfilter/nft_fib_ipv4.c +++ b/net/ipv4/netfilter/nft_fib_ipv4.c @@ -77,7 +77,9 @@ void nft_fib4_eval(const struct nft_expr *expr, struct nft_regs *regs, }; const struct net_device *oif; struct net_device *found; +#ifdef CONFIG_IP_ROUTE_MULTIPATH int i; +#endif /* * Do not set flowi4_oif, it restricts results (for example, asking @@ -90,6 +92,8 @@ void nft_fib4_eval(const struct nft_expr *expr, struct nft_regs *regs, oif = pkt->out; else if (priv->flags & NFTA_FIB_F_IIF) oif = pkt->in; + else + oif = NULL; if (pkt->hook == NF_INET_PRE_ROUTING && fib4_is_local(pkt->skb)) { nft_fib_store_result(dest, priv->result, pkt, LOOPBACK_IFINDEX); @@ -130,6 +134,11 @@ void nft_fib4_eval(const struct nft_expr *expr, struct nft_regs *regs, break; } + if (!oif) { + found = FIB_RES_DEV(res); + goto ok; + } + #ifdef CONFIG_IP_ROUTE_MULTIPATH for (i = 0; i < res.fi->fib_nhs; i++) { struct fib_nh *nh = &res.fi->fib_nh[i]; @@ -139,16 +148,12 @@ void nft_fib4_eval(const struct nft_expr *expr, struct nft_regs *regs, goto ok; } } -#endif - if (priv->flags & NFTA_FIB_F_OIF) { - found = FIB_RES_DEV(res); - if (found == oif) - goto ok; - return; - } - - *dest = FIB_RES_DEV(res)->ifindex; return; +#else + found = FIB_RES_DEV(res); + if (found != oif) + return; +#endif ok: switch (priv->result) { case NFT_FIB_RESULT_OIF: -- 2.9.0
Re: [PATCH] net: phy: at803x: the Atheros 8031 supports pause frames
Florian Fainelli wrote: On 10/27/2016 03:24 PM, Timur Tabi wrote: Florian Fainelli wrote: Hu? In my experience that should not come from supporting Pause frames or not, but rather properly configuring a (RG)MII delay, but your mileage may vary. I can assure you, I'm more confused than you. I've been working in this for almost two weeks, and not only does this patch align with other phy drivers, but it does fix my bug. Without this patch, phylib does not set the pause frame bits (10 and 11) in MII_ADVERTISE. And that's expected, but if your MAC does not act upon phydev->pause and phydev->asym_pause, then chances are that you can indeed lose packets every once in a while. Can you give me more details on that? I'm not really an expert on our MAC, so I'm not sure how it's supposed to work. The MAC supports the concept of "flow control". The non-upstream version of my driver reads PHY register 0x11, which apparently is a PHY-specific status register that says whether or not "transmit pause" and "receive pause" is enabled. (The AT8031 datasheet is here, if you'd like to read it: http://www.datasheet-pdf.com/datasheet/AtherosCommunications/734720/AR8031.pdf.html). If both are enabled in the PHY, then the driver enables the same features in the MAC. Unfortunately, I don't have any documentation that explains that that really means. I've tried enabling and disabling this feature in the MAC, and it makes no difference. You would think that if the feature is disabled in both the MAC and the PHY, then no packets would be lost, but that's not the case. The part that is totally crappy about Pause frames and PHYLIB is that we need some information about whether this should be supported or not, such that we can change MII_ADVERTISE accordingly to reflect that, and what best way to do this than use one of these SUPPORTED_* bits to set that, except, that unlike speed (which is both a MAC and PHY property), Pause is exclusively MAC, yet, it transpires in PHYLIB. Then what do those bits in register 0x11 do? If I don't set them, I lose packets, no matter how my MAC is programmed. It's like that joke, "Doctor, if I hold my arm like this, then it hurts!", "Well, then don't hold your arm like that." If I don't program those PHY register bits, then my hardware doesn't work. MACs that I work with for instance need to be told to ignore pause frames, or not ignore them, it all depends on what you want to advertise/support. That's the problem, I don't know what I "want", except that I want my hardware to work. :-) 99% of my knowledge of the hardware comes from scanning the source code of the internal version of the driver that 1) does not support phylib, and 2) is hard-coded to initialize the Atheros 8031 PHY. It does not, support for Pause frames is a MAC-level feature, yet, PHYLIB (and that's been on my todo for a while now) insists on reporting the confusing phydev->pause and phydev->asym_pause, which really is what has been determined from auto-negotiating with your partner, as opposed to being a supported thing or not. The PHY has absolutely not business in that. But there are pause frame bits in the MII_ADVERTISE register, and this setting directly impacts whether those bits are set. I don't see how this is a MAC-level feature. This is a MAC level feature, that needs to be auto-negotiated with your link partner, which is why there is room for this in MII_ADVERTISE, but the PHY absolutely does not participate in Pause frames, other than passing them through the MAC, like normal frames. Whether your MAC acts upon that or not is a MAC dependent feature. Ok, to me that means that the PHY driver must tell phylib whether or not it supports pause frames. The question becomes: 1) Is my patch correct? 2) Is my patch necessary? 3) Is my patch sufficient? 1) I believe my patch is correct, because many other PHY drivers do the same thing for the same reason. If the PHY supports register 4 bits 10 and 11, then those SUPPORTED_xxx bits should be set. 2) I don't know why my patch appears to be necessary, because I don't understand why a 1Gb NIC on a 2Ghz processor drops frames. I suspect the program is not performance but rather a mis-programming. 4) Is my patch sufficient? The internal driver always enables pause frame support in the PHY if the PHY says that pause frames are enabled. In fact, I can't even turn off flow control in the internal driver: $ sudo ethtool -A eth1 tx off rx off $ sudo ethtool -a eth1 Pause parameters for eth1: Autonegotiate: on RX: on TX: on The driver insists on leaving flow control enabled if autonegotiation is enabled. Yes I am sure you should set these bits, the documentation is not necessarily the authoritative source here (although it should). What this probably meant to be written is something like: don't set any bits that are not defined in ethtool.h, i.e: do not use this to store persistent st
Re: [PATCH net 3/3] sctp: hold transport instead of assoc when lookup assoc in rx path
On Fri, Oct 28, 2016 at 06:10:54PM +0800, Xin Long wrote: > Prior to this patch, in rx path, before calling lock_sock, it needed to > hold assoc when got it by __sctp_lookup_association, in case other place > would free/put assoc. > > But in __sctp_lookup_association, it lookup and hold transport, then got > assoc by transport->assoc, then hold assoc and put transport. It means > it didn't hold transport, yet it was returned and later on directly > assigned to chunk->transport. > > Without the protection of sock lock, the transport may be freed/put by > other places, which would cause a use-after-free issue. > > This patch is to fix this issue by holding transport instead of assoc. > As holding transport can make sure to access assoc is also safe, and > actually it looks up assoc by searching transport rhashtable, to hold > transport here makes more sense. > > Note that the function will be renamed later on on another patch. > > Signed-off-by: Xin Long Acked-by: Marcelo Ricardo Leitner > --- > include/net/sctp/sctp.h | 2 +- > net/sctp/input.c| 32 > net/sctp/ipv6.c | 2 +- > 3 files changed, 18 insertions(+), 18 deletions(-) > > diff --git a/include/net/sctp/sctp.h b/include/net/sctp/sctp.h > index 87a7f42..31acc3f 100644 > --- a/include/net/sctp/sctp.h > +++ b/include/net/sctp/sctp.h > @@ -152,7 +152,7 @@ void sctp_unhash_endpoint(struct sctp_endpoint *); > struct sock *sctp_err_lookup(struct net *net, int family, struct sk_buff *, >struct sctphdr *, struct sctp_association **, >struct sctp_transport **); > -void sctp_err_finish(struct sock *, struct sctp_association *); > +void sctp_err_finish(struct sock *, struct sctp_transport *); > void sctp_icmp_frag_needed(struct sock *, struct sctp_association *, > struct sctp_transport *t, __u32 pmtu); > void sctp_icmp_redirect(struct sock *, struct sctp_transport *, > diff --git a/net/sctp/input.c b/net/sctp/input.c > index 8e0bc58..a01a56e 100644 > --- a/net/sctp/input.c > +++ b/net/sctp/input.c > @@ -181,9 +181,10 @@ int sctp_rcv(struct sk_buff *skb) >* bound to another interface, via SO_BINDTODEVICE, treat it as OOTB >*/ > if (sk->sk_bound_dev_if && (sk->sk_bound_dev_if != af->skb_iif(skb))) { > - if (asoc) { > - sctp_association_put(asoc); > + if (transport) { > + sctp_transport_put(transport); > asoc = NULL; > + transport = NULL; > } else { > sctp_endpoint_put(ep); > ep = NULL; > @@ -269,8 +270,8 @@ int sctp_rcv(struct sk_buff *skb) > bh_unlock_sock(sk); > > /* Release the asoc/ep ref we took in the lookup calls. */ > - if (asoc) > - sctp_association_put(asoc); > + if (transport) > + sctp_transport_put(transport); > else > sctp_endpoint_put(ep); > > @@ -283,8 +284,8 @@ int sctp_rcv(struct sk_buff *skb) > > discard_release: > /* Release the asoc/ep ref we took in the lookup calls. */ > - if (asoc) > - sctp_association_put(asoc); > + if (transport) > + sctp_transport_put(transport); > else > sctp_endpoint_put(ep); > > @@ -300,6 +301,7 @@ int sctp_backlog_rcv(struct sock *sk, struct sk_buff *skb) > { > struct sctp_chunk *chunk = SCTP_INPUT_CB(skb)->chunk; > struct sctp_inq *inqueue = &chunk->rcvr->inqueue; > + struct sctp_transport *t = chunk->transport; > struct sctp_ep_common *rcvr = NULL; > int backloged = 0; > > @@ -351,7 +353,7 @@ int sctp_backlog_rcv(struct sock *sk, struct sk_buff *skb) > done: > /* Release the refs we took in sctp_add_backlog */ > if (SCTP_EP_TYPE_ASSOCIATION == rcvr->type) > - sctp_association_put(sctp_assoc(rcvr)); > + sctp_transport_put(t); > else if (SCTP_EP_TYPE_SOCKET == rcvr->type) > sctp_endpoint_put(sctp_ep(rcvr)); > else > @@ -363,6 +365,7 @@ int sctp_backlog_rcv(struct sock *sk, struct sk_buff *skb) > static int sctp_add_backlog(struct sock *sk, struct sk_buff *skb) > { > struct sctp_chunk *chunk = SCTP_INPUT_CB(skb)->chunk; > + struct sctp_transport *t = chunk->transport; > struct sctp_ep_common *rcvr = chunk->rcvr; > int ret; > > @@ -373,7 +376,7 @@ static int sctp_add_backlog(struct sock *sk, struct > sk_buff *skb) >* from us >*/ > if (SCTP_EP_TYPE_ASSOCIATION == rcvr->type) > - sctp_association_hold(sctp_assoc(rcvr)); > + sctp_transport_hold(t); > else if (SCTP_EP_TYPE_SOCKET == rcvr->type) > sctp_endpoint_hold(sctp_ep(rcvr)); > else > @@ -537,15 +540,15 @@ struct sock *sctp_err_lookup(struct net *net, int > fami
[patch net-next 16/16] mlxsw: switchib: Introduce SwitchIB and SwitchIB silicon driver
From: Elad Raz SwitchIB and SwitchIB-2 are Infiniband switches with up to 36 ports. This driver initialize the hardware and Firmware which implements the IB management and connection with the SM. Signed-off-by: Elad Raz Signed-off-by: Jiri Pirko --- drivers/net/ethernet/mellanox/mlxsw/Kconfig| 11 + drivers/net/ethernet/mellanox/mlxsw/Makefile | 2 + drivers/net/ethernet/mellanox/mlxsw/pci.h | 2 + drivers/net/ethernet/mellanox/mlxsw/port.h | 3 + drivers/net/ethernet/mellanox/mlxsw/switchib.c | 598 + 5 files changed, 616 insertions(+) create mode 100644 drivers/net/ethernet/mellanox/mlxsw/switchib.c diff --git a/drivers/net/ethernet/mellanox/mlxsw/Kconfig b/drivers/net/ethernet/mellanox/mlxsw/Kconfig index d0bcf59..bac2e5e 100644 --- a/drivers/net/ethernet/mellanox/mlxsw/Kconfig +++ b/drivers/net/ethernet/mellanox/mlxsw/Kconfig @@ -29,6 +29,17 @@ config MLXSW_PCI To compile this driver as a module, choose M here: the module will be called mlxsw_pci. +config MLXSW_SWITCHIB + tristate "Mellanox Technologies SwitchIB and SwitchIB-2 support" + depends on MLXSW_CORE && NET_SWITCHDEV + default m + ---help--- + This driver supports Mellanox Technologies SwitchIB and SwitchIB-2 + Infiniband Switch ASICs. + + To compile this driver as a module, choose M here: the + module will be called mlxsw_switchib. + config MLXSW_SWITCHX2 tristate "Mellanox Technologies SwitchX-2 support" depends on MLXSW_CORE && MLXSW_PCI && NET_SWITCHDEV diff --git a/drivers/net/ethernet/mellanox/mlxsw/Makefile b/drivers/net/ethernet/mellanox/mlxsw/Makefile index d20ae18..badef87 100644 --- a/drivers/net/ethernet/mellanox/mlxsw/Makefile +++ b/drivers/net/ethernet/mellanox/mlxsw/Makefile @@ -3,6 +3,8 @@ mlxsw_core-objs := core.o mlxsw_core-$(CONFIG_MLXSW_CORE_HWMON) += core_hwmon.o obj-$(CONFIG_MLXSW_PCI)+= mlxsw_pci.o mlxsw_pci-objs := pci.o +obj-$(CONFIG_MLXSW_SWITCHIB) += mlxsw_switchib.o +mlxsw_switchib-objs:= switchib.o obj-$(CONFIG_MLXSW_SWITCHX2) += mlxsw_switchx2.o mlxsw_switchx2-objs:= switchx2.o obj-$(CONFIG_MLXSW_SPECTRUM) += mlxsw_spectrum.o diff --git a/drivers/net/ethernet/mellanox/mlxsw/pci.h b/drivers/net/ethernet/mellanox/mlxsw/pci.h index 35a0011..d655823 100644 --- a/drivers/net/ethernet/mellanox/mlxsw/pci.h +++ b/drivers/net/ethernet/mellanox/mlxsw/pci.h @@ -39,6 +39,8 @@ #define PCI_DEVICE_ID_MELLANOX_SWITCHX20xc738 #define PCI_DEVICE_ID_MELLANOX_SPECTRUM0xcb84 +#define PCI_DEVICE_ID_MELLANOX_SWITCHIB0xcb20 +#define PCI_DEVICE_ID_MELLANOX_SWITCHIB2 0xcf08 #if IS_ENABLED(CONFIG_MLXSW_PCI) diff --git a/drivers/net/ethernet/mellanox/mlxsw/port.h b/drivers/net/ethernet/mellanox/mlxsw/port.h index 7296071..3d42146 100644 --- a/drivers/net/ethernet/mellanox/mlxsw/port.h +++ b/drivers/net/ethernet/mellanox/mlxsw/port.h @@ -52,6 +52,9 @@ #define MLXSW_PORT_MAX_PHY_PORTS 0x40 #define MLXSW_PORT_MAX_PORTS (MLXSW_PORT_MAX_PHY_PORTS + 1) +#define MLXSW_PORT_MAX_IB_PHY_PORTS36 +#define MLXSW_PORT_MAX_IB_PORTS(MLXSW_PORT_MAX_IB_PHY_PORTS + 1) + #define MLXSW_PORT_DEVID_BITS_OFFSET 10 #define MLXSW_PORT_PHY_BITS_OFFSET 4 #define MLXSW_PORT_PHY_BITS_MASK (MLXSW_PORT_MAX_PHY_PORTS - 1) diff --git a/drivers/net/ethernet/mellanox/mlxsw/switchib.c b/drivers/net/ethernet/mellanox/mlxsw/switchib.c new file mode 100644 index 000..ec0b27e --- /dev/null +++ b/drivers/net/ethernet/mellanox/mlxsw/switchib.c @@ -0,0 +1,598 @@ +/* + * drivers/net/ethernet/mellanox/mlxsw/switchib.c + * Copyright (c) 2016 Mellanox Technologies. All rights reserved. + * Copyright (c) 2016 Elad Raz + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * 1. Redistributions of source code must retain the above copyright + *notice, this list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + *notice, this list of conditions and the following disclaimer in the + *documentation and/or other materials provided with the distribution. + * 3. Neither the names of the copyright holders nor the names of its + *contributors may be used to endorse or promote products derived from + *this software without specific prior written permission. + * + * Alternatively, this software may be distributed under the terms of the + * GNU General Public License ("GPL") version 2 as published by the Free + * Software Foundation. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTI
[patch net-next 15/16] mlxsw: switchx2: Add IB port support
From: Elad Raz SwitchX-2 is IB capable device. This patch add a support to change the port type between Ethernet and Infiniband. When the port is set to IB, the FW implements the Subnet Management Agent (SMA) manage the port. All port attributes can be control remotely by the SM. Usage: $ devlink port show pci/:03:00.0/1: type eth netdev eth0 pci/:03:00.0/3: type eth netdev eth1 pci/:03:00.0/5: type eth netdev eth2 pci/:03:00.0/6: type eth netdev eth3 pci/:03:00.0/8: type eth netdev eth4 $ devlink port set pci/:03:00.0/1 type ib $ devlink port show pci/:03:00.0/1: type ib Signed-off-by: Elad Raz Reviewed-by: Jiri Pirko Signed-off-by: Jiri Pirko --- drivers/net/ethernet/mellanox/mlxsw/ib.h | 39 + drivers/net/ethernet/mellanox/mlxsw/switchx2.c | 229 +++-- 2 files changed, 249 insertions(+), 19 deletions(-) create mode 100644 drivers/net/ethernet/mellanox/mlxsw/ib.h diff --git a/drivers/net/ethernet/mellanox/mlxsw/ib.h b/drivers/net/ethernet/mellanox/mlxsw/ib.h new file mode 100644 index 000..ce313aa --- /dev/null +++ b/drivers/net/ethernet/mellanox/mlxsw/ib.h @@ -0,0 +1,39 @@ +/* + * drivers/net/ethernet/mellanox/mlxsw/ib.h + * Copyright (c) 2016 Mellanox Technologies. All rights reserved. + * Copyright (c) 2016 Elad Raz + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * 1. Redistributions of source code must retain the above copyright + *notice, this list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + *notice, this list of conditions and the following disclaimer in the + *documentation and/or other materials provided with the distribution. + * 3. Neither the names of the copyright holders nor the names of its + *contributors may be used to endorse or promote products derived from + *this software without specific prior written permission. + * + * Alternatively, this software may be distributed under the terms of the + * GNU General Public License ("GPL") version 2 as published by the Free + * Software Foundation. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS + * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN + * CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + */ +#ifndef _MLXSW_IB_H +#define _MLXSW_IB_H + +#define MLXSW_IB_DEFAULT_MTU 4096 + +#endif /* _MLXSW_IB_H */ diff --git a/drivers/net/ethernet/mellanox/mlxsw/switchx2.c b/drivers/net/ethernet/mellanox/mlxsw/switchx2.c index 3aa0948..5208764 100644 --- a/drivers/net/ethernet/mellanox/mlxsw/switchx2.c +++ b/drivers/net/ethernet/mellanox/mlxsw/switchx2.c @@ -3,7 +3,7 @@ * Copyright (c) 2015 Mellanox Technologies. All rights reserved. * Copyright (c) 2015 Jiri Pirko * Copyright (c) 2015 Ido Schimmel - * Copyright (c) 2015 Elad Raz + * Copyright (c) 2015-2016 Elad Raz * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: @@ -53,6 +53,7 @@ #include "port.h" #include "trap.h" #include "txheader.h" +#include "ib.h" static const char mlxsw_sx_driver_name[] = "mlxsw_switchx2"; static const char mlxsw_sx_driver_version[] = "1.0"; @@ -218,14 +219,14 @@ static int mlxsw_sx_port_oper_status_get(struct mlxsw_sx_port *mlxsw_sx_port, return 0; } -static int mlxsw_sx_port_mtu_set(struct mlxsw_sx_port *mlxsw_sx_port, u16 mtu) +static int __mlxsw_sx_port_mtu_set(struct mlxsw_sx_port *mlxsw_sx_port, + u16 mtu) { struct mlxsw_sx *mlxsw_sx = mlxsw_sx_port->mlxsw_sx; char pmtu_pl[MLXSW_REG_PMTU_LEN]; int max_mtu; int err; - mtu += MLXSW_TXHDR_LEN + ETH_HLEN; mlxsw_reg_pmtu_pack(pmtu_pl, mlxsw_sx_port->local_port, 0); err = mlxsw_reg_query(mlxsw_sx->core, MLXSW_REG(pmtu), pmtu_pl); if (err) @@ -239,6 +240,32 @@ static int mlxsw_sx_port_mtu_set(struct mlxsw_sx_port *mlxsw_sx_port, u16 mtu) return mlxsw_reg_write(mlxsw_sx->core, MLXSW_REG(pmtu), pmtu_pl); } +static int mlxsw_sx_port_mtu_eth_set(struct mlxsw_sx_port *mlxsw_sx_port, +
[patch net-next 14/16] mlxsw: switchx2: Add eth prefix to port create and remove
From: Elad Raz Since we are about to add Infiniband port remove and create we will add "eth" prefix to port create and remove APIs. Signed-off-by: Elad Raz Reviewed-by: Jiri Pirko Signed-off-by: Jiri Pirko --- drivers/net/ethernet/mellanox/mlxsw/switchx2.c | 22 +++--- 1 file changed, 11 insertions(+), 11 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlxsw/switchx2.c b/drivers/net/ethernet/mellanox/mlxsw/switchx2.c index d110e3f..3aa0948 100644 --- a/drivers/net/ethernet/mellanox/mlxsw/switchx2.c +++ b/drivers/net/ethernet/mellanox/mlxsw/switchx2.c @@ -994,8 +994,8 @@ mlxsw_sx_port_mac_learning_mode_set(struct mlxsw_sx_port *mlxsw_sx_port, return mlxsw_reg_write(mlxsw_sx->core, MLXSW_REG(spmlr), spmlr_pl); } -static int __mlxsw_sx_port_create(struct mlxsw_sx *mlxsw_sx, u8 local_port, - u8 module, u8 width) +static int __mlxsw_sx_port_eth_create(struct mlxsw_sx *mlxsw_sx, u8 local_port, + u8 module, u8 width) { struct mlxsw_sx_port *mlxsw_sx_port; struct net_device *dev; @@ -1119,8 +1119,8 @@ static int __mlxsw_sx_port_create(struct mlxsw_sx *mlxsw_sx, u8 local_port, return err; } -static int mlxsw_sx_port_create(struct mlxsw_sx *mlxsw_sx, u8 local_port, - u8 module, u8 width) +static int mlxsw_sx_port_eth_create(struct mlxsw_sx *mlxsw_sx, u8 local_port, + u8 module, u8 width) { int err; @@ -1130,7 +1130,7 @@ static int mlxsw_sx_port_create(struct mlxsw_sx *mlxsw_sx, u8 local_port, local_port); return err; } - err = __mlxsw_sx_port_create(mlxsw_sx, local_port, module, width); + err = __mlxsw_sx_port_eth_create(mlxsw_sx, local_port, module, width); if (err) goto err_port_create; @@ -1141,7 +1141,7 @@ static int mlxsw_sx_port_create(struct mlxsw_sx *mlxsw_sx, u8 local_port, return err; } -static void __mlxsw_sx_port_remove(struct mlxsw_sx *mlxsw_sx, u8 local_port) +static void __mlxsw_sx_port_eth_remove(struct mlxsw_sx *mlxsw_sx, u8 local_port) { struct mlxsw_sx_port *mlxsw_sx_port = mlxsw_sx->ports[local_port]; @@ -1153,9 +1153,9 @@ static void __mlxsw_sx_port_remove(struct mlxsw_sx *mlxsw_sx, u8 local_port) free_netdev(mlxsw_sx_port->dev); } -static void mlxsw_sx_port_remove(struct mlxsw_sx *mlxsw_sx, u8 local_port) +static void mlxsw_sx_port_eth_remove(struct mlxsw_sx *mlxsw_sx, u8 local_port) { - __mlxsw_sx_port_remove(mlxsw_sx, local_port); + __mlxsw_sx_port_eth_remove(mlxsw_sx, local_port); mlxsw_core_port_fini(mlxsw_sx->core, local_port); } @@ -1170,7 +1170,7 @@ static void mlxsw_sx_ports_remove(struct mlxsw_sx *mlxsw_sx) for (i = 1; i < MLXSW_PORT_MAX_PORTS; i++) if (mlxsw_sx_port_created(mlxsw_sx, i)) - mlxsw_sx_port_remove(mlxsw_sx, i); + mlxsw_sx_port_eth_remove(mlxsw_sx, i); kfree(mlxsw_sx->ports); } @@ -1193,7 +1193,7 @@ static int mlxsw_sx_ports_create(struct mlxsw_sx *mlxsw_sx) goto err_port_module_info_get; if (!width) continue; - err = mlxsw_sx_port_create(mlxsw_sx, i, module, width); + err = mlxsw_sx_port_eth_create(mlxsw_sx, i, module, width); if (err) goto err_port_create; } @@ -1203,7 +1203,7 @@ static int mlxsw_sx_ports_create(struct mlxsw_sx *mlxsw_sx) err_port_module_info_get: for (i--; i >= 1; i--) if (mlxsw_sx_port_created(mlxsw_sx, i)) - mlxsw_sx_port_remove(mlxsw_sx, i); + mlxsw_sx_port_eth_remove(mlxsw_sx, i); kfree(mlxsw_sx->ports); return err; } -- 2.5.5
[patch net-next 13/16] mlxsw: core: Add port type (Eth/IB) set API
From: Elad Raz Add "port_type_set" API to mlxsw core. The core layer send the change type callback to the port along with it's private information. Signed-off-by: Elad Raz Reviewed-by: Jiri Pirko Signed-off-by: Jiri Pirko --- drivers/net/ethernet/mellanox/mlxsw/core.c | 41 ++ drivers/net/ethernet/mellanox/mlxsw/core.h | 6 + 2 files changed, 47 insertions(+) diff --git a/drivers/net/ethernet/mellanox/mlxsw/core.c b/drivers/net/ethernet/mellanox/mlxsw/core.c index dcd7202..6004817 100644 --- a/drivers/net/ethernet/mellanox/mlxsw/core.c +++ b/drivers/net/ethernet/mellanox/mlxsw/core.c @@ -93,6 +93,7 @@ struct mlxsw_core_pcpu_stats { struct mlxsw_core_port { struct devlink_port devlink_port; void *port_driver_priv; + u8 local_port; }; void *mlxsw_core_port_driver_priv(struct mlxsw_core_port *mlxsw_core_port) @@ -937,6 +938,21 @@ static void *__dl_port(struct devlink_port *devlink_port) return container_of(devlink_port, struct mlxsw_core_port, devlink_port); } +static int mlxsw_devlink_port_type_set(struct devlink_port *devlink_port, + enum devlink_port_type port_type) +{ + struct mlxsw_core *mlxsw_core = devlink_priv(devlink_port->devlink); + struct mlxsw_driver *mlxsw_driver = mlxsw_core->driver; + struct mlxsw_core_port *mlxsw_core_port = __dl_port(devlink_port); + + if (!mlxsw_driver->port_type_set) + return -EOPNOTSUPP; + + return mlxsw_driver->port_type_set(mlxsw_core, + mlxsw_core_port->local_port, + port_type); +} + static int mlxsw_devlink_sb_port_pool_get(struct devlink_port *devlink_port, unsigned int sb_index, u16 pool_index, u32 *p_threshold) @@ -1060,6 +1076,7 @@ mlxsw_devlink_sb_occ_tc_port_bind_get(struct devlink_port *devlink_port, } static const struct devlink_ops mlxsw_devlink_ops = { + .port_type_set = mlxsw_devlink_port_type_set, .port_split = mlxsw_devlink_port_split, .port_unsplit = mlxsw_devlink_port_unsplit, .sb_pool_get= mlxsw_devlink_sb_pool_get, @@ -1687,6 +1704,7 @@ int mlxsw_core_port_init(struct mlxsw_core *mlxsw_core, u8 local_port) struct devlink_port *devlink_port = &mlxsw_core_port->devlink_port; int err; + mlxsw_core_port->local_port = local_port; err = devlink_port_register(devlink, devlink_port, local_port); if (err) memset(mlxsw_core_port, 0, sizeof(*mlxsw_core_port)); @@ -1720,6 +1738,18 @@ void mlxsw_core_port_eth_set(struct mlxsw_core *mlxsw_core, u8 local_port, } EXPORT_SYMBOL(mlxsw_core_port_eth_set); +void mlxsw_core_port_ib_set(struct mlxsw_core *mlxsw_core, u8 local_port, + void *port_driver_priv) +{ + struct mlxsw_core_port *mlxsw_core_port = + &mlxsw_core->ports[local_port]; + struct devlink_port *devlink_port = &mlxsw_core_port->devlink_port; + + mlxsw_core_port->port_driver_priv = port_driver_priv; + devlink_port_type_ib_set(devlink_port, NULL); +} +EXPORT_SYMBOL(mlxsw_core_port_ib_set); + void mlxsw_core_port_clear(struct mlxsw_core *mlxsw_core, u8 local_port, void *port_driver_priv) { @@ -1732,6 +1762,17 @@ void mlxsw_core_port_clear(struct mlxsw_core *mlxsw_core, u8 local_port, } EXPORT_SYMBOL(mlxsw_core_port_clear); +enum devlink_port_type mlxsw_core_port_type_get(struct mlxsw_core *mlxsw_core, + u8 local_port) +{ + struct mlxsw_core_port *mlxsw_core_port = + &mlxsw_core->ports[local_port]; + struct devlink_port *devlink_port = &mlxsw_core_port->devlink_port; + + return devlink_port->type; +} +EXPORT_SYMBOL(mlxsw_core_port_type_get); + static void mlxsw_core_buf_dump_dbg(struct mlxsw_core *mlxsw_core, const char *buf, size_t size) { diff --git a/drivers/net/ethernet/mellanox/mlxsw/core.h b/drivers/net/ethernet/mellanox/mlxsw/core.h index 82866d5..c0acc1b 100644 --- a/drivers/net/ethernet/mellanox/mlxsw/core.h +++ b/drivers/net/ethernet/mellanox/mlxsw/core.h @@ -148,8 +148,12 @@ void mlxsw_core_port_fini(struct mlxsw_core *mlxsw_core, u8 local_port); void mlxsw_core_port_eth_set(struct mlxsw_core *mlxsw_core, u8 local_port, void *port_driver_priv, struct net_device *dev, bool split, u32 split_group); +void mlxsw_core_port_ib_set(struct mlxsw_core *mlxsw_core, u8 local_port, + void *port_driver_priv); void mlxsw_core_port_clear(struct mlxsw_core *mlxsw_core, u8 local_port, void *port_driver_priv);
[patch net-next 12/16] mlxsw: core: Add "eth" prefix to mlxsw_core_port_set
From: Elad Raz Since we are about to introduce IB port APIs, we will add prefixes to existing APIs. Signed-off-by: Elad Raz Signed-off-by: Jiri Pirko --- drivers/net/ethernet/mellanox/mlxsw/core.c | 8 drivers/net/ethernet/mellanox/mlxsw/core.h | 6 +++--- drivers/net/ethernet/mellanox/mlxsw/spectrum.c | 5 +++-- drivers/net/ethernet/mellanox/mlxsw/switchx2.c | 4 ++-- 4 files changed, 12 insertions(+), 11 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlxsw/core.c b/drivers/net/ethernet/mellanox/mlxsw/core.c index b50acb1..dcd7202 100644 --- a/drivers/net/ethernet/mellanox/mlxsw/core.c +++ b/drivers/net/ethernet/mellanox/mlxsw/core.c @@ -1705,9 +1705,9 @@ void mlxsw_core_port_fini(struct mlxsw_core *mlxsw_core, u8 local_port) } EXPORT_SYMBOL(mlxsw_core_port_fini); -void mlxsw_core_port_set(struct mlxsw_core *mlxsw_core, u8 local_port, -void *port_driver_priv, struct net_device *dev, -bool split, u32 split_group) +void mlxsw_core_port_eth_set(struct mlxsw_core *mlxsw_core, u8 local_port, +void *port_driver_priv, struct net_device *dev, +bool split, u32 split_group) { struct mlxsw_core_port *mlxsw_core_port = &mlxsw_core->ports[local_port]; @@ -1718,7 +1718,7 @@ void mlxsw_core_port_set(struct mlxsw_core *mlxsw_core, u8 local_port, devlink_port_split_set(devlink_port, split_group); devlink_port_type_eth_set(devlink_port, dev); } -EXPORT_SYMBOL(mlxsw_core_port_set); +EXPORT_SYMBOL(mlxsw_core_port_eth_set); void mlxsw_core_port_clear(struct mlxsw_core *mlxsw_core, u8 local_port, void *port_driver_priv) diff --git a/drivers/net/ethernet/mellanox/mlxsw/core.h b/drivers/net/ethernet/mellanox/mlxsw/core.h index f4bbbd4..82866d5 100644 --- a/drivers/net/ethernet/mellanox/mlxsw/core.h +++ b/drivers/net/ethernet/mellanox/mlxsw/core.h @@ -145,9 +145,9 @@ void mlxsw_core_lag_mapping_clear(struct mlxsw_core *mlxsw_core, void *mlxsw_core_port_driver_priv(struct mlxsw_core_port *mlxsw_core_port); int mlxsw_core_port_init(struct mlxsw_core *mlxsw_core, u8 local_port); void mlxsw_core_port_fini(struct mlxsw_core *mlxsw_core, u8 local_port); -void mlxsw_core_port_set(struct mlxsw_core *mlxsw_core, u8 local_port, -void *port_driver_priv, struct net_device *dev, -bool split, u32 split_group); +void mlxsw_core_port_eth_set(struct mlxsw_core *mlxsw_core, u8 local_port, +void *port_driver_priv, struct net_device *dev, +bool split, u32 split_group); void mlxsw_core_port_clear(struct mlxsw_core *mlxsw_core, u8 local_port, void *port_driver_priv); diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c index ad4ff27..8bca020 100644 --- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c @@ -2358,8 +2358,9 @@ static int __mlxsw_sp_port_create(struct mlxsw_sp *mlxsw_sp, u8 local_port, goto err_register_netdev; } - mlxsw_core_port_set(mlxsw_sp->core, mlxsw_sp_port->local_port, - mlxsw_sp_port, dev, mlxsw_sp_port->split, module); + mlxsw_core_port_eth_set(mlxsw_sp->core, mlxsw_sp_port->local_port, + mlxsw_sp_port, dev, mlxsw_sp_port->split, + module); mlxsw_core_schedule_dw(&mlxsw_sp_port->hw_stats.update_dw, 0); return 0; diff --git a/drivers/net/ethernet/mellanox/mlxsw/switchx2.c b/drivers/net/ethernet/mellanox/mlxsw/switchx2.c index a905589..d110e3f 100644 --- a/drivers/net/ethernet/mellanox/mlxsw/switchx2.c +++ b/drivers/net/ethernet/mellanox/mlxsw/switchx2.c @@ -1098,8 +1098,8 @@ static int __mlxsw_sx_port_create(struct mlxsw_sx *mlxsw_sx, u8 local_port, goto err_register_netdev; } - mlxsw_core_port_set(mlxsw_sx->core, mlxsw_sx_port->local_port, - mlxsw_sx_port, dev, false, 0); + mlxsw_core_port_eth_set(mlxsw_sx->core, mlxsw_sx_port->local_port, + mlxsw_sx_port, dev, false, 0); mlxsw_sx->ports[local_port] = mlxsw_sx_port; return 0; -- 2.5.5
[patch net-next 11/16] mlxsw: switchx2: Add Infiniband switch partition
From: Elad Raz In order to put a port in Infiniband fabric it should be assigned to separate swid (Switch partition) that initialized as IB swid. Signed-off-by: Elad Raz Signed-off-by: Jiri Pirko --- drivers/net/ethernet/mellanox/mlxsw/port.h | 1 + drivers/net/ethernet/mellanox/mlxsw/switchx2.c | 6 +- 2 files changed, 6 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/mellanox/mlxsw/port.h b/drivers/net/ethernet/mellanox/mlxsw/port.h index af371a8..7296071 100644 --- a/drivers/net/ethernet/mellanox/mlxsw/port.h +++ b/drivers/net/ethernet/mellanox/mlxsw/port.h @@ -44,6 +44,7 @@ #define MLXSW_PORT_SWID_DISABLED_PORT 255 #define MLXSW_PORT_SWID_ALL_SWIDS 254 +#define MLXSW_PORT_SWID_TYPE_IB1 #define MLXSW_PORT_SWID_TYPE_ETH 2 #define MLXSW_PORT_MID 0xd000 diff --git a/drivers/net/ethernet/mellanox/mlxsw/switchx2.c b/drivers/net/ethernet/mellanox/mlxsw/switchx2.c index 8eac26f..a905589 100644 --- a/drivers/net/ethernet/mellanox/mlxsw/switchx2.c +++ b/drivers/net/ethernet/mellanox/mlxsw/switchx2.c @@ -1590,13 +1590,17 @@ static struct mlxsw_config_profile mlxsw_sx_config_profile = { .used_flood_mode= 1, .flood_mode = 3, .used_max_ib_mc = 1, - .max_ib_mc = 0, + .max_ib_mc = 6, .used_max_pkey = 1, .max_pkey = 0, .swid_config= { { .used_type = 1, .type = MLXSW_PORT_SWID_TYPE_ETH, + }, + { + .used_type = 1, + .type = MLXSW_PORT_SWID_TYPE_IB, } }, .resource_query_enable = 0, -- 2.5.5
[patch net-next 05/16] mlxsw: switchx2: Add support for physical port names
From: Elad Raz Export to userspace the front panel name of the port, so that udev can rename the ports accordingly. The convention suggested by switchdev documentation is used: pX Signed-off-by: Elad Raz Reviewed-by: Jiri Pirko Reviewed-by: Ido Schimmel Signed-off-by: Jiri Pirko --- drivers/net/ethernet/mellanox/mlxsw/switchx2.c | 32 ++ 1 file changed, 27 insertions(+), 5 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlxsw/switchx2.c b/drivers/net/ethernet/mellanox/mlxsw/switchx2.c index 6882e35..941393b 100644 --- a/drivers/net/ethernet/mellanox/mlxsw/switchx2.c +++ b/drivers/net/ethernet/mellanox/mlxsw/switchx2.c @@ -81,6 +81,9 @@ struct mlxsw_sx_port { struct mlxsw_sx_port_pcpu_stats __percpu *pcpu_stats; struct mlxsw_sx *mlxsw_sx; u8 local_port; + struct { + u8 module; + } mapping; }; /* tx_hdr_version @@ -257,7 +260,8 @@ mlxsw_sx_port_system_port_mapping_set(struct mlxsw_sx_port *mlxsw_sx_port) } static int mlxsw_sx_port_module_info_get(struct mlxsw_sx *mlxsw_sx, -u8 local_port, u8 *p_width) +u8 local_port, u8 *p_module, +u8 *p_width) { char pmlp_pl[MLXSW_REG_PMLP_LEN]; int err; @@ -266,6 +270,7 @@ static int mlxsw_sx_port_module_info_get(struct mlxsw_sx *mlxsw_sx, err = mlxsw_reg_query(mlxsw_sx->core, MLXSW_REG(pmlp), pmlp_pl); if (err) return err; + *p_module = mlxsw_reg_pmlp_module_get(pmlp_pl, 0); *p_width = mlxsw_reg_pmlp_width_get(pmlp_pl); return 0; } @@ -383,12 +388,26 @@ mlxsw_sx_port_get_stats64(struct net_device *dev, return stats; } +static int mlxsw_sx_port_get_phys_port_name(struct net_device *dev, char *name, + size_t len) +{ + struct mlxsw_sx_port *mlxsw_sx_port = netdev_priv(dev); + int err; + + err = snprintf(name, len, "p%d", mlxsw_sx_port->mapping.module + 1); + if (err >= len) + return -EINVAL; + + return 0; +} + static const struct net_device_ops mlxsw_sx_port_netdev_ops = { .ndo_open = mlxsw_sx_port_open, .ndo_stop = mlxsw_sx_port_stop, .ndo_start_xmit = mlxsw_sx_port_xmit, .ndo_change_mtu = mlxsw_sx_port_change_mtu, .ndo_get_stats64= mlxsw_sx_port_get_stats64, + .ndo_get_phys_port_name = mlxsw_sx_port_get_phys_port_name, }; static void mlxsw_sx_port_get_drvinfo(struct net_device *dev, @@ -957,7 +976,8 @@ mlxsw_sx_port_mac_learning_mode_set(struct mlxsw_sx_port *mlxsw_sx_port, return mlxsw_reg_write(mlxsw_sx->core, MLXSW_REG(spmlr), spmlr_pl); } -static int mlxsw_sx_port_create(struct mlxsw_sx *mlxsw_sx, u8 local_port) +static int mlxsw_sx_port_create(struct mlxsw_sx *mlxsw_sx, u8 local_port, + u8 module) { struct mlxsw_sx_port *mlxsw_sx_port; struct net_device *dev; @@ -971,6 +991,7 @@ static int mlxsw_sx_port_create(struct mlxsw_sx *mlxsw_sx, u8 local_port) mlxsw_sx_port->dev = dev; mlxsw_sx_port->mlxsw_sx = mlxsw_sx; mlxsw_sx_port->local_port = local_port; + mlxsw_sx_port->mapping.module = module; mlxsw_sx_port->pcpu_stats = netdev_alloc_pcpu_stats(struct mlxsw_sx_port_pcpu_stats); @@ -1119,7 +1140,7 @@ static void mlxsw_sx_ports_remove(struct mlxsw_sx *mlxsw_sx) static int mlxsw_sx_ports_create(struct mlxsw_sx *mlxsw_sx) { size_t alloc_size; - u8 width; + u8 module, width; int i; int err; @@ -1129,12 +1150,13 @@ static int mlxsw_sx_ports_create(struct mlxsw_sx *mlxsw_sx) return -ENOMEM; for (i = 1; i < MLXSW_PORT_MAX_PORTS; i++) { - err = mlxsw_sx_port_module_info_get(mlxsw_sx, i, &width); + err = mlxsw_sx_port_module_info_get(mlxsw_sx, i, &module, + &width); if (err) goto err_port_module_info_get; if (!width) continue; - err = mlxsw_sx_port_create(mlxsw_sx, i); + err = mlxsw_sx_port_create(mlxsw_sx, i, module); if (err) goto err_port_create; } -- 2.5.5
[patch net-next 08/16] mlxsw: reg: Add Infiniband support to PTYS
From: Elad Raz In order to support Infiniband fabric, we need to introduce IB speeds and capabilities to PTYS emads. Signed-off-by: Elad Raz Reviewed-by: Jiri Pirko Reviewed-by: Ido Schimmel Signed-off-by: Jiri Pirko --- drivers/net/ethernet/mellanox/mlxsw/reg.h | 71 +++ 1 file changed, 71 insertions(+) diff --git a/drivers/net/ethernet/mellanox/mlxsw/reg.h b/drivers/net/ethernet/mellanox/mlxsw/reg.h index 3ecf348..2f31af0 100644 --- a/drivers/net/ethernet/mellanox/mlxsw/reg.h +++ b/drivers/net/ethernet/mellanox/mlxsw/reg.h @@ -2054,6 +2054,7 @@ MLXSW_REG_DEFINE(ptys, MLXSW_REG_PTYS_ID, MLXSW_REG_PTYS_LEN); */ MLXSW_ITEM32(reg, ptys, local_port, 0x00, 16, 8); +#define MLXSW_REG_PTYS_PROTO_MASK_IB BIT(0) #define MLXSW_REG_PTYS_PROTO_MASK_ETH BIT(2) /* reg_ptys_proto_mask @@ -2112,18 +2113,61 @@ MLXSW_ITEM32(reg, ptys, an_status, 0x04, 28, 4); */ MLXSW_ITEM32(reg, ptys, eth_proto_cap, 0x0C, 0, 32); +/* reg_ptys_ib_link_width_cap + * IB port supported widths. + * Access: RO + */ +MLXSW_ITEM32(reg, ptys, ib_link_width_cap, 0x10, 16, 16); + +#define MLXSW_REG_PTYS_IB_SPEED_SDRBIT(0) +#define MLXSW_REG_PTYS_IB_SPEED_DDRBIT(1) +#define MLXSW_REG_PTYS_IB_SPEED_QDRBIT(2) +#define MLXSW_REG_PTYS_IB_SPEED_FDR10 BIT(3) +#define MLXSW_REG_PTYS_IB_SPEED_FDRBIT(4) +#define MLXSW_REG_PTYS_IB_SPEED_EDRBIT(5) + +/* reg_ptys_ib_proto_cap + * IB port supported speeds and protocols. + * Access: RO + */ +MLXSW_ITEM32(reg, ptys, ib_proto_cap, 0x10, 0, 16); + /* reg_ptys_eth_proto_admin * Speed and protocol to set port to. * Access: RW */ MLXSW_ITEM32(reg, ptys, eth_proto_admin, 0x18, 0, 32); +/* reg_ptys_ib_link_width_admin + * IB width to set port to. + * Access: RW + */ +MLXSW_ITEM32(reg, ptys, ib_link_width_admin, 0x1C, 16, 16); + +/* reg_ptys_ib_proto_admin + * IB speeds and protocols to set port to. + * Access: RW + */ +MLXSW_ITEM32(reg, ptys, ib_proto_admin, 0x1C, 0, 16); + /* reg_ptys_eth_proto_oper * The current speed and protocol configured for the port. * Access: RO */ MLXSW_ITEM32(reg, ptys, eth_proto_oper, 0x24, 0, 32); +/* reg_ptys_ib_link_width_oper + * The current IB width to set port to. + * Access: RO + */ +MLXSW_ITEM32(reg, ptys, ib_link_width_oper, 0x28, 16, 16); + +/* reg_ptys_ib_proto_oper + * The current IB speed and protocol. + * Access: RO + */ +MLXSW_ITEM32(reg, ptys, ib_proto_oper, 0x28, 0, 16); + /* reg_ptys_eth_proto_lp_advertise * The protocols that were advertised by the link partner during * autonegotiation. @@ -2153,6 +2197,33 @@ static inline void mlxsw_reg_ptys_eth_unpack(char *payload, *p_eth_proto_oper = mlxsw_reg_ptys_eth_proto_oper_get(payload); } +static inline void mlxsw_reg_ptys_ib_pack(char *payload, u8 local_port, + u16 proto_admin, u16 link_width) +{ + MLXSW_REG_ZERO(ptys, payload); + mlxsw_reg_ptys_local_port_set(payload, local_port); + mlxsw_reg_ptys_proto_mask_set(payload, MLXSW_REG_PTYS_PROTO_MASK_IB); + mlxsw_reg_ptys_ib_proto_admin_set(payload, proto_admin); + mlxsw_reg_ptys_ib_link_width_admin_set(payload, link_width); +} + +static inline void mlxsw_reg_ptys_ib_unpack(char *payload, u16 *p_ib_proto_cap, + u16 *p_ib_link_width_cap, + u16 *p_ib_proto_oper, + u16 *p_ib_link_width_oper) +{ + if (p_ib_proto_cap) + *p_ib_proto_cap = mlxsw_reg_ptys_ib_proto_cap_get(payload); + if (p_ib_link_width_cap) + *p_ib_link_width_cap = + mlxsw_reg_ptys_ib_link_width_cap_get(payload); + if (p_ib_proto_oper) + *p_ib_proto_oper = mlxsw_reg_ptys_ib_proto_oper_get(payload); + if (p_ib_link_width_oper) + *p_ib_link_width_oper = + mlxsw_reg_ptys_ib_link_width_oper_get(payload); +} + /* PPAD - Port Physical Address Register * - * The PPAD register configures the per port physical MAC address. -- 2.5.5
[patch net-next 06/16] mlxsw: switchx2: Fix port speed configuration
From: Elad Raz In SwitchX-2 we configure the port speed to negotiate with 40G link only. Add support for all other supported speeds. Fixes: 31557f0f9755 ("mlxsw: Introduce Mellanox SwitchX-2 ASIC support") Signed-off-by: Elad Raz Reviewed-by: Jiri Pirko Reviewed-by: Ido Schimmel Signed-off-by: Jiri Pirko --- drivers/net/ethernet/mellanox/mlxsw/switchx2.c | 30 -- 1 file changed, 23 insertions(+), 7 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlxsw/switchx2.c b/drivers/net/ethernet/mellanox/mlxsw/switchx2.c index 941393b..44dae8d 100644 --- a/drivers/net/ethernet/mellanox/mlxsw/switchx2.c +++ b/drivers/net/ethernet/mellanox/mlxsw/switchx2.c @@ -662,6 +662,7 @@ static const struct mlxsw_sx_port_link_mode mlxsw_sx_port_link_mode[] = { }; #define MLXSW_SX_PORT_LINK_MODE_LEN ARRAY_SIZE(mlxsw_sx_port_link_mode) +#define MLXSW_SX_PORT_BASE_SPEED 1 /* Mb/s */ static u32 mlxsw_sx_from_ptys_supported_port(u32 ptys_eth_proto) { @@ -809,6 +810,18 @@ static u32 mlxsw_sx_to_ptys_speed(u32 speed) return ptys_proto; } +static u32 mlxsw_sx_to_ptys_upper_speed(u32 upper_speed) +{ + u32 ptys_proto = 0; + int i; + + for (i = 0; i < MLXSW_SX_PORT_LINK_MODE_LEN; i++) { + if (mlxsw_sx_port_link_mode[i].speed <= upper_speed) + ptys_proto |= mlxsw_sx_port_link_mode[i].mask; + } + return ptys_proto; +} + static int mlxsw_sx_port_set_settings(struct net_device *dev, struct ethtool_cmd *cmd) { @@ -955,13 +968,17 @@ static int mlxsw_sx_port_stp_state_set(struct mlxsw_sx_port *mlxsw_sx_port, return err; } -static int mlxsw_sx_port_speed_set(struct mlxsw_sx_port *mlxsw_sx_port, - u32 speed) +static int +mlxsw_sx_port_speed_by_width_set(struct mlxsw_sx_port *mlxsw_sx_port, u8 width) { struct mlxsw_sx *mlxsw_sx = mlxsw_sx_port->mlxsw_sx; + u32 upper_speed = MLXSW_SX_PORT_BASE_SPEED * width; char ptys_pl[MLXSW_REG_PTYS_LEN]; + u32 eth_proto_admin; - mlxsw_reg_ptys_pack(ptys_pl, mlxsw_sx_port->local_port, speed); + eth_proto_admin = mlxsw_sx_to_ptys_upper_speed(upper_speed); + mlxsw_reg_ptys_pack(ptys_pl, mlxsw_sx_port->local_port, + eth_proto_admin); return mlxsw_reg_write(mlxsw_sx->core, MLXSW_REG(ptys), ptys_pl); } @@ -977,7 +994,7 @@ mlxsw_sx_port_mac_learning_mode_set(struct mlxsw_sx_port *mlxsw_sx_port, } static int mlxsw_sx_port_create(struct mlxsw_sx *mlxsw_sx, u8 local_port, - u8 module) + u8 module, u8 width) { struct mlxsw_sx_port *mlxsw_sx_port; struct net_device *dev; @@ -1038,8 +1055,7 @@ static int mlxsw_sx_port_create(struct mlxsw_sx *mlxsw_sx, u8 local_port, goto err_port_swid_set; } - err = mlxsw_sx_port_speed_set(mlxsw_sx_port, - MLXSW_REG_PTYS_ETH_SPEED_40GBASE_CR4); + err = mlxsw_sx_port_speed_by_width_set(mlxsw_sx_port, width); if (err) { dev_err(mlxsw_sx->bus_info->dev, "Port %d: Failed to set speed\n", mlxsw_sx_port->local_port); @@ -1156,7 +1172,7 @@ static int mlxsw_sx_ports_create(struct mlxsw_sx *mlxsw_sx) goto err_port_module_info_get; if (!width) continue; - err = mlxsw_sx_port_create(mlxsw_sx, i, module); + err = mlxsw_sx_port_create(mlxsw_sx, i, module, width); if (err) goto err_port_create; } -- 2.5.5
[patch net-next 10/16] mlxsw: Make devlink port instances independent of spectrum/switchx2 port instances
From: Jiri Pirko Currently, devlink register/unregister is done directly from spectrum/switchx2 port create/remove functions. With a need to introduce a port type change, the devlink port instances have to be persistent across type changes, therefore across port create/remove function calls. So do a bit of reshuffling to achieve that. Signed-off-by: Jiri Pirko Signed-off-by: Elad Raz Reviewed-by: Ido Schimmel Signed-off-by: Jiri Pirko --- drivers/net/ethernet/mellanox/mlxsw/core.c | 82 +- drivers/net/ethernet/mellanox/mlxsw/core.h | 26 +++- drivers/net/ethernet/mellanox/mlxsw/spectrum.c | 53 +++-- drivers/net/ethernet/mellanox/mlxsw/spectrum.h | 1 - drivers/net/ethernet/mellanox/mlxsw/switchx2.c | 49 ++- 5 files changed, 147 insertions(+), 64 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlxsw/core.c b/drivers/net/ethernet/mellanox/mlxsw/core.c index 02183f6..b50acb1 100644 --- a/drivers/net/ethernet/mellanox/mlxsw/core.c +++ b/drivers/net/ethernet/mellanox/mlxsw/core.c @@ -90,6 +90,22 @@ struct mlxsw_core_pcpu_stats { u32 port_rx_invalid; }; +struct mlxsw_core_port { + struct devlink_port devlink_port; + void *port_driver_priv; +}; + +void *mlxsw_core_port_driver_priv(struct mlxsw_core_port *mlxsw_core_port) +{ + return mlxsw_core_port->port_driver_priv; +} +EXPORT_SYMBOL(mlxsw_core_port_driver_priv); + +static bool mlxsw_core_port_check(struct mlxsw_core_port *mlxsw_core_port) +{ + return mlxsw_core_port->port_driver_priv != NULL; +} + struct mlxsw_core { struct mlxsw_driver *driver; const struct mlxsw_bus *bus; @@ -114,6 +130,7 @@ struct mlxsw_core { } lag; struct mlxsw_res res; struct mlxsw_hwmon *hwmon; + struct mlxsw_core_port ports[MLXSW_PORT_MAX_PORTS]; unsigned long driver_priv[0]; /* driver_priv has to be always the last item */ }; @@ -928,7 +945,8 @@ static int mlxsw_devlink_sb_port_pool_get(struct devlink_port *devlink_port, struct mlxsw_driver *mlxsw_driver = mlxsw_core->driver; struct mlxsw_core_port *mlxsw_core_port = __dl_port(devlink_port); - if (!mlxsw_driver->sb_port_pool_get) + if (!mlxsw_driver->sb_port_pool_get || + !mlxsw_core_port_check(mlxsw_core_port)) return -EOPNOTSUPP; return mlxsw_driver->sb_port_pool_get(mlxsw_core_port, sb_index, pool_index, p_threshold); @@ -942,7 +960,8 @@ static int mlxsw_devlink_sb_port_pool_set(struct devlink_port *devlink_port, struct mlxsw_driver *mlxsw_driver = mlxsw_core->driver; struct mlxsw_core_port *mlxsw_core_port = __dl_port(devlink_port); - if (!mlxsw_driver->sb_port_pool_set) + if (!mlxsw_driver->sb_port_pool_set || + !mlxsw_core_port_check(mlxsw_core_port)) return -EOPNOTSUPP; return mlxsw_driver->sb_port_pool_set(mlxsw_core_port, sb_index, pool_index, threshold); @@ -958,7 +977,8 @@ mlxsw_devlink_sb_tc_pool_bind_get(struct devlink_port *devlink_port, struct mlxsw_driver *mlxsw_driver = mlxsw_core->driver; struct mlxsw_core_port *mlxsw_core_port = __dl_port(devlink_port); - if (!mlxsw_driver->sb_tc_pool_bind_get) + if (!mlxsw_driver->sb_tc_pool_bind_get || + !mlxsw_core_port_check(mlxsw_core_port)) return -EOPNOTSUPP; return mlxsw_driver->sb_tc_pool_bind_get(mlxsw_core_port, sb_index, tc_index, pool_type, @@ -975,7 +995,8 @@ mlxsw_devlink_sb_tc_pool_bind_set(struct devlink_port *devlink_port, struct mlxsw_driver *mlxsw_driver = mlxsw_core->driver; struct mlxsw_core_port *mlxsw_core_port = __dl_port(devlink_port); - if (!mlxsw_driver->sb_tc_pool_bind_set) + if (!mlxsw_driver->sb_tc_pool_bind_set || + !mlxsw_core_port_check(mlxsw_core_port)) return -EOPNOTSUPP; return mlxsw_driver->sb_tc_pool_bind_set(mlxsw_core_port, sb_index, tc_index, pool_type, @@ -1013,7 +1034,8 @@ mlxsw_devlink_sb_occ_port_pool_get(struct devlink_port *devlink_port, struct mlxsw_driver *mlxsw_driver = mlxsw_core->driver; struct mlxsw_core_port *mlxsw_core_port = __dl_port(devlink_port); - if (!mlxsw_driver->sb_occ_port_pool_get) + if (!mlxsw_driver->sb_occ_port_pool_get || + !mlxsw_core_port_check(mlxsw_core_port)) return -EOPNOTSUPP; return mlxsw_driver->sb_occ_port_pool_get(mlxsw_core_port, sb_index, pool_index, p_cur, p_max); @@ -1029,7 +1051,8 @@ mlxsw_devlink_sb_occ_tc_port_bind_get(struct devlink_port *devlink_port, struct mlxsw_driver *mlxsw_driver = mlxsw_core->driver; struct mlxsw_core_por
[patch net-next 09/16] mlxsw: reg: Add local-port to Infiniband port mapping
From: Elad Raz In order to change a port type to Infiniband port we should change his mapping from local-port to Infiniband. Adding the PLIB (Port Local to InfiniBand) allows this mapping. Signed-off-by: Elad Raz Reviewed-by: Jiri Pirko Reviewed-by: Ido Schimmel Signed-off-by: Jiri Pirko --- drivers/net/ethernet/mellanox/mlxsw/reg.h | 22 ++ 1 file changed, 22 insertions(+) diff --git a/drivers/net/ethernet/mellanox/mlxsw/reg.h b/drivers/net/ethernet/mellanox/mlxsw/reg.h index 2f31af0..a61ce34 100644 --- a/drivers/net/ethernet/mellanox/mlxsw/reg.h +++ b/drivers/net/ethernet/mellanox/mlxsw/reg.h @@ -2748,6 +2748,27 @@ static inline void mlxsw_reg_ppcnt_pack(char *payload, u8 local_port, mlxsw_reg_ppcnt_prio_tc_set(payload, prio_tc); } +/* PLIB - Port Local to InfiniBand Port + * + * The PLIB register performs mapping from Local Port into InfiniBand Port. + */ +#define MLXSW_REG_PLIB_ID 0x500A +#define MLXSW_REG_PLIB_LEN 0x10 + +MLXSW_REG_DEFINE(plib, MLXSW_REG_PLIB_ID, MLXSW_REG_PLIB_LEN); + +/* reg_plib_local_port + * Local port number. + * Access: Index + */ +MLXSW_ITEM32(reg, plib, local_port, 0x00, 16, 8); + +/* reg_plib_ib_port + * InfiniBand port remapping for local_port. + * Access: RW + */ +MLXSW_ITEM32(reg, plib, ib_port, 0x00, 0, 8); + /* PPTB - Port Prio To Buffer Register * --- * Configures the switch priority to buffer table. @@ -5188,6 +5209,7 @@ static const struct mlxsw_reg_info *mlxsw_reg_infos[] = { MLXSW_REG(paos), MLXSW_REG(pfcc), MLXSW_REG(ppcnt), + MLXSW_REG(plib), MLXSW_REG(pptb), MLXSW_REG(pbmc), MLXSW_REG(pspa), -- 2.5.5
[patch net-next 07/16] mlxsw: reg: Add eth prefix to PTYS pack and unpack
From: Elad Raz We want to add Infiniband support to PTYS. In order to maintain proper conventions, we will change pack and unpack prefix to eth. Signed-off-by: Elad Raz Reviewed-by: Jiri Pirko Reviewed-by: Ido Schimmel Signed-off-by: Jiri Pirko --- drivers/net/ethernet/mellanox/mlxsw/reg.h | 11 ++- drivers/net/ethernet/mellanox/mlxsw/spectrum.c | 17 + drivers/net/ethernet/mellanox/mlxsw/switchx2.c | 18 ++ 3 files changed, 25 insertions(+), 21 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlxsw/reg.h b/drivers/net/ethernet/mellanox/mlxsw/reg.h index debcf26..3ecf348 100644 --- a/drivers/net/ethernet/mellanox/mlxsw/reg.h +++ b/drivers/net/ethernet/mellanox/mlxsw/reg.h @@ -2131,8 +2131,8 @@ MLXSW_ITEM32(reg, ptys, eth_proto_oper, 0x24, 0, 32); */ MLXSW_ITEM32(reg, ptys, eth_proto_lp_advertise, 0x30, 0, 32); -static inline void mlxsw_reg_ptys_pack(char *payload, u8 local_port, - u32 proto_admin) +static inline void mlxsw_reg_ptys_eth_pack(char *payload, u8 local_port, + u32 proto_admin) { MLXSW_REG_ZERO(ptys, payload); mlxsw_reg_ptys_local_port_set(payload, local_port); @@ -2140,9 +2140,10 @@ static inline void mlxsw_reg_ptys_pack(char *payload, u8 local_port, mlxsw_reg_ptys_eth_proto_admin_set(payload, proto_admin); } -static inline void mlxsw_reg_ptys_unpack(char *payload, u32 *p_eth_proto_cap, -u32 *p_eth_proto_adm, -u32 *p_eth_proto_oper) +static inline void mlxsw_reg_ptys_eth_unpack(char *payload, +u32 *p_eth_proto_cap, +u32 *p_eth_proto_adm, +u32 *p_eth_proto_oper) { if (p_eth_proto_cap) *p_eth_proto_cap = mlxsw_reg_ptys_eth_proto_cap_get(payload); diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c index d649064..d1c2f9f 100644 --- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c @@ -2003,12 +2003,12 @@ static int mlxsw_sp_port_get_link_ksettings(struct net_device *dev, int err; autoneg = mlxsw_sp_port->link.autoneg; - mlxsw_reg_ptys_pack(ptys_pl, mlxsw_sp_port->local_port, 0); + mlxsw_reg_ptys_eth_pack(ptys_pl, mlxsw_sp_port->local_port, 0); err = mlxsw_reg_query(mlxsw_sp->core, MLXSW_REG(ptys), ptys_pl); if (err) return err; - mlxsw_reg_ptys_unpack(ptys_pl, ð_proto_cap, ð_proto_admin, - ð_proto_oper); + mlxsw_reg_ptys_eth_unpack(ptys_pl, ð_proto_cap, ð_proto_admin, + ð_proto_oper); mlxsw_sp_port_get_link_supported(eth_proto_cap, cmd); @@ -2037,11 +2037,11 @@ mlxsw_sp_port_set_link_ksettings(struct net_device *dev, bool autoneg; int err; - mlxsw_reg_ptys_pack(ptys_pl, mlxsw_sp_port->local_port, 0); + mlxsw_reg_ptys_eth_pack(ptys_pl, mlxsw_sp_port->local_port, 0); err = mlxsw_reg_query(mlxsw_sp->core, MLXSW_REG(ptys), ptys_pl); if (err) return err; - mlxsw_reg_ptys_unpack(ptys_pl, ð_proto_cap, NULL, NULL); + mlxsw_reg_ptys_eth_unpack(ptys_pl, ð_proto_cap, NULL, NULL); autoneg = cmd->base.autoneg == AUTONEG_ENABLE; eth_proto_new = autoneg ? @@ -2054,7 +2054,8 @@ mlxsw_sp_port_set_link_ksettings(struct net_device *dev, return -EINVAL; } - mlxsw_reg_ptys_pack(ptys_pl, mlxsw_sp_port->local_port, eth_proto_new); + mlxsw_reg_ptys_eth_pack(ptys_pl, mlxsw_sp_port->local_port, + eth_proto_new); err = mlxsw_reg_write(mlxsw_sp->core, MLXSW_REG(ptys), ptys_pl); if (err) return err; @@ -2092,8 +2093,8 @@ mlxsw_sp_port_speed_by_width_set(struct mlxsw_sp_port *mlxsw_sp_port, u8 width) u32 eth_proto_admin; eth_proto_admin = mlxsw_sp_to_ptys_upper_speed(upper_speed); - mlxsw_reg_ptys_pack(ptys_pl, mlxsw_sp_port->local_port, - eth_proto_admin); + mlxsw_reg_ptys_eth_pack(ptys_pl, mlxsw_sp_port->local_port, + eth_proto_admin); return mlxsw_reg_write(mlxsw_sp->core, MLXSW_REG(ptys), ptys_pl); } diff --git a/drivers/net/ethernet/mellanox/mlxsw/switchx2.c b/drivers/net/ethernet/mellanox/mlxsw/switchx2.c index 44dae8d..7b74dcb 100644 --- a/drivers/net/ethernet/mellanox/mlxsw/switchx2.c +++ b/drivers/net/ethernet/mellanox/mlxsw/switchx2.c @@ -762,14 +762,14 @@ static int mlxsw_sx_port_get_settings(struct net_device *dev, u32 eth_proto_oper; int err; - mlxsw_reg_ptys_pack(ptys_pl, mlxsw_sx_port->local_port, 0); + mlxsw_reg_ptys_eth_pack(ptys_pl, mlxsw
[patch net-next 04/16] mlxsw: spectrum: Move port used check outside port remove function
From: Jiri Pirko Be symmentrical with create and do the check outside the remove function. Signed-off-by: Jiri Pirko Signed-off-by: Elad Raz --- drivers/net/ethernet/mellanox/mlxsw/spectrum.c | 22 +++--- 1 file changed, 15 insertions(+), 7 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c index 2573faa..d649064 100644 --- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c @@ -2403,8 +2403,6 @@ static void mlxsw_sp_port_remove(struct mlxsw_sp *mlxsw_sp, u8 local_port) { struct mlxsw_sp_port *mlxsw_sp_port = mlxsw_sp->ports[local_port]; - if (!mlxsw_sp_port) - return; cancel_delayed_work_sync(&mlxsw_sp_port->hw_stats.update_dw); mlxsw_core_port_fini(&mlxsw_sp_port->core_port); unregister_netdev(mlxsw_sp_port->dev); /* This calls ndo_stop */ @@ -2422,12 +2420,18 @@ static void mlxsw_sp_port_remove(struct mlxsw_sp *mlxsw_sp, u8 local_port) free_netdev(mlxsw_sp_port->dev); } +static bool mlxsw_sp_port_created(struct mlxsw_sp *mlxsw_sp, u8 local_port) +{ + return mlxsw_sp->ports[local_port] != NULL; +} + static void mlxsw_sp_ports_remove(struct mlxsw_sp *mlxsw_sp) { int i; for (i = 1; i < MLXSW_PORT_MAX_PORTS; i++) - mlxsw_sp_port_remove(mlxsw_sp, i); + if (mlxsw_sp_port_created(mlxsw_sp, i)) + mlxsw_sp_port_remove(mlxsw_sp, i); kfree(mlxsw_sp->ports); } @@ -2461,7 +2465,8 @@ static int mlxsw_sp_ports_create(struct mlxsw_sp *mlxsw_sp) err_port_create: err_port_module_info_get: for (i--; i >= 1; i--) - mlxsw_sp_port_remove(mlxsw_sp, i); + if (mlxsw_sp_port_created(mlxsw_sp, i)) + mlxsw_sp_port_remove(mlxsw_sp, i); kfree(mlxsw_sp->ports); return err; } @@ -2503,7 +2508,8 @@ static int mlxsw_sp_port_split_create(struct mlxsw_sp *mlxsw_sp, u8 base_port, err_port_create: for (i--; i >= 0; i--) - mlxsw_sp_port_remove(mlxsw_sp, base_port + i); + if (mlxsw_sp_port_created(mlxsw_sp, base_port + i)) + mlxsw_sp_port_remove(mlxsw_sp, base_port + i); i = count; err_port_swid_set: for (i--; i >= 0; i--) @@ -2593,7 +2599,8 @@ static int mlxsw_sp_port_split(struct mlxsw_core *mlxsw_core, u8 local_port, } for (i = 0; i < count; i++) - mlxsw_sp_port_remove(mlxsw_sp, base_port + i); + if (mlxsw_sp_port_created(mlxsw_sp, base_port + i)) + mlxsw_sp_port_remove(mlxsw_sp, base_port + i); err = mlxsw_sp_port_split_create(mlxsw_sp, base_port, module, count); if (err) { @@ -2638,7 +2645,8 @@ static int mlxsw_sp_port_unsplit(struct mlxsw_core *mlxsw_core, u8 local_port) base_port = base_port + 2; for (i = 0; i < count; i++) - mlxsw_sp_port_remove(mlxsw_sp, base_port + i); + if (mlxsw_sp_port_created(mlxsw_sp, base_port + i)) + mlxsw_sp_port_remove(mlxsw_sp, base_port + i); mlxsw_sp_port_unsplit_create(mlxsw_sp, base_port, count); -- 2.5.5
Re: [PATCH net 2/3] sctp: return back transport in __sctp_rcv_init_lookup
On Fri, Oct 28, 2016 at 06:10:53PM +0800, Xin Long wrote: > Prior to this patch, it used a local variable to save the transport that is > looked up by __sctp_lookup_association(), and didn't return it back. But in > sctp_rcv, it is used to initialize chunk->transport. So when hitting this > code, it was initializing chunk->transport with some random stack value > instead. > > This patch is to return the transport back through transport pointer > that is from __sctp_rcv_lookup_harder(). > > Signed-off-by: Xin Long Acked-by: Marcelo Ricardo Leitner transport pointer in sctp_rcv() is initialized to null and there are checks for it after this path, so this shouldn't be exploitable, just malfunction. > --- > net/sctp/input.c | 3 +-- > 1 file changed, 1 insertion(+), 2 deletions(-) > > diff --git a/net/sctp/input.c b/net/sctp/input.c > index a2ea1d1..8e0bc58 100644 > --- a/net/sctp/input.c > +++ b/net/sctp/input.c > @@ -1021,7 +1021,6 @@ static struct sctp_association > *__sctp_rcv_init_lookup(struct net *net, > struct sctphdr *sh = sctp_hdr(skb); > union sctp_params params; > sctp_init_chunk_t *init; > - struct sctp_transport *transport; > struct sctp_af *af; > > /* > @@ -1052,7 +1051,7 @@ static struct sctp_association > *__sctp_rcv_init_lookup(struct net *net, > > af->from_addr_param(paddr, params.addr, sh->source, 0); > > - asoc = __sctp_lookup_association(net, laddr, paddr, &transport); > + asoc = __sctp_lookup_association(net, laddr, paddr, transportp); > if (asoc) > return asoc; > } > -- > 2.1.0 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-sctp" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html >
[patch net-next 03/16] mlxsw: switchx2: Move port used check outside port remove function
From: Jiri Pirko Be symmentrical with create and do the check outside the remove function. Signed-off-by: Jiri Pirko Signed-off-by: Elad Raz --- drivers/net/ethernet/mellanox/mlxsw/switchx2.c | 14 ++ 1 file changed, 10 insertions(+), 4 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlxsw/switchx2.c b/drivers/net/ethernet/mellanox/mlxsw/switchx2.c index 67f1b70..6882e35 100644 --- a/drivers/net/ethernet/mellanox/mlxsw/switchx2.c +++ b/drivers/net/ethernet/mellanox/mlxsw/switchx2.c @@ -1093,21 +1093,26 @@ static void mlxsw_sx_port_remove(struct mlxsw_sx *mlxsw_sx, u8 local_port) { struct mlxsw_sx_port *mlxsw_sx_port = mlxsw_sx->ports[local_port]; - if (!mlxsw_sx_port) - return; mlxsw_core_port_fini(&mlxsw_sx_port->core_port); unregister_netdev(mlxsw_sx_port->dev); /* This calls ndo_stop */ + mlxsw_sx->ports[local_port] = NULL; mlxsw_sx_port_swid_set(mlxsw_sx_port, MLXSW_PORT_SWID_DISABLED_PORT); free_percpu(mlxsw_sx_port->pcpu_stats); free_netdev(mlxsw_sx_port->dev); } +static bool mlxsw_sx_port_created(struct mlxsw_sx *mlxsw_sx, u8 local_port) +{ + return mlxsw_sx->ports[local_port] != NULL; +} + static void mlxsw_sx_ports_remove(struct mlxsw_sx *mlxsw_sx) { int i; for (i = 1; i < MLXSW_PORT_MAX_PORTS; i++) - mlxsw_sx_port_remove(mlxsw_sx, i); + if (mlxsw_sx_port_created(mlxsw_sx, i)) + mlxsw_sx_port_remove(mlxsw_sx, i); kfree(mlxsw_sx->ports); } @@ -1138,7 +1143,8 @@ static int mlxsw_sx_ports_create(struct mlxsw_sx *mlxsw_sx) err_port_create: err_port_module_info_get: for (i--; i >= 1; i--) - mlxsw_sx_port_remove(mlxsw_sx, i); + if (mlxsw_sx_port_created(mlxsw_sx, i)) + mlxsw_sx_port_remove(mlxsw_sx, i); kfree(mlxsw_sx->ports); return err; } -- 2.5.5
[patch net-next 02/16] mlxsw: switchx2: Check if port is usable before calling port create
From: Jiri Pirko Do it in a same way we do it in spectrum. Check if port is usable first and only in that case create a port instance. Signed-off-by: Jiri Pirko Reviewed-by: Elad Raz Reviewed-by: Ido Schimmel --- drivers/net/ethernet/mellanox/mlxsw/switchx2.c | 32 +- 1 file changed, 11 insertions(+), 21 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlxsw/switchx2.c b/drivers/net/ethernet/mellanox/mlxsw/switchx2.c index 8d11e5447..67f1b70 100644 --- a/drivers/net/ethernet/mellanox/mlxsw/switchx2.c +++ b/drivers/net/ethernet/mellanox/mlxsw/switchx2.c @@ -256,18 +256,17 @@ mlxsw_sx_port_system_port_mapping_set(struct mlxsw_sx_port *mlxsw_sx_port) return mlxsw_reg_write(mlxsw_sx->core, MLXSW_REG(sspr), sspr_pl); } -static int mlxsw_sx_port_module_check(struct mlxsw_sx_port *mlxsw_sx_port, - bool *p_usable) +static int mlxsw_sx_port_module_info_get(struct mlxsw_sx *mlxsw_sx, +u8 local_port, u8 *p_width) { - struct mlxsw_sx *mlxsw_sx = mlxsw_sx_port->mlxsw_sx; char pmlp_pl[MLXSW_REG_PMLP_LEN]; int err; - mlxsw_reg_pmlp_pack(pmlp_pl, mlxsw_sx_port->local_port); + mlxsw_reg_pmlp_pack(pmlp_pl, local_port); err = mlxsw_reg_query(mlxsw_sx->core, MLXSW_REG(pmlp), pmlp_pl); if (err) return err; - *p_usable = mlxsw_reg_pmlp_width_get(pmlp_pl) ? true : false; + *p_width = mlxsw_reg_pmlp_width_get(pmlp_pl); return 0; } @@ -962,7 +961,6 @@ static int mlxsw_sx_port_create(struct mlxsw_sx *mlxsw_sx, u8 local_port) { struct mlxsw_sx_port *mlxsw_sx_port; struct net_device *dev; - bool usable; int err; dev = alloc_etherdev(sizeof(struct mlxsw_sx_port)); @@ -1005,19 +1003,6 @@ static int mlxsw_sx_port_create(struct mlxsw_sx *mlxsw_sx, u8 local_port) */ dev->needed_headroom = MLXSW_TXHDR_LEN; - err = mlxsw_sx_port_module_check(mlxsw_sx_port, &usable); - if (err) { - dev_err(mlxsw_sx->bus_info->dev, "Port %d: Failed to check module\n", - mlxsw_sx_port->local_port); - goto err_port_module_check; - } - - if (!usable) { - dev_dbg(mlxsw_sx->bus_info->dev, "Port %d: Not usable, skipping initialization\n", - mlxsw_sx_port->local_port); - goto port_not_usable; - } - err = mlxsw_sx_port_system_port_mapping_set(mlxsw_sx_port); if (err) { dev_err(mlxsw_sx->bus_info->dev, "Port %d: Failed to set system port mapping\n", @@ -1097,8 +1082,6 @@ static int mlxsw_sx_port_create(struct mlxsw_sx *mlxsw_sx, u8 local_port) mlxsw_sx_port_swid_set(mlxsw_sx_port, MLXSW_PORT_SWID_DISABLED_PORT); err_port_swid_set: err_port_system_port_mapping_set: -port_not_usable: -err_port_module_check: err_dev_addr_get: free_percpu(mlxsw_sx_port->pcpu_stats); err_alloc_stats: @@ -1131,6 +1114,7 @@ static void mlxsw_sx_ports_remove(struct mlxsw_sx *mlxsw_sx) static int mlxsw_sx_ports_create(struct mlxsw_sx *mlxsw_sx) { size_t alloc_size; + u8 width; int i; int err; @@ -1140,6 +1124,11 @@ static int mlxsw_sx_ports_create(struct mlxsw_sx *mlxsw_sx) return -ENOMEM; for (i = 1; i < MLXSW_PORT_MAX_PORTS; i++) { + err = mlxsw_sx_port_module_info_get(mlxsw_sx, i, &width); + if (err) + goto err_port_module_info_get; + if (!width) + continue; err = mlxsw_sx_port_create(mlxsw_sx, i); if (err) goto err_port_create; @@ -1147,6 +1136,7 @@ static int mlxsw_sx_ports_create(struct mlxsw_sx *mlxsw_sx) return 0; err_port_create: +err_port_module_info_get: for (i--; i >= 1; i--) mlxsw_sx_port_remove(mlxsw_sx, i); kfree(mlxsw_sx->ports); -- 2.5.5
[patch net-next 01/16] mlxsw: core: Zero payload buffers for couple of registers
From: Elad Raz We recently discovered a bug in the firmware in which a field's length in one of the registers was incorrectly set. This caused the firmware to access garbage data that wasn't initialized by the driver and therefore emit error messages. While the bug is already fixed and the driver usually zeros the buffers passed to the firmware, there are a handful of cases where this isn't done. Zero the buffer in these cases and prevent similar bugs from recurring, as they tend to be hard to debug. Fixes: 52581961d83d ("mlxsw: core: Implement fan control using hwmon") Signed-off-by: Elad Raz Reviewed-by: Jiri Pirko Reviewed-by: Ido Schimmel Signed-off-by: Jiri Pirko --- drivers/net/ethernet/mellanox/mlxsw/core_hwmon.c | 4 ++-- drivers/net/ethernet/mellanox/mlxsw/spectrum.c | 2 +- drivers/net/ethernet/mellanox/mlxsw/switchx2.c | 2 +- 3 files changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlxsw/core_hwmon.c b/drivers/net/ethernet/mellanox/mlxsw/core_hwmon.c index 1ac8bf1..ab710e3 100644 --- a/drivers/net/ethernet/mellanox/mlxsw/core_hwmon.c +++ b/drivers/net/ethernet/mellanox/mlxsw/core_hwmon.c @@ -262,7 +262,7 @@ static void mlxsw_hwmon_attr_add(struct mlxsw_hwmon *mlxsw_hwmon, static int mlxsw_hwmon_temp_init(struct mlxsw_hwmon *mlxsw_hwmon) { - char mtcap_pl[MLXSW_REG_MTCAP_LEN]; + char mtcap_pl[MLXSW_REG_MTCAP_LEN] = {0}; char mtmp_pl[MLXSW_REG_MTMP_LEN]; u8 sensor_count; int i; @@ -295,7 +295,7 @@ static int mlxsw_hwmon_temp_init(struct mlxsw_hwmon *mlxsw_hwmon) static int mlxsw_hwmon_fans_init(struct mlxsw_hwmon *mlxsw_hwmon) { - char mfcr_pl[MLXSW_REG_MFCR_LEN]; + char mfcr_pl[MLXSW_REG_MFCR_LEN] = {0}; enum mlxsw_reg_mfcr_pwm_frequency freq; unsigned int type_index; unsigned int num; diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c index d652f7f..2573faa 100644 --- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c @@ -158,7 +158,7 @@ static void mlxsw_sp_txhdr_construct(struct sk_buff *skb, static int mlxsw_sp_base_mac_get(struct mlxsw_sp *mlxsw_sp) { - char spad_pl[MLXSW_REG_SPAD_LEN]; + char spad_pl[MLXSW_REG_SPAD_LEN] = {0}; int err; err = mlxsw_reg_query(mlxsw_sp->core, MLXSW_REG(spad), spad_pl); diff --git a/drivers/net/ethernet/mellanox/mlxsw/switchx2.c b/drivers/net/ethernet/mellanox/mlxsw/switchx2.c index 8c8f5d8..8d11e5447 100644 --- a/drivers/net/ethernet/mellanox/mlxsw/switchx2.c +++ b/drivers/net/ethernet/mellanox/mlxsw/switchx2.c @@ -890,7 +890,7 @@ static const struct switchdev_ops mlxsw_sx_port_switchdev_ops = { static int mlxsw_sx_hw_id_get(struct mlxsw_sx *mlxsw_sx) { - char spad_pl[MLXSW_REG_SPAD_LEN]; + char spad_pl[MLXSW_REG_SPAD_LEN] = {0}; int err; err = mlxsw_reg_query(mlxsw_sx->core, MLXSW_REG(spad), spad_pl); -- 2.5.5
Re: [PATCH net 1/3] sctp: hold transport instead of assoc in sctp_diag
On Fri, Oct 28, 2016 at 06:10:52PM +0800, Xin Long wrote: > In sctp_transport_lookup_process(), Commit 1cceda784980 ("sctp: fix > the issue sctp_diag uses lock_sock in rcu_read_lock") moved cb() out > of rcu lock, but it put transport and hold assoc instead, and ignore > that cb() still uses transport. It may cause a use-after-free issue. > > This patch is to hold transport instead of assoc there. > > Fixes: 1cceda784980 ("sctp: fix the issue sctp_diag uses lock_sock in > rcu_read_lock") > Signed-off-by: Xin Long Acked-by: Marcelo Ricardo Leitner > --- > net/sctp/socket.c | 5 + > 1 file changed, 1 insertion(+), 4 deletions(-) > > diff --git a/net/sctp/socket.c b/net/sctp/socket.c > index 9fbb6fe..71b75f9 100644 > --- a/net/sctp/socket.c > +++ b/net/sctp/socket.c > @@ -4480,12 +4480,9 @@ int sctp_transport_lookup_process(int (*cb)(struct > sctp_transport *, void *), > if (!transport || !sctp_transport_hold(transport)) > goto out; > > - sctp_association_hold(transport->asoc); > - sctp_transport_put(transport); > - > rcu_read_unlock(); > err = cb(transport, p); > - sctp_association_put(transport->asoc); > + sctp_transport_put(transport); > > out: > return err; > -- > 2.1.0 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-sctp" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html >
[patch net-next 00/16] mlxsw: Add Infiniband support for Mellanox switches
From: Jiri Pirko This patchset adds basic Infiniband support for SwitchX-2, Switch-IB and Switch-IB-2 ASIC drivers. SwitchX-2 ASIC is VPI capable, which means each port can be either Ethernet or Infiniband. When the port is configured as Infiniband, the Subnet Management Agent (SMA) is managed by the SwitchX-2 firmware and not by the host. Port configuration, MTU and more are configured remotely by the Subnet Manager (SM). Usage: $ devlink port show pci/:03:00.0/1: type eth netdev eth0 pci/:03:00.0/3: type eth netdev eth1 pci/:03:00.0/5: type eth netdev eth2 pci/:03:00.0/6: type eth netdev eth3 pci/:03:00.0/8: type eth netdev eth4 $ devlink port set pci/:03:00.0/1 type ib $ devlink port show pci/:03:00.0/1: type ib Switch-IB (FDR) and Switch-IB-2 (EDR 100Gbs) ASICs are Infiniband-only switches. The support provided in the mlxsw_switchib.ko driver is port initialization only. The firmware running in the Silicon implements the SMA. Please note that this patchset does only very basic port initialization. ib_device or RDMA implementations are not part of this patchset. Elad Raz (12): mlxsw: core: Zero payload buffers for couple of registers mlxsw: switchx2: Add support for physical port names mlxsw: switchx2: Fix port speed configuration mlxsw: reg: Add eth prefix to PTYS pack and unpack mlxsw: reg: Add Infiniband support to PTYS mlxsw: reg: Add local-port to Infiniband port mapping mlxsw: switchx2: Add Infiniband switch partition mlxsw: core: Add "eth" prefix to mlxsw_core_port_set mlxsw: core: Add port type (Eth/IB) set API mlxsw: switchx2: Add eth prefix to port create and remove mlxsw: switchx2: Add IB port support mlxsw: switchib: Introduce SwitchIB and SwitchIB silicon driver Jiri Pirko (4): mlxsw: switchx2: Check if port is usable before calling port create mlxsw: switchx2: Move port used check outside port remove function mlxsw: spectrum: Move port used check outside port remove function mlxsw: Make devlink port instances independent of spectrum/switchx2 port instances drivers/net/ethernet/mellanox/mlxsw/Kconfig | 11 + drivers/net/ethernet/mellanox/mlxsw/Makefile | 2 + drivers/net/ethernet/mellanox/mlxsw/core.c | 123 - drivers/net/ethernet/mellanox/mlxsw/core.h | 32 +- drivers/net/ethernet/mellanox/mlxsw/core_hwmon.c | 4 +- drivers/net/ethernet/mellanox/mlxsw/ib.h | 39 ++ drivers/net/ethernet/mellanox/mlxsw/pci.h| 2 + drivers/net/ethernet/mellanox/mlxsw/port.h | 4 + drivers/net/ethernet/mellanox/mlxsw/reg.h| 104 +++- drivers/net/ethernet/mellanox/mlxsw/spectrum.c | 95 ++-- drivers/net/ethernet/mellanox/mlxsw/spectrum.h | 1 - drivers/net/ethernet/mellanox/mlxsw/switchib.c | 598 +++ drivers/net/ethernet/mellanox/mlxsw/switchx2.c | 380 +++--- 13 files changed, 1258 insertions(+), 137 deletions(-) create mode 100644 drivers/net/ethernet/mellanox/mlxsw/ib.h create mode 100644 drivers/net/ethernet/mellanox/mlxsw/switchib.c -- 2.5.5
Re: [PATCH net-next] lan78xx: Use irq_domain for phy interrupt from USB Int. EP
On 10/28/2016 11:54 AM, woojung@microchip.com wrote: > From: Woojung Huh > > To utilize phylib with interrupt fully than handling some of phy stuff in the > MAC driver, > create irq_domain for USB interrupt EP of phy interrupt and > pass the irq number to phy_connect_direct() instead of PHY_IGNORE_INTERRUPT. > > Idea comes from drivers/gpio/gpio-dl2.c > > Signed-off-by: Woojung Huh > --- > +static void lan78xx_irq_mask(struct irq_data *irqd) > +{ > + struct irq_domain_data *data = irq_data_get_irq_chip_data(irqd); > + struct lan78xx_net *dev = > + container_of(data, struct lan78xx_net, domain_data); > + u32 buf; > + > + lan78xx_read_reg(dev, INT_EP_CTL, &buf); lan78xx_read_reg() uses kmalloc() with GFP_KERNEL, while irq_mask/irq_unmask can be called in atomic context AFAIR, you may need to pass down a specific gfp_t to lan78xx_read_reg. What about usb_submit_urb(), can that work in atomic context? Do you need to have lan78xx_read_reg() acquire a raw spinlock or something to serialize them? > + buf &= ~INT_EP_PHY_INT_EN_; Even though you may have only one interrupt line to deal with at the moment, better make this bit derived from irqd->hwirq instead of hard coding it here. > + lan78xx_write_reg(dev, INT_EP_CTL, buf); > +} > + > +static void lan78xx_irq_unmask(struct irq_data *irqd) > +{ > + struct irq_domain_data *data = irq_data_get_irq_chip_data(irqd); > + struct lan78xx_net *dev = > + container_of(data, struct lan78xx_net, domain_data); > + u32 buf; > + > + lan78xx_read_reg(dev, INT_EP_CTL, &buf); > + buf |= INT_EP_PHY_INT_EN_; Same here, this should come from irqd->hwirq. > + lan78xx_write_reg(dev, INT_EP_CTL, buf); > +} > + > +static struct irq_chip lan78xx_irqchip = { > + .name = "lan78xx-phyirq", > + .irq_mask = lan78xx_irq_mask, > + .irq_unmask = lan78xx_irq_unmask, > +}; > + > +static int lan78xx_setup_irq_domain(struct lan78xx_net *dev) > +{ > + struct device_node *of_node; > + struct irq_domain *irqdomain; > + unsigned int irq_base = 0; > + int ret = 0; > + > + of_node = dev->udev->dev.parent->of_node; > + > + dev->domain_data.irqchip = &lan78xx_irqchip; > + dev->domain_data.irq_handler = handle_simple_irq; > + > + irqdomain = irq_domain_add_simple(of_node, 1, 0, &chip_domain_ops, > + &dev->domain_data); Is there really just one interrupt associated with this peripheral here? > > + if (lan78xx_setup_irq_domain(dev) < 0) { > + netdev_warn(dev->net, "lan78xx_setup_irq_domain() failed"); > + return -EIO; > + } Any reason not to propagate the error code from lan78xx_setup_irq_domain() here? -- Florian
[PATCH net-next] lan78xx: Use irq_domain for phy interrupt from USB Int. EP
From: Woojung Huh To utilize phylib with interrupt fully than handling some of phy stuff in the MAC driver, create irq_domain for USB interrupt EP of phy interrupt and pass the irq number to phy_connect_direct() instead of PHY_IGNORE_INTERRUPT. Idea comes from drivers/gpio/gpio-dl2.c Signed-off-by: Woojung Huh --- drivers/net/usb/lan78xx.c | 142 +- 1 file changed, 128 insertions(+), 14 deletions(-) diff --git a/drivers/net/usb/lan78xx.c b/drivers/net/usb/lan78xx.c index c4e748e..88ff21d 100644 --- a/drivers/net/usb/lan78xx.c +++ b/drivers/net/usb/lan78xx.c @@ -30,13 +30,17 @@ #include #include #include +#include +#include +#include +#include #include #include "lan78xx.h" #define DRIVER_AUTHOR "WOOJUNG HUH " #define DRIVER_DESC"LAN78XX USB 3.0 Gigabit Ethernet Devices" #define DRIVER_NAME"lan78xx" -#define DRIVER_VERSION "1.0.4" +#define DRIVER_VERSION "1.0.5" #define TX_TIMEOUT_JIFFIES (5 * HZ) #define THROTTLE_JIFFIES (HZ / 8) @@ -296,6 +300,13 @@ struct statstage { struct lan78xx_statstage64 curr_stat; }; +struct irq_domain_data { + struct irq_domain *irqdomain; + unsigned intirq_base; + struct irq_chip *irqchip; + irq_flow_handler_t irq_handler; +}; + struct lan78xx_net { struct net_device *net; struct usb_device *udev; @@ -351,6 +362,8 @@ struct lan78xx_net { int delta; struct statstagestats; + + struct irq_domain_data domain_data; }; /* use ethtool to change the level for any given device */ @@ -1120,8 +1133,6 @@ static int lan78xx_link_reset(struct lan78xx_net *dev) if (unlikely(ret < 0)) return -EIO; - phy_mac_interrupt(phydev, 0); - del_timer(&dev->stat_monitor); } else if (phydev->link && !dev->link_on) { dev->link_on = true; @@ -1163,7 +1174,6 @@ static int lan78xx_link_reset(struct lan78xx_net *dev) ret = lan78xx_update_flowcontrol(dev, ecmd.base.duplex, ladv, radv); - phy_mac_interrupt(phydev, 1); if (!timer_pending(&dev->stat_monitor)) { dev->delta = 1; @@ -1202,7 +1212,10 @@ static void lan78xx_status(struct lan78xx_net *dev, struct urb *urb) if (intdata & INT_ENP_PHY_INT) { netif_dbg(dev, link, dev->net, "PHY INTR: 0x%08x\n", intdata); - lan78xx_defer_kevent(dev, EVENT_LINK_RESET); + lan78xx_defer_kevent(dev, EVENT_LINK_RESET); + + if (dev->domain_data.irq_base > 0) + generic_handle_irq(dev->domain_data.irq_base); } else netdev_warn(dev->net, "unexpected interrupt: 0x%08x\n", intdata); @@ -1844,6 +1857,103 @@ static void lan78xx_link_status_change(struct net_device *net) } } +static int irq_map(struct irq_domain *d, unsigned int irq, + irq_hw_number_t hwirq) +{ + struct irq_domain_data *data = d->host_data; + + irq_set_chip_data(irq, data); + irq_set_chip_and_handler(irq, data->irqchip, data->irq_handler); + irq_set_noprobe(irq); + + return 0; +} + +static void irq_unmap(struct irq_domain *d, unsigned int irq) +{ + irq_set_chip_and_handler(irq, NULL, NULL); + irq_set_chip_data(irq, NULL); +} + +static const struct irq_domain_ops chip_domain_ops = { + .map= irq_map, + .unmap = irq_unmap, +}; + +static void lan78xx_irq_mask(struct irq_data *irqd) +{ + struct irq_domain_data *data = irq_data_get_irq_chip_data(irqd); + struct lan78xx_net *dev = + container_of(data, struct lan78xx_net, domain_data); + u32 buf; + + lan78xx_read_reg(dev, INT_EP_CTL, &buf); + buf &= ~INT_EP_PHY_INT_EN_; + lan78xx_write_reg(dev, INT_EP_CTL, buf); +} + +static void lan78xx_irq_unmask(struct irq_data *irqd) +{ + struct irq_domain_data *data = irq_data_get_irq_chip_data(irqd); + struct lan78xx_net *dev = + container_of(data, struct lan78xx_net, domain_data); + u32 buf; + + lan78xx_read_reg(dev, INT_EP_CTL, &buf); + buf |= INT_EP_PHY_INT_EN_; + lan78xx_write_reg(dev, INT_EP_CTL, buf); +} + +static struct irq_chip lan78xx_irqchip = { + .name = "lan78xx-phyirq", + .irq_mask = lan78xx_irq_mask, + .irq_unmask = lan78xx_irq_unmask, +}; + +static int lan78xx_setup_irq_domain(struct lan78xx_net *dev) +{ + struct device_node *of_node; + struct irq_domain *irqdomain; + unsigned int irq_base = 0; + int ret = 0; + + of_node = dev->udev->dev.parent->of_node; + + dev->domain_data.irqchip = &lan78xx_irqchip; + dev->domain_data.ir
net-next still on 4.8.0?
Hi Dave During the merge window, you didn't close net-next. Trying something new. We are now at v4.9-rc2, yet net-next is still based on 4.8.0. Is this also something new you are trying? Or do you still plan to rebase to an -rcX at some point? Thanks Andrew
Re: [PATCH v2] net: cpsw: fix obtaining mac address for am3517
Hello Tony, On 28-10-16 17:52, Tony Lindgren wrote: * Jeroen Hofstee [161028 08:33]: Commit b6745f6e4e63 ("drivers: net: cpsw: davinci_emac: move reading mac id to common file") did not only move the code for an am3517, it also added the slave parameter, resulting in an invalid (all zero) mac address being returned for an am3517, since it only has a single emac and the slave parameter is pointing to the second. So simply always read the first and valid mac-address for a ti,am3517-emac. And others davinci_emac.c users can have more than one. So is the reason the slave parameter points to the second instance because of the location in the hardware? Sort of, the slave parameter gets determined by the fact if there is one or two register range(s) associated with the davinci_emac. In davinci_emac.c res_ctrl = platform_get_resource(pdev, IORESOURCE_MEM, 1); ... rc = davinci_emac_try_get_mac(pdev, res_ctrl ? 0 : 1, priv->mac_addr); So it there are two ranges, the slave param becomes 0. It there is only one, it will be 1. Since the am3517 only has a single regs entry it ends up with slave 1, while there is only a single davinci_emac. Regards, Jeroen arch/arm/boot/dts/dm816x.dtsi - eth0: ethernet@4a10 { compatible = "ti,dm816-emac"; ti,hwmods = "emac0"; reg = <0x4a10 0x800 0x4a100900 0x3700>; clocks = <&sysclk24_ck>; syscon = <&scm_conf>; ti,davinci-ctrl-reg-offset = <0>; ti,davinci-ctrl-mod-reg-offset = <0x900>; ti,davinci-ctrl-ram-offset = <0x2000>; ti,davinci-ctrl-ram-size = <0x2000>; interrupts = <40 41 42 43>; phy-handle = <&phy0>; }; eth1: ethernet@4a12 { compatible = "ti,dm816-emac"; ti,hwmods = "emac1"; reg = <0x4a12 0x4000>; clocks = <&sysclk24_ck>; syscon = <&scm_conf>; ti,davinci-ctrl-reg-offset = <0>; ti,davinci-ctrl-mod-reg-offset = <0x900>; ti,davinci-ctrl-ram-offset = <0x2000>; ti,davinci-ctrl-ram-size = <0x2000>; interrupts = <44 45 46 47>; phy-handle = <&phy1>; }; arch/arm/boot/dts/am3517.dtsi --- davinci_emac: ethernet@0x5c00 { compatible = "ti,am3517-emac"; ti,hwmods = "davinci_emac"; status = "disabled"; reg = <0x5c00 0x3>; interrupts = <67 68 69 70>; syscon = <&scm_conf>; ti,davinci-ctrl-reg-offset = <0x1>; ti,davinci-ctrl-mod-reg-offset = <0>; ti,davinci-ctrl-ram-offset = <0x2>; ti,davinci-ctrl-ram-size = <0x2000>; ti,davinci-rmii-en = /bits/ 8 <1>; local-mac-address = [ 00 00 00 00 00 00 ]; };
[PATCH net-next v3]ldmvsw: tx queue stuck in stopped state after LDC reset
From: Aaron Young The following patch fixes an issue with the ldmvsw driver where the network connection of a guest domain becomes non-functional after the guest domain has panic'd and rebooted. The root cause was determined to be from the following series of events: 1. Guest domain panics - resulting in the guest no longer processing network packets (from ldmvsw driver) 2. The ldmvsw driver (in the control domain) eventually exerts flow control due to no more available tx drings and stops the tx queue for the guest domain 3. The LDC of the network connection for the guest is reset when the guest domain reboots after the panic. 4. The LDC reset event is received by the ldmvsw driver and the ldmvsw responds by clearing the tx queue for the guest. 5. ldmvsw waits indefinitely for a DATA ACK from the guest - which is the normal method to re-enable the tx queue. But the ACK never comes because the tx queue was cleared due to the LDC reset. To fix this issue, in addition to clearing the tx queue, re-enable the tx queue on a LDC reset. This prevents the ldmvsw from getting caught in this deadlocked state of waiting for a DATA ACK which will never come. Signed-off-by: Aaron Young Acked-by: Sowmini Varadhan --- drivers/net/ethernet/sun/sunvnet_common.c | 16 +--- 1 files changed, 13 insertions(+), 3 deletions(-) diff --git a/drivers/net/ethernet/sun/sunvnet_common.c b/drivers/net/ethernet/sun/sunvnet_common.c index 58efe69..8878b75 100644 --- a/drivers/net/ethernet/sun/sunvnet_common.c +++ b/drivers/net/ethernet/sun/sunvnet_common.c @@ -704,9 +704,8 @@ static int handle_mcast(struct vnet_port *port, void *msgbuf) return 0; } -/* Got back a STOPPED LDC message on port. If the queue is stopped, - * wake it up so that we'll send out another START message at the - * next TX. +/* If the queue is stopped, wake it up so that we'll + * send out another START message at the next TX. */ static void maybe_tx_wakeup(struct vnet_port *port) { @@ -734,6 +733,7 @@ bool sunvnet_port_is_up_common(struct vnet_port *vnet) static int vnet_event_napi(struct vnet_port *port, int budget) { + struct net_device *dev = VNET_PORT_TO_NET_DEVICE(port); struct vio_driver_state *vio = &port->vio; int tx_wakeup, err; int npkts = 0; @@ -747,6 +747,16 @@ static int vnet_event_napi(struct vnet_port *port, int budget) if (event == LDC_EVENT_RESET) { vnet_port_reset(port); vio_port_up(vio); + + /* If the device is running but its tx queue was +* stopped (due to flow control), restart it. +* This is necessary since vnet_port_reset() +* clears the tx drings and thus we may never get +* back a VIO_TYPE_DATA ACK packet - which is +* the normal mechanism to restart the tx queue. +*/ + if (netif_running(dev)) + maybe_tx_wakeup(port); } port->rx_event = 0; return 0; -- 1.7.1
Re: [PATCH net-next RFC WIP] Patch for XDP support for virtio_net
On Fri, Oct 28, 2016 at 05:18:12PM +0100, Jakub Kicinski wrote: > On Fri, 28 Oct 2016 08:56:35 -0700, John Fastabend wrote: > > On 16-10-27 07:10 PM, David Miller wrote: > > > From: Alexander Duyck > > > Date: Thu, 27 Oct 2016 18:43:59 -0700 > > > > > >> On Thu, Oct 27, 2016 at 6:35 PM, David Miller > > >> wrote: > > >>> From: "Michael S. Tsirkin" > > >>> Date: Fri, 28 Oct 2016 01:25:48 +0300 > > >>> > > On Thu, Oct 27, 2016 at 05:42:18PM -0400, David Miller wrote: > > > From: "Michael S. Tsirkin" > > > Date: Fri, 28 Oct 2016 00:30:35 +0300 > > > > > >> Something I'd like to understand is how does XDP address the > > >> problem that 100Byte packets are consuming 4K of memory now. > > > > > > Via page pools. We're going to make a generic one, but right now > > > each and every driver implements a quick list of pages to allocate > > > from (and thus avoid the DMA man/unmap overhead, etc.) > > > > So to clarify, ATM virtio doesn't attempt to avoid dma map/unmap > > so there should be no issue with that even when using sub/page > > regions, assuming DMA APIs support sub-page map/unmap correctly. > > >>> > > >>> That's not what I said. > > >>> > > >>> The page pools are meant to address the performance degradation from > > >>> going to having one packet per page for the sake of XDP's > > >>> requirements. > > >>> > > >>> You still need to have one packet per page for correct XDP operation > > >>> whether you do page pools or not, and whether you have DMA mapping > > >>> (or it's equivalent virutalization operation) or not. > > >> > > >> Maybe I am missing something here, but why do you need to limit things > > >> to one packet per page for correct XDP operation? Most of the drivers > > >> out there now are usually storing something closer to at least 2 > > >> packets per page, and with the DMA API fixes I am working on there > > >> should be no issue with changing the contents inside those pages since > > >> we won't invalidate or overwrite the data after the DMA buffer has > > >> been synchronized for use by the CPU. > > > > > > Because with SKB's you can share the page with other packets. > > > > > > With XDP you simply cannot. > > > > > > It's software semantics that are the issue. SKB frag list pages > > > are read only, XDP packets are writable. > > > > > > This has nothing to do with "writability" of the pages wrt. DMA > > > mapping or cpu mappings. > > > > > > > Sorry I'm not seeing it either. The current xdp_buff is defined > > by, > > > > struct xdp_buff { > > void *data; > > void *data_end; > > }; > > > > The verifier has an xdp_is_valid_access() check to ensure we don't go > > past data_end. The page for now at least never leaves the driver. For > > the work to get xmit to other devices working I'm still not sure I see > > any issue. > > +1 > > Do we want to make the packet-per-page a requirement because it could > be useful in the future from architectural standpoint? I guess there > is a trade-off here between having the comfort of people following this > requirement today and making driver support for XDP even more complex. It looks to me that packet per page makes drivers simpler instead of complex. ixgbe split-page and mlx* order-3/5 tricks are definitely complex. The skb truesize concerns come from the host when data is delivered to user space and we need to have precise memory accounting for different applications and different users. XDP is all root and imo it's far away from the days when multi-user non-root issues start to pop up. At the same time XDP doesn't require to use 4k buffer in something like Netronome. If xdp bpf program can be offloaded into HW with 1800 byte buffers, great! For x86 cpu the 4k byte is a natural allocation chunk. Anything lower requires delicate dma tricks paired with even more complex slab allocator and atomic recnts. All of that can only drive the performance down. Comparing to kernel bypass xdp is using 4k pages whereas dpdk has to use huge pages, so xdp is saving a ton of memory already!
Re: [PATCH] mac80211: fix incorrect error return path on tmp allocation failure
On Fri, 2016-10-28 at 19:08 +0100, Colin King wrote: > From: Colin Ian King > > The current exit path when tmp fails to be allocated is via the > fail label which frees tfm2 which has not yet been allocated, > which is problematic since tfm2 is not initialized and is a garbage > pointer. Fix this by exiting directly to the return at the end > of the function and hence avoiding the freeing of tfm2. Yeah, thanks. Arnd beat you to the fix by about 8 hours, so I've already applied his patch and sent an updated pull request :) johannes
Re: [PATCH v2] net: cpsw: fix obtaining mac address for am3517
* Jeroen Hofstee [161028 11:19]: > Hello Tony, > > On 28-10-16 17:52, Tony Lindgren wrote: > > * Jeroen Hofstee [161028 08:33]: > > > Commit b6745f6e4e63 ("drivers: net: cpsw: davinci_emac: move reading mac > > > id to common file") did not only move the code for an am3517, it also > > > added the slave parameter, resulting in an invalid (all zero) mac address > > > being returned for an am3517, since it only has a single emac and the > > > slave > > > parameter is pointing to the second. So simply always read the first and > > > valid mac-address for a ti,am3517-emac. > > And others davinci_emac.c users can have more than one. So is the > > reason the slave parameter points to the second instance because > > of the location in the hardware? > > Sort of, the slave parameter gets determined by the fact if there is one > or two register range(s) associated with the davinci_emac. In davinci_emac.c > > res_ctrl = platform_get_resource(pdev, IORESOURCE_MEM, 1); > ... > rc = davinci_emac_try_get_mac(pdev, res_ctrl ? 0 : 1, > priv->mac_addr); > > So it there are two ranges, the slave param becomes 0. It there is only one, > it > will be 1. Since the am3517 only has a single regs entry it ends up with > slave 1, > while there is only a single davinci_emac. OK thanks for clarifying it: Acked-by: Tony Lindgren
Re: [net-next PATCH 03/27] swiotlb: Add support for DMA_ATTR_SKIP_CPU_SYNC
On Fri, Oct 28, 2016 at 10:34 AM, Konrad Rzeszutek Wilk wrote: > On Tue, Oct 25, 2016 at 11:37:03AM -0400, Alexander Duyck wrote: >> As a first step to making DMA_ATTR_SKIP_CPU_SYNC apply to architectures >> beyond just ARM I need to make it so that the swiotlb will respect the >> flag. In order to do that I also need to update the swiotlb-xen since it >> heavily makes use of the functionality. >> >> Cc: Konrad Rzeszutek Wilk > > I am pretty sure I acked it the RFC. Was there a particular > reason (this is very different from the RFC?) you dropped my ACk? > > Thanks. If I recall you had acked patch 1, but for 2 you had some review comments on and suggested I change a few things. What was patch 2 in the RFC was split out into patches 2 and 3. That is why I didn't include an Ack from you for those patches. Patch 2 is a fix for Xen to address the fact that you could return either 0 or ~0. It was part of patch 2 originally and I pulled it out into a separate patch. Patch 3 does most of what patch 2 in the RFC was doing before with fixes to address the fact that I was moving some code to avoid going over 80 characters. I found a different way to fix that by just updating attrs before using it instead of ORing in the value when passing it as a parameter. >> @@ -558,11 +560,12 @@ void xen_swiotlb_unmap_page(struct device *hwdev, >> dma_addr_t dev_addr, >> >> start_dma_addr, >>sg_phys(sg), >>sg->length, >> - dir); >> + dir, attrs); >> if (map == SWIOTLB_MAP_ERROR) { >> dev_warn(hwdev, "swiotlb buffer is full\n"); >> /* Don't panic here, we expect map_sg users >> to do proper error handling. */ >> + attrs |= DMA_ATTR_SKIP_CPU_SYNC; >> xen_swiotlb_unmap_sg_attrs(hwdev, sgl, i, dir, >> attrs); >> sg_dma_len(sgl) = 0; The biggest difference from patch 2 in the RFC is right here. This code before was moving this off to the end of the function and adding a label which I then jumped to. I just ORed the DMA_ATTR_SKIP_CPU_SYNC into attrs and skipped the problem entirely. It should be harmless to do this way since attrs isn't used anywhere else once we have had the error. I hope that helps to clear it up. So if you want I will add your Acked-by for patches 2 and 3, but I just wanted to make sure this worked with the changes you suggested. Thanks. - Alex
[PATCH] mac80211: fix incorrect error return path on tmp allocation failure
From: Colin Ian King The current exit path when tmp fails to be allocated is via the fail label which frees tfm2 which has not yet been allocated, which is problematic since tfm2 is not initialized and is a garbage pointer. Fix this by exiting directly to the return at the end of the function and hence avoiding the freeing of tfm2. Signed-off-by: Colin Ian King --- net/mac80211/fils_aead.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/net/mac80211/fils_aead.c b/net/mac80211/fils_aead.c index b81b4f24..c114737 100644 --- a/net/mac80211/fils_aead.c +++ b/net/mac80211/fils_aead.c @@ -112,7 +112,7 @@ static int aes_siv_encrypt(const u8 *key, size_t key_len, tmp = kmemdup(plain, plain_len, GFP_KERNEL); if (!tmp) { res = -ENOMEM; - goto fail; + goto fail_ret; } /* IV for CTR before encrypted data */ @@ -150,6 +150,7 @@ static int aes_siv_encrypt(const u8 *key, size_t key_len, fail: kfree(tmp); crypto_free_skcipher(tfm2); +fail_ret: return res; } -- 2.9.3
Re: [PATCH for-next 00/14][PULL request] Mellanox mlx5 core driver updates 2016-10-25
I really disalike pull requests of this form. You add lots of datastructures and helper functions but no actual users of these facilities to the driver. Do this instead: 1) Add TSAR infrastructure 2) Add use of TSAR facilities to the driver That's one pull request. I don't care if this is hard, or if there are entanglements with Infiniband or whatever, you must submit changes in this manner. I will not accept additions to a driver that don't even get really used.
Re: [PATCH net-next] udp: do fwd memory scheduling on dequeue
On Fri, 2016-10-28 at 10:16 -0700, Eric Dumazet wrote: > Nice ! > > I was working on this as well and my implementation was somewhat > different. This is my WIP Note this can be split in two parts. 1) One adding struct sock *sk param to ip_cmsg_recv_offset() This was because I left skb->sk NULL for skbs stored in receive queue. You chose instead to set skb->sk, which is unusual (check skb_orphan() BUG_ON()) 2) Udp changes. Tell me what you think, thanks again ! diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 601258f6e621..52e70431abc9 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -3033,9 +3033,13 @@ int __skb_wait_for_more_packets(struct sock *sk, int *err, long *timeo_p, const struct sk_buff *skb); struct sk_buff *__skb_try_recv_datagram(struct sock *sk, unsigned flags, int *peeked, int *off, int *err, - struct sk_buff **last); + struct sk_buff **last, + void (*destructor)(struct sock *sk, + struct sk_buff *skb)); struct sk_buff *__skb_recv_datagram(struct sock *sk, unsigned flags, - int *peeked, int *off, int *err); + int *peeked, int *off, int *err, + void (*destructor)(struct sock *sk, + struct sk_buff *skb)); struct sk_buff *skb_recv_datagram(struct sock *sk, unsigned flags, int noblock, int *err); unsigned int datagram_poll(struct file *file, struct socket *sock, diff --git a/include/net/ip.h b/include/net/ip.h index 79b102ffbe16..f91fc376f3fb 100644 --- a/include/net/ip.h +++ b/include/net/ip.h @@ -572,7 +572,8 @@ int ip_options_rcv_srr(struct sk_buff *skb); */ void ipv4_pktinfo_prepare(const struct sock *sk, struct sk_buff *skb); -void ip_cmsg_recv_offset(struct msghdr *msg, struct sk_buff *skb, int tlen, int offset); +void ip_cmsg_recv_offset(struct msghdr *msg, struct sock *sk, +struct sk_buff *skb, int tlen, int offset); int ip_cmsg_send(struct sock *sk, struct msghdr *msg, struct ipcm_cookie *ipc, bool allow_ipv6); int ip_setsockopt(struct sock *sk, int level, int optname, char __user *optval, @@ -594,7 +595,7 @@ void ip_local_error(struct sock *sk, int err, __be32 daddr, __be16 dport, static inline void ip_cmsg_recv(struct msghdr *msg, struct sk_buff *skb) { - ip_cmsg_recv_offset(msg, skb, 0, 0); + ip_cmsg_recv_offset(msg, skb->sk, skb, 0, 0); } bool icmp_global_allow(void); diff --git a/include/net/udp.h b/include/net/udp.h index 18f1e6b91927..0178f4552c4d 100644 --- a/include/net/udp.h +++ b/include/net/udp.h @@ -248,6 +248,7 @@ static inline __be16 udp_flow_src_port(struct net *net, struct sk_buff *skb, /* net/ipv4/udp.c */ void skb_consume_udp(struct sock *sk, struct sk_buff *skb, int len); int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb); +void udp_skb_destructor(struct sock *sk, struct sk_buff *skb); void udp_v4_early_demux(struct sk_buff *skb); int udp_get_port(struct sock *sk, unsigned short snum, diff --git a/net/core/datagram.c b/net/core/datagram.c index bfb973aebb5b..48228d15f33c 100644 --- a/net/core/datagram.c +++ b/net/core/datagram.c @@ -198,7 +198,8 @@ static struct sk_buff *skb_set_peeked(struct sk_buff *skb) */ struct sk_buff *__skb_try_recv_datagram(struct sock *sk, unsigned int flags, int *peeked, int *off, int *err, - struct sk_buff **last) + struct sk_buff **last, + void (*destructor)(struct sock *sk, struct sk_buff *skb)) { struct sk_buff_head *queue = &sk->sk_receive_queue; struct sk_buff *skb; @@ -241,9 +242,11 @@ struct sk_buff *__skb_try_recv_datagram(struct sock *sk, unsigned int flags, } atomic_inc(&skb->users); - } else + } else { __skb_unlink(skb, queue); - + if (destructor) + destructor(sk, skb); + } spin_unlock_irqrestore(&queue->lock, cpu_flags); *off = _off; return skb; @@ -262,7 +265,9 @@ struct sk_buff *__skb_try_recv_datagram(struct sock *sk, unsigned int flags, EXPORT_SYMBOL(__skb_try_recv_datagram); struct sk_buff *__skb_recv_datagram(struct sock *sk, unsigned int flags, - int *peeked, int *off, int *err) + int *peeked, int *off,
Re: [PATCH v2] LSO feature added to Cadence GEM driver
From: Rafal Ozieblo Date: Tue, 25 Oct 2016 13:05:30 +0100 > New Cadence GEM hardware support Large Segment Offload (LSO): > TCP segmentation offload (TSO) as well as UDP fragmentation > offload (UFO). Support for those features was added to the driver. > > Signed-off-by: Rafal Ozieblo Patch does not apply cleanly to net-next, please respin. Also please fix your Subject line, it should be of the form: [PATCH net-next v2] cadence: Add LSO support.
Re: [PATCH] net caif: insert missing spaces in pr_* messages and unbreak multi-line strings
From: Colin King Date: Tue, 25 Oct 2016 12:18:42 +0100 > From: Colin Ian King > > Some of the pr_* messages are missing spaces, so insert these and also > unbreak multi-line literal strings in pr_* messages > > Signed-off-by: Colin Ian King Applied.
Re: [patch net-next 0/5] mlxsw: small driver update
From: Jiri Pirko Date: Thu, 27 Oct 2016 15:12:56 +0200 > For details, see individual patches. Series applied, thanks.
Re: [patch net 0/2] mlxsw: Couple of fixes
From: Jiri Pirko Date: Tue, 25 Oct 2016 11:25:55 +0200 > From: Jiri Pirko > > Couple of LPM tree management fixes. Series applied, thanks Jiri.
Re: [net-next PATCH 02/27] swiotlb-xen: Enforce return of DMA_ERROR_CODE in mapping function
On Tue, Oct 25, 2016 at 11:36:58AM -0400, Alexander Duyck wrote: > The mapping function should always return DMA_ERROR_CODE when a mapping has > failed as this is what the DMA API expects when a DMA error has occurred. > The current function for mapping a page in Xen was returning either > DMA_ERROR_CODE or 0 depending on where it failed. > > On x86 DMA_ERROR_CODE is 0, but on other architectures such as ARM it is > ~0. We need to make sure we return the same error value if either the > mapping failed or the device is not capable of accessing the mapping. > > If we are returning DMA_ERROR_CODE as our error value we can drop the > function for checking the error code as the default is to compare the > return value against DMA_ERROR_CODE if no function is defined. > > Cc: Konrad Rzeszutek Wilk I am pretty sure I gave an Ack. Any particular reason from dropping it (if so, please add a comment under the --- of the reason). Thanks. > Signed-off-by: Alexander Duyck > --- > arch/arm/xen/mm.c |1 - > arch/x86/xen/pci-swiotlb-xen.c |1 - > drivers/xen/swiotlb-xen.c | 18 ++ > include/xen/swiotlb-xen.h |3 --- > 4 files changed, 6 insertions(+), 17 deletions(-) > > diff --git a/arch/arm/xen/mm.c b/arch/arm/xen/mm.c > index d062f08..bd62d94 100644 > --- a/arch/arm/xen/mm.c > +++ b/arch/arm/xen/mm.c > @@ -186,7 +186,6 @@ void xen_destroy_contiguous_region(phys_addr_t pstart, > unsigned int order) > EXPORT_SYMBOL(xen_dma_ops); > > static struct dma_map_ops xen_swiotlb_dma_ops = { > - .mapping_error = xen_swiotlb_dma_mapping_error, > .alloc = xen_swiotlb_alloc_coherent, > .free = xen_swiotlb_free_coherent, > .sync_single_for_cpu = xen_swiotlb_sync_single_for_cpu, > diff --git a/arch/x86/xen/pci-swiotlb-xen.c b/arch/x86/xen/pci-swiotlb-xen.c > index 0e98e5d..a9fafb5 100644 > --- a/arch/x86/xen/pci-swiotlb-xen.c > +++ b/arch/x86/xen/pci-swiotlb-xen.c > @@ -19,7 +19,6 @@ > int xen_swiotlb __read_mostly; > > static struct dma_map_ops xen_swiotlb_dma_ops = { > - .mapping_error = xen_swiotlb_dma_mapping_error, > .alloc = xen_swiotlb_alloc_coherent, > .free = xen_swiotlb_free_coherent, > .sync_single_for_cpu = xen_swiotlb_sync_single_for_cpu, > diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c > index 87e6035..b8014bf 100644 > --- a/drivers/xen/swiotlb-xen.c > +++ b/drivers/xen/swiotlb-xen.c > @@ -416,11 +416,12 @@ dma_addr_t xen_swiotlb_map_page(struct device *dev, > struct page *page, > /* >* Ensure that the address returned is DMA'ble >*/ > - if (!dma_capable(dev, dev_addr, size)) { > - swiotlb_tbl_unmap_single(dev, map, size, dir); > - dev_addr = 0; > - } > - return dev_addr; > + if (dma_capable(dev, dev_addr, size)) > + return dev_addr; > + > + swiotlb_tbl_unmap_single(dev, map, size, dir); > + > + return DMA_ERROR_CODE; > } > EXPORT_SYMBOL_GPL(xen_swiotlb_map_page); > > @@ -648,13 +649,6 @@ void xen_swiotlb_unmap_page(struct device *hwdev, > dma_addr_t dev_addr, > } > EXPORT_SYMBOL_GPL(xen_swiotlb_sync_sg_for_device); > > -int > -xen_swiotlb_dma_mapping_error(struct device *hwdev, dma_addr_t dma_addr) > -{ > - return !dma_addr; > -} > -EXPORT_SYMBOL_GPL(xen_swiotlb_dma_mapping_error); > - > /* > * Return whether the given device DMA address mask can be supported > * properly. For example, if your device can only drive the low 24-bits > diff --git a/include/xen/swiotlb-xen.h b/include/xen/swiotlb-xen.h > index 7c35e27..a0083be 100644 > --- a/include/xen/swiotlb-xen.h > +++ b/include/xen/swiotlb-xen.h > @@ -51,9 +51,6 @@ extern void xen_swiotlb_unmap_page(struct device *hwdev, > dma_addr_t dev_addr, > int nelems, enum dma_data_direction dir); > > extern int > -xen_swiotlb_dma_mapping_error(struct device *hwdev, dma_addr_t dma_addr); > - > -extern int > xen_swiotlb_dma_supported(struct device *hwdev, u64 mask); > > extern int >
Re: [net-next PATCH 03/27] swiotlb: Add support for DMA_ATTR_SKIP_CPU_SYNC
On Tue, Oct 25, 2016 at 11:37:03AM -0400, Alexander Duyck wrote: > As a first step to making DMA_ATTR_SKIP_CPU_SYNC apply to architectures > beyond just ARM I need to make it so that the swiotlb will respect the > flag. In order to do that I also need to update the swiotlb-xen since it > heavily makes use of the functionality. > > Cc: Konrad Rzeszutek Wilk I am pretty sure I acked it the RFC. Was there a particular reason (this is very different from the RFC?) you dropped my ACk? Thanks. > Signed-off-by: Alexander Duyck > --- > drivers/xen/swiotlb-xen.c | 11 +++--- > include/linux/swiotlb.h |6 -- > lib/swiotlb.c | 48 > +++-- > 3 files changed, 40 insertions(+), 25 deletions(-) > > diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c > index b8014bf..3d048af 100644 > --- a/drivers/xen/swiotlb-xen.c > +++ b/drivers/xen/swiotlb-xen.c > @@ -405,7 +405,8 @@ dma_addr_t xen_swiotlb_map_page(struct device *dev, > struct page *page, >*/ > trace_swiotlb_bounced(dev, dev_addr, size, swiotlb_force); > > - map = swiotlb_tbl_map_single(dev, start_dma_addr, phys, size, dir); > + map = swiotlb_tbl_map_single(dev, start_dma_addr, phys, size, dir, > + attrs); > if (map == SWIOTLB_MAP_ERROR) > return DMA_ERROR_CODE; > > @@ -419,7 +420,8 @@ dma_addr_t xen_swiotlb_map_page(struct device *dev, > struct page *page, > if (dma_capable(dev, dev_addr, size)) > return dev_addr; > > - swiotlb_tbl_unmap_single(dev, map, size, dir); > + swiotlb_tbl_unmap_single(dev, map, size, dir, > + attrs | DMA_ATTR_SKIP_CPU_SYNC); > > return DMA_ERROR_CODE; > } > @@ -445,7 +447,7 @@ static void xen_unmap_single(struct device *hwdev, > dma_addr_t dev_addr, > > /* NOTE: We use dev_addr here, not paddr! */ > if (is_xen_swiotlb_buffer(dev_addr)) { > - swiotlb_tbl_unmap_single(hwdev, paddr, size, dir); > + swiotlb_tbl_unmap_single(hwdev, paddr, size, dir, attrs); > return; > } > > @@ -558,11 +560,12 @@ void xen_swiotlb_unmap_page(struct device *hwdev, > dma_addr_t dev_addr, >start_dma_addr, >sg_phys(sg), >sg->length, > - dir); > + dir, attrs); > if (map == SWIOTLB_MAP_ERROR) { > dev_warn(hwdev, "swiotlb buffer is full\n"); > /* Don't panic here, we expect map_sg users > to do proper error handling. */ > + attrs |= DMA_ATTR_SKIP_CPU_SYNC; > xen_swiotlb_unmap_sg_attrs(hwdev, sgl, i, dir, > attrs); > sg_dma_len(sgl) = 0; > diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h > index e237b6f..4517be9 100644 > --- a/include/linux/swiotlb.h > +++ b/include/linux/swiotlb.h > @@ -44,11 +44,13 @@ enum dma_sync_target { > extern phys_addr_t swiotlb_tbl_map_single(struct device *hwdev, > dma_addr_t tbl_dma_addr, > phys_addr_t phys, size_t size, > - enum dma_data_direction dir); > + enum dma_data_direction dir, > + unsigned long attrs); > > extern void swiotlb_tbl_unmap_single(struct device *hwdev, >phys_addr_t tlb_addr, > - size_t size, enum dma_data_direction dir); > + size_t size, enum dma_data_direction dir, > + unsigned long attrs); > > extern void swiotlb_tbl_sync_single(struct device *hwdev, > phys_addr_t tlb_addr, > diff --git a/lib/swiotlb.c b/lib/swiotlb.c > index 47aad37..b538d39 100644 > --- a/lib/swiotlb.c > +++ b/lib/swiotlb.c > @@ -425,7 +425,8 @@ static void swiotlb_bounce(phys_addr_t orig_addr, > phys_addr_t tlb_addr, > phys_addr_t swiotlb_tbl_map_single(struct device *hwdev, > dma_addr_t tbl_dma_addr, > phys_addr_t orig_addr, size_t size, > -enum dma_data_direction dir) > +enum dma_data_direction dir, > +unsigned long attrs) > { > unsigned long flags; > phys_addr_t tlb_addr; > @@ -526,7 +527,8 @@ phys_addr_t swiotlb_tbl_map_single(struct
Re: pull request (net-next): ipsec-next 2016-10-25
From: Steffen Klassert Date: Tue, 25 Oct 2016 07:05:40 +0200 > Just a leftover from the last development cycle. > > 1) Remove some unused code, from Florian Westphal. > > Please pull or let me know if there are problems. Pulled, thanks!
Re: [PATCH net-next] udp: do fwd memory scheduling on dequeue
Nice ! I was working on this as well and my implementation was somewhat different. I then stopped when I found the UDP checksum bug and was waiting for net to be merged into net-next ( https://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/?id=10df8e6152c6c400a563a673e9956320bfce1871 ) On Fri, 2016-10-28 at 15:20 +0200, Paolo Abeni wrote: > A new argument is added to __skb_recv_datagram, it allows > the caller to perform protocol specific action on dequeue > under the receive queue spinlock. > The UDP protocol uses such argument to perform fwd memory > reclaiming on dequeue, while protocol memory and rmem updatating > is delayed after the lock release, to keep the time spent > under the lock as low as possible. > The UDP specific skb desctructor is not used anymore, instead > explicit memory reclaiming is performed at close() time and > when skbs are removed from the receive queue. > The in kernel UDP procotol users now need to use an > skb_recv_udp() variant instead of skb_recv_datagram() to > properly perform memory accounting on dequeue. > > Overall, this allows acquiring only once the receive queue > lock on dequeue. > > Tested using pktgen with random src port, 64 bytes packet, > wire-speed on a 10G link as sender and udp_sink as the receiver, > using an l4 tuple rxhash to stress the contention, and one or more > udp_sink instances with reuseport. > > nr sinks vanilla patched > 1 440 560 > 3 21502300 > 6 36503800 > 9 44504600 > 1262506450 > > Suggested-by: Eric Dumazet > Acked-by: Hannes Frederic Sowa > Signed-off-by: Paolo Abeni > --- > include/linux/skbuff.h | 4 > include/net/udp.h | 10 > net/core/datagram.c| 11 ++--- > net/ipv4/udp.c | 65 > -- > net/ipv6/udp.c | 3 +-- > net/rxrpc/input.c | 7 +++--- > net/sunrpc/svcsock.c | 2 +- > net/sunrpc/xprtsock.c | 2 +- > net/unix/af_unix.c | 4 ++-- > 9 files changed, 83 insertions(+), 25 deletions(-) > > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h > index 601258f..dd171a9 100644 > --- a/include/linux/skbuff.h > +++ b/include/linux/skbuff.h > @@ -3028,13 +3028,17 @@ static inline void skb_frag_list_init(struct sk_buff > *skb) > #define skb_walk_frags(skb, iter)\ > for (iter = skb_shinfo(skb)->frag_list; iter; iter = iter->next) > > +typedef void (*skb_dequeue_cb_t)(struct sock *sk, struct sk_buff *skb, > + int flags); > > int __skb_wait_for_more_packets(struct sock *sk, int *err, long *timeo_p, > const struct sk_buff *skb); > struct sk_buff *__skb_try_recv_datagram(struct sock *sk, unsigned flags, > + skb_dequeue_cb_t dequeue_cb, > int *peeked, int *off, int *err, > struct sk_buff **last); > struct sk_buff *__skb_recv_datagram(struct sock *sk, unsigned flags, > + skb_dequeue_cb_t dequeue_cb, > int *peeked, int *off, int *err); > struct sk_buff *skb_recv_datagram(struct sock *sk, unsigned flags, int > noblock, > int *err); > diff --git a/include/net/udp.h b/include/net/udp.h > index 18f1e6b..983c861 100644 > --- a/include/net/udp.h > +++ b/include/net/udp.h > @@ -48,6 +48,7 @@ struct udp_skb_cb { > } header; > __u16 cscov; > __u8partial_cov; > + int fwd_memory_released; I was not adding this field. > }; > #define UDP_SKB_CB(__skb)((struct udp_skb_cb *)((__skb)->cb)) > > @@ -248,6 +249,15 @@ static inline __be16 udp_flow_src_port(struct net *net, > struct sk_buff *skb, > /* net/ipv4/udp.c */ > void skb_consume_udp(struct sock *sk, struct sk_buff *skb, int len); > int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb); > +struct sk_buff *__skb_recv_udp(struct sock *sk, unsigned int flags, int > noblock, > +int *peeked, int *off, int *err); > +static inline struct sk_buff *skb_recv_udp(struct sock *sk, unsigned int > flags, > +int noblock, int *err) > +{ > + int peeked, off = 0; > + > + return __skb_recv_udp(sk, flags, noblock, &peeked, &off, err); > +} > > void udp_v4_early_demux(struct sk_buff *skb); > int udp_get_port(struct sock *sk, unsigned short snum, > diff --git a/net/core/datagram.c b/net/core/datagram.c > index bfb973a..226548b 100644 > --- a/net/core/datagram.c > +++ b/net/core/datagram.c > @@ -165,6 +165,7 @@ static struct sk_buff *skb_set_peeked(struct sk_buff *skb) > * __skb_try_recv_datagram - Receive a datagram skbuff > * @sk: socket > * @flags: MSG_ flags > + * @dequeue_cb: invoked under the rece
Re: [PATCH net-next RFC WIP] Patch for XDP support for virtio_net
From: John Fastabend Date: Fri, 28 Oct 2016 08:56:35 -0700 > On 16-10-27 07:10 PM, David Miller wrote: >> From: Alexander Duyck >> Date: Thu, 27 Oct 2016 18:43:59 -0700 >> >>> On Thu, Oct 27, 2016 at 6:35 PM, David Miller wrote: From: "Michael S. Tsirkin" Date: Fri, 28 Oct 2016 01:25:48 +0300 > On Thu, Oct 27, 2016 at 05:42:18PM -0400, David Miller wrote: >> From: "Michael S. Tsirkin" >> Date: Fri, 28 Oct 2016 00:30:35 +0300 >> >>> Something I'd like to understand is how does XDP address the >>> problem that 100Byte packets are consuming 4K of memory now. >> >> Via page pools. We're going to make a generic one, but right now >> each and every driver implements a quick list of pages to allocate >> from (and thus avoid the DMA man/unmap overhead, etc.) > > So to clarify, ATM virtio doesn't attempt to avoid dma map/unmap > so there should be no issue with that even when using sub/page > regions, assuming DMA APIs support sub-page map/unmap correctly. That's not what I said. The page pools are meant to address the performance degradation from going to having one packet per page for the sake of XDP's requirements. You still need to have one packet per page for correct XDP operation whether you do page pools or not, and whether you have DMA mapping (or it's equivalent virutalization operation) or not. >>> >>> Maybe I am missing something here, but why do you need to limit things >>> to one packet per page for correct XDP operation? Most of the drivers >>> out there now are usually storing something closer to at least 2 >>> packets per page, and with the DMA API fixes I am working on there >>> should be no issue with changing the contents inside those pages since >>> we won't invalidate or overwrite the data after the DMA buffer has >>> been synchronized for use by the CPU. >> >> Because with SKB's you can share the page with other packets. >> >> With XDP you simply cannot. >> >> It's software semantics that are the issue. SKB frag list pages >> are read only, XDP packets are writable. >> >> This has nothing to do with "writability" of the pages wrt. DMA >> mapping or cpu mappings. >> > > Sorry I'm not seeing it either. The current xdp_buff is defined > by, > > struct xdp_buff { > void *data; > void *data_end; > }; > > The verifier has an xdp_is_valid_access() check to ensure we don't go > past data_end. The page for now at least never leaves the driver. For > the work to get xmit to other devices working I'm still not sure I see > any issue. I guess I can say that the packets must be "writable" until I'm blue in the face but I'll say it again, semantically writable pages are a requirement. And if multiple packets share a page this requirement is not satisfied. Also, we want to do several things in the future: 1) Allow push/pop of headers via eBPF code, which needs we need headroom. 2) Transparently zero-copy pass packets into userspace, basically the user will have a semi-permanently mapped ring of all the packet pages sitting in the RX queue of the device and the page pool associated with it. This way we avoid all of the TLB flush/map overhead for the user's mapping of the packets just as we avoid the DMA map/unmap overhead. And that's just the beginninng. I'm sure others can come up with more reasons why we have this requirement.
Re: [net-next PATCH 00/27] Add support for DMA writable pages being writable by the network stack
From: Alexander Duyck Date: Fri, 28 Oct 2016 08:48:01 -0700 > So the feedback for this set has been mostly just a few "Acked-by"s, > and it looks like the series was marked as "Not Applicable" in > patchwork. I was wondering what the correct merge strategy for this > patch set should be going forward? I marked it as not applicable because it's definitely not a networking change, and merging it via my tree would be really inappropriate, even though we need it for some infrastructure we want to build for networking. So you have to merge this upstream via a more appropriate path. > I was wondering if I should be looking at breaking up the set and > splitting it over a few different trees, or if I should just hold onto > it and resubmit it when the merge window opens? My preference would > be to submit it as a single set so I can know all the patches are > present to avoid any possible regressions due to only part of the set > being present. I don't think you need to split it up.
Re: [PATCH net 2/2] geneve: avoid using stale geneve socket.
On Fri, Oct 28, 2016 at 7:27 AM, John W. Linville wrote: > On Thu, Oct 27, 2016 at 11:51:56AM -0700, Pravin B Shelar wrote: >> This patch is similar to earlier vxlan patch. >> Geneve device close operation frees geneve socket. This >> operation can race with geneve-xmit function which >> dereferences geneve socket. Following patch uses RCU >> mechanism to avoid this situation. >> >> Signed-off-by: Pravin B Shelar > > LGTM, although I reckon that Stephen's comment about RCU_INIT_POINTER > applies to this patch as well. > Geneve code is bit different and it is using rcu_assign_pointer() while destroying the object. so we need the barrier. > Either way... > > Acked-by: John W. Linville >
[PATCH v2 net 1/2] vxlan: avoid using stale vxlan socket.
When vxlan device is closed vxlan socket is freed. This operation can race with vxlan-xmit function which dereferences vxlan socket. Following patch uses RCU mechanism to avoid this situation. Signed-off-by: Pravin B Shelar --- drivers/net/vxlan.c | 80 + include/net/vxlan.h | 4 +-- 2 files changed, 52 insertions(+), 32 deletions(-) diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c index c1639a3..f3c2fa3 100644 --- a/drivers/net/vxlan.c +++ b/drivers/net/vxlan.c @@ -943,17 +943,20 @@ static bool vxlan_snoop(struct net_device *dev, static bool vxlan_group_used(struct vxlan_net *vn, struct vxlan_dev *dev) { struct vxlan_dev *vxlan; + struct vxlan_sock *sock4; + struct vxlan_sock *sock6 = NULL; unsigned short family = dev->default_dst.remote_ip.sa.sa_family; + sock4 = rtnl_dereference(dev->vn4_sock); + /* The vxlan_sock is only used by dev, leaving group has * no effect on other vxlan devices. */ - if (family == AF_INET && dev->vn4_sock && - atomic_read(&dev->vn4_sock->refcnt) == 1) + if (family == AF_INET && sock4 && atomic_read(&sock4->refcnt) == 1) return false; #if IS_ENABLED(CONFIG_IPV6) - if (family == AF_INET6 && dev->vn6_sock && - atomic_read(&dev->vn6_sock->refcnt) == 1) + sock6 = rtnl_dereference(dev->vn6_sock); + if (family == AF_INET6 && sock6 && atomic_read(&sock6->refcnt) == 1) return false; #endif @@ -961,10 +964,12 @@ static bool vxlan_group_used(struct vxlan_net *vn, struct vxlan_dev *dev) if (!netif_running(vxlan->dev) || vxlan == dev) continue; - if (family == AF_INET && vxlan->vn4_sock != dev->vn4_sock) + if (family == AF_INET && + rtnl_dereference(vxlan->vn4_sock) != sock4) continue; #if IS_ENABLED(CONFIG_IPV6) - if (family == AF_INET6 && vxlan->vn6_sock != dev->vn6_sock) + if (family == AF_INET6 && + rtnl_dereference(vxlan->vn6_sock) != sock6) continue; #endif @@ -1005,22 +1010,25 @@ static bool __vxlan_sock_release_prep(struct vxlan_sock *vs) static void vxlan_sock_release(struct vxlan_dev *vxlan) { - bool ipv4 = __vxlan_sock_release_prep(vxlan->vn4_sock); + struct vxlan_sock *sock4 = rtnl_dereference(vxlan->vn4_sock); #if IS_ENABLED(CONFIG_IPV6) - bool ipv6 = __vxlan_sock_release_prep(vxlan->vn6_sock); + struct vxlan_sock *sock6 = rtnl_dereference(vxlan->vn6_sock); + + rcu_assign_pointer(vxlan->vn6_sock, NULL); #endif + rcu_assign_pointer(vxlan->vn4_sock, NULL); synchronize_net(); - if (ipv4) { - udp_tunnel_sock_release(vxlan->vn4_sock->sock); - kfree(vxlan->vn4_sock); + if (__vxlan_sock_release_prep(sock4)) { + udp_tunnel_sock_release(sock4->sock); + kfree(sock4); } #if IS_ENABLED(CONFIG_IPV6) - if (ipv6) { - udp_tunnel_sock_release(vxlan->vn6_sock->sock); - kfree(vxlan->vn6_sock); + if (__vxlan_sock_release_prep(sock6)) { + udp_tunnel_sock_release(sock6->sock); + kfree(sock6); } #endif } @@ -1036,18 +1044,21 @@ static int vxlan_igmp_join(struct vxlan_dev *vxlan) int ret = -EINVAL; if (ip->sa.sa_family == AF_INET) { + struct vxlan_sock *sock4 = rtnl_dereference(vxlan->vn4_sock); struct ip_mreqn mreq = { .imr_multiaddr.s_addr = ip->sin.sin_addr.s_addr, .imr_ifindex= ifindex, }; - sk = vxlan->vn4_sock->sock->sk; + sk = sock4->sock->sk; lock_sock(sk); ret = ip_mc_join_group(sk, &mreq); release_sock(sk); #if IS_ENABLED(CONFIG_IPV6) } else { - sk = vxlan->vn6_sock->sock->sk; + struct vxlan_sock *sock6 = rtnl_dereference(vxlan->vn6_sock); + + sk = sock6->sock->sk; lock_sock(sk); ret = ipv6_stub->ipv6_sock_mc_join(sk, ifindex, &ip->sin6.sin6_addr); @@ -1067,18 +1078,21 @@ static int vxlan_igmp_leave(struct vxlan_dev *vxlan) int ret = -EINVAL; if (ip->sa.sa_family == AF_INET) { + struct vxlan_sock *sock4 = rtnl_dereference(vxlan->vn4_sock); struct ip_mreqn mreq = { .imr_multiaddr.s_addr = ip->sin.sin_addr.s_addr, .imr_ifindex= ifindex, }; - sk = vxlan->vn4_sock->sock->sk; + sk = sock4->sock->sk; lock_sock(sk); ret = ip_mc_leave_group(sk, &mreq); release_sock(sk
[PATCH v2 net 2/2] geneve: avoid using stale geneve socket.
This patch is similar to earlier vxlan patch. Geneve device close operation frees geneve socket. This operation can race with geneve-xmit function which dereferences geneve socket. Following patch uses RCU mechanism to avoid this situation. Signed-off-by: Pravin B Shelar Acked-by: John W. Linville --- drivers/net/geneve.c | 45 ++--- 1 file changed, 34 insertions(+), 11 deletions(-) diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c index 16af1ce..42edd7b 100644 --- a/drivers/net/geneve.c +++ b/drivers/net/geneve.c @@ -58,9 +58,9 @@ struct geneve_dev { struct hlist_node hlist; /* vni hash table */ struct net *net;/* netns for packet i/o */ struct net_device *dev;/* netdev for geneve tunnel */ - struct geneve_sock *sock4; /* IPv4 socket used for geneve tunnel */ + struct geneve_sock __rcu *sock4;/* IPv4 socket used for geneve tunnel */ #if IS_ENABLED(CONFIG_IPV6) - struct geneve_sock *sock6; /* IPv6 socket used for geneve tunnel */ + struct geneve_sock __rcu *sock6;/* IPv6 socket used for geneve tunnel */ #endif u8 vni[3]; /* virtual network ID for tunnel */ u8 ttl; /* TTL override */ @@ -543,9 +543,19 @@ static void __geneve_sock_release(struct geneve_sock *gs) static void geneve_sock_release(struct geneve_dev *geneve) { - __geneve_sock_release(geneve->sock4); + struct geneve_sock *gs4 = rtnl_dereference(geneve->sock4); #if IS_ENABLED(CONFIG_IPV6) - __geneve_sock_release(geneve->sock6); + struct geneve_sock *gs6 = rtnl_dereference(geneve->sock6); + + rcu_assign_pointer(geneve->sock6, NULL); +#endif + + rcu_assign_pointer(geneve->sock4, NULL); + synchronize_net(); + + __geneve_sock_release(gs4); +#if IS_ENABLED(CONFIG_IPV6) + __geneve_sock_release(gs6); #endif } @@ -586,10 +596,10 @@ static int geneve_sock_add(struct geneve_dev *geneve, bool ipv6) gs->flags = geneve->flags; #if IS_ENABLED(CONFIG_IPV6) if (ipv6) - geneve->sock6 = gs; + rcu_assign_pointer(geneve->sock6, gs); else #endif - geneve->sock4 = gs; + rcu_assign_pointer(geneve->sock4, gs); hash = geneve_net_vni_hash(geneve->vni); hlist_add_head_rcu(&geneve->hlist, &gs->vni_list[hash]); @@ -603,9 +613,7 @@ static int geneve_open(struct net_device *dev) bool metadata = geneve->collect_md; int ret = 0; - geneve->sock4 = NULL; #if IS_ENABLED(CONFIG_IPV6) - geneve->sock6 = NULL; if (ipv6 || metadata) ret = geneve_sock_add(geneve, true); #endif @@ -720,6 +728,9 @@ static struct rtable *geneve_get_v4_rt(struct sk_buff *skb, struct rtable *rt = NULL; __u8 tos; + if (!rcu_dereference(geneve->sock4)) + return ERR_PTR(-EIO); + memset(fl4, 0, sizeof(*fl4)); fl4->flowi4_mark = skb->mark; fl4->flowi4_proto = IPPROTO_UDP; @@ -772,11 +783,15 @@ static struct dst_entry *geneve_get_v6_dst(struct sk_buff *skb, { bool use_cache = ip_tunnel_dst_cache_usable(skb, info); struct geneve_dev *geneve = netdev_priv(dev); - struct geneve_sock *gs6 = geneve->sock6; struct dst_entry *dst = NULL; struct dst_cache *dst_cache; + struct geneve_sock *gs6; __u8 prio; + gs6 = rcu_dereference(geneve->sock6); + if (!gs6) + return ERR_PTR(-EIO); + memset(fl6, 0, sizeof(*fl6)); fl6->flowi6_mark = skb->mark; fl6->flowi6_proto = IPPROTO_UDP; @@ -842,7 +857,7 @@ static netdev_tx_t geneve_xmit_skb(struct sk_buff *skb, struct net_device *dev, struct ip_tunnel_info *info) { struct geneve_dev *geneve = netdev_priv(dev); - struct geneve_sock *gs4 = geneve->sock4; + struct geneve_sock *gs4; struct rtable *rt = NULL; const struct iphdr *iip; /* interior IP header */ int err = -EINVAL; @@ -853,6 +868,10 @@ static netdev_tx_t geneve_xmit_skb(struct sk_buff *skb, struct net_device *dev, bool xnet = !net_eq(geneve->net, dev_net(geneve->dev)); u32 flags = geneve->flags; + gs4 = rcu_dereference(geneve->sock4); + if (!gs4) + goto tx_error; + if (geneve->collect_md) { if (unlikely(!info || !(info->mode & IP_TUNNEL_INFO_TX))) { netdev_dbg(dev, "no tunnel metadata\n"); @@ -932,9 +951,9 @@ static netdev_tx_t geneve6_xmit_skb(struct sk_buff *skb, struct net_device *dev, struct ip_tunnel_info *info) { struct geneve_dev *geneve = netdev_priv(dev); - struct geneve_sock *gs6 = geneve->sock6; struct dst_entry *dst = NULL; const struct iphdr *iip; /* interior IP header */ + str
Re: [PATCH net 1/2] vxlan: avoid using stale vxlan socket.
On Thu, Oct 27, 2016 at 4:02 PM, Stephen Hemminger wrote: > On Thu, 27 Oct 2016 11:51:55 -0700 > Pravin B Shelar wrote: > >> - vxlan->vn4_sock = NULL; >> + rcu_assign_pointer(vxlan->vn4_sock, NULL); >> #if IS_ENABLED(CONFIG_IPV6) >> - vxlan->vn6_sock = NULL; >> + rcu_assign_pointer(vxlan->vn6_sock, NULL); >> if (ipv6 || metadata) > > User RCU_INIT_POINTER when setting pointer to NULL. > It avoids unnecessary barrier. OK. I have posted v2 patch to use RCU_INIT_POINTER().
[PATCH net] dctcp: avoid bogus doubling of cwnd after loss
If a congestion control module doesn't provide .undo_cwnd function, tcp_undo_cwnd_reduction() will set cwnd to tp->snd_cwnd = max(tp->snd_cwnd, tp->snd_ssthresh << 1); ... which makes sense for reno (it sets ssthresh to half the current cwnd), but it makes no sense for dctcp, which sets ssthresh based on the current congestion estimate. This can cause severe growth of cwnd (eventually overflowing u32). Fix this by saving last cwnd on loss and restore cwnd based on that, similar to cubic and other algorithms. Fixes: e3118e8359bb7c ("net: tcp: add DCTCP congestion control algorithm") Cc: Lawrence Brakmo Cc: Andrew Shewmaker Cc: Glenn Judd Acked-by: Daniel Borkmann Signed-off-by: Florian Westphal --- net/ipv4/tcp_dctcp.c | 13 - 1 file changed, 12 insertions(+), 1 deletion(-) diff --git a/net/ipv4/tcp_dctcp.c b/net/ipv4/tcp_dctcp.c index 10d728b6804c..ab37c6775630 100644 --- a/net/ipv4/tcp_dctcp.c +++ b/net/ipv4/tcp_dctcp.c @@ -56,6 +56,7 @@ struct dctcp { u32 next_seq; u32 ce_state; u32 delayed_ack_reserved; + u32 loss_cwnd; }; static unsigned int dctcp_shift_g __read_mostly = 4; /* g = 1/2^4 */ @@ -96,6 +97,7 @@ static void dctcp_init(struct sock *sk) ca->dctcp_alpha = min(dctcp_alpha_on_init, DCTCP_MAX_ALPHA); ca->delayed_ack_reserved = 0; + ca->loss_cwnd = 0; ca->ce_state = 0; dctcp_reset(tp, ca); @@ -111,9 +113,10 @@ static void dctcp_init(struct sock *sk) static u32 dctcp_ssthresh(struct sock *sk) { - const struct dctcp *ca = inet_csk_ca(sk); + struct dctcp *ca = inet_csk_ca(sk); struct tcp_sock *tp = tcp_sk(sk); + ca->loss_cwnd = tp->snd_cwnd; return max(tp->snd_cwnd - ((tp->snd_cwnd * ca->dctcp_alpha) >> 11U), 2U); } @@ -308,12 +311,20 @@ static size_t dctcp_get_info(struct sock *sk, u32 ext, int *attr, return 0; } +static u32 dctcp_cwnd_undo(struct sock *sk) +{ + const struct dctcp *ca = inet_csk_ca(sk); + + return max(tcp_sk(sk)->snd_cwnd, ca->loss_cwnd); +} + static struct tcp_congestion_ops dctcp __read_mostly = { .init = dctcp_init, .in_ack_event = dctcp_update_alpha, .cwnd_event = dctcp_cwnd_event, .ssthresh = dctcp_ssthresh, .cong_avoid = tcp_reno_cong_avoid, + .undo_cwnd = dctcp_cwnd_undo, .set_state = dctcp_state, .get_info = dctcp_get_info, .flags = TCP_CONG_NEEDS_ECN, -- 2.7.3
Re: Bonding MAC address
I think it is intentional. Because changing mac address of an interface would cause problems. As a result, if you find it necessary to change the mac address, it should be done manually. Consider these two example: * A host is connecting the internet through a bond interface, and obtain ip address from a dhcp server. Changing the mac address automatically would possibly lead to re-assigning a new IP address, which could not be expected. * A linux-box is acting as a gateway, providing service to hosts in the local lan. Changing mac address and not changing IP address of an interface would cause other hosts to fail to communicate with the gateway, since the old mac address is still in theirs arp cache table. The communication will recover after arp cache expires, which can be a short or long time. * The scene is the same to the second one. And consider if arp snooping or other mechanisms to protect hosts from being spoofed by a fake gateway are enabled in the local lan. After changing the mac address, the linux-box itself will be a “spoofer” and may get blocked. So changing mac address of an interface could be dangerous and lead to network malfunction and cannot be done automatically. > 在 2016年10月28日,19:09,Igor Ryzhov 写道: > > Hello everyone, > > Studying the bonding driver I found that bonding device doesn't choose a new > MAC address when the slave which MAC is in use is released. > This is a warning about that - "the permanent HWaddr of slave is still in use > by bond - set the HWaddr of slave to a different address to avoid conflicts". > Why not to choose a new MAC for bonding device? Is it intentional or just not > implemented? > > Best regards, > Igor smime.p7s Description: S/MIME cryptographic signature
Re: [PATCH] Revert "usb: dwc3: gadget: use allocated/queued reqs for LST bit"
On Fri, Oct 28, 2016 at 01:16:13PM +0300, Felipe Balbi wrote: > > Hi, > > Ville Syrjälä writes: > > On Thu, Oct 06, 2016 at 12:08:26PM +0300, Ville Syrjälä wrote: > >> On Thu, Oct 06, 2016 at 10:36:09AM +0300, Felipe Balbi wrote: > >> > > >> > Hi, > >> > > >> > Felipe Balbi writes: > >> > >> > Okay, I have found a regression on dwc3 and another patch follows: > >> > > >> > commit 5e1a2af3e46248c55098cdae643c4141851b703e > >> > Author: Felipe Balbi > >> > Date: Wed Oct 5 14:24:37 2016 +0300 > >> > > >> > usb: dwc3: gadget: properly account queued requests > >> > > >> > Some requests could be accounted for multiple > >> > times. Let's fix that so each and every requests is > >> > accounted for only once. > >> > > >> > Cc: # v4.8 > >> > Fixes: 55a0237f8f47 ("usb: dwc3: gadget: use allocated/queued reqs > >> > for LST bit") > >> > Signed-off-by: Felipe Balbi > >> > > >> > diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c > >> > index 07cc8929f271..3c3ced128c77 100644 > >> > --- a/drivers/usb/dwc3/gadget.c > >> > +++ b/drivers/usb/dwc3/gadget.c > >> > @@ -783,6 +783,7 @@ static void dwc3_prepare_one_trb(struct dwc3_ep *dep, > >> > req->trb = trb; > >> > req->trb_dma = dwc3_trb_dma_offset(dep, trb); > >> > req->first_trb_index = dep->trb_enqueue; > >> > +dep->queued_requests++; > >> > } > >> > > >> > dwc3_ep_inc_enq(dep); > >> > @@ -833,8 +834,6 @@ static void dwc3_prepare_one_trb(struct dwc3_ep *dep, > >> > > >> > trb->ctrl |= DWC3_TRB_CTRL_HWO; > >> > > >> > -dep->queued_requests++; > >> > - > >> > trace_dwc3_prepare_trb(dep, trb); > >> > } > >> > > >> > @@ -1861,8 +1860,11 @@ static int __dwc3_cleanup_done_trbs(struct dwc3 > >> > *dwc, struct dwc3_ep *dep, > >> > unsigned ints_pkt = 0; > >> > unsigned inttrb_status; > >> > > >> > -dep->queued_requests--; > >> > dwc3_ep_inc_deq(dep); > >> > + > >> > +if (req->trb == trb) > >> > +dep->queued_requests--; > >> > + > >> > trace_dwc3_complete_trb(dep, trb); > >> > > >> > /* > >> > > >> > I have also built a branch which you can use for testing. Here's a pull > >> > request, once you tell me it works for you, then I can send proper > >> > patches out: > >> > > >> > The following changes since commit > >> > c8d2bc9bc39ebea8437fd974fdbc21847bb897a3: > >> > > >> > Linux 4.8 (2016-10-02 16:24:33 -0700) > >> > > >> > are available in the git repository at: > >> > > >> > git://git.kernel.org/pub/scm/linux/kernel/git/balbi/usb.git > >> > tmp-test-v4.8 > >> > > >> > for you to fetch changes up to c968b8d1effe64a7980802d1eef29f4e1922faca: > >> > > >> > usb: dwc3: gadget: properly account queued requests (2016-10-06 > >> > 10:16:37 +0300) > >> > > >> > > >> > Felipe Balbi (2): > >> > usb: gadget: function: u_ether: don't starve tx request queue > >> > usb: dwc3: gadget: properly account queued requests > >> > > >> > drivers/usb/dwc3/gadget.c | 7 --- > >> > drivers/usb/gadget/function/u_ether.c | 5 +++-- > >> > 2 files changed, 7 insertions(+), 5 deletions(-) > >> > >> Tried your branch, but unfortunately I'm still seeing the lags. New trace > >> attached. > > > > Just a reminder that the regressions is still there in 4.9-rc2. > > Unfortunateyly with all the stuff already piled on top, getting USB > > working on my device is no longer a simple matter of reverting the > > one commit. I had to revert 10 patches to get even a clean revert, but > > unfortunately that approach just lead to the transfer hanging immediately. > > > > So what I ended up doing is reverting all of this: > > git log --no-merges --format=oneline > > 55a0237f8f47957163125e20ee9260538c5c341c^..HEAD -- drivers/usb/dwc3/ > > include/linux/ulpi/interface.h drivers/usb/Kconfig drivers/usb/core/Kconfig > > > > which comes out at whopping 70 commits. Having to carry that around > > is going to be quite a pain especially as more stuff might be piled on > > top. > > > > /me thinks a stable backport of any fix (assuming one is found > > eventually) is going to be quite the challenge... > > Yeah, I'm guessing we're gonna need some help from networking folks. The > only thing we did since v4.7 was actually respect req->no_interrupt flag > coming from u_ether itself. No idea why that causes so much trouble for > u_ether. > > BTW, Instead of reverting so many patches, you can just remove > throttling: > > diff --git a/drivers/usb/gadget/function/u_ether.c > b/drivers/usb/gadget/function/u_ether.c > index f4a640216913..119a2e5848e8 100644 > --- a/drivers/usb/gadget/function/u_ether.c > +++ b/drivers/usb/gadget/function/u_ether.c > @@ -589,14 +589,6 @@ static netdev_tx_t eth_start_xmit(struct sk_buff *skb, > >
Re: [PATCH] [rfc, netfilter-next] netfilter: nf_tables: fib warnings
Arnd Bergmann wrote: > On Friday, October 28, 2016 5:50:31 PM CEST Florian Westphal wrote: > > Arnd Bergmann wrote: > > > The newly added nft fib code produces two warnings: > > > > > > net/ipv4/netfilter/nft_fib_ipv4.c: In function 'nft_fib4_eval': > > > net/ipv4/netfilter/nft_fib_ipv4.c:80:6: error: unused variable 'i' > > > [-Werror=unused-variable] > > > net/ipv4/netfilter/nft_fib_ipv4.c: In function ‘nft_fib4_eval’: > > > net/ipv4/netfilter/nft_fib_ipv4.c:137:6: error: ‘oif’ may be used > > > uninitialized in this function [-Werror=maybe-uninitialized] > > > > > > The first one is obvious as the only user of that variable is > > > inside of an #ifdef, but the second one is a bit trickier. > > > It is clear that 'oif' is uninitialized here if neither > > > NFTA_FIB_F_OIF nor NFTA_FIB_F_IIF are set. > > > > > > I have no idea how that should be handled, this patch just > > > returns without doing anything, which may or may not be > > > the right thing to do. > > > > It should be initialized to NULL. > > Ok, I had considered that, but wasn't sure if ->nh_dev could > ever be NULL, as that would then get dereferenced. Good point. In case oif is NULL we don't have to search the result list for a match anyway, so we could do this (not even build tested): diff --git a/net/ipv4/netfilter/nft_fib_ipv4.c b/net/ipv4/netfilter/nft_fib_ipv4.c --- a/net/ipv4/netfilter/nft_fib_ipv4.c +++ b/net/ipv4/netfilter/nft_fib_ipv4.c @@ -130,6 +130,11 @@ void nft_fib4_eval(const struct nft_expr *expr, struct nft_regs *regs, break; } + if (!oif) { + found = FIB_RES_DEV(res); + goto ok; + } + #ifdef CONFIG_IP_ROUTE_MULTIPATH for (i = 0; i < res.fi->fib_nhs; i++) { struct fib_nh *nh = &res.fi->fib_nh[i]; @@ -139,16 +144,12 @@ void nft_fib4_eval(const struct nft_expr *expr, struct nft_regs *regs, goto ok; } } -#endif - if (priv->flags & NFTA_FIB_F_OIF) { - found = FIB_RES_DEV(res); - if (found == oif) - goto ok; - return; - } - - *dest = FIB_RES_DEV(res)->ifindex; return; +#else + found = FIB_RES_DEV(res); + if (found != oif) + return; +#endif ok: switch (priv->result) { I can take care of this as a followup.
Re: [PATCH net-next RFC WIP] Patch for XDP support for virtio_net
On Fri, 28 Oct 2016 08:56:35 -0700, John Fastabend wrote: > On 16-10-27 07:10 PM, David Miller wrote: > > From: Alexander Duyck > > Date: Thu, 27 Oct 2016 18:43:59 -0700 > > > >> On Thu, Oct 27, 2016 at 6:35 PM, David Miller wrote: > >> > >>> From: "Michael S. Tsirkin" > >>> Date: Fri, 28 Oct 2016 01:25:48 +0300 > >>> > On Thu, Oct 27, 2016 at 05:42:18PM -0400, David Miller wrote: > > From: "Michael S. Tsirkin" > > Date: Fri, 28 Oct 2016 00:30:35 +0300 > > > >> Something I'd like to understand is how does XDP address the > >> problem that 100Byte packets are consuming 4K of memory now. > > > > Via page pools. We're going to make a generic one, but right now > > each and every driver implements a quick list of pages to allocate > > from (and thus avoid the DMA man/unmap overhead, etc.) > > So to clarify, ATM virtio doesn't attempt to avoid dma map/unmap > so there should be no issue with that even when using sub/page > regions, assuming DMA APIs support sub-page map/unmap correctly. > >>> > >>> That's not what I said. > >>> > >>> The page pools are meant to address the performance degradation from > >>> going to having one packet per page for the sake of XDP's > >>> requirements. > >>> > >>> You still need to have one packet per page for correct XDP operation > >>> whether you do page pools or not, and whether you have DMA mapping > >>> (or it's equivalent virutalization operation) or not. > >> > >> Maybe I am missing something here, but why do you need to limit things > >> to one packet per page for correct XDP operation? Most of the drivers > >> out there now are usually storing something closer to at least 2 > >> packets per page, and with the DMA API fixes I am working on there > >> should be no issue with changing the contents inside those pages since > >> we won't invalidate or overwrite the data after the DMA buffer has > >> been synchronized for use by the CPU. > > > > Because with SKB's you can share the page with other packets. > > > > With XDP you simply cannot. > > > > It's software semantics that are the issue. SKB frag list pages > > are read only, XDP packets are writable. > > > > This has nothing to do with "writability" of the pages wrt. DMA > > mapping or cpu mappings. > > > > Sorry I'm not seeing it either. The current xdp_buff is defined > by, > > struct xdp_buff { > void *data; > void *data_end; > }; > > The verifier has an xdp_is_valid_access() check to ensure we don't go > past data_end. The page for now at least never leaves the driver. For > the work to get xmit to other devices working I'm still not sure I see > any issue. +1 Do we want to make the packet-per-page a requirement because it could be useful in the future from architectural standpoint? I guess there is a trade-off here between having the comfort of people following this requirement today and making driver support for XDP even more complex.