qmi_wwan: Null pointer dereference when removing driver
Unsure at which point was added, but issue not present in stock debian 4.11 kernel. Running on a Thinkpad X220 with coreboot. I'm building from upstream. When I attempt to remove the qmi_wwan module (which also happens pre-suspend) the rmmod process gets killed, and the following shows in dmesg: [ 59.979791] usb 2-1.4: USB disconnect, device number 4 [ 59.980102] qmi_wwan 2-1.4:1.6 wwp0s29u1u4i6: unregister 'qmi_wwan' usb-:00:1d.0-1.4, WWAN/QMI device [ 60.006821] BUG: unable to handle kernel NULL pointer dereference at 00e0 [ 60.006879] IP: qmi_wwan_disconnect+0x25/0xc0 [qmi_wwan] [ 60.006911] PGD 0 [ 60.006911] P4D 0 [ 60.006957] Oops: [#1] SMP [ 60.006978] Modules linked in: fuse(E) ccm(E) rfcomm(E) cmac(E) bnep(E) qmi_wwan(E) cdc_wdm(E) cdc_ether(E) usbnet(E) mii(E) btusb(E) btrtl(E) btbcm(E) btintel(E) bluetooth(E) joydev(E) xpad(E) ecdh_generic(E) ff_memless(E) binfmt_misc(E) snd_hda_codec_hdmi(E) snd_hda_codec_conexant(E) snd_hda_codec_generic(E) arc4(E) iTCO_wdt(E) iTCO_vendor_support(E) intel_rapl(E) x86_pkg_temp_thermal(E) kvm_intel(E) kvm(E) irqbypass(E) crct10dif_pclmul(E) crc32_pclmul(E) crc32c_intel(E) ghash_clmulni_intel(E) aesni_intel(E) iwlmvm(E) aes_x86_64(E) crypto_simd(E) mac80211(E) cryptd(E) glue_helper(E) snd_hda_intel(E) snd_hda_codec(E) iwlwifi(E) snd_hwdep(E) psmouse(E) snd_hda_core(E) snd_pcm(E) serio_raw(E) sdhci_pci(E) pcspkr(E) snd_timer(E) ehci_pci(E) e1000e(E) i2c_i801(E) ehci_hcd(E) snd(E) sg(E) i915(E) lpc_ich(E) [ 60.007366] ptp(E) usbcore(E) cfg80211(E) mfd_core(E) pps_core(E) shpchp(E) ac(E) battery(E) tpm_tis(E) tpm_tis_core(E) evdev(E) tpm(E) parport_pc(E) ppdev(E) lp(E) parport(E) ip_tables(E) x_tables(E) autofs4(E) [ 60.007474] CPU: 2 PID: 33 Comm: kworker/2:1 Tainted: GE 4.12.3-nr44-normandy-r1500619820+ #1 [ 60.007524] Hardware name: LENOVO 4291LR7/4291LR7, BIOS CBET4000 4.6-810-g50522254fb 07/21/2017 [ 60.007580] Workqueue: usb_hub_wq hub_event [usbcore] [ 60.007609] task: 8c882b716040 task.stack: b8e800d84000 [ 60.007644] RIP: 0010:qmi_wwan_disconnect+0x25/0xc0 [qmi_wwan] [ 60.007678] RSP: 0018:b8e800d87b38 EFLAGS: 00010246 [ 60.007711] RAX: RBX: RCX: [ 60.007752] RDX: 0001 RSI: 8c8824f3f1d0 RDI: 8c8824ef6400 [ 60.007792] RBP: 8c8824ef6400 R08: R09: [ 60.007833] R10: b8e800d87780 R11: 0011 R12: c07ea0e8 [ 60.007874] R13: 8c8824e2e000 R14: 8c8824e2e098 R15: [ 60.007915] FS: () GS:8c883530() knlGS: [ 60.007960] CS: 0010 DS: ES: CR0: 80050033 [ 60.007994] CR2: 00e0 CR3: 000229ca5000 CR4: 000406e0 [ 60.008035] Call Trace: [ 60.008065] ? usb_unbind_interface+0x71/0x270 [usbcore] [ 60.008101] ? device_release_driver_internal+0x154/0x210 [ 60.008135] ? qmi_wwan_unbind+0x6d/0xc0 [qmi_wwan] [ 60.008168] ? usbnet_disconnect+0x6c/0xf0 [usbnet] [ 60.008194] ? qmi_wwan_disconnect+0x87/0xc0 [qmi_wwan] [ 60.008232] ? usb_unbind_interface+0x71/0x270 [usbcore] [ 60.008264] ? device_release_driver_internal+0x154/0x210 [ 60.008296] ? bus_remove_device+0xf5/0x160 [ 60.008324] ? device_del+0x1dc/0x310 [ 60.008355] ? usb_remove_ep_devs+0x1b/0x30 [usbcore] [ 60.008393] ? usb_disable_device+0x93/0x250 [usbcore] [ 60.008430] ? usb_disconnect+0x90/0x260 [usbcore] [ 60.008468] ? hub_event+0x1d9/0x14a0 [usbcore] [ 60.008500] ? process_one_work+0x175/0x370 [ 60.008528] ? worker_thread+0x4a/0x380 [ 60.008555] ? kthread+0xfc/0x130 [ 60.008579] ? process_one_work+0x370/0x370 [ 60.008606] ? kthread_park+0x60/0x60 [ 60.008631] ? ret_from_fork+0x22/0x30 [ 60.008656] Code: 66 0f 1f 44 00 00 66 66 66 66 90 55 48 89 fd 53 48 83 ec 10 48 8b 9f c8 00 00 00 65 48 8b 04 25 28 00 00 00 48 89 44 24 08 31 c0 83 e0 00 00 00 02 74 51 e8 0d b3 2b cd 85 c0 74 67 48 8b bb [ 60.011925] RIP: qmi_wwan_disconnect+0x25/0xc0 [qmi_wwan] RSP: b8e800d87b38 [ 60.013564] CR2: 00e0 [ 60.022125] ---[ end trace e536b59f45bc0f25 ]--- [ 60.025385] IPv6: ADDRCONF(NETDEV_UP): wlp2s0: link is not ready If I attempt a second rmmod, the process hangs. If I attempt it on 4.11.x it works as expected: [ 16.897783] fuse init (API version 7.26) [ 68.073552] usbcore: deregistering interface driver qmi_wwan [ 68.075808] qmi_wwan 2-1.4:1.6 wwp0s29u1u4i6: unregister 'qmi_wwan' usb-:00:1d.0-1.4, WWAN/QMI device [ 72.431403] e1000e: enp0s25 NIC Link is Down So I'm pretty certain it's not coreboot causing the issue.
[PATCH V2 net] Revert "vhost: cache used event for better performance"
This reverts commit 809ecb9bca6a9424ccd392d67e368160f8b76c92. Since it was reported to break vhost_net. We want to cache used event and use it to check for notification. The assumption was that guest won't move the event idx back, but this could happen in fact when 16 bit index wraps around after 64K entries. Signed-off-by: Jason Wang--- - Changes from V1: tweak commit log - The patch is needed for -stable. --- drivers/vhost/vhost.c | 28 ++-- drivers/vhost/vhost.h | 3 --- 2 files changed, 6 insertions(+), 25 deletions(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index e4613a3..9cb3f72 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -308,7 +308,6 @@ static void vhost_vq_reset(struct vhost_dev *dev, vq->avail = NULL; vq->used = NULL; vq->last_avail_idx = 0; - vq->last_used_event = 0; vq->avail_idx = 0; vq->last_used_idx = 0; vq->signalled_used = 0; @@ -1402,7 +1401,7 @@ long vhost_vring_ioctl(struct vhost_dev *d, int ioctl, void __user *argp) r = -EINVAL; break; } - vq->last_avail_idx = vq->last_used_event = s.num; + vq->last_avail_idx = s.num; /* Forget the cached index value. */ vq->avail_idx = vq->last_avail_idx; break; @@ -2241,6 +2240,10 @@ static bool vhost_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq) __u16 old, new; __virtio16 event; bool v; + /* Flush out used index updates. This is paired +* with the barrier that the Guest executes when enabling +* interrupts. */ + smp_mb(); if (vhost_has_feature(vq, VIRTIO_F_NOTIFY_ON_EMPTY) && unlikely(vq->avail_idx == vq->last_avail_idx)) @@ -2248,10 +2251,6 @@ static bool vhost_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq) if (!vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX)) { __virtio16 flags; - /* Flush out used index updates. This is paired -* with the barrier that the Guest executes when enabling -* interrupts. */ - smp_mb(); if (vhost_get_avail(vq, flags, >avail->flags)) { vq_err(vq, "Failed to get flags"); return true; @@ -2266,26 +2265,11 @@ static bool vhost_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq) if (unlikely(!v)) return true; - /* We're sure if the following conditions are met, there's no -* need to notify guest: -* 1) cached used event is ahead of new -* 2) old to new updating does not cross cached used event. */ - if (vring_need_event(vq->last_used_event, new + vq->num, new) && - !vring_need_event(vq->last_used_event, new, old)) - return false; - - /* Flush out used index updates. This is paired -* with the barrier that the Guest executes when enabling -* interrupts. */ - smp_mb(); - if (vhost_get_avail(vq, event, vhost_used_event(vq))) { vq_err(vq, "Failed to get used event idx"); return true; } - vq->last_used_event = vhost16_to_cpu(vq, event); - - return vring_need_event(vq->last_used_event, new, old); + return vring_need_event(vhost16_to_cpu(vq, event), new, old); } /* This actually signals the guest, using eventfd. */ diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h index f720958..bb7c29b 100644 --- a/drivers/vhost/vhost.h +++ b/drivers/vhost/vhost.h @@ -115,9 +115,6 @@ struct vhost_virtqueue { /* Last index we used. */ u16 last_used_idx; - /* Last used evet we've seen */ - u16 last_used_event; - /* Used flags */ u16 used_flags; -- 2.7.4
Re: [PATCH 1/4] can: dev: Add support for limiting configured bitrate
Hi Kurt, On 07/26/2017 03:04 PM, Kurt Van Dijck wrote: > Hi, > > I know my response is late ... > >> Hi Oliver >> On 07/20/2017 02:43 AM, Oliver Hartkopp wrote: >>> Hi Franklin, >>> >>> On 07/20/2017 01:36 AM, Franklin S Cooper Jr wrote: >>> +#ifdef CONFIG_OF +void of_transceiver_is_fixed(struct net_device *dev) +{ >>> >>> (..) >>> +} +EXPORT_SYMBOL(of_transceiver_is_fixed); +#endif >>> >>> I'm not sure about the naming here. >>> >>> As this is a CAN transceiver related option it should be named accordingly: > > I contest the the name too: > 1) the can transceiver isn't fixed at all, it limited to the higher > bitrates. Its "possible" that this subnode may have additional properties beyond bitrates in the future. But your right as of now it is specifically addressing max bit rates. The naming of this function and subnode is based on "fixed-link". So "fixed" is just implying that certain properties can't be changed. > > 2) of_can_transceiver_is_fixed suggests to test if a transceiver is > fixed, it does not suggest to load some properties from the device tree. > of_can_load_transceiver looks way more clear to me. I address this partially in my rev 2 that I've already sent. I'm now using of_can_transceiver_fixed. The fact its of_ already implies it is loading properties from device tree. So I don't think of_can_load_transceiver really makes things clearer. > > That's my opinion. > The important things, like the contents of the functions, look good. You mind throwing your two cents in the thread for my latest patch? Specifically the conversation regarding naming the properties. A couple of people prefer to not use "arbitration" in one of the property names. Currently I believe there are two options on property names that can be used and I'm open to a majority vote on which one to go with. > > Kind regards, > Kurt Van Dijck >
RE: [PATCH V2 net-next] TLP: Don't reschedule PTO when there's one outstanding TLP retransmission
> -Original Message- > From: Sergei Shtylyov [mailto:sergei.shtyl...@cogentembedded.com] > Sent: Wednesday, July 26, 2017 6:03 PM > To: maowenan; netdev@vger.kernel.org; da...@davemloft.net; > ncardw...@google.com; ych...@google.com; nandi...@google.com; > weiyongjun (A); Chenweilong; Wangkefeng (Kevin) > Subject: Re: [PATCH V2 net-next] TLP: Don't reschedule PTO when there's one > outstanding TLP retransmission > > On 7/26/2017 12:44 PM, Mao Wenan wrote: > > > If there is one TLP probe went out(TLP use the write_queue_tail packet > > as TLP probe, we assume this first TLP probe named A), and this TLP > > probe was not acked by receive side. > > > > Then the transmit side sent the next two packetes out(named B,C), but > > unfortunately these two packets are also not acked by receive side. > > > > And then there is one data packet with ack_seq A arrive at transmit > > side, in tcp_ack() will call tcp_schedule_loss_probe() to rearm PTO, > > the handler tcp_send_loss_probe() is to check > > if(tp->tlp_high_seq) then go to rearm_timer(because there is one > > outstanding TLP named A), so the new TLP probe can't be sent out and > > it needs to rearm the RTO timer(timeout is relative to the transmit > > time of the write queue head). > > > > After that, there is another data packet with ack_seq A is received, > > if the tlp_time_stamp is greater than rto_time_stamp, it will reset > > the TLP timeout, which is before previous RTO timeout, so PTO is rearm > > and previous RTO is cleared. Because there is no retransmission packet > > was sent or no TLP sack receive, tp->tlp_high_seq can't be reset to > > zero and the next TLP probe also can't be sent out, so there is no > > way(or very long time) to retransmit the lost packet. > > > > This fix is to check(tp->tlp_high_seq) in tcp_schedule_loss_probe() > > when TLP PTO is after RTO, It is not needed to reschedule PTO when > > there is one outstanding TLP retransmission, so if the TLP A is lost > > RTO can retransmit lost packet, then tp->tlp_high_seq will be set to > > 0, and TLP will go to the normal work process. > > > > v1->v2 > > refine some words of code and patch comments. > > > > Signed-off-by: Mao Wenan> > --- > > net/ipv4/tcp_output.c | 6 ++ > > 1 file changed, 6 insertions(+) > > > > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index > > 886d874..f85c7ef 100644 > > --- a/net/ipv4/tcp_output.c > > +++ b/net/ipv4/tcp_output.c > > @@ -2423,6 +2423,12 @@ bool tcp_schedule_loss_probe(struct sock *sk) > > tlp_time_stamp = tcp_jiffies32 + timeout; > > rto_time_stamp = (u32)inet_csk(sk)->icsk_timeout; > > if ((s32)(tlp_time_stamp - rto_time_stamp) > 0) { > > + /* It is not needed to reschedule PTO when there > > +* is one outstanding TLP retransmission. > > +*/ > > + if (tp->tlp_high_seq) { > > + return false; > > + } > > I have already told you to remove the needless {}... :-/ Thanks, sorry for this. > > [...] > > MBR, Sergei
RE: [PATCH net-next] TLP: Don't reschedule PTO when there's one outstanding TLP retransmission.
> -Original Message- > From: Yuchung Cheng [mailto:ych...@google.com] > Sent: Thursday, July 27, 2017 12:45 AM > To: maowenan > Cc: Neal Cardwell; Netdev; David Miller; weiyongjun (A); Chenweilong; Nandita > Dukkipati; Wangkefeng (Kevin) > Subject: Re: [PATCH net-next] TLP: Don't reschedule PTO when there's one > outstanding TLP retransmission. > > On Tue, Jul 25, 2017 at 7:12 PM, maowenan> wrote: > > > > > > > > > -Original Message- > > > From: Neal Cardwell [mailto:ncardw...@google.com] > > > Sent: Tuesday, July 25, 2017 9:30 PM > > > To: maowenan > > > Cc: Netdev; David Miller; weiyongjun (A); Chenweilong; Yuchung > > > Cheng; Nandita Dukkipati > > > Subject: Re: [PATCH net-next] TLP: Don't reschedule PTO when there's > > > one outstanding TLP retransmission. > > > > > > On Tue, Jul 25, 2017 at 4:35 AM, Mao Wenan > > > wrote: > > > > If there is one TLP probe went out(TLP use the write_queue_tail > > > > packet as TLP probe, we assume this first TLP probe named A), and > > > > this TLP probe was not acked by receive side. > > > > > > > > Then the transmit side sent the next two packetes out(named B,C), > > > > but unfortunately these two packets are also not acked by receive side. > > > > > > > > And then there is one data packet with ack_seq A arrive, in > > > > tcp_ack() will call tcp_schedule_loss_probe() to rearm PTO, the > > > > handler > > > > tcp_send_loss_probe() pass if(tp->tlp_high_seq)(because there is > > > > one outstanding TLP named A,tp->tlp_high_seq is not zero), so the > > > > new TLP probe can't be went out and need to rearm the RTO > > > > timer(timeout is relative to the transmit time of the write queue head). > > > > > > > > After this, another data packet with ack_seq A is received, if the > > > > tlp_time_stamp is after rto_time_stamp, it will reset the TLP > > > > timeout with delta value, which is before previous RTO timeout, so > > > > PTO is rearm and previous RTO is cleared. This TLP probe also > > > > can't be sent out because of tp->tlp_high_seq != 0, so there is no > > > > way(or need very long time)to retransmit the packet because of TLP A is > lost. > > > > > > > > This fix is not to pass the if(tp->tlp_high_seq) in > > > > tcp_schedule_loss_probe() when TLP PTO is after RTO, It is no need > > > > to reschedule PTO when there is one outstanding TLP > > > > retransmission, so if the TLP A is lost then RTO can retransmit > > > > that packet, and > > > > tp->tlp_high_seq will be set to 0. After this TLP will go to the > > > > tp->normal work > > > process. > > > > > > > > Signed-off-by: Mao Wenan > > > > > > Thanks for posting this. This is a pretty involved scenario. To help > > > document/test precisely what the behavior is before and after your > > > patch, would you be able to post a packetdrill ( > > > https://github.com/google/packetdrill ) test case for this scenario? > > > > > > Can I ask if you saw this scenario in an actual trace, or noticed > > > this by inspection? > > [Mao Wenan] It is my actual product scenario, from some debug trace > > info I found that PTO is always rescheduled before RTO, this is PTO is > > trigged by receiving one tcp_ack packets. After this patch, the > > product works well and there is no previous issue happen. The packet > > can retransmit with RTO if TLP probe is lost. > > I will say sorry that my local test can't reproduce because i can't > > build successfully the same complex scenario, I don't have mature test > > case to reproduce now but I will do my best to try in local test > > environment. > tcpdump output of an example trace also helps. Feel free to remove IP address, > ports, or payload information. We only need the timing and TCP header info. Ok, please check as below. Server: 192.169.0.18, client: 192.169.0.17 .. 20:36:12.239562 IP 192.169.0.18.1234 > 192.169.0.17.56246: P 501216544:501216616(72) ack 2308880351 win 92 20:36:12.240034 IP 192.169.0.18.1234 > 192.169.0.17.56246: P 501216616:501216688(72) ack 2308880351 win 92 20:36:12.260764 IP 192.169.0.18.1234 > 192.169.0.17.56246: P 501216616:501216688(72) ack 2308880351 win 92 /* first TLP probe A, seq: 501216616 */ 20:36:13.018020 IP 192.169.0.17.56246 > 192.169.0.18.1234: . ack 501216616 win 1970 20:36:13.018386 IP 192.169.0.18.1234 > 192.169.0.17.56246: P 501216688:501216760(72) ack 2308880351 win 92 /* server send packet B */ 20:36:13.022448 IP 192.169.0.18.1234 > 192.169.0.17.56246: P 501216760:501216832(72) ack 2308880351 win 92 /* server send packet C */ 20:36:13.033813 IP 192.169.0.17.56246 > 192.169.0.18.1234: P 2308880351:2308880437(86) ack 501216616 win 1970 /* receive client data packet ack 501216616 */ 20:36:13.034213 IP 192.169.0.17.56246 > 192.169.0.18.1234: P 2308880437:2308880523(86) ack 501216616 win 1970 /* receive client data packet ack 501216616 */ 20:36:13.034633 IP 192.169.0.18.1234 > 192.169.0.17.56246: .
Re: [PATCH v7 2/3] PCI: Enable PCIe Relaxed Ordering if supported
On 2017/7/27 2:26, Casey Leedom wrote: > By the way Ding, two issues: > > 1. Did we ever get any acknowledgement from either Intel or AMD > on this patch? I know that we can't ensure that, but it sure would > be nice since the PCI Quirks that we're putting in affect their > products. > Still no Intel and AMD guys has ack this, this is what I am worried about, should I ping some man again ? Thanks Ding > > Casey > . >
Re: [PATCH v7 2/3] PCI: Enable PCIe Relaxed Ordering if supported
On 2017/7/27 3:05, Casey Leedom wrote: > | From: Alexander Duyck> | Sent: Wednesday, July 26, 2017 11:44 AM > | > | On Jul 26, 2017 11:26 AM, "Casey Leedom" wrote: > | | > | | I think that the patch will need to be extended to modify > | | drivers/pci.c/iov.c:sriov_enable() to explicitly turn off > | | Relaxed Ordering Enable if the Root Complex is marked > | for no RO TLPs. > | > | I'm not sure that would be an issue. Wouldn't most VFs inherit the PF's > settings? > > Ah yes, you're right. This is covered in section 3.5.4 of the Single Root I/O > Virtualization and Sharing Specification, Revision 1.0 (September 11, 2007), > governing the PCIe Capability Device Control register. It states that the VF > version of that register shall follow the setting of the corresponding PF. > > So we should enhance the cxgb4vf/sge.c:t4vf_sge_alloc_rxq() in the same > way we did for the cxgb4 driver, but that's not critical since the Relaxed > Ordering Enable supersedes the internal chip's desire to use the Relaxed > Ordering Attribute. > > Ding, send me a note if you'd like me to work that up for you. > Ok, you could send the change log and I could put it in the v8 version together, will you base on the patch 3/3 or build a independence patch? Ding > | Also I thought most of the VF configuration space is read only. > > Yes, but not all of it. And when a VF is exported to a Virtual Machine, > then the Hypervisor captures and interprets all accesses to the VF's > PCIe Configuration Space from the VM. > > Thanks again for reminding me of the subtle aspect of the SR_IOV > specification that I forgot. > > Casey > . >
Re: [PATCH v3 net-next] net: systemport: Support 64bit statistics
On 07/26/2017 05:40 PM, Jianming.qiao wrote: > When using Broadcom Systemport device in 32bit Platform, ifconfig can > only report up to 4G tx,rx status, which will be wrapped to 0 when the > number of incoming or outgoing packets exceeds 4G, only taking > around 2 hours in busy network environment (such as streaming). > Therefore, it makes hard for network diagnostic tool to get reliable > statistical result, so the patch is used to add 64bit support for > Broadcom Systemport device in 32bit Platform. > > Signed-off-by: Jianming.qiao> --- > drivers/net/ethernet/broadcom/bcmsysport.c | 71 > -- > drivers/net/ethernet/broadcom/bcmsysport.h | 9 +++- > 2 files changed, 55 insertions(+), 25 deletions(-) > > diff --git a/drivers/net/ethernet/broadcom/bcmsysport.c > b/drivers/net/ethernet/broadcom/bcmsysport.c > index 5274501..967221f 100644 > --- a/drivers/net/ethernet/broadcom/bcmsysport.c > +++ b/drivers/net/ethernet/broadcom/bcmsysport.c > @@ -662,6 +662,7 @@ static int bcm_sysport_alloc_rx_bufs(struct > bcm_sysport_priv *priv) > static unsigned int bcm_sysport_desc_rx(struct bcm_sysport_priv *priv, > unsigned int budget) > { > + struct bcm_sysport_stats *stats64 = >stats64; > struct net_device *ndev = priv->netdev; > unsigned int processed = 0, to_process; > struct bcm_sysport_cb *cb; > @@ -765,6 +766,10 @@ static unsigned int bcm_sysport_desc_rx(struct > bcm_sysport_priv *priv, > skb->protocol = eth_type_trans(skb, ndev); > ndev->stats.rx_packets++; > ndev->stats.rx_bytes += len; > + u64_stats_update_begin(>syncp); > + stats64->rx_packets++; > + stats64->rx_bytes += len; > + u64_stats_update_end(>syncp); > > napi_gro_receive(>napi, skb); > next: > @@ -784,24 +789,31 @@ static void bcm_sysport_tx_reclaim_one(struct > bcm_sysport_tx_ring *ring, > unsigned int *pkts_compl) > { > struct bcm_sysport_priv *priv = ring->priv; > + struct bcm_sysport_stats *stats64 = >stats64; > struct device *kdev = >pdev->dev; > + unsigned int len = 0; > > if (cb->skb) { > - ring->bytes += cb->skb->len; > - *bytes_compl += cb->skb->len; > + len = cb->skb->len; > + *bytes_compl += len; > dma_unmap_single(kdev, dma_unmap_addr(cb, dma_addr), >dma_unmap_len(cb, dma_len), >DMA_TO_DEVICE); > - ring->packets++; > (*pkts_compl)++; > bcm_sysport_free_cb(cb); > /* SKB fragment */ > } else if (dma_unmap_addr(cb, dma_addr)) { > - ring->bytes += dma_unmap_len(cb, dma_len); > + len = dma_unmap_len(cb, dma_len); > dma_unmap_page(kdev, dma_unmap_addr(cb, dma_addr), > dma_unmap_len(cb, dma_len), DMA_TO_DEVICE); > dma_unmap_addr_set(cb, dma_addr, 0); > } > + > + u64_stats_update_begin(>syncp); > + ring->bytes += len; > + if (cb->skb) > + ring->packets++; How can this be safe now with bcm_sysport_free_cb() doing cb->skb = NULL? I know I recommended moving the 64-bit statistics update out of the if () else () clause before but still, it sounds like we are no longer able to count the ring packets here. > + u64_stats_update_end(>syncp); > } > > /* Reclaim queued SKBs for transmission completion, lockless version */ > @@ -1671,24 +1683,6 @@ static int bcm_sysport_change_mac(struct net_device > *dev, void *p) > return 0; > } > > -static struct net_device_stats *bcm_sysport_get_nstats(struct net_device > *dev) > -{ > - struct bcm_sysport_priv *priv = netdev_priv(dev); > - unsigned long tx_bytes = 0, tx_packets = 0; > - struct bcm_sysport_tx_ring *ring; > - unsigned int q; > - > - for (q = 0; q < dev->num_tx_queues; q++) { > - ring = >tx_rings[q]; > - tx_bytes += ring->bytes; > - tx_packets += ring->packets; > - } > - > - dev->stats.tx_bytes = tx_bytes; > - dev->stats.tx_packets = tx_packets; > - return >stats; > -} > - > static void bcm_sysport_netif_start(struct net_device *dev) > { > struct bcm_sysport_priv *priv = netdev_priv(dev); > @@ -1923,6 +1917,37 @@ static int bcm_sysport_stop(struct net_device *dev) > return 0; > } > > +static void bcm_sysport_get_stats64(struct net_device *dev, > + struct rtnl_link_stats64 *stats) > +{ > + struct bcm_sysport_priv *priv = netdev_priv(dev); > + struct bcm_sysport_stats *stats64 = >stats64; > + struct bcm_sysport_tx_ring *ring; > + u64 tx_packets = 0, tx_bytes = 0; > + unsigned int start; > + unsigned int q; > + > + netdev_stats_to_stats64(stats, >stats); > + > +
[PATCH v3 net-next] net: systemport: Support 64bit statistics
When using Broadcom Systemport device in 32bit Platform, ifconfig can only report up to 4G tx,rx status, which will be wrapped to 0 when the number of incoming or outgoing packets exceeds 4G, only taking around 2 hours in busy network environment (such as streaming). Therefore, it makes hard for network diagnostic tool to get reliable statistical result, so the patch is used to add 64bit support for Broadcom Systemport device in 32bit Platform. Signed-off-by: Jianming.qiao--- drivers/net/ethernet/broadcom/bcmsysport.c | 71 -- drivers/net/ethernet/broadcom/bcmsysport.h | 9 +++- 2 files changed, 55 insertions(+), 25 deletions(-) diff --git a/drivers/net/ethernet/broadcom/bcmsysport.c b/drivers/net/ethernet/broadcom/bcmsysport.c index 5274501..967221f 100644 --- a/drivers/net/ethernet/broadcom/bcmsysport.c +++ b/drivers/net/ethernet/broadcom/bcmsysport.c @@ -662,6 +662,7 @@ static int bcm_sysport_alloc_rx_bufs(struct bcm_sysport_priv *priv) static unsigned int bcm_sysport_desc_rx(struct bcm_sysport_priv *priv, unsigned int budget) { + struct bcm_sysport_stats *stats64 = >stats64; struct net_device *ndev = priv->netdev; unsigned int processed = 0, to_process; struct bcm_sysport_cb *cb; @@ -765,6 +766,10 @@ static unsigned int bcm_sysport_desc_rx(struct bcm_sysport_priv *priv, skb->protocol = eth_type_trans(skb, ndev); ndev->stats.rx_packets++; ndev->stats.rx_bytes += len; + u64_stats_update_begin(>syncp); + stats64->rx_packets++; + stats64->rx_bytes += len; + u64_stats_update_end(>syncp); napi_gro_receive(>napi, skb); next: @@ -784,24 +789,31 @@ static void bcm_sysport_tx_reclaim_one(struct bcm_sysport_tx_ring *ring, unsigned int *pkts_compl) { struct bcm_sysport_priv *priv = ring->priv; + struct bcm_sysport_stats *stats64 = >stats64; struct device *kdev = >pdev->dev; + unsigned int len = 0; if (cb->skb) { - ring->bytes += cb->skb->len; - *bytes_compl += cb->skb->len; + len = cb->skb->len; + *bytes_compl += len; dma_unmap_single(kdev, dma_unmap_addr(cb, dma_addr), dma_unmap_len(cb, dma_len), DMA_TO_DEVICE); - ring->packets++; (*pkts_compl)++; bcm_sysport_free_cb(cb); /* SKB fragment */ } else if (dma_unmap_addr(cb, dma_addr)) { - ring->bytes += dma_unmap_len(cb, dma_len); + len = dma_unmap_len(cb, dma_len); dma_unmap_page(kdev, dma_unmap_addr(cb, dma_addr), dma_unmap_len(cb, dma_len), DMA_TO_DEVICE); dma_unmap_addr_set(cb, dma_addr, 0); } + + u64_stats_update_begin(>syncp); + ring->bytes += len; + if (cb->skb) + ring->packets++; + u64_stats_update_end(>syncp); } /* Reclaim queued SKBs for transmission completion, lockless version */ @@ -1671,24 +1683,6 @@ static int bcm_sysport_change_mac(struct net_device *dev, void *p) return 0; } -static struct net_device_stats *bcm_sysport_get_nstats(struct net_device *dev) -{ - struct bcm_sysport_priv *priv = netdev_priv(dev); - unsigned long tx_bytes = 0, tx_packets = 0; - struct bcm_sysport_tx_ring *ring; - unsigned int q; - - for (q = 0; q < dev->num_tx_queues; q++) { - ring = >tx_rings[q]; - tx_bytes += ring->bytes; - tx_packets += ring->packets; - } - - dev->stats.tx_bytes = tx_bytes; - dev->stats.tx_packets = tx_packets; - return >stats; -} - static void bcm_sysport_netif_start(struct net_device *dev) { struct bcm_sysport_priv *priv = netdev_priv(dev); @@ -1923,6 +1917,37 @@ static int bcm_sysport_stop(struct net_device *dev) return 0; } +static void bcm_sysport_get_stats64(struct net_device *dev, + struct rtnl_link_stats64 *stats) +{ + struct bcm_sysport_priv *priv = netdev_priv(dev); + struct bcm_sysport_stats *stats64 = >stats64; + struct bcm_sysport_tx_ring *ring; + u64 tx_packets = 0, tx_bytes = 0; + unsigned int start; + unsigned int q; + + netdev_stats_to_stats64(stats, >stats); + + for (q = 0; q < dev->num_tx_queues; q++) { + ring = >tx_rings[q]; + do { + start = u64_stats_fetch_begin_irq(>syncp); + tx_bytes += ring->bytes; + tx_packets += ring->packets; + } while (u64_stats_fetch_retry_irq(>syncp, start)); + } + + stats->tx_packets = tx_packets; +
Re: [PATCH] bpf: testing: fix devmap tests
On 07/27/2017 02:32 AM, John Fastabend wrote: Apparently through one of my revisions of the initial patches series I lost the devmap test. We can add more testing later but for now lets fix the simple one we have. Fixes: 546ac1ffb70d "bpf: add devmap, a map for storing net device references" Reported-by: Jakub KicinskiSigned-off-by: John Fastabend Looks good, thanks! Acked-by: Daniel Borkmann
RE: [Intel-wired-lan] [PATCH] igb: support BCM54616 PHY
> From: Intel-wired-lan [mailto:intel-wired-lan-boun...@osuosl.org] On Behalf > Of John W. Linville > Sent: Friday, July 21, 2017 11:12 AM > To: netdev@vger.kernel.org > Cc: intel-wired-...@lists.osuosl.org; John W. Linville >> Subject: [Intel-wired-lan] [PATCH] igb: support BCM54616 PHY > > The management port on an Edgecore AS7712-32 switch uses an igb MAC, > but > it uses a BCM54616 PHY. Without a patch like this, loading the igb > module produces dmesg output like this: > > [3.439125] igb: Copyright (c) 2007-2014 Intel Corporation. > [3.439866] igb: probe of :00:14.0 failed with error -2 > > Signed-off-by: John W. Linville > Cc: Jeff Kirsher > --- > drivers/net/ethernet/intel/igb/e1000_82575.c | 6 ++ > drivers/net/ethernet/intel/igb/e1000_defines.h | 1 + > drivers/net/ethernet/intel/igb/e1000_hw.h | 1 + > 3 files changed, 8 insertions(+) I do not have the specific hardware (Edgecore switch) but as far as regression tests go this works fine. Tested-by: Aaron Brown
Re: [PATCH] bpf: testing: fix devmap tests
On 07/26/2017 05:32 PM, John Fastabend wrote: > Apparently through one of my revisions of the initial patches > series I lost the devmap test. We can add more testing later but > for now lets fix the simple one we have. > > Fixes: 546ac1ffb70d "bpf: add devmap, a map for storing net device references" > Reported-by: Jakub Kicinski> Signed-off-by: John Fastabend > --- Forgot prefix, this is for net-next. Sorry about that. Thanks, John
[PATCH] bpf: testing: fix devmap tests
Apparently through one of my revisions of the initial patches series I lost the devmap test. We can add more testing later but for now lets fix the simple one we have. Fixes: 546ac1ffb70d "bpf: add devmap, a map for storing net device references" Reported-by: Jakub KicinskiSigned-off-by: John Fastabend --- tools/testing/selftests/bpf/test_maps.c |4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index ce2988b..1579cab 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -104,6 +104,7 @@ enum bpf_map_type { BPF_MAP_TYPE_LPM_TRIE, BPF_MAP_TYPE_ARRAY_OF_MAPS, BPF_MAP_TYPE_HASH_OF_MAPS, + BPF_MAP_TYPE_DEVMAP, }; enum bpf_prog_type { diff --git a/tools/testing/selftests/bpf/test_maps.c b/tools/testing/selftests/bpf/test_maps.c index 36d6ac3..c991ab6 100644 --- a/tools/testing/selftests/bpf/test_maps.c +++ b/tools/testing/selftests/bpf/test_maps.c @@ -440,7 +440,7 @@ static void test_arraymap_percpu_many_keys(void) static void test_devmap(int task, void *data) { - int next_key, fd; + int fd; __u32 key, value; fd = bpf_create_map(BPF_MAP_TYPE_DEVMAP, sizeof(key), sizeof(value), @@ -620,6 +620,8 @@ static void run_all_tests(void) test_arraymap_percpu_many_keys(); + test_devmap(0, NULL); + test_map_large(); test_map_parallel(); test_map_stress();
Re: [PATCH net-next v2 07/10] net: dsa: lan9303: Added basic offloading of unicast traffic
Hi Egil, [auto build test WARNING on net-next/master] url: https://github.com/0day-ci/linux/commits/Egil-Hjelmeland/net-dsa-lan9303-unicast-offload-fdb-mdb-STP/20170727-074246 config: x86_64-randconfig-x000-201730 (attached as .config) compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901 reproduce: # save the attached .config to linux build tree make ARCH=x86_64 Note: it may well be a FALSE warning. FWIW you are at least aware of it now. http://gcc.gnu.org/wiki/Better_Uninitialized_Warnings All warnings (new ones prefixed by >>): drivers/net//dsa/lan9303-core.c: In function 'lan9303_port_stp_state_set': >> drivers/net//dsa/lan9303-core.c:945:16: warning: 'portstate' may be used >> uninitialized in this function [-Wmaybe-uninitialized] int portmask, portstate; ^ vim +/portstate +945 drivers/net//dsa/lan9303-core.c 941 942 static void lan9303_port_stp_state_set(struct dsa_switch *ds, int port, 943 u8 state) 944 { > 945 int portmask, portstate; 946 struct lan9303 *chip = ds->priv; 947 948 dev_dbg(chip->dev, "%s(port %d, state %d)\n", 949 __func__, port, state); 950 if (!chip->is_bridged) 951 return; 952 953 switch (state) { 954 case BR_STATE_DISABLED: 955 portstate = LAN9303_SWE_PORT_STATE_DISABLED_PORT0; 956 break; 957 case BR_STATE_BLOCKING: 958 case BR_STATE_LISTENING: 959 portstate = LAN9303_SWE_PORT_STATE_BLOCKING_PORT0; 960 break; 961 case BR_STATE_LEARNING: 962 portstate = LAN9303_SWE_PORT_STATE_LEARNING_PORT0; 963 break; 964 case BR_STATE_FORWARDING: 965 portstate = LAN9303_SWE_PORT_STATE_FORWARDING_PORT0; 966 break; 967 default: 968 dev_err(chip->dev, "%s(port %d, state %d)\n", 969 __func__, port, state); 970 } 971 portmask = 0x3 << (port * 2); 972 portstate <<= (port * 2); 973 lan9303_write_switch_reg_mask(chip, LAN9303_SWE_PORT_STATE, 974portstate, portmask); 975 } 976 --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: application/gzip
Re: [PATCH net] bpf: don't zero out the info struct in bpf_obj_get_info_by_fd()
From: Jakub KicinskiDate: Tue, 25 Jul 2017 15:16:12 -0700 > The buffer passed to bpf_obj_get_info_by_fd() should be initialized > to zeros. Kernel will enforce that to guarantee we can safely extend > info structures in the future. > > Making the bpf_obj_get_info_by_fd() call in libbpf perform the zeroing > is problematic, however, since some members of the info structures > may need to be initialized by the callers (for instance pointers > to buffers to which kernel is to dump translated and jited images). > > Remove the zeroing and fix up the in-tree callers before any kernel > has been released with this code. > > As Daniel points out this seems to be the intended operation anyway, > since commit 95b9afd3987f ("bpf: Test for bpf ID") is itself setting > the buffer pointers before calling bpf_obj_get_info_by_fd(). > > Signed-off-by: Jakub Kicinski Applied, thanks Jakub.
Re: [PATCH] netpoll: Fix device name check in netpoll_setup()
From: Matthias KaehlckeDate: Tue, 25 Jul 2017 11:36:25 -0700 > Apparently netpoll_setup() assumes that netpoll.dev_name is a pointer > when checking if the device name is set: > > if (np->dev_name) { > ... > > However the field is a character array, therefore the condition always > yields true. Check instead whether the first byte of the array has a > non-zero value. > > Signed-off-by: Matthias Kaehlcke Applied, thanks a lot.
Re: [net-next 00/14][pull request] 40GbE Intel Wired LAN Driver Updates 2017-07-25
From: Jeff KirsherDate: Wed, 26 Jul 2017 03:31:09 -0700 > This series contains updates to i40e and i40evf only. > > Gustavo Silva fixes a variable assignment, where the incorrect variable > was being used to store the error parameter. > > Carolyn provides a fix for a problem found in systems when entering S4 > state, by ensuring that the misc vector's IRQ is disabled as well. > > Jake removes the single-threaded restriction on the module workqueue, > which was causing issues with events such as CORER. Does some future > proofing, by changing how the driver displays the UDP tunnel type. > > Paul adds a retry in releasing resources if the admin queue times out > during the first attempt to release the resources. > > Jesse fixes up references to 32bit timspec, since there are a small set > of errors on 32 bit, so we need to be using the right calls for dealing > with timespec64 variables. Cleaned up code indentation and corrected > an "if" conditional check, as well as making the code flow more clear. > Cast or changed the types to remove warnings for comparing signed and > unsigned types. Adds missing includes in i40evf, which were being used > but were not being directly included. > > Daniel Borkmann fixes i40e to fill the XDP prog_id with the id just like > other XDP enabled drivers, so that on dump we can retrieve the attached > program based on the id and dump BPF insns, opcodes, etc back to user > space. > > Tushar Dave adds le32_to_cpu while evaluating the hardware descriptor > fields, since they are in little-endian format. Also removed > unnecessary "__packed" to a couple of i40evf structures. > > Stefan Assmann fixes an issue when an administratively set MAC was set > and should now be switched back to 00:00:00:00:00:00, the pf_set_mac > flag is not being toggled back to false. ... > git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue 40GbE Also pulled, thanks Jeff.
[PATCH net-next v2 00/10] netvsc fixes and new features
This series has updates for Hyper-V network driver. The first four patches are just minor fixes. The next group of patches aims to reduce the driver memory footprint by reducing the size of the per-device receive and send buffers, and the per-channel receive completion queue. The biggest change is in the last three patches that implement transparent management of SR-IOV VF device. A little more explaination is required. What does it do? With the new netvsc VF logic, the Linux hyper-V network virtual driver will directly manage the link to SR-IOV VF device. When VF device is detected (hot plug) it is automatically made a slave device of the netvsc device. The VF device state reflects the state of the netvsc device; i.e. if netvsc is set down, then VF is set down. If netvsc is set up, then VF is brought up. Packet flow is independent of VF status; all packets are sent and received as if they were associated with the netvsc device. If VF is removed or link is down then the synthetic VMBUS path is used. What was wrong with using bonding script? A lot of work went into getting the bonding script to work on all distributions, but it was a major struggle. Linux network devices can be configured many, many ways and there is no one solution from userspace to make it all work. What is really hard is when configuration is attached to synthetic device during boot (eth0) and then the same addresses and firewall rules needs to also work later if doing bonding. The new code gets around all of this. How does VF work during initialization? Since all packets are sent and received through the logical netvsc device, initialization is much easier. Just configure the regular netvsc Ethernet device; when/if SR-IOV is enabled it just works. Provisioning and cloud init only need to worry about setting up netvsc device (eth0). If SR-IOV is enabled (even as a later step), the address and rules stay the same. What devices show up? Both netvsc and PCI devices are visible in the system. The netvsc device is active and named in usual manner (eth0). The PCI device is visible to Linux and gets renamed by udev to a persistent name (enP2p3s0). The PCI device name is now irrelevant now. The logica also sets the PCI VF device SLAVE flag on the network device so network tools can see the relationship if they are smart enough to understand how layered devices work. This is a lot like how I see Windows working. The VF device is visible in Device Manager, but is not configured. Is there any performance impact? There is no visible change in performance. The bonding and netvsc driver both have equivalent steps. Is it compatible with old bonding script? It turns out that if you use the old bonding script, then everything still works but in a sub-optimum manner. What happens is that bonding is unable to steal the VF from the netvsc device so it creates a one legged bond. Packet flow then is: bond0 <--> eth0 <- -> VF (enP2p3s0). In other words, if you get it wrong it still works, just awkward and slower. What if I add address or firewall rule onto the VF? Same problems occur with now as already occur with bonding, bridging, teaming on Linux if user incorrectly does configuration onto an underlying slave device. It will sort of work, packets will come in and out but the Linux kernel gets confused and things like ARP don’t work right. There is no way to block manipulation of the slave device, and I am sure someone will find some special use case where they want it. Stephen Hemminger (10): netvsc: fix return value for set_channels netvsc: fix warnings reported by lockdep netvsc: don't print pointer value in error message netvsc: remove unnecessary indirection of page_buffer netvsc: optimize receive completions netvsc: signal host if receive ring is emptied netvsc: allow smaller send/recv buffer size netvsc: transparent VF management netvsc: add documentation netvsc: remove bonding setup script Documentation/networking/netvsc.txt | 50 MAINTAINERS | 1 + drivers/net/hyperv/hyperv_net.h | 33 ++- drivers/net/hyperv/netvsc.c | 317 +++- drivers/net/hyperv/netvsc_drv.c | 476 drivers/net/hyperv/rndis_filter.c | 108 tools/hv/bondvf.sh | 255 --- 7 files changed, 642 insertions(+), 598 deletions(-) create mode 100644 Documentation/networking/netvsc.txt delete mode 100755 tools/hv/bondvf.sh -- 2.11.0
[PATCH net-next v2 09/10] netvsc: add documentation
Add some background documentation on netvsc device options and limitations. Signed-off-by: Stephen Hemminger--- Documentation/networking/netvsc.txt | 50 + MAINTAINERS | 1 + 2 files changed, 51 insertions(+) create mode 100644 Documentation/networking/netvsc.txt diff --git a/Documentation/networking/netvsc.txt b/Documentation/networking/netvsc.txt new file mode 100644 index ..86f72dced4da --- /dev/null +++ b/Documentation/networking/netvsc.txt @@ -0,0 +1,50 @@ +Hyper-V network driver +== + +Compatibility += + +This driver is compatible and tested on Windows Server 2012 R2, 2016 and +Windows 10. + +Features + + + Checksum offload + + The netvsc driver supports checksum offload as long as the + Hyper-V host version does. Windows Server 2016 and Azure + support checksum offload for TCP and UDP for both IPv4 and + IPv6. Windows Server 2012 only supports checksum offload for TCP. + + Receive Side Scaling + + Hyper-V supports receive side scaling. For TCP, packets are + distributed among available queues based on IP address and port + number. Current versions of Hyper-V host, only distribute UDP + packets based on the IP source and destination address. + The port number is not used as part of the hash value for UDP. + Fragmented IP packets are not distributed between queues; + all fragmented packets arrive on the first channel. + + Generic Receive Offload, aka GRO + + The driver supports GRO and it is enabled by default. GRO coalesces + like packets and significantly reduces CPU usage under heavy Rx + load. + + SR-IOV support + -- + Hyper-V supports SR-IOV as a hardware acceleration option. If SR-IOV + is enabled in both the vSwitch and the guest configuration, then the + Virtual Function (VF) device is passed to the guest as a PCI + device. In this case, both a synthetic (netvsc) and VF device are + visible in the guest OS and both NIC's have the same MAC address. + + The VF is enslaved by netvsc device. The netvsc driver will transparently + switch the data path to the VF when it is available and up. + Network state (addresses, firewall, etc) should be applied only to the + netvsc device; the slave device should not be accessed directly in + most cases. The exceptions are if some special queue discipline or + flow direction is desired, these should be applied directly to the + VF slave device. diff --git a/MAINTAINERS b/MAINTAINERS index 297e610c9163..d30c17df1deb 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -6294,6 +6294,7 @@ M:Haiyang Zhang M: Stephen Hemminger L: de...@linuxdriverproject.org S: Maintained +F: Documentation/networking/netvsc.txt F: arch/x86/include/asm/mshyperv.h F: arch/x86/include/uapi/asm/hyperv.h F: arch/x86/kernel/cpu/mshyperv.c -- 2.11.0
[PATCH net-next v2 10/10] netvsc: remove bonding setup script
No longer needed, now all managed by transparent VF logic. Signed-off-by: Stephen Hemminger--- tools/hv/bondvf.sh | 255 - 1 file changed, 255 deletions(-) delete mode 100755 tools/hv/bondvf.sh diff --git a/tools/hv/bondvf.sh b/tools/hv/bondvf.sh deleted file mode 100755 index 80f102860cf8.. --- a/tools/hv/bondvf.sh +++ /dev/null @@ -1,255 +0,0 @@ -#!/bin/bash - -# This example script creates bonding network devices based on synthetic NIC -# (the virtual network adapter usually provided by Hyper-V) and the matching -# VF NIC (SRIOV virtual function). So the synthetic NIC and VF NIC can -# function as one network device, and fail over to the synthetic NIC if VF is -# down. -# -# Usage: -# - After configured vSwitch and vNIC with SRIOV, start Linux virtual -# machine (VM) -# - Run this scripts on the VM. It will create configuration files in -# distro specific directory. -# - Reboot the VM, so that the bonding config are enabled. -# -# The config files are DHCP by default. You may edit them if you need to change -# to Static IP or change other settings. -# - -sysdir=/sys/class/net -netvsc_cls={f8615163-df3e-46c5-913f-f2d2f965ed0e} -bondcnt=0 - -# Detect Distro -if [ -f /etc/redhat-release ]; -then - cfgdir=/etc/sysconfig/network-scripts - distro=redhat -elif grep -q 'Ubuntu' /etc/issue -then - cfgdir=/etc/network - distro=ubuntu -elif grep -q 'SUSE' /etc/issue -then - cfgdir=/etc/sysconfig/network - distro=suse -else - echo "Unsupported Distro" - exit 1 -fi - -echo Detected Distro: $distro, or compatible - -# Get a list of ethernet names -list_eth=(`cd $sysdir && ls -d */ | cut -d/ -f1 | grep -v bond`) -eth_cnt=${#list_eth[@]} - -echo List of net devices: - -# Get the MAC addresses -for (( i=0; i < $eth_cnt; i++ )) -do - list_mac[$i]=`cat $sysdir/${list_eth[$i]}/address` - echo ${list_eth[$i]}, ${list_mac[$i]} -done - -# Find NIC with matching MAC -for (( i=0; i < $eth_cnt-1; i++ )) -do - for (( j=i+1; j < $eth_cnt; j++ )) - do - if [ "${list_mac[$i]}" = "${list_mac[$j]}" ] - then - list_match[$i]=${list_eth[$j]} - break - fi - done -done - -function create_eth_cfg_redhat { - local fn=$cfgdir/ifcfg-$1 - - rm -f $fn - echo DEVICE=$1 >>$fn - echo TYPE=Ethernet >>$fn - echo BOOTPROTO=none >>$fn - echo UUID=`uuidgen` >>$fn - echo ONBOOT=yes >>$fn - echo PEERDNS=yes >>$fn - echo IPV6INIT=yes >>$fn - echo MASTER=$2 >>$fn - echo SLAVE=yes >>$fn -} - -function create_eth_cfg_pri_redhat { - create_eth_cfg_redhat $1 $2 -} - -function create_bond_cfg_redhat { - local fn=$cfgdir/ifcfg-$1 - - rm -f $fn - echo DEVICE=$1 >>$fn - echo TYPE=Bond >>$fn - echo BOOTPROTO=dhcp >>$fn - echo UUID=`uuidgen` >>$fn - echo ONBOOT=yes >>$fn - echo PEERDNS=yes >>$fn - echo IPV6INIT=yes >>$fn - echo BONDING_MASTER=yes >>$fn - echo BONDING_OPTS=\"mode=active-backup miimon=100 primary=$2\" >>$fn -} - -function del_eth_cfg_ubuntu { - local mainfn=$cfgdir/interfaces - local fnlist=( $mainfn ) - - local dirlist=(`awk '/^[ \t]*source/{print $2}' $mainfn`) - - local i - for i in "${dirlist[@]}" - do - fnlist+=(`ls $i 2>/dev/null`) - done - - local tmpfl=$(mktemp) - - local nic_start='^[ \t]*(auto|iface|mapping|allow-.*)[ \t]+'$1 - local nic_end='^[ \t]*(auto|iface|mapping|allow-.*|source)' - - local fn - for fn in "${fnlist[@]}" - do - awk "/$nic_end/{x=0} x{next} /$nic_start/{x=1;next} 1" \ - $fn >$tmpfl - - cp $tmpfl $fn - done - - rm $tmpfl -} - -function create_eth_cfg_ubuntu { - local fn=$cfgdir/interfaces - - del_eth_cfg_ubuntu $1 - echo $'\n'auto $1 >>$fn - echo iface $1 inet manual >>$fn - echo bond-master $2 >>$fn -} - -function create_eth_cfg_pri_ubuntu { - local fn=$cfgdir/interfaces - - del_eth_cfg_ubuntu $1 - echo $'\n'allow-hotplug $1 >>$fn - echo iface $1 inet manual >>$fn - echo bond-master $2 >>$fn - echo bond-primary $1 >>$fn -} - -function create_bond_cfg_ubuntu { - local fn=$cfgdir/interfaces - - del_eth_cfg_ubuntu $1 - - echo $'\n'auto $1 >>$fn - echo iface $1 inet dhcp >>$fn - echo bond-mode active-backup >>$fn - echo bond-miimon 100 >>$fn - echo bond-slaves none >>$fn -} - -function create_eth_cfg_suse { -local fn=$cfgdir/ifcfg-$1 - -rm -f $fn - echo BOOTPROTO=none >>$fn - echo STARTMODE=auto >>$fn -} - -function create_eth_cfg_pri_suse { - local fn=$cfgdir/ifcfg-$1 - - rm -f $fn - echo BOOTPROTO=none >>$fn - echo
[PATCH net-next v2 02/10] netvsc: fix warnings reported by lockdep
This includes a bunch of fixups for issues reported by lockdep. * ethtool routines can assume RTNL * send is done with RCU lock (and BH disable) * avoid refetching internal device struct (netvsc) instead pass it as a parameter. Signed-off-by: Stephen Hemminger--- drivers/net/hyperv/hyperv_net.h | 3 +- drivers/net/hyperv/netvsc.c | 2 +- drivers/net/hyperv/netvsc_drv.c | 15 --- drivers/net/hyperv/rndis_filter.c | 84 +++ 4 files changed, 54 insertions(+), 50 deletions(-) diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h index 4e7ff348327e..fb62ea632914 100644 --- a/drivers/net/hyperv/hyperv_net.h +++ b/drivers/net/hyperv/hyperv_net.h @@ -217,7 +217,8 @@ int rndis_filter_receive(struct net_device *ndev, struct vmbus_channel *channel, void *data, u32 buflen); -int rndis_filter_set_device_mac(struct net_device *ndev, char *mac); +int rndis_filter_set_device_mac(struct netvsc_device *ndev, + const char *mac); void netvsc_switch_datapath(struct net_device *nv_dev, bool vf); diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c index 06f39a99da7c..94c00acac58a 100644 --- a/drivers/net/hyperv/netvsc.c +++ b/drivers/net/hyperv/netvsc.c @@ -832,7 +832,7 @@ int netvsc_send(struct net_device_context *ndev_ctx, struct sk_buff *skb) { struct netvsc_device *net_device - = rcu_dereference_rtnl(ndev_ctx->nvdev); + = rcu_dereference_bh(ndev_ctx->nvdev); struct hv_device *device = ndev_ctx->device_ctx; int ret = 0; struct netvsc_channel *nvchan; diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c index f1eaf675d2e9..a04f2efbbc25 100644 --- a/drivers/net/hyperv/netvsc_drv.c +++ b/drivers/net/hyperv/netvsc_drv.c @@ -923,6 +923,8 @@ static void netvsc_get_stats64(struct net_device *net, static int netvsc_set_mac_addr(struct net_device *ndev, void *p) { + struct net_device_context *ndc = netdev_priv(ndev); + struct netvsc_device *nvdev = rtnl_dereference(ndc->nvdev); struct sockaddr *addr = p; char save_adr[ETH_ALEN]; unsigned char save_aatype; @@ -935,7 +937,10 @@ static int netvsc_set_mac_addr(struct net_device *ndev, void *p) if (err != 0) return err; - err = rndis_filter_set_device_mac(ndev, addr->sa_data); + if (!nvdev) + return -ENODEV; + + err = rndis_filter_set_device_mac(nvdev, addr->sa_data); if (err != 0) { /* roll back to saved MAC */ memcpy(ndev->dev_addr, save_adr, ETH_ALEN); @@ -981,7 +986,7 @@ static void netvsc_get_ethtool_stats(struct net_device *dev, struct ethtool_stats *stats, u64 *data) { struct net_device_context *ndc = netdev_priv(dev); - struct netvsc_device *nvdev = rcu_dereference(ndc->nvdev); + struct netvsc_device *nvdev = rtnl_dereference(ndc->nvdev); const void *nds = >eth_stats; const struct netvsc_stats *qstats; unsigned int start; @@ -1019,7 +1024,7 @@ static void netvsc_get_ethtool_stats(struct net_device *dev, static void netvsc_get_strings(struct net_device *dev, u32 stringset, u8 *data) { struct net_device_context *ndc = netdev_priv(dev); - struct netvsc_device *nvdev = rcu_dereference(ndc->nvdev); + struct netvsc_device *nvdev = rtnl_dereference(ndc->nvdev); u8 *p = data; int i; @@ -1077,7 +1082,7 @@ netvsc_get_rxnfc(struct net_device *dev, struct ethtool_rxnfc *info, u32 *rules) { struct net_device_context *ndc = netdev_priv(dev); - struct netvsc_device *nvdev = rcu_dereference(ndc->nvdev); + struct netvsc_device *nvdev = rtnl_dereference(ndc->nvdev); if (!nvdev) return -ENODEV; @@ -1127,7 +1132,7 @@ static int netvsc_get_rxfh(struct net_device *dev, u32 *indir, u8 *key, u8 *hfunc) { struct net_device_context *ndc = netdev_priv(dev); - struct netvsc_device *ndev = rcu_dereference(ndc->nvdev); + struct netvsc_device *ndev = rtnl_dereference(ndc->nvdev); struct rndis_device *rndis_dev; int i; diff --git a/drivers/net/hyperv/rndis_filter.c b/drivers/net/hyperv/rndis_filter.c index eaa3f0d5682a..bf21ea92c743 100644 --- a/drivers/net/hyperv/rndis_filter.c +++ b/drivers/net/hyperv/rndis_filter.c @@ -85,14 +85,6 @@ static struct rndis_device *get_rndis_device(void) return device; } -static struct netvsc_device * -net_device_to_netvsc_device(struct net_device *ndev) -{ - struct net_device_context *net_device_ctx = netdev_priv(ndev); - - return rtnl_dereference(net_device_ctx->nvdev); -} - static struct rndis_request *get_rndis_request(struct rndis_device
[PATCH net-next v2 04/10] netvsc: remove unnecessary indirection of page_buffer
The internal API was passing struct hv_page_buffer ** when only simple struct hv_page_buffer * was necessary for passing an array. Signed-off-by: Stephen Hemminger--- drivers/net/hyperv/hyperv_net.h | 2 +- drivers/net/hyperv/netvsc.c | 21 ++--- drivers/net/hyperv/netvsc_drv.c | 10 -- drivers/net/hyperv/rndis_filter.c | 4 ++-- 4 files changed, 17 insertions(+), 20 deletions(-) diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h index fb62ea632914..9ca3ed692d73 100644 --- a/drivers/net/hyperv/hyperv_net.h +++ b/drivers/net/hyperv/hyperv_net.h @@ -190,7 +190,7 @@ void netvsc_device_remove(struct hv_device *device); int netvsc_send(struct net_device_context *ndc, struct hv_netvsc_packet *packet, struct rndis_message *rndis_msg, - struct hv_page_buffer **page_buffer, + struct hv_page_buffer *page_buffer, struct sk_buff *skb); void netvsc_linkstatus_callback(struct hv_device *device_obj, struct rndis_message *resp); diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c index f0c15e782ce0..d3c0b19f6d34 100644 --- a/drivers/net/hyperv/netvsc.c +++ b/drivers/net/hyperv/netvsc.c @@ -697,7 +697,7 @@ static u32 netvsc_copy_to_send_buf(struct netvsc_device *net_device, u32 pend_size, struct hv_netvsc_packet *packet, struct rndis_message *rndis_msg, - struct hv_page_buffer **pb, + struct hv_page_buffer *pb, struct sk_buff *skb) { char *start = net_device->send_buf; @@ -718,9 +718,9 @@ static u32 netvsc_copy_to_send_buf(struct netvsc_device *net_device, } for (i = 0; i < page_count; i++) { - char *src = phys_to_virt((*pb)[i].pfn << PAGE_SHIFT); - u32 offset = (*pb)[i].offset; - u32 len = (*pb)[i].len; + char *src = phys_to_virt(pb[i].pfn << PAGE_SHIFT); + u32 offset = pb[i].offset; + u32 len = pb[i].len; memcpy(dest, (src + offset), len); msg_size += len; @@ -739,7 +739,7 @@ static inline int netvsc_send_pkt( struct hv_device *device, struct hv_netvsc_packet *packet, struct netvsc_device *net_device, - struct hv_page_buffer **pb, + struct hv_page_buffer *pb, struct sk_buff *skb) { struct nvsp_message nvmsg; @@ -750,7 +750,6 @@ static inline int netvsc_send_pkt( struct netdev_queue *txq = netdev_get_tx_queue(ndev, packet->q_idx); u64 req_id; int ret; - struct hv_page_buffer *pgbuf; u32 ring_avail = hv_ringbuf_avail_percent(_channel->outbound); nvmsg.hdr.msg_type = NVSP_MSG1_TYPE_SEND_RNDIS_PKT; @@ -776,11 +775,11 @@ static inline int netvsc_send_pkt( return -ENODEV; if (packet->page_buf_cnt) { - pgbuf = packet->cp_partial ? (*pb) + - packet->rmsg_pgcnt : (*pb); + if (packet->cp_partial) + pb += packet->rmsg_pgcnt; + ret = vmbus_sendpacket_pagebuffer_ctl(out_channel, - pgbuf, - packet->page_buf_cnt, + pb, packet->page_buf_cnt, , sizeof(struct nvsp_message), req_id, @@ -830,7 +829,7 @@ static inline void move_pkt_msd(struct hv_netvsc_packet **msd_send, int netvsc_send(struct net_device_context *ndev_ctx, struct hv_netvsc_packet *packet, struct rndis_message *rndis_msg, - struct hv_page_buffer **pb, + struct hv_page_buffer *pb, struct sk_buff *skb) { struct netvsc_device *net_device diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c index a04f2efbbc25..8ff4cbf582cc 100644 --- a/drivers/net/hyperv/netvsc_drv.c +++ b/drivers/net/hyperv/netvsc_drv.c @@ -282,9 +282,8 @@ static u32 fill_pg_buf(struct page *page, u32 offset, u32 len, static u32 init_page_array(void *hdr, u32 len, struct sk_buff *skb, struct hv_netvsc_packet *packet, - struct hv_page_buffer **page_buf) + struct hv_page_buffer *pb) { - struct hv_page_buffer *pb = *page_buf; u32 slots_used = 0; char *data = skb->data; int frags = skb_shinfo(skb)->nr_frags; @@ -359,8 +358,7 @@ static int netvsc_start_xmit(struct sk_buff *skb, struct net_device *net)
[PATCH net-next v2 07/10] netvsc: allow smaller send/recv buffer size
The default value of send and receive buffer area for host DMA is much larger than it needs to be. Experimentation shows that 4M receive and 1M send is sufficient. Make the size a module parameter so that it can be adjusted as needed for testing or special needs. Signed-off-by: Stephen Hemminger--- drivers/net/hyperv/hyperv_net.h | 2 ++ drivers/net/hyperv/netvsc.c | 18 -- drivers/net/hyperv/netvsc_drv.c | 31 ++- 3 files changed, 44 insertions(+), 7 deletions(-) diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h index f2cef5aaed1f..4779422868cf 100644 --- a/drivers/net/hyperv/hyperv_net.h +++ b/drivers/net/hyperv/hyperv_net.h @@ -148,6 +148,8 @@ struct netvsc_device_info { unsigned char mac_adr[ETH_ALEN]; int ring_size; u32 num_chn; + u32 recv_buf_size; + u32 send_buf_size; }; enum rndis_device_state { diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c index c5e6f7fc4f2b..dd64227c736c 100644 --- a/drivers/net/hyperv/netvsc.c +++ b/drivers/net/hyperv/netvsc.c @@ -482,14 +482,16 @@ static int negotiate_nvsp_ver(struct hv_device *device, } static int netvsc_connect_vsp(struct hv_device *device, - struct netvsc_device *net_device) + struct netvsc_device *net_device, + const struct netvsc_device_info *device_info) { const u32 ver_list[] = { NVSP_PROTOCOL_VERSION_1, NVSP_PROTOCOL_VERSION_2, NVSP_PROTOCOL_VERSION_4, NVSP_PROTOCOL_VERSION_5 }; struct nvsp_message *init_packet; - int ndis_version, i, ret; + u32 max_recv_buf_size, ndis_version; + int i, ret; init_packet = _device->channel_init_pkt; @@ -534,10 +536,14 @@ static int netvsc_connect_vsp(struct hv_device *device, /* Post the big receive buffer to NetVSP */ if (net_device->nvsp_version <= NVSP_PROTOCOL_VERSION_2) - net_device->recv_buf_size = NETVSC_RECEIVE_BUFFER_SIZE_LEGACY; + max_recv_buf_size = NETVSC_RECEIVE_BUFFER_SIZE_LEGACY; else - net_device->recv_buf_size = NETVSC_RECEIVE_BUFFER_SIZE; - net_device->send_buf_size = NETVSC_SEND_BUFFER_SIZE; + max_recv_buf_size = NETVSC_RECEIVE_BUFFER_SIZE; + + net_device->recv_buf_size = min(max_recv_buf_size, + device_info->recv_buf_size); + net_device->send_buf_size = min_t(u32, NETVSC_SEND_BUFFER_SIZE, + device_info->send_buf_size); ret = netvsc_init_buf(device, net_device); @@ -1302,7 +1308,7 @@ struct netvsc_device *netvsc_device_add(struct hv_device *device, rcu_assign_pointer(net_device_ctx->nvdev, net_device); /* Connect with the NetVsp */ - ret = netvsc_connect_vsp(device, net_device); + ret = netvsc_connect_vsp(device, net_device, device_info); if (ret != 0) { netdev_err(ndev, "unable to connect to NetVSP - %d\n", ret); diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c index 8ff4cbf582cc..b662c0198807 100644 --- a/drivers/net/hyperv/netvsc_drv.c +++ b/drivers/net/hyperv/netvsc_drv.c @@ -44,13 +44,23 @@ #include "hyperv_net.h" -#define RING_SIZE_MIN 64 +#define RING_SIZE_MIN 64 +#define RECV_BUFFER_MIN 16 +#define SEND_BUFFER_MIN 4 #define LINKCHANGE_INT (2 * HZ) static int ring_size = 128; module_param(ring_size, int, S_IRUGO); MODULE_PARM_DESC(ring_size, "Ring buffer size (# of pages)"); +static unsigned int recv_buffer_size = (4 * 1024 * 1024) / PAGE_SIZE; +module_param(recv_buffer_size, uint, S_IRUGO); +MODULE_PARM_DESC(recv_buffer_size, "Receive buffer size (# of pages)"); + +static unsigned int send_buffer_size = (1024 * 1024) / PAGE_SIZE; +module_param(send_buffer_size, uint, S_IRUGO); +MODULE_PARM_DESC(send_buffer_size, "Send buffer size (# of pages)"); + static const u32 default_msg = NETIF_MSG_DRV | NETIF_MSG_PROBE | NETIF_MSG_LINK | NETIF_MSG_IFUP | NETIF_MSG_IFDOWN | NETIF_MSG_RX_ERR | @@ -751,6 +761,8 @@ static int netvsc_set_channels(struct net_device *net, memset(_info, 0, sizeof(device_info)); device_info.num_chn = count; device_info.ring_size = ring_size; + device_info.send_buf_size = nvdev->send_buf_size; + device_info.recv_buf_size = nvdev->recv_buf_size; nvdev = rndis_filter_device_add(dev, _info); if (!IS_ERR(nvdev)) { @@ -848,6 +860,8 @@ static int netvsc_change_mtu(struct net_device *ndev, int mtu) memset(_info, 0, sizeof(device_info)); device_info.ring_size = ring_size; device_info.num_chn = nvdev->num_chn; + device_info.send_buf_size = nvdev->send_buf_size; +
[PATCH net-next v2 06/10] netvsc: signal host if receive ring is emptied
Latency improvement related to NAPI conversion. If all packets are processed from receive ring then need to signal host. Signed-off-by: Stephen Hemminger--- drivers/net/hyperv/netvsc.c | 11 --- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c index 4c709b454d34..c5e6f7fc4f2b 100644 --- a/drivers/net/hyperv/netvsc.c +++ b/drivers/net/hyperv/netvsc.c @@ -1195,10 +1195,15 @@ int netvsc_poll(struct napi_struct *napi, int budget) nvchan->desc = hv_pkt_iter_next(channel, nvchan->desc); } - /* If send of pending receive completions suceeded -* and did not exhaust NAPI budget + /* if ring is empty, signal host */ + if (!nvchan->desc) + hv_pkt_iter_close(channel); + + /* If send of pending receive completions suceeded +* and did not exhaust NAPI budget this time * and not doing busy poll -* then reschedule if more data has arrived from host +* then re-enable host interrupts +* and reschedule if ring is not empty. */ if (send_recv_completions(nvchan) == 0 && work_done < budget && -- 2.11.0
[PATCH net-next v2 05/10] netvsc: optimize receive completions
Optimize how receive completion ring are managed. * Allocate only as many slots as needed for all buffers from host * Allocate before setting up sub channel for better error detection * Don't need to keep copy of initial receive section message * Precompute the watermark for when receive flushing is needed * Replace division with conditional test * Replace atomic per-device variable with per-channel check. * Handle corner case where receive completion send fails if ring buffer to host is full. Signed-off-by: Stephen Hemminger--- drivers/net/hyperv/hyperv_net.h | 14 +- drivers/net/hyperv/netvsc.c | 267 -- drivers/net/hyperv/rndis_filter.c | 20 +-- 3 files changed, 126 insertions(+), 175 deletions(-) diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h index 9ca3ed692d73..f2cef5aaed1f 100644 --- a/drivers/net/hyperv/hyperv_net.h +++ b/drivers/net/hyperv/hyperv_net.h @@ -186,6 +186,7 @@ struct net_device_context; struct netvsc_device *netvsc_device_add(struct hv_device *device, const struct netvsc_device_info *info); +int netvsc_alloc_recv_comp_ring(struct netvsc_device *net_device, u32 q_idx); void netvsc_device_remove(struct hv_device *device); int netvsc_send(struct net_device_context *ndc, struct hv_netvsc_packet *packet, @@ -657,13 +658,10 @@ struct recv_comp_data { u32 status; }; -/* Netvsc Receive Slots Max */ -#define NETVSC_RECVSLOT_MAX (NETVSC_RECEIVE_BUFFER_SIZE / ETH_DATA_LEN + 1) - struct multi_recv_comp { - void *buf; /* queued receive completions */ - u32 first; /* first data entry */ - u32 next; /* next entry for writing */ + struct recv_comp_data *slots; + u32 first; /* first data entry */ + u32 next; /* next entry for writing */ }; struct netvsc_stats { @@ -750,7 +748,7 @@ struct netvsc_device { u32 recv_buf_size; u32 recv_buf_gpadl_handle; u32 recv_section_cnt; - struct nvsp_1_receive_buffer_section *recv_section; + u32 recv_completion_cnt; /* Send buffer allocated by us */ void *send_buf; @@ -778,8 +776,6 @@ struct netvsc_device { u32 max_pkt; /* max number of pkt in one send, e.g. 8 */ u32 pkt_align; /* alignment bytes, e.g. 8 */ - atomic_t num_outstanding_recvs; - atomic_t open_cnt; struct netvsc_channel chan_table[VRSS_CHANNEL_MAX]; diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c index d3c0b19f6d34..4c709b454d34 100644 --- a/drivers/net/hyperv/netvsc.c +++ b/drivers/net/hyperv/netvsc.c @@ -72,9 +72,6 @@ static struct netvsc_device *alloc_net_device(void) if (!net_device) return NULL; - net_device->chan_table[0].mrc.buf - = vzalloc(NETVSC_RECVSLOT_MAX * sizeof(struct recv_comp_data)); - init_waitqueue_head(_device->wait_drain); net_device->destroy = false; atomic_set(_device->open_cnt, 0); @@ -92,7 +89,7 @@ static void free_netvsc_device(struct rcu_head *head) int i; for (i = 0; i < VRSS_CHANNEL_MAX; i++) - vfree(nvdev->chan_table[i].mrc.buf); + vfree(nvdev->chan_table[i].mrc.slots); kfree(nvdev); } @@ -171,12 +168,6 @@ static void netvsc_destroy_buf(struct hv_device *device) net_device->recv_buf = NULL; } - if (net_device->recv_section) { - net_device->recv_section_cnt = 0; - kfree(net_device->recv_section); - net_device->recv_section = NULL; - } - /* Deal with the send buffer we may have setup. * If we got a send section size, it means we received a * NVSP_MSG1_TYPE_SEND_SEND_BUF_COMPLETE msg (ie sent @@ -239,11 +230,26 @@ static void netvsc_destroy_buf(struct hv_device *device) kfree(net_device->send_section_map); } +int netvsc_alloc_recv_comp_ring(struct netvsc_device *net_device, u32 q_idx) +{ + struct netvsc_channel *nvchan = _device->chan_table[q_idx]; + int node = cpu_to_node(nvchan->channel->target_cpu); + size_t size; + + size = net_device->recv_completion_cnt * sizeof(struct recv_comp_data); + nvchan->mrc.slots = vzalloc_node(size, node); + if (!nvchan->mrc.slots) + nvchan->mrc.slots = vzalloc(size); + + return nvchan->mrc.slots ? 0 : -ENOMEM; +} + static int netvsc_init_buf(struct hv_device *device, struct netvsc_device *net_device) { int ret = 0; struct nvsp_message *init_packet; + struct nvsp_1_message_send_receive_buffer_complete *resp; struct net_device *ndev; size_t map_words; int node; @@ -300,43 +306,41 @@ static int netvsc_init_buf(struct hv_device *device, wait_for_completion(_device->channel_init_wait); /* Check
[PATCH net-next v2 08/10] netvsc: transparent VF management
This patch implements transparent fail over from synthetic NIC to SR-IOV virtual function NIC in Hyper-V environment. It is a better alternative to using bonding as is done now. Instead, the receive and transmit fail over is done internally inside the driver. Using bonding driver has lots of issues because it depends on the script being run early enough in the boot process and with sufficient information to make the association. This patch moves all that functionality into the kernel. Signed-off-by: Stephen Hemminger--- drivers/net/hyperv/hyperv_net.h | 12 ++ drivers/net/hyperv/netvsc_drv.c | 418 +++- 2 files changed, 341 insertions(+), 89 deletions(-) diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h index 4779422868cf..8ff4ff0d77dd 100644 --- a/drivers/net/hyperv/hyperv_net.h +++ b/drivers/net/hyperv/hyperv_net.h @@ -682,6 +682,15 @@ struct netvsc_ethtool_stats { unsigned long tx_busy; }; +struct netvsc_vf_pcpu_stats { + u64 rx_packets; + u64 rx_bytes; + u64 tx_packets; + u64 tx_bytes; + struct u64_stats_sync syncp; + u32 tx_dropped; +}; + struct netvsc_reconfig { struct list_head list; u32 event; @@ -715,6 +724,9 @@ struct net_device_context { /* State to manage the associated VF interface. */ struct net_device __rcu *vf_netdev; + struct netvsc_vf_pcpu_stats __percpu *vf_stats; + struct work_struct vf_takeover; + struct work_struct vf_notify; /* 1: allocated, serial number is valid. 0: not allocated */ u32 vf_alloc; diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c index b662c0198807..73edc3db97b2 100644 --- a/drivers/net/hyperv/netvsc_drv.c +++ b/drivers/net/hyperv/netvsc_drv.c @@ -34,6 +34,7 @@ #include #include #include +#include #include #include @@ -81,6 +82,7 @@ static void netvsc_set_multicast_list(struct net_device *net) static int netvsc_open(struct net_device *net) { struct net_device_context *ndev_ctx = netdev_priv(net); + struct net_device *vf_netdev = rtnl_dereference(ndev_ctx->vf_netdev); struct netvsc_device *nvdev = rtnl_dereference(ndev_ctx->nvdev); struct rndis_device *rdev; int ret = 0; @@ -97,15 +99,29 @@ static int netvsc_open(struct net_device *net) netif_tx_wake_all_queues(net); rdev = nvdev->extension; - if (!rdev->link_state && !ndev_ctx->datapath) + + if (!rdev->link_state) netif_carrier_on(net); - return ret; + if (vf_netdev) { + /* Setting synthetic device up transparently sets +* slave as up. If open fails, then slave will be +* still be offline (and not used). +*/ + ret = dev_open(vf_netdev); + if (ret) + netdev_warn(net, + "unable to open slave: %s: %d\n", + vf_netdev->name, ret); + } + return 0; } static int netvsc_close(struct net_device *net) { struct net_device_context *net_device_ctx = netdev_priv(net); + struct net_device *vf_netdev + = rtnl_dereference(net_device_ctx->vf_netdev); struct netvsc_device *nvdev = rtnl_dereference(net_device_ctx->nvdev); int ret; u32 aread, i, msec = 10, retry = 0, retry_max = 20; @@ -151,6 +167,9 @@ static int netvsc_close(struct net_device *net) ret = -ETIMEDOUT; } + if (vf_netdev) + dev_close(vf_netdev); + return ret; } @@ -234,13 +253,11 @@ static inline int netvsc_get_tx_queue(struct net_device *ndev, * * TODO support XPS - but get_xps_queue not exported */ -static u16 netvsc_select_queue(struct net_device *ndev, struct sk_buff *skb, - void *accel_priv, select_queue_fallback_t fallback) +static u16 netvsc_pick_tx(struct net_device *ndev, struct sk_buff *skb) { - unsigned int num_tx_queues = ndev->real_num_tx_queues; int q_idx = sk_tx_queue_get(skb->sk); - if (q_idx < 0 || skb->ooo_okay) { + if (q_idx < 0 || skb->ooo_okay || q_idx >= ndev->real_num_tx_queues) { /* If forwarding a packet, we use the recorded queue when * available for better cache locality. */ @@ -250,12 +267,33 @@ static u16 netvsc_select_queue(struct net_device *ndev, struct sk_buff *skb, q_idx = netvsc_get_tx_queue(ndev, skb, q_idx); } - while (unlikely(q_idx >= num_tx_queues)) - q_idx -= num_tx_queues; - return q_idx; } +static u16 netvsc_select_queue(struct net_device *ndev, struct sk_buff *skb, + void *accel_priv, + select_queue_fallback_t fallback) +{ +
[PATCH net-next v2 03/10] netvsc: don't print pointer value in error message
Using %p to print pointer to packet meta-data doesn't give any good info, and exposes kernel memory offsets. Signed-off-by: Stephen Hemminger--- drivers/net/hyperv/netvsc.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c index 94c00acac58a..f0c15e782ce0 100644 --- a/drivers/net/hyperv/netvsc.c +++ b/drivers/net/hyperv/netvsc.c @@ -805,8 +805,10 @@ static inline int netvsc_send_pkt( ret = -ENOSPC; } } else { - netdev_err(ndev, "Unable to send packet %p ret %d\n", - packet, ret); + netdev_err(ndev, + "Unable to send packet pages %u len %u, ret %d\n", + packet->page_buf_cnt, packet->total_data_buflen, + ret); } return ret; -- 2.11.0
[PATCH net-next v2 01/10] netvsc: fix return value for set_channels
The error and normal case got swapped. Signed-off-by: Stephen Hemminger--- drivers/net/hyperv/netvsc_drv.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c index 262486ce8e2a..f1eaf675d2e9 100644 --- a/drivers/net/hyperv/netvsc_drv.c +++ b/drivers/net/hyperv/netvsc_drv.c @@ -758,8 +758,8 @@ static int netvsc_set_channels(struct net_device *net, if (!IS_ERR(nvdev)) { netif_set_real_num_tx_queues(net, nvdev->num_chn); netif_set_real_num_rx_queues(net, nvdev->num_chn); - ret = PTR_ERR(nvdev); } else { + ret = PTR_ERR(nvdev); device_info.num_chn = orig; rndis_filter_device_add(dev, _info); } -- 2.11.0
Re: [net-next v4 0/5][pull request] 10GbE Intel Wired LAN Driver Updates 2017-07-25
From: Jeff KirsherDate: Tue, 25 Jul 2017 16:41:33 -0700 > This series contains updates to ixgbe only. > > Tony provides all of the changes in the series, starting with adding a > check to ensure that adding a MAC filter was successful, before setting the > MACVLAN. In order to receive notifications of link configurations of the > external PHY and support the configuration of the internal iXFI link on > X552 devices, Tony enables LASI interrupts. Update the iXFI driver code > flow, since the MAC register NW_MNG_IF_SEL fields have been redefined for > X553 devices, so add MAC checks for iXFI flows. Added additional checks > for flow control autonegotiation, since it is not support for X553 fiber > and XFI devices. > > v2: removed unnecessary parens noticed by David Miller in patch 6 of the > series. > v3: dropped patch 6 of the original series, while we work out a more > generic solution for malicious driver detection (MDD) support. > v4: updated patch 1 of the series with the comments from Joe Perches which > were: > - switched logic to return on error > - return 0 on success > - declare retval as an integer ... > git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue 10GbE Pulled, thanks Jeff!
Re: [PATCH] net: inet: diag: expose sockets cgroup classid
On Wed, Jul 26, 2017 at 03:01:32PM -0700, Cong Wang wrote: >On Wed, Jul 26, 2017 at 10:22 AM, Levin, Alexander (Sasha Levin) >wrote: >> + if (ext & (1 << (INET_DIAG_CLASS_ID - 1))) { >> + u32 classid = 0; >> + >> +#ifdef CONFIG_SOCK_CGROUP_DATA >> + classid = sock_cgroup_classid(>sk_cgrp_data); >> +#endif > > >If CONFIG_SOCK_CGROUP_DATA is not enabled, should we put 0 >or put nothing (that is, skipping this nla_put())? My logic was that if CONFIG_SOCK_CGROUP_DATA is disabled, then all sockets have the same default classid ==> 0, rather than having to deal with whether that config is enabled or not. >> + >> + nla_put(skb, INET_DIAG_CLASS_ID, sizeof(classid), ); > >nla_put_u32() Will fix, thanks! -- Thanks, Sasha
Re: [PATCH RFC 02/13] net: phy: split out PHY speed and duplex string generation
On 07/25/2017 07:02 AM, Russell King wrote: > Other code would like to make use of this, so make the speed and duplex > string generation visible, and place it in a separate file. > > Signed-off-by: Russell KingReviewed-by: Florian Fainelli -- Florian
Re: [PATCH RFC 01/13] net: phy: allow settings table to support more than 32 link modes
On 07/25/2017 07:02 AM, Russell King wrote: > Allow the phy settings table to support more than 32 link modes by > switching to the ethtool link mode bit number representation, rather > than storing the mask. This will allow phylink and other ethtool > code to share the settings table to look up settings. > > Signed-off-by: Russell KingReviewed-by: Florian Fainelli -- Florian
Re: [PATCH RFC 07/13] net: phy: add I2C mdio bus
On 07/25/2017 07:03 AM, Russell King wrote: > Add an I2C MDIO bus bridge library, to allow phylib to access PHYs which > are connected to an I2C bus instead of the more conventional MDIO bus. > Such PHYs can be found in SFP adapters and SFF modules. > > Since PHYs appear at I2C bus address 0x40..0x5f, and 0x50/0x51 are > reserved for SFP EEPROMs/diagnostics, we must not allow the MDIO bus > to access these I2C addresses. > > Signed-off-by: Russell KingReviewed-by: Florian Fainelli -- Florian
Re: [PATCH V2 net-next 20/21] net-next/hinic: Add ethtool and stats
On Wed, Jul 19, 2017 at 03:36:28PM +0300, Aviad Krawczyk wrote: > Hi Joe, > > I tried to be consistent with the comments before, that requested > that we will use dev_err exclude some special cases for use netif. > > We will replace the dev_err(>dev,.. to netdev_err in the > next fix. netdev_err() should be used when possible. You just have to be careful in the probe() function, before netdev exists and you get "(NULL net_device):" or before it is registered and you get "(unnamed net_device)" instead of "eth42" etc. Andrew
Re: [PATCH RFC 12/13] phylink: add in-band autonegotiation support for 10GBase-KR mode.
On Tue, Jul 25, 2017 at 03:03:34PM +0100, Russell King wrote: > Add in-band autonegotation support for 10GBase-KR mode. > > Signed-off-by: Russell KingReviewed-by: Andrew Lunn Andrew
[Patch net v2] team: use a larger struct for mac address
IPv6 tunnels use sizeof(struct in6_addr) as dev->addr_len, but in many places especially bonding, we use struct sockaddr to copy and set mac addr, this could lead to stack out-of-bounds access. Fix it by using a larger address storage like bonding. Reported-by: Andrey KonovalovCc: Jiri Pirko Signed-off-by: Cong Wang --- drivers/net/team/team.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c index 464570409796..ae53e899259f 100644 --- a/drivers/net/team/team.c +++ b/drivers/net/team/team.c @@ -60,11 +60,11 @@ static struct team_port *team_port_get_rtnl(const struct net_device *dev) static int __set_port_dev_addr(struct net_device *port_dev, const unsigned char *dev_addr) { - struct sockaddr addr; + struct sockaddr_storage addr; - memcpy(addr.sa_data, dev_addr, port_dev->addr_len); - addr.sa_family = port_dev->type; - return dev_set_mac_address(port_dev, ); + memcpy(addr.__data, dev_addr, port_dev->addr_len); + addr.ss_family = port_dev->type; + return dev_set_mac_address(port_dev, (struct sockaddr *)); } static int team_port_set_orig_dev_addr(struct team_port *port) -- 2.13.0
[Patch net v2] net: check dev->addr_len for dev_set_mac_address()
Historically, dev_ifsioc() uses struct sockaddr as mac address definition, this is why dev_set_mac_address() accepts a struct sockaddr pointer as input but now we have various types of mac addresse whose lengths are up to MAX_ADDR_LEN, longer than struct sockaddr, and saved in dev->addr_len. It is too late to fix dev_ifsioc() due to API compatibility, so just reject those larger than sizeof(struct sockaddr), otherwise we would read and use some random bytes from kernel stack. Fortunately, only a few IPv6 tunnel devices have addr_len larger than sizeof(struct sockaddr) and they don't support ndo_set_mac_addr(). But with team driver, in lb mode, they can still be enslaved to a team master and make its mac addr length as the same. Cc: Jiri PirkoSigned-off-by: Cong Wang --- net/core/dev_ioctl.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/net/core/dev_ioctl.c b/net/core/dev_ioctl.c index 06b147d7d9e2..709a4e6fb447 100644 --- a/net/core/dev_ioctl.c +++ b/net/core/dev_ioctl.c @@ -263,6 +263,8 @@ static int dev_ifsioc(struct net *net, struct ifreq *ifr, unsigned int cmd) return dev_set_mtu(dev, ifr->ifr_mtu); case SIOCSIFHWADDR: + if (dev->addr_len > sizeof(struct sockaddr)) + return -EINVAL; return dev_set_mac_address(dev, >ifr_hwaddr); case SIOCSIFHWBROADCAST: -- 2.13.0
Re: [PATCH RFC 07/13] net: phy: add I2C mdio bus
On Tue, Jul 25, 2017 at 03:03:08PM +0100, Russell King wrote: > Add an I2C MDIO bus bridge library, to allow phylib to access PHYs which > are connected to an I2C bus instead of the more conventional MDIO bus. > Such PHYs can be found in SFP adapters and SFF modules. > > Since PHYs appear at I2C bus address 0x40..0x5f, and 0x50/0x51 are > reserved for SFP EEPROMs/diagnostics, we must not allow the MDIO bus > to access these I2C addresses. > > Signed-off-by: Russell KingReviewed-by: Andrew Lunn Andrew
Re: [PATCH RFC 03/13] net: phy: move phy_lookup_setting() and guts of phy_supported_speeds() to phy-core
On Tue, Jul 25, 2017 at 03:02:47PM +0100, Russell King wrote: > phy_lookup_setting() provides useful functionality in ethtool code > outside phylib. Move it to phy-core and allow it to be re-used (eg, > in phylink) rather than duplicated elsewhere. Note that this supports > the larger linkmode space. > > As we move the phy settings table, we also need to move the guts of > phy_supported_speeds() as well. > > Signed-off-by: Russell KingReviewed-by: Andrew Lunn Andrew
[PATCH v3 1/3] ptp: introduce ptp auxiliary worker
Many PTP drivers required to perform some asynchronous or periodic work, like periodically handling PHC counter overflow or handle delayed timestamp for RX/TX network packets. In most of the cases, such work is implemented using workqueues. Unfortunately, Kernel workqueues might introduce significant delay in work scheduling under high system load and on -RT, which could cause misbehavior of PTP drivers due to internal counter overflow, for example, and there is no way to tune its execution policy and priority manually. Hence, The kthread_worker can be used instead of workqueues, as it creates separate named kthread for each worker and its its execution policy and priority can be configured using chrt tool. This problem was reported for two drivers TI CPSW CPTS and dp83640, so instead of modifying each of these driver it was proposed to add PTP auxiliary worker to the PHC subsystem [1]. The patch adds PTP auxiliary worker in PHC subsystem using kthread_worker and kthread_delayed_work and introduces two new PHC subsystem APIs: - long (*do_aux_work)(struct ptp_clock_info *ptp) callback in ptp_clock_info structure, which driver should assign if it require to perform asynchronous or periodic work. Driver should return the delay of the PTP next auxiliary work scheduling time (>=0) or negative value in case further scheduling is not required. - int ptp_schedule_worker(struct ptp_clock *ptp, unsigned long delay) which allows schedule PTP auxiliary work. The name of kthread_worker thread corresponds PTP PHC device name "ptp%d". [1] https://www.spinics.net/lists/netdev/msg445392.html Signed-off-by: Grygorii Strashko--- drivers/ptp/ptp_clock.c | 41 drivers/ptp/ptp_private.h| 3 +++ include/linux/ptp_clock_kernel.h | 20 3 files changed, 64 insertions(+) diff --git a/drivers/ptp/ptp_clock.c b/drivers/ptp/ptp_clock.c index b774357..899433c 100644 --- a/drivers/ptp/ptp_clock.c +++ b/drivers/ptp/ptp_clock.c @@ -28,6 +28,7 @@ #include #include #include +#include #include "ptp_private.h" @@ -184,6 +185,19 @@ static void delete_ptp_clock(struct posix_clock *pc) kfree(ptp); } +static void ptp_aux_kworker(struct kthread_work *work) +{ + struct ptp_clock *ptp = container_of(work, struct ptp_clock, +aux_work.work); + struct ptp_clock_info *info = ptp->info; + long delay; + + delay = info->do_aux_work(info); + + if (delay >= 0) + kthread_queue_delayed_work(ptp->kworker, >aux_work, delay); +} + /* public interface */ struct ptp_clock *ptp_clock_register(struct ptp_clock_info *info, @@ -217,6 +231,19 @@ struct ptp_clock *ptp_clock_register(struct ptp_clock_info *info, mutex_init(>pincfg_mux); init_waitqueue_head(>tsev_wq); + if (ptp->info->do_aux_work) { + char *worker_name = kasprintf(GFP_KERNEL, "ptp%d", ptp->index); + + kthread_init_delayed_work(>aux_work, ptp_aux_kworker); + ptp->kworker = kthread_create_worker(0, worker_name ? +worker_name : info->name); + if (IS_ERR(ptp->kworker)) { + err = PTR_ERR(ptp->kworker); + pr_err("failed to create ptp aux_worker %d\n", err); + goto kworker_err; + } + } + err = ptp_populate_pin_groups(ptp); if (err) goto no_pin_groups; @@ -259,6 +286,9 @@ struct ptp_clock *ptp_clock_register(struct ptp_clock_info *info, no_device: ptp_cleanup_pin_groups(ptp); no_pin_groups: + if (ptp->kworker) + kthread_destroy_worker(ptp->kworker); +kworker_err: mutex_destroy(>tsevq_mux); mutex_destroy(>pincfg_mux); ida_simple_remove(_clocks_map, index); @@ -274,6 +304,11 @@ int ptp_clock_unregister(struct ptp_clock *ptp) ptp->defunct = 1; wake_up_interruptible(>tsev_wq); + if (ptp->kworker) { + kthread_cancel_delayed_work_sync(>aux_work); + kthread_destroy_worker(ptp->kworker); + } + /* Release the clock's resources. */ if (ptp->pps_source) pps_unregister_source(ptp->pps_source); @@ -339,6 +374,12 @@ int ptp_find_pin(struct ptp_clock *ptp, } EXPORT_SYMBOL(ptp_find_pin); +int ptp_schedule_worker(struct ptp_clock *ptp, unsigned long delay) +{ + return kthread_mod_delayed_work(ptp->kworker, >aux_work, delay); +} +EXPORT_SYMBOL(ptp_schedule_worker); + /* module operations */ static void __exit ptp_exit(void) diff --git a/drivers/ptp/ptp_private.h b/drivers/ptp/ptp_private.h index d958889..b86f1bf 100644 --- a/drivers/ptp/ptp_private.h +++ b/drivers/ptp/ptp_private.h @@ -22,6 +22,7 @@ #include #include +#include #include #include #include @@ -56,6 +57,8 @@ struct ptp_clock {
[PATCH v3 0/3] net: ethernet: ti: cpts: fix tx timestamping timeout
Hi With the low Ethernet connection speed cpdma notification about packet processing can be received before CPTS TX timestamp event, which is set when packet actually left CPSW while cpdma notification is sent when packet pushed in CPSW fifo. As result, when connection is slow and CPU is fast enough TX timestamping is not working properly. Issue was discovered using timestamping tool on am57x boards with Ethernet link speed forced to 100M and on am335x-evm with Ethernet link speed forced to 10M. Patch3 - This series fixes it by introducing TX SKB queue to store PTP SKBs for which Ethernet Transmit Event hasn't been received yet and then re-check this queue with new Ethernet Transmit Events by scheduling CPTS overflow work more often until TX SKB queue is not empty. Patch 1,2 - As CPTS overflow work is time critical task it important to ensure that its scheduling is not delayed. Unfortunately, There could be significant delay in CPTS work schedule under high system load and on -RT which could cause CPTS misbehavior due to internal counter overflow and there is no way to tune CPTS overflow work execution policy and priority manually. The kthread_worker can be used instead of workqueues, as it creates separate named kthread for each worker and its its execution policy and priority can be configured using chrt tool. Instead of modifying CPTS driver itself it was proposed to it was proposed to add PTP auxiliary worker to the PHC subsystem [1], so other drivers can benefit from this feature also. [1] https://www.spinics.net/lists/netdev/msg445392.html changes in v3: - patch 1: added proper error handling in ptp_clock_register. minor comments applied. changes in v2: - added PTP auxiliary worker to the PHC subsystem Links v2: https://www.spinics.net/lists/netdev/msg445859.html v1: https://www.spinics.net/lists/netdev/msg445387.html Grygorii Strashko (3): ptp: introduce ptp auxiliary worker net: ethernet: ti: cpts: convert to use ptp auxiliary worker net: ethernet: ti: cpts: fix tx timestamping timeout drivers/net/ethernet/ti/cpts.c | 111 +-- drivers/net/ethernet/ti/cpts.h | 2 +- drivers/ptp/ptp_clock.c | 41 +++ drivers/ptp/ptp_private.h| 3 ++ include/linux/ptp_clock_kernel.h | 20 +++ 5 files changed, 160 insertions(+), 17 deletions(-) -- 2.10.1
[PATCH v3 3/3] net: ethernet: ti: cpts: fix tx timestamping timeout
With the low speed Ethernet connection CPDMA notification about packet processing can be received before CPTS TX timestamp event, which is set when packet actually left CPSW while cpdma notification is sent when packet pushed in CPSW fifo. As result, when connection is slow and CPU is fast enough TX timestamping is not working properly. Fix it, by introducing TX SKB queue to store PTP SKBs for which Ethernet Transmit Event hasn't been received yet and then re-check this queue with new Ethernet Transmit Events by scheduling CPTS overflow work more often (every 1 jiffies) until TX SKB queue is not empty. Side effect of this change is: - User space tools require to take into account possible delay in TX timestamp processing (for example ptp4l works with tx_timestamp_timeout=400 under net traffic and tx_timestamp_timeout=25 in idle). Signed-off-by: Grygorii Strashko--- drivers/net/ethernet/ti/cpts.c | 86 -- drivers/net/ethernet/ti/cpts.h | 1 + 2 files changed, 84 insertions(+), 3 deletions(-) diff --git a/drivers/net/ethernet/ti/cpts.c b/drivers/net/ethernet/ti/cpts.c index 3ed438e..c2121d2 100644 --- a/drivers/net/ethernet/ti/cpts.c +++ b/drivers/net/ethernet/ti/cpts.c @@ -31,9 +31,18 @@ #include "cpts.h" +#define CPTS_SKB_TX_WORK_TIMEOUT 1 /* jiffies */ + +struct cpts_skb_cb_data { + unsigned long tmo; +}; + #define cpts_read32(c, r) readl_relaxed(>reg->r) #define cpts_write32(c, v, r) writel_relaxed(v, >reg->r) +static int cpts_match(struct sk_buff *skb, unsigned int ptp_class, + u16 ts_seqid, u8 ts_msgtype); + static int event_expired(struct cpts_event *event) { return time_after(jiffies, event->tmo); @@ -77,6 +86,47 @@ static int cpts_purge_events(struct cpts *cpts) return removed ? 0 : -1; } +static bool cpts_match_tx_ts(struct cpts *cpts, struct cpts_event *event) +{ + struct sk_buff *skb, *tmp; + u16 seqid; + u8 mtype; + bool found = false; + + mtype = (event->high >> MESSAGE_TYPE_SHIFT) & MESSAGE_TYPE_MASK; + seqid = (event->high >> SEQUENCE_ID_SHIFT) & SEQUENCE_ID_MASK; + + /* no need to grab txq.lock as access is always done under cpts->lock */ + skb_queue_walk_safe(>txq, skb, tmp) { + struct skb_shared_hwtstamps ssh; + unsigned int class = ptp_classify_raw(skb); + struct cpts_skb_cb_data *skb_cb = + (struct cpts_skb_cb_data *)skb->cb; + + if (cpts_match(skb, class, seqid, mtype)) { + u64 ns = timecounter_cyc2time(>tc, event->low); + + memset(, 0, sizeof(ssh)); + ssh.hwtstamp = ns_to_ktime(ns); + skb_tstamp_tx(skb, ); + found = true; + __skb_unlink(skb, >txq); + dev_consume_skb_any(skb); + dev_dbg(cpts->dev, "match tx timestamp mtype %u seqid %04x\n", + mtype, seqid); + } else if (time_after(jiffies, skb_cb->tmo)) { + /* timeout any expired skbs over 1s */ + dev_dbg(cpts->dev, + "expiring tx timestamp mtype %u seqid %04x\n", + mtype, seqid); + __skb_unlink(skb, >txq); + dev_consume_skb_any(skb); + } + } + + return found; +} + /* * Returns zero if matching event type was found. */ @@ -101,9 +151,15 @@ static int cpts_fifo_read(struct cpts *cpts, int match) event->low = lo; type = event_type(event); switch (type) { + case CPTS_EV_TX: + if (cpts_match_tx_ts(cpts, event)) { + /* if the new event matches an existing skb, +* then don't queue it +*/ + break; + } case CPTS_EV_PUSH: case CPTS_EV_RX: - case CPTS_EV_TX: list_del_init(>list); list_add_tail(>list, >events); break; @@ -229,8 +285,15 @@ static long cpts_overflow_check(struct ptp_clock_info *ptp) struct cpts *cpts = container_of(ptp, struct cpts, info); unsigned long delay = cpts->ov_check_period; struct timespec64 ts; + unsigned long flags; + + spin_lock_irqsave(>lock, flags); + ts = ns_to_timespec64(timecounter_read(>tc)); + + if (!skb_queue_empty(>txq)) + delay = CPTS_SKB_TX_WORK_TIMEOUT; + spin_unlock_irqrestore(>lock, flags); - cpts_ptp_gettime(>info, ); pr_debug("cpts overflow check at %lld.%09lu\n", ts.tv_sec, ts.tv_nsec); return
[PATCH v4 2/3] net: ethernet: ti: cpts: convert to use ptp auxiliary worker
There could be significant delay in CPTS work schedule under high system load and on -RT which could cause CPTS misbehavior due to internal counter overflow. Usage of own kthread_worker allows to avoid such kind of issues and makes it possible to tune priority of CPTS kthread_worker thread on -RT. Hence, the CPTS driver is converted to use PTP auxiliary worker as PHC subsystem implements such functionality in a generic way now. Signed-off-by: Grygorii Strashko--- drivers/net/ethernet/ti/cpts.c | 27 +-- drivers/net/ethernet/ti/cpts.h | 1 - 2 files changed, 13 insertions(+), 15 deletions(-) diff --git a/drivers/net/ethernet/ti/cpts.c b/drivers/net/ethernet/ti/cpts.c index 32279d2..3ed438e 100644 --- a/drivers/net/ethernet/ti/cpts.c +++ b/drivers/net/ethernet/ti/cpts.c @@ -224,6 +224,17 @@ static int cpts_ptp_enable(struct ptp_clock_info *ptp, return -EOPNOTSUPP; } +static long cpts_overflow_check(struct ptp_clock_info *ptp) +{ + struct cpts *cpts = container_of(ptp, struct cpts, info); + unsigned long delay = cpts->ov_check_period; + struct timespec64 ts; + + cpts_ptp_gettime(>info, ); + pr_debug("cpts overflow check at %lld.%09lu\n", ts.tv_sec, ts.tv_nsec); + return (long)delay; +} + static struct ptp_clock_info cpts_info = { .owner = THIS_MODULE, .name = "CTPS timer", @@ -236,18 +247,9 @@ static struct ptp_clock_info cpts_info = { .gettime64 = cpts_ptp_gettime, .settime64 = cpts_ptp_settime, .enable = cpts_ptp_enable, + .do_aux_work= cpts_overflow_check, }; -static void cpts_overflow_check(struct work_struct *work) -{ - struct timespec64 ts; - struct cpts *cpts = container_of(work, struct cpts, overflow_work.work); - - cpts_ptp_gettime(>info, ); - pr_debug("cpts overflow check at %lld.%09lu\n", ts.tv_sec, ts.tv_nsec); - schedule_delayed_work(>overflow_work, cpts->ov_check_period); -} - static int cpts_match(struct sk_buff *skb, unsigned int ptp_class, u16 ts_seqid, u8 ts_msgtype) { @@ -378,7 +380,7 @@ int cpts_register(struct cpts *cpts) } cpts->phc_index = ptp_clock_index(cpts->clock); - schedule_delayed_work(>overflow_work, cpts->ov_check_period); + ptp_schedule_worker(cpts->clock, cpts->ov_check_period); return 0; err_ptp: @@ -392,8 +394,6 @@ void cpts_unregister(struct cpts *cpts) if (WARN_ON(!cpts->clock)) return; - cancel_delayed_work_sync(>overflow_work); - ptp_clock_unregister(cpts->clock); cpts->clock = NULL; @@ -476,7 +476,6 @@ struct cpts *cpts_create(struct device *dev, void __iomem *regs, cpts->dev = dev; cpts->reg = (struct cpsw_cpts __iomem *)regs; spin_lock_init(>lock); - INIT_DELAYED_WORK(>overflow_work, cpts_overflow_check); ret = cpts_of_parse(cpts, node); if (ret) diff --git a/drivers/net/ethernet/ti/cpts.h b/drivers/net/ethernet/ti/cpts.h index 01ea82b..586edd9 100644 --- a/drivers/net/ethernet/ti/cpts.h +++ b/drivers/net/ethernet/ti/cpts.h @@ -119,7 +119,6 @@ struct cpts { u32 cc_mult; /* for the nominal frequency */ struct cyclecounter cc; struct timecounter tc; - struct delayed_work overflow_work; int phc_index; struct clk *refclk; struct list_head events; -- 2.10.1
Re: [PATCH] net: inet: diag: expose sockets cgroup classid
On Wed, Jul 26, 2017 at 10:22 AM, Levin, Alexander (Sasha Levin)wrote: > + if (ext & (1 << (INET_DIAG_CLASS_ID - 1))) { > + u32 classid = 0; > + > +#ifdef CONFIG_SOCK_CGROUP_DATA > + classid = sock_cgroup_classid(>sk_cgrp_data); > +#endif If CONFIG_SOCK_CGROUP_DATA is not enabled, should we put 0 or put nothing (that is, skipping this nla_put())? > + > + nla_put(skb, INET_DIAG_CLASS_ID, sizeof(classid), ); nla_put_u32() Thanks.
Re: [PATCH RFC 04/13] net: phy: add 1000Base-X to phy settings table
On Tue, Jul 25, 2017 at 03:02:52PM +0100, Russell King wrote: > Add the missing 1000Base-X entry to the phy settings table. This was > not included because the original code could not cope with more than > 32 bits of link mode mask. > > Signed-off-by: Russell KingHi Russell Looks like there are another 15 to add to the table. But that can wait for another time when somebody needs them. Reviewed-by: Andrew Lunn Andrew
Re: [PATCH RFC 02/13] net: phy: split out PHY speed and duplex string generation
On Tue, Jul 25, 2017 at 03:02:42PM +0100, Russell King wrote: > Other code would like to make use of this, so make the speed and duplex > string generation visible, and place it in a separate file. > > Signed-off-by: Russell KingReviewed-by: Andrew Lunn Andrew
Re: [PATCH net-next v2 01/10] net: dsa: lan9303: Fixed MDIO interface
> Good. Just one question about process. Could I have posted my work > as a RFC? To get one round of initial feedback before chopping into > small patch requests. As well as indicating where I am heading. Or is > that just waste of human bandwidth? Depends. Post 100 RFC patches, i won't look at them. Post 21 with a cover note making it clear you are planning to submit them in blocks of 7, i might. But it is best to assume reviewers have small blocks of time. 21 patches take 3 times a long to review as 7. The block of time might not be enough for 21, so the review gets differed. 7 are more likely to fit in the available time, so it happens quickly. Andrew
Re: [PATCH] netpoll: Fix device name check in netpoll_setup()
On Wed, Jul 26, 2017 at 11:44 AM, Doug Andersonwrote: > Hi, > > On Tue, Jul 25, 2017 at 11:36 AM, Matthias Kaehlcke wrote: >> Apparently netpoll_setup() assumes that netpoll.dev_name is a pointer >> when checking if the device name is set: >> >> if (np->dev_name) { >> ... >> >> However the field is a character array, therefore the condition always >> yields true. Check instead whether the first byte of the array has a >> non-zero value. >> >> Signed-off-by: Matthias Kaehlcke >> --- >> net/core/netpoll.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/net/core/netpoll.c b/net/core/netpoll.c >> index 8357f164c660..912731bed7b7 100644 >> --- a/net/core/netpoll.c >> +++ b/net/core/netpoll.c >> @@ -666,7 +666,7 @@ int netpoll_setup(struct netpoll *np) >> int err; >> >> rtnl_lock(); >> - if (np->dev_name) { >> + if (np->dev_name[0]) { >> struct net *net = current->nsproxy->net_ns; >> ndev = __dev_get_by_name(net, np->dev_name); >> } > > It's really up to the maintainer of the code, but my first instinct > here would be to instead remove the "if" test unless we really expect > dev->dev_name to be blank in lots of cases. It will slightly slow > down the error case but should avoid an "if" test in the non-error > case. By definition it should be safe since currently the "if" test > should always evaluate to true. > netconsole could set this dev_name to empty via configfs, so this patch is correct.
Re: [PATCH net-next v2 01/10] net: dsa: lan9303: Fixed MDIO interface
Den 26. juli 2017 22:07, skrev David Miller: From: Andrew LunnDate: Wed, 26 Jul 2017 19:52:24 +0200 So I really want to group the patches into only a few series in order to not spend months on the process. I strongly agree with Vivien here. Good patches get accepted in about 3 days. You should expect feedback within a day or two. That allows you to have fast cycle times for getting patches in. +1 Small simple patches will get everything in 10 times fast than if you clump everything together into larger, harder to review ones. Good. Just one question about process. Could I have posted my work as a RFC? To get one round of initial feedback before chopping into small patch requests. As well as indicating where I am heading. Or is that just waste of human bandwidth? Egil
Re: [PATCH net-next v2] bpf: install libbpf headers on 'make install'
From: Jakub KicinskiDate: Tue, 25 Jul 2017 11:17:11 -0700 > Add a new target to install the bpf.h header to $(prefix)/include/bpf/ > directory. This is necessary to build standalone applications using > libbpf, without the need to clone the kernel sources and point to them. > > Signed-off-by: Jakub Kicinski Applied, thank you.
Re: [Patch net] bonding: commit link status change after propose
From: Cong WangDate: Tue, 25 Jul 2017 09:44:25 -0700 > Commit de77ecd4ef02 ("bonding: improve link-status update in mii-monitoring") > moves link status commitment into bond_mii_monitor(), but it still relies > on the return value of bond_miimon_inspect() as the hint. We need to return > non-zero as long as we propose a link status change. > > Fixes: de77ecd4ef02 ("bonding: improve link-status update in mii-monitoring") > Reported-by: Benjamin Gilbert > Tested-by: Benjamin Gilbert > Cc: Mahesh Bandewar > Signed-off-by: Cong Wang Applied and queued up for -stable.
Re: [PATCH 0/8] net: mvpp2: add TX interrupts support
From: Thomas PetazzoniDate: Tue, 25 Jul 2017 17:55:01 +0200 > Hello, > > So far, the mvpp2 driver was using an hrtimer to handle TX > completion. This patch series adds support for using TX interrupts > (for each CPU) on PPv2.2, the variant of the IP used on Marvell Armada > 7K/8K. > > This series has been tested on Marvell Armada 7K (PPv2.2) and Armada > 375 (PPv2.1). > > Dave: > > - This series depends on the previous series sent by Antoine TĂ©nart >"net: mvpp2: MAC/GoP configuration and optional PHYs". Functionally >speaking there is no real dependency, but we touch in a few areas >the same piece of code, so I based my patch series on top of >Antoine's. > > - Please do not apply the last patch of this series "arm64: dts: >marvell: add TX interrupts for PPv2.2", it will be taken by the ARM >mvebu maintainers. Please don't do things this way. Patiently wait for Antione's series to make it into my tree, then submit your's. Also, if we're continually going to elide the DTS file patches, just don't bother adding them to the series. That way you don't have to give me special instructions, and I don't have the possibility of making a mistake and applying it accidently. Thanks.
Re: [RFC] switchdev: clarify ndo_get_phys_port_name() formats
On Wed, 26 Jul 2017 12:23:17 +0300, Or Gerlitz wrote: > On Wed, Jul 26, 2017 at 11:13 AM, Jakub Kicinskiwrote: > > On Wed, 26 Jul 2017 07:48:40 +0200, Jiri Pirko wrote: > > >> I think it would make sense if the driver would just fill-up a struct in > >> the ndo call and core would generate the string. > > > I do like the idea of core generating the string. I have a temptation > > to try to involve devlink in this somehow, since port id and split info > > is already available there. Perhaps that would only overcomplicate > > things. > > > The other question is: can we remove the option to generate an arbitrary > > string completely? I think Or and mlx5 folks may object since it would > > mean mlx5 VF repr names would change. > > What we have today is the representor driver instance setting the VF index as > the of phys port name and we're telling users to have this sort of udev rule: > > SUBSYSTEM=="net", ACTION=="add", > ATTR{phys_switch_id}=="", \ ATTR{phys_port_name}!="", > NAME="$PF_NIC$attr{phys_port_name}" Example names generated by this rule would be pfnic_0, pfnic_1 for vf representors 0 and 1? > that has affiliation to the PF where this VF belongs. AFAIK this > model/assumption is now under push to > higher level (open stack), so lets see if/what we want to change here > w.r.t to sriov offloading drivers. > > I would opt for an approach where the value returned by the kernel is > the minimal possible and > user-space has flexibility to further orchestrate that with udev > rules. I wasn't fully clear on Jakub's suggestion > which parts must come from the kernel. Do we have any length > limitation for the phys port name? Looks like the limit today is IFNAMSIZ. I'm in favor of leaving the flexibility to the userspace, why I suggested adding the pf%d or pf%dvf%d to the name is that I don't think we have any other way today of telling whether representor is for a physical port, VF or PF. If I understand mlx5 code, you're not reporting port ids for physical ports so presence of the name already implies it's a VF but in case you want to add port splitting support, for example, reporting the name on physical ports will become more of a necessity. If we adopt Jiri's suggestion of returning structured data it will be very easy to give user space type and indexes separately, but we should probably still return the string for backwards compatibility.
Re: [PATCH net-next v2 01/10] net: dsa: lan9303: Fixed MDIO interface
From: Andrew LunnDate: Wed, 26 Jul 2017 19:52:24 +0200 >> > So I really want to group the patches into only a few series in order >> > to not spend months on the process. > > I strongly agree with Vivien here. Good patches get accepted in about > 3 days. You should expect feedback within a day or two. That allows > you to have fast cycle times for getting patches in. +1 Small simple patches will get everything in 10 times fast than if you clump everything together into larger, harder to review ones.
Re: [PATCH 1/4] can: dev: Add support for limiting configured bitrate
Hi, I know my response is late ... > Hi Oliver > On 07/20/2017 02:43 AM, Oliver Hartkopp wrote: > > Hi Franklin, > > > > On 07/20/2017 01:36 AM, Franklin S Cooper Jr wrote: > > > >> +#ifdef CONFIG_OF > >> +void of_transceiver_is_fixed(struct net_device *dev) > >> +{ > > > > (..) > > > >> +} > >> +EXPORT_SYMBOL(of_transceiver_is_fixed); > >> +#endif > > > > I'm not sure about the naming here. > > > > As this is a CAN transceiver related option it should be named accordingly: I contest the the name too: 1) the can transceiver isn't fixed at all, it limited to the higher bitrates. 2) of_can_transceiver_is_fixed suggests to test if a transceiver is fixed, it does not suggest to load some properties from the device tree. of_can_load_transceiver looks way more clear to me. That's my opinion. The important things, like the contents of the functions, look good. Kind regards, Kurt Van Dijck
Re: [PATCH net] net: phy: Run state machine to completion
On 07/26/2017 12:34 PM, Florian Fainelli wrote: > On 07/26/2017 12:24 PM, Florian Fainelli wrote: >> Marc reported that he was not getting the PHY library adjust_link() >> callback function to run when calling phy_stop() + phy_disconnect() >> which does not indeed happen because we don't make sure we flush the >> PHYLIB delayed work and let it run to completion. We also need to have >> a synchronous call to phy_state_machine() in order to have the state >> machine actually act on PHY_HALTED, set the PHY device's link down, turn >> the network device's carrier off and finally call the adjust_link() >> function. >> >> Reported-by: Marc Gonzalez>> Fixes: a390d1f379cf ("phylib: convert state_queue work to delayed_work") >> Signed-off-by: Florian Fainelli >> --- >> David, this dates back from before the commit mentioned in Fixes but it would >> be hard to backport to earlier kernels if we flagged the original design flaw >> that used timers. Also, I am not clear on the timer API whether there was a >> way >> to ensure timers would run to completion before they would be cancelled. >> >> Marc, please add your Signed-off-by tag since you contributed the second >> line. >> >> Thanks! > > David, please hold off before applying this, I found a corner case with > power management where we may be accessing clock gated registers and > cause bus errors, will report back here. This was a downstream only problem because I was missing these two commits: 49d52e8108a21749dc2114b924c907db43358984 ("net: phy: handle state correctly in phy_stop_machine") d5c3d84657db57bd23ecd58b97f1c99dd42a7b80 ("net: phy: Avoid polling PHY with PHY_IGNORE_INTERRUPTS") This was tested with bcmgenet (PHY_IGNORE_INTERRUPT and PHY_POLL) and with bcm_sf2 (net/dsa/slave.c), would appreciate some review/testing from others as well. > >> >> drivers/net/phy/phy.c | 5 + >> 1 file changed, 5 insertions(+) >> >> diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c >> index d0626bf5c540..30e7c43e0d87 100644 >> --- a/drivers/net/phy/phy.c >> +++ b/drivers/net/phy/phy.c >> @@ -743,12 +743,17 @@ void phy_trigger_machine(struct phy_device *phydev, >> bool sync) >> */ >> void phy_stop_machine(struct phy_device *phydev) >> { >> +/* Run the state machine to completion */ >> +flush_delayed_work(>state_queue); >> cancel_delayed_work_sync(>state_queue); >> >> mutex_lock(>lock); >> if (phydev->state > PHY_UP && phydev->state != PHY_HALTED) >> phydev->state = PHY_UP; >> mutex_unlock(>lock); >> + >> +/* Now we can run the state machine synchronously */ >> +phy_state_machine(>state_queue.work); >> } >> >> /** >> > > -- Florian
Re: [PATCH net-next 10/18] net: mvpp2: use the GoP interrupt for link status changes
On Wed, Jul 26, 2017 at 06:26:48PM +0200, Andrew Lunn wrote: > And this is another bit of code you probably need to change in a while > with phylink lands. The way the MAC driver handles link up/down and configuration events changes significantly when a MAC driver switches to phylink, since a directly connected SFP cage needs to have the MAC reconfigured between SGMII and 1000base-X modes. If you add SFP+ into that, also 10Gbase-KR as well. Note also that the "link up" condition for SFP (and probably SFF) is more complex than just "is the module reporting that it's receiving a signal" - especially with 1000base-X, there's negotiation to be performed, so you also need to know (if the module is connected directly to the MAC) whether the Serdes is in sync and has finished negotiation (and itself says it has link with the remote end.) With the Marvell 88x3310 PHY, the MAC driver already needs to switch between 10Gbase-KR and SGMII modes, as the 88x3310 automatically makes that switch on its MAC facing interface without software intervention. -- RMK's Patch system: http://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up according to speedtest.net.
Re: [PATCH net] ipv6: no need to return rt->dst.error if it is not null entry.
On Wed, Jul 26, 2017 at 12:00 PM, David Ahernwrote: > On 7/26/17 12:55 PM, Roopa Prabhu wrote: >> On Wed, Jul 26, 2017 at 11:49 AM, David Ahern wrote: >>> On 7/26/17 12:27 PM, Roopa Prabhu wrote: agreed...so looks like the check in v3 should be + if ( rt == net->ipv6.ip6_null_entry || +(rt->dst.error && + #ifdef CONFIG_IPV6_MULTIPLE_TABLES + rt != net->ipv6.ip6_prohibit_entry && + rt != net->ipv6.ip6_blk_hole_entry && +#endif + )) { err = rt->dst.error; ip6_rt_put(rt); goto errout; >>> >>> I don't think so. If I add a prohibit route and use the fibmatch >>> attribute, I want to see the route from the FIB that was matched. >> >> >> yes, exactly. wouldn't 'rt != net->ipv6.ip6_prohibit_entry' above let >> it fall through to the route fill code ? >> >> ah...but i guess you are saying that they will have rt6_info's of >> their own and will not match. got it. ack. >> > > This: > > diff --git a/net/ipv6/route.c b/net/ipv6/route.c > index 4d30c96a819d..24de81c804c2 100644 > --- a/net/ipv6/route.c > +++ b/net/ipv6/route.c > @@ -3637,11 +3637,6 @@ static int inet6_rtm_getroute(struct sk_buff > *in_skb, struct nlmsghdr *nlh, > dst = ip6_route_lookup(net, , 0); > > rt = container_of(dst, struct rt6_info, dst); > - if (rt->dst.error) { > - err = rt->dst.error; > - ip6_rt_put(rt); > - goto errout; > - } > > if (rt == net->ipv6.ip6_null_entry) { > err = rt->dst.error; > > Puts back the original behavior. In that case, only rt == null_entry > drops to the error path which is correct. All other rt values will drop > to rt6_fill_node and return rt data. yes, i thought so too and hence acked v1. But, following congs comment, realized that it may mask some real errors for fibmatch ? I just tested a case of unreachable route with just the above patch you posted, and I do get the error correctly. so, I guess you are saying all real errors for fibmatch will have "rt == net->ipv6.ip6_null_entry" and we should be ok. sounds good to me.
Re: [PATCH net] net: phy: Run state machine to completion
On 07/26/2017 12:24 PM, Florian Fainelli wrote: > Marc reported that he was not getting the PHY library adjust_link() > callback function to run when calling phy_stop() + phy_disconnect() > which does not indeed happen because we don't make sure we flush the > PHYLIB delayed work and let it run to completion. We also need to have > a synchronous call to phy_state_machine() in order to have the state > machine actually act on PHY_HALTED, set the PHY device's link down, turn > the network device's carrier off and finally call the adjust_link() > function. > > Reported-by: Marc Gonzalez> Fixes: a390d1f379cf ("phylib: convert state_queue work to delayed_work") > Signed-off-by: Florian Fainelli > --- > David, this dates back from before the commit mentioned in Fixes but it would > be hard to backport to earlier kernels if we flagged the original design flaw > that used timers. Also, I am not clear on the timer API whether there was a > way > to ensure timers would run to completion before they would be cancelled. > > Marc, please add your Signed-off-by tag since you contributed the second line. > > Thanks! David, please hold off before applying this, I found a corner case with power management where we may be accessing clock gated registers and cause bus errors, will report back here. > > drivers/net/phy/phy.c | 5 + > 1 file changed, 5 insertions(+) > > diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c > index d0626bf5c540..30e7c43e0d87 100644 > --- a/drivers/net/phy/phy.c > +++ b/drivers/net/phy/phy.c > @@ -743,12 +743,17 @@ void phy_trigger_machine(struct phy_device *phydev, > bool sync) > */ > void phy_stop_machine(struct phy_device *phydev) > { > + /* Run the state machine to completion */ > + flush_delayed_work(>state_queue); > cancel_delayed_work_sync(>state_queue); > > mutex_lock(>lock); > if (phydev->state > PHY_UP && phydev->state != PHY_HALTED) > phydev->state = PHY_UP; > mutex_unlock(>lock); > + > + /* Now we can run the state machine synchronously */ > + phy_state_machine(>state_queue.work); > } > > /** > -- Florian
Re: ARM GLX Khadas VIM Pro - Ethernet detected as only 10Mbps and stalled after some traffic
On Tue, 2017-07-25 at 18:56 +0200, crow wrote: > Hi, > Today I did test on ArchLinuxArm the Kernel v4.13-rc2. On downloading > the linux git source the network will eventually get stalled. Here are > the information > > Over SSH (network works). > > [root@alarm ~]# uname -a > Linux alarm 4.13.0-rc2-1-ARCH #1 SMP Mon Jul 24 20:02:50 MDT 2017 > aarch64 GNU/Linux > [root@alarm ~]# mii-tool -vvv eth0 > Using SIOCGMIIPHY=0x8947 > eth0: negotiated 1000baseT-HD flow-control, link ok [Replying again on the last thread :) ] This 1000BaseT Half Duplex looks suspucious if the PHY is supposed to be a 10/100Mbps > Â registers for MII PHY 8: > 1000 782d 0181 4400 01e1 c1e1 000f 2001 > > 0040 0002 40e8 5400 1c1c > fff0 000a 1407 004a 105a > Â product info: vendor 00:60:51, model 0 rev 0 > Â basic mode:Â Â Â autonegotiation enabled > Â basic status: autonegotiation complete, link ok > Â capabilities: 1000baseT-HD 1000baseT-FD 100baseTx-FD 100baseTx-HD > 10baseT-FD 10baseT-HD > Â advertising:Â Â 1000baseT-HD 1000baseT-FD 100baseTx-FD 100baseTx-HD > 10baseT-FD 10baseT-HD > Â link partner: 1000baseT-HD 1000baseT-FD 100baseTx-FD 100baseTx-HD > 10baseT-FD 10baseT-HD > [root@alarm ~]# ethtool -S eth0 > NIC statistics: > Â mmc_tx_octetcount_gb: 0 > Â mmc_tx_framecount_gb: 0 > Â mmc_tx_broadcastframe_g: 0 > Â mmc_tx_multicastframe_g: 0 > Â mmc_tx_64_octets_gb: 0 > Â mmc_tx_65_to_127_octets_gb: 0 > Â mmc_tx_128_to_255_octets_gb: 0 > Â mmc_tx_256_to_511_octets_gb: 0 > Â mmc_tx_512_to_1023_octets_gb: 0 > Â mmc_tx_1024_to_max_octets_gb: 0 > Â mmc_tx_unicast_gb: 0 > Â mmc_tx_multicast_gb: 0 > Â mmc_tx_broadcast_gb: 0 > Â mmc_tx_underflow_error: 0 > Â mmc_tx_singlecol_g: 0 > Â mmc_tx_multicol_g: 0 > Â mmc_tx_deferred: 0 > Â mmc_tx_latecol: 0 > Â mmc_tx_exesscol: 0 > Â mmc_tx_carrier_error: 0 > Â mmc_tx_octetcount_g: 0 > Â mmc_tx_framecount_g: 0 > Â mmc_tx_excessdef: 0 > Â mmc_tx_pause_frame: 0 > Â mmc_tx_vlan_frame_g: 0 > Â mmc_rx_framecount_gb: 133 > Â mmc_rx_octetcount_gb: 16646 > Â mmc_rx_octetcount_g: 16646 > Â mmc_rx_broadcastframe_g: 9 > Â mmc_rx_multicastframe_g: 22 > Â mmc_rx_crc_error: 0 > Â mmc_rx_align_error: 0 > Â mmc_rx_run_error: 0 > Â mmc_rx_jabber_error: 0 > Â mmc_rx_undersize_g: 0 > Â mmc_rx_oversize_g: 0 > Â mmc_rx_64_octets_gb: 45 > Â mmc_rx_65_to_127_octets_gb: 64 > Â mmc_rx_128_to_255_octets_gb: 13 > Â mmc_rx_256_to_511_octets_gb: 7 > Â mmc_rx_512_to_1023_octets_gb: 4 > Â mmc_rx_1024_to_max_octets_gb: 0 > Â mmc_rx_unicast_g: 102 > Â mmc_rx_length_error: 0 > Â mmc_rx_autofrangetype: 0 > Â mmc_rx_pause_frames: 0 > Â mmc_rx_fifo_overflow: 0 > Â mmc_rx_vlan_frames_gb: 0 > Â mmc_rx_watchdog_error: 0 > Â mmc_rx_ipc_intr_mask: 1073692671 > Â mmc_rx_ipc_intr: 0 > Â mmc_rx_ipv4_gd: 117 > Â mmc_rx_ipv4_hderr: 0 > Â mmc_rx_ipv4_nopay: 0 > Â mmc_rx_ipv4_frag: 0 > Â mmc_rx_ipv4_udsbl: 0 > Â mmc_rx_ipv4_gd_octets: 12585 > Â mmc_rx_ipv4_hderr_octets: 0 > Â mmc_rx_ipv4_nopay_octets: 0 > Â mmc_rx_ipv4_frag_octets: 0 > Â mmc_rx_ipv4_udsbl_octets: 0 > Â mmc_rx_ipv6_gd_octets: 104 > Â mmc_rx_ipv6_hderr_octets: 0 > Â mmc_rx_ipv6_nopay_octets: 0 > Â mmc_rx_ipv6_gd: 1 > Â mmc_rx_ipv6_hderr: 0 > Â mmc_rx_ipv6_nopay: 0 > Â mmc_rx_udp_gd: 31 > Â mmc_rx_udp_err: 0 > Â mmc_rx_tcp_gd: 85 > Â mmc_rx_tcp_err: 0 > Â mmc_rx_icmp_gd: 2 > Â mmc_rx_icmp_err: 0 > Â mmc_rx_udp_gd_octets: 2963 > Â mmc_rx_udp_err_octets: 0 > Â mmc_rx_tcp_gd_octets: 7254 > Â mmc_rx_tcp_err_octets: 0 > Â mmc_rx_icmp_gd_octets: 92 > Â mmc_rx_icmp_err_octets: 0 > Â tx_underflow: 0 > Â tx_carrier: 0 > Â tx_losscarrier: 0 > Â vlan_tag: 0 > Â tx_deferred: 0 > Â tx_vlan: 0 > Â tx_jabber: 0 > Â tx_frame_flushed: 0 > Â tx_payload_error: 0 > Â tx_ip_header_error: 0 > Â rx_desc: 0 > Â sa_filter_fail: 0 > Â overflow_error: 0 > Â ipc_csum_error: 0 > Â rx_collision: 0 > Â rx_crc_errors: 0 > Â dribbling_bit: 0 > Â rx_length: 0 > Â rx_mii: 0 > Â rx_multicast: 0 > Â rx_gmac_overflow: 0 > Â rx_watchdog: 0 > Â da_rx_filter_fail: 0 > Â sa_rx_filter_fail: 0 > Â rx_missed_cntr: 0 > Â rx_overflow_cntr: 0 > Â rx_vlan: 0 > Â tx_undeflow_irq: 0 > Â tx_process_stopped_irq: 0 > Â tx_jabber_irq: 0 > Â rx_overflow_irq: 0 > Â rx_buf_unav_irq: 0 > Â rx_process_stopped_irq: 0 > Â rx_watchdog_irq: 0 > Â tx_early_irq: 0 > Â fatal_bus_error_irq: 0 > Â rx_early_irq: 0 > Â threshold: 1 > Â tx_pkt_n: 83 > Â rx_pkt_n: 133 > Â normal_irq_n: 130 > Â rx_normal_irq_n: 129 > Â napi_poll: 130 > Â tx_normal_irq_n: 1 > Â tx_clean: 192 > Â tx_set_ic_bit: 1 > Â irq_receive_pmt_irq_n: 0 > Â mmc_tx_irq_n: 0 > Â mmc_rx_irq_n: 0 > Â mmc_rx_csum_offload_irq_n: 0
Re: ARM GLX Khadas VIM Pro - Ethernet detected as only 10Mbps and stalled after some traffic
On Sun, 2017-06-11 at 08:31 +0200, crow wrote: > Hi Andrew, > > On Sat, Jun 10, 2017 at 5:27 PM, Andrew Lunnwrote: > > > Also what Martin Blumenstingl wrote is following which is also crucial > > > for fixing the issue: > > > Amlogic has given their ethernet PHY driver some updates [2], it now > > > includes wake-on-lan, and they now have an internal_phy_read_status > > > which uses reset_internal_phy if there's a link and some error counter > > > exceeds some magic value. > > > > Hi Crow > > > > You could probably just drop the Amlogic driver into mainline and see > > if it works better. If that solves your problem, we can look at > > merging the changes. > > > > Andrew > > Thank your for the suggestion, and thanks Martin to explaining me over > IRC what actually I should do. > > I recompiled mainline kernel 4.12-rc4 with the Amlogic driver: > replaced drivers/net/phy/meson-gxl.c with > https://github.com/khadas/linux/blob/ubuntu-4.9/drivers/amlogic/ethernet/phy/a > mlogic.c > > But this did not solve the issue. As soon as i start git clone i lose > network connection to device (no session timeout/disconnect this time, > but I am unable to reconnect over SSH or to get OK ping replay back). > > Here are the tests: > Linux khadasvimpro 4.12.0-rc4-4-ARCH #1 SMP Sun Jun 11 03:39:21 CEST > 2017 aarch64 GNU/Linux > > # modinfo meson_gxl > filename: > /lib/modules/4.12.0-rc4-4-ARCH/kernel/drivers/net/phy/meson-gxl.ko.gz > license:GPL > author: Neil Armstrong > author: Baoqi wang > description:Amlogic Meson GXL Internal PHY driver > alias:  mdio:0001100101000100 > depends: > intree: Y > vermagic:   4.12.0-rc4-4-ARCH SMP mod_unload aarch64 > # > # mii-tool -vvv eth0 > Using SIOCGMIIPHY=0x8947 > eth0: negotiated 1000baseT-HD flow-control, link ok Hum, 1000Mbps Half duplex looks duplex looks suspicious The PHY is supposed to be a 10/100, right ? or did I miss something ? >  registers for MII PHY 8: > 1000 782d 0181 4400 01e1 c1e1 000f 2001 > > 0040 0002 40e8 5400 1c1c > fff0 040a 1407 004a 105a >  product info: vendor 00:60:51, model 0 rev 0 >  basic mode:   autonegotiation enabled >  basic status: autonegotiation complete, link ok >  capabilities: 1000baseT-HD 1000baseT-FD 100baseTx-FD 100baseTx-HD > 10baseT-FD 10baseT-HD >  advertising:  1000baseT-HD 1000baseT-FD 100baseTx-FD 100baseTx-HD > 10baseT-FD 10baseT-HD >  link partner: 1000baseT-HD 1000baseT-FD 100baseTx-FD 100baseTx-HD > 10baseT-FD 10baseT-HD > $ > > over SSH startet following but it stall already at 0%: > > $ git clone git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux- > stable.git > Cloning into 'linux-stable'... > remote: Counting objects: 5948690, done. > remote: Compressing objects: 100% (124799/124799), done. > Receiving objects:   0% (11668/5948690), 2.27 MiB | 4.52 MiB/s > > shows timeout while trying to sync with NTP server: > > # journalctl -f > systemd-timesyncd[299]: Timed out waiting for reply from > 83.68.137.76:123 (2.at.pool.ntp.org). > systemd-timesyncd[299]: Timed out waiting for reply from > 86.59.113.114:123 (2.at.pool.ntp.org). > > while still not working dump the register: > # mii-tool -vvv eth0 > Using SIOCGMIIPHY=0x8947 > eth0: negotiated 1000baseT-HD flow-control, link ok >  registers for MII PHY 8: > 1000 782d 0181 4400 01e1 c1e1 000d 2001 > > 0040 0002 40e8 5400 1c1c > fff0 040a 1407 105a >  product info: vendor 00:60:51, model 0 rev 0 >  basic mode:   autonegotiation enabled >  basic status: autonegotiation complete, link ok >  capabilities: 1000baseT-HD 1000baseT-FD 100baseTx-FD 100baseTx-HD > 10baseT-FD 10baseT-HD >  advertising:  1000baseT-HD 1000baseT-FD 100baseTx-FD 100baseTx-HD > 10baseT-FD 10baseT-HD >  link partner: 1000baseT-HD 1000baseT-FD 100baseTx-FD 100baseTx-HD > 10baseT-FD 10baseT-HD > # > > # ifconfig eth0 down && ifconfig eth0 up > # dmesg -T | tail -n 10 > [Sun Jun 11 07:40:34 2017] meson8b-dwmac c941.ethernet eth0: > device MAC address 00:15:18:01:81:31 > [Sun Jun 11 07:40:34 2017] random: crng init done > [Sun Jun 11 07:40:34 2017] Meson GXL Internal PHY 0.e40908ff:08: > attached PHY driver [Meson GXL Internal PHY] > (mii_bus:phy_addr=0.e40908ff:08, irq=-1) > [Sun Jun 11 07:40:34 2017] meson8b-dwmac c941.ethernet eth0: PTP > not supported by HW > [Sun Jun 11 07:40:36 2017] brcmfmac: brcmf_c_preinit_dcmds: Firmware > version = wl0: Mar  1 2015 07:29:38 version 7.45.18 (r538002) FWID > 01-6a2c8ad4 > [Sun Jun 11 07:40:36 2017] meson8b-dwmac c941.ethernet eth0: Link > is Up - 100Mbps/Full - flow control off > [Sun Jun 11 07:41:23 2017] EXT4-fs (mmcblk1p1): mounted filesystem > with ordered data mode. Opts: (null) > [Sun Jun 11 07:54:28 2017] Meson GXL Internal PHY
[PATCH net] net: phy: Run state machine to completion
Marc reported that he was not getting the PHY library adjust_link() callback function to run when calling phy_stop() + phy_disconnect() which does not indeed happen because we don't make sure we flush the PHYLIB delayed work and let it run to completion. We also need to have a synchronous call to phy_state_machine() in order to have the state machine actually act on PHY_HALTED, set the PHY device's link down, turn the network device's carrier off and finally call the adjust_link() function. Reported-by: Marc GonzalezFixes: a390d1f379cf ("phylib: convert state_queue work to delayed_work") Signed-off-by: Florian Fainelli --- David, this dates back from before the commit mentioned in Fixes but it would be hard to backport to earlier kernels if we flagged the original design flaw that used timers. Also, I am not clear on the timer API whether there was a way to ensure timers would run to completion before they would be cancelled. Marc, please add your Signed-off-by tag since you contributed the second line. Thanks! drivers/net/phy/phy.c | 5 + 1 file changed, 5 insertions(+) diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c index d0626bf5c540..30e7c43e0d87 100644 --- a/drivers/net/phy/phy.c +++ b/drivers/net/phy/phy.c @@ -743,12 +743,17 @@ void phy_trigger_machine(struct phy_device *phydev, bool sync) */ void phy_stop_machine(struct phy_device *phydev) { + /* Run the state machine to completion */ + flush_delayed_work(>state_queue); cancel_delayed_work_sync(>state_queue); mutex_lock(>lock); if (phydev->state > PHY_UP && phydev->state != PHY_HALTED) phydev->state = PHY_UP; mutex_unlock(>lock); + + /* Now we can run the state machine synchronously */ + phy_state_machine(>state_queue.work); } /** -- 2.9.3
Re: [PATCH net-next] net: phy: Remove stale comments referencing timer
On Wed, Jul 26, 2017 at 12:05:38PM -0700, Florian Fainelli wrote: > Since commit a390d1f379cf ("phylib: convert state_queue work to > delayed_work"), the PHYLIB state machine was converted to use delayed > workqueues, yet some functions were still referencing the PHY library > timer in their comments, fix that and remove the now unused > linux/timer.h include. > > Signed-off-by: Florian FainelliReviewed-by: Andrew Lunn Andrew
[PATCH v3 net-next 2/2] liquidio: cleanup: removed cryptic and misleading macro
From: Rick FarringtonSigned-off-by: Rick Farrington Signed-off-by: Felix Manlunas --- drivers/net/ethernet/cavium/liquidio/octeon_console.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/drivers/net/ethernet/cavium/liquidio/octeon_console.c b/drivers/net/ethernet/cavium/liquidio/octeon_console.c index 501ad95..15ad1ab 100644 --- a/drivers/net/ethernet/cavium/liquidio/octeon_console.c +++ b/drivers/net/ethernet/cavium/liquidio/octeon_console.c @@ -42,8 +42,6 @@ module_param(console_bitmask, int, 0644); MODULE_PARM_DESC(console_bitmask, "Bitmask indicating which consoles have debug output redirected to syslog."); -#define CAST_ULL(v) ((u64)(v)) - #define BOOTLOADER_PCI_READ_BUFFER_DATA_ADDR0x0006c008 #define BOOTLOADER_PCI_READ_BUFFER_LEN_ADDR 0x0006c004 #define BOOTLOADER_PCI_READ_BUFFER_OWNER_ADDR 0x0006c000 @@ -233,7 +231,7 @@ static int __cvmx_bootmem_check_version(struct octeon_device *oct, (exact_match && major_version != exact_match)) { dev_err(>pci_dev->dev, "bootmem ver mismatch %d.%d addr:0x%llx\n", major_version, minor_version, - CAST_ULL(oct->bootmem_desc_addr)); + (long long)oct->bootmem_desc_addr); return -1; } else { return 0; -- 2.9.0
[PATCH v3 net-next 1/2] liquidio: standardization: use min_t instead of custom macro
From: Rick FarringtonSigned-off-by: Rick Farrington Signed-off-by: Derek Chickles Signed-off-by: Felix Manlunas --- drivers/net/ethernet/cavium/liquidio/octeon_console.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/drivers/net/ethernet/cavium/liquidio/octeon_console.c b/drivers/net/ethernet/cavium/liquidio/octeon_console.c index e08f760..501ad95 100644 --- a/drivers/net/ethernet/cavium/liquidio/octeon_console.c +++ b/drivers/net/ethernet/cavium/liquidio/octeon_console.c @@ -42,7 +42,6 @@ module_param(console_bitmask, int, 0644); MODULE_PARM_DESC(console_bitmask, "Bitmask indicating which consoles have debug output redirected to syslog."); -#define MIN(a, b) min((a), (b)) #define CAST_ULL(v) ((u64)(v)) #define BOOTLOADER_PCI_READ_BUFFER_DATA_ADDR0x0006c008 @@ -704,7 +703,7 @@ static int octeon_console_read(struct octeon_device *oct, u32 console_num, if (bytes_to_read <= 0) return bytes_to_read; - bytes_to_read = MIN(bytes_to_read, (s32)buf_size); + bytes_to_read = min_t(s32, bytes_to_read, buf_size); /* Check to see if what we want to read is not contiguous, and limit * ourselves to the contiguous block -- 2.9.0
[PATCH v3 net-next 0/2] liquidio: standardization and cleanup
From: Rick FarringtonThis patchset corrects some non-standard macro usage. 1. Replaced custom MIN macro with use of standard 'min_t'. 2. Removed cryptic and misleading macro 'CAST_ULL'. change log: V1 -> V2: 1. Add driver cleanup of macro 'CAST_ULL'. V2 -> V3: 1. Remove extra parentheses from previous usage of macro 'CAST_ULL'. Rick Farrington (2): liquidio: standardization: use min_t instead of custom macro liquidio: cleanup: removed cryptic and misleading macro drivers/net/ethernet/cavium/liquidio/octeon_console.c | 7 ++- 1 file changed, 2 insertions(+), 5 deletions(-) -- 2.9.0
Re: [PATCH v7 2/3] PCI: Enable PCIe Relaxed Ordering if supported
| From: Alexander Duyck| Sent: Wednesday, July 26, 2017 11:44 AM | | On Jul 26, 2017 11:26 AM, "Casey Leedom" wrote: | | | | Â Â I think that the patch will need to be extended to modify | | Â Â drivers/pci.c/iov.c:sriov_enable() to explicitly turn off | | Â Â Relaxed Ordering Enable if the Root Complex is marked | Â Â for no RO TLPs. | | I'm not sure that would be an issue. Wouldn't most VFs inherit the PF's settings? Ah yes, you're right. This is covered in section 3.5.4 of the Single Root I/O Virtualization and Sharing Specification, Revision 1.0 (September 11, 2007), governing the PCIe Capability Device Control register. It states that the VF version of that register shall follow the setting of the corresponding PF. So we should enhance the cxgb4vf/sge.c:t4vf_sge_alloc_rxq() in the same way we did for the cxgb4 driver, but that's not critical since the Relaxed Ordering Enable supersedes the internal chip's desire to use the Relaxed Ordering Attribute. Ding, send me a note if you'd like me to work that up for you. | Also I thought most of the VF configuration space is read only. Yes, but not all of it. And when a VF is exported to a Virtual Machine, then the Hypervisor captures and interprets all accesses to the VF's PCIe Configuration Space from the VM. Thanks again for reminding me of the subtle aspect of the SR_IOV specification that I forgot. Casey
[PATCH net-next] net: phy: Remove stale comments referencing timer
Since commit a390d1f379cf ("phylib: convert state_queue work to delayed_work"), the PHYLIB state machine was converted to use delayed workqueues, yet some functions were still referencing the PHY library timer in their comments, fix that and remove the now unused linux/timer.h include. Signed-off-by: Florian Fainelli--- drivers/net/phy/phy.c | 11 +-- 1 file changed, 5 insertions(+), 6 deletions(-) diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c index d0626bf5c540..ac1dcf0289fa 100644 --- a/drivers/net/phy/phy.c +++ b/drivers/net/phy/phy.c @@ -30,7 +30,6 @@ #include #include #include -#include #include #include #include @@ -705,8 +704,8 @@ EXPORT_SYMBOL(phy_start_aneg); * * Description: The PHY infrastructure can run a state machine * which tracks whether the PHY is starting up, negotiating, - * etc. This function starts the timer which tracks the state - * of the PHY. If you want to maintain your own state machine, + * etc. This function starts the delayed workqueue which tracks + * the state of the PHY. If you want to maintain your own state machine, * do not call this function. */ void phy_start_machine(struct phy_device *phydev) @@ -737,9 +736,9 @@ void phy_trigger_machine(struct phy_device *phydev, bool sync) * phy_stop_machine - stop the PHY state machine tracking * @phydev: target phy_device struct * - * Description: Stops the state machine timer, sets the state to UP - * (unless it wasn't up yet). This function must be called BEFORE - * phy_detach. + * Description: Stops the state machine delayed workqueue, sets the + * state to UP (unless it wasn't up yet). This function must be + * called BEFORE phy_detach. */ void phy_stop_machine(struct phy_device *phydev) { -- 2.9.3
Re: TCP fast retransmit issues
On Wed, Jul 26, 2017 at 2:38 PM, Neal Cardwellwrote: > Yeah, it looks like I can reproduce this issue with (1) bad sacks > causing repeated TLPs, and (2) TLPs timers being pushed out to later > times due to incoming data. Scripts are attached. I'm testing a fix of only scheduling a TLP if (flag & FLAG_DATA_ACKED) is true... neal
Re: [PATCH net] ipv6: no need to return rt->dst.error if it is not null entry.
On 7/26/17 12:55 PM, Roopa Prabhu wrote: > On Wed, Jul 26, 2017 at 11:49 AM, David Ahernwrote: >> On 7/26/17 12:27 PM, Roopa Prabhu wrote: >>> agreed...so looks like the check in v3 should be >>> >>> >>> + if ( rt == net->ipv6.ip6_null_entry || >>> +(rt->dst.error && >>> + #ifdef CONFIG_IPV6_MULTIPLE_TABLES >>> + rt != net->ipv6.ip6_prohibit_entry && >>> + rt != net->ipv6.ip6_blk_hole_entry && >>> +#endif >>> + )) { >>> err = rt->dst.error; >>> ip6_rt_put(rt); >>> goto errout; >>> >> >> I don't think so. If I add a prohibit route and use the fibmatch >> attribute, I want to see the route from the FIB that was matched. > > > yes, exactly. wouldn't 'rt != net->ipv6.ip6_prohibit_entry' above let > it fall through to the route fill code ? > > ah...but i guess you are saying that they will have rt6_info's of > their own and will not match. got it. ack. > This: diff --git a/net/ipv6/route.c b/net/ipv6/route.c index 4d30c96a819d..24de81c804c2 100644 --- a/net/ipv6/route.c +++ b/net/ipv6/route.c @@ -3637,11 +3637,6 @@ static int inet6_rtm_getroute(struct sk_buff *in_skb, struct nlmsghdr *nlh, dst = ip6_route_lookup(net, , 0); rt = container_of(dst, struct rt6_info, dst); - if (rt->dst.error) { - err = rt->dst.error; - ip6_rt_put(rt); - goto errout; - } if (rt == net->ipv6.ip6_null_entry) { err = rt->dst.error; Puts back the original behavior. In that case, only rt == null_entry drops to the error path which is correct. All other rt values will drop to rt6_fill_node and return rt data.
Re: [PATCH net] ipv6: no need to return rt->dst.error if it is not null entry.
On Wed, Jul 26, 2017 at 11:49 AM, David Ahernwrote: > On 7/26/17 12:27 PM, Roopa Prabhu wrote: >> agreed...so looks like the check in v3 should be >> >> >> + if ( rt == net->ipv6.ip6_null_entry || >> +(rt->dst.error && >> + #ifdef CONFIG_IPV6_MULTIPLE_TABLES >> + rt != net->ipv6.ip6_prohibit_entry && >> + rt != net->ipv6.ip6_blk_hole_entry && >> +#endif >> + )) { >> err = rt->dst.error; >> ip6_rt_put(rt); >> goto errout; >> > > I don't think so. If I add a prohibit route and use the fibmatch > attribute, I want to see the route from the FIB that was matched. yes, exactly. wouldn't 'rt != net->ipv6.ip6_prohibit_entry' above let it fall through to the route fill code ? ah...but i guess you are saying that they will have rt6_info's of their own and will not match. got it. ack.
Re: [PATCH] netpoll: Fix device name check in netpoll_setup()
Hi, On Tue, Jul 25, 2017 at 11:36 AM, Matthias Kaehlckewrote: > Apparently netpoll_setup() assumes that netpoll.dev_name is a pointer > when checking if the device name is set: > > if (np->dev_name) { > ... > > However the field is a character array, therefore the condition always > yields true. Check instead whether the first byte of the array has a > non-zero value. > > Signed-off-by: Matthias Kaehlcke > --- > net/core/netpoll.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/net/core/netpoll.c b/net/core/netpoll.c > index 8357f164c660..912731bed7b7 100644 > --- a/net/core/netpoll.c > +++ b/net/core/netpoll.c > @@ -666,7 +666,7 @@ int netpoll_setup(struct netpoll *np) > int err; > > rtnl_lock(); > - if (np->dev_name) { > + if (np->dev_name[0]) { > struct net *net = current->nsproxy->net_ns; > ndev = __dev_get_by_name(net, np->dev_name); > } It's really up to the maintainer of the code, but my first instinct here would be to instead remove the "if" test unless we really expect dev->dev_name to be blank in lots of cases. It will slightly slow down the error case but should avoid an "if" test in the non-error case. By definition it should be safe since currently the "if" test should always evaluate to true. -Doug
Re: [PATCH net] ipv6: no need to return rt->dst.error if it is not null entry.
On 7/26/17 12:27 PM, Roopa Prabhu wrote: > agreed...so looks like the check in v3 should be > > > + if ( rt == net->ipv6.ip6_null_entry || > +(rt->dst.error && > + #ifdef CONFIG_IPV6_MULTIPLE_TABLES > + rt != net->ipv6.ip6_prohibit_entry && > + rt != net->ipv6.ip6_blk_hole_entry && > +#endif > + )) { > err = rt->dst.error; > ip6_rt_put(rt); > goto errout; > I don't think so. If I add a prohibit route and use the fibmatch attribute, I want to see the route from the FIB that was matched.
Re: [PATCHv3 net] ipv6: no need to return rt->dst.error if it is prohibit entry
On 7/26/17 11:09 AM, David Ahern wrote: > I don't understand why you are focused on the built-in null and prohibit > route entries. I see. You are using fib rules for the prohibit entry; I am using an explicit route entry. If I run 'ip ro get fibmatch' for the latter I want to see that route entry since it is a route in the FIB: # ip -6 ro get fibmatch vrf red 5000::1 prohibit 5000::/120 dev lo table red metric 1024 error -13 pref medium So there are multiple cases to verify.
Re: TCP fast retransmit issues
On Wed, Jul 26, 2017 at 1:06 PM, Neal Cardwellwrote: > On Wed, Jul 26, 2017 at 12:43 PM, Neal Cardwell wrote: >> (2) It looks like there is a bug in the sender code where it seems to >> be repeatedly using a TLP timer firing 211ms after every ACK is >> received to transmit another TLP probe (a new packet in this case). >> Somehow these weird invalid SACKs seem to be triggering a code path >> that makes us think we can send another TLP, when we probably should >> be firing an RTO. That's my interpretation, anyway. I will try to >> reproduce this with packetdrill. > > Hmm. It looks like this might be a general issue, where any time we > get an ACK that doesn't ACK/SACK anything new (whether because it's > incoming data in a bi-directional flow, or a middlebox breaking the > SACKs), then we schedule a TLP timer further out in time. Probably we > should only push the TLP timer out if something is cumulatively ACKed. > > But that's not a trivial thing to do, because by the time we are > deciding whether to schedule another TLP, we have already canceled the > previous TLP and reinstalled an RTO. Hmm. Yeah, it looks like I can reproduce this issue with (1) bad sacks causing repeated TLPs, and (2) TLPs timers being pushed out to later times due to incoming data. Scripts are attached. neal tlp-bad-sacks.pkt Description: Binary data tlp-bidirectional.pkt Description: Binary data
Re: [PATCH v2 2/4] can: fixed-transceiver: Add documentation for CAN fixed transceiver bindings
On 07/26/2017 12:05 PM, Oliver Hartkopp wrote: > On 07/26/2017 06:41 PM, Andrew Lunn wrote: >> On Mon, Jul 24, 2017 at 06:05:19PM -0500, Franklin S Cooper Jr wrote: > >>> + >>> +Optional: >>> + max-arbitration-speed: a positive non 0 value that determines the max >>> +speed that CAN can run in non CAN-FD mode or during the >>> +arbitration phase in CAN-FD mode. >> >> Hi Franklin >> >> Since this and the next property are optional, it is good to document >> what happens when they are not in the DT blob. Also document what 0 >> means. The driver ignores values less than 0 with the exception being max-data-speed which supports a value of -1. Not sure what I'm documenting when the binding specifically says to use a positive non zero value. Its the same reason I don't document what happens if you give it a negative value. >> >>> + >>> + max-data-speed:a positive non 0 value that determines the max >>> data rate >>> +that can be used in CAN-FD mode. A value of -1 implies >>> +CAN-FD is not supported by the transceiver. >> >> -1 is ugly. I think it would be better to have a missing >> max-data-speed property indicate that CAN-FD is not supported. > Although this leads to your later point I don't think this is the right approach. Its an optional property. Not including the property should not assume it isn't supported. > Thanks Andrew! I had the same feeling about '-1' :-) Ok I'll go back to having 0 = not supported. Which will handle the documentation comment above. > >> And >> maybe put 'fd' into the property name. > > Good point. In fact the common naming to set bitrates for CAN(FD) > controllers are 'bitrate' and 'data bitrate'. > > 'speed' is not really a good word for that. I'm fine with switching to using bitrate instead of speed. Kurk was originally the one that suggested to use the term arbitration and data since thats how the spec refers to it. Which I do agree with. But your right that in the drivers (struct can_priv) we just use bittiming and data_bittiming (CAN-FD timings). I don't think adding "fd" into the property name makes sense unless we are calling it something like "max-canfd-bitrate" which I would agree is the easiest to understand. So what is the preference if we end up sticking with two properties? Option 1 or 2? 1) max-bitrate max-data-bitrate 2) max-bitrate max-canfd-bitrate > > Finally, @Franklin: > > A CAN transceiver is limited in bandwidth. But you only have one RX and > one TX line between the CAN controller and the CAN transceiver. The > transceiver does not know about CAN FD - it has just a physical(!) layer > with a limited bandwidth. This is ONE limitation. > > So I tend to specify only ONE 'max-bitrate' property for the > fixed-transceiver binding. > > The fact whether the CAN controller is CAN FD capable or not is provided > by the netlink configuration interface for CAN controllers. Part of the reasoning to have two properties is to indicate that you don't support CAN FD while limiting the "arbitration" bit rate. With one property you can not determine this and end up having to make some assumptions that can quickly end up biting people. > > Regards, > Oliver >
Re: [PATCH net] ipv6: no need to return rt->dst.error if it is not null entry.
On Wed, Jul 26, 2017 at 10:18 AM, David Ahernwrote: > On 7/25/17 1:32 AM, Hangbin Liu wrote: >> On Mon, Jul 24, 2017 at 09:28:07PM -0600, David Ahern wrote: >>> On 7/24/17 6:08 PM, Hangbin Liu wrote: That's why I think we should remove both rt->dst.error and ip6_null_entry check in inet6_rtm_getroute(). And even further, remove the ip6_null_entry check in rt6_dump_route(). >>> >>> git blame net/ipv6/route.c >>> >>> find the commits and review. >> >> Hi David, >> >> Sorry, would you like to give me more hints? >> >> I saw your commit 1f17e2f2c8a8 ("net: ipv6: ignore null_entry on route >> dumps"). But I think this issue has been fixed by >> >> 2f460933f58e ("ipv6: initialize route null entry in addrconf_init()") and >> 242d3a49a2a1 ("ipv6: reorder ip6_route_dev_notifier after ipv6_dev_notf") >> >> I use a reproducer which add unreachable route in netns with lo down. And I >> could not trigger Panic in the fixed kernel. That's why I think we could >> remove ip6_null_entry check in rt6_dump_route(). > > Understood. Cong's patch to fix the intialization order (lo device > before route init) makes sure the idev is not null. That said, the > null_entry route is internal ipv6 logic and is not a route entry to be > returned to userspace. agreed...so looks like the check in v3 should be + if ( rt == net->ipv6.ip6_null_entry || +(rt->dst.error && + #ifdef CONFIG_IPV6_MULTIPLE_TABLES + rt != net->ipv6.ip6_prohibit_entry && + rt != net->ipv6.ip6_blk_hole_entry && +#endif + )) { err = rt->dst.error; ip6_rt_put(rt); goto errout;
Re: [PATCH v7 2/3] PCI: Enable PCIe Relaxed Ordering if supported
By the way Ding, two issues: 1. Did we ever get any acknowledgement from either Intel or AMD on this patch? I know that we can't ensure that, but it sure would be nice since the PCI Quirks that we're putting in affect their products. 2. I just realized that there's still a small hole in the patch with respect to PCIe SR-IOV Virtual Functions. When the PCI Quirk notices a problematic PCIe Device and marks it to note that it's not "happy" receiving Transaction Layer Packets with the Relaxed Ordering Attribute set, it's my understanding that your current patch iterates down the PCIe Fabric and turns off the PCIe Capability Device Control[Relaxed Ordering Enable]. But this scan may miss any SR-IOV VFs because they probably won't be instantiated at that time. And, at a later time, when they are instantiated, they could well have their Relaxed Ordering Enable set. I think that the patch will need to be extended to modify drivers/pci.c/iov.c:sriov_enable() to explicitly turn off Relaxed Ordering Enable if the Root Complex is marked for no RO TLPs. Casey
Re: [PATCH 0/4] Configuring traffic classes via new hardware offload mechanism in tc/mqprio
On 7/21/2017 2:42 AM, Richard Cochran wrote: On Mon, May 22, 2017 at 12:31:12PM -0700, Jeff Kirsher wrote: On Fri, 2017-05-19 at 17:58 -0700, Amritha Nambiar wrote: The following series introduces a new harware offload mode in tc/mqprio where the TCs, the queue configurations and bandwidth rate limits are offloaded to the hardware. ... This was meant to be sent out as an RFC, but apparently that did not get conveyed when these were sent out Friday. I am looking at tc/mqprio and dcb with the purpose of implementing - Forwarding and Queuing Enhancements for Time-Sensitive Streams (FQTSS) 802.1Q-2014 Clause 34 - Scheduled Traffic (time based scheduling) P802.1Qbv using the HW capabilities of the i210. This series looks like a promising avenue for these features. My question is, did series go anywhere? I didn't see any follow ups on netdev, but maybe I missed something. I have submitted another series as non-RFC for next-queue maintained by Jeff Kirsher. This is going to come through the next-queue tree. You can follow the new series "[PATCH 0/6] Configuring traffic classes via new hardware offload mechanism in tc/mqprio" at https://www.mail-archive.com/netdev@vger.kernel.org/msg177390.html Thanks, Richard
[PATCH net-next 3/3] nfp: only use direct firmware requests
request_firmware() will fallback to user space helper and may cause long delays when driver is loaded if udev doesn't correctly handle FW requests. Since we never really made use of the user space helper functionality switch to the simpler request_firmware_direct() call. The side effect of this change is that no warning will be printed when the FW image does not exists. To help users figure out which FW file is missing print a info message when we request each file. Signed-off-by: Jakub Kicinski--- drivers/net/ethernet/netronome/nfp/nfp_main.c | 42 +-- 1 file changed, 26 insertions(+), 16 deletions(-) diff --git a/drivers/net/ethernet/netronome/nfp/nfp_main.c b/drivers/net/ethernet/netronome/nfp/nfp_main.c index 13d056da0765..dd769eceb33d 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_main.c +++ b/drivers/net/ethernet/netronome/nfp/nfp_main.c @@ -174,6 +174,21 @@ static int nfp_pcie_sriov_configure(struct pci_dev *pdev, int num_vfs) return nfp_pcie_sriov_enable(pdev, num_vfs); } +static const struct firmware * +nfp_net_fw_request(struct pci_dev *pdev, struct nfp_pf *pf, const char *name) +{ + const struct firmware *fw = NULL; + int err; + + err = request_firmware_direct(, name, >dev); + nfp_info(pf->cpp, " %s: %s\n", +name, err ? "not found" : "found, loading..."); + if (err) + return NULL; + + return fw; +} + /** * nfp_net_fw_find() - Find the correct firmware image for netdev mode * @pdev: PCI Device structure @@ -184,29 +199,30 @@ static int nfp_pcie_sriov_configure(struct pci_dev *pdev, int num_vfs) static const struct firmware * nfp_net_fw_find(struct pci_dev *pdev, struct nfp_pf *pf) { - const struct firmware *fw = NULL; struct nfp_eth_table_port *port; + const struct firmware *fw; const char *fw_model; char fw_name[256]; const u8 *serial; - int spc, err = 0; u16 interface; - int i, j; + int spc, i, j; + + nfp_info(pf->cpp, "Looking for firmware file in order of priority:\n"); /* First try to find a firmware image specific for this device */ interface = nfp_cpp_interface(pf->cpp); nfp_cpp_serial(pf->cpp, ); sprintf(fw_name, "netronome/serial-%pMF-%02hhx-%02hhx.nffw", serial, interface >> 8, interface & 0xff); - err = request_firmware_direct(, fw_name, >dev); - if (!err) - goto done; + fw = nfp_net_fw_request(pdev, pf, fw_name); + if (fw) + return fw; /* Then try the PCI name */ sprintf(fw_name, "netronome/pci-%s.nffw", pci_name(pdev)); - err = request_firmware_direct(, fw_name, >dev); - if (!err) - goto done; + fw = nfp_net_fw_request(pdev, pf, fw_name); + if (fw) + return fw; /* Finally try the card type and media */ if (!pf->eth_tbl) { @@ -241,13 +257,7 @@ nfp_net_fw_find(struct pci_dev *pdev, struct nfp_pf *pf) if (spc <= 0) return NULL; - err = request_firmware(, fw_name, >dev); - if (err) - return NULL; -done: - dev_info(>dev, "Loading FW image: %s\n", fw_name); - - return fw; + return nfp_net_fw_request(pdev, pf, fw_name); } /** -- 2.11.0
[PATCH net-next 1/3] nfp: remove the probe deferral when FW not present
We use a hack to defer probe when firmware was not pre-loaded or found on disk. This helps in case users forgot to include firmware in initramfs, the driver will most likely get another shot at probing after real root is mounted. This is not for what EPROBE_DEFER is supposed to be used, and when FW is completely missing every time new device is probed NFP will reprobe spamming kernel logs. Remove this hack, users will now have to make sure the right firmware image is present in initramfs if nfp.ko is placed there or built in. Signed-off-by: Jakub Kicinski--- drivers/net/ethernet/netronome/nfp/nfp_net_main.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_main.c b/drivers/net/ethernet/netronome/nfp/nfp_net_main.c index 5797dbf2b507..d5e2361f0e86 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_net_main.c +++ b/drivers/net/ethernet/netronome/nfp/nfp_net_main.c @@ -704,7 +704,7 @@ int nfp_net_pci_probe(struct nfp_pf *pf) if (!pf->rtbl) { nfp_err(pf->cpp, "No %s, giving up.\n", pf->fw_loaded ? "symbol table" : "firmware found"); - return -EPROBE_DEFER; + return -EINVAL; } mutex_lock(>lock); -- 2.11.0
[PATCH net-next 2/3] nfp: look for firmware image by device serial number and PCI name
We generally look up firmware by card type, but that doesn't allow users who have more than one card of the same type in their system to select firmware per adapter. Unfortunately user space firmware helper seems fraught with difficulties and to be on its way out. In particular support for handling firmware uevents have been dropped from systemd and most distributions don't enable the FW fallback by default any more. To allow users selecting firmware for a particular device look up firmware names by serial and pci_name(). Use the direct lookup to disable generating uevents when enabled in Kconfig and not print any warnings to logs if adapter-specific files are missing. Users can place in /lib/firmware/netronome files named: pci-${pci_name}.nffw serial-${serial}.nffw to target a specific card. E.g.: pci-:04:00.0.nffw pci-:82:00.0.nffw serial-00-aa-bb-11-22-33-10-ff.nffw We use the full serial number including the interface id, as it appears in lspci output (bytes separated by '-'). Signed-off-by: Jakub Kicinski--- drivers/net/ethernet/netronome/nfp/nfp_main.c | 20 +++- 1 file changed, 19 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/netronome/nfp/nfp_main.c b/drivers/net/ethernet/netronome/nfp/nfp_main.c index d67969d3e484..13d056da0765 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_main.c +++ b/drivers/net/ethernet/netronome/nfp/nfp_main.c @@ -188,9 +188,27 @@ nfp_net_fw_find(struct pci_dev *pdev, struct nfp_pf *pf) struct nfp_eth_table_port *port; const char *fw_model; char fw_name[256]; + const u8 *serial; int spc, err = 0; + u16 interface; int i, j; + /* First try to find a firmware image specific for this device */ + interface = nfp_cpp_interface(pf->cpp); + nfp_cpp_serial(pf->cpp, ); + sprintf(fw_name, "netronome/serial-%pMF-%02hhx-%02hhx.nffw", + serial, interface >> 8, interface & 0xff); + err = request_firmware_direct(, fw_name, >dev); + if (!err) + goto done; + + /* Then try the PCI name */ + sprintf(fw_name, "netronome/pci-%s.nffw", pci_name(pdev)); + err = request_firmware_direct(, fw_name, >dev); + if (!err) + goto done; + + /* Finally try the card type and media */ if (!pf->eth_tbl) { dev_err(>dev, "Error: can't identify media config\n"); return NULL; @@ -226,7 +244,7 @@ nfp_net_fw_find(struct pci_dev *pdev, struct nfp_pf *pf) err = request_firmware(, fw_name, >dev); if (err) return NULL; - +done: dev_info(>dev, "Loading FW image: %s\n", fw_name); return fw; -- 2.11.0
[PATCH net-next 0/3] nfp: extend firmware request logic
Hi! We have been pondering for some time how to support loading different application firmwares onto NFP. We want to support both users selecting one of the firmware images provided by Netronome (which are optimized for different use cases each) as well as firmware images created by users themselves or other companies. In the end we decided to go with the simplest solution - depending on the right firmware image being placed in /lib/firmware. This vastly simplifies driver logic and also doesn't require any new API. Different NICs on one system may want to run different applications therefore we first try to load firmware specific to the device (by serial number of PCI slot) and if not present try the device model based name we have been using so far. Jakub Kicinski (3): nfp: remove the probe deferral when FW not present nfp: look for firmware image by device serial number and PCI name nfp: only use direct firmware requests drivers/net/ethernet/netronome/nfp/nfp_main.c | 48 ++- drivers/net/ethernet/netronome/nfp/nfp_net_main.c | 2 +- 2 files changed, 39 insertions(+), 11 deletions(-) -- 2.11.0
Re: [PATCH RFC 00/13] phylink and sfp support
On Wed, Jul 26, 2017 at 06:44:11PM +0200, Andrew Lunn wrote: > Do you have a git branch we can look at to see these changes. It helps > understand the API when you can see both sides of it. git://git.armlinux.org.uk/~rmk/linux-arm.git phy or head to the cgit at: git.armlinux.org.uk/cgit/linux-arm.git/log/?h=phy Not only does that include the mvneta changes, but also the changes to the 88x3310 driver (although incomplete, but functional.) -- RMK's Patch system: http://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up according to speedtest.net.
Re: [RFC] switchdev: clarify ndo_get_phys_port_name() formats
On Tue, Jul 25, 2017 at 07:34:47PM -0700, Jakub Kicinski wrote: > On Tue, 25 Jul 2017 21:48:15 -0400, Andy Gospodarek wrote: > > On Tue, Jul 25, 2017 at 03:26:47PM -0700, Jakub Kicinski wrote: > > > On Tue, 25 Jul 2017 11:22:41 -0400, Andy Gospodarek wrote: > > > > On Mon, Jul 24, 2017 at 10:13:44PM -0700, Jakub Kicinski wrote: > > > > > We are still in position where we can suggest uniform naming > > > > > convention for ndo_get_phys_port_name(). switchdev.txt file > > > > > already contained a suggestion of how to name external ports. > > > > > Since the use of switchdev for SR-IOV NIC's eswitches is growing, > > > > > establish a format for ports of those devices as well. > > > > > > > > > > Signed-off-by: Jakub Kicinski> > > > > > > > This is a nice addition and I suspect there could be even more done to > > > > update this file to cover the VF rep usage. > > > > > > > > > --- > > > > > Documentation/networking/switchdev.txt | 14 +++--- > > > > > 1 file changed, 11 insertions(+), 3 deletions(-) > > > > > > > > > > diff --git a/Documentation/networking/switchdev.txt > > > > > b/Documentation/networking/switchdev.txt > > > > > index 3e7b946dea27..7c4b6025fb4b 100644 > > > > > --- a/Documentation/networking/switchdev.txt > > > > > +++ b/Documentation/networking/switchdev.txt > > > > > @@ -119,9 +119,17 @@ into 4 10G ports, resulting in 4 port netdevs, > > > > > the device can give a unique > > > > > SUBSYSTEM=="net", ACTION=="add", > > > > > ATTR{phys_switch_id}=="", \ > > > > > ATTR{phys_port_name}!="", NAME="swX$attr{phys_port_name}" > > > > > > > > > > -Suggested naming convention is "swXpYsZ", where X is the switch name > > > > > or ID, Y > > > > > -is the port name or ID, and Z is the sub-port name or ID. For > > > > > example, sw1p1s0 > > > > > -would be sub-port 0 on port 1 on switch 1. > > > > > +Suggested formats of the port name returned by > > > > > ndo_get_phys_port_name are: > > > > > + - pA for external ports; > > > > > + - pAsB for split external ports; > > > > > + - pfCfor PF ports (so called PF representors); > > > > > + - pfCvfD for VF ports (so called VF representors). > > > > > > > > I hate to clutter this up, but might be also need to add: > > > > > > > > - pfCsBfor split PF ports (so called PF representors); > > > > - pfCsBvfD for split VF ports (so called VF representors). > > > > > > > > or are we comfortable that these additions to the name for split ports > > > > are implied? > > > > > > Hm.. What is a split PF port? Splits happen on the physical port - see > > > my rant on the thread this is a reply to ;) PFs are PCIe functions, > > > on the opposite side of the eswitch from the wires. > > > > I'm with you that I think there is value in separate netdevs to > > represent "PFs, VFs and external ports/MACs" -- particularly for the > > use-case you to create rules to control PF<->VF traffic. > > > > So while I'm not saying it is a _great_ idea to support such a thing as > > port-splitting of PFs, I suggested this addition as I'm not willing to > > restrict > > such a design/implementation if a vendor or customer desired. It seemed > > useful to provde some guidance on how to name them -- even if we do not > > like them. :-) > > If I understand you correctly split PF would be a situation where > device has multiple port instances on the PCIe PF side? IOW switch sees > multiple endpoints on the PF side? Let me attempt an ASCII diagram :) > > > HOST A || HOST B >|| > PF A | V | V | V | V || PF B| V | V | V > >| F | F | F | F ... || | F | F | F ... > > port A0 | port A1 | 0 | 1 | 2 | 3 || port B0 | port B1 | 0 | 1 | 2 > >|| > PCI Express link ||PCI Express link > \ \ \ | | | || / / / > \ \ \ | | | || / / / > \__\__\' | | |'/___/___/ > /---\ > > |<<== | > | ==>> | > | SR-IOV e-switch | > > |<<== | > > | ==>> | > > \---/ > |
Re: [PATCH net-next v2 01/10] net: dsa: lan9303: Fixed MDIO interface
> > So I really want to group the patches into only a few series in order > > to not spend months on the process. I strongly agree with Vivien here. Good patches get accepted in about 3 days. You should expect feedback within a day or two. That allows you to have fast cycle times for getting patches in. Andrew
Re: [PATCH net-next v2 08/10] net: dsa: lan9303: Added ALR/fdb/mdb handling
Hi Egil > +/* This function will wait a while until mask & reg == value */ > +/* Otherwise, return timeout */ > +static int lan9303_csr_reg_wait(struct lan9303 *chip, int regno, > + int mask, char value) > +{ > + int i; > + > + for (i = 0; i < 0x1000; i++) { > + u32 reg; > + > + lan9303_read_switch_reg(chip, regno, ); > + if ((reg & mask) == value) > + return 0; > + } > + return -ETIMEDOUT; Busy looping is probably not a good idea. Can you add a usleep()? > +} > + > +static int _lan9303_alr_make_entry_raw(struct lan9303 *chip, u32 dat0, u32 > dat1) What does the _ indicate. I could understand having it when you have lan9303_alr_make_entry_raw() call _lan9303_alr_make_entry_raw() after taking a lock, but i don't see anything like that here. > +{ > + lan9303_write_switch_reg( > + chip, LAN9303_SWE_ALR_WR_DAT_0, dat0); > + lan9303_write_switch_reg( > + chip, LAN9303_SWE_ALR_WR_DAT_1, dat1); > + lan9303_write_switch_reg( > + chip, LAN9303_SWE_ALR_CMD, ALR_CMD_MAKE_ENTRY); > + lan9303_csr_reg_wait( > + chip, LAN9303_SWE_ALR_CMD_STS, ALR_STS_MAKE_PEND, 0); > + lan9303_write_switch_reg(chip, LAN9303_SWE_ALR_CMD, 0); > + return 0; > +} > + > +typedef void alr_loop_cb_t( > + struct lan9303 *chip, u32 dat0, u32 dat1, int portmap, void *ctx); > + > +static void lan9303_alr_loop(struct lan9303 *chip, alr_loop_cb_t *cb, void > *ctx) > +{ > + int i; > + > + lan9303_write_switch_reg(chip, LAN9303_SWE_ALR_CMD, ALR_CMD_GET_FIRST); > + lan9303_write_switch_reg(chip, LAN9303_SWE_ALR_CMD, 0); > + > + for (i = 1; i < LAN9303_NUM_ALR_RECORDS; i++) { > + u32 dat0, dat1; > + int alrport, portmap; > + > + lan9303_read_switch_reg(chip, LAN9303_SWE_ALR_RD_DAT_0, ); > + lan9303_read_switch_reg(chip, LAN9303_SWE_ALR_RD_DAT_1, ); > + if (dat1 & ALR_DAT1_END_OF_TABL) > + break; > + > + alrport = (dat1 & ALR_DAT1_PORT_MASK) >> ALR_DAT1_PORT_BITOFFS; > + portmap = alrport_2_portmap[alrport]; > + > + cb(chip, dat0, dat1, portmap, ctx); > + > + lan9303_write_switch_reg( > + chip, LAN9303_SWE_ALR_CMD, ALR_CMD_GET_NEXT); > + lan9303_write_switch_reg(chip, LAN9303_SWE_ALR_CMD, 0); > + } > +} > + > +/* ALR: lan9303_alr_loop callback functions */ > + > +static void _alr_reg_to_mac(u32 dat0, u32 dat1, u8 mac[6]) > +{ > + mac[0] = (dat0 >> 0) & 0xff; > + mac[1] = (dat0 >> 8) & 0xff; > + mac[2] = (dat0 >> 16) & 0xff; > + mac[3] = (dat0 >> 24) & 0xff; > + mac[4] = (dat1 >> 0) & 0xff; > + mac[5] = (dat1 >> 8) & 0xff; > +} > + > +/* Clear learned (non-static) entry on given port */ > +static void alr_loop_cb_del_port_learned( > + struct lan9303 *chip, u32 dat0, u32 dat1, int portmap, void *ctx) > +{ > + int *port = ctx; > + > + if (((BIT(*port) & portmap) == 0) || (dat1 & ALR_DAT1_STATIC)) > + return; > + > + /* learned entries has only one port, we can just delete */ > + dat1 &= ~ALR_DAT1_VALID; /* delete entry */ > + _lan9303_alr_make_entry_raw(chip, dat0, dat1); > +} > + > +struct port_fdb_dump_ctx { > + int port; > + struct switchdev_obj_port_fdb *fdb; > + switchdev_obj_dump_cb_t *cb; > +}; > + > +static void alr_loop_cb_fdb_port_dump( > + struct lan9303 *chip, u32 dat0, u32 dat1, int portmap, void *ctx) > +{ > + struct port_fdb_dump_ctx *dump_ctx = ctx; > + struct switchdev_obj_port_fdb *fdb = dump_ctx->fdb; > + u8 mac[ETH_ALEN]; > + > + if ((BIT(dump_ctx->port) & portmap) == 0) > + return; > + > + _alr_reg_to_mac(dat0, dat1, mac); > + ether_addr_copy(fdb->addr, mac); > + fdb->vid = 0; > + fdb->ndm_state = (dat1 & ALR_DAT1_STATIC) ? > + NUD_NOARP : NUD_REACHABLE; > + dump_ctx->cb(>obj); > +} > + > +/* ALR: Add/modify/delete ALR entries */ > + > +/* Set a static ALR entry. Delete entry if port_map is zero */ > +static void _lan9303_alr_set_entry(struct lan9303 *chip, const u8 *mac, > +u8 port_map, bool stp_override) > +{ > + u32 dat0, dat1, alr_port; > + > + dat1 = ALR_DAT1_STATIC; > + if (port_map) > + dat1 |= ALR_DAT1_VALID; /* otherwise no ports: delete entry */ > + if (stp_override) > + dat1 |= ALR_DAT1_AGE_OVERRID; > + > + alr_port = portmap_2_alrport[port_map & 7]; > + dat1 &= ~ALR_DAT1_PORT_MASK; > + dat1 |= alr_port << ALR_DAT1_PORT_BITOFFS; > + > + dat0 = 0; > + dat0 |= (mac[0] << 0); > + dat0 |= (mac[1] << 8); > + dat0 |= (mac[2] << 16); > + dat0 |= (mac[3] << 24); > + > + dat1 |= (mac[4] << 0); > + dat1 |= (mac[5] << 8); > + > + dev_dbg(chip->dev, "%s %pM %d %08x %08x\n", > + __func__, mac, port_map, dat0,
Re: [PATCH net-next v2 07/10] net: dsa: lan9303: Added basic offloading of unicast traffic
Hi Egil > +/* forward special tagged packets from port 0 to port 1 *or* port 2 */ > +static int lan9303_setup_tagging(struct lan9303 *chip) > +{ > + int ret; Blank line please. > + /* enable defining the destination port via special VLAN tagging > + * for port 0 > + */ > + ret = lan9303_write_switch_reg(chip, LAN9303_SWE_INGRESS_PORT_TYPE, > +0x03); #define for 0x03. > + if (ret) > + return ret; > + > + /* tag incoming packets at port 1 and 2 on their way to port 0 to be > + * able to discover their source port > + */ > + return lan9303_write_switch_reg( > + chip, LAN9303_BM_EGRSS_PORT_TYPE, > + LAN9303_BM_EGRSS_PORT_TYPE_SPECIAL_TAG_PORT0); > +} > + > /* We want a special working switch: > * - do not forward packets between port 1 and 2 > * - forward everything from port 1 to port 0 > * - forward everything from port 2 to port 0 > - * - forward special tagged packets from port 0 to port 1 *or* port 2 > */ > static int lan9303_separate_ports(struct lan9303 *chip) > { > @@ -534,22 +555,6 @@ static int lan9303_separate_ports(struct lan9303 *chip) > if (ret) > return ret; > > - /* enable defining the destination port via special VLAN tagging > - * for port 0 > - */ > - ret = lan9303_write_switch_reg(chip, LAN9303_SWE_INGRESS_PORT_TYPE, > -0x03); > - if (ret) > - return ret; > - > - /* tag incoming packets at port 1 and 2 on their way to port 0 to be > - * able to discover their source port > - */ > - ret = lan9303_write_switch_reg(chip, LAN9303_BM_EGRSS_PORT_TYPE, > - LAN9303_BM_EGRSS_PORT_TYPE_SPECIAL_TAG_PORT0); > - if (ret) > - return ret; > - > /* prevent port 1 and 2 from forwarding packets by their own */ > return lan9303_write_switch_reg(chip, LAN9303_SWE_PORT_STATE, > LAN9303_SWE_PORT_STATE_FORWARDING_PORT0 | > @@ -557,6 +562,12 @@ static int lan9303_separate_ports(struct lan9303 *chip) > LAN9303_SWE_PORT_STATE_BLOCKING_PORT2); > } > > +static void lan9303_bridge_ports(struct lan9303 *chip) > +{ > + /* ports bridged: remove mirroring */ > + lan9303_write_switch_reg(chip, LAN9303_SWE_PORT_MIRROR, 0); > +} > + > static int lan9303_handle_reset(struct lan9303 *chip) > { > if (!chip->reset_gpio) > @@ -707,6 +718,10 @@ static int lan9303_setup(struct dsa_switch *ds) > return -EINVAL; > } > > + ret = lan9303_setup_tagging(chip); > + if (ret) > + dev_err(chip->dev, "failed to setup port tagging %d\n", ret); > + > ret = lan9303_separate_ports(chip); > if (ret) > dev_err(chip->dev, "failed to separate ports %d\n", ret); > @@ -898,17 +913,81 @@ static void lan9303_port_disable(struct dsa_switch *ds, > int port, > } > } > > +static int lan9303_port_bridge_join(struct dsa_switch *ds, int port, > + struct net_device *br) > +{ > + struct lan9303 *chip = ds->priv; > + > + dev_dbg(chip->dev, "%s(port %d)\n", __func__, port); > + if (ds->ports[1].bridge_dev == ds->ports[2].bridge_dev) { > + lan9303_bridge_ports(chip); > + chip->is_bridged = true; /* unleash stp_state_set() */ > + } > + > + return 0; > +} > + > +static void lan9303_port_bridge_leave(struct dsa_switch *ds, int port, > + struct net_device *br) > +{ > + struct lan9303 *chip = ds->priv; > + > + dev_dbg(chip->dev, "%s(port %d)\n", __func__, port); > + if (chip->is_bridged) { > + lan9303_separate_ports(chip); > + chip->is_bridged = false; > + } > +} > + > +static void lan9303_port_stp_state_set(struct dsa_switch *ds, int port, > +u8 state) > +{ > + int portmask, portstate; > + struct lan9303 *chip = ds->priv; > + > + dev_dbg(chip->dev, "%s(port %d, state %d)\n", > + __func__, port, state); > + if (!chip->is_bridged) > + return; I think you are over-simplifying here. Say i have a layer 2 VPN and i bridge port 1 and the VPN? The software bridge still wants to do STP on port 1, in order to solve loops. > + > + switch (state) { > + case BR_STATE_DISABLED: > + portstate = LAN9303_SWE_PORT_STATE_DISABLED_PORT0; > + break; > + case BR_STATE_BLOCKING: > + case BR_STATE_LISTENING: > + portstate = LAN9303_SWE_PORT_STATE_BLOCKING_PORT0; > + break; > + case BR_STATE_LEARNING: > + portstate = LAN9303_SWE_PORT_STATE_LEARNING_PORT0; > + break; > + case BR_STATE_FORWARDING: > + portstate = LAN9303_SWE_PORT_STATE_FORWARDING_PORT0; > + break; > + default: > +
Re: [PATCH] net: phy: Log only PHY state transitions
On 07/25/2017 12:33 PM, David Miller wrote: > From: Marc Gonzalez> Date: Tue, 25 Jul 2017 11:31:46 +0200 > >> In the current code, old and new PHY states are always logged. >> From now on, log only PHY state transitions. >> >> Signed-off-by: Marc Gonzalez >> --- >> drivers/net/phy/phy.c | 3 ++- >> 1 file changed, 2 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c >> index d0626bf5c540..6bb764e716fc 100644 >> --- a/drivers/net/phy/phy.c >> +++ b/drivers/net/phy/phy.c >> @@ -1226,7 +1226,8 @@ void phy_state_machine(struct work_struct *work) >> if (err < 0) >> phy_error(phydev); >> >> -phydev_dbg(phydev, "PHY state change %s -> %s\n", >> +if (old_state != phydev->state) >> +phydev_dbg(phydev, "PHY state change %s -> %s\n", >> phy_state_to_str(old_state), >> phy_state_to_str(phydev->state)); > > Something is not kosher with this indentation at all. Indeed, please align the two phy_state_to_str() where the opening parenthesis for phydev_dbg() starts, such this visually looks like: if (old_state != phydev->state) phydev_dbg(phydev, "PHY state change %s -> %s\n", phy_state_to_str(old_state), phy_state_to_str(phydev->state)); Thanks! -- Florian
[PATCH] net: inet: diag: expose sockets cgroup classid
This is useful for directly looking up a task based on class id rather than having to scan through all open file descriptors. Signed-off-by: Sasha Levin--- include/uapi/linux/inet_diag.h | 1 + net/ipv4/inet_diag.c | 10 ++ 2 files changed, 11 insertions(+) diff --git a/include/uapi/linux/inet_diag.h b/include/uapi/linux/inet_diag.h index bbe201047df6..678496897a68 100644 --- a/include/uapi/linux/inet_diag.h +++ b/include/uapi/linux/inet_diag.h @@ -142,6 +142,7 @@ enum { INET_DIAG_PAD, INET_DIAG_MARK, INET_DIAG_BBRINFO, + INET_DIAG_CLASS_ID, __INET_DIAG_MAX, }; diff --git a/net/ipv4/inet_diag.c b/net/ipv4/inet_diag.c index 3828b3a805cd..ffc6dad9780a 100644 --- a/net/ipv4/inet_diag.c +++ b/net/ipv4/inet_diag.c @@ -274,6 +274,16 @@ int inet_sk_diag_fill(struct sock *sk, struct inet_connection_sock *icsk, goto errout; } + if (ext & (1 << (INET_DIAG_CLASS_ID - 1))) { + u32 classid = 0; + +#ifdef CONFIG_SOCK_CGROUP_DATA + classid = sock_cgroup_classid(>sk_cgrp_data); +#endif + + nla_put(skb, INET_DIAG_CLASS_ID, sizeof(classid), ); + } + out: nlmsg_end(skb, nlh); return 0; -- 2.11.0
Re: [PATCH net] ipv6: no need to return rt->dst.error if it is not null entry.
On 7/25/17 1:32 AM, Hangbin Liu wrote: > On Mon, Jul 24, 2017 at 09:28:07PM -0600, David Ahern wrote: >> On 7/24/17 6:08 PM, Hangbin Liu wrote: >>> That's why I think we should remove both rt->dst.error and ip6_null_entry >>> check in inet6_rtm_getroute(). And even further, remove the ip6_null_entry >>> check in rt6_dump_route(). >> >> git blame net/ipv6/route.c >> >> find the commits and review. > > Hi David, > > Sorry, would you like to give me more hints? > > I saw your commit 1f17e2f2c8a8 ("net: ipv6: ignore null_entry on route > dumps"). But I think this issue has been fixed by > > 2f460933f58e ("ipv6: initialize route null entry in addrconf_init()") and > 242d3a49a2a1 ("ipv6: reorder ip6_route_dev_notifier after ipv6_dev_notf") > > I use a reproducer which add unreachable route in netns with lo down. And I > could not trigger Panic in the fixed kernel. That's why I think we could > remove ip6_null_entry check in rt6_dump_route(). Understood. Cong's patch to fix the intialization order (lo device before route init) makes sure the idev is not null. That said, the null_entry route is internal ipv6 logic and is not a route entry to be returned to userspace.
Re: [PATCH net-next v2 06/10] net: dsa: lan9303: added sysfs node swe_bcst_throt
On Tue, Jul 25, 2017 at 06:15:49PM +0200, Egil Hjelmeland wrote: > Allowing per-port access to Switch Engine Broadcast Throttling Register Hi Egil In general, we are against using sysfs. If there is a generic mechanism, that applies for all sorts of network interfaces, it should be used instead of sysfs. Is this intended to reduce the effect of broadcast storms? Andrew
Re: [PATCH net-next v2 04/10] net: dsa: lan9303: Added adjust_link() method
On Tue, Jul 25, 2017 at 06:15:47PM +0200, Egil Hjelmeland wrote: > This makes the driver react to device tree "fixed-link" declaration > on CPU port. > > - turn off autonegotiation > - force speed 10 or 100 mb/s > - force duplex mode > > Signed-off-by: Egil Hjelmeland> --- > drivers/net/dsa/lan9303-core.c | 33 + > 1 file changed, 33 insertions(+) > > diff --git a/drivers/net/dsa/lan9303-core.c b/drivers/net/dsa/lan9303-core.c > index 0806a0684d55..be6d78f45a5f 100644 > --- a/drivers/net/dsa/lan9303-core.c > +++ b/drivers/net/dsa/lan9303-core.c > @@ -17,6 +17,7 @@ > #include > #include > #include > +#include > > #include "lan9303.h" > > @@ -746,6 +747,37 @@ static int lan9303_phy_write(struct dsa_switch *ds, int > phy, int regnum, > return chip->ops->phy_write(chip, phy, regnum, val); > } > > +static void lan9303_adjust_link(struct dsa_switch *ds, int port, > + struct phy_device *phydev) > +{ > + struct lan9303 *chip = ds->priv; > + > + int ctl, res; > + > + ctl = lan9303_phy_read(ds, port, MII_BMCR); > + > + if (!phy_is_pseudo_fixed_link(phydev)) > + return; > + Hi Egil Maybe do this check before reading MII_BMCR? > + ctl &= ~BMCR_ANENABLE; Should this also mask out BMCR_SPEED100 and DUPLEX_FULL? Otherwise how do you select 10/Half if it is already configured for 100/Full? > + if (phydev->speed == SPEED_100) > + ctl |= BMCR_SPEED100; > + > + if (phydev->duplex == DUPLEX_FULL) > + ctl |= BMCR_FULLDPLX; > + > + res = lan9303_phy_write(ds, port, MII_BMCR, ctl); > + > + if (port == chip->phy_addr_sel_strap) { > + /* Virtual Phy: Remove Turbo 200Mbit mode */ > + lan9303_read(chip->regmap, LAN9303_VIRT_SPECIAL_CTRL, ); > + > + ctl &= ~(1 << 10); // TURBO BIT BIT(10), or better still something like BIT(LAN9303_VIRT_SPECIAL_TURBO) > + res = regmap_write(chip->regmap, > + LAN9303_VIRT_SPECIAL_CTRL, ctl); > + } > +} Andrew
Re: [PATCHv3 net] ipv6: no need to return rt->dst.error if it is prohibit entry
On 7/26/17 3:20 AM, Hangbin Liu wrote: > After commit 18c3a61c4264 ("net: ipv6: RTM_GETROUTE: return matched fib > result when requested"). When we get a prohibit ertry, we will return > -EACCES directly. > > Before: Do you mean "Before commit 18c3a61c4264?" > + ip netns exec client ip -6 route get 2003::1 > prohibit 2003::1 dev lo table unspec proto kernel src 2001::1 metric > 4294967295 error -13 > > After: And "After commit 18c3a61c4264?" > + ip netns exec server ip -6 route get 2002::1 > RTNETLINK answers: Permission denied > > Fix this by add prohibit and blk hole check. > > At the same time, after commit > 2f460933f58e ("ipv6: initialize route null entry in addrconf_init()") and > 242d3a49a2a1 ("ipv6: reorder ip6_route_dev_notifier after ipv6_dev_notf") > We will init rt6i_idev correctly. So we could dump ip6_null_entry > (unreachable route entry) safely now. > > Fixes: 18c3a61c4264 ("net: ipv6: RTM_GETROUTE: return matched fib...") > Signed-off-by: Hangbin Liu> --- > net/ipv6/route.c | 13 ++--- > 1 file changed, 6 insertions(+), 7 deletions(-) > > diff --git a/net/ipv6/route.c b/net/ipv6/route.c > index 4d30c96..b05da74 100644 > --- a/net/ipv6/route.c > +++ b/net/ipv6/route.c > @@ -3637,13 +3637,12 @@ static int inet6_rtm_getroute(struct sk_buff *in_skb, > struct nlmsghdr *nlh, > dst = ip6_route_lookup(net, , 0); > > rt = container_of(dst, struct rt6_info, dst); > - if (rt->dst.error) { > - err = rt->dst.error; > - ip6_rt_put(rt); > - goto errout; > - } > - > - if (rt == net->ipv6.ip6_null_entry) { > + if (rt->dst.error && > +#ifdef CONFIG_IPV6_MULTIPLE_TABLES > + rt != net->ipv6.ip6_prohibit_entry && > + rt != net->ipv6.ip6_blk_hole_entry && > +#endif > + rt != net->ipv6.ip6_null_entry) { > err = rt->dst.error; > ip6_rt_put(rt); > goto errout; > This is what I see with your patch: # ip -6 ro ls vrf red 2001:db8:1::/120 dev eth1 proto kernel metric 256 pref medium prohibit 5000::/120 dev lo metric 1024 error -13 pref medium fe80::/64 dev eth1 proto kernel metric 256 pref medium ff00::/8 dev eth1 metric 256 pref medium unreachable default dev lo metric 8192 error -113 pref medium ie., I added a prohibit route for 5000:/120 and then running: # ip -6 ro get vrf red 5000::1 RTNETLINK answers: Permission denied Which is the behavior without your patch. Now if I delete just the first bit: diff --git a/net/ipv6/route.c b/net/ipv6/route.c index 4d30c96a819d..8fc52de40175 100644 --- a/net/ipv6/route.c +++ b/net/ipv6/route.c @@ -3637,12 +3637,6 @@ static int inet6_rtm_getroute(struct sk_buff *in_skb, struct nlmsghdr *nlh, dst = ip6_route_lookup(net, , 0); rt = container_of(dst, struct rt6_info, dst); - if (rt->dst.error) { - err = rt->dst.error; - ip6_rt_put(rt); - goto errout; - } - if (rt == net->ipv6.ip6_null_entry) { err = rt->dst.error; ip6_rt_put(rt); Then I get: # ip -6 ro get vrf red 5000::1 prohibit 5000::1 from :: dev lo table red src 2001:db8::2 metric 1024 error -13 pref medium which seems to be your objective. I don't understand why you are focused on the built-in null and prohibit route entries. When I add a default unreachable or prohibit route those are different rt6_info entries. Take a look at ip6_route_info_create.
Re: TCP fast retransmit issues
On Wed, Jul 26, 2017 at 12:43 PM, Neal Cardwellwrote: > (1) Because the connection negotiated SACK, the Linux TCP sender does > not get to its tcp_add_reno_sack() code to count dupacks and enter > fast recovery on the 3rd dupack. The sender keeps waiting for specific > packets to be SACKed that would signal that something has probably > been lost. We could probably mitigate this by having the sender turn > off SACK once it sees SACKed ranges that are completely invalid (way > out of window). Then it should use the old non-SACK "Recovery on 3rd > dupack" path. > > (2) It looks like there is a bug in the sender code where it seems to > be repeatedly using a TLP timer firing 211ms after every ACK is > received to transmit another TLP probe (a new packet in this case). > Somehow these weird invalid SACKs seem to be triggering a code path > that makes us think we can send another TLP, when we probably should > be firing an RTO. That's my interpretation, anyway. I will try to > reproduce this with packetdrill. Hmm. It looks like this might be a general issue, where any time we get an ACK that doesn't ACK/SACK anything new (whether because it's incoming data in a bi-directional flow, or a middlebox breaking the SACKs), then we schedule a TLP timer further out in time. Probably we should only push the TLP timer out if something is cumulatively ACKed. But that's not a trivial thing to do, because by the time we are deciding whether to schedule another TLP, we have already canceled the previous TLP and reinstalled an RTO. Hmm. neal
Re: [PATCH v2 2/4] can: fixed-transceiver: Add documentation for CAN fixed transceiver bindings
On 07/26/2017 06:41 PM, Andrew Lunn wrote: On Mon, Jul 24, 2017 at 06:05:19PM -0500, Franklin S Cooper Jr wrote: + +Optional: + max-arbitration-speed: a positive non 0 value that determines the max + speed that CAN can run in non CAN-FD mode or during the + arbitration phase in CAN-FD mode. Hi Franklin Since this and the next property are optional, it is good to document what happens when they are not in the DT blob. Also document what 0 means. + + max-data-speed: a positive non 0 value that determines the max data rate + that can be used in CAN-FD mode. A value of -1 implies + CAN-FD is not supported by the transceiver. -1 is ugly. I think it would be better to have a missing max-data-speed property indicate that CAN-FD is not supported. Thanks Andrew! I had the same feeling about '-1' :-) And maybe put 'fd' into the property name. Good point. In fact the common naming to set bitrates for CAN(FD) controllers are 'bitrate' and 'data bitrate'. 'speed' is not really a good word for that. Finally, @Franklin: A CAN transceiver is limited in bandwidth. But you only have one RX and one TX line between the CAN controller and the CAN transceiver. The transceiver does not know about CAN FD - it has just a physical(!) layer with a limited bandwidth. This is ONE limitation. So I tend to specify only ONE 'max-bitrate' property for the fixed-transceiver binding. The fact whether the CAN controller is CAN FD capable or not is provided by the netlink configuration interface for CAN controllers. Regards, Oliver
[RFC] net: make net.core.{r,w}mem_{default,max} namespaced
The following sysctl are global and can't be read or set from a netns: net.core.rmem_default net.core.rmem_max net.core.wmem_default net.core.wmem_max Make the following sysctl parameters available from within a network namespace, allowing to set unique values per network namespace. My concern is about the initial value of this sysctl in the newly creates netns: I'm not sure if is better to copy them from the init namespace or set them to the default values. Setting them to the default value has the advantage that a new namespace behaves like a freshly booted system, while copying them from the init netns has the advantage of keeping the current behaviour as the values from the init netns are used. Signed-off-by: Matteo Croce--- include/net/netns/core.h| 5 +++ include/net/sock.h | 6 include/net/tcp.h | 3 +- net/core/net_namespace.c| 22 + net/core/sock.c | 31 +- net/core/sysctl_net_core.c | 70 ++--- net/ipv4/ip_output.c| 2 +- net/ipv4/syncookies.c | 3 +- net/ipv4/tcp_minisocks.c| 3 +- net/ipv4/tcp_output.c | 12 --- net/ipv6/syncookies.c | 3 +- net/netfilter/ipvs/ip_vs_sync.c | 4 +-- 12 files changed, 89 insertions(+), 75 deletions(-) diff --git a/include/net/netns/core.h b/include/net/netns/core.h index 78eb1ff75475..9b613162467d 100644 --- a/include/net/netns/core.h +++ b/include/net/netns/core.h @@ -9,6 +9,11 @@ struct netns_core { struct ctl_table_header *sysctl_hdr; int sysctl_somaxconn; + u32 sysctl_wmem_max; + u32 sysctl_rmem_max; + + u32 sysctl_wmem_default; + u32 sysctl_rmem_default; struct prot_inuse __percpu *inuse; }; diff --git a/include/net/sock.h b/include/net/sock.h index 7c0632c7e870..e62a279e420f 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -2363,13 +2363,7 @@ bool sk_net_capable(const struct sock *sk, int cap); void sk_get_meminfo(const struct sock *sk, u32 *meminfo); -extern __u32 sysctl_wmem_max; -extern __u32 sysctl_rmem_max; - extern int sysctl_tstamp_allow_data; extern int sysctl_optmem_max; -extern __u32 sysctl_wmem_default; -extern __u32 sysctl_rmem_default; - #endif /* _SOCK_H */ diff --git a/include/net/tcp.h b/include/net/tcp.h index 70483296157f..460f4373d42a 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -1300,7 +1300,8 @@ static inline void tcp_slow_start_after_idle_check(struct sock *sk) /* Determine a window scaling and initial window to offer. */ void tcp_select_initial_window(int __space, __u32 mss, __u32 *rcv_wnd, __u32 *window_clamp, int wscale_ok, - __u8 *rcv_wscale, __u32 init_rcv_wnd); + __u8 *rcv_wscale, __u32 init_rcv_wnd, + __u32 rmem_max); static inline int tcp_win_from_space(int space) { diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c index 8726d051f31d..2d72b2bd6eab 100644 --- a/net/core/net_namespace.c +++ b/net/core/net_namespace.c @@ -23,6 +23,16 @@ #include #include +/* Take into consideration the size of the struct sk_buff overhead in the + * determination of these values, since that is non-constant across + * platforms. This makes socket queueing behavior and performance + * not depend upon such differences. + */ +#define _SK_MEM_PACKETS256 +#define _SK_MEM_OVERHEAD SKB_TRUESIZE(256) +#define SK_WMEM_MAX(_SK_MEM_OVERHEAD * _SK_MEM_PACKETS) +#define SK_RMEM_MAX(_SK_MEM_OVERHEAD * _SK_MEM_PACKETS) + /* * Our network namespace constructor/destructor lists */ @@ -318,6 +328,18 @@ static __net_init int setup_net(struct net *net, struct user_namespace *user_ns) static int __net_init net_defaults_init_net(struct net *net) { net->core.sysctl_somaxconn = SOMAXCONN; + if (net_eq(net, _net)) { + init_net.core.sysctl_wmem_max = SK_WMEM_MAX; + init_net.core.sysctl_rmem_max = SK_RMEM_MAX; + init_net.core.sysctl_wmem_default = SK_WMEM_MAX; + init_net.core.sysctl_rmem_default = SK_RMEM_MAX; + } else { + net->core.sysctl_wmem_max = init_net.core.sysctl_wmem_max; + net->core.sysctl_rmem_max = init_net.core.sysctl_rmem_max; + net->core.sysctl_wmem_default = init_net.core.sysctl_wmem_default; + net->core.sysctl_rmem_default = init_net.core.sysctl_rmem_default; + } + return 0; } diff --git a/net/core/sock.c b/net/core/sock.c index ac2a404c73eb..8086a660d75f 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -307,24 +307,6 @@ static struct lock_class_key af_wlock_keys[AF_MAX]; static struct lock_class_key af_elock_keys[AF_MAX]; static struct lock_class_key
Re: [PATCH net-next v2 02/10] net: dsa: lan9303: Do not disable/enable switch fabric port 0 at startup
On Tue, Jul 25, 2017 at 06:15:45PM +0200, Egil Hjelmeland wrote: > For some mysterious reason enable switch fabric port 0 TX fails to > work, when the TX has previous been disabled. Resolved by not > disable/enable switch fabric port 0 at startup. Port 1 and 2 are > still disabled in early init. > > Signed-off-by: Egil Hjelmeland> --- > drivers/net/dsa/lan9303-core.c | 7 --- > 1 file changed, 7 deletions(-) > > diff --git a/drivers/net/dsa/lan9303-core.c b/drivers/net/dsa/lan9303-core.c > index e622db586c3d..c2b53659f58f 100644 > --- a/drivers/net/dsa/lan9303-core.c > +++ b/drivers/net/dsa/lan9303-core.c > @@ -557,9 +557,6 @@ static int lan9303_disable_processing(struct lan9303 > *chip) > { > int ret; > > - ret = lan9303_disable_packet_processing(chip, LAN9303_PORT_0_OFFSET); > - if (ret) > - return ret; > ret = lan9303_disable_packet_processing(chip, LAN9303_PORT_1_OFFSET); > if (ret) > return ret; > @@ -633,10 +630,6 @@ static int lan9303_setup(struct dsa_switch *ds) > if (ret) > dev_err(chip->dev, "failed to separate ports %d\n", ret); > > - ret = lan9303_enable_packet_processing(chip, LAN9303_PORT_0_OFFSET); > - if (ret) > - dev_err(chip->dev, "failed to re-enable switching %d\n", ret); > - Does this mean you are relying on something else enabling port 0? The bootloader? I'm wondering if it is better to keep the enable, but remove the disable? Andrew
Re: [PATCH net-next v2 01/10] net: dsa: lan9303: Fixed MDIO interface
On Tue, Jul 25, 2017 at 06:15:44PM +0200, Egil Hjelmeland wrote: > Fixes after testing on actual HW: > > - lan9303_mdio_write()/_read() must multiply register number > by 4 to get offset > > - Indirect access (PMI) to phy register only work in I2C mode. In > MDIO mode phy registers must be accessed directly. Introduced > struct lan9303_phy_ops to handle the two modes. Renamed functions > to clarify. > > - lan9303_detect_phy_setup() : Failed MDIO read return 0x. > Handle that. > > Signed-off-by: Egil Hjelmeland> --- > drivers/net/dsa/lan9303-core.c | 42 > +++--- > drivers/net/dsa/lan9303.h | 11 +++ > drivers/net/dsa/lan9303_i2c.c | 2 ++ > drivers/net/dsa/lan9303_mdio.c | 34 ++ > 4 files changed, 74 insertions(+), 15 deletions(-) > > diff --git a/drivers/net/dsa/lan9303-core.c b/drivers/net/dsa/lan9303-core.c > index cd76e61f1fca..e622db586c3d 100644 > --- a/drivers/net/dsa/lan9303-core.c > +++ b/drivers/net/dsa/lan9303-core.c > @@ -20,6 +20,9 @@ > > #include "lan9303.h" > > +/* 13.2 System Control and Status Registers > + * Multiply register number by 4 to get address offset. > + */ > #define LAN9303_CHIP_REV 0x14 > # define LAN9303_CHIP_ID 0x9303 > #define LAN9303_IRQ_CFG 0x15 > @@ -53,6 +56,9 @@ > #define LAN9303_VIRT_PHY_BASE 0x70 > #define LAN9303_VIRT_SPECIAL_CTRL 0x77 > > +/*13.4 Switch Fabric Control and Status Registers > + * Accessed indirectly via SWITCH_CSR_CMD, SWITCH_CSR_DATA. > + */ > #define LAN9303_SW_DEV_ID 0x > #define LAN9303_SW_RESET 0x0001 > #define LAN9303_SW_RESET_RESET BIT(0) > @@ -242,7 +248,7 @@ static int lan9303_virt_phy_reg_write(struct lan9303 > *chip, int regnum, u16 val) > return regmap_write(chip->regmap, LAN9303_VIRT_PHY_BASE + regnum, val); > } > > -static int lan9303_port_phy_reg_wait_for_completion(struct lan9303 *chip) > +static int lan9303_indirect_phy_wait_for_completion(struct lan9303 *chip) > { > int ret, i; > u32 reg; > @@ -262,7 +268,7 @@ static int > lan9303_port_phy_reg_wait_for_completion(struct lan9303 *chip) > return -EIO; > } > > -static int lan9303_port_phy_reg_read(struct lan9303 *chip, int addr, int > regnum) > +static int lan9303_indirect_phy_read(struct lan9303 *chip, int addr, int > regnum) > { > int ret; > u32 val; > @@ -272,7 +278,7 @@ static int lan9303_port_phy_reg_read(struct lan9303 > *chip, int addr, int regnum) > > mutex_lock(>indirect_mutex); > > - ret = lan9303_port_phy_reg_wait_for_completion(chip); > + ret = lan9303_indirect_phy_wait_for_completion(chip); > if (ret) > goto on_error; > > @@ -281,7 +287,7 @@ static int lan9303_port_phy_reg_read(struct lan9303 > *chip, int addr, int regnum) > if (ret) > goto on_error; > > - ret = lan9303_port_phy_reg_wait_for_completion(chip); > + ret = lan9303_indirect_phy_wait_for_completion(chip); > if (ret) > goto on_error; > > @@ -299,8 +305,8 @@ static int lan9303_port_phy_reg_read(struct lan9303 > *chip, int addr, int regnum) > return ret; > } > > -static int lan9303_phy_reg_write(struct lan9303 *chip, int addr, int regnum, > - unsigned int val) > +static int lan9303_indirect_phy_write(struct lan9303 *chip, int addr, > + int regnum, u16 val) > { > int ret; > u32 reg; > @@ -311,7 +317,7 @@ static int lan9303_phy_reg_write(struct lan9303 *chip, > int addr, int regnum, > > mutex_lock(>indirect_mutex); > > - ret = lan9303_port_phy_reg_wait_for_completion(chip); > + ret = lan9303_indirect_phy_wait_for_completion(chip); > if (ret) > goto on_error; > > @@ -328,6 +334,11 @@ static int lan9303_phy_reg_write(struct lan9303 *chip, > int addr, int regnum, > return ret; > } > > +const struct lan9303_phy_ops lan9303_indirect_phy_ops = { > + .phy_read = lan9303_indirect_phy_read, > + .phy_write = lan9303_indirect_phy_write, > +}; > + > static int lan9303_switch_wait_for_completion(struct lan9303 *chip) > { > int ret, i; > @@ -427,14 +438,15 @@ static int lan9303_detect_phy_setup(struct lan9303 > *chip) >* Special reg 18 of phy 3 reads as 0x, if 'phy_addr_sel_strap' is 0 >* and the IDs are 0-1-2, else it contains something different from >* 0x, which means 'phy_addr_sel_strap' is 1 and the IDs are 1-2-3. > + * 0x is returned for failed MDIO access. Hi Egil 0x is not really a failure. It just means there is nothing on the bus at that address. The bus has a weak pull-up, so defaults to one. >*/ > - reg = lan9303_port_phy_reg_read(chip, 3, MII_LAN911X_SPECIAL_MODES); > + reg = chip->ops->phy_read(chip, 3, MII_LAN911X_SPECIAL_MODES); > if (reg < 0) { > dev_err(chip->dev, "Failed to detect phy
Re: TC-pedit man page examples error
Hello Phil, Thank you for taking my message. I have gotten a solution working so feel free to ignore the rest of this email. By vanilla iproute2 do you mean the latest package through the yum or apt repository? All I can tell you is that in the example I gave the iproute2 package was the latest one through the RHEL 7 yum repository. Otherwise I will need clarification on what vanilla means. I attempted the above example code on two other Ubuntu machines, one of which I can confirm is Ubuntu 16.04 LTS. Unfortunately I am not at that machine right now so I cannot give you the output. Both machines gave similar error messages. The reason I emailed you via netdev was because the example code I used was taken verbatim from the tc-pedit man page on my machine. The man page also gave me email addresses to contact in case an error was found with the man page, which is what I did. I did get the code to work when using the latest version of iproute2, so I assume the only practical solution is to manually install the latest version of iproute2 or wait for the various distros of Linux to use updated iproute2. I appreciate your help. Have a nice day. Tyler Bautista On Wed, Jul 26, 2017 at 1:43 AM, Phil Sutterwrote: > Hi Tyler, > > On Tue, Jul 25, 2017 at 08:33:40AM -0700, Tyler Bautista wrote: >> To whom it may concern, >> I recently attempted to use simple tc action pedit commands on the man >> page and I ran into some errors. The following is some information >> about my version of iproute and my machine: >> >> the following is my iproute package information >> >> Loaded plugins: fastestmirror, langpacks >> iproute-3.10.0-74.el7.x86_64 > > This mailing list is for upstream issues only. Can you reproduce the > problem with vanilla iproute2? > > You are using RHEL7's iproute package, which has a number of known > issues in pedit action: > > https://bugzilla.redhat.com/show_bug.cgi?id=1322406 > > These are planned to be resolved for RHEL7.5, not sure when it will land > in CentOS. > > Cheers, Phil