Re: [PATCH] vhost: introduce O(1) vq metadata cache
Hi Jason, [auto build test WARNING on vhost/linux-next] [also build test WARNING on v4.9 next-20161214] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Jason-Wang/vhost-introduce-O-1-vq-metadata-cache/20161214-160153 base: https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git linux-next config: i386-randconfig-x005-201650 (attached as .config) compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901 reproduce: # save the attached .config to linux build tree make ARCH=i386 All warnings (new ones prefixed by >>): drivers/vhost/vhost.c: In function 'vhost_vq_meta_fetch': >> drivers/vhost/vhost.c:719:9: warning: cast to pointer from integer of >> different size [-Wint-to-pointer-cast] return (void *)(node->userspace_addr + (u64)addr - node->start); ^ vim +719 drivers/vhost/vhost.c 703 node->start, 704 node->size)) 705 return 0; 706 } 707 return 1; 708 } 709 710 static inline void __user *vhost_vq_meta_fetch(struct vhost_virtqueue *vq, 711 u64 addr, unsigned int size, 712 int type) 713 { 714 const struct vhost_umem_node *node = vq->meta_iotlb[type]; 715 716 if (!node) 717 return NULL; 718 > 719 return (void *)(node->userspace_addr + (u64)addr - node->start); 720 } 721 722 /* Can we switch to this memory table? */ 723 /* Caller should have device mutex but not vq mutex */ 724 static int memory_access_ok(struct vhost_dev *d, struct vhost_umem *umem, 725 int log_all) 726 { 727 int i; --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: application/gzip
Re: stmmac driver...
Hi David, Giuseppe and Alexandre, There are a lot of patches and discussions happening around the stammc driver lately and both of you are listed as the maintainers. I really need prompt and conclusive reviews of these patch submissions from you, and participation in all discussions about the driver. >>> >>> yes we are trying to do the best. >>> Otherwise I have only three things I can do: 1) let the patches rot in patchwork for days 2) trust that the patches are sane and fit your desires and goals and just apply them or 3) reject them since they aren't being reviewed properly. >>> >>> at this stage, I think the best is: (3). >> I think the patches David mentioned also included XLGMAC. He sent this email >> before I explained QoS and XLGMAC were different IPs. Do you mind we do >> XLGMAC >> development under drivers/net/ethernet/synopsys/ ? I think we don't have >> conflict since we will keep QoS development in stmmac. > > Great. Many thanks for the clarification :-) > > Regards > Peppe > Do you agree that we do XLGMAC development under drivers/net/ethernet/synopsys/ in the future ? There is no conflict of interest since this is a new IP without driver. As you see, there are several drivers for QoS (GMAC) and several drivers for XGMAC. We want to avoid this situation for the new IP XLGMAC. Regards, Jie
Re: [PATCH v2 net-next 1/2] phy: add phy fixup unregister functions
Hi all, On 2016/12/8 4:26, woojung@microchip.com wrote: >>From : Woojung Huh > > Add functions to unregister phy fixup for modules. > > int phy_unregister_fixup(const char *bus_id, u32 phy_uid, u32 phy_uid_mask) > Unregister phy fixup from phy_fixup_list per bus_id, phy_uid & > phy_uid_mask > > int phy_unregister_fixup_for_uid(u32 phy_uid, u32 phy_uid_mask) > Unregister phy fixup from phy_fixup_list. > Use it for fixup registered by phy_register_fixup_for_uid() > > int phy_unregister_fixup_for_id(const char *bus_id) > Unregister phy fixup from phy_fixup_list. > Use it for fixup registered by phy_register_fixup_for_id() > > Signed-off-by: Woojung Huh > --- > Documentation/networking/phy.txt | 9 > drivers/net/phy/phy_device.c | 47 > > include/linux/phy.h | 4 > 3 files changed, 60 insertions(+) > > diff --git a/Documentation/networking/phy.txt > b/Documentation/networking/phy.txt > index e017d93..16f90d8 100644 > --- a/Documentation/networking/phy.txt > +++ b/Documentation/networking/phy.txt > @@ -407,6 +407,15 @@ Board Fixups > The stubs set one of the two matching criteria, and set the other one to > match anything. > > + When phy_register_fixup() or *_for_uid()/*_for_id() is called at module, > + unregister fixup and free allocate memory are required. > + > + Call one of following function before unloading module. > + > + int phy_unregister_fixup(const char *phy_id, u32 phy_uid, u32 phy_uid_mask); > + int phy_unregister_fixup_for_uid(u32 phy_uid, u32 phy_uid_mask); > + int phy_register_fixup_for_id(const char *phy_id); > + > Standards > > IEEE Standard 802.3: CSMA/CD Access Method and Physical Layer > Specifications, Section Two: > diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c > index aeaf1bc..32fa7c7 100644 > --- a/drivers/net/phy/phy_device.c > +++ b/drivers/net/phy/phy_device.c > @@ -235,6 +235,53 @@ int phy_register_fixup_for_id(const char *bus_id, > } > EXPORT_SYMBOL(phy_register_fixup_for_id); > > +/** > + * phy_unregister_fixup - remove a phy_fixup from the list > + * @bus_id: A string matches fixup->bus_id (or PHY_ANY_ID) in phy_fixup_list > + * @phy_uid: A phy id matches fixup->phy_id (or PHY_ANY_UID) in > phy_fixup_list > + * @phy_uid_mask: Applied to phy_uid and fixup->phy_uid before comparison > + */ > +int phy_unregister_fixup(const char *bus_id, u32 phy_uid, u32 phy_uid_mask) > +{ > + struct list_head *pos, *n; > + struct phy_fixup *fixup; > + int ret; > + > + ret = -ENODEV; > + > + mutex_lock(&phy_fixup_lock); > + list_for_each_safe(pos, n, &phy_fixup_list) { > + fixup = list_entry(pos, struct phy_fixup, list); > + > + if ((!strcmp(fixup->bus_id, bus_id)) && > + ((fixup->phy_uid & phy_uid_mask) == > + (phy_uid & phy_uid_mask))) { > + list_del(&fixup->list); > + kfree(fixup); > + ret = 0; > + break; > + } > + } > + mutex_unlock(&phy_fixup_lock); > + > + return ret; > +} > +EXPORT_SYMBOL(phy_unregister_fixup); > + I just want to commit the unregister patch and found this patch. Good job! But I consider this patch may miss something. If one SoC has 2 MAC ports and each port uses the different network driver, the 2 drivers may register fixup for the same PHY chip with different "run" function because the PHY chip works in different mode. In such a case, this patch doesn't consider "run" function and may cause problem. When removing the driver which register fixup at last, it will remove another driver's fixup. Should this condition be considered and fixed? > +/* Unregisters a fixup of any PHY with the UID in phy_uid */ > +int phy_unregister_fixup_for_uid(u32 phy_uid, u32 phy_uid_mask) > +{ > + return phy_unregister_fixup(PHY_ANY_ID, phy_uid, phy_uid_mask); > +} > +EXPORT_SYMBOL(phy_unregister_fixup_for_uid); > + > +/* Unregisters a fixup of the PHY with id string bus_id */ > +int phy_unregister_fixup_for_id(const char *bus_id) > +{ > + return phy_unregister_fixup(bus_id, PHY_ANY_UID, 0x); > +} > +EXPORT_SYMBOL(phy_unregister_fixup_for_id); > + > /* Returns 1 if fixup matches phydev in bus_id and phy_uid. > * Fixups can be set to match any in one or more fields. > */ > diff --git a/include/linux/phy.h b/include/linux/phy.h > index feb8a98..f7d95f6 100644 > --- a/include/linux/phy.h > +++ b/include/linux/phy.h > @@ -860,6 +860,10 @@ int phy_register_fixup_for_id(const char *bus_id, > int phy_register_fixup_for_uid(u32 phy_uid, u32 phy_uid_mask, > int (*run)(struct phy_device *)); > > +int phy_unregister_fixup(const char *bus_id, u32 phy_uid, u32 phy_uid_mask); > +int phy_unregister_fixup_for_id(const char *bus_id); > +int phy_unregister_fixup_for_uid(u32 phy_uid, u32 phy_uid_mask); > + > int
Re: stmmac driver...
Hello Jie On 12/14/2016 5:05 AM, Jie Deng wrote: Hi Peppe, On 2016/12/12 22:17, Giuseppe CAVALLARO wrote: Hi David On 12/7/2016 7:06 PM, David Miller wrote: Giuseppe and Alexandre, There are a lot of patches and discussions happening around the stammc driver lately and both of you are listed as the maintainers. I really need prompt and conclusive reviews of these patch submissions from you, and participation in all discussions about the driver. yes we are trying to do the best. Otherwise I have only three things I can do: 1) let the patches rot in patchwork for days 2) trust that the patches are sane and fit your desires and goals and just apply them or 3) reject them since they aren't being reviewed properly. at this stage, I think the best is: (3). I think the patches David mentioned also included XLGMAC. He sent this email before I explained QoS and XLGMAC were different IPs. Do you mind we do XLGMAC development under drivers/net/ethernet/synopsys/ ? I think we don't have conflict since we will keep QoS development in stmmac. Great. Many thanks for the clarification :-) Regards Peppe Thanks in advance. you are welcome Peppe
[PATCH] net: davicom: dm9000: use new api ethtool_{get|set}_link_ksettings
The ethtool api {get|set}_settings is deprecated. We move this driver to new api {get|set}_link_ksettings. Signed-off-by: Philippe Reynes --- drivers/net/ethernet/davicom/dm9000.c | 14 -- 1 files changed, 8 insertions(+), 6 deletions(-) diff --git a/drivers/net/ethernet/davicom/dm9000.c b/drivers/net/ethernet/davicom/dm9000.c index f1a81c5..008dc81 100644 --- a/drivers/net/ethernet/davicom/dm9000.c +++ b/drivers/net/ethernet/davicom/dm9000.c @@ -570,19 +570,21 @@ static void dm9000_set_msglevel(struct net_device *dev, u32 value) dm->msg_enable = value; } -static int dm9000_get_settings(struct net_device *dev, struct ethtool_cmd *cmd) +static int dm9000_get_link_ksettings(struct net_device *dev, +struct ethtool_link_ksettings *cmd) { struct board_info *dm = to_dm9000_board(dev); - mii_ethtool_gset(&dm->mii, cmd); + mii_ethtool_get_link_ksettings(&dm->mii, cmd); return 0; } -static int dm9000_set_settings(struct net_device *dev, struct ethtool_cmd *cmd) +static int dm9000_set_link_ksettings(struct net_device *dev, +const struct ethtool_link_ksettings *cmd) { struct board_info *dm = to_dm9000_board(dev); - return mii_ethtool_sset(&dm->mii, cmd); + return mii_ethtool_set_link_ksettings(&dm->mii, cmd); } static int dm9000_nway_reset(struct net_device *dev) @@ -741,8 +743,6 @@ static int dm9000_set_wol(struct net_device *dev, struct ethtool_wolinfo *w) static const struct ethtool_ops dm9000_ethtool_ops = { .get_drvinfo= dm9000_get_drvinfo, - .get_settings = dm9000_get_settings, - .set_settings = dm9000_set_settings, .get_msglevel = dm9000_get_msglevel, .set_msglevel = dm9000_set_msglevel, .nway_reset = dm9000_nway_reset, @@ -752,6 +752,8 @@ static int dm9000_set_wol(struct net_device *dev, struct ethtool_wolinfo *w) .get_eeprom_len = dm9000_get_eeprom_len, .get_eeprom = dm9000_get_eeprom, .set_eeprom = dm9000_set_eeprom, + .get_link_ksettings = dm9000_get_link_ksettings, + .set_link_ksettings = dm9000_set_link_ksettings, }; static void dm9000_show_carrier(struct board_info *db, -- 1.7.4.4
Re: [Query] Delayed vxlan socket creation?
On Wed, 14 Dec 2016 07:49:24 +, Du, Fan wrote: > I'm interested to one Docker issue[1] which looks like related to kernel > vxlan socket creation > as described in the thread. From my limited knowledge here, socket creation > is synchronous , > and after the *socket* syscall, the sock handle will be valid and ready to > linkup. > > Somehow I'm not sure the detailed scenario here, and which/how possible > commit fix? baf606d9c9b1^..56ef9c909b40 Jiri
Re: [PATCH] vhost: introduce O(1) vq metadata cache
On 2016年12月14日 16:14, kbuild test robot wrote: Hi Jason, [auto build test WARNING on vhost/linux-next] [also build test WARNING on v4.9 next-20161214] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Jason-Wang/vhost-introduce-O-1-vq-metadata-cache/20161214-160153 base: https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git linux-next config: i386-randconfig-x005-201650 (attached as .config) compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901 reproduce: # save the attached .config to linux build tree Thanks, V2 will be posted soon.
Re: Designing a safe RX-zero-copy Memory Model for Networking
On Tue, 13 Dec 2016 12:08:21 -0800 John Fastabend wrote: > On 16-12-13 11:53 AM, David Miller wrote: > > From: John Fastabend > > Date: Tue, 13 Dec 2016 09:43:59 -0800 > > > >> What does "zero-copy send packet-pages to the application/socket that > >> requested this" mean? At the moment on x86 page-flipping appears to be > >> more expensive than memcpy (I can post some data shortly) and shared > >> memory was proposed and rejected for security reasons when we were > >> working on bifurcated driver. > > > > The whole idea is that we map all the active RX ring pages into > > userspace from the start. > > > > And just how Jesper's page pool work will avoid DMA map/unmap, > > it will also avoid changing the userspace mapping of the pages > > as well. > > > > Thus avoiding the TLB/VM overhead altogether. > > Exactly. It is worth mentioning that pages entering the page pool need to be cleared (measured cost 143 cycles), in order to not leak any kernel info. The primary focus of this design is to make sure not to leak kernel info to userspace, but with an "exclusive" mode also support isolation between applications. > I get this but it requires applications to be isolated. The pages from > a queue can not be shared between multiple applications in different > trust domains. And the application has to be cooperative meaning it > can't "look" at data that has not been marked by the stack as OK. In > these schemes we tend to end up with something like virtio/vhost or > af_packet. I expect 3 modes, when enabling RX-zero-copy on a page_pool. The first two would require CAP_NET_ADMIN privileges. All modes have a trust domain id, that need to match e.g. when page reach the socket. Mode-1 "Shared": Application choose lowest isolation level, allowing multiple application to mmap VMA area. Mode-2 "Single-user": Application request it want to be the only user of the RX queue. This blocks other application to mmap VMA area. Mode-3 "Exclusive": Application request to own RX queue. Packets are no longer allowed for normal netstack delivery. Notice mode-2 still requires CAP_NET_ADMIN, because packets/pages are still allowed to travel netstack and thus can contain packet data from other normal applications. This is part of the design, to share the NIC between netstack and an accelerated userspace application using RX zero-copy delivery. > Any ACLs/filtering/switching/headers need to be done in hardware or > the application trust boundaries are broken. The software solution outlined allow the application to make the choice of what trust boundary it wants. The "exclusive" mode-3 make most sense together with HW filters. Already today, we support creating a new RX queue based on ethtool ntuple HW filter and then you simply attach your application that queue in mode-3, and have full isolation. > If the above can not be met then a copy is needed. What I am trying > to tease out is the above comment along with other statements like > this "can be done with out HW filter features". Does this address your concerns? -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
RE: [PATCH 3/3] secure_seq: use fast&secure siphash instead of slow&insecure md5
From: Jason A. Donenfeld > Sent: 14 December 2016 00:17 > This gives a clear speed and security improvement. Rather than manually > filling MD5 buffers, we simply create a layout by a simple anonymous > struct, for which gcc generates rather efficient code. ... > + const struct { > + struct in6_addr saddr; > + struct in6_addr daddr; > + __be16 sport; > + __be16 dport; > + } __packed combined = { > + .saddr = *(struct in6_addr *)saddr, > + .daddr = *(struct in6_addr *)daddr, > + .sport = sport, > + .dport = dport > + }; You need to look at the effect of marking this (and the other) structures 'packed' on architectures like sparc64. David
RE: [PATCH 1/3] siphash: add cryptographically secure hashtable function
From: Jason A. Donenfeld > Sent: 14 December 2016 00:17 > SipHash is a 64-bit keyed hash function that is actually a > cryptographically secure PRF, like HMAC. Except SipHash is super fast, > and is meant to be used as a hashtable keyed lookup function. > > SipHash isn't just some new trendy hash function. It's been around for a > while, and there really isn't anything that comes remotely close to > being useful in the way SipHash is. With that said, why do we need this? > > There are a variety of attacks known as "hashtable poisoning" in which an > attacker forms some data such that the hash of that data will be the > same, and then preceeds to fill up all entries of a hashbucket. This is > a realistic and well-known denial-of-service vector. > > Linux developers already seem to be aware that this is an issue, and > various places that use hash tables in, say, a network context, use a > non-cryptographically secure function (usually jhash) and then try to > twiddle with the key on a time basis (or in many cases just do nothing > and hope that nobody notices). While this is an admirable attempt at > solving the problem, it doesn't actually fix it. SipHash fixes it. > > (It fixes it in such a sound way that you could even build a stream > cipher out of SipHash that would resist the modern cryptanalysis.) > > There are a modicum of places in the kernel that are vulnerable to > hashtable poisoning attacks, either via userspace vectors or network > vectors, and there's not a reliable mechanism inside the kernel at the > moment to fix it. The first step toward fixing these issues is actually > getting a secure primitive into the kernel for developers to use. Then > we can, bit by bit, port things over to it as deemed appropriate. > > Dozens of languages are already using this internally for their hash > tables. Some of the BSDs already use this in their kernels. SipHash is > a widely known high-speed solution to a widely known problem, and it's > time we catch-up. ... > +u64 siphash24(const u8 *data, size_t len, const u8 key[SIPHASH24_KEY_LEN]) ... > + u64 k0 = get_unaligned_le64(key); > + u64 k1 = get_unaligned_le64(key + sizeof(u64)); ... > + m = get_unaligned_le64(data); All these unaligned accesses are going to get expensive on architectures like sparc64. David
dsa: handling more than 1 cpu port
Hi Andrew, switches supported by qca8k have 2 gmacs that we can wire an external mii interface to. Usually this is used to wire 2 of the SoCs MACs to the switch. Thw switch itself is aware of one of the MACs being the cpu port and expects this to be port/mac0. Using the other will break the hardware offloading features. The original series from Matthieu added a feature to configure the switch in a pass-through mode [1]. This resulted in us having to define the pass-through inside the dts which means that we loose userland configurability. Assume the setup as follows ... port0 - cpu port wired to SoCs MAC0 port1-4 - the lan ports port5 - the wan port port6 - wired to the SoCs MAC1 What i have done now is bring up one bridge for port1-4 and a second one for port5-6. Once setup I can pass traffic on the SoCs MAC1 and it will flow via port6 and egress on port5. So far so good, however due to the way the port based vlans are setup and how the bridge_join/leave() logic works, port5/6 will also fwd traffic to the cpu port. the driver has now to tell that we are trunking traffic on eth1 via port6. also the MII mode is not known to the driver. Adding some hackish register writes will make this work nicely. My proposed way of fixing this cleanly in an upstream friendly way would be as follows 1) add an extra dsa,ethernet property to the 2nd MII port dsa@0 { compatible = "qca,ar8xxx"; dsa,ethernet = <&gmac1>; [...] switch@0 { [...] port@5 { reg = <5>; label = "wan"; phy-handle = <&phy_port5>; }; port@6 { reg = <6>; label = "gmac2"; dsa,ethernet = <&gmac2>; fixed-link { speed = <1000>; full-duplex; }; }; }; }; 2) fix up the port_bridge_join/leave() logic such that if a port is present in the bridge that has a reference to a ethernet interface it will remove all ports in the bridge from the port based vlan of the actual cpu port. 3) in case of this switch we also need to fiddle with the bcast/mcast flooding registers would this be ok and would it be better to probe the extra dsa_ethernet inside the subsystem or the driver ? i was considering to do add a dsa_is_trunk_port() or similar to achieve this. John [1] https://patchwork.ozlabs.org/patch/477525/
Hi
I am Mrs Nicole Marois, i have a pending project of fulfillment to put in your hand, i will need your support to make this dream come through, could you let me know your interest to enable me give you further information, and I hereby advice that you send the below mentioned information I decided to will/donate the sum of $4.5 Million US Dollars to you for the good work of God, and also to help the motherless and less privilege and also forth assistance of the widows. At the moment I cannot take any telephone calls right now due to the fact that my relatives (that have squandered the funds agave them for this purpose before) are around me and my health status also. I have adjusted my will and my lawyer is aware. I have willed those properties to you by quoting my personal file routing and account information. And I have also notified the bank that I am willing that properties to you for a good, effective and prudent work. I know I don't know you but I have been directed to do this by God.ok Please contact this woman for more details you might not get me on line in time contact this email if you need more information ok Email: mrsnicole2mar...@gmail.com Yours fairly friend, Mrs Nicole Benoite Marois.
Re: sctp: suspicious rcu_dereference_check() usage in sctp_epaddr_lookup_transport
On Wed, Dec 14, 2016 at 5:37 AM, Marcelo Ricardo Leitner wrote: > On Tue, Dec 13, 2016 at 07:07:01PM +0100, Dmitry Vyukov wrote: >> Hello, >> >> I am getting the following reports while running syzkaller fuzzer: >> >> [ INFO: suspicious RCU usage. ] >> 4.9.0+ #85 Not tainted >> --- >> ./include/linux/rhashtable.h:572 suspicious rcu_dereference_check() usage! >> >> other info that might help us debug this: >> >> rcu_scheduler_active = 1, debug_locks = 0 >> 1 lock held by syz-executor1/18023: >> #0: (sk_lock-AF_INET){+.+.+.}, at: [< inline >] lock_sock >> include/net/sock.h:1454 >> #0: (sk_lock-AF_INET){+.+.+.}, at: [] >> sctp_getsockopt+0x45f/0x6800 net/sctp/socket.c:6432 >> >> stack backtrace: >> CPU: 2 PID: 18023 Comm: syz-executor1 Not tainted 4.9.0+ #85 >> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 >> Call Trace: >> [< inline >] __dump_stack lib/dump_stack.c:15 >> [] dump_stack+0x2ee/0x3ef lib/dump_stack.c:51 >> [ ] lockdep_rcu_suspicious+0x139/0x180 >> kernel/locking/lockdep.c:4448 >> [< inline >] __rhashtable_lookup ./include/linux/rhashtable.h:572 >> [< inline >] rhltable_lookup ./include/linux/rhashtable.h:660 >> [ ] sctp_epaddr_lookup_transport+0x641/0x930 >> net/sctp/input.c:946 > > I think this was introduced in the rhlist converstion. We had removed > some rcu_read_lock() calls on sctp stack because rhashtable was already > calling it, but then we didn't add them back when moving to rhlist. > > This code: > +/* return a transport without holding it, as it's only used under sock lock > */ > struct sctp_transport *sctp_epaddr_lookup_transport( > const struct sctp_endpoint *ep, > const union sctp_addr *paddr) > { > struct net *net = sock_net(ep->base.sk); > + struct rhlist_head *tmp, *list; > + struct sctp_transport *t; > struct sctp_hash_cmp_arg arg = { > - .ep= ep, > .paddr = paddr, > .net = net, > + .lport = htons(ep->base.bind_addr.port), > }; > > - return rhashtable_lookup_fast(&sctp_transport_hashtable, &arg, > - sctp_hash_params); > + list = rhltable_lookup(&sctp_transport_hashtable, &arg, > + sctp_hash_params); > > Had an implicit rcu_read_lock() on rhashtable_lookup_fast, but it > doesn't on rhltable_lookup and rhltable_lookup uses _rcu calls which > assumes we have rcu read protection. You're right, we need to call rcu_read_lock before using rhltable_lookup. will post a fix for it, thanks.
[RFC PATCH net-next v4 2/2] macb: Enable 1588 support in SAMA5Dx platforms.
This patch does the following: - Enable HW time stamp for the following platforms: SAMA5D2, SAMA5D3 and SAMA5D4. - HW time stamp capabilities are advertised via ethtool and macb ioctl is updated accordingly. - HW time stamp on the PTP Ethernet packets are received using the SO_TIMESTAMPING API. Where timers are obtained from the PTP event/peer registers. Note: Patch on net-next, on December 7th. Signed-off-by: Andrei Pistirica --- Patch history: Version 1: Integration with SAMA5D2 only. This feature wasn't tested on any other platform that might use cadence/gem. Patch is not completely ported to the very latest version of net-next, and it will be after review. Version 2 modifications: - add PTP caps for SAMA5D2/3/4 platforms - and cosmetic changes Version 3 modifications: - add support for sama5D2/3/4 platforms using GEM-PTP interface. Version 4 modifications: - time stamp only PTP_V2 events - maximum adjustment value is set based on Richard's input Note: Patch on net-next, on December 14th. drivers/net/ethernet/cadence/macb.c | 168 ++-- 1 file changed, 163 insertions(+), 5 deletions(-) diff --git a/drivers/net/ethernet/cadence/macb.c b/drivers/net/ethernet/cadence/macb.c index 538544a..8d5c976 100644 --- a/drivers/net/ethernet/cadence/macb.c +++ b/drivers/net/ethernet/cadence/macb.c @@ -714,6 +714,8 @@ static void macb_tx_interrupt(struct macb_queue *queue) /* First, update TX stats if needed */ if (skb) { + gem_ptp_do_txstamp(bp, skb); + netdev_vdbg(bp->dev, "skb %u (data %p) TX complete\n", macb_tx_ring_wrap(bp, tail), skb->data); @@ -878,6 +880,8 @@ static int gem_rx(struct macb *bp, int budget) GEM_BFEXT(RX_CSUM, ctrl) & GEM_RX_CSUM_CHECKED_MASK) skb->ip_summed = CHECKSUM_UNNECESSARY; + gem_ptp_do_rxstamp(bp, skb); + bp->stats.rx_packets++; bp->stats.rx_bytes += skb->len; @@ -2080,6 +2084,9 @@ static int macb_open(struct net_device *dev) netif_tx_start_all_queues(dev); + if (bp->ptp_info) + bp->ptp_info->ptp_init(dev); + return 0; } @@ -2101,6 +2108,9 @@ static int macb_close(struct net_device *dev) macb_free_consistent(bp); + if (bp->ptp_info) + bp->ptp_info->ptp_remove(dev); + return 0; } @@ -2374,6 +2384,133 @@ static int macb_set_ringparam(struct net_device *netdev, return 0; } +#ifdef CONFIG_MACB_USE_HWSTAMP +static unsigned int gem_get_tsu_rate(struct macb *bp) +{ + /* Note: TSU rate is hardwired to PCLK. */ + return clk_get_rate(bp->pclk); +} + +static s32 gem_get_ptp_max_adj(void) +{ + return 3921508; +} + +static int gem_get_ts_info(struct net_device *dev, + struct ethtool_ts_info *info) +{ + struct macb *bp = netdev_priv(dev); + + ethtool_op_get_ts_info(dev, info); + info->so_timestamping = + SOF_TIMESTAMPING_TX_SOFTWARE | + SOF_TIMESTAMPING_RX_SOFTWARE | + SOF_TIMESTAMPING_SOFTWARE | + SOF_TIMESTAMPING_TX_HARDWARE | + SOF_TIMESTAMPING_RX_HARDWARE | + SOF_TIMESTAMPING_RAW_HARDWARE; + info->phc_index = -1; + + if (bp->ptp_clock) + info->phc_index = ptp_clock_index(bp->ptp_clock); + + return 0; +} + +static int gem_set_hwtst(struct net_device *netdev, +struct ifreq *ifr, int cmd) +{ + struct hwtstamp_config config; + struct macb *priv = netdev_priv(netdev); + u32 regval; + + netdev_vdbg(netdev, "macb_hwtstamp_ioctl\n"); + + if (copy_from_user(&config, ifr->ifr_data, sizeof(config))) + return -EFAULT; + + /* reserved for future extensions */ + if (config.flags) + return -EINVAL; + + switch (config.tx_type) { + case HWTSTAMP_TX_OFF: + priv->hwts_tx_en = false; + break; + case HWTSTAMP_TX_ON: + priv->hwts_tx_en = true; + break; + default: + return -ERANGE; + } + + switch (config.rx_filter) { + case HWTSTAMP_FILTER_NONE: + if (priv->hwts_rx_en) + priv->hwts_rx_en = false; + break; + case HWTSTAMP_FILTER_PTP_V2_L4_EVENT: + case HWTSTAMP_FILTER_PTP_V2_L2_EVENT: + case HWTSTAMP_FILTER_PTP_V2_L2_SYNC: + case HWTSTAMP_FILTER_PTP_V2_L4_SYNC: + case HWTSTAMP_FILTER_PTP_V2_L2_DELAY_REQ: + case HWTSTAMP_FILTER_PTP_V2_L4_DELAY_REQ: + case HWTSTAMP_FILTER_PTP_V2_EVENT: + case HWTSTAMP_FILTER_PTP_V2_SYNC: + case HWTSTAMP_FILTER_PTP_V2_DELAY_REQ: + config.rx_filter = HWTSTAMP_FILT
[RFC PATCH net-next v4 1/2] macb: Add 1588 support in Cadence GEM.
Cadence GEM provides a 102 bit time counter with 48 bits for seconds, 30 bits for nsecs and 24 bits for sub-nsecs to control 1588 timestamping. This patch does the following: - Registers to ptp clock framework - Timer initialization is done by writing time of day to the timer counter. - ns increment register is programmed as NSEC_PER_SEC/tsu-clock-rate. For a 16 bit subns precision, the subns increment equals remainder of (NS_PER_SEC/TSU_CLK) * (2^16). - Timestamps are obtained from the TX/RX PTP event/PEER registers. The timestamp obtained thus is updated in skb for upper layers to access. - The drivers register functions with ptp to perform time and frequency adjustment. - Time adjustment is done by writing to the 1558_ADJUST register. The controller will read the delta in this register and update the timer counter register. Alternatively, for large time offset adjustments, the driver reads the secs and nsecs counter values, adds/subtracts the delta and updates the timer counter. - Frequency is adjusted by adjusting addend (8bit nanosecond increment) and addendsub (16bit increment nanosecond fractions). The 102bit counter is incremented at nominal frequency with addend and addendsub values. Each period addend and addendsub values are adjusted based on ppm drift. Signed-off-by: Andrei Pistirica Signed-off-by: Harini Katakam --- Patch history: Version 1: This patch is based on original Harini's patch, implemented in a separate file to ease the review/maintanance and integration with other platforms (e.g. Zynq Ultrascale+ MPSoC). Feature was tested on SAMA5D2 platform using ptp4l v1.6 from linuxptp project and also with ptpd2 version 2.3.1. PTP was tested over IPv4,IPv6 and 802.3 protocols. In case that macb is compiled as a module, it has been renamed to cadence-macb.ko to avoid naming confusion in Makefile. Version 2 modifications: - bitfields for TSU are named according to SAMA5D2 data sheet - identify GEM-PTP support based on platform capability - add spinlock for TSU access - change macb_ptp_adjfreq and use fewer 64bit divisions Version 3 modifications: - new adjfine api with one 64 division for frequency adjustment (based on Richard's input) - add maximum adjustment frequency (ppb) based on nominal frequency - per platform PTP configuration - cosmetic changes Note 1: Kbuild uses "select" instead of "imply", and the macb maintainer agreed to make the change when it will be available in net-next. Version 4 modifications: - update adjfine for a better approximation - add maximum adjustment frequency callback to PTP platform configuraion Note 1: This driver does not support GEM-GXL! Note 2: Patch on net-next, on December 14th. drivers/net/ethernet/cadence/Kconfig| 10 +- drivers/net/ethernet/cadence/Makefile | 8 +- drivers/net/ethernet/cadence/macb.h | 118 ++ drivers/net/ethernet/cadence/macb_ptp.c | 366 4 files changed, 500 insertions(+), 2 deletions(-) create mode 100644 drivers/net/ethernet/cadence/macb_ptp.c diff --git a/drivers/net/ethernet/cadence/Kconfig b/drivers/net/ethernet/cadence/Kconfig index f0bcb15..ebbc65f 100644 --- a/drivers/net/ethernet/cadence/Kconfig +++ b/drivers/net/ethernet/cadence/Kconfig @@ -29,6 +29,14 @@ config MACB support for the MACB/GEM chip. To compile this driver as a module, choose M here: the module - will be called macb. + will be called cadence-macb. + +config MACB_USE_HWSTAMP + bool "Use IEEE 1588 hwstamp" + depends on MACB + default y + select PTP_1588_CLOCK + ---help--- + Enable IEEE 1588 Precision Time Protocol (PTP) support for MACB. endif # NET_CADENCE diff --git a/drivers/net/ethernet/cadence/Makefile b/drivers/net/ethernet/cadence/Makefile index 91f79b1..4402d42 100644 --- a/drivers/net/ethernet/cadence/Makefile +++ b/drivers/net/ethernet/cadence/Makefile @@ -2,4 +2,10 @@ # Makefile for the Atmel network device drivers. # -obj-$(CONFIG_MACB) += macb.o +cadence-macb-y := macb.o + +ifeq ($(CONFIG_MACB_USE_HWSTAMP),y) +cadence-macb-y += macb_ptp.o +endif + +obj-$(CONFIG_MACB) += cadence-macb.o diff --git a/drivers/net/ethernet/cadence/macb.h b/drivers/net/ethernet/cadence/macb.h index d67adad..e65e985 100644 --- a/drivers/net/ethernet/cadence/macb.h +++ b/drivers/net/ethernet/cadence/macb.h @@ -10,6 +10,9 @@ #ifndef _MACB_H #define _MACB_H +#include +#include + #define MACB_GREGS_NBR 16 #define MACB_GREGS_VERSION 2 #define MACB_MAX_QUEUES 8 @@ -131,6 +134,20 @@ #define GEM_RXIPCCNT 0x01a8 /* IP header Checksum Error Counter */ #define GEM_RXTCPCCNT 0x01ac /* TCP Checksum Error Counter */ #define GEM_RXUDPCCNT 0x01b0 /* UDP Checksum Error Counter */ +#define GEM_TISUBN 0x01bc /* 1588 Timer Increment Sub-ns */ +#define GEM_TSH0x01c0 /* 1588 Timer Seconds High */ +#define GEM_TSL
[PATCH V2] vhost: introduce O(1) vq metadata cache
When device IOTLB is enabled, all address translations were stored in interval tree. O(lgN) searching time could be slow for virtqueue metadata (avail, used and descriptors) since they were accessed much often than other addresses. So this patch introduces an O(1) array which points to the interval tree nodes that store the translations of vq metadata. Those array were update during vq IOTLB prefetching and were reset during each invalidation and tlb update. Each time we want to access vq metadata, this small array were queried before interval tree. This would be sufficient for static mappings but not dynamic mappings, we could do optimizations on top. Test were done with l2fwd in guest (2M hugepage): noiommu | before| after tx 1.32Mpps | 1.06Mpps(82%) | 1.30Mpps(98%) rx 2.33Mpps | 1.46Mpps(63%) | 2.29Mpps(98%) We can almost reach the same performance as noiommu mode. Signed-off-by: Jason Wang --- Changes from V1: - silent 32bit build warning --- drivers/vhost/vhost.c | 136 -- drivers/vhost/vhost.h | 8 +++ 2 files changed, 118 insertions(+), 26 deletions(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index c6f2d89..50ed625 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -282,6 +282,22 @@ void vhost_poll_queue(struct vhost_poll *poll) } EXPORT_SYMBOL_GPL(vhost_poll_queue); +static void __vhost_vq_meta_reset(struct vhost_virtqueue *vq) +{ + int j; + + for (j = 0; j < VHOST_NUM_ADDRS; j++) + vq->meta_iotlb[j] = NULL; +} + +static void vhost_vq_meta_reset(struct vhost_dev *d) +{ + int i; + + for (i = 0; i < d->nvqs; ++i) + __vhost_vq_meta_reset(d->vqs[i]); +} + static void vhost_vq_reset(struct vhost_dev *dev, struct vhost_virtqueue *vq) { @@ -311,6 +327,7 @@ static void vhost_vq_reset(struct vhost_dev *dev, vq->busyloop_timeout = 0; vq->umem = NULL; vq->iotlb = NULL; + __vhost_vq_meta_reset(vq); } static int vhost_worker(void *data) @@ -690,6 +707,18 @@ static int vq_memory_access_ok(void __user *log_base, struct vhost_umem *umem, return 1; } +static inline void __user *vhost_vq_meta_fetch(struct vhost_virtqueue *vq, + u64 addr, unsigned int size, + int type) +{ + const struct vhost_umem_node *node = vq->meta_iotlb[type]; + + if (!node) + return NULL; + + return (void *)(uintptr_t)(node->userspace_addr + addr - node->start); +} + /* Can we switch to this memory table? */ /* Caller should have device mutex but not vq mutex */ static int memory_access_ok(struct vhost_dev *d, struct vhost_umem *umem, @@ -732,8 +761,14 @@ static int vhost_copy_to_user(struct vhost_virtqueue *vq, void *to, * could be access through iotlb. So -EAGAIN should * not happen in this case. */ - /* TODO: more fast path */ struct iov_iter t; + void __user *uaddr = vhost_vq_meta_fetch(vq, +(u64)(uintptr_t)to, size, +VHOST_ADDR_DESC); + + if (uaddr) + return __copy_to_user(uaddr, from, size); + ret = translate_desc(vq, (u64)(uintptr_t)to, size, vq->iotlb_iov, ARRAY_SIZE(vq->iotlb_iov), VHOST_ACCESS_WO); @@ -761,8 +796,14 @@ static int vhost_copy_from_user(struct vhost_virtqueue *vq, void *to, * could be access through iotlb. So -EAGAIN should * not happen in this case. */ - /* TODO: more fast path */ + void __user *uaddr = vhost_vq_meta_fetch(vq, +(u64)(uintptr_t)from, size, +VHOST_ADDR_DESC); struct iov_iter f; + + if (uaddr) + return __copy_from_user(to, uaddr, size); + ret = translate_desc(vq, (u64)(uintptr_t)from, size, vq->iotlb_iov, ARRAY_SIZE(vq->iotlb_iov), VHOST_ACCESS_RO); @@ -782,17 +823,12 @@ static int vhost_copy_from_user(struct vhost_virtqueue *vq, void *to, return ret; } -static void __user *__vhost_get_user(struct vhost_virtqueue *vq, -void *addr, unsigned size) +static void __user *__vhost_get_user_slow(struct vhost_virtqueue *vq, + void *addr, unsigned int size, + int type) { int ret; - /* This function should be called after iotlb -* prefetch, which means we're sure that vq -* could be access through iotlb. So -EAGAIN should -
Re: dsa: handling more than 1 cpu port
On Wed, Dec 14, 2016 at 11:01:54AM +0100, John Crispin wrote: > Hi Andrew, > > switches supported by qca8k have 2 gmacs that we can wire an external > mii interface to. Usually this is used to wire 2 of the SoCs MACs to the > switch. Thw switch itself is aware of one of the MACs being the cpu port > and expects this to be port/mac0. Using the other will break the > hardware offloading features. Just to be sure here. There is no way to use the second port connected to the CPU as a CPU port? The Marvell chips do allow this. So i developed a proof of concept which had a mapping between cpu ports and slave ports. slave port X should you cpu port y for its traffic. This never got past proof of concept. If this can be made to work for qca8k, i would prefer having this general concept, than specific hacks for pass through. Andrew
Re: dsa: handling more than 1 cpu port
On 14/12/2016 11:31, Andrew Lunn wrote: > On Wed, Dec 14, 2016 at 11:01:54AM +0100, John Crispin wrote: >> Hi Andrew, >> >> switches supported by qca8k have 2 gmacs that we can wire an external >> mii interface to. Usually this is used to wire 2 of the SoCs MACs to the >> switch. Thw switch itself is aware of one of the MACs being the cpu port >> and expects this to be port/mac0. Using the other will break the >> hardware offloading features. > > Just to be sure here. There is no way to use the second port connected > to the CPU as a CPU port? both macs are considered cpu ports and both allow for the tag to be injected. for HW NAT/routing/... offloading to work, the lan ports neet to trunk via port0 and not port6 however. > > The Marvell chips do allow this. So i developed a proof of concept > which had a mapping between cpu ports and slave ports. slave port X > should you cpu port y for its traffic. This never got past proof of > concept. > > If this can be made to work for qca8k, i would prefer having this > general concept, than specific hacks for pass through. oh cool, can you send those patches my way please ? how do you configure this from userland ? does the cpu port get its on swdev which i just add to my lan bridge and then add the 2nd cpu port to the wan bridge ? John
Re: [PATCH net-next 1/3] net:dsa:mv88e6xxx: use hashtable to store multicast entries
On Wed, Dec 14, 2016 at 11:03:14AM +0100, Volodymyr Bendiuga wrote: > Hi, > > I understand what you wrote, but we do not refresh anything > at intervals. FDB dump works only on user request, i.e. user > run bridge fdb dump command. What we did is just a smarter > way of feeding the dump request with entries obtained from > switchcore. Interesting thing is that fdb dump in switchcore > reads all entries, and from those entries only one is returned at > a time, others are discarded, because request comes as per port. Ah, O.K. > What we do is we dump entries from switchcore once, when the > first dump request comes, save all of them in the cache, and then > all following fdb dump requests for each port will be fed from the cache, > so no more hardware dump operations. We set the cache to be valid for > 2 seconds, but this could be adjusted and tweaked. So in our case > we decrease the number of MDIO reads quite a lot. > What we are thinking now, is that this logics could be moved > to net/dsa layer, and maybe unified with fdb load logics, if possible. We should check what the other switches do. Can they perform a dump with the hardware performing the port filter? Those switches will not benefit from such a cache. But they should also not suffer. Combining it with load is a bigger question. Currently the drivers are stateless. That makes the error handling very simple. We don't have to worry about running out of memory, since we don't allocate any memory. If we run out of memory for this dump cache, it is not serious, since a dump does not result in state change. But if we are out of memory on a load, we do have state changes. We have to deal with the complexity of allocating resources in the _prepare call, and then use the resource in the do call. I would much prefer to avoid this at first, by optimising the atu get. If we don't see a significant speed up, then we can consider this complex solution of a cache for load. Andrew
[no subject]
Good Day, This is the second time i am sending you this mail. I, Friedrich Mayrhofer Donate $ 1,000,000.00 to You, Email Me personally for more details. Regards. Friedrich Mayrhofer
Re: net/arp: ARP cache aging failed.
On 2016/11/26 4:40, Julian Anastasov wrote: > > Hello, > > On Fri, 25 Nov 2016, Hannes Frederic Sowa wrote: > >> On 25.11.2016 09:18, Julian Anastasov wrote: >>> >>> Another option would be to add similar bit to >>> struct sock (sk_dst_pending_confirm), may be ip_finish_output2 >>> can propagate it to dst_neigh_output via new arg and then to >>> reset it. >> >> I don't understand how this can help? Maybe I understood it wrong? > > The idea is, the indication from highest level (TCP) to > be propagated to lowest level (neigh). > >>> Or to change n->confirmed via some new inline sock >>> function instead of changing dst_neigh_output. At this place >>> skb->sk is optional, may be we need (skb->sk && dst == >>> skb->sk->sk_dst_cache) check. Also, when sk_dst_cache changes >>> we should clear this flag. >> >> In "(skb->sk && dst == skb->sk->sk_dst_cache)" where does dst come from? > > This is in case we do not want to propagate indication > from TCP to tunnels, see below... > >> I don't see a possibility besides using mac_header or inner_mac_header >> to look up the incoming MAC address and confirm that one in the neighbor >> cache so far (we could try to optimize this case for rt_gateway though). > > My idea is as follows: > > - TCP instead of calling dst_confirm should call new func > dst_confirm_sk(sk) where sk->sk_dst_pending_confirm is set, > not dst->pending_confirm. > > - ip_finish_output2: use skb->sk (TCP) to check for > sk_dst_pending_confirm and update n->confirmed in some > new inline func > > Correct me if I'm wrong but here is how I see the > situation: > > - TCP caches output dst in socket, for example, dst1, > sets it as skb->dst > > - if xfrm tunnels are used then dst1 can point to some tunnel, > i.e. it is not in all cases the same skb->dst, as seen by > ip_finish_output2 > > - netfilter can DNAT and assign different skb->dst > > - as result, before now, dst1->pending_confirm is set but > it is not used properly because ip_finish_output2 uses > the final skb->dst which can be different or with > rt_gateway = 0 > > - considering that tunnels do not use dst_confirm, > n->confirmed is not changed every time. There are some > interesting cases: > > 1. both dst1 and the final skb->dst point to same > dst with rt_gateway = 0: n->confirmed is updated but > as wee see it can be for wrong neigh. > > 2. dst1 is different from skb->dst, so n->confirmed is > not changed. This can happen on DNAT or when using > tunnel to secure gateway. > > - in ip_finish_output2 we have skb->sk (original TCP sk) and > sk arg (equal to skb->sk or second option is sock of some UDP > tunnel, etc). The idea is to use skb->sk->sk_dst_pending_confirm, > i.e. highest level sk because the sk arg, if different, does not > have such flag set (tunnels do not call dst_confirm). > > - ip_finish_output2 should call new func dst_neigh_confirm_sk > (which has nothing related to dst, hm... better name is needed): > > if (!IS_ERR(neigh)) { > - int res = dst_neigh_output(dst, neigh, skb); > + int res; > + > + dst_neigh_confirm_sk(skb->sk, neigh); > + res = dst_neigh_output(dst, neigh, skb); > > which should do something like this: > > if (sk && sk->sk_dst_pending_confirm) { > unsigned long now = jiffies; > > sk->sk_dst_pending_confirm = 0; > /* avoid dirtying neighbour */ > if (n->confirmed != now) > n->confirmed = now; > } > > Additional dst == skb->sk->sk_dst_cache check will > not propagate the indication on DNAT/tunnel. For me, > it is better without such check. > > So, the idea is to move TCP and other similar > users to the new dst_confirm_sk() method. If other > dst_confirm() users are left, they should be checked > if dsts with rt_gateway = 0 can be wrongly used. > > Regards > > -- > Julian Anastasov > > . > Sorry for so late. Based on your ideas, I make a patch and test it for a while. It works for me. >From dc033fe4bac8931590e15774a298f04ccea8ed4c Mon Sep 17 00:00:00 2001 From: YueHabing Date: Wed, 14 Dec 2016 02:43:51 -0500 Subject: [PATCH] arp: do neigh confirm based on sk arg Signed-off-by: YueHabing --- include/net/sock.h | 18 ++ net/ipv4/ip_output.c | 5 - net/ipv4/ping.c| 2 +- net/ipv4/raw.c | 2 +- net/ipv4/tcp_input.c | 8 ++-- net/ipv4/tcp_metrics.c | 5 +++-- net/ipv4/udp.c | 2 +- net/ipv6/raw.c | 2 +- net/ipv6/route.c | 6 +++--- net/ipv6/udp.c | 2 +- 10 files changed, 35 insertions(+), 17 deletions(-) diff --git a/include/net/sock.h b/include/net/sock.h index 92b2697..ece8dfa 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -434,6 +434,7 @@ struct sock { struct sk_buff *sk_send_head; __s32 sk_peek_off; int sk_write
Re: [PATCH 2/3] selftests: do not require bash to run bpf tests
On 12/14/2016 11:58 AM, Rolf Eike Beer wrote: From b9d6c1b7427d708ef2d4d57aac17b700b3694d71 Mon Sep 17 00:00:00 2001 From: Rolf Eike Beer Date: Wed, 14 Dec 2016 09:58:12 +0100 Subject: [PATCH 2/3] selftests: do not require bash to run bpf tests Nothing in this minimal script seems to require bash. We often run these tests on embedded devices where the only shell available is the busybox ash. Signed-off-by: Rolf Eike Beer Acked-by: Daniel Borkmann
Re: dsa: handling more than 1 cpu port
On 14/12/2016 12:00, Andrew Lunn wrote: > On Wed, Dec 14, 2016 at 11:35:30AM +0100, John Crispin wrote: >> >> >> On 14/12/2016 11:31, Andrew Lunn wrote: >>> On Wed, Dec 14, 2016 at 11:01:54AM +0100, John Crispin wrote: Hi Andrew, switches supported by qca8k have 2 gmacs that we can wire an external mii interface to. Usually this is used to wire 2 of the SoCs MACs to the switch. Thw switch itself is aware of one of the MACs being the cpu port and expects this to be port/mac0. Using the other will break the hardware offloading features. >>> >>> Just to be sure here. There is no way to use the second port connected >>> to the CPU as a CPU port? >> >> both macs are considered cpu ports and both allow for the tag to be >> injected. for HW NAT/routing/... offloading to work, the lan ports neet >> to trunk via port0 and not port6 however. > > Maybe you can do a restricted version of the generic solution. LAN > ports are mapped to cpu port0. WAN port to cpu port 6? hardcoding it is exactly what i want to avoid, same as using magic name matching. the dts should describe the HW and not the usage dictated by what is printed on the casing of the switch. as you mention below this is marketing chatter. imho ports should have any name and we should be able to bridge them as we feel happy and attach the bridges to any cpu port that we want. > >>> The Marvell chips do allow this. So i developed a proof of concept >>> which had a mapping between cpu ports and slave ports. slave port X >>> should you cpu port y for its traffic. This never got past proof of >>> concept. >>> >>> If this can be made to work for qca8k, i would prefer having this >>> general concept, than specific hacks for pass through. >> >> oh cool, can you send those patches my way please ? how do you configure >> this from userland ? does the cpu port get its on swdev which i just add >> to my lan bridge and then add the 2nd cpu port to the wan bridge ? > > https://github.com/lunn/linux/tree/v4.1-rc4-net-next-multiple-cpu > > You don't configure anything from userland. Which was actually a > criticism. It is in device tree. But my solution is generic. Having > one WAN port and four bridges LAN ports is a pure marketing > requirement. The hardware will happily do two WAN ports and 3 LAN > ports, for example. And the switch is happy to take traffic for the > WAN port and a LAN port over the same CPU port, and keep the traffic > separate. So we can have some form of load balancing. We are not > limited to 1->1, 1->4, we can do 1->2, 1->3 to increase the overall > performance. And to the user it is all transparent. > > This PoC is for the old DSA binding. The new binding makes it easier > to express this. Which is one of the reasons for the new binding. > > Andrew > i'll have a look at the patches. thanks ! John
Re: [PATCH v2 1/4] siphash: add cryptographically secure hashtable function
Hello, On 14.12.2016 04:59, Jason A. Donenfeld wrote: > SipHash is a 64-bit keyed hash function that is actually a > cryptographically secure PRF, like HMAC. Except SipHash is super fast, > and is meant to be used as a hashtable keyed lookup function. Can you show or cite benchmarks in comparison with jhash? Last time I looked, especially for short inputs, siphash didn't beat jhash (also on all the 32 bit devices etc.). > SipHash isn't just some new trendy hash function. It's been around for a > while, and there really isn't anything that comes remotely close to > being useful in the way SipHash is. With that said, why do we need this? > > There are a variety of attacks known as "hashtable poisoning" in which an > attacker forms some data such that the hash of that data will be the > same, and then preceeds to fill up all entries of a hashbucket. This is > a realistic and well-known denial-of-service vector. This pretty much depends on the linearity of the hash function? I don't think a crypto secure hash function is needed for a hash table. Albeit I agree that siphash certainly looks good to be used here. > Linux developers already seem to be aware that this is an issue, and > various places that use hash tables in, say, a network context, use a > non-cryptographically secure function (usually jhash) and then try to > twiddle with the key on a time basis (or in many cases just do nothing > and hope that nobody notices). While this is an admirable attempt at > solving the problem, it doesn't actually fix it. SipHash fixes it. I am pretty sure that SipHash still needs a random key per hash table also. So far it was only the choice of hash function you are questioning. > (It fixes it in such a sound way that you could even build a stream > cipher out of SipHash that would resist the modern cryptanalysis.) > > There are a modicum of places in the kernel that are vulnerable to > hashtable poisoning attacks, either via userspace vectors or network > vectors, and there's not a reliable mechanism inside the kernel at the > moment to fix it. The first step toward fixing these issues is actually > getting a secure primitive into the kernel for developers to use. Then > we can, bit by bit, port things over to it as deemed appropriate. Hmm, I tried to follow up with all the HashDoS work and so far didn't see any HashDoS attacks against the Jenkins/SpookyHash family. If this is an issue we might need to also put those changes into stable. > Dozens of languages are already using this internally for their hash > tables. Some of the BSDs already use this in their kernels. SipHash is > a widely known high-speed solution to a widely known problem, and it's > time we catch-up. Bye, Hannes
Re: dsa: handling more than 1 cpu port
On Wed, Dec 14, 2016 at 11:35:30AM +0100, John Crispin wrote: > > > On 14/12/2016 11:31, Andrew Lunn wrote: > > On Wed, Dec 14, 2016 at 11:01:54AM +0100, John Crispin wrote: > >> Hi Andrew, > >> > >> switches supported by qca8k have 2 gmacs that we can wire an external > >> mii interface to. Usually this is used to wire 2 of the SoCs MACs to the > >> switch. Thw switch itself is aware of one of the MACs being the cpu port > >> and expects this to be port/mac0. Using the other will break the > >> hardware offloading features. > > > > Just to be sure here. There is no way to use the second port connected > > to the CPU as a CPU port? > > both macs are considered cpu ports and both allow for the tag to be > injected. for HW NAT/routing/... offloading to work, the lan ports neet > to trunk via port0 and not port6 however. Maybe you can do a restricted version of the generic solution. LAN ports are mapped to cpu port0. WAN port to cpu port 6? > > The Marvell chips do allow this. So i developed a proof of concept > > which had a mapping between cpu ports and slave ports. slave port X > > should you cpu port y for its traffic. This never got past proof of > > concept. > > > > If this can be made to work for qca8k, i would prefer having this > > general concept, than specific hacks for pass through. > > oh cool, can you send those patches my way please ? how do you configure > this from userland ? does the cpu port get its on swdev which i just add > to my lan bridge and then add the 2nd cpu port to the wan bridge ? https://github.com/lunn/linux/tree/v4.1-rc4-net-next-multiple-cpu You don't configure anything from userland. Which was actually a criticism. It is in device tree. But my solution is generic. Having one WAN port and four bridges LAN ports is a pure marketing requirement. The hardware will happily do two WAN ports and 3 LAN ports, for example. And the switch is happy to take traffic for the WAN port and a LAN port over the same CPU port, and keep the traffic separate. So we can have some form of load balancing. We are not limited to 1->1, 1->4, we can do 1->2, 1->3 to increase the overall performance. And to the user it is all transparent. This PoC is for the old DSA binding. The new binding makes it easier to express this. Which is one of the reasons for the new binding. Andrew
[PATCH 2/3] selftests: do not require bash to run bpf tests
>From b9d6c1b7427d708ef2d4d57aac17b700b3694d71 Mon Sep 17 00:00:00 2001 From: Rolf Eike Beer Date: Wed, 14 Dec 2016 09:58:12 +0100 Subject: [PATCH 2/3] selftests: do not require bash to run bpf tests Nothing in this minimal script seems to require bash. We often run these tests on embedded devices where the only shell available is the busybox ash. Signed-off-by: Rolf Eike Beer --- tools/testing/selftests/bpf/test_kmod.sh | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/testing/selftests/bpf/test_kmod.sh b/tools/testing/selftests/bpf/test_kmod.sh index 92e627a..6d58cca 100755 --- a/tools/testing/selftests/bpf/test_kmod.sh +++ b/tools/testing/selftests/bpf/test_kmod.sh @@ -1,4 +1,4 @@ -#!/bin/bash +#!/bin/sh SRC_TREE=../../../../ -- 2.8.3 -- Rolf Eike Beer, emlix GmbH, http://www.emlix.com Fon +49 551 30664-0, Fax +49 551 30664-11 Bertha-von-Suttner-Str. 9, 37085 Göttingen, Germany Sitz der Gesellschaft: Göttingen, Amtsgericht Göttingen HR B 3160 Geschäftsführung: Heike Jordan, Dr. Uwe Kracke – Ust-IdNr.: DE 205 198 055 emlix – smart embedded open source
Re: Re: Synopsys Ethernet QoS
Hi! > There are some performance problems with the stmmac driver though: > > When running iperf3 with 3 streams: > iperf3 -c 192.168.0.90 -P 3 -t 30 > iperf3 -c 192.168.0.90 -P 3 -t 30 -R > > I get really bad fairness between the streams. > > This appears to be an issue with how TX IRQ coalescing is implemented in > stmmac. > Disabling TX IRQ coalescing in the stmmac driver makes the problem go away. Yes, I'm hitting the same problem https://lkml.org/lkml/2016/12/11/90 > Also netperf TCP_RR and UDP_RR gives really bad results compared to the > dwceqos driver (without IRQ coalescing). > 2000 transactions/sec vs 9000 transactions/sec. > Turning TX IRQ coalescing off and RX interrupt watchdog off in stmmac > gives the same performance. I guess it's a trade off, low CPU usage > vs low latency, so I don't know how important TCP_RR/UDP_RR really is. Well... 75% performance hit is a bug, plain and simple, not CPU tradeoff. > The best thing would be to get a good working TX IRQ coalesce > implementation with HR timers in stmmac. Actually it seems that using HR timers is not the only problem, AFAICT the logic is wrong way around. (But yes, it needs to be HR timer, too, and probably packet count needs to be reduced, too.) Best regards, Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html signature.asc Description: Digital signature
Re: [PATCH v2 1/4] siphash: add cryptographically secure hashtable function
Hi David, On Wed, Dec 14, 2016 at 10:56 AM, David Laight wrote: > ... >> +u64 siphash24(const u8 *data, size_t len, const u8 key[SIPHASH24_KEY_LEN]) > ... >> + u64 k0 = get_unaligned_le64(key); >> + u64 k1 = get_unaligned_le64(key + sizeof(u64)); > ... >> + m = get_unaligned_le64(data); > > All these unaligned accesses are going to get expensive on architectures > like sparc64. Yes, the unaligned accesses aren't pretty. Since in pretty much all use cases thus far, the data can easily be made aligned, perhaps it makes sense to create siphash24() and siphash24_unaligned(). Any thoughts on doing something like that? Jason
Re: [PATCH v2 3/4] secure_seq: use siphash24 instead of md5_transform
Hi David, On Wed, Dec 14, 2016 at 10:51 AM, David Laight wrote: > From: Jason A. Donenfeld >> Sent: 14 December 2016 00:17 >> This gives a clear speed and security improvement. Rather than manually >> filling MD5 buffers, we simply create a layout by a simple anonymous >> struct, for which gcc generates rather efficient code. > ... >> + const struct { >> + struct in6_addr saddr; >> + struct in6_addr daddr; >> + __be16 sport; >> + __be16 dport; >> + } __packed combined = { >> + .saddr = *(struct in6_addr *)saddr, >> + .daddr = *(struct in6_addr *)daddr, >> + .sport = sport, >> + .dport = dport >> + }; > > You need to look at the effect of marking this (and the other) > structures 'packed' on architectures like sparc64. In all current uses of __packed in the code, I think the impact is precisely zero, because all structures have members in descending order of size, with each member being a perfect multiple of the one below it. The __packed is therefore just there for safety, in case somebody comes in and screws everything up by sticking a u8 in between. In that case, it wouldn't be desirable to hash the structure padding bits. In the worst case, I don't believe the impact would be worse than a byte-by-byte memcpy, which is what the old code did. But anyway, these structures are already naturally packed anyway, so the present impact is nil. Jason
Re: [PATCH] arp: do neigh confirm based on sk arg
Hi YueHaibing, [auto build test WARNING on v4.9-rc8] [cannot apply to net/master net-next/master sparc-next/master next-20161214] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/YueHaibing/arp-do-neigh-confirm-based-on-sk-arg/20161214-191755 config: openrisc-or1ksim_defconfig (attached as .config) compiler: or32-linux-gcc (GCC) 4.5.1-or32-1.0rc1 reproduce: wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # save the attached .config to linux build tree make.cross ARCH=openrisc All warnings (new ones prefixed by >>): net/ipv4/tcp_input.c: In function 'tcp_rcv_state_process': >> net/ipv4/tcp_input.c:6003:21: warning: unused variable 'dst' vim +/dst +6003 net/ipv4/tcp_input.c ^1da177e4 Linus Torvalds 2005-04-16 5987*/ 168a8f580 Jerry Chu 2012-08-31 5988 tcp_rearm_rto(sk); 168a8f580 Jerry Chu 2012-08-31 5989 } else ^1da177e4 Linus Torvalds 2005-04-16 5990 tcp_init_metrics(sk); ^1da177e4 Linus Torvalds 2005-04-16 5991 c0402760f Yuchung Cheng 2016-09-19 5992 if (!inet_csk(sk)->icsk_ca_ops->cong_control) 02cf4ebd8 Neal Cardwell 2013-10-21 5993 tcp_update_pacing_rate(sk); 02cf4ebd8 Neal Cardwell 2013-10-21 5994 61eb90035 Joe Perches2013-05-24 5995 /* Prevent spurious tcp_cwnd_restart() on first data packet */ ^1da177e4 Linus Torvalds 2005-04-16 5996 tp->lsndtime = tcp_time_stamp; ^1da177e4 Linus Torvalds 2005-04-16 5997 ^1da177e4 Linus Torvalds 2005-04-16 5998 tcp_initialize_rcv_mss(sk); ^1da177e4 Linus Torvalds 2005-04-16 5999 tcp_fast_path_on(tp); ^1da177e4 Linus Torvalds 2005-04-16 6000 break; ^1da177e4 Linus Torvalds 2005-04-16 6001 c48b22daa Joe Perches2013-05-24 6002 case TCP_FIN_WAIT1: { c48b22daa Joe Perches2013-05-24 @6003 struct dst_entry *dst; c48b22daa Joe Perches2013-05-24 6004 int tmo; c48b22daa Joe Perches2013-05-24 6005 168a8f580 Jerry Chu 2012-08-31 6006 /* If we enter the TCP_FIN_WAIT1 state and we are a 168a8f580 Jerry Chu 2012-08-31 6007* Fast Open socket and this is the first acceptable 168a8f580 Jerry Chu 2012-08-31 6008* ACK we have received, this would have acknowledged 168a8f580 Jerry Chu 2012-08-31 6009* our SYNACK so stop the SYNACK timer. 168a8f580 Jerry Chu 2012-08-31 6010*/ 00db41243 Ian Morris 2015-04-03 6011 if (req) { :: The code at line 6003 was first introduced by commit :: c48b22daa6062fff9eded311b4d6974c29b40487 tcp: Remove 2 indentation levels in tcp_rcv_state_process :: TO: Joe Perches :: CC: David S. Miller --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: application/gzip
Re: [PATCH 1/1] Fixed to BUG_ON to WARN_ON def
Thanks Ozgur for your report. On 12/12/2016 8:18 PM, Leon Romanovsky wrote: On Mon, Dec 12, 2016 at 03:04:28PM +0200, Ozgur Karatas wrote: Dear Romanovsky; Please avoid top-posting in your replies. Thanks I'm trying to learn english and I apologize for my mistake words and phrases. So, I think the code when call to "sg_set_buf" and next time set memory and buffer. For example, isn't to call "WARN_ON" function, get a error to implicit declaration, right? Because, you will use to "BUG_ON" get a error implicit declaration of functions. I'm not sure that I followed you. mem->offset is set by sg_set_buf from buf variable returned by dma_alloc_coherent(). HW needs to get very precise size of this buf, in multiple of pages and aligned to pages boundaries. sg_set_buf(mem, buf, PAGE_SIZE << order); WARN_ON(mem->offset); See the patch inline which removes this BUG_ON in proper and safe way. From 7babe807affa2b27d51d3610afb75b693929ea1a Mon Sep 17 00:00:00 2001 From: Leon Romanovsky Date: Mon, 12 Dec 2016 20:02:45 +0200 Subject: [PATCH] net/mlx4: Remove BUG_ON from ICM allocation routine This patch removes BUG_ON() macro from mlx4_alloc_icm_coherent() by checking DMA address aligment in advance and performing proper folding in case of error. Fixes: 5b0bf5e25efe ("mlx4_core: Support ICM tables in coherent memory") Reported-by: Ozgur Karatas Signed-off-by: Leon Romanovsky --- drivers/net/ethernet/mellanox/mlx4/icm.c | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c b/drivers/net/ethernet/mellanox/mlx4/icm.c index 2a9dd46..e1f9e7c 100644 --- a/drivers/net/ethernet/mellanox/mlx4/icm.c +++ b/drivers/net/ethernet/mellanox/mlx4/icm.c @@ -118,8 +118,13 @@ static int mlx4_alloc_icm_coherent(struct device *dev, struct scatterlist *mem, if (!buf) return -ENOMEM; + if (offset_in_page(buf)) { + dma_free_coherent(dev, PAGE_SIZE << order, + buf, sg_dma_address(mem)); + return -ENOMEM; + } + sg_set_buf(mem, buf, PAGE_SIZE << order); - BUG_ON(mem->offset); sg_dma_len(mem) = PAGE_SIZE << order; return 0; } -- Thanks Leon for the patch. It is the right way to do so. Reviewed-by: Tariq Toukan We will submit Leon's patch in a new email. Regards, Tariq 2.10.2
Re: Synopsys Ethernet QoS
Hi! > So if there is a long time before handling interrupts, > I guess that it makes sense that one stream could > get an advantage in the net scheduler. > > If I find the time, and if no one beats me to it, I will try to replace > the normal timers with HR timers + a smaller default timeout. > Can you try something like this? Highres timers will be needed, too, but this fixes the logic problem. You'll need to apply it twice as code is copy&pasted. Best regards, Pavel +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c */ priv->tx_count_frames += nfrags + 1; if (likely(priv->tx_coal_frames > priv->tx_count_frames)) { - mod_timer(&priv->txtimer, - STMMAC_COAL_TIMER(priv->tx_coal_timer)); + if (priv->tx_count_frames == nfrags + 1) + mod_timer(&priv->txtimer, + STMMAC_COAL_TIMER(priv->tx_coal_timer)); } else { priv->tx_count_frames = 0; priv->hw->desc->set_tx_ic(desc); -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html signature.asc Description: Digital signature
Re: [PATCH] vsock: lookup and setup guest_cid inside vhost_vsock_lock
On Wed, Dec 14, 2016 at 07:24:36PM +0800, Gao feng wrote: > Multi vsocks may setup the same cid at the same time. > > Signed-off-by: Gao feng > --- > drivers/vhost/vsock.c | 25 + > 1 file changed, 17 insertions(+), 8 deletions(-) Good catch, a classic time-of-check-to-time-of-use race condition. Reviewed-by: Stefan Hajnoczi signature.asc Description: PGP signature
Re: [PATCH v2 1/4] siphash: add cryptographically secure hashtable function
Hi Hannes, On Wed, Dec 14, 2016 at 12:21 PM, Hannes Frederic Sowa wrote: > Can you show or cite benchmarks in comparison with jhash? Last time I > looked, especially for short inputs, siphash didn't beat jhash (also on > all the 32 bit devices etc.). I assume that jhash is likely faster than siphash, but I wouldn't be surprised if with optimization we can make siphash at least pretty close on 64-bit platforms. (I'll do some tests though; maybe I'm wrong and jhash is already slower.) With that said, siphash is here to replace uses of jhash where hashtable poisoning vulnerabilities make it necessary. Where there's no significant security improvement, if there's no speed improvement either, then of course nothing's going to change. I should have mentioned md5_transform in this first message too, as two other patches in this series actually replace md5_transform usage with siphash. I think in this case, siphash is a clear performance winner (and security winner) over md5_transform. So if the push back against replacing jhash usages is just too high, at the very least it remains useful already for the md5_transform usage. > This pretty much depends on the linearity of the hash function? I don't > think a crypto secure hash function is needed for a hash table. Albeit I > agree that siphash certainly looks good to be used here. In order to prevent the aforementioned poisoning attacks, a PRF with perfect linearity is required, which is what's achieved when it's a cryptographically secure one. Check out section 7 of https://131002.net/siphash/siphash.pdf . > I am pretty sure that SipHash still needs a random key per hash table > also. So far it was only the choice of hash function you are questioning. Siphash needs a random secret key, yes. The point is that the hash function remains secure so long as the secret key is kept secret. Other functions can't make the same guarantee, and so nervous periodic key rotation is necessary, but in most cases nothing is done, and so things just leak over time. > Hmm, I tried to follow up with all the HashDoS work and so far didn't > see any HashDoS attacks against the Jenkins/SpookyHash family. > > If this is an issue we might need to also put those changes into stable. jhash just isn't secure; it's not a cryptographically secure PRF. If there hasn't already been an academic paper put out there about it this year, let's make this thread 1000 messages long to garner attention, and next year perhaps we'll see one. No doubt that motivated government organizations, defense contractors, criminals, and other netizens have already done research in private. Replacing insecure functions with secure functions is usually a good thing. Jason
Re: [PATCH v2 3/4] secure_seq: use siphash24 instead of md5_transform
On 14.12.2016 13:53, Jason A. Donenfeld wrote: > Hi David, > > On Wed, Dec 14, 2016 at 10:51 AM, David Laight > wrote: >> From: Jason A. Donenfeld >>> Sent: 14 December 2016 00:17 >>> This gives a clear speed and security improvement. Rather than manually >>> filling MD5 buffers, we simply create a layout by a simple anonymous >>> struct, for which gcc generates rather efficient code. >> ... >>> + const struct { >>> + struct in6_addr saddr; >>> + struct in6_addr daddr; >>> + __be16 sport; >>> + __be16 dport; >>> + } __packed combined = { >>> + .saddr = *(struct in6_addr *)saddr, >>> + .daddr = *(struct in6_addr *)daddr, >>> + .sport = sport, >>> + .dport = dport >>> + }; >> >> You need to look at the effect of marking this (and the other) >> structures 'packed' on architectures like sparc64. > > In all current uses of __packed in the code, I think the impact is > precisely zero, because all structures have members in descending > order of size, with each member being a perfect multiple of the one > below it. The __packed is therefore just there for safety, in case > somebody comes in and screws everything up by sticking a u8 in > between. In that case, it wouldn't be desirable to hash the structure > padding bits. In the worst case, I don't believe the impact would be > worse than a byte-by-byte memcpy, which is what the old code did. But > anyway, these structures are already naturally packed anyway, so the > present impact is nil. __packed not only removes all padding of the struct but also changes the alignment assumptions for the whole struct itself. The rule, the struct is aligned by its maximum alignment of a member is no longer true. That said, the code accessing this struct will change (not on archs that can deal efficiently with unaligned access, but on others). A proper test for not introducing padding is to use something with BUILD_BUG_ON. Also gcc also clears the padding of the struct, so padding shouldn't be that bad in there (it is better than byte access on mips). Btw. I think gcc7 gets support for store merging optimization. Bye, Hannes
Re: [PATCH 3/3] netns: fix net_generic() "id - 1" bloat
On Tue, Dec 13, 2016 at 5:42 PM, David Laight wrote: > From: Alexey Dobriyan >> Sent: 13 December 2016 14:23 > ... >> Well, the point of the patch is to save .text, so might as well save >> as much as possible. Any form other than "ptr[id]" is going >> to be either bigger or bigger and slower and "ptr" should be the first field. > > You've not read and understood the next bit: > >> > However if you offset the 'id' values so that only >> > values 2 up are valid the code becomes: >> > return net->gen2->ptr[id - 2]; >> > which will be exactly the same code as: >> > return net->gen1->ptr[id]; >> > but it is much more obvious that 'id' values must be >= 2. >> > >> > The '2' should be generated from the structure offset, but with my method >> > is doesn't actually matter if it is wrong. > > If you have foo->bar[id - const] then the compiler has to add the > offset of 'bar' and subtract for 'const'. > If the numbers match no add or subtract is needed. > > It is much cleaner to do this by explicitly removing the offset on the > accesses than using a union. Surprisingly, the trick only works if array index is cast to "unsigned long" before subtracting. Code becomes ... ptr = ng->ptr[(unsigned long)id - 3]; ... I'll post a patch when net-next reopens.
Re: Synopsys Ethernet QoS
Hi, Às 12:57 PM de 12/14/2016, Pavel Machek escreveu: > Hi! > >> So if there is a long time before handling interrupts, >> I guess that it makes sense that one stream could >> get an advantage in the net scheduler. >> >> If I find the time, and if no one beats me to it, I will try to replace >> the normal timers with HR timers + a smaller default timeout. >> > > Can you try something like this? Highres timers will be needed, too, > but this fixes the logic problem. > > You'll need to apply it twice as code is copy&pasted. > > Best regards, > Pavel > > +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c > >*/ > priv->tx_count_frames += nfrags + 1; > if (likely(priv->tx_coal_frames > priv->tx_count_frames)) { > - mod_timer(&priv->txtimer, > - STMMAC_COAL_TIMER(priv->tx_coal_timer)); > + if (priv->tx_count_frames == nfrags + 1) > + mod_timer(&priv->txtimer, > + STMMAC_COAL_TIMER(priv->tx_coal_timer)); > } else { > priv->tx_count_frames = 0; > priv->hw->desc->set_tx_ic(desc); > > I know that this is completely of topic, but I am facing a dificulty with stmmac. I have interrupts, mac well configured rx packets being received successfully, but TX is not working, resulting in Tx errors = Total TX packets. I have made a lot of debug and my conclusions is that by some reason when using stmmac after starting tx dma, the hw state machine enters a deadend state resulting in those errors. Anyone faced this trouble? Thanks.
Re: [PATCHv3 perf/core 0/7] Reuse libbpf from samples/bpf
Em Fri, Dec 09, 2016 at 04:30:54PM +0100, Daniel Borkmann escreveu: > Hi Arnaldo, > > On 12/09/2016 04:09 PM, Arnaldo Carvalho de Melo wrote: > > Em Thu, Dec 08, 2016 at 06:46:13PM -0800, Joe Stringer escreveu: > > > (Was "libbpf: Synchronize implementations") > > > > > > Update tools/lib/bpf to provide the remaining bpf wrapper pieces needed > > > by the > > > samples/bpf/ code, then get rid of all of the duplicate BPF libraries in > > > samples/bpf/libbpf.[ch]. > > > > > > --- > > > v3: Add ack for first patch. > > > Split out second patch from v2 into separate changes for remaining > > > diff. > > > Add patches to switch samples/bpf over to using tools/lib/. > > > v2: https://www.mail-archive.com/netdev@vger.kernel.org/msg135088.html > > > Don't shift non-bpf code into libbpf. > > > Drop the patch to synchronize ELF definitions with tc. > > > v1: https://www.mail-archive.com/netdev@vger.kernel.org/msg135088.html > > > First post. > > > > Thanks, applied after addressing the -I$(objtree) issue raised by Wang, > > [ Sorry for late reply. ] > > First of all, glad to see us getting rid of the duplicate lib eventually! :) > > Please note that this might result in hopefully just a minor merge issue > with net-next. Looks like patch 4/7 touches test_maps.c and test_verifier.c, > which moved to a new bpf selftest suite [1] this net-next cycle. Seems it's > just log buffer and some renames there, which can be discarded for both > files sitting in selftests. Yeah, I've got to this point, and the merge has a little bit more than that, including BPF_PROG_ATTACH/BPF_PROG_DETACH, etc, working on it... - Arnaldo
Re: [net-next PATCH v5 1/6] net: virtio dynamically disable/enable LRO
On Thu, Dec 08, 2016 at 04:04:58PM -0800, John Fastabend wrote: > On 16-12-08 01:36 PM, Michael S. Tsirkin wrote: > > On Wed, Dec 07, 2016 at 12:11:11PM -0800, John Fastabend wrote: > >> This adds support for dynamically setting the LRO feature flag. The > >> message to control guest features in the backend uses the > >> CTRL_GUEST_OFFLOADS msg type. > >> > >> Signed-off-by: John Fastabend > >> --- > >> drivers/net/virtio_net.c | 40 +++- > >> 1 file changed, 39 insertions(+), 1 deletion(-) > >> > >> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c > >> index a21d93a..a5c47b1 100644 > >> --- a/drivers/net/virtio_net.c > >> +++ b/drivers/net/virtio_net.c > >> @@ -1419,6 +1419,36 @@ static void virtnet_init_settings(struct net_device > >> *dev) > >>.set_settings = virtnet_set_settings, > >> }; > >> > >> +static int virtnet_set_features(struct net_device *netdev, > >> + netdev_features_t features) > >> +{ > >> + struct virtnet_info *vi = netdev_priv(netdev); > >> + struct virtio_device *vdev = vi->vdev; > >> + struct scatterlist sg; > >> + u64 offloads = 0; > >> + > >> + if (features & NETIF_F_LRO) > >> + offloads |= (1 << VIRTIO_NET_F_GUEST_TSO4) | > >> + (1 << VIRTIO_NET_F_GUEST_TSO6); > >> + > >> + if (features & NETIF_F_RXCSUM) > >> + offloads |= (1 << VIRTIO_NET_F_GUEST_CSUM); > >> + > >> + if (virtio_has_feature(vdev, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS)) { > >> + sg_init_one(&sg, &offloads, sizeof(uint64_t)); > >> + if (!virtnet_send_command(vi, > >> +VIRTIO_NET_CTRL_GUEST_OFFLOADS, > >> +VIRTIO_NET_CTRL_GUEST_OFFLOADS_SET, > >> +&sg)) { > > > > Hmm I just realised that this will slow down setups that bridge > > virtio net interfaces since bridge calls this if provided. > > See below. > > > Really? What code is trying to turn off GRO via the GUEST_OFFLOADS LRO > command. My qemu/Linux setup has a set of tap/vhost devices attached to > a bridge and all of them have LRO enabled even with this patch series. > > I must missing a setup handler somewhere? > > > > >> + dev_warn(&netdev->dev, > >> + "Failed to set guest offloads by virtnet > >> command.\n"); > >> + return -EINVAL; > >> + } > >> + } > > > > Hmm if VIRTIO_NET_F_CTRL_GUEST_OFFLOADS is off, this fails > > silently. It might actually be a good idea to avoid > > breaking setups. > > > >> + > >> + return 0; > >> +} > >> + > >> static const struct net_device_ops virtnet_netdev = { > >>.ndo_open= virtnet_open, > >>.ndo_stop= virtnet_close, > >> @@ -1435,6 +1465,7 @@ static void virtnet_init_settings(struct net_device > >> *dev) > >> #ifdef CONFIG_NET_RX_BUSY_POLL > >>.ndo_busy_poll = virtnet_busy_poll, > >> #endif > >> + .ndo_set_features = virtnet_set_features, > >> }; > >> > >> static void virtnet_config_changed_work(struct work_struct *work) > >> @@ -1815,6 +1846,12 @@ static int virtnet_probe(struct virtio_device *vdev) > >>if (virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_CSUM)) > >>dev->features |= NETIF_F_RXCSUM; > >> > >> + if (virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_TSO4) && > >> + virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_TSO6)) { > >> + dev->features |= NETIF_F_LRO; > >> + dev->hw_features |= NETIF_F_LRO; > > > > So the issue is I think that the virtio "LRO" isn't really > > LRO, it's typically just GRO forwarded to guests. > > So these are easily re-split along MTU boundaries, > > which makes it ok to forward these across bridges. > > > > It's not nice that we don't document this in the spec, > > but it's the reality and people rely on this. > > > > For now, how about doing a custom thing and just disable/enable > > it as XDP is attached/detached? > > The annoying part about doing this is ethtool will say that it is fixed > yet it will be changed by seemingly unrelated operation. I'm not sure I > like the idea to start automatically configuring the link via xdp_set. I really don't like the idea of dropping performance by a factor of 3 for people bridging two virtio net interfaces. So how about a simple approach for now, just disable XDP if GUEST_TSO is enabled? We can discuss better approaches in next version. > > > >> + } > >> + > >>dev->vlan_features = dev->features; > >> > >>/* MTU range: 68 - 65535 */ > >> @@ -2057,7 +2094,8 @@ static int virtnet_restore(struct virtio_device > >> *vdev) > >>VIRTIO_NET_F_CTRL_RX, VIRTIO_NET_F_CTRL_VLAN, \ > >>VIRTIO_NET_F_GUEST_ANNOUNCE, VIRTIO_NET_F_MQ, \ > >>VIRTIO_NET_F_CTRL_MAC_ADDR, \ > >> - VIRTIO_NET_F_MTU > >> + VIRTIO_NET_F_MTU, \ > >> + VIRTIO_NET_F_CTRL_GUEST_OFFLOADS > >> > >> static unsigned int features[] = { > >>
Re: [PATCH] arp: do neigh confirm based on sk arg
Hi YueHaibing, [auto build test WARNING on v4.9-rc8] [cannot apply to net/master net-next/master sparc-next/master next-20161214] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/YueHaibing/arp-do-neigh-confirm-based-on-sk-arg/20161214-191755 reproduce: make htmldocs All warnings (new ones prefixed by >>): >> include/net/sock.h:452: warning: No description found for parameter >> 'sk_dst_pending_confirm' vim +/sk_dst_pending_confirm +452 include/net/sock.h ^1da177e4 Linus Torvalds 2005-04-16 436 int sk_write_pending; 4ea59a6cc YueHaibing 2016-12-14 437 unsigned short sk_dst_pending_confirm; d5f642384 Alexey Dobriyan 2008-11-04 438 #ifdef CONFIG_SECURITY ^1da177e4 Linus Torvalds 2005-04-16 439 void *sk_security; d5f642384 Alexey Dobriyan 2008-11-04 440 #endif 2a56a1fec Tejun Heo 2015-12-07 441 struct sock_cgroup_data sk_cgrp_data; baac50bbc Johannes Weiner 2016-01-14 442 struct mem_cgroup *sk_memcg; ^1da177e4 Linus Torvalds 2005-04-16 443 void (*sk_state_change)(struct sock *sk); 676d23690 David S. Miller 2014-04-11 444 void (*sk_data_ready)(struct sock *sk); ^1da177e4 Linus Torvalds 2005-04-16 445 void (*sk_write_space)(struct sock *sk); ^1da177e4 Linus Torvalds 2005-04-16 446 void (*sk_error_report)(struct sock *sk); ^1da177e4 Linus Torvalds 2005-04-16 447 int (*sk_backlog_rcv)(struct sock *sk, ^1da177e4 Linus Torvalds 2005-04-16 448 struct sk_buff *skb); ^1da177e4 Linus Torvalds 2005-04-16 449 void (*sk_destruct)(struct sock *sk); ef456144d Craig Gallek2016-01-04 450 struct sock_reuseport __rcu *sk_reuseport_cb; a4298e452 Eric Dumazet2016-04-01 451 struct rcu_head sk_rcu; ^1da177e4 Linus Torvalds 2005-04-16 @452 }; ^1da177e4 Linus Torvalds 2005-04-16 453 559835ea7 Pravin B Shelar 2013-09-24 454 #define __sk_user_data(sk) ((*((void __rcu **)&(sk)->sk_user_data))) 559835ea7 Pravin B Shelar 2013-09-24 455 559835ea7 Pravin B Shelar 2013-09-24 456 #define rcu_dereference_sk_user_data(sk) rcu_dereference(__sk_user_data((sk))) 559835ea7 Pravin B Shelar 2013-09-24 457 #define rcu_assign_sk_user_data(sk, ptr) rcu_assign_pointer(__sk_user_data((sk)), ptr) 559835ea7 Pravin B Shelar 2013-09-24 458 4a17fd522 Pavel Emelyanov 2012-04-19 459 /* 4a17fd522 Pavel Emelyanov 2012-04-19 460 * SK_CAN_REUSE and SK_NO_REUSE on a socket mean that the socket is OK :: The code at line 452 was first introduced by commit :: 1da177e4c3f41524e886b7f1b8a0c1fc7321cac2 Linux-2.6.12-rc2 :: TO: Linus Torvalds :: CC: Linus Torvalds --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: application/gzip
Re: [PATCHv2 2/5] sh_eth: enable wake-on-lan for Gen2 devices
Hello! You forgot "R-Car" before "Gen2" in the subject. On 12/12/2016 07:09 PM, Niklas Söderlund wrote: Tested on Gen2 r8a7791/Koelsch. Signed-off-by: Niklas Söderlund --- drivers/net/ethernet/renesas/sh_eth.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/renesas/sh_eth.c b/drivers/net/ethernet/renesas/sh_eth.c index 87640b9..348ed22 100644 --- a/drivers/net/ethernet/renesas/sh_eth.c +++ b/drivers/net/ethernet/renesas/sh_eth.c @@ -624,8 +624,9 @@ static struct sh_eth_cpu_data r8a779x_data = { .register_type = SH_ETH_REG_FAST_RCAR, - .ecsr_value = ECSR_PSRTO | ECSR_LCHNG | ECSR_ICD, - .ecsipr_value = ECSIPR_PSRTOIP | ECSIPR_LCHNGIP | ECSIPR_ICDIP, + .ecsr_value = ECSR_PSRTO | ECSR_LCHNG | ECSR_ICD | ECSR_MPD, + .ecsipr_value = ECSIPR_PSRTOIP | ECSIPR_LCHNGIP | ECSIPR_ICDIP | + ECSIPR_MPDIP, These expressions seem to have been sorted by the bit # before your patch, now they aren't... care to fix? :-) [...] MBR, Sergei
Re: [PATCH v2 3/4] secure_seq: use siphash24 instead of md5_transform
Hi Hannes, Thanks for the feedback. > __packed not only removes all padding of the struct but also changes the > alignment assumptions for the whole struct itself. The rule, the struct > is aligned by its maximum alignment of a member is no longer true. That > said, the code accessing this struct will change (not on archs that can > deal efficiently with unaligned access, but on others). That's interesting. There currently aren't any alignment requirements in siphash because we use the unaligned helper functions, but as David pointed out in another thread, maybe that too should change. In that case, we'd have an aligned-only version of the function that requires 8-byte aligned input. Perhaps the best way to go about that would be to just mark the struct as __packed __aligned(8). Or, I guess, since 64-bit accesses gets split into two on 32-bit, that'd be best descried as __packed __aligned(sizeof(long)). Would that be an acceptable solution? Jason
[PATCH] vsock: lookup and setup guest_cid inside vhost_vsock_lock
Multi vsocks may setup the same cid at the same time. Signed-off-by: Gao feng --- drivers/vhost/vsock.c | 25 + 1 file changed, 17 insertions(+), 8 deletions(-) diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c index e3b30ea..a08332b 100644 --- a/drivers/vhost/vsock.c +++ b/drivers/vhost/vsock.c @@ -50,11 +50,10 @@ static u32 vhost_transport_get_local_cid(void) return VHOST_VSOCK_DEFAULT_HOST_CID; } -static struct vhost_vsock *vhost_vsock_get(u32 guest_cid) +static struct vhost_vsock *__vhost_vsock_get(u32 guest_cid) { struct vhost_vsock *vsock; - spin_lock_bh(&vhost_vsock_lock); list_for_each_entry(vsock, &vhost_vsock_list, list) { u32 other_cid = vsock->guest_cid; @@ -63,15 +62,24 @@ static struct vhost_vsock *vhost_vsock_get(u32 guest_cid) continue; if (other_cid == guest_cid) { - spin_unlock_bh(&vhost_vsock_lock); return vsock; } } - spin_unlock_bh(&vhost_vsock_lock); return NULL; } +static struct vhost_vsock *vhost_vsock_get(u32 guest_cid) +{ + struct vhost_vsock *vsock; + + spin_lock_bh(&vhost_vsock_lock); + vsock = __vhost_vsock_get(guest_cid); + spin_unlock_bh(&vhost_vsock_lock); + + return vsock; +} + static void vhost_transport_do_send_pkt(struct vhost_vsock *vsock, struct vhost_virtqueue *vq) @@ -562,11 +570,12 @@ static int vhost_vsock_set_cid(struct vhost_vsock *vsock, u64 guest_cid) return -EINVAL; /* Refuse if CID is already in use */ - other = vhost_vsock_get(guest_cid); - if (other && other != vsock) - return -EADDRINUSE; - spin_lock_bh(&vhost_vsock_lock); + other = __vhost_vsock_get(guest_cid); + if (other && other != vsock) { + spin_unlock_bh(&vhost_vsock_lock); + return -EADDRINUSE; + } vsock->guest_cid = guest_cid; spin_unlock_bh(&vhost_vsock_lock); -- 2.5.5
Re: [PATCH net-next] net: remove abuse of VLAN DEI/CFI bit
On Tue, Dec 13, 2016 at 05:21:18PM -0800, Stephen Hemminger wrote: > On Sat, 3 Dec 2016 10:22:28 +0100 (CET) > Michał Mirosław wrote: > > > This All-in-one patch removes abuse of VLAN CFI bit, so it can be passed > > intact through linux networking stack. > > > > Signed-off-by: Michał Mirosław > > --- > > > > Dear NetDevs > > > > I guess this needs to be split to the prep..convert[]..finish sequence, > > but if you like it as is, then it's ready. > > > > The biggest question is if the modified interface and vlan_present > > is the way to go. This can be changed to use vlan_proto != 0 instead > > of an extra flag bit. > > > > As I can't test most of the driver changes, please look at them carefully. > > OVS and bridge eyes are especially welcome. > > > > Best Regards, > > Michał Mirosław > > Is the motivation to support 802.1ad Drop Eligability Indicator (DEI)? > > If so then you need to be more verbose in the commit log, and lots more > work is needed. You need to rename fields and validate every place a > driver is using DEI bit to make sure it really does the right thing > on that hardware. It is not just a mechanical change. There are not many mentions of CFI bit in the Linux tree. Places that used it as VLAN_TAG_PRESENT are fixed with this patchset. Other uses are: - VLAN code: ignored - ebt_vlan: ignored - OVS: cleared because of netlink API assumptions - DSA: transferred to/from (E)DSA tag - drivers: gianfar: uses properly in filtering rules - drivers: cnic: false-positive (uses only VLAN ID, CFI bit marks the field 'valid') - drivers: qedr: false-positive (like cnic) So unless there is something hidden in the hardware, no driver does anything special with the CFI bit. After this patchset only OVS will need further modifications to be able to support handling of DEI bit. Best Regards, Michał Mirosław
Re: [PATCH 2/3] selftests: do not require bash to run bpf tests
On 12/14/2016 04:03 AM, Daniel Borkmann wrote: > On 12/14/2016 11:58 AM, Rolf Eike Beer wrote: >> From b9d6c1b7427d708ef2d4d57aac17b700b3694d71 Mon Sep 17 00:00:00 2001 >> From: Rolf Eike Beer >> Date: Wed, 14 Dec 2016 09:58:12 +0100 >> Subject: [PATCH 2/3] selftests: do not require bash to run bpf tests >> >> Nothing in this minimal script seems to require bash. We often run these >> tests >> on embedded devices where the only shell available is the busybox ash. >> >> Signed-off-by: Rolf Eike Beer > > Acked-by: Daniel Borkmann Thanks. I will get these into 4.10-rc1 or rc2 -- Shuah
Re: [v1] net:ethernet:cavium:octeon:octeon_mgmt: Handle return NULL error from devm_ioremap
Hi Arvind, [auto build test ERROR on net-next/master] [also build test ERROR on v4.9 next-20161214] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Arvind-Yadav/net-ethernet-cavium-octeon-octeon_mgmt-Handle-return-NULL-error-from-devm_ioremap/20161213-224624 config: mips-cavium_octeon_defconfig (attached as .config) compiler: mips64-linux-gnuabi64-gcc (Debian 6.1.1-9) 6.1.1 20160705 reproduce: wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # save the attached .config to linux build tree make.cross ARCH=mips All errors (new ones prefixed by >>): drivers/net/ethernet/cavium/octeon/octeon_mgmt.c: In function 'octeon_mgmt_probe': >> drivers/net/ethernet/cavium/octeon/octeon_mgmt.c:1473:11: error: 'dev' >> undeclared (first use in this function) dev_err(dev, "failed to map I/O memory\n"); ^~~ drivers/net/ethernet/cavium/octeon/octeon_mgmt.c:1473:11: note: each undeclared identifier is reported only once for each function it appears in vim +/dev +1473 drivers/net/ethernet/cavium/octeon/octeon_mgmt.c 1467 1468 p->mix = (u64)devm_ioremap(&pdev->dev, p->mix_phys, p->mix_size); 1469 p->agl = (u64)devm_ioremap(&pdev->dev, p->agl_phys, p->agl_size); 1470 p->agl_prt_ctl = (u64)devm_ioremap(&pdev->dev, p->agl_prt_ctl_phys, 1471 p->agl_prt_ctl_size); 1472 if (!p->mix || !p->agl || !p->agl_prt_ctl) { > 1473 dev_err(dev, "failed to map I/O memory\n"); 1474 result = -ENOMEM; 1475 goto err; 1476 } --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: application/gzip
RE: [PATCH 3/3] netns: fix net_generic() "id - 1" bloat
From: Alexey Dobriyan > Sent: 14 December 2016 13:20 ... > > If you have foo->bar[id - const] then the compiler has to add the > > offset of 'bar' and subtract for 'const'. > > If the numbers match no add or subtract is needed. > > > > It is much cleaner to do this by explicitly removing the offset on the > > accesses than using a union. > > Surprisingly, the trick only works if array index is cast to "unsigned long" > before subtracting. > > Code becomes > > ... > ptr = ng->ptr[(unsigned long)id - 3]; > ... The compiler may also be able to optimise it away if 'id' is 'int' rather than 'unsigned int'. Oh, if you need casts like that use an accessor function. David
RE: [PATCH v2 3/4] secure_seq: use siphash24 instead of md5_transform
From: Jason A. Donenfeld > Sent: 14 December 2016 13:44 > To: Hannes Frederic Sowa > > __packed not only removes all padding of the struct but also changes the > > alignment assumptions for the whole struct itself. The rule, the struct > > is aligned by its maximum alignment of a member is no longer true. That > > said, the code accessing this struct will change (not on archs that can > > deal efficiently with unaligned access, but on others). > > That's interesting. There currently aren't any alignment requirements > in siphash because we use the unaligned helper functions, but as David > pointed out in another thread, maybe that too should change. In that > case, we'd have an aligned-only version of the function that requires > 8-byte aligned input. Perhaps the best way to go about that would be > to just mark the struct as __packed __aligned(8). Or, I guess, since > 64-bit accesses gets split into two on 32-bit, that'd be best descried > as __packed __aligned(sizeof(long)). Would that be an acceptable > solution? Just remove the __packed and ensure that the structure is 'nice'. This includes ensuring there is no 'tail padding'. In some cases you'll need to put the port number into a 32bit field. I'd also require that the key be aligned. It probably ought to be a named structure type with two 64bit members (or with an array member that has two elements). David
Re: [PATCHv3 perf/core 0/7] Reuse libbpf from samples/bpf
Em Wed, Dec 14, 2016 at 10:25:01AM -0300, Arnaldo Carvalho de Melo escreveu: > Em Fri, Dec 09, 2016 at 04:30:54PM +0100, Daniel Borkmann escreveu: > > On 12/09/2016 04:09 PM, Arnaldo Carvalho de Melo wrote: > > > > v3: Add ack for first patch. > > > > Split out second patch from v2 into separate changes for remaining > > > > diff. > > > > Add patches to switch samples/bpf over to using tools/lib/. > > > > v2: https://www.mail-archive.com/netdev@vger.kernel.org/msg135088.html > > > > Don't shift non-bpf code into libbpf. > > > > Drop the patch to synchronize ELF definitions with tc. > > > > v1: https://www.mail-archive.com/netdev@vger.kernel.org/msg135088.html > > > > First post. > > > Thanks, applied after addressing the -I$(objtree) issue raised by Wang, > > [ Sorry for late reply. ] > > First of all, glad to see us getting rid of the duplicate lib eventually! :) > > > > Please note that this might result in hopefully just a minor merge issue > > with net-next. Looks like patch 4/7 touches test_maps.c and test_verifier.c, > > which moved to a new bpf selftest suite [1] this net-next cycle. Seems it's > > just log buffer and some renames there, which can be discarded for both > > files sitting in selftests. > > Yeah, I've got to this point, and the merge has a little bit more than > that, including BPF_PROG_ATTACH/BPF_PROG_DETACH, etc, working on it... So, Joe, can you try refreshing this work, starting from what I have in perf/core? It has the changes coming from net-next that Daniel warned us about and some more. [acme@jouet linux]$ git log --oneline -5 1f125a4aa4d8 tools lib bpf: Add flags to bpf_create_map() 5adf5614f72d tools lib bpf: use __u32 from linux/types.h ff687c38d803 tools lib bpf: Sync {tools,}/include/uapi/linux/bpf.h 53452c69b4c3 perf annotate: Fix jump target outside of function address range 2f41ae602b57 perf annotate: Support jump instruction with target as second operand [acme@jouet linux]$ I tried refreshing it, but it seems samples/bpf/ needs some love and care first, as I can't get it to build before these patches, to make sure nothing gets broken. Trying to bisect it I get to what seems multiple bisect breakages, last tag I got it to build, with lots of warnings, was v4.8, after that I get things like the ones below. I could try fixing it, but may be missing something, and want to push the other stuff in this branch... [acme@jouet linux]$ egrep SAMPLES\|BPF .config CONFIG_BPF=y CONFIG_BPF_SYSCALL=y CONFIG_NETFILTER_XT_MATCH_BPF=m CONFIG_NET_CLS_BPF=m CONFIG_NET_ACT_BPF=m CONFIG_BPF_JIT=y CONFIG_HAVE_EBPF_JIT=y CONFIG_BPF_EVENTS=y # CONFIG_TEST_BPF is not set CONFIG_SAMPLES=y [acme@jouet linux]$ [acme@jouet linux]$ make -C samples/bpf make: Entering directory '/home/acme/git/linux/samples/bpf' make -C ../../ $PWD/ make[1]: Entering directory '/home/acme/git/linux' CHK include/config/kernel.release CHK include/generated/uapi/linux/version.h CHK include/generated/utsrelease.h CHK include/generated/timeconst.h CHK include/generated/bounds.h CHK include/generated/asm-offsets.h CALLscripts/checksyscalls.sh HOSTCC /home/acme/git/linux/samples/bpf/bpf_load.o In file included from /home/acme/git/linux/samples/bpf/bpf_load.c:21:0: /home/acme/git/linux/samples/bpf/bpf_helpers.h:76:11: error: ‘BPF_FUNC_skb_in_cgroup’ undeclared here (not in a function) (void *) BPF_FUNC_skb_in_cgroup; ^~ scripts/Makefile.host:124: recipe for target '/home/acme/git/linux/samples/bpf/bpf_load.o' failed make[2]: *** [/home/acme/git/linux/samples/bpf/bpf_load.o] Error 1 Makefile:1646: recipe for target '/home/acme/git/linux/samples/bpf/' failed [acme@jouet linux]$ make -C samples/bpf make: Entering directory '/home/acme/git/linux/samples/bpf' make -C ../../ $PWD/ make[1]: Entering directory '/home/acme/git/linux' scripts/kconfig/conf --silentoldconfig Kconfig # # configuration written to .config # SYSTBL arch/x86/entry/syscalls/../../include/generated/asm/syscalls_32.h SYSHDR arch/x86/entry/syscalls/../../include/generated/asm/unistd_32_ia32.h SYSHDR arch/x86/entry/syscalls/../../include/generated/uapi/asm/unistd_32.h CHK include/config/kernel.release UPD include/config/kernel.release CHK include/generated/uapi/linux/version.h UPD include/generated/uapi/linux/version.h CHK include/generated/utsrelease.h UPD include/generated/utsrelease.h CHK include/generated/timeconst.h CC kernel/bounds.s CHK include/generated/bounds.h GEN scripts/gdb/linux/constants.py CC arch/x86/kernel/asm-offsets.s CHK include/generated/asm-offsets.h CALLscripts/checksyscalls.sh HOSTCC /home/acme/git/linux/samples/bpf/bpf_load.o In file included from /home/acme/git/linux/samples/bpf/bpf_load.c:21:0: /home/acme/git/linux/samples/bpf/bpf_helpers.h:49:11: error: ‘BPF_FUNC_current_task_under_cgroup’ undeclared here (not in a function)
[PATCH net-next 1/1] driver: ipvlan: Define common functions to decrease duplicated codes used to add or del IP address
From: Gao Feng There are some duplicated codes in ipvlan_add_addr6/4 and ipvlan_del_addr6/4. Now define two common functions ipvlan_add_addr and ipvlan_del_addr to decrease the duplicated codes. It could be helful to maintain the codes. Signed-off-by: Gao Feng --- drivers/net/ipvlan/ipvlan_main.c | 68 +--- 1 file changed, 29 insertions(+), 39 deletions(-) diff --git a/drivers/net/ipvlan/ipvlan_main.c b/drivers/net/ipvlan/ipvlan_main.c index 693ec5b..5874d30 100644 --- a/drivers/net/ipvlan/ipvlan_main.c +++ b/drivers/net/ipvlan/ipvlan_main.c @@ -669,23 +669,22 @@ static int ipvlan_device_event(struct notifier_block *unused, return NOTIFY_DONE; } -static int ipvlan_add_addr6(struct ipvl_dev *ipvlan, struct in6_addr *ip6_addr) +static int ipvlan_add_addr(struct ipvl_dev *ipvlan, void *iaddr, bool is_v6) { struct ipvl_addr *addr; - if (ipvlan_addr_busy(ipvlan->port, ip6_addr, true)) { - netif_err(ipvlan, ifup, ipvlan->dev, - "Failed to add IPv6=%pI6c addr for %s intf\n", - ip6_addr, ipvlan->dev->name); - return -EINVAL; - } addr = kzalloc(sizeof(struct ipvl_addr), GFP_ATOMIC); if (!addr) return -ENOMEM; addr->master = ipvlan; - memcpy(&addr->ip6addr, ip6_addr, sizeof(struct in6_addr)); - addr->atype = IPVL_IPV6; + if (is_v6) { + memcpy(&addr->ip6addr, iaddr, sizeof(struct in6_addr)); + addr->atype = IPVL_IPV6; + } else { + memcpy(&addr->ip4addr, iaddr, sizeof(struct in_addr)); + addr->atype = IPVL_IPV4; + } list_add_tail(&addr->anode, &ipvlan->addrs); /* If the interface is not up, the address will be added to the hash @@ -697,11 +696,11 @@ static int ipvlan_add_addr6(struct ipvl_dev *ipvlan, struct in6_addr *ip6_addr) return 0; } -static void ipvlan_del_addr6(struct ipvl_dev *ipvlan, struct in6_addr *ip6_addr) +static void ipvlan_del_addr(struct ipvl_dev *ipvlan, void *iaddr, bool is_v6) { struct ipvl_addr *addr; - addr = ipvlan_find_addr(ipvlan, ip6_addr, true); + addr = ipvlan_find_addr(ipvlan, iaddr, is_v6); if (!addr) return; @@ -712,6 +711,23 @@ static void ipvlan_del_addr6(struct ipvl_dev *ipvlan, struct in6_addr *ip6_addr) return; } +static int ipvlan_add_addr6(struct ipvl_dev *ipvlan, struct in6_addr *ip6_addr) +{ + if (ipvlan_addr_busy(ipvlan->port, ip6_addr, true)) { + netif_err(ipvlan, ifup, ipvlan->dev, + "Failed to add IPv6=%pI6c addr for %s intf\n", + ip6_addr, ipvlan->dev->name); + return -EINVAL; + } + + return ipvlan_add_addr(ipvlan, ip6_addr, true); +} + +static void ipvlan_del_addr6(struct ipvl_dev *ipvlan, struct in6_addr *ip6_addr) +{ + return ipvlan_del_addr(ipvlan, ip6_addr, true); +} + static int ipvlan_addr6_event(struct notifier_block *unused, unsigned long event, void *ptr) { @@ -745,45 +761,19 @@ static int ipvlan_addr6_event(struct notifier_block *unused, static int ipvlan_add_addr4(struct ipvl_dev *ipvlan, struct in_addr *ip4_addr) { - struct ipvl_addr *addr; - if (ipvlan_addr_busy(ipvlan->port, ip4_addr, false)) { netif_err(ipvlan, ifup, ipvlan->dev, "Failed to add IPv4=%pI4 on %s intf.\n", ip4_addr, ipvlan->dev->name); return -EINVAL; } - addr = kzalloc(sizeof(struct ipvl_addr), GFP_KERNEL); - if (!addr) - return -ENOMEM; - - addr->master = ipvlan; - memcpy(&addr->ip4addr, ip4_addr, sizeof(struct in_addr)); - addr->atype = IPVL_IPV4; - list_add_tail(&addr->anode, &ipvlan->addrs); - - /* If the interface is not up, the address will be added to the hash -* list by ipvlan_open. -*/ - if (netif_running(ipvlan->dev)) - ipvlan_ht_addr_add(ipvlan, addr); - return 0; + return ipvlan_add_addr(ipvlan, ip4_addr, false); } static void ipvlan_del_addr4(struct ipvl_dev *ipvlan, struct in_addr *ip4_addr) { - struct ipvl_addr *addr; - - addr = ipvlan_find_addr(ipvlan, ip4_addr, false); - if (!addr) - return; - - ipvlan_ht_addr_del(addr); - list_del(&addr->anode); - kfree_rcu(addr, rcu); - - return; + return ipvlan_del_addr(ipvlan, ip4_addr, false); } static int ipvlan_addr4_event(struct notifier_block *unused, -- 1.9.1
Re: [PATCH v2 1/4] siphash: add cryptographically secure hashtable function
Hello, On 14.12.2016 14:10, Jason A. Donenfeld wrote: > On Wed, Dec 14, 2016 at 12:21 PM, Hannes Frederic Sowa > wrote: >> Can you show or cite benchmarks in comparison with jhash? Last time I >> looked, especially for short inputs, siphash didn't beat jhash (also on >> all the 32 bit devices etc.). > > I assume that jhash is likely faster than siphash, but I wouldn't be > surprised if with optimization we can make siphash at least pretty > close on 64-bit platforms. (I'll do some tests though; maybe I'm wrong > and jhash is already slower.) Yes, numbers would be very usable here. I am mostly concerned about small plastic router cases. E.g. assume you double packet processing time with a change of the hashing function at what point is the actual packet processing more of an attack vector than the hashtable? > With that said, siphash is here to replace uses of jhash where > hashtable poisoning vulnerabilities make it necessary. Where there's > no significant security improvement, if there's no speed improvement > either, then of course nothing's going to change. It still changes currently well working source. ;-) > I should have mentioned md5_transform in this first message too, as > two other patches in this series actually replace md5_transform usage > with siphash. I think in this case, siphash is a clear performance > winner (and security winner) over md5_transform. So if the push back > against replacing jhash usages is just too high, at the very least it > remains useful already for the md5_transform usage. MD5 is considered broken because its collision resistance is broken? SipHash doesn't even claim to have collision resistance (which we don't need here)? But I agree, certainly it could be a nice speed-up! >> This pretty much depends on the linearity of the hash function? I don't >> think a crypto secure hash function is needed for a hash table. Albeit I >> agree that siphash certainly looks good to be used here. > > In order to prevent the aforementioned poisoning attacks, a PRF with > perfect linearity is required, which is what's achieved when it's a > cryptographically secure one. Check out section 7 of > https://131002.net/siphash/siphash.pdf . I think you mean non-linearity. Otherwise I agree that siphash is certainly a better suited hashing algorithm as far as I know. But it would be really interesting to compare some performance numbers. Hard to say anything without them. >> I am pretty sure that SipHash still needs a random key per hash table >> also. So far it was only the choice of hash function you are questioning. > > Siphash needs a random secret key, yes. The point is that the hash > function remains secure so long as the secret key is kept secret. > Other functions can't make the same guarantee, and so nervous periodic > key rotation is necessary, but in most cases nothing is done, and so > things just leak over time. > > >> Hmm, I tried to follow up with all the HashDoS work and so far didn't >> see any HashDoS attacks against the Jenkins/SpookyHash family. >> >> If this is an issue we might need to also put those changes into stable. > > jhash just isn't secure; it's not a cryptographically secure PRF. If > there hasn't already been an academic paper put out there about it > this year, let's make this thread 1000 messages long to garner > attention, and next year perhaps we'll see one. No doubt that > motivated government organizations, defense contractors, criminals, > and other netizens have already done research in private. Replacing > insecure functions with secure functions is usually a good thing. I think this is a weak argument. In general I am in favor to switch to siphash, but it would be nice to see some benchmarks with the specific kernel implementation also on some smaller 32 bit CPUs and especially without using any SIMD instructions (which might have been used in paper comparison). Bye, Hannes
RE: [PATCH v2 net-next 1/2] phy: add phy fixup unregister functions
> I just want to commit the unregister patch and found this patch. Good job! > But I consider this patch may miss something. > If one SoC has 2 MAC ports and each port uses the different network driver, > the 2 drivers may register fixup for the same PHY chip with different > "run" function because the PHY chip works in different mode. > In such a case, this patch doesn't consider "run" function and may cause > problem. > When removing the driver which register fixup at last, it will remove another > driver's fixup. > Should this condition be considered and fixed? Good point. Current phy fixup is independent LIST from phydev structure, and, fixup runs in two places of phy_device_register() and phy_init_hw(). It's not clear that it needs two separate fixup, but it may be good idea to pass phy fixup when calling phy_attach() or phy_attach_direct() and put it under phydev structure. So, fixup can be called at phy_init_hw() per phy device and remove When phy detached. Welcome any comments. - Woojung
netfilter -stable backport request
Hi, I would like to request a -stable backport for the following patchset that as we speak can be found in pablo's nf-next: # git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next.git [PATCH 1/3] commit 2394ae21e8b652aff0db1c02e946243c1e2f5edb netfilter: x_tables: pass xt_counters struct instead of packet counter [PATCH 2/3] commit 18b61e8161cc308cbfd06d2e2c6c0758dfd925ef netfilter: x_tables: pass xt_counters struct to counter allocator [PATCH 3/3] commit 722d6785e3b29a3b9f95c4d77542a1416094786a netfilter: x_tables: pack percpu counter allocations Please add this to stable branches : v4.4.x, v4.8.x The above patchset is fixing a netfilter regression which introduced a performance slowdown in binary arp/ip/ip6tables starting at commit : #v4.2-rc1 commit 71ae0dff02d756e4d2ca710b79f2ff5390029a5f netfilter: xtables: use percpu rule counters Regards, Eric
[PATCH RFC 1/2] bpf: add a longest prefix match trie map implementation
This trie implements a longest prefix match algorithm that can be used to match IP addresses to a stored set of ranges. Internally, data is stored in an unbalanced trie of nodes that has a maximum height of n, where n is the prefixlen the trie was created with. Tries may be created with prefix lengths that are multiples of 8, in the range from 8 to 2048. The key used for lookup and update operations is a struct bpf_lpm_trie_key, and the value is a uint64_t. The code carries more information about the internal implementation. Signed-off-by: Daniel Mack Reviewed-by: David Herrmann --- include/uapi/linux/bpf.h | 7 + kernel/bpf/Makefile | 2 +- kernel/bpf/lpm_trie.c| 491 +++ 3 files changed, 499 insertions(+), 1 deletion(-) create mode 100644 kernel/bpf/lpm_trie.c diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 0eb0e87..d564277 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -63,6 +63,12 @@ struct bpf_insn { __s32 imm;/* signed immediate constant */ }; +/* Key of an a BPF_MAP_TYPE_LPM_TRIE entry */ +struct bpf_lpm_trie_key { + __u32 prefixlen; /* up to 32 for AF_INET, 128 for AF_INET6 */ + __u8data[0];/* Arbitrary size */ +}; + /* BPF syscall commands, see bpf(2) man-page for details. */ enum bpf_cmd { BPF_MAP_CREATE, @@ -89,6 +95,7 @@ enum bpf_map_type { BPF_MAP_TYPE_CGROUP_ARRAY, BPF_MAP_TYPE_LRU_HASH, BPF_MAP_TYPE_LRU_PERCPU_HASH, + BPF_MAP_TYPE_LPM_TRIE, }; enum bpf_prog_type { diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile index 1276474..e1ce4f4 100644 --- a/kernel/bpf/Makefile +++ b/kernel/bpf/Makefile @@ -1,7 +1,7 @@ obj-y := core.o obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o -obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o +obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o ifeq ($(CONFIG_PERF_EVENTS),y) obj-$(CONFIG_BPF_SYSCALL) += stackmap.o endif diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c new file mode 100644 index 000..cae759d --- /dev/null +++ b/kernel/bpf/lpm_trie.c @@ -0,0 +1,491 @@ +/* + * Longest prefix match list implementation + * + * Copyright (c) 2016 Daniel Mack + * Copyright (c) 2016 David Herrmann + * + * This file is subject to the terms and conditions of version 2 of the GNU + * General Public License. See the file COPYING in the main directory of the + * Linux distribution for more details. + */ + +#include +#include +#include +#include +#include +#include + +/* Intermediate node */ +#define LPM_TREE_NODE_FLAG_IM BIT(0) + +struct lpm_trie_node; + +struct lpm_trie_node { + struct rcu_head rcu; + struct lpm_trie_node*child[2]; + u32 prefixlen; + u32 flags; + u64 value; + u8 data[0]; +}; + +struct lpm_trie { + struct bpf_map map; + struct lpm_trie_node*root; + size_t n_entries; + size_t max_prefixlen; + size_t data_size; + spinlock_t lock; +}; + +/* + * This trie implements a longest prefix match algorithm that can be used to + * match IP addresses to a stored set of ranges. + * + * Data stored in @data of struct bpf_lpm_key and struct lpm_trie_node is + * interpreted as big endian, so data[0] stores the most significant byte. + * + * Match ranges are internally stored in instances of struct lpm_trie_node + * which each contain their prefix length as well as two pointers that may + * lead to more nodes containing more specific matches. Each node also stores + * a value that is defined by and returned to userspace via the update_elem + * and lookup functions. + * + * For instance, let's start with a trie that was created with a prefix length + * of 32, so it can be used for IPv4 addresses, and one single element that + * matches 192.168.0.0/16. The data array would hence contain + * [0xc0, 0xa8, 0x00, 0x00] in big-endian notation. This documentation will + * stick to IP-address notation for readability though. + * + * As the trie is empty initially, the new node (1) will be places as root + * node, denoted as (R) in the example below. As there are no other node, both + * child pointers are %NULL. + * + * ++ + * | (1) (R) | + * | 192.168.0.0/16 | + * |value: 1| + * | [0][1] | + * ++ + * + * Next, let's add a new node (2) matching 192.168.0.0/24. As there is already + * a node with the same data and a smaller prefix (ie, a less specific one), + * node (2) will become a child of (1). In child index depends on the next bit + * that is outside of that (1) matches, and that bit i
[PATCH RFC 0/2] bpf: add longest prefix match map
This patch set adds longest prefix match algorithm that can be used to match IP addresses to a stored set of ranges. It is exposed as a bpf map type. Internally, data is stored in an unbalanced tree of nodes that has a maximum height of n, where n is the prefixlen the trie was created with. Not that this has nothing to do with fib or fib6 and is in no way meant to replace or share code with it. It's rather a much simpler implementation that is specifically written with bpf maps in mind. Patch 1/2 adds the implementation, and 2/2 an extensive test suite. Feedback is much appreciated. Thanks, Daniel Daniel Mack (1): bpf: add a longest prefix match trie map implementation David Herrmann (1): bpf: Add tests for the lpm trie map include/uapi/linux/bpf.h | 7 + kernel/bpf/Makefile| 2 +- kernel/bpf/lpm_trie.c | 491 + tools/testing/selftests/bpf/.gitignore | 1 + tools/testing/selftests/bpf/Makefile | 4 +- tools/testing/selftests/bpf/test_lpm_map.c | 348 6 files changed, 850 insertions(+), 3 deletions(-) create mode 100644 kernel/bpf/lpm_trie.c create mode 100644 tools/testing/selftests/bpf/test_lpm_map.c -- 2.9.3
[PATCH RFC 2/2] bpf: Add tests for the lpm trie map
From: David Herrmann The first part of this program runs randomized tests against the lpm-bpf-map. It implements a "Trivial Longest Prefix Match" (tlpm) based on simple, linear, single linked lists. The implementation should be pretty straightforward. Based on tlpm, this inserts randomized data into bpf-lpm-maps and verifies the trie-based bpf-map implementation behaves the same way as tlpm. The second part uses 'real world' IPv4 and IPv6 addresses and tests the trie with those. Signed-off-by: David Herrmann Signed-off-by: Daniel Mack --- tools/testing/selftests/bpf/.gitignore | 1 + tools/testing/selftests/bpf/Makefile | 4 +- tools/testing/selftests/bpf/test_lpm_map.c | 348 + 3 files changed, 351 insertions(+), 2 deletions(-) create mode 100644 tools/testing/selftests/bpf/test_lpm_map.c diff --git a/tools/testing/selftests/bpf/.gitignore b/tools/testing/selftests/bpf/.gitignore index 071431b..d3b1c9b 100644 --- a/tools/testing/selftests/bpf/.gitignore +++ b/tools/testing/selftests/bpf/.gitignore @@ -1,3 +1,4 @@ test_verifier test_maps test_lru_map +test_lpm_map diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile index 7a5f245..064a3e5 100644 --- a/tools/testing/selftests/bpf/Makefile +++ b/tools/testing/selftests/bpf/Makefile @@ -1,8 +1,8 @@ CFLAGS += -Wall -O2 -I../../../../usr/include -test_objs = test_verifier test_maps test_lru_map +test_objs = test_verifier test_maps test_lru_map test_lpm_map -TEST_PROGS := test_verifier test_maps test_lru_map test_kmod.sh +TEST_PROGS := test_verifier test_maps test_lru_map test_lpm_map test_kmod.sh TEST_FILES := $(test_objs) all: $(test_objs) diff --git a/tools/testing/selftests/bpf/test_lpm_map.c b/tools/testing/selftests/bpf/test_lpm_map.c new file mode 100644 index 000..08db750 --- /dev/null +++ b/tools/testing/selftests/bpf/test_lpm_map.c @@ -0,0 +1,348 @@ +/* + * Randomized tests for eBPF longest-prefix-match maps + * + * This program runs randomized tests against the lpm-bpf-map. It implements a + * "Trivial Longest Prefix Match" (tlpm) based on simple, linear, singly linked + * lists. The implementation should be pretty straightforward. + * + * Based on tlpm, this inserts randomized data into bpf-lpm-maps and verifies + * the trie-based bpf-map implementation behaves the same way as tlpm. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "bpf_sys.h" +#include "bpf_util.h" + +struct tlpm_node { + struct tlpm_node *next; + size_t n_bits; + uint8_t key[]; +}; + +static struct tlpm_node *tlpm_add(struct tlpm_node *list, + const uint8_t *key, + size_t n_bits) +{ + struct tlpm_node *node; + size_t n; + + /* add new entry with @key/@n_bits to @list and return new head */ + + n = (n_bits + 7) / 8; + node = malloc(sizeof(*node) + n); + assert(node); + + node->next = list; + node->n_bits = n_bits; + memcpy(node->key, key, n); + + return node; +} + +static void tlpm_clear(struct tlpm_node *list) +{ + struct tlpm_node *node; + + /* free all entries in @list */ + + while ((node = list)) { + list = list->next; + free(node); + } +} + +static struct tlpm_node *tlpm_match(struct tlpm_node *list, + const uint8_t *key, + size_t n_bits) +{ + struct tlpm_node *best = NULL; + size_t i; + + /* +* Perform longest prefix-match on @key/@n_bits. That is, iterate all +* entries and match each prefix against @key. Remember the "best" +* entry we find (i.e., the longest prefix that matches) and return it +* to the caller when done. +*/ + + for ( ; list; list = list->next) { + for (i = 0; i < n_bits && i < list->n_bits; ++i) { + if ((key[i / 8] & (1 << (7 - i % 8))) != + (list->key[i / 8] & (1 << (7 - i % 8 + break; + } + + if (i >= list->n_bits) { + if (!best || i > best->n_bits) + best = list; + } + } + + return best; +} + +static void test_lpm_basic(void) +{ + struct tlpm_node *list = NULL, *t1, *t2; + + /* very basic, static tests to verify tlpm works as expected */ + + assert(!tlpm_match(list, (uint8_t[]){ 0xff }, 8)); + + t1 = list = tlpm_add(list, (uint8_t[]){ 0xff }, 8); + assert(t1 == tlpm_match(list, (uint8_t[]){ 0xff }, 8)); + assert(t1 == tlpm_match(list, (uint8_t[]){ 0xff, 0xff }, 16)); + assert(t1 == tlpm_match(list, (uint8_t[]){ 0xff, 0x00 }, 16)); + assert(!tlpm_match(list, (uint8_t[]){ 0x7f }, 8)); + assert(!tlpm_mat
Re: dsa: handling more than 1 cpu port
On 14/12/2016 12:00, Andrew Lunn wrote: > On Wed, Dec 14, 2016 at 11:35:30AM +0100, John Crispin wrote: >> >> >> On 14/12/2016 11:31, Andrew Lunn wrote: >>> On Wed, Dec 14, 2016 at 11:01:54AM +0100, John Crispin wrote: Hi Andrew, switches supported by qca8k have 2 gmacs that we can wire an external mii interface to. Usually this is used to wire 2 of the SoCs MACs to the switch. Thw switch itself is aware of one of the MACs being the cpu port and expects this to be port/mac0. Using the other will break the hardware offloading features. >>> >>> Just to be sure here. There is no way to use the second port connected >>> to the CPU as a CPU port? >> >> both macs are considered cpu ports and both allow for the tag to be >> injected. for HW NAT/routing/... offloading to work, the lan ports neet >> to trunk via port0 and not port6 however. > > Maybe you can do a restricted version of the generic solution. LAN > ports are mapped to cpu port0. WAN port to cpu port 6? > >>> The Marvell chips do allow this. So i developed a proof of concept >>> which had a mapping between cpu ports and slave ports. slave port X >>> should you cpu port y for its traffic. This never got past proof of >>> concept. >>> >>> If this can be made to work for qca8k, i would prefer having this >>> general concept, than specific hacks for pass through. >> >> oh cool, can you send those patches my way please ? how do you configure >> this from userland ? does the cpu port get its on swdev which i just add >> to my lan bridge and then add the 2nd cpu port to the wan bridge ? > > https://github.com/lunn/linux/tree/v4.1-rc4-net-next-multiple-cpu > > You don't configure anything from userland. Which was actually a > criticism. It is in device tree. But my solution is generic. Having > one WAN port and four bridges LAN ports is a pure marketing > requirement. The hardware will happily do two WAN ports and 3 LAN > ports, for example. And the switch is happy to take traffic for the > WAN port and a LAN port over the same CPU port, and keep the traffic > separate. So we can have some form of load balancing. We are not > limited to 1->1, 1->4, we can do 1->2, 1->3 to increase the overall > performance. And to the user it is all transparent. > > This PoC is for the old DSA binding. The new binding makes it easier > to express this. Which is one of the reasons for the new binding. > > Andrew > Hi Andrew, spent some time looking at this and thinking about possible solutions. my initial idea was to detect which cpu port to based on the cpu port being included inside the bridge. however that wont allow us to control ports using the tag outside of a bridge. i think that your approach is the only sane one. we could add a sysfs interface later, allowing us to change the default cpu port <-> mappings, but the device tree needs to include some sane defaults. i'll use your patches as a base for a series. John
ipv6: handle -EFAULT from skb_copy_bits
It seems to be possible to craft a packet for sendmsg that triggers the -EFAULT path in skb_copy_bits resulting in a BUG_ON that looks like: RIP: 0010:[] [] rawv6_sendmsg+0xc30/0xc40 RSP: 0018:881f6c4a7c18 EFLAGS: 00010282 RAX: fff2 RBX: 881f6c681680 RCX: 0002 RDX: 881f6c4a7cf8 RSI: 0030 RDI: 881fed0f6a00 RBP: 881f6c4a7da8 R08: R09: 0009 R10: 881fed0f6a00 R11: 0009 R12: 0030 R13: 881fed0f6a00 R14: 881fee39ba00 R15: 881fefa93a80 Call Trace: [] ? unmap_page_range+0x693/0x830 [] inet_sendmsg+0x67/0xa0 [] sock_sendmsg+0x38/0x50 [] SYSC_sendto+0xef/0x170 [] SyS_sendto+0xe/0x10 [] do_syscall_64+0x50/0xa0 [] entry_SYSCALL64_slow_path+0x25/0x25 Handle this in rawv6_push_pending_frames and jump to the failure path. Signed-off-by: Dave Jones diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c index 291ebc260e70..35aa82faa052 100644 --- a/net/ipv6/raw.c +++ b/net/ipv6/raw.c @@ -591,7 +591,9 @@ static int rawv6_push_pending_frames(struct sock *sk, struct flowi6 *fl6, } offset += skb_transport_offset(skb); - BUG_ON(skb_copy_bits(skb, offset, &csum, 2)); + err = skb_copy_bits(skb, offset, &csum, 2); + if (err < 0) + goto out; /* in case cksum was not initialized */ if (unlikely(csum))
ravb/sh_eth/b44: BUG: sleeping function called from invalid context
Hi, When CONFIG_DEBUG_ATOMIC_SLEEP=y, running "ethtool -s eth0 speed 100" on Salvator-X gives: BUG: sleeping function called from invalid context at kernel/locking/mutex.c:97 in_atomic(): 1, irqs_disabled(): 128, pid: 1683, name: ethtool CPU: 0 PID: 1683 Comm: ethtool Tainted: GW 4.9.0-salvator-x-00426-g326519df42c65007-dirty #976 Hardware name: Renesas Salvator-X board based on r8a7796 (DT) Call trace: [] dump_backtrace+0x0/0x208 [] show_stack+0x14/0x1c [] dump_stack+0x94/0xb4 [] ___might_sleep+0x108/0x11c [] __might_sleep+0x84/0x94 [] mutex_lock+0x24/0x40 [] phy_start_aneg+0x20/0x130 [] phy_ethtool_ksettings_set+0xd0/0xe8 [] ravb_set_link_ksettings+0x4c/0xa4 [] ethtool_set_settings+0xec/0xfc [] dev_ethtool+0x188/0x17c4 [] dev_ioctl+0x53c/0x6b8 [] sock_do_ioctl.constprop.45+0x3c/0x4c [] sock_ioctl+0x33c/0x370 [] vfs_ioctl+0x20/0x38 [] do_vfs_ioctl+0x844/0x954 [] SyS_ioctl+0x44/0x68 [] el0_svc_naked+0x24/0x28 ravb_set_link_ksettings() calls phy_ethtool_ksettings_set() with a spinlock held and interrupts disabled, while phy_start_aneg() tries to obtain a mutex. The same issue is present in sh_eth_set_link_ksettings() (verified) and b44_set_link_ksettings() (code inspection). Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds
[v2] net:ethernet:cavium:octeon:octeon_mgmt: Handle return NULL error from devm_ioremap
Here, If devm_ioremap will fail. It will return NULL. Kernel can run into a NULL-pointer dereference. This error check will avoid NULL pointer dereference. Signed-off-by: Arvind Yadav --- drivers/net/ethernet/cavium/octeon/octeon_mgmt.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/drivers/net/ethernet/cavium/octeon/octeon_mgmt.c b/drivers/net/ethernet/cavium/octeon/octeon_mgmt.c index 4ab404f..33c2fec 100644 --- a/drivers/net/ethernet/cavium/octeon/octeon_mgmt.c +++ b/drivers/net/ethernet/cavium/octeon/octeon_mgmt.c @@ -1479,6 +1479,12 @@ static int octeon_mgmt_probe(struct platform_device *pdev) p->agl = (u64)devm_ioremap(&pdev->dev, p->agl_phys, p->agl_size); p->agl_prt_ctl = (u64)devm_ioremap(&pdev->dev, p->agl_prt_ctl_phys, p->agl_prt_ctl_size); + if (!p->mix || !p->agl || !p->agl_prt_ctl) { + dev_err(&pdev->dev, "failed to map I/O memory\n"); + result = -ENOMEM; + goto err; + } + spin_lock_init(&p->lock); skb_queue_head_init(&p->tx_list); -- 2.7.4
Re: [v1] net:ethernet:cavium:octeon:octeon_mgmt: Handle return NULL error from devm_ioremap
Sorry for build failure. I have send new changes. Which does not this failure. Thanks -Arvind On Wednesday 14 December 2016 08:10 PM, kbuild test robot wrote: Hi Arvind, [auto build test ERROR on net-next/master] [also build test ERROR on v4.9 next-20161214] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Arvind-Yadav/net-ethernet-cavium-octeon-octeon_mgmt-Handle-return-NULL-error-from-devm_ioremap/20161213-224624 config: mips-cavium_octeon_defconfig (attached as .config) compiler: mips64-linux-gnuabi64-gcc (Debian 6.1.1-9) 6.1.1 20160705 reproduce: wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # save the attached .config to linux build tree make.cross ARCH=mips All errors (new ones prefixed by >>): drivers/net/ethernet/cavium/octeon/octeon_mgmt.c: In function 'octeon_mgmt_probe': drivers/net/ethernet/cavium/octeon/octeon_mgmt.c:1473:11: error: 'dev' undeclared (first use in this function) dev_err(dev, "failed to map I/O memory\n"); ^~~ drivers/net/ethernet/cavium/octeon/octeon_mgmt.c:1473:11: note: each undeclared identifier is reported only once for each function it appears in vim +/dev +1473 drivers/net/ethernet/cavium/octeon/octeon_mgmt.c 1467 1468 p->mix = (u64)devm_ioremap(&pdev->dev, p->mix_phys, p->mix_size); 1469 p->agl = (u64)devm_ioremap(&pdev->dev, p->agl_phys, p->agl_size); 1470 p->agl_prt_ctl = (u64)devm_ioremap(&pdev->dev, p->agl_prt_ctl_phys, 1471p->agl_prt_ctl_size); 1472 if (!p->mix || !p->agl || !p->agl_prt_ctl) { 1473dev_err(dev, "failed to map I/O memory\n"); 1474 result = -ENOMEM; 1475 goto err; 1476 } --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation
Re: Designing a safe RX-zero-copy Memory Model for Networking
On 16-12-14 01:39 AM, Jesper Dangaard Brouer wrote: > On Tue, 13 Dec 2016 12:08:21 -0800 > John Fastabend wrote: > >> On 16-12-13 11:53 AM, David Miller wrote: >>> From: John Fastabend >>> Date: Tue, 13 Dec 2016 09:43:59 -0800 >>> What does "zero-copy send packet-pages to the application/socket that requested this" mean? At the moment on x86 page-flipping appears to be more expensive than memcpy (I can post some data shortly) and shared memory was proposed and rejected for security reasons when we were working on bifurcated driver. >>> >>> The whole idea is that we map all the active RX ring pages into >>> userspace from the start. >>> >>> And just how Jesper's page pool work will avoid DMA map/unmap, >>> it will also avoid changing the userspace mapping of the pages >>> as well. >>> >>> Thus avoiding the TLB/VM overhead altogether. >>> > > Exactly. It is worth mentioning that pages entering the page pool need > to be cleared (measured cost 143 cycles), in order to not leak any > kernel info. The primary focus of this design is to make sure not to > leak kernel info to userspace, but with an "exclusive" mode also > support isolation between applications. > > >> I get this but it requires applications to be isolated. The pages from >> a queue can not be shared between multiple applications in different >> trust domains. And the application has to be cooperative meaning it >> can't "look" at data that has not been marked by the stack as OK. In >> these schemes we tend to end up with something like virtio/vhost or >> af_packet. > > I expect 3 modes, when enabling RX-zero-copy on a page_pool. The first > two would require CAP_NET_ADMIN privileges. All modes have a trust > domain id, that need to match e.g. when page reach the socket. Even mode 3 should required cap_net_admin we don't want userspace to grab queues off the nic without it IMO. > > Mode-1 "Shared": Application choose lowest isolation level, allowing > multiple application to mmap VMA area. My only point here is applications can read each others data and all applications need to cooperate for example one app could try to write continuously to read only pages causing faults and what not. This is all non standard and doesn't play well with cgroups and "normal" applications. It requires a new orchestration model. I'm a bit skeptical of the use case but I know of a handful of reasons to use this model. Maybe take a look at the ivshmem implementation in DPDK. Also this still requires a hardware filter to push "application" traffic onto reserved queues/pages as far as I can tell. > > Mode-2 "Single-user": Application request it want to be the only user > of the RX queue. This blocks other application to mmap VMA area. > Assuming data is read-only sharing with the stack is possibly OK :/. I guess you would need to pools of memory for data and skb so you don't leak skb into user space. The devils in the details here. There are lots of hooks in the kernel that can for example push the packet with a 'redirect' tc action for example. And letting an app "read" data or impact performance of an unrelated application is wrong IMO. Stacked devices also provide another set of details that are a bit difficult to track down see all the hardware offload efforts. I assume all these concerns are shared between mode-1 and mode-2 > Mode-3 "Exclusive": Application request to own RX queue. Packets are > no longer allowed for normal netstack delivery. > I have patches for this mode already but haven't pushed them due to an alternative solution using VFIO. > Notice mode-2 still requires CAP_NET_ADMIN, because packets/pages are > still allowed to travel netstack and thus can contain packet data from > other normal applications. This is part of the design, to share the > NIC between netstack and an accelerated userspace application using RX > zero-copy delivery. > I don't think this is acceptable to be honest. Letting an application potentially read/impact other arbitrary applications on the system seems like a non-starter even with CAP_NET_ADMIN. At least this was the conclusion from bifurcated driver work some time ago. > >> Any ACLs/filtering/switching/headers need to be done in hardware or >> the application trust boundaries are broken. > > The software solution outlined allow the application to make the choice > of what trust boundary it wants. > > The "exclusive" mode-3 make most sense together with HW filters. > Already today, we support creating a new RX queue based on ethtool > ntuple HW filter and then you simply attach your application that queue > in mode-3, and have full isolation. > Still pretty fuzzy on why mode-1 and mode-2 do not need hw filters? Without hardware filters we have no way of knowing who/what data is put in the page. > >> If the above can not be met then a copy is needed. What I am trying >> to tease out is the above comment along with other statements like >> this "can be
Re: [PATCH 4/3] random: use siphash24 instead of md5 for get_random_int/long
On Wed, Dec 14, 2016 at 04:10:37AM +0100, Jason A. Donenfeld wrote: > This duplicates the current algorithm for get_random_int/long, but uses > siphash24 instead. This comes with several benefits. It's certainly > faster and more cryptographically secure than MD5. This patch also > hashes the pid, entropy, and timestamp as fixed width fields, in order > to increase diffusion. > > The previous md5 algorithm used a per-cpu md5 state, which caused > successive calls to the function to chain upon each other. While it's > not entirely clear that this kind of chaining is absolutely necessary > when using a secure PRF like siphash24, it can't hurt, and the timing of > the call chain does add a degree of natural entropy. So, in keeping with > this design, instead of the massive per-cpu 64-byte md5 state, there is > instead a per-cpu previously returned value for chaining. > > Signed-off-by: Jason A. Donenfeld > Cc: Jean-Philippe Aumasson The original reason for get_random_int was because the original urandom algorithms were too slow. When we moved to chacha20, which is must faster, I didn't think to revisit get_random_int() and get_random_long(). One somewhat undesirable aspect of the current algorithm is that we never change random_int_secret. So I've been toying with the following, which is 4 times faster than md5. (I haven't tried benchmarking against siphash yet.) [3.606139] random benchmark!! [3.606276] get_random_int # cycles: 326578 [3.606317] get_random_int_new # cycles: 95438 [3.607423] get_random_bytes # cycles: 2653388 - Ted P.S. It's interesting to note that siphash24 and chacha20 are both add-rotate-xor based algorithms. diff --git a/drivers/char/random.c b/drivers/char/random.c index d6876d506220..be172ea75799 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -1681,6 +1681,38 @@ static int rand_initialize(void) } early_initcall(rand_initialize); +static unsigned int get_random_int_new(void); + +static int rand_benchmark(void) +{ + cycles_t start,finish; + int i, out; + + pr_crit("random benchmark!!\n"); + start = get_cycles(); + for (i = 0; i < 1000; i++) { + get_random_int(); + } + finish = get_cycles(); + pr_err("get_random_int # cycles: %llu\n", finish - start); + + start = get_cycles(); + for (i = 0; i < 1000; i++) { + get_random_int_new(); + } + finish = get_cycles(); + pr_err("get_random_int_new # cycles: %llu\n", finish - start); + + start = get_cycles(); + for (i = 0; i < 1000; i++) { + get_random_bytes(&out, sizeof(out)); + } + finish = get_cycles(); + pr_err("get_random_bytes # cycles: %llu\n", finish - start); + return 0; +} +device_initcall(rand_benchmark); + #ifdef CONFIG_BLOCK void rand_initialize_disk(struct gendisk *disk) { @@ -2064,8 +2096,10 @@ unsigned int get_random_int(void) __u32 *hash; unsigned int ret; +#if 0 // force slow path if (arch_get_random_int(&ret)) return ret; +#endif hash = get_cpu_var(get_random_int_hash); @@ -2100,6 +2134,38 @@ unsigned long get_random_long(void) } EXPORT_SYMBOL(get_random_long); +struct random_buf { + __u8 buf[CHACHA20_BLOCK_SIZE]; + int ptr; +}; + +static DEFINE_PER_CPU(struct random_buf, batched_entropy); + +static void get_batched_entropy(void *buf, int n) +{ + struct random_buf *p; + + p = &get_cpu_var(batched_entropy); + + if ((p->ptr == 0) || + (p->ptr + n >= CHACHA20_BLOCK_SIZE)) { + extract_crng(p->buf); + p->ptr = 0; + } + BUG_ON(n > CHACHA20_BLOCK_SIZE); + memcpy(buf, p->buf, n); + p->ptr += n; + put_cpu_var(batched_entropy); +} + +static unsigned int get_random_int_new(void) +{ + int ret; + + get_batched_entropy(&ret, sizeof(ret)); + return ret; +} + /** * randomize_page - Generate a random, page aligned address * @start: The smallest acceptable address the caller will take.
Re: "virtio-net: enable multiqueue by default" in linux-next breaks networking on GCE
On Wed, Dec 14, 2016 at 12:24:43PM +0800, Wei Xu wrote: > > BTW, although this is a guest issue, is there anyway to view the GCE > host kernel or qemu(if it is) version? No, there isn't, as far as I know. - Ted
Re: Designing a safe RX-zero-copy Memory Model for Networking
On Wed, Dec 14, 2016 at 8:32 AM, John Fastabend wrote: > On 16-12-14 01:39 AM, Jesper Dangaard Brouer wrote: >> On Tue, 13 Dec 2016 12:08:21 -0800 >> John Fastabend wrote: >> >>> On 16-12-13 11:53 AM, David Miller wrote: From: John Fastabend Date: Tue, 13 Dec 2016 09:43:59 -0800 > What does "zero-copy send packet-pages to the application/socket that > requested this" mean? At the moment on x86 page-flipping appears to be > more expensive than memcpy (I can post some data shortly) and shared > memory was proposed and rejected for security reasons when we were > working on bifurcated driver. The whole idea is that we map all the active RX ring pages into userspace from the start. And just how Jesper's page pool work will avoid DMA map/unmap, it will also avoid changing the userspace mapping of the pages as well. Thus avoiding the TLB/VM overhead altogether. >> >> Exactly. It is worth mentioning that pages entering the page pool need >> to be cleared (measured cost 143 cycles), in order to not leak any >> kernel info. The primary focus of this design is to make sure not to >> leak kernel info to userspace, but with an "exclusive" mode also >> support isolation between applications. >> >> >>> I get this but it requires applications to be isolated. The pages from >>> a queue can not be shared between multiple applications in different >>> trust domains. And the application has to be cooperative meaning it >>> can't "look" at data that has not been marked by the stack as OK. In >>> these schemes we tend to end up with something like virtio/vhost or >>> af_packet. >> >> I expect 3 modes, when enabling RX-zero-copy on a page_pool. The first >> two would require CAP_NET_ADMIN privileges. All modes have a trust >> domain id, that need to match e.g. when page reach the socket. > > Even mode 3 should required cap_net_admin we don't want userspace to > grab queues off the nic without it IMO. > >> >> Mode-1 "Shared": Application choose lowest isolation level, allowing >> multiple application to mmap VMA area. > > My only point here is applications can read each others data and all > applications need to cooperate for example one app could try to write > continuously to read only pages causing faults and what not. This is > all non standard and doesn't play well with cgroups and "normal" > applications. It requires a new orchestration model. > > I'm a bit skeptical of the use case but I know of a handful of reasons > to use this model. Maybe take a look at the ivshmem implementation in > DPDK. > > Also this still requires a hardware filter to push "application" traffic > onto reserved queues/pages as far as I can tell. > >> >> Mode-2 "Single-user": Application request it want to be the only user >> of the RX queue. This blocks other application to mmap VMA area. >> > > Assuming data is read-only sharing with the stack is possibly OK :/. I > guess you would need to pools of memory for data and skb so you don't > leak skb into user space. > > The devils in the details here. There are lots of hooks in the kernel > that can for example push the packet with a 'redirect' tc action for > example. And letting an app "read" data or impact performance of an > unrelated application is wrong IMO. Stacked devices also provide another > set of details that are a bit difficult to track down see all the > hardware offload efforts. > > I assume all these concerns are shared between mode-1 and mode-2 > >> Mode-3 "Exclusive": Application request to own RX queue. Packets are >> no longer allowed for normal netstack delivery. >> > > I have patches for this mode already but haven't pushed them due to > an alternative solution using VFIO. > >> Notice mode-2 still requires CAP_NET_ADMIN, because packets/pages are >> still allowed to travel netstack and thus can contain packet data from >> other normal applications. This is part of the design, to share the >> NIC between netstack and an accelerated userspace application using RX >> zero-copy delivery. >> > > I don't think this is acceptable to be honest. Letting an application > potentially read/impact other arbitrary applications on the system > seems like a non-starter even with CAP_NET_ADMIN. At least this was > the conclusion from bifurcated driver work some time ago. I agree. This is a no-go from the performance perspective as well. At a minimum you would have to be zeroing out the page between uses to avoid leaking data, and that assumes that the program we are sending the pages to is slightly well behaved. If we think zeroing out an sk_buff is expensive wait until we are trying to do an entire 4K page. I think we are stuck with having to use a HW filter to split off application traffic to a specific ring, and then having to share the memory between the application and the kernel on that ring only. Any other approach just opens us up to all sorts of security concerns since it would be poss
Re: Designing a safe RX-zero-copy Memory Model for Networking
On Tue, 13 Dec 2016, Hannes Frederic Sowa wrote: > > Interesting. So you even imagine sockets registering memory regions > > with the NIC. If we had a proper NIC HW filter API across the drivers, > > to register the steering rule (like ibv_create_flow), this would be > > doable, but we don't (DPDK actually have an interesting proposal[1]) > > On a side note, this is what windows does with RIO ("registered I/O"). > Maybe you want to look at the API to get some ideas: allocating and > pinning down memory in user space and registering that with sockets to > get zero-copy IO. Yup that is also what I think. Regarding the memory registration and flow steering for user space RX/TX ring please look at the qpair model implemented by the RDMA subsystem in the kernel. The memory semantics are clearly established there and have been in use for more than a decade.
[PATCH net 0/2] net/sched: cls_flower: Fix mask handling
Hi, The series fix how the mask is being handled. Thanks. Paul Blakey (2): net/sched: cls_flower: Use mask for addr_type net/sched: cls_flower: Use masked key when calling HW offloads net/sched/cls_flower.c | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) -- 1.8.3.1
[PATCH net 1/2] net/sched: cls_flower: Use mask for addr_type
When addr_type is set, mask should also be set. Fixes: 66530bdf85eb ('sched,cls_flower: set key address type when present') Fixes: bc3103f1ed40 ('net/sched: cls_flower: Classify packet in ip tunnels') Signed-off-by: Paul Blakey Reviewed-by: Roi Dayan Acked-by: Jiri Pirko --- net/sched/cls_flower.c | 4 1 file changed, 4 insertions(+) diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c index e040c51..9758f5a 100644 --- a/net/sched/cls_flower.c +++ b/net/sched/cls_flower.c @@ -509,6 +509,7 @@ static int fl_set_key(struct net *net, struct nlattr **tb, if (tb[TCA_FLOWER_KEY_IPV4_SRC] || tb[TCA_FLOWER_KEY_IPV4_DST]) { key->control.addr_type = FLOW_DISSECTOR_KEY_IPV4_ADDRS; + mask->control.addr_type = ~0; fl_set_key_val(tb, &key->ipv4.src, TCA_FLOWER_KEY_IPV4_SRC, &mask->ipv4.src, TCA_FLOWER_KEY_IPV4_SRC_MASK, sizeof(key->ipv4.src)); @@ -517,6 +518,7 @@ static int fl_set_key(struct net *net, struct nlattr **tb, sizeof(key->ipv4.dst)); } else if (tb[TCA_FLOWER_KEY_IPV6_SRC] || tb[TCA_FLOWER_KEY_IPV6_DST]) { key->control.addr_type = FLOW_DISSECTOR_KEY_IPV6_ADDRS; + mask->control.addr_type = ~0; fl_set_key_val(tb, &key->ipv6.src, TCA_FLOWER_KEY_IPV6_SRC, &mask->ipv6.src, TCA_FLOWER_KEY_IPV6_SRC_MASK, sizeof(key->ipv6.src)); @@ -571,6 +573,7 @@ static int fl_set_key(struct net *net, struct nlattr **tb, if (tb[TCA_FLOWER_KEY_ENC_IPV4_SRC] || tb[TCA_FLOWER_KEY_ENC_IPV4_DST]) { key->enc_control.addr_type = FLOW_DISSECTOR_KEY_IPV4_ADDRS; + mask->enc_control.addr_type = ~0; fl_set_key_val(tb, &key->enc_ipv4.src, TCA_FLOWER_KEY_ENC_IPV4_SRC, &mask->enc_ipv4.src, @@ -586,6 +589,7 @@ static int fl_set_key(struct net *net, struct nlattr **tb, if (tb[TCA_FLOWER_KEY_ENC_IPV6_SRC] || tb[TCA_FLOWER_KEY_ENC_IPV6_DST]) { key->enc_control.addr_type = FLOW_DISSECTOR_KEY_IPV6_ADDRS; + mask->enc_control.addr_type = ~0; fl_set_key_val(tb, &key->enc_ipv6.src, TCA_FLOWER_KEY_ENC_IPV6_SRC, &mask->enc_ipv6.src, -- 1.8.3.1
[PATCH net 2/2] net/sched: cls_flower: Use masked key when calling HW offloads
Zero bits on the mask signify a "don't care" on the corresponding bits in key. Some HWs require those bits on the key to be zero. Since these bits are masked anyway, it's okay to provide the masked key to all drivers. Fixes: 5b33f48842fa ('net/flower: Introduce hardware offload support') Signed-off-by: Paul Blakey Reviewed-by: Roi Dayan Acked-by: Jiri Pirko --- net/sched/cls_flower.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c index 9758f5a..35ac28d 100644 --- a/net/sched/cls_flower.c +++ b/net/sched/cls_flower.c @@ -252,7 +252,7 @@ static int fl_hw_replace_filter(struct tcf_proto *tp, offload.cookie = (unsigned long)f; offload.dissector = dissector; offload.mask = mask; - offload.key = &f->key; + offload.key = &f->mkey; offload.exts = &f->exts; tc->type = TC_SETUP_CLSFLOWER; -- 1.8.3.1
Re: [PATCH] IB/mlx4: avoid a -Wmaybe-uninitialize warning
On 10/25/2016 12:16 PM, Arnd Bergmann wrote: > There is an old warning about mlx4_SW2HW_EQ_wrapper on x86: > > ethernet/mellanox/mlx4/resource_tracker.c: In function > ‘mlx4_SW2HW_EQ_wrapper’: > ethernet/mellanox/mlx4/resource_tracker.c:3071:10: error: ‘eq’ may be used > uninitialized in this function [-Werror=maybe-uninitialized] > > The problem here is that gcc won't track the state of the variable > across a spin_unlock. Moving the assignment out of the lock is > safe here and avoids the warning. > > Signed-off-by: Arnd Bergmann Thanks, applied. -- Doug Ledford GPG Key ID: 0E572FDD signature.asc Description: OpenPGP digital signature
Re: [net-next PATCH v5 1/6] net: virtio dynamically disable/enable LRO
On 16-12-14 05:31 AM, Michael S. Tsirkin wrote: > On Thu, Dec 08, 2016 at 04:04:58PM -0800, John Fastabend wrote: >> On 16-12-08 01:36 PM, Michael S. Tsirkin wrote: >>> On Wed, Dec 07, 2016 at 12:11:11PM -0800, John Fastabend wrote: This adds support for dynamically setting the LRO feature flag. The message to control guest features in the backend uses the CTRL_GUEST_OFFLOADS msg type. Signed-off-by: John Fastabend --- [...] static void virtnet_config_changed_work(struct work_struct *work) @@ -1815,6 +1846,12 @@ static int virtnet_probe(struct virtio_device *vdev) if (virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_CSUM)) dev->features |= NETIF_F_RXCSUM; + if (virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_TSO4) && + virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_TSO6)) { + dev->features |= NETIF_F_LRO; + dev->hw_features |= NETIF_F_LRO; >>> >>> So the issue is I think that the virtio "LRO" isn't really >>> LRO, it's typically just GRO forwarded to guests. >>> So these are easily re-split along MTU boundaries, >>> which makes it ok to forward these across bridges. >>> >>> It's not nice that we don't document this in the spec, >>> but it's the reality and people rely on this. >>> >>> For now, how about doing a custom thing and just disable/enable >>> it as XDP is attached/detached? >> >> The annoying part about doing this is ethtool will say that it is fixed >> yet it will be changed by seemingly unrelated operation. I'm not sure I >> like the idea to start automatically configuring the link via xdp_set. > > I really don't like the idea of dropping performance > by a factor of 3 for people bridging two virtio net > interfaces. > > So how about a simple approach for now, just disable > XDP if GUEST_TSO is enabled? > > We can discuss better approaches in next version. > So the proposal is to add a check in XDP setup so that if (virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_TSO{4|6}) return -ENOPSUPP; Or whatever is the most appropriate return code? Then we can disable TSO via qemu-system with guest_tso4=off,guest_tso6=off for XDP use cases. Sounds like a reasonable start to me. I'll make the change should this go through DaveMs net-next tree or do you want it on virtio tree? Either is fine with me. Thanks, John
Re: [Query] Delayed vxlan socket creation?
On Tue, Dec 13, 2016 at 11:49 PM, Du, Fan wrote: > Hi > > I'm interested to one Docker issue[1] which looks like related to kernel > vxlan socket creation > as described in the thread. From my limited knowledge here, socket creation > is synchronous , > and after the *socket* syscall, the sock handle will be valid and ready to > linkup. You need to read the code. vxlan tunnel is a UDP tunnel, it needs a kernel socket (and a port) to setup UDP communication, unlike GRE tunnel etc. > > Somehow I'm not sure the detailed scenario here, and which/how possible > commit fix? > Thanks! > > Quoted analysis: > -- > (Found in kernel 3.13) > The issue happens because in older kernels when a vxlan interface is created, > the socket creation is queued up in a worker thread which actually creates > the socket. But this needs to happen before we bring up the link on the vxlan > interface. > If for some chance, the worker thread hasn't completed the creation of the > socket > before we did link up then when we do link up the kernel checks if the socket > was > created and if not it will return ENOTCONN. This was a bug in the kernel > which got fixed > in later kernels. That is why retrying with a timer fixes the issue. This was introduced by commit 1c51a9159ddefa5119724a4c7da3fd3ef44b68d5 and later fixed by commit 56ef9c909b40483d2c8cb63fcbf83865f162d5ec.
RE: Designing a safe RX-zero-copy Memory Model for Networking
From: Christoph Lameter > Sent: 14 December 2016 17:00 > On Tue, 13 Dec 2016, Hannes Frederic Sowa wrote: > > > > Interesting. So you even imagine sockets registering memory regions > > > with the NIC. If we had a proper NIC HW filter API across the drivers, > > > to register the steering rule (like ibv_create_flow), this would be > > > doable, but we don't (DPDK actually have an interesting proposal[1]) > > > > On a side note, this is what windows does with RIO ("registered I/O"). > > Maybe you want to look at the API to get some ideas: allocating and > > pinning down memory in user space and registering that with sockets to > > get zero-copy IO. > > Yup that is also what I think. Regarding the memory registration and flow > steering for user space RX/TX ring please look at the qpair model > implemented by the RDMA subsystem in the kernel. The memory semantics are > clearly established there and have been in use for more than a decade. Isn't there a bigger problem for transmit? If the kernel is doing ANY validation on the frames it must copy the data to memory the application cannot modify before doing the validation. Otherwise the application could change the data afterwards. David
Re: [PATCH v2 3/4] secure_seq: use siphash24 instead of md5_transform
On Wed, Dec 14, 2016 at 3:47 PM, David Laight wrote: > Just remove the __packed and ensure that the structure is 'nice'. > This includes ensuring there is no 'tail padding'. > In some cases you'll need to put the port number into a 32bit field. I'd rather not. There's no point in wasting extra cycles on hashing useless data, just for some particular syntactic improvement. __packed __aligned(8) will do what we want perfectly, I think. > I'd also require that the key be aligned. Yep, I'll certainly do this for the siphash24_aligned version in the v3.
Re: [PATCH] infiniband: nes: nes_nic: use new api ethtool_{get|set}_link_ksettings
On 10/25/2016 11:29 AM, Philippe Reynes wrote: > The ethtool api {get|set}_settings is deprecated. > We move this driver to new api {get|set}_link_ksettings. > > Signed-off-by: Philippe Reynes Thanks, applied. -- Doug Ledford GPG Key ID: 0E572FDD signature.asc Description: OpenPGP digital signature
Re: [PATCH v2 3/4] secure_seq: use siphash24 instead of md5_transform
From: "Jason A. Donenfeld" Date: Wed, 14 Dec 2016 13:53:10 +0100 > In all current uses of __packed in the code, I think the impact is > precisely zero, because all structures have members in descending > order of size, with each member being a perfect multiple of the one > below it. The __packed is therefore just there for safety, in case > somebody comes in and screws everything up by sticking a u8 in > between. Just marking the structure __packed, whether necessary or not, makes the compiler assume that the members are not aligned and causes byte-by-byte accesses to be performed for words. Never, _ever_, use __packed unless absolutely necessary, it pessimizes the code on cpus that require proper alignment of types.
Re: [kernel-hardening] Re: [PATCH 4/3] random: use siphash24 instead of md5 for get_random_int/long
Hey Ted, On Wed, Dec 14, 2016 at 5:37 PM, Theodore Ts'o wrote: > One somewhat undesirable aspect of the current algorithm is that we > never change random_int_secret. Why exactly would this be a problem? So long as the secret is kept secret, the PRF is secure. If an attacker can read arbitrary kernel memory, there are much much bigger issues to be concerned about. As well, the "chaining" variable I introduce ensures that the random numbers are, per-cpu, related to the uniqueness of timing of subsequent calls. > So I've been toying with the > following, which is 4 times faster than md5. (I haven't tried > benchmarking against siphash yet.) > > [3.606139] random benchmark!! > [3.606276] get_random_int # cycles: 326578 > [3.606317] get_random_int_new # cycles: 95438 > [3.607423] get_random_bytes # cycles: 2653388 Cool, I'll benchmark it against the siphash implementation. I like what you did with batching up lots of chacha output, and doling it out bit by bit. I suspect this will be quite fast, because with chacha20 you get an entire block. > P.S. It's interesting to note that siphash24 and chacha20 are both > add-rotate-xor based algorithms. Quite! Lots of nice shiny things are turning out be be ARX -- ChaCha, BLAKE2, Siphash, NORX. The simplicity is really appealing. Jason
Re: [v2] net:ethernet:cavium:octeon:octeon_mgmt: Handle return NULL error from devm_ioremap
Yes, I have seen this error. We have a device with very less memory. Basically it's OMAP2 board. We have to port Android L on this. It's has 3.10 kernel version. In this device, we were getting Page allocation failure. Vmalloc size was not enough to run all application. So we have decide to increase vmalloc reserve space. once we increases Vmalloc space. We start getting ioremap falilure. Kernel is getting NULL-pointer dereference error. Here, It's just check to avoid any kernel crash because of ioremap failure. We can keep this check to avoid this kind of scenario. Thanks -Arvind On Wednesday 14 December 2016 11:02 PM, David Daney wrote: On 12/14/2016 08:25 AM, Arvind Yadav wrote: Here, If devm_ioremap will fail. It will return NULL. Kernel can run into a NULL-pointer dereference. This error check will avoid NULL pointer dereference. i Have you ever seen this failure in the wild? How was the patch tested? Thanks, David Daney Signed-off-by: Arvind Yadav --- drivers/net/ethernet/cavium/octeon/octeon_mgmt.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/drivers/net/ethernet/cavium/octeon/octeon_mgmt.c b/drivers/net/ethernet/cavium/octeon/octeon_mgmt.c index 4ab404f..33c2fec 100644 --- a/drivers/net/ethernet/cavium/octeon/octeon_mgmt.c +++ b/drivers/net/ethernet/cavium/octeon/octeon_mgmt.c @@ -1479,6 +1479,12 @@ static int octeon_mgmt_probe(struct platform_device *pdev) p->agl = (u64)devm_ioremap(&pdev->dev, p->agl_phys, p->agl_size); p->agl_prt_ctl = (u64)devm_ioremap(&pdev->dev, p->agl_prt_ctl_phys, p->agl_prt_ctl_size); +if (!p->mix || !p->agl || !p->agl_prt_ctl) { +dev_err(&pdev->dev, "failed to map I/O memory\n"); +result = -ENOMEM; +goto err; +} + spin_lock_init(&p->lock); skb_queue_head_init(&p->tx_list);
Re: [PATCH v2 3/4] secure_seq: use siphash24 instead of md5_transform
Hi David, On Wed, Dec 14, 2016 at 6:56 PM, David Miller wrote: > Just marking the structure __packed, whether necessary or not, makes > the compiler assume that the members are not aligned and causes > byte-by-byte accesses to be performed for words. > Never, _ever_, use __packed unless absolutely necessary, it pessimizes > the code on cpus that require proper alignment of types. Oh, jimminy cricket, I did not realize that it made assignments byte-by-byte *always*. So what options am I left with? What immediately comes to mind are: 1) struct { u64 a; u32 b; u32 c; u16 d; u8 end[]; } a = { .a = a, .b = b, .c = c, .d = d }; siphash24(&a, offsetof(typeof(a), end), key); 2) u8 bytes[sizeof(u64) + sizeof(u32) * 2 + sizeof(u16)]; *(u64 *)&bytes[0] = a; *(u32 *)&bytes[sizeof(u64)] = b; *(u32 *)&bytes[sizeof(u64) + sizeof(u32)] = c; *(u16 *)&bytes[sizeof(u64) + sizeof(u32) * 2] = d; siphash24(bytes, sizeof(bytes), key); Personally I find (1) a bit neater than (2). What's your opinion? Jason
Re: [v2] net:ethernet:cavium:octeon:octeon_mgmt: Handle return NULL error from devm_ioremap
On 12/14/2016 10:06 AM, arvind Yadav wrote: Yes, I have seen this error. We have a device with very less memory. Basically it's OMAP2 board. We have to port Android L on this. It's has 3.10 kernel version. In this device, we were getting Page allocation failure. This makes absolutely no sense to me. OCTEON is a mips64 SoC with a ton of memory where ioremap can never fail, and it doesn't run Android, and you are talking about OMAP2. Q1: Have you observed a failure on the device for which you are modifying the driver? Q2: Have you tested the patch on hardware that uses the driver you are modifying by running network traffic through the Ethernet interface this driver controls? If you cannot answer yes to both of those questions, then you should probably note in the changelog that the patch is untested. David. Vmalloc size was not enough to run all application. So we have decide to increase vmalloc reserve space. once we increases Vmalloc space. We start getting ioremap falilure. Kernel is getting NULL-pointer dereference error. Here, It's just check to avoid any kernel crash because of ioremap failure. We can keep this check to avoid this kind of scenario. Thanks -Arvind On Wednesday 14 December 2016 11:02 PM, David Daney wrote: On 12/14/2016 08:25 AM, Arvind Yadav wrote: Here, If devm_ioremap will fail. It will return NULL. Kernel can run into a NULL-pointer dereference. This error check will avoid NULL pointer dereference. i Have you ever seen this failure in the wild? How was the patch tested? Thanks, David Daney Signed-off-by: Arvind Yadav --- drivers/net/ethernet/cavium/octeon/octeon_mgmt.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/drivers/net/ethernet/cavium/octeon/octeon_mgmt.c b/drivers/net/ethernet/cavium/octeon/octeon_mgmt.c index 4ab404f..33c2fec 100644 --- a/drivers/net/ethernet/cavium/octeon/octeon_mgmt.c +++ b/drivers/net/ethernet/cavium/octeon/octeon_mgmt.c @@ -1479,6 +1479,12 @@ static int octeon_mgmt_probe(struct platform_device *pdev) p->agl = (u64)devm_ioremap(&pdev->dev, p->agl_phys, p->agl_size); p->agl_prt_ctl = (u64)devm_ioremap(&pdev->dev, p->agl_prt_ctl_phys, p->agl_prt_ctl_size); +if (!p->mix || !p->agl || !p->agl_prt_ctl) { +dev_err(&pdev->dev, "failed to map I/O memory\n"); +result = -ENOMEM; +goto err; +} + spin_lock_init(&p->lock); skb_queue_head_init(&p->tx_list);
Re: [v2] net:ethernet:cavium:octeon:octeon_mgmt: Handle return NULL error from devm_ioremap
Hi David, I have gave my comment. Thanks Arvind On Wednesday 14 December 2016 11:44 PM, David Daney wrote: On 12/14/2016 10:06 AM, arvind Yadav wrote: Yes, I have seen this error. We have a device with very less memory. Basically it's OMAP2 board. We have to port Android L on this. It's has 3.10 kernel version. In this device, we were getting Page allocation failure. This makes absolutely no sense to me. OCTEON is a mips64 SoC with a ton of memory where ioremap can never fail, and it doesn't run Android, and you are talking about OMAP2. -I just gave as example where i have seen ioremap issue. Please don't relate. I know, Now it will not fail. ioremap will through NULL on failure. We should catch this error. Even other driver of MIPS soc is having same check. It's just check which will not impact any functionality or performance of this driver. It will avoid NULL pointer error. We know, if function is returning any error. we should catch. Q1: Have you observed a failure on the device for which you are modifying the driver? -No, I did not observe this error. Q2: Have you tested the patch on hardware that uses the driver you are modifying by running network traffic through the Ethernet interface this driver controls? -Right Now we can not tested these kind of failure, If you cannot answer yes to both of those questions, then you should probably note in the changelog that the patch is untested. David. Vmalloc size was not enough to run all application. So we have decide to increase vmalloc reserve space. once we increases Vmalloc space. We start getting ioremap falilure. Kernel is getting NULL-pointer dereference error. Here, It's just check to avoid any kernel crash because of ioremap failure. We can keep this check to avoid this kind of scenario. Thanks -Arvind On Wednesday 14 December 2016 11:02 PM, David Daney wrote: On 12/14/2016 08:25 AM, Arvind Yadav wrote: Here, If devm_ioremap will fail. It will return NULL. Kernel can run into a NULL-pointer dereference. This error check will avoid NULL pointer dereference. i Have you ever seen this failure in the wild? How was the patch tested? Thanks, David Daney Signed-off-by: Arvind Yadav --- drivers/net/ethernet/cavium/octeon/octeon_mgmt.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/drivers/net/ethernet/cavium/octeon/octeon_mgmt.c b/drivers/net/ethernet/cavium/octeon/octeon_mgmt.c index 4ab404f..33c2fec 100644 --- a/drivers/net/ethernet/cavium/octeon/octeon_mgmt.c +++ b/drivers/net/ethernet/cavium/octeon/octeon_mgmt.c @@ -1479,6 +1479,12 @@ static int octeon_mgmt_probe(struct platform_device *pdev) p->agl = (u64)devm_ioremap(&pdev->dev, p->agl_phys, p->agl_size); p->agl_prt_ctl = (u64)devm_ioremap(&pdev->dev, p->agl_prt_ctl_phys, p->agl_prt_ctl_size); +if (!p->mix || !p->agl || !p->agl_prt_ctl) { +dev_err(&pdev->dev, "failed to map I/O memory\n"); +result = -ENOMEM; +goto err; +} + spin_lock_init(&p->lock); skb_queue_head_init(&p->tx_list);
[PATCH v3 3/3] random: use siphash24 instead of md5 for get_random_int/long
This duplicates the current algorithm for get_random_int/long, but uses siphash24 instead. This comes with several benefits. It's certainly faster and more cryptographically secure than MD5. This patch also hashes the pid, entropy, and timestamp as fixed width fields, in order to increase diffusion. The previous md5 algorithm used a per-cpu md5 state, which caused successive calls to the function to chain upon each other. While it's not entirely clear that this kind of chaining is absolutely necessary when using a secure PRF like siphash24, it can't hurt, and the timing of the call chain does add a degree of natural entropy. So, in keeping with this design, instead of the massive per-cpu 64-byte md5 state, there is instead a per-cpu previously returned value for chaining. Signed-off-by: Jason A. Donenfeld Cc: Jean-Philippe Aumasson Cc: Ted Tso --- Changes from v2->v3: - Structs are no longer packed, to mitigate slow byte-by-byte assignment. drivers/char/random.c | 52 --- 1 file changed, 33 insertions(+), 19 deletions(-) diff --git a/drivers/char/random.c b/drivers/char/random.c index d6876d506220..b1c2e3b26430 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -262,6 +262,7 @@ #include #include #include +#include #include #include @@ -2042,7 +2043,7 @@ struct ctl_table random_table[] = { }; #endif /* CONFIG_SYSCTL */ -static u32 random_int_secret[MD5_MESSAGE_BYTES / 4] cacheline_aligned; +static u8 random_int_secret[SIPHASH24_KEY_LEN] __aligned(SIPHASH24_ALIGNMENT); int random_int_secret_init(void) { @@ -2050,8 +2051,7 @@ int random_int_secret_init(void) return 0; } -static DEFINE_PER_CPU(__u32 [MD5_DIGEST_WORDS], get_random_int_hash) - __aligned(sizeof(unsigned long)); +static DEFINE_PER_CPU(u64, get_random_int_chaining); /* * Get a random word for internal kernel use only. Similar to urandom but @@ -2061,19 +2061,26 @@ static DEFINE_PER_CPU(__u32 [MD5_DIGEST_WORDS], get_random_int_hash) */ unsigned int get_random_int(void) { - __u32 *hash; unsigned int ret; + struct { + u64 chaining; + unsigned long ts; + unsigned long entropy; + pid_t pid; + char end[]; + } __aligned(SIPHASH24_ALIGNMENT) combined; + u64 *chaining; if (arch_get_random_int(&ret)) return ret; - hash = get_cpu_var(get_random_int_hash); - - hash[0] += current->pid + jiffies + random_get_entropy(); - md5_transform(hash, random_int_secret); - ret = hash[0]; - put_cpu_var(get_random_int_hash); - + chaining = get_cpu_ptr(&get_random_int_chaining); + combined.chaining = *chaining; + combined.ts = jiffies; + combined.entropy = random_get_entropy(); + combined.pid = current->pid; + ret = *chaining = siphash24((u8 *)&combined, offsetof(typeof(combined), end), random_int_secret); + put_cpu_ptr(chaining); return ret; } EXPORT_SYMBOL(get_random_int); @@ -2083,19 +2090,26 @@ EXPORT_SYMBOL(get_random_int); */ unsigned long get_random_long(void) { - __u32 *hash; unsigned long ret; + struct { + u64 chaining; + unsigned long ts; + unsigned long entropy; + pid_t pid; + char end[]; + } __aligned(SIPHASH24_ALIGNMENT) combined; + u64 *chaining; if (arch_get_random_long(&ret)) return ret; - hash = get_cpu_var(get_random_int_hash); - - hash[0] += current->pid + jiffies + random_get_entropy(); - md5_transform(hash, random_int_secret); - ret = *(unsigned long *)hash; - put_cpu_var(get_random_int_hash); - + chaining = get_cpu_ptr(&get_random_int_chaining); + combined.chaining = *chaining; + combined.ts = jiffies; + combined.entropy = random_get_entropy(); + combined.pid = current->pid; + ret = *chaining = siphash24((u8 *)&combined, offsetof(typeof(combined), end), random_int_secret); + put_cpu_ptr(chaining); return ret; } EXPORT_SYMBOL(get_random_long); -- 2.11.0
[PATCH v3 2/3] secure_seq: use siphash24 instead of md5_transform
This gives a clear speed and security improvement. Siphash is both faster and is more solid crypto than the aging MD5. Rather than manually filling MD5 buffers, we simply create a layout by a simple anonymous struct, for which gcc generates rather efficient code. Signed-off-by: Jason A. Donenfeld Cc: Andi Kleen Cc: David Miller Cc: David Laight --- Changes from v2->v3: - Structs are no longer packed, to mitigate slow byte-by-byte assignment. - A typo has been fixed in the port number assignment. net/core/secure_seq.c | 166 ++ 1 file changed, 85 insertions(+), 81 deletions(-) diff --git a/net/core/secure_seq.c b/net/core/secure_seq.c index 88a8e429fc3e..00eb141c981b 100644 --- a/net/core/secure_seq.c +++ b/net/core/secure_seq.c @@ -1,3 +1,5 @@ +/* Copyright (C) 2016 Jason A. Donenfeld . All Rights Reserved. */ + #include #include #include @@ -8,14 +10,14 @@ #include #include #include - +#include #include #if IS_ENABLED(CONFIG_IPV6) || IS_ENABLED(CONFIG_INET) +#include #include -#define NET_SECRET_SIZE (MD5_MESSAGE_BYTES / 4) -static u32 net_secret[NET_SECRET_SIZE] cacheline_aligned; +static u8 net_secret[SIPHASH24_KEY_LEN] __aligned(SIPHASH24_ALIGNMENT); static __always_inline void net_secret_init(void) { @@ -44,44 +46,41 @@ static u32 seq_scale(u32 seq) u32 secure_tcpv6_sequence_number(const __be32 *saddr, const __be32 *daddr, __be16 sport, __be16 dport, u32 *tsoff) { - u32 secret[MD5_MESSAGE_BYTES / 4]; - u32 hash[MD5_DIGEST_WORDS]; - u32 i; - + const struct { + struct in6_addr saddr; + struct in6_addr daddr; + __be16 sport; + __be16 dport; + char end[]; + } __aligned(SIPHASH24_ALIGNMENT) combined = { + .saddr = *(struct in6_addr *)saddr, + .daddr = *(struct in6_addr *)daddr, + .sport = sport, + .dport = dport + }; + u64 hash; net_secret_init(); - memcpy(hash, saddr, 16); - for (i = 0; i < 4; i++) - secret[i] = net_secret[i] + (__force u32)daddr[i]; - secret[4] = net_secret[4] + - (((__force u16)sport << 16) + (__force u16)dport); - for (i = 5; i < MD5_MESSAGE_BYTES / 4; i++) - secret[i] = net_secret[i]; - - md5_transform(hash, secret); - - *tsoff = sysctl_tcp_timestamps == 1 ? hash[1] : 0; - return seq_scale(hash[0]); + hash = siphash24((const u8 *)&combined, offsetof(typeof(combined), end), net_secret); + *tsoff = sysctl_tcp_timestamps == 1 ? (hash >> 32) : 0; + return seq_scale(hash); } EXPORT_SYMBOL(secure_tcpv6_sequence_number); u32 secure_ipv6_port_ephemeral(const __be32 *saddr, const __be32 *daddr, __be16 dport) { - u32 secret[MD5_MESSAGE_BYTES / 4]; - u32 hash[MD5_DIGEST_WORDS]; - u32 i; - + const struct { + struct in6_addr saddr; + struct in6_addr daddr; + __be16 dport; + char end[]; + } __aligned(SIPHASH24_ALIGNMENT) combined = { + .saddr = *(struct in6_addr *)saddr, + .daddr = *(struct in6_addr *)daddr, + .dport = dport + }; net_secret_init(); - memcpy(hash, saddr, 16); - for (i = 0; i < 4; i++) - secret[i] = net_secret[i] + (__force u32) daddr[i]; - secret[4] = net_secret[4] + (__force u32)dport; - for (i = 5; i < MD5_MESSAGE_BYTES / 4; i++) - secret[i] = net_secret[i]; - - md5_transform(hash, secret); - - return hash[0]; + return siphash24((const u8 *)&combined, offsetof(typeof(combined), end), net_secret); } EXPORT_SYMBOL(secure_ipv6_port_ephemeral); #endif @@ -91,33 +90,39 @@ EXPORT_SYMBOL(secure_ipv6_port_ephemeral); u32 secure_tcp_sequence_number(__be32 saddr, __be32 daddr, __be16 sport, __be16 dport, u32 *tsoff) { - u32 hash[MD5_DIGEST_WORDS]; - + const struct { + __be32 saddr; + __be32 daddr; + __be16 sport; + __be16 dport; + char end[]; + } __aligned(SIPHASH24_ALIGNMENT) combined = { + .saddr = saddr, + .daddr = daddr, + .sport = sport, + .dport = dport + }; + u64 hash; net_secret_init(); - hash[0] = (__force u32)saddr; - hash[1] = (__force u32)daddr; - hash[2] = ((__force u16)sport << 16) + (__force u16)dport; - hash[3] = net_secret[15]; - - md5_transform(hash, net_secret); - - *tsoff = sysctl_tcp_timestamps == 1 ? hash[1] : 0; - return seq_scale(hash[0]); + hash = siphash24((const u8 *)&combined, offsetof(typeof(combined), end), net_secret); + *tsoff = sysctl_tcp_timestamps == 1 ? (hash
[PATCH v3 1/3] siphash: add cryptographically secure hashtable function
SipHash is a 64-bit keyed hash function that is actually a cryptographically secure PRF, like HMAC. Except SipHash is super fast, and is meant to be used as a hashtable keyed lookup function. SipHash isn't just some new trendy hash function. It's been around for a while, and there really isn't anything that comes remotely close to being useful in the way SipHash is. With that said, why do we need this? There are a variety of attacks known as "hashtable poisoning" in which an attacker forms some data such that the hash of that data will be the same, and then preceeds to fill up all entries of a hashbucket. This is a realistic and well-known denial-of-service vector. Linux developers already seem to be aware that this is an issue, and various places that use hash tables in, say, a network context, use a non-cryptographically secure function (usually jhash) and then try to twiddle with the key on a time basis (or in many cases just do nothing and hope that nobody notices). While this is an admirable attempt at solving the problem, it doesn't actually fix it. SipHash fixes it. (It fixes it in such a sound way that you could even build a stream cipher out of SipHash that would resist the modern cryptanalysis.) There are a modicum of places in the kernel that are vulnerable to hashtable poisoning attacks, either via userspace vectors or network vectors, and there's not a reliable mechanism inside the kernel at the moment to fix it. The first step toward fixing these issues is actually getting a secure primitive into the kernel for developers to use. Then we can, bit by bit, port things over to it as deemed appropriate. Secondly, a few places are using MD5 for creating secure sequence numbers, port numbers, or fast random numbers. Siphash is a faster, more fittting, and more secure replacement for MD5 in those situations. Dozens of languages are already using this internally for their hash tables. Some of the BSDs already use this in their kernels. SipHash is a widely known high-speed solution to a widely known problem, and it's time we catch-up. Signed-off-by: Jason A. Donenfeld Cc: Jean-Philippe Aumasson Cc: Daniel J. Bernstein Cc: Linus Torvalds Cc: Eric Biggers Cc: David Laight --- Changes from v2->v3: - There is now a fast aligned version of the function and a not-as-fast unaligned version. The requirements for each have been documented in a docbook-style comment. As well, the header now contains a constant for the expected alignment. - The test suite has been updated to check both the unaligned and aligned version of the function. include/linux/siphash.h | 30 ++ lib/Kconfig.debug | 6 +- lib/Makefile| 5 +- lib/siphash.c | 153 lib/test_siphash.c | 85 +++ 5 files changed, 274 insertions(+), 5 deletions(-) create mode 100644 include/linux/siphash.h create mode 100644 lib/siphash.c create mode 100644 lib/test_siphash.c diff --git a/include/linux/siphash.h b/include/linux/siphash.h new file mode 100644 index ..82dc1a911a2e --- /dev/null +++ b/include/linux/siphash.h @@ -0,0 +1,30 @@ +/* Copyright (C) 2016 Jason A. Donenfeld + * + * This file is provided under a dual BSD/GPLv2 license. + * + * SipHash: a fast short-input PRF + * https://131002.net/siphash/ + */ + +#ifndef _LINUX_SIPHASH_H +#define _LINUX_SIPHASH_H + +#include + +enum siphash_lengths { + SIPHASH24_KEY_LEN = 16, + SIPHASH24_ALIGNMENT = 8 +}; + +u64 siphash24(const u8 *data, size_t len, const u8 key[SIPHASH24_KEY_LEN]); + +#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS +static inline u64 siphash24_unaligned(const u8 *data, size_t len, const u8 key[SIPHASH24_KEY_LEN]) +{ + return siphash24(data, len, key); +} +#else +u64 siphash24_unaligned(const u8 *data, size_t len, const u8 key[SIPHASH24_KEY_LEN]); +#endif + +#endif /* _LINUX_SIPHASH_H */ diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index e6327d102184..32bbf689fc46 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -1843,9 +1843,9 @@ config TEST_HASH tristate "Perform selftest on hash functions" default n help - Enable this option to test the kernel's integer () - and string () hash functions on boot - (or module load). + Enable this option to test the kernel's integer (), + string (), and siphash () + hash functions on boot (or module load). This is intended to help people writing architecture-specific optimized versions. If unsure, say N. diff --git a/lib/Makefile b/lib/Makefile index 50144a3aeebd..71d398b04a74 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -22,7 +22,8 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \ sha1.o chacha20.o md5.o irq_regs.o argv_split.o \ flex_proportions.o ratelimit.o show_mem.o \ is_single_threaded.o plist.o decompress.o kobject_ueven
Re: [v2] net:ethernet:cavium:octeon:octeon_mgmt: Handle return NULL error from devm_ioremap
On 12/14/2016 10:39 AM, arvind Yadav wrote: > Hi David, > > I have gave my comment. > > Thanks > Arvind > > On Wednesday 14 December 2016 11:44 PM, David Daney wrote: >> On 12/14/2016 10:06 AM, arvind Yadav wrote: >>> Yes, I have seen this error. We have a device with very less memory. >>> Basically it's OMAP2 board. We have to port Android L on this. >>> It's has 3.10 kernel version. In this device, we were getting Page >>> allocation failure. >> >> This makes absolutely no sense to me. OCTEON is a mips64 SoC with a >> ton of memory where ioremap can never fail, and it doesn't run >> Android, and you are talking about OMAP2. > -I just gave as example where i have seen ioremap issue. > Please don't relate. I know, Now it will not fail. ioremap will through > NULL on failure. We should catch this error. Even other driver of MIPS > soc is having same check. It's just check which will not impact any > functionality or performance of this driver. It will avoid NULL pointer > error. We know, if function is returning any error. we should catch. Your patch subject should also be changed to insert spaces between semicolon, so this would be: net: ethernet: cavium: octeon: octeon_mgmt: -- Florian
stmmac: lockups (was Re: Synopsys Ethernet QoS)
Hi! > I know that this is completely of topic, but I am facing a dificulty with > stmmac. I have interrupts, mac well configured rx packets being received > successfully, but TX is not working, resulting in Tx errors = Total TX > packets. > I have made a lot of debug and my conclusions is that by some reason when > using > stmmac after starting tx dma, the hw state machine enters a deadend state > resulting in those errors. Anyone faced this trouble? Well what I'm debugging are lockups after many packets transmitted (followed by netdev watchdog; stmmac_tx_err() does not work for me, it kills the device even when run from working state; ifconfig down/up helps). 4.4 locks up in minutes to hours, 4.9 seems to work better (but I believe I seen a lockup there, too; once). So... probably different problem? Best regards, Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html signature.asc Description: Digital signature
[v3] net: ethernet: cavium: octeon: octeon_mgmt: Handle return NULL error from devm_ioremap
Here, If devm_ioremap will fail. It will return NULL. Kernel can run into a NULL-pointer dereference. This error check will avoid NULL pointer dereference. Signed-off-by: Arvind Yadav --- drivers/net/ethernet/cavium/octeon/octeon_mgmt.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/drivers/net/ethernet/cavium/octeon/octeon_mgmt.c b/drivers/net/ethernet/cavium/octeon/octeon_mgmt.c index 4ab404f..33c2fec 100644 --- a/drivers/net/ethernet/cavium/octeon/octeon_mgmt.c +++ b/drivers/net/ethernet/cavium/octeon/octeon_mgmt.c @@ -1479,6 +1479,12 @@ static int octeon_mgmt_probe(struct platform_device *pdev) p->agl = (u64)devm_ioremap(&pdev->dev, p->agl_phys, p->agl_size); p->agl_prt_ctl = (u64)devm_ioremap(&pdev->dev, p->agl_prt_ctl_phys, p->agl_prt_ctl_size); + if (!p->mix || !p->agl || !p->agl_prt_ctl) { + dev_err(&pdev->dev, "failed to map I/O memory\n"); + result = -ENOMEM; + goto err; + } + spin_lock_init(&p->lock); skb_queue_head_init(&p->tx_list); -- 2.7.4
Re: [v2] net:ethernet:cavium:octeon:octeon_mgmt: Handle return NULL error from devm_ioremap
Hi, As per your suggestion, I have change the subject. Thanks On Thursday 15 December 2016 12:24 AM, Florian Fainelli wrote: On 12/14/2016 10:39 AM, arvind Yadav wrote: Hi David, I have gave my comment. Thanks Arvind On Wednesday 14 December 2016 11:44 PM, David Daney wrote: On 12/14/2016 10:06 AM, arvind Yadav wrote: Yes, I have seen this error. We have a device with very less memory. Basically it's OMAP2 board. We have to port Android L on this. It's has 3.10 kernel version. In this device, we were getting Page allocation failure. This makes absolutely no sense to me. OCTEON is a mips64 SoC with a ton of memory where ioremap can never fail, and it doesn't run Android, and you are talking about OMAP2. -I just gave as example where i have seen ioremap issue. Please don't relate. I know, Now it will not fail. ioremap will through NULL on failure. We should catch this error. Even other driver of MIPS soc is having same check. It's just check which will not impact any functionality or performance of this driver. It will avoid NULL pointer error. We know, if function is returning any error. we should catch. Your patch subject should also be changed to insert spaces between semicolon, so this would be: net: ethernet: cavium: octeon: octeon_mgmt:
[PATCH net] net: vrf: Fix NAT within a VRF
Connection tracking with VRF is broken because the pass through the VRF device drops the connection tracking info. Removing the call to nf_reset allows DNAT and MASQUERADE to work across interfaces within a VRF. Fixes: 73e20b761acf ("net: vrf: Add support for PREROUTING rules on vrf device") Signed-off-by: David Ahern --- drivers/net/vrf.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c index 3bca24651dc0..015a1321c7dd 100644 --- a/drivers/net/vrf.c +++ b/drivers/net/vrf.c @@ -849,8 +849,6 @@ static struct sk_buff *vrf_rcv_nfhook(u8 pf, unsigned int hook, { struct net *net = dev_net(dev); - nf_reset(skb); - if (NF_HOOK(pf, hook, net, NULL, skb, dev, NULL, vrf_rcv_finish) < 0) skb = NULL;/* kfree_skb(skb) handled by nf code */ -- 2.1.4
Re: [v2] net:ethernet:cavium:octeon:octeon_mgmt: Handle return NULL error from devm_ioremap
On 12/14/2016 08:25 AM, Arvind Yadav wrote: Here, If devm_ioremap will fail. It will return NULL. Kernel can run into a NULL-pointer dereference. This error check will avoid NULL pointer dereference. Have you ever seen this failure in the wild? How was the patch tested? Thanks, David Daney Signed-off-by: Arvind Yadav --- drivers/net/ethernet/cavium/octeon/octeon_mgmt.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/drivers/net/ethernet/cavium/octeon/octeon_mgmt.c b/drivers/net/ethernet/cavium/octeon/octeon_mgmt.c index 4ab404f..33c2fec 100644 --- a/drivers/net/ethernet/cavium/octeon/octeon_mgmt.c +++ b/drivers/net/ethernet/cavium/octeon/octeon_mgmt.c @@ -1479,6 +1479,12 @@ static int octeon_mgmt_probe(struct platform_device *pdev) p->agl = (u64)devm_ioremap(&pdev->dev, p->agl_phys, p->agl_size); p->agl_prt_ctl = (u64)devm_ioremap(&pdev->dev, p->agl_prt_ctl_phys, p->agl_prt_ctl_size); + if (!p->mix || !p->agl || !p->agl_prt_ctl) { + dev_err(&pdev->dev, "failed to map I/O memory\n"); + result = -ENOMEM; + goto err; + } + spin_lock_init(&p->lock); skb_queue_head_init(&p->tx_list);
[PATCH 1/3] Bluetooth: btusb: Use an error label for error paths
Use a label to remove the repetetive cleanup, for error cases. (This label will also be used in subsequent patches). Signed-off-by: Rajat Jain --- drivers/bluetooth/btusb.c | 19 +-- 1 file changed, 9 insertions(+), 10 deletions(-) diff --git a/drivers/bluetooth/btusb.c b/drivers/bluetooth/btusb.c index 2f633df..ce22cef 100644 --- a/drivers/bluetooth/btusb.c +++ b/drivers/bluetooth/btusb.c @@ -2991,18 +2991,15 @@ static int btusb_probe(struct usb_interface *intf, err = usb_set_interface(data->udev, 0, 0); if (err < 0) { BT_ERR("failed to set interface 0, alt 0 %d", err); - hci_free_dev(hdev); - return err; + goto out_free_dev; } } if (data->isoc) { err = usb_driver_claim_interface(&btusb_driver, data->isoc, data); - if (err < 0) { - hci_free_dev(hdev); - return err; - } + if (err < 0) + goto out_free_dev; } #ifdef CONFIG_BT_HCIBTUSB_BCM @@ -3016,14 +3013,16 @@ static int btusb_probe(struct usb_interface *intf, #endif err = hci_register_dev(hdev); - if (err < 0) { - hci_free_dev(hdev); - return err; - } + if (err < 0) + goto out_free_dev; usb_set_intfdata(intf, data); return 0; + +out_free_dev: + hci_free_dev(hdev); + return err; } static void btusb_disconnect(struct usb_interface *intf) -- 2.8.0.rc3.226.g39d4020
Re: [kernel-hardening] Re: [PATCH 4/3] random: use siphash24 instead of md5 for get_random_int/long
Hi again, On Wed, Dec 14, 2016 at 5:37 PM, Theodore Ts'o wrote: > [3.606139] random benchmark!! > [3.606276] get_random_int # cycles: 326578 > [3.606317] get_random_int_new # cycles: 95438 > [3.607423] get_random_bytes # cycles: 2653388 Looks to me like my siphash implementation is much faster for get_random_long, and more or less tied for get_random_int: [1.729370] random benchmark!! [1.729710] get_random_long # cycles: 349771 [1.730128] get_random_long_chacha # cycles: 359660 [1.730457] get_random_long_siphash # cycles: 94255 [1.731307] get_random_bytes # cycles: 1354894 [1.731707] get_random_int # cycles: 305640 [1.732095] get_random_int_chacha # cycles: 80726 [1.732425] get_random_int_siphash # cycles: 94265 [1.733278] get_random_bytes # cycles: 1315873 Given the increasing usage of get_random_long for ASLR and related, I think this makes the siphash approach worth pursuing. The chacha approach is also not significantly different from the md5 approach in terms of speed for get_rand_long. Additionally, since siphash is a PRF, I think this opens up a big window for optimizing it even further. Benchmark here: https://git.zx2c4.com/linux-dev/commit/?h=rng-bench Jason
[PATCH 3/3] Bluetooth: btusb: Configure Marvel to use one of the pins for oob wakeup
The Marvell devices may have many gpio pins, and hence for wakeup on these out-of-band pins, the chip needs to be told which pin is to be used for wakeup, using an hci command. Thus, we read the pin number etc from the device tree node and send a command to the chip. Signed-off-by: Rajat Jain --- Note that while I would have liked to name the compatible string as more like "marvell, usb8997-bt", the devicetrees/bindings/usb/usb-device.txt requires the compatible property to be of the form "usbVID,PID". .../{marvell-bt-sd8xxx.txt => marvell-bt-8xxx.txt} | 25 - drivers/bluetooth/btusb.c | 59 ++ 2 files changed, 82 insertions(+), 2 deletions(-) rename Documentation/devicetree/bindings/net/{marvell-bt-sd8xxx.txt => marvell-bt-8xxx.txt} (76%) diff --git a/Documentation/devicetree/bindings/net/marvell-bt-sd8xxx.txt b/Documentation/devicetree/bindings/net/marvell-bt-8xxx.txt similarity index 76% rename from Documentation/devicetree/bindings/net/marvell-bt-sd8xxx.txt rename to Documentation/devicetree/bindings/net/marvell-bt-8xxx.txt index 6a9a63c..471bef8 100644 --- a/Documentation/devicetree/bindings/net/marvell-bt-sd8xxx.txt +++ b/Documentation/devicetree/bindings/net/marvell-bt-8xxx.txt @@ -1,4 +1,4 @@ -Marvell 8897/8997 (sd8897/sd8997) bluetooth SDIO devices +Marvell 8897/8997 (sd8897/sd8997) bluetooth devices (SDIO or USB based) -- Required properties: @@ -6,11 +6,13 @@ Required properties: - compatible : should be one of the following: * "marvell,sd8897-bt" * "marvell,sd8997-bt" + * "usb1286,204e" Optional properties: - marvell,cal-data: Calibration data downloaded to the device during initialization. This is an array of 28 values(u8). + This is only applicable to SDIO devices. - marvell,wakeup-pin: It represents wakeup pin number of the bluetooth chip. firmware will use the pin to wakeup host system (u16). @@ -29,7 +31,9 @@ Example: IRQ pin 119 is used as system wakeup source interrupt. wakeup pin 13 and gap 100ms are configured so that firmware can wakeup host using this device side pin and wakeup latency. -calibration data is also available in below example. + +Example for SDIO device follows (calibration data is also available in +below example). &mmc3 { status = "okay"; @@ -54,3 +58,20 @@ calibration data is also available in below example. marvell,wakeup-gap-ms = /bits/ 16 <0x64>; }; }; + +Example for USB device: + +&usb_host1_ohci { +status = "okay"; +#address-cells = <1>; +#size-cells = <0>; + +mvl_bt1: bt@1 { + compatible = "usb1286,204e"; + reg = <1>; + interrupt-parent = <&gpio0>; + interrupts = <119 IRQ_TYPE_LEVEL_LOW>; + marvell,wakeup-pin = /bits/ 16 <0x0d>; + marvell,wakeup-gap-ms = /bits/ 16 <0x64>; +}; +}; diff --git a/drivers/bluetooth/btusb.c b/drivers/bluetooth/btusb.c index 32a6f22..99d7f6d 100644 --- a/drivers/bluetooth/btusb.c +++ b/drivers/bluetooth/btusb.c @@ -2343,6 +2343,58 @@ static int btusb_shutdown_intel(struct hci_dev *hdev) return 0; } +#ifdef CONFIG_PM +static const struct of_device_id mvl_oob_wake_match_table[] = { + { .compatible = "usb1286,204e" }, + { } +}; +MODULE_DEVICE_TABLE(of, mvl_oob_wake_match_table); + +/* Configure an out-of-band gpio as wake-up pin, if specified in device tree */ +static int marvell_config_oob_wake(struct hci_dev *hdev) +{ + struct sk_buff *skb; + struct btusb_data *data = hci_get_drvdata(hdev); + struct device *dev = &data->udev->dev; + u16 pin, gap, opcode; + int ret; + u8 cmd[5]; + + if (!of_match_device(mvl_oob_wake_match_table, dev)) + return 0; + + if (of_property_read_u16(dev->of_node, "marvell,wakeup-pin", &pin) || + of_property_read_u16(dev->of_node, "marvell,wakeup-gap-ms", &gap)) + return -EINVAL; + + /* Vendor specific command to configure a GPIO as wake-up pin */ + opcode = hci_opcode_pack(0x3F, 0x59); + cmd[0] = opcode & 0xFF; + cmd[1] = opcode >> 8; + cmd[2] = 2; /* length of parameters that follow */ + cmd[3] = pin; + cmd[4] = gap; /* time in ms, for which wakeup pin should be asserted */ + + skb = bt_skb_alloc(sizeof(cmd), GFP_KERNEL); + if (!skb) { + bt_dev_err(hdev, "%s: No memory\n", __func__); + return -ENOMEM; + } + + memcpy(skb_put(skb, sizeof(cmd)), cmd, sizeof(cmd)); + hci_skb_pkt_type(skb) = HCI_COMMAND_PKT; + + ret = btusb_send_frame(hdev, skb); + if (ret) { + bt_dev_err(hdev, "%s: configuration failed\n", __func__); + kfree_skb(skb); + return ret; + } + + return 0; +} +#endif + static int btusb_set_bdaddr_marvell(struct hci_dev *hdev, const bda
[PATCH 2/3] Bluetooth: btusb: Add out-of-band wakeup support
Some BT chips (e.g. Marvell 8997) contain a wakeup pin that can be connected to a gpio on the CPU side, and can be used to wakeup the host out-of-band. This can be useful in situations where the in-band wakeup is not possible or not preferable (e.g. the in-band wakeup may require the USB host controller to remain active, and hence consuming more system power during system sleep). The oob gpio interrupt to be used for wakeup on the CPU side, is read from the device tree node, (using standard interrupt descriptors). A devcie tree binding document is also added for the driver. Signed-off-by: Rajat Jain --- Documentation/devicetree/bindings/net/btusb.txt | 38 drivers/bluetooth/btusb.c | 82 + 2 files changed, 120 insertions(+) create mode 100644 Documentation/devicetree/bindings/net/btusb.txt diff --git a/Documentation/devicetree/bindings/net/btusb.txt b/Documentation/devicetree/bindings/net/btusb.txt new file mode 100644 index 000..bb27f92 --- /dev/null +++ b/Documentation/devicetree/bindings/net/btusb.txt @@ -0,0 +1,38 @@ +Generic Bluetooth controller over USB (btusb driver) +--- + +Required properties: + + - compatible : should comply with the format "usbVID,PID" specified in +Documentation/devicetree/bindings/usb/usb-device.txt +At the time of writing, the only OF supported devices +(more may be added later) are: + + "usb1286,204e" (Marvell 8997) + +Optional properties: + + - interrupt-parent: phandle of the parent interrupt controller + - interrupts : The first interrupt specified is the interrupt that shall be +used for out-of-band wake-on-bt. Driver will request an irq +based on this interrupt number. During system suspend, the irq +will be enabled so that the bluetooth chip can wakeup host +platform out of band. During system resume, the irq will be +disabled to make sure unnecessary interrupt is not received. + +Example: + +Following example uses irq pin number 3 of gpio0 for out of band wake-on-bt: + +&usb_host1_ehci { +status = "okay"; +#address-cells = <1>; +#size-cells = <0>; + +mvl_bt1: bt@1 { + compatible = "usb1286,204e"; + reg = <1>; + interrupt-parent = <&gpio0>; + interrupts = <3 IRQ_TYPE_LEVEL_LOW>; +}; +}; diff --git a/drivers/bluetooth/btusb.c b/drivers/bluetooth/btusb.c index ce22cef..32a6f22 100644 --- a/drivers/bluetooth/btusb.c +++ b/drivers/bluetooth/btusb.c @@ -24,6 +24,8 @@ #include #include #include +#include +#include #include #include @@ -369,6 +371,7 @@ static const struct usb_device_id blacklist_table[] = { #define BTUSB_BOOTING 9 #define BTUSB_RESET_RESUME 10 #define BTUSB_DIAG_RUNNING 11 +#define BTUSB_OOB_WAKE_DISABLED12 struct btusb_data { struct hci_dev *hdev; @@ -416,6 +419,8 @@ struct btusb_data { int (*recv_bulk)(struct btusb_data *data, void *buffer, int count); int (*setup_on_usb)(struct hci_dev *hdev); + + int oob_wake_irq; /* irq for out-of-band wake-on-bt */ }; static inline void btusb_free_frags(struct btusb_data *data) @@ -2728,6 +2733,65 @@ static int btusb_bcm_set_diag(struct hci_dev *hdev, bool enable) } #endif +#ifdef CONFIG_PM +static irqreturn_t btusb_oob_wake_handler(int irq, void *priv) +{ + struct btusb_data *data = priv; + + /* Disable only if not already disabled (keep it balanced) */ + if (!test_and_set_bit(BTUSB_OOB_WAKE_DISABLED, &data->flags)) { + disable_irq_wake(irq); + disable_irq_nosync(irq); + } + pm_wakeup_event(&data->udev->dev, 0); + return IRQ_HANDLED; +} + +static const struct of_device_id btusb_match_table[] = { + { .compatible = "usb1286,204e" }, + { } +}; +MODULE_DEVICE_TABLE(of, btusb_match_table); + +/* Use an oob wakeup pin? */ +static int btusb_config_oob_wake(struct hci_dev *hdev) +{ + struct btusb_data *data = hci_get_drvdata(hdev); + struct device *dev = &data->udev->dev; + int irq, ret; + + if (!of_match_device(btusb_match_table, dev)) + return 0; + + /* Move on if no IRQ specified */ + irq = irq_of_parse_and_map(dev->of_node, 0); + if (!irq) { + bt_dev_dbg(hdev, "%s: no oob wake irq in DT", __func__); + return 0; + } + + set_bit(BTUSB_OOB_WAKE_DISABLED, &data->flags); + + ret = devm_request_irq(&hdev->dev, irq, btusb_oob_wake_handler, + IRQF_TRIGGER_LOW, "oob wake-on-bt", data); + if (ret) { + bt_dev_err(hdev, "%s: irq request failed", __func__); + return ret; + } + + ret = device_init_wakeup(dev, true); + if (ret) { + bt_dev_err(hdev, "%s: failed to init_wakeup
Re: [PATCH v3 1/3] siphash: add cryptographically secure hashtable function
On Wed, Dec 14, 2016 at 10:46 AM, Jason A. Donenfeld wrote: > SipHash is a 64-bit keyed hash function that is actually a > cryptographically secure PRF, like HMAC. Except SipHash is super fast, > and is meant to be used as a hashtable keyed lookup function. > "super fast" is relative. My quick test shows that this faster than Toeplitz (good, but not exactly hard to achieve), but is about 4x slower than jhash. > SipHash isn't just some new trendy hash function. It's been around for a > while, and there really isn't anything that comes remotely close to > being useful in the way SipHash is. With that said, why do we need this? > I don't think we need advertising nor a lesson on hashing. It would be much more useful if you just point us to the paper on siphash (which I assume I http://cr.yp.to/siphash/siphash-20120918.pdf ?). > There are a variety of attacks known as "hashtable poisoning" in which an > attacker forms some data such that the hash of that data will be the > same, and then preceeds to fill up all entries of a hashbucket. This is > a realistic and well-known denial-of-service vector. > > Linux developers already seem to be aware that this is an issue, and > various places that use hash tables in, say, a network context, use a > non-cryptographically secure function (usually jhash) and then try to > twiddle with the key on a time basis (or in many cases just do nothing > and hope that nobody notices). While this is an admirable attempt at > solving the problem, it doesn't actually fix it. SipHash fixes it. > Key rotation is important anyway, without any key rotation even if the key is compromised in siphash by some external means we would have an insecure hash until the system reboots. > (It fixes it in such a sound way that you could even build a stream > cipher out of SipHash that would resist the modern cryptanalysis.) > > There are a modicum of places in the kernel that are vulnerable to > hashtable poisoning attacks, either via userspace vectors or network > vectors, and there's not a reliable mechanism inside the kernel at the > moment to fix it. The first step toward fixing these issues is actually > getting a secure primitive into the kernel for developers to use. Then > we can, bit by bit, port things over to it as deemed appropriate. > > Secondly, a few places are using MD5 for creating secure sequence > numbers, port numbers, or fast random numbers. Siphash is a faster, more > fittting, and more secure replacement for MD5 in those situations. > > Dozens of languages are already using this internally for their hash > tables. Some of the BSDs already use this in their kernels. SipHash is > a widely known high-speed solution to a widely known problem, and it's > time we catch-up. > Maybe so, but we need to do due diligence before considering adopting siphash as the primary hashing in the network stack. Consider that we may very well perform a hash over L4 tuples on _every_ packet. We've done a good job at limiting this to be at most one hash per packet, but nevertheless the performance of the hash function must be take into account. A few points: 1) My quick test shows siphash is about four times more expensive than jhash. On my test system, computing a hash over IPv4 tuple (two 32 bit addresses and 2 16 bit source ports) is 6.9 nsecs in Jenkins hash, 33 nsecs with siphash. Given that we have eliminated most of the packet header hashes this might be tolerable, but still should be looking at ways to optimize. 2) I like moving to use u64 (quad words) in the hash, this is an improvement over Jenkins which is based on 32 bit words. If we put this in the kernel we probably want to have several variants of siphash for specific sizes (e.g. siphash1, siphash2, siphash2, siphashn for hash over one, two, three, or n sixty four bit words). 3) I also tested siphash distribution and Avalanche Effect (these really should have been covered in the paper :-( ). Both of these are good with siphash, in fact almost have identical characteristics to Jenkins hash. Tom > Signed-off-by: Jason A. Donenfeld > Cc: Jean-Philippe Aumasson > Cc: Daniel J. Bernstein > Cc: Linus Torvalds > Cc: Eric Biggers > Cc: David Laight > --- > Changes from v2->v3: > > - There is now a fast aligned version of the function and a not-as-fast > unaligned version. The requirements for each have been documented in > a docbook-style comment. As well, the header now contains a constant > for the expected alignment. > > - The test suite has been updated to check both the unaligned and aligned > version of the function. > > include/linux/siphash.h | 30 ++ > lib/Kconfig.debug | 6 +- > lib/Makefile| 5 +- > lib/siphash.c | 153 > > lib/test_siphash.c | 85 +++ > 5 files changed, 274 insertions(+), 5 deletions(-) > create mode 100644 include/linux/siphash.h > create mode 100644 lib/sipha
Re: [PATCH v2 3/4] secure_seq: use siphash24 instead of md5_transform
On 14.12.2016 19:06, Jason A. Donenfeld wrote: > Hi David, > > On Wed, Dec 14, 2016 at 6:56 PM, David Miller wrote: >> Just marking the structure __packed, whether necessary or not, makes >> the compiler assume that the members are not aligned and causes >> byte-by-byte accesses to be performed for words. >> Never, _ever_, use __packed unless absolutely necessary, it pessimizes >> the code on cpus that require proper alignment of types. > > Oh, jimminy cricket, I did not realize that it made assignments > byte-by-byte *always*. So what options am I left with? What > immediately comes to mind are: > > 1) > > struct { > u64 a; > u32 b; > u32 c; > u16 d; > u8 end[]; I don't think this helps. Did you test it? I don't see reason why padding could be left out between `d' and `end' because of the flexible array member? Bye, Hannes
Re: [PATCH v3 1/3] siphash: add cryptographically secure hashtable function
Hi Tom, On Wed, Dec 14, 2016 at 8:18 PM, Tom Herbert wrote: > "super fast" is relative. My quick test shows that this faster than > Toeplitz (good, but not exactly hard to achieve), but is about 4x > slower than jhash. Fast relative to other cryptographically secure PRFs. >> SipHash isn't just some new trendy hash function. It's been around for a >> while, and there really isn't anything that comes remotely close to >> being useful in the way SipHash is. With that said, why do we need this? > I don't think we need advertising nor a lesson on hashing. It would be > much more useful if you just point us to the paper on siphash (which I > assume I http://cr.yp.to/siphash/siphash-20120918.pdf ?). Ugh. Sorry. It definitely wasn't my intention to give an uninvited lesson or an annoying advert. For the former, I didn't want to make any expectations about fields of knowledge, because I honest have no idea. For the latter, I wrote that sentence to indicate that siphash isn't just some newfangled hipster function, but something useful and well established. I didn't mean it as a form of advertising. My apologies if I've offended your sensibilities. That cr.yp.to link is fine, or https://131002.net/siphash/siphash.pdf I believe. > Key rotation is important anyway, without any key rotation even if the > key is compromised in siphash by some external means we would have an > insecure hash until the system reboots. I'm a bit surprised to read this. I've never designed a system to be secure even in the event of remote arbitrary kernel memory disclosure, and I wasn't aware this was generally considered an architectural requirement or Linux. In any case, if you want this, I suppose you can have it with siphash too. > Maybe so, but we need to do due diligence before considering adopting > siphash as the primary hashing in the network stack. Consider that we > may very well perform a hash over L4 tuples on _every_ packet. We've > done a good job at limiting this to be at most one hash per packet, > but nevertheless the performance of the hash function must be take > into account. I agree with you. It seems like each case is going to needed to be measured on a case by case basis. In this series I make the first use of siphash in the secure sequence generation and get_random_int/long, where siphash replaces md5, so there's a pretty clear performance in. But for the jhash replacements indeed things are going to need to be individually evaluated. > 1) My quick test shows siphash is about four times more expensive than > jhash. On my test system, computing a hash over IPv4 tuple (two 32 bit > addresses and 2 16 bit source ports) is 6.9 nsecs in Jenkins hash, 33 > nsecs with siphash. Given that we have eliminated most of the packet > header hashes this might be tolerable, but still should be looking at > ways to optimize. > 2) I like moving to use u64 (quad words) in the hash, this is an > improvement over Jenkins which is based on 32 bit words. If we put > this in the kernel we probably want to have several variants of > siphash for specific sizes (e.g. siphash1, siphash2, siphash2, > siphashn for hash over one, two, three, or n sixty four bit words). I think your suggestion for (2) will contribute to further optimizations for (1). In v2, I had another patch in there adding siphash_1word, siphash_2words, etc, like jhash, but I implemented it by taking u32 variables and then just concatenating these into a buffer and passing them to the main siphash function. I removed it from v3 because I thought that these kind of missed the whole point. In particular: a) siphash24_1word, siphash24_2words, siphash24_3words, etc should take u64, not u32, since that's what siphash operates on natively b) Rather than concatenating them in a buffer, I should write specializations of the siphash24 function _especially_ for these size inputs to avoid the copy and to reduce the book keeping. I'll add these functions to v4 implemented like that. Thanks for the useful feedback and benchmarks! Jason
Re: [PATCH v2 3/4] secure_seq: use siphash24 instead of md5_transform
Hi Hannes, On Wed, Dec 14, 2016 at 8:22 PM, Hannes Frederic Sowa wrote: > I don't think this helps. Did you test it? I don't see reason why > padding could be left out between `d' and `end' because of the flexible > array member? Because the type u8 doesn't require any alignment requirements, it can nestle right up there cozy with the u16: zx2c4@thinkpad ~ $ cat a.c #include #include #include int main() { struct { uint64_t a; uint32_t b; uint32_t c; uint16_t d; char x[]; } a; printf("%zu\n", sizeof(a)); printf("%zu\n", offsetof(typeof(a), x)); return 0; } zx2c4@thinkpad ~ $ gcc a.c zx2c4@thinkpad ~ $ ./a.out 24 18 Jason
RE: Designing a safe RX-zero-copy Memory Model for Networking
On Wed, 14 Dec 2016, David Laight wrote: > If the kernel is doing ANY validation on the frames it must copy the > data to memory the application cannot modify before doing the validation. > Otherwise the application could change the data afterwards. The application is not allowed to change the data after a work request has been submitted to send the frame. Changes are possible after the completion request has been received. The kernel can enforce that by making the frame(s) readonly and thus getting a page fault if the app would do such a thing.
Re: [PATCH v2 1/4] siphash: add cryptographically secure hashtable function
Hi Hannes, On Wed, Dec 14, 2016 at 4:09 PM, Hannes Frederic Sowa wrote: > Yes, numbers would be very usable here. I am mostly concerned about > small plastic router cases. E.g. assume you double packet processing > time with a change of the hashing function at what point is the actual > packet processing more of an attack vector than the hashtable? I agree. Looks like Tom did some very quick benchmarks. I'll do some more precise benchmarks myself when we graduate from looking at md5 replacement (the easy case) to looking at jhash replacement (the harder case). >> With that said, siphash is here to replace uses of jhash where >> hashtable poisoning vulnerabilities make it necessary. Where there's >> no significant security improvement, if there's no speed improvement >> either, then of course nothing's going to change. > > It still changes currently well working source. ;-) I mean if siphash doesn't make things better in someway, we'll just continue using jhash, so no source change or anything. In other words: evolutionary conservative approach rather than hasty "replace 'em all!" tomfoolery. > MD5 is considered broken because its collision resistance is broken? > SipHash doesn't even claim to have collision resistance (which we don't > need here)? Not just that, but it's not immediately clear to me that using MD5 as a PRF the way it is now with md5_transform is even a straightforwardly good idea. > But I agree, certainly it could be a nice speed-up! The benchmarks for the secure sequence number generation and the rng are indeed really promising. > I think you mean non-linearity. Yea of course, editing typo, sorry. > In general I am in favor to switch to siphash, but it would be nice to > see some benchmarks with the specific kernel implementation also on some > smaller 32 bit CPUs and especially without using any SIMD instructions > (which might have been used in paper comparison). Sure, agreed. Each proposed jhash replacement will need to be benchmarked on little MIPS machines and x86 monsters alike, with patches indicating PPS before and after. Jason