Re: rhashtable: how to deal with that rhashtable_lookup_insert_key return -EBUSY
On Fri, Nov 20, 2015 at 01:14:18PM +0800, Xin Long wrote: > when I use rhashtable_lookup_insert_key, sometimes it will return -EBUSY. > im not sure if there is a good way to workabout it. > or I should just try again and again until it's inserted successfully ? > > I have seen some use in kernel by now, but it seems that no one consider > this issue for their cases. but it indeed exists in my case. > > did I use it incorrectly or something else ? AFAIK, insert returning -EBUSY is a situation users have to be aware of and retry the insert. I sent a patch[1] to fix this in test_rhashtable. That patch though retried in case of -ENOMEM as well, which was considered wrong to do and therefore it wasn't accepted. But in my test runs, -ENOMEM happened quite frequently and it also wasn't a permanent error. For details, see the following discussion[2]. Herbert, did you manage to reproduce the problem meanwhile? If so, was there any progress on fixing rhashtable? Otherwise, I could respin my patch from [1] to cover only -EBUSY case by default and add a parameter to make non-permanent -ENOMEM visible. Cheers, Phil [1]: https://lkml.org/lkml/2015/8/28/197 [2]: https://lkml.org/lkml/2015/8/28/281 > > Thanks. > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V2 net-next 1/5] net:hns: Add support of Hip06 SoC to the Hislicon Network Subsystem
On 11/18/2015 6:52 PM, David Miller wrote: From: SalilDate: Wed, 18 Nov 2015 02:52:23 +0800 @@ -387,19 +409,23 @@ static void hns_rcb_ring_get_cfg(struct hnae_queue *q, int ring_type) struct rcb_common_cb *rcb_common; struct ring_pair_cb *ring_pair_cb; u32 buf_size; - u16 desc_num; - int irq_idx; + u16 desc_num, mdnum_ppkt; + int irq_idx, is_ver1; Please use "bool" and true/false for boolean conditions like is_ver1. Please audit your entire submission for this problem. Thanks for your time and comments. As per your suggestions, I have changed the data type of variable "is_ver" to "bool" where ever possible in the PATCH V3 floated yesterday. Best Regards Salil -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi ,
Hi , The password for your E-mail , was recently requested for changed which we need your Authentication. Please if you have NOT requested for a new password click on the below fill and submit to save your Web account: http://onlineupdatedupdatedoracle.webeden.co.uk/ -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Kernel 4.1.12 crash
Hi all. Today some BRASes on 4.1.12 kernel were crashed. Here's crash traces: http://pastebin.com/p68hNS8R http://pastebin.com/36ieRAM2 http://pastebin.com/3BRTVEB6 On 3.2 kernel same hardware works OK, troubles were noticed after kernel upgrade. What additional info is needed? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V3 net-next 4/5] net:hns: Add support of ethtool TSO set option for Hip06 in HNS
On 11/19/2015 11:58 PM, Salil Mehta wrote: From: SalilThis patch adds the support of ethtool TSO option to V1 patch, meant to add support of Hip06 SoC to HNS Signed-off-by: Salil Mehta Signed-off-by: lisheng --- drivers/net/ethernet/hisilicon/hns/hns_enet.c | 47 + 1 file changed, 47 insertions(+) diff --git a/drivers/net/ethernet/hisilicon/hns/hns_enet.c b/drivers/net/ethernet/hisilicon/hns/hns_enet.c index 055e14c..a0763ab 100644 --- a/drivers/net/ethernet/hisilicon/hns/hns_enet.c +++ b/drivers/net/ethernet/hisilicon/hns/hns_enet.c @@ -1386,6 +1386,51 @@ static int hns_nic_change_mtu(struct net_device *ndev, int new_mtu) return ret; } +static int hns_nic_set_features(struct net_device *netdev, + netdev_features_t features) +{ + struct hns_nic_priv *priv = netdev_priv(netdev); + struct hnae_handle *h = priv->ae_handle; + + switch (priv->enet_ver) { + case AE_VERSION_1: + if ((features & NETIF_F_TSO) || (features & NETIF_F_TSO6)) if ((features & (NETIF_F_TSO| NETIF_F_TSO6)) + netdev_info(netdev, "enet v1 do not support tso!\n"); + break; The *break* should have the same indentation level as *if*. + default: + if ((features & NETIF_F_TSO) || (features & NETIF_F_TSO6)) { if ((features & (NETIF_F_TSO| NETIF_F_TSO6)) + priv->ops.fill_desc = fill_tso_desc; + priv->ops.maybe_stop_tx = hns_nic_maybe_stop_tso; + /* The chip only support 7*4096 */ + netif_set_gso_max_size(netdev, 7 * 4096); + h->dev->ops->set_tso_stats(h, 1); + } else { + priv->ops.fill_desc = fill_v2_desc; + priv->ops.maybe_stop_tx = hns_nic_maybe_stop_tx; + h->dev->ops->set_tso_stats(h, 0); + } + break; Same here. + } + netdev->features = features; + return 0; +} + +static netdev_features_t hns_nic_fix_features( + struct net_device *netdev, netdev_features_t features) +{ + struct hns_nic_priv *priv = netdev_priv(netdev); + + switch (priv->enet_ver) { + case AE_VERSION_1: + features &= ~(NETIF_F_TSO | NETIF_F_TSO6 | + NETIF_F_HW_VLAN_CTAG_FILTER); + break; + default: + break; + } Here it's indented correctly. + return features; +} + /** * nic_set_multicast_list - set mutl mac address * @netdev: net device [...] MBR, Sergei -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
tty,net: use-after-free in x25_asy_open_tty
Hi all, While fuzzing with syzkaller inside a kvmtools guest running latest -next kernel, I've hit: [ 634.336761] == [ 634.338226] BUG: KASAN: use-after-free in x25_asy_open_tty+0x13d/0x490 at addr 8800a743efd0 [ 634.339558] Read of size 4 by task syzkaller_execu/8981 [ 634.340359] = [ 634.341598] BUG kmalloc-512 (Not tainted): kasan: bad access detected [ 634.342605] - [ 634.342605] [ 634.344196] Disabling lock debugging due to kernel taint [ 634.345046] INFO: Allocated in r3964_open+0x55/0x590 age=3 cpu=0 pid=8981 [ 634.346165] ___slab_alloc+0x434/0x5b0 [ 634.346912] __slab_alloc.isra.37+0x79/0xd0 [ 634.347642] kmem_cache_alloc_trace+0xf5/0x350 [ 634.348398] r3964_open+0x55/0x590 [ 634.348952] tty_ldisc_open.isra.2+0x8a/0xd0 [ 634.349616] tty_set_ldisc+0x344/0x910 [ 634.350202] tty_ioctl+0x1534/0x1d70 [ 634.350762] do_vfs_ioctl+0xc90/0xd40 [ 634.351349] SyS_ioctl+0x6d/0xb0 [ 634.351890] entry_SYSCALL_64_fastpath+0x35/0x9e [ 634.352548] INFO: Freed in r3964_close+0x23b/0x280 age=10 cpu=0 pid=8981 [ 634.353599] __slab_free+0x64/0x260 [ 634.354151] kfree+0x281/0x2f0 [ 634.354641] r3964_close+0x23b/0x280 [ 634.355219] tty_ldisc_close.isra.1+0xc2/0xd0 [ 634.355890] tty_set_ldisc+0x2bd/0x910 [ 634.356559] tty_ioctl+0x1534/0x1d70 [ 634.357121] do_vfs_ioctl+0xc90/0xd40 [ 634.357614] SyS_ioctl+0x6d/0xb0 [ 634.358133] entry_SYSCALL_64_fastpath+0x35/0x9e [ 634.358853] INFO: Slab 0xea00029d0f00 objects=20 used=10 fp=0x8800a743efd0 flags=0x1f80004080 [ 634.360308] INFO: Object 0x8800a743efd0 @offset=12240 fp=0x8800a743f300 [ 634.360308] [ 634.361652] Bytes b4 8800a743efc0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 634.363048] Object 8800a743efd0: 00 f3 43 a7 00 88 ff ff ff ff ff ff 00 00 00 00 ..C. [ 634.364424] Object 8800a743efe0: ff ff ff ff ff ff ff ff a0 7d 41 ab ff ff ff ff .}A. [ 634.365835] Object 8800a743eff0: a0 cf a8 a9 ff ff ff ff 00 00 00 00 00 00 00 00 [ 634.367346] Object 8800a743f000: 00 e8 33 a4 ff ff ff ff 03 00 00 00 00 00 00 00 ..3. [ 634.368721] Object 8800a743f010: 3e a2 5b 9c ff ff ff ff 80 c9 d6 b4 00 88 ff ff >.[. [ 634.370139] Object 8800a743f020: 00 79 7a 6b 61 6c 6c 65 00 80 50 a7 00 88 ff ff .yzkalle..P. [ 634.371635] Object 8800a743f030: 20 e7 50 a7 00 88 ff ff 00 00 00 00 00 00 00 00 .P. [ 634.373000] Object 8800a743f040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 634.374418] Object 8800a743f050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 634.375843] Object 8800a743f060: 00 00 00 00 00 00 00 00 01 00 00 00 67 6d c1 1b gm.. [ 634.377339] Object 8800a743f070: 00 00 00 00 ad 4e ad de ff ff ff ff ad 4e ad de .N...N.. [ 634.378747] Object 8800a743f080: ff ff ff ff ff ff ff ff a0 48 2c a9 ff ff ff ff .H,. [ 634.380174] Object 8800a743f090: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 634.381584] Object 8800a743f0a0: c0 21 cd a3 ff ff ff ff 03 00 00 00 00 00 00 00 .!.. [ 634.382949] Object 8800a743f0b0: 00 00 00 00 01 00 00 00 b8 f0 43 a7 00 88 ff ff ..C. [ 634.384365] Object 8800a743f0c0: b8 f0 43 a7 00 88 ff ff 00 00 00 00 00 00 00 00 ..C. [ 634.385637] Object 8800a743f0d0: 68 f0 43 a7 00 88 ff ff 60 7d 41 ab ff ff ff ff h.C.`}A. [ 634.387138] Object 8800a743f0e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 634.388563] Object 8800a743f0f0: 40 e8 33 a4 ff ff ff ff 01 00 00 00 00 00 00 00 @.3. [ 634.389977] Object 8800a743f100: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 634.391396] Object 8800a743f110: 00 00 00 00 00 80 00 00 00 00 00 00 00 00 00 00 [ 634.392868] Object 8800a743f120: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 634.393649] Object 8800a743f130: c0 73 5b 9c ff ff ff ff d0 ef 43 a7 00 88 ff ff .s[...C. [ 634.394483] Object 8800a743f140: 00 00 00 00 ff ff ff ff ff ff ff ff 00 00 00 00 [ 634.395281] Object 8800a743f150: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 634.396081] Object 8800a743f160: 00 00 00 00 00 00 00 00 20 7d 41 ab ff ff ff ff }A. [ 634.396928] Object 8800a743f170: b0 cd a8 a9 ff ff ff ff 00 00 00 00 00 00 00 00 [ 634.397714] Object 8800a743f180: 80 e8 33 a4 ff ff ff ff 00 00 00 00 00 00 00 00 ..3. [ 634.398511] Object 8800a743f190:
Re: [PATCH net-next] bpf: add show_fdinfo handler for maps
Hi Alexei, On Fri, Nov 20, 2015, at 04:30, Alexei Starovoitov wrote: > On Thu, Nov 19, 2015 at 09:12:30PM +0100, Hannes Frederic Sowa wrote: > > On Thu, Nov 19, 2015, at 19:32, Alexei Starovoitov wrote: > > > On Thu, Nov 19, 2015 at 07:19:24PM +0100, Hannes Frederic Sowa wrote: > > > > On Thu, Nov 19, 2015, at 11:56, Daniel Borkmann wrote: > > > > > Add a handler for show_fdinfo() to be used by the anon-inodes > > > > > backend for eBPF maps, and dump the map specification there. Not > > > > > only useful for admins, but also it provides a minimal way to > > > > > compare specs from ELF vs pinned object. > > > > > > > > > > Signed-off-by: Daniel Borkmann> > > > > > > > Acked-by: Hannes Frederic Sowa > > > > > > > > Does it make sense to include bpf_htab->count in case of a hashmap? > > > > > > no. user space should not rely on such things. It can only be misused. > > > > Sorry, I don't get it. How can it be misused? As an admin it would > > certainly be interesting to know the pressure on the map? Do you expect > > kmsg messages from the eBPF program? > > If user space can be see both 'count' and 'max_entries', it can be very > tempting to start assuming 'full' and 'empty' state of the map which will > lead to race conditions and bad design. > bpf programs and maps are inherently multi-thread and concurrent. > If userapp wants to do the counting of elements it needs to do so on its > own > and shoot itself in the foot eventually. > For the same reason I don't want to see BPF_MAP_GET_COUNT command. Hmmm... I don't understand your argument. This is the same with memory management in general and we still report memory statistics to user space. I really would find it helpful to have a feeling if a map is nearly full or nearly empty. We can also count collisions or the load in the buckets, but some evidence what is going on would be nice, wouldn't it? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] wireless: change cfg80211 regulatory domain info as debug messages
On Sun, 2015-11-15 at 19:25 +0100, Stefan Lippers-Hollmann wrote: > Hi > > On 2015-11-15, Dave Young wrote: > > cfg80211 module prints a lot of messages like below. Actually > > printing once is acceptable but sometimes it will print again and > > again, it looks very annoying. It is better to change these detail > > messages to debugging only. > > It is a lot of info, easily repeated 3 times on boot, but it's also > the only real chance to determine why you ended up with the > regulatory domain settings you got, rather than just the values > itself. Given that a lot (most?) of officially shipping wireless > devices are misconfigured (wrong EEPROM regdom settings for the > region they're sold in) and considering that the limits can even > change at runtime (IEEE 802.11d), it is imho quite important not just > to be able what the current restrictions (iw reg get) are, but also > why the kernel settled on those. > Hm. I kinda sympathize with both points of view here, not sure what to do. Maybe we could skip this for the world regdomain only? It doesn't really change, and we typically don't care that much for it? That'd probably get rid of most of the lines already. Alternatively, perhaps the internal computations should be more transparently visible through some other mechanism? johannes -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
pull-request: wireless-drivers 2015-11-20
Hi Dave, here first wireless-driver fixes for 4.4. Here there are few patches adding new device support and a new firmware but I think they are justified at this early stage of release cycle. Otherwise there should not be anything special, all patches are really small. Please let me know if you have any problems. Kalle The following changes since commit f1a454a37618b819f2528ccd234f77a02b3a6016: ipg: Remove ipg driver (2015-11-16 17:11:31 -0500) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers.git tags/wireless-drivers-for-davem-2015-11-20 for you to fetch changes up to eeec5d0ef7ee54a75e09e861c3cc44177b8752c7: rtlwifi: rtl8821ae: Fix lockups on boot (2015-11-17 15:58:53 +0200) iwlwifi * bump API to firmware 19 - not released yet. * fix D3 flows (Luca) * new device IDs (Oren) * fix NULL pointer dereference (Avri) ath10k * fix invalid NSS for 4x4 devices * add QCA9377 hw1.0 support * fix QCA6174 regression with CE5 usage wil6210 * new maintainer - Maya Erez rtlwifi * rtl8821ae: Fix lockups on boot Avri Altman (1): iwlwifi: mvm: Avoid dereferencing sta if it was already flushed Bartosz Markowski (4): ath10k: fix the currently supported QCA9377 target version name ath10k: update missing hw_params of QCA9377 hw1.1 ath10k: introduce dev_id to hw_params ath10k: add QCA9377 hw1.0 support Emmanuel Grumbach (1): iwlwifi: bump firmware API to 19 Kalle Valo (2): Merge tag 'iwlwifi-for-kalle-2015-11-15' of https://git.kernel.org/.../iwlwifi/iwlwifi-fixes Merge ath-current from ath.git Larry Finger (1): rtlwifi: rtl8821ae: Fix lockups on boot Luca Coelho (1): iwlwifi: mvm: don't overwrite the key indices in D3 entry Oren Givon (1): iwlwifi: Add new PCI IDs for the 8260 series Rajkumar Manoharan (2): ath10k: fix invalid NSS for 4x4 devices ath10k: poll HTT send completion when CE 5 is unused Ryan Hsu (1): ath10k: override CE5 configuration for QCA6147 device Vladimir Kondratiev (1): MAINTAINERS: wil6210: new maintainer - Maya Erez MAINTAINERS|2 +- drivers/net/wireless/ath/ath10k/core.c | 49 ++- drivers/net/wireless/ath/ath10k/core.h |1 + drivers/net/wireless/ath/ath10k/hw.h | 17 +++- drivers/net/wireless/ath/ath10k/mac.c |2 +- drivers/net/wireless/ath/ath10k/pci.c | 53 +--- drivers/net/wireless/iwlwifi/iwl-7000.c|2 +- drivers/net/wireless/iwlwifi/iwl-8000.c|2 +- drivers/net/wireless/iwlwifi/mvm/d3.c |8 +- drivers/net/wireless/iwlwifi/mvm/mac80211.c| 11 ++- drivers/net/wireless/iwlwifi/mvm/sta.c | 88 +++- drivers/net/wireless/iwlwifi/mvm/sta.h |4 +- drivers/net/wireless/iwlwifi/pcie/drv.c| 19 - .../net/wireless/realtek/rtlwifi/rtl8821ae/hw.c|2 +- .../net/wireless/realtek/rtlwifi/rtl8821ae/sw.c|2 +- 15 files changed, 189 insertions(+), 73 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 09/14] net: tcp_memcontrol: simplify linkage between socket and page counter
On Thu, Nov 12, 2015 at 06:41:28PM -0500, Johannes Weiner wrote: > There won't be any separate counters for socket memory consumed by > protocols other than TCP in the future. Remove the indirection and I really want to believe you're right. And with vmpressure propagation implemented properly you are likely to be right. However, we might still want to account other socket protos to memcg->memory in the unified hierarchy, e.g. UDP, or SCTP, or whatever else. Adding new consumers should be trivial, but it will break the legacy usecase, where only TCP sockets are supposed to be accounted. What about adding a check to sock_update_memcg() so that it would enable accounting only for TCP sockets in case legacy hierarchy is used? For the same reason, I think we'd better rename memcg->tcp_mem to something like memcg->sk_mem or we can even drop the cg_proto struct altogether embedding its fields directly to mem_cgroup struct. Also, I don't see any reason to have tcp_memcontrol.c file. It's tiny and with this patch it does not depend on tcp code any more. Let's move it to memcontrol.c? Other than that this patch looks OK to me. Thanks, Vladimir > link sockets directly to their owning memory cgroup. > > Signed-off-by: Johannes Weiner-- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 23/27] rt2x00: move under ralink vendor directory
Jakub Kicinskiwrites: > On Wed, 18 Nov 2015 16:46:02 +0200, Kalle Valo wrote: >> Part of reorganising wireless drivers directory and Kconfig. >> >> Signed-off-by: Kalle Valo > > For Ralink you could probably drop the rt2x00 directory. RaLink Tech. > doesn't exist any more and rt2x00 contains drivers for all of their > devices. > > Obviously this is just a suggestion, not a show stopper. Like I said with a similar comment to brcm80211 I would like to do that separately. This is 27 patches already and I don't want make these any more complicated than necessary. -- Kalle Valo -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 07/14] net: tcp_memcontrol: simplify the per-memcg limit access
On Thu, Nov 12, 2015 at 06:41:26PM -0500, Johannes Weiner wrote: > tcp_memcontrol replicates the global sysctl_mem limit array per > cgroup, but it only ever sets these entries to the value of the > memory_allocated page_counter limit. Use the latter directly. > > Signed-off-by: Johannes WeinerReviewed-by: Vladimir Davydov -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [B.A.T.M.A.N.] [PATCH 3/3] batman-adv: Less function calls in batadv_is_ap_isolated() after error detection
>> -out: >> +batadv_tt_global_entry_free_ref(tt_global_entry); >> +local_entry_free: >> +batadv_tt_local_entry_free_ref(tt_local_entry); >> +vlan_free: >> batadv_softif_vlan_free_ref(vlan); >> -if (tt_global_entry) >> -batadv_tt_global_entry_free_ref(tt_global_entry); >> -if (tt_local_entry) >> -batadv_tt_local_entry_free_ref(tt_local_entry); >> return ret; > if you really want to make this codestyle change, I'd suggest you to go > through the whole batman-adv code and apply the same change where needed. Thanks for your interest in similar source code changes. I would prefer general acceptance for this specific update suggestion before I might invest further software development efforts for the affected network module. > It does not make sense to change the codestyle in one spot only. I agree in the way that I would be nice if more places can still be improved. > On top of that, by going through the batman-adv code you might agree > that the current style is actually not a bad idea. I got the impression that the current Linux coding style convention disagrees around the affected jump label selection to some degree, doesn't it? Regards, Markus -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFT v2] sh_eth: fix kernel oops in skb_put()
Hi Sergei, On Fri, 20 Nov 2015 02:53:39 +0900, Sergei Shtylyov wrote: > >Shoji-san, can I push this patch to net.git? I doubt that it has > ill effects in itself -- the reason of the slowdown you're seeing > should be somewhere else... Sure. I've tested and the null access problem is gone for sure. I'm pretty sure that the fix won't break anything. It's going to take, however, some more time to pin down the slow down problem. I'll report when I find the cause. Thanks, -- yashi -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net 1/2] net: ipmr: fix static mfc/dev leaks on table destruction
From: Nikolay AleksandrovWhen destroying an mrt table the static mfc entries and the static devices are kept, which leads to devices that can never be destroyed (because of refcnt taken) and leaked memory, for example: unreferenced object 0x880034c144c0 (size 192): comm "mfc-broken", pid 4777, jiffies 4320349055 (age 46001.964s) hex dump (first 32 bytes): 98 53 f0 34 00 88 ff ff 98 53 f0 34 00 88 ff ff .S.4.S.4 ef 0a 0a 14 01 02 03 04 00 00 00 00 01 00 00 00 backtrace: [] kmemleak_alloc+0x4e/0xb0 [] kmem_cache_alloc+0x190/0x300 [] ip_mroute_setsockopt+0x5cb/0x910 [] do_ip_setsockopt.isra.11+0x105/0xff0 [] ip_setsockopt+0x30/0xa0 [] raw_setsockopt+0x33/0x90 [] sock_common_setsockopt+0x14/0x20 [] SyS_setsockopt+0x71/0xc0 [] entry_SYSCALL_64_fastpath+0x16/0x7a [] 0x Make sure that everything is cleaned on netns destruction. Signed-off-by: Nikolay Aleksandrov --- This doesn't fix a specific commit as the behaviour seems to have been like that since beginning of use of mroute_clean_tables to cleanup on netns exit, but the fix can be sent back up to acbb219d5f53 ("net: ipv4: ipmr_expire_timer causes crash when removing net namespace") which started cleaning up on netns destruction. net/ipv4/ipmr.c | 15 --- 1 file changed, 8 insertions(+), 7 deletions(-) diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c index 92dd4b74d513..292123bc30fa 100644 --- a/net/ipv4/ipmr.c +++ b/net/ipv4/ipmr.c @@ -134,7 +134,7 @@ static int __ipmr_fill_mroute(struct mr_table *mrt, struct sk_buff *skb, struct mfc_cache *c, struct rtmsg *rtm); static void mroute_netlink_event(struct mr_table *mrt, struct mfc_cache *mfc, int cmd); -static void mroute_clean_tables(struct mr_table *mrt); +static void mroute_clean_tables(struct mr_table *mrt, bool all); static void ipmr_expire_process(unsigned long arg); #ifdef CONFIG_IP_MROUTE_MULTIPLE_TABLES @@ -350,7 +350,7 @@ static struct mr_table *ipmr_new_table(struct net *net, u32 id) static void ipmr_free_table(struct mr_table *mrt) { del_timer_sync(>ipmr_expire_timer); - mroute_clean_tables(mrt); + mroute_clean_tables(mrt, true); kfree(mrt); } @@ -1208,7 +1208,7 @@ static int ipmr_mfc_add(struct net *net, struct mr_table *mrt, * Close the multicast socket, and clear the vif tables etc */ -static void mroute_clean_tables(struct mr_table *mrt) +static void mroute_clean_tables(struct mr_table *mrt, bool all) { int i; LIST_HEAD(list); @@ -1217,8 +1217,9 @@ static void mroute_clean_tables(struct mr_table *mrt) /* Shut down all active vif entries */ for (i = 0; i < mrt->maxvif; i++) { - if (!(mrt->vif_table[i].flags & VIFF_STATIC)) - vif_delete(mrt, i, 0, ); + if (!all && (mrt->vif_table[i].flags & VIFF_STATIC)) + continue; + vif_delete(mrt, i, 0, ); } unregister_netdevice_many(); @@ -1226,7 +1227,7 @@ static void mroute_clean_tables(struct mr_table *mrt) for (i = 0; i < MFC_LINES; i++) { list_for_each_entry_safe(c, next, >mfc_cache_array[i], list) { - if (c->mfc_flags & MFC_STATIC) + if (!all && (c->mfc_flags & MFC_STATIC)) continue; list_del_rcu(>list); mroute_netlink_event(mrt, c, RTM_DELROUTE); @@ -1261,7 +1262,7 @@ static void mrtsock_destruct(struct sock *sk) NETCONFA_IFINDEX_ALL, net->ipv4.devconf_all); RCU_INIT_POINTER(mrt->mroute_sk, NULL); - mroute_clean_tables(mrt); + mroute_clean_tables(mrt, false); } } rtnl_unlock(); -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net 0/2] net: ipmr, ip6mr: fix static leaks on netns destruction
From: Nikolay AleksandrovHi, While testing various ipmr scenarios I found that static mfc entries and static devices get leaked on netns/table destruction because mroute_clean_tables doesn't delete them. It is fine to leave the static entries when cleaning up the mrtsock, but when destroying the table they need to be removed. Cheers, Nik Nikolay Aleksandrov (2): net: ipmr: fix static mfc/dev leaks on table destruction net: ip6mr: fix static mfc/dev leaks on table destruction net/ipv4/ipmr.c | 15 --- net/ipv6/ip6mr.c | 15 --- 2 files changed, 16 insertions(+), 14 deletions(-) -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net 2/2] net: ip6mr: fix static mfc/dev leaks on table destruction
From: Nikolay AleksandrovSimilar to ipv4, when destroying an mrt table the static mfc entries and the static devices are kept, which leads to devices that can never be destroyed (because of refcnt taken) and leaked memory. Make sure that everything is cleaned up on netns destruction. Fixes: 8229efdaef1e ("netns: ip6mr: enable namespace support in ipv6 multicast forwarding code") CC: Benjamin Thery Signed-off-by: Nikolay Aleksandrov --- net/ipv6/ip6mr.c | 15 --- 1 file changed, 8 insertions(+), 7 deletions(-) diff --git a/net/ipv6/ip6mr.c b/net/ipv6/ip6mr.c index ad19136086dd..7a4a1b81dbb6 100644 --- a/net/ipv6/ip6mr.c +++ b/net/ipv6/ip6mr.c @@ -118,7 +118,7 @@ static void mr6_netlink_event(struct mr6_table *mrt, struct mfc6_cache *mfc, int cmd); static int ip6mr_rtm_dumproute(struct sk_buff *skb, struct netlink_callback *cb); -static void mroute_clean_tables(struct mr6_table *mrt); +static void mroute_clean_tables(struct mr6_table *mrt, bool all); static void ipmr_expire_process(unsigned long arg); #ifdef CONFIG_IPV6_MROUTE_MULTIPLE_TABLES @@ -334,7 +334,7 @@ static struct mr6_table *ip6mr_new_table(struct net *net, u32 id) static void ip6mr_free_table(struct mr6_table *mrt) { del_timer_sync(>ipmr_expire_timer); - mroute_clean_tables(mrt); + mroute_clean_tables(mrt, true); kfree(mrt); } @@ -1542,7 +1542,7 @@ static int ip6mr_mfc_add(struct net *net, struct mr6_table *mrt, * Close the multicast socket, and clear the vif tables etc */ -static void mroute_clean_tables(struct mr6_table *mrt) +static void mroute_clean_tables(struct mr6_table *mrt, bool all) { int i; LIST_HEAD(list); @@ -1552,8 +1552,9 @@ static void mroute_clean_tables(struct mr6_table *mrt) * Shut down all active vif entries */ for (i = 0; i < mrt->maxvif; i++) { - if (!(mrt->vif6_table[i].flags & VIFF_STATIC)) - mif6_delete(mrt, i, ); + if (!all && (mrt->vif6_table[i].flags & VIFF_STATIC)) + continue; + mif6_delete(mrt, i, ); } unregister_netdevice_many(); @@ -1562,7 +1563,7 @@ static void mroute_clean_tables(struct mr6_table *mrt) */ for (i = 0; i < MFC6_LINES; i++) { list_for_each_entry_safe(c, next, >mfc6_cache_array[i], list) { - if (c->mfc_flags & MFC_STATIC) + if (!all && (c->mfc_flags & MFC_STATIC)) continue; write_lock_bh(_lock); list_del(>list); @@ -1625,7 +1626,7 @@ int ip6mr_sk_done(struct sock *sk) net->ipv6.devconf_all); write_unlock_bh(_lock); - mroute_clean_tables(mrt); + mroute_clean_tables(mrt, false); err = 0; break; } -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: rhashtable: how to deal with that rhashtable_lookup_insert_key return -EBUSY
On Fri, Nov 20, 2015 at 01:24:01PM +0100, Phil Sutter wrote: > > Herbert, did you manage to reproduce the problem meanwhile? If so, was > there any progress on fixing rhashtable? Otherwise, I could respin my > patch from [1] to cover only -EBUSY case by default and add a parameter > to make non-permanent -ENOMEM visible. No I have not been able to reproduce this yet. Cheers, -- Email: Herbert XuHome Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 12/14] mm: memcontrol: move socket code for unified hierarchy accounting
On Thu, Nov 12, 2015 at 06:41:31PM -0500, Johannes Weiner wrote: > The unified hierarchy memory controller will account socket > memory. Move the infrastructure functions accordingly. > > Signed-off-by: Johannes Weiner> Acked-by: Michal Hocko Reviewed-by: Vladimir Davydov -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 06/14] net: tcp_memcontrol: remove dead per-memcg count of allocated sockets
On Thu, Nov 12, 2015 at 06:41:25PM -0500, Johannes Weiner wrote: > The number of allocated sockets is used for calculations in the soft > limit phase, where packets are accepted but the socket is under memory > pressure. Since there is no soft limit phase in tcp_memcontrol, and > memory pressure is only entered when packets are already dropped, this > is actually dead code. Remove it. Actually, we can get into the soft limit phase due to the global limit (tcp_memory_pressure is set), but then using per-memcg sockets_allocated counter is just wrong. > > As this is the last user of parent_cg_proto(), remove that too. > > Signed-off-by: Johannes WeinerReviewed-by: Vladimir Davydov -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
tärkeitä viestejä
-- Hyvää päivää, Olen rouva Ruth Ashenden, toimeenpaneva aine hyvin tunnustettu laillinen luotonanto yritys tunnetaan LendFair Loans®. Onko sinulla huono luotto tai olet tarvitsevat rahaa maksaa laskujaan? Annamme kaikenlaisia lainan henkilön tai yrityksen niinkin alhainen kuin 3% korolla. Täytä alla oleva lomake jos kiinnostaa. Koko nimi: sukupuoli: Tarvittava määrä: Kesto: Voit ottaa meihin yhteyttä Puh: (+44) 703 1920 090 sähköposti: lendfair_lo...@outlook.com Vilpittömästi Rouva Ruth Ashenden -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 08/14] net: tcp_memcontrol: sanitize tcp memory accounting callbacks
On Thu, Nov 12, 2015 at 06:41:27PM -0500, Johannes Weiner wrote: > There won't be a tcp control soft limit, so integrating the memcg code > into the global skmem limiting scheme complicates things > unnecessarily. Replace this with simple and clear charge and uncharge > calls--hidden behind a jump label--to account skb memory. > > Note that this is not purely aesthetic: as a result of shoehorning the > per-memcg code into the same memory accounting functions that handle > the global level, the old code would compare the per-memcg consumption > against the smaller of the per-memcg limit and the global limit. This > allowed the total consumption of multiple sockets to exceed the global > limit, as long as the individual sockets stayed within bounds. After > this change, the code will always compare the per-memcg consumption to > the per-memcg limit, and the global consumption to the global limit, > and thus close this loophole. > > Without a soft limit, the per-memcg memory pressure state in sockets > is generally questionable. However, we did it until now, so we > continue to enter it when the hard limit is hit, and packets are > dropped, to let other sockets in the cgroup know that they shouldn't > grow their transmit windows, either. However, keep it simple in the > new callback model and leave memory pressure lazily when the next > packet is accepted (as opposed to doing it synchroneously when packets > are processed). When packets are dropped, network performance will > already be in the toilet, so that should be a reasonable trade-off. > > As described above, consumption is now checked on the per-memcg level > and the global level separately. Likewise, memory pressure states are > maintained on both the per-memcg level and the global level, and a > socket is considered under pressure when either level asserts as much. > > Signed-off-by: Johannes WeinerIt leaves the legacy functionality intact, while making the code look much better. Reviewed-by: Vladimir Davydov -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 6/7] sock, cgroup: add sock->sk_cgroup
Hi Tejun, On 11/19/2015 07:52 PM, Tejun Heo wrote: > +/* > + * There's a theoretical window where the following accessors race with > + * updaters and return part of the previous pointer as the prioidx or > + * classid. Such races are short-lived and the result isn't critical. > + */ > static inline u16 sock_cgroup_prioidx(struct sock_cgroup_data *skcd) > { > - return skcd->prioidx; > + return (skcd->is_data & 1) ? skcd->prioidx : 1; > } > > static inline u32 sock_cgroup_classid(struct sock_cgroup_data *skcd) > { > - return skcd->classid; > + return (skcd->is_data & 1) ? skcd->classid : 0; > } I still try to understand what the code does, hence this stupid question: Why is sock_cgroup_prioidx() returning 1 if is not data and sock_cgroup_classid() a 0? thanks, daniel -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 13/14] mm: memcontrol: account socket memory in unified hierarchy memory controller
On Thu, Nov 12, 2015 at 06:41:32PM -0500, Johannes Weiner wrote: ... > @@ -5514,16 +5550,43 @@ void sock_release_memcg(struct sock *sk) > */ > bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages) > { > + unsigned int batch = max(CHARGE_BATCH, nr_pages); > struct page_counter *counter; > + bool force = false; > > - if (page_counter_try_charge(>tcp_mem.memory_allocated, > - nr_pages, )) { > - memcg->tcp_mem.memory_pressure = 0; > +#ifdef CONFIG_MEMCG_KMEM > + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) { > + if (page_counter_try_charge(>tcp_mem.memory_allocated, > + nr_pages, )) { > + memcg->tcp_mem.memory_pressure = 0; > + return true; > + } > + page_counter_charge(>tcp_mem.memory_allocated, nr_pages); > + memcg->tcp_mem.memory_pressure = 1; > + return false; > + } > +#endif > + if (consume_stock(memcg, nr_pages)) > return true; > +retry: > + if (page_counter_try_charge(>memory, batch, )) > + goto done; > + > + if (batch > nr_pages) { > + batch = nr_pages; > + goto retry; > } > - page_counter_charge(>tcp_mem.memory_allocated, nr_pages); > - memcg->tcp_mem.memory_pressure = 1; > - return false; > + > + page_counter_charge(>memory, batch); > + force = true; > +done: > + css_get_many(>css, batch); Is there any point to get css reference per each charged page? For kmem it is absolutely necessary, because dangling slabs must block destruction of memcg's kmem caches, which are destroyed on css_free. But for sockets there's no such problem: memcg will be destroyed only after all sockets are destroyed and therefore uncharged (since sock_update_memcg pins css). > + if (batch > nr_pages) > + refill_stock(memcg, batch - nr_pages); > + > + schedule_work(>socket_work); I think it's suboptimal to schedule the work even if we are below the high threshold. BTW why do we need this work at all? Why is reclaim_high called from task_work not enough? Thanks, Vladimir > + > + return !force; > } > > /** -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next] net: encx24j600: move rev announcement to probe function
From: j...@ringle.org Date: Wed, 18 Nov 2015 16:22:21 -0500 > From: Jon Ringle> > When encx24j600 is open and closed many times due to userspace polling the > interface, the log gets noise with this log message. > > Moving this to encx24j600_spi_probe function where it belongs. > > Signed-off-by: Jon Ringle Applied, thanks. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] unix: avoid use-after-free in ep_remove_wait_queue (w/ Fixes:)
Jason Baronwrites: > On 11/19/2015 06:52 PM, Rainer Weikusat wrote: > > [...] > >> @@ -1590,21 +1718,35 @@ restart: >> goto out_unlock; >> } >> >> -if (unix_peer(other) != sk && unix_recvq_full(other)) { >> -if (!timeo) { >> +if (unlikely(unix_peer(other) != sk && unix_recvq_full(other))) { >> +if (timeo) { >> +timeo = unix_wait_for_peer(other, timeo); >> + >> +err = sock_intr_errno(timeo); >> +if (signal_pending(current)) >> +goto out_free; >> + >> +goto restart; >> +} >> + >> +if (unix_peer(sk) != other || >> +unix_dgram_peer_wake_me(sk, other)) { >> err = -EAGAIN; >> goto out_unlock; >> } > > Hi, > > So here we are calling unix_dgram_peer_wake_me() without the sk lock the > first time > through - right? Yes. And this is obviously wrong. I spend most of the 'evening time' (some people would call that 'night time') with testing this and didn't get to read through it again yet. Thank you for pointing this out. I'll send an updated patch shortly. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net 1/2] tcp: disable Fast Open on timeouts after handshake
From: Yuchung ChengDate: Wed, 18 Nov 2015 18:17:30 -0800 > Some middle-boxes black-hole the data after the Fast Open handshake > (https://www.ietf.org/proceedings/94/slides/slides-94-tcpm-13.pdf). > The exact reason is unknown. The work-around is to disable Fast Open > temporarily after multiple recurring timeouts with few or no data > delivered in the established state. > > Signed-off-by: Yuchung Cheng > Signed-off-by: Eric Dumazet > Reported-by: Christoph Paasch Applied and queued up for -stable. Just out of curiosity, why isn't a test for zero data sufficient? Do these middle-boxes sometimes not black-hole all of the data? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Fw: [Bug 108191] New: tcp option TCP_USER_TIMEOUT working incorrect within tcp keepalive.
Begin forwarded message: Date: Fri, 20 Nov 2015 11:03:58 + From: "bugzilla-dae...@bugzilla.kernel.org"To: "shemmin...@linux-foundation.org" Subject: [Bug 108191] New: tcp option TCP_USER_TIMEOUT working incorrect within tcp keepalive. https://bugzilla.kernel.org/show_bug.cgi?id=108191 Bug ID: 108191 Summary: tcp option TCP_USER_TIMEOUT working incorrect within tcp keepalive. Product: Networking Version: 2.5 Kernel Version: 4.3 Hardware: All OS: Linux Tree: Mainline Status: NEW Severity: normal Priority: P1 Component: IPV4 Assignee: shemmin...@linux-foundation.org Reporter: jj@163.com Regression: No The TCP_USER_TIMEOUT semantic means when you send an packet, how long time no received the ACK should disconnect the connection. In tcp retransmits case, retransmits_timed_out checkout this timeout, it walks well. but in keepalive case. the code below may have bugs: elapsed = keepalive_time_elapsed(tp); if (elapsed >= keepalive_time_when(tp)) { /* If the TCP_USER_TIMEOUT option is enabled, use that * to determine when to timeout instead. */ if ((icsk->icsk_user_timeout != 0 && elapsed >= icsk->icsk_user_timeout && icsk->icsk_probes_out > 0) || (icsk->icsk_user_timeout == 0 && icsk->icsk_probes_out >= keepalive_probes(tp))) { tcp_send_active_reset(sk, GFP_ATOMIC); tcp_write_err(sk); goto out; } . elapsed >= icsk->icsk_user_timeout should be elapsed-keepalive_time_when(tp) >= icsk->icsk_user_timeout here is the timeline: idle ... keepalive1 .. keepalive2 ... keepalive_probes <- katime_when -> <- keepalive_intvl -> <- TCP_USER_TIMEOUT -> // user expected timeout <---elapsed> <-elapsed-katime_when-> /* test code */ int v; v=1;setsockopt(fd, SOL_SOCKET, SO_KEEPALIVE, , 4); v=30;setsockopt(fd, SOL_TCP, TCP_KEEPIDLE, , 4); v=5;setsockopt(fd, SOL_TCP, TCP_KEEPINTVL, , 4); v=3;setsockopt(fd, SOL_TCP, TCP_KEEPCNT, , 4); v=20*1000; setsockopt(fd, SOL_TCP, TCP_USER_TIMEOUT, , 4); connect(fd, addr, sizeof(addr); // when connect // drop the recv data // iptables -t filter -A INPUT --protocol tcp --dport -j DROP pause(); we can see 30s later, tcp start keepalive, and close connection without do the first retransmits (because 30+5 > 20) but we want waiting 20 second. -- You are receiving this mail because: You are the assignee for the bug. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Add a SOCK_DESTROY operation to close sockets from userspace
On 11/19/15 6:55 PM, Lorenzo Colitti wrote: upstream alternatives. We might even be able to show up at netdev 1.1 for some higher-bandwidth conversations. This use case would make a great talk for netdev. There are similar problems when netdev's are moved between namespaces (and VRFs). -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net] bnx2x: Fix vxlan removal
From: Yuval MintzDate: Thu, 19 Nov 2015 11:56:51 +0200 > Commmit ac7eccd4d48fc "bnx2x: track vxlan port count" contains a bug - > Instead of achieving the required goal, vxlan configuration would not > be removed since we're decrementing the port instead of the counter. > > CC: Jiri Benc > Signed-off-by: Yuval Mintz Applied, thank you. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next] bpf: add show_fdinfo handler for maps
From: Daniel BorkmannDate: Thu, 19 Nov 2015 11:56:22 +0100 > Add a handler for show_fdinfo() to be used by the anon-inodes > backend for eBPF maps, and dump the map specification there. Not > only useful for admins, but also it provides a minimal way to > compare specs from ELF vs pinned object. > > Signed-off-by: Daniel Borkmann Applied, thanks. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: network stream fairness
On 11/09/2015 05:07 PM, Eric Dumazet wrote: > On Mon, 2015-11-09 at 16:53 +0100, Niklas Cassel wrote: >> On 11/09/2015 04:50 PM, Eric Dumazet wrote: >>> On Mon, 2015-11-09 at 16:41 +0100, Niklas Cassel wrote: I have a ethernet driver for a 100 Mbps NIC. The NIC has dedicated hardware for offloading. The driver has implemented TSO, GSO and BQL. Since the CPU on the SoC is rather weak, I'd rather not increase the CPU load by turning off offloading. Since commit 605ad7f184b6 ("tcp: refine TSO autosizing") the bandwidth is no longer fair between streams. see output at the end of the mail, where I'm testing with 2 streams. If I revert 605ad7f184b6 on 4.3, I get a stable 45 Mbps per stream. I can also use vanilla 4.3 and do: echo 3000 > /sys/class/net/eth0/queues/tx-0/byte_queue_limits/limit_max to also get a stable 45 Mbps per stream. My question is, am I supposed to set the BQL limit explicitly? It is possible that I have missed something in my driver, but my understanding is that the TCP stack sets and adjusts the BQL limit automatically. Perhaps the following info might help: After running iperf3 on vanilla 4.3: /sys/class/net/eth0/queues/tx-0/byte_queue_limits/ limit 89908 limit_max 1879048192 After running iperf3 on vanilla 4.3 + BQL explicitly set: /sys/class/net/eth0/queues/tx-0/byte_queue_limits/ limit 3000 limit_max 3000 After running iperf3 on 4.3 + 605ad7f184b6 reverted: /sys/class/net/eth0/queues/tx-0/byte_queue_limits/ limit 8886 limit_max 1879048192 >>> >>> There is absolutely nothing ensuring fairness among multiple TCP flows. >>> >>> One TCP flow can very easily grab whole bandwidth for itself, there are >>> numerous descriptions of this phenomena in various TCP studies. >>> >>> This is why we have packet schedulers ;) >> >> Oh.. How stupid of me, I forgot to mention.. all of the measurements were >> done with fq_codel. > > Your numbers suggest a cwnd growth then, which might show a CC bug. > > Please run the following when your iper3 runs on regular 4.3 kernel > > for i in `seq 1 10` > do > ss -temoi dst 192.168.0.141 > sleep 1 > done > > I've been able to reproduce this on a ARMv7, single core, 100 Mbps NIC. Kernel vanilla 4.3, driver has BQL implemented, but is unfortunately not upstreamed. ethtool -k eth0 Offload parameters for eth0: rx-checksumming: off tx-checksumming: on scatter-gather: off tcp segmentation offload: off udp fragmentation offload: off generic segmentation offload: off ip addr show dev eth0 2: eth0:mtu 1500 qdisc fq_codel state UP group default qlen 1000 link/ether 00:40:8c:18:58:c8 brd ff:ff:ff:ff:ff:ff inet 192.168.0.136/24 brd 192.168.0.255 scope global eth0 valid_lft forever preferred_lft forever # before iperf3 run tc -s -d qdisc qdisc noqueue 0: dev lo root refcnt 2 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 qdisc fq_codel 0: dev eth0 root refcnt 2 limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms ecn Sent 21001 bytes 45 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 maxpacket 0 drop_overlimit 0 new_flow_count 0 ecn_mark 0 new_flows_len 0 old_flows_len 0 sysctl net.ipv4.tcp_congestion_control net.ipv4.tcp_congestion_control = cubic # after iperf3 run tc -s -d qdisc qdisc noqueue 0: dev lo root refcnt 2 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 qdisc fq_codel 0: dev eth0 root refcnt 2 limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms ecn Sent 5618224754 bytes 3710914 pkt (dropped 0, overlimits 0 requeues 1) backlog 0b 0p requeues 1 maxpacket 1514 drop_overlimit 0 new_flow_count 2 ecn_mark 0 new_flows_len 0 old_flows_len 0 Note that it appears stable for 411 seconds before you can see the congestion window growth. It appears that the amount of time you have to wait before things go downhill varies a lot. No switch was used between the server and client; they were connected directly. For full iperf3 log and output from ss command, see attachment. [ ID] Interval Transfer Bandwidth Retr Cwnd [ 4] 411.00-412.00 sec 5.09 MBytes 42.7 Mbits/sec0 22.6 KBytes [ 6] 411.00-412.00 sec 5.14 MBytes 43.1 Mbits/sec0 22.6 KBytes [SUM] 411.00-412.00 sec 10.2 MBytes 85.8 Mbits/sec0 - - - - - - - - - - - - - - - - - - - - - - - - - [ 4] 412.00-413.00 sec 5.12 MBytes 43.0 Mbits/sec0 22.6 KBytes [ 6] 412.00-413.00 sec 5.13 MBytes 43.0 Mbits/sec0 22.6 KBytes [SUM] 412.00-413.00 sec 10.3 MBytes 86.0 Mbits/sec0 - - - - - - - - - - - - - - - - - - - - - - - - - [ 4] 413.00-414.00 sec 5.17 MBytes 43.4 Mbits/sec0 22.6 KBytes [ 6]
Re: [PATCH net-next 0/2] ppp: Remove PPPOX_ZOMBIE socket state
From: Guillaume NaultDate: Thu, 19 Nov 2015 12:52:30 +0100 > Several issues have been found lately wrt. the PPPOX_ZOMBIE socket > state. This state is now only set upon reception of a PADT to stop > further transmissions. However this is redundant with the PADT > workqueue mechanism introduced by 287f3a943fef ("pppoe: Use workqueue > to die properly when a PADT is received"). > > We can thus simplify pppox socket state handling by getting rid of > PPPOX_ZOMBIE entirely. Nice, applied to net-next, thanks! -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net: cpsw: Fix ethernet regression for dm814x
From: Tony LindgrenDate: Wed, 18 Nov 2015 17:27:25 -0800 > Commit b6745f6e4e63 ("drivers: net: cpsw: davinci_emac: move reading mac > id to common file") started using of_machine_is_compatible for detecting > type but missed at dm8148 causing Ethernet to stop working. > > Let's fix the issue by adding handling for dm814x. > > Cc: Mugunthan V N > Signed-off-by: Tony Lindgren Applied, thank you. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net 2/2] tcp: fix Fast Open snmp over-counting bug
From: Yuchung ChengDate: Wed, 18 Nov 2015 18:17:31 -0800 > Fix incrementing TCPFastOpenActiveFailed snmp stats multiple times > when the handshake experiences multiple SYN timeouts. > > Signed-off-by: Yuchung Cheng > Signed-off-by: Eric Dumazet Applied. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net] tcp: fix potential huge kmalloc() calls in TCP_REPAIR
From: Eric DumazetDate: Wed, 18 Nov 2015 21:03:33 -0800 > From: Eric Dumazet > > tcp_send_rcvq() is used for re-injecting data into tcp receive queue. > > Problems : > > - No check against size is performed, allowed user to fool kernel in > attempting very large memory allocations, eventually triggering > OOM when memory is fragmented. > > - In case of fault during the copy we do not return correct errno. > > Lets use alloc_skb_with_frags() to cook optimal skbs. > > Fixes: 292e8d8c8538 ("tcp: Move rcvq sending to tcp_input.c") > Fixes: c0e88ff0f256 ("tcp: Repair socket queues") > Signed-off-by: Eric Dumazet Good catch, applied and queued up for -stable. Thanks! -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net: tulip: turn compile-time warning into dev_warn()
From: Arnd BergmannDate: Thu, 19 Nov 2015 11:42:26 +0100 > The tulip driver causes annoying build-time warnings for allmodconfig > builds for all recent architectures: > > dec/tulip/winbond-840.c:910:2: warning: #warning Processor architecture > undefined > dec/tulip/tulip_core.c:101:2: warning: #warning Processor architecture > undefined! > > This is the last remaining warning for arm64, and I'd like to get rid of > it. We don't really know the cache line size, architecturally it would > be at least 16 bytes, but all implementations I found have 64 or 128 > bytes. Configuring tulip for 32-byte lines as we do on ARM32 seems to > be the safe but slow default, and nobody who cares about performance these > days would use a tulip chip anyway, so we can just use that. > > To save the next person the job of trying to find out what this is for > and picking a default for their architecture just to kill off the warning, > I'm now removing the preprocessor #warning and turning it into a pr_warn > or dev_warn that prints the equivalent information when the driver gets > loaded. > > Signed-off-by: Arnd Bergmann Seems reasonable, applied, thanks Arnd! -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] unix: avoid use-after-free in ep_remove_wait_queue (w/ Fixes:)
On 11/19/2015 06:52 PM, Rainer Weikusat wrote: [...] > @@ -1590,21 +1718,35 @@ restart: > goto out_unlock; > } > > - if (unix_peer(other) != sk && unix_recvq_full(other)) { > - if (!timeo) { > + if (unlikely(unix_peer(other) != sk && unix_recvq_full(other))) { > + if (timeo) { > + timeo = unix_wait_for_peer(other, timeo); > + > + err = sock_intr_errno(timeo); > + if (signal_pending(current)) > + goto out_free; > + > + goto restart; > + } > + > + if (unix_peer(sk) != other || > + unix_dgram_peer_wake_me(sk, other)) { > err = -EAGAIN; > goto out_unlock; > } Hi, So here we are calling unix_dgram_peer_wake_me() without the sk lock the first time through - right? In that case, we can end up registering on the queue of other for the callback but we might have already connected to a different remote. In that case, the wakeup will crash if 'sk' has freed in the meantime. Thanks, -Jason -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v8] net: ethernet: add driver for Aurora VLSI NB8800 Ethernet controller
From: Mans RullgardDate: Thu, 19 Nov 2015 13:02:59 + > This adds a driver for the Aurora VLSI NB8800 Ethernet controller. > It is an almost complete rewrite of a driver originally found in > a Sigma Designs 2.6.22 tree. > > Signed-off-by: Mans Rullgard Applied, thank you. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Fw: [Bug 108201] New: Can connect with Huawei E3131-s2 (Hi-Link) 3G modem only after reboot.
Appears to be a cdc_ether driver bug. See Bugzilla for more followup info Begin forwarded message: Date: Fri, 20 Nov 2015 11:11:26 + From: "bugzilla-dae...@bugzilla.kernel.org"To: "shemmin...@linux-foundation.org" Subject: [Bug 108201] New: Can connect with Huawei E3131-s2 (Hi-Link) 3G modem only after reboot. https://bugzilla.kernel.org/show_bug.cgi?id=108201 Bug ID: 108201 Summary: Can connect with Huawei E3131-s2 (Hi-Link) 3G modem only after reboot. Product: Networking Version: 2.5 Kernel Version: 4.1.12-1-default Hardware: x86-64 OS: Linux Tree: Mainline Status: NEW Severity: normal Priority: P1 Component: Other Assignee: shemmin...@linux-foundation.org Reporter: cameron...@poczta.fm Regression: No When I start (boot) the system from shutdown state I can't connect to Internet because there is no network connection in pop-up network plasma menu (close to clock in right bottom of the screen) in new openSUSE Leap 42.1 with KDE 5. When I shutdown the laptop and start again (boot again) it still doesn't connect (no connection available), but if I restart (reboot) the system the connection works (shows in plasma networking pop-up menu). >From what I can see in journalctl when I boot the laptop for the 1st time the system recognizes this modem as memory stick and there are errors in journalctl: NetworkManager[878]: (eth1): failed to find device 4 'eth1' with udev NetworkManager[878]: (eth1): new Ethernet device (carrier: OFF, driver: 'cdc_ether', ifindex: 4) kernel: cdc_ether 3-1:1.0 eth1: register 'cdc_ether' at usb-:00:12.2-1, CDC Ethernet Device, 58:2c:80:13:92:63 kernel: usbcore: registered new interface driver cdc_ether NetworkManager[878]: (eth1): device state change: unmanaged -> unavailable (reason 'managed') [10 20 2] kernel: cdc_ether 3-1:1.0 eth1: kevent 12 may have been dropped NetworkManager[878]: (eth1): link connected NetworkManager[878]: (eth1): device state change: unavailable -> disconnected (reason 'none') [20 30 0] ...but when I reboot the laptop the system recognizes it as modem strait away and there are no "failed" nor "dropped" messages. I will attach 2 files with journalctl from the 1st boot and the reboot. I have always been installing the Huawei E3131-s2 (Hi-Link) from Linux driver attached in the modem's internal memory and it always worked, but now when I reinstalled the openSUSE system to newer version 42.1 there was an "failed" error with runmbbservice so I deactivated it. Anyway, the connection did not work either. Another thing - when I unplug the modem and plug it back in it fails again and the connection vanishes (doesn't show up when I plug modem back) and I have to reboot the system to be able to connect to the Internet. I is annoying to always boot the system and reboot in order to connect t the Internet :( So, can anyone fix this please? -- You are receiving this mail because: You are the assignee for the bug. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch net-next 0/3] mlxsw: small driver update
From: Jiri PirkoDate: Thu, 19 Nov 2015 12:27:37 +0100 > Couple of VLAN-related patches. Series applied. I'm really pleased with this driver and work you guys are doing on it. Thanks! -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH linux-firmware] bnx2x: Add FW 7.13.1.0.
On Thu, Nov 19, 2015 at 06:41:26PM +0200, Yuval Mintz wrote: > This adds new FW for bnx2x, which adds the following: > - Ability to change outer vlan ID for some multi-function modes. > - FW ability for Geneve RSS classification according to inner headers. > - Prevent VFs from sending MAC control frames. > > Signed-off-by: Yuval Mintzapplied, thanks. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] gianfar: use of_property_read_bool()
use of_property_read_bool() for testing bool property Signed-off-by: Saurabh Sengar--- drivers/net/ethernet/freescale/gianfar.c | 5 + 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/drivers/net/ethernet/freescale/gianfar.c b/drivers/net/ethernet/freescale/gianfar.c index 3e6b9b4..ebeea5e 100644 --- a/drivers/net/ethernet/freescale/gianfar.c +++ b/drivers/net/ethernet/freescale/gianfar.c @@ -738,7 +738,6 @@ static int gfar_of_init(struct platform_device *ofdev, struct net_device **pdev) struct gfar_private *priv = NULL; struct device_node *np = ofdev->dev.of_node; struct device_node *child = NULL; - struct property *stash; u32 stash_len = 0; u32 stash_idx = 0; unsigned int num_tx_qs, num_rx_qs; @@ -854,9 +853,7 @@ static int gfar_of_init(struct platform_device *ofdev, struct net_device **pdev) goto err_grp_init; } - stash = of_find_property(np, "bd-stash", NULL); - - if (stash) { + if (of_property_read_bool(np, "bd-stash")) { priv->device_flags |= FSL_GIANFAR_DEV_HAS_BD_STASHING; priv->bd_stash_en = 1; } -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 4/4] rhashtable-test: allow to retry even if -ENOMEM was returned
This is rather a hack to expose the current issue with rhashtable to under high pressure sometimes return -ENOMEM even though system memory is not exhausted and a consecutive insert may succeed. Signed-off-by: Phil Sutter--- lib/test_rhashtable.c | 14 +- 1 file changed, 13 insertions(+), 1 deletion(-) diff --git a/lib/test_rhashtable.c b/lib/test_rhashtable.c index 6fa77b3..270bf72 100644 --- a/lib/test_rhashtable.c +++ b/lib/test_rhashtable.c @@ -52,6 +52,10 @@ static int tcount = 10; module_param(tcount, int, 0); MODULE_PARM_DESC(tcount, "Number of threads to spawn (default: 10)"); +static bool enomem_retry = false; +module_param(enomem_retry, bool, 0); +MODULE_PARM_DESC(enomem_retry, "Retry insert even if -ENOMEM was returned (default: off)"); + struct test_obj { int value; struct rhash_head node; @@ -79,14 +83,22 @@ static struct semaphore startup_sem = __SEMAPHORE_INITIALIZER(startup_sem, 0); static int insert_retry(struct rhashtable *ht, struct rhash_head *obj, const struct rhashtable_params params) { - int err, retries = -1; + int err, retries = -1, enomem_retries = 0; do { retries++; cond_resched(); err = rhashtable_insert_fast(ht, obj, params); + if (err == -ENOMEM && enomem_retry) { + enomem_retries++; + err = -EBUSY; + } } while (err == -EBUSY); + if (enomem_retries) + pr_info(" %u insertions retried after -ENOMEM\n", + enomem_retries); + return err ? : retries; } -- 2.1.2 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 4/4] rhashtable-test: allow to retry even if -ENOMEM was returned
On Fri, Nov 20, 2015 at 06:17:20PM +0100, Phil Sutter wrote: > This is rather a hack to expose the current issue with rhashtable to > under high pressure sometimes return -ENOMEM even though system memory > is not exhausted and a consecutive insert may succeed. Please note that this problem does not show every time when running the test in default configuration on my system. With increased number of threads though, it becomes very visible. Load test_rhashtable like so: modprobe test_rhashtable enomem_retry=1 tcount=20 and grep dmesg for 'insertions retried after -ENOMEM'. In my case: # dmesg | grep -E '(insertions retried after -ENOMEM|Started)' | tail [ 34.642980] 1 insertions retried after -ENOMEM [ 34.642989] 1 insertions retried after -ENOMEM [ 34.642994] 1 insertions retried after -ENOMEM [ 34.648353] 28294 insertions retried after -ENOMEM [ 34.689687] 31262 insertions retried after -ENOMEM [ 34.714015] 16280 insertions retried after -ENOMEM [ 34.736019] 15327 insertions retried after -ENOMEM [ 34.755100] 39012 insertions retried after -ENOMEM [ 34.769116] 49369 insertions retried after -ENOMEM [ 35.387200] Started 20 threads, 0 failed Cheers, Phil -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 3/4] rhashtable-test: calculate max_entries value by default
A maximum table size of 64k entries is insufficient for the multiple threads test even in default configuration (10 threads * 5 objects = 50 objects in total). Since we know how many objects will be inserted, calculate the max size unless overridden by parameter. Note that specifying the exact number of objects upon table init won't suffice as that value is being rounded down to the next power of two - anticipate this by rounding up to the next power of two in beforehand. Signed-off-by: Phil Sutter--- lib/test_rhashtable.c | 8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/lib/test_rhashtable.c b/lib/test_rhashtable.c index cfc3440..6fa77b3 100644 --- a/lib/test_rhashtable.c +++ b/lib/test_rhashtable.c @@ -36,9 +36,9 @@ static int runs = 4; module_param(runs, int, 0); MODULE_PARM_DESC(runs, "Number of test runs per variant (default: 4)"); -static int max_size = 65536; +static int max_size = 0; module_param(max_size, int, 0); -MODULE_PARM_DESC(runs, "Maximum table size (default: 65536)"); +MODULE_PARM_DESC(runs, "Maximum table size (default: calculated)"); static bool shrinking = false; module_param(shrinking, bool, 0); @@ -321,7 +321,7 @@ static int __init test_rht_init(void) entries = min(entries, MAX_ENTRIES); test_rht_params.automatic_shrinking = shrinking; - test_rht_params.max_size = max_size; + test_rht_params.max_size = max_size ? : roundup_pow_of_two(entries); test_rht_params.nelem_hint = size; pr_info("Running rhashtable test nelem=%d, max_size=%d, shrinking=%d\n", @@ -367,6 +367,8 @@ static int __init test_rht_init(void) return -ENOMEM; } + test_rht_params.max_size = max_size ? : + roundup_pow_of_two(tcount * entries); err = rhashtable_init(, _rht_params); if (err < 0) { pr_warn("Test failed: Unable to initialize hashtable: %d\n", -- 2.1.2 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 2/4] rhashtable-test: retry insert operations
After adding cond_resched() calls to threadfunc(), a surprisingly high rate of insert failures occurred probably due to table resizes getting a better chance to run in background. To not soften up the remaining tests, retry inserts until they either succeed or fail permanently. Also change the non-threaded test to retry insert operations, too. Suggested-by: Thomas GrafSigned-off-by: Phil Sutter --- lib/test_rhashtable.c | 53 --- 1 file changed, 29 insertions(+), 24 deletions(-) diff --git a/lib/test_rhashtable.c b/lib/test_rhashtable.c index 63654e3..cfc3440 100644 --- a/lib/test_rhashtable.c +++ b/lib/test_rhashtable.c @@ -76,6 +76,20 @@ static struct rhashtable_params test_rht_params = { static struct semaphore prestart_sem; static struct semaphore startup_sem = __SEMAPHORE_INITIALIZER(startup_sem, 0); +static int insert_retry(struct rhashtable *ht, struct rhash_head *obj, +const struct rhashtable_params params) +{ + int err, retries = -1; + + do { + retries++; + cond_resched(); + err = rhashtable_insert_fast(ht, obj, params); + } while (err == -EBUSY); + + return err ? : retries; +} + static int __init test_rht_lookup(struct rhashtable *ht) { unsigned int i; @@ -157,7 +171,7 @@ static s64 __init test_rhashtable(struct rhashtable *ht) { struct test_obj *obj; int err; - unsigned int i, insert_fails = 0; + unsigned int i, insert_retries = 0; s64 start, end; /* @@ -170,22 +184,16 @@ static s64 __init test_rhashtable(struct rhashtable *ht) struct test_obj *obj = [i]; obj->value = i * 2; - - err = rhashtable_insert_fast(ht, >node, test_rht_params); - if (err == -ENOMEM || err == -EBUSY) { - /* Mark failed inserts but continue */ - obj->value = TEST_INSERT_FAIL; - insert_fails++; - } else if (err) { + err = insert_retry(ht, >node, test_rht_params); + if (err > 0) + insert_retries += err; + else if (err) return err; - } - - cond_resched(); } - if (insert_fails) - pr_info(" %u insertions failed due to memory pressure\n", - insert_fails); + if (insert_retries) + pr_info(" %u insertions retried due to memory pressure\n", + insert_retries); test_bucket_stats(ht); rcu_read_lock(); @@ -244,7 +252,7 @@ static int thread_lookup_test(struct thread_data *tdata) static int threadfunc(void *data) { - int i, step, err = 0, insert_fails = 0; + int i, step, err = 0, insert_retries = 0; struct thread_data *tdata = data; up(_sem); @@ -253,21 +261,18 @@ static int threadfunc(void *data) for (i = 0; i < entries; i++) { tdata->objs[i].value = (tdata->id << 16) | i; - cond_resched(); - err = rhashtable_insert_fast(, >objs[i].node, -test_rht_params); - if (err == -ENOMEM || err == -EBUSY) { - tdata->objs[i].value = TEST_INSERT_FAIL; - insert_fails++; + err = insert_retry(, >objs[i].node, test_rht_params); + if (err > 0) { + insert_retries += err; } else if (err) { pr_err(" thread[%d]: rhashtable_insert_fast failed\n", tdata->id); goto out; } } - if (insert_fails) - pr_info(" thread[%d]: %d insert failures\n", - tdata->id, insert_fails); + if (insert_retries) + pr_info(" thread[%d]: %u insertions retried due to memory pressure\n", + tdata->id, insert_retries); err = thread_lookup_test(tdata); if (err) { -- 2.1.2 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 0/4] improve fault-tolerance of rhashtable runtime-test
The following series aims to improve lib/test_rhashtable in different situations: Patch 1 allows the kernel to reschedule so the test does not block too long on slow systems. Patch 2 fixes behaviour under pressure, retrying inserts in non-permanent error case (-EBUSY). Patch 3 auto-adjusts the upper table size limit according to the number of threads (in concurrency test). In fact, the current default is already too small. Patch 4 makes it possible to retry inserts even in supposedly permanent error case (-ENOMEM) to expose rhashtable's remaining problem of -ENOMEM being not as permanent as it is expected to be. Changes since v1: - Introduce insert_retry() which is then used in single-threaded test as well. - Do not retry inserts by default if -ENOMEM was returned. - Rename the retry counter to be a bit more verbose about what it contains. - Add patch 4 as a debugging aid. Phil Sutter (4): rhashtable-test: add cond_resched() to thread test rhashtable-test: retry insert operations rhashtable-test: calculate max_entries value by default rhashtable-test: allow to retry even if -ENOMEM was returned lib/test_rhashtable.c | 76 +-- 1 file changed, 50 insertions(+), 26 deletions(-) -- 2.1.2 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 1/4] rhashtable-test: add cond_resched() to thread test
This should fix for soft lockup bugs triggered on slow systems. Signed-off-by: Phil Sutter--- lib/test_rhashtable.c | 5 + 1 file changed, 5 insertions(+) diff --git a/lib/test_rhashtable.c b/lib/test_rhashtable.c index 8c1ad1c..63654e3 100644 --- a/lib/test_rhashtable.c +++ b/lib/test_rhashtable.c @@ -236,6 +236,8 @@ static int thread_lookup_test(struct thread_data *tdata) obj->value, key); err++; } + + cond_resched(); } return err; } @@ -251,6 +253,7 @@ static int threadfunc(void *data) for (i = 0; i < entries; i++) { tdata->objs[i].value = (tdata->id << 16) | i; + cond_resched(); err = rhashtable_insert_fast(, >objs[i].node, test_rht_params); if (err == -ENOMEM || err == -EBUSY) { @@ -285,6 +288,8 @@ static int threadfunc(void *data) goto out; } tdata->objs[i].value = TEST_INSERT_FAIL; + + cond_resched(); } err = thread_lookup_test(tdata); if (err) { -- 2.1.2 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net 1/2] tcp: disable Fast Open on timeouts after handshake
On Fri, Nov 20, 2015 at 7:52 AM, David Millerwrote: > From: Yuchung Cheng > Date: Wed, 18 Nov 2015 18:17:30 -0800 > >> Some middle-boxes black-hole the data after the Fast Open handshake >> (https://www.ietf.org/proceedings/94/slides/slides-94-tcpm-13.pdf). >> The exact reason is unknown. The work-around is to disable Fast Open >> temporarily after multiple recurring timeouts with few or no data >> delivered in the established state. >> >> Signed-off-by: Yuchung Cheng >> Signed-off-by: Eric Dumazet >> Reported-by: Christoph Paasch > > Applied and queued up for -stable. > > Just out of curiosity, why isn't a test for zero data sufficient? > > Do these middle-boxes sometimes not black-hole all of the data? Great question. I should be more clear in the commit message. The answer is yes it should be sufficient. The tricky part is tp->bytes_acked includes data acked in the SYN. Since we don't remember data size sent in SYN, hence the heuristic. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: network stream fairness
On Fri, 2015-11-20 at 16:33 +0100, Niklas Cassel wrote: > I've been able to reproduce this on a ARMv7, single core, 100 Mbps NIC. > Kernel vanilla 4.3, driver has BQL implemented, but is unfortunately not > upstreamed. > > ethtool -k eth0 > Offload parameters for eth0: > rx-checksumming: off > tx-checksumming: on > scatter-gather: off > tcp segmentation offload: off > udp fragmentation offload: off > generic segmentation offload: off > > ip addr show dev eth0 > 2: eth0:mtu 1500 qdisc fq_codel state UP > group default qlen 1000 > link/ether 00:40:8c:18:58:c8 brd ff:ff:ff:ff:ff:ff > inet 192.168.0.136/24 brd 192.168.0.255 scope global eth0 >valid_lft forever preferred_lft forever > > # before iperf3 run > tc -s -d qdisc > qdisc noqueue 0: dev lo root refcnt 2 > Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) > backlog 0b 0p requeues 0 > qdisc fq_codel 0: dev eth0 root refcnt 2 limit 10240p flows 1024 quantum 1514 > target 5.0ms interval 100.0ms ecn > Sent 21001 bytes 45 pkt (dropped 0, overlimits 0 requeues 0) > backlog 0b 0p requeues 0 > maxpacket 0 drop_overlimit 0 new_flow_count 0 ecn_mark 0 > new_flows_len 0 old_flows_len 0 > > sysctl net.ipv4.tcp_congestion_control > net.ipv4.tcp_congestion_control = cubic > > # after iperf3 run > tc -s -d qdisc > qdisc noqueue 0: dev lo root refcnt 2 > Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) > backlog 0b 0p requeues 0 > qdisc fq_codel 0: dev eth0 root refcnt 2 limit 10240p flows 1024 quantum 1514 > target 5.0ms interval 100.0ms ecn > Sent 5618224754 bytes 3710914 pkt (dropped 0, overlimits 0 requeues 1) > backlog 0b 0p requeues 1 > maxpacket 1514 drop_overlimit 0 new_flow_count 2 ecn_mark 0 > new_flows_len 0 old_flows_len 0 > > Note that it appears stable for 411 seconds before you can see the > congestion window growth. It appears that the amount of time you have > to wait before things go downhill varies a lot. > No switch was used between the server and client; they were connected > directly. Hi Niklas Your results seem to show there is no special issue ;) With TSO off and GSO off, there is no way a 'TSO autosizing' patch would have any effect, since this code path is not taken. You have to wait 400 seconds before getting into a mode where one of the flow gets bigger cwnd (25 instead of 16), and then TCP cubic simply shows typical unfairness ... If you absolutely need to guarantee a given throughput per flow, you might consider using fq packet scheduler and SO_MAX_PACING_RATE socket option. Thanks ! -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next 0/2] bnx2x: Statistics patch series
From: Yuval MintzDate: Thu, 19 Nov 2015 17:04:34 +0200 > This series contains 2 small statistics-related patches, > first adding a new SW statistics and the other exposing port stats > for multi-function devices. > > Please consider applying this series to `net-next'. Series applied, thanks Yuval. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 08/14] net: tcp_memcontrol: sanitize tcp memory accounting callbacks
On Fri, Nov 20, 2015 at 01:58:57PM +0300, Vladimir Davydov wrote: > On Thu, Nov 12, 2015 at 06:41:27PM -0500, Johannes Weiner wrote: > > There won't be a tcp control soft limit, so integrating the memcg code > > into the global skmem limiting scheme complicates things > > unnecessarily. Replace this with simple and clear charge and uncharge > > calls--hidden behind a jump label--to account skb memory. > > > > Note that this is not purely aesthetic: as a result of shoehorning the > > per-memcg code into the same memory accounting functions that handle > > the global level, the old code would compare the per-memcg consumption > > against the smaller of the per-memcg limit and the global limit. This > > allowed the total consumption of multiple sockets to exceed the global > > limit, as long as the individual sockets stayed within bounds. After > > this change, the code will always compare the per-memcg consumption to > > the per-memcg limit, and the global consumption to the global limit, > > and thus close this loophole. > > > > Without a soft limit, the per-memcg memory pressure state in sockets > > is generally questionable. However, we did it until now, so we > > continue to enter it when the hard limit is hit, and packets are > > dropped, to let other sockets in the cgroup know that they shouldn't > > grow their transmit windows, either. However, keep it simple in the > > new callback model and leave memory pressure lazily when the next > > packet is accepted (as opposed to doing it synchroneously when packets > > are processed). When packets are dropped, network performance will > > already be in the toilet, so that should be a reasonable trade-off. > > > > As described above, consumption is now checked on the per-memcg level > > and the global level separately. Likewise, memory pressure states are > > maintained on both the per-memcg level and the global level, and a > > socket is considered under pressure when either level asserts as much. > > > > Signed-off-by: Johannes Weiner> > It leaves the legacy functionality intact, while making the code look > much better. > > Reviewed-by: Vladimir Davydov Thank you very much! -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHSET v2] netfilter, cgroup: implement xt_cgroup2 match
From: Tejun HeoDate: Thu, 19 Nov 2015 13:52:44 -0500 > This is the second take of the xt_cgroup2 patchset. Changes from the > last take are > > * Instead of adding sock->sk_cgroup separately, sock->sk_cgrp_data now > carries either (prioidx, classid) pair or cgroup2 pointer. This > avoids inflating struct sock with yet another cgroup related field. > Unfortunately, this does add some complexity but that's the > trade-off and the complexity is contained in cgroup proper. > > * Various small updats as per David and Jan's reviews. I like this a lot better, thanks. Please address Daniel's feedback on patch #6 and then I'm personally fine with this series. Pablo, are you ok with me merging this into net-next directly or would you rather I take patches 1-6 into net-next and then you can merge and then add patch #7 on top? Thanks. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH V4 net-next 0/5] net:hns: Add support of Hip06 SoC to the Hislicon Network Subsystem
This PATCH V4 addresses the review comment provided by Sergei Shtylyov. The changelog of every patch has also been modified. PATCH V3: Addresses the review comment floated by David Miller PATCH V2: 1) Bug Fixes and Clean-up: Internally identified 2) Addresses internal review comments by Kenneth Lee and by Huang Daode 3) Addresses the review comment from "Yisen.Zhuang(Zhuangyuzeng)" 4) Adds fix from Fengguang Wu for an error generated from "kbuild test robot" from Intel 5) Ethtool support for TSO set option from Lisheng PATCH V1: Adds initial support of Hip06 SoC with below changes: This patch-set adds support of new Hisilicon Hip06 SoC to the existing (already part of net-next) HNS ethernet driver for Hip05 SoC. Hip06 is a multi-core SoC and is a derivative of Hip05 SoC with lots of new hardware featres supported like RSS, TSO, hardware VLAN assist etc. The changes in the driver are mainly due to following: 1) changes in the DMA descriptor provided by the Hip06 ethernet hardware. These changes need to co-exist with already present Hip05 DMA descriptor and its operating functions. The decision to choose the correct type of DMA descriptor is taken dynamically depending upon the version of the hardware (i.e. V1/hip05 or V2/hip06, see already existing hisilicon-hns-nic.txt binding file for the detailed description version and naming). 2) To support new features added to the Hip06 ethernet hardware: a. RSS (Receive Side Scaling) b. TSO (TCP Segment Offload) c. Hardware VLAN support (currently we are initializing hardware to not assist in stripping the vlan tag at hardware level. Proper support of this feature and ethtool would come after these patches have been accepted) Kindly note that, this patchset has been based on latest net-next. Salil Mehta (5): net:hns: Add support of Hip06 SoC to the Hislicon Network Subsystem net:hns: Add Hip06 "RSS(Receive Side Scaling)" support to HNS Driver net:hns: Add Hip06 "TSO(TCP Segment Offload)" support HNS Driver net:hns: Add support of ethtool TSO set option for Hip06 in HNS net:hns: Add the init code to disable Hip06 "Hardware VLAN assist" drivers/net/ethernet/hisilicon/hns/hnae.h | 56 ++- drivers/net/ethernet/hisilicon/hns/hns_ae_adapt.c | 90 +++- drivers/net/ethernet/hisilicon/hns/hns_dsaf_main.c | 213 +++-- drivers/net/ethernet/hisilicon/hns/hns_dsaf_main.h | 25 +- drivers/net/ethernet/hisilicon/hns/hns_dsaf_misc.c |6 +- drivers/net/ethernet/hisilicon/hns/hns_dsaf_ppe.c | 79 +++- drivers/net/ethernet/hisilicon/hns/hns_dsaf_ppe.h | 32 +- drivers/net/ethernet/hisilicon/hns/hns_dsaf_rcb.c | 68 ++- drivers/net/ethernet/hisilicon/hns/hns_dsaf_rcb.h |8 +- drivers/net/ethernet/hisilicon/hns/hns_dsaf_reg.h | 88 +++- drivers/net/ethernet/hisilicon/hns/hns_enet.c | 487 +--- drivers/net/ethernet/hisilicon/hns/hns_enet.h | 12 + drivers/net/ethernet/hisilicon/hns/hns_ethtool.c | 95 +++- 13 files changed, 1072 insertions(+), 187 deletions(-) -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 net-next] ravb: use clock rate as basis for GTI.TIV
From: Simon HormanDate: Fri, 20 Nov 2015 11:29:39 -0800 > The GTI.TIV may be set to 2GHz^2 / rate, where rate is > that of the clock of the device. Rather than assuming a > rate of 130MHz use the actual rate of the clock. > > The motivation for this is to use the correct rate on > the r8a7795/Salvator-X which is advertised as 133MHz but > may differ depending on the extal present on the Salvator-X. > > Signed-off-by: Simon Horman Applied, thanks Simon. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH net-next 0/8] tipc: some cleanups and improvements
> -Original Message- > From: David Miller [mailto:da...@davemloft.net] > Sent: Friday, 20 November, 2015 14:07 > To: Jon Maloy > Cc: netdev@vger.kernel.org; paul.gortma...@windriver.com; > parthasarathy.xx.bhuvara...@ericsson.com; Richard Alpe; Ying Xue; > ma...@donjonn.com; tipc-discuss...@lists.sourceforge.net > Subject: Re: [PATCH net-next 0/8] tipc: some cleanups and improvements > > From: Jon Maloy> Date: Thu, 19 Nov 2015 14:30:38 -0500 > > > This series mostly contains cleanups and cosmetic code changes. > > The only real functional change is in #4 and #5, where we change the > > locking structure for nodes and links in order to permit full > > concurrency between links working in parallel on different interfaces. > > Since the groundwork for this has been done in previous commit series, > > this change constitutes only the final, small step to achieve that goal. > > Series applied, thanks. > > Generally speaking, rwlock usage really never buys you anything significant. > Therefore in the long run I think a single spinlock plus RCU is going to be > much better for per-node locking in TIPC. Thank you for the feedback. My own benchmarking has already confirmed what you are stating. I am currently looking at how to convert it to RCU. ///jon -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net] ppp: fix pppoe_dev deletion condition in pppoe_release()
Hello! David Miller schrieb am Fri, 23 Oct 2015 03:30:48 -0700 (PDT): From: Guillaume NaultDate: Thu, 22 Oct 2015 16:57:10 +0200 We can't rely on PPPOX_ZOMBIE to decide whether to clear po->pppoe_dev. PPPOX_ZOMBIE can be set by pppoe_disc_rcv() even when po->pppoe_dev is NULL. So we have no guarantee that (sk->sk_state & PPPOX_ZOMBIE) implies (po->pppoe_dev != NULL). Since we're releasing a PPPoE socket, we want to release the pppoe_dev if it exists and reset sk_state to PPPOX_DEAD, no matter the previous value of sk_state. So we can just check for po->pppoe_dev and avoid any assumption on sk->sk_state. Fixes: 2b018d57ff18 ("pppoe: drop PPPOX_ZOMBIEs in pppoe_release") Signed-off-by: Guillaume Nault Applied and queued up for -stable, thanks. Somehow this commit (1acea4f6ce1b1c0941438aca75dd2e5c6b09db60) did not make it into Linux 4.2.6, 4.1.13, or 3.18.24. But I don't find it in your stable bundle on Patchwork either. Has this patch been inadvertently "lost in translation"? Best regards, -- Christoph Schulz -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 13/14] mm: memcontrol: account socket memory in unified hierarchy memory controller
On Fri, Nov 20, 2015 at 04:10:33PM +0300, Vladimir Davydov wrote: > On Thu, Nov 12, 2015 at 06:41:32PM -0500, Johannes Weiner wrote: > ... > > @@ -5514,16 +5550,43 @@ void sock_release_memcg(struct sock *sk) > > */ > > bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int > > nr_pages) > > { > > + unsigned int batch = max(CHARGE_BATCH, nr_pages); > > struct page_counter *counter; > > + bool force = false; > > > > - if (page_counter_try_charge(>tcp_mem.memory_allocated, > > - nr_pages, )) { > > - memcg->tcp_mem.memory_pressure = 0; > > +#ifdef CONFIG_MEMCG_KMEM > > + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) { > > + if (page_counter_try_charge(>tcp_mem.memory_allocated, > > + nr_pages, )) { > > + memcg->tcp_mem.memory_pressure = 0; > > + return true; > > + } > > + page_counter_charge(>tcp_mem.memory_allocated, nr_pages); > > + memcg->tcp_mem.memory_pressure = 1; > > + return false; > > + } > > +#endif > > + if (consume_stock(memcg, nr_pages)) > > return true; > > +retry: > > + if (page_counter_try_charge(>memory, batch, )) > > + goto done; > > + > > + if (batch > nr_pages) { > > + batch = nr_pages; > > + goto retry; > > } > > - page_counter_charge(>tcp_mem.memory_allocated, nr_pages); > > - memcg->tcp_mem.memory_pressure = 1; > > - return false; > > + > > + page_counter_charge(>memory, batch); > > + force = true; > > +done: > > > + css_get_many(>css, batch); > > Is there any point to get css reference per each charged page? For kmem > it is absolutely necessary, because dangling slabs must block > destruction of memcg's kmem caches, which are destroyed on css_free. But > for sockets there's no such problem: memcg will be destroyed only after > all sockets are destroyed and therefore uncharged (since > sock_update_memcg pins css). I'm afraid we have to when we want to share 'stock' with cache and anon pages, which hold individual references. drain_stock() always assumes one reference per cached page. > > + if (batch > nr_pages) > > + refill_stock(memcg, batch - nr_pages); > > + > > + schedule_work(>socket_work); > > I think it's suboptimal to schedule the work even if we are below the > high threshold. Hm, it seemed unnecessary to duplicate the hierarchy check since this is in the batch-exhausted slowpath anyway. > BTW why do we need this work at all? Why is reclaim_high called from > task_work not enough? The problem lies in the memcg association: the random task that gets interrupted by an arriving packet might not be in the same memcg as the one owning receiving socket. And multiple interrupts could happen while we're in the kernel already charging pages. We'd basically have to maintain a list of memcgs that need to run reclaim_high associated with current. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH V4 net-next 1/5] net:hns: Add support of Hip06 SoC to the Hislicon Network Subsystem
This patchset adds support of Hisilicon Hip06 SoC to the existing HNS ethernet driver. The changes in the driver are mainly due to changes in the DMA descriptor provided by the Hip06 ethernet hardware. These changes need to co-exist with already present Hip05 DMA descriptor and its operating functions. The decision to choose the correct type of DMA descriptor is taken dynamically depending upon the version of the hardware (i.e. V1/hip05 or V2/hip06, see alredy existing hisilicon-hns-nic.txt binding file for detailed description). other changes includes in SBM, DSAF and PPE modules as well. Changes affecting the driver related to the newly added ethernet hardware features in Hip06 would be added as separate patch over this and subsequent patches. Signed-off-by: Salil MehtaSigned-off-by: yankejian Signed-off-by: huangdaode Signed-off-by: lipeng Signed-off-by: lisheng Signed-off-by: Fengguang Wu --- PATCH V4: No change over PATCH V3 PATCH V3: - This patch addresses comments floated by David Miller on PATCH V2. In summary, changing is_ver1 data-type from 'int' to 'bool' at different places of the code: Link: https://lkml.org/lkml/2015/11/18/656 PATCH V2: - Fix the comment from "kbuild test robot" from Intel(Fengguang Wu) Link: https://lkml.org/lkml/2015/10/20/562 https://lkml.org/lkml/2015/10/20/563 - Fixes the internal review comments from: Kenneth Lee huangdaode PATCH V1: - Intial driver Version to support HNS over Hip06 SoC --- drivers/net/ethernet/hisilicon/hns/hnae.h | 49 ++- drivers/net/ethernet/hisilicon/hns/hns_ae_adapt.c | 29 ++ drivers/net/ethernet/hisilicon/hns/hns_dsaf_main.c | 213 +--- drivers/net/ethernet/hisilicon/hns/hns_dsaf_main.h | 25 +- drivers/net/ethernet/hisilicon/hns/hns_dsaf_misc.c |6 +- drivers/net/ethernet/hisilicon/hns/hns_dsaf_ppe.c |6 +- drivers/net/ethernet/hisilicon/hns/hns_dsaf_rcb.c | 68 +++- drivers/net/ethernet/hisilicon/hns/hns_dsaf_rcb.h |8 +- drivers/net/ethernet/hisilicon/hns/hns_dsaf_reg.h | 72 +++- drivers/net/ethernet/hisilicon/hns/hns_enet.c | 364 drivers/net/ethernet/hisilicon/hns/hns_enet.h | 12 + drivers/net/ethernet/hisilicon/hns/hns_ethtool.c |2 +- 12 files changed, 677 insertions(+), 177 deletions(-) diff --git a/drivers/net/ethernet/hisilicon/hns/hnae.h b/drivers/net/ethernet/hisilicon/hns/hnae.h index cec95ac..aa53dd3 100644 --- a/drivers/net/ethernet/hisilicon/hns/hnae.h +++ b/drivers/net/ethernet/hisilicon/hns/hnae.h @@ -35,7 +35,7 @@ #include #include -#define HNAE_DRIVER_VERSION "1.3.0" +#define HNAE_DRIVER_VERSION "2.0" #define HNAE_DRIVER_NAME "hns" #define HNAE_COPYRIGHT "Copyright(c) 2015 Huawei Corporation." #define HNAE_DRIVER_STRING "Hisilicon Network Subsystem Driver" @@ -63,6 +63,7 @@ do { \ #define AE_VERSION_1 ('6' << 16 | '6' << 8 | '0') #define AE_VERSION_2 ('1' << 24 | '6' << 16 | '1' << 8 | '0') +#define AE_IS_VER1(ver) ((ver) == AE_VERSION_1) #define AE_NAME_SIZE 16 /* some said the RX and TX RCB format should not be the same in the future. But @@ -144,23 +145,59 @@ enum hnae_led_state { #define HNS_RXD_ASID_S 24 #define HNS_RXD_ASID_M (0xff << HNS_RXD_ASID_S) +#define HNSV2_TXD_RI_B 1 +#define HNSV2_TXD_L4CS_B 2 +#define HNSV2_TXD_L3CS_B 3 +#define HNSV2_TXD_FE_B 4 +#define HNSV2_TXD_VLD_B 5 + +#define HNSV2_TXD_TSE_B 0 +#define HNSV2_TXD_VLAN_EN_B 1 +#define HNSV2_TXD_SNAP_B 2 +#define HNSV2_TXD_IPV6_B 3 +#define HNSV2_TXD_SCTP_B 4 + /* hardware spec ring buffer format */ struct __packed hnae_desc { __le64 addr; union { struct { - __le16 asid_bufnum_pid; + union { + __le16 asid_bufnum_pid; + __le16 asid; + }; __le16 send_size; - __le32 flag_ipoffset; - __le32 reserved_3[4]; + union { + __le32 flag_ipoffset; + struct { + __u8 bn_pid; + __u8 ra_ri_cs_fe_vld; + __u8 ip_offset; + __u8 tse_vlan_snap_v6_sctp_nth; + }; + }; + __le16 mss; + __u8 l4_len; + __u8 reserved1; + __le16 paylen; + __u8 vmid; + __u8 qid; + __le32 reserved2[2]; } tx; struct { __le32 ipoff_bnum_pid_flag;
Re: tty,net: use-after-free in x25_asy_open_tty
[ + David Miller ] On 11/20/2015 08:56 AM, Sasha Levin wrote: > Hi all, > > While fuzzing with syzkaller inside a kvmtools guest running latest -next > kernel, I've hit: > > [ 634.336761] > == > [ 634.338226] BUG: KASAN: use-after-free in x25_asy_open_tty+0x13d/0x490 at > addr 8800a743efd0 > [ 634.339558] Read of size 4 by task syzkaller_execu/8981 > [ 634.340359] > = > [ 634.341598] BUG kmalloc-512 (Not tainted): kasan: bad access detected Thanks for the report, Sasha. Would you please test the patch below? The ldisc api should really prevent these kinds of errors. I'll prepare a patch to the tty core which should address the api weakness. Regards, Peter Hurley --->% --- Subject: [PATCH] wan/x25: Fix use-after-free in x25_asy_open_tty() The N_X25 line discipline may access the previous line discipline's closed and already-freed private data on open [1]. The tty->disc_data field _never_ refers to valid data on entry to the line discipline's open() method. Rather, the ldisc is expected to initialize that field for its own use for the lifetime of the instance (ie. from open() to close() only). [1] Report by Sasha Levin[ 634.336761] == [ 634.338226] BUG: KASAN: use-after-free in x25_asy_open_tty+0x13d/0x490 at addr 8800a743efd0 [ 634.339558] Read of size 4 by task syzkaller_execu/8981 [ 634.340359] = [ 634.341598] BUG kmalloc-512 (Not tainted): kasan: bad access detected ... [ 634.405018] Call Trace: [ 634.405277] dump_stack (lib/dump_stack.c:52) [ 634.405775] print_trailer (mm/slub.c:655) [ 634.406361] object_err (mm/slub.c:662) [ 634.406824] kasan_report_error (mm/kasan/report.c:138 mm/kasan/report.c:236) [ 634.409581] __asan_report_load4_noabort (mm/kasan/report.c:279) [ 634.411355] x25_asy_open_tty (drivers/net/wan/x25_asy.c:559 (discriminator 1)) [ 634.413997] tty_ldisc_open.isra.2 (drivers/tty/tty_ldisc.c:447) [ 634.414549] tty_set_ldisc (drivers/tty/tty_ldisc.c:567) [ 634.415057] tty_ioctl (drivers/tty/tty_io.c:2646 drivers/tty/tty_io.c:2879) [ 634.423524] do_vfs_ioctl (fs/ioctl.c:43 fs/ioctl.c:607) [ 634.427491] SyS_ioctl (fs/ioctl.c:622 fs/ioctl.c:613) [ 634.427945] entry_SYSCALL_64_fastpath (arch/x86/entry/entry_64.S:188) Reported-by: Sasha Levin Signed-off-by: Peter Hurley --- drivers/net/wan/x25_asy.c | 6 +- 1 file changed, 1 insertion(+), 5 deletions(-) diff --git a/drivers/net/wan/x25_asy.c b/drivers/net/wan/x25_asy.c index 5c47b01..cd39025 100644 --- a/drivers/net/wan/x25_asy.c +++ b/drivers/net/wan/x25_asy.c @@ -549,16 +549,12 @@ static void x25_asy_receive_buf(struct tty_struct *tty, static int x25_asy_open_tty(struct tty_struct *tty) { - struct x25_asy *sl = tty->disc_data; + struct x25_asy *sl; int err; if (tty->ops->write == NULL) return -EOPNOTSUPP; - /* First make sure we're not already connected. */ - if (sl && sl->magic == X25_ASY_MAGIC) - return -EEXIST; - /* OK. Find a free X.25 channel to use. */ sl = x25_asy_alloc(); if (sl == NULL) -- 2.6.3 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next] net: remove useless check in napi_gro_frags()
From: Eric DumazetDate: Thu, 19 Nov 2015 13:43:45 -0800 > On Thu, 2015-11-19 at 16:06 -0500, Aaron Conole wrote: > >> > >> >> Would the following be an appropriate change in addition to the one >> you've posted, then? If so I can repost as a formal patch, if you'd >> like. At present, there's only one user of napi_frags_skb(), and your >> patch removes the NULL check. If this can really only be the result of >> buggy driver, then perhaps we should just call out the bug? > > Lets mark my patch as "premature" optimization, and revisit whole thing > after audit of the 10 drivers using this interface ;) Also BUG_ON() is way too large a hammer. An attempt to continue should be made in some way, so that person inspecting the message and still have a network and work on fixing the driver after the check triggers :-) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next 0/8] tipc: some cleanups and improvements
From: Jon MaloyDate: Thu, 19 Nov 2015 14:30:38 -0500 > This series mostly contains cleanups and cosmetic code changes. > The only real functional change is in #4 and #5, where we change the > locking structure for nodes and links in order to permit full > concurrency between links working in parallel on different interfaces. > Since the groundwork for this has been done in previous commit series, > this change constitutes only the final, small step to achieve that goal. Series applied, thanks. Generally speaking, rwlock usage really never buys you anything significant. Therefore in the long run I think a single spinlock plus RCU is going to be much better for per-node locking in TIPC. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 net-next] ravb: use clock rate as basis for GTI.TIV
The GTI.TIV may be set to 2GHz^2 / rate, where rate is that of the clock of the device. Rather than assuming a rate of 130MHz use the actual rate of the clock. The motivation for this is to use the correct rate on the r8a7795/Salvator-X which is advertised as 133MHz but may differ depending on the extal present on the Salvator-X. Signed-off-by: Simon Horman--- v2 * Corrected typos in changelog, as pointed out by Geert Uytterhoeven * Use do_div() rather than 64-bit division to allow compilation on 32-bit ARM v3 * Dropped RFC prefix --- drivers/net/ethernet/renesas/ravb.h | 3 +++ drivers/net/ethernet/renesas/ravb_main.c | 38 +++- 2 files changed, 40 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/renesas/ravb.h b/drivers/net/ethernet/renesas/ravb.h index 0623fff932e4..f9dee7436e81 100644 --- a/drivers/net/ethernet/renesas/ravb.h +++ b/drivers/net/ethernet/renesas/ravb.h @@ -576,6 +576,9 @@ enum GTI_BIT { GTI_TIV = 0x0FFF, }; +#define GTI_TIV_MAXGTI_TIV +#define GTI_TIV_MIN0x20 + /* GIC */ enum GIC_BIT { GIC_PTCE= 0x0001, /* Undocumented? */ diff --git a/drivers/net/ethernet/renesas/ravb_main.c b/drivers/net/ethernet/renesas/ravb_main.c index ee8d1ec61fab..990dc55cdada 100644 --- a/drivers/net/ethernet/renesas/ravb_main.c +++ b/drivers/net/ethernet/renesas/ravb_main.c @@ -32,6 +32,8 @@ #include #include +#include + #include "ravb.h" #define RAVB_DEF_MSG_ENABLE \ @@ -1659,6 +1661,38 @@ static const struct of_device_id ravb_match_table[] = { }; MODULE_DEVICE_TABLE(of, ravb_match_table); +static int ravb_set_gti(struct net_device *ndev) +{ + + struct device *dev = ndev->dev.parent; + struct device_node *np = dev->of_node; + unsigned long rate; + struct clk *clk; + uint64_t inc; + + clk = of_clk_get(np, 0); + if (IS_ERR(clk)) { + dev_err(dev, "could not get clock\n"); + return PTR_ERR(clk); + } + + rate = clk_get_rate(clk); + clk_put(clk); + + inc = 10ULL << 20; + do_div(inc, rate); + + if (inc < GTI_TIV_MIN || inc > GTI_TIV_MAX) { + dev_err(dev, "gti.tiv increment 0x%llx is outside the range 0x%x - 0x%x\n", + inc, GTI_TIV_MIN, GTI_TIV_MAX); + return -EINVAL; + } + + ravb_write(ndev, inc, GTI); + + return 0; +} + static int ravb_probe(struct platform_device *pdev) { struct device_node *np = pdev->dev.of_node; @@ -1755,7 +1789,9 @@ static int ravb_probe(struct platform_device *pdev) CCC); /* Set GTI value */ - ravb_write(ndev, ((1000 << 20) / 130) & GTI_TIV, GTI); + error = ravb_set_gti(ndev); + if (error) + goto out_release; /* Request GTI loading */ ravb_write(ndev, ravb_read(ndev, GCCR) | GCCR_LTI, GCCR); -- 2.1.4 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH V4 net-next 4/5] net:hns: Add support of ethtool TSO set option for Hip06 in HNS
From: SalilThis patch adds the support of ethtool TSO option to support Hip06 SoC to HNS Signed-off-by: Salil Mehta Signed-off-by: lisheng --- PATCH V4: This fixes the comments given by Sergei Shtylyov over the PATCH V3: Link: https://lkml.org/lkml/2015/11/20/358 PATCH V3/V2: - No change over the initial patch PATCH V1: - Initial version of Ethtool support of TSO by Lisheng --- drivers/net/ethernet/hisilicon/hns/hns_enet.c | 47 + 1 file changed, 47 insertions(+) diff --git a/drivers/net/ethernet/hisilicon/hns/hns_enet.c b/drivers/net/ethernet/hisilicon/hns/hns_enet.c index 055e14c..09995d2 100644 --- a/drivers/net/ethernet/hisilicon/hns/hns_enet.c +++ b/drivers/net/ethernet/hisilicon/hns/hns_enet.c @@ -1386,6 +1386,51 @@ static int hns_nic_change_mtu(struct net_device *ndev, int new_mtu) return ret; } +static int hns_nic_set_features(struct net_device *netdev, + netdev_features_t features) +{ + struct hns_nic_priv *priv = netdev_priv(netdev); + struct hnae_handle *h = priv->ae_handle; + + switch (priv->enet_ver) { + case AE_VERSION_1: + if (features & (NETIF_F_TSO | NETIF_F_TSO6)) + netdev_info(netdev, "enet v1 do not support tso!\n"); + break; + default: + if (features & (NETIF_F_TSO | NETIF_F_TSO6)) { + priv->ops.fill_desc = fill_tso_desc; + priv->ops.maybe_stop_tx = hns_nic_maybe_stop_tso; + /* The chip only support 7*4096 */ + netif_set_gso_max_size(netdev, 7 * 4096); + h->dev->ops->set_tso_stats(h, 1); + } else { + priv->ops.fill_desc = fill_v2_desc; + priv->ops.maybe_stop_tx = hns_nic_maybe_stop_tx; + h->dev->ops->set_tso_stats(h, 0); + } + break; + } + netdev->features = features; + return 0; +} + +static netdev_features_t hns_nic_fix_features( + struct net_device *netdev, netdev_features_t features) +{ + struct hns_nic_priv *priv = netdev_priv(netdev); + + switch (priv->enet_ver) { + case AE_VERSION_1: + features &= ~(NETIF_F_TSO | NETIF_F_TSO6 | + NETIF_F_HW_VLAN_CTAG_FILTER); + break; + default: + break; + } + return features; +} + /** * nic_set_multicast_list - set mutl mac address * @netdev: net device @@ -1481,6 +1526,8 @@ static const struct net_device_ops hns_nic_netdev_ops = { .ndo_set_mac_address = hns_nic_net_set_mac_address, .ndo_change_mtu = hns_nic_change_mtu, .ndo_do_ioctl = hns_nic_do_ioctl, + .ndo_set_features = hns_nic_set_features, + .ndo_fix_features = hns_nic_fix_features, .ndo_get_stats64 = hns_nic_get_stats64, #ifdef CONFIG_NET_POLL_CONTROLLER .ndo_poll_controller = hns_nic_poll_controller, -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RESEND] cgroups: Allow dynamically changing net_classid
Hello, On Fri, Nov 20, 2015 at 12:31:39PM -0800, Nina Schiff wrote: > The classid of a process is changed either when a process is moved to > or from a cgroup or when the net_cls.classid file is updated. > Previously net_cls only supported propogating these changes to the > cgroup's related sockets when a process was added or removed from the > cgroup. This means it was neccessary to remove and re-add all processes > to a cgroup in order to update its classid. This change introduces > support for doing this dynamically - i.e. when the value is changed in > the net_cls_classid file, this will also trigger an update to the > classid associated with all sockets controlled by the cgroup. > This mimics the behaviour of other cgroup subsystems. > net_prio circumvents this issue by storing an index into a table with > each socket (and so any updates to the table, don't require updating > the value associated with the socket). net_cls, however, passes the > socket the classid directly, and so this additional step is needed. > > Signed-off-by: Nina SchiffAcked-by: Tejun Heo This was broken from the beginning. Thanks for fixing this. BTW, this will cause a context conflict with the cgroup2 match patches. I'll update the patchset once this lands in net-next. -- tejun -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net 1/1] tipc: correct settings of broadcast link state
From: Jon MaloyDate: Thu, 19 Nov 2015 14:12:50 -0500 > Since commit 5266698661401afc5e ("tipc: let broadcast packet > reception use new link receive function") the broadcast send > link state was meant to always be set to LINK_ESTABLISHED, since > we don't need this link to follow the regular link FSM rules. It > was also the intention that this state anyway shouldn't impact > the run-time working state of the link, since the latter in > reality is controlled by the number of registered peers. > > We have now discovered that this assumption is not quite correct. > If the broadcast link is reset because of too many retransmissions, > its state will inadvertently go to LINK_RESETTING, and never go > back to LINK_ESTABLISHED, because the LINK_FAILURE event was not > anticipated. This will work well once, but if it happens a second > time, the reset on a link in LINK_RESETTING has has no effect, and > neither the broadcast link nor the unicast links will go down as > they should. > > Furthermore, it is confusing that the management tool shows that > this link is in UP state when that obviously isn't the case. > > We now ensure that this state strictly follows the true working > state of the link. The state is set to LINK_ESTABLISHED when > the number of peers is non-zero, and to LINK_RESET otherwise. > > Signed-off-by: Jon Maloy Applied, thanks. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3 v3] dl2k: Implement suspend
From: Ondrej ZaryDate: Thu, 19 Nov 2015 20:13:06 +0100 > Add suspend/resume support to dl2k driver. > This requires RX/TX rings to be reset so split out the required > functionality from alloc_list() into new rio_reset_ring(). > > Tested on Asus NX1101 (IP1000A) and D-Link DGE-550T (DL-2000). > > Signed-off-by: Ondrej Zary Applied to net-next. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHSET v2] netfilter, cgroup: implement xt_cgroup2 match
On Fri, Nov 20, 2015 at 08:56:25PM +0100, Pablo Neira Ayuso wrote: > Regarding #7, I have a couple two concerns: > > 1) cgroup currently doesn't work the way users expect, ie. to perform any >reasonable firewalling. Since this relies on early demux, only a >limited number of sockets get access to the cgroup info. Ops sorry, I forgot to indicate that I'm refering to the INPUT chain. > 2) We have traditionally rejected match2 and target2 extensions. I >guess you can accomodate the new cgroup code through the revision >iptables infrastructure, so we still use the cgroup match. > > Let me know, thanks. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHSET v2] netfilter, cgroup: implement xt_cgroup2 match
On Fri, Nov 20, 2015 at 01:59:12PM -0500, David Miller wrote: > From: Tejun Heo> Date: Thu, 19 Nov 2015 13:52:44 -0500 > > > This is the second take of the xt_cgroup2 patchset. Changes from the > > last take are > > > > * Instead of adding sock->sk_cgroup separately, sock->sk_cgrp_data now > > carries either (prioidx, classid) pair or cgroup2 pointer. This > > avoids inflating struct sock with yet another cgroup related field. > > Unfortunately, this does add some complexity but that's the > > trade-off and the complexity is contained in cgroup proper. > > > > * Various small updats as per David and Jan's reviews. > > I like this a lot better, thanks. > > Please address Daniel's feedback on patch #6 and then I'm personally > fine with this series. > > Pablo, are you ok with me merging this into net-next directly or > would you rather I take patches 1-6 into net-next and then you can > merge and then add patch #7 on top? I'd suggest you get 1-6, then I'll pull this info my tree. Thanks David! Regarding #7, I have a couple two concerns: 1) cgroup currently doesn't work the way users expect, ie. to perform any reasonable firewalling. Since this relies on early demux, only a limited number of sockets get access to the cgroup info. 2) We have traditionally rejected match2 and target2 extensions. I guess you can accomodate the new cgroup code through the revision iptables infrastructure, so we still use the cgroup match. Let me know, thanks. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH RESEND] cgroups: Allow dynamically changing net_classid
The classid of a process is changed either when a process is moved to or from a cgroup or when the net_cls.classid file is updated. Previously net_cls only supported propogating these changes to the cgroup's related sockets when a process was added or removed from the cgroup. This means it was neccessary to remove and re-add all processes to a cgroup in order to update its classid. This change introduces support for doing this dynamically - i.e. when the value is changed in the net_cls_classid file, this will also trigger an update to the classid associated with all sockets controlled by the cgroup. This mimics the behaviour of other cgroup subsystems. net_prio circumvents this issue by storing an index into a table with each socket (and so any updates to the table, don't require updating the value associated with the socket). net_cls, however, passes the socket the classid directly, and so this additional step is needed. Signed-off-by: Nina Schiff--- Concatented two email addresses by mistake, so resending net/core/netclassid_cgroup.c | 26 ++ 1 file changed, 18 insertions(+), 8 deletions(-) diff --git a/net/core/netclassid_cgroup.c b/net/core/netclassid_cgroup.c index 6441f47..2e4df84 100644 --- a/net/core/netclassid_cgroup.c +++ b/net/core/netclassid_cgroup.c @@ -56,7 +56,7 @@ static void cgrp_css_free(struct cgroup_subsys_state *css) kfree(css_cls_state(css)); } -static int update_classid(const void *v, struct file *file, unsigned n) +static int update_classid_sock(const void *v, struct file *file, unsigned n) { int err; struct socket *sock = sock_from_file(file, ); @@ -67,18 +67,25 @@ static int update_classid(const void *v, struct file *file, unsigned n) return 0; } -static void cgrp_attach(struct cgroup_subsys_state *css, - struct cgroup_taskset *tset) +static void update_classid(struct cgroup_subsys_state *css, void *v) { - struct cgroup_cls_state *cs = css_cls_state(css); - void *v = (void *)(unsigned long)cs->classid; + struct css_task_iter it; struct task_struct *p; - cgroup_taskset_for_each(p, tset) { + css_task_iter_start(css, ); + while ((p = css_task_iter_next())) { task_lock(p); - iterate_fd(p->files, 0, update_classid, v); + iterate_fd(p->files, 0, update_classid_sock, v); task_unlock(p); } + css_task_iter_end(); +} + +static void cgrp_attach(struct cgroup_subsys_state *css, + struct cgroup_taskset *tset) +{ + update_classid(css, + (void *)(unsigned long)css_cls_state(css)->classid); } static u64 read_classid(struct cgroup_subsys_state *css, struct cftype *cft) @@ -89,8 +96,11 @@ static u64 read_classid(struct cgroup_subsys_state *css, struct cftype *cft) static int write_classid(struct cgroup_subsys_state *css, struct cftype *cft, u64 value) { - css_cls_state(css)->classid = (u32) value; + struct cgroup_cls_state *cs = css_cls_state(css); + + cs->classid = (u32)value; + update_classid(css, (void *)(unsigned long)cs->classid); return 0; } -- 2.4.6 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net 2/2] net: ip6mr: fix static mfc/dev leaks on table destruction
On Fri, Nov 20, 2015 at 4:54 AM, Nikolay Aleksandrovwrote: > From: Nikolay Aleksandrov > > Similar to ipv4, when destroying an mrt table the static mfc entries and > the static devices are kept, which leads to devices that can never be > destroyed (because of refcnt taken) and leaked memory. Make sure that > everything is cleaned up on netns destruction. > > Fixes: 8229efdaef1e ("netns: ip6mr: enable namespace support in ipv6 > multicast forwarding code") > CC: Benjamin Thery > Signed-off-by: Nikolay Aleksandrov Reviewed-by: Cong Wang -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net 1/2] net: ipmr: fix static mfc/dev leaks on table destruction
On Fri, Nov 20, 2015 at 4:54 AM, Nikolay Aleksandrovwrote: > From: Nikolay Aleksandrov > > When destroying an mrt table the static mfc entries and the static > devices are kept, which leads to devices that can never be destroyed > (because of refcnt taken) and leaked memory, for example: > unreferenced object 0x880034c144c0 (size 192): > comm "mfc-broken", pid 4777, jiffies 4320349055 (age 46001.964s) > hex dump (first 32 bytes): > 98 53 f0 34 00 88 ff ff 98 53 f0 34 00 88 ff ff .S.4.S.4 > ef 0a 0a 14 01 02 03 04 00 00 00 00 01 00 00 00 > backtrace: > [] kmemleak_alloc+0x4e/0xb0 > [] kmem_cache_alloc+0x190/0x300 > [] ip_mroute_setsockopt+0x5cb/0x910 > [] do_ip_setsockopt.isra.11+0x105/0xff0 > [] ip_setsockopt+0x30/0xa0 > [] raw_setsockopt+0x33/0x90 > [] sock_common_setsockopt+0x14/0x20 > [] SyS_setsockopt+0x71/0xc0 > [] entry_SYSCALL_64_fastpath+0x16/0x7a > [] 0x > > Make sure that everything is cleaned on netns destruction. > > Signed-off-by: Nikolay Aleksandrov Looks good to me, Reviewed-by: Cong Wang -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 09/14] net: tcp_memcontrol: simplify linkage between socket and page counter
On Fri, Nov 20, 2015 at 03:42:16PM +0300, Vladimir Davydov wrote: > On Thu, Nov 12, 2015 at 06:41:28PM -0500, Johannes Weiner wrote: > > There won't be any separate counters for socket memory consumed by > > protocols other than TCP in the future. Remove the indirection and > > I really want to believe you're right. And with vmpressure propagation > implemented properly you are likely to be right. > > However, we might still want to account other socket protos to > memcg->memory in the unified hierarchy, e.g. UDP, or SCTP, or whatever > else. Adding new consumers should be trivial, but it will break the > legacy usecase, where only TCP sockets are supposed to be accounted. > What about adding a check to sock_update_memcg() so that it would enable > accounting only for TCP sockets in case legacy hierarchy is used? Yup, I was thinking the same thing. But we can cross that bridge when we come to it and are actually adding further packet types. > For the same reason, I think we'd better rename memcg->tcp_mem to > something like memcg->sk_mem or we can even drop the cg_proto struct > altogether embedding its fields directly to mem_cgroup struct. > > Also, I don't see any reason to have tcp_memcontrol.c file. It's tiny > and with this patch it does not depend on tcp code any more. Let's move > it to memcontrol.c? I actually had all this at first, but then wondered if it makes more sense to keep the legacy code in isolation. Don't you think it would be easier to keep track of what's v1 and what's v2 if we keep the legacy stuff physically separate as much as possible? In particular I found that 'tcp_mem.' marker really useful while working on the code. In the same vein, tcp_memcontrol.c doesn't really hurt anybody and I'd expect it to remain mostly unopened and unchanged in the future. But if we merge it into memcontrol.c, that code will likely be in the way and we'd have to make it explicit somehow that this is not actually part of the new memory controller anymore. What do you think? > Other than that this patch looks OK to me. Thank you! -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 6/7] sock, cgroup: add sock->sk_cgroup
Hello, Daniel. On Fri, Nov 20, 2015 at 12:04:05PM +0100, Daniel Wagner wrote: > > static inline u16 sock_cgroup_prioidx(struct sock_cgroup_data *skcd) > > { > > - return skcd->prioidx; > > + return (skcd->is_data & 1) ? skcd->prioidx : 1; > > } > > > > static inline u32 sock_cgroup_classid(struct sock_cgroup_data *skcd) > > { > > - return skcd->classid; > > + return (skcd->is_data & 1) ? skcd->classid : 0; > > } > > > I still try to understand what the code does, hence this stupid question: > > Why is sock_cgroup_prioidx() returning 1 if is not data and > sock_cgroup_classid() a 0? I prolly should have added comments there. prioidx carries the cgroup ID on the hierarchy net_prio is attached to, so if nothing is configured, the default value would be the ID of the root cgroup which is always 1. For net_cls, the unconfigured default value is zero. Will refresh the patch with comments. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH V4 net-next 3/5] net:hns: Add Hip06 "TSO(TCP Segment Offload)" support HNS Driver
This patch adds the support of "TSO (TCP Segment Offload)" feature provided by the Hip06 ethernet hardware to the HNS ethernet driver. Enabling this feature would help offload the TCP Segmentation process to the Hip06 ethernet hardware. This eventually would help in saving precious cpu cycles. Signed-off-by: Salil MehtaSigned-off-by: lisheng --- PATCH V4: No change over the previous patches PATCH V3/V2: - No change over the initial floated patch for TSO PATCH V1: - Initial support of TSO feature in Hip06 SoC in HNS driver --- drivers/net/ethernet/hisilicon/hns/hnae.h |1 + drivers/net/ethernet/hisilicon/hns/hns_ae_adapt.c |8 ++ drivers/net/ethernet/hisilicon/hns/hns_dsaf_ppe.c |5 ++ drivers/net/ethernet/hisilicon/hns/hns_dsaf_ppe.h |2 +- drivers/net/ethernet/hisilicon/hns/hns_dsaf_reg.h |1 + drivers/net/ethernet/hisilicon/hns/hns_enet.c | 82 - 6 files changed, 95 insertions(+), 4 deletions(-) diff --git a/drivers/net/ethernet/hisilicon/hns/hnae.h b/drivers/net/ethernet/hisilicon/hns/hnae.h index 1ee42cb..6ec5bd7 100644 --- a/drivers/net/ethernet/hisilicon/hns/hnae.h +++ b/drivers/net/ethernet/hisilicon/hns/hnae.h @@ -472,6 +472,7 @@ struct hnae_ae_ops { int (*set_mac_addr)(struct hnae_handle *handle, void *p); int (*set_mc_addr)(struct hnae_handle *handle, void *addr); int (*set_mtu)(struct hnae_handle *handle, int new_mtu); + void (*set_tso_stats)(struct hnae_handle *handle, int enable); void (*update_stats)(struct hnae_handle *handle, struct net_device_stats *net_stats); void (*get_stats)(struct hnae_handle *handle, u64 *data); diff --git a/drivers/net/ethernet/hisilicon/hns/hns_ae_adapt.c b/drivers/net/ethernet/hisilicon/hns/hns_ae_adapt.c index e5a31bc..d02fa58 100644 --- a/drivers/net/ethernet/hisilicon/hns/hns_ae_adapt.c +++ b/drivers/net/ethernet/hisilicon/hns/hns_ae_adapt.c @@ -277,6 +277,13 @@ static int hns_ae_set_mtu(struct hnae_handle *handle, int new_mtu) return hns_mac_set_mtu(mac_cb, new_mtu); } +static void hns_ae_set_tso_stats(struct hnae_handle *handle, int enable) +{ + struct hns_ppe_cb *ppe_cb = hns_get_ppe_cb(handle); + + hns_ppe_set_tso_enable(ppe_cb, enable); +} + static int hns_ae_start(struct hnae_handle *handle) { int ret; @@ -824,6 +831,7 @@ static struct hnae_ae_ops hns_dsaf_ops = { .set_mc_addr = hns_ae_set_multicast_one, .set_mtu = hns_ae_set_mtu, .update_stats = hns_ae_update_stats, + .set_tso_stats = hns_ae_set_tso_stats, .get_stats = hns_ae_get_stats, .get_strings = hns_ae_get_strings, .get_sset_count = hns_ae_get_sset_count, diff --git a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_ppe.c b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_ppe.c index 824fe50..b6bf292 100644 --- a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_ppe.c +++ b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_ppe.c @@ -19,6 +19,11 @@ #include "hns_dsaf_ppe.h" +void hns_ppe_set_tso_enable(struct hns_ppe_cb *ppe_cb, u32 value) +{ + dsaf_set_dev_bit(ppe_cb, PPEV2_CFG_TSO_EN_REG, 0, !!value); +} + void hns_ppe_set_rss_key(struct hns_ppe_cb *ppe_cb, const u32 rss_key[HNS_PPEV2_RSS_KEY_NUM]) { diff --git a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_ppe.h b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_ppe.h index dac8532..0f5cb69 100644 --- a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_ppe.h +++ b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_ppe.h @@ -113,7 +113,7 @@ void hns_ppe_get_regs(struct hns_ppe_cb *ppe_cb, void *data); void hns_ppe_get_strings(struct hns_ppe_cb *ppe_cb, int stringset, u8 *data); void hns_ppe_get_stats(struct hns_ppe_cb *ppe_cb, u64 *data); - +void hns_ppe_set_tso_enable(struct hns_ppe_cb *ppe_cb, u32 value); void hns_ppe_set_rss_key(struct hns_ppe_cb *ppe_cb, const u32 rss_key[HNS_PPEV2_RSS_KEY_NUM]); void hns_ppe_set_indir_table(struct hns_ppe_cb *ppe_cb, diff --git a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_reg.h b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_reg.h index b070d57..98c163e 100644 --- a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_reg.h +++ b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_reg.h @@ -317,6 +317,7 @@ #define PPE_CFG_TAG_GEN_REG0x90 #define PPE_CFG_PARSE_TAG_REG 0x94 #define PPE_CFG_PRO_CHECK_EN_REG 0x98 +#define PPEV2_CFG_TSO_EN_REG0xA0 #define PPE_INTEN_REG 0x100 #define PPE_RINT_REG 0x104 #define PPE_INTSTS_REG 0x108 diff --git a/drivers/net/ethernet/hisilicon/hns/hns_enet.c b/drivers/net/ethernet/hisilicon/hns/hns_enet.c index e235714..055e14c 100644 --- a/drivers/net/ethernet/hisilicon/hns/hns_enet.c +++ b/drivers/net/ethernet/hisilicon/hns/hns_enet.c @@ -223,6
[PATCH V4 net-next 2/5] net:hns: Add Hip06 "RSS(Receive Side Scaling)" support to HNS Driver
This patch adds the support of "RSS (Receive Side Scaling)" feature provided by the Hip06 ethernet hardware to the HNS ethernet driver. This feature helps in distributing the different flows (mapped as hash by hardware using Toeplitz Hash) to different Queues asssociated with the processor cores. The mapping of flow-hash values to the different queues is stored in indirection table (which is per Packet- parse-Engine/PPE). This patch also provides the changes to re-program the (flow-hash<->Qid) mapping using the ethtool. Signed-off-by: Salil MehtaReviewed-by: Kenneth Lee --- PATCH V4: - No Change over previous patches PATCH V3: - No change ove PATCH V2 PATCH V2: - Fix for review-comments on PATCH V1 by Yisen.Zhuang(Zhuangyuzeng) Link: https://lkml.org/lkml/2015/10/21/1032 - Rework for Internal review comments by Kenneth Lee PATCH V1: - Initial version to support RSS and its Ethtool interface on Hip06 SoC --- drivers/net/ethernet/hisilicon/hns/hnae.h |6 ++ drivers/net/ethernet/hisilicon/hns/hns_ae_adapt.c | 53 +++- drivers/net/ethernet/hisilicon/hns/hns_dsaf_ppe.c | 61 +- drivers/net/ethernet/hisilicon/hns/hns_dsaf_ppe.h | 32 +-- drivers/net/ethernet/hisilicon/hns/hns_dsaf_reg.h | 14 drivers/net/ethernet/hisilicon/hns/hns_ethtool.c | 93 + 6 files changed, 249 insertions(+), 10 deletions(-) diff --git a/drivers/net/ethernet/hisilicon/hns/hnae.h b/drivers/net/ethernet/hisilicon/hns/hnae.h index aa53dd3..1ee42cb 100644 --- a/drivers/net/ethernet/hisilicon/hns/hnae.h +++ b/drivers/net/ethernet/hisilicon/hns/hnae.h @@ -483,6 +483,12 @@ struct hnae_ae_ops { enum hnae_led_state status); void (*get_regs)(struct hnae_handle *handle, void *data); int (*get_regs_len)(struct hnae_handle *handle); + u32 (*get_rss_key_size)(struct hnae_handle *handle); + u32 (*get_rss_indir_size)(struct hnae_handle *handle); + int (*get_rss)(struct hnae_handle *handle, u32 *indir, u8 *key, + u8 *hfunc); + int (*set_rss)(struct hnae_handle *handle, const u32 *indir, + const u8 *key, const u8 hfunc); }; struct hnae_ae_dev { diff --git a/drivers/net/ethernet/hisilicon/hns/hns_ae_adapt.c b/drivers/net/ethernet/hisilicon/hns/hns_ae_adapt.c index c03bc1e..e5a31bc 100644 --- a/drivers/net/ethernet/hisilicon/hns/hns_ae_adapt.c +++ b/drivers/net/ethernet/hisilicon/hns/hns_ae_adapt.c @@ -749,6 +749,53 @@ int hns_ae_get_regs_len(struct hnae_handle *handle) return total_num; } +static u32 hns_ae_get_rss_key_size(struct hnae_handle *handle) +{ + return HNS_PPEV2_RSS_KEY_SIZE; +} + +static u32 hns_ae_get_rss_indir_size(struct hnae_handle *handle) +{ + return HNS_PPEV2_RSS_IND_TBL_SIZE; +} + +static int hns_ae_get_rss(struct hnae_handle *handle, u32 *indir, u8 *key, + u8 *hfunc) +{ + struct hns_ppe_cb *ppe_cb = hns_get_ppe_cb(handle); + + /* currently we support only one type of hash function i.e. Toep hash */ + if (hfunc) + *hfunc = ETH_RSS_HASH_TOP; + + /* get the RSS Key required by the user */ + if (key) + memcpy(key, ppe_cb->rss_key, HNS_PPEV2_RSS_KEY_SIZE); + + /* update the current hash->queue mappings from the shadow RSS table */ + memcpy(indir, ppe_cb->rss_indir_table, HNS_PPEV2_RSS_IND_TBL_SIZE); + + return 0; +} + +static int hns_ae_set_rss(struct hnae_handle *handle, const u32 *indir, + const u8 *key, const u8 hfunc) +{ + struct hns_ppe_cb *ppe_cb = hns_get_ppe_cb(handle); + + /* set the RSS Hash Key if specififed by the user */ + if (key) + hns_ppe_set_rss_key(ppe_cb, (int *)key); + + /* update the shadow RSS table with user specified qids */ + memcpy(ppe_cb->rss_indir_table, indir, HNS_PPEV2_RSS_IND_TBL_SIZE); + + /* now update the hardware */ + hns_ppe_set_indir_table(ppe_cb, ppe_cb->rss_indir_table); + + return 0; +} + static struct hnae_ae_ops hns_dsaf_ops = { .get_handle = hns_ae_get_handle, .put_handle = hns_ae_put_handle, @@ -783,7 +830,11 @@ static struct hnae_ae_ops hns_dsaf_ops = { .update_led_status = hns_ae_update_led_status, .set_led_id = hns_ae_cpld_set_led_id, .get_regs = hns_ae_get_regs, - .get_regs_len = hns_ae_get_regs_len + .get_regs_len = hns_ae_get_regs_len, + .get_rss_key_size = hns_ae_get_rss_key_size, + .get_rss_indir_size = hns_ae_get_rss_indir_size, + .get_rss = hns_ae_get_rss, + .set_rss = hns_ae_set_rss }; int hns_dsaf_ae_init(struct dsaf_device *dsaf_dev) diff --git a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_ppe.c b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_ppe.c index 9531992..824fe50 100644 --- a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_ppe.c
[PATCH V4 net-next 5/5] net:hns: Add the init code to disable Hip06 "Hardware VLAN assist"
This patch adds the initializzation code to disable the hardware vlan support for VLAN Tag stripping by default for now. Proper support of "hardware VLAN assitance" feature would soon come in the next coming patches. Signed-off-by: Salil Mehta--- PATCH V4: - No change over the earlier patches PATCH V2/V3: - No change over the initial floated patch PATCH V1: - Initial code to disable the hardware VLAN assist for now --- drivers/net/ethernet/hisilicon/hns/hns_dsaf_ppe.c |7 +++ drivers/net/ethernet/hisilicon/hns/hns_dsaf_reg.h |1 + 2 files changed, 8 insertions(+) diff --git a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_ppe.c b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_ppe.c index b6bf292..544f323 100644 --- a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_ppe.c +++ b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_ppe.c @@ -176,6 +176,11 @@ static void hns_ppe_cnt_clr_ce(struct hns_ppe_cb *ppe_cb) PPE_CNT_CLR_CE_B, 1); } +static void hns_ppe_set_vlan_strip(struct hns_ppe_cb *ppe_cb, int en) +{ + dsaf_write_dev(ppe_cb, PPEV2_VLAN_STRIP_EN_REG, en); +} + /** * hns_ppe_checksum_hw - set ppe checksum caculate * @ppe_device: ppe device @@ -345,6 +350,8 @@ static void hns_ppe_init_hw(struct hns_ppe_cb *ppe_cb) hns_ppe_cnt_clr_ce(ppe_cb); if (!AE_IS_VER1(dsaf_dev->dsaf_ver)) { + hns_ppe_set_vlan_strip(ppe_cb, 0); + hns_ppe_set_rss_key(ppe_cb, rss_key); for (i = 0; i < HNS_PPEV2_RSS_IND_TBL_SIZE; i++) diff --git a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_reg.h b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_reg.h index 98c163e..6c18ca9 100644 --- a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_reg.h +++ b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_reg.h @@ -318,6 +318,7 @@ #define PPE_CFG_PARSE_TAG_REG 0x94 #define PPE_CFG_PRO_CHECK_EN_REG 0x98 #define PPEV2_CFG_TSO_EN_REG0xA0 +#define PPEV2_VLAN_STRIP_EN_REG 0xAC #define PPE_INTEN_REG 0x100 #define PPE_RINT_REG 0x104 #define PPE_INTSTS_REG 0x108 -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] dl2k: Reorder and cleanup initialization
From: Ondrej ZaryDate: Thu, 19 Nov 2015 20:13:05 +0100 > Move HW init and stop into separate functions. > Request IRQ only after the HW has been reset (so interrupts are > disabled and no stale interrupts are pending). > > Signed-off-by: Ondrej Zary Applied to net-next. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3 v2] dl2k: Handle memory allocation errors in alloc_list
From: Ondrej ZaryDate: Thu, 19 Nov 2015 20:13:04 +0100 > If memory allocation fails in alloc_list(), free the already allocated > memory and return -ENOMEM. In rio_open(), call alloc_list() first and > abort if it fails. Move HW access (set RFDListPtr) out ot alloc_list(). > > Signed-off-by: Ondrej Zary Applied to net-next. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next 4/6] kcm: Kernel Connection Multiplexor module
This module implement the Kernel Connection Multiplexor. Kernel Connection Multiplexor (KCM) is a facility that provides a message based interface over TCP for generic application protocols. With KCM an application can efficiently send and receive application protocol messages over TCP using datagram sockets. For more information see the included Documentation/networking/kcm.txt Signed-off-by: Tom Herbert--- include/linux/socket.h |6 +- include/net/kcm.h| 121 +++ include/uapi/linux/kcm.h | 27 + net/Kconfig |1 + net/Makefile |1 + net/kcm/Kconfig | 10 + net/kcm/Makefile |3 + net/kcm/kcmsock.c| 1974 ++ 8 files changed, 2142 insertions(+), 1 deletion(-) create mode 100644 include/net/kcm.h create mode 100644 include/uapi/linux/kcm.h create mode 100644 net/kcm/Kconfig create mode 100644 net/kcm/Makefile create mode 100644 net/kcm/kcmsock.c diff --git a/include/linux/socket.h b/include/linux/socket.h index d834af2..73bf6c6 100644 --- a/include/linux/socket.h +++ b/include/linux/socket.h @@ -200,7 +200,9 @@ struct ucred { #define AF_ALG 38 /* Algorithm sockets*/ #define AF_NFC 39 /* NFC sockets */ #define AF_VSOCK 40 /* vSockets */ -#define AF_MAX 41 /* For now.. */ +#define AF_KCM 41 /* Kernel Connection Multiplexor*/ + +#define AF_MAX 42 /* For now.. */ /* Protocol families, same as address families. */ #define PF_UNSPEC AF_UNSPEC @@ -246,6 +248,7 @@ struct ucred { #define PF_ALG AF_ALG #define PF_NFC AF_NFC #define PF_VSOCK AF_VSOCK +#define PF_KCM AF_KCM #define PF_MAX AF_MAX /* Maximum queue length specifiable by listen. */ @@ -323,6 +326,7 @@ struct ucred { #define SOL_CAIF 278 #define SOL_ALG279 #define SOL_NFC280 +#define SOL_KCM281 /* IPX options */ #define IPX_TYPE 1 diff --git a/include/net/kcm.h b/include/net/kcm.h new file mode 100644 index 000..4f371fe --- /dev/null +++ b/include/net/kcm.h @@ -0,0 +1,121 @@ +/* Kernel Connection Multiplexor */ + +#ifndef __NET_KCM_H_ +#define __NET_KCM_H_ + +#include +#include +#include + +#ifdef __KERNEL__ + +extern unsigned int kcm_net_id; + +struct kcm_tx_msg { + unsigned int sent; + unsigned int fragidx; + unsigned int frag_offset; + unsigned int msg_flags; + struct sk_buff *frag_skb; + struct sk_buff *last_skb; +}; + +struct kcm_rx_msg { + int full_len; + int accum_len; + int offset; +}; + +/* Socket structure for KCM client sockets */ +struct kcm_sock { + struct sock sk; + struct kcm_mux *mux; + struct list_head kcm_sock_list; + int index; + u32 done : 1; + struct work_struct done_work; + + /* Transmit */ + struct kcm_psock *tx_psock; + struct work_struct tx_work; + struct list_head wait_psock_list; + struct sk_buff *seq_skb; + + /* Don't use bit fields here, these are set under different locks */ + bool tx_wait; + bool tx_wait_more; + + /* Receive */ + struct kcm_psock *rx_psock; + struct list_head wait_rx_list; /* KCMs waiting for receiving */ + bool rx_wait; + u32 rx_disabled : 1; +}; + +struct bpf_prog; + +/* Structure for an attached lower socket */ +struct kcm_psock { + struct sock *sk; + struct kcm_mux *mux; + int index; + + u32 tx_stopped : 1; + u32 rx_stopped : 1; + u32 done : 1; + u32 unattaching : 1; + + void (*save_state_change)(struct sock *sk); + void (*save_data_ready)(struct sock *sk); + void (*save_write_space)(struct sock *sk); + + struct list_head psock_list; + + /* Receive */ + struct sk_buff *rx_skb_head; + struct sk_buff **rx_skb_nextp; + struct sk_buff *ready_rx_msg; + struct list_head psock_ready_list; + struct work_struct rx_work; + struct delayed_work rx_delayed_work; + struct bpf_prog *bpf_prog; + struct kcm_sock *rx_kcm; + + /* Transmit */ + struct kcm_sock *tx_kcm; + struct list_head psock_avail_list; +}; + +/* Per net MUX list */ +struct kcm_net { + struct mutex mutex; + struct list_head mux_list; + int count; +}; + +/* Structure for a MUX */ +struct kcm_mux { + struct list_head kcm_mux_list; + struct rcu_head rcu; + struct kcm_net *knet; + + struct list_head kcm_socks; /* All KCM sockets on MUX */ + int kcm_socks_cnt; /* Total KCM socket count for MUX */ + struct list_head psocks;/* List of all psocks on MUX */ + int psocks_cnt; /* Total attached sockets */ + + /* Receive */ + spinlock_t rx_lock
[PATCH net-next 5/6] kcm: Add statistics and proc interfaces
This patch adds various counters for KCM. These include counters for messages and bytes received or sent, as well as counters for number of attached/unattached TCP sockets and other error or edge events. The statistics are exposed via a proc interface. /proc/net/kcm provides statistics per KCM socket and per psock (attached TCP sockets). /proc/net/kcm_stats provides aggregate statistics. Signed-off-by: Tom Herbert--- include/net/kcm.h | 102 + net/kcm/Makefile | 2 +- net/kcm/kcmproc.c | 422 ++ net/kcm/kcmsock.c | 66 + 4 files changed, 591 insertions(+), 1 deletion(-) create mode 100644 net/kcm/kcmproc.c diff --git a/include/net/kcm.h b/include/net/kcm.h index 4f371fe..83b4f91 100644 --- a/include/net/kcm.h +++ b/include/net/kcm.h @@ -11,6 +11,45 @@ extern unsigned int kcm_net_id; +#define KCM_STATS_ADD(stat, count) \ + ((stat) += (count)) + +#define KCM_STATS_INCR(stat) \ + ((stat)++) + +struct kcm_psock_stats { + unsigned long long rx_msgs; + unsigned long long rx_bytes; + unsigned long long tx_msgs; + unsigned long long tx_bytes; + unsigned int rx_aborts; + unsigned int rx_mem_fail; + unsigned int rx_need_more_hdr; + unsigned int rx_bad_hdr_len; + unsigned long long reserved; + unsigned long long unreserved; + unsigned int tx_aborts; +}; + +struct kcm_mux_stats { + unsigned long long rx_msgs; + unsigned long long rx_bytes; + unsigned long long tx_msgs; + unsigned long long tx_bytes; + unsigned int rx_ready_drops; + unsigned int tx_retries; + unsigned int psock_attach; + unsigned int psock_unattach_rsvd; + unsigned int psock_unattach; +}; + +struct kcm_stats { + unsigned long long rx_msgs; + unsigned long long rx_bytes; + unsigned long long tx_msgs; + unsigned long long tx_bytes; +}; + struct kcm_tx_msg { unsigned int sent; unsigned int fragidx; @@ -35,6 +74,8 @@ struct kcm_sock { u32 done : 1; struct work_struct done_work; + struct kcm_stats stats; + /* Transmit */ struct kcm_psock *tx_psock; struct work_struct tx_work; @@ -71,6 +112,8 @@ struct kcm_psock { struct list_head psock_list; + struct kcm_psock_stats stats; + /* Receive */ struct sk_buff *rx_skb_head; struct sk_buff **rx_skb_nextp; @@ -80,15 +123,21 @@ struct kcm_psock { struct delayed_work rx_delayed_work; struct bpf_prog *bpf_prog; struct kcm_sock *rx_kcm; + unsigned long long saved_rx_bytes; + unsigned long long saved_rx_msgs; /* Transmit */ struct kcm_sock *tx_kcm; struct list_head psock_avail_list; + unsigned long long saved_tx_bytes; + unsigned long long saved_tx_msgs; }; /* Per net MUX list */ struct kcm_net { struct mutex mutex; + struct kcm_psock_stats aggregate_psock_stats; + struct kcm_mux_stats aggregate_mux_stats; struct list_head mux_list; int count; }; @@ -104,6 +153,9 @@ struct kcm_mux { struct list_head psocks;/* List of all psocks on MUX */ int psocks_cnt; /* Total attached sockets */ + struct kcm_mux_stats stats; + struct kcm_psock_stats aggregate_psock_stats; + /* Receive */ spinlock_t rx_lock cacheline_aligned_in_smp; struct list_head kcm_rx_waiters; /* KCMs waiting for receiving */ @@ -116,6 +168,56 @@ struct kcm_mux { struct list_head kcm_tx_waiters; /* KCMs waiting for a TX psock */ }; +#ifdef CONFIG_PROC_FS +int kcm_proc_init(void); +void kcm_proc_exit(void); +#else +static int kcm_proc_init(void) { return 0; } +static void kcm_proc_exit(void) { } +#endif + + +static inline void aggregate_psock_stats(struct kcm_psock_stats *stats, +struct kcm_psock_stats *agg_stats) +{ + /* Save psock statistics in the mux when psock is being unattached. */ + +#define SAVE_PSOCK_STATS(_stat) (agg_stats->_stat += stats->_stat) + + SAVE_PSOCK_STATS(rx_msgs); + SAVE_PSOCK_STATS(rx_bytes); + SAVE_PSOCK_STATS(rx_aborts); + SAVE_PSOCK_STATS(rx_mem_fail); + SAVE_PSOCK_STATS(rx_need_more_hdr); + SAVE_PSOCK_STATS(rx_bad_hdr_len); + SAVE_PSOCK_STATS(tx_msgs); + SAVE_PSOCK_STATS(tx_bytes); + SAVE_PSOCK_STATS(reserved); + SAVE_PSOCK_STATS(unreserved); + SAVE_PSOCK_STATS(tx_aborts); + +#undef SAVE_PSOCK_STATS +} + +static inline void aggregate_mux_stats(struct kcm_mux_stats *stats, + struct kcm_mux_stats *agg_stats) +{ + /* Save psock statistics in the mux when psock is being unattached. */ + +#define SAVE_MUX_STATS(_stat) (agg_stats->_stat += stats->_stat) + +
[PATCH net-next 6/6] kcm: Add description in Documentation
Add kcm.txt to desribe KCM and interfaces. Signed-off-by: Tom Herbert--- Documentation/networking/kcm.txt | 273 +++ 1 file changed, 273 insertions(+) create mode 100644 Documentation/networking/kcm.txt diff --git a/Documentation/networking/kcm.txt b/Documentation/networking/kcm.txt new file mode 100644 index 000..5432090 --- /dev/null +++ b/Documentation/networking/kcm.txt @@ -0,0 +1,273 @@ +Kernel Connection Mulitplexor +- + +Kernel Connection Multiplexor (KCM) is a mechanism that provides a message based +interface over TCP for generic application protocols. With KCM an application +can efficiently send and receive application protocol messages over TCP using +datagram sockets. + +KCM implements an NxM multiplexor in the kernel as diagrammed below: + +++ ++ ++ ++ +| KCM socket | | KCM socket | | KCM socket | | KCM socket | +++ ++ ++ ++ + | | || + +---+ | | +--+ + | | | | + +--+ + | Multiplexor| + +--+ + | | | | | + +-+ | | | + + | | | | | ++--+ +--+ +--+ +--+ +--+ +| Psock | | Psock | | Psock | | Psock | | Psock | ++--+ +--+ +--+ +--+ +--+ + | | || | ++--+ +--+ +--+ +--+ +--+ +| TCP sock | | TCP sock | | TCP sock | | TCP sock | | TCP sock | ++--+ +--+ +--+ +--+ +--+ + +KCM sockets +--- + +The KCM sockets provide the user interface to the muliplexor. All the KCM sockets +bound to a multiplexor are considered to have equivalent function, and I/O +operations in different sockets may be done in parallel without the need for +synchronization between threads in userspace. + +Multiplexor +--- + +The multiplexor provides the message steering. In the transmit path, messages +written on a KCM socket are sent atomically on an appropriate TCP socket. +Similarly, in the receive path, messages are constructed on each TCP socket +(Psock) and complete messages are steered to a KCM socket. + +TCP sockets & Psocks + + +TCP sockets may be bound to a KCM multiplexor. A Psock structure is allocated +for each bound TCP socket, this structure holds the state for constructing +messages on receive as well as other connection specific information for KCM. + +Connected mode semantics + + +Each multiplexor assumes that all attached TCP connections are to the same +destination and can use the different connections for load balancing when +transmitting. The normal send and recv calls (include sendmmsg and recvmmsg) +can be used to send and receive messages from the KCM socket. + +Socket types + + +KCM supports SOCK_DGRAM and SOCK_SEQPACKET socket types. + +Message delineation +--- + +Messages are sent over a TCP stream with some application protocol message +format that typically includes a header which frames the messages. The length +of a received message can be deduced from the application protocol header +(often just a simple length field). + +A TCP stream must be parsed to determine message boundaries. Berkeley Packet +Filter (BPF) is used for this. When attaching a TCP socket to a multiplexor a +BPF program must be specified. The program is called at the start of receiving +a new message and is given an skbuff that contains the bytes received so far. +It parses the message header and returns the length of the message. Given this +information, KCM will construct the message of the stated length and deliver it +to a KCM socket. + +TCP socket management +- + +When a TCP socket is attached to a KCM multiplexor data ready (POLLIN) and +write space available (POLLOUT) events are handled by the multiplexor. If there +is a state change (disconnection) or other error on a TCP socket, an error is +posted on the TCP socket so that a POLLERR event happens and KCM discontinues +using the socket. When the application gets the error notification for a +TCP socket, it should unattach the socket from KCM and then handle the error +condition (the typical response is to close the socket and create a new +connection if necessary). + +User interface +== + +Creating a multiplexor +-- + +A new multiplexor and initial KCM socket is created by a socket call: + +
[PATCH net-next 2/6] net: Make sock_alloc exportable
Export it for cases where we want to create sockets by hand. Signed-off-by: Tom Herbert--- include/linux/net.h | 1 + net/socket.c| 3 ++- 2 files changed, 3 insertions(+), 1 deletion(-) diff --git a/include/linux/net.h b/include/linux/net.h index 70ac5e2..f9e3d3a 100644 --- a/include/linux/net.h +++ b/include/linux/net.h @@ -210,6 +210,7 @@ int __sock_create(struct net *net, int family, int type, int proto, int sock_create(int family, int type, int proto, struct socket **res); int sock_create_kern(struct net *net, int family, int type, int proto, struct socket **res); int sock_create_lite(int family, int type, int proto, struct socket **res); +struct socket *sock_alloc(void); void sock_release(struct socket *sock); int sock_sendmsg(struct socket *sock, struct msghdr *msg); int sock_recvmsg(struct socket *sock, struct msghdr *msg, size_t size, diff --git a/net/socket.c b/net/socket.c index dd2c247..21373f8 100644 --- a/net/socket.c +++ b/net/socket.c @@ -532,7 +532,7 @@ static const struct inode_operations sockfs_inode_ops = { * NULL is returned. */ -static struct socket *sock_alloc(void) +struct socket *sock_alloc(void) { struct inode *inode; struct socket *sock; @@ -553,6 +553,7 @@ static struct socket *sock_alloc(void) this_cpu_add(sockets_in_use, 1); return sock; } +EXPORT_SYMBOL(sock_alloc); /** * sock_release- close a socket -- 2.4.6 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next 3/6] net: Add MSG_BATCH flag
Add a new msg flag called MSG_BATCH. This flag is used in sendmsg to indicate that more messages will follow (i.e. a batch of messages is being sent). This is similar to MSG_MORE except that the following messages are not merged into one packet, they are sent individually. MSG_BATCH is a performance optimization in cases where a socket implementation can benefit by transmitting packets in a batch. This patch also updates sendmmsg so that each contained message except for the last one is marked as MSG_BATCH. Signed-off-by: Tom Herbert--- include/linux/socket.h | 1 + net/socket.c | 17 + 2 files changed, 14 insertions(+), 4 deletions(-) diff --git a/include/linux/socket.h b/include/linux/socket.h index 5bf59c8..d834af2 100644 --- a/include/linux/socket.h +++ b/include/linux/socket.h @@ -274,6 +274,7 @@ struct ucred { #define MSG_MORE 0x8000 /* Sender will send more */ #define MSG_WAITFORONE 0x1 /* recvmmsg(): block until 1+ packets avail */ #define MSG_SENDPAGE_NOTLAST 0x2 /* sendpage() internal : not the last page */ +#define MSG_BATCH 0x4 /* sendmmsg(): more messages coming */ #define MSG_EOF MSG_FIN #define MSG_FASTOPEN 0x2000 /* Send data in TCP SYN */ diff --git a/net/socket.c b/net/socket.c index 21373f8..ef64b72 100644 --- a/net/socket.c +++ b/net/socket.c @@ -1880,7 +1880,7 @@ static int copy_msghdr_from_user(struct msghdr *kmsg, static int ___sys_sendmsg(struct socket *sock, struct user_msghdr __user *msg, struct msghdr *msg_sys, unsigned int flags, -struct used_address *used_address) +struct used_address *used_address, bool doing_mmsg) { struct compat_msghdr __user *msg_compat = (struct compat_msghdr __user *)msg; @@ -1906,6 +1906,8 @@ static int ___sys_sendmsg(struct socket *sock, struct user_msghdr __user *msg, if (msg_sys->msg_controllen > INT_MAX) goto out_freeiov; + if (doing_mmsg) + flags |= (msg_sys->msg_flags & MSG_EOR); ctl_len = msg_sys->msg_controllen; if ((MSG_CMSG_COMPAT & flags) && ctl_len) { err = @@ -1984,7 +1986,7 @@ long __sys_sendmsg(int fd, struct user_msghdr __user *msg, unsigned flags) if (!sock) goto out; - err = ___sys_sendmsg(sock, msg, _sys, flags, NULL); + err = ___sys_sendmsg(sock, msg, _sys, flags, NULL, false); fput_light(sock->file, fput_needed); out: @@ -2011,6 +2013,7 @@ int __sys_sendmmsg(int fd, struct mmsghdr __user *mmsg, unsigned int vlen, struct compat_mmsghdr __user *compat_entry; struct msghdr msg_sys; struct used_address used_address; + unsigned int oflags = flags; if (vlen > UIO_MAXIOV) vlen = UIO_MAXIOV; @@ -2025,11 +2028,16 @@ int __sys_sendmmsg(int fd, struct mmsghdr __user *mmsg, unsigned int vlen, entry = mmsg; compat_entry = (struct compat_mmsghdr __user *)mmsg; err = 0; + flags |= MSG_BATCH; while (datagrams < vlen) { + if (datagrams == vlen - 1) + flags = oflags; + if (MSG_CMSG_COMPAT & flags) { err = ___sys_sendmsg(sock, (struct user_msghdr __user *)compat_entry, -_sys, flags, _address); +_sys, flags, _address, +true); if (err < 0) break; err = __put_user(err, _entry->msg_len); @@ -2037,7 +2045,8 @@ int __sys_sendmmsg(int fd, struct mmsghdr __user *mmsg, unsigned int vlen, } else { err = ___sys_sendmsg(sock, (struct user_msghdr __user *)entry, -_sys, flags, _address); +_sys, flags, _address, +true); if (err < 0) break; err = put_user(err, >msg_len); -- 2.4.6 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 06/27] brcm80211: move under broadcom vendor directory
On 11/19/2015 08:48 AM, Kalle Valo wrote: Hauke Mehrtenswrites: On 11/18/2015 03:45 PM, Kalle Valo wrote: Part of reorganising wireless drivers directory and Kconfig. Note that I had to edit Makefiles from subdirectories to use the new location. Signed-off-by: Kalle Valo --- I would prefer to remove the brcm80211 directory in this process and create: drivers/net/wireless/broadcom/brcmfmac drivers/net/wireless/broadcom/brcmsmac drivers/net/wireless/broadcom/brcmutil drivers/net/wireless/broadcom/include This way we have one directory less. I think this could be done separately. This patchset is big enough already, I would not like to make it anymore complicated. And I actually like the brcm80211 directory, I would not mind keeping it still. I prefer to keep it as brcmsmac and brcmfmac rely on brcmutil module so I want to keep them together under brcm80211. So does this patch go in before or after the patches I submitted before the merge window. I hope after :-p Regards, Arend -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 06/27] brcm80211: move under broadcom vendor directory
On 11/19/2015 08:54 AM, Kalle Valo wrote: Florian Fainelliwrites: On 18/11/15 11:19, Hauke Mehrtens wrote: On 11/18/2015 03:45 PM, Kalle Valo wrote: Part of reorganising wireless drivers directory and Kconfig. Note that I had to edit Makefiles from subdirectories to use the new location. Signed-off-by: Kalle Valo --- I would prefer to remove the brcm80211 directory in this process and create: drivers/net/wireless/broadcom/brcmfmac drivers/net/wireless/broadcom/brcmsmac drivers/net/wireless/broadcom/brcmutil drivers/net/wireless/broadcom/include This way we have one directory less. Would not that make keeping track of the previous and future history harder for people contributing to these drivers? I could imagine that for Arend and other Broadcom engineers, dealing with a simple level move would be manageable, but having to account for a different directory hierarchy could be a pain. What is the impact on compat-wireless after/before these changes by the way? It's called backports nowadays :) But I understood that as long as we have a separate kconfig option for the vendor directories (CONFIG_WLAN_VENDOR_*) it should be ok. For 4.3 we didn't have that for realtek directory and that caused pain for backports. That is my understanding as well. Regards, Arend -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] unix: avoid use-after-free in ep_remove_wait_queue
Rainer Weikusatwrites: An AF_UNIX datagram socket being the client in an n:1 association with some server socket is only allowed to send messages to the server if the receive queue of this socket contains at most sk_max_ack_backlog datagrams. This implies that prospective writers might be forced to go to sleep despite none of the message presently enqueued on the server receive queue were sent by them. In order to ensure that these will be woken up once space becomes again available, the present unix_dgram_poll routine does a second sock_poll_wait call with the peer_wait wait queue of the server socket as queue argument (unix_dgram_recvmsg does a wake up on this queue after a datagram was received). This is inherently problematic because the server socket is only guaranteed to remain alive for as long as the client still holds a reference to it. In case the connection is dissolved via connect or by the dead peer detection logic in unix_dgram_sendmsg, the server socket may be freed despite "the polling mechanism" (in particular, epoll) still has a pointer to the corresponding peer_wait queue. There's no way to forcibly deregister a wait queue with epoll. Based on an idea by Jason Baron, the patch below changes the code such that a wait_queue_t belonging to the client socket is enqueued on the peer_wait queue of the server whenever the peer receive queue full condition is detected by either a sendmsg or a poll. A wake up on the peer queue is then relayed to the ordinary wait queue of the client socket via wake function. The connection to the peer wait queue is again dissolved if either a wake up is about to be relayed or the client socket reconnects or a dead peer is detected or the client socket is itself closed. This enables removing the second sock_poll_wait from unix_dgram_poll, thus avoiding the use-after-free, while still ensuring that no blocked writer sleeps forever. Signed-off-by: Rainer Weikusat Fixes: ec0d215f9420 ("af_unix: fix 'poll for write'/connected DGRAM sockets") --- - uninvert the lock/ check code in _dgram_sendmsg - introduce a unix_dgram_peer_wake_disconnect_wakuep helper function as there were two calls with a wakeup immediately following and two without diff --git a/include/net/af_unix.h b/include/net/af_unix.h index b36d837..2a91a05 100644 --- a/include/net/af_unix.h +++ b/include/net/af_unix.h @@ -62,6 +62,7 @@ struct unix_sock { #define UNIX_GC_CANDIDATE 0 #define UNIX_GC_MAYBE_CYCLE1 struct socket_wqpeer_wq; + wait_queue_tpeer_wake; }; static inline struct unix_sock *unix_sk(const struct sock *sk) diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c index 94f6582..3d93b0d 100644 --- a/net/unix/af_unix.c +++ b/net/unix/af_unix.c @@ -326,6 +326,118 @@ found: return s; } +/* Support code for asymmetrically connected dgram sockets + * + * If a datagram socket is connected to a socket not itself connected + * to the first socket (eg, /dev/log), clients may only enqueue more + * messages if the present receive queue of the server socket is not + * "too large". This means there's a second writeability condition + * poll and sendmsg need to test. The dgram recv code will do a wake + * up on the peer_wait wait queue of a socket upon reception of a + * datagram which needs to be propagated to sleeping would-be writers + * since these might not have sent anything so far. This can't be + * accomplished via poll_wait because the lifetime of the server + * socket might be less than that of its clients if these break their + * association with it or if the server socket is closed while clients + * are still connected to it and there's no way to inform "a polling + * implementation" that it should let go of a certain wait queue + * + * In order to propagate a wake up, a wait_queue_t of the client + * socket is enqueued on the peer_wait queue of the server socket + * whose wake function does a wake_up on the ordinary client socket + * wait queue. This connection is established whenever a write (or + * poll for write) hit the flow control condition and broken when the + * association to the server socket is dissolved or after a wake up + * was relayed. + */ + +static int unix_dgram_peer_wake_relay(wait_queue_t *q, unsigned mode, int flags, + void *key) +{ + struct unix_sock *u; + wait_queue_head_t *u_sleep; + + u = container_of(q, struct unix_sock, peer_wake); + + __remove_wait_queue(_sk(u->peer_wake.private)->peer_wait, + q); + u->peer_wake.private = NULL; + + /* relaying can only happen while the wq still exists */ + u_sleep = sk_sleep(>sk); + if (u_sleep) + wake_up_interruptible_poll(u_sleep, key); + + return 0; +} + +static int unix_dgram_peer_wake_connect(struct sock *sk, struct sock *other)
[PATCH net-next 0/6] kcm: Kernel Connection Multiplexor (KCM)
Kernel Connection Multiplexor (KCM) is a facility that provides a message based interface over TCP for generic application protocols. The motivation for this is based on the observation that although TCP is byte stream transport protocol with no concept of message boundaries, a common use case is to implement a framed application layer protocol running over TCP. To date, most TCP stacks offer byte stream API for applications, which places the burden of message delineation, message I/O operation atomicity, and load balancing in the application. With KCM an application can efficiently send and receive application protocol messages over TCP using a datagram interface. In order to delineate message in a TCP stream for receive in KCM, the kernel implements a message parser. For this we chose to employ BPF which is applied to the TCP stream. BPF code parses application layer messages and returns a message length. Nearly all binary application protocols are parsable in this manner, so KCM should be applicable across a wide range of applications. Other than message length determination in receive, KCM does not require any other application specific awareness. KCM does not implement any other application protocol semantics-- these are are provided in userspace or could be implemented in a kernel module layered above KCM. KCM implements an NxM multiplexor in the kernel as diagrammed below: ++ ++ ++ ++ | KCM socket | | KCM socket | | KCM socket | | KCM socket | ++ ++ ++ ++ | | || +---+ | | +--+ | | | | +--+ | Multiplexor| +--+ | | | | | +-+ | | | + | | | | | +--+ +--+ +--+ +--+ +--+ | Psock | | Psock | | Psock | | Psock | | Psock | +--+ +--+ +--+ +--+ +--+ | | || | +--+ +--+ +--+ +--+ +--+ | TCP sock | | TCP sock | | TCP sock | | TCP sock | | TCP sock | +--+ +--+ +--+ +--+ +--+ The KCM sockets provide the datagram interface to applications, Psocks are the state for each attached TCP connection (i.e. where message delineation is performed on receive). A description of the APIs and design can be found in the included Documentation/networking/kcm.txt. In this patch set: - Add MSG_BATCH flag. This is used in sendmsg msg_hdr flags to indicate that more messages will be sent on the socket. The stack may batch messages up if it is beneficial for transmission. - In sendmmsg, set MSG_BATCH in all sub messages except for the last one. - In order to allow sendmmsg to contain multiple messages with SOCK_SEQPAKET we allow each msg_hdr in the sendmmsg to set MSG_EOR. - Add KCM module - This supports SOCK_DGRAM and SOCK_SEQPACKET. - KCM documentation Testing: Dave Watson has integrated KCM into Thrift and we intend to put these changes into open source. Example of this is in: https://github.com/djwatson/fbthrift/commit/ dd7e0f9cf4e80912fdb90f6cd394db24e61a14cc Some initial KCM Thrift benchmark numbers (comment from Dave) Thrift by default ties a single connection to a single thread. KCM is instead able to load balance multiple connections across multiple epoll loops easily. A test sending ~5k bytes of data to a kcm thrift server, dropping the bytes on recv: QPS Latency / std dev Latency without KCM 70336 209/123 with KCM 70353 191/124 A test sending a small request, then doing work in the epoll thread, before serving more requests: QPS Latency / std dev Latency without KCM 14282 559/602 with KCM 23192 344/234 At the high end, there's definitely some additional kernel overhead: Cranking the pipelining way up, with lots of small requests QPS Latency / std dev Latency without KCM 1863429 127/119 with KCM 1337713 192/241 --- So for a "realistic" workload, KCM performs pretty well (second case). Under extreme conditions of highest tps we still have some work to do. In its nature a multiplexor will spread work between CPUs which is logically good for load balancing but coan conflict with the goal promoting affinity. Batching messages on both send and receive are the means to recoup performance. Future support: - Integration with TLS (TLS-in-kernel is a separate initiative). - Page operations/splice support - Unconnected KCM sockets. Will be able to attach sockets to
[PATCH net-next 1/6] rcu: Add list_next_or_null_rcu
This is a convenience function that returns the next entry in an RCU list or NULL if at the end of the list. Signed-off-by: Tom Herbert--- include/linux/rculist.h | 21 + 1 file changed, 21 insertions(+) diff --git a/include/linux/rculist.h b/include/linux/rculist.h index 5ed5409..a9376fd 100644 --- a/include/linux/rculist.h +++ b/include/linux/rculist.h @@ -290,6 +290,27 @@ static inline void list_splice_init_rcu(struct list_head *list, }) /** + * list_next_or_null_rcu - get the first element from a list + * @head: the head for the list. + * @ptr:the list head to take the next element from. + * @type: the type of the struct this is embedded in. + * @member: the name of the list_head within the struct. + * + * Note that if the ptr is at the end of the list, NULL is returned. + * + * This primitive may safely run concurrently with the _rcu list-mutation + * primitives such as list_add_rcu() as long as it's guarded by rcu_read_lock(). + */ +#define list_next_or_null_rcu(head, ptr, type, member) \ +({ \ + struct list_head *__head = (head); \ + struct list_head *__ptr = (ptr); \ + struct list_head *__next = READ_ONCE(__ptr->next); \ + likely(__next != __head) ? list_entry_rcu(__next, type, \ + member) : NULL; \ +}) + +/** * list_for_each_entry_rcu - iterate over rcu list of given type * @pos: the type * to use as a loop cursor. * @head: the head for your list. -- 2.4.6 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/7] kernfs: implement kernfs_walk_and_get()
On Fri, Nov 20, 2015 at 04:12:54PM -0500, Tejun Heo wrote: > On Thu, Nov 19, 2015 at 08:41:04PM -0800, Greg Kroah-Hartman wrote: > > On Thu, Nov 19, 2015 at 01:52:46PM -0500, Tejun Heo wrote: > > > Implement kernfs_walk_and_get() which is similar to > > > kernfs_find_and_get() but can walk a path instead of just a name. > > > > > > v2: Use strlcpy() instead of strlen() + memcpy() as suggested by > > > David. > > > > > > Signed-off-by: Tejun Heo> > > Cc: Greg Kroah-Hartman > > > Cc: David Miller > > > --- > > > fs/kernfs/dir.c| 46 > > > ++ > > > include/linux/kernfs.h | 12 > > > 2 files changed, 58 insertions(+) > > > > Acked-by: Greg Kroah-Hartman > > Greg, would it be okay to route this one through either cgroup or net > tree? Either is fine with me, whatever works best for you. greg k-h -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 00/15] net: The beginning of the end for NETIF_F_IP_CSUM and NETIF_F_IPV6_CSUM
From: Tom HerbertDate: Thu, 19 Nov 2015 11:55:46 -0800 > Goals of this patch set: > > We propose that drivers advertise NETIF_F_HW_CSUM instead of protocol > specific values of NETIF_F_IP_CSUM and NETIF_F_IPV6_CSUM. If the > driver's device is constrained (for instance it can only offlaod simple > IPv4 and IPv6 packets) then these constraints can be checked in the > transmit path and skb_checksum_help would be called for packets that the > driver is unable to offload. In order to facilitate this, we add some > helper functions that takes a specification argument indicating the > type of packets a device is able to offload. If a packet does not match > the specification, the helper function calls skb_checksum_help. I very much like the direction this is taking things. And I do sincerely hope that this does in fact actually encourage HW vendors to drop all of the protocol specific offloading, and just support 2's complement sums. They can turn that _trivially_ into whatever the Windows et al. driver interfaces want in their respective drivers. There is absolutely no reason to implement protocol specific checksum offloads in silicon in this day and age. Absolutely none. So driver folks tell your hardware buddies to just stop doing it now and get with the program. Even your marketing department shouldn't care, they can list support for every protocol on the planet in their specs and packaging if they want, and it might even look impressive... Thanks. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHSET v2] netfilter, cgroup: implement xt_cgroup2 match
Hello, David, Pablo. On Fri, Nov 20, 2015 at 08:56:25PM +0100, Pablo Neira Ayuso wrote: > > Pablo, are you ok with me merging this into net-next directly or > > would you rather I take patches 1-6 into net-next and then you can > > merge and then add patch #7 on top? > > I'd suggest you get 1-6, then I'll pull this info my tree. Thanks David! Hmm 1-3 will be needed to address similar issues in a different controller, so putting them in a separate branch would work best. I created a branch which contains the 1-3 on top of v4.4-rc1. git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git for-4.5-ancestor-test If creating a different branch from net side is better, please let me know. > Regarding #7, I have a couple two concerns: > > 1) cgroup currently doesn't work the way users expect, ie. to perform any >reasonable firewalling. Since this relies on early demux, only a >limited number of sockets get access to the cgroup info. Right, it doesn't work well on INPUT side, so the big warning in the man page. > 2) We have traditionally rejected match2 and target2 extensions. I >guess you can accomodate the new cgroup code through the revision >iptables infrastructure, so we still use the cgroup match. I thought it would be confusing because the two are completely separate. Hmmm... okay, I'll merge it into xt_cgroup. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/7] kernfs: implement kernfs_walk_and_get()
On Thu, Nov 19, 2015 at 08:41:04PM -0800, Greg Kroah-Hartman wrote: > On Thu, Nov 19, 2015 at 01:52:46PM -0500, Tejun Heo wrote: > > Implement kernfs_walk_and_get() which is similar to > > kernfs_find_and_get() but can walk a path instead of just a name. > > > > v2: Use strlcpy() instead of strlen() + memcpy() as suggested by > > David. > > > > Signed-off-by: Tejun Heo> > Cc: Greg Kroah-Hartman > > Cc: David Miller > > --- > > fs/kernfs/dir.c| 46 ++ > > include/linux/kernfs.h | 12 > > 2 files changed, 58 insertions(+) > > Acked-by: Greg Kroah-Hartman Greg, would it be okay to route this one through either cgroup or net tree? Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next] net: avoid NULL deref in napi_get_frags()
From: Eric DumazetDate: Thu, 19 Nov 2015 12:11:23 -0800 > From: Eric Dumazet > > napi_alloc_skb() can return NULL. > We should not crash should this happen. > > Fixes: 93f93a440415 ("net: move skb_mark_napi_id() into core networking > stack") > Signed-off-by: Eric Dumazet Applied, thanks. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 04/14] net: tcp_memcontrol: remove bogus hierarchy pressure propagation
On Thu, Nov 12, 2015 at 06:41:23PM -0500, Johannes Weiner wrote: > When a cgroup currently breaches its socket memory limit, it enters > memory pressure mode for itself and its *ancestors*. This throttles > transmission in unrelated sibling and cousin subtrees that have > nothing to do with the breached limit. > > On the contrary, breaching a limit should make that group and its > *children* enter memory pressure mode. But this happens already, > albeit lazily: if an ancestor limit is breached, siblings will enter > memory pressure on their own once the next packet arrives for them. Hmm, we still call sk_prot->enter_memory_pressure, which might hurt a workload in the root cgroup AFAICS. Strange. You fix it in patch 8 though. > > So no additional hierarchy code is needed. Remove the bogus stuff. > > Signed-off-by: Johannes WeinerReviewed-by: Vladimir Davydov -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [B.A.T.M.A.N.] [PATCH 3/3] batman-adv: Less function calls in batadv_is_ap_isolated() after error detection
On 04/11/15 04:56, SF Markus Elfring wrote: > From: Markus Elfring> Date: Tue, 3 Nov 2015 21:10:51 +0100 > > The variables "tt_local_entry" and "tt_global_entry" were eventually checked > again despite of a corresponding null pointer test before. > Let us avoid this double check by reordering a function call sequence > and the better selection of jump targets. > > Signed-off-by: Markus Elfring > --- > net/batman-adv/translation-table.c | 21 + > 1 file changed, 9 insertions(+), 12 deletions(-) > > diff --git a/net/batman-adv/translation-table.c > b/net/batman-adv/translation-table.c > index 965a004..3ac32d9 100644 > --- a/net/batman-adv/translation-table.c > +++ b/net/batman-adv/translation-table.c > @@ -3323,27 +3323,24 @@ bool batadv_is_ap_isolated(struct batadv_priv > *bat_priv, u8 *src, u8 *dst, > return false; > > if (!atomic_read(>ap_isolation)) > - goto out; > + goto vlan_free; > > tt_local_entry = batadv_tt_local_hash_find(bat_priv, dst, vid); > if (!tt_local_entry) > - goto out; > + goto vlan_free; > > tt_global_entry = batadv_tt_global_hash_find(bat_priv, src, vid); > if (!tt_global_entry) > - goto out; > + goto local_entry_free; > > - if (!_batadv_is_ap_isolated(tt_local_entry, tt_global_entry)) > - goto out; > - > - ret = true; > + if (_batadv_is_ap_isolated(tt_local_entry, tt_global_entry)) > + ret = true; > > -out: > + batadv_tt_global_entry_free_ref(tt_global_entry); > +local_entry_free: > + batadv_tt_local_entry_free_ref(tt_local_entry); > +vlan_free: > batadv_softif_vlan_free_ref(vlan); > - if (tt_global_entry) > - batadv_tt_global_entry_free_ref(tt_global_entry); > - if (tt_local_entry) > - batadv_tt_local_entry_free_ref(tt_local_entry); > return ret; Markus, if you really want to make this codestyle change, I'd suggest you to go through the whole batman-adv code and apply the same change where needed. It does not make sense to change the codestyle in one spot only. On top of that, by going through the batman-adv code you might agree that the current style is actually not a bad idea. Cheers, -- Antonio Quartulli signature.asc Description: OpenPGP digital signature
Re: [PATCH 4/7] netprio_cgroup: limit the maximum css->id to USHRT_MAX
On 11/19/2015 07:52 PM, Tejun Heo wrote: > netprio builds per-netdev contiguous priomap array which is indexed by > css->id. The array is allocated using kzalloc() effectively limiting > the maximum ID supported to some thousand range. This patch caps the > maximum supported css->id to USHRT_MAX which should be way above what > is actually useable. > > This allows reducing sock->sk_cgrp_prioidx to u16 from u32. The freed > up part will be used to overload the cgroup related fields. > sock->sk_cgrp_prioidx's position is swapped with sk_mark so that the > two cgroup related fields are adjacent. > > Signed-off-by: Tejun Heo> Cc: Daniel Borkmann > Cc: Daniel Wagner Acked-by: Daniel Wagner -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] packet: Allow packets with only a header (but no payload)
9c70776 added validation for the packet size in packet_snd. This change enforced that every packet needs a header with at least hard_header_len bytes and at least one byte payload. This fixes PPPoE connections which do not have a "Service" or "Host-Uniq" configured (which is violating the spec, but is still widely used in real-world setups). Those are currently failing with the following message: "pppd: packet size is too short (24 <= 24)" Signed-off-by: Martin Blumenstingl--- v2: Simply change the existing logic in ll_header_truncated instead of splitting it and having multiple checks. include/linux/netdevice.h | 3 ++- net/packet/af_packet.c| 4 ++-- 2 files changed, 4 insertions(+), 3 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 67bfac1..1f42cb7 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1398,7 +1398,8 @@ enum netdev_priv_flags { * @dma: DMA channel * @mtu: Interface MTU value * @type: Interface hardware type - * @hard_header_len: Hardware header length + * @hard_header_len: Hardware header length, which means that this is the + * minimum size of a packet. * * @needed_headroom: Extra headroom the hardware may need, but not in all * cases can this be guaranteed diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c index 1cf928f..992396a 100644 --- a/net/packet/af_packet.c +++ b/net/packet/af_packet.c @@ -2329,8 +2329,8 @@ static void tpacket_destruct_skb(struct sk_buff *skb) static bool ll_header_truncated(const struct net_device *dev, int len) { /* net device doesn't like empty head */ - if (unlikely(len <= dev->hard_header_len)) { - net_warn_ratelimited("%s: packet size is too short (%d <= %d)\n", + if (unlikely(len < dev->hard_header_len)) { + net_warn_ratelimited("%s: packet size is too short (%d < %d)\n", current->comm, len, dev->hard_header_len); return true; } -- 2.6.2 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: r8169 regression: UDP packets dropped intermittantly
Jonathan Woithe: [...] > This indicates to me that in the fault condition, packets coming into the PC > are being held by the lower layers (perhaps even the hardware) for a very > long time, and in fact only seem to be released once a packet is queued for > transmission. > > I then ran a test using the capture script you suggested. For this test I > arranged to only send the C-A packet sequence repeatedly until an error > condition was detected. Approximate times of the sequence's start time and > the outcome of the sequence were: > > 1447985720: C response received, A response received > 1447985722: C response received, A response was not seen(*) > 1447985739: C response received, A response received > > (*) Based on the earlier test, I expect it was delivered to the OS layer at > the start of the next test, when the "C" packet was sent. The register dumps all look the same. Nothing to see here. The hardware stats are not exactly clear. Is the initial Tx - Rx packet difference (6) at the hardware stats level expected ? ! late Tx burp 272 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + + + ++ *** + 270 +Tx ***+*..+-+ |Rx ### : ::*: : | | : : ** : : | 268 +-+..*.+-+ | : * :: : : | 266 +-+*...+-+ | : *** : :: : : | 264 +-+*.###...+-+ | :*: ::#: : | *** : ::#: : | 262 +-+..###...+-+ | : # :: : : | 260 +-+#...+-+ | : ### : :: : : | | :#: :: : : | 258 +-+#...+-+ ### + ++ + + + 256 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 15:15 15:20 15:25 15:3015:35 15:40 15:45 15:50 Tx Rx 1447985715.230668398 263 257 1447985716.234480645 263 257 1447985717.238563603 263 257 1447985718.242506424 263 257 1447985719.246230757 263 257 1447985720.249900368 263 257 1447985721.253727549 265 259 1447985722.257578916 265 259 1447985723.261457073 267 261 1447985724.264998842 267 261 1447985725.268928099 267 261 1447985726.272550987 269 262 1447985727.277086547 269 262 1447985728.280609761 269 262 1447985729.284707715 269 262 1447985730.288299143 269 262 1447985731.292144772 269 262 1447985732.295803667 269 262 1447985733.299568961 269 262 1447985734.303204633 269 262 1447985735.307015920 269 262 1447985736.310640604 269 262 1447985737.314153757 269 262 1447985738.318170685 269 262 1447985739.322196876 269 262 1447985740.325972075 271 264 1447985741.330118113 271 264 1447985742.334009017 271 264 1447985743.337646906 271 264 1447985744.341710301 271 264 1447985745.345781415 271 264 1447985746.349656916 271 264 1447985747.353606904 271 264 -- Ueimor -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html