Re: [PATCH net] rds: RDS (tcp) hangs on sendto() to unresponding address
10/8/2018 9:17 AM, Ka-Cheong Poon wrote: In rds_send_mprds_hash(), if the calculated hash value is non-zero and the MPRDS connections are not yet up, it will wait. But it should not wait if the send is non-blocking. In this case, it should just use the base c_path for sending the message. Signed-off-by: Ka-Cheong Poon --- Thanks for getting this out on the list. Acked-by: Santosh Shilimkar
Re: [PATCH 1/1] net: rds: use memset to optimize the recv
On 9/14/2018 1:45 AM, Zhu Yanjun wrote: The function rds_inc_init is in recv process. To use memset can optimize the function rds_inc_init. The test result: Before: 1) + 24.950 us |rds_inc_init [rds](); After: 1) + 10.990 us |rds_inc_init [rds](); Signed-off-by: Zhu Yanjun --- Looks good. Thanks !! Acked-by: Santosh Shilimkar
Re: [Patch net v2] rds: fix two RCU related problems
On 9/10/2018 6:27 PM, Cong Wang wrote: When a rds sock is bound, it is inserted into the bind_hash_table which is protected by RCU. But when releasing rds sock, after it is removed from this hash table, it is freed immediately without respecting RCU grace period. This could cause some use-after-free as reported by syzbot. Mark the rds sock with SOCK_RCU_FREE before inserting it into the bind_hash_table, so that it would be always freed after a RCU grace period. The other problem is in rds_find_bound(), the rds sock could be freed in between rhashtable_lookup_fast() and rds_sock_addref(), so we need to extend RCU read lock protection in rds_find_bound() to close this race condition. Reported-and-tested-by: syzbot+8967084bcac563795...@syzkaller.appspotmail.com Reported-by: syzbot+93a5839deb3555374...@syzkaller.appspotmail.com Cc: Sowmini Varadhan Cc: Santosh Shilimkar Cc: rds-de...@oss.oracle.com Signed-off-by: Cong Wang --- Thank you !! Acked-by: Santosh Shilimkar
Re: [Patch net] rds: mark bound socket with SOCK_RCU_FREE
On 9/10/2018 5:45 PM, Cong Wang wrote: On Mon, Sep 10, 2018 at 5:26 PM Santosh Shilimkar wrote: Would you mind posting an updated patch please with call_rcu and above extended RCU grace period with rcu_read_lock. Thanks !! If you prefer to fix _two_ problems in one patch, sure. For the record, the bug this patch fixes is NOT same with the one in rds_find_bound(), because there is no rds_find_bound() in the backtrace. If you want to see the full backtrace, here it is: https://marc.info/?l=linux-netdev=15364808962=2 This is why I believe they are two problems. Whether fixing two problems in one patch or two patches is merely a preference, I leave it up to you. I had a partial fix as well in past but since it wasn't covering complete window, it was abandoned. Since we know the other race, it would be best to address it together and hence requested a combine patch. Thanks for help !! Regards, Santosh
Re: [Patch net] rds: mark bound socket with SOCK_RCU_FREE
On 9/10/2018 5:16 PM, Cong Wang wrote: On Mon, Sep 10, 2018 at 5:04 PM Sowmini Varadhan wrote: On (09/10/18 16:51), Cong Wang wrote: __rds_create_bind_key(key, addr, port, scope_id); - rs = rhashtable_lookup_fast(_hash_table, key, ht_parms); + rcu_read_lock(); + rs = rhashtable_lookup(_hash_table, key, ht_parms); if (rs && !sock_flag(rds_rs_to_sk(rs), SOCK_DEAD)) rds_sock_addref(rs); else rs = NULL; + rcu_read_unlock(); aiui, the rcu_read lock/unlock is only useful if the write side doing destructive operations does something to make sure readers are done before doing the destructive opertion. AFAIK, that does not exist for rds socket management today That is exactly why we need it here, right? As you mentioned previously, the sock could be freed after rhashtable_lookup_fast() but before rds_sock_addref(), extending the RCU read section after rds_sock_addref() exactly solves the problem here. Thats the case really. Or is there any other destructive problem you didn't mention? Mind to be specific? Although sock release path is not a very hot path, but blocking it isn't a good idea either, especially when you can use call_rcu(), which has the same effect. I don't see any reason we should prefer synchronize_rcu() here. Usually correctness (making sure all readers are done, before nuking a data structure) is a little bit more important than perforamance, aka "safety before speed" is what I've always been taught. Hmm, so you are saying synchronize_rcu() is kinda more correct than call_rcu()?? I never hear this before, would like to know why. To my best knowledge, the only difference between them is the context, one is blocking, the other is non-blocking. Their correctness must be equivalent. We have burn our hands with blocking synchronize_rcu() and that was actually the main reason we moved to rw locks from rcu to localise the cost. call_rcu()should be just fine as long as it plugs the hole. I don't want to add blocking in this path since this slows down the connection shutdown path and at times lead to soft dumps. Would you mind posting an updated patch please with call_rcu and above extended RCU grace period with rcu_read_lock. Thanks !! Regards, Santosh Regards, Santosh
Re: [Patch net] rds: mark bound socket with SOCK_RCU_FREE
On 9/10/2018 3:24 PM, Cong Wang wrote: When a rds sock is bound, it is inserted into the bind_hash_table which is protected by RCU. But when releasing rd sock, after it is removed from this hash table, it is freed immediately without respecting RCU grace period. This could cause some use-after-free as reported by syzbot. Indeed. Mark the rds sock as SOCK_RCU_FREE before inserting it into the bind_hash_table, so that it would be always freed after a RCU grace period. Reported-and-tested-by: syzbot+8967084bcac563795...@syzkaller.appspotmail.com Cc: Sowmini Varadhan Cc: Santosh Shilimkar Cc: rds-de...@oss.oracle.com Signed-off-by: Cong Wang --- net/rds/bind.c | 1 + 1 file changed, 1 insertion(+) diff --git a/net/rds/bind.c b/net/rds/bind.c index 3ab55784b637..2281b34415b9 100644 --- a/net/rds/bind.c +++ b/net/rds/bind.c @@ -235,6 +235,7 @@ int rds_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len) goto out; } + sock_set_flag(sk, SOCK_RCU_FREE); ret = rds_add_bound(rs, binding_addr, , scope_id); if (ret) goto out; I wasn't aware of this "SOCK_RCU_FREE" so really thanks for this patch. Have been scratching my head over this for a while thinking about generic provision at sk level to synchronize. This is much better than adding the sync at upper layer. It does have the tax for slowing down RDS for other kernel components rcu sync but anyway this hole needs to be plugged so am fine to go ahead with this change. Thanks for the patch. FWIW, Acked-by: Santosh Shilimkar
Re: [PATCH] rds: tcp: remove duplicated include from tcp.c
On 8/21/2018 7:05 AM, Yue Haibing wrote: Remove duplicated include. Signed-off-by: Yue Haibing --- Looks fine. Acked-by : Santosh Shilimkar
Re: [RFC v3 net-next 4/5] rds: invoke socket sg filter attached to rds socket
On 8/17/2018 4:08 PM, Tushar Dave wrote: RDS module sits on top of TCP (rds_tcp) and IB (rds_rdma), so messages arrive in form of skb (over TCP) and scatterlist (over IB/RDMA). However, because socket filter only deal with skb (e.g. struct skb as bpf context) we can only use socket filter for rds_tcp and not for rds_rdma. Considering one filtering solution for RDS, it seems that the common denominator between sk_buff and scatterlist is scatterlist. Therefore, this patch converts skb to sgvec and invoke sg_filter_run for rds_tcp and simply invoke sg_filter_run for IB/rds_rdma. Signed-off-by: Tushar Dave Reviewed-by: Sowmini Varadhan --- Looks good Tushar !! Acked-by: Santosh Shilimkar
Re: [PATCH net-next] rds: avoid lock hierarchy violation between m_rs_lock and rs_recv_lock
On 8/8/2018 3:18 PM, Sowmini Varadhan wrote: On (08/08/18 14:51), Santosh Shilimkar wrote: This bug doesn't make sense since two different transports are using same socket (Loop and rds_tcp) and running together. For same transport, such race can't happen with MSG_ON_SOCK flag. CPU1-> rds_loop_inc_free CPU0 -> rds_tcp_cork ... The test is just reporting a lock hierarchy violation As far as I can tell, this wasn't an actual deadlock that happened because as you point out, either a socket has the rds_tcp transport or the rds_loop transport, so this particular pair of stack traces would not happen with the code as it exists today. Exactly. but there is a valid lock hierachy violation here, and imho it's a good idea to get that cleaned up. The lock hierarchy violation is protected for the same transport. I don't see this violation possible for legitimate use and hence the comment. If we start supporting two different transport on same socket then we have many more cases to fix and as such lock violation will be just one of those. Loop transport seems to keep throwing surprises. Need to confirm but looks like it can co-exist with another transport on same socket if those traces to be believed. If its the case, then definitely that need to be plugged. Regards, Santosh
Re: [PATCH net-next] rds: avoid lock hierarchy violation between m_rs_lock and rs_recv_lock
On 8/8/2018 1:57 PM, Sowmini Varadhan wrote: The following deadlock, reported by syzbot, can occur if CPU0 is in rds_send_remove_from_sock() while CPU1 is in rds_clear_recv_queue() CPU0CPU1 lock(&(>m_rs_lock)->rlock); lock(>rs_recv_lock); lock(&(>m_rs_lock)->rlock); lock(>rs_recv_lock); The deadlock should be avoided by moving the messages from the rs_recv_queue into a tmp_list in rds_clear_recv_queue() under the rs_recv_lock, and then dropping the refcnt on the messages in the tmp_list (potentially resulting in rds_message_purge()) after dropping the rs_recv_lock. The same lock hierarchy violation also exists in rds_still_queued() and should be avoided in a similar manner Signed-off-by: Sowmini Varadhan Reported-by: syzbot+52140d69ac6dc6b92...@syzkaller.appspotmail.com --- This bug doesn't make sense since two different transports are using same socket (Loop and rds_tcp) and running together. For same transport, such race can't happen with MSG_ON_SOCK flag. CPU1-> rds_loop_inc_free CPU0 -> rds_tcp_cork ... I need to understand this test better. Regards, Santosh
Re: [PATCH net-next 2/2] rds: Remove IPv6 dependency
On 7/30/2018 10:48 PM, Ka-Cheong Poon wrote: This patch removes the IPv6 dependency from RDS. Signed-off-by: Ka-Cheong Poon --- net/rds/Kconfig | 2 +- net/rds/af_rds.c | 32 +++- net/rds/bind.c | 4 +++- net/rds/connection.c | 26 -- net/rds/ib.c | 31 ++- net/rds/ib_cm.c | 9 + net/rds/ib_rdma.c| 2 ++ net/rds/rdma_transport.c | 10 ++ net/rds/recv.c | 2 ++ net/rds/send.c | 2 ++ net/rds/tcp.c| 25 + net/rds/tcp_listen.c | 21 + 12 files changed, 140 insertions(+), 26 deletions(-) diff --git a/net/rds/Kconfig b/net/rds/Kconfig index 4c7f259..41f7556 100644 --- a/net/rds/Kconfig +++ b/net/rds/Kconfig @@ -1,7 +1,7 @@ config RDS tristate "The RDS Protocol" - depends on INET && IPV6 + depends on INET Thanks for followup Ka-Cheong. diff --git a/net/rds/af_rds.c b/net/rds/af_rds.c index fc5c48b..65387e1 100644 --- a/net/rds/af_rds.c +++ b/net/rds/af_rds.c @@ -156,18 +156,20 @@ static int rds_getname(struct socket *sock, struct sockaddr *uaddr, return sizeof(*sin); } - if (ipv6_addr_type(>rs_conn_addr) & - IPV6_ADDR_MAPPED) { - sin = (struct sockaddr_in *)uaddr; - memset(sin, 0, sizeof(*sin)); - sin->sin_family = AF_INET; - return sizeof(*sin); +#if IS_ENABLED(CONFIG_IPV6) + if (!(ipv6_addr_type(>rs_conn_addr) & + IPV6_ADDR_MAPPED)) { + sin6 = (struct sockaddr_in6 *)uaddr; + memset(sin6, 0, sizeof(*sin6)); + sin6->sin6_family = AF_INET6; + return sizeof(*sin6); } +#endif I don't like this #ifdef spaghetti all over in the middle functions but seems like that is the best option considering the involved changes from previous series. FWIW, Acked-by: Santosh Shilimkar
Re: [PATCH net-next 1/2] rds: rds_ib_recv_alloc_cache() should call alloc_percpu_gfp() instead
On 7/30/2018 10:48 PM, Ka-Cheong Poon wrote: Currently, rds_ib_conn_alloc() calls rds_ib_recv_alloc_caches() without passing along the gfp_t flag. But rds_ib_recv_alloc_caches() and rds_ib_recv_alloc_cache() should take a gfp_t parameter so that rds_ib_recv_alloc_cache() can call alloc_percpu_gfp() using the correct flag instead of calling alloc_percpu(). Signed-off-by: Ka-Cheong Poon --- Looks good !! Acked-by: Santosh Shilimkar
Re: [PATCH] rds: send: Fix dead code in rds_sendmsg
On 7/25/2018 8:22 AM, Gustavo A. R. Silva wrote: Currently, code at label *out* is unreachable. Fix this by updating variable *ret* with -EINVAL, so the jump to *out* can be properly executed instead of directly returning from function. Addresses-Coverity-ID: 1472059 ("Structurally dead code") Fixes: 1e2b44e78eea ("rds: Enable RDS IPv6 support") Signed-off-by: Gustavo A. R. Silva --- Looks fine. Acked-by: Santosh Shilimkar
Re: [PATCH v4 net-next 3/3] rds: Extend RDS API for IPv6 support
On 7/23/2018 7:16 AM, Ka-Cheong Poon wrote: There are many data structures (RDS socket options) used by RDS apps which use a 32 bit integer to store IP address. To support IPv6, struct in6_addr needs to be used. To ensure backward compatibility, a new data structure is introduced for each of those data structures which use a 32 bit integer to represent an IP address. And new socket options are introduced to use those new structures. This means that existing apps should work without a problem with the new RDS module. For apps which want to use IPv6, those new data structures and socket options can be used. IPv4 mapped address is used to represent IPv4 address in the new data structures. v4: Revert changes to SO_RDS_TRANSPORT Signed-off-by: Ka-Cheong Poon --- Looks good to me now. Thanks !! Acked-by: Santosh Shilimkar
Re: [PATCH v4 net-next 2/3] rds: Enable RDS IPv6 support
On 7/23/2018 7:16 AM, Ka-Cheong Poon wrote: This patch enables RDS to use IPv6 addresses. For RDS/TCP, the listener is now an IPv6 endpoint which accepts both IPv4 and IPv6 connection requests. RDS/RDMA/IB uses a private data (struct rds_ib_connect_private) exchange between endpoints at RDS connection establishment time to support RDMA. This private data exchange uses a 32 bit integer to represent an IP address. This needs to be changed in order to support IPv6. A new private data struct rds6_ib_connect_private is introduced to handle this. To ensure backward compatibility, an IPv6 capable RDS stack uses another RDMA listener port (RDS_CM_PORT) to accept IPv6 connection. And it continues to use the original RDS_PORT for IPv4 RDS connections. When it needs to communicate with an IPv6 peer, it uses the RDS_CM_PORT to send the connection set up request. v4: Changed port history comments in rds.h (Sowmini Varadhan). v3: Added support to set up IPv4 connection using mapped address (David Miller). Added support to set up connection between link local and non-link addresses. Various review comments from Santosh Shilimkar and Sowmini Varadhan. v2: Fixed bound and peer address scope mismatched issue. Added back rds_connect() IPv6 changes. Signed-off-by: Ka-Cheong Poon --- Looks good to me now. Thanks !! Acked-by: Santosh Shilimkar
Re: [PATCH v4 net-next 1/3] rds: Changing IP address internal representation to struct in6_addr
On 7/23/2018 7:16 AM, Ka-Cheong Poon wrote: This patch changes the internal representation of an IP address to use struct in6_addr. IPv4 address is stored as an IPv4 mapped address. All the functions which take an IP address as argument are also changed to use struct in6_addr. But RDS socket layer is not modified such that it still does not accept IPv6 address from an application. And RDS layer does not accept nor initiate IPv6 connections. v2: Fixed sparse warnings. Signed-off-by: Ka-Cheong Poon --- Looks good to me now. Thanks !! Acked-by: Santosh Shilimkar
Re: [PATCH v3 net-next 0/3] rds: IPv6 support
On 7/13/2018 4:02 AM, Ka-Cheong Poon wrote: This patch set adds IPv6 support to the kernel RDS and related modules. Existing RDS apps using IPv4 address continue to run without any problem. New RDS apps which want to use IPv6 address can do so by passing the address in struct sockaddr_in6 to bind(), connect() or sendmsg(). And those apps also need to use the new IPv6 equivalents of some of the existing socket options as the existing options use a 32 bit integer to store IP address. [...] Ka-Cheong Poon (3): rds: Changing IP address internal representation to struct in6_addr rds: Enable RDS IPv6 support rds: Extend RDS API for IPv6 support Could you please post v4 and as aligned, please drop the SO_TRANSPORT change from the set. regards, Santosh
Re: [PATCH v3 net-next 3/3] rds: Extend RDS API for IPv6 support
On 7/13/2018 4:27 PM, David Miller wrote: From: Santosh Shilimkar Date: Fri, 13 Jul 2018 15:00:59 -0700 Ofcourse any application built using upstream header and using SO_RDS_TRANSPORT will break but since this particular option was added for special case(application wants to upfront select transport instead letting bind figure it out), our hope its not used by other application(s). We can't let people have different UAPIs from upstream on a whim like this then change the already released upstream UAPI to match. Please take this into consideration when making changes in the future. Will not be repeated in future. I'm not allowing this upstream UAPI break, sorry. Ok Dave !! Regards, Santosh
Re: [PATCH v3 net-next 3/3] rds: Extend RDS API for IPv6 support
Hi Dave, On 7/13/2018 2:25 PM, David Miller wrote: From: Ka-Cheong Poon Date: Fri, 13 Jul 2018 04:02:59 -0700 @@ -52,7 +52,7 @@ #define RDS_RECVERR 5 #define RDS_CONG_MONITOR 6 #define RDS_GET_MR_FOR_DEST 7 -#define SO_RDS_TRANSPORT 8 +#define SO_RDS_TRANSPORT 9 There is no way you can change this value without breaking applications. Downstream Oracle shipping application have been built with 9 as a SO_RDS_TRANSPORT from beginning. 8 is used for RDS_CONN_RESET but support for it doesn't exist upstream yet. Ka-cheong was aligning them so that we have same number upstream as well as in shipping products. Ofcourse any application built using upstream header and using SO_RDS_TRANSPORT will break but since this particular option was added for special case(application wants to upfront select transport instead letting bind figure it out), our hope its not used by other application(s). Regards, Santosh
Re: [PATCH v2 net-next 1/3] rds: Changing IP address internal representation to struct in6_addr
Hi Ka-Cheong, On 7/6/2018 8:25 AM, Sowmini Varadhan wrote: On (07/06/18 23:08), Ka-Cheong Poon wrote: As mentioned in a previous mail, it is unclear why the port number is transport specific. Most Internet services use the same port number running over TCP/UDP as shown in the IANA database. And the IANA RDS registration is the same. What is the rationale of having a transport specific number in the RDS implementation? because not every transport may need a port number. Lets keep separate port for RDMA and TCP transport. This has been already useful for wireshark dissector and can also help for eBPF like external tooling. The fragment format and re-assembly is different across transports. I do see your point and also agree that port number isn't transport specific and in case we need to add another transport, what port to use. But may be till then lets keep the existing behavior. As such this port switch is not related to IPv6 support as such so lets deal with it separately. Regards, Santosh
Re: [PATCH v2 net-next 1/3] rds: Changing IP address internal representation to struct in6_addr
On 6/27/2018 3:23 AM, Ka-Cheong Poon wrote: This patch changes the internal representation of an IP address to use struct in6_addr. IPv4 address is stored as an IPv4 mapped address. All the functions which take an IP address as argument are also changed to use struct in6_addr. But RDS socket layer is not modified such that it still does not accept IPv6 address from an application. And RDS layer does not accept nor initiate IPv6 connections. v2: Fixed sparse warnings. Signed-off-by: Ka-Cheong Poon --- net/rds/af_rds.c | 138 -- net/rds/bind.c | 91 +- net/rds/cong.c | 23 ++-- net/rds/connection.c | 132 + net/rds/ib.c | 17 +-- net/rds/ib.h | 45 +-- net/rds/ib_cm.c | 300 +++ net/rds/ib_rdma.c| 15 +-- net/rds/ib_recv.c| 18 +-- net/rds/ib_send.c| 10 +- net/rds/loop.c | 7 +- net/rds/rdma.c | 6 +- net/rds/rdma_transport.c | 56 ++--- net/rds/rds.h| 69 +++ net/rds/recv.c | 51 +--- net/rds/send.c | 67 --- net/rds/tcp.c| 32 - net/rds/tcp_connect.c| 34 +++--- net/rds/tcp_listen.c | 18 +-- net/rds/tcp_recv.c | 9 +- net/rds/tcp_send.c | 4 +- net/rds/threads.c| 69 +-- net/rds/transport.c | 15 ++- 23 files changed, 857 insertions(+), 369 deletions(-) diff --git a/net/rds/bind.c b/net/rds/bind.c index 5aa3a64..3822886 100644 --- a/net/rds/bind.c +++ b/net/rds/bind.c @@ -1,5 +1,5 @@ /* - * Copyright (c) 2006 Oracle. All rights reserved. + * Copyright (c) 2006, 2018 Oracle and/or its affiliates. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -33,6 +33,7 @@ #include #include #include +#include #include #include #include @@ -42,42 +43,58 @@ static const struct rhashtable_params ht_parms = { .nelem_hint = 768, - .key_len = sizeof(u64), + .key_len = RDS_BOUND_KEY_LEN, Do we really need the scope id to be part of the key ? With link local/global, do you see any collisions. Please educate me on the actual usecase. This can avoid bunch of changes and hence the question. @@ -114,7 +132,7 @@ static int rds_add_bound(struct rds_sock *rs, __be32 addr, __be16 *port) rs, , (int)ntohs(*port)); break; } else { - rs->rs_bound_addr = 0; + rs->rs_bound_addr = in6addr_any; Can you elaborate why 0 is not ok ? rds_sock_put(rs); ret = -ENOMEM; break; @@ -127,44 +145,61 @@ static int rds_add_bound(struct rds_sock *rs, __be32 addr, __be16 *port) void rds_remove_bound(struct rds_sock *rs) { - if (!rs->rs_bound_addr) + if (ipv6_addr_any(>rs_bound_addr)) return; - rdsdebug("rs %p unbinding from %pI4:%d\n", + rdsdebug("rs %p unbinding from %pI6c:%d\n", rs, >rs_bound_addr, ntohs(rs->rs_bound_port)); rhashtable_remove_fast(_hash_table, >rs_bound_node, ht_parms); rds_sock_put(rs); - rs->rs_bound_addr = 0; + rs->rs_bound_addr = in6addr_any; } int rds_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len) { struct sock *sk = sock->sk; - struct sockaddr_in *sin = (struct sockaddr_in *)uaddr; struct rds_sock *rs = rds_sk_to_rs(sk); + struct in6_addr v6addr, *binding_addr; struct rds_transport *trans; + __u32 scope_id = 0; int ret = 0; + __be16 port; + /* We only allow an RDS socket to be bound to and IPv4 address. IPv6 s/'bound to and IPv4'/'bound to an IPv4' +* address support will be added later. +*/ + if (addr_len == sizeof(struct sockaddr_in)) { + struct sockaddr_in *sin = (struct sockaddr_in *)uaddr; + + if (sin->sin_family != AF_INET || + sin->sin_addr.s_addr == htonl(INADDR_ANY)) + return -EINVAL; + ipv6_addr_set_v4mapped(sin->sin_addr.s_addr, ); + binding_addr = + port = sin->sin_port; + } else if (addr_len == sizeof(struct sockaddr_in6)) { + return -EPROTONOSUPPORT; + } else { + return -EINVAL; + } lock_sock(sk); - if (addr_len != sizeof(struct sockaddr_in) || - sin->sin_family != AF_INET || - rs->rs_bound_addr || - sin->sin_addr.s_addr == htonl(INADDR_ANY)) { + /* RDS socket does not allow re-binding. */ + if (!ipv6_addr_any(>rs_bound_addr)) { ret = -EINVAL;
Re: [PATCH net-next 2/3] rds: Enable RDS IPv6 support
On 6/27/2018 3:07 AM, Ka-Cheong Poon wrote: On 06/26/2018 09:08 PM, Sowmini Varadhan wrote: On (06/26/18 21:02), Ka-Cheong Poon wrote: [...] I don't expect RDS apps will want to use link local address in the first place. In fact, most normal network apps don't. This is not true. You're not doing this for IPv4 and RDS today (you dont have to do this for UDP, afaik) Do you know of any IPv4 RDS app which uses IPv4 link local address? In fact, IPv4 link local address is explicitly disallowed for active active bonding. Yes. Cluster-ware HAIP makes use of link local addresses. That check was mainly because of RDMA CM issues but that only means active-active isn't used. The bonding works just fine and if needed cluster-ware can also use TCP transport. Lets not add this new behavior for link local and its actually not relevant to really v6 addressing support. Regards, Santosh
Re: [PATCH net-next 2/3] rds: Enable RDS IPv6 support
On 6/25/2018 10:50 AM, Sowmini Varadhan wrote: On (06/26/18 01:43), Ka-Cheong Poon wrote: Yes, I think if the socket is bound, it should check the scope_id in msg_name (if not NULL) to make sure that they match. A bound RDS socket can send to multiple peers. But if the bound local address is link local, it should only be allowed to send to peers on the same link. agree. Yep. Its inline with RDS bind behavior. If a socket is bound, I guess the scope_id should be used. So if a socket is not bound to a link local address and the socket is used to sent to a link local peer, it should fail. PF_RDS sockets *MUST* alwasy be bound. See Documentation/networking/rds.txt: " Sockets must be bound before you can send or receive data. This is needed because binding also selects a transport and attaches it to the socket. Once bound, the transport assignment does not change." In any case link local or not, the socket needs to be bound before any data can be sent as documented. Send path already enforces it. Also, why is there no IPv6 support in rds_connect? Oops, I missed this when I ported the internal version to the net-next version. Will add it back. So the net-next wasn't tested? IPv6 connections itself wouldn't be formed with this missing. As mentioned already, please test v2 before posting on list. Regards, Santosh
[PATCH] rds: avoid unenecessary cong_update in loop transport
Loop transport which is self loopback, remote port congestion update isn't relevant. Infact the xmit path already ignores it. Receive path needs to do the same. Reported-by: syzbot+4c20b3866171ce844...@syzkaller.appspotmail.com Reviewed-by: Sowmini Varadhan Signed-off-by: Santosh Shilimkar --- net/rds/loop.c | 1 + net/rds/rds.h | 5 + net/rds/recv.c | 5 + 3 files changed, 11 insertions(+) diff --git a/net/rds/loop.c b/net/rds/loop.c index f2bf78d..dac6218 100644 --- a/net/rds/loop.c +++ b/net/rds/loop.c @@ -193,4 +193,5 @@ struct rds_transport rds_loop_transport = { .inc_copy_to_user = rds_message_inc_copy_to_user, .inc_free = rds_loop_inc_free, .t_name = "loopback", + .t_type = RDS_TRANS_LOOP, }; diff --git a/net/rds/rds.h b/net/rds/rds.h index b04c333..f2272fb 100644 --- a/net/rds/rds.h +++ b/net/rds/rds.h @@ -479,6 +479,11 @@ struct rds_notifier { int n_status; }; +/* Available as part of RDS core, so doesn't need to participate + * in get_preferred transport etc + */ +#defineRDS_TRANS_LOOP 3 + /** * struct rds_transport - transport specific behavioural hooks * diff --git a/net/rds/recv.c b/net/rds/recv.c index dc67458..192ac6f 100644 --- a/net/rds/recv.c +++ b/net/rds/recv.c @@ -103,6 +103,11 @@ static void rds_recv_rcvbuf_delta(struct rds_sock *rs, struct sock *sk, rds_stats_add(s_recv_bytes_added_to_socket, delta); else rds_stats_add(s_recv_bytes_removed_from_socket, -delta); + + /* loop transport doesn't send/recv congestion updates */ + if (rs->rs_transport->t_type == RDS_TRANS_LOOP) + return; + now_congested = rs->rs_rcv_bytes > rds_sk_rcvbuf(rs); rdsdebug("rs %p (%pI4:%u) recv bytes %d buf %d " -- 1.9.1
Re: KASAN: null-ptr-deref Read in rds_ib_get_mr
On 5/11/2018 12:48 AM, Yanjun Zhu wrote: On 2018/5/11 13:20, DaeRyong Jeong wrote: We report the crash: KASAN: null-ptr-deref Read in rds_ib_get_mr Note that this bug is previously reported by syzkaller. https://syzkaller.appspot.com/bug?id=0bb56a5a48b000b52aa2b0d8dd20b1f545214d91 Nonetheless, this bug has not fixed yet, and we hope that this report and our analysis, which gets help by the RaceFuzzer's feature, will helpful to fix the crash. This crash has been found in v4.17-rc1 using RaceFuzzer (a modified version of Syzkaller), which we describe more at the end of this report. Our analysis shows that the race occurs when invoking two syscalls concurrently, bind$rds and setsockopt$RDS_GET_MR. Analysis: We think the concurrent execution of __rds_rdma_map() and rds_bind() causes the problem. __rds_rdma_map() checks whether rs->rs_bound_addr is 0 or not. But the concurrent execution with rds_bind() can by-pass this check. Therefore, __rds_rdmap_map() calls rs->rs_transport->get_mr() and rds_ib_get_mr() causes the null deref at ib_rdma.c:544 in v4.17-rc1, when dereferencing rs_conn. Thread interleaving: CPU0 (__rds_rdma_map) CPU1 (rds_bind) // rds_add_bound() sets rs->bound_addr as none 0 ret = rds_add_bound(rs, sin->sin_addr.s_addr, >sin_port); if (rs->rs_bound_addr == 0 || !rs->rs_transport) { ret = -ENOTCONN; /* XXX not a great errno */ goto out; } if (rs->rs_transport) { /* previously bound */ trans = rs->rs_transport; if (trans->laddr_check(sock_net(sock->sk), sin->sin_addr.s_addr) != 0) { ret = -ENOPROTOOPT; // rds_remove_bound() sets rs->bound_addr as 0 rds_remove_bound(rs); ... trans_private = rs->rs_transport->get_mr(sg, nents, rs, >r_key); (in rds_ib_get_mr()) struct rds_ib_connection *ic = rs->rs_conn->c_transport_data; Call sequence (v4.17-rc1): CPU0 rds_setsockopt rds_get_mr __rds_rdma_map rds_ib_get_mr CPU1 rds_bind rds_add_bound ... rds_remove_bound Crash log: == BUG: KASAN: null-ptr-deref in rds_ib_get_mr+0x3a/0x150 net/rds/ib_rdma.c:544 Read of size 8 at addr 0068 by task syz-executor0/32067 CPU: 0 PID: 32067 Comm: syz-executor0 Not tainted 4.17.0-rc1 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x166/0x21c lib/dump_stack.c:113 kasan_report_error mm/kasan/report.c:352 [inline] kasan_report+0x140/0x360 mm/kasan/report.c:412 check_memory_region_inline mm/kasan/kasan.c:260 [inline] __asan_load8+0x54/0x90 mm/kasan/kasan.c:699 rds_ib_get_mr+0x3a/0x150 net/rds/ib_rdma.c:544 __rds_rdma_map+0x521/0x9d0 net/rds/rdma.c:271 rds_get_mr+0xad/0xf0 net/rds/rdma.c:333 rds_setsockopt+0x57f/0x720 net/rds/af_rds.c:347 __sys_setsockopt+0x147/0x230 net/socket.c:1903 __do_sys_setsockopt net/socket.c:1914 [inline] __se_sys_setsockopt net/socket.c:1911 [inline] __x64_sys_setsockopt+0x67/0x80 net/socket.c:1911 do_syscall_64+0x15f/0x4a0 arch/x86/entry/common.c:287 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x4563f9 RSP: 002b:7f6a2b3c2b28 EFLAGS: 0246 ORIG_RAX: 0036 RAX: ffda RBX: 0072bee0 RCX: 004563f9 RDX: 0002 RSI: 0114 RDI: 0015 RBP: 0575 R08: 0020 R09: R10: 2140 R11: 0246 R12: 7f6a2b3c36d4 R13: R14: 006fd398 R15: == diff --git a/net/rds/ib_rdma.c b/net/rds/ib_rdma.c index e678699..2228b50 100644 --- a/net/rds/ib_rdma.c +++ b/net/rds/ib_rdma.c @@ -539,11 +539,17 @@ void rds_ib_flush_mrs(void) void *rds_ib_get_mr(struct scatterlist *sg, unsigned long nents, struct rds_sock *rs, u32 *key_ret) { - struct rds_ib_device *rds_ibdev; + struct rds_ib_device *rds_ibdev = NULL; struct rds_ib_mr *ibmr = NULL; - struct rds_ib_connection *ic = rs->rs_conn->c_transport_data; + struct rds_ib_connection *ic = NULL; int ret; + if (rs->rs_bound_addr == 0) { + ret = -EPERM; + goto out; + } + No you can't return such error for this API and the socket related checks needs to be done at core layer. I remember fixing this race but probably never pushed fix upstream. The MR code is due for update with optimized FRWR code which now stable enough. We will address this issue as well as part of that patchset.
Re: [PATCH] rds: ib: Fix missing call to rds_ib_dev_put in rds_ib_setup_qp
On 4/25/2018 4:22 AM, Dag Moxnes wrote: The function rds_ib_setup_qp is calling rds_ib_get_client_data and should correspondingly call rds_ib_dev_put. This call was lost in the non-error path with the introduction of error handling done in commit 3b12f73a5c29 ("rds: ib: add error handle") Signed-off-by: Dag Moxnes <dag.mox...@oracle.com> --- Thanks Dag for following it up. Acked-by: Santosh Shilimkar <santosh.shilim...@oracle.com>
Re: [PATCH 1/1] Revert "rds: ib: add error handle"
On 4/24/2018 4:25 AM, Dag Moxnes wrote: I was going to suggest the following correction: If all agree that this is the correct way of doing it, I can go ahead and an post it. Yes please. Go ahead and post your fix. Regards, Santosh P.S: Avoid top posting please.
Re: [PATCH net-next] rds: tcp: must use spin_lock_irq* and not spin_lock_bh with rds_tcp_conn_lock
On 3/15/2018 3:54 AM, Sowmini Varadhan wrote: rds_tcp_connection allocation/free management has the potential to be called from __rds_conn_create after IRQs have been disabled, so spin_[un]lock_bh cannot be used with rds_tcp_conn_lock. Bottom-halves that need to synchronize for critical sections protected by rds_tcp_conn_lock should instead use rds_destroy_pending() correctly. Reported-by: syzbot+c68e51bb5e699d3f8...@syzkaller.appspotmail.com Fixes: ebeeb1ad9b8a ("rds: tcp: use rds_destroy_pending() to synchronize netns/module teardown and rds connection/workq management") Signed-off-by: Sowmini Varadhan <sowmini.varad...@oracle.com> --- Thanks Sowmini for the WARN_ON() discussion off-list. Acked-by: Santosh Shilimkar <santosh.shilim...@oracle.com>
Re: [PATCH v2 net] rds: Incorrect reference counting in TCP socket creation
On 3/2/2018 6:42 AM, David Miller wrote: From: "santosh.shilim...@oracle.com"Date: Thu, 1 Mar 2018 22:22:07 -0800 Versioning comment typically goes below "---" and not part of commit message. I like them to be in the commit message most of the time. Especially for patch series header postings. Series header I have seen it followed but not in patch commit since it goes into kernel git history. Later if someone reviews the patch and says "why didn't they do X" and if it says in the version history text "don't do X based upon feedback from developer Y" that helps a lot. I agree the versioning info helps, but didn't know that you like that to be in commit message of patches as well. Will keep that in mind for future for netdev list. Regards, Santosh
Re: [PATCH V3 net-next 2/3] rds: deliver zerocopy completion notification with data
On 2/26/2018 9:11 AM, Sowmini Varadhan wrote: On (02/26/18 09:07), Santosh Shilimkar wrote: Just in case you haven't seen yet, Dan Carpenter reported skb deref warning on previous version of the patch. Not sure why it wasn't sent on netdev. yes I saw it, but that was for the previous version of the patch around code that delivers notifications on sk_error_queue. This patch series removes the sk_error_queue support to the smatch warning is not applicable after this patch. I see it now. Thanks !! Regards, Santosh
Re: [PATCH V3 net-next 2/3] rds: deliver zerocopy completion notification with data
On 2/25/2018 3:21 PM, Sowmini Varadhan wrote: This commit is an optimization over commit 01883eda72bd ("rds: support for zcopy completion notification") for PF_RDS sockets. RDS applications are predominantly request-response transactions, so it is more efficient to reduce the number of system calls and have zerocopy completion notification delivered as ancillary data on the POLLIN channel. Cookies are passed up as ancillary data (at level SOL_RDS) in a struct rds_zcopy_cookies when the returned value of recvmsg() is greater than, or equal to, 0. A max of RDS_MAX_ZCOOKIES may be passed with each message. This commit removes support for zerocopy completion notification on MSG_ERRQUEUE for PF_RDS sockets. Signed-off-by: Sowmini Varadhan <sowmini.varad...@oracle.com> --- v2: remove sk_error_queue path; lot of cautionary checks rds_recvmsg_zcookie() and callers to make sure we dont remove cookies from the queue and then fail to pass it up to caller v3: - bounds check on skb->cb to make sure there is enough room for struct rds_zcopy_cookies as well as the rds_znotifier; - Refactor cautionary checks in rds_recvmsg_zcookie: if no msg_control has been passed, or if there not enough msg_controllen for a a rds_zcopy_cookies, return silently (do not return error, as the caller may have wanted other ancillary data which may happen to fit in the space provided) - return bool form rds_recvmsg_zcookie, some other code cleanup Just in case you haven't seen yet, Dan Carpenter reported skb deref warning on previous version of the patch. Not sure why it wasn't sent on netdev. smatch warnings: net/rds/recv.c:605 rds_recvmsg_zcookie() warn: variable dereferenced before check 'skb' (see line 596) With that addressed, Acked-by: Santosh Shilimkar <santosh.shilim...@oracle.com>
Re: [for-next 7/7] IB/mlx5: Implement fragmented completion queue (CQ)
n 2/24/2018 1:40 AM, Majd Dibbiny wrote: On Feb 23, 2018, at 9:13 PM, Saeed Mahameed <sae...@mellanox.com> wrote: On Thu, 2018-02-22 at 16:04 -0800, Santosh Shilimkar wrote: Hi Saeed On 2/21/2018 12:13 PM, Saeed Mahameed wrote: [...] Jason mentioned about this patch to me off-list. We were seeing similar issue with SRQs & QPs. So wondering whether you have any plans to do similar change for other resouces too so that they don't rely on higher order page allocation for icm tables. Hi Santosh, Adding Majd, Which ULP is in question ? how big are the QPs/SRQs you create that lead to this problem ? For icm tables we already allocate only order 0 pages: see alloc_system_page() in drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c But for kernel RDMA SRQ and QP buffers there is a place for improvement. Majd, do you know if we have any near future plans for this. It’s in our plans to move all the buffers to use 0-order pages. Santosh, Is this RDS? Do you have persistent failure with some configuration? Can you please share more information? No the issue seen with user verbs and actually MLX4 driver. My last question was more for both MLX4 and MLX5 drivers icm allocation for all the resources. With MLX4 driver, we have seen corruption issues with MLX4_NO_RR while recycling the issues. So we ended up switching to round robin bitmap allocation as it was before which was changed by one of Jacks commit 7c6d74d23 {mlx4_core: Roll back round robin bitmap allocation commit for CQs, SRQs, and MPTs} With default round robin, the corruption issue went away but then its undesired effect of bloating the icm tables till you hit the resource limit means more memory fragmentation. Since these resources makes use of higher order allocations and in fragmented memory scenarios we see contention on mm lock for seconds since compaction layer is trying to stitch pages which takes time. If these alloaction don't make use of higher order pages, the issue can be certainly avoided and hence the reason behind the question. Ofcourse we wouldn't have ended up with this issue if 'MLX4_NO_RR' worked without corruption :-) Regards, Santosh
Re: [for-next 7/7] IB/mlx5: Implement fragmented completion queue (CQ)
Hi Saeed On 2/21/2018 12:13 PM, Saeed Mahameed wrote: From: Yonatan CohenThe current implementation of create CQ requires contiguous memory, such requirement is problematic once the memory is fragmented or the system is low in memory, it causes for failures in dma_zalloc_coherent(). This patch implements new scheme of fragmented CQ to overcome this issue by introducing new type: 'struct mlx5_frag_buf_ctrl' to allocate fragmented buffers, rather than contiguous ones. Base the Completion Queues (CQs) on this new fragmented buffer. It fixes following crashes: kworker/29:0: page allocation failure: order:6, mode:0x80d0 CPU: 29 PID: 8374 Comm: kworker/29:0 Tainted: G OE 3.10.0 Workqueue: ib_cm cm_work_handler [ib_cm] Call Trace: [<>] dump_stack+0x19/0x1b [<>] warn_alloc_failed+0x110/0x180 [<>] __alloc_pages_slowpath+0x6b7/0x725 [<>] __alloc_pages_nodemask+0x405/0x420 [<>] dma_generic_alloc_coherent+0x8f/0x140 [<>] x86_swiotlb_alloc_coherent+0x21/0x50 [<>] mlx5_dma_zalloc_coherent_node+0xad/0x110 [mlx5_core] [<>] ? mlx5_db_alloc_node+0x69/0x1b0 [mlx5_core] [<>] mlx5_buf_alloc_node+0x3e/0xa0 [mlx5_core] [<>] mlx5_buf_alloc+0x14/0x20 [mlx5_core] [<>] create_cq_kernel+0x90/0x1f0 [mlx5_ib] [<>] mlx5_ib_create_cq+0x3b0/0x4e0 [mlx5_ib] Signed-off-by: Yonatan Cohen Reviewed-by: Tariq Toukan Signed-off-by: Leon Romanovsky Signed-off-by: Saeed Mahameed --- Jason mentioned about this patch to me off-list. We were seeing similar issue with SRQs & QPs. So wondering whether you have any plans to do similar change for other resouces too so that they don't rely on higher order page allocation for icm tables. Regards, Santosh
Re: [PATCH net-next] rds: rds_msg_zcopy should return error of null rm->data.op_mmp_znotifier
On 2/22/2018 1:40 PM, Sowmini Varadhan wrote: if either or both of MSG_ZEROCOPY and SOCK_ZEROCOPY have not been specified, the rm->data.op_mmp_znotifier allocation will be skipped. In this case, it is invalid ot pass down a cmsghdr with RDS_CMSG_ZCOPY_COOKIE, so return EINVAL from rds_msg_zcopy for this case. Reported-by: syzbot+f893ae7bb2f6456df...@syzkaller.appspotmail.com Fixes: 0cebaccef3ac ("rds: zerocopy Tx support.") Signed-off-by: Sowmini Varadhan <sowmini.varad...@oracle.com> --- Acked-by: Santosh Shilimkar <santosh.shilim...@oracle.com>
Re: [PATCH net-next] RDS: deliver zerocopy completion notification with data as an optimization
On 2/21/2018 12:19 PM, Sowmini Varadhan wrote: This commit is an optimization that builds on top of commit 01883eda72bd ("rds: support for zcopy completion notification") for PF_RDS sockets. Cookies associated with zerocopy completion are passed up on the POLLIN channel, piggybacked with data whereever possible. Such cookies are passed s/whereever/wherever up as ancillary data (at level SOL_RDS) in a struct rds_zcopy_cookies when the returned value of recvmsg() is >= 0. A max of SO_EE_ORIGIN_MAX_ZCOOKIES may be passed with each message. Signed-off-by: Sowmini Varadhan <sowmini.varad...@oracle.com> --- Acked-by: Santosh Shilimkar <santosh.shilim...@oracle.com>
Re: [PATCH] rds: send: mark expected switch fall-through in rds_rm_size
On 2/20/2018 10:05 AM, Gustavo A. R. Silva wrote: Hi Santosh, On 02/20/2018 11:54 AM, Santosh Shilimkar wrote: Hi, 2/19/2018 10:10 AM, Gustavo A. R. Silva wrote: In preparation to enabling -Wimplicit-fallthrough, mark switch cases where we are expecting to fall through. Addresses-Coverity-ID: 1465362 ("Missing break in switch") Signed-off-by: Gustavo A. R. Silva <gust...@embeddedor.com> --- net/rds/send.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/net/rds/send.c b/net/rds/send.c index 028ab59..79d158b 100644 --- a/net/rds/send.c +++ b/net/rds/send.c @@ -902,6 +902,8 @@ static int rds_rm_size(struct msghdr *msg, int num_sgs) case RDS_CMSG_ZCOPY_COOKIE: zcopy_cookie = true; + /* fall through */ + case RDS_CMSG_RDMA_DEST: case RDS_CMSG_RDMA_MAP: cmsg_groups |= 2; So coverity greps for commet as "fall through" for -Wimplicit-fallthrough build ? No. Basically, Coverity only reports cases in which a break, return or continue statement is missing. Now, if the statements I mention above are missing and if you add the following line to your Makefile: KBUILD_CFLAGS += $(call cc-option,-Wimplicit-fallthrough) You will get a warning if a fall-through comment is missing. That make sense. Adding that comments itself if fine but was curious about it if some one makes a spell error in this comment what happens ;-) In this case, Coverity would still report the same "Missing break in switch" error, but you'll get a GCC warning. Got it. Thanks !!
Re: [PATCH] rds: send: mark expected switch fall-through in rds_rm_size
On 2/20/2018 10:01 AM, David Miller wrote: From: Santosh Shilimkar <santosh.shilim...@oracle.com> Date: Tue, 20 Feb 2018 09:54:09 -0800 So coverity greps for commet as "fall through" for -Wimplicit-fallthrough build ? From what I understand, 'gcc' does in the latest versions. Coverity might as well, I don't know. Good to know about 'gcc' adding such option. Thanks !!
Re: [PATCH] rds: send: mark expected switch fall-through in rds_rm_size
Hi, 2/19/2018 10:10 AM, Gustavo A. R. Silva wrote: In preparation to enabling -Wimplicit-fallthrough, mark switch cases where we are expecting to fall through. Addresses-Coverity-ID: 1465362 ("Missing break in switch") Signed-off-by: Gustavo A. R. Silva <gust...@embeddedor.com> --- net/rds/send.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/net/rds/send.c b/net/rds/send.c index 028ab59..79d158b 100644 --- a/net/rds/send.c +++ b/net/rds/send.c @@ -902,6 +902,8 @@ static int rds_rm_size(struct msghdr *msg, int num_sgs) case RDS_CMSG_ZCOPY_COOKIE: zcopy_cookie = true; + /* fall through */ + case RDS_CMSG_RDMA_DEST: case RDS_CMSG_RDMA_MAP: cmsg_groups |= 2; So coverity greps for commet as "fall through" for -Wimplicit-fallthrough build ? Adding that comments itself if fine but was curious about it if some one makes a spell error in this comment what happens ;-) For patch itself, Acked-by: Santosh Shilimkar <santosh.shilim...@oracle.com>
Re: [PATCH v3 net-next 5/7] rds: zerocopy Tx support.
On 2/15/2018 10:49 AM, Sowmini Varadhan wrote: If the MSG_ZEROCOPY flag is specified with rds_sendmsg(), and, if the SO_ZEROCOPY socket option has been set on the PF_RDS socket, application pages sent down with rds_sendmsg() are pinned. The pinning uses the accounting infrastructure added by Commit a91dbff551a6 ("sock: ulimit on MSG_ZEROCOPY pages") The payload bytes in the message may not be modified for the duration that the message has been pinned. A multi-threaded application using this infrastructure may thus need to be notified about send-completion so that it can free/reuse the buffers passed to rds_sendmsg(). Notification of send-completion will identify each message-buffer by a cookie that the application must specify as ancillary data to rds_sendmsg(). The ancillary data in this case has cmsg_level == SOL_RDS and cmsg_type == RDS_CMSG_ZCOPY_COOKIE. Signed-off-by: Sowmini Varadhan <sowmini.varad...@oracle.com> --- Acked-by: Santosh Shilimkar <santosh.shilim...@oracle.com>
Re: [PATCH v3 net-next 4/7] rds: support for zcopy completion notification
On 2/15/2018 10:49 AM, Sowmini Varadhan wrote: RDS removes a datagram (rds_message) from the retransmit queue when an ACK is received. The ACK indicates that the receiver has queued the RDS datagram, so that the sender can safely forget the datagram. When all references to the rds_message are quiesced, rds_message_purge is called to release resources used by the rds_message If the datagram to be removed had pinned pages set up, add an entry to the rs->rs_znotify_queue so that the notifcation will be sent up via rds_rm_zerocopy_callback() when the rds_message is eventually freed by rds_message_purge. rds_rm_zerocopy_callback() attempts to batch the number of cookies sent with each notification to a max of SO_EE_ORIGIN_MAX_ZCOOKIES. This is achieved by checking the tail skb in the sk_error_queue: if this has room for one more cookie, the cookie from the current notification is added; else a new skb is added to the sk_error_queue. Every invocation of rds_rm_zerocopy_callback() will trigger a ->sk_error_report to notify the application. Signed-off-by: Sowmini Varadhan <sowmini.varad...@oracle.com> --- Acked-by: Santosh Shilimkar <santosh.shilim...@oracle.com>
Re: [PATCH v3 net-next 2/7] rds: hold a sock ref from rds_message to the rds_sock
On 2/15/2018 10:49 AM, Sowmini Varadhan wrote: The existing model holds a reference from the rds_sock to the rds_message, but the rds_message does not itself hold a sock_put() on the rds_sock. Instead the m_rs field in the rds_message is assigned when the message is queued on the sock, and nulled when the message is dequeued from the sock. We want to be able to notify userspace when the rds_message is actually freed (from rds_message_purge(), after the refcounts to the rds_message go to 0). At the time that rds_message_purge() is called, the message is no longer on the rds_sock retransmit queue. Thus the explicit reference for the m_rs is needed to send a notification that will signal to userspace that it is now safe to free/reuse any pages that may have been pinned down for zerocopy. This patch manages the m_rs assignment in the rds_message with the necessary refcount book-keeping. Signed-off-by: Sowmini Varadhan <sowmini.varad...@oracle.com> --- Acked-by: Santosh Shilimkar <santosh.shilim...@oracle.com>
Re: [PATCH V2 net-next 4/7] rds: support for zcopy completion notification
On 2/14/2018 2:26 PM, Willem de Bruijn wrote: On Wed, Feb 14, 2018 at 1:50 PM, Santosh Shilimkar [...] This error change might need to go though other subsystem tree. May be you can seperate it and also copy "linux-...@vger.kernel.org" Previous changes to this file also went in through net-next, so this is fine. The error queue is a socket, so network, datastructure. OK if that acceptable then there is no need to split the change. As for naming, zerocopy is unfortunately an overused term. I did not help matters when adding msg_zerocopy. :-) That said, let's try to keep consistent naming across socket families, so no zmsg prefix only in RDS. Fair enough. Regards, Santosh
Re: [PATCH V2 net-next 4/7] rds: support for zcopy completion notification
On 2/14/2018 1:25 PM, Sowmini Varadhan wrote: On (02/14/18 13:10), Santosh Shilimkar wrote: RDS support true zero copy already with RDMA transport so some of this code can easily get confused. btw, another way to solve this is to have the RDMA code use the suffix "rdma" (which is what it really is) as needed. And same breath,here zcopy is No message from user ;-) ZCOPY is otherwise often tied with RDMA directly. Renaming churns are not that useful and I definitely agree with what Dave said if it was renaming change to the existing code like what you are suggesting with 'rdma' Anyways I don't want to contest this too much since I can follow that code and know what each does and means :-) The comment was long term readability perspective for some one completely new reading the code and being able to distinguish the different modes. Regards, Santosh
Re: [PATCH V2 net-next 4/7] rds: support for zcopy completion notification
On 2/14/2018 1:25 PM, Sowmini Varadhan wrote: On (02/14/18 13:10), Santosh Shilimkar wrote: RDS support true zero copy already with RDMA transport so some of this code can easily get confused. btw, another way to solve this is to have the RDMA code use the suffix "rdma" (which is what it really is) as needed. And same breath,here zcopy is No message from user ;-) ZCOPY is otherwise often tied with RDMA directly. Renaming churns are not that useful and I definitely agree with what Dave said if it was renaming change to the existing code like what you are suggesting with 'rdma' Anyways I don't want to contest this too much since I can follow that code and know what each does and means :-) The comment was long term readability perspective for some one completely new reading the code and being able to distinguish the different modes. Regards, Santosh
Re: [PATCH V2 net-next 5/7] rds: zerocopy Tx support.
On 2/14/2018 11:49 AM, Sowmini Varadhan wrote: On (02/14/18 11:10), Santosh Shilimkar wrote: s/RDS_CMSG_ZCOPY_COOKIE/RDS_CMSG_ZMSGCOPY_COOKIE Please see https://www.spinics.net/lists/netdev/msg483627.html Just saw it and responded to Dave. @@ -356,6 +358,53 @@ int rds_message_copy_from_user(struct rds_message *rm, struct iov_iter *from) sg = rm->data.op_sg; sg_off = 0; /* Dear gcc, sg->page will be null from kzalloc. */ + if (zcopy) { + int total_copied = 0; + struct sk_buff *skb; + + skb = alloc_skb(SO_EE_ORIGIN_MAX_ZCOOKIES * sizeof(u32), + GFP_KERNEL); This can sleep so you might want to check if you want to use ATOMIC version here. I think it should be fine: rds_message_copy_from_user() is called in process context, and if you notice, the calling function rds_sendmsg() also has this 1100 rm = rds_message_alloc(ret, GFP_KERNEL); 1101 if (!rm) { 1102 ret = -ENOMEM; 1103 goto out; 1104 } : 1106 /* Attach data to the rm */ : 1113 ret = rds_message_copy_from_user(rm, >msg_iter); So using GFP_KERNEL is as safe as the call at line 1100. Was just asking you to check if it is safe. The path already does that so we are good. + return -ENOMEM; + } NOMEM new application visible change but probably the right one for this particular case. Just need to make sure application can handle this error. I think the application already handles this correctly (see line 1102 above) Indeed. Thanks for checking. Regards, Santosh
Re: [PATCH V2 net-next 4/7] rds: support for zcopy completion notification
On 2/14/2018 11:02 AM, David Miller wrote: From: Sowmini Varadhan <sowmini.varad...@oracle.com> Date: Wed, 14 Feb 2018 14:01:10 -0500 On (02/14/18 10:50), Santosh Shilimkar wrote: generic comment and please update it where it is applicable in terms of variable names, notifiers etc. RDS support true zero copy already with RDMA transport so some of this code can easily get confused. So I suggest something like below. s/zerocopy/zeromsgcopy s/zcopy/zmsgcopy s/zcookie/zmsgcpycookie s/znotifier/zmsgcpynotifier I'd like to hear some additional opinions from the list on this: the existing socket API for TCP etc. already uses ZEROCOPY, and other than extending variable names (and putting me at risk of violating the "fit within 80 chars per line" requirement, leading to not-so-pretty line wraps), I'm not seeing much value in this. I agree, this name change requires seems pointless. Just keep the names the way they are, thank you. Dave I understand your point for generic code and I was not all asking to rename anything in generic code. My point was new code getting added to RDS to support this zero copy message option. Within RDS, core code is common and all the infrastructure with it and it will be hard to follow the code if that distinction is not maintained. Ofcourse you have a last say on this list so if you want to over rule the comment, I will step back :-) Regards, Santosh
Re: [PATCH V2 net-next 4/7] rds: support for zcopy completion notification
On 2/14/2018 11:01 AM, Sowmini Varadhan wrote: On (02/14/18 10:50), Santosh Shilimkar wrote: generic comment and please update it where it is applicable in terms of variable names, notifiers etc. RDS support true zero copy already with RDMA transport so some of this code can easily get confused. So I suggest something like below. s/zerocopy/zeromsgcopy s/zcopy/zmsgcopy s/zcookie/zmsgcpycookie s/znotifier/zmsgcpynotifier I'd like to hear some additional opinions from the list on this: the existing socket API for TCP etc. already uses ZEROCOPY, and other than extending variable names (and putting me at risk of violating the "fit within 80 chars per line" requirement, leading to not-so-pretty line wraps), I'm not seeing much value in this. What you mean by not seeing value ? It has a value for RDS code which already support ZERO copy via RDMA transport. For TCP sockets in isolation, ZCOPY may be alright. Patch under discussion is for RDS and hence the comment. Regards, Santosh
Re: [PATCH V2 net-next 5/7] rds: zerocopy Tx support.
On 2/14/2018 2:28 AM, Sowmini Varadhan wrote: If the MSG_ZEROCOPY flag is specified with rds_sendmsg(), and, if the SO_ZEROCOPY socket option has been set on the PF_RDS socket, application pages sent down with rds_sendmsg() are pinned. The pinning uses the accounting infrastructure added by Commit a91dbff551a6 ("sock: ulimit on MSG_ZEROCOPY pages") The payload bytes in the message may not be modified for the duration that the message has been pinned. A multi-threaded application using this infrastructure may thus need to be notified about send-completion so that it can free/reuse the buffers passed to rds_sendmsg(). Notification of send-completion will identify each message-buffer by a cookie that the application must specify as ancillary data to rds_sendmsg(). The ancillary data in this case has cmsg_level == SOL_RDS and cmsg_type == RDS_CMSG_ZCOPY_COOKIE. Signed-off-by: Sowmini Varadhan--- v2: - remove unused data_len argument to rds_rm_size; - unmap as necessary if we fail in the middle of zerocopy setup include/uapi/linux/rds.h |1 + net/rds/message.c| 51 +- net/rds/rds.h|3 +- net/rds/send.c | 44 ++- 4 files changed, 91 insertions(+), 8 deletions(-) diff --git a/include/uapi/linux/rds.h b/include/uapi/linux/rds.h index e71d449..12e3bca 100644 --- a/include/uapi/linux/rds.h +++ b/include/uapi/linux/rds.h @@ -103,6 +103,7 @@ #define RDS_CMSG_MASKED_ATOMIC_FADD 8 #define RDS_CMSG_MASKED_ATOMIC_CSWP 9 #define RDS_CMSG_RXPATH_LATENCY 11 +#defineRDS_CMSG_ZCOPY_COOKIE 12 s/RDS_CMSG_ZCOPY_COOKIE/RDS_CMSG_ZMSGCOPY_COOKIE #define RDS_INFO_FIRST1 #define RDS_INFO_COUNTERS 1 diff --git a/net/rds/message.c b/net/rds/message.c index d874b74..e499566 100644 --- a/net/rds/message.c +++ b/net/rds/message.c @@ -341,12 +341,14 @@ struct rds_message *rds_message_map_pages(unsigned long *page_addrs, unsigned in return rm; } -int rds_message_copy_from_user(struct rds_message *rm, struct iov_iter *from) +int rds_message_copy_from_user(struct rds_message *rm, struct iov_iter *from, + bool zcopy) { unsigned long to_copy, nbytes; unsigned long sg_off; struct scatterlist *sg; int ret = 0; + int length = iov_iter_count(from); rm->m_inc.i_hdr.h_len = cpu_to_be32(iov_iter_count(from)); @@ -356,6 +358,53 @@ int rds_message_copy_from_user(struct rds_message *rm, struct iov_iter *from) sg = rm->data.op_sg; sg_off = 0; /* Dear gcc, sg->page will be null from kzalloc. */ + if (zcopy) { + int total_copied = 0; + struct sk_buff *skb; + + skb = alloc_skb(SO_EE_ORIGIN_MAX_ZCOOKIES * sizeof(u32), + GFP_KERNEL); This can sleep so you might want to check if you want to use ATOMIC version here. + if (!skb) + return -ENOMEM; + rm->data.op_mmp_znotifier = RDS_ZCOPY_SKB(skb); + memset(rm->data.op_mmp_znotifier, 0, + sizeof(*rm->data.op_mmp_znotifier)); + if (mm_account_pinned_pages(>data.op_mmp_znotifier->z_mmp, + length)) { + consume_skb(skb); + rm->data.op_mmp_znotifier = NULL; + return -ENOMEM; + } NOMEM new application visible change but probably the right one for this particular case. Just need to make sure application can handle this error. diff --git a/net/rds/rds.h b/net/rds/rds.h index 6e8fc4c..dfdc9b3 100644 --- a/net/rds/rds.h +++ b/net/rds/rds.h @@ -784,7 +784,8 @@ void rds_for_each_conn_info(struct socket *sock, unsigned int len, /* message.c */ struct rds_message *rds_message_alloc(unsigned int nents, gfp_t gfp); struct scatterlist *rds_message_alloc_sgs(struct rds_message *rm, int nents); -int rds_message_copy_from_user(struct rds_message *rm, struct iov_iter *from); +int rds_message_copy_from_user(struct rds_message *rm, struct iov_iter *from, + bool zcopy); struct rds_message *rds_message_map_pages(unsigned long *page_addrs, unsigned int total_len); void rds_message_populate_header(struct rds_header *hdr, __be16 sport, __be16 dport, u64 seq); diff --git a/net/rds/send.c b/net/rds/send.c index 5ac0925..80171cf 100644 --- a/net/rds/send.c +++ b/net/rds/send.c [...] @@ -1087,8 +1112,15 @@ int rds_sendmsg(struct socket *sock, struct msghdr *msg, size_t payload_len) goto out; } + if (zcopy) { + if (rs->rs_transport->t_type != RDS_TRANS_TCP) { + ret = -EOPNOTSUPP; + goto out; + } +
Re: [PATCH V2 net-next 4/7] rds: support for zcopy completion notification
On 2/14/2018 2:28 AM, Sowmini Varadhan wrote: RDS removes a datagram (rds_message) from the retransmit queue when an ACK is received. The ACK indicates that the receiver has queued the RDS datagram, so that the sender can safely forget the datagram. When all references to the rds_message are quiesced, rds_message_purge is called to release resources used by the rds_message If the datagram to be removed had pinned pages set up, add an entry to the rs->rs_znotify_queue so that the notifcation will be sent up via rds_rm_zerocopy_callback() when the rds_message is eventually freed by rds_message_purge. rds_rm_zerocopy_callback() attempts to batch the number of cookies sent with each notification to a max of SO_EE_ORIGIN_MAX_ZCOOKIES. This is achieved by checking the tail skb in the sk_error_queue: if this has room for one more cookie, the cookie from the current notification is added; else a new skb is added to the sk_error_queue. Every invocation of rds_rm_zerocopy_callback() will trigger a ->sk_error_report to notify the application. Signed-off-by: Sowmini Varadhan--- v2: - make sure to always sock_put m_rs even if there is no znotifier. - major rewrite of notification, resulting in much simplification. include/uapi/linux/errqueue.h |2 + net/rds/af_rds.c |2 + net/rds/message.c | 83 +--- net/rds/rds.h | 14 +++ net/rds/recv.c|2 + 5 files changed, 96 insertions(+), 7 deletions(-) generic comment and please update it where it is applicable in terms of variable names, notifiers etc. RDS support true zero copy already with RDMA transport so some of this code can easily get confused. So I suggest something like below. s/zerocopy/zeromsgcopy s/zcopy/zmsgcopy s/zcookie/zmsgcpycookie s/znotifier/zmsgcpynotifier diff --git a/include/uapi/linux/errqueue.h b/include/uapi/linux/errqueue.h index dc64cfa..28812ed 100644 --- a/include/uapi/linux/errqueue.h +++ b/include/uapi/linux/errqueue.h @@ -20,11 +20,13 @@ struct sock_extended_err { #define SO_EE_ORIGIN_ICMP63 #define SO_EE_ORIGIN_TXSTATUS 4 #define SO_EE_ORIGIN_ZEROCOPY 5 +#define SO_EE_ORIGIN_ZCOOKIE 6 #define SO_EE_ORIGIN_TIMESTAMPING SO_EE_ORIGIN_TXSTATUS #define SO_EE_OFFENDER(ee) ((struct sockaddr*)((ee)+1)) #define SO_EE_CODE_ZEROCOPY_COPIED 1 +#defineSO_EE_ORIGIN_MAX_ZCOOKIES 8 This error change might need to go though other subsystem tree. May be you can seperate it and also copy "linux-...@vger.kernel.org" Otherwise changes looks fine to me. Regards, Santosh
Re: [PATCH V2 net-next 2/7] rds: hold a sock ref from rds_message to the rds_sock
On 2/14/2018 2:28 AM, Sowmini Varadhan wrote: The existing model holds a reference from the rds_sock to the rds_message, but the rds_message does not itself hold a sock_put() on the rds_sock. Instead the m_rs field in the rds_message is assigned when the message is queued on the sock, and nulled when the message is dequeued from the sock. We want to be able to notify userspace when the rds_message is actually freed (from rds_message_purge(), after the refcounts to the rds_message go to 0). At the time that rds_message_purge() is called, the message is no longer on the rds_sock retransmit queue. Thus the explicit reference for the m_rs is needed to send a notification that will signal to userspace that it is now safe to free/reuse any pages that may have been pinned down for zerocopy. This patch manages the m_rs assignment in the rds_message with the necessary refcount book-keeping. Signed-off-by: Sowmini Varadhan <sowmini.varad...@oracle.com> --- [...] @@ -756,9 +755,6 @@ void rds_send_drop_to(struct rds_sock *rs, struct sockaddr_in *dest) */ if (!test_and_clear_bit(RDS_MSG_ON_CONN, >m_flags)) { spin_unlock_irqrestore(>cp_lock, flags); - spin_lock_irqsave(>m_rs_lock, flags); - rm->m_rs = NULL; - spin_unlock_irqrestore(>m_rs_lock, flags); continue; } list_del_init(>m_conn_item); This hunk was clearly wrong so good that you got rid of it as well. Patch looks fine to me. Acked-by: Santosh Shilimkar <santosh.shilim...@oracle.com>
Re: KASAN: use-after-free Read in rds_find_bound
On 2/14/2018 9:52 AM, Dmitry Vyukov wrote: On Wed, Feb 14, 2018 at 6:35 PM, Santosh Shilimkar <santosh.shilim...@oracle.com> wrote: Hi Santosh, What is that fix? You forgot to provide any link/reference. I also don't see any patches from you at around that date... Fix [1] was later not added since there was a still a race. Wanted to see if the issue re-appears after recent netns fix [2]. We will not see if the bug re-appears or not until this bug is closed. Please see this recent discussion about another rds bug: https://groups.google.com/d/msg/syzkaller-bugs/3XjmOzr5jRU/g7pXIsY1BgAJ In the current state syzbot will never report bugs in these functions again. OK. Can you close that one then in that case ? Anybody can do this: I see. #syz fix: rds: tcp: use rds_destroy_pending() to synchronize netns/module teardown and rds connection/workq management syzbot provides full self-service, see first email and in particular this: https://github.com/google/syzkaller/blob/master/docs/syzbot.md#communication-with-syzbot ok will have a look. Regards, Santosh
Re: KASAN: use-after-free Read in rds_find_bound
On 2/14/2018 9:14 AM, Dmitry Vyukov wrote: On Wed, Feb 14, 2018 at 5:53 PM, Santosh Shilimkar <santosh.shilim...@oracle.com> wrote: [...] Hi Santosh, What is that fix? You forgot to provide any link/reference. I also don't see any patches from you at around that date... Fix [1] was later not added since there was a still a race. Wanted to see if the issue re-appears after recent netns fix [2]. We will not see if the bug re-appears or not until this bug is closed. Please see this recent discussion about another rds bug: https://groups.google.com/d/msg/syzkaller-bugs/3XjmOzr5jRU/g7pXIsY1BgAJ In the current state syzbot will never report bugs in these functions again. OK. Can you close that one then in that case ?
Re: KASAN: use-after-free Read in rds_find_bound
On 2/13/2018 12:12 PM, Dmitry Vyukov wrote: On Sat, Dec 30, 2017 at 8:41 PM, santosh.shilim...@oracle.comwrote: On 12/30/17 1:17 AM, syzbot wrote: Hello, syzkaller hit the following crash on fba961ab29e5ffb055592442808bb0f7962e05da git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/master compiler: gcc (GCC) 7.1.1 20170620 .config is attached Raw console output is attached. Unfortunately, I don't have any reproducer for this bug yet. IMPORTANT: if you fix the bug, please add the following tag to the commit: Reported-by: syzbot+93a5839deb3555374...@syzkaller.appspotmail.com Posted a fix[1] for above issue. Didn't test it but looks straight forward. Hi Santosh, What is that fix? You forgot to provide any link/reference. I also don't see any patches from you at around that date... Fix [1] was later not added since there was a still a race. Wanted to see if the issue re-appears after recent netns fix [2]. Regards, Santosh [1] https://patchwork.kernel.org/patch/10137901/ [2] https://patchwork.ozlabs.org/patch/868902/
Re: [PATCH V2 net-next] rds: do not call ->conn_alloc with GFP_KERNEL
On 2/13/2018 9:05 AM, Sowmini Varadhan wrote: Commit ebeeb1ad9b8a ("rds: tcp: use rds_destroy_pending() to synchronize netns/module teardown and rds connection/workq management") adds an rcu read critical section to __rd_conn_create. The memory allocations in that critcal section need to use GFP_ATOMIC to avoid sleeping. This patch was verified with syzkaller reproducer. Fixes: ebeeb1ad9b8a ("rds: tcp: use rds_destroy_pending() to synchronize netns/module teardown and rds connection/workq management") Reported-by: syzbot+a0564419941aaae3f...@syzkaller.appspotmail.com Signed-off-by: Sowmini Varadhan <sowmini.varad...@oracle.com> --- Acked-by: Santosh Shilimkar <santosh.shilim...@oracle.com>
Re: [PATCH V2 net-next] rds: tcp: use rds_destroy_pending() to synchronize netns/module teardown and rds connection/workq management
On 2/3/2018 4:26 AM, Sowmini Varadhan wrote: An rds_connection can get added during netns deletion between lines 528 and 529 of 506 static void rds_tcp_kill_sock(struct net *net) : /* code to pull out all the rds_connections that should be destroyed */ : 528 spin_unlock_irq(_tcp_conn_lock); 529 list_for_each_entry_safe(tc, _tc, _list, t_tcp_node) 530 rds_conn_destroy(tc->t_cpath->cp_conn); Such an rds_connection would miss out the rds_conn_destroy() loop (that cancels all pending work) and (if it was scheduled after netns deletion) could trigger the use-after-free. A similar race-window exists for the module unload path in rds_tcp_exit -> rds_tcp_destroy_conns Concurrency with netns deletion (rds_tcp_kill_sock()) must be handled by checking check_net() before enqueuing new work or adding new connections. Concurrency with module-unload is handled by maintaining a module specific flag that is set at the start of the module exit function, and must be checked before enqueuing new work or adding new connections. This commit refactors existing RDS_DESTROY_PENDING checks added by commit 3db6e0d172c9 ("rds: use RCU to synchronize work-enqueue with connection teardown") and consolidates all the concurrency checks listed above into the function rds_destroy_pending(). Signed-off-by: Sowmini Varadhan <sowmini.varad...@oracle.com> --- v2: use check_net() for the netns delete case, as recommended on the list. refactor RDS_DESTROY_PENDING checks and consolidate into rds_destroy_pending() Thanks for the update. It looks inline as per off-list chat. Acked-by: Santosh Shilimkar <santosh.shilim...@oracle.com>
Re: KASAN: stack-out-of-bounds Read in rds_sendmsg
On 1/30/2018 6:16 PM, Eric Biggers wrote: On Thu, Dec 21, 2017 at 08:44:32AM -0800, Santosh Shilimkar wrote: +Avinash On 12/21/2017 1:10 AM, syzbot wrote: syzkaller has found reproducer for the following crash on [..] audit: type=1400 audit(1513847224.110:7): avc: denied { map } for pid=3157 comm="syzkaller455006" path="/root/syzkaller455006870" dev="sda1" ino=16481 scontext=unconfined_u:system_r:insmod_t:s0-s0:c0.c1023 tcontext=unconfined_u:object_r:user_home_t:s0 tclass=file permissive=1 == BUG: KASAN: stack-out-of-bounds in rds_rdma_bytes net/rds/send.c:1013 [inline] Could you please post the discussed fix if you are ready with it ? This new report is same as last one and cmesg length check should address it. [...] This crash seems to have stopped occurring. I assume it was fixed by commit 14e138a86f63 (thanks Avinash!), so let's tell syzbot so that it can start reporting crashes in the same place again: #syz fix: RDS: Check cmsg_len before dereferencing CMSG_DATA Thanks Eric for confirmation !! Regards, Santosh
Re: [PATCH net-next 0/7] RDS zerocopy support
Hi Sowmini, On 1/24/2018 3:45 AM, Sowmini Varadhan wrote: This patch series follows up on the RFC and subsequent review comments at https://patchwork.ozlabs.org/cover/862248/ Review comments addressed are - drop MSG_PEEK change for sk_error_queue - (patch4) batch of SO_EE_ORIGIN_MAX_ZCOOKIES (#defined to 8) is sent up as part of the data in the error notification. The ancillary data in with this notification specifies the number of cookies in ee_data, with the ee_origin is set to SO_EE_ORIGIN_ZCOOKIE - (patch4, patch5) allocate the skb to be used for error notification up-front (in rds_sendmsg()) so that we never have to fail due to skb allocation failure in the callback routine. - other minor review fixes around refactoring code for the setsockopt of ZEROCOPY, use iov_iter_npages() etc. This patch series also updates the selftests/net/msg_zerocopy.c to support PF_RDS sockets (both with and without zerocopy) RDS changes looks like largely good but I need some time to look at the completion notification and send side changes. Will try to provide feedback in next few days. regards, Santosh
Re: [PATCH net-next] rds: tcp: per-netns flag to stop new connection creation when rds-tcp is being dismantled
On 1/24/2018 1:03 PM, Sowmini Varadhan wrote: An rds_connection can get added during netns deletion between lines 528 and 529 of 506 static void rds_tcp_kill_sock(struct net *net) : /* code to pull out all the rds_connections that should be destroyed */ : 528 spin_unlock_irq(_tcp_conn_lock); 529 list_for_each_entry_safe(tc, _tc, _list, t_tcp_node) 530 rds_conn_destroy(tc->t_cpath->cp_conn); Such an rds_connection would miss out the rds_conn_destroy() loop (that cancels all pending work) and (if it was scheduled after netns deletion) could trigger the use-after-free. A similar race-window exists for the module unload path in rds_tcp_exit -> rds_tcp_destroy_conns To avoid the addition of new rds_connections during kill_sock or netns_delete, this patch introduces a per-netns flag, RTN_DELETE_PENDING, that will cause RDS connection creation to fail. RCU is used to make sure that we wait for the critical section of __rds_conn_create threads (that may have started before the setting of RTN_DELETE_PENDING) to complete before starting the connection destruction. Reported-by: syzbot+bbd8e9a06452cc480...@syzkaller.appspotmail.com Signed-off-by: Sowmini Varadhan <sowmini.varad...@oracle.com> --- net/rds/connection.c |3 ++ net/rds/tcp.c| 82 - net/rds/tcp.h|1 + 3 files changed, 57 insertions(+), 29 deletions(-) FWIW, Acked-by: Santosh Shilimkar <santosh.shilim...@oracle.com> Just for archives, just summarizing off-list discussion. Netns destroy making use of conn_destroy now which in past was used for only module unload is racy. Its not possible to make it race free with just flags alone and needs rcu sync kind of mechanism. RDS being sensitive to brownouts on reconnects, rcu usage was has been minimised. Netns delete is expected to be non-frequent operation and hence usage of rcu as done in this patch is probably ok. If needed it will be revisited in future for optimization. regards, Santosh
Re: [PATCH] RDS: Fix rds-ping inducing kernel panic
On 1/22/2018 2:17 PM, Kees Cook wrote: On Tue, Jan 23, 2018 at 4:01 AM, Santosh Shilimkar <santosh.shilim...@oracle.com> wrote: On 1/22/2018 3:24 AM, Kees Cook wrote: As described in: https://bugzilla.redhat.com/show_bug.cgi?id=822754 Attempting an RDS connection from the IP address of an IPoIB interface to itself causes a kernel panic due to a BUG_ON() being triggered. Making the test less strict allows rds-ping to work without crashing the machine. A local unprivileged user could use this flaw to crash the sytem. Are you able to reproduce this issue on mainline kernel ? IIRC, this sjouldn't happen anymore but if you see it, please let me know. Will try it as well. rds-ping on self loopback device is often tested and used as well for monitoring services in production. I don't have an RDS test setup, no. But it sounds like kernels without this patch aren't seeing the problem. Yep. Thats what I thought and hence asked. Am not sure if its applicable anymore. Infact the issue with loopback device was due to congestion update and thats been already addressed with commit '18fc25c94: {rds: prevent BUG_ON triggered on congestion update to loopback}' That looks very much like it was fixed there. Thanks! Yeah. Thanks Kees !! Regards, Santosh
Re: [PATCH] RDS: Fix rds-ping inducing kernel panic
On 1/22/2018 7:47 AM, David Miller wrote: From: Leon RomanovskyDate: Mon, 22 Jan 2018 17:10:54 +0200 On Mon, Jan 22, 2018 at 03:24:15AM -0800, Kees Cook wrote: diff --git a/net/rds/ib_send.c b/net/rds/ib_send.c index 8557a1cae041..5fbf635d17cb 100644 --- a/net/rds/ib_send.c +++ b/net/rds/ib_send.c @@ -506,7 +506,7 @@ int rds_ib_xmit(struct rds_connection *conn, struct rds_message *rm, int flow_controlled = 0; int nr_sig = 0; - BUG_ON(off % RDS_FRAG_SIZE); + BUG_ON(!conn->c_loopback && off % RDS_FRAG_SIZE); BUG_ON(hdr_off != 0 && hdr_off != sizeof(struct rds_header)); To be honest this function full of BUG_ONs and it looks fishy to have them there. Why don't we return EINVAL instead of crashing system? I completely agree that these assertions should just cause an error-out rather than trigger a BUG(). Andy did remove bunch of them but there are still few more left overs. Will have a look at remainder set since most of them were added during early development and remained there. Thanks Dave/Leon. Regards, Santosh
Re: [PATCH] RDS: Fix rds-ping inducing kernel panic
On 1/22/2018 3:24 AM, Kees Cook wrote: As described in: https://bugzilla.redhat.com/show_bug.cgi?id=822754 Attempting an RDS connection from the IP address of an IPoIB interface to itself causes a kernel panic due to a BUG_ON() being triggered. Making the test less strict allows rds-ping to work without crashing the machine. A local unprivileged user could use this flaw to crash the sytem. Are you able to reproduce this issue on mainline kernel ? IIRC, this sjouldn't happen anymore but if you see it, please let me know. Will try it as well. rds-ping on self loopback device is often tested and used as well for monitoring services in production. I think this fix was written by Jay Fenlason, and extracted from the RedHat kernel patches here: https://oss.oracle.com/git/gitweb.cgi?p=redpatch.git;a=commitdiff;h=c7b6a0a1d8d636852be130fa15fa8be10d4704e8 It was part of redhat patched kernel but not carried in shipping Oracle UEK kernels at least afaik. This fix appears to have been carried by at least RedHat, Oracle, and Ubuntu for several years. CVE-2012-2372 Reported-by: Honggang Li Cc: sta...@vger.kernel.org Signed-off-by: Kees Cook --- This is what I get for researching CVE lifetimes... Am not sure if its applicable anymore. Infact the issue with loopback device was due to congestion update and thats been already addressed with commit '18fc25c94: {rds: prevent BUG_ON triggered on congestion update to loopback}' Regards, Santosh
Re: [PATCH] RDS: null pointer dereference in rds_atomic_free_op
On 1/3/2018 1:06 PM, simo.ghan...@gmail.com wrote: From: Mohamed Ghannam <simo.ghan...@gmail.com> set rm->atomic.op_active to 0 when rds_pin_pages() fails or the user supplied address is invalid, this prevents a NULL pointer usage in rds_atomic_free_op() Signed-off-by: Mohamed Ghannam <simo.ghan...@gmail.com> --- Good catch !! Acked-by: Santosh Shilimkar <santosh.shilim...@oracle.com>
Re: WARNING in rds_cmsg_rdma_args
On 1/3/2018 10:51 AM, Mohamed Ghannam wrote: The fix : https://patchwork.ozlabs.org/patch/854723/ Cool. Thanks Mohamed for following it up. Regards, Santosh
Re: WARNING in rds_cmsg_rdma_args
On 1/3/2018 12:58 AM, syzbot wrote: Hello, syzkaller hit the following crash on ad036b63ee57df9ab802a4eb20cbbbec66aa4520 git://git.cmpxchg.org/linux-mmots.git/master compiler: gcc (GCC) 7.1.1 20170620 .config is attached Raw console output is attached. C reproducer is attached syzkaller reproducer is attached. See for information about syzkaller reproducers IMPORTANT: if you fix the bug, please add the following tag to the commit: Reported-by: syzbot+ef175b5825285531e...@syzkaller.appspotmail.com It will help syzbot understand when the bug is fixed. See footer for details. If you forward the report, please keep this part and the footer. audit: type=1400 audit(1514947982.760:7): avc: denied { map } for pid=3468 comm="syzkaller284818" path="/root/syzkaller284818499" dev="sda1" ino=16481 scontext=unconfined_u:system_r:insmod_t:s0-s0:c0.c1023 tcontext=unconfined_u:object_r:user_home_t:s0 tclass=file permissive=1 WARNING: CPU: 1 PID: 3468 at net/rds/rdma.c:617 rds_cmsg_rdma_args+0xe96/0x1360 net/rds/rdma.c:617 Kernel panic - not syncing: panic_on_warn set ... +Mohamed Ghannam, who was discussing similar issue and was going to post fix for it. Checking args->nr_local against 0 upfront would address this issue as well. Regards, Santosh
[PATCH] rds: fix use-after-free read in rds_find_bound
socket buffer can get freed as part of sock_close callback so before adding reference check underneath socket validity. Reported-by: syzbot+93a5839deb3555374...@syzkaller.appspotmail.com Signed-off-by: Santosh Shilimkar <santosh.shilim...@oracle.com> --- net/rds/bind.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/rds/bind.c b/net/rds/bind.c index 75d43dc..8dec06e 100644 --- a/net/rds/bind.c +++ b/net/rds/bind.c @@ -61,7 +61,7 @@ struct rds_sock *rds_find_bound(__be32 addr, __be16 port) struct rds_sock *rs; rs = rhashtable_lookup_fast(_hash_table, , ht_parms); - if (rs && !sock_flag(rds_rs_to_sk(rs), SOCK_DEAD)) + if (rs && rds_rs_to_sk(rs) && !sock_flag(rds_rs_to_sk(rs), SOCK_DEAD)) rds_sock_addref(rs); else rs = NULL; -- 1.9.1
Re: [PATCH V2 net-next 3/3] rds: tcp: cleanup if kmem_cache_alloc fails in rds_tcp_conn_alloc()
On 12/22/2017 9:39 AM, Sowmini Varadhan wrote: If kmem_cache_alloc() fails in the middle of the for() loop, cleanup anything that might have been allocated so far. Signed-off-by: Sowmini Varadhan <sowmini.varad...@oracle.com> --- v2: target net-next, not net Acked-by: Santosh Shilimkar <santosh.shilim...@oracle.com>
Re: [PATCH V2 net-next 2/3] rds: tcp: initialize t_tcp_detached to false
On 12/22/2017 9:39 AM, Sowmini Varadhan wrote: Commit f10b4cff98c6 ("rds: tcp: atomically purge entries from rds_tcp_conn_list during netns delete") adds the field t_tcp_detached, but this needs to be initialized explicitly to false. Signed-off-by: Sowmini Varadhan <sowmini.varad...@oracle.com> --- v2: target net-next, not net Acked-by: Santosh Shilimkar <santosh.shilim...@oracle.com>
Re: [PATCH V2 net-next 1/3] rds; Reset rs->rs_bound_addr in rds_add_bound() failure path
On 12/22/2017 9:38 AM, Sowmini Varadhan wrote: If the rds_sock is not added to the bind_hash_table, we must reset rs_bound_addr so that rds_remove_bound will not trip on this rds_sock. rds_add_bound() does a rds_sock_put() in this failure path, so failing to reset rs_bound_addr will result in a socket refcount bug, and will trigger a WARN_ON with the stack shown below when the application subsequently tries to close the PF_RDS socket. WARNING: CPU: 20 PID: 19499 at net/rds/af_rds.c:496 \ rds_sock_destruct+0x15/0x30 [rds] : __sk_destruct+0x21/0x190 rds_remove_bound.part.13+0xb6/0x140 [rds] rds_release+0x71/0x120 [rds] sock_release+0x1a/0x70 sock_close+0xe/0x20 __fput+0xd5/0x210 task_work_run+0x82/0xa0 do_exit+0x2ce/0xb30 ? syscall_trace_enter+0x1cc/0x2b0 do_group_exit+0x39/0xa0 SyS_exit_group+0x10/0x10 do_syscall_64+0x61/0x1a0 Signed-off-by: Sowmini Varadhan <sowmini.varad...@oracle.com> --- v2: target net-next, not net Acked-by: Santosh Shilimkar <santosh.shilim...@oracle.com>
Re: KASAN: stack-out-of-bounds Read in rds_sendmsg
+Avinash On 12/21/2017 1:10 AM, syzbot wrote: syzkaller has found reproducer for the following crash on [..] audit: type=1400 audit(1513847224.110:7): avc: denied { map } for pid=3157 comm="syzkaller455006" path="/root/syzkaller455006870" dev="sda1" ino=16481 scontext=unconfined_u:system_r:insmod_t:s0-s0:c0.c1023 tcontext=unconfined_u:object_r:user_home_t:s0 tclass=file permissive=1 == BUG: KASAN: stack-out-of-bounds in rds_rdma_bytes net/rds/send.c:1013 [inline] Could you please post the discussed fix if you are ready with it ? This new report is same as last one and cmesg length check should address it. Regards, Santosh
Re: [PATCH v2 4/5] rds: Add runchecks.cfg for net/rds
On 12/16/2017 6:02 PM, Knut Omang wrote: On Sat, 2017-12-16 at 12:00 -0800, santosh.shilim...@oracle.com wrote: On 12/16/17 10:24 AM, Joe Perches wrote: [...] Most of these existing messages from checkpatch should probably be inspected and corrected where possible to minimize the style differences between this subsystem and the rest of the kernel. For instance, here's a trivial patch to substitute pr_ for printks and a couple braces next to these substitutions. Thanks Joe. I actually had a similar patch a while back but since it was lot of churn, and code was already merged, never submitted it and then later forgot about it. Will look into it. Please look at my set here first - I have already spent considerable time cleaning up stuff while working on this: Just closing the loop. As discussed, I can use your patches without any new tool dependency since existing checkpatch.pl already gives those warnings. I started picking up Joes patch but since you have changes, can use them instead once you untie them with runcheck. Regarding the $subject, just re-iterating that I don't want any custom script for RDS and want to just follow generic guidelines followed by netdev for all net/* code. Regards, Santosh
Re: BUG: unable to handle kernel NULL pointer dereference in rds_send_xmit
On 12/18/2017 9:12 AM, David Miller wrote: From: Santosh Shilimkar <santosh.shilim...@oracle.com> Date: Mon, 18 Dec 2017 08:28:05 -0800 On 12/18/2017 12:43 AM, syzbot wrote: Hello, syzkaller hit the following crash on 6084b576dca2e898f5c101baef151f7bfdbb606d git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/master compiler: gcc (GCC) 7.1.1 20170620 .config is attached Raw console output is attached. Unfortunately, I don't have any reproducer for this bug yet. BUG: unable to handle kernel NULL pointer dereference at 0028 program syz-executor6 is using a deprecated SCSI ioctl, please convert it to SG_IO IP: rds_send_xmit+0x80/0x930 net/rds/send.c:186 Looks like another one tripping on empty transport. Mostly below should address it but we will test it if it does. diff --git a/net/rds/send.c b/net/rds/send.c index 7244d2e..e2d0eaa 100644 --- a/net/rds/send.c +++ b/net/rds/send.c @@ -183,7 +183,7 @@ int rds_send_xmit(struct rds_conn_path *cp) goto out; } - if (conn->c_trans->xmit_path_prepare) + if (conn->c_trans && conn->c_trans->xmit_path_prepare) conn->c_trans->xmit_path_prepare(cp); We're seeming to accumulate a lot of checks like this, maybe there is a more general way to deal with this problem? Agree. Some of these additional transports hooks got added later to specific transports which needs them. Will review this overall and see if it can be addressed generically. Regards, Santosh
Re: BUG: spinlock bad magic (2)
On 12/18/2017 4:36 AM, syzbot wrote: Hello, syzkaller hit the following crash on 6084b576dca2e898f5c101baef151f7bfdbb606d git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/master compiler: gcc (GCC) 7.1.1 20170620 .config is attached Raw console output is attached. Unfortunately, I don't have any reproducer for this bug yet. [...] BUG: unable to handle kernel NULL pointer dereference at 0028 IP: rds_send_xmit+0x80/0x930 net/rds/send.c:186 This one seems to be same bug as reported as below. BUG: unable to handle kernel NULL pointer dereference in rds_send_xmit
Re: BUG: unable to handle kernel NULL pointer dereference in rds_send_xmit
On 12/18/2017 12:43 AM, syzbot wrote: Hello, syzkaller hit the following crash on 6084b576dca2e898f5c101baef151f7bfdbb606d git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/master compiler: gcc (GCC) 7.1.1 20170620 .config is attached Raw console output is attached. Unfortunately, I don't have any reproducer for this bug yet. BUG: unable to handle kernel NULL pointer dereference at 0028 program syz-executor6 is using a deprecated SCSI ioctl, please convert it to SG_IO IP: rds_send_xmit+0x80/0x930 net/rds/send.c:186 Looks like another one tripping on empty transport. Mostly below should address it but we will test it if it does. diff --git a/net/rds/send.c b/net/rds/send.c index 7244d2e..e2d0eaa 100644 --- a/net/rds/send.c +++ b/net/rds/send.c @@ -183,7 +183,7 @@ int rds_send_xmit(struct rds_conn_path *cp) goto out; } - if (conn->c_trans->xmit_path_prepare) + if (conn->c_trans && conn->c_trans->xmit_path_prepare) conn->c_trans->xmit_path_prepare(cp);
Re: [PATCH net-next] rds: debug: fix null check on static array
On 12/6/2017 3:32 AM, Sowmini Varadhan wrote: On (12/06/17 10:47), Prashant Bhole wrote: t_name cannot be NULL since it is an array field of a struct. Replacing null check on static array with string length check using strnlen() t_name is always initialized for all rds transports today, and would be all zeros unless someone stomped over that memory (in which case there could be more serious issues than a busted debug string) but it's good to be safe and check I suppose. Thanks Sowmini !! Acked-by: Santosh Shilimkar <santosh.shilim...@oracle.com>
Re: [PATCH net] rds: Fix NULL pointer dereference in __rds_rdma_map
On 12/6/2017 8:18 AM, Håkon Bugge wrote: This is a fix for syzkaller719569, where memory registration was attempted without any underlying transport being loaded. Analysis of the case reveals that it is the setsockopt() RDS_GET_MR (2) and RDS_GET_MR_FOR_DEST (7) that are vulnerable. Here is an example stack trace when the bug is hit: BUG: unable to handle kernel NULL pointer dereference at 00c0 IP: __rds_rdma_map+0x36/0x440 [rds] PGD 2f93d03067 P4D 2f93d03067 PUD 2f93d02067 PMD 0 Oops: [#1] SMP Modules linked in: bridge stp llc tun rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache rds binfmt_misc sb_edac intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul c rc32_pclmul ghash_clmulni_intel pcbc aesni_intel crypto_simd glue_helper cryptd iTCO_wdt mei_me sg iTCO_vendor_support ipmi_si mei ipmi_devintf nfsd shpchp pcspkr i2c_i801 ioatd ma ipmi_msghandler wmi lpc_ich mfd_core auth_rpcgss nfs_acl lockd grace sunrpc ip_tables ext4 mbcache jbd2 mgag200 i2c_algo_bit drm_kms_helper ixgbe syscopyarea ahci sysfillrect sysimgblt libahci mdio fb_sys_fops ttm ptp libata sd_mod mlx4_core drm crc32c_intel pps_core megaraid_sas i2c_core dca dm_mirror dm_region_hash dm_log dm_mod CPU: 48 PID: 45787 Comm: repro_set2 Not tainted 4.14.2-3.el7uek.x86_64 #2 Hardware name: Oracle Corporation ORACLE SERVER X5-2L/ASM,MOBO TRAY,2U, BIOS 3111 03/03/2017 task: 882f9190db00 task.stack: c9002b994000 RIP: 0010:__rds_rdma_map+0x36/0x440 [rds] RSP: 0018:c9002b997df0 EFLAGS: 00010202 RAX: RBX: 882fa2182580 RCX: RDX: RSI: c9002b997e40 RDI: 882fa2182580 RBP: c9002b997e30 R08: R09: 0002 R10: 885fb29e3838 R11: R12: 882fa2182580 R13: 882fa2182580 R14: 0002 R15: 2ffc FS: 7fbffa20b700() GS:882fbfb8() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 00c0 CR3: 002f98a66006 CR4: 001606e0 Call Trace: rds_get_mr+0x56/0x80 [rds] rds_setsockopt+0x172/0x340 [rds] ? __fget_light+0x25/0x60 ? __fdget+0x13/0x20 SyS_setsockopt+0x80/0xe0 do_syscall_64+0x67/0x1b0 entry_SYSCALL64_slow_path+0x25/0x25 RIP: 0033:0x7fbff9b117f9 RSP: 002b:7fbffa20aed8 EFLAGS: 0293 ORIG_RAX: 0036 RAX: ffda RBX: 000c84a4 RCX: 7fbff9b117f9 RDX: 0002 RSI: 4114 RDI: 109b RBP: 7fbffa20af10 R08: 0020 R09: 7fbff9dd7860 R10: 2ffc R11: 0293 R12: R13: 7fbffa20b9c0 R14: 7fbffa20b700 R15: 0021 Code: 41 56 41 55 49 89 fd 41 54 53 48 83 ec 18 8b 87 f0 02 00 00 48 89 55 d0 48 89 4d c8 85 c0 0f 84 2d 03 00 00 48 8b 87 00 03 00 00 <48> 83 b8 c0 00 00 00 00 0f 84 25 03 00 0 0 48 8b 06 48 8b 56 08 The fix is to check the existence of an underlying transport in __rds_rdma_map(). Signed-off-by: Håkon Bugge <haakon.bu...@oracle.com> Reported-by: syzbot <syzkal...@googlegroups.com> --- Thanks Haakon for testing and posting out the discussed change out. Acked-by: Santosh Shilimkar <santosh.shilim...@oracle.com>
Re: [PATCH net-next 3/3] rds: tcp: atomically purge entries from rds_tcp_conn_list during netns delete
On 11/30/2017 11:11 AM, Sowmini Varadhan wrote: The rds_tcp_kill_sock() function parses the rds_tcp_conn_list to find the rds_connection entries marked for deletion as part of the netns deletion under the protection of the rds_tcp_conn_lock. Since the rds_tcp_conn_list tracks rds_tcp_connections (which have a 1:1 mapping with rds_conn_path), multiple tc entries in the rds_tcp_conn_list will map to a single rds_connection, and will be deleted as part of the rds_conn_destroy() operation that is done outside the rds_tcp_conn_lock. The rds_tcp_conn_list traversal done under the protection of rds_tcp_conn_lock should not leave any doomed tc entries in the list after the rds_tcp_conn_lock is released, else another concurrently executiong netns delete (for a differnt netns) thread may trip on these entries. Reported-by: syzbot <syzkal...@googlegroups.com> Signed-off-by: Sowmini Varadhan <sowmini.varad...@oracle.com> --- Acked-by: Santosh Shilimkar <santosh.shilim...@oracle.com>
Re: [PATCH net-next 2/3] rds: tcp: correctly sequence cleanup on netns deletion.
On 11/30/2017 11:11 AM, Sowmini Varadhan wrote: Commit 8edc3affc077 ("rds: tcp: Take explicit refcounts on struct net") introduces a regression in rds-tcp netns cleanup. The cleanup_net(), (and thus rds_tcp_dev_event notification) is only called from put_net() when all netns refcounts go to 0, but this cannot happen if the rds_connection itself is holding a c_net ref that it expects to release in rds_tcp_kill_sock. Instead, the rds_tcp_kill_sock callback should make sure to tear down state carefully, ensuring that the socket teardown is only done after all data-structures and workqs that depend on it are quiesced. The original motivation for commit 8edc3affc077 ("rds: tcp: Take explicit refcounts on struct net") was to resolve a race condition reported by syzkaller where workqs for tx/rx/connect were triggered after the namespace was deleted. Those worker threads should have been cancelled/flushed before socket tear-down and indeed, rds_conn_path_destroy() does try to sequence this by doing /* cancel cp_send_w */ /* cancel cp_recv_w */ /* flush cp_down_w */ /* free data structures */ Here the "flush cp_down_w" will trigger rds_conn_shutdown and thus invoke rds_tcp_conn_path_shutdown() to close the tcp socket, so that we ought to have satisfied the requirement that "socket-close is done after all other dependent state is quiesced". However, rds_conn_shutdown has a bug in that it *always* triggers the reconnect workq (and if connection is successful, we always restart tx/rx workqs so with the right timing, we risk the race conditions reported by syzkaller). Netns deletion is like module teardown- no need to restart a reconnect in this case. We can use the c_destroy_in_prog bit to avoid restarting the reconnect. Fixes: 8edc3affc077 ("rds: tcp: Take explicit refcounts on struct net") Signed-off-by: Sowmini Varadhan <sowmini.varad...@oracle.com> --- net/rds/connection.c |3 ++- net/rds/rds.h|6 +++--- net/rds/tcp.c|4 ++-- 3 files changed, 7 insertions(+), 6 deletions(-) diff --git a/net/rds/connection.c b/net/rds/connection.c index 7ee2d5d..9efc82c 100644 --- a/net/rds/connection.c +++ b/net/rds/connection.c @@ -366,6 +366,8 @@ void rds_conn_shutdown(struct rds_conn_path *cp) * to the conn hash, so we never trigger a reconnect on this * conn - the reconnect is always triggered by the active peer. */ cancel_delayed_work_sync(>cp_conn_w); + if (conn->c_destroy_in_prog) + return; Not related to this patch but it will be more safe to use cp_flags or if needed add flag and conn level for bundle and use bit wise to avoid possible races to set c_destroy_in_prog. Something similar to RDS_DESTROY_PENDING etc. The patch itself looks good to me in terms of netns ref counting. Acked-by: Santosh Shilimkar <santosh.shilim...@oracle.com>
Re: [PATCH net-next 1/3] rds: tcp: remove redundant function rds_tcp_conn_paths_destroy()
On 11/30/2017 11:11 AM, Sowmini Varadhan wrote: A side-effect of Commit c14b0366813a ("rds: tcp: set linger to 1 when unloading a rds-tcp") is that we always send a RST on the tcp connection for rds_conn_destroy(), so rds_tcp_conn_paths_destroy() is not needed any more and is removed in this patch. Signed-off-by: Sowmini Varadhan <sowmini.varad...@oracle.com> --- Looks good. Acked-by: Santosh Shilimkar <santosh.shilim...@oracle.com>
Re: general protection fault in __rds_rdma_map
On 11/27/2017 10:30 AM, syzbot wrote: Hello, syzkaller hit the following crash on e1d1ea549b57790a3d8cf6300e6ef86118d692a3 git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/master compiler: gcc (GCC) 7.1.1 20170620 .config is attached Raw console output is attached. C reproducer is attached syzkaller reproducer is attached. See https://urldefense.proofpoint.com/v2/url?u=https-3A__goo.gl_kgGztJ=DwIBaQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=hWpFvp_cTkkwMMULcvbV65orOO9Gv3OUaY0ATWhQwak=0pw38xYdDB2QuLTkc6b0N3240iyzMU13jwFZvLaxDSo=0kx55ufXFnBORomS71r4MtXomSqMRKhkHI1tGM3oPic= for information about syzkaller reproducers kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: [#1] SMP KASAN RDS: rds_bind could not find a transport for 224.0.0.2, load rds_tcp or rds_rdma? Seems like the RDMA operation got triggered on the non RDMA transport lead to non populated rs->rs_transport->get_mr(). Also seems like the tests was running in the namespace and the RDMA transport doesn't yet support it. Thanks for reporting. Will look into fix internally. Regards, Santosh
Re: [lldp-devel] Fwd: [PATCH RESEND] Fix segfault on "lldptool -t -i eth2 -V PFC -c enabled"
On 11/8/2017 1:31 AM, Valentin Vidic wrote: On Wed, Nov 08, 2017 at 04:24:48AM -0500, Sowmini Varadhan wrote: I just tried setting up a git pull-request for this to Valentin's repo, I'm not sure if I followed procedures correctly (sending text patches to a list comes more naturally to me, and I may have fat-fingered something) To whom should I send this patch? We also have a couple of other patches in the pipeline that we are testing, so setting up a mailing list would be welcomed! Right, but my repo is just for Debian packaging :) We would need a SUSE repo or email address where to send patches - I also have few small fixes for lldpad and fcoe waiting. I also heard Mellanox folks have few patches so looping them as well. IMO email alone will not be enough and a public repo needed so that latest sources can be available for people developing/using it. Hannes, you mentioned that you are taking fixes. Is your repo public where you are applying this fixes. Sorry if I missed that part. Regards, Santosh
Re: [PATCH net] rds: ib: Fix NULL pointer dereference in debug code
_DEBUG -DDEBUG" and inserting an artificial delay between the rdsdebug() and ib_ib_port_recv() statements: /* XXX when can this fail? */ ret = ib_post_recv(ic->i_cm_id->qp, >r_wr, _wr); + if (can_wait) + usleep_range(1000, 5000); rdsdebug("recv %p ibinc %p page %p addr %lu ret %d\n", recv, recv->r_ibinc, sg_page(>r_frag->f_sg), (long) ib_sg_dma_address( The fix is simply to move the rdsdebug() statement up before the ib_post_recv() and remove the printing of ret, which is taken care of anyway by the non-debug code. Signed-off-by: Håkon Bugge <haakon.bu...@oracle.com> Reviewed-by: Knut Omang <knut.om...@oracle.com> Reviewed-by: Wei Lin Guay <wei.lin.g...@oracle.com> --- net/rds/ib_recv.c | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) Acked-by: Santosh Shilimkar <santosh.shilim...@oracle.com>
Re: [PATCH] rds: Fix inaccurate accounting of unsignaled wrs
On 10/24/2017 9:15 AM, Håkon Bugge wrote: On 24 Oct 2017, at 18:05, Santosh Shilimkar <santosh.shilim...@oracle.com <mailto:santosh.shilim...@oracle.com>> wrote: [...] Instead of partially doing changes inside/outside helper, can also add inline helper for solicited state like rds_ib_set_wr_solicited_state() and use that along with this change. Why? There is no book-keeping associated with setting send-solicited. Its set on the last fragment of a message and on the last fragment sent before throttling due to flow-control. Creating a function to perform: FOO |= BAR; seems like an overkill to me. Its just inline helper and keep code consistent for flag manipulation. Compiler output will be like "FOO =| BAR;" :-) That being said, in my opinion the fragments of a (large) send should be scattered with send-solicited. But that is another commit. But with such a commit, I agree with you, a helper function is required. We already talked about it so lets leave it there. Regards, Santosh
Re: [PATCH] rds: Fix inaccurate accounting of unsignaled wrs
On 10/24/2017 7:16 AM, Håkon Bugge wrote: The number of unsignaled work-requests posted to the IB send queue is tracked by a counter in the rds_ib_connection struct. When it reaches zero, or the caller explicitly asks for it, the send-signaled bit is set in send_flags and the counter is reset. This is performed by the rds_ib_set_wr_signal_state() function. However, this function is not always used which yields inaccurate accounting. This commit fixes this, re-factors a code bloat related to the matter, and makes the actual parameter type to the function consistent. Signed-off-by: Håkon Bugge--- Instead of partially doing changes inside/outside helper, can also add inline helper for solicited state like rds_ib_set_wr_solicited_state() and use that along with this change. Regards, Santosh
Re: [PATCH] rds: Fix uninitialized variable
$subject s/rds:/rds: ib: On 10/24/2017 8:02 AM, Håkon Bugge wrote: send_flags needs to be initialized before calling rds_ib_set_wr_signal_state(). Signed-off-by: Håkon Bugge <haakon.bu...@oracle.com> --- Looks fine otherwise. Please re-post with subject fixed. Acked-by: Santosh Shilimkar <santosh.shilim...@oracle.com>
Re: [PATCH net] RDS: IB: Initialize max_items based on underlying device attributes
Hi Avinash, On 10/3/2017 5:50 PM, Avinash Repaka wrote: Use max_1m_mrs/max_8k_mrs while setting max_items, as the former variables are set based on the underlying device attricutes. s/attricutes/attributes Signed-off-by: Avinash Repaka <avinash.rep...@oracle.com> --- net/rds/ib_rdma.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) This patch is fine but it will be nice if you can collect these changes in series as you are trying to update the FRWR support. Like this patch and other cleanup patch you posted yesterday. With that log fixed, Acked-by: Santosh Shilimkar <santosh.shilim...@oracle.com>
Re: [PATCH net] RDS: IB: Limit the scope of has_fr/has_fmr variables
Hi Dave, On 10/2/2017 1:30 PM, Santosh Shilimkar wrote: On 10/1/2017 10:56 PM, David Miller wrote: From: David Miller <da...@davemloft.net> [...] Actually, reverted, this breaks the build. net/rds/rdma_transport.c:38:10: fatal error: ib.h: No such file or directory #include "ib.h" Although I can't see how in the world this patch is causing such an error. Weird indeed. Will sort this out with Avinash. Thanks Dave. I tried few build combinations on net-next but couldn't reproduce the build failure. AFAIU, the build error can only happen if for some reason the ib.h file got deleted accidentally. I did delete ib.h file and saw below error CC [M] net/rds/rdma_transport.o net/rds/rdma_transport.c:38:16: error: ib.h: No such file or directory Could it be the case for some reason the ib.h file got deleted or mangled while applying the $subject patch ? Regards, Santosh
Re: [PATCH net] RDS: IB: Limit the scope of has_fr/has_fmr variables
On 10/1/2017 10:56 PM, David Miller wrote: From: David MillerDate: Sun, 01 Oct 2017 22:54:19 -0700 (PDT) From: Avinash Repaka Date: Fri, 29 Sep 2017 18:13:50 -0700 This patch fixes the scope of has_fr and has_fmr variables as they are needed only in rds_ib_add_one(). Signed-off-by: Avinash Repaka Applied. Actually, reverted, this breaks the build. net/rds/rdma_transport.c:38:10: fatal error: ib.h: No such file or directory #include "ib.h" Although I can't see how in the world this patch is causing such an error. Weird indeed. Will sort this out with Avinash. Thanks Dave. Regards, Santosh
Re: [PATCH net] RDS: IB: Limit the scope of has_fr/has_fmr variables
On 9/29/2017 6:13 PM, Avinash Repaka wrote: This patch fixes the scope of has_fr and has_fmr variables as they are needed only in rds_ib_add_one(). Signed-off-by: Avinash Repaka <avinash.rep...@oracle.com> --- Indeed the final merge version actually didn't need those across files. Change looks good to me. Thanks !! Acked-by: Santosh Shilimkar <santosh.shilim...@oracle.com>
Re: [PATCH net v2] rds: Fix incorrect statistics counting
On 9/6/2017 9:35 AM, Håkon Bugge wrote: In rds_send_xmit() there is logic to batch the sends. However, if another thread has acquired the lock and has incremented the send_gen, it is considered a race and we yield. The code incrementing the s_send_lock_queue_raced statistics counter did not count this event correctly. This commit counts the race condition correctly. Changes from v1: - Removed check for *someone_on_xmit()* - Fixed incorrect indentation Signed-off-by: Håkon Bugge <haakon.bu...@oracle.com> Reviewed-by: Knut Omang <knut.om...@oracle.com> --- Thanks for the update. Acked-by: Santosh Shilimkar <santosh.shilim...@oracle.com>
Re: [PATCH net] rds: Fix incorrect statistics counting
On 9/6/2017 9:12 AM, Håkon Bugge wrote: [...] Hi Santosh, Yes, I agree with accuracy of s_send_lock_queue_raced. But the main point is that the existing code counts some partial share of when it is _not_ raced. So, in the critical path, my patch adds one test_bit(), which hits the local CPU cache, if not raced. If raced, some other thread is in control, so I would not think the added cycles would make any big difference. Cycles added for no good reason is the point. I can send a v2 where the race tightening is removed if you like. Yes please. Regards, Santosh
Re: [PATCH net] rds: Fix incorrect statistics counting
On 9/6/2017 8:29 AM, Håkon Bugge wrote: In rds_send_xmit() there is logic to batch the sends. However, if another thread has acquired the lock, it is considered a race and we yield. The code incrementing the s_send_lock_queue_raced statistics counter did not count this event correctly. This commit removes a small race in determining the race and increments the statistics counter correctly. Signed-off-by: Håkon BuggeReviewed-by: Knut Omang --- net/rds/send.c | 16 +--- 1 file changed, 13 insertions(+), 3 deletions(-) Those counters are not really to give that accurate so am not very keen to add additional cycles in send paths and add additional code. Have you seen any real issue or this is just a observation. s_send_lock_queue_raced counter is never used to check for smaller increments and hence the question. Regards, Santosh
Re: [PATCH net] rds: Fix non-atomic operation on shared flag variable
On 9/5/2017 8:42 AM, Håkon Bugge wrote: The bits in m_flags in struct rds_message are used for a plurality of reasons, and from different contexts. To avoid any missing updates to m_flags, use the atomic set_bit() instead of the non-atomic equivalent. Signed-off-by: Håkon Bugge <haakon.bu...@oracle.com> Reviewed-by: Knut Omang <knut.om...@oracle.com> Reviewed-by: Wei Lin Guay <wei.lin.g...@oracle.com> --- net/rds/send.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/net/rds/send.c b/net/rds/send.c index 41b9f0f..058a407 100644 --- a/net/rds/send.c +++ b/net/rds/send.c @@ -273,7 +273,7 @@ int rds_send_xmit(struct rds_conn_path *cp) len = ntohl(rm->m_inc.i_hdr.h_len); if (cp->cp_unacked_packets == 0 || cp->cp_unacked_bytes < len) { - __set_bit(RDS_MSG_ACK_REQUIRED, >m_flags); + set_bit(RDS_MSG_ACK_REQUIRED, >m_flags); cp->cp_unacked_packets = rds_sysctl_max_unacked_packets; @@ -829,7 +829,7 @@ static int rds_send_queue_rm(struct rds_sock *rs, struct rds_connection *conn, * throughput hits a certain threshold. */ if (rs->rs_snd_bytes >= rds_sk_sndbuf(rs) / 2) - __set_bit(RDS_MSG_ACK_REQUIRED, >m_flags); + set_bit(RDS_MSG_ACK_REQUIRED, >m_flags); list_add_tail(>m_sock_item, >rs_send_queue); set_bit(RDS_MSG_ON_SOCK, >m_flags); Indeed, these couple of instances remained for the m_flags. Patch looks good. Thanks !! Acked-by: Santosh Shilimkar <santosh.shilim...@oracle.com>
Re: [PATCH 5/7] RDS: make rhashtable_params const
8/25/2017 7:21 AM, Bhumika Goyal wrote: Make this const as it is either used during a copy operation or passed to a const argument of the function rhltable_init Signed-off-by: Bhumika Goyal <bhumi...@gmail.com> --- Acked-by: Santosh Shilimkar <santosh.shilim...@oracle.com>
Re: [PATCH net] rds: Reintroduce statistics counting
On 8/8/2017 2:13 AM, Håkon Bugge wrote: In commit 7e3f2952eeb1 ("rds: don't let RDS shutdown a connection while senders are present"), refilling the receive queue was removed from rds_ib_recv(), along with the increment of s_ib_rx_refill_from_thread. Commit 73ce4317bf98 ("RDS: make sure we post recv buffers") re-introduces filling the receive queue from rds_ib_recv(), but does not add the statistics counter. rds_ib_recv() was later renamed to rds_ib_recv_path(). This commit reintroduces the statistics counting of s_ib_rx_refill_from_thread and s_ib_rx_refill_from_cq. Signed-off-by: Håkon Bugge <haakon.bu...@oracle.com> Reviewed-by: Knut Omang <knut.om...@oracle.com> Reviewed-by: Wei Lin Guay <wei.lin.g...@oracle.com> --- Looks fine. Acked-by: Santosh Shilimkar <santosh.shilim...@oracle.com>
Re: [PATCH 0/3] ARM: dts: keystone-k2g: Add DCAN instances to 66AK2G
Hi Franklin, On 8/2/2017 1:18 PM, Franklin S Cooper Jr wrote: Add D CAN nodes to 66AK2G based SoC dtsi. Franklin S Cooper Jr (2): dt-bindings: net: c_can: Update binding for clock and power-domains property ARM: configs: keystone: Enable D_CAN driver Lokesh Vutla (1): ARM: dts: k2g: Add DCAN nodes Any DCAN driver dependency with these patchset ? If not, I can queue this up so do let me know. Regards, Santosh
Re: [PATCH net] rds: Make sure updates to cp_send_gen can be observed
On 7/20/2017 3:28 AM, Håkon Bugge wrote: cp->cp_send_gen is treated as a normal variable, although it may be used by different threads. This is fixed by using {READ,WRITE}_ONCE when it is incremented and READ_ONCE when it is read outside the {acquire,release}_in_xmit protection. There is explicit memory barrier before the value is read outside the {acquire,release}_in_xmit so it takes care of load/store sync. Normative reference from the Linux-Kernel Memory Model: Loads from and stores to shared (but non-atomic) variables should be protected with the READ_ONCE(), WRITE_ONCE(), and ACCESS_ONCE(). Clause 5.1.2.4/25 in the C standard is also relevant. Signed-off-by: Håkon Bugge <haakon.bu...@oracle.com> Reviewed-by: Knut Omang <knut.om...@oracle.com> --- Having said that, {READ,WRITE}_ONCE usages seems to make it clear and explicit. So its fine with me. Acked-by: Santosh Shilimkar <santosh.shilim...@oracle.com>
Re: [PATCH net 1/2] rds: tcp: use sock_create_lite() to create the accept socket
On 7/6/2017 8:15 AM, Sowmini Varadhan wrote: There are two problems with calling sock_create_kern() from rds_tcp_accept_one() 1. it sets up a new_sock->sk that is wasteful, because this ->sk is going to get replaced by inet_accept() in the subsequent ->accept() 2. The new_sock->sk is a leaked reference in sock_graft() which expects to find a null parent->sk Avoid these problems by calling sock_create_lite(). Signed-off-by: Sowmini Varadhan <sowmini.varad...@oracle.com> --- Acked-by: Santosh Shilimkar <santosh.shilim...@oracle.com>
Re: [PATCH 1/3] RDS: IB: Delete an error message for a failed memory allocation in rds_ib_add_one()
On 5/22/2017 7:11 AM, SF Markus Elfring wrote: From: Markus Elfring <elfr...@users.sourceforge.net> Date: Mon, 22 May 2017 15:34:28 +0200 Omit an extra message for a memory allocation failure in this function. This issue was detected by using the Coccinelle software. Link: http://events.linuxfoundation.org/sites/events/files/slides/LCJ16-Refactor_Strings-WSang_0.pdf Signed-off-by: Markus Elfring <elfr...@users.sourceforge.net> --- net/rds/ib.c | 5 + 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/net/rds/ib.c b/net/rds/ib.c index 7a64c8db81ab..c5514d058171 100644 --- a/net/rds/ib.c +++ b/net/rds/ib.c @@ -166,8 +166,5 @@ static void rds_ib_add_one(struct ib_device *device) - if (!rds_ibdev->vector_load) { - pr_err("RDS/IB: %s failed to allocate vector memory\n", - __func__); + if (!rds_ibdev->vector_load) goto put_dev; - } Well the ENOMEM is not carried here so the message was usefu but its not critical so its fine to clean that up. Acked-by: Santosh Shilimkar <santosh.shilim...@oracle.com>
Re: [PATCH 2/3] RDS: IB: Improve a size determination in rds_ib_add_one()
On 5/22/2017 7:12 AM, SF Markus Elfring wrote: From: Markus Elfring <elfr...@users.sourceforge.net> Date: Mon, 22 May 2017 15:40:21 +0200 Replace the specification of a data structure by a pointer dereference as the parameter for the operator "sizeof" to make the corresponding size determination a bit safer according to the Linux coding style convention. Signed-off-by: Markus Elfring <elfr...@users.sourceforge.net> --- Acked-by: Santosh Shilimkar <santosh.shilim...@oracle.com>
Re: [PATCH] RDS: IB: ensure an initialized ret is printed in pr_warn message
On 4/7/2017 7:24 AM, Dan Carpenter wrote: Setting it to zero doesn't seem like the right thing, it should be an error code. Oh, heh... Smatch parses this one correctly. "ret" is always initialized but the code is probably buggy in a different way: [...] 561 ibmr = rds_ib_reg_fmr(rds_ibdev, sg, nents, key_ret); 562 if (ibmr) This condition is always true because those functions return ERR_PTRs not NULLs. Thanks Dan. I will fix that up. Regards, Santosh