Re: [PATCH v3] datapath: Avoid using stack larger than 1024.
Thanks for your works. I send v3 for net-next and add "Acked-by". If it is applied, I will backport it to d...@openvswitch.org. On Fri, Jun 30, 2017 at 1:45 AM, Pravin Shelarwrote: > On Wed, Jun 28, 2017 at 6:38 PM, Tonghao Zhang > wrote: >> When compiling OvS-master on 4.4.0-81 kernel, >> there is a warning: >> >> CC [M] /root/ovs/datapath/linux/datapath.o >> /root/ovs/datapath/linux/datapath.c: In function >> 'ovs_flow_cmd_set': >> /root/ovs/datapath/linux/datapath.c:1221:1: warning: >> the frame size of 1040 bytes is larger than 1024 bytes >> [-Wframe-larger-than=] >> >> This patch factors out match-init and action-copy to avoid >> "Wframe-larger-than=1024" warning. Because mask is only >> used to get actions, we new a function to save some >> stack space. >> >> Signed-off-by: Tonghao Zhang >> --- > Looks good, you have to submit patch against net-next for upstream OVS > module first.
Re: [PATCH RFC 0/2] kproxy: Kernel Proxy
On Thu, Jun 29, 2017 at 04:43:28PM -0700, Tom Herbert wrote: > On Thu, Jun 29, 2017 at 1:58 PM, Willy Tarreauwrote: > > On Thu, Jun 29, 2017 at 01:40:26PM -0700, Tom Herbert wrote: > >> > In fact that's not much what I observe in field. In practice, large > >> > data streams are cheaply relayed using splice(), I could achieve > >> > 60 Gbps of HTTP forwarding via HAProxy on a 4-core xeon 2 years ago. > >> > And when you use SSL, the cost of the copy to/from kernel is small > >> > compared to all the crypto operations surrounding this. > >> > > >> Right, getting rid of the extra crypto operations and so called "SSL > >> inspection" is the ultimate goal this is going towards. > > > > Yep but in order to take decisions at L7 you need to decapsulate SSL. > > > Decapsulate or decrypt? There's a big difference... :-) I'm am aiming > to just have to decapsulate. Sorry, but what difference do you make ? For me "decapsulate" means "extract the next level layer", and for SSL it means you need to decrypt. > > > >> Performance is relevant because we > >> potentially want security applied to every message in every > >> communication in a containerized data center. Putting the userspace > >> hop in the datapath of every packet is know to be problematic, not > >> just for the performance hit also because it increases the attack > >> surface on users' privacy. > > > > While I totally agree on the performance hit when inspecting each packet, > > I fail to see the relation with users' privacy. In fact under some > > circumstances it can even be the opposite. For example, using something > > like kTLS for a TCP/HTTP proxy can result in cleartext being observable > > in strace while it's not visible when TLS is terminated in userland because > > all you see are openssl's read()/write() operations. Maybe you have specific > > attacks in mind ? > > > No, just the normal problem of making yet one more tool systematically > have access to user data. OK. > >> > Regarding kernel-side protocol parsing, there's an unfortunate trend > >> > at moving more and more protocols to userland due to these protocols > >> > evolving very quickly. At least you'll want to find a way to provide > >> > these parsers from userspace, which will inevitably come with its set > >> > of problems or limitations :-/ > >> > > >> That's why everything is going BPF now ;-) > > > > Yes, I knew you were going to suggest this :-) I'm still prudent on it > > to be honnest. I don't think it would be that easy to implement an HPACK > > encoder/decoder using BPF. And even regarding just plain HTTP parsing, > > certain very small operations in haproxy's parser can quickly result in > > a 10% performance degradation when improperly optimized (ie: changing a > > "likely", altering branch prediction, or cache walk patterns when using > > arrays to evaluate character classes faster). But for general usage I > > indeed think it should be OK. > > > HTTP might qualify as a special case, and I believe there's already > been some work to put http in kernel by Alexander Krizhanovsky and > others. In this case maybe http parse could be front end before BPF. It could indeed be an option. We've seen this with Tux in the past. > Although, pretty clear we'll need regex in BPF if we want use it with > http. I think so as well. And some loop-like operations (foreach or stuff like this so that they remain bounded). Willy
Re: [PATCH net-next] cxgb4: Add PTP Hardware Clock (PHC) support (fwd)
The complete context isn't shown, but it seems likely that there is a goto out_free under line 1207, with no unlock on >ptp_lock. julia -- Forwarded message -- Date: Fri, 30 Jun 2017 11:54:23 +0800 From: kbuild test robot <fengguang...@intel.com> To: kbu...@01.org Cc: Julia Lawall <julia.law...@lip6.fr> Subject: Re: [PATCH net-next] cxgb4: Add PTP Hardware Clock (PHC) support Hi Atul, [auto build test WARNING on net-next/master] url: https://github.com/0day-ci/linux/commits/Atul-Gupta/cxgb4-Add-PTP-Hardware-Clock-PHC-support/20170629-200758 :: branch date: 16 hours ago :: commit date: 16 hours ago >> drivers/net/ethernet/chelsio/cxgb4/sge.c:1190:2-8: preceding lock on line >> 1204 git remote add linux-review https://github.com/0day-ci/linux git remote update linux-review git checkout 159226c60ceb77858018f6c31d17a575b3679b8a vim +1190 drivers/net/ethernet/chelsio/cxgb4/sge.c fd3a47900 drivers/net/cxgb4/sge.c Dimitris Michailidis 2010-04-01 1184/* fd3a47900 drivers/net/cxgb4/sge.c Dimitris Michailidis 2010-04-01 1185 * The chip min packet length is 10 octets but play safe and reject fd3a47900 drivers/net/cxgb4/sge.c Dimitris Michailidis 2010-04-01 1186 * anything shorter than an Ethernet header. fd3a47900 drivers/net/cxgb4/sge.c Dimitris Michailidis 2010-04-01 1187 */ fd3a47900 drivers/net/cxgb4/sge.c Dimitris Michailidis 2010-04-01 1188if (unlikely(skb->len < ETH_HLEN)) { a7525198a drivers/net/ethernet/chelsio/cxgb4/sge.c Eric W. Biederman 2014-03-15 1189 out_free: dev_kfree_skb_any(skb); fd3a47900 drivers/net/cxgb4/sge.c Dimitris Michailidis 2010-04-01 @1190return NETDEV_TX_OK; fd3a47900 drivers/net/cxgb4/sge.c Dimitris Michailidis 2010-04-01 1191} fd3a47900 drivers/net/cxgb4/sge.c Dimitris Michailidis 2010-04-01 1192 637d3e997 drivers/net/ethernet/chelsio/cxgb4/sge.c Hariprasad Shenai 2015-05-05 1193/* Discard the packet if the length is greater than mtu */ 637d3e997 drivers/net/ethernet/chelsio/cxgb4/sge.c Hariprasad Shenai 2015-05-05 1194max_pkt_len = ETH_HLEN + dev->mtu; 8d09e6b8b drivers/net/ethernet/chelsio/cxgb4/sge.c Hariprasad Shenai 2016-07-28 1195if (skb_vlan_tagged(skb)) 637d3e997 drivers/net/ethernet/chelsio/cxgb4/sge.c Hariprasad Shenai 2015-05-05 1196max_pkt_len += VLAN_HLEN; 637d3e997 drivers/net/ethernet/chelsio/cxgb4/sge.c Hariprasad Shenai 2015-05-05 1197if (!skb_shinfo(skb)->gso_size && (unlikely(skb->len > max_pkt_len))) 637d3e997 drivers/net/ethernet/chelsio/cxgb4/sge.c Hariprasad Shenai 2015-05-05 1198goto out_free; 637d3e997 drivers/net/ethernet/chelsio/cxgb4/sge.c Hariprasad Shenai 2015-05-05 1199 fd3a47900 drivers/net/cxgb4/sge.c Dimitris Michailidis 2010-04-01 1200pi = netdev_priv(dev); fd3a47900 drivers/net/cxgb4/sge.c Dimitris Michailidis 2010-04-01 1201adap = pi->adapter; fd3a47900 drivers/net/cxgb4/sge.c Dimitris Michailidis 2010-04-01 1202qidx = skb_get_queue_mapping(skb); 159226c60 drivers/net/ethernet/chelsio/cxgb4/sge.c Atul Gupta 2017-06-28 1203if (ptp_enabled) { 159226c60 drivers/net/ethernet/chelsio/cxgb4/sge.c Atul Gupta 2017-06-28 @1204spin_lock(>ptp_lock); 159226c60 drivers/net/ethernet/chelsio/cxgb4/sge.c Atul Gupta 2017-06-28 1205if (!(adap->ptp_tx_skb)) { 159226c60 drivers/net/ethernet/chelsio/cxgb4/sge.c Atul Gupta 2017-06-28 1206skb_shinfo(skb)->tx_flags |= SKBTX_IN_PROGRESS; 159226c60 drivers/net/ethernet/chelsio/cxgb4/sge.c Atul Gupta 2017-06-28 1207adap->ptp_tx_skb = skb_get(skb); :: The code at line 1190 was first introduced by commit :: fd3a47900b6f9fa72a4074ecb630f9dae62f1a95 cxgb4: Add packet queues and packet DMA code :: TO: Dimitris Michailidis <d...@chelsio.com> :: CC: David S. Miller <da...@davemloft.net> --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation
[PATCH net-next 09/11] sctp: remove the typedef sctp_data_chunk_t
This patch is to remove the typedef sctp_data_chunk_t, and replace with struct sctp_data_chunk in the places where it's using this typedef. Signed-off-by: Xin Long--- include/linux/sctp.h | 4 ++-- include/net/sctp/constants.h | 2 +- include/net/sctp/sm.h| 2 +- net/sctp/output.c| 4 ++-- net/sctp/sm_statefuns.c | 6 +++--- net/sctp/ulpqueue.c | 2 +- 6 files changed, 10 insertions(+), 10 deletions(-) diff --git a/include/linux/sctp.h b/include/linux/sctp.h index 55d84c1..91c888f 100644 --- a/include/linux/sctp.h +++ b/include/linux/sctp.h @@ -235,10 +235,10 @@ struct sctp_datahdr { __u8 payload[0]; }; -typedef struct sctp_data_chunk { +struct sctp_data_chunk { struct sctp_chunkhdr chunk_hdr; struct sctp_datahdr data_hdr; -} sctp_data_chunk_t; +}; /* DATA Chuck Specific Flags */ enum { diff --git a/include/net/sctp/constants.h b/include/net/sctp/constants.h index 02e867b..9b18044 100644 --- a/include/net/sctp/constants.h +++ b/include/net/sctp/constants.h @@ -152,7 +152,7 @@ SCTP_SUBTYPE_CONSTRUCTOR(PRIMITIVE, sctp_event_primitive_t, primitive) /* Calculate the actual data size in a data chunk */ #define SCTP_DATA_SNDSIZE(c) ((int)((unsigned long)(c->chunk_end)\ - (unsigned long)(c->chunk_hdr)\ - - sizeof(sctp_data_chunk_t))) + - sizeof(struct sctp_data_chunk))) /* Internal error codes */ typedef enum { diff --git a/include/net/sctp/sm.h b/include/net/sctp/sm.h index 245eb22..860f378 100644 --- a/include/net/sctp/sm.h +++ b/include/net/sctp/sm.h @@ -347,7 +347,7 @@ static inline __u16 sctp_data_size(struct sctp_chunk *chunk) __u16 size; size = ntohs(chunk->chunk_hdr->length); - size -= sizeof(sctp_data_chunk_t); + size -= sizeof(struct sctp_data_chunk); return size; } diff --git a/net/sctp/output.c b/net/sctp/output.c index 89cee14..977ab5d 100644 --- a/net/sctp/output.c +++ b/net/sctp/output.c @@ -723,8 +723,8 @@ static sctp_xmit_t sctp_packet_can_append_data(struct sctp_packet *packet, /* Check whether this chunk and all the rest of pending data will fit * or delay in hopes of bundling a full sized packet. */ - if (chunk->skb->len + q->out_qlen > - transport->pathmtu - packet->overhead - sizeof(sctp_data_chunk_t) - 4) + if (chunk->skb->len + q->out_qlen > transport->pathmtu - + packet->overhead - sizeof(struct sctp_data_chunk) - 4) /* Enough data queued to fill a packet */ return SCTP_XMIT_OK; diff --git a/net/sctp/sm_statefuns.c b/net/sctp/sm_statefuns.c index 1ba9a9b..212fe76 100644 --- a/net/sctp/sm_statefuns.c +++ b/net/sctp/sm_statefuns.c @@ -2990,7 +2990,7 @@ sctp_disposition_t sctp_sf_eat_data_6_2(struct net *net, return sctp_sf_pdiscard(net, ep, asoc, type, arg, commands); } - if (!sctp_chunk_length_valid(chunk, sizeof(sctp_data_chunk_t))) + if (!sctp_chunk_length_valid(chunk, sizeof(struct sctp_data_chunk))) return sctp_sf_violation_chunklen(net, ep, asoc, type, arg, commands); @@ -3109,7 +3109,7 @@ sctp_disposition_t sctp_sf_eat_data_fast_4_4(struct net *net, return sctp_sf_pdiscard(net, ep, asoc, type, arg, commands); } - if (!sctp_chunk_length_valid(chunk, sizeof(sctp_data_chunk_t))) + if (!sctp_chunk_length_valid(chunk, sizeof(struct sctp_data_chunk))) return sctp_sf_violation_chunklen(net, ep, asoc, type, arg, commands); @@ -6262,7 +6262,7 @@ static int sctp_eat_data(const struct sctp_association *asoc, * Actually, allow a little bit of overflow (up to a MTU). */ datalen = ntohs(chunk->chunk_hdr->length); - datalen -= sizeof(sctp_data_chunk_t); + datalen -= sizeof(struct sctp_data_chunk); deliver = SCTP_CMD_CHUNK_ULP; diff --git a/net/sctp/ulpqueue.c b/net/sctp/ulpqueue.c index 25f7e41..0225d62 100644 --- a/net/sctp/ulpqueue.c +++ b/net/sctp/ulpqueue.c @@ -1090,7 +1090,7 @@ void sctp_ulpq_renege(struct sctp_ulpq *ulpq, struct sctp_chunk *chunk, if (chunk) { needed = ntohs(chunk->chunk_hdr->length); - needed -= sizeof(sctp_data_chunk_t); + needed -= sizeof(struct sctp_data_chunk); } else needed = SCTP_DEFAULT_MAXWINDOW; -- 2.1.0
[PATCH net-next 08/11] sctp: remove the typedef sctp_datahdr_t
This patch is to remove the typedef sctp_datahdr_t, and replace with struct sctp_datahdr in the places where it's using this typedef. It is also to use izeof(variable) instead of sizeof(type). Signed-off-by: Xin Long--- include/linux/sctp.h| 6 +++--- net/sctp/sm_statefuns.c | 13 - 2 files changed, 11 insertions(+), 8 deletions(-) diff --git a/include/linux/sctp.h b/include/linux/sctp.h index d5c0dda..55d84c1 100644 --- a/include/linux/sctp.h +++ b/include/linux/sctp.h @@ -227,17 +227,17 @@ enum { SCTP_PARAM_ACTION_MASK = cpu_to_be16(0xc000), }; /* RFC 2960 Section 3.3.1 Payload Data (DATA) (0) */ -typedef struct sctp_datahdr { +struct sctp_datahdr { __be32 tsn; __be16 stream; __be16 ssn; __be32 ppid; __u8 payload[0]; -} sctp_datahdr_t; +}; typedef struct sctp_data_chunk { struct sctp_chunkhdr chunk_hdr; - sctp_datahdr_t data_hdr; + struct sctp_datahdr data_hdr; } sctp_data_chunk_t; /* DATA Chuck Specific Flags */ diff --git a/net/sctp/sm_statefuns.c b/net/sctp/sm_statefuns.c index 0a01c68..1ba9a9b 100644 --- a/net/sctp/sm_statefuns.c +++ b/net/sctp/sm_statefuns.c @@ -3010,7 +3010,8 @@ sctp_disposition_t sctp_sf_eat_data_6_2(struct net *net, return SCTP_DISPOSITION_ABORT; case SCTP_IERROR_PROTO_VIOLATION: return sctp_sf_abort_violation(net, ep, asoc, chunk, commands, - (u8 *)chunk->subh.data_hdr, sizeof(sctp_datahdr_t)); + (u8 *)chunk->subh.data_hdr, + sizeof(struct sctp_datahdr)); default: BUG(); } @@ -3124,7 +3125,8 @@ sctp_disposition_t sctp_sf_eat_data_fast_4_4(struct net *net, return SCTP_DISPOSITION_ABORT; case SCTP_IERROR_PROTO_VIOLATION: return sctp_sf_abort_violation(net, ep, asoc, chunk, commands, - (u8 *)chunk->subh.data_hdr, sizeof(sctp_datahdr_t)); + (u8 *)chunk->subh.data_hdr, + sizeof(struct sctp_datahdr)); default: BUG(); } @@ -6197,7 +6199,7 @@ static int sctp_eat_data(const struct sctp_association *asoc, struct sctp_chunk *chunk, sctp_cmd_seq_t *commands) { - sctp_datahdr_t *data_hdr; + struct sctp_datahdr *data_hdr; struct sctp_chunk *err; size_t datalen; sctp_verb_t deliver; @@ -6210,8 +6212,9 @@ static int sctp_eat_data(const struct sctp_association *asoc, u16 sid; u8 ordered = 0; - data_hdr = chunk->subh.data_hdr = (sctp_datahdr_t *)chunk->skb->data; - skb_pull(chunk->skb, sizeof(sctp_datahdr_t)); + data_hdr = (struct sctp_datahdr *)chunk->skb->data; + chunk->subh.data_hdr = data_hdr; + skb_pull(chunk->skb, sizeof(*data_hdr)); tsn = ntohl(data_hdr->tsn); pr_debug("%s: TSN 0x%x\n", __func__, tsn); -- 2.1.0
[PATCH net-next 11/11] sctp: remove the typedef sctp_init_chunk_t
This patch is to remove the typedef sctp_init_chunk_t, and replace with struct sctp_init_chunk in the places where it's using this typedef. Signed-off-by: Xin Long--- include/linux/sctp.h | 6 +++--- include/net/sctp/command.h | 4 ++-- include/net/sctp/structs.h | 4 ++-- net/sctp/input.c | 4 ++-- net/sctp/sm_make_chunk.c | 6 +++--- net/sctp/sm_sideeffect.c | 2 +- net/sctp/sm_statefuns.c| 28 ++-- 7 files changed, 27 insertions(+), 27 deletions(-) diff --git a/include/linux/sctp.h b/include/linux/sctp.h index 5624195..99e8664 100644 --- a/include/linux/sctp.h +++ b/include/linux/sctp.h @@ -266,10 +266,10 @@ struct sctp_inithdr { __u8 params[0]; }; -typedef struct sctp_init_chunk { +struct sctp_init_chunk { struct sctp_chunkhdr chunk_hdr; struct sctp_inithdr init_hdr; -} sctp_init_chunk_t; +}; /* Section 3.3.2.1. IPv4 Address Parameter (5) */ @@ -341,7 +341,7 @@ typedef struct sctp_hmac_algo_param { * The INIT ACK chunk is used to acknowledge the initiation of an SCTP * association. */ -typedef sctp_init_chunk_t sctp_initack_chunk_t; +typedef struct sctp_init_chunk sctp_initack_chunk_t; /* Section 3.3.3.1 State Cookie (7) */ typedef struct sctp_cookie_param { diff --git a/include/net/sctp/command.h b/include/net/sctp/command.h index d4a20d0..d4679e7 100644 --- a/include/net/sctp/command.h +++ b/include/net/sctp/command.h @@ -132,7 +132,7 @@ typedef union { struct sctp_association *asoc; struct sctp_transport *transport; struct sctp_bind_addr *bp; - sctp_init_chunk_t *init; + struct sctp_init_chunk *init; struct sctp_ulpevent *ulpevent; struct sctp_packet *packet; sctp_sackhdr_t *sackh; @@ -173,7 +173,7 @@ SCTP_ARG_CONSTRUCTOR(CHUNK, struct sctp_chunk *, chunk) SCTP_ARG_CONSTRUCTOR(ASOC, struct sctp_association *, asoc) SCTP_ARG_CONSTRUCTOR(TRANSPORT,struct sctp_transport *, transport) SCTP_ARG_CONSTRUCTOR(BA, struct sctp_bind_addr *, bp) -SCTP_ARG_CONSTRUCTOR(PEER_INIT,sctp_init_chunk_t *, init) +SCTP_ARG_CONSTRUCTOR(PEER_INIT,struct sctp_init_chunk *, init) SCTP_ARG_CONSTRUCTOR(ULPEVENT, struct sctp_ulpevent *, ulpevent) SCTP_ARG_CONSTRUCTOR(PACKET, struct sctp_packet *, packet) SCTP_ARG_CONSTRUCTOR(SACKH,sctp_sackhdr_t *, sackh) diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h index 2393d2e..07c11fe 100644 --- a/include/net/sctp/structs.h +++ b/include/net/sctp/structs.h @@ -1298,11 +1298,11 @@ int sctp_has_association(struct net *net, const union sctp_addr *laddr, int sctp_verify_init(struct net *net, const struct sctp_endpoint *ep, const struct sctp_association *asoc, -enum sctp_cid cid, sctp_init_chunk_t *peer_init, +enum sctp_cid cid, struct sctp_init_chunk *peer_init, struct sctp_chunk *chunk, struct sctp_chunk **err_chunk); int sctp_process_init(struct sctp_association *, struct sctp_chunk *chunk, const union sctp_addr *peer, - sctp_init_chunk_t *init, gfp_t gfp); + struct sctp_init_chunk *init, gfp_t gfp); __u32 sctp_generate_tag(const struct sctp_endpoint *); __u32 sctp_generate_tsn(const struct sctp_endpoint *); diff --git a/net/sctp/input.c b/net/sctp/input.c index a9994c4..41eb2ec 100644 --- a/net/sctp/input.c +++ b/net/sctp/input.c @@ -1051,7 +1051,7 @@ static struct sctp_association *__sctp_rcv_init_lookup(struct net *net, union sctp_addr *paddr = struct sctphdr *sh = sctp_hdr(skb); union sctp_params params; - sctp_init_chunk_t *init; + struct sctp_init_chunk *init; struct sctp_af *af; /* @@ -1070,7 +1070,7 @@ static struct sctp_association *__sctp_rcv_init_lookup(struct net *net, /* Find the start of the TLVs and the end of the chunk. This is * the region we search for address parameters. */ - init = (sctp_init_chunk_t *)skb->data; + init = (struct sctp_init_chunk *)skb->data; /* Walk the parameters looking for embedded addresses. */ sctp_walk_params(params, init, init_hdr.params) { diff --git a/net/sctp/sm_make_chunk.c b/net/sctp/sm_make_chunk.c index 8b9ca10..3af4dd0 100644 --- a/net/sctp/sm_make_chunk.c +++ b/net/sctp/sm_make_chunk.c @@ -2242,8 +2242,8 @@ static sctp_ierror_t sctp_verify_param(struct net *net, /* Verify the INIT packet before we process it. */ int sctp_verify_init(struct net *net, const struct sctp_endpoint *ep, const struct sctp_association *asoc, enum sctp_cid cid, -sctp_init_chunk_t *peer_init, struct sctp_chunk *chunk, -struct sctp_chunk **errp) +struct sctp_init_chunk *peer_init, +struct sctp_chunk *chunk, struct sctp_chunk **errp) { union
[PATCH net-next 10/11] sctp: remove the typedef sctp_inithdr_t
This patch is to remove the typedef sctp_inithdr_t, and replace with struct sctp_inithdr in the places where it's using this typedef. Signed-off-by: Xin Long--- include/linux/sctp.h| 6 +++--- net/netfilter/nf_conntrack_proto_sctp.c | 4 ++-- net/sctp/sm_make_chunk.c| 4 ++-- net/sctp/sm_statefuns.c | 12 ++-- 4 files changed, 13 insertions(+), 13 deletions(-) diff --git a/include/linux/sctp.h b/include/linux/sctp.h index 91c888f..5624195 100644 --- a/include/linux/sctp.h +++ b/include/linux/sctp.h @@ -257,18 +257,18 @@ enum { SCTP_DATA_FRAG_MASK = 0x03, }; * This chunk is used to initiate a SCTP association between two * endpoints. */ -typedef struct sctp_inithdr { +struct sctp_inithdr { __be32 init_tag; __be32 a_rwnd; __be16 num_outbound_streams; __be16 num_inbound_streams; __be32 initial_tsn; __u8 params[0]; -} sctp_inithdr_t; +}; typedef struct sctp_init_chunk { struct sctp_chunkhdr chunk_hdr; - sctp_inithdr_t init_hdr; + struct sctp_inithdr init_hdr; } sctp_init_chunk_t; diff --git a/net/netfilter/nf_conntrack_proto_sctp.c b/net/netfilter/nf_conntrack_proto_sctp.c index b841a8a..31c6c8e 100644 --- a/net/netfilter/nf_conntrack_proto_sctp.c +++ b/net/netfilter/nf_conntrack_proto_sctp.c @@ -395,7 +395,7 @@ static int sctp_packet(struct nf_conn *ct, /* If it is an INIT or an INIT ACK note down the vtag */ if (sch->type == SCTP_CID_INIT || sch->type == SCTP_CID_INIT_ACK) { - sctp_inithdr_t _inithdr, *ih; + struct sctp_inithdr _inithdr, *ih; ih = skb_header_pointer(skb, offset + sizeof(_sch), sizeof(_inithdr), &_inithdr); @@ -471,7 +471,7 @@ static bool sctp_new(struct nf_conn *ct, const struct sk_buff *skb, /* Copy the vtag into the state info */ if (sch->type == SCTP_CID_INIT) { - sctp_inithdr_t _inithdr, *ih; + struct sctp_inithdr _inithdr, *ih; /* Sec 8.5.1 (A) */ if (sh->vtag) return false; diff --git a/net/sctp/sm_make_chunk.c b/net/sctp/sm_make_chunk.c index 3ed2108..8b9ca10 100644 --- a/net/sctp/sm_make_chunk.c +++ b/net/sctp/sm_make_chunk.c @@ -217,7 +217,7 @@ struct sctp_chunk *sctp_make_init(const struct sctp_association *asoc, { struct net *net = sock_net(asoc->base.sk); struct sctp_endpoint *ep = asoc->ep; - sctp_inithdr_t init; + struct sctp_inithdr init; union sctp_params addrs; size_t chunksize; struct sctp_chunk *retval = NULL; @@ -385,7 +385,7 @@ struct sctp_chunk *sctp_make_init_ack(const struct sctp_association *asoc, const struct sctp_chunk *chunk, gfp_t gfp, int unkparam_len) { - sctp_inithdr_t initack; + struct sctp_inithdr initack; struct sctp_chunk *retval; union sctp_params addrs; struct sctp_sock *sp; diff --git a/net/sctp/sm_statefuns.c b/net/sctp/sm_statefuns.c index 212fe76..71b6e3f 100644 --- a/net/sctp/sm_statefuns.c +++ b/net/sctp/sm_statefuns.c @@ -389,10 +389,10 @@ sctp_disposition_t sctp_sf_do_5_1B_init(struct net *net, } /* Grab the INIT header. */ - chunk->subh.init_hdr = (sctp_inithdr_t *)chunk->skb->data; + chunk->subh.init_hdr = (struct sctp_inithdr *)chunk->skb->data; /* Tag the variable length parameters. */ - chunk->param_hdr.v = skb_pull(chunk->skb, sizeof(sctp_inithdr_t)); + chunk->param_hdr.v = skb_pull(chunk->skb, sizeof(struct sctp_inithdr)); new_asoc = sctp_make_temp_asoc(ep, chunk, GFP_ATOMIC); if (!new_asoc) @@ -522,7 +522,7 @@ sctp_disposition_t sctp_sf_do_5_1C_ack(struct net *net, return sctp_sf_violation_chunklen(net, ep, asoc, type, arg, commands); /* Grab the INIT header. */ - chunk->subh.init_hdr = (sctp_inithdr_t *) chunk->skb->data; + chunk->subh.init_hdr = (struct sctp_inithdr *)chunk->skb->data; /* Verify the INIT chunk before processing it. */ err_chunk = NULL; @@ -576,7 +576,7 @@ sctp_disposition_t sctp_sf_do_5_1C_ack(struct net *net, /* Tag the variable length parameters. Note that we never * convert the parameters in an INIT chunk. */ - chunk->param_hdr.v = skb_pull(chunk->skb, sizeof(sctp_inithdr_t)); + chunk->param_hdr.v = skb_pull(chunk->skb, sizeof(struct sctp_inithdr)); initchunk = (sctp_init_chunk_t *) chunk->chunk_hdr; @@ -1454,10 +1454,10 @@ static sctp_disposition_t sctp_sf_do_unexpected_init( return sctp_sf_violation_chunklen(net, ep, asoc,
[PATCH net-next 06/11] sctp: remove the typedef sctp_param_t
This patch is to remove the typedef sctp_param_t, and replace with struct sctp_paramhdr in the places where it's using this typedef. It is also to remove the useless declaration sctp_addip_addr_config and fix the lack of params for some other functions' declaration. Signed-off-by: Xin Long--- include/linux/sctp.h | 4 ++-- include/net/sctp/sm.h| 14 ++ net/sctp/sm_make_chunk.c | 2 +- 3 files changed, 9 insertions(+), 11 deletions(-) diff --git a/include/linux/sctp.h b/include/linux/sctp.h index 142bb6a..5eecc0f 100644 --- a/include/linux/sctp.h +++ b/include/linux/sctp.h @@ -167,7 +167,7 @@ struct sctp_paramhdr { __be16 length; }; -typedef enum { +enum sctp_param { /* RFC 2960 Section 3.3.5 */ SCTP_PARAM_HEARTBEAT_INFO = cpu_to_be16(1), @@ -207,7 +207,7 @@ typedef enum { SCTP_PARAM_RESET_RESPONSE = cpu_to_be16(0x0010), SCTP_PARAM_RESET_ADD_OUT_STREAMS= cpu_to_be16(0x0011), SCTP_PARAM_RESET_ADD_IN_STREAMS = cpu_to_be16(0x0012), -} sctp_param_t; /* enum */ +}; /* enum */ /* RFC 2960 Section 3.2.1 diff --git a/include/net/sctp/sm.h b/include/net/sctp/sm.h index 47113f2..245eb22 100644 --- a/include/net/sctp/sm.h +++ b/include/net/sctp/sm.h @@ -325,19 +325,17 @@ void sctp_generate_heartbeat_event(unsigned long peer); void sctp_generate_reconf_event(unsigned long peer); void sctp_generate_proto_unreach_event(unsigned long peer); -void sctp_ootb_pkt_free(struct sctp_packet *); +void sctp_ootb_pkt_free(struct sctp_packet *packet); -struct sctp_association *sctp_unpack_cookie(const struct sctp_endpoint *, - const struct sctp_association *, - struct sctp_chunk *, +struct sctp_association *sctp_unpack_cookie(const struct sctp_endpoint *ep, + const struct sctp_association *asoc, + struct sctp_chunk *chunk, gfp_t gfp, int *err, struct sctp_chunk **err_chk_p); -int sctp_addip_addr_config(struct sctp_association *, sctp_param_t, - struct sockaddr_storage*, int); /* 3rd level prototypes */ -__u32 sctp_generate_tag(const struct sctp_endpoint *); -__u32 sctp_generate_tsn(const struct sctp_endpoint *); +__u32 sctp_generate_tag(const struct sctp_endpoint *ep); +__u32 sctp_generate_tsn(const struct sctp_endpoint *ep); /* Extern declarations for major data structures. */ extern sctp_timer_event_t *sctp_timer_events[SCTP_NUM_TIMEOUT_TYPES]; diff --git a/net/sctp/sm_make_chunk.c b/net/sctp/sm_make_chunk.c index 9f9d40c..3ed2108 100644 --- a/net/sctp/sm_make_chunk.c +++ b/net/sctp/sm_make_chunk.c @@ -1882,7 +1882,7 @@ struct __sctp_missing { * Report a missing mandatory parameter. */ static int sctp_process_missing_param(const struct sctp_association *asoc, - sctp_param_t paramtype, + enum sctp_param paramtype, struct sctp_chunk *chunk, struct sctp_chunk **errp) { -- 2.1.0
[PATCH net-next 05/11] sctp: remove the typedef sctp_paramhdr_t
This patch is to remove the typedef sctp_paramhdr_t, and replace with struct sctp_paramhdr in the places where it's using this typedef. It is also to fix some indents and use sizeof(variable) instead of sizeof(type). Signed-off-by: Xin Long--- include/linux/sctp.h | 44 +-- include/net/sctp/sctp.h| 2 +- include/net/sctp/structs.h | 5 +++-- net/sctp/associola.c | 6 +++--- net/sctp/auth.c| 22 +- net/sctp/endpointola.c | 7 --- net/sctp/sm_make_chunk.c | 47 +++--- net/sctp/sm_statefuns.c| 6 +++--- net/sctp/socket.c | 7 --- net/sctp/stream.c | 4 ++-- 10 files changed, 79 insertions(+), 71 deletions(-) diff --git a/include/linux/sctp.h b/include/linux/sctp.h index ffdccb4..142bb6a 100644 --- a/include/linux/sctp.h +++ b/include/linux/sctp.h @@ -162,10 +162,10 @@ enum { SCTP_CHUNK_FLAG_T = 0x01 }; * Section 3.2.1 Optional/Variable-length Parmaeter Format. */ -typedef struct sctp_paramhdr { +struct sctp_paramhdr { __be16 type; __be16 length; -} sctp_paramhdr_t; +}; typedef enum { @@ -274,37 +274,37 @@ typedef struct sctp_init_chunk { /* Section 3.3.2.1. IPv4 Address Parameter (5) */ typedef struct sctp_ipv4addr_param { - sctp_paramhdr_t param_hdr; + struct sctp_paramhdr param_hdr; struct in_addr addr; } sctp_ipv4addr_param_t; /* Section 3.3.2.1. IPv6 Address Parameter (6) */ typedef struct sctp_ipv6addr_param { - sctp_paramhdr_t param_hdr; + struct sctp_paramhdr param_hdr; struct in6_addr addr; } sctp_ipv6addr_param_t; /* Section 3.3.2.1 Cookie Preservative (9) */ typedef struct sctp_cookie_preserve_param { - sctp_paramhdr_t param_hdr; + struct sctp_paramhdr param_hdr; __be32 lifespan_increment; } sctp_cookie_preserve_param_t; /* Section 3.3.2.1 Host Name Address (11) */ typedef struct sctp_hostname_param { - sctp_paramhdr_t param_hdr; + struct sctp_paramhdr param_hdr; uint8_t hostname[0]; } sctp_hostname_param_t; /* Section 3.3.2.1 Supported Address Types (12) */ typedef struct sctp_supported_addrs_param { - sctp_paramhdr_t param_hdr; + struct sctp_paramhdr param_hdr; __be16 types[0]; } sctp_supported_addrs_param_t; /* Appendix A. ECN Capable (32768) */ typedef struct sctp_ecn_capable_param { - sctp_paramhdr_t param_hdr; + struct sctp_paramhdr param_hdr; } sctp_ecn_capable_param_t; /* ADDIP Section 3.2.6 Adaptation Layer Indication */ @@ -321,19 +321,19 @@ typedef struct sctp_supported_ext_param { /* AUTH Section 3.1 Random */ typedef struct sctp_random_param { - sctp_paramhdr_t param_hdr; + struct sctp_paramhdr param_hdr; __u8 random_val[0]; } sctp_random_param_t; /* AUTH Section 3.2 Chunk List */ typedef struct sctp_chunks_param { - sctp_paramhdr_t param_hdr; + struct sctp_paramhdr param_hdr; __u8 chunks[0]; } sctp_chunks_param_t; /* AUTH Section 3.3 HMAC Algorithm */ typedef struct sctp_hmac_algo_param { - sctp_paramhdr_t param_hdr; + struct sctp_paramhdr param_hdr; __be16 hmac_ids[0]; } sctp_hmac_algo_param_t; @@ -345,14 +345,14 @@ typedef sctp_init_chunk_t sctp_initack_chunk_t; /* Section 3.3.3.1 State Cookie (7) */ typedef struct sctp_cookie_param { - sctp_paramhdr_t p; + struct sctp_paramhdr p; __u8 body[0]; } sctp_cookie_param_t; /* Section 3.3.3.1 Unrecognized Parameters (8) */ typedef struct sctp_unrecognized_param { - sctp_paramhdr_t param_hdr; - sctp_paramhdr_t unrecognized; + struct sctp_paramhdr param_hdr; + struct sctp_paramhdr unrecognized; } sctp_unrecognized_param_t; @@ -399,7 +399,7 @@ typedef struct sctp_sack_chunk { */ typedef struct sctp_heartbeathdr { - sctp_paramhdr_t info; + struct sctp_paramhdr info; } sctp_heartbeathdr_t; typedef struct sctp_heartbeat_chunk { @@ -639,7 +639,7 @@ struct sctp_fwdtsn_chunk { * report status of ASCONF processing. */ typedef struct sctp_addip_param { - sctp_paramhdr_t param_hdr; + struct sctp_paramhdrparam_hdr; __be32 crr_id; } sctp_addip_param_t; @@ -724,7 +724,7 @@ struct sctp_reconf_chunk { }; struct sctp_strreset_outreq { - sctp_paramhdr_t param_hdr; + struct sctp_paramhdr param_hdr; __u32 request_seq; __u32 response_seq; __u32 send_reset_at_tsn; @@ -732,18 +732,18 @@ struct sctp_strreset_outreq { }; struct sctp_strreset_inreq { - sctp_paramhdr_t param_hdr; + struct sctp_paramhdr param_hdr; __u32 request_seq; __u16 list_of_streams[0]; }; struct sctp_strreset_tsnreq { - sctp_paramhdr_t param_hdr; + struct sctp_paramhdr param_hdr; __u32 request_seq; }; struct
[PATCH net-next 07/11] sctp: remove the typedef sctp_param_action_t
Remove this typedef, there is even no places using it. Signed-off-by: Xin Long--- include/linux/sctp.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/include/linux/sctp.h b/include/linux/sctp.h index 5eecc0f..d5c0dda 100644 --- a/include/linux/sctp.h +++ b/include/linux/sctp.h @@ -216,12 +216,12 @@ enum sctp_param { * not recognize the Parameter Type. * */ -typedef enum { +enum { SCTP_PARAM_ACTION_DISCARD = cpu_to_be16(0x), SCTP_PARAM_ACTION_DISCARD_ERR = cpu_to_be16(0x4000), SCTP_PARAM_ACTION_SKIP= cpu_to_be16(0x8000), SCTP_PARAM_ACTION_SKIP_ERR= cpu_to_be16(0xc000), -} sctp_param_action_t; +}; enum { SCTP_PARAM_ACTION_MASK = cpu_to_be16(0xc000), }; -- 2.1.0
[PATCH net-next 01/11] sctp: remove the typedef sctp_sctphdr_t
This patch is to remove the typedef sctp_sctphdr_t, and replace with struct sctphdr in the places where it's using this typedef. It is also to fix some indents and use sizeof(variable) instead of sizeof(type). Signed-off-by: Xin Long--- include/linux/sctp.h| 4 ++-- net/netfilter/ipset/ip_set_getport.c| 4 ++-- net/netfilter/ipvs/ip_vs_core.c | 6 +++--- net/netfilter/ipvs/ip_vs_proto_sctp.c | 15 +++ net/netfilter/nf_conntrack_proto_sctp.c | 2 +- net/netfilter/nf_nat_proto_sctp.c | 2 +- net/netfilter/xt_sctp.c | 16 7 files changed, 24 insertions(+), 25 deletions(-) diff --git a/include/linux/sctp.h b/include/linux/sctp.h index 7a4804c..85540ec 100644 --- a/include/linux/sctp.h +++ b/include/linux/sctp.h @@ -57,12 +57,12 @@ #include /* Section 3.1. SCTP Common Header Format */ -typedef struct sctphdr { +struct sctphdr { __be16 source; __be16 dest; __be32 vtag; __le32 checksum; -} sctp_sctphdr_t; +}; static inline struct sctphdr *sctp_hdr(const struct sk_buff *skb) { diff --git a/net/netfilter/ipset/ip_set_getport.c b/net/netfilter/ipset/ip_set_getport.c index 42c3e3b..3f09cdb 100644 --- a/net/netfilter/ipset/ip_set_getport.c +++ b/net/netfilter/ipset/ip_set_getport.c @@ -38,8 +38,8 @@ get_port(const struct sk_buff *skb, int protocol, unsigned int protooff, break; } case IPPROTO_SCTP: { - sctp_sctphdr_t _sh; - const sctp_sctphdr_t *sh; + struct sctphdr _sh; + const struct sctphdr *sh; sh = skb_header_pointer(skb, protooff, sizeof(_sh), &_sh); if (!sh) diff --git a/net/netfilter/ipvs/ip_vs_core.c b/net/netfilter/ipvs/ip_vs_core.c index ad99c1c..6f39af9 100644 --- a/net/netfilter/ipvs/ip_vs_core.c +++ b/net/netfilter/ipvs/ip_vs_core.c @@ -1038,8 +1038,8 @@ static int ip_vs_out_icmp_v6(struct netns_ipvs *ipvs, struct sk_buff *skb, static inline int is_sctp_abort(const struct sk_buff *skb, int nh_len) { sctp_chunkhdr_t *sch, schunk; - sch = skb_header_pointer(skb, nh_len + sizeof(sctp_sctphdr_t), - sizeof(schunk), ); + sch = skb_header_pointer(skb, nh_len + sizeof(struct sctphdr), +sizeof(schunk), ); if (sch == NULL) return 0; if (sch->type == SCTP_CID_ABORT) @@ -1072,7 +1072,7 @@ static inline bool is_new_conn(const struct sk_buff *skb, case IPPROTO_SCTP: { sctp_chunkhdr_t *sch, schunk; - sch = skb_header_pointer(skb, iph->len + sizeof(sctp_sctphdr_t), + sch = skb_header_pointer(skb, iph->len + sizeof(struct sctphdr), sizeof(schunk), ); if (sch == NULL) return false; diff --git a/net/netfilter/ipvs/ip_vs_proto_sctp.c b/net/netfilter/ipvs/ip_vs_proto_sctp.c index 56f8e4b..6b38cad 100644 --- a/net/netfilter/ipvs/ip_vs_proto_sctp.c +++ b/net/netfilter/ipvs/ip_vs_proto_sctp.c @@ -16,15 +16,14 @@ sctp_conn_schedule(struct netns_ipvs *ipvs, int af, struct sk_buff *skb, { struct ip_vs_service *svc; sctp_chunkhdr_t _schunkh, *sch; - sctp_sctphdr_t *sh, _sctph; + struct sctphdr *sh, _sctph; __be16 _ports[2], *ports = NULL; if (likely(!ip_vs_iph_icmp(iph))) { sh = skb_header_pointer(skb, iph->len, sizeof(_sctph), &_sctph); if (sh) { - sch = skb_header_pointer( - skb, iph->len + sizeof(sctp_sctphdr_t), - sizeof(_schunkh), &_schunkh); + sch = skb_header_pointer(skb, iph->len + sizeof(_sctph), +sizeof(_schunkh), &_schunkh); if (sch && (sch->type == SCTP_CID_INIT || sysctl_sloppy_sctp(ipvs))) ports = >source; @@ -77,7 +76,7 @@ sctp_conn_schedule(struct netns_ipvs *ipvs, int af, struct sk_buff *skb, return 1; } -static void sctp_nat_csum(struct sk_buff *skb, sctp_sctphdr_t *sctph, +static void sctp_nat_csum(struct sk_buff *skb, struct sctphdr *sctph, unsigned int sctphoff) { sctph->checksum = sctp_compute_cksum(skb, sctphoff); @@ -88,7 +87,7 @@ static int sctp_snat_handler(struct sk_buff *skb, struct ip_vs_protocol *pp, struct ip_vs_conn *cp, struct ip_vs_iphdr *iph) { - sctp_sctphdr_t *sctph; + struct sctphdr *sctph; unsigned int sctphoff = iph->len; bool payload_csum = false; @@ -135,7 +134,7 @@ static int sctp_dnat_handler(struct sk_buff *skb, struct ip_vs_protocol *pp, struct ip_vs_conn *cp, struct ip_vs_iphdr *iph) { - sctp_sctphdr_t *sctph; + struct
[PATCH net-next 04/11] sctp: remove the typedef sctp_cid_action_t
Remove this typedef, there is even no places using it. Signed-off-by: Xin Long--- include/linux/sctp.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/include/linux/sctp.h b/include/linux/sctp.h index 6d7b884..ffdccb4 100644 --- a/include/linux/sctp.h +++ b/include/linux/sctp.h @@ -117,12 +117,12 @@ enum sctp_cid { * the action that must be taken if the processing endpoint does not * recognize the Chunk Type. */ -typedef enum { +enum { SCTP_CID_ACTION_DISCARD = 0x00, SCTP_CID_ACTION_DISCARD_ERR = 0x40, SCTP_CID_ACTION_SKIP= 0x80, SCTP_CID_ACTION_SKIP_ERR= 0xc0, -} sctp_cid_action_t; +}; enum { SCTP_CID_ACTION_MASK = 0xc0, }; -- 2.1.0
[PATCH net-next 03/11] sctp: remove the typedef sctp_cid_t
This patch is to remove the typedef sctp_cid_t, and replace with struct sctp_cid in the places where it's using this typedef. Signed-off-by: Xin Long--- include/linux/sctp.h | 4 ++-- include/net/sctp/auth.h | 6 -- include/net/sctp/constants.h | 4 ++-- include/net/sctp/structs.h | 2 +- net/sctp/auth.c | 6 +++--- net/sctp/sm_make_chunk.c | 4 ++-- net/sctp/sm_statetable.c | 4 ++-- 7 files changed, 16 insertions(+), 14 deletions(-) diff --git a/include/linux/sctp.h b/include/linux/sctp.h index 9ad5b9e..6d7b884 100644 --- a/include/linux/sctp.h +++ b/include/linux/sctp.h @@ -82,7 +82,7 @@ struct sctp_chunkhdr { * Value field. It takes a value from 0 to 254. The value of 255 is * reserved for future use as an extension field. */ -typedef enum { +enum sctp_cid { SCTP_CID_DATA = 0, SCTP_CID_INIT = 1, SCTP_CID_INIT_ACK = 2, @@ -109,7 +109,7 @@ typedef enum { SCTP_CID_ASCONF = 0xC1, SCTP_CID_ASCONF_ACK = 0x80, SCTP_CID_RECONF = 0x82, -} sctp_cid_t; /* enum */ +}; /* enum */ /* Section 3.2 diff --git a/include/net/sctp/auth.h b/include/net/sctp/auth.h index 9b9fb12..171244b 100644 --- a/include/net/sctp/auth.h +++ b/include/net/sctp/auth.h @@ -97,8 +97,10 @@ void sctp_auth_asoc_set_default_hmac(struct sctp_association *asoc, struct sctp_hmac_algo_param *hmacs); int sctp_auth_asoc_verify_hmac_id(const struct sctp_association *asoc, __be16 hmac_id); -int sctp_auth_send_cid(sctp_cid_t chunk, const struct sctp_association *asoc); -int sctp_auth_recv_cid(sctp_cid_t chunk, const struct sctp_association *asoc); +int sctp_auth_send_cid(enum sctp_cid chunk, + const struct sctp_association *asoc); +int sctp_auth_recv_cid(enum sctp_cid chunk, + const struct sctp_association *asoc); void sctp_auth_calculate_hmac(const struct sctp_association *asoc, struct sk_buff *skb, struct sctp_auth_chunk *auth, gfp_t gfp); diff --git a/include/net/sctp/constants.h b/include/net/sctp/constants.h index b07a745..02e867b 100644 --- a/include/net/sctp/constants.h +++ b/include/net/sctp/constants.h @@ -130,7 +130,7 @@ typedef enum { */ typedef union { - sctp_cid_t chunk; + enum sctp_cid chunk; sctp_event_timeout_t timeout; sctp_event_other_t other; sctp_event_primitive_t primitive; @@ -141,7 +141,7 @@ static inline sctp_subtype_t\ SCTP_ST_## _name (_type _arg) \ { sctp_subtype_t _retval; _retval._elt = _arg; return _retval; } -SCTP_SUBTYPE_CONSTRUCTOR(CHUNK,sctp_cid_t, chunk) +SCTP_SUBTYPE_CONSTRUCTOR(CHUNK,enum sctp_cid, chunk) SCTP_SUBTYPE_CONSTRUCTOR(TIMEOUT, sctp_event_timeout_t, timeout) SCTP_SUBTYPE_CONSTRUCTOR(OTHER,sctp_event_other_t, other) SCTP_SUBTYPE_CONSTRUCTOR(PRIMITIVE,sctp_event_primitive_t, primitive) diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h index e26763b..9e9605e 100644 --- a/include/net/sctp/structs.h +++ b/include/net/sctp/structs.h @@ -1297,7 +1297,7 @@ int sctp_has_association(struct net *net, const union sctp_addr *laddr, int sctp_verify_init(struct net *net, const struct sctp_endpoint *ep, const struct sctp_association *asoc, -sctp_cid_t, sctp_init_chunk_t *peer_init, +enum sctp_cid cid, sctp_init_chunk_t *peer_init, struct sctp_chunk *chunk, struct sctp_chunk **err_chunk); int sctp_process_init(struct sctp_association *, struct sctp_chunk *chunk, const union sctp_addr *peer, diff --git a/net/sctp/auth.c b/net/sctp/auth.c index f99d485..7171dd3 100644 --- a/net/sctp/auth.c +++ b/net/sctp/auth.c @@ -632,7 +632,7 @@ void sctp_auth_asoc_set_default_hmac(struct sctp_association *asoc, /* Check to see if the given chunk is supposed to be authenticated */ -static int __sctp_auth_cid(sctp_cid_t chunk, struct sctp_chunks_param *param) +static int __sctp_auth_cid(enum sctp_cid chunk, struct sctp_chunks_param *param) { unsigned short len; int found = 0; @@ -668,7 +668,7 @@ static int __sctp_auth_cid(sctp_cid_t chunk, struct sctp_chunks_param *param) } /* Check if peer requested that this chunk is authenticated */ -int sctp_auth_send_cid(sctp_cid_t chunk, const struct sctp_association *asoc) +int sctp_auth_send_cid(enum sctp_cid chunk, const struct sctp_association *asoc) { if (!asoc) return 0; @@ -680,7 +680,7 @@ int sctp_auth_send_cid(sctp_cid_t chunk, const struct sctp_association *asoc) } /* Check if we requested that peer authenticate this chunk. */ -int sctp_auth_recv_cid(sctp_cid_t chunk, const
[PATCH net-next 02/11] sctp: remove the typedef sctp_chunkhdr_t
This patch is to remove the typedef sctp_chunkhdr_t, and replace with struct sctp_chunkhdr in the places where it's using this typedef. It is also to fix some indents and use sizeof(variable) instead of sizeof(type)., especially in sctp_new. Signed-off-by: Xin Long--- include/linux/sctp.h| 34 - include/net/sctp/sctp.h | 2 +- net/netfilter/ipvs/ip_vs_core.c | 4 +- net/netfilter/ipvs/ip_vs_proto_sctp.c | 6 +-- net/netfilter/nf_conntrack_proto_sctp.c | 29 +++--- net/netfilter/xt_sctp.c | 4 +- net/sctp/input.c| 20 +- net/sctp/inqueue.c | 15 net/sctp/sm_make_chunk.c| 17 + net/sctp/sm_sideeffect.c| 5 ++- net/sctp/sm_statefuns.c | 67 + net/sctp/ulpevent.c | 2 +- 12 files changed, 102 insertions(+), 103 deletions(-) diff --git a/include/linux/sctp.h b/include/linux/sctp.h index 85540ec..9ad5b9e 100644 --- a/include/linux/sctp.h +++ b/include/linux/sctp.h @@ -70,11 +70,11 @@ static inline struct sctphdr *sctp_hdr(const struct sk_buff *skb) } /* Section 3.2. Chunk Field Descriptions. */ -typedef struct sctp_chunkhdr { +struct sctp_chunkhdr { __u8 type; __u8 flags; __be16 length; -} sctp_chunkhdr_t; +}; /* Section 3.2. Chunk Type Values. @@ -236,8 +236,8 @@ typedef struct sctp_datahdr { } sctp_datahdr_t; typedef struct sctp_data_chunk { -sctp_chunkhdr_t chunk_hdr; -sctp_datahdr_t data_hdr; + struct sctp_chunkhdr chunk_hdr; + sctp_datahdr_t data_hdr; } sctp_data_chunk_t; /* DATA Chuck Specific Flags */ @@ -267,7 +267,7 @@ typedef struct sctp_inithdr { } sctp_inithdr_t; typedef struct sctp_init_chunk { - sctp_chunkhdr_t chunk_hdr; + struct sctp_chunkhdr chunk_hdr; sctp_inithdr_t init_hdr; } sctp_init_chunk_t; @@ -386,7 +386,7 @@ typedef struct sctp_sackhdr { } sctp_sackhdr_t; typedef struct sctp_sack_chunk { - sctp_chunkhdr_t chunk_hdr; + struct sctp_chunkhdr chunk_hdr; sctp_sackhdr_t sack_hdr; } sctp_sack_chunk_t; @@ -403,7 +403,7 @@ typedef struct sctp_heartbeathdr { } sctp_heartbeathdr_t; typedef struct sctp_heartbeat_chunk { - sctp_chunkhdr_t chunk_hdr; + struct sctp_chunkhdr chunk_hdr; sctp_heartbeathdr_t hb_hdr; } sctp_heartbeat_chunk_t; @@ -413,7 +413,7 @@ typedef struct sctp_heartbeat_chunk { * chunk descriptor. */ typedef struct sctp_abort_chunk { -sctp_chunkhdr_t uh; + struct sctp_chunkhdr uh; } sctp_abort_chunk_t; @@ -425,8 +425,8 @@ typedef struct sctp_shutdownhdr { } sctp_shutdownhdr_t; struct sctp_shutdown_chunk_t { -sctp_chunkhdr_tchunk_hdr; -sctp_shutdownhdr_t shutdown_hdr; + struct sctp_chunkhdr chunk_hdr; + sctp_shutdownhdr_t shutdown_hdr; }; /* RFC 2960. Section 3.3.10 Operation Error (ERROR) (9) */ @@ -438,8 +438,8 @@ typedef struct sctp_errhdr { } sctp_errhdr_t; typedef struct sctp_operr_chunk { -sctp_chunkhdr_t chunk_hdr; - sctp_errhdr_t err_hdr; + struct sctp_chunkhdr chunk_hdr; + sctp_errhdr_t err_hdr; } sctp_operr_chunk_t; /* RFC 2960 3.3.10 - Operation Error @@ -528,7 +528,7 @@ typedef struct sctp_ecnehdr { } sctp_ecnehdr_t; typedef struct sctp_ecne_chunk { - sctp_chunkhdr_t chunk_hdr; + struct sctp_chunkhdr chunk_hdr; sctp_ecnehdr_t ence_hdr; } sctp_ecne_chunk_t; @@ -540,7 +540,7 @@ typedef struct sctp_cwrhdr { } sctp_cwrhdr_t; typedef struct sctp_cwr_chunk { - sctp_chunkhdr_t chunk_hdr; + struct sctp_chunkhdr chunk_hdr; sctp_cwrhdr_t cwr_hdr; } sctp_cwr_chunk_t; @@ -649,7 +649,7 @@ typedef struct sctp_addiphdr { } sctp_addiphdr_t; typedef struct sctp_addip_chunk { - sctp_chunkhdr_t chunk_hdr; + struct sctp_chunkhdr chunk_hdr; sctp_addiphdr_t addip_hdr; } sctp_addip_chunk_t; @@ -709,7 +709,7 @@ typedef struct sctp_authhdr { } sctp_authhdr_t; typedef struct sctp_auth_chunk { - sctp_chunkhdr_t chunk_hdr; + struct sctp_chunkhdr chunk_hdr; sctp_authhdr_t auth_hdr; } sctp_auth_chunk_t; @@ -719,7 +719,7 @@ struct sctp_infox { }; struct sctp_reconf_chunk { - sctp_chunkhdr_t chunk_hdr; + struct sctp_chunkhdr chunk_hdr; __u8 params[0]; }; diff --git a/include/net/sctp/sctp.h b/include/net/sctp/sctp.h index 069582e..d756bd0 100644 --- a/include/net/sctp/sctp.h +++ b/include/net/sctp/sctp.h @@ -478,7 +478,7 @@ _sctp_walk_errors((err), (chunk_hdr), ntohs((chunk_hdr)->length)) #define _sctp_walk_errors(err, chunk_hdr, end)\ for (err = (sctp_errhdr_t *)((void *)chunk_hdr + \ - sizeof(sctp_chunkhdr_t));\ + sizeof(struct sctp_chunkhdr));\ (void *)err <= (void *)chunk_hdr + end - ntohs(err->length)
[PATCH net-next 00/11] sctp: remove typedefs from structures part 1
As we know, typedef is suggested not to use in kernel, even checkpatch.pl also gives warnings about it. Now sctp is using it for many structures. All this kind of typedef's using should be removed. As the 1st part, this patchset is to remove it for 11 basic structures in linux/sctp.h. It is also to fix some indents. No any code's logic is changed in these patches, only cleaning up. Xin Long (11): sctp: remove the typedef sctp_sctphdr_t sctp: remove the typedef sctp_chunkhdr_t sctp: remove the typedef sctp_cid_t sctp: remove the typedef sctp_cid_action_t sctp: remove the typedef sctp_paramhdr_t sctp: remove the typedef sctp_param_t sctp: remove the typedef sctp_param_action_t sctp: remove the typedef sctp_datahdr_t sctp: remove the typedef sctp_data_chunk_t sctp: remove the typedef sctp_inithdr_t sctp: remove the typedef sctp_init_chunk_t include/linux/sctp.h| 118 ++-- include/net/sctp/auth.h | 6 +- include/net/sctp/command.h | 4 +- include/net/sctp/constants.h| 6 +- include/net/sctp/sctp.h | 4 +- include/net/sctp/sm.h | 16 ++-- include/net/sctp/structs.h | 9 ++- net/netfilter/ipset/ip_set_getport.c| 4 +- net/netfilter/ipvs/ip_vs_core.c | 10 +-- net/netfilter/ipvs/ip_vs_proto_sctp.c | 21 +++-- net/netfilter/nf_conntrack_proto_sctp.c | 33 net/netfilter/nf_nat_proto_sctp.c | 2 +- net/netfilter/xt_sctp.c | 20 ++--- net/sctp/associola.c| 6 +- net/sctp/auth.c | 28 --- net/sctp/endpointola.c | 7 +- net/sctp/input.c| 24 +++--- net/sctp/inqueue.c | 15 ++-- net/sctp/output.c | 4 +- net/sctp/sm_make_chunk.c| 80 +-- net/sctp/sm_sideeffect.c| 7 +- net/sctp/sm_statefuns.c | 132 net/sctp/sm_statetable.c| 4 +- net/sctp/socket.c | 7 +- net/sctp/stream.c | 4 +- net/sctp/ulpevent.c | 2 +- net/sctp/ulpqueue.c | 2 +- 27 files changed, 292 insertions(+), 283 deletions(-) -- 2.1.0
[PATCH] vmalloc: respect the GFP_NOIO and GFP_NOFS flags
The __vmalloc function has a parameter gfp_mask with the allocation flags, however it doesn't fully respect the GFP_NOIO and GFP_NOFS flags. The pages are allocated with the specified gfp flags, but the pagetables are always allocated with GFP_KERNEL. This allocation can cause unexpected recursion into the filesystem or I/O subsystem. It is not practical to extend page table allocation routines with gfp flags because it would require modification of architecture-specific code in all architecturs. However, the process can temporarily request that all allocations are done with GFP_NOFS or GFP_NOIO with with the functions memalloc_nofs_save and memalloc_noio_save. This patch makes the vmalloc code use memalloc_nofs_save or memalloc_noio_save if the supplied gfp flags do not contain __GFP_FS or __GFP_IO. It fixes some possible deadlocks in drivers/mtd/ubi/io.c, fs/gfs2/, fs/btrfs/free-space-tree.c, fs/ubifs/, fs/nfs/blocklayout/extent_tree.c where __vmalloc is used with the GFP_NOFS flag. The patch also simplifies code in dm-bufio.c, dm-ioctl.c and fs/xfs/kmem.c by removing explicit calls to memalloc_nofs_save and memalloc_noio_save before the call to __vmalloc. Signed-off-by: Mikulas Patocka--- drivers/md/dm-bufio.c | 24 +--- drivers/md/dm-ioctl.c |6 +- fs/xfs/kmem.c | 14 -- mm/util.c |6 +++--- mm/vmalloc.c | 18 +- 5 files changed, 22 insertions(+), 46 deletions(-) Index: linux-2.6/mm/vmalloc.c === --- linux-2.6.orig/mm/vmalloc.c +++ linux-2.6/mm/vmalloc.c @@ -31,6 +31,7 @@ #include #include #include +#include #include #include @@ -1670,6 +1671,8 @@ static void *__vmalloc_area_node(struct unsigned int nr_pages, array_size, i; const gfp_t nested_gfp = (gfp_mask & GFP_RECLAIM_MASK) | __GFP_ZERO; const gfp_t alloc_mask = gfp_mask | __GFP_HIGHMEM | __GFP_NOWARN; + unsigned noio_flag; + int r; nr_pages = get_vm_area_size(area) >> PAGE_SHIFT; array_size = (nr_pages * sizeof(struct page *)); @@ -1712,8 +1715,21 @@ static void *__vmalloc_area_node(struct cond_resched(); } - if (map_vm_area(area, prot, pages)) + if (unlikely(!(gfp_mask & __GFP_IO))) + noio_flag = memalloc_noio_save(); + else if (unlikely(!(gfp_mask & __GFP_FS))) + noio_flag = memalloc_nofs_save(); + + r = map_vm_area(area, prot, pages); + + if (unlikely(!(gfp_mask & __GFP_IO))) + memalloc_noio_restore(noio_flag); + else if (unlikely(!(gfp_mask & __GFP_FS))) + memalloc_nofs_restore(noio_flag); + + if (unlikely(r)) goto fail; + return area->addr; fail: Index: linux-2.6/mm/util.c === --- linux-2.6.orig/mm/util.c +++ linux-2.6/mm/util.c @@ -351,10 +351,10 @@ void *kvmalloc_node(size_t size, gfp_t f void *ret; /* -* vmalloc uses GFP_KERNEL for some internal allocations (e.g page tables) -* so the given set of flags has to be compatible. +* vmalloc uses blocking allocations for some internal allocations +* (e.g page tables) so the given set of flags has to be compatible. */ - WARN_ON_ONCE((flags & GFP_KERNEL) != GFP_KERNEL); + WARN_ON_ONCE(!gfpflags_allow_blocking(flags)); /* * We want to attempt a large physically contiguous block first because Index: linux-2.6/drivers/md/dm-bufio.c === --- linux-2.6.orig/drivers/md/dm-bufio.c +++ linux-2.6/drivers/md/dm-bufio.c @@ -386,9 +386,6 @@ static void __cache_size_refresh(void) static void *alloc_buffer_data(struct dm_bufio_client *c, gfp_t gfp_mask, enum data_mode *data_mode) { - unsigned noio_flag; - void *ptr; - if (c->block_size <= DM_BUFIO_BLOCK_SIZE_SLAB_LIMIT) { *data_mode = DATA_MODE_SLAB; return kmem_cache_alloc(DM_BUFIO_CACHE(c), gfp_mask); @@ -402,26 +399,7 @@ static void *alloc_buffer_data(struct dm } *data_mode = DATA_MODE_VMALLOC; - - /* -* __vmalloc allocates the data pages and auxiliary structures with -* gfp_flags that were specified, but pagetables are always allocated -* with GFP_KERNEL, no matter what was specified as gfp_mask. -* -* Consequently, we must set per-process flag PF_MEMALLOC_NOIO so that -* all allocations done by this process (including pagetables) are done -* as if GFP_NOIO was specified. -*/ - - if (gfp_mask & __GFP_NORETRY) - noio_flag = memalloc_noio_save(); - - ptr = __vmalloc(c->block_size, gfp_mask, PAGE_KERNEL); - - if (gfp_mask &
Re: kernel (master) build failure w. !CONFIG_NET_RX_BUSY_POLL
On Thu, 2017-06-29 at 15:35 -0400, David Miller wrote: > From: Mike Galbraith> Date: Wed, 28 Jun 2017 10:01:00 +0200 > > > Greetings network wizards, > > > > The latest RT explicitly disables CONFIG_NET_RX_BUSY_POLL, thus > > uncovering $subject. Below is what I did about it. > > > > -Mike > > > > net: Move napi_hash_add/del() inside CONFIG_NET_RX_BUSY_POLL > > > > Since 545cd5e5ec54 ("net: Busy polling should ignore sender CPUs"), > > kernel build fails when CONFIG_NET_RX_BUSY_POLL is disabled. Move > > napi_hash_add/del() accordingly. > > > > Banged-upon-by: Mike Galbraith > > First, this is in no way a standard signoff or ack tag. It's not intended to be, it's intended to stress that this in not a submission, it's a local fix for a problem that network folks will likely fix up differently... but who knows, showing what I did about it after taking a look could perhaps save a wee bit of labor, so... Hohum, so much for good intentions. I though I was clearly showing that separation of the polling business was not as complete as other changes led me to believe it is intended to be. > Second, you must provide the complete set of build failure details > so that anyone just reading your posting can understand and fully > review your patch without having to go anywhere else for the > pertinent information. Here's the naked failure instead. CC net/core/dev.o net/core/dev.c: In function ‘napi_hash_add’: net/core/dev.c:5315:43: error: ‘MIN_NAPI_ID’ undeclared (first use in this function) if (unlikely(++napi_gen_id < MIN_NAPI_ID)) ^ net/core/dev.c:5315:43: note: each undeclared identifier is reported only once for each function it appears in scripts/Makefile.build:302: recipe for target 'net/core/dev.o' failed make[1]: *** [net/core/dev.o] Error 1 Makefile:1663: recipe for target 'net/core/dev.o' failed make: *** [net/core/dev.o] Error 2 -Mike
Re: [PATCH] mm: convert three more cases to kvmalloc
On Thu, 29 Jun 2017, Michal Hocko wrote: > On Wed 28-06-17 23:24:10, Mikulas Patocka wrote: > [...] > > From: Mikulas Patocka> > > > The patch a7c3e901 ("mm: introduce kv[mz]alloc helpers") converted a lot > > of kernel code to kvmalloc. This patch converts three more forgotten > > cases. > > Thanks! I have two remarks below but other than that feel free to add > > > Signed-off-by: Mikulas Patocka > > Acked-by: Michal Hocko > [...] > > Index: linux-2.6/kernel/bpf/syscall.c > > === > > --- linux-2.6.orig/kernel/bpf/syscall.c > > +++ linux-2.6/kernel/bpf/syscall.c > > @@ -58,16 +58,7 @@ void *bpf_map_area_alloc(size_t size) > > * trigger under memory pressure as we really just want to > > * fail instead. > > */ > > - const gfp_t flags = __GFP_NOWARN | __GFP_NORETRY | __GFP_ZERO; > > - void *area; > > - > > - if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) { > > - area = kmalloc(size, GFP_USER | flags); > > - if (area != NULL) > > - return area; > > - } > > - > > - return __vmalloc(size, GFP_KERNEL | flags, PAGE_KERNEL); > > + return kvmalloc(size, GFP_USER | __GFP_NOWARN | __GFP_NORETRY | > > __GFP_ZERO); > > kvzalloc without additional flags would be more appropriate. > __GFP_NORETRY is explicitly documented as non-supported How is __GFP_NORETRY non-supported? > and NOWARN wouldn't be applied everywhere in the vmalloc path. __GFP_NORETRY and __GFP_NOWARN wouldn't be applied in the page-table allocation and they would be applied in the page allocation - that seems acceptable. But the problem here is that if the system is under memory stress, __GFP_NORETRY allocations would randomly fail (they would fail for example if there's a plenty of free swap space and the system is busy swapping) and that would make the BFP creation code randomly fail. BPF maintainers, please explain, how are you dealing with the random memory allocation failures? Is there some other code in the BPF stack that retries the failed allocations? > > } > > > > void bpf_map_area_free(void *area) > > Index: linux-2.6/kernel/cgroup/cgroup-v1.c > > === > > --- linux-2.6.orig/kernel/cgroup/cgroup-v1.c > > +++ linux-2.6/kernel/cgroup/cgroup-v1.c > > @@ -184,15 +184,10 @@ struct cgroup_pidlist { > > /* > > * The following two functions "fix" the issue where there are more pids > > * than kmalloc will give memory for; in such cases, we use vmalloc/vfree. > > - * TODO: replace with a kernel-wide solution to this problem > > */ > > -#define PIDLIST_TOO_LARGE(c) ((c) * sizeof(pid_t) > (PAGE_SIZE * 2)) > > static void *pidlist_allocate(int count) > > { > > - if (PIDLIST_TOO_LARGE(count)) > > - return vmalloc(count * sizeof(pid_t)); > > - else > > - return kmalloc(count * sizeof(pid_t), GFP_KERNEL); > > + return kvmalloc(count * sizeof(pid_t), GFP_KERNEL); > > } > > I would rather use kvmalloc_array to have an overflow protection as > well. Yes. Mikulas > > > > static void pidlist_free(void *p) > > -- > Michal Hocko > SUSE Labs >
Re: RFC: bring UP 'lo' by default after creating new netns
"Mahesh Bandewar (महेश बंडेवार)"writes: > Creation of new network namespace is almost always followed up by > bringing up the loopback device. > > ip netns add foo > ip -netns foo link set lo up > > I'm not sure if there are any consequences of bringing the device UP > at the creation of network-namespace. Hard coded in net/core/net_namespace.c:copy_net_ns is definitely the wrong place in the code for something like this. If this lives anywhere it should live in driver/net/loopback.c, or possibly in net/core/dev.c:net_dev_init. If we want this we want to match what we do when we the primary network namespace. Just so that there are no unneeded surprises with network namespaces. Eric
Re: [PATCH v1 1/2] dt-binding: ptp: add bindings document for dte based ptp clock
Hi Rob, On 17-06-29 03:40 PM, Rob Herring wrote: On Tue, Jun 27, 2017 at 12:10 AM, Scott Brandenwrote: Hi Rob/Florian, Thanks for input but still don't see any need for SoC specific compatible stings. IP revision specific yes. On 17-06-22 06:04 PM, Florian Fainelli wrote: On 06/22/2017 05:42 PM, Scott Branden wrote: On 17-06-21 08:19 PM, Rob Herring wrote: On Tue, Jun 20, 2017 at 3:48 PM, Scott Branden wrote: Hi Rob, On 17-06-18 07:04 AM, Rob Herring wrote: On Mon, Jun 12, 2017 at 01:26:00PM -0700, Arun Parameswaran wrote: Add device tree binding documentation for the Broadcom DTE PTP clock driver. Signed-off-by: Arun Parameswaran --- Documentation/devicetree/bindings/ptp/brcm,ptp-dte.txt | 13 + 1 file changed, 13 insertions(+) create mode 100644 Documentation/devicetree/bindings/ptp/brcm,ptp-dte.txt diff --git a/Documentation/devicetree/bindings/ptp/brcm,ptp-dte.txt b/Documentation/devicetree/bindings/ptp/brcm,ptp-dte.txt new file mode 100644 index 000..07590bc --- /dev/null +++ b/Documentation/devicetree/bindings/ptp/brcm,ptp-dte.txt @@ -0,0 +1,13 @@ +* Broadcom Digital Timing Engine(DTE) based PTP clock driver Bindings describe h/w, not drivers. + +Required properties: +- compatible: should be "brcm,ptp-dte" Looks too generic. You need SoC specific compatible strings. Rob, could you please help me understand the use of adding SoC specific compatible strings. I still don't get it. It's my understanding that the SoC compatibility string is to future proof against bugs/incompatibilities between different versions of the hardware block due to integration issues or any other reason. You can then compare in your driver because the strings were already used in the dtb. That would make sense if you can't already differentiate what SoC you are running on. But the SoC is already specified in the root of the device tree in the compatible string? Why can't you just use of_machine_is_compatible inside your driver when needed? Use of of_machine_is_compatible in drivers will result in the same mess we had with machine_is_X defines pre-DT. It practically guarantees that you must update the driver for a new SoC (with fallback compatibles you don't). Plus the matching logic for of_machine_is_compatible is open coded logic in every driver which is worse IMO than having a standard match table. I don't understand what you mean by fallback compatible then. Let's say I have 3 SoCs that each contain the same ip block. You want us to add a fallback compatibility per SoC, is that correct? I think Rob meant a fallback compatibility for the particular block. E.g: brcm,iproc-ptp is the fallback compatible string, but in your SoC-specific DTS, you would have at least: compatible = "brcm,cygnus-ptp", "brcm,iproc-ptp"; Where cygnus-ptp is more specific than iproc-ptp Then, if there is a workaround discovered in a particular SoC the driver can be updated in the future without changing the dtb. Then, the block gets added to a 4th SoC. You want us to another new compatibility string for the new SoC? If the new SoC has a bug then the driver has to be updated whether it is in uses the fallback compatible or machine_is_compatible string. There is no difference in amount of code added to a driver when a new SoC is introduced into the system that has bugs that need to be handled by the driver. The difference is in your recommendation we need to go through all the drivers used by the new SoC and add fallback compatibility strings. Not really, the fallback is what the driver should be matching by default (hence the name fallback) and if and only if you need to have a SoC-specific behavior in your driver (because of bugs, or slight differences) would you be matching this SoC-specific compatible string to capture that and key the driver behavior based off that. Then, we have to modify all the devicetree documentation for all the drivers. Then, we have to ensure that all dts files populate this new fallback string (even if it is unused). We don't see the benefit in doing any of that. Using machine_is_compatible and having less compatibility strings to deal appears much cleaner and more foolproof for all situations. When you introduce a new SoC, you would update all the bindings for the devices (not drivers) that are susceptible to be used by this SoC. If all goes well, your SoC has little bugs, and your drivers don't even need to see an update because they are already matching the fallback compatible string. That's what I understand from the suggestion but I may be totally off rails here. In this particular case, the IP is integrated within and outside a family of SoCs (the exact same block is in Capri for example). The only fallback compatablity string that makes sense is "brcm, dte-ptp-v1". When we integrate a new version of this IP we would add "brcm, dte-ptp-v2". Unless you
linux-next: manual merge of the net-next tree with Linus' tree
Hi all, Today's linux-next merge of the net-next tree got a conflict in: drivers/net/ethernet/rocker/rocker_ofdpa.c between commit: acb4b7df48b5 ("rocker: move dereference before free") from Linus' tree and commit: 00fc0c51e35b ("rocker: Change world_ops API and implementation to be switchdev independant") from the net-next tree. I fixed it up (see below) and can carry the fix as necessary. This is now fixed as far as linux-next is concerned, but any non trivial conflicts should be mentioned to your upstream maintainer when your tree is submitted for merging. You may also want to consider cooperating with the maintainer of the conflicting tree to minimise any particularly complex conflicts. -- Cheers, Stephen Rothwell diff --cc drivers/net/ethernet/rocker/rocker_ofdpa.c index a9ce82d3e9cf,bd0e3f157e9e.. --- a/drivers/net/ethernet/rocker/rocker_ofdpa.c +++ b/drivers/net/ethernet/rocker/rocker_ofdpa.c @@@ -1505,10 -1409,10 +1409,10 @@@ static int ofdpa_port_ipv4_nh(struct of *index = entry->index; resolved = false; } else if (removing) { - ofdpa_neigh_del(found); *index = found->index; - ofdpa_neigh_del(trans, found); ++ ofdpa_neigh_del(found); } else if (updating) { - ofdpa_neigh_update(found, trans, NULL, false); + ofdpa_neigh_update(found, NULL, false); resolved = !is_zero_ether_addr(found->eth_dst); *index = found->index; } else {
Re: [PATCH RFC 0/2] kproxy: Kernel Proxy
On 29 June 2017 at 16:21, Tom Herbertwrote: > I think the main part of that discussion was around stream parser > which is needed for message delineation. For a 1:1 proxy, KCM is > probably overkill (the whole KCM data path and lock becomes > superfluous). Also, there's no concept of creating a whole message > before routing it, in the 1:1 case we should let the message pass > through once it's cleared by the filter (this is the strparser change > I referred to). As I mentioned, for L7 load balancing we would want a > multiplexor probably also M:N, but the structure is different since > there's still no user facing sockets, they're all TCP for instance. > IMO, the 1:1 proxy case is compelling to solve in itself... I see. I was definitely thinking m:n. We should definitely evaluate whether it makes sense to have a specific 1:1 implementation if we need m:n anyway. For L7 LB, m:n seems obvious as a particular L4 connection may act as a transport for multiple requests bidirectional. KCM looks like a good starting point for that. When I talked about enqueueing entire messages, the main concern is to buffer up the payload after the TLS handshake to the point to where a forwarding decision can be made. I would definitely not advocate to buffer entire messages before starting to forward.
[PATCH net-next v3] datapath: Avoid using stack larger than 1024.
From: Tonghao ZhangWhen compiling OvS-master on 4.4.0-81 kernel, there is a warning: CC [M] /root/ovs/datapath/linux/datapath.o /root/ovs/datapath/linux/datapath.c: In function 'ovs_flow_cmd_set': /root/ovs/datapath/linux/datapath.c:1221:1: warning: the frame size of 1040 bytes is larger than 1024 bytes [-Wframe-larger-than=] This patch factors out match-init and action-copy to avoid "Wframe-larger-than=1024" warning. Because mask is only used to get actions, we new a function to save some stack space. Signed-off-by: Tonghao Zhang Acked-by: Pravin B Shelar --- net/openvswitch/datapath.c | 81 ++--- 1 file changed, 58 insertions(+), 23 deletions(-) diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c index c85029c..fb9f114 100644 --- a/net/openvswitch/datapath.c +++ b/net/openvswitch/datapath.c @@ -1100,6 +1100,58 @@ static struct sw_flow_actions *get_flow_actions(struct net *net, return acts; } +/* Factor out match-init and action-copy to avoid + * "Wframe-larger-than=1024" warning. Because mask is only + * used to get actions, we new a function to save some + * stack space. + * + * If there are not key and action attrs, we return 0 + * directly. In the case, the caller will also not use the + * match as before. If there is action attr, we try to get + * actions and save them to *acts. Before returning from + * the function, we reset the match->mask pointer. Because + * we should not to return match object with dangling reference + * to mask. + * */ +static int ovs_nla_init_match_and_action(struct net *net, +struct sw_flow_match *match, +struct sw_flow_key *key, +struct nlattr **a, +struct sw_flow_actions **acts, +bool log) +{ + struct sw_flow_mask mask; + int error = 0; + + if (a[OVS_FLOW_ATTR_KEY]) { + ovs_match_init(match, key, true, ); + error = ovs_nla_get_match(net, match, a[OVS_FLOW_ATTR_KEY], + a[OVS_FLOW_ATTR_MASK], log); + if (error) + goto error; + } + + if (a[OVS_FLOW_ATTR_ACTIONS]) { + if (!a[OVS_FLOW_ATTR_KEY]) { + OVS_NLERR(log, + "Flow key attribute not present in set flow."); + return -EINVAL; + } + + *acts = get_flow_actions(net, a[OVS_FLOW_ATTR_ACTIONS], key, +, log); + if (IS_ERR(*acts)) { + error = PTR_ERR(*acts); + goto error; + } + } + + /* On success, error is 0. */ +error: + match->mask = NULL; + return error; +} + static int ovs_flow_cmd_set(struct sk_buff *skb, struct genl_info *info) { struct net *net = sock_net(skb->sk); @@ -1107,7 +1159,6 @@ static int ovs_flow_cmd_set(struct sk_buff *skb, struct genl_info *info) struct ovs_header *ovs_header = info->userhdr; struct sw_flow_key key; struct sw_flow *flow; - struct sw_flow_mask mask; struct sk_buff *reply = NULL; struct datapath *dp; struct sw_flow_actions *old_acts = NULL, *acts = NULL; @@ -1119,34 +1170,18 @@ static int ovs_flow_cmd_set(struct sk_buff *skb, struct genl_info *info) bool ufid_present; ufid_present = ovs_nla_get_ufid(, a[OVS_FLOW_ATTR_UFID], log); - if (a[OVS_FLOW_ATTR_KEY]) { - ovs_match_init(, , true, ); - error = ovs_nla_get_match(net, , a[OVS_FLOW_ATTR_KEY], - a[OVS_FLOW_ATTR_MASK], log); - } else if (!ufid_present) { + if (!a[OVS_FLOW_ATTR_KEY] && !ufid_present) { OVS_NLERR(log, "Flow set message rejected, Key attribute missing."); - error = -EINVAL; + return -EINVAL; } + + error = ovs_nla_init_match_and_action(net, , , a, + , log); if (error) goto error; - /* Validate actions. */ - if (a[OVS_FLOW_ATTR_ACTIONS]) { - if (!a[OVS_FLOW_ATTR_KEY]) { - OVS_NLERR(log, - "Flow key attribute not present in set flow."); - error = -EINVAL; - goto error; - } - - acts = get_flow_actions(net, a[OVS_FLOW_ATTR_ACTIONS], , - , log); - if (IS_ERR(acts)) { - error = PTR_ERR(acts); - goto error; -
Re: [PATCH RFC 25/26] tile: Remove spin_unlock_wait() arch-specific definitions
On Thu, Jun 29, 2017 at 05:10:41PM -0700, Linus Torvalds wrote: > On Thu, Jun 29, 2017 at 5:06 PM, Linus Torvalds >wrote: > > > > Please don't make this one commit fopr every architecture. > > > > Once something gets removed, it gets removed. There's no point in > > "remove it from architecture X". If there are no more users, we're > > done with it, and making it be 25 patches with the same commit message > > instead of just one doesn't help anybody. > > Just to clarify: I think the actual *users* are worth doing one by > one, particularly if there are user-specific explanations of what that > particular code wanted, and why spin_unlock_wait() doesn't really > help. Got it, and I did merge -only- the arch-specific definition removals into one commit. Should I also merge the core-code definition removals into that same commit, or is it OK to remove the core-code definitions with one commit and the arch-specific definitions with another. (My guess is that you would prefer I removed -all- definitions with one commit, including the core-kernel definitions, but at this point I figure I should just ask.) > And I think that you actually have those per-user insights by now, > after looking at the long thread. One Acked-by thus far, so some progress! > So I'm not saying "do one patch for the whole series". One patch per > removal of use is fine - in fact preferred. Got it. It allows the developers and maintainers to tell me where my analysis is wrong, for one thing. ;-) > But once there are no actual more users, just remove all the > architecture definitions in one go, because explaining the same thing > several times doesn't actually help anything. > > In fact, *if* we end up ever resurrecting that thing, it's good if we > can resurrect it in one go. Then we can resurrect the one or two users > that turned out to matter after all and could come up with why some > particular ordering was ok too. Understood! Thanx, Paul
Re: [PATCH RFC 25/26] tile: Remove spin_unlock_wait() arch-specific definitions
On Thu, Jun 29, 2017 at 05:09:56PM -0700, Paul E. McKenney wrote: > On Thu, Jun 29, 2017 at 05:06:16PM -0700, Linus Torvalds wrote: > > On Thu, Jun 29, 2017 at 5:01 PM, Paul E. McKenney > >wrote: > > > There is no agreed-upon definition of spin_unlock_wait()'s semantics, > > > and it appears that all callers could do just as well with a lock/unlock > > > pair. This commit therefore removes the underlying arch-specific > > > arch_spin_unlock_wait(). > > > > Please don't make this one commit fopr every architecture. > > > > Once something gets removed, it gets removed. There's no point in > > "remove it from architecture X". If there are no more users, we're > > done with it, and making it be 25 patches with the same commit message > > instead of just one doesn't help anybody. > > Apologies, I will merge them. I suppose that I could have sent the merged patch... Please see below. Thanx, Paul commit e81efa4219e9b07e476448572aafe8f9c4ad28c8 Author: Paul E. McKenney Date: Thu Jun 29 15:53:02 2017 -0700 arch: Remove spin_unlock_wait() arch-specific definitions There is no agreed-upon definition of spin_unlock_wait()'s semantics, and it appears that all callers could do just as well with a lock/unlock pair. This commit therefore removes the underlying arch-specific arch_spin_unlock_wait() for all architectures providing them. Signed-off-by: Paul E. McKenney Cc: Cc: Will Deacon Cc: Peter Zijlstra Cc: Alan Stern Cc: Andrea Parri Cc: Linus Torvalds diff --git a/arch/alpha/include/asm/spinlock.h b/arch/alpha/include/asm/spinlock.h index a40b9fc0c6c3..718ac0b64adf 100644 --- a/arch/alpha/include/asm/spinlock.h +++ b/arch/alpha/include/asm/spinlock.h @@ -16,11 +16,6 @@ #define arch_spin_lock_flags(lock, flags) arch_spin_lock(lock) #define arch_spin_is_locked(x) ((x)->lock != 0) -static inline void arch_spin_unlock_wait(arch_spinlock_t *lock) -{ - smp_cond_load_acquire(>lock, !VAL); -} - static inline int arch_spin_value_unlocked(arch_spinlock_t lock) { return lock.lock == 0; diff --git a/arch/arc/include/asm/spinlock.h b/arch/arc/include/asm/spinlock.h index 233d5ffe6ec7..a325e6a36523 100644 --- a/arch/arc/include/asm/spinlock.h +++ b/arch/arc/include/asm/spinlock.h @@ -16,11 +16,6 @@ #define arch_spin_is_locked(x) ((x)->slock != __ARCH_SPIN_LOCK_UNLOCKED__) #define arch_spin_lock_flags(lock, flags) arch_spin_lock(lock) -static inline void arch_spin_unlock_wait(arch_spinlock_t *lock) -{ - smp_cond_load_acquire(>slock, !VAL); -} - #ifdef CONFIG_ARC_HAS_LLSC static inline void arch_spin_lock(arch_spinlock_t *lock) diff --git a/arch/arm/include/asm/spinlock.h b/arch/arm/include/asm/spinlock.h index 4bec45442072..c030143c18c6 100644 --- a/arch/arm/include/asm/spinlock.h +++ b/arch/arm/include/asm/spinlock.h @@ -52,22 +52,6 @@ static inline void dsb_sev(void) * memory. */ -static inline void arch_spin_unlock_wait(arch_spinlock_t *lock) -{ - u16 owner = READ_ONCE(lock->tickets.owner); - - for (;;) { - arch_spinlock_t tmp = READ_ONCE(*lock); - - if (tmp.tickets.owner == tmp.tickets.next || - tmp.tickets.owner != owner) - break; - - wfe(); - } - smp_acquire__after_ctrl_dep(); -} - #define arch_spin_lock_flags(lock, flags) arch_spin_lock(lock) static inline void arch_spin_lock(arch_spinlock_t *lock) diff --git a/arch/arm64/include/asm/spinlock.h b/arch/arm64/include/asm/spinlock.h index cae331d553f8..f445bd7f2b9f 100644 --- a/arch/arm64/include/asm/spinlock.h +++ b/arch/arm64/include/asm/spinlock.h @@ -26,58 +26,6 @@ * The memory barriers are implicit with the load-acquire and store-release * instructions. */ -static inline void arch_spin_unlock_wait(arch_spinlock_t *lock) -{ - unsigned int tmp; - arch_spinlock_t lockval; - u32 owner; - - /* -* Ensure prior spin_lock operations to other locks have completed -* on this CPU before we test whether "lock" is locked. -*/ - smp_mb(); - owner = READ_ONCE(lock->owner) << 16; - - asm volatile( -" sevl\n" -"1:wfe\n" -"2:ldaxr %w0, %2\n" - /* Is the lock free? */ -" eor %w1, %w0, %w0, ror #16\n" -" cbz %w1, 3f\n" - /* Lock taken -- has there been a subsequent unlock->lock transition? */ -" eor %w1, %w3, %w0, lsl #16\n" -" cbz %w1, 1b\n" - /* -* The owner has been updated, so there was an unlock->lock -* transition that we missed. That
[PATCH RFC 05/26] exit: Replace spin_unlock_wait() with lock/unlock pair
There is no agreed-upon definition of spin_unlock_wait()'s semantics, and it appears that all callers could do just as well with a lock/unlock pair. This commit therefore replaces the spin_unlock_wait() call in do_exit() with spin_lock() followed immediately by spin_unlock(). This should be safe from a performance perspective because the lock is a per-task lock, and this is happening only at task-exit time. Signed-off-by: Paul E. McKenneyCc: Ingo Molnar Cc: Peter Zijlstra Cc: Will Deacon Cc: Alan Stern Cc: Andrea Parri Cc: Linus Torvalds --- kernel/exit.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/kernel/exit.c b/kernel/exit.c index 516acdb0e0ec..1a976e47ddd1 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -832,7 +832,8 @@ void __noreturn do_exit(long code) * Ensure that we must observe the pi_state in exit_mm() -> * mm_release() -> exit_pi_state_list(). */ - raw_spin_unlock_wait(>pi_lock); + raw_spin_lock(>pi_lock); + raw_spin_unlock(>pi_lock); if (unlikely(in_atomic())) { pr_info("note: %s[%d] exited with preempt_count %d\n", -- 2.5.2
[PATCH RFC 17/26] metag: Remove spin_unlock_wait() arch-specific definitions
There is no agreed-upon definition of spin_unlock_wait()'s semantics, and it appears that all callers could do just as well with a lock/unlock pair. This commit therefore removes the underlying arch-specific arch_spin_unlock_wait(). Signed-off-by: Paul E. McKenneyCc: James Hogan Cc: Cc: Will Deacon Cc: Peter Zijlstra Cc: Alan Stern Cc: Andrea Parri Cc: Linus Torvalds --- arch/metag/include/asm/spinlock.h | 5 - 1 file changed, 5 deletions(-) diff --git a/arch/metag/include/asm/spinlock.h b/arch/metag/include/asm/spinlock.h index c0c7a22be1ae..ddf7fe5708a6 100644 --- a/arch/metag/include/asm/spinlock.h +++ b/arch/metag/include/asm/spinlock.h @@ -15,11 +15,6 @@ * locked. */ -static inline void arch_spin_unlock_wait(arch_spinlock_t *lock) -{ - smp_cond_load_acquire(>lock, !VAL); -} - #define arch_spin_lock_flags(lock, flags) arch_spin_lock(lock) #definearch_read_lock_flags(lock, flags) arch_read_lock(lock) -- 2.5.2
[PATCH RFC 10/26] arc: Remove spin_unlock_wait() arch-specific definitions
There is no agreed-upon definition of spin_unlock_wait()'s semantics, and it appears that all callers could do just as well with a lock/unlock pair. This commit therefore removes the underlying arch-specific arch_spin_unlock_wait(). Signed-off-by: Paul E. McKenneyCc: Vineet Gupta Cc: Cc: Will Deacon Cc: Peter Zijlstra Cc: Alan Stern Cc: Andrea Parri Cc: Linus Torvalds --- arch/arc/include/asm/spinlock.h | 5 - 1 file changed, 5 deletions(-) diff --git a/arch/arc/include/asm/spinlock.h b/arch/arc/include/asm/spinlock.h index 233d5ffe6ec7..a325e6a36523 100644 --- a/arch/arc/include/asm/spinlock.h +++ b/arch/arc/include/asm/spinlock.h @@ -16,11 +16,6 @@ #define arch_spin_is_locked(x) ((x)->slock != __ARCH_SPIN_LOCK_UNLOCKED__) #define arch_spin_lock_flags(lock, flags) arch_spin_lock(lock) -static inline void arch_spin_unlock_wait(arch_spinlock_t *lock) -{ - smp_cond_load_acquire(>slock, !VAL); -} - #ifdef CONFIG_ARC_HAS_LLSC static inline void arch_spin_lock(arch_spinlock_t *lock) -- 2.5.2
[PATCH RFC 02/26] task_work: Replace spin_unlock_wait() with lock/unlock pair
There is no agreed-upon definition of spin_unlock_wait()'s semantics, and it appears that all callers could do just as well with a lock/unlock pair. This commit therefore replaces the spin_unlock_wait() call in task_work_run() with spin_lock() followed immediately by spin_unlock(). This should be safe from a performance perspective because calls to the other side of the race, task_work_cancel(), should be rare. Signed-off-by: Paul E. McKenneyCc: Oleg Nesterov Cc: Andrew Morton Cc: Will Deacon Cc: Peter Zijlstra Cc: Alan Stern Cc: Andrea Parri Cc: Linus Torvalds --- kernel/task_work.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/kernel/task_work.c b/kernel/task_work.c index d513051fcca2..b9b428832229 100644 --- a/kernel/task_work.c +++ b/kernel/task_work.c @@ -109,7 +109,8 @@ void task_work_run(void) * the first entry == work, cmpxchg(task_works) should * fail, but it can play with *work and other entries. */ - raw_spin_unlock_wait(>pi_lock); + raw_spin_lock(>pi_lock); + raw_spin_unlock(>pi_lock); do { next = work->next; -- 2.5.2
[PATCH RFC 16/26] m32r: Remove spin_unlock_wait() arch-specific definitions
There is no agreed-upon definition of spin_unlock_wait()'s semantics, and it appears that all callers could do just as well with a lock/unlock pair. This commit therefore removes the underlying arch-specific arch_spin_unlock_wait(). Signed-off-by: Paul E. McKenneyCc: Will Deacon Cc: Peter Zijlstra Cc: Alan Stern Cc: Andrea Parri Cc: Linus Torvalds --- arch/m32r/include/asm/spinlock.h | 5 - 1 file changed, 5 deletions(-) diff --git a/arch/m32r/include/asm/spinlock.h b/arch/m32r/include/asm/spinlock.h index 323c7fc953cd..a56825592b90 100644 --- a/arch/m32r/include/asm/spinlock.h +++ b/arch/m32r/include/asm/spinlock.h @@ -30,11 +30,6 @@ #define arch_spin_is_locked(x) (*(volatile int *)(&(x)->slock) <= 0) #define arch_spin_lock_flags(lock, flags) arch_spin_lock(lock) -static inline void arch_spin_unlock_wait(arch_spinlock_t *lock) -{ - smp_cond_load_acquire(>slock, VAL > 0); -} - /** * arch_spin_trylock - Try spin lock and return a result * @lock: Pointer to the lock variable -- 2.5.2
[PATCH RFC 25/26] tile: Remove spin_unlock_wait() arch-specific definitions
There is no agreed-upon definition of spin_unlock_wait()'s semantics, and it appears that all callers could do just as well with a lock/unlock pair. This commit therefore removes the underlying arch-specific arch_spin_unlock_wait(). Signed-off-by: Paul E. McKenneyCc: Chris Metcalf Cc: Will Deacon Cc: Peter Zijlstra Cc: Alan Stern Cc: Andrea Parri Cc: Linus Torvalds --- arch/tile/include/asm/spinlock_32.h | 2 -- arch/tile/include/asm/spinlock_64.h | 2 -- arch/tile/lib/spinlock_32.c | 23 --- arch/tile/lib/spinlock_64.c | 22 -- 4 files changed, 49 deletions(-) diff --git a/arch/tile/include/asm/spinlock_32.h b/arch/tile/include/asm/spinlock_32.h index b14b1ba5bf9c..cba8ba9b8da6 100644 --- a/arch/tile/include/asm/spinlock_32.h +++ b/arch/tile/include/asm/spinlock_32.h @@ -64,8 +64,6 @@ static inline void arch_spin_unlock(arch_spinlock_t *lock) lock->current_ticket = old_ticket + TICKET_QUANTUM; } -void arch_spin_unlock_wait(arch_spinlock_t *lock); - /* * Read-write spinlocks, allowing multiple readers * but only one writer. diff --git a/arch/tile/include/asm/spinlock_64.h b/arch/tile/include/asm/spinlock_64.h index b9718fb4e74a..9a2c2d605752 100644 --- a/arch/tile/include/asm/spinlock_64.h +++ b/arch/tile/include/asm/spinlock_64.h @@ -58,8 +58,6 @@ static inline void arch_spin_unlock(arch_spinlock_t *lock) __insn_fetchadd4(>lock, 1U << __ARCH_SPIN_CURRENT_SHIFT); } -void arch_spin_unlock_wait(arch_spinlock_t *lock); - void arch_spin_lock_slow(arch_spinlock_t *lock, u32 val); /* Grab the "next" ticket number and bump it atomically. diff --git a/arch/tile/lib/spinlock_32.c b/arch/tile/lib/spinlock_32.c index 076c6cc43113..db9333f2447c 100644 --- a/arch/tile/lib/spinlock_32.c +++ b/arch/tile/lib/spinlock_32.c @@ -62,29 +62,6 @@ int arch_spin_trylock(arch_spinlock_t *lock) } EXPORT_SYMBOL(arch_spin_trylock); -void arch_spin_unlock_wait(arch_spinlock_t *lock) -{ - u32 iterations = 0; - int curr = READ_ONCE(lock->current_ticket); - int next = READ_ONCE(lock->next_ticket); - - /* Return immediately if unlocked. */ - if (next == curr) - return; - - /* Wait until the current locker has released the lock. */ - do { - delay_backoff(iterations++); - } while (READ_ONCE(lock->current_ticket) == curr); - - /* -* The TILE architecture doesn't do read speculation; therefore -* a control dependency guarantees a LOAD->{LOAD,STORE} order. -*/ - barrier(); -} -EXPORT_SYMBOL(arch_spin_unlock_wait); - /* * The low byte is always reserved to be the marker for a "tns" operation * since the low bit is set to "1" by a tns. The next seven bits are diff --git a/arch/tile/lib/spinlock_64.c b/arch/tile/lib/spinlock_64.c index a4b5b2cbce93..de414c22892f 100644 --- a/arch/tile/lib/spinlock_64.c +++ b/arch/tile/lib/spinlock_64.c @@ -62,28 +62,6 @@ int arch_spin_trylock(arch_spinlock_t *lock) } EXPORT_SYMBOL(arch_spin_trylock); -void arch_spin_unlock_wait(arch_spinlock_t *lock) -{ - u32 iterations = 0; - u32 val = READ_ONCE(lock->lock); - u32 curr = arch_spin_current(val); - - /* Return immediately if unlocked. */ - if (arch_spin_next(val) == curr) - return; - - /* Wait until the current locker has released the lock. */ - do { - delay_backoff(iterations++); - } while (arch_spin_current(READ_ONCE(lock->lock)) == curr); - - /* -* The TILE architecture doesn't do read speculation; therefore -* a control dependency guarantees a LOAD->{LOAD,STORE} order. -*/ - barrier(); -} -EXPORT_SYMBOL(arch_spin_unlock_wait); /* * If the read lock fails due to a writer, we retry periodically -- 2.5.2
[PATCH RFC 03/26] sched: Replace spin_unlock_wait() with lock/unlock pair
There is no agreed-upon definition of spin_unlock_wait()'s semantics, and it appears that all callers could do just as well with a lock/unlock pair. This commit therefore replaces the spin_unlock_wait() call in do_task_dead() with spin_lock() followed immediately by spin_unlock(). This should be safe from a performance perspective because the lock is this tasks ->pi_lock, and this is called only after the task exits. Signed-off-by: Paul E. McKenneyCc: Ingo Molnar Cc: Peter Zijlstra Cc: Will Deacon Cc: Peter Zijlstra Cc: Alan Stern Cc: Andrea Parri Cc: Linus Torvalds --- kernel/sched/core.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index e91138fcde86..6dea3d9728c8 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3461,7 +3461,8 @@ void __noreturn do_task_dead(void) * is held by try_to_wake_up() */ smp_mb(); - raw_spin_unlock_wait(>pi_lock); + raw_spin_lock(>pi_lock); + raw_spin_unlock(>pi_lock); /* Causes final put_task_struct in finish_task_switch(): */ __set_current_state(TASK_DEAD); -- 2.5.2
Re: [PATCH RFC 25/26] tile: Remove spin_unlock_wait() arch-specific definitions
On Thu, Jun 29, 2017 at 5:06 PM, Linus Torvaldswrote: > > Please don't make this one commit fopr every architecture. > > Once something gets removed, it gets removed. There's no point in > "remove it from architecture X". If there are no more users, we're > done with it, and making it be 25 patches with the same commit message > instead of just one doesn't help anybody. Just to clarify: I think the actual *users* are worth doing one by one, particularly if there are user-specific explanations of what that particular code wanted, and why spin_unlock_wait() doesn't really help. And I think that you actually have those per-user insights by now, after looking at the long thread. So I'm not saying "do one patch for the whole series". One patch per removal of use is fine - in fact preferred. But once there are no actual more users, just remove all the architecture definitions in one go, because explaining the same thing several times doesn't actually help anything. In fact, *if* we end up ever resurrecting that thing, it's good if we can resurrect it in one go. Then we can resurrect the one or two users that turned out to matter after all and could come up with why some particular ordering was ok too. Linus
Re: [PATCH RFC 25/26] tile: Remove spin_unlock_wait() arch-specific definitions
On Thu, Jun 29, 2017 at 05:06:16PM -0700, Linus Torvalds wrote: > On Thu, Jun 29, 2017 at 5:01 PM, Paul E. McKenney >wrote: > > There is no agreed-upon definition of spin_unlock_wait()'s semantics, > > and it appears that all callers could do just as well with a lock/unlock > > pair. This commit therefore removes the underlying arch-specific > > arch_spin_unlock_wait(). > > Please don't make this one commit fopr every architecture. > > Once something gets removed, it gets removed. There's no point in > "remove it from architecture X". If there are no more users, we're > done with it, and making it be 25 patches with the same commit message > instead of just one doesn't help anybody. Apologies, I will merge them. Thanx, Paul
[PATCH RFC 04/26] completion: Replace spin_unlock_wait() with lock/unlock pair
There is no agreed-upon definition of spin_unlock_wait()'s semantics, and it appears that all callers could do just as well with a lock/unlock pair. This commit therefore replaces the spin_unlock_wait() call in completion_done() with spin_lock() followed immediately by spin_unlock(). This should be safe from a performance perspective because the lock will be held only the wakeup happens really quickly. Signed-off-by: Paul E. McKenneyCc: Ingo Molnar Cc: Peter Zijlstra Cc: Will Deacon Cc: Alan Stern Cc: Andrea Parri Cc: Linus Torvalds --- kernel/sched/completion.c | 9 ++--- 1 file changed, 2 insertions(+), 7 deletions(-) diff --git a/kernel/sched/completion.c b/kernel/sched/completion.c index 53f9558fa925..d4b89a59629f 100644 --- a/kernel/sched/completion.c +++ b/kernel/sched/completion.c @@ -307,14 +307,9 @@ bool completion_done(struct completion *x) * If ->done, we need to wait for complete() to release ->wait.lock * otherwise we can end up freeing the completion before complete() * is done referencing it. -* -* The RMB pairs with complete()'s RELEASE of ->wait.lock and orders -* the loads of ->done and ->wait.lock such that we cannot observe -* the lock before complete() acquires it while observing the ->done -* after it's acquired the lock. */ - smp_rmb(); - spin_unlock_wait(>wait.lock); + spin_lock(>wait.lock); + spin_unlock(>wait.lock); return true; } EXPORT_SYMBOL(completion_done); -- 2.5.2
[PATCH RFC 18/26] mips: Remove spin_unlock_wait() arch-specific definitions
There is no agreed-upon definition of spin_unlock_wait()'s semantics, and it appears that all callers could do just as well with a lock/unlock pair. This commit therefore removes the underlying arch-specific arch_spin_unlock_wait(). Signed-off-by: Paul E. McKenneyCc: Ralf Baechle Cc: Cc: Will Deacon Cc: Peter Zijlstra Cc: Alan Stern Cc: Andrea Parri Cc: Linus Torvalds --- arch/mips/include/asm/spinlock.h | 16 1 file changed, 16 deletions(-) diff --git a/arch/mips/include/asm/spinlock.h b/arch/mips/include/asm/spinlock.h index a8df44d60607..81b4945031ee 100644 --- a/arch/mips/include/asm/spinlock.h +++ b/arch/mips/include/asm/spinlock.h @@ -50,22 +50,6 @@ static inline int arch_spin_value_unlocked(arch_spinlock_t lock) #define arch_spin_lock_flags(lock, flags) arch_spin_lock(lock) -static inline void arch_spin_unlock_wait(arch_spinlock_t *lock) -{ - u16 owner = READ_ONCE(lock->h.serving_now); - smp_rmb(); - for (;;) { - arch_spinlock_t tmp = READ_ONCE(*lock); - - if (tmp.h.serving_now == tmp.h.ticket || - tmp.h.serving_now != owner) - break; - - cpu_relax(); - } - smp_acquire__after_ctrl_dep(); -} - static inline int arch_spin_is_contended(arch_spinlock_t *lock) { u32 counters = ACCESS_ONCE(lock->lock); -- 2.5.2
[PATCH RFC 13/26] blackfin: Remove spin_unlock_wait() arch-specific definitions
There is no agreed-upon definition of spin_unlock_wait()'s semantics, and it appears that all callers could do just as well with a lock/unlock pair. This commit therefore removes the underlying arch-specific arch_spin_unlock_wait(). Signed-off-by: Paul E. McKenneyCc: Steven Miao Cc: Cc: Will Deacon Cc: Peter Zijlstra Cc: Alan Stern Cc: Andrea Parri Cc: Linus Torvalds --- arch/blackfin/include/asm/spinlock.h | 5 - 1 file changed, 5 deletions(-) diff --git a/arch/blackfin/include/asm/spinlock.h b/arch/blackfin/include/asm/spinlock.h index c58f4a83ed6f..f6431439d15d 100644 --- a/arch/blackfin/include/asm/spinlock.h +++ b/arch/blackfin/include/asm/spinlock.h @@ -48,11 +48,6 @@ static inline void arch_spin_unlock(arch_spinlock_t *lock) __raw_spin_unlock_asm(>lock); } -static inline void arch_spin_unlock_wait(arch_spinlock_t *lock) -{ - smp_cond_load_acquire(>lock, !VAL); -} - static inline int arch_read_can_lock(arch_rwlock_t *rw) { return __raw_uncached_fetch_asm(>lock) > 0; -- 2.5.2
[PATCH RFC 07/26] drivers/ata: Replace spin_unlock_wait() with lock/unlock pair
There is no agreed-upon definition of spin_unlock_wait()'s semantics, and it appears that all callers could do just as well with a lock/unlock pair. This commit therefore eliminates the spin_unlock_wait() call and associated else-clause and hoists the then-clause's lock and unlock out of the "if" statement. This should be safe from a performance perspective because according to Tejun there should be few if any drivers that don't set their own error handler. Signed-off-by: Paul E. McKenneyAcked-by: Tejun Heo Cc: Cc: Will Deacon Cc: Peter Zijlstra Cc: Alan Stern Cc: Andrea Parri Cc: Linus Torvalds --- drivers/ata/libata-eh.c | 8 +++- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c index ef68232b5222..779f6f18c1f4 100644 --- a/drivers/ata/libata-eh.c +++ b/drivers/ata/libata-eh.c @@ -645,12 +645,11 @@ void ata_scsi_cmd_error_handler(struct Scsi_Host *host, struct ata_port *ap, * completions are honored. A scmd is determined to have * timed out iff its associated qc is active and not failed. */ + spin_lock_irqsave(ap->lock, flags); if (ap->ops->error_handler) { struct scsi_cmnd *scmd, *tmp; int nr_timedout = 0; - spin_lock_irqsave(ap->lock, flags); - /* This must occur under the ap->lock as we don't want a polled recovery to race the real interrupt handler @@ -700,12 +699,11 @@ void ata_scsi_cmd_error_handler(struct Scsi_Host *host, struct ata_port *ap, if (nr_timedout) __ata_port_freeze(ap); - spin_unlock_irqrestore(ap->lock, flags); /* initialize eh_tries */ ap->eh_tries = ATA_EH_MAX_TRIES; - } else - spin_unlock_wait(ap->lock); + } + spin_unlock_irqrestore(ap->lock, flags); } EXPORT_SYMBOL(ata_scsi_cmd_error_handler); -- 2.5.2
Re: [PATCH RFC 25/26] tile: Remove spin_unlock_wait() arch-specific definitions
On Thu, Jun 29, 2017 at 5:01 PM, Paul E. McKenneywrote: > There is no agreed-upon definition of spin_unlock_wait()'s semantics, > and it appears that all callers could do just as well with a lock/unlock > pair. This commit therefore removes the underlying arch-specific > arch_spin_unlock_wait(). Please don't make this one commit fopr every architecture. Once something gets removed, it gets removed. There's no point in "remove it from architecture X". If there are no more users, we're done with it, and making it be 25 patches with the same commit message instead of just one doesn't help anybody. Linus
[PATCH RFC 15/26] ia64: Remove spin_unlock_wait() arch-specific definitions
There is no agreed-upon definition of spin_unlock_wait()'s semantics, and it appears that all callers could do just as well with a lock/unlock pair. This commit therefore removes the underlying arch-specific arch_spin_unlock_wait(). Signed-off-by: Paul E. McKenneyCc: Tony Luck Cc: Fenghua Yu Cc: Cc: Will Deacon Cc: Peter Zijlstra Cc: Alan Stern Cc: Andrea Parri Cc: Linus Torvalds --- arch/ia64/include/asm/spinlock.h | 21 - 1 file changed, 21 deletions(-) diff --git a/arch/ia64/include/asm/spinlock.h b/arch/ia64/include/asm/spinlock.h index ca9e76149a4a..df2c121164b8 100644 --- a/arch/ia64/include/asm/spinlock.h +++ b/arch/ia64/include/asm/spinlock.h @@ -76,22 +76,6 @@ static __always_inline void __ticket_spin_unlock(arch_spinlock_t *lock) ACCESS_ONCE(*p) = (tmp + 2) & ~1; } -static __always_inline void __ticket_spin_unlock_wait(arch_spinlock_t *lock) -{ - int *p = (int *)>lock, ticket; - - ia64_invala(); - - for (;;) { - asm volatile ("ld4.c.nc %0=[%1]" : "=r"(ticket) : "r"(p) : "memory"); - if (!(((ticket >> TICKET_SHIFT) ^ ticket) & TICKET_MASK)) - return; - cpu_relax(); - } - - smp_acquire__after_ctrl_dep(); -} - static inline int __ticket_spin_is_locked(arch_spinlock_t *lock) { long tmp = ACCESS_ONCE(lock->lock); @@ -143,11 +127,6 @@ static __always_inline void arch_spin_lock_flags(arch_spinlock_t *lock, arch_spin_lock(lock); } -static inline void arch_spin_unlock_wait(arch_spinlock_t *lock) -{ - __ticket_spin_unlock_wait(lock); -} - #define arch_read_can_lock(rw) (*(volatile int *)(rw) >= 0) #define arch_write_can_lock(rw)(*(volatile int *)(rw) == 0) -- 2.5.2
[PATCH RFC 14/26] hexagon: Remove spin_unlock_wait() arch-specific definitions
There is no agreed-upon definition of spin_unlock_wait()'s semantics, and it appears that all callers could do just as well with a lock/unlock pair. This commit therefore removes the underlying arch-specific arch_spin_unlock_wait(). Signed-off-by: Paul E. McKenneyCc: Richard Kuo Cc: Cc: Will Deacon Cc: Peter Zijlstra Cc: Alan Stern Cc: Andrea Parri Cc: Linus Torvalds --- arch/hexagon/include/asm/spinlock.h | 5 - 1 file changed, 5 deletions(-) diff --git a/arch/hexagon/include/asm/spinlock.h b/arch/hexagon/include/asm/spinlock.h index a1c55788c5d6..53a8d5885887 100644 --- a/arch/hexagon/include/asm/spinlock.h +++ b/arch/hexagon/include/asm/spinlock.h @@ -179,11 +179,6 @@ static inline unsigned int arch_spin_trylock(arch_spinlock_t *lock) */ #define arch_spin_lock_flags(lock, flags) arch_spin_lock(lock) -static inline void arch_spin_unlock_wait(arch_spinlock_t *lock) -{ - smp_cond_load_acquire(>lock, !VAL); -} - #define arch_spin_is_locked(x) ((x)->lock != 0) #define arch_read_lock_flags(lock, flags) arch_read_lock(lock) -- 2.5.2
[PATCH RFC 12/26] arm64: Remove spin_unlock_wait() arch-specific definitions
There is no agreed-upon definition of spin_unlock_wait()'s semantics, and it appears that all callers could do just as well with a lock/unlock pair. This commit therefore removes the underlying arch-specific arch_spin_unlock_wait(). Signed-off-by: Paul E. McKenneyCc: Catalin Marinas Cc: Will Deacon Cc: Cc: Peter Zijlstra Cc: Alan Stern Cc: Andrea Parri Cc: Linus Torvalds --- arch/arm64/include/asm/spinlock.h | 58 --- 1 file changed, 5 insertions(+), 53 deletions(-) diff --git a/arch/arm64/include/asm/spinlock.h b/arch/arm64/include/asm/spinlock.h index cae331d553f8..f445bd7f2b9f 100644 --- a/arch/arm64/include/asm/spinlock.h +++ b/arch/arm64/include/asm/spinlock.h @@ -26,58 +26,6 @@ * The memory barriers are implicit with the load-acquire and store-release * instructions. */ -static inline void arch_spin_unlock_wait(arch_spinlock_t *lock) -{ - unsigned int tmp; - arch_spinlock_t lockval; - u32 owner; - - /* -* Ensure prior spin_lock operations to other locks have completed -* on this CPU before we test whether "lock" is locked. -*/ - smp_mb(); - owner = READ_ONCE(lock->owner) << 16; - - asm volatile( -" sevl\n" -"1:wfe\n" -"2:ldaxr %w0, %2\n" - /* Is the lock free? */ -" eor %w1, %w0, %w0, ror #16\n" -" cbz %w1, 3f\n" - /* Lock taken -- has there been a subsequent unlock->lock transition? */ -" eor %w1, %w3, %w0, lsl #16\n" -" cbz %w1, 1b\n" - /* -* The owner has been updated, so there was an unlock->lock -* transition that we missed. That means we can rely on the -* store-release of the unlock operation paired with the -* load-acquire of the lock operation to publish any of our -* previous stores to the new lock owner and therefore don't -* need to bother with the writeback below. -*/ -" b 4f\n" -"3:\n" - /* -* Serialise against any concurrent lockers by writing back the -* unlocked lock value -*/ - ARM64_LSE_ATOMIC_INSN( - /* LL/SC */ -" stxr%w1, %w0, %2\n" - __nops(2), - /* LSE atomics */ -" mov %w1, %w0\n" -" cas %w0, %w0, %2\n" -" eor %w1, %w1, %w0\n") - /* Somebody else wrote to the lock, GOTO 10 and reload the value */ -" cbnz%w1, 2b\n" -"4:" - : "=" (lockval), "=" (tmp), "+Q" (*lock) - : "r" (owner) - : "memory"); -} #define arch_spin_lock_flags(lock, flags) arch_spin_lock(lock) @@ -176,7 +124,11 @@ static inline int arch_spin_value_unlocked(arch_spinlock_t lock) static inline int arch_spin_is_locked(arch_spinlock_t *lock) { - smp_mb(); /* See arch_spin_unlock_wait */ + /* +* Ensure prior spin_lock operations to other locks have completed +* on this CPU before we test whether "lock" is locked. +*/ + smp_mb(); /* ^^^ */ return !arch_spin_value_unlocked(READ_ONCE(*lock)); } -- 2.5.2
[PATCH RFC 19/26] mn10300: Remove spin_unlock_wait() arch-specific definitions
There is no agreed-upon definition of spin_unlock_wait()'s semantics, and it appears that all callers could do just as well with a lock/unlock pair. This commit therefore removes the underlying arch-specific arch_spin_unlock_wait(). Signed-off-by: Paul E. McKenneyCc: David Howells Cc: Cc: Will Deacon Cc: Peter Zijlstra Cc: Alan Stern Cc: Andrea Parri Cc: Linus Torvalds --- arch/mn10300/include/asm/spinlock.h | 5 - 1 file changed, 5 deletions(-) diff --git a/arch/mn10300/include/asm/spinlock.h b/arch/mn10300/include/asm/spinlock.h index 9c7b8f7942d8..fe413b41df6c 100644 --- a/arch/mn10300/include/asm/spinlock.h +++ b/arch/mn10300/include/asm/spinlock.h @@ -26,11 +26,6 @@ #define arch_spin_is_locked(x) (*(volatile signed char *)(&(x)->slock) != 0) -static inline void arch_spin_unlock_wait(arch_spinlock_t *lock) -{ - smp_cond_load_acquire(>slock, !VAL); -} - static inline void arch_spin_unlock(arch_spinlock_t *lock) { asm volatile( -- 2.5.2
[PATCH RFC 22/26] s390: Remove spin_unlock_wait() arch-specific definitions
There is no agreed-upon definition of spin_unlock_wait()'s semantics, and it appears that all callers could do just as well with a lock/unlock pair. This commit therefore removes the underlying arch-specific arch_spin_unlock_wait(). Signed-off-by: Paul E. McKenneyCc: Martin Schwidefsky Cc: Heiko Carstens Cc: Cc: Will Deacon Cc: Peter Zijlstra Cc: Alan Stern Cc: Andrea Parri Cc: Linus Torvalds --- arch/s390/include/asm/spinlock.h | 7 --- 1 file changed, 7 deletions(-) diff --git a/arch/s390/include/asm/spinlock.h b/arch/s390/include/asm/spinlock.h index f7838ecd83c6..217ee5210c32 100644 --- a/arch/s390/include/asm/spinlock.h +++ b/arch/s390/include/asm/spinlock.h @@ -98,13 +98,6 @@ static inline void arch_spin_unlock(arch_spinlock_t *lp) : "cc", "memory"); } -static inline void arch_spin_unlock_wait(arch_spinlock_t *lock) -{ - while (arch_spin_is_locked(lock)) - arch_spin_relax(lock); - smp_acquire__after_ctrl_dep(); -} - /* * Read-write spinlocks, allowing multiple readers * but only one writer. -- 2.5.2
[PATCH RFC 21/26] powerpc: Remove spin_unlock_wait() arch-specific definitions
There is no agreed-upon definition of spin_unlock_wait()'s semantics, and it appears that all callers could do just as well with a lock/unlock pair. This commit therefore removes the underlying arch-specific arch_spin_unlock_wait(). Signed-off-by: Paul E. McKenneyCc: Benjamin Herrenschmidt Cc: Paul Mackerras Cc: Michael Ellerman Cc: Cc: Will Deacon Cc: Peter Zijlstra Cc: Alan Stern Cc: Andrea Parri Cc: Linus Torvalds --- arch/powerpc/include/asm/spinlock.h | 33 - 1 file changed, 33 deletions(-) diff --git a/arch/powerpc/include/asm/spinlock.h b/arch/powerpc/include/asm/spinlock.h index 8c1b913de6d7..d256e448ea49 100644 --- a/arch/powerpc/include/asm/spinlock.h +++ b/arch/powerpc/include/asm/spinlock.h @@ -170,39 +170,6 @@ static inline void arch_spin_unlock(arch_spinlock_t *lock) lock->slock = 0; } -static inline void arch_spin_unlock_wait(arch_spinlock_t *lock) -{ - arch_spinlock_t lock_val; - - smp_mb(); - - /* -* Atomically load and store back the lock value (unchanged). This -* ensures that our observation of the lock value is ordered with -* respect to other lock operations. -*/ - __asm__ __volatile__( -"1:" PPC_LWARX(%0, 0, %2, 0) "\n" -" stwcx. %0, 0, %2\n" -" bne- 1b\n" - : "=" (lock_val), "+m" (*lock) - : "r" (lock) - : "cr0", "xer"); - - if (arch_spin_value_unlocked(lock_val)) - goto out; - - while (lock->slock) { - HMT_low(); - if (SHARED_PROCESSOR) - __spin_yield(lock); - } - HMT_medium(); - -out: - smp_mb(); -} - /* * Read-write spinlocks, allowing multiple readers * but only one writer. -- 2.5.2
[PATCH RFC 01/26] netfilter: Replace spin_unlock_wait() with lock/unlock pair
There is no agreed-upon definition of spin_unlock_wait()'s semantics, and it appears that all callers could do just as well with a lock/unlock pair. This commit therefore replaces the spin_unlock_wait() calls in nf_conntrack_lock() and nf_conntrack_all_lock() with spin_lock() followed immediately by spin_unlock(). These functions do not appear to be invoked on any fastpaths. Signed-off-by: Paul E. McKenneyCc: Pablo Neira Ayuso Cc: Jozsef Kadlecsik Cc: Florian Westphal Cc: "David S. Miller" Cc: Cc: Cc: Cc: Will Deacon Cc: Peter Zijlstra Cc: Alan Stern Cc: Andrea Parri Cc: Linus Torvalds --- net/netfilter/nf_conntrack_core.c | 26 -- 1 file changed, 8 insertions(+), 18 deletions(-) diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c index e847dbaa0c6b..9f997859d160 100644 --- a/net/netfilter/nf_conntrack_core.c +++ b/net/netfilter/nf_conntrack_core.c @@ -99,15 +99,11 @@ void nf_conntrack_lock(spinlock_t *lock) __acquires(lock) spin_lock(lock); while (unlikely(nf_conntrack_locks_all)) { spin_unlock(lock); - - /* -* Order the 'nf_conntrack_locks_all' load vs. the -* spin_unlock_wait() loads below, to ensure -* that 'nf_conntrack_locks_all_lock' is indeed held: -*/ - smp_rmb(); /* spin_lock(_conntrack_locks_all_lock) */ - spin_unlock_wait(_conntrack_locks_all_lock); + /* Wait for nf_conntrack_locks_all_lock holder to release ... */ + spin_lock(_conntrack_locks_all_lock); + spin_unlock(_conntrack_locks_all_lock); spin_lock(lock); + /* ... and retry. */ } } EXPORT_SYMBOL_GPL(nf_conntrack_lock); @@ -150,17 +146,11 @@ static void nf_conntrack_all_lock(void) spin_lock(_conntrack_locks_all_lock); nf_conntrack_locks_all = true; - - /* -* Order the above store of 'nf_conntrack_locks_all' against -* the spin_unlock_wait() loads below, such that if -* nf_conntrack_lock() observes 'nf_conntrack_locks_all' -* we must observe nf_conntrack_locks[] held: -*/ - smp_mb(); /* spin_lock(_conntrack_locks_all_lock) */ - for (i = 0; i < CONNTRACK_LOCKS; i++) { - spin_unlock_wait(_conntrack_locks[i]); + /* Wait for any current holder to release lock. */ + spin_lock(_conntrack_locks[i]); + spin_unlock(_conntrack_locks[i]); + /* Next acquisition will see nf_conntrack_locks_all == true. */ } } -- 2.5.2
[PATCH RFC 24/26] sparc: Remove spin_unlock_wait() arch-specific definitions
There is no agreed-upon definition of spin_unlock_wait()'s semantics, and it appears that all callers could do just as well with a lock/unlock pair. This commit therefore removes the underlying arch-specific arch_spin_unlock_wait(). Signed-off-by: Paul E. McKenneyCc: "David S. Miller" Cc: Cc: Will Deacon Cc: Peter Zijlstra Cc: Alan Stern Cc: Andrea Parri Cc: Linus Torvalds --- arch/sparc/include/asm/spinlock_32.h | 5 - arch/sparc/include/asm/spinlock_64.h | 5 - 2 files changed, 10 deletions(-) diff --git a/arch/sparc/include/asm/spinlock_32.h b/arch/sparc/include/asm/spinlock_32.h index 8011e79f59c9..67345b2dc408 100644 --- a/arch/sparc/include/asm/spinlock_32.h +++ b/arch/sparc/include/asm/spinlock_32.h @@ -14,11 +14,6 @@ #define arch_spin_is_locked(lock) (*((volatile unsigned char *)(lock)) != 0) -static inline void arch_spin_unlock_wait(arch_spinlock_t *lock) -{ - smp_cond_load_acquire(>lock, !VAL); -} - static inline void arch_spin_lock(arch_spinlock_t *lock) { __asm__ __volatile__( diff --git a/arch/sparc/include/asm/spinlock_64.h b/arch/sparc/include/asm/spinlock_64.h index 07c9f2e9bf57..923d57f9b79d 100644 --- a/arch/sparc/include/asm/spinlock_64.h +++ b/arch/sparc/include/asm/spinlock_64.h @@ -26,11 +26,6 @@ #define arch_spin_is_locked(lp)((lp)->lock != 0) -static inline void arch_spin_unlock_wait(arch_spinlock_t *lock) -{ - smp_cond_load_acquire(>lock, !VAL); -} - static inline void arch_spin_lock(arch_spinlock_t *lock) { unsigned long tmp; -- 2.5.2
[PATCH RFC 23/26] sh: Remove spin_unlock_wait() arch-specific definitions
There is no agreed-upon definition of spin_unlock_wait()'s semantics, and it appears that all callers could do just as well with a lock/unlock pair. This commit therefore removes the underlying arch-specific arch_spin_unlock_wait(). Signed-off-by: Paul E. McKenneyCc: Yoshinori Sato Cc: Rich Felker Cc: Cc: Will Deacon Cc: Peter Zijlstra Cc: Alan Stern Cc: Andrea Parri Cc: Linus Torvalds --- arch/sh/include/asm/spinlock-cas.h | 5 - arch/sh/include/asm/spinlock-llsc.h | 5 - 2 files changed, 10 deletions(-) diff --git a/arch/sh/include/asm/spinlock-cas.h b/arch/sh/include/asm/spinlock-cas.h index c46e8cc7b515..5ed7dbbd94ff 100644 --- a/arch/sh/include/asm/spinlock-cas.h +++ b/arch/sh/include/asm/spinlock-cas.h @@ -29,11 +29,6 @@ static inline unsigned __sl_cas(volatile unsigned *p, unsigned old, unsigned new #define arch_spin_is_locked(x) ((x)->lock <= 0) #define arch_spin_lock_flags(lock, flags) arch_spin_lock(lock) -static inline void arch_spin_unlock_wait(arch_spinlock_t *lock) -{ - smp_cond_load_acquire(>lock, VAL > 0); -} - static inline void arch_spin_lock(arch_spinlock_t *lock) { while (!__sl_cas(>lock, 1, 0)); diff --git a/arch/sh/include/asm/spinlock-llsc.h b/arch/sh/include/asm/spinlock-llsc.h index cec78143fa83..f77263aae760 100644 --- a/arch/sh/include/asm/spinlock-llsc.h +++ b/arch/sh/include/asm/spinlock-llsc.h @@ -21,11 +21,6 @@ #define arch_spin_is_locked(x) ((x)->lock <= 0) #define arch_spin_lock_flags(lock, flags) arch_spin_lock(lock) -static inline void arch_spin_unlock_wait(arch_spinlock_t *lock) -{ - smp_cond_load_acquire(>lock, VAL > 0); -} - /* * Simple spin lock operations. There are two variants, one clears IRQ's * on the local processor, one does not. -- 2.5.2
[PATCH RFC 26/26] xtensa: Remove spin_unlock_wait() arch-specific definitions
There is no agreed-upon definition of spin_unlock_wait()'s semantics, and it appears that all callers could do just as well with a lock/unlock pair. This commit therefore removes the underlying arch-specific arch_spin_unlock_wait(). Signed-off-by: Paul E. McKenneyCc: Chris Zankel Cc: Max Filippov Cc: --- arch/xtensa/include/asm/spinlock.h | 5 - 1 file changed, 5 deletions(-) diff --git a/arch/xtensa/include/asm/spinlock.h b/arch/xtensa/include/asm/spinlock.h index a36221cf6363..3bb49681ee24 100644 --- a/arch/xtensa/include/asm/spinlock.h +++ b/arch/xtensa/include/asm/spinlock.h @@ -33,11 +33,6 @@ #define arch_spin_is_locked(x) ((x)->slock != 0) -static inline void arch_spin_unlock_wait(arch_spinlock_t *lock) -{ - smp_cond_load_acquire(>slock, !VAL); -} - #define arch_spin_lock_flags(lock, flags) arch_spin_lock(lock) static inline void arch_spin_lock(arch_spinlock_t *lock) -- 2.5.2
[PATCH RFC 20/26] parisc: Remove spin_unlock_wait() arch-specific definitions
There is no agreed-upon definition of spin_unlock_wait()'s semantics, and it appears that all callers could do just as well with a lock/unlock pair. This commit therefore removes the underlying arch-specific arch_spin_unlock_wait(). Signed-off-by: Paul E. McKenneyCc: "James E.J. Bottomley" Cc: Helge Deller Cc: Cc: Will Deacon Cc: Peter Zijlstra Cc: Alan Stern Cc: Andrea Parri Cc: Linus Torvalds --- arch/parisc/include/asm/spinlock.h | 7 --- 1 file changed, 7 deletions(-) diff --git a/arch/parisc/include/asm/spinlock.h b/arch/parisc/include/asm/spinlock.h index e32936cd7f10..55bfe4affca3 100644 --- a/arch/parisc/include/asm/spinlock.h +++ b/arch/parisc/include/asm/spinlock.h @@ -14,13 +14,6 @@ static inline int arch_spin_is_locked(arch_spinlock_t *x) #define arch_spin_lock(lock) arch_spin_lock_flags(lock, 0) -static inline void arch_spin_unlock_wait(arch_spinlock_t *x) -{ - volatile unsigned int *a = __ldcw_align(x); - - smp_cond_load_acquire(a, VAL); -} - static inline void arch_spin_lock_flags(arch_spinlock_t *x, unsigned long flags) { -- 2.5.2
[PATCH RFC 09/26] alpha: Remove spin_unlock_wait() arch-specific definitions
There is no agreed-upon definition of spin_unlock_wait()'s semantics, and it appears that all callers could do just as well with a lock/unlock pair. This commit therefore removes the underlying arch-specific arch_spin_unlock_wait(). Signed-off-by: Paul E. McKenneyCc: Richard Henderson Cc: Ivan Kokshaysky Cc: Matt Turner Cc: Cc: Will Deacon Cc: Peter Zijlstra Cc: Alan Stern Cc: Andrea Parri Cc: Linus Torvalds --- arch/alpha/include/asm/spinlock.h | 5 - 1 file changed, 5 deletions(-) diff --git a/arch/alpha/include/asm/spinlock.h b/arch/alpha/include/asm/spinlock.h index a40b9fc0c6c3..718ac0b64adf 100644 --- a/arch/alpha/include/asm/spinlock.h +++ b/arch/alpha/include/asm/spinlock.h @@ -16,11 +16,6 @@ #define arch_spin_lock_flags(lock, flags) arch_spin_lock(lock) #define arch_spin_is_locked(x) ((x)->lock != 0) -static inline void arch_spin_unlock_wait(arch_spinlock_t *lock) -{ - smp_cond_load_acquire(>lock, !VAL); -} - static inline int arch_spin_value_unlocked(arch_spinlock_t lock) { return lock.lock == 0; -- 2.5.2
[PATCH RFC 06/26] ipc: Replace spin_unlock_wait() with lock/unlock pair
There is no agreed-upon definition of spin_unlock_wait()'s semantics, and it appears that all callers could do just as well with a lock/unlock pair. This commit therefore replaces the spin_unlock_wait() call in exit_sem() with spin_lock() followed immediately by spin_unlock(). This should be safe from a performance perspective because exit_sem() is rarely invoked in production. Signed-off-by: Paul E. McKenneyCc: Andrew Morton Cc: Davidlohr Bueso Cc: Manfred Spraul Cc: Will Deacon Cc: Peter Zijlstra Cc: Alan Stern Cc: Andrea Parri Cc: Linus Torvalds --- ipc/sem.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/ipc/sem.c b/ipc/sem.c index 947dc2348271..e88d0749a929 100644 --- a/ipc/sem.c +++ b/ipc/sem.c @@ -2096,7 +2096,8 @@ void exit_sem(struct task_struct *tsk) * possibility where we exit while freeary() didn't * finish unlocking sem_undo_list. */ - spin_unlock_wait(>lock); + spin_lock(>lock); + spin_unlock(>lock); rcu_read_unlock(); break; } -- 2.5.2
[PATCH RFC 11/26] arm: Remove spin_unlock_wait() arch-specific definitions
There is no agreed-upon definition of spin_unlock_wait()'s semantics, and it appears that all callers could do just as well with a lock/unlock pair. This commit therefore removes the underlying arch-specific arch_spin_unlock_wait(). Signed-off-by: Paul E. McKenneyCc: Russell King Cc: Cc: Will Deacon Cc: Peter Zijlstra Cc: Alan Stern Cc: Andrea Parri Cc: Linus Torvalds --- arch/arm/include/asm/spinlock.h | 16 1 file changed, 16 deletions(-) diff --git a/arch/arm/include/asm/spinlock.h b/arch/arm/include/asm/spinlock.h index 4bec45442072..c030143c18c6 100644 --- a/arch/arm/include/asm/spinlock.h +++ b/arch/arm/include/asm/spinlock.h @@ -52,22 +52,6 @@ static inline void dsb_sev(void) * memory. */ -static inline void arch_spin_unlock_wait(arch_spinlock_t *lock) -{ - u16 owner = READ_ONCE(lock->tickets.owner); - - for (;;) { - arch_spinlock_t tmp = READ_ONCE(*lock); - - if (tmp.tickets.owner == tmp.tickets.next || - tmp.tickets.owner != owner) - break; - - wfe(); - } - smp_acquire__after_ctrl_dep(); -} - #define arch_spin_lock_flags(lock, flags) arch_spin_lock(lock) static inline void arch_spin_lock(arch_spinlock_t *lock) -- 2.5.2
[PATCH RFC 0/26] Remove spin_unlock_wait()
Hello! There is no agreed-upon definition of spin_unlock_wait()'s semantics, and it appears that all callers could do just as well with a lock/unlock pair. This series therefore removes spin_unlock_wait() and changes its users to instead use a lock/unlock pair. The commits are as follows, in three groups: 1-7.Change uses of spin_unlock_wait() and raw_spin_unlock_wait() to instead use a spin_lock/spin_unlock pair. These may be applied in any order, but must be applied before any later commits in this series. The commit logs state why I believe that these commits won't noticeably degrade performance. 8. Remove core-kernel definitions for spin_unlock_wait() and raw_spin_unlock_wait(). 9-26. Remove arch-specific definitions of arch_spin_unlock_wait(). Thoughts? Thanx, Paul arch/alpha/include/asm/spinlock.h|5 - arch/arc/include/asm/spinlock.h |5 - arch/arm/include/asm/spinlock.h | 16 arch/arm64/include/asm/spinlock.h| 58 + arch/blackfin/include/asm/spinlock.h |5 - arch/hexagon/include/asm/spinlock.h |5 - arch/ia64/include/asm/spinlock.h | 21 -- arch/m32r/include/asm/spinlock.h |5 - arch/metag/include/asm/spinlock.h|5 - arch/mips/include/asm/spinlock.h | 16 arch/mn10300/include/asm/spinlock.h |5 - arch/parisc/include/asm/spinlock.h |7 -- arch/powerpc/include/asm/spinlock.h | 33 - arch/s390/include/asm/spinlock.h |7 -- arch/sh/include/asm/spinlock-cas.h |5 - arch/sh/include/asm/spinlock-llsc.h |5 - arch/sparc/include/asm/spinlock_32.h |5 - arch/sparc/include/asm/spinlock_64.h |5 - arch/tile/include/asm/spinlock_32.h |2 arch/tile/include/asm/spinlock_64.h |2 arch/tile/lib/spinlock_32.c | 23 -- arch/tile/lib/spinlock_64.c | 22 -- arch/xtensa/include/asm/spinlock.h |5 - drivers/ata/libata-eh.c |8 -- include/asm-generic/qspinlock.h | 14 include/linux/spinlock.h | 31 - include/linux/spinlock_up.h |6 - ipc/sem.c|3 kernel/exit.c|3 kernel/locking/qspinlock.c | 117 --- kernel/sched/completion.c|9 -- kernel/sched/core.c |3 kernel/task_work.c |3 net/netfilter/nf_conntrack_core.c| 26 ++- 34 files changed, 26 insertions(+), 464 deletions(-)
RFC: bring UP 'lo' by default after creating new netns
Creation of new network namespace is almost always followed up by bringing up the loopback device. ip netns add foo ip -netns foo link set lo up I'm not sure if there are any consequences of bringing the device UP at the creation of network-namespace. thanks, --mahesh.. diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c index 2178db8e47cd..ac0e86c9a17f 100644 --- a/net/core/net_namespace.c +++ b/net/core/net_namespace.c @@ -428,6 +428,11 @@ struct net *copy_net_ns(unsigned long flags, net_drop_ns(net); return ERR_PTR(rv); } + /* Set the loopback device UP */ + rtnl_lock(); + dev_open(net->loopback_dev); + rtnl_unlock(); + return net; }
Re: [PATCH RFC 0/2] kproxy: Kernel Proxy
On Thu, Jun 29, 2017 at 1:58 PM, Willy Tarreauwrote: > On Thu, Jun 29, 2017 at 01:40:26PM -0700, Tom Herbert wrote: >> > In fact that's not much what I observe in field. In practice, large >> > data streams are cheaply relayed using splice(), I could achieve >> > 60 Gbps of HTTP forwarding via HAProxy on a 4-core xeon 2 years ago. >> > And when you use SSL, the cost of the copy to/from kernel is small >> > compared to all the crypto operations surrounding this. >> > >> Right, getting rid of the extra crypto operations and so called "SSL >> inspection" is the ultimate goal this is going towards. > > Yep but in order to take decisions at L7 you need to decapsulate SSL. > Decapsulate or decrypt? There's a big difference... :-) I'm am aiming to just have to decapsulate. >> HTTP is only one use case. The are other interesting use cases such as >> those in container security where the application protocol might be >> something like simple RPC. > > OK that indeed makes sense in such environments. > >> Performance is relevant because we >> potentially want security applied to every message in every >> communication in a containerized data center. Putting the userspace >> hop in the datapath of every packet is know to be problematic, not >> just for the performance hit also because it increases the attack >> surface on users' privacy. > > While I totally agree on the performance hit when inspecting each packet, > I fail to see the relation with users' privacy. In fact under some > circumstances it can even be the opposite. For example, using something > like kTLS for a TCP/HTTP proxy can result in cleartext being observable > in strace while it's not visible when TLS is terminated in userland because > all you see are openssl's read()/write() operations. Maybe you have specific > attacks in mind ? > No, just the normal problem of making yet one more tool systematically have access to user data. >> > Regarding kernel-side protocol parsing, there's an unfortunate trend >> > at moving more and more protocols to userland due to these protocols >> > evolving very quickly. At least you'll want to find a way to provide >> > these parsers from userspace, which will inevitably come with its set >> > of problems or limitations :-/ >> > >> That's why everything is going BPF now ;-) > > Yes, I knew you were going to suggest this :-) I'm still prudent on it > to be honnest. I don't think it would be that easy to implement an HPACK > encoder/decoder using BPF. And even regarding just plain HTTP parsing, > certain very small operations in haproxy's parser can quickly result in > a 10% performance degradation when improperly optimized (ie: changing a > "likely", altering branch prediction, or cache walk patterns when using > arrays to evaluate character classes faster). But for general usage I > indeed think it should be OK. > HTTP might qualify as a special case, and I believe there's already been some work to put http in kernel by Alexander Krizhanovsky and others. In this case maybe http parse could be front end before BPF. Although, pretty clear we'll need regex in BPF if we want use it with http. Tom
Re: [PATCH RFC 0/2] kproxy: Kernel Proxy
On Thu, Jun 29, 2017 at 3:04 PM, Thomas Grafwrote: > Hi Tom > > On 29 June 2017 at 11:27, Tom Herbert wrote: >> This is raw, minimally tested, and error hanlding needs work. Posting >> as RFC to get feedback on the design... >> >> Sidecar proxies are becoming quite popular on server as a means to >> perform layer 7 processing on application data as it is sent. Such >> sidecars are used for SSL proxies, application firewalls, and L7 >> load balancers. While these proxies provide nice functionality, >> their performance is obviously terrible since all the data needs >> to take an extra hop though userspace. > Hi Thomas, > I really appreciate this work. It would have been nice to at least > attribute me in some way as this is exactly what I presented at > Netconf 2017 [0]. > Sure, will do that! > I'm also wondering why this is not built on top of KCM which you > suggested to use when we discussed this. > I think the main part of that discussion was around stream parser which is needed for message delineation. For a 1:1 proxy, KCM is probably overkill (the whole KCM data path and lock becomes superfluous). Also, there's no concept of creating a whole message before routing it, in the 1:1 case we should let the message pass through once it's cleared by the filter (this is the strparser change I referred to). As I mentioned, for L7 load balancing we would want a multiplexor probably also M:N, but the structure is different since there's still no user facing sockets, they're all TCP for instance. IMO, the 1:1 proxy case is compelling to solve in itself... Tom > [0] > https://docs.google.com/presentation/d/1dwSKSBGpUHD3WO5xxzZWj8awV_-xL-oYhvqQMOBhhtk/edit#slide=id.g203aae413f_0_0
Re: CAN-FD Transceiver Limitations
On 06/29/2017 05:36 PM, Kurt Van Dijck wrote: mcan@0 { ... fixed-transceiver { max-canfd-speed = <2000> }; ... }; > > Since when would a transceiver support different speeds for CAN & CANFD? When I say CAN I'm referring to CAN 2.0 specification which mentioned speeds upto 1 Mbit/s. While CAN FD supports higher bitrates. > No transceivers were available, but they are now. > I see no datalink problem applying 2MBit for regular CAN with apropriate > physical layer, and CAN does not predefine the physical layer > (advise != define). > > IMHO, > fixed-transceiver { > max-arbitration-speed = <200> > max-data-speed = <400> > }; > is way better to describe the hardware. > Regular CAN chips would not consider max-data-speed... What is arbitration speed? Also if I understand you correctly then I agree drivers for traditional CAN wouldn't care about this subnode. Although it may be helpful for max-data-speed to become max-canfd-speed or something along those lines. Just so the property's purpose is clear. > > Kind regards, > Kurt >
[PATCH 03/29] netfilter: dup: resolve warnings about missing prototypes
From: stephen hemmingerMissing include file causes: net/netfilter/nf_dup_netdev.c:26:6: warning: no previous prototype for ‘nf_fwd_netdev_egress’ [-Wmissing-prototypes] net/netfilter/nf_dup_netdev.c:40:6: warning: no previous prototype for ‘nf_dup_netdev_egress’ [-Wmissing-prototypes] Signed-off-by: Stephen Hemminger Signed-off-by: Pablo Neira Ayuso --- net/netfilter/nf_dup_netdev.c | 1 + 1 file changed, 1 insertion(+) diff --git a/net/netfilter/nf_dup_netdev.c b/net/netfilter/nf_dup_netdev.c index c9d7f95768ab..f4a566e67213 100644 --- a/net/netfilter/nf_dup_netdev.c +++ b/net/netfilter/nf_dup_netdev.c @@ -13,6 +13,7 @@ #include #include #include +#include static void nf_do_netdev_egress(struct sk_buff *skb, struct net_device *dev) { -- 2.1.4
Re: [PATCH] sctp: Add peeloff-flags socket option
On Thu, Jun 29, 2017 at 02:13:40PM -0400, Neil Horman wrote: > diff --git a/fs/file.c b/fs/file.c > index 1c2972e..a4521da 100644 > --- a/fs/file.c > +++ b/fs/file.c > @@ -807,6 +807,7 @@ void set_close_on_exec(unsigned int fd, int flag) > __clear_close_on_exec(fd, fdt); > spin_unlock(>file_lock); > } > +EXPORT_SYMBOL(set_close_on_exec); Huh? > + set_close_on_exec(peeloff->sd, !!(flags & SOCK_CLOEXEC)); Why hadn't you set it when that descriptor had been allocated? That retval = get_unused_fd_flags(0); could just pass O_CLOEXEC and you'd get it set from the very beginning...
[PATCH 01/29] netfilter: ctnetlink: delete extra spaces
From: linzhangThis patch cleans up extra spaces. Signed-off-by: linzhang Signed-off-by: Pablo Neira Ayuso --- net/netfilter/nf_conntrack_netlink.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c index 9799a50bc604..f08604dd1a59 100644 --- a/net/netfilter/nf_conntrack_netlink.c +++ b/net/netfilter/nf_conntrack_netlink.c @@ -636,11 +636,11 @@ ctnetlink_conntrack_event(unsigned int events, struct nf_ct_event *item) if (events & (1 << IPCT_DESTROY)) { type = IPCTNL_MSG_CT_DELETE; group = NFNLGRP_CONNTRACK_DESTROY; - } else if (events & ((1 << IPCT_NEW) | (1 << IPCT_RELATED))) { + } else if (events & ((1 << IPCT_NEW) | (1 << IPCT_RELATED))) { type = IPCTNL_MSG_CT_NEW; flags = NLM_F_CREATE|NLM_F_EXCL; group = NFNLGRP_CONNTRACK_NEW; - } else if (events) { + } else if (events) { type = IPCTNL_MSG_CT_NEW; group = NFNLGRP_CONNTRACK_UPDATE; } else -- 2.1.4
[PATCH 04/29] netfilter: nft_rt: make local functions static
From: stephen hemmingerResolves warnings: net/netfilter/nft_rt.c:26:6: warning: no previous prototype for ‘nft_rt_get_eval’ [-Wmissing-prototypes] net/netfilter/nft_rt.c:75:5: warning: no previous prototype for ‘nft_rt_get_init’ [-Wmissing-prototypes] net/netfilter/nft_rt.c:106:5: warning: no previous prototype for ‘nft_rt_get_dump’ [-Wmissing-prototypes] Signed-off-by: Stephen Hemminger Signed-off-by: Pablo Neira Ayuso --- net/netfilter/nft_rt.c | 16 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/net/netfilter/nft_rt.c b/net/netfilter/nft_rt.c index d3eb640bc784..c7383d8f88d0 100644 --- a/net/netfilter/nft_rt.c +++ b/net/netfilter/nft_rt.c @@ -23,9 +23,9 @@ struct nft_rt { enum nft_registers dreg:8; }; -void nft_rt_get_eval(const struct nft_expr *expr, -struct nft_regs *regs, -const struct nft_pktinfo *pkt) +static void nft_rt_get_eval(const struct nft_expr *expr, + struct nft_regs *regs, + const struct nft_pktinfo *pkt) { const struct nft_rt *priv = nft_expr_priv(expr); const struct sk_buff *skb = pkt->skb; @@ -72,9 +72,9 @@ const struct nla_policy nft_rt_policy[NFTA_RT_MAX + 1] = { [NFTA_RT_KEY] = { .type = NLA_U32 }, }; -int nft_rt_get_init(const struct nft_ctx *ctx, - const struct nft_expr *expr, - const struct nlattr * const tb[]) +static int nft_rt_get_init(const struct nft_ctx *ctx, + const struct nft_expr *expr, + const struct nlattr * const tb[]) { struct nft_rt *priv = nft_expr_priv(expr); unsigned int len; @@ -103,8 +103,8 @@ int nft_rt_get_init(const struct nft_ctx *ctx, NFT_DATA_VALUE, len); } -int nft_rt_get_dump(struct sk_buff *skb, - const struct nft_expr *expr) +static int nft_rt_get_dump(struct sk_buff *skb, + const struct nft_expr *expr) { const struct nft_rt *priv = nft_expr_priv(expr); -- 2.1.4
Re: [PATCH net-next v3 0/9] introduce flower offload capabilities
On Thu, 29 Jun 2017 22:08:10 +0200, Simon Horman wrote: > Hi, > > this series adds flower offload to the NFP driver. It builds on recent > work to add representor and a skeleton flower app - now the app does what > its name says. > > In general the approach taken is to allow some flows within > the universe of possible flower matches and tc actions to be offloaded. > It is planned that this support will grow over time but the support > offered by this patch-set seems to be a reasonable starting point. > > Key Changes since v2: > * Revised flow structures to simplify setup/teardown and locking of stats > * Addressed other code-change review of v2 > > Other review questions regarding v2 have been answered on netdev. Looks OK, thanks Simon and Pieter! Reviewed-by: Jakub Kicinski
[PATCH 18/29] netfilter: nft_set_hash: add lookup variant for fixed size hashtable
This patch provides a faster variant of the lookup function for 2 and 4 byte keys. Optimizing the one byte case is not worth, as the set backend selection will always select the bitmap set type for such case. Signed-off-by: Pablo Neira Ayuso--- net/netfilter/nft_set_hash.c | 60 ++-- 1 file changed, 58 insertions(+), 2 deletions(-) diff --git a/net/netfilter/nft_set_hash.c b/net/netfilter/nft_set_hash.c index b2eab94362d6..0fa01d772c5e 100644 --- a/net/netfilter/nft_set_hash.c +++ b/net/netfilter/nft_set_hash.c @@ -409,6 +409,38 @@ static bool nft_hash_lookup(const struct net *net, const struct nft_set *set, return false; } +/* nft_hash_select_ops() makes sure key size can be either 2 or 4 bytes . */ +static inline u32 nft_hash_key(const u32 *key, u32 klen) +{ + if (klen == 4) + return *key; + + return *(u16 *)key; +} + +static bool nft_hash_lookup_fast(const struct net *net, +const struct nft_set *set, +const u32 *key, const struct nft_set_ext **ext) +{ + struct nft_hash *priv = nft_set_priv(set); + u8 genmask = nft_genmask_cur(net); + const struct nft_hash_elem *he; + u32 hash, k1, k2; + + k1 = nft_hash_key(key, set->klen); + hash = jhash_1word(k1, priv->seed); + hash = reciprocal_scale(hash, priv->buckets); + hlist_for_each_entry_rcu(he, >table[hash], node) { + k2 = nft_hash_key(nft_set_ext_key(>ext)->data, set->klen); + if (k1 == k2 && + nft_set_elem_active(>ext, genmask)) { + *ext = >ext; + return true; + } + } + return false; +} + static int nft_hash_insert(const struct net *net, const struct nft_set *set, const struct nft_set_elem *elem, struct nft_set_ext **ext) @@ -588,12 +620,36 @@ static struct nft_set_ops nft_hash_ops __read_mostly = { .features = NFT_SET_MAP | NFT_SET_OBJECT, }; +static struct nft_set_ops nft_hash_fast_ops __read_mostly = { + .type = _hash_type, + .privsize = nft_hash_privsize, + .elemsize = offsetof(struct nft_hash_elem, ext), + .estimate = nft_hash_estimate, + .init = nft_hash_init, + .destroy= nft_hash_destroy, + .insert = nft_hash_insert, + .activate = nft_hash_activate, + .deactivate = nft_hash_deactivate, + .flush = nft_hash_flush, + .remove = nft_hash_remove, + .lookup = nft_hash_lookup_fast, + .walk = nft_hash_walk, + .features = NFT_SET_MAP | NFT_SET_OBJECT, +}; + static const struct nft_set_ops * nft_hash_select_ops(const struct nft_ctx *ctx, const struct nft_set_desc *desc, u32 flags) { - if (desc->size) - return _hash_ops; + if (desc->size) { + switch (desc->klen) { + case 2: + case 4: + return _hash_fast_ops; + default: + return _hash_ops; + } + } return _rhash_ops; } -- 2.1.4
[PATCH 07/29] netfilter: conntrack: add nf_ct_iterate_destroy
From: Florian Westphalsledgehammer to be used on module unload (to remove affected conntracks from all namespaces). It will also flag all unconfirmed conntracks as dying, i.e. they will not be committed to main table. Signed-off-by: Florian Westphal Signed-off-by: Pablo Neira Ayuso --- include/net/netfilter/nf_conntrack.h | 4 ++ net/netfilter/nf_conntrack_core.c| 87 ++-- 2 files changed, 78 insertions(+), 13 deletions(-) diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h index f21180ea4558..48407569585d 100644 --- a/include/net/netfilter/nf_conntrack.h +++ b/include/net/netfilter/nf_conntrack.h @@ -229,6 +229,10 @@ void nf_ct_iterate_cleanup_net(struct net *net, int (*iter)(struct nf_conn *i, void *data), void *data, u32 portid, int report); +/* also set unconfirmed conntracks as dying. Only use in module exit path. */ +void nf_ct_iterate_destroy(int (*iter)(struct nf_conn *i, void *data), + void *data); + struct nf_conntrack_zone; void nf_conntrack_free(struct nf_conn *ct); diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c index 08733685d732..7ecee79c78b8 100644 --- a/net/netfilter/nf_conntrack_core.c +++ b/net/netfilter/nf_conntrack_core.c @@ -1586,7 +1586,7 @@ static void nf_conntrack_attach(struct sk_buff *nskb, const struct sk_buff *skb) /* Bring out ya dead! */ static struct nf_conn * -get_next_corpse(struct net *net, int (*iter)(struct nf_conn *i, void *data), +get_next_corpse(int (*iter)(struct nf_conn *i, void *data), void *data, unsigned int *bucket) { struct nf_conntrack_tuple_hash *h; @@ -1603,8 +1603,7 @@ get_next_corpse(struct net *net, int (*iter)(struct nf_conn *i, void *data), if (NF_CT_DIRECTION(h) != IP_CT_DIR_ORIGINAL) continue; ct = nf_ct_tuplehash_to_ctrack(h); - if (net_eq(nf_ct_net(ct), net) && - iter(ct, data)) + if (iter(ct, data)) goto found; } } @@ -1621,6 +1620,39 @@ get_next_corpse(struct net *net, int (*iter)(struct nf_conn *i, void *data), return ct; } +static void nf_ct_iterate_cleanup(int (*iter)(struct nf_conn *i, void *data), + void *data, u32 portid, int report) +{ + struct nf_conn *ct; + unsigned int bucket = 0; + + might_sleep(); + + while ((ct = get_next_corpse(iter, data, )) != NULL) { + /* Time to push up daises... */ + + nf_ct_delete(ct, portid, report); + nf_ct_put(ct); + cond_resched(); + } +} + +struct iter_data { + int (*iter)(struct nf_conn *i, void *data); + void *data; + struct net *net; +}; + +static int iter_net_only(struct nf_conn *i, void *data) +{ + struct iter_data *d = data; + + if (!net_eq(d->net, nf_ct_net(i))) + return 0; + + return d->iter(i, d->data); +} + static void __nf_ct_unconfirmed_destroy(struct net *net) { @@ -1653,8 +1685,7 @@ void nf_ct_iterate_cleanup_net(struct net *net, int (*iter)(struct nf_conn *i, void *data), void *data, u32 portid, int report) { - struct nf_conn *ct; - unsigned int bucket = 0; + struct iter_data d; might_sleep(); @@ -1663,21 +1694,51 @@ void nf_ct_iterate_cleanup_net(struct net *net, __nf_ct_unconfirmed_destroy(net); + d.iter = iter; + d.data = data; + d.net = net; + synchronize_net(); - while ((ct = get_next_corpse(net, iter, data, )) != NULL) { - /* Time to push up daises... */ + nf_ct_iterate_cleanup(iter_net_only, , portid, report); +} +EXPORT_SYMBOL_GPL(nf_ct_iterate_cleanup_net); - nf_ct_delete(ct, portid, report); - nf_ct_put(ct); - cond_resched(); +/** + * nf_ct_iterate_destroy - destroy unconfirmed conntracks and iterate table + * @iter: callback to invoke for each conntrack + * @data: data to pass to @iter + * + * Like nf_ct_iterate_cleanup, but first marks conntracks on the + * unconfirmed list as dying (so they will not be inserted into + * main table). + */ +void +nf_ct_iterate_destroy(int (*iter)(struct nf_conn *i, void *data), void *data) +{ + struct net *net; + + rtnl_lock(); + for_each_net(net) { + if (atomic_read(>ct.count) == 0) + continue; + __nf_ct_unconfirmed_destroy(net); } + rtnl_unlock(); + + /* a conntrack could have been unlinked from unconfirmed list +
[PATCH 08/29] netfilter: conntrack: restart iteration on resize
From: Florian WestphalWe could some conntracks when a resize occurs in parallel. Avoid this by sampling generation seqcnt and doing a restart if needed. Signed-off-by: Florian Westphal Signed-off-by: Pablo Neira Ayuso --- net/netfilter/nf_conntrack_core.c | 20 ++-- 1 file changed, 14 insertions(+), 6 deletions(-) diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c index 7ecee79c78b8..c3bd9b086dcc 100644 --- a/net/netfilter/nf_conntrack_core.c +++ b/net/netfilter/nf_conntrack_core.c @@ -1623,17 +1623,25 @@ get_next_corpse(int (*iter)(struct nf_conn *i, void *data), static void nf_ct_iterate_cleanup(int (*iter)(struct nf_conn *i, void *data), void *data, u32 portid, int report) { + unsigned int bucket = 0, sequence; struct nf_conn *ct; - unsigned int bucket = 0; might_sleep(); - while ((ct = get_next_corpse(iter, data, )) != NULL) { - /* Time to push up daises... */ + for (;;) { + sequence = read_seqcount_begin(_conntrack_generation); - nf_ct_delete(ct, portid, report); - nf_ct_put(ct); - cond_resched(); + while ((ct = get_next_corpse(iter, data, )) != NULL) { + /* Time to push up daises... */ + + nf_ct_delete(ct, portid, report); + nf_ct_put(ct); + cond_resched(); + } + + if (!read_seqcount_retry(_conntrack_generation, sequence)) + break; + bucket = 0; } } -- 2.1.4
[PATCH 05/29] netfilter: conntrack: rename nf_ct_iterate_cleanup
From: Florian WestphalThere are several places where we needlesly call nf_ct_iterate_cleanup, we should instead iterate the full table at module unload time. This is a leftover from back when the conntrack table got duplicated per net namespace. So rename nf_ct_iterate_cleanup to nf_ct_iterate_cleanup_net. A later patch will then add a non-net variant. Signed-off-by: Florian Westphal Signed-off-by: Pablo Neira Ayuso --- include/net/netfilter/nf_conntrack.h| 6 +++--- net/ipv4/netfilter/nf_nat_masquerade_ipv4.c | 4 ++-- net/ipv6/netfilter/nf_nat_masquerade_ipv6.c | 10 +- net/netfilter/nf_conntrack_core.c | 10 +- net/netfilter/nf_conntrack_netlink.c| 4 ++-- net/netfilter/nf_conntrack_proto.c | 4 ++-- net/netfilter/nf_nat_core.c | 6 +++--- 7 files changed, 22 insertions(+), 22 deletions(-) diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h index 8ece3612d0cd..f21180ea4558 100644 --- a/include/net/netfilter/nf_conntrack.h +++ b/include/net/netfilter/nf_conntrack.h @@ -225,9 +225,9 @@ extern s32 (*nf_ct_nat_offset)(const struct nf_conn *ct, u32 seq); /* Iterate over all conntracks: if iter returns true, it's deleted. */ -void nf_ct_iterate_cleanup(struct net *net, - int (*iter)(struct nf_conn *i, void *data), - void *data, u32 portid, int report); +void nf_ct_iterate_cleanup_net(struct net *net, + int (*iter)(struct nf_conn *i, void *data), + void *data, u32 portid, int report); struct nf_conntrack_zone; diff --git a/net/ipv4/netfilter/nf_nat_masquerade_ipv4.c b/net/ipv4/netfilter/nf_nat_masquerade_ipv4.c index dc1dea15c1b4..f39037fca923 100644 --- a/net/ipv4/netfilter/nf_nat_masquerade_ipv4.c +++ b/net/ipv4/netfilter/nf_nat_masquerade_ipv4.c @@ -98,8 +98,8 @@ static int masq_device_event(struct notifier_block *this, */ NF_CT_ASSERT(dev->ifindex != 0); - nf_ct_iterate_cleanup(net, device_cmp, - (void *)(long)dev->ifindex, 0, 0); + nf_ct_iterate_cleanup_net(net, device_cmp, + (void *)(long)dev->ifindex, 0, 0); } return NOTIFY_DONE; diff --git a/net/ipv6/netfilter/nf_nat_masquerade_ipv6.c b/net/ipv6/netfilter/nf_nat_masquerade_ipv6.c index 2297c9f073ba..d7b679037bae 100644 --- a/net/ipv6/netfilter/nf_nat_masquerade_ipv6.c +++ b/net/ipv6/netfilter/nf_nat_masquerade_ipv6.c @@ -75,8 +75,8 @@ static int masq_device_event(struct notifier_block *this, struct net *net = dev_net(dev); if (event == NETDEV_DOWN) - nf_ct_iterate_cleanup(net, device_cmp, - (void *)(long)dev->ifindex, 0, 0); + nf_ct_iterate_cleanup_net(net, device_cmp, + (void *)(long)dev->ifindex, 0, 0); return NOTIFY_DONE; } @@ -99,7 +99,7 @@ static void iterate_cleanup_work(struct work_struct *work) w = container_of(work, struct masq_dev_work, work); index = w->ifindex; - nf_ct_iterate_cleanup(w->net, device_cmp, (void *)index, 0, 0); + nf_ct_iterate_cleanup_net(w->net, device_cmp, (void *)index, 0, 0); put_net(w->net); kfree(w); @@ -110,12 +110,12 @@ static void iterate_cleanup_work(struct work_struct *work) /* ipv6 inet notifier is an atomic notifier, i.e. we cannot * schedule. * - * Unfortunately, nf_ct_iterate_cleanup can run for a long + * Unfortunately, nf_ct_iterate_cleanup_net can run for a long * time if there are lots of conntracks and the system * handles high softirq load, so it frequently calls cond_resched * while iterating the conntrack table. * - * So we defer nf_ct_iterate_cleanup walk to the system workqueue. + * So we defer nf_ct_iterate_cleanup_net walk to the system workqueue. * * As we can have 'a lot' of inet_events (depending on amount * of ipv6 addresses being deleted), we also need to add an upper diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c index e847dbaa0c6b..2730f9df33b7 100644 --- a/net/netfilter/nf_conntrack_core.c +++ b/net/netfilter/nf_conntrack_core.c @@ -1634,9 +1634,9 @@ get_next_corpse(struct net *net, int (*iter)(struct nf_conn *i, void *data), return ct; } -void nf_ct_iterate_cleanup(struct net *net, - int (*iter)(struct nf_conn *i, void *data), - void *data, u32 portid, int report) +void nf_ct_iterate_cleanup_net(struct net *net, + int (*iter)(struct nf_conn *i, void *data), + void *data, u32 portid, int report) { struct nf_conn *ct; unsigned int bucket = 0; @@ -1654,7
[PATCH 16/29] netfilter: nf_tables: allow large allocations for new sets
The new fixed size hashtable backend implementation may result in a large array of buckets that would spew splats from mm. Update this code to fall back on vmalloc in case the memory allocation order is too costly. Signed-off-by: Pablo Neira Ayuso--- net/netfilter/nf_tables_api.c | 12 +++- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c index 2969016d8cad..bc8f03a53734 100644 --- a/net/netfilter/nf_tables_api.c +++ b/net/netfilter/nf_tables_api.c @@ -13,6 +13,7 @@ #include #include #include +#include #include #include #include @@ -3054,10 +3055,11 @@ static int nf_tables_newset(struct net *net, struct sock *nlsk, if (ops->privsize != NULL) size = ops->privsize(nla, ); - err = -ENOMEM; - set = kzalloc(sizeof(*set) + size + udlen, GFP_KERNEL); - if (set == NULL) + set = kvzalloc(sizeof(*set) + size + udlen, GFP_KERNEL); + if (!set) { + err = -ENOMEM; goto err1; + } nla_strlcpy(name, nla[NFTA_SET_NAME], sizeof(set->name)); err = nf_tables_set_alloc_name(, set, name); @@ -3100,7 +3102,7 @@ static int nf_tables_newset(struct net *net, struct sock *nlsk, err3: ops->destroy(set); err2: - kfree(set); + kvfree(set); err1: module_put(ops->type->owner); return err; @@ -3110,7 +3112,7 @@ static void nft_set_destroy(struct nft_set *set) { set->ops->destroy(set); module_put(set->ops->type->owner); - kfree(set); + kvfree(set); } static void nf_tables_set_destroy(const struct nft_ctx *ctx, struct nft_set *set) -- 2.1.4
[PATCH 00/29] Netfilter updates for net-next
Hi David, The following patchset contains Netfilter updates for your net-next tree. This batch contains connection tracking updates for the cleanup iteration path, patches from Florian Westphal: X) Skip unconfirmed conntracks in nf_ct_iterate_cleanup_net(), just set dying bit to let the CPU release them. X) Add nf_ct_iterate_destroy() to be used on module removal, to kill conntrack from all namespace. X) Restart iteration on hashtable resizing, since both may occur at the same time. X) Use the new nf_ct_iterate_destroy() to remove conntrack with NAT mapping on module removal. X) Use nf_ct_iterate_destroy() to remove conntrack entries helper module removal, from Liping Zhang. X) Use nf_ct_iterate_cleanup_net() to remove the timeout extension if user requests this, also from Liping. X) Add net_ns_barrier() and use it from FTP helper, so make sure no concurrent namespace removal happens at the same time while the helper module is being removed. X) Use NFPROTO_MAX in layer 3 conntrack protocol array, to reduce module size. Same thing in nf_tables. Updates for the nf_tables infrastructure: X) Prepare usage of the extended ACK reporting infrastructure for nf_tables. X) Remove unnecessary forward declaration in nf_tables hash set. X) Skip set size estimation if number of element is not specified. X) Changes to accomodate a (faster) unresizable hash set implementation, for anonymous sets and dynamic size fixed sets with no timeouts. X) Faster lookup function for unresizable hash table for 2 and 4 bytes key. And, finally, a bunch of asorted small updates and cleanups: X) Do not hold reference to netdev from ipt_CLUSTER, instead subscribe to device events and look up for index from the packet path, this is fixing an issue that is present since the very beginning, patch from Xin Long. X) Use nf_register_net_hook() in ipt_CLUSTER, from Florian Westphal. X) Use ebt_invalid_target() whenever possible in the ebtables tree, from Gao Feng. X) Calm down compilation warning in nf_dup infrastructure, patch from stephen hemminger. X) Statify functions in nftables rt expression, also from stephen. X) Update Makefile to use canonical method to specify nf_tables-objs. From Jike Song. X) Use nf_conntrack_helpers_register() in amanda and H323. X) Space cleanup for ctnetlink, from linzhang. You can pull these changes from: git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next.git Thanks! The following changes since commit 417ccf6b5bc3f1a390505d5ef65ec17f10e8b29a: net: make struct request_sock_ops::obj_size unsigned (2017-05-23 11:13:19 -0400) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next.git HEAD for you to fetch changes up to 04ba724b659c6808b0ca31528121bdb2f2807e00: netfilter: nfnetlink: extended ACK reporting (2017-06-19 19:38:24 +0200) Florian Westphal (10): netfilter: ipt_CLUSTERIP: switch to nf_register_net_hook netfilter: conntrack: rename nf_ct_iterate_cleanup netfilter: conntrack: don't call iter for non-confirmed conntracks netfilter: conntrack: add nf_ct_iterate_destroy netfilter: conntrack: restart iteration on resize netfilter: nat: destroy nat mappings on module exit path only netfilter: move table iteration out of netns exit paths netns: add and use net_ns_barrier netfilter: conntrack: use NFPROTO_MAX to size array netfilter: nf_tables: reduce chain type table size Gao Feng (1): netfilter: ebt: Use new helper ebt_invalid_target to check target Jike Song (1): netfilter, kbuild: use canonical method to specify objs. Liping Zhang (3): netfilter: nf_ct_helper: use nf_ct_iterate_destroy to unlink helper objs netfilter: cttimeout: use nf_ct_iterate_cleanup_net to unlink timeout objs netfilter: use nf_conntrack_helpers_register when possible Pablo Neira Ayuso (10): netfilter: nft_set_hash: unnecessary forward declaration netfilter: nf_tables: no size estimation if number of set elements is unknown netfilter: nft_set_hash: use nft_rhash prefix for resizable set backend netfilter: nf_tables: select set backend flavour depending on description netfilter: nf_tables: pass set description to ->privsize netfilter: nft_set_hash: add nft_hash_buckets() netfilter: nf_tables: allow large allocations for new sets netfilter: nft_set_hash: add non-resizable hashtable implementation netfilter: nft_set_hash: add lookup variant for fixed size hashtable netfilter: nfnetlink: extended ACK reporting Xin Long (1): netfilter: ipt_CLUSTERIP: do not hold dev linzhang (1): netfilter: ctnetlink: delete extra spaces stephen hemminger (2): netfilter: dup: resolve warnings about missing prototypes
[PATCH 17/29] netfilter: nft_set_hash: add non-resizable hashtable implementation
This patch adds a simple non-resizable hashtable implementation. If the user specifies the set size, then this new faster hashtable flavour is selected. Signed-off-by: Pablo Neira Ayuso--- net/netfilter/nft_set_hash.c | 210 +-- 1 file changed, 202 insertions(+), 8 deletions(-) diff --git a/net/netfilter/nft_set_hash.c b/net/netfilter/nft_set_hash.c index 466cb7092dfa..b2eab94362d6 100644 --- a/net/netfilter/nft_set_hash.c +++ b/net/netfilter/nft_set_hash.c @@ -371,14 +371,181 @@ static u32 nft_hash_buckets(u32 size) static bool nft_rhash_estimate(const struct nft_set_desc *desc, u32 features, struct nft_set_estimate *est) { - if (desc->size) - est->size = sizeof(struct nft_rhash) + - nft_hash_buckets(desc->size) * - sizeof(struct nft_rhash_elem *) + - desc->size * sizeof(struct nft_rhash_elem); - else - est->size = ~0; + est->size = ~0; + est->lookup = NFT_SET_CLASS_O_1; + est->space = NFT_SET_CLASS_O_N; + + return true; +} + +struct nft_hash { + u32 seed; + u32 buckets; + struct hlist_head table[]; +}; + +struct nft_hash_elem { + struct hlist_node node; + struct nft_set_ext ext; +}; + +static bool nft_hash_lookup(const struct net *net, const struct nft_set *set, + const u32 *key, const struct nft_set_ext **ext) +{ + struct nft_hash *priv = nft_set_priv(set); + u8 genmask = nft_genmask_cur(net); + const struct nft_hash_elem *he; + u32 hash; + + hash = jhash(key, set->klen, priv->seed); + hash = reciprocal_scale(hash, priv->buckets); + hlist_for_each_entry_rcu(he, >table[hash], node) { + if (!memcmp(nft_set_ext_key(>ext), key, set->klen) && + nft_set_elem_active(>ext, genmask)) { + *ext = >ext; + return true; + } + } + return false; +} + +static int nft_hash_insert(const struct net *net, const struct nft_set *set, + const struct nft_set_elem *elem, + struct nft_set_ext **ext) +{ + struct nft_hash_elem *this = elem->priv, *he; + struct nft_hash *priv = nft_set_priv(set); + u8 genmask = nft_genmask_next(net); + u32 hash; + + hash = jhash(nft_set_ext_key(>ext), set->klen, priv->seed); + hash = reciprocal_scale(hash, priv->buckets); + hlist_for_each_entry(he, >table[hash], node) { + if (!memcmp(nft_set_ext_key(>ext), + nft_set_ext_key(>ext), set->klen) && + nft_set_elem_active(>ext, genmask)) { + *ext = >ext; + return -EEXIST; + } + } + hlist_add_head_rcu(>node, >table[hash]); + return 0; +} + +static void nft_hash_activate(const struct net *net, const struct nft_set *set, + const struct nft_set_elem *elem) +{ + struct nft_hash_elem *he = elem->priv; + + nft_set_elem_change_active(net, set, >ext); +} + +static bool nft_hash_flush(const struct net *net, + const struct nft_set *set, void *priv) +{ + struct nft_hash_elem *he = priv; + + nft_set_elem_change_active(net, set, >ext); + return true; +} + +static void *nft_hash_deactivate(const struct net *net, +const struct nft_set *set, +const struct nft_set_elem *elem) +{ + struct nft_hash *priv = nft_set_priv(set); + struct nft_hash_elem *this = elem->priv, *he; + u8 genmask = nft_genmask_next(net); + u32 hash; + + hash = jhash(nft_set_ext_key(>ext), set->klen, priv->seed); + hash = reciprocal_scale(hash, priv->buckets); + hlist_for_each_entry(he, >table[hash], node) { + if (!memcmp(nft_set_ext_key(>ext), >key.val, + set->klen) || + nft_set_elem_active(>ext, genmask)) { + nft_set_elem_change_active(net, set, >ext); + return he; + } + } + return NULL; +} + +static void nft_hash_remove(const struct net *net, + const struct nft_set *set, + const struct nft_set_elem *elem) +{ + struct nft_hash_elem *he = elem->priv; + + hlist_del_rcu(>node); +} + +static void nft_hash_walk(const struct nft_ctx *ctx, struct nft_set *set, + struct nft_set_iter *iter) +{ + struct nft_hash *priv = nft_set_priv(set); + struct nft_hash_elem *he; + struct nft_set_elem elem; + int i; + + for (i = 0; i <
[PATCH 12/29] netfilter: nft_set_hash: use nft_rhash prefix for resizable set backend
This patch prepares the introduction of a non-resizable hashtable implementation that is significantly faster. Signed-off-by: Pablo Neira Ayuso--- net/netfilter/nft_set_hash.c | 212 +-- 1 file changed, 106 insertions(+), 106 deletions(-) diff --git a/net/netfilter/nft_set_hash.c b/net/netfilter/nft_set_hash.c index 1f1cc33895fd..7c21e3da0d88 100644 --- a/net/netfilter/nft_set_hash.c +++ b/net/netfilter/nft_set_hash.c @@ -22,43 +22,43 @@ #include /* We target a hash table size of 4, element hint is 75% of final size */ -#define NFT_HASH_ELEMENT_HINT 3 +#define NFT_RHASH_ELEMENT_HINT 3 -struct nft_hash { +struct nft_rhash { struct rhashtable ht; struct delayed_work gc_work; }; -struct nft_hash_elem { +struct nft_rhash_elem { struct rhash_head node; struct nft_set_ext ext; }; -struct nft_hash_cmp_arg { +struct nft_rhash_cmp_arg { const struct nft_set*set; const u32 *key; u8 genmask; }; -static inline u32 nft_hash_key(const void *data, u32 len, u32 seed) +static inline u32 nft_rhash_key(const void *data, u32 len, u32 seed) { - const struct nft_hash_cmp_arg *arg = data; + const struct nft_rhash_cmp_arg *arg = data; return jhash(arg->key, len, seed); } -static inline u32 nft_hash_obj(const void *data, u32 len, u32 seed) +static inline u32 nft_rhash_obj(const void *data, u32 len, u32 seed) { - const struct nft_hash_elem *he = data; + const struct nft_rhash_elem *he = data; return jhash(nft_set_ext_key(>ext), len, seed); } -static inline int nft_hash_cmp(struct rhashtable_compare_arg *arg, - const void *ptr) +static inline int nft_rhash_cmp(struct rhashtable_compare_arg *arg, + const void *ptr) { - const struct nft_hash_cmp_arg *x = arg->key; - const struct nft_hash_elem *he = ptr; + const struct nft_rhash_cmp_arg *x = arg->key; + const struct nft_rhash_elem *he = ptr; if (memcmp(nft_set_ext_key(>ext), x->key, x->set->klen)) return 1; @@ -69,49 +69,49 @@ static inline int nft_hash_cmp(struct rhashtable_compare_arg *arg, return 0; } -static const struct rhashtable_params nft_hash_params = { - .head_offset= offsetof(struct nft_hash_elem, node), - .hashfn = nft_hash_key, - .obj_hashfn = nft_hash_obj, - .obj_cmpfn = nft_hash_cmp, +static const struct rhashtable_params nft_rhash_params = { + .head_offset= offsetof(struct nft_rhash_elem, node), + .hashfn = nft_rhash_key, + .obj_hashfn = nft_rhash_obj, + .obj_cmpfn = nft_rhash_cmp, .automatic_shrinking= true, }; -static bool nft_hash_lookup(const struct net *net, const struct nft_set *set, - const u32 *key, const struct nft_set_ext **ext) +static bool nft_rhash_lookup(const struct net *net, const struct nft_set *set, +const u32 *key, const struct nft_set_ext **ext) { - struct nft_hash *priv = nft_set_priv(set); - const struct nft_hash_elem *he; - struct nft_hash_cmp_arg arg = { + struct nft_rhash *priv = nft_set_priv(set); + const struct nft_rhash_elem *he; + struct nft_rhash_cmp_arg arg = { .genmask = nft_genmask_cur(net), .set = set, .key = key, }; - he = rhashtable_lookup_fast(>ht, , nft_hash_params); + he = rhashtable_lookup_fast(>ht, , nft_rhash_params); if (he != NULL) *ext = >ext; return !!he; } -static bool nft_hash_update(struct nft_set *set, const u32 *key, - void *(*new)(struct nft_set *, -const struct nft_expr *, -struct nft_regs *regs), - const struct nft_expr *expr, - struct nft_regs *regs, - const struct nft_set_ext **ext) +static bool nft_rhash_update(struct nft_set *set, const u32 *key, +void *(*new)(struct nft_set *, + const struct nft_expr *, + struct nft_regs *regs), +const struct nft_expr *expr, +struct nft_regs *regs, +const struct nft_set_ext **ext) { - struct nft_hash *priv = nft_set_priv(set); - struct nft_hash_elem *he, *prev; - struct nft_hash_cmp_arg arg = { + struct nft_rhash *priv = nft_set_priv(set); + struct nft_rhash_elem *he, *prev; + struct
[PATCH 14/29] netfilter: nf_tables: pass set description to ->privsize
The new non-resizable hashtable variant needs this to calculate the size of the bucket array. Signed-off-by: Pablo Neira Ayuso--- include/net/netfilter/nf_tables.h | 3 ++- net/netfilter/nf_tables_api.c | 2 +- net/netfilter/nft_set_bitmap.c| 3 ++- net/netfilter/nft_set_hash.c | 3 ++- net/netfilter/nft_set_rbtree.c| 3 ++- 5 files changed, 9 insertions(+), 5 deletions(-) diff --git a/include/net/netfilter/nf_tables.h b/include/net/netfilter/nf_tables.h index f27012098846..bd5be0d691d5 100644 --- a/include/net/netfilter/nf_tables.h +++ b/include/net/netfilter/nf_tables.h @@ -351,7 +351,8 @@ struct nft_set_ops { struct nft_set *set, struct nft_set_iter *iter); - unsigned int(*privsize)(const struct nlattr * const nla[]); + unsigned int(*privsize)(const struct nlattr * const nla[], + const struct nft_set_desc *desc); bool(*estimate)(const struct nft_set_desc *desc, u32 features, struct nft_set_estimate *est); diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c index c0b2b19607e1..2969016d8cad 100644 --- a/net/netfilter/nf_tables_api.c +++ b/net/netfilter/nf_tables_api.c @@ -3052,7 +3052,7 @@ static int nf_tables_newset(struct net *net, struct sock *nlsk, size = 0; if (ops->privsize != NULL) - size = ops->privsize(nla); + size = ops->privsize(nla, ); err = -ENOMEM; set = kzalloc(sizeof(*set) + size + udlen, GFP_KERNEL); diff --git a/net/netfilter/nft_set_bitmap.c b/net/netfilter/nft_set_bitmap.c index 87d17691278f..734989c40579 100644 --- a/net/netfilter/nft_set_bitmap.c +++ b/net/netfilter/nft_set_bitmap.c @@ -236,7 +236,8 @@ static inline u32 nft_bitmap_total_size(u32 klen) return sizeof(struct nft_bitmap) + nft_bitmap_size(klen); } -static unsigned int nft_bitmap_privsize(const struct nlattr * const nla[]) +static unsigned int nft_bitmap_privsize(const struct nlattr * const nla[], + const struct nft_set_desc *desc) { u32 klen = ntohl(nla_get_be32(nla[NFTA_SET_KEY_LEN])); diff --git a/net/netfilter/nft_set_hash.c b/net/netfilter/nft_set_hash.c index 4ba0717408d9..455a11ce8cd0 100644 --- a/net/netfilter/nft_set_hash.c +++ b/net/netfilter/nft_set_hash.c @@ -321,7 +321,8 @@ static void nft_rhash_gc(struct work_struct *work) nft_set_gc_interval(set)); } -static unsigned int nft_rhash_privsize(const struct nlattr * const nla[]) +static unsigned int nft_rhash_privsize(const struct nlattr * const nla[], + const struct nft_set_desc *desc) { return sizeof(struct nft_rhash); } diff --git a/net/netfilter/nft_set_rbtree.c b/net/netfilter/nft_set_rbtree.c index 29d41d378339..491e805d3ca2 100644 --- a/net/netfilter/nft_set_rbtree.c +++ b/net/netfilter/nft_set_rbtree.c @@ -251,7 +251,8 @@ static void nft_rbtree_walk(const struct nft_ctx *ctx, read_unlock_bh(>lock); } -static unsigned int nft_rbtree_privsize(const struct nlattr * const nla[]) +static unsigned int nft_rbtree_privsize(const struct nlattr * const nla[], + const struct nft_set_desc *desc) { return sizeof(struct nft_rbtree); } -- 2.1.4
[PATCH 20/29] netfilter: cttimeout: use nf_ct_iterate_cleanup_net to unlink timeout objs
From: Liping ZhangSimilar to nf_conntrack_helper, we can use nf_ct_iterare_cleanup_net to remove these copy & paste code. Signed-off-by: Liping Zhang Signed-off-by: Pablo Neira Ayuso --- net/netfilter/nfnetlink_cttimeout.c | 39 + 1 file changed, 5 insertions(+), 34 deletions(-) diff --git a/net/netfilter/nfnetlink_cttimeout.c b/net/netfilter/nfnetlink_cttimeout.c index a3e7bb54d96a..49638b03ccc9 100644 --- a/net/netfilter/nfnetlink_cttimeout.c +++ b/net/netfilter/nfnetlink_cttimeout.c @@ -287,49 +287,20 @@ static int cttimeout_get_timeout(struct net *net, struct sock *ctnl, return ret; } -static void untimeout(struct nf_conntrack_tuple_hash *i, - struct ctnl_timeout *timeout) +static int untimeout(struct nf_conn *ct, void *timeout) { - struct nf_conn *ct = nf_ct_tuplehash_to_ctrack(i); struct nf_conn_timeout *timeout_ext = nf_ct_timeout_find(ct); if (timeout_ext && (!timeout || timeout_ext->timeout == timeout)) RCU_INIT_POINTER(timeout_ext->timeout, NULL); + + /* We are not intended to delete this conntrack. */ + return 0; } static void ctnl_untimeout(struct net *net, struct ctnl_timeout *timeout) { - struct nf_conntrack_tuple_hash *h; - const struct hlist_nulls_node *nn; - unsigned int last_hsize; - spinlock_t *lock; - int i, cpu; - - for_each_possible_cpu(cpu) { - struct ct_pcpu *pcpu = per_cpu_ptr(net->ct.pcpu_lists, cpu); - - spin_lock_bh(>lock); - hlist_nulls_for_each_entry(h, nn, >unconfirmed, hnnode) - untimeout(h, timeout); - spin_unlock_bh(>lock); - } - - local_bh_disable(); -restart: - last_hsize = nf_conntrack_htable_size; - for (i = 0; i < last_hsize; i++) { - lock = _conntrack_locks[i % CONNTRACK_LOCKS]; - nf_conntrack_lock(lock); - if (last_hsize != nf_conntrack_htable_size) { - spin_unlock(lock); - goto restart; - } - - hlist_nulls_for_each_entry(h, nn, _conntrack_hash[i], hnnode) - untimeout(h, timeout); - spin_unlock(lock); - } - local_bh_enable(); + nf_ct_iterate_cleanup_net(net, untimeout, timeout, 0, 0); } /* try to delete object, fail if it is still in use. */ -- 2.1.4
[PATCH 15/29] netfilter: nft_set_hash: add nft_hash_buckets()
Add nft_hash_buckets() helper function to calculate the number of hashtable buckets based on the elements. This function can be reused from the follow up patch to add non-resizable hashtables. Signed-off-by: Pablo Neira Ayuso--- net/netfilter/nft_set_hash.c | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/net/netfilter/nft_set_hash.c b/net/netfilter/nft_set_hash.c index 455a11ce8cd0..466cb7092dfa 100644 --- a/net/netfilter/nft_set_hash.c +++ b/net/netfilter/nft_set_hash.c @@ -363,12 +363,17 @@ static void nft_rhash_destroy(const struct nft_set *set) (void *)set); } +static u32 nft_hash_buckets(u32 size) +{ + return roundup_pow_of_two(size * 4 / 3); +} + static bool nft_rhash_estimate(const struct nft_set_desc *desc, u32 features, struct nft_set_estimate *est) { if (desc->size) est->size = sizeof(struct nft_rhash) + - roundup_pow_of_two(desc->size * 4 / 3) * + nft_hash_buckets(desc->size) * sizeof(struct nft_rhash_elem *) + desc->size * sizeof(struct nft_rhash_elem); else -- 2.1.4
[PATCH 19/29] netfilter: nf_ct_helper: use nf_ct_iterate_destroy to unlink helper objs
From: Liping ZhangWhen we unlink the helper objects, we will iterate the nf_conntrack_hash, iterate the unconfirmed list, handle the hash resize situation, etc. Actually this logic is same as the nf_ct_iterate_destroy, so we can use it to remove these copy & paste code. Signed-off-by: Liping Zhang Signed-off-by: Pablo Neira Ayuso --- net/netfilter/nf_conntrack_helper.c | 50 +++-- 1 file changed, 4 insertions(+), 46 deletions(-) diff --git a/net/netfilter/nf_conntrack_helper.c b/net/netfilter/nf_conntrack_helper.c index 7f6100ca63be..9129bb3b5153 100644 --- a/net/netfilter/nf_conntrack_helper.c +++ b/net/netfilter/nf_conntrack_helper.c @@ -285,16 +285,16 @@ int __nf_ct_try_assign_helper(struct nf_conn *ct, struct nf_conn *tmpl, EXPORT_SYMBOL_GPL(__nf_ct_try_assign_helper); /* appropriate ct lock protecting must be taken by caller */ -static inline int unhelp(struct nf_conntrack_tuple_hash *i, -const struct nf_conntrack_helper *me) +static int unhelp(struct nf_conn *ct, void *me) { - struct nf_conn *ct = nf_ct_tuplehash_to_ctrack(i); struct nf_conn_help *help = nfct_help(ct); if (help && rcu_dereference_raw(help->helper) == me) { nf_conntrack_event(IPCT_HELPER, ct); RCU_INIT_POINTER(help->helper, NULL); } + + /* We are not intended to delete this conntrack. */ return 0; } @@ -437,33 +437,10 @@ int nf_conntrack_helper_register(struct nf_conntrack_helper *me) } EXPORT_SYMBOL_GPL(nf_conntrack_helper_register); -static void __nf_conntrack_helper_unregister(struct nf_conntrack_helper *me, -struct net *net) -{ - struct nf_conntrack_tuple_hash *h; - const struct hlist_nulls_node *nn; - int cpu; - - /* Get rid of expecteds, set helpers to NULL. */ - for_each_possible_cpu(cpu) { - struct ct_pcpu *pcpu = per_cpu_ptr(net->ct.pcpu_lists, cpu); - - spin_lock_bh(>lock); - hlist_nulls_for_each_entry(h, nn, >unconfirmed, hnnode) - unhelp(h, me); - spin_unlock_bh(>lock); - } -} - void nf_conntrack_helper_unregister(struct nf_conntrack_helper *me) { - struct nf_conntrack_tuple_hash *h; struct nf_conntrack_expect *exp; const struct hlist_node *next; - const struct hlist_nulls_node *nn; - unsigned int last_hsize; - spinlock_t *lock; - struct net *net; unsigned int i; mutex_lock(_ct_helper_mutex); @@ -491,26 +468,7 @@ void nf_conntrack_helper_unregister(struct nf_conntrack_helper *me) } spin_unlock_bh(_conntrack_expect_lock); - rtnl_lock(); - for_each_net(net) - __nf_conntrack_helper_unregister(me, net); - rtnl_unlock(); - - local_bh_disable(); -restart: - last_hsize = nf_conntrack_htable_size; - for (i = 0; i < last_hsize; i++) { - lock = _conntrack_locks[i % CONNTRACK_LOCKS]; - nf_conntrack_lock(lock); - if (last_hsize != nf_conntrack_htable_size) { - spin_unlock(lock); - goto restart; - } - hlist_nulls_for_each_entry(h, nn, _conntrack_hash[i], hnnode) - unhelp(h, me); - spin_unlock(lock); - } - local_bh_enable(); + nf_ct_iterate_destroy(unhelp, me); } EXPORT_SYMBOL_GPL(nf_conntrack_helper_unregister); -- 2.1.4
[PATCH 13/29] netfilter: nf_tables: select set backend flavour depending on description
This patch adds the infrastructure to support several implementations of the same set type. This selection will be based on the set description and the features available for this set. This allow us to select set backend implementation that will result in better performance numbers. Signed-off-by: Pablo Neira Ayuso--- include/net/netfilter/nf_tables.h | 26 + net/netfilter/nf_tables_api.c | 59 --- net/netfilter/nft_set_bitmap.c| 10 +-- net/netfilter/nft_set_hash.c | 10 +-- net/netfilter/nft_set_rbtree.c| 10 +-- 5 files changed, 80 insertions(+), 35 deletions(-) diff --git a/include/net/netfilter/nf_tables.h b/include/net/netfilter/nf_tables.h index 8a8bab8d7b15..f27012098846 100644 --- a/include/net/netfilter/nf_tables.h +++ b/include/net/netfilter/nf_tables.h @@ -281,6 +281,23 @@ struct nft_set_estimate { enum nft_set_class space; }; +/** + * struct nft_set_type - nf_tables set type + * + * @select_ops: function to select nft_set_ops + * @ops: default ops, used when no select_ops functions is present + * @list: used internally + * @owner: module reference + */ +struct nft_set_type { + const struct nft_set_ops*(*select_ops)(const struct nft_ctx *, + const struct nft_set_desc *desc, + u32 flags); + const struct nft_set_ops*ops; + struct list_headlist; + struct module *owner; +}; + struct nft_set_ext; struct nft_expr; @@ -297,8 +314,6 @@ struct nft_expr; * @privsize: function to return size of set private data * @init: initialize private data of new set instance * @destroy: destroy private data of set instance - * @list: nf_tables_set_ops list node - * @owner: module reference * @elemsize: element private size * @features: features supported by the implementation */ @@ -345,14 +360,13 @@ struct nft_set_ops { const struct nlattr * const nla[]); void(*destroy)(const struct nft_set *set); - struct list_headlist; - struct module *owner; unsigned intelemsize; u32 features; + const struct nft_set_type *type; }; -int nft_register_set(struct nft_set_ops *ops); -void nft_unregister_set(struct nft_set_ops *ops); +int nft_register_set(struct nft_set_type *type); +void nft_unregister_set(struct nft_set_type *type); /** * struct nft_set - nf_tables set instance diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c index da314be0c048..c0b2b19607e1 100644 --- a/net/netfilter/nf_tables_api.c +++ b/net/netfilter/nf_tables_api.c @@ -2377,64 +2377,77 @@ static int nf_tables_delrule(struct net *net, struct sock *nlsk, * Sets */ -static LIST_HEAD(nf_tables_set_ops); +static LIST_HEAD(nf_tables_set_types); -int nft_register_set(struct nft_set_ops *ops) +int nft_register_set(struct nft_set_type *type) { nfnl_lock(NFNL_SUBSYS_NFTABLES); - list_add_tail_rcu(>list, _tables_set_ops); + list_add_tail_rcu(>list, _tables_set_types); nfnl_unlock(NFNL_SUBSYS_NFTABLES); return 0; } EXPORT_SYMBOL_GPL(nft_register_set); -void nft_unregister_set(struct nft_set_ops *ops) +void nft_unregister_set(struct nft_set_type *type) { nfnl_lock(NFNL_SUBSYS_NFTABLES); - list_del_rcu(>list); + list_del_rcu(>list); nfnl_unlock(NFNL_SUBSYS_NFTABLES); } EXPORT_SYMBOL_GPL(nft_unregister_set); +#define NFT_SET_FEATURES (NFT_SET_INTERVAL | NFT_SET_MAP | \ +NFT_SET_TIMEOUT | NFT_SET_OBJECT) + +static bool nft_set_ops_candidate(const struct nft_set_ops *ops, u32 flags) +{ + return (flags & ops->features) == (flags & NFT_SET_FEATURES); +} + /* * Select a set implementation based on the data characteristics and the * given policy. The total memory use might not be known if no size is * given, in that case the amount of memory per element is used. */ static const struct nft_set_ops * -nft_select_set_ops(const struct nlattr * const nla[], +nft_select_set_ops(const struct nft_ctx *ctx, + const struct nlattr * const nla[], const struct nft_set_desc *desc, enum nft_set_policies policy) { const struct nft_set_ops *ops, *bops; struct nft_set_estimate est, best; - u32 features; + const struct nft_set_type *type; + u32 flags = 0; #ifdef CONFIG_MODULES - if (list_empty(_tables_set_ops)) { + if (list_empty(_tables_set_types)) { nfnl_unlock(NFNL_SUBSYS_NFTABLES); request_module("nft-set");
[PATCH 26/29] netfilter: use nf_conntrack_helpers_register when possible
From: Liping Zhangamanda_helper, nf_conntrack_helper_ras and nf_conntrack_helper_q931 are all arrays, so we can use nf_conntrack_helpers_register to register the ct helper, this will help us to eliminate some "goto errX" statements. Also introduce h323_helper_init/exit helper function to register the ct helpers, this is prepared for the followup patch, which will add net namespace support for ct helper. Signed-off-by: Liping Zhang Acked-by: Florian Westphal Signed-off-by: Pablo Neira Ayuso --- net/netfilter/nf_conntrack_amanda.c| 12 +++ net/netfilter/nf_conntrack_h323_main.c | 63 +++--- 2 files changed, 40 insertions(+), 35 deletions(-) diff --git a/net/netfilter/nf_conntrack_amanda.c b/net/netfilter/nf_conntrack_amanda.c index 03d2ccffa9fa..20edd589fe06 100644 --- a/net/netfilter/nf_conntrack_amanda.c +++ b/net/netfilter/nf_conntrack_amanda.c @@ -197,8 +197,8 @@ static void __exit nf_conntrack_amanda_fini(void) { int i; - nf_conntrack_helper_unregister(_helper[0]); - nf_conntrack_helper_unregister(_helper[1]); + nf_conntrack_helpers_unregister(amanda_helper, + ARRAY_SIZE(amanda_helper)); for (i = 0; i < ARRAY_SIZE(search); i++) textsearch_destroy(search[i].ts); } @@ -218,16 +218,12 @@ static int __init nf_conntrack_amanda_init(void) goto err1; } } - ret = nf_conntrack_helper_register(_helper[0]); + ret = nf_conntrack_helpers_register(amanda_helper, + ARRAY_SIZE(amanda_helper)); if (ret < 0) goto err1; - ret = nf_conntrack_helper_register(_helper[1]); - if (ret < 0) - goto err2; return 0; -err2: - nf_conntrack_helper_unregister(_helper[0]); err1: while (--i >= 0) textsearch_destroy(search[i].ts); diff --git a/net/netfilter/nf_conntrack_h323_main.c b/net/netfilter/nf_conntrack_h323_main.c index 3bcdc718484e..f71f0d2558fd 100644 --- a/net/netfilter/nf_conntrack_h323_main.c +++ b/net/netfilter/nf_conntrack_h323_main.c @@ -1815,14 +1815,44 @@ static struct nf_conntrack_helper nf_conntrack_helper_ras[] __read_mostly = { }, }; +static int __init h323_helper_init(void) +{ + int ret; + + ret = nf_conntrack_helper_register(_conntrack_helper_h245); + if (ret < 0) + return ret; + ret = nf_conntrack_helpers_register(nf_conntrack_helper_q931, + ARRAY_SIZE(nf_conntrack_helper_q931)); + if (ret < 0) + goto err1; + ret = nf_conntrack_helpers_register(nf_conntrack_helper_ras, + ARRAY_SIZE(nf_conntrack_helper_ras)); + if (ret < 0) + goto err2; + + return 0; +err2: + nf_conntrack_helpers_unregister(nf_conntrack_helper_q931, + ARRAY_SIZE(nf_conntrack_helper_q931)); +err1: + nf_conntrack_helper_unregister(_conntrack_helper_h245); + return ret; +} + +static void __exit h323_helper_exit(void) +{ + nf_conntrack_helpers_unregister(nf_conntrack_helper_ras, + ARRAY_SIZE(nf_conntrack_helper_ras)); + nf_conntrack_helpers_unregister(nf_conntrack_helper_q931, + ARRAY_SIZE(nf_conntrack_helper_q931)); + nf_conntrack_helper_unregister(_conntrack_helper_h245); +} + // static void __exit nf_conntrack_h323_fini(void) { - nf_conntrack_helper_unregister(_conntrack_helper_ras[1]); - nf_conntrack_helper_unregister(_conntrack_helper_ras[0]); - nf_conntrack_helper_unregister(_conntrack_helper_q931[1]); - nf_conntrack_helper_unregister(_conntrack_helper_q931[0]); - nf_conntrack_helper_unregister(_conntrack_helper_h245); + h323_helper_exit(); kfree(h323_buffer); pr_debug("nf_ct_h323: fini\n"); } @@ -1837,32 +1867,11 @@ static int __init nf_conntrack_h323_init(void) h323_buffer = kmalloc(65536, GFP_KERNEL); if (!h323_buffer) return -ENOMEM; - ret = nf_conntrack_helper_register(_conntrack_helper_h245); + ret = h323_helper_init(); if (ret < 0) goto err1; - ret = nf_conntrack_helper_register(_conntrack_helper_q931[0]); - if (ret < 0) - goto err2; - ret = nf_conntrack_helper_register(_conntrack_helper_q931[1]); - if (ret < 0) - goto err3; - ret = nf_conntrack_helper_register(_conntrack_helper_ras[0]); - if (ret < 0) - goto err4; - ret = nf_conntrack_helper_register(_conntrack_helper_ras[1]); - if (ret < 0) - goto err5;
[PATCH 06/29] netfilter: conntrack: don't call iter for non-confirmed conntracks
From: Florian Westphalnf_ct_iterate_cleanup_net currently calls iter() callback also for conntracks on the unconfirmed list, but this is unsafe. Acesses to nf_conn are fine, but some users access the extension area in the iter() callback, but that does only work reliably for confirmed conntracks (ct->ext can be reallocated at any time for unconfirmed conntrack). The seond issue is that there is a short window where a conntrack entry is neither on the list nor in the table: To confirm an entry, it is first removed from the unconfirmed list, then insert into the table. Fix this by iterating the unconfirmed list first and marking all entries as dying, then wait for rcu grace period. This makes sure all entries that were about to be confirmed either are in the main table, or will be dropped soon. Signed-off-by: Florian Westphal Signed-off-by: Pablo Neira Ayuso --- net/netfilter/nf_conntrack_core.c | 39 +-- 1 file changed, 29 insertions(+), 10 deletions(-) diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c index 2730f9df33b7..08733685d732 100644 --- a/net/netfilter/nf_conntrack_core.c +++ b/net/netfilter/nf_conntrack_core.c @@ -1592,7 +1592,6 @@ get_next_corpse(struct net *net, int (*iter)(struct nf_conn *i, void *data), struct nf_conntrack_tuple_hash *h; struct nf_conn *ct; struct hlist_nulls_node *n; - int cpu; spinlock_t *lockp; for (; *bucket < nf_conntrack_htable_size; (*bucket)++) { @@ -1614,24 +1613,40 @@ get_next_corpse(struct net *net, int (*iter)(struct nf_conn *i, void *data), cond_resched(); } + return NULL; +found: + atomic_inc(>ct_general.use); + spin_unlock(lockp); + local_bh_enable(); + return ct; +} + +static void +__nf_ct_unconfirmed_destroy(struct net *net) +{ + int cpu; + for_each_possible_cpu(cpu) { - struct ct_pcpu *pcpu = per_cpu_ptr(net->ct.pcpu_lists, cpu); + struct nf_conntrack_tuple_hash *h; + struct hlist_nulls_node *n; + struct ct_pcpu *pcpu; + + pcpu = per_cpu_ptr(net->ct.pcpu_lists, cpu); spin_lock_bh(>lock); hlist_nulls_for_each_entry(h, n, >unconfirmed, hnnode) { + struct nf_conn *ct; + ct = nf_ct_tuplehash_to_ctrack(h); - if (iter(ct, data)) - set_bit(IPS_DYING_BIT, >status); + + /* we cannot call iter() on unconfirmed list, the +* owning cpu can reallocate ct->ext at any time. +*/ + set_bit(IPS_DYING_BIT, >status); } spin_unlock_bh(>lock); cond_resched(); } - return NULL; -found: - atomic_inc(>ct_general.use); - spin_unlock(lockp); - local_bh_enable(); - return ct; } void nf_ct_iterate_cleanup_net(struct net *net, @@ -1646,6 +1661,10 @@ void nf_ct_iterate_cleanup_net(struct net *net, if (atomic_read(>ct.count) == 0) return; + __nf_ct_unconfirmed_destroy(net); + + synchronize_net(); + while ((ct = get_next_corpse(net, iter, data, )) != NULL) { /* Time to push up daises... */ -- 2.1.4
[PATCH 23/29] netns: add and use net_ns_barrier
From: Florian WestphalQuoting Joe Stringer: If a user loads nf_conntrack_ftp, sends FTP traffic through a network namespace, destroys that namespace then unloads the FTP helper module, then the kernel will crash. Events that lead to the crash: 1. conntrack is created with ftp helper in netns x 2. This netns is destroyed 3. netns destruction is scheduled 4. netns destruction wq starts, removes netns from global list 5. ftp helper is unloaded, which resets all helpers of the conntracks via for_each_net() but because netns is already gone from list the for_each_net() loop doesn't include it, therefore all of these conntracks are unaffected. 6. helper module unload finishes 7. netns wq invokes destructor for rmmod'ed helper CC: "Eric W. Biederman" Reported-by: Joe Stringer Signed-off-by: Florian Westphal Acked-by: David S. Miller Acked-by: "Eric W. Biederman" Signed-off-by: Pablo Neira Ayuso --- include/net/net_namespace.h | 3 +++ net/core/net_namespace.c | 17 + net/netfilter/nf_conntrack_core.c | 9 + 3 files changed, 29 insertions(+) diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h index fe80bb48ab1f..a24a57593202 100644 --- a/include/net/net_namespace.h +++ b/include/net/net_namespace.h @@ -158,6 +158,7 @@ extern struct net init_net; struct net *copy_net_ns(unsigned long flags, struct user_namespace *user_ns, struct net *old_net); +void net_ns_barrier(void); #else /* CONFIG_NET_NS */ #include #include @@ -168,6 +169,8 @@ static inline struct net *copy_net_ns(unsigned long flags, return ERR_PTR(-EINVAL); return old_net; } + +static inline void net_ns_barrier(void) {} #endif /* CONFIG_NET_NS */ diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c index 1934efd4a9d4..1f15abb1d733 100644 --- a/net/core/net_namespace.c +++ b/net/core/net_namespace.c @@ -482,6 +482,23 @@ static void cleanup_net(struct work_struct *work) net_drop_ns(net); } } + +/** + * net_ns_barrier - wait until concurrent net_cleanup_work is done + * + * cleanup_net runs from work queue and will first remove namespaces + * from the global list, then run net exit functions. + * + * Call this in module exit path to make sure that all netns + * ->exit ops have been invoked before the function is removed. + */ +void net_ns_barrier(void) +{ + mutex_lock(_mutex); + mutex_unlock(_mutex); +} +EXPORT_SYMBOL(net_ns_barrier); + static DECLARE_WORK(net_cleanup_work, cleanup_net); void __put_net(struct net *net) diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c index c3bd9b086dcc..9979f46c81dc 100644 --- a/net/netfilter/nf_conntrack_core.c +++ b/net/netfilter/nf_conntrack_core.c @@ -1720,6 +1720,8 @@ EXPORT_SYMBOL_GPL(nf_ct_iterate_cleanup_net); * Like nf_ct_iterate_cleanup, but first marks conntracks on the * unconfirmed list as dying (so they will not be inserted into * main table). + * + * Can only be called in module exit path. */ void nf_ct_iterate_destroy(int (*iter)(struct nf_conn *i, void *data), void *data) @@ -1734,6 +1736,13 @@ nf_ct_iterate_destroy(int (*iter)(struct nf_conn *i, void *data), void *data) } rtnl_unlock(); + /* Need to wait for netns cleanup worker to finish, if its +* running -- it might have deleted a net namespace from +* the global list, so our __nf_ct_unconfirmed_destroy() might +* not have affected all namespaces. +*/ + net_ns_barrier(); + /* a conntrack could have been unlinked from unconfirmed list * before we grabbed pcpu lock in __nf_ct_unconfirmed_destroy(). * This makes sure its inserted into conntrack table. -- 2.1.4
[PATCH 09/29] netfilter: nat: destroy nat mappings on module exit path only
From: Florian WestphalWe don't need pernetns cleanup anymore. If the netns is being destroyed, conntrack netns exit will kill all entries in this namespace, and neither conntrack hash table nor bysource hash are per namespace. For the rmmod case, we have to make sure we remove all entries from the nat bysource table, so call the new nf_ct_iterate_destroy in module exit path. Signed-off-by: Florian Westphal Signed-off-by: Pablo Neira Ayuso --- net/netfilter/nf_nat_core.c | 37 + 1 file changed, 5 insertions(+), 32 deletions(-) diff --git a/net/netfilter/nf_nat_core.c b/net/netfilter/nf_nat_core.c index daf5b22c07f8..d26cc2f864e6 100644 --- a/net/netfilter/nf_nat_core.c +++ b/net/netfilter/nf_nat_core.c @@ -582,12 +582,8 @@ static void nf_nat_l4proto_clean(u8 l3proto, u8 l4proto) .l3proto = l3proto, .l4proto = l4proto, }; - struct net *net; - rtnl_lock(); - for_each_net(net) - nf_ct_iterate_cleanup_net(net, nf_nat_proto_remove, , 0, 0); - rtnl_unlock(); + nf_ct_iterate_destroy(nf_nat_proto_remove, ); } static void nf_nat_l3proto_clean(u8 l3proto) @@ -595,13 +591,8 @@ static void nf_nat_l3proto_clean(u8 l3proto) struct nf_nat_proto_clean clean = { .l3proto = l3proto, }; - struct net *net; - rtnl_lock(); - - for_each_net(net) - nf_ct_iterate_cleanup_net(net, nf_nat_proto_remove, , 0, 0); - rtnl_unlock(); + nf_ct_iterate_destroy(nf_nat_proto_remove, ); } /* Protocol registration. */ @@ -822,17 +813,6 @@ nfnetlink_parse_nat_setup(struct nf_conn *ct, } #endif -static void __net_exit nf_nat_net_exit(struct net *net) -{ - struct nf_nat_proto_clean clean = {}; - - nf_ct_iterate_cleanup_net(net, nf_nat_proto_clean, , 0, 0); -} - -static struct pernet_operations nf_nat_net_ops = { - .exit = nf_nat_net_exit, -}; - static struct nf_ct_helper_expectfn follow_master_nat = { .name = "nat-follow-master", .expectfn = nf_nat_follow_master, @@ -853,10 +833,6 @@ static int __init nf_nat_init(void) return ret; } - ret = register_pernet_subsys(_nat_net_ops); - if (ret < 0) - goto cleanup_extend; - nf_ct_helper_expectfn_register(_master_nat); BUG_ON(nfnetlink_parse_nat_setup_hook != NULL); @@ -867,18 +843,15 @@ static int __init nf_nat_init(void) RCU_INIT_POINTER(nf_nat_decode_session_hook, __nf_nat_decode_session); #endif return 0; - - cleanup_extend: - rhltable_destroy(_nat_bysource_table); - nf_ct_extend_unregister(_extend); - return ret; } static void __exit nf_nat_cleanup(void) { + struct nf_nat_proto_clean clean = {}; unsigned int i; - unregister_pernet_subsys(_nat_net_ops); + nf_ct_iterate_destroy(nf_nat_proto_clean, ); + nf_ct_extend_unregister(_extend); nf_ct_helper_expectfn_unregister(_master_nat); RCU_INIT_POINTER(nfnetlink_parse_nat_setup_hook, NULL); -- 2.1.4
[PATCH 24/29] netfilter: ebt: Use new helper ebt_invalid_target to check target
From: Gao FengUse the new helper function ebt_invalid_target instead of the old macro INVALID_TARGET and other duplicated codes to enhance the readability. Signed-off-by: Gao Feng Signed-off-by: Pablo Neira Ayuso --- include/linux/netfilter_bridge/ebtables.h | 2 -- net/bridge/netfilter/ebt_dnat.c | 2 +- net/bridge/netfilter/ebt_mark.c | 2 +- net/bridge/netfilter/ebt_redirect.c | 2 +- net/bridge/netfilter/ebt_snat.c | 2 +- 5 files changed, 4 insertions(+), 6 deletions(-) diff --git a/include/linux/netfilter_bridge/ebtables.h b/include/linux/netfilter_bridge/ebtables.h index e0cbf17af780..2c2a5514b0df 100644 --- a/include/linux/netfilter_bridge/ebtables.h +++ b/include/linux/netfilter_bridge/ebtables.h @@ -122,8 +122,6 @@ extern unsigned int ebt_do_table(struct sk_buff *skb, #define BASE_CHAIN (par->hook_mask & (1 << NF_BR_NUMHOOKS)) /* Clear the bit in the hook mask that tells if the rule is on a base chain */ #define CLEAR_BASE_CHAIN_BIT (par->hook_mask &= ~(1 << NF_BR_NUMHOOKS)) -/* True if the target is not a standard target */ -#define INVALID_TARGET (info->target < -NUM_STANDARD_TARGETS || info->target >= 0) static inline bool ebt_invalid_target(int target) { diff --git a/net/bridge/netfilter/ebt_dnat.c b/net/bridge/netfilter/ebt_dnat.c index e0bb624c3845..dfc86a0199da 100644 --- a/net/bridge/netfilter/ebt_dnat.c +++ b/net/bridge/netfilter/ebt_dnat.c @@ -61,7 +61,7 @@ static int ebt_dnat_tg_check(const struct xt_tgchk_param *par) (strcmp(par->table, "broute") != 0 || hook_mask & ~(1 << NF_BR_BROUTING))) return -EINVAL; - if (INVALID_TARGET) + if (ebt_invalid_target(info->target)) return -EINVAL; return 0; } diff --git a/net/bridge/netfilter/ebt_mark.c b/net/bridge/netfilter/ebt_mark.c index 66697cbd0a8b..19f0f9592d32 100644 --- a/net/bridge/netfilter/ebt_mark.c +++ b/net/bridge/netfilter/ebt_mark.c @@ -44,7 +44,7 @@ static int ebt_mark_tg_check(const struct xt_tgchk_param *par) tmp = info->target | ~EBT_VERDICT_BITS; if (BASE_CHAIN && tmp == EBT_RETURN) return -EINVAL; - if (tmp < -NUM_STANDARD_TARGETS || tmp >= 0) + if (ebt_invalid_target(tmp)) return -EINVAL; tmp = info->target & ~EBT_VERDICT_BITS; if (tmp != MARK_SET_VALUE && tmp != MARK_OR_VALUE && diff --git a/net/bridge/netfilter/ebt_redirect.c b/net/bridge/netfilter/ebt_redirect.c index 8d2a85e0594e..a7223eaf490b 100644 --- a/net/bridge/netfilter/ebt_redirect.c +++ b/net/bridge/netfilter/ebt_redirect.c @@ -47,7 +47,7 @@ static int ebt_redirect_tg_check(const struct xt_tgchk_param *par) (strcmp(par->table, "broute") != 0 || hook_mask & ~(1 << NF_BR_BROUTING))) return -EINVAL; - if (INVALID_TARGET) + if (ebt_invalid_target(info->target)) return -EINVAL; return 0; } diff --git a/net/bridge/netfilter/ebt_snat.c b/net/bridge/netfilter/ebt_snat.c index e56ccd060d26..11cf9e9e9222 100644 --- a/net/bridge/netfilter/ebt_snat.c +++ b/net/bridge/netfilter/ebt_snat.c @@ -51,7 +51,7 @@ static int ebt_snat_tg_check(const struct xt_tgchk_param *par) if (BASE_CHAIN && tmp == EBT_RETURN) return -EINVAL; - if (tmp < -NUM_STANDARD_TARGETS || tmp >= 0) + if (ebt_invalid_target(tmp)) return -EINVAL; tmp = info->target | EBT_VERDICT_BITS; if ((tmp & ~NAT_ARP_BIT) != ~NAT_ARP_BIT) -- 2.1.4
[PATCH 25/29] netfilter, kbuild: use canonical method to specify objs.
From: Jike SongShould use ":=" instead of "+=". Signed-off-by: Jike Song Signed-off-by: Pablo Neira Ayuso --- net/netfilter/Makefile | 7 +++ 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/net/netfilter/Makefile b/net/netfilter/Makefile index c9b78e7b342f..913380919301 100644 --- a/net/netfilter/Makefile +++ b/net/netfilter/Makefile @@ -70,10 +70,9 @@ obj-$(CONFIG_NETFILTER_SYNPROXY) += nf_synproxy_core.o obj-$(CONFIG_NF_DUP_NETDEV)+= nf_dup_netdev.o # nf_tables -nf_tables-objs += nf_tables_core.o nf_tables_api.o nf_tables_trace.o -nf_tables-objs += nft_immediate.o nft_cmp.o nft_range.o -nf_tables-objs += nft_bitwise.o nft_byteorder.o nft_payload.o -nf_tables-objs += nft_lookup.o nft_dynset.o +nf_tables-objs := nf_tables_core.o nf_tables_api.o nf_tables_trace.o \ + nft_immediate.o nft_cmp.o nft_range.o nft_bitwise.o \ + nft_byteorder.o nft_payload.o nft_lookup.o nft_dynset.o obj-$(CONFIG_NF_TABLES)+= nf_tables.o obj-$(CONFIG_NF_TABLES_INET) += nf_tables_inet.o -- 2.1.4
[PATCH 29/29] netfilter: nfnetlink: extended ACK reporting
Pass down struct netlink_ext_ack as parameter to all of our nfnetlink subsystem callbacks, so we can work on follow up patches to provide finer grain error reporting using the new infrastructure that 2d4bc93368f5 ("netlink: extended ACK reporting") provides. No functional change, just pass down this new object to callbacks. Signed-off-by: Pablo Neira Ayuso--- include/linux/netfilter/nfnetlink.h | 10 +++--- net/netfilter/ipset/ip_set_core.c| 39 net/netfilter/nf_conntrack_netlink.c | 39 net/netfilter/nf_tables_api.c| 59 net/netfilter/nfnetlink.c| 21 + net/netfilter/nfnetlink_acct.c | 9 -- net/netfilter/nfnetlink_cthelper.c | 9 -- net/netfilter/nfnetlink_cttimeout.c | 15 ++--- net/netfilter/nfnetlink_log.c| 6 ++-- net/netfilter/nfnetlink_queue.c | 12 +--- net/netfilter/nft_compat.c | 3 +- net/netfilter/xt_osf.c | 6 ++-- 12 files changed, 152 insertions(+), 76 deletions(-) diff --git a/include/linux/netfilter/nfnetlink.h b/include/linux/netfilter/nfnetlink.h index 996711d8a7b4..41d04e9d088a 100644 --- a/include/linux/netfilter/nfnetlink.h +++ b/include/linux/netfilter/nfnetlink.h @@ -1,7 +1,6 @@ #ifndef _NFNETLINK_H #define _NFNETLINK_H - #include #include #include @@ -10,13 +9,16 @@ struct nfnl_callback { int (*call)(struct net *net, struct sock *nl, struct sk_buff *skb, const struct nlmsghdr *nlh, - const struct nlattr * const cda[]); + const struct nlattr * const cda[], + struct netlink_ext_ack *extack); int (*call_rcu)(struct net *net, struct sock *nl, struct sk_buff *skb, const struct nlmsghdr *nlh, - const struct nlattr * const cda[]); + const struct nlattr * const cda[], + struct netlink_ext_ack *extack); int (*call_batch)(struct net *net, struct sock *nl, struct sk_buff *skb, const struct nlmsghdr *nlh, - const struct nlattr * const cda[]); + const struct nlattr * const cda[], + struct netlink_ext_ack *extack); const struct nla_policy *policy;/* netlink attribute policy */ const u_int16_t attr_count; /* number of nlattr's */ }; diff --git a/net/netfilter/ipset/ip_set_core.c b/net/netfilter/ipset/ip_set_core.c index ba6a5516dc7c..e495b5e484b1 100644 --- a/net/netfilter/ipset/ip_set_core.c +++ b/net/netfilter/ipset/ip_set_core.c @@ -841,14 +841,16 @@ find_free_id(struct ip_set_net *inst, const char *name, ip_set_id_t *index, static int ip_set_none(struct net *net, struct sock *ctnl, struct sk_buff *skb, const struct nlmsghdr *nlh, - const struct nlattr * const attr[]) + const struct nlattr * const attr[], + struct netlink_ext_ack *extack) { return -EOPNOTSUPP; } static int ip_set_create(struct net *net, struct sock *ctnl, struct sk_buff *skb, const struct nlmsghdr *nlh, -const struct nlattr * const attr[]) +const struct nlattr * const attr[], +struct netlink_ext_ack *extack) { struct ip_set_net *inst = ip_set_pernet(net); struct ip_set *set, *clash = NULL; @@ -989,7 +991,8 @@ ip_set_destroy_set(struct ip_set *set) static int ip_set_destroy(struct net *net, struct sock *ctnl, struct sk_buff *skb, const struct nlmsghdr *nlh, - const struct nlattr * const attr[]) + const struct nlattr * const attr[], + struct netlink_ext_ack *extack) { struct ip_set_net *inst = ip_set_pernet(net); struct ip_set *s; @@ -1067,7 +1070,8 @@ ip_set_flush_set(struct ip_set *set) static int ip_set_flush(struct net *net, struct sock *ctnl, struct sk_buff *skb, const struct nlmsghdr *nlh, - const struct nlattr * const attr[]) + const struct nlattr * const attr[], + struct netlink_ext_ack *extack) { struct ip_set_net *inst = ip_set_pernet(net); struct ip_set *s; @@ -1106,7 +1110,8 @@ ip_set_setname2_policy[IPSET_ATTR_CMD_MAX + 1] = { static int ip_set_rename(struct net *net, struct sock *ctnl, struct sk_buff *skb, const struct nlmsghdr *nlh, -const struct nlattr * const attr[]) +const struct nlattr * const attr[], +struct netlink_ext_ack *extack) { struct ip_set_net *inst = ip_set_pernet(net); struct ip_set *set, *s; @@
[PATCH 28/29] netfilter: nf_tables: reduce chain type table size
From: Florian Westphaltext data bss dechex filename old: 151590 2240 1152 154982 25d66 net/netfilter/nf_tables_api.o new: 151666 2240 416 154322 25ad2 net/netfilter/nf_tables_api.o Signed-off-by: Florian Westphal Signed-off-by: Pablo Neira Ayuso --- net/netfilter/nf_tables_api.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c index bc8f03a53734..5f3339978f6b 100644 --- a/net/netfilter/nf_tables_api.c +++ b/net/netfilter/nf_tables_api.c @@ -387,7 +387,7 @@ static inline u64 nf_tables_alloc_handle(struct nft_table *table) return ++table->hgenerator; } -static const struct nf_chain_type *chain_type[AF_MAX][NFT_CHAIN_T_MAX]; +static const struct nf_chain_type *chain_type[NFPROTO_NUMPROTO][NFT_CHAIN_T_MAX]; static const struct nf_chain_type * __nf_tables_chain_type_lookup(int family, const struct nlattr *nla) @@ -870,6 +870,9 @@ int nft_register_chain_type(const struct nf_chain_type *ctype) { int err = 0; + if (WARN_ON(ctype->family >= NFPROTO_NUMPROTO)) + return -EINVAL; + nfnl_lock(NFNL_SUBSYS_NFTABLES); if (chain_type[ctype->family][ctype->type] != NULL) { err = -EBUSY; -- 2.1.4
[PATCH 22/29] netfilter: move table iteration out of netns exit paths
From: Florian WestphalWe only need to iterate & remove in case of module removal; for netns destruction all conntracks will be removed anyway. Signed-off-by: Florian Westphal Signed-off-by: Pablo Neira Ayuso --- net/netfilter/nf_conntrack_proto.c | 26 +- 1 file changed, 17 insertions(+), 9 deletions(-) diff --git a/net/netfilter/nf_conntrack_proto.c b/net/netfilter/nf_conntrack_proto.c index b7d01f27d463..6a36623e897c 100644 --- a/net/netfilter/nf_conntrack_proto.c +++ b/net/netfilter/nf_conntrack_proto.c @@ -265,6 +265,8 @@ void nf_ct_l3proto_unregister(struct nf_conntrack_l3proto *proto) mutex_unlock(_ct_proto_mutex); synchronize_rcu(); + /* Remove all contrack entries for this protocol */ + nf_ct_iterate_destroy(kill_l3proto, proto); } EXPORT_SYMBOL_GPL(nf_ct_l3proto_unregister); @@ -280,9 +282,6 @@ void nf_ct_l3proto_pernet_unregister(struct net *net, */ if (proto->net_ns_put) proto->net_ns_put(net); - - /* Remove all contrack entries for this protocol */ - nf_ct_iterate_cleanup_net(net, kill_l3proto, proto, 0, 0); } EXPORT_SYMBOL_GPL(nf_ct_l3proto_pernet_unregister); @@ -421,17 +420,23 @@ int nf_ct_l4proto_pernet_register_one(struct net *net, } EXPORT_SYMBOL_GPL(nf_ct_l4proto_pernet_register_one); -void nf_ct_l4proto_unregister_one(struct nf_conntrack_l4proto *l4proto) +static void __nf_ct_l4proto_unregister_one(struct nf_conntrack_l4proto *l4proto) + { BUG_ON(l4proto->l3proto >= PF_MAX); - mutex_lock(_ct_proto_mutex); BUG_ON(rcu_dereference_protected( nf_ct_protos[l4proto->l3proto][l4proto->l4proto], lockdep_is_held(_ct_proto_mutex) ) != l4proto); rcu_assign_pointer(nf_ct_protos[l4proto->l3proto][l4proto->l4proto], _conntrack_l4proto_generic); +} + +void nf_ct_l4proto_unregister_one(struct nf_conntrack_l4proto *l4proto) +{ + mutex_lock(_ct_proto_mutex); + __nf_ct_l4proto_unregister_one(l4proto); mutex_unlock(_ct_proto_mutex); synchronize_rcu(); @@ -448,9 +453,6 @@ void nf_ct_l4proto_pernet_unregister_one(struct net *net, pn->users--; nf_ct_l4proto_unregister_sysctl(net, pn, l4proto); - - /* Remove all contrack entries for this protocol */ - nf_ct_iterate_cleanup_net(net, kill_l4proto, l4proto, 0, 0); } EXPORT_SYMBOL_GPL(nf_ct_l4proto_pernet_unregister_one); @@ -500,8 +502,14 @@ EXPORT_SYMBOL_GPL(nf_ct_l4proto_pernet_register); void nf_ct_l4proto_unregister(struct nf_conntrack_l4proto *l4proto[], unsigned int num_proto) { + mutex_lock(_ct_proto_mutex); while (num_proto-- != 0) - nf_ct_l4proto_unregister_one(l4proto[num_proto]); + __nf_ct_l4proto_unregister_one(l4proto[num_proto]); + mutex_unlock(_ct_proto_mutex); + + synchronize_net(); + /* Remove all contrack entries for this protocol */ + nf_ct_iterate_destroy(kill_l4proto, l4proto); } EXPORT_SYMBOL_GPL(nf_ct_l4proto_unregister); -- 2.1.4
[PATCH 21/29] netfilter: ipt_CLUSTERIP: do not hold dev
From: Xin LongIt's a terrible thing to hold dev in iptables target. When the dev is being removed, unregister_netdevice has to wait for the dev to become free. dmesg will keep logging the err: kernel:unregister_netdevice: waiting for veth0_in to become free. \ Usage count = 1 until iptables rules with this target are removed manually. The worse thing is when deleting a netns, a virtual nic will be deleted instead of reset to init_net in default_device_ops exit/exit_batch. As it is earlier than to flush the iptables rules in iptable_filter_net_ops exit, unregister_netdevice will block to wait for the nic to become free. As unregister_netdevice is actually waiting for iptables rules flushing while iptables rules have to be flushed after unregister_netdevice. This 'dead lock' will cause unregister_netdevice to block there forever. As the netns is not available to operate at that moment, iptables rules can not even be flushed manually either. The reproducer can be: # ip netns add test # ip link add veth0_in type veth peer name veth0_out # ip link set veth0_in netns test # ip netns exec test ip link set lo up # ip netns exec test ip link set veth0_in up # ip netns exec test iptables -I INPUT -d 1.2.3.4 -i veth0_in -j \ CLUSTERIP --new --clustermac 89:d4:47:eb:9a:fa --total-nodes 3 \ --local-node 1 --hashmode sourceip-sourceport # ip netns del test This issue can be triggered by all virtual nics with ipt_CLUSTERIP. This patch is to fix it by not holding dev in ipt_CLUSTERIP, but saving the dev->ifindex instead of the dev. As Pablo Neira Ayuso's suggestion, it will refresh c->ifindex and dev's mc by registering a netdevice notifier, just as what xt_TEE does. So it removes the old codes updating dev's mc, and also no need to initialize c->ifindex with dev->ifindex. But as one config can be shared by more than one targets, and the netdev notifier is per config, not per target. It couldn't get e->ip.iniface in the notifier handler. So e->ip.iniface has to be saved into config. Note that for backwards compatibility, this patch doesn't remove the codes checking if the dev exists before creating a config. v1->v2: - As Pablo Neira Ayuso's suggestion, register a netdevice notifier to manage c->ifindex and dev's mc. Reported-by: Jianlin Shi Signed-off-by: Xin Long Signed-off-by: Pablo Neira Ayuso --- net/ipv4/netfilter/ipt_CLUSTERIP.c | 101 +++-- 1 file changed, 73 insertions(+), 28 deletions(-) diff --git a/net/ipv4/netfilter/ipt_CLUSTERIP.c b/net/ipv4/netfilter/ipt_CLUSTERIP.c index f30bee8e407b..7d72decb80f9 100644 --- a/net/ipv4/netfilter/ipt_CLUSTERIP.c +++ b/net/ipv4/netfilter/ipt_CLUSTERIP.c @@ -47,7 +47,7 @@ struct clusterip_config { __be32 clusterip; /* the IP address */ u_int8_t clustermac[ETH_ALEN]; /* the MAC address */ - struct net_device *dev; /* device */ + int ifindex;/* device ifindex */ u_int16_t num_total_nodes; /* total number of nodes */ unsigned long local_nodes; /* node number array */ @@ -57,6 +57,9 @@ struct clusterip_config { enum clusterip_hashmode hash_mode; /* which hashing mode */ u_int32_t hash_initval; /* hash initialization */ struct rcu_head rcu; + + char ifname[IFNAMSIZ]; /* device ifname */ + struct notifier_block notifier; /* refresh c->ifindex in it */ }; #ifdef CONFIG_PROC_FS @@ -98,9 +101,8 @@ clusterip_config_put(struct clusterip_config *c) * entry(rule) is removed, remove the config from lists, but don't free it * yet, since proc-files could still be holding references */ static inline void -clusterip_config_entry_put(struct clusterip_config *c) +clusterip_config_entry_put(struct net *net, struct clusterip_config *c) { - struct net *net = dev_net(c->dev); struct clusterip_net *cn = net_generic(net, clusterip_net_id); local_bh_disable(); @@ -109,8 +111,7 @@ clusterip_config_entry_put(struct clusterip_config *c) spin_unlock(>lock); local_bh_enable(); - dev_mc_del(c->dev, c->clustermac); - dev_put(c->dev); + unregister_netdevice_notifier(>notifier); /* In case anyone still accesses the file, the open/close * functions are also incrementing the refcount on their own, @@ -170,19 +171,55 @@ clusterip_config_init_nodelist(struct clusterip_config *c, set_bit(i->local_nodes[n] - 1, >local_nodes); } -static struct clusterip_config * -clusterip_config_init(const struct ipt_clusterip_tgt_info *i, __be32 ip, - struct net_device *dev) +static int +clusterip_netdev_event(struct notifier_block *this, unsigned long event,
[PATCH 27/29] netfilter: conntrack: use NFPROTO_MAX to size array
From: Florian WestphalWe don't support anything larger than NFPROTO_MAX, so we can shrink this a bit: text data dec hex filename old: 8259 1096 9355 248b net/netfilter/nf_conntrack_proto.o new: 8259 624 8883 22b3 net/netfilter/nf_conntrack_proto.o Signed-off-by: Florian Westphal Signed-off-by: Pablo Neira Ayuso --- include/net/netfilter/nf_conntrack_l3proto.h | 4 ++-- net/netfilter/nf_conntrack_proto.c | 18 +- 2 files changed, 11 insertions(+), 11 deletions(-) diff --git a/include/net/netfilter/nf_conntrack_l3proto.h b/include/net/netfilter/nf_conntrack_l3proto.h index e01559b4d781..6d14b36e3a49 100644 --- a/include/net/netfilter/nf_conntrack_l3proto.h +++ b/include/net/netfilter/nf_conntrack_l3proto.h @@ -71,7 +71,7 @@ struct nf_conntrack_l3proto { struct module *me; }; -extern struct nf_conntrack_l3proto __rcu *nf_ct_l3protos[AF_MAX]; +extern struct nf_conntrack_l3proto __rcu *nf_ct_l3protos[NFPROTO_NUMPROTO]; #ifdef CONFIG_SYSCTL /* Protocol pernet registration. */ @@ -100,7 +100,7 @@ extern struct nf_conntrack_l3proto nf_conntrack_l3proto_generic; static inline struct nf_conntrack_l3proto * __nf_ct_l3proto_find(u_int16_t l3proto) { - if (unlikely(l3proto >= AF_MAX)) + if (unlikely(l3proto >= NFPROTO_NUMPROTO)) return _conntrack_l3proto_generic; return rcu_dereference(nf_ct_l3protos[l3proto]); } diff --git a/net/netfilter/nf_conntrack_proto.c b/net/netfilter/nf_conntrack_proto.c index 6a36623e897c..1dcad229c3cc 100644 --- a/net/netfilter/nf_conntrack_proto.c +++ b/net/netfilter/nf_conntrack_proto.c @@ -28,8 +28,8 @@ #include #include -static struct nf_conntrack_l4proto __rcu **nf_ct_protos[PF_MAX] __read_mostly; -struct nf_conntrack_l3proto __rcu *nf_ct_l3protos[AF_MAX] __read_mostly; +static struct nf_conntrack_l4proto __rcu **nf_ct_protos[NFPROTO_NUMPROTO] __read_mostly; +struct nf_conntrack_l3proto __rcu *nf_ct_l3protos[NFPROTO_NUMPROTO] __read_mostly; EXPORT_SYMBOL_GPL(nf_ct_l3protos); static DEFINE_MUTEX(nf_ct_proto_mutex); @@ -68,7 +68,7 @@ nf_ct_unregister_sysctl(struct ctl_table_header **header, struct nf_conntrack_l4proto * __nf_ct_l4proto_find(u_int16_t l3proto, u_int8_t l4proto) { - if (unlikely(l3proto >= AF_MAX || nf_ct_protos[l3proto] == NULL)) + if (unlikely(l3proto >= NFPROTO_NUMPROTO || nf_ct_protos[l3proto] == NULL)) return _conntrack_l4proto_generic; return rcu_dereference(nf_ct_protos[l3proto][l4proto]); @@ -212,7 +212,7 @@ int nf_ct_l3proto_register(struct nf_conntrack_l3proto *proto) int ret = 0; struct nf_conntrack_l3proto *old; - if (proto->l3proto >= AF_MAX) + if (proto->l3proto >= NFPROTO_NUMPROTO) return -EBUSY; if (proto->tuple_to_nlattr && !proto->nlattr_tuple_size) @@ -254,7 +254,7 @@ EXPORT_SYMBOL_GPL(nf_ct_l3proto_pernet_register); void nf_ct_l3proto_unregister(struct nf_conntrack_l3proto *proto) { - BUG_ON(proto->l3proto >= AF_MAX); + BUG_ON(proto->l3proto >= NFPROTO_NUMPROTO); mutex_lock(_ct_proto_mutex); BUG_ON(rcu_dereference_protected(nf_ct_l3protos[proto->l3proto], @@ -341,7 +341,7 @@ int nf_ct_l4proto_register_one(struct nf_conntrack_l4proto *l4proto) { int ret = 0; - if (l4proto->l3proto >= PF_MAX) + if (l4proto->l3proto >= ARRAY_SIZE(nf_ct_protos)) return -EBUSY; if ((l4proto->to_nlattr && !l4proto->nlattr_size) || @@ -423,7 +423,7 @@ EXPORT_SYMBOL_GPL(nf_ct_l4proto_pernet_register_one); static void __nf_ct_l4proto_unregister_one(struct nf_conntrack_l4proto *l4proto) { - BUG_ON(l4proto->l3proto >= PF_MAX); + BUG_ON(l4proto->l3proto >= ARRAY_SIZE(nf_ct_protos)); BUG_ON(rcu_dereference_protected( nf_ct_protos[l4proto->l3proto][l4proto->l4proto], @@ -556,7 +556,7 @@ void nf_conntrack_proto_pernet_fini(struct net *net) int nf_conntrack_proto_init(void) { unsigned int i; - for (i = 0; i < AF_MAX; i++) + for (i = 0; i < NFPROTO_NUMPROTO; i++) rcu_assign_pointer(nf_ct_l3protos[i], _conntrack_l3proto_generic); return 0; @@ -566,6 +566,6 @@ void nf_conntrack_proto_fini(void) { unsigned int i; /* free l3proto protocol tables */ - for (i = 0; i < PF_MAX; i++) + for (i = 0; i < ARRAY_SIZE(nf_ct_protos); i++) kfree(nf_ct_protos[i]); } -- 2.1.4
[PATCH 11/29] netfilter: nf_tables: no size estimation if number of set elements is unknown
This size estimation is ignored by the existing set backend selection logic, since this estimation structure is stack allocated, set this to ~0 to make it easier to catch bugs in future changes. Signed-off-by: Pablo Neira Ayuso--- net/netfilter/nft_set_hash.c | 17 - net/netfilter/nft_set_rbtree.c | 8 +++- 2 files changed, 7 insertions(+), 18 deletions(-) diff --git a/net/netfilter/nft_set_hash.c b/net/netfilter/nft_set_hash.c index 850be3a00e62..1f1cc33895fd 100644 --- a/net/netfilter/nft_set_hash.c +++ b/net/netfilter/nft_set_hash.c @@ -365,22 +365,13 @@ static void nft_hash_destroy(const struct nft_set *set) static bool nft_hash_estimate(const struct nft_set_desc *desc, u32 features, struct nft_set_estimate *est) { - unsigned int esize; - - esize = sizeof(struct nft_hash_elem); - if (desc->size) { + if (desc->size) est->size = sizeof(struct nft_hash) + roundup_pow_of_two(desc->size * 4 / 3) * sizeof(struct nft_hash_elem *) + - desc->size * esize; - } else { - /* Resizing happens when the load drops below 30% or goes -* above 75%. The average of 52.5% load (approximated by 50%) -* is used for the size estimation of the hash buckets, -* meaning we calculate two buckets per element. -*/ - est->size = esize + 2 * sizeof(struct nft_hash_elem *); - } + desc->size * sizeof(struct nft_hash_elem); + else + est->size = ~0; est->lookup = NFT_SET_CLASS_O_1; est->space = NFT_SET_CLASS_O_N; diff --git a/net/netfilter/nft_set_rbtree.c b/net/netfilter/nft_set_rbtree.c index e97e2fb53f0a..fbfb3cbb3916 100644 --- a/net/netfilter/nft_set_rbtree.c +++ b/net/netfilter/nft_set_rbtree.c @@ -283,13 +283,11 @@ static void nft_rbtree_destroy(const struct nft_set *set) static bool nft_rbtree_estimate(const struct nft_set_desc *desc, u32 features, struct nft_set_estimate *est) { - unsigned int nsize; - - nsize = sizeof(struct nft_rbtree_elem); if (desc->size) - est->size = sizeof(struct nft_rbtree) + desc->size * nsize; + est->size = sizeof(struct nft_rbtree) + + desc->size * sizeof(struct nft_rbtree_elem); else - est->size = nsize; + est->size = ~0; est->lookup = NFT_SET_CLASS_O_LOG_N; est->space = NFT_SET_CLASS_O_N; -- 2.1.4
[PATCH 02/29] netfilter: ipt_CLUSTERIP: switch to nf_register_net_hook
From: Florian Westphalone of the last remaining users of the old api, hopefully followup commit can remove it soon. Signed-off-by: Florian Westphal Signed-off-by: Pablo Neira Ayuso --- net/ipv4/netfilter/ipt_CLUSTERIP.c | 14 +++--- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/net/ipv4/netfilter/ipt_CLUSTERIP.c b/net/ipv4/netfilter/ipt_CLUSTERIP.c index 038f293c2376..f30bee8e407b 100644 --- a/net/ipv4/netfilter/ipt_CLUSTERIP.c +++ b/net/ipv4/netfilter/ipt_CLUSTERIP.c @@ -743,14 +743,20 @@ static const struct file_operations clusterip_proc_fops = { static int clusterip_net_init(struct net *net) { struct clusterip_net *cn = net_generic(net, clusterip_net_id); + int ret; INIT_LIST_HEAD(>configs); spin_lock_init(>lock); + ret = nf_register_net_hook(net, _arp_ops); + if (ret < 0) + return ret; + #ifdef CONFIG_PROC_FS cn->procdir = proc_mkdir("ipt_CLUSTERIP", net->proc_net); if (!cn->procdir) { + nf_unregister_net_hook(net, _arp_ops); pr_err("Unable to proc dir entry\n"); return -ENOMEM; } @@ -765,6 +771,7 @@ static void clusterip_net_exit(struct net *net) struct clusterip_net *cn = net_generic(net, clusterip_net_id); proc_remove(cn->procdir); #endif + nf_unregister_net_hook(net, _arp_ops); } static struct pernet_operations clusterip_net_ops = { @@ -786,17 +793,11 @@ static int __init clusterip_tg_init(void) if (ret < 0) goto cleanup_subsys; - ret = nf_register_hook(_arp_ops); - if (ret < 0) - goto cleanup_target; - pr_info("ClusterIP Version %s loaded successfully\n", CLUSTERIP_VERSION); return 0; -cleanup_target: - xt_unregister_target(_tg_reg); cleanup_subsys: unregister_pernet_subsys(_net_ops); return ret; @@ -806,7 +807,6 @@ static void __exit clusterip_tg_exit(void) { pr_info("ClusterIP Version %s unloading\n", CLUSTERIP_VERSION); - nf_unregister_hook(_arp_ops); xt_unregister_target(_tg_reg); unregister_pernet_subsys(_net_ops); -- 2.1.4
[PATCH 10/29] netfilter: nft_set_hash: unnecessary forward declaration
Replace struct rhashtable_params forward declaration by the structure definition itself. Signed-off-by: Pablo Neira Ayuso--- net/netfilter/nft_set_hash.c | 18 -- 1 file changed, 8 insertions(+), 10 deletions(-) diff --git a/net/netfilter/nft_set_hash.c b/net/netfilter/nft_set_hash.c index 3d3a6df4ce70..850be3a00e62 100644 --- a/net/netfilter/nft_set_hash.c +++ b/net/netfilter/nft_set_hash.c @@ -40,8 +40,6 @@ struct nft_hash_cmp_arg { u8 genmask; }; -static const struct rhashtable_params nft_hash_params; - static inline u32 nft_hash_key(const void *data, u32 len, u32 seed) { const struct nft_hash_cmp_arg *arg = data; @@ -71,6 +69,14 @@ static inline int nft_hash_cmp(struct rhashtable_compare_arg *arg, return 0; } +static const struct rhashtable_params nft_hash_params = { + .head_offset= offsetof(struct nft_hash_elem, node), + .hashfn = nft_hash_key, + .obj_hashfn = nft_hash_obj, + .obj_cmpfn = nft_hash_cmp, + .automatic_shrinking= true, +}; + static bool nft_hash_lookup(const struct net *net, const struct nft_set *set, const u32 *key, const struct nft_set_ext **ext) { @@ -320,14 +326,6 @@ static unsigned int nft_hash_privsize(const struct nlattr * const nla[]) return sizeof(struct nft_hash); } -static const struct rhashtable_params nft_hash_params = { - .head_offset= offsetof(struct nft_hash_elem, node), - .hashfn = nft_hash_key, - .obj_hashfn = nft_hash_obj, - .obj_cmpfn = nft_hash_cmp, - .automatic_shrinking= true, -}; - static int nft_hash_init(const struct nft_set *set, const struct nft_set_desc *desc, const struct nlattr * const tb[]) -- 2.1.4
Re: CAN-FD Transceiver Limitations
> >> > >> mcan@0 { > >>... > >>fixed-transceiver { > >> max-canfd-speed = <2000> > >>}; > >>... > >> }; Since when would a transceiver support different speeds for CAN & CANFD? No transceivers were available, but they are now. I see no datalink problem applying 2MBit for regular CAN with apropriate physical layer, and CAN does not predefine the physical layer (advise != define). IMHO, fixed-transceiver { max-arbitration-speed = <200> max-data-speed = <400> }; is way better to describe the hardware. Regular CAN chips would not consider max-data-speed... Kind regards, Kurt
Re: [PATCH v1 1/2] dt-binding: ptp: add bindings document for dte based ptp clock
On Tue, Jun 27, 2017 at 12:10 AM, Scott Brandenwrote: > Hi Rob/Florian, > > Thanks for input but still don't see any need for SoC specific > compatible stings. IP revision specific yes. > > > On 17-06-22 06:04 PM, Florian Fainelli wrote: >> >> On 06/22/2017 05:42 PM, Scott Branden wrote: >>> >>> >>> On 17-06-21 08:19 PM, Rob Herring wrote: On Tue, Jun 20, 2017 at 3:48 PM, Scott Branden wrote: > > Hi Rob, > > > On 17-06-18 07:04 AM, Rob Herring wrote: >> >> On Mon, Jun 12, 2017 at 01:26:00PM -0700, Arun Parameswaran wrote: >>> >>> Add device tree binding documentation for the Broadcom DTE >>> PTP clock driver. >>> >>> Signed-off-by: Arun Parameswaran >>> --- >>> Documentation/devicetree/bindings/ptp/brcm,ptp-dte.txt | 13 >>> + >>> 1 file changed, 13 insertions(+) >>> create mode 100644 >>> Documentation/devicetree/bindings/ptp/brcm,ptp-dte.txt >>> >>> diff --git a/Documentation/devicetree/bindings/ptp/brcm,ptp-dte.txt >>> b/Documentation/devicetree/bindings/ptp/brcm,ptp-dte.txt >>> new file mode 100644 >>> index 000..07590bc >>> --- /dev/null >>> +++ b/Documentation/devicetree/bindings/ptp/brcm,ptp-dte.txt >>> @@ -0,0 +1,13 @@ >>> +* Broadcom Digital Timing Engine(DTE) based PTP clock driver >> >> Bindings describe h/w, not drivers. >> >>> + >>> +Required properties: >>> +- compatible: should be "brcm,ptp-dte" >> >> Looks too generic. You need SoC specific compatible strings. > > Rob, could you please help me understand the use of adding SoC specific > compatible strings. > I still don't get it. > > It's my understanding that the SoC compatibility string is to future > proof > against bugs/incompatibilities > between different versions of the hardware block due to integration > issues > or any other reason. > You can then compare in your driver because the strings were already > used in > the dtb. > > That would make sense if you can't already differentiate what SoC you > are > running on. > But the SoC is already specified in the root of the device tree in the > compatible string? > Why can't you just use of_machine_is_compatible inside your driver when > needed? Use of of_machine_is_compatible in drivers will result in the same mess we had with machine_is_X defines pre-DT. It practically guarantees that you must update the driver for a new SoC (with fallback compatibles you don't). Plus the matching logic for of_machine_is_compatible is open coded logic in every driver which is worse IMO than having a standard match table. >>> >>> I don't understand what you mean by fallback compatible then. >>> >>> Let's say I have 3 SoCs that each contain the same ip block. >>> You want us to add a fallback compatibility per SoC, is that correct? >> >> I think Rob meant a fallback compatibility for the particular block. >> E.g: brcm,iproc-ptp is the fallback compatible string, but in your >> SoC-specific DTS, you would have at least: >> >> compatible = "brcm,cygnus-ptp", "brcm,iproc-ptp"; >> >> Where cygnus-ptp is more specific than iproc-ptp >>> >>> Then, if there is a workaround discovered in a particular SoC the driver >>> can be updated in the future without changing the dtb. >>> >>> Then, the block gets added to a 4th SoC. >>> You want us to another new compatibility string for the new SoC? >>> If the new SoC has a bug then the driver has to be updated whether it is >>> in uses the fallback compatible or machine_is_compatible string. >>> >>> There is no difference in amount of code added to a driver when a new >>> SoC is introduced into the system that has bugs that need to be handled >>> by the driver. >>> >>> The difference is in your recommendation we need to go through all the >>> drivers used by the new SoC and add fallback compatibility strings. >> >> Not really, the fallback is what the driver should be matching by >> default (hence the name fallback) and if and only if you need to have a >> SoC-specific behavior in your driver (because of bugs, or slight >> differences) would you be matching this SoC-specific compatible string >> to capture that and key the driver behavior based off that. >> >>> Then, we have to modify all the devicetree documentation for all the >>> drivers. Then, we have to ensure that all dts files populate this new >>> fallback string (even if it is unused). We don't see the benefit in >>> doing any of that. Using machine_is_compatible and having less >>> compatibility strings to deal appears much cleaner and more foolproof >>> for all situations. >> >> When you introduce a new SoC, you would update all the bindings for the >> devices (not drivers) that are susceptible to be used by this
Re: [PATCH net-next v4 04/16] bpf: Sample bpf program to set SYN/SYN-ACK RTOs
On 6/29/17, 12:39 PM, "netdev-ow...@vger.kernel.org on behalf of Jesper Dangaard Brouer"wrote: On Wed, 28 Jun 2017 10:31:12 -0700 Lawrence Brakmo wrote: > +++ b/samples/bpf/tcp_synrto_kern.c > @@ -0,0 +1,60 @@ > +/* Copyright (c) 2017 Facebook > + * > + * This program is free software; you can redistribute it and/or > + * modify it under the terms of version 2 of the GNU General Public > + * License as published by the Free Software Foundation. > + * > + * BPF program to set SYN and SYN-ACK RTOs to 10ms when using IPv6 addresses > + * and the first 5.5 bytes of the IPv6 addresses are the same (in this example > + * that means both hosts are in the same datacenter. Missing end ")". I really like this short comment of what the program does, as it helps people browsing these sample programs. Can you also mention in the comment (of all these) bpf programs that people load this bpf object file via the program 'load_sock_ops'? Thank you for finding the typo and for the comment on adding how to load the sample programs. Will be done in v5 due later today. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: https://urldefense.proofpoint.com/v2/url?u=http-3A__www.linkedin.com_in_brouer=DwICAg=5VD0RTtNlTh3ycd41b3MUw=pq_Mqvzfy-C8ltkgyx1u_g=EJ1TyanCGEOIXEPnAm8BicVjUXEJLsvUQY1vNC_4r7g=INcdT-mimhOZEgFLw7hqg2V6VJ70XZJoeY83vp6V8YY=
Re: [PATCH RFC 0/2] kproxy: Kernel Proxy
Hi Tom On 29 June 2017 at 11:27, Tom Herbertwrote: > This is raw, minimally tested, and error hanlding needs work. Posting > as RFC to get feedback on the design... > > Sidecar proxies are becoming quite popular on server as a means to > perform layer 7 processing on application data as it is sent. Such > sidecars are used for SSL proxies, application firewalls, and L7 > load balancers. While these proxies provide nice functionality, > their performance is obviously terrible since all the data needs > to take an extra hop though userspace. I really appreciate this work. It would have been nice to at least attribute me in some way as this is exactly what I presented at Netconf 2017 [0]. I'm also wondering why this is not built on top of KCM which you suggested to use when we discussed this. [0] https://docs.google.com/presentation/d/1dwSKSBGpUHD3WO5xxzZWj8awV_-xL-oYhvqQMOBhhtk/edit#slide=id.g203aae413f_0_0
RE: [net-next v2 6/6] ixgbe: Add malicious driver detection support
>-Original Message- >From: David Miller [mailto:da...@davemloft.net] >Sent: Thursday, June 29, 2017 11:28 AM >To: gerlitz...@gmail.com >Cc: Tantilov, Emil S; Kirsher, Jeffrey T > ; Greenwalt, Paul ; >netdev@vger.kernel.org; nhor...@redhat.com; sassm...@redhat.com; >jogre...@redhat.com >Subject: Re: [net-next v2 6/6] ixgbe: Add malicious driver detection >support > >From: Or Gerlitz >Date: Wed, 28 Jun 2017 17:28:59 +0300 > >> On Wed, Jun 28, 2017 at 1:14 AM, Tantilov, Emil S >> wrote: >> >>> Mainly because I am not sure that other (non-Intel) drivers will benefit >from >>> such an option. In normal operation this functionality should not cause >issues >>> and if it doesn't we may be able to deprecate the private flag in the >future. >> >> If you think this functionality makes sense any driver running over HW >> implementing >> it would like to be able to expose that and hence you better not use >> private flag. As I mentioned I don't know if this will be useful for other drivers. The i40e driver enables it by default without possibility to disable it and if this protection does not cause problems for ixgbe then we may not need the option in the future. Because of this I wasn't sure if it's worth polluting the tools with options that may end up not being needed/used at all. >> Are we sure the trust UAPI can't be extended for that matter? > >Yeah, we should probably make this a generic control if possible. MDD is set globally for the device, while the trusted option is set per VF. So if we do end up adding an option it probably won't work as extension for trusted. Thanks, Emil
Re: [PATCH RFC 0/2] kproxy: Kernel Proxy
On Thu, Jun 29, 2017 at 01:40:26PM -0700, Tom Herbert wrote: > > In fact that's not much what I observe in field. In practice, large > > data streams are cheaply relayed using splice(), I could achieve > > 60 Gbps of HTTP forwarding via HAProxy on a 4-core xeon 2 years ago. > > And when you use SSL, the cost of the copy to/from kernel is small > > compared to all the crypto operations surrounding this. > > > Right, getting rid of the extra crypto operations and so called "SSL > inspection" is the ultimate goal this is going towards. Yep but in order to take decisions at L7 you need to decapsulate SSL. > HTTP is only one use case. The are other interesting use cases such as > those in container security where the application protocol might be > something like simple RPC. OK that indeed makes sense in such environments. > Performance is relevant because we > potentially want security applied to every message in every > communication in a containerized data center. Putting the userspace > hop in the datapath of every packet is know to be problematic, not > just for the performance hit also because it increases the attack > surface on users' privacy. While I totally agree on the performance hit when inspecting each packet, I fail to see the relation with users' privacy. In fact under some circumstances it can even be the opposite. For example, using something like kTLS for a TCP/HTTP proxy can result in cleartext being observable in strace while it's not visible when TLS is terminated in userland because all you see are openssl's read()/write() operations. Maybe you have specific attacks in mind ? > > Regarding kernel-side protocol parsing, there's an unfortunate trend > > at moving more and more protocols to userland due to these protocols > > evolving very quickly. At least you'll want to find a way to provide > > these parsers from userspace, which will inevitably come with its set > > of problems or limitations :-/ > > > That's why everything is going BPF now ;-) Yes, I knew you were going to suggest this :-) I'm still prudent on it to be honnest. I don't think it would be that easy to implement an HPACK encoder/decoder using BPF. And even regarding just plain HTTP parsing, certain very small operations in haproxy's parser can quickly result in a 10% performance degradation when improperly optimized (ie: changing a "likely", altering branch prediction, or cache walk patterns when using arrays to evaluate character classes faster). But for general usage I indeed think it should be OK. > > All this to say that while I can definitely imagine the benefits of > > having in-kernel sockets for in-kernel L7 processing or filtering, > > I'm having strong doubts about the benefits that userland may receive > > by using this (or maybe you already have any performance numbers > > supporting this ?). > > > Nope, no numbers yet. OK, no worries. Thanks for your explanations! Willy
Re: [PATCH RFC 0/2] kproxy: Kernel Proxy
Hi Willy, Thanks for the comments! > In fact that's not much what I observe in field. In practice, large > data streams are cheaply relayed using splice(), I could achieve > 60 Gbps of HTTP forwarding via HAProxy on a 4-core xeon 2 years ago. > And when you use SSL, the cost of the copy to/from kernel is small > compared to all the crypto operations surrounding this. > Right, getting rid of the extra crypto operations and so called "SSL inspection" is the ultimate goal this is going towards. > Another point is that most HTTP requests are quite small (typically ~80% > 20kB or less), and in this case the L7 processing and certain syscalls > significantly dominate the operations, data copies are comparatively > small. Simply parsing a HTTP header takes time (when you do it correctly). > You can hardly parse and index more than 800MB-1GB/s of HTTP headers > per core, which limits you to roughly 1-1.2 M req+resp per second for > a 400 byte request and a 400 byte response, and that's without any > processing at all. But when doing this, certain syscalls like connect(), > close() or epollctl() start to be quite expensive. Even splice() is > expensive to forward small data chunks because you need two calls, and > recv+send is faster. In fact our TCP stack has been so much optimized > for realistic workloads over the years that it becomes hard to gain > more by cheating on it :-) > > In the end in haproxy I'm seeing about 300k req+resp per second in > HTTP keep-alive and more like 100-130k with close, when disabling > TCP quick-ack during accept() and connect() to save one ACK on each > side (just doing this generally brings performance gains between 7 > and 10%). > HTTP is only one use case. The are other interesting use cases such as those in container security where the application protocol might be something like simple RPC. Performance is relevant because we potentially want security applied to every message in every communication in a containerized data center. Putting the userspace hop in the datapath of every packet is know to be problematic, not just for the performance hit also because it increases the attack surface on users' privacy. > Regarding kernel-side protocol parsing, there's an unfortunate trend > at moving more and more protocols to userland due to these protocols > evolving very quickly. At least you'll want to find a way to provide > these parsers from userspace, which will inevitably come with its set > of problems or limitations :-/ > That's why everything is going BPF now ;-) > All this to say that while I can definitely imagine the benefits of > having in-kernel sockets for in-kernel L7 processing or filtering, > I'm having strong doubts about the benefits that userland may receive > by using this (or maybe you already have any performance numbers > supporting this ?). > Nope, no numbers yet. > Just my two cents, > Willy
[GIT] Networking
1) Need to access netdev->num_rx_queues behind an accessor in netvsc driver otherwise the build breaks with some configs, from Arnd Bergmann. 2) Add dummy xfrm_dev_event() so that build doesn't fail when CONFIG_XFRM_OFFLOAD is not set. From Hangbin Liu. 3) Don't OOPS when pfkey_msg2xfrm_state() signals an erros, from Dan Carpenter. 4) Fix MCDI command size for filter operations in sfc driver, from Martin Habets. 5) Fix UFO segmenting so that we don't calculate incorrect checksums, from Michal Kubecek. 6) When ipv6 datagram connects fail, reset destination address and port. From Wei Wang. 7) TCP disconnect must reset the cached receive DST, from WANG Cong. 8) Fix sign extension bug on 32-bit in dev_get_stats(), from Eric Dumazet. 9) fman driver has to depend on HAS_DMA, from Madalin Bucur. 10) Fix bpf pointer leak with xadd in verifier, from Daniel Borkmann. 11) Fix negative page counts with GFO, from Michal Kubecek. Please pull, thanks a lot! The following changes since commit 48b6bbef9a1789f0365c1a385879a1fea4460016: Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net (2017-06-21 12:40:20 -0700) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git for you to fetch changes up to d58299a478c416c0b48e4b31c6332fe7beb63000: sfc: fix attempt to translate invalid filter ID (2017-06-29 15:59:38 -0400) Andrew F. Davis (1): net: usb: asix88179_178a: Add support for the Belkin B2B128 Arnd Bergmann (1): netvsc: don't access netdev->num_rx_queues directly Colin Ian King (1): Trivial fix to spelling mistake in arc_printk message Dan Carpenter (3): xfrm: Oops on error in pfkey_msg2xfrm_state() xfrm: NULL dereference on allocation failure rocker: move dereference before free Daniel Borkmann (1): bpf: prevent leaking pointer via xadd on unpriviledged David S. Miller (4): Merge branch 'macvlan-Fix-some-issues-with-changing-mac-addresses' Merge branch 'master' of git://git.kernel.org/.../klassert/ipsec Merge branch 'bnxt_en-fixes' Merge branch 'arcnet-fixes' Edward Cree (1): sfc: fix attempt to translate invalid filter ID Eric Dumazet (1): net: prevent sign extension in dev_get_stats() Gao Feng (1): net: sched: Fix one possible panic when no destroy callback Hangbin Liu (2): xfrm: fix xfrm_dev_event() missing when compile without CONFIG_XFRM_OFFLOAD xfrm: move xfrm_garbage_collect out of xfrm_policy_flush Ido Schimmel (1): mlxsw: spectrum_router: Fix NULL pointer dereference Jason Wang (1): virtio-net: serialize tx routine during reset Lokesh Vutla (1): drivers: net: cpsw-common: Fix reading of mac address for am43 SoCs Madalin Bucur (1): fsl/fman: add dependency on HAS_DMA Martin Habets (1): sfc: Fix MCDI command size for filter operations Michael Chan (2): bnxt_en: Add missing logic to handle TPA end error conditions. bnxt_en: Fix netpoll handling. Michael Grzeschik (4): arcnet: change irq handler to lock irqsave arcnet: com20020: remove needless base_addr assignment arcnet: com20020-pci: fix dev_id calculation arcnet: com20020-pci: add missing pdev setup in netdev structure Michal Kubeček (2): net: account for current skb length when deciding about UFO net: handle NAPI_GRO_FREE_STOLEN_HEAD case also in napi_frags_finish() Mintz, Yuval (1): bnx2x: Don't log mc removal needlessly Richard Cochran (1): net: dp83640: Avoid NULL pointer dereference. Serhey Popovych (1): veth: Be more robust on network device creation when no attributes Vlad Yasevich (4): macvlan: Do not return error when setting the same mac address macvlan: Fix passthru macvlan mac address inheritance macvlan: convert port passthru to flags. macvlan: Let passthru macvlan correctly restore lower mac address WANG Cong (4): ipv6: only call ip6_route_dev_notify() once for NETDEV_UNREGISTER ipv6: avoid unregistering inet6_dev for loopback sit: use __GFP_NOWARN for user controlled allocation tcp: reset sk_rx_dst in tcp_disconnect() Wei Liu (1): xen-netback: correctly schedule rate-limited queues Wei Wang (1): net: ipv6: reset daddr and dport in sk if connect() fails Yossi Kuperman (2): xfrm6: Fix IPv6 payload_len in xfrm6_transport_finish esp6_offload: Fix IP6CB(skb)->nhoff for ESP GRO Zach Brown (1): net/phy: micrel: configure intterupts after autoneg workaround drivers/net/arcnet/arcnet.c | 7 --- drivers/net/arcnet/capmode.c | 2 +- drivers/net/arcnet/com20020-pci.c | 6 -- drivers/net/arcnet/com20020.c | 2 -- drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c | 2 +-
Re: BUG: Dentry ffff9f795a08fe60{i=af565f,n=lo} still in use (1) [unmount of proc proc]
On Thu, Jun 29, 2017 at 12:06 PM, Eric W. Biedermanwrote: > Andrei Vagin writes: > >> Hello, >> >> We run CRIU tests on linus' tree and today we found this issue. >> >> CRIU tests are the set of small programs to check checkpoint/restore >> of different primitives (files, sockets, signals, pipes, etc). >> https://github.com/xemul/criu/tree/master/test >> >> Each test is executed three times: without namespaces, in a set of all >> namespaces except userns, in a set of all namespaces. When a test >> passed the preparation tests, it sends a signal to an executer, and >> then the executer dumps and restores tests processes, and sends a >> signal to the test back to check that everything are restored >> correctly. > > I am not certain what you are saying, and you seem to have Cc'd > every list except the netdev and netfilter lists that are needed > to deal with this. > > Are you saing that the change from Liping Zhang is needed? Or are you > saying that change introduces the problem below? Hi Eric, Here I tried to explain our usecase. I don't know which changes in the kernel affect this issue. Actually I reported about the similar problem a few month ago on the linux-next: https://lkml.org/lkml/2017/3/10/1586 So I don't think that the change from Liping Zhang affects this issue somehow. I mentioned it just to describe what kernel we used. And I don't know how to reproduce the issue. You can see from the kernel log, that the kernel worked for more than 6 hours in out case. During this perioud we run all our tests a few times, so I think there is a kind of race. > > I could not find the mentioned commits. Are the in Linus's tree or > someone's next tree that feeds into linux-next? Here is the patch from Liping Zhang https://patchwork.ozlabs.org/patch/770887/ The second mentioned commit is HEAD of the master branch in Linus' tree: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6474924e2b5ddb0030c38966adcbe3b49022 Thanks, Andrei > > Eric > >> [root@zdtm linux]# git describe HEAD~1 >> v4.12-rc7-25-g6474924 >> >> And there is one more patch from the netfilter tree: >> commit b216759c0cb5d37d1eec3cd5b67ba38bace94fd8 >> Author: Liping Zhang >> Date: Sun Jun 4 19:17:34 2017 +0800 >> >> netfilter: nf_ct_dccp/sctp: fix memory leak after netns cleanup >> >> [root@zdtm linux]# uname -a >> Linux zdtm.openvz.org 4.12.0-rc7+ #9 SMP Thu Jun 29 08:28:18 CEST 2017 >> x86_64 x86_64 x86_64 GNU/Linux >> >> A kernel config is attached. >> >> >> [22458.504137] BUG: Dentry 9f795a08fe60{i=af565f,n=lo} still in >> use (1) [unmount of proc proc] >> [22458.505117] [ cut here ] >> [22458.505299] WARNING: CPU: 0 PID: 15036 at fs/dcache.c:1445 >> umount_check+0x66/0x80 >> [22458.505564] Modules linked in: nfsd auth_rpcgss nfs_acl lockd grace >> sunrpc macvlan tun veth nf_conntrack_netlink xt_mark udp_diag tcp_diag >> inet_diag netlink_diag af_packet_diag unix_diag binfmt_misc >> ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink >> ebtable_broute bridge stp llc ebtable_nat ip6table_security >> ip6table_mangle ip6table_raw ip6table_nat nf_conntrack_ipv6 >> nf_defrag_ipv6 nf_nat_ipv6 iptable_security iptable_mangle iptable_raw >> iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat >> nf_conntrack ebtable_filter ebtables ip6table_filter ip6_tables btrfs >> xor raid6_pq loop ppdev crct10dif_pclmul crc32_pclmul >> ghash_clmulni_intel lpc_ich sbs virtio_balloon shpchp sbshc parport_pc >> parport tpm_tis tpm_tis_core tpm xfs libcrc32c crc32c_intel >> ata_generic serio_raw pata_acpi >> [22458.507586] virtio_pci virtio virtio_ring e1000 >> [22458.507771] CPU: 0 PID: 15036 Comm: kworker/0:2 Not tainted 4.12.0-rc7+ #9 >> [22458.507830] systemd-journald[561]: Compressed data object 807 -> >> 605 using LZ4 >> [22458.508184] Hardware name: Parallels Software International Inc. >> Parallels Virtual Platform/Parallels Virtual Platform, BIOS >> 6.12.26068.1232434 02/27/2017 >> [22458.508641] Workqueue: events proc_cleanup_work >> [22458.508848] task: 9f797be8 task.stack: b2b8825f >> [22458.509033] RIP: 0010:umount_check+0x66/0x80 >> [22458.509172] RSP: 0018:b2b8825f3c68 EFLAGS: 00010286 >> [22458.509363] RAX: 0054 RBX: 9f798492afa0 RCX: >> >> [22458.509589] RDX: RSI: 9f79bce0e388 RDI: >> 9f79bce0e388 >> [22458.509823] RBP: b2b8825f3c70 R08: R09: >> >> [22458.510022] R10: b2b8825f3bf0 R11: R12: >> 9f795a08fe60 >> [22458.510217] R13: 9f795a08ff50 R14: 9f795a08fee0 R15: >> 9f798492afa0 >> [22458.510456] FS: () GS:9f79bce0() >> knlGS: >> [22458.510726] CS: 0010 DS: ES: CR0: 80050033 >> [22458.510953] CR2: 7f834657a6b0 CR3: 602a3000 CR4: >> 06f0 >>
Re: [PATCH NET V7 0/2] Add loopback support in phy_driver and hns ethtool fix
From: Andrew LunnDate: Thu, 29 Jun 2017 20:22:16 +0200 Is less broken a sufficient criteria for acceptance? Sometimes, it depends upon the situation. If you are continuing to resolve those issues, I'll wait and watch for future respins then.