Re: [PATCH] Doc: z8530book: Fix typo in API-z8530-sync-txdma-open.html
From: Masanari Iida Date: Fri, 10 Jul 2015 21:20:28 +0900 > This patch fix a spelling typo found in API-z8530-sync-txdma-open.html. > It is because this file was generated from comment in source, > I have to fix comment in source. > > Signed-off-by: Masanari Iida Applied, t hanks. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Panic with demuxed ipv4 multicast udp sockets on 4.0.4
On Fri, 2015-07-10 at 13:31 -0700, Alex Gartrell wrote: > Hey everyone, > > There's some kind of nasty condition in which sk_rx_dst points to an > apparently garbage datastructure and it's blowing up in the early > demux code because dst->ops is NULL. The packet in question was for > bit torrent local peer discovery > https://en.wikipedia.org/wiki/Local_Peer_Discovery . We're seeing > this on about a 1/200 chance of panic per day. > > crash> bt > PID: 1899532 TASK: 88000826cf00 CPU: 9 COMMAND: "hhvm.node.1" > #0 [88047fc23990] machine_kexec at 8103bf05 > #1 [88047fc239e0] crash_kexec at 810cb4e8 > #2 [88047fc23ab0] oops_end at 81006468 > #3 [88047fc23ae0] no_context at 8167aac1 > #4 [88047fc23b40] __bad_area_nosemaphore at 8167acb9 > #5 [88047fc23b90] bad_area_nosemaphore at 8167aceb > #6 [88047fc23ba0] __do_page_fault at 81044ac5 > #7 [88047fc23c10] do_page_fault at 81044eec > #8 [88047fc23c20] page_fault at 81686c02 > [exception RIP: udp_v4_early_demux+481] > RIP: 816249a1 RSP: 88047fc23cd8 RFLAGS: 00010246 > RAX: 880248ee4500 RBX: 093a RCX: 0002 > RDX: RSI: RDI: 880248ee4500 > RBP: 88047fc23d48 R8: R9: > R10: 0001 R11: c9000199f3a0 R12: 88006f8a6300 > R13: 81cbb1c0 R14: 0001 R15: 880474798f00 > ORIG_RAX: CS: 0010 SS: > #9 [88047fc23cd0] udp_v4_early_demux at 81624bb3 > #10 [88047fc23d50] ip_rcv_finish at 815f3055 > #11 [88047fc23d80] ip_rcv at 815f3952 > #12 [88047fc23dc0] __netif_receive_skb_core at 815b96d4 > #13 [88047fc23e30] __netif_receive_skb at 815b9911 > #14 [88047fc23e50] process_backlog at 815b99f0 > #15 [88047fc23ea0] net_rx_action at 815ba1e8 > #16 [88047fc23f30] __do_softirq at 81054ce6 > #17 [88047fc23f90] irq_exit at 81055075 > #18 [88047fc23fa0] smp_call_function_single_interrupt at 810319f5 > #19 [88047fc23fb0] call_function_single_interrupt at 8168637a > --- --- > #20 [8800792dff58] call_function_single_interrupt at 8168637a > RIP: 006e7b4c RSP: 7f4c8ba38b80 RFLAGS: 0216 > RAX: 006b RBX: 816851f2 RCX: 7f49f4de84d6 > RDX: 7f49f4de84d8 RSI: 7f48dbcce731 RDI: > RBP: 7f4c8ba38bd0 R8: 006b R9: > R10: 7f48dbcce737 R11: 7f49f4de84e0 R12: 7f4adab85198 > R13: 0014 R14: 7f4adaaa4c00 R15: > ORIG_RAX: ff04 CS: 0033 SS: 002b > crash> print *(struct *dst_entry)0x880248ee4500 > A syntax error in expression, near `*dst_entry)0x880248ee4500'. > gdb: gdb request failed: print *(struct *dst_entry)0x880248ee4500 > crash> print *(struct dst_entry*)0x880248ee4500 > $1 = { > callback_head = { > next = 0x880248ee4d00, > func = 0x0 > }, > child = 0x13eacdfb7df67f6b, > dev = 0x880113975d00, > ops = 0x0, > _metrics = 13729079323838086211, > expires = 103079215104, > path = 0x24c8d3baa, > from = 0x0, > xfrm = 0x6, > input = 0x0, > output = 0x0, > flags = 5536, > pending_confirm = 33114, > error = -1, > obsolete = -1, > header_len = 0, > trailer_len = 0, > tclassid = 0, > __pad_to_align_refcnt = {0, 704374636708}, > __refcnt = { > counter = 14 > }, > __use = 2097153, > lastuse = 536576, > { > next = 0x0, > rt_next = 0x0, > rt6_next = 0x0, > dn_next = 0x0 > } > } Seems similar to what commit d0c294c53a771 fixed Have you tried following patch ? diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index 83aa604f9273..35c0a4ac540c 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -1995,7 +1995,7 @@ void udp_v4_early_demux(struct sk_buff *skb) skb->sk = sk; skb->destructor = sock_efree; - dst = sk->sk_rx_dst; + dst = READ_ONCE(sk->sk_rx_dst); if (dst) dst = dst_check(dst, 0); -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fighting out-of-order reception with RPS?
On Fri, 2015-07-10 at 22:36 +0200, Oliver Hartkopp wrote: > On 07/10/2015 04:48 AM, Tom Herbert wrote: > > On Wed, Jul 8, 2015 at 10:55 PM, Oliver Hartkopp > > wrote: > >> Both drivers do not use NAPI. The just follow the way > >> > >> interrupt -> alloc_skb() -> fill skb -> netif_rx(skb) > >> > >> I'm usually testing with the USB adapters as the PCIe setup is not very > >> handy. > >> > > Okay, I see what is happening. In netif_rx when RPS is not enabled > > that packet is queued to the backlog queue for the local CPU. Since > > you're doing round robin on the interrupts then OOO packets can be a > > result. Unfortunately, this is the expected behavior. The correct > > kernel fix would be to move to these drivers to use NAPI. > > Hm. Doesn't sound like a good solution when there's a difference between NAPI > and non-NAPI drivers in matters of OOO, right? Isn't OOO a problem for you ? Then you either have to : 1) Use a single CPU to handle IRQ from the device 2) Use NAPI > > > RPS > > eliminates the OOO, but if there is no ability to derive a flow hash > > from packets everything will wind up one queue without load balancing. > > Correct. > > That's why I added > > skb_set_hash(skb, dev->ifindex, PKT_HASH_TYPE_L2); > > in my driver, because the only relevant flow identifiction is the number of > the incoming CAN interface. > > > Besides that, automatically setting RPS in drivers is a difficult > > proposition since there is no definitively "correct" way to do that in > > an arbitrary configuration. > > What about checking in netif_rx() if the non-NAPI driver has set a hash (aka > the driver is OOO sensitive)? > And if so we could automatically set rps_cpus for this interface in a way that > all CPUs are enabled to take skbs following the hash. Wow, netif_rx() is packet processing fast path, certainly not the place to add controlling path decisions. Please convert your driver to NAPI. You might then even benefit from GRO. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] net: inet_diag: always export IPV6_V6ONLY sockopt for listening sockets
From: Phil Sutter Date: Fri, 10 Jul 2015 11:39:57 +0200 > Reconsidering my commit 20462155 "net: inet_diag: export IPV6_V6ONLY > sockopt", I am not happy with the limitations it causes for socket > analysing code in userspace. Exporting the value only if it is set makes > it hard for userspace to decide whether the option is not set or the > kernel does not support exporting the option at all. > > From an auditor's perspective, the interesting question for listening > AF_INET6 sockets is: "Does it NOT have IPV6_V6ONLY set?" Because it is > the unexpected case. This patch allows to answer this question reliably. > > Signed-off-by: Phil Sutter > Cc: Eric Dumazet Applied, thanks. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next 0/9] be2net: patch set
From: Sathya Perla Date: Fri, 10 Jul 2015 05:32:42 -0400 > Hi David, the following patch set has code cleanup patches, minor enhancements > and non-critical fixes. Pls consider applying to the net-next tree. Thanks! Series applied, thanks. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next v2] ipv6: Do not iterate over all interfaces when finding source address on specific interface.
From: YOSHIFUJI Hideaki/吉藤英明 Date: Fri, 10 Jul 2015 16:58:31 +0900 > If outgoing interface is specified and the candidate address is > restricted to the outgoing interface, it is enough to iterate > over that given interface only. > > Signed-off-by: YOSHIFUJI Hideaki > Acked-by: Erik Kline Applied, thanks! -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net: Drop owner assignment from platform_driver
From: Krzysztof Kozlowski Date: Fri, 10 Jul 2015 15:29:23 +0900 > platform_driver does not need to set an owner because > platform_driver_register() will set it. > > Signed-off-by: Krzysztof Kozlowski Applied to net-next, thanks. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net: phy: Support setting polarity in marvell phy driver
From: David Thomson Date: Fri, 10 Jul 2015 16:28:25 +1200 > Support manually setting the polarity to mdi or mdix > > Signed-off-by: David Thomson Applied to net-next. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net: phy: Pass mdix ethtool setting through to phy driver
From: David Thomson Date: Fri, 10 Jul 2015 13:56:54 +1200 > Pass the mdix setting from ethtool down to the phy driver, to allow > driver specific implementations of manually setting the polarity. > > Signed-off-by: David Thomson Applied to net-next. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next 4/4] gianfar: Add paged allocation and Rx S/G
From: Claudiu Manoil Date: Thu, 9 Jul 2015 19:24:44 +0300 > @@ -2839,6 +2815,94 @@ static irqreturn_t gfar_transmit(int irq, void *grp_id) > return IRQ_HANDLED; > } > > +static bool gfar_add_rx_frag(struct gfar_rx_buff *rxb, unsigned int lstatus, > + struct sk_buff *skb, bool first) > +{ You really are again playing Russian Roulette with the type of 'lstatus' here. Use 'u32' consistently. > +static struct sk_buff *gfar_get_next_rxbuff(struct gfar_priv_rx_q *rx_queue, > + unsigned long lstatus, > + struct sk_buff *skb) Likewise. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next 2/4] gianfar: Fix and cleanup rxbd status handling
From: Claudiu Manoil Date: Thu, 9 Jul 2015 19:24:42 +0300 > @@ -2921,6 +2921,7 @@ int gfar_clean_rx_ring(struct gfar_priv_rx_q *rx_queue, > int rx_work_limit) > i = rx_queue->next_to_clean; > > while (rx_work_limit--) { > + unsigned long lstatus; > > if (cleaned_cnt >= GFAR_RX_BUFF_ALLOC) { > gfar_alloc_rx_buffs(rx_queue, cleaned_cnt); > @@ -2928,7 +2929,8 @@ int gfar_clean_rx_ring(struct gfar_priv_rx_q *rx_queue, > int rx_work_limit) > } > > bdp = &rx_queue->rx_bd_base[i]; > - if (be16_to_cpu(bdp->status) & RXBD_EMPTY) > + lstatus = be32_to_cpu(bdp->lstatus); > + if (lstatus & BD_LFLAG(RXBD_EMPTY)) > break; > > /* order rx buffer descriptor reads */ "lstatus" is not an "unsigned long", it's a "u32". Don't use ambiguous types for objects with exact known sizes. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] net: dsa: mv88e6xxx: add write access to debugfs regs file
From: Vivien Didelot Date: Thu, 9 Jul 2015 17:13:29 -0400 > Allow write access to the regs file in the debugfs interface, with the > following parameters: > > echo> regs > > Where "name" is the register name (as shown in the header row), "reg" is > the register address (as shown in the first column) and "value" is the > 16-bit value. e.g.: > > echo GLOBAL 1a 5550 > regs > > Signed-off-by: Vivien Didelot I don't know about this. This starts to smell like a back door for proprietary userspace SDKs to program the switch hardware. Yes, they can do it via other mechanisms, but we don't have to make it any eaiser for them either. If you want to poke registers, hack the module just like any other person with appropriate privileges can do. Frankly, all of this debugfs crap in the DSA drivers smells like poo. I don't like it _AT_ _ALL_, and I shouldn't have allowed any of it into the tree in the first place. I might just remove it all myself, it bothers me so much. Fetching information should be done by well typed, generic, interfaces that apply to any similar device or object. All of this debugfs stuff smells of hacks and special case crap that's only usable for one device type and that makes it the single most terrible interface to give to users. Thanks. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] sctp: SCTP_SOCKOPT_PEELOFF return socket pointer for kernel users
From: Marcelo Ricardo Leitner Date: Thu, 9 Jul 2015 11:15:19 -0300 > SCTP has this operation to peel off associations from a given socket and > create a new socket using this association. We currently have two ways > to use this operation: > - via getsockopt(), on which it will also create and return a file > descriptor for this new socket > - via sctp_do_peeloff(), which is for kernel only > > The caveat with using sctp_do_peeloff() directly is that it creates a > dependency to SCTP module, while all other operations are handled via > kernel_{socket,sendmsg,getsockopt...}() interface. This causes the > kernel to load SCTP module even when it's not directly used > > This patch then updates SCTP_SOCKOPT_PEELOFF so that for kernel users of > this protocol it will not allocate a file descriptor but instead just > return the socket pointer directly. > > If called by an user application it will work as before. > > Signed-off-by: Marcelo Ricardo Leitner I do not like this at all. Socket option implementations should not change their behavior or what datastructures they consume or return just because the socket happens to be a kernel socket. I'm not applying this series, sorry. Also, your patch series lacked an intial "PATCH 0/N" posting, so you could at least spend the time to discuss this patch series at a high level and explain your overall motivations. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -next] Revert "ipv4: use skb coalescing in defragmentation"
On Sat, 2015-07-11 at 01:37 +0200, Florian Westphal wrote: > This reverts commit 3cc4949269e01f39443d0fcfffb5bc6b47878d45. > > There is nothing wrong with coalescing during defragmentation, it > reduces truesize overhead and simplifies things for the receiving > socket (no fraglist walk needed). > > However, it also destroys geometry of the original fragments. > While that doesn't cause any breakage (we make sure to not exceed largest > original size) ip_do_fragment contains a 'fastpath' that takes advantage > of a present frag list and results in fragments that (in most cases) > match what was received. > > In case its needed the coalescing could be done later, when we're sure > the skb is not forwarded. But discussion during NFWS resulted in > 'lets just remove this for now'. > > Cc: Eric Dumazet > Signed-off-by: Florian Westphal > --- Acked-by: Eric Dumazet -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv3 net 0/2] net: fixes for device unregistration
From: Julian Anastasov Date: Thu, 9 Jul 2015 09:59:08 +0300 > Test script from Eric W. Biederman can catch a problem > where packets from backlog are processed long after the last > synchronize_net call. This can be reproduced after few tests > if commit 381c759d9916 ("ipv4: Avoid crashing in ip_error") > is reverted for the test. Incoming packets do not hold > reference to device but even if they do, subsystems do not > expect packets to fly during and after the NETDEV_UNREGISTER > event. > > The first fix has the cost of netif_running check in fast path. > The second fix calls rcu_read_lock while local IRQ is disabled, > I hope this is not against the rules. Thanks for fixing this tricky bug. Applied and queued up for -stable. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC net-next] net: Build IPv6 into kernel by default
From: Tom Herbert Date: Thu, 9 Jul 2015 13:42:29 -0700 > This patch makes the default to build IPv6 into the kernel. IPv6 > now has significant traction and any remaining vestiges of IPv6 > not being provided parity with IPv4 should be swept away. IPv6 is now > core to the Internet and kernel. I guess I'm fine with this, just fix up the doc error Dave Jones pointed out. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net] bridge: mdb: allow the user to delete mdb entry if there's a querier
From: Nikolay Aleksandrov Date: Thu, 9 Jul 2015 04:12:45 -0700 > From: Satish Ashok > > Until now when a querier was present static entries couldn't be deleted. > Fix this and allow the user to manipulate the mdb with or without a > querier. > > Signed-off-by: Satish Ashok > Signed-off-by: Nikolay Aleksandrov Applied, thanks. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH -next] Revert "ipv4: use skb coalescing in defragmentation"
This reverts commit 3cc4949269e01f39443d0fcfffb5bc6b47878d45. There is nothing wrong with coalescing during defragmentation, it reduces truesize overhead and simplifies things for the receiving socket (no fraglist walk needed). However, it also destroys geometry of the original fragments. While that doesn't cause any breakage (we make sure to not exceed largest original size) ip_do_fragment contains a 'fastpath' that takes advantage of a present frag list and results in fragments that (in most cases) match what was received. In case its needed the coalescing could be done later, when we're sure the skb is not forwarded. But discussion during NFWS resulted in 'lets just remove this for now'. Cc: Eric Dumazet Signed-off-by: Florian Westphal --- diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c index a50dc6d..4d3fffa 100644 --- a/net/ipv4/ip_fragment.c +++ b/net/ipv4/ip_fragment.c @@ -522,7 +522,6 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev, int len; int ihlen; int err; - int sum_truesize; u8 ecn; ipq_kill(qp); @@ -590,32 +589,19 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev, add_frag_mem_limit(&qp->q, clone->truesize); } + skb_shinfo(head)->frag_list = head->next; skb_push(head, head->data - skb_network_header(head)); - sum_truesize = head->truesize; - for (fp = head->next; fp;) { - bool headstolen; - int delta; - struct sk_buff *next = fp->next; - - sum_truesize += fp->truesize; + for (fp=head->next; fp; fp = fp->next) { + head->data_len += fp->len; + head->len += fp->len; if (head->ip_summed != fp->ip_summed) head->ip_summed = CHECKSUM_NONE; else if (head->ip_summed == CHECKSUM_COMPLETE) head->csum = csum_add(head->csum, fp->csum); - - if (skb_try_coalesce(head, fp, &headstolen, &delta)) { - kfree_skb_partial(fp, headstolen); - } else { - if (!skb_shinfo(head)->frag_list) - skb_shinfo(head)->frag_list = fp; - head->data_len += fp->len; - head->len += fp->len; - head->truesize += fp->truesize; - } - fp = next; + head->truesize += fp->truesize; } - sub_frag_mem_limit(&qp->q, sum_truesize); + sub_frag_mem_limit(&qp->q, head->truesize); head->next = NULL; head->dev = dev; -- 2.0.5 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH -next 6/6] netfilter: nftables: Only run the nftables chains in the proper netns
- Register the nftables chains in the network namespace that they need to run in. - Remove the hacks that stopped chains running in the wrong network namespace. Signed-off-by: "Eric W. Biederman" --- net/netfilter/nf_tables_api.c | 6 -- net/netfilter/nf_tables_core.c | 5 - 2 files changed, 4 insertions(+), 7 deletions(-) diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c index cfe636808541..12e65ec46bd4 100644 --- a/net/netfilter/nf_tables_api.c +++ b/net/netfilter/nf_tables_api.c @@ -130,20 +130,22 @@ static void nft_trans_destroy(struct nft_trans *trans) int nft_register_basechain(struct nft_base_chain *basechain, unsigned int hook_nops) { + struct net *net = read_pnet(&basechain->pnet); if (basechain->flags & NFT_BASECHAIN_DISABLED) return 0; - return nf_register_hooks(basechain->ops, hook_nops); + return nf_register_net_hooks(net, basechain->ops, hook_nops); } EXPORT_SYMBOL_GPL(nft_register_basechain); void nft_unregister_basechain(struct nft_base_chain *basechain, unsigned int hook_nops) { + struct net *net = read_pnet(&basechain->pnet); if (basechain->flags & NFT_BASECHAIN_DISABLED) return; - nf_unregister_hooks(basechain->ops, hook_nops); + nf_unregister_net_hooks(net, basechain->ops, hook_nops); } EXPORT_SYMBOL_GPL(nft_unregister_basechain); diff --git a/net/netfilter/nf_tables_core.c b/net/netfilter/nf_tables_core.c index f77bad46ac68..05d0b03530f6 100644 --- a/net/netfilter/nf_tables_core.c +++ b/net/netfilter/nf_tables_core.c @@ -114,7 +114,6 @@ unsigned int nft_do_chain(struct nft_pktinfo *pkt, const struct nf_hook_ops *ops) { const struct nft_chain *chain = ops->priv, *basechain = chain; - const struct net *chain_net = read_pnet(&nft_base_chain(basechain)->pnet); const struct net *net = dev_net(pkt->in ? pkt->in : pkt->out); const struct nft_rule *rule; const struct nft_expr *expr, *last; @@ -125,10 +124,6 @@ nft_do_chain(struct nft_pktinfo *pkt, const struct nf_hook_ops *ops) int rulenum; unsigned int gencursor = nft_genmask_cur(net); - /* Ignore chains that are not for the current network namespace */ - if (!net_eq(net, chain_net)) - return NF_ACCEPT; - do_chain: rulenum = 0; rule = list_entry(&chain->rules, struct nft_rule, list); -- 2.2.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH -next 5/6] netfilter: Per network namespace netfilter hooks.
- Add a new set of functions for registering and unregistering per network namespace hooks. - Modify the old global namespace hook functions to use the per network namespace hooks in their implementation, so their remains a single list that needs to be walked for any hook (this is important for keeping the hook priority working and for keeping the code walking the hooks simple). - Only allow registering the per netdevice hooks in the network namespace where the network device lives. - Dynamically allocate the structures in the per network namespace hook list in nf_register_net_hook, and unregister them in nf_unregister_net_hook. Dynamic allocate is required somewhere as the number of network namespaces are not fixed so we might as well allocate them in the registration function. The chain of registered hooks on any list is expected to be small so the cost of walking that list to find the entry we are unregistering should also be small. Performing the management of the dynamically allocated list entries in the registration and unregistration functions keeps the complexity from spreading. Signed-off-by: "Eric W. Biederman" --- include/linux/netfilter.h | 15 +++- include/net/netns/netfilter.h | 1 + net/netfilter/core.c | 184 +- 3 files changed, 175 insertions(+), 25 deletions(-) diff --git a/include/linux/netfilter.h b/include/linux/netfilter.h index d7826e59779e..4903e392a6ac 100644 --- a/include/linux/netfilter.h +++ b/include/linux/netfilter.h @@ -11,6 +11,8 @@ #include #include #include +#include +#include #ifdef CONFIG_NETFILTER static inline int NF_DROP_GETERR(int verdict) @@ -118,6 +120,13 @@ struct nf_sockopt_ops { }; /* Function to register/unregister hook points. */ +int nf_register_net_hook(struct net *net, const struct nf_hook_ops *ops); +void nf_unregister_net_hook(struct net *net, const struct nf_hook_ops *ops); +int nf_register_net_hooks(struct net *net, const struct nf_hook_ops *reg, + unsigned int n); +void nf_unregister_net_hooks(struct net *net, const struct nf_hook_ops *reg, +unsigned int n); + int nf_register_hook(struct nf_hook_ops *reg); void nf_unregister_hook(struct nf_hook_ops *reg); int nf_register_hooks(struct nf_hook_ops *reg, unsigned int n); @@ -128,8 +137,6 @@ void nf_unregister_hooks(struct nf_hook_ops *reg, unsigned int n); int nf_register_sockopt(struct nf_sockopt_ops *reg); void nf_unregister_sockopt(struct nf_sockopt_ops *reg); -extern struct list_head nf_hooks[NFPROTO_NUMPROTO][NF_MAX_HOOKS]; - #ifdef HAVE_JUMP_LABEL extern struct static_key nf_hooks_needed[NFPROTO_NUMPROTO][NF_MAX_HOOKS]; @@ -167,10 +174,10 @@ static inline int nf_hook_thresh(u_int8_t pf, unsigned int hook, int (*okfn)(struct sock *, struct sk_buff *), int thresh) { - struct list_head *nf_hook_list = &nf_hooks[pf][hook]; + struct net *net = dev_net(indev?indev:outdev); + struct list_head *nf_hook_list = &net->nf.hooks[pf][hook]; if (nf_hook_list_active(nf_hook_list, pf, hook)) { struct nf_hook_state state; - nf_hook_state_init(&state, nf_hook_list, hook, thresh, pf, indev, outdev, sk, okfn); return nf_hook_slow(skb, &state); diff --git a/include/net/netns/netfilter.h b/include/net/netns/netfilter.h index 532e4ba64f49..38aa4983e2a9 100644 --- a/include/net/netns/netfilter.h +++ b/include/net/netns/netfilter.h @@ -14,5 +14,6 @@ struct netns_nf { #ifdef CONFIG_SYSCTL struct ctl_table_header *nf_log_dir_header; #endif + struct list_head hooks[NFPROTO_NUMPROTO][NF_MAX_HOOKS]; }; #endif diff --git a/net/netfilter/core.c b/net/netfilter/core.c index 8bbcdfcdbfbd..320ae65dc09a 100644 --- a/net/netfilter/core.c +++ b/net/netfilter/core.c @@ -52,9 +52,6 @@ void nf_unregister_afinfo(const struct nf_afinfo *afinfo) } EXPORT_SYMBOL_GPL(nf_unregister_afinfo); -struct list_head nf_hooks[NFPROTO_NUMPROTO][NF_MAX_HOOKS] __read_mostly; -EXPORT_SYMBOL(nf_hooks); - #ifdef HAVE_JUMP_LABEL struct static_key nf_hooks_needed[NFPROTO_NUMPROTO][NF_MAX_HOOKS]; EXPORT_SYMBOL(nf_hooks_needed); @@ -62,27 +59,40 @@ EXPORT_SYMBOL(nf_hooks_needed); static DEFINE_MUTEX(nf_hook_mutex); -static struct list_head *find_nf_hook_list(const struct nf_hook_ops *reg) +static struct list_head *find_nf_hook_list(struct net *net, + const struct nf_hook_ops *reg) { struct list_head *nf_hook_list = NULL; if (reg->pf != NFPROTO_NETDEV) - nf_hook_list = &nf_hooks[reg->pf][reg->hooknum]; + nf_hook_list = &net->nf.hooks[reg->pf][reg->hooknum]; else if (reg->hooknum == NF_NETDEV_INGRESS) { #ifdef CONFIG_NETFILTER_INGRESS - if (reg->dev) + if (reg->de
[PATCH net v2] net: bcmgenet: fix accounting of packet drops vs errors
bcmgenet driver needs to separate packet drops from packet errors. When the driver has to drop a *good* packet, due to lack of buffers or replacement skbs, increment only dev->stats.[rx|tx]_dropped. When the driver encounters a bad Rx packet or Tx error, increment only dev->stats.[rx|tx]_errors + relevant detailed error counter. Signed-off-by: Petri Gynther --- drivers/net/ethernet/broadcom/genet/bcmgenet.c | 4 1 file changed, 4 deletions(-) diff --git a/drivers/net/ethernet/broadcom/genet/bcmgenet.c b/drivers/net/ethernet/broadcom/genet/bcmgenet.c index b43b2cb..64c1e9d 100644 --- a/drivers/net/ethernet/broadcom/genet/bcmgenet.c +++ b/drivers/net/ethernet/broadcom/genet/bcmgenet.c @@ -1230,7 +1230,6 @@ static struct sk_buff *bcmgenet_put_tx_csum(struct net_device *dev, new_skb = skb_realloc_headroom(skb, sizeof(*status)); dev_kfree_skb(skb); if (!new_skb) { - dev->stats.tx_errors++; dev->stats.tx_dropped++; return NULL; } @@ -1465,7 +1464,6 @@ static unsigned int bcmgenet_desc_rx(struct bcmgenet_rx_ring *ring, if (unlikely(!skb)) { dev->stats.rx_dropped++; - dev->stats.rx_errors++; goto next; } @@ -1493,7 +1491,6 @@ static unsigned int bcmgenet_desc_rx(struct bcmgenet_rx_ring *ring, if (unlikely(!(dma_flag & DMA_EOP) || !(dma_flag & DMA_SOP))) { netif_err(priv, rx_status, dev, "dropping fragmented packet!\n"); - dev->stats.rx_dropped++; dev->stats.rx_errors++; dev_kfree_skb_any(skb); goto next; @@ -1515,7 +1512,6 @@ static unsigned int bcmgenet_desc_rx(struct bcmgenet_rx_ring *ring, dev->stats.rx_frame_errors++; if (dma_flag & DMA_RX_LG) dev->stats.rx_length_errors++; - dev->stats.rx_dropped++; dev->stats.rx_errors++; dev_kfree_skb_any(skb); goto next; -- 2.4.3.573.g4eafbef -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH -next 4/6] netfilter: Factor out the hook list selection from nf_register_hook
- Add a new function find_nf_hook_list to select the nf_hook_list - Fail nf_register_hook when asked for a per netdevice hook list when support for per netdevice hook lists is not built into the kernel. - Move the hook list head selection outside of nf_hook_mutex as nothing in the selection requires the hook list, and error handling is simpler if a mutex is not held. Signed-off-by: "Eric W. Biederman" --- net/netfilter/core.c | 32 ++-- 1 file changed, 18 insertions(+), 14 deletions(-) diff --git a/net/netfilter/core.c b/net/netfilter/core.c index 7e3bc12f1c62..8bbcdfcdbfbd 100644 --- a/net/netfilter/core.c +++ b/net/netfilter/core.c @@ -62,27 +62,31 @@ EXPORT_SYMBOL(nf_hooks_needed); static DEFINE_MUTEX(nf_hook_mutex); -int nf_register_hook(struct nf_hook_ops *reg) +static struct list_head *find_nf_hook_list(const struct nf_hook_ops *reg) { - struct list_head *nf_hook_list; - struct nf_hook_ops *elem; + struct list_head *nf_hook_list = NULL; - mutex_lock(&nf_hook_mutex); - switch (reg->pf) { - case NFPROTO_NETDEV: + if (reg->pf != NFPROTO_NETDEV) + nf_hook_list = &nf_hooks[reg->pf][reg->hooknum]; + else if (reg->hooknum == NF_NETDEV_INGRESS) { #ifdef CONFIG_NETFILTER_INGRESS - if (reg->hooknum == NF_NETDEV_INGRESS) { - BUG_ON(reg->dev == NULL); + if (reg->dev) nf_hook_list = ®->dev->nf_hooks_ingress; - break; - } #endif - /* Fall through. */ - default: - nf_hook_list = &nf_hooks[reg->pf][reg->hooknum]; - break; } + return nf_hook_list; +} + +int nf_register_hook(struct nf_hook_ops *reg) +{ + struct list_head *nf_hook_list; + struct nf_hook_ops *elem; + nf_hook_list = find_nf_hook_list(reg); + if (!nf_hook_list) + return -ENOENT; + + mutex_lock(&nf_hook_mutex); list_for_each_entry(elem, nf_hook_list, list) { if (reg->priority < elem->priority) break; -- 2.2.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH -next 3/6] netfilter: Simply the tests for enabling and disabling the ingress queue hook
Replace an overcomplicated switch statement with a simple if statement. This also removes the ingress queue enable outside of nf_hook_mutex as the protection provided by the mutex is not necessary and the code is clearer having both of the static key increments together. Signed-off-by: "Eric W. Biederman" --- net/netfilter/core.c | 17 ++--- 1 file changed, 6 insertions(+), 11 deletions(-) diff --git a/net/netfilter/core.c b/net/netfilter/core.c index a0e54974e2c9..7e3bc12f1c62 100644 --- a/net/netfilter/core.c +++ b/net/netfilter/core.c @@ -74,7 +74,6 @@ int nf_register_hook(struct nf_hook_ops *reg) if (reg->hooknum == NF_NETDEV_INGRESS) { BUG_ON(reg->dev == NULL); nf_hook_list = ®->dev->nf_hooks_ingress; - net_inc_ingress_queue(); break; } #endif @@ -90,6 +89,10 @@ int nf_register_hook(struct nf_hook_ops *reg) } list_add_rcu(®->list, elem->list.prev); mutex_unlock(&nf_hook_mutex); +#ifdef CONFIG_NETFILTER_INGRESS + if ((reg->pf == NFPROTO_NETDEV) && (reg->hooknum == NF_NETDEV_INGRESS)) + net_inc_ingress_queue(); +#endif #ifdef HAVE_JUMP_LABEL static_key_slow_inc(&nf_hooks_needed[reg->pf][reg->hooknum]); #endif @@ -102,18 +105,10 @@ void nf_unregister_hook(struct nf_hook_ops *reg) mutex_lock(&nf_hook_mutex); list_del_rcu(®->list); mutex_unlock(&nf_hook_mutex); - switch (reg->pf) { - case NFPROTO_NETDEV: #ifdef CONFIG_NETFILTER_INGRESS - if (reg->hooknum == NF_NETDEV_INGRESS) { - net_dec_ingress_queue(); - break; - } - break; + if ((reg->pf == NFPROTO_NETDEV) && (reg->hooknum == NF_NETDEV_INGRESS)) + net_dec_ingress_queue(); #endif - default: - break; - } #ifdef HAVE_JUMP_LABEL static_key_slow_dec(&nf_hooks_needed[reg->pf][reg->hooknum]); #endif -- 2.2.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH -next 2/6] netfilter: kill nf_hooks_active
The function obscures what is going on in nf_hook_thresh and it's existence requires computing the hook list twice. Signed-off-by: "Eric W. Biederman" --- include/linux/netfilter.h | 10 +++--- 1 file changed, 3 insertions(+), 7 deletions(-) diff --git a/include/linux/netfilter.h b/include/linux/netfilter.h index 00050dfd9f23..d7826e59779e 100644 --- a/include/linux/netfilter.h +++ b/include/linux/netfilter.h @@ -150,11 +150,6 @@ static inline bool nf_hook_list_active(struct list_head *nf_hook_list, } #endif -static inline bool nf_hooks_active(u_int8_t pf, unsigned int hook) -{ - return nf_hook_list_active(&nf_hooks[pf][hook], pf, hook); -} - int nf_hook_slow(struct sk_buff *skb, struct nf_hook_state *state); /** @@ -172,10 +167,11 @@ static inline int nf_hook_thresh(u_int8_t pf, unsigned int hook, int (*okfn)(struct sock *, struct sk_buff *), int thresh) { - if (nf_hooks_active(pf, hook)) { + struct list_head *nf_hook_list = &nf_hooks[pf][hook]; + if (nf_hook_list_active(nf_hook_list, pf, hook)) { struct nf_hook_state state; - nf_hook_state_init(&state, &nf_hooks[pf][hook], hook, thresh, + nf_hook_state_init(&state, nf_hook_list, hook, thresh, pf, indev, outdev, sk, okfn); return nf_hook_slow(skb, &state); } -- 2.2.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH -next 1/6] netfilter: nf_queue: Don't recompute the hook_list head
If someone sends packets from one of the netdevice ingress hooks to the a userspace queue, and then userspace later accepts the packet, the netfilter code can enter an infinite loop as the list head will never be found. Pass in the saved list_head to avoid this. Signed-off-by: "Eric W. Biederman" --- net/netfilter/nf_queue.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/netfilter/nf_queue.c b/net/netfilter/nf_queue.c index cd60d397fe05..8a8b2abc35ff 100644 --- a/net/netfilter/nf_queue.c +++ b/net/netfilter/nf_queue.c @@ -213,7 +213,7 @@ void nf_reinject(struct nf_queue_entry *entry, unsigned int verdict) if (verdict == NF_ACCEPT) { next_hook: - verdict = nf_iterate(&nf_hooks[entry->state.pf][entry->state.hook], + verdict = nf_iterate(entry->state.hook_list, skb, &entry->state, &elem); } -- 2.2.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH -next 0/6] Per network namespace netfilter chains
By maintining a set of functions to register and unregister netfilter hooks both globally and per network namespace I have managed to write a compact patchset that maintain per network netfilter chains, and registers the nftables netfilter hooks per network namespace. There are lots of other possible and desirable cleanups but this one is a core change needed to make the other changes independent small changes. Eric W. Biederman (6): netfilter: nf_queue: Don't recompute the hook_list head netfilter: kill nf_hooks_active netfilter: Simply the tests for enabling and disabling the ingress queue hook netfilter: Factor out the hook list selection from nf_register_hook netfilter: Per network namespace netfilter hooks. netfilter: nftables: Only run the nftables chains in the proper netns include/linux/netfilter.h | 23 +++-- include/net/netns/netfilter.h | 1 + net/netfilter/core.c | 221 + net/netfilter/nf_queue.c | 2 +- net/netfilter/nf_tables_api.c | 6 +- net/netfilter/nf_tables_core.c | 5 - 6 files changed, 200 insertions(+), 58 deletions(-) Eric -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] nohz: prevent tilegx network driver interrupts
On Fri, Jul 10, 2015 at 07:06:23PM -0400, Chris Metcalf wrote: > On 7/10/2015 6:45 PM, Josh Cartwright wrote: > >>+static inline const struct cpumask *housekeeping_cpumask(void) > >>>+{ > >>>+#ifdef CONFIG_NO_HZ_FULL > >>>+ if (tick_nohz_full_enabled()) > >>>+ return housekeeping_mask; > >>>+#endif > >Just a small comment: > > > >We can take these checks out from under a #ifdef CONFIG_NO_HZ_FULL > >check, given that are stubbed tick_nohz_full_enabled() defined above. > > True for the "if" clause, but the "housekeeping_mask" variable is only defined > under CONFIG_NO_HZ_FULL. Indeed! I should have read more carefully. Sorry for the noise. Josh signature.asc Description: PGP signature
Re: [PATCH v2] nohz: prevent tilegx network driver interrupts
On 7/10/2015 6:45 PM, Josh Cartwright wrote: +static inline const struct cpumask *housekeeping_cpumask(void) >+{ >+#ifdef CONFIG_NO_HZ_FULL >+ if (tick_nohz_full_enabled()) >+ return housekeeping_mask; >+#endif Just a small comment: We can take these checks out from under a #ifdef CONFIG_NO_HZ_FULL check, given that are stubbed tick_nohz_full_enabled() defined above. True for the "if" clause, but the "housekeeping_mask" variable is only defined under CONFIG_NO_HZ_FULL. -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] ixgbe: Remove bimodal SR-IOV disabling
> -Original Message- > From: Alex Williamson [mailto:alex.william...@redhat.com] > Sent: Friday, July 10, 2015 3:44 PM > To: Rose, Gregory V > Cc: intel-wired-...@lists.osuosl.org; Kirsher, Jeffrey T; > netdev@vger.kernel.org; linux-ker...@vger.kernel.org > Subject: Re: [PATCH] ixgbe: Remove bimodal SR-IOV disabling > > On Fri, 2015-07-10 at 21:36 +, Rose, Gregory V wrote: > > > > > -Original Message- > > > From: Alex Williamson [mailto:alex.william...@redhat.com] > > > Sent: Friday, July 10, 2015 2:32 PM > > > To: intel-wired-...@lists.osuosl.org; Kirsher, Jeffrey T > > > Cc: netdev@vger.kernel.org; linux-ker...@vger.kernel.org; Rose, > > > Gregory V > > > Subject: [PATCH] ixgbe: Remove bimodal SR-IOV disabling > > > > > > When unbinding an SR-IOV device with VFs configured from ixgbe, the > > > driver behaves in one of two ways. If max_vfs was specified, the > > > SR-IOV state is disabled, removing the VFs. The occurs regardless > > > of whether the VF count was later modified through sysfs. If > > > however max_vfs is zero, such as by not specifying the module > > > parameter, the VFs persist after the PF is unbound from ixgbe. If > > > the PF is then bound to vfio-pci to be assigned to a VM, the PF is > non-functional. > > > > > > >From the comment, commit da36b64736cf ("ixgbe: Implement PCI SR-IOV > > > sysfs callback operation") clearly intended this alternate behavior, > > > but probably didn't realize the PF doesn't work in this mode. > > > > > > This bimodal behavior is confusing to users and results in a state > > > where the PF is broken for other uses unless the user sets > > > sriov_numvfs to zero prior to unbinding the device. Remove this > > > behavior so that VFs are removed and the PF is functional for other > > > uses after unbind, regardless of the way VFs are enabled. > > > > > > Signed-off-by: Alex Williamson > > > Cc: Greg Rose > > > Cc: Jeff Kirsher > > > --- > > > > > > I can only think that not disabling SR-IOV was meant to enable some > > > sort of persistence for VFs, but that's probably better accomplished > > > with either udev rules and/or modprobe.d install scripts. > > > > > > drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |7 +-- > > > 1 file changed, 1 insertion(+), 6 deletions(-) > > > > > > diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c > > > b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c > > > index 5be12a0..de04e3e 100644 > > > --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c > > > +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c > > > @@ -8810,12 +8810,7 @@ static void ixgbe_remove(struct pci_dev *pdev) > > > unregister_netdev(netdev); > > > > > > #ifdef CONFIG_PCI_IOV > > > - /* > > > - * Only disable SR-IOV on unload if the user specified the now > > > - * deprecated max_vfs module parameter. > > > - */ > > > - if (max_vfs) > > > - ixgbe_disable_sriov(adapter); > > > + ixgbe_disable_sriov(adapter); > > > #endif > > > ixgbe_clear_interrupt_scheme(adapter); > > > > > > > Please remove max_vfs module parameter - it is deprecated and should be > removed from upstream builds. Dave let us get away with a kernel module a > few years ago because the other necessary infrastructure to enable SR-IOV > virtual functions via the PCIe interface was not available. Now that it's > there it should be removed and vendors/end users should be forced to move > away from this. > > I can't really say I'm in favor of removing that option. It's probably > going to break a lot of people because doing the udev rules right is hard. > The sysfs sriov interface has been tossed over the wall as the right way > to do things, but there's really no infrastructure to facilitate even the > simple peanut butter, everybody gets the same number of VFs, interface > that max_vfs provides. I think the existence of this bug is probably a > good indication that the sysfs interface has not really been adopted yet. > Thanks, Alright, I'll go with that reasoning. Acked-by: Greg Rose > > Alex N�r��yb�X��ǧv�^�){.n�+���z�^�)w*jg����ݢj/���z�ޖ��2�ޙ&�)ߡ�a�����G���h��j:+v���w��٥
Re: [PATCH] vhost-blk: Add vhost-blk support v6
On Sat, Dec 1, 2012 at 5:33 PM, Asias He wrote: > vhost-blk is an in-kernel virito-blk device accelerator. > > Due to lack of proper in-kernel AIO interface, this version converts > guest's I/O request to bio and use submit_bio() to submit I/O directly. > So this version any supports raw block device as guest's disk image, > e.g. /dev/sda, /dev/ram0. We can add file based image support to > vhost-blk once we have in-kernel AIO interface. There are some work in > progress for in-kernel AIO interface from Dave Kleikamp and Zach Brown: > >http://marc.info/?l=linux-fsdevel&m=133312234313122 > > Performance evaluation: > - > LKVM: Fio with libaio ioengine on 1 Fusion IO device > IOPS(k)BeforeAfterImprovement > seq-read 107 121 +13.0% > seq-write 130 179 +37.6% > rnd-read 102 122 +19.6% > rnd-write 125 159 +27.0% > > QEMU: Fio with libaio ioengine on 1 Fusion IO device > IOPS(k)BeforeAfterImprovement > seq-read 76123 +61.8% > seq-write 139 173 +24.4% > rnd-read 73120 +64.3% > rnd-write 75156 +108.0% > > QEMU: Fio with libaio ioengine on 1 Ramdisk device > IOPS(k)BeforeAfterImprovement > seq-read 138 437 +216% > seq-write 191 436 +128% > rnd-read 137 426 +210% > rnd-write 140 415 +196% > > QEMU: Fio with libaio ioengine on 8 Ramdisk device > 50% read + 50% write > IOPS(k)BeforeAfterImprovement > randrw 64/64 189/189 +195%/+195% > > Userspace bits: > - > 1) LKVM > The latest vhost-blk userspace bits for kvm tool can be found here: > g...@github.com:asias/linux-kvm.git blk.vhost-blk > > 2) QEMU > The latest vhost-blk userspace prototype for QEMU can be found here: > g...@github.com:asias/qemu.git blk.vhost-blk > > Changes in v6: > - Use inline req_page_list to reduce kmalloc > - Switch to single thread model, thanks mst! > - Wait until requests fired before vhost_blk_flush to be finished > > Changes in v5: > - Do not assume the buffer layout > - Fix wakeup race > > Changes in v4: > - Mark req->status as userspace pointer > - Use __copy_to_user() instead of copy_to_user() in vhost_blk_set_status() > - Add if (need_resched()) schedule() in blk thread > - Kill vhost_blk_stop_vq() and move it into vhost_blk_stop() > - Use vq_err() instead of pr_warn() > - Fail un Unsupported request > - Add flush in vhost_blk_set_features() > > Changes in v3: > - Sending REQ_FLUSH bio instead of vfs_fsync, thanks Christoph! > - Check file passed by user is a raw block device file > > Acked-by: David S. Miller > Signed-off-by: Asias He Hi Asias, Is this still under development or stopped? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] nohz: prevent tilegx network driver interrupts
On Fri, Jul 10, 2015 at 03:37:25PM -0400, Chris Metcalf wrote: > Normally the tilegx networking shim sends irqs to all the cores > to distribute the load of processing incoming-packet interrupts, > so that you can get to multiple Gb's of traffic inbound. > > However, in nohz_full mode we don't want to interrupt the > nohz_full cores by default, so we limit the set of cores we use > to only the online housekeeping cores. > > To make client code easier to read, we introduce a new nohz_full > accessor, housekeeping_cpumask(), which returns a pointer to the > housekeeping_mask if nohz_full is enabled, and otherwise returns > the cpu_possible_mask. > > Signed-off-by: Chris Metcalf > --- [..] > +static inline const struct cpumask *housekeeping_cpumask(void) > +{ > +#ifdef CONFIG_NO_HZ_FULL > + if (tick_nohz_full_enabled()) > + return housekeeping_mask; > +#endif Just a small comment: We can take these checks out from under a #ifdef CONFIG_NO_HZ_FULL check, given that are stubbed tick_nohz_full_enabled() defined above. Josh signature.asc Description: PGP signature
Re: [PATCH] ixgbe: Remove bimodal SR-IOV disabling
On Fri, 2015-07-10 at 21:36 +, Rose, Gregory V wrote: > > > -Original Message- > > From: Alex Williamson [mailto:alex.william...@redhat.com] > > Sent: Friday, July 10, 2015 2:32 PM > > To: intel-wired-...@lists.osuosl.org; Kirsher, Jeffrey T > > Cc: netdev@vger.kernel.org; linux-ker...@vger.kernel.org; Rose, Gregory V > > Subject: [PATCH] ixgbe: Remove bimodal SR-IOV disabling > > > > When unbinding an SR-IOV device with VFs configured from ixgbe, the driver > > behaves in one of two ways. If max_vfs was specified, the SR-IOV state is > > disabled, removing the VFs. The occurs regardless of whether the VF count > > was later modified through sysfs. If however max_vfs is zero, such as by > > not specifying the module parameter, the VFs persist after the PF is > > unbound from ixgbe. If the PF is then bound to vfio-pci to be assigned to > > a VM, the PF is non-functional. > > > > >From the comment, commit da36b64736cf ("ixgbe: Implement PCI SR-IOV > > sysfs callback operation") clearly intended this alternate behavior, but > > probably didn't realize the PF doesn't work in this mode. > > > > This bimodal behavior is confusing to users and results in a state where > > the PF is broken for other uses unless the user sets sriov_numvfs to zero > > prior to unbinding the device. Remove this behavior so that VFs are > > removed and the PF is functional for other uses after unbind, regardless > > of the way VFs are enabled. > > > > Signed-off-by: Alex Williamson > > Cc: Greg Rose > > Cc: Jeff Kirsher > > --- > > > > I can only think that not disabling SR-IOV was meant to enable some sort > > of persistence for VFs, but that's probably better accomplished with > > either udev rules and/or modprobe.d install scripts. > > > > drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |7 +-- > > 1 file changed, 1 insertion(+), 6 deletions(-) > > > > diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c > > b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c > > index 5be12a0..de04e3e 100644 > > --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c > > +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c > > @@ -8810,12 +8810,7 @@ static void ixgbe_remove(struct pci_dev *pdev) > > unregister_netdev(netdev); > > > > #ifdef CONFIG_PCI_IOV > > - /* > > -* Only disable SR-IOV on unload if the user specified the now > > -* deprecated max_vfs module parameter. > > -*/ > > - if (max_vfs) > > - ixgbe_disable_sriov(adapter); > > + ixgbe_disable_sriov(adapter); > > #endif > > ixgbe_clear_interrupt_scheme(adapter); > > > > Please remove max_vfs module parameter - it is deprecated and should be > removed from upstream builds. Dave let us get away with a kernel module a > few years ago because the other necessary infrastructure to enable SR-IOV > virtual functions via the PCIe interface was not available. Now that it's > there it should be removed and vendors/end users should be forced to move > away from this. I can't really say I'm in favor of removing that option. It's probably going to break a lot of people because doing the udev rules right is hard. The sysfs sriov interface has been tossed over the wall as the right way to do things, but there's really no infrastructure to facilitate even the simple peanut butter, everybody gets the same number of VFs, interface that max_vfs provides. I think the existence of this bug is probably a good indication that the sysfs interface has not really been adopted yet. Thanks, Alex -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] ixgbe: Remove bimodal SR-IOV disabling
> -Original Message- > From: Alex Williamson [mailto:alex.william...@redhat.com] > Sent: Friday, July 10, 2015 2:32 PM > To: intel-wired-...@lists.osuosl.org; Kirsher, Jeffrey T > Cc: netdev@vger.kernel.org; linux-ker...@vger.kernel.org; Rose, Gregory V > Subject: [PATCH] ixgbe: Remove bimodal SR-IOV disabling > > When unbinding an SR-IOV device with VFs configured from ixgbe, the driver > behaves in one of two ways. If max_vfs was specified, the SR-IOV state is > disabled, removing the VFs. The occurs regardless of whether the VF count > was later modified through sysfs. If however max_vfs is zero, such as by > not specifying the module parameter, the VFs persist after the PF is > unbound from ixgbe. If the PF is then bound to vfio-pci to be assigned to > a VM, the PF is non-functional. > > >From the comment, commit da36b64736cf ("ixgbe: Implement PCI SR-IOV > sysfs callback operation") clearly intended this alternate behavior, but > probably didn't realize the PF doesn't work in this mode. > > This bimodal behavior is confusing to users and results in a state where > the PF is broken for other uses unless the user sets sriov_numvfs to zero > prior to unbinding the device. Remove this behavior so that VFs are > removed and the PF is functional for other uses after unbind, regardless > of the way VFs are enabled. > > Signed-off-by: Alex Williamson > Cc: Greg Rose > Cc: Jeff Kirsher > --- > > I can only think that not disabling SR-IOV was meant to enable some sort > of persistence for VFs, but that's probably better accomplished with > either udev rules and/or modprobe.d install scripts. > > drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |7 +-- > 1 file changed, 1 insertion(+), 6 deletions(-) > > diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c > b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c > index 5be12a0..de04e3e 100644 > --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c > +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c > @@ -8810,12 +8810,7 @@ static void ixgbe_remove(struct pci_dev *pdev) > unregister_netdev(netdev); > > #ifdef CONFIG_PCI_IOV > - /* > - * Only disable SR-IOV on unload if the user specified the now > - * deprecated max_vfs module parameter. > - */ > - if (max_vfs) > - ixgbe_disable_sriov(adapter); > + ixgbe_disable_sriov(adapter); > #endif > ixgbe_clear_interrupt_scheme(adapter); > Please remove max_vfs module parameter - it is deprecated and should be removed from upstream builds. Dave let us get away with a kernel module a few years ago because the other necessary infrastructure to enable SR-IOV virtual functions via the PCIe interface was not available. Now that it's there it should be removed and vendors/end users should be forced to move away from this. Thanks, - Greg Rose Intel Corp. Networking Division
[PATCH] ixgbe: Remove bimodal SR-IOV disabling
When unbinding an SR-IOV device with VFs configured from ixgbe, the driver behaves in one of two ways. If max_vfs was specified, the SR-IOV state is disabled, removing the VFs. The occurs regardless of whether the VF count was later modified through sysfs. If however max_vfs is zero, such as by not specifying the module parameter, the VFs persist after the PF is unbound from ixgbe. If the PF is then bound to vfio-pci to be assigned to a VM, the PF is non-functional. >From the comment, commit da36b64736cf ("ixgbe: Implement PCI SR-IOV sysfs callback operation") clearly intended this alternate behavior, but probably didn't realize the PF doesn't work in this mode. This bimodal behavior is confusing to users and results in a state where the PF is broken for other uses unless the user sets sriov_numvfs to zero prior to unbinding the device. Remove this behavior so that VFs are removed and the PF is functional for other uses after unbind, regardless of the way VFs are enabled. Signed-off-by: Alex Williamson Cc: Greg Rose Cc: Jeff Kirsher --- I can only think that not disabling SR-IOV was meant to enable some sort of persistence for VFs, but that's probably better accomplished with either udev rules and/or modprobe.d install scripts. drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |7 +-- 1 file changed, 1 insertion(+), 6 deletions(-) diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c index 5be12a0..de04e3e 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c @@ -8810,12 +8810,7 @@ static void ixgbe_remove(struct pci_dev *pdev) unregister_netdev(netdev); #ifdef CONFIG_PCI_IOV - /* -* Only disable SR-IOV on unload if the user specified the now -* deprecated max_vfs module parameter. -*/ - if (max_vfs) - ixgbe_disable_sriov(adapter); + ixgbe_disable_sriov(adapter); #endif ixgbe_clear_interrupt_scheme(adapter); -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 02/11] soc/fsl: Introduce DPAA BMan device management driver
On Fri, 2015-07-10 at 15:57 -0500, Pledge Roy-R01356 wrote: > > > > On Fri, 2015-07-10 at 13:36 +0200, Paul Bolle wrote: > > > On do, 2015-07-09 at 16:21 -0400, Roy Pledge wrote: > > > > +#ifdef CONFIG_FSL_DPA_CHECKING > > > > +#define DPA_ASSERT(x) \ > > > > + do { \ > > > > + if (!(x)) { \ > > > > + pr_crit("ASSERT: (%s:%d) %s\n", __FILE__, > > > > __LINE__, \ > > > > + __stringify_1(x)); \ > > > > + dump_stack(); \ > > > > + panic("assertion failure"); \ > > > > > > Not my call, but why panic() here? > > > > I'm pretty sure I've complained about this before (as well as all the > > BUG_ONs). > > > Is the concern here just the call to panic()? I'm happy to change what > happens when an issue is detected but the DPA_ASSERT() calls are very > useful when testing changes to the driver and when bringing up the drivers > on new silicon variants. Use WARN_ON() or a variant thereof. -Scott -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] fixed_phy: handle link-down case
10.07.2015 23:44, Florian Fainelli пишет: On 10/07/15 09:41, Stas Sergeev wrote: Currently fixed_phy driver recognizes only the link-up state. This simple patch adds an implementation of link-down state. The actual change is 1-line, the rest is an indentation. It is not clear to me how this is useful, if you have a link_update callback manipulating the link state, the fixed PHY driver returns appropriate MII_BMSR values and always re-initializes everything. It returns the appropriate values only for link status (when its down), but it still returns speed, duplex etc as if the link is up. I had hard times finding the relevant specs, but from what I have googled, when link is down, the speed/duplex/etc status fields should _also_ be zero, which is what my patch does. What is more important is that fixed_phy_add() would return -EINVAL if you didn't specify speed while the link is down. This is an absolute must-fix, or I will have to add an arbitrary speed value again, on which you already objected. Is this meant to be some sort of optimization? If so, you could just avoid the re-intendation completely and do a goto instead? Oh, c'mon... Adding goto just to keep the _patch_ smaller? (not smaller code, just a smaller patch) Well, this is certainly something that can be done, feel free to request that explicitly and I'll release v3 next week. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] of_mdio: add new DT property 'autoneg' for fixed-link
10.07.2015 23:39, Florian Fainelli пишет: - in-band status is an implementation delail, and it is specific to a particular protocols. If you request the in-band status for some protocol that doesn't support it, perhaps you should get -EINVAL, because such a config makes no sense. With autonegotiation, the rules are not that strict: it can be "unimplemented", which doesn't necessary mean nonsense in the config. So by specifying "autoneg", you are not specific about the kind of auto-negotiation protocol available, which is precisely my point: you need to go down to that level of detail for this to be useful. So maybe something like: autoneg = "in-band-status" would actually be a better thing in terms of description because then you would tell what can be made available/working? I would agree with this if your argument below is true (see below). - autonegotiation is a wider term, and may be implemented by some other means than the in-band status (which is probably impossible for a fixed-link though). - In the terms that the driver uses, it is autonegotiation, eg MVNETA_GMAC_AUTONEG_CONFIG. And when you go down the implementation details, you see MVNETA_GMAC_INBAND_AN_ENABLE, which is just one AN bit of many. But arguably, there could be another auto-negotiation method, which is not in-band status related, which means that you would need a way to distinguish between using in-band status, or using something else or nothing, would not you? "something else" is a big question here. Can you think of _any_ other way that is both not an MDIO (suits to fixed-link) and not an in-band? If the answer is yes (even theoretically), then autoneg = "in-band" | "off" may make sense. Otherwise boolean just looks enough. If we would implement autoneg outside of the fixed-link, then its semantic would likely be autoneg = "mdio" | "in-band" | "off" But the fact that we put it under fixed-link where only a single AN possibility exist, may probably be underlined by a semantic specific to fixed-link. One may also argue that autoneg = "any-possible-autoneg-that-works" is better than specifying it explicitly, which is exactly what the boolean does. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 02/11] soc/fsl: Introduce DPAA BMan device management driver
> > On Fri, 2015-07-10 at 13:36 +0200, Paul Bolle wrote: > > On do, 2015-07-09 at 16:21 -0400, Roy Pledge wrote: > > > +#ifdef CONFIG_FSL_DPA_CHECKING > > > +#define DPA_ASSERT(x) \ > > > + do { \ > > > + if (!(x)) { \ > > > + pr_crit("ASSERT: (%s:%d) %s\n", __FILE__, __LINE__, \ > > > + __stringify_1(x)); \ > > > + dump_stack(); \ > > > + panic("assertion failure"); \ > > > > Not my call, but why panic() here? > > I'm pretty sure I've complained about this before (as well as all the > BUG_ONs). > Is the concern here just the call to panic()? I'm happy to change what happens when an issue is detected but the DPA_ASSERT() calls are very useful when testing changes to the driver and when bringing up the drivers on new silicon variants.
Re: [PATCH 1/3] fixed_phy: handle link-down case
On 10/07/15 09:41, Stas Sergeev wrote: > > Currently fixed_phy driver recognizes only the link-up state. > This simple patch adds an implementation of link-down state. > The actual change is 1-line, the rest is an indentation. It is not clear to me how this is useful, if you have a link_update callback manipulating the link state, the fixed PHY driver returns appropriate MII_BMSR values and always re-initializes everything. Is this meant to be some sort of optimization? If so, you could just avoid the re-intendation completely and do a goto instead? > > Signed-off-by: Stas Sergeev > > CC: Florian Fainelli > CC: netdev@vger.kernel.org > CC: linux-ker...@vger.kernel.org > --- > drivers/net/phy/fixed_phy.c | 99 > +++-- > 1 file changed, 50 insertions(+), 49 deletions(-) > > diff --git a/drivers/net/phy/fixed_phy.c b/drivers/net/phy/fixed_phy.c > index 1960b46..a5d93cf 100644 > --- a/drivers/net/phy/fixed_phy.c > +++ b/drivers/net/phy/fixed_phy.c > @@ -52,58 +52,59 @@ static int fixed_phy_update_regs(struct fixed_phy *fp) > u16 lpagb = 0; > u16 lpa = 0; > > - if (fp->status.duplex) { > - bmcr |= BMCR_FULLDPLX; > - > - switch (fp->status.speed) { > - case 1000: > - bmsr |= BMSR_ESTATEN; > - bmcr |= BMCR_SPEED1000; > - lpagb |= LPA_1000FULL; > - break; > - case 100: > - bmsr |= BMSR_100FULL; > - bmcr |= BMCR_SPEED100; > - lpa |= LPA_100FULL; > - break; > - case 10: > - bmsr |= BMSR_10FULL; > - lpa |= LPA_10FULL; > - break; > - default: > - pr_warn("fixed phy: unknown speed\n"); > - return -EINVAL; > - } > - } else { > - switch (fp->status.speed) { > - case 1000: > - bmsr |= BMSR_ESTATEN; > - bmcr |= BMCR_SPEED1000; > - lpagb |= LPA_1000HALF; > - break; > - case 100: > - bmsr |= BMSR_100HALF; > - bmcr |= BMCR_SPEED100; > - lpa |= LPA_100HALF; > - break; > - case 10: > - bmsr |= BMSR_10HALF; > - lpa |= LPA_10HALF; > - break; > - default: > - pr_warn("fixed phy: unknown speed\n"); > - return -EINVAL; > - } > - } > - > - if (fp->status.link) > + if (fp->status.link) { > bmsr |= BMSR_LSTATUS | BMSR_ANEGCOMPLETE; > > - if (fp->status.pause) > - lpa |= LPA_PAUSE_CAP; > + if (fp->status.duplex) { > + bmcr |= BMCR_FULLDPLX; > + > + switch (fp->status.speed) { > + case 1000: > + bmsr |= BMSR_ESTATEN; > + bmcr |= BMCR_SPEED1000; > + lpagb |= LPA_1000FULL; > + break; > + case 100: > + bmsr |= BMSR_100FULL; > + bmcr |= BMCR_SPEED100; > + lpa |= LPA_100FULL; > + break; > + case 10: > + bmsr |= BMSR_10FULL; > + lpa |= LPA_10FULL; > + break; > + default: > + pr_warn("fixed phy: unknown speed\n"); > + return -EINVAL; > + } > + } else { > + switch (fp->status.speed) { > + case 1000: > + bmsr |= BMSR_ESTATEN; > + bmcr |= BMCR_SPEED1000; > + lpagb |= LPA_1000HALF; > + break; > + case 100: > + bmsr |= BMSR_100HALF; > + bmcr |= BMCR_SPEED100; > + lpa |= LPA_100HALF; > + break; > + case 10: > + bmsr |= BMSR_10HALF; > + lpa |= LPA_10HALF; > + break; > + default: > + pr_warn("fixed phy: unknown speed\n"); > + return -EINVAL; > + } > + } > > - if (fp->status.asym_pause) > - lpa |= LPA_PAUSE_ASYM; > + if (fp->status.pause) > + lpa |= LPA_PAUSE_CAP; > + > + if (fp->status.asym_pause) > +
Re: [PATCH v2 0/2] net: enable inband link state negotiation only when explicitly requested
10.07.2015 23:31, Florian Fainelli пишет: On 10/07/15 09:38, Stas Sergeev wrote: Hello. Currently the link status auto-negotiation is enabled for any SGMII link with fixed-link DT binding. The regression was reported: https://lkml.org/lkml/2015/7/8/865 Apparently not all HW that implements SGMII protocol, generates the inband status for the auto-negotiation to work. More details here: https://lkml.org/lkml/2015/7/10/206 The following patches reverts to the old behavior by default, which is to not enable the auto-negotiation for fixed-link. The new DT property is added that allows to explicitly request the auto-negotiation. Those who were affected by the change, please send your Tested-by, Thanks! For future submissions, would you mind CC'ing everybody for the entire patch series and not just on a per-patch basis? I used get_maintainers.pl on a per-patch basis. This is what SubmittingPatches.txt seems to suggest. It doesn't say how to produce the CC lists for patch series, it seems... Anyway, I'll try to add more recipients to the announce e-mail next time. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] of_mdio: add new DT property 'autoneg' for fixed-link
On 10/07/15 13:08, Stas Sergeev wrote: > 10.07.2015 21:37, Florian Fainelli пишет: >> On 10/07/15 09:43, Stas Sergeev wrote: >>> Currently for fixed-link the MAC driver decides whether to use the >>> link status auto-negotiation or not. >>> Unfortunately the auto-negotiation may not work when expected by >>> the MAC driver. Sebastien Rannou explains: >>> << Yes, I confirm that my HW does not generate an in-band status. >>> AFAIK, it's >>> a PHY that aggregates 4xSGMIIs to 1xQSGMII ; the MAC side of the PHY >>> (with >>> inband status) is connected to the switch through QSGMII, and in this >>> context >>> we are on the media side of the PHY. >> >>> https://lkml.org/lkml/2015/7/10/206 >>> >>> This patch introduces the new boolean property 'autoneg' that allows >>> the user to request the auto-negotiation explicitly. >> The implementation looks better, but the name might still be slightly >> controversial. I would go with "use-in-band-status" which is more >> strictly defined than "autoneg" which could mean anything and everything. >> >> What do you think? > I actually think autoneg is a bit better. > > - Autonegotiation is a widely used and known term: > https://en.wikipedia.org/wiki/Autonegotiation > And who knows what in-band status is? You and I apparently do because otherwise you would not have ran into this problem and more generally, anyone having some mild exposure to the (S|R)GMII protocols should at some point, but that is a pointless argument. > And, more importantly, who knows what is it used for? > Who even knows it is used for autonegotiation? It is not about what do people know most, it is about being accurate and specific. > > - When we set autoneg for fixed-link, we basically just > say "no MDIO here, but please do autoneg by any other > means, if possible". I agree with this. > > - in-band status is an implementation delail, and it is > specific to a particular protocols. If you request the > in-band status for some protocol that doesn't support > it, perhaps you should get -EINVAL, because such a > config makes no sense. With autonegotiation, the rules > are not that strict: it can be "unimplemented", which doesn't > necessary mean nonsense in the config. So by specifying "autoneg", you are not specific about the kind of auto-negotiation protocol available, which is precisely my point: you need to go down to that level of detail for this to be useful. So maybe something like: autoneg = "in-band-status" would actually be a better thing in terms of description because then you would tell what can be made available/working? > > - autonegotiation is a wider term, and may be implemented > by some other means than the in-band status (which is > probably impossible for a fixed-link though). > > - In the terms that the driver uses, it is autonegotiation, eg > MVNETA_GMAC_AUTONEG_CONFIG. And when you go down > the implementation details, you see MVNETA_GMAC_INBAND_AN_ENABLE, > which is just one AN bit of many. But arguably, there could be another auto-negotiation method, which is not in-band status related, which means that you would need a way to distinguish between using in-band status, or using something else or nothing, would not you? > > So I really would prefer to keep things as is. > But if you insist, I can rename, but there will still be no > -EINVAL checks for obviously wrong configs. -- Florian -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fighting out-of-order reception with RPS?
On 07/10/2015 04:48 AM, Tom Herbert wrote: > On Wed, Jul 8, 2015 at 10:55 PM, Oliver Hartkopp > wrote: >> Both drivers do not use NAPI. The just follow the way >> >> interrupt -> alloc_skb() -> fill skb -> netif_rx(skb) >> >> I'm usually testing with the USB adapters as the PCIe setup is not very >> handy. >> > Okay, I see what is happening. In netif_rx when RPS is not enabled > that packet is queued to the backlog queue for the local CPU. Since > you're doing round robin on the interrupts then OOO packets can be a > result. Unfortunately, this is the expected behavior. The correct > kernel fix would be to move to these drivers to use NAPI. Hm. Doesn't sound like a good solution when there's a difference between NAPI and non-NAPI drivers in matters of OOO, right? > RPS > eliminates the OOO, but if there is no ability to derive a flow hash > from packets everything will wind up one queue without load balancing. Correct. That's why I added skb_set_hash(skb, dev->ifindex, PKT_HASH_TYPE_L2); in my driver, because the only relevant flow identifiction is the number of the incoming CAN interface. > Besides that, automatically setting RPS in drivers is a difficult > proposition since there is no definitively "correct" way to do that in > an arbitrary configuration. What about checking in netif_rx() if the non-NAPI driver has set a hash (aka the driver is OOO sensitive)? And if so we could automatically set rps_cpus for this interface in a way that all CPUs are enabled to take skbs following the hash. Best regards, Oliver -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 0/2] net: enable inband link state negotiation only when explicitly requested
On 10/07/15 09:38, Stas Sergeev wrote: > Hello. > > Currently the link status auto-negotiation is enabled > for any SGMII link with fixed-link DT binding. > The regression was reported: > https://lkml.org/lkml/2015/7/8/865 > Apparently not all HW that implements SGMII protocol, generates the > inband status for the auto-negotiation to work. > More details here: > https://lkml.org/lkml/2015/7/10/206 > > The following patches reverts to the old behavior by default, > which is to not enable the auto-negotiation for fixed-link. > The new DT property is added that allows to explicitly request > the auto-negotiation. > > Those who were affected by the change, please send your Tested-by, > Thanks! For future submissions, would you mind CC'ing everybody for the entire patch series and not just on a per-patch basis? -- Florian -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Panic with demuxed ipv4 multicast udp sockets on 4.0.4
Hey everyone, There's some kind of nasty condition in which sk_rx_dst points to an apparently garbage datastructure and it's blowing up in the early demux code because dst->ops is NULL. The packet in question was for bit torrent local peer discovery https://en.wikipedia.org/wiki/Local_Peer_Discovery . We're seeing this on about a 1/200 chance of panic per day. crash> bt PID: 1899532 TASK: 88000826cf00 CPU: 9 COMMAND: "hhvm.node.1" #0 [88047fc23990] machine_kexec at 8103bf05 #1 [88047fc239e0] crash_kexec at 810cb4e8 #2 [88047fc23ab0] oops_end at 81006468 #3 [88047fc23ae0] no_context at 8167aac1 #4 [88047fc23b40] __bad_area_nosemaphore at 8167acb9 #5 [88047fc23b90] bad_area_nosemaphore at 8167aceb #6 [88047fc23ba0] __do_page_fault at 81044ac5 #7 [88047fc23c10] do_page_fault at 81044eec #8 [88047fc23c20] page_fault at 81686c02 [exception RIP: udp_v4_early_demux+481] RIP: 816249a1 RSP: 88047fc23cd8 RFLAGS: 00010246 RAX: 880248ee4500 RBX: 093a RCX: 0002 RDX: RSI: RDI: 880248ee4500 RBP: 88047fc23d48 R8: R9: R10: 0001 R11: c9000199f3a0 R12: 88006f8a6300 R13: 81cbb1c0 R14: 0001 R15: 880474798f00 ORIG_RAX: CS: 0010 SS: #9 [88047fc23cd0] udp_v4_early_demux at 81624bb3 #10 [88047fc23d50] ip_rcv_finish at 815f3055 #11 [88047fc23d80] ip_rcv at 815f3952 #12 [88047fc23dc0] __netif_receive_skb_core at 815b96d4 #13 [88047fc23e30] __netif_receive_skb at 815b9911 #14 [88047fc23e50] process_backlog at 815b99f0 #15 [88047fc23ea0] net_rx_action at 815ba1e8 #16 [88047fc23f30] __do_softirq at 81054ce6 #17 [88047fc23f90] irq_exit at 81055075 #18 [88047fc23fa0] smp_call_function_single_interrupt at 810319f5 #19 [88047fc23fb0] call_function_single_interrupt at 8168637a --- --- #20 [8800792dff58] call_function_single_interrupt at 8168637a RIP: 006e7b4c RSP: 7f4c8ba38b80 RFLAGS: 0216 RAX: 006b RBX: 816851f2 RCX: 7f49f4de84d6 RDX: 7f49f4de84d8 RSI: 7f48dbcce731 RDI: RBP: 7f4c8ba38bd0 R8: 006b R9: R10: 7f48dbcce737 R11: 7f49f4de84e0 R12: 7f4adab85198 R13: 0014 R14: 7f4adaaa4c00 R15: ORIG_RAX: ff04 CS: 0033 SS: 002b crash> print *(struct *dst_entry)0x880248ee4500 A syntax error in expression, near `*dst_entry)0x880248ee4500'. gdb: gdb request failed: print *(struct *dst_entry)0x880248ee4500 crash> print *(struct dst_entry*)0x880248ee4500 $1 = { callback_head = { next = 0x880248ee4d00, func = 0x0 }, child = 0x13eacdfb7df67f6b, dev = 0x880113975d00, ops = 0x0, _metrics = 13729079323838086211, expires = 103079215104, path = 0x24c8d3baa, from = 0x0, xfrm = 0x6, input = 0x0, output = 0x0, flags = 5536, pending_confirm = 33114, error = -1, obsolete = -1, header_len = 0, trailer_len = 0, tclassid = 0, __pad_to_align_refcnt = {0, 704374636708}, __refcnt = { counter = 14 }, __use = 2097153, lastuse = 536576, { next = 0x0, rt_next = 0x0, rt6_next = 0x0, dn_next = 0x0 } } -- Alex Gartrell -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] of_mdio: add new DT property 'autoneg' for fixed-link
10.07.2015 21:37, Florian Fainelli пишет: On 10/07/15 09:43, Stas Sergeev wrote: Currently for fixed-link the MAC driver decides whether to use the link status auto-negotiation or not. Unfortunately the auto-negotiation may not work when expected by the MAC driver. Sebastien Rannou explains: << Yes, I confirm that my HW does not generate an in-band status. AFAIK, it's a PHY that aggregates 4xSGMIIs to 1xQSGMII ; the MAC side of the PHY (with inband status) is connected to the switch through QSGMII, and in this context we are on the media side of the PHY. >> https://lkml.org/lkml/2015/7/10/206 This patch introduces the new boolean property 'autoneg' that allows the user to request the auto-negotiation explicitly. The implementation looks better, but the name might still be slightly controversial. I would go with "use-in-band-status" which is more strictly defined than "autoneg" which could mean anything and everything. What do you think? I actually think autoneg is a bit better. - Autonegotiation is a widely used and known term: https://en.wikipedia.org/wiki/Autonegotiation And who knows what in-band status is? And, more importantly, who knows what is it used for? Who even knows it is used for autonegotiation? - When we set autoneg for fixed-link, we basically just say "no MDIO here, but please do autoneg by any other means, if possible". - in-band status is an implementation delail, and it is specific to a particular protocols. If you request the in-band status for some protocol that doesn't support it, perhaps you should get -EINVAL, because such a config makes no sense. With autonegotiation, the rules are not that strict: it can be "unimplemented", which doesn't necessary mean nonsense in the config. - autonegotiation is a wider term, and may be implemented by some other means than the in-band status (which is probably impossible for a fixed-link though). - In the terms that the driver uses, it is autonegotiation, eg MVNETA_GMAC_AUTONEG_CONFIG. And when you go down the implementation details, you see MVNETA_GMAC_INBAND_AN_ENABLE, which is just one AN bit of many. So I really would prefer to keep things as is. But if you insist, I can rename, but there will still be no -EINVAL checks for obviously wrong configs. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] net: dsa: mv88e6xxx: sleep in _mv88e6xxx_stats_wait
Hi Vivien, On Fri, Jul 10, 2015 at 03:21:47PM -0400, Vivien Didelot wrote: > Hi Guenter, > >> I must have missed where is the benefit from spin reading 10 times this > >> register, rather than sleeping 1ms between tests. Does this busy bit > >> behaves differently from the phy, atu, scratch, or vtu busy bits? > >> > > Benefit is reaction time, mostly. If the result isn't ready after the > > first spin, the new code path adds a mandatory 1-2ms delay. This could > > add up to a lot if that kind of retry is seen a lot. > > To me, it looks like if this mandatory 1-2ms delay is an issue, then > _mv88e6xxx_wait must be fixed. Maybe reducing this delay is an option? > Good point. The timeout is most definitely quite large and for sure on the safe side. It might make sense to add some statistics gathering to see how long the maximum observed delay actually is. Guenter -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2] nohz: prevent tilegx network driver interrupts
Normally the tilegx networking shim sends irqs to all the cores to distribute the load of processing incoming-packet interrupts, so that you can get to multiple Gb's of traffic inbound. However, in nohz_full mode we don't want to interrupt the nohz_full cores by default, so we limit the set of cores we use to only the online housekeeping cores. To make client code easier to read, we introduce a new nohz_full accessor, housekeeping_cpumask(), which returns a pointer to the housekeeping_mask if nohz_full is enabled, and otherwise returns the cpu_possible_mask. Signed-off-by: Chris Metcalf --- drivers/net/ethernet/tile/tilegx.c | 4 +++- include/linux/tick.h | 9 + 2 files changed, 12 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/tile/tilegx.c b/drivers/net/ethernet/tile/tilegx.c index a3f7610002aa..0a15acc075b3 100644 --- a/drivers/net/ethernet/tile/tilegx.c +++ b/drivers/net/ethernet/tile/tilegx.c @@ -40,6 +40,7 @@ #include #include #include +#include #include #include @@ -2273,7 +2274,8 @@ static int __init tile_net_init_module(void) tile_net_dev_init(name, mac); if (!network_cpus_init()) - network_cpus_map = *cpu_online_mask; + cpumask_and(&network_cpus_map, housekeeping_cpumask(), + cpu_online_mask); return 0; } diff --git a/include/linux/tick.h b/include/linux/tick.h index 3741ba1a652c..d3b00f0b2221 100644 --- a/include/linux/tick.h +++ b/include/linux/tick.h @@ -160,6 +160,15 @@ static inline void tick_nohz_full_kick_all(void) { } static inline void __tick_nohz_task_switch(struct task_struct *tsk) { } #endif +static inline const struct cpumask *housekeeping_cpumask(void) +{ +#ifdef CONFIG_NO_HZ_FULL + if (tick_nohz_full_enabled()) + return housekeeping_mask; +#endif + return cpu_possible_mask; +} + static inline bool is_housekeeping_cpu(int cpu) { #ifdef CONFIG_NO_HZ_FULL -- 2.1.2 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 08/22] fjes: net_device_ops.ndo_start_xmit
Hi Izumi-san, On Wed, 24 Jun 2015 11:55:40 +0900 Taku Izumi wrote: > This patch adds net_device_ops.ndo_start_xmit callback, > which is called when sending packets. > > Signed-off-by: Taku Izumi > --- > drivers/net/fjes/fjes.h | 1 + > drivers/net/fjes/fjes_hw.c | 58 ++ > drivers/net/fjes/fjes_hw.h | 12 +++ > drivers/net/fjes/fjes_main.c | 177 > +++ > 4 files changed, 248 insertions(+) > > diff --git a/drivers/net/fjes/fjes.h b/drivers/net/fjes/fjes.h > index f182ed3..7af4304 100644 > --- a/drivers/net/fjes/fjes.h > +++ b/drivers/net/fjes/fjes.h > @@ -29,6 +29,7 @@ > #define FJES_ACPI_SYMBOL "Extended Socket" > #define FJES_MAX_QUEUES 1 > #define FJES_TX_RETRY_INTERVAL (20 * HZ) > +#define FJES_TX_RETRY_TIMEOUT(100) > #define FJES_OPEN_ZONE_UPDATE_WAIT (300) /* msec */ > > /* board specific private data structure */ > diff --git a/drivers/net/fjes/fjes_hw.c b/drivers/net/fjes/fjes_hw.c > index a17cb96..b2e7509 100644 > --- a/drivers/net/fjes/fjes_hw.c > +++ b/drivers/net/fjes/fjes_hw.c > @@ -793,3 +793,61 @@ int fjes_hw_wait_epstop(struct fjes_hw *hw) > return (wait_time < FJES_COMMAND_EPSTOP_WAIT_TIMEOUT * 1000) > ? 0 : -EBUSY; > } > + > +bool fjes_hw_check_epbuf_version(struct epbuf_handler *epbh, u32 version) > +{ > + union ep_buffer_info *info = epbh->info; > + > + return (info->common.version == version); > +} > + > +bool fjes_hw_check_mtu(struct epbuf_handler *epbh, u32 mtu) > +{ > + union ep_buffer_info *info = epbh->info; > + > + return (info->v1i.frame_max == FJES_MTU_TO_FRAME_SIZE(mtu)); > +} > + > +bool fjes_hw_check_vlan_id(struct epbuf_handler *epbh, u16 vlan_id) > +{ > + union ep_buffer_info *info = epbh->info; > + int i; > + bool ret = false; > + > + if (vlan_id == 0) { > + ret = true; > + } else { > + for (i = 0; i < EP_BUFFER_SUPPORT_VLAN_MAX; i++) { > + if (vlan_id == info->v1i.vlan_id[i]) { > + ret = true; > + break; > + } > + } > + } > + return ret; > +} > + > +int fjes_hw_epbuf_tx_pkt_send(struct epbuf_handler *epbh, > + void *frame, size_t size) > +{ > + union ep_buffer_info *info = epbh->info; > + struct esmem_frame *ring_frame; > + > + if (EP_RING_FULL(info->v1i.head, info->v1i.tail, info->v1i.count_max)) > + return -ENOBUFS; > + > + ring_frame = > + (struct esmem_frame *)&(epbh-> > + ring[EP_RING_INDEX > +(info->v1i.tail - 1, > + info->v1i.count_max) * > +info->v1i.frame_max]); > + > + ring_frame->frame_size = size; > + memcpy((void *)(ring_frame->frame_data), (void *)frame, size); > + > + EP_RING_INDEX_INC(epbh->info->v1i.tail, info->v1i.count_max); > + > + return 0; > +} > + > diff --git a/drivers/net/fjes/fjes_hw.h b/drivers/net/fjes/fjes_hw.h > index 823d802..95b9732 100644 > --- a/drivers/net/fjes/fjes_hw.h > +++ b/drivers/net/fjes/fjes_hw.h > @@ -49,6 +49,9 @@ struct fjes_hw; > > #define FJES_ZONING_ZONE_TYPE_NONE (0xFF) > > +#define FJES_TX_DELAY_SEND_NONE (0) > +#define FJES_TX_DELAY_SEND_PENDING (1) > + > #define FJES_RX_STOP_REQ_NONE(0x0) > #define FJES_RX_STOP_REQ_DONE(0x1) > #define FJES_RX_STOP_REQ_REQUEST (0x2) > @@ -60,6 +63,11 @@ struct fjes_hw; > > #define EP_RING_NUM(buffer_size, frame_size) \ > (u32)((buffer_size) / (frame_size)) > +#define EP_RING_INDEX(_num, _max) (((_num) + (_max)) % (_max)) > +#define EP_RING_INDEX_INC(_num, _max) \ > + ((_num) = EP_RING_INDEX((_num) + 1, (_max))) > +#define EP_RING_FULL(_head, _tail, _max) \ > + (0 == EP_RING_INDEX(((_tail) - (_head)), (_max))) > > #define FJES_MTU_TO_BUFFER_SIZE(mtu) \ > (ETH_HLEN + VLAN_HLEN + (mtu) + ETH_FCS_LEN) > @@ -308,5 +316,9 @@ enum ep_partner_status > > bool fjes_hw_epid_is_same_zone(struct fjes_hw *, int); > int fjes_hw_epid_is_shared(struct fjes_device_shared_info *, int); > +bool fjes_hw_check_epbuf_version(struct epbuf_handler *, u32); > +bool fjes_hw_check_mtu(struct epbuf_handler *, u32); > +bool fjes_hw_check_vlan_id(struct epbuf_handler *, u16); > +int fjes_hw_epbuf_tx_pkt_send(struct epbuf_handler *, void *, size_t); > > #endif /* FJES_HW_H_ */ > diff --git a/drivers/net/fjes/fjes_main.c b/drivers/net/fjes/fjes_main.c > index 832439b..f2ef286 100644 > --- a/drivers/net/fjes/fjes_main.c > +++ b/drivers/net/fjes/fjes_main.c > @@ -51,6 +51,7 @@ static int fjes_open(struct net_device *); > static int fjes_close(struct net_device *); > static int fjes_setup_resources(struct fjes_adapter *); > static void fjes_free_resource
Re: [PATCH 1/2] net: dsa: mv88e6xxx: sleep in _mv88e6xxx_stats_wait
Hi Guenter, On Jul 10, 2015, at 2:36 PM, Guenter Roeck li...@roeck-us.net wrote: > Hi Vivien, > > On Fri, Jul 10, 2015 at 02:20:47PM -0400, Vivien Didelot wrote: >> > >> > is this really beneficial and/or needed ? >> >> Except using existing generic code, no. >> >> > It adds at least 1ms delay to a loop which did not have any delay at >> > all unless the register read itself was sleeping. >> >> I must have missed where is the benefit from spin reading 10 times this >> register, rather than sleeping 1ms between tests. Does this busy bit >> behaves differently from the phy, atu, scratch, or vtu busy bits? >> > Benefit is reaction time, mostly. If the result isn't ready after the > first spin, the new code path adds a mandatory 1-2ms delay. This could > add up to a lot if that kind of retry is seen a lot. To me, it looks like if this mandatory 1-2ms delay is an issue, then _mv88e6xxx_wait must be fixed. Maybe reducing this delay is an option? > I don't now if there is a specific time limit for this busy bit, > and/or if it behaves differently than the others in terms of timing. > >> > Is the original function seen to return a timeout error under some >> > circumstances ? >> >> I didn't experience it myself, but I guess it may happen. In addition to >> that, the current implementation doesn't check eventual read error. >> That's why I saw a benefit in using _mv88e6xxx_wait(). > > Checking for a read error (or a timeout) is definitely a good thing. > I could also imagine that, for example, a "clear statistics" request > takes more time than currently supported. This is why I asked if you > had seen a timeout with the old code. > > Personally I'd rather leave the wait loop alone and only introduce > error checking unless there is a reason to introduce a sleep, > but I'd like to hear Andrew's and/or Florian's opinion. Andrew may not reply since he's on vacation, but I add Florian in Cc. Thanks, -v -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] nohz: prevent tilegx network driver interrupts
On 07/10/2015 02:24 PM, Frederic Weisbecker wrote: On Fri, Jul 10, 2015 at 01:33:44PM -0400, Chris Metcalf wrote: In nohz_full mode, by default distribute networking shim interrupts across the housekeeping cores, not all the cores. I can't really tell, I have no idea what this driver does. It seems to be about networking CPUs but I have no idea what we are affining here. Whether it is task, interrupt, ... And what those affine things do, if it is safe to do that reduce affinity etc.. I looked at the driver but I can't make my way there. I think you need a more detailed changelog :-) Fair enough! :-) See updated changelog to follow. Signed-off-by: Chris Metcalf --- The alternate approaches to this might be: 1. "#define housekeeping_mask cpu_online_mask" in the non-nohz_full arm in , then just unconditionally use "housekeeping_mask". Indeed we are doing more and more references on housekeeping_mask, so we should probably think about an off-case. Now the nohz-full off-case should rather be cpu_possible_mask than cpu_online_mask. housekeeping_mask doesn't take into account onlining at all. That suggests that in this case, we might want to default to something like "housekeeping_mask & cpu_online_mask", since you really don't want to send irqs to offline cores to process your packets :-) The tilegx chips typically don't do cpu offlining anyway, since we've never really found a usecase, so whatever you boot with you always have available. We do have support for a bare-metal mode which you can run on some of the cores, so you may start with fewer than cpu_possible actually running, but it will always be that same set of cores. So this does suggest that my original patch is wrong for that same reason. 2. Provide an accessor that returns the cpumask to use for housekeeping chores and implement it in the obvious ways for both nohz_full and non-nohz_full. The latter seems like arguably the most satisfying approach, but the patch below is, if nothing else, suitable to push for 4.3 without any further API development work. I don't know. 1) looks easier. On reflection, the problem with (1) is that if you are in NO_HZ_FULL mode but !tick_nohz_full_enabled(), you want to fall back to just using cpu_possible_mask anyway. So I think a simple accessor that returns an appropriate cpumask pointer is probably the best bet (along the lines of the existing is_housekeeping_cpu() accessor). -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC net-next 11/22] dst: Metadata destinations
On 07/10/15 at 09:57pm, Julian Anastasov wrote: > > Hello, > > On Fri, 10 Jul 2015, Thomas Graf wrote: > > > --- a/net/ipv4/route.c > > +++ b/net/ipv4/route.c > > @@ -1691,6 +1691,8 @@ static int ip_route_input_slow(struct sk_buff *skb, > > __be32 daddr, __be32 saddr, > >by fib_lookup. > > */ > > > > + skb_dst_drop(skb); > > + > > This is not very safe. There are places that > call ip_route_input() for temporary lookups and they > do not set NULL. For example, icmp_route_lookup(), > may be there are other such places... Wow. What a tremendous hack ;-) It saves and restores the original dst ref to avoid the leaked reference. > OTOH, ip_options_rcv_srr() looks correct to use > skb_dst_set(skb, NULL), may be such call should be > added if it is missing... Agreed. This seems to be the right fix. I'll apply to this icmp_route_lookup() as well. Thanks! -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
re: net: fec: Ensure clocks are enabled while using mdio bus
Hi Andrew, The kernelci.org system reported[1][2][3] new imx6 boot failures in next-20150710. I've bisected[4] these failures to 6c3e921b18ed "net: fec: Ensure clocks are enabled while using mdio bus". Reverting this commit on top of next-20150710, resolves the boot failures. I have not investigated any further, but did add the imx maintainer to the CC list. Apologies in advanced for not replying to the original thread, as I was not subscribed to the netdev list until now. Any chance you can confirm this regression on your end? Cheers, Tyler [1] http://kernelci.org/boot/all/job/next/kernel/next-20150710/ [2] http://storage.kernelci.org/next/next-20150710/arm-multi_v7_defconfig/lab-collabora/boot-imx6q-sabrelite.html [3] http://storage.kernelci.org/next/next-20150710/arm-imx_v6_v7_defconfig/lab-tbaker/boot-imx6q-cm-fx6.html [4] http://hastebin.com/uzofovupey.pas -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC net-next 11/22] dst: Metadata destinations
Hello, On Fri, 10 Jul 2015, Thomas Graf wrote: > --- a/net/ipv4/route.c > +++ b/net/ipv4/route.c > @@ -1691,6 +1691,8 @@ static int ip_route_input_slow(struct sk_buff *skb, > __be32 daddr, __be32 saddr, > by fib_lookup. >*/ > > + skb_dst_drop(skb); > + This is not very safe. There are places that call ip_route_input() for temporary lookups and they do not set NULL. For example, icmp_route_lookup(), may be there are other such places... OTOH, ip_options_rcv_srr() looks correct to use skb_dst_set(skb, NULL), may be such call should be added if it is missing... > if (ipv4_is_multicast(saddr) || ipv4_is_lbcast(saddr)) > goto martian_source; Regards -- Julian Anastasov -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC net-next 12/22] arp: Inherit metadata dst when creating ARP requests
On 07/10/2015 05:19 PM, Thomas Graf wrote: If output device wants to see the dst, inherit the dst of the original skb and pass it on to generate the ARP request. Signed-off-by: Thomas Graf --- net/ipv4/arp.c | 71 +++--- 1 file changed, 43 insertions(+), 28 deletions(-) diff --git a/net/ipv4/arp.c b/net/ipv4/arp.c index 933a928..3400aea 100644 --- a/net/ipv4/arp.c +++ b/net/ipv4/arp.c @@ -291,6 +291,46 @@ static void arp_error_report(struct neighbour *neigh, struct sk_buff *skb) kfree_skb(skb); } +/* + * Create and send an arp packet. + */ +static void arp_send_dst(int type, int ptype, __be32 dest_ip, +struct net_device *dev, __be32 src_ip, +const unsigned char *dest_hw, +const unsigned char *src_hw, +const unsigned char *target_hw, struct sk_buff *oskb) +{ + struct sk_buff *skb; + + /* +* No arp on this interface. +*/ The networking code, we do multi-line comments this way: /* bla * bla */ + + if (dev->flags&IFF_NOARP) Please surround & with spaces. [...] @@ -597,32 +638,6 @@ void arp_xmit(struct sk_buff *skb) EXPORT_SYMBOL(arp_xmit); /* - * Create and send an arp packet. - */ -void arp_send(int type, int ptype, __be32 dest_ip, - struct net_device *dev, __be32 src_ip, - const unsigned char *dest_hw, const unsigned char *src_hw, - const unsigned char *target_hw) -{ - struct sk_buff *skb; - - /* -* No arp on this interface. -*/ - - if (dev->flags&IFF_NOARP) - return; Ah, you're moving the code... [...] WBR, Sergei -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 02/11] soc/fsl: Introduce DPAA BMan device management driver
On Fri, 2015-07-10 at 13:29 -0500, Pledge Roy-R01356 wrote: > > return in_be32((void *)bm + offset); > > > ^ > > > [...]/drivers/soc/fsl/qbman/bman.c: In function ‘__bm_out’: > > > [...]/drivers/soc/fsl/qbman/bman.c:172:2: error: implicit declaration > > > of function ‘out_be32’ [-Werror=implicit-function-declaration] > > > out_be32((void *)bm + offset, val); > > > > These PPCisms will need to be fixed. LS1043A is an ARM chip with DPAA > > 1.0. > > LS1043 (ARM, Little Endian) support is still work in progress. The patches > for that are being tested now but are based on the SDK version of the > driver and will need to be massaged in order to get them applied here. Our > plan of record is to add upstream support for this at a later time. If you're already reworking this for upstream acceptance, why not fix the more obvious PPCisms now? -Scott -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC net-next 3/6] net: Introduce VRF device driver - v2
On 7/9/15 10:56 PM, Eric W. Biederman wrote: I have given specific areas of concern, and explained myself and you are blowing me off. You have not had answered my question with any additional details or code references -- ie., a specific example. Asking you for clarification and details is not blowing you off. To recap: Eric: "With respect to sockets there is also the issue that ip addresses are not per vrf." David: "IP addresses are per interface and interfaces are uniquely assigned to a VRF so why do you think IP addresses are not per VRF?" Eric: "I have read large swaths of the linux networking code over the years. Further I was thinking more about non-local addresses ip addresses, but I would not be surprised if there are also issues with local addresses." David: "Well, if someone has a specific example I'll take a look." So, let me try this again: All of the IPv4 and IPv6 addresses I am aware of are held in structs linked to a specific netdevice. Can you give me a specific example of what you mean here? I can't respond to your feedback based on the little information you have given me. Besides the fragment reassembly and xfrm there are things like the ineetpeer cache. noted. Which means things like packet fragmentation reassembly can easily do the wrong thing. Similarly things like the xfrm for ipsec tunnels are not hooked into this mix. So I really do not see how this VRF/MRF thing as designed can support general purpose sockets. I am not certain it can correctly support any kind of socket except perhaps SOCK_RAW. Sockets bound to the VRF device work properly. Why do you think they won't? Because there are many locations in the network stack (like fragment reassembly) that make the assumption that ip addresses are unique and do not bother looking at network device or anything else. If fragments manage to come into play I don't expect it would be hard to poision a connections with fragments from another routing domain with overlapping ip addresses. If that is true it is a problem with the networking stack today and is completely independent of this VRF proposal. Not at all. It is required functionality for reassembly of ip fragments when the packets come in via different paths. This is required to support multi-path ip reception. This only becomes a bug in the scenario you have proposed. Can you please point to a specific line in one of these patches that impacts fragmentation? This patch -- patch 3 -- adds a VRF driver. It has not mucked with packets at all. Patch 4 (again I have a better split in a forthcoming revision) tweaks a few places in the IPv4 stack with respect to socket lookups (dif modified) and FIB table lookups (specifying a table to use or tweaking oif/iif). Since the VRF device has not touched the packet and does not introduce a tunnel how has its use/existence impacted fragmentation? I do plan tests to include ipsec for example and fragmentation; it's a matter of time. If you have suggestions on a setup/test case I should include please let me know. David -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] of_mdio: add new DT property 'autoneg' for fixed-link
On 10/07/15 09:43, Stas Sergeev wrote: > > Currently for fixed-link the MAC driver decides whether to use the > link status auto-negotiation or not. > Unfortunately the auto-negotiation may not work when expected by > the MAC driver. Sebastien Rannou explains: > << Yes, I confirm that my HW does not generate an in-band status. AFAIK, it's > a PHY that aggregates 4xSGMIIs to 1xQSGMII ; the MAC side of the PHY (with > inband status) is connected to the switch through QSGMII, and in this context > we are on the media side of the PHY. >> > https://lkml.org/lkml/2015/7/10/206 > > This patch introduces the new boolean property 'autoneg' that allows > the user to request the auto-negotiation explicitly. The implementation looks better, but the name might still be slightly controversial. I would go with "use-in-band-status" which is more strictly defined than "autoneg" which could mean anything and everything. What do you think? > > Signed-off-by: Stas Sergeev > > CC: Rob Herring > CC: Pawel Moll > CC: Mark Rutland > CC: Ian Campbell > CC: Kumar Gala > CC: Florian Fainelli > CC: Grant Likely > CC: devicet...@vger.kernel.org > CC: linux-ker...@vger.kernel.org > CC: netdev@vger.kernel.org > --- > .../devicetree/bindings/net/fixed-link.txt | 6 +- > drivers/of/of_mdio.c | 23 > -- > include/linux/of_mdio.h| 5 + > 3 files changed, 31 insertions(+), 3 deletions(-) > > diff --git a/Documentation/devicetree/bindings/net/fixed-link.txt > b/Documentation/devicetree/bindings/net/fixed-link.txt > index 82bf7e0..e2959a8 100644 > --- a/Documentation/devicetree/bindings/net/fixed-link.txt > +++ b/Documentation/devicetree/bindings/net/fixed-link.txt > @@ -9,8 +9,12 @@ Such a fixed link situation is described by creating a > 'fixed-link' > sub-node of the Ethernet MAC device node, with the following > properties: > > +* 'autoneg' (boolean, optional), to enable the auto-negotiation of link > + state. Auto-negotiation is MII protocol, HW and driver-specific and is > + not supported in many cases, so use it only when you know what you do. > * 'speed' (integer, mandatory), to indicate the link speed. Accepted > - values are 10, 100 and 1000 > + values are 10, 100 and 1000. If the auto-negotiation is enabled, > + 'speed' may not be set. It will then be auto-negotiated, if possible. > * 'full-duplex' (boolean, optional), to indicate that full duplex is >used. When absent, half duplex is assumed. > * 'pause' (boolean, optional), to indicate that pause should be > diff --git a/drivers/of/of_mdio.c b/drivers/of/of_mdio.c > index 1bd4305..12b2ede 100644 > --- a/drivers/of/of_mdio.c > +++ b/drivers/of/of_mdio.c > @@ -280,6 +280,22 @@ bool of_phy_is_fixed_link(struct device_node *np) > } > EXPORT_SYMBOL(of_phy_is_fixed_link); > > +bool of_phy_is_autoneg_link(struct device_node *np) > +{ > + struct device_node *dn; > + bool ret; > + > + dn = of_get_child_by_name(np, "fixed-link"); > + if (!dn) > + return false; > + > + ret = of_property_read_bool(dn, "autoneg"); > + > + of_node_put(dn); > + return ret; > +} > +EXPORT_SYMBOL(of_phy_is_autoneg_link); > + > int of_phy_register_fixed_link(struct device_node *np) > { > struct fixed_phy_status status = {}; > @@ -291,10 +307,13 @@ int of_phy_register_fixed_link(struct device_node *np) > /* New binding */ > fixed_link_node = of_get_child_by_name(np, "fixed-link"); > if (fixed_link_node) { > - status.link = 1; > + bool autoneg = of_property_read_bool(fixed_link_node, > + "autoneg"); > + status.link = !autoneg; > status.duplex = of_property_read_bool(fixed_link_node, > "full-duplex"); > - if (of_property_read_u32(fixed_link_node, "speed", > &status.speed)) > + if (of_property_read_u32(fixed_link_node, "speed", > + &status.speed) != 0 && !autoneg) > return -EINVAL; > status.pause = of_property_read_bool(fixed_link_node, "pause"); > status.asym_pause = of_property_read_bool(fixed_link_node, > diff --git a/include/linux/of_mdio.h b/include/linux/of_mdio.h > index d449018..647f348 100644 > --- a/include/linux/of_mdio.h > +++ b/include/linux/of_mdio.h > @@ -65,6 +65,7 @@ static inline struct mii_bus *of_mdio_find_bus(struct > device_node *mdio_np) > #if defined(CONFIG_OF) && defined(CONFIG_FIXED_PHY) > extern int of_phy_register_fixed_link(struct device_node *np); > extern bool of_phy_is_fixed_link(struct device_node *np); > +extern bool of_phy_is_autoneg_link(struct device_node *np); > #else > static inline int of_phy_register_fixed_link(struct device_node *np) > { > @@ -74,6 +75,10 @@ static inline bool of_phy_is_fixed_link(struct device_node
Re: [PATCH 1/2] net: dsa: mv88e6xxx: sleep in _mv88e6xxx_stats_wait
Hi Vivien, On Fri, Jul 10, 2015 at 02:20:47PM -0400, Vivien Didelot wrote: > > > > is this really beneficial and/or needed ? > > Except using existing generic code, no. > > > It adds at least 1ms delay to a loop which did not have any delay at > > all unless the register read itself was sleeping. > > I must have missed where is the benefit from spin reading 10 times this > register, rather than sleeping 1ms between tests. Does this busy bit > behaves differently from the phy, atu, scratch, or vtu busy bits? > Benefit is reaction time, mostly. If the result isn't ready after the first spin, the new code path adds a mandatory 1-2ms delay. This could add up to a lot if that kind of retry is seen a lot. I don't now if there is a specific time limit for this busy bit, and/or if it behaves differently than the others in terms of timing. > > Is the original function seen to return a timeout error under some > > circumstances ? > > I didn't experience it myself, but I guess it may happen. In addition to > that, the current implementation doesn't check eventual read error. > That's why I saw a benefit in using _mv88e6xxx_wait(). Checking for a read error (or a timeout) is definitely a good thing. I could also imagine that, for example, a "clear statistics" request takes more time than currently supported. This is why I asked if you had seen a timeout with the old code. Personally I'd rather leave the wait loop alone and only introduce error checking unless there is a reason to introduce a sleep, but I'd like to hear Andrew's and/or Florian's opinion. Thanks, Guenter -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] nohz: prevent tilegx network driver interrupts
On Fri, Jul 10, 2015 at 01:33:44PM -0400, Chris Metcalf wrote: > In nohz_full mode, by default distribute networking shim > interrupts across the housekeeping cores, not all the cores. I can't really tell, I have no idea what this driver does. It seems to be about networking CPUs but I have no idea what we are affining here. Whether it is task, interrupt, ... And what those affine things do, if it is safe to do that reduce affinity etc.. I looked at the driver but I can't make my way there. I think you need a more detailed changelog :-) > Signed-off-by: Chris Metcalf > --- > The alternate approaches to this might be: > > 1. "#define housekeeping_mask cpu_online_mask" in the non-nohz_full >arm in , then just unconditionally use >"housekeeping_mask". Indeed we are doing more and more references on housekeeping_mask, so we should probably think about an off-case. Now the nohz-full off-case should rather be cpu_possible_mask than cpu_online_mask. housekeeping_mask doesn't take into account onlining at all. > > 2. Provide an accessor that returns the cpumask to use for housekeeping >chores and implement it in the obvious ways for both nohz_full >and non-nohz_full. > > The latter seems like arguably the most satisfying approach, but > the patch below is, if nothing else, suitable to push for 4.3 > without any further API development work. I don't know. 1) looks easier. > > Frederic (or others), comments? > > drivers/net/ethernet/tile/tilegx.c | 8 +++- > 1 file changed, 7 insertions(+), 1 deletion(-) > > diff --git a/drivers/net/ethernet/tile/tilegx.c > b/drivers/net/ethernet/tile/tilegx.c > index a3f7610002aa..7687c62e7d75 100644 > --- a/drivers/net/ethernet/tile/tilegx.c > +++ b/drivers/net/ethernet/tile/tilegx.c > @@ -40,6 +40,7 @@ > #include > #include > #include > +#include > > #include > #include > @@ -2272,8 +2273,13 @@ static int __init tile_net_init_module(void) > for (i = 0; gxio_mpipe_link_enumerate_mac(i, name, mac) >= 0; i++) > tile_net_dev_init(name, mac); > > - if (!network_cpus_init()) > + if (!network_cpus_init()) { > +#ifdef CONFIG_NO_HZ_FULL > + network_cpus_map = *housekeeping_mask; > +#else > network_cpus_map = *cpu_online_mask; > +#endif > + } > > return 0; > } > -- > 2.1.2 > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 02/11] soc/fsl: Introduce DPAA BMan device management driver
> return in_be32((void *)bm + offset); > > ^ > > [...]/drivers/soc/fsl/qbman/bman.c: In function ‘__bm_out’: > > [...]/drivers/soc/fsl/qbman/bman.c:172:2: error: implicit declaration > > of function ‘out_be32’ [-Werror=implicit-function-declaration] > > out_be32((void *)bm + offset, val); > > These PPCisms will need to be fixed. LS1043A is an ARM chip with DPAA 1.0. LS1043 (ARM, Little Endian) support is still work in progress. The patches for that are being tested now but are based on the SDK version of the driver and will need to be massaged in order to get them applied here. Our plan of record is to add upstream support for this at a later time. > > > ^ > > [...]/drivers/soc/fsl/qbman/bman.c: In function ‘of_fsl_bman_probe’: > > [...]/drivers/soc/fsl/qbman/bman.c:463:17: error: ‘NO_IRQ’ undeclared > > (first use in this function) > > if (err_irq == NO_IRQ) { > > This isn't even a PPCism. It's just wrong. Compare to zero instead. Yeah - I recall fixing this when doing the ARM port but I guess I didn't make the same change in this version. I will address this in the next version. > > -Scott
Re: [RFC PATCH net-next] sctp: fix src address selection if using secondary addresses
On Fri, Jul 10, 2015 at 01:14:21PM -0400, Vlad Yasevich wrote: > On 07/10/2015 12:17 PM, Marcelo Ricardo Leitner wrote: > > On Fri, Jul 10, 2015 at 11:35:28AM -0400, Vlad Yasevich wrote: > >> On 07/10/2015 07:53 AM, Marcelo Ricardo Leitner wrote: > >>> On Thu, Jul 09, 2015 at 09:55:19PM +0200, Michael Tuexen wrote: > > On 09 Jul 2015, at 18:54, Marcelo Ricardo Leitner > > wrote: > > > > Cc'ing Michael too. > I'm not familiar with the Linux kernel code, so I can't comment on it. > But making sure to use a source address belonging to the emitting > interface makes sense for me. > > Best regards > Michael > >>> > >>> That's pretty much what I was looking for, in case I missed something on > >>> SCTP RFCs. > >>> > >> > >> Well, the RFCs do not really specify what the source address should be, > >> and there > > > > That's why I was afraid of having missed something ;) > > > >> have been numerous times where I've seen weak host model in use on the wire > >> even with a BSD peer. > >> > >> This also puts a very big nail through many suggestions we've had over the > >> years > >> to allow source based path multihoming in addition to destination based > >> multihoming > >> we currently support. > >> > >> It might be a good idea to make rp-filter like behavior best effort, and > >> have > >> the old behavior as fallback. I am still trying to think up different > >> scenarios > >> where rp-filter behavior will cause things to fail prematurely... > > > > The old behavior is like "if we don't have a src yet and can't find a > > preferred src for this dst, use the 1st bound address". We can add it > > but as I said, I'm afraid it is just doing wrong and not worth. If such > > randomly src addressed packet is meant to be routed, the router will > > likely drop it as it is seen as a spoof. And if it reaches the peer, it > > will probably come back through a different path. > > > > I'm tempted to say that current usual use cases are handled by the first > > check on this function, which returns the preferred/primary address for > > the interface and checks against bound addresses. Whenever you reach the > > second check, it just allows you to use that 1st bound address that is > > checked. I mean, I can't see use cases that we would be breaking with > > this change. > > Yes, the secondary check didn't amount to much, but we've kept it since 2.5 > days (when sctp was introduced). I've made attempts over the years to > try to make it stricter, but that never amounted to anything that worked well. > > > > > But yeah, it impacts source based routing, and I'm not aware of previous > > discussions on it. I'll try to dig some up but if possible, please share > > some pointers on it. > > It's been suggested a few times that we should support source based > multihoming > particularly for the case where one peer has only 1 address. > We've always punted on this, but people still ask every now and then. Ah okay, now I see it. > I do have a question about the code though.. Have you tried with mutlipath > routing > enabled. I see rp_filter checks have special code to handle that. Seem like > we > might get false negatives in sctp. In the sense of CONFIG_IP_ROUTE_MULTIPATH=y, yes, but just that. My routes were simple ones, either 2 peers attaches to 2 local subnets, or with a gateway in the middle (with 2 subnets on each side, but mapped 1-1, no crossing. Aka subnet1<->subnet2 and subnet3<->subnet4 while not (subnet1<->subnet4 or subnet3<->subnet2). Note that this is not rp_filter strictly speaking, as it's mirrored. rp_filter needs to calculate all possible output routes (actually until it finds a valid one) for finding one that would match the one used for incoming. This check already has an output path, and it's calculating if such input would be acceptable. We can't really expect/check for other hits because it invalidates the chosen output path. Hmmm... but we could support multipath in the output selection, ie in the outputs of ip_route_output_key(), probably in another patch then? Marcelo -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] of_mdio: add new DT property 'link' for fixed-link
On 10/07/15 04:20, Stas Sergeev wrote: > 10.07.2015 11:46, Sebastien Rannou пишет: >> On Fri, 10 Jul 2015, Stas Sergeev wrote: >> >>> 10.07.2015 00:15, Florian Fainelli пишет: Then, if the in-band status indication is not reliable (which really should be completely understood), >>> Agree! >>> But this is not something I can help with. >>> Sebastien Rannou reports the problem, please ask him whatever >>> you see fits to get a better understanding of a problem. >>> The fact that his HW does not generate the inband status, is >>> _my own guess_. >> >> Yes, I confirm that my HW does not generate an in-band status. AFAIK, it's >> a PHY that aggregates 4xSGMIIs to 1xQSGMII ; the MAC side of the PHY (with >> inband status) is connected to the switch through QSGMII, and in this context >> we are on the media side of the PHY. > Hmm, interesting. > So if I parse the above correctly, you have something like 88E1340S set > up into a mode when SGMII is used as media interface and QSGMII as system > interface (terms are from datasheet page 5), then you connect the media > interface to armada-xp and system interface to the switch. > > I wonder if it is the right thing to do. > AFAIK you could as well set up armada-xp into QSGMII mode and connect > that to switch. The driver would then disable the use of inband status > and everything would be fine. > Either way, your use-case proves that only DT can decide the use of an > inband status. I do not think there is any debate around the need for a property that defines whether in-band-status is both reliable and usable, the debate is about *where* to put it. I still think this does not belong in the fixed-link property, but now that you have explained a bit more in the other patch what your understanding of "fixed-link" is, I can see the confusion. Instead of having a link = "auto", property, how about just something like this: fixed-link { speed = <1000>; full-duplex; use-in-band-status; }; or event this: fixed-link { use-in-band-status; }; If you parse the 'use-in-band-status' which means that it is reliable information, then you can override whatever was defined in the DT under the 'fixed-link' property? -- Florian -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] net: dsa: mv88e6xxx: sleep in _mv88e6xxx_stats_wait
Hi Guenter, On Jul 10, 2015, at 1:10 PM, Guenter Roeck li...@roeck-us.net wrote: > On Fri, Jul 10, 2015 at 12:57:28PM -0400, Vivien Didelot wrote: >> The current _mv88e6xxx_stats_wait function does not sleep while testing >> the stats busy bit. Fix this by using the generic _mv88e6xxx_wait >> function. >> >> Note that it requires to move _mv88e6xxx_wait on top of >> _mv88e6xxx_stats_wait to avoid undefined reference compilation error. >> >> Signed-off-by: Vivien Didelot >> --- >> drivers/net/dsa/mv88e6xxx.c | 52 >> +++-- >> 1 file changed, 22 insertions(+), 30 deletions(-) >> >> diff --git a/drivers/net/dsa/mv88e6xxx.c b/drivers/net/dsa/mv88e6xxx.c >> index f6c7409..7753db1 100644 >> --- a/drivers/net/dsa/mv88e6xxx.c >> +++ b/drivers/net/dsa/mv88e6xxx.c >> @@ -557,19 +557,31 @@ static bool mv88e6xxx_6352_family(struct dsa_switch >> *ds) >> return false; >> } >> >> +/* Must be called with SMI lock held */ >> +static int _mv88e6xxx_wait(struct dsa_switch *ds, int reg, int offset, >> + u16 mask) >> +{ >> +unsigned long timeout = jiffies + HZ / 10; >> + >> +while (time_before(jiffies, timeout)) { >> +int ret; >> + >> +ret = _mv88e6xxx_reg_read(ds, reg, offset); >> +if (ret < 0) >> +return ret; >> +if (!(ret & mask)) >> +return 0; >> + >> +usleep_range(1000, 2000); >> +} >> +return -ETIMEDOUT; >> +} >> + >> /* Must be called with SMI mutex held */ >> static int _mv88e6xxx_stats_wait(struct dsa_switch *ds) >> { >> -int ret; >> -int i; >> - >> -for (i = 0; i < 10; i++) { >> -ret = _mv88e6xxx_reg_read(ds, REG_GLOBAL, GLOBAL_STATS_OP); >> -if ((ret & GLOBAL_STATS_OP_BUSY) == 0) >> -return 0; >> -} > > Hi Vivien, > > is this really beneficial and/or needed ? Except using existing generic code, no. > It adds at least 1ms delay to a loop which did not have any delay at > all unless the register read itself was sleeping. I must have missed where is the benefit from spin reading 10 times this register, rather than sleeping 1ms between tests. Does this busy bit behaves differently from the phy, atu, scratch, or vtu busy bits? > Is the original function seen to return a timeout error under some > circumstances ? I didn't experience it myself, but I guess it may happen. In addition to that, the current implementation doesn't check eventual read error. That's why I saw a benefit in using _mv88e6xxx_wait(). Thanks, -v -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] ravb: kill useless initializers
Some of the local variable intializers in the driver turned out to be pointless, kill them. Signed-off-by: Sergei Shtylyov --- The patch is against the Dave Miller's 'net-next.git' repo. Changes in version 2: - fixed typo in the subject. drivers/net/ethernet/renesas/ravb_main.c | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) Index: net-next/drivers/net/ethernet/renesas/ravb_main.c === --- net-next.orig/drivers/net/ethernet/renesas/ravb_main.c +++ net-next/drivers/net/ethernet/renesas/ravb_main.c @@ -223,9 +223,9 @@ static void ravb_ring_free(struct net_de static void ravb_ring_format(struct net_device *ndev, int q) { struct ravb_private *priv = netdev_priv(ndev); - struct ravb_ex_rx_desc *rx_desc = NULL; - struct ravb_tx_desc *tx_desc = NULL; - struct ravb_desc *desc = NULL; + struct ravb_ex_rx_desc *rx_desc; + struct ravb_tx_desc *tx_desc; + struct ravb_desc *desc; int rx_ring_size = sizeof(*rx_desc) * priv->num_rx_ring[q]; int tx_ring_size = sizeof(*tx_desc) * priv->num_tx_ring[q]; struct sk_buff *skb; @@ -435,7 +435,7 @@ static int ravb_tx_free(struct net_devic struct net_device_stats *stats = &priv->stats[q]; struct ravb_tx_desc *desc; int free_num = 0; - int entry = 0; + int entry; u32 size; for (; priv->cur_tx[q] - priv->dirty_tx[q] > 0; priv->dirty_tx[q]++) { @@ -508,8 +508,8 @@ static bool ravb_rx(struct net_device *n struct sk_buff *skb; dma_addr_t dma_addr; struct timespec64 ts; - u16 pkt_len = 0; u8 desc_status; + u16 pkt_len; int limit; boguscnt = min(boguscnt, *quota); @@ -1272,8 +1272,8 @@ static void ravb_tx_timeout_work(struct static netdev_tx_t ravb_start_xmit(struct sk_buff *skb, struct net_device *ndev) { struct ravb_private *priv = netdev_priv(ndev); - struct ravb_tstamp_skb *ts_skb = NULL; u16 q = skb_get_queue_mapping(skb); + struct ravb_tstamp_skb *ts_skb; struct ravb_tx_desc *desc; unsigned long flags; u32 dma_addr; -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] ravb: kill usless initializers
Some of the local variable intializers in the driver turned out to be pointless, kill them. Signed-off-by: Sergei Shtylyov --- The patch is against the Dave Miller's 'net-next.git' repo. drivers/net/ethernet/renesas/ravb_main.c | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) Index: net-next/drivers/net/ethernet/renesas/ravb_main.c === --- net-next.orig/drivers/net/ethernet/renesas/ravb_main.c +++ net-next/drivers/net/ethernet/renesas/ravb_main.c @@ -223,9 +223,9 @@ static void ravb_ring_free(struct net_de static void ravb_ring_format(struct net_device *ndev, int q) { struct ravb_private *priv = netdev_priv(ndev); - struct ravb_ex_rx_desc *rx_desc = NULL; - struct ravb_tx_desc *tx_desc = NULL; - struct ravb_desc *desc = NULL; + struct ravb_ex_rx_desc *rx_desc; + struct ravb_tx_desc *tx_desc; + struct ravb_desc *desc; int rx_ring_size = sizeof(*rx_desc) * priv->num_rx_ring[q]; int tx_ring_size = sizeof(*tx_desc) * priv->num_tx_ring[q]; struct sk_buff *skb; @@ -435,7 +435,7 @@ static int ravb_tx_free(struct net_devic struct net_device_stats *stats = &priv->stats[q]; struct ravb_tx_desc *desc; int free_num = 0; - int entry = 0; + int entry; u32 size; for (; priv->cur_tx[q] - priv->dirty_tx[q] > 0; priv->dirty_tx[q]++) { @@ -508,8 +508,8 @@ static bool ravb_rx(struct net_device *n struct sk_buff *skb; dma_addr_t dma_addr; struct timespec64 ts; - u16 pkt_len = 0; u8 desc_status; + u16 pkt_len; int limit; boguscnt = min(boguscnt, *quota); @@ -1272,8 +1272,8 @@ static void ravb_tx_timeout_work(struct static netdev_tx_t ravb_start_xmit(struct sk_buff *skb, struct net_device *ndev) { struct ravb_private *priv = netdev_priv(ndev); - struct ravb_tstamp_skb *ts_skb = NULL; u16 q = skb_get_queue_mapping(skb); + struct ravb_tstamp_skb *ts_skb; struct ravb_tx_desc *desc; unsigned long flags; u32 dma_addr; -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] nohz: prevent tilegx network driver interrupts
In nohz_full mode, by default distribute networking shim interrupts across the housekeeping cores, not all the cores. Signed-off-by: Chris Metcalf --- The alternate approaches to this might be: 1. "#define housekeeping_mask cpu_online_mask" in the non-nohz_full arm in , then just unconditionally use "housekeeping_mask". 2. Provide an accessor that returns the cpumask to use for housekeeping chores and implement it in the obvious ways for both nohz_full and non-nohz_full. The latter seems like arguably the most satisfying approach, but the patch below is, if nothing else, suitable to push for 4.3 without any further API development work. Frederic (or others), comments? drivers/net/ethernet/tile/tilegx.c | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/tile/tilegx.c b/drivers/net/ethernet/tile/tilegx.c index a3f7610002aa..7687c62e7d75 100644 --- a/drivers/net/ethernet/tile/tilegx.c +++ b/drivers/net/ethernet/tile/tilegx.c @@ -40,6 +40,7 @@ #include #include #include +#include #include #include @@ -2272,8 +2273,13 @@ static int __init tile_net_init_module(void) for (i = 0; gxio_mpipe_link_enumerate_mac(i, name, mac) >= 0; i++) tile_net_dev_init(name, mac); - if (!network_cpus_init()) + if (!network_cpus_init()) { +#ifdef CONFIG_NO_HZ_FULL + network_cpus_map = *housekeeping_mask; +#else network_cpus_map = *cpu_online_mask; +#endif + } return 0; } -- 2.1.2 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 02/11] soc/fsl: Introduce DPAA BMan device management driver
On Fri, 2015-07-10 at 13:36 +0200, Paul Bolle wrote: > On do, 2015-07-09 at 16:21 -0400, Roy Pledge wrote: > > +#ifdef CONFIG_FSL_DPA_CHECKING > > +#define DPA_ASSERT(x) \ > > + do { \ > > + if (!(x)) { \ > > + pr_crit("ASSERT: (%s:%d) %s\n", __FILE__, __LINE__, \ > > + __stringify_1(x)); \ > > + dump_stack(); \ > > + panic("assertion failure"); \ > > Not my call, but why panic() here? I'm pretty sure I've complained about this before (as well as all the BUG_ONs). -Scott -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 02/11] soc/fsl: Introduce DPAA BMan device management driver
On Fri, 2015-07-10 at 10:38 +0200, Paul Bolle wrote: > On do, 2015-07-09 at 16:21 -0400, Roy Pledge wrote: > > --- /dev/null > > +++ b/drivers/soc/fsl/qbman/Kconfig > > @@ -0,0 +1,33 @@ > > +menuconfig FSL_DPA > > + bool "Freescale DPAA support" > > + depends on FSL_SOC || COMPILE_TEST > > Are you sure about COMPILE_TEST? > > > + default n > > + help > > + FSL Data-Path Acceleration Architecture drivers > > + > > + These are not the actual Ethernet driver(s) > > + > > +if FSL_DPA > > [...] > > > +config FSL_BMAN > > + tristate "BMan device management" > > + default n > > + help > > + FSL DPAA BMan driver > > + > > +endif # FSL_DPA > > Because with FSL_BMAN set to 'm' testing things with > make -C ../../../.. M=$PWD bman.ko > > will not actually compile on x86_64: > > make: Entering directory '[...]' > CC [M] [...]/drivers/soc/fsl/qbman/bman.o > In file included from [...]/drivers/soc/fsl/qbman/bman_priv.h:33:0, > from [...]/drivers/soc/fsl/qbman/bman.c:31: > [...]/drivers/soc/fsl/qbman/dpaa_sys.h: In function ‘mfatb’: > [...]/drivers/soc/fsl/qbman/dpaa_sys.h:134:8: error: implicit declaration > of function ‘mfspr’ [-Werror=implicit-function-declaration] >hi = mfspr(SPRN_ATBU); > ^ > [...]/drivers/soc/fsl/qbman/dpaa_sys.h:134:14: error: ‘SPRN_ATBU’ > undeclared (first use in this function) >hi = mfspr(SPRN_ATBU); > ^ > [...]/drivers/soc/fsl/qbman/dpaa_sys.h:134:14: note: each undeclared > identifier is reported only once for each function it appears in > [...]/drivers/soc/fsl/qbman/dpaa_sys.h:135:14: error: ‘SPRN_ATBL’ > undeclared (first use in this function) >lo = mfspr(SPRN_ATBL); > ^ > [...]/drivers/soc/fsl/qbman/bman.c: In function ‘__bm_in’: > [...]/drivers/soc/fsl/qbman/bman.c:168:9: error: implicit declaration of > function ‘in_be32’ [-Werror=implicit-function-declaration] > return in_be32((void *)bm + offset); > ^ > [...]/drivers/soc/fsl/qbman/bman.c: In function ‘__bm_out’: > [...]/drivers/soc/fsl/qbman/bman.c:172:2: error: implicit declaration of > function ‘out_be32’ [-Werror=implicit-function-declaration] > out_be32((void *)bm + offset, val); These PPCisms will need to be fixed. LS1043A is an ARM chip with DPAA 1.0. > ^ > [...]/drivers/soc/fsl/qbman/bman.c: In function ‘of_fsl_bman_probe’: > [...]/drivers/soc/fsl/qbman/bman.c:463:17: error: ‘NO_IRQ’ undeclared > (first use in this function) > if (err_irq == NO_IRQ) { This isn't even a PPCism. It's just wrong. Compare to zero instead. -Scott -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] net: dsa: mv88e6xxx: call _mv88e6xxx_stats_wait with SMI lock held
On Fri, Jul 10, 2015 at 12:57:29PM -0400, Vivien Didelot wrote: > At switch setup, _mv88e6xxx_stats_wait was called without holding the > SMI mutex. Fix this by requesting the lock for this call. > > Also, return the _mv88e6xxx_stats_wait code, since it may fail. > > Signed-off-by: Vivien Didelot Good catch. Reviewed-by: Guenter Roeck -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH net-next] sctp: fix src address selection if using secondary addresses
On 07/10/2015 12:17 PM, Marcelo Ricardo Leitner wrote: > On Fri, Jul 10, 2015 at 11:35:28AM -0400, Vlad Yasevich wrote: >> On 07/10/2015 07:53 AM, Marcelo Ricardo Leitner wrote: >>> On Thu, Jul 09, 2015 at 09:55:19PM +0200, Michael Tuexen wrote: > On 09 Jul 2015, at 18:54, Marcelo Ricardo Leitner > wrote: > > Cc'ing Michael too. I'm not familiar with the Linux kernel code, so I can't comment on it. But making sure to use a source address belonging to the emitting interface makes sense for me. Best regards Michael >>> >>> That's pretty much what I was looking for, in case I missed something on >>> SCTP RFCs. >>> >> >> Well, the RFCs do not really specify what the source address should be, and >> there > > That's why I was afraid of having missed something ;) > >> have been numerous times where I've seen weak host model in use on the wire >> even with a BSD peer. >> >> This also puts a very big nail through many suggestions we've had over the >> years >> to allow source based path multihoming in addition to destination based >> multihoming >> we currently support. >> >> It might be a good idea to make rp-filter like behavior best effort, and have >> the old behavior as fallback. I am still trying to think up different >> scenarios >> where rp-filter behavior will cause things to fail prematurely... > > The old behavior is like "if we don't have a src yet and can't find a > preferred src for this dst, use the 1st bound address". We can add it > but as I said, I'm afraid it is just doing wrong and not worth. If such > randomly src addressed packet is meant to be routed, the router will > likely drop it as it is seen as a spoof. And if it reaches the peer, it > will probably come back through a different path. > > I'm tempted to say that current usual use cases are handled by the first > check on this function, which returns the preferred/primary address for > the interface and checks against bound addresses. Whenever you reach the > second check, it just allows you to use that 1st bound address that is > checked. I mean, I can't see use cases that we would be breaking with > this change. Yes, the secondary check didn't amount to much, but we've kept it since 2.5 days (when sctp was introduced). I've made attempts over the years to try to make it stricter, but that never amounted to anything that worked well. > > But yeah, it impacts source based routing, and I'm not aware of previous > discussions on it. I'll try to dig some up but if possible, please share > some pointers on it. It's been suggested a few times that we should support source based multihoming particularly for the case where one peer has only 1 address. We've always punted on this, but people still ask every now and then. I do have a question about the code though.. Have you tried with mutlipath routing enabled. I see rp_filter checks have special code to handle that. Seem like we might get false negatives in sctp. -vlad > > Marcelo > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] net: dsa: mv88e6xxx: sleep in _mv88e6xxx_stats_wait
On Fri, Jul 10, 2015 at 12:57:28PM -0400, Vivien Didelot wrote: > The current _mv88e6xxx_stats_wait function does not sleep while testing > the stats busy bit. Fix this by using the generic _mv88e6xxx_wait > function. > > Note that it requires to move _mv88e6xxx_wait on top of > _mv88e6xxx_stats_wait to avoid undefined reference compilation error. > > Signed-off-by: Vivien Didelot > --- > drivers/net/dsa/mv88e6xxx.c | 52 > +++-- > 1 file changed, 22 insertions(+), 30 deletions(-) > > diff --git a/drivers/net/dsa/mv88e6xxx.c b/drivers/net/dsa/mv88e6xxx.c > index f6c7409..7753db1 100644 > --- a/drivers/net/dsa/mv88e6xxx.c > +++ b/drivers/net/dsa/mv88e6xxx.c > @@ -557,19 +557,31 @@ static bool mv88e6xxx_6352_family(struct dsa_switch *ds) > return false; > } > > +/* Must be called with SMI lock held */ > +static int _mv88e6xxx_wait(struct dsa_switch *ds, int reg, int offset, > +u16 mask) > +{ > + unsigned long timeout = jiffies + HZ / 10; > + > + while (time_before(jiffies, timeout)) { > + int ret; > + > + ret = _mv88e6xxx_reg_read(ds, reg, offset); > + if (ret < 0) > + return ret; > + if (!(ret & mask)) > + return 0; > + > + usleep_range(1000, 2000); > + } > + return -ETIMEDOUT; > +} > + > /* Must be called with SMI mutex held */ > static int _mv88e6xxx_stats_wait(struct dsa_switch *ds) > { > - int ret; > - int i; > - > - for (i = 0; i < 10; i++) { > - ret = _mv88e6xxx_reg_read(ds, REG_GLOBAL, GLOBAL_STATS_OP); > - if ((ret & GLOBAL_STATS_OP_BUSY) == 0) > - return 0; > - } Hi Vivien, is this really beneficial and/or needed ? It adds at least 1ms delay to a loop which did not have any delay at all unless the register read itself was sleeping. Is the original function seen to return a timeout error under some circumstances ? Thanks, Guenter -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC net-next 03/22] ipv4: support for fib route lwtunnel encap attributes
On 7/10/15, 9:54 AM, Thomas Graf wrote: On 07/10/15 at 05:36pm, Eric Dumazet wrote: On Fri, 2015-07-10 at 16:19 +0200, Thomas Graf wrote: From: Roopa Prabhu + if (oif) + dev = __dev_get_by_index(net, oif); + ret = lwtunnel_build_state(dev, encap_type, + encap, &lwtstate); + if (!ret) { + lwtunnel_state_get(lwtstate); + ret = lwtunnel_cmp_encap(lwtstate, nh->nh_lwtstate); + lwtunnel_state_put(lwtstate); + + return ret; These _get()/_put() calls do not seem necessary, or buggy. If refcounting is needed the _get() should be done at the time lwstate is fetched. I'll let Roopa comment but it seems like these can just be removed. Agree, They do seem unnecessary here. I will get rid of them. thanks. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] net: dsa: mv88e6xxx: call _mv88e6xxx_stats_wait with SMI lock held
At switch setup, _mv88e6xxx_stats_wait was called without holding the SMI mutex. Fix this by requesting the lock for this call. Also, return the _mv88e6xxx_stats_wait code, since it may fail. Signed-off-by: Vivien Didelot --- drivers/net/dsa/mv88e6xxx.c | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/drivers/net/dsa/mv88e6xxx.c b/drivers/net/dsa/mv88e6xxx.c index 7753db1..cbbc33a 100644 --- a/drivers/net/dsa/mv88e6xxx.c +++ b/drivers/net/dsa/mv88e6xxx.c @@ -1961,6 +1961,7 @@ int mv88e6xxx_setup_common(struct dsa_switch *ds) int mv88e6xxx_setup_global(struct dsa_switch *ds) { struct mv88e6xxx_priv_state *ps = ds_to_priv(ds); + int ret; int i; /* Set the default address aging time to 5 minutes, and @@ -2059,9 +2060,11 @@ int mv88e6xxx_setup_global(struct dsa_switch *ds) REG_WRITE(REG_GLOBAL, GLOBAL_STATS_OP, GLOBAL_STATS_OP_FLUSH_ALL); /* Wait for the flush to complete. */ - _mv88e6xxx_stats_wait(ds); + mutex_lock(&ps->smi_mutex); + ret = _mv88e6xxx_stats_wait(ds); + mutex_unlock(&ps->smi_mutex); - return 0; + return ret; } int mv88e6xxx_switch_reset(struct dsa_switch *ds, bool ppu_active) -- 2.4.5 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] net: dsa: mv88e6xxx: sleep in _mv88e6xxx_stats_wait
The current _mv88e6xxx_stats_wait function does not sleep while testing the stats busy bit. Fix this by using the generic _mv88e6xxx_wait function. Note that it requires to move _mv88e6xxx_wait on top of _mv88e6xxx_stats_wait to avoid undefined reference compilation error. Signed-off-by: Vivien Didelot --- drivers/net/dsa/mv88e6xxx.c | 52 +++-- 1 file changed, 22 insertions(+), 30 deletions(-) diff --git a/drivers/net/dsa/mv88e6xxx.c b/drivers/net/dsa/mv88e6xxx.c index f6c7409..7753db1 100644 --- a/drivers/net/dsa/mv88e6xxx.c +++ b/drivers/net/dsa/mv88e6xxx.c @@ -557,19 +557,31 @@ static bool mv88e6xxx_6352_family(struct dsa_switch *ds) return false; } +/* Must be called with SMI lock held */ +static int _mv88e6xxx_wait(struct dsa_switch *ds, int reg, int offset, + u16 mask) +{ + unsigned long timeout = jiffies + HZ / 10; + + while (time_before(jiffies, timeout)) { + int ret; + + ret = _mv88e6xxx_reg_read(ds, reg, offset); + if (ret < 0) + return ret; + if (!(ret & mask)) + return 0; + + usleep_range(1000, 2000); + } + return -ETIMEDOUT; +} + /* Must be called with SMI mutex held */ static int _mv88e6xxx_stats_wait(struct dsa_switch *ds) { - int ret; - int i; - - for (i = 0; i < 10; i++) { - ret = _mv88e6xxx_reg_read(ds, REG_GLOBAL, GLOBAL_STATS_OP); - if ((ret & GLOBAL_STATS_OP_BUSY) == 0) - return 0; - } - - return -ETIMEDOUT; + return _mv88e6xxx_wait(ds, REG_GLOBAL, GLOBAL_STATS_OP, + GLOBAL_STATS_OP_BUSY); } /* Must be called with SMI mutex held */ @@ -856,26 +868,6 @@ error: } #endif /* CONFIG_NET_DSA_HWMON */ -/* Must be called with SMI lock held */ -static int _mv88e6xxx_wait(struct dsa_switch *ds, int reg, int offset, - u16 mask) -{ - unsigned long timeout = jiffies + HZ / 10; - - while (time_before(jiffies, timeout)) { - int ret; - - ret = _mv88e6xxx_reg_read(ds, reg, offset); - if (ret < 0) - return ret; - if (!(ret & mask)) - return 0; - - usleep_range(1000, 2000); - } - return -ETIMEDOUT; -} - static int mv88e6xxx_wait(struct dsa_switch *ds, int reg, int offset, u16 mask) { struct mv88e6xxx_priv_state *ps = ds_to_priv(ds); -- 2.4.5 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC net-next 03/22] ipv4: support for fib route lwtunnel encap attributes
On 07/10/15 at 05:36pm, Eric Dumazet wrote: > On Fri, 2015-07-10 at 16:19 +0200, Thomas Graf wrote: > > From: Roopa Prabhu > > > > + if (oif) > > + dev = __dev_get_by_index(net, oif); > > + ret = lwtunnel_build_state(dev, encap_type, > > + encap, &lwtstate); > > + if (!ret) { > > + lwtunnel_state_get(lwtstate); > > + ret = lwtunnel_cmp_encap(lwtstate, nh->nh_lwtstate); > > + lwtunnel_state_put(lwtstate); > > + > > + return ret; > > > These _get()/_put() calls do not seem necessary, or buggy. > > If refcounting is needed the _get() should be done at the time lwstate > is fetched. I'll let Roopa comment but it seems like these can just be removed. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 03/11] soc/fsl: Introduce the DPAA BMan portal driver
Hi Roy, On vr, 2015-07-10 at 15:19 +, Roy Pledge wrote: > Thanks you for your valuable feedback so far. You're welcome. Please note that I just scan for, well, common build issues. Ie, stuff that requires no domain specific knowledge. > Let me try to address a general issue you mention below: unused > exported APIs. Good timing. I was pondering how to handle 04/11, a big patch that exports 79 (!) symbols. Many of those appear to have no users. > The QMan and BMan drivers provide a base layer for other blocks built > on top of them, for instance an Ethernet Driver, an Encrypt/Decrypt > Engine, a pattern matcher, a compress/decompress engine, etc... > Some of these drivers will be presented for review in the near future, > but in order to try and layer the review/up streaming process we're > presenting each layer as independently as possible. > If you really were to start dissecting which APIs are called you would > come to a very small set of pieces that merely initialize the hardware > but don't provide any opportunity for other users to invoke that HW. > > I hope that this helps you understand our goals in trying to upstream > these drivers. At the end of the day, what matters is what the people that need to sign off on these drivers (ppc? netdev?) are comfortable with. I know that some maintainers won't even bother looking at code that has no callers. In the mean time I'll skip the exports for the remainder of this series. Thanks, Paul Bolle -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/3] mvneta: use inband status only when explicitly enabled
The commit 898b2970e2c9 ("mvneta: implement SGMII-based in-band link state signaling") implemented the link parameters auto-negotiation unconditionally. Unfortunately it appears that some HW that implements SGMII protocol, doesn't generate the inband status, so it is not possible to auto-negotiate anything with such HW. This patch enables the auto-negotiation only if the 'autoneg' DT property is set to 1. For old configurations where the 'autoneg' property is not specified, the default is to not use auto-negotiation. This patch fixes the following regression: https://lkml.org/lkml/2015/7/8/865 and is therefore CCed to stable. Signed-off-by: Stas Sergeev CC: Thomas Petazzoni CC: netdev@vger.kernel.org CC: linux-ker...@vger.kernel.org CC: sta...@vger.kernel.org --- drivers/net/ethernet/marvell/mvneta.c | 7 +++ 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c index 74176ec..d4c29a3 100644 --- a/drivers/net/ethernet/marvell/mvneta.c +++ b/drivers/net/ethernet/marvell/mvneta.c @@ -3009,7 +3009,7 @@ static int mvneta_probe(struct platform_device *pdev) char hw_mac_addr[ETH_ALEN]; const char *mac_from; int phy_mode; - int fixed_phy = 0; + int autoneg_link = 0; int err; /* Our multiqueue support is not complete, so for now, only @@ -3043,7 +3043,7 @@ static int mvneta_probe(struct platform_device *pdev) dev_err(&pdev->dev, "cannot register fixed PHY\n"); goto err_free_irq; } - fixed_phy = 1; + autoneg_link = of_phy_is_autoneg_link(dn); /* In the case of a fixed PHY, the DT node associated * to the PHY is the Ethernet MAC DT node. @@ -3067,8 +3067,7 @@ static int mvneta_probe(struct platform_device *pdev) pp = netdev_priv(dev); pp->phy_node = phy_node; pp->phy_interface = phy_mode; - pp->use_inband_status = (phy_mode == PHY_INTERFACE_MODE_SGMII) && - fixed_phy; + pp->use_inband_status = autoneg_link; pp->clk = devm_clk_get(&pdev->dev, NULL); if (IS_ERR(pp->clk)) { -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/3] of_mdio: add new DT property 'autoneg' for fixed-link
Currently for fixed-link the MAC driver decides whether to use the link status auto-negotiation or not. Unfortunately the auto-negotiation may not work when expected by the MAC driver. Sebastien Rannou explains: << Yes, I confirm that my HW does not generate an in-band status. AFAIK, it's a PHY that aggregates 4xSGMIIs to 1xQSGMII ; the MAC side of the PHY (with inband status) is connected to the switch through QSGMII, and in this context we are on the media side of the PHY. >> https://lkml.org/lkml/2015/7/10/206 This patch introduces the new boolean property 'autoneg' that allows the user to request the auto-negotiation explicitly. Signed-off-by: Stas Sergeev CC: Rob Herring CC: Pawel Moll CC: Mark Rutland CC: Ian Campbell CC: Kumar Gala CC: Florian Fainelli CC: Grant Likely CC: devicet...@vger.kernel.org CC: linux-ker...@vger.kernel.org CC: netdev@vger.kernel.org --- .../devicetree/bindings/net/fixed-link.txt | 6 +- drivers/of/of_mdio.c | 23 -- include/linux/of_mdio.h| 5 + 3 files changed, 31 insertions(+), 3 deletions(-) diff --git a/Documentation/devicetree/bindings/net/fixed-link.txt b/Documentation/devicetree/bindings/net/fixed-link.txt index 82bf7e0..e2959a8 100644 --- a/Documentation/devicetree/bindings/net/fixed-link.txt +++ b/Documentation/devicetree/bindings/net/fixed-link.txt @@ -9,8 +9,12 @@ Such a fixed link situation is described by creating a 'fixed-link' sub-node of the Ethernet MAC device node, with the following properties: +* 'autoneg' (boolean, optional), to enable the auto-negotiation of link + state. Auto-negotiation is MII protocol, HW and driver-specific and is + not supported in many cases, so use it only when you know what you do. * 'speed' (integer, mandatory), to indicate the link speed. Accepted - values are 10, 100 and 1000 + values are 10, 100 and 1000. If the auto-negotiation is enabled, + 'speed' may not be set. It will then be auto-negotiated, if possible. * 'full-duplex' (boolean, optional), to indicate that full duplex is used. When absent, half duplex is assumed. * 'pause' (boolean, optional), to indicate that pause should be diff --git a/drivers/of/of_mdio.c b/drivers/of/of_mdio.c index 1bd4305..12b2ede 100644 --- a/drivers/of/of_mdio.c +++ b/drivers/of/of_mdio.c @@ -280,6 +280,22 @@ bool of_phy_is_fixed_link(struct device_node *np) } EXPORT_SYMBOL(of_phy_is_fixed_link); +bool of_phy_is_autoneg_link(struct device_node *np) +{ + struct device_node *dn; + bool ret; + + dn = of_get_child_by_name(np, "fixed-link"); + if (!dn) + return false; + + ret = of_property_read_bool(dn, "autoneg"); + + of_node_put(dn); + return ret; +} +EXPORT_SYMBOL(of_phy_is_autoneg_link); + int of_phy_register_fixed_link(struct device_node *np) { struct fixed_phy_status status = {}; @@ -291,10 +307,13 @@ int of_phy_register_fixed_link(struct device_node *np) /* New binding */ fixed_link_node = of_get_child_by_name(np, "fixed-link"); if (fixed_link_node) { - status.link = 1; + bool autoneg = of_property_read_bool(fixed_link_node, +"autoneg"); + status.link = !autoneg; status.duplex = of_property_read_bool(fixed_link_node, "full-duplex"); - if (of_property_read_u32(fixed_link_node, "speed", &status.speed)) + if (of_property_read_u32(fixed_link_node, "speed", +&status.speed) != 0 && !autoneg) return -EINVAL; status.pause = of_property_read_bool(fixed_link_node, "pause"); status.asym_pause = of_property_read_bool(fixed_link_node, diff --git a/include/linux/of_mdio.h b/include/linux/of_mdio.h index d449018..647f348 100644 --- a/include/linux/of_mdio.h +++ b/include/linux/of_mdio.h @@ -65,6 +65,7 @@ static inline struct mii_bus *of_mdio_find_bus(struct device_node *mdio_np) #if defined(CONFIG_OF) && defined(CONFIG_FIXED_PHY) extern int of_phy_register_fixed_link(struct device_node *np); extern bool of_phy_is_fixed_link(struct device_node *np); +extern bool of_phy_is_autoneg_link(struct device_node *np); #else static inline int of_phy_register_fixed_link(struct device_node *np) { @@ -74,6 +75,10 @@ static inline bool of_phy_is_fixed_link(struct device_node *np) { return false; } +static inline bool of_phy_is_autoneg_link(struct device_node *np) +{ + return false; +} #endif -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/3] fixed_phy: handle link-down case
Currently fixed_phy driver recognizes only the link-up state. This simple patch adds an implementation of link-down state. The actual change is 1-line, the rest is an indentation. Signed-off-by: Stas Sergeev CC: Florian Fainelli CC: netdev@vger.kernel.org CC: linux-ker...@vger.kernel.org --- drivers/net/phy/fixed_phy.c | 99 +++-- 1 file changed, 50 insertions(+), 49 deletions(-) diff --git a/drivers/net/phy/fixed_phy.c b/drivers/net/phy/fixed_phy.c index 1960b46..a5d93cf 100644 --- a/drivers/net/phy/fixed_phy.c +++ b/drivers/net/phy/fixed_phy.c @@ -52,58 +52,59 @@ static int fixed_phy_update_regs(struct fixed_phy *fp) u16 lpagb = 0; u16 lpa = 0; - if (fp->status.duplex) { - bmcr |= BMCR_FULLDPLX; - - switch (fp->status.speed) { - case 1000: - bmsr |= BMSR_ESTATEN; - bmcr |= BMCR_SPEED1000; - lpagb |= LPA_1000FULL; - break; - case 100: - bmsr |= BMSR_100FULL; - bmcr |= BMCR_SPEED100; - lpa |= LPA_100FULL; - break; - case 10: - bmsr |= BMSR_10FULL; - lpa |= LPA_10FULL; - break; - default: - pr_warn("fixed phy: unknown speed\n"); - return -EINVAL; - } - } else { - switch (fp->status.speed) { - case 1000: - bmsr |= BMSR_ESTATEN; - bmcr |= BMCR_SPEED1000; - lpagb |= LPA_1000HALF; - break; - case 100: - bmsr |= BMSR_100HALF; - bmcr |= BMCR_SPEED100; - lpa |= LPA_100HALF; - break; - case 10: - bmsr |= BMSR_10HALF; - lpa |= LPA_10HALF; - break; - default: - pr_warn("fixed phy: unknown speed\n"); - return -EINVAL; - } - } - - if (fp->status.link) + if (fp->status.link) { bmsr |= BMSR_LSTATUS | BMSR_ANEGCOMPLETE; - if (fp->status.pause) - lpa |= LPA_PAUSE_CAP; + if (fp->status.duplex) { + bmcr |= BMCR_FULLDPLX; + + switch (fp->status.speed) { + case 1000: + bmsr |= BMSR_ESTATEN; + bmcr |= BMCR_SPEED1000; + lpagb |= LPA_1000FULL; + break; + case 100: + bmsr |= BMSR_100FULL; + bmcr |= BMCR_SPEED100; + lpa |= LPA_100FULL; + break; + case 10: + bmsr |= BMSR_10FULL; + lpa |= LPA_10FULL; + break; + default: + pr_warn("fixed phy: unknown speed\n"); + return -EINVAL; + } + } else { + switch (fp->status.speed) { + case 1000: + bmsr |= BMSR_ESTATEN; + bmcr |= BMCR_SPEED1000; + lpagb |= LPA_1000HALF; + break; + case 100: + bmsr |= BMSR_100HALF; + bmcr |= BMCR_SPEED100; + lpa |= LPA_100HALF; + break; + case 10: + bmsr |= BMSR_10HALF; + lpa |= LPA_10HALF; + break; + default: + pr_warn("fixed phy: unknown speed\n"); + return -EINVAL; + } + } - if (fp->status.asym_pause) - lpa |= LPA_PAUSE_ASYM; + if (fp->status.pause) + lpa |= LPA_PAUSE_CAP; + + if (fp->status.asym_pause) + lpa |= LPA_PAUSE_ASYM; + } fp->regs[MII_PHYSID1] = 0; fp->regs[MII_PHYSID2] = 0; -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Kernel Oops in __inet_twsk_kill()
> > > The problem has been fixed. It is introduced by a third party patch, > > which decreases the refcnt of timewait socket. > What is the fix? pl share the gerrit. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 0/2] net: enable inband link state negotiation only when explicitly requested
Hello. Currently the link status auto-negotiation is enabled for any SGMII link with fixed-link DT binding. The regression was reported: https://lkml.org/lkml/2015/7/8/865 Apparently not all HW that implements SGMII protocol, generates the inband status for the auto-negotiation to work. More details here: https://lkml.org/lkml/2015/7/10/206 The following patches reverts to the old behavior by default, which is to not enable the auto-negotiation for fixed-link. The new DT property is added that allows to explicitly request the auto-negotiation. Those who were affected by the change, please send your Tested-by, Thanks! -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH net-next] sctp: fix src address selection if using secondary addresses
On Fri, Jul 10, 2015 at 11:35:28AM -0400, Vlad Yasevich wrote: > On 07/10/2015 07:53 AM, Marcelo Ricardo Leitner wrote: > > On Thu, Jul 09, 2015 at 09:55:19PM +0200, Michael Tuexen wrote: > >>> On 09 Jul 2015, at 18:54, Marcelo Ricardo Leitner > >>> wrote: > >>> > >>> Cc'ing Michael too. > >> I'm not familiar with the Linux kernel code, so I can't comment on it. > >> But making sure to use a source address belonging to the emitting > >> interface makes sense for me. > >> > >> Best regards > >> Michael > > > > That's pretty much what I was looking for, in case I missed something on > > SCTP RFCs. > > > > Well, the RFCs do not really specify what the source address should be, and > there That's why I was afraid of having missed something ;) > have been numerous times where I've seen weak host model in use on the wire > even with a BSD peer. > > This also puts a very big nail through many suggestions we've had over the > years > to allow source based path multihoming in addition to destination based > multihoming > we currently support. > > It might be a good idea to make rp-filter like behavior best effort, and have > the old behavior as fallback. I am still trying to think up different > scenarios > where rp-filter behavior will cause things to fail prematurely... The old behavior is like "if we don't have a src yet and can't find a preferred src for this dst, use the 1st bound address". We can add it but as I said, I'm afraid it is just doing wrong and not worth. If such randomly src addressed packet is meant to be routed, the router will likely drop it as it is seen as a spoof. And if it reaches the peer, it will probably come back through a different path. I'm tempted to say that current usual use cases are handled by the first check on this function, which returns the preferred/primary address for the interface and checks against bound addresses. Whenever you reach the second check, it just allows you to use that 1st bound address that is checked. I mean, I can't see use cases that we would be breaking with this change. But yeah, it impacts source based routing, and I'm not aware of previous discussions on it. I'll try to dig some up but if possible, please share some pointers on it. Marcelo -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC net-next 03/22] ipv4: support for fib route lwtunnel encap attributes
On Fri, 2015-07-10 at 16:19 +0200, Thomas Graf wrote: > From: Roopa Prabhu > + if (oif) > + dev = __dev_get_by_index(net, oif); > + ret = lwtunnel_build_state(dev, encap_type, > +encap, &lwtstate); > + if (!ret) { > + lwtunnel_state_get(lwtstate); > + ret = lwtunnel_cmp_encap(lwtstate, nh->nh_lwtstate); > + lwtunnel_state_put(lwtstate); > + > + return ret; These _get()/_put() calls do not seem necessary, or buggy. If refcounting is needed the _get() should be done at the time lwstate is fetched. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH net-next] sctp: fix src address selection if using secondary addresses
On 07/10/2015 07:53 AM, Marcelo Ricardo Leitner wrote: > On Thu, Jul 09, 2015 at 09:55:19PM +0200, Michael Tuexen wrote: >>> On 09 Jul 2015, at 18:54, Marcelo Ricardo Leitner >>> wrote: >>> >>> Cc'ing Michael too. >> I'm not familiar with the Linux kernel code, so I can't comment on it. >> But making sure to use a source address belonging to the emitting >> interface makes sense for me. >> >> Best regards >> Michael > > That's pretty much what I was looking for, in case I missed something on > SCTP RFCs. > Well, the RFCs do not really specify what the source address should be, and there have been numerous times where I've seen weak host model in use on the wire even with a BSD peer. This also puts a very big nail through many suggestions we've had over the years to allow source based path multihoming in addition to destination based multihoming we currently support. It might be a good idea to make rp-filter like behavior best effort, and have the old behavior as fallback. I am still trying to think up different scenarios where rp-filter behavior will cause things to fail prematurely... -vlad > Thanks Michael! > >>> On Tue, Jul 07, 2015 at 02:42:25PM -0300, Marcelo Ricardo Leitner wrote: Hi folks, This is an attempt to better choose a src address for sctp packets as peers with rp_filter could be dropping our packets in some situations. With this patch, we try to respect and use a src address that belongs to the interface we are putting the packet out. I have that feeling that there is be a better way to do this, but I just couldn't see it. This patch has been tested with and without gateways between the peers and also just two peers connected via two subnets and results were pretty good. One could think that this limits the address combination we can use, but such combinations probably are just bogus anyway. Like, if you have an host with address A1 and B1 and another with A2 and B2, you cannot expect that A can use A1 to reach B2 through subnet B, because the return path would be via the other link which, when this switch happens, we are thinking it's broken. Thanks, Marcelo ---8<--- In short, sctp is likely to incorrectly choose src address if socket is bound to secondary addresses. This patch fixes it by adding a new check that tries to anticipate if the src address would be expected by the next hop/peer on this interface by doing reverse routing. Also took the shot to reduce the indentation level on this code. Details: Currently, sctp will do a routing attempt without specifying the src address and compare the returned value (preferred source) with the addresses that the socket is bound to. When using secondary addresses, this will not match. Then it will try specifying each of the addresses that the socket is bound to and re-routing, checking if that address is valid as src for that dst. Thing is, this check alone is weak: # ip r l 192.168.100.0/24 dev eth1 proto kernel scope link src 192.168.100.149 192.168.122.0/24 dev eth0 proto kernel scope link src 192.168.122.147 # ip a l 1: lo: mtu 65536 qdisc noqueue state UNKNOWN group default link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0: mtu 1500 qdisc pfifo_fast state UP group default qlen 1000 link/ether 52:54:00:15:18:6a brd ff:ff:ff:ff:ff:ff inet 192.168.122.147/24 brd 192.168.122.255 scope global dynamic eth0 valid_lft 2160sec preferred_lft 2160sec inet 192.168.122.148/24 scope global secondary eth0 valid_lft forever preferred_lft forever inet6 fe80::5054:ff:fe15:186a/64 scope link valid_lft forever preferred_lft forever 3: eth1: mtu 1500 qdisc pfifo_fast state UP group default qlen 1000 link/ether 52:54:00:b3:91:46 brd ff:ff:ff:ff:ff:ff inet 192.168.100.149/24 brd 192.168.100.255 scope global dynamic eth1 valid_lft 2162sec preferred_lft 2162sec inet 192.168.100.148/24 scope global secondary eth1 valid_lft forever preferred_lft forever inet6 fe80::5054:ff:feb3:9146/64 scope link valid_lft forever preferred_lft forever 4: ens9: mtu 1500 qdisc pfifo_fast state UP group default qlen 1000 link/ether 52:54:00:05:47:ee brd ff:ff:ff:ff:ff:ff inet6 fe80::5054:ff:fe05:47ee/64 scope link valid_lft forever preferred_lft forever # ip r g 192.168.100.193 from 192.168.122.148 192.168.100.193 from 192.168.122.148 dev eth1 cache Even if you specify an interface: # ip r g
RE: [PATCH 03/11] soc/fsl: Introduce the DPAA BMan portal driver
Paul, Thanks you for your valuable feedback so far. Let me try to address a general issue you mention below: unused exported APIs. The QMan and BMan drivers provide a base layer for other blocks built on top of them, for instance an Ethernet Driver, an Encrypt/Decrypt Engine, a pattern matcher, a compress/decompress engine, etc... Some of these drivers will be presented for review in the near future, but in order to try and layer the review/up streaming process we're presenting each layer as independently as possible. If you really were to start dissecting which APIs are called you would come to a very small set of pieces that merely initialize the hardware but don't provide any opportunity for other users to invoke that HW. I hope that this helps you understand our goals in trying to upstream these drivers. Roy > -Original Message- > From: Paul Bolle [mailto:pebo...@tiscali.nl] > Sent: Friday, July 10, 2015 9:33 AM > To: Pledge Roy-R01356 > Cc: linuxppc-...@lists.ozlabs.org; Wood Scott-B07421; > netdev@vger.kernel.org; linux-ker...@vger.kernel.org > Subject: Re: [PATCH 03/11] soc/fsl: Introduce the DPAA BMan portal driver > > On do, 2015-07-09 at 16:21 -0400, Roy Pledge wrote: > > --- a/drivers/soc/fsl/qbman/Kconfig > > +++ b/drivers/soc/fsl/qbman/Kconfig > > > +config FSL_DPA_PIRQ_FAST > > + bool > > + default y > > First used in 04/11. > > > +config FSL_DPA_PIRQ_SLOW > > + bool > > + default y > > + > > +config FSL_DPA_PORTAL_SHARE > > + bool > > + default y > > As in 02/11: these three symbols function as aliases for FSL_DPA. Why are > they needed? > > > config FSL_BMAN > > tristate "BMan device management" > > default n > > help > > FSL DPAA BMan driver > > > > +config FSL_BMAN_PORTAL > > + tristate "BMan portal(s)" > > + default n > > + help > > + FSL BMan portal driver > > > --- /dev/null > > +++ b/drivers/soc/fsl/qbman/bman_api.c > > > +struct bman_portal { > > + [...] > > +#ifdef CONFIG_FSL_DPA_CAN_WAIT_SYNC > > This check will always evaluate to true, right? (I'll only report this > once.) > > > + struct bman_pool *rcri_owned; /* only 1 release WAIT_SYNC at > > a time */ > > +#endif > > +#ifdef CONFIG_FSL_DPA_PORTAL_SHARE > > Ditto. > > > + raw_spinlock_t sharing_lock; /* only used if is_shared */ > > + int is_shared; > > + struct bman_portal *sharing_redirect; #endif > > + [...] > > +}; > > > +const struct bman_portal_config *bman_get_portal_config(void) { > > + [...] > > +} > > I couldn't find callers of this function. > > > +EXPORT_SYMBOL(bman_get_portal_config); > > Nor users of this export, obviously. > > > + > > +u32 bman_irqsource_get(void) > > +{ > > + [...] > > +} > > Ditto. > > > +EXPORT_SYMBOL(bman_irqsource_get); > > Ditto. > > > +int bman_p_irqsource_add(struct bman_portal *p, __maybe_unused u32 > > +bits) { > > + [...] > > +} > > +EXPORT_SYMBOL(bman_p_irqsource_add); > > There seem to be no users of this export. > > > +int bman_irqsource_add(__maybe_unused u32 bits) { > > + [...] > > +} > > Unused. > > > +EXPORT_SYMBOL(bman_irqsource_add); > > Ditto. > > > +int bman_irqsource_remove(u32 bits) > > +{ > > + [...] > > +} > > Ditto. > > > +EXPORT_SYMBOL(bman_irqsource_remove); > > Ditto. > > > +u32 bman_poll_slow(void) > > +{ > > + [...] > > +} > > Ditto. > > > +EXPORT_SYMBOL(bman_poll_slow); > > Ditto. > > > +void bman_poll(void) > > +{ > > + [...] > > +} > > Ditto. > > > +EXPORT_SYMBOL(bman_poll); > > Ditto. > > > +static inline struct bm_rcr_entry *try_rel_start(struct bman_portal > > +**p, #ifdef CONFIG_FSL_DPA_CAN_WAIT > > Always true, right? > > > + __maybe_unused struct bman_pool > *pool, #endif > > + __maybe_unused unsigned long > *irqflags, > > + __maybe_unused u32 flags) > > And this is a, well, novel way to declare a function. > > > +{ > > + [...] > > +} > > > +int bman_flush_stockpile(struct bman_pool *pool, u32 flags) { > > + [...] > > +} > > Unused function. > > > +EXPORT_SYMBOL(bman_flush_stockpile); > > So unused export too. > > > +#ifdef CONFIG_FSL_BMAN > > +u32 bman_query_free_buffers(struct bman_pool *pool) { > > + return bm_pool_free_buffers(pool->params.bpid); > > +} > > +EXPORT_SYMBOL(bman_query_free_buffers); > > + > > +int bman_update_pool_thresholds(struct bman_pool *pool, const u32 > > +*thresholds) { > > + u32 bpid; > > + > > + bpid = bman_get_params(pool)->bpid; > > + > > + return bm_pool_set(bpid, thresholds); } > > +EXPORT_SYMBOL(bman_update_pool_thresholds); > > More of the same. > > > +#endif > > > --- /dev/null > > +++ b/drivers/soc/fsl/qbman/bman_portal.c > > > > +module_driver(bman_portal_driver, > > + bman_portal_driver_register, platform_driver_unregister); > > No MODULE_LICENSE() here, nor in the other files that make up this module. > So loading this module will trigger a wa
[PATCH net-next] bridge: mdb: add vlan support for user entries
Until now all user mdb entries were added in vlan 0, this patch adds support to allow the user to specify the vlan for the entry. About the uapi change a hole in struct br_mdb_entry is used so the size and offsets are kept the same (verified with pahole and tested with older iproute2). Example: $ bridge mdb dev br0 port eth1 grp 239.0.0.1 permanent vlan 2000 dev br0 port eth1 grp 239.0.0.1 permanent vlan 200 dev br0 port eth1 grp 239.0.0.1 permanent Signed-off-by: Nikolay Aleksandrov --- note: an iproute2 patch will follow if this one is accepted include/uapi/linux/if_bridge.h | 1 + net/bridge/br_mdb.c| 6 ++ 2 files changed, 7 insertions(+) diff --git a/include/uapi/linux/if_bridge.h b/include/uapi/linux/if_bridge.h index eaaea6208b42..3635b7797508 100644 --- a/include/uapi/linux/if_bridge.h +++ b/include/uapi/linux/if_bridge.h @@ -182,6 +182,7 @@ struct br_mdb_entry { #define MDB_TEMPORARY 0 #define MDB_PERMANENT 1 __u8 state; + __u16 vid; struct { union { __be32 ip4; diff --git a/net/bridge/br_mdb.c b/net/bridge/br_mdb.c index 1fb7d076f15c..a8d0e93d43f2 100644 --- a/net/bridge/br_mdb.c +++ b/net/bridge/br_mdb.c @@ -85,6 +85,7 @@ static int br_mdb_fill_info(struct sk_buff *skb, struct netlink_callback *cb, memset(&e, 0, sizeof(e)); e.ifindex = port->dev->ifindex; e.state = p->state; + e.vid = p->addr.vid; if (p->addr.proto == htons(ETH_P_IP)) e.addr.u.ip4 = p->addr.u.ip4; #if IS_ENABLED(CONFIG_IPV6) @@ -242,6 +243,7 @@ void br_mdb_notify(struct net_device *dev, struct net_bridge_port *port, entry.addr.u.ip6 = group->u.ip6; #endif entry.state = state; + entry.vid = group->vid; __br_mdb_notify(dev, &entry, type); } @@ -264,6 +266,8 @@ static bool is_valid_mdb_entry(struct br_mdb_entry *entry) return false; if (entry->state != MDB_PERMANENT && entry->state != MDB_TEMPORARY) return false; + if (entry->vid >= VLAN_VID_MASK) + return false; return true; } @@ -372,6 +376,7 @@ static int __br_mdb_add(struct net *net, struct net_bridge *br, if (!p || p->br != br || p->state == BR_STATE_DISABLED) return -EINVAL; + ip.vid = entry->vid; ip.proto = entry->addr.proto; if (ip.proto == htons(ETH_P_IP)) ip.u.ip4 = entry->addr.u.ip4; @@ -418,6 +423,7 @@ static int __br_mdb_del(struct net_bridge *br, struct br_mdb_entry *entry) if (!netif_running(br->dev) || br->multicast_disabled) return -EINVAL; + ip.vid = entry->vid; ip.proto = entry->addr.proto; if (ip.proto == htons(ETH_P_IP)) { if (timer_pending(&br->ip4_other_query.timer)) -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC net-next 16/22] fib: Add fib rule match on tunnel id
This add the ability to select a routing table based on the tunnel id which allows to maintain separate routing tables for each virtual tunnel network. ip rule add from all tunnel-id 100 lookup 100 ip rule add from all tunnel-id 200 lookup 200 Signed-off-by: Thomas Graf --- include/net/fib_rules.h| 1 + include/uapi/linux/fib_rules.h | 2 +- net/core/fib_rules.c | 17 +++-- 3 files changed, 17 insertions(+), 3 deletions(-) diff --git a/include/net/fib_rules.h b/include/net/fib_rules.h index 903a55e..4e8f804 100644 --- a/include/net/fib_rules.h +++ b/include/net/fib_rules.h @@ -19,6 +19,7 @@ struct fib_rule { u8 action; /* 3 bytes hole, try to use */ u32 target; + __be64 tun_id; struct fib_rule __rcu *ctarget; struct net *fr_net; diff --git a/include/uapi/linux/fib_rules.h b/include/uapi/linux/fib_rules.h index 2b82d7e..96161b8 100644 --- a/include/uapi/linux/fib_rules.h +++ b/include/uapi/linux/fib_rules.h @@ -43,7 +43,7 @@ enum { FRA_UNUSED5, FRA_FWMARK, /* mark */ FRA_FLOW, /* flow/class id */ - FRA_UNUSED6, + FRA_TUN_ID, FRA_SUPPRESS_IFGROUP, FRA_SUPPRESS_PREFIXLEN, FRA_TABLE, /* Extended table id */ diff --git a/net/core/fib_rules.c b/net/core/fib_rules.c index 9a12668..6da78c9 100644 --- a/net/core/fib_rules.c +++ b/net/core/fib_rules.c @@ -186,6 +186,9 @@ static int fib_rule_match(struct fib_rule *rule, struct fib_rules_ops *ops, if ((rule->mark ^ fl->flowi_mark) & rule->mark_mask) goto out; + if (rule->tun_id && (rule->tun_id != fl->flowi_tun_key.tun_id)) + goto out; + ret = ops->match(rule, fl, flags); out: return (rule->flags & FIB_RULE_INVERT) ? !ret : ret; @@ -330,6 +333,9 @@ static int fib_nl_newrule(struct sk_buff *skb, struct nlmsghdr* nlh) if (tb[FRA_FWMASK]) rule->mark_mask = nla_get_u32(tb[FRA_FWMASK]); + if (tb[FRA_TUN_ID]) + rule->tun_id = nla_get_be64(tb[FRA_TUN_ID]); + rule->action = frh->action; rule->flags = frh->flags; rule->table = frh_get_table(frh, tb); @@ -473,6 +479,10 @@ static int fib_nl_delrule(struct sk_buff *skb, struct nlmsghdr* nlh) (rule->mark_mask != nla_get_u32(tb[FRA_FWMASK]))) continue; + if (tb[FRA_TUN_ID] && + (rule->tun_id != nla_get_be64(tb[FRA_TUN_ID]))) + continue; + if (!ops->compare(rule, frh, tb)) continue; @@ -535,7 +545,8 @@ static inline size_t fib_rule_nlmsg_size(struct fib_rules_ops *ops, + nla_total_size(4) /* FRA_SUPPRESS_PREFIXLEN */ + nla_total_size(4) /* FRA_SUPPRESS_IFGROUP */ + nla_total_size(4) /* FRA_FWMARK */ -+ nla_total_size(4); /* FRA_FWMASK */ ++ nla_total_size(4) /* FRA_FWMASK */ ++ nla_total_size(8); /* FRA_TUN_ID */ if (ops->nlmsg_payload) payload += ops->nlmsg_payload(rule); @@ -591,7 +602,9 @@ static int fib_nl_fill_rule(struct sk_buff *skb, struct fib_rule *rule, ((rule->mark_mask || rule->mark) && nla_put_u32(skb, FRA_FWMASK, rule->mark_mask)) || (rule->target && -nla_put_u32(skb, FRA_GOTO, rule->target))) +nla_put_u32(skb, FRA_GOTO, rule->target)) || + (rule->tun_id && +nla_put_be64(skb, FRA_TUN_ID, rule->tun_id))) goto nla_put_failure; if (rule->suppress_ifgroup != -1) { -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC net-next 19/22] openvswitch: Move dev pointer into vport itself
This is the first step in representing all OVS vports as regular struct net_devices. Move the net_device pointer into the vport structure itself to get rid of struct vport_netdev. Signed-off-by: Thomas Graf Signed-off-by: Pravin B Shelar --- net/openvswitch/datapath.c | 7 +-- net/openvswitch/dp_notify.c | 5 +-- net/openvswitch/vport-internal_dev.c | 37 +++- net/openvswitch/vport-netdev.c | 86 net/openvswitch/vport-netdev.h | 12 - net/openvswitch/vport.h | 3 +- 6 files changed, 59 insertions(+), 91 deletions(-) diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c index 0208210..19df28e 100644 --- a/net/openvswitch/datapath.c +++ b/net/openvswitch/datapath.c @@ -188,7 +188,7 @@ static int get_dpifindex(const struct datapath *dp) local = ovs_vport_rcu(dp, OVSP_LOCAL); if (local) - ifindex = netdev_vport_priv(local)->dev->ifindex; + ifindex = local->dev->ifindex; else ifindex = 0; @@ -2219,13 +2219,10 @@ static void __net_exit list_vports_from_net(struct net *net, struct net *dnet, struct vport *vport; hlist_for_each_entry(vport, &dp->ports[i], dp_hash_node) { - struct netdev_vport *netdev_vport; - if (vport->ops->type != OVS_VPORT_TYPE_INTERNAL) continue; - netdev_vport = netdev_vport_priv(vport); - if (dev_net(netdev_vport->dev) == dnet) + if (dev_net(vport->dev) == dnet) list_add(&vport->detach_list, head); } } diff --git a/net/openvswitch/dp_notify.c b/net/openvswitch/dp_notify.c index 2c631fe..a7a80a6 100644 --- a/net/openvswitch/dp_notify.c +++ b/net/openvswitch/dp_notify.c @@ -58,13 +58,10 @@ void ovs_dp_notify_wq(struct work_struct *work) struct hlist_node *n; hlist_for_each_entry_safe(vport, n, &dp->ports[i], dp_hash_node) { - struct netdev_vport *netdev_vport; - if (vport->ops->type != OVS_VPORT_TYPE_NETDEV) continue; - netdev_vport = netdev_vport_priv(vport); - if (!(netdev_vport->dev->priv_flags & IFF_OVS_DATAPATH)) + if (!(vport->dev->priv_flags & IFF_OVS_DATAPATH)) dp_detach_port_notify(vport); } } diff --git a/net/openvswitch/vport-internal_dev.c b/net/openvswitch/vport-internal_dev.c index 6a55f71..a2c205d 100644 --- a/net/openvswitch/vport-internal_dev.c +++ b/net/openvswitch/vport-internal_dev.c @@ -156,49 +156,44 @@ static void do_setup(struct net_device *netdev) static struct vport *internal_dev_create(const struct vport_parms *parms) { struct vport *vport; - struct netdev_vport *netdev_vport; struct internal_dev *internal_dev; int err; - vport = ovs_vport_alloc(sizeof(struct netdev_vport), - &ovs_internal_vport_ops, parms); + vport = ovs_vport_alloc(0, &ovs_internal_vport_ops, parms); if (IS_ERR(vport)) { err = PTR_ERR(vport); goto error; } - netdev_vport = netdev_vport_priv(vport); - - netdev_vport->dev = alloc_netdev(sizeof(struct internal_dev), -parms->name, NET_NAME_UNKNOWN, -do_setup); - if (!netdev_vport->dev) { + vport->dev = alloc_netdev(sizeof(struct internal_dev), + parms->name, NET_NAME_UNKNOWN, do_setup); + if (!vport->dev) { err = -ENOMEM; goto error_free_vport; } - dev_net_set(netdev_vport->dev, ovs_dp_get_net(vport->dp)); - internal_dev = internal_dev_priv(netdev_vport->dev); + dev_net_set(vport->dev, ovs_dp_get_net(vport->dp)); + internal_dev = internal_dev_priv(vport->dev); internal_dev->vport = vport; /* Restrict bridge port to current netns. */ if (vport->port_no == OVSP_LOCAL) - netdev_vport->dev->features |= NETIF_F_NETNS_LOCAL; + vport->dev->features |= NETIF_F_NETNS_LOCAL; rtnl_lock(); - err = register_netdevice(netdev_vport->dev); + err = register_netdevice(vport->dev); if (err) goto error_free_netdev; - dev_set_promiscuity(netdev_vport->dev, 1); + dev_set_promiscuity(vport->dev, 1); rtnl_unlock(); - netif_start_queue(netdev_vport->dev); + netif_start_queue(vport->dev);
[RFC net-next 06/22] ipv4: redirect dst output to lwtunnel output
From: Roopa Prabhu For input routes with tunnel encap state this patch redirects dst output functions to lwtunnel_output which later resolves to the corresponding lwtunnel output function. This has been tested to work with mpls ip tunnels. Open items: Support for tunnel mtu, pmtu, fragmentation can be added by hooking into the corresponding (ipv4, ipv6) dst ops. We may do this differently when lwtstate moves to dst or dst_metadata as per upstream discussions. Signed-off-by: Roopa Prabhu --- net/ipv4/route.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/net/ipv4/route.c b/net/ipv4/route.c index bfc0a18..d3964fa 100644 --- a/net/ipv4/route.c +++ b/net/ipv4/route.c @@ -1634,6 +1634,8 @@ static int __mkroute_input(struct sk_buff *skb, rth->dst.output = ip_output; rt_set_nexthop(rth, daddr, res, fnhe, res->fi, res->type, itag); + if (lwtunnel_output_redirect(rth->rt_lwtstate)) + rth->dst.output = lwtunnel_output; skb_dst_set(skb, &rth->dst); out: err = 0; -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC net-next 17/22] vxlan: Factor out device configuration
This factors out the device configuration out of the RTNL newlink API which allows for in-kernel creation of VXLAN net_devices. Signed-off-by: Thomas Graf --- drivers/net/vxlan.c | 332 include/net/vxlan.h | 59 ++ 2 files changed, 236 insertions(+), 155 deletions(-) diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c index 773b6bf..020c941 100644 --- a/drivers/net/vxlan.c +++ b/drivers/net/vxlan.c @@ -55,10 +55,6 @@ #define PORT_HASH_BITS 8 #define PORT_HASH_SIZE (1remote_port && rdst->remote_port != vxlan->dst_port && + if (rdst->remote_port && rdst->remote_port != vxlan->cfg.dst_port && nla_put_be16(skb, NDA_PORT, rdst->remote_port)) goto nla_put_failure; if (rdst->remote_vni != vxlan->default_dst.remote_vni && @@ -750,7 +707,8 @@ static int vxlan_fdb_create(struct vxlan_dev *vxlan, if (!(flags & NLM_F_CREATE)) return -ENOENT; - if (vxlan->addrmax && vxlan->addrcnt >= vxlan->addrmax) + if (vxlan->cfg.addrmax && + vxlan->addrcnt >= vxlan->cfg.addrmax) return -ENOSPC; /* Disallow replace to add a multicast entry */ @@ -836,7 +794,7 @@ static int vxlan_fdb_parse(struct nlattr *tb[], struct vxlan_dev *vxlan, return -EINVAL; *port = nla_get_be16(tb[NDA_PORT]); } else { - *port = vxlan->dst_port; + *port = vxlan->cfg.dst_port; } if (tb[NDA_VNI]) { @@ -1022,7 +980,7 @@ static bool vxlan_snoop(struct net_device *dev, vxlan_fdb_create(vxlan, src_mac, src_ip, NUD_REACHABLE, NLM_F_EXCL|NLM_F_CREATE, -vxlan->dst_port, +vxlan->cfg.dst_port, vxlan->default_dst.remote_vni, 0, NTF_SELF); spin_unlock(&vxlan->hash_lock); @@ -1951,7 +1909,7 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev, info = skb_tunnel_info(skb, AF_INET); if (rdst) { - dst_port = rdst->remote_port ? rdst->remote_port : vxlan->dst_port; + dst_port = rdst->remote_port ? rdst->remote_port : vxlan->cfg.dst_port; vni = rdst->remote_vni; dst = &rdst->remote_ip; } else { @@ -1961,7 +1919,7 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev, goto drop; } - dst_port = info->key.tp_dst ? : vxlan->dst_port; + dst_port = info->key.tp_dst ? : vxlan->cfg.dst_port; vni = be64_to_cpu(info->key.tun_id); remote_ip.sin.sin_family = AF_INET; remote_ip.sin.sin_addr.s_addr = info->key.ipv4_dst; @@ -1979,16 +1937,16 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev, old_iph = ip_hdr(skb); - ttl = vxlan->ttl; + ttl = vxlan->cfg.ttl; if (!ttl && vxlan_addr_multicast(dst)) ttl = 1; - tos = vxlan->tos; + tos = vxlan->cfg.tos; if (tos == 1) tos = ip_tunnel_get_dsfield(old_iph, skb); - src_port = udp_flow_src_port(dev_net(dev), skb, vxlan->port_min, -vxlan->port_max, true); + src_port = udp_flow_src_port(dev_net(dev), skb, vxlan->cfg.port_min, +vxlan->cfg.port_max, true); if (dst->sa.sa_family == AF_INET) { if (info) { @@ -2014,7 +1972,7 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev, fl4.flowi4_mark = skb->mark; fl4.flowi4_proto = IPPROTO_UDP; fl4.daddr = dst->sin.sin_addr.s_addr; - fl4.saddr = vxlan->saddr.sin.sin_addr.s_addr; + fl4.saddr = vxlan->cfg.saddr.sin.sin_addr.s_addr; rt = ip_route_output_key(vxlan->net, &fl4); if (IS_ERR(rt)) { @@ -2070,7 +2028,7 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev, memset(&fl6, 0, sizeof(fl6)); fl6.flowi6_oif = rdst ? rdst->remote_ifindex : 0; fl6.daddr = dst->sin6.sin6_addr; - fl6.saddr = vxlan->saddr.sin6.sin6_addr; + fl6.saddr = vxlan->cfg.saddr.sin6.sin6_addr; fl6.flowi6_mark = skb->mark; fl6.flowi6_proto = IPPROTO_UDP; @@ -2241,7 +2199,7 @@ static void vxlan_cleanup(unsigned long arg) if (f->state & NUD_PERMANENT) continue; - timeo
[RFC net-next 15/22] route: Per route IP tunnel metadata via lightweight tunnel
This introduces a new IP tunnel lightweight tunnel type which allows to specify IP tunnel instructions per route. Only IPv4 is supported at this point. Signed-off-by: Thomas Graf --- drivers/net/vxlan.c| 10 +++- include/net/dst_metadata.h | 12 - include/net/ip_tunnels.h | 7 ++- include/uapi/linux/lwtunnel.h | 1 + include/uapi/linux/rtnetlink.h | 15 ++ net/ipv4/ip_tunnel_core.c | 114 + net/ipv4/route.c | 2 +- net/openvswitch/vport.h| 1 + 8 files changed, 157 insertions(+), 5 deletions(-) diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c index 4dfb8a7..773b6bf 100644 --- a/drivers/net/vxlan.c +++ b/drivers/net/vxlan.c @@ -1930,7 +1930,7 @@ static void vxlan_encap_bypass(struct sk_buff *skb, struct vxlan_dev *src_vxlan, static void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev, struct vxlan_rdst *rdst, bool did_rsc) { - struct ip_tunnel_info *info = skb_tunnel_info(skb); + struct ip_tunnel_info *info; struct vxlan_dev *vxlan = netdev_priv(dev); struct sock *sk = vxlan->vn_sock->sock->sk; struct rtable *rt = NULL; @@ -1947,6 +1947,9 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev, int err; u32 flags = vxlan->flags; + /* FIXME: Support IPv6 */ + info = skb_tunnel_info(skb, AF_INET); + if (rdst) { dst_port = rdst->remote_port ? rdst->remote_port : vxlan->dst_port; vni = rdst->remote_vni; @@ -2136,12 +2139,15 @@ tx_free: static netdev_tx_t vxlan_xmit(struct sk_buff *skb, struct net_device *dev) { struct vxlan_dev *vxlan = netdev_priv(dev); - const struct ip_tunnel_info *info = skb_tunnel_info(skb); + const struct ip_tunnel_info *info; struct ethhdr *eth; bool did_rsc = false; struct vxlan_rdst *rdst, *fdst = NULL; struct vxlan_fdb *f; + /* FIXME: Support IPv6 */ + info = skb_tunnel_info(skb, AF_INET); + skb_reset_mac_header(skb); eth = eth_hdr(skb); diff --git a/include/net/dst_metadata.h b/include/net/dst_metadata.h index e843937..fc03491 100644 --- a/include/net/dst_metadata.h +++ b/include/net/dst_metadata.h @@ -23,13 +23,23 @@ static inline struct metadata_dst *skb_metadata_dst(struct sk_buff *skb) return NULL; } -static inline struct ip_tunnel_info *skb_tunnel_info(struct sk_buff *skb) +static inline struct ip_tunnel_info *skb_tunnel_info(struct sk_buff *skb, +int family) { struct metadata_dst *md_dst = skb_metadata_dst(skb); + struct rtable *rt; if (md_dst) return &md_dst->u.tun_info; + switch(family) { + case AF_INET: + rt = (struct rtable *)skb_dst(skb); + if (rt && rt->rt_lwtstate) + return lwt_tun_info(rt->rt_lwtstate); + break; + } + return NULL; } diff --git a/include/net/ip_tunnels.h b/include/net/ip_tunnels.h index d11530f..0b7e18c 100644 --- a/include/net/ip_tunnels.h +++ b/include/net/ip_tunnels.h @@ -9,9 +9,9 @@ #include #include #include -#include #include #include +#include #if IS_ENABLED(CONFIG_IPV6) #include @@ -298,6 +298,11 @@ static inline void *ip_tunnel_info_opts(struct ip_tunnel_info *info, size_t n) return info + 1; } +static inline struct ip_tunnel_info *lwt_tun_info(struct lwtunnel_state *lwtstate) +{ + return (struct ip_tunnel_info *)lwtstate->data; +} + #endif /* CONFIG_INET */ #endif /* __NET_IP_TUNNELS_H */ diff --git a/include/uapi/linux/lwtunnel.h b/include/uapi/linux/lwtunnel.h index aa611d9..31377bb 100644 --- a/include/uapi/linux/lwtunnel.h +++ b/include/uapi/linux/lwtunnel.h @@ -6,6 +6,7 @@ enum lwtunnel_encap_types { LWTUNNEL_ENCAP_NONE, LWTUNNEL_ENCAP_MPLS, + LWTUNNEL_ENCAP_IP, __LWTUNNEL_ENCAP_MAX, }; diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h index 0d3d3cc..47d24cb 100644 --- a/include/uapi/linux/rtnetlink.h +++ b/include/uapi/linux/rtnetlink.h @@ -286,6 +286,21 @@ enum rt_class_t { /* Routing message attributes */ +enum ip_tunnel_t { + IP_TUN_UNSPEC, + IP_TUN_ID, + IP_TUN_DST, + IP_TUN_SRC, + IP_TUN_TTL, + IP_TUN_TOS, + IP_TUN_SPORT, + IP_TUN_DPORT, + IP_TUN_FLAGS, + __IP_TUN_MAX, +}; + +#define IP_TUN_MAX (__IP_TUN_MAX - 1) + enum rtattr_type_t { RTA_UNSPEC, RTA_DST, diff --git a/net/ipv4/ip_tunnel_core.c b/net/ipv4/ip_tunnel_core.c index 6a51a71..f4f2100 100644 --- a/net/ipv4/ip_tunnel_core.c +++ b/net/ipv4/ip_tunnel_core.c @@ -190,3 +190,117 @@ struct rtnl_link_stats64 *ip_tunnel_get_stats64(struct net_device *dev, return tot; } EXPORT_SYMBOL_GPL(ip_tunnel_get_stats64); + +static const struc
[RFC net-next 07/22] ipv6: rt6_info output redirect to tunnel output
From: Roopa Prabhu This is similar to ipv4 redirect of dst output to lwtunnel output function for encapsulation and xmit. Signed-off-by: Roopa Prabhu --- net/ipv6/route.c | 1 + 1 file changed, 1 insertion(+) diff --git a/net/ipv6/route.c b/net/ipv6/route.c index acfe25d..6c328bb 100644 --- a/net/ipv6/route.c +++ b/net/ipv6/route.c @@ -1783,6 +1783,7 @@ int ip6_route_add(struct fib6_config *cfg) goto out; lwtunnel_state_get(lwtstate); rt->rt6i_lwtstate = lwtstate; + rt->dst.output = lwtunnel_output6; } ipv6_addr_prefix(&rt->rt6i_dst.addr, &cfg->fc_dst, cfg->fc_dst_len); -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC net-next 18/22] openvswitch: Make tunnel set action attach a metadata dst
Utilize the new metadata dst to attach encapsulation instructions to the skb. The existing egress_tun_info via the OVS_CB() is left in place until all tunnel vports have been converted to the new method. Signed-off-by: Thomas Graf Signed-off-by: Pravin B Shelar --- net/openvswitch/actions.c | 10 ++- net/openvswitch/datapath.c | 8 +++--- net/openvswitch/flow.h | 6 + net/openvswitch/flow_netlink.c | 61 +- net/openvswitch/flow_netlink.h | 1 + 5 files changed, 74 insertions(+), 12 deletions(-) diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c index 27c1687..cf04c2f 100644 --- a/net/openvswitch/actions.c +++ b/net/openvswitch/actions.c @@ -733,7 +733,15 @@ static int execute_set_action(struct sk_buff *skb, { /* Only tunnel set execution is supported without a mask. */ if (nla_type(a) == OVS_KEY_ATTR_TUNNEL_INFO) { - OVS_CB(skb)->egress_tun_info = nla_data(a); + struct ovs_tunnel_info *tun = nla_data(a); + + skb_dst_drop(skb); + dst_hold((struct dst_entry *)tun->tun_dst); + skb_dst_set(skb, (struct dst_entry *)tun->tun_dst); + + /* FIXME: Remove when all vports have been converted */ + OVS_CB(skb)->egress_tun_info = &tun->tun_dst->u.tun_info; + return 0; } diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c index ff8c4a4..0208210 100644 --- a/net/openvswitch/datapath.c +++ b/net/openvswitch/datapath.c @@ -1018,7 +1018,7 @@ static int ovs_flow_cmd_new(struct sk_buff *skb, struct genl_info *info) } ovs_unlock(); - ovs_nla_free_flow_actions(old_acts); + ovs_nla_free_flow_actions_rcu(old_acts); ovs_flow_free(new_flow, false); } @@ -1030,7 +1030,7 @@ err_unlock_ovs: ovs_unlock(); kfree_skb(reply); err_kfree_acts: - kfree(acts); + ovs_nla_free_flow_actions(acts); err_kfree_flow: ovs_flow_free(new_flow, false); error: @@ -1157,7 +1157,7 @@ static int ovs_flow_cmd_set(struct sk_buff *skb, struct genl_info *info) if (reply) ovs_notify(&dp_flow_genl_family, reply, info); if (old_acts) - ovs_nla_free_flow_actions(old_acts); + ovs_nla_free_flow_actions_rcu(old_acts); return 0; @@ -1165,7 +1165,7 @@ err_unlock_ovs: ovs_unlock(); kfree_skb(reply); err_kfree_acts: - kfree(acts); + ovs_nla_free_flow_actions(acts); error: return error; } diff --git a/net/openvswitch/flow.h b/net/openvswitch/flow.h index cadc6c5..3fbf874 100644 --- a/net/openvswitch/flow.h +++ b/net/openvswitch/flow.h @@ -33,6 +33,7 @@ #include #include #include +#include struct sk_buff; @@ -45,6 +46,11 @@ struct sk_buff; #define TUN_METADATA_OPTS(flow_key, opt_len) \ ((void *)((flow_key)->tun_opts + TUN_METADATA_OFFSET(opt_len))) +struct ovs_tunnel_info +{ + struct metadata_dst *tun_dst; +}; + #define OVS_SW_FLOW_KEY_METADATA_SIZE \ (offsetof(struct sw_flow_key, recirc_id) + \ FIELD_SIZEOF(struct sw_flow_key, recirc_id)) diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c index ecfa530..05fe46b 100644 --- a/net/openvswitch/flow_netlink.c +++ b/net/openvswitch/flow_netlink.c @@ -1548,11 +1548,45 @@ static struct sw_flow_actions *nla_alloc_flow_actions(int size, bool log) return sfa; } +static void ovs_nla_free_set_action(const struct nlattr *a) +{ + const struct nlattr *ovs_key = nla_data(a); + struct ovs_tunnel_info *ovs_tun; + + switch (nla_type(ovs_key)) { + case OVS_KEY_ATTR_TUNNEL_INFO: + ovs_tun = nla_data(ovs_key); + dst_release((struct dst_entry *)ovs_tun->tun_dst); + break; + } +} + +void ovs_nla_free_flow_actions(struct sw_flow_actions *sf_acts) +{ + const struct nlattr *a; + int rem; + + nla_for_each_attr(a, sf_acts->actions, sf_acts->actions_len, rem) { + switch (nla_type(a)) { + case OVS_ACTION_ATTR_SET: + ovs_nla_free_set_action(a); + break; + } + } + + kfree(sf_acts); +} + +static void __ovs_nla_free_flow_actions(struct rcu_head *head) +{ + ovs_nla_free_flow_actions(container_of(head, struct sw_flow_actions, rcu)); +} + /* Schedules 'sf_acts' to be freed after the next RCU grace period. * The caller must hold rcu_read_lock for this to be sensible. */ -void ovs_nla_free_flow_actions(struct sw_flow_actions *sf_acts) +void ovs_nla_free_flow_actions_rcu(struct sw_flow_actions *sf_acts) { - kfree_rcu(sf_acts, rcu); + call_rcu(&sf_acts->rcu, __ovs_nla_free_flow_actions); } static struct nlattr *reserve_sfa_size(struct sw_flow_actions **sfa, @@ -1746,7 +
[RFC net-next 13/22] vxlan: Flow based tunneling
Allows putting a VXLAN device into a new flow-based mode in which it will populate a ip_tunnel_info struct for each packet received and attach it to the skb using the new metadata dst. The metadata structure will contain the outer header and tunnel header fields which have been stripped off. Layers further up in the stack such as routing, tc or netfitler can later match on these fields and perform forwarding. Similar on the transmit side, while in flow based mode, skbs with a ip_tunnel_info dst metadata attached will be encapsulated according to the instructions stored in there with the VXLAN device defaults taken into consideration. This prepares the VXLAN device to be steered by the routing and other subsystems which allows to support encapsulation for a large number of tunnel endpoints and tunnel ids through a single net_device which improves the scalability. It also allows for OVS to leverage this mode which in turn allows for the removal of the OVS specific VXLAN code. Because the skb is currently scrubed in vxlan_rcv(), the attachment of the new dst metadata is postponed until after scrubing which requires the temporary addition of a new member to vxlan_metadata. This member is removed again in a later commit after the indirect VXLAN receive API has been removed. Signed-off-by: Thomas Graf Signed-off-by: Pravin B Shelar --- drivers/net/vxlan.c | 149 --- include/linux/skbuff.h | 1 + include/net/dst_metadata.h | 13 include/net/ip_tunnels.h | 14 include/net/vxlan.h | 8 ++- include/uapi/linux/if_link.h | 1 + 6 files changed, 163 insertions(+), 23 deletions(-) diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c index 34c519e..4dfb8a7 100644 --- a/drivers/net/vxlan.c +++ b/drivers/net/vxlan.c @@ -49,6 +49,7 @@ #include #include #endif +#include #define VXLAN_VERSION "0.1" @@ -1164,10 +1165,13 @@ static struct vxlanhdr *vxlan_remcsum(struct sk_buff *skb, struct vxlanhdr *vh, /* Callback from net/ipv4/udp.c to receive packets */ static int vxlan_udp_encap_recv(struct sock *sk, struct sk_buff *skb) { + struct metadata_dst *tun_dst = NULL; + struct ip_tunnel_info *info; struct vxlan_sock *vs; struct vxlanhdr *vxh; u32 flags, vni; - struct vxlan_metadata md = {0}; + struct vxlan_metadata _md; + struct vxlan_metadata *md = &_md; /* Need Vxlan and inner Ethernet header to be present */ if (!pskb_may_pull(skb, VXLAN_HLEN)) @@ -1202,6 +1206,33 @@ static int vxlan_udp_encap_recv(struct sock *sk, struct sk_buff *skb) vni &= VXLAN_VNI_MASK; } + if (vs->flags & VXLAN_F_FLOW_BASED) { + const struct iphdr *iph = ip_hdr(skb); + + tun_dst = metadata_dst_alloc(sizeof(*md), GFP_ATOMIC); + if (!tun_dst) + goto drop; + + info = &tun_dst->u.tun_info; + info->key.ipv4_src = iph->saddr; + info->key.ipv4_dst = iph->daddr; + info->key.ipv4_tos = iph->tos; + info->key.ipv4_ttl = iph->ttl; + info->key.tp_src = udp_hdr(skb)->source; + info->key.tp_dst = udp_hdr(skb)->dest; + + info->mode = IP_TUNNEL_INFO_RX; + info->key.tun_flags = TUNNEL_KEY; + info->key.tun_id = cpu_to_be64(vni >> 8); + if (udp_hdr(skb)->check != 0) + info->key.tun_flags |= TUNNEL_CSUM; + + md = ip_tunnel_info_opts(info, sizeof(*md)); + md->tun_dst = tun_dst; + } else { + memset(md, 0, sizeof(*md)); + } + /* For backwards compatibility, only allow reserved fields to be * used by VXLAN extensions if explicitly requested. */ @@ -1209,13 +1240,16 @@ static int vxlan_udp_encap_recv(struct sock *sk, struct sk_buff *skb) struct vxlanhdr_gbp *gbp; gbp = (struct vxlanhdr_gbp *)vxh; - md.gbp = ntohs(gbp->policy_id); + md->gbp = ntohs(gbp->policy_id); + + if (tun_dst) + info->key.tun_flags |= TUNNEL_VXLAN_OPT; if (gbp->dont_learn) - md.gbp |= VXLAN_GBP_DONT_LEARN; + md->gbp |= VXLAN_GBP_DONT_LEARN; if (gbp->policy_applied) - md.gbp |= VXLAN_GBP_POLICY_APPLIED; + md->gbp |= VXLAN_GBP_POLICY_APPLIED; flags &= ~VXLAN_GBP_USED_BITS; } @@ -1233,8 +1267,8 @@ static int vxlan_udp_encap_recv(struct sock *sk, struct sk_buff *skb) goto bad_flags; } - md.vni = vxh->vx_vni; - vs->rcv(vs, skb, &md); + md->vni = vxh->vx_vni; + vs->rcv(vs, skb, md); return 0; drop: @@ -1247,6 +1281,9 @@ bad_flags: ntohl(vxh->vx_flags), ntohl(vxh->
[RFC net-next 20/22] openvswitch: Abstract vport name through ovs_vport_name()
This allows to get rid of the get_name() vport ops later on. Signed-off-by: Thomas Graf --- net/openvswitch/datapath.c | 4 ++-- net/openvswitch/vport-internal_dev.c | 1 - net/openvswitch/vport-netdev.c | 6 -- net/openvswitch/vport-netdev.h | 1 - net/openvswitch/vport.c | 4 ++-- net/openvswitch/vport.h | 5 + 6 files changed, 9 insertions(+), 12 deletions(-) diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c index 19df28e..ffe984f 100644 --- a/net/openvswitch/datapath.c +++ b/net/openvswitch/datapath.c @@ -176,7 +176,7 @@ static inline struct datapath *get_dp(struct net *net, int dp_ifindex) const char *ovs_dp_name(const struct datapath *dp) { struct vport *vport = ovs_vport_ovsl_rcu(dp, OVSP_LOCAL); - return vport->ops->get_name(vport); + return ovs_vport_name(vport); } static int get_dpifindex(const struct datapath *dp) @@ -1800,7 +1800,7 @@ static int ovs_vport_cmd_fill_info(struct vport *vport, struct sk_buff *skb, if (nla_put_u32(skb, OVS_VPORT_ATTR_PORT_NO, vport->port_no) || nla_put_u32(skb, OVS_VPORT_ATTR_TYPE, vport->ops->type) || nla_put_string(skb, OVS_VPORT_ATTR_NAME, - vport->ops->get_name(vport))) + ovs_vport_name(vport))) goto nla_put_failure; ovs_vport_get_stats(vport, &vport_stats); diff --git a/net/openvswitch/vport-internal_dev.c b/net/openvswitch/vport-internal_dev.c index a2c205d..c058bbf 100644 --- a/net/openvswitch/vport-internal_dev.c +++ b/net/openvswitch/vport-internal_dev.c @@ -242,7 +242,6 @@ static struct vport_ops ovs_internal_vport_ops = { .type = OVS_VPORT_TYPE_INTERNAL, .create = internal_dev_create, .destroy= internal_dev_destroy, - .get_name = ovs_netdev_get_name, .send = internal_dev_recv, }; diff --git a/net/openvswitch/vport-netdev.c b/net/openvswitch/vport-netdev.c index 1c96966..e682bdc 100644 --- a/net/openvswitch/vport-netdev.c +++ b/net/openvswitch/vport-netdev.c @@ -171,11 +171,6 @@ static void netdev_destroy(struct vport *vport) call_rcu(&vport->rcu, free_port_rcu); } -const char *ovs_netdev_get_name(const struct vport *vport) -{ - return vport->dev->name; -} - static unsigned int packet_length(const struct sk_buff *skb) { unsigned int length = skb->len - ETH_HLEN; @@ -223,7 +218,6 @@ static struct vport_ops ovs_netdev_vport_ops = { .type = OVS_VPORT_TYPE_NETDEV, .create = netdev_create, .destroy= netdev_destroy, - .get_name = ovs_netdev_get_name, .send = netdev_send, }; diff --git a/net/openvswitch/vport-netdev.h b/net/openvswitch/vport-netdev.h index 1c52aed..684fb88 100644 --- a/net/openvswitch/vport-netdev.h +++ b/net/openvswitch/vport-netdev.h @@ -26,7 +26,6 @@ struct vport *ovs_netdev_get_vport(struct net_device *dev); -const char *ovs_netdev_get_name(const struct vport *); void ovs_netdev_detach_dev(struct vport *); int __init ovs_netdev_init(void); diff --git a/net/openvswitch/vport.c b/net/openvswitch/vport.c index af23ba0..d14f594 100644 --- a/net/openvswitch/vport.c +++ b/net/openvswitch/vport.c @@ -113,7 +113,7 @@ struct vport *ovs_vport_locate(const struct net *net, const char *name) struct vport *vport; hlist_for_each_entry_rcu(vport, bucket, hash_node) - if (!strcmp(name, vport->ops->get_name(vport)) && + if (!strcmp(name, ovs_vport_name(vport)) && net_eq(ovs_dp_get_net(vport->dp), net)) return vport; @@ -226,7 +226,7 @@ struct vport *ovs_vport_add(const struct vport_parms *parms) } bucket = hash_bucket(ovs_dp_get_net(vport->dp), -vport->ops->get_name(vport)); +ovs_vport_name(vport)); hlist_add_head_rcu(&vport->hash_node, bucket); return vport; } diff --git a/net/openvswitch/vport.h b/net/openvswitch/vport.h index e05ec68..1a689c2 100644 --- a/net/openvswitch/vport.h +++ b/net/openvswitch/vport.h @@ -237,6 +237,11 @@ static inline void ovs_skb_postpush_rcsum(struct sk_buff *skb, skb->csum = csum_add(skb->csum, csum_partial(start, len, 0)); } +static inline const char *ovs_vport_name(struct vport *vport) +{ + return vport->dev ? vport->dev->name : vport->ops->get_name(vport); +} + int ovs_vport_ops_register(struct vport_ops *ops); void ovs_vport_ops_unregister(struct vport_ops *ops); -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC net-next 11/22] dst: Metadata destinations
Introduces a new dst_metadata which enables to carry per packet metadata between forwarding and processing elements via the skb->dst pointer. The structure is set up to be a union. Thus, each separate type of metadata requires its own dst instance. If demand arises to carry multiple types of metadata concurrently, metadata dst entries can be made stackable. The metadata dst entry is refcnt'ed as expected for now but a non reference counted use is possible if the reference is forced before queueing the skb. In order to allow allocating dsts with variable length, the existing dst_alloc() is split into a dst_alloc() and dst_init() function. The existing dst_init() function to initialize the subsystem is being renamed to dst_subsys_init() to make it clear what is what. The check before ip_route_input() is changed to ignore metadata dsts and drop the dst inside the routing function thus allowing to interpret metadata in a later commit. Signed-off-by: Thomas Graf --- include/net/dst.h | 5 ++- include/net/dst_metadata.h | 32 +++ net/core/dev.c | 2 +- net/core/dst.c | 76 ++ net/ipv4/ip_input.c| 3 +- net/ipv4/route.c | 2 ++ 6 files changed, 104 insertions(+), 16 deletions(-) create mode 100644 include/net/dst_metadata.h diff --git a/include/net/dst.h b/include/net/dst.h index 2bc73f8a..fd69bb9 100644 --- a/include/net/dst.h +++ b/include/net/dst.h @@ -57,6 +57,7 @@ struct dst_entry { #define DST_FAKE_RTABLE0x0040 #define DST_XFRM_TUNNEL0x0080 #define DST_XFRM_QUEUE 0x0100 +#define DST_METADATA 0x0200 unsigned short pending_confirm; @@ -356,6 +357,8 @@ static inline int dst_discard(struct sk_buff *skb) } void *dst_alloc(struct dst_ops *ops, struct net_device *dev, int initial_ref, int initial_obsolete, unsigned short flags); +void dst_init(struct dst_entry *dst, struct dst_ops *ops, struct net_device *dev, + int initial_ref, int initial_obsolete, unsigned short flags); void __dst_free(struct dst_entry *dst); struct dst_entry *dst_destroy(struct dst_entry *dst); @@ -457,7 +460,7 @@ static inline struct dst_entry *dst_check(struct dst_entry *dst, u32 cookie) return dst; } -void dst_init(void); +void dst_subsys_init(void); /* Flags for xfrm_lookup flags argument. */ enum { diff --git a/include/net/dst_metadata.h b/include/net/dst_metadata.h new file mode 100644 index 000..4f7694f --- /dev/null +++ b/include/net/dst_metadata.h @@ -0,0 +1,32 @@ +#ifndef __NET_DST_METADATA_H +#define __NET_DST_METADATA_H 1 + +#include +#include +#include + +struct metadata_dst { + struct dst_entrydst; + size_t opts_len; +}; + +static inline struct metadata_dst *skb_metadata_dst(struct sk_buff *skb) +{ + struct metadata_dst *md_dst = (struct metadata_dst *) skb_dst(skb); + + if (md_dst && md_dst->dst.flags & DST_METADATA) + return md_dst; + + return NULL; +} + +static inline bool skb_valid_dst(const struct sk_buff *skb) +{ + struct dst_entry *dst = skb_dst(skb); + + return dst && !(dst->flags & DST_METADATA); +} + +struct metadata_dst *metadata_dst_alloc(u8 optslen, gfp_t flags); + +#endif /* __NET_DST_METADATA_H */ diff --git a/net/core/dev.c b/net/core/dev.c index 6778a99..97f6d47 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -7640,7 +7640,7 @@ static int __init net_dev_init(void) open_softirq(NET_RX_SOFTIRQ, net_rx_action); hotcpu_notifier(dev_cpu_callback, 0); - dst_init(); + dst_subsys_init(); rc = 0; out: return rc; diff --git a/net/core/dst.c b/net/core/dst.c index e956ce6..5372450 100644 --- a/net/core/dst.c +++ b/net/core/dst.c @@ -22,6 +22,7 @@ #include #include +#include /* * Theory of operations: @@ -158,19 +159,9 @@ const u32 dst_default_metrics[RTAX_MAX + 1] = { [RTAX_MAX] = 0xdeadbeef, }; - -void *dst_alloc(struct dst_ops *ops, struct net_device *dev, - int initial_ref, int initial_obsolete, unsigned short flags) +void dst_init(struct dst_entry *dst, struct dst_ops *ops, struct net_device *dev, + int initial_ref, int initial_obsolete, unsigned short flags) { - struct dst_entry *dst; - - if (ops->gc && dst_entries_get_fast(ops) > ops->gc_thresh) { - if (ops->gc(ops)) - return NULL; - } - dst = kmem_cache_alloc(ops->kmem_cachep, GFP_ATOMIC); - if (!dst) - return NULL; dst->child = NULL; dst->dev = dev; if (dev) @@ -200,6 +191,25 @@ void *dst_alloc(struct dst_ops *ops, struct net_device *dev, dst->next = NULL; if (!(flags & DST_NOCOUNT)) dst_entries_add(ops, 1); +} +EXPORT_SYMBOL(dst_init); + +void *dst_alloc(struct dst_ops *ops, struct ne
[RFC net-next 12/22] arp: Inherit metadata dst when creating ARP requests
If output device wants to see the dst, inherit the dst of the original skb and pass it on to generate the ARP request. Signed-off-by: Thomas Graf --- net/ipv4/arp.c | 71 +++--- 1 file changed, 43 insertions(+), 28 deletions(-) diff --git a/net/ipv4/arp.c b/net/ipv4/arp.c index 933a928..3400aea 100644 --- a/net/ipv4/arp.c +++ b/net/ipv4/arp.c @@ -291,6 +291,46 @@ static void arp_error_report(struct neighbour *neigh, struct sk_buff *skb) kfree_skb(skb); } +/* + * Create and send an arp packet. + */ +static void arp_send_dst(int type, int ptype, __be32 dest_ip, +struct net_device *dev, __be32 src_ip, +const unsigned char *dest_hw, +const unsigned char *src_hw, +const unsigned char *target_hw, struct sk_buff *oskb) +{ + struct sk_buff *skb; + + /* +* No arp on this interface. +*/ + + if (dev->flags&IFF_NOARP) + return; + + skb = arp_create(type, ptype, dest_ip, dev, src_ip, +dest_hw, src_hw, target_hw); + if (!skb) + return; + + if (oskb) + skb_dst_copy(skb, oskb); + + arp_xmit(skb); +} + +void arp_send(int type, int ptype, __be32 dest_ip, + struct net_device *dev, __be32 src_ip, + const unsigned char *dest_hw, const unsigned char *src_hw, + const unsigned char *target_hw) +{ + arp_send_dst(type, ptype, dest_ip, dev, src_ip, dest_hw, src_hw, +target_hw, NULL); +} +EXPORT_SYMBOL(arp_send); + + static void arp_solicit(struct neighbour *neigh, struct sk_buff *skb) { __be32 saddr = 0; @@ -346,8 +386,9 @@ static void arp_solicit(struct neighbour *neigh, struct sk_buff *skb) } } - arp_send(ARPOP_REQUEST, ETH_P_ARP, target, dev, saddr, -dst_hw, dev->dev_addr, NULL); + arp_send_dst(ARPOP_REQUEST, ETH_P_ARP, target, dev, saddr, +dst_hw, dev->dev_addr, NULL, +dev->priv_flags & IFF_XMIT_DST_RELEASE ? NULL : skb); } static int arp_ignore(struct in_device *in_dev, __be32 sip, __be32 tip) @@ -597,32 +638,6 @@ void arp_xmit(struct sk_buff *skb) EXPORT_SYMBOL(arp_xmit); /* - * Create and send an arp packet. - */ -void arp_send(int type, int ptype, __be32 dest_ip, - struct net_device *dev, __be32 src_ip, - const unsigned char *dest_hw, const unsigned char *src_hw, - const unsigned char *target_hw) -{ - struct sk_buff *skb; - - /* -* No arp on this interface. -*/ - - if (dev->flags&IFF_NOARP) - return; - - skb = arp_create(type, ptype, dest_ip, dev, src_ip, -dest_hw, src_hw, target_hw); - if (!skb) - return; - - arp_xmit(skb); -} -EXPORT_SYMBOL(arp_send); - -/* * Process an arp request. */ -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC net-next 21/22] openvswitch: Use regular VXLAN net_device device
This gets rid of all OVS specific VXLAN code in the receive and transmit path by using a VXLAN net_device to represent the vport. Only a small shim layer remains which takes care of handling the VXLAN specific OVS Netlink configuration. Unexports vxlan_sock_add(), vxlan_sock_release(), vxlan_xmit_skb() since they are no longer needed. Signed-off-by: Thomas Graf Signed-off-by: Pravin B Shelar --- drivers/net/vxlan.c| 242 +++ include/net/vxlan.h| 24 +-- net/openvswitch/Kconfig| 12 -- net/openvswitch/Makefile | 1 - net/openvswitch/flow_netlink.c | 6 +- net/openvswitch/vport-netdev.c | 177 +- net/openvswitch/vport-vxlan.c | 322 - net/openvswitch/vport-vxlan.h | 11 -- 8 files changed, 298 insertions(+), 497 deletions(-) delete mode 100644 net/openvswitch/vport-vxlan.c delete mode 100644 net/openvswitch/vport-vxlan.h diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c index 020c941..dba7a7d 100644 --- a/drivers/net/vxlan.c +++ b/drivers/net/vxlan.c @@ -75,6 +75,9 @@ static struct rtnl_link_ops vxlan_link_ops; static const u8 all_zeros_mac[ETH_ALEN]; +static struct vxlan_sock *vxlan_sock_add(struct net *net, __be16 port, +bool no_share, u32 flags); + /* per-network namespace private data for this module */ struct vxlan_net { struct list_head vxlan_list; @@ -1021,7 +1024,7 @@ static bool vxlan_group_used(struct vxlan_net *vn, struct vxlan_dev *dev) return false; } -void vxlan_sock_release(struct vxlan_sock *vs) +static void vxlan_sock_release(struct vxlan_sock *vs) { struct sock *sk = vs->sock->sk; struct net *net = sock_net(sk); @@ -1037,7 +1040,6 @@ void vxlan_sock_release(struct vxlan_sock *vs) queue_work(vxlan_wq, &vs->del_work); } -EXPORT_SYMBOL_GPL(vxlan_sock_release); /* Update multicast group membership when first VNI on * multicast address is brought up @@ -1120,6 +1122,102 @@ static struct vxlanhdr *vxlan_remcsum(struct sk_buff *skb, struct vxlanhdr *vh, return vh; } +static void vxlan_rcv(struct vxlan_sock *vs, struct sk_buff *skb, + struct vxlan_metadata *md, u32 vni, + struct metadata_dst *tun_dst) +{ + struct iphdr *oip = NULL; + struct ipv6hdr *oip6 = NULL; + struct vxlan_dev *vxlan; + struct pcpu_sw_netstats *stats; + union vxlan_addr saddr; + int err = 0; + union vxlan_addr *remote_ip; + + /* For flow based devices, map all packets to VNI 0 */ + if (vs->flags & VXLAN_F_FLOW_BASED) + vni = 0; + + /* Is this VNI defined? */ + vxlan = vxlan_vs_find_vni(vs, vni); + if (!vxlan) + goto drop; + + remote_ip = &vxlan->default_dst.remote_ip; + skb_reset_mac_header(skb); + skb_scrub_packet(skb, !net_eq(vxlan->net, dev_net(vxlan->dev))); + skb->protocol = eth_type_trans(skb, vxlan->dev); + skb_postpull_rcsum(skb, eth_hdr(skb), ETH_HLEN); + + /* Ignore packet loops (and multicast echo) */ + if (ether_addr_equal(eth_hdr(skb)->h_source, vxlan->dev->dev_addr)) + goto drop; + + /* Re-examine inner Ethernet packet */ + if (remote_ip->sa.sa_family == AF_INET) { + oip = ip_hdr(skb); + saddr.sin.sin_addr.s_addr = oip->saddr; + saddr.sa.sa_family = AF_INET; +#if IS_ENABLED(CONFIG_IPV6) + } else { + oip6 = ipv6_hdr(skb); + saddr.sin6.sin6_addr = oip6->saddr; + saddr.sa.sa_family = AF_INET6; +#endif + } + + if (tun_dst) { + skb_dst_set(skb, (struct dst_entry *)tun_dst); + tun_dst = NULL; + } + + if ((vxlan->flags & VXLAN_F_LEARN) && + vxlan_snoop(skb->dev, &saddr, eth_hdr(skb)->h_source)) + goto drop; + + skb_reset_network_header(skb); + /* In flow-based mode, GBP is carried in dst_metadata */ + if (!(vs->flags & VXLAN_F_FLOW_BASED)) + skb->mark = md->gbp; + + if (oip6) + err = IP6_ECN_decapsulate(oip6, skb); + if (oip) + err = IP_ECN_decapsulate(oip, skb); + + if (unlikely(err)) { + if (log_ecn_error) { + if (oip6) + net_info_ratelimited("non-ECT from %pI6\n", +&oip6->saddr); + if (oip) + net_info_ratelimited("non-ECT from %pI4 with TOS=%#x\n", +&oip->saddr, oip->tos); + } + if (err > 1) { + ++vxlan->dev->stats.rx_frame_errors; + ++vxlan->dev->stats.rx_errors; + goto drop; + } + } + +
[RFC net-next 22/22] openvswitch: Use regular GRE net_device instead of vport
From: Pravin Shelar Removes all of the OVS specific GRE code and makes OVS use a GRE net_device. Signed-off-by: Pravin B Shelar --- net/core/dev.c | 5 +- net/ipv4/ip_gre.c | 165 +- net/openvswitch/Makefile | 1 - net/openvswitch/vport-gre.c| 313 - net/openvswitch/vport-netdev.c | 5 +- 5 files changed, 170 insertions(+), 319 deletions(-) delete mode 100644 net/openvswitch/vport-gre.c diff --git a/net/core/dev.c b/net/core/dev.c index 97f6d47..67a8cac 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -6966,6 +6966,9 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name, INIT_LIST_HEAD(&dev->ptype_all); INIT_LIST_HEAD(&dev->ptype_specific); dev->priv_flags = IFF_XMIT_DST_RELEASE | IFF_XMIT_DST_RELEASE_PERM; + + strcpy(dev->name, name); + dev->name_assign_type = name_assign_type; setup(dev); dev->num_tx_queues = txqs; @@ -6980,8 +6983,6 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name, goto free_all; #endif - strcpy(dev->name, name); - dev->name_assign_type = name_assign_type; dev->group = INIT_NETDEV_GROUP; if (!dev->ethtool_ops) dev->ethtool_ops = &default_ethtool_ops; diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c index 5fd7064..d285fb4 100644 --- a/net/ipv4/ip_gre.c +++ b/net/ipv4/ip_gre.c @@ -25,6 +25,7 @@ #include #include #include +#include #include #include #include @@ -47,6 +48,7 @@ #include #include #include +#include #if IS_ENABLED(CONFIG_IPV6) #include @@ -115,6 +117,8 @@ static bool log_ecn_error = true; module_param(log_ecn_error, bool, 0644); MODULE_PARM_DESC(log_ecn_error, "Log packets received with corrupted ECN"); +#define GRE_TAP_FB_NAME "gretap0" + static struct rtnl_link_ops ipgre_link_ops __read_mostly; static int ipgre_tunnel_init(struct net_device *dev); @@ -217,7 +221,20 @@ static int ipgre_rcv(struct sk_buff *skb, const struct tnl_ptk_info *tpi) iph->saddr, iph->daddr, tpi->key); if (tunnel) { + skb_pop_mac_header(skb); + if (tunnel->dev == itn->fb_tunnel_dev) { + struct metadata_dst *tun_dst; + + tun_dst = metadata_dst_alloc(0, GFP_ATOMIC); + if (!tun_dst) + return PACKET_REJECT; + + /* TODO: setup tun info from tpi */ + skb_dst_drop(skb); + skb_dst_set(skb, (struct dst_entry *)tun_dst); + } + ip_tunnel_rcv(tunnel, skb, tpi, log_ecn_error); return PACKET_RCVD; } @@ -287,6 +304,135 @@ out: return NETDEV_TX_OK; } +/* TODO: share xmit code */ +static inline struct rtable *tunnel_route_lookup(struct net *net, +const struct ip_tunnel_key *key, +u32 mark, +struct flowi4 *fl, +u8 protocol) +{ + struct rtable *rt; + + memset(fl, 0, sizeof(*fl)); + fl->daddr = key->ipv4_dst; + fl->saddr = key->ipv4_src; + fl->flowi4_tos = RT_TOS(key->ipv4_tos); + fl->flowi4_mark = mark; + fl->flowi4_proto = protocol; + + rt = ip_route_output_key(net, fl); + return rt; +} + + +/* Returns the least-significant 32 bits of a __be64. */ +static __be32 be64_get_low32(__be64 x) +{ +#ifdef __BIG_ENDIAN + return (__force __be32)x; +#else + return (__force __be32)((__force u64)x >> 32); +#endif +} + +static __be16 filter_tnl_flags(__be16 flags) +{ + return flags & (TUNNEL_CSUM | TUNNEL_KEY); +} + + +static struct sk_buff *__build_header(struct sk_buff *skb, + const struct ip_tunnel_info *tun_info, + int tunnel_hlen) +{ + struct tnl_ptk_info tpi; + + skb = gre_handle_offloads(skb, !!(tun_info->key.tun_flags & TUNNEL_CSUM)); + if (IS_ERR(skb)) + return skb; + + tpi.flags = filter_tnl_flags(tun_info->key.tun_flags); + tpi.proto = htons(ETH_P_TEB); + tpi.key = be64_get_low32(tun_info->key.tun_id); + tpi.seq = 0; + gre_build_header(skb, &tpi, tunnel_hlen); + + return skb; +} + +static netdev_tx_t gre_fb_xmit(struct sk_buff *skb, + struct net_device *dev) +{ + struct net *net = dev_net(dev); + struct ip_tunnel_info *tun_info; + const struct ip_tunnel_key *key; + struct flowi4 fl; + struct rtable *rt; + int min_headroom; + int tunnel_hlen; + __be16 df; + int err; + + tun_info = skb_tunnel_info(skb, AF_INET); +
[RFC net-next 14/22] route: Extend flow representation with tunnel key
Add a new flowi_tunnel structure which is a subset of ip_tunnel_key to allow routes to match on tunnel metadata. For now, the tunnel id is added to flowi_tunnel which allows for routes to be bound to specific virtual tunnels. Signed-off-by: Thomas Graf --- include/net/flow.h | 7 +++ net/ipv4/route.c | 6 ++ 2 files changed, 13 insertions(+) diff --git a/include/net/flow.h b/include/net/flow.h index 8109a15..c15fb5e 100644 --- a/include/net/flow.h +++ b/include/net/flow.h @@ -19,6 +19,10 @@ #define LOOPBACK_IFINDEX 1 +struct flowi_tunnel { + __be64 tun_id; +}; + struct flowi_common { int flowic_oif; int flowic_iif; @@ -30,6 +34,7 @@ struct flowi_common { #define FLOWI_FLAG_ANYSRC 0x01 #define FLOWI_FLAG_KNOWN_NH0x02 __u32 flowic_secid; + struct flowi_tunnel flowic_tun_key; }; union flowi_uli { @@ -66,6 +71,7 @@ struct flowi4 { #define flowi4_proto __fl_common.flowic_proto #define flowi4_flags __fl_common.flowic_flags #define flowi4_secid __fl_common.flowic_secid +#define flowi4_tun_key __fl_common.flowic_tun_key /* (saddr,daddr) must be grouped, same order as in IP header */ __be32 saddr; @@ -165,6 +171,7 @@ struct flowi { #define flowi_protou.__fl_common.flowic_proto #define flowi_flagsu.__fl_common.flowic_flags #define flowi_secidu.__fl_common.flowic_secid +#define flowi_tun_key u.__fl_common.flowic_tun_key } __attribute__((__aligned__(BITS_PER_LONG/8))); static inline struct flowi *flowi4_to_flowi(struct flowi4 *fl4) diff --git a/net/ipv4/route.c b/net/ipv4/route.c index ea48880..bf84164c 100644 --- a/net/ipv4/route.c +++ b/net/ipv4/route.c @@ -91,6 +91,7 @@ #include #include #include +#include #include #include #include @@ -110,6 +111,7 @@ #include #endif #include +#include #define RT_FL_TOS(oldflp4) \ ((oldflp4)->flowi4_tos & (IPTOS_RT_MASK | RTO_ONLINK)) @@ -1674,6 +1676,7 @@ static int ip_route_input_slow(struct sk_buff *skb, __be32 daddr, __be32 saddr, { struct fib_result res; struct in_device *in_dev = __in_dev_get_rcu(dev); + struct ip_tunnel_info *tun_info; struct flowi4 fl4; unsigned intflags = 0; u32 itag = 0; @@ -1691,6 +1694,9 @@ static int ip_route_input_slow(struct sk_buff *skb, __be32 daddr, __be32 saddr, by fib_lookup. */ + tun_info = skb_tunnel_info(skb); + if (tun_info && tun_info->mode == IP_TUNNEL_INFO_RX) + fl4.flowi4_tun_key.tun_id = tun_info->key.tun_id; skb_dst_drop(skb); if (ipv4_is_multicast(saddr) || ipv4_is_lbcast(saddr)) -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html