Re: [PATCH] Doc: z8530book: Fix typo in API-z8530-sync-txdma-open.html

2015-07-10 Thread David Miller
From: Masanari Iida 
Date: Fri, 10 Jul 2015 21:20:28 +0900

> This patch fix a spelling typo found in API-z8530-sync-txdma-open.html.
> It is because this file was generated from comment in source,
> I have to fix comment in source.
> 
> Signed-off-by: Masanari Iida 

Applied, t hanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Panic with demuxed ipv4 multicast udp sockets on 4.0.4

2015-07-10 Thread Eric Dumazet
On Fri, 2015-07-10 at 13:31 -0700, Alex Gartrell wrote:
> Hey everyone,
> 
> There's some kind of nasty condition in which sk_rx_dst points to an
> apparently garbage datastructure and it's blowing up in the early
> demux code because dst->ops is NULL.  The packet in question was for
> bit torrent local peer discovery
> https://en.wikipedia.org/wiki/Local_Peer_Discovery .  We're seeing
> this on about a 1/200 chance of panic per day.
> 
> crash> bt
> PID: 1899532  TASK: 88000826cf00  CPU: 9   COMMAND: "hhvm.node.1"
>  #0 [88047fc23990] machine_kexec at 8103bf05
>  #1 [88047fc239e0] crash_kexec at 810cb4e8
>  #2 [88047fc23ab0] oops_end at 81006468
>  #3 [88047fc23ae0] no_context at 8167aac1
>  #4 [88047fc23b40] __bad_area_nosemaphore at 8167acb9
>  #5 [88047fc23b90] bad_area_nosemaphore at 8167aceb
>  #6 [88047fc23ba0] __do_page_fault at 81044ac5
>  #7 [88047fc23c10] do_page_fault at 81044eec
>  #8 [88047fc23c20] page_fault at 81686c02
> [exception RIP: udp_v4_early_demux+481]
> RIP: 816249a1  RSP: 88047fc23cd8  RFLAGS: 00010246
> RAX: 880248ee4500  RBX: 093a  RCX: 0002
> RDX:   RSI:   RDI: 880248ee4500
> RBP: 88047fc23d48   R8:    R9: 
> R10: 0001  R11: c9000199f3a0  R12: 88006f8a6300
> R13: 81cbb1c0  R14: 0001  R15: 880474798f00
> ORIG_RAX:   CS: 0010  SS: 
>  #9 [88047fc23cd0] udp_v4_early_demux at 81624bb3
> #10 [88047fc23d50] ip_rcv_finish at 815f3055
> #11 [88047fc23d80] ip_rcv at 815f3952
> #12 [88047fc23dc0] __netif_receive_skb_core at 815b96d4
> #13 [88047fc23e30] __netif_receive_skb at 815b9911
> #14 [88047fc23e50] process_backlog at 815b99f0
> #15 [88047fc23ea0] net_rx_action at 815ba1e8
> #16 [88047fc23f30] __do_softirq at 81054ce6
> #17 [88047fc23f90] irq_exit at 81055075
> #18 [88047fc23fa0] smp_call_function_single_interrupt at 810319f5
> #19 [88047fc23fb0] call_function_single_interrupt at 8168637a
> ---  ---
> #20 [8800792dff58] call_function_single_interrupt at 8168637a
> RIP: 006e7b4c  RSP: 7f4c8ba38b80  RFLAGS: 0216
> RAX: 006b  RBX: 816851f2  RCX: 7f49f4de84d6
> RDX: 7f49f4de84d8  RSI: 7f48dbcce731  RDI: 
> RBP: 7f4c8ba38bd0   R8: 006b   R9: 
> R10: 7f48dbcce737  R11: 7f49f4de84e0  R12: 7f4adab85198
> R13: 0014  R14: 7f4adaaa4c00  R15: 
> ORIG_RAX: ff04  CS: 0033  SS: 002b
> crash> print *(struct *dst_entry)0x880248ee4500
> A syntax error in expression, near `*dst_entry)0x880248ee4500'.
> gdb: gdb request failed: print *(struct *dst_entry)0x880248ee4500
> crash> print *(struct dst_entry*)0x880248ee4500
> $1 = {
>   callback_head = {
> next = 0x880248ee4d00,
> func = 0x0
>   },
>   child = 0x13eacdfb7df67f6b,
>   dev = 0x880113975d00,
>   ops = 0x0,
>   _metrics = 13729079323838086211,
>   expires = 103079215104,
>   path = 0x24c8d3baa,
>   from = 0x0,
>   xfrm = 0x6,
>   input = 0x0,
>   output = 0x0,
>   flags = 5536,
>   pending_confirm = 33114,
>   error = -1,
>   obsolete = -1,
>   header_len = 0,
>   trailer_len = 0,
>   tclassid = 0,
>   __pad_to_align_refcnt = {0, 704374636708},
>   __refcnt = {
> counter = 14
>   },
>   __use = 2097153,
>   lastuse = 536576,
>   {
> next = 0x0,
> rt_next = 0x0,
> rt6_next = 0x0,
> dn_next = 0x0
>   }
> }

Seems similar to what commit d0c294c53a771 fixed

Have you tried following patch ?

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 83aa604f9273..35c0a4ac540c 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1995,7 +1995,7 @@ void udp_v4_early_demux(struct sk_buff *skb)
 
skb->sk = sk;
skb->destructor = sock_efree;
-   dst = sk->sk_rx_dst;
+   dst = READ_ONCE(sk->sk_rx_dst);
 
if (dst)
dst = dst_check(dst, 0);


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fighting out-of-order reception with RPS?

2015-07-10 Thread Eric Dumazet
On Fri, 2015-07-10 at 22:36 +0200, Oliver Hartkopp wrote:
> On 07/10/2015 04:48 AM, Tom Herbert wrote:
> > On Wed, Jul 8, 2015 at 10:55 PM, Oliver Hartkopp  
> > wrote:
> >> Both drivers do not use NAPI. The just follow the way
> >>
> >> interrupt -> alloc_skb() -> fill skb -> netif_rx(skb)
> >>
> >> I'm usually testing with the USB adapters as the PCIe setup is not very
> >> handy.
> >>
> > Okay, I see what is happening. In netif_rx when RPS is not enabled
> > that packet is queued to the backlog queue for the local CPU. Since
> > you're doing round robin on the interrupts then OOO packets can be a
> > result. Unfortunately, this is the expected behavior. The correct
> > kernel fix would be to move to these drivers to use NAPI.
> 
> Hm. Doesn't sound like a good solution when there's a difference between NAPI
> and non-NAPI drivers in matters of OOO, right?

Isn't OOO a problem for you ? Then you either have to :

1) Use a single CPU to handle IRQ from the device
2) Use NAPI

> 
> > RPS
> > eliminates the OOO, but if there is no ability to derive a flow hash
> > from packets everything will wind up one queue without load balancing.
> 
> Correct.
> 
> That's why I added
> 
>   skb_set_hash(skb, dev->ifindex, PKT_HASH_TYPE_L2);
> 
> in my driver, because the only relevant flow identifiction is the number of
> the incoming CAN interface.
> 
> > Besides that, automatically setting RPS in drivers is a difficult
> > proposition since there is no definitively "correct" way to do that in
> > an arbitrary configuration.
> 
> What about checking in netif_rx() if the non-NAPI driver has set a hash (aka
> the driver is OOO sensitive)?
> And if so we could automatically set rps_cpus for this interface in a way that
> all CPUs are enabled to take skbs following the hash.

Wow, netif_rx() is packet processing fast path, certainly not the place
to add controlling path decisions.

Please convert your driver to NAPI. You might then even benefit from
GRO.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] net: inet_diag: always export IPV6_V6ONLY sockopt for listening sockets

2015-07-10 Thread David Miller
From: Phil Sutter 
Date: Fri, 10 Jul 2015 11:39:57 +0200

> Reconsidering my commit 20462155 "net: inet_diag: export IPV6_V6ONLY
> sockopt", I am not happy with the limitations it causes for socket
> analysing code in userspace. Exporting the value only if it is set makes
> it hard for userspace to decide whether the option is not set or the
> kernel does not support exporting the option at all.
> 
> From an auditor's perspective, the interesting question for listening
> AF_INET6 sockets is: "Does it NOT have IPV6_V6ONLY set?" Because it is
> the unexpected case. This patch allows to answer this question reliably.
> 
> Signed-off-by: Phil Sutter 
> Cc: Eric Dumazet 

Applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 0/9] be2net: patch set

2015-07-10 Thread David Miller
From: Sathya Perla 
Date: Fri, 10 Jul 2015 05:32:42 -0400

> Hi David, the following patch set has code cleanup patches, minor enhancements
> and non-critical fixes. Pls consider applying to the net-next tree. Thanks!

Series applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next v2] ipv6: Do not iterate over all interfaces when finding source address on specific interface.

2015-07-10 Thread David Miller
From: YOSHIFUJI Hideaki/吉藤英明 
Date: Fri, 10 Jul 2015 16:58:31 +0900

> If outgoing interface is specified and the candidate address is
> restricted to the outgoing interface, it is enough to iterate
> over that given interface only.
> 
> Signed-off-by: YOSHIFUJI Hideaki 
> Acked-by: Erik Kline 

Applied, thanks!
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] net: Drop owner assignment from platform_driver

2015-07-10 Thread David Miller
From: Krzysztof Kozlowski 
Date: Fri, 10 Jul 2015 15:29:23 +0900

> platform_driver does not need to set an owner because
> platform_driver_register() will set it.
> 
> Signed-off-by: Krzysztof Kozlowski 

Applied to net-next, thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] net: phy: Support setting polarity in marvell phy driver

2015-07-10 Thread David Miller
From: David Thomson 
Date: Fri, 10 Jul 2015 16:28:25 +1200

> Support manually setting the polarity to mdi or mdix
> 
> Signed-off-by: David Thomson 

Applied to net-next.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] net: phy: Pass mdix ethtool setting through to phy driver

2015-07-10 Thread David Miller
From: David Thomson 
Date: Fri, 10 Jul 2015 13:56:54 +1200

> Pass the mdix setting from ethtool down to the phy driver, to allow
> driver specific implementations of manually setting the polarity.
> 
> Signed-off-by: David Thomson 

Applied to net-next.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 4/4] gianfar: Add paged allocation and Rx S/G

2015-07-10 Thread David Miller
From: Claudiu Manoil 
Date: Thu, 9 Jul 2015 19:24:44 +0300

> @@ -2839,6 +2815,94 @@ static irqreturn_t gfar_transmit(int irq, void *grp_id)
>   return IRQ_HANDLED;
>  }
>  
> +static bool gfar_add_rx_frag(struct gfar_rx_buff *rxb, unsigned int lstatus,
> +  struct sk_buff *skb, bool first)
> +{

You really are again playing Russian Roulette with the type of
'lstatus' here.  Use 'u32' consistently.

> +static struct sk_buff *gfar_get_next_rxbuff(struct gfar_priv_rx_q *rx_queue,
> + unsigned long lstatus,
> + struct sk_buff *skb)

Likewise.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 2/4] gianfar: Fix and cleanup rxbd status handling

2015-07-10 Thread David Miller
From: Claudiu Manoil 
Date: Thu, 9 Jul 2015 19:24:42 +0300

> @@ -2921,6 +2921,7 @@ int gfar_clean_rx_ring(struct gfar_priv_rx_q *rx_queue, 
> int rx_work_limit)
>   i = rx_queue->next_to_clean;
>  
>   while (rx_work_limit--) {
> + unsigned long lstatus;
>  
>   if (cleaned_cnt >= GFAR_RX_BUFF_ALLOC) {
>   gfar_alloc_rx_buffs(rx_queue, cleaned_cnt);
> @@ -2928,7 +2929,8 @@ int gfar_clean_rx_ring(struct gfar_priv_rx_q *rx_queue, 
> int rx_work_limit)
>   }
>  
>   bdp = &rx_queue->rx_bd_base[i];
> - if (be16_to_cpu(bdp->status) & RXBD_EMPTY)
> + lstatus = be32_to_cpu(bdp->lstatus);
> + if (lstatus & BD_LFLAG(RXBD_EMPTY))
>   break;
>  
>   /* order rx buffer descriptor reads */

"lstatus" is not an "unsigned long", it's a "u32".

Don't use ambiguous types for objects with exact known sizes.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] net: dsa: mv88e6xxx: add write access to debugfs regs file

2015-07-10 Thread David Miller
From: Vivien Didelot 
Date: Thu,  9 Jul 2015 17:13:29 -0400

> Allow write access to the regs file in the debugfs interface, with the
> following parameters:
> 
> echo> regs
> 
> Where "name" is the register name (as shown in the header row), "reg" is
> the register address (as shown in the first column) and "value" is the
> 16-bit value. e.g.:
> 
> echo GLOBAL 1a 5550 > regs
> 
> Signed-off-by: Vivien Didelot 

I don't know about this.

This starts to smell like a back door for proprietary userspace SDKs to
program the switch hardware.

Yes, they can do it via other mechanisms, but we don't have to make it
any eaiser for them either.

If you want to poke registers, hack the module just like any other
person with appropriate privileges can do.

Frankly, all of this debugfs crap in the DSA drivers smells like poo.
I don't like it _AT_ _ALL_, and I shouldn't have allowed any of it
into the tree in the first place.

I might just remove it all myself, it bothers me so much.

Fetching information should be done by well typed, generic, interfaces
that apply to any similar device or object.  All of this debugfs stuff
smells of hacks and special case crap that's only usable for one
device type and that makes it the single most terrible interface to
give to users.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] sctp: SCTP_SOCKOPT_PEELOFF return socket pointer for kernel users

2015-07-10 Thread David Miller
From: Marcelo Ricardo Leitner 
Date: Thu,  9 Jul 2015 11:15:19 -0300

> SCTP has this operation to peel off associations from a given socket and
> create a new socket using this association. We currently have two ways
> to use this operation:
> - via getsockopt(), on which it will also create and return a file
>   descriptor for this new socket
> - via sctp_do_peeloff(), which is for kernel only
> 
> The caveat with using sctp_do_peeloff() directly is that it creates a
> dependency to SCTP module, while all other operations are handled via
> kernel_{socket,sendmsg,getsockopt...}() interface. This causes the
> kernel to load SCTP module even when it's not directly used
> 
> This patch then updates SCTP_SOCKOPT_PEELOFF so that for kernel users of
> this protocol it will not allocate a file descriptor but instead just
> return the socket pointer directly.
> 
> If called by an user application it will work as before.
> 
> Signed-off-by: Marcelo Ricardo Leitner 

I do not like this at all.

Socket option implementations should not change their behavior or what
datastructures they consume or return just because the socket happens
to be a kernel socket.

I'm not applying this series, sorry.

Also, your patch series lacked an intial "PATCH 0/N" posting, so you
could at least spend the time to discuss this patch series at a high
level and explain your overall motivations.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -next] Revert "ipv4: use skb coalescing in defragmentation"

2015-07-10 Thread Eric Dumazet
On Sat, 2015-07-11 at 01:37 +0200, Florian Westphal wrote:
> This reverts commit 3cc4949269e01f39443d0fcfffb5bc6b47878d45.
> 
> There is nothing wrong with coalescing during defragmentation, it
> reduces truesize overhead and simplifies things for the receiving
> socket (no fraglist walk needed).
> 
> However, it also destroys geometry of the original fragments.
> While that doesn't cause any breakage (we make sure to not exceed largest
> original size) ip_do_fragment contains a 'fastpath' that takes advantage
> of a present frag list and results in fragments that (in most cases)
> match what was received.
> 
> In case its needed the coalescing could be done later, when we're sure
> the skb is not forwarded.  But discussion during NFWS resulted in
> 'lets just remove this for now'.
> 
> Cc: Eric Dumazet 
> Signed-off-by: Florian Westphal 
> ---

Acked-by: Eric Dumazet 


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv3 net 0/2] net: fixes for device unregistration

2015-07-10 Thread David Miller
From: Julian Anastasov 
Date: Thu,  9 Jul 2015 09:59:08 +0300

> Test script from Eric W. Biederman can catch a problem
> where packets from backlog are processed long after the last
> synchronize_net call. This can be reproduced after few tests
> if commit 381c759d9916 ("ipv4: Avoid crashing in ip_error")
> is reverted for the test. Incoming packets do not hold
> reference to device but even if they do, subsystems do not
> expect packets to fly during and after the NETDEV_UNREGISTER
> event.
> 
> The first fix has the cost of netif_running check in fast path.
> The second fix calls rcu_read_lock while local IRQ is disabled,
> I hope this is not against the rules.

Thanks for fixing this tricky bug.

Applied and queued up for -stable.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC net-next] net: Build IPv6 into kernel by default

2015-07-10 Thread David Miller
From: Tom Herbert 
Date: Thu, 9 Jul 2015 13:42:29 -0700

> This patch makes the default to build IPv6 into the kernel. IPv6
> now has significant traction and any remaining vestiges of IPv6
> not being provided parity with IPv4 should be swept away. IPv6 is now
> core to the Internet and kernel.

I guess I'm fine with this, just fix up the doc error Dave Jones
pointed out.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net] bridge: mdb: allow the user to delete mdb entry if there's a querier

2015-07-10 Thread David Miller
From: Nikolay Aleksandrov 
Date: Thu,  9 Jul 2015 04:12:45 -0700

> From: Satish Ashok 
> 
> Until now when a querier was present static entries couldn't be deleted.
> Fix this and allow the user to manipulate the mdb with or without a
> querier.
> 
> Signed-off-by: Satish Ashok 
> Signed-off-by: Nikolay Aleksandrov 

Applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH -next] Revert "ipv4: use skb coalescing in defragmentation"

2015-07-10 Thread Florian Westphal
This reverts commit 3cc4949269e01f39443d0fcfffb5bc6b47878d45.

There is nothing wrong with coalescing during defragmentation, it
reduces truesize overhead and simplifies things for the receiving
socket (no fraglist walk needed).

However, it also destroys geometry of the original fragments.
While that doesn't cause any breakage (we make sure to not exceed largest
original size) ip_do_fragment contains a 'fastpath' that takes advantage
of a present frag list and results in fragments that (in most cases)
match what was received.

In case its needed the coalescing could be done later, when we're sure
the skb is not forwarded.  But discussion during NFWS resulted in
'lets just remove this for now'.

Cc: Eric Dumazet 
Signed-off-by: Florian Westphal 
---
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index a50dc6d..4d3fffa 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -522,7 +522,6 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff 
*prev,
int len;
int ihlen;
int err;
-   int sum_truesize;
u8 ecn;
 
ipq_kill(qp);
@@ -590,32 +589,19 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff 
*prev,
add_frag_mem_limit(&qp->q, clone->truesize);
}
 
+   skb_shinfo(head)->frag_list = head->next;
skb_push(head, head->data - skb_network_header(head));
 
-   sum_truesize = head->truesize;
-   for (fp = head->next; fp;) {
-   bool headstolen;
-   int delta;
-   struct sk_buff *next = fp->next;
-
-   sum_truesize += fp->truesize;
+   for (fp=head->next; fp; fp = fp->next) {
+   head->data_len += fp->len;
+   head->len += fp->len;
if (head->ip_summed != fp->ip_summed)
head->ip_summed = CHECKSUM_NONE;
else if (head->ip_summed == CHECKSUM_COMPLETE)
head->csum = csum_add(head->csum, fp->csum);
-
-   if (skb_try_coalesce(head, fp, &headstolen, &delta)) {
-   kfree_skb_partial(fp, headstolen);
-   } else {
-   if (!skb_shinfo(head)->frag_list)
-   skb_shinfo(head)->frag_list = fp;
-   head->data_len += fp->len;
-   head->len += fp->len;
-   head->truesize += fp->truesize;
-   }
-   fp = next;
+   head->truesize += fp->truesize;
}
-   sub_frag_mem_limit(&qp->q, sum_truesize);
+   sub_frag_mem_limit(&qp->q, head->truesize);
 
head->next = NULL;
head->dev = dev;
-- 
2.0.5

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH -next 6/6] netfilter: nftables: Only run the nftables chains in the proper netns

2015-07-10 Thread Eric W. Biederman

- Register the nftables chains in the network namespace that they need
  to run in.

- Remove the hacks that stopped chains running in the wrong network
  namespace.

Signed-off-by: "Eric W. Biederman" 
---
 net/netfilter/nf_tables_api.c  | 6 --
 net/netfilter/nf_tables_core.c | 5 -
 2 files changed, 4 insertions(+), 7 deletions(-)

diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index cfe636808541..12e65ec46bd4 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -130,20 +130,22 @@ static void nft_trans_destroy(struct nft_trans *trans)
 int nft_register_basechain(struct nft_base_chain *basechain,
   unsigned int hook_nops)
 {
+   struct net *net = read_pnet(&basechain->pnet);
if (basechain->flags & NFT_BASECHAIN_DISABLED)
return 0;
 
-   return nf_register_hooks(basechain->ops, hook_nops);
+   return nf_register_net_hooks(net, basechain->ops, hook_nops);
 }
 EXPORT_SYMBOL_GPL(nft_register_basechain);
 
 void nft_unregister_basechain(struct nft_base_chain *basechain,
  unsigned int hook_nops)
 {
+   struct net *net = read_pnet(&basechain->pnet);
if (basechain->flags & NFT_BASECHAIN_DISABLED)
return;
 
-   nf_unregister_hooks(basechain->ops, hook_nops);
+   nf_unregister_net_hooks(net, basechain->ops, hook_nops);
 }
 EXPORT_SYMBOL_GPL(nft_unregister_basechain);
 
diff --git a/net/netfilter/nf_tables_core.c b/net/netfilter/nf_tables_core.c
index f77bad46ac68..05d0b03530f6 100644
--- a/net/netfilter/nf_tables_core.c
+++ b/net/netfilter/nf_tables_core.c
@@ -114,7 +114,6 @@ unsigned int
 nft_do_chain(struct nft_pktinfo *pkt, const struct nf_hook_ops *ops)
 {
const struct nft_chain *chain = ops->priv, *basechain = chain;
-   const struct net *chain_net = 
read_pnet(&nft_base_chain(basechain)->pnet);
const struct net *net = dev_net(pkt->in ? pkt->in : pkt->out);
const struct nft_rule *rule;
const struct nft_expr *expr, *last;
@@ -125,10 +124,6 @@ nft_do_chain(struct nft_pktinfo *pkt, const struct 
nf_hook_ops *ops)
int rulenum;
unsigned int gencursor = nft_genmask_cur(net);
 
-   /* Ignore chains that are not for the current network namespace */
-   if (!net_eq(net, chain_net))
-   return NF_ACCEPT;
-
 do_chain:
rulenum = 0;
rule = list_entry(&chain->rules, struct nft_rule, list);
-- 
2.2.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH -next 5/6] netfilter: Per network namespace netfilter hooks.

2015-07-10 Thread Eric W. Biederman

- Add a new set of functions for registering and unregistering per
  network namespace hooks.

- Modify the old global namespace hook functions to use the per
  network namespace hooks in their implementation, so their remains a
  single list that needs to be walked for any hook (this is important
  for keeping the hook priority working and for keeping the code
  walking the hooks simple).

- Only allow registering the per netdevice hooks in the network
  namespace where the network device lives.

- Dynamically allocate the structures in the per network namespace
  hook list in nf_register_net_hook, and unregister them in
  nf_unregister_net_hook.

  Dynamic allocate is required somewhere as the number of network
  namespaces are not fixed so we might as well allocate them in the
  registration function.

  The chain of registered hooks on any list is expected to be small so
  the cost of walking that list to find the entry we are unregistering
  should also be small.

  Performing the management of the dynamically allocated list entries
  in the registration and unregistration functions keeps the complexity
  from spreading.

Signed-off-by: "Eric W. Biederman" 
---
 include/linux/netfilter.h |  15 +++-
 include/net/netns/netfilter.h |   1 +
 net/netfilter/core.c  | 184 +-
 3 files changed, 175 insertions(+), 25 deletions(-)

diff --git a/include/linux/netfilter.h b/include/linux/netfilter.h
index d7826e59779e..4903e392a6ac 100644
--- a/include/linux/netfilter.h
+++ b/include/linux/netfilter.h
@@ -11,6 +11,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #ifdef CONFIG_NETFILTER
 static inline int NF_DROP_GETERR(int verdict)
@@ -118,6 +120,13 @@ struct nf_sockopt_ops {
 };
 
 /* Function to register/unregister hook points. */
+int nf_register_net_hook(struct net *net, const struct nf_hook_ops *ops);
+void nf_unregister_net_hook(struct net *net, const struct nf_hook_ops *ops);
+int nf_register_net_hooks(struct net *net, const struct nf_hook_ops *reg,
+ unsigned int n);
+void nf_unregister_net_hooks(struct net *net, const struct nf_hook_ops *reg,
+unsigned int n);
+
 int nf_register_hook(struct nf_hook_ops *reg);
 void nf_unregister_hook(struct nf_hook_ops *reg);
 int nf_register_hooks(struct nf_hook_ops *reg, unsigned int n);
@@ -128,8 +137,6 @@ void nf_unregister_hooks(struct nf_hook_ops *reg, unsigned 
int n);
 int nf_register_sockopt(struct nf_sockopt_ops *reg);
 void nf_unregister_sockopt(struct nf_sockopt_ops *reg);
 
-extern struct list_head nf_hooks[NFPROTO_NUMPROTO][NF_MAX_HOOKS];
-
 #ifdef HAVE_JUMP_LABEL
 extern struct static_key nf_hooks_needed[NFPROTO_NUMPROTO][NF_MAX_HOOKS];
 
@@ -167,10 +174,10 @@ static inline int nf_hook_thresh(u_int8_t pf, unsigned 
int hook,
 int (*okfn)(struct sock *, struct sk_buff *),
 int thresh)
 {
-   struct list_head *nf_hook_list = &nf_hooks[pf][hook];
+   struct net *net = dev_net(indev?indev:outdev);
+   struct list_head *nf_hook_list = &net->nf.hooks[pf][hook];
if (nf_hook_list_active(nf_hook_list, pf, hook)) {
struct nf_hook_state state;
-
nf_hook_state_init(&state, nf_hook_list, hook, thresh,
   pf, indev, outdev, sk, okfn);
return nf_hook_slow(skb, &state);
diff --git a/include/net/netns/netfilter.h b/include/net/netns/netfilter.h
index 532e4ba64f49..38aa4983e2a9 100644
--- a/include/net/netns/netfilter.h
+++ b/include/net/netns/netfilter.h
@@ -14,5 +14,6 @@ struct netns_nf {
 #ifdef CONFIG_SYSCTL
struct ctl_table_header *nf_log_dir_header;
 #endif
+   struct list_head hooks[NFPROTO_NUMPROTO][NF_MAX_HOOKS];
 };
 #endif
diff --git a/net/netfilter/core.c b/net/netfilter/core.c
index 8bbcdfcdbfbd..320ae65dc09a 100644
--- a/net/netfilter/core.c
+++ b/net/netfilter/core.c
@@ -52,9 +52,6 @@ void nf_unregister_afinfo(const struct nf_afinfo *afinfo)
 }
 EXPORT_SYMBOL_GPL(nf_unregister_afinfo);
 
-struct list_head nf_hooks[NFPROTO_NUMPROTO][NF_MAX_HOOKS] __read_mostly;
-EXPORT_SYMBOL(nf_hooks);
-
 #ifdef HAVE_JUMP_LABEL
 struct static_key nf_hooks_needed[NFPROTO_NUMPROTO][NF_MAX_HOOKS];
 EXPORT_SYMBOL(nf_hooks_needed);
@@ -62,27 +59,40 @@ EXPORT_SYMBOL(nf_hooks_needed);
 
 static DEFINE_MUTEX(nf_hook_mutex);
 
-static struct list_head *find_nf_hook_list(const struct nf_hook_ops *reg)
+static struct list_head *find_nf_hook_list(struct net *net,
+  const struct nf_hook_ops *reg)
 {
struct list_head *nf_hook_list = NULL;
 
if (reg->pf != NFPROTO_NETDEV)
-   nf_hook_list = &nf_hooks[reg->pf][reg->hooknum];
+   nf_hook_list = &net->nf.hooks[reg->pf][reg->hooknum];
else if (reg->hooknum == NF_NETDEV_INGRESS) {
 #ifdef CONFIG_NETFILTER_INGRESS
-   if (reg->dev)
+   if (reg->de

[PATCH net v2] net: bcmgenet: fix accounting of packet drops vs errors

2015-07-10 Thread Petri Gynther
bcmgenet driver needs to separate packet drops from packet errors.

When the driver has to drop a *good* packet, due to lack of buffers or
replacement skbs, increment only dev->stats.[rx|tx]_dropped.

When the driver encounters a bad Rx packet or Tx error, increment only
dev->stats.[rx|tx]_errors + relevant detailed error counter.

Signed-off-by: Petri Gynther 
---
 drivers/net/ethernet/broadcom/genet/bcmgenet.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/genet/bcmgenet.c 
b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
index b43b2cb..64c1e9d 100644
--- a/drivers/net/ethernet/broadcom/genet/bcmgenet.c
+++ b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
@@ -1230,7 +1230,6 @@ static struct sk_buff *bcmgenet_put_tx_csum(struct 
net_device *dev,
new_skb = skb_realloc_headroom(skb, sizeof(*status));
dev_kfree_skb(skb);
if (!new_skb) {
-   dev->stats.tx_errors++;
dev->stats.tx_dropped++;
return NULL;
}
@@ -1465,7 +1464,6 @@ static unsigned int bcmgenet_desc_rx(struct 
bcmgenet_rx_ring *ring,
 
if (unlikely(!skb)) {
dev->stats.rx_dropped++;
-   dev->stats.rx_errors++;
goto next;
}
 
@@ -1493,7 +1491,6 @@ static unsigned int bcmgenet_desc_rx(struct 
bcmgenet_rx_ring *ring,
if (unlikely(!(dma_flag & DMA_EOP) || !(dma_flag & DMA_SOP))) {
netif_err(priv, rx_status, dev,
  "dropping fragmented packet!\n");
-   dev->stats.rx_dropped++;
dev->stats.rx_errors++;
dev_kfree_skb_any(skb);
goto next;
@@ -1515,7 +1512,6 @@ static unsigned int bcmgenet_desc_rx(struct 
bcmgenet_rx_ring *ring,
dev->stats.rx_frame_errors++;
if (dma_flag & DMA_RX_LG)
dev->stats.rx_length_errors++;
-   dev->stats.rx_dropped++;
dev->stats.rx_errors++;
dev_kfree_skb_any(skb);
goto next;
-- 
2.4.3.573.g4eafbef

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH -next 4/6] netfilter: Factor out the hook list selection from nf_register_hook

2015-07-10 Thread Eric W. Biederman

- Add a new function find_nf_hook_list to select the nf_hook_list

- Fail nf_register_hook when asked for a per netdevice hook list when
  support for per netdevice hook lists is not built into the kernel.

- Move the hook list head selection outside of nf_hook_mutex as
  nothing in the selection requires the hook list, and error handling
  is simpler if a mutex is not held.

Signed-off-by: "Eric W. Biederman" 
---
 net/netfilter/core.c | 32 ++--
 1 file changed, 18 insertions(+), 14 deletions(-)

diff --git a/net/netfilter/core.c b/net/netfilter/core.c
index 7e3bc12f1c62..8bbcdfcdbfbd 100644
--- a/net/netfilter/core.c
+++ b/net/netfilter/core.c
@@ -62,27 +62,31 @@ EXPORT_SYMBOL(nf_hooks_needed);
 
 static DEFINE_MUTEX(nf_hook_mutex);
 
-int nf_register_hook(struct nf_hook_ops *reg)
+static struct list_head *find_nf_hook_list(const struct nf_hook_ops *reg)
 {
-   struct list_head *nf_hook_list;
-   struct nf_hook_ops *elem;
+   struct list_head *nf_hook_list = NULL;
 
-   mutex_lock(&nf_hook_mutex);
-   switch (reg->pf) {
-   case NFPROTO_NETDEV:
+   if (reg->pf != NFPROTO_NETDEV)
+   nf_hook_list = &nf_hooks[reg->pf][reg->hooknum];
+   else if (reg->hooknum == NF_NETDEV_INGRESS) {
 #ifdef CONFIG_NETFILTER_INGRESS
-   if (reg->hooknum == NF_NETDEV_INGRESS) {
-   BUG_ON(reg->dev == NULL);
+   if (reg->dev)
nf_hook_list = ®->dev->nf_hooks_ingress;
-   break;
-   }
 #endif
-   /* Fall through. */
-   default:
-   nf_hook_list = &nf_hooks[reg->pf][reg->hooknum];
-   break;
}
+   return nf_hook_list;
+}
+
+int nf_register_hook(struct nf_hook_ops *reg)
+{
+   struct list_head *nf_hook_list;
+   struct nf_hook_ops *elem;
 
+   nf_hook_list = find_nf_hook_list(reg);
+   if (!nf_hook_list)
+   return -ENOENT;
+
+   mutex_lock(&nf_hook_mutex);
list_for_each_entry(elem, nf_hook_list, list) {
if (reg->priority < elem->priority)
break;
-- 
2.2.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH -next 3/6] netfilter: Simply the tests for enabling and disabling the ingress queue hook

2015-07-10 Thread Eric W. Biederman

Replace an overcomplicated switch statement with a simple if statement.

This also removes the ingress queue enable outside of nf_hook_mutex as
the protection provided by the mutex is not necessary and the code is
clearer having both of the static key increments together.

Signed-off-by: "Eric W. Biederman" 
---
 net/netfilter/core.c | 17 ++---
 1 file changed, 6 insertions(+), 11 deletions(-)

diff --git a/net/netfilter/core.c b/net/netfilter/core.c
index a0e54974e2c9..7e3bc12f1c62 100644
--- a/net/netfilter/core.c
+++ b/net/netfilter/core.c
@@ -74,7 +74,6 @@ int nf_register_hook(struct nf_hook_ops *reg)
if (reg->hooknum == NF_NETDEV_INGRESS) {
BUG_ON(reg->dev == NULL);
nf_hook_list = ®->dev->nf_hooks_ingress;
-   net_inc_ingress_queue();
break;
}
 #endif
@@ -90,6 +89,10 @@ int nf_register_hook(struct nf_hook_ops *reg)
}
list_add_rcu(®->list, elem->list.prev);
mutex_unlock(&nf_hook_mutex);
+#ifdef CONFIG_NETFILTER_INGRESS
+   if ((reg->pf == NFPROTO_NETDEV) && (reg->hooknum == NF_NETDEV_INGRESS))
+   net_inc_ingress_queue();
+#endif
 #ifdef HAVE_JUMP_LABEL
static_key_slow_inc(&nf_hooks_needed[reg->pf][reg->hooknum]);
 #endif
@@ -102,18 +105,10 @@ void nf_unregister_hook(struct nf_hook_ops *reg)
mutex_lock(&nf_hook_mutex);
list_del_rcu(®->list);
mutex_unlock(&nf_hook_mutex);
-   switch (reg->pf) {
-   case NFPROTO_NETDEV:
 #ifdef CONFIG_NETFILTER_INGRESS
-   if (reg->hooknum == NF_NETDEV_INGRESS) {
-   net_dec_ingress_queue();
-   break;
-   }
-   break;
+   if ((reg->pf == NFPROTO_NETDEV) && (reg->hooknum == NF_NETDEV_INGRESS))
+   net_dec_ingress_queue();
 #endif
-   default:
-   break;
-   }
 #ifdef HAVE_JUMP_LABEL
static_key_slow_dec(&nf_hooks_needed[reg->pf][reg->hooknum]);
 #endif
-- 
2.2.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH -next 2/6] netfilter: kill nf_hooks_active

2015-07-10 Thread Eric W. Biederman

The function obscures what is going on in nf_hook_thresh and it's existence
requires computing the hook list twice.

Signed-off-by: "Eric W. Biederman" 
---
 include/linux/netfilter.h | 10 +++---
 1 file changed, 3 insertions(+), 7 deletions(-)

diff --git a/include/linux/netfilter.h b/include/linux/netfilter.h
index 00050dfd9f23..d7826e59779e 100644
--- a/include/linux/netfilter.h
+++ b/include/linux/netfilter.h
@@ -150,11 +150,6 @@ static inline bool nf_hook_list_active(struct list_head 
*nf_hook_list,
 }
 #endif
 
-static inline bool nf_hooks_active(u_int8_t pf, unsigned int hook)
-{
-   return nf_hook_list_active(&nf_hooks[pf][hook], pf, hook);
-}
-
 int nf_hook_slow(struct sk_buff *skb, struct nf_hook_state *state);
 
 /**
@@ -172,10 +167,11 @@ static inline int nf_hook_thresh(u_int8_t pf, unsigned 
int hook,
 int (*okfn)(struct sock *, struct sk_buff *),
 int thresh)
 {
-   if (nf_hooks_active(pf, hook)) {
+   struct list_head *nf_hook_list = &nf_hooks[pf][hook];
+   if (nf_hook_list_active(nf_hook_list, pf, hook)) {
struct nf_hook_state state;
 
-   nf_hook_state_init(&state, &nf_hooks[pf][hook], hook, thresh,
+   nf_hook_state_init(&state, nf_hook_list, hook, thresh,
   pf, indev, outdev, sk, okfn);
return nf_hook_slow(skb, &state);
}
-- 
2.2.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH -next 1/6] netfilter: nf_queue: Don't recompute the hook_list head

2015-07-10 Thread Eric W. Biederman

If someone sends packets from one of the netdevice ingress hooks to
the a userspace queue, and then userspace later accepts the packet,
the netfilter code can enter an infinite loop as the list head will
never be found.

Pass in the saved list_head to avoid this.

Signed-off-by: "Eric W. Biederman" 
---
 net/netfilter/nf_queue.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/netfilter/nf_queue.c b/net/netfilter/nf_queue.c
index cd60d397fe05..8a8b2abc35ff 100644
--- a/net/netfilter/nf_queue.c
+++ b/net/netfilter/nf_queue.c
@@ -213,7 +213,7 @@ void nf_reinject(struct nf_queue_entry *entry, unsigned int 
verdict)
 
if (verdict == NF_ACCEPT) {
next_hook:
-   verdict = 
nf_iterate(&nf_hooks[entry->state.pf][entry->state.hook],
+   verdict = nf_iterate(entry->state.hook_list,
 skb, &entry->state, &elem);
}
 
-- 
2.2.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH -next 0/6] Per network namespace netfilter chains

2015-07-10 Thread Eric W. Biederman

By maintining a set of functions to register and unregister netfilter
hooks both globally and per network namespace I have managed to write a
compact patchset that maintain per network netfilter chains, and
registers the nftables netfilter hooks per network namespace.

There are lots of other possible and desirable cleanups but this one is
a core change needed to make the other changes independent small
changes.

Eric W. Biederman (6):
  netfilter: nf_queue: Don't recompute the hook_list head
  netfilter: kill nf_hooks_active
  netfilter: Simply the tests for enabling and disabling the ingress queue 
hook
  netfilter: Factor out the hook list selection from nf_register_hook
  netfilter: Per network namespace netfilter hooks.
  netfilter: nftables: Only run the nftables chains in the proper netns

 include/linux/netfilter.h  |  23 +++--
 include/net/netns/netfilter.h  |   1 +
 net/netfilter/core.c   | 221 +
 net/netfilter/nf_queue.c   |   2 +-
 net/netfilter/nf_tables_api.c  |   6 +-
 net/netfilter/nf_tables_core.c |   5 -
 6 files changed, 200 insertions(+), 58 deletions(-)

Eric
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] nohz: prevent tilegx network driver interrupts

2015-07-10 Thread Josh Cartwright
On Fri, Jul 10, 2015 at 07:06:23PM -0400, Chris Metcalf wrote:
> On 7/10/2015 6:45 PM, Josh Cartwright wrote:
> >>+static inline const struct cpumask *housekeeping_cpumask(void)
> >>>+{
> >>>+#ifdef CONFIG_NO_HZ_FULL
> >>>+  if (tick_nohz_full_enabled())
> >>>+  return housekeeping_mask;
> >>>+#endif
> >Just a small comment:
> >
> >We can take these checks out from under a #ifdef CONFIG_NO_HZ_FULL
> >check, given that are stubbed tick_nohz_full_enabled() defined above.
> 
> True for the "if" clause, but the "housekeeping_mask" variable is only defined
> under CONFIG_NO_HZ_FULL.

Indeed!  I should have read more carefully.

Sorry for the noise.

  Josh


signature.asc
Description: PGP signature


Re: [PATCH v2] nohz: prevent tilegx network driver interrupts

2015-07-10 Thread Chris Metcalf

On 7/10/2015 6:45 PM, Josh Cartwright wrote:

+static inline const struct cpumask *housekeeping_cpumask(void)
>+{
>+#ifdef CONFIG_NO_HZ_FULL
>+   if (tick_nohz_full_enabled())
>+   return housekeeping_mask;
>+#endif

Just a small comment:

We can take these checks out from under a #ifdef CONFIG_NO_HZ_FULL
check, given that are stubbed tick_nohz_full_enabled() defined above.


True for the "if" clause, but the "housekeeping_mask" variable is only defined
under CONFIG_NO_HZ_FULL.

--
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] ixgbe: Remove bimodal SR-IOV disabling

2015-07-10 Thread Rose, Gregory V

> -Original Message-
> From: Alex Williamson [mailto:alex.william...@redhat.com]
> Sent: Friday, July 10, 2015 3:44 PM
> To: Rose, Gregory V
> Cc: intel-wired-...@lists.osuosl.org; Kirsher, Jeffrey T;
> netdev@vger.kernel.org; linux-ker...@vger.kernel.org
> Subject: Re: [PATCH] ixgbe: Remove bimodal SR-IOV disabling
> 
> On Fri, 2015-07-10 at 21:36 +, Rose, Gregory V wrote:
> >
> > > -Original Message-
> > > From: Alex Williamson [mailto:alex.william...@redhat.com]
> > > Sent: Friday, July 10, 2015 2:32 PM
> > > To: intel-wired-...@lists.osuosl.org; Kirsher, Jeffrey T
> > > Cc: netdev@vger.kernel.org; linux-ker...@vger.kernel.org; Rose,
> > > Gregory V
> > > Subject: [PATCH] ixgbe: Remove bimodal SR-IOV disabling
> > >
> > > When unbinding an SR-IOV device with VFs configured from ixgbe, the
> > > driver behaves in one of two ways.  If max_vfs was specified, the
> > > SR-IOV state is disabled, removing the VFs.  The occurs regardless
> > > of whether the VF count was later modified through sysfs.  If
> > > however max_vfs is zero, such as by not specifying the module
> > > parameter, the VFs persist after the PF is unbound from ixgbe.  If
> > > the PF is then bound to vfio-pci to be assigned to a VM, the PF is
> non-functional.
> > >
> > > >From the comment, commit da36b64736cf ("ixgbe: Implement PCI SR-IOV
> > > sysfs callback operation") clearly intended this alternate behavior,
> > > but probably didn't realize the PF doesn't work in this mode.
> > >
> > > This bimodal behavior is confusing to users and results in a state
> > > where the PF is broken for other uses unless the user sets
> > > sriov_numvfs to zero prior to unbinding the device.  Remove this
> > > behavior so that VFs are removed and the PF is functional for other
> > > uses after unbind, regardless of the way VFs are enabled.
> > >
> > > Signed-off-by: Alex Williamson 
> > > Cc: Greg Rose 
> > > Cc: Jeff Kirsher 
> > > ---
> > >
> > > I can only think that not disabling SR-IOV was meant to enable some
> > > sort of persistence for VFs, but that's probably better accomplished
> > > with either udev rules and/or modprobe.d install scripts.
> > >
> > >  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |7 +--
> > >  1 file changed, 1 insertion(+), 6 deletions(-)
> > >
> > > diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> > > b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> > > index 5be12a0..de04e3e 100644
> > > --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> > > +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> > > @@ -8810,12 +8810,7 @@ static void ixgbe_remove(struct pci_dev *pdev)
> > >   unregister_netdev(netdev);
> > >
> > >  #ifdef CONFIG_PCI_IOV
> > > - /*
> > > -  * Only disable SR-IOV on unload if the user specified the now
> > > -  * deprecated max_vfs module parameter.
> > > -  */
> > > - if (max_vfs)
> > > - ixgbe_disable_sriov(adapter);
> > > + ixgbe_disable_sriov(adapter);
> > >  #endif
> > >   ixgbe_clear_interrupt_scheme(adapter);
> > >
> >
> > Please remove max_vfs module parameter - it is deprecated and should be
> removed from upstream builds.  Dave let us get away with a kernel module a
> few years ago because the other necessary infrastructure to enable SR-IOV
> virtual functions via the PCIe interface was not available.  Now that it's
> there it should be removed and vendors/end users should be forced to move
> away from this.
> 
> I can't really say I'm in favor of removing that option.  It's probably
> going to break a lot of people because doing the udev rules right is hard.
> The sysfs sriov interface has been tossed over the wall as the right way
> to do things, but there's really no infrastructure to facilitate even the
> simple peanut butter, everybody gets the same number of VFs, interface
> that max_vfs provides.  I think the existence of this bug is probably a
> good indication that the sysfs interface has not really been adopted yet.
> Thanks,

Alright, I'll go with that reasoning.

Acked-by: Greg Rose 


> 
> Alex

N�r��yb�X��ǧv�^�)޺{.n�+���z�^�)w*jg����ݢj/���z�ޖ��2�ޙ&�)ߡ�a�����G���h��j:+v���w��٥

Re: [PATCH] vhost-blk: Add vhost-blk support v6

2015-07-10 Thread Ming Lin
On Sat, Dec 1, 2012 at 5:33 PM, Asias He  wrote:
> vhost-blk is an in-kernel virito-blk device accelerator.
>
> Due to lack of proper in-kernel AIO interface, this version converts
> guest's I/O request to bio and use submit_bio() to submit I/O directly.
> So this version any supports raw block device as guest's disk image,
> e.g. /dev/sda, /dev/ram0. We can add file based image support to
> vhost-blk once we have in-kernel AIO interface. There are some work in
> progress for in-kernel AIO interface from Dave Kleikamp and Zach Brown:
>
>http://marc.info/?l=linux-fsdevel&m=133312234313122
>
> Performance evaluation:
> -
> LKVM: Fio with libaio ioengine on 1 Fusion IO device
> IOPS(k)BeforeAfterImprovement
> seq-read   107   121  +13.0%
> seq-write  130   179  +37.6%
> rnd-read   102   122  +19.6%
> rnd-write  125   159  +27.0%
>
> QEMU: Fio with libaio ioengine on 1 Fusion IO device
> IOPS(k)BeforeAfterImprovement
> seq-read   76123  +61.8%
> seq-write  139   173  +24.4%
> rnd-read   73120  +64.3%
> rnd-write  75156  +108.0%
>
> QEMU: Fio with libaio ioengine on 1 Ramdisk device
> IOPS(k)BeforeAfterImprovement
> seq-read   138   437  +216%
> seq-write  191   436  +128%
> rnd-read   137   426  +210%
> rnd-write  140   415  +196%
>
> QEMU: Fio with libaio ioengine on 8 Ramdisk device
> 50% read + 50% write
> IOPS(k)BeforeAfterImprovement
> randrw 64/64 189/189  +195%/+195%
>
> Userspace bits:
> -
> 1) LKVM
> The latest vhost-blk userspace bits for kvm tool can be found here:
> g...@github.com:asias/linux-kvm.git blk.vhost-blk
>
> 2) QEMU
> The latest vhost-blk userspace prototype for QEMU can be found here:
> g...@github.com:asias/qemu.git blk.vhost-blk
>
> Changes in v6:
> - Use inline req_page_list to reduce kmalloc
> - Switch to single thread model, thanks mst!
> - Wait until requests fired before vhost_blk_flush to be finished
>
> Changes in v5:
> - Do not assume the buffer layout
> - Fix wakeup race
>
> Changes in v4:
> - Mark req->status as userspace pointer
> - Use __copy_to_user() instead of copy_to_user() in vhost_blk_set_status()
> - Add if (need_resched()) schedule() in blk thread
> - Kill vhost_blk_stop_vq() and move it into vhost_blk_stop()
> - Use vq_err() instead of pr_warn()
> - Fail un Unsupported request
> - Add flush in vhost_blk_set_features()
>
> Changes in v3:
> - Sending REQ_FLUSH bio instead of vfs_fsync, thanks Christoph!
> - Check file passed by user is a raw block device file
>
> Acked-by: David S. Miller 
> Signed-off-by: Asias He 

Hi Asias,

Is this still under development or stopped?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] nohz: prevent tilegx network driver interrupts

2015-07-10 Thread Josh Cartwright
On Fri, Jul 10, 2015 at 03:37:25PM -0400, Chris Metcalf wrote:
> Normally the tilegx networking shim sends irqs to all the cores
> to distribute the load of processing incoming-packet interrupts,
> so that you can get to multiple Gb's of traffic inbound.
> 
> However, in nohz_full mode we don't want to interrupt the
> nohz_full cores by default, so we limit the set of cores we use
> to only the online housekeeping cores.
> 
> To make client code easier to read, we introduce a new nohz_full
> accessor, housekeeping_cpumask(), which returns a pointer to the
> housekeeping_mask if nohz_full is enabled, and otherwise returns
> the cpu_possible_mask.
> 
> Signed-off-by: Chris Metcalf 
> ---
[..]
> +static inline const struct cpumask *housekeeping_cpumask(void)
> +{
> +#ifdef CONFIG_NO_HZ_FULL
> + if (tick_nohz_full_enabled())
> + return housekeeping_mask;
> +#endif

Just a small comment:

We can take these checks out from under a #ifdef CONFIG_NO_HZ_FULL
check, given that are stubbed tick_nohz_full_enabled() defined above.

  Josh


signature.asc
Description: PGP signature


Re: [PATCH] ixgbe: Remove bimodal SR-IOV disabling

2015-07-10 Thread Alex Williamson
On Fri, 2015-07-10 at 21:36 +, Rose, Gregory V wrote:
> 
> > -Original Message-
> > From: Alex Williamson [mailto:alex.william...@redhat.com]
> > Sent: Friday, July 10, 2015 2:32 PM
> > To: intel-wired-...@lists.osuosl.org; Kirsher, Jeffrey T
> > Cc: netdev@vger.kernel.org; linux-ker...@vger.kernel.org; Rose, Gregory V
> > Subject: [PATCH] ixgbe: Remove bimodal SR-IOV disabling
> > 
> > When unbinding an SR-IOV device with VFs configured from ixgbe, the driver
> > behaves in one of two ways.  If max_vfs was specified, the SR-IOV state is
> > disabled, removing the VFs.  The occurs regardless of whether the VF count
> > was later modified through sysfs.  If however max_vfs is zero, such as by
> > not specifying the module parameter, the VFs persist after the PF is
> > unbound from ixgbe.  If the PF is then bound to vfio-pci to be assigned to
> > a VM, the PF is non-functional.
> > 
> > >From the comment, commit da36b64736cf ("ixgbe: Implement PCI SR-IOV
> > sysfs callback operation") clearly intended this alternate behavior, but
> > probably didn't realize the PF doesn't work in this mode.
> > 
> > This bimodal behavior is confusing to users and results in a state where
> > the PF is broken for other uses unless the user sets sriov_numvfs to zero
> > prior to unbinding the device.  Remove this behavior so that VFs are
> > removed and the PF is functional for other uses after unbind, regardless
> > of the way VFs are enabled.
> > 
> > Signed-off-by: Alex Williamson 
> > Cc: Greg Rose 
> > Cc: Jeff Kirsher 
> > ---
> > 
> > I can only think that not disabling SR-IOV was meant to enable some sort
> > of persistence for VFs, but that's probably better accomplished with
> > either udev rules and/or modprobe.d install scripts.
> > 
> >  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |7 +--
> >  1 file changed, 1 insertion(+), 6 deletions(-)
> > 
> > diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> > b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> > index 5be12a0..de04e3e 100644
> > --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> > +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> > @@ -8810,12 +8810,7 @@ static void ixgbe_remove(struct pci_dev *pdev)
> > unregister_netdev(netdev);
> > 
> >  #ifdef CONFIG_PCI_IOV
> > -   /*
> > -* Only disable SR-IOV on unload if the user specified the now
> > -* deprecated max_vfs module parameter.
> > -*/
> > -   if (max_vfs)
> > -   ixgbe_disable_sriov(adapter);
> > +   ixgbe_disable_sriov(adapter);
> >  #endif
> > ixgbe_clear_interrupt_scheme(adapter);
> > 
> 
> Please remove max_vfs module parameter - it is deprecated and should be 
> removed from upstream builds.  Dave let us get away with a kernel module a 
> few years ago because the other necessary infrastructure to enable SR-IOV 
> virtual functions via the PCIe interface was not available.  Now that it's 
> there it should be removed and vendors/end users should be forced to move 
> away from this.

I can't really say I'm in favor of removing that option.  It's probably
going to break a lot of people because doing the udev rules right is
hard.  The sysfs sriov interface has been tossed over the wall as the
right way to do things, but there's really no infrastructure to
facilitate even the simple peanut butter, everybody gets the same number
of VFs, interface that max_vfs provides.  I think the existence of this
bug is probably a good indication that the sysfs interface has not
really been adopted yet.  Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] ixgbe: Remove bimodal SR-IOV disabling

2015-07-10 Thread Rose, Gregory V

> -Original Message-
> From: Alex Williamson [mailto:alex.william...@redhat.com]
> Sent: Friday, July 10, 2015 2:32 PM
> To: intel-wired-...@lists.osuosl.org; Kirsher, Jeffrey T
> Cc: netdev@vger.kernel.org; linux-ker...@vger.kernel.org; Rose, Gregory V
> Subject: [PATCH] ixgbe: Remove bimodal SR-IOV disabling
> 
> When unbinding an SR-IOV device with VFs configured from ixgbe, the driver
> behaves in one of two ways.  If max_vfs was specified, the SR-IOV state is
> disabled, removing the VFs.  The occurs regardless of whether the VF count
> was later modified through sysfs.  If however max_vfs is zero, such as by
> not specifying the module parameter, the VFs persist after the PF is
> unbound from ixgbe.  If the PF is then bound to vfio-pci to be assigned to
> a VM, the PF is non-functional.
> 
> >From the comment, commit da36b64736cf ("ixgbe: Implement PCI SR-IOV
> sysfs callback operation") clearly intended this alternate behavior, but
> probably didn't realize the PF doesn't work in this mode.
> 
> This bimodal behavior is confusing to users and results in a state where
> the PF is broken for other uses unless the user sets sriov_numvfs to zero
> prior to unbinding the device.  Remove this behavior so that VFs are
> removed and the PF is functional for other uses after unbind, regardless
> of the way VFs are enabled.
> 
> Signed-off-by: Alex Williamson 
> Cc: Greg Rose 
> Cc: Jeff Kirsher 
> ---
> 
> I can only think that not disabling SR-IOV was meant to enable some sort
> of persistence for VFs, but that's probably better accomplished with
> either udev rules and/or modprobe.d install scripts.
> 
>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |7 +--
>  1 file changed, 1 insertion(+), 6 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> index 5be12a0..de04e3e 100644
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> @@ -8810,12 +8810,7 @@ static void ixgbe_remove(struct pci_dev *pdev)
>   unregister_netdev(netdev);
> 
>  #ifdef CONFIG_PCI_IOV
> - /*
> -  * Only disable SR-IOV on unload if the user specified the now
> -  * deprecated max_vfs module parameter.
> -  */
> - if (max_vfs)
> - ixgbe_disable_sriov(adapter);
> + ixgbe_disable_sriov(adapter);
>  #endif
>   ixgbe_clear_interrupt_scheme(adapter);
> 

Please remove max_vfs module parameter - it is deprecated and should be removed 
from upstream builds.  Dave let us get away with a kernel module a few years 
ago because the other necessary infrastructure to enable SR-IOV virtual 
functions via the PCIe interface was not available.  Now that it's there it 
should be removed and vendors/end users should be forced to move away from this.

Thanks,

- Greg Rose
Intel Corp.
Networking Division



[PATCH] ixgbe: Remove bimodal SR-IOV disabling

2015-07-10 Thread Alex Williamson
When unbinding an SR-IOV device with VFs configured from ixgbe, the
driver behaves in one of two ways.  If max_vfs was specified, the
SR-IOV state is disabled, removing the VFs.  The occurs regardless of
whether the VF count was later modified through sysfs.  If however
max_vfs is zero, such as by not specifying the module parameter, the
VFs persist after the PF is unbound from ixgbe.  If the PF is then
bound to vfio-pci to be assigned to a VM, the PF is non-functional.

>From the comment, commit da36b64736cf ("ixgbe: Implement PCI SR-IOV
sysfs callback operation") clearly intended this alternate behavior,
but probably didn't realize the PF doesn't work in this mode.

This bimodal behavior is confusing to users and results in a state
where the PF is broken for other uses unless the user sets
sriov_numvfs to zero prior to unbinding the device.  Remove this
behavior so that VFs are removed and the PF is functional for other
uses after unbind, regardless of the way VFs are enabled.

Signed-off-by: Alex Williamson 
Cc: Greg Rose 
Cc: Jeff Kirsher 
---

I can only think that not disabling SR-IOV was meant to enable some
sort of persistence for VFs, but that's probably better accomplished
with either udev rules and/or modprobe.d install scripts.

 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |7 +--
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 5be12a0..de04e3e 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -8810,12 +8810,7 @@ static void ixgbe_remove(struct pci_dev *pdev)
unregister_netdev(netdev);
 
 #ifdef CONFIG_PCI_IOV
-   /*
-* Only disable SR-IOV on unload if the user specified the now
-* deprecated max_vfs module parameter.
-*/
-   if (max_vfs)
-   ixgbe_disable_sriov(adapter);
+   ixgbe_disable_sriov(adapter);
 #endif
ixgbe_clear_interrupt_scheme(adapter);
 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 02/11] soc/fsl: Introduce DPAA BMan device management driver

2015-07-10 Thread Scott Wood
On Fri, 2015-07-10 at 15:57 -0500, Pledge Roy-R01356 wrote:
> > 
> > On Fri, 2015-07-10 at 13:36 +0200, Paul Bolle wrote:
> > > On do, 2015-07-09 at 16:21 -0400, Roy Pledge wrote:
> > > > +#ifdef CONFIG_FSL_DPA_CHECKING
> > > > +#define DPA_ASSERT(x) \
> > > > +   do { \
> > > > +   if (!(x)) { \
> > > > +   pr_crit("ASSERT: (%s:%d) %s\n", __FILE__, 
> > > > __LINE__, \
> > > > +   __stringify_1(x)); \
> > > > +   dump_stack(); \
> > > > +   panic("assertion failure"); \
> > > 
> > > Not my call, but why panic() here?
> > 
> > I'm pretty sure I've complained about this before (as well as all the
> > BUG_ONs).
> > 
> Is the concern here just the call to panic()?  I'm happy to change what 
> happens when an issue is detected but the DPA_ASSERT() calls are very 
> useful when testing changes to the driver and when bringing up the drivers 
> on new silicon variants. 

Use WARN_ON() or a variant thereof.

-Scott

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] fixed_phy: handle link-down case

2015-07-10 Thread Stas Sergeev

10.07.2015 23:44, Florian Fainelli пишет:

On 10/07/15 09:41, Stas Sergeev wrote:

Currently fixed_phy driver recognizes only the link-up state.
This simple patch adds an implementation of link-down state.
The actual change is 1-line, the rest is an indentation.

It is not clear to me how this is useful, if you have a link_update
callback manipulating the link state, the fixed PHY driver returns
appropriate MII_BMSR values and always re-initializes everything.

It returns the appropriate values only for link status (when its down),
but it still returns speed, duplex etc as if the link is up. I had hard
times finding the relevant specs, but from what I have googled,
when link is down, the speed/duplex/etc status fields should _also_
be zero, which is what my patch does.
What is more important is that fixed_phy_add() would return
-EINVAL if you didn't specify speed while the link is down.
This is an absolute must-fix, or I will have to add an arbitrary
speed value again, on which you already objected.


Is this meant to be some sort of optimization? If so, you could just
avoid the re-intendation completely and do a goto instead?

Oh, c'mon... Adding goto just to keep the _patch_ smaller?
(not smaller code, just a smaller patch)
Well, this is certainly something that can be done, feel free
to request that explicitly and I'll release v3 next week.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] of_mdio: add new DT property 'autoneg' for fixed-link

2015-07-10 Thread Stas Sergeev

10.07.2015 23:39, Florian Fainelli пишет:

- in-band status is an implementation delail, and it is
specific to a particular protocols. If you request the
in-band status for some protocol that doesn't support
it, perhaps you should get -EINVAL, because such a
config makes no sense. With autonegotiation, the rules
are not that strict: it can be "unimplemented", which doesn't
necessary mean nonsense in the config.

So by specifying "autoneg", you are not specific about the kind of
auto-negotiation protocol available, which is precisely my point: you
need to go down to that level of detail for this to be useful. So maybe
something like:

autoneg = "in-band-status" would actually be a better thing in terms of
description because then you would tell what can be made available/working?

I would agree with this if your argument below is true (see below).


- autonegotiation is a wider term, and may be implemented
by some other means than the in-band status (which is
probably impossible for a fixed-link though).

- In the terms that the driver uses, it is autonegotiation, eg
MVNETA_GMAC_AUTONEG_CONFIG. And when you go down
the implementation details, you see MVNETA_GMAC_INBAND_AN_ENABLE,
which is just one AN bit of many.

But arguably, there could be another auto-negotiation method, which is
not in-band status related, which means that you would need a way to
distinguish between using in-band status, or using something else or
nothing, would not you?

"something else" is a big question here.
Can you think of _any_ other way that is both not an MDIO
(suits to fixed-link) and not an in-band?
If the answer is yes (even theoretically), then
autoneg = "in-band" | "off"
may make sense. Otherwise boolean just looks enough.
If we would implement autoneg outside of the fixed-link,
then its semantic would likely be
autoneg = "mdio" | "in-band" | "off"
But the fact that we put it under fixed-link where only a
single AN possibility exist, may probably be underlined by
a semantic specific to fixed-link.

One may also argue that
autoneg = "any-possible-autoneg-that-works" is better than
specifying it explicitly, which is exactly what the boolean does.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 02/11] soc/fsl: Introduce DPAA BMan device management driver

2015-07-10 Thread Roy Pledge
> 
> On Fri, 2015-07-10 at 13:36 +0200, Paul Bolle wrote:
> > On do, 2015-07-09 at 16:21 -0400, Roy Pledge wrote:
> > > +#ifdef CONFIG_FSL_DPA_CHECKING
> > > +#define DPA_ASSERT(x) \
> > > +   do { \
> > > +   if (!(x)) { \
> > > +   pr_crit("ASSERT: (%s:%d) %s\n", __FILE__, __LINE__, \
> > > +   __stringify_1(x)); \
> > > +   dump_stack(); \
> > > +   panic("assertion failure"); \
> >
> > Not my call, but why panic() here?
> 
> I'm pretty sure I've complained about this before (as well as all the
> BUG_ONs).
> 
Is the concern here just the call to panic()?  I'm happy to change what happens 
when an issue is detected but the DPA_ASSERT() calls are very useful when 
testing changes to the driver and when bringing up the drivers on new silicon 
variants. 


Re: [PATCH 1/3] fixed_phy: handle link-down case

2015-07-10 Thread Florian Fainelli
On 10/07/15 09:41, Stas Sergeev wrote:
> 
> Currently fixed_phy driver recognizes only the link-up state.
> This simple patch adds an implementation of link-down state.
> The actual change is 1-line, the rest is an indentation.

It is not clear to me how this is useful, if you have a link_update
callback manipulating the link state, the fixed PHY driver returns
appropriate MII_BMSR values and always re-initializes everything.

Is this meant to be some sort of optimization? If so, you could just
avoid the re-intendation completely and do a goto instead?

> 
> Signed-off-by: Stas Sergeev 
> 
> CC: Florian Fainelli 
> CC: netdev@vger.kernel.org
> CC: linux-ker...@vger.kernel.org
> ---
>  drivers/net/phy/fixed_phy.c | 99 
> +++--
>  1 file changed, 50 insertions(+), 49 deletions(-)
> 
> diff --git a/drivers/net/phy/fixed_phy.c b/drivers/net/phy/fixed_phy.c
> index 1960b46..a5d93cf 100644
> --- a/drivers/net/phy/fixed_phy.c
> +++ b/drivers/net/phy/fixed_phy.c
> @@ -52,58 +52,59 @@ static int fixed_phy_update_regs(struct fixed_phy *fp)
>   u16 lpagb = 0;
>   u16 lpa = 0;
> 
> - if (fp->status.duplex) {
> - bmcr |= BMCR_FULLDPLX;
> -
> - switch (fp->status.speed) {
> - case 1000:
> - bmsr |= BMSR_ESTATEN;
> - bmcr |= BMCR_SPEED1000;
> - lpagb |= LPA_1000FULL;
> - break;
> - case 100:
> - bmsr |= BMSR_100FULL;
> - bmcr |= BMCR_SPEED100;
> - lpa |= LPA_100FULL;
> - break;
> - case 10:
> - bmsr |= BMSR_10FULL;
> - lpa |= LPA_10FULL;
> - break;
> - default:
> - pr_warn("fixed phy: unknown speed\n");
> - return -EINVAL;
> - }
> - } else {
> - switch (fp->status.speed) {
> - case 1000:
> - bmsr |= BMSR_ESTATEN;
> - bmcr |= BMCR_SPEED1000;
> - lpagb |= LPA_1000HALF;
> - break;
> - case 100:
> - bmsr |= BMSR_100HALF;
> - bmcr |= BMCR_SPEED100;
> - lpa |= LPA_100HALF;
> - break;
> - case 10:
> - bmsr |= BMSR_10HALF;
> - lpa |= LPA_10HALF;
> - break;
> - default:
> - pr_warn("fixed phy: unknown speed\n");
> - return -EINVAL;
> - }
> - }
> -
> - if (fp->status.link)
> + if (fp->status.link) {
>   bmsr |= BMSR_LSTATUS | BMSR_ANEGCOMPLETE;
> 
> - if (fp->status.pause)
> - lpa |= LPA_PAUSE_CAP;
> + if (fp->status.duplex) {
> + bmcr |= BMCR_FULLDPLX;
> +
> + switch (fp->status.speed) {
> + case 1000:
> + bmsr |= BMSR_ESTATEN;
> + bmcr |= BMCR_SPEED1000;
> + lpagb |= LPA_1000FULL;
> + break;
> + case 100:
> + bmsr |= BMSR_100FULL;
> + bmcr |= BMCR_SPEED100;
> + lpa |= LPA_100FULL;
> + break;
> + case 10:
> + bmsr |= BMSR_10FULL;
> + lpa |= LPA_10FULL;
> + break;
> + default:
> + pr_warn("fixed phy: unknown speed\n");
> + return -EINVAL;
> + }
> + } else {
> + switch (fp->status.speed) {
> + case 1000:
> + bmsr |= BMSR_ESTATEN;
> + bmcr |= BMCR_SPEED1000;
> + lpagb |= LPA_1000HALF;
> + break;
> + case 100:
> + bmsr |= BMSR_100HALF;
> + bmcr |= BMCR_SPEED100;
> + lpa |= LPA_100HALF;
> + break;
> + case 10:
> + bmsr |= BMSR_10HALF;
> + lpa |= LPA_10HALF;
> + break;
> + default:
> + pr_warn("fixed phy: unknown speed\n");
> + return -EINVAL;
> + }
> + }
> 
> - if (fp->status.asym_pause)
> - lpa |= LPA_PAUSE_ASYM;
> + if (fp->status.pause)
> + lpa |= LPA_PAUSE_CAP;
> +
> + if (fp->status.asym_pause)
> +

Re: [PATCH v2 0/2] net: enable inband link state negotiation only when explicitly requested

2015-07-10 Thread Stas Sergeev

10.07.2015 23:31, Florian Fainelli пишет:

On 10/07/15 09:38, Stas Sergeev wrote:

Hello.

Currently the link status auto-negotiation is enabled
for any SGMII link with fixed-link DT binding.
The regression was reported:
https://lkml.org/lkml/2015/7/8/865
Apparently not all HW that implements SGMII protocol, generates the
inband status for the auto-negotiation to work.
More details here:
https://lkml.org/lkml/2015/7/10/206

The following patches reverts to the old behavior by default,
which is to not enable the auto-negotiation for fixed-link.
The new DT property is added that allows to explicitly request
the auto-negotiation.

Those who were affected by the change, please send your Tested-by,
Thanks!

For future submissions, would you mind CC'ing everybody for the entire
patch series and not just on a per-patch basis?

I used get_maintainers.pl on a per-patch basis.
This is what SubmittingPatches.txt seems to suggest.
It doesn't say how to produce the CC lists for patch
series, it seems...
Anyway, I'll try to add more recipients to the announce
e-mail next time.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] of_mdio: add new DT property 'autoneg' for fixed-link

2015-07-10 Thread Florian Fainelli
On 10/07/15 13:08, Stas Sergeev wrote:
> 10.07.2015 21:37, Florian Fainelli пишет:
>> On 10/07/15 09:43, Stas Sergeev wrote:
>>> Currently for fixed-link the MAC driver decides whether to use the
>>> link status auto-negotiation or not.
>>> Unfortunately the auto-negotiation may not work when expected by
>>> the MAC driver. Sebastien Rannou explains:
>>> << Yes, I confirm that my HW does not generate an in-band status.
>>> AFAIK, it's
>>> a PHY that aggregates 4xSGMIIs to 1xQSGMII ; the MAC side of the PHY
>>> (with
>>> inband status) is connected to the switch through QSGMII, and in this
>>> context
>>> we are on the media side of the PHY. >>
>>> https://lkml.org/lkml/2015/7/10/206
>>>
>>> This patch introduces the new boolean property 'autoneg' that allows
>>> the user to request the auto-negotiation explicitly.
>> The implementation looks better, but the name might still be slightly
>> controversial. I would go with "use-in-band-status" which is more
>> strictly defined than "autoneg" which could mean anything and everything.
>>
>> What do you think?
> I actually think autoneg is a bit better.
> 
> - Autonegotiation is a widely used and known term:
> https://en.wikipedia.org/wiki/Autonegotiation
> And who knows what in-band status is?

You and I apparently do because otherwise you would not have ran into
this problem and more generally, anyone having some mild exposure to the
(S|R)GMII protocols should at some point, but that is a pointless argument.

> And, more importantly, who knows what is it used for?
> Who even knows it is used for autonegotiation?

It is not about what do people know most, it is about being accurate and
specific.

> 
> - When we set autoneg for fixed-link, we basically just
> say "no MDIO here, but please do autoneg by any other
> means, if possible".

I agree with this.

> 
> - in-band status is an implementation delail, and it is
> specific to a particular protocols. If you request the
> in-band status for some protocol that doesn't support
> it, perhaps you should get -EINVAL, because such a
> config makes no sense. With autonegotiation, the rules
> are not that strict: it can be "unimplemented", which doesn't
> necessary mean nonsense in the config.

So by specifying "autoneg", you are not specific about the kind of
auto-negotiation protocol available, which is precisely my point: you
need to go down to that level of detail for this to be useful. So maybe
something like:

autoneg = "in-band-status" would actually be a better thing in terms of
description because then you would tell what can be made available/working?

> 
> - autonegotiation is a wider term, and may be implemented
> by some other means than the in-band status (which is
> probably impossible for a fixed-link though).
> 
> - In the terms that the driver uses, it is autonegotiation, eg
> MVNETA_GMAC_AUTONEG_CONFIG. And when you go down
> the implementation details, you see MVNETA_GMAC_INBAND_AN_ENABLE,
> which is just one AN bit of many.

But arguably, there could be another auto-negotiation method, which is
not in-band status related, which means that you would need a way to
distinguish between using in-band status, or using something else or
nothing, would not you?

> 
> So I really would prefer to keep things as is.
> But if you insist, I can rename, but there will still be no
> -EINVAL checks for obviously wrong configs.


-- 
Florian
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fighting out-of-order reception with RPS?

2015-07-10 Thread Oliver Hartkopp
On 07/10/2015 04:48 AM, Tom Herbert wrote:
> On Wed, Jul 8, 2015 at 10:55 PM, Oliver Hartkopp  
> wrote:
>> Both drivers do not use NAPI. The just follow the way
>>
>> interrupt -> alloc_skb() -> fill skb -> netif_rx(skb)
>>
>> I'm usually testing with the USB adapters as the PCIe setup is not very
>> handy.
>>
> Okay, I see what is happening. In netif_rx when RPS is not enabled
> that packet is queued to the backlog queue for the local CPU. Since
> you're doing round robin on the interrupts then OOO packets can be a
> result. Unfortunately, this is the expected behavior. The correct
> kernel fix would be to move to these drivers to use NAPI.

Hm. Doesn't sound like a good solution when there's a difference between NAPI
and non-NAPI drivers in matters of OOO, right?

> RPS
> eliminates the OOO, but if there is no ability to derive a flow hash
> from packets everything will wind up one queue without load balancing.

Correct.

That's why I added

skb_set_hash(skb, dev->ifindex, PKT_HASH_TYPE_L2);

in my driver, because the only relevant flow identifiction is the number of
the incoming CAN interface.

> Besides that, automatically setting RPS in drivers is a difficult
> proposition since there is no definitively "correct" way to do that in
> an arbitrary configuration.

What about checking in netif_rx() if the non-NAPI driver has set a hash (aka
the driver is OOO sensitive)?
And if so we could automatically set rps_cpus for this interface in a way that
all CPUs are enabled to take skbs following the hash.

Best regards,
Oliver


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/2] net: enable inband link state negotiation only when explicitly requested

2015-07-10 Thread Florian Fainelli
On 10/07/15 09:38, Stas Sergeev wrote:
> Hello.
> 
> Currently the link status auto-negotiation is enabled
> for any SGMII link with fixed-link DT binding.
> The regression was reported:
> https://lkml.org/lkml/2015/7/8/865
> Apparently not all HW that implements SGMII protocol, generates the
> inband status for the auto-negotiation to work.
> More details here:
> https://lkml.org/lkml/2015/7/10/206
> 
> The following patches reverts to the old behavior by default,
> which is to not enable the auto-negotiation for fixed-link.
> The new DT property is added that allows to explicitly request
> the auto-negotiation.
> 
> Those who were affected by the change, please send your Tested-by,
> Thanks!

For future submissions, would you mind CC'ing everybody for the entire
patch series and not just on a per-patch basis?
-- 
Florian
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Panic with demuxed ipv4 multicast udp sockets on 4.0.4

2015-07-10 Thread Alex Gartrell
Hey everyone,

There's some kind of nasty condition in which sk_rx_dst points to an
apparently garbage datastructure and it's blowing up in the early
demux code because dst->ops is NULL.  The packet in question was for
bit torrent local peer discovery
https://en.wikipedia.org/wiki/Local_Peer_Discovery .  We're seeing
this on about a 1/200 chance of panic per day.

crash> bt
PID: 1899532  TASK: 88000826cf00  CPU: 9   COMMAND: "hhvm.node.1"
 #0 [88047fc23990] machine_kexec at 8103bf05
 #1 [88047fc239e0] crash_kexec at 810cb4e8
 #2 [88047fc23ab0] oops_end at 81006468
 #3 [88047fc23ae0] no_context at 8167aac1
 #4 [88047fc23b40] __bad_area_nosemaphore at 8167acb9
 #5 [88047fc23b90] bad_area_nosemaphore at 8167aceb
 #6 [88047fc23ba0] __do_page_fault at 81044ac5
 #7 [88047fc23c10] do_page_fault at 81044eec
 #8 [88047fc23c20] page_fault at 81686c02
[exception RIP: udp_v4_early_demux+481]
RIP: 816249a1  RSP: 88047fc23cd8  RFLAGS: 00010246
RAX: 880248ee4500  RBX: 093a  RCX: 0002
RDX:   RSI:   RDI: 880248ee4500
RBP: 88047fc23d48   R8:    R9: 
R10: 0001  R11: c9000199f3a0  R12: 88006f8a6300
R13: 81cbb1c0  R14: 0001  R15: 880474798f00
ORIG_RAX:   CS: 0010  SS: 
 #9 [88047fc23cd0] udp_v4_early_demux at 81624bb3
#10 [88047fc23d50] ip_rcv_finish at 815f3055
#11 [88047fc23d80] ip_rcv at 815f3952
#12 [88047fc23dc0] __netif_receive_skb_core at 815b96d4
#13 [88047fc23e30] __netif_receive_skb at 815b9911
#14 [88047fc23e50] process_backlog at 815b99f0
#15 [88047fc23ea0] net_rx_action at 815ba1e8
#16 [88047fc23f30] __do_softirq at 81054ce6
#17 [88047fc23f90] irq_exit at 81055075
#18 [88047fc23fa0] smp_call_function_single_interrupt at 810319f5
#19 [88047fc23fb0] call_function_single_interrupt at 8168637a
---  ---
#20 [8800792dff58] call_function_single_interrupt at 8168637a
RIP: 006e7b4c  RSP: 7f4c8ba38b80  RFLAGS: 0216
RAX: 006b  RBX: 816851f2  RCX: 7f49f4de84d6
RDX: 7f49f4de84d8  RSI: 7f48dbcce731  RDI: 
RBP: 7f4c8ba38bd0   R8: 006b   R9: 
R10: 7f48dbcce737  R11: 7f49f4de84e0  R12: 7f4adab85198
R13: 0014  R14: 7f4adaaa4c00  R15: 
ORIG_RAX: ff04  CS: 0033  SS: 002b
crash> print *(struct *dst_entry)0x880248ee4500
A syntax error in expression, near `*dst_entry)0x880248ee4500'.
gdb: gdb request failed: print *(struct *dst_entry)0x880248ee4500
crash> print *(struct dst_entry*)0x880248ee4500
$1 = {
  callback_head = {
next = 0x880248ee4d00,
func = 0x0
  },
  child = 0x13eacdfb7df67f6b,
  dev = 0x880113975d00,
  ops = 0x0,
  _metrics = 13729079323838086211,
  expires = 103079215104,
  path = 0x24c8d3baa,
  from = 0x0,
  xfrm = 0x6,
  input = 0x0,
  output = 0x0,
  flags = 5536,
  pending_confirm = 33114,
  error = -1,
  obsolete = -1,
  header_len = 0,
  trailer_len = 0,
  tclassid = 0,
  __pad_to_align_refcnt = {0, 704374636708},
  __refcnt = {
counter = 14
  },
  __use = 2097153,
  lastuse = 536576,
  {
next = 0x0,
rt_next = 0x0,
rt6_next = 0x0,
dn_next = 0x0
  }
}

-- 
Alex Gartrell 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] of_mdio: add new DT property 'autoneg' for fixed-link

2015-07-10 Thread Stas Sergeev

10.07.2015 21:37, Florian Fainelli пишет:

On 10/07/15 09:43, Stas Sergeev wrote:

Currently for fixed-link the MAC driver decides whether to use the
link status auto-negotiation or not.
Unfortunately the auto-negotiation may not work when expected by
the MAC driver. Sebastien Rannou explains:
<< Yes, I confirm that my HW does not generate an in-band status. AFAIK, it's
a PHY that aggregates 4xSGMIIs to 1xQSGMII ; the MAC side of the PHY (with
inband status) is connected to the switch through QSGMII, and in this context
we are on the media side of the PHY. >>
https://lkml.org/lkml/2015/7/10/206

This patch introduces the new boolean property 'autoneg' that allows
the user to request the auto-negotiation explicitly.

The implementation looks better, but the name might still be slightly
controversial. I would go with "use-in-band-status" which is more
strictly defined than "autoneg" which could mean anything and everything.

What do you think?

I actually think autoneg is a bit better.

- Autonegotiation is a widely used and known term:
https://en.wikipedia.org/wiki/Autonegotiation
And who knows what in-band status is?
And, more importantly, who knows what is it used for?
Who even knows it is used for autonegotiation?

- When we set autoneg for fixed-link, we basically just
say "no MDIO here, but please do autoneg by any other
means, if possible".

- in-band status is an implementation delail, and it is
specific to a particular protocols. If you request the
in-band status for some protocol that doesn't support
it, perhaps you should get -EINVAL, because such a
config makes no sense. With autonegotiation, the rules
are not that strict: it can be "unimplemented", which doesn't
necessary mean nonsense in the config.

- autonegotiation is a wider term, and may be implemented
by some other means than the in-band status (which is
probably impossible for a fixed-link though).

- In the terms that the driver uses, it is autonegotiation, eg
MVNETA_GMAC_AUTONEG_CONFIG. And when you go down
the implementation details, you see MVNETA_GMAC_INBAND_AN_ENABLE,
which is just one AN bit of many.

So I really would prefer to keep things as is.
But if you insist, I can rename, but there will still be no
-EINVAL checks for obviously wrong configs.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] net: dsa: mv88e6xxx: sleep in _mv88e6xxx_stats_wait

2015-07-10 Thread Guenter
Hi Vivien,

On Fri, Jul 10, 2015 at 03:21:47PM -0400, Vivien Didelot wrote:
> Hi Guenter,
> >> I must have missed where is the benefit from spin reading 10 times this
> >> register, rather than sleeping 1ms between tests. Does this busy bit
> >> behaves differently from the phy, atu, scratch, or vtu busy bits?
> >> 
> > Benefit is reaction time, mostly. If the result isn't ready after the
> > first spin, the new code path adds a mandatory 1-2ms delay. This could
> > add up to a lot if that kind of retry is seen a lot.
> 
> To me, it looks like if this mandatory 1-2ms delay is an issue, then
> _mv88e6xxx_wait must be fixed. Maybe reducing this delay is an option?
> 
Good point. The timeout is most definitely quite large and for sure on
the safe side. It might make sense to add some statistics gathering to
see how long the maximum observed delay actually is.

Guenter
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] nohz: prevent tilegx network driver interrupts

2015-07-10 Thread Chris Metcalf
Normally the tilegx networking shim sends irqs to all the cores
to distribute the load of processing incoming-packet interrupts,
so that you can get to multiple Gb's of traffic inbound.

However, in nohz_full mode we don't want to interrupt the
nohz_full cores by default, so we limit the set of cores we use
to only the online housekeeping cores.

To make client code easier to read, we introduce a new nohz_full
accessor, housekeeping_cpumask(), which returns a pointer to the
housekeeping_mask if nohz_full is enabled, and otherwise returns
the cpu_possible_mask.

Signed-off-by: Chris Metcalf 
---
 drivers/net/ethernet/tile/tilegx.c | 4 +++-
 include/linux/tick.h   | 9 +
 2 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/tile/tilegx.c 
b/drivers/net/ethernet/tile/tilegx.c
index a3f7610002aa..0a15acc075b3 100644
--- a/drivers/net/ethernet/tile/tilegx.c
+++ b/drivers/net/ethernet/tile/tilegx.c
@@ -40,6 +40,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -2273,7 +2274,8 @@ static int __init tile_net_init_module(void)
tile_net_dev_init(name, mac);
 
if (!network_cpus_init())
-   network_cpus_map = *cpu_online_mask;
+   cpumask_and(&network_cpus_map, housekeeping_cpumask(),
+   cpu_online_mask);
 
return 0;
 }
diff --git a/include/linux/tick.h b/include/linux/tick.h
index 3741ba1a652c..d3b00f0b2221 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -160,6 +160,15 @@ static inline void tick_nohz_full_kick_all(void) { }
 static inline void __tick_nohz_task_switch(struct task_struct *tsk) { }
 #endif
 
+static inline const struct cpumask *housekeeping_cpumask(void)
+{
+#ifdef CONFIG_NO_HZ_FULL
+   if (tick_nohz_full_enabled())
+   return housekeeping_mask;
+#endif
+   return cpu_possible_mask;
+}
+
 static inline bool is_housekeeping_cpu(int cpu)
 {
 #ifdef CONFIG_NO_HZ_FULL
-- 
2.1.2

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 08/22] fjes: net_device_ops.ndo_start_xmit

2015-07-10 Thread Yasuaki Ishimatsu
Hi Izumi-san,

On Wed, 24 Jun 2015 11:55:40 +0900
Taku Izumi  wrote:

> This patch adds net_device_ops.ndo_start_xmit callback,
> which is called when sending packets.
> 
> Signed-off-by: Taku Izumi 
> ---
>  drivers/net/fjes/fjes.h  |   1 +
>  drivers/net/fjes/fjes_hw.c   |  58 ++
>  drivers/net/fjes/fjes_hw.h   |  12 +++
>  drivers/net/fjes/fjes_main.c | 177 
> +++
>  4 files changed, 248 insertions(+)
> 
> diff --git a/drivers/net/fjes/fjes.h b/drivers/net/fjes/fjes.h
> index f182ed3..7af4304 100644
> --- a/drivers/net/fjes/fjes.h
> +++ b/drivers/net/fjes/fjes.h
> @@ -29,6 +29,7 @@
>  #define FJES_ACPI_SYMBOL "Extended Socket"
>  #define FJES_MAX_QUEUES  1
>  #define FJES_TX_RETRY_INTERVAL   (20 * HZ)
> +#define FJES_TX_RETRY_TIMEOUT(100)
>  #define FJES_OPEN_ZONE_UPDATE_WAIT   (300) /* msec */
>  
>  /* board specific private data structure */
> diff --git a/drivers/net/fjes/fjes_hw.c b/drivers/net/fjes/fjes_hw.c
> index a17cb96..b2e7509 100644
> --- a/drivers/net/fjes/fjes_hw.c
> +++ b/drivers/net/fjes/fjes_hw.c
> @@ -793,3 +793,61 @@ int fjes_hw_wait_epstop(struct fjes_hw *hw)
>   return (wait_time < FJES_COMMAND_EPSTOP_WAIT_TIMEOUT * 1000)
>   ? 0 : -EBUSY;
>  }
> +
> +bool fjes_hw_check_epbuf_version(struct epbuf_handler *epbh, u32 version)
> +{
> + union ep_buffer_info *info = epbh->info;
> +
> + return (info->common.version == version);
> +}
> +
> +bool fjes_hw_check_mtu(struct epbuf_handler *epbh, u32 mtu)
> +{
> + union ep_buffer_info *info = epbh->info;
> +
> + return (info->v1i.frame_max == FJES_MTU_TO_FRAME_SIZE(mtu));
> +}
> +
> +bool fjes_hw_check_vlan_id(struct epbuf_handler *epbh, u16 vlan_id)
> +{
> + union ep_buffer_info *info = epbh->info;
> + int i;
> + bool ret = false;
> +
> + if (vlan_id == 0) {
> + ret = true;
> + } else {
> + for (i = 0; i < EP_BUFFER_SUPPORT_VLAN_MAX; i++) {
> + if (vlan_id == info->v1i.vlan_id[i]) {
> + ret = true;
> + break;
> + }
> + }
> + }
> + return ret;
> +}
> +
> +int fjes_hw_epbuf_tx_pkt_send(struct epbuf_handler *epbh,
> +   void *frame, size_t size)
> +{
> + union ep_buffer_info *info = epbh->info;
> + struct esmem_frame *ring_frame;
> +
> + if (EP_RING_FULL(info->v1i.head, info->v1i.tail, info->v1i.count_max))
> + return -ENOBUFS;
> +
> + ring_frame =
> + (struct esmem_frame *)&(epbh->
> +   ring[EP_RING_INDEX
> +(info->v1i.tail - 1,
> + info->v1i.count_max) *
> +info->v1i.frame_max]);
> +
> + ring_frame->frame_size = size;
> + memcpy((void *)(ring_frame->frame_data), (void *)frame, size);
> +
> + EP_RING_INDEX_INC(epbh->info->v1i.tail, info->v1i.count_max);
> +
> + return 0;
> +}
> +
> diff --git a/drivers/net/fjes/fjes_hw.h b/drivers/net/fjes/fjes_hw.h
> index 823d802..95b9732 100644
> --- a/drivers/net/fjes/fjes_hw.h
> +++ b/drivers/net/fjes/fjes_hw.h
> @@ -49,6 +49,9 @@ struct fjes_hw;
>  
>  #define FJES_ZONING_ZONE_TYPE_NONE (0xFF)
>  
> +#define FJES_TX_DELAY_SEND_NONE  (0)
> +#define FJES_TX_DELAY_SEND_PENDING   (1)
> +
>  #define FJES_RX_STOP_REQ_NONE(0x0)
>  #define FJES_RX_STOP_REQ_DONE(0x1)
>  #define FJES_RX_STOP_REQ_REQUEST (0x2)
> @@ -60,6 +63,11 @@ struct fjes_hw;
>  
>  #define EP_RING_NUM(buffer_size, frame_size) \
>   (u32)((buffer_size) / (frame_size))
> +#define EP_RING_INDEX(_num, _max) (((_num) + (_max)) % (_max))
> +#define EP_RING_INDEX_INC(_num, _max) \
> + ((_num) = EP_RING_INDEX((_num) + 1, (_max)))
> +#define EP_RING_FULL(_head, _tail, _max) \
> + (0 == EP_RING_INDEX(((_tail) - (_head)), (_max)))
>  
>  #define FJES_MTU_TO_BUFFER_SIZE(mtu) \
>   (ETH_HLEN + VLAN_HLEN + (mtu) + ETH_FCS_LEN)
> @@ -308,5 +316,9 @@ enum ep_partner_status
>  
>  bool fjes_hw_epid_is_same_zone(struct fjes_hw *, int);
>  int fjes_hw_epid_is_shared(struct fjes_device_shared_info *, int);
> +bool fjes_hw_check_epbuf_version(struct epbuf_handler *, u32);
> +bool fjes_hw_check_mtu(struct epbuf_handler *, u32);
> +bool fjes_hw_check_vlan_id(struct epbuf_handler *, u16);
> +int fjes_hw_epbuf_tx_pkt_send(struct epbuf_handler *, void *, size_t);
>  
>  #endif /* FJES_HW_H_ */
> diff --git a/drivers/net/fjes/fjes_main.c b/drivers/net/fjes/fjes_main.c
> index 832439b..f2ef286 100644
> --- a/drivers/net/fjes/fjes_main.c
> +++ b/drivers/net/fjes/fjes_main.c
> @@ -51,6 +51,7 @@ static int fjes_open(struct net_device *);
>  static int fjes_close(struct net_device *);
>  static int fjes_setup_resources(struct fjes_adapter *);
>  static void fjes_free_resource

Re: [PATCH 1/2] net: dsa: mv88e6xxx: sleep in _mv88e6xxx_stats_wait

2015-07-10 Thread Vivien Didelot
Hi Guenter,

On Jul 10, 2015, at 2:36 PM, Guenter Roeck li...@roeck-us.net wrote:

> Hi Vivien,
> 
> On Fri, Jul 10, 2015 at 02:20:47PM -0400, Vivien Didelot wrote:
>> > 
>> > is this really beneficial and/or needed ?
>> 
>> Except using existing generic code, no.
>> 
>> > It adds at least 1ms delay to a loop which did not have any delay at
>> > all unless the register read itself was sleeping.
>> 
>> I must have missed where is the benefit from spin reading 10 times this
>> register, rather than sleeping 1ms between tests. Does this busy bit
>> behaves differently from the phy, atu, scratch, or vtu busy bits?
>> 
> Benefit is reaction time, mostly. If the result isn't ready after the
> first spin, the new code path adds a mandatory 1-2ms delay. This could
> add up to a lot if that kind of retry is seen a lot.

To me, it looks like if this mandatory 1-2ms delay is an issue, then
_mv88e6xxx_wait must be fixed. Maybe reducing this delay is an option?

> I don't now if there is a specific time limit for this busy bit,
> and/or if it behaves differently than the others in terms of timing.
> 
>> > Is the original function seen to return a timeout error under some
>> > circumstances ?
>> 
>> I didn't experience it myself, but I guess it may happen. In addition to
>> that, the current implementation doesn't check eventual read error.
>> That's why I saw a benefit in using _mv88e6xxx_wait().
> 
> Checking for a read error (or a timeout) is definitely a good thing.
> I could also imagine that, for example, a "clear statistics" request
> takes more time than currently supported. This is why I asked if you
> had seen a timeout with the old code.
> 
> Personally I'd rather leave the wait loop alone and only introduce
> error checking unless there is a reason to introduce a sleep,
> but I'd like to hear Andrew's and/or Florian's opinion.

Andrew may not reply since he's on vacation, but I add Florian in Cc.

Thanks,
-v
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] nohz: prevent tilegx network driver interrupts

2015-07-10 Thread Chris Metcalf

On 07/10/2015 02:24 PM, Frederic Weisbecker wrote:

On Fri, Jul 10, 2015 at 01:33:44PM -0400, Chris Metcalf wrote:

In nohz_full mode, by default distribute networking shim
interrupts across the housekeeping cores, not all the cores.

I can't really tell, I have no idea what this driver does. It seems
to be about networking CPUs but I have no idea what we are affining
here. Whether it is task, interrupt, ... And what those affine things do,
if it is safe to do that reduce affinity etc..

I looked at the driver but I can't make my way there. I think you
need a more detailed changelog :-)


Fair enough! :-)  See updated changelog to follow.


Signed-off-by: Chris Metcalf 
---
The alternate approaches to this might be:

1. "#define housekeeping_mask cpu_online_mask" in the non-nohz_full
arm in , then just unconditionally use
"housekeeping_mask".

Indeed we are doing more and more references on housekeeping_mask, so
we should probably think about an off-case.

Now the nohz-full off-case should rather be cpu_possible_mask than
cpu_online_mask. housekeeping_mask doesn't take into account onlining
at all.


That suggests that in this case, we might want to default to
something like "housekeeping_mask & cpu_online_mask",
since you really don't want to send irqs to offline cores to
process your packets :-)

The tilegx chips typically don't do cpu offlining anyway, since
we've never really found a usecase, so whatever you boot with
you always have available.  We do have support for a bare-metal
mode which you can run on some of the cores, so you may start
with fewer than cpu_possible actually running, but it will always
be that same set of cores.

So this does suggest that my original patch is wrong for that
same reason.


2. Provide an accessor that returns the cpumask to use for housekeeping
chores and implement it in the obvious ways for both nohz_full
and non-nohz_full.

The latter seems like arguably the most satisfying approach, but
the patch below is, if nothing else, suitable to push for 4.3
without any further API development work.

I don't know. 1) looks easier.


On reflection, the problem with (1) is that if you are in NO_HZ_FULL
mode but !tick_nohz_full_enabled(), you want to fall back to just
using cpu_possible_mask anyway.  So I think a simple accessor that
returns an appropriate cpumask pointer is probably the best bet
(along the lines of the existing is_housekeeping_cpu() accessor).

--
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC net-next 11/22] dst: Metadata destinations

2015-07-10 Thread Thomas Graf
On 07/10/15 at 09:57pm, Julian Anastasov wrote:
> 
>   Hello,
> 
> On Fri, 10 Jul 2015, Thomas Graf wrote:
> 
> > --- a/net/ipv4/route.c
> > +++ b/net/ipv4/route.c
> > @@ -1691,6 +1691,8 @@ static int ip_route_input_slow(struct sk_buff *skb, 
> > __be32 daddr, __be32 saddr,
> >by fib_lookup.
> >  */
> >  
> > +   skb_dst_drop(skb);
> > +
> 
>   This is not very safe. There are places that
> call ip_route_input() for temporary lookups and they
> do not set NULL. For example, icmp_route_lookup(),
> may be there are other such places...

Wow. What a tremendous hack ;-) It saves and restores the original
dst ref to avoid the leaked reference.

>   OTOH, ip_options_rcv_srr() looks correct to use
> skb_dst_set(skb, NULL), may be such call should be
> added if it is missing...

Agreed. This seems to be the right fix. I'll apply to this
icmp_route_lookup() as well.

Thanks!
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


re: net: fec: Ensure clocks are enabled while using mdio bus

2015-07-10 Thread Tyler Baker
Hi Andrew,

The kernelci.org system reported[1][2][3] new imx6 boot failures in
next-20150710. I've bisected[4] these failures to 6c3e921b18ed "net:
fec: Ensure clocks are enabled while using mdio bus". Reverting this
commit on top of next-20150710, resolves the boot failures. I have not
investigated any further, but did add the imx maintainer to the CC
list. Apologies in advanced for not replying to the original thread,
as I was not subscribed to the netdev list until now.

Any chance you can confirm this regression on your end?

Cheers,

Tyler

[1] http://kernelci.org/boot/all/job/next/kernel/next-20150710/
[2] 
http://storage.kernelci.org/next/next-20150710/arm-multi_v7_defconfig/lab-collabora/boot-imx6q-sabrelite.html
[3] 
http://storage.kernelci.org/next/next-20150710/arm-imx_v6_v7_defconfig/lab-tbaker/boot-imx6q-cm-fx6.html
[4] http://hastebin.com/uzofovupey.pas
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC net-next 11/22] dst: Metadata destinations

2015-07-10 Thread Julian Anastasov

Hello,

On Fri, 10 Jul 2015, Thomas Graf wrote:

> --- a/net/ipv4/route.c
> +++ b/net/ipv4/route.c
> @@ -1691,6 +1691,8 @@ static int ip_route_input_slow(struct sk_buff *skb, 
> __be32 daddr, __be32 saddr,
>  by fib_lookup.
>*/
>  
> + skb_dst_drop(skb);
> +

This is not very safe. There are places that
call ip_route_input() for temporary lookups and they
do not set NULL. For example, icmp_route_lookup(),
may be there are other such places...

OTOH, ip_options_rcv_srr() looks correct to use
skb_dst_set(skb, NULL), may be such call should be
added if it is missing...

>   if (ipv4_is_multicast(saddr) || ipv4_is_lbcast(saddr))
>   goto martian_source;

Regards

--
Julian Anastasov 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC net-next 12/22] arp: Inherit metadata dst when creating ARP requests

2015-07-10 Thread Sergei Shtylyov

On 07/10/2015 05:19 PM, Thomas Graf wrote:


If output device wants to see the dst, inherit the dst of the
original skb and pass it on to generate the ARP request.



Signed-off-by: Thomas Graf 
---
  net/ipv4/arp.c | 71 +++---
  1 file changed, 43 insertions(+), 28 deletions(-)



diff --git a/net/ipv4/arp.c b/net/ipv4/arp.c
index 933a928..3400aea 100644
--- a/net/ipv4/arp.c
+++ b/net/ipv4/arp.c
@@ -291,6 +291,46 @@ static void arp_error_report(struct neighbour *neigh, 
struct sk_buff *skb)
kfree_skb(skb);
  }

+/*
+ * Create and send an arp packet.
+ */
+static void arp_send_dst(int type, int ptype, __be32 dest_ip,
+struct net_device *dev, __be32 src_ip,
+const unsigned char *dest_hw,
+const unsigned char *src_hw,
+const unsigned char *target_hw, struct sk_buff *oskb)
+{
+   struct sk_buff *skb;
+
+   /*
+*  No arp on this interface.
+*/


   The networking code, we do multi-line comments this way:

/* bla
 * bla
 */


+
+   if (dev->flags&IFF_NOARP)


   Please surround & with spaces.

[...]

@@ -597,32 +638,6 @@ void arp_xmit(struct sk_buff *skb)
  EXPORT_SYMBOL(arp_xmit);

  /*
- * Create and send an arp packet.
- */
-void arp_send(int type, int ptype, __be32 dest_ip,
- struct net_device *dev, __be32 src_ip,
- const unsigned char *dest_hw, const unsigned char *src_hw,
- const unsigned char *target_hw)
-{
-   struct sk_buff *skb;
-
-   /*
-*  No arp on this interface.
-*/
-
-   if (dev->flags&IFF_NOARP)
-   return;


   Ah, you're moving the code...

[...]

WBR, Sergei

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 02/11] soc/fsl: Introduce DPAA BMan device management driver

2015-07-10 Thread Scott Wood
On Fri, 2015-07-10 at 13:29 -0500, Pledge Roy-R01356 wrote:
> >   return in_be32((void *)bm + offset);
> > >  ^
> > > [...]/drivers/soc/fsl/qbman/bman.c: In function ‘__bm_out’:
> > > [...]/drivers/soc/fsl/qbman/bman.c:172:2: error: implicit declaration
> > > of function ‘out_be32’ [-Werror=implicit-function-declaration]
> > >   out_be32((void *)bm + offset, val);
> > 
> > These PPCisms will need to be fixed.  LS1043A is an ARM chip with DPAA 
> > 1.0.
> 
> LS1043 (ARM, Little Endian) support is still work in progress.  The patches 
> for that are being tested now but are based on the SDK version of the 
> driver and will need to be massaged in order to get them applied here.  Our 
> plan of record is to add upstream support for this at a later time.

If you're already reworking this for upstream acceptance, why not fix the 
more obvious PPCisms now?

-Scott

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC net-next 3/6] net: Introduce VRF device driver - v2

2015-07-10 Thread David Ahern

On 7/9/15 10:56 PM, Eric W. Biederman wrote:

I have given specific areas of concern, and explained myself and you are
blowing me off.


You have not had answered my question with any additional details or 
code references -- ie., a specific example. Asking you for clarification 
and details is not blowing you off.


To recap:

Eric: "With respect to sockets there is also the issue that ip addresses 
are not per vrf."


David: "IP addresses are per interface and interfaces are uniquely 
assigned to a VRF so why do you think IP addresses are not per VRF?"


Eric: "I have read large swaths of the linux networking code over the 
years. Further I was thinking more about non-local addresses ip 
addresses, but I would not be surprised if there are also issues with 
local addresses."


David: "Well, if someone has a specific example I'll take a look."

So, let me try this again: All of the IPv4 and IPv6 addresses I am aware 
of are held in structs linked to a specific netdevice. Can you give me a 
specific example of what you mean here? I can't respond to your feedback 
based on the little information you have given me.




Besides the fragment reassembly and xfrm there are things like the
ineetpeer cache.


noted.




   Which means things like packet fragmentation reassembly
can easily do the wrong thing.  Similarly things like the xfrm for ipsec
tunnels are not hooked into this mix.

So I really do not see how this VRF/MRF thing as designed can support
general purpose sockets.  I am not certain it can correctly support any
kind of socket except perhaps SOCK_RAW.


Sockets bound to the VRF device work properly. Why do you think they won't?


Because there are many locations in the network stack (like fragment
reassembly) that make the assumption that ip addresses are unique and
do not bother looking at network device or anything else.  If fragments
manage to come into play I don't expect it would be hard to poision a
connections with fragments from another routing domain with overlapping
ip addresses.


If that is true it is a problem with the networking stack today and is
completely independent of this VRF proposal.


Not at all.  It is required functionality for reassembly of ip fragments
when the packets come in via different paths.  This is required to
support multi-path ip reception.

This only becomes a bug in the scenario you have proposed.


Can you please point to a specific line in one of these patches that 
impacts fragmentation?


This patch -- patch 3 -- adds a VRF driver. It has not mucked with 
packets at all.


Patch 4 (again I have a better split in a forthcoming revision) tweaks a 
few places in the IPv4 stack with respect to socket lookups (dif 
modified) and FIB table lookups (specifying a table to use or tweaking 
oif/iif).


Since the VRF device has not touched the packet and does not introduce a 
tunnel how has its use/existence impacted fragmentation?


I do plan tests to include ipsec for example and fragmentation; it's a 
matter of time. If you have suggestions on a setup/test case I should 
include please let me know.


David
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] of_mdio: add new DT property 'autoneg' for fixed-link

2015-07-10 Thread Florian Fainelli
On 10/07/15 09:43, Stas Sergeev wrote:
> 
> Currently for fixed-link the MAC driver decides whether to use the
> link status auto-negotiation or not.
> Unfortunately the auto-negotiation may not work when expected by
> the MAC driver. Sebastien Rannou explains:
> << Yes, I confirm that my HW does not generate an in-band status. AFAIK, it's
> a PHY that aggregates 4xSGMIIs to 1xQSGMII ; the MAC side of the PHY (with
> inband status) is connected to the switch through QSGMII, and in this context
> we are on the media side of the PHY. >>
> https://lkml.org/lkml/2015/7/10/206
> 
> This patch introduces the new boolean property 'autoneg' that allows
> the user to request the auto-negotiation explicitly.

The implementation looks better, but the name might still be slightly
controversial. I would go with "use-in-band-status" which is more
strictly defined than "autoneg" which could mean anything and everything.

What do you think?

> 
> Signed-off-by: Stas Sergeev 
> 
> CC: Rob Herring 
> CC: Pawel Moll 
> CC: Mark Rutland 
> CC: Ian Campbell 
> CC: Kumar Gala 
> CC: Florian Fainelli 
> CC: Grant Likely 
> CC: devicet...@vger.kernel.org
> CC: linux-ker...@vger.kernel.org
> CC: netdev@vger.kernel.org
> ---
>  .../devicetree/bindings/net/fixed-link.txt |  6 +-
>  drivers/of/of_mdio.c   | 23 
> --
>  include/linux/of_mdio.h|  5 +
>  3 files changed, 31 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/devicetree/bindings/net/fixed-link.txt 
> b/Documentation/devicetree/bindings/net/fixed-link.txt
> index 82bf7e0..e2959a8 100644
> --- a/Documentation/devicetree/bindings/net/fixed-link.txt
> +++ b/Documentation/devicetree/bindings/net/fixed-link.txt
> @@ -9,8 +9,12 @@ Such a fixed link situation is described by creating a 
> 'fixed-link'
>  sub-node of the Ethernet MAC device node, with the following
>  properties:
> 
> +* 'autoneg' (boolean, optional), to enable the auto-negotiation of link
> +  state. Auto-negotiation is MII protocol, HW and driver-specific and is
> +  not supported in many cases, so use it only when you know what you do.
>  * 'speed' (integer, mandatory), to indicate the link speed. Accepted
> -  values are 10, 100 and 1000
> +  values are 10, 100 and 1000. If the auto-negotiation is enabled,
> +  'speed' may not be set. It will then be auto-negotiated, if possible.
>  * 'full-duplex' (boolean, optional), to indicate that full duplex is
>used. When absent, half duplex is assumed.
>  * 'pause' (boolean, optional), to indicate that pause should be
> diff --git a/drivers/of/of_mdio.c b/drivers/of/of_mdio.c
> index 1bd4305..12b2ede 100644
> --- a/drivers/of/of_mdio.c
> +++ b/drivers/of/of_mdio.c
> @@ -280,6 +280,22 @@ bool of_phy_is_fixed_link(struct device_node *np)
>  }
>  EXPORT_SYMBOL(of_phy_is_fixed_link);
> 
> +bool of_phy_is_autoneg_link(struct device_node *np)
> +{
> + struct device_node *dn;
> + bool ret;
> +
> + dn = of_get_child_by_name(np, "fixed-link");
> + if (!dn)
> + return false;
> +
> + ret = of_property_read_bool(dn, "autoneg");
> +
> + of_node_put(dn);
> + return ret;
> +}
> +EXPORT_SYMBOL(of_phy_is_autoneg_link);
> +
>  int of_phy_register_fixed_link(struct device_node *np)
>  {
>   struct fixed_phy_status status = {};
> @@ -291,10 +307,13 @@ int of_phy_register_fixed_link(struct device_node *np)
>   /* New binding */
>   fixed_link_node = of_get_child_by_name(np, "fixed-link");
>   if (fixed_link_node) {
> - status.link = 1;
> + bool autoneg = of_property_read_bool(fixed_link_node,
> +  "autoneg");
> + status.link = !autoneg;
>   status.duplex = of_property_read_bool(fixed_link_node,
> "full-duplex");
> - if (of_property_read_u32(fixed_link_node, "speed", 
> &status.speed))
> + if (of_property_read_u32(fixed_link_node, "speed",
> +  &status.speed) != 0 && !autoneg)
>   return -EINVAL;
>   status.pause = of_property_read_bool(fixed_link_node, "pause");
>   status.asym_pause = of_property_read_bool(fixed_link_node,
> diff --git a/include/linux/of_mdio.h b/include/linux/of_mdio.h
> index d449018..647f348 100644
> --- a/include/linux/of_mdio.h
> +++ b/include/linux/of_mdio.h
> @@ -65,6 +65,7 @@ static inline struct mii_bus *of_mdio_find_bus(struct 
> device_node *mdio_np)
>  #if defined(CONFIG_OF) && defined(CONFIG_FIXED_PHY)
>  extern int of_phy_register_fixed_link(struct device_node *np);
>  extern bool of_phy_is_fixed_link(struct device_node *np);
> +extern bool of_phy_is_autoneg_link(struct device_node *np);
>  #else
>  static inline int of_phy_register_fixed_link(struct device_node *np)
>  {
> @@ -74,6 +75,10 @@ static inline bool of_phy_is_fixed_link(struct device_node 

Re: [PATCH 1/2] net: dsa: mv88e6xxx: sleep in _mv88e6xxx_stats_wait

2015-07-10 Thread Guenter Roeck
Hi Vivien,

On Fri, Jul 10, 2015 at 02:20:47PM -0400, Vivien Didelot wrote:
> > 
> > is this really beneficial and/or needed ?
> 
> Except using existing generic code, no.
> 
> > It adds at least 1ms delay to a loop which did not have any delay at
> > all unless the register read itself was sleeping.
> 
> I must have missed where is the benefit from spin reading 10 times this
> register, rather than sleeping 1ms between tests. Does this busy bit
> behaves differently from the phy, atu, scratch, or vtu busy bits?
> 
Benefit is reaction time, mostly. If the result isn't ready after the
first spin, the new code path adds a mandatory 1-2ms delay. This could
add up to a lot if that kind of retry is seen a lot.

I don't now if there is a specific time limit for this busy bit,
and/or if it behaves differently than the others in terms of timing.

> > Is the original function seen to return a timeout error under some
> > circumstances ?
> 
> I didn't experience it myself, but I guess it may happen. In addition to
> that, the current implementation doesn't check eventual read error.
> That's why I saw a benefit in using _mv88e6xxx_wait().

Checking for a read error (or a timeout) is definitely a good thing.
I could also imagine that, for example, a "clear statistics" request
takes more time than currently supported. This is why I asked if you
had seen a timeout with the old code.

Personally I'd rather leave the wait loop alone and only introduce
error checking unless there is a reason to introduce a sleep,
but I'd like to hear Andrew's and/or Florian's opinion.

Thanks,
Guenter
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] nohz: prevent tilegx network driver interrupts

2015-07-10 Thread Frederic Weisbecker
On Fri, Jul 10, 2015 at 01:33:44PM -0400, Chris Metcalf wrote:
> In nohz_full mode, by default distribute networking shim
> interrupts across the housekeeping cores, not all the cores.

I can't really tell, I have no idea what this driver does. It seems
to be about networking CPUs but I have no idea what we are affining
here. Whether it is task, interrupt, ... And what those affine things do,
if it is safe to do that reduce affinity etc..

I looked at the driver but I can't make my way there. I think you
need a more detailed changelog :-)

> Signed-off-by: Chris Metcalf 
> ---
> The alternate approaches to this might be:
> 
> 1. "#define housekeeping_mask cpu_online_mask" in the non-nohz_full
>arm in , then just unconditionally use
>"housekeeping_mask".

Indeed we are doing more and more references on housekeeping_mask, so
we should probably think about an off-case.

Now the nohz-full off-case should rather be cpu_possible_mask than
cpu_online_mask. housekeeping_mask doesn't take into account onlining
at all.

> 
> 2. Provide an accessor that returns the cpumask to use for housekeeping
>chores and implement it in the obvious ways for both nohz_full
>and non-nohz_full.
> 
> The latter seems like arguably the most satisfying approach, but
> the patch below is, if nothing else, suitable to push for 4.3
> without any further API development work.

I don't know. 1) looks easier.

> 
> Frederic (or others), comments?
> 
>  drivers/net/ethernet/tile/tilegx.c | 8 +++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/tile/tilegx.c 
> b/drivers/net/ethernet/tile/tilegx.c
> index a3f7610002aa..7687c62e7d75 100644
> --- a/drivers/net/ethernet/tile/tilegx.c
> +++ b/drivers/net/ethernet/tile/tilegx.c
> @@ -40,6 +40,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -2272,8 +2273,13 @@ static int __init tile_net_init_module(void)
>   for (i = 0; gxio_mpipe_link_enumerate_mac(i, name, mac) >= 0; i++)
>   tile_net_dev_init(name, mac);
>  
> - if (!network_cpus_init())
> + if (!network_cpus_init()) {
> +#ifdef CONFIG_NO_HZ_FULL
> + network_cpus_map = *housekeeping_mask;
> +#else
>   network_cpus_map = *cpu_online_mask;
> +#endif
> + }
>  
>   return 0;
>  }
> -- 
> 2.1.2
> 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 02/11] soc/fsl: Introduce DPAA BMan device management driver

2015-07-10 Thread Roy Pledge
>   return in_be32((void *)bm + offset);
> >  ^
> > [...]/drivers/soc/fsl/qbman/bman.c: In function ‘__bm_out’:
> > [...]/drivers/soc/fsl/qbman/bman.c:172:2: error: implicit declaration
> > of function ‘out_be32’ [-Werror=implicit-function-declaration]
> >   out_be32((void *)bm + offset, val);
> 
> These PPCisms will need to be fixed.  LS1043A is an ARM chip with DPAA 1.0.

LS1043 (ARM, Little Endian) support is still work in progress.  The patches for 
that are being tested now but are based on the SDK version of the driver and 
will need to be massaged in order to get them applied here.  Our plan of record 
is to add upstream support for this at a later time.

> 
> >   ^
> > [...]/drivers/soc/fsl/qbman/bman.c: In function ‘of_fsl_bman_probe’:
> > [...]/drivers/soc/fsl/qbman/bman.c:463:17: error: ‘NO_IRQ’ undeclared
> > (first use in this function)
> >   if (err_irq == NO_IRQ) {
> 
> This isn't even a PPCism.  It's just wrong.  Compare to zero instead.

Yeah - I recall fixing this when doing the ARM port but I guess I didn't make 
the same change in this version.  I will address this in the next version.


> 
> -Scott



Re: [RFC PATCH net-next] sctp: fix src address selection if using secondary addresses

2015-07-10 Thread Marcelo Ricardo Leitner
On Fri, Jul 10, 2015 at 01:14:21PM -0400, Vlad Yasevich wrote:
> On 07/10/2015 12:17 PM, Marcelo Ricardo Leitner wrote:
> > On Fri, Jul 10, 2015 at 11:35:28AM -0400, Vlad Yasevich wrote:
> >> On 07/10/2015 07:53 AM, Marcelo Ricardo Leitner wrote:
> >>> On Thu, Jul 09, 2015 at 09:55:19PM +0200, Michael Tuexen wrote:
> > On 09 Jul 2015, at 18:54, Marcelo Ricardo Leitner  
> > wrote:
> >
> > Cc'ing Michael too.
>  I'm not familiar with the Linux kernel code, so I can't comment on it.
>  But making sure to use a source address belonging to the emitting
>  interface makes sense for me.
> 
>  Best regards
>  Michael
> >>>
> >>> That's pretty much what I was looking for, in case I missed something on
> >>> SCTP RFCs. 
> >>>
> >>
> >> Well, the RFCs do not really specify what the source address should be, 
> >> and there
> > 
> > That's why I was afraid of having missed something ;)
> > 
> >> have been numerous times where I've seen weak host model in use on the wire
> >> even with a BSD peer.
> >>
> >> This also puts a very big nail through many suggestions we've had over the 
> >> years
> >> to allow source based path multihoming in addition to destination based 
> >> multihoming
> >> we currently support.
> >>
> >> It might be a good idea to make rp-filter like behavior best effort, and 
> >> have
> >> the old behavior as fallback.  I am still trying to think up different 
> >> scenarios
> >> where rp-filter behavior will cause things to fail prematurely...
> > 
> > The old behavior is like "if we don't have a src yet and can't find a
> > preferred src for this dst, use the 1st bound address". We can add it
> > but as I said, I'm afraid it is just doing wrong and not worth. If such
> > randomly src addressed packet is meant to be routed, the router will
> > likely drop it as it is seen as a spoof. And if it reaches the peer, it
> > will probably come back through a different path.
> > 
> > I'm tempted to say that current usual use cases are handled by the first
> > check on this function, which returns the preferred/primary address for
> > the interface and checks against bound addresses. Whenever you reach the
> > second check, it just allows you to use that 1st bound address that is
> > checked. I mean, I can't see use cases that we would be breaking with
> > this change. 
> 
> Yes,  the secondary check didn't amount to much, but we've kept it since 2.5
> days (when sctp was introduced).  I've made attempts over the years to
> try to make it stricter, but that never amounted to anything that worked well.
> 
> > 
> > But yeah, it impacts source based routing, and I'm not aware of previous
> > discussions on it. I'll try to dig some up but if possible, please share
> > some pointers on it.
> 
> It's been suggested a few times that we should support source based 
> multihoming
> particularly for the case where one peer has only 1 address.
> We've always punted on this, but people still ask every now and then.

Ah okay, now I see it.
 
> I do have a question about the code though.. Have you tried with mutlipath 
> routing
> enabled.  I see rp_filter checks have special code to handle that.  Seem like 
> we
> might get false negatives in sctp.

In the sense of CONFIG_IP_ROUTE_MULTIPATH=y, yes, but just that. My
routes were simple ones, either 2 peers attaches to 2 local subnets, or
with a gateway in the middle (with 2 subnets on each side, but mapped
1-1, no crossing. Aka subnet1<->subnet2 and subnet3<->subnet4 while not
(subnet1<->subnet4 or subnet3<->subnet2).

Note that this is not rp_filter strictly speaking, as it's mirrored.
rp_filter needs to calculate all possible output routes (actually until
it finds a valid one) for finding one that would match the one used for
incoming. 

This check already has an output path, and it's calculating if such
input would be acceptable. We can't really expect/check for other hits
because it invalidates the chosen output path.

Hmmm... but we could support multipath in the output selection, ie in
the outputs of ip_route_output_key(), probably in another patch then?

  Marcelo

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] of_mdio: add new DT property 'link' for fixed-link

2015-07-10 Thread Florian Fainelli
On 10/07/15 04:20, Stas Sergeev wrote:
> 10.07.2015 11:46, Sebastien Rannou пишет:
>> On Fri, 10 Jul 2015, Stas Sergeev wrote:
>>
>>> 10.07.2015 00:15, Florian Fainelli пишет:
 Then, if the in-band status indication is not reliable (which really
 should be completely understood),
>>> Agree!
>>> But this is not something I can help with.
>>> Sebastien Rannou reports the problem, please ask him whatever
>>> you see fits to get a better understanding of a problem.
>>> The fact that his HW does not generate the inband status, is
>>> _my own guess_.
>>
>> Yes, I confirm that my HW does not generate an in-band status. AFAIK, it's
>> a PHY that aggregates 4xSGMIIs to 1xQSGMII ; the MAC side of the PHY (with
>> inband status) is connected to the switch through QSGMII, and in this context
>> we are on the media side of the PHY.
> Hmm, interesting.
> So if I parse the above correctly, you have something like 88E1340S set
> up into a mode when SGMII is used as media interface and QSGMII as system
> interface (terms are from datasheet page 5), then you connect the media
> interface to armada-xp and system interface to the switch.
> 
> I wonder if it is the right thing to do.
> AFAIK you could as well set up armada-xp into QSGMII mode and connect
> that to switch. The driver would then disable the use of inband status
> and everything would be fine.
> Either way, your use-case proves that only DT can decide the use of an
> inband status.

I do not think there is any debate around the need for a property that
defines whether in-band-status is both reliable and usable, the debate
is about *where* to put it.

I still think this does not belong in the fixed-link property, but now
that you have explained a bit more in the other patch what your
understanding of "fixed-link" is, I can see the confusion.

Instead of having a link = "auto", property, how about just something
like this:

fixed-link {
speed = <1000>;
full-duplex;
use-in-band-status;
};

or event this:

fixed-link {
use-in-band-status;
};

If you parse the 'use-in-band-status' which means that it is reliable
information, then you can override whatever was defined in the DT under
the 'fixed-link' property?
-- 
Florian
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] net: dsa: mv88e6xxx: sleep in _mv88e6xxx_stats_wait

2015-07-10 Thread Vivien Didelot
Hi Guenter,

On Jul 10, 2015, at 1:10 PM, Guenter Roeck li...@roeck-us.net wrote:

> On Fri, Jul 10, 2015 at 12:57:28PM -0400, Vivien Didelot wrote:
>> The current _mv88e6xxx_stats_wait function does not sleep while testing
>> the stats busy bit. Fix this by using the generic _mv88e6xxx_wait
>> function.
>> 
>> Note that it requires to move _mv88e6xxx_wait on top of
>> _mv88e6xxx_stats_wait to avoid undefined reference compilation error.
>> 
>> Signed-off-by: Vivien Didelot 
>> ---
>>  drivers/net/dsa/mv88e6xxx.c | 52 
>> +++--
>>  1 file changed, 22 insertions(+), 30 deletions(-)
>> 
>> diff --git a/drivers/net/dsa/mv88e6xxx.c b/drivers/net/dsa/mv88e6xxx.c
>> index f6c7409..7753db1 100644
>> --- a/drivers/net/dsa/mv88e6xxx.c
>> +++ b/drivers/net/dsa/mv88e6xxx.c
>> @@ -557,19 +557,31 @@ static bool mv88e6xxx_6352_family(struct dsa_switch 
>> *ds)
>>  return false;
>>  }
>>  
>> +/* Must be called with SMI lock held */
>> +static int _mv88e6xxx_wait(struct dsa_switch *ds, int reg, int offset,
>> +   u16 mask)
>> +{
>> +unsigned long timeout = jiffies + HZ / 10;
>> +
>> +while (time_before(jiffies, timeout)) {
>> +int ret;
>> +
>> +ret = _mv88e6xxx_reg_read(ds, reg, offset);
>> +if (ret < 0)
>> +return ret;
>> +if (!(ret & mask))
>> +return 0;
>> +
>> +usleep_range(1000, 2000);
>> +}
>> +return -ETIMEDOUT;
>> +}
>> +
>>  /* Must be called with SMI mutex held */
>>  static int _mv88e6xxx_stats_wait(struct dsa_switch *ds)
>>  {
>> -int ret;
>> -int i;
>> -
>> -for (i = 0; i < 10; i++) {
>> -ret = _mv88e6xxx_reg_read(ds, REG_GLOBAL, GLOBAL_STATS_OP);
>> -if ((ret & GLOBAL_STATS_OP_BUSY) == 0)
>> -return 0;
>> -}
> 
> Hi Vivien,
> 
> is this really beneficial and/or needed ?

Except using existing generic code, no.

> It adds at least 1ms delay to a loop which did not have any delay at
> all unless the register read itself was sleeping.

I must have missed where is the benefit from spin reading 10 times this
register, rather than sleeping 1ms between tests. Does this busy bit
behaves differently from the phy, atu, scratch, or vtu busy bits?

> Is the original function seen to return a timeout error under some
> circumstances ?

I didn't experience it myself, but I guess it may happen. In addition to
that, the current implementation doesn't check eventual read error.
That's why I saw a benefit in using _mv88e6xxx_wait().

Thanks,
-v
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] ravb: kill useless initializers

2015-07-10 Thread Sergei Shtylyov
Some of the local variable intializers in the driver turned out to be pointless,
kill them.

Signed-off-by: Sergei Shtylyov 

---
The patch is against the Dave Miller's 'net-next.git' repo.

Changes in version 2:
- fixed typo in the subject.

 drivers/net/ethernet/renesas/ravb_main.c |   12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

Index: net-next/drivers/net/ethernet/renesas/ravb_main.c
===
--- net-next.orig/drivers/net/ethernet/renesas/ravb_main.c
+++ net-next/drivers/net/ethernet/renesas/ravb_main.c
@@ -223,9 +223,9 @@ static void ravb_ring_free(struct net_de
 static void ravb_ring_format(struct net_device *ndev, int q)
 {
struct ravb_private *priv = netdev_priv(ndev);
-   struct ravb_ex_rx_desc *rx_desc = NULL;
-   struct ravb_tx_desc *tx_desc = NULL;
-   struct ravb_desc *desc = NULL;
+   struct ravb_ex_rx_desc *rx_desc;
+   struct ravb_tx_desc *tx_desc;
+   struct ravb_desc *desc;
int rx_ring_size = sizeof(*rx_desc) * priv->num_rx_ring[q];
int tx_ring_size = sizeof(*tx_desc) * priv->num_tx_ring[q];
struct sk_buff *skb;
@@ -435,7 +435,7 @@ static int ravb_tx_free(struct net_devic
struct net_device_stats *stats = &priv->stats[q];
struct ravb_tx_desc *desc;
int free_num = 0;
-   int entry = 0;
+   int entry;
u32 size;
 
for (; priv->cur_tx[q] - priv->dirty_tx[q] > 0; priv->dirty_tx[q]++) {
@@ -508,8 +508,8 @@ static bool ravb_rx(struct net_device *n
struct sk_buff *skb;
dma_addr_t dma_addr;
struct timespec64 ts;
-   u16 pkt_len = 0;
u8  desc_status;
+   u16 pkt_len;
int limit;
 
boguscnt = min(boguscnt, *quota);
@@ -1272,8 +1272,8 @@ static void ravb_tx_timeout_work(struct
 static netdev_tx_t ravb_start_xmit(struct sk_buff *skb, struct net_device 
*ndev)
 {
struct ravb_private *priv = netdev_priv(ndev);
-   struct ravb_tstamp_skb *ts_skb = NULL;
u16 q = skb_get_queue_mapping(skb);
+   struct ravb_tstamp_skb *ts_skb;
struct ravb_tx_desc *desc;
unsigned long flags;
u32 dma_addr;
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] ravb: kill usless initializers

2015-07-10 Thread Sergei Shtylyov
Some of the local variable intializers in the driver turned out to be pointless,
kill them.

Signed-off-by: Sergei Shtylyov 

---
The patch is against the Dave Miller's 'net-next.git' repo.

 drivers/net/ethernet/renesas/ravb_main.c |   12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

Index: net-next/drivers/net/ethernet/renesas/ravb_main.c
===
--- net-next.orig/drivers/net/ethernet/renesas/ravb_main.c
+++ net-next/drivers/net/ethernet/renesas/ravb_main.c
@@ -223,9 +223,9 @@ static void ravb_ring_free(struct net_de
 static void ravb_ring_format(struct net_device *ndev, int q)
 {
struct ravb_private *priv = netdev_priv(ndev);
-   struct ravb_ex_rx_desc *rx_desc = NULL;
-   struct ravb_tx_desc *tx_desc = NULL;
-   struct ravb_desc *desc = NULL;
+   struct ravb_ex_rx_desc *rx_desc;
+   struct ravb_tx_desc *tx_desc;
+   struct ravb_desc *desc;
int rx_ring_size = sizeof(*rx_desc) * priv->num_rx_ring[q];
int tx_ring_size = sizeof(*tx_desc) * priv->num_tx_ring[q];
struct sk_buff *skb;
@@ -435,7 +435,7 @@ static int ravb_tx_free(struct net_devic
struct net_device_stats *stats = &priv->stats[q];
struct ravb_tx_desc *desc;
int free_num = 0;
-   int entry = 0;
+   int entry;
u32 size;
 
for (; priv->cur_tx[q] - priv->dirty_tx[q] > 0; priv->dirty_tx[q]++) {
@@ -508,8 +508,8 @@ static bool ravb_rx(struct net_device *n
struct sk_buff *skb;
dma_addr_t dma_addr;
struct timespec64 ts;
-   u16 pkt_len = 0;
u8  desc_status;
+   u16 pkt_len;
int limit;
 
boguscnt = min(boguscnt, *quota);
@@ -1272,8 +1272,8 @@ static void ravb_tx_timeout_work(struct
 static netdev_tx_t ravb_start_xmit(struct sk_buff *skb, struct net_device 
*ndev)
 {
struct ravb_private *priv = netdev_priv(ndev);
-   struct ravb_tstamp_skb *ts_skb = NULL;
u16 q = skb_get_queue_mapping(skb);
+   struct ravb_tstamp_skb *ts_skb;
struct ravb_tx_desc *desc;
unsigned long flags;
u32 dma_addr;

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] nohz: prevent tilegx network driver interrupts

2015-07-10 Thread Chris Metcalf
In nohz_full mode, by default distribute networking shim
interrupts across the housekeeping cores, not all the cores.

Signed-off-by: Chris Metcalf 
---
The alternate approaches to this might be:

1. "#define housekeeping_mask cpu_online_mask" in the non-nohz_full
   arm in , then just unconditionally use
   "housekeeping_mask".

2. Provide an accessor that returns the cpumask to use for housekeeping
   chores and implement it in the obvious ways for both nohz_full
   and non-nohz_full.

The latter seems like arguably the most satisfying approach, but
the patch below is, if nothing else, suitable to push for 4.3
without any further API development work.

Frederic (or others), comments?

 drivers/net/ethernet/tile/tilegx.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/tile/tilegx.c 
b/drivers/net/ethernet/tile/tilegx.c
index a3f7610002aa..7687c62e7d75 100644
--- a/drivers/net/ethernet/tile/tilegx.c
+++ b/drivers/net/ethernet/tile/tilegx.c
@@ -40,6 +40,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -2272,8 +2273,13 @@ static int __init tile_net_init_module(void)
for (i = 0; gxio_mpipe_link_enumerate_mac(i, name, mac) >= 0; i++)
tile_net_dev_init(name, mac);
 
-   if (!network_cpus_init())
+   if (!network_cpus_init()) {
+#ifdef CONFIG_NO_HZ_FULL
+   network_cpus_map = *housekeeping_mask;
+#else
network_cpus_map = *cpu_online_mask;
+#endif
+   }
 
return 0;
 }
-- 
2.1.2

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 02/11] soc/fsl: Introduce DPAA BMan device management driver

2015-07-10 Thread Scott Wood
On Fri, 2015-07-10 at 13:36 +0200, Paul Bolle wrote:
> On do, 2015-07-09 at 16:21 -0400, Roy Pledge wrote:
> > +#ifdef CONFIG_FSL_DPA_CHECKING
> > +#define DPA_ASSERT(x) \
> > +   do { \
> > +   if (!(x)) { \
> > +   pr_crit("ASSERT: (%s:%d) %s\n", __FILE__, __LINE__, \
> > +   __stringify_1(x)); \
> > +   dump_stack(); \
> > +   panic("assertion failure"); \
> 
> Not my call, but why panic() here?

I'm pretty sure I've complained about this before (as well as all the 
BUG_ONs).

-Scott

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 02/11] soc/fsl: Introduce DPAA BMan device management driver

2015-07-10 Thread Scott Wood
On Fri, 2015-07-10 at 10:38 +0200, Paul Bolle wrote:
> On do, 2015-07-09 at 16:21 -0400, Roy Pledge wrote:
> > --- /dev/null
> > +++ b/drivers/soc/fsl/qbman/Kconfig
> > @@ -0,0 +1,33 @@
> > +menuconfig FSL_DPA
> > +   bool "Freescale DPAA support"
> > +   depends on FSL_SOC || COMPILE_TEST
> 
> Are you sure about COMPILE_TEST?
> 
> > +   default n
> > +   help
> > +   FSL Data-Path Acceleration Architecture drivers
> > +
> > +   These are not the actual Ethernet driver(s)
> > +
> > +if FSL_DPA
> 
> [...]
> 
> > +config FSL_BMAN
> > +   tristate "BMan device management"
> > +   default n
> > +   help
> > +   FSL DPAA BMan driver
> > +
> > +endif # FSL_DPA
> 
> Because with FSL_BMAN set to 'm' testing things with
> make -C ../../../.. M=$PWD bman.ko
> 
> will not actually compile on x86_64:
> 
> make: Entering directory '[...]'
>   CC [M]  [...]/drivers/soc/fsl/qbman/bman.o
> In file included from [...]/drivers/soc/fsl/qbman/bman_priv.h:33:0,
>  from [...]/drivers/soc/fsl/qbman/bman.c:31:
> [...]/drivers/soc/fsl/qbman/dpaa_sys.h: In function ‘mfatb’:
> [...]/drivers/soc/fsl/qbman/dpaa_sys.h:134:8: error: implicit declaration 
> of function ‘mfspr’ [-Werror=implicit-function-declaration]
>hi = mfspr(SPRN_ATBU);
> ^
> [...]/drivers/soc/fsl/qbman/dpaa_sys.h:134:14: error: ‘SPRN_ATBU’ 
> undeclared (first use in this function)
>hi = mfspr(SPRN_ATBU);
>   ^
> [...]/drivers/soc/fsl/qbman/dpaa_sys.h:134:14: note: each undeclared 
> identifier is reported only once for each function it appears in
> [...]/drivers/soc/fsl/qbman/dpaa_sys.h:135:14: error: ‘SPRN_ATBL’ 
> undeclared (first use in this function)
>lo = mfspr(SPRN_ATBL);
>   ^
> [...]/drivers/soc/fsl/qbman/bman.c: In function ‘__bm_in’:
> [...]/drivers/soc/fsl/qbman/bman.c:168:9: error: implicit declaration of 
> function ‘in_be32’ [-Werror=implicit-function-declaration]
>   return in_be32((void *)bm + offset);
>  ^
> [...]/drivers/soc/fsl/qbman/bman.c: In function ‘__bm_out’:
> [...]/drivers/soc/fsl/qbman/bman.c:172:2: error: implicit declaration of 
> function ‘out_be32’ [-Werror=implicit-function-declaration]
>   out_be32((void *)bm + offset, val);

These PPCisms will need to be fixed.  LS1043A is an ARM chip with DPAA 1.0.

>   ^
> [...]/drivers/soc/fsl/qbman/bman.c: In function ‘of_fsl_bman_probe’:
> [...]/drivers/soc/fsl/qbman/bman.c:463:17: error: ‘NO_IRQ’ undeclared 
> (first use in this function)
>   if (err_irq == NO_IRQ) {

This isn't even a PPCism.  It's just wrong.  Compare to zero instead.

-Scott

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] net: dsa: mv88e6xxx: call _mv88e6xxx_stats_wait with SMI lock held

2015-07-10 Thread Guenter
On Fri, Jul 10, 2015 at 12:57:29PM -0400, Vivien Didelot wrote:
> At switch setup, _mv88e6xxx_stats_wait was called without holding the
> SMI mutex. Fix this by requesting the lock for this call.
> 
> Also, return the _mv88e6xxx_stats_wait code, since it may fail.
> 
> Signed-off-by: Vivien Didelot 

Good catch.

Reviewed-by: Guenter Roeck 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH net-next] sctp: fix src address selection if using secondary addresses

2015-07-10 Thread Vlad Yasevich
On 07/10/2015 12:17 PM, Marcelo Ricardo Leitner wrote:
> On Fri, Jul 10, 2015 at 11:35:28AM -0400, Vlad Yasevich wrote:
>> On 07/10/2015 07:53 AM, Marcelo Ricardo Leitner wrote:
>>> On Thu, Jul 09, 2015 at 09:55:19PM +0200, Michael Tuexen wrote:
> On 09 Jul 2015, at 18:54, Marcelo Ricardo Leitner  
> wrote:
>
> Cc'ing Michael too.
 I'm not familiar with the Linux kernel code, so I can't comment on it.
 But making sure to use a source address belonging to the emitting
 interface makes sense for me.

 Best regards
 Michael
>>>
>>> That's pretty much what I was looking for, in case I missed something on
>>> SCTP RFCs. 
>>>
>>
>> Well, the RFCs do not really specify what the source address should be, and 
>> there
> 
> That's why I was afraid of having missed something ;)
> 
>> have been numerous times where I've seen weak host model in use on the wire
>> even with a BSD peer.
>>
>> This also puts a very big nail through many suggestions we've had over the 
>> years
>> to allow source based path multihoming in addition to destination based 
>> multihoming
>> we currently support.
>>
>> It might be a good idea to make rp-filter like behavior best effort, and have
>> the old behavior as fallback.  I am still trying to think up different 
>> scenarios
>> where rp-filter behavior will cause things to fail prematurely...
> 
> The old behavior is like "if we don't have a src yet and can't find a
> preferred src for this dst, use the 1st bound address". We can add it
> but as I said, I'm afraid it is just doing wrong and not worth. If such
> randomly src addressed packet is meant to be routed, the router will
> likely drop it as it is seen as a spoof. And if it reaches the peer, it
> will probably come back through a different path.
> 
> I'm tempted to say that current usual use cases are handled by the first
> check on this function, which returns the preferred/primary address for
> the interface and checks against bound addresses. Whenever you reach the
> second check, it just allows you to use that 1st bound address that is
> checked. I mean, I can't see use cases that we would be breaking with
> this change. 

Yes,  the secondary check didn't amount to much, but we've kept it since 2.5
days (when sctp was introduced).  I've made attempts over the years to
try to make it stricter, but that never amounted to anything that worked well.

> 
> But yeah, it impacts source based routing, and I'm not aware of previous
> discussions on it. I'll try to dig some up but if possible, please share
> some pointers on it.

It's been suggested a few times that we should support source based multihoming
particularly for the case where one peer has only 1 address.
We've always punted on this, but people still ask every now and then.

I do have a question about the code though.. Have you tried with mutlipath 
routing
enabled.  I see rp_filter checks have special code to handle that.  Seem like we
might get false negatives in sctp.

-vlad

> 
>   Marcelo
> 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] net: dsa: mv88e6xxx: sleep in _mv88e6xxx_stats_wait

2015-07-10 Thread Guenter
On Fri, Jul 10, 2015 at 12:57:28PM -0400, Vivien Didelot wrote:
> The current _mv88e6xxx_stats_wait function does not sleep while testing
> the stats busy bit. Fix this by using the generic _mv88e6xxx_wait
> function.
> 
> Note that it requires to move _mv88e6xxx_wait on top of
> _mv88e6xxx_stats_wait to avoid undefined reference compilation error.
> 
> Signed-off-by: Vivien Didelot 
> ---
>  drivers/net/dsa/mv88e6xxx.c | 52 
> +++--
>  1 file changed, 22 insertions(+), 30 deletions(-)
> 
> diff --git a/drivers/net/dsa/mv88e6xxx.c b/drivers/net/dsa/mv88e6xxx.c
> index f6c7409..7753db1 100644
> --- a/drivers/net/dsa/mv88e6xxx.c
> +++ b/drivers/net/dsa/mv88e6xxx.c
> @@ -557,19 +557,31 @@ static bool mv88e6xxx_6352_family(struct dsa_switch *ds)
>   return false;
>  }
>  
> +/* Must be called with SMI lock held */
> +static int _mv88e6xxx_wait(struct dsa_switch *ds, int reg, int offset,
> +u16 mask)
> +{
> + unsigned long timeout = jiffies + HZ / 10;
> +
> + while (time_before(jiffies, timeout)) {
> + int ret;
> +
> + ret = _mv88e6xxx_reg_read(ds, reg, offset);
> + if (ret < 0)
> + return ret;
> + if (!(ret & mask))
> + return 0;
> +
> + usleep_range(1000, 2000);
> + }
> + return -ETIMEDOUT;
> +}
> +
>  /* Must be called with SMI mutex held */
>  static int _mv88e6xxx_stats_wait(struct dsa_switch *ds)
>  {
> - int ret;
> - int i;
> -
> - for (i = 0; i < 10; i++) {
> - ret = _mv88e6xxx_reg_read(ds, REG_GLOBAL, GLOBAL_STATS_OP);
> - if ((ret & GLOBAL_STATS_OP_BUSY) == 0)
> - return 0;
> - }

Hi Vivien,

is this really beneficial and/or needed ? It adds at least 1ms delay
to a loop which did not have any delay at all unless the register
read itself was sleeping. Is the original function seen to return
a timeout error under some circumstances ?

Thanks,
Guenter
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC net-next 03/22] ipv4: support for fib route lwtunnel encap attributes

2015-07-10 Thread roopa

On 7/10/15, 9:54 AM, Thomas Graf wrote:

On 07/10/15 at 05:36pm, Eric Dumazet wrote:

On Fri, 2015-07-10 at 16:19 +0200, Thomas Graf wrote:

From: Roopa Prabhu 



+   if (oif)
+   dev = __dev_get_by_index(net, oif);
+   ret = lwtunnel_build_state(dev, encap_type,
+  encap, &lwtstate);
+   if (!ret) {
+   lwtunnel_state_get(lwtstate);
+   ret = lwtunnel_cmp_encap(lwtstate, nh->nh_lwtstate);
+   lwtunnel_state_put(lwtstate);
+
+   return ret;


These _get()/_put() calls do not seem necessary, or buggy.

If refcounting is needed the _get() should be done at the time lwstate
is fetched.

I'll let Roopa comment but it seems like these can just be removed.

Agree, They do seem unnecessary here.  I will get rid of them. thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] net: dsa: mv88e6xxx: call _mv88e6xxx_stats_wait with SMI lock held

2015-07-10 Thread Vivien Didelot
At switch setup, _mv88e6xxx_stats_wait was called without holding the
SMI mutex. Fix this by requesting the lock for this call.

Also, return the _mv88e6xxx_stats_wait code, since it may fail.

Signed-off-by: Vivien Didelot 
---
 drivers/net/dsa/mv88e6xxx.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/net/dsa/mv88e6xxx.c b/drivers/net/dsa/mv88e6xxx.c
index 7753db1..cbbc33a 100644
--- a/drivers/net/dsa/mv88e6xxx.c
+++ b/drivers/net/dsa/mv88e6xxx.c
@@ -1961,6 +1961,7 @@ int mv88e6xxx_setup_common(struct dsa_switch *ds)
 int mv88e6xxx_setup_global(struct dsa_switch *ds)
 {
struct mv88e6xxx_priv_state *ps = ds_to_priv(ds);
+   int ret;
int i;
 
/* Set the default address aging time to 5 minutes, and
@@ -2059,9 +2060,11 @@ int mv88e6xxx_setup_global(struct dsa_switch *ds)
REG_WRITE(REG_GLOBAL, GLOBAL_STATS_OP, GLOBAL_STATS_OP_FLUSH_ALL);
 
/* Wait for the flush to complete. */
-   _mv88e6xxx_stats_wait(ds);
+   mutex_lock(&ps->smi_mutex);
+   ret = _mv88e6xxx_stats_wait(ds);
+   mutex_unlock(&ps->smi_mutex);
 
-   return 0;
+   return ret;
 }
 
 int mv88e6xxx_switch_reset(struct dsa_switch *ds, bool ppu_active)
-- 
2.4.5

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] net: dsa: mv88e6xxx: sleep in _mv88e6xxx_stats_wait

2015-07-10 Thread Vivien Didelot
The current _mv88e6xxx_stats_wait function does not sleep while testing
the stats busy bit. Fix this by using the generic _mv88e6xxx_wait
function.

Note that it requires to move _mv88e6xxx_wait on top of
_mv88e6xxx_stats_wait to avoid undefined reference compilation error.

Signed-off-by: Vivien Didelot 
---
 drivers/net/dsa/mv88e6xxx.c | 52 +++--
 1 file changed, 22 insertions(+), 30 deletions(-)

diff --git a/drivers/net/dsa/mv88e6xxx.c b/drivers/net/dsa/mv88e6xxx.c
index f6c7409..7753db1 100644
--- a/drivers/net/dsa/mv88e6xxx.c
+++ b/drivers/net/dsa/mv88e6xxx.c
@@ -557,19 +557,31 @@ static bool mv88e6xxx_6352_family(struct dsa_switch *ds)
return false;
 }
 
+/* Must be called with SMI lock held */
+static int _mv88e6xxx_wait(struct dsa_switch *ds, int reg, int offset,
+  u16 mask)
+{
+   unsigned long timeout = jiffies + HZ / 10;
+
+   while (time_before(jiffies, timeout)) {
+   int ret;
+
+   ret = _mv88e6xxx_reg_read(ds, reg, offset);
+   if (ret < 0)
+   return ret;
+   if (!(ret & mask))
+   return 0;
+
+   usleep_range(1000, 2000);
+   }
+   return -ETIMEDOUT;
+}
+
 /* Must be called with SMI mutex held */
 static int _mv88e6xxx_stats_wait(struct dsa_switch *ds)
 {
-   int ret;
-   int i;
-
-   for (i = 0; i < 10; i++) {
-   ret = _mv88e6xxx_reg_read(ds, REG_GLOBAL, GLOBAL_STATS_OP);
-   if ((ret & GLOBAL_STATS_OP_BUSY) == 0)
-   return 0;
-   }
-
-   return -ETIMEDOUT;
+   return _mv88e6xxx_wait(ds, REG_GLOBAL, GLOBAL_STATS_OP,
+  GLOBAL_STATS_OP_BUSY);
 }
 
 /* Must be called with SMI mutex held */
@@ -856,26 +868,6 @@ error:
 }
 #endif /* CONFIG_NET_DSA_HWMON */
 
-/* Must be called with SMI lock held */
-static int _mv88e6xxx_wait(struct dsa_switch *ds, int reg, int offset,
-  u16 mask)
-{
-   unsigned long timeout = jiffies + HZ / 10;
-
-   while (time_before(jiffies, timeout)) {
-   int ret;
-
-   ret = _mv88e6xxx_reg_read(ds, reg, offset);
-   if (ret < 0)
-   return ret;
-   if (!(ret & mask))
-   return 0;
-
-   usleep_range(1000, 2000);
-   }
-   return -ETIMEDOUT;
-}
-
 static int mv88e6xxx_wait(struct dsa_switch *ds, int reg, int offset, u16 mask)
 {
struct mv88e6xxx_priv_state *ps = ds_to_priv(ds);
-- 
2.4.5

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC net-next 03/22] ipv4: support for fib route lwtunnel encap attributes

2015-07-10 Thread Thomas Graf
On 07/10/15 at 05:36pm, Eric Dumazet wrote:
> On Fri, 2015-07-10 at 16:19 +0200, Thomas Graf wrote:
> > From: Roopa Prabhu 
> 
> 
> > +   if (oif)
> > +   dev = __dev_get_by_index(net, oif);
> > +   ret = lwtunnel_build_state(dev, encap_type,
> > +  encap, &lwtstate);
> > +   if (!ret) {
> > +   lwtunnel_state_get(lwtstate);
> > +   ret = lwtunnel_cmp_encap(lwtstate, nh->nh_lwtstate);
> > +   lwtunnel_state_put(lwtstate);
> > +
> > +   return ret;
> 
> 
> These _get()/_put() calls do not seem necessary, or buggy.
> 
> If refcounting is needed the _get() should be done at the time lwstate
> is fetched.

I'll let Roopa comment but it seems like these can just be removed.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 03/11] soc/fsl: Introduce the DPAA BMan portal driver

2015-07-10 Thread Paul Bolle
Hi Roy,

On vr, 2015-07-10 at 15:19 +, Roy Pledge wrote:
> Thanks you for your valuable feedback so far.

You're welcome. Please note that I just scan for, well, common build
issues. Ie, stuff that requires no domain specific knowledge.

> Let me try to address a general issue you mention below: unused 
> exported APIs.

Good timing. I was pondering how to handle 04/11, a big patch that
exports 79 (!) symbols. Many of those appear to have no users.

> The QMan and BMan drivers provide a base layer for other blocks built 
> on top of them, for instance an Ethernet Driver, an Encrypt/Decrypt 
> Engine,  a pattern matcher, a compress/decompress engine, etc...
> Some of these drivers will be presented for review in the near future, 
> but in order to try and layer the review/up streaming process we're 
> presenting each layer as independently as possible.
> If you really were to start dissecting which APIs are called you would 
> come to a very small set of pieces that merely initialize the hardware 
> but don't provide any opportunity for other users to invoke that HW. 
> 
> I hope that this helps you understand our goals in trying to upstream 
> these drivers.

At the end of the day, what matters is what the people that need to sign
off on these drivers (ppc? netdev?) are comfortable with. I know that
some maintainers won't even bother looking at code that has no callers.

In the mean time I'll skip the exports for the remainder of this series.

Thanks,


Paul Bolle
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3] mvneta: use inband status only when explicitly enabled

2015-07-10 Thread Stas Sergeev

The commit 898b2970e2c9 ("mvneta: implement SGMII-based in-band link state
signaling") implemented the link parameters auto-negotiation unconditionally.
Unfortunately it appears that some HW that implements SGMII protocol,
doesn't generate the inband status, so it is not possible to auto-negotiate
anything with such HW.

This patch enables the auto-negotiation only if the 'autoneg' DT property
is set to 1.
For old configurations where the 'autoneg' property is not specified, the
default is to not use auto-negotiation.

This patch fixes the following regression:
https://lkml.org/lkml/2015/7/8/865
and is therefore CCed to stable.

Signed-off-by: Stas Sergeev 

CC: Thomas Petazzoni 
CC: netdev@vger.kernel.org
CC: linux-ker...@vger.kernel.org
CC: sta...@vger.kernel.org
---
 drivers/net/ethernet/marvell/mvneta.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvneta.c 
b/drivers/net/ethernet/marvell/mvneta.c
index 74176ec..d4c29a3 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -3009,7 +3009,7 @@ static int mvneta_probe(struct platform_device *pdev)
char hw_mac_addr[ETH_ALEN];
const char *mac_from;
int phy_mode;
-   int fixed_phy = 0;
+   int autoneg_link = 0;
int err;

/* Our multiqueue support is not complete, so for now, only
@@ -3043,7 +3043,7 @@ static int mvneta_probe(struct platform_device *pdev)
dev_err(&pdev->dev, "cannot register fixed PHY\n");
goto err_free_irq;
}
-   fixed_phy = 1;
+   autoneg_link = of_phy_is_autoneg_link(dn);

/* In the case of a fixed PHY, the DT node associated
 * to the PHY is the Ethernet MAC DT node.
@@ -3067,8 +3067,7 @@ static int mvneta_probe(struct platform_device *pdev)
pp = netdev_priv(dev);
pp->phy_node = phy_node;
pp->phy_interface = phy_mode;
-   pp->use_inband_status = (phy_mode == PHY_INTERFACE_MODE_SGMII) &&
-   fixed_phy;
+   pp->use_inband_status = autoneg_link;

pp->clk = devm_clk_get(&pdev->dev, NULL);
if (IS_ERR(pp->clk)) {
-- 
1.9.1
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3] of_mdio: add new DT property 'autoneg' for fixed-link

2015-07-10 Thread Stas Sergeev

Currently for fixed-link the MAC driver decides whether to use the
link status auto-negotiation or not.
Unfortunately the auto-negotiation may not work when expected by
the MAC driver. Sebastien Rannou explains:
<< Yes, I confirm that my HW does not generate an in-band status. AFAIK, it's
a PHY that aggregates 4xSGMIIs to 1xQSGMII ; the MAC side of the PHY (with
inband status) is connected to the switch through QSGMII, and in this context
we are on the media side of the PHY. >>
https://lkml.org/lkml/2015/7/10/206

This patch introduces the new boolean property 'autoneg' that allows
the user to request the auto-negotiation explicitly.

Signed-off-by: Stas Sergeev 

CC: Rob Herring 
CC: Pawel Moll 
CC: Mark Rutland 
CC: Ian Campbell 
CC: Kumar Gala 
CC: Florian Fainelli 
CC: Grant Likely 
CC: devicet...@vger.kernel.org
CC: linux-ker...@vger.kernel.org
CC: netdev@vger.kernel.org
---
 .../devicetree/bindings/net/fixed-link.txt |  6 +-
 drivers/of/of_mdio.c   | 23 --
 include/linux/of_mdio.h|  5 +
 3 files changed, 31 insertions(+), 3 deletions(-)

diff --git a/Documentation/devicetree/bindings/net/fixed-link.txt 
b/Documentation/devicetree/bindings/net/fixed-link.txt
index 82bf7e0..e2959a8 100644
--- a/Documentation/devicetree/bindings/net/fixed-link.txt
+++ b/Documentation/devicetree/bindings/net/fixed-link.txt
@@ -9,8 +9,12 @@ Such a fixed link situation is described by creating a 
'fixed-link'
 sub-node of the Ethernet MAC device node, with the following
 properties:

+* 'autoneg' (boolean, optional), to enable the auto-negotiation of link
+  state. Auto-negotiation is MII protocol, HW and driver-specific and is
+  not supported in many cases, so use it only when you know what you do.
 * 'speed' (integer, mandatory), to indicate the link speed. Accepted
-  values are 10, 100 and 1000
+  values are 10, 100 and 1000. If the auto-negotiation is enabled,
+  'speed' may not be set. It will then be auto-negotiated, if possible.
 * 'full-duplex' (boolean, optional), to indicate that full duplex is
   used. When absent, half duplex is assumed.
 * 'pause' (boolean, optional), to indicate that pause should be
diff --git a/drivers/of/of_mdio.c b/drivers/of/of_mdio.c
index 1bd4305..12b2ede 100644
--- a/drivers/of/of_mdio.c
+++ b/drivers/of/of_mdio.c
@@ -280,6 +280,22 @@ bool of_phy_is_fixed_link(struct device_node *np)
 }
 EXPORT_SYMBOL(of_phy_is_fixed_link);

+bool of_phy_is_autoneg_link(struct device_node *np)
+{
+   struct device_node *dn;
+   bool ret;
+
+   dn = of_get_child_by_name(np, "fixed-link");
+   if (!dn)
+   return false;
+
+   ret = of_property_read_bool(dn, "autoneg");
+
+   of_node_put(dn);
+   return ret;
+}
+EXPORT_SYMBOL(of_phy_is_autoneg_link);
+
 int of_phy_register_fixed_link(struct device_node *np)
 {
struct fixed_phy_status status = {};
@@ -291,10 +307,13 @@ int of_phy_register_fixed_link(struct device_node *np)
/* New binding */
fixed_link_node = of_get_child_by_name(np, "fixed-link");
if (fixed_link_node) {
-   status.link = 1;
+   bool autoneg = of_property_read_bool(fixed_link_node,
+"autoneg");
+   status.link = !autoneg;
status.duplex = of_property_read_bool(fixed_link_node,
  "full-duplex");
-   if (of_property_read_u32(fixed_link_node, "speed", 
&status.speed))
+   if (of_property_read_u32(fixed_link_node, "speed",
+&status.speed) != 0 && !autoneg)
return -EINVAL;
status.pause = of_property_read_bool(fixed_link_node, "pause");
status.asym_pause = of_property_read_bool(fixed_link_node,
diff --git a/include/linux/of_mdio.h b/include/linux/of_mdio.h
index d449018..647f348 100644
--- a/include/linux/of_mdio.h
+++ b/include/linux/of_mdio.h
@@ -65,6 +65,7 @@ static inline struct mii_bus *of_mdio_find_bus(struct 
device_node *mdio_np)
 #if defined(CONFIG_OF) && defined(CONFIG_FIXED_PHY)
 extern int of_phy_register_fixed_link(struct device_node *np);
 extern bool of_phy_is_fixed_link(struct device_node *np);
+extern bool of_phy_is_autoneg_link(struct device_node *np);
 #else
 static inline int of_phy_register_fixed_link(struct device_node *np)
 {
@@ -74,6 +75,10 @@ static inline bool of_phy_is_fixed_link(struct device_node 
*np)
 {
return false;
 }
+static inline bool of_phy_is_autoneg_link(struct device_node *np)
+{
+   return false;
+}
 #endif


-- 
1.9.1
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/3] fixed_phy: handle link-down case

2015-07-10 Thread Stas Sergeev

Currently fixed_phy driver recognizes only the link-up state.
This simple patch adds an implementation of link-down state.
The actual change is 1-line, the rest is an indentation.

Signed-off-by: Stas Sergeev 

CC: Florian Fainelli 
CC: netdev@vger.kernel.org
CC: linux-ker...@vger.kernel.org
---
 drivers/net/phy/fixed_phy.c | 99 +++--
 1 file changed, 50 insertions(+), 49 deletions(-)

diff --git a/drivers/net/phy/fixed_phy.c b/drivers/net/phy/fixed_phy.c
index 1960b46..a5d93cf 100644
--- a/drivers/net/phy/fixed_phy.c
+++ b/drivers/net/phy/fixed_phy.c
@@ -52,58 +52,59 @@ static int fixed_phy_update_regs(struct fixed_phy *fp)
u16 lpagb = 0;
u16 lpa = 0;

-   if (fp->status.duplex) {
-   bmcr |= BMCR_FULLDPLX;
-
-   switch (fp->status.speed) {
-   case 1000:
-   bmsr |= BMSR_ESTATEN;
-   bmcr |= BMCR_SPEED1000;
-   lpagb |= LPA_1000FULL;
-   break;
-   case 100:
-   bmsr |= BMSR_100FULL;
-   bmcr |= BMCR_SPEED100;
-   lpa |= LPA_100FULL;
-   break;
-   case 10:
-   bmsr |= BMSR_10FULL;
-   lpa |= LPA_10FULL;
-   break;
-   default:
-   pr_warn("fixed phy: unknown speed\n");
-   return -EINVAL;
-   }
-   } else {
-   switch (fp->status.speed) {
-   case 1000:
-   bmsr |= BMSR_ESTATEN;
-   bmcr |= BMCR_SPEED1000;
-   lpagb |= LPA_1000HALF;
-   break;
-   case 100:
-   bmsr |= BMSR_100HALF;
-   bmcr |= BMCR_SPEED100;
-   lpa |= LPA_100HALF;
-   break;
-   case 10:
-   bmsr |= BMSR_10HALF;
-   lpa |= LPA_10HALF;
-   break;
-   default:
-   pr_warn("fixed phy: unknown speed\n");
-   return -EINVAL;
-   }
-   }
-
-   if (fp->status.link)
+   if (fp->status.link) {
bmsr |= BMSR_LSTATUS | BMSR_ANEGCOMPLETE;

-   if (fp->status.pause)
-   lpa |= LPA_PAUSE_CAP;
+   if (fp->status.duplex) {
+   bmcr |= BMCR_FULLDPLX;
+
+   switch (fp->status.speed) {
+   case 1000:
+   bmsr |= BMSR_ESTATEN;
+   bmcr |= BMCR_SPEED1000;
+   lpagb |= LPA_1000FULL;
+   break;
+   case 100:
+   bmsr |= BMSR_100FULL;
+   bmcr |= BMCR_SPEED100;
+   lpa |= LPA_100FULL;
+   break;
+   case 10:
+   bmsr |= BMSR_10FULL;
+   lpa |= LPA_10FULL;
+   break;
+   default:
+   pr_warn("fixed phy: unknown speed\n");
+   return -EINVAL;
+   }
+   } else {
+   switch (fp->status.speed) {
+   case 1000:
+   bmsr |= BMSR_ESTATEN;
+   bmcr |= BMCR_SPEED1000;
+   lpagb |= LPA_1000HALF;
+   break;
+   case 100:
+   bmsr |= BMSR_100HALF;
+   bmcr |= BMCR_SPEED100;
+   lpa |= LPA_100HALF;
+   break;
+   case 10:
+   bmsr |= BMSR_10HALF;
+   lpa |= LPA_10HALF;
+   break;
+   default:
+   pr_warn("fixed phy: unknown speed\n");
+   return -EINVAL;
+   }
+   }

-   if (fp->status.asym_pause)
-   lpa |= LPA_PAUSE_ASYM;
+   if (fp->status.pause)
+   lpa |= LPA_PAUSE_CAP;
+
+   if (fp->status.asym_pause)
+   lpa |= LPA_PAUSE_ASYM;
+   }

fp->regs[MII_PHYSID1] = 0;
fp->regs[MII_PHYSID2] = 0;
-- 
1.9.1
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Kernel Oops in __inet_twsk_kill()

2015-07-10 Thread Phani


> 
> > The problem has been fixed. It is introduced by a third party patch,
> > which decreases the refcnt of timewait socket.
>



What is the fix?
pl share the gerrit.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 0/2] net: enable inband link state negotiation only when explicitly requested

2015-07-10 Thread Stas Sergeev
Hello.

Currently the link status auto-negotiation is enabled
for any SGMII link with fixed-link DT binding.
The regression was reported:
https://lkml.org/lkml/2015/7/8/865
Apparently not all HW that implements SGMII protocol, generates the
inband status for the auto-negotiation to work.
More details here:
https://lkml.org/lkml/2015/7/10/206

The following patches reverts to the old behavior by default,
which is to not enable the auto-negotiation for fixed-link.
The new DT property is added that allows to explicitly request
the auto-negotiation.

Those who were affected by the change, please send your Tested-by,
Thanks!
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH net-next] sctp: fix src address selection if using secondary addresses

2015-07-10 Thread Marcelo Ricardo Leitner
On Fri, Jul 10, 2015 at 11:35:28AM -0400, Vlad Yasevich wrote:
> On 07/10/2015 07:53 AM, Marcelo Ricardo Leitner wrote:
> > On Thu, Jul 09, 2015 at 09:55:19PM +0200, Michael Tuexen wrote:
> >>> On 09 Jul 2015, at 18:54, Marcelo Ricardo Leitner  
> >>> wrote:
> >>>
> >>> Cc'ing Michael too.
> >> I'm not familiar with the Linux kernel code, so I can't comment on it.
> >> But making sure to use a source address belonging to the emitting
> >> interface makes sense for me.
> >>
> >> Best regards
> >> Michael
> > 
> > That's pretty much what I was looking for, in case I missed something on
> > SCTP RFCs. 
> > 
> 
> Well, the RFCs do not really specify what the source address should be, and 
> there

That's why I was afraid of having missed something ;)

> have been numerous times where I've seen weak host model in use on the wire
> even with a BSD peer.
> 
> This also puts a very big nail through many suggestions we've had over the 
> years
> to allow source based path multihoming in addition to destination based 
> multihoming
> we currently support.
> 
> It might be a good idea to make rp-filter like behavior best effort, and have
> the old behavior as fallback.  I am still trying to think up different 
> scenarios
> where rp-filter behavior will cause things to fail prematurely...

The old behavior is like "if we don't have a src yet and can't find a
preferred src for this dst, use the 1st bound address". We can add it
but as I said, I'm afraid it is just doing wrong and not worth. If such
randomly src addressed packet is meant to be routed, the router will
likely drop it as it is seen as a spoof. And if it reaches the peer, it
will probably come back through a different path.

I'm tempted to say that current usual use cases are handled by the first
check on this function, which returns the preferred/primary address for
the interface and checks against bound addresses. Whenever you reach the
second check, it just allows you to use that 1st bound address that is
checked. I mean, I can't see use cases that we would be breaking with
this change. 

But yeah, it impacts source based routing, and I'm not aware of previous
discussions on it. I'll try to dig some up but if possible, please share
some pointers on it.

  Marcelo

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC net-next 03/22] ipv4: support for fib route lwtunnel encap attributes

2015-07-10 Thread Eric Dumazet
On Fri, 2015-07-10 at 16:19 +0200, Thomas Graf wrote:
> From: Roopa Prabhu 


> + if (oif)
> + dev = __dev_get_by_index(net, oif);
> + ret = lwtunnel_build_state(dev, encap_type,
> +encap, &lwtstate);
> + if (!ret) {
> + lwtunnel_state_get(lwtstate);
> + ret = lwtunnel_cmp_encap(lwtstate, nh->nh_lwtstate);
> + lwtunnel_state_put(lwtstate);
> +
> + return ret;


These _get()/_put() calls do not seem necessary, or buggy.

If refcounting is needed the _get() should be done at the time lwstate
is fetched.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH net-next] sctp: fix src address selection if using secondary addresses

2015-07-10 Thread Vlad Yasevich
On 07/10/2015 07:53 AM, Marcelo Ricardo Leitner wrote:
> On Thu, Jul 09, 2015 at 09:55:19PM +0200, Michael Tuexen wrote:
>>> On 09 Jul 2015, at 18:54, Marcelo Ricardo Leitner  
>>> wrote:
>>>
>>> Cc'ing Michael too.
>> I'm not familiar with the Linux kernel code, so I can't comment on it.
>> But making sure to use a source address belonging to the emitting
>> interface makes sense for me.
>>
>> Best regards
>> Michael
> 
> That's pretty much what I was looking for, in case I missed something on
> SCTP RFCs. 
> 

Well, the RFCs do not really specify what the source address should be, and 
there
have been numerous times where I've seen weak host model in use on the wire
even with a BSD peer.

This also puts a very big nail through many suggestions we've had over the years
to allow source based path multihoming in addition to destination based 
multihoming
we currently support.

It might be a good idea to make rp-filter like behavior best effort, and have
the old behavior as fallback.  I am still trying to think up different scenarios
where rp-filter behavior will cause things to fail prematurely...

-vlad

> Thanks Michael!
> 
>>> On Tue, Jul 07, 2015 at 02:42:25PM -0300, Marcelo Ricardo Leitner wrote:
 Hi folks,

 This is an attempt to better choose a src address for sctp packets as
 peers with rp_filter could be dropping our packets in some situations.
 With this patch, we try to respect and use a src address that belongs to
 the interface we are putting the packet out.

 I have that feeling that there is be a better way to do this, but I
 just couldn't see it.

 This patch has been tested with and without gateways between the peers
 and also just two peers connected via two subnets and results were
 pretty good.

 One could think that this limits the address combination we can use, but
 such combinations probably are just bogus anyway. Like, if you have an
 host with address A1 and B1 and another with A2 and B2, you cannot
 expect that A can use A1 to reach B2 through subnet B, because the
 return path would be via the other link which, when this switch happens,
 we are thinking it's broken.

 Thanks,
 Marcelo

 ---8<---

 In short, sctp is likely to incorrectly choose src address if socket is
 bound to secondary addresses. This patch fixes it by adding a new check
 that tries to anticipate if the src address would be expected by the
 next hop/peer on this interface by doing reverse routing.

 Also took the shot to reduce the indentation level on this code.

 Details:

 Currently, sctp will do a routing attempt without specifying the src
 address and compare the returned value (preferred source) with the
 addresses that the socket is bound to. When using secondary addresses,
 this will not match.

 Then it will try specifying each of the addresses that the socket is
 bound to and re-routing, checking if that address is valid as src for
 that dst. Thing is, this check alone is weak:

 # ip r l
 192.168.100.0/24 dev eth1  proto kernel  scope link  src 192.168.100.149
 192.168.122.0/24 dev eth0  proto kernel  scope link  src 192.168.122.147

 # ip a l
 1: lo:  mtu 65536 qdisc noqueue state UNKNOWN group 
 default
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
   valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
   valid_lft forever preferred_lft forever
 2: eth0:  mtu 1500 qdisc pfifo_fast state 
 UP group default qlen 1000
link/ether 52:54:00:15:18:6a brd ff:ff:ff:ff:ff:ff
inet 192.168.122.147/24 brd 192.168.122.255 scope global dynamic eth0
   valid_lft 2160sec preferred_lft 2160sec
inet 192.168.122.148/24 scope global secondary eth0
   valid_lft forever preferred_lft forever
inet6 fe80::5054:ff:fe15:186a/64 scope link
   valid_lft forever preferred_lft forever
 3: eth1:  mtu 1500 qdisc pfifo_fast state 
 UP group default qlen 1000
link/ether 52:54:00:b3:91:46 brd ff:ff:ff:ff:ff:ff
inet 192.168.100.149/24 brd 192.168.100.255 scope global dynamic eth1
   valid_lft 2162sec preferred_lft 2162sec
inet 192.168.100.148/24 scope global secondary eth1
   valid_lft forever preferred_lft forever
inet6 fe80::5054:ff:feb3:9146/64 scope link
   valid_lft forever preferred_lft forever
 4: ens9:  mtu 1500 qdisc pfifo_fast state 
 UP group default qlen 1000
link/ether 52:54:00:05:47:ee brd ff:ff:ff:ff:ff:ff
inet6 fe80::5054:ff:fe05:47ee/64 scope link
   valid_lft forever preferred_lft forever

 # ip r g 192.168.100.193 from 192.168.122.148
 192.168.100.193 from 192.168.122.148 dev eth1
cache

 Even if you specify an interface:

 # ip r g 

RE: [PATCH 03/11] soc/fsl: Introduce the DPAA BMan portal driver

2015-07-10 Thread Roy Pledge
Paul,

Thanks you for your valuable feedback so far.  Let me try to address a general 
issue you mention below: unused exported APIs.

The QMan and BMan drivers provide a base layer for other blocks built on top of 
them, for instance an Ethernet Driver, an Encrypt/Decrypt Engine,  a pattern 
matcher, a compress/decompress engine, etc...
Some of these drivers will be presented for review in the near future, but in 
order to try and layer the review/up streaming process we're presenting each 
layer as independently as possible.
If you really were to start dissecting which APIs are called you would come to 
a very small set of pieces that merely initialize the hardware but don't 
provide any opportunity for other users to invoke that HW. 

I hope that this helps you understand our goals in trying to upstream these 
drivers.


Roy

> -Original Message-
> From: Paul Bolle [mailto:pebo...@tiscali.nl]
> Sent: Friday, July 10, 2015 9:33 AM
> To: Pledge Roy-R01356
> Cc: linuxppc-...@lists.ozlabs.org; Wood Scott-B07421;
> netdev@vger.kernel.org; linux-ker...@vger.kernel.org
> Subject: Re: [PATCH 03/11] soc/fsl: Introduce the DPAA BMan portal driver
> 
> On do, 2015-07-09 at 16:21 -0400, Roy Pledge wrote:
> > --- a/drivers/soc/fsl/qbman/Kconfig
> > +++ b/drivers/soc/fsl/qbman/Kconfig
> 
> > +config FSL_DPA_PIRQ_FAST
> > +   bool
> > +   default y
> 
> First used in 04/11.
> 
> > +config FSL_DPA_PIRQ_SLOW
> > +   bool
> > +   default y
> > +
> > +config FSL_DPA_PORTAL_SHARE
> > +   bool
> > +   default y
> 
> As in 02/11: these three symbols function as aliases for FSL_DPA. Why are
> they needed?
> 
> >  config FSL_BMAN
> > tristate "BMan device management"
> > default n
> > help
> > FSL DPAA BMan driver
> >
> > +config FSL_BMAN_PORTAL
> > +   tristate "BMan portal(s)"
> > +   default n
> > +   help
> > +   FSL BMan portal driver
> 
> > --- /dev/null
> > +++ b/drivers/soc/fsl/qbman/bman_api.c
> 
> > +struct bman_portal {
> > +   [...]
> > +#ifdef CONFIG_FSL_DPA_CAN_WAIT_SYNC
> 
> This check will always evaluate to true, right? (I'll only report this
> once.)
> 
> > +   struct bman_pool *rcri_owned; /* only 1 release WAIT_SYNC at
> > a time */
> > +#endif
> > +#ifdef CONFIG_FSL_DPA_PORTAL_SHARE
> 
> Ditto.
> 
> > +   raw_spinlock_t sharing_lock; /* only used if is_shared */
> > +   int is_shared;
> > +   struct bman_portal *sharing_redirect; #endif
> > +   [...]
> > +};
> 
> > +const struct bman_portal_config *bman_get_portal_config(void) {
> > +   [...]
> > +}
> 
> I couldn't find callers of this function.
> 
> > +EXPORT_SYMBOL(bman_get_portal_config);
> 
> Nor users of this export, obviously.
> 
> > +
> > +u32 bman_irqsource_get(void)
> > +{
> > +   [...]
> > +}
> 
> Ditto.
> 
> > +EXPORT_SYMBOL(bman_irqsource_get);
> 
> Ditto.
> 
> > +int bman_p_irqsource_add(struct bman_portal *p, __maybe_unused u32
> > +bits) {
> > +   [...]
> > +}
> > +EXPORT_SYMBOL(bman_p_irqsource_add);
> 
> There seem to be no users of this export.
> 
> > +int bman_irqsource_add(__maybe_unused u32 bits) {
> > +   [...]
> > +}
> 
> Unused.
> 
> > +EXPORT_SYMBOL(bman_irqsource_add);
> 
> Ditto.
> 
> > +int bman_irqsource_remove(u32 bits)
> > +{
> > +   [...]
> > +}
> 
> Ditto.
> 
> > +EXPORT_SYMBOL(bman_irqsource_remove);
> 
> Ditto.
> 
> > +u32 bman_poll_slow(void)
> > +{
> > +   [...]
> > +}
> 
> Ditto.
> 
> > +EXPORT_SYMBOL(bman_poll_slow);
> 
> Ditto.
> 
> > +void bman_poll(void)
> > +{
> > +   [...]
> > +}
> 
> Ditto.
> 
> > +EXPORT_SYMBOL(bman_poll);
> 
> Ditto.
> 
> > +static inline struct bm_rcr_entry *try_rel_start(struct bman_portal
> > +**p, #ifdef CONFIG_FSL_DPA_CAN_WAIT
> 
> Always true, right?
> 
> > +   __maybe_unused struct bman_pool
> *pool, #endif
> > +   __maybe_unused unsigned long
> *irqflags,
> > +   __maybe_unused u32 flags)
> 
> And this is a, well, novel way to declare a function.
> 
> > +{
> > +   [...]
> > +}
> 
> > +int bman_flush_stockpile(struct bman_pool *pool, u32 flags) {
> > +   [...]
> > +}
> 
> Unused function.
> 
> > +EXPORT_SYMBOL(bman_flush_stockpile);
> 
> So unused export too.
> 
> > +#ifdef CONFIG_FSL_BMAN
> > +u32 bman_query_free_buffers(struct bman_pool *pool) {
> > +   return bm_pool_free_buffers(pool->params.bpid);
> > +}
> > +EXPORT_SYMBOL(bman_query_free_buffers);
> > +
> > +int bman_update_pool_thresholds(struct bman_pool *pool, const u32
> > +*thresholds) {
> > +   u32 bpid;
> > +
> > +   bpid = bman_get_params(pool)->bpid;
> > +
> > +   return bm_pool_set(bpid, thresholds); }
> > +EXPORT_SYMBOL(bman_update_pool_thresholds);
> 
> More of the same.
> 
> > +#endif
> 
> > --- /dev/null
> > +++ b/drivers/soc/fsl/qbman/bman_portal.c
> >
> > +module_driver(bman_portal_driver,
> > +  bman_portal_driver_register, platform_driver_unregister);
> 
> No MODULE_LICENSE() here, nor in the other files that make up this module.
> So loading this module will trigger a wa

[PATCH net-next] bridge: mdb: add vlan support for user entries

2015-07-10 Thread Nikolay Aleksandrov
Until now all user mdb entries were added in vlan 0, this patch adds
support to allow the user to specify the vlan for the entry.
About the uapi change a hole in struct br_mdb_entry is used so the size
and offsets are kept the same (verified with pahole and tested with older
iproute2).

Example:
$ bridge mdb
dev br0 port eth1 grp 239.0.0.1 permanent vlan 2000
dev br0 port eth1 grp 239.0.0.1 permanent vlan 200
dev br0 port eth1 grp 239.0.0.1 permanent

Signed-off-by: Nikolay Aleksandrov 
---
note: an iproute2 patch will follow if this one is accepted

 include/uapi/linux/if_bridge.h | 1 +
 net/bridge/br_mdb.c| 6 ++
 2 files changed, 7 insertions(+)

diff --git a/include/uapi/linux/if_bridge.h b/include/uapi/linux/if_bridge.h
index eaaea6208b42..3635b7797508 100644
--- a/include/uapi/linux/if_bridge.h
+++ b/include/uapi/linux/if_bridge.h
@@ -182,6 +182,7 @@ struct br_mdb_entry {
 #define MDB_TEMPORARY 0
 #define MDB_PERMANENT 1
__u8 state;
+   __u16 vid;
struct {
union {
__be32  ip4;
diff --git a/net/bridge/br_mdb.c b/net/bridge/br_mdb.c
index 1fb7d076f15c..a8d0e93d43f2 100644
--- a/net/bridge/br_mdb.c
+++ b/net/bridge/br_mdb.c
@@ -85,6 +85,7 @@ static int br_mdb_fill_info(struct sk_buff *skb, struct 
netlink_callback *cb,
memset(&e, 0, sizeof(e));
e.ifindex = port->dev->ifindex;
e.state = p->state;
+   e.vid = p->addr.vid;
if (p->addr.proto == htons(ETH_P_IP))
e.addr.u.ip4 = p->addr.u.ip4;
 #if IS_ENABLED(CONFIG_IPV6)
@@ -242,6 +243,7 @@ void br_mdb_notify(struct net_device *dev, struct 
net_bridge_port *port,
entry.addr.u.ip6 = group->u.ip6;
 #endif
entry.state = state;
+   entry.vid = group->vid;
__br_mdb_notify(dev, &entry, type);
 }
 
@@ -264,6 +266,8 @@ static bool is_valid_mdb_entry(struct br_mdb_entry *entry)
return false;
if (entry->state != MDB_PERMANENT && entry->state != MDB_TEMPORARY)
return false;
+   if (entry->vid >= VLAN_VID_MASK)
+   return false;
 
return true;
 }
@@ -372,6 +376,7 @@ static int __br_mdb_add(struct net *net, struct net_bridge 
*br,
if (!p || p->br != br || p->state == BR_STATE_DISABLED)
return -EINVAL;
 
+   ip.vid = entry->vid;
ip.proto = entry->addr.proto;
if (ip.proto == htons(ETH_P_IP))
ip.u.ip4 = entry->addr.u.ip4;
@@ -418,6 +423,7 @@ static int __br_mdb_del(struct net_bridge *br, struct 
br_mdb_entry *entry)
if (!netif_running(br->dev) || br->multicast_disabled)
return -EINVAL;
 
+   ip.vid = entry->vid;
ip.proto = entry->addr.proto;
if (ip.proto == htons(ETH_P_IP)) {
if (timer_pending(&br->ip4_other_query.timer))
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC net-next 16/22] fib: Add fib rule match on tunnel id

2015-07-10 Thread Thomas Graf
This add the ability to select a routing table based on the tunnel
id which allows to maintain separate routing tables for each virtual
tunnel network.

ip rule add from all tunnel-id 100 lookup 100
ip rule add from all tunnel-id 200 lookup 200

Signed-off-by: Thomas Graf 
---
 include/net/fib_rules.h|  1 +
 include/uapi/linux/fib_rules.h |  2 +-
 net/core/fib_rules.c   | 17 +++--
 3 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/include/net/fib_rules.h b/include/net/fib_rules.h
index 903a55e..4e8f804 100644
--- a/include/net/fib_rules.h
+++ b/include/net/fib_rules.h
@@ -19,6 +19,7 @@ struct fib_rule {
u8  action;
/* 3 bytes hole, try to use */
u32 target;
+   __be64  tun_id;
struct fib_rule __rcu   *ctarget;
struct net  *fr_net;
 
diff --git a/include/uapi/linux/fib_rules.h b/include/uapi/linux/fib_rules.h
index 2b82d7e..96161b8 100644
--- a/include/uapi/linux/fib_rules.h
+++ b/include/uapi/linux/fib_rules.h
@@ -43,7 +43,7 @@ enum {
FRA_UNUSED5,
FRA_FWMARK, /* mark */
FRA_FLOW,   /* flow/class id */
-   FRA_UNUSED6,
+   FRA_TUN_ID,
FRA_SUPPRESS_IFGROUP,
FRA_SUPPRESS_PREFIXLEN,
FRA_TABLE,  /* Extended table id */
diff --git a/net/core/fib_rules.c b/net/core/fib_rules.c
index 9a12668..6da78c9 100644
--- a/net/core/fib_rules.c
+++ b/net/core/fib_rules.c
@@ -186,6 +186,9 @@ static int fib_rule_match(struct fib_rule *rule, struct 
fib_rules_ops *ops,
if ((rule->mark ^ fl->flowi_mark) & rule->mark_mask)
goto out;
 
+   if (rule->tun_id && (rule->tun_id != fl->flowi_tun_key.tun_id))
+   goto out;
+
ret = ops->match(rule, fl, flags);
 out:
return (rule->flags & FIB_RULE_INVERT) ? !ret : ret;
@@ -330,6 +333,9 @@ static int fib_nl_newrule(struct sk_buff *skb, struct 
nlmsghdr* nlh)
if (tb[FRA_FWMASK])
rule->mark_mask = nla_get_u32(tb[FRA_FWMASK]);
 
+   if (tb[FRA_TUN_ID])
+   rule->tun_id = nla_get_be64(tb[FRA_TUN_ID]);
+
rule->action = frh->action;
rule->flags = frh->flags;
rule->table = frh_get_table(frh, tb);
@@ -473,6 +479,10 @@ static int fib_nl_delrule(struct sk_buff *skb, struct 
nlmsghdr* nlh)
(rule->mark_mask != nla_get_u32(tb[FRA_FWMASK])))
continue;
 
+   if (tb[FRA_TUN_ID] &&
+   (rule->tun_id != nla_get_be64(tb[FRA_TUN_ID])))
+   continue;
+
if (!ops->compare(rule, frh, tb))
continue;
 
@@ -535,7 +545,8 @@ static inline size_t fib_rule_nlmsg_size(struct 
fib_rules_ops *ops,
 + nla_total_size(4) /* FRA_SUPPRESS_PREFIXLEN */
 + nla_total_size(4) /* FRA_SUPPRESS_IFGROUP */
 + nla_total_size(4) /* FRA_FWMARK */
-+ nla_total_size(4); /* FRA_FWMASK */
++ nla_total_size(4) /* FRA_FWMASK */
++ nla_total_size(8); /* FRA_TUN_ID */
 
if (ops->nlmsg_payload)
payload += ops->nlmsg_payload(rule);
@@ -591,7 +602,9 @@ static int fib_nl_fill_rule(struct sk_buff *skb, struct 
fib_rule *rule,
((rule->mark_mask || rule->mark) &&
 nla_put_u32(skb, FRA_FWMASK, rule->mark_mask)) ||
(rule->target &&
-nla_put_u32(skb, FRA_GOTO, rule->target)))
+nla_put_u32(skb, FRA_GOTO, rule->target)) ||
+   (rule->tun_id &&
+nla_put_be64(skb, FRA_TUN_ID, rule->tun_id)))
goto nla_put_failure;
 
if (rule->suppress_ifgroup != -1) {
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC net-next 19/22] openvswitch: Move dev pointer into vport itself

2015-07-10 Thread Thomas Graf
This is the first step in representing all OVS vports as regular
struct net_devices. Move the net_device pointer into the vport
structure itself to get rid of struct vport_netdev.

Signed-off-by: Thomas Graf 
Signed-off-by: Pravin B Shelar 
---
 net/openvswitch/datapath.c   |  7 +--
 net/openvswitch/dp_notify.c  |  5 +--
 net/openvswitch/vport-internal_dev.c | 37 +++-
 net/openvswitch/vport-netdev.c   | 86 
 net/openvswitch/vport-netdev.h   | 12 -
 net/openvswitch/vport.h  |  3 +-
 6 files changed, 59 insertions(+), 91 deletions(-)

diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
index 0208210..19df28e 100644
--- a/net/openvswitch/datapath.c
+++ b/net/openvswitch/datapath.c
@@ -188,7 +188,7 @@ static int get_dpifindex(const struct datapath *dp)
 
local = ovs_vport_rcu(dp, OVSP_LOCAL);
if (local)
-   ifindex = netdev_vport_priv(local)->dev->ifindex;
+   ifindex = local->dev->ifindex;
else
ifindex = 0;
 
@@ -2219,13 +2219,10 @@ static void __net_exit list_vports_from_net(struct net 
*net, struct net *dnet,
struct vport *vport;
 
hlist_for_each_entry(vport, &dp->ports[i], 
dp_hash_node) {
-   struct netdev_vport *netdev_vport;
-
if (vport->ops->type != OVS_VPORT_TYPE_INTERNAL)
continue;
 
-   netdev_vport = netdev_vport_priv(vport);
-   if (dev_net(netdev_vport->dev) == dnet)
+   if (dev_net(vport->dev) == dnet)
list_add(&vport->detach_list, head);
}
}
diff --git a/net/openvswitch/dp_notify.c b/net/openvswitch/dp_notify.c
index 2c631fe..a7a80a6 100644
--- a/net/openvswitch/dp_notify.c
+++ b/net/openvswitch/dp_notify.c
@@ -58,13 +58,10 @@ void ovs_dp_notify_wq(struct work_struct *work)
struct hlist_node *n;
 
hlist_for_each_entry_safe(vport, n, &dp->ports[i], 
dp_hash_node) {
-   struct netdev_vport *netdev_vport;
-
if (vport->ops->type != OVS_VPORT_TYPE_NETDEV)
continue;
 
-   netdev_vport = netdev_vport_priv(vport);
-   if (!(netdev_vport->dev->priv_flags & 
IFF_OVS_DATAPATH))
+   if (!(vport->dev->priv_flags & 
IFF_OVS_DATAPATH))
dp_detach_port_notify(vport);
}
}
diff --git a/net/openvswitch/vport-internal_dev.c 
b/net/openvswitch/vport-internal_dev.c
index 6a55f71..a2c205d 100644
--- a/net/openvswitch/vport-internal_dev.c
+++ b/net/openvswitch/vport-internal_dev.c
@@ -156,49 +156,44 @@ static void do_setup(struct net_device *netdev)
 static struct vport *internal_dev_create(const struct vport_parms *parms)
 {
struct vport *vport;
-   struct netdev_vport *netdev_vport;
struct internal_dev *internal_dev;
int err;
 
-   vport = ovs_vport_alloc(sizeof(struct netdev_vport),
-   &ovs_internal_vport_ops, parms);
+   vport = ovs_vport_alloc(0, &ovs_internal_vport_ops, parms);
if (IS_ERR(vport)) {
err = PTR_ERR(vport);
goto error;
}
 
-   netdev_vport = netdev_vport_priv(vport);
-
-   netdev_vport->dev = alloc_netdev(sizeof(struct internal_dev),
-parms->name, NET_NAME_UNKNOWN,
-do_setup);
-   if (!netdev_vport->dev) {
+   vport->dev = alloc_netdev(sizeof(struct internal_dev),
+ parms->name, NET_NAME_UNKNOWN, do_setup);
+   if (!vport->dev) {
err = -ENOMEM;
goto error_free_vport;
}
 
-   dev_net_set(netdev_vport->dev, ovs_dp_get_net(vport->dp));
-   internal_dev = internal_dev_priv(netdev_vport->dev);
+   dev_net_set(vport->dev, ovs_dp_get_net(vport->dp));
+   internal_dev = internal_dev_priv(vport->dev);
internal_dev->vport = vport;
 
/* Restrict bridge port to current netns. */
if (vport->port_no == OVSP_LOCAL)
-   netdev_vport->dev->features |= NETIF_F_NETNS_LOCAL;
+   vport->dev->features |= NETIF_F_NETNS_LOCAL;
 
rtnl_lock();
-   err = register_netdevice(netdev_vport->dev);
+   err = register_netdevice(vport->dev);
if (err)
goto error_free_netdev;
 
-   dev_set_promiscuity(netdev_vport->dev, 1);
+   dev_set_promiscuity(vport->dev, 1);
rtnl_unlock();
-   netif_start_queue(netdev_vport->dev);
+   netif_start_queue(vport->dev);

[RFC net-next 06/22] ipv4: redirect dst output to lwtunnel output

2015-07-10 Thread Thomas Graf
From: Roopa Prabhu 

For input routes with tunnel encap state this patch redirects
dst output functions to lwtunnel_output which later resolves to
the corresponding lwtunnel output function.

This has been tested to work with mpls ip tunnels.

Open items: Support for tunnel mtu, pmtu, fragmentation can be
added by hooking into the corresponding (ipv4, ipv6) dst ops.
We may do this differently when lwtstate moves to dst or dst_metadata
as per upstream discussions.

Signed-off-by: Roopa Prabhu 
---
 net/ipv4/route.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index bfc0a18..d3964fa 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1634,6 +1634,8 @@ static int __mkroute_input(struct sk_buff *skb,
rth->dst.output = ip_output;
 
rt_set_nexthop(rth, daddr, res, fnhe, res->fi, res->type, itag);
+   if (lwtunnel_output_redirect(rth->rt_lwtstate))
+   rth->dst.output = lwtunnel_output;
skb_dst_set(skb, &rth->dst);
 out:
err = 0;
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC net-next 17/22] vxlan: Factor out device configuration

2015-07-10 Thread Thomas Graf
This factors out the device configuration out of the RTNL newlink
API which allows for in-kernel creation of VXLAN net_devices.

Signed-off-by: Thomas Graf 
---
 drivers/net/vxlan.c | 332 
 include/net/vxlan.h |  59 ++
 2 files changed, 236 insertions(+), 155 deletions(-)

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index 773b6bf..020c941 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -55,10 +55,6 @@
 
 #define PORT_HASH_BITS 8
 #define PORT_HASH_SIZE  (1remote_port && rdst->remote_port != vxlan->dst_port &&
+   if (rdst->remote_port && rdst->remote_port != vxlan->cfg.dst_port &&
nla_put_be16(skb, NDA_PORT, rdst->remote_port))
goto nla_put_failure;
if (rdst->remote_vni != vxlan->default_dst.remote_vni &&
@@ -750,7 +707,8 @@ static int vxlan_fdb_create(struct vxlan_dev *vxlan,
if (!(flags & NLM_F_CREATE))
return -ENOENT;
 
-   if (vxlan->addrmax && vxlan->addrcnt >= vxlan->addrmax)
+   if (vxlan->cfg.addrmax &&
+   vxlan->addrcnt >= vxlan->cfg.addrmax)
return -ENOSPC;
 
/* Disallow replace to add a multicast entry */
@@ -836,7 +794,7 @@ static int vxlan_fdb_parse(struct nlattr *tb[], struct 
vxlan_dev *vxlan,
return -EINVAL;
*port = nla_get_be16(tb[NDA_PORT]);
} else {
-   *port = vxlan->dst_port;
+   *port = vxlan->cfg.dst_port;
}
 
if (tb[NDA_VNI]) {
@@ -1022,7 +980,7 @@ static bool vxlan_snoop(struct net_device *dev,
vxlan_fdb_create(vxlan, src_mac, src_ip,
 NUD_REACHABLE,
 NLM_F_EXCL|NLM_F_CREATE,
-vxlan->dst_port,
+vxlan->cfg.dst_port,
 vxlan->default_dst.remote_vni,
 0, NTF_SELF);
spin_unlock(&vxlan->hash_lock);
@@ -1951,7 +1909,7 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct 
net_device *dev,
info = skb_tunnel_info(skb, AF_INET);
 
if (rdst) {
-   dst_port = rdst->remote_port ? rdst->remote_port : 
vxlan->dst_port;
+   dst_port = rdst->remote_port ? rdst->remote_port : 
vxlan->cfg.dst_port;
vni = rdst->remote_vni;
dst = &rdst->remote_ip;
} else {
@@ -1961,7 +1919,7 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct 
net_device *dev,
goto drop;
}
 
-   dst_port = info->key.tp_dst ? : vxlan->dst_port;
+   dst_port = info->key.tp_dst ? : vxlan->cfg.dst_port;
vni = be64_to_cpu(info->key.tun_id);
remote_ip.sin.sin_family = AF_INET;
remote_ip.sin.sin_addr.s_addr = info->key.ipv4_dst;
@@ -1979,16 +1937,16 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct 
net_device *dev,
 
old_iph = ip_hdr(skb);
 
-   ttl = vxlan->ttl;
+   ttl = vxlan->cfg.ttl;
if (!ttl && vxlan_addr_multicast(dst))
ttl = 1;
 
-   tos = vxlan->tos;
+   tos = vxlan->cfg.tos;
if (tos == 1)
tos = ip_tunnel_get_dsfield(old_iph, skb);
 
-   src_port = udp_flow_src_port(dev_net(dev), skb, vxlan->port_min,
-vxlan->port_max, true);
+   src_port = udp_flow_src_port(dev_net(dev), skb, vxlan->cfg.port_min,
+vxlan->cfg.port_max, true);
 
if (dst->sa.sa_family == AF_INET) {
if (info) {
@@ -2014,7 +1972,7 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct 
net_device *dev,
fl4.flowi4_mark = skb->mark;
fl4.flowi4_proto = IPPROTO_UDP;
fl4.daddr = dst->sin.sin_addr.s_addr;
-   fl4.saddr = vxlan->saddr.sin.sin_addr.s_addr;
+   fl4.saddr = vxlan->cfg.saddr.sin.sin_addr.s_addr;
 
rt = ip_route_output_key(vxlan->net, &fl4);
if (IS_ERR(rt)) {
@@ -2070,7 +2028,7 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct 
net_device *dev,
memset(&fl6, 0, sizeof(fl6));
fl6.flowi6_oif = rdst ? rdst->remote_ifindex : 0;
fl6.daddr = dst->sin6.sin6_addr;
-   fl6.saddr = vxlan->saddr.sin6.sin6_addr;
+   fl6.saddr = vxlan->cfg.saddr.sin6.sin6_addr;
fl6.flowi6_mark = skb->mark;
fl6.flowi6_proto = IPPROTO_UDP;
 
@@ -2241,7 +2199,7 @@ static void vxlan_cleanup(unsigned long arg)
if (f->state & NUD_PERMANENT)
continue;
 
-   timeo

[RFC net-next 15/22] route: Per route IP tunnel metadata via lightweight tunnel

2015-07-10 Thread Thomas Graf
This introduces a new IP tunnel lightweight tunnel type which allows
to specify IP tunnel instructions per route. Only IPv4 is supported
at this point.

Signed-off-by: Thomas Graf 
---
 drivers/net/vxlan.c|  10 +++-
 include/net/dst_metadata.h |  12 -
 include/net/ip_tunnels.h   |   7 ++-
 include/uapi/linux/lwtunnel.h  |   1 +
 include/uapi/linux/rtnetlink.h |  15 ++
 net/ipv4/ip_tunnel_core.c  | 114 +
 net/ipv4/route.c   |   2 +-
 net/openvswitch/vport.h|   1 +
 8 files changed, 157 insertions(+), 5 deletions(-)

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index 4dfb8a7..773b6bf 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -1930,7 +1930,7 @@ static void vxlan_encap_bypass(struct sk_buff *skb, 
struct vxlan_dev *src_vxlan,
 static void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev,
   struct vxlan_rdst *rdst, bool did_rsc)
 {
-   struct ip_tunnel_info *info = skb_tunnel_info(skb);
+   struct ip_tunnel_info *info;
struct vxlan_dev *vxlan = netdev_priv(dev);
struct sock *sk = vxlan->vn_sock->sock->sk;
struct rtable *rt = NULL;
@@ -1947,6 +1947,9 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct 
net_device *dev,
int err;
u32 flags = vxlan->flags;
 
+   /* FIXME: Support IPv6 */
+   info = skb_tunnel_info(skb, AF_INET);
+
if (rdst) {
dst_port = rdst->remote_port ? rdst->remote_port : 
vxlan->dst_port;
vni = rdst->remote_vni;
@@ -2136,12 +2139,15 @@ tx_free:
 static netdev_tx_t vxlan_xmit(struct sk_buff *skb, struct net_device *dev)
 {
struct vxlan_dev *vxlan = netdev_priv(dev);
-   const struct ip_tunnel_info *info = skb_tunnel_info(skb);
+   const struct ip_tunnel_info *info;
struct ethhdr *eth;
bool did_rsc = false;
struct vxlan_rdst *rdst, *fdst = NULL;
struct vxlan_fdb *f;
 
+   /* FIXME: Support IPv6 */
+   info = skb_tunnel_info(skb, AF_INET);
+
skb_reset_mac_header(skb);
eth = eth_hdr(skb);
 
diff --git a/include/net/dst_metadata.h b/include/net/dst_metadata.h
index e843937..fc03491 100644
--- a/include/net/dst_metadata.h
+++ b/include/net/dst_metadata.h
@@ -23,13 +23,23 @@ static inline struct metadata_dst *skb_metadata_dst(struct 
sk_buff *skb)
return NULL;
 }
 
-static inline struct ip_tunnel_info *skb_tunnel_info(struct sk_buff *skb)
+static inline struct ip_tunnel_info *skb_tunnel_info(struct sk_buff *skb,
+int family)
 {
struct metadata_dst *md_dst = skb_metadata_dst(skb);
+   struct rtable *rt;
 
if (md_dst)
return &md_dst->u.tun_info;
 
+   switch(family) {
+   case AF_INET:
+   rt = (struct rtable *)skb_dst(skb);
+   if (rt && rt->rt_lwtstate)
+   return lwt_tun_info(rt->rt_lwtstate);
+   break;
+   }
+
return NULL;
 }
 
diff --git a/include/net/ip_tunnels.h b/include/net/ip_tunnels.h
index d11530f..0b7e18c 100644
--- a/include/net/ip_tunnels.h
+++ b/include/net/ip_tunnels.h
@@ -9,9 +9,9 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
+#include 
 
 #if IS_ENABLED(CONFIG_IPV6)
 #include 
@@ -298,6 +298,11 @@ static inline void *ip_tunnel_info_opts(struct 
ip_tunnel_info *info, size_t n)
return info + 1;
 }
 
+static inline struct ip_tunnel_info *lwt_tun_info(struct lwtunnel_state 
*lwtstate)
+{
+   return (struct ip_tunnel_info *)lwtstate->data;
+}
+
 #endif /* CONFIG_INET */
 
 #endif /* __NET_IP_TUNNELS_H */
diff --git a/include/uapi/linux/lwtunnel.h b/include/uapi/linux/lwtunnel.h
index aa611d9..31377bb 100644
--- a/include/uapi/linux/lwtunnel.h
+++ b/include/uapi/linux/lwtunnel.h
@@ -6,6 +6,7 @@
 enum lwtunnel_encap_types {
LWTUNNEL_ENCAP_NONE,
LWTUNNEL_ENCAP_MPLS,
+   LWTUNNEL_ENCAP_IP,
__LWTUNNEL_ENCAP_MAX,
 };
 
diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h
index 0d3d3cc..47d24cb 100644
--- a/include/uapi/linux/rtnetlink.h
+++ b/include/uapi/linux/rtnetlink.h
@@ -286,6 +286,21 @@ enum rt_class_t {
 
 /* Routing message attributes */
 
+enum ip_tunnel_t {
+   IP_TUN_UNSPEC,
+   IP_TUN_ID,
+   IP_TUN_DST,
+   IP_TUN_SRC,
+   IP_TUN_TTL,
+   IP_TUN_TOS,
+   IP_TUN_SPORT,
+   IP_TUN_DPORT,
+   IP_TUN_FLAGS,
+   __IP_TUN_MAX,
+};
+
+#define IP_TUN_MAX (__IP_TUN_MAX - 1)
+
 enum rtattr_type_t {
RTA_UNSPEC,
RTA_DST,
diff --git a/net/ipv4/ip_tunnel_core.c b/net/ipv4/ip_tunnel_core.c
index 6a51a71..f4f2100 100644
--- a/net/ipv4/ip_tunnel_core.c
+++ b/net/ipv4/ip_tunnel_core.c
@@ -190,3 +190,117 @@ struct rtnl_link_stats64 *ip_tunnel_get_stats64(struct 
net_device *dev,
return tot;
 }
 EXPORT_SYMBOL_GPL(ip_tunnel_get_stats64);
+
+static const struc

[RFC net-next 07/22] ipv6: rt6_info output redirect to tunnel output

2015-07-10 Thread Thomas Graf
From: Roopa Prabhu 

This is similar to ipv4 redirect of dst output to lwtunnel
output function for encapsulation and xmit.

Signed-off-by: Roopa Prabhu 
---
 net/ipv6/route.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index acfe25d..6c328bb 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -1783,6 +1783,7 @@ int ip6_route_add(struct fib6_config *cfg)
goto out;
lwtunnel_state_get(lwtstate);
rt->rt6i_lwtstate = lwtstate;
+   rt->dst.output = lwtunnel_output6;
}
 
ipv6_addr_prefix(&rt->rt6i_dst.addr, &cfg->fc_dst, cfg->fc_dst_len);
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC net-next 18/22] openvswitch: Make tunnel set action attach a metadata dst

2015-07-10 Thread Thomas Graf
Utilize the new metadata dst to attach encapsulation instructions to
the skb. The existing egress_tun_info via the OVS_CB() is left in
place until all tunnel vports have been converted to the new method.

Signed-off-by: Thomas Graf 
Signed-off-by: Pravin B Shelar 
---
 net/openvswitch/actions.c  | 10 ++-
 net/openvswitch/datapath.c |  8 +++---
 net/openvswitch/flow.h |  6 +
 net/openvswitch/flow_netlink.c | 61 +-
 net/openvswitch/flow_netlink.h |  1 +
 5 files changed, 74 insertions(+), 12 deletions(-)

diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
index 27c1687..cf04c2f 100644
--- a/net/openvswitch/actions.c
+++ b/net/openvswitch/actions.c
@@ -733,7 +733,15 @@ static int execute_set_action(struct sk_buff *skb,
 {
/* Only tunnel set execution is supported without a mask. */
if (nla_type(a) == OVS_KEY_ATTR_TUNNEL_INFO) {
-   OVS_CB(skb)->egress_tun_info = nla_data(a);
+   struct ovs_tunnel_info *tun = nla_data(a);
+
+   skb_dst_drop(skb);
+   dst_hold((struct dst_entry *)tun->tun_dst);
+   skb_dst_set(skb, (struct dst_entry *)tun->tun_dst);
+
+   /* FIXME: Remove when all vports have been converted */
+   OVS_CB(skb)->egress_tun_info = &tun->tun_dst->u.tun_info;
+
return 0;
}
 
diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
index ff8c4a4..0208210 100644
--- a/net/openvswitch/datapath.c
+++ b/net/openvswitch/datapath.c
@@ -1018,7 +1018,7 @@ static int ovs_flow_cmd_new(struct sk_buff *skb, struct 
genl_info *info)
}
ovs_unlock();
 
-   ovs_nla_free_flow_actions(old_acts);
+   ovs_nla_free_flow_actions_rcu(old_acts);
ovs_flow_free(new_flow, false);
}
 
@@ -1030,7 +1030,7 @@ err_unlock_ovs:
ovs_unlock();
kfree_skb(reply);
 err_kfree_acts:
-   kfree(acts);
+   ovs_nla_free_flow_actions(acts);
 err_kfree_flow:
ovs_flow_free(new_flow, false);
 error:
@@ -1157,7 +1157,7 @@ static int ovs_flow_cmd_set(struct sk_buff *skb, struct 
genl_info *info)
if (reply)
ovs_notify(&dp_flow_genl_family, reply, info);
if (old_acts)
-   ovs_nla_free_flow_actions(old_acts);
+   ovs_nla_free_flow_actions_rcu(old_acts);
 
return 0;
 
@@ -1165,7 +1165,7 @@ err_unlock_ovs:
ovs_unlock();
kfree_skb(reply);
 err_kfree_acts:
-   kfree(acts);
+   ovs_nla_free_flow_actions(acts);
 error:
return error;
 }
diff --git a/net/openvswitch/flow.h b/net/openvswitch/flow.h
index cadc6c5..3fbf874 100644
--- a/net/openvswitch/flow.h
+++ b/net/openvswitch/flow.h
@@ -33,6 +33,7 @@
 #include 
 #include 
 #include 
+#include 
 
 struct sk_buff;
 
@@ -45,6 +46,11 @@ struct sk_buff;
 #define TUN_METADATA_OPTS(flow_key, opt_len) \
((void *)((flow_key)->tun_opts + TUN_METADATA_OFFSET(opt_len)))
 
+struct ovs_tunnel_info
+{
+   struct metadata_dst *tun_dst;
+};
+
 #define OVS_SW_FLOW_KEY_METADATA_SIZE  \
(offsetof(struct sw_flow_key, recirc_id) +  \
FIELD_SIZEOF(struct sw_flow_key, recirc_id))
diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
index ecfa530..05fe46b 100644
--- a/net/openvswitch/flow_netlink.c
+++ b/net/openvswitch/flow_netlink.c
@@ -1548,11 +1548,45 @@ static struct sw_flow_actions 
*nla_alloc_flow_actions(int size, bool log)
return sfa;
 }
 
+static void ovs_nla_free_set_action(const struct nlattr *a)
+{
+   const struct nlattr *ovs_key = nla_data(a);
+   struct ovs_tunnel_info *ovs_tun;
+
+   switch (nla_type(ovs_key)) {
+   case OVS_KEY_ATTR_TUNNEL_INFO:
+   ovs_tun = nla_data(ovs_key);
+   dst_release((struct dst_entry *)ovs_tun->tun_dst);
+   break;
+   }
+}
+
+void ovs_nla_free_flow_actions(struct sw_flow_actions *sf_acts)
+{
+   const struct nlattr *a;
+   int rem;
+
+   nla_for_each_attr(a, sf_acts->actions, sf_acts->actions_len, rem) {
+   switch (nla_type(a)) {
+   case OVS_ACTION_ATTR_SET:
+   ovs_nla_free_set_action(a);
+   break;
+   }
+   }
+
+   kfree(sf_acts);
+}
+
+static void __ovs_nla_free_flow_actions(struct rcu_head *head)
+{
+   ovs_nla_free_flow_actions(container_of(head, struct sw_flow_actions, 
rcu));
+}
+
 /* Schedules 'sf_acts' to be freed after the next RCU grace period.
  * The caller must hold rcu_read_lock for this to be sensible. */
-void ovs_nla_free_flow_actions(struct sw_flow_actions *sf_acts)
+void ovs_nla_free_flow_actions_rcu(struct sw_flow_actions *sf_acts)
 {
-   kfree_rcu(sf_acts, rcu);
+   call_rcu(&sf_acts->rcu, __ovs_nla_free_flow_actions);
 }
 
 static struct nlattr *reserve_sfa_size(struct sw_flow_actions **sfa,
@@ -1746,7 +

[RFC net-next 13/22] vxlan: Flow based tunneling

2015-07-10 Thread Thomas Graf
Allows putting a VXLAN device into a new flow-based mode in which it
will populate a ip_tunnel_info struct for each packet received and
attach it to the skb using the new metadata dst. The metadata structure
will contain the outer header and tunnel header fields which have been
stripped off. Layers further up in the stack such as routing, tc or
netfitler can later match on these fields and perform forwarding.

Similar on the transmit side, while in flow based mode, skbs with a
ip_tunnel_info dst metadata attached will be encapsulated according
to the instructions stored in there with the VXLAN device defaults
taken into consideration.

This prepares the VXLAN device to be steered by the routing and other
subsystems which allows to support encapsulation for a large number
of tunnel endpoints and tunnel ids through a single net_device which
improves the scalability.

It also allows for OVS to leverage this mode which in turn allows for
the removal of the OVS specific VXLAN code.

Because the skb is currently scrubed in vxlan_rcv(), the attachment of
the new dst metadata is postponed until after scrubing which requires
the temporary addition of a new member to vxlan_metadata. This member
is removed again in a later commit after the indirect VXLAN receive API
has been removed.

Signed-off-by: Thomas Graf 
Signed-off-by: Pravin B Shelar 
---
 drivers/net/vxlan.c  | 149 ---
 include/linux/skbuff.h   |   1 +
 include/net/dst_metadata.h   |  13 
 include/net/ip_tunnels.h |  14 
 include/net/vxlan.h  |   8 ++-
 include/uapi/linux/if_link.h |   1 +
 6 files changed, 163 insertions(+), 23 deletions(-)

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index 34c519e..4dfb8a7 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -49,6 +49,7 @@
 #include 
 #include 
 #endif
+#include 
 
 #define VXLAN_VERSION  "0.1"
 
@@ -1164,10 +1165,13 @@ static struct vxlanhdr *vxlan_remcsum(struct sk_buff 
*skb, struct vxlanhdr *vh,
 /* Callback from net/ipv4/udp.c to receive packets */
 static int vxlan_udp_encap_recv(struct sock *sk, struct sk_buff *skb)
 {
+   struct metadata_dst *tun_dst = NULL;
+   struct ip_tunnel_info *info;
struct vxlan_sock *vs;
struct vxlanhdr *vxh;
u32 flags, vni;
-   struct vxlan_metadata md = {0};
+   struct vxlan_metadata _md;
+   struct vxlan_metadata *md = &_md;
 
/* Need Vxlan and inner Ethernet header to be present */
if (!pskb_may_pull(skb, VXLAN_HLEN))
@@ -1202,6 +1206,33 @@ static int vxlan_udp_encap_recv(struct sock *sk, struct 
sk_buff *skb)
vni &= VXLAN_VNI_MASK;
}
 
+   if (vs->flags & VXLAN_F_FLOW_BASED) {
+   const struct iphdr *iph = ip_hdr(skb);
+
+   tun_dst = metadata_dst_alloc(sizeof(*md), GFP_ATOMIC);
+   if (!tun_dst)
+   goto drop;
+
+   info = &tun_dst->u.tun_info;
+   info->key.ipv4_src = iph->saddr;
+   info->key.ipv4_dst = iph->daddr;
+   info->key.ipv4_tos = iph->tos;
+   info->key.ipv4_ttl = iph->ttl;
+   info->key.tp_src = udp_hdr(skb)->source;
+   info->key.tp_dst = udp_hdr(skb)->dest;
+
+   info->mode = IP_TUNNEL_INFO_RX;
+   info->key.tun_flags = TUNNEL_KEY;
+   info->key.tun_id = cpu_to_be64(vni >> 8);
+   if (udp_hdr(skb)->check != 0)
+   info->key.tun_flags |= TUNNEL_CSUM;
+
+   md = ip_tunnel_info_opts(info, sizeof(*md));
+   md->tun_dst = tun_dst;
+   } else {
+   memset(md, 0, sizeof(*md));
+   }
+
/* For backwards compatibility, only allow reserved fields to be
 * used by VXLAN extensions if explicitly requested.
 */
@@ -1209,13 +1240,16 @@ static int vxlan_udp_encap_recv(struct sock *sk, struct 
sk_buff *skb)
struct vxlanhdr_gbp *gbp;
 
gbp = (struct vxlanhdr_gbp *)vxh;
-   md.gbp = ntohs(gbp->policy_id);
+   md->gbp = ntohs(gbp->policy_id);
+
+   if (tun_dst)
+   info->key.tun_flags |= TUNNEL_VXLAN_OPT;
 
if (gbp->dont_learn)
-   md.gbp |= VXLAN_GBP_DONT_LEARN;
+   md->gbp |= VXLAN_GBP_DONT_LEARN;
 
if (gbp->policy_applied)
-   md.gbp |= VXLAN_GBP_POLICY_APPLIED;
+   md->gbp |= VXLAN_GBP_POLICY_APPLIED;
 
flags &= ~VXLAN_GBP_USED_BITS;
}
@@ -1233,8 +1267,8 @@ static int vxlan_udp_encap_recv(struct sock *sk, struct 
sk_buff *skb)
goto bad_flags;
}
 
-   md.vni = vxh->vx_vni;
-   vs->rcv(vs, skb, &md);
+   md->vni = vxh->vx_vni;
+   vs->rcv(vs, skb, md);
return 0;
 
 drop:
@@ -1247,6 +1281,9 @@ bad_flags:
   ntohl(vxh->vx_flags), ntohl(vxh->

[RFC net-next 20/22] openvswitch: Abstract vport name through ovs_vport_name()

2015-07-10 Thread Thomas Graf
This allows to get rid of the get_name() vport ops later on.

Signed-off-by: Thomas Graf 
---
 net/openvswitch/datapath.c   | 4 ++--
 net/openvswitch/vport-internal_dev.c | 1 -
 net/openvswitch/vport-netdev.c   | 6 --
 net/openvswitch/vport-netdev.h   | 1 -
 net/openvswitch/vport.c  | 4 ++--
 net/openvswitch/vport.h  | 5 +
 6 files changed, 9 insertions(+), 12 deletions(-)

diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
index 19df28e..ffe984f 100644
--- a/net/openvswitch/datapath.c
+++ b/net/openvswitch/datapath.c
@@ -176,7 +176,7 @@ static inline struct datapath *get_dp(struct net *net, int 
dp_ifindex)
 const char *ovs_dp_name(const struct datapath *dp)
 {
struct vport *vport = ovs_vport_ovsl_rcu(dp, OVSP_LOCAL);
-   return vport->ops->get_name(vport);
+   return ovs_vport_name(vport);
 }
 
 static int get_dpifindex(const struct datapath *dp)
@@ -1800,7 +1800,7 @@ static int ovs_vport_cmd_fill_info(struct vport *vport, 
struct sk_buff *skb,
if (nla_put_u32(skb, OVS_VPORT_ATTR_PORT_NO, vport->port_no) ||
nla_put_u32(skb, OVS_VPORT_ATTR_TYPE, vport->ops->type) ||
nla_put_string(skb, OVS_VPORT_ATTR_NAME,
-  vport->ops->get_name(vport)))
+  ovs_vport_name(vport)))
goto nla_put_failure;
 
ovs_vport_get_stats(vport, &vport_stats);
diff --git a/net/openvswitch/vport-internal_dev.c 
b/net/openvswitch/vport-internal_dev.c
index a2c205d..c058bbf 100644
--- a/net/openvswitch/vport-internal_dev.c
+++ b/net/openvswitch/vport-internal_dev.c
@@ -242,7 +242,6 @@ static struct vport_ops ovs_internal_vport_ops = {
.type   = OVS_VPORT_TYPE_INTERNAL,
.create = internal_dev_create,
.destroy= internal_dev_destroy,
-   .get_name   = ovs_netdev_get_name,
.send   = internal_dev_recv,
 };
 
diff --git a/net/openvswitch/vport-netdev.c b/net/openvswitch/vport-netdev.c
index 1c96966..e682bdc 100644
--- a/net/openvswitch/vport-netdev.c
+++ b/net/openvswitch/vport-netdev.c
@@ -171,11 +171,6 @@ static void netdev_destroy(struct vport *vport)
call_rcu(&vport->rcu, free_port_rcu);
 }
 
-const char *ovs_netdev_get_name(const struct vport *vport)
-{
-   return vport->dev->name;
-}
-
 static unsigned int packet_length(const struct sk_buff *skb)
 {
unsigned int length = skb->len - ETH_HLEN;
@@ -223,7 +218,6 @@ static struct vport_ops ovs_netdev_vport_ops = {
.type   = OVS_VPORT_TYPE_NETDEV,
.create = netdev_create,
.destroy= netdev_destroy,
-   .get_name   = ovs_netdev_get_name,
.send   = netdev_send,
 };
 
diff --git a/net/openvswitch/vport-netdev.h b/net/openvswitch/vport-netdev.h
index 1c52aed..684fb88 100644
--- a/net/openvswitch/vport-netdev.h
+++ b/net/openvswitch/vport-netdev.h
@@ -26,7 +26,6 @@
 
 struct vport *ovs_netdev_get_vport(struct net_device *dev);
 
-const char *ovs_netdev_get_name(const struct vport *);
 void ovs_netdev_detach_dev(struct vport *);
 
 int __init ovs_netdev_init(void);
diff --git a/net/openvswitch/vport.c b/net/openvswitch/vport.c
index af23ba0..d14f594 100644
--- a/net/openvswitch/vport.c
+++ b/net/openvswitch/vport.c
@@ -113,7 +113,7 @@ struct vport *ovs_vport_locate(const struct net *net, const 
char *name)
struct vport *vport;
 
hlist_for_each_entry_rcu(vport, bucket, hash_node)
-   if (!strcmp(name, vport->ops->get_name(vport)) &&
+   if (!strcmp(name, ovs_vport_name(vport)) &&
net_eq(ovs_dp_get_net(vport->dp), net))
return vport;
 
@@ -226,7 +226,7 @@ struct vport *ovs_vport_add(const struct vport_parms *parms)
}
 
bucket = hash_bucket(ovs_dp_get_net(vport->dp),
-vport->ops->get_name(vport));
+ovs_vport_name(vport));
hlist_add_head_rcu(&vport->hash_node, bucket);
return vport;
}
diff --git a/net/openvswitch/vport.h b/net/openvswitch/vport.h
index e05ec68..1a689c2 100644
--- a/net/openvswitch/vport.h
+++ b/net/openvswitch/vport.h
@@ -237,6 +237,11 @@ static inline void ovs_skb_postpush_rcsum(struct sk_buff 
*skb,
skb->csum = csum_add(skb->csum, csum_partial(start, len, 0));
 }
 
+static inline const char *ovs_vport_name(struct vport *vport)
+{
+   return vport->dev ? vport->dev->name : vport->ops->get_name(vport);
+}
+
 int ovs_vport_ops_register(struct vport_ops *ops);
 void ovs_vport_ops_unregister(struct vport_ops *ops);
 
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC net-next 11/22] dst: Metadata destinations

2015-07-10 Thread Thomas Graf
Introduces a new dst_metadata which enables to carry per packet metadata
between forwarding and processing elements via the skb->dst pointer.

The structure is set up to be a union. Thus, each separate type of
metadata requires its own dst instance. If demand arises to carry
multiple types of metadata concurrently, metadata dst entries can be
made stackable.

The metadata dst entry is refcnt'ed as expected for now but a non
reference counted use is possible if the reference is forced before
queueing the skb.

In order to allow allocating dsts with variable length, the existing
dst_alloc() is split into a dst_alloc() and dst_init() function. The
existing dst_init() function to initialize the subsystem is being
renamed to dst_subsys_init() to make it clear what is what.

The check before ip_route_input() is changed to ignore metadata dsts
and drop the dst inside the routing function thus allowing to interpret
metadata in a later commit.

Signed-off-by: Thomas Graf 
---
 include/net/dst.h  |  5 ++-
 include/net/dst_metadata.h | 32 +++
 net/core/dev.c |  2 +-
 net/core/dst.c | 76 ++
 net/ipv4/ip_input.c|  3 +-
 net/ipv4/route.c   |  2 ++
 6 files changed, 104 insertions(+), 16 deletions(-)
 create mode 100644 include/net/dst_metadata.h

diff --git a/include/net/dst.h b/include/net/dst.h
index 2bc73f8a..fd69bb9 100644
--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -57,6 +57,7 @@ struct dst_entry {
 #define DST_FAKE_RTABLE0x0040
 #define DST_XFRM_TUNNEL0x0080
 #define DST_XFRM_QUEUE 0x0100
+#define DST_METADATA   0x0200
 
unsigned short  pending_confirm;
 
@@ -356,6 +357,8 @@ static inline int dst_discard(struct sk_buff *skb)
 }
 void *dst_alloc(struct dst_ops *ops, struct net_device *dev, int initial_ref,
int initial_obsolete, unsigned short flags);
+void dst_init(struct dst_entry *dst, struct dst_ops *ops, struct net_device 
*dev,
+ int initial_ref, int initial_obsolete, unsigned short flags);
 void __dst_free(struct dst_entry *dst);
 struct dst_entry *dst_destroy(struct dst_entry *dst);
 
@@ -457,7 +460,7 @@ static inline struct dst_entry *dst_check(struct dst_entry 
*dst, u32 cookie)
return dst;
 }
 
-void dst_init(void);
+void dst_subsys_init(void);
 
 /* Flags for xfrm_lookup flags argument. */
 enum {
diff --git a/include/net/dst_metadata.h b/include/net/dst_metadata.h
new file mode 100644
index 000..4f7694f
--- /dev/null
+++ b/include/net/dst_metadata.h
@@ -0,0 +1,32 @@
+#ifndef __NET_DST_METADATA_H
+#define __NET_DST_METADATA_H 1
+
+#include 
+#include 
+#include 
+
+struct metadata_dst {
+   struct dst_entrydst;
+   size_t  opts_len;
+};
+
+static inline struct metadata_dst *skb_metadata_dst(struct sk_buff *skb)
+{
+   struct metadata_dst *md_dst = (struct metadata_dst *) skb_dst(skb);
+
+   if (md_dst && md_dst->dst.flags & DST_METADATA)
+   return md_dst;
+
+   return NULL;
+}
+
+static inline bool skb_valid_dst(const struct sk_buff *skb)
+{
+   struct dst_entry *dst = skb_dst(skb);
+
+   return dst && !(dst->flags & DST_METADATA);
+}
+
+struct metadata_dst *metadata_dst_alloc(u8 optslen, gfp_t flags);
+
+#endif /* __NET_DST_METADATA_H */
diff --git a/net/core/dev.c b/net/core/dev.c
index 6778a99..97f6d47 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -7640,7 +7640,7 @@ static int __init net_dev_init(void)
open_softirq(NET_RX_SOFTIRQ, net_rx_action);
 
hotcpu_notifier(dev_cpu_callback, 0);
-   dst_init();
+   dst_subsys_init();
rc = 0;
 out:
return rc;
diff --git a/net/core/dst.c b/net/core/dst.c
index e956ce6..5372450 100644
--- a/net/core/dst.c
+++ b/net/core/dst.c
@@ -22,6 +22,7 @@
 #include 
 
 #include 
+#include 
 
 /*
  * Theory of operations:
@@ -158,19 +159,9 @@ const u32 dst_default_metrics[RTAX_MAX + 1] = {
[RTAX_MAX] = 0xdeadbeef,
 };
 
-
-void *dst_alloc(struct dst_ops *ops, struct net_device *dev,
-   int initial_ref, int initial_obsolete, unsigned short flags)
+void dst_init(struct dst_entry *dst, struct dst_ops *ops, struct net_device 
*dev,
+ int initial_ref, int initial_obsolete, unsigned short flags)
 {
-   struct dst_entry *dst;
-
-   if (ops->gc && dst_entries_get_fast(ops) > ops->gc_thresh) {
-   if (ops->gc(ops))
-   return NULL;
-   }
-   dst = kmem_cache_alloc(ops->kmem_cachep, GFP_ATOMIC);
-   if (!dst)
-   return NULL;
dst->child = NULL;
dst->dev = dev;
if (dev)
@@ -200,6 +191,25 @@ void *dst_alloc(struct dst_ops *ops, struct net_device 
*dev,
dst->next = NULL;
if (!(flags & DST_NOCOUNT))
dst_entries_add(ops, 1);
+}
+EXPORT_SYMBOL(dst_init);
+
+void *dst_alloc(struct dst_ops *ops, struct ne

[RFC net-next 12/22] arp: Inherit metadata dst when creating ARP requests

2015-07-10 Thread Thomas Graf
If output device wants to see the dst, inherit the dst of the
original skb and pass it on to generate the ARP request.

Signed-off-by: Thomas Graf 
---
 net/ipv4/arp.c | 71 +++---
 1 file changed, 43 insertions(+), 28 deletions(-)

diff --git a/net/ipv4/arp.c b/net/ipv4/arp.c
index 933a928..3400aea 100644
--- a/net/ipv4/arp.c
+++ b/net/ipv4/arp.c
@@ -291,6 +291,46 @@ static void arp_error_report(struct neighbour *neigh, 
struct sk_buff *skb)
kfree_skb(skb);
 }
 
+/*
+ * Create and send an arp packet.
+ */
+static void arp_send_dst(int type, int ptype, __be32 dest_ip,
+struct net_device *dev, __be32 src_ip,
+const unsigned char *dest_hw,
+const unsigned char *src_hw,
+const unsigned char *target_hw, struct sk_buff *oskb)
+{
+   struct sk_buff *skb;
+
+   /*
+*  No arp on this interface.
+*/
+
+   if (dev->flags&IFF_NOARP)
+   return;
+
+   skb = arp_create(type, ptype, dest_ip, dev, src_ip,
+dest_hw, src_hw, target_hw);
+   if (!skb)
+   return;
+
+   if (oskb)
+   skb_dst_copy(skb, oskb);
+
+   arp_xmit(skb);
+}
+
+void arp_send(int type, int ptype, __be32 dest_ip,
+ struct net_device *dev, __be32 src_ip,
+ const unsigned char *dest_hw, const unsigned char *src_hw,
+ const unsigned char *target_hw)
+{
+   arp_send_dst(type, ptype, dest_ip, dev, src_ip, dest_hw, src_hw,
+target_hw, NULL);
+}
+EXPORT_SYMBOL(arp_send);
+
+
 static void arp_solicit(struct neighbour *neigh, struct sk_buff *skb)
 {
__be32 saddr = 0;
@@ -346,8 +386,9 @@ static void arp_solicit(struct neighbour *neigh, struct 
sk_buff *skb)
}
}
 
-   arp_send(ARPOP_REQUEST, ETH_P_ARP, target, dev, saddr,
-dst_hw, dev->dev_addr, NULL);
+   arp_send_dst(ARPOP_REQUEST, ETH_P_ARP, target, dev, saddr,
+dst_hw, dev->dev_addr, NULL,
+dev->priv_flags & IFF_XMIT_DST_RELEASE ? NULL : skb);
 }
 
 static int arp_ignore(struct in_device *in_dev, __be32 sip, __be32 tip)
@@ -597,32 +638,6 @@ void arp_xmit(struct sk_buff *skb)
 EXPORT_SYMBOL(arp_xmit);
 
 /*
- * Create and send an arp packet.
- */
-void arp_send(int type, int ptype, __be32 dest_ip,
- struct net_device *dev, __be32 src_ip,
- const unsigned char *dest_hw, const unsigned char *src_hw,
- const unsigned char *target_hw)
-{
-   struct sk_buff *skb;
-
-   /*
-*  No arp on this interface.
-*/
-
-   if (dev->flags&IFF_NOARP)
-   return;
-
-   skb = arp_create(type, ptype, dest_ip, dev, src_ip,
-dest_hw, src_hw, target_hw);
-   if (!skb)
-   return;
-
-   arp_xmit(skb);
-}
-EXPORT_SYMBOL(arp_send);
-
-/*
  * Process an arp request.
  */
 
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC net-next 21/22] openvswitch: Use regular VXLAN net_device device

2015-07-10 Thread Thomas Graf
This gets rid of all OVS specific VXLAN code in the receive and
transmit path by using a VXLAN net_device to represent the vport.
Only a small shim layer remains which takes care of handling the
VXLAN specific OVS Netlink configuration.

Unexports vxlan_sock_add(), vxlan_sock_release(), vxlan_xmit_skb()
since they are no longer needed.

Signed-off-by: Thomas Graf 
Signed-off-by: Pravin B Shelar 
---
 drivers/net/vxlan.c| 242 +++
 include/net/vxlan.h|  24 +--
 net/openvswitch/Kconfig|  12 --
 net/openvswitch/Makefile   |   1 -
 net/openvswitch/flow_netlink.c |   6 +-
 net/openvswitch/vport-netdev.c | 177 +-
 net/openvswitch/vport-vxlan.c  | 322 -
 net/openvswitch/vport-vxlan.h  |  11 --
 8 files changed, 298 insertions(+), 497 deletions(-)
 delete mode 100644 net/openvswitch/vport-vxlan.c
 delete mode 100644 net/openvswitch/vport-vxlan.h

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index 020c941..dba7a7d 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -75,6 +75,9 @@ static struct rtnl_link_ops vxlan_link_ops;
 
 static const u8 all_zeros_mac[ETH_ALEN];
 
+static struct vxlan_sock *vxlan_sock_add(struct net *net, __be16 port,
+bool no_share, u32 flags);
+
 /* per-network namespace private data for this module */
 struct vxlan_net {
struct list_head  vxlan_list;
@@ -1021,7 +1024,7 @@ static bool vxlan_group_used(struct vxlan_net *vn, struct 
vxlan_dev *dev)
return false;
 }
 
-void vxlan_sock_release(struct vxlan_sock *vs)
+static void vxlan_sock_release(struct vxlan_sock *vs)
 {
struct sock *sk = vs->sock->sk;
struct net *net = sock_net(sk);
@@ -1037,7 +1040,6 @@ void vxlan_sock_release(struct vxlan_sock *vs)
 
queue_work(vxlan_wq, &vs->del_work);
 }
-EXPORT_SYMBOL_GPL(vxlan_sock_release);
 
 /* Update multicast group membership when first VNI on
  * multicast address is brought up
@@ -1120,6 +1122,102 @@ static struct vxlanhdr *vxlan_remcsum(struct sk_buff 
*skb, struct vxlanhdr *vh,
return vh;
 }
 
+static void vxlan_rcv(struct vxlan_sock *vs, struct sk_buff *skb,
+ struct vxlan_metadata *md, u32 vni,
+ struct metadata_dst *tun_dst)
+{
+   struct iphdr *oip = NULL;
+   struct ipv6hdr *oip6 = NULL;
+   struct vxlan_dev *vxlan;
+   struct pcpu_sw_netstats *stats;
+   union vxlan_addr saddr;
+   int err = 0;
+   union vxlan_addr *remote_ip;
+
+   /* For flow based devices, map all packets to VNI 0 */
+   if (vs->flags & VXLAN_F_FLOW_BASED)
+   vni = 0;
+
+   /* Is this VNI defined? */
+   vxlan = vxlan_vs_find_vni(vs, vni);
+   if (!vxlan)
+   goto drop;
+
+   remote_ip = &vxlan->default_dst.remote_ip;
+   skb_reset_mac_header(skb);
+   skb_scrub_packet(skb, !net_eq(vxlan->net, dev_net(vxlan->dev)));
+   skb->protocol = eth_type_trans(skb, vxlan->dev);
+   skb_postpull_rcsum(skb, eth_hdr(skb), ETH_HLEN);
+
+   /* Ignore packet loops (and multicast echo) */
+   if (ether_addr_equal(eth_hdr(skb)->h_source, vxlan->dev->dev_addr))
+   goto drop;
+
+   /* Re-examine inner Ethernet packet */
+   if (remote_ip->sa.sa_family == AF_INET) {
+   oip = ip_hdr(skb);
+   saddr.sin.sin_addr.s_addr = oip->saddr;
+   saddr.sa.sa_family = AF_INET;
+#if IS_ENABLED(CONFIG_IPV6)
+   } else {
+   oip6 = ipv6_hdr(skb);
+   saddr.sin6.sin6_addr = oip6->saddr;
+   saddr.sa.sa_family = AF_INET6;
+#endif
+   }
+
+   if (tun_dst) {
+   skb_dst_set(skb, (struct dst_entry *)tun_dst);
+   tun_dst = NULL;
+   }
+
+   if ((vxlan->flags & VXLAN_F_LEARN) &&
+   vxlan_snoop(skb->dev, &saddr, eth_hdr(skb)->h_source))
+   goto drop;
+
+   skb_reset_network_header(skb);
+   /* In flow-based mode, GBP is carried in dst_metadata */
+   if (!(vs->flags & VXLAN_F_FLOW_BASED))
+   skb->mark = md->gbp;
+
+   if (oip6)
+   err = IP6_ECN_decapsulate(oip6, skb);
+   if (oip)
+   err = IP_ECN_decapsulate(oip, skb);
+
+   if (unlikely(err)) {
+   if (log_ecn_error) {
+   if (oip6)
+   net_info_ratelimited("non-ECT from %pI6\n",
+&oip6->saddr);
+   if (oip)
+   net_info_ratelimited("non-ECT from %pI4 with 
TOS=%#x\n",
+&oip->saddr, oip->tos);
+   }
+   if (err > 1) {
+   ++vxlan->dev->stats.rx_frame_errors;
+   ++vxlan->dev->stats.rx_errors;
+   goto drop;
+   }
+   }
+
+

[RFC net-next 22/22] openvswitch: Use regular GRE net_device instead of vport

2015-07-10 Thread Thomas Graf
From: Pravin Shelar 

Removes all of the OVS specific GRE code and makes OVS use a
GRE net_device.

Signed-off-by: Pravin B Shelar 
---
 net/core/dev.c |   5 +-
 net/ipv4/ip_gre.c  | 165 +-
 net/openvswitch/Makefile   |   1 -
 net/openvswitch/vport-gre.c| 313 -
 net/openvswitch/vport-netdev.c |   5 +-
 5 files changed, 170 insertions(+), 319 deletions(-)
 delete mode 100644 net/openvswitch/vport-gre.c

diff --git a/net/core/dev.c b/net/core/dev.c
index 97f6d47..67a8cac 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6966,6 +6966,9 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, 
const char *name,
INIT_LIST_HEAD(&dev->ptype_all);
INIT_LIST_HEAD(&dev->ptype_specific);
dev->priv_flags = IFF_XMIT_DST_RELEASE | IFF_XMIT_DST_RELEASE_PERM;
+
+   strcpy(dev->name, name);
+   dev->name_assign_type = name_assign_type;
setup(dev);
 
dev->num_tx_queues = txqs;
@@ -6980,8 +6983,6 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, 
const char *name,
goto free_all;
 #endif
 
-   strcpy(dev->name, name);
-   dev->name_assign_type = name_assign_type;
dev->group = INIT_NETDEV_GROUP;
if (!dev->ethtool_ops)
dev->ethtool_ops = &default_ethtool_ops;
diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
index 5fd7064..d285fb4 100644
--- a/net/ipv4/ip_gre.c
+++ b/net/ipv4/ip_gre.c
@@ -25,6 +25,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -47,6 +48,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #if IS_ENABLED(CONFIG_IPV6)
 #include 
@@ -115,6 +117,8 @@ static bool log_ecn_error = true;
 module_param(log_ecn_error, bool, 0644);
 MODULE_PARM_DESC(log_ecn_error, "Log packets received with corrupted ECN");
 
+#define GRE_TAP_FB_NAME "gretap0"
+
 static struct rtnl_link_ops ipgre_link_ops __read_mostly;
 static int ipgre_tunnel_init(struct net_device *dev);
 
@@ -217,7 +221,20 @@ static int ipgre_rcv(struct sk_buff *skb, const struct 
tnl_ptk_info *tpi)
  iph->saddr, iph->daddr, tpi->key);
 
if (tunnel) {
+
skb_pop_mac_header(skb);
+   if (tunnel->dev == itn->fb_tunnel_dev) {
+   struct metadata_dst *tun_dst;
+
+   tun_dst = metadata_dst_alloc(0, GFP_ATOMIC);
+   if (!tun_dst)
+   return PACKET_REJECT;
+
+   /* TODO: setup tun info from tpi */
+   skb_dst_drop(skb);
+   skb_dst_set(skb, (struct dst_entry *)tun_dst);
+   }
+
ip_tunnel_rcv(tunnel, skb, tpi, log_ecn_error);
return PACKET_RCVD;
}
@@ -287,6 +304,135 @@ out:
return NETDEV_TX_OK;
 }
 
+/* TODO: share xmit code */
+static inline struct rtable *tunnel_route_lookup(struct net *net,
+const struct ip_tunnel_key 
*key,
+u32 mark,
+struct flowi4 *fl,
+u8 protocol)
+{
+   struct rtable *rt;
+
+   memset(fl, 0, sizeof(*fl));
+   fl->daddr = key->ipv4_dst;
+   fl->saddr = key->ipv4_src;
+   fl->flowi4_tos = RT_TOS(key->ipv4_tos);
+   fl->flowi4_mark = mark;
+   fl->flowi4_proto = protocol;
+
+   rt = ip_route_output_key(net, fl);
+   return rt;
+}
+
+
+/* Returns the least-significant 32 bits of a __be64. */
+static __be32 be64_get_low32(__be64 x)
+{
+#ifdef __BIG_ENDIAN
+   return (__force __be32)x;
+#else
+   return (__force __be32)((__force u64)x >> 32);
+#endif
+}
+
+static __be16 filter_tnl_flags(__be16 flags)
+{
+   return flags & (TUNNEL_CSUM | TUNNEL_KEY);
+}
+
+
+static struct sk_buff *__build_header(struct sk_buff *skb,
+ const struct ip_tunnel_info *tun_info,
+ int tunnel_hlen)
+{
+   struct tnl_ptk_info tpi;
+
+   skb = gre_handle_offloads(skb, !!(tun_info->key.tun_flags & 
TUNNEL_CSUM));
+   if (IS_ERR(skb))
+   return skb;
+
+   tpi.flags = filter_tnl_flags(tun_info->key.tun_flags);
+   tpi.proto = htons(ETH_P_TEB);
+   tpi.key = be64_get_low32(tun_info->key.tun_id);
+   tpi.seq = 0;
+   gre_build_header(skb, &tpi, tunnel_hlen);
+
+   return skb;
+}
+
+static netdev_tx_t gre_fb_xmit(struct sk_buff *skb,
+   struct net_device *dev)
+{
+   struct net *net = dev_net(dev);
+   struct ip_tunnel_info *tun_info;
+   const struct ip_tunnel_key *key;
+   struct flowi4 fl;
+   struct rtable *rt;
+   int min_headroom;
+   int tunnel_hlen;
+   __be16 df;
+   int err;
+
+   tun_info = skb_tunnel_info(skb, AF_INET);
+   

[RFC net-next 14/22] route: Extend flow representation with tunnel key

2015-07-10 Thread Thomas Graf
Add a new flowi_tunnel structure which is a subset of ip_tunnel_key to
allow routes to match on tunnel metadata. For now, the tunnel id is
added to flowi_tunnel which allows for routes to be bound to specific
virtual tunnels.

Signed-off-by: Thomas Graf 
---
 include/net/flow.h | 7 +++
 net/ipv4/route.c   | 6 ++
 2 files changed, 13 insertions(+)

diff --git a/include/net/flow.h b/include/net/flow.h
index 8109a15..c15fb5e 100644
--- a/include/net/flow.h
+++ b/include/net/flow.h
@@ -19,6 +19,10 @@
 
 #define LOOPBACK_IFINDEX   1
 
+struct flowi_tunnel {
+   __be64  tun_id;
+};
+
 struct flowi_common {
int flowic_oif;
int flowic_iif;
@@ -30,6 +34,7 @@ struct flowi_common {
 #define FLOWI_FLAG_ANYSRC  0x01
 #define FLOWI_FLAG_KNOWN_NH0x02
__u32   flowic_secid;
+   struct flowi_tunnel flowic_tun_key;
 };
 
 union flowi_uli {
@@ -66,6 +71,7 @@ struct flowi4 {
 #define flowi4_proto   __fl_common.flowic_proto
 #define flowi4_flags   __fl_common.flowic_flags
 #define flowi4_secid   __fl_common.flowic_secid
+#define flowi4_tun_key __fl_common.flowic_tun_key
 
/* (saddr,daddr) must be grouped, same order as in IP header */
__be32  saddr;
@@ -165,6 +171,7 @@ struct flowi {
 #define flowi_protou.__fl_common.flowic_proto
 #define flowi_flagsu.__fl_common.flowic_flags
 #define flowi_secidu.__fl_common.flowic_secid
+#define flowi_tun_key  u.__fl_common.flowic_tun_key
 } __attribute__((__aligned__(BITS_PER_LONG/8)));
 
 static inline struct flowi *flowi4_to_flowi(struct flowi4 *fl4)
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index ea48880..bf84164c 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -91,6 +91,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -110,6 +111,7 @@
 #include 
 #endif
 #include 
+#include 
 
 #define RT_FL_TOS(oldflp4) \
((oldflp4)->flowi4_tos & (IPTOS_RT_MASK | RTO_ONLINK))
@@ -1674,6 +1676,7 @@ static int ip_route_input_slow(struct sk_buff *skb, 
__be32 daddr, __be32 saddr,
 {
struct fib_result res;
struct in_device *in_dev = __in_dev_get_rcu(dev);
+   struct ip_tunnel_info *tun_info;
struct flowi4   fl4;
unsigned intflags = 0;
u32 itag = 0;
@@ -1691,6 +1694,9 @@ static int ip_route_input_slow(struct sk_buff *skb, 
__be32 daddr, __be32 saddr,
   by fib_lookup.
 */
 
+   tun_info = skb_tunnel_info(skb);
+   if (tun_info && tun_info->mode == IP_TUNNEL_INFO_RX)
+   fl4.flowi4_tun_key.tun_id = tun_info->key.tun_id;
skb_dst_drop(skb);
 
if (ipv4_is_multicast(saddr) || ipv4_is_lbcast(saddr))
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   >