date:20150717

[net-next 08/14] dp83640: only report generic filters in ts_info

2015-07-17 Thread Jeff Kirsher

From: Jacob Keller jacob.e.kel...@intel.com

CC: Richard Cochran richardcoch...@gmail.com
Signed-off-by: Jacob Keller jacob.e.kel...@intel.com
Signed-off-by: Jeff Kirsher jeffrey.t.kirs...@intel.com
---
 drivers/net/phy/dp83640.c | 10 +-
 1 file changed, 1 insertion(+), 9 deletions(-)

diff --git a/drivers/net/phy/dp83640.c b/drivers/net/phy/dp83640.c
index 00cb41e..185b03c 100644
--- a/drivers/net/phy/dp83640.c
+++ b/drivers/net/phy/dp83640.c
@@ -1449,17 +1449,9 @@ static int dp83640_ts_info(struct phy_device *dev, 
struct ethtool_ts_info *info)
info-rx_filters =
(1  HWTSTAMP_FILTER_NONE) |
(1  HWTSTAMP_FILTER_PTP_V1_L4_EVENT) |
-   (1  HWTSTAMP_FILTER_PTP_V1_L4_SYNC) |
-   (1  HWTSTAMP_FILTER_PTP_V1_L4_DELAY_REQ) |
(1  HWTSTAMP_FILTER_PTP_V2_L4_EVENT) |
-   (1  HWTSTAMP_FILTER_PTP_V2_L4_SYNC) |
-   (1  HWTSTAMP_FILTER_PTP_V2_L4_DELAY_REQ) |
(1  HWTSTAMP_FILTER_PTP_V2_L2_EVENT) |
-   (1  HWTSTAMP_FILTER_PTP_V2_L2_SYNC) |
-   (1  HWTSTAMP_FILTER_PTP_V2_L2_DELAY_REQ) |
-   (1  HWTSTAMP_FILTER_PTP_V2_EVENT) |
-   (1  HWTSTAMP_FILTER_PTP_V2_SYNC) |
-   (1  HWTSTAMP_FILTER_PTP_V2_DELAY_REQ);
+   (1  HWTSTAMP_FILTER_PTP_V2_EVENT);
return 0;
 }
 
-- 
2.4.3

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[net-next 04/14] i40e: only report generic filters in get_ts_info

2015-07-17 Thread Jeff Kirsher

From: Jacob Keller jacob.e.kel...@intel.com

Signed-off-by: Jacob Keller jacob.e.kel...@intel.com
Tested-by: Jim Young james.m.yo...@intel.com
Signed-off-by: Jeff Kirsher jeffrey.t.kirs...@intel.com
---
 drivers/net/ethernet/intel/i40e/i40e_ethtool.c | 13 ++---
 1 file changed, 2 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c 
b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
index 0b68f61..f2075d5 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
@@ -1467,17 +1467,8 @@ static int i40e_get_ts_info(struct net_device *dev,
info-tx_types = (1  HWTSTAMP_TX_OFF) | (1  HWTSTAMP_TX_ON);
 
info-rx_filters = (1  HWTSTAMP_FILTER_NONE) |
-  (1  HWTSTAMP_FILTER_PTP_V1_L4_SYNC) |
-  (1  HWTSTAMP_FILTER_PTP_V1_L4_DELAY_REQ) |
-  (1  HWTSTAMP_FILTER_PTP_V2_EVENT) |
-  (1  HWTSTAMP_FILTER_PTP_V2_L2_EVENT) |
-  (1  HWTSTAMP_FILTER_PTP_V2_L4_EVENT) |
-  (1  HWTSTAMP_FILTER_PTP_V2_SYNC) |
-  (1  HWTSTAMP_FILTER_PTP_V2_L2_SYNC) |
-  (1  HWTSTAMP_FILTER_PTP_V2_L4_SYNC) |
-  (1  HWTSTAMP_FILTER_PTP_V2_DELAY_REQ) |
-  (1  HWTSTAMP_FILTER_PTP_V2_L2_DELAY_REQ) |
-  (1  HWTSTAMP_FILTER_PTP_V2_L4_DELAY_REQ);
+  (1  HWTSTAMP_FILTER_PTP_V1_L4_EVENT) |
+  (1  HWTSTAMP_FILTER_PTP_V2_EVENT);
 
return 0;
 }
-- 
2.4.3

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[net-next 00/14][pull request] Intel Wired LAN Driver Updates 2015-07-17

2015-07-17 Thread Jeff Kirsher

This series contains updates to igb, ixgbe, ixgbevf, i40e, bnx2x,
freescale, siena and dp83640.

Jacob provides several patches to clarify the intended way to implement
both SIOCSHWTSTAMP and ethtool's get_ts_info().  It is okay to support
the specific filters in SIOCSHWTSTAMP by upscaling them to the generic
filters.

Alex Duyck provides a igb patch to pull the time stamp from the fragment
before it gets added to the skb, to avoid a possible issue in which the
fragment can possibly be less than IGB_RX_HDR_LEN due to the time stamp
being pulled after the copybreak check.  Also provides a ixgbevf patch to
fold the ixgbevf_pull_tail() call into ixgbevf_add_rx_frag(), which gives
the advantage that the fragment does not have to be modified after it is
added to the skb.

Fan provides patches for ixgbe/ixgbevf to set the receive hash type
based on receive descriptor RSS type.

Todd provides a fix for igb where on check for link on any media other
than copper was not being detected since it was looking on the incorrect
PHY page (due to the page being used gets switched before the function
to check link gets executed).

The following are changes since commit c15df306fc79c672573f1cc2ebdfcb32d7e68780:
  ipv6: Remove unused arguments for __ipv6_dev_get_saddr().
and are available in the git repository at:
  git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue master

Alexander Duyck (2):
  igb: Pull timestamp from fragment before adding it to skb
  ixgbevf: fold ixgbevf_pull_tail into ixgbevf_add_rx_frag

Fan Du (3):
  ixgbe: Specify Rx hash type WRT Rx desc RSS type
  ixgbevf: Set Rx hash type for ingress packets
  ixgbe: Don't report flow director filter's status

Jacob Keller (8):
  clarify implementation of ethtool's get_ts_info op
  freescale: remove incorrect copied comment
  bnx2x: only report most generic filters in get_ts_info
  i40e: only report generic filters in get_ts_info
  igb: only report generic filters in get_ts_info
  ixgbe: only report generic filters in get_ts_info
  siena: only report generic filters in get_ts_info
  dp83640: only report generic filters in ts_info

Todd Fujinaka (1):
  igb: Fix i354 88E1112 PHY on RCC boards using AutoMediaDetect

 Documentation/networking/timestamping.txt  |  7 ++
 .../net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c| 11 +--
 drivers/net/ethernet/freescale/fec_ptp.c   |  6 --
 drivers/net/ethernet/intel/i40e/i40e_ethtool.c | 13 +--
 drivers/net/ethernet/intel/igb/e1000_82575.c   | 18 +++--
 drivers/net/ethernet/intel/igb/igb_ethtool.c   |  4 -
 drivers/net/ethernet/intel/igb/igb_main.c  | 94 ++
 drivers/net/ethernet/intel/ixgbe/ixgbe_82599.c |  2 -
 drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c   |  8 --
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c  | 25 +-
 drivers/net/ethernet/intel/ixgbevf/defines.h   | 12 +++
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c  | 93 +++--
 drivers/net/ethernet/sfc/siena.c   |  6 +-
 drivers/net/phy/dp83640.c  | 10 +--
 include/uapi/linux/ethtool.h   |  5 ++
 15 files changed, 134 insertions(+), 180 deletions(-)

-- 
2.4.3

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[net-next 12/14] ixgbevf: Set Rx hash type for ingress packets

2015-07-17 Thread Jeff Kirsher

From: Fan Du fan...@intel.com

Set hash type for ingress packets according to NIC
advanced receive descriptors RSS type part.

Signed-off-by: Fan Du fan...@intel.com
Tested-by: Phil Schmitt phillip.j.schm...@intel.com
Signed-off-by: Jeff Kirsher jeffrey.t.kirs...@intel.com
---
 drivers/net/ethernet/intel/ixgbevf/defines.h  | 12 ++
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 27 +++
 2 files changed, 39 insertions(+)

diff --git a/drivers/net/ethernet/intel/ixgbevf/defines.h 
b/drivers/net/ethernet/intel/ixgbevf/defines.h
index 770e21a..5843458 100644
--- a/drivers/net/ethernet/intel/ixgbevf/defines.h
+++ b/drivers/net/ethernet/intel/ixgbevf/defines.h
@@ -161,6 +161,18 @@ typedef u32 ixgbe_link_speed;
 #define IXGBE_RXDADV_SPLITHEADER_EN0x1000
 #define IXGBE_RXDADV_SPH   0x8000
 
+/* RSS Hash results */
+#define IXGBE_RXDADV_RSSTYPE_NONE  0x
+#define IXGBE_RXDADV_RSSTYPE_IPV4_TCP  0x0001
+#define IXGBE_RXDADV_RSSTYPE_IPV4  0x0002
+#define IXGBE_RXDADV_RSSTYPE_IPV6_TCP  0x0003
+#define IXGBE_RXDADV_RSSTYPE_IPV6_EX   0x0004
+#define IXGBE_RXDADV_RSSTYPE_IPV6  0x0005
+#define IXGBE_RXDADV_RSSTYPE_IPV6_TCP_EX   0x0006
+#define IXGBE_RXDADV_RSSTYPE_IPV4_UDP  0x0007
+#define IXGBE_RXDADV_RSSTYPE_IPV6_UDP  0x0008
+#define IXGBE_RXDADV_RSSTYPE_IPV6_UDP_EX   0x0009
+
 #define IXGBE_RXD_ERR_FRAME_ERR_MASK ( \
  IXGBE_RXD_ERR_CE |  \
  IXGBE_RXD_ERR_LE |  \
diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c 
b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index acfa051..b2c86f1 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -457,6 +457,32 @@ static void ixgbevf_rx_skb(struct ixgbevf_q_vector 
*q_vector,
napi_gro_receive(q_vector-napi, skb);
 }
 
+#define IXGBE_RSS_L4_TYPES_MASK \
+   ((1ul  IXGBE_RXDADV_RSSTYPE_IPV4_TCP) | \
+(1ul  IXGBE_RXDADV_RSSTYPE_IPV4_UDP) | \
+(1ul  IXGBE_RXDADV_RSSTYPE_IPV6_TCP) | \
+(1ul  IXGBE_RXDADV_RSSTYPE_IPV6_UDP))
+
+static inline void ixgbevf_rx_hash(struct ixgbevf_ring *ring,
+  union ixgbe_adv_rx_desc *rx_desc,
+  struct sk_buff *skb)
+{
+   u16 rss_type;
+
+   if (!(ring-netdev-features  NETIF_F_RXHASH))
+   return;
+
+   rss_type = le16_to_cpu(rx_desc-wb.lower.lo_dword.hs_rss.pkt_info) 
+  IXGBE_RXDADV_RSSTYPE_MASK;
+
+   if (!rss_type)
+   return;
+
+   skb_set_hash(skb, le32_to_cpu(rx_desc-wb.lower.hi_dword.rss),
+(IXGBE_RSS_L4_TYPES_MASK  (1ul  rss_type)) ?
+PKT_HASH_TYPE_L4 : PKT_HASH_TYPE_L3);
+}
+
 /**
  * ixgbevf_rx_checksum - indicate in skb if hw indicated a good cksum
  * @ring: structure containig ring specific data
@@ -506,6 +532,7 @@ static void ixgbevf_process_skb_fields(struct ixgbevf_ring 
*rx_ring,
   union ixgbe_adv_rx_desc *rx_desc,
   struct sk_buff *skb)
 {
+   ixgbevf_rx_hash(rx_ring, rx_desc, skb);
ixgbevf_rx_checksum(rx_ring, rx_desc, skb);
 
if (ixgbevf_test_staterr(rx_desc, IXGBE_RXD_STAT_VP)) {
-- 
2.4.3

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[net-next 09/14] igb: Pull timestamp from fragment before adding it to skb

2015-07-17 Thread Jeff Kirsher

From: Alexander Duyck alexander.h.du...@redhat.com

This change makes it so that we pull the timestamp from the fragment before
we add it to the skb.  By doing this we can avoid a possible issue in which
the fragment can possibly be less than IGB_RX_HDR_LEN due to the timestamp
being pulled after the copybreak check.

While making this change I realized we could also pull the rest of the
igb_pull_tail function into igb_add_rx_frag since in the case of igb,
unlike ixgbe, we are able to unmap the entire buffer before calling
add_rx_frag so merging the two allows for sharing of code between the two
merged functions.

Reported-by: Cong Wang xiyou.wangc...@gmail.com
Signed-off-by: Alexander Duyck alexander.h.du...@redhat.com
Tested-by: Aaron Brown aaron.f.br...@intel.com
Signed-off-by: Jeff Kirsher jeffrey.t.kirs...@intel.com
---
 drivers/net/ethernet/intel/igb/igb_main.c | 94 ---
 1 file changed, 25 insertions(+), 69 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
b/drivers/net/ethernet/intel/igb/igb_main.c
index 2f70a9b..fc7729e 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -6621,22 +6621,25 @@ static bool igb_add_rx_frag(struct igb_ring *rx_ring,
struct sk_buff *skb)
 {
struct page *page = rx_buffer-page;
+   unsigned char *va = page_address(page) + rx_buffer-page_offset;
unsigned int size = le16_to_cpu(rx_desc-wb.upper.length);
 #if (PAGE_SIZE  8192)
unsigned int truesize = IGB_RX_BUFSZ;
 #else
-   unsigned int truesize = ALIGN(size, L1_CACHE_BYTES);
+   unsigned int truesize = SKB_DATA_ALIGN(size);
 #endif
+   unsigned int pull_len;
 
-   if ((size = IGB_RX_HDR_LEN)  !skb_is_nonlinear(skb)) {
-   unsigned char *va = page_address(page) + rx_buffer-page_offset;
+   if (unlikely(skb_is_nonlinear(skb)))
+   goto add_tail_frag;
 
-   if (igb_test_staterr(rx_desc, E1000_RXDADV_STAT_TSIP)) {
-   igb_ptp_rx_pktstamp(rx_ring-q_vector, va, skb);
-   va += IGB_TS_HDR_LEN;
-   size -= IGB_TS_HDR_LEN;
-   }
+   if (unlikely(igb_test_staterr(rx_desc, E1000_RXDADV_STAT_TSIP))) {
+   igb_ptp_rx_pktstamp(rx_ring-q_vector, va, skb);
+   va += IGB_TS_HDR_LEN;
+   size -= IGB_TS_HDR_LEN;
+   }
 
+   if (likely(size = IGB_RX_HDR_LEN)) {
memcpy(__skb_put(skb, size), va, ALIGN(size, sizeof(long)));
 
/* page is not reserved, we can reuse buffer as-is */
@@ -6648,8 +6651,21 @@ static bool igb_add_rx_frag(struct igb_ring *rx_ring,
return false;
}
 
+   /* we need the header to contain the greater of either ETH_HLEN or
+* 60 bytes if the skb-len is less than 60 for skb_pad.
+*/
+   pull_len = eth_get_headlen(va, IGB_RX_HDR_LEN);
+
+   /* align pull length to size of long to optimize memcpy performance */
+   memcpy(__skb_put(skb, pull_len), va, ALIGN(pull_len, sizeof(long)));
+
+   /* update all of the pointers */
+   va += pull_len;
+   size -= pull_len;
+
+add_tail_frag:
skb_add_rx_frag(skb, skb_shinfo(skb)-nr_frags, page,
-   rx_buffer-page_offset, size, truesize);
+   (unsigned long)va  ~PAGE_MASK, size, truesize);
 
return igb_can_reuse_rx_page(rx_buffer, page, truesize);
 }
@@ -6791,62 +6807,6 @@ static bool igb_is_non_eop(struct igb_ring *rx_ring,
 }
 
 /**
- *  igb_pull_tail - igb specific version of skb_pull_tail
- *  @rx_ring: rx descriptor ring packet is being transacted on
- *  @rx_desc: pointer to the EOP Rx descriptor
- *  @skb: pointer to current skb being adjusted
- *
- *  This function is an igb specific version of __pskb_pull_tail.  The
- *  main difference between this version and the original function is that
- *  this function can make several assumptions about the state of things
- *  that allow for significant optimizations versus the standard function.
- *  As a result we can do things like drop a frag and maintain an accurate
- *  truesize for the skb.
- */
-static void igb_pull_tail(struct igb_ring *rx_ring,
- union e1000_adv_rx_desc *rx_desc,
- struct sk_buff *skb)
-{
-   struct skb_frag_struct *frag = skb_shinfo(skb)-frags[0];
-   unsigned char *va;
-   unsigned int pull_len;
-
-   /* it is valid to use page_address instead of kmap since we are
-* working with pages allocated out of the lomem pool per
-* alloc_page(GFP_ATOMIC)
-*/
-   va = skb_frag_address(frag);
-
-   if (igb_test_staterr(rx_desc, E1000_RXDADV_STAT_TSIP)) {
-   /* retrieve timestamp from buffer */
-   igb_ptp_rx_pktstamp(rx_ring-q_vector, va, skb);
-
-   /* update pointers to remove timestamp header */
-

[net-next 11/14] ixgbe: Specify Rx hash type WRT Rx desc RSS type

2015-07-17 Thread Jeff Kirsher

From: Fan Du fan...@intel.com

RSS could be leveraged by taking account L4 src/dst ports
as ingredients, thus ingress skb Rx hash type should honor
such the real configuration.

Signed-off-by: Fan Du fan...@intel.com
Tested-by: Phil Schmitt phillip.j.schm...@intel.com
Signed-off-by: Jeff Kirsher jeffrey.t.kirs...@intel.com
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 25 +
 1 file changed, 21 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 9aa6104..3e6a931 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -1360,14 +1360,31 @@ static int __ixgbe_notify_dca(struct device *dev, void 
*data)
 }
 
 #endif /* CONFIG_IXGBE_DCA */
+
+#define IXGBE_RSS_L4_TYPES_MASK \
+   ((1ul  IXGBE_RXDADV_RSSTYPE_IPV4_TCP) | \
+(1ul  IXGBE_RXDADV_RSSTYPE_IPV4_UDP) | \
+(1ul  IXGBE_RXDADV_RSSTYPE_IPV6_TCP) | \
+(1ul  IXGBE_RXDADV_RSSTYPE_IPV6_UDP))
+
 static inline void ixgbe_rx_hash(struct ixgbe_ring *ring,
 union ixgbe_adv_rx_desc *rx_desc,
 struct sk_buff *skb)
 {
-   if (ring-netdev-features  NETIF_F_RXHASH)
-   skb_set_hash(skb,
-le32_to_cpu(rx_desc-wb.lower.hi_dword.rss),
-PKT_HASH_TYPE_L3);
+   u16 rss_type;
+
+   if (!(ring-netdev-features  NETIF_F_RXHASH))
+   return;
+
+   rss_type = le16_to_cpu(rx_desc-wb.lower.lo_dword.hs_rss.pkt_info) 
+  IXGBE_RXDADV_RSSTYPE_MASK;
+
+   if (!rss_type)
+   return;
+
+   skb_set_hash(skb, le32_to_cpu(rx_desc-wb.lower.hi_dword.rss),
+(IXGBE_RSS_L4_TYPES_MASK  (1ul  rss_type)) ?
+PKT_HASH_TYPE_L4 : PKT_HASH_TYPE_L3);
 }
 
 #ifdef IXGBE_FCOE
-- 
2.4.3

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[net-next 01/14] clarify implementation of ethtool's get_ts_info op

2015-07-17 Thread Jeff Kirsher

From: Jacob Keller jacob.e.kel...@intel.com

This patch adds some clarification about the intended way to implement
both SIOCSHWTSTAMP and ethtool's get_ts_info. The HWTSTAMP API has
several Rx filters which are very specific, as well as more general
filters. The specific filters really only exist to support some broken
hardware which can't fully implement the generic filters. This patch
adds clarification that it is okay to support the specific filters in
SIOCSHWTSTAMP by upscaling them to the generic filters. In addition,
update the header for ethtool_ts_info to specify that drivers ought to
only report the filters they support without upscaling in this manner.

Signed-off-by: Jacob Keller jacob.e.kel...@intel.com
Tested-by: Phil Schmitt phillip.j.schm...@intel.com
Reviewed-by: Aaron Brown aaron.f.br...@intel.com
Signed-off-by: Jeff Kirsher jeffrey.t.kirs...@intel.com
---
 Documentation/networking/timestamping.txt | 7 +++
 include/uapi/linux/ethtool.h  | 5 +
 2 files changed, 12 insertions(+)

diff --git a/Documentation/networking/timestamping.txt 
b/Documentation/networking/timestamping.txt
index 5f09226..a977339 100644
--- a/Documentation/networking/timestamping.txt
+++ b/Documentation/networking/timestamping.txt
@@ -359,6 +359,13 @@ the requested fine-grained filtering for incoming packets 
is not
 supported, the driver may time stamp more than just the requested types
 of packets.
 
+Drivers are free to use a more permissive configuration than the requested
+configuration. It is expected that drivers should only implement directly the
+most generic mode that can be supported. For example if the hardware can
+support HWTSTAMP_FILTER_V2_EVENT, then it should generally always upscale
+HWTSTAMP_FILTER_V2_L2_SYNC_MESSAGE, and so forth, as HWTSTAMP_FILTER_V2_EVENT
+is more generic (and more useful to applications).
+
 A driver which supports hardware time stamping shall update the struct
 with the actual, possibly more permissive configuration. If the
 requested packets cannot be time stamped, then nothing should be
diff --git a/include/uapi/linux/ethtool.h b/include/uapi/linux/ethtool.h
index cd67aec..cd16291 100644
--- a/include/uapi/linux/ethtool.h
+++ b/include/uapi/linux/ethtool.h
@@ -1093,6 +1093,11 @@ struct ethtool_sfeatures {
  * the 'hwtstamp_tx_types' and 'hwtstamp_rx_filters' enumeration values,
  * respectively.  For example, if the device supports HWTSTAMP_TX_ON,
  * then (1  HWTSTAMP_TX_ON) in 'tx_types' will be set.
+ *
+ * Drivers should only report the filters they actually support without
+ * upscaling in the SIOCSHWTSTAMP ioctl. If the SIOCSHWSTAMP request for
+ * HWTSTAMP_FILTER_V1_SYNC is supported by HWTSTAMP_FILTER_V1_EVENT, then the
+ * driver should only report HWTSTAMP_FILTER_V1_EVENT in this op.
  */
 struct ethtool_ts_info {
__u32   cmd;
-- 
2.4.3

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[net-next 14/14] igb: Fix i354 88E1112 PHY on RCC boards using AutoMediaDetect

2015-07-17 Thread Jeff Kirsher

From: Todd Fujinaka todd.fujin...@intel.com

e1000_check_for_link_media_swap() checks PHY page 0 for copper and PHY
page 1 for other (fiber) link. The switch back from page 1 to page 0
happened too soon, before e1000_check_for_link_82575() is executed, and
link on fiber (other) was never detected. Check for link while still on
the proper PHY page.

Signed-off-by: Todd Fujinaka todd.fujin...@intel.com
Tested-by: Aaron Brown aaron.f.br...@intel.com
Signed-off-by: Jeff Kirsher jeffrey.t.kirs...@intel.com
---
 drivers/net/ethernet/intel/igb/e1000_82575.c | 18 +-
 1 file changed, 13 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/e1000_82575.c 
b/drivers/net/ethernet/intel/igb/e1000_82575.c
index b0182dd..d192569 100644
--- a/drivers/net/ethernet/intel/igb/e1000_82575.c
+++ b/drivers/net/ethernet/intel/igb/e1000_82575.c
@@ -139,10 +139,6 @@ static s32 igb_check_for_link_media_swap(struct e1000_hw 
*hw)
if (ret_val)
return ret_val;
 
-   /* reset page to 0 */
-   ret_val = phy-ops.write_reg(hw, E1000_M88E1112_PAGE_ADDR, 0);
-   if (ret_val)
-   return ret_val;
 
if (data  E1000_M88E1112_STATUS_LINK)
port = E1000_MEDIA_PORT_OTHER;
@@ -151,8 +147,20 @@ static s32 igb_check_for_link_media_swap(struct e1000_hw 
*hw)
if (port  (hw-dev_spec._82575.media_port != port)) {
hw-dev_spec._82575.media_port = port;
hw-dev_spec._82575.media_changed = true;
+   }
+
+   if (port == E1000_MEDIA_PORT_COPPER) {
+   /* reset page to 0 */
+   ret_val = phy-ops.write_reg(hw, E1000_M88E1112_PAGE_ADDR, 0);
+   if (ret_val)
+   return ret_val;
+   igb_check_for_link_82575(hw);
} else {
-   ret_val = igb_check_for_link_82575(hw);
+   igb_check_for_link_82575(hw);
+   /* reset page to 0 */
+   ret_val = phy-ops.write_reg(hw, E1000_M88E1112_PAGE_ADDR, 0);
+   if (ret_val)
+   return ret_val;
}
 
return 0;
-- 
2.4.3

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[net-next 05/14] igb: only report generic filters in get_ts_info

2015-07-17 Thread Jeff Kirsher

From: Jacob Keller jacob.e.kel...@intel.com

Signed-off-by: Jacob Keller jacob.e.kel...@intel.com
Tested-by: Aaron Brown aaron.f.br...@intel.com
Signed-off-by: Jeff Kirsher jeffrey.t.kirs...@intel.com
---
 drivers/net/ethernet/intel/igb/igb_ethtool.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/igb_ethtool.c 
b/drivers/net/ethernet/intel/igb/igb_ethtool.c
index d5673eb..109cad9 100644
--- a/drivers/net/ethernet/intel/igb/igb_ethtool.c
+++ b/drivers/net/ethernet/intel/igb/igb_ethtool.c
@@ -2396,10 +2396,6 @@ static int igb_get_ts_info(struct net_device *dev,
info-rx_filters |=
(1  HWTSTAMP_FILTER_PTP_V1_L4_SYNC) |
(1  HWTSTAMP_FILTER_PTP_V1_L4_DELAY_REQ) |
-   (1  HWTSTAMP_FILTER_PTP_V2_L2_SYNC) |
-   (1  HWTSTAMP_FILTER_PTP_V2_L4_SYNC) |
-   (1  HWTSTAMP_FILTER_PTP_V2_L2_DELAY_REQ) |
-   (1  HWTSTAMP_FILTER_PTP_V2_L4_DELAY_REQ) |
(1  HWTSTAMP_FILTER_PTP_V2_EVENT);
 
return 0;
-- 
2.4.3

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[net-next 02/14] freescale: remove incorrect copied comment

2015-07-17 Thread Jeff Kirsher

From: Jacob Keller jacob.e.kel...@intel.com

The comment in question is word-for-word copied from ixgbe, and clearly
has no meaning in freescale's driver. (it even says 'return an error'
when the code clearly does not). Remove the comment as it is obviously
incorrect and not applicable to the code as it is today.

CC: Pantelis Antoniou pantelis.anton...@gmail.com
CC: Vitaly Bordug vbor...@ru.mvista.com
CC: linuxppc-...@lists.ozlabs.org
Signed-off-by: Jacob Keller jacob.e.kel...@intel.com
Signed-off-by: Jeff Kirsher jeffrey.t.kirs...@intel.com
---
 drivers/net/ethernet/freescale/fec_ptp.c | 6 --
 1 file changed, 6 deletions(-)

diff --git a/drivers/net/ethernet/freescale/fec_ptp.c 
b/drivers/net/ethernet/freescale/fec_ptp.c
index a15663a..7a8386a 100644
--- a/drivers/net/ethernet/freescale/fec_ptp.c
+++ b/drivers/net/ethernet/freescale/fec_ptp.c
@@ -506,12 +506,6 @@ int fec_ptp_set(struct net_device *ndev, struct ifreq *ifr)
break;
 
default:
-   /*
-* register RXMTRL must be set in order to do V1 packets,
-* therefore it is not possible to time stamp both V1 Sync and
-* Delay_Req messages and hardware does not support
-* timestamping all packets = return error
-*/
fep-hwts_rx_en = 1;
config.rx_filter = HWTSTAMP_FILTER_ALL;
break;
-- 
2.4.3

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[net-next 10/14] ixgbevf: fold ixgbevf_pull_tail into ixgbevf_add_rx_frag

2015-07-17 Thread Jeff Kirsher

From: Alexander Duyck alexander.h.du...@redhat.com

This change folds the ixgbevf_pull_tail call into ixgbevf_add_rx_frag.  The
advantage to doing this is that the fragment doesn't have to be modified
after it is added to the skb.

Signed-off-by: Alexander Duyck alexander.h.du...@redhat.com
Tested-by: Phil Schmitt phillip.j.schm...@intel.com
Signed-off-by: Jeff Kirsher jeffrey.t.kirs...@intel.com
---
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 66 +++
 1 file changed, 19 insertions(+), 47 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c 
b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index e71cdde..acfa051 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -649,46 +649,6 @@ static void ixgbevf_alloc_rx_buffers(struct ixgbevf_ring 
*rx_ring,
 }
 
 /**
- * ixgbevf_pull_tail - ixgbevf specific version of skb_pull_tail
- * @rx_ring: rx descriptor ring packet is being transacted on
- * @skb: pointer to current skb being adjusted
- *
- * This function is an ixgbevf specific version of __pskb_pull_tail.  The
- * main difference between this version and the original function is that
- * this function can make several assumptions about the state of things
- * that allow for significant optimizations versus the standard function.
- * As a result we can do things like drop a frag and maintain an accurate
- * truesize for the skb.
- **/
-static void ixgbevf_pull_tail(struct ixgbevf_ring *rx_ring,
- struct sk_buff *skb)
-{
-   struct skb_frag_struct *frag = skb_shinfo(skb)-frags[0];
-   unsigned char *va;
-   unsigned int pull_len;
-
-   /* it is valid to use page_address instead of kmap since we are
-* working with pages allocated out of the lomem pool per
-* alloc_page(GFP_ATOMIC)
-*/
-   va = skb_frag_address(frag);
-
-   /* we need the header to contain the greater of either ETH_HLEN or
-* 60 bytes if the skb-len is less than 60 for skb_pad.
-*/
-   pull_len = eth_get_headlen(va, IXGBEVF_RX_HDR_SIZE);
-
-   /* align pull length to size of long to optimize memcpy performance */
-   skb_copy_to_linear_data(skb, va, ALIGN(pull_len, sizeof(long)));
-
-   /* update all of the pointers */
-   skb_frag_size_sub(frag, pull_len);
-   frag-page_offset += pull_len;
-   skb-data_len -= pull_len;
-   skb-tail += pull_len;
-}
-
-/**
  * ixgbevf_cleanup_headers - Correct corrupted or empty headers
  * @rx_ring: rx descriptor ring packet is being transacted on
  * @rx_desc: pointer to the EOP Rx descriptor
@@ -721,10 +681,6 @@ static bool ixgbevf_cleanup_headers(struct ixgbevf_ring 
*rx_ring,
}
}
 
-   /* place header in linear portion of buffer */
-   if (skb_is_nonlinear(skb))
-   ixgbevf_pull_tail(rx_ring, skb);
-
/* if eth_skb_pad returns an error the skb was freed */
if (eth_skb_pad(skb))
return true;
@@ -789,16 +745,19 @@ static bool ixgbevf_add_rx_frag(struct ixgbevf_ring 
*rx_ring,
struct sk_buff *skb)
 {
struct page *page = rx_buffer-page;
+   unsigned char *va = page_address(page) + rx_buffer-page_offset;
unsigned int size = le16_to_cpu(rx_desc-wb.upper.length);
 #if (PAGE_SIZE  8192)
unsigned int truesize = IXGBEVF_RX_BUFSZ;
 #else
unsigned int truesize = ALIGN(size, L1_CACHE_BYTES);
 #endif
+   unsigned int pull_len;
 
-   if ((size = IXGBEVF_RX_HDR_SIZE)  !skb_is_nonlinear(skb)) {
-   unsigned char *va = page_address(page) + rx_buffer-page_offset;
+   if (unlikely(skb_is_nonlinear(skb)))
+   goto add_tail_frag;
 
+   if (likely(size = IXGBEVF_RX_HDR_SIZE)) {
memcpy(__skb_put(skb, size), va, ALIGN(size, sizeof(long)));
 
/* page is not reserved, we can reuse buffer as is */
@@ -810,8 +769,21 @@ static bool ixgbevf_add_rx_frag(struct ixgbevf_ring 
*rx_ring,
return false;
}
 
+   /* we need the header to contain the greater of either ETH_HLEN or
+* 60 bytes if the skb-len is less than 60 for skb_pad.
+*/
+   pull_len = eth_get_headlen(va, IXGBEVF_RX_HDR_SIZE);
+
+   /* align pull length to size of long to optimize memcpy performance */
+   memcpy(__skb_put(skb, pull_len), va, ALIGN(pull_len, sizeof(long)));
+
+   /* update all of the pointers */
+   va += pull_len;
+   size -= pull_len;
+
+add_tail_frag:
skb_add_rx_frag(skb, skb_shinfo(skb)-nr_frags, page,
-   rx_buffer-page_offset, size, truesize);
+   (unsigned long)va  ~PAGE_MASK, size, truesize);
 
/* avoid re-using remote pages */
if (unlikely(ixgbevf_page_is_reserved(page)))
-- 
2.4.3

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a

[net-next 07/14] siena: only report generic filters in get_ts_info

2015-07-17 Thread Jeff Kirsher

From: Jacob Keller jacob.e.kel...@intel.com

CC: Solarflare linux maintainers linux-net-driv...@solarflare.com
CC: Shradha Shah ss...@solarflare.com
Signed-off-by: Jacob Keller jacob.e.kel...@intel.com
Signed-off-by: Jeff Kirsher jeffrey.t.kirs...@intel.com
---
 drivers/net/ethernet/sfc/siena.c | 6 +-
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/sfc/siena.c b/drivers/net/ethernet/sfc/siena.c
index b323b91..b2f886d 100644
--- a/drivers/net/ethernet/sfc/siena.c
+++ b/drivers/net/ethernet/sfc/siena.c
@@ -1042,9 +1042,5 @@ const struct efx_nic_type siena_a0_nic_type = {
.max_rx_ip_filters = FR_BZ_RX_FILTER_TBL0_ROWS,
.hwtstamp_filters = (1  HWTSTAMP_FILTER_NONE |
 1  HWTSTAMP_FILTER_PTP_V1_L4_EVENT |
-1  HWTSTAMP_FILTER_PTP_V1_L4_SYNC |
-1  HWTSTAMP_FILTER_PTP_V1_L4_DELAY_REQ |
-1  HWTSTAMP_FILTER_PTP_V2_L4_EVENT |
-1  HWTSTAMP_FILTER_PTP_V2_L4_SYNC |
-1  HWTSTAMP_FILTER_PTP_V2_L4_DELAY_REQ),
+1  HWTSTAMP_FILTER_PTP_V2_L4_EVENT),
 };
-- 
2.4.3

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[net-next 06/14] ixgbe: only report generic filters in get_ts_info

2015-07-17 Thread Jeff Kirsher

From: Jacob Keller jacob.e.kel...@intel.com

Signed-off-by: Jacob Keller jacob.e.kel...@intel.com
Tested-by: Phil Schmitt phillip.j.schm...@intel.com
Signed-off-by: Jeff Kirsher jeffrey.t.kirs...@intel.com
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c | 8 
 1 file changed, 8 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c
index ec7b232..f7aeb56 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c
@@ -2938,14 +2938,6 @@ static int ixgbe_get_ts_info(struct net_device *dev,
(1  HWTSTAMP_FILTER_NONE) |
(1  HWTSTAMP_FILTER_PTP_V1_L4_SYNC) |
(1  HWTSTAMP_FILTER_PTP_V1_L4_DELAY_REQ) |
-   (1  HWTSTAMP_FILTER_PTP_V2_L2_EVENT) |
-   (1  HWTSTAMP_FILTER_PTP_V2_L4_EVENT) |
-   (1  HWTSTAMP_FILTER_PTP_V2_SYNC) |
-   (1  HWTSTAMP_FILTER_PTP_V2_L2_SYNC) |
-   (1  HWTSTAMP_FILTER_PTP_V2_L4_SYNC) |
-   (1  HWTSTAMP_FILTER_PTP_V2_DELAY_REQ) |
-   (1  HWTSTAMP_FILTER_PTP_V2_L2_DELAY_REQ) |
-   (1  HWTSTAMP_FILTER_PTP_V2_L4_DELAY_REQ) |
(1  HWTSTAMP_FILTER_PTP_V2_EVENT);
break;
default:
-- 
2.4.3

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[net-next 13/14] ixgbe: Don't report flow director filter's status

2015-07-17 Thread Jeff Kirsher

From: Fan Du fan...@intel.com

For two reasons I want to disable this:
1. Not any part actually check the report status(Alexander Duyck)
2. To report hash value of a packet to stack,
   RSS - 32bits hash value
   Perfect match fdir filter - 13bits hash value
   Hashed-based fdir filter - 31bits hash value

   fdir filter might hash on masked tuples for IP address,
   so it's still not desirable for usage.

So for now, just stick to RSS 32bits hash value.

Signed-off-by: Fan Du fan...@intel.com
Suggested-by: Alexander Duyck alexander.h.du...@redhat.com
Tested-by: Phil Schmitt phillip.j.schm...@intel.com
Signed-off-by: Jeff Kirsher jeffrey.t.kirs...@intel.com
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_82599.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_82599.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_82599.c
index 6b87d96..b1e364d 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_82599.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_82599.c
@@ -1394,14 +1394,12 @@ s32 ixgbe_init_fdir_perfect_82599(struct ixgbe_hw *hw, 
u32 fdirctrl)
/*
 * Continue setup of fdirctrl register bits:
 *  Turn perfect match filtering on
-*  Report hash in RSS field of Rx wb descriptor
 *  Initialize the drop queue
 *  Move the flexible bytes to use the ethertype - shift 6 words
 *  Set the maximum length per hash bucket to 0xA filters
 *  Send interrupt when 64 (0x4 * 16) filters are left
 */
fdirctrl |= IXGBE_FDIRCTRL_PERFECT_MATCH |
-   IXGBE_FDIRCTRL_REPORT_STATUS |
(IXGBE_FDIR_DROP_QUEUE  IXGBE_FDIRCTRL_DROP_Q_SHIFT) |
(0x6  IXGBE_FDIRCTRL_FLEX_SHIFT) |
(0xA  IXGBE_FDIRCTRL_MAX_LENGTH_SHIFT) |
-- 
2.4.3

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[net-next 03/14] bnx2x: only report most generic filters in get_ts_info

2015-07-17 Thread Jeff Kirsher

From: Jacob Keller jacob.e.kel...@intel.com

CC: Ariel Elior ariel.el...@qlogic.com
Signed-off-by: Jacob Keller jacob.e.kel...@intel.com
Signed-off-by: Jeff Kirsher jeffrey.t.kirs...@intel.com
---
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c | 11 +--
 1 file changed, 1 insertion(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c 
b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c
index 76b9052..c783b57 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c
@@ -3562,17 +3562,8 @@ static int bnx2x_get_ts_info(struct net_device *dev,
 
info-rx_filters = (1  HWTSTAMP_FILTER_NONE) |
   (1  HWTSTAMP_FILTER_PTP_V1_L4_EVENT) |
-  (1  HWTSTAMP_FILTER_PTP_V1_L4_SYNC) |
-  (1  HWTSTAMP_FILTER_PTP_V1_L4_DELAY_REQ) |
   (1  HWTSTAMP_FILTER_PTP_V2_L4_EVENT) |
-  (1  HWTSTAMP_FILTER_PTP_V2_L4_SYNC) |
-  (1  HWTSTAMP_FILTER_PTP_V2_L4_DELAY_REQ) |
-  (1  HWTSTAMP_FILTER_PTP_V2_L2_EVENT) |
-  (1  HWTSTAMP_FILTER_PTP_V2_L2_SYNC) |
-  (1  HWTSTAMP_FILTER_PTP_V2_L2_DELAY_REQ) |
-  (1  HWTSTAMP_FILTER_PTP_V2_EVENT) |
-  (1  HWTSTAMP_FILTER_PTP_V2_SYNC) |
-  (1  HWTSTAMP_FILTER_PTP_V2_DELAY_REQ);
+  (1  HWTSTAMP_FILTER_PTP_V2_EVENT);
 
info-tx_types = (1  HWTSTAMP_TX_OFF)|(1  HWTSTAMP_TX_ON);
 
-- 
2.4.3

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next v2 1/5] net: don't reforward packets already forwarded by offload device

2015-07-17 Thread Nicolas Dichtel


Le 16/07/2015 10:04, sfel...@gmail.com a écrit :

From: Scott Feldman sfel...@gmail.com

Just before queuing skb for xmit on port, check if skb has been marked by
switchdev port driver as already fordwarded by device.  If so, drop skb.  A
non-zero skb-offload_fwd_mark field is set by the switchdev port
driver/device on ingress to indicate the skb has already been forwarded by
the device to egress ports with matching dev-skb_mark.  The switchdev port
driver would assign a non-zero dev-skb_mark for each device port netdev
during registration, for example.

Signed-off-by: Scott Feldman sfel...@gmail.com
---
  include/linux/netdevice.h |6 ++
  include/linux/skbuff.h|   11 ++-
  net/core/dev.c|   10 ++
  3 files changed, 26 insertions(+), 1 deletion(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 45cfd79..8364f29 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1456,6 +1456,8 @@ enum netdev_priv_flags {
   *
   *@xps_maps:  XXX: need comments on this one
   *
+ * @offload_fwd_mark:  Offload device fwding mark
+ *
   *@trans_start:   Time (in jiffies) of last Tx
   *@watchdog_timeo:Represents the timeout that is used by
   *the watchdog ( see dev_watchdog() )
@@ -1697,6 +1699,10 @@ struct net_device {
struct xps_dev_maps __rcu *xps_maps;
  #endif

+#ifdef CONFIG_NET_SWITCHDEV
+   u32 offload_fwd_mark;
+#endif
+
/* These may be needed for future network-power-down code. */

/*
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index d6cdd6e..2edcf50 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -506,6 +506,7 @@ static inline u32 skb_mstamp_us_delta(const struct 
skb_mstamp *t1,
   *@no_fcs:  Request NIC to treat last 4 bytes as Ethernet FCS
*   @napi_id: id of the NAPI struct this skb came from
   *@secmark: security marking
+ * @offload_fwd_mark: fwding offload mark
   *@mark: Generic packet mark
   *@vlan_proto: vlan encapsulation protocol
   *@vlan_tci: vlan tag control information
@@ -650,9 +651,17 @@ struct sk_buff {
unsigned intsender_cpu;
};
  #endif
+   union {
  #ifdef CONFIG_NETWORK_SECMARK
-   __u32   secmark;
+   __u32   secmark;
+#endif
+#ifdef CONFIG_NET_SWITCHDEV
+   __u32   offload_fwd_mark;
  #endif
+   };
+
+   union {};
+

Everybody seems to ack. For my knowledge, why did you put this empty union?


Thank you,
Nicolas
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 net-next 1/3] rhashtable: Allow lookup function to have compare function agument

2015-07-17 Thread Thomas Graf

On 07/14/15 at 04:45pm, Tom Herbert wrote:
 Added rhashtable_lookup_fast_cmpfn which does a lookup in an rhash table
 with the compare function being taken from an argument. This allows
 different compare functions to be used on the same table.
 
 Signed-off-by: Tom Herbert t...@herbertland.com

Acked-by: Thomas Graf tg...@suug.ch
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 net-next 2/3] rhashtable: Add a function for in order insertion and lookup in buckets

2015-07-17 Thread Thomas Graf

On 07/15/15 at 12:46pm, Tom Herbert wrote:
 On Tue, Jul 14, 2015 at 10:54 PM, Herbert Xu
  The memory cost is merely 8 bytes per local port, is it really too
  much?
 
 Okay, it looks like there is already an additional hlist_node in
 skc_common that can be used for a secondary hash. It's conceivable
 this can be generalized and used in the TCP listeners also in
 combination with rhashtable.

Are you dropping this series entirely then?
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] net: wireless: reduce log level of CRDA related messages

2015-07-17 Thread Johannes Berg

On Thu, 2015-07-09 at 15:35 +0200, Thomas Petazzoni wrote:
 With a basic Linux userspace, the messages Calling CRDA to update
 world regulatory domain appears 10 times after boot every second or
 so, followed by a final Exceeded CRDA call max attempts. Not calling
 CRDA. For those of us not having the corresponding userspace parts,
 having those messages repeatedly displayed at boot time is a bit
 annoying, so this commit reduces their log level to pr_debug().
 
Applied.

johannes
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net-next 22/22] openvswitch: Use regular VXLAN net_device device

2015-07-17 Thread Thomas Graf

This gets rid of all OVS specific VXLAN code in the receive and
transmit path by using a VXLAN net_device to represent the vport.
Only a small shim layer remains which takes care of handling the
VXLAN specific OVS Netlink configuration.

Unexports vxlan_sock_add(), vxlan_sock_release(), vxlan_xmit_skb()
since they are no longer needed.

Signed-off-by: Thomas Graf tg...@suug.ch
Signed-off-by: Pravin B Shelar pshe...@nicira.com
---
 drivers/net/vxlan.c| 242 +++
 include/net/rtnetlink.h|   1 +
 include/net/vxlan.h|  24 +--
 net/core/rtnetlink.c   |  26 ++--
 net/openvswitch/Kconfig|  12 --
 net/openvswitch/Makefile   |   1 -
 net/openvswitch/flow_netlink.c |   6 +-
 net/openvswitch/vport-netdev.c | 201 -
 net/openvswitch/vport-vxlan.c  | 322 -
 net/openvswitch/vport-vxlan.h  |  11 --
 10 files changed, 339 insertions(+), 507 deletions(-)
 delete mode 100644 net/openvswitch/vport-vxlan.c
 delete mode 100644 net/openvswitch/vport-vxlan.h

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index 5ae6c0c..76466ef 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -75,6 +75,9 @@ static struct rtnl_link_ops vxlan_link_ops;
 
 static const u8 all_zeros_mac[ETH_ALEN];
 
+static struct vxlan_sock *vxlan_sock_add(struct net *net, __be16 port,
+bool no_share, u32 flags);
+
 /* per-network namespace private data for this module */
 struct vxlan_net {
struct list_head  vxlan_list;
@@ -1027,7 +1030,7 @@ static bool vxlan_group_used(struct vxlan_net *vn, struct 
vxlan_dev *dev)
return false;
 }
 
-void vxlan_sock_release(struct vxlan_sock *vs)
+static void vxlan_sock_release(struct vxlan_sock *vs)
 {
struct sock *sk = vs-sock-sk;
struct net *net = sock_net(sk);
@@ -1043,7 +1046,6 @@ void vxlan_sock_release(struct vxlan_sock *vs)
 
queue_work(vxlan_wq, vs-del_work);
 }
-EXPORT_SYMBOL_GPL(vxlan_sock_release);
 
 /* Update multicast group membership when first VNI on
  * multicast address is brought up
@@ -1126,6 +1128,102 @@ static struct vxlanhdr *vxlan_remcsum(struct sk_buff 
*skb, struct vxlanhdr *vh,
return vh;
 }
 
+static void vxlan_rcv(struct vxlan_sock *vs, struct sk_buff *skb,
+ struct vxlan_metadata *md, u32 vni,
+ struct metadata_dst *tun_dst)
+{
+   struct iphdr *oip = NULL;
+   struct ipv6hdr *oip6 = NULL;
+   struct vxlan_dev *vxlan;
+   struct pcpu_sw_netstats *stats;
+   union vxlan_addr saddr;
+   int err = 0;
+   union vxlan_addr *remote_ip;
+
+   /* For flow based devices, map all packets to VNI 0 */
+   if (vs-flags  VXLAN_F_FLOW_BASED)
+   vni = 0;
+
+   /* Is this VNI defined? */
+   vxlan = vxlan_vs_find_vni(vs, vni);
+   if (!vxlan)
+   goto drop;
+
+   remote_ip = vxlan-default_dst.remote_ip;
+   skb_reset_mac_header(skb);
+   skb_scrub_packet(skb, !net_eq(vxlan-net, dev_net(vxlan-dev)));
+   skb-protocol = eth_type_trans(skb, vxlan-dev);
+   skb_postpull_rcsum(skb, eth_hdr(skb), ETH_HLEN);
+
+   /* Ignore packet loops (and multicast echo) */
+   if (ether_addr_equal(eth_hdr(skb)-h_source, vxlan-dev-dev_addr))
+   goto drop;
+
+   /* Re-examine inner Ethernet packet */
+   if (remote_ip-sa.sa_family == AF_INET) {
+   oip = ip_hdr(skb);
+   saddr.sin.sin_addr.s_addr = oip-saddr;
+   saddr.sa.sa_family = AF_INET;
+#if IS_ENABLED(CONFIG_IPV6)
+   } else {
+   oip6 = ipv6_hdr(skb);
+   saddr.sin6.sin6_addr = oip6-saddr;
+   saddr.sa.sa_family = AF_INET6;
+#endif
+   }
+
+   if (tun_dst) {
+   skb_dst_set(skb, (struct dst_entry *)tun_dst);
+   tun_dst = NULL;
+   }
+
+   if ((vxlan-flags  VXLAN_F_LEARN) 
+   vxlan_snoop(skb-dev, saddr, eth_hdr(skb)-h_source))
+   goto drop;
+
+   skb_reset_network_header(skb);
+   /* In flow-based mode, GBP is carried in dst_metadata */
+   if (!(vs-flags  VXLAN_F_FLOW_BASED))
+   skb-mark = md-gbp;
+
+   if (oip6)
+   err = IP6_ECN_decapsulate(oip6, skb);
+   if (oip)
+   err = IP_ECN_decapsulate(oip, skb);
+
+   if (unlikely(err)) {
+   if (log_ecn_error) {
+   if (oip6)
+   net_info_ratelimited(non-ECT from %pI6\n,
+oip6-saddr);
+   if (oip)
+   net_info_ratelimited(non-ECT from %pI4 with 
TOS=%#x\n,
+oip-saddr, oip-tos);
+   }
+   if (err  1) {
+   ++vxlan-dev-stats.rx_frame_errors;
+

[PATCH net-next 20/22] openvswitch: Move dev pointer into vport itself

2015-07-17 Thread Thomas Graf

This is the first step in representing all OVS vports as regular
struct net_devices. Move the net_device pointer into the vport
structure itself to get rid of struct vport_netdev.

Signed-off-by: Thomas Graf tg...@suug.ch
Signed-off-by: Pravin B Shelar pshe...@nicira.com
---
 net/openvswitch/datapath.c   |  7 +--
 net/openvswitch/dp_notify.c  |  5 +--
 net/openvswitch/vport-internal_dev.c | 37 +++-
 net/openvswitch/vport-netdev.c   | 86 
 net/openvswitch/vport-netdev.h   | 12 -
 net/openvswitch/vport.h  |  3 +-
 6 files changed, 59 insertions(+), 91 deletions(-)

diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
index 0208210..19df28e 100644
--- a/net/openvswitch/datapath.c
+++ b/net/openvswitch/datapath.c
@@ -188,7 +188,7 @@ static int get_dpifindex(const struct datapath *dp)
 
local = ovs_vport_rcu(dp, OVSP_LOCAL);
if (local)
-   ifindex = netdev_vport_priv(local)-dev-ifindex;
+   ifindex = local-dev-ifindex;
else
ifindex = 0;
 
@@ -2219,13 +2219,10 @@ static void __net_exit list_vports_from_net(struct net 
*net, struct net *dnet,
struct vport *vport;
 
hlist_for_each_entry(vport, dp-ports[i], 
dp_hash_node) {
-   struct netdev_vport *netdev_vport;
-
if (vport-ops-type != OVS_VPORT_TYPE_INTERNAL)
continue;
 
-   netdev_vport = netdev_vport_priv(vport);
-   if (dev_net(netdev_vport-dev) == dnet)
+   if (dev_net(vport-dev) == dnet)
list_add(vport-detach_list, head);
}
}
diff --git a/net/openvswitch/dp_notify.c b/net/openvswitch/dp_notify.c
index 2c631fe..a7a80a6 100644
--- a/net/openvswitch/dp_notify.c
+++ b/net/openvswitch/dp_notify.c
@@ -58,13 +58,10 @@ void ovs_dp_notify_wq(struct work_struct *work)
struct hlist_node *n;
 
hlist_for_each_entry_safe(vport, n, dp-ports[i], 
dp_hash_node) {
-   struct netdev_vport *netdev_vport;
-
if (vport-ops-type != OVS_VPORT_TYPE_NETDEV)
continue;
 
-   netdev_vport = netdev_vport_priv(vport);
-   if (!(netdev_vport-dev-priv_flags  
IFF_OVS_DATAPATH))
+   if (!(vport-dev-priv_flags  
IFF_OVS_DATAPATH))
dp_detach_port_notify(vport);
}
}
diff --git a/net/openvswitch/vport-internal_dev.c 
b/net/openvswitch/vport-internal_dev.c
index 6a55f71..a2c205d 100644
--- a/net/openvswitch/vport-internal_dev.c
+++ b/net/openvswitch/vport-internal_dev.c
@@ -156,49 +156,44 @@ static void do_setup(struct net_device *netdev)
 static struct vport *internal_dev_create(const struct vport_parms *parms)
 {
struct vport *vport;
-   struct netdev_vport *netdev_vport;
struct internal_dev *internal_dev;
int err;
 
-   vport = ovs_vport_alloc(sizeof(struct netdev_vport),
-   ovs_internal_vport_ops, parms);
+   vport = ovs_vport_alloc(0, ovs_internal_vport_ops, parms);
if (IS_ERR(vport)) {
err = PTR_ERR(vport);
goto error;
}
 
-   netdev_vport = netdev_vport_priv(vport);
-
-   netdev_vport-dev = alloc_netdev(sizeof(struct internal_dev),
-parms-name, NET_NAME_UNKNOWN,
-do_setup);
-   if (!netdev_vport-dev) {
+   vport-dev = alloc_netdev(sizeof(struct internal_dev),
+ parms-name, NET_NAME_UNKNOWN, do_setup);
+   if (!vport-dev) {
err = -ENOMEM;
goto error_free_vport;
}
 
-   dev_net_set(netdev_vport-dev, ovs_dp_get_net(vport-dp));
-   internal_dev = internal_dev_priv(netdev_vport-dev);
+   dev_net_set(vport-dev, ovs_dp_get_net(vport-dp));
+   internal_dev = internal_dev_priv(vport-dev);
internal_dev-vport = vport;
 
/* Restrict bridge port to current netns. */
if (vport-port_no == OVSP_LOCAL)
-   netdev_vport-dev-features |= NETIF_F_NETNS_LOCAL;
+   vport-dev-features |= NETIF_F_NETNS_LOCAL;
 
rtnl_lock();
-   err = register_netdevice(netdev_vport-dev);
+   err = register_netdevice(vport-dev);
if (err)
goto error_free_netdev;
 
-   dev_set_promiscuity(netdev_vport-dev, 1);
+   dev_set_promiscuity(vport-dev, 1);
rtnl_unlock();
-   netif_start_queue(netdev_vport-dev);
+   netif_start_queue(vport-dev);
 
return

[PATCH net-next 19/22] openvswitch: Make tunnel set action attach a metadata dst

2015-07-17 Thread Thomas Graf

Utilize the new metadata dst to attach encapsulation instructions to
the skb. The existing egress_tun_info via the OVS_CB() is left in
place until all tunnel vports have been converted to the new method.

Signed-off-by: Thomas Graf tg...@suug.ch
Signed-off-by: Pravin B Shelar pshe...@nicira.com
---
 net/openvswitch/actions.c  | 10 ++-
 net/openvswitch/datapath.c |  8 +++---
 net/openvswitch/flow.h |  5 
 net/openvswitch/flow_netlink.c | 64 +-
 net/openvswitch/flow_netlink.h |  1 +
 net/openvswitch/flow_table.c   |  4 ++-
 6 files changed, 79 insertions(+), 13 deletions(-)

diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
index 27c1687..cf04c2f 100644
--- a/net/openvswitch/actions.c
+++ b/net/openvswitch/actions.c
@@ -733,7 +733,15 @@ static int execute_set_action(struct sk_buff *skb,
 {
/* Only tunnel set execution is supported without a mask. */
if (nla_type(a) == OVS_KEY_ATTR_TUNNEL_INFO) {
-   OVS_CB(skb)-egress_tun_info = nla_data(a);
+   struct ovs_tunnel_info *tun = nla_data(a);
+
+   skb_dst_drop(skb);
+   dst_hold((struct dst_entry *)tun-tun_dst);
+   skb_dst_set(skb, (struct dst_entry *)tun-tun_dst);
+
+   /* FIXME: Remove when all vports have been converted */
+   OVS_CB(skb)-egress_tun_info = tun-tun_dst-u.tun_info;
+
return 0;
}
 
diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
index ff8c4a4..0208210 100644
--- a/net/openvswitch/datapath.c
+++ b/net/openvswitch/datapath.c
@@ -1018,7 +1018,7 @@ static int ovs_flow_cmd_new(struct sk_buff *skb, struct 
genl_info *info)
}
ovs_unlock();
 
-   ovs_nla_free_flow_actions(old_acts);
+   ovs_nla_free_flow_actions_rcu(old_acts);
ovs_flow_free(new_flow, false);
}
 
@@ -1030,7 +1030,7 @@ err_unlock_ovs:
ovs_unlock();
kfree_skb(reply);
 err_kfree_acts:
-   kfree(acts);
+   ovs_nla_free_flow_actions(acts);
 err_kfree_flow:
ovs_flow_free(new_flow, false);
 error:
@@ -1157,7 +1157,7 @@ static int ovs_flow_cmd_set(struct sk_buff *skb, struct 
genl_info *info)
if (reply)
ovs_notify(dp_flow_genl_family, reply, info);
if (old_acts)
-   ovs_nla_free_flow_actions(old_acts);
+   ovs_nla_free_flow_actions_rcu(old_acts);
 
return 0;
 
@@ -1165,7 +1165,7 @@ err_unlock_ovs:
ovs_unlock();
kfree_skb(reply);
 err_kfree_acts:
-   kfree(acts);
+   ovs_nla_free_flow_actions(acts);
 error:
return error;
 }
diff --git a/net/openvswitch/flow.h b/net/openvswitch/flow.h
index cadc6c5..b62cdb3 100644
--- a/net/openvswitch/flow.h
+++ b/net/openvswitch/flow.h
@@ -33,6 +33,7 @@
 #include linux/flex_array.h
 #include net/inet_ecn.h
 #include net/ip_tunnels.h
+#include net/dst_metadata.h
 
 struct sk_buff;
 
@@ -45,6 +46,10 @@ struct sk_buff;
 #define TUN_METADATA_OPTS(flow_key, opt_len) \
((void *)((flow_key)-tun_opts + TUN_METADATA_OFFSET(opt_len)))
 
+struct ovs_tunnel_info {
+   struct metadata_dst *tun_dst;
+};
+
 #define OVS_SW_FLOW_KEY_METADATA_SIZE  \
(offsetof(struct sw_flow_key, recirc_id) +  \
FIELD_SIZEOF(struct sw_flow_key, recirc_id))
diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
index ecfa530..e7906df 100644
--- a/net/openvswitch/flow_netlink.c
+++ b/net/openvswitch/flow_netlink.c
@@ -1548,11 +1548,48 @@ static struct sw_flow_actions 
*nla_alloc_flow_actions(int size, bool log)
return sfa;
 }
 
+static void ovs_nla_free_set_action(const struct nlattr *a)
+{
+   const struct nlattr *ovs_key = nla_data(a);
+   struct ovs_tunnel_info *ovs_tun;
+
+   switch (nla_type(ovs_key)) {
+   case OVS_KEY_ATTR_TUNNEL_INFO:
+   ovs_tun = nla_data(ovs_key);
+   dst_release((struct dst_entry *)ovs_tun-tun_dst);
+   break;
+   }
+}
+
+void ovs_nla_free_flow_actions(struct sw_flow_actions *sf_acts)
+{
+   const struct nlattr *a;
+   int rem;
+
+   if (!sf_acts)
+   return;
+
+   nla_for_each_attr(a, sf_acts-actions, sf_acts-actions_len, rem) {
+   switch (nla_type(a)) {
+   case OVS_ACTION_ATTR_SET:
+   ovs_nla_free_set_action(a);
+   break;
+   }
+   }
+
+   kfree(sf_acts);
+}
+
+static void __ovs_nla_free_flow_actions(struct rcu_head *head)
+{
+   ovs_nla_free_flow_actions(container_of(head, struct sw_flow_actions, 
rcu));
+}
+
 /* Schedules 'sf_acts' to be freed after the next RCU grace period.
  * The caller must hold rcu_read_lock for this to be sensible. */
-void ovs_nla_free_flow_actions(struct sw_flow_actions *sf_acts)
+void ovs_nla_free_flow_actions_rcu(struct sw_flow_actions *sf_acts)
 {
-

[PATCH net-next 08/22] mpls: export mpls functions for use by mpls iptunnels

2015-07-17 Thread Thomas Graf

From: Roopa Prabhu ro...@cumulusnetworks.com

Signed-off-by: Roopa Prabhu ro...@cumulusnetworks.com
---
 net/mpls/af_mpls.c  | 11 ---
 net/mpls/internal.h |  9 +++--
 2 files changed, 15 insertions(+), 5 deletions(-)

diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
index 1f93a59..6e66911 100644
--- a/net/mpls/af_mpls.c
+++ b/net/mpls/af_mpls.c
@@ -58,10 +58,11 @@ static inline struct mpls_dev *mpls_dev_get(const struct 
net_device *dev)
return rcu_dereference_rtnl(dev-mpls_ptr);
 }
 
-static bool mpls_output_possible(const struct net_device *dev)
+bool mpls_output_possible(const struct net_device *dev)
 {
return dev  (dev-flags  IFF_UP)  netif_carrier_ok(dev);
 }
+EXPORT_SYMBOL_GPL(mpls_output_possible);
 
 static unsigned int mpls_rt_header_size(const struct mpls_route *rt)
 {
@@ -69,13 +70,14 @@ static unsigned int mpls_rt_header_size(const struct 
mpls_route *rt)
return rt-rt_labels * sizeof(struct mpls_shim_hdr);
 }
 
-static unsigned int mpls_dev_mtu(const struct net_device *dev)
+unsigned int mpls_dev_mtu(const struct net_device *dev)
 {
/* The amount of data the layer 2 frame can hold */
return dev-mtu;
 }
+EXPORT_SYMBOL_GPL(mpls_dev_mtu);
 
-static bool mpls_pkt_too_big(const struct sk_buff *skb, unsigned int mtu)
+bool mpls_pkt_too_big(const struct sk_buff *skb, unsigned int mtu)
 {
if (skb-len = mtu)
return false;
@@ -85,6 +87,7 @@ static bool mpls_pkt_too_big(const struct sk_buff *skb, 
unsigned int mtu)
 
return true;
 }
+EXPORT_SYMBOL_GPL(mpls_pkt_too_big);
 
 static bool mpls_egress(struct mpls_route *rt, struct sk_buff *skb,
struct mpls_entry_decoded dec)
@@ -626,6 +629,7 @@ int nla_put_labels(struct sk_buff *skb, int attrtype,
 
return 0;
 }
+EXPORT_SYMBOL_GPL(nla_put_labels);
 
 int nla_get_labels(const struct nlattr *nla,
   u32 max_labels, u32 *labels, u32 label[])
@@ -671,6 +675,7 @@ int nla_get_labels(const struct nlattr *nla,
*labels = nla_labels;
return 0;
 }
+EXPORT_SYMBOL_GPL(nla_get_labels);
 
 static int rtm_to_route_config(struct sk_buff *skb,  struct nlmsghdr *nlh,
   struct mpls_route_config *cfg)
diff --git a/net/mpls/internal.h b/net/mpls/internal.h
index 8cabeb5..2681a4b 100644
--- a/net/mpls/internal.h
+++ b/net/mpls/internal.h
@@ -50,7 +50,12 @@ static inline struct mpls_entry_decoded 
mpls_entry_decode(struct mpls_shim_hdr *
return result;
 }
 
-int nla_put_labels(struct sk_buff *skb, int attrtype,  u8 labels, const u32 
label[]);
-int nla_get_labels(const struct nlattr *nla, u32 max_labels, u32 *labels, u32 
label[]);
+int nla_put_labels(struct sk_buff *skb, int attrtype,  u8 labels,
+  const u32 label[]);
+int nla_get_labels(const struct nlattr *nla, u32 max_labels, u32 *labels,
+  u32 label[]);
+bool mpls_output_possible(const struct net_device *dev);
+unsigned int mpls_dev_mtu(const struct net_device *dev);
+bool mpls_pkt_too_big(const struct sk_buff *skb, unsigned int mtu);
 
 #endif /* MPLS_INTERNAL_H */
-- 
2.4.3

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net-next 21/22] openvswitch: Abstract vport name through ovs_vport_name()

2015-07-17 Thread Thomas Graf

This allows to get rid of the get_name() vport ops later on.

Signed-off-by: Thomas Graf tg...@suug.ch
---
 net/openvswitch/datapath.c   | 4 ++--
 net/openvswitch/vport-internal_dev.c | 1 -
 net/openvswitch/vport-netdev.c   | 6 --
 net/openvswitch/vport-netdev.h   | 1 -
 net/openvswitch/vport.c  | 4 ++--
 net/openvswitch/vport.h  | 5 +
 6 files changed, 9 insertions(+), 12 deletions(-)

diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
index 19df28e..ffe984f 100644
--- a/net/openvswitch/datapath.c
+++ b/net/openvswitch/datapath.c
@@ -176,7 +176,7 @@ static inline struct datapath *get_dp(struct net *net, int 
dp_ifindex)
 const char *ovs_dp_name(const struct datapath *dp)
 {
struct vport *vport = ovs_vport_ovsl_rcu(dp, OVSP_LOCAL);
-   return vport-ops-get_name(vport);
+   return ovs_vport_name(vport);
 }
 
 static int get_dpifindex(const struct datapath *dp)
@@ -1800,7 +1800,7 @@ static int ovs_vport_cmd_fill_info(struct vport *vport, 
struct sk_buff *skb,
if (nla_put_u32(skb, OVS_VPORT_ATTR_PORT_NO, vport-port_no) ||
nla_put_u32(skb, OVS_VPORT_ATTR_TYPE, vport-ops-type) ||
nla_put_string(skb, OVS_VPORT_ATTR_NAME,
-  vport-ops-get_name(vport)))
+  ovs_vport_name(vport)))
goto nla_put_failure;
 
ovs_vport_get_stats(vport, vport_stats);
diff --git a/net/openvswitch/vport-internal_dev.c 
b/net/openvswitch/vport-internal_dev.c
index a2c205d..c058bbf 100644
--- a/net/openvswitch/vport-internal_dev.c
+++ b/net/openvswitch/vport-internal_dev.c
@@ -242,7 +242,6 @@ static struct vport_ops ovs_internal_vport_ops = {
.type   = OVS_VPORT_TYPE_INTERNAL,
.create = internal_dev_create,
.destroy= internal_dev_destroy,
-   .get_name   = ovs_netdev_get_name,
.send   = internal_dev_recv,
 };
 
diff --git a/net/openvswitch/vport-netdev.c b/net/openvswitch/vport-netdev.c
index 1c96966..e682bdc 100644
--- a/net/openvswitch/vport-netdev.c
+++ b/net/openvswitch/vport-netdev.c
@@ -171,11 +171,6 @@ static void netdev_destroy(struct vport *vport)
call_rcu(vport-rcu, free_port_rcu);
 }
 
-const char *ovs_netdev_get_name(const struct vport *vport)
-{
-   return vport-dev-name;
-}
-
 static unsigned int packet_length(const struct sk_buff *skb)
 {
unsigned int length = skb-len - ETH_HLEN;
@@ -223,7 +218,6 @@ static struct vport_ops ovs_netdev_vport_ops = {
.type   = OVS_VPORT_TYPE_NETDEV,
.create = netdev_create,
.destroy= netdev_destroy,
-   .get_name   = ovs_netdev_get_name,
.send   = netdev_send,
 };
 
diff --git a/net/openvswitch/vport-netdev.h b/net/openvswitch/vport-netdev.h
index 1c52aed..684fb88 100644
--- a/net/openvswitch/vport-netdev.h
+++ b/net/openvswitch/vport-netdev.h
@@ -26,7 +26,6 @@
 
 struct vport *ovs_netdev_get_vport(struct net_device *dev);
 
-const char *ovs_netdev_get_name(const struct vport *);
 void ovs_netdev_detach_dev(struct vport *);
 
 int __init ovs_netdev_init(void);
diff --git a/net/openvswitch/vport.c b/net/openvswitch/vport.c
index af23ba0..d14f594 100644
--- a/net/openvswitch/vport.c
+++ b/net/openvswitch/vport.c
@@ -113,7 +113,7 @@ struct vport *ovs_vport_locate(const struct net *net, const 
char *name)
struct vport *vport;
 
hlist_for_each_entry_rcu(vport, bucket, hash_node)
-   if (!strcmp(name, vport-ops-get_name(vport)) 
+   if (!strcmp(name, ovs_vport_name(vport)) 
net_eq(ovs_dp_get_net(vport-dp), net))
return vport;
 
@@ -226,7 +226,7 @@ struct vport *ovs_vport_add(const struct vport_parms *parms)
}
 
bucket = hash_bucket(ovs_dp_get_net(vport-dp),
-vport-ops-get_name(vport));
+ovs_vport_name(vport));
hlist_add_head_rcu(vport-hash_node, bucket);
return vport;
}
diff --git a/net/openvswitch/vport.h b/net/openvswitch/vport.h
index e05ec68..1a689c2 100644
--- a/net/openvswitch/vport.h
+++ b/net/openvswitch/vport.h
@@ -237,6 +237,11 @@ static inline void ovs_skb_postpush_rcsum(struct sk_buff 
*skb,
skb-csum = csum_add(skb-csum, csum_partial(start, len, 0));
 }
 
+static inline const char *ovs_vport_name(struct vport *vport)
+{
+   return vport-dev ? vport-dev-name : vport-ops-get_name(vport);
+}
+
 int ovs_vport_ops_register(struct vport_ops *ops);
 void ovs_vport_ops_unregister(struct vport_ops *ops);
 
-- 
2.4.3

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net-next 17/22] fib: Add fib rule match on tunnel id

2015-07-17 Thread Thomas Graf

This add the ability to select a routing table based on the tunnel
id which allows to maintain separate routing tables for each virtual
tunnel network.

ip rule add from all tunnel-id 100 lookup 100
ip rule add from all tunnel-id 200 lookup 200

A new static key controls the collection of metadata at tunnel level
upon demand.

Signed-off-by: Thomas Graf tg...@suug.ch
---
 drivers/net/vxlan.c|  3 ++-
 include/net/fib_rules.h|  1 +
 include/net/ip_tunnels.h   | 11 +++
 include/uapi/linux/fib_rules.h |  2 +-
 net/core/fib_rules.c   | 24 ++--
 net/ipv4/ip_tunnel_core.c  | 16 
 6 files changed, 53 insertions(+), 4 deletions(-)

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index a350afb..23378db 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -143,7 +143,8 @@ static struct workqueue_struct *vxlan_wq;
 
 static inline bool vxlan_collect_metadata(struct vxlan_sock *vs)
 {
-   return vs-flags  VXLAN_F_COLLECT_METADATA;
+   return vs-flags  VXLAN_F_COLLECT_METADATA ||
+  ip_tunnel_collect_metadata();
 }
 
 #if IS_ENABLED(CONFIG_IPV6)
diff --git a/include/net/fib_rules.h b/include/net/fib_rules.h
index 903a55e..4e8f804 100644
--- a/include/net/fib_rules.h
+++ b/include/net/fib_rules.h
@@ -19,6 +19,7 @@ struct fib_rule {
u8  action;
/* 3 bytes hole, try to use */
u32 target;
+   __be64  tun_id;
struct fib_rule __rcu   *ctarget;
struct net  *fr_net;
 
diff --git a/include/net/ip_tunnels.h b/include/net/ip_tunnels.h
index 0b7e18c..0a5a776 100644
--- a/include/net/ip_tunnels.h
+++ b/include/net/ip_tunnels.h
@@ -303,6 +303,17 @@ static inline struct ip_tunnel_info *lwt_tun_info(struct 
lwtunnel_state *lwtstat
return (struct ip_tunnel_info *)lwtstate-data;
 }
 
+extern struct static_key ip_tunnel_metadata_cnt;
+
+/* Returns  0 if metadata should be collected */
+static inline int ip_tunnel_collect_metadata(void)
+{
+   return static_key_false(ip_tunnel_metadata_cnt);
+}
+
+void ip_tunnel_need_metadata(void);
+void ip_tunnel_unneed_metadata(void);
+
 #endif /* CONFIG_INET */
 
 #endif /* __NET_IP_TUNNELS_H */
diff --git a/include/uapi/linux/fib_rules.h b/include/uapi/linux/fib_rules.h
index 2b82d7e..96161b8 100644
--- a/include/uapi/linux/fib_rules.h
+++ b/include/uapi/linux/fib_rules.h
@@ -43,7 +43,7 @@ enum {
FRA_UNUSED5,
FRA_FWMARK, /* mark */
FRA_FLOW,   /* flow/class id */
-   FRA_UNUSED6,
+   FRA_TUN_ID,
FRA_SUPPRESS_IFGROUP,
FRA_SUPPRESS_PREFIXLEN,
FRA_TABLE,  /* Extended table id */
diff --git a/net/core/fib_rules.c b/net/core/fib_rules.c
index 9a12668..ae8306e 100644
--- a/net/core/fib_rules.c
+++ b/net/core/fib_rules.c
@@ -16,6 +16,7 @@
 #include net/net_namespace.h
 #include net/sock.h
 #include net/fib_rules.h
+#include net/ip_tunnels.h
 
 int fib_default_rule_add(struct fib_rules_ops *ops,
 u32 pref, u32 table, u32 flags)
@@ -186,6 +187,9 @@ static int fib_rule_match(struct fib_rule *rule, struct 
fib_rules_ops *ops,
if ((rule-mark ^ fl-flowi_mark)  rule-mark_mask)
goto out;
 
+   if (rule-tun_id  (rule-tun_id != fl-flowi_tun_key.tun_id))
+   goto out;
+
ret = ops-match(rule, fl, flags);
 out:
return (rule-flags  FIB_RULE_INVERT) ? !ret : ret;
@@ -330,6 +334,9 @@ static int fib_nl_newrule(struct sk_buff *skb, struct 
nlmsghdr* nlh)
if (tb[FRA_FWMASK])
rule-mark_mask = nla_get_u32(tb[FRA_FWMASK]);
 
+   if (tb[FRA_TUN_ID])
+   rule-tun_id = nla_get_be64(tb[FRA_TUN_ID]);
+
rule-action = frh-action;
rule-flags = frh-flags;
rule-table = frh_get_table(frh, tb);
@@ -407,6 +414,9 @@ static int fib_nl_newrule(struct sk_buff *skb, struct 
nlmsghdr* nlh)
if (unresolved)
ops-unresolved_rules++;
 
+   if (rule-tun_id)
+   ip_tunnel_need_metadata();
+
notify_rule_change(RTM_NEWRULE, rule, ops, nlh, NETLINK_CB(skb).portid);
flush_route_cache(ops);
rules_ops_put(ops);
@@ -473,6 +483,10 @@ static int fib_nl_delrule(struct sk_buff *skb, struct 
nlmsghdr* nlh)
(rule-mark_mask != nla_get_u32(tb[FRA_FWMASK])))
continue;
 
+   if (tb[FRA_TUN_ID] 
+   (rule-tun_id != nla_get_be64(tb[FRA_TUN_ID])))
+   continue;
+
if (!ops-compare(rule, frh, tb))
continue;
 
@@ -487,6 +501,9 @@ static int fib_nl_delrule(struct sk_buff *skb, struct 
nlmsghdr* nlh)
goto errout;
}
 
+   if (rule-tun_id)
+   ip_tunnel_unneed_metadata();
+
list_del_rcu(rule-list);
 
if (rule-action == FR_ACT_GOTO) {

[PATCH net-next 13/22] arp: Inherit metadata dst when creating ARP requests

2015-07-17 Thread Thomas Graf

If output device wants to see the dst, inherit the dst of the
original skb and pass it on to generate the ARP request.

Signed-off-by: Thomas Graf tg...@suug.ch
---
 net/ipv4/arp.c | 65 +-
 1 file changed, 37 insertions(+), 28 deletions(-)

diff --git a/net/ipv4/arp.c b/net/ipv4/arp.c
index 933a928..1d59e50 100644
--- a/net/ipv4/arp.c
+++ b/net/ipv4/arp.c
@@ -291,6 +291,40 @@ static void arp_error_report(struct neighbour *neigh, 
struct sk_buff *skb)
kfree_skb(skb);
 }
 
+/* Create and send an arp packet. */
+static void arp_send_dst(int type, int ptype, __be32 dest_ip,
+struct net_device *dev, __be32 src_ip,
+const unsigned char *dest_hw,
+const unsigned char *src_hw,
+const unsigned char *target_hw, struct sk_buff *oskb)
+{
+   struct sk_buff *skb;
+
+   /* arp on this interface. */
+   if (dev-flags  IFF_NOARP)
+   return;
+
+   skb = arp_create(type, ptype, dest_ip, dev, src_ip,
+dest_hw, src_hw, target_hw);
+   if (!skb)
+   return;
+
+   if (oskb)
+   skb_dst_copy(skb, oskb);
+
+   arp_xmit(skb);
+}
+
+void arp_send(int type, int ptype, __be32 dest_ip,
+ struct net_device *dev, __be32 src_ip,
+ const unsigned char *dest_hw, const unsigned char *src_hw,
+ const unsigned char *target_hw)
+{
+   arp_send_dst(type, ptype, dest_ip, dev, src_ip, dest_hw, src_hw,
+target_hw, NULL);
+}
+EXPORT_SYMBOL(arp_send);
+
 static void arp_solicit(struct neighbour *neigh, struct sk_buff *skb)
 {
__be32 saddr = 0;
@@ -346,8 +380,9 @@ static void arp_solicit(struct neighbour *neigh, struct 
sk_buff *skb)
}
}
 
-   arp_send(ARPOP_REQUEST, ETH_P_ARP, target, dev, saddr,
-dst_hw, dev-dev_addr, NULL);
+   arp_send_dst(ARPOP_REQUEST, ETH_P_ARP, target, dev, saddr,
+dst_hw, dev-dev_addr, NULL,
+dev-priv_flags  IFF_XMIT_DST_RELEASE ? NULL : skb);
 }
 
 static int arp_ignore(struct in_device *in_dev, __be32 sip, __be32 tip)
@@ -597,32 +632,6 @@ void arp_xmit(struct sk_buff *skb)
 EXPORT_SYMBOL(arp_xmit);
 
 /*
- * Create and send an arp packet.
- */
-void arp_send(int type, int ptype, __be32 dest_ip,
- struct net_device *dev, __be32 src_ip,
- const unsigned char *dest_hw, const unsigned char *src_hw,
- const unsigned char *target_hw)
-{
-   struct sk_buff *skb;
-
-   /*
-*  No arp on this interface.
-*/
-
-   if (dev-flagsIFF_NOARP)
-   return;
-
-   skb = arp_create(type, ptype, dest_ip, dev, src_ip,
-dest_hw, src_hw, target_hw);
-   if (!skb)
-   return;
-
-   arp_xmit(skb);
-}
-EXPORT_SYMBOL(arp_send);
-
-/*
  * Process an arp request.
  */
 
-- 
2.4.3

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net-next 16/22] route: Per route IP tunnel metadata via lightweight tunnel

2015-07-17 Thread Thomas Graf

This introduces a new IP tunnel lightweight tunnel type which allows
to specify IP tunnel instructions per route. Only IPv4 is supported
at this point.

Signed-off-by: Thomas Graf tg...@suug.ch
---
 drivers/net/vxlan.c|  10 +++-
 include/net/dst_metadata.h |  12 -
 include/net/ip_tunnels.h   |   7 ++-
 include/uapi/linux/lwtunnel.h  |   1 +
 include/uapi/linux/rtnetlink.h |  15 ++
 net/ipv4/ip_tunnel_core.c  | 114 +
 net/ipv4/route.c   |   2 +-
 net/openvswitch/vport.h|   1 +
 8 files changed, 157 insertions(+), 5 deletions(-)

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index 994d89c..a350afb 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -1935,7 +1935,7 @@ static void vxlan_encap_bypass(struct sk_buff *skb, 
struct vxlan_dev *src_vxlan,
 static void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev,
   struct vxlan_rdst *rdst, bool did_rsc)
 {
-   struct ip_tunnel_info *info = skb_tunnel_info(skb);
+   struct ip_tunnel_info *info;
struct vxlan_dev *vxlan = netdev_priv(dev);
struct sock *sk = vxlan-vn_sock-sock-sk;
struct rtable *rt = NULL;
@@ -1952,6 +1952,9 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct 
net_device *dev,
int err;
u32 flags = vxlan-flags;
 
+   /* FIXME: Support IPv6 */
+   info = skb_tunnel_info(skb, AF_INET);
+
if (rdst) {
dst_port = rdst-remote_port ? rdst-remote_port : 
vxlan-dst_port;
vni = rdst-remote_vni;
@@ -2141,12 +2144,15 @@ tx_free:
 static netdev_tx_t vxlan_xmit(struct sk_buff *skb, struct net_device *dev)
 {
struct vxlan_dev *vxlan = netdev_priv(dev);
-   const struct ip_tunnel_info *info = skb_tunnel_info(skb);
+   const struct ip_tunnel_info *info;
struct ethhdr *eth;
bool did_rsc = false;
struct vxlan_rdst *rdst, *fdst = NULL;
struct vxlan_fdb *f;
 
+   /* FIXME: Support IPv6 */
+   info = skb_tunnel_info(skb, AF_INET);
+
skb_reset_mac_header(skb);
eth = eth_hdr(skb);
 
diff --git a/include/net/dst_metadata.h b/include/net/dst_metadata.h
index e843937..7b03068 100644
--- a/include/net/dst_metadata.h
+++ b/include/net/dst_metadata.h
@@ -23,13 +23,23 @@ static inline struct metadata_dst *skb_metadata_dst(struct 
sk_buff *skb)
return NULL;
 }
 
-static inline struct ip_tunnel_info *skb_tunnel_info(struct sk_buff *skb)
+static inline struct ip_tunnel_info *skb_tunnel_info(struct sk_buff *skb,
+int family)
 {
struct metadata_dst *md_dst = skb_metadata_dst(skb);
+   struct rtable *rt;
 
if (md_dst)
return md_dst-u.tun_info;
 
+   switch (family) {
+   case AF_INET:
+   rt = (struct rtable *)skb_dst(skb);
+   if (rt  rt-rt_lwtstate)
+   return lwt_tun_info(rt-rt_lwtstate);
+   break;
+   }
+
return NULL;
 }
 
diff --git a/include/net/ip_tunnels.h b/include/net/ip_tunnels.h
index d11530f..0b7e18c 100644
--- a/include/net/ip_tunnels.h
+++ b/include/net/ip_tunnels.h
@@ -9,9 +9,9 @@
 #include net/dsfield.h
 #include net/gro_cells.h
 #include net/inet_ecn.h
-#include net/ip.h
 #include net/netns/generic.h
 #include net/rtnetlink.h
+#include net/lwtunnel.h
 
 #if IS_ENABLED(CONFIG_IPV6)
 #include net/ipv6.h
@@ -298,6 +298,11 @@ static inline void *ip_tunnel_info_opts(struct 
ip_tunnel_info *info, size_t n)
return info + 1;
 }
 
+static inline struct ip_tunnel_info *lwt_tun_info(struct lwtunnel_state 
*lwtstate)
+{
+   return (struct ip_tunnel_info *)lwtstate-data;
+}
+
 #endif /* CONFIG_INET */
 
 #endif /* __NET_IP_TUNNELS_H */
diff --git a/include/uapi/linux/lwtunnel.h b/include/uapi/linux/lwtunnel.h
index aa611d9..31377bb 100644
--- a/include/uapi/linux/lwtunnel.h
+++ b/include/uapi/linux/lwtunnel.h
@@ -6,6 +6,7 @@
 enum lwtunnel_encap_types {
LWTUNNEL_ENCAP_NONE,
LWTUNNEL_ENCAP_MPLS,
+   LWTUNNEL_ENCAP_IP,
__LWTUNNEL_ENCAP_MAX,
 };
 
diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h
index 0d3d3cc..47d24cb 100644
--- a/include/uapi/linux/rtnetlink.h
+++ b/include/uapi/linux/rtnetlink.h
@@ -286,6 +286,21 @@ enum rt_class_t {
 
 /* Routing message attributes */
 
+enum ip_tunnel_t {
+   IP_TUN_UNSPEC,
+   IP_TUN_ID,
+   IP_TUN_DST,
+   IP_TUN_SRC,
+   IP_TUN_TTL,
+   IP_TUN_TOS,
+   IP_TUN_SPORT,
+   IP_TUN_DPORT,
+   IP_TUN_FLAGS,
+   __IP_TUN_MAX,
+};
+
+#define IP_TUN_MAX (__IP_TUN_MAX - 1)
+
 enum rtattr_type_t {
RTA_UNSPEC,
RTA_DST,
diff --git a/net/ipv4/ip_tunnel_core.c b/net/ipv4/ip_tunnel_core.c
index 6a51a71..025b76e 100644
--- a/net/ipv4/ip_tunnel_core.c
+++ b/net/ipv4/ip_tunnel_core.c
@@ -190,3 +190,117 @@ struct rtnl_link_stats64

Re: [PATCH net-next 1/1] hv_netvsc: Wait for sub-channels to be processed during probe

2015-07-17 Thread Vitaly Kuznetsov

K. Y. Srinivasan k...@microsoft.com writes:

 The current code returns from probe without waiting for the proper handling
 of subchannels that may be requested. If the netvsc driver were to be rapidly
 loaded/unloaded, we can  trigger a panic as the unload will be tearing
 down state that may not have been fully setup yet. We fix this issue by making
 sure that we return from the probe call only after ensuring that the
 sub-channel offers in flight are properly handled.

 Signed-off-by: K. Y. Srinivasan k...@microsoft.com
 Reviewed-and-tested-by: Haiyang Zhang haiya...@microsoft.com
 ---
  drivers/net/hyperv/hyperv_net.h   |2 ++
  drivers/net/hyperv/rndis_filter.c |   25 +
  2 files changed, 27 insertions(+), 0 deletions(-)

 diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
 index 26cd14c..925b75d 100644
 --- a/drivers/net/hyperv/hyperv_net.h
 +++ b/drivers/net/hyperv/hyperv_net.h
 @@ -671,6 +671,8 @@ struct netvsc_device {
   u32 send_table[VRSS_SEND_TAB_SIZE];
   u32 max_chn;
   u32 num_chn;
 + spinlock_t sc_lock; /* Protects num_sc_offered variable */
 + u32 num_sc_offered;
   atomic_t queue_sends[NR_CPUS];

   /* Holds rndis device info */
 diff --git a/drivers/net/hyperv/rndis_filter.c 
 b/drivers/net/hyperv/rndis_filter.c
 index 2e40417..2e09f3f 100644
 --- a/drivers/net/hyperv/rndis_filter.c
 +++ b/drivers/net/hyperv/rndis_filter.c
 @@ -984,9 +984,16 @@ static void netvsc_sc_open(struct vmbus_channel *new_sc)
   struct netvsc_device *nvscdev;
   u16 chn_index = new_sc-offermsg.offer.sub_channel_index;
   int ret;
 + unsigned long flags;

   nvscdev = hv_get_drvdata(new_sc-primary_channel-device_obj);

 + spin_lock_irqsave(nvscdev-sc_lock, flags);
 + nvscdev-num_sc_offered--;
 + spin_unlock_irqrestore(nvscdev-sc_lock, flags);
 + if (nvscdev-num_sc_offered == 0)
 + complete(nvscdev-channel_init_wait);
 +
   if (chn_index = nvscdev-num_chn)
   return;

 @@ -1015,8 +1022,10 @@ int rndis_filter_device_add(struct hv_device *dev,
   u32 rsscap_size = sizeof(struct ndis_recv_scale_cap);
   u32 mtu, size;
   u32 num_rss_qs;
 + u32 sc_delta;
   const struct cpumask *node_cpu_mask;
   u32 num_possible_rss_qs;
 + unsigned long flags;

   rndis_device = get_rndis_device();
   if (!rndis_device)
 @@ -1039,6 +1048,8 @@ int rndis_filter_device_add(struct hv_device *dev,
   net_device-max_chn = 1;
   net_device-num_chn = 1;

 + spin_lock_init(net_device-sc_lock);
 +
   net_device-extension = rndis_device;
   rndis_device-net_dev = net_device;

 @@ -1116,6 +1127,9 @@ int rndis_filter_device_add(struct hv_device *dev,
   num_possible_rss_qs = cpumask_weight(node_cpu_mask);
   net_device-num_chn = min(num_possible_rss_qs, num_rss_qs);

 + num_rss_qs = net_device-num_chn - 1;
 + net_device-num_sc_offered = num_rss_qs;
 +
   if (net_device-num_chn == 1)
   goto out;

 @@ -1157,11 +1171,22 @@ int rndis_filter_device_add(struct hv_device *dev,

   ret = rndis_filter_set_rss_param(rndis_device, net_device-num_chn);

 + /*
 +  * Wait for the host to send us the sub-channel offers.
 +  */
 + spin_lock_irqsave(net_device-sc_lock, flags);
 + sc_delta = net_device-num_chn - 1 - num_rss_qs;
 + net_device-num_sc_offered -= sc_delta;
 + spin_unlock_irqrestore(net_device-sc_lock, flags);
 +
 + if (net_device-num_sc_offered != 0)
 + wait_for_completion(net_device-channel_init_wait);

I'd suggest we add an essentian timeout (big, let's say 30 sec.)
here. In case something goes wrong we don't really want to hang the
whole kernel for forever. Such bugs are hard to debug as if a 'kernel
hangs' is reported we can't be sure which wait caused it. We can even
have something like:

 t = wait_for_completion_timeout(net_device-channel_init_wait, 30*HZ);
 BUG_ON(t == 0);

This is much better as we'll be sure what went wrong. (I know other
pieces of hyper-v code use wait_for_completion() without a timeout, this
is rather a general suggestion for all of them).

  out:
   if (ret) {
   net_device-max_chn = 1;
   net_device-num_chn = 1;
   }
 +
   return 0; /* return 0 because primary channel can be used alone */

  err_dev_remv:

-- 
  Vitaly
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH nf-next v2] netfilter: nf_ct_sctp: minimal multihoming support

2015-07-17 Thread Michal Kubecek

Currently nf_conntrack_proto_sctp module handles only packets between
primary addresses used to establish the connection. Any packets between
secondary addresses are classified as invalid so that usual firewall
configurations drop them. Allowing HEARTBEAT and HEARTBEAT-ACK chunks to
establish a new conntrack would allow traffic between secondary
addresses to pass through. A more sophisticated solution based on the
addresses advertised in the initial handshake (and possibly also later
dynamic address addition and removal) would be much harder to implement.
Moreover, in general we cannot assume to always see the initial
handshake as it can be routed through a different path.

The patch adds two new conntrack states:

  SCTP_CONNTRACK_HEARTBEAT_SENT  - a HEARTBEAT chunk seen but not acked
  SCTP_CONNTRACK_HEARTBEAT_ACKED - a HEARTBEAT acked by HEARTBEAT-ACK

State transition rules:

- HEARTBEAT_SENT responds to usual chunks the same way as NONE (so that
  the behaviour changes as little as possible)
- HEARTBEAT_ACKED responds to usual chunks the same way as ESTABLISHED
  does, except the resulting state is HEARTBEAT_ACKED rather than
  ESTABLISHED
- previously existing states except NONE are preserved when HEARTBEAT or
  HEARTBEAT-ACK is seen
- NONE (in the initial direction) changes to HEARTBEAT_SENT on HEARTBEAT
  and to CLOSED on HEARTBEAT-ACK
- HEARTBEAT_SENT changes to HEARTBEAT_ACKED on HEARTBEAT-ACK in the
  reply direction
- HEARTBEAT_SENT and HEARTBEAT_ACKED are preserved on HEARTBEAT and
  HEARTBEAT-ACK otherwise

Normally, vtag is set from the INIT chunk for the reply direction and
from the INIT-ACK chunk for the originating direction (i.e. each of
these defines vtag value for the opposite direction). For secondary
conntracks, we can't rely on seeing INIT/INIT-ACK and even if we have
seen them, we would need to connect two different conntracks. Therefore
simplified logic is applied: vtag of first packet in each direction
(HEARTBEAT in the originating and HEARTBEAT-ACK in reply direction) is
saved and all following packets in that direction are compared with this
saved value. While INIT and INIT-ACK define vtag for the opposite
direction, vtags extracted from HEARTBEAT and HEARTBEAT-ACK are always
for their direction.

Default timeout values for new states are

  HEARTBEAT_SENT: 30 seconds (default hb_interval)
  HEARTBEAT_ACKED: 210 seconds (hb_interval * path_max_retry + max_rto)

(We cannot expect to see the shutdown sequence so that, unlike
ESTABLISHED, the HEARTBEAT_ACKED timeout shouldn't be too long.)

Signed-off-by: Michal Kubecek mkube...@suse.cz
---

v2:
- add new timeouts to nla policy interface
- explain vtag handling in the commit message
- for consistency, rename *_HB_* constants to *_HEARTBEAT_*

 include/uapi/linux/netfilter/nf_conntrack_sctp.h   |   2 +
 include/uapi/linux/netfilter/nfnetlink_cttimeout.h |   2 +
 net/netfilter/nf_conntrack_proto_sctp.c| 115 -
 3 files changed, 95 insertions(+), 24 deletions(-)

diff --git a/include/uapi/linux/netfilter/nf_conntrack_sctp.h 
b/include/uapi/linux/netfilter/nf_conntrack_sctp.h
index ceeefe6681b5..ed4e776e1242 100644
--- a/include/uapi/linux/netfilter/nf_conntrack_sctp.h
+++ b/include/uapi/linux/netfilter/nf_conntrack_sctp.h
@@ -13,6 +13,8 @@ enum sctp_conntrack {
SCTP_CONNTRACK_SHUTDOWN_SENT,
SCTP_CONNTRACK_SHUTDOWN_RECD,
SCTP_CONNTRACK_SHUTDOWN_ACK_SENT,
+   SCTP_CONNTRACK_HEARTBEAT_SENT,
+   SCTP_CONNTRACK_HEARTBEAT_ACKED,
SCTP_CONNTRACK_MAX
 };
 
diff --git a/include/uapi/linux/netfilter/nfnetlink_cttimeout.h 
b/include/uapi/linux/netfilter/nfnetlink_cttimeout.h
index 1ab0b97b3a1e..f2c10dc140d6 100644
--- a/include/uapi/linux/netfilter/nfnetlink_cttimeout.h
+++ b/include/uapi/linux/netfilter/nfnetlink_cttimeout.h
@@ -92,6 +92,8 @@ enum ctattr_timeout_sctp {
CTA_TIMEOUT_SCTP_SHUTDOWN_SENT,
CTA_TIMEOUT_SCTP_SHUTDOWN_RECD,
CTA_TIMEOUT_SCTP_SHUTDOWN_ACK_SENT,
+   CTA_TIMEOUT_SCTP_HEARTBEAT_SENT,
+   CTA_TIMEOUT_SCTP_HEARTBEAT_ACKED,
__CTA_TIMEOUT_SCTP_MAX
 };
 #define CTA_TIMEOUT_SCTP_MAX (__CTA_TIMEOUT_SCTP_MAX - 1)
diff --git a/net/netfilter/nf_conntrack_proto_sctp.c 
b/net/netfilter/nf_conntrack_proto_sctp.c
index b45da90fad32..1aac57e45319 100644
--- a/net/netfilter/nf_conntrack_proto_sctp.c
+++ b/net/netfilter/nf_conntrack_proto_sctp.c
@@ -42,6 +42,8 @@ static const char *const sctp_conntrack_names[] = {
SHUTDOWN_SENT,
SHUTDOWN_RECD,
SHUTDOWN_ACK_SENT,
+   HEARTBEAT_SENT,
+   HEARTBEAT_ACKED,
 };
 
 #define SECS  * HZ
@@ -57,6 +59,8 @@ static unsigned int sctp_timeouts[SCTP_CONNTRACK_MAX] 
__read_mostly = {
[SCTP_CONNTRACK_SHUTDOWN_SENT]  = 300 SECS / 1000,
[SCTP_CONNTRACK_SHUTDOWN_RECD]  = 300 SECS / 1000,
[SCTP_CONNTRACK_SHUTDOWN_ACK_SENT]  = 3 SECS,
+   [SCTP_CONNTRACK_HEARTBEAT_SENT] = 30 SECS,
+

Re: [PATCH v2] jhash: Deinline jhash, jhash2 and __jhash_nwords

2015-07-17 Thread Hagen Paul Pfeifer

 On July 16, 2015 at 9:23 PM Joe Perches j...@perches.com wrote:
 
 It might be useful to have these performance impacting
 changes guarded by something like CONFIG_CC_OPTIMIZE_FOR_SIZE
 with another static __always_inline __func and a function 
 EXPORT_SYMBOL or just a static inline so that where code size
 is critical it's uninlined.

But keep in mind that jhash, jhash2 and __jhash_nwords are *not*
one-instruction long functions. We duplicate code over and over resulting
probably in more cache misses. __always_inline__ is probably too strict
and a vanilla inline is already for 99% of all distribution builds a
 __always_inline__, see ARCH_SUPPORTS_OPTIMIZED_INLINING and
CONFIG_CC_OPTIMIZE_FOR_SIZE.

The answer depends on the specific workload. Sometimes an enforced inline
perform better and sometimes a call is the better solution (read: less
cache misses). General purpose vendors with a larger working set size
should reduce cache misses by deinline many functions. For
high-performance special fast-path operations a strong inlined kernel
build is probably faster. __always_inline__ makes it impossible for the
user to deinline functions or not.

Hagen
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] mac80211: Deinline rate_control_rate_init, rate_control_rate_update

2015-07-17 Thread Johannes Berg

On Wed, 2015-07-15 at 14:56 +0200, Denys Vlasenko wrote:
 With this .config: http://busybox.net/~vda/kernel_config,
 after deinlining these functions have sizes and callsite counts
 as follows:
 
 rate_control_rate_init: 554 bytes, 8 calls
 rate_control_rate_update: 1596 bytes, 5 calls
 
 Total size reduction: about 11 kbytes.
 
Both applied.

johannes
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next 0/6] net: bcmgenet: PHY initialization rework

2015-07-17 Thread Jaedon Shin

 On Jul 17, 2015, at 7:51 AM, Florian Fainelli f.faine...@gmail.com wrote:
 
 Hi David, Petri, Jaedon,
 
 This patch series reworks how we perform PHY initialization and resets in the
 GENET driver. Although this contains mostly fixes, some of the changes are a
 bit too intrusive to be backported to 'net' at the moment.
 
 Some of the motivations behind these changes were to reduce the time spent in 
 how
 performing MDIO transactions, since it is better to perform then when we have
 interrupts enabled. This reduces the bring-up time of GENET from ~600 msecs 
 down
 to ~8 msecs, and about the same time for suspend/resume.
 
 Since I do not currently have a system which is not DT-aware, can you (Petri,
 Jaedon) give this a try and confirm things keep working as expected?
 
 Thanks!
 

I tested your patch series on Broadcom 40nm set-top box platform that used
internal phy. I did not have the exact measurements. but I expect it to improve
on the interface-up or link-up time. and I compared the changes roughly from
kernel print time. please see below.

- before patching
[1.865126] bcmgenet 1043.ethernet eth0: Link is Down
[3.941132] bcmgenet 1043.ethernet eth0: Link is Up - 100Mbps/Full - 
flow control rx/tx

- after patching
[3.145127] bcmgenet 1043.ethernet eth0: Link is Down
[4.189140] bcmgenet 1043.ethernet eth0: Link is Up - 100Mbps/Full - 
flow control rx/tx

 Florian Fainelli (6):
  net: bcmgenet: Remove excessive PHY reset
  net: bcmgenet: Use correct dev_id for free_irq
  net: bcmgenet: Power on integrated GPHY in bcmgenet_power_up()
  net: bcmgenet: Determine PHY type before scanning MDIO bus
  net: bcmgenet: Delay PHY initialization to bcmgenet_open()
  net: bcmgenet: Remove init parameter from bcmgenet_mii_config
 
 drivers/net/ethernet/broadcom/genet/bcmgenet.c | 33 +-
 drivers/net/ethernet/broadcom/genet/bcmgenet.h |  5 +-
 drivers/net/ethernet/broadcom/genet/bcmmii.c   | 84 --
 3 files changed, 59 insertions(+), 63 deletions(-)
 
 -- 
 2.1.0
 

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: pull-request: mac80211 2015-07-17

2015-07-17 Thread Johannes Berg

On Fri, 2015-07-17 at 15:31 +0200, Johannes Berg wrote:
 Hi Dave,
 
 We've accumulated some wireless fixes, please pull. Arik's fix is a 
 bit
 bigger than I might like, but it fixes a real locking issue and we
 didn't really see a good way to make a smaller version.
 
 Let me know if there's any problem.

Also, I'm going to be on vacation starting Tuesday, back on August 10.

I'm merging things to mac80211-next, but I'll hold the pull request
until after I return so I can deal with any possible issues in net-next
more quickly.

Kalle has graciously agreed to handle any urgent bugfixes to mac80211
while I'm out, and will probably send them to you as patches (if
necessary.)

johannes
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net-next 04/22] ipv6: support for fib route lwtunnel encap attributes

2015-07-17 Thread Thomas Graf

From: Roopa Prabhu ro...@cumulusnetworks.com

This patch adds support in ipv6 fib functions to parse Netlink
RTA encap attributes and attach encap state data to rt6_info.

Signed-off-by: Roopa Prabhu ro...@cumulusnetworks.com
---
 include/net/ip6_fib.h |  3 +++
 net/ipv6/ip6_fib.c|  2 ++
 net/ipv6/route.c  | 33 ++---
 3 files changed, 35 insertions(+), 3 deletions(-)

diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
index 3b76849..276328e 100644
--- a/include/net/ip6_fib.h
+++ b/include/net/ip6_fib.h
@@ -51,6 +51,8 @@ struct fib6_config {
struct nlattr   *fc_mp;
 
struct nl_info  fc_nlinfo;
+   struct nlattr   *fc_encap;
+   u16 fc_encap_type;
 };
 
 struct fib6_node {
@@ -131,6 +133,7 @@ struct rt6_info {
/* more non-fragment space at head required */
unsigned short  rt6i_nfheader_len;
u8  rt6i_protocol;
+   struct lwtunnel_state   *rt6i_lwtstate;
 };
 
 static inline struct inet6_dev *ip6_dst_idev(struct dst_entry *dst)
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 55d1986..d715f2e 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -32,6 +32,7 @@
 #include net/ipv6.h
 #include net/ndisc.h
 #include net/addrconf.h
+#include net/lwtunnel.h
 
 #include net/ip6_fib.h
 #include net/ip6_route.h
@@ -177,6 +178,7 @@ static void rt6_free_pcpu(struct rt6_info *non_pcpu_rt)
 static void rt6_release(struct rt6_info *rt)
 {
if (atomic_dec_and_test(rt-rt6i_ref)) {
+   lwtunnel_state_put(rt-rt6i_lwtstate);
rt6_free_pcpu(rt);
dst_free(rt-dst);
}
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 6090969..b3431b7 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -58,6 +58,7 @@
 #include net/netevent.h
 #include net/netlink.h
 #include net/nexthop.h
+#include net/lwtunnel.h
 
 #include asm/uaccess.h
 
@@ -1770,6 +1771,17 @@ int ip6_route_add(struct fib6_config *cfg)
 
rt-dst.output = ip6_output;
 
+   if (cfg-fc_encap) {
+   struct lwtunnel_state *lwtstate;
+
+   err = lwtunnel_build_state(dev, cfg-fc_encap_type,
+  cfg-fc_encap, lwtstate);
+   if (err)
+   goto out;
+   lwtunnel_state_get(lwtstate);
+   rt-rt6i_lwtstate = lwtstate;
+   }
+
ipv6_addr_prefix(rt-rt6i_dst.addr, cfg-fc_dst, cfg-fc_dst_len);
rt-rt6i_dst.plen = cfg-fc_dst_len;
if (rt-rt6i_dst.plen == 128)
@@ -2595,6 +2607,8 @@ static const struct nla_policy rtm_ipv6_policy[RTA_MAX+1] 
= {
[RTA_METRICS]   = { .type = NLA_NESTED },
[RTA_MULTIPATH] = { .len = sizeof(struct rtnexthop) },
[RTA_PREF]  = { .type = NLA_U8 },
+   [RTA_ENCAP_TYPE]= { .type = NLA_U16 },
+   [RTA_ENCAP] = { .type = NLA_NESTED },
 };
 
 static int rtm_to_fib6_config(struct sk_buff *skb, struct nlmsghdr *nlh,
@@ -2689,6 +2703,12 @@ static int rtm_to_fib6_config(struct sk_buff *skb, 
struct nlmsghdr *nlh,
cfg-fc_flags |= RTF_PREF(pref);
}
 
+   if (tb[RTA_ENCAP])
+   cfg-fc_encap = tb[RTA_ENCAP];
+
+   if (tb[RTA_ENCAP_TYPE])
+   cfg-fc_encap_type = nla_get_u16(tb[RTA_ENCAP_TYPE]);
+
err = 0;
 errout:
return err;
@@ -2721,6 +2741,10 @@ beginning:
r_cfg.fc_gateway = nla_get_in6_addr(nla);
r_cfg.fc_flags |= RTF_GATEWAY;
}
+   r_cfg.fc_encap = nla_find(attrs, attrlen, RTA_ENCAP);
+   nla = nla_find(attrs, attrlen, RTA_ENCAP_TYPE);
+   if (nla)
+   r_cfg.fc_encap_type = nla_get_u16(nla);
}
err = add ? ip6_route_add(r_cfg) : ip6_route_del(r_cfg);
if (err) {
@@ -2783,7 +2807,7 @@ static int inet6_rtm_newroute(struct sk_buff *skb, struct 
nlmsghdr *nlh)
return ip6_route_add(cfg);
 }
 
-static inline size_t rt6_nlmsg_size(void)
+static inline size_t rt6_nlmsg_size(struct rt6_info *rt)
 {
return NLMSG_ALIGN(sizeof(struct rtmsg))
   + nla_total_size(16) /* RTA_SRC */
@@ -2797,7 +2821,8 @@ static inline size_t rt6_nlmsg_size(void)
   + RTAX_MAX * nla_total_size(4) /* RTA_METRICS */
   + nla_total_size(sizeof(struct rta_cacheinfo))
   + nla_total_size(TCP_CA_NAME_MAX) /* RTAX_CC_ALGO */
-  + nla_total_size(1); /* RTA_PREF */
+  + nla_total_size(1) /* RTA_PREF */
+  + lwtunnel_get_encap_size(rt-rt6i_lwtstate);
 }
 
 static int rt6_fill_node(struct net *net,
@@ -2945,6 +2970,8 @@ static int rt6_fill_node(struct net *net,
if (nla_put_u8(skb, RTA_PREF, IPV6_EXTRACT_PREF(rt-rt6i_flags)))
goto

[PATCH net-next 03/22] ipv4: support for fib route lwtunnel encap attributes

2015-07-17 Thread Thomas Graf

From: Roopa Prabhu ro...@cumulusnetworks.com

This patch adds support in ipv4 fib functions to parse user
provided encap attributes and attach encap state data to fib_nh
and rtable.

Signed-off-by: Roopa Prabhu ro...@cumulusnetworks.com
---
 include/net/ip_fib.h |  5 ++-
 include/net/route.h  |  1 +
 net/ipv4/fib_frontend.c  |  8 
 net/ipv4/fib_semantics.c | 96 +++-
 net/ipv4/route.c | 16 +++-
 5 files changed, 122 insertions(+), 4 deletions(-)

diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index 49c142b..5e01960 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -44,7 +44,9 @@ struct fib_config {
u32 fc_flow;
u32 fc_nlflags;
struct nl_info  fc_nlinfo;
- };
+   struct nlattr   *fc_encap;
+   u16 fc_encap_type;
+};
 
 struct fib_info;
 struct rtable;
@@ -89,6 +91,7 @@ struct fib_nh {
struct rtable __rcu * __percpu *nh_pcpu_rth_output;
struct rtable __rcu *nh_rth_input;
struct fnhe_hash_bucket __rcu *nh_exceptions;
+   struct lwtunnel_state   *nh_lwtstate;
 };
 
 /*
diff --git a/include/net/route.h b/include/net/route.h
index fe22d03..2d45f41 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -66,6 +66,7 @@ struct rtable {
 
struct list_headrt_uncached;
struct uncached_list*rt_uncached_list;
+   struct lwtunnel_state   *rt_lwtstate;
 };
 
 static inline bool rt_is_input_route(const struct rtable *rt)
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 6bbc549..9b2019c 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -591,6 +591,8 @@ const struct nla_policy rtm_ipv4_policy[RTA_MAX + 1] = {
[RTA_METRICS]   = { .type = NLA_NESTED },
[RTA_MULTIPATH] = { .len = sizeof(struct rtnexthop) },
[RTA_FLOW]  = { .type = NLA_U32 },
+   [RTA_ENCAP_TYPE]= { .type = NLA_U16 },
+   [RTA_ENCAP] = { .type = NLA_NESTED },
 };
 
 static int rtm_to_fib_config(struct net *net, struct sk_buff *skb,
@@ -656,6 +658,12 @@ static int rtm_to_fib_config(struct net *net, struct 
sk_buff *skb,
case RTA_TABLE:
cfg-fc_table = nla_get_u32(attr);
break;
+   case RTA_ENCAP:
+   cfg-fc_encap = attr;
+   break;
+   case RTA_ENCAP_TYPE:
+   cfg-fc_encap_type = nla_get_u16(attr);
+   break;
}
}
 
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index c7358ea..6754c64 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -42,6 +42,7 @@
 #include net/ip_fib.h
 #include net/netlink.h
 #include net/nexthop.h
+#include net/lwtunnel.h
 
 #include fib_lookup.h
 
@@ -208,6 +209,7 @@ static void free_fib_info_rcu(struct rcu_head *head)
change_nexthops(fi) {
if (nexthop_nh-nh_dev)
dev_put(nexthop_nh-nh_dev);
+   lwtunnel_state_put(nexthop_nh-nh_lwtstate);
free_nh_exceptions(nexthop_nh);
rt_fibinfo_free_cpus(nexthop_nh-nh_pcpu_rth_output);
rt_fibinfo_free(nexthop_nh-nh_rth_input);
@@ -266,6 +268,7 @@ static inline int nh_comp(const struct fib_info *fi, const 
struct fib_info *ofi)
 #ifdef CONFIG_IP_ROUTE_CLASSID
nh-nh_tclassid != onh-nh_tclassid ||
 #endif
+   lwtunnel_cmp_encap(nh-nh_lwtstate, onh-nh_lwtstate) ||
((nh-nh_flags ^ onh-nh_flags)  ~RTNH_COMPARE_MASK))
return -1;
onh++;
@@ -366,6 +369,7 @@ static inline size_t fib_nlmsg_size(struct fib_info *fi)
payload += nla_total_size((RTAX_MAX * nla_total_size(4)));
 
if (fi-fib_nhs) {
+   size_t nh_encapsize = 0;
/* Also handles the special case fib_nhs == 1 */
 
/* each nexthop is packed in an attribute */
@@ -374,8 +378,21 @@ static inline size_t fib_nlmsg_size(struct fib_info *fi)
/* may contain flow and gateway attribute */
nhsize += 2 * nla_total_size(4);
 
+   /* grab encap info */
+   for_nexthops(fi) {
+   if (nh-nh_lwtstate) {
+   /* RTA_ENCAP_TYPE */
+   nh_encapsize += lwtunnel_get_encap_size(
+   nh-nh_lwtstate);
+   /* RTA_ENCAP */
+   nh_encapsize +=  nla_total_size(2);
+   }
+   } endfor_nexthops(fi);
+
/* all nexthops are packed in a nested attribute */
-   payload += nla_total_size(fi-fib_nhs * nhsize);
+   payload +=

[PATCH net-next 18/22] vxlan: Factor out device configuration

2015-07-17 Thread Thomas Graf

This factors out the device configuration out of the RTNL newlink
API which allows for in-kernel creation of VXLAN net_devices.

Signed-off-by: Thomas Graf tg...@suug.ch
---
 drivers/net/vxlan.c | 332 
 include/net/vxlan.h |  59 ++
 2 files changed, 236 insertions(+), 155 deletions(-)

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index 23378db..5ae6c0c 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -55,10 +55,6 @@
 
 #define PORT_HASH_BITS 8
 #define PORT_HASH_SIZE  (1PORT_HASH_BITS)
-#define VNI_HASH_BITS  10
-#define VNI_HASH_SIZE  (1VNI_HASH_BITS)
-#define FDB_HASH_BITS  8
-#define FDB_HASH_SIZE  (1FDB_HASH_BITS)
 #define FDB_AGE_DEFAULT 300 /* 5 min */
 #define FDB_AGE_INTERVAL (10 * HZ) /* rescan interval */
 
@@ -75,6 +71,7 @@ module_param(log_ecn_error, bool, 0644);
 MODULE_PARM_DESC(log_ecn_error, Log packets received with corrupted ECN);
 
 static int vxlan_net_id;
+static struct rtnl_link_ops vxlan_link_ops;
 
 static const u8 all_zeros_mac[ETH_ALEN];
 
@@ -85,21 +82,6 @@ struct vxlan_net {
spinlock_tsock_lock;
 };
 
-union vxlan_addr {
-   struct sockaddr_in sin;
-   struct sockaddr_in6 sin6;
-   struct sockaddr sa;
-};
-
-struct vxlan_rdst {
-   union vxlan_addr remote_ip;
-   __be16   remote_port;
-   u32  remote_vni;
-   u32  remote_ifindex;
-   struct list_head list;
-   struct rcu_head  rcu;
-};
-
 /* Forwarding table entry */
 struct vxlan_fdb {
struct hlist_node hlist;/* linked list of entries */
@@ -112,31 +94,6 @@ struct vxlan_fdb {
u8eth_addr[ETH_ALEN];
 };
 
-/* Pseudo network device */
-struct vxlan_dev {
-   struct hlist_node hlist;/* vni hash table */
-   struct list_head  next; /* vxlan's per namespace list */
-   struct vxlan_sock *vn_sock; /* listening socket */
-   struct net_device *dev;
-   struct net*net; /* netns for packet i/o */
-   struct vxlan_rdst default_dst;  /* default destination */
-   union vxlan_addr  saddr;/* source address */
-   __be16dst_port;
-   __u16 port_min; /* source port range */
-   __u16 port_max;
-   __u8  tos;  /* TOS override */
-   __u8  ttl;
-   u32   flags;/* VXLAN_F_* in vxlan.h */
-
-   unsigned long age_interval;
-   struct timer_list age_timer;
-   spinlock_thash_lock;
-   unsigned int  addrcnt;
-   unsigned int  addrmax;
-
-   struct hlist_head fdb_head[FDB_HASH_SIZE];
-};
-
 /* salt for hash table */
 static u32 vxlan_salt __read_mostly;
 static struct workqueue_struct *vxlan_wq;
@@ -352,7 +309,7 @@ static int vxlan_fdb_info(struct sk_buff *skb, struct 
vxlan_dev *vxlan,
if (send_ip  vxlan_nla_put_addr(skb, NDA_DST, rdst-remote_ip))
goto nla_put_failure;
 
-   if (rdst-remote_port  rdst-remote_port != vxlan-dst_port 
+   if (rdst-remote_port  rdst-remote_port != vxlan-cfg.dst_port 
nla_put_be16(skb, NDA_PORT, rdst-remote_port))
goto nla_put_failure;
if (rdst-remote_vni != vxlan-default_dst.remote_vni 
@@ -756,7 +713,8 @@ static int vxlan_fdb_create(struct vxlan_dev *vxlan,
if (!(flags  NLM_F_CREATE))
return -ENOENT;
 
-   if (vxlan-addrmax  vxlan-addrcnt = vxlan-addrmax)
+   if (vxlan-cfg.addrmax 
+   vxlan-addrcnt = vxlan-cfg.addrmax)
return -ENOSPC;
 
/* Disallow replace to add a multicast entry */
@@ -842,7 +800,7 @@ static int vxlan_fdb_parse(struct nlattr *tb[], struct 
vxlan_dev *vxlan,
return -EINVAL;
*port = nla_get_be16(tb[NDA_PORT]);
} else {
-   *port = vxlan-dst_port;
+   *port = vxlan-cfg.dst_port;
}
 
if (tb[NDA_VNI]) {
@@ -1028,7 +986,7 @@ static bool vxlan_snoop(struct net_device *dev,
vxlan_fdb_create(vxlan, src_mac, src_ip,
 NUD_REACHABLE,
 NLM_F_EXCL|NLM_F_CREATE,
-vxlan-dst_port,
+vxlan-cfg.dst_port,
 vxlan-default_dst.remote_vni,
 0, NTF_SELF);
spin_unlock(vxlan-hash_lock);
@@ -1957,7 +1915,7 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct 
net_device *dev,
info = skb_tunnel_info(skb, AF_INET);
 
if (rdst) {
-   dst_port = rdst-remote_port ? rdst-remote_port : 
vxlan-dst_port;
+   dst_port = rdst-remote_port ? rdst-remote_port :

[PATCH net-next 14/22] vxlan: Flow based tunneling

2015-07-17 Thread Thomas Graf

Allows putting a VXLAN device into a new flow-based mode in which
skbs with a ip_tunnel_info dst metadata attached will be encapsulated
according to the instructions stored in there with the VXLAN device
defaults taken into consideration.

Similar on the receive side, if the VXLAN_F_COLLECT_METADATA flag is
set, the packet processing will populate a ip_tunnel_info struct for
each packet received and attach it to the skb using the new metadata
dst.  The metadata structure will contain the outer header and tunnel
header fields which have been stripped off. Layers further up in the
stack such as routing, tc or netfitler can later match on these fields
and perform forwarding. It is the responsibility of upper layers to
ensure that the flag is set if the metadata is needed. The flag limits
the additional cost of metadata collecting based on demand.

This prepares the VXLAN device to be steered by the routing and other
subsystems which allows to support encapsulation for a large number
of tunnel endpoints and tunnel ids through a single net_device which
improves the scalability.

It also allows for OVS to leverage this mode which in turn allows for
the removal of the OVS specific VXLAN code.

Because the skb is currently scrubed in vxlan_rcv(), the attachment of
the new dst metadata is postponed until after scrubing which requires
the temporary addition of a new member to vxlan_metadata. This member
is removed again in a later commit after the indirect VXLAN receive API
has been removed.

Signed-off-by: Thomas Graf tg...@suug.ch
Signed-off-by: Pravin B Shelar pshe...@nicira.com
---
 drivers/net/vxlan.c  | 155 +--
 include/linux/skbuff.h   |   1 +
 include/net/dst_metadata.h   |  13 
 include/net/ip_tunnels.h |  14 
 include/net/vxlan.h  |  10 ++-
 include/uapi/linux/if_link.h |   1 +
 6 files changed, 171 insertions(+), 23 deletions(-)

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index 34c519e..994d89c 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -49,6 +49,7 @@
 #include net/ip6_tunnel.h
 #include net/ip6_checksum.h
 #endif
+#include net/dst_metadata.h
 
 #define VXLAN_VERSION  0.1
 
@@ -140,6 +141,11 @@ struct vxlan_dev {
 static u32 vxlan_salt __read_mostly;
 static struct workqueue_struct *vxlan_wq;
 
+static inline bool vxlan_collect_metadata(struct vxlan_sock *vs)
+{
+   return vs-flags  VXLAN_F_COLLECT_METADATA;
+}
+
 #if IS_ENABLED(CONFIG_IPV6)
 static inline
 bool vxlan_addr_equal(const union vxlan_addr *a, const union vxlan_addr *b)
@@ -1164,10 +1170,13 @@ static struct vxlanhdr *vxlan_remcsum(struct sk_buff 
*skb, struct vxlanhdr *vh,
 /* Callback from net/ipv4/udp.c to receive packets */
 static int vxlan_udp_encap_recv(struct sock *sk, struct sk_buff *skb)
 {
+   struct metadata_dst *tun_dst = NULL;
+   struct ip_tunnel_info *info;
struct vxlan_sock *vs;
struct vxlanhdr *vxh;
u32 flags, vni;
-   struct vxlan_metadata md = {0};
+   struct vxlan_metadata _md;
+   struct vxlan_metadata *md = _md;
 
/* Need Vxlan and inner Ethernet header to be present */
if (!pskb_may_pull(skb, VXLAN_HLEN))
@@ -1202,6 +1211,33 @@ static int vxlan_udp_encap_recv(struct sock *sk, struct 
sk_buff *skb)
vni = VXLAN_VNI_MASK;
}
 
+   if (vxlan_collect_metadata(vs)) {
+   const struct iphdr *iph = ip_hdr(skb);
+
+   tun_dst = metadata_dst_alloc(sizeof(*md), GFP_ATOMIC);
+   if (!tun_dst)
+   goto drop;
+
+   info = tun_dst-u.tun_info;
+   info-key.ipv4_src = iph-saddr;
+   info-key.ipv4_dst = iph-daddr;
+   info-key.ipv4_tos = iph-tos;
+   info-key.ipv4_ttl = iph-ttl;
+   info-key.tp_src = udp_hdr(skb)-source;
+   info-key.tp_dst = udp_hdr(skb)-dest;
+
+   info-mode = IP_TUNNEL_INFO_RX;
+   info-key.tun_flags = TUNNEL_KEY;
+   info-key.tun_id = cpu_to_be64(vni  8);
+   if (udp_hdr(skb)-check != 0)
+   info-key.tun_flags |= TUNNEL_CSUM;
+
+   md = ip_tunnel_info_opts(info, sizeof(*md));
+   md-tun_dst = tun_dst;
+   } else {
+   memset(md, 0, sizeof(*md));
+   }
+
/* For backwards compatibility, only allow reserved fields to be
 * used by VXLAN extensions if explicitly requested.
 */
@@ -1209,13 +1245,16 @@ static int vxlan_udp_encap_recv(struct sock *sk, struct 
sk_buff *skb)
struct vxlanhdr_gbp *gbp;
 
gbp = (struct vxlanhdr_gbp *)vxh;
-   md.gbp = ntohs(gbp-policy_id);
+   md-gbp = ntohs(gbp-policy_id);
+
+   if (tun_dst)
+   info-key.tun_flags |= TUNNEL_VXLAN_OPT;
 
if (gbp-dont_learn)
-   md.gbp |= VXLAN_GBP_DONT_LEARN;
+

[PATCH net-next 15/22] route: Extend flow representation with tunnel key

2015-07-17 Thread Thomas Graf

Add a new flowi_tunnel structure which is a subset of ip_tunnel_key to
allow routes to match on tunnel metadata. For now, the tunnel id is
added to flowi_tunnel which allows for routes to be bound to specific
virtual tunnels.

Signed-off-by: Thomas Graf tg...@suug.ch
---
 include/net/flow.h | 7 +++
 net/ipv4/route.c   | 6 ++
 2 files changed, 13 insertions(+)

diff --git a/include/net/flow.h b/include/net/flow.h
index 8109a15..c15fb5e 100644
--- a/include/net/flow.h
+++ b/include/net/flow.h
@@ -19,6 +19,10 @@
 
 #define LOOPBACK_IFINDEX   1
 
+struct flowi_tunnel {
+   __be64  tun_id;
+};
+
 struct flowi_common {
int flowic_oif;
int flowic_iif;
@@ -30,6 +34,7 @@ struct flowi_common {
 #define FLOWI_FLAG_ANYSRC  0x01
 #define FLOWI_FLAG_KNOWN_NH0x02
__u32   flowic_secid;
+   struct flowi_tunnel flowic_tun_key;
 };
 
 union flowi_uli {
@@ -66,6 +71,7 @@ struct flowi4 {
 #define flowi4_proto   __fl_common.flowic_proto
 #define flowi4_flags   __fl_common.flowic_flags
 #define flowi4_secid   __fl_common.flowic_secid
+#define flowi4_tun_key __fl_common.flowic_tun_key
 
/* (saddr,daddr) must be grouped, same order as in IP header */
__be32  saddr;
@@ -165,6 +171,7 @@ struct flowi {
 #define flowi_protou.__fl_common.flowic_proto
 #define flowi_flagsu.__fl_common.flowic_flags
 #define flowi_secidu.__fl_common.flowic_secid
+#define flowi_tun_key  u.__fl_common.flowic_tun_key
 } __attribute__((__aligned__(BITS_PER_LONG/8)));
 
 static inline struct flowi *flowi4_to_flowi(struct flowi4 *fl4)
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 4c8e84e..931015c 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -91,6 +91,7 @@
 #include linux/slab.h
 #include linux/jhash.h
 #include net/dst.h
+#include net/dst_metadata.h
 #include net/net_namespace.h
 #include net/protocol.h
 #include net/ip.h
@@ -110,6 +111,7 @@
 #include linux/kmemleak.h
 #endif
 #include net/secure_seq.h
+#include net/ip_tunnels.h
 
 #define RT_FL_TOS(oldflp4) \
((oldflp4)-flowi4_tos  (IPTOS_RT_MASK | RTO_ONLINK))
@@ -1673,6 +1675,7 @@ static int ip_route_input_slow(struct sk_buff *skb, 
__be32 daddr, __be32 saddr,
 {
struct fib_result res;
struct in_device *in_dev = __in_dev_get_rcu(dev);
+   struct ip_tunnel_info *tun_info;
struct flowi4   fl4;
unsigned intflags = 0;
u32 itag = 0;
@@ -1690,6 +1693,9 @@ static int ip_route_input_slow(struct sk_buff *skb, 
__be32 daddr, __be32 saddr,
   by fib_lookup.
 */
 
+   tun_info = skb_tunnel_info(skb);
+   if (tun_info  tun_info-mode == IP_TUNNEL_INFO_RX)
+   fl4.flowi4_tun_key.tun_id = tun_info-key.tun_id;
skb_dst_drop(skb);
 
if (ipv4_is_multicast(saddr) || ipv4_is_lbcast(saddr))
-- 
2.4.3

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net-next 12/22] dst: Metadata destinations

2015-07-17 Thread Thomas Graf

Introduces a new dst_metadata which enables to carry per packet metadata
between forwarding and processing elements via the skb-dst pointer.

The structure is set up to be a union. Thus, each separate type of
metadata requires its own dst instance. If demand arises to carry
multiple types of metadata concurrently, metadata dst entries can be
made stackable.

The metadata dst entry is refcnt'ed as expected for now but a non
reference counted use is possible if the reference is forced before
queueing the skb.

In order to allow allocating dsts with variable length, the existing
dst_alloc() is split into a dst_alloc() and dst_init() function. The
existing dst_init() function to initialize the subsystem is being
renamed to dst_subsys_init() to make it clear what is what.

The check before ip_route_input() is changed to ignore metadata dsts
and drop the dst inside the routing function thus allowing to interpret
metadata in a later commit.

Signed-off-by: Thomas Graf tg...@suug.ch
---
 include/net/dst.h  |  6 +++-
 include/net/dst_metadata.h | 32 ++
 net/core/dev.c |  2 +-
 net/core/dst.c | 84 ++
 net/ipv4/ip_input.c|  3 +-
 net/ipv4/route.c   |  2 ++
 6 files changed, 112 insertions(+), 17 deletions(-)
 create mode 100644 include/net/dst_metadata.h

diff --git a/include/net/dst.h b/include/net/dst.h
index 2bc73f8a..2578811 100644
--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -57,6 +57,7 @@ struct dst_entry {
 #define DST_FAKE_RTABLE0x0040
 #define DST_XFRM_TUNNEL0x0080
 #define DST_XFRM_QUEUE 0x0100
+#define DST_METADATA   0x0200
 
unsigned short  pending_confirm;
 
@@ -356,6 +357,9 @@ static inline int dst_discard(struct sk_buff *skb)
 }
 void *dst_alloc(struct dst_ops *ops, struct net_device *dev, int initial_ref,
int initial_obsolete, unsigned short flags);
+void dst_init(struct dst_entry *dst, struct dst_ops *ops,
+ struct net_device *dev, int initial_ref, int initial_obsolete,
+ unsigned short flags);
 void __dst_free(struct dst_entry *dst);
 struct dst_entry *dst_destroy(struct dst_entry *dst);
 
@@ -457,7 +461,7 @@ static inline struct dst_entry *dst_check(struct dst_entry 
*dst, u32 cookie)
return dst;
 }
 
-void dst_init(void);
+void dst_subsys_init(void);
 
 /* Flags for xfrm_lookup flags argument. */
 enum {
diff --git a/include/net/dst_metadata.h b/include/net/dst_metadata.h
new file mode 100644
index 000..4f7694f
--- /dev/null
+++ b/include/net/dst_metadata.h
@@ -0,0 +1,32 @@
+#ifndef __NET_DST_METADATA_H
+#define __NET_DST_METADATA_H 1
+
+#include linux/skbuff.h
+#include net/ip_tunnels.h
+#include net/dst.h
+
+struct metadata_dst {
+   struct dst_entrydst;
+   size_t  opts_len;
+};
+
+static inline struct metadata_dst *skb_metadata_dst(struct sk_buff *skb)
+{
+   struct metadata_dst *md_dst = (struct metadata_dst *) skb_dst(skb);
+
+   if (md_dst  md_dst-dst.flags  DST_METADATA)
+   return md_dst;
+
+   return NULL;
+}
+
+static inline bool skb_valid_dst(const struct sk_buff *skb)
+{
+   struct dst_entry *dst = skb_dst(skb);
+
+   return dst  !(dst-flags  DST_METADATA);
+}
+
+struct metadata_dst *metadata_dst_alloc(u8 optslen, gfp_t flags);
+
+#endif /* __NET_DST_METADATA_H */
diff --git a/net/core/dev.c b/net/core/dev.c
index 8810b6b..61e3dcb 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -7659,7 +7659,7 @@ static int __init net_dev_init(void)
open_softirq(NET_RX_SOFTIRQ, net_rx_action);
 
hotcpu_notifier(dev_cpu_callback, 0);
-   dst_init();
+   dst_subsys_init();
rc = 0;
 out:
return rc;
diff --git a/net/core/dst.c b/net/core/dst.c
index e956ce6..917364f 100644
--- a/net/core/dst.c
+++ b/net/core/dst.c
@@ -22,6 +22,7 @@
 #include linux/prefetch.h
 
 #include net/dst.h
+#include net/dst_metadata.h
 
 /*
  * Theory of operations:
@@ -158,19 +159,10 @@ const u32 dst_default_metrics[RTAX_MAX + 1] = {
[RTAX_MAX] = 0xdeadbeef,
 };
 
-
-void *dst_alloc(struct dst_ops *ops, struct net_device *dev,
-   int initial_ref, int initial_obsolete, unsigned short flags)
+void dst_init(struct dst_entry *dst, struct dst_ops *ops,
+ struct net_device *dev, int initial_ref, int initial_obsolete,
+ unsigned short flags)
 {
-   struct dst_entry *dst;
-
-   if (ops-gc  dst_entries_get_fast(ops)  ops-gc_thresh) {
-   if (ops-gc(ops))
-   return NULL;
-   }
-   dst = kmem_cache_alloc(ops-kmem_cachep, GFP_ATOMIC);
-   if (!dst)
-   return NULL;
dst-child = NULL;
dst-dev = dev;
if (dev)
@@ -200,6 +192,25 @@ void *dst_alloc(struct dst_ops *ops, struct net_device 
*dev,
dst-next = NULL;
if (!(flags  DST_NOCOUNT))

[PATCH net-next 09/22] mpls: ip tunnel support

2015-07-17 Thread Thomas Graf

From: Roopa Prabhu ro...@cumulusnetworks.com

This implementation uses lwtunnel infrastructure to register
hooks for mpls tunnel encaps.

It picks cues from iptunnel_encaps infrastructure and previous
mpls iptunnel RFC patches from Eric W. Biederman and Robert Shearman

Signed-off-by: Roopa Prabhu ro...@cumulusnetworks.com
---
 include/linux/mpls_iptunnel.h  |   6 +
 include/net/mpls_iptunnel.h|  29 +
 include/uapi/linux/mpls_iptunnel.h |  28 +
 net/mpls/Kconfig   |   8 +-
 net/mpls/Makefile  |   1 +
 net/mpls/mpls_iptunnel.c   | 233 +
 6 files changed, 304 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/mpls_iptunnel.h
 create mode 100644 include/net/mpls_iptunnel.h
 create mode 100644 include/uapi/linux/mpls_iptunnel.h
 create mode 100644 net/mpls/mpls_iptunnel.c

diff --git a/include/linux/mpls_iptunnel.h b/include/linux/mpls_iptunnel.h
new file mode 100644
index 000..ef29eb2
--- /dev/null
+++ b/include/linux/mpls_iptunnel.h
@@ -0,0 +1,6 @@
+#ifndef _LINUX_MPLS_IPTUNNEL_H
+#define _LINUX_MPLS_IPTUNNEL_H
+
+#include uapi/linux/mpls_iptunnel.h
+
+#endif  /* _LINUX_MPLS_IPTUNNEL_H */
diff --git a/include/net/mpls_iptunnel.h b/include/net/mpls_iptunnel.h
new file mode 100644
index 000..4757997
--- /dev/null
+++ b/include/net/mpls_iptunnel.h
@@ -0,0 +1,29 @@
+/*
+ * Copyright (c) 2015 Cumulus Networks, Inc.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ */
+
+#ifndef _NET_MPLS_IPTUNNEL_H
+#define _NET_MPLS_IPTUNNEL_H 1
+
+#define MAX_NEW_LABELS 2
+
+struct mpls_iptunnel_encap {
+   u32 label[MAX_NEW_LABELS];
+   u32 labels;
+};
+
+static inline struct mpls_iptunnel_encap *mpls_lwtunnel_encap(struct 
lwtunnel_state *lwtstate)
+{
+   return (struct mpls_iptunnel_encap *)lwtstate-data;
+}
+
+#endif
diff --git a/include/uapi/linux/mpls_iptunnel.h 
b/include/uapi/linux/mpls_iptunnel.h
new file mode 100644
index 000..d80a049
--- /dev/null
+++ b/include/uapi/linux/mpls_iptunnel.h
@@ -0,0 +1,28 @@
+/*
+ * mpls tunnel api
+ *
+ * Authors:
+ * Roopa Prabhu ro...@cumulusnetworks.com
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+#ifndef _UAPI_LINUX_MPLS_IPTUNNEL_H
+#define _UAPI_LINUX_MPLS_IPTUNNEL_H
+
+/* MPLS tunnel attributes
+ * [RTA_ENCAP] = {
+ * [MPLS_IPTUNNEL_DST]
+ * }
+ */
+enum {
+   MPLS_IPTUNNEL_UNSPEC,
+   MPLS_IPTUNNEL_DST,
+   __MPLS_IPTUNNEL_MAX,
+};
+#define MPLS_IPTUNNEL_MAX (__MPLS_IPTUNNEL_MAX - 1)
+
+#endif /* _UAPI_LINUX_MPLS_IPTUNNEL_H */
diff --git a/net/mpls/Kconfig b/net/mpls/Kconfig
index 17bde79..5c467ef 100644
--- a/net/mpls/Kconfig
+++ b/net/mpls/Kconfig
@@ -24,7 +24,13 @@ config NET_MPLS_GSO
 
 config MPLS_ROUTING
tristate MPLS: routing support
-   help
+   ---help---
 Add support for forwarding of mpls packets.
 
+config MPLS_IPTUNNEL
+   tristate MPLS: IP over MPLS tunnel support
+   depends on LWTUNNEL  MPLS_ROUTING
+   ---help---
+mpls ip tunnel support.
+
 endif # MPLS
diff --git a/net/mpls/Makefile b/net/mpls/Makefile
index 65bbe68..9ca9236 100644
--- a/net/mpls/Makefile
+++ b/net/mpls/Makefile
@@ -3,5 +3,6 @@
 #
 obj-$(CONFIG_NET_MPLS_GSO) += mpls_gso.o
 obj-$(CONFIG_MPLS_ROUTING) += mpls_router.o
+obj-$(CONFIG_MPLS_IPTUNNEL) += mpls_iptunnel.o
 
 mpls_router-y := af_mpls.o
diff --git a/net/mpls/mpls_iptunnel.c b/net/mpls/mpls_iptunnel.c
new file mode 100644
index 000..eea096f
--- /dev/null
+++ b/net/mpls/mpls_iptunnel.c
@@ -0,0 +1,233 @@
+/*
+ * mpls tunnelsAn implementation mpls tunnels using the light weight 
tunnel
+ * infrastructure
+ *
+ * Authors:Roopa Prabhu, ro...@cumulusnetworks.com
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ *
+ */
+#include linux/types.h
+#include linux/skbuff.h
+#include linux/net.h
+#include linux/module.h
+#include linux/mpls.h
+#include linux/vmalloc.h
+#include net/ip.h
+#include net/dst.h
+#include net/lwtunnel.h
+#include net/netevent.h
+#include net/netns/generic.h
+#include net/ip6_fib.h
+#include

[PATCH net-next 11/22] icmp: Don't leak original dst into ip_route_input()

2015-07-17 Thread Thomas Graf

ip_route_input() unconditionally overwrites the dst. Hide the original
dst attached to the skb by calling skb_dst_set(skb, NULL) prior to
ip_route_input().

Reported-by: Julian Anastasov j...@ssi.bg
Signed-off-by: Thomas Graf tg...@suug.ch
---
 net/ipv4/icmp.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index f5203fb..c0556f1 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -496,6 +496,7 @@ static struct rtable *icmp_route_lookup(struct net *net,
}
/* Ugh! */
orefdst = skb_in-_skb_refdst; /* save old refdst */
+   skb_dst_set(skb_in, NULL);
err = ip_route_input(skb_in, fl4_dec.daddr, fl4_dec.saddr,
 RT_TOS(tos), rt2-dst.dev);
 
-- 
2.4.3

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net-next 00/22] Lightweight flow based encapsulation

2015-07-17 Thread Thomas Graf

This series combines the work previously posted by Roopa, Robert and
myself. It's according to what we discussed at NFWS. The motivation
of this series is to:

 * Consolidate code between OVS and the rest of the kernel and get
   rid of OVS vports and instead represent them as pure net_devices.
 * Introduce a lightweight tunneling mechanism which enables flow
   based encapsulation to improve scalability on both RX and TX.
 * Do the above in an encapsulation unspecific way so that the
   encapsulation type is eventually abstracted away from the user.
 * Use the same forwarding decision for both native forwarding and
   encapsulation thus allowing to switch between native IPv6 and
   UDP encapsulation based on endpoint without requiring additional
   logic

The fundamental changes introduces in this series are:
 * A new RTA_ENCAP Netlink attribute for routes carrying encapsulation
   instructions. Depending on the specified type, the instructions
   apply to UDP encapsulations, MPLS and possible other in the future.
 * Depending on the encapsulation type, the output function of the
   dst is directly overwritten or the dst merely attaches metadata and
   relies on a subsequent net_device to apply it to the packet. The
   latter is typically used if an inner and outer IP header exist which
   require two subsequent routing lookups to be performed.
 * A new metadata_dst structure which can be attached to skbs to
   carry metadata in between subsystems. This new metadata transport
   is used to provide a single interface for VXLAN, routing and OVS
   to communicate through metadata.

The OVS interfaces remain as-is but will transparently create a real
VXLAN net_device in the background. iproute2 is extended with a new
use cases:

  VXLAN:
  ip route add 40.1.1.1/32 encap vxlan id 10 dst 50.1.1.2 dev vxlan0

  MPLS:
  ip route add 10.1.1.0/30 encap mpls 200 via inet 10.1.1.1 dev swp1

Changes since RFC:
 * Addressed comments
 * Folded in various fixes provided by Roopa, Joe, and Wei-Chun Chao
 * New static key to only collect metadata on receive if a filter exists
   which matches on the relevant fields.

Roopa Prabhu (9):
  rtnetlink: introduce new RTA_ENCAP_TYPE and RTA_ENCAP attributes
  lwtunnel: infrastructure for handling light weight tunnels like mpls
  ipv4: support for fib route lwtunnel encap attributes
  ipv6: support for fib route lwtunnel encap attributes
  lwtunnel: support dst output redirect function
  ipv4: redirect dst output to lwtunnel output
  ipv6: rt6_info output redirect to tunnel output
  mpls: export mpls functions for use by mpls iptunnels
  mpls: ip tunnel support

Thomas Graf (13):
  ip_tunnel: Make ovs_tunnel_info and ovs_key_ipv4_tunnel generic
  icmp: Don't leak original dst into ip_route_input()
  dst: Metadata destinations
  arp: Inherit metadata dst when creating ARP requests
  vxlan: Flow based tunneling
  route: Extend flow representation with tunnel key
  route: Per route IP tunnel metadata via lightweight tunnel
  fib: Add fib rule match on tunnel id
  vxlan: Factor out device configuration
  openvswitch: Make tunnel set action attach a metadata dst
  openvswitch: Move dev pointer into vport itself
  openvswitch: Abstract vport name through ovs_vport_name()
  openvswitch: Use regular VXLAN net_device device

 drivers/net/vxlan.c  | 678 +--
 include/linux/lwtunnel.h |   6 +
 include/linux/mpls_iptunnel.h|   6 +
 include/linux/skbuff.h   |   1 +
 include/net/dst.h|   6 +-
 include/net/dst_metadata.h   |  55 +++
 include/net/fib_rules.h  |   1 +
 include/net/flow.h   |   7 +
 include/net/ip6_fib.h|   3 +
 include/net/ip_fib.h |   5 +-
 include/net/ip_tunnels.h |  95 -
 include/net/lwtunnel.h   | 144 
 include/net/mpls_iptunnel.h  |  29 ++
 include/net/route.h  |   1 +
 include/net/rtnetlink.h  |   1 +
 include/net/vxlan.h  |  85 -
 include/uapi/linux/fib_rules.h   |   2 +-
 include/uapi/linux/if_link.h |   1 +
 include/uapi/linux/lwtunnel.h|  16 +
 include/uapi/linux/mpls_iptunnel.h   |  28 ++
 include/uapi/linux/openvswitch.h |   2 +-
 include/uapi/linux/rtnetlink.h   |  17 +
 net/Kconfig  |   7 +
 net/core/Makefile|   1 +
 net/core/dev.c   |   2 +-
 net/core/dst.c   |  84 -
 net/core/fib_rules.c |  24 +-
 net/core/lwtunnel.c  | 235 
 net/core/rtnetlink.c |  26 +-
 net/ipv4/arp.c   |  65 ++--
 net/ipv4/fib_frontend.c  |   8 +
 net/ipv4/fib_semantics.c |  96 -
 net/ipv4/icmp.c  |   1 +
 net/ipv4/ip_input.c  |   3 +-
 net/ipv4/ip_tunnel_core.c| 130

[PATCH net-next 06/22] ipv4: redirect dst output to lwtunnel output

2015-07-17 Thread Thomas Graf

From: Roopa Prabhu ro...@cumulusnetworks.com

For input routes with tunnel encap state this patch redirects
dst output functions to lwtunnel_output which later resolves to
the corresponding lwtunnel output function.

This has been tested to work with mpls ip tunnels.

Open items: Support for tunnel mtu, pmtu, fragmentation can be
added by hooking into the corresponding (ipv4, ipv6) dst ops.
We may do this differently when lwtstate moves to dst or dst_metadata
as per upstream discussions.

Signed-off-by: Roopa Prabhu ro...@cumulusnetworks.com
---
 net/ipv4/route.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 226570b..cd3157c 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1633,6 +1633,8 @@ static int __mkroute_input(struct sk_buff *skb,
rth-dst.output = ip_output;
 
rt_set_nexthop(rth, daddr, res, fnhe, res-fi, res-type, itag);
+   if (lwtunnel_output_redirect(rth-rt_lwtstate))
+   rth-dst.output = lwtunnel_output;
skb_dst_set(skb, rth-dst);
 out:
err = 0;
-- 
2.4.3

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net-next 10/22] ip_tunnel: Make ovs_tunnel_info and ovs_key_ipv4_tunnel generic

2015-07-17 Thread Thomas Graf

Rename the tunnel metadata data structures currently internal to
OVS and make them generic for use by all IP tunnels.

Both structures are kernel internal and will stay that way. Their
members are exposed to user space through individual Netlink
attributes by OVS. It will therefore be possible to extend/modify
these structures without affecting user ABI.

Signed-off-by: Thomas Graf tg...@suug.ch
---
 include/net/ip_tunnels.h | 63 +
 include/uapi/linux/openvswitch.h |  2 +-
 net/openvswitch/actions.c|  2 +-
 net/openvswitch/datapath.h   |  5 +--
 net/openvswitch/flow.c   |  4 +--
 net/openvswitch/flow.h   | 76 ++--
 net/openvswitch/flow_netlink.c   | 16 -
 net/openvswitch/flow_netlink.h   |  2 +-
 net/openvswitch/vport-geneve.c   | 17 +
 net/openvswitch/vport-gre.c  | 16 -
 net/openvswitch/vport-vxlan.c| 18 +-
 net/openvswitch/vport.c  | 30 
 net/openvswitch/vport.h  | 12 +++
 13 files changed, 128 insertions(+), 135 deletions(-)

diff --git a/include/net/ip_tunnels.h b/include/net/ip_tunnels.h
index d8214cb..6b9d559 100644
--- a/include/net/ip_tunnels.h
+++ b/include/net/ip_tunnels.h
@@ -22,6 +22,28 @@
 /* Keep error state on tunnel for 30 sec */
 #define IPTUNNEL_ERR_TIMEO (30*HZ)
 
+/* Used to memset ip_tunnel padding. */
+#define IP_TUNNEL_KEY_SIZE \
+   (offsetof(struct ip_tunnel_key, tp_dst) +   \
+FIELD_SIZEOF(struct ip_tunnel_key, tp_dst))
+
+struct ip_tunnel_key {
+   __be64  tun_id;
+   __be32  ipv4_src;
+   __be32  ipv4_dst;
+   __be16  tun_flags;
+   __u8ipv4_tos;
+   __u8ipv4_ttl;
+   __be16  tp_src;
+   __be16  tp_dst;
+} __packed __aligned(4); /* Minimize padding. */
+
+struct ip_tunnel_info {
+   struct ip_tunnel_keykey;
+   const void  *options;
+   u8  options_len;
+};
+
 /* 6rd prefix/relay information */
 #ifdef CONFIG_IPV6_SIT_6RD
 struct ip_tunnel_6rd_parm {
@@ -136,6 +158,47 @@ int ip_tunnel_encap_add_ops(const struct 
ip_tunnel_encap_ops *op,
 int ip_tunnel_encap_del_ops(const struct ip_tunnel_encap_ops *op,
unsigned int num);
 
+static inline void __ip_tunnel_info_init(struct ip_tunnel_info *tun_info,
+__be32 saddr, __be32 daddr,
+u8 tos, u8 ttl,
+__be16 tp_src, __be16 tp_dst,
+__be64 tun_id, __be16 tun_flags,
+const void *opts, u8 opts_len)
+{
+   tun_info-key.tun_id = tun_id;
+   tun_info-key.ipv4_src = saddr;
+   tun_info-key.ipv4_dst = daddr;
+   tun_info-key.ipv4_tos = tos;
+   tun_info-key.ipv4_ttl = ttl;
+   tun_info-key.tun_flags = tun_flags;
+
+   /* For the tunnel types on the top of IPsec, the tp_src and tp_dst of
+* the upper tunnel are used.
+* E.g: GRE over IPSEC, the tp_src and tp_port are zero.
+*/
+   tun_info-key.tp_src = tp_src;
+   tun_info-key.tp_dst = tp_dst;
+
+   /* Clear struct padding. */
+   if (sizeof(tun_info-key) != IP_TUNNEL_KEY_SIZE)
+   memset((unsigned char *)tun_info-key + IP_TUNNEL_KEY_SIZE,
+  0, sizeof(tun_info-key) - IP_TUNNEL_KEY_SIZE);
+
+   tun_info-options = opts;
+   tun_info-options_len = opts_len;
+}
+
+static inline void ip_tunnel_info_init(struct ip_tunnel_info *tun_info,
+  const struct iphdr *iph,
+  __be16 tp_src, __be16 tp_dst,
+  __be64 tun_id, __be16 tun_flags,
+  const void *opts, u8 opts_len)
+{
+   __ip_tunnel_info_init(tun_info, iph-saddr, iph-daddr,
+ iph-tos, iph-ttl, tp_src, tp_dst,
+ tun_id, tun_flags, opts, opts_len);
+}
+
 #ifdef CONFIG_INET
 
 int ip_tunnel_init(struct net_device *dev);
diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h
index 1dab776..d6b8854 100644
--- a/include/uapi/linux/openvswitch.h
+++ b/include/uapi/linux/openvswitch.h
@@ -321,7 +321,7 @@ enum ovs_key_attr {
 * the accepted length of the array. */
 
 #ifdef __KERNEL__
-   OVS_KEY_ATTR_TUNNEL_INFO,  /* struct ovs_tunnel_info */
+   OVS_KEY_ATTR_TUNNEL_INFO,  /* struct ip_tunnel_info */
 #endif
__OVS_KEY_ATTR_MAX
 };
diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
index 8a8c0b8..27c1687 100644
--- a/net/openvswitch/actions.c
+++ b/net/openvswitch/actions.c
@@ -611,7

[PATCH net-next 02/22] lwtunnel: infrastructure for handling light weight tunnels like mpls

2015-07-17 Thread Thomas Graf

From: Roopa Prabhu ro...@cumulusnetworks.com

Provides infrastructure to parse/dump/store encap information for
light weight tunnels like mpls. Encap information for such tunnels
is associated with fib routes.

This infrastructure is based on previous suggestions from
Eric Biederman to follow the xfrm infrastructure.

Signed-off-by: Roopa Prabhu ro...@cumulusnetworks.com
---
 include/linux/lwtunnel.h  |   6 ++
 include/net/lwtunnel.h| 132 +++
 include/uapi/linux/lwtunnel.h |  15 
 net/Kconfig   |   7 ++
 net/core/Makefile |   1 +
 net/core/lwtunnel.c   | 179 ++
 6 files changed, 340 insertions(+)
 create mode 100644 include/linux/lwtunnel.h
 create mode 100644 include/net/lwtunnel.h
 create mode 100644 include/uapi/linux/lwtunnel.h
 create mode 100644 net/core/lwtunnel.c

diff --git a/include/linux/lwtunnel.h b/include/linux/lwtunnel.h
new file mode 100644
index 000..97f32f8
--- /dev/null
+++ b/include/linux/lwtunnel.h
@@ -0,0 +1,6 @@
+#ifndef _LINUX_LWTUNNEL_H_
+#define _LINUX_LWTUNNEL_H_
+
+#include uapi/linux/lwtunnel.h
+
+#endif /* _LINUX_LWTUNNEL_H_ */
diff --git a/include/net/lwtunnel.h b/include/net/lwtunnel.h
new file mode 100644
index 000..df24b36
--- /dev/null
+++ b/include/net/lwtunnel.h
@@ -0,0 +1,132 @@
+#ifndef __NET_LWTUNNEL_H
+#define __NET_LWTUNNEL_H 1
+
+#include linux/lwtunnel.h
+#include linux/netdevice.h
+#include linux/skbuff.h
+#include linux/types.h
+#include net/route.h
+
+#define LWTUNNEL_HASH_BITS   7
+#define LWTUNNEL_HASH_SIZE   (1  LWTUNNEL_HASH_BITS)
+
+/* lw tunnel state flags */
+#define LWTUNNEL_STATE_OUTPUT_REDIRECT 0x1
+
+struct lwtunnel_state {
+   __u16   type;
+   __u16   flags;
+   atomic_trefcnt;
+   int len;
+   __u8data[0];
+};
+
+struct lwtunnel_encap_ops {
+   int (*build_state)(struct net_device *dev, struct nlattr *encap,
+  struct lwtunnel_state **ts);
+   int (*output)(struct sock *sk, struct sk_buff *skb);
+   int (*fill_encap)(struct sk_buff *skb,
+ struct lwtunnel_state *lwtstate);
+   int (*get_encap_size)(struct lwtunnel_state *lwtstate);
+   int (*cmp_encap)(struct lwtunnel_state *a, struct lwtunnel_state *b);
+};
+
+extern const struct lwtunnel_encap_ops __rcu *
+   lwtun_encaps[LWTUNNEL_ENCAP_MAX+1];
+
+#ifdef CONFIG_LWTUNNEL
+static inline void lwtunnel_state_get(struct lwtunnel_state *lws)
+{
+   atomic_inc(lws-refcnt);
+}
+
+static inline void lwtunnel_state_put(struct lwtunnel_state *lws)
+{
+   if (!lws)
+   return;
+
+   if (atomic_dec_and_test(lws-refcnt))
+   kfree(lws);
+}
+
+static inline bool lwtunnel_output_redirect(struct lwtunnel_state *lwtstate)
+{
+   if (lwtstate  (lwtstate-flags  LWTUNNEL_STATE_OUTPUT_REDIRECT))
+   return true;
+
+   return false;
+}
+
+int lwtunnel_encap_add_ops(const struct lwtunnel_encap_ops *op,
+  unsigned int num);
+int lwtunnel_encap_del_ops(const struct lwtunnel_encap_ops *op,
+  unsigned int num);
+int lwtunnel_build_state(struct net_device *dev, u16 encap_type,
+struct nlattr *encap,
+struct lwtunnel_state **lws);
+int lwtunnel_fill_encap(struct sk_buff *skb,
+   struct lwtunnel_state *lwtstate);
+int lwtunnel_get_encap_size(struct lwtunnel_state *lwtstate);
+struct lwtunnel_state *lwtunnel_state_alloc(int hdr_len);
+int lwtunnel_cmp_encap(struct lwtunnel_state *a, struct lwtunnel_state *b);
+
+#else
+
+static inline void lwtunnel_state_get(struct lwtunnel_state *lws)
+{
+}
+
+static inline void lwtunnel_state_put(struct lwtunnel_state *lws)
+{
+}
+
+static inline bool lwtunnel_output_redirect(struct lwtunnel_state *lwtstate)
+{
+   return false;
+}
+
+static inline int lwtunnel_encap_add_ops(const struct lwtunnel_encap_ops *op,
+unsigned int num)
+{
+   return -EOPNOTSUPP;
+
+}
+
+static inline int lwtunnel_encap_del_ops(const struct lwtunnel_encap_ops *op,
+unsigned int num)
+{
+   return -EOPNOTSUPP;
+}
+
+static inline int lwtunnel_build_state(struct net_device *dev, u16 encap_type,
+  struct nlattr *encap,
+  struct lwtunnel_state **lws)
+{
+   return -EOPNOTSUPP;
+}
+
+static inline int lwtunnel_fill_encap(struct sk_buff *skb,
+ struct lwtunnel_state *lwtstate)
+{
+   return 0;
+}
+
+static inline int lwtunnel_get_encap_size(struct lwtunnel_state *lwtstate)
+{
+   return 0;
+}
+
+static inline struct lwtunnel_state *lwtunnel_state_alloc(int hdr_len)
+{
+   return NULL;
+}
+
+static inline int lwtunnel_cmp_encap(struct lwtunnel_state *a,
+

[PATCH net-next 07/22] ipv6: rt6_info output redirect to tunnel output

2015-07-17 Thread Thomas Graf

From: Roopa Prabhu ro...@cumulusnetworks.com

This is similar to ipv4 redirect of dst output to lwtunnel
output function for encapsulation and xmit.

Signed-off-by: Roopa Prabhu ro...@cumulusnetworks.com
---
 net/ipv6/route.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index b3431b7..7f2214f 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -1780,6 +1780,7 @@ int ip6_route_add(struct fib6_config *cfg)
goto out;
lwtunnel_state_get(lwtstate);
rt-rt6i_lwtstate = lwtstate;
+   rt-dst.output = lwtunnel_output6;
}
 
ipv6_addr_prefix(rt-rt6i_dst.addr, cfg-fc_dst, cfg-fc_dst_len);
-- 
2.4.3

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net-next 05/22] lwtunnel: support dst output redirect function

2015-07-17 Thread Thomas Graf

From: Roopa Prabhu ro...@cumulusnetworks.com

This patch introduces lwtunnel_output function to call corresponding
lwtunnels output function to xmit the packet.

It adds two variants lwtunnel_output and lwtunnel_output6 for ipv4 and
ipv6 respectively today. But this is subject to change when lwtstate will
reside in dst or dst_metadata (as per upstream discussions).

Signed-off-by: Roopa Prabhu ro...@cumulusnetworks.com
---
 include/net/lwtunnel.h | 12 +++
 net/core/lwtunnel.c| 56 ++
 2 files changed, 68 insertions(+)

diff --git a/include/net/lwtunnel.h b/include/net/lwtunnel.h
index df24b36..918e03c 100644
--- a/include/net/lwtunnel.h
+++ b/include/net/lwtunnel.h
@@ -69,6 +69,8 @@ int lwtunnel_fill_encap(struct sk_buff *skb,
 int lwtunnel_get_encap_size(struct lwtunnel_state *lwtstate);
 struct lwtunnel_state *lwtunnel_state_alloc(int hdr_len);
 int lwtunnel_cmp_encap(struct lwtunnel_state *a, struct lwtunnel_state *b);
+int lwtunnel_output(struct sock *sk, struct sk_buff *skb);
+int lwtunnel_output6(struct sock *sk, struct sk_buff *skb);
 
 #else
 
@@ -127,6 +129,16 @@ static inline int lwtunnel_cmp_encap(struct lwtunnel_state 
*a,
return 0;
 }
 
+static inline int lwtunnel_output(struct sock *sk, struct sk_buff *skb)
+{
+   return -EOPNOTSUPP;
+}
+
+static inline int lwtunnel_output6(struct sock *sk, struct sk_buff *skb)
+{
+   return -EOPNOTSUPP;
+}
+
 #endif
 
 #endif /* __NET_LWTUNNEL_H */
diff --git a/net/core/lwtunnel.c b/net/core/lwtunnel.c
index d7ae3a2..bb58826 100644
--- a/net/core/lwtunnel.c
+++ b/net/core/lwtunnel.c
@@ -25,6 +25,7 @@
 
 #include net/lwtunnel.h
 #include net/rtnetlink.h
+#include net/ip6_fib.h
 
 struct lwtunnel_state *lwtunnel_state_alloc(int encap_len)
 {
@@ -177,3 +178,58 @@ int lwtunnel_cmp_encap(struct lwtunnel_state *a, struct 
lwtunnel_state *b)
return ret;
 }
 EXPORT_SYMBOL(lwtunnel_cmp_encap);
+
+int __lwtunnel_output(struct sock *sk, struct sk_buff *skb,
+ struct lwtunnel_state *lwtstate)
+{
+   const struct lwtunnel_encap_ops *ops;
+   int ret = -EINVAL;
+
+   if (!lwtstate)
+   goto drop;
+
+   if (lwtstate-type == LWTUNNEL_ENCAP_NONE ||
+   lwtstate-type  LWTUNNEL_ENCAP_MAX)
+   return 0;
+
+   ret = -EOPNOTSUPP;
+   rcu_read_lock();
+   ops = rcu_dereference(lwtun_encaps[lwtstate-type]);
+   if (likely(ops  ops-output))
+   ret = ops-output(sk, skb);
+   rcu_read_unlock();
+
+   if (ret == -EOPNOTSUPP)
+   goto drop;
+
+   return ret;
+
+drop:
+   kfree(skb);
+
+   return ret;
+}
+
+int lwtunnel_output6(struct sock *sk, struct sk_buff *skb)
+{
+   struct rt6_info *rt = (struct rt6_info *)skb_dst(skb);
+   struct lwtunnel_state *lwtstate = NULL;
+
+   if (rt)
+   lwtstate = rt-rt6i_lwtstate;
+
+   return __lwtunnel_output(sk, skb, lwtstate);
+}
+EXPORT_SYMBOL(lwtunnel_output6);
+
+int lwtunnel_output(struct sock *sk, struct sk_buff *skb)
+{
+   struct rtable *rt = (struct rtable *)skb_dst(skb);
+   struct lwtunnel_state *lwtstate = NULL;
+
+   if (rt)
+   lwtstate = rt-rt_lwtstate;
+
+   return __lwtunnel_output(sk, skb, lwtstate);
+}
+EXPORT_SYMBOL(lwtunnel_output);
-- 
2.4.3

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net-next 01/22] rtnetlink: introduce new RTA_ENCAP_TYPE and RTA_ENCAP attributes

2015-07-17 Thread Thomas Graf

From: Roopa Prabhu ro...@cumulusnetworks.com

This patch introduces two new RTA attributes to attach encap
data to fib routes.

Example iproute2 command to attach mpls encap data to ipv4 routes

$ip route add 10.1.1.0/30 encap mpls 200 via inet 10.1.1.1 dev swp1

Signed-off-by: Roopa Prabhu ro...@cumulusnetworks.com
Suggested-by: Eric W. Biederman ebied...@xmission.com
---
 include/uapi/linux/rtnetlink.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h
index fdd8f07..0d3d3cc 100644
--- a/include/uapi/linux/rtnetlink.h
+++ b/include/uapi/linux/rtnetlink.h
@@ -308,6 +308,8 @@ enum rtattr_type_t {
RTA_VIA,
RTA_NEWDST,
RTA_PREF,
+   RTA_ENCAP_TYPE,
+   RTA_ENCAP,
__RTA_MAX
 };
 
-- 
2.4.3

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

pull-request: mac80211 2015-07-17

2015-07-17 Thread Johannes Berg

Hi Dave,

We've accumulated some wireless fixes, please pull. Arik's fix is a bit
bigger than I might like, but it fixes a real locking issue and we
didn't really see a good way to make a smaller version.

Let me know if there's any problem.

johannes


The following changes since commit f760b87f8f12eb262f14603e65042996fe03720e:

  Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net (2015-07-13 
11:18:25 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211.git 
tags/mac80211-for-davem-2015-07-17

for you to fetch changes up to 923b352f19d9ea971ae2536eab55f5fc9e95fedf:

  cfg80211: use RTNL locked reg_can_beacon for IR-relaxation (2015-07-17 
15:02:02 +0200)


Some fixes for the current cycle:

 1. Arik introduced an rtnl-locked regulatory API to be able
to differentiate between place do/don't have the RTNL;
this fixes missing locking in some of the code paths

 2. Two small mesh bugfixes from Bob, one to avoid treating
a certain malformed over-the-air frame and one to avoid
sending a garbage field over the air.

 3. A fix for powersave during WoWLAN suspend from Krishna Chaitanya.

 4. A fix for a powersave vs. aggregation teardown race, from Michal.

 5. Thomas reduced the loglevel of CRDA messages to avoid spamming
the kernel log with mostly irrelevant information.

 6. Tom fixed a dangling debugfs directory pointer that could cause
crashes if subsequent addition of the same interface to debugfs
failed for some reason.

 7. A fix from myself for a list corruption issue in mac80211 during
combined interface shutdown/removal - shut down interfaces first
and only then remove them to avoid that.


Arik Nemtsov (1):
  cfg80211: use RTNL locked reg_can_beacon for IR-relaxation

Bob Copeland (2):
  mac80211: correct aid location in peering frames
  mac80211: add missing length check for confirm frames

Chaitanya T K (1):
  mac80211: wowlan: enable powersave if suspend while ps-polling

Johannes Berg (1):
  mac80211: shut down interfaces before destroying interface list

Michal Kazior (1):
  mac80211: don't clear all tx flags when requeing

Thomas Petazzoni (1):
  wireless: regulatory: reduce log level of CRDA related messages

Tom Hughes (1):
  mac80211: clear subdir_stations when removing debugfs

 include/net/cfg80211.h| 17 
 net/mac80211/debugfs_netdev.c |  1 +
 net/mac80211/iface.c  | 25 +---
 net/mac80211/mesh_plink.c |  5 -
 net/mac80211/pm.c | 16 +++
 net/mac80211/tdls.c   |  6 +++---
 net/mac80211/tx.c |  4 +++-
 net/wireless/chan.c   | 45 ---
 net/wireless/nl80211.c| 14 --
 net/wireless/reg.c|  8 
 net/wireless/trace.h  | 11 +++
 11 files changed, 111 insertions(+), 41 deletions(-)
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH v7 3/3] ixgbe, ixgbevf: Add new mbox API xcast mode

2015-07-17 Thread Skidmore, Donald C



 -Original Message-
 From: Hiroshi Shimamoto [mailto:h-shimam...@ct.jp.nec.com]
 Sent: Thursday, July 16, 2015 3:36 AM
 To: Alexander Duyck; Skidmore, Donald C; Rose, Gregory V; Kirsher, Jeffrey
 T; intel-wired-...@lists.osuosl.org
 Cc: nhor...@redhat.com; jogre...@redhat.com; Linux Netdev List; Choi,
 Sy Jong; Rony Efraim; Or Gerlitz; Edward Cree; David Miller;
 sassm...@redhat.com
 Subject: [PATCH v7 3/3] ixgbe, ixgbevf: Add new mbox API xcast mode
 
 From: Hiroshi Shimamoto h-shimam...@ct.jp.nec.com
 
 The limitation of the number of multicast address for VF is not enough for
 the large scale server with SR-IOV feature. IPv6 requires the multicast MAC
 address for each IP address to handle the Neighbor Solicitation message. We
 couldn't assign over 30 IPv6 addresses to a single VF.
 
 This patch introduces the new mailbox API,
 IXGBE_VF_UPDATE_XCAST_MODE, to update multicast mode of VF. This
 adds 3 modes;
   - NONE only L2 exact match addresses or Flow Director enabled
   - MULTIBAM and ROMPE set
   - ALLMULTI BAM, ROMPE and MPE set
 
 If a guest VF user wants over 30 MAC multicast addresses, set IFF_ALLMULTI
 to request PF to update xcast mode to enable VF multicast promiscuous
 mode.
 
 On the other hand, enabling VF multicast promiscuous mode may affect
 security and performance in the network of the NIC. Only trusted VF can
 enable multicast promiscuous mode. The behavior of untrusted VF is the
 same as previous version.
 
 Signed-off-by: Hiroshi Shimamoto h-shimam...@ct.jp.nec.com
 ---
  drivers/net/ethernet/intel/ixgbe/ixgbe.h  |  7 +++
  drivers/net/ethernet/intel/ixgbe/ixgbe_mbx.h  |  2 +
  drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c| 59
 +++
  drivers/net/ethernet/intel/ixgbevf/ixgbevf.h  |  6 +++
  drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c |  8 +++
  drivers/net/ethernet/intel/ixgbevf/mbx.h  |  2 +
  drivers/net/ethernet/intel/ixgbevf/vf.c   | 41 
  drivers/net/ethernet/intel/ixgbevf/vf.h   |  1 +
  8 files changed, 126 insertions(+)
 
 diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
 b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
 index fb72622..17250ef 100644
 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
 +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
 @@ -153,9 +153,16 @@ struct vf_data_storage {
   u8 spoofchk_enabled;
   bool rss_query_enabled;
   u8 trusted;
 + int xcast_mode;
   unsigned int vf_api;
  };
 
 +enum ixgbevf_xcast_modes {
 + IXGBEVF_XCAST_MODE_NONE = 0,
 + IXGBEVF_XCAST_MODE_MULTI,
 + IXGBEVF_XCAST_MODE_ALLMULTI,
 +};
 +
  struct vf_macvlans {
   struct list_head l;
   int vf;
 diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_mbx.h
 b/drivers/net/ethernet/intel/ixgbe/ixgbe_mbx.h
 index b1e4703..8daa95f 100644
 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_mbx.h
 +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_mbx.h
 @@ -102,6 +102,8 @@ enum ixgbe_pfvf_api_rev {
  #define IXGBE_VF_GET_RETA0x0a/* VF request for RETA */
  #define IXGBE_VF_GET_RSS_KEY 0x0b/* get RSS key */
 
 +#define IXGBE_VF_UPDATE_XCAST_MODE   0x0c
 +
  /* length of permanent address message returned from PF */  #define
 IXGBE_VF_PERMADDR_MSG_LEN 4
  /* word in permanent address message with the current multicast type */
 diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
 b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
 index 65aeb58..ac071e5 100644
 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
 +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
 @@ -119,6 +119,9 @@ static int __ixgbe_enable_sriov(struct ixgbe_adapter
 *adapter)
 
   /* Untrust all VFs */
   adapter-vfinfo[i].trusted = false;
 +
 + /* set the default xcast mode */
 + adapter-vfinfo[i].xcast_mode =
 IXGBEVF_XCAST_MODE_NONE;
   }
 
   return 0;
 @@ -1004,6 +1007,59 @@ static int ixgbe_get_vf_rss_key(struct
 ixgbe_adapter *adapter,
   return 0;
  }
 
 +static int ixgbe_update_vf_xcast_mode(struct ixgbe_adapter *adapter,
 +   u32 *msgbuf, u32 vf)
 +{
 + struct ixgbe_hw *hw = adapter-hw;
 + int xcast_mode = msgbuf[1];
 + u32 vmolr, disable, enable;
 +
 + /* verify the PF is supporting the correct APIs */
 + switch (adapter-vfinfo[vf].vf_api) {
 + case ixgbe_mbox_api_12:
 + break;
 + default:
 + return -1;

Shouldn't you return -EOPNOTSUPP.

 + }
 +
 + if (xcast_mode  IXGBEVF_XCAST_MODE_MULTI 
 + !adapter-vfinfo[vf].trusted) {
 + xcast_mode = IXGBEVF_XCAST_MODE_MULTI;
 + }
 +
 + if (adapter-vfinfo[vf].xcast_mode == xcast_mode)
 + goto out;
 +
 + switch (xcast_mode) {
 + case IXGBEVF_XCAST_MODE_NONE:
 + disable = IXGBE_VMOLR_BAM | IXGBE_VMOLR_ROMPE |
 IXGBE_VMOLR_MPE;
 + enable = 0;
 +

Re: [V2 6/7] hvsock: introduce Hyper-V VM Sockets feature

2015-07-17 Thread Vitaly Kuznetsov

Dexuan Cui de...@microsoft.com writes:

 From: David Miller
 Sent: Thursday, July 16, 2015 12:19

 From: Dexuan Cui
 Date: Tue, 14 Jul 2015 03:00:48 -0700

  +  pr_debug(hvsock_sk_destruct: called\n);

 Debug logging just to state that a function is called is not appropriate,
 we have very sophisticated tracing facilities in the kernel that can do
 that transparently, and more.

 Please remove this.
 OK. 

  +  if (hvsk-channel) {
  +  pr_debug(hvsock_sk_destruct: calling vmbus_close()\n);

 Likewise, these kinds of debug logs are totally inappropriate.
 OK, I'll remove all the pr_debug() in the patch.

I'd suggest we rather use something like net_dbg_ratelimited()
intead. The driver is new so issues are expected. Some debugging may
be useful)

[...]

-- 
  Vitaly
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH -next] net: fib: use fib result when zero-length prefix aliases exist

2015-07-17 Thread Florian Westphal

default route selection is not deterministic when TOS keys are used:

ip route del default

ip route add tos 0x00 via 10.2.100.100
ip route add tos 0x04 via 10.2.100.101
ip route add tos 0x08 via 10.2.100.102
ip route add tos 0x0C via 10.2.100.103
ip route add tos 0x10 via 10.2.100.104

[ i.e. 5 routes with prefix length 0, differentiated via TOS key ]

ip route get 10.3.1.1 tos 0x4
- 10.2.100.101
ip route get 10.3.1.1 tos 0x8
- 10.2.100.102
ip route get tos 0x0C
- 10.2.100.103

But for 0x10, we'll get round-robin results among all the aliases.
Repeated queries return .100, 101, 102, etc. in turn.

This behaviour is not new  -- fib_select_default can be traced back to
fn_hash_select_default in CVS history.

Routing cache made 'round-robin' behaviour less visible.

This changes fib_select_default to not change the FIB chosen result EXCEPT
if this nexthop appears to be unreachable.

fib_detect_death() logic is reversed -- we consider a nexthop 'dead' only
if it has a neigh entry in unreachable state.

Only then we search fib_aliases for an alternative and use one of these in
a round-robin fashion.  If all are believed to be unreachable, no change is
made and fib-chosen nh_gw is used.

Reported-by: Hagen Paul Pfeifer ha...@jauu.net
Cc: Alexander Duyck alexander.h.du...@redhat.com
Signed-off-by: Florian Westphal f...@strlen.de
---
 net/ipv4/fib_semantics.c | 71 
 1 file changed, 36 insertions(+), 35 deletions(-)

diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index c7358ea..83b485b 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -410,28 +410,24 @@ errout:
rtnl_set_sk_err(info-nl_net, RTNLGRP_IPV4_ROUTE, err);
 }
 
-static int fib_detect_death(struct fib_info *fi, int order,
-   struct fib_info **last_resort, int *last_idx,
-   int dflt)
+static bool fib_nud_is_unreach(const struct fib_info *fi)
 {
struct neighbour *n;
int state = NUD_NONE;
 
-   n = neigh_lookup(arp_tbl, fi-fib_nh[0].nh_gw, fi-fib_dev);
-   if (n) {
+   local_bh_disable();
+
+   n = __neigh_lookup_noref(arp_tbl, fi-fib_nh[0].nh_gw, fi-fib_dev);
+   if (n)
state = n-nud_state;
-   neigh_release(n);
-   }
-   if (state == NUD_REACHABLE)
-   return 0;
-   if ((state  NUD_VALID)  order != dflt)
-   return 0;
-   if ((state  NUD_VALID) ||
-   (*last_idx  0  order  dflt)) {
-   *last_resort = fi;
-   *last_idx = order;
-   }
-   return 1;
+
+   local_bh_enable();
+
+   /* Caller might be able to find alternate (reachable) nexthop */
+   if (state  (NUD_INCOMPLETE | NUD_FAILED))
+   return true;
+
+   return false;
 }
 
 #ifdef CONFIG_IP_ROUTE_MULTIPATH
@@ -1204,12 +1200,17 @@ int fib_sync_down_dev(struct net_device *dev, unsigned 
long event)
 /* Must be invoked inside of an RCU protected region.  */
 void fib_select_default(struct fib_result *res)
 {
-   struct fib_info *fi = NULL, *last_resort = NULL;
struct hlist_head *fa_head = res-fa_head;
+   struct fib_info *last_resort = NULL;
struct fib_table *tb = res-table;
int order = -1, last_idx = -1;
struct fib_alias *fa;
+   bool unreach = fib_nud_is_unreach(res-fi);
 
+   if (likely(!unreach))
+   return;
+
+   /* attempt to pick another nexthop */
hlist_for_each_entry_rcu(fa, fa_head, fa_list) {
struct fib_info *next_fi = fa-fa_info;
 
@@ -1223,33 +1224,33 @@ void fib_select_default(struct fib_result *res)
next_fi-fib_nh[0].nh_scope != RT_SCOPE_LINK)
continue;
 
+   order++;
+
+   if (next_fi == res-fi) /* already tested, not reachable */
+   continue;
+
fib_alias_accessed(fa);
 
-   if (!fi) {
-   if (next_fi != res-fi)
+   unreach = fib_nud_is_unreach(next_fi);
+   if (unreach)
+   continue;
+
+   /* try to round-robin among all fa_aliases in case
+* res-fi nexthop is unreachable.
+*/
+   if (last_idx  0 || order  tb-tb_default) {
+   last_resort = next_fi;
+   last_idx = order;
+   if (order  tb-tb_default)
break;
-   } else if (!fib_detect_death(fi, order, last_resort,
-last_idx, tb-tb_default)) {
-   fib_result_assign(res, fi);
-   tb-tb_default = order;
-   goto out;
}
-   fi = next_fi;
-   order++;
}
 
-   if (order = 0 || !fi) {
+   if (order  0) {
tb-tb_default = -1;

RE: [PATCH 0/7] introduce Hyper-V VM Sockets(hvsock)

2015-07-17 Thread Dexuan Cui

 -Original Message-
 From: Stefan Hajnoczi
 Sent: Thursday, July 16, 2015 23:59
 On Mon, Jul 06, 2015 at 07:39:35AM -0700, Dexuan Cui wrote:
  Hyper-V VM Sockets (hvsock) is a byte-stream based communication
 mechanism
  between Windowsd 10 (or later) host and a guest. It's kind of TCP over
  VMBus, but the transportation layer (VMBus) is much simpler than IP.
  With Hyper-V VM Sockets, applications between the host and a guest can
  talk with each other directly by the traditional BSD-style socket APIs.
 
  The patchset implements the necessary support in the guest side by adding
  the necessary new APIs in the vmbus driver, and introducing a new driver
  hv_sock.ko, which implements_a new socket address family AF_HYPERV.
 
 
  I know the kernel has already had a VM Sockets driver (AF_VSOCK) based
  on VMware's VMCI (net/vmw_vsock/, drivers/misc/vmw_vmci), and KVM is
  proposing AF_VSOCK of virtio version:
  http://thread.gmane.org/gmane.linux.network/365205.
 
  However, though Hyper-V VM Sockets may seem conceptually similar to
  AF_VOSCK, there are differences in the transportation layer, and IMO these
  make the direct code reusing impractical:
 
  1. In AF_VSOCK, the endpoint type is: u32 ContextID, u32 Port, but in
  AF_HYPERV, the endpoint type is: GUID VM_ID, GUID ServiceID. Here GUID
  is 128-bit.
 
  2. AF_VSOCK supports SOCK_DGRAM, while AF_HYPERV doesn't.
 
  3. AF_VSOCK supports some special sock opts, like
 SO_VM_SOCKETS_BUFFER_SIZE,
  SO_VM_SOCKETS_BUFFER_MIN/MAX_SIZE and
 SO_VM_SOCKETS_CONNECT_TIMEOUT.
  These are meaningless to AF_HYPERV.
 
  4. Some AF_VSOCK's VMCI transportation ops are meanless to
 AF_HYPERV/VMBus,
  like.notify_recv_init
  .notify_recv_pre_block
  .notify_recv_pre_dequeue
  .notify_recv_post_dequeue
  .notify_send_init
  .notify_send_pre_block
  .notify_send_pre_enqueue
  .notify_send_post_enqueue
  etc.
 
  So I think we'd better introduce a new address family: AF_HYPERV.
 
 Points 2-4 are not critical.  I think there are solutions to them.
 
 Point 1 is the main issue: hvsock has GUID, GUID addresses instead of
 vsock's u32, u32 addresses.  Perhaps a mapping could be used but that
 is pretty ugly.
Hi Stefan,
Exactly!

In the current AF_VSOCK code and the related transport layer (the wrapper
ops of VMware's VMCI), the u32, u32 endpoint is widely used by
struct sockaddr_vm (this struct is exported to the user space).

So, anyway, the user space application has to explicitly handle the different
endpoint sizes.

And in the driver side, IMO there is no way to reuse the code of
AF_VSOCK with clean changes.

 One idea is something like a userspace GUID, GUID -
 u32, u32 lookup function that applications can use if they want to
 accept GUIDs.
Thanks for the suggestion!
While this is technically possible, IMO it would mess up the driver side's
AF_VSOCK code: in many places, we'll have to add ugly code like:

IF the endpoint size is u32, u32 THEN
use the existing logic;
ELSE
use the new logic;

 I don't have a workable alternative to propose, so I agree that a new
 address family is justified.
Thanks for your exact understanding! :-)

-- Dexuan

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next] rhashtable: Allow other tasks to be scheduled in large lookup loops

2015-07-17 Thread Eric Dumazet

On Fri, 2015-07-17 at 10:07 +0200, Thomas Graf wrote:
 Depending on system speed, the large lookup loop can take a considerable
 amount of time to complete causing watchdog warnings to appear. Allow
 other tasks to be scheduled after every batch of 1000 lookups.
 
 Reported-by: Meelis Roos mr...@linux.ee
 Signed-off-by: Thomas Graf tg...@suug.ch
 ---
  lib/test_rhashtable.c | 9 -
  1 file changed, 8 insertions(+), 1 deletion(-)
 
 diff --git a/lib/test_rhashtable.c b/lib/test_rhashtable.c
 index c90777e..5ed6211 100644
 --- a/lib/test_rhashtable.c
 +++ b/lib/test_rhashtable.c
 @@ -20,8 +20,10 @@
  #include linux/rcupdate.h
  #include linux/rhashtable.h
  #include linux/slab.h
 +#include linux/sched.h
  
  #define MAX_ENTRIES  100
 +#define RELAX_CPU_AFTER  1000
  #define TEST_INSERT_FAIL INT_MAX
  
  static int entries = 5;
 @@ -61,7 +63,7 @@ static struct rhashtable_params test_rht_params = {
  
  static int __init test_rht_lookup(struct rhashtable *ht)
  {
 - unsigned int i;
 + unsigned int i, relax_cnt = RELAX_CPU_AFTER;
  
   for (i = 0; i  entries * 2; i++) {
   struct test_obj *obj;
 @@ -87,6 +89,11 @@ static int __init test_rht_lookup(struct rhashtable *ht)
   return -EINVAL;
   }
   }
 +
 + if (!relax_cnt--) {
 + schedule();
 + relax_cnt = RELAX_CPU_AFTER;
 + }
   }
  
   return 0;

Please simply use cond_resched() without counting and magic value.

 


--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next] rhashtable: Allow other tasks to be scheduled in large lookup loops

2015-07-17 Thread Thomas Graf

On 07/17/15 at 10:28am, Eric Dumazet wrote:
 On Fri, 2015-07-17 at 10:24 +0200, Eric Dumazet wrote:
 
  Please simply use cond_resched() without counting and magic value.

Done

 Also use cond_resched() in insert and delete phases ?

When I tried that it made the walker duplicates disappear which weakens
the test case a little bit but it's probably safer this way.  I'll include
it in the v2.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net-next v2] rhashtable: Allow other tasks to be scheduled in large lookup loops

2015-07-17 Thread Thomas Graf

Depending on system speed, the large lookup/insert/delete loops of the 
testsuite can
take a considerable amount of time to complete causing watchdog warnings to 
appear.
Allow other tasks to be scheduled throughout the loops.

Reported-by: Meelis Roos mr...@linux.ee
Signed-off-by: Thomas Graf tg...@suug.ch
---
v2: Use cond_resched() instead schedule()

 lib/test_rhashtable.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/lib/test_rhashtable.c b/lib/test_rhashtable.c
index c90777e..9af7cef 100644
--- a/lib/test_rhashtable.c
+++ b/lib/test_rhashtable.c
@@ -20,6 +20,7 @@
 #include linux/rcupdate.h
 #include linux/rhashtable.h
 #include linux/slab.h
+#include linux/sched.h
 
 #define MAX_ENTRIES100
 #define TEST_INSERT_FAIL INT_MAX
@@ -87,6 +88,8 @@ static int __init test_rht_lookup(struct rhashtable *ht)
return -EINVAL;
}
}
+
+   cond_resched_rcu();
}
 
return 0;
@@ -160,6 +163,8 @@ static s64 __init test_rhashtable(struct rhashtable *ht)
} else if (err) {
return err;
}
+
+   cond_resched();
}
 
if (insert_fails)
@@ -183,6 +188,8 @@ static s64 __init test_rhashtable(struct rhashtable *ht)
 
rhashtable_remove_fast(ht, obj-node, test_rht_params);
}
+
+   cond_resched();
}
 
end = ktime_get_ns();
-- 
2.4.3

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 4.1 regression in resizable hashtable tests

2015-07-17 Thread Thomas Graf

On 07/02/15 at 10:09pm, Meelis Roos wrote:
 [   33.425061] Running rhashtable test nelem=8, max_size=65536, shrinking=0
 [   33.425154] Test 00:
 [   33.534470]   Adding 5 keys
 [   34.743553] Info: encountered resize
 [   34.743698] Info: encountered resize
 [   34.743838] Info: encountered resize
 [   34.744057] Info: encountered resize
 [   34.744430] Info: encountered resize
 [   34.745139] Info: encountered resize
 [   34.746441] Info: encountered resize
 [   34.749055] Info: encountered resize
 [   34.754469] Info: encountered resize
 [   34.764836] Info: encountered resize
 [   34.785696] Info: encountered resize
 [   34.827448] Info: encountered resize
 [   34.896936]   Traversal complete: counted=49993, nelems=5, 
 entries=5, table-jumps=12
 [   34.897056] Test failed: Total count mismatch ^^^

I do see count mismatches as well due to the design of the walker
which restarts and thus sees certain entries multiple times.

Do you have this commit as well?

Author: Phil Sutter p...@nwl.cc
Date:   Mon Jul 6 15:51:20 2015 +0200

rhashtable: fix for resize events during table walk
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net] caif: fix leaks and race in caif_queue_rcv_skb()

2015-07-17 Thread Eric Dumazet

From: Eric Dumazet eduma...@google.com

1) If sk_filter() is applied, skb was leaked (not freed)
2) Testing SOCK_DEAD twice is racy :
   packet could be freed while already queued.
3) Remove obsolete comment about caching skb-len

Signed-off-by: Eric Dumazet eduma...@google.com
---
 net/caif/caif_socket.c |   19 ---
 1 file changed, 8 insertions(+), 11 deletions(-)

diff --git a/net/caif/caif_socket.c b/net/caif/caif_socket.c
index 3cc71b9f5517..cc858919108e 100644
--- a/net/caif/caif_socket.c
+++ b/net/caif/caif_socket.c
@@ -121,12 +121,13 @@ static void caif_flow_ctrl(struct sock *sk, int mode)
  * Copied from sock.c:sock_queue_rcv_skb(), but changed so packets are
  * not dropped, but CAIF is sending flow off instead.
  */
-static int caif_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
+static void caif_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
 {
int err;
unsigned long flags;
struct sk_buff_head *list = sk-sk_receive_queue;
struct caifsock *cf_sk = container_of(sk, struct caifsock, sk);
+   bool queued = false;
 
if (atomic_read(sk-sk_rmem_alloc) + skb-truesize =
(unsigned int)sk-sk_rcvbuf  rx_flow_is_on(cf_sk)) {
@@ -139,7 +140,8 @@ static int caif_queue_rcv_skb(struct sock *sk, struct 
sk_buff *skb)
 
err = sk_filter(sk, skb);
if (err)
-   return err;
+   goto out;
+
if (!sk_rmem_schedule(sk, skb, skb-truesize)  rx_flow_is_on(cf_sk)) {
set_rx_flow_off(cf_sk);
net_dbg_ratelimited(sending flow OFF due to rmem_schedule\n);
@@ -147,21 +149,16 @@ static int caif_queue_rcv_skb(struct sock *sk, struct 
sk_buff *skb)
}
skb-dev = NULL;
skb_set_owner_r(skb, sk);
-   /* Cache the SKB length before we tack it onto the receive
-* queue. Once it is added it no longer belongs to us and
-* may be freed by other threads of control pulling packets
-* from the queue.
-*/
spin_lock_irqsave(list-lock, flags);
-   if (!sock_flag(sk, SOCK_DEAD))
+   queued = !sock_flag(sk, SOCK_DEAD);
+   if (queued)
__skb_queue_tail(list, skb);
spin_unlock_irqrestore(list-lock, flags);
-
-   if (!sock_flag(sk, SOCK_DEAD))
+out:
+   if (queued)
sk-sk_data_ready(sk);
else
kfree_skb(skb);
-   return 0;
 }
 
 /* Packet Receive Callback function called from CAIF Stack */


--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next] rhashtable: Allow other tasks to be scheduled in large lookup loops

2015-07-17 Thread Eric Dumazet

On Fri, 2015-07-17 at 10:24 +0200, Eric Dumazet wrote:

 Please simply use cond_resched() without counting and magic value.

Also use cond_resched() in insert and delete phases ?


--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] ravb: do not invalidate cache for RX buffer twice

2015-07-17 Thread David Miller

From: Sergei Shtylyov sergei.shtyl...@cogentembedded.com
Date: Wed, 15 Jul 2015 00:56:52 +0300

 First, dma_sync_single_for_cpu() shouldn't have been called in the first place
 (it's a streaming DMA API). dma_unmap_single() should have been called 
 instead.
 Second, dma_unmap_single() call after handing the buffer to napi_gro_receive()
 makes little sense.

 Signed-off-by: Sergei Shtylyov sergei.shtyl...@cogentembedded.com

Applied with fixed up commit log message, thanks.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net-next] rhashtable: Allow other tasks to be scheduled in large lookup loops

2015-07-17 Thread Thomas Graf

Depending on system speed, the large lookup loop can take a considerable
amount of time to complete causing watchdog warnings to appear. Allow
other tasks to be scheduled after every batch of 1000 lookups.

Reported-by: Meelis Roos mr...@linux.ee
Signed-off-by: Thomas Graf tg...@suug.ch
---
 lib/test_rhashtable.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/lib/test_rhashtable.c b/lib/test_rhashtable.c
index c90777e..5ed6211 100644
--- a/lib/test_rhashtable.c
+++ b/lib/test_rhashtable.c
@@ -20,8 +20,10 @@
 #include linux/rcupdate.h
 #include linux/rhashtable.h
 #include linux/slab.h
+#include linux/sched.h
 
 #define MAX_ENTRIES100
+#define RELAX_CPU_AFTER1000
 #define TEST_INSERT_FAIL INT_MAX
 
 static int entries = 5;
@@ -61,7 +63,7 @@ static struct rhashtable_params test_rht_params = {
 
 static int __init test_rht_lookup(struct rhashtable *ht)
 {
-   unsigned int i;
+   unsigned int i, relax_cnt = RELAX_CPU_AFTER;
 
for (i = 0; i  entries * 2; i++) {
struct test_obj *obj;
@@ -87,6 +89,11 @@ static int __init test_rht_lookup(struct rhashtable *ht)
return -EINVAL;
}
}
+
+   if (!relax_cnt--) {
+   schedule();
+   relax_cnt = RELAX_CPU_AFTER;
+   }
}
 
return 0;
-- 
2.4.3

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next 1/2] bpf: introduce bpf_skb_vlan_push/pop() helpers

2015-07-17 Thread Eric Dumazet

On Thu, 2015-07-16 at 19:58 -0700, Alexei Starovoitov wrote:
 In order to let eBPF programs call skb_vlan_push/pop via helper functions

Why should eBPF program do such thing ?

Are BPF users in the kernel expecting skb being changed, and are we sure
they reload all cached values when/if needed ?

 eBPF JITs need to recognize helpers that change skb-data, since
 skb-data and hlen are cached as part of JIT code generation.
 - arm64 JIT is using bpf_load_pointer() without caching, so it's ok as-is.
 - x64 JIT recognizes bpf_skb_vlan_push/pop() calls and re-caches 
 skb-data/hlen
   after such calls (experiments showed that conditional re-caching is slower).
 - s390 JIT falls back to interpreter for now when bpf_skb_vlan_push() is 
 present
   in the program (re-caching is tbd).



 +static u64 bpf_skb_vlan_push(u64 r1, u64 vlan_proto, u64 vlan_tci, u64 r4, 
 u64 r5)
 +{
 + struct sk_buff *skb = (struct sk_buff *) (long) r1;
 +
 + if (unlikely(vlan_proto != htons(ETH_P_8021Q) 
 +  vlan_proto != htons(ETH_P_8021AD)))
 + vlan_proto = htons(ETH_P_8021Q);


This would raise sparse error, as vlan_proto is u64, and htons() __be16

make C=2 CF=-D__CHECK_ENDIAN__ net/core/filter.o



--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: sit: Set SKB_GSO_SIT bit when performing GRO

2015-07-17 Thread Wolfgang Walter

Am Freitag, 17. Juli 2015, 09:56:51 schrieb Herbert Xu:
 On Thu, Jul 16, 2015 at 12:58:45PM +0200, Wolfgang Walter wrote:
  Am Donnerstag, 16. Juli 2015, 08:23:50 schrieb Herbert Xu:
   On Wed, Jul 15, 2015 at 02:25:59PM +0200, Wolfgang Walter wrote:
Yes. Switching TSO off and leaving GRO on works, too.
   
   OK, could you please try this patch?
  
  Patch works here.
 
 Thanks for the confirmation.  Let's add a tag for patchwork:
 
 Tested-by: Wolfgang Walter li...@stwm.de

It seems that this patch may cause a problem with another one of our routers. 
Without the patch it had no problem, so I didn't tested it there.

With that patch one interface blocks after some time. Not even arp requests 
get answered. It still receives packets though. Restarting the interface fixes 
the problem.

Switching off gro for the other interface helps.

This router is different from the other ones. It does not directly route 
isatap packets. It may routes isatap packets encapsulated in GRE packets, 
though. It is itself not an GRE-endpoint.

The router does NAT. Basically it routes the GRE-tunnel packets unatted and 
NATs most of the rest.
Not doing NAT and conntrack (and unloading all modules like nf_conntrack_ipv4, 
nf_defrag_ipv4) does not help.

eth0: extern
eth1: intern

One (IPv4) GRE-tunnel is routed between eth0 und eth1.
IPv6 ESP-tunnels are routed between eth0 and eth1
IPv4 UDP/TCP/ICMP from intern is natted with netfilter.

eth1 stops sending with the patch after some time
disabling gro on eth0 helps
disabling tso or gso on eth0 and/or eth1 or both does not help

eth0 and eth1 are both intel I350.


Regards,
-- 
Wolfgang Walter
Studentenwerk München
Anstalt des öffentlichen Rechts
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 0/2] sctp: fix src address selection if using secondary address

2015-07-17 Thread Marcelo Ricardo Leitner

This series improves the way SCTP chooses its src address so that the
choosen one will always belong to the interface being used for output.

v1-v2:
 - split out the refactoring from the fix itself
 - Doing a full reverse routing as in v1 is not necessary. Only looking
   for the interface that has the address and comparing its number is
   enough.

Marcelo Ricardo Leitner (2):
  sctp: reduce indent level on sctp_v4_get_dst
  sctp: fix src address selection if using secondary addresses

 net/sctp/protocol.c | 42 +++---
 1 file changed, 27 insertions(+), 15 deletions(-)

-- 
2.4.1

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 1/2] sctp: reduce indent level on sctp_v4_get_dst

2015-07-17 Thread Marcelo Ricardo Leitner

Paves the day for the next patch. Functionality stays untouched.

Signed-off-by: Marcelo Ricardo Leitner marcelo.leit...@gmail.com
---
 net/sctp/protocol.c | 32 +---
 1 file changed, 17 insertions(+), 15 deletions(-)

diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c
index 
59e80356672bdf89777265ae1f8c384792dfb98c..fa80fe4f23629fc3c3f5c44f99dbf3cc524cc6a0
 100644
--- a/net/sctp/protocol.c
+++ b/net/sctp/protocol.c
@@ -489,21 +489,23 @@ static void sctp_v4_get_dst(struct sctp_transport *t, 
union sctp_addr *saddr,
list_for_each_entry_rcu(laddr, bp-address_list, list) {
if (!laddr-valid)
continue;
-   if ((laddr-state == SCTP_ADDR_SRC) 
-   (AF_INET == laddr-a.sa.sa_family)) {
-   fl4-fl4_sport = laddr-a.v4.sin_port;
-   flowi4_update_output(fl4,
-asoc-base.sk-sk_bound_dev_if,
-RT_CONN_FLAGS(asoc-base.sk),
-daddr-v4.sin_addr.s_addr,
-laddr-a.v4.sin_addr.s_addr);
-
-   rt = ip_route_output_key(sock_net(sk), fl4);
-   if (!IS_ERR(rt)) {
-   dst = rt-dst;
-   goto out_unlock;
-   }
-   }
+   if (laddr-state != SCTP_ADDR_SRC ||
+   AF_INET != laddr-a.sa.sa_family)
+   continue;
+
+   fl4-fl4_sport = laddr-a.v4.sin_port;
+   flowi4_update_output(fl4,
+asoc-base.sk-sk_bound_dev_if,
+RT_CONN_FLAGS(asoc-base.sk),
+daddr-v4.sin_addr.s_addr,
+laddr-a.v4.sin_addr.s_addr);
+
+   rt = ip_route_output_key(sock_net(sk), fl4);
+   if (IS_ERR(rt))
+   continue;
+
+   dst = rt-dst;
+   break;
}
 
 out_unlock:
-- 
2.4.1

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 2/2] sctp: fix src address selection if using secondary addresses

2015-07-17 Thread Marcelo Ricardo Leitner

In short, sctp is likely to incorrectly choose src address if socket is
bound to secondary addresses. This patch fixes it by adding a new check
that checks if such src address belongs to the interface that routing
identified as output.

This is enough to avoid rp_filter drops on remote peer.

Details:

Currently, sctp will do a routing attempt without specifying the src
address and compare the returned value (preferred source) with the
addresses that the socket is bound to. When using secondary addresses,
this will not match.

Then it will try specifying each of the addresses that the socket is
bound to and re-routing, checking if that address is valid as src for
that dst. Thing is, this check alone is weak:

# ip r l
192.168.100.0/24 dev eth1  proto kernel  scope link  src 192.168.100.149
192.168.122.0/24 dev eth0  proto kernel  scope link  src 192.168.122.147

# ip a l
1: lo: LOOPBACK,UP,LOWER_UP mtu 65536 qdisc noqueue state UNKNOWN group 
default
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
   valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
   valid_lft forever preferred_lft forever
2: eth0: BROADCAST,MULTICAST,UP,LOWER_UP mtu 1500 qdisc pfifo_fast state UP 
group default qlen 1000
link/ether 52:54:00:15:18:6a brd ff:ff:ff:ff:ff:ff
inet 192.168.122.147/24 brd 192.168.122.255 scope global dynamic eth0
   valid_lft 2160sec preferred_lft 2160sec
inet 192.168.122.148/24 scope global secondary eth0
   valid_lft forever preferred_lft forever
inet6 fe80::5054:ff:fe15:186a/64 scope link
   valid_lft forever preferred_lft forever
3: eth1: BROADCAST,MULTICAST,UP,LOWER_UP mtu 1500 qdisc pfifo_fast state UP 
group default qlen 1000
link/ether 52:54:00:b3:91:46 brd ff:ff:ff:ff:ff:ff
inet 192.168.100.149/24 brd 192.168.100.255 scope global dynamic eth1
   valid_lft 2162sec preferred_lft 2162sec
inet 192.168.100.148/24 scope global secondary eth1
   valid_lft forever preferred_lft forever
inet6 fe80::5054:ff:feb3:9146/64 scope link
   valid_lft forever preferred_lft forever
4: ens9: BROADCAST,MULTICAST,UP,LOWER_UP mtu 1500 qdisc pfifo_fast state UP 
group default qlen 1000
link/ether 52:54:00:05:47:ee brd ff:ff:ff:ff:ff:ff
inet6 fe80::5054:ff:fe05:47ee/64 scope link
   valid_lft forever preferred_lft forever

# ip r g 192.168.100.193 from 192.168.122.148
192.168.100.193 from 192.168.122.148 dev eth1
cache

Even if you specify an interface:

# ip r g 192.168.100.193 from 192.168.122.148 oif eth1
192.168.100.193 from 192.168.122.148 dev eth1
cache

Although this would be valid, peers using rp_filter will drop such
packets as their src doesn't match the routes for that interface.

Signed-off-by: Marcelo Ricardo Leitner marcelo.leit...@gmail.com
---
 net/sctp/protocol.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c
index 
fa80fe4f23629fc3c3f5c44f99dbf3cc524cc6a0..4345790ad3266c353eeac5398593c2a9ce4effda
 100644
--- a/net/sctp/protocol.c
+++ b/net/sctp/protocol.c
@@ -487,6 +487,8 @@ static void sctp_v4_get_dst(struct sctp_transport *t, union 
sctp_addr *saddr,
 */
rcu_read_lock();
list_for_each_entry_rcu(laddr, bp-address_list, list) {
+   struct net_device *odev;
+
if (!laddr-valid)
continue;
if (laddr-state != SCTP_ADDR_SRC ||
@@ -504,6 +506,14 @@ static void sctp_v4_get_dst(struct sctp_transport *t, 
union sctp_addr *saddr,
if (IS_ERR(rt))
continue;
 
+   /* Ensure the src address belongs to the output
+* interface.
+*/
+   odev = __ip_dev_find(sock_net(sk), laddr-a.v4.sin_addr.s_addr,
+false);
+   if (!odev || odev-ifindex != fl4-flowi4_oif)
+   continue;
+
dst = rt-dst;
break;
}
-- 
2.4.1

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH net-next 1/1] hv_netvsc: Wait for sub-channels to be processed during probe

2015-07-17 Thread KY Srinivasan



 -Original Message-
 From: Vitaly Kuznetsov [mailto:vkuzn...@redhat.com]
 Sent: Friday, July 17, 2015 7:13 AM
 To: KY Srinivasan
 Cc: da...@davemloft.net; netdev@vger.kernel.org; linux-
 ker...@vger.kernel.org; de...@linuxdriverproject.org; o...@aepfle.de;
 a...@canonical.com; jasow...@redhat.com; Dexuan Cui
 Subject: Re: [PATCH net-next 1/1] hv_netvsc: Wait for sub-channels to be
 processed during probe
 
 K. Y. Srinivasan k...@microsoft.com writes:
 
  The current code returns from probe without waiting for the proper
 handling
  of subchannels that may be requested. If the netvsc driver were to be
 rapidly
  loaded/unloaded, we can  trigger a panic as the unload will be tearing
  down state that may not have been fully setup yet. We fix this issue by
 making
  sure that we return from the probe call only after ensuring that the
  sub-channel offers in flight are properly handled.
 
  Signed-off-by: K. Y. Srinivasan k...@microsoft.com
  Reviewed-and-tested-by: Haiyang Zhang haiya...@microsoft.com
  ---
   drivers/net/hyperv/hyperv_net.h   |2 ++
   drivers/net/hyperv/rndis_filter.c |   25 +
   2 files changed, 27 insertions(+), 0 deletions(-)
 
  diff --git a/drivers/net/hyperv/hyperv_net.h
 b/drivers/net/hyperv/hyperv_net.h
  index 26cd14c..925b75d 100644
  --- a/drivers/net/hyperv/hyperv_net.h
  +++ b/drivers/net/hyperv/hyperv_net.h
  @@ -671,6 +671,8 @@ struct netvsc_device {
  u32 send_table[VRSS_SEND_TAB_SIZE];
  u32 max_chn;
  u32 num_chn;
  +   spinlock_t sc_lock; /* Protects num_sc_offered variable */
  +   u32 num_sc_offered;
  atomic_t queue_sends[NR_CPUS];
 
  /* Holds rndis device info */
  diff --git a/drivers/net/hyperv/rndis_filter.c
 b/drivers/net/hyperv/rndis_filter.c
  index 2e40417..2e09f3f 100644
  --- a/drivers/net/hyperv/rndis_filter.c
  +++ b/drivers/net/hyperv/rndis_filter.c
  @@ -984,9 +984,16 @@ static void netvsc_sc_open(struct vmbus_channel
 *new_sc)
  struct netvsc_device *nvscdev;
  u16 chn_index = new_sc-offermsg.offer.sub_channel_index;
  int ret;
  +   unsigned long flags;
 
  nvscdev = hv_get_drvdata(new_sc-primary_channel-device_obj);
 
  +   spin_lock_irqsave(nvscdev-sc_lock, flags);
  +   nvscdev-num_sc_offered--;
  +   spin_unlock_irqrestore(nvscdev-sc_lock, flags);
  +   if (nvscdev-num_sc_offered == 0)
  +   complete(nvscdev-channel_init_wait);
  +
  if (chn_index = nvscdev-num_chn)
  return;
 
  @@ -1015,8 +1022,10 @@ int rndis_filter_device_add(struct hv_device
 *dev,
  u32 rsscap_size = sizeof(struct ndis_recv_scale_cap);
  u32 mtu, size;
  u32 num_rss_qs;
  +   u32 sc_delta;
  const struct cpumask *node_cpu_mask;
  u32 num_possible_rss_qs;
  +   unsigned long flags;
 
  rndis_device = get_rndis_device();
  if (!rndis_device)
  @@ -1039,6 +1048,8 @@ int rndis_filter_device_add(struct hv_device
 *dev,
  net_device-max_chn = 1;
  net_device-num_chn = 1;
 
  +   spin_lock_init(net_device-sc_lock);
  +
  net_device-extension = rndis_device;
  rndis_device-net_dev = net_device;
 
  @@ -1116,6 +1127,9 @@ int rndis_filter_device_add(struct hv_device
 *dev,
  num_possible_rss_qs = cpumask_weight(node_cpu_mask);
  net_device-num_chn = min(num_possible_rss_qs, num_rss_qs);
 
  +   num_rss_qs = net_device-num_chn - 1;
  +   net_device-num_sc_offered = num_rss_qs;
  +
  if (net_device-num_chn == 1)
  goto out;
 
  @@ -1157,11 +1171,22 @@ int rndis_filter_device_add(struct hv_device
 *dev,
 
  ret = rndis_filter_set_rss_param(rndis_device, net_device-
 num_chn);
 
  +   /*
  +* Wait for the host to send us the sub-channel offers.
  +*/
  +   spin_lock_irqsave(net_device-sc_lock, flags);
  +   sc_delta = net_device-num_chn - 1 - num_rss_qs;
  +   net_device-num_sc_offered -= sc_delta;
  +   spin_unlock_irqrestore(net_device-sc_lock, flags);
  +
  +   if (net_device-num_sc_offered != 0)
  +   wait_for_completion(net_device-channel_init_wait);
 
 I'd suggest we add an essentian timeout (big, let's say 30 sec.)
 here. In case something goes wrong we don't really want to hang the
 whole kernel for forever. Such bugs are hard to debug as if a 'kernel
 hangs' is reported we can't be sure which wait caused it. We can even
 have something like:
 
  t = wait_for_completion_timeout(net_device-channel_init_wait, 30*HZ);
  BUG_ON(t == 0);
 
 This is much better as we'll be sure what went wrong. (I know other
 pieces of hyper-v code use wait_for_completion() without a timeout, this
 is rather a general suggestion for all of them).

There is some history here. Initially, I had timeout for calls where we could 
reasonably
rollback state if we timed out. Some calls were subsequently changed to 
unconditional
wait because under some load conditions, these timeouts would trigger (granted 
I did not
have 30 second timeout; it was a 5 second timeout).

Greg was opposed to calls to BUG_ON() in general for drivers.

RE: [PATCH net-next 1/1] hv_netvsc: Wait for sub-channels to be processed during probe

2015-07-17 Thread KY Srinivasan

 -Original Message-
 From: Dexuan Cui
 Sent: Friday, July 17, 2015 3:01 AM
 To: KY Srinivasan; da...@davemloft.net; netdev@vger.kernel.org; linux-
 ker...@vger.kernel.org; de...@linuxdriverproject.org; o...@aepfle.de;
 a...@canonical.com; jasow...@redhat.com; vkuzn...@redhat.com
 Cc: KY Srinivasan
 Subject: RE: [PATCH net-next 1/1] hv_netvsc: Wait for sub-channels to be
 processed during probe

  From: K. Y. Srinivasan
  Sent: Friday, July 17, 2015 3:17
  Subject: [PATCH net-next 1/1] hv_netvsc: Wait for sub-channels to be
 processed
  during probe
  diff --git a/drivers/net/hyperv/hyperv_net.h
 b/drivers/net/hyperv/hyperv_net.h
  ...
  @@ -1116,6 +1127,9 @@ int rndis_filter_device_add(struct hv_device
 *dev,
  num_possible_rss_qs = cpumask_weight(node_cpu_mask);
  net_device-num_chn = min(num_possible_rss_qs, num_rss_qs);

  +   num_rss_qs = net_device-num_chn - 1;
  +   net_device-num_sc_offered = num_rss_qs;
  +
  if (net_device-num_chn == 1)
  goto out;

  @@ -1157,11 +1171,22 @@ int rndis_filter_device_add(struct hv_device
 *dev,

  ret = rndis_filter_set_rss_param(rndis_device, net_device-
 num_chn);

  +   /*
  +* Wait for the host to send us the sub-channel offers.
  +*/
  +   spin_lock_irqsave(net_device-sc_lock, flags);
  +   sc_delta = net_device-num_chn - 1 - num_rss_qs;
  +   net_device-num_sc_offered -= sc_delta;

 Hi KY,
 IMO here the -=  should be +=?

 I think sc_delta is usually = 0, meaning the host may allocate less
 subchannels than
 we expect.
 With -=, net_device-num_sc_offered can become bigger -- this doesn't
 seem correct.
We control how many sub-channels we want the host to offer (say sc_requested). 
Based on this
number we begin to track how many have actually been processed - we decrement 
sc_requested
each time a sub-channel offer is processed. If the host were to actually offer 
all that we have requested,
then checking for sc_requested to be zero is sufficient to ensure that we have 
processed all the
potentially in-flight sub-channels. However, the host  may choose to offer less 
than what we had asked for
and the variable delta is tracking this difference. Since we are counting 
down from what we had asked for
we have to subtract delta for proper accounting.

 Why not use
 net_device-num_sc_offered = net_device-num_chn - 1; directly?
 At this point, net_device-num_chn has been the number of the actual
 channels.

I am not sure what the question here is. num_sc_offered is initialized to the 
number we
are going to ask and this is the number that will be decremented each time a 
sub-channel 
is processed. Since the host may decide to offer us less than what we had asked 
and some 
sub-channels may have already been processed (num_sc_offerred decremented 
accordingly)
by the time we discover that the host has offered us less than what we asked 
for, we adjust
num_sc_offered accordingly.

  +   spin_unlock_irqrestore(net_device-sc_lock, flags);
  +
  +   if (net_device-num_sc_offered != 0)
  +   wait_for_completion(net_device-channel_init_wait);

 BTW, I also tested the patch and I can confirm the panic I saw disappeared
 with the patch.

Thank you.

K. Y

 -- Dexuan
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next 1/1] hv_netvsc: Wait for sub-channels to be processed during probe

2015-07-17 Thread Vitaly Kuznetsov

KY Srinivasan k...@microsoft.com writes:

 -Original Message-
 From: Vitaly Kuznetsov [mailto:vkuzn...@redhat.com]
 Sent: Friday, July 17, 2015 7:13 AM
 To: KY Srinivasan
 Cc: da...@davemloft.net; netdev@vger.kernel.org; linux-
 ker...@vger.kernel.org; de...@linuxdriverproject.org; o...@aepfle.de;
 a...@canonical.com; jasow...@redhat.com; Dexuan Cui
 Subject: Re: [PATCH net-next 1/1] hv_netvsc: Wait for sub-channels to be
 processed during probe
 
 K. Y. Srinivasan k...@microsoft.com writes:
 
  The current code returns from probe without waiting for the proper
 handling
  of subchannels that may be requested. If the netvsc driver were to be
 rapidly
  loaded/unloaded, we can  trigger a panic as the unload will be tearing
  down state that may not have been fully setup yet. We fix this issue by
 making
  sure that we return from the probe call only after ensuring that the
  sub-channel offers in flight are properly handled.
 
  Signed-off-by: K. Y. Srinivasan k...@microsoft.com
  Reviewed-and-tested-by: Haiyang Zhang haiya...@microsoft.com
  ---
   drivers/net/hyperv/hyperv_net.h   |2 ++
   drivers/net/hyperv/rndis_filter.c |   25 +
   2 files changed, 27 insertions(+), 0 deletions(-)
 
  diff --git a/drivers/net/hyperv/hyperv_net.h
 b/drivers/net/hyperv/hyperv_net.h
  index 26cd14c..925b75d 100644
  --- a/drivers/net/hyperv/hyperv_net.h
  +++ b/drivers/net/hyperv/hyperv_net.h
  @@ -671,6 +671,8 @@ struct netvsc_device {
 u32 send_table[VRSS_SEND_TAB_SIZE];
 u32 max_chn;
 u32 num_chn;
  +  spinlock_t sc_lock; /* Protects num_sc_offered variable */
  +  u32 num_sc_offered;
 atomic_t queue_sends[NR_CPUS];
 
 /* Holds rndis device info */
  diff --git a/drivers/net/hyperv/rndis_filter.c
 b/drivers/net/hyperv/rndis_filter.c
  index 2e40417..2e09f3f 100644
  --- a/drivers/net/hyperv/rndis_filter.c
  +++ b/drivers/net/hyperv/rndis_filter.c
  @@ -984,9 +984,16 @@ static void netvsc_sc_open(struct vmbus_channel
 *new_sc)
 struct netvsc_device *nvscdev;
 u16 chn_index = new_sc-offermsg.offer.sub_channel_index;
 int ret;
  +  unsigned long flags;
 
 nvscdev = hv_get_drvdata(new_sc-primary_channel-device_obj);
 
  +  spin_lock_irqsave(nvscdev-sc_lock, flags);
  +  nvscdev-num_sc_offered--;
  +  spin_unlock_irqrestore(nvscdev-sc_lock, flags);
  +  if (nvscdev-num_sc_offered == 0)
  +  complete(nvscdev-channel_init_wait);
  +
 if (chn_index = nvscdev-num_chn)
 return;
 
  @@ -1015,8 +1022,10 @@ int rndis_filter_device_add(struct hv_device
 *dev,
 u32 rsscap_size = sizeof(struct ndis_recv_scale_cap);
 u32 mtu, size;
 u32 num_rss_qs;
  +  u32 sc_delta;
 const struct cpumask *node_cpu_mask;
 u32 num_possible_rss_qs;
  +  unsigned long flags;
 
 rndis_device = get_rndis_device();
 if (!rndis_device)
  @@ -1039,6 +1048,8 @@ int rndis_filter_device_add(struct hv_device
 *dev,
 net_device-max_chn = 1;
 net_device-num_chn = 1;
 
  +  spin_lock_init(net_device-sc_lock);
  +
 net_device-extension = rndis_device;
 rndis_device-net_dev = net_device;
 
  @@ -1116,6 +1127,9 @@ int rndis_filter_device_add(struct hv_device
 *dev,
 num_possible_rss_qs = cpumask_weight(node_cpu_mask);
 net_device-num_chn = min(num_possible_rss_qs, num_rss_qs);
 
  +  num_rss_qs = net_device-num_chn - 1;
  +  net_device-num_sc_offered = num_rss_qs;
  +
 if (net_device-num_chn == 1)
 goto out;
 
  @@ -1157,11 +1171,22 @@ int rndis_filter_device_add(struct hv_device
 *dev,
 
 ret = rndis_filter_set_rss_param(rndis_device, net_device-
 num_chn);
 
  +  /*
  +   * Wait for the host to send us the sub-channel offers.
  +   */
  +  spin_lock_irqsave(net_device-sc_lock, flags);
  +  sc_delta = net_device-num_chn - 1 - num_rss_qs;
  +  net_device-num_sc_offered -= sc_delta;
  +  spin_unlock_irqrestore(net_device-sc_lock, flags);
  +
  +  if (net_device-num_sc_offered != 0)
  +  wait_for_completion(net_device-channel_init_wait);
 
 I'd suggest we add an essentian timeout (big, let's say 30 sec.)
 here. In case something goes wrong we don't really want to hang the
 whole kernel for forever. Such bugs are hard to debug as if a 'kernel
 hangs' is reported we can't be sure which wait caused it. We can even
 have something like:
 
  t = wait_for_completion_timeout(net_device-channel_init_wait, 30*HZ);
  BUG_ON(t == 0);
 
 This is much better as we'll be sure what went wrong. (I know other
 pieces of hyper-v code use wait_for_completion() without a timeout, this
 is rather a general suggestion for all of them).

 There is some history here. Initially, I had timeout for calls where we could 
 reasonably
 rollback state if we timed out. Some calls were subsequently changed to 
 unconditional
 wait because under some load conditions, these timeouts would trigger 
 (granted I did not
 have 30 second timeout; it was a 5 second timeout).

 Greg was opposed to calls to BUG_ON() in general for drivers.

RE: [PATCH net-next 1/1] hv_netvsc: Wait for sub-channels to be processed during probe

2015-07-17 Thread Dexuan Cui

 From: K. Y. Srinivasan 
 Sent: Friday, July 17, 2015 3:17
 Subject: [PATCH net-next 1/1] hv_netvsc: Wait for sub-channels to be processed
 during probe
 diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
 ...
 @@ -1116,6 +1127,9 @@ int rndis_filter_device_add(struct hv_device *dev,
   num_possible_rss_qs = cpumask_weight(node_cpu_mask);
   net_device-num_chn = min(num_possible_rss_qs, num_rss_qs);
 
 + num_rss_qs = net_device-num_chn - 1;
 + net_device-num_sc_offered = num_rss_qs;
 +
   if (net_device-num_chn == 1)
   goto out;
 
 @@ -1157,11 +1171,22 @@ int rndis_filter_device_add(struct hv_device *dev,
 
   ret = rndis_filter_set_rss_param(rndis_device, net_device-num_chn);
 
 + /*
 +  * Wait for the host to send us the sub-channel offers.
 +  */
 + spin_lock_irqsave(net_device-sc_lock, flags);
 + sc_delta = net_device-num_chn - 1 - num_rss_qs;
 + net_device-num_sc_offered -= sc_delta;

Hi KY,
IMO here the -=  should be +=?

I think sc_delta is usually = 0, meaning the host may allocate less 
subchannels than
we expect.
With -=, net_device-num_sc_offered can become bigger -- this doesn't seem 
correct.

Why not use 
net_device-num_sc_offered = net_device-num_chn - 1; directly?
At this point, net_device-num_chn has been the number of the actual channels.


 + spin_unlock_irqrestore(net_device-sc_lock, flags);
 +
 + if (net_device-num_sc_offered != 0)
 + wait_for_completion(net_device-channel_init_wait);

BTW, I also tested the patch and I can confirm the panic I saw disappeared with 
the patch.

-- Dexuan
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 4.1 regression in resizable hashtable tests

2015-07-17 Thread Thomas Graf

On 07/17/15 at 12:26pm, Phil Sutter wrote:
 On Fri, Jul 17, 2015 at 10:04:56AM +0200, Thomas Graf wrote:
  On 07/02/15 at 10:09pm, Meelis Roos wrote:
   [   33.425061] Running rhashtable test nelem=8, max_size=65536, 
   shrinking=0
   [   33.425154] Test 00:
   [   33.534470]   Adding 5 keys
   [   34.743553] Info: encountered resize
   [   34.743698] Info: encountered resize
   [   34.743838] Info: encountered resize
   [   34.744057] Info: encountered resize
   [   34.744430] Info: encountered resize
   [   34.745139] Info: encountered resize
   [   34.746441] Info: encountered resize
   [   34.749055] Info: encountered resize
   [   34.754469] Info: encountered resize
   [   34.764836] Info: encountered resize
   [   34.785696] Info: encountered resize
   [   34.827448] Info: encountered resize
   [   34.896936]   Traversal complete: counted=49993, nelems=5, 
   entries=5, table-jumps=12
   [   34.897056] Test failed: Total count mismatch ^^^
  
  I do see count mismatches as well due to the design of the walker
  which restarts and thus sees certain entries multiple times.
  
  Do you have this commit as well?
  
  Author: Phil Sutter p...@nwl.cc
  Date:   Mon Jul 6 15:51:20 2015 +0200
  
  rhashtable: fix for resize events during table walk
 
 Thomas, this should be resolved already. Meelis replied[1] to my patch,
 stating it fixes that problem for him. Though he's still waiting for
 your proposed patch to add a schedule() call so the kernel won't
 complain on his slow UltraSparc. :)
 
 Cheers, Phil
 
 [1]: http://www.spinics.net/lists/netdev/msg335767.html

OK, good to know. I've posted the schedule patch today:
https://patchwork.ozlabs.org/patch/497035/
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Blackhole route not enough for proxy-arp

2015-07-17 Thread Jörg Pommnitz

Hello all,
it seems that a blackhole route is not enough to enable proxy arp for
the routing target.
I tried
ip route add blackhole 192.168.66.3/32
and
ip route add 192.168.66.3/32 dev lo

arping failed with the blackhole route but got responses with the
route through the loopback interface.
Is this behaviour a bug or a feature?

-- Regards
Joerg
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 4.1 regression in resizable hashtable tests

2015-07-17 Thread Phil Sutter

On Fri, Jul 17, 2015 at 10:04:56AM +0200, Thomas Graf wrote:
 On 07/02/15 at 10:09pm, Meelis Roos wrote:
  [   33.425061] Running rhashtable test nelem=8, max_size=65536, shrinking=0
  [   33.425154] Test 00:
  [   33.534470]   Adding 5 keys
  [   34.743553] Info: encountered resize
  [   34.743698] Info: encountered resize
  [   34.743838] Info: encountered resize
  [   34.744057] Info: encountered resize
  [   34.744430] Info: encountered resize
  [   34.745139] Info: encountered resize
  [   34.746441] Info: encountered resize
  [   34.749055] Info: encountered resize
  [   34.754469] Info: encountered resize
  [   34.764836] Info: encountered resize
  [   34.785696] Info: encountered resize
  [   34.827448] Info: encountered resize
  [   34.896936]   Traversal complete: counted=49993, nelems=5, 
  entries=5, table-jumps=12
  [   34.897056] Test failed: Total count mismatch ^^^
 
 I do see count mismatches as well due to the design of the walker
 which restarts and thus sees certain entries multiple times.
 
 Do you have this commit as well?
 
 Author: Phil Sutter p...@nwl.cc
 Date:   Mon Jul 6 15:51:20 2015 +0200
 
 rhashtable: fix for resize events during table walk

Thomas, this should be resolved already. Meelis replied[1] to my patch,
stating it fixes that problem for him. Though he's still waiting for
your proposed patch to add a schedule() call so the kernel won't
complain on his slow UltraSparc. :)

Cheers, Phil

[1]: http://www.spinics.net/lists/netdev/msg335767.html
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC net-next 22/22] openvswitch: Use regular GRE net_device instead of vport

2015-07-17 Thread Thomas Graf

On 07/16/15 at 02:36pm, Pravin Shelar wrote:
 On Thu, Jul 16, 2015 at 7:52 AM, Thomas Graf tg...@suug.ch wrote:
  I'm inclined to change this and use an in-kernel API as well to
  create the net_device just like VXLAN does in patch 21.
 
  Pravin, what do you think?
 
 About the vxlan APIs we also need to direct netlink interface for
 userspace to configure vxlan device. This will allow us to remove
 vxlan compat code from ovs vport-netdev.c in future.

Do you mean creating the tunnel devices from user space? This would
break existing users of the OVS Netlink interface. How do you want
to prevent that?
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next v2] rhashtable: Allow other tasks to be scheduled in large lookup loops

2015-07-17 Thread Eric Dumazet

On Fri, 2015-07-17 at 10:52 +0200, Thomas Graf wrote:
 Depending on system speed, the large lookup/insert/delete loops of the 
 testsuite can
 take a considerable amount of time to complete causing watchdog warnings to 
 appear.
 Allow other tasks to be scheduled throughout the loops.
 
 Reported-by: Meelis Roos mr...@linux.ee
 Signed-off-by: Thomas Graf tg...@suug.ch
 ---
 v2: Use cond_resched() instead schedule()

Acked-by: Eric Dumazet eduma...@google.com


--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] net/mdio: fix mdio_bus_match for c45 PHY

2015-07-17 Thread shh.xie

From: Shaohui Xie shaohui@freescale.com

We store c45 PHY's id information in c45_ids, so it should be used to
check the matching between PHY driver and PHY device for c45 PHY.

Signed-off-by: Shaohui Xie shaohui@freescale.com
---
 drivers/net/phy/mdio_bus.c | 19 +--
 1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/drivers/net/phy/mdio_bus.c b/drivers/net/phy/mdio_bus.c
index 095ef3f..46a14cb 100644
--- a/drivers/net/phy/mdio_bus.c
+++ b/drivers/net/phy/mdio_bus.c
@@ -421,6 +421,8 @@ static int mdio_bus_match(struct device *dev, struct 
device_driver *drv)
 {
struct phy_device *phydev = to_phy_device(dev);
struct phy_driver *phydrv = to_phy_driver(drv);
+   const int num_ids = ARRAY_SIZE(phydev-c45_ids.device_ids);
+   int i;
 
if (of_driver_match_device(dev, drv))
return 1;
@@ -428,8 +430,21 @@ static int mdio_bus_match(struct device *dev, struct 
device_driver *drv)
if (phydrv-match_phy_device)
return phydrv-match_phy_device(phydev);
 
-   return (phydrv-phy_id  phydrv-phy_id_mask) ==
-   (phydev-phy_id  phydrv-phy_id_mask);
+   if (phydev-is_c45) {
+   for (i = 1; i  num_ids; i++) {
+   if (!(phydev-c45_ids.devices_in_package  (1  i)))
+   continue;
+
+   if ((phydrv-phy_id  phydrv-phy_id_mask) ==
+   (phydev-c45_ids.device_ids[i] 
+phydrv-phy_id_mask))
+   return 1;
+   }
+   return 0;
+   } else {
+   return (phydrv-phy_id  phydrv-phy_id_mask) ==
+   (phydev-phy_id  phydrv-phy_id_mask);
+   }
 }
 
 #ifdef CONFIG_PM
-- 
2.1.0.27.g96db324

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2] net: ratelimit warnings about dst entry refcount underflow or overflow

2015-07-17 Thread Konstantin Khlebnikov

Kernel generates a lot of warnings when dst entry reference counter
overflows and becomes negative. That bug was seen several times at
machines with outdated 3.10.y kernels. Most like it's already fixed
in upstream. Anyway that flood completely kills machine and makes
further debugging impossible.

Signed-off-by: Konstantin Khlebnikov khlebni...@yandex-team.ru
---
 net/core/dst.c |4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/core/dst.c b/net/core/dst.c
index e956ce6d1378..002144bea935 100644
--- a/net/core/dst.c
+++ b/net/core/dst.c
@@ -284,7 +284,9 @@ void dst_release(struct dst_entry *dst)
int newrefcnt;
 
newrefcnt = atomic_dec_return(dst-__refcnt);
-   WARN_ON(newrefcnt  0);
+   if (unlikely(newrefcnt  0))
+   net_warn_ratelimited(%s: dst:%p refcnt:%d\n,
+__func__, dst, newrefcnt);
if (unlikely(dst-flags  DST_NOCACHE)  !newrefcnt)
call_rcu(dst-rcu_head, dst_destroy_rcu);
}

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 4.1.0, kernel panic, pppoe_release

2015-07-17 Thread Denys Fedoryshchenko


As i suspect, this kernel panic caused by recent changes to pppoe.
This problem appearing in accel-pppd (server), on loaded servers (2k 
users and more).
Most probably related to changed pppoe: Use workqueue to die properly 
when a PADT is received

I will try to reverse this and related patches.

On 2015-07-14 13:57, Denys Fedoryshchenko wrote:

Here is panic message from netconsole. Please let me know if any
additional information required.

Jul 14 13:49:16 10.0.252.10 [76078.867822] BUG: unable to handle kernel
Jul 14 13:49:16 10.0.252.10 NULL pointer dereference
Jul 14 13:49:16 10.0.252.10 at 03f0
Jul 14 13:49:16 10.0.252.10 [76078.868280] IP:
Jul 14 13:49:16 10.0.252.10 [a011e12a]
pppoe_release+0x56/0x142 [pppoe]
Jul 14 13:49:16 10.0.252.10 [76078.868541] PGD 336e4a067
Jul 14 13:49:16 10.0.252.10 PUD 333f17067
Jul 14 13:49:16 10.0.252.10 PMD 0
Jul 14 13:49:16 10.0.252.10
Jul 14 13:49:16 10.0.252.10 [76078.868918] Oops:  [#1]
Jul 14 13:49:16 10.0.252.10 SMP
Jul 14 13:49:16 10.0.252.10
Jul 14 13:49:16 10.0.252.10 [76078.869226] Modules linked in:
Jul 14 13:49:16 10.0.252.10 netconsole
Jul 14 13:49:16 10.0.252.10 configfs
Jul 14 13:49:16 10.0.252.10 coretemp
Jul 14 13:49:16 10.0.252.10 sch_fq
Jul 14 13:49:16 10.0.252.10 cls_fw
Jul 14 13:49:16 10.0.252.10 act_police
Jul 14 13:49:16 10.0.252.10 cls_u32
Jul 14 13:49:16 10.0.252.10 sch_ingress
Jul 14 13:49:16 10.0.252.10 sch_sfq
Jul 14 13:49:16 10.0.252.10 sch_htb
Jul 14 13:49:16 10.0.252.10 pppoe
Jul 14 13:49:16 10.0.252.10 pppox
Jul 14 13:49:16 10.0.252.10 ppp_generic
Jul 14 13:49:16 10.0.252.10 slhc
Jul 14 13:49:16 10.0.252.10 nf_nat_pptp
Jul 14 13:49:16 10.0.252.10 nf_nat_proto_gre
Jul 14 13:49:16 10.0.252.10 nf_conntrack_pptp
Jul 14 13:49:16 10.0.252.10 nf_conntrack_proto_gre
Jul 14 13:49:16 10.0.252.10 tun
Jul 14 13:49:16 10.0.252.10 xt_REDIRECT
Jul 14 13:49:16 10.0.252.10 nf_nat_redirect
Jul 14 13:49:16 10.0.252.10 xt_set
Jul 14 13:49:16 10.0.252.10 xt_TCPMSS
Jul 14 13:49:16 10.0.252.10 ipt_REJECT
Jul 14 13:49:16 10.0.252.10 nf_reject_ipv4
Jul 14 13:49:16 10.0.252.10 ts_bm
Jul 14 13:49:16 10.0.252.10 xt_string
Jul 14 13:49:16 10.0.252.10 xt_connmark
Jul 14 13:49:16 10.0.252.10 xt_DSCP
Jul 14 13:49:16 10.0.252.10 xt_mark
Jul 14 13:49:16 10.0.252.10 xt_tcpudp
Jul 14 13:49:16 10.0.252.10 iptable_mangle
Jul 14 13:49:16 10.0.252.10 iptable_filter
Jul 14 13:49:16 10.0.252.10 iptable_nat
Jul 14 13:49:16 10.0.252.10 nf_conntrack_ipv4
Jul 14 13:49:16 10.0.252.10 nf_defrag_ipv4
Jul 14 13:49:16 10.0.252.10 nf_nat_ipv4
Jul 14 13:49:16 10.0.252.10 nf_nat
Jul 14 13:49:16 10.0.252.10 nf_conntrack
Jul 14 13:49:16 10.0.252.10 ip_tables
Jul 14 13:49:16 10.0.252.10 x_tables
Jul 14 13:49:16 10.0.252.10 ip_set_hash_ip
Jul 14 13:49:16 10.0.252.10 ip_set
Jul 14 13:49:16 10.0.252.10 nfnetlink
Jul 14 13:49:16 10.0.252.10 8021q
Jul 14 13:49:16 10.0.252.10 garp
Jul 14 13:49:16 10.0.252.10 mrp
Jul 14 13:49:16 10.0.252.10 stp
Jul 14 13:49:16 10.0.252.10 llc
Jul 14 13:49:16 10.0.252.10 [last unloaded: netconsole]
Jul 14 13:49:16 10.0.252.10
Jul 14 13:49:16 10.0.252.10 [76078.873195] CPU: 3 PID: 2940 Comm:
accel-pppd Not tainted 4.1.0-build-0074 #7
Jul 14 13:49:16 10.0.252.10 [76078.873396] Hardware name: HP ProLiant
DL320e Gen8 v2, BIOS P80 04/02/2015
Jul 14 13:49:16 10.0.252.10 [76078.873598] task: 8800b1886ba0 ti:
8800b09f4000 task.ti: 8800b09f4000
Jul 14 13:49:16 10.0.252.10 [76078.873929] RIP: 
0010:[a011e12a]

Jul 14 13:49:16 10.0.252.10 [a011e12a]
pppoe_release+0x56/0x142 [pppoe]
Jul 14 13:49:16 10.0.252.10 [76078.874317] RSP: 0018:8800b09f7e28
EFLAGS: 00010202
Jul 14 13:49:16 10.0.252.10 [76078.874512] RAX:  RBX:
88032a214400 RCX: 
Jul 14 13:49:16 10.0.252.10 [76078.874709] RDX: 000d RSI:
fe01 RDI: 8180d6da
Jul 14 13:49:16 10.0.252.10 [76078.874906] RBP: 8800b09f7e68 R08:
 R09: 
Jul 14 13:49:16 10.0.252.10 [76078.875102] R10: 88031ef6a110 R11:
0293 R12: 88030f8d8fc0
Jul 14 13:49:16 10.0.252.10 [76078.875299] R13: 88030f8d8ff0 R14:
88033115ee40 R15: 8803394e4920
Jul 14 13:49:16 10.0.252.10 [76078.875499] FS:  7f79b602c700()
GS:88034746() knlGS:
Jul 14 13:49:16 10.0.252.10 [76078.875837] CS:  0010 DS:  ES: 
CR0: 80050033
Jul 14 13:49:16 10.0.252.10 [76078.876036] CR2: 03f0 CR3:
000335425000 CR4: 001407e0
Jul 14 13:49:16 10.0.252.10 [76078.876239] Stack:
Jul 14 13:49:16 10.0.252.10 [76078.876434]  88033ac45c80
Jul 14 13:49:16 10.0.252.10 
Jul 14 13:49:16 10.0.252.10 0001
Jul 14 13:49:16 10.0.252.10 88030f8d8fc0
Jul 14 13:49:16 10.0.252.10
Jul 14 13:49:16 10.0.252.10 [76078.877001]  a0120260
Jul 14 13:49:16 10.0.252.10 88030f8d8ff0
Jul 14 13:49:16 10.0.252.10 88033115ee40
Jul 14 13:49:16 10.0.252.10 8803394e4920
Jul 14 13:49:16 10.0.252.10
Jul 14 13:49:16

RE: [PATCH net-next 1/1] hv_netvsc: Wait for sub-channels to be processed during probe

2015-07-17 Thread Dexuan Cui

 From: K. Y. Srinivasan
 Sent: Friday, July 17, 2015 3:17
 Subject: [PATCH net-next 1/1] hv_netvsc: Wait for sub-channels to be processed
 during probe

 The current code returns from probe without waiting for the proper handling
 of subchannels that may be requested. If the netvsc driver were to be rapidly
 loaded/unloaded, we can  trigger a panic as the unload will be tearing
 down state that may not have been fully setup yet. We fix this issue by making
 sure that we return from the probe call only after ensuring that the
 sub-channel offers in flight are properly handled.

 ---
  drivers/net/hyperv/hyperv_net.h   |2 ++
  drivers/net/hyperv/rndis_filter.c |   25 +
  2 files changed, 27 insertions(+), 0 deletions(-)

BTW, not sure if we should make the same fix to storvsc.

IMO storvsc should have the same issue, at least in theory, though usually it's
unlikely to unload storvsc.  :-)

Thanks,
-- Dexuan
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 4.1 regression in resizable hashtable tests

2015-07-17 Thread Phil Sutter

On Fri, Jul 17, 2015 at 12:26:36PM +0200, Phil Sutter wrote:
 On Fri, Jul 17, 2015 at 10:04:56AM +0200, Thomas Graf wrote:
  On 07/02/15 at 10:09pm, Meelis Roos wrote:
   [   33.425061] Running rhashtable test nelem=8, max_size=65536, 
   shrinking=0
   [   33.425154] Test 00:
   [   33.534470]   Adding 5 keys
   [   34.743553] Info: encountered resize
   [   34.743698] Info: encountered resize
   [   34.743838] Info: encountered resize
   [   34.744057] Info: encountered resize
   [   34.744430] Info: encountered resize
   [   34.745139] Info: encountered resize
   [   34.746441] Info: encountered resize
   [   34.749055] Info: encountered resize
   [   34.754469] Info: encountered resize
   [   34.764836] Info: encountered resize
   [   34.785696] Info: encountered resize
   [   34.827448] Info: encountered resize
   [   34.896936]   Traversal complete: counted=49993, nelems=5, 
   entries=5, table-jumps=12
   [   34.897056] Test failed: Total count mismatch ^^^
  
  I do see count mismatches as well due to the design of the walker
  which restarts and thus sees certain entries multiple times.
  
  Do you have this commit as well?
  
  Author: Phil Sutter p...@nwl.cc
  Date:   Mon Jul 6 15:51:20 2015 +0200
  
  rhashtable: fix for resize events during table walk
 
 Thomas, this should be resolved already. Meelis replied[1] to my patch,
 stating it fixes that problem for him. Though he's still waiting for
 your proposed patch to add a schedule() call so the kernel won't
 complain on his slow UltraSparc. :)

Ah, nevermind. You sent it already with him in Cc.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH RFC net-next] net/vxlan: Fix kernel unaligned access in __vxlan_find_mac

2015-07-17 Thread Sowmini Varadhan


__vxlan_find_mac invokes ether_addr_equal on the eth_addr field,
which triggers unaligned access messages, so rearrange vxlan_fdb
to avoid this as non-intrusively as possible. 

Signed-off-by: Sowmini Varadhan sowmini.varad...@oracle.com
---
 drivers/net/vxlan.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index 34c519e..c9790a2 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -107,8 +107,8 @@ struct vxlan_fdb {
unsigned long used;
struct list_head  remotes;
u16   state;/* see ndm_state */
-   u8flags;/* see ndm_flags */
u8eth_addr[ETH_ALEN];
+   u8flags;/* see ndm_flags */
 };
 
 /* Pseudo network device */
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/3] fixed_phy: handle link-down case

2015-07-17 Thread Stas Sergeev

17.07.2015 02:25, Florian Fainelli пишет:
 On 16/07/15 07:50, Stas Sergeev wrote:

 Currently fixed_phy driver recognizes only the link-up state.
 This simple patch adds an implementation of link-down state.
 It fixes the status registers when link is down, and also allows
 to register the fixed-phy with link down without specifying the speed.
 
 This patch still breaks my setups here, e.g: drivers/net/dsa/bcm_sf2.c,
 but I will look into it.
 
 Do we really need this for now for your two other patches to work
 properly, or is it just nicer to have?
Yes, absolutely.
Otherwise registering fixed phy will return -EINVAL
because of the missing link speed (even though the link
is down).

Please, see what makes a problem. I can't reproduce what you report.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] bcmsysport:Fix error handling in the function bcm_sysport_init_rx_ring

2015-07-17 Thread Florian Fainelli

On 17/07/15 05:13, Nicholas Krause wrote:
 This fixes the error handling in the function bcm_sysport_init_rx_ringi
 after calling the function rdma_enable_set to make sure the return value 
 is equal to zero and if not print on the console failed to enable RDMA
 for the device and return the failed error code returned by
 rdma_enable_set.

Subject should be starting with net: systemport: , otherwise, this
looks good to me:

Reviewed-by: Florian Fainelli f.faine...@gmail.com
Tested-by: Florian Fainelli f.faine...@gmail.com

 
 Signed-off-by: Nicholas Krause xerofo...@gmail.com
 ---
  drivers/net/ethernet/broadcom/bcmsysport.c | 7 ++-
  1 file changed, 6 insertions(+), 1 deletion(-)
 
 diff --git a/drivers/net/ethernet/broadcom/bcmsysport.c 
 b/drivers/net/ethernet/broadcom/bcmsysport.c
 index 4566cdf..27a7b36 100644
 --- a/drivers/net/ethernet/broadcom/bcmsysport.c
 +++ b/drivers/net/ethernet/broadcom/bcmsysport.c
 @@ -1365,7 +1365,12 @@ static int bcm_sysport_init_rx_ring(struct 
 bcm_sysport_priv *priv)
   /* Initialize HW, ensure RDMA is disabled */
   reg = rdma_readl(priv, RDMA_STATUS);
   if (!(reg  RDMA_DISABLED))
 - rdma_enable_set(priv, 0);
 + ret = rdma_enable_set(priv, 0);
 +
 + if (ret) {
 + netif_err(priv, hw, priv-netdev, failed to enable RDMA\n);
 + return ret;
 + }
  
   rdma_writel(priv, 0, RDMA_WRITE_PTR_LO);
   rdma_writel(priv, 0, RDMA_WRITE_PTR_HI);
 


-- 
Florian
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC net-next 22/22] openvswitch: Use regular GRE net_device instead of vport

2015-07-17 Thread Pravin Shelar

On Fri, Jul 17, 2015 at 3:58 AM, Thomas Graf tg...@suug.ch wrote:
 On 07/16/15 at 02:36pm, Pravin Shelar wrote:
 On Thu, Jul 16, 2015 at 7:52 AM, Thomas Graf tg...@suug.ch wrote:
  I'm inclined to change this and use an in-kernel API as well to
  create the net_device just like VXLAN does in patch 21.
 
  Pravin, what do you think?

 About the vxlan APIs we also need to direct netlink interface for
 userspace to configure vxlan device. This will allow us to remove
 vxlan compat code from ovs vport-netdev.c in future.

 Do you mean creating the tunnel devices from user space? This would
 break existing users of the OVS Netlink interface. How do you want
 to prevent that?
To handle old interface there is compat code in netdev-vport in patch 22.

OVS userspace should be able to create any type of tunneling device
and then add it as netdev type vport. so that OVS has two types of
vport i.e. netdev and internal, rather than vport for each type of
tunnel.
This way we can keep compat code simple. All enhancements can be
directly done to new interface.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 -next] net: fib: use fib result when zero-length prefix aliases exist

2015-07-17 Thread Florian Westphal

default route selection is not deterministic when TOS keys are used:

ip route del default

ip route add tos 0x00 via 10.2.100.100
ip route add tos 0x04 via 10.2.100.101
ip route add tos 0x08 via 10.2.100.102
ip route add tos 0x0C via 10.2.100.103
ip route add tos 0x10 via 10.2.100.104

[ i.e. 5 routes with prefix length 0, differentiated via TOS key ]

ip route get 10.3.1.1 tos 0x4
- 10.2.100.101
ip route get 10.3.1.1 tos 0x8
- 10.2.100.102
ip route get tos 0x0C
- 10.2.100.103

But for 0x10, we'll get round-robin behavour among all the aliases.
Repeated invocations return .100, 101, 102, etc. in turn.

This behaviour is not new  -- fib_select_default can be traced back to
fn_hash_select_default in CVS history.

Routing cache made 'round-robin' behaviour less visible.

This changes fib_select_default to not change the FIB chosen result EXCEPT
if this nexthop appears to be unreachable.

fib_detect_death() logic is reversed -- we consider a nexthop 'dead' only
if it has a neigh entry in unreachable state.

Only then we search fib_aliases for an alternative and use one of these in
a round-robin fashion.  If all are believed to be unreachable, no change is
made and fib-chosen nh_gw is used.

Reported-by: Hagen Paul Pfeifer ha...@jauu.net
Cc: Alexander Duyck alexander.h.du...@redhat.com
Signed-off-by: Florian Westphal f...@strlen.de
---
 Changes since v1:
  Address comments from Alex Duyck:
  - use if (fib_nud_is_unreach( .. rather than temporary boolean retval
  - rename last_* varibles to fi_, they're not the last item in the list...
  - kill pointless if() statement, if order  0, then fi_last is  0 too

 net/ipv4/fib_semantics.c | 80 ++--
 1 file changed, 36 insertions(+), 44 deletions(-)

diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index c7358ea..2cdf8d7 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -410,28 +410,24 @@ errout:
rtnl_set_sk_err(info-nl_net, RTNLGRP_IPV4_ROUTE, err);
 }
 
-static int fib_detect_death(struct fib_info *fi, int order,
-   struct fib_info **last_resort, int *last_idx,
-   int dflt)
+static bool fib_nud_is_unreach(const struct fib_info *fi)
 {
struct neighbour *n;
int state = NUD_NONE;
 
-   n = neigh_lookup(arp_tbl, fi-fib_nh[0].nh_gw, fi-fib_dev);
-   if (n) {
+   local_bh_disable();
+
+   n = __neigh_lookup_noref(arp_tbl, fi-fib_nh[0].nh_gw, fi-fib_dev);
+   if (n)
state = n-nud_state;
-   neigh_release(n);
-   }
-   if (state == NUD_REACHABLE)
-   return 0;
-   if ((state  NUD_VALID)  order != dflt)
-   return 0;
-   if ((state  NUD_VALID) ||
-   (*last_idx  0  order  dflt)) {
-   *last_resort = fi;
-   *last_idx = order;
-   }
-   return 1;
+
+   local_bh_enable();
+
+   /* Caller might be able to find alternate (reachable) nexthop */
+   if (state  (NUD_INCOMPLETE | NUD_FAILED))
+   return true;
+
+   return false;
 }
 
 #ifdef CONFIG_IP_ROUTE_MULTIPATH
@@ -1204,12 +1200,16 @@ int fib_sync_down_dev(struct net_device *dev, unsigned 
long event)
 /* Must be invoked inside of an RCU protected region.  */
 void fib_select_default(struct fib_result *res)
 {
-   struct fib_info *fi = NULL, *last_resort = NULL;
struct hlist_head *fa_head = res-fa_head;
+   struct fib_info *fi = NULL;
struct fib_table *tb = res-table;
-   int order = -1, last_idx = -1;
+   int order = -1, fi_idx = -1;
struct fib_alias *fa;
 
+   if (likely(!fib_nud_is_unreach(res-fi)))
+   return;
+
+   /* attempt to pick another nexthop */
hlist_for_each_entry_rcu(fa, fa_head, fa_list) {
struct fib_info *next_fi = fa-fa_info;
 
@@ -1223,38 +1223,30 @@ void fib_select_default(struct fib_result *res)
next_fi-fib_nh[0].nh_scope != RT_SCOPE_LINK)
continue;
 
+   order++;
+
+   if (next_fi == res-fi) /* already tested, not reachable */
+   continue;
+
fib_alias_accessed(fa);
 
-   if (!fi) {
-   if (next_fi != res-fi)
+   if (fib_nud_is_unreach(next_fi))
+   continue;
+
+   /* try to round-robin among all fa_aliases in case
+* res-fi nexthop is unreachable.
+*/
+   if (fi == NULL || order  tb-tb_default) {
+   fi = next_fi;
+   fi_idx = order;
+   if (order  tb-tb_default)
break;
-   } else if (!fib_detect_death(fi, order, last_resort,
-last_idx, tb-tb_default)) {
-   fib_result_assign(res, fi);
-   tb-tb_default

Re: [PATCH -next] net: fib: use fib result when zero-length prefix aliases exist

2015-07-17 Thread Alexander Duyck


On 07/17/2015 08:17 AM, Florian Westphal wrote:

default route selection is not deterministic when TOS keys are used:

ip route del default

ip route add tos 0x00 via 10.2.100.100
ip route add tos 0x04 via 10.2.100.101
ip route add tos 0x08 via 10.2.100.102
ip route add tos 0x0C via 10.2.100.103
ip route add tos 0x10 via 10.2.100.104

[ i.e. 5 routes with prefix length 0, differentiated via TOS key ]

ip route get 10.3.1.1 tos 0x4
- 10.2.100.101
ip route get 10.3.1.1 tos 0x8
- 10.2.100.102
ip route get tos 0x0C
- 10.2.100.103

But for 0x10, we'll get round-robin results among all the aliases.
Repeated queries return .100, 101, 102, etc. in turn.

This behaviour is not new  -- fib_select_default can be traced back to
fn_hash_select_default in CVS history.

Routing cache made 'round-robin' behaviour less visible.

This changes fib_select_default to not change the FIB chosen result EXCEPT
if this nexthop appears to be unreachable.

fib_detect_death() logic is reversed -- we consider a nexthop 'dead' only
if it has a neigh entry in unreachable state.

Only then we search fib_aliases for an alternative and use one of these in
a round-robin fashion.  If all are believed to be unreachable, no change is
made and fib-chosen nh_gw is used.

Reported-by: Hagen Paul Pfeifer ha...@jauu.net
Cc: Alexander Duyck alexander.h.du...@redhat.com
Signed-off-by: Florian Westphal f...@strlen.de
---
  net/ipv4/fib_semantics.c | 71 
  1 file changed, 36 insertions(+), 35 deletions(-)

diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index c7358ea..83b485b 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -410,28 +410,24 @@ errout:
rtnl_set_sk_err(info-nl_net, RTNLGRP_IPV4_ROUTE, err);
  }

-static int fib_detect_death(struct fib_info *fi, int order,
-   struct fib_info **last_resort, int *last_idx,
-   int dflt)
+static bool fib_nud_is_unreach(const struct fib_info *fi)
  {
struct neighbour *n;
int state = NUD_NONE;

-   n = neigh_lookup(arp_tbl, fi-fib_nh[0].nh_gw, fi-fib_dev);
-   if (n) {
+   local_bh_disable();
+
+   n = __neigh_lookup_noref(arp_tbl, fi-fib_nh[0].nh_gw, fi-fib_dev);
+   if (n)
state = n-nud_state;
-   neigh_release(n);
-   }
-   if (state == NUD_REACHABLE)
-   return 0;
-   if ((state  NUD_VALID)  order != dflt)
-   return 0;
-   if ((state  NUD_VALID) ||
-   (*last_idx  0  order  dflt)) {
-   *last_resort = fi;
-   *last_idx = order;
-   }
-   return 1;
+
+   local_bh_enable();
+
+   /* Caller might be able to find alternate (reachable) nexthop */
+   if (state  (NUD_INCOMPLETE | NUD_FAILED))
+   return true;
+
+   return false;
  }

  #ifdef CONFIG_IP_ROUTE_MULTIPATH
@@ -1204,12 +1200,17 @@ int fib_sync_down_dev(struct net_device *dev, unsigned 
long event)
  /* Must be invoked inside of an RCU protected region.  */
  void fib_select_default(struct fib_result *res)
  {
-   struct fib_info *fi = NULL, *last_resort = NULL;
struct hlist_head *fa_head = res-fa_head;
+   struct fib_info *last_resort = NULL;
struct fib_table *tb = res-table;
int order = -1, last_idx = -1;
struct fib_alias *fa;
+   bool unreach = fib_nud_is_unreach(res-fi);

+   if (likely(!unreach))
+   return;
+


There probably isn't any need for the boolean variable.  You could just 
place the function in the if statement itself.



+   /* attempt to pick another nexthop */
hlist_for_each_entry_rcu(fa, fa_head, fa_list) {
struct fib_info *next_fi = fa-fa_info;

@@ -1223,33 +1224,33 @@ void fib_select_default(struct fib_result *res)
next_fi-fib_nh[0].nh_scope != RT_SCOPE_LINK)
continue;

+   order++;
+
+   if (next_fi == res-fi) /* already tested, not reachable */
+   continue;
+
fib_alias_accessed(fa);

-   if (!fi) {
-   if (next_fi != res-fi)
+   unreach = fib_nud_is_unreach(next_fi);
+   if (unreach)
+   continue;
+


Same here.  It seems like this is just an extra variable that isn't 
really needed.



+   /* try to round-robin among all fa_aliases in case
+* res-fi nexthop is unreachable.
+*/
+   if (last_idx  0 || order  tb-tb_default) {
+   last_resort = next_fi;
+   last_idx = order;
+   if (order  tb-tb_default)
break;


You might want to update the variable naming as it can be a bit 
confusing.  The last_resort and last_idx represent either the first 
fib_info and index, or the next one after current entry in

[PATCH] sctp: fix cut and paste issue in comment

2015-07-17 Thread Marcelo Ricardo Leitner

Cookie ACK is always received by the association initiator, so fix the
comment to avoid confusion.

Signed-off-by: Marcelo Ricardo Leitner marcelo.leit...@gmail.com
---
 net/sctp/sm_statefuns.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/sctp/sm_statefuns.c b/net/sctp/sm_statefuns.c
index 
3ee27b7704ffb95430541507e83973e9207f9672..d7eaa7354cf76148d1a2c9ee3af4fff9a24990fb
 100644
--- a/net/sctp/sm_statefuns.c
+++ b/net/sctp/sm_statefuns.c
@@ -853,7 +853,7 @@ nomem:
 
 /*
  * Respond to a normal COOKIE ACK chunk.
- * We are the side that is being asked for an association.
+ * We are the side that is asking for an association.
  *
  * RFC 2960 5.1 Normal Establishment of an Association
  *
-- 
2.4.1

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH net-next 1/1] hv_netvsc: Wait for sub-channels to be processed during probe

2015-07-17 Thread KY Srinivasan

 -Original Message-
 From: Dexuan Cui
 Sent: Friday, July 17, 2015 3:07 AM
 To: KY Srinivasan; da...@davemloft.net; netdev@vger.kernel.org; linux-
 ker...@vger.kernel.org; de...@linuxdriverproject.org; o...@aepfle.de;
 a...@canonical.com; jasow...@redhat.com; vkuzn...@redhat.com
 Cc: KY Srinivasan
 Subject: RE: [PATCH net-next 1/1] hv_netvsc: Wait for sub-channels to be
 processed during probe

  From: K. Y. Srinivasan
  Sent: Friday, July 17, 2015 3:17
  Subject: [PATCH net-next 1/1] hv_netvsc: Wait for sub-channels to be
 processed
  during probe

  The current code returns from probe without waiting for the proper
 handling
  of subchannels that may be requested. If the netvsc driver were to be
 rapidly
  loaded/unloaded, we can  trigger a panic as the unload will be tearing
  down state that may not have been fully setup yet. We fix this issue by
 making
  sure that we return from the probe call only after ensuring that the
  sub-channel offers in flight are properly handled.

  ---
   drivers/net/hyperv/hyperv_net.h   |2 ++
   drivers/net/hyperv/rndis_filter.c |   25 +
   2 files changed, 27 insertions(+), 0 deletions(-)

 BTW, not sure if we should make the same fix to storvsc.

 IMO storvsc should have the same issue, at least in theory, though usually 
 it's
 unlikely to unload storvsc.  :-)

You are right; I am planning to submit a similar patch for storvsc. As you note,
this scenario is unlikely for sorvsc.

K. Y

 Thanks,
 -- Dexuan
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH net-next 1/1] hv_netvsc: Wait for sub-channels to be processed during probe

2015-07-17 Thread KY Srinivasan



 -Original Message-
 From: Vitaly Kuznetsov [mailto:vkuzn...@redhat.com]
 Sent: Friday, July 17, 2015 9:10 AM
 To: KY Srinivasan
 Cc: da...@davemloft.net; netdev@vger.kernel.org; linux-
 ker...@vger.kernel.org; de...@linuxdriverproject.org; o...@aepfle.de;
 a...@canonical.com; jasow...@redhat.com; Dexuan Cui
 Subject: Re: [PATCH net-next 1/1] hv_netvsc: Wait for sub-channels to be
 processed during probe
 
 KY Srinivasan k...@microsoft.com writes:
 
  -Original Message-
  From: Vitaly Kuznetsov [mailto:vkuzn...@redhat.com]
  Sent: Friday, July 17, 2015 7:13 AM
  To: KY Srinivasan
  Cc: da...@davemloft.net; netdev@vger.kernel.org; linux-
  ker...@vger.kernel.org; de...@linuxdriverproject.org; o...@aepfle.de;
  a...@canonical.com; jasow...@redhat.com; Dexuan Cui
  Subject: Re: [PATCH net-next 1/1] hv_netvsc: Wait for sub-channels to be
  processed during probe
 
  K. Y. Srinivasan k...@microsoft.com writes:
 
   The current code returns from probe without waiting for the proper
  handling
   of subchannels that may be requested. If the netvsc driver were to be
  rapidly
   loaded/unloaded, we can  trigger a panic as the unload will be tearing
   down state that may not have been fully setup yet. We fix this issue by
  making
   sure that we return from the probe call only after ensuring that the
   sub-channel offers in flight are properly handled.
  
   Signed-off-by: K. Y. Srinivasan k...@microsoft.com
   Reviewed-and-tested-by: Haiyang Zhang haiya...@microsoft.com
   ---
drivers/net/hyperv/hyperv_net.h   |2 ++
drivers/net/hyperv/rndis_filter.c |   25 +
2 files changed, 27 insertions(+), 0 deletions(-)
  
   diff --git a/drivers/net/hyperv/hyperv_net.h
  b/drivers/net/hyperv/hyperv_net.h
   index 26cd14c..925b75d 100644
   --- a/drivers/net/hyperv/hyperv_net.h
   +++ b/drivers/net/hyperv/hyperv_net.h
   @@ -671,6 +671,8 @@ struct netvsc_device {
u32 send_table[VRSS_SEND_TAB_SIZE];
u32 max_chn;
u32 num_chn;
   +spinlock_t sc_lock; /* Protects num_sc_offered variable */
   +u32 num_sc_offered;
atomic_t queue_sends[NR_CPUS];
  
/* Holds rndis device info */
   diff --git a/drivers/net/hyperv/rndis_filter.c
  b/drivers/net/hyperv/rndis_filter.c
   index 2e40417..2e09f3f 100644
   --- a/drivers/net/hyperv/rndis_filter.c
   +++ b/drivers/net/hyperv/rndis_filter.c
   @@ -984,9 +984,16 @@ static void netvsc_sc_open(struct
 vmbus_channel
  *new_sc)
struct netvsc_device *nvscdev;
u16 chn_index = new_sc-offermsg.offer.sub_channel_index;
int ret;
   +unsigned long flags;
  
nvscdev = hv_get_drvdata(new_sc-primary_channel-device_obj);
  
   +spin_lock_irqsave(nvscdev-sc_lock, flags);
   +nvscdev-num_sc_offered--;
   +spin_unlock_irqrestore(nvscdev-sc_lock, flags);
   +if (nvscdev-num_sc_offered == 0)
   +complete(nvscdev-channel_init_wait);
   +
if (chn_index = nvscdev-num_chn)
return;
  
   @@ -1015,8 +1022,10 @@ int rndis_filter_device_add(struct hv_device
  *dev,
u32 rsscap_size = sizeof(struct ndis_recv_scale_cap);
u32 mtu, size;
u32 num_rss_qs;
   +u32 sc_delta;
const struct cpumask *node_cpu_mask;
u32 num_possible_rss_qs;
   +unsigned long flags;
  
rndis_device = get_rndis_device();
if (!rndis_device)
   @@ -1039,6 +1048,8 @@ int rndis_filter_device_add(struct hv_device
  *dev,
net_device-max_chn = 1;
net_device-num_chn = 1;
  
   +spin_lock_init(net_device-sc_lock);
   +
net_device-extension = rndis_device;
rndis_device-net_dev = net_device;
  
   @@ -1116,6 +1127,9 @@ int rndis_filter_device_add(struct hv_device
  *dev,
num_possible_rss_qs = cpumask_weight(node_cpu_mask);
net_device-num_chn = min(num_possible_rss_qs, num_rss_qs);
  
   +num_rss_qs = net_device-num_chn - 1;
   +net_device-num_sc_offered = num_rss_qs;
   +
if (net_device-num_chn == 1)
goto out;
  
   @@ -1157,11 +1171,22 @@ int rndis_filter_device_add(struct hv_device
  *dev,
  
ret = rndis_filter_set_rss_param(rndis_device, net_device-
  num_chn);
  
   +/*
   + * Wait for the host to send us the sub-channel offers.
   + */
   +spin_lock_irqsave(net_device-sc_lock, flags);
   +sc_delta = net_device-num_chn - 1 - num_rss_qs;
   +net_device-num_sc_offered -= sc_delta;
   +spin_unlock_irqrestore(net_device-sc_lock, flags);
   +
   +if (net_device-num_sc_offered != 0)
   +wait_for_completion(net_device-channel_init_wait);
 
  I'd suggest we add an essentian timeout (big, let's say 30 sec.)
  here. In case something goes wrong we don't really want to hang the
  whole

Re: [PATCH 1/3] fixed_phy: handle link-down case

2015-07-17 Thread Florian Fainelli

On 17/07/15 04:26, Stas Sergeev wrote:
 17.07.2015 02:25, Florian Fainelli пишет:
 On 16/07/15 07:50, Stas Sergeev wrote:

 Currently fixed_phy driver recognizes only the link-up state.
 This simple patch adds an implementation of link-down state.
 It fixes the status registers when link is down, and also allows
 to register the fixed-phy with link down without specifying the speed.

 This patch still breaks my setups here, e.g: drivers/net/dsa/bcm_sf2.c,
 but I will look into it.

 Do we really need this for now for your two other patches to work
 properly, or is it just nicer to have?
 Yes, absolutely.
 Otherwise registering fixed phy will return -EINVAL
 because of the missing link speed (even though the link
 is down).

Ok, I see the problem that you have now. Arguably you could say that
according to the fixed-link binding, speed needs to be specified and the
code correctly errors out with such an error if you do not specify it. I
also agree that having to specify speed and duplex for something you
will end-up auto-negotiating has no useful purpose.

 
 Please, see what makes a problem. I can't reproduce what you report.
 

So is different is that I use a link_update callback, and so we rely on
at least one call of this function to initialize the hardware in
drivers/net/dsa/bcm_sf2.c for this to work, after that, the hardware
reflects the fixed link parameters we configured, and we feed the
fixed_phy_status information from the hardware directly.

From there I see two different ways to fix this:

- we ignore the fixed_phy_update_regs return value in fixed_phy_add(),
but that will make us avoid doing verification on the speed, which is
not so great, but is essentially what your patch does anyway

- we update the use of the fixed PHY link_update in drivers using it and
convert them to use fixed_phy_update_state instead, which can take some
time and effort to convert

What do you think? I would go with option 1 and eventually introduce a
special switch() case on the speed settings just to validate we know them.

Thanks
-- 
Florian
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next v2] rhashtable: Allow other tasks to be scheduled in large lookup loops

2015-07-17 Thread mroos

 Depending on system speed, the large lookup/insert/delete loops of the 
 testsuite can
 take a considerable amount of time to complete causing watchdog warnings to 
 appear.
 Allow other tasks to be scheduled throughout the loops.
 
 Reported-by: Meelis Roos mr...@linux.ee
 Signed-off-by: Thomas Graf tg...@suug.ch
 ---
 v2: Use cond_resched() instead schedule()

Tested it. The warning is gone from rhashtable test but now it is 
present in rbtree test (it was not there before). Same kernel, just your 
patch applied - but it should not change rbtree test???

[0.00] PROMLIB: Sun IEEE Boot Prom 'OBP 3.31.0 2001/07/25 20:36'
[0.00] PROMLIB: Root node compatible: 
[0.00] Linux version 4.2.0-rc2-00077-gf760b87-dirty (mroos@u5) (gcc 
version 4.9.3 (Debian 4.9.3-1) ) #21 Fri Jul 17 20:15:21 EEST 2015
[0.00] bootconsole [earlyprom0] enabled
[0.00] ARCH: SUN4U
[0.00] Ethernet address: 08:00:20:f8:c7:72
[0.00] MM: PAGE_OFFSET is 0xf800 (max_phys_bits == 40)
[0.00] MM: VMALLOC [0x0001 -- 0x0600]
[0.00] MM: VMEMMAP [0x0600 -- 0x0c00]
[0.00] Kernel: Using 10 locked TLB entries for main kernel image.
[0.00] Remapping the kernel... done.
[0.00] kmemleak: Kernel memory leak detector disabled
[0.00] OF stdout device is: /pci@1f,0/pci@1,1/ebus@1/se@14,40:a
[0.00] PROM: Built device tree with 70266 bytes of memory.
[0.00] Top of RAM: 0x1ff2c000, Total RAM: 0x1ff2a000
[0.00] Memory hole size: 0MB
[0.00] Allocated 16384 bytes for kernel page tables.
[0.00] Zone ranges:
[0.00]   Normal   [mem 0x-0x1ff2bfff]
[0.00] Movable zone start for each node
[0.00] Early memory node ranges
[0.00]   node   0: [mem 0x-0x1fefdfff]
[0.00]   node   0: [mem 0x1ff0-0x1ff2bfff]
[0.00] Initmem setup node 0 [mem 0x-0x1ff2bfff]
[0.00] On node 0 totalpages: 65429
[0.00]   Normal zone: 512 pages used for memmap
[0.00]   Normal zone: 0 pages reserved
[0.00]   Normal zone: 65429 pages, LIFO batch:15
[0.00] Booting Linux...
[0.00] CPU CAPS: [flush,stbar,swap,muldiv,v9,mul32,div32,v8plus]
[0.00] CPU CAPS: [vis]
[0.00] pcpu-alloc: s0 r0 d32768 u32768 alloc=1*32768
[0.00] pcpu-alloc: [0] 0 
[0.00] Built 1 zonelists in Zone order, mobility grouping on.  Total 
pages: 64917
[0.00] Kernel command line: root=/dev/sda1 ro
[0.00] PID hash table entries: 2048 (order: 1, 16384 bytes)
[0.00] Dentry cache hash table entries: 65536 (order: 6, 524288 bytes)
[0.00] Inode-cache hash table entries: 32768 (order: 5, 262144 bytes)
[0.00] Sorting __ex_table...
[0.00] Memory: 475912K/523432K available (5270K kernel code, 516K 
rwdata, 1672K rodata, 520K init, 30210K bss, 47520K reserved, 0K cma-reserved)
[0.00] Running RCU self tests
[0.00] Testing tracer nop: PASSED
[0.00] NR_IRQS:2048 nr_irqs:2048 1
[   26.882478] clocksource: tick: mask: 0x max_cycles: 
0x5306eb473f, max_idle_ns: 440795213232 ns
[   26.986192] clocksource: mult[2c71c72] shift[24]
[   27.025729] clockevent: mult[5c28f5c3] shift[32]
[   27.067997] Console: colour dummy device 80x25
[   27.104149] console [tty0] enabled
[   27.128868] bootconsole [earlyprom0] disabled
[   27.165340] Lock dependency validator: Copyright (c) 2006 Red Hat, Inc., 
Ingo Molnar
[   27.165405] ... MAX_LOCKDEP_SUBCLASSES:  8
[   27.165445] ... MAX_LOCK_DEPTH:  48
[   27.165486] ... MAX_LOCKDEP_KEYS:8191
[   27.165529] ... CLASSHASH_SIZE:  4096
[   27.165574] ... MAX_LOCKDEP_ENTRIES: 32768
[   27.165617] ... MAX_LOCKDEP_CHAINS:  65536
[   27.165662] ... CHAINHASH_SIZE:  32768
[   27.165706]  memory used by lock dependency info: 8159 kB
[   27.165756]  per task-struct memory footprint: 1920 bytes
[   27.165802] 
[   27.165838] | Locking API testsuite:
[   27.165873] 

[   27.165932]  | spin |wlock |rlock |mutex | 
wsem | rsem |
[   27.165993]   
--
[   27.166092]  A-A deadlock:  ok  |  ok  |  ok  |  ok  |  
ok  |  ok  |
[   27.232682]  A-B-B-A deadlock:  ok  |  ok  |  ok  |  ok  |  
ok  |  ok  |
[   27.299789]  A-B-B-C-C-A deadlock:  ok  |  ok  |  ok  |  ok  |  
ok  |  ok  |
[   27.367295]  A-B-C-A-B-C deadlock:  ok  |  ok  |  ok  |  ok  |  
ok  |  ok  |
[   27.434877]  A-B-B-C-C-D-D-A deadlock:  ok  |  ok  |  ok  |  ok  |  
ok  |  ok  |
[   27.502857]  A-B-C-D-B-D-D-A deadlock:  ok  |  ok  |  ok  |  ok  |  
ok  |

Re: [net-next 01/14] clarify implementation of ethtool's get_ts_info op

2015-07-17 Thread Richard Cochran

On Fri, Jul 17, 2015 at 07:25:10AM -0700, Jeff Kirsher wrote:
 From: Jacob Keller jacob.e.kel...@intel.com
 
 This patch adds some clarification about the intended way to implement
 both SIOCSHWTSTAMP and ethtool's get_ts_info. The HWTSTAMP API has
 several Rx filters which are very specific, as well as more general
 filters. The specific filters really only exist to support some broken
 hardware which can't fully implement the generic filters. This patch
 adds clarification that it is okay to support the specific filters in
 SIOCSHWTSTAMP by upscaling them to the generic filters. In addition,
 update the header for ethtool_ts_info to specify that drivers ought to
 only report the filters they support without upscaling in this manner.

Acked-by: Richard Cochran richardcoch...@gmail.com

(for this patch and the other get_ts_info patches)
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] net: bcmgenet: Return the variable ret rather then zero for the function bcmgenet_power_down

2015-07-17 Thread Florian Fainelli

On 16/07/15 12:38, Nicholas Krause wrote:
 This makes the function bcmgenet_power_down return the variable ret
 rather then zero in order to make this function be able to signal its
 caller with a error code when a failure occurs internally rather then
 always appearing to run successfully to its caller.
 
 Signed-off-by: Nicholas Krause xerofo...@gmail.com

Reviewed-by: Florian Fainelli f.faine...@gmail.com

 ---
  drivers/net/ethernet/broadcom/genet/bcmgenet.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)
 
 diff --git a/drivers/net/ethernet/broadcom/genet/bcmgenet.c 
 b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
 index 64c1e9d..129e5b5 100644
 --- a/drivers/net/ethernet/broadcom/genet/bcmgenet.c
 +++ b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
 @@ -877,7 +877,7 @@ static int bcmgenet_power_down(struct bcmgenet_priv *priv,
   break;
   }
  
 - return 0;
 + return ret;
  }
  
  static void bcmgenet_power_up(struct bcmgenet_priv *priv,
 


-- 
Florian
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 4.1.0, kernel panic, pppoe_release

2015-07-17 Thread Denys Fedoryshchenko

Probably my knowledge of kernel is not sufficient, but i will try few 
approaches.

One of them to add to pppoe_unbind_sock_work:

pppox_unbind_sock(sk);
+/* Signal the death of the socket. */
+sk-sk_state = PPPOX_DEAD;

I will wait first, to make sure this patch was causing kernel panic (it 
needs 24h testing cycle), then i will try this fix.


On 2015-07-17 18:36, Dan Williams wrote:

On Fri, 2015-07-17 at 12:24 +0300, Denys Fedoryshchenko wrote:

As i suspect, this kernel panic caused by recent changes to pppoe.
This problem appearing in accel-pppd (server), on loaded servers (2k
users and more).
Most probably related to changed pppoe: Use workqueue to die properly
when a PADT is received
I will try to reverse this and related patches.


While I didn't write the patch, I'm the one that started the process
that got it submitted...  Could you review the patch quickly too to see
if you can spot anything amiss with it, so that it could get fixed up?
The original patch does fix a real problem so ideally we don't have to
revert the whole thing upstream.

Dan


On 2015-07-14 13:57, Denys Fedoryshchenko wrote:
 Here is panic message from netconsole. Please let me know if any
 additional information required.

 Jul 14 13:49:16 10.0.252.10 [76078.867822] BUG: unable to handle kernel
 Jul 14 13:49:16 10.0.252.10 NULL pointer dereference
 Jul 14 13:49:16 10.0.252.10 at 03f0
 Jul 14 13:49:16 10.0.252.10 [76078.868280] IP:
 Jul 14 13:49:16 10.0.252.10 [a011e12a]
 pppoe_release+0x56/0x142 [pppoe]
 Jul 14 13:49:16 10.0.252.10 [76078.868541] PGD 336e4a067
 Jul 14 13:49:16 10.0.252.10 PUD 333f17067
 Jul 14 13:49:16 10.0.252.10 PMD 0
 Jul 14 13:49:16 10.0.252.10
 Jul 14 13:49:16 10.0.252.10 [76078.868918] Oops:  [#1]
 Jul 14 13:49:16 10.0.252.10 SMP
 Jul 14 13:49:16 10.0.252.10
 Jul 14 13:49:16 10.0.252.10 [76078.869226] Modules linked in:
 Jul 14 13:49:16 10.0.252.10 netconsole
 Jul 14 13:49:16 10.0.252.10 configfs
 Jul 14 13:49:16 10.0.252.10 coretemp
 Jul 14 13:49:16 10.0.252.10 sch_fq
 Jul 14 13:49:16 10.0.252.10 cls_fw
 Jul 14 13:49:16 10.0.252.10 act_police
 Jul 14 13:49:16 10.0.252.10 cls_u32
 Jul 14 13:49:16 10.0.252.10 sch_ingress
 Jul 14 13:49:16 10.0.252.10 sch_sfq
 Jul 14 13:49:16 10.0.252.10 sch_htb
 Jul 14 13:49:16 10.0.252.10 pppoe
 Jul 14 13:49:16 10.0.252.10 pppox
 Jul 14 13:49:16 10.0.252.10 ppp_generic
 Jul 14 13:49:16 10.0.252.10 slhc
 Jul 14 13:49:16 10.0.252.10 nf_nat_pptp
 Jul 14 13:49:16 10.0.252.10 nf_nat_proto_gre
 Jul 14 13:49:16 10.0.252.10 nf_conntrack_pptp
 Jul 14 13:49:16 10.0.252.10 nf_conntrack_proto_gre
 Jul 14 13:49:16 10.0.252.10 tun
 Jul 14 13:49:16 10.0.252.10 xt_REDIRECT
 Jul 14 13:49:16 10.0.252.10 nf_nat_redirect
 Jul 14 13:49:16 10.0.252.10 xt_set
 Jul 14 13:49:16 10.0.252.10 xt_TCPMSS
 Jul 14 13:49:16 10.0.252.10 ipt_REJECT
 Jul 14 13:49:16 10.0.252.10 nf_reject_ipv4
 Jul 14 13:49:16 10.0.252.10 ts_bm
 Jul 14 13:49:16 10.0.252.10 xt_string
 Jul 14 13:49:16 10.0.252.10 xt_connmark
 Jul 14 13:49:16 10.0.252.10 xt_DSCP
 Jul 14 13:49:16 10.0.252.10 xt_mark
 Jul 14 13:49:16 10.0.252.10 xt_tcpudp
 Jul 14 13:49:16 10.0.252.10 iptable_mangle
 Jul 14 13:49:16 10.0.252.10 iptable_filter
 Jul 14 13:49:16 10.0.252.10 iptable_nat
 Jul 14 13:49:16 10.0.252.10 nf_conntrack_ipv4
 Jul 14 13:49:16 10.0.252.10 nf_defrag_ipv4
 Jul 14 13:49:16 10.0.252.10 nf_nat_ipv4
 Jul 14 13:49:16 10.0.252.10 nf_nat
 Jul 14 13:49:16 10.0.252.10 nf_conntrack
 Jul 14 13:49:16 10.0.252.10 ip_tables
 Jul 14 13:49:16 10.0.252.10 x_tables
 Jul 14 13:49:16 10.0.252.10 ip_set_hash_ip
 Jul 14 13:49:16 10.0.252.10 ip_set
 Jul 14 13:49:16 10.0.252.10 nfnetlink
 Jul 14 13:49:16 10.0.252.10 8021q
 Jul 14 13:49:16 10.0.252.10 garp
 Jul 14 13:49:16 10.0.252.10 mrp
 Jul 14 13:49:16 10.0.252.10 stp
 Jul 14 13:49:16 10.0.252.10 llc
 Jul 14 13:49:16 10.0.252.10 [last unloaded: netconsole]
 Jul 14 13:49:16 10.0.252.10
 Jul 14 13:49:16 10.0.252.10 [76078.873195] CPU: 3 PID: 2940 Comm:
 accel-pppd Not tainted 4.1.0-build-0074 #7
 Jul 14 13:49:16 10.0.252.10 [76078.873396] Hardware name: HP ProLiant
 DL320e Gen8 v2, BIOS P80 04/02/2015
 Jul 14 13:49:16 10.0.252.10 [76078.873598] task: 8800b1886ba0 ti:
 8800b09f4000 task.ti: 8800b09f4000
 Jul 14 13:49:16 10.0.252.10 [76078.873929] RIP:
 0010:[a011e12a]
 Jul 14 13:49:16 10.0.252.10 [a011e12a]
 pppoe_release+0x56/0x142 [pppoe]
 Jul 14 13:49:16 10.0.252.10 [76078.874317] RSP: 0018:8800b09f7e28
 EFLAGS: 00010202
 Jul 14 13:49:16 10.0.252.10 [76078.874512] RAX:  RBX:
 88032a214400 RCX: 
 Jul 14 13:49:16 10.0.252.10 [76078.874709] RDX: 000d RSI:
 fe01 RDI: 8180d6da
 Jul 14 13:49:16 10.0.252.10 [76078.874906] RBP: 8800b09f7e68 R08:
  R09: 
 Jul 14 13:49:16 10.0.252.10 [76078.875102] R10: 88031ef6a110 R11:
 0293 R12: 88030f8d8fc0
 Jul 14 13:49:16 10.0.252.10 [76078.875299] R13: 88030f8d8ff0

Re: [PATCH net-next 1/2] bpf: introduce bpf_skb_vlan_push/pop() helpers

2015-07-17 Thread Alexei Starovoitov


On 7/17/15 1:12 AM, Eric Dumazet wrote:

On Thu, 2015-07-16 at 19:58 -0700, Alexei Starovoitov wrote:

In order to let eBPF programs call skb_vlan_push/pop via helper functions


Why should eBPF program do such thing ?

Are BPF users in the kernel expecting skb being changed, and are we sure
they reload all cached values when/if needed ?


well, classic BPF and even extended BPF with socket filters cannot use
these helpers. They are for TC ingress/egress only. There different
actions can already change skb-data while classifiers/actions are
running. btw, before we started discussing this topic at nfws,
I thought that bpf programs will never be able to change skb-data from
inside the programs, but turned out it's only JITs that needed
re-caching. Programs cannot see skb-data. They can access packet
only via ld_abs/ld_ind instructions and helper functions.
So any changes to internal fields of skb are invisible to programs.
skb-data/hlen cache that is part of JIT is also invisible to the
programs. It's an optimization that some JITs do. arm64 JIT doesn't
do this optimization, for example.
I'll reword commit log to better explain this.

One of the use cases is this phys2virt gateway I presented. There I
need to do vlan-learning and src mac forwarding. Currently I'm
creating as many as I can vlan netdevs on top of regular eth0
and attach tc+bpf to all of them. That's very inefficient.
With these helpers I'll attach tc+bpf to eth0 only and skip creation
of thousands of vlan netdevs.


eBPF JITs need to recognize helpers that change skb-data, since
skb-data and hlen are cached as part of JIT code generation.
- arm64 JIT is using bpf_load_pointer() without caching, so it's ok as-is.
- x64 JIT recognizes bpf_skb_vlan_push/pop() calls and re-caches skb-data/hlen
   after such calls (experiments showed that conditional re-caching is slower).
- s390 JIT falls back to interpreter for now when bpf_skb_vlan_push() is present
   in the program (re-caching is tbd).





+static u64 bpf_skb_vlan_push(u64 r1, u64 vlan_proto, u64 vlan_tci, u64 r4, u64 
r5)
+{
+   struct sk_buff *skb = (struct sk_buff *) (long) r1;
+
+   if (unlikely(vlan_proto != htons(ETH_P_8021Q) 
+vlan_proto != htons(ETH_P_8021AD)))
+   vlan_proto = htons(ETH_P_8021Q);



This would raise sparse error, as vlan_proto is u64, and htons() __be16

make C=2 CF=-D__CHECK_ENDIAN__ net/core/filter.o


yes.
When I wrote these lines I thought of the same, so I did run the sparse
and it spewed a lot of false positives and stopped on 'too many errors'
before reaching these lines.
So I downloaded the latest sparse, hacked it a little and tried again.
Still it didn't complain about the endianness. That was puzzling,
so I left the above lines as-is.
but since your eagle eyes caught it, I will add casts :)

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next 14/22] vxlan: Flow based tunneling

2015-07-17 Thread Alexei Starovoitov


On 7/17/15 5:55 AM, Thomas Graf wrote:

@@ -2373,6 +2470,12 @@ static void vxlan_setup(struct net_device *dev)
netif_keep_dst(dev);
dev-priv_flags |= IFF_LIVE_ADDR_CHANGE;

+   /* If in flow based mode, keep the dst including encapsulation
+* instructions for vxlan_xmit().
+*/
+   if (vxlan-flags  VXLAN_F_FLOW_BASED)
+   netif_keep_dst(dev);


hmm, isn't this done already few lines above? ;)
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next v2] rhashtable: Allow other tasks to be scheduled in large lookup loops

2015-07-17 Thread Eric Dumazet

On Fri, 2015-07-17 at 22:07 +0300, mr...@linux.ee wrote:
  Depending on system speed, the large lookup/insert/delete loops of the 
  testsuite can
  take a considerable amount of time to complete causing watchdog warnings to 
  appear.
  Allow other tasks to be scheduled throughout the loops.
  
  Reported-by: Meelis Roos mr...@linux.ee
  Signed-off-by: Thomas Graf tg...@suug.ch
  ---
  v2: Use cond_resched() instead schedule()
 
 Tested it. The warning is gone from rhashtable test but now it is 
 present in rbtree test (it was not there before). Same kernel, just your 
 patch applied - but it should not change rbtree test???

Why not ?

rbtree tests need same kind of patches.




--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv2 RFC net-next] net/vxlan: Fix kernel unaligned access in __vxlan_find_mac

2015-07-17 Thread Sowmini Varadhan

On (07/17/15 16:07), Joe Perches wrote:
 On Fri, 2015-07-17 at 22:00 +0200, Sowmini Varadhan wrote:
  __vxlan_find_mac invokes ether_addr_equal on the eth_addr field,
  which triggers unaligned access messages, so rearrange vxlan_fdb
  to avoid this in the most non-intrusive way.
 
 What arch does this?

sparc.

BTW, I was also getting a lot of alignment errors from 
vxlan_xmit_skb (vxh comes out unaligned) for the IPv6 path.
I did not have time to investigate/fix this correctly- not sure
if this is specific to v6.

--Sowmini

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

good afternoon

2015-07-17 Thread wvzjtmqx

Hi
samsung s6,280euro,
imac,nikon,samsung products
site:  isgayre. com

Re: [PATCH net-next 1/2] tcp: don't extend RTO on failed loss probe attempts

2015-07-17 Thread Eric Dumazet

On Fri, 2015-07-17 at 14:22 -0700, Yuchung Cheng wrote:
 If TLP was unable to send a probe, it extended the RTO to
 now + icsk_rto. But extending the RTO makes little sense
 if no TLP probe went out. With this commit, instead of
 extending the RTO we re-arm it relative to the transmit time
 of the write queue head.

But what was the reason the probe could not be sent ?

If it is local congestion or memory allocation error, it does make sense
to not add fuel to the fire.



--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

1 2 >

1 - 100 of 136 matches

Mail list logo