[net-next 08/14] dp83640: only report generic filters in ts_info
From: Jacob Keller jacob.e.kel...@intel.com CC: Richard Cochran richardcoch...@gmail.com Signed-off-by: Jacob Keller jacob.e.kel...@intel.com Signed-off-by: Jeff Kirsher jeffrey.t.kirs...@intel.com --- drivers/net/phy/dp83640.c | 10 +- 1 file changed, 1 insertion(+), 9 deletions(-) diff --git a/drivers/net/phy/dp83640.c b/drivers/net/phy/dp83640.c index 00cb41e..185b03c 100644 --- a/drivers/net/phy/dp83640.c +++ b/drivers/net/phy/dp83640.c @@ -1449,17 +1449,9 @@ static int dp83640_ts_info(struct phy_device *dev, struct ethtool_ts_info *info) info-rx_filters = (1 HWTSTAMP_FILTER_NONE) | (1 HWTSTAMP_FILTER_PTP_V1_L4_EVENT) | - (1 HWTSTAMP_FILTER_PTP_V1_L4_SYNC) | - (1 HWTSTAMP_FILTER_PTP_V1_L4_DELAY_REQ) | (1 HWTSTAMP_FILTER_PTP_V2_L4_EVENT) | - (1 HWTSTAMP_FILTER_PTP_V2_L4_SYNC) | - (1 HWTSTAMP_FILTER_PTP_V2_L4_DELAY_REQ) | (1 HWTSTAMP_FILTER_PTP_V2_L2_EVENT) | - (1 HWTSTAMP_FILTER_PTP_V2_L2_SYNC) | - (1 HWTSTAMP_FILTER_PTP_V2_L2_DELAY_REQ) | - (1 HWTSTAMP_FILTER_PTP_V2_EVENT) | - (1 HWTSTAMP_FILTER_PTP_V2_SYNC) | - (1 HWTSTAMP_FILTER_PTP_V2_DELAY_REQ); + (1 HWTSTAMP_FILTER_PTP_V2_EVENT); return 0; } -- 2.4.3 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[net-next 04/14] i40e: only report generic filters in get_ts_info
From: Jacob Keller jacob.e.kel...@intel.com Signed-off-by: Jacob Keller jacob.e.kel...@intel.com Tested-by: Jim Young james.m.yo...@intel.com Signed-off-by: Jeff Kirsher jeffrey.t.kirs...@intel.com --- drivers/net/ethernet/intel/i40e/i40e_ethtool.c | 13 ++--- 1 file changed, 2 insertions(+), 11 deletions(-) diff --git a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c index 0b68f61..f2075d5 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c +++ b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c @@ -1467,17 +1467,8 @@ static int i40e_get_ts_info(struct net_device *dev, info-tx_types = (1 HWTSTAMP_TX_OFF) | (1 HWTSTAMP_TX_ON); info-rx_filters = (1 HWTSTAMP_FILTER_NONE) | - (1 HWTSTAMP_FILTER_PTP_V1_L4_SYNC) | - (1 HWTSTAMP_FILTER_PTP_V1_L4_DELAY_REQ) | - (1 HWTSTAMP_FILTER_PTP_V2_EVENT) | - (1 HWTSTAMP_FILTER_PTP_V2_L2_EVENT) | - (1 HWTSTAMP_FILTER_PTP_V2_L4_EVENT) | - (1 HWTSTAMP_FILTER_PTP_V2_SYNC) | - (1 HWTSTAMP_FILTER_PTP_V2_L2_SYNC) | - (1 HWTSTAMP_FILTER_PTP_V2_L4_SYNC) | - (1 HWTSTAMP_FILTER_PTP_V2_DELAY_REQ) | - (1 HWTSTAMP_FILTER_PTP_V2_L2_DELAY_REQ) | - (1 HWTSTAMP_FILTER_PTP_V2_L4_DELAY_REQ); + (1 HWTSTAMP_FILTER_PTP_V1_L4_EVENT) | + (1 HWTSTAMP_FILTER_PTP_V2_EVENT); return 0; } -- 2.4.3 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[net-next 00/14][pull request] Intel Wired LAN Driver Updates 2015-07-17
This series contains updates to igb, ixgbe, ixgbevf, i40e, bnx2x, freescale, siena and dp83640. Jacob provides several patches to clarify the intended way to implement both SIOCSHWTSTAMP and ethtool's get_ts_info(). It is okay to support the specific filters in SIOCSHWTSTAMP by upscaling them to the generic filters. Alex Duyck provides a igb patch to pull the time stamp from the fragment before it gets added to the skb, to avoid a possible issue in which the fragment can possibly be less than IGB_RX_HDR_LEN due to the time stamp being pulled after the copybreak check. Also provides a ixgbevf patch to fold the ixgbevf_pull_tail() call into ixgbevf_add_rx_frag(), which gives the advantage that the fragment does not have to be modified after it is added to the skb. Fan provides patches for ixgbe/ixgbevf to set the receive hash type based on receive descriptor RSS type. Todd provides a fix for igb where on check for link on any media other than copper was not being detected since it was looking on the incorrect PHY page (due to the page being used gets switched before the function to check link gets executed). The following are changes since commit c15df306fc79c672573f1cc2ebdfcb32d7e68780: ipv6: Remove unused arguments for __ipv6_dev_get_saddr(). and are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue master Alexander Duyck (2): igb: Pull timestamp from fragment before adding it to skb ixgbevf: fold ixgbevf_pull_tail into ixgbevf_add_rx_frag Fan Du (3): ixgbe: Specify Rx hash type WRT Rx desc RSS type ixgbevf: Set Rx hash type for ingress packets ixgbe: Don't report flow director filter's status Jacob Keller (8): clarify implementation of ethtool's get_ts_info op freescale: remove incorrect copied comment bnx2x: only report most generic filters in get_ts_info i40e: only report generic filters in get_ts_info igb: only report generic filters in get_ts_info ixgbe: only report generic filters in get_ts_info siena: only report generic filters in get_ts_info dp83640: only report generic filters in ts_info Todd Fujinaka (1): igb: Fix i354 88E1112 PHY on RCC boards using AutoMediaDetect Documentation/networking/timestamping.txt | 7 ++ .../net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c| 11 +-- drivers/net/ethernet/freescale/fec_ptp.c | 6 -- drivers/net/ethernet/intel/i40e/i40e_ethtool.c | 13 +-- drivers/net/ethernet/intel/igb/e1000_82575.c | 18 +++-- drivers/net/ethernet/intel/igb/igb_ethtool.c | 4 - drivers/net/ethernet/intel/igb/igb_main.c | 94 ++ drivers/net/ethernet/intel/ixgbe/ixgbe_82599.c | 2 - drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c | 8 -- drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 25 +- drivers/net/ethernet/intel/ixgbevf/defines.h | 12 +++ drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 93 +++-- drivers/net/ethernet/sfc/siena.c | 6 +- drivers/net/phy/dp83640.c | 10 +-- include/uapi/linux/ethtool.h | 5 ++ 15 files changed, 134 insertions(+), 180 deletions(-) -- 2.4.3 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[net-next 12/14] ixgbevf: Set Rx hash type for ingress packets
From: Fan Du fan...@intel.com Set hash type for ingress packets according to NIC advanced receive descriptors RSS type part. Signed-off-by: Fan Du fan...@intel.com Tested-by: Phil Schmitt phillip.j.schm...@intel.com Signed-off-by: Jeff Kirsher jeffrey.t.kirs...@intel.com --- drivers/net/ethernet/intel/ixgbevf/defines.h | 12 ++ drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 27 +++ 2 files changed, 39 insertions(+) diff --git a/drivers/net/ethernet/intel/ixgbevf/defines.h b/drivers/net/ethernet/intel/ixgbevf/defines.h index 770e21a..5843458 100644 --- a/drivers/net/ethernet/intel/ixgbevf/defines.h +++ b/drivers/net/ethernet/intel/ixgbevf/defines.h @@ -161,6 +161,18 @@ typedef u32 ixgbe_link_speed; #define IXGBE_RXDADV_SPLITHEADER_EN0x1000 #define IXGBE_RXDADV_SPH 0x8000 +/* RSS Hash results */ +#define IXGBE_RXDADV_RSSTYPE_NONE 0x +#define IXGBE_RXDADV_RSSTYPE_IPV4_TCP 0x0001 +#define IXGBE_RXDADV_RSSTYPE_IPV4 0x0002 +#define IXGBE_RXDADV_RSSTYPE_IPV6_TCP 0x0003 +#define IXGBE_RXDADV_RSSTYPE_IPV6_EX 0x0004 +#define IXGBE_RXDADV_RSSTYPE_IPV6 0x0005 +#define IXGBE_RXDADV_RSSTYPE_IPV6_TCP_EX 0x0006 +#define IXGBE_RXDADV_RSSTYPE_IPV4_UDP 0x0007 +#define IXGBE_RXDADV_RSSTYPE_IPV6_UDP 0x0008 +#define IXGBE_RXDADV_RSSTYPE_IPV6_UDP_EX 0x0009 + #define IXGBE_RXD_ERR_FRAME_ERR_MASK ( \ IXGBE_RXD_ERR_CE | \ IXGBE_RXD_ERR_LE | \ diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c index acfa051..b2c86f1 100644 --- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c +++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c @@ -457,6 +457,32 @@ static void ixgbevf_rx_skb(struct ixgbevf_q_vector *q_vector, napi_gro_receive(q_vector-napi, skb); } +#define IXGBE_RSS_L4_TYPES_MASK \ + ((1ul IXGBE_RXDADV_RSSTYPE_IPV4_TCP) | \ +(1ul IXGBE_RXDADV_RSSTYPE_IPV4_UDP) | \ +(1ul IXGBE_RXDADV_RSSTYPE_IPV6_TCP) | \ +(1ul IXGBE_RXDADV_RSSTYPE_IPV6_UDP)) + +static inline void ixgbevf_rx_hash(struct ixgbevf_ring *ring, + union ixgbe_adv_rx_desc *rx_desc, + struct sk_buff *skb) +{ + u16 rss_type; + + if (!(ring-netdev-features NETIF_F_RXHASH)) + return; + + rss_type = le16_to_cpu(rx_desc-wb.lower.lo_dword.hs_rss.pkt_info) + IXGBE_RXDADV_RSSTYPE_MASK; + + if (!rss_type) + return; + + skb_set_hash(skb, le32_to_cpu(rx_desc-wb.lower.hi_dword.rss), +(IXGBE_RSS_L4_TYPES_MASK (1ul rss_type)) ? +PKT_HASH_TYPE_L4 : PKT_HASH_TYPE_L3); +} + /** * ixgbevf_rx_checksum - indicate in skb if hw indicated a good cksum * @ring: structure containig ring specific data @@ -506,6 +532,7 @@ static void ixgbevf_process_skb_fields(struct ixgbevf_ring *rx_ring, union ixgbe_adv_rx_desc *rx_desc, struct sk_buff *skb) { + ixgbevf_rx_hash(rx_ring, rx_desc, skb); ixgbevf_rx_checksum(rx_ring, rx_desc, skb); if (ixgbevf_test_staterr(rx_desc, IXGBE_RXD_STAT_VP)) { -- 2.4.3 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[net-next 09/14] igb: Pull timestamp from fragment before adding it to skb
From: Alexander Duyck alexander.h.du...@redhat.com This change makes it so that we pull the timestamp from the fragment before we add it to the skb. By doing this we can avoid a possible issue in which the fragment can possibly be less than IGB_RX_HDR_LEN due to the timestamp being pulled after the copybreak check. While making this change I realized we could also pull the rest of the igb_pull_tail function into igb_add_rx_frag since in the case of igb, unlike ixgbe, we are able to unmap the entire buffer before calling add_rx_frag so merging the two allows for sharing of code between the two merged functions. Reported-by: Cong Wang xiyou.wangc...@gmail.com Signed-off-by: Alexander Duyck alexander.h.du...@redhat.com Tested-by: Aaron Brown aaron.f.br...@intel.com Signed-off-by: Jeff Kirsher jeffrey.t.kirs...@intel.com --- drivers/net/ethernet/intel/igb/igb_main.c | 94 --- 1 file changed, 25 insertions(+), 69 deletions(-) diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c index 2f70a9b..fc7729e 100644 --- a/drivers/net/ethernet/intel/igb/igb_main.c +++ b/drivers/net/ethernet/intel/igb/igb_main.c @@ -6621,22 +6621,25 @@ static bool igb_add_rx_frag(struct igb_ring *rx_ring, struct sk_buff *skb) { struct page *page = rx_buffer-page; + unsigned char *va = page_address(page) + rx_buffer-page_offset; unsigned int size = le16_to_cpu(rx_desc-wb.upper.length); #if (PAGE_SIZE 8192) unsigned int truesize = IGB_RX_BUFSZ; #else - unsigned int truesize = ALIGN(size, L1_CACHE_BYTES); + unsigned int truesize = SKB_DATA_ALIGN(size); #endif + unsigned int pull_len; - if ((size = IGB_RX_HDR_LEN) !skb_is_nonlinear(skb)) { - unsigned char *va = page_address(page) + rx_buffer-page_offset; + if (unlikely(skb_is_nonlinear(skb))) + goto add_tail_frag; - if (igb_test_staterr(rx_desc, E1000_RXDADV_STAT_TSIP)) { - igb_ptp_rx_pktstamp(rx_ring-q_vector, va, skb); - va += IGB_TS_HDR_LEN; - size -= IGB_TS_HDR_LEN; - } + if (unlikely(igb_test_staterr(rx_desc, E1000_RXDADV_STAT_TSIP))) { + igb_ptp_rx_pktstamp(rx_ring-q_vector, va, skb); + va += IGB_TS_HDR_LEN; + size -= IGB_TS_HDR_LEN; + } + if (likely(size = IGB_RX_HDR_LEN)) { memcpy(__skb_put(skb, size), va, ALIGN(size, sizeof(long))); /* page is not reserved, we can reuse buffer as-is */ @@ -6648,8 +6651,21 @@ static bool igb_add_rx_frag(struct igb_ring *rx_ring, return false; } + /* we need the header to contain the greater of either ETH_HLEN or +* 60 bytes if the skb-len is less than 60 for skb_pad. +*/ + pull_len = eth_get_headlen(va, IGB_RX_HDR_LEN); + + /* align pull length to size of long to optimize memcpy performance */ + memcpy(__skb_put(skb, pull_len), va, ALIGN(pull_len, sizeof(long))); + + /* update all of the pointers */ + va += pull_len; + size -= pull_len; + +add_tail_frag: skb_add_rx_frag(skb, skb_shinfo(skb)-nr_frags, page, - rx_buffer-page_offset, size, truesize); + (unsigned long)va ~PAGE_MASK, size, truesize); return igb_can_reuse_rx_page(rx_buffer, page, truesize); } @@ -6791,62 +6807,6 @@ static bool igb_is_non_eop(struct igb_ring *rx_ring, } /** - * igb_pull_tail - igb specific version of skb_pull_tail - * @rx_ring: rx descriptor ring packet is being transacted on - * @rx_desc: pointer to the EOP Rx descriptor - * @skb: pointer to current skb being adjusted - * - * This function is an igb specific version of __pskb_pull_tail. The - * main difference between this version and the original function is that - * this function can make several assumptions about the state of things - * that allow for significant optimizations versus the standard function. - * As a result we can do things like drop a frag and maintain an accurate - * truesize for the skb. - */ -static void igb_pull_tail(struct igb_ring *rx_ring, - union e1000_adv_rx_desc *rx_desc, - struct sk_buff *skb) -{ - struct skb_frag_struct *frag = skb_shinfo(skb)-frags[0]; - unsigned char *va; - unsigned int pull_len; - - /* it is valid to use page_address instead of kmap since we are -* working with pages allocated out of the lomem pool per -* alloc_page(GFP_ATOMIC) -*/ - va = skb_frag_address(frag); - - if (igb_test_staterr(rx_desc, E1000_RXDADV_STAT_TSIP)) { - /* retrieve timestamp from buffer */ - igb_ptp_rx_pktstamp(rx_ring-q_vector, va, skb); - - /* update pointers to remove timestamp header */ -
[net-next 11/14] ixgbe: Specify Rx hash type WRT Rx desc RSS type
From: Fan Du fan...@intel.com RSS could be leveraged by taking account L4 src/dst ports as ingredients, thus ingress skb Rx hash type should honor such the real configuration. Signed-off-by: Fan Du fan...@intel.com Tested-by: Phil Schmitt phillip.j.schm...@intel.com Signed-off-by: Jeff Kirsher jeffrey.t.kirs...@intel.com --- drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 25 + 1 file changed, 21 insertions(+), 4 deletions(-) diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c index 9aa6104..3e6a931 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c @@ -1360,14 +1360,31 @@ static int __ixgbe_notify_dca(struct device *dev, void *data) } #endif /* CONFIG_IXGBE_DCA */ + +#define IXGBE_RSS_L4_TYPES_MASK \ + ((1ul IXGBE_RXDADV_RSSTYPE_IPV4_TCP) | \ +(1ul IXGBE_RXDADV_RSSTYPE_IPV4_UDP) | \ +(1ul IXGBE_RXDADV_RSSTYPE_IPV6_TCP) | \ +(1ul IXGBE_RXDADV_RSSTYPE_IPV6_UDP)) + static inline void ixgbe_rx_hash(struct ixgbe_ring *ring, union ixgbe_adv_rx_desc *rx_desc, struct sk_buff *skb) { - if (ring-netdev-features NETIF_F_RXHASH) - skb_set_hash(skb, -le32_to_cpu(rx_desc-wb.lower.hi_dword.rss), -PKT_HASH_TYPE_L3); + u16 rss_type; + + if (!(ring-netdev-features NETIF_F_RXHASH)) + return; + + rss_type = le16_to_cpu(rx_desc-wb.lower.lo_dword.hs_rss.pkt_info) + IXGBE_RXDADV_RSSTYPE_MASK; + + if (!rss_type) + return; + + skb_set_hash(skb, le32_to_cpu(rx_desc-wb.lower.hi_dword.rss), +(IXGBE_RSS_L4_TYPES_MASK (1ul rss_type)) ? +PKT_HASH_TYPE_L4 : PKT_HASH_TYPE_L3); } #ifdef IXGBE_FCOE -- 2.4.3 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[net-next 01/14] clarify implementation of ethtool's get_ts_info op
From: Jacob Keller jacob.e.kel...@intel.com This patch adds some clarification about the intended way to implement both SIOCSHWTSTAMP and ethtool's get_ts_info. The HWTSTAMP API has several Rx filters which are very specific, as well as more general filters. The specific filters really only exist to support some broken hardware which can't fully implement the generic filters. This patch adds clarification that it is okay to support the specific filters in SIOCSHWTSTAMP by upscaling them to the generic filters. In addition, update the header for ethtool_ts_info to specify that drivers ought to only report the filters they support without upscaling in this manner. Signed-off-by: Jacob Keller jacob.e.kel...@intel.com Tested-by: Phil Schmitt phillip.j.schm...@intel.com Reviewed-by: Aaron Brown aaron.f.br...@intel.com Signed-off-by: Jeff Kirsher jeffrey.t.kirs...@intel.com --- Documentation/networking/timestamping.txt | 7 +++ include/uapi/linux/ethtool.h | 5 + 2 files changed, 12 insertions(+) diff --git a/Documentation/networking/timestamping.txt b/Documentation/networking/timestamping.txt index 5f09226..a977339 100644 --- a/Documentation/networking/timestamping.txt +++ b/Documentation/networking/timestamping.txt @@ -359,6 +359,13 @@ the requested fine-grained filtering for incoming packets is not supported, the driver may time stamp more than just the requested types of packets. +Drivers are free to use a more permissive configuration than the requested +configuration. It is expected that drivers should only implement directly the +most generic mode that can be supported. For example if the hardware can +support HWTSTAMP_FILTER_V2_EVENT, then it should generally always upscale +HWTSTAMP_FILTER_V2_L2_SYNC_MESSAGE, and so forth, as HWTSTAMP_FILTER_V2_EVENT +is more generic (and more useful to applications). + A driver which supports hardware time stamping shall update the struct with the actual, possibly more permissive configuration. If the requested packets cannot be time stamped, then nothing should be diff --git a/include/uapi/linux/ethtool.h b/include/uapi/linux/ethtool.h index cd67aec..cd16291 100644 --- a/include/uapi/linux/ethtool.h +++ b/include/uapi/linux/ethtool.h @@ -1093,6 +1093,11 @@ struct ethtool_sfeatures { * the 'hwtstamp_tx_types' and 'hwtstamp_rx_filters' enumeration values, * respectively. For example, if the device supports HWTSTAMP_TX_ON, * then (1 HWTSTAMP_TX_ON) in 'tx_types' will be set. + * + * Drivers should only report the filters they actually support without + * upscaling in the SIOCSHWTSTAMP ioctl. If the SIOCSHWSTAMP request for + * HWTSTAMP_FILTER_V1_SYNC is supported by HWTSTAMP_FILTER_V1_EVENT, then the + * driver should only report HWTSTAMP_FILTER_V1_EVENT in this op. */ struct ethtool_ts_info { __u32 cmd; -- 2.4.3 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[net-next 14/14] igb: Fix i354 88E1112 PHY on RCC boards using AutoMediaDetect
From: Todd Fujinaka todd.fujin...@intel.com e1000_check_for_link_media_swap() checks PHY page 0 for copper and PHY page 1 for other (fiber) link. The switch back from page 1 to page 0 happened too soon, before e1000_check_for_link_82575() is executed, and link on fiber (other) was never detected. Check for link while still on the proper PHY page. Signed-off-by: Todd Fujinaka todd.fujin...@intel.com Tested-by: Aaron Brown aaron.f.br...@intel.com Signed-off-by: Jeff Kirsher jeffrey.t.kirs...@intel.com --- drivers/net/ethernet/intel/igb/e1000_82575.c | 18 +- 1 file changed, 13 insertions(+), 5 deletions(-) diff --git a/drivers/net/ethernet/intel/igb/e1000_82575.c b/drivers/net/ethernet/intel/igb/e1000_82575.c index b0182dd..d192569 100644 --- a/drivers/net/ethernet/intel/igb/e1000_82575.c +++ b/drivers/net/ethernet/intel/igb/e1000_82575.c @@ -139,10 +139,6 @@ static s32 igb_check_for_link_media_swap(struct e1000_hw *hw) if (ret_val) return ret_val; - /* reset page to 0 */ - ret_val = phy-ops.write_reg(hw, E1000_M88E1112_PAGE_ADDR, 0); - if (ret_val) - return ret_val; if (data E1000_M88E1112_STATUS_LINK) port = E1000_MEDIA_PORT_OTHER; @@ -151,8 +147,20 @@ static s32 igb_check_for_link_media_swap(struct e1000_hw *hw) if (port (hw-dev_spec._82575.media_port != port)) { hw-dev_spec._82575.media_port = port; hw-dev_spec._82575.media_changed = true; + } + + if (port == E1000_MEDIA_PORT_COPPER) { + /* reset page to 0 */ + ret_val = phy-ops.write_reg(hw, E1000_M88E1112_PAGE_ADDR, 0); + if (ret_val) + return ret_val; + igb_check_for_link_82575(hw); } else { - ret_val = igb_check_for_link_82575(hw); + igb_check_for_link_82575(hw); + /* reset page to 0 */ + ret_val = phy-ops.write_reg(hw, E1000_M88E1112_PAGE_ADDR, 0); + if (ret_val) + return ret_val; } return 0; -- 2.4.3 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[net-next 05/14] igb: only report generic filters in get_ts_info
From: Jacob Keller jacob.e.kel...@intel.com Signed-off-by: Jacob Keller jacob.e.kel...@intel.com Tested-by: Aaron Brown aaron.f.br...@intel.com Signed-off-by: Jeff Kirsher jeffrey.t.kirs...@intel.com --- drivers/net/ethernet/intel/igb/igb_ethtool.c | 4 1 file changed, 4 deletions(-) diff --git a/drivers/net/ethernet/intel/igb/igb_ethtool.c b/drivers/net/ethernet/intel/igb/igb_ethtool.c index d5673eb..109cad9 100644 --- a/drivers/net/ethernet/intel/igb/igb_ethtool.c +++ b/drivers/net/ethernet/intel/igb/igb_ethtool.c @@ -2396,10 +2396,6 @@ static int igb_get_ts_info(struct net_device *dev, info-rx_filters |= (1 HWTSTAMP_FILTER_PTP_V1_L4_SYNC) | (1 HWTSTAMP_FILTER_PTP_V1_L4_DELAY_REQ) | - (1 HWTSTAMP_FILTER_PTP_V2_L2_SYNC) | - (1 HWTSTAMP_FILTER_PTP_V2_L4_SYNC) | - (1 HWTSTAMP_FILTER_PTP_V2_L2_DELAY_REQ) | - (1 HWTSTAMP_FILTER_PTP_V2_L4_DELAY_REQ) | (1 HWTSTAMP_FILTER_PTP_V2_EVENT); return 0; -- 2.4.3 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[net-next 02/14] freescale: remove incorrect copied comment
From: Jacob Keller jacob.e.kel...@intel.com The comment in question is word-for-word copied from ixgbe, and clearly has no meaning in freescale's driver. (it even says 'return an error' when the code clearly does not). Remove the comment as it is obviously incorrect and not applicable to the code as it is today. CC: Pantelis Antoniou pantelis.anton...@gmail.com CC: Vitaly Bordug vbor...@ru.mvista.com CC: linuxppc-...@lists.ozlabs.org Signed-off-by: Jacob Keller jacob.e.kel...@intel.com Signed-off-by: Jeff Kirsher jeffrey.t.kirs...@intel.com --- drivers/net/ethernet/freescale/fec_ptp.c | 6 -- 1 file changed, 6 deletions(-) diff --git a/drivers/net/ethernet/freescale/fec_ptp.c b/drivers/net/ethernet/freescale/fec_ptp.c index a15663a..7a8386a 100644 --- a/drivers/net/ethernet/freescale/fec_ptp.c +++ b/drivers/net/ethernet/freescale/fec_ptp.c @@ -506,12 +506,6 @@ int fec_ptp_set(struct net_device *ndev, struct ifreq *ifr) break; default: - /* -* register RXMTRL must be set in order to do V1 packets, -* therefore it is not possible to time stamp both V1 Sync and -* Delay_Req messages and hardware does not support -* timestamping all packets = return error -*/ fep-hwts_rx_en = 1; config.rx_filter = HWTSTAMP_FILTER_ALL; break; -- 2.4.3 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[net-next 10/14] ixgbevf: fold ixgbevf_pull_tail into ixgbevf_add_rx_frag
From: Alexander Duyck alexander.h.du...@redhat.com This change folds the ixgbevf_pull_tail call into ixgbevf_add_rx_frag. The advantage to doing this is that the fragment doesn't have to be modified after it is added to the skb. Signed-off-by: Alexander Duyck alexander.h.du...@redhat.com Tested-by: Phil Schmitt phillip.j.schm...@intel.com Signed-off-by: Jeff Kirsher jeffrey.t.kirs...@intel.com --- drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 66 +++ 1 file changed, 19 insertions(+), 47 deletions(-) diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c index e71cdde..acfa051 100644 --- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c +++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c @@ -649,46 +649,6 @@ static void ixgbevf_alloc_rx_buffers(struct ixgbevf_ring *rx_ring, } /** - * ixgbevf_pull_tail - ixgbevf specific version of skb_pull_tail - * @rx_ring: rx descriptor ring packet is being transacted on - * @skb: pointer to current skb being adjusted - * - * This function is an ixgbevf specific version of __pskb_pull_tail. The - * main difference between this version and the original function is that - * this function can make several assumptions about the state of things - * that allow for significant optimizations versus the standard function. - * As a result we can do things like drop a frag and maintain an accurate - * truesize for the skb. - **/ -static void ixgbevf_pull_tail(struct ixgbevf_ring *rx_ring, - struct sk_buff *skb) -{ - struct skb_frag_struct *frag = skb_shinfo(skb)-frags[0]; - unsigned char *va; - unsigned int pull_len; - - /* it is valid to use page_address instead of kmap since we are -* working with pages allocated out of the lomem pool per -* alloc_page(GFP_ATOMIC) -*/ - va = skb_frag_address(frag); - - /* we need the header to contain the greater of either ETH_HLEN or -* 60 bytes if the skb-len is less than 60 for skb_pad. -*/ - pull_len = eth_get_headlen(va, IXGBEVF_RX_HDR_SIZE); - - /* align pull length to size of long to optimize memcpy performance */ - skb_copy_to_linear_data(skb, va, ALIGN(pull_len, sizeof(long))); - - /* update all of the pointers */ - skb_frag_size_sub(frag, pull_len); - frag-page_offset += pull_len; - skb-data_len -= pull_len; - skb-tail += pull_len; -} - -/** * ixgbevf_cleanup_headers - Correct corrupted or empty headers * @rx_ring: rx descriptor ring packet is being transacted on * @rx_desc: pointer to the EOP Rx descriptor @@ -721,10 +681,6 @@ static bool ixgbevf_cleanup_headers(struct ixgbevf_ring *rx_ring, } } - /* place header in linear portion of buffer */ - if (skb_is_nonlinear(skb)) - ixgbevf_pull_tail(rx_ring, skb); - /* if eth_skb_pad returns an error the skb was freed */ if (eth_skb_pad(skb)) return true; @@ -789,16 +745,19 @@ static bool ixgbevf_add_rx_frag(struct ixgbevf_ring *rx_ring, struct sk_buff *skb) { struct page *page = rx_buffer-page; + unsigned char *va = page_address(page) + rx_buffer-page_offset; unsigned int size = le16_to_cpu(rx_desc-wb.upper.length); #if (PAGE_SIZE 8192) unsigned int truesize = IXGBEVF_RX_BUFSZ; #else unsigned int truesize = ALIGN(size, L1_CACHE_BYTES); #endif + unsigned int pull_len; - if ((size = IXGBEVF_RX_HDR_SIZE) !skb_is_nonlinear(skb)) { - unsigned char *va = page_address(page) + rx_buffer-page_offset; + if (unlikely(skb_is_nonlinear(skb))) + goto add_tail_frag; + if (likely(size = IXGBEVF_RX_HDR_SIZE)) { memcpy(__skb_put(skb, size), va, ALIGN(size, sizeof(long))); /* page is not reserved, we can reuse buffer as is */ @@ -810,8 +769,21 @@ static bool ixgbevf_add_rx_frag(struct ixgbevf_ring *rx_ring, return false; } + /* we need the header to contain the greater of either ETH_HLEN or +* 60 bytes if the skb-len is less than 60 for skb_pad. +*/ + pull_len = eth_get_headlen(va, IXGBEVF_RX_HDR_SIZE); + + /* align pull length to size of long to optimize memcpy performance */ + memcpy(__skb_put(skb, pull_len), va, ALIGN(pull_len, sizeof(long))); + + /* update all of the pointers */ + va += pull_len; + size -= pull_len; + +add_tail_frag: skb_add_rx_frag(skb, skb_shinfo(skb)-nr_frags, page, - rx_buffer-page_offset, size, truesize); + (unsigned long)va ~PAGE_MASK, size, truesize); /* avoid re-using remote pages */ if (unlikely(ixgbevf_page_is_reserved(page))) -- 2.4.3 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a
[net-next 07/14] siena: only report generic filters in get_ts_info
From: Jacob Keller jacob.e.kel...@intel.com CC: Solarflare linux maintainers linux-net-driv...@solarflare.com CC: Shradha Shah ss...@solarflare.com Signed-off-by: Jacob Keller jacob.e.kel...@intel.com Signed-off-by: Jeff Kirsher jeffrey.t.kirs...@intel.com --- drivers/net/ethernet/sfc/siena.c | 6 +- 1 file changed, 1 insertion(+), 5 deletions(-) diff --git a/drivers/net/ethernet/sfc/siena.c b/drivers/net/ethernet/sfc/siena.c index b323b91..b2f886d 100644 --- a/drivers/net/ethernet/sfc/siena.c +++ b/drivers/net/ethernet/sfc/siena.c @@ -1042,9 +1042,5 @@ const struct efx_nic_type siena_a0_nic_type = { .max_rx_ip_filters = FR_BZ_RX_FILTER_TBL0_ROWS, .hwtstamp_filters = (1 HWTSTAMP_FILTER_NONE | 1 HWTSTAMP_FILTER_PTP_V1_L4_EVENT | -1 HWTSTAMP_FILTER_PTP_V1_L4_SYNC | -1 HWTSTAMP_FILTER_PTP_V1_L4_DELAY_REQ | -1 HWTSTAMP_FILTER_PTP_V2_L4_EVENT | -1 HWTSTAMP_FILTER_PTP_V2_L4_SYNC | -1 HWTSTAMP_FILTER_PTP_V2_L4_DELAY_REQ), +1 HWTSTAMP_FILTER_PTP_V2_L4_EVENT), }; -- 2.4.3 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[net-next 06/14] ixgbe: only report generic filters in get_ts_info
From: Jacob Keller jacob.e.kel...@intel.com Signed-off-by: Jacob Keller jacob.e.kel...@intel.com Tested-by: Phil Schmitt phillip.j.schm...@intel.com Signed-off-by: Jeff Kirsher jeffrey.t.kirs...@intel.com --- drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c | 8 1 file changed, 8 deletions(-) diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c index ec7b232..f7aeb56 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c @@ -2938,14 +2938,6 @@ static int ixgbe_get_ts_info(struct net_device *dev, (1 HWTSTAMP_FILTER_NONE) | (1 HWTSTAMP_FILTER_PTP_V1_L4_SYNC) | (1 HWTSTAMP_FILTER_PTP_V1_L4_DELAY_REQ) | - (1 HWTSTAMP_FILTER_PTP_V2_L2_EVENT) | - (1 HWTSTAMP_FILTER_PTP_V2_L4_EVENT) | - (1 HWTSTAMP_FILTER_PTP_V2_SYNC) | - (1 HWTSTAMP_FILTER_PTP_V2_L2_SYNC) | - (1 HWTSTAMP_FILTER_PTP_V2_L4_SYNC) | - (1 HWTSTAMP_FILTER_PTP_V2_DELAY_REQ) | - (1 HWTSTAMP_FILTER_PTP_V2_L2_DELAY_REQ) | - (1 HWTSTAMP_FILTER_PTP_V2_L4_DELAY_REQ) | (1 HWTSTAMP_FILTER_PTP_V2_EVENT); break; default: -- 2.4.3 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[net-next 13/14] ixgbe: Don't report flow director filter's status
From: Fan Du fan...@intel.com For two reasons I want to disable this: 1. Not any part actually check the report status(Alexander Duyck) 2. To report hash value of a packet to stack, RSS - 32bits hash value Perfect match fdir filter - 13bits hash value Hashed-based fdir filter - 31bits hash value fdir filter might hash on masked tuples for IP address, so it's still not desirable for usage. So for now, just stick to RSS 32bits hash value. Signed-off-by: Fan Du fan...@intel.com Suggested-by: Alexander Duyck alexander.h.du...@redhat.com Tested-by: Phil Schmitt phillip.j.schm...@intel.com Signed-off-by: Jeff Kirsher jeffrey.t.kirs...@intel.com --- drivers/net/ethernet/intel/ixgbe/ixgbe_82599.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_82599.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_82599.c index 6b87d96..b1e364d 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_82599.c +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_82599.c @@ -1394,14 +1394,12 @@ s32 ixgbe_init_fdir_perfect_82599(struct ixgbe_hw *hw, u32 fdirctrl) /* * Continue setup of fdirctrl register bits: * Turn perfect match filtering on -* Report hash in RSS field of Rx wb descriptor * Initialize the drop queue * Move the flexible bytes to use the ethertype - shift 6 words * Set the maximum length per hash bucket to 0xA filters * Send interrupt when 64 (0x4 * 16) filters are left */ fdirctrl |= IXGBE_FDIRCTRL_PERFECT_MATCH | - IXGBE_FDIRCTRL_REPORT_STATUS | (IXGBE_FDIR_DROP_QUEUE IXGBE_FDIRCTRL_DROP_Q_SHIFT) | (0x6 IXGBE_FDIRCTRL_FLEX_SHIFT) | (0xA IXGBE_FDIRCTRL_MAX_LENGTH_SHIFT) | -- 2.4.3 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[net-next 03/14] bnx2x: only report most generic filters in get_ts_info
From: Jacob Keller jacob.e.kel...@intel.com CC: Ariel Elior ariel.el...@qlogic.com Signed-off-by: Jacob Keller jacob.e.kel...@intel.com Signed-off-by: Jeff Kirsher jeffrey.t.kirs...@intel.com --- drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c | 11 +-- 1 file changed, 1 insertion(+), 10 deletions(-) diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c index 76b9052..c783b57 100644 --- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c +++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c @@ -3562,17 +3562,8 @@ static int bnx2x_get_ts_info(struct net_device *dev, info-rx_filters = (1 HWTSTAMP_FILTER_NONE) | (1 HWTSTAMP_FILTER_PTP_V1_L4_EVENT) | - (1 HWTSTAMP_FILTER_PTP_V1_L4_SYNC) | - (1 HWTSTAMP_FILTER_PTP_V1_L4_DELAY_REQ) | (1 HWTSTAMP_FILTER_PTP_V2_L4_EVENT) | - (1 HWTSTAMP_FILTER_PTP_V2_L4_SYNC) | - (1 HWTSTAMP_FILTER_PTP_V2_L4_DELAY_REQ) | - (1 HWTSTAMP_FILTER_PTP_V2_L2_EVENT) | - (1 HWTSTAMP_FILTER_PTP_V2_L2_SYNC) | - (1 HWTSTAMP_FILTER_PTP_V2_L2_DELAY_REQ) | - (1 HWTSTAMP_FILTER_PTP_V2_EVENT) | - (1 HWTSTAMP_FILTER_PTP_V2_SYNC) | - (1 HWTSTAMP_FILTER_PTP_V2_DELAY_REQ); + (1 HWTSTAMP_FILTER_PTP_V2_EVENT); info-tx_types = (1 HWTSTAMP_TX_OFF)|(1 HWTSTAMP_TX_ON); -- 2.4.3 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next v2 1/5] net: don't reforward packets already forwarded by offload device
Le 16/07/2015 10:04, sfel...@gmail.com a écrit : From: Scott Feldman sfel...@gmail.com Just before queuing skb for xmit on port, check if skb has been marked by switchdev port driver as already fordwarded by device. If so, drop skb. A non-zero skb-offload_fwd_mark field is set by the switchdev port driver/device on ingress to indicate the skb has already been forwarded by the device to egress ports with matching dev-skb_mark. The switchdev port driver would assign a non-zero dev-skb_mark for each device port netdev during registration, for example. Signed-off-by: Scott Feldman sfel...@gmail.com --- include/linux/netdevice.h |6 ++ include/linux/skbuff.h| 11 ++- net/core/dev.c| 10 ++ 3 files changed, 26 insertions(+), 1 deletion(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 45cfd79..8364f29 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1456,6 +1456,8 @@ enum netdev_priv_flags { * *@xps_maps: XXX: need comments on this one * + * @offload_fwd_mark: Offload device fwding mark + * *@trans_start: Time (in jiffies) of last Tx *@watchdog_timeo:Represents the timeout that is used by *the watchdog ( see dev_watchdog() ) @@ -1697,6 +1699,10 @@ struct net_device { struct xps_dev_maps __rcu *xps_maps; #endif +#ifdef CONFIG_NET_SWITCHDEV + u32 offload_fwd_mark; +#endif + /* These may be needed for future network-power-down code. */ /* diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index d6cdd6e..2edcf50 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -506,6 +506,7 @@ static inline u32 skb_mstamp_us_delta(const struct skb_mstamp *t1, *@no_fcs: Request NIC to treat last 4 bytes as Ethernet FCS * @napi_id: id of the NAPI struct this skb came from *@secmark: security marking + * @offload_fwd_mark: fwding offload mark *@mark: Generic packet mark *@vlan_proto: vlan encapsulation protocol *@vlan_tci: vlan tag control information @@ -650,9 +651,17 @@ struct sk_buff { unsigned intsender_cpu; }; #endif + union { #ifdef CONFIG_NETWORK_SECMARK - __u32 secmark; + __u32 secmark; +#endif +#ifdef CONFIG_NET_SWITCHDEV + __u32 offload_fwd_mark; #endif + }; + + union {}; + Everybody seems to ack. For my knowledge, why did you put this empty union? Thank you, Nicolas -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 net-next 1/3] rhashtable: Allow lookup function to have compare function agument
On 07/14/15 at 04:45pm, Tom Herbert wrote: Added rhashtable_lookup_fast_cmpfn which does a lookup in an rhash table with the compare function being taken from an argument. This allows different compare functions to be used on the same table. Signed-off-by: Tom Herbert t...@herbertland.com Acked-by: Thomas Graf tg...@suug.ch -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 net-next 2/3] rhashtable: Add a function for in order insertion and lookup in buckets
On 07/15/15 at 12:46pm, Tom Herbert wrote: On Tue, Jul 14, 2015 at 10:54 PM, Herbert Xu The memory cost is merely 8 bytes per local port, is it really too much? Okay, it looks like there is already an additional hlist_node in skc_common that can be used for a secondary hash. It's conceivable this can be generalized and used in the TCP listeners also in combination with rhashtable. Are you dropping this series entirely then? -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net: wireless: reduce log level of CRDA related messages
On Thu, 2015-07-09 at 15:35 +0200, Thomas Petazzoni wrote: With a basic Linux userspace, the messages Calling CRDA to update world regulatory domain appears 10 times after boot every second or so, followed by a final Exceeded CRDA call max attempts. Not calling CRDA. For those of us not having the corresponding userspace parts, having those messages repeatedly displayed at boot time is a bit annoying, so this commit reduces their log level to pr_debug(). Applied. johannes -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next 22/22] openvswitch: Use regular VXLAN net_device device
This gets rid of all OVS specific VXLAN code in the receive and transmit path by using a VXLAN net_device to represent the vport. Only a small shim layer remains which takes care of handling the VXLAN specific OVS Netlink configuration. Unexports vxlan_sock_add(), vxlan_sock_release(), vxlan_xmit_skb() since they are no longer needed. Signed-off-by: Thomas Graf tg...@suug.ch Signed-off-by: Pravin B Shelar pshe...@nicira.com --- drivers/net/vxlan.c| 242 +++ include/net/rtnetlink.h| 1 + include/net/vxlan.h| 24 +-- net/core/rtnetlink.c | 26 ++-- net/openvswitch/Kconfig| 12 -- net/openvswitch/Makefile | 1 - net/openvswitch/flow_netlink.c | 6 +- net/openvswitch/vport-netdev.c | 201 - net/openvswitch/vport-vxlan.c | 322 - net/openvswitch/vport-vxlan.h | 11 -- 10 files changed, 339 insertions(+), 507 deletions(-) delete mode 100644 net/openvswitch/vport-vxlan.c delete mode 100644 net/openvswitch/vport-vxlan.h diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c index 5ae6c0c..76466ef 100644 --- a/drivers/net/vxlan.c +++ b/drivers/net/vxlan.c @@ -75,6 +75,9 @@ static struct rtnl_link_ops vxlan_link_ops; static const u8 all_zeros_mac[ETH_ALEN]; +static struct vxlan_sock *vxlan_sock_add(struct net *net, __be16 port, +bool no_share, u32 flags); + /* per-network namespace private data for this module */ struct vxlan_net { struct list_head vxlan_list; @@ -1027,7 +1030,7 @@ static bool vxlan_group_used(struct vxlan_net *vn, struct vxlan_dev *dev) return false; } -void vxlan_sock_release(struct vxlan_sock *vs) +static void vxlan_sock_release(struct vxlan_sock *vs) { struct sock *sk = vs-sock-sk; struct net *net = sock_net(sk); @@ -1043,7 +1046,6 @@ void vxlan_sock_release(struct vxlan_sock *vs) queue_work(vxlan_wq, vs-del_work); } -EXPORT_SYMBOL_GPL(vxlan_sock_release); /* Update multicast group membership when first VNI on * multicast address is brought up @@ -1126,6 +1128,102 @@ static struct vxlanhdr *vxlan_remcsum(struct sk_buff *skb, struct vxlanhdr *vh, return vh; } +static void vxlan_rcv(struct vxlan_sock *vs, struct sk_buff *skb, + struct vxlan_metadata *md, u32 vni, + struct metadata_dst *tun_dst) +{ + struct iphdr *oip = NULL; + struct ipv6hdr *oip6 = NULL; + struct vxlan_dev *vxlan; + struct pcpu_sw_netstats *stats; + union vxlan_addr saddr; + int err = 0; + union vxlan_addr *remote_ip; + + /* For flow based devices, map all packets to VNI 0 */ + if (vs-flags VXLAN_F_FLOW_BASED) + vni = 0; + + /* Is this VNI defined? */ + vxlan = vxlan_vs_find_vni(vs, vni); + if (!vxlan) + goto drop; + + remote_ip = vxlan-default_dst.remote_ip; + skb_reset_mac_header(skb); + skb_scrub_packet(skb, !net_eq(vxlan-net, dev_net(vxlan-dev))); + skb-protocol = eth_type_trans(skb, vxlan-dev); + skb_postpull_rcsum(skb, eth_hdr(skb), ETH_HLEN); + + /* Ignore packet loops (and multicast echo) */ + if (ether_addr_equal(eth_hdr(skb)-h_source, vxlan-dev-dev_addr)) + goto drop; + + /* Re-examine inner Ethernet packet */ + if (remote_ip-sa.sa_family == AF_INET) { + oip = ip_hdr(skb); + saddr.sin.sin_addr.s_addr = oip-saddr; + saddr.sa.sa_family = AF_INET; +#if IS_ENABLED(CONFIG_IPV6) + } else { + oip6 = ipv6_hdr(skb); + saddr.sin6.sin6_addr = oip6-saddr; + saddr.sa.sa_family = AF_INET6; +#endif + } + + if (tun_dst) { + skb_dst_set(skb, (struct dst_entry *)tun_dst); + tun_dst = NULL; + } + + if ((vxlan-flags VXLAN_F_LEARN) + vxlan_snoop(skb-dev, saddr, eth_hdr(skb)-h_source)) + goto drop; + + skb_reset_network_header(skb); + /* In flow-based mode, GBP is carried in dst_metadata */ + if (!(vs-flags VXLAN_F_FLOW_BASED)) + skb-mark = md-gbp; + + if (oip6) + err = IP6_ECN_decapsulate(oip6, skb); + if (oip) + err = IP_ECN_decapsulate(oip, skb); + + if (unlikely(err)) { + if (log_ecn_error) { + if (oip6) + net_info_ratelimited(non-ECT from %pI6\n, +oip6-saddr); + if (oip) + net_info_ratelimited(non-ECT from %pI4 with TOS=%#x\n, +oip-saddr, oip-tos); + } + if (err 1) { + ++vxlan-dev-stats.rx_frame_errors; +
[PATCH net-next 20/22] openvswitch: Move dev pointer into vport itself
This is the first step in representing all OVS vports as regular struct net_devices. Move the net_device pointer into the vport structure itself to get rid of struct vport_netdev. Signed-off-by: Thomas Graf tg...@suug.ch Signed-off-by: Pravin B Shelar pshe...@nicira.com --- net/openvswitch/datapath.c | 7 +-- net/openvswitch/dp_notify.c | 5 +-- net/openvswitch/vport-internal_dev.c | 37 +++- net/openvswitch/vport-netdev.c | 86 net/openvswitch/vport-netdev.h | 12 - net/openvswitch/vport.h | 3 +- 6 files changed, 59 insertions(+), 91 deletions(-) diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c index 0208210..19df28e 100644 --- a/net/openvswitch/datapath.c +++ b/net/openvswitch/datapath.c @@ -188,7 +188,7 @@ static int get_dpifindex(const struct datapath *dp) local = ovs_vport_rcu(dp, OVSP_LOCAL); if (local) - ifindex = netdev_vport_priv(local)-dev-ifindex; + ifindex = local-dev-ifindex; else ifindex = 0; @@ -2219,13 +2219,10 @@ static void __net_exit list_vports_from_net(struct net *net, struct net *dnet, struct vport *vport; hlist_for_each_entry(vport, dp-ports[i], dp_hash_node) { - struct netdev_vport *netdev_vport; - if (vport-ops-type != OVS_VPORT_TYPE_INTERNAL) continue; - netdev_vport = netdev_vport_priv(vport); - if (dev_net(netdev_vport-dev) == dnet) + if (dev_net(vport-dev) == dnet) list_add(vport-detach_list, head); } } diff --git a/net/openvswitch/dp_notify.c b/net/openvswitch/dp_notify.c index 2c631fe..a7a80a6 100644 --- a/net/openvswitch/dp_notify.c +++ b/net/openvswitch/dp_notify.c @@ -58,13 +58,10 @@ void ovs_dp_notify_wq(struct work_struct *work) struct hlist_node *n; hlist_for_each_entry_safe(vport, n, dp-ports[i], dp_hash_node) { - struct netdev_vport *netdev_vport; - if (vport-ops-type != OVS_VPORT_TYPE_NETDEV) continue; - netdev_vport = netdev_vport_priv(vport); - if (!(netdev_vport-dev-priv_flags IFF_OVS_DATAPATH)) + if (!(vport-dev-priv_flags IFF_OVS_DATAPATH)) dp_detach_port_notify(vport); } } diff --git a/net/openvswitch/vport-internal_dev.c b/net/openvswitch/vport-internal_dev.c index 6a55f71..a2c205d 100644 --- a/net/openvswitch/vport-internal_dev.c +++ b/net/openvswitch/vport-internal_dev.c @@ -156,49 +156,44 @@ static void do_setup(struct net_device *netdev) static struct vport *internal_dev_create(const struct vport_parms *parms) { struct vport *vport; - struct netdev_vport *netdev_vport; struct internal_dev *internal_dev; int err; - vport = ovs_vport_alloc(sizeof(struct netdev_vport), - ovs_internal_vport_ops, parms); + vport = ovs_vport_alloc(0, ovs_internal_vport_ops, parms); if (IS_ERR(vport)) { err = PTR_ERR(vport); goto error; } - netdev_vport = netdev_vport_priv(vport); - - netdev_vport-dev = alloc_netdev(sizeof(struct internal_dev), -parms-name, NET_NAME_UNKNOWN, -do_setup); - if (!netdev_vport-dev) { + vport-dev = alloc_netdev(sizeof(struct internal_dev), + parms-name, NET_NAME_UNKNOWN, do_setup); + if (!vport-dev) { err = -ENOMEM; goto error_free_vport; } - dev_net_set(netdev_vport-dev, ovs_dp_get_net(vport-dp)); - internal_dev = internal_dev_priv(netdev_vport-dev); + dev_net_set(vport-dev, ovs_dp_get_net(vport-dp)); + internal_dev = internal_dev_priv(vport-dev); internal_dev-vport = vport; /* Restrict bridge port to current netns. */ if (vport-port_no == OVSP_LOCAL) - netdev_vport-dev-features |= NETIF_F_NETNS_LOCAL; + vport-dev-features |= NETIF_F_NETNS_LOCAL; rtnl_lock(); - err = register_netdevice(netdev_vport-dev); + err = register_netdevice(vport-dev); if (err) goto error_free_netdev; - dev_set_promiscuity(netdev_vport-dev, 1); + dev_set_promiscuity(vport-dev, 1); rtnl_unlock(); - netif_start_queue(netdev_vport-dev); + netif_start_queue(vport-dev); return
[PATCH net-next 19/22] openvswitch: Make tunnel set action attach a metadata dst
Utilize the new metadata dst to attach encapsulation instructions to the skb. The existing egress_tun_info via the OVS_CB() is left in place until all tunnel vports have been converted to the new method. Signed-off-by: Thomas Graf tg...@suug.ch Signed-off-by: Pravin B Shelar pshe...@nicira.com --- net/openvswitch/actions.c | 10 ++- net/openvswitch/datapath.c | 8 +++--- net/openvswitch/flow.h | 5 net/openvswitch/flow_netlink.c | 64 +- net/openvswitch/flow_netlink.h | 1 + net/openvswitch/flow_table.c | 4 ++- 6 files changed, 79 insertions(+), 13 deletions(-) diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c index 27c1687..cf04c2f 100644 --- a/net/openvswitch/actions.c +++ b/net/openvswitch/actions.c @@ -733,7 +733,15 @@ static int execute_set_action(struct sk_buff *skb, { /* Only tunnel set execution is supported without a mask. */ if (nla_type(a) == OVS_KEY_ATTR_TUNNEL_INFO) { - OVS_CB(skb)-egress_tun_info = nla_data(a); + struct ovs_tunnel_info *tun = nla_data(a); + + skb_dst_drop(skb); + dst_hold((struct dst_entry *)tun-tun_dst); + skb_dst_set(skb, (struct dst_entry *)tun-tun_dst); + + /* FIXME: Remove when all vports have been converted */ + OVS_CB(skb)-egress_tun_info = tun-tun_dst-u.tun_info; + return 0; } diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c index ff8c4a4..0208210 100644 --- a/net/openvswitch/datapath.c +++ b/net/openvswitch/datapath.c @@ -1018,7 +1018,7 @@ static int ovs_flow_cmd_new(struct sk_buff *skb, struct genl_info *info) } ovs_unlock(); - ovs_nla_free_flow_actions(old_acts); + ovs_nla_free_flow_actions_rcu(old_acts); ovs_flow_free(new_flow, false); } @@ -1030,7 +1030,7 @@ err_unlock_ovs: ovs_unlock(); kfree_skb(reply); err_kfree_acts: - kfree(acts); + ovs_nla_free_flow_actions(acts); err_kfree_flow: ovs_flow_free(new_flow, false); error: @@ -1157,7 +1157,7 @@ static int ovs_flow_cmd_set(struct sk_buff *skb, struct genl_info *info) if (reply) ovs_notify(dp_flow_genl_family, reply, info); if (old_acts) - ovs_nla_free_flow_actions(old_acts); + ovs_nla_free_flow_actions_rcu(old_acts); return 0; @@ -1165,7 +1165,7 @@ err_unlock_ovs: ovs_unlock(); kfree_skb(reply); err_kfree_acts: - kfree(acts); + ovs_nla_free_flow_actions(acts); error: return error; } diff --git a/net/openvswitch/flow.h b/net/openvswitch/flow.h index cadc6c5..b62cdb3 100644 --- a/net/openvswitch/flow.h +++ b/net/openvswitch/flow.h @@ -33,6 +33,7 @@ #include linux/flex_array.h #include net/inet_ecn.h #include net/ip_tunnels.h +#include net/dst_metadata.h struct sk_buff; @@ -45,6 +46,10 @@ struct sk_buff; #define TUN_METADATA_OPTS(flow_key, opt_len) \ ((void *)((flow_key)-tun_opts + TUN_METADATA_OFFSET(opt_len))) +struct ovs_tunnel_info { + struct metadata_dst *tun_dst; +}; + #define OVS_SW_FLOW_KEY_METADATA_SIZE \ (offsetof(struct sw_flow_key, recirc_id) + \ FIELD_SIZEOF(struct sw_flow_key, recirc_id)) diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c index ecfa530..e7906df 100644 --- a/net/openvswitch/flow_netlink.c +++ b/net/openvswitch/flow_netlink.c @@ -1548,11 +1548,48 @@ static struct sw_flow_actions *nla_alloc_flow_actions(int size, bool log) return sfa; } +static void ovs_nla_free_set_action(const struct nlattr *a) +{ + const struct nlattr *ovs_key = nla_data(a); + struct ovs_tunnel_info *ovs_tun; + + switch (nla_type(ovs_key)) { + case OVS_KEY_ATTR_TUNNEL_INFO: + ovs_tun = nla_data(ovs_key); + dst_release((struct dst_entry *)ovs_tun-tun_dst); + break; + } +} + +void ovs_nla_free_flow_actions(struct sw_flow_actions *sf_acts) +{ + const struct nlattr *a; + int rem; + + if (!sf_acts) + return; + + nla_for_each_attr(a, sf_acts-actions, sf_acts-actions_len, rem) { + switch (nla_type(a)) { + case OVS_ACTION_ATTR_SET: + ovs_nla_free_set_action(a); + break; + } + } + + kfree(sf_acts); +} + +static void __ovs_nla_free_flow_actions(struct rcu_head *head) +{ + ovs_nla_free_flow_actions(container_of(head, struct sw_flow_actions, rcu)); +} + /* Schedules 'sf_acts' to be freed after the next RCU grace period. * The caller must hold rcu_read_lock for this to be sensible. */ -void ovs_nla_free_flow_actions(struct sw_flow_actions *sf_acts) +void ovs_nla_free_flow_actions_rcu(struct sw_flow_actions *sf_acts) { -
[PATCH net-next 08/22] mpls: export mpls functions for use by mpls iptunnels
From: Roopa Prabhu ro...@cumulusnetworks.com Signed-off-by: Roopa Prabhu ro...@cumulusnetworks.com --- net/mpls/af_mpls.c | 11 --- net/mpls/internal.h | 9 +++-- 2 files changed, 15 insertions(+), 5 deletions(-) diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c index 1f93a59..6e66911 100644 --- a/net/mpls/af_mpls.c +++ b/net/mpls/af_mpls.c @@ -58,10 +58,11 @@ static inline struct mpls_dev *mpls_dev_get(const struct net_device *dev) return rcu_dereference_rtnl(dev-mpls_ptr); } -static bool mpls_output_possible(const struct net_device *dev) +bool mpls_output_possible(const struct net_device *dev) { return dev (dev-flags IFF_UP) netif_carrier_ok(dev); } +EXPORT_SYMBOL_GPL(mpls_output_possible); static unsigned int mpls_rt_header_size(const struct mpls_route *rt) { @@ -69,13 +70,14 @@ static unsigned int mpls_rt_header_size(const struct mpls_route *rt) return rt-rt_labels * sizeof(struct mpls_shim_hdr); } -static unsigned int mpls_dev_mtu(const struct net_device *dev) +unsigned int mpls_dev_mtu(const struct net_device *dev) { /* The amount of data the layer 2 frame can hold */ return dev-mtu; } +EXPORT_SYMBOL_GPL(mpls_dev_mtu); -static bool mpls_pkt_too_big(const struct sk_buff *skb, unsigned int mtu) +bool mpls_pkt_too_big(const struct sk_buff *skb, unsigned int mtu) { if (skb-len = mtu) return false; @@ -85,6 +87,7 @@ static bool mpls_pkt_too_big(const struct sk_buff *skb, unsigned int mtu) return true; } +EXPORT_SYMBOL_GPL(mpls_pkt_too_big); static bool mpls_egress(struct mpls_route *rt, struct sk_buff *skb, struct mpls_entry_decoded dec) @@ -626,6 +629,7 @@ int nla_put_labels(struct sk_buff *skb, int attrtype, return 0; } +EXPORT_SYMBOL_GPL(nla_put_labels); int nla_get_labels(const struct nlattr *nla, u32 max_labels, u32 *labels, u32 label[]) @@ -671,6 +675,7 @@ int nla_get_labels(const struct nlattr *nla, *labels = nla_labels; return 0; } +EXPORT_SYMBOL_GPL(nla_get_labels); static int rtm_to_route_config(struct sk_buff *skb, struct nlmsghdr *nlh, struct mpls_route_config *cfg) diff --git a/net/mpls/internal.h b/net/mpls/internal.h index 8cabeb5..2681a4b 100644 --- a/net/mpls/internal.h +++ b/net/mpls/internal.h @@ -50,7 +50,12 @@ static inline struct mpls_entry_decoded mpls_entry_decode(struct mpls_shim_hdr * return result; } -int nla_put_labels(struct sk_buff *skb, int attrtype, u8 labels, const u32 label[]); -int nla_get_labels(const struct nlattr *nla, u32 max_labels, u32 *labels, u32 label[]); +int nla_put_labels(struct sk_buff *skb, int attrtype, u8 labels, + const u32 label[]); +int nla_get_labels(const struct nlattr *nla, u32 max_labels, u32 *labels, + u32 label[]); +bool mpls_output_possible(const struct net_device *dev); +unsigned int mpls_dev_mtu(const struct net_device *dev); +bool mpls_pkt_too_big(const struct sk_buff *skb, unsigned int mtu); #endif /* MPLS_INTERNAL_H */ -- 2.4.3 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next 21/22] openvswitch: Abstract vport name through ovs_vport_name()
This allows to get rid of the get_name() vport ops later on. Signed-off-by: Thomas Graf tg...@suug.ch --- net/openvswitch/datapath.c | 4 ++-- net/openvswitch/vport-internal_dev.c | 1 - net/openvswitch/vport-netdev.c | 6 -- net/openvswitch/vport-netdev.h | 1 - net/openvswitch/vport.c | 4 ++-- net/openvswitch/vport.h | 5 + 6 files changed, 9 insertions(+), 12 deletions(-) diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c index 19df28e..ffe984f 100644 --- a/net/openvswitch/datapath.c +++ b/net/openvswitch/datapath.c @@ -176,7 +176,7 @@ static inline struct datapath *get_dp(struct net *net, int dp_ifindex) const char *ovs_dp_name(const struct datapath *dp) { struct vport *vport = ovs_vport_ovsl_rcu(dp, OVSP_LOCAL); - return vport-ops-get_name(vport); + return ovs_vport_name(vport); } static int get_dpifindex(const struct datapath *dp) @@ -1800,7 +1800,7 @@ static int ovs_vport_cmd_fill_info(struct vport *vport, struct sk_buff *skb, if (nla_put_u32(skb, OVS_VPORT_ATTR_PORT_NO, vport-port_no) || nla_put_u32(skb, OVS_VPORT_ATTR_TYPE, vport-ops-type) || nla_put_string(skb, OVS_VPORT_ATTR_NAME, - vport-ops-get_name(vport))) + ovs_vport_name(vport))) goto nla_put_failure; ovs_vport_get_stats(vport, vport_stats); diff --git a/net/openvswitch/vport-internal_dev.c b/net/openvswitch/vport-internal_dev.c index a2c205d..c058bbf 100644 --- a/net/openvswitch/vport-internal_dev.c +++ b/net/openvswitch/vport-internal_dev.c @@ -242,7 +242,6 @@ static struct vport_ops ovs_internal_vport_ops = { .type = OVS_VPORT_TYPE_INTERNAL, .create = internal_dev_create, .destroy= internal_dev_destroy, - .get_name = ovs_netdev_get_name, .send = internal_dev_recv, }; diff --git a/net/openvswitch/vport-netdev.c b/net/openvswitch/vport-netdev.c index 1c96966..e682bdc 100644 --- a/net/openvswitch/vport-netdev.c +++ b/net/openvswitch/vport-netdev.c @@ -171,11 +171,6 @@ static void netdev_destroy(struct vport *vport) call_rcu(vport-rcu, free_port_rcu); } -const char *ovs_netdev_get_name(const struct vport *vport) -{ - return vport-dev-name; -} - static unsigned int packet_length(const struct sk_buff *skb) { unsigned int length = skb-len - ETH_HLEN; @@ -223,7 +218,6 @@ static struct vport_ops ovs_netdev_vport_ops = { .type = OVS_VPORT_TYPE_NETDEV, .create = netdev_create, .destroy= netdev_destroy, - .get_name = ovs_netdev_get_name, .send = netdev_send, }; diff --git a/net/openvswitch/vport-netdev.h b/net/openvswitch/vport-netdev.h index 1c52aed..684fb88 100644 --- a/net/openvswitch/vport-netdev.h +++ b/net/openvswitch/vport-netdev.h @@ -26,7 +26,6 @@ struct vport *ovs_netdev_get_vport(struct net_device *dev); -const char *ovs_netdev_get_name(const struct vport *); void ovs_netdev_detach_dev(struct vport *); int __init ovs_netdev_init(void); diff --git a/net/openvswitch/vport.c b/net/openvswitch/vport.c index af23ba0..d14f594 100644 --- a/net/openvswitch/vport.c +++ b/net/openvswitch/vport.c @@ -113,7 +113,7 @@ struct vport *ovs_vport_locate(const struct net *net, const char *name) struct vport *vport; hlist_for_each_entry_rcu(vport, bucket, hash_node) - if (!strcmp(name, vport-ops-get_name(vport)) + if (!strcmp(name, ovs_vport_name(vport)) net_eq(ovs_dp_get_net(vport-dp), net)) return vport; @@ -226,7 +226,7 @@ struct vport *ovs_vport_add(const struct vport_parms *parms) } bucket = hash_bucket(ovs_dp_get_net(vport-dp), -vport-ops-get_name(vport)); +ovs_vport_name(vport)); hlist_add_head_rcu(vport-hash_node, bucket); return vport; } diff --git a/net/openvswitch/vport.h b/net/openvswitch/vport.h index e05ec68..1a689c2 100644 --- a/net/openvswitch/vport.h +++ b/net/openvswitch/vport.h @@ -237,6 +237,11 @@ static inline void ovs_skb_postpush_rcsum(struct sk_buff *skb, skb-csum = csum_add(skb-csum, csum_partial(start, len, 0)); } +static inline const char *ovs_vport_name(struct vport *vport) +{ + return vport-dev ? vport-dev-name : vport-ops-get_name(vport); +} + int ovs_vport_ops_register(struct vport_ops *ops); void ovs_vport_ops_unregister(struct vport_ops *ops); -- 2.4.3 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next 17/22] fib: Add fib rule match on tunnel id
This add the ability to select a routing table based on the tunnel id which allows to maintain separate routing tables for each virtual tunnel network. ip rule add from all tunnel-id 100 lookup 100 ip rule add from all tunnel-id 200 lookup 200 A new static key controls the collection of metadata at tunnel level upon demand. Signed-off-by: Thomas Graf tg...@suug.ch --- drivers/net/vxlan.c| 3 ++- include/net/fib_rules.h| 1 + include/net/ip_tunnels.h | 11 +++ include/uapi/linux/fib_rules.h | 2 +- net/core/fib_rules.c | 24 ++-- net/ipv4/ip_tunnel_core.c | 16 6 files changed, 53 insertions(+), 4 deletions(-) diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c index a350afb..23378db 100644 --- a/drivers/net/vxlan.c +++ b/drivers/net/vxlan.c @@ -143,7 +143,8 @@ static struct workqueue_struct *vxlan_wq; static inline bool vxlan_collect_metadata(struct vxlan_sock *vs) { - return vs-flags VXLAN_F_COLLECT_METADATA; + return vs-flags VXLAN_F_COLLECT_METADATA || + ip_tunnel_collect_metadata(); } #if IS_ENABLED(CONFIG_IPV6) diff --git a/include/net/fib_rules.h b/include/net/fib_rules.h index 903a55e..4e8f804 100644 --- a/include/net/fib_rules.h +++ b/include/net/fib_rules.h @@ -19,6 +19,7 @@ struct fib_rule { u8 action; /* 3 bytes hole, try to use */ u32 target; + __be64 tun_id; struct fib_rule __rcu *ctarget; struct net *fr_net; diff --git a/include/net/ip_tunnels.h b/include/net/ip_tunnels.h index 0b7e18c..0a5a776 100644 --- a/include/net/ip_tunnels.h +++ b/include/net/ip_tunnels.h @@ -303,6 +303,17 @@ static inline struct ip_tunnel_info *lwt_tun_info(struct lwtunnel_state *lwtstat return (struct ip_tunnel_info *)lwtstate-data; } +extern struct static_key ip_tunnel_metadata_cnt; + +/* Returns 0 if metadata should be collected */ +static inline int ip_tunnel_collect_metadata(void) +{ + return static_key_false(ip_tunnel_metadata_cnt); +} + +void ip_tunnel_need_metadata(void); +void ip_tunnel_unneed_metadata(void); + #endif /* CONFIG_INET */ #endif /* __NET_IP_TUNNELS_H */ diff --git a/include/uapi/linux/fib_rules.h b/include/uapi/linux/fib_rules.h index 2b82d7e..96161b8 100644 --- a/include/uapi/linux/fib_rules.h +++ b/include/uapi/linux/fib_rules.h @@ -43,7 +43,7 @@ enum { FRA_UNUSED5, FRA_FWMARK, /* mark */ FRA_FLOW, /* flow/class id */ - FRA_UNUSED6, + FRA_TUN_ID, FRA_SUPPRESS_IFGROUP, FRA_SUPPRESS_PREFIXLEN, FRA_TABLE, /* Extended table id */ diff --git a/net/core/fib_rules.c b/net/core/fib_rules.c index 9a12668..ae8306e 100644 --- a/net/core/fib_rules.c +++ b/net/core/fib_rules.c @@ -16,6 +16,7 @@ #include net/net_namespace.h #include net/sock.h #include net/fib_rules.h +#include net/ip_tunnels.h int fib_default_rule_add(struct fib_rules_ops *ops, u32 pref, u32 table, u32 flags) @@ -186,6 +187,9 @@ static int fib_rule_match(struct fib_rule *rule, struct fib_rules_ops *ops, if ((rule-mark ^ fl-flowi_mark) rule-mark_mask) goto out; + if (rule-tun_id (rule-tun_id != fl-flowi_tun_key.tun_id)) + goto out; + ret = ops-match(rule, fl, flags); out: return (rule-flags FIB_RULE_INVERT) ? !ret : ret; @@ -330,6 +334,9 @@ static int fib_nl_newrule(struct sk_buff *skb, struct nlmsghdr* nlh) if (tb[FRA_FWMASK]) rule-mark_mask = nla_get_u32(tb[FRA_FWMASK]); + if (tb[FRA_TUN_ID]) + rule-tun_id = nla_get_be64(tb[FRA_TUN_ID]); + rule-action = frh-action; rule-flags = frh-flags; rule-table = frh_get_table(frh, tb); @@ -407,6 +414,9 @@ static int fib_nl_newrule(struct sk_buff *skb, struct nlmsghdr* nlh) if (unresolved) ops-unresolved_rules++; + if (rule-tun_id) + ip_tunnel_need_metadata(); + notify_rule_change(RTM_NEWRULE, rule, ops, nlh, NETLINK_CB(skb).portid); flush_route_cache(ops); rules_ops_put(ops); @@ -473,6 +483,10 @@ static int fib_nl_delrule(struct sk_buff *skb, struct nlmsghdr* nlh) (rule-mark_mask != nla_get_u32(tb[FRA_FWMASK]))) continue; + if (tb[FRA_TUN_ID] + (rule-tun_id != nla_get_be64(tb[FRA_TUN_ID]))) + continue; + if (!ops-compare(rule, frh, tb)) continue; @@ -487,6 +501,9 @@ static int fib_nl_delrule(struct sk_buff *skb, struct nlmsghdr* nlh) goto errout; } + if (rule-tun_id) + ip_tunnel_unneed_metadata(); + list_del_rcu(rule-list); if (rule-action == FR_ACT_GOTO) {
[PATCH net-next 13/22] arp: Inherit metadata dst when creating ARP requests
If output device wants to see the dst, inherit the dst of the original skb and pass it on to generate the ARP request. Signed-off-by: Thomas Graf tg...@suug.ch --- net/ipv4/arp.c | 65 +- 1 file changed, 37 insertions(+), 28 deletions(-) diff --git a/net/ipv4/arp.c b/net/ipv4/arp.c index 933a928..1d59e50 100644 --- a/net/ipv4/arp.c +++ b/net/ipv4/arp.c @@ -291,6 +291,40 @@ static void arp_error_report(struct neighbour *neigh, struct sk_buff *skb) kfree_skb(skb); } +/* Create and send an arp packet. */ +static void arp_send_dst(int type, int ptype, __be32 dest_ip, +struct net_device *dev, __be32 src_ip, +const unsigned char *dest_hw, +const unsigned char *src_hw, +const unsigned char *target_hw, struct sk_buff *oskb) +{ + struct sk_buff *skb; + + /* arp on this interface. */ + if (dev-flags IFF_NOARP) + return; + + skb = arp_create(type, ptype, dest_ip, dev, src_ip, +dest_hw, src_hw, target_hw); + if (!skb) + return; + + if (oskb) + skb_dst_copy(skb, oskb); + + arp_xmit(skb); +} + +void arp_send(int type, int ptype, __be32 dest_ip, + struct net_device *dev, __be32 src_ip, + const unsigned char *dest_hw, const unsigned char *src_hw, + const unsigned char *target_hw) +{ + arp_send_dst(type, ptype, dest_ip, dev, src_ip, dest_hw, src_hw, +target_hw, NULL); +} +EXPORT_SYMBOL(arp_send); + static void arp_solicit(struct neighbour *neigh, struct sk_buff *skb) { __be32 saddr = 0; @@ -346,8 +380,9 @@ static void arp_solicit(struct neighbour *neigh, struct sk_buff *skb) } } - arp_send(ARPOP_REQUEST, ETH_P_ARP, target, dev, saddr, -dst_hw, dev-dev_addr, NULL); + arp_send_dst(ARPOP_REQUEST, ETH_P_ARP, target, dev, saddr, +dst_hw, dev-dev_addr, NULL, +dev-priv_flags IFF_XMIT_DST_RELEASE ? NULL : skb); } static int arp_ignore(struct in_device *in_dev, __be32 sip, __be32 tip) @@ -597,32 +632,6 @@ void arp_xmit(struct sk_buff *skb) EXPORT_SYMBOL(arp_xmit); /* - * Create and send an arp packet. - */ -void arp_send(int type, int ptype, __be32 dest_ip, - struct net_device *dev, __be32 src_ip, - const unsigned char *dest_hw, const unsigned char *src_hw, - const unsigned char *target_hw) -{ - struct sk_buff *skb; - - /* -* No arp on this interface. -*/ - - if (dev-flagsIFF_NOARP) - return; - - skb = arp_create(type, ptype, dest_ip, dev, src_ip, -dest_hw, src_hw, target_hw); - if (!skb) - return; - - arp_xmit(skb); -} -EXPORT_SYMBOL(arp_send); - -/* * Process an arp request. */ -- 2.4.3 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next 16/22] route: Per route IP tunnel metadata via lightweight tunnel
This introduces a new IP tunnel lightweight tunnel type which allows to specify IP tunnel instructions per route. Only IPv4 is supported at this point. Signed-off-by: Thomas Graf tg...@suug.ch --- drivers/net/vxlan.c| 10 +++- include/net/dst_metadata.h | 12 - include/net/ip_tunnels.h | 7 ++- include/uapi/linux/lwtunnel.h | 1 + include/uapi/linux/rtnetlink.h | 15 ++ net/ipv4/ip_tunnel_core.c | 114 + net/ipv4/route.c | 2 +- net/openvswitch/vport.h| 1 + 8 files changed, 157 insertions(+), 5 deletions(-) diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c index 994d89c..a350afb 100644 --- a/drivers/net/vxlan.c +++ b/drivers/net/vxlan.c @@ -1935,7 +1935,7 @@ static void vxlan_encap_bypass(struct sk_buff *skb, struct vxlan_dev *src_vxlan, static void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev, struct vxlan_rdst *rdst, bool did_rsc) { - struct ip_tunnel_info *info = skb_tunnel_info(skb); + struct ip_tunnel_info *info; struct vxlan_dev *vxlan = netdev_priv(dev); struct sock *sk = vxlan-vn_sock-sock-sk; struct rtable *rt = NULL; @@ -1952,6 +1952,9 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev, int err; u32 flags = vxlan-flags; + /* FIXME: Support IPv6 */ + info = skb_tunnel_info(skb, AF_INET); + if (rdst) { dst_port = rdst-remote_port ? rdst-remote_port : vxlan-dst_port; vni = rdst-remote_vni; @@ -2141,12 +2144,15 @@ tx_free: static netdev_tx_t vxlan_xmit(struct sk_buff *skb, struct net_device *dev) { struct vxlan_dev *vxlan = netdev_priv(dev); - const struct ip_tunnel_info *info = skb_tunnel_info(skb); + const struct ip_tunnel_info *info; struct ethhdr *eth; bool did_rsc = false; struct vxlan_rdst *rdst, *fdst = NULL; struct vxlan_fdb *f; + /* FIXME: Support IPv6 */ + info = skb_tunnel_info(skb, AF_INET); + skb_reset_mac_header(skb); eth = eth_hdr(skb); diff --git a/include/net/dst_metadata.h b/include/net/dst_metadata.h index e843937..7b03068 100644 --- a/include/net/dst_metadata.h +++ b/include/net/dst_metadata.h @@ -23,13 +23,23 @@ static inline struct metadata_dst *skb_metadata_dst(struct sk_buff *skb) return NULL; } -static inline struct ip_tunnel_info *skb_tunnel_info(struct sk_buff *skb) +static inline struct ip_tunnel_info *skb_tunnel_info(struct sk_buff *skb, +int family) { struct metadata_dst *md_dst = skb_metadata_dst(skb); + struct rtable *rt; if (md_dst) return md_dst-u.tun_info; + switch (family) { + case AF_INET: + rt = (struct rtable *)skb_dst(skb); + if (rt rt-rt_lwtstate) + return lwt_tun_info(rt-rt_lwtstate); + break; + } + return NULL; } diff --git a/include/net/ip_tunnels.h b/include/net/ip_tunnels.h index d11530f..0b7e18c 100644 --- a/include/net/ip_tunnels.h +++ b/include/net/ip_tunnels.h @@ -9,9 +9,9 @@ #include net/dsfield.h #include net/gro_cells.h #include net/inet_ecn.h -#include net/ip.h #include net/netns/generic.h #include net/rtnetlink.h +#include net/lwtunnel.h #if IS_ENABLED(CONFIG_IPV6) #include net/ipv6.h @@ -298,6 +298,11 @@ static inline void *ip_tunnel_info_opts(struct ip_tunnel_info *info, size_t n) return info + 1; } +static inline struct ip_tunnel_info *lwt_tun_info(struct lwtunnel_state *lwtstate) +{ + return (struct ip_tunnel_info *)lwtstate-data; +} + #endif /* CONFIG_INET */ #endif /* __NET_IP_TUNNELS_H */ diff --git a/include/uapi/linux/lwtunnel.h b/include/uapi/linux/lwtunnel.h index aa611d9..31377bb 100644 --- a/include/uapi/linux/lwtunnel.h +++ b/include/uapi/linux/lwtunnel.h @@ -6,6 +6,7 @@ enum lwtunnel_encap_types { LWTUNNEL_ENCAP_NONE, LWTUNNEL_ENCAP_MPLS, + LWTUNNEL_ENCAP_IP, __LWTUNNEL_ENCAP_MAX, }; diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h index 0d3d3cc..47d24cb 100644 --- a/include/uapi/linux/rtnetlink.h +++ b/include/uapi/linux/rtnetlink.h @@ -286,6 +286,21 @@ enum rt_class_t { /* Routing message attributes */ +enum ip_tunnel_t { + IP_TUN_UNSPEC, + IP_TUN_ID, + IP_TUN_DST, + IP_TUN_SRC, + IP_TUN_TTL, + IP_TUN_TOS, + IP_TUN_SPORT, + IP_TUN_DPORT, + IP_TUN_FLAGS, + __IP_TUN_MAX, +}; + +#define IP_TUN_MAX (__IP_TUN_MAX - 1) + enum rtattr_type_t { RTA_UNSPEC, RTA_DST, diff --git a/net/ipv4/ip_tunnel_core.c b/net/ipv4/ip_tunnel_core.c index 6a51a71..025b76e 100644 --- a/net/ipv4/ip_tunnel_core.c +++ b/net/ipv4/ip_tunnel_core.c @@ -190,3 +190,117 @@ struct rtnl_link_stats64
Re: [PATCH net-next 1/1] hv_netvsc: Wait for sub-channels to be processed during probe
K. Y. Srinivasan k...@microsoft.com writes: The current code returns from probe without waiting for the proper handling of subchannels that may be requested. If the netvsc driver were to be rapidly loaded/unloaded, we can trigger a panic as the unload will be tearing down state that may not have been fully setup yet. We fix this issue by making sure that we return from the probe call only after ensuring that the sub-channel offers in flight are properly handled. Signed-off-by: K. Y. Srinivasan k...@microsoft.com Reviewed-and-tested-by: Haiyang Zhang haiya...@microsoft.com --- drivers/net/hyperv/hyperv_net.h |2 ++ drivers/net/hyperv/rndis_filter.c | 25 + 2 files changed, 27 insertions(+), 0 deletions(-) diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h index 26cd14c..925b75d 100644 --- a/drivers/net/hyperv/hyperv_net.h +++ b/drivers/net/hyperv/hyperv_net.h @@ -671,6 +671,8 @@ struct netvsc_device { u32 send_table[VRSS_SEND_TAB_SIZE]; u32 max_chn; u32 num_chn; + spinlock_t sc_lock; /* Protects num_sc_offered variable */ + u32 num_sc_offered; atomic_t queue_sends[NR_CPUS]; /* Holds rndis device info */ diff --git a/drivers/net/hyperv/rndis_filter.c b/drivers/net/hyperv/rndis_filter.c index 2e40417..2e09f3f 100644 --- a/drivers/net/hyperv/rndis_filter.c +++ b/drivers/net/hyperv/rndis_filter.c @@ -984,9 +984,16 @@ static void netvsc_sc_open(struct vmbus_channel *new_sc) struct netvsc_device *nvscdev; u16 chn_index = new_sc-offermsg.offer.sub_channel_index; int ret; + unsigned long flags; nvscdev = hv_get_drvdata(new_sc-primary_channel-device_obj); + spin_lock_irqsave(nvscdev-sc_lock, flags); + nvscdev-num_sc_offered--; + spin_unlock_irqrestore(nvscdev-sc_lock, flags); + if (nvscdev-num_sc_offered == 0) + complete(nvscdev-channel_init_wait); + if (chn_index = nvscdev-num_chn) return; @@ -1015,8 +1022,10 @@ int rndis_filter_device_add(struct hv_device *dev, u32 rsscap_size = sizeof(struct ndis_recv_scale_cap); u32 mtu, size; u32 num_rss_qs; + u32 sc_delta; const struct cpumask *node_cpu_mask; u32 num_possible_rss_qs; + unsigned long flags; rndis_device = get_rndis_device(); if (!rndis_device) @@ -1039,6 +1048,8 @@ int rndis_filter_device_add(struct hv_device *dev, net_device-max_chn = 1; net_device-num_chn = 1; + spin_lock_init(net_device-sc_lock); + net_device-extension = rndis_device; rndis_device-net_dev = net_device; @@ -1116,6 +1127,9 @@ int rndis_filter_device_add(struct hv_device *dev, num_possible_rss_qs = cpumask_weight(node_cpu_mask); net_device-num_chn = min(num_possible_rss_qs, num_rss_qs); + num_rss_qs = net_device-num_chn - 1; + net_device-num_sc_offered = num_rss_qs; + if (net_device-num_chn == 1) goto out; @@ -1157,11 +1171,22 @@ int rndis_filter_device_add(struct hv_device *dev, ret = rndis_filter_set_rss_param(rndis_device, net_device-num_chn); + /* + * Wait for the host to send us the sub-channel offers. + */ + spin_lock_irqsave(net_device-sc_lock, flags); + sc_delta = net_device-num_chn - 1 - num_rss_qs; + net_device-num_sc_offered -= sc_delta; + spin_unlock_irqrestore(net_device-sc_lock, flags); + + if (net_device-num_sc_offered != 0) + wait_for_completion(net_device-channel_init_wait); I'd suggest we add an essentian timeout (big, let's say 30 sec.) here. In case something goes wrong we don't really want to hang the whole kernel for forever. Such bugs are hard to debug as if a 'kernel hangs' is reported we can't be sure which wait caused it. We can even have something like: t = wait_for_completion_timeout(net_device-channel_init_wait, 30*HZ); BUG_ON(t == 0); This is much better as we'll be sure what went wrong. (I know other pieces of hyper-v code use wait_for_completion() without a timeout, this is rather a general suggestion for all of them). out: if (ret) { net_device-max_chn = 1; net_device-num_chn = 1; } + return 0; /* return 0 because primary channel can be used alone */ err_dev_remv: -- Vitaly -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH nf-next v2] netfilter: nf_ct_sctp: minimal multihoming support
Currently nf_conntrack_proto_sctp module handles only packets between primary addresses used to establish the connection. Any packets between secondary addresses are classified as invalid so that usual firewall configurations drop them. Allowing HEARTBEAT and HEARTBEAT-ACK chunks to establish a new conntrack would allow traffic between secondary addresses to pass through. A more sophisticated solution based on the addresses advertised in the initial handshake (and possibly also later dynamic address addition and removal) would be much harder to implement. Moreover, in general we cannot assume to always see the initial handshake as it can be routed through a different path. The patch adds two new conntrack states: SCTP_CONNTRACK_HEARTBEAT_SENT - a HEARTBEAT chunk seen but not acked SCTP_CONNTRACK_HEARTBEAT_ACKED - a HEARTBEAT acked by HEARTBEAT-ACK State transition rules: - HEARTBEAT_SENT responds to usual chunks the same way as NONE (so that the behaviour changes as little as possible) - HEARTBEAT_ACKED responds to usual chunks the same way as ESTABLISHED does, except the resulting state is HEARTBEAT_ACKED rather than ESTABLISHED - previously existing states except NONE are preserved when HEARTBEAT or HEARTBEAT-ACK is seen - NONE (in the initial direction) changes to HEARTBEAT_SENT on HEARTBEAT and to CLOSED on HEARTBEAT-ACK - HEARTBEAT_SENT changes to HEARTBEAT_ACKED on HEARTBEAT-ACK in the reply direction - HEARTBEAT_SENT and HEARTBEAT_ACKED are preserved on HEARTBEAT and HEARTBEAT-ACK otherwise Normally, vtag is set from the INIT chunk for the reply direction and from the INIT-ACK chunk for the originating direction (i.e. each of these defines vtag value for the opposite direction). For secondary conntracks, we can't rely on seeing INIT/INIT-ACK and even if we have seen them, we would need to connect two different conntracks. Therefore simplified logic is applied: vtag of first packet in each direction (HEARTBEAT in the originating and HEARTBEAT-ACK in reply direction) is saved and all following packets in that direction are compared with this saved value. While INIT and INIT-ACK define vtag for the opposite direction, vtags extracted from HEARTBEAT and HEARTBEAT-ACK are always for their direction. Default timeout values for new states are HEARTBEAT_SENT: 30 seconds (default hb_interval) HEARTBEAT_ACKED: 210 seconds (hb_interval * path_max_retry + max_rto) (We cannot expect to see the shutdown sequence so that, unlike ESTABLISHED, the HEARTBEAT_ACKED timeout shouldn't be too long.) Signed-off-by: Michal Kubecek mkube...@suse.cz --- v2: - add new timeouts to nla policy interface - explain vtag handling in the commit message - for consistency, rename *_HB_* constants to *_HEARTBEAT_* include/uapi/linux/netfilter/nf_conntrack_sctp.h | 2 + include/uapi/linux/netfilter/nfnetlink_cttimeout.h | 2 + net/netfilter/nf_conntrack_proto_sctp.c| 115 - 3 files changed, 95 insertions(+), 24 deletions(-) diff --git a/include/uapi/linux/netfilter/nf_conntrack_sctp.h b/include/uapi/linux/netfilter/nf_conntrack_sctp.h index ceeefe6681b5..ed4e776e1242 100644 --- a/include/uapi/linux/netfilter/nf_conntrack_sctp.h +++ b/include/uapi/linux/netfilter/nf_conntrack_sctp.h @@ -13,6 +13,8 @@ enum sctp_conntrack { SCTP_CONNTRACK_SHUTDOWN_SENT, SCTP_CONNTRACK_SHUTDOWN_RECD, SCTP_CONNTRACK_SHUTDOWN_ACK_SENT, + SCTP_CONNTRACK_HEARTBEAT_SENT, + SCTP_CONNTRACK_HEARTBEAT_ACKED, SCTP_CONNTRACK_MAX }; diff --git a/include/uapi/linux/netfilter/nfnetlink_cttimeout.h b/include/uapi/linux/netfilter/nfnetlink_cttimeout.h index 1ab0b97b3a1e..f2c10dc140d6 100644 --- a/include/uapi/linux/netfilter/nfnetlink_cttimeout.h +++ b/include/uapi/linux/netfilter/nfnetlink_cttimeout.h @@ -92,6 +92,8 @@ enum ctattr_timeout_sctp { CTA_TIMEOUT_SCTP_SHUTDOWN_SENT, CTA_TIMEOUT_SCTP_SHUTDOWN_RECD, CTA_TIMEOUT_SCTP_SHUTDOWN_ACK_SENT, + CTA_TIMEOUT_SCTP_HEARTBEAT_SENT, + CTA_TIMEOUT_SCTP_HEARTBEAT_ACKED, __CTA_TIMEOUT_SCTP_MAX }; #define CTA_TIMEOUT_SCTP_MAX (__CTA_TIMEOUT_SCTP_MAX - 1) diff --git a/net/netfilter/nf_conntrack_proto_sctp.c b/net/netfilter/nf_conntrack_proto_sctp.c index b45da90fad32..1aac57e45319 100644 --- a/net/netfilter/nf_conntrack_proto_sctp.c +++ b/net/netfilter/nf_conntrack_proto_sctp.c @@ -42,6 +42,8 @@ static const char *const sctp_conntrack_names[] = { SHUTDOWN_SENT, SHUTDOWN_RECD, SHUTDOWN_ACK_SENT, + HEARTBEAT_SENT, + HEARTBEAT_ACKED, }; #define SECS * HZ @@ -57,6 +59,8 @@ static unsigned int sctp_timeouts[SCTP_CONNTRACK_MAX] __read_mostly = { [SCTP_CONNTRACK_SHUTDOWN_SENT] = 300 SECS / 1000, [SCTP_CONNTRACK_SHUTDOWN_RECD] = 300 SECS / 1000, [SCTP_CONNTRACK_SHUTDOWN_ACK_SENT] = 3 SECS, + [SCTP_CONNTRACK_HEARTBEAT_SENT] = 30 SECS, +
Re: [PATCH v2] jhash: Deinline jhash, jhash2 and __jhash_nwords
On July 16, 2015 at 9:23 PM Joe Perches j...@perches.com wrote: It might be useful to have these performance impacting changes guarded by something like CONFIG_CC_OPTIMIZE_FOR_SIZE with another static __always_inline __func and a function EXPORT_SYMBOL or just a static inline so that where code size is critical it's uninlined. But keep in mind that jhash, jhash2 and __jhash_nwords are *not* one-instruction long functions. We duplicate code over and over resulting probably in more cache misses. __always_inline__ is probably too strict and a vanilla inline is already for 99% of all distribution builds a __always_inline__, see ARCH_SUPPORTS_OPTIMIZED_INLINING and CONFIG_CC_OPTIMIZE_FOR_SIZE. The answer depends on the specific workload. Sometimes an enforced inline perform better and sometimes a call is the better solution (read: less cache misses). General purpose vendors with a larger working set size should reduce cache misses by deinline many functions. For high-performance special fast-path operations a strong inlined kernel build is probably faster. __always_inline__ makes it impossible for the user to deinline functions or not. Hagen -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] mac80211: Deinline rate_control_rate_init, rate_control_rate_update
On Wed, 2015-07-15 at 14:56 +0200, Denys Vlasenko wrote: With this .config: http://busybox.net/~vda/kernel_config, after deinlining these functions have sizes and callsite counts as follows: rate_control_rate_init: 554 bytes, 8 calls rate_control_rate_update: 1596 bytes, 5 calls Total size reduction: about 11 kbytes. Both applied. johannes -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next 0/6] net: bcmgenet: PHY initialization rework
On Jul 17, 2015, at 7:51 AM, Florian Fainelli f.faine...@gmail.com wrote: Hi David, Petri, Jaedon, This patch series reworks how we perform PHY initialization and resets in the GENET driver. Although this contains mostly fixes, some of the changes are a bit too intrusive to be backported to 'net' at the moment. Some of the motivations behind these changes were to reduce the time spent in how performing MDIO transactions, since it is better to perform then when we have interrupts enabled. This reduces the bring-up time of GENET from ~600 msecs down to ~8 msecs, and about the same time for suspend/resume. Since I do not currently have a system which is not DT-aware, can you (Petri, Jaedon) give this a try and confirm things keep working as expected? Thanks! I tested your patch series on Broadcom 40nm set-top box platform that used internal phy. I did not have the exact measurements. but I expect it to improve on the interface-up or link-up time. and I compared the changes roughly from kernel print time. please see below. - before patching [1.865126] bcmgenet 1043.ethernet eth0: Link is Down [3.941132] bcmgenet 1043.ethernet eth0: Link is Up - 100Mbps/Full - flow control rx/tx - after patching [3.145127] bcmgenet 1043.ethernet eth0: Link is Down [4.189140] bcmgenet 1043.ethernet eth0: Link is Up - 100Mbps/Full - flow control rx/tx Florian Fainelli (6): net: bcmgenet: Remove excessive PHY reset net: bcmgenet: Use correct dev_id for free_irq net: bcmgenet: Power on integrated GPHY in bcmgenet_power_up() net: bcmgenet: Determine PHY type before scanning MDIO bus net: bcmgenet: Delay PHY initialization to bcmgenet_open() net: bcmgenet: Remove init parameter from bcmgenet_mii_config drivers/net/ethernet/broadcom/genet/bcmgenet.c | 33 +- drivers/net/ethernet/broadcom/genet/bcmgenet.h | 5 +- drivers/net/ethernet/broadcom/genet/bcmmii.c | 84 -- 3 files changed, 59 insertions(+), 63 deletions(-) -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: pull-request: mac80211 2015-07-17
On Fri, 2015-07-17 at 15:31 +0200, Johannes Berg wrote: Hi Dave, We've accumulated some wireless fixes, please pull. Arik's fix is a bit bigger than I might like, but it fixes a real locking issue and we didn't really see a good way to make a smaller version. Let me know if there's any problem. Also, I'm going to be on vacation starting Tuesday, back on August 10. I'm merging things to mac80211-next, but I'll hold the pull request until after I return so I can deal with any possible issues in net-next more quickly. Kalle has graciously agreed to handle any urgent bugfixes to mac80211 while I'm out, and will probably send them to you as patches (if necessary.) johannes -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next 04/22] ipv6: support for fib route lwtunnel encap attributes
From: Roopa Prabhu ro...@cumulusnetworks.com This patch adds support in ipv6 fib functions to parse Netlink RTA encap attributes and attach encap state data to rt6_info. Signed-off-by: Roopa Prabhu ro...@cumulusnetworks.com --- include/net/ip6_fib.h | 3 +++ net/ipv6/ip6_fib.c| 2 ++ net/ipv6/route.c | 33 ++--- 3 files changed, 35 insertions(+), 3 deletions(-) diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h index 3b76849..276328e 100644 --- a/include/net/ip6_fib.h +++ b/include/net/ip6_fib.h @@ -51,6 +51,8 @@ struct fib6_config { struct nlattr *fc_mp; struct nl_info fc_nlinfo; + struct nlattr *fc_encap; + u16 fc_encap_type; }; struct fib6_node { @@ -131,6 +133,7 @@ struct rt6_info { /* more non-fragment space at head required */ unsigned short rt6i_nfheader_len; u8 rt6i_protocol; + struct lwtunnel_state *rt6i_lwtstate; }; static inline struct inet6_dev *ip6_dst_idev(struct dst_entry *dst) diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c index 55d1986..d715f2e 100644 --- a/net/ipv6/ip6_fib.c +++ b/net/ipv6/ip6_fib.c @@ -32,6 +32,7 @@ #include net/ipv6.h #include net/ndisc.h #include net/addrconf.h +#include net/lwtunnel.h #include net/ip6_fib.h #include net/ip6_route.h @@ -177,6 +178,7 @@ static void rt6_free_pcpu(struct rt6_info *non_pcpu_rt) static void rt6_release(struct rt6_info *rt) { if (atomic_dec_and_test(rt-rt6i_ref)) { + lwtunnel_state_put(rt-rt6i_lwtstate); rt6_free_pcpu(rt); dst_free(rt-dst); } diff --git a/net/ipv6/route.c b/net/ipv6/route.c index 6090969..b3431b7 100644 --- a/net/ipv6/route.c +++ b/net/ipv6/route.c @@ -58,6 +58,7 @@ #include net/netevent.h #include net/netlink.h #include net/nexthop.h +#include net/lwtunnel.h #include asm/uaccess.h @@ -1770,6 +1771,17 @@ int ip6_route_add(struct fib6_config *cfg) rt-dst.output = ip6_output; + if (cfg-fc_encap) { + struct lwtunnel_state *lwtstate; + + err = lwtunnel_build_state(dev, cfg-fc_encap_type, + cfg-fc_encap, lwtstate); + if (err) + goto out; + lwtunnel_state_get(lwtstate); + rt-rt6i_lwtstate = lwtstate; + } + ipv6_addr_prefix(rt-rt6i_dst.addr, cfg-fc_dst, cfg-fc_dst_len); rt-rt6i_dst.plen = cfg-fc_dst_len; if (rt-rt6i_dst.plen == 128) @@ -2595,6 +2607,8 @@ static const struct nla_policy rtm_ipv6_policy[RTA_MAX+1] = { [RTA_METRICS] = { .type = NLA_NESTED }, [RTA_MULTIPATH] = { .len = sizeof(struct rtnexthop) }, [RTA_PREF] = { .type = NLA_U8 }, + [RTA_ENCAP_TYPE]= { .type = NLA_U16 }, + [RTA_ENCAP] = { .type = NLA_NESTED }, }; static int rtm_to_fib6_config(struct sk_buff *skb, struct nlmsghdr *nlh, @@ -2689,6 +2703,12 @@ static int rtm_to_fib6_config(struct sk_buff *skb, struct nlmsghdr *nlh, cfg-fc_flags |= RTF_PREF(pref); } + if (tb[RTA_ENCAP]) + cfg-fc_encap = tb[RTA_ENCAP]; + + if (tb[RTA_ENCAP_TYPE]) + cfg-fc_encap_type = nla_get_u16(tb[RTA_ENCAP_TYPE]); + err = 0; errout: return err; @@ -2721,6 +2741,10 @@ beginning: r_cfg.fc_gateway = nla_get_in6_addr(nla); r_cfg.fc_flags |= RTF_GATEWAY; } + r_cfg.fc_encap = nla_find(attrs, attrlen, RTA_ENCAP); + nla = nla_find(attrs, attrlen, RTA_ENCAP_TYPE); + if (nla) + r_cfg.fc_encap_type = nla_get_u16(nla); } err = add ? ip6_route_add(r_cfg) : ip6_route_del(r_cfg); if (err) { @@ -2783,7 +2807,7 @@ static int inet6_rtm_newroute(struct sk_buff *skb, struct nlmsghdr *nlh) return ip6_route_add(cfg); } -static inline size_t rt6_nlmsg_size(void) +static inline size_t rt6_nlmsg_size(struct rt6_info *rt) { return NLMSG_ALIGN(sizeof(struct rtmsg)) + nla_total_size(16) /* RTA_SRC */ @@ -2797,7 +2821,8 @@ static inline size_t rt6_nlmsg_size(void) + RTAX_MAX * nla_total_size(4) /* RTA_METRICS */ + nla_total_size(sizeof(struct rta_cacheinfo)) + nla_total_size(TCP_CA_NAME_MAX) /* RTAX_CC_ALGO */ - + nla_total_size(1); /* RTA_PREF */ + + nla_total_size(1) /* RTA_PREF */ + + lwtunnel_get_encap_size(rt-rt6i_lwtstate); } static int rt6_fill_node(struct net *net, @@ -2945,6 +2970,8 @@ static int rt6_fill_node(struct net *net, if (nla_put_u8(skb, RTA_PREF, IPV6_EXTRACT_PREF(rt-rt6i_flags))) goto
[PATCH net-next 03/22] ipv4: support for fib route lwtunnel encap attributes
From: Roopa Prabhu ro...@cumulusnetworks.com This patch adds support in ipv4 fib functions to parse user provided encap attributes and attach encap state data to fib_nh and rtable. Signed-off-by: Roopa Prabhu ro...@cumulusnetworks.com --- include/net/ip_fib.h | 5 ++- include/net/route.h | 1 + net/ipv4/fib_frontend.c | 8 net/ipv4/fib_semantics.c | 96 +++- net/ipv4/route.c | 16 +++- 5 files changed, 122 insertions(+), 4 deletions(-) diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h index 49c142b..5e01960 100644 --- a/include/net/ip_fib.h +++ b/include/net/ip_fib.h @@ -44,7 +44,9 @@ struct fib_config { u32 fc_flow; u32 fc_nlflags; struct nl_info fc_nlinfo; - }; + struct nlattr *fc_encap; + u16 fc_encap_type; +}; struct fib_info; struct rtable; @@ -89,6 +91,7 @@ struct fib_nh { struct rtable __rcu * __percpu *nh_pcpu_rth_output; struct rtable __rcu *nh_rth_input; struct fnhe_hash_bucket __rcu *nh_exceptions; + struct lwtunnel_state *nh_lwtstate; }; /* diff --git a/include/net/route.h b/include/net/route.h index fe22d03..2d45f41 100644 --- a/include/net/route.h +++ b/include/net/route.h @@ -66,6 +66,7 @@ struct rtable { struct list_headrt_uncached; struct uncached_list*rt_uncached_list; + struct lwtunnel_state *rt_lwtstate; }; static inline bool rt_is_input_route(const struct rtable *rt) diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c index 6bbc549..9b2019c 100644 --- a/net/ipv4/fib_frontend.c +++ b/net/ipv4/fib_frontend.c @@ -591,6 +591,8 @@ const struct nla_policy rtm_ipv4_policy[RTA_MAX + 1] = { [RTA_METRICS] = { .type = NLA_NESTED }, [RTA_MULTIPATH] = { .len = sizeof(struct rtnexthop) }, [RTA_FLOW] = { .type = NLA_U32 }, + [RTA_ENCAP_TYPE]= { .type = NLA_U16 }, + [RTA_ENCAP] = { .type = NLA_NESTED }, }; static int rtm_to_fib_config(struct net *net, struct sk_buff *skb, @@ -656,6 +658,12 @@ static int rtm_to_fib_config(struct net *net, struct sk_buff *skb, case RTA_TABLE: cfg-fc_table = nla_get_u32(attr); break; + case RTA_ENCAP: + cfg-fc_encap = attr; + break; + case RTA_ENCAP_TYPE: + cfg-fc_encap_type = nla_get_u16(attr); + break; } } diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c index c7358ea..6754c64 100644 --- a/net/ipv4/fib_semantics.c +++ b/net/ipv4/fib_semantics.c @@ -42,6 +42,7 @@ #include net/ip_fib.h #include net/netlink.h #include net/nexthop.h +#include net/lwtunnel.h #include fib_lookup.h @@ -208,6 +209,7 @@ static void free_fib_info_rcu(struct rcu_head *head) change_nexthops(fi) { if (nexthop_nh-nh_dev) dev_put(nexthop_nh-nh_dev); + lwtunnel_state_put(nexthop_nh-nh_lwtstate); free_nh_exceptions(nexthop_nh); rt_fibinfo_free_cpus(nexthop_nh-nh_pcpu_rth_output); rt_fibinfo_free(nexthop_nh-nh_rth_input); @@ -266,6 +268,7 @@ static inline int nh_comp(const struct fib_info *fi, const struct fib_info *ofi) #ifdef CONFIG_IP_ROUTE_CLASSID nh-nh_tclassid != onh-nh_tclassid || #endif + lwtunnel_cmp_encap(nh-nh_lwtstate, onh-nh_lwtstate) || ((nh-nh_flags ^ onh-nh_flags) ~RTNH_COMPARE_MASK)) return -1; onh++; @@ -366,6 +369,7 @@ static inline size_t fib_nlmsg_size(struct fib_info *fi) payload += nla_total_size((RTAX_MAX * nla_total_size(4))); if (fi-fib_nhs) { + size_t nh_encapsize = 0; /* Also handles the special case fib_nhs == 1 */ /* each nexthop is packed in an attribute */ @@ -374,8 +378,21 @@ static inline size_t fib_nlmsg_size(struct fib_info *fi) /* may contain flow and gateway attribute */ nhsize += 2 * nla_total_size(4); + /* grab encap info */ + for_nexthops(fi) { + if (nh-nh_lwtstate) { + /* RTA_ENCAP_TYPE */ + nh_encapsize += lwtunnel_get_encap_size( + nh-nh_lwtstate); + /* RTA_ENCAP */ + nh_encapsize += nla_total_size(2); + } + } endfor_nexthops(fi); + /* all nexthops are packed in a nested attribute */ - payload += nla_total_size(fi-fib_nhs * nhsize); + payload +=
[PATCH net-next 18/22] vxlan: Factor out device configuration
This factors out the device configuration out of the RTNL newlink API which allows for in-kernel creation of VXLAN net_devices. Signed-off-by: Thomas Graf tg...@suug.ch --- drivers/net/vxlan.c | 332 include/net/vxlan.h | 59 ++ 2 files changed, 236 insertions(+), 155 deletions(-) diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c index 23378db..5ae6c0c 100644 --- a/drivers/net/vxlan.c +++ b/drivers/net/vxlan.c @@ -55,10 +55,6 @@ #define PORT_HASH_BITS 8 #define PORT_HASH_SIZE (1PORT_HASH_BITS) -#define VNI_HASH_BITS 10 -#define VNI_HASH_SIZE (1VNI_HASH_BITS) -#define FDB_HASH_BITS 8 -#define FDB_HASH_SIZE (1FDB_HASH_BITS) #define FDB_AGE_DEFAULT 300 /* 5 min */ #define FDB_AGE_INTERVAL (10 * HZ) /* rescan interval */ @@ -75,6 +71,7 @@ module_param(log_ecn_error, bool, 0644); MODULE_PARM_DESC(log_ecn_error, Log packets received with corrupted ECN); static int vxlan_net_id; +static struct rtnl_link_ops vxlan_link_ops; static const u8 all_zeros_mac[ETH_ALEN]; @@ -85,21 +82,6 @@ struct vxlan_net { spinlock_tsock_lock; }; -union vxlan_addr { - struct sockaddr_in sin; - struct sockaddr_in6 sin6; - struct sockaddr sa; -}; - -struct vxlan_rdst { - union vxlan_addr remote_ip; - __be16 remote_port; - u32 remote_vni; - u32 remote_ifindex; - struct list_head list; - struct rcu_head rcu; -}; - /* Forwarding table entry */ struct vxlan_fdb { struct hlist_node hlist;/* linked list of entries */ @@ -112,31 +94,6 @@ struct vxlan_fdb { u8eth_addr[ETH_ALEN]; }; -/* Pseudo network device */ -struct vxlan_dev { - struct hlist_node hlist;/* vni hash table */ - struct list_head next; /* vxlan's per namespace list */ - struct vxlan_sock *vn_sock; /* listening socket */ - struct net_device *dev; - struct net*net; /* netns for packet i/o */ - struct vxlan_rdst default_dst; /* default destination */ - union vxlan_addr saddr;/* source address */ - __be16dst_port; - __u16 port_min; /* source port range */ - __u16 port_max; - __u8 tos; /* TOS override */ - __u8 ttl; - u32 flags;/* VXLAN_F_* in vxlan.h */ - - unsigned long age_interval; - struct timer_list age_timer; - spinlock_thash_lock; - unsigned int addrcnt; - unsigned int addrmax; - - struct hlist_head fdb_head[FDB_HASH_SIZE]; -}; - /* salt for hash table */ static u32 vxlan_salt __read_mostly; static struct workqueue_struct *vxlan_wq; @@ -352,7 +309,7 @@ static int vxlan_fdb_info(struct sk_buff *skb, struct vxlan_dev *vxlan, if (send_ip vxlan_nla_put_addr(skb, NDA_DST, rdst-remote_ip)) goto nla_put_failure; - if (rdst-remote_port rdst-remote_port != vxlan-dst_port + if (rdst-remote_port rdst-remote_port != vxlan-cfg.dst_port nla_put_be16(skb, NDA_PORT, rdst-remote_port)) goto nla_put_failure; if (rdst-remote_vni != vxlan-default_dst.remote_vni @@ -756,7 +713,8 @@ static int vxlan_fdb_create(struct vxlan_dev *vxlan, if (!(flags NLM_F_CREATE)) return -ENOENT; - if (vxlan-addrmax vxlan-addrcnt = vxlan-addrmax) + if (vxlan-cfg.addrmax + vxlan-addrcnt = vxlan-cfg.addrmax) return -ENOSPC; /* Disallow replace to add a multicast entry */ @@ -842,7 +800,7 @@ static int vxlan_fdb_parse(struct nlattr *tb[], struct vxlan_dev *vxlan, return -EINVAL; *port = nla_get_be16(tb[NDA_PORT]); } else { - *port = vxlan-dst_port; + *port = vxlan-cfg.dst_port; } if (tb[NDA_VNI]) { @@ -1028,7 +986,7 @@ static bool vxlan_snoop(struct net_device *dev, vxlan_fdb_create(vxlan, src_mac, src_ip, NUD_REACHABLE, NLM_F_EXCL|NLM_F_CREATE, -vxlan-dst_port, +vxlan-cfg.dst_port, vxlan-default_dst.remote_vni, 0, NTF_SELF); spin_unlock(vxlan-hash_lock); @@ -1957,7 +1915,7 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev, info = skb_tunnel_info(skb, AF_INET); if (rdst) { - dst_port = rdst-remote_port ? rdst-remote_port : vxlan-dst_port; + dst_port = rdst-remote_port ? rdst-remote_port :
[PATCH net-next 14/22] vxlan: Flow based tunneling
Allows putting a VXLAN device into a new flow-based mode in which skbs with a ip_tunnel_info dst metadata attached will be encapsulated according to the instructions stored in there with the VXLAN device defaults taken into consideration. Similar on the receive side, if the VXLAN_F_COLLECT_METADATA flag is set, the packet processing will populate a ip_tunnel_info struct for each packet received and attach it to the skb using the new metadata dst. The metadata structure will contain the outer header and tunnel header fields which have been stripped off. Layers further up in the stack such as routing, tc or netfitler can later match on these fields and perform forwarding. It is the responsibility of upper layers to ensure that the flag is set if the metadata is needed. The flag limits the additional cost of metadata collecting based on demand. This prepares the VXLAN device to be steered by the routing and other subsystems which allows to support encapsulation for a large number of tunnel endpoints and tunnel ids through a single net_device which improves the scalability. It also allows for OVS to leverage this mode which in turn allows for the removal of the OVS specific VXLAN code. Because the skb is currently scrubed in vxlan_rcv(), the attachment of the new dst metadata is postponed until after scrubing which requires the temporary addition of a new member to vxlan_metadata. This member is removed again in a later commit after the indirect VXLAN receive API has been removed. Signed-off-by: Thomas Graf tg...@suug.ch Signed-off-by: Pravin B Shelar pshe...@nicira.com --- drivers/net/vxlan.c | 155 +-- include/linux/skbuff.h | 1 + include/net/dst_metadata.h | 13 include/net/ip_tunnels.h | 14 include/net/vxlan.h | 10 ++- include/uapi/linux/if_link.h | 1 + 6 files changed, 171 insertions(+), 23 deletions(-) diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c index 34c519e..994d89c 100644 --- a/drivers/net/vxlan.c +++ b/drivers/net/vxlan.c @@ -49,6 +49,7 @@ #include net/ip6_tunnel.h #include net/ip6_checksum.h #endif +#include net/dst_metadata.h #define VXLAN_VERSION 0.1 @@ -140,6 +141,11 @@ struct vxlan_dev { static u32 vxlan_salt __read_mostly; static struct workqueue_struct *vxlan_wq; +static inline bool vxlan_collect_metadata(struct vxlan_sock *vs) +{ + return vs-flags VXLAN_F_COLLECT_METADATA; +} + #if IS_ENABLED(CONFIG_IPV6) static inline bool vxlan_addr_equal(const union vxlan_addr *a, const union vxlan_addr *b) @@ -1164,10 +1170,13 @@ static struct vxlanhdr *vxlan_remcsum(struct sk_buff *skb, struct vxlanhdr *vh, /* Callback from net/ipv4/udp.c to receive packets */ static int vxlan_udp_encap_recv(struct sock *sk, struct sk_buff *skb) { + struct metadata_dst *tun_dst = NULL; + struct ip_tunnel_info *info; struct vxlan_sock *vs; struct vxlanhdr *vxh; u32 flags, vni; - struct vxlan_metadata md = {0}; + struct vxlan_metadata _md; + struct vxlan_metadata *md = _md; /* Need Vxlan and inner Ethernet header to be present */ if (!pskb_may_pull(skb, VXLAN_HLEN)) @@ -1202,6 +1211,33 @@ static int vxlan_udp_encap_recv(struct sock *sk, struct sk_buff *skb) vni = VXLAN_VNI_MASK; } + if (vxlan_collect_metadata(vs)) { + const struct iphdr *iph = ip_hdr(skb); + + tun_dst = metadata_dst_alloc(sizeof(*md), GFP_ATOMIC); + if (!tun_dst) + goto drop; + + info = tun_dst-u.tun_info; + info-key.ipv4_src = iph-saddr; + info-key.ipv4_dst = iph-daddr; + info-key.ipv4_tos = iph-tos; + info-key.ipv4_ttl = iph-ttl; + info-key.tp_src = udp_hdr(skb)-source; + info-key.tp_dst = udp_hdr(skb)-dest; + + info-mode = IP_TUNNEL_INFO_RX; + info-key.tun_flags = TUNNEL_KEY; + info-key.tun_id = cpu_to_be64(vni 8); + if (udp_hdr(skb)-check != 0) + info-key.tun_flags |= TUNNEL_CSUM; + + md = ip_tunnel_info_opts(info, sizeof(*md)); + md-tun_dst = tun_dst; + } else { + memset(md, 0, sizeof(*md)); + } + /* For backwards compatibility, only allow reserved fields to be * used by VXLAN extensions if explicitly requested. */ @@ -1209,13 +1245,16 @@ static int vxlan_udp_encap_recv(struct sock *sk, struct sk_buff *skb) struct vxlanhdr_gbp *gbp; gbp = (struct vxlanhdr_gbp *)vxh; - md.gbp = ntohs(gbp-policy_id); + md-gbp = ntohs(gbp-policy_id); + + if (tun_dst) + info-key.tun_flags |= TUNNEL_VXLAN_OPT; if (gbp-dont_learn) - md.gbp |= VXLAN_GBP_DONT_LEARN; +
[PATCH net-next 15/22] route: Extend flow representation with tunnel key
Add a new flowi_tunnel structure which is a subset of ip_tunnel_key to allow routes to match on tunnel metadata. For now, the tunnel id is added to flowi_tunnel which allows for routes to be bound to specific virtual tunnels. Signed-off-by: Thomas Graf tg...@suug.ch --- include/net/flow.h | 7 +++ net/ipv4/route.c | 6 ++ 2 files changed, 13 insertions(+) diff --git a/include/net/flow.h b/include/net/flow.h index 8109a15..c15fb5e 100644 --- a/include/net/flow.h +++ b/include/net/flow.h @@ -19,6 +19,10 @@ #define LOOPBACK_IFINDEX 1 +struct flowi_tunnel { + __be64 tun_id; +}; + struct flowi_common { int flowic_oif; int flowic_iif; @@ -30,6 +34,7 @@ struct flowi_common { #define FLOWI_FLAG_ANYSRC 0x01 #define FLOWI_FLAG_KNOWN_NH0x02 __u32 flowic_secid; + struct flowi_tunnel flowic_tun_key; }; union flowi_uli { @@ -66,6 +71,7 @@ struct flowi4 { #define flowi4_proto __fl_common.flowic_proto #define flowi4_flags __fl_common.flowic_flags #define flowi4_secid __fl_common.flowic_secid +#define flowi4_tun_key __fl_common.flowic_tun_key /* (saddr,daddr) must be grouped, same order as in IP header */ __be32 saddr; @@ -165,6 +171,7 @@ struct flowi { #define flowi_protou.__fl_common.flowic_proto #define flowi_flagsu.__fl_common.flowic_flags #define flowi_secidu.__fl_common.flowic_secid +#define flowi_tun_key u.__fl_common.flowic_tun_key } __attribute__((__aligned__(BITS_PER_LONG/8))); static inline struct flowi *flowi4_to_flowi(struct flowi4 *fl4) diff --git a/net/ipv4/route.c b/net/ipv4/route.c index 4c8e84e..931015c 100644 --- a/net/ipv4/route.c +++ b/net/ipv4/route.c @@ -91,6 +91,7 @@ #include linux/slab.h #include linux/jhash.h #include net/dst.h +#include net/dst_metadata.h #include net/net_namespace.h #include net/protocol.h #include net/ip.h @@ -110,6 +111,7 @@ #include linux/kmemleak.h #endif #include net/secure_seq.h +#include net/ip_tunnels.h #define RT_FL_TOS(oldflp4) \ ((oldflp4)-flowi4_tos (IPTOS_RT_MASK | RTO_ONLINK)) @@ -1673,6 +1675,7 @@ static int ip_route_input_slow(struct sk_buff *skb, __be32 daddr, __be32 saddr, { struct fib_result res; struct in_device *in_dev = __in_dev_get_rcu(dev); + struct ip_tunnel_info *tun_info; struct flowi4 fl4; unsigned intflags = 0; u32 itag = 0; @@ -1690,6 +1693,9 @@ static int ip_route_input_slow(struct sk_buff *skb, __be32 daddr, __be32 saddr, by fib_lookup. */ + tun_info = skb_tunnel_info(skb); + if (tun_info tun_info-mode == IP_TUNNEL_INFO_RX) + fl4.flowi4_tun_key.tun_id = tun_info-key.tun_id; skb_dst_drop(skb); if (ipv4_is_multicast(saddr) || ipv4_is_lbcast(saddr)) -- 2.4.3 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next 12/22] dst: Metadata destinations
Introduces a new dst_metadata which enables to carry per packet metadata between forwarding and processing elements via the skb-dst pointer. The structure is set up to be a union. Thus, each separate type of metadata requires its own dst instance. If demand arises to carry multiple types of metadata concurrently, metadata dst entries can be made stackable. The metadata dst entry is refcnt'ed as expected for now but a non reference counted use is possible if the reference is forced before queueing the skb. In order to allow allocating dsts with variable length, the existing dst_alloc() is split into a dst_alloc() and dst_init() function. The existing dst_init() function to initialize the subsystem is being renamed to dst_subsys_init() to make it clear what is what. The check before ip_route_input() is changed to ignore metadata dsts and drop the dst inside the routing function thus allowing to interpret metadata in a later commit. Signed-off-by: Thomas Graf tg...@suug.ch --- include/net/dst.h | 6 +++- include/net/dst_metadata.h | 32 ++ net/core/dev.c | 2 +- net/core/dst.c | 84 ++ net/ipv4/ip_input.c| 3 +- net/ipv4/route.c | 2 ++ 6 files changed, 112 insertions(+), 17 deletions(-) create mode 100644 include/net/dst_metadata.h diff --git a/include/net/dst.h b/include/net/dst.h index 2bc73f8a..2578811 100644 --- a/include/net/dst.h +++ b/include/net/dst.h @@ -57,6 +57,7 @@ struct dst_entry { #define DST_FAKE_RTABLE0x0040 #define DST_XFRM_TUNNEL0x0080 #define DST_XFRM_QUEUE 0x0100 +#define DST_METADATA 0x0200 unsigned short pending_confirm; @@ -356,6 +357,9 @@ static inline int dst_discard(struct sk_buff *skb) } void *dst_alloc(struct dst_ops *ops, struct net_device *dev, int initial_ref, int initial_obsolete, unsigned short flags); +void dst_init(struct dst_entry *dst, struct dst_ops *ops, + struct net_device *dev, int initial_ref, int initial_obsolete, + unsigned short flags); void __dst_free(struct dst_entry *dst); struct dst_entry *dst_destroy(struct dst_entry *dst); @@ -457,7 +461,7 @@ static inline struct dst_entry *dst_check(struct dst_entry *dst, u32 cookie) return dst; } -void dst_init(void); +void dst_subsys_init(void); /* Flags for xfrm_lookup flags argument. */ enum { diff --git a/include/net/dst_metadata.h b/include/net/dst_metadata.h new file mode 100644 index 000..4f7694f --- /dev/null +++ b/include/net/dst_metadata.h @@ -0,0 +1,32 @@ +#ifndef __NET_DST_METADATA_H +#define __NET_DST_METADATA_H 1 + +#include linux/skbuff.h +#include net/ip_tunnels.h +#include net/dst.h + +struct metadata_dst { + struct dst_entrydst; + size_t opts_len; +}; + +static inline struct metadata_dst *skb_metadata_dst(struct sk_buff *skb) +{ + struct metadata_dst *md_dst = (struct metadata_dst *) skb_dst(skb); + + if (md_dst md_dst-dst.flags DST_METADATA) + return md_dst; + + return NULL; +} + +static inline bool skb_valid_dst(const struct sk_buff *skb) +{ + struct dst_entry *dst = skb_dst(skb); + + return dst !(dst-flags DST_METADATA); +} + +struct metadata_dst *metadata_dst_alloc(u8 optslen, gfp_t flags); + +#endif /* __NET_DST_METADATA_H */ diff --git a/net/core/dev.c b/net/core/dev.c index 8810b6b..61e3dcb 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -7659,7 +7659,7 @@ static int __init net_dev_init(void) open_softirq(NET_RX_SOFTIRQ, net_rx_action); hotcpu_notifier(dev_cpu_callback, 0); - dst_init(); + dst_subsys_init(); rc = 0; out: return rc; diff --git a/net/core/dst.c b/net/core/dst.c index e956ce6..917364f 100644 --- a/net/core/dst.c +++ b/net/core/dst.c @@ -22,6 +22,7 @@ #include linux/prefetch.h #include net/dst.h +#include net/dst_metadata.h /* * Theory of operations: @@ -158,19 +159,10 @@ const u32 dst_default_metrics[RTAX_MAX + 1] = { [RTAX_MAX] = 0xdeadbeef, }; - -void *dst_alloc(struct dst_ops *ops, struct net_device *dev, - int initial_ref, int initial_obsolete, unsigned short flags) +void dst_init(struct dst_entry *dst, struct dst_ops *ops, + struct net_device *dev, int initial_ref, int initial_obsolete, + unsigned short flags) { - struct dst_entry *dst; - - if (ops-gc dst_entries_get_fast(ops) ops-gc_thresh) { - if (ops-gc(ops)) - return NULL; - } - dst = kmem_cache_alloc(ops-kmem_cachep, GFP_ATOMIC); - if (!dst) - return NULL; dst-child = NULL; dst-dev = dev; if (dev) @@ -200,6 +192,25 @@ void *dst_alloc(struct dst_ops *ops, struct net_device *dev, dst-next = NULL; if (!(flags DST_NOCOUNT))
[PATCH net-next 09/22] mpls: ip tunnel support
From: Roopa Prabhu ro...@cumulusnetworks.com This implementation uses lwtunnel infrastructure to register hooks for mpls tunnel encaps. It picks cues from iptunnel_encaps infrastructure and previous mpls iptunnel RFC patches from Eric W. Biederman and Robert Shearman Signed-off-by: Roopa Prabhu ro...@cumulusnetworks.com --- include/linux/mpls_iptunnel.h | 6 + include/net/mpls_iptunnel.h| 29 + include/uapi/linux/mpls_iptunnel.h | 28 + net/mpls/Kconfig | 8 +- net/mpls/Makefile | 1 + net/mpls/mpls_iptunnel.c | 233 + 6 files changed, 304 insertions(+), 1 deletion(-) create mode 100644 include/linux/mpls_iptunnel.h create mode 100644 include/net/mpls_iptunnel.h create mode 100644 include/uapi/linux/mpls_iptunnel.h create mode 100644 net/mpls/mpls_iptunnel.c diff --git a/include/linux/mpls_iptunnel.h b/include/linux/mpls_iptunnel.h new file mode 100644 index 000..ef29eb2 --- /dev/null +++ b/include/linux/mpls_iptunnel.h @@ -0,0 +1,6 @@ +#ifndef _LINUX_MPLS_IPTUNNEL_H +#define _LINUX_MPLS_IPTUNNEL_H + +#include uapi/linux/mpls_iptunnel.h + +#endif /* _LINUX_MPLS_IPTUNNEL_H */ diff --git a/include/net/mpls_iptunnel.h b/include/net/mpls_iptunnel.h new file mode 100644 index 000..4757997 --- /dev/null +++ b/include/net/mpls_iptunnel.h @@ -0,0 +1,29 @@ +/* + * Copyright (c) 2015 Cumulus Networks, Inc. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of version 2 of the GNU General Public + * License as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + */ + +#ifndef _NET_MPLS_IPTUNNEL_H +#define _NET_MPLS_IPTUNNEL_H 1 + +#define MAX_NEW_LABELS 2 + +struct mpls_iptunnel_encap { + u32 label[MAX_NEW_LABELS]; + u32 labels; +}; + +static inline struct mpls_iptunnel_encap *mpls_lwtunnel_encap(struct lwtunnel_state *lwtstate) +{ + return (struct mpls_iptunnel_encap *)lwtstate-data; +} + +#endif diff --git a/include/uapi/linux/mpls_iptunnel.h b/include/uapi/linux/mpls_iptunnel.h new file mode 100644 index 000..d80a049 --- /dev/null +++ b/include/uapi/linux/mpls_iptunnel.h @@ -0,0 +1,28 @@ +/* + * mpls tunnel api + * + * Authors: + * Roopa Prabhu ro...@cumulusnetworks.com + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version + * 2 of the License, or (at your option) any later version. + */ + +#ifndef _UAPI_LINUX_MPLS_IPTUNNEL_H +#define _UAPI_LINUX_MPLS_IPTUNNEL_H + +/* MPLS tunnel attributes + * [RTA_ENCAP] = { + * [MPLS_IPTUNNEL_DST] + * } + */ +enum { + MPLS_IPTUNNEL_UNSPEC, + MPLS_IPTUNNEL_DST, + __MPLS_IPTUNNEL_MAX, +}; +#define MPLS_IPTUNNEL_MAX (__MPLS_IPTUNNEL_MAX - 1) + +#endif /* _UAPI_LINUX_MPLS_IPTUNNEL_H */ diff --git a/net/mpls/Kconfig b/net/mpls/Kconfig index 17bde79..5c467ef 100644 --- a/net/mpls/Kconfig +++ b/net/mpls/Kconfig @@ -24,7 +24,13 @@ config NET_MPLS_GSO config MPLS_ROUTING tristate MPLS: routing support - help + ---help--- Add support for forwarding of mpls packets. +config MPLS_IPTUNNEL + tristate MPLS: IP over MPLS tunnel support + depends on LWTUNNEL MPLS_ROUTING + ---help--- +mpls ip tunnel support. + endif # MPLS diff --git a/net/mpls/Makefile b/net/mpls/Makefile index 65bbe68..9ca9236 100644 --- a/net/mpls/Makefile +++ b/net/mpls/Makefile @@ -3,5 +3,6 @@ # obj-$(CONFIG_NET_MPLS_GSO) += mpls_gso.o obj-$(CONFIG_MPLS_ROUTING) += mpls_router.o +obj-$(CONFIG_MPLS_IPTUNNEL) += mpls_iptunnel.o mpls_router-y := af_mpls.o diff --git a/net/mpls/mpls_iptunnel.c b/net/mpls/mpls_iptunnel.c new file mode 100644 index 000..eea096f --- /dev/null +++ b/net/mpls/mpls_iptunnel.c @@ -0,0 +1,233 @@ +/* + * mpls tunnelsAn implementation mpls tunnels using the light weight tunnel + * infrastructure + * + * Authors:Roopa Prabhu, ro...@cumulusnetworks.com + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version + * 2 of the License, or (at your option) any later version. + * + */ +#include linux/types.h +#include linux/skbuff.h +#include linux/net.h +#include linux/module.h +#include linux/mpls.h +#include linux/vmalloc.h +#include net/ip.h +#include net/dst.h +#include net/lwtunnel.h +#include net/netevent.h +#include net/netns/generic.h +#include net/ip6_fib.h +#include
[PATCH net-next 11/22] icmp: Don't leak original dst into ip_route_input()
ip_route_input() unconditionally overwrites the dst. Hide the original dst attached to the skb by calling skb_dst_set(skb, NULL) prior to ip_route_input(). Reported-by: Julian Anastasov j...@ssi.bg Signed-off-by: Thomas Graf tg...@suug.ch --- net/ipv4/icmp.c | 1 + 1 file changed, 1 insertion(+) diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c index f5203fb..c0556f1 100644 --- a/net/ipv4/icmp.c +++ b/net/ipv4/icmp.c @@ -496,6 +496,7 @@ static struct rtable *icmp_route_lookup(struct net *net, } /* Ugh! */ orefdst = skb_in-_skb_refdst; /* save old refdst */ + skb_dst_set(skb_in, NULL); err = ip_route_input(skb_in, fl4_dec.daddr, fl4_dec.saddr, RT_TOS(tos), rt2-dst.dev); -- 2.4.3 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next 00/22] Lightweight flow based encapsulation
This series combines the work previously posted by Roopa, Robert and myself. It's according to what we discussed at NFWS. The motivation of this series is to: * Consolidate code between OVS and the rest of the kernel and get rid of OVS vports and instead represent them as pure net_devices. * Introduce a lightweight tunneling mechanism which enables flow based encapsulation to improve scalability on both RX and TX. * Do the above in an encapsulation unspecific way so that the encapsulation type is eventually abstracted away from the user. * Use the same forwarding decision for both native forwarding and encapsulation thus allowing to switch between native IPv6 and UDP encapsulation based on endpoint without requiring additional logic The fundamental changes introduces in this series are: * A new RTA_ENCAP Netlink attribute for routes carrying encapsulation instructions. Depending on the specified type, the instructions apply to UDP encapsulations, MPLS and possible other in the future. * Depending on the encapsulation type, the output function of the dst is directly overwritten or the dst merely attaches metadata and relies on a subsequent net_device to apply it to the packet. The latter is typically used if an inner and outer IP header exist which require two subsequent routing lookups to be performed. * A new metadata_dst structure which can be attached to skbs to carry metadata in between subsystems. This new metadata transport is used to provide a single interface for VXLAN, routing and OVS to communicate through metadata. The OVS interfaces remain as-is but will transparently create a real VXLAN net_device in the background. iproute2 is extended with a new use cases: VXLAN: ip route add 40.1.1.1/32 encap vxlan id 10 dst 50.1.1.2 dev vxlan0 MPLS: ip route add 10.1.1.0/30 encap mpls 200 via inet 10.1.1.1 dev swp1 Changes since RFC: * Addressed comments * Folded in various fixes provided by Roopa, Joe, and Wei-Chun Chao * New static key to only collect metadata on receive if a filter exists which matches on the relevant fields. Roopa Prabhu (9): rtnetlink: introduce new RTA_ENCAP_TYPE and RTA_ENCAP attributes lwtunnel: infrastructure for handling light weight tunnels like mpls ipv4: support for fib route lwtunnel encap attributes ipv6: support for fib route lwtunnel encap attributes lwtunnel: support dst output redirect function ipv4: redirect dst output to lwtunnel output ipv6: rt6_info output redirect to tunnel output mpls: export mpls functions for use by mpls iptunnels mpls: ip tunnel support Thomas Graf (13): ip_tunnel: Make ovs_tunnel_info and ovs_key_ipv4_tunnel generic icmp: Don't leak original dst into ip_route_input() dst: Metadata destinations arp: Inherit metadata dst when creating ARP requests vxlan: Flow based tunneling route: Extend flow representation with tunnel key route: Per route IP tunnel metadata via lightweight tunnel fib: Add fib rule match on tunnel id vxlan: Factor out device configuration openvswitch: Make tunnel set action attach a metadata dst openvswitch: Move dev pointer into vport itself openvswitch: Abstract vport name through ovs_vport_name() openvswitch: Use regular VXLAN net_device device drivers/net/vxlan.c | 678 +-- include/linux/lwtunnel.h | 6 + include/linux/mpls_iptunnel.h| 6 + include/linux/skbuff.h | 1 + include/net/dst.h| 6 +- include/net/dst_metadata.h | 55 +++ include/net/fib_rules.h | 1 + include/net/flow.h | 7 + include/net/ip6_fib.h| 3 + include/net/ip_fib.h | 5 +- include/net/ip_tunnels.h | 95 - include/net/lwtunnel.h | 144 include/net/mpls_iptunnel.h | 29 ++ include/net/route.h | 1 + include/net/rtnetlink.h | 1 + include/net/vxlan.h | 85 - include/uapi/linux/fib_rules.h | 2 +- include/uapi/linux/if_link.h | 1 + include/uapi/linux/lwtunnel.h| 16 + include/uapi/linux/mpls_iptunnel.h | 28 ++ include/uapi/linux/openvswitch.h | 2 +- include/uapi/linux/rtnetlink.h | 17 + net/Kconfig | 7 + net/core/Makefile| 1 + net/core/dev.c | 2 +- net/core/dst.c | 84 - net/core/fib_rules.c | 24 +- net/core/lwtunnel.c | 235 net/core/rtnetlink.c | 26 +- net/ipv4/arp.c | 65 ++-- net/ipv4/fib_frontend.c | 8 + net/ipv4/fib_semantics.c | 96 - net/ipv4/icmp.c | 1 + net/ipv4/ip_input.c | 3 +- net/ipv4/ip_tunnel_core.c| 130
[PATCH net-next 06/22] ipv4: redirect dst output to lwtunnel output
From: Roopa Prabhu ro...@cumulusnetworks.com For input routes with tunnel encap state this patch redirects dst output functions to lwtunnel_output which later resolves to the corresponding lwtunnel output function. This has been tested to work with mpls ip tunnels. Open items: Support for tunnel mtu, pmtu, fragmentation can be added by hooking into the corresponding (ipv4, ipv6) dst ops. We may do this differently when lwtstate moves to dst or dst_metadata as per upstream discussions. Signed-off-by: Roopa Prabhu ro...@cumulusnetworks.com --- net/ipv4/route.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/net/ipv4/route.c b/net/ipv4/route.c index 226570b..cd3157c 100644 --- a/net/ipv4/route.c +++ b/net/ipv4/route.c @@ -1633,6 +1633,8 @@ static int __mkroute_input(struct sk_buff *skb, rth-dst.output = ip_output; rt_set_nexthop(rth, daddr, res, fnhe, res-fi, res-type, itag); + if (lwtunnel_output_redirect(rth-rt_lwtstate)) + rth-dst.output = lwtunnel_output; skb_dst_set(skb, rth-dst); out: err = 0; -- 2.4.3 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next 10/22] ip_tunnel: Make ovs_tunnel_info and ovs_key_ipv4_tunnel generic
Rename the tunnel metadata data structures currently internal to OVS and make them generic for use by all IP tunnels. Both structures are kernel internal and will stay that way. Their members are exposed to user space through individual Netlink attributes by OVS. It will therefore be possible to extend/modify these structures without affecting user ABI. Signed-off-by: Thomas Graf tg...@suug.ch --- include/net/ip_tunnels.h | 63 + include/uapi/linux/openvswitch.h | 2 +- net/openvswitch/actions.c| 2 +- net/openvswitch/datapath.h | 5 +-- net/openvswitch/flow.c | 4 +-- net/openvswitch/flow.h | 76 ++-- net/openvswitch/flow_netlink.c | 16 - net/openvswitch/flow_netlink.h | 2 +- net/openvswitch/vport-geneve.c | 17 + net/openvswitch/vport-gre.c | 16 - net/openvswitch/vport-vxlan.c| 18 +- net/openvswitch/vport.c | 30 net/openvswitch/vport.h | 12 +++ 13 files changed, 128 insertions(+), 135 deletions(-) diff --git a/include/net/ip_tunnels.h b/include/net/ip_tunnels.h index d8214cb..6b9d559 100644 --- a/include/net/ip_tunnels.h +++ b/include/net/ip_tunnels.h @@ -22,6 +22,28 @@ /* Keep error state on tunnel for 30 sec */ #define IPTUNNEL_ERR_TIMEO (30*HZ) +/* Used to memset ip_tunnel padding. */ +#define IP_TUNNEL_KEY_SIZE \ + (offsetof(struct ip_tunnel_key, tp_dst) + \ +FIELD_SIZEOF(struct ip_tunnel_key, tp_dst)) + +struct ip_tunnel_key { + __be64 tun_id; + __be32 ipv4_src; + __be32 ipv4_dst; + __be16 tun_flags; + __u8ipv4_tos; + __u8ipv4_ttl; + __be16 tp_src; + __be16 tp_dst; +} __packed __aligned(4); /* Minimize padding. */ + +struct ip_tunnel_info { + struct ip_tunnel_keykey; + const void *options; + u8 options_len; +}; + /* 6rd prefix/relay information */ #ifdef CONFIG_IPV6_SIT_6RD struct ip_tunnel_6rd_parm { @@ -136,6 +158,47 @@ int ip_tunnel_encap_add_ops(const struct ip_tunnel_encap_ops *op, int ip_tunnel_encap_del_ops(const struct ip_tunnel_encap_ops *op, unsigned int num); +static inline void __ip_tunnel_info_init(struct ip_tunnel_info *tun_info, +__be32 saddr, __be32 daddr, +u8 tos, u8 ttl, +__be16 tp_src, __be16 tp_dst, +__be64 tun_id, __be16 tun_flags, +const void *opts, u8 opts_len) +{ + tun_info-key.tun_id = tun_id; + tun_info-key.ipv4_src = saddr; + tun_info-key.ipv4_dst = daddr; + tun_info-key.ipv4_tos = tos; + tun_info-key.ipv4_ttl = ttl; + tun_info-key.tun_flags = tun_flags; + + /* For the tunnel types on the top of IPsec, the tp_src and tp_dst of +* the upper tunnel are used. +* E.g: GRE over IPSEC, the tp_src and tp_port are zero. +*/ + tun_info-key.tp_src = tp_src; + tun_info-key.tp_dst = tp_dst; + + /* Clear struct padding. */ + if (sizeof(tun_info-key) != IP_TUNNEL_KEY_SIZE) + memset((unsigned char *)tun_info-key + IP_TUNNEL_KEY_SIZE, + 0, sizeof(tun_info-key) - IP_TUNNEL_KEY_SIZE); + + tun_info-options = opts; + tun_info-options_len = opts_len; +} + +static inline void ip_tunnel_info_init(struct ip_tunnel_info *tun_info, + const struct iphdr *iph, + __be16 tp_src, __be16 tp_dst, + __be64 tun_id, __be16 tun_flags, + const void *opts, u8 opts_len) +{ + __ip_tunnel_info_init(tun_info, iph-saddr, iph-daddr, + iph-tos, iph-ttl, tp_src, tp_dst, + tun_id, tun_flags, opts, opts_len); +} + #ifdef CONFIG_INET int ip_tunnel_init(struct net_device *dev); diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h index 1dab776..d6b8854 100644 --- a/include/uapi/linux/openvswitch.h +++ b/include/uapi/linux/openvswitch.h @@ -321,7 +321,7 @@ enum ovs_key_attr { * the accepted length of the array. */ #ifdef __KERNEL__ - OVS_KEY_ATTR_TUNNEL_INFO, /* struct ovs_tunnel_info */ + OVS_KEY_ATTR_TUNNEL_INFO, /* struct ip_tunnel_info */ #endif __OVS_KEY_ATTR_MAX }; diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c index 8a8c0b8..27c1687 100644 --- a/net/openvswitch/actions.c +++ b/net/openvswitch/actions.c @@ -611,7
[PATCH net-next 02/22] lwtunnel: infrastructure for handling light weight tunnels like mpls
From: Roopa Prabhu ro...@cumulusnetworks.com Provides infrastructure to parse/dump/store encap information for light weight tunnels like mpls. Encap information for such tunnels is associated with fib routes. This infrastructure is based on previous suggestions from Eric Biederman to follow the xfrm infrastructure. Signed-off-by: Roopa Prabhu ro...@cumulusnetworks.com --- include/linux/lwtunnel.h | 6 ++ include/net/lwtunnel.h| 132 +++ include/uapi/linux/lwtunnel.h | 15 net/Kconfig | 7 ++ net/core/Makefile | 1 + net/core/lwtunnel.c | 179 ++ 6 files changed, 340 insertions(+) create mode 100644 include/linux/lwtunnel.h create mode 100644 include/net/lwtunnel.h create mode 100644 include/uapi/linux/lwtunnel.h create mode 100644 net/core/lwtunnel.c diff --git a/include/linux/lwtunnel.h b/include/linux/lwtunnel.h new file mode 100644 index 000..97f32f8 --- /dev/null +++ b/include/linux/lwtunnel.h @@ -0,0 +1,6 @@ +#ifndef _LINUX_LWTUNNEL_H_ +#define _LINUX_LWTUNNEL_H_ + +#include uapi/linux/lwtunnel.h + +#endif /* _LINUX_LWTUNNEL_H_ */ diff --git a/include/net/lwtunnel.h b/include/net/lwtunnel.h new file mode 100644 index 000..df24b36 --- /dev/null +++ b/include/net/lwtunnel.h @@ -0,0 +1,132 @@ +#ifndef __NET_LWTUNNEL_H +#define __NET_LWTUNNEL_H 1 + +#include linux/lwtunnel.h +#include linux/netdevice.h +#include linux/skbuff.h +#include linux/types.h +#include net/route.h + +#define LWTUNNEL_HASH_BITS 7 +#define LWTUNNEL_HASH_SIZE (1 LWTUNNEL_HASH_BITS) + +/* lw tunnel state flags */ +#define LWTUNNEL_STATE_OUTPUT_REDIRECT 0x1 + +struct lwtunnel_state { + __u16 type; + __u16 flags; + atomic_trefcnt; + int len; + __u8data[0]; +}; + +struct lwtunnel_encap_ops { + int (*build_state)(struct net_device *dev, struct nlattr *encap, + struct lwtunnel_state **ts); + int (*output)(struct sock *sk, struct sk_buff *skb); + int (*fill_encap)(struct sk_buff *skb, + struct lwtunnel_state *lwtstate); + int (*get_encap_size)(struct lwtunnel_state *lwtstate); + int (*cmp_encap)(struct lwtunnel_state *a, struct lwtunnel_state *b); +}; + +extern const struct lwtunnel_encap_ops __rcu * + lwtun_encaps[LWTUNNEL_ENCAP_MAX+1]; + +#ifdef CONFIG_LWTUNNEL +static inline void lwtunnel_state_get(struct lwtunnel_state *lws) +{ + atomic_inc(lws-refcnt); +} + +static inline void lwtunnel_state_put(struct lwtunnel_state *lws) +{ + if (!lws) + return; + + if (atomic_dec_and_test(lws-refcnt)) + kfree(lws); +} + +static inline bool lwtunnel_output_redirect(struct lwtunnel_state *lwtstate) +{ + if (lwtstate (lwtstate-flags LWTUNNEL_STATE_OUTPUT_REDIRECT)) + return true; + + return false; +} + +int lwtunnel_encap_add_ops(const struct lwtunnel_encap_ops *op, + unsigned int num); +int lwtunnel_encap_del_ops(const struct lwtunnel_encap_ops *op, + unsigned int num); +int lwtunnel_build_state(struct net_device *dev, u16 encap_type, +struct nlattr *encap, +struct lwtunnel_state **lws); +int lwtunnel_fill_encap(struct sk_buff *skb, + struct lwtunnel_state *lwtstate); +int lwtunnel_get_encap_size(struct lwtunnel_state *lwtstate); +struct lwtunnel_state *lwtunnel_state_alloc(int hdr_len); +int lwtunnel_cmp_encap(struct lwtunnel_state *a, struct lwtunnel_state *b); + +#else + +static inline void lwtunnel_state_get(struct lwtunnel_state *lws) +{ +} + +static inline void lwtunnel_state_put(struct lwtunnel_state *lws) +{ +} + +static inline bool lwtunnel_output_redirect(struct lwtunnel_state *lwtstate) +{ + return false; +} + +static inline int lwtunnel_encap_add_ops(const struct lwtunnel_encap_ops *op, +unsigned int num) +{ + return -EOPNOTSUPP; + +} + +static inline int lwtunnel_encap_del_ops(const struct lwtunnel_encap_ops *op, +unsigned int num) +{ + return -EOPNOTSUPP; +} + +static inline int lwtunnel_build_state(struct net_device *dev, u16 encap_type, + struct nlattr *encap, + struct lwtunnel_state **lws) +{ + return -EOPNOTSUPP; +} + +static inline int lwtunnel_fill_encap(struct sk_buff *skb, + struct lwtunnel_state *lwtstate) +{ + return 0; +} + +static inline int lwtunnel_get_encap_size(struct lwtunnel_state *lwtstate) +{ + return 0; +} + +static inline struct lwtunnel_state *lwtunnel_state_alloc(int hdr_len) +{ + return NULL; +} + +static inline int lwtunnel_cmp_encap(struct lwtunnel_state *a, +
[PATCH net-next 07/22] ipv6: rt6_info output redirect to tunnel output
From: Roopa Prabhu ro...@cumulusnetworks.com This is similar to ipv4 redirect of dst output to lwtunnel output function for encapsulation and xmit. Signed-off-by: Roopa Prabhu ro...@cumulusnetworks.com --- net/ipv6/route.c | 1 + 1 file changed, 1 insertion(+) diff --git a/net/ipv6/route.c b/net/ipv6/route.c index b3431b7..7f2214f 100644 --- a/net/ipv6/route.c +++ b/net/ipv6/route.c @@ -1780,6 +1780,7 @@ int ip6_route_add(struct fib6_config *cfg) goto out; lwtunnel_state_get(lwtstate); rt-rt6i_lwtstate = lwtstate; + rt-dst.output = lwtunnel_output6; } ipv6_addr_prefix(rt-rt6i_dst.addr, cfg-fc_dst, cfg-fc_dst_len); -- 2.4.3 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next 05/22] lwtunnel: support dst output redirect function
From: Roopa Prabhu ro...@cumulusnetworks.com This patch introduces lwtunnel_output function to call corresponding lwtunnels output function to xmit the packet. It adds two variants lwtunnel_output and lwtunnel_output6 for ipv4 and ipv6 respectively today. But this is subject to change when lwtstate will reside in dst or dst_metadata (as per upstream discussions). Signed-off-by: Roopa Prabhu ro...@cumulusnetworks.com --- include/net/lwtunnel.h | 12 +++ net/core/lwtunnel.c| 56 ++ 2 files changed, 68 insertions(+) diff --git a/include/net/lwtunnel.h b/include/net/lwtunnel.h index df24b36..918e03c 100644 --- a/include/net/lwtunnel.h +++ b/include/net/lwtunnel.h @@ -69,6 +69,8 @@ int lwtunnel_fill_encap(struct sk_buff *skb, int lwtunnel_get_encap_size(struct lwtunnel_state *lwtstate); struct lwtunnel_state *lwtunnel_state_alloc(int hdr_len); int lwtunnel_cmp_encap(struct lwtunnel_state *a, struct lwtunnel_state *b); +int lwtunnel_output(struct sock *sk, struct sk_buff *skb); +int lwtunnel_output6(struct sock *sk, struct sk_buff *skb); #else @@ -127,6 +129,16 @@ static inline int lwtunnel_cmp_encap(struct lwtunnel_state *a, return 0; } +static inline int lwtunnel_output(struct sock *sk, struct sk_buff *skb) +{ + return -EOPNOTSUPP; +} + +static inline int lwtunnel_output6(struct sock *sk, struct sk_buff *skb) +{ + return -EOPNOTSUPP; +} + #endif #endif /* __NET_LWTUNNEL_H */ diff --git a/net/core/lwtunnel.c b/net/core/lwtunnel.c index d7ae3a2..bb58826 100644 --- a/net/core/lwtunnel.c +++ b/net/core/lwtunnel.c @@ -25,6 +25,7 @@ #include net/lwtunnel.h #include net/rtnetlink.h +#include net/ip6_fib.h struct lwtunnel_state *lwtunnel_state_alloc(int encap_len) { @@ -177,3 +178,58 @@ int lwtunnel_cmp_encap(struct lwtunnel_state *a, struct lwtunnel_state *b) return ret; } EXPORT_SYMBOL(lwtunnel_cmp_encap); + +int __lwtunnel_output(struct sock *sk, struct sk_buff *skb, + struct lwtunnel_state *lwtstate) +{ + const struct lwtunnel_encap_ops *ops; + int ret = -EINVAL; + + if (!lwtstate) + goto drop; + + if (lwtstate-type == LWTUNNEL_ENCAP_NONE || + lwtstate-type LWTUNNEL_ENCAP_MAX) + return 0; + + ret = -EOPNOTSUPP; + rcu_read_lock(); + ops = rcu_dereference(lwtun_encaps[lwtstate-type]); + if (likely(ops ops-output)) + ret = ops-output(sk, skb); + rcu_read_unlock(); + + if (ret == -EOPNOTSUPP) + goto drop; + + return ret; + +drop: + kfree(skb); + + return ret; +} + +int lwtunnel_output6(struct sock *sk, struct sk_buff *skb) +{ + struct rt6_info *rt = (struct rt6_info *)skb_dst(skb); + struct lwtunnel_state *lwtstate = NULL; + + if (rt) + lwtstate = rt-rt6i_lwtstate; + + return __lwtunnel_output(sk, skb, lwtstate); +} +EXPORT_SYMBOL(lwtunnel_output6); + +int lwtunnel_output(struct sock *sk, struct sk_buff *skb) +{ + struct rtable *rt = (struct rtable *)skb_dst(skb); + struct lwtunnel_state *lwtstate = NULL; + + if (rt) + lwtstate = rt-rt_lwtstate; + + return __lwtunnel_output(sk, skb, lwtstate); +} +EXPORT_SYMBOL(lwtunnel_output); -- 2.4.3 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next 01/22] rtnetlink: introduce new RTA_ENCAP_TYPE and RTA_ENCAP attributes
From: Roopa Prabhu ro...@cumulusnetworks.com This patch introduces two new RTA attributes to attach encap data to fib routes. Example iproute2 command to attach mpls encap data to ipv4 routes $ip route add 10.1.1.0/30 encap mpls 200 via inet 10.1.1.1 dev swp1 Signed-off-by: Roopa Prabhu ro...@cumulusnetworks.com Suggested-by: Eric W. Biederman ebied...@xmission.com --- include/uapi/linux/rtnetlink.h | 2 ++ 1 file changed, 2 insertions(+) diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h index fdd8f07..0d3d3cc 100644 --- a/include/uapi/linux/rtnetlink.h +++ b/include/uapi/linux/rtnetlink.h @@ -308,6 +308,8 @@ enum rtattr_type_t { RTA_VIA, RTA_NEWDST, RTA_PREF, + RTA_ENCAP_TYPE, + RTA_ENCAP, __RTA_MAX }; -- 2.4.3 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
pull-request: mac80211 2015-07-17
Hi Dave, We've accumulated some wireless fixes, please pull. Arik's fix is a bit bigger than I might like, but it fixes a real locking issue and we didn't really see a good way to make a smaller version. Let me know if there's any problem. johannes The following changes since commit f760b87f8f12eb262f14603e65042996fe03720e: Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net (2015-07-13 11:18:25 -0700) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211.git tags/mac80211-for-davem-2015-07-17 for you to fetch changes up to 923b352f19d9ea971ae2536eab55f5fc9e95fedf: cfg80211: use RTNL locked reg_can_beacon for IR-relaxation (2015-07-17 15:02:02 +0200) Some fixes for the current cycle: 1. Arik introduced an rtnl-locked regulatory API to be able to differentiate between place do/don't have the RTNL; this fixes missing locking in some of the code paths 2. Two small mesh bugfixes from Bob, one to avoid treating a certain malformed over-the-air frame and one to avoid sending a garbage field over the air. 3. A fix for powersave during WoWLAN suspend from Krishna Chaitanya. 4. A fix for a powersave vs. aggregation teardown race, from Michal. 5. Thomas reduced the loglevel of CRDA messages to avoid spamming the kernel log with mostly irrelevant information. 6. Tom fixed a dangling debugfs directory pointer that could cause crashes if subsequent addition of the same interface to debugfs failed for some reason. 7. A fix from myself for a list corruption issue in mac80211 during combined interface shutdown/removal - shut down interfaces first and only then remove them to avoid that. Arik Nemtsov (1): cfg80211: use RTNL locked reg_can_beacon for IR-relaxation Bob Copeland (2): mac80211: correct aid location in peering frames mac80211: add missing length check for confirm frames Chaitanya T K (1): mac80211: wowlan: enable powersave if suspend while ps-polling Johannes Berg (1): mac80211: shut down interfaces before destroying interface list Michal Kazior (1): mac80211: don't clear all tx flags when requeing Thomas Petazzoni (1): wireless: regulatory: reduce log level of CRDA related messages Tom Hughes (1): mac80211: clear subdir_stations when removing debugfs include/net/cfg80211.h| 17 net/mac80211/debugfs_netdev.c | 1 + net/mac80211/iface.c | 25 +--- net/mac80211/mesh_plink.c | 5 - net/mac80211/pm.c | 16 +++ net/mac80211/tdls.c | 6 +++--- net/mac80211/tx.c | 4 +++- net/wireless/chan.c | 45 --- net/wireless/nl80211.c| 14 -- net/wireless/reg.c| 8 net/wireless/trace.h | 11 +++ 11 files changed, 111 insertions(+), 41 deletions(-) -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH v7 3/3] ixgbe, ixgbevf: Add new mbox API xcast mode
-Original Message- From: Hiroshi Shimamoto [mailto:h-shimam...@ct.jp.nec.com] Sent: Thursday, July 16, 2015 3:36 AM To: Alexander Duyck; Skidmore, Donald C; Rose, Gregory V; Kirsher, Jeffrey T; intel-wired-...@lists.osuosl.org Cc: nhor...@redhat.com; jogre...@redhat.com; Linux Netdev List; Choi, Sy Jong; Rony Efraim; Or Gerlitz; Edward Cree; David Miller; sassm...@redhat.com Subject: [PATCH v7 3/3] ixgbe, ixgbevf: Add new mbox API xcast mode From: Hiroshi Shimamoto h-shimam...@ct.jp.nec.com The limitation of the number of multicast address for VF is not enough for the large scale server with SR-IOV feature. IPv6 requires the multicast MAC address for each IP address to handle the Neighbor Solicitation message. We couldn't assign over 30 IPv6 addresses to a single VF. This patch introduces the new mailbox API, IXGBE_VF_UPDATE_XCAST_MODE, to update multicast mode of VF. This adds 3 modes; - NONE only L2 exact match addresses or Flow Director enabled - MULTIBAM and ROMPE set - ALLMULTI BAM, ROMPE and MPE set If a guest VF user wants over 30 MAC multicast addresses, set IFF_ALLMULTI to request PF to update xcast mode to enable VF multicast promiscuous mode. On the other hand, enabling VF multicast promiscuous mode may affect security and performance in the network of the NIC. Only trusted VF can enable multicast promiscuous mode. The behavior of untrusted VF is the same as previous version. Signed-off-by: Hiroshi Shimamoto h-shimam...@ct.jp.nec.com --- drivers/net/ethernet/intel/ixgbe/ixgbe.h | 7 +++ drivers/net/ethernet/intel/ixgbe/ixgbe_mbx.h | 2 + drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c| 59 +++ drivers/net/ethernet/intel/ixgbevf/ixgbevf.h | 6 +++ drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 8 +++ drivers/net/ethernet/intel/ixgbevf/mbx.h | 2 + drivers/net/ethernet/intel/ixgbevf/vf.c | 41 drivers/net/ethernet/intel/ixgbevf/vf.h | 1 + 8 files changed, 126 insertions(+) diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h b/drivers/net/ethernet/intel/ixgbe/ixgbe.h index fb72622..17250ef 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h @@ -153,9 +153,16 @@ struct vf_data_storage { u8 spoofchk_enabled; bool rss_query_enabled; u8 trusted; + int xcast_mode; unsigned int vf_api; }; +enum ixgbevf_xcast_modes { + IXGBEVF_XCAST_MODE_NONE = 0, + IXGBEVF_XCAST_MODE_MULTI, + IXGBEVF_XCAST_MODE_ALLMULTI, +}; + struct vf_macvlans { struct list_head l; int vf; diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_mbx.h b/drivers/net/ethernet/intel/ixgbe/ixgbe_mbx.h index b1e4703..8daa95f 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_mbx.h +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_mbx.h @@ -102,6 +102,8 @@ enum ixgbe_pfvf_api_rev { #define IXGBE_VF_GET_RETA0x0a/* VF request for RETA */ #define IXGBE_VF_GET_RSS_KEY 0x0b/* get RSS key */ +#define IXGBE_VF_UPDATE_XCAST_MODE 0x0c + /* length of permanent address message returned from PF */ #define IXGBE_VF_PERMADDR_MSG_LEN 4 /* word in permanent address message with the current multicast type */ diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c index 65aeb58..ac071e5 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c @@ -119,6 +119,9 @@ static int __ixgbe_enable_sriov(struct ixgbe_adapter *adapter) /* Untrust all VFs */ adapter-vfinfo[i].trusted = false; + + /* set the default xcast mode */ + adapter-vfinfo[i].xcast_mode = IXGBEVF_XCAST_MODE_NONE; } return 0; @@ -1004,6 +1007,59 @@ static int ixgbe_get_vf_rss_key(struct ixgbe_adapter *adapter, return 0; } +static int ixgbe_update_vf_xcast_mode(struct ixgbe_adapter *adapter, + u32 *msgbuf, u32 vf) +{ + struct ixgbe_hw *hw = adapter-hw; + int xcast_mode = msgbuf[1]; + u32 vmolr, disable, enable; + + /* verify the PF is supporting the correct APIs */ + switch (adapter-vfinfo[vf].vf_api) { + case ixgbe_mbox_api_12: + break; + default: + return -1; Shouldn't you return -EOPNOTSUPP. + } + + if (xcast_mode IXGBEVF_XCAST_MODE_MULTI + !adapter-vfinfo[vf].trusted) { + xcast_mode = IXGBEVF_XCAST_MODE_MULTI; + } + + if (adapter-vfinfo[vf].xcast_mode == xcast_mode) + goto out; + + switch (xcast_mode) { + case IXGBEVF_XCAST_MODE_NONE: + disable = IXGBE_VMOLR_BAM | IXGBE_VMOLR_ROMPE | IXGBE_VMOLR_MPE; + enable = 0; +
Re: [V2 6/7] hvsock: introduce Hyper-V VM Sockets feature
Dexuan Cui de...@microsoft.com writes: From: David Miller Sent: Thursday, July 16, 2015 12:19 From: Dexuan Cui Date: Tue, 14 Jul 2015 03:00:48 -0700 + pr_debug(hvsock_sk_destruct: called\n); Debug logging just to state that a function is called is not appropriate, we have very sophisticated tracing facilities in the kernel that can do that transparently, and more. Please remove this. OK. + if (hvsk-channel) { + pr_debug(hvsock_sk_destruct: calling vmbus_close()\n); Likewise, these kinds of debug logs are totally inappropriate. OK, I'll remove all the pr_debug() in the patch. I'd suggest we rather use something like net_dbg_ratelimited() intead. The driver is new so issues are expected. Some debugging may be useful) [...] -- Vitaly -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH -next] net: fib: use fib result when zero-length prefix aliases exist
default route selection is not deterministic when TOS keys are used: ip route del default ip route add tos 0x00 via 10.2.100.100 ip route add tos 0x04 via 10.2.100.101 ip route add tos 0x08 via 10.2.100.102 ip route add tos 0x0C via 10.2.100.103 ip route add tos 0x10 via 10.2.100.104 [ i.e. 5 routes with prefix length 0, differentiated via TOS key ] ip route get 10.3.1.1 tos 0x4 - 10.2.100.101 ip route get 10.3.1.1 tos 0x8 - 10.2.100.102 ip route get tos 0x0C - 10.2.100.103 But for 0x10, we'll get round-robin results among all the aliases. Repeated queries return .100, 101, 102, etc. in turn. This behaviour is not new -- fib_select_default can be traced back to fn_hash_select_default in CVS history. Routing cache made 'round-robin' behaviour less visible. This changes fib_select_default to not change the FIB chosen result EXCEPT if this nexthop appears to be unreachable. fib_detect_death() logic is reversed -- we consider a nexthop 'dead' only if it has a neigh entry in unreachable state. Only then we search fib_aliases for an alternative and use one of these in a round-robin fashion. If all are believed to be unreachable, no change is made and fib-chosen nh_gw is used. Reported-by: Hagen Paul Pfeifer ha...@jauu.net Cc: Alexander Duyck alexander.h.du...@redhat.com Signed-off-by: Florian Westphal f...@strlen.de --- net/ipv4/fib_semantics.c | 71 1 file changed, 36 insertions(+), 35 deletions(-) diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c index c7358ea..83b485b 100644 --- a/net/ipv4/fib_semantics.c +++ b/net/ipv4/fib_semantics.c @@ -410,28 +410,24 @@ errout: rtnl_set_sk_err(info-nl_net, RTNLGRP_IPV4_ROUTE, err); } -static int fib_detect_death(struct fib_info *fi, int order, - struct fib_info **last_resort, int *last_idx, - int dflt) +static bool fib_nud_is_unreach(const struct fib_info *fi) { struct neighbour *n; int state = NUD_NONE; - n = neigh_lookup(arp_tbl, fi-fib_nh[0].nh_gw, fi-fib_dev); - if (n) { + local_bh_disable(); + + n = __neigh_lookup_noref(arp_tbl, fi-fib_nh[0].nh_gw, fi-fib_dev); + if (n) state = n-nud_state; - neigh_release(n); - } - if (state == NUD_REACHABLE) - return 0; - if ((state NUD_VALID) order != dflt) - return 0; - if ((state NUD_VALID) || - (*last_idx 0 order dflt)) { - *last_resort = fi; - *last_idx = order; - } - return 1; + + local_bh_enable(); + + /* Caller might be able to find alternate (reachable) nexthop */ + if (state (NUD_INCOMPLETE | NUD_FAILED)) + return true; + + return false; } #ifdef CONFIG_IP_ROUTE_MULTIPATH @@ -1204,12 +1200,17 @@ int fib_sync_down_dev(struct net_device *dev, unsigned long event) /* Must be invoked inside of an RCU protected region. */ void fib_select_default(struct fib_result *res) { - struct fib_info *fi = NULL, *last_resort = NULL; struct hlist_head *fa_head = res-fa_head; + struct fib_info *last_resort = NULL; struct fib_table *tb = res-table; int order = -1, last_idx = -1; struct fib_alias *fa; + bool unreach = fib_nud_is_unreach(res-fi); + if (likely(!unreach)) + return; + + /* attempt to pick another nexthop */ hlist_for_each_entry_rcu(fa, fa_head, fa_list) { struct fib_info *next_fi = fa-fa_info; @@ -1223,33 +1224,33 @@ void fib_select_default(struct fib_result *res) next_fi-fib_nh[0].nh_scope != RT_SCOPE_LINK) continue; + order++; + + if (next_fi == res-fi) /* already tested, not reachable */ + continue; + fib_alias_accessed(fa); - if (!fi) { - if (next_fi != res-fi) + unreach = fib_nud_is_unreach(next_fi); + if (unreach) + continue; + + /* try to round-robin among all fa_aliases in case +* res-fi nexthop is unreachable. +*/ + if (last_idx 0 || order tb-tb_default) { + last_resort = next_fi; + last_idx = order; + if (order tb-tb_default) break; - } else if (!fib_detect_death(fi, order, last_resort, -last_idx, tb-tb_default)) { - fib_result_assign(res, fi); - tb-tb_default = order; - goto out; } - fi = next_fi; - order++; } - if (order = 0 || !fi) { + if (order 0) { tb-tb_default = -1;
RE: [PATCH 0/7] introduce Hyper-V VM Sockets(hvsock)
-Original Message- From: Stefan Hajnoczi Sent: Thursday, July 16, 2015 23:59 On Mon, Jul 06, 2015 at 07:39:35AM -0700, Dexuan Cui wrote: Hyper-V VM Sockets (hvsock) is a byte-stream based communication mechanism between Windowsd 10 (or later) host and a guest. It's kind of TCP over VMBus, but the transportation layer (VMBus) is much simpler than IP. With Hyper-V VM Sockets, applications between the host and a guest can talk with each other directly by the traditional BSD-style socket APIs. The patchset implements the necessary support in the guest side by adding the necessary new APIs in the vmbus driver, and introducing a new driver hv_sock.ko, which implements_a new socket address family AF_HYPERV. I know the kernel has already had a VM Sockets driver (AF_VSOCK) based on VMware's VMCI (net/vmw_vsock/, drivers/misc/vmw_vmci), and KVM is proposing AF_VSOCK of virtio version: http://thread.gmane.org/gmane.linux.network/365205. However, though Hyper-V VM Sockets may seem conceptually similar to AF_VOSCK, there are differences in the transportation layer, and IMO these make the direct code reusing impractical: 1. In AF_VSOCK, the endpoint type is: u32 ContextID, u32 Port, but in AF_HYPERV, the endpoint type is: GUID VM_ID, GUID ServiceID. Here GUID is 128-bit. 2. AF_VSOCK supports SOCK_DGRAM, while AF_HYPERV doesn't. 3. AF_VSOCK supports some special sock opts, like SO_VM_SOCKETS_BUFFER_SIZE, SO_VM_SOCKETS_BUFFER_MIN/MAX_SIZE and SO_VM_SOCKETS_CONNECT_TIMEOUT. These are meaningless to AF_HYPERV. 4. Some AF_VSOCK's VMCI transportation ops are meanless to AF_HYPERV/VMBus, like.notify_recv_init .notify_recv_pre_block .notify_recv_pre_dequeue .notify_recv_post_dequeue .notify_send_init .notify_send_pre_block .notify_send_pre_enqueue .notify_send_post_enqueue etc. So I think we'd better introduce a new address family: AF_HYPERV. Points 2-4 are not critical. I think there are solutions to them. Point 1 is the main issue: hvsock has GUID, GUID addresses instead of vsock's u32, u32 addresses. Perhaps a mapping could be used but that is pretty ugly. Hi Stefan, Exactly! In the current AF_VSOCK code and the related transport layer (the wrapper ops of VMware's VMCI), the u32, u32 endpoint is widely used by struct sockaddr_vm (this struct is exported to the user space). So, anyway, the user space application has to explicitly handle the different endpoint sizes. And in the driver side, IMO there is no way to reuse the code of AF_VSOCK with clean changes. One idea is something like a userspace GUID, GUID - u32, u32 lookup function that applications can use if they want to accept GUIDs. Thanks for the suggestion! While this is technically possible, IMO it would mess up the driver side's AF_VSOCK code: in many places, we'll have to add ugly code like: IF the endpoint size is u32, u32 THEN use the existing logic; ELSE use the new logic; I don't have a workable alternative to propose, so I agree that a new address family is justified. Thanks for your exact understanding! :-) -- Dexuan -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next] rhashtable: Allow other tasks to be scheduled in large lookup loops
On Fri, 2015-07-17 at 10:07 +0200, Thomas Graf wrote: Depending on system speed, the large lookup loop can take a considerable amount of time to complete causing watchdog warnings to appear. Allow other tasks to be scheduled after every batch of 1000 lookups. Reported-by: Meelis Roos mr...@linux.ee Signed-off-by: Thomas Graf tg...@suug.ch --- lib/test_rhashtable.c | 9 - 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/lib/test_rhashtable.c b/lib/test_rhashtable.c index c90777e..5ed6211 100644 --- a/lib/test_rhashtable.c +++ b/lib/test_rhashtable.c @@ -20,8 +20,10 @@ #include linux/rcupdate.h #include linux/rhashtable.h #include linux/slab.h +#include linux/sched.h #define MAX_ENTRIES 100 +#define RELAX_CPU_AFTER 1000 #define TEST_INSERT_FAIL INT_MAX static int entries = 5; @@ -61,7 +63,7 @@ static struct rhashtable_params test_rht_params = { static int __init test_rht_lookup(struct rhashtable *ht) { - unsigned int i; + unsigned int i, relax_cnt = RELAX_CPU_AFTER; for (i = 0; i entries * 2; i++) { struct test_obj *obj; @@ -87,6 +89,11 @@ static int __init test_rht_lookup(struct rhashtable *ht) return -EINVAL; } } + + if (!relax_cnt--) { + schedule(); + relax_cnt = RELAX_CPU_AFTER; + } } return 0; Please simply use cond_resched() without counting and magic value. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next] rhashtable: Allow other tasks to be scheduled in large lookup loops
On 07/17/15 at 10:28am, Eric Dumazet wrote: On Fri, 2015-07-17 at 10:24 +0200, Eric Dumazet wrote: Please simply use cond_resched() without counting and magic value. Done Also use cond_resched() in insert and delete phases ? When I tried that it made the walker duplicates disappear which weakens the test case a little bit but it's probably safer this way. I'll include it in the v2. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next v2] rhashtable: Allow other tasks to be scheduled in large lookup loops
Depending on system speed, the large lookup/insert/delete loops of the testsuite can take a considerable amount of time to complete causing watchdog warnings to appear. Allow other tasks to be scheduled throughout the loops. Reported-by: Meelis Roos mr...@linux.ee Signed-off-by: Thomas Graf tg...@suug.ch --- v2: Use cond_resched() instead schedule() lib/test_rhashtable.c | 7 +++ 1 file changed, 7 insertions(+) diff --git a/lib/test_rhashtable.c b/lib/test_rhashtable.c index c90777e..9af7cef 100644 --- a/lib/test_rhashtable.c +++ b/lib/test_rhashtable.c @@ -20,6 +20,7 @@ #include linux/rcupdate.h #include linux/rhashtable.h #include linux/slab.h +#include linux/sched.h #define MAX_ENTRIES100 #define TEST_INSERT_FAIL INT_MAX @@ -87,6 +88,8 @@ static int __init test_rht_lookup(struct rhashtable *ht) return -EINVAL; } } + + cond_resched_rcu(); } return 0; @@ -160,6 +163,8 @@ static s64 __init test_rhashtable(struct rhashtable *ht) } else if (err) { return err; } + + cond_resched(); } if (insert_fails) @@ -183,6 +188,8 @@ static s64 __init test_rhashtable(struct rhashtable *ht) rhashtable_remove_fast(ht, obj-node, test_rht_params); } + + cond_resched(); } end = ktime_get_ns(); -- 2.4.3 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 4.1 regression in resizable hashtable tests
On 07/02/15 at 10:09pm, Meelis Roos wrote: [ 33.425061] Running rhashtable test nelem=8, max_size=65536, shrinking=0 [ 33.425154] Test 00: [ 33.534470] Adding 5 keys [ 34.743553] Info: encountered resize [ 34.743698] Info: encountered resize [ 34.743838] Info: encountered resize [ 34.744057] Info: encountered resize [ 34.744430] Info: encountered resize [ 34.745139] Info: encountered resize [ 34.746441] Info: encountered resize [ 34.749055] Info: encountered resize [ 34.754469] Info: encountered resize [ 34.764836] Info: encountered resize [ 34.785696] Info: encountered resize [ 34.827448] Info: encountered resize [ 34.896936] Traversal complete: counted=49993, nelems=5, entries=5, table-jumps=12 [ 34.897056] Test failed: Total count mismatch ^^^ I do see count mismatches as well due to the design of the walker which restarts and thus sees certain entries multiple times. Do you have this commit as well? Author: Phil Sutter p...@nwl.cc Date: Mon Jul 6 15:51:20 2015 +0200 rhashtable: fix for resize events during table walk -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net] caif: fix leaks and race in caif_queue_rcv_skb()
From: Eric Dumazet eduma...@google.com 1) If sk_filter() is applied, skb was leaked (not freed) 2) Testing SOCK_DEAD twice is racy : packet could be freed while already queued. 3) Remove obsolete comment about caching skb-len Signed-off-by: Eric Dumazet eduma...@google.com --- net/caif/caif_socket.c | 19 --- 1 file changed, 8 insertions(+), 11 deletions(-) diff --git a/net/caif/caif_socket.c b/net/caif/caif_socket.c index 3cc71b9f5517..cc858919108e 100644 --- a/net/caif/caif_socket.c +++ b/net/caif/caif_socket.c @@ -121,12 +121,13 @@ static void caif_flow_ctrl(struct sock *sk, int mode) * Copied from sock.c:sock_queue_rcv_skb(), but changed so packets are * not dropped, but CAIF is sending flow off instead. */ -static int caif_queue_rcv_skb(struct sock *sk, struct sk_buff *skb) +static void caif_queue_rcv_skb(struct sock *sk, struct sk_buff *skb) { int err; unsigned long flags; struct sk_buff_head *list = sk-sk_receive_queue; struct caifsock *cf_sk = container_of(sk, struct caifsock, sk); + bool queued = false; if (atomic_read(sk-sk_rmem_alloc) + skb-truesize = (unsigned int)sk-sk_rcvbuf rx_flow_is_on(cf_sk)) { @@ -139,7 +140,8 @@ static int caif_queue_rcv_skb(struct sock *sk, struct sk_buff *skb) err = sk_filter(sk, skb); if (err) - return err; + goto out; + if (!sk_rmem_schedule(sk, skb, skb-truesize) rx_flow_is_on(cf_sk)) { set_rx_flow_off(cf_sk); net_dbg_ratelimited(sending flow OFF due to rmem_schedule\n); @@ -147,21 +149,16 @@ static int caif_queue_rcv_skb(struct sock *sk, struct sk_buff *skb) } skb-dev = NULL; skb_set_owner_r(skb, sk); - /* Cache the SKB length before we tack it onto the receive -* queue. Once it is added it no longer belongs to us and -* may be freed by other threads of control pulling packets -* from the queue. -*/ spin_lock_irqsave(list-lock, flags); - if (!sock_flag(sk, SOCK_DEAD)) + queued = !sock_flag(sk, SOCK_DEAD); + if (queued) __skb_queue_tail(list, skb); spin_unlock_irqrestore(list-lock, flags); - - if (!sock_flag(sk, SOCK_DEAD)) +out: + if (queued) sk-sk_data_ready(sk); else kfree_skb(skb); - return 0; } /* Packet Receive Callback function called from CAIF Stack */ -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next] rhashtable: Allow other tasks to be scheduled in large lookup loops
On Fri, 2015-07-17 at 10:24 +0200, Eric Dumazet wrote: Please simply use cond_resched() without counting and magic value. Also use cond_resched() in insert and delete phases ? -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ravb: do not invalidate cache for RX buffer twice
From: Sergei Shtylyov sergei.shtyl...@cogentembedded.com Date: Wed, 15 Jul 2015 00:56:52 +0300 First, dma_sync_single_for_cpu() shouldn't have been called in the first place (it's a streaming DMA API). dma_unmap_single() should have been called instead. Second, dma_unmap_single() call after handing the buffer to napi_gro_receive() makes little sense. Signed-off-by: Sergei Shtylyov sergei.shtyl...@cogentembedded.com Applied with fixed up commit log message, thanks. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next] rhashtable: Allow other tasks to be scheduled in large lookup loops
Depending on system speed, the large lookup loop can take a considerable amount of time to complete causing watchdog warnings to appear. Allow other tasks to be scheduled after every batch of 1000 lookups. Reported-by: Meelis Roos mr...@linux.ee Signed-off-by: Thomas Graf tg...@suug.ch --- lib/test_rhashtable.c | 9 - 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/lib/test_rhashtable.c b/lib/test_rhashtable.c index c90777e..5ed6211 100644 --- a/lib/test_rhashtable.c +++ b/lib/test_rhashtable.c @@ -20,8 +20,10 @@ #include linux/rcupdate.h #include linux/rhashtable.h #include linux/slab.h +#include linux/sched.h #define MAX_ENTRIES100 +#define RELAX_CPU_AFTER1000 #define TEST_INSERT_FAIL INT_MAX static int entries = 5; @@ -61,7 +63,7 @@ static struct rhashtable_params test_rht_params = { static int __init test_rht_lookup(struct rhashtable *ht) { - unsigned int i; + unsigned int i, relax_cnt = RELAX_CPU_AFTER; for (i = 0; i entries * 2; i++) { struct test_obj *obj; @@ -87,6 +89,11 @@ static int __init test_rht_lookup(struct rhashtable *ht) return -EINVAL; } } + + if (!relax_cnt--) { + schedule(); + relax_cnt = RELAX_CPU_AFTER; + } } return 0; -- 2.4.3 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next 1/2] bpf: introduce bpf_skb_vlan_push/pop() helpers
On Thu, 2015-07-16 at 19:58 -0700, Alexei Starovoitov wrote: In order to let eBPF programs call skb_vlan_push/pop via helper functions Why should eBPF program do such thing ? Are BPF users in the kernel expecting skb being changed, and are we sure they reload all cached values when/if needed ? eBPF JITs need to recognize helpers that change skb-data, since skb-data and hlen are cached as part of JIT code generation. - arm64 JIT is using bpf_load_pointer() without caching, so it's ok as-is. - x64 JIT recognizes bpf_skb_vlan_push/pop() calls and re-caches skb-data/hlen after such calls (experiments showed that conditional re-caching is slower). - s390 JIT falls back to interpreter for now when bpf_skb_vlan_push() is present in the program (re-caching is tbd). +static u64 bpf_skb_vlan_push(u64 r1, u64 vlan_proto, u64 vlan_tci, u64 r4, u64 r5) +{ + struct sk_buff *skb = (struct sk_buff *) (long) r1; + + if (unlikely(vlan_proto != htons(ETH_P_8021Q) + vlan_proto != htons(ETH_P_8021AD))) + vlan_proto = htons(ETH_P_8021Q); This would raise sparse error, as vlan_proto is u64, and htons() __be16 make C=2 CF=-D__CHECK_ENDIAN__ net/core/filter.o -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: sit: Set SKB_GSO_SIT bit when performing GRO
Am Freitag, 17. Juli 2015, 09:56:51 schrieb Herbert Xu: On Thu, Jul 16, 2015 at 12:58:45PM +0200, Wolfgang Walter wrote: Am Donnerstag, 16. Juli 2015, 08:23:50 schrieb Herbert Xu: On Wed, Jul 15, 2015 at 02:25:59PM +0200, Wolfgang Walter wrote: Yes. Switching TSO off and leaving GRO on works, too. OK, could you please try this patch? Patch works here. Thanks for the confirmation. Let's add a tag for patchwork: Tested-by: Wolfgang Walter li...@stwm.de It seems that this patch may cause a problem with another one of our routers. Without the patch it had no problem, so I didn't tested it there. With that patch one interface blocks after some time. Not even arp requests get answered. It still receives packets though. Restarting the interface fixes the problem. Switching off gro for the other interface helps. This router is different from the other ones. It does not directly route isatap packets. It may routes isatap packets encapsulated in GRE packets, though. It is itself not an GRE-endpoint. The router does NAT. Basically it routes the GRE-tunnel packets unatted and NATs most of the rest. Not doing NAT and conntrack (and unloading all modules like nf_conntrack_ipv4, nf_defrag_ipv4) does not help. eth0: extern eth1: intern One (IPv4) GRE-tunnel is routed between eth0 und eth1. IPv6 ESP-tunnels are routed between eth0 and eth1 IPv4 UDP/TCP/ICMP from intern is natted with netfilter. eth1 stops sending with the patch after some time disabling gro on eth0 helps disabling tso or gso on eth0 and/or eth1 or both does not help eth0 and eth1 are both intel I350. Regards, -- Wolfgang Walter Studentenwerk München Anstalt des öffentlichen Rechts -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 0/2] sctp: fix src address selection if using secondary address
This series improves the way SCTP chooses its src address so that the choosen one will always belong to the interface being used for output. v1-v2: - split out the refactoring from the fix itself - Doing a full reverse routing as in v1 is not necessary. Only looking for the interface that has the address and comparing its number is enough. Marcelo Ricardo Leitner (2): sctp: reduce indent level on sctp_v4_get_dst sctp: fix src address selection if using secondary addresses net/sctp/protocol.c | 42 +++--- 1 file changed, 27 insertions(+), 15 deletions(-) -- 2.4.1 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 1/2] sctp: reduce indent level on sctp_v4_get_dst
Paves the day for the next patch. Functionality stays untouched. Signed-off-by: Marcelo Ricardo Leitner marcelo.leit...@gmail.com --- net/sctp/protocol.c | 32 +--- 1 file changed, 17 insertions(+), 15 deletions(-) diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c index 59e80356672bdf89777265ae1f8c384792dfb98c..fa80fe4f23629fc3c3f5c44f99dbf3cc524cc6a0 100644 --- a/net/sctp/protocol.c +++ b/net/sctp/protocol.c @@ -489,21 +489,23 @@ static void sctp_v4_get_dst(struct sctp_transport *t, union sctp_addr *saddr, list_for_each_entry_rcu(laddr, bp-address_list, list) { if (!laddr-valid) continue; - if ((laddr-state == SCTP_ADDR_SRC) - (AF_INET == laddr-a.sa.sa_family)) { - fl4-fl4_sport = laddr-a.v4.sin_port; - flowi4_update_output(fl4, -asoc-base.sk-sk_bound_dev_if, -RT_CONN_FLAGS(asoc-base.sk), -daddr-v4.sin_addr.s_addr, -laddr-a.v4.sin_addr.s_addr); - - rt = ip_route_output_key(sock_net(sk), fl4); - if (!IS_ERR(rt)) { - dst = rt-dst; - goto out_unlock; - } - } + if (laddr-state != SCTP_ADDR_SRC || + AF_INET != laddr-a.sa.sa_family) + continue; + + fl4-fl4_sport = laddr-a.v4.sin_port; + flowi4_update_output(fl4, +asoc-base.sk-sk_bound_dev_if, +RT_CONN_FLAGS(asoc-base.sk), +daddr-v4.sin_addr.s_addr, +laddr-a.v4.sin_addr.s_addr); + + rt = ip_route_output_key(sock_net(sk), fl4); + if (IS_ERR(rt)) + continue; + + dst = rt-dst; + break; } out_unlock: -- 2.4.1 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 2/2] sctp: fix src address selection if using secondary addresses
In short, sctp is likely to incorrectly choose src address if socket is bound to secondary addresses. This patch fixes it by adding a new check that checks if such src address belongs to the interface that routing identified as output. This is enough to avoid rp_filter drops on remote peer. Details: Currently, sctp will do a routing attempt without specifying the src address and compare the returned value (preferred source) with the addresses that the socket is bound to. When using secondary addresses, this will not match. Then it will try specifying each of the addresses that the socket is bound to and re-routing, checking if that address is valid as src for that dst. Thing is, this check alone is weak: # ip r l 192.168.100.0/24 dev eth1 proto kernel scope link src 192.168.100.149 192.168.122.0/24 dev eth0 proto kernel scope link src 192.168.122.147 # ip a l 1: lo: LOOPBACK,UP,LOWER_UP mtu 65536 qdisc noqueue state UNKNOWN group default link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0: BROADCAST,MULTICAST,UP,LOWER_UP mtu 1500 qdisc pfifo_fast state UP group default qlen 1000 link/ether 52:54:00:15:18:6a brd ff:ff:ff:ff:ff:ff inet 192.168.122.147/24 brd 192.168.122.255 scope global dynamic eth0 valid_lft 2160sec preferred_lft 2160sec inet 192.168.122.148/24 scope global secondary eth0 valid_lft forever preferred_lft forever inet6 fe80::5054:ff:fe15:186a/64 scope link valid_lft forever preferred_lft forever 3: eth1: BROADCAST,MULTICAST,UP,LOWER_UP mtu 1500 qdisc pfifo_fast state UP group default qlen 1000 link/ether 52:54:00:b3:91:46 brd ff:ff:ff:ff:ff:ff inet 192.168.100.149/24 brd 192.168.100.255 scope global dynamic eth1 valid_lft 2162sec preferred_lft 2162sec inet 192.168.100.148/24 scope global secondary eth1 valid_lft forever preferred_lft forever inet6 fe80::5054:ff:feb3:9146/64 scope link valid_lft forever preferred_lft forever 4: ens9: BROADCAST,MULTICAST,UP,LOWER_UP mtu 1500 qdisc pfifo_fast state UP group default qlen 1000 link/ether 52:54:00:05:47:ee brd ff:ff:ff:ff:ff:ff inet6 fe80::5054:ff:fe05:47ee/64 scope link valid_lft forever preferred_lft forever # ip r g 192.168.100.193 from 192.168.122.148 192.168.100.193 from 192.168.122.148 dev eth1 cache Even if you specify an interface: # ip r g 192.168.100.193 from 192.168.122.148 oif eth1 192.168.100.193 from 192.168.122.148 dev eth1 cache Although this would be valid, peers using rp_filter will drop such packets as their src doesn't match the routes for that interface. Signed-off-by: Marcelo Ricardo Leitner marcelo.leit...@gmail.com --- net/sctp/protocol.c | 10 ++ 1 file changed, 10 insertions(+) diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c index fa80fe4f23629fc3c3f5c44f99dbf3cc524cc6a0..4345790ad3266c353eeac5398593c2a9ce4effda 100644 --- a/net/sctp/protocol.c +++ b/net/sctp/protocol.c @@ -487,6 +487,8 @@ static void sctp_v4_get_dst(struct sctp_transport *t, union sctp_addr *saddr, */ rcu_read_lock(); list_for_each_entry_rcu(laddr, bp-address_list, list) { + struct net_device *odev; + if (!laddr-valid) continue; if (laddr-state != SCTP_ADDR_SRC || @@ -504,6 +506,14 @@ static void sctp_v4_get_dst(struct sctp_transport *t, union sctp_addr *saddr, if (IS_ERR(rt)) continue; + /* Ensure the src address belongs to the output +* interface. +*/ + odev = __ip_dev_find(sock_net(sk), laddr-a.v4.sin_addr.s_addr, +false); + if (!odev || odev-ifindex != fl4-flowi4_oif) + continue; + dst = rt-dst; break; } -- 2.4.1 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH net-next 1/1] hv_netvsc: Wait for sub-channels to be processed during probe
-Original Message- From: Vitaly Kuznetsov [mailto:vkuzn...@redhat.com] Sent: Friday, July 17, 2015 7:13 AM To: KY Srinivasan Cc: da...@davemloft.net; netdev@vger.kernel.org; linux- ker...@vger.kernel.org; de...@linuxdriverproject.org; o...@aepfle.de; a...@canonical.com; jasow...@redhat.com; Dexuan Cui Subject: Re: [PATCH net-next 1/1] hv_netvsc: Wait for sub-channels to be processed during probe K. Y. Srinivasan k...@microsoft.com writes: The current code returns from probe without waiting for the proper handling of subchannels that may be requested. If the netvsc driver were to be rapidly loaded/unloaded, we can trigger a panic as the unload will be tearing down state that may not have been fully setup yet. We fix this issue by making sure that we return from the probe call only after ensuring that the sub-channel offers in flight are properly handled. Signed-off-by: K. Y. Srinivasan k...@microsoft.com Reviewed-and-tested-by: Haiyang Zhang haiya...@microsoft.com --- drivers/net/hyperv/hyperv_net.h |2 ++ drivers/net/hyperv/rndis_filter.c | 25 + 2 files changed, 27 insertions(+), 0 deletions(-) diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h index 26cd14c..925b75d 100644 --- a/drivers/net/hyperv/hyperv_net.h +++ b/drivers/net/hyperv/hyperv_net.h @@ -671,6 +671,8 @@ struct netvsc_device { u32 send_table[VRSS_SEND_TAB_SIZE]; u32 max_chn; u32 num_chn; + spinlock_t sc_lock; /* Protects num_sc_offered variable */ + u32 num_sc_offered; atomic_t queue_sends[NR_CPUS]; /* Holds rndis device info */ diff --git a/drivers/net/hyperv/rndis_filter.c b/drivers/net/hyperv/rndis_filter.c index 2e40417..2e09f3f 100644 --- a/drivers/net/hyperv/rndis_filter.c +++ b/drivers/net/hyperv/rndis_filter.c @@ -984,9 +984,16 @@ static void netvsc_sc_open(struct vmbus_channel *new_sc) struct netvsc_device *nvscdev; u16 chn_index = new_sc-offermsg.offer.sub_channel_index; int ret; + unsigned long flags; nvscdev = hv_get_drvdata(new_sc-primary_channel-device_obj); + spin_lock_irqsave(nvscdev-sc_lock, flags); + nvscdev-num_sc_offered--; + spin_unlock_irqrestore(nvscdev-sc_lock, flags); + if (nvscdev-num_sc_offered == 0) + complete(nvscdev-channel_init_wait); + if (chn_index = nvscdev-num_chn) return; @@ -1015,8 +1022,10 @@ int rndis_filter_device_add(struct hv_device *dev, u32 rsscap_size = sizeof(struct ndis_recv_scale_cap); u32 mtu, size; u32 num_rss_qs; + u32 sc_delta; const struct cpumask *node_cpu_mask; u32 num_possible_rss_qs; + unsigned long flags; rndis_device = get_rndis_device(); if (!rndis_device) @@ -1039,6 +1048,8 @@ int rndis_filter_device_add(struct hv_device *dev, net_device-max_chn = 1; net_device-num_chn = 1; + spin_lock_init(net_device-sc_lock); + net_device-extension = rndis_device; rndis_device-net_dev = net_device; @@ -1116,6 +1127,9 @@ int rndis_filter_device_add(struct hv_device *dev, num_possible_rss_qs = cpumask_weight(node_cpu_mask); net_device-num_chn = min(num_possible_rss_qs, num_rss_qs); + num_rss_qs = net_device-num_chn - 1; + net_device-num_sc_offered = num_rss_qs; + if (net_device-num_chn == 1) goto out; @@ -1157,11 +1171,22 @@ int rndis_filter_device_add(struct hv_device *dev, ret = rndis_filter_set_rss_param(rndis_device, net_device- num_chn); + /* +* Wait for the host to send us the sub-channel offers. +*/ + spin_lock_irqsave(net_device-sc_lock, flags); + sc_delta = net_device-num_chn - 1 - num_rss_qs; + net_device-num_sc_offered -= sc_delta; + spin_unlock_irqrestore(net_device-sc_lock, flags); + + if (net_device-num_sc_offered != 0) + wait_for_completion(net_device-channel_init_wait); I'd suggest we add an essentian timeout (big, let's say 30 sec.) here. In case something goes wrong we don't really want to hang the whole kernel for forever. Such bugs are hard to debug as if a 'kernel hangs' is reported we can't be sure which wait caused it. We can even have something like: t = wait_for_completion_timeout(net_device-channel_init_wait, 30*HZ); BUG_ON(t == 0); This is much better as we'll be sure what went wrong. (I know other pieces of hyper-v code use wait_for_completion() without a timeout, this is rather a general suggestion for all of them). There is some history here. Initially, I had timeout for calls where we could reasonably rollback state if we timed out. Some calls were subsequently changed to unconditional wait because under some load conditions, these timeouts would trigger (granted I did not have 30 second timeout; it was a 5 second timeout). Greg was opposed to calls to BUG_ON() in general for drivers.
RE: [PATCH net-next 1/1] hv_netvsc: Wait for sub-channels to be processed during probe
-Original Message- From: Dexuan Cui Sent: Friday, July 17, 2015 3:01 AM To: KY Srinivasan; da...@davemloft.net; netdev@vger.kernel.org; linux- ker...@vger.kernel.org; de...@linuxdriverproject.org; o...@aepfle.de; a...@canonical.com; jasow...@redhat.com; vkuzn...@redhat.com Cc: KY Srinivasan Subject: RE: [PATCH net-next 1/1] hv_netvsc: Wait for sub-channels to be processed during probe From: K. Y. Srinivasan Sent: Friday, July 17, 2015 3:17 Subject: [PATCH net-next 1/1] hv_netvsc: Wait for sub-channels to be processed during probe diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h ... @@ -1116,6 +1127,9 @@ int rndis_filter_device_add(struct hv_device *dev, num_possible_rss_qs = cpumask_weight(node_cpu_mask); net_device-num_chn = min(num_possible_rss_qs, num_rss_qs); + num_rss_qs = net_device-num_chn - 1; + net_device-num_sc_offered = num_rss_qs; + if (net_device-num_chn == 1) goto out; @@ -1157,11 +1171,22 @@ int rndis_filter_device_add(struct hv_device *dev, ret = rndis_filter_set_rss_param(rndis_device, net_device- num_chn); + /* +* Wait for the host to send us the sub-channel offers. +*/ + spin_lock_irqsave(net_device-sc_lock, flags); + sc_delta = net_device-num_chn - 1 - num_rss_qs; + net_device-num_sc_offered -= sc_delta; Hi KY, IMO here the -= should be +=? I think sc_delta is usually = 0, meaning the host may allocate less subchannels than we expect. With -=, net_device-num_sc_offered can become bigger -- this doesn't seem correct. We control how many sub-channels we want the host to offer (say sc_requested). Based on this number we begin to track how many have actually been processed - we decrement sc_requested each time a sub-channel offer is processed. If the host were to actually offer all that we have requested, then checking for sc_requested to be zero is sufficient to ensure that we have processed all the potentially in-flight sub-channels. However, the host may choose to offer less than what we had asked for and the variable delta is tracking this difference. Since we are counting down from what we had asked for we have to subtract delta for proper accounting. Why not use net_device-num_sc_offered = net_device-num_chn - 1; directly? At this point, net_device-num_chn has been the number of the actual channels. I am not sure what the question here is. num_sc_offered is initialized to the number we are going to ask and this is the number that will be decremented each time a sub-channel is processed. Since the host may decide to offer us less than what we had asked and some sub-channels may have already been processed (num_sc_offerred decremented accordingly) by the time we discover that the host has offered us less than what we asked for, we adjust num_sc_offered accordingly. + spin_unlock_irqrestore(net_device-sc_lock, flags); + + if (net_device-num_sc_offered != 0) + wait_for_completion(net_device-channel_init_wait); BTW, I also tested the patch and I can confirm the panic I saw disappeared with the patch. Thank you. K. Y -- Dexuan -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next 1/1] hv_netvsc: Wait for sub-channels to be processed during probe
KY Srinivasan k...@microsoft.com writes: -Original Message- From: Vitaly Kuznetsov [mailto:vkuzn...@redhat.com] Sent: Friday, July 17, 2015 7:13 AM To: KY Srinivasan Cc: da...@davemloft.net; netdev@vger.kernel.org; linux- ker...@vger.kernel.org; de...@linuxdriverproject.org; o...@aepfle.de; a...@canonical.com; jasow...@redhat.com; Dexuan Cui Subject: Re: [PATCH net-next 1/1] hv_netvsc: Wait for sub-channels to be processed during probe K. Y. Srinivasan k...@microsoft.com writes: The current code returns from probe without waiting for the proper handling of subchannels that may be requested. If the netvsc driver were to be rapidly loaded/unloaded, we can trigger a panic as the unload will be tearing down state that may not have been fully setup yet. We fix this issue by making sure that we return from the probe call only after ensuring that the sub-channel offers in flight are properly handled. Signed-off-by: K. Y. Srinivasan k...@microsoft.com Reviewed-and-tested-by: Haiyang Zhang haiya...@microsoft.com --- drivers/net/hyperv/hyperv_net.h |2 ++ drivers/net/hyperv/rndis_filter.c | 25 + 2 files changed, 27 insertions(+), 0 deletions(-) diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h index 26cd14c..925b75d 100644 --- a/drivers/net/hyperv/hyperv_net.h +++ b/drivers/net/hyperv/hyperv_net.h @@ -671,6 +671,8 @@ struct netvsc_device { u32 send_table[VRSS_SEND_TAB_SIZE]; u32 max_chn; u32 num_chn; + spinlock_t sc_lock; /* Protects num_sc_offered variable */ + u32 num_sc_offered; atomic_t queue_sends[NR_CPUS]; /* Holds rndis device info */ diff --git a/drivers/net/hyperv/rndis_filter.c b/drivers/net/hyperv/rndis_filter.c index 2e40417..2e09f3f 100644 --- a/drivers/net/hyperv/rndis_filter.c +++ b/drivers/net/hyperv/rndis_filter.c @@ -984,9 +984,16 @@ static void netvsc_sc_open(struct vmbus_channel *new_sc) struct netvsc_device *nvscdev; u16 chn_index = new_sc-offermsg.offer.sub_channel_index; int ret; + unsigned long flags; nvscdev = hv_get_drvdata(new_sc-primary_channel-device_obj); + spin_lock_irqsave(nvscdev-sc_lock, flags); + nvscdev-num_sc_offered--; + spin_unlock_irqrestore(nvscdev-sc_lock, flags); + if (nvscdev-num_sc_offered == 0) + complete(nvscdev-channel_init_wait); + if (chn_index = nvscdev-num_chn) return; @@ -1015,8 +1022,10 @@ int rndis_filter_device_add(struct hv_device *dev, u32 rsscap_size = sizeof(struct ndis_recv_scale_cap); u32 mtu, size; u32 num_rss_qs; + u32 sc_delta; const struct cpumask *node_cpu_mask; u32 num_possible_rss_qs; + unsigned long flags; rndis_device = get_rndis_device(); if (!rndis_device) @@ -1039,6 +1048,8 @@ int rndis_filter_device_add(struct hv_device *dev, net_device-max_chn = 1; net_device-num_chn = 1; + spin_lock_init(net_device-sc_lock); + net_device-extension = rndis_device; rndis_device-net_dev = net_device; @@ -1116,6 +1127,9 @@ int rndis_filter_device_add(struct hv_device *dev, num_possible_rss_qs = cpumask_weight(node_cpu_mask); net_device-num_chn = min(num_possible_rss_qs, num_rss_qs); + num_rss_qs = net_device-num_chn - 1; + net_device-num_sc_offered = num_rss_qs; + if (net_device-num_chn == 1) goto out; @@ -1157,11 +1171,22 @@ int rndis_filter_device_add(struct hv_device *dev, ret = rndis_filter_set_rss_param(rndis_device, net_device- num_chn); + /* + * Wait for the host to send us the sub-channel offers. + */ + spin_lock_irqsave(net_device-sc_lock, flags); + sc_delta = net_device-num_chn - 1 - num_rss_qs; + net_device-num_sc_offered -= sc_delta; + spin_unlock_irqrestore(net_device-sc_lock, flags); + + if (net_device-num_sc_offered != 0) + wait_for_completion(net_device-channel_init_wait); I'd suggest we add an essentian timeout (big, let's say 30 sec.) here. In case something goes wrong we don't really want to hang the whole kernel for forever. Such bugs are hard to debug as if a 'kernel hangs' is reported we can't be sure which wait caused it. We can even have something like: t = wait_for_completion_timeout(net_device-channel_init_wait, 30*HZ); BUG_ON(t == 0); This is much better as we'll be sure what went wrong. (I know other pieces of hyper-v code use wait_for_completion() without a timeout, this is rather a general suggestion for all of them). There is some history here. Initially, I had timeout for calls where we could reasonably rollback state if we timed out. Some calls were subsequently changed to unconditional wait because under some load conditions, these timeouts would trigger (granted I did not have 30 second timeout; it was a 5 second timeout). Greg was opposed to calls to BUG_ON() in general for drivers.
RE: [PATCH net-next 1/1] hv_netvsc: Wait for sub-channels to be processed during probe
From: K. Y. Srinivasan Sent: Friday, July 17, 2015 3:17 Subject: [PATCH net-next 1/1] hv_netvsc: Wait for sub-channels to be processed during probe diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h ... @@ -1116,6 +1127,9 @@ int rndis_filter_device_add(struct hv_device *dev, num_possible_rss_qs = cpumask_weight(node_cpu_mask); net_device-num_chn = min(num_possible_rss_qs, num_rss_qs); + num_rss_qs = net_device-num_chn - 1; + net_device-num_sc_offered = num_rss_qs; + if (net_device-num_chn == 1) goto out; @@ -1157,11 +1171,22 @@ int rndis_filter_device_add(struct hv_device *dev, ret = rndis_filter_set_rss_param(rndis_device, net_device-num_chn); + /* + * Wait for the host to send us the sub-channel offers. + */ + spin_lock_irqsave(net_device-sc_lock, flags); + sc_delta = net_device-num_chn - 1 - num_rss_qs; + net_device-num_sc_offered -= sc_delta; Hi KY, IMO here the -= should be +=? I think sc_delta is usually = 0, meaning the host may allocate less subchannels than we expect. With -=, net_device-num_sc_offered can become bigger -- this doesn't seem correct. Why not use net_device-num_sc_offered = net_device-num_chn - 1; directly? At this point, net_device-num_chn has been the number of the actual channels. + spin_unlock_irqrestore(net_device-sc_lock, flags); + + if (net_device-num_sc_offered != 0) + wait_for_completion(net_device-channel_init_wait); BTW, I also tested the patch and I can confirm the panic I saw disappeared with the patch. -- Dexuan -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 4.1 regression in resizable hashtable tests
On 07/17/15 at 12:26pm, Phil Sutter wrote: On Fri, Jul 17, 2015 at 10:04:56AM +0200, Thomas Graf wrote: On 07/02/15 at 10:09pm, Meelis Roos wrote: [ 33.425061] Running rhashtable test nelem=8, max_size=65536, shrinking=0 [ 33.425154] Test 00: [ 33.534470] Adding 5 keys [ 34.743553] Info: encountered resize [ 34.743698] Info: encountered resize [ 34.743838] Info: encountered resize [ 34.744057] Info: encountered resize [ 34.744430] Info: encountered resize [ 34.745139] Info: encountered resize [ 34.746441] Info: encountered resize [ 34.749055] Info: encountered resize [ 34.754469] Info: encountered resize [ 34.764836] Info: encountered resize [ 34.785696] Info: encountered resize [ 34.827448] Info: encountered resize [ 34.896936] Traversal complete: counted=49993, nelems=5, entries=5, table-jumps=12 [ 34.897056] Test failed: Total count mismatch ^^^ I do see count mismatches as well due to the design of the walker which restarts and thus sees certain entries multiple times. Do you have this commit as well? Author: Phil Sutter p...@nwl.cc Date: Mon Jul 6 15:51:20 2015 +0200 rhashtable: fix for resize events during table walk Thomas, this should be resolved already. Meelis replied[1] to my patch, stating it fixes that problem for him. Though he's still waiting for your proposed patch to add a schedule() call so the kernel won't complain on his slow UltraSparc. :) Cheers, Phil [1]: http://www.spinics.net/lists/netdev/msg335767.html OK, good to know. I've posted the schedule patch today: https://patchwork.ozlabs.org/patch/497035/ -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Blackhole route not enough for proxy-arp
Hello all, it seems that a blackhole route is not enough to enable proxy arp for the routing target. I tried ip route add blackhole 192.168.66.3/32 and ip route add 192.168.66.3/32 dev lo arping failed with the blackhole route but got responses with the route through the loopback interface. Is this behaviour a bug or a feature? -- Regards Joerg -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 4.1 regression in resizable hashtable tests
On Fri, Jul 17, 2015 at 10:04:56AM +0200, Thomas Graf wrote: On 07/02/15 at 10:09pm, Meelis Roos wrote: [ 33.425061] Running rhashtable test nelem=8, max_size=65536, shrinking=0 [ 33.425154] Test 00: [ 33.534470] Adding 5 keys [ 34.743553] Info: encountered resize [ 34.743698] Info: encountered resize [ 34.743838] Info: encountered resize [ 34.744057] Info: encountered resize [ 34.744430] Info: encountered resize [ 34.745139] Info: encountered resize [ 34.746441] Info: encountered resize [ 34.749055] Info: encountered resize [ 34.754469] Info: encountered resize [ 34.764836] Info: encountered resize [ 34.785696] Info: encountered resize [ 34.827448] Info: encountered resize [ 34.896936] Traversal complete: counted=49993, nelems=5, entries=5, table-jumps=12 [ 34.897056] Test failed: Total count mismatch ^^^ I do see count mismatches as well due to the design of the walker which restarts and thus sees certain entries multiple times. Do you have this commit as well? Author: Phil Sutter p...@nwl.cc Date: Mon Jul 6 15:51:20 2015 +0200 rhashtable: fix for resize events during table walk Thomas, this should be resolved already. Meelis replied[1] to my patch, stating it fixes that problem for him. Though he's still waiting for your proposed patch to add a schedule() call so the kernel won't complain on his slow UltraSparc. :) Cheers, Phil [1]: http://www.spinics.net/lists/netdev/msg335767.html -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC net-next 22/22] openvswitch: Use regular GRE net_device instead of vport
On 07/16/15 at 02:36pm, Pravin Shelar wrote: On Thu, Jul 16, 2015 at 7:52 AM, Thomas Graf tg...@suug.ch wrote: I'm inclined to change this and use an in-kernel API as well to create the net_device just like VXLAN does in patch 21. Pravin, what do you think? About the vxlan APIs we also need to direct netlink interface for userspace to configure vxlan device. This will allow us to remove vxlan compat code from ovs vport-netdev.c in future. Do you mean creating the tunnel devices from user space? This would break existing users of the OVS Netlink interface. How do you want to prevent that? -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next v2] rhashtable: Allow other tasks to be scheduled in large lookup loops
On Fri, 2015-07-17 at 10:52 +0200, Thomas Graf wrote: Depending on system speed, the large lookup/insert/delete loops of the testsuite can take a considerable amount of time to complete causing watchdog warnings to appear. Allow other tasks to be scheduled throughout the loops. Reported-by: Meelis Roos mr...@linux.ee Signed-off-by: Thomas Graf tg...@suug.ch --- v2: Use cond_resched() instead schedule() Acked-by: Eric Dumazet eduma...@google.com -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] net/mdio: fix mdio_bus_match for c45 PHY
From: Shaohui Xie shaohui@freescale.com We store c45 PHY's id information in c45_ids, so it should be used to check the matching between PHY driver and PHY device for c45 PHY. Signed-off-by: Shaohui Xie shaohui@freescale.com --- drivers/net/phy/mdio_bus.c | 19 +-- 1 file changed, 17 insertions(+), 2 deletions(-) diff --git a/drivers/net/phy/mdio_bus.c b/drivers/net/phy/mdio_bus.c index 095ef3f..46a14cb 100644 --- a/drivers/net/phy/mdio_bus.c +++ b/drivers/net/phy/mdio_bus.c @@ -421,6 +421,8 @@ static int mdio_bus_match(struct device *dev, struct device_driver *drv) { struct phy_device *phydev = to_phy_device(dev); struct phy_driver *phydrv = to_phy_driver(drv); + const int num_ids = ARRAY_SIZE(phydev-c45_ids.device_ids); + int i; if (of_driver_match_device(dev, drv)) return 1; @@ -428,8 +430,21 @@ static int mdio_bus_match(struct device *dev, struct device_driver *drv) if (phydrv-match_phy_device) return phydrv-match_phy_device(phydev); - return (phydrv-phy_id phydrv-phy_id_mask) == - (phydev-phy_id phydrv-phy_id_mask); + if (phydev-is_c45) { + for (i = 1; i num_ids; i++) { + if (!(phydev-c45_ids.devices_in_package (1 i))) + continue; + + if ((phydrv-phy_id phydrv-phy_id_mask) == + (phydev-c45_ids.device_ids[i] +phydrv-phy_id_mask)) + return 1; + } + return 0; + } else { + return (phydrv-phy_id phydrv-phy_id_mask) == + (phydev-phy_id phydrv-phy_id_mask); + } } #ifdef CONFIG_PM -- 2.1.0.27.g96db324 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2] net: ratelimit warnings about dst entry refcount underflow or overflow
Kernel generates a lot of warnings when dst entry reference counter overflows and becomes negative. That bug was seen several times at machines with outdated 3.10.y kernels. Most like it's already fixed in upstream. Anyway that flood completely kills machine and makes further debugging impossible. Signed-off-by: Konstantin Khlebnikov khlebni...@yandex-team.ru --- net/core/dst.c |4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/net/core/dst.c b/net/core/dst.c index e956ce6d1378..002144bea935 100644 --- a/net/core/dst.c +++ b/net/core/dst.c @@ -284,7 +284,9 @@ void dst_release(struct dst_entry *dst) int newrefcnt; newrefcnt = atomic_dec_return(dst-__refcnt); - WARN_ON(newrefcnt 0); + if (unlikely(newrefcnt 0)) + net_warn_ratelimited(%s: dst:%p refcnt:%d\n, +__func__, dst, newrefcnt); if (unlikely(dst-flags DST_NOCACHE) !newrefcnt) call_rcu(dst-rcu_head, dst_destroy_rcu); } -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 4.1.0, kernel panic, pppoe_release
As i suspect, this kernel panic caused by recent changes to pppoe. This problem appearing in accel-pppd (server), on loaded servers (2k users and more). Most probably related to changed pppoe: Use workqueue to die properly when a PADT is received I will try to reverse this and related patches. On 2015-07-14 13:57, Denys Fedoryshchenko wrote: Here is panic message from netconsole. Please let me know if any additional information required. Jul 14 13:49:16 10.0.252.10 [76078.867822] BUG: unable to handle kernel Jul 14 13:49:16 10.0.252.10 NULL pointer dereference Jul 14 13:49:16 10.0.252.10 at 03f0 Jul 14 13:49:16 10.0.252.10 [76078.868280] IP: Jul 14 13:49:16 10.0.252.10 [a011e12a] pppoe_release+0x56/0x142 [pppoe] Jul 14 13:49:16 10.0.252.10 [76078.868541] PGD 336e4a067 Jul 14 13:49:16 10.0.252.10 PUD 333f17067 Jul 14 13:49:16 10.0.252.10 PMD 0 Jul 14 13:49:16 10.0.252.10 Jul 14 13:49:16 10.0.252.10 [76078.868918] Oops: [#1] Jul 14 13:49:16 10.0.252.10 SMP Jul 14 13:49:16 10.0.252.10 Jul 14 13:49:16 10.0.252.10 [76078.869226] Modules linked in: Jul 14 13:49:16 10.0.252.10 netconsole Jul 14 13:49:16 10.0.252.10 configfs Jul 14 13:49:16 10.0.252.10 coretemp Jul 14 13:49:16 10.0.252.10 sch_fq Jul 14 13:49:16 10.0.252.10 cls_fw Jul 14 13:49:16 10.0.252.10 act_police Jul 14 13:49:16 10.0.252.10 cls_u32 Jul 14 13:49:16 10.0.252.10 sch_ingress Jul 14 13:49:16 10.0.252.10 sch_sfq Jul 14 13:49:16 10.0.252.10 sch_htb Jul 14 13:49:16 10.0.252.10 pppoe Jul 14 13:49:16 10.0.252.10 pppox Jul 14 13:49:16 10.0.252.10 ppp_generic Jul 14 13:49:16 10.0.252.10 slhc Jul 14 13:49:16 10.0.252.10 nf_nat_pptp Jul 14 13:49:16 10.0.252.10 nf_nat_proto_gre Jul 14 13:49:16 10.0.252.10 nf_conntrack_pptp Jul 14 13:49:16 10.0.252.10 nf_conntrack_proto_gre Jul 14 13:49:16 10.0.252.10 tun Jul 14 13:49:16 10.0.252.10 xt_REDIRECT Jul 14 13:49:16 10.0.252.10 nf_nat_redirect Jul 14 13:49:16 10.0.252.10 xt_set Jul 14 13:49:16 10.0.252.10 xt_TCPMSS Jul 14 13:49:16 10.0.252.10 ipt_REJECT Jul 14 13:49:16 10.0.252.10 nf_reject_ipv4 Jul 14 13:49:16 10.0.252.10 ts_bm Jul 14 13:49:16 10.0.252.10 xt_string Jul 14 13:49:16 10.0.252.10 xt_connmark Jul 14 13:49:16 10.0.252.10 xt_DSCP Jul 14 13:49:16 10.0.252.10 xt_mark Jul 14 13:49:16 10.0.252.10 xt_tcpudp Jul 14 13:49:16 10.0.252.10 iptable_mangle Jul 14 13:49:16 10.0.252.10 iptable_filter Jul 14 13:49:16 10.0.252.10 iptable_nat Jul 14 13:49:16 10.0.252.10 nf_conntrack_ipv4 Jul 14 13:49:16 10.0.252.10 nf_defrag_ipv4 Jul 14 13:49:16 10.0.252.10 nf_nat_ipv4 Jul 14 13:49:16 10.0.252.10 nf_nat Jul 14 13:49:16 10.0.252.10 nf_conntrack Jul 14 13:49:16 10.0.252.10 ip_tables Jul 14 13:49:16 10.0.252.10 x_tables Jul 14 13:49:16 10.0.252.10 ip_set_hash_ip Jul 14 13:49:16 10.0.252.10 ip_set Jul 14 13:49:16 10.0.252.10 nfnetlink Jul 14 13:49:16 10.0.252.10 8021q Jul 14 13:49:16 10.0.252.10 garp Jul 14 13:49:16 10.0.252.10 mrp Jul 14 13:49:16 10.0.252.10 stp Jul 14 13:49:16 10.0.252.10 llc Jul 14 13:49:16 10.0.252.10 [last unloaded: netconsole] Jul 14 13:49:16 10.0.252.10 Jul 14 13:49:16 10.0.252.10 [76078.873195] CPU: 3 PID: 2940 Comm: accel-pppd Not tainted 4.1.0-build-0074 #7 Jul 14 13:49:16 10.0.252.10 [76078.873396] Hardware name: HP ProLiant DL320e Gen8 v2, BIOS P80 04/02/2015 Jul 14 13:49:16 10.0.252.10 [76078.873598] task: 8800b1886ba0 ti: 8800b09f4000 task.ti: 8800b09f4000 Jul 14 13:49:16 10.0.252.10 [76078.873929] RIP: 0010:[a011e12a] Jul 14 13:49:16 10.0.252.10 [a011e12a] pppoe_release+0x56/0x142 [pppoe] Jul 14 13:49:16 10.0.252.10 [76078.874317] RSP: 0018:8800b09f7e28 EFLAGS: 00010202 Jul 14 13:49:16 10.0.252.10 [76078.874512] RAX: RBX: 88032a214400 RCX: Jul 14 13:49:16 10.0.252.10 [76078.874709] RDX: 000d RSI: fe01 RDI: 8180d6da Jul 14 13:49:16 10.0.252.10 [76078.874906] RBP: 8800b09f7e68 R08: R09: Jul 14 13:49:16 10.0.252.10 [76078.875102] R10: 88031ef6a110 R11: 0293 R12: 88030f8d8fc0 Jul 14 13:49:16 10.0.252.10 [76078.875299] R13: 88030f8d8ff0 R14: 88033115ee40 R15: 8803394e4920 Jul 14 13:49:16 10.0.252.10 [76078.875499] FS: 7f79b602c700() GS:88034746() knlGS: Jul 14 13:49:16 10.0.252.10 [76078.875837] CS: 0010 DS: ES: CR0: 80050033 Jul 14 13:49:16 10.0.252.10 [76078.876036] CR2: 03f0 CR3: 000335425000 CR4: 001407e0 Jul 14 13:49:16 10.0.252.10 [76078.876239] Stack: Jul 14 13:49:16 10.0.252.10 [76078.876434] 88033ac45c80 Jul 14 13:49:16 10.0.252.10 Jul 14 13:49:16 10.0.252.10 0001 Jul 14 13:49:16 10.0.252.10 88030f8d8fc0 Jul 14 13:49:16 10.0.252.10 Jul 14 13:49:16 10.0.252.10 [76078.877001] a0120260 Jul 14 13:49:16 10.0.252.10 88030f8d8ff0 Jul 14 13:49:16 10.0.252.10 88033115ee40 Jul 14 13:49:16 10.0.252.10 8803394e4920 Jul 14 13:49:16 10.0.252.10 Jul 14 13:49:16
RE: [PATCH net-next 1/1] hv_netvsc: Wait for sub-channels to be processed during probe
From: K. Y. Srinivasan Sent: Friday, July 17, 2015 3:17 Subject: [PATCH net-next 1/1] hv_netvsc: Wait for sub-channels to be processed during probe The current code returns from probe without waiting for the proper handling of subchannels that may be requested. If the netvsc driver were to be rapidly loaded/unloaded, we can trigger a panic as the unload will be tearing down state that may not have been fully setup yet. We fix this issue by making sure that we return from the probe call only after ensuring that the sub-channel offers in flight are properly handled. --- drivers/net/hyperv/hyperv_net.h |2 ++ drivers/net/hyperv/rndis_filter.c | 25 + 2 files changed, 27 insertions(+), 0 deletions(-) BTW, not sure if we should make the same fix to storvsc. IMO storvsc should have the same issue, at least in theory, though usually it's unlikely to unload storvsc. :-) Thanks, -- Dexuan -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 4.1 regression in resizable hashtable tests
On Fri, Jul 17, 2015 at 12:26:36PM +0200, Phil Sutter wrote: On Fri, Jul 17, 2015 at 10:04:56AM +0200, Thomas Graf wrote: On 07/02/15 at 10:09pm, Meelis Roos wrote: [ 33.425061] Running rhashtable test nelem=8, max_size=65536, shrinking=0 [ 33.425154] Test 00: [ 33.534470] Adding 5 keys [ 34.743553] Info: encountered resize [ 34.743698] Info: encountered resize [ 34.743838] Info: encountered resize [ 34.744057] Info: encountered resize [ 34.744430] Info: encountered resize [ 34.745139] Info: encountered resize [ 34.746441] Info: encountered resize [ 34.749055] Info: encountered resize [ 34.754469] Info: encountered resize [ 34.764836] Info: encountered resize [ 34.785696] Info: encountered resize [ 34.827448] Info: encountered resize [ 34.896936] Traversal complete: counted=49993, nelems=5, entries=5, table-jumps=12 [ 34.897056] Test failed: Total count mismatch ^^^ I do see count mismatches as well due to the design of the walker which restarts and thus sees certain entries multiple times. Do you have this commit as well? Author: Phil Sutter p...@nwl.cc Date: Mon Jul 6 15:51:20 2015 +0200 rhashtable: fix for resize events during table walk Thomas, this should be resolved already. Meelis replied[1] to my patch, stating it fixes that problem for him. Though he's still waiting for your proposed patch to add a schedule() call so the kernel won't complain on his slow UltraSparc. :) Ah, nevermind. You sent it already with him in Cc. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH RFC net-next] net/vxlan: Fix kernel unaligned access in __vxlan_find_mac
__vxlan_find_mac invokes ether_addr_equal on the eth_addr field, which triggers unaligned access messages, so rearrange vxlan_fdb to avoid this as non-intrusively as possible. Signed-off-by: Sowmini Varadhan sowmini.varad...@oracle.com --- drivers/net/vxlan.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c index 34c519e..c9790a2 100644 --- a/drivers/net/vxlan.c +++ b/drivers/net/vxlan.c @@ -107,8 +107,8 @@ struct vxlan_fdb { unsigned long used; struct list_head remotes; u16 state;/* see ndm_state */ - u8flags;/* see ndm_flags */ u8eth_addr[ETH_ALEN]; + u8flags;/* see ndm_flags */ }; /* Pseudo network device */ -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] fixed_phy: handle link-down case
17.07.2015 02:25, Florian Fainelli пишет: On 16/07/15 07:50, Stas Sergeev wrote: Currently fixed_phy driver recognizes only the link-up state. This simple patch adds an implementation of link-down state. It fixes the status registers when link is down, and also allows to register the fixed-phy with link down without specifying the speed. This patch still breaks my setups here, e.g: drivers/net/dsa/bcm_sf2.c, but I will look into it. Do we really need this for now for your two other patches to work properly, or is it just nicer to have? Yes, absolutely. Otherwise registering fixed phy will return -EINVAL because of the missing link speed (even though the link is down). Please, see what makes a problem. I can't reproduce what you report. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] bcmsysport:Fix error handling in the function bcm_sysport_init_rx_ring
On 17/07/15 05:13, Nicholas Krause wrote: This fixes the error handling in the function bcm_sysport_init_rx_ringi after calling the function rdma_enable_set to make sure the return value is equal to zero and if not print on the console failed to enable RDMA for the device and return the failed error code returned by rdma_enable_set. Subject should be starting with net: systemport: , otherwise, this looks good to me: Reviewed-by: Florian Fainelli f.faine...@gmail.com Tested-by: Florian Fainelli f.faine...@gmail.com Signed-off-by: Nicholas Krause xerofo...@gmail.com --- drivers/net/ethernet/broadcom/bcmsysport.c | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/broadcom/bcmsysport.c b/drivers/net/ethernet/broadcom/bcmsysport.c index 4566cdf..27a7b36 100644 --- a/drivers/net/ethernet/broadcom/bcmsysport.c +++ b/drivers/net/ethernet/broadcom/bcmsysport.c @@ -1365,7 +1365,12 @@ static int bcm_sysport_init_rx_ring(struct bcm_sysport_priv *priv) /* Initialize HW, ensure RDMA is disabled */ reg = rdma_readl(priv, RDMA_STATUS); if (!(reg RDMA_DISABLED)) - rdma_enable_set(priv, 0); + ret = rdma_enable_set(priv, 0); + + if (ret) { + netif_err(priv, hw, priv-netdev, failed to enable RDMA\n); + return ret; + } rdma_writel(priv, 0, RDMA_WRITE_PTR_LO); rdma_writel(priv, 0, RDMA_WRITE_PTR_HI); -- Florian -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC net-next 22/22] openvswitch: Use regular GRE net_device instead of vport
On Fri, Jul 17, 2015 at 3:58 AM, Thomas Graf tg...@suug.ch wrote: On 07/16/15 at 02:36pm, Pravin Shelar wrote: On Thu, Jul 16, 2015 at 7:52 AM, Thomas Graf tg...@suug.ch wrote: I'm inclined to change this and use an in-kernel API as well to create the net_device just like VXLAN does in patch 21. Pravin, what do you think? About the vxlan APIs we also need to direct netlink interface for userspace to configure vxlan device. This will allow us to remove vxlan compat code from ovs vport-netdev.c in future. Do you mean creating the tunnel devices from user space? This would break existing users of the OVS Netlink interface. How do you want to prevent that? To handle old interface there is compat code in netdev-vport in patch 22. OVS userspace should be able to create any type of tunneling device and then add it as netdev type vport. so that OVS has two types of vport i.e. netdev and internal, rather than vport for each type of tunnel. This way we can keep compat code simple. All enhancements can be directly done to new interface. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 -next] net: fib: use fib result when zero-length prefix aliases exist
default route selection is not deterministic when TOS keys are used: ip route del default ip route add tos 0x00 via 10.2.100.100 ip route add tos 0x04 via 10.2.100.101 ip route add tos 0x08 via 10.2.100.102 ip route add tos 0x0C via 10.2.100.103 ip route add tos 0x10 via 10.2.100.104 [ i.e. 5 routes with prefix length 0, differentiated via TOS key ] ip route get 10.3.1.1 tos 0x4 - 10.2.100.101 ip route get 10.3.1.1 tos 0x8 - 10.2.100.102 ip route get tos 0x0C - 10.2.100.103 But for 0x10, we'll get round-robin behavour among all the aliases. Repeated invocations return .100, 101, 102, etc. in turn. This behaviour is not new -- fib_select_default can be traced back to fn_hash_select_default in CVS history. Routing cache made 'round-robin' behaviour less visible. This changes fib_select_default to not change the FIB chosen result EXCEPT if this nexthop appears to be unreachable. fib_detect_death() logic is reversed -- we consider a nexthop 'dead' only if it has a neigh entry in unreachable state. Only then we search fib_aliases for an alternative and use one of these in a round-robin fashion. If all are believed to be unreachable, no change is made and fib-chosen nh_gw is used. Reported-by: Hagen Paul Pfeifer ha...@jauu.net Cc: Alexander Duyck alexander.h.du...@redhat.com Signed-off-by: Florian Westphal f...@strlen.de --- Changes since v1: Address comments from Alex Duyck: - use if (fib_nud_is_unreach( .. rather than temporary boolean retval - rename last_* varibles to fi_, they're not the last item in the list... - kill pointless if() statement, if order 0, then fi_last is 0 too net/ipv4/fib_semantics.c | 80 ++-- 1 file changed, 36 insertions(+), 44 deletions(-) diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c index c7358ea..2cdf8d7 100644 --- a/net/ipv4/fib_semantics.c +++ b/net/ipv4/fib_semantics.c @@ -410,28 +410,24 @@ errout: rtnl_set_sk_err(info-nl_net, RTNLGRP_IPV4_ROUTE, err); } -static int fib_detect_death(struct fib_info *fi, int order, - struct fib_info **last_resort, int *last_idx, - int dflt) +static bool fib_nud_is_unreach(const struct fib_info *fi) { struct neighbour *n; int state = NUD_NONE; - n = neigh_lookup(arp_tbl, fi-fib_nh[0].nh_gw, fi-fib_dev); - if (n) { + local_bh_disable(); + + n = __neigh_lookup_noref(arp_tbl, fi-fib_nh[0].nh_gw, fi-fib_dev); + if (n) state = n-nud_state; - neigh_release(n); - } - if (state == NUD_REACHABLE) - return 0; - if ((state NUD_VALID) order != dflt) - return 0; - if ((state NUD_VALID) || - (*last_idx 0 order dflt)) { - *last_resort = fi; - *last_idx = order; - } - return 1; + + local_bh_enable(); + + /* Caller might be able to find alternate (reachable) nexthop */ + if (state (NUD_INCOMPLETE | NUD_FAILED)) + return true; + + return false; } #ifdef CONFIG_IP_ROUTE_MULTIPATH @@ -1204,12 +1200,16 @@ int fib_sync_down_dev(struct net_device *dev, unsigned long event) /* Must be invoked inside of an RCU protected region. */ void fib_select_default(struct fib_result *res) { - struct fib_info *fi = NULL, *last_resort = NULL; struct hlist_head *fa_head = res-fa_head; + struct fib_info *fi = NULL; struct fib_table *tb = res-table; - int order = -1, last_idx = -1; + int order = -1, fi_idx = -1; struct fib_alias *fa; + if (likely(!fib_nud_is_unreach(res-fi))) + return; + + /* attempt to pick another nexthop */ hlist_for_each_entry_rcu(fa, fa_head, fa_list) { struct fib_info *next_fi = fa-fa_info; @@ -1223,38 +1223,30 @@ void fib_select_default(struct fib_result *res) next_fi-fib_nh[0].nh_scope != RT_SCOPE_LINK) continue; + order++; + + if (next_fi == res-fi) /* already tested, not reachable */ + continue; + fib_alias_accessed(fa); - if (!fi) { - if (next_fi != res-fi) + if (fib_nud_is_unreach(next_fi)) + continue; + + /* try to round-robin among all fa_aliases in case +* res-fi nexthop is unreachable. +*/ + if (fi == NULL || order tb-tb_default) { + fi = next_fi; + fi_idx = order; + if (order tb-tb_default) break; - } else if (!fib_detect_death(fi, order, last_resort, -last_idx, tb-tb_default)) { - fib_result_assign(res, fi); - tb-tb_default
Re: [PATCH -next] net: fib: use fib result when zero-length prefix aliases exist
On 07/17/2015 08:17 AM, Florian Westphal wrote: default route selection is not deterministic when TOS keys are used: ip route del default ip route add tos 0x00 via 10.2.100.100 ip route add tos 0x04 via 10.2.100.101 ip route add tos 0x08 via 10.2.100.102 ip route add tos 0x0C via 10.2.100.103 ip route add tos 0x10 via 10.2.100.104 [ i.e. 5 routes with prefix length 0, differentiated via TOS key ] ip route get 10.3.1.1 tos 0x4 - 10.2.100.101 ip route get 10.3.1.1 tos 0x8 - 10.2.100.102 ip route get tos 0x0C - 10.2.100.103 But for 0x10, we'll get round-robin results among all the aliases. Repeated queries return .100, 101, 102, etc. in turn. This behaviour is not new -- fib_select_default can be traced back to fn_hash_select_default in CVS history. Routing cache made 'round-robin' behaviour less visible. This changes fib_select_default to not change the FIB chosen result EXCEPT if this nexthop appears to be unreachable. fib_detect_death() logic is reversed -- we consider a nexthop 'dead' only if it has a neigh entry in unreachable state. Only then we search fib_aliases for an alternative and use one of these in a round-robin fashion. If all are believed to be unreachable, no change is made and fib-chosen nh_gw is used. Reported-by: Hagen Paul Pfeifer ha...@jauu.net Cc: Alexander Duyck alexander.h.du...@redhat.com Signed-off-by: Florian Westphal f...@strlen.de --- net/ipv4/fib_semantics.c | 71 1 file changed, 36 insertions(+), 35 deletions(-) diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c index c7358ea..83b485b 100644 --- a/net/ipv4/fib_semantics.c +++ b/net/ipv4/fib_semantics.c @@ -410,28 +410,24 @@ errout: rtnl_set_sk_err(info-nl_net, RTNLGRP_IPV4_ROUTE, err); } -static int fib_detect_death(struct fib_info *fi, int order, - struct fib_info **last_resort, int *last_idx, - int dflt) +static bool fib_nud_is_unreach(const struct fib_info *fi) { struct neighbour *n; int state = NUD_NONE; - n = neigh_lookup(arp_tbl, fi-fib_nh[0].nh_gw, fi-fib_dev); - if (n) { + local_bh_disable(); + + n = __neigh_lookup_noref(arp_tbl, fi-fib_nh[0].nh_gw, fi-fib_dev); + if (n) state = n-nud_state; - neigh_release(n); - } - if (state == NUD_REACHABLE) - return 0; - if ((state NUD_VALID) order != dflt) - return 0; - if ((state NUD_VALID) || - (*last_idx 0 order dflt)) { - *last_resort = fi; - *last_idx = order; - } - return 1; + + local_bh_enable(); + + /* Caller might be able to find alternate (reachable) nexthop */ + if (state (NUD_INCOMPLETE | NUD_FAILED)) + return true; + + return false; } #ifdef CONFIG_IP_ROUTE_MULTIPATH @@ -1204,12 +1200,17 @@ int fib_sync_down_dev(struct net_device *dev, unsigned long event) /* Must be invoked inside of an RCU protected region. */ void fib_select_default(struct fib_result *res) { - struct fib_info *fi = NULL, *last_resort = NULL; struct hlist_head *fa_head = res-fa_head; + struct fib_info *last_resort = NULL; struct fib_table *tb = res-table; int order = -1, last_idx = -1; struct fib_alias *fa; + bool unreach = fib_nud_is_unreach(res-fi); + if (likely(!unreach)) + return; + There probably isn't any need for the boolean variable. You could just place the function in the if statement itself. + /* attempt to pick another nexthop */ hlist_for_each_entry_rcu(fa, fa_head, fa_list) { struct fib_info *next_fi = fa-fa_info; @@ -1223,33 +1224,33 @@ void fib_select_default(struct fib_result *res) next_fi-fib_nh[0].nh_scope != RT_SCOPE_LINK) continue; + order++; + + if (next_fi == res-fi) /* already tested, not reachable */ + continue; + fib_alias_accessed(fa); - if (!fi) { - if (next_fi != res-fi) + unreach = fib_nud_is_unreach(next_fi); + if (unreach) + continue; + Same here. It seems like this is just an extra variable that isn't really needed. + /* try to round-robin among all fa_aliases in case +* res-fi nexthop is unreachable. +*/ + if (last_idx 0 || order tb-tb_default) { + last_resort = next_fi; + last_idx = order; + if (order tb-tb_default) break; You might want to update the variable naming as it can be a bit confusing. The last_resort and last_idx represent either the first fib_info and index, or the next one after current entry in
[PATCH] sctp: fix cut and paste issue in comment
Cookie ACK is always received by the association initiator, so fix the comment to avoid confusion. Signed-off-by: Marcelo Ricardo Leitner marcelo.leit...@gmail.com --- net/sctp/sm_statefuns.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/sctp/sm_statefuns.c b/net/sctp/sm_statefuns.c index 3ee27b7704ffb95430541507e83973e9207f9672..d7eaa7354cf76148d1a2c9ee3af4fff9a24990fb 100644 --- a/net/sctp/sm_statefuns.c +++ b/net/sctp/sm_statefuns.c @@ -853,7 +853,7 @@ nomem: /* * Respond to a normal COOKIE ACK chunk. - * We are the side that is being asked for an association. + * We are the side that is asking for an association. * * RFC 2960 5.1 Normal Establishment of an Association * -- 2.4.1 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH net-next 1/1] hv_netvsc: Wait for sub-channels to be processed during probe
-Original Message- From: Dexuan Cui Sent: Friday, July 17, 2015 3:07 AM To: KY Srinivasan; da...@davemloft.net; netdev@vger.kernel.org; linux- ker...@vger.kernel.org; de...@linuxdriverproject.org; o...@aepfle.de; a...@canonical.com; jasow...@redhat.com; vkuzn...@redhat.com Cc: KY Srinivasan Subject: RE: [PATCH net-next 1/1] hv_netvsc: Wait for sub-channels to be processed during probe From: K. Y. Srinivasan Sent: Friday, July 17, 2015 3:17 Subject: [PATCH net-next 1/1] hv_netvsc: Wait for sub-channels to be processed during probe The current code returns from probe without waiting for the proper handling of subchannels that may be requested. If the netvsc driver were to be rapidly loaded/unloaded, we can trigger a panic as the unload will be tearing down state that may not have been fully setup yet. We fix this issue by making sure that we return from the probe call only after ensuring that the sub-channel offers in flight are properly handled. --- drivers/net/hyperv/hyperv_net.h |2 ++ drivers/net/hyperv/rndis_filter.c | 25 + 2 files changed, 27 insertions(+), 0 deletions(-) BTW, not sure if we should make the same fix to storvsc. IMO storvsc should have the same issue, at least in theory, though usually it's unlikely to unload storvsc. :-) You are right; I am planning to submit a similar patch for storvsc. As you note, this scenario is unlikely for sorvsc. K. Y Thanks, -- Dexuan -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH net-next 1/1] hv_netvsc: Wait for sub-channels to be processed during probe
-Original Message- From: Vitaly Kuznetsov [mailto:vkuzn...@redhat.com] Sent: Friday, July 17, 2015 9:10 AM To: KY Srinivasan Cc: da...@davemloft.net; netdev@vger.kernel.org; linux- ker...@vger.kernel.org; de...@linuxdriverproject.org; o...@aepfle.de; a...@canonical.com; jasow...@redhat.com; Dexuan Cui Subject: Re: [PATCH net-next 1/1] hv_netvsc: Wait for sub-channels to be processed during probe KY Srinivasan k...@microsoft.com writes: -Original Message- From: Vitaly Kuznetsov [mailto:vkuzn...@redhat.com] Sent: Friday, July 17, 2015 7:13 AM To: KY Srinivasan Cc: da...@davemloft.net; netdev@vger.kernel.org; linux- ker...@vger.kernel.org; de...@linuxdriverproject.org; o...@aepfle.de; a...@canonical.com; jasow...@redhat.com; Dexuan Cui Subject: Re: [PATCH net-next 1/1] hv_netvsc: Wait for sub-channels to be processed during probe K. Y. Srinivasan k...@microsoft.com writes: The current code returns from probe without waiting for the proper handling of subchannels that may be requested. If the netvsc driver were to be rapidly loaded/unloaded, we can trigger a panic as the unload will be tearing down state that may not have been fully setup yet. We fix this issue by making sure that we return from the probe call only after ensuring that the sub-channel offers in flight are properly handled. Signed-off-by: K. Y. Srinivasan k...@microsoft.com Reviewed-and-tested-by: Haiyang Zhang haiya...@microsoft.com --- drivers/net/hyperv/hyperv_net.h |2 ++ drivers/net/hyperv/rndis_filter.c | 25 + 2 files changed, 27 insertions(+), 0 deletions(-) diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h index 26cd14c..925b75d 100644 --- a/drivers/net/hyperv/hyperv_net.h +++ b/drivers/net/hyperv/hyperv_net.h @@ -671,6 +671,8 @@ struct netvsc_device { u32 send_table[VRSS_SEND_TAB_SIZE]; u32 max_chn; u32 num_chn; +spinlock_t sc_lock; /* Protects num_sc_offered variable */ +u32 num_sc_offered; atomic_t queue_sends[NR_CPUS]; /* Holds rndis device info */ diff --git a/drivers/net/hyperv/rndis_filter.c b/drivers/net/hyperv/rndis_filter.c index 2e40417..2e09f3f 100644 --- a/drivers/net/hyperv/rndis_filter.c +++ b/drivers/net/hyperv/rndis_filter.c @@ -984,9 +984,16 @@ static void netvsc_sc_open(struct vmbus_channel *new_sc) struct netvsc_device *nvscdev; u16 chn_index = new_sc-offermsg.offer.sub_channel_index; int ret; +unsigned long flags; nvscdev = hv_get_drvdata(new_sc-primary_channel-device_obj); +spin_lock_irqsave(nvscdev-sc_lock, flags); +nvscdev-num_sc_offered--; +spin_unlock_irqrestore(nvscdev-sc_lock, flags); +if (nvscdev-num_sc_offered == 0) +complete(nvscdev-channel_init_wait); + if (chn_index = nvscdev-num_chn) return; @@ -1015,8 +1022,10 @@ int rndis_filter_device_add(struct hv_device *dev, u32 rsscap_size = sizeof(struct ndis_recv_scale_cap); u32 mtu, size; u32 num_rss_qs; +u32 sc_delta; const struct cpumask *node_cpu_mask; u32 num_possible_rss_qs; +unsigned long flags; rndis_device = get_rndis_device(); if (!rndis_device) @@ -1039,6 +1048,8 @@ int rndis_filter_device_add(struct hv_device *dev, net_device-max_chn = 1; net_device-num_chn = 1; +spin_lock_init(net_device-sc_lock); + net_device-extension = rndis_device; rndis_device-net_dev = net_device; @@ -1116,6 +1127,9 @@ int rndis_filter_device_add(struct hv_device *dev, num_possible_rss_qs = cpumask_weight(node_cpu_mask); net_device-num_chn = min(num_possible_rss_qs, num_rss_qs); +num_rss_qs = net_device-num_chn - 1; +net_device-num_sc_offered = num_rss_qs; + if (net_device-num_chn == 1) goto out; @@ -1157,11 +1171,22 @@ int rndis_filter_device_add(struct hv_device *dev, ret = rndis_filter_set_rss_param(rndis_device, net_device- num_chn); +/* + * Wait for the host to send us the sub-channel offers. + */ +spin_lock_irqsave(net_device-sc_lock, flags); +sc_delta = net_device-num_chn - 1 - num_rss_qs; +net_device-num_sc_offered -= sc_delta; +spin_unlock_irqrestore(net_device-sc_lock, flags); + +if (net_device-num_sc_offered != 0) +wait_for_completion(net_device-channel_init_wait); I'd suggest we add an essentian timeout (big, let's say 30 sec.) here. In case something goes wrong we don't really want to hang the whole
Re: [PATCH 1/3] fixed_phy: handle link-down case
On 17/07/15 04:26, Stas Sergeev wrote: 17.07.2015 02:25, Florian Fainelli пишет: On 16/07/15 07:50, Stas Sergeev wrote: Currently fixed_phy driver recognizes only the link-up state. This simple patch adds an implementation of link-down state. It fixes the status registers when link is down, and also allows to register the fixed-phy with link down without specifying the speed. This patch still breaks my setups here, e.g: drivers/net/dsa/bcm_sf2.c, but I will look into it. Do we really need this for now for your two other patches to work properly, or is it just nicer to have? Yes, absolutely. Otherwise registering fixed phy will return -EINVAL because of the missing link speed (even though the link is down). Ok, I see the problem that you have now. Arguably you could say that according to the fixed-link binding, speed needs to be specified and the code correctly errors out with such an error if you do not specify it. I also agree that having to specify speed and duplex for something you will end-up auto-negotiating has no useful purpose. Please, see what makes a problem. I can't reproduce what you report. So is different is that I use a link_update callback, and so we rely on at least one call of this function to initialize the hardware in drivers/net/dsa/bcm_sf2.c for this to work, after that, the hardware reflects the fixed link parameters we configured, and we feed the fixed_phy_status information from the hardware directly. From there I see two different ways to fix this: - we ignore the fixed_phy_update_regs return value in fixed_phy_add(), but that will make us avoid doing verification on the speed, which is not so great, but is essentially what your patch does anyway - we update the use of the fixed PHY link_update in drivers using it and convert them to use fixed_phy_update_state instead, which can take some time and effort to convert What do you think? I would go with option 1 and eventually introduce a special switch() case on the speed settings just to validate we know them. Thanks -- Florian -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next v2] rhashtable: Allow other tasks to be scheduled in large lookup loops
Depending on system speed, the large lookup/insert/delete loops of the testsuite can take a considerable amount of time to complete causing watchdog warnings to appear. Allow other tasks to be scheduled throughout the loops. Reported-by: Meelis Roos mr...@linux.ee Signed-off-by: Thomas Graf tg...@suug.ch --- v2: Use cond_resched() instead schedule() Tested it. The warning is gone from rhashtable test but now it is present in rbtree test (it was not there before). Same kernel, just your patch applied - but it should not change rbtree test??? [0.00] PROMLIB: Sun IEEE Boot Prom 'OBP 3.31.0 2001/07/25 20:36' [0.00] PROMLIB: Root node compatible: [0.00] Linux version 4.2.0-rc2-00077-gf760b87-dirty (mroos@u5) (gcc version 4.9.3 (Debian 4.9.3-1) ) #21 Fri Jul 17 20:15:21 EEST 2015 [0.00] bootconsole [earlyprom0] enabled [0.00] ARCH: SUN4U [0.00] Ethernet address: 08:00:20:f8:c7:72 [0.00] MM: PAGE_OFFSET is 0xf800 (max_phys_bits == 40) [0.00] MM: VMALLOC [0x0001 -- 0x0600] [0.00] MM: VMEMMAP [0x0600 -- 0x0c00] [0.00] Kernel: Using 10 locked TLB entries for main kernel image. [0.00] Remapping the kernel... done. [0.00] kmemleak: Kernel memory leak detector disabled [0.00] OF stdout device is: /pci@1f,0/pci@1,1/ebus@1/se@14,40:a [0.00] PROM: Built device tree with 70266 bytes of memory. [0.00] Top of RAM: 0x1ff2c000, Total RAM: 0x1ff2a000 [0.00] Memory hole size: 0MB [0.00] Allocated 16384 bytes for kernel page tables. [0.00] Zone ranges: [0.00] Normal [mem 0x-0x1ff2bfff] [0.00] Movable zone start for each node [0.00] Early memory node ranges [0.00] node 0: [mem 0x-0x1fefdfff] [0.00] node 0: [mem 0x1ff0-0x1ff2bfff] [0.00] Initmem setup node 0 [mem 0x-0x1ff2bfff] [0.00] On node 0 totalpages: 65429 [0.00] Normal zone: 512 pages used for memmap [0.00] Normal zone: 0 pages reserved [0.00] Normal zone: 65429 pages, LIFO batch:15 [0.00] Booting Linux... [0.00] CPU CAPS: [flush,stbar,swap,muldiv,v9,mul32,div32,v8plus] [0.00] CPU CAPS: [vis] [0.00] pcpu-alloc: s0 r0 d32768 u32768 alloc=1*32768 [0.00] pcpu-alloc: [0] 0 [0.00] Built 1 zonelists in Zone order, mobility grouping on. Total pages: 64917 [0.00] Kernel command line: root=/dev/sda1 ro [0.00] PID hash table entries: 2048 (order: 1, 16384 bytes) [0.00] Dentry cache hash table entries: 65536 (order: 6, 524288 bytes) [0.00] Inode-cache hash table entries: 32768 (order: 5, 262144 bytes) [0.00] Sorting __ex_table... [0.00] Memory: 475912K/523432K available (5270K kernel code, 516K rwdata, 1672K rodata, 520K init, 30210K bss, 47520K reserved, 0K cma-reserved) [0.00] Running RCU self tests [0.00] Testing tracer nop: PASSED [0.00] NR_IRQS:2048 nr_irqs:2048 1 [ 26.882478] clocksource: tick: mask: 0x max_cycles: 0x5306eb473f, max_idle_ns: 440795213232 ns [ 26.986192] clocksource: mult[2c71c72] shift[24] [ 27.025729] clockevent: mult[5c28f5c3] shift[32] [ 27.067997] Console: colour dummy device 80x25 [ 27.104149] console [tty0] enabled [ 27.128868] bootconsole [earlyprom0] disabled [ 27.165340] Lock dependency validator: Copyright (c) 2006 Red Hat, Inc., Ingo Molnar [ 27.165405] ... MAX_LOCKDEP_SUBCLASSES: 8 [ 27.165445] ... MAX_LOCK_DEPTH: 48 [ 27.165486] ... MAX_LOCKDEP_KEYS:8191 [ 27.165529] ... CLASSHASH_SIZE: 4096 [ 27.165574] ... MAX_LOCKDEP_ENTRIES: 32768 [ 27.165617] ... MAX_LOCKDEP_CHAINS: 65536 [ 27.165662] ... CHAINHASH_SIZE: 32768 [ 27.165706] memory used by lock dependency info: 8159 kB [ 27.165756] per task-struct memory footprint: 1920 bytes [ 27.165802] [ 27.165838] | Locking API testsuite: [ 27.165873] [ 27.165932] | spin |wlock |rlock |mutex | wsem | rsem | [ 27.165993] -- [ 27.166092] A-A deadlock: ok | ok | ok | ok | ok | ok | [ 27.232682] A-B-B-A deadlock: ok | ok | ok | ok | ok | ok | [ 27.299789] A-B-B-C-C-A deadlock: ok | ok | ok | ok | ok | ok | [ 27.367295] A-B-C-A-B-C deadlock: ok | ok | ok | ok | ok | ok | [ 27.434877] A-B-B-C-C-D-D-A deadlock: ok | ok | ok | ok | ok | ok | [ 27.502857] A-B-C-D-B-D-D-A deadlock: ok | ok | ok | ok | ok |
Re: [net-next 01/14] clarify implementation of ethtool's get_ts_info op
On Fri, Jul 17, 2015 at 07:25:10AM -0700, Jeff Kirsher wrote: From: Jacob Keller jacob.e.kel...@intel.com This patch adds some clarification about the intended way to implement both SIOCSHWTSTAMP and ethtool's get_ts_info. The HWTSTAMP API has several Rx filters which are very specific, as well as more general filters. The specific filters really only exist to support some broken hardware which can't fully implement the generic filters. This patch adds clarification that it is okay to support the specific filters in SIOCSHWTSTAMP by upscaling them to the generic filters. In addition, update the header for ethtool_ts_info to specify that drivers ought to only report the filters they support without upscaling in this manner. Acked-by: Richard Cochran richardcoch...@gmail.com (for this patch and the other get_ts_info patches) -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net: bcmgenet: Return the variable ret rather then zero for the function bcmgenet_power_down
On 16/07/15 12:38, Nicholas Krause wrote: This makes the function bcmgenet_power_down return the variable ret rather then zero in order to make this function be able to signal its caller with a error code when a failure occurs internally rather then always appearing to run successfully to its caller. Signed-off-by: Nicholas Krause xerofo...@gmail.com Reviewed-by: Florian Fainelli f.faine...@gmail.com --- drivers/net/ethernet/broadcom/genet/bcmgenet.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/broadcom/genet/bcmgenet.c b/drivers/net/ethernet/broadcom/genet/bcmgenet.c index 64c1e9d..129e5b5 100644 --- a/drivers/net/ethernet/broadcom/genet/bcmgenet.c +++ b/drivers/net/ethernet/broadcom/genet/bcmgenet.c @@ -877,7 +877,7 @@ static int bcmgenet_power_down(struct bcmgenet_priv *priv, break; } - return 0; + return ret; } static void bcmgenet_power_up(struct bcmgenet_priv *priv, -- Florian -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 4.1.0, kernel panic, pppoe_release
Probably my knowledge of kernel is not sufficient, but i will try few approaches. One of them to add to pppoe_unbind_sock_work: pppox_unbind_sock(sk); +/* Signal the death of the socket. */ +sk-sk_state = PPPOX_DEAD; I will wait first, to make sure this patch was causing kernel panic (it needs 24h testing cycle), then i will try this fix. On 2015-07-17 18:36, Dan Williams wrote: On Fri, 2015-07-17 at 12:24 +0300, Denys Fedoryshchenko wrote: As i suspect, this kernel panic caused by recent changes to pppoe. This problem appearing in accel-pppd (server), on loaded servers (2k users and more). Most probably related to changed pppoe: Use workqueue to die properly when a PADT is received I will try to reverse this and related patches. While I didn't write the patch, I'm the one that started the process that got it submitted... Could you review the patch quickly too to see if you can spot anything amiss with it, so that it could get fixed up? The original patch does fix a real problem so ideally we don't have to revert the whole thing upstream. Dan On 2015-07-14 13:57, Denys Fedoryshchenko wrote: Here is panic message from netconsole. Please let me know if any additional information required. Jul 14 13:49:16 10.0.252.10 [76078.867822] BUG: unable to handle kernel Jul 14 13:49:16 10.0.252.10 NULL pointer dereference Jul 14 13:49:16 10.0.252.10 at 03f0 Jul 14 13:49:16 10.0.252.10 [76078.868280] IP: Jul 14 13:49:16 10.0.252.10 [a011e12a] pppoe_release+0x56/0x142 [pppoe] Jul 14 13:49:16 10.0.252.10 [76078.868541] PGD 336e4a067 Jul 14 13:49:16 10.0.252.10 PUD 333f17067 Jul 14 13:49:16 10.0.252.10 PMD 0 Jul 14 13:49:16 10.0.252.10 Jul 14 13:49:16 10.0.252.10 [76078.868918] Oops: [#1] Jul 14 13:49:16 10.0.252.10 SMP Jul 14 13:49:16 10.0.252.10 Jul 14 13:49:16 10.0.252.10 [76078.869226] Modules linked in: Jul 14 13:49:16 10.0.252.10 netconsole Jul 14 13:49:16 10.0.252.10 configfs Jul 14 13:49:16 10.0.252.10 coretemp Jul 14 13:49:16 10.0.252.10 sch_fq Jul 14 13:49:16 10.0.252.10 cls_fw Jul 14 13:49:16 10.0.252.10 act_police Jul 14 13:49:16 10.0.252.10 cls_u32 Jul 14 13:49:16 10.0.252.10 sch_ingress Jul 14 13:49:16 10.0.252.10 sch_sfq Jul 14 13:49:16 10.0.252.10 sch_htb Jul 14 13:49:16 10.0.252.10 pppoe Jul 14 13:49:16 10.0.252.10 pppox Jul 14 13:49:16 10.0.252.10 ppp_generic Jul 14 13:49:16 10.0.252.10 slhc Jul 14 13:49:16 10.0.252.10 nf_nat_pptp Jul 14 13:49:16 10.0.252.10 nf_nat_proto_gre Jul 14 13:49:16 10.0.252.10 nf_conntrack_pptp Jul 14 13:49:16 10.0.252.10 nf_conntrack_proto_gre Jul 14 13:49:16 10.0.252.10 tun Jul 14 13:49:16 10.0.252.10 xt_REDIRECT Jul 14 13:49:16 10.0.252.10 nf_nat_redirect Jul 14 13:49:16 10.0.252.10 xt_set Jul 14 13:49:16 10.0.252.10 xt_TCPMSS Jul 14 13:49:16 10.0.252.10 ipt_REJECT Jul 14 13:49:16 10.0.252.10 nf_reject_ipv4 Jul 14 13:49:16 10.0.252.10 ts_bm Jul 14 13:49:16 10.0.252.10 xt_string Jul 14 13:49:16 10.0.252.10 xt_connmark Jul 14 13:49:16 10.0.252.10 xt_DSCP Jul 14 13:49:16 10.0.252.10 xt_mark Jul 14 13:49:16 10.0.252.10 xt_tcpudp Jul 14 13:49:16 10.0.252.10 iptable_mangle Jul 14 13:49:16 10.0.252.10 iptable_filter Jul 14 13:49:16 10.0.252.10 iptable_nat Jul 14 13:49:16 10.0.252.10 nf_conntrack_ipv4 Jul 14 13:49:16 10.0.252.10 nf_defrag_ipv4 Jul 14 13:49:16 10.0.252.10 nf_nat_ipv4 Jul 14 13:49:16 10.0.252.10 nf_nat Jul 14 13:49:16 10.0.252.10 nf_conntrack Jul 14 13:49:16 10.0.252.10 ip_tables Jul 14 13:49:16 10.0.252.10 x_tables Jul 14 13:49:16 10.0.252.10 ip_set_hash_ip Jul 14 13:49:16 10.0.252.10 ip_set Jul 14 13:49:16 10.0.252.10 nfnetlink Jul 14 13:49:16 10.0.252.10 8021q Jul 14 13:49:16 10.0.252.10 garp Jul 14 13:49:16 10.0.252.10 mrp Jul 14 13:49:16 10.0.252.10 stp Jul 14 13:49:16 10.0.252.10 llc Jul 14 13:49:16 10.0.252.10 [last unloaded: netconsole] Jul 14 13:49:16 10.0.252.10 Jul 14 13:49:16 10.0.252.10 [76078.873195] CPU: 3 PID: 2940 Comm: accel-pppd Not tainted 4.1.0-build-0074 #7 Jul 14 13:49:16 10.0.252.10 [76078.873396] Hardware name: HP ProLiant DL320e Gen8 v2, BIOS P80 04/02/2015 Jul 14 13:49:16 10.0.252.10 [76078.873598] task: 8800b1886ba0 ti: 8800b09f4000 task.ti: 8800b09f4000 Jul 14 13:49:16 10.0.252.10 [76078.873929] RIP: 0010:[a011e12a] Jul 14 13:49:16 10.0.252.10 [a011e12a] pppoe_release+0x56/0x142 [pppoe] Jul 14 13:49:16 10.0.252.10 [76078.874317] RSP: 0018:8800b09f7e28 EFLAGS: 00010202 Jul 14 13:49:16 10.0.252.10 [76078.874512] RAX: RBX: 88032a214400 RCX: Jul 14 13:49:16 10.0.252.10 [76078.874709] RDX: 000d RSI: fe01 RDI: 8180d6da Jul 14 13:49:16 10.0.252.10 [76078.874906] RBP: 8800b09f7e68 R08: R09: Jul 14 13:49:16 10.0.252.10 [76078.875102] R10: 88031ef6a110 R11: 0293 R12: 88030f8d8fc0 Jul 14 13:49:16 10.0.252.10 [76078.875299] R13: 88030f8d8ff0
Re: [PATCH net-next 1/2] bpf: introduce bpf_skb_vlan_push/pop() helpers
On 7/17/15 1:12 AM, Eric Dumazet wrote: On Thu, 2015-07-16 at 19:58 -0700, Alexei Starovoitov wrote: In order to let eBPF programs call skb_vlan_push/pop via helper functions Why should eBPF program do such thing ? Are BPF users in the kernel expecting skb being changed, and are we sure they reload all cached values when/if needed ? well, classic BPF and even extended BPF with socket filters cannot use these helpers. They are for TC ingress/egress only. There different actions can already change skb-data while classifiers/actions are running. btw, before we started discussing this topic at nfws, I thought that bpf programs will never be able to change skb-data from inside the programs, but turned out it's only JITs that needed re-caching. Programs cannot see skb-data. They can access packet only via ld_abs/ld_ind instructions and helper functions. So any changes to internal fields of skb are invisible to programs. skb-data/hlen cache that is part of JIT is also invisible to the programs. It's an optimization that some JITs do. arm64 JIT doesn't do this optimization, for example. I'll reword commit log to better explain this. One of the use cases is this phys2virt gateway I presented. There I need to do vlan-learning and src mac forwarding. Currently I'm creating as many as I can vlan netdevs on top of regular eth0 and attach tc+bpf to all of them. That's very inefficient. With these helpers I'll attach tc+bpf to eth0 only and skip creation of thousands of vlan netdevs. eBPF JITs need to recognize helpers that change skb-data, since skb-data and hlen are cached as part of JIT code generation. - arm64 JIT is using bpf_load_pointer() without caching, so it's ok as-is. - x64 JIT recognizes bpf_skb_vlan_push/pop() calls and re-caches skb-data/hlen after such calls (experiments showed that conditional re-caching is slower). - s390 JIT falls back to interpreter for now when bpf_skb_vlan_push() is present in the program (re-caching is tbd). +static u64 bpf_skb_vlan_push(u64 r1, u64 vlan_proto, u64 vlan_tci, u64 r4, u64 r5) +{ + struct sk_buff *skb = (struct sk_buff *) (long) r1; + + if (unlikely(vlan_proto != htons(ETH_P_8021Q) +vlan_proto != htons(ETH_P_8021AD))) + vlan_proto = htons(ETH_P_8021Q); This would raise sparse error, as vlan_proto is u64, and htons() __be16 make C=2 CF=-D__CHECK_ENDIAN__ net/core/filter.o yes. When I wrote these lines I thought of the same, so I did run the sparse and it spewed a lot of false positives and stopped on 'too many errors' before reaching these lines. So I downloaded the latest sparse, hacked it a little and tried again. Still it didn't complain about the endianness. That was puzzling, so I left the above lines as-is. but since your eagle eyes caught it, I will add casts :) -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next 14/22] vxlan: Flow based tunneling
On 7/17/15 5:55 AM, Thomas Graf wrote: @@ -2373,6 +2470,12 @@ static void vxlan_setup(struct net_device *dev) netif_keep_dst(dev); dev-priv_flags |= IFF_LIVE_ADDR_CHANGE; + /* If in flow based mode, keep the dst including encapsulation +* instructions for vxlan_xmit(). +*/ + if (vxlan-flags VXLAN_F_FLOW_BASED) + netif_keep_dst(dev); hmm, isn't this done already few lines above? ;) -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next v2] rhashtable: Allow other tasks to be scheduled in large lookup loops
On Fri, 2015-07-17 at 22:07 +0300, mr...@linux.ee wrote: Depending on system speed, the large lookup/insert/delete loops of the testsuite can take a considerable amount of time to complete causing watchdog warnings to appear. Allow other tasks to be scheduled throughout the loops. Reported-by: Meelis Roos mr...@linux.ee Signed-off-by: Thomas Graf tg...@suug.ch --- v2: Use cond_resched() instead schedule() Tested it. The warning is gone from rhashtable test but now it is present in rbtree test (it was not there before). Same kernel, just your patch applied - but it should not change rbtree test??? Why not ? rbtree tests need same kind of patches. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv2 RFC net-next] net/vxlan: Fix kernel unaligned access in __vxlan_find_mac
On (07/17/15 16:07), Joe Perches wrote: On Fri, 2015-07-17 at 22:00 +0200, Sowmini Varadhan wrote: __vxlan_find_mac invokes ether_addr_equal on the eth_addr field, which triggers unaligned access messages, so rearrange vxlan_fdb to avoid this in the most non-intrusive way. What arch does this? sparc. BTW, I was also getting a lot of alignment errors from vxlan_xmit_skb (vxh comes out unaligned) for the IPv6 path. I did not have time to investigate/fix this correctly- not sure if this is specific to v6. --Sowmini -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
good afternoon
Hi samsung s6,280euro, imac,nikon,samsung products site: isgayre. com
Re: [PATCH net-next 1/2] tcp: don't extend RTO on failed loss probe attempts
On Fri, 2015-07-17 at 14:22 -0700, Yuchung Cheng wrote: If TLP was unable to send a probe, it extended the RTO to now + icsk_rto. But extending the RTO makes little sense if no TLP probe went out. With this commit, instead of extending the RTO we re-arm it relative to the transmit time of the write queue head. But what was the reason the probe could not be sent ? If it is local congestion or memory allocation error, it does make sense to not add fuel to the fire. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html