date:20170323

On Thu, Mar 23, 2017 at 9:27 PM, Eric Dumazet  wrote:
> On Thu, 2017-03-23 at 20:42 -0700, Alexander Duyck wrote:
>> On Thu, Mar 23, 2017 at 6:24 PM, Eric Dumazet  wrote:
>> > On Thu, 2017-03-23 at 14:37 -0700, Alexander Duyck wrote:
>> >> From: Alexander Duyck 
>> >>
>> >
>> >> The last bit I changed is to move from using a shift by 10 to just using
>> >> NSEC_PER_USEC and using multiplication for any run time calculations and
>> >> division for a few compile time ones.  This should be more accurate and
>> >> perform about the same on most architectures since modern CPUs typically
>> >> handle multiplication without too much overhead.
>> >
>> >
>> > busy polling thread can be preempted for more than 2 seconds.
>>
>> If it is preempted is the timing value even valid anymore?  I was
>> wondering about that.  Also when preemption is enabled is there
>> anything to prevent us from being migrated to a CPU?  If so what do we
>> do about architectures that allow drift between the clocks?
>>
>> > Using usec instead of nanoseconds gave us 3 orders of magnitude cushion.
>>
>> Yes, but the problem is we also opened up an issue where if the clock
>> was approaching a roll-over we could add a value to it that would put
>> us in a state where we would never time out.
>
> If you believe there is a bug, send a fix for net tree ?
>
> I really do not see a bug, given we use time_after(now, end_time) which
> handles roll-over just fine.

Right, but time_after assumes roll over.  When you are using a time
value based off of local_clock() >> 10, you don't ever roll over when
you do addition.  Just the clock rolls over.  At least on 64 bit
systems.

So if local time approaches something like all 1's, and we have
shifted it by 10 it is then the max it can ever reach is
0x003F.  I can add our loop time to that and it won't roll
over.  In the mean time the busy_loop_us_ can never exceed whatever I
added to that so we are now locked into a loop.  I realize I am
probably being pedantic, and it will have an exceedingly small rate of
occurrence, but it is still an issue.

>>
>> > We do not need nsec accuracy for busy polling users, if this restricts
>> > range and usability under stress.
>>
>> Yes and no.  So the standard use cases suggest using values of 50 to
>> 100 microseconds.  I suspect that for most people that is probably
>> what they are using.  The addition of preemption kind of threw a
>> wrench in the works because now instead of spending that time busy
>> polling you can get preempted and then are off doing something else
>> for the entire period of time.
>
> That is fine. Busy polling heavy users are pinning threads to cpu, and
> the re-schedule check is basically a way to make sure ksoftirqd will get
> a chance to service BH.
>
>>
>> What would you think of changing this so that instead of tracking the
>> total time this function is active, instead we tracked the total time
>> we spent with preemption disabled?  What I would do is move the
>> start_time configuration to just after the preempt_disable() call.
>> Then if we end up yielding to another thread we would just reset the
>> start_time when we restarted instead of trying to deal with all the
>> extra clock nonsense that we would have to deal with otherwise since I
>> don't know if we actually want to count time where we aren't actually
>> doing anything.  In addition this would bring us closer to how NAPI
>> already works since it essentially will either find an event, or if we
>> time out we hand it off to the softirq which in turn can handle it or
>> hand it off to softirqd.  The only item that might be a bit more
>> difficult to deal with then would be the way the times are used in
>> fs/select.c but I don't know if that is really the right way to go
>> anyway. With the preemption changes and such it might just make sense
>> to drop those bits and rely on just the socket polling alone.
>>
>> The other option is to switch over everything from using unsigned long
>> to using uint64_t and time_after64.  Then we can guarantee the range
>> needed and then some, but then we are playing with a u64 time value on
>> 32b architectures which might be a bit more expensive.  Even with that
>> though I still need to clean up the sysctl since it doesn't make sense
>> to allow negative values for the busy_poll usec to be used which is
>> currently the case.
>>
>> Anyway let me know what you think and I can probably spin out a new
>> set tomorrow.
>
>
> Leave usec resolution, I see no point switching to nsec, as long we
> support 32bit kernels.

I wonder how much it really matters in the grand scheme of things.
The question I would have is which is more expensive.  We are either
having to do a set of double precision shifts, or a multiplication,
double precision addition, and double precision compare.  I need to
take a look at some instruction tables to see which is more expensive.

My

Re: [PATCHv2] net: usbnet: support 64bit stats in qmi_wwan driver

2017-03-23 Thread Greg Ungerer

Hi Eric,

On 24/03/17 14:59, Eric Dumazet wrote:
> On Fri, 2017-03-24 at 11:27 +1000, Greg Ungerer wrote:
>> Add support for the net stats64 counters to the usbnet core and then to
>> the qmi_wwan driver.
>>
>> This is a strait forward addition of 64bit counters for RX and TX packets
>> and byte counts. It is done in the same style as for the other net drivers
>> that support stats64.
>>
>> The bulk of the change is to the usbnet core. Then it is trivial to use
>> that in the qmi_wwan.c driver. It would be very simple to extend this
>> support to other usbnet based drivers.
>>
>> The motivation to add this is that it is not particularly difficult to
>> get the RX and TX byte counts to wrap on 32bit platforms.
>>
>> Signed-off-by: Greg Ungerer 
>> ---
>>  drivers/net/usb/qmi_wwan.c |  1 +
>>  drivers/net/usb/usbnet.c   | 28 
>>  include/linux/usb/usbnet.h | 12 
>>  3 files changed, 41 insertions(+)
>>
>> v2: EXPORT symbol usbnet_get_stats64()
>> rebase on top of net-next
>>
>> diff --git a/drivers/net/usb/qmi_wwan.c b/drivers/net/usb/qmi_wwan.c
>> index 8056745..db12a66 100644
>> --- a/drivers/net/usb/qmi_wwan.c
>> +++ b/drivers/net/usb/qmi_wwan.c
>> @@ -251,6 +251,7 @@ static int qmi_wwan_mac_addr(struct net_device *dev, 
>> void *p)
>>  .ndo_change_mtu = usbnet_change_mtu,
>>  .ndo_set_mac_address= qmi_wwan_mac_addr,
>>  .ndo_validate_addr  = eth_validate_addr,
>> +.ndo_get_stats64= usbnet_get_stats64,
>>  };
>>  
>>  /* using a counter to merge subdriver requests with our own into a
>> diff --git a/drivers/net/usb/usbnet.c b/drivers/net/usb/usbnet.c
>> index 13d4ec5..4499d89 100644
>> --- a/drivers/net/usb/usbnet.c
>> +++ b/drivers/net/usb/usbnet.c
>> @@ -330,6 +330,11 @@ void usbnet_skb_return (struct usbnet *dev, struct 
>> sk_buff *skb)
>>  dev->net->stats.rx_packets++;
>>  dev->net->stats.rx_bytes += skb->len;
> 
> Why updating dev->net->stats.rx_{packets|bytes} and
> dev->stats.rx_{packets|bytes}  ?
> 
> Seems redundant.

The usbnet core is used by a number of drivers. This patch only
updates the qmi-wwan driver to use stats64. If you remove the
dev->stats.rx_* updates all those other driver users will have
no counts.


>> +u64_stats_update_begin(>stats.syncp);
>> +dev->stats.rx_packets++;
>> +dev->stats.rx_bytes += skb->len;
>> +u64_stats_update_end(>stats.syncp);
>> +
>>  netif_dbg(dev, rx_status, dev->net, "< rx, len %zu, type 0x%x\n",
>>skb->len + sizeof (struct ethhdr), skb->protocol);
>>  memset (skb->cb, 0, sizeof (struct skb_data));
>> @@ -981,6 +986,23 @@ int usbnet_set_link_ksettings(struct net_device *net,
>>  }
>>  EXPORT_SYMBOL_GPL(usbnet_set_link_ksettings);
>>  
>> +void usbnet_get_stats64(struct net_device *net, struct rtnl_link_stats64 
>> *stats)
>> +{
>> +struct usbnet *dev = netdev_priv(net);
>> +unsigned int start;
>> +
>> +netdev_stats_to_stats64(stats, >stats);
>> +
>> +do {
>> +start = u64_stats_fetch_begin_irq(>stats.syncp);
>> +stats->rx_packets = dev->stats.rx_packets;
>> +stats->rx_bytes = dev->stats.rx_bytes;
>> +stats->tx_packets = dev->stats.tx_packets;
>> +stats->tx_bytes = dev->stats.tx_bytes;
>> +} while (u64_stats_fetch_retry_irq(>stats.syncp, start));
>> +}
>> +EXPORT_SYMBOL_GPL(usbnet_get_stats64);
>> +
>>  u32 usbnet_get_link (struct net_device *net)
>>  {
>>  struct usbnet *dev = netdev_priv(net);
>> @@ -1214,6 +1236,11 @@ static void tx_complete (struct urb *urb)
>>  if (urb->status == 0) {
>>  dev->net->stats.tx_packets += entry->packets;
>>  dev->net->stats.tx_bytes += entry->length;
> 
> 
> Why updating dev->net->stats.tx_{packets|bytes} and
> dev->stats.tx_{packets|bytes}  ?
> 
>> +
>> +u64_stats_update_begin(>stats.syncp);
>> +dev->stats.tx_packets += entry->packets;
>> +dev->stats.tx_bytes += entry->length;
>> +u64_stats_update_end(>stats.syncp);
>>  } else {
>>  dev->net->stats.tx_errors++;
>>  
>> @@ -1658,6 +1685,7 @@ void usbnet_disconnect (struct usb_interface *intf)
>>  init_timer (>delay);
>>  mutex_init (>phy_mutex);
>>  mutex_init(>interrupt_mutex);
>> +u64_stats_init(>stats.syncp);
>>  dev->interrupt_count = 0;
>>  
>>  dev->net = net;
>> diff --git a/include/linux/usb/usbnet.h b/include/linux/usb/usbnet.h
>> index e2b5691..d1501b8 100644
>> --- a/include/linux/usb/usbnet.h
>> +++ b/include/linux/usb/usbnet.h
>> @@ -22,6 +22,14 @@
>>  #ifndef __LINUX_USB_USBNET_H
>>  #define __LINUX_USB_USBNET_H
>>  
>> +struct usbnet_stats64 {
>> +struct u64_stats_sync   syncp;
>> +u64 rx_packets;
>> +u64 rx_bytes;
>> +u64 tx_packets;
>> +u64 tx_bytes;
>> +};
>> +
> 
> You use the same syncp for TX and RX.

Re: r8169 regression: UDP packets dropped intermittantly

2017-03-23 Thread Jonathan Woithe

On Thu, Jun 23, 2016 at 01:22:50AM +0200, Francois Romieu wrote:
> Jonathan Woithe  :
> [...]
> > to mainline (in which case I'll keep watching out for it)?  Or is the
> > out-of-tree workaround mentioned above considered to be the long term
> > fix for those who encounter the problem?
> 
> It's a workaround. Nothing less, nothing more.

Recently I have had a chance to revisit this issue.  I have written a
program (r8196-test, source is included below) which recreates the problem
without requiring our external hardware devices.  That is, this program
triggers the fault when run between two networked computers.  To use, two
PCs are needed.  One (the "master") has an rtl8169 network card fitted (ours
has a Netgear GA311, but the problem has been seen with others too from
memory).  The network hardware of the other computer (the "slave") isn't
important.  First run

  ./r8196-test

on the slave, followed by 

  ./r8196-test 

on the master.  When running stock kernel version 4.3 the master stops
reliably within a minute or so with a timeout, indicating (in this case)
that the response packet never arrived within the 0.5 second timeout period. 
The ID whose response was never received by the master is reported as having
been seen (and a response sent) by the slave.

If I substitute the forward ported r8169 driver mentioned earlier in this
thread into kernel 4.3, the above program sequence runs seemingly
indefinitely without any timeouts (runtime is beyond two hours as of this
writing, compared to tens of seconds with the standard driver).

This demonstrates that the problem is independent of our custom network
devices and allows the fault to be recreated using commodity hardware.

Does this make it any easier to develop a mainline fix for the regression?

Regards
  jonathan

/*
 * To test, the "master" mode is run on a PC with an RTL-8169 card.
 * The "slave" mode is run on any other PC.  "Master" mode is activated
 * by providing the IP of the slave PC on the command line.  The slave
 * should be started before the master; without a running slave the master
 * will time out.
 *
 * This code is in the public domain.
 */
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#include 

unsigned char ping_payload[] = {
0x00, 0x00,
0x00, 0x00, 0x00, 0x00,
};

#define PING_PAYLOAD_SIZE 6

unsigned char ack_payload[] = {
0x12, 0x34,
0x01, 0x01, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00,
};

#define ACK_PAYLOAD_SIZE 14

#define UDP_PORT 49491

signed int open_udp(const char *target_addr)
{
struct sockaddr_in local_addr;
struct timeval tv;
int sock;

sock = socket(PF_INET,SOCK_DGRAM, 0);
if (sock < 0) {
return -1;
}

tv.tv_sec = 0;
tv.tv_usec = 50;
setsockopt(sock, SOL_SOCKET, SO_SNDTIMEO, (char *), sizeof(tv));
setsockopt(sock, SOL_SOCKET, SO_RCVTIMEO, (char *), sizeof(tv));

memset(_addr, 0, sizeof(local_addr));
local_addr.sin_family = AF_INET;
local_addr.sin_addr.s_addr = INADDR_ANY;
local_addr.sin_port = htons(49491);
if (bind(sock, (struct sockaddr *)_addr,
 sizeof(struct sockaddr)) < 0) {
return -1;
}

if (target_addr != NULL) {
struct sockaddr_in dest_addr;
memset(_addr, 0, sizeof(dest_addr));
dest_addr.sin_family = AF_INET;
dest_addr.sin_port = htons(49491);
if (inet_aton(target_addr, _addr.sin_addr) < 0) {
return -1;
}
if (connect(sock, (struct sockaddr *)_addr,
sizeof(dest_addr)) < 0) {
return -1;
}
}
return sock;
}

void master(const char *target_addr)
{
signed int id = 0;
int sock = open_udp(target_addr);

printf("master()\n");
if (sock < 0) {
return;
}

for (;; id++) {
unsigned char buf[1024];
signed int n;
ping_payload[0] = id & 0xff;
if (send(sock, ping_payload, PING_PAYLOAD_SIZE, 0) < 0) {
break;
}
n = recv(sock, buf, sizeof(buf), 0);
if (n == -1) {
if (errno == EAGAIN) {
printf("id 0x%02x: no response received (timeout)\n", 
   ping_payload[0]);
break;
}
} else {
printf("id 0x%02x: recv %d\n", buf[0], n);
}
usleep(1);
}
close(sock);
}

void slave()
{
int sock = open_udp(NULL);

printf("slave()\n");
if (sock < 0) {
return;
}

for ( ; ; ) {
struct sockaddr master_addr;
unsigned char buf[1024];
signed int n;

socklen_t len = sizeof(master_addr);
n = recvfrom(sock, buf, sizeof(buf), 0, _addr, );
if (n == PING_PAYLOAD_SIZE) {
printf("id 0x%02x: recv %d, sending %d\n", buf[0], n,
   ACK_PAYLOAD_SIZE);
ack_payload[0] = buf[0];
sendto(sock, ack_payload, ACK_PAYLOAD_SIZE,

[net-next 09/10] i40e: document drivers use of ntuple filters

From: Jacob Keller 

Add documentation describing the drivers use of ethtool ntuple filters,
including the limitations that it has due to hardware, as well as how it
reads and parses the user-def data block.

Signed-off-by: Jacob Keller 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 Documentation/networking/i40e.txt | 72 +++
 1 file changed, 72 insertions(+)

diff --git a/Documentation/networking/i40e.txt 
b/Documentation/networking/i40e.txt
index a251bf4fe9c9..57e616ed10b0 100644
--- a/Documentation/networking/i40e.txt
+++ b/Documentation/networking/i40e.txt
@@ -63,6 +63,78 @@ Additional Configurations
   The latest release of ethtool can be found from
   https://www.kernel.org/pub/software/network/ethtool
 
+
+  Flow Director n-ntuple traffic filters (FDir)
+  -
+  The driver utilizes the ethtool interface for configuring ntuple filters,
+  via "ethtool -N  ".
+
+  The sctp4, ip4, udp4, and tcp4 flow types are supported with the standard
+  fields including src-ip, dst-ip, src-port and dst-port. The driver only
+  supports fully enabling or fully masking the fields, so use of the mask
+  fields for partial matches is not supported.
+
+  Additionally, the driver supports using the action to specify filters for a
+  Virtual Function. You can specify the action as a 64bit value, where the
+  lower 32 bits represents the queue number, while the next 8 bits represent
+  which VF. Note that 0 is the PF, so the VF identifier is offset by 1. For
+  example:
+
+... action 0x80002 ...
+
+  Would indicate to direct traffic for Virtual Function 7 (8 minus 1) on queue
+  2 of that VF.
+
+  The driver also supports using the user-defined field to specify 2 bytes of
+  arbitrary data to match within the packet payload in addition to the regular
+  fields. The data is specified in the lower 32bits of the user-def field in
+  the following way:
+
+  ++---+
+  | 3128242016 | 1512 8 4 0|
+  ++---+
+  | offset into packet payload |  2 bytes of flexible data |
+  ++---+
+
+  As an example,
+
+... user-def 0x4 
+
+  means to match the value 0x 4 bytes into the packet payload. Note that
+  the offset is based on the beginning of the payload, and not the beginning
+  of the packet. Thus
+
+flow-type tcp4 ... user-def 0x8BEAF 
+
+  would match TCP/IPv4 packets which have the value 0xBEAF 8bytes into the
+  TCP/IPv4 payload.
+
+  For ICMP, the hardware parses the ICMP header as 4 bytes of header and 4
+  bytes of payload, so if you want to match an ICMP frames payload you may need
+  to add 4 to the offset in order to match the data.
+
+  Furthermore, the offset can only be up to a value of 64, as the hardware
+  will only read up to 64 bytes of data from the payload. It must also be even
+  as the flexible data is 2 bytes long and must be aligned to byte 0 of the
+  packet payload.
+
+  When programming filters, the hardware is limited to using a single input
+  set for each flow type. This means that it is an error to program two
+  different filters with the same type that don't match on the same fields.
+  Thus the second of the following two commands will fail:
+
+ethtool -N  flow-type tcp4 src-ip 192.168.0.7 action 5
+ethtool -N  flow-type tcp4 dst-ip 192.168.15.18 action 1
+
+  This is because the first filter will be accepted and reprogram the input
+  set for TCPv4 filters, but the second filter will be unable to reprogram the
+  input set until all the conflicting TCPv4 filters are first removed.
+
+  Note that the user-defined flexible offset is also considered part of the
+  input set and cannot be programmed separately for multiple filters of the
+  same type. However, the flexible data is not part of the input set and
+  multiple filters may use the same offset but match against different data.
+
   Data Center Bridging (DCB)
   --
   DCB configuration is not currently supported.
-- 
2.12.0

[net-next 07/10] i40e: implement support for flexible word payload

From: Jacob Keller 

Add support for flexible payloads passed via ethtool user-def field.
This support is somewhat limited due to hardware design. The input set
can only be programmed once per filter type, and the flexible offset is
part of this filter input set. This means that the user cannot program
both a regular and a flexible filter at the same time for a given flow
type. Additionally, the user may not program two flexible filters of the
same flow type with different offsets, although they are allowed to
configure different values at that offset location.

We support a single flexible word (2byte) value per protocol type, and
we handle the FLX_PIT register using a list of flexible entries so that
each flow type may be configured separately.

Due to hardware implementation, the flexible data is offset from the
start of the packet payload, and thus may not be in part of the header
data. For this reason, the offset provided by the user defined data is
interpreted as a byte offset from the start of the matching payload.
Previous implementations have tried to represent the offset as from the
start of the frame, but this is not feasible because header sizes may
change due to options.

Change-Id: 36ed27995e97de63f9aea5ade5778ff038d6f811
Signed-off-by: Jacob Keller 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e.h |  83 
 drivers/net/ethernet/intel/i40e/i40e_ethtool.c | 499 -
 drivers/net/ethernet/intel/i40e/i40e_main.c|  16 +
 drivers/net/ethernet/intel/i40e/i40e_txrx.c|  27 ++
 4 files changed, 613 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e.h 
b/drivers/net/ethernet/intel/i40e/i40e.h
index 3bf0e7ed4eda..72c4a740b432 100644
--- a/drivers/net/ethernet/intel/i40e/i40e.h
+++ b/drivers/net/ethernet/intel/i40e/i40e.h
@@ -222,6 +222,12 @@ struct i40e_fdir_filter {
__be16 src_port;
__be16 dst_port;
__be32 sctp_v_tag;
+
+   /* Flexible data to match within the packet payload */
+   __be16 flex_word;
+   u16 flex_offset;
+   bool flex_filter;
+
/* filter control */
u16 q_index;
u8  flex_off;
@@ -258,6 +264,75 @@ struct i40e_udp_port_config {
u8 type;
 };
 
+/* macros related to FLX_PIT */
+#define I40E_FLEX_SET_FSIZE(fsize) (((fsize) << \
+   I40E_PRTQF_FLX_PIT_FSIZE_SHIFT) & \
+   I40E_PRTQF_FLX_PIT_FSIZE_MASK)
+#define I40E_FLEX_SET_DST_WORD(dst) (((dst) << \
+I40E_PRTQF_FLX_PIT_DEST_OFF_SHIFT) & \
+I40E_PRTQF_FLX_PIT_DEST_OFF_MASK)
+#define I40E_FLEX_SET_SRC_WORD(src) (((src) << \
+I40E_PRTQF_FLX_PIT_SOURCE_OFF_SHIFT) & \
+I40E_PRTQF_FLX_PIT_SOURCE_OFF_MASK)
+#define I40E_FLEX_PREP_VAL(dst, fsize, src) (I40E_FLEX_SET_DST_WORD(dst) | \
+I40E_FLEX_SET_FSIZE(fsize) | \
+I40E_FLEX_SET_SRC_WORD(src))
+
+#define I40E_FLEX_PIT_GET_SRC(flex) (((flex) & \
+I40E_PRTQF_FLX_PIT_SOURCE_OFF_MASK) >> \
+I40E_PRTQF_FLX_PIT_SOURCE_OFF_SHIFT)
+#define I40E_FLEX_PIT_GET_DST(flex) (((flex) & \
+I40E_PRTQF_FLX_PIT_DEST_OFF_MASK) >> \
+I40E_PRTQF_FLX_PIT_DEST_OFF_SHIFT)
+#define I40E_FLEX_PIT_GET_FSIZE(flex) (((flex) & \
+  I40E_PRTQF_FLX_PIT_FSIZE_MASK) >> \
+  I40E_PRTQF_FLX_PIT_FSIZE_SHIFT)
+
+#define I40E_MAX_FLEX_SRC_OFFSET 0x1F
+
+/* macros related to GLQF_ORT */
+#define I40E_ORT_SET_IDX(idx)  (((idx) << \
+ I40E_GLQF_ORT_PIT_INDX_SHIFT) & \
+I40E_GLQF_ORT_PIT_INDX_MASK)
+
+#define I40E_ORT_SET_COUNT(count)  (((count) << \
+ I40E_GLQF_ORT_FIELD_CNT_SHIFT) & \
+I40E_GLQF_ORT_FIELD_CNT_MASK)
+
+#define I40E_ORT_SET_PAYLOAD(payload)  (((payload) << \
+ I40E_GLQF_ORT_FLX_PAYLOAD_SHIFT) & \
+I40E_GLQF_ORT_FLX_PAYLOAD_MASK)
+
+#define I40E_ORT_PREP_VAL(idx, count, payload) (I40E_ORT_SET_IDX(idx) | \
+   I40E_ORT_SET_COUNT(count) | \
+   I40E_ORT_SET_PAYLOAD(payload))
+
+#define I40E_L3_GLQF_ORT_IDX   34
+#define I40E_L4_GLQF_ORT_IDX   35
+
+/* Flex PIT register index */
+#define I40E_FLEX_PIT_IDX_START_L2 0
+#define I40E_FLEX_PIT_IDX_START_L3 3
+#define

[net-next 10/10] i40e: make use of hlist_for_each_entry_continue

From: Jacob Keller 

Replace a complex if->continue->else->break construction in
i40e_next_filter. We can simply use hlist_for_each_entry_continue
instead. This drops a lot of confusing code. The resulting code is much
easier to understand the intention, and follows the more normal pattern
for using hlist loops. We could have also used a break with a "return
next" at the end of the function, instead of return NULL, but the
current implementation is explicitly clear that when you reach the end
of the loop you get a NULL value. The alternative construction is less
clear since the reader would have to know that next is NULL at the end
of the loop.

Change-Id: Ife74ca451dd79d7f0d93c672bd42092d324d4a03
Signed-off-by: Jacob Keller 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 15 ---
 1 file changed, 4 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 114481b67ad8..1d8febd721ac 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -1883,19 +1883,12 @@ static void i40e_undo_add_filter_entries(struct 
i40e_vsi *vsi,
 static
 struct i40e_new_mac_filter *i40e_next_filter(struct i40e_new_mac_filter *next)
 {
-   while (next) {
-   next = hlist_entry(next->hlist.next,
-  typeof(struct i40e_new_mac_filter),
-  hlist);
-
-   /* keep going if we found a broadcast filter */
-   if (next && is_broadcast_ether_addr(next->f->macaddr))
-   continue;
-
-   break;
+   hlist_for_each_entry_continue(next, hlist) {
+   if (!is_broadcast_ether_addr(next->f->macaddr))
+   return next;
}
 
-   return next;
+   return NULL;
 }
 
 /**
-- 
2.12.0

[net-next 06/10] i40e: add parsing of flexible filter fields from userdef

From: Jacob Keller 

Add code to parse the user-def field into a data structure format. This
code is intended to allow future extensions of the user-def field by
keeping all code that actually reads and writes the field into a single
location. This ensures that we do not litter the driver with references
to the user-def field and minimizes the amount of bitwise operations we
need to do on the data.

Add code which parses the lower 32bits into a flexible word and its
offset. This will be used in a future patch to enable flexible filters
which can match on some arbitrary data in the packet payload. For now,
we just return -EOPNOTSUPP when this is used.

Add code to fill in the user-def field when reporting the filter back,
even though we don't actually implement any user-def fields yet.

Additionally, ensure that we mask the extended FLOW_EXT bit from the
flow_type now that we will be accepting filters which have the FLOW_EXT
bit set (and thus make use of the user-def field).

Change-Id: I238845035c179380a347baa8db8223304f5f6dd7
Signed-off-by: Jacob Keller 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e.h |   9 ++
 drivers/net/ethernet/intel/i40e/i40e_ethtool.c | 110 -
 2 files changed, 118 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e.h 
b/drivers/net/ethernet/intel/i40e/i40e.h
index 95b1a1e7b906..3bf0e7ed4eda 100644
--- a/drivers/net/ethernet/intel/i40e/i40e.h
+++ b/drivers/net/ethernet/intel/i40e/i40e.h
@@ -202,6 +202,15 @@ enum i40e_fd_stat_idx {
 #define I40E_FD_ATR_TUNNEL_STAT_IDX(pf_id) \
(I40E_FD_STAT_PF_IDX(pf_id) + I40E_FD_STAT_ATR_TUNNEL)
 
+/* The following structure contains the data parsed from the user-defined
+ * field of the ethtool_rx_flow_spec structure.
+ */
+struct i40e_rx_flow_userdef {
+   bool flex_filter;
+   u16 flex_word;
+   u16 flex_offset;
+};
+
 struct i40e_fdir_filter {
struct hlist_node fdir_node;
/* filter ipnut set */
diff --git a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c 
b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
index 44b4a2fe9fb6..f04a06d0dbf8 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
@@ -2332,6 +2332,102 @@ static int i40e_get_rss_hash_opts(struct i40e_pf *pf, 
struct ethtool_rxnfc *cmd)
 }
 
 /**
+ * i40e_check_mask - Check whether a mask field is set
+ * @mask: the full mask value
+ * @field; mask of the field to check
+ *
+ * If the given mask is fully set, return positive value. If the mask for the
+ * field is fully unset, return zero. Otherwise return a negative error code.
+ **/
+static int i40e_check_mask(u64 mask, u64 field)
+{
+   u64 value = mask & field;
+
+   if (value == field)
+   return 1;
+   else if (!value)
+   return 0;
+   else
+   return -1;
+}
+
+/**
+ * i40e_parse_rx_flow_user_data - Deconstruct user-defined data
+ * @fsp: pointer to rx flow specification
+ * @data: pointer to userdef data structure for storage
+ *
+ * Read the user-defined data and deconstruct the value into a structure. No
+ * other code should read the user-defined data, so as to ensure that every
+ * place consistently reads the value correctly.
+ *
+ * The user-defined field is a 64bit Big Endian format value, which we
+ * deconstruct by reading bits or bit fields from it. Single bit flags shall
+ * be defined starting from the highest bits, while small bit field values
+ * shall be defined starting from the lowest bits.
+ *
+ * Returns 0 if the data is valid, and non-zero if the userdef data is invalid
+ * and the filter should be rejected. The data structure will always be
+ * modified even if FLOW_EXT is not set.
+ *
+ **/
+static int i40e_parse_rx_flow_user_data(struct ethtool_rx_flow_spec *fsp,
+   struct i40e_rx_flow_userdef *data)
+{
+   u64 value, mask;
+   int valid;
+
+   /* Zero memory first so it's always consistent. */
+   memset(data, 0, sizeof(*data));
+
+   if (!(fsp->flow_type & FLOW_EXT))
+   return 0;
+
+   value = be64_to_cpu(*((__be64 *)fsp->h_ext.data));
+   mask = be64_to_cpu(*((__be64 *)fsp->m_ext.data));
+
+#define I40E_USERDEF_FLEX_WORD GENMASK_ULL(15, 0)
+#define I40E_USERDEF_FLEX_OFFSET   GENMASK_ULL(31, 16)
+#define I40E_USERDEF_FLEX_FILTER   GENMASK_ULL(31, 0)
+
+   valid = i40e_check_mask(mask, I40E_USERDEF_FLEX_FILTER);
+   if (valid < 0) {
+   return -EINVAL;
+   } else if (valid) {
+   data->flex_word = value & I40E_USERDEF_FLEX_WORD;
+   data->flex_offset =
+   (value & I40E_USERDEF_FLEX_OFFSET) >> 16;
+   data->flex_filter = true;
+   }
+
+   return 0;
+}
+

[net-next 05/10] i40e: partition the ring_cookie to get VF index

From: Jacob Keller 

Do not use the user-def field for determining the VF target. Instead,
similar to ixgbe, partition the ring_cookie value into 8bits of VF
index, along with 32bits of queue number. This is better than using the
user-def field, because it leaves the field open for extension in
a future patch which will enable flexible data. Also, this matches with
convention used by ixgbe and other drivers.

Change-Id: Ie36745186d817216b12f0313b99ec95cb8a9130c
Signed-off-by: Jacob Keller 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_ethtool.c | 74 ++
 1 file changed, 40 insertions(+), 34 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c 
b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
index a2ef49084d46..44b4a2fe9fb6 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
@@ -2458,8 +2458,13 @@ static int i40e_get_ethtool_fdir_entry(struct i40e_pf 
*pf,
 
vsi = i40e_find_vsi_from_id(pf, rule->dest_vsi);
if (vsi && vsi->type == I40E_VSI_SRIOV) {
-   fsp->h_ext.data[1] = htonl(vsi->vf_id);
-   fsp->m_ext.data[1] = htonl(0x1);
+   /* VFs are zero-indexed by the driver, but ethtool
+* expects them to be one-indexed, so add one here
+*/
+   u64 ring_vf = vsi->vf_id + 1;
+
+   ring_vf <<= ETHTOOL_RX_FLOW_SPEC_RING_VF_OFF;
+   fsp->ring_cookie |= ring_vf;
}
}
 
@@ -3038,9 +3043,10 @@ static int i40e_add_fdir_ethtool(struct i40e_vsi *vsi,
 {
struct ethtool_rx_flow_spec *fsp;
struct i40e_fdir_filter *input;
+   u16 dest_vsi = 0, q_index = 0;
struct i40e_pf *pf;
int ret = -EINVAL;
-   u16 vf_id;
+   u8 dest_ctl;
 
if (!vsi)
return -EINVAL;
@@ -3074,9 +3080,32 @@ static int i40e_add_fdir_ethtool(struct i40e_vsi *vsi,
return -EINVAL;
}
 
-   if ((fsp->ring_cookie != RX_CLS_FLOW_DISC) &&
-   (fsp->ring_cookie >= vsi->num_queue_pairs))
-   return -EINVAL;
+   /* ring_cookie is either the drop index, or is a mask of the queue
+* index and VF id we wish to target.
+*/
+   if (fsp->ring_cookie == RX_CLS_FLOW_DISC) {
+   dest_ctl = I40E_FILTER_PROGRAM_DESC_DEST_DROP_PACKET;
+   } else {
+   u32 ring = ethtool_get_flow_spec_ring(fsp->ring_cookie);
+   u8 vf = ethtool_get_flow_spec_ring_vf(fsp->ring_cookie);
+
+   if (!vf) {
+   if (ring >= vsi->num_queue_pairs)
+   return -EINVAL;
+   dest_vsi = vsi->id;
+   } else {
+   /* VFs are zero-indexed, so we subtract one here */
+   vf--;
+
+   if (vf >= pf->num_alloc_vfs)
+   return -EINVAL;
+   if (ring >= pf->vf[vf].num_queue_pairs)
+   return -EINVAL;
+   dest_vsi = pf->vf[vf].lan_vsi_id;
+   }
+   dest_ctl = I40E_FILTER_PROGRAM_DESC_DEST_DIRECT_PACKET_QINDEX;
+   q_index = ring;
+   }
 
input = kzalloc(sizeof(*input), GFP_KERNEL);
 
@@ -3084,19 +3113,13 @@ static int i40e_add_fdir_ethtool(struct i40e_vsi *vsi,
return -ENOMEM;
 
input->fd_id = fsp->location;
-
-   if (fsp->ring_cookie == RX_CLS_FLOW_DISC)
-   input->dest_ctl = I40E_FILTER_PROGRAM_DESC_DEST_DROP_PACKET;
-   else
-   input->dest_ctl =
-I40E_FILTER_PROGRAM_DESC_DEST_DIRECT_PACKET_QINDEX;
-
-   input->q_index = fsp->ring_cookie;
-   input->flex_off = 0;
-   input->pctype = 0;
-   input->dest_vsi = vsi->id;
+   input->q_index = q_index;
+   input->dest_vsi = dest_vsi;
+   input->dest_ctl = dest_ctl;
input->fd_status = I40E_FILTER_PROGRAM_DESC_FD_STATUS_FD_ID;
input->cnt_index  = I40E_FD_SB_STAT_IDX(pf->hw.pf_id);
+   input->dst_ip = fsp->h_u.tcp_ip4_spec.ip4src;
+   input->src_ip = fsp->h_u.tcp_ip4_spec.ip4dst;
input->flow_type = fsp->flow_type;
input->ip4_proto = fsp->h_u.usr_ip4_spec.proto;
 
@@ -3108,23 +3131,6 @@ static int i40e_add_fdir_ethtool(struct i40e_vsi *vsi,
input->dst_ip = fsp->h_u.tcp_ip4_spec.ip4src;
input->src_ip = fsp->h_u.tcp_ip4_spec.ip4dst;
 
-   if (ntohl(fsp->m_ext.data[1])) {
-   vf_id = ntohl(fsp->h_ext.data[1]);
-   if (vf_id >= pf->num_alloc_vfs) {
-   netif_info(pf, drv, vsi->netdev,
-  "Invalid

[net-next 08/10] i40e: add support for SCTPv4 FDir filters

From: Jacob Keller 

Enable FDir filters for SCTPv4 packets using the ethtool ntuple
interface to enable filters. The ethtool API does not allow masking on
the verification tag.

Change-Id: I093e88a8143994c7e6f4b7b17a0bd5cf861d18e4
Signed-off-by: Jacob Keller 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e.h |  1 +
 drivers/net/ethernet/intel/i40e/i40e_ethtool.c | 10 
 drivers/net/ethernet/intel/i40e/i40e_main.c|  2 +
 drivers/net/ethernet/intel/i40e/i40e_txrx.c| 80 ++
 4 files changed, 93 insertions(+)

diff --git a/drivers/net/ethernet/intel/i40e/i40e.h 
b/drivers/net/ethernet/intel/i40e/i40e.h
index 72c4a740b432..3133a1a8b8b3 100644
--- a/drivers/net/ethernet/intel/i40e/i40e.h
+++ b/drivers/net/ethernet/intel/i40e/i40e.h
@@ -377,6 +377,7 @@ struct i40e_pf {
 */
u16 fd_tcp4_filter_cnt;
u16 fd_udp4_filter_cnt;
+   u16 fd_sctp4_filter_cnt;
u16 fd_ip4_filter_cnt;
 
/* Flexible filter table values that need to be programmed into
diff --git a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c 
b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
index 93b7854c220d..8fac124ebcb5 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
@@ -2509,6 +2509,9 @@ static int i40e_get_ethtool_fdir_entry(struct i40e_pf *pf,
fsp->h_u.tcp_ip4_spec.ip4dst = rule->src_ip;
 
switch (rule->flow_type) {
+   case SCTP_V4_FLOW:
+   index = I40E_FILTER_PCTYPE_NONF_IPV4_SCTP;
+   break;
case TCP_V4_FLOW:
index = I40E_FILTER_PCTYPE_NONF_IPV4_TCP;
break;
@@ -3336,6 +3339,10 @@ static int i40e_check_fdir_input_set(struct i40e_vsi 
*vsi,
int err;
 
switch (fsp->flow_type & ~FLOW_EXT) {
+   case SCTP_V4_FLOW:
+   index = I40E_FILTER_PCTYPE_NONF_IPV4_SCTP;
+   fdir_filter_count = >fd_sctp4_filter_cnt;
+   break;
case TCP_V4_FLOW:
index = I40E_FILTER_PCTYPE_NONF_IPV4_TCP;
fdir_filter_count = >fd_tcp4_filter_cnt;
@@ -3367,6 +3374,9 @@ static int i40e_check_fdir_input_set(struct i40e_vsi *vsi,
 * ip4dst fields.
 */
switch (fsp->flow_type & ~FLOW_EXT) {
+   case SCTP_V4_FLOW:
+   new_mask &= ~I40E_VERIFY_TAG_MASK;
+   /* Fall through */
case TCP_V4_FLOW:
case UDP_V4_FLOW:
tcp_ip4_spec = >m_u.tcp_ip4_spec;
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 7f8b929c90bf..114481b67ad8 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -3286,6 +3286,7 @@ static void i40e_fdir_filter_restore(struct i40e_vsi *vsi)
/* Reset FDir counters as we're replaying all existing filters */
pf->fd_tcp4_filter_cnt = 0;
pf->fd_udp4_filter_cnt = 0;
+   pf->fd_sctp4_filter_cnt = 0;
pf->fd_ip4_filter_cnt = 0;
 
hlist_for_each_entry_safe(filter, node,
@@ -5771,6 +5772,7 @@ static void i40e_fdir_filter_exit(struct i40e_pf *pf)
pf->fdir_pf_active_filters = 0;
pf->fd_tcp4_filter_cnt = 0;
pf->fd_udp4_filter_cnt = 0;
+   pf->fd_sctp4_filter_cnt = 0;
pf->fd_ip4_filter_cnt = 0;
 
/* Reprogram the default input set for TCP/IPv4 */
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c 
b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index 855ae1e359df..0ca307a6c731 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -346,6 +346,80 @@ static int i40e_add_del_fdir_tcpv4(struct i40e_vsi *vsi,
return 0;
 }
 
+#define I40E_SCTPIP_DUMMY_PACKET_LEN 46
+/**
+ * i40e_add_del_fdir_sctpv4 - Add/Remove SCTPv4 Flow Director filters for
+ * a specific flow spec
+ * @vsi: pointer to the targeted VSI
+ * @fd_data: the flow director data required for the FDir descriptor
+ * @add: true adds a filter, false removes it
+ *
+ * Returns 0 if the filters were successfully added or removed
+ **/
+static int i40e_add_del_fdir_sctpv4(struct i40e_vsi *vsi,
+   struct i40e_fdir_filter *fd_data,
+   bool add)
+{
+   struct i40e_pf *pf = vsi->back;
+   struct sctphdr *sctp;
+   struct iphdr *ip;
+   u8 *raw_packet;
+   int ret;
+   /* Dummy packet */
+   static char packet[] = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0x08, 0,
+   0x45, 0, 0, 0x20, 0, 0, 0x40, 0, 0x40, 0x84, 0, 0, 0, 0, 0, 0,
+   0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};
+
+   raw_packet = kzalloc(I40E_FDIR_MAX_RAW_PACKET_SIZE, GFP_KERNEL);
+   if (!raw_packet)
+   return -ENOMEM;
+

[net-next 03/10] i40e: restore default input set for each flow type

From: Jacob Keller 

Ensure that the default input set is correctly reprogrammed when
cleaning up after disabling flow director support. This ensures that the
programmed value will be in a clean state.

Although we do not yet have support for SCTPv4 filters, a future patch
will add support for this protocol, so we will correctly restore the
SCTPv4 input set here as well. Note that strictly speaking the default
hardware value for SCTP includes matching the verification tag. However,
the ethtool API does not have support for specifying this value, so
there is no reason to keep the verification field enabled.

This patch is the next step on the way to enabling partial tuple filters
which will be implemented in a following patch.

Change-Id: Ic22e1c267ae37518bb036aca4a5694681449f283
Signed-off-by: Jacob Keller 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e.h  | 18 ++
 drivers/net/ethernet/intel/i40e/i40e_main.c | 19 +++
 2 files changed, 37 insertions(+)

diff --git a/drivers/net/ethernet/intel/i40e/i40e.h 
b/drivers/net/ethernet/intel/i40e/i40e.h
index 0cd21ea48e1d..95b1a1e7b906 100644
--- a/drivers/net/ethernet/intel/i40e/i40e.h
+++ b/drivers/net/ethernet/intel/i40e/i40e.h
@@ -753,6 +753,24 @@ static inline u64 i40e_read_fd_input_set(struct i40e_pf 
*pf, u16 addr)
return val;
 }
 
+/**
+ * i40e_write_fd_input_set - writes value into flow director input set register
+ * @pf: pointer to the PF struct
+ * @addr: register addr
+ * @val: value to be written
+ *
+ * This function writes specified value to the register specified by 'addr'.
+ * This register is input set register based on flow-type.
+ **/
+static inline void i40e_write_fd_input_set(struct i40e_pf *pf,
+  u16 addr, u64 val)
+{
+   i40e_write_rx_ctl(>hw, I40E_PRTQF_FD_INSET(addr, 1),
+ (u32)(val >> 32));
+   i40e_write_rx_ctl(>hw, I40E_PRTQF_FD_INSET(addr, 0),
+ (u32)(val & 0xULL));
+}
+
 /* needed by i40e_ethtool.c */
 int i40e_up(struct i40e_vsi *vsi);
 void i40e_down(struct i40e_vsi *vsi);
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index caccb8e97f1b..3fbecaa10286 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -5759,6 +5759,25 @@ static void i40e_fdir_filter_exit(struct i40e_pf *pf)
pf->fd_tcp4_filter_cnt = 0;
pf->fd_udp4_filter_cnt = 0;
pf->fd_ip4_filter_cnt = 0;
+
+   /* Reprogram the default input set for TCP/IPv4 */
+   i40e_write_fd_input_set(pf, I40E_FILTER_PCTYPE_NONF_IPV4_TCP,
+   I40E_L3_SRC_MASK | I40E_L3_DST_MASK |
+   I40E_L4_SRC_MASK | I40E_L4_DST_MASK);
+
+   /* Reprogram the default input set for UDP/IPv4 */
+   i40e_write_fd_input_set(pf, I40E_FILTER_PCTYPE_NONF_IPV4_UDP,
+   I40E_L3_SRC_MASK | I40E_L3_DST_MASK |
+   I40E_L4_SRC_MASK | I40E_L4_DST_MASK);
+
+   /* Reprogram the default input set for SCTP/IPv4 */
+   i40e_write_fd_input_set(pf, I40E_FILTER_PCTYPE_NONF_IPV4_SCTP,
+   I40E_L3_SRC_MASK | I40E_L3_DST_MASK |
+   I40E_L4_SRC_MASK | I40E_L4_DST_MASK);
+
+   /* Reprogram the default input set for Other/IPv4 */
+   i40e_write_fd_input_set(pf, I40E_FILTER_PCTYPE_NONF_IPV4_OTHER,
+   I40E_L3_SRC_MASK | I40E_L3_DST_MASK);
 }
 
 /**
-- 
2.12.0

[net-next 02/10] i40e: check current configured input set when adding ntuple filters

From: Jacob Keller 

Do not assume that hardware has been programmed with the default mask,
but instead read the input set registers to determine what is currently
programmed. This ensures that all programmed filters match exactly how
the hardware will interpret them, avoiding confusion regarding filter
behavior.

This sets the initial ground-work for allowing custom input sets where
some fields are disabled. A future patch will fully implement this
feature.

Instead of using bitwise negation, we'll just explicitly check for the
correct value. The use of htonl and htons are used to silence sparse
warnings. The compiler should be able to handle the constant value and
avoid actually performing a byteswap.

Change-Id: I3d8db46cb28ea0afdaac8c5b31a2bfb90e3a4102
Signed-off-by: Jacob Keller 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e.h |  19 
 drivers/net/ethernet/intel/i40e/i40e_ethtool.c | 117 +
 2 files changed, 121 insertions(+), 15 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e.h 
b/drivers/net/ethernet/intel/i40e/i40e.h
index c0f2286c2b72..0cd21ea48e1d 100644
--- a/drivers/net/ethernet/intel/i40e/i40e.h
+++ b/drivers/net/ethernet/intel/i40e/i40e.h
@@ -734,6 +734,25 @@ static inline int i40e_get_fd_cnt_all(struct i40e_pf *pf)
return pf->hw.fdir_shared_filter_count + pf->fdir_pf_filter_count;
 }
 
+/**
+ * i40e_read_fd_input_set - reads value of flow director input set register
+ * @pf: pointer to the PF struct
+ * @addr: register addr
+ *
+ * This function reads value of flow director input set register
+ * specified by 'addr' (which is specific to flow-type)
+ **/
+static inline u64 i40e_read_fd_input_set(struct i40e_pf *pf, u16 addr)
+{
+   u64 val;
+
+   val = i40e_read_rx_ctl(>hw, I40E_PRTQF_FD_INSET(addr, 1));
+   val <<= 32;
+   val += i40e_read_rx_ctl(>hw, I40E_PRTQF_FD_INSET(addr, 0));
+
+   return val;
+}
+
 /* needed by i40e_ethtool.c */
 int i40e_up(struct i40e_vsi *vsi);
 void i40e_down(struct i40e_vsi *vsi);
diff --git a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c 
b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
index ac8d5cf720aa..1815c149040f 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
@@ -2384,6 +2384,8 @@ static int i40e_get_ethtool_fdir_entry(struct i40e_pf *pf,
(struct ethtool_rx_flow_spec *)>fs;
struct i40e_fdir_filter *rule = NULL;
struct hlist_node *node2;
+   u64 input_set;
+   u16 index;
 
hlist_for_each_entry_safe(rule, node2,
  >fdir_filter_list, fdir_node) {
@@ -2409,11 +2411,42 @@ static int i40e_get_ethtool_fdir_entry(struct i40e_pf 
*pf,
fsp->h_u.tcp_ip4_spec.ip4src = rule->dst_ip;
fsp->h_u.tcp_ip4_spec.ip4dst = rule->src_ip;
 
-   /* Set the mask fields */
-   fsp->m_u.tcp_ip4_spec.psrc = htons(0x);
-   fsp->m_u.tcp_ip4_spec.pdst = htons(0x);
-   fsp->m_u.tcp_ip4_spec.ip4src = htonl(0x);
-   fsp->m_u.tcp_ip4_spec.ip4dst = htonl(0x);
+   switch (rule->flow_type) {
+   case TCP_V4_FLOW:
+   index = I40E_FILTER_PCTYPE_NONF_IPV4_TCP;
+   break;
+   case UDP_V4_FLOW:
+   index = I40E_FILTER_PCTYPE_NONF_IPV4_UDP;
+   break;
+   case IP_USER_FLOW:
+   index = I40E_FILTER_PCTYPE_NONF_IPV4_OTHER;
+   break;
+   default:
+   /* If we have stored a filter with a flow type not listed here
+* it is almost certainly a driver bug. WARN(), and then
+* assign the input_set as if all fields are enabled to avoid
+* reading unassigned memory.
+*/
+   WARN(1, "Missing input set index for flow_type %d\n",
+rule->flow_type);
+   input_set = 0xULL;
+   goto no_input_set;
+   }
+
+   input_set = i40e_read_fd_input_set(pf, index);
+
+no_input_set:
+   if (input_set & I40E_L3_SRC_MASK)
+   fsp->m_u.tcp_ip4_spec.ip4src = htonl(0x);
+
+   if (input_set & I40E_L3_DST_MASK)
+   fsp->m_u.tcp_ip4_spec.ip4dst = htonl(0x);
+
+   if (input_set & I40E_L4_SRC_MASK)
+   fsp->m_u.tcp_ip4_spec.psrc = htons(0x);
+
+   if (input_set & I40E_L4_DST_MASK)
+   fsp->m_u.tcp_ip4_spec.pdst = htons(0x);
 
if (rule->dest_ctl == I40E_FILTER_PROGRAM_DESC_DEST_DROP_PACKET)
fsp->ring_cookie = RX_CLS_FLOW_DISC;
@@ -2725,36 +2758,74 @@ static int i40e_del_fdir_entry(struct i40e_vsi *vsi,
 
 /**
  * i40e_check_fdir_input_set - Check that a given rx_flow_spec mask is valid
+ * @vsi: pointer to the targeted VSI
  * @fsp: pointer

[net-next 01/10] i40e: correctly honor the mask fields for ETHTOOL_SRXCLSRLINS

From: Jacob Keller 

The current implementation of .set_rxnfc does not properly read the mask
field for filter entries. This results in incorrect driver behavior, as
we do not reject filters which have masks set to ignore some fields. The
current implementation simply assumes that every part of the tuple or
"input set" is specified. This results in filters not behaving as
expected, and not working correctly.

As a first step in supporting some partial filters, add code which
checks the mask fields and rejects any filters which do not have an
acceptable mask. For now, we just assume that all fields must be set.

This will get the driver one step towards allowing some partial filters.
At a minimum, the ethtool commands which previously installed filters
that would not function will now return a non-zero exit code indicating
failure instead.

We should now be meeting the minimum requirements of the .set_rxnfc API,
by ensuring that all filters we program have a valid mask value for each
field.

Finally, add code to report the mask correctly so that the ethtool
command properly reports the mask to the user.

Note that the typecast to (__be16) when checking source and destination
port masks is required because the ~ bitwise negation operator does not
correctly handle variables other than integer size.

Change-Id: Ia020149e07c87aa3fcec7b2283621b887ef0546f
Signed-off-by: Jacob Keller 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_ethtool.c | 83 ++
 1 file changed, 83 insertions(+)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c 
b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
index 1c3805b4fcf3..ac8d5cf720aa 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
@@ -2409,6 +2409,12 @@ static int i40e_get_ethtool_fdir_entry(struct i40e_pf 
*pf,
fsp->h_u.tcp_ip4_spec.ip4src = rule->dst_ip;
fsp->h_u.tcp_ip4_spec.ip4dst = rule->src_ip;
 
+   /* Set the mask fields */
+   fsp->m_u.tcp_ip4_spec.psrc = htons(0x);
+   fsp->m_u.tcp_ip4_spec.pdst = htons(0x);
+   fsp->m_u.tcp_ip4_spec.ip4src = htonl(0x);
+   fsp->m_u.tcp_ip4_spec.ip4dst = htonl(0x);
+
if (rule->dest_ctl == I40E_FILTER_PROGRAM_DESC_DEST_DROP_PACKET)
fsp->ring_cookie = RX_CLS_FLOW_DISC;
else
@@ -2718,6 +2724,79 @@ static int i40e_del_fdir_entry(struct i40e_vsi *vsi,
 }
 
 /**
+ * i40e_check_fdir_input_set - Check that a given rx_flow_spec mask is valid
+ * @fsp: pointer to Rx flow specification
+ *
+ * Ensures that a given ethtool_rx_flow_spec has a valid mask.
+ **/
+static int i40e_check_fdir_input_set(struct ethtool_rx_flow_spec *fsp)
+{
+   struct ethtool_tcpip4_spec *tcp_ip4_spec;
+   struct ethtool_usrip4_spec *usr_ip4_spec;
+
+   /* Verify the provided mask is valid. */
+   switch (fsp->flow_type & ~FLOW_EXT) {
+   case SCTP_V4_FLOW:
+   case TCP_V4_FLOW:
+   case UDP_V4_FLOW:
+   tcp_ip4_spec = >m_u.tcp_ip4_spec;
+
+   /* IPv4 source address */
+   if (!tcp_ip4_spec->ip4src || ~tcp_ip4_spec->ip4src)
+   return -EOPNOTSUPP;
+
+   /* IPv4 destination address */
+   if (!tcp_ip4_spec->ip4dst || ~tcp_ip4_spec->ip4dst)
+   return -EOPNOTSUPP;
+
+   /* L4 source port */
+   if (!tcp_ip4_spec->psrc || (__be16)~tcp_ip4_spec->psrc)
+   return -EOPNOTSUPP;
+
+   /* L4 destination port */
+   if (!tcp_ip4_spec->pdst || (__be16)~tcp_ip4_spec->pdst)
+   return -EOPNOTSUPP;
+
+   /* Filtering on Type of Service is not supported. */
+   if (tcp_ip4_spec->tos)
+   return -EOPNOTSUPP;
+
+   break;
+   case IP_USER_FLOW:
+   usr_ip4_spec = >m_u.usr_ip4_spec;
+
+   /* IPv4 source address */
+   if (!usr_ip4_spec->ip4src || ~usr_ip4_spec->ip4src)
+   return -EOPNOTSUPP;
+
+   /* IPv4 destination address */
+   if (!usr_ip4_spec->ip4dst || ~usr_ip4_spec->ip4dst)
+   return -EOPNOTSUPP;
+
+   /* First 4 bytes of L4 header */
+   if (!usr_ip4_spec->l4_4_bytes || ~usr_ip4_spec->l4_4_bytes)
+   return -EOPNOTSUPP;
+
+   /* Filtering on Type of Service is not supported. */
+   if (usr_ip4_spec->tos)
+   return -EOPNOTSUPP;
+
+   /* IP version does not have a mask field. */
+   if (usr_ip4_spec->ip_ver)
+   return -EINVAL;
+
+   /* L4 protocol doesn't have a mask field. */
+   if (usr_ip4_spec->proto)
+

[net-next 04/10] i40e: allow changing input set for ntuple filters

From: Jacob Keller 

Add support to detect when we can update the input set for each flow
type.

Because the hardware only supports a single input set for all flows of
that matching type, the driver shall only allow the input set to change
if there are no other configured filters for that flow type.

Thus, the first filter added for each flow type is allowed to change the
input set, and all future filters must match the same input set. Display
a diagnostic message whenever the filter input set changes, and
a warning whenever a filter cannot be accepted because it does not match
the configured input set.

Change-Id: Ic22e1c267ae37518bb036aca4a5694681449f283
Signed-off-by: Jacob Keller 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_ethtool.c | 148 -
 1 file changed, 145 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c 
b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
index 1815c149040f..a2ef49084d46 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
@@ -2757,11 +2757,107 @@ static int i40e_del_fdir_entry(struct i40e_vsi *vsi,
 }
 
 /**
+ * i40e_flow_str - Converts a flow_type into a human readable string
+ * @flow_type: the flow type from a flow specification
+ *
+ * Currently only flow types we support are included here, and the string
+ * value attempts to match what ethtool would use to configure this flow type.
+ **/
+static const char *i40e_flow_str(struct ethtool_rx_flow_spec *fsp)
+{
+   switch (fsp->flow_type & ~FLOW_EXT) {
+   case TCP_V4_FLOW:
+   return "tcp4";
+   case UDP_V4_FLOW:
+   return "udp4";
+   case SCTP_V4_FLOW:
+   return "sctp4";
+   case IP_USER_FLOW:
+   return "ip4";
+   default:
+   return "unknown";
+   }
+}
+
+/**
+ * i40e_print_input_set - Show changes between two input sets
+ * @vsi: the vsi being configured
+ * @old: the old input set
+ * @new: the new input set
+ *
+ * Print the difference between old and new input sets by showing which series
+ * of words are toggled on or off. Only displays the bits we actually support
+ * changing.
+ **/
+static void i40e_print_input_set(struct i40e_vsi *vsi, u64 old, u64 new)
+{
+   struct i40e_pf *pf = vsi->back;
+   bool old_value, new_value;
+
+   old_value = !!(old & I40E_L3_SRC_MASK);
+   new_value = !!(new & I40E_L3_SRC_MASK);
+   if (old_value != new_value)
+   netif_info(pf, drv, vsi->netdev, "L3 source address: %s -> 
%s\n",
+  old_value ? "ON" : "OFF",
+  new_value ? "ON" : "OFF");
+
+   old_value = !!(old & I40E_L3_DST_MASK);
+   new_value = !!(new & I40E_L3_DST_MASK);
+   if (old_value != new_value)
+   netif_info(pf, drv, vsi->netdev, "L3 destination address: %s -> 
%s\n",
+  old_value ? "ON" : "OFF",
+  new_value ? "ON" : "OFF");
+
+   old_value = !!(old & I40E_L4_SRC_MASK);
+   new_value = !!(new & I40E_L4_SRC_MASK);
+   if (old_value != new_value)
+   netif_info(pf, drv, vsi->netdev, "L4 source port: %s -> %s\n",
+  old_value ? "ON" : "OFF",
+  new_value ? "ON" : "OFF");
+
+   old_value = !!(old & I40E_L4_DST_MASK);
+   new_value = !!(new & I40E_L4_DST_MASK);
+   if (old_value != new_value)
+   netif_info(pf, drv, vsi->netdev, "L4 destination port: %s -> 
%s\n",
+  old_value ? "ON" : "OFF",
+  new_value ? "ON" : "OFF");
+
+   old_value = !!(old & I40E_VERIFY_TAG_MASK);
+   new_value = !!(new & I40E_VERIFY_TAG_MASK);
+   if (old_value != new_value)
+   netif_info(pf, drv, vsi->netdev, "SCTP verification tag: %s -> 
%s\n",
+  old_value ? "ON" : "OFF",
+  new_value ? "ON" : "OFF");
+
+   netif_info(pf, drv, vsi->netdev, "  Current input set: %0llx\n",
+  old);
+   netif_info(pf, drv, vsi->netdev, "Requested input set: %0llx\n",
+  new);
+}
+
+/**
  * i40e_check_fdir_input_set - Check that a given rx_flow_spec mask is valid
  * @vsi: pointer to the targeted VSI
  * @fsp: pointer to Rx flow specification
  *
- * Ensures that a given ethtool_rx_flow_spec has a valid mask.
+ * Ensures that a given ethtool_rx_flow_spec has a valid mask. Some support
+ * for partial matches exists with a few limitations. First, hardware only
+ * supports masking by word boundary (2 bytes) and not per individual bit.
+ * Second, hardware is limited to using one mask for a flow type and cannot
+ * use a separate mask for each filter.
+ *
+ * To support these limitations, if we

[net-next 00/10][pull request] 40GbE Intel Wired LAN Driver Updates 2017-03-23

This series contains updates to i40e and i40e.txt documentation.

Jake provides all the changes in the series which are centered around
ntuple filter fixes and additional support.  Fixed the current
implementation of .set_rxnfc, where we were not reading the mask field
for filter entries which was resulting in filters not behaving as
expected and not working correctly.  When cleaning up after disabling
flow director support, ensure that the default input set is correctly
reprogrammed.  Since the hardware only supports a single input set for
all flows of that type, the driver shall only allow the input set to
change if there are no other configured filters for that flow type, so
add support to detect when we can update the input set for each flow
type.  Align the driver to other drivers to partition the ring_cookie
value into 8bits of VF index, along with 32bits of queue number instead
of using the user-def field.  Added support to parse the user-def field
into a data structure format to allow future extensions of the user-def
filed by keeping all the code that read/writes the field into a single
location.  Added support for flexible payloads passed via ethtool
user-def field.  We support a single flexible word (2byte) value per
protocol type, and we handle the FLX_PIT register using a list of
flexible entries so that each flow type may be configured separately.
Enabled flow director filters for SCTPv4 packets using the ethtool
ntuple interface to enable filters.  Updated the documentation on the
i40e driver to include the newly added support to ntuple filters.
Reduced complexity of a if-continue-else-break section of code by
taking advantage of using hlist_for_each_entry_continue() instead.

The following are changes since commit add641e7dee31b36aee83412c29e39dd1f5e0c9c:
  sched: act_csum: don't mangle TCP and UDP GSO packets
and are available in the git repository at:
  git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue 40GbE

Jacob Keller (10):
  i40e: correctly honor the mask fields for ETHTOOL_SRXCLSRLINS
  i40e: check current configured input set when adding ntuple filters
  i40e: restore default input set for each flow type
  i40e: allow changing input set for ntuple filters
  i40e: partition the ring_cookie to get VF index
  i40e: add parsing of flexible filter fields from userdef
  i40e: implement support for flexible word payload
  i40e: add support for SCTPv4 FDir filters
  i40e: document drivers use of ntuple filters
  i40e: make use of hlist_for_each_entry_continue

 Documentation/networking/i40e.txt  |  72 ++
 drivers/net/ethernet/intel/i40e/i40e.h | 130 
 drivers/net/ethernet/intel/i40e/i40e_ethtool.c | 977 -
 drivers/net/ethernet/intel/i40e/i40e_main.c|  52 +-
 drivers/net/ethernet/intel/i40e/i40e_txrx.c| 107 +++
 5 files changed, 1294 insertions(+), 44 deletions(-)

-- 
2.12.0

Re: [net-next PATCH v2 8/8] net: Introduce SO_INCOMING_NAPI_ID

On Thu, Mar 23, 2017 at 9:47 PM, Andy Lutomirski  wrote:

> So don't we want queue id, not NAPI id?  Or am I still missing something?
>
> But I'm also a but confused as to the overall performance effect.
> Suppose I have an rx queue that has its interrupt bound to cpu 0.  For
> whatever reason (random chance if I'm hashing, for example), I end up
> with the epoll caller on cpu 1.  Suppose further that cpus 0 and 1 are
> on different NUMA nodes.
>
> Now, let's suppose that I get lucky and *all* the packets are pulled
> off the queue by epoll busy polling.  Life is great [1].  But suppose
> that, due to a tiny hiccup or simply user code spending some cycles
> processing those packets, an rx interrupt fires.  Now cpu 0 starts
> pulling packets off the queue via NAPI, right?  So both NUMA nodes are
> fighting over all the cachelines involved in servicing the queue *and*
> the packets just got dequeued on the wrong NUMA node.
>
> ISTM this would work better if the epoll busy polling could handle the
> case where one epoll set polls sockets on different queues as long as
> those queues are all owned by the same CPU.  Then user code could use
> SO_INCOMING_CPU to sort out the sockets.
>

Of course you can do that already.

SO_REUSEPORT + appropriate eBPF filter can select the best socket to
receive your packets, based
on various smp/numa affinities ( BPF_FUNC_get_smp_processor_id or
BPF_FUNC_get_numa_node_id )

This new instruction is simply _allowing_ other schems, based on
queues ID, in the case each NIC queue
can be managed by a group of cores (presumably on same NUMA node)


> Am I missing something?
>
> [1] Maybe.  How smart is direct cache access?  If it's smart enough,
> it'll pre-populate node 0's LLC, which means that life isn't so great
> after all.

Re: [PATCHv2] net: usbnet: support 64bit stats in qmi_wwan driver

On Fri, 2017-03-24 at 11:27 +1000, Greg Ungerer wrote:
> Add support for the net stats64 counters to the usbnet core and then to
> the qmi_wwan driver.
> 
> This is a strait forward addition of 64bit counters for RX and TX packets
> and byte counts. It is done in the same style as for the other net drivers
> that support stats64.
> 
> The bulk of the change is to the usbnet core. Then it is trivial to use
> that in the qmi_wwan.c driver. It would be very simple to extend this
> support to other usbnet based drivers.
> 
> The motivation to add this is that it is not particularly difficult to
> get the RX and TX byte counts to wrap on 32bit platforms.
> 
> Signed-off-by: Greg Ungerer 
> ---
>  drivers/net/usb/qmi_wwan.c |  1 +
>  drivers/net/usb/usbnet.c   | 28 
>  include/linux/usb/usbnet.h | 12 
>  3 files changed, 41 insertions(+)
> 
> v2: EXPORT symbol usbnet_get_stats64()
> rebase on top of net-next
> 
> diff --git a/drivers/net/usb/qmi_wwan.c b/drivers/net/usb/qmi_wwan.c
> index 8056745..db12a66 100644
> --- a/drivers/net/usb/qmi_wwan.c
> +++ b/drivers/net/usb/qmi_wwan.c
> @@ -251,6 +251,7 @@ static int qmi_wwan_mac_addr(struct net_device *dev, void 
> *p)
>   .ndo_change_mtu = usbnet_change_mtu,
>   .ndo_set_mac_address= qmi_wwan_mac_addr,
>   .ndo_validate_addr  = eth_validate_addr,
> + .ndo_get_stats64= usbnet_get_stats64,
>  };
>  
>  /* using a counter to merge subdriver requests with our own into a
> diff --git a/drivers/net/usb/usbnet.c b/drivers/net/usb/usbnet.c
> index 13d4ec5..4499d89 100644
> --- a/drivers/net/usb/usbnet.c
> +++ b/drivers/net/usb/usbnet.c
> @@ -330,6 +330,11 @@ void usbnet_skb_return (struct usbnet *dev, struct 
> sk_buff *skb)
>   dev->net->stats.rx_packets++;
>   dev->net->stats.rx_bytes += skb->len;

Why updating dev->net->stats.rx_{packets|bytes} and
dev->stats.rx_{packets|bytes}  ?

Seems redundant.


> + u64_stats_update_begin(>stats.syncp);
> + dev->stats.rx_packets++;
> + dev->stats.rx_bytes += skb->len;
> + u64_stats_update_end(>stats.syncp);
> +
>   netif_dbg(dev, rx_status, dev->net, "< rx, len %zu, type 0x%x\n",
> skb->len + sizeof (struct ethhdr), skb->protocol);
>   memset (skb->cb, 0, sizeof (struct skb_data));
> @@ -981,6 +986,23 @@ int usbnet_set_link_ksettings(struct net_device *net,
>  }
>  EXPORT_SYMBOL_GPL(usbnet_set_link_ksettings);
>  
> +void usbnet_get_stats64(struct net_device *net, struct rtnl_link_stats64 
> *stats)
> +{
> + struct usbnet *dev = netdev_priv(net);
> + unsigned int start;
> +
> + netdev_stats_to_stats64(stats, >stats);
> +
> + do {
> + start = u64_stats_fetch_begin_irq(>stats.syncp);
> + stats->rx_packets = dev->stats.rx_packets;
> + stats->rx_bytes = dev->stats.rx_bytes;
> + stats->tx_packets = dev->stats.tx_packets;
> + stats->tx_bytes = dev->stats.tx_bytes;
> + } while (u64_stats_fetch_retry_irq(>stats.syncp, start));
> +}
> +EXPORT_SYMBOL_GPL(usbnet_get_stats64);
> +
>  u32 usbnet_get_link (struct net_device *net)
>  {
>   struct usbnet *dev = netdev_priv(net);
> @@ -1214,6 +1236,11 @@ static void tx_complete (struct urb *urb)
>   if (urb->status == 0) {
>   dev->net->stats.tx_packets += entry->packets;
>   dev->net->stats.tx_bytes += entry->length;


Why updating dev->net->stats.tx_{packets|bytes} and
dev->stats.tx_{packets|bytes}  ?

> +
> + u64_stats_update_begin(>stats.syncp);
> + dev->stats.tx_packets += entry->packets;
> + dev->stats.tx_bytes += entry->length;
> + u64_stats_update_end(>stats.syncp);
>   } else {
>   dev->net->stats.tx_errors++;
>  
> @@ -1658,6 +1685,7 @@ void usbnet_disconnect (struct usb_interface *intf)
>   init_timer (>delay);
>   mutex_init (>phy_mutex);
>   mutex_init(>interrupt_mutex);
> + u64_stats_init(>stats.syncp);
>   dev->interrupt_count = 0;
>  
>   dev->net = net;
> diff --git a/include/linux/usb/usbnet.h b/include/linux/usb/usbnet.h
> index e2b5691..d1501b8 100644
> --- a/include/linux/usb/usbnet.h
> +++ b/include/linux/usb/usbnet.h
> @@ -22,6 +22,14 @@
>  #ifndef  __LINUX_USB_USBNET_H
>  #define  __LINUX_USB_USBNET_H
>  
> +struct usbnet_stats64 {
> + struct u64_stats_sync   syncp;
> + u64 rx_packets;
> + u64 rx_bytes;
> + u64 tx_packets;
> + u64 tx_bytes;
> +};
> +

You use the same syncp for TX and RX.

Do you have guarantee that TX and RX paths can not run concurrently (on
different cpus) ?

Re: [net-next PATCH v2 8/8] net: Introduce SO_INCOMING_NAPI_ID

2017-03-23 Thread Andy Lutomirski

On Thu, Mar 23, 2017 at 5:58 PM, Alexander Duyck
 wrote:
> On Thu, Mar 23, 2017 at 3:43 PM, Andy Lutomirski  wrote:
>> On Thu, Mar 23, 2017 at 2:38 PM, Alexander Duyck
>>  wrote:
>>> From: Sridhar Samudrala 
>>>
>>> This socket option returns the NAPI ID associated with the queue on which
>>> the last frame is received. This information can be used by the apps to
>>> split the incoming flows among the threads based on the Rx queue on which
>>> they are received.
>>>
>>> If the NAPI ID actually represents a sender_cpu then the value is ignored
>>> and 0 is returned.
>>
>> This may be more of a naming / documentation issue than a
>> functionality issue, but to me this reads as:
>>
>> "This socket option returns an internal implementation detail that, if
>> you are sufficiently clueful about the current performance heuristics
>> used by the Linux networking stack, just might give you a hint as to
>> which epoll set to put the socket in."  I've done some digging into
>> Linux networking stuff, but not nearly enough to have the slighest
>> clue what you're supposed to do with the NAPI ID.
>
> Really the NAPI ID is an arbitrary number that will be unique per
> device queue, though multiple Rx queues can share a NAPI ID if they
> are meant to be processed in the same call to poll.
>
> If we wanted we could probably rename it to something like Device Poll
> Identifier or Device Queue Identifier, DPID or DQID, if that would
> work for you.  Essentially it is just a unique u32 value that should
> not identify any other queue in the system while this device queue is
> active.  Really the number itself is mostly arbitrary, the main thing
> is that it doesn't change and uniquely identifies the queue in the
> system.

That seems reasonably sane to me.

>
>> It would be nice to make this a bit more concrete and a bit less tied
>> in Linux innards.  Perhaps a socket option could instead return a hint
>> saying "for best results, put this socket in an epoll set that's on
>> cpu N"?  After all, you're unlikely to do anything productive with
>> busy polling at all, even on a totally different kernel
>> implementation, if you have more than one epoll set per CPU.  I can
>> see cases where you could plausibly poll with fewer than one set per
>> CPU, I suppose.
>
> Really we kind of already have an option that does what you are
> implying called SO_INCOMING_CPU.  The problem is it requires pinning
> the interrupts to the CPUs in order to keep the values consistent,

Some day the kernel should just solve this problem once and for all.
Have root give a basic policy for mapping queues to CPUs (one per
physical core / one per logical core / use this subset of cores) and
perhaps forcibly prevent irqbalanced from even seeing it.  I'm sure
other solutions are possible.

> and
> even then busy polling can mess that up if the busy poll thread is
> running on a different CPU.  With the NAPI ID we have to do a bit of
> work on the application end, but we can uniquely identify each
> incoming queue and interrupt migration and busy polling don't have any
> effect on it.  So for example we could stack all the interrupts on CPU
> 0, and have our main thread located there doing the sorting of
> incoming requests and handing them out to epoll listener threads on
> other CPUs.  When those epoll listener threads start doing busy
> polling the NAPI ID won't change even though the packet is being
> processed on a different CPU.
>
>> Again, though, from the description, it's totally unclear what a user
>> is supposed to do.
>
> What you end up having to do is essentially create a hash of sorts so
> that you can go from NAPI IDs to threads.  In an ideal setup what you
> end up with multiple threads, each one running one epoll, and each
> epoll polling on one specific queue.

So don't we want queue id, not NAPI id?  Or am I still missing something?

But I'm also a but confused as to the overall performance effect.
Suppose I have an rx queue that has its interrupt bound to cpu 0.  For
whatever reason (random chance if I'm hashing, for example), I end up
with the epoll caller on cpu 1.  Suppose further that cpus 0 and 1 are
on different NUMA nodes.

Now, let's suppose that I get lucky and *all* the packets are pulled
off the queue by epoll busy polling.  Life is great [1].  But suppose
that, due to a tiny hiccup or simply user code spending some cycles
processing those packets, an rx interrupt fires.  Now cpu 0 starts
pulling packets off the queue via NAPI, right?  So both NUMA nodes are
fighting over all the cachelines involved in servicing the queue *and*
the packets just got dequeued on the wrong NUMA node.

ISTM this would work better if the epoll busy polling could handle the
case where one epoll set polls sockets on different queues as long as
those queues are all owned by the same CPU.  Then user code could use
SO_INCOMING_CPU to sort out the

Re: [net-next PATCH v2 5/8] net: Track start of busy loop instead of when it should end

On Thu, 2017-03-23 at 21:27 -0700, Eric Dumazet wrote:

> If you believe min/max values should be added to the sysctls, because we
> do not trust root anymore, please send patches only addressing that. 

extern unsigned int sysctl_net_busy_read;
extern unsigned int sysctl_net_busy_poll;
...
unsigned intsk_ll_usec;


These are unsigned values, so no min/max would be required, only
maybe use proc_douintvec() instead of proc_dointvec() as most sysctls
use because we were lazy and had to wait linux-4.8 to actually get
proc_douintvec() in the kernel ;)

Re: [patch net-next] mlxsw: Remove debugfs interface

From: Jiri Pirko 
Date: Thu, 23 Mar 2017 11:14:24 +0100

> From: Ido Schimmel 
> 
> We don't use it during development and we can't extend it either, so
> remove it.
> 
> Signed-off-by: Ido Schimmel 
> Signed-off-by: Jiri Pirko 

Applied, thank you.

Re: [PATCH net] net: neigh: guard against NULL solicit() method

From: Eric Dumazet 
Date: Thu, 23 Mar 2017 12:39:21 -0700

> From: Eric Dumazet 
> 
> Dmitry posted a nice reproducer of a bug triggering in neigh_probe()
> when dereferencing a NULL neigh->ops->solicit method.
> 
> This can happen for arp_direct_ops/ndisc_direct_ops and similar,
> which can be used for NUD_NOARP neighbours (created when dev->header_ops
> is NULL). Admin can then force changing nud_state to some other state
> that would fire neigh timer.
> 
> Signed-off-by: Eric Dumazet 
> Reported-by: Dmitry Vyukov 

Applied and queued up for -stable, thanks Eric.

Re: [net-next PATCH v2 5/8] net: Track start of busy loop instead of when it should end

On Thu, 2017-03-23 at 20:42 -0700, Alexander Duyck wrote:
> On Thu, Mar 23, 2017 at 6:24 PM, Eric Dumazet  wrote:
> > On Thu, 2017-03-23 at 14:37 -0700, Alexander Duyck wrote:
> >> From: Alexander Duyck 
> >>
> >
> >> The last bit I changed is to move from using a shift by 10 to just using
> >> NSEC_PER_USEC and using multiplication for any run time calculations and
> >> division for a few compile time ones.  This should be more accurate and
> >> perform about the same on most architectures since modern CPUs typically
> >> handle multiplication without too much overhead.
> >
> >
> > busy polling thread can be preempted for more than 2 seconds.
> 
> If it is preempted is the timing value even valid anymore?  I was
> wondering about that.  Also when preemption is enabled is there
> anything to prevent us from being migrated to a CPU?  If so what do we
> do about architectures that allow drift between the clocks?
> 
> > Using usec instead of nanoseconds gave us 3 orders of magnitude cushion.
> 
> Yes, but the problem is we also opened up an issue where if the clock
> was approaching a roll-over we could add a value to it that would put
> us in a state where we would never time out.

If you believe there is a bug, send a fix for net tree ?

I really do not see a bug, given we use time_after(now, end_time) which
handles roll-over just fine.



> 
> > We do not need nsec accuracy for busy polling users, if this restricts
> > range and usability under stress.
> 
> Yes and no.  So the standard use cases suggest using values of 50 to
> 100 microseconds.  I suspect that for most people that is probably
> what they are using.  The addition of preemption kind of threw a
> wrench in the works because now instead of spending that time busy
> polling you can get preempted and then are off doing something else
> for the entire period of time.

That is fine. Busy polling heavy users are pinning threads to cpu, and
the re-schedule check is basically a way to make sure ksoftirqd will get
a chance to service BH.

> 
> What would you think of changing this so that instead of tracking the
> total time this function is active, instead we tracked the total time
> we spent with preemption disabled?  What I would do is move the
> start_time configuration to just after the preempt_disable() call.
> Then if we end up yielding to another thread we would just reset the
> start_time when we restarted instead of trying to deal with all the
> extra clock nonsense that we would have to deal with otherwise since I
> don't know if we actually want to count time where we aren't actually
> doing anything.  In addition this would bring us closer to how NAPI
> already works since it essentially will either find an event, or if we
> time out we hand it off to the softirq which in turn can handle it or
> hand it off to softirqd.  The only item that might be a bit more
> difficult to deal with then would be the way the times are used in
> fs/select.c but I don't know if that is really the right way to go
> anyway. With the preemption changes and such it might just make sense
> to drop those bits and rely on just the socket polling alone.
> 
> The other option is to switch over everything from using unsigned long
> to using uint64_t and time_after64.  Then we can guarantee the range
> needed and then some, but then we are playing with a u64 time value on
> 32b architectures which might be a bit more expensive.  Even with that
> though I still need to clean up the sysctl since it doesn't make sense
> to allow negative values for the busy_poll usec to be used which is
> currently the case.
> 
> Anyway let me know what you think and I can probably spin out a new
> set tomorrow.


Leave usec resolution, I see no point switching to nsec, as long we
support 32bit kernels.

If you believe min/max values should be added to the sysctls, because we
do not trust root anymore, please send patches only addressing that. 

Thanks.

Re: [net] Revert "e1000e: driver trying to free already-free irq"

On Thu, 2017-03-23 at 20:47 -0700, Jeff Kirsher wrote:
> This reverts commit 7e54d9d063fa239c95c21548c5267f0ef419ff56.
> 
> After additional regression testing, several users are experiencing
> kernel panics during shutdown on e1000e devices.  Reverting this
> change resolves the issue.
> 
> Signed-off-by: Jeff Kirsher 

CC: 

Sorry, meant to queue this up for stable as well.

> ---
>  drivers/net/ethernet/intel/e1000e/netdev.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c
> b/drivers/net/ethernet/intel/e1000e/netdev.c
> index 2175cced402f..e9af89ad039c 100644
> --- a/drivers/net/ethernet/intel/e1000e/netdev.c
> +++ b/drivers/net/ethernet/intel/e1000e/netdev.c
> @@ -6274,8 +6274,8 @@ static int e1000e_pm_freeze(struct device *dev)
>   /* Quiesce the device without resetting the hardware
> */
>   e1000e_down(adapter, false);
>   e1000_free_irq(adapter);
> - e1000e_reset_interrupt_capability(adapter);
>   }
> + e1000e_reset_interrupt_capability(adapter);
>  
>   /* Allow time for pending master requests to run */
>   e1000e_disable_pcie_master(>hw);


signature.asc
Description: This is a digitally signed message part

[net] Revert "e1000e: driver trying to free already-free irq"

This reverts commit 7e54d9d063fa239c95c21548c5267f0ef419ff56.

After additional regression testing, several users are experiencing
kernel panics during shutdown on e1000e devices.  Reverting this
change resolves the issue.

Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/e1000e/netdev.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c 
b/drivers/net/ethernet/intel/e1000e/netdev.c
index 2175cced402f..e9af89ad039c 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -6274,8 +6274,8 @@ static int e1000e_pm_freeze(struct device *dev)
/* Quiesce the device without resetting the hardware */
e1000e_down(adapter, false);
e1000_free_irq(adapter);
-   e1000e_reset_interrupt_capability(adapter);
}
+   e1000e_reset_interrupt_capability(adapter);
 
/* Allow time for pending master requests to run */
e1000e_disable_pcie_master(>hw);
-- 
2.12.0

Re: [net-next PATCH v2 5/8] net: Track start of busy loop instead of when it should end

On Thu, Mar 23, 2017 at 6:24 PM, Eric Dumazet  wrote:
> On Thu, 2017-03-23 at 14:37 -0700, Alexander Duyck wrote:
>> From: Alexander Duyck 
>>
>
>> The last bit I changed is to move from using a shift by 10 to just using
>> NSEC_PER_USEC and using multiplication for any run time calculations and
>> division for a few compile time ones.  This should be more accurate and
>> perform about the same on most architectures since modern CPUs typically
>> handle multiplication without too much overhead.
>
>
> busy polling thread can be preempted for more than 2 seconds.

If it is preempted is the timing value even valid anymore?  I was
wondering about that.  Also when preemption is enabled is there
anything to prevent us from being migrated to a CPU?  If so what do we
do about architectures that allow drift between the clocks?

> Using usec instead of nanoseconds gave us 3 orders of magnitude cushion.

Yes, but the problem is we also opened up an issue where if the clock
was approaching a roll-over we could add a value to it that would put
us in a state where we would never time out.

> We do not need nsec accuracy for busy polling users, if this restricts
> range and usability under stress.

Yes and no.  So the standard use cases suggest using values of 50 to
100 microseconds.  I suspect that for most people that is probably
what they are using.  The addition of preemption kind of threw a
wrench in the works because now instead of spending that time busy
polling you can get preempted and then are off doing something else
for the entire period of time.

What would you think of changing this so that instead of tracking the
total time this function is active, instead we tracked the total time
we spent with preemption disabled?  What I would do is move the
start_time configuration to just after the preempt_disable() call.
Then if we end up yielding to another thread we would just reset the
start_time when we restarted instead of trying to deal with all the
extra clock nonsense that we would have to deal with otherwise since I
don't know if we actually want to count time where we aren't actually
doing anything.  In addition this would bring us closer to how NAPI
already works since it essentially will either find an event, or if we
time out we hand it off to the softirq which in turn can handle it or
hand it off to softirqd.  The only item that might be a bit more
difficult to deal with then would be the way the times are used in
fs/select.c but I don't know if that is really the right way to go
anyway. With the preemption changes and such it might just make sense
to drop those bits and rely on just the socket polling alone.

The other option is to switch over everything from using unsigned long
to using uint64_t and time_after64.  Then we can guarantee the range
needed and then some, but then we are playing with a u64 time value on
32b architectures which might be a bit more expensive.  Even with that
though I still need to clean up the sysctl since it doesn't make sense
to allow negative values for the busy_poll usec to be used which is
currently the case.

Anyway let me know what you think and I can probably spin out a new
set tomorrow.

- Alex

[iproute2 net-next v2 1/3] netlink: Add flag to suppress print of nlmsg error

Allow callers of the dump API to handle nlmsg errors (e.g., an
unsupported feature). Setting RTNL_HANDLE_F_SUPPRESS_NLERR in the
rtnl_handle avoids unnecessary messages to the users in some case.
For example,

  RTNETLINK answers: Operation not supported

when probing for support of a new feature.

Signed-off-by: David Ahern 
---
 include/libnetlink.h | 1 +
 lib/libnetlink.c | 3 ++-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/libnetlink.h b/include/libnetlink.h
index bd0267dfcc02..c43ab0a2d9d9 100644
--- a/include/libnetlink.h
+++ b/include/libnetlink.h
@@ -21,6 +21,7 @@ struct rtnl_handle {
int proto;
FILE   *dump_fp;
 #define RTNL_HANDLE_F_LISTEN_ALL_NSID  0x01
+#define RTNL_HANDLE_F_SUPPRESS_NLERR   0x02
int flags;
 };
 
diff --git a/lib/libnetlink.c b/lib/libnetlink.c
index 9303b6686e2c..5b75b2db4e0b 100644
--- a/lib/libnetlink.c
+++ b/lib/libnetlink.c
@@ -299,7 +299,8 @@ static void rtnl_dump_error(const struct rtnl_handle *rth,
 errno == EOPNOTSUPP))
return;
 
-   perror("RTNETLINK answers");
+   if (!(rth->flags & RTNL_HANDLE_F_SUPPRESS_NLERR))
+   perror("RTNETLINK answers");
}
 }
 
-- 
2.1.4

[iproute2 net-next v2 2/3] ip netconf: Show all address families by default in dumps

Currently, 'ip netconf' only shows ipv4 and ipv6 netconf settings. If IPv6
is not enabled, the dump ends with
RTNETLINK answers: Operation not supported

when IPv6 request is attempted. Further, if the mpls_router module is also
loaded a separate request is needed to get MPLS settings.

To make this better going forward, use the new PF_UNSPEC dump all option
if the kernel supports it. If the kernel does not, it sets NLMSG_ERROR and
returns EOPNOTSUPP which is trapped and we fall back to the existing output
to maintain compatibility with existing kernels.

Signed-off-by: David Ahern 
---
 ip/ipnetconf.c | 13 -
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/ip/ipnetconf.c b/ip/ipnetconf.c
index af539f5e945c..dc0851025223 100644
--- a/ip/ipnetconf.c
+++ b/ip/ipnetconf.c
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "rt_names.h"
 #include "utils.h"
@@ -197,16 +198,26 @@ static int do_show(int argc, char **argv)
}
rtnl_listen(, print_netconf, stdout);
} else {
+   rth.flags = RTNL_HANDLE_F_SUPPRESS_NLERR;
 dump:
if (rtnl_wilddump_request(, filter.family, RTM_GETNETCONF) 
< 0) {
perror("Cannot send dump request");
exit(1);
}
if (rtnl_dump_filter(, print_netconf2, stdout) < 0) {
+   /* kernel does not support netconf dump on AF_UNSPEC;
+* fall back to requesting by family
+*/
+   if (errno == EOPNOTSUPP &&
+   filter.family == AF_UNSPEC) {
+   filter.family = AF_INET;
+   goto dump;
+   }
+   perror("RTNETLINK answers");
fprintf(stderr, "Dump terminated\n");
exit(1);
}
-   if (preferred_family == AF_UNSPEC) {
+   if (preferred_family == AF_UNSPEC && filter.family == AF_INET) {
preferred_family = AF_INET6;
filter.family = AF_INET6;
goto dump;
-- 
2.1.4

[iproute2 net-next v2 3/3] ip netconf: show all families on dev request

Currently specifying a device to ip netconf and it dumps only values
for IPv4. Change this to dump data for all families unless a specific
family is given.

Signed-off-by: David Ahern 
---
 ip/ipnetconf.c | 23 +--
 1 file changed, 13 insertions(+), 10 deletions(-)

diff --git a/ip/ipnetconf.c b/ip/ipnetconf.c
index dc0851025223..696e3dd51a1c 100644
--- a/ip/ipnetconf.c
+++ b/ip/ipnetconf.c
@@ -56,6 +56,7 @@ int print_netconf(const struct sockaddr_nl *who, struct 
rtnl_ctrl_data *ctrl,
struct netconfmsg *ncm = NLMSG_DATA(n);
int len = n->nlmsg_len;
struct rtattr *tb[NETCONFA_MAX+1];
+   int ifindex = 0;
 
if (n->nlmsg_type == NLMSG_ERROR)
return -1;
@@ -77,6 +78,12 @@ int print_netconf(const struct sockaddr_nl *who, struct 
rtnl_ctrl_data *ctrl,
parse_rtattr(tb, NETCONFA_MAX, netconf_rta(ncm),
 NLMSG_PAYLOAD(n, sizeof(*ncm)));
 
+   if (tb[NETCONFA_IFINDEX])
+   ifindex = rta_getattr_u32(tb[NETCONFA_IFINDEX]);
+
+   if (filter.ifindex && filter.ifindex != ifindex)
+   return 0;
+
switch (ncm->ncm_family) {
case AF_INET:
fprintf(fp, "ipv4 ");
@@ -93,9 +100,7 @@ int print_netconf(const struct sockaddr_nl *who, struct 
rtnl_ctrl_data *ctrl,
}
 
if (tb[NETCONFA_IFINDEX]) {
-   int *ifindex = (int *)rta_getattr_str(tb[NETCONFA_IFINDEX]);
-
-   switch (*ifindex) {
+   switch (ifindex) {
case NETCONFA_IFINDEX_ALL:
fprintf(fp, "all ");
break;
@@ -103,7 +108,7 @@ int print_netconf(const struct sockaddr_nl *who, struct 
rtnl_ctrl_data *ctrl,
fprintf(fp, "default ");
break;
default:
-   fprintf(fp, "dev %s ", ll_index_to_name(*ifindex));
+   fprintf(fp, "dev %s ", ll_index_to_name(ifindex));
break;
}
}
@@ -169,8 +174,6 @@ static int do_show(int argc, char **argv)
 
ipnetconf_reset_filter(0);
filter.family = preferred_family;
-   if (filter.family == AF_UNSPEC)
-   filter.family = AF_INET;
 
while (argc > 0) {
if (strcmp(*argv, "dev") == 0) {
@@ -186,11 +189,11 @@ static int do_show(int argc, char **argv)
}
 
ll_init_map();
-   if (filter.ifindex) {
+
+   if (filter.ifindex && filter.family != AF_UNSPEC) {
req.ncm.ncm_family = filter.family;
-   if (filter.ifindex)
-   addattr_l(, sizeof(req), NETCONFA_IFINDEX,
- , sizeof(filter.ifindex));
+   addattr_l(, sizeof(req), NETCONFA_IFINDEX,
+ , sizeof(filter.ifindex));
 
if (rtnl_send(, , req.n.nlmsg_len) < 0) {
perror("Can not send request");
-- 
2.1.4

[iproute2 net-next v2 0/3] ip netconf improvements

Currently, ip netconf only shows data for ipv4 and ipv6 for dumps
and just ipv4 for device requests. Improve the user experience by
using the new kernel patch to dump all address families that have
registered. For example, if mpls_router module is loaded then mpls
values are displayed along with ipv4 and ipv6.

If the new feature is not supported (new iproute2 on older kernel)
the kernel returns the nlmsg error EOPNOTSUPP which can be trapped
and fallback to existing behavior.

v2
- fixed index conversion in patch 3 per nicholas' comment

David Ahern (3):
  netlink: Add flag to suppress print of nlmsg error
  ip netconf: Show all address families by default in dumps
  ip netconf: show all families on dev request

 include/libnetlink.h |  1 +
 ip/ipnetconf.c   | 36 +---
 lib/libnetlink.c |  3 ++-
 3 files changed, 28 insertions(+), 12 deletions(-)

-- 
2.1.4

Re: [PATCH net-next v4 1/2] New kernel function to get IP overhead on a socket.

2017-03-23 Thread R. Parameswaran

Hi Dave,

Please see inline:

On Thu, 23 Mar 2017, David Miller wrote:

> From: "R. Parameswaran" 
> Date: Wed, 22 Mar 2017 15:59:13 -0700 (PDT)
> 
> > A new function, kernel_sock_ip_overhead(), is provided
> > to calculate the cumulative overhead imposed by the IP
> > Header and IP options, if any, on a socket's payload.
> > The new function returns an overhead of zero for sockets
> > that do not belong to the IPv4 or IPv6 address families.
> > 
> > Signed-off-by: R. Parameswaran 
> 
> Just use the IPv4/IPv6 header size for now, just like the VXLAN
> driver does.
>

Actually, that's how the original posting was - it was changed in 
response to a review comment from James Chapman requesting the IP
Options overhead to be factored in and for this to be calculated in
a new standalone function that can be reused in other situations. 
The review comment makes sense to me - the kernel seems to do a 
good job of accounting for the cumulative size of IP Options and
if the information is available, it may make sense to factor it in.

I guess you are concerned about compatibility between vxlan and
L2TP? There may be one difference  - the socket for vxlan
appears to be opened/controlled entirely within kernel code (seems
to call udp_sock_create() which does not appear to turn on any options), 
but in the case of L2TP, it is possible for the tunnel socket to be 
opened from user space, if a user space control plane daemon is running.
Regardless of how user space daemons are written right now, it is 
possible in theory for the user space code to turn on options on the 
L2TP tunnel socket. So it seems that IP options might be enabled on the 
L2TP socket, but are probably unlikely on the vxlan socket? 

I'd suggest giving this a few days for James to respond. 
At that time if there is agreement that we don't need to factor options, 
I can rework it.

thanks,

Ramkumar

> Thanks.
>

[PATCHv2] net: usbnet: support 64bit stats in qmi_wwan driver

2017-03-23 Thread Greg Ungerer

Add support for the net stats64 counters to the usbnet core and then to
the qmi_wwan driver.

This is a strait forward addition of 64bit counters for RX and TX packets
and byte counts. It is done in the same style as for the other net drivers
that support stats64.

The bulk of the change is to the usbnet core. Then it is trivial to use
that in the qmi_wwan.c driver. It would be very simple to extend this
support to other usbnet based drivers.

The motivation to add this is that it is not particularly difficult to
get the RX and TX byte counts to wrap on 32bit platforms.

Signed-off-by: Greg Ungerer 
---
 drivers/net/usb/qmi_wwan.c |  1 +
 drivers/net/usb/usbnet.c   | 28 
 include/linux/usb/usbnet.h | 12 
 3 files changed, 41 insertions(+)

v2: EXPORT symbol usbnet_get_stats64()
rebase on top of net-next

diff --git a/drivers/net/usb/qmi_wwan.c b/drivers/net/usb/qmi_wwan.c
index 8056745..db12a66 100644
--- a/drivers/net/usb/qmi_wwan.c
+++ b/drivers/net/usb/qmi_wwan.c
@@ -251,6 +251,7 @@ static int qmi_wwan_mac_addr(struct net_device *dev, void 
*p)
.ndo_change_mtu = usbnet_change_mtu,
.ndo_set_mac_address= qmi_wwan_mac_addr,
.ndo_validate_addr  = eth_validate_addr,
+   .ndo_get_stats64= usbnet_get_stats64,
 };
 
 /* using a counter to merge subdriver requests with our own into a
diff --git a/drivers/net/usb/usbnet.c b/drivers/net/usb/usbnet.c
index 13d4ec5..4499d89 100644
--- a/drivers/net/usb/usbnet.c
+++ b/drivers/net/usb/usbnet.c
@@ -330,6 +330,11 @@ void usbnet_skb_return (struct usbnet *dev, struct sk_buff 
*skb)
dev->net->stats.rx_packets++;
dev->net->stats.rx_bytes += skb->len;
 
+   u64_stats_update_begin(>stats.syncp);
+   dev->stats.rx_packets++;
+   dev->stats.rx_bytes += skb->len;
+   u64_stats_update_end(>stats.syncp);
+
netif_dbg(dev, rx_status, dev->net, "< rx, len %zu, type 0x%x\n",
  skb->len + sizeof (struct ethhdr), skb->protocol);
memset (skb->cb, 0, sizeof (struct skb_data));
@@ -981,6 +986,23 @@ int usbnet_set_link_ksettings(struct net_device *net,
 }
 EXPORT_SYMBOL_GPL(usbnet_set_link_ksettings);
 
+void usbnet_get_stats64(struct net_device *net, struct rtnl_link_stats64 
*stats)
+{
+   struct usbnet *dev = netdev_priv(net);
+   unsigned int start;
+
+   netdev_stats_to_stats64(stats, >stats);
+
+   do {
+   start = u64_stats_fetch_begin_irq(>stats.syncp);
+   stats->rx_packets = dev->stats.rx_packets;
+   stats->rx_bytes = dev->stats.rx_bytes;
+   stats->tx_packets = dev->stats.tx_packets;
+   stats->tx_bytes = dev->stats.tx_bytes;
+   } while (u64_stats_fetch_retry_irq(>stats.syncp, start));
+}
+EXPORT_SYMBOL_GPL(usbnet_get_stats64);
+
 u32 usbnet_get_link (struct net_device *net)
 {
struct usbnet *dev = netdev_priv(net);
@@ -1214,6 +1236,11 @@ static void tx_complete (struct urb *urb)
if (urb->status == 0) {
dev->net->stats.tx_packets += entry->packets;
dev->net->stats.tx_bytes += entry->length;
+
+   u64_stats_update_begin(>stats.syncp);
+   dev->stats.tx_packets += entry->packets;
+   dev->stats.tx_bytes += entry->length;
+   u64_stats_update_end(>stats.syncp);
} else {
dev->net->stats.tx_errors++;
 
@@ -1658,6 +1685,7 @@ void usbnet_disconnect (struct usb_interface *intf)
init_timer (>delay);
mutex_init (>phy_mutex);
mutex_init(>interrupt_mutex);
+   u64_stats_init(>stats.syncp);
dev->interrupt_count = 0;
 
dev->net = net;
diff --git a/include/linux/usb/usbnet.h b/include/linux/usb/usbnet.h
index e2b5691..d1501b8 100644
--- a/include/linux/usb/usbnet.h
+++ b/include/linux/usb/usbnet.h
@@ -22,6 +22,14 @@
 #ifndef__LINUX_USB_USBNET_H
 #define__LINUX_USB_USBNET_H
 
+struct usbnet_stats64 {
+   struct u64_stats_sync   syncp;
+   u64 rx_packets;
+   u64 rx_bytes;
+   u64 tx_packets;
+   u64 tx_bytes;
+};
+
 /* interface from usbnet core to each USB networking link we handle */
 struct usbnet {
/* housekeeping */
@@ -64,6 +72,8 @@ struct usbnet {
struct usb_anchor   deferred;
struct tasklet_struct   bh;
 
+   struct usbnet_stats64   stats;
+
struct work_struct  kevent;
unsigned long   flags;
 #  define EVENT_TX_HALT0
@@ -278,5 +288,7 @@ extern int usbnet_set_link_ksettings(struct net_device *net,
 extern void usbnet_status_stop(struct usbnet *dev);
 
 extern void usbnet_update_max_qlen(struct usbnet *dev);
+extern void usbnet_get_stats64(struct net_device *dev,
+  struct rtnl_link_stats64 *stats);
 
 #endif /* __LINUX_USB_USBNET_H */
-- 
1.9.1

Re: linux-next: manual merge of the net-next tree with Linus' tree

2017-03-23 Thread Alexei Starovoitov

On 3/23/17 5:10 PM, David Miller wrote:

From: Stephen Rothwell 
Date: Fri, 24 Mar 2017 11:05:14 +1100

Hi all,

Today's linux-next merge of the net-next tree got a conflict in:

  kernel/bpf/hashtab.c

between commit:

  8c290e60fa2a ("bpf: fix hashmap extra_elems logic")

from Linus' tree and commit:

  bcc6b1b7ebf8 ("bpf: Add hash of maps support")

from the net-next tree.

I fixed it up (see below) and can carry the fix as necessary. This
is now fixed as far as linux-next is concerned, but any non trivial
conflicts should be mentioned to your upstream maintainer when your tree
is submitted for merging.  You may also want to consider cooperating
with the maintainer of the conflicting tree to minimise any particularly
complex conflicts.

I did the same resolution just an hour ago when merging net into
net-next.

yes. that's correct merge conflict resolution.
Just rebuilt and retested. All looks good.
Thanks!

Re: [net-next PATCH v2 5/8] net: Track start of busy loop instead of when it should end

On Thu, 2017-03-23 at 14:37 -0700, Alexander Duyck wrote:
> From: Alexander Duyck 
> 

> The last bit I changed is to move from using a shift by 10 to just using
> NSEC_PER_USEC and using multiplication for any run time calculations and
> division for a few compile time ones.  This should be more accurate and
> perform about the same on most architectures since modern CPUs typically
> handle multiplication without too much overhead.

busy polling thread can be preempted for more than 2 seconds.

Using usec instead of nanoseconds gave us 3 orders of magnitude cushion.

We do not need nsec accuracy for busy polling users, if this restricts
range and usability under stress.

Re: [PATCH net-next v4 1/2] New kernel function to get IP overhead on a socket.

2017-03-23 Thread kbuild test robot

Hi Parameswaran,

[auto build test WARNING on net-next/master]

url:
https://github.com/0day-ci/linux/commits/R-Parameswaran/New-kernel-function-to-get-IP-overhead-on-a-socket/20170324-065003
config: x86_64-randconfig-s1-03240701 (attached as .config)
compiler: gcc-4.4 (Debian 4.4.7-8) 4.4.7
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64 

All warnings (new ones prefixed by >>):

   net/socket.c: In function 'kernel_sock_ip_overhead':
>> net/socket.c:3359: warning: unused variable 'optv6'
>> net/socket.c:3357: warning: unused variable 'np'

vim +/optv6 +3359 net/socket.c

  3351   * this is an IPv4 or IPv6 socket and the length from IP options turned
  3352   * on at the socket.
  3353   */
  3354  u32 kernel_sock_ip_overhead(struct sock *sk)
  3355  {
  3356  struct inet_sock *inet;
> 3357  struct ipv6_pinfo *np;
  3358  struct ip_options_rcu *opt;
> 3359  struct ipv6_txoptions *optv6 = NULL;
  3360  u32 overhead = 0;
  3361  bool owned_by_user;
  3362  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip

Re: [net-next PATCH v2 3/8] net: Only define skb_mark_napi_id in one spot instead of two

On Thu, 2017-03-23 at 14:36 -0700, Alexander Duyck wrote:
> From: Alexander Duyck 
> 
> Instead of defining two versions of skb_mark_napi_id I think it is more
> readable to just match the format of the sk_mark_napi_id functions and just
> wrap the contents of the function instead of defining two versions of the
> function.  This way we can save a few lines of code since we only need 2 of
> the ifdef/endif but needed 5 for the extra function declaration.
> 
> Signed-off-by: Alexander Duyck 
> ---
>  include/net/busy_poll.h |   22 +-
>  1 file changed, 9 insertions(+), 13 deletions(-)

Acked-by: Eric Dumazet

Re: [net-next PATCH v2 4/8] net: Change return type of sk_busy_loop from bool to void

On Thu, 2017-03-23 at 14:37 -0700, Alexander Duyck wrote:
> From: Alexander Duyck 
> 
> >From what I can tell there is only a couple spots where we are actually
> checking the return value of sk_busy_loop. As there are only a few
> consumers of that data, and the data being checked for can be replaced
> with a check for !skb_queue_empty() we might as well just pull the code
> out of sk_busy_loop and place it in the spots that actually need it.
> 
> Signed-off-by: Alexander Duyck 
> ---

Acked-by: Eric Dumazet

[PATCH net-next] net: mpls: Fix setting ttl_propagate for rt2

Fix copy and paste error setting rt_ttl_propagate.

Fixes: 5b441ac8784c1 ("mpls: allow TTL propagation to IP packets to be 
configured")
Signed-off-by: David Ahern 
---
 net/mpls/af_mpls.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
index 82589b2abf3c..cd8be8d5e4ad 100644
--- a/net/mpls/af_mpls.c
+++ b/net/mpls/af_mpls.c
@@ -1949,7 +1949,7 @@ static int resize_platform_label_table(struct net *net, 
size_t limit)
RCU_INIT_POINTER(rt2->rt_nh->nh_dev, lo);
rt2->rt_protocol = RTPROT_KERNEL;
rt2->rt_payload_type = MPT_IPV6;
-   rt0->rt_ttl_propagate = MPLS_TTL_PROP_DEFAULT;
+   rt2->rt_ttl_propagate = MPLS_TTL_PROP_DEFAULT;
rt2->rt_nh->nh_via_table = NEIGH_LINK_TABLE;
rt2->rt_nh->nh_via_alen = lo->addr_len;
memcpy(__mpls_nh_via(rt2, rt2->rt_nh), lo->dev_addr,
-- 
2.10.1 (Apple Git-78)

Re: [net-next PATCH v2 2/8] tcp: Record Rx hash and NAPI ID in tcp_child_process

On Thu, 2017-03-23 at 14:36 -0700, Alexander Duyck wrote:
> From: Alexander Duyck 
> 
> While working on some recent busy poll changes we found that child sockets
> were being instantiated without NAPI ID being set.  In our first attempt to
> fix it, it was suggested that we should just pull programming the NAPI ID
> into the function itself since all callers will need to have it set.
> 
> In addition to the NAPI ID change I have dropped the code that was
> populating the Rx hash since it was actually being populated in
> tcp_get_cookie_sock.
> 
> Reported-by: Sridhar Samudrala 
> Suggested-by: Eric Dumazet 
> Signed-off-by: Alexander Duyck 
> ---

Acked-by: Eric Dumazet

Re: [net-next PATCH v2 8/8] net: Introduce SO_INCOMING_NAPI_ID

On Thu, Mar 23, 2017 at 3:43 PM, Andy Lutomirski  wrote:
> On Thu, Mar 23, 2017 at 2:38 PM, Alexander Duyck
>  wrote:
>> From: Sridhar Samudrala 
>>
>> This socket option returns the NAPI ID associated with the queue on which
>> the last frame is received. This information can be used by the apps to
>> split the incoming flows among the threads based on the Rx queue on which
>> they are received.
>>
>> If the NAPI ID actually represents a sender_cpu then the value is ignored
>> and 0 is returned.
>
> This may be more of a naming / documentation issue than a
> functionality issue, but to me this reads as:
>
> "This socket option returns an internal implementation detail that, if
> you are sufficiently clueful about the current performance heuristics
> used by the Linux networking stack, just might give you a hint as to
> which epoll set to put the socket in."  I've done some digging into
> Linux networking stuff, but not nearly enough to have the slighest
> clue what you're supposed to do with the NAPI ID.

Really the NAPI ID is an arbitrary number that will be unique per
device queue, though multiple Rx queues can share a NAPI ID if they
are meant to be processed in the same call to poll.

If we wanted we could probably rename it to something like Device Poll
Identifier or Device Queue Identifier, DPID or DQID, if that would
work for you.  Essentially it is just a unique u32 value that should
not identify any other queue in the system while this device queue is
active.  Really the number itself is mostly arbitrary, the main thing
is that it doesn't change and uniquely identifies the queue in the
system.

> It would be nice to make this a bit more concrete and a bit less tied
> in Linux innards.  Perhaps a socket option could instead return a hint
> saying "for best results, put this socket in an epoll set that's on
> cpu N"?  After all, you're unlikely to do anything productive with
> busy polling at all, even on a totally different kernel
> implementation, if you have more than one epoll set per CPU.  I can
> see cases where you could plausibly poll with fewer than one set per
> CPU, I suppose.

Really we kind of already have an option that does what you are
implying called SO_INCOMING_CPU.  The problem is it requires pinning
the interrupts to the CPUs in order to keep the values consistent, and
even then busy polling can mess that up if the busy poll thread is
running on a different CPU.  With the NAPI ID we have to do a bit of
work on the application end, but we can uniquely identify each
incoming queue and interrupt migration and busy polling don't have any
effect on it.  So for example we could stack all the interrupts on CPU
0, and have our main thread located there doing the sorting of
incoming requests and handing them out to epoll listener threads on
other CPUs.  When those epoll listener threads start doing busy
polling the NAPI ID won't change even though the packet is being
processed on a different CPU.

> Again, though, from the description, it's totally unclear what a user
> is supposed to do.

What you end up having to do is essentially create a hash of sorts so
that you can go from NAPI IDs to threads.  In an ideal setup what you
end up with multiple threads, each one running one epoll, and each
epoll polling on one specific queue.

Hope that helps to clarify it.

- Alex

Re: [PATCH net-next] sched: act_csum: don't mangle TCP and UDP GSO packets

From: Davide Caratti 
Date: Thu, 23 Mar 2017 10:39:40 +0100

> after act_csum computes the checksum on skbs carrying GSO TCP/UDP packets,
> subsequent segmentation fails because skb_needs_check(skb, true) returns
> true. Because of that, skb_warn_bad_offload() is invoked and the following
> message is displayed:
> 
> WARNING: CPU: 3 PID: 28 at net/core/dev.c:2553 skb_warn_bad_offload+0xf0/0xfd
> <...>
> 
>   [] skb_warn_bad_offload+0xf0/0xfd
>   [] __skb_gso_segment+0xec/0x110
>   [] validate_xmit_skb+0x12d/0x2b0
>   [] validate_xmit_skb_list+0x42/0x70
>   [] sch_direct_xmit+0xd0/0x1b0
>   [] __qdisc_run+0x120/0x270
>   [] __dev_queue_xmit+0x23d/0x690
>   [] dev_queue_xmit+0x10/0x20
> 
> Since GSO is able to compute checksum on individual segments of such skbs,
> we can simply skip mangling the packet.
> 
> Signed-off-by: Davide Caratti 

Seems reasonable, applied, thanks.

RE: [PATCH] e1000e: fix timing for 82579 Gigabit Ethernet controller

2017-03-23 Thread Brown, Aaron F

> From: netdev-ow...@vger.kernel.org [mailto:netdev-
> ow...@vger.kernel.org] On Behalf Of Bernd Faust
> Sent: Thursday, February 16, 2017 10:42 AM
> To: Kirsher, Jeffrey T ; Lubetkin, YanirX
> ; intel-wired-...@lists.osuosl.org;
> netdev@vger.kernel.org; linux-ker...@vger.kernel.org
> Cc: Bernd Faust 
> Subject: [PATCH] e1000e: fix timing for 82579 Gigabit Ethernet controller
> 
> After an upgrade to Linux kernel v4.x the hardware timestamps of the
> 82579 Gigabit Ethernet Controller are different than expected.
> The values that are being read are almost four times as big as before
> the kernel upgrade.
> 
> The difference is that after the upgrade the driver sets the clock
> frequency to 25MHz, where before the upgrade it was set to 96MHz. Intel
> confirmed that the correct frequency for this network adapter is 96MHz.
> 
> Signed-off-by: Bernd Faust 
> ---
>  drivers/net/ethernet/intel/e1000e/netdev.c | 6 ++
>  1 file changed, 6 insertions(+)

Tested-by: Aaron Brown

Re: [PATCH net] net: phy: Export mdiobus_register_board_info()

From: Florian Fainelli 
Date: Wed, 22 Mar 2017 22:40:30 -0700

> We can build modular code that uses mdiobus_register_board_info() which would
> lead to linking failure since this symbol is not expoerted.
> 
> Fixes: 648ea0134069 ("net: phy: Allow pre-declaration of MDIO devices")
> Signed-off-by: Florian Fainelli 

Applied, thanks.

Re: linux-next: manual merge of the net-next tree with Linus' tree

From: Stephen Rothwell 
Date: Fri, 24 Mar 2017 11:05:14 +1100

> Hi all,
> 
> Today's linux-next merge of the net-next tree got a conflict in:
> 
>   kernel/bpf/hashtab.c
> 
> between commit:
> 
>   8c290e60fa2a ("bpf: fix hashmap extra_elems logic")
> 
> from Linus' tree and commit:
> 
>   bcc6b1b7ebf8 ("bpf: Add hash of maps support")
> 
> from the net-next tree.
> 
> I fixed it up (see below) and can carry the fix as necessary. This
> is now fixed as far as linux-next is concerned, but any non trivial
> conflicts should be mentioned to your upstream maintainer when your tree
> is submitted for merging.  You may also want to consider cooperating
> with the maintainer of the conflicting tree to minimise any particularly
> complex conflicts.

I did the same resolution just an hour ago when merging net into
net-next.

Thanks!

linux-next: manual merge of the net-next tree with Linus' tree

2017-03-23 Thread Stephen Rothwell

Hi all,

Today's linux-next merge of the net-next tree got a conflict in:

  kernel/bpf/hashtab.c

between commit:

  8c290e60fa2a ("bpf: fix hashmap extra_elems logic")

from Linus' tree and commit:

  bcc6b1b7ebf8 ("bpf: Add hash of maps support")

from the net-next tree.

I fixed it up (see below) and can carry the fix as necessary. This
is now fixed as far as linux-next is concerned, but any non trivial
conflicts should be mentioned to your upstream maintainer when your tree
is submitted for merging.  You may also want to consider cooperating
with the maintainer of the conflicting tree to minimise any particularly
complex conflicts.

-- 
Cheers,
Stephen Rothwell

diff --cc kernel/bpf/hashtab.c
index 361a69dfe543,343fb5394c95..
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@@ -582,7 -609,20 +616,15 @@@ static void htab_elem_free_rcu(struct r
  
  static void free_htab_elem(struct bpf_htab *htab, struct htab_elem *l)
  {
+   struct bpf_map *map = >map;
+ 
+   if (map->ops->map_fd_put_ptr) {
+   void *ptr = fd_htab_map_get_ptr(map, l);
+ 
+   map->ops->map_fd_put_ptr(ptr);
+   }
+ 
 -  if (l->state == HTAB_EXTRA_ELEM_USED) {
 -  l->state = HTAB_EXTRA_ELEM_FREE;
 -  return;
 -  }
 -
 -  if (!(htab->map.map_flags & BPF_F_NO_PREALLOC)) {
 +  if (htab_is_prealloc(htab)) {
pcpu_freelist_push(>freelist, >fnode);
} else {
atomic_dec(>count);

Re: [PATCH net-next 1/2] net: dwc-xlgmac: declaration of dual license in headers

From: Jie Deng 
Date: Thu, 23 Mar 2017 12:03:45 +0800

> The driver "dwc-xlgmac" is dual-licensed. This patch adds
> declaration of dual license in file headers.
> 
> Signed-off-by: Jie Deng 

Applied.

Re: [PATCH net-next 2/2] net: dwc-xlgmac: use dual license

From: Jie Deng 
Date: Thu, 23 Mar 2017 12:03:46 +0800

> The driver "dwc-xlgmac" is dual-licensed.
> Declare the dual license with MODULE_LICENSE().
> 
> Signed-off-by: Jie Deng 

Applied.

Re: [PATCH net-next v8 0/3] net: core: Two Helper function about socket information

From: Chenbo Feng 
Date: Wed, 22 Mar 2017 17:27:33 -0700

> Introduce two eBpf helper function to get the socket cookie and
> socket uid for each packet. The helper function is useful when
> the *sk field inside sk_buff is not empty. These helper functions
> can be used on socket and uid based traffic monitoring programs.

Series applied, thanks.

Re: [PATCH 1/3] soc: qcom: smd: Transition client drivers from smd to rpmsg

From: Bjorn Andersson 
Date: Wed, 22 Mar 2017 14:57:33 -0700

> On Wed 22 Mar 11:44 PDT 2017, David Miller wrote:
> 
>> From: Bjorn Andersson 
>> Date: Mon, 20 Mar 2017 16:35:42 -0700
>> 
>> What is the status of the Kconfig dependency fix and how will I be
>> getting it?
>> 
> 
> There are two Kconfig dependencies in play here, the first is
> c3104aae5d8c ("remoteproc: qcom: fix QCOM_SMD dependencies"), this was
> picked up by Linus yesterday and will as such be in v4.10-rc4.
> 
> The other dependency, is the one Marcel wants you to pick up here is
> https://patchwork.kernel.org/patch/9635385/. It's on LKML, but if you
> want I can resend it with you as direct recipient, with Marcel's ack.
> 
> Likely Arnd would like this fix to be sent upstream for v4.11 already.
> 
>> Second, should I merge all three of these patches to net-next or just
>> this one?
>> 
> 
> I would like all three to be merged in this cycle and in addition I have
> a couple of patches coming up that will cause some minor conflicts with
> patch 2 - so I would prefer if patch 2 was available in a tag I can
> merge into my tree.

I should have all the dependencies in net-next now, but when I apply
this series I get undefined symbols:

[davem@localhost net-next]$ time make -s -j8
Kernel: arch/x86/boot/bzImage is ready  (#578)
ERROR: "qcom_rpm_smd_write" [drivers/regulator/qcom_smd-regulator.ko] undefined!
ERROR: "qcom_wcnss_open_channel" [drivers/net/wireless/ath/wcn36xx/wcn36xx.ko] 
undefined!
ERROR: "qcom_rpm_smd_write" [drivers/clk/qcom/clk-smd-rpm.ko] undefined!
ERROR: "qcom_wcnss_open_channel" [drivers/bluetooth/btqcomsmd.ko] undefined!
scripts/Makefile.modpost:91: recipe for target '__modpost' failed

Please fix this up.

[PATCH] xfrm: branchless addr4_match() on 64-bit

Current addr4_match() code has special test for /0 prefixes because of
standard required undefined behaviour. However, it is possible to omit
it on 64-bit because shifting can be done in a 64-bit register and then
truncated to the expected value (which is 0 mask).

Implicit truncation by htonl() fits nicely into R32-within-R64 model
on x86-64.

Space savings: none (coincidence)
Branch savings: 1


Before:

movzx  eax,BYTE PTR [rdi+0x2a]  # ->prefixlen_d
test   al,al
jnexfrm_selector_match + 0x23f
...
movzx  eax,BYTE PTR [rbx+0x2b]  # ->prefixlen_s
test   al,al
je xfrm_selector_match + 0x1c7


After (no branches):

movr8d,0x20
movrdx,0x
movesi,DWORD PTR [rsi+0x2c]
movecx,r8d
subcl,BYTE PTR [rdi+0x2a]
xoresi,DWORD PTR [rbx]
movrdi,rdx
xoreax,eax
shlrdi,cl
bswap  edi


Signed-off-by: Alexey Dobriyan 
---

 include/net/xfrm.h |5 +
 1 file changed, 5 insertions(+)

--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -844,10 +844,15 @@ static inline bool addr_match(const void *token1, const 
void *token2,
 
 static inline bool addr4_match(__be32 a1, __be32 a2, u8 prefixlen)
 {
+#ifdef CONFIG_64BIT
+   /* L in UL is not a typo. */
+   return !((a1 ^ a2) & htonl(~0UL << (32 - prefixlen)));
+#else
/* C99 6.5.7 (3): u32 << 32 is undefined behaviour */
if (prefixlen == 0)
return true;
return !((a1 ^ a2) & htonl(0xu << (32 - prefixlen)));
+#endif
 }
 
 static __inline__

[PATCH] xfrm: use "unsigned int" in addr_match()

x86_64 is zero-extending arch so "unsigned int" is preferred over "int"
for address calculations and extending to size_t.

Space savings:

add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-24 (-24)
function old new   delta
xfrm_state_walk  708 696 -12
xfrm_selector_match  918 906 -12

Signed-off-by: Alexey Dobriyan 
---

 include/net/xfrm.h |6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -816,12 +816,12 @@ static inline void xfrm_state_hold(struct xfrm_state *x)
 }
 
 static inline bool addr_match(const void *token1, const void *token2,
- int prefixlen)
+ unsigned int prefixlen)
 {
const __be32 *a1 = token1;
const __be32 *a2 = token2;
-   int pdw;
-   int pbi;
+   unsigned int pdw;
+   unsigned int pbi;
 
pdw = prefixlen >> 5; /* num of whole u32 in prefix */
pbi = prefixlen &  0x1f;  /* num of bits in incomplete u32 in prefix */

[PATCH net-next 1/1] tcp: sysctl: Fix a race to avoid unexpected 0 window from space

2017-03-23 Thread gfree . wind

From: Gao Feng 

Because sysctl_tcp_adv_win_scale could be changed any time, so there
is one race in tcp_win_from_space.
For example,
1.sysctl_tcp_adv_win_scale<=0 (sysctl_tcp_adv_win_scale is negative now)
2.space>>(-sysctl_tcp_adv_win_scale) (sysctl_tcp_adv_win_scale is postive now)

As a result, tcp_win_from_space returns 0. It is unexpected.

Certainly if the compiler put the sysctl_tcp_adv_win_scale into one
register firstly, then use the register directly, it would be ok.
But we could not depend on the compiler behavior.

Signed-off-by: Gao Feng 
---
 include/net/tcp.h | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 6ec4ea6..119b592 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1252,9 +1252,11 @@ void tcp_select_initial_window(int __space, __u32 mss, 
__u32 *rcv_wnd,
 
 static inline int tcp_win_from_space(int space)
 {
-   return sysctl_tcp_adv_win_scale<=0 ?
-   (space>>(-sysctl_tcp_adv_win_scale)) :
-   space - (space>>sysctl_tcp_adv_win_scale);
+   int tcp_adv_win_scale = sysctl_tcp_adv_win_scale;
+
+   return tcp_adv_win_scale <= 0 ?
+   (space>>(-tcp_adv_win_scale)) :
+   space - (space>>tcp_adv_win_scale);
 }
 
 /* Note: caller must be prepared to deal with negative returns */
-- 
1.9.1

[PATCH] xfrm: use "unsigned int" in __xfrm6_pref_hash()

x86_64 is zero-extending arch so "unsigned int" is preferred over "int"
for address calculations.

Space savings:

add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-58 (-58)
function old new   delta
xfrm_hash_resize27522743  -9
policy_hash_bysel985 973 -12
policy_hash_direct  1036 999 -37

Signed-off-by: Alexey Dobriyan 
---

 net/xfrm/xfrm_hash.h |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- a/net/xfrm/xfrm_hash.h
+++ b/net/xfrm/xfrm_hash.h
@@ -54,8 +54,8 @@ static inline unsigned int __xfrm4_dpref_spref_hash(const 
xfrm_address_t *daddr,
 static inline unsigned int __xfrm6_pref_hash(const xfrm_address_t *addr,
 __u8 prefixlen)
 {
-   int pdw;
-   int pbi;
+   unsigned int pdw;
+   unsigned int pbi;
u32 initval = 0;
 
pdw = prefixlen >> 5; /* num of whole u32 in prefix */

Re: [net-next PATCH v2 0/8] Add busy poll support for epoll

On Thu, Mar 23, 2017 at 3:38 PM, Alexander Duyck
 wrote:
> On Thu, Mar 23, 2017 at 3:07 PM, Alexei Starovoitov

>> it all sounds awesome, but i cannot quite visualize the impact.
>> Can you post some sample code/minibenchmark and numbers before/after?
>>
>> Thanks!
>>
>
> Anything specific you are looking for?  I can probably work with our
> team internally to setup the benchmark in the next day or so.
>
> I've been doing most of my benchmarking at my desk with sockperf with
> just one thread and multiple sockets and seeing some modest savings
> with my round trip times dropping from something like 18 microseconds
> on average to 8 microseconds for ping-pong tests.

Same reduction for me, on 1000/1000 bytes RPC.

26 usec -> 16 usec  per transaction

(If you use sockperf, beware that it displays half round trips)

Also note that the gains also depends on how the interrupt mitigation
parameters are set.

Re: [net-next PATCH v2 8/8] net: Introduce SO_INCOMING_NAPI_ID

2017-03-23 Thread Andy Lutomirski

On Thu, Mar 23, 2017 at 2:38 PM, Alexander Duyck
 wrote:
> From: Sridhar Samudrala 
>
> This socket option returns the NAPI ID associated with the queue on which
> the last frame is received. This information can be used by the apps to
> split the incoming flows among the threads based on the Rx queue on which
> they are received.
>
> If the NAPI ID actually represents a sender_cpu then the value is ignored
> and 0 is returned.

This may be more of a naming / documentation issue than a
functionality issue, but to me this reads as:

"This socket option returns an internal implementation detail that, if
you are sufficiently clueful about the current performance heuristics
used by the Linux networking stack, just might give you a hint as to
which epoll set to put the socket in."  I've done some digging into
Linux networking stuff, but not nearly enough to have the slighest
clue what you're supposed to do with the NAPI ID.

It would be nice to make this a bit more concrete and a bit less tied
in Linux innards.  Perhaps a socket option could instead return a hint
saying "for best results, put this socket in an epoll set that's on
cpu N"?  After all, you're unlikely to do anything productive with
busy polling at all, even on a totally different kernel
implementation, if you have more than one epoll set per CPU.  I can
see cases where you could plausibly poll with fewer than one set per
CPU, I suppose.

Again, though, from the description, it's totally unclear what a user
is supposed to do.

--Andy

Re: [net-next PATCH v2 0/8] Add busy poll support for epoll

On Thu, Mar 23, 2017 at 3:07 PM, Alexei Starovoitov
 wrote:
> On Thu, Mar 23, 2017 at 02:36:29PM -0700, Alexander Duyck wrote:
>> This is my second pass at trying to add support for busy polling when using
>> epoll. It is pretty much a full rewrite as I have made serious changes to
>> most of the patches.
>>
>> In the v1 series I had submitted we only allowed epoll to make use of busy
>> poll when all NAPI IDs were the same. I gave this some more thought and
>> after making several other changes based on feedback from Eric Dumazet I
>> decided to try changing the main concept a bit and instead we will now
>> attempt to busy poll on the NAPI ID of the last socket added to the ready
>> list. By doing it this way we are able to take advantage of the changes
>> Eric has already made so that we get woken up by the softirq, we then pull
>> packets via busy poll, and will return to the softirq until we are woken up
>> and a new socket has been added to the ready list.
>>
>> Most of the changes in this set authored by me are meant to be cleanup or
>> fixes for various things. For example, I am trying to make it so that we
>> don't perform hash look-ups for the NAPI instance when we are only working
>> with sender_cpu and the like.
>>
>> The most complicated change of the set is probably the clean-ups for the
>> timeout. I realized that the timeout could potentially get into a state
>> where it would never timeout if the local_clock() was approaching a
>> rollover and the added busy poll usecs interval would be enough to roll it
>> over.  Because we were shifting the value you would never be able to get
>> time_after to actually trigger.
>>
>> At the heart of this set is the last 3 patches which enable epoll support
>> and add support for obtaining the NAPI ID of a given socket.  With these
>> It becomes possible for an application to make use of epoll and get optimal
>> busy poll utilization by stacking multiple sockets with the same NAPI ID on
>> the same epoll context.
>
> it all sounds awesome, but i cannot quite visualize the impact.
> Can you post some sample code/minibenchmark and numbers before/after?
>
> Thanks!
>

Anything specific you are looking for?  I can probably work with our
team internally to setup the benchmark in the next day or so.

I've been doing most of my benchmarking at my desk with sockperf with
just one thread and multiple sockets and seeing some modest savings
with my round trip times dropping from something like 18 microseconds
on average to 8 microseconds for ping-pong tests.

Thanks.

- Alex

[PATCH] xfrm: remove unused struct xfrm_mgr::id

Signed-off-by: Alexey Dobriyan 
---

 include/net/xfrm.h   |1 -
 net/key/af_key.c |1 -
 net/xfrm/xfrm_user.c |1 -
 3 files changed, 3 deletions(-)

--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -586,7 +586,6 @@ struct xfrm_migrate {
 
 struct xfrm_mgr {
struct list_headlist;
-   char*id;
int (*notify)(struct xfrm_state *x, const struct 
km_event *c);
int (*acquire)(struct xfrm_state *x, struct 
xfrm_tmpl *, struct xfrm_policy *xp);
struct xfrm_policy  *(*compile_policy)(struct sock *sk, int opt, u8 
*data, int len, int *dir);
--- a/net/key/af_key.c
+++ b/net/key/af_key.c
@@ -3792,7 +3792,6 @@ static inline void pfkey_exit_proc(struct net *net)
 
 static struct xfrm_mgr pfkeyv2_mgr =
 {
-   .id = "pfkeyv2",
.notify = pfkey_send_notify,
.acquire= pfkey_send_acquire,
.compile_policy = pfkey_compile_policy,
--- a/net/xfrm/xfrm_user.c
+++ b/net/xfrm/xfrm_user.c
@@ -3101,7 +3101,6 @@ static bool xfrm_is_alive(const struct km_event *c)
 }
 
 static struct xfrm_mgr netlink_mgr = {
-   .id = "netlink",
.notify = xfrm_send_state_notify,
.acquire= xfrm_send_acquire,
.compile_policy = xfrm_compile_policy,

Re: [net-next PATCH v2 6/8] net: Commonize busy polling code to focus on napi_id instead of socket

On Thu, 2017-03-23 at 14:37 -0700, Alexander Duyck wrote:
> From: Sridhar Samudrala 
> 
> Move the core functionality in sk_busy_loop() to napi_busy_loop() and
> make it independent of sk.
> 
> This enables re-using this function in epoll busy loop implementation.
> 
> Signed-off-by: Sridhar Samudrala 
> Signed-off-by: Alexander Duyck 
> ---

Acked-by: Eric Dumazet

Re: [net-next PATCH v2 8/8] net: Introduce SO_INCOMING_NAPI_ID

On Thu, 2017-03-23 at 14:38 -0700, Alexander Duyck wrote:
> From: Sridhar Samudrala 
> 
> This socket option returns the NAPI ID associated with the queue on which
> the last frame is received. This information can be used by the apps to
> split the incoming flows among the threads based on the Rx queue on which
> they are received.
> 
> If the NAPI ID actually represents a sender_cpu then the value is ignored
> and 0 is returned.

Acked-by: Eric Dumazet

Re: Problem: net: mvneta: auto-negotiation with disconnected network cable

2017-03-23 Thread Stas Sergeev

23.03.2017 14:30, Maxime Morin пишет:

Hi,
Thank you very much for your help and your reactivity! See my answer bellow:

22.03.2017 16:23, Maxime Morin пишет:

Hi all,

I work on an embedded platform based on the Marvell Armada 88F6707, that is
connected to a Marvell Alaska 88E1512 ethernet transceiver. A defect has
appeared recently, and it turns out to be a regression on the network part.
There is a complete lost of the network when following these steps:
1) boot the board with the network cable disconnected
2) run the following commands (or equivalent):
ip link set eth0 up
ip addr add 10.0.0.80/24 dev eth0
ethtool -s eth0 autoneg on #this is the command that really breaks
the network

Why do you call it a regression, if previously
this command did nothing at all?

I called that a regression because we still pass through the function
phy_ethtool_sset(), which I though was also doing something about the
auto-negotiation. But apparently not.

It does, but with mdio.
But in the MAC driver we have another autoneg
now, which is a great confusion. mvneta_set_autoneg()
could have been named mvneta_set_inband_autoneg(),
which would already help a lot. If you prepare the patch-set,
you can include such renaming

I did a "git bisect" to find when the regression was introduced, because it
previously worked with kernel 4.4, but not with the recent ones. The commit that made
appear the issue is this one:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0c0744fc1dd5b
If I remove on mvneta.c the part that was added on this commit on the function
mvneta_ethtool_set_link_ksettings (mvneta_ethtool_set_settings at that time), I
do not have the issue, but I can't call that a fix...
So, could it be a driver issue, or maybe a wrong configuration somewhere? If
you need additional information to reproduce the problem please ask me, I will
be as responsive as possible.

It seems mvneta_set_autoneg() does some non-symmetric
things. It clears
MVNETA_GMAC_FORCE_LINK_PASS |
MVNETA_GMAC_FORCE_LINK_DOWN |
MVNETA_GMAC_AN_FLOW_CTRL_EN
when enabling autoneg, and does not restore these flags
when disabling it. Try to make it to set or to restore these
flags and see if that makes "ethtool -s eth0 autoneg off" to
get the network back alive> .

As you suggested, I just set these three flags when the function is called with
"enable" set to 0. And it works!

Am I right that it works only when you set autoneg
to "off", while "on" still does not give you any link detect?
So this is a partial fix, right?

Actually, I did not even have to set autoneg to off. When the module is probed, the
default param are applied (mvneta_defaults_set()), and mvneta_set_autoneg() is called
with "enable" to 0, and it seems to fix everything.

It turns off SGMII-specific autoned and starts using
mdio autoneg instead... a great confusion.

I tested it then, by setting autoneg to off/on, booting with or without the
cable plugged, and I failed to break it. It seems to be fixed. Should I submit
a patch? (would be the first time...)

After looking up my other patches to this driver, I
can see that

MVNETA_GMAC_FORCE_LINK_PASS |
MVNETA_GMAC_FORCE_LINK_DOWN

are left cleared willingly. I suspect that MVNETA_GMAC_AN_FLOW_CTRL_EN
breaks things for you. Could you please try with setting
back only this flag?

Re: [net-next PATCH v2 0/8] Add busy poll support for epoll

2017-03-23 Thread Alexei Starovoitov

On Thu, Mar 23, 2017 at 02:36:29PM -0700, Alexander Duyck wrote:
> This is my second pass at trying to add support for busy polling when using
> epoll. It is pretty much a full rewrite as I have made serious changes to
> most of the patches.
> 
> In the v1 series I had submitted we only allowed epoll to make use of busy
> poll when all NAPI IDs were the same. I gave this some more thought and
> after making several other changes based on feedback from Eric Dumazet I
> decided to try changing the main concept a bit and instead we will now
> attempt to busy poll on the NAPI ID of the last socket added to the ready
> list. By doing it this way we are able to take advantage of the changes
> Eric has already made so that we get woken up by the softirq, we then pull
> packets via busy poll, and will return to the softirq until we are woken up
> and a new socket has been added to the ready list.
> 
> Most of the changes in this set authored by me are meant to be cleanup or
> fixes for various things. For example, I am trying to make it so that we
> don't perform hash look-ups for the NAPI instance when we are only working
> with sender_cpu and the like.
> 
> The most complicated change of the set is probably the clean-ups for the
> timeout. I realized that the timeout could potentially get into a state
> where it would never timeout if the local_clock() was approaching a
> rollover and the added busy poll usecs interval would be enough to roll it
> over.  Because we were shifting the value you would never be able to get
> time_after to actually trigger.
> 
> At the heart of this set is the last 3 patches which enable epoll support
> and add support for obtaining the NAPI ID of a given socket.  With these
> It becomes possible for an application to make use of epoll and get optimal
> busy poll utilization by stacking multiple sockets with the same NAPI ID on
> the same epoll context.

it all sounds awesome, but i cannot quite visualize the impact.
Can you post some sample code/minibenchmark and numbers before/after?

Thanks!

Re: [PATCH net-next v4 1/2] New kernel function to get IP overhead on a socket.

From: "R. Parameswaran" 
Date: Wed, 22 Mar 2017 15:59:13 -0700 (PDT)

> A new function, kernel_sock_ip_overhead(), is provided
> to calculate the cumulative overhead imposed by the IP
> Header and IP options, if any, on a socket's payload.
> The new function returns an overhead of zero for sockets
> that do not belong to the IPv4 or IPv6 address families.
> 
> Signed-off-by: R. Parameswaran 

Just use the IPv4/IPv6 header size for now, just like the VXLAN
driver does.

Thanks.

Re: [net-next PATCH v2 1/8] net: Busy polling should ignore sender CPUs

On Thu, 2017-03-23 at 14:36 -0700, Alexander Duyck wrote:
> From: Alexander Duyck 
> 
> This patch is a cleanup/fix for NAPI IDs following the changes that made it
> so that sender_cpu and napi_id were doing a better job of sharing the same
> location in the sk_buff.
> 
> One issue I found is that we weren't validating the napi_id as being valid
> before we started trying to setup the busy polling.  This change corrects
> that by using the MIN_NAPI_ID value that is now used in both allocating the
> NAPI IDs, as well as validating them.
> 
> Signed-off-by: Alexander Duyck 
> ---

Acked-by: Eric Dumazet

[PATCH] net: make in_aton() 32-bit internally

Converting IPv4 address doesn't need 64-bit arithmetic.

Space savings: 10 bytes!

add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-10 (-10)
function  old new   delta
in_aton96  86 -10

Signed-off-by: Alexey Dobriyan 
---

 net/core/utils.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/net/core/utils.c
+++ b/net/core/utils.c
@@ -51,7 +51,7 @@ EXPORT_SYMBOL(net_ratelimit);
 
 __be32 in_aton(const char *str)
 {
-   unsigned long l;
+   unsigned int l;
unsigned int val;
int i;

Re: [PATCH] net: phy: handle state correctly in phy_stop_machine

2017-03-23 Thread Florian Fainelli

On 03/23/2017 02:41 PM, Florian Fainelli wrote:
> On 03/22/2017 01:27 PM, Zach Brown wrote:
>> From: Nathan Sullivan 
>>
>> If the PHY is halted on stop, then do not set the state to PHY_UP.  This
>> ensures the phy will be restarted later in phy_start when the machine is
>> started again.
> 
> So essentially what you want to "defeat" here is entering phy_start()
> with PHY_UP therefore not run this part:
> 
> case PHY_HALTED:
> /* make sure interrupts are re-enabled for the PHY */
> if (phydev->irq != PHY_POLL) {
> err = phy_enable_interrupts(phydev);
> if (err < 0)
> break;
> }
> 
> phydev->state = PHY_RESUMING;
> do_resume = true;
> break;
> 
> which is what re-enables interrupts and makes sure the PHY is resumed,
> right?
> 
> If that's the scenario, I guess:
> 
> Reviewed-by: Florian Fainelli 

And almost forgot:

Fixes: 00db8189d984 ("This patch adds a PHY Abstraction Layer to the
Linux Kernel, enabling ethernet drivers to remain as ignorant as is
reasonable of the connected PHY's design and operation details.")
-- 
Florian

Re: [PATCH] net: phy: handle state correctly in phy_stop_machine

2017-03-23 Thread Florian Fainelli

On 03/22/2017 01:27 PM, Zach Brown wrote:
> From: Nathan Sullivan 
> 
> If the PHY is halted on stop, then do not set the state to PHY_UP.  This
> ensures the phy will be restarted later in phy_start when the machine is
> started again.

So essentially what you want to "defeat" here is entering phy_start()
with PHY_UP therefore not run this part:

case PHY_HALTED:
/* make sure interrupts are re-enabled for the PHY */
if (phydev->irq != PHY_POLL) {
err = phy_enable_interrupts(phydev);
if (err < 0)
break;
}

phydev->state = PHY_RESUMING;
do_resume = true;
break;

which is what re-enables interrupts and makes sure the PHY is resumed,
right?

If that's the scenario, I guess:

Reviewed-by: Florian Fainelli 

> 
> Signed-off-by: Nathan Sullivan 
> Signed-off-by: Brad Mouring 
> Acked-by: Xander Huff 
> Acked-by: Kyle Roeschley 
> ---
>  drivers/net/phy/phy.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
> index 7cc1b7d..fe2d4c4 100644
> --- a/drivers/net/phy/phy.c
> +++ b/drivers/net/phy/phy.c
> @@ -678,7 +678,7 @@ void phy_stop_machine(struct phy_device *phydev)
>   cancel_delayed_work_sync(>state_queue);
>  
>   mutex_lock(>lock);
> - if (phydev->state > PHY_UP)
> + if (phydev->state > PHY_UP && phydev->state != PHY_HALTED)
>   phydev->state = PHY_UP;
>   mutex_unlock(>lock);
>  }
> 


-- 
Florian

[net-next PATCH v2 6/8] net: Commonize busy polling code to focus on napi_id instead of socket

From: Sridhar Samudrala 

Move the core functionality in sk_busy_loop() to napi_busy_loop() and
make it independent of sk.

This enables re-using this function in epoll busy loop implementation.

Signed-off-by: Sridhar Samudrala 
Signed-off-by: Alexander Duyck 
---
 include/net/busy_poll.h |   20 +++-
 net/core/dev.c  |   21 -
 net/core/sock.c |   11 +++
 3 files changed, 34 insertions(+), 18 deletions(-)

diff --git a/include/net/busy_poll.h b/include/net/busy_poll.h
index 4626cb22f625..336a874c0fed 100644
--- a/include/net/busy_poll.h
+++ b/include/net/busy_poll.h
@@ -61,7 +61,11 @@ static inline bool sk_can_busy_loop(const struct sock *sk)
return sk->sk_ll_usec && !signal_pending(current);
 }
 
-void sk_busy_loop(struct sock *sk, int nonblock);
+bool sk_busy_loop_end(void *p, unsigned long start_time);
+
+void napi_busy_loop(unsigned int napi_id,
+   bool (*loop_end)(void *, unsigned long),
+   void *loop_end_arg);
 
 #else /* CONFIG_NET_RX_BUSY_POLL */
 static inline unsigned long net_busy_loop_on(void)
@@ -74,10 +78,6 @@ static inline bool sk_can_busy_loop(struct sock *sk)
return false;
 }
 
-static inline void sk_busy_loop(struct sock *sk, int nonblock)
-{
-}
-
 #endif /* CONFIG_NET_RX_BUSY_POLL */
 
 static inline unsigned long busy_loop_current_time(void)
@@ -121,6 +121,16 @@ static inline bool sk_busy_loop_timeout(struct sock *sk,
return true;
 }
 
+static inline void sk_busy_loop(struct sock *sk, int nonblock)
+{
+#ifdef CONFIG_NET_RX_BUSY_POLL
+   unsigned int napi_id = READ_ONCE(sk->sk_napi_id);
+
+   if (napi_id >= MIN_NAPI_ID)
+   napi_busy_loop(napi_id, nonblock ? NULL : sk_busy_loop_end, sk);
+#endif
+}
+
 /* used in the NIC receive handler to mark the skb */
 static inline void skb_mark_napi_id(struct sk_buff *skb,
struct napi_struct *napi)
diff --git a/net/core/dev.c b/net/core/dev.c
index 73ebf2f5600e..b082f70e50c5 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5060,20 +5060,17 @@ static void busy_poll_stop(struct napi_struct *napi, 
void *have_poll_lock)
do_softirq();
 }
 
-void sk_busy_loop(struct sock *sk, int nonblock)
+void napi_busy_loop(unsigned int napi_id,
+   bool (*loop_end)(void *, unsigned long),
+   void *loop_end_arg)
 {
-   unsigned long start_time = nonblock ? 0 : busy_loop_current_time();
+   unsigned long start_time = loop_end ? busy_loop_current_time() : 0;
int (*napi_poll)(struct napi_struct *napi, int budget);
void *have_poll_lock = NULL;
struct napi_struct *napi;
-   unsigned int napi_id;
int work;
 
 restart:
-   napi_id = READ_ONCE(sk->sk_napi_id);
-   if (napi_id < MIN_NAPI_ID)
-   return;
-
work = 0;
napi_poll = NULL;
 
@@ -5107,12 +5104,11 @@ void sk_busy_loop(struct sock *sk, int nonblock)
trace_napi_poll(napi, work, BUSY_POLL_BUDGET);
 count:
if (work > 0)
-   __NET_ADD_STATS(sock_net(sk),
+   __NET_ADD_STATS(dev_net(napi->dev),
LINUX_MIB_BUSYPOLLRXPACKETS, work);
local_bh_enable();
 
-   if (nonblock || !skb_queue_empty(>sk_receive_queue) ||
-   sk_busy_loop_timeout(sk, start_time))
+   if (!loop_end || loop_end(loop_end_arg, start_time))
break;
 
if (unlikely(need_resched())) {
@@ -5121,8 +5117,7 @@ void sk_busy_loop(struct sock *sk, int nonblock)
preempt_enable();
rcu_read_unlock();
cond_resched();
-   if (!skb_queue_empty(>sk_receive_queue) ||
-   sk_busy_loop_timeout(sk, start_time))
+   if (loop_end(loop_end_arg, start_time))
return;
goto restart;
}
@@ -5134,7 +5129,7 @@ void sk_busy_loop(struct sock *sk, int nonblock)
 out:
rcu_read_unlock();
 }
-EXPORT_SYMBOL(sk_busy_loop);
+EXPORT_SYMBOL(napi_busy_loop);
 
 #endif /* CONFIG_NET_RX_BUSY_POLL */
 
diff --git a/net/core/sock.c b/net/core/sock.c
index f8c0373a3a74..0aa725cb3dd6 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -3231,3 +3231,14 @@ static int __init proto_init(void)
 subsys_initcall(proto_init);
 
 #endif /* PROC_FS */
+
+#ifdef CONFIG_NET_RX_BUSY_POLL
+bool sk_busy_loop_end(void *p, unsigned long start_time)
+{
+   struct sock *sk = p;
+
+   return !skb_queue_empty(>sk_receive_queue) ||
+  sk_busy_loop_timeout(sk, start_time);
+}
+EXPORT_SYMBOL(sk_busy_loop_end);
+#endif /* CONFIG_NET_RX_BUSY_POLL */

[net-next PATCH v2 0/8] Add busy poll support for epoll

This is my second pass at trying to add support for busy polling when using
epoll. It is pretty much a full rewrite as I have made serious changes to
most of the patches.

In the v1 series I had submitted we only allowed epoll to make use of busy
poll when all NAPI IDs were the same. I gave this some more thought and
after making several other changes based on feedback from Eric Dumazet I
decided to try changing the main concept a bit and instead we will now
attempt to busy poll on the NAPI ID of the last socket added to the ready
list. By doing it this way we are able to take advantage of the changes
Eric has already made so that we get woken up by the softirq, we then pull
packets via busy poll, and will return to the softirq until we are woken up
and a new socket has been added to the ready list.

Most of the changes in this set authored by me are meant to be cleanup or
fixes for various things. For example, I am trying to make it so that we
don't perform hash look-ups for the NAPI instance when we are only working
with sender_cpu and the like.

The most complicated change of the set is probably the clean-ups for the
timeout. I realized that the timeout could potentially get into a state
where it would never timeout if the local_clock() was approaching a
rollover and the added busy poll usecs interval would be enough to roll it
over.  Because we were shifting the value you would never be able to get
time_after to actually trigger.

At the heart of this set is the last 3 patches which enable epoll support
and add support for obtaining the NAPI ID of a given socket.  With these
It becomes possible for an application to make use of epoll and get optimal
busy poll utilization by stacking multiple sockets with the same NAPI ID on
the same epoll context.

---

Alexander Duyck (5):
  net: Busy polling should ignore sender CPUs
  tcp: Record Rx hash and NAPI ID in tcp_child_process
  net: Only define skb_mark_napi_id in one spot instead of two
  net: Change return type of sk_busy_loop from bool to void
  net: Track start of busy loop instead of when it should end

Sridhar Samudrala (3):
  net: Commonize busy polling code to focus on napi_id instead of socket
  epoll: Add busy poll support to epoll with socket fds.
  net: Introduce SO_INCOMING_NAPI_ID


 arch/alpha/include/uapi/asm/socket.h   |2 +
 arch/avr32/include/uapi/asm/socket.h   |2 +
 arch/frv/include/uapi/asm/socket.h |2 +
 arch/ia64/include/uapi/asm/socket.h|2 +
 arch/m32r/include/uapi/asm/socket.h|2 +
 arch/mips/include/uapi/asm/socket.h|1 
 arch/mn10300/include/uapi/asm/socket.h |2 +
 arch/parisc/include/uapi/asm/socket.h  |2 +
 arch/powerpc/include/uapi/asm/socket.h |2 +
 arch/s390/include/uapi/asm/socket.h|2 +
 arch/sparc/include/uapi/asm/socket.h   |2 +
 arch/xtensa/include/uapi/asm/socket.h  |2 +
 fs/eventpoll.c |   93 +++
 fs/select.c|   16 ++---
 include/net/busy_poll.h|  112 
 include/uapi/asm-generic/socket.h  |2 +
 net/core/datagram.c|8 ++
 net/core/dev.c |   42 ++--
 net/core/sock.c|   23 +++
 net/core/sysctl_net_core.c |   11 +++
 net/ipv4/tcp_ipv4.c|2 -
 net/ipv4/tcp_minisocks.c   |4 +
 net/ipv6/tcp_ipv6.c|2 -
 net/sctp/socket.c  |9 ++-
 24 files changed, 264 insertions(+), 83 deletions(-)

--

[net-next PATCH v2 8/8] net: Introduce SO_INCOMING_NAPI_ID

From: Sridhar Samudrala 

This socket option returns the NAPI ID associated with the queue on which
the last frame is received. This information can be used by the apps to
split the incoming flows among the threads based on the Rx queue on which
they are received.

If the NAPI ID actually represents a sender_cpu then the value is ignored
and 0 is returned.

Signed-off-by: Sridhar Samudrala 
Signed-off-by: Alexander Duyck 
---
 arch/alpha/include/uapi/asm/socket.h   |2 ++
 arch/avr32/include/uapi/asm/socket.h   |2 ++
 arch/frv/include/uapi/asm/socket.h |2 ++
 arch/ia64/include/uapi/asm/socket.h|2 ++
 arch/m32r/include/uapi/asm/socket.h|2 ++
 arch/mips/include/uapi/asm/socket.h|1 +
 arch/mn10300/include/uapi/asm/socket.h |2 ++
 arch/parisc/include/uapi/asm/socket.h  |2 ++
 arch/powerpc/include/uapi/asm/socket.h |2 ++
 arch/s390/include/uapi/asm/socket.h|2 ++
 arch/sparc/include/uapi/asm/socket.h   |2 ++
 arch/xtensa/include/uapi/asm/socket.h  |2 ++
 include/uapi/asm-generic/socket.h  |2 ++
 net/core/sock.c|   12 
 14 files changed, 37 insertions(+)

diff --git a/arch/alpha/include/uapi/asm/socket.h 
b/arch/alpha/include/uapi/asm/socket.h
index 089db42c1b40..1bb8cac61a28 100644
--- a/arch/alpha/include/uapi/asm/socket.h
+++ b/arch/alpha/include/uapi/asm/socket.h
@@ -101,4 +101,6 @@
 
 #define SO_MEMINFO 55
 
+#define SO_INCOMING_NAPI_ID56
+
 #endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/avr32/include/uapi/asm/socket.h 
b/arch/avr32/include/uapi/asm/socket.h
index 6eabcbd2f82a..f824eeb0f2e4 100644
--- a/arch/avr32/include/uapi/asm/socket.h
+++ b/arch/avr32/include/uapi/asm/socket.h
@@ -94,4 +94,6 @@
 
 #define SO_MEMINFO 55
 
+#define SO_INCOMING_NAPI_ID56
+
 #endif /* _UAPI__ASM_AVR32_SOCKET_H */
diff --git a/arch/frv/include/uapi/asm/socket.h 
b/arch/frv/include/uapi/asm/socket.h
index bd497f8356b9..a8ad9bebfc47 100644
--- a/arch/frv/include/uapi/asm/socket.h
+++ b/arch/frv/include/uapi/asm/socket.h
@@ -94,5 +94,7 @@
 
 #define SO_MEMINFO 55
 
+#define SO_INCOMING_NAPI_ID56
+
 #endif /* _ASM_SOCKET_H */
 
diff --git a/arch/ia64/include/uapi/asm/socket.h 
b/arch/ia64/include/uapi/asm/socket.h
index f1bb54686168..6af3253e4209 100644
--- a/arch/ia64/include/uapi/asm/socket.h
+++ b/arch/ia64/include/uapi/asm/socket.h
@@ -103,4 +103,6 @@
 
 #define SO_MEMINFO 55
 
+#define SO_INCOMING_NAPI_ID56
+
 #endif /* _ASM_IA64_SOCKET_H */
diff --git a/arch/m32r/include/uapi/asm/socket.h 
b/arch/m32r/include/uapi/asm/socket.h
index 459c46076f6f..e98b6bb897c0 100644
--- a/arch/m32r/include/uapi/asm/socket.h
+++ b/arch/m32r/include/uapi/asm/socket.h
@@ -94,4 +94,6 @@
 
 #define SO_MEMINFO 55
 
+#define SO_INCOMING_NAPI_ID56
+
 #endif /* _ASM_M32R_SOCKET_H */
diff --git a/arch/mips/include/uapi/asm/socket.h 
b/arch/mips/include/uapi/asm/socket.h
index 688c18dd62ef..ae2b62e39d4d 100644
--- a/arch/mips/include/uapi/asm/socket.h
+++ b/arch/mips/include/uapi/asm/socket.h
@@ -112,5 +112,6 @@
 
 #define SO_MEMINFO 55
 
+#define SO_INCOMING_NAPI_ID56
 
 #endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/mn10300/include/uapi/asm/socket.h 
b/arch/mn10300/include/uapi/asm/socket.h
index 312d2c457a04..e4ac1843ee01 100644
--- a/arch/mn10300/include/uapi/asm/socket.h
+++ b/arch/mn10300/include/uapi/asm/socket.h
@@ -94,4 +94,6 @@
 
 #define SO_MEMINFO 55
 
+#define SO_INCOMING_NAPI_ID56
+
 #endif /* _ASM_SOCKET_H */
diff --git a/arch/parisc/include/uapi/asm/socket.h 
b/arch/parisc/include/uapi/asm/socket.h
index b98ec38f2083..f754c793e82a 100644
--- a/arch/parisc/include/uapi/asm/socket.h
+++ b/arch/parisc/include/uapi/asm/socket.h
@@ -93,4 +93,6 @@
 
 #define SO_MEMINFO 0x4030
 
+#define SO_INCOMING_NAPI_ID0x4031
+
 #endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/powerpc/include/uapi/asm/socket.h 
b/arch/powerpc/include/uapi/asm/socket.h
index 099a889240f6..5f84af7dcb2e 100644
--- a/arch/powerpc/include/uapi/asm/socket.h
+++ b/arch/powerpc/include/uapi/asm/socket.h
@@ -101,4 +101,6 @@
 
 #define SO_MEMINFO 55
 
+#define SO_INCOMING_NAPI_ID56
+
 #endif /* _ASM_POWERPC_SOCKET_H */
diff --git a/arch/s390/include/uapi/asm/socket.h 
b/arch/s390/include/uapi/asm/socket.h
index 6199bb34f7fa..25ac4960e707 100644
--- a/arch/s390/include/uapi/asm/socket.h
+++ b/arch/s390/include/uapi/asm/socket.h
@@ -100,4 +100,6 @@
 
 #defineSO_MEMINFO  55
 
+#define SO_INCOMING_NAPI_ID56
+
 #endif /* _ASM_SOCKET_H */
diff --git a/arch/sparc/include/uapi/asm/socket.h 
b/arch/sparc/include/uapi/asm/socket.h
index 12cd8c2ec422..b05513acd589 100644
--- a/arch/sparc/include/uapi/asm/socket.h
+++ b/arch/sparc/include/uapi/asm/socket.h
@@ -90,6 +90,8 @@
 
 #define SO_MEMINFO 0x0039
 
+#define

[net-next PATCH v2 7/8] epoll: Add busy poll support to epoll with socket fds.

From: Sridhar Samudrala 

This patch adds busy poll support to epoll. The implementation is meant to
be opportunistic in that it will take the NAPI ID from the last socket
that is added to the ready list that contains a valid NAPI ID and it will
use that for busy polling until the ready list goes empty.  Once the ready
list goes empty the NAPI ID is reset and busy polling is disabled until a
new socket is added to the ready list.

In addition when we insert a new socket into the epoll we record the NAPI
ID and assume we are going to receive events on it.  If that doesn't occur
it will be evicted as the active NAPI ID and we will resume normal
behavior.

An application can use SO_INCOMING_CPU or SO_REUSEPORT_ATTACH_C/EBPF socket
options to spread the incoming connections to specific worker threads
based on the incoming queue. This enables epoll for each worker thread
to have only sockets that receive packets from a single queue. So when an
application calls epoll_wait() and there are no events available to report,
busy polling is done on the associated queue to pull the packets.

Signed-off-by: Sridhar Samudrala 
Signed-off-by: Alexander Duyck 
---
 fs/eventpoll.c |   93 
 1 file changed, 93 insertions(+)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 341251421ced..5420767c9b68 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -42,6 +42,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * LOCKING:
@@ -224,6 +225,11 @@ struct eventpoll {
/* used to optimize loop detection check */
int visited;
struct list_head visited_list_link;
+
+#ifdef CONFIG_NET_RX_BUSY_POLL
+   /* used to track busy poll napi_id */
+   unsigned int napi_id;
+#endif
 };
 
 /* Wait structure used by the poll hooks */
@@ -384,6 +390,77 @@ static inline int ep_events_available(struct eventpoll *ep)
return !list_empty(>rdllist) || ep->ovflist != EP_UNACTIVE_PTR;
 }
 
+#ifdef CONFIG_NET_RX_BUSY_POLL
+static bool ep_busy_loop_end(void *p, unsigned long start_time)
+{
+   struct eventpoll *ep = p;
+
+   return ep_events_available(ep) || busy_loop_timeout(start_time);
+}
+#endif /* CONFIG_NET_RX_BUSY_POLL */
+
+/*
+ * Busy poll if globally on and supporting sockets found && no events,
+ * busy loop will return if need_resched or ep_events_available.
+ *
+ * we must do our busy polling with irqs enabled
+ */
+static void ep_busy_loop(struct eventpoll *ep, int nonblock)
+{
+#ifdef CONFIG_NET_RX_BUSY_POLL
+   unsigned int napi_id = READ_ONCE(ep->napi_id);
+
+   if ((napi_id >= MIN_NAPI_ID) && net_busy_loop_on())
+   napi_busy_loop(napi_id, nonblock ? NULL : ep_busy_loop_end, ep);
+#endif
+}
+
+static inline void ep_reset_busy_poll_napi_id(struct eventpoll *ep)
+{
+#ifdef CONFIG_NET_RX_BUSY_POLL
+   if (ep->napi_id)
+   ep->napi_id = 0;
+#endif
+}
+
+/*
+ * Set epoll busy poll NAPI ID from sk.
+ */
+static inline void ep_set_busy_poll_napi_id(struct epitem *epi)
+{
+#ifdef CONFIG_NET_RX_BUSY_POLL
+   struct eventpoll *ep;
+   unsigned int napi_id;
+   struct socket *sock;
+   struct sock *sk;
+   int err;
+
+   if (!net_busy_loop_on())
+   return;
+
+   sock = sock_from_file(epi->ffd.file, );
+   if (!sock)
+   return;
+
+   sk = sock->sk;
+   if (!sk)
+   return;
+
+   napi_id = READ_ONCE(sk->sk_napi_id);
+   ep = epi->ep;
+
+   /* Non-NAPI IDs can be rejected
+*  or
+* Nothing to do if we already have this ID
+*/
+   if (napi_id < MIN_NAPI_ID || napi_id == ep->napi_id)
+   return;
+
+   /* record NAPI ID for use in next busy poll */
+   ep->napi_id = napi_id;
+#endif
+}
+
 /**
  * ep_call_nested - Perform a bound (possibly) nested call, by checking
  *  that the recursion limit is not exceeded, and that
@@ -1022,6 +1099,8 @@ static int ep_poll_callback(wait_queue_t *wait, unsigned 
mode, int sync, void *k
 
spin_lock_irqsave(>lock, flags);
 
+   ep_set_busy_poll_napi_id(epi);
+
/*
 * If the event mask does not contain any poll(2) event, we consider the
 * descriptor to be disabled. This condition is likely the effect of the
@@ -1363,6 +1442,9 @@ static int ep_insert(struct eventpoll *ep, struct 
epoll_event *event,
/* We have to drop the new item inside our item list to keep track of 
it */
spin_lock_irqsave(>lock, flags);
 
+   /* record NAPI ID of new item if present */
+   ep_set_busy_poll_napi_id(epi);
+
/* If the file is already "ready" we drop it inside the ready list */
if ((revents & event->events) && !ep_is_linked(>rdllink)) {
list_add_tail(>rdllink, >rdllist);
@@ -1637,10 +1719,21 @@ static int ep_poll(struct eventpoll *ep, struct 
epoll_event __user

[net-next PATCH v2 5/8] net: Track start of busy loop instead of when it should end

From: Alexander Duyck 

This patch flips the logic we were using to determine if the busy polling
has timed out.  The main motivation for this is that we will need to
support two different possible timeout values in the future and by
recording the start time rather than when we would want to end we can focus
on making the end_time specific to the task be it epoll or socket based
polling.

I am also flipping the logic a bit.  The previous code was taking
local_clock() and shifting it by 10 to get the time value in microseconds.
That works for most values but has a side effect of us potentially
encountering time values that will never time out as the current time will
never exceed the recorded clock value plus the timeout usecs if the clock
value was approaching a roll-over.  To account for that I am leaving
start_time as a nanoseconds value instead of shifting it down to a
microseconds value.  In addition I am limiting the busy_poll and busy_read
values so that they cannot exceed about 2 seconds.  By doing this I can add
the timeout value into the nanosecond value and be guaranteed that we will
not prematurely trigger timeouts, even on 32 bit architectures.

The last bit I changed is to move from using a shift by 10 to just using
NSEC_PER_USEC and using multiplication for any run time calculations and
division for a few compile time ones.  This should be more accurate and
perform about the same on most architectures since modern CPUs typically
handle multiplication without too much overhead.

Signed-off-by: Alexander Duyck 
---
 fs/select.c|   16 +
 include/net/busy_poll.h|   78 +++-
 net/core/dev.c |6 ++-
 net/core/sysctl_net_core.c |   11 +-
 4 files changed, 68 insertions(+), 43 deletions(-)

diff --git a/fs/select.c b/fs/select.c
index e2112270d75a..9287d3a96e35 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -409,7 +409,7 @@ int do_select(int n, fd_set_bits *fds, struct timespec64 
*end_time)
int retval, i, timed_out = 0;
u64 slack = 0;
unsigned int busy_flag = net_busy_loop_on() ? POLL_BUSY_LOOP : 0;
-   unsigned long busy_end = 0;
+   unsigned long busy_start = 0;
 
rcu_read_lock();
retval = max_select_fd(n, fds);
@@ -512,11 +512,11 @@ int do_select(int n, fd_set_bits *fds, struct timespec64 
*end_time)
 
/* only if found POLL_BUSY_LOOP sockets && not out of time */
if (can_busy_loop && !need_resched()) {
-   if (!busy_end) {
-   busy_end = busy_loop_end_time();
+   if (!busy_start) {
+   busy_start = busy_loop_current_time();
continue;
}
-   if (!busy_loop_timeout(busy_end))
+   if (!busy_loop_timeout(busy_start))
continue;
}
busy_flag = 0;
@@ -800,7 +800,7 @@ static int do_poll(struct poll_list *list, struct 
poll_wqueues *wait,
int timed_out = 0, count = 0;
u64 slack = 0;
unsigned int busy_flag = net_busy_loop_on() ? POLL_BUSY_LOOP : 0;
-   unsigned long busy_end = 0;
+   unsigned long busy_start = 0;
 
/* Optimise the no-wait case */
if (end_time && !end_time->tv_sec && !end_time->tv_nsec) {
@@ -853,11 +853,11 @@ static int do_poll(struct poll_list *list, struct 
poll_wqueues *wait,
 
/* only if found POLL_BUSY_LOOP sockets && not out of time */
if (can_busy_loop && !need_resched()) {
-   if (!busy_end) {
-   busy_end = busy_loop_end_time();
+   if (!busy_start) {
+   busy_start = busy_loop_current_time();
continue;
}
-   if (!busy_loop_timeout(busy_end))
+   if (!busy_loop_timeout(busy_start))
continue;
}
busy_flag = 0;
diff --git a/include/net/busy_poll.h b/include/net/busy_poll.h
index c55760f4820f..4626cb22f625 100644
--- a/include/net/busy_poll.h
+++ b/include/net/busy_poll.h
@@ -41,67 +41,85 @@
  */
 #define MIN_NAPI_ID ((unsigned int)(NR_CPUS + 1))
 
+/* The timer checks are using time_after() to determine if we have
+ * exceeded out timeout period. In order to support that we have to limit
+ * ourselves to the smallest possible signed type and then in addition
+ * account for the fact that we are recording our value in microseconds
+ * instead of nanoseconds.
+ *
+ * This limit should be just a little over 2 seconds.
+ */
+#define MAX_BUSY_POLL (INT_MAX / NSEC_PER_USEC)
+
 static inline bool net_busy_loop_on(void)
 {
return sysctl_net_busy_poll;
 }
 
-static inline u64

[net-next PATCH v2 4/8] net: Change return type of sk_busy_loop from bool to void

From: Alexander Duyck 

>From what I can tell there is only a couple spots where we are actually
checking the return value of sk_busy_loop. As there are only a few
consumers of that data, and the data being checked for can be replaced
with a check for !skb_queue_empty() we might as well just pull the code
out of sk_busy_loop and place it in the spots that actually need it.

Signed-off-by: Alexander Duyck 
---
 include/net/busy_poll.h |5 ++---
 net/core/datagram.c |8 ++--
 net/core/dev.c  |   26 --
 net/sctp/socket.c   |9 ++---
 4 files changed, 26 insertions(+), 22 deletions(-)

diff --git a/include/net/busy_poll.h b/include/net/busy_poll.h
index b82d6ba70a14..c55760f4820f 100644
--- a/include/net/busy_poll.h
+++ b/include/net/busy_poll.h
@@ -74,7 +74,7 @@ static inline bool busy_loop_timeout(unsigned long end_time)
return time_after(now, end_time);
 }
 
-bool sk_busy_loop(struct sock *sk, int nonblock);
+void sk_busy_loop(struct sock *sk, int nonblock);
 
 #else /* CONFIG_NET_RX_BUSY_POLL */
 static inline unsigned long net_busy_loop_on(void)
@@ -97,9 +97,8 @@ static inline bool busy_loop_timeout(unsigned long end_time)
return true;
 }
 
-static inline bool sk_busy_loop(struct sock *sk, int nonblock)
+static inline void sk_busy_loop(struct sock *sk, int nonblock)
 {
-   return false;
 }
 
 #endif /* CONFIG_NET_RX_BUSY_POLL */
diff --git a/net/core/datagram.c b/net/core/datagram.c
index ea633342ab0d..4608aa245410 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -256,8 +256,12 @@ struct sk_buff *__skb_try_recv_datagram(struct sock *sk, 
unsigned int flags,
}
 
spin_unlock_irqrestore(>lock, cpu_flags);
-   } while (sk_can_busy_loop(sk) &&
-sk_busy_loop(sk, flags & MSG_DONTWAIT));
+
+   if (!sk_can_busy_loop(sk))
+   break;
+
+   sk_busy_loop(sk, flags & MSG_DONTWAIT);
+   } while (!skb_queue_empty(>sk_receive_queue));
 
error = -EAGAIN;
 
diff --git a/net/core/dev.c b/net/core/dev.c
index ab337bf5bbf4..6b0458b5afe0 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5060,21 +5060,21 @@ static void busy_poll_stop(struct napi_struct *napi, 
void *have_poll_lock)
do_softirq();
 }
 
-bool sk_busy_loop(struct sock *sk, int nonblock)
+void sk_busy_loop(struct sock *sk, int nonblock)
 {
unsigned long end_time = !nonblock ? sk_busy_loop_end_time(sk) : 0;
int (*napi_poll)(struct napi_struct *napi, int budget);
void *have_poll_lock = NULL;
struct napi_struct *napi;
unsigned int napi_id;
-   int rc;
+   int work;
 
 restart:
napi_id = READ_ONCE(sk->sk_napi_id);
if (napi_id < MIN_NAPI_ID)
-   return 0;
+   return;
 
-   rc = false;
+   work = 0;
napi_poll = NULL;
 
rcu_read_lock();
@@ -5085,7 +5085,7 @@ bool sk_busy_loop(struct sock *sk, int nonblock)
 
preempt_disable();
for (;;) {
-   rc = 0;
+   work = 0;
local_bh_disable();
if (!napi_poll) {
unsigned long val = READ_ONCE(napi->state);
@@ -5103,12 +5103,12 @@ bool sk_busy_loop(struct sock *sk, int nonblock)
have_poll_lock = netpoll_poll_lock(napi);
napi_poll = napi->poll;
}
-   rc = napi_poll(napi, BUSY_POLL_BUDGET);
-   trace_napi_poll(napi, rc, BUSY_POLL_BUDGET);
+   work = napi_poll(napi, BUSY_POLL_BUDGET);
+   trace_napi_poll(napi, work, BUSY_POLL_BUDGET);
 count:
-   if (rc > 0)
+   if (work > 0)
__NET_ADD_STATS(sock_net(sk),
-   LINUX_MIB_BUSYPOLLRXPACKETS, rc);
+   LINUX_MIB_BUSYPOLLRXPACKETS, work);
local_bh_enable();
 
if (nonblock || !skb_queue_empty(>sk_receive_queue) ||
@@ -5121,9 +5121,9 @@ bool sk_busy_loop(struct sock *sk, int nonblock)
preempt_enable();
rcu_read_unlock();
cond_resched();
-   rc = !skb_queue_empty(>sk_receive_queue);
-   if (rc || busy_loop_timeout(end_time))
-   return rc;
+   if (!skb_queue_empty(>sk_receive_queue) ||
+   busy_loop_timeout(end_time))
+   return;
goto restart;
}
cpu_relax();
@@ -5131,10 +5131,8 @@ bool sk_busy_loop(struct sock *sk, int nonblock)
if (napi_poll)
busy_poll_stop(napi, have_poll_lock);
preempt_enable();
-   rc = !skb_queue_empty(>sk_receive_queue);
 out:

[net-next PATCH v2 3/8] net: Only define skb_mark_napi_id in one spot instead of two

From: Alexander Duyck 

Instead of defining two versions of skb_mark_napi_id I think it is more
readable to just match the format of the sk_mark_napi_id functions and just
wrap the contents of the function instead of defining two versions of the
function.  This way we can save a few lines of code since we only need 2 of
the ifdef/endif but needed 5 for the extra function declaration.

Signed-off-by: Alexander Duyck 
---
 include/net/busy_poll.h |   22 +-
 1 file changed, 9 insertions(+), 13 deletions(-)

diff --git a/include/net/busy_poll.h b/include/net/busy_poll.h
index 3fcda9e70c3f..b82d6ba70a14 100644
--- a/include/net/busy_poll.h
+++ b/include/net/busy_poll.h
@@ -76,14 +76,6 @@ static inline bool busy_loop_timeout(unsigned long end_time)
 
 bool sk_busy_loop(struct sock *sk, int nonblock);
 
-/* used in the NIC receive handler to mark the skb */
-static inline void skb_mark_napi_id(struct sk_buff *skb,
-   struct napi_struct *napi)
-{
-   skb->napi_id = napi->napi_id;
-}
-
-
 #else /* CONFIG_NET_RX_BUSY_POLL */
 static inline unsigned long net_busy_loop_on(void)
 {
@@ -100,11 +92,6 @@ static inline bool sk_can_busy_loop(struct sock *sk)
return false;
 }
 
-static inline void skb_mark_napi_id(struct sk_buff *skb,
-   struct napi_struct *napi)
-{
-}
-
 static inline bool busy_loop_timeout(unsigned long end_time)
 {
return true;
@@ -117,6 +104,15 @@ static inline bool sk_busy_loop(struct sock *sk, int 
nonblock)
 
 #endif /* CONFIG_NET_RX_BUSY_POLL */
 
+/* used in the NIC receive handler to mark the skb */
+static inline void skb_mark_napi_id(struct sk_buff *skb,
+   struct napi_struct *napi)
+{
+#ifdef CONFIG_NET_RX_BUSY_POLL
+   skb->napi_id = napi->napi_id;
+#endif
+}
+
 /* used in the protocol hanlder to propagate the napi_id to the socket */
 static inline void sk_mark_napi_id(struct sock *sk, const struct sk_buff *skb)
 {

[net-next PATCH v2 1/8] net: Busy polling should ignore sender CPUs

From: Alexander Duyck 

This patch is a cleanup/fix for NAPI IDs following the changes that made it
so that sender_cpu and napi_id were doing a better job of sharing the same
location in the sk_buff.

One issue I found is that we weren't validating the napi_id as being valid
before we started trying to setup the busy polling.  This change corrects
that by using the MIN_NAPI_ID value that is now used in both allocating the
NAPI IDs, as well as validating them.

Signed-off-by: Alexander Duyck 
---
 include/net/busy_poll.h |9 +++--
 net/core/dev.c  |   13 +
 2 files changed, 16 insertions(+), 6 deletions(-)

diff --git a/include/net/busy_poll.h b/include/net/busy_poll.h
index c0452de83086..3fcda9e70c3f 100644
--- a/include/net/busy_poll.h
+++ b/include/net/busy_poll.h
@@ -35,6 +35,12 @@
 extern unsigned int sysctl_net_busy_read __read_mostly;
 extern unsigned int sysctl_net_busy_poll __read_mostly;
 
+/* 0 - Reserved to indicate value not set
+ * 1..NR_CPUS - Reserved for sender_cpu
+ *  NR_CPUS+1..~0 - Region available for NAPI IDs
+ */
+#define MIN_NAPI_ID ((unsigned int)(NR_CPUS + 1))
+
 static inline bool net_busy_loop_on(void)
 {
return sysctl_net_busy_poll;
@@ -58,10 +64,9 @@ static inline unsigned long busy_loop_end_time(void)
 
 static inline bool sk_can_busy_loop(const struct sock *sk)
 {
-   return sk->sk_ll_usec && sk->sk_napi_id && !signal_pending(current);
+   return sk->sk_ll_usec && !signal_pending(current);
 }
 
-
 static inline bool busy_loop_timeout(unsigned long end_time)
 {
unsigned long now = busy_loop_us_clock();
diff --git a/net/core/dev.c b/net/core/dev.c
index 7869ae3837ca..ab337bf5bbf4 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5066,15 +5066,20 @@ bool sk_busy_loop(struct sock *sk, int nonblock)
int (*napi_poll)(struct napi_struct *napi, int budget);
void *have_poll_lock = NULL;
struct napi_struct *napi;
+   unsigned int napi_id;
int rc;
 
 restart:
+   napi_id = READ_ONCE(sk->sk_napi_id);
+   if (napi_id < MIN_NAPI_ID)
+   return 0;
+
rc = false;
napi_poll = NULL;
 
rcu_read_lock();
 
-   napi = napi_by_id(sk->sk_napi_id);
+   napi = napi_by_id(napi_id);
if (!napi)
goto out;
 
@@ -5143,10 +5148,10 @@ static void napi_hash_add(struct napi_struct *napi)
 
spin_lock(_hash_lock);
 
-   /* 0..NR_CPUS+1 range is reserved for sender_cpu use */
+   /* 0..NR_CPUS range is reserved for sender_cpu use */
do {
-   if (unlikely(++napi_gen_id < NR_CPUS + 1))
-   napi_gen_id = NR_CPUS + 1;
+   if (unlikely(++napi_gen_id < MIN_NAPI_ID))
+   napi_gen_id = MIN_NAPI_ID;
} while (napi_by_id(napi_gen_id));
napi->napi_id = napi_gen_id;

[net-next PATCH v2 2/8] tcp: Record Rx hash and NAPI ID in tcp_child_process

From: Alexander Duyck 

While working on some recent busy poll changes we found that child sockets
were being instantiated without NAPI ID being set.  In our first attempt to
fix it, it was suggested that we should just pull programming the NAPI ID
into the function itself since all callers will need to have it set.

In addition to the NAPI ID change I have dropped the code that was
populating the Rx hash since it was actually being populated in
tcp_get_cookie_sock.

Reported-by: Sridhar Samudrala 
Suggested-by: Eric Dumazet 
Signed-off-by: Alexander Duyck 
---
 net/ipv4/tcp_ipv4.c  |2 --
 net/ipv4/tcp_minisocks.c |4 
 net/ipv6/tcp_ipv6.c  |2 --
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 7482b5d11861..20cbd2f07f28 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1409,8 +1409,6 @@ int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)
if (!nsk)
goto discard;
if (nsk != sk) {
-   sock_rps_save_rxhash(nsk, skb);
-   sk_mark_napi_id(nsk, skb);
if (tcp_child_process(sk, nsk, skb)) {
rsk = nsk;
goto reset;
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 692f974e5abe..40f125b19988 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -26,6 +26,7 @@
 #include 
 #include 
 #include 
+#include 
 
 int sysctl_tcp_abort_on_overflow __read_mostly;
 
@@ -798,6 +799,9 @@ int tcp_child_process(struct sock *parent, struct sock 
*child,
int ret = 0;
int state = child->sk_state;
 
+   /* record NAPI ID of child */
+   sk_mark_napi_id(child, skb);
+
tcp_segs_in(tcp_sk(child), skb);
if (!sock_owned_by_user(child)) {
ret = tcp_rcv_state_process(child, skb);
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 0f08d718a002..ee13e380c0dd 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1293,8 +1293,6 @@ static int tcp_v6_do_rcv(struct sock *sk, struct sk_buff 
*skb)
goto discard;
 
if (nsk != sk) {
-   sock_rps_save_rxhash(nsk, skb);
-   sk_mark_napi_id(nsk, skb);
if (tcp_child_process(sk, nsk, skb))
goto reset;
if (opt_skb)

Re: [PATCH v2 1/2] gtp: rename SGSN netlink attribute

2017-03-23 Thread Pablo Neira Ayuso

On Thu, Mar 23, 2017 at 02:16:22PM -0700, David Miller wrote:
> Can I get a review of these two patches from some GTP experts?

I asked Jonas to resend indicating [PATCH net-next] in the subject and
some minor comestic change.

Apart from patches look good so Jonas, you can add this in your follow
up version.

Acked-by: Pablo Neira Ayuso

Re: [PATCH] net: phy: handle state correctly in phy_stop_machine

From: Zach Brown 
Date: Wed, 22 Mar 2017 15:27:01 -0500

> From: Nathan Sullivan 
> 
> If the PHY is halted on stop, then do not set the state to PHY_UP.  This
> ensures the phy will be restarted later in phy_start when the machine is
> started again.
> 
> Signed-off-by: Nathan Sullivan 
> Signed-off-by: Brad Mouring 
> Acked-by: Xander Huff 
> Acked-by: Kyle Roeschley 

Florian, please review.

Re: [PATCH v2 1/2] gtp: rename SGSN netlink attribute


Can I get a review of these two patches from some GTP experts?

Thanks.

[PATCH net-next] liquidio: do not reset Octeon if NIC firmware was preloaded

2017-03-23 Thread Felix Manlunas

The PF driver is incorrectly resetting Octeon when the module parameter
"fw_type=none" is there.  "fw_type=none" means the PF should not load any
firmware to the NIC because Octeon is already running preloaded firmware.

Fix it by putting an if (fw_type != none) around the reset code.

Because the Octeon reset is now conditionally gone, when unloading the
driver, conditionally send the RESET_PF command to the firmware who will
then free up PF-related data structures.

Signed-off-by: Felix Manlunas 
Signed-off-by: Satanand Burla 
---
 drivers/net/ethernet/cavium/liquidio/lio_main.c| 45 +++---
 .../net/ethernet/cavium/liquidio/liquidio_common.h |  1 +
 2 files changed, 33 insertions(+), 13 deletions(-)

diff --git a/drivers/net/ethernet/cavium/liquidio/lio_main.c 
b/drivers/net/ethernet/cavium/liquidio/lio_main.c
index 2b89ec2..1b0381f 100644
--- a/drivers/net/ethernet/cavium/liquidio/lio_main.c
+++ b/drivers/net/ethernet/cavium/liquidio/lio_main.c
@@ -1360,6 +1360,12 @@ liquidio_probe(struct pci_dev *pdev,
return 0;
 }
 
+static bool fw_type_is_none(void)
+{
+   return strncmp(fw_type, LIO_FW_NAME_TYPE_NONE,
+  sizeof(LIO_FW_NAME_TYPE_NONE)) == 0;
+}
+
 /**
  *\brief Destroy resources associated with octeon device
  * @param pdev PCI device structure
@@ -1499,9 +1505,12 @@ static void octeon_destroy_resources(struct 
octeon_device *oct)
 
/* fallthrough */
case OCT_DEV_PCI_MAP_DONE:
-   /* Soft reset the octeon device before exiting */
-   if ((!OCTEON_CN23XX_PF(oct)) || !oct->octeon_id)
-   oct->fn_list.soft_reset(oct);
+   if (!fw_type_is_none()) {
+   /* Soft reset the octeon device before exiting */
+   if (!OCTEON_CN23XX_PF(oct) ||
+   (OCTEON_CN23XX_PF(oct) && !oct->octeon_id))
+   oct->fn_list.soft_reset(oct);
+   }
 
octeon_unmap_pci_barx(oct, 0);
octeon_unmap_pci_barx(oct, 1);
@@ -1634,6 +1643,15 @@ static void liquidio_destroy_nic_device(struct 
octeon_device *oct, int ifidx)
if (atomic_read(>ifstate) & LIO_IFSTATE_RUNNING)
liquidio_stop(netdev);
 
+   if (fw_type_is_none()) {
+   struct octnic_ctrl_pkt nctrl;
+
+   memset(, 0, sizeof(struct octnic_ctrl_pkt));
+   nctrl.ncmd.s.cmd = OCTNET_CMD_RESET_PF;
+   nctrl.iq_no = lio->linfo.txpciq[0].s.q_no;
+   octnet_send_nic_ctrl_pkt(oct, );
+   }
+
if (oct->props[lio->ifidx].napi_enabled == 1) {
list_for_each_entry_safe(napi, n, >napi_list, dev_list)
napi_disable(napi);
@@ -2123,8 +2141,7 @@ static int load_firmware(struct octeon_device *oct)
char fw_name[LIO_MAX_FW_FILENAME_LEN];
char *tmp_fw_type;
 
-   if (strncmp(fw_type, LIO_FW_NAME_TYPE_NONE,
-   sizeof(LIO_FW_NAME_TYPE_NONE)) == 0) {
+   if (fw_type_is_none()) {
dev_info(>pci_dev->dev, "Skipping firmware load\n");
return ret;
}
@@ -4443,14 +4460,16 @@ static int octeon_device_init(struct octeon_device 
*octeon_dev)
if (OCTEON_CN23XX_PF(octeon_dev)) {
if (!cn23xx_fw_loaded(octeon_dev)) {
fw_loaded = 0;
-   /* Do a soft reset of the Octeon device. */
-   if (octeon_dev->fn_list.soft_reset(octeon_dev))
-   return 1;
-   /* things might have changed */
-   if (!cn23xx_fw_loaded(octeon_dev))
-   fw_loaded = 0;
-   else
-   fw_loaded = 1;
+   if (!fw_type_is_none()) {
+   /* Do a soft reset of the Octeon device. */
+   if (octeon_dev->fn_list.soft_reset(octeon_dev))
+   return 1;
+   /* things might have changed */
+   if (!cn23xx_fw_loaded(octeon_dev))
+   fw_loaded = 0;
+   else
+   fw_loaded = 1;
+   }
} else {
fw_loaded = 1;
}
diff --git a/drivers/net/ethernet/cavium/liquidio/liquidio_common.h 
b/drivers/net/ethernet/cavium/liquidio/liquidio_common.h
index bc0af8a..1bc0385 100644
--- a/drivers/net/ethernet/cavium/liquidio/liquidio_common.h
+++ b/drivers/net/ethernet/cavium/liquidio/liquidio_common.h
@@ -180,6 +180,7 @@ static inline void add_sg_size(struct octeon_sg_entry 
*sg_entry,
 #define   OCTNET_CMD_Q0
 
 /* NIC Command types */
+#define   OCTNET_CMD_RESET_PF 0x0

Re: [PATCH net-next 0/2] net: bridge: allow user-space to add ext learned entries

2017-03-23 Thread Ido Schimmel

On Thu, Mar 23, 2017 at 12:27:11PM +0200, Nikolay Aleksandrov wrote:
> Hi,
> This set adds the ability to add externally learned entries from
> user-space. For symmetry and proper function we need to allow SW entries
> to take over HW learned ones (similar to how HW can take over SW entries
> currently) which is needed for our use case (evpn) where we have pure SW
> ports and HW ports mixed in a single bridge. This does not play well with
> switchdev devices currently because there's no feedback when the entry is
> taken over, but this case has never worked anyway and feedback can be
> easily added when needed.

Yea, correct. I think we should handle FDB offload in a similar fashion
to route offload. FDBs aren't only of interest to the port to which they
point, but also to the other ports in the bridge. In your example use
case we would actually need to forward to the CPU packets that hit FDB
entries pointing to the SW ports. What would currently happen is that we
would simply flood such packets via the HW ports.

Re: [PATCH net-next 2/2] net: bridge: allow to add externally learned entries from user-space

2017-03-23 Thread Ido Schimmel

On Thu, Mar 23, 2017 at 12:27:13PM +0200, Nikolay Aleksandrov wrote:
> The NTF_EXT_LEARNED flag was added for switchdev and externally learned
> entries, but it can also be used for entries learned via a software
> in user-space which requires dynamic entries that do not expire.
> One such case that we have is with quagga and evpn which need dynamic
> entries but also require to age them themselves.
> 
> Signed-off-by: Nikolay Aleksandrov 

Reviewed-by: Ido Schimmel

Re: [PATCH net-next 1/2] net: bridge: allow SW learn to take over HW fdb entries

2017-03-23 Thread Ido Schimmel

On Thu, Mar 23, 2017 at 12:27:12PM +0200, Nikolay Aleksandrov wrote:
> Allow to take over an entry which was previously learned via HW when it
> shows up from a SW port. This is analogous to how HW takes over SW learned
> entries already.
> 
> Suggested-by: Roopa Prabhu 
> Signed-off-by: Nikolay Aleksandrov 

Reviewed-by: Ido Schimmel

[PATCH net-next v5] net: Add sysctl to toggle early demux for tcp and udp

2017-03-23 Thread Subash Abhinov Kasiviswanathan

Certain system process significant unconnected UDP workload.
It would be preferrable to disable UDP early demux for those systems
and enable it for TCP only.

By disabling UDP demux, we see these slight gains on an ARM64 system-
782 -> 788Mbps unconnected single stream UDPv4
633 -> 654Mbps unconnected UDPv4 different sources

The performance impact can change based on CPU architecure and cache
sizes. There will not much difference seen if entire UDP hash table
is in cache.

Both sysctls are enabled by default to preserve existing behavior.

v1->v2: Change function pointer instead of adding conditional as
suggested by Stephen.

v2->v3: Read once in callers to avoid issues due to compiler
optimizations. Also update commit message with the tests.

v3->v4: Store and use read once result instead of querying pointer
again incorrectly.

v4->v5: Refactor to avoid errors due to compilation with IPV6={m,n}

Signed-off-by: Subash Abhinov Kasiviswanathan 
Suggested-by: Eric Dumazet 
Cc: Stephen Hemminger 
Cc: Tom Herbert 
Cc: David Miller 
---
 Documentation/networking/ip-sysctl.txt | 11 +-
 include/net/netns/ipv4.h   |  2 +
 include/net/protocol.h |  7 ++--
 include/net/udp.h  |  1 +
 net/ipv4/af_inet.c |  8 +++-
 net/ipv4/ip_input.c|  5 ++-
 net/ipv4/protocol.c|  2 +-
 net/ipv4/sysctl_net_ipv4.c | 67 ++
 net/ipv6/ip6_input.c   |  6 ++-
 net/ipv6/protocol.c|  2 +-
 net/ipv6/tcp_ipv6.c|  3 +-
 net/ipv6/udp.c |  3 +-
 12 files changed, 103 insertions(+), 14 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt 
b/Documentation/networking/ip-sysctl.txt
index eaee2c8..b1c6500 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -856,12 +856,21 @@ ip_dynaddr - BOOLEAN
 ip_early_demux - BOOLEAN
Optimize input packet processing down to one demux for
certain kinds of local sockets.  Currently we only do this
-   for established TCP sockets.
+   for established TCP and connected UDP sockets.
 
It may add an additional cost for pure routing workloads that
reduces overall throughput, in such case you should disable it.
Default: 1
 
+tcp_early_demux - BOOLEAN
+   Enable early demux for established TCP sockets.
+   Default: 1
+
+udp_early_demux - BOOLEAN
+   Enable early demux for connected UDP sockets. Disable this if
+   your system could experience more unconnected load.
+   Default: 1
+
 icmp_echo_ignore_all - BOOLEAN
If set non-zero, then the kernel will ignore all ICMP ECHO
requests sent to it.
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index a0e8919..cd686c4 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -95,6 +95,8 @@ struct netns_ipv4 {
/* Shall we try to damage output packets if routing dev changes? */
int sysctl_ip_dynaddr;
int sysctl_ip_early_demux;
+   int sysctl_tcp_early_demux;
+   int sysctl_udp_early_demux;
 
int sysctl_fwmark_reflect;
int sysctl_tcp_fwmark_accept;
diff --git a/include/net/protocol.h b/include/net/protocol.h
index bf36ca3..65ba335 100644
--- a/include/net/protocol.h
+++ b/include/net/protocol.h
@@ -40,6 +40,7 @@
 /* This is used to register protocols. */
 struct net_protocol {
void(*early_demux)(struct sk_buff *skb);
+   void(*early_demux_handler)(struct sk_buff *skb);
int (*handler)(struct sk_buff *skb);
void(*err_handler)(struct sk_buff *skb, u32 info);
unsigned intno_policy:1,
@@ -54,7 +55,7 @@ struct net_protocol {
 #if IS_ENABLED(CONFIG_IPV6)
 struct inet6_protocol {
void(*early_demux)(struct sk_buff *skb);
-
+   void(*early_demux_handler)(struct sk_buff *skb);
int (*handler)(struct sk_buff *skb);
 
void(*err_handler)(struct sk_buff *skb,
@@ -92,12 +93,12 @@ struct inet_protosw {
 #define INET_PROTOSW_PERMANENT 0x02  /* Permanent protocols are unremovable. */
 #define INET_PROTOSW_ICSK  0x04  /* Is this an inet_connection_sock? */
 
-extern const struct net_protocol __rcu *inet_protos[MAX_INET_PROTOS];
+extern struct net_protocol __rcu *inet_protos[MAX_INET_PROTOS];
 extern const struct net_offload __rcu *inet_offloads[MAX_INET_PROTOS];
 extern const struct net_offload __rcu *inet6_offloads[MAX_INET_PROTOS];
 
 #if IS_ENABLED(CONFIG_IPV6)
-extern const struct inet6_protocol __rcu *inet6_protos[MAX_INET_PROTOS];
+extern struct inet6_protocol __rcu *inet6_protos[MAX_INET_PROTOS];
 #endif
 
 int inet_add_protocol(const struct net_protocol

[PATCH net] net: neigh: guard against NULL solicit() method

From: Eric Dumazet 

Dmitry posted a nice reproducer of a bug triggering in neigh_probe()
when dereferencing a NULL neigh->ops->solicit method.

This can happen for arp_direct_ops/ndisc_direct_ops and similar,
which can be used for NUD_NOARP neighbours (created when dev->header_ops
is NULL). Admin can then force changing nud_state to some other state
that would fire neigh timer.

Signed-off-by: Eric Dumazet 
Reported-by: Dmitry Vyukov 
---
 net/core/neighbour.c |3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 
e7c12caa20c88acc9a5dd86f07d11644fb58341d..4526cbd7e28a1fcdecfc06a41985fd4d19634457
 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -860,7 +860,8 @@ static void neigh_probe(struct neighbour *neigh)
if (skb)
skb = skb_clone(skb, GFP_ATOMIC);
write_unlock(>lock);
-   neigh->ops->solicit(neigh, skb);
+   if (neigh->ops->solicit)
+   neigh->ops->solicit(neigh, skb);
atomic_inc(>probes);
kfree_skb(skb);
 }

Re: netlink: NULL timer crash

On Thu, 2017-03-23 at 12:00 -0700, David Miller wrote:
> From: Eric Dumazet 
> Date: Thu, 23 Mar 2017 09:00:58 -0700
> 
> > On Thu, 2017-03-23 at 07:53 -0700, Eric Dumazet wrote:
> > 
> >> Nice !
> >> 
> >> Looks like neigh->ops->solicit is NULL
> > 
> > Apparently we allow admins to do really stupid things with neighbours
> > on tunnels.
> > 
> > Following patch should avoid the crash.
> > 
> > Anyone has better ideas ?
> 
> This is probably good enough for now, but you need to also handle
> dn_neigh_ops.
> 
> Another way to solve this is to add a NULL method check to the
> one spot where we invoke this method.  That clearly shows that
> the method is optional.

Yes, this would be a one liner. I will post this in a minute.

Thanks.

Re: [PATCH V8 1/3] irq: Add flags to request_percpu_irq function

2017-03-23 Thread Daniel Lezcano

Hi Mark,

On Thu, Mar 23, 2017 at 06:54:52PM +, Mark Rutland wrote:
> Hi Daniel,
> 
> On Thu, Mar 23, 2017 at 06:42:01PM +0100, Daniel Lezcano wrote:
> > In the next changes, we track the interrupts but we discard the timers as
> > that does not make sense. The next interrupt on a timer is predictable.
> 
> Sorry, but I could not parse this. 

I meant we are measuring when are happening the interrupts by getting the local
clock in the interrupt handler. But if the interrupts are coming from a timer, 
it
is not necessary to do that because we already know when they will occur.

So, in order to sort out which interrupt we measure, we use the IRQF_TIMER flag.

Unfortunately, this flag is missing when we do a request_percpu_irq. The
purpose of this patch is to fix that.

> > diff --git a/drivers/perf/arm_pmu.c b/drivers/perf/arm_pmu.c
> > index 9612b84..0f5ab4a 100644
> > --- a/drivers/perf/arm_pmu.c
> > +++ b/drivers/perf/arm_pmu.c
> > @@ -661,7 +661,7 @@ static int cpu_pmu_request_irq(struct arm_pmu *cpu_pmu, 
> > irq_handler_t handler)
> >  
> > irq = platform_get_irq(pmu_device, 0);
> > if (irq > 0 && irq_is_percpu(irq)) {
> > -   err = request_percpu_irq(irq, handler, "arm-pmu",
> > +   err = request_percpu_irq(irq, 0, handler, "arm-pmu",
> >  _events->percpu_pmu);
> > if (err) {
> > pr_err("unable to request IRQ%d for ARM PMU counters\n",
> 
> Please Cc myself and Will Deacon when modifying the arm_pmu driver, as
> per MAINTAINERS. I only spotted this patch by chance.

Ah, ok, sorry for that. Thanks for spotting this, you should have been Cc'ed by
my cccmd script. I will check that.

> This conflicts with arm_pmu changes I have queued for v4.12 [1].
> 
> So, can we leave the prototype of request_percpu_irq() as-is?
> 
> Why not add a new request_percpu_irq_flags() function, and leave
> request_percpu_irq() as a wrapper for that? e.g.

[ ... ]

> static inline int
> request_percpu_irq(unsigned int irq, irq_handler_t handler,
>  const char *devname, void __percpu *percpu_dev_id)
> {
>   return request_percpu_irq_flags(irq, handler, devname,
>   percpu_dev_id, 0);
> }
> 
> ... that would avoid having to touch any non-timer driver for now.

Mmh, yes. That's a good suggestion.

> [...]
> 
> > -request_percpu_irq(unsigned int irq, irq_handler_t handler,
> > -  const char *devname, void __percpu *percpu_dev_id);
> > +request_percpu_irq(unsigned int irq, unsigned long flags,
> > +  irq_handler_t handler,  const char *devname,
> > +  void __percpu *percpu_dev_id);
> >  
> 
> Looking at request_irq, the prototype is:
> 
> int __must_check
> request_irq(unsigned int irq, irq_handler_t handler,
>   unsigned long flags, const char *name,
>   void *dev);
> 
> ... surely it would be better to share the same argument order? i.e.
> 
> int __must_check
> request_percpu_irq(unsigned int irq, irq_handler_t handler,
>  unsigned long flags, const char *devname,
>  void __percpu *percpu_dev_id);
> 

Agree.

Thanks for the review.

  -- Daniel


-- 

  Linaro.org │ Open source software for ARM SoCs

Follow Linaro:   Facebook |
 Twitter |
 Blog

Re: net/sched: GPF in qdisc_hash_add

On Thu, Mar 23, 2017 at 12:06 PM, Dmitry Vyukov  wrote:
>
> On Thu, Mar 23, 2017 at 8:00 PM, Cong Wang  wrote:
> > On Thu, Mar 23, 2017 at 9:06 AM, Dmitry Vyukov  wrote:
> >> kasan: CONFIG_KASAN_INLINE enabled
> >> kasan: GPF could be caused by NULL-ptr deref or user memory access
> >> general protection fault:  [#1] SMP KASAN
> >> Dumping ftrace buffer:
> >>(ftrace buffer empty)
> >> Modules linked in:
> >> CPU: 2 PID: 12732 Comm: syz-executor6 Not tainted 4.11.0-rc3+ #365
> >> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 
> >> 01/01/2011
> >> task: 880062b7a2c0 task.stack: 88003348
> >> RIP: 0010:qdisc_hash_add.part.19+0xb6/0x3c0 net/sched/sch_api.c:280
> >> RSP: 0018:880033487820 EFLAGS: 00010202
> >> RAX: dc00 RBX: 85357e00 RCX: c90002b24000
> >> RDX: 007a RSI: 835a523a RDI: 03d0
> >> RBP: 8800334878b8 R08: fbfff0a6afeb R09: fbfff0a6afeb
> >> R10: 0001 R11: fbfff0a6afea R12: 85357e48
> >> R13: 110006690f06 R14: 880033487890 R15: 
> >> FS:  7f68665d0700() GS:88006e20() 
> >> knlGS:
> >> CS:  0010 DS:  ES:  CR0: 80050033
> >> CR2: 004c2d44 CR3: 3c6f8000 CR4: 26e0
> >> Call Trace:
> >>  qdisc_hash_add+0x76/0x90 net/sched/sch_api.c:279
> >>  attach_default_qdiscs net/sched/sch_generic.c:798 [inline]
> >>  dev_activate+0x6ca/0x920 net/sched/sch_generic.c:829
> >>  __dev_open+0x25b/0x360 net/core/dev.c:1348
> >>  __dev_change_flags+0x159/0x3d0 net/core/dev.c:6460
> >>  dev_change_flags+0x88/0x140 net/core/dev.c:6525
> >>  dev_ifsioc+0x51f/0x9b0 net/core/dev_ioctl.c:254
> >>  dev_ioctl+0x1fe/0x1030 net/core/dev_ioctl.c:532
> >>  sock_do_ioctl+0x94/0xb0 net/socket.c:902
> >>  sock_ioctl+0x2c2/0x440 net/socket.c:993
> >>  vfs_ioctl fs/ioctl.c:45 [inline]
> >>  do_vfs_ioctl+0x1af/0x16d0 fs/ioctl.c:685
> >>  SYSC_ioctl fs/ioctl.c:700 [inline]
> >>  SyS_ioctl+0x8f/0xc0 fs/ioctl.c:691
> >>  entry_SYSCALL_64_fastpath+0x1f/0xc2
> >
> > The interesting part is why the NULL dereference is in
> > qdisc_hash_add(), since we have a check before calling
> > it:
> >
> > #ifdef CONFIG_NET_SCHED
> > if (dev->qdisc)
> > qdisc_hash_add(dev->qdisc);
> > #endif
> >
> >
> > When attach_one_default_qdisc() fails, we should trigger
> > the NULL pointer dereference bug at:
> >
> > atomic_inc(>qdisc->refcnt);
>
> I think qdisc is not NULL, it's something _in_ qdisc that is NULL. The
> crash happens here:
>
> struct Qdisc *root = qdisc_dev(q)->qdisc;
>
> so it's probably device.



Looks like this bug came with commit 59cc1f61f09c
("net: sched: convert qdisc linked list to hashtable")

I would simply guard qdisc_hash_add()

(Against _qdisc)

diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index 
bcf49cd2278670197f2a7e9d4e9a62ae8d117468..2bb34b51cffdb05434a488b9f45c344d57868253
100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -276,6 +276,8 @@ static struct Qdisc *qdisc_match_from_root(struct
Qdisc *root, u32 handle)

 void qdisc_hash_add(struct Qdisc *q)
 {
+ if (q == _qdisc)
+ return;
  if ((q->parent != TC_H_ROOT) && !(q->flags & TCQ_F_INGRESS)) {
  struct Qdisc *root = qdisc_dev(q)->qdisc;

Re: Extending socket timestamping API for NTP

2017-03-23 Thread Richard Cochran

On Thu, Mar 23, 2017 at 05:21:45PM +0100, Miroslav Lichvar wrote:
> A better approach might be a control message that would provide the
> original interface index together with the length of the packet, so
> the application could transpose the HW timestamp and map the HW
> interface to the PHC.

This sounds better than trying to auto-magically transpose and correct
for link speed.

BTW, isn't there already a control message for "original interface
index"?

> The two values could be saved in the skb_shared_info structure. Now
> my question is if they could be useful also for other things than
> timestamping

such as?

> and if it should be a new socket option which would work
> on any socket independently from timestamping, or if it should rather
> be a new flag for the SO_TIMESTAMPING option. If the latter, would it
> make sense to put them in the skb_shared_hwtstamps structure and
> modify all drivers to set the values when a HW timestamp is captured
> instead of adding more code to __netif_receive_skb_core() or similar?

This information is solely for a highly specialized NTP application.
No normal program would ever need this, AFAICT.  So, if possible,
getting the original frame length should be done in a way that doesn't
affect users that don't need it.

Thanks,
Richard

Re: net/sched: GPF in qdisc_hash_add

2017-03-23 Thread Dmitry Vyukov

On Thu, Mar 23, 2017 at 8:00 PM, Cong Wang  wrote:
> On Thu, Mar 23, 2017 at 9:06 AM, Dmitry Vyukov  wrote:
>> kasan: CONFIG_KASAN_INLINE enabled
>> kasan: GPF could be caused by NULL-ptr deref or user memory access
>> general protection fault:  [#1] SMP KASAN
>> Dumping ftrace buffer:
>>(ftrace buffer empty)
>> Modules linked in:
>> CPU: 2 PID: 12732 Comm: syz-executor6 Not tainted 4.11.0-rc3+ #365
>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
>> task: 880062b7a2c0 task.stack: 88003348
>> RIP: 0010:qdisc_hash_add.part.19+0xb6/0x3c0 net/sched/sch_api.c:280
>> RSP: 0018:880033487820 EFLAGS: 00010202
>> RAX: dc00 RBX: 85357e00 RCX: c90002b24000
>> RDX: 007a RSI: 835a523a RDI: 03d0
>> RBP: 8800334878b8 R08: fbfff0a6afeb R09: fbfff0a6afeb
>> R10: 0001 R11: fbfff0a6afea R12: 85357e48
>> R13: 110006690f06 R14: 880033487890 R15: 
>> FS:  7f68665d0700() GS:88006e20() knlGS:
>> CS:  0010 DS:  ES:  CR0: 80050033
>> CR2: 004c2d44 CR3: 3c6f8000 CR4: 26e0
>> Call Trace:
>>  qdisc_hash_add+0x76/0x90 net/sched/sch_api.c:279
>>  attach_default_qdiscs net/sched/sch_generic.c:798 [inline]
>>  dev_activate+0x6ca/0x920 net/sched/sch_generic.c:829
>>  __dev_open+0x25b/0x360 net/core/dev.c:1348
>>  __dev_change_flags+0x159/0x3d0 net/core/dev.c:6460
>>  dev_change_flags+0x88/0x140 net/core/dev.c:6525
>>  dev_ifsioc+0x51f/0x9b0 net/core/dev_ioctl.c:254
>>  dev_ioctl+0x1fe/0x1030 net/core/dev_ioctl.c:532
>>  sock_do_ioctl+0x94/0xb0 net/socket.c:902
>>  sock_ioctl+0x2c2/0x440 net/socket.c:993
>>  vfs_ioctl fs/ioctl.c:45 [inline]
>>  do_vfs_ioctl+0x1af/0x16d0 fs/ioctl.c:685
>>  SYSC_ioctl fs/ioctl.c:700 [inline]
>>  SyS_ioctl+0x8f/0xc0 fs/ioctl.c:691
>>  entry_SYSCALL_64_fastpath+0x1f/0xc2
>
> The interesting part is why the NULL dereference is in
> qdisc_hash_add(), since we have a check before calling
> it:
>
> #ifdef CONFIG_NET_SCHED
> if (dev->qdisc)
> qdisc_hash_add(dev->qdisc);
> #endif
>
>
> When attach_one_default_qdisc() fails, we should trigger
> the NULL pointer dereference bug at:
>
> atomic_inc(>qdisc->refcnt);

I think qdisc is not NULL, it's something _in_ qdisc that is NULL. The
crash happens here:

struct Qdisc *root = qdisc_dev(q)->qdisc;

so it's probably device.

Re: netlink: NULL timer crash

From: Eric Dumazet 
Date: Thu, 23 Mar 2017 09:00:58 -0700

> On Thu, 2017-03-23 at 07:53 -0700, Eric Dumazet wrote:
> 
>> Nice !
>> 
>> Looks like neigh->ops->solicit is NULL
> 
> Apparently we allow admins to do really stupid things with neighbours
> on tunnels.
> 
> Following patch should avoid the crash.
> 
> Anyone has better ideas ?

This is probably good enough for now, but you need to also handle
dn_neigh_ops.

Another way to solve this is to add a NULL method check to the
one spot where we invoke this method.  That clearly shows that
the method is optional.

Re: net/sched: GPF in qdisc_hash_add

2017-03-23 Thread Cong Wang

On Thu, Mar 23, 2017 at 9:06 AM, Dmitry Vyukov  wrote:
> kasan: CONFIG_KASAN_INLINE enabled
> kasan: GPF could be caused by NULL-ptr deref or user memory access
> general protection fault:  [#1] SMP KASAN
> Dumping ftrace buffer:
>(ftrace buffer empty)
> Modules linked in:
> CPU: 2 PID: 12732 Comm: syz-executor6 Not tainted 4.11.0-rc3+ #365
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
> task: 880062b7a2c0 task.stack: 88003348
> RIP: 0010:qdisc_hash_add.part.19+0xb6/0x3c0 net/sched/sch_api.c:280
> RSP: 0018:880033487820 EFLAGS: 00010202
> RAX: dc00 RBX: 85357e00 RCX: c90002b24000
> RDX: 007a RSI: 835a523a RDI: 03d0
> RBP: 8800334878b8 R08: fbfff0a6afeb R09: fbfff0a6afeb
> R10: 0001 R11: fbfff0a6afea R12: 85357e48
> R13: 110006690f06 R14: 880033487890 R15: 
> FS:  7f68665d0700() GS:88006e20() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 004c2d44 CR3: 3c6f8000 CR4: 26e0
> Call Trace:
>  qdisc_hash_add+0x76/0x90 net/sched/sch_api.c:279
>  attach_default_qdiscs net/sched/sch_generic.c:798 [inline]
>  dev_activate+0x6ca/0x920 net/sched/sch_generic.c:829
>  __dev_open+0x25b/0x360 net/core/dev.c:1348
>  __dev_change_flags+0x159/0x3d0 net/core/dev.c:6460
>  dev_change_flags+0x88/0x140 net/core/dev.c:6525
>  dev_ifsioc+0x51f/0x9b0 net/core/dev_ioctl.c:254
>  dev_ioctl+0x1fe/0x1030 net/core/dev_ioctl.c:532
>  sock_do_ioctl+0x94/0xb0 net/socket.c:902
>  sock_ioctl+0x2c2/0x440 net/socket.c:993
>  vfs_ioctl fs/ioctl.c:45 [inline]
>  do_vfs_ioctl+0x1af/0x16d0 fs/ioctl.c:685
>  SYSC_ioctl fs/ioctl.c:700 [inline]
>  SyS_ioctl+0x8f/0xc0 fs/ioctl.c:691
>  entry_SYSCALL_64_fastpath+0x1f/0xc2

The interesting part is why the NULL dereference is in
qdisc_hash_add(), since we have a check before calling
it:

#ifdef CONFIG_NET_SCHED
if (dev->qdisc)
qdisc_hash_add(dev->qdisc);
#endif


When attach_one_default_qdisc() fails, we should trigger
the NULL pointer dereference bug at:

atomic_inc(>qdisc->refcnt);

Re: [PATCH V8 1/3] irq: Add flags to request_percpu_irq function

2017-03-23 Thread Mark Rutland

Hi Daniel,

On Thu, Mar 23, 2017 at 06:42:01PM +0100, Daniel Lezcano wrote:
> In the next changes, we track the interrupts but we discard the timers as
> that does not make sense. The next interrupt on a timer is predictable.

Sorry, but I could not parse this. 

[...]

> diff --git a/drivers/perf/arm_pmu.c b/drivers/perf/arm_pmu.c
> index 9612b84..0f5ab4a 100644
> --- a/drivers/perf/arm_pmu.c
> +++ b/drivers/perf/arm_pmu.c
> @@ -661,7 +661,7 @@ static int cpu_pmu_request_irq(struct arm_pmu *cpu_pmu, 
> irq_handler_t handler)
>  
>   irq = platform_get_irq(pmu_device, 0);
>   if (irq > 0 && irq_is_percpu(irq)) {
> - err = request_percpu_irq(irq, handler, "arm-pmu",
> + err = request_percpu_irq(irq, 0, handler, "arm-pmu",
>_events->percpu_pmu);
>   if (err) {
>   pr_err("unable to request IRQ%d for ARM PMU counters\n",

Please Cc myself and Will Deacon when modifying the arm_pmu driver, as
per MAINTAINERS. I only spotted this patch by chance.

This conflicts with arm_pmu changes I have queued for v4.12 [1].

So, can we leave the prototype of request_percpu_irq() as-is?

Why not add a new request_percpu_irq_flags() function, and leave
request_percpu_irq() as a wrapper for that? e.g.

static inline int
request_percpu_irq(unsigned int irq, irq_handler_t handler,
   const char *devname, void __percpu *percpu_dev_id)
{
return request_percpu_irq_flags(irq, handler, devname,
percpu_dev_id, 0);
}

... that would avoid having to touch any non-timer driver for now.

[...]

> -request_percpu_irq(unsigned int irq, irq_handler_t handler,
> -const char *devname, void __percpu *percpu_dev_id);
> +request_percpu_irq(unsigned int irq, unsigned long flags,
> +irq_handler_t handler,  const char *devname,
> +void __percpu *percpu_dev_id);
>  

Looking at request_irq, the prototype is:

int __must_check
request_irq(unsigned int irq, irq_handler_t handler,
unsigned long flags, const char *name,
void *dev);

... surely it would be better to share the same argument order? i.e.

int __must_check
request_percpu_irq(unsigned int irq, irq_handler_t handler,
   unsigned long flags, const char *devname,
   void __percpu *percpu_dev_id);

Thanks,
Mark.

[1] 
https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm/perf/refactoring

Re: [PATCH 1/3] net: hns: avoid gcc-7.0.1 warning for uninitialized data