RE: [PATCHSET] printk, netconsole: implement reliable netconsole

2015-04-20 Thread David Laight
From: Of Rob Landley
 Sent: 19 April 2015 08:25
 On Thu, Apr 16, 2015 at 6:03 PM, Tejun Heo t...@kernel.org wrote:
  In a lot of configurations, netconsole is a useful way to collect
  system logs; however, all netconsole does is simply emitting UDP
  packets for the raw messages and there's no way for the receiver to
  find out whether the packets were lost and/or reordered in flight.
 
 Except a modern nonsaturated LAN shouldn't do that.
 
 If you have two machines plugged into a hub, and that's _all_ that's
 plugged in, packets should never get dropped. This was the original
 use case of netconsole was that the sender and the receiver were
 plugged into the same router.
 
 However, even on a quite active LAN the days of ethernet doing CDMA
 requiring retransmits are long gone, even 100baseT routers have been
 cacheing and retransmitting data internally so each connection can go
 at the full 11 megabytes/second with the backplane running fast enough
 to keep them all active at the same time. (That's why it's so hard to
 find a _hub_ anymore, it's all routers
...

Most machines are plugged into switches (not routers), many of them
will send 'pause' frames which the host mac may act on.
In which case packet loss is not expected (unless you have broadcast storms
when all bets are off).

Additionally, within a local network you shouldn't really get any packet
loss since no segments should be 100% loaded.
So for testing it is not unreasonable to expect no lost packets in netconsole
traffic.

David




RE: [PATCH v3] Renesas Ethernet AVB driver

2015-04-24 Thread David Laight
From: Sergei Shtylyov
 Sent: 22 April 2015 22:39
 On 04/22/2015 11:42 PM, David Miller wrote:
 
  Hmm, I've been digging in the net core, and was unable to see where TX
  skb's get their NET_IP_ALIGN bytes reserved. Have I missed something?
  Probably need to print out skb's fields...
 
  NET_IP_ALIGN is for receive, not transmit.
 
 But when I print 'skb-data' from the ndo_start_xmit() method (in the
 'sh_eth' driver), all addresses end with 2, so it looks like NET_IP_ALIGN gets
 added somewhere...

For a locally generated message:
The TCP userdata is likely to be 4 byte aligned.
The TCP and IP headers are multiples of 4 bytes.
The MAC header is 14 bytes.
So you end up with a buffer that starts on a 4n+2 boundary or an initial
short fragment that is 4n+2 bytes long.

If a message is being forwarded the alignment probably depends on where
it came from.

If you have ethernet hardware that requires tx or rx buffers to be on
4n boundaries you should send it back as 'not fit for purpose'.

David

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [Y2038] [PATCH 04/11] posix timers:Introduce the 64bit methods with timespec64 type for k_clock structure

2015-04-22 Thread David Laight
From: Thomas Gleixner
 Sent: 22 April 2015 09:45
 On Tue, 21 Apr 2015, Thomas Gleixner wrote:
  On Tue, 21 Apr 2015, Arnd Bergmann wrote:
   I know there are concerns about this, in particular because C11 and
   POSIX both require tv_nsec to be 'long', unlike timeval-tv_usec,
   which is a 'suseconds_t' and can be defined as 'long long'.
  
   a)
  
   struct timespec {
 time_t tv_sec;
 long long tv_nsec; /* or typedef long long snseconds_t */
   };
  
   This is not directly compatible with C11 or POSIX.1-2008, but it
   matches what we do inside of 64-bit kernels, so probably has the
   highest chance of working correctly in practice
 
  After reading Linus rant in the x32 thread again (thanks for the
  reminder), and looking at b/c/d - which rate between ugly and butt
  ugly - I think we should go for a) and screw POSIX and C11 as those
  committee dinosaurs seem to completely ignore the 2038 problem on
  32bit machines. At least I have not found any hint that these folks
  care at all. So why should we comply to something which is completely
  useless?
 
  That also makes the question about the upper 32bits check moot, so
  it's the simplest and clearest of the possible solutions.
 
 Second thoughts after some sleep.
 
 So the outcome of this is going to be that user space libraries will
 not expose the syscall variant of
 
 syscall_timespec64 {
s64 tv_sec;
  s64 tv_nsec;
 };
 
 to applications. The libs will translate them to spec conforming
 
timespec {
time_t tv_sec;
  long   tv_nsec;
};
 
 anyway. That means we have two translation steps on 32bit systems:
 
   1) user space timespec - syscall timespec64
 
   2) syscall timespec64 - scalar nsec s64 (ktime_t)
 
 and the other way round. The kernel internal representation is simply
 s64 (nsec) based all over the place.

Do you need the double-translation?
If all the kernel uses a 64bit nsec value the in-kernel syscall stub
can convert the user-supplied values appropriately before calling
the standard function.
Not that a syscall that takes a linear nsec value isn't useful.

FWIW I can't remember what NetBSD did when they extended time_t to 64bits.

David

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH iproute2 1/3] tc: fix compilation warning on 32bits arch

2015-04-21 Thread David Laight
From: Nicolas Dichtel
 Sent: 21 April 2015 17:07
 The warning was:
 m_simple.c: In function parse_simple:
 m_simple.c:142:4: warning: format %ld expects argument of type long int, but 
 argument 3 has type
 size_t [-Wformat]
 
 Useful to be able to compile with -Werror.
 
 Signed-off-by: Nicolas Dichtel nicolas.dich...@6wind.com
 ---
  tc/m_simple.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)
 
 diff --git a/tc/m_simple.c b/tc/m_simple.c
 index 866552f559b3..3b6d7beb769c 100644
 --- a/tc/m_simple.c
 +++ b/tc/m_simple.c
 @@ -139,7 +139,7 @@ parse_simple(struct action_util *a, int *argc_p, char 
 ***argv_p, int tca_id,
 
   if (strlen(simpdata)  (SIMP_MAX_DATA - 1)) {
   fprintf(stderr, simple: Illegal string len %ld %s \n,
 - strlen(simpdata), simpdata);
 + (long)strlen(simpdata), simpdata);
   return -1;

Isn't the correct fix to use %zu ?

David



RE: [PATCH v3] Renesas Ethernet AVB driver

2015-04-27 Thread David Laight
From: Sergei Shtylyov 
 Sent: 24 April 2015 19:27
...
  If you have ethernet hardware that requires tx or rx buffers to be on
  4n boundaries you should send it back as 'not fit for purpose'.
 
 The RX buffers can be adjusted with skb_resrerve(), it's only the TX
 buffers that need to be copied...

If the processor can't perform misaligned reads (don't know what is on
your SoC, but I suspect it can't - crossing page boundaries is hard)
then the rx buffer will have to be re-aligned in software.
Even the 'userdata' part will typically end up with an expensive
misaligned buffer copy.

Even on x86 the misaligned transfers are probably measurable.

 I'm afraid we can't. :-)
 However, my colleague has suggested a scheme minimizing the copying:
 only up to 3 first bytes need to be copied to the driver's internal buffers,
 the rest can be sent from an skb itself. That would require substantial
 changes to the driver though...

There might be a restriction on the length of buffer fragments.

You might be able to alternate 14 and 1500+ byte receive buffers.
The frame following a slightly overlong one would be 'wrong'.

David
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 1/4] ozwpan: Use proper check to prevent heap overflow

2015-05-15 Thread David Laight
From: Jason A. Donenfeld
 Sent: 13 May 2015 19:34
 Since elt-length is a u8, we can make this variable a u8. Then we can
 do proper bounds checking more easily. Without this, a potentially
 negative value is passed to the memcpy inside oz_hcd_get_desc_cnf,
 resulting in a remotely exploitable heap overflow with network
 supplied data.
...
 diff --git a/drivers/staging/ozwpan/ozusbsvc1.c 
 b/drivers/staging/ozwpan/ozusbsvc1.c
 index d434d8c..cd6c63e 100644
 --- a/drivers/staging/ozwpan/ozusbsvc1.c
 +++ b/drivers/staging/ozwpan/ozusbsvc1.c
 @@ -390,8 +390,10 @@ void oz_usb_rx(struct oz_pd *pd, struct oz_elt *elt)
   case OZ_GET_DESC_RSP: {
   struct oz_get_desc_rsp *body =
   (struct oz_get_desc_rsp *)usb_hdr;
 - int data_len = elt-length -
 + u8 data_len = elt-length -
   sizeof(struct oz_get_desc_rsp) + 1;
 + if (data_len  elt-length)
 + break;

Why not just check the length. eg:
unsigned int data_len = elt-length;
if (data_len  sizeof(struct oz_get_desc_rsp) + 1)
break;

   u16 offs = le16_to_cpu(get_unaligned(body-offset));
   u16 total_size =
   le16_to_cpu(get_unaligned(body-total_size));

Don't put variable definitions after code.

You don't really want to do arithmetic on local variables that are
smaller than a machine word (eg u8 and u16), doing so can require
the compiler generate a lot more code.

David

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 1/7] net: refactor __netif_receive_skb_core

2015-04-15 Thread David Laight
From: Alexander Duyck
 Sent: 10 April 2015 20:56
 On 04/10/2015 05:15 AM, Pablo Neira Ayuso wrote:
  +another_round:
  +   ret = __netif_receive_skb_ingress(skb, pfmemalloc, orig_dev);
  +   switch (ret) {
  +   case NET_RX_SUCCESS:
  +   case NET_RX_DROP:
  +   break;
  +   case __NET_RX_ANOTHER_ROUND:
  +   goto another_round;
  +   }
  rcu_read_unlock();
  +
  return ret;
   }
 
 
 
 Couldn't this just be done as a do while?  It would probably be easier
 to read and there wouldn't be any need for the another_round label anymore.

Or an infinite loop with a break at the bottom, as in:
for (;;) {
switch (...) {
case again:
continue;
default:
break;
}
break;
}

David


--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [Xen-devel] tcp: refine TSO autosizing causes performance regression on Xen

2015-04-16 Thread David Laight
From: George Dunlap
 Sent: 16 April 2015 09:56
 On 04/15/2015 07:19 PM, Eric Dumazet wrote:
  On Wed, 2015-04-15 at 19:04 +0100, George Dunlap wrote:
 
  Maybe you should stop wasting all of our time and just tell us what
  you're thinking.
 
  I think you make me wasting my time.
 
  I already gave all the hints in prior discussions.
 
 Right, and I suggested these two options:
 
 Obviously one solution would be to allow the drivers themselves to set
 the tcp_limit_output_bytes, but that seems like a maintenance
 nightmare.
 
 Another simple solution would be to allow drivers to indicate whether
 they have a high transmit latency, and have the kernel use a higher
 value by default when that's the case. [1]
 
 Neither of which you commented on.  Instead you pointed me to a comment
 that only partially described what the limitations were. (I.e., it
 described the two packets or 1ms, but not how they related, nor how
 they related to the max of 2 64k packets outstanding of the default
 tcp_limit_output_bytes setting.)

ISTM that you are changing the wrong knob.
You need to change something that affects the global amount of pending tx data,
not the amount that can be buffered by a single connection.

If you change tcp_limit_output_bytes and then have 1000 connections trying
to send data you'll suffer 'bufferbloat'.

If you call skb_orphan() in the tx setup path then the total number of
buffers is limited, but a single connection can (and will) will the tx
ring leading to incorrect RTT calculations and additional latency for
other connections.
This will give high single connection throughput but isn't ideal.

One possibility might be to call skb_orphan() when enough time has
elapsed since the packet was queued for transmit that it is very likely
to have actually been transmitted - even though 'transmit done' has
not yet been signalled.
Not at all sure how this would fit in though...

David




RE: [PATCH 15/17] switch kernel_sendmsg() and kernel_recvmsg() to iov_iter_kvec()

2015-04-15 Thread David Laight
From: Al Viro
 Sent: 14 April 2015 17:59
 On Tue, Apr 14, 2015 at 04:36:36PM +, David Laight wrote:
  From: Al Viro
   Sent: 14 April 2015 17:34
   On Tue, Apr 14, 2015 at 04:21:02PM +, David Laight wrote:
  
Massive NAK.
This breaks any code that is using msg_control to set SCTP parameters
when sending data.
  
 Huh?  -sendmsg() expects -msg_control already in kernel space;
   it's -recvmsg() that plays silly buggers with userland pointers there.
 
  I read your commit message as implying that you hadn't found any
  users of kernel_sendmsg() that used msg_control.
  Not that the data was always read from kernel space.
 
 Sigh...  The situation is:
   * -sendmsg() expects -msg_control copied to userland.  sendmsg(2),
 sendto(2), etc. do that copying.  See ___sys_sendmsg() - there we have
 /*
  * Careful! Before this, msg_sys-msg_control contains a user 
 pointer.
  * Afterwards, it will be a kernel pointer. Thus the 
 compiler-assisted
  * checking falls down on this.
  */
 if (copy_from_user(ctl_buf,
(void __user __force 
 *)msg_sys-msg_control,
ctl_len))
 goto out_freectl;
 msg_sys-msg_control = ctl_buf;
 As the result, -sendmsg() instances access -msg_control contents as normal
 kernel data.
   * -recvmsg() expects -msg_control to point to userland.  See
 net/core/scm.c for the helpers used to store into it.  recvmsg(2) et.al.
 simply leave the userland pointer there; worse, that pointer might be
 to native or to compat variants, and layouts _are_ different.  Thus those
 if (MSG_CMSG_COMPAT  msg-msg_flags) in net/core/scm.c...
   * kernel-side users of -sendmsg() do not depend on setfs() for
 access to their -msg_control, simply because -sendmsg() won't be using
 copy_from_user()/get_user() to access it anyway.
   * kernel-side users of -recvmsg() are less lucky - most of them
 don't give a damn either (they have NULL -msg_control), but there's an
 exception (somewhere in sunrpc, IIRC).  So there we need to keep
 playing with setfs(), even though the data side would be just fine without
 that.

Apart from any other code that is using the interface.
I know you guys don't do anything to help out of tree code, but removing the 
setfs()
stuff from the kernel_recvmsg() code would break anything using sctp.
It shouldn't need some code lurking in sunrpc for you to leave the setfs().

In any case, how much does the setfs() cost?
I suspect it is just modifying a flag in 'current'.

A comment in kernel_recvmsg() saying that the setfs() is for msg_control
might be useful.
Then one in kelnel_sendmsg() saying that setfs() isn't needed because
msg_control is always kernel - just to avoid any confusion.

David
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 15/17] switch kernel_sendmsg() and kernel_recvmsg() to iov_iter_kvec()

2015-04-15 Thread David Laight
From: Daniel Borkmann
 Sent: 15 April 2015 10:37
 On 04/15/2015 11:08 AM, David Laight wrote:
 ...
  Apart from any other code that is using the interface.
  I know you guys don't do anything to help out of tree code, but removing 
  the setfs()
  stuff from the kernel_recvmsg() code would break anything using sctp.
 
 Then that might just be one more incentive to work towards upstreaming
 your out-of-tree SCTP bits ... ;)

You really wouldn't want the many MB of code that generates several MB
of driver object to support a telephony applications using custom hardware.

Almost all the code is completely OS agnostic, the linux 'glue' code is
a few 1000 lines of wrapper functions (compiled as part of the installation
process).
Much the same wrappers exist for windows and solaris.

David

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 03/12] netfilter: x_tables: align per cpu xt_counter

2015-06-22 Thread David Laight
From: Pablo Neira Ayuso
 Sent: 19 June 2015 18:18
 From: Eric Dumazet eduma...@google.com
 
 Let's force a 16 bytes alignment on xt_counter percpu allocations,
 so that bytes and packets sit in same cache line.
 
 xt_counter being exported to user space, we cannot add __align(16) on
 the structure itself.

You could allocate a wrapper structure that contains xt_counter
and is itself marked __align(16).

David

--
To unsubscribe from this list: send the line unsubscribe netdev in


RE: [net 06/10] e1000e: i219 - increase IPG for speed 10/100 full duplex

2015-06-26 Thread David Laight
From: Of Jeff Kirsher
 Sent: 26 June 2015 11:21
 In SPT/i219, there were CRC errors in speed 10/100 full duplex.
 The solution given by the HW team is to increase the IPG from 8 to 0xC
 
 Signed-off-by: Yanir Lubetkin yanirx.lubet...@intel.com
 Tested-by: Aaron Brown aaron.f.br...@intel.com
 Signed-off-by: Jeff Kirsher jeffrey.t.kirs...@intel.com
 ---
  drivers/net/ethernet/intel/e1000e/ich8lan.c | 10 +++---
  1 file changed, 7 insertions(+), 3 deletions(-)
 
 diff --git a/drivers/net/ethernet/intel/e1000e/ich8lan.c 
 b/drivers/net/ethernet/intel/e1000e/ich8lan.c
 index c5bb1dd..983f5bf 100644
 --- a/drivers/net/ethernet/intel/e1000e/ich8lan.c
 +++ b/drivers/net/ethernet/intel/e1000e/ich8lan.c
 @@ -1400,16 +1400,20 @@ static s32 e1000_check_for_copper_link_ich8lan(struct 
 e1000_hw *hw)
   if (((hw-mac.type == e1000_pch2lan) ||
(hw-mac.type == e1000_pch_lpt) ||
(hw-mac.type == e1000_pch_spt))  link) {
 - u32 reg;
 + u16 speed, duplex;
 
 - reg = er32(STATUS);
 + e1000e_get_speed_and_duplex_copper(hw, speed, duplex);
   tipg_reg = er32(TIPG);
   tipg_reg = ~E1000_TIPG_IPGT_MASK;
 
 - if (!(reg  (E1000_STATUS_FD | E1000_STATUS_SPEED_MASK))) {
 + if (duplex == HALF_DUPLEX  speed == SPEED_10) {
   tipg_reg |= 0xFF;
   /* Reduce Rx latency in analog PHY */
   emi_val = 0;
 + } else if (hw-mac.type == e1000_pch_spt 
 +duplex == FULL_DUPLEX  speed != SPEED_1000) {

You really ought to have a comment here explaining why the inter-packet
gap is increased.

 + tipg_reg |= 0xC;

Shouldn't you mask out the old IPG value - just in case it isn't 8.

 + emi_val = 1;
   } else {
 
   /* Roll back the default values */

David

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH net-next v2 05/19] bna: use BIT(x) instead of (1 x)

2015-06-12 Thread David Laight
From: Ivan Vecera
...
 diff --git a/drivers/net/ethernet/brocade/bna/bfa_defs_mfg_comm.h
 b/drivers/net/ethernet/brocade/bna/bfa_defs_mfg_comm.h
 index 679a503..16090fd 100644
 --- a/drivers/net/ethernet/brocade/bna/bfa_defs_mfg_comm.h
 +++ b/drivers/net/ethernet/brocade/bna/bfa_defs_mfg_comm.h
 @@ -75,7 +75,7 @@ enum {
   CB_GPIO_FC4P2   = (4),  /*! 4G 2port FC card   */
   CB_GPIO_FC4P1   = (5),  /*! 4G 1port FC card   */
   CB_GPIO_DFLY= (6),  /*! 8G 2port FC mezzanine card */
 - CB_GPIO_PROTO   = (1  7)  /*! 8G 2port FC prototypes */
 + CB_GPIO_PROTO   = BIT(7)/*! 8G 2port FC prototypes */

That doesn't look like a BIT() value to me, just a large number.
Should the release driver even have support for the prototype hardware?

...
 - if (rx_enet_mask  ((u32)(1  i))) {
 + if (rx_enet_mask  ((u32)BIT(i))) {

The (u32) cast looks superfluous.
There are also too many ().

...
 - int bit = (1  (vlan_id  BFI_VLAN_WORD_MASK));
 + int bit = BIT((vlan_id  BFI_VLAN_WORD_MASK));

Too many ()

David

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Fw: [Bug 98781] New: WWAN: TX bytes counter shows very huge impossible value

2015-06-16 Thread David Laight
From: Kevin
 Sent: 13 June 2015 01:29
...
 Linux uranis 3.19.0-20-generic #20-Ubuntu SMP Fri May 29 10:10:47 UTC 2015
 x86_64 x86_64 x86_64 GNU/Linux
 
 wwan0 Link encap:Ethernet  HWaddr 26:03:a9:e3:88:2e
   inet addr:41.150.225.132  Bcast:41.150.225.135  Mask:255.255.255.248
   inet6 addr: fe80::2403:a9ff:fee3:882e/64 Scope:Link
   UP BROADCAST RUNNING NOARP MULTICAST  MTU:1500  Metric:1
   RX packets:3366 errors:0 dropped:0 overruns:0 frame:0
   TX packets:3497 errors:0 dropped:0 overruns:0 carrier:0
   collisions:0 txqueuelen:1000
   RX bytes:1459963 (1.4 MB)  TX bytes:15019500985395 (15.0 TB)
 
 Kernel internal or module (driver) bug, shows up everywhere including
 'system monitor'. This problem has come up a few years back and was solved,
 but seems to be back again...
 
 Any ideas on fix ?

$ printf '%x\n' 15019500985395
da900055c33
$ echo $((0xda9))
3497
$ echo $((0x55c33))
351283

Looks like a 32-64bit fubar somewhere.

David

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] sctp: Fix mangled IPv4 addresses on a IPv6 listening socket

2015-05-27 Thread David Laight
From: Daniel Borkmann
 Sent: 27 May 2015 10:34
...
  Fixes: 299ee123e198 (sctp: Fixup v4mapped behaviour to comply with Sock 
  API)
  ...
  This bugfix should be a candidate for -stable
 
  Anyone know off-hand which kernel releases are affected?
  I'm going to have to note this in the release notes for one of our products.
 
 Ever heard of git-describe(1) ? ;)
 
 git describe 299ee123e19889d511092347f5fc14db0f10e3a6
 v3.16-rc7-1525-g299ee12

Probably last time I asked the same question :-)
Since I don't spend all day fighting git, I don't know all the sub-commands.

In any case it looks like I can escape by turning off
SCTP_I_WANT_MAPPED_V4_ADDR for kernels 3.17 through 4.0.

David

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] net: tcp: Fix a PTO timing granularity issue

2015-05-27 Thread David Laight
From: Of Ido Yariv
 Sent: 26 May 2015 21:17
 The Tail Loss Probe RFC specifies that the PTO value should be set to
 max(2 * SRTT, 10ms), where SRTT is the smoothed round-trip time.
 
 The PTO value is converted to jiffies, so the timer may expire
 prematurely.
 
 This is especially problematic on systems in which HZ = 100, so work
 around this by setting the timeout to at least 2 jiffies on such
 systems.
 
 The 10ms figure was originally selected based on tests performed with
 the current implementation and HZ = 1000. Thus, leave the behavior on
 systems with HZ  100 unchanged.
 
 Signed-off-by: Ido Yariv idox.ya...@intel.com
 ---
  net/ipv4/tcp_output.c | 3 +++
  1 file changed, 3 insertions(+)
 
 diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
 index 534e5fd..5321df8 100644
 --- a/net/ipv4/tcp_output.c
 +++ b/net/ipv4/tcp_output.c
 @@ -2208,6 +2208,9 @@ bool tcp_schedule_loss_probe(struct sock *sk)
   timeout = max_t(u32, timeout,
   (rtt + (rtt  1) + TCP_DELACK_MAX));
   timeout = max_t(u32, timeout, msecs_to_jiffies(10));
 +#if HZ = 100
 + timeout = max_t(u32, timeout, 2);
 +#endif

Why not:
timeout = max_t(u32, timeout, max_t(u32, 2, msecs_to_jiffies(10)));
I think the RH max_t() is a compile time constant.

You need 2 jiffies to guarantee a non-zero timeout.
Even if HZ=199 with a 'rounding down' msecs_to_jiffies() you get 1 jiffy
and a possible immediate timeout.

There are probably other places where msecs_to_jiffies() better not return 1.

David

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH net-next 02/14] sfc: Add sysfs entry for flags (link control and primary)

2015-05-29 Thread David Laight
From: Shradha Shah
 Sent: 29 May 2015 11:01
 On  every adapter there will be one primary PF per adaptor and
 one link control PF per port.
...
 + return sprintf(buf, %d\n,
 +((efx-mcdi-fn_flags) 
 + (1  MC_CMD_DRV_ATTACH_EXT_OUT_FLAG_LINKCTRL))
 +? 1 : 0);

Horrid expression.
Why not:
(efx-mcdi-fn_flags  MC_CMD_DRV_ATTACH_EXT_OUT_FLAG_LINKCTRL)  1

using sprintf() is also excessive. Maybe:
*buf = '0' + (expression);
return 1;

You may also need to check for buffer overrun.

David

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] sctp: Fix mangled IPv4 addresses on a IPv6 listening socket

2015-05-27 Thread David Laight
From: Jason Gunthorpe
 Sent: 27 May 2015 16:32
 On Wed, May 27, 2015 at 10:11:22AM +, David Laight wrote:
 
  In any case it looks like I can escape by turning off
  SCTP_I_WANT_MAPPED_V4_ADDR for kernels 3.17 through 4.0.
 
 Just be aware that option is unusable on kernels without 299ee.
 
 I fixed everything wrong I saw, but that doesn't mean it works
 100%. Honestly, I don't think anyone has ever used it.

I'm now confused.

I've just done a test using a 4.0.0-rc1 kernel.
I'm binding an IPv6 listening socket and then connecting to it
from 127.0.0.1.
I don't know it I'm being given an IPv4 format address or a
v6mapped one (I shorten the latter before tracing it) - but
it contains 127.0.0.1 (not 0.0.0.0).
(That is without changing any socket options.)

The kernel code I have seems to default 'v4mapped = 1' which means
it should (if I read the code correctly) hit sctp_v4_map_v6().
But I'm not seeing the corruption.
Does 127.0.0.1 go through a different path that means the address
is already in IPv6 format?

Testing without using the loopback address is rather harder (for automated
test scripts).

(If anyone suggests that network namespaces might help, they can then
explain how a single kernel entity is going to choose between different
namespaces on an individual connection basis.
Think of something that is receiving data on one connection, decoding the
requests, re-encapsulation the data in a different protocol and then
sending the data onwards.)

David

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] sctp: Fix mangled IPv4 addresses on a IPv6 listening socket

2015-05-27 Thread David Laight
From: Jason Gunthorpe
 Sent: 27 May 2015 17:32
 On Wed, May 27, 2015 at 04:16:44PM +, David Laight wrote:
  From: Jason Gunthorpe
   Sent: 27 May 2015 16:32
   On Wed, May 27, 2015 at 10:11:22AM +, David Laight wrote:
  
In any case it looks like I can escape by turning off
SCTP_I_WANT_MAPPED_V4_ADDR for kernels 3.17 through 4.0.
  
   Just be aware that option is unusable on kernels without 299ee.
  
   I fixed everything wrong I saw, but that doesn't mean it works
   100%. Honestly, I don't think anyone has ever used it.
 
  I'm now confused.
 
  I've just done a test using a 4.0.0-rc1 kernel.
  I'm binding an IPv6 listening socket and then connecting to it
  from 127.0.0.1.
  I don't know it I'm being given an IPv4 format address or a
  v6mapped one (I shorten the latter before tracing it) - but
  it contains 127.0.0.1 (not 0.0.0.0).
  (That is without changing any socket options.)
 
 I don't know what your test does, but I used the same basic idea with
 loopback to find this issue. You should confirm the kernel is
 returning a AF_INET6 socket type, if it is AF_INET then there is a
 path I missed in 299ee and I should fix it..

The code will be sleeping in kernel_accept() and later calls
kernel_getpeername().
The code is used for both TCP and SCTP and this part is common (using
the TCP semantics).

I'll look at the actual raw address format tomorrow.
I suspect it is already AF_INET6 before getting to the code that would
badly translate it.

David

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] net: tcp: Fix a PTO timing granularity issue

2015-05-28 Thread David Laight
From: Ido Yariv
 Sent: 28 May 2015 05:37
...
 +/* Convert msecs to jiffies, ensuring that the return value is at least 2
 + * jiffies.
 + * This can be used when setting tick-based timers to guarantee that they 
 won't
 + * expire right away.
 + */
 +static inline unsigned long tcp_safe_msecs_to_jiffies(const unsigned int m)

I don't like using 'safe' in function names, being 'safe; depends on what
the caller wants.
Maybe tcp_msecs_to_jiffies_min_2() would be better.

David

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] sctp: Fix mangled IPv4 addresses on a IPv6 listening socket

2015-05-28 Thread David Laight
From: Jason Gunthorpe
 Sent: 27 May 2015 18:05
 On Wed, May 27, 2015 at 04:41:18PM +, David Laight wrote:
 
  The code will be sleeping in kernel_accept() and later calls
  kernel_getpeername().
  The code is used for both TCP and SCTP and this part is common (using
  the TCP semantics).
 
 getpeername uses a different flow, it calls into inet6_getname which
 will always return the AF_INET6 version.

Ok, that explains why I hadn't seen the problem.
It also means I don't have to worry about it.

David

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH/RFC net-next] rocker: remove rocker parameter from functions that have rocker_port parameter

2015-05-28 Thread David Laight
From: Simon Horman
 Sent: 28 May 2015 04:23
 The rocker (switch) of a rocker_port may be trivially obtained from
 the latter it seems cleaner not to pass the former to a function when
 the latter is being passed anyway.

If the arguments are passed in registers (they almost certainly are)
or the function is inlined (possible since they are static) and
the calling code already has both values in registers then
passing both values saves a memory read inside the called code.

So on 'hot paths' it probably makes sense to pass both values.

David

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH net-next] net: macb: replace literal constant with NET_IP_ALIGN

2015-07-03 Thread David Laight
From: Eric Dumazet
 Sent: 01 July 2015 23:14
 To: Joe Perches
...
   Then please use netdev_alloc_skb_ip_align(), so that you get rid of
   skb_reserve()
 
  It seems there are ~50 of these in the kernel tree
  that could be converted.
 
 
 Make sure the 2 is really NET_IP_ALIGN
 
 Some hardwares need 2, even if NET_IP_ALIGN is 0 (on x86 arches for
 example)

Even on x86 aligning the ethernet receive data on a 4n+2
boundary is likely to give marginally better performance
than aligning on a 4n boundary.

David



RE: [PATCH net-next] net: macb: replace literal constant with NET_IP_ALIGN

2015-07-06 Thread David Laight
From: Eric Dumazet [mailto:eric.duma...@gmail.com]
 Sent: 03 July 2015 17:39
 On Fri, 2015-07-03 at 16:18 +, David Laight wrote:
 
  Even on x86 aligning the ethernet receive data on a 4n+2
  boundary is likely to give marginally better performance
  than aligning on a 4n boundary.
 
 You are coming late to the party.

I've been to many parties at many different times

Going back many years, Sun's original sbus DMA part generated a lot
of single sbus transfers for 4n+2 aligned buffers - so it was necessary
to do a 'realignment' copy. The later DMA+ (definitely the DMA2) part
did sbus burst transfers even when the buffer was 4n+2 aligned.
So with the later parts you could correctly align the buffer.

 Intel guys decided to change NET_IP_ALIGN to 0 (it was 2 in the past)
...
 x86: Align skb w/ start of cacheline on newer core 2/Xeon Arch
 
 x86 architectures can handle unaligned accesses in hardware, and it has
 been shown that unaligned DMA accesses can be expensive on Nehalem
 architectures.  As such we should overwrite NET_IP_ALIGN to resolve
 this issue.

My 2 cents:

I'd have thought it would depend on the nature of the 'DMA' requests
generated by the hardware - so ethernet hardware dependant.

The above may be correct for PCI masters - especially those that do
paired 16bit accesses for every 32bit word.
If the hardware generated cache line aligned PCI bursts I wouldn't
have thought it would matter.

I doubt it is valid for PCIe transfers - where the ethernet frame
will be split into (probably) 128byte TLPs. Even if it starts on
a 64n+2 boundary the splits will be on 64 byte boundaries since the
first and last 32bit words of the TLP have separate byte enables.

So I'd expect to see a cache line RMW for the first and last cache
lines - That may, or may not, be slower than the misaligned accesses
for the entire frame (1 clock data delay per access?)

Of course, modern nics will write 2 bytes of 'crap' before the frame.
Rounding up the transfer to the end of a cache line might also help
(especially if only a few extra words are needed).

David



RE: [PATCH net] x86: bpf_jit: fix compilation of large bpf programs

2015-05-26 Thread David Laight
From: Alexei Starovoitov
 Sent: 22 May 2015 23:43
 x86 has variable length encoding. x86 JIT compiler is trying
 to pick the shortest encoding for given bpf instruction.
 While doing so the jump targets are changing, so JIT is doing
 multiple passes over the program. Typical program needs 3 passes.
 Some very short programs converge with 2 passes. Large programs
 may need 4 or 5. But specially crafted bpf programs may hit the
 pass limit and if the program converges on the last iteration
 the JIT compiler will be producing an image full of 'int 3' insns.
 Fix this corner case by doing final iteration over bpf program.

If the JIT compiler is only changing the encoding of the constants
in the x86 instructions (rather than changing the instructions themselves)
then there is likely to me an unmeasurable change in the execution time.
For instance I don't remember there being a difference in execution time
between long and short branches - the only difference is the amount of
cache they use.

David

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH net] x86: bpf_jit: fix compilation of large bpf programs

2015-05-26 Thread David Laight
From: Eric Dumazet
 Sent: 26 May 2015 15:35
 On Tue, 2015-05-26 at 13:40 +, David Laight wrote:
 
  If the JIT compiler is only changing the encoding of the constants
  in the x86 instructions (rather than changing the instructions themselves)
  then there is likely to me an unmeasurable change in the execution time.
  For instance I don't remember there being a difference in execution time
  between long and short branches - the only difference is the amount of
  cache they use.
 
 icache is precisely the matter here. In the end, it makes a difference.
 
 You could check this interesting study Ingo did recently :
 
 https://lkml.org/lkml/2015/5/19/1009

Yes, interesting, a benchmark that manages to run a lot of code 'cold cache'.
Possibly dominated by having a full cache line of code to execute at the
beginning of each function.

My guess is that aligning function bodies helps, aligning branch targets
(as some old versions of gcc used to do aggressively) is more likely to
have a negative effect (increases the icache footprint too much).
Unrolling loops further than needed to avoid data stalls needs very
careful study.

For JIT, once the obviously short branches have been reduced, further code
size reduction might not be worth while.

David

N�r��yb�X��ǧv�^�)޺{.n�+���z�^�)w*jg����ݢj/���z�ޖ��2�ޙ�)ߡ�a�����G���h��j:+v���w��٥

RE: [PATCH net] x86: bpf_jit: fix compilation of large bpf programs

2015-05-26 Thread David Laight
From: Eric Dumazet 
 Sent: 26 May 2015 16:30
 
  Yes, interesting, a benchmark that manages to run a lot of code 'cold 
  cache'.
 
 We have binaries here at Google with 400 or 500 MBytes of text.
 
 Not benchmark, super real workloads you know.

Indeed, and a lot of the code is likely to be running 'cold cache'.

I was alluding to the problem where people will benchmark a small function
by running in 1000s of times in a tight loop with exactly the same data.
Not only is it 'hot cache' but any dynamic branch prediction is 'trained'
to the specific data.

David


 



RE: [PATCH 4/6] dlm: use sctp 1-to-1 API

2015-08-12 Thread David Laight
From: Marcelo Ricardo Leitner
 Sent: 12 August 2015 14:16
 Em 12-08-2015 07:23, David Laight escreveu:
  From: Marcelo Ricardo Leitner
  Sent: 11 August 2015 23:22
  DLM is using 1-to-many API but in a 1-to-1 fashion. That is, it's not
  needed but this causes it to use sctp_do_peeloff() to mimic an
  kernel_accept() and this causes a symbol dependency on sctp module.
 
  By switching it to 1-to-1 API we can avoid this dependency and also
  reduce quite a lot of SCTP-specific code in lowcomms.c.
  ...
 
  You still need to enable sctp notifications (I think the patch deleted
  that code).
  Otherwise you don't get any kind of indication if the remote system
  'resets' (ie sends an new INIT chunk) on an existing connection.
 
 Right, it would miss the restart event and could generate a corrupted
 tx/rx buffers by glueing parts of old messages with new ones.

Except that it is SCTP so you'd expect DATA chunks to contain entire
messages and so get unexpected message sequences rather than corrupt
messages.
The problem is that the recovery is likely to be another reset.
(Particularly with M3UA where the source and destination port numbers
are likely to be the same and fixed.)

  It is probably enough to treat the MSG_NOTIFICATION as a fatal error
  and close the socket.
 
 Just so we are on the same page, you mean that after accepting the new
 association and enabling notifications on it, any further notification
 on it can be treated as fatal errors, right? Seems reasonable to me.

That's what I had to do.
The far end will probably see an additional disconnect, but it shouldn't
matter.

  This is probably a bug in the sctp stack - if a connection is reset
  but the user hasn't requested notifications then it should be
  converted to a disconnect indication and a new incoming connection.
 
 Maybe in such case resets shouldn't be allowed at all? Because unless it
 happens on a moment of silence it will always lead to application buffer
 corruption. Checked the RFCs now but couldn't find anything restricting
 them to some condition.

I certainly expected the 'reset' to cause an inwards abortive disconnect
on the old socket and a new indication on the listening socket.
I think (hope) that is what you get for a TCP SYN that matches an existing
connection.

In our case I think they were happening when the remote system was power
cycled.

David

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH net] inet: fix potential deadlock in reqsk_queue_unlink()

2015-08-18 Thread David Laight
From: Eric Dumazet
 Sent: 13 August 2015 23:45
 When replacing del_timer() with del_timer_sync(), I introduced
 a deadlock condition :
 
 reqsk_queue_unlink() is called from inet_csk_reqsk_queue_drop()
 
 inet_csk_reqsk_queue_drop() can be called from many contexts,
 one being the timer handler itself (reqsk_timer_handler()).
 
 In this case, del_timer_sync() loops forever.
 
 Simple fix is to test if timer is pending.

Doesn't that mean you fail to wait for the timer function
to finish if you are calling from a different context?

What you need to know is whether the current context
is that running the timer itself.
In which case you need to mark the timer 'to be deleted'
and actually delete it when the timer function returns.
(so other code can still wait for completion.)

David

N�r��yb�X��ǧv�^�)޺{.n�+���z�^�)w*jg����ݢj/���z�ޖ��2�ޙ�)ߡ�a�����G���h��j:+v���w��٥

RE: [PATCH net-next 2/3] qeth: Convert use of __constant_htons to htons

2015-08-20 Thread David Laight
From: Ursula Braun
 Sent: 19 August 2015 09:21
 In little endian cases, the macro htons unfolds to __swab16 which
 provides special case for constants. In big endian cases,
 __constant_htons and htons expand directly to the same expression.
 So, replace __constant_htons with htons with the goal of getting
 rid of the definition of __constant_htons completely.
...
 diff --git a/drivers/s390/net/qeth_l3_main.c b/drivers/s390/net/qeth_l3_main.c
 index 70eb2f6..ecfe622 100644
 --- a/drivers/s390/net/qeth_l3_main.c
 +++ b/drivers/s390/net/qeth_l3_main.c
 @@ -1887,13 +1887,13 @@ static inline int qeth_l3_rebuild_skb(struct 
 qeth_card *card,
   case QETH_CAST_MULTICAST:
   switch (prot) {
  #ifdef CONFIG_QETH_IPV6
 - case __constant_htons(ETH_P_IPV6):
 + case htons(ETH_P_IPV6):

I didn't think htons() was 'constant enough' to be used as a case label.

Using byteswapped constants in a case statement can change it from being
implemented as a jump table to a branch tree.
This might be more expensive than byteswapping the value (even on systems
that don't have cheap byteswap instructions).

David

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH net-next 2/3] qeth: Convert use of __constant_htons to htons

2015-08-20 Thread David Laight
From: Ursula Braun [mailto:ubr...@linux.vnet.ibm.com]
 Sent: 20 August 2015 12:44
 On Thu, 2015-08-20 at 10:46 +, David Laight wrote:
  From: Ursula Braun
   Sent: 19 August 2015 09:21
   In little endian cases, the macro htons unfolds to __swab16 which
   provides special case for constants. In big endian cases,
   __constant_htons and htons expand directly to the same expression.
   So, replace __constant_htons with htons with the goal of getting
   rid of the definition of __constant_htons completely.
  ...
   diff --git a/drivers/s390/net/qeth_l3_main.c 
   b/drivers/s390/net/qeth_l3_main.c
   index 70eb2f6..ecfe622 100644
   --- a/drivers/s390/net/qeth_l3_main.c
   +++ b/drivers/s390/net/qeth_l3_main.c
   @@ -1887,13 +1887,13 @@ static inline int qeth_l3_rebuild_skb(struct 
   qeth_card *card,
 case QETH_CAST_MULTICAST:
 switch (prot) {
#ifdef CONFIG_QETH_IPV6
   - case __constant_htons(ETH_P_IPV6):
   + case htons(ETH_P_IPV6):
 
  I didn't think htons() was 'constant enough' to be used as a case label.
 
  Using byteswapped constants in a case statement can change it from being
  implemented as a jump table to a branch tree.
  This might be more expensive than byteswapping the value (even on systems
  that don't have cheap byteswap instructions).
 
  David
 
 For big endian systems both __constant_htons(x) and htons(x) are
 resolved to ((__force __be16)(__u16)(x)). Thus I do not see a reason to
 reject the patch proposal from Vaishali Thakkar.

Look at a little-endian one (eg amd64).
I think you'll find a C ?: expression that uses __builtin_constant() to
select between an expression the compiler can evaluate and a call to an
inline function that uses some appropriate asm.
The latter isn't a compile-time constant so can't be used as a case
label or as an initialiser.

David




RE: [PATCH net] inet: fix potential deadlock in reqsk_queue_unlink()

2015-08-18 Thread David Laight
From: Eric Dumazet 
 Sent: 18 August 2015 14:37
 On Tue, 2015-08-18 at 11:04 +, David Laight wrote:
  From: Eric Dumazet
   Sent: 13 August 2015 23:45
   When replacing del_timer() with del_timer_sync(), I introduced
   a deadlock condition :
  
   reqsk_queue_unlink() is called from inet_csk_reqsk_queue_drop()
  
   inet_csk_reqsk_queue_drop() can be called from many contexts,
   one being the timer handler itself (reqsk_timer_handler()).
  
   In this case, del_timer_sync() loops forever.
  
   Simple fix is to test if timer is pending.
 
  Doesn't that mean you fail to wait for the timer function
  to finish if you are calling from a different context?
 
 No, this is the purpose of del_timer_sync()

I wasn't worried about del_timer_sync().
The problem is the call to timer_pending().

If the timer has just expired, and the timeout function is running
(on another cpu) then timer_pending() will return false.
So any tidyup path (apart from that called by the timeout function itself)
will fail to wait for the function to finish.

So, in effect, you've converted the code back into a call to del_timer().

timer_pending() should probably never be called without the relevant
timer lock help - because the result is stale.

David



RE: Fix skb_set_peeked use-after-free bug

2015-08-04 Thread David Laight
From: Herbert Xu
 Sent: 04 August 2015 10:21
 On Tue, Aug 04, 2015 at 09:15:13AM +, David Laight wrote:
 
  You've introduced a memory leak if skb_clone() fails.
 
 No I have not.
 
 nskb = skb_clone(skb, GFP_ATOMIC);
 if (!nskb)
   - return -ENOMEM;
   + return ERR_PTR(-ENOMEM);
 
  Here the original skb is still allocated.
 
   - error = skb_set_peeked(skb);
   - if (error)
   + skb = skb_set_peeked(skb);
 
  You've now lost the address of the original skb.
 
 It doesn't matter because we will take the error path and return
 the ENOMEM error.  We must not free the skb as it's still on the
 recv queue.

In that case, what happens to the receive queue when skb_clone()
takes a copy of the skb - freeing the original one?

David

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Fix skb_set_peeked use-after-free bug

2015-08-04 Thread David Laight
From: Herbert Xu
 Sent: 04 August 2015 08:43
 Brenden Blanco bbla...@plumgrid.com wrote:
  [  318.244596] BUG: unable to handle kernel NULL pointer dereference
  at 008e
  [  318.245182] IP: [81455e7c] __skb_recv_datagram+0xbc/0x5a0
 
  Replying to myself, and adding commit interested parties...
 
  I went through the git log for the function in question, and
  positively identified that the following commit introduces the crash:
 
  738ac1e net: Clone skb before setting peeked flag
 
  Null dereference is at line 224 of net/core/datagram.c (according to
  my objdump dis-assembly):
 
 Sorry, brain fart on my part.  Please try this patch.

You've introduced a memory leak if skb_clone() fails.

 The commit 738ac1ebb96d02e0d23bc320302a6ea94c612dec (net: Clone
 skb before setting peeked flag) introduced a use-after-free bug
 in skb_recv_datagram.  This is because skb_set_peeked may create
 a new skb and free the existing one.  As it stands the caller will
 continue to use the old freed skb.
 
 This patch fixes it by making skb_set_peeked return the new skb
 (or the old one if unchanged).
...
   nskb = skb_clone(skb, GFP_ATOMIC);
   if (!nskb)
 - return -ENOMEM;
 + return ERR_PTR(-ENOMEM);

Here the original skb is still allocated.

 - error = skb_set_peeked(skb);
 - if (error)
 + skb = skb_set_peeked(skb);

You've now lost the address of the original skb.

 + error = PTR_ERR(skb);
 + if (IS_ERR(skb))
   goto unlock_err;

David


--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 08/15] drivers: net: Drop unlikely before IS_ERR(_OR_NULL)

2015-07-31 Thread David Laight
From: Murali Karicheri
 Sent: 31 July 2015 16:04
 On 07/31/2015 04:38 AM, Viresh Kumar wrote:
  IS_ERR(_OR_NULL) already contain an 'unlikely' compiler flag and there
  is no need to do that again from its callers. Drop it.
 
 
 IS_ERR_OR_NULL() is defined as
 
 static inline bool __must_check IS_ERR_OR_NULL(__force const void *ptr)
 {
  return !ptr || IS_ERR_VALUE((unsigned long)ptr);
 }
 
 So the unlikely() applies only to second part. Wouldn't that be a
 problem for optimization?

Ugg...

The unlikely() in IS_ERR_VALUE() is likely to stop the compiler
doing a single 'window' comparison for the range [-MAX_ERROR .. 0].
So you are likely to end up with 2 comparisons.
I suspect that:

return IS_ERR_VALUE((unsigned long)ptr - 1);

would be a much better test.
(Ignoring the off-by-one for the highest error.)

David

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 2/4] batman-adv: fix kernel crash due to missing NULL checks

2015-08-05 Thread David Laight
From: Of Antonio Quartulli
 Sent: 05 August 2015 13:52
 From: Marek Lindner mareklind...@neomailbox.ch
 
 batadv_softif_vlan_get() may return NULL which has to be verified
 by the caller.
 
...
 diff --git a/net/batman-adv/soft-interface.c b/net/batman-adv/soft-interface.c
 index c002961..a2fc843 100644
 --- a/net/batman-adv/soft-interface.c
 +++ b/net/batman-adv/soft-interface.c
 @@ -479,6 +479,9 @@ out:
   */
  void batadv_softif_vlan_free_ref(struct batadv_softif_vlan *vlan)
  {
 + if (!vlan)
 + return;
 +

This bit doesn't look necessary.
You've added checks to some callers, the others probably don't need the check.

 @@ -1066,6 +1069,9 @@ uint16_t batadv_tt_local_remove(struct batadv_priv 
 *bat_priv,
 
   /* decrease the reference held for this vlan */
   vlan = batadv_softif_vlan_get(bat_priv, vid);
 + if (!vlan)
 + goto out;
 +
   batadv_softif_vlan_free_ref(vlan);
   batadv_softif_vlan_free_ref(vlan);

That code is ringing alarm bells.
If you expect to have a reference count the object better exist.
If you can remove a reference count from a 'random' object then
you can break the reference counting of objects.

So is this test just hiding anoter bug somewhere??

David

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 2/4] batman-adv: fix kernel crash due to missing NULL checks

2015-08-12 Thread David Laight
From: Antonio Quartulli
 Sent: 11 August 2015 17:43
 On 05/08/15 15:15, David Laight wrote:
  So is this test just hiding anoter bug somewhere??
 
 Hi David and thanks for your feedback.
 
 The point is that we got several bug reports of kernel crashes due to
 NULL pointer deferences in these lines and after having debugged this
 problem for quite a while we preferred to move on and propose this patch.

That is what I thought.

 Still, I am personally debugging this part of the code to understand if
 we really have something wrong or if this NULL pointer is something we
 should expect (and therefore check).
 
 For the time being we think this patch is better than having horrible
 kernel crashes, but I hope to come to a definitive conclusion soon.

If you don't know why you are seeing the NULL pointer then you are
just papering over some cracks and it is likely that something
is really badly wrong somewhere (ie missing locking).
This could easily mean that there are some much harder to find
bugs lurking there.

David


--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 4/6] dlm: use sctp 1-to-1 API

2015-08-12 Thread David Laight
From: Marcelo Ricardo Leitner
 Sent: 11 August 2015 23:22
 DLM is using 1-to-many API but in a 1-to-1 fashion. That is, it's not
 needed but this causes it to use sctp_do_peeloff() to mimic an
 kernel_accept() and this causes a symbol dependency on sctp module.
 
 By switching it to 1-to-1 API we can avoid this dependency and also
 reduce quite a lot of SCTP-specific code in lowcomms.c.
...

You still need to enable sctp notifications (I think the patch deleted
that code).
Otherwise you don't get any kind of indication if the remote system
'resets' (ie sends an new INIT chunk) on an existing connection.

It is probably enough to treat the MSG_NOTIFICATION as a fatal error
and close the socket.

This is probably a bug in the sctp stack - if a connection is reset
but the user hasn't requested notifications then it should be
converted to a disconnect indication and a new incoming connection.

David

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH iproute2] ip: replace exit with return

2015-08-12 Thread David Laight
From: netdev-ow...@vger.kernel.org
 Sent: 11 August 2015 10:40
 In our manual, we have this description of 'EXIT STATUS':
 Exit status is 0 if command was successful, and 1 if there is a syntax
 error.
 
 But we exit in command functions with code -1 when there is a syntax error.
 It's better to use return.

Eh?
Using exit() makes it much more obvious that the program is going to exit.

I've not looked at the call site (I'm not entirely sure where this code is
in the source tree), but main() shouldn't return -1 any more than exit(-1)
is invalid.
The domain for both is 0..127.
So the code should be using a valid value.

...
 diff --git a/ip/ipaddress.c b/ip/ipaddress.c
 index b7b4e3e..6d29a69 100644
 --- a/ip/ipaddress.c
 +++ b/ip/ipaddress.c
 @@ -1879,5 +1879,5 @@ int do_ipaddr(int argc, char **argv)
   if (matches(*argv, help) == 0)
   usage();
   fprintf(stderr, Command \%s\ is unknown, try \ip addr help\.\n, 
 *argv);
 - exit(-1);
 + return -1;
  }
...

David
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC] bmac:change to use bitrev8() generic function

2015-08-10 Thread David Laight
 From: Tobias Klauser
 Sent: 10 August 2015 12:49
 On 2015-08-10 at 11:53:41 +0200, yalin wang yalin.wang2...@gmail.com wrote:
  This change to use generic bitrev8() for bmac driver.
...
  @@ -871,7 +860,7 @@ bmac_addhash(struct bmac_data *bp, unsigned char *addr)
 
  if (!(*addr)) return;
  crc = bmac_crc((unsigned short *)addr)  0x3f; /* Big-endian alert! */

Why not *((u8 *)addr + 1)  0x3f

  -   crc = reverse6[crc];/* Hyperfast bit-reversing algorithm */
  +   crc = bitrev8((u8)crc);
 
 No, this won't work. reverse6 works on 6 bit values, while bitrev8 works
 on 8bit values, e.g. reverse6[1] = 0x20, while bitrev8(1) = 0x80

You could use bitrev8(n)  2.

But that is a 'strange' map of a 7-bit value to a 6-bit one.

I thought it was more common for ethernet hardware to use the
value of the crc register after all 6 bytes of the mac address
had been processed.

David

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v2] ravb: minimize TX data copying

2015-07-27 Thread David Laight
From: Sergei Shtylyov
 Sent: 25 July 2015 21:42
 Renesas Ethernet AVB controller requires that all data are aligned on 4-byte
 boundary.  While it's  easily achievable for  the RX  data with  the help of
 skb_reserve() (we even align on 128-byte boundary as recommended by the 
 manual),
 we  can't  do the same with the TX data, and it always comes  unaligned from
 the networking core. Originally we solved it an easy way, copying all packet
 to  a  preallocated  aligned buffer; however, it's enough to copy only up to
 3 first bytes from each packet, doing the transfer using 2 TX descriptors
 instead of just 1. Here's an implementation of the new  TX algorithm that
 significantly reduces the driver's memory requirements.
 
...
 - buffer = PTR_ALIGN(priv-tx_buffers[q][entry], RAVB_ALIGN);
 - memcpy(buffer, skb-data, skb-len);
 - desc = priv-tx_ring[q][entry];
 - desc-ds_tagl = cpu_to_le16(skb-len);
 - dma_addr = dma_map_single(ndev-dev, buffer, skb-len, DMA_TO_DEVICE);
 + buffer = PTR_ALIGN(priv-tx_align[q], DPTR_ALIGN) +
 +  entry / NUM_TX_DESC * DPTR_ALIGN;

The above would be clearer if tx_align was char[DPTR_ALIGN][].

 + len = PTR_ALIGN(skb-data, DPTR_ALIGN) - skb-data;
 + memcpy(buffer, skb-data, len);

Does this imply there has been an skb_linearize() ???
The old version didn't really need it (it was doing a copy anyway).

 + dma_addr = dma_map_single(ndev-dev, buffer, len, DMA_TO_DEVICE);
   if (dma_mapping_error(ndev-dev, dma_addr))
   goto drop;
 +
 + desc = priv-tx_ring[q][entry];
 + desc-ds_tagl = cpu_to_le16(len);
 + desc-dptr = cpu_to_le32(dma_addr);
 +
 + buffer = skb-data + len;
 + len = skb-len - len;
 + dma_addr = dma_map_single(ndev-dev, buffer, len, DMA_TO_DEVICE);
 + if (dma_mapping_error(ndev-dev, dma_addr))
 + goto unmap;
 +
 + desc++;
 + desc-ds_tagl = cpu_to_le16(len);

What happens if a fragment is less than DPTR_ALIGN bytes ???
Actually is looks like you relying on having a linear skb.

On systems with expensive dma_map it might be worth copying
small fragments.

David


--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v2 1/2] sctp: add new getsockopt option SCTP_SOCKOPT_PEELOFF_KERNEL

2015-07-22 Thread David Laight
From: Marcelo Ricardo Leitner
 Sent: 14 July 2015 18:13
 SCTP has this operation to peel off associations from a given socket and
 create a new socket using this association. We currently have two ways
 to use this operation:
 - via getsockopt(), on which it will also create and return a file
   descriptor for this new socket
 - via sctp_do_peeloff(), which is for kernel only
 
 The caveat with using sctp_do_peeloff() directly is that it creates a
 dependency to SCTP module, while all other operations are handled via
 kernel_{socket,sendmsg,getsockopt...}() interface. This causes the
 kernel to load SCTP module even when it's not really used.
 
 This patch then creates a new sockopt that is to be used only by kernel
 users of this protocol. This new sockopt will not allocate a file
 descriptor but instead just return the socket pointer directly.
 
 Kernel users are actually identified by if the parent socket has or not
 a fd attached to it. If not, it's a kernel a user.
 
 If called by an user application, it will just return -EPERM.
 
 Even though it's not intended for user applications, it's listed under
 uapi header. That's because hidding this wouldn't add any extra security
 and to keep the sockopt list in one place, so it's easy to check
 available numbers to use.
 
 Signed-off-by: Marcelo Ricardo Leitner marcelo.leit...@gmail.com
...
 +static int sctp_getsockopt_peeloff_kernel(struct sock *sk, int len,
 +   char __user *optval, int __user 
 *optlen)
 +{
 + sctp_peeloff_kernel_arg_t peeloff;
 + struct socket *newsock;
 + int retval = 0;
 +
 + /* We only allow this operation if parent socket also hadn't a
 +  * file descriptor allocated to it, mainly as a way to make sure
 +  * that this is really a kernel socket.
 +  */
 + if (sk-sk_socket-file)
 + return -EPERM;
 +
 + if (len  sizeof(sctp_peeloff_kernel_arg_t))
 + return -EINVAL;
 + len = sizeof(sctp_peeloff_kernel_arg_t);
 + if (copy_from_user(peeloff, optval, len))
 + return -EFAULT;

You can't need copy_from_user() here, the buffer would surely be kernel.

David

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC net-next 1/1] : sctp: denoise deprecation log on SCTP_EVENTS

2015-07-22 Thread David Laight
From: Jamal Hadi Salim
 Sent: 09 July 2015 11:38
 
 In the newer kernels this message is extremely noisy. After a quick
 discussion with Daniel it seems to me it will be very hard to get
 existing apps that nobody is going to update to continue to work
 (i.e no forward compat). And newer apps that desire to play in both
 older kernels and new kernels will have to play some tricks to work
 (i.e weak backward compatibility).  These are good reasons
 to totally get rid of this message. At minimal to neutre it.
 The attached change tries to do that. However, if you had multiple
 apps, you will only get warning for the first one.

Never mind spanning the kernel log, there will be performance
issues with systems that are sending significant amounts of SCTP data.
I'm not sure you'll get 1+/sec through the trace system.

I'm going to have to find which version this trace went into and
make sure we tell customers not to use the relevant kernel versions.

David

N�r��yb�X��ǧv�^�)޺{.n�+���z�^�)w*jg����ݢj/���z�ޖ��2�ޙ�)ߡ�a�����G���h��j:+v���w��٥

RE: [iproute PATCH 05/12] ip{,6}tunnel: align do_tunnels_list() a bit

2015-11-13 Thread David Laight
From: Phil Sutter
> Sent: 13 November 2015 17:09
> In iptunnel, declare loop variables inside the loop as done in
> ip6tunnel.
...
> @@ -396,14 +396,8 @@ static void print_tunnel(struct ip_tunnel_parm *p)
> 
>  static int do_tunnels_list(struct ip_tunnel_parm *p)
>  {
> - char name[IFNAMSIZ];
> - unsigned long  rx_bytes, rx_packets, rx_errs, rx_drops,
> - rx_fifo, rx_frame,
> - tx_bytes, tx_packets, tx_errs, tx_drops,
> - tx_fifo, tx_colls, tx_carrier, rx_multi;
> - struct ip_tunnel_parm p1;
> -
...
>   while (fgets(buf, sizeof(buf), fp) != NULL) {
> + char name[IFNAMSIZ];
>   int index, type;
> + unsigned long rx_bytes, rx_packets, rx_errs, rx_drops,
> + rx_fifo, rx_frame,
> + tx_bytes, tx_packets, tx_errs, tx_drops,
> + tx_fifo, tx_colls, tx_carrier, rx_multi;
> + struct ip_tunnel_parm p1;
>   char *ptr;
> +

Personally I find that just makes it harder to find where the
variables are defined.
Since the linux kernel cannot be compiled with -Wshadow declaring
variables in inner scopes can easily lead to very strange bugs.

David
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: irq_fpu_usable() is false in ndo_start_xmit() for UDP packets

2015-11-17 Thread David Laight
From: David Miller
> Sent: 16 November 2015 20:32
> From: "Jason A. Donenfeld" 
> Date: Mon, 16 Nov 2015 20:52:28 +0100
> 
> > This works fine with, say, iperf3 in TCP mode. The AVX performance
> > is great. However, when using iperf3 in UDP mode, irq_fpu_usable()
> > is mostly false! I added a dump_stack() call to see why, except
> > nothing looks strange; the initial call in the stack trace is
> > entry_SYSCALL_64_fastpath. Why would irq_fpu_usable() return false
> > when we're in a syscall? Doesn't that mean this is in process
> > context?
> 
> Network device driver transmit executes with software interrupts
> disabled.
> 
> Therefore on x86, you cannot use the FPU.

I had some thoughts about driver access to AVX instructions when
I was adding AVX support to NetBSD.

The fpu state is almost certainly 'lazy switched' this means that
the fpu registers can contain data for a process that is currently
running on a different cpu.
At any time the other cpu might request (by IPI) that they be flushed
to the process data area so that it can reload them.
Kernel code could request them be flushed, but that can only happen once.
If a nested function requires them it would have to supply a local
save area. But the full save area is too big to go on stack.
Not only that, the save and restore instructions are slow.

It is also worth noting that all the AVX registers are callee saved.
This means that the syscall entry need not preserve them, instead
it can mark that they will be 'restored as zero'. However this
isn't true of any other kernel entry points.

Back to my thoughts...

Kernel code is likely to want to use special SSE/AVX instructions (eg
the crc and crypto ones) rather than generic FP calculations.
As such just saving a two of three of AVX registers would suffice.
This could be done using a small on-stack save structure that
can be referenced from the process's save area so that any IPI
can copy over the correct values after saving the full state.

This would allow kernel code (including interrupts) to execute
some AVX instructions without having to save the entire cpu
extended state anywhere.

I suspect there is a big hole in the above though...

David
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-11-04 Thread David Laight
From: Al Viro
> Sent: 30 October 2015 21:10
> On Fri, Oct 30, 2015 at 05:43:21PM +, David Laight wrote:
> 
> > ISTM that the correct call should be listen(fd, 0);
> > Although that doesn't help a thread stuck in recvmsg() for a datagram.
> >
> > It is also tempting to think that close(fd) should sleep until all
> > io activities using that fd have completed - whether or not blocking
> > calls are woken.
> 
> Sigh...  The kernel has no idea when other threads are done with "all
> io activities using that fd" - it can wait for them to leave the
> kernel mode, but there's fuck-all it can do about e.g. a userland
> loop doing write() until there's more data to send.  And no, you can't
> rely upon them catching EBADF on the next iteration - by the time they
> get there, close() might very well have returned and open() from yet
> another thread might've grabbed the same descriptor.  Welcome to your
> data being written to hell knows what...

That just means that the application must use dup2() rather than close().
It must do that anyway since the thread it is trying to stop might be
sleeping in the system call stub in libc at the time a close() and open()
happen.
The listening (in this case) thread would need to look at its global
data to determine that it is supposed to exit, and then close the fd itself.

David


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-11-02 Thread David Laight
From: casper@oracle.com
> Sent: 21 October 2015 21:33
..
> >Er...  So fd2 = dup(fd);accept(fd)/close(fd) should *not* trigger that
> >behaviour, in your opinion?  Because fd is sure as hell not the last
> >descriptor refering to that socket - fd2 remains alive and well.
> >
> >Behaviour you describe below matches the _third_ option.
> 
> >> >BTW, for real fun, consider this:
> >> >7)
> >> >// fd is a socket
> >> >fd2 = dup(fd);
> >> >in thread A: accept(fd);
> >> >in thread B: accept(fd);
> >> >in thread C: accept(fd2);
> >> >in thread D: close(fd);

If D is going to do this, it ought to be using dup2() to clone
a copy of (say) /dev/null onto 'fd'.

> >> >Which threads (if any), should get hit where it hurts?
> >>
> >> A & B should return from the accept with an error. C should
> >> continue. Which is what happens on Solaris.

That seems to completely break the normal *nix file scheme...

> >> To this end each thread keeps a list of file descriptors
> >> in use by the current active system call.
> >

Remember, Solaris (and SYSV) has extra levels of multiplexing between
userspace and char special drivers (and probably sockets) than Linux does.
As well as having multiple fd referencing the same struct FILE, multiple
FILE can point to the same inode.
If you have two different /dev entries for the same major/minor you
also end up with separate inodes - all finally referencing the same
driver data (indexed only by minor number).
(You actually get two inodes in the chain, one for the disk filesystem
and one char-special. The ref counts on both can be greater than 1.)

David

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH RFC net-next 2/2] tcp: Add Redundant Data Bundling (RDB)

2015-11-02 Thread David Laight
From: Bendik Rønning Opstad
> Sent: 23 October 2015 21:50
> RDB is a mechanism that enables a TCP sender to bundle redundant
> (already sent) data with TCP packets containing new data. By bundling
> (retransmitting) already sent data with each TCP packet containing new
> data, the connection will be more resistant to sporadic packet loss
> which reduces the application layer latency significantly in congested
> scenarios.

What sort of traffic flows do you expect this to help?

An ssh (or similar) connection will get additional data to send,
but that sort of data flow needs Nagle in order to reduce the
number of packets sent.
OTOH it might benefit from including unacked data if the Nagle
timer expires.
Being able to set the Nagle timer on a per-connection basis
(or maybe using something based on the RTT instead of 2 secs)
might make packet loss less problematic.

Data flows that already have Nagle disabled (probably anything that
isn't command-response and isn't unidirectional bulk data) are
likely to generate a lot of packets within the RTT.
Resending unacked data will just eat into available network bandwidth
and could easily make any congestion worse.

I think that means you shouldn't resend data more than once, and/or
should make sure that the resent data isn't a significant overhead
on the packet being sent.

David




RE: [PATCH RFC net-next 2/2] tcp: Add Redundant Data Bundling (RDB)

2015-11-02 Thread David Laight
From: Bendik Rønning Opstad
> Sent: 29 October 2015 22:54
...
> > > > The semantics of the tp->nonagle bits are already a bit complex. My
> > > > sense is that having a setsockopt of TCP_RDB transparently modify the
> > > > nagle behavior is going to add more extra complexity and unanticipated
> > > > behavior than is warranted given the slight possible gain in
> > > > convenience to the app writer. What about a model where the
> > > > application user just needs to remember to call
> > > > setsockopt(TCP_NODELAY) if they want the TCP_RDB behavior to be
> > > > sensible? I see your nice tests at
> > > >
> > > >   https://github.com/bendikro/packetdrill/commit/9916b6c53e33dd04329d29b
> > > >   7d8baf703b2c2ac1b> >
> > > > are already doing that. And my sense is that likewise most
> > > > well-engineered "thin stream" apps will already be using
> > > > setsockopt(TCP_NODELAY). Is that workable?
> 
> This is definitely workable. I agree that it may not be an ideal solution to
> have TCP_RDB disable Nagle, however, it would be useful with a way to easily
> enable RDB and disable Nagle.

If enabling RDB disables Nagle, then what happens when you turn RDB back off?

David

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-11-06 Thread David Laight
From: Al Viro
> Sent: 04 November 2015 16:28
> On Wed, Nov 04, 2015 at 03:54:09PM +, David Laight wrote:
> > > Sigh...  The kernel has no idea when other threads are done with "all
> > > io activities using that fd" - it can wait for them to leave the
> > > kernel mode, but there's fuck-all it can do about e.g. a userland
> > > loop doing write() until there's more data to send.  And no, you can't
> > > rely upon them catching EBADF on the next iteration - by the time they
> > > get there, close() might very well have returned and open() from yet
> > > another thread might've grabbed the same descriptor.  Welcome to your
> > > data being written to hell knows what...
> >
> > That just means that the application must use dup2() rather than close().
> > It must do that anyway since the thread it is trying to stop might be
> > sleeping in the system call stub in libc at the time a close() and open()
> > happen.
> 
> Oh, _lovely_.  So instead of continuation of that write(2) going down
> the throat of something opened by unrelated thread, it (starting from a
> pretty arbitrary point) will go into the descriptor the closing thread
> passed to dup2().  Until it, in turn, gets closed, at which point we
> are back to square one.  That, of course, makes it so much better -
> whatever had I been thinking about that made me miss that?

Do I detect a note of sarcasm.

You'd open "/dev/null" (lacking a "/dev/error") and use that as the source fd.
So any writes (etc) would be discarded in a safe manner, and nothing will block.

David

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [GIT] Networking

2015-11-06 Thread David Laight
> From: Linus Torvalds
> Sent: 03 November 2015 20:45
> On Tue, Nov 3, 2015 at 12:05 PM, Linus Torvalds
>  wrote:
> >  result = add_overflow(
> > mul_overflow(sec, SEC_CONVERSION, ),
> > mul_overflow(nsec, NSEC_CONVERSION, ),
> > );
> >
> >  return overflow ? MAX_JIFFIES : result;
> 
> Thinking more about this example, I think the gcc interface for
> multiplication overflow is fine.
> 
> It would end up something like
> 
> if (mul_overflow(sec, SEC_CONVERSION, ))
> return MAX_JIFFY_OFFSET;
> if (mul_overflow(nsec, NSEC_CONVERSION, ))
> return MAX_JIFFY_OFFSET;
> sum = sec + nsec;
> if (sum < sec || sum > MAX_JIFFY_OFFSET)
> return MAX_JIFFY_OFFSET;
> return sum;
> 
> and that doesn't look horribly ugly to me.

If mul_overflow() is a real function you've just forced some of the
values out to memory, generating a 'clobber' for all memory
(unless 'strict-aliasing' is enabled) and making a mess of other
optimisations.
(If it is a static inline that might not happen.)

If you assume that no one is stupid enough to multiply very large
values by 1 and not get an error you could have mul_overflow()
return the largest prime if the multiply overflowed.

David



RE: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-30 Thread David Laight
From: David Holland
> Sent: 29 October 2015 14:59
> On Tue, Oct 27, 2015 at 10:52:46AM +, Alan Burlison wrote:
>  > >But in general, this is basically a problem with the application: the file
>  > >descriptor space is shared between threads and having one thread sniping
>  > >at open files, you do have a problem and whatever the kernel does in that
>  > >case perhaps doesn't matter all that much: the application needs to be
>  > >fixed anyway.
>  >
>  > The scenario in Hadoop is that the FD is being used by a thread that's
>  > waiting in accept and another thread wants to shut it down, e.g. because
>  > the application is terminating and needs to stop all threads cleanly.
> 
> ISTM that the best way to do this is to post a signal to the thread so
> accept bails with EINTR, at which point it can check to see if it's
> supposed to be exiting.

Actually, just send it a connect indication.

ISTM that the correct call should be listen(fd, 0);
Although that doesn't help a thread stuck in recvmsg() for a datagram.

It is also tempting to think that close(fd) should sleep until all
io activities using that fd have completed - whether or not blocking
calls are woken.

David

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] net: hix5hd2_gmac: avoid integer overload warning

2015-10-16 Thread David Laight
From: Arnd Bergmann
> Sent: 16 October 2015 11:01
> BITS_RX_EN is an 'unsigned long' constant, so the ones complement of that
> has bits set that do not fit into a 32-bit variable on 64-bit architectures,
> which causes a harmless gcc warning:
...
>  static void hix5hd2_port_disable(struct hix5hd2_priv *priv)
>  {
> - writel_relaxed(~(BITS_RX_EN | BITS_TX_EN), priv->base + PORT_EN);
> + writel_relaxed(~(u32)(BITS_RX_EN | BITS_TX_EN), priv->base + PORT_EN);
>   writel_relaxed(0, priv->base + DESC_WR_RD_ENA);

ISTM that just means that the constants shouldn't be 'long'.

David

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH net-next v3] bnxt_en: New Broadcom ethernet driver.

2015-10-20 Thread David Laight
From: Stephen Hemminger
...
> On Sat, 17 Oct 2015 00:21:44 -0400
> Michael Chan  wrote:
> 
> > +static bool bnxt_vf_pciid(int idx)
> > +{
> > +   if (idx == BCM57304_VF || idx == BCM57404_VF)
> > +   return true;
> > +   return false;
> > +}
> > +
> 
> I prefer just returning result of logical operation
> rather than adding unnecessary if statement.
> And never use (signed) int when unsigned is the real
> data type. Also avoid any unnecessary expansion of 16 bit
> value.

Actually 16-bit (and 8-bit) values always have to be extended to 'int'
before any arithmetic operations.
On systems with 16-bit arithmetic instructions (like x86) the
compiler can only use them if it can determine that result would be
the same (as if they were extended).

This can mean a lot of extra instructions if you do arithmetic
on 16-bit values the compiler is holding in registers.

Similarly using 16-bit values as function parameters/results
can require additional masking instructions.

IMHO This really all means that you should use [unsigned] int for
local variables even when the domain of the value is known
to be much smaller.

David
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-21 Thread David Laight
From: Alan Burlison
> Sent: 20 October 2015 19:31
...
> The problem with poll() is that it returns immediately when passed a FD
> that is in the listening state. rather than waiting until there's an
> incoming connection to handle. As I said, that means you can't use
> poll() to multiplex between read/write/accept sockets.

That seems to work for me...

David

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: next build: 235 warnings 3 failures (next/next-20151117)

2015-11-18 Thread David Laight
From: Will Deacon
> Sent: 18 November 2015 12:28
> On Wed, Nov 18, 2015 at 12:11:25PM +, David Laight wrote:
> > From: Will Deacon
> > > Sent: 18 November 2015 10:14
> > > On Tue, Nov 17, 2015 at 08:17:17PM +0100, Arnd Bergmann wrote:
> > > > On Tuesday 17 November 2015 17:12:37 Will Deacon wrote:
> > > > > On Tue, Nov 17, 2015 at 06:03:40PM +0100, Arnd Bergmann wrote:
> > > > > > On Tuesday 17 November 2015 16:44:53 Will Deacon wrote:
> > > > > > > > 8<
> > > > > > > > Subject: ARM64: make smp_load_acquire() work with const 
> > > > > > > > arguments
> > > > > > > >
> > > > > > > > smp_load_acquire() uses typeof() to declare a local variable 
> > > > > > > > for temporarily
> > > > > > > > storing the output of the memory access. This fails when the 
> > > > > > > > argument is
> > > > > > > > constant, because the assembler complains about using a 
> > > > > > > > constant register
> > > > > > > > as output:
> > > > > > > >
> > > > > > > >  arch/arm64/include/asm/barrier.h:71:3: error: read-only 
> > > > > > > > variable '___p1'
> > > > > > > >  used as 'asm' output
> > > > > > >
> > > > > > > Do you know the usage in the kernel causing this warning?
> > > > > >
> > > > > > A newly introduced function in include/net/sock.h:
> > > > > >
> > > > > > static inline int sk_state_load(const struct sock *sk)
> > > > > > {
> > > > > > return smp_load_acquire(>sk_state);
> > > > > > }
> > > > >
> > > > > Hmm, maybe we could play a similar trick to READ_ONCE by declaring an
> > > > > anonymous union and writing through the non-const member?
> > > >
> > > > Yes, I think that would work, if you think we need to care about the
> > > > case where we read into a structure.
> > > >
> > > > Can you come up with a patch for that?
> > >
> > > Done:
> > >
> > >   
> > > http://lists.infradead.org/pipermail/linux-arm-kernel/2015-November/386094.html
> >
> > That patch forces a memory write-read and returns uninitialised stack
> > for short reads.
> 
> Really? The disassembly looks fine to me. Do you have a concrete example
> of where you think it goes wrong, please?
> 
> > Who knows what happens on big-endian systems.
> 
> The same thing as READ_ONCE? I'll test it there to make sure, but I
> don't see a problem.

Ah, god, it is absolutely horrid. But probably right :-(

Do all the lda variants zero extend to 64 bits ?
If so maybe you could use a single 64 bit variable for the result of the read
and then cast it to typeof(*p) to get the required sign extension for
small integer types.

David

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: next build: 235 warnings 3 failures (next/next-20151117)

2015-11-18 Thread David Laight
From: Will Deacon [mailto:will.dea...@arm.com]
> Sent: 18 November 2015 15:37
> On Wed, Nov 18, 2015 at 03:21:19PM +, David Laight wrote:
> > From: Will Deacon
> > > Sent: 18 November 2015 12:28
> > > On Wed, Nov 18, 2015 at 12:11:25PM +, David Laight wrote:
> > > > From: Will Deacon
> > > > >   
> > > > > http://lists.infradead.org/pipermail/linux-arm-kernel/2015-November/386094.html
> > > >
> > > > That patch forces a memory write-read and returns uninitialised stack
> > > > for short reads.
> > >
> > > Really? The disassembly looks fine to me. Do you have a concrete example
> > > of where you think it goes wrong, please?
> > >
> > > > Who knows what happens on big-endian systems.
> > >
> > > The same thing as READ_ONCE? I'll test it there to make sure, but I
> > > don't see a problem.
> >
> > Ah, god, it is absolutely horrid. But probably right :-(
> 
> Yeah, I wasn't pretending it was nice :) FWIW, I've given it a reasonable
> testing in both little-endian and big-endian configurations and it seems
> to be happy.

I was missing the fact that the *(int_type *) is always reading
the full union.
The next version of the compiler might decide to barf at the code
that appears to be reading beyond the end of the union.

> > Do all the lda variants zero extend to 64 bits ?
> 
> Yes.
> 
> > If so maybe you could use a single 64 bit variable for the result of the 
> > read
> > and then cast it to typeof(*p) to get the required sign extension for
> > small integer types.
> 
> That was the original proposal from Arnd, but I want this to work with
> structures smaller than 64-bit (e.g. arch_spinlock_t), so that's why
> I decided to follow the approach laid down by READ_ONCE.

That would still be ok.
You'd have something that is effectively:
_u64 val = *p;
return typeof(*p)val;
The compiler might mask unsigned values - but it may be able to
determine it isn't needed (which is probably true of your version).
For signed types both versions require the compile sign-extend
the value.

David

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] net/core: use netdev name in warning if no parent

2015-11-17 Thread David Laight
From: Bjørn Mork
> Sent: 16 November 2015 18:17
> A recent flaw in the netdev feature setting resulted in warnings
> like this one from VLAN interfaces:
> 
>  WARNING: CPU: 1 PID: 4975 at net/core/dev.c:2419 
> skb_warn_bad_offload+0xbc/0xcb()
>  : caps=(0x001b5820, 0x001b5829) len=2782 data_len=0 
> gso_size=1348 gso_type=16
> ip_summed=3
> 
> The ":" is supposed to be preceded by a driver name, but in this
> case it is an empty string since the device has no parent.
...

There'll also be no name if skb->dev is NULL.
There may not be anything interesting to print instead, but something
other than "" would be more useful.

David




RE: [PATCH net-next 3/5] net: Add helper function to compare inetpeer addresses

2015-08-28 Thread David Laight
From: David Ahern
 Sent: 27 August 2015 22:17
ATCH net-next 3/5] net: Add helper function to compare inetpeer addresses
 
 tcp_metrics and inetpeer both have functions to compare inetpeer
 addresses. Consolidate into 1 version.
 
 Signed-off-by: David Ahern d...@cumulusnetworks.com
 ---
...
 diff --git a/include/net/inetpeer.h b/include/net/inetpeer.h
 index f75b9e7036a2..9d9b3446731d 100644
 --- a/include/net/inetpeer.h
 +++ b/include/net/inetpeer.h
 @@ -121,6 +121,22 @@ static inline struct inet_peer *inet_getpeer_v6(struct 
 inet_peer_base *base,
   return inet_getpeer(base, daddr, create);
  }
 
 +static inline int inetpeer_addr_cmp(const struct inetpeer_addr *a,
 + const struct inetpeer_addr *b)
 +{
 + int i, n = (a-family == AF_INET ? 1 : 4);
 +
 + for (i = 0; i  n; i++) {
 + if (a-addr.a6[i] == b-addr.a6[i])
 + continue;
 + if ((__force u32)a-addr.a6[i]  (__force u32)b-addr.a6[i])
 + return -1;
 + return 1;
 + }
 +
 + return 0;
 +}

If the performance of this matters then I'd not use the loop for IPv4
and use u64 comparisons (esp. on 64 bit systems) in an unrolled loop for IPv6.
(Might need to worry about the alignment.)

I presume nothing cares that the ordering relation is endian dependant?
With either byteswapping memory reads or a byteswapping instruction
then an endian-independant ordering should be almost as quick.

David

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [GIT] Networking

2015-09-04 Thread David Laight
> I find them useful as syntactic sugar. We have not used them a lot, but there 
> are cases in our crypto
> handling code where we have fixed size array inputs/outputs and there we 
> opted to use them. They make
> it easy to remember what the expected sizes of input and output are without 
> having to read through the
> implementation (of course we never even tried to use sizeof on these 
> pointers).
> 
> static int smp_ah(struct crypto_blkcipher *tfm, const u8 irk[16],
>   const u8 r[3], u8 res[3])

Expect that it looks like you are passing arrays by value,
but instead you are passing by reference.

Explicitly pass by reference and sizeof works.
The object code will be the same.

David


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [GIT] Networking

2015-09-07 Thread David Laight
From: Rustad, Mark D
...
> >> static int smp_ah(struct crypto_blkcipher *tfm, const u8 irk[16],
> >>  const u8 r[3], u8 res[3])
> >
> > Expect that it looks like you are passing arrays by value,
> > but instead you are passing by reference.
> >
> > Explicitly pass by reference and sizeof works.
> 
> It depends on what you mean by works. It at least doesn't look so misleading 
> when passing by reference
> and so works more as expected. The sizeof in either case will never return 
> the size of the array. To
> have sizeof return the size of the array would require a typedef of the array 
> to pass by reference. In
> some cases that could be the right thing to do.

Feed:
int bar(int (*f)[10]) { return sizeof *f; }
into cc -O2 -S and look at the generated code - returns 40 not 4.

David

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] x86: Wire up 32-bit direct socket calls

2015-09-03 Thread David Laight
From: Peter Anvin
> Sent: 02 September 2015 21:16
> On 09/02/2015 02:48 AM, Geert Uytterhoeven wrote:
> >
> > Should all other architectures follow suit?
> > Or should we follow the s390 approach:
> >
> 
> It is up to the maintainer(s), largely dependent on how likely you are
> going to want to support this in your libc, but in general, socketcall
> is an abomination which there is no reason not to bypass.

The other (worse) abomination is the way SCTP overloads setsockopt()
to perform actions that change state.
Rather unfortunately that got documented in the protocol standard :-(

David

N�r��yb�X��ǧv�^�)޺{.n�+���z�^�)w*jg����ݢj/���z�ޖ��2�ޙ&�)ߡ�a�����G���h��j:+v���w��٥

RE: [PATCH] mwifiex: Make mwifiex_dbg a function, reduce object size

2015-09-01 Thread David Laight
From: Joe Perches
> Sent: 31 August 2015 18:47
> 
> The mwifiex_dbg macro has two tests that could be consolidated
> into a function reducing overall object size ~10KB (~4%).
> 
> So convert the macro into a function.

This looks like it will slow things down somewhat.
Maybe inline the tests and then do a call for the dev_info().
Or at least inline the test for the debug flag.
So you end up with something like:

#define mwifiex_dbg(adapter, dbg_mask, fmt, args...)\
do {\
if ((adapter)->debug_mask & MWIFIEX_DBG_##dbg_mask) \
_mwifiex_dbg(adapter, fmt, ##__VA_ARGS__)
} while (0)

David

> 
> $ size drivers/net/wireless/mwifiex/built-in.o* (x86-64 defconfig)
>text  data bss dec hex filename
>  233102  86284809  246539   3c30b 
> drivers/net/wireless/mwifiex/built-
> in.o.new
>  243949  86284809  257386   3ed6a 
> drivers/net/wireless/mwifiex/built-
> in.o.old
> 
> Signed-off-by: Joe Perches 
> ---
>  drivers/net/wireless/mwifiex/main.c | 20 
>  drivers/net/wireless/mwifiex/main.h | 17 -
>  2 files changed, 28 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/net/wireless/mwifiex/main.c 
> b/drivers/net/wireless/mwifiex/main.c
> index 278dc94..4fa8ca3 100644
> --- a/drivers/net/wireless/mwifiex/main.c
> +++ b/drivers/net/wireless/mwifiex/main.c
> @@ -1447,6 +1447,26 @@ exit_sem_err:
>  }
>  EXPORT_SYMBOL_GPL(mwifiex_remove_card);
> 
> +void _mwifiex_dbg(const struct mwifiex_adapter *adapter, int mask,
> +   const char *fmt, ...)
> +{
> + struct va_format vaf;
> + va_list args;
> +
> + if (!adapter->dev || !(adapter->debug_mask & mask))
> + return;
> +
> + va_start(args, fmt);
> +
> + vaf.fmt = fmt;
> + vaf.va = 
> +
> + dev_info(adapter->dev, "%pV", );
> +
> + va_end(args);
> +}
> +EXPORT_SYMBOL_GPL(_mwifiex_dbg);
> +
>  /*
>   * This function initializes the module.
>   *
> diff --git a/drivers/net/wireless/mwifiex/main.h 
> b/drivers/net/wireless/mwifiex/main.h
> index 6b95121..96663214 100644
> --- a/drivers/net/wireless/mwifiex/main.h
> +++ b/drivers/net/wireless/mwifiex/main.h
> @@ -48,6 +48,9 @@
> 
>  extern const char driver_version[];
> 
> +struct mwifiex_adapter;
> +struct mwifiex_private;
> +
>  enum {
>   MWIFIEX_ASYNC_CMD,
>   MWIFIEX_SYNC_CMD
> @@ -180,12 +183,11 @@ enum MWIFIEX_DEBUG_LEVEL {
>   MWIFIEX_DBG_FATAL | \
>   MWIFIEX_DBG_ERROR)
> 
> -#define mwifiex_dbg(adapter, dbg_mask, fmt, args...) \
> -do { \
> - if ((adapter)->debug_mask & MWIFIEX_DBG_##dbg_mask) \
> - if ((adapter)->dev) \
> - dev_info((adapter)->dev, fmt, ## args); \
> -} while (0)
> +__printf(3, 4)
> +void _mwifiex_dbg(const struct mwifiex_adapter *adapter, int mask,
> +   const char *fmt, ...);
> +#define mwifiex_dbg(adapter, mask, fmt, ...) \
> + _mwifiex_dbg(adapter, MWIFIEX_DBG_##mask, fmt, ##__VA_ARGS__)
> 
...
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH net] sctp: support global vtag assochash and per endpoint s(d)port assochash table

2015-09-01 Thread David Laight
From: Xin Long
> Sent: 31 August 2015 18:44
>
> for telecom center, the usual case is that a server is connected by thousands
> of clients. but if the server with only one enpoint(udp style) use the same
> sport and dport to communicate with every clients, and every assoc in server
> will be hashed in the same chain of global assoc hashtable due to currently we
> choose dport and sport as the hash key.

I'm not sure the above is true...

Although a server might be connected to 1000s of clients, if the sctp
'application' is M3UA then it is very likely that the source and
destination ports for all the associations are all the same (2905).

An M3UA application might also be written to use a single socket
(eg: in order to reduce the number of fds and/or poll() list size).

David

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH net] sctp: fix race on protocol/netns initialization

2015-09-10 Thread David Laight
From: Marcelo Ricardo Leitner
> Sent: 10 September 2015 13:54
> Em 09-09-2015 21:16, David Miller escreveu:
> > From: Marcelo Ricardo Leitner 
> > Date: Wed,  9 Sep 2015 17:03:01 -0300
> >
> >> So the fix then is to invert the initialization order inside
> >> register_pernet_subsys() so that the control socket is created by last
> >> and also block socket creation if netns initialization wasn't yet
> >> performed.
> >
> > If we really need to we could make ->create() fail with -EAFNOSUPPORT
> > if kern==1 until the protocol is fully setup.
> >
> > Or, instead of failing, we could make such ->create() calls block
> > until the control sock init is complete or fails.
> 
> I guess I should have written that paragraph in another order, perhaps like:
> So the fix then is to deny any sctp socket creation until netns
> initialization is sufficiently done. And due to that, we have to
> initialize the control socket as last step in netns initialization, as
> now it can't be created earlier anymore.
> 
> Is it clearer on the intention?
> 
> And my emphasis on userspace sockets was to highlight that a random user
> could trigger it, but yes both users are affected by the issue.
> 
> Strictly speaking, we would have to block ->create() not until the
> control socket init is done but until the protocol is fully loaded. Such
> condition, with this patch, is after net->sctp.auto_asconf_splist is
> initialized. But for blocking until instead of just denying, we would
> need some other mechanism.
> 
> It would be better from the (sctp) user point of view but then such
> solution may better belong to another layer instead and protect all
> protocols at once. (I checked and couldn't find other protocols at risk
> like sctp)

Given that the first ->create() blocks while the protocol code loads
it really wouldn't be right to error a subsequent ->create() because
the load hasn't completed.

I hold a semaphore across sock_create_kern() because of issues with sctp.
(Current kernels are nowhere near as bad as really old ones though.)

David

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH net] sctp: fix race on protocol/netns initialization

2015-09-10 Thread David Laight
From: Marcelo Ricardo 
> Sent: 10 September 2015 15:36
...
> > Given that the first ->create() blocks while the protocol code loads
> > it really wouldn't be right to error a subsequent ->create() because
> > the load hasn't completed.
> 
> Can't say I don't agree with you, but at the same time, there are other
> temporary errors that can happen and that the user should just retry.
> This would be just another condition in a trade off for avoiding complexity.

We do retry, but the delay messes up out test scripts :-(

> > I hold a semaphore across sock_create_kern() because of issues with sctp.
> > (Current kernels are nowhere near as bad as really old ones though.)
> 
> Oh, that's not good to hear. I'll experiment with that later, try to
> catch some bugs. :)

I mean REALLY old - like 2.6.12 (FC3).
I'm pretty sure I've seen oops as well as create failing.

We don't create enough sockets for the semaphore to be a problem.

OTOH I've a current problem with a customer using RHEL5.8 (basically 2.6.18).
They might manage to move to RHEL6 (2.6.32) - but that could take a year or two.
RH might be pulling some of the SCTP fixes, but I doubt they get priority.

David

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH net-next RFC] net: increase LL_MAX_HEADER for Hyper-V

2015-09-16 Thread David Laight
From: Haiyang Zhang
> Sent: 16 September 2015 17:09
> > -Original Message-
> > From: Vitaly Kuznetsov [mailto:vkuzn...@redhat.com]
> > Sent: Wednesday, September 16, 2015 11:50 AM
> > To: netdev@vger.kernel.org
> > Cc: David S. Miller ; linux-ker...@vger.kernel.org;
> > KY Srinivasan ; Haiyang Zhang
> > ; Jason Wang 
> > Subject: [PATCH net-next RFC] net: increase LL_MAX_HEADER for Hyper-V
> >
> > Commit b08cc79155fc26d0d112b1470d1ece5034651a4b ("hv_netvsc: Eliminate
> >  memory allocation in the packet send path") introduced skb headroom
> > request for Hyper-V netvsc driver:
> >
> >max_needed_headroom = sizeof(struct hv_netvsc_packet) +
> >sizeof(struct rndis_message) +
> >NDIS_VLAN_PPI_SIZE + NDIS_CSUM_PPI_SIZE +
> >NDIS_LSO_PPI_SIZE + NDIS_HASH_PPI_SIZE;
> >...
> >net->needed_headroom = max_needed_headroom;
> >
> > max_needed_headroom is 220 bytes, it significantly exceeds the
> > LL_MAX_HEADER setting. This causes each skb to be cloned on send path,
> > e.g. for IPv4 case we fall into the following clause
> > (ip_finish_output2()):
> >
> > if (unlikely(skb_headroom(skb) < hh_len && dev->header_ops)) {
> > ...
> > skb2 = skb_realloc_headroom(skb, LL_RESERVED_SPACE(dev));
> > ...
> > }
> >
> > leading to a significant performance regression. Increase LL_MAX_HEADER
> > to make it suitable for netvsc, make it 224 to be 16-aligned.
> > Alternatively we could (partially) revert the commit which introduced
> > skb
> > headroom request restoring manual memory allocation on transmit path.
> >
> > Signed-off-by: Vitaly Kuznetsov 
> > ---
> >  include/linux/netdevice.h | 4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> > index 88a0069..7233790 100644
> > --- a/include/linux/netdevice.h
> > +++ b/include/linux/netdevice.h
> > @@ -132,7 +132,9 @@ static inline bool dev_xmit_complete(int rc)
> >   * used.
> >   */
> >
> > -#if defined(CONFIG_WLAN) || IS_ENABLED(CONFIG_AX25)
> > +#if IS_ENABLED(CONFIG_HYPERV_NET)
> > +# define LL_MAX_HEADER 224
> > +#elif defined(CONFIG_WLAN) || IS_ENABLED(CONFIG_AX25)
> >  # if defined(CONFIG_MAC80211_MESH)
> >  #  define LL_MAX_HEADER 128
> >  # else
> 
> Thanks for the patch.
> To avoid we forget to update that 224 number when we add more things
> into netvsc header, I suggest that we define a macro in netdevice.h such
> as:
> #define HVNETVSC_MAX_HEADER 224
> #define LL_MAX_HEADER HVNETVSC_MAX_HEADER
> 
> And, put a note in netvsc code saying the header reservation shouldn't
> exceed HVNETVSC_MAX_HEADER, or you need to update HVNETVSC_MAX_HEADER.

Am I right in thinking this is adding an extra 96 unused bytes to the front
of almost all skb just so that hyper-v can make its link level header
contiguous with whatever follows (IP header ?).

Doesn't sound ideal.

David

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: PATCH: netdev: add a cast NLMSG_OK to avoid a GCC warning in users' code

2015-09-11 Thread David Laight
From: D. Hugh Redelmeier
> Sent: 09 September 2015 21:24
...
> 2) if you use the type "unsigned int" on a 32-bit machine, you get the
>warning for an earlier conjunct:
> 
> #define NLMSG_OK(nlh,len) ((len) >= (int)sizeof(struct nlmsghdr) && \
>  (nlh)->nlmsg_len >= sizeof(struct nlmsghdr) && \
>  (nlh)->nlmsg_len <= (len))
> 
>(len) >= (int)sizeof(struct nlmsghdr)  <=== unsigned >= int
> 
> 3) on a 32-bit machine, size_t is likely "unsigned int" so the
>same problem as (2) should arise.
> 
> 4) on a 64-bit machine with 64-bit ints, the same problems are likely.
>I don't have one to test on.
> 
> Casting to "short" or "unsigned short" works, but I don't know that
> the value is guaranteed to fit in either of them.

Why not cast (nlh)->nl_msg_len instead?
Or subtract the two values and compare against zero?
Perhaps:
(typeof (len))(nlh)->nlmsg_len <= (len)
which is almost certainly safe unless 'len' is 'signed char'.

David

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH net-next RFC] net: increase LL_MAX_HEADER for Hyper-V

2015-09-17 Thread David Laight
From: KY Srinivasan
> Sent: 16 September 2015 23:58
...
> > I think we get that.  The question is does the Remote NDIS header and
> > packet info actually need to be a part of the header data?  I would
> > argue that it probably doesn't.
> >
> > So for example in netvsc_start_xmit it looks like you are calling
> > init_page_array in order to populate a set of page buffers, but the
> > first buffer for the Remote NDIS protocol is populated as a separate
> > page and offset.  As such it doesn't seem like it necessarily needs to
> > be a part of the header data but could be maintained perhaps in a
> > separate ring buffer, or perhaps just be a separate page that you break
> > up to use for each header.
> 
> You are right; the rndis header can be built as a separate fragment and sent.
> Indeed this is what we were doing earlier - on the outgoing path we would 
> allocate
> memory for the rndis header. My goal was to avoid this allocation on every 
> packet being
> sent and I decided to use the headroom instead. If we can completely avoid 
> all memory
> allocation for rndis header, it makes a significant perf difference:
...


So just preallocate the header space as a fixed buffer for each ring entry
(or tx frame).

If you allocate a fixed buffer for each ring entry you may find there are
performance gains from copying small fragments into the buffer instead
of doing whatever mapping operations are required.

David

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v4 net-next 0/2] ipv4: Hash-based multipath routing

2015-09-29 Thread David Laight
From: Peter Nørlund
> Sent: 29 September 2015 12:29
...
> As for using L4 hashing with anycast, CloudFlare apparently does L4
> hashing - they could have disabled it, but they didn't. Besides,
> analysis of my own load balancers showed that only one in every
> 500,000,000 packets is fragmented. And even if I hit a fragmented
> packet, it is only a problem if the packet hits the wrong load
> balancer, and if that load balancer haven't been updated with the state
> from another load balancer (that is, one of the very first packets). It
> is still a possible scenario though - especially with large HTTP
> cookies or file uploads. But apparently it is a common problem that IP
> fragments gets dropped on the Internet, so I suspect that ECMP+Anycast
> sites are just part of the pool of problematic sites for people with
> fragments.

Fragmentation is usually more of an issue with UDP than TCP.
Some SIP messages can get fragmented...

David



RE: [PATCH v2] net: sctp: Don't use 64 kilobyte lookup table for four elements

2015-09-28 Thread David Laight
From: Eric Dumazet
> Sent: 28 September 2015 15:27
> On Mon, 2015-09-28 at 14:12 +, David Laight wrote:
> > From: Neil Horman
> > > Sent: 28 September 2015 14:51
> > > On Mon, Sep 28, 2015 at 02:34:04PM +0200, Denys Vlasenko wrote:
> > > > Seemingly innocuous sctp_trans_state_to_prio_map[] array
> > > > is way bigger than it looks, since
> > > > "[SCTP_UNKNOWN] = 2" expands into "[0x] = 2" !
> > > >
> > > > This patch replaces it with switch() statement.
> >
> > What about just adding 1 (and masking) before indexing the array?
> > That might require a static inline function with a local static array.
> >
> > Or define the array as (say) [16] and just mask the state before using
> > it as an index?
> 
> Just let the compiler do its job, instead of obfuscating source.
> 
> Compilers can transform a switch into an (optimal) table if it is really
> a gain.

The compiler can choose between a jump table and nested ifs for a switch
statement. I've never seen it convert one into a data array index.

David



RE: [PATCH v2] net: sctp: Don't use 64 kilobyte lookup table for four elements

2015-09-28 Thread David Laight
From: Neil Horman
> Sent: 28 September 2015 14:51
> On Mon, Sep 28, 2015 at 02:34:04PM +0200, Denys Vlasenko wrote:
> > Seemingly innocuous sctp_trans_state_to_prio_map[] array
> > is way bigger than it looks, since
> > "[SCTP_UNKNOWN] = 2" expands into "[0x] = 2" !
> >
> > This patch replaces it with switch() statement.

What about just adding 1 (and masking) before indexing the array?
That might require a static inline function with a local static array.

Or define the array as (say) [16] and just mask the state before using
it as an index?

David

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH V4 1/2] ACPI / EC: Fix broken 64bit big-endian users of 'global_lock'

2015-09-28 Thread David Laight
From: James Bottomley [mailto:james.bottom...@hansenpartnership.com]
> Sent: 28 September 2015 15:27
> On Mon, 2015-09-28 at 08:58 +, David Laight wrote:
> > From: Rafael J. Wysocki
> > > Sent: 27 September 2015 15:09
> > ...
> > > > > Say you have three adjacent fields in a structure, x, y, z, each one 
> > > > > byte long.
> > > > > Initially, all of them are equal to 0.
> > > > >
> > > > > CPU A writes 1 to x and CPU B writes 2 to y at the same time.
> > > > >
> > > > > What's the result?
> > > >
> > > > I think every CPU's  cache architecure guarantees adjacent store
> > > > integrity, even in the face of SMP, so it's x==1 and y==2.  If you're
> > > > thinking of old alpha SMP system where the lowest store width is 32 bits
> > > > and thus you have to do RMW to update a byte, this was usually fixed by
> > > > padding (assuming the structure is not packed).  However, it was such a
> > > > problem that even the later alpha chips had byte extensions.
> >
> > Does linux still support those old Alphas?
> >
> > The x86 cpus will also do 32bit wide rmw cycles for the 'bit' operations.
> 
> That's different: it's an atomic RMW operation.  The problem with the
> alpha was that the operation wasn't atomic (meaning that it can't be
> interrupted and no intermediate output states are visible).

It is only atomic if prefixed by the 'lock' prefix.
Normally the read and write are separate bus cycles.
 
> > You still have to ensure the compiler doesn't do wider rmw cycles.
> > I believe the recent versions of gcc won't do wider accesses for volatile 
> > data.
> 
> I don't understand this comment.  You seem to be implying gcc would do a
> 64 bit RMW for a 32 bit store ... that would be daft when a single
> instruction exists to perform the operation on all architectures.

Read the object code and weep...
It is most likely to happen for operations that are rmw (eg bit set).
For instance the arm cpu has limited offsets for 16bit accesses, for
normal structures the compiler is likely to use a 32bit rmw sequence
for a 16bit field that has a large offset.
The C language allows the compiler to do it for any access (IIRC including
volatiles).

David



RE: [PATCH V4 1/2] ACPI / EC: Fix broken 64bit big-endian users of 'global_lock'

2015-09-28 Thread David Laight
From: James Bottomley 
> Sent: 28 September 2015 16:12
> > > > The x86 cpus will also do 32bit wide rmw cycles for the 'bit' 
> > > > operations.
> > >
> > > That's different: it's an atomic RMW operation.  The problem with the
> > > alpha was that the operation wasn't atomic (meaning that it can't be
> > > interrupted and no intermediate output states are visible).
> >
> > It is only atomic if prefixed by the 'lock' prefix.
> > Normally the read and write are separate bus cycles.
> 
> The essential point is that x86 has atomic bit ops and byte writes.
> Early alpha did not.

Early alpha didn't have any byte accesses.

On x86 if you have the following:
struct {
char  a;
volatile char b;
} *foo;
foo->a |= 4;

The compiler is likely to generate a 'bis #4, 0(rbx)' (or similar)
and the cpu will do two 32bit memory cycles that read and write
the 'volatile' field 'b'.
(gcc definitely used to do this...)

A lot of fields were made 32bit (and probably not bitfields) in the linux
kernel tree a year or two ago to avoid this very problem.

> > > > You still have to ensure the compiler doesn't do wider rmw cycles.
> > > > I believe the recent versions of gcc won't do wider accesses for 
> > > > volatile data.
> > >
> > > I don't understand this comment.  You seem to be implying gcc would do a
> > > 64 bit RMW for a 32 bit store ... that would be daft when a single
> > > instruction exists to perform the operation on all architectures.
> >
> > Read the object code and weep...
> > It is most likely to happen for operations that are rmw (eg bit set).
> > For instance the arm cpu has limited offsets for 16bit accesses, for
> > normal structures the compiler is likely to use a 32bit rmw sequence
> > for a 16bit field that has a large offset.
> > The C language allows the compiler to do it for any access (IIRC including
> > volatiles).
> 
> I think you might be confusing different things.  Most RISC CPUs can't
> do 32 bit store immediates because there aren't enough bits in their
> arsenal, so they tend to split 32 bit loads into a left and right part
> (first the top then the offset).  This (and other things) are mostly
> what you see in code.  However, 32 bit register stores are still atomic,
> which is all we require.  It's not really the compiler's fault, it's
> mostly an architectural limitation.

No, I'm not talking about how 32bit constants are generated.
I'm talking about structure offsets.

David

N�r��yb�X��ǧv�^�)޺{.n�+���z�^�)w*jg����ݢj/���z�ޖ��2�ޙ&�)ߡ�a�����G���h��j:+v���w��٥

RE: [PATCH v2 00/14] RDS: connection scalability and performance improvements

2015-10-01 Thread David Laight
From: Santosh Shilimkar
> Sent: 30 September 2015 18:24
...
> This is being addressed by simply using per bucket rw lock which makes the
> locking simple and very efficient. The hash table size is still an issue and
> I plan to address it by using re-sizable hash tables as suggested on the list.

If the hash chains are short do you need the expense of a rw lock
for each chain?
A simple spinlock may be faster.

If you use the hash chain lock for the reference count on the hashed
objects you should be able to release the lock before locking the
object itself.

David

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH V4 1/2] ACPI / EC: Fix broken 64bit big-endian users of 'global_lock'

2015-09-28 Thread David Laight
From: Rafael J. Wysocki
> Sent: 27 September 2015 15:09
...
> > > Say you have three adjacent fields in a structure, x, y, z, each one byte 
> > > long.
> > > Initially, all of them are equal to 0.
> > >
> > > CPU A writes 1 to x and CPU B writes 2 to y at the same time.
> > >
> > > What's the result?
> >
> > I think every CPU's  cache architecure guarantees adjacent store
> > integrity, even in the face of SMP, so it's x==1 and y==2.  If you're
> > thinking of old alpha SMP system where the lowest store width is 32 bits
> > and thus you have to do RMW to update a byte, this was usually fixed by
> > padding (assuming the structure is not packed).  However, it was such a
> > problem that even the later alpha chips had byte extensions.

Does linux still support those old Alphas?

The x86 cpus will also do 32bit wide rmw cycles for the 'bit' operations.

> OK, thanks!

You still have to ensure the compiler doesn't do wider rmw cycles.
I believe the recent versions of gcc won't do wider accesses for volatile data.

David

N�r��yb�X��ǧv�^�)޺{.n�+���z�^�)w*jg����ݢj/���z�ޖ��2�ޙ&�)ߡ�a�����G���h��j:+v���w��٥

RE: [PATCH net-next 3/9] net: Remove e_nobufs label from ip_route_input_slow

2015-09-24 Thread David Laight
From: Eric W. Biederman
> Sent: 23 September 2015 03:15
> David Ahern  writes:
> 
> > e_nobufs has 1 user. Move setting err to -ENOBUFS for the 1 user and
> > use the goto out label instead of e_nobufs. Stepping stone patch; next
> > one moves rth code into a helper function.
> 
> Ick you are pessimizing the code.
> 
> You will almost certainly have better code generation if you hoist
> the assignment of "err = -ENOBUFS" above the rt_dst_alloc then you
> don't need to do anything in your error path except "goto out;"

Unlikely to make a difference.
The compiler is very likely to set 'err' before the conditional
branch anyway. It will use an extra register to make it all work.

David

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 05/15] RDS: increase size of hash-table to 8K

2015-09-21 Thread David Laight
From: Santosh Shilimkar
> Sent: 20 September 2015 00:05
> Even with per bucket locking scheme, in a massive parallel
> system with active rds sockets which could be in excess of multiple
> of 10K, rds_bin_lookup() workload is siginificant because of smaller
> hashtable size.
> 
> With some tests, it was found that we get modest but still nice
> reduction in rds_bind_lookup with bigger bucket.
> 
>   Hashtable   Baseline(1k)Delta
>   2048:   8.28%   -2.45%
>   4096:   8.28%   -4.60%
>   8192:   8.28%   -6.46%
>   16384:  8.28%   -6.75%
> 
> Based on the data, we set 8K as the bind hash-table size.

Can't you use of on the dynamically sizing hash tables?
8k hash table entries is OTT for a lot of systems.

David

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [patch net-next 10/14] rocker: pass "learning" value as a parameter to rocker_port_set_learning

2015-10-05 Thread David Laight
From: Jiri Pirko
> Sent: 04 October 2015 22:26
> Be consistent with the rest of the setting functions, and pass
> "learning" as a bool function parameter.
...
> diff --git a/drivers/net/ethernet/rocker/rocker_main.c 
> b/drivers/net/ethernet/rocker/rocker_main.c
> index fb7e8c2..d9329a7 100644
> --- a/drivers/net/ethernet/rocker/rocker_main.c
> +++ b/drivers/net/ethernet/rocker/rocker_main.c
> @@ -1634,6 +1634,7 @@ rocker_cmd_set_port_learning_prep(const struct 
> rocker_port *rocker_port,
> struct rocker_desc_info *desc_info,
> void *priv)
>  {
> + bool learning = *(int *)priv;
...
> static int rocker_port_set_learning(struct rocker_port *rocker_port,
> - struct switchdev_trans *trans)
> + struct switchdev_trans *trans,
> + bool learning)
> {
>   return rocker_cmd_exec(rocker_port, trans, 0,
>  rocker_cmd_set_port_learning_prep,
> -NULL, NULL, NULL);
> +, NULL, NULL);

This hit my 'casting between integer pointer types' bell.
It is clearly wrong if 'sizeof (bool) != sizeof (int)'.

It is much safer to only ever cast structure types to/from 'void *'.

David

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v2 8/8] treewide: Remove newlines inside DEFINE_PER_CPU() macros

2015-12-07 Thread David Laight
From: Michal Marek
> Sent: 04 December 2015 15:26
> Otherwise make tags can't parse them:
> 
> ctags: Warning: arch/ia64/kernel/smp.c:60: null expansion of name pattern "\1"
...

Seems to me you need to fix ctags.

David
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: 4.4-rc2 xfrm_lookup kasan trace

2015-12-08 Thread David Laight
From: Of David Miller
> Sent: 03 December 2015 16:59
> From: Eric Dumazet 
> Date: Mon, 30 Nov 2015 17:22:11 -0800
> 
> > @@ -2198,6 +2198,7 @@ struct dst_entry *xfrm_lookup(struct net *net, struct 
> > dst_entry *dst_orig,
> > xdst = NULL;
> > route = NULL;
> >
> > +   sk = sk_to_full_sk((struct sock *)sk);
> 
> The war against const...
> 
> I know this is the only instance where const is input, but you may want to
> consider adding the const verion of the helper anyways to avoid ugly casts
> like this.

In that case you could use something like:

#define SK_TO_FULL_SK(sk) (typeof (sk))sk_to_full_sk(sk))

With the helper arg and result being 'const struct sock *'.

David
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: bpf: undefined shift in __bpf_prog_run

2015-12-07 Thread David Laight
From: Dmitry Vyukov
> Sent: 04 December 2015 19:49
...
> 3.4.3
> undefined behavior
> 1 behavior, upon use of a nonportable or erroneous program construct
> or of erroneous data, for which this International Standard imposes no
> requirements
> 2 NOTE Possible undefined behavior ranges from ignoring the situation
> completely with unpredictable results, to behaving during translation
> or program execution in a documented manner characteristic of the
> environment (with or without the issuance of a diagnostic message), to
> terminating a translation or execution

While 'undefined behaviour' is allowed to include 'firing an ICBM at
the current location of the person who wrote the code' it is very
unlikely to result in anything other than an unexpected value
and the compiler making false assumptions about the value.

eg the compiler can assume this is an infinite loop:
int i;
for (i = 0; i >= 0; i++)
...

David


RE: [PATCH net] ipv6: sctp: clone options to avoid use after free

2015-12-09 Thread David Laight
> SCTP is lacking proper np->opt cloning at accept() time.
> 
> TCP and DCCP use ipv6_dup_options() helper, do the same in SCTP.
> 
> We might later factorize this code in a common helper to avoid
> future mistakes.

I'm wondering what the real impact of this and the other recent
SCTP bugs/patches is on real workloads?
We have enough trouble getting our customers to use kernels
later that the 2.6.18 based RHEL5 - without having to persuade
them to use kernels that contain very recent fixes.

David



RE: [PATCH net] ipv6: sctp: clone options to avoid use after free

2015-12-09 Thread David Laight
From: Eric Dumazet [mailto:eric.duma...@gmail.com]
> Sent: 09 December 2015 16:00
> On Wed, 2015-12-09 at 15:49 +, David Laight wrote:
> > > SCTP is lacking proper np->opt cloning at accept() time.
> > >
> > > TCP and DCCP use ipv6_dup_options() helper, do the same in SCTP.
> > >
> > > We might later factorize this code in a common helper to avoid
> > > future mistakes.
> >
> > I'm wondering what the real impact of this and the other recent
> > SCTP bugs/patches is on real workloads?
> > We have enough trouble getting our customers to use kernels
> > later that the 2.6.18 based RHEL5 - without having to persuade
> > them to use kernels that contain very recent fixes.
> 
> It all depends if your customers let (hostile ?) people run programs on
> the boxes.

If they require hostile programs I'm not worried.

But it isn't entirely clear from these oops reports what the
test program is actually doing.
Some of them might be valid scenarios.
Not that our code does anything clever.

David



RE: [PATCH (net-next.git) 13/18] stmmac: perf, remove modulo in stmmac_rx()

2015-12-09 Thread David Laight
From: Giuseppe Cavallaro
> Sent: 09 December 2015 08:38
> The indexes into the ring buffer are always incremented, and
> the entry is accessed via doing a modulo to find the "real" index.
> Modulo is an expensive operation.
> 
> This patch replaces the modulo with a simple if clamp.
> It helps improve stmmac RX path as it's being called inside RX loop.

Is dma_rx_size always a power of 2 ?
If so you can replace the % by & and remove the conditionals.

If you have changed the read and write values to indexes
then you need to be certain that the 'ring full' and 'ring empty'
cases are correctly distinguished.

David
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: rhashtable: ENOMEM errors when hit with a flood of insertions

2015-12-03 Thread David Laight
From: Herbert Xu
> Sent: 03 December 2015 12:51
> On Mon, Nov 30, 2015 at 06:18:59PM +0800, Herbert Xu wrote:
> >
> > OK that's better.  I think I see the problem.  The test in
> > rhashtable_insert_rehash is racy and if two threads both try
> > to grow the table one of them may be tricked into doing a rehash
> > instead.
> >
> > I'm working on a fix.
> 
> While the EBUSY errors are gone for me, I can still see plenty
> of ENOMEM errors.  In fact it turns out that the reason is quite
> understandable.  When you pound the rhashtable hard so that it
> doesn't actually get a chance to grow the table in process context,
> then the table will only grow with GFP_ATOMIC allocations.
> 
> For me this starts failing regularly at around 2^19 entries, which
> requires about 1024 contiguous pages if I'm not mistaken.

ISTM that you should always let the insert succeed - even if it makes
the average/maximum chain length increase beyond some limit.
Any limit on the number of hashed items should have been done earlier
by the calling code.
The slight performance decrease caused by scanning longer chains
is almost certainly more 'user friendly' than an error return.

Hoping to get 1024+ contiguous VA pages does seem over-optimistic.

With a 2-level lookup you could make all the 2nd level tables
a fixed size (maybe 4 or 8 pages?) and extend the first level
table as needed.

David
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: ipsec impact on performance

2015-12-02 Thread David Laight
From: Sowmini Varadhan
> Sent: 01 December 2015 18:37
...
> I was using esp-null merely to not have the crypto itself perturb
> the numbers (i.e., just focus on the s/w overhead for now), but here
> are the numbers for the stock linux kernel stack
> Gbps  peak cpu util
> esp-null 1.8   71%
> aes-gcm-c-2561.6   79%
> aes-ccm-a-1280.7   96%
> 
> That trend made me think that if we can get esp-null to be as close
> as possible to GSO/GRO, the rest will follow closely behind.

That's not how I read those figures.
They imply to me that there is a massive cost for the actual encryption
(particularly for aes-ccm-a-128) - so whatever you do to the esp-null
case won't help.

One way to get a view of the cost of the encryption (and copies)
is to do the operation twice.

David

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: ipsec impact on performance

2015-12-02 Thread David Laight
From: Sowmini Varadhan
> Sent: 02 December 2015 12:12
> On (12/02/15 11:56), David Laight wrote:
> > > Gbps  peak cpu util
> > > esp-null 1.8   71%
> > > aes-gcm-c-2561.6   79%
> > > aes-ccm-a-1280.7   96%
> > >
> > > That trend made me think that if we can get esp-null to be as close
> > > as possible to GSO/GRO, the rest will follow closely behind.
> >
> > That's not how I read those figures.
> > They imply to me that there is a massive cost for the actual encryption
> > (particularly for aes-ccm-a-128) - so whatever you do to the esp-null
> > case won't help.
> 
> I'm not a crypto expert, but my understanding is that the CCM mode
> is the "older" encryption algorithm, and GCM is the way of the future.
> Plus, I think the GCM mode has some type of h/w support (hence the
> lower cpu util)
> 
> I'm sure that crypto has a cost, not disputing that, but my point
> was that 1.8 -> 1.6 -> 0.7 is a curve with a much gentler slope than
> the 9 Gbps (clear traffic, GSO, GRO)
> -> 4 Gbps (clear, no gro, gso)
>-> 1.8 (esp-null)
> That steeper slope smells of s/w perf that we need to resolve first,
> before getting into the work of faster crypto?

That isn't the way cpu cost works.
You are getting 0.7 Gbps with ass-ccm-a-128, scale the esp-null back to
that and it would use 7/18*71 = 27% of the cpu.
So 69% of the cpu in the a-128 case is probably caused by the
encryption itself.
Even if the rest of the code cost nothing you'd not increase
above 1Gbps.

The sums for aes-gcm-c-256 are slightly better, about 15%.

Ok, things aren't quite that simple since you are probably changing
the way data flows through the system as well.

Also what/how are you measuring cpu use.
I'm not sure anything on Linux gives you a truly accurate value
when processes are running for very short periods.

On an SMP system you also get big effects when work is switched
between cpus. I've got some tests that run a lot faster if I
put all but one of the cpus into a busy-loop in userspace
(eg: while :; do :; done)!

David


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 1/1] net: sctp: dynamically enable or disable pf state

2015-12-14 Thread David Laight
From: zyjzyj2...@gmail.com
> Sent: 11 December 2015 09:06
...
> +pf_enable - INTEGER
> + Enable or disable pf state. A value of pf_retrans > path_max_retrans
> + also disables pf state. That is, one of both pf_enable and
> + pf_retrans > path_max_retrans can disable pf state. Since pf_retrans
> + and path_max_retrans can be changed by userspace application, sometimes
> + user expects to disable pf state by the value of
> + pf_retrans > path_max_retrans, but ocassionally the value of pf_retrans
> + or path_max_retrans is changed by the user application, this pf state is
> + enabled. As such, it is necessary to add this to dynamically enable
> + and disable pf state.
> +
> + 1: Enable pf.
> +
> + 0: Disable pf.
> +
> + Default: 1

You ought to say what 'pf' is short for.

David

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: use-after-free in sctp_do_sm

2015-12-14 Thread David Laight
From: Vlad Yasevich
> Sent: 11 December 2015 18:38
...
> > Found a similar place in abort primitive handling like in this last
> > patch update, it's probably the issue you're still triggering.
> >
> > Also found another place that may lead to this use after free, in case
> > we receive a packet with a chunk that has no data.
> >
> > Oh my.. :)
> 
> Yes.  This is what I was worried about...  Anything that triggers
> a DELTE_TCB command has to return a code that we can trap.
> 
> The other way is to do what Dmitri suggested, but even there, we
> need to be very careful.

I'm always wary of anything that queues actions up for later processing.
It is far too easy (as found here) to end up processing actions
in invalid states, or to process actions in 'unusual' orders when
specific events happen close together.

I wonder how much fallout there'd be from getting the sctp code
to immediately action things, instead of queuing the actions for later.
It would certainly remove a lot of the unusual combinations of events.

David


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH net] ipv6: sctp: clone options to avoid use after free

2015-12-10 Thread David Laight
From: Daniel Borkmann
> Sent: 09 December 2015 19:19
> On 12/09/2015 06:11 PM, Marcelo Ricardo Leitner wrote:
> > Em 09-12-2015 14:31, David Laight escreveu:
> >> From: Eric Dumazet [mailto:eric.duma...@gmail.com]
> >>> Sent: 09 December 2015 16:00
> >>> On Wed, 2015-12-09 at 15:49 +, David Laight wrote:
> >>>>> SCTP is lacking proper np->opt cloning at accept() time.
> >>>>>
> >>>>> TCP and DCCP use ipv6_dup_options() helper, do the same in SCTP.
> >>>>>
> >>>>> We might later factorize this code in a common helper to avoid
> >>>>> future mistakes.
> >>>>
> >>>> I'm wondering what the real impact of this and the other recent
> >>>> SCTP bugs/patches is on real workloads?
> >>>> We have enough trouble getting our customers to use kernels
> >>>> later that the 2.6.18 based RHEL5 - without having to persuade
> >>>> them to use kernels that contain very recent fixes.
> >>>
> >>> It all depends if your customers let (hostile ?) people run programs on
> >>> the boxes.
> >>
> >> If they require hostile programs I'm not worried.
> >
> > Not really "require", but "allow", as in: allowing third-party applications 
> > to run on it.
> 
> Yeah :/ given distros enable almost everything anyway, the first unpriv'ed
> socket(..., IPPROTO_SCTP) call auto-loads SCTP module. But to be honest, I'd
> be surprised if Cloud providers allow for this. Most of this might only run
> on dedicated boxes with telco appliances.

Yes, I'm worried about whether our M3UA code is likely to crash customer
systems, not whether hostile applications can crash it.
These boxes ought to be on private networks since the sigtran protocols
themselves have nothing that even gives a hint of security.

David


RE: [PATCH net] ipv6: sctp: clone options to avoid use after free

2015-12-10 Thread David Laight
From: Eric Dumazet
> Sent: 10 December 2015 15:58
>
> BTW, are you even using IPv6 SCTP sessions ?

Our M3UA/SCTP protocol stack supports them and defaults to using
IPv6 listening sockets for IPv4 connections.

I very much doubt than any customers have used them yet.
So most of the IPv6 connections will have been to ::1
during internal regression testing.

We don't even try to set any IPv6 (or IPv4) options.

Just SO_REUSEADDR, TCP/SCTP_NODELAY, SCTP_EVENTS, SCTP_INITMSG,
SO_KEEPALIVE (tcp), IPV6_V6ONLY (if binding separate listeners),
SCTP_SOCKOPT_BINX_ADD (WTF is this a 'socket option') and
SO_LINGER (to get abortive close on SCTP connections on kernels
before 3.18).

David



RE: Checksum offload queries

2015-12-09 Thread David Laight
From: Edward Cree
> Sent: 09 December 2015 17:29

You also need to stop (probably outluck?) from deleting newlines and
flowing the text.

The message below is completely unreadable.

David


> On 09/12/15 16:01, Tom Herbert wrote:
> > On Wed, Dec 9, 2015 at 4:14 AM, Edward Cree  > wrote: 
> > >> Convincing hardware
> designers to go the HW_CSUM way and only fill >> in the inner checksum, when 
> their current approach
> can fill in >> both inner and outer checksums (though admittedly only for the 
> >> protocols the
> hardware knows about), might be difficult. >> > But again, 
> NETIF_F_IP[V6]_CSUM and NETIF_F_HW_CSUM
> describe > capabilities._not_ the interface. The interface currently allows 
> only > one checksum to be
> offloaded at time, if we want to be able to > offload two checksums then the 
> interface needs to be
> changed-- > probably something like defining a new capability like > 
> NETIF_F_HW_2CSUMS, adding another
> csum_start,csum_offset pair into > the sk_buff.
> Which only pushes the problem onto when someone wants to nest
> encapsulations.  (I heard you like tunnels, so I put a tunnel in your
> tunnel so you can encapsulate while you encapsulate.)
> Or to put it another way, 2 isn't a number; the only numbers are 0, 1
> and infinity ;)
> Perhaps in practice 2 csums would be enough, for now.  But isn't the
> whole point of the brave new world of generic checksums that it should
> be future-proof?
> 
> > The stack will need to be modified also wherever CHECKSUM_PARTIAL is > 
> > handled.
> Naturally.
> 
> > If your device is trying do offload more than one checksum on its own > 
> > accord without being asked
> to do so by the stack it is doing the > wrong thing!
> From the stack's perspective: yes, it is doing the wrong thing.  (I've
> been discussing with colleagues today how we could change that, and I
> think we can, but it involves having _three_ hardware TXQs per kernel
> queue, instead of the two we have now...)
> But from the outside perspective, the system as a whole isn't doing
> anything bad - the packet going on the network is valid and just
> happens to have both inner and outer checksums filled in.  Is there a
> good reason _why_ the stack forbids a device to do this?  (Sure, it's
> not necessary, and makes the hardware more complex.  But the hardware's
> already been made, and it's not a *completely* useless thing to do...)
> 
> > Please stop adding this disclaimer to your messages, it is not > 
> > appropriate for the list.
> Actually, the copy that goes the list doesn't have the disclaimer.  But
> thanks to a combination of crappy email server and corporate politics,
> it still sticks it on any CCs.  If it were up to me (it's not) we
> wouldn't add that disclaimer to anything ever.
> I'm now attempting (for the nth time) to convince our mgmt to get rid
> of the disclaimer, but I don't hold out much hope :(
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v2 net-next] net: Implement fast csum_partial for x86_64

2016-01-06 Thread David Laight
From: Eric Dumazet
> Sent: 05 January 2016 22:19
> To: Tom Herbert
> You might add a comment telling the '4' comes from length of 'adcq
> 6*8(%rdi),%rax' instruction, and that the 'nop' is to compensate that
> 'adcq0*8(%rdi),%rax' is using 3 bytes instead.
> 
> We also could use .byte 0x48, 0x13, 0x47, 0x00 to force a 4 bytes
> instruction and remove the nop.
> 
> 
> +   lea 20f(, %rcx, 4), %r11
> +   clc
> +   jmp *%r11
> +
> +.align 8
> +   adcq6*8(%rdi),%rax
> +   adcq5*8(%rdi),%rax
> +   adcq4*8(%rdi),%rax
> +   adcq3*8(%rdi),%rax
> +   adcq2*8(%rdi),%rax
> +   adcq1*8(%rdi),%rax
> +   adcq0*8(%rdi),%rax // could force a 4 byte instruction (.byte 
> 0x48, 0x13, 0x47, 0x00)
> +   nop
> +20:/* #quads % 8 jump table base */

Or move label at the top (after the .align) and adjust the maths.
You could add a second label after the first adcq and use the
difference between them for the '4'.

David



RE: [PATCH] net: add per device sg_max_frags for skb

2016-01-06 Thread David Laight
From: Hans Westgaard Ry
> Sent: 06 January 2016 13:16
> Devices may have limits on the number of fragments in an skb they
> support. Current codebase uses a constant as maximum for number of
> fragments (MAX_SKB_FRAGS) one skb can hold and use.
> 
> When enabling scatter/gather and running traffic with many small
> messages the codebase uses the maximum number of fragments and thereby
> violates the max for certain devices.
> 
> An example of such a violation is when running IPoIB on a HCA
> supporting 16 SGE on an architecture with 4K pagesize. The
> MAX_SKB_FRAGS will be 17 (64K/4K+1) and because IPoIB adds yet another
> segment we end up with send_requests with 18 SGE resulting in
> kernel-panic.
> 
> The patch allows the device to limit the maximum number fragments used
> in one skb.

This doesn't seem to me to be the correct way to fix this.
Anything that adds an extra fragment (in this case IPoIB) should allow
for the skb already having the maximum number of fragments.
Fully linearising the skb is overkill, but I think the first fragment
can be added to the linear part of the skb.

David




RE: [PATCH v2 net-next] net: Implement fast csum_partial for x86_64

2016-01-06 Thread David Laight
From: Eric Dumazet
> Sent: 06 January 2016 14:25
> On Wed, 2016-01-06 at 10:16 +, David Laight wrote:
> > From: Eric Dumazet
> > > Sent: 05 January 2016 22:19
> > > To: Tom Herbert
> > > You might add a comment telling the '4' comes from length of 'adcq
> > > 6*8(%rdi),%rax' instruction, and that the 'nop' is to compensate that
> > > 'adcq0*8(%rdi),%rax' is using 3 bytes instead.
> > >
> > > We also could use .byte 0x48, 0x13, 0x47, 0x00 to force a 4 bytes
> > > instruction and remove the nop.
> > >
> > >
> > > +   lea 20f(, %rcx, 4), %r11
> > > +   clc
> > > +   jmp *%r11
> > > +
> > > +.align 8
> > > +   adcq6*8(%rdi),%rax
> > > +   adcq5*8(%rdi),%rax
> > > +   adcq4*8(%rdi),%rax
> > > +   adcq3*8(%rdi),%rax
> > > +   adcq2*8(%rdi),%rax
> > > +   adcq1*8(%rdi),%rax
> > > +   adcq0*8(%rdi),%rax // could force a 4 byte instruction (.byte 
> > > 0x48, 0x13, 0x47, 0x00)
> > > +   nop
> > > +20:/* #quads % 8 jump table base */
> >
> > Or move label at the top (after the .align) and adjust the maths.
> > You could add a second label after the first adcq and use the
> > difference between them for the '4'.
> 
> Not really.
> 
> We could not use the trick it the length was 5.
> 
> Only 1, 2, 4 and 8 are supported.

Indeed, and 'lea  20f(, %rcx, 5), %r11' will generate an error from the
assembler.
Seems appropriate to get the assembler to verify this for you.

Assuming this code block is completely skipped for aligned lengths
the nop isn't needed provided the '20:' label is at the right place.

Someone also pointed out that the code is memory limited (dual add
chains making no difference), so why is it unrolled at all?

OTOH I'm sure I remember something about false dependencies on the
x86 flags register because of instructions only changing some bits.
So it might be that you can't (or couldn't) get concurrency between
instructions that update different parts of the flags register.

David




RE: [PATCH] arm64: net: bpf: don't BUG() on large shifts

2016-01-07 Thread David Laight
From: Alexei Starovoitov
> Sent: 06 January 2016 22:13
> On Wed, Jan 06, 2016 at 09:31:27PM +0100, Rabin Vincent wrote:
> > On Tue, Jan 05, 2016 at 09:55:58AM -0800, Alexei Starovoitov wrote:
> > > this one is better to be addressed in verifier instead of eBPF JITs.
> > > Please reject it in check_alu_op() instead.
> >
> > AFAICS the eBPF verifier is not called on the eBPF filters generated by
> > the BPF->eBPF conversion in net/core/filter.c, so performing this check
> > only in check_alu_op() will be insufficient.  So I think we'd need to
> > add this check to bpf_check_classic() too.  Or am I missing something?
> 
> correct. the check is needed in bpf_check_classic() too and I believe
> it's ok to tighten it up in this case, since >32 shift is
> invalid/undefined anyway. We can either accept it as nop or K&=31
> or error. I think returning error is more user friendly long term, though
> there is a small risk of rejecting previously loadable broken programs.

Or replace with an assignment of zero?

David

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v2 net-next 1/5] net: local checksum offload for encapsulation

2016-01-07 Thread David Laight
From: Edward Cree
> Sent: 07 January 2016 17:12
> The arithmetic properties of the ones-complement checksum mean that a
>  correctly checksummed inner packet, including its checksum, has a ones
>  complement sum depending only on whatever value was used to initialise
>  the checksum field before checksumming (in the case of TCP and UDP,
>  this is the ones complement sum of the pseudo header, complemented).
> Consequently, if we are going to offload the inner checksum with
>  CHECKSUM_PARTIAL, we can compute the outer checksum based only on the
>  packed data not covered by the inner checksum, and the initial value of
>  the inner checksum field.

Isn't it even simpler than that?
The checksum of the inner packet (including its header) is ~0 (ie 0).
So the checksum of the whole packet (for the outer header) is the same
as that of the packet down to the start of the inner header.

David



RE: [PATCH] netfilter: nf_conntrack: use safer way to lock all buckets

2016-01-05 Thread David Laight
From: Sasha Levin
> Sent: 05 January 2016 02:26
> When we need to lock all buckets in the connection hashtable we'd attempt to
> lock 1024 spinlocks, which is way more preemption levels than supported by
> the kernel. Furthermore, this behavior was hidden by checking if lockdep is
> enabled, and if it was - use only 8 buckets(!).
> 
> Fix this by using a global lock and synchronize all buckets on it when we
> need to lock them all. This is pretty heavyweight, but is only done when we
> need to resize the hashtable, and that doesn't happen often enough (or at 
> all).
...
> +static void nf_conntrack_lock_nested(spinlock_t *lock)
> +{
> + spin_lock_nested(lock, SINGLE_DEPTH_NESTING);
> + while (unlikely(nf_conntrack_locks_all)) {
> + spin_unlock(lock);
> + spin_lock(_conntrack_locks_all_lock);
> + spin_unlock(_conntrack_locks_all_lock);
> + spin_lock_nested(lock, SINGLE_DEPTH_NESTING);
> + }
> +}
...
> @@ -102,16 +126,19 @@ static void nf_conntrack_all_lock(void)
>  {
>   int i;
> 
> - for (i = 0; i < CONNTRACK_LOCKS; i++)
> - spin_lock_nested(_conntrack_locks[i], i);
> + spin_lock(_conntrack_locks_all_lock);
> + nf_conntrack_locks_all = true;
> +
> + for (i = 0; i < CONNTRACK_LOCKS; i++) {
> + spin_lock(_conntrack_locks[i]);
> + spin_unlock(_conntrack_locks[i]);
> + }
>  }

If spin_lock_nested() does anything like what I think its
name suggests then I suspect that deadlocks.

David


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [Resend PATCH] RDS: fix race condition when sending a message on unbound socket

2015-11-25 Thread David Laight
From: Santosh Shilimkar
> Sent: 24 November 2015 22:13
...
> Sasha's found a NULL pointer dereference in the RDS connection code when
> sending a message to an apparently unbound socket.  The problem is caused
> by the code checking if the socket is bound in rds_sendmsg(), which checks
> the rs_bound_addr field without taking a lock on the socket.  This opens a
> race where rs_bound_addr is temporarily set but where the transport is not
> in rds_bind(), leading to a NULL pointer dereference when trying to
> dereference 'trans' in __rds_conn_create().
> 
> Vegard wrote a reproducer for this issue, so kindly ask him to share if
> you're interested.
...
> diff --git a/net/rds/send.c b/net/rds/send.c
> index 827155c..c9cdb35 100644
> --- a/net/rds/send.c
> +++ b/net/rds/send.c
> @@ -1013,11 +1013,13 @@ int rds_sendmsg(struct socket *sock, struct msghdr 
> *msg, size_t payload_len)
>   release_sock(sk);

This is falling though into an unconditional lock_sock().
No need to unlock and relock immediately.

>   }
> 
> - /* racing with another thread binding seems ok here */
> + lock_sock(sk);
>   if (daddr == 0 || rs->rs_bound_addr == 0) {
> + release_sock(sk);
>   ret = -ENOTCONN; /* XXX not a great errno */
>   goto out;
>   }
> + release_sock(sk);
> 

On the face of it the above looks somewhat dubious.
Locks usually tie together two action (eg a test and use of a value),
In this case you only have a test inside the lock.
That either means that the state can change after you release the lock
(ie rs->rs_bound_addr = 0 is executed somewhere), or you don't
really need the lock.

David


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   3   4   5   6   7   >