RE: [RFC PATCH v1 5/5] wave: Added basic version of TCP Wave

2017-07-31 Thread David Laight
From: Natale Patriciello
> Sent: 28 July 2017 20:59
..
> +static __always_inline bool test_flag(u8 value, const u8 *flags)
> +{
> + return (*flags & value) == value;
> +}
...
> + if (!test_flag(FLAG_INIT, >flags))
> + return;
...

That is a completely unnecessary 'helper'.
It has its arguments in the wrong order.
Doesn't need to pass by reference.
Since you only ever check one bit you don't need the '=='.
Any error seems to be silently ignored.
I bet they can't actually happen at all.

David




RE: [PATCH] netfilter: fix stringop-overflow warning with UBSAN

2017-08-01 Thread David Laight
From: Arnd Bergmann
> Sent: 31 July 2017 11:09
> Using gcc-7 with UBSAN enabled, we get this false-positive warning:
> 
> net/netfilter/ipset/ip_set_core.c: In function 'ip_set_sockfn_get':
> net/netfilter/ipset/ip_set_core.c:1998:3: error: 'strncpy' writing 32 bytes 
> into a region of size 2
> overflows the destination [-Werror=stringop-overflow=]
>strncpy(req_get->set.name, set ? set->name : "",
>^~~~
> sizeof(req_get->set.name));
> ~~
> 
> This seems completely bogus, and I could not find a nice workaround.
> To work around it in a less elegant way, I change the ?: operator
> into an if()/else() construct.
> 
> Signed-off-by: Arnd Bergmann 
> ---
>  net/netfilter/ipset/ip_set_core.c | 8 ++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/net/netfilter/ipset/ip_set_core.c 
> b/net/netfilter/ipset/ip_set_core.c
> index e495b5e484b1..d7ebb021003b 100644
> --- a/net/netfilter/ipset/ip_set_core.c
> +++ b/net/netfilter/ipset/ip_set_core.c
> @@ -1995,8 +1995,12 @@ ip_set_sockfn_get(struct sock *sk, int optval, void 
> __user *user, int *len)
>   }
>   nfnl_lock(NFNL_SUBSYS_IPSET);
>   set = ip_set(inst, req_get->set.index);
> - strncpy(req_get->set.name, ,
> - IPSET_MAXNAMELEN);
> + if (set)
> + strncpy(req_get->set.name, set->name,
> + sizeof(req_get->set.name));
> + else
> + memset(req_get->set.name, '\0',
> +sizeof(req_get->set.name));

If you use strncpy() here, the compiler might optimise the code
back to 'how it was before'.

Or, maybe an explicit temporary: 'const char *name = set ? set->name : "";

David



RE: [PATCH v2] iwlwifi: Demote messages about fw flags size to info

2017-08-03 Thread David Laight
From: João Paulo Rechi Vita
> Sent: 03 August 2017 15:30
> These messages are not reporting a real error, just the fact that the
> firmware knows about more flags then the driver.
  than
> 
> Currently these messages are presented to the user during boot if there
> is no bootsplash covering the console, even when booting the kernel with
> "quiet".
> 
> Demoting it to the warn level helps having a clean boot process.
> 
> Signed-off-by: Joo Paulo Rechi Vita 

David



RE: [PATCH v4 iproute2 2/7] rdma: Add dev object

2017-08-15 Thread David Laight
From: Stephen Hemminger
> Sent: 15 August 2017 17:12
> On Tue, 15 Aug 2017 16:00:15 +0300
> Leon Romanovsky  wrote:
> 
> > +
> > +static const char *dev_caps_to_str(uint32_t idx)
> > +{
> > +   uint64_t cap = 1 << idx;
> > +
> > +   switch (cap) {
> > +   case RDMA_DEV_RESIZE_MAX_WR: return "RESIZE_MAX_WR";
> > +   case RDMA_DEV_BAD_PKEY_CNTR: return "BAD_PKEY_CNTR";
...
> > +   case RDMA_DEV_RAW_SCATTER_FCS: return "RAW_SCATTER_FCS";
> > +   case RDMA_DEV_RDMA_NETDEV_OPA_VNIC: return "RDMA_NETDEV_OPA_VNIC";
> > +   default: return "UNKNOWN";
> > +   }
> 
> Could this be a table in future versions?

Potentially you could define the constants using some pre-processor
'magic' that would create the table for you.
Something like (but not compiled):

#define RDMA_DEV_FLAGS(x) \
x(RESIZE_MAX_WR, 0) \
x(BAD_PKEY_CNTR, 1) \
(continue for all the bits)

#define RDMA_DEV_ENUM(name, bit_no) RDMA_DEV_##name = BIT(bit_no),
enum {RDMA_DEV_FLAGS(RDMA_DEV_ENUM)};
#undef RDMA_DEV_ENUM

#define RDMA_DEV_NAMES(name, bit_no) [bit_no] = #name,
static const char rdma_dev_names[] = {RDMA_DEV_FLAGS(RDMA_DEV_NAMES)};

David



RE: [PATCH v4 iproute2 2/7] rdma: Add dev object

2017-08-15 Thread David Laight
From: Leon Romanovsky
> Sent: 15 August 2017 17:47
> On Tue, Aug 15, 2017 at 04:23:11PM +, David Laight wrote:
> > From: Stephen Hemminger
> > > Sent: 15 August 2017 17:12
> > > On Tue, 15 Aug 2017 16:00:15 +0300
> > > Leon Romanovsky <leo...@mellanox.com> wrote:
> > >
> > > > +
> > > > +static const char *dev_caps_to_str(uint32_t idx)
> > > > +{
> > > > +   uint64_t cap = 1 << idx;
> > > > +
> > > > +   switch (cap) {
> > > > +   case RDMA_DEV_RESIZE_MAX_WR: return "RESIZE_MAX_WR";
> > > > +   case RDMA_DEV_BAD_PKEY_CNTR: return "BAD_PKEY_CNTR";
> > ...
> > > > +   case RDMA_DEV_RAW_SCATTER_FCS: return "RAW_SCATTER_FCS";
> > > > +   case RDMA_DEV_RDMA_NETDEV_OPA_VNIC: return 
> > > > "RDMA_NETDEV_OPA_VNIC";
> > > > +   default: return "UNKNOWN";
> > > > +   }
> > >
> > > Could this be a table in future versions?
> >
> > Potentially you could define the constants using some pre-processor
> > 'magic' that would create the table for you.
> > Something like (but not compiled):
> >
> > #define RDMA_DEV_FLAGS(x) \
> > x(RESIZE_MAX_WR, 0) \
> > x(BAD_PKEY_CNTR, 1) \
> > (continue for all the bits)
> >
> > #define RDMA_DEV_ENUM(name, bit_no) RDMA_DEV_##name = BIT(bit_no),
> > enum {RDMA_DEV_FLAGS(RDMA_DEV_ENUM)};
> > #undef RDMA_DEV_ENUM
> >
> > #define RDMA_DEV_NAMES(name, bit_no) [bit_no] = #name,
> > static const char rdma_dev_names[] = {RDMA_DEV_FLAGS(RDMA_DEV_NAMES)};
I missed the #undef
> >
> 
> David,
> 
> How should I handle "uknown" fields without names?

The function that accesses rdma_dev_names[] checks ARRAY_SIZE()
and for a NULL pointer.

WRT the other comment, the spare pointers consume far less space than
the code for your switch statement.

David



RE: [PATCH] New Chapter on CodingStyle .

2017-08-15 Thread David Laight
From: Stephen Hemminger
> Sent: 15 August 2017 17:21
> On Tue, 15 Aug 2017 10:42:39 +0000
> David Laight <david.lai...@aculab.com> wrote:
> 
> > From: Jonathan Corbet
> > > Sent: 12 August 2017 15:55
> > ...
> > > > +   Chapter 20: Put values on initialisers without exception
> > > > +
> > > > +When declaring variables on functions must put values:
> > >
> > > Thanks for sending a patch for the kernel's documentation.
> > > Unfortunately, I can't accept this patch for a couple of reasons:
> > ...
> > > - The coding style document is there to describe the community's
> > >   standards for kernel code.  It is *not* a mechanism for imposing new
> > >   standards.  If you really think that the kernel community should adopt
> > >   this rule, you will need to argue for it on the mailing lists.  I will
> > >   say, though, that I do not expect that this effort would be successful.
> >
> > I'd even go as far as suggesting almost the opposite.
> > Declarations should only have initialisers if the value is constant.
> 
> Yup. This new rule sound like something taught to people in coding schools.
> But initializing everything defeats the compiler detection of uninitialized 
> variables
> which is more useful for catching errors.

You'll also get:
Values being read the wrong side of locks.
Values being read early so requiring spilling to stack.

Next someone will be suggesting that all pointers should be checked
against NULL every time they are used.

David



RE: [PATCH net-next 3/3 v4] drivers: net: ethernet: qualcomm: rmnet: Initial implementation

2017-08-16 Thread David Laight
From: David Miller
> Sent: 16 August 2017 05:24
> From: Subash Abhinov Kasiviswanathan 
> Date: Tue, 15 Aug 2017 22:15:53 -0600
> 
> > +static int rmnet_unregister_real_device(struct net_device *dev)
> > +{
> > +   int config_id = RMNET_LOCAL_LOGICAL_ENDPOINT;
> > +   struct rmnet_logical_ep_conf_s *epconfig_l;
> > +   struct rmnet_phys_ep_conf_s *config;
> > +
> > +   ASSERT_RTNL();
> > +
> > +   netdev_info(dev, "Removing device %s\n", dev->name);
> > +
> > +   if (!rmnet_is_real_dev_registered(dev))
> > +   return -EINVAL;
> > +
> > +   for (; config_id < RMNET_MAX_LOGICAL_EP; config_id++) {
> 
> This loop is so much harder to understand because you initialize
> the loop index several lines above the for() statement.  Just
> initialize it here at the beginning of the for() loop and deal
> with the fact that this will have to therefore be a multi-line
> for() statement the best you can.
...

One way to split the multi-line for() is to put the initialiser
on the immediately preceding line:
config_id = RMNET_LOCAL_LOGICAL_ENDPOINT;
for (; config_id < RMNET_MAX_LOGICAL_EP; config_id++) {

David



RE: [net-next 11/12] igbvf: convert msleep to mdelay in atomic context

2017-08-16 Thread David Laight
From: Greg Edwards
> Sent: 15 August 2017 20:32
> On Mon, Aug 14, 2017 at 10:17:31AM +, David Laight wrote:
> > From: Jeff Kirsher
> >> Sent: 09 August 2017 22:48
> >> From: Greg Edwards <gedwa...@ddn.com>
> >>
> >> This fixes a "scheduling while atomic" splat seen with
> >> CONFIG_DEBUG_ATOMIC_SLEEP enabled.
> >>
> >> Signed-off-by: Greg Edwards <gedwa...@ddn.com>
> >> Tested-by: Aaron Brown <aaron.f.br...@intel.com>
> >> Signed-off-by: Jeff Kirsher <jeffrey.t.kirs...@intel.com>
> >> ---
> >>  drivers/net/ethernet/intel/igbvf/vf.c | 2 +-
> >>  1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/drivers/net/ethernet/intel/igbvf/vf.c 
> >> b/drivers/net/ethernet/intel/igbvf/vf.c
> >> index 1d3aa9adcaa8..9577ccf4b26a 100644
> >> --- a/drivers/net/ethernet/intel/igbvf/vf.c
> >> +++ b/drivers/net/ethernet/intel/igbvf/vf.c
> >> @@ -149,7 +149,7 @@ static s32 e1000_reset_hw_vf(struct e1000_hw *hw)
> >>msgbuf[0] = E1000_VF_RESET;
> >>mbx->ops.write_posted(hw, msgbuf, 1);
> >>
> >> -  msleep(10);
> >> +  mdelay(10);
> >
> > Spinning for 10ms seems somewhat sub-optimal
> 
> Jeff,
> 
> Do we even need this delay?  The subsequent read_posted() will poll for
> the PF's mailbox reply for up to 1s.

A 1 second loop?
Who is kidding who that this code is sensible.
If this code is ever executed and has to wait at all other interfaces
are likely to lose packets.

David



RE: [patch net-next 0/3] net/sched: Improve getting objects by indexes

2017-08-16 Thread David Laight
From: Christian König
> Sent: 16 August 2017 09:32
> Am 16.08.2017 um 10:16 schrieb Jiri Pirko:
> > Wed, Aug 16, 2017 at 09:49:07AM CEST, christian.koe...@amd.com wrote:
> >> Am 16.08.2017 um 04:12 schrieb Chris Mi:
...
> >>> - ret = idr_alloc(_minor_idr, bcd, 0, BSG_MAX_DEVS, GFP_KERNEL);
> >>> - if (ret < 0) {
> >>> + ret = idr_alloc(_minor_idr, bcd, _index, 0, BSG_MAX_DEVS,
> >>> + GFP_KERNEL);
> >>> + if (ret) {
> >>>   if (ret == -ENOSPC) {
> >>>   printk(KERN_ERR "bsg: too many bsg devices\n");
> >>>   ret = -EINVAL;
> >> The condition "if (ret)" will now always be true after the first allocation
> >> and so we always run into the error handling after that.
> > On success, idr_alloc returns 0.
> 
> Ah, I see. You change the idr_alloc to return the resulting index as
> separate parameter.

Returning values by reference typically generates considerably worse code
that using the function return value.
It isn't just the extra parameter, it can constrain the generated code
in other ways.
That is why ERR_PTR() and friends exist.
IMHO You need a really good reason to make this change.

David




RE: [PATCH] New Chapter on CodingStyle .

2017-08-15 Thread David Laight
From: Jonathan Corbet
> Sent: 12 August 2017 15:55
...
> > +   Chapter 20: Put values on initialisers without exception
> > +
> > +When declaring variables on functions must put values:
> 
> Thanks for sending a patch for the kernel's documentation.
> Unfortunately, I can't accept this patch for a couple of reasons:
...
> - The coding style document is there to describe the community's
>   standards for kernel code.  It is *not* a mechanism for imposing new
>   standards.  If you really think that the kernel community should adopt
>   this rule, you will need to argue for it on the mailing lists.  I will
>   say, though, that I do not expect that this effort would be successful.

I'd even go as far as suggesting almost the opposite.
Declarations should only have initialisers if the value is constant.

David



RE: [PATCH] of_mdio: merge branch tails in of_phy_register_fixed_link()

2017-08-15 Thread David Laight
From: David Miller
> Sent: 14 August 2017 04:09
> From: Sergei Shtylyov 
> Date: Sun, 13 Aug 2017 00:03:06 +0300
> 
> > Looks  like gcc isn't always able to figure  out that 3 *if* branches in
> > of_phy_register_fixed_link() calling fixed_phy_register() at their ends
> > are similar enough and thus can be merged. The "manual" merge saves 40
> > bytes of the object code (AArch64 gcc 4.8.5), and still saves 12 bytes
> > even  if gcc was able to merge the branch tails (ARM gcc 4.8.5)...
> >
> > Signed-off-by: Sergei Shtylyov 
> 
> Applied, but if two instances of the "same" compiler just with
> different targets changes the optimization, it could be because of a
> tradeoff which is specific to parameters expressed in that target's
> backend.
> 
> So in the future we should probably back away from trying to "help"
> the compiler in this way.

Probably a trade off between code size and execution speed.
I've had 'fun' trying to stop gcc merging tail code paths
in order to avoid the cost of the branch instruction.

David



RE: [iproute PATCH 51/51] lib/bpf: Check return value of write()

2017-08-15 Thread David Laight
From: Phil Sutter
> Sent: 12 August 2017 13:05
> This is merely to silence the compiler warning. If write to stderr
> failed, assume that printing an error message will fail as well so don't
> even try.
> 
> Signed-off-by: Phil Sutter 
> ---
>  lib/bpf.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/lib/bpf.c b/lib/bpf.c
> index 1dcb261dc915f..825e071cea572 100644
> --- a/lib/bpf.c
> +++ b/lib/bpf.c
> @@ -591,7 +591,8 @@ int bpf_trace_pipe(void)
> 
>   ret = read(fd, buff, sizeof(buff) - 1);
>   if (ret > 0) {
> - write(2, buff, ret);
> + if (write(STDERR_FILENO, buff, ret) != ret)
> + return -1;
>   fflush(stderr);
>   }

WTF is this code doing anyway?
write() is a system call, fflush() writes out any data buffered in the
stdio stream.
If there was anything buffered you'd want to output it earlier.
Otherwise if it is going to use fflush() it should be using fwrite().

I presume the function is allowed to write to stderr - since in general
library functions shouldn't assume fd 0/1/2 or stdin/out/err are valid.
There is a lot of code out there that does close(0); close(1); close(2);
but leaves stdout/err valid. Call printf() instead of sprint() and eventually
10k of data gets written somewhere rather unexpected.

If it is a copy loop, what is wrong with the last byte of buff[].
It is valid for write() to return a partial length - the code should
probably loop until all the data is accepted (or error).

David




RE: [net-next 08/15] i40e/i40evf: organize and re-number feature flags

2017-08-15 Thread David Laight
From: Keller, Jacob E
> Sent: 14 August 2017 23:11
> > From: David Miller [mailto:da...@davemloft.net]
> > Sent: Saturday, August 12, 2017 1:04 PM
> > From: Jeff Kirsher 
> > Date: Sat, 12 Aug 2017 04:08:41 -0700
> >
> > > Also ensure that the flags variable is actually a u64 to guarantee
> > > 64bits of space on all architectures.
> >
> > Why?  You don't need 64-bits, you only need 27.
> >
> > This will be unnecessarily expensive on 32-bit platforms.
> >
> > Please don't do this.
> 
> I suppose a better method would be to switch to using a declare_bitmap 
> instead, so that it
> automatically sizes based on the number of flags we have. The reason we chose 
> 64bits is because we
> will add flags in the future, as we originally had more than 32 flags prior 
> to this patch until we
> moved some into a separate field.
> 
> But now that I think about it, using DECLARE_BITMAP makes more sense, though 
> it's a bit more invasive
> of the code.

And horribly stupid unless you really need dynamic indexes.

David



RE: [net-next 11/12] igbvf: convert msleep to mdelay in atomic context

2017-08-14 Thread David Laight
From: Jeff Kirsher
> Sent: 09 August 2017 22:48
> From: Greg Edwards 
> 
> This fixes a "scheduling while atomic" splat seen with
> CONFIG_DEBUG_ATOMIC_SLEEP enabled.
> 
> Signed-off-by: Greg Edwards 
> Tested-by: Aaron Brown 
> Signed-off-by: Jeff Kirsher 
> ---
>  drivers/net/ethernet/intel/igbvf/vf.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/intel/igbvf/vf.c 
> b/drivers/net/ethernet/intel/igbvf/vf.c
> index 1d3aa9adcaa8..9577ccf4b26a 100644
> --- a/drivers/net/ethernet/intel/igbvf/vf.c
> +++ b/drivers/net/ethernet/intel/igbvf/vf.c
> @@ -149,7 +149,7 @@ static s32 e1000_reset_hw_vf(struct e1000_hw *hw)
>   msgbuf[0] = E1000_VF_RESET;
>   mbx->ops.write_posted(hw, msgbuf, 1);
> 
> - msleep(10);
> + mdelay(10);

Spinning for 10ms seems somewhat sub-optimal

David



RE: [RFC PATCH 2/2] bpf: Initialise mod[] in bpf_trace_printk

2017-08-14 Thread David Laight
From: Daniel Borkmann
> Sent: 11 August 2017 17:47
> On 08/09/2017 10:34 PM, Daniel Borkmann wrote:
> > On 08/09/2017 09:39 AM, James Hogan wrote:
> > [...]
> >> time (but please consider looking at the other patch which is certainly
> >> a more real issue).
> >
> > Sorry for the delay, started looking into that and whether we
> > have some other options, I'll get back to you on this.
> 
> Could we solve this more generically (as in: originally intended) in
> the sense that we don't need to trust the gcc va_list handling; I feel
> this is relying on an implementation detail? Perhaps something like
> below poc patch?

That patch still has 'cond ? arg : cond1 ? (long)arg : (u32)arg' so
probably has the same warning as the original version.

The va_list handling is defined by the relevant ABI, not gcc.

It is ok on x86-64 because all 32bit values are extended to 64bits
before being passed as arguments (either in registers, or on the stack).
Nothing in the C language requires that, so other 64bit architectures
could pass 32bit values in 4 bytes of stack.

David




RE: [PATCH net] datagram: When peeking datagrams with offset < 0 don't skip empty skbs

2017-08-17 Thread David Laight
From: Willem de Bruijn
> Sent: 17 August 2017 00:27
> Actually, it is safe even without the check. Overflow of the signed integer
> is benign here.

IIRC the C language states that 'signed integer overflow' is undefined.
So 'MAXINT + 1' doesn't have to equal '-MAXINT - 1' (as one would
expect on a 2's compliment system).

While the linux kernel probably won't run on systems where this isn't true
(eg where signed arithmetic saturates) gcc will assume it can't happen
and optimise code with that assumption.

This may not matter here ...

David



RE: [PATCH 04/22] scsi: fusion: fix string overflow warning

2017-07-17 Thread David Laight
From: Arnd Bergmann
> Sent: 14 July 2017 13:07
> gcc points out a theorerical string overflow:
> 
> drivers/message/fusion/mptbase.c: In function 'mpt_detach':
> drivers/message/fusion/mptbase.c:2103:17: error: '%s' directive writing up to 
> 31 bytes into a region
> of size 28 [-Werror=format-overflow=]
> sprintf(pname, MPT_PROCFS_MPTBASEDIR "/%s/summary", ioc->name);
>^
> drivers/message/fusion/mptbase.c:2103:2: note: 'sprintf' output between 13 
> and 44 bytes into a
> destination of size 32
> 
> We can simply double the size of the local buffer here to be on the
> safe side.

I think I'd change it to snprintf() as well.
Saves any worries if ioc->name isn't '\0' terminated.

David



RE: [PATCH iproute2] ip: change flag names to an array

2017-07-10 Thread David Laight
From:  Stephen Hemminger
> Sent: 07 July 2017 16:40
> For the most of the address flags, use a table of bit values rather
> than open coding every value.  This allows for easier inevitable
> expansion of flags.
> 
> This also fixes the missing stable-privacy flag.
> 
> Signed-off-by: Stephen Hemminger 
> ---
>  ip/ipaddress.c | 152 
> -
>  1 file changed, 65 insertions(+), 87 deletions(-)
> 
> diff --git a/ip/ipaddress.c b/ip/ipaddress.c
> index f06f5829fb61..b4efb9fedcd2 100644
> --- a/ip/ipaddress.c
> +++ b/ip/ipaddress.c
> @@ -1013,14 +1013,67 @@ static unsigned int get_ifa_flags(struct ifaddrmsg 
> *ifa,
>   ifa->ifa_flags;
>  }
> 
> +
> +static const char *ifa_flag_names[] = {
> + "secondary",
> + "nodad",
> + "optimistic",
> + "dadfailed",
> + "home",
> + "deprecated",
> + "tentative",
> + "permanent",
> + "mngtmpaddr",
> + "noprefixroute",
> + "autojoin",
> + "stable-privacy",
> +};

It would be safer to set up a table of the constant - string pairs
instead of relying on the table being in the right order.

David



RE: [PATCH v2 RFC 0/13] Remove UDP Fragmentation Offload support

2017-07-07 Thread David Laight
From: Michal Kubecek
> Sent: 07 July 2017 13:46
> On Fri, Jul 07, 2017 at 10:43:26AM +0100, David Miller wrote:
> >
> > This is an RFC patch series, based upon some discussions with
> > various developers, that removes UFO offloading.
> >
> > Very few devices support this operation, it's usefullness is
> > quesitonable at best, and it adds a non-trivial amount of
> > complexity to our data paths.
> 
> My understanding from the communication with the customer whose reports
> resulted in commits acf8dd0a9d0b ("udp: only allow UFO for packets from
> SOCK_DGRAM sockets") and a5cb659bbc1c ("net: account for current skb
> length when deciding about UFO") was that the real benefit from UFO is
> in the case when UFO allows to avoid the need to actually fragment the
> packets. In their case it's when UDP packets are sent via virtio_net
> either between a guest and its host or between two guests on the same
> host.
> ...

Could that be done with a large path-specific mtu?

David



RE: [PATCH V4 net-next 7/8] net: hns3: Add Ethtool support to HNS3 driver

2017-07-25 Thread David Laight
From: Rustad, Mark D
> Sent: 24 July 2017 21:32
> > On Jul 23, 2017, at 10:05 AM, Florian Fainelli  wrote:
> >> +
> >> +  strncpy(drvinfo->version, HNAE_DRIVER_VERSION,
> >> +  sizeof(drvinfo->version));
> >> +  drvinfo->version[sizeof(drvinfo->version) - 1] = '\0';
> >
> > strlcpy() would probably do that for you.
> 
> You need to be careful about strlcpy - it does not completely write the 
> destination buffer as strncpy
> does, and so can result in a kernel memory leak if the destination is not 
> known to already be cleared.

Not only that, strlcpy() is defined to return the size of the
source string.
So if the source buffers isn't '\0' terminated it can fault.
(Not a problem above.)

Neither function is the one you were looking for.

David



RE: [PATCH] netns: more input validation

2017-07-25 Thread David Laight
From: Matteo Croce
> Sent: 25 July 2017 14:31
> ip netns accepts invalid input as namespace name like an empty string or a
> string longer than the maximum file name length.
> Check that the netns name is not empty and less than or equal to NAME_MAX.
> 
> Signed-off-by: Matteo Croce 
> ---
>  ip/ipnetns.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/ip/ipnetns.c b/ip/ipnetns.c
> index 42549944..198e9de8 100644
> --- a/ip/ipnetns.c
> +++ b/ip/ipnetns.c
> @@ -768,7 +768,8 @@ static int netns_monitor(int argc, char **argv)
> 
>  static int invalid_name(const char *name)
>  {
> - return strchr(name, '/') || !strcmp(name, ".") || !strcmp(name, "..");
> + return !*name || strlen(name) > NAME_MAX ||
> + strchr(name, '/') || !strcmp(name, ".") || !strcmp(name, "..");

Think I'd check:
!name[0] || !memchr(name, 0, NAME_MAX) || strchr(name, '/') ||
(name[0] == '.' && (!name[1] || (name[1] == '.' && !name[2])))

David



RE: [PATCH net-next 2/3] nfp: look for firmware image by device serial number and PCI name

2017-07-27 Thread David Laight
From: Jakub Kicinski
> Sent: 26 July 2017 19:10
> We generally look up firmware by card type, but that doesn't allow
> users who have more than one card of the same type in their system
> to select firmware per adapter.
> 
> Unfortunately user space firmware helper seems fraught with
> difficulties and to be on its way out.  In particular support for
> handling firmware uevents have been dropped from systemd and most
> distributions don't enable the FW fallback by default any more.
> 
> To allow users selecting firmware for a particular device look up
> firmware names by serial and pci_name().  Use the direct lookup to
> disable generating uevents when enabled in Kconfig and not print
> any warnings to logs if adapter-specific files are missing.  Users
> can place in /lib/firmware/netronome files named:
> 
> pci-${pci_name}.nffw
> serial-${serial}.nffw
> 
> to target a specific card.  E.g.:
> 
> pci-:04:00.0.nffw
> pci-:82:00.0.nffw
> serial-00-aa-bb-11-22-33-10-ff.nffw
> 
> We use the full serial number including the interface id, as it
> appears in lspci output (bytes separated by '-').

Where does lspci get that from?

> Signed-off-by: Jakub Kicinski 
> ---
>  drivers/net/ethernet/netronome/nfp/nfp_main.c | 20 +++-
>  1 file changed, 19 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/netronome/nfp/nfp_main.c
> b/drivers/net/ethernet/netronome/nfp/nfp_main.c
> index d67969d3e484..13d056da0765 100644
> --- a/drivers/net/ethernet/netronome/nfp/nfp_main.c
> +++ b/drivers/net/ethernet/netronome/nfp/nfp_main.c
> @@ -188,9 +188,27 @@ nfp_net_fw_find(struct pci_dev *pdev, struct nfp_pf *pf)
>   struct nfp_eth_table_port *port;
>   const char *fw_model;
>   char fw_name[256];
> + const u8 *serial;
>   int spc, err = 0;
> + u16 interface;
>   int i, j;
> 
> + /* First try to find a firmware image specific for this device */
> + interface = nfp_cpp_interface(pf->cpp);
> + nfp_cpp_serial(pf->cpp, );
> + sprintf(fw_name, "netronome/serial-%pMF-%02hhx-%02hhx.nffw",
> + serial, interface >> 8, interface & 0xff);

WTF???
- use snprintf().
- kill those hh, the arguments are of type 'int'.
In fact make 'interface' 'unsigned int' as well.

David



RE: [PATCH net-next 2/3] nfp: look for firmware image by device serial number and PCI name

2017-07-27 Thread David Laight
From: Jakub Kicinski
> Sent: 27 July 2017 10:26
...
> > - use snprintf().
> 
> To effectively print an integer into an amply sized array?  I need to
> guarantee that the string will fit otherwise I would request a FW image
> with a wrong name.  snprintf() would only mask such a bug.

Eh?
If, for any reason, the buffer isn't long enough snprintf() won't
write over random memory - sprint() will, and you may not notice.

> > - kill those hh, the arguments are of type 'int'.
> 
> It doesn't matter.  I will be more careful in the future, though.
> 
> > In fact make 'interface' 'unsigned int' as well.
> 
> It's a value read from the hardware, and it's 16 bits wide, therefore
> my preference it to explicitly size the variable.

Right, so you keep asking the compiler to generate code to mask
any arithmetic results (written into the variable) back down to
16 bits.

OTOH, looking at some of the functions in nfp_cppcore.c you don't
care about performance at all.

David



RE: [PATCH 25/29] netfilter, kbuild: use canonical method to specify objs.

2017-06-30 Thread David Laight
From: Pablo Neira Ayuso
> Sent: 29 June 2017 23:53
> Should use ":=" instead of "+=".
...
> diff --git a/net/netfilter/Makefile b/net/netfilter/Makefile
> index c9b78e7b342f..913380919301 100644
> --- a/net/netfilter/Makefile
> +++ b/net/netfilter/Makefile
> @@ -70,10 +70,9 @@ obj-$(CONFIG_NETFILTER_SYNPROXY) += nf_synproxy_core.o
>  obj-$(CONFIG_NF_DUP_NETDEV)  += nf_dup_netdev.o
> 
>  # nf_tables
> -nf_tables-objs += nf_tables_core.o nf_tables_api.o nf_tables_trace.o
> -nf_tables-objs += nft_immediate.o nft_cmp.o nft_range.o
> -nf_tables-objs += nft_bitwise.o nft_byteorder.o nft_payload.o
> -nf_tables-objs += nft_lookup.o nft_dynset.o
> +nf_tables-objs := nf_tables_core.o nf_tables_api.o nf_tables_trace.o \
> +   nft_immediate.o nft_cmp.o nft_range.o nft_bitwise.o \
> +   nft_byteorder.o nft_payload.o nft_lookup.o nft_dynset.o

Why?
My preference is to add each file on its own line.
Generates much less churn when files are added or removed.

David



RE: [PATCH net-next v2 9/9] nfp: add control message passing capabilities to flower offloads

2017-06-29 Thread David Laight
From: Or Gerlitz
> Sent: 29 June 2017 14:47
...
> > +   nfp_flow->meta.key_len /= NFP_FL_LW_SIZ;
> > +   nfp_flow->meta.mask_len /= NFP_FL_LW_SIZ;
> > +   nfp_flow->meta.act_len /= NFP_FL_LW_SIZ;
> 
> better to use use x >>= 2 instead x /= 4
> 
> I saw it in other places across the patch/set

Provided the values are unsigned it makes no difference.
Having 2 matching constants just makes the code harder to read.

David



RE: [PATCH 1/1] net sched: Added the TC_LINKLAYER_CUSTOM linklayer type

2017-07-06 Thread David Laight
From: McCabe, Robert J
> Sent: 04 July 2017 01:14
> + if (nla_put(skb, TCA_STAB_DATA, sizeof(stab->szopts)*sizeof(u16), 
> >data))
> + goto nla_put_failure;

Multiplying sizeof(a) by sizeof(b) really doesn't look right at all.

David



RE: [PATCH v2 0/9] Remove spin_unlock_wait()

2017-07-06 Thread David Laight
From: Paul E. McKenney
> Sent: 06 July 2017 00:30
> There is no agreed-upon definition of spin_unlock_wait()'s semantics,
> and it appears that all callers could do just as well with a lock/unlock
> pair.  This series therefore removes spin_unlock_wait() and changes
> its users to instead use a lock/unlock pair.  The commits are as follows,
> in three groups:
> 
> 1-7.  Change uses of spin_unlock_wait() and raw_spin_unlock_wait()
>   to instead use a spin_lock/spin_unlock pair.  These may be
>   applied in any order, but must be applied before any later
>   commits in this series.  The commit logs state why I believe
>   that these commits won't noticeably degrade performance.

I can't help feeling that it might be better to have a spin_lock_sync()
call that is equivalent to a spin_lock/spin_unlock pair.
The default implementation being an inline function that does exactly that.
This would let an architecture implement a more efficient version.

It might even be that this is the defined semantics of spin_unlock_wait().

Note that it can only be useful to do a spin_lock/unlock pair if it is
impossible for another code path to try to acquire the lock.
(Or, at least, the code can't care if the lock is acquired just after.)
So if it can de determined that the lock isn't held (a READ_ONCE()
might be enough) the lock itself need not be acquired (with the
associated slow bus cycles).

David



RE: [PATCH net-next v2 4/9] nfp: extend flower add flow offload

2017-06-29 Thread David Laight
From: Jakub Kicinski
> Sent: 29 June 2017 07:48
> On Thu, 29 Jun 2017 14:18:07 +0800, Yunsheng Lin wrote:
> > > + if (mask_basic->n_proto) {
> > cpu_to_be16(mask_basic->n_proto)

Should be be16_to_cpu()

> > remove cpu_to_be16 in case.
> 
> Thanks, but this is incorrect.  Byte swapping constants is done at
> compilation time - therefore it's preferred.

Except that the 'cpu' values are likely to be dense so the compiler
is likely to generate a jump table for the switch statement instead
of sequence of conditionals.

OTOH the jump table is almost certainly a data cache miss, whereas
the conditionals might be predicted correctly.

The best code might come from an explicitly ordered sequence of
conditionals.

David



RE: [PATCH v6 1/5] skbuff: return -EMSGSIZE in skb_to_sgvec to prevent overflow

2017-04-27 Thread David Laight
From: Jason A. Donenfeld
> On Thu, Apr 27, 2017 at 1:30 PM, Sabrina Dubroca  wrote:
> > Hmm, I think this can actually happen:
> 
> Alright, perhaps better to err on the side of caution, then.

You only need to recurse if both pointers are set.

David



RE: [PATCH net-next 8/9] ipvlan: improve compiler hints

2017-04-27 Thread David Laight
From: Duyck, Alexander H
> Sent: 27 April 2017 16:21
...
> > -unsigned int ipvlan_mac_hash(const unsigned char *addr)
> > +inline unsigned int ipvlan_mac_hash(const unsigned char *addr)
> >  {
> > u32 hash = jhash_1word(__get_unaligned_cpu32(addr + 2),
> >ipvlan_jhash_secret);
> 
> I'm kind of surprised this isn't causing a problem with differing 
> declarations between the declaration
> here and the declaration in ipvlan.h. Normally for inlining something like 
> this you would change it to
> a "static inline" and move the entire declaration into the header file.

You get a callable copy for external callers and local calls inlined.
Not usually what you want.

David



RE: [PATCH] macsec: avoid heap overflow in skb_to_sgvec

2017-04-24 Thread David Laight
From: Jason A. Donenfeld
> Sent: 21 April 2017 22:15
> While this may appear as a humdrum one line change, it's actually quite
> important. An sk_buff stores data in three places:
> 
> 1. A linear chunk of allocated memory in skb->data. This is the easiest
>one to work with, but it precludes using scatterdata since the memory
>must be linear.
> 2. The array skb_shinfo(skb)->frags, which is of maximum length
>MAX_SKB_FRAGS. This is nice for scattergather, since these fragments
>can point to different pages.
> 3. skb_shinfo(skb)->frag_list, which is a pointer to another sk_buff,
>which in turn can have data in either (1) or (2).
> 
> The first two are rather easy to deal with, since they're of a fixed
> maximum length, while the third one is not, since there can be
> potentially limitless chains of fragments. Fortunately dealing with
> frag_list is opt-in for drivers, so drivers don't actually have to deal
> with this mess. For whatever reason, macsec decided it wanted pain, and
> so it explicitly specified NETIF_F_FRAGLIST.
> 
> Because dealing with (1), (2), and (3) is insane, most users of sk_buff
> doing any sort of crypto or paging operation calls a convenient function
> called skb_to_sgvec (which happens to be recursive if (3) is in use!).
> This takes a sk_buff as input, and writes into its output pointer an
> array of scattergather list items. Sometimes people like to declare a
> fixed size scattergather list on the stack; othertimes people like to
> allocate a fixed size scattergather list on the heap. However, if you're
> doing it in a fixed-size fashion, you really shouldn't be using
> NETIF_F_FRAGLIST too (unless you're also ensuring the sk_buff and its
> frag_list children arent't shared and then you check the number of
> fragments in total required.)
> 
> Macsec specifically does this:
> 
> size += sizeof(struct scatterlist) * (MAX_SKB_FRAGS + 1);
> tmp = kmalloc(size, GFP_ATOMIC);
> *sg = (struct scatterlist *)(tmp + sg_offset);
>   ...
> sg_init_table(sg, MAX_SKB_FRAGS + 1);
> skb_to_sgvec(skb, sg, 0, skb->len);
> 
> Specifying MAX_SKB_FRAGS + 1 is the right answer usually, but not if you're
> using NETIF_F_FRAGLIST, in which case the call to skb_to_sgvec will
> overflow the heap, and disaster ensues.
...

Shouldn't skb_to_sgvec() be checking the number of fragments against
the size of the sg list?
The callers would then all need auditing to allow for failure.

David




RE: [PATCH net-next 3/3] samples/bpf: check before defining offsetof

2017-04-25 Thread David Laight
From: Daniel Borkmann
> Sent: 24 April 2017 15:41
> To: Alexander Alemayhu; netdev@vger.kernel.org
> Cc: a...@fb.com
> Subject: Re: [PATCH net-next 3/3] samples/bpf: check before defining offsetof
> 
> On 04/24/2017 03:31 PM, Alexander Alemayhu wrote:
> > Fixes the following warning
> >
> > samples/bpf/test_lru_dist.c:28:0: warning: "offsetof" redefined
> >   #define offsetof(TYPE, MEMBER) ((size_t)&((TYPE *)0)->MEMBER)
> >
> > In file included from ./tools/lib/bpf/bpf.h:25:0,
> >   from samples/bpf/libbpf.h:5,
> >   from samples/bpf/test_lru_dist.c:24:
> > /usr/lib/gcc/x86_64-redhat-linux/6.3.1/include/stddef.h:417:0: note: this 
> > is the location of the
> previous definition
> >   #define offsetof(TYPE, MEMBER) __builtin_offsetof (TYPE, MEMBER)
> >
> > Signed-off-by: Alexander Alemayhu 
> 
> Acked-by: Daniel Borkmann 

Isn't the correct fix to include stddef.h ?

David



RE: [PATCH] iov_iter: don't revert if csum error

2017-04-28 Thread David Laight
From: Sabrina Dubroca
> Sent: 28 April 2017 14:17
...
> > if (__skb_checksum_complete(skb))
> > -   goto csum_error;
> > +   goto fault;
> 
> With this patch, skb_copy_and_csum_datagram_msg() will return -EFAULT
> for an incorrect checksum, that doesn't seem right.

Especially since (IIRC) -EFAULT generates SIGSEGV.

David



RE: [PATCH net] tcp: avoid bogus gcc-7 array-bounds warning

2017-07-28 Thread David Laight
From: Arnd Bergmann
> Sent: 28 July 2017 15:42
...
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -2202,9 +2202,10 @@ static bool tcp_small_queue_check(struct sock *sk, 
> const struct sk_buff *skb,
>  static void tcp_chrono_set(struct tcp_sock *tp, const enum tcp_chrono new)
>  {
>   const u32 now = tcp_jiffies32;
> + enum tcp_chrono old = tp->chrono_type;
> 
> - if (tp->chrono_type > TCP_CHRONO_UNSPEC)
> - tp->chrono_stat[tp->chrono_type - 1] += now - tp->chrono_start;
> + if (old > TCP_CHRONO_UNSPEC)
> + tp->chrono_stat[old - 1] += now - tp->chrono_start;
>   tp->chrono_start = now;
>   tp->chrono_type = new;

What a horrid combination of enum and integers.
Also have u32 chrono_stat[3]; - should probably be [__TCP_CHRONO_MAX - 1]
(or - CHRONO_FIRST which is defined to be 1).

Checking if (old != 0) would make the code more readable.

David



RE: [PATCH RFC, iproute2] tc/mirred: Extend the mirred/redirect action to accept additional traffic class parameter

2017-08-01 Thread David Laight
From: Stephen Hemminger
> Sent: 01 August 2017 04:52
> On Mon, 31 Jul 2017 17:40:50 -0700
> Amritha Nambiar  wrote:
> The concept is fine, bu t the code looks different than the rest which
> is never a good sign.
> 
> 
> > +   if ((argc > 0) && (matches(*argv, "tc") == 0)) {
> 
> Extra () are unnecessary in compound conditional.
> 
> > +   tc = atoi(*argv);
> 
> Prefer using strtoul since it has better error handling than atoi()
> 
> > +   argc--;
> > +   argv++;
> > +   }
> 
> 
> Use NEXT_ARG() construct like rest of the code.

Why bother faffing about with argc at all?
The argument list terminates when *argv == NULL.

David



RE: [PATCH net-next 00/10] net: l3mdev: Support for sockets bound to enslaved device

2017-08-01 Thread David Laight
From: David Ahern
> Sent: 01 August 2017 04:13
...
> Existing code for socket lookups already pass in 6+ arguments. Rather
> than add another for the enslaved device index, the existing lookups
> are converted to use a new sk_lookup struct. From there, the enslaved
> device index becomes another element of the struct.
> 
> Patch 1 introduces sk_lookup struct and helper.

I guess that socket lookup happens quite often!
Passing the lookup parameters in a structure might have a
measurable negative effect on performance - especially if the
structure isn't passed through to other functions.

Have you made any performance mearurements?

David



RE: [PATCH] ss: Enclose IPv6 address in brackets

2017-08-01 Thread David Laight
From: Florian Lehner
> Sent: 29 July 2017 13:29
> This patch adds support for RFC2732 IPv6 address format with brackets
> for the tool ss. So output for ss changes from
> 2a00:1450:400a:804::200e:443 to [2a00:1450:400a:804::200e]:443 for IPv6
> addresses with attached port number.
> 
> Signed-off-by: Lehner Florian 
> ---
>  misc/ss.c | 6 +-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/misc/ss.c b/misc/ss.c
> index 12763c9..db39c93 100644
> --- a/misc/ss.c
> +++ b/misc/ss.c
> @@ -1059,7 +1059,11 @@ static void inet_addr_print(const inet_prefix *a,
> int port, unsigned int ifindex
>   ap = format_host(AF_INET, 4, a->data);
>   }
>   } else {
> - ap = format_host(a->family, 16, a->data);
> + if (a->family == AF_INET6) {
> + sprintf(buf, "[%s]", format_host(a->family, 16, 
> a->data));
> + } else {
> + ap = format_host(a->family, 16, a->data);
> + }
>   est_len = strlen(ap);
...

There are some strange things going on with global variables if this works at 
all.
The text form of the address is in buf[] in one path and *ap in the other.

One option might be to call format_host() then use strchr(ap, ':')
to add [] if the string contains any ':'.

David



RE: [PATCH 1/2] [for 4.13] net: qcom/emac: disable flow control autonegotiation by default

2017-08-02 Thread David Laight
From: Timur Tabi
> Sent: 02 August 2017 15:22
> On 8/2/17 8:48 AM, David Laight wrote:
> > If the nearby switches cannot handle pause frames, then the MAC shouldn't
> > be sending them at all.
> 
> There's no way for me to know whether the switches can handle the pause
> frames or not.  You would think that sending one multicast pause frame
> ever 33ms would not overload a switch, but in our lab it does.
> 
> This is why I changed the default to disable sending pause frames.
...

Thinks ...
Sending pause frames just tells the adjacent switch not to send you packets
(because you'll discard them).
Since the idea is to avoid the discards, the switch will buffer the
packets it would have sent.
The buffers in the switch then fill up with packets it isn't sending you.
The switch then runs out of buffers, it has 2 choices:
1) Throw the packets away.
2) Send 'pause' frames to the sources.
If it sends 'pause' frames the entire network will very quickly lock up.
If it discards the packets they might as well have been discarded by the
receiving MAC.

Doesn't this mean that pause frames are 99.999% useless??

David



RE: [PATCH 1/2] [for 4.13] net: qcom/emac: disable flow control autonegotiation by default

2017-08-02 Thread David Laight
From: Timur Tabi
> Sent: 02 August 2017 16:09
> On 08/02/2017 09:51 AM, David Laight wrote:
> > Sending pause frames just tells the adjacent switch not to send you packets
> > (because you'll discard them).
> > Since the idea is to avoid the discards, the switch will buffer the
> > packets it would have sent.
> > The buffers in the switch then fill up with packets it isn't sending you.
> 
> I was under the impression that the switch forwards the pause frames to
> other devices, so that the transmitting NIC can stop sending the data,

Most of my ethernet knowledge predates FDX :-)
It wouldn't make any sense to try to use the source address of a pause
frame to suppress traffic to a specific MAC - that would have to go way down
into IP on all the receiving systems.
Also ISTR that pause frames are very short and don't even contain MAC
addresses.

> but your explanation makes a lot more sense.  If the EMAC never stops
> sending pause frames, then the switch's buffers will fill up, disabling
> all other devices.  If the switch does not have per-port buffers, then
> it makes sense when the buffer is full, it blocks all ports.

A switch will (probably) either have buffers for each input port, or for
each output port (or maybe both).
Output port buffers are less likely to cause grief when an output port is
running at a slower speed than the input port.

But is a switch is going to send pause frames it doesn't really matter
how the buffers are arranged. The cascade will still happen.

It would take very careful heuristics in a switch to manage pause
frames properly.

Of course, once the kernel has crashed, even multicast packets will
eventually run the MAC out of buffers.
(Unless you run on private network and manage to reinstall and reboot
while the MAC is still active and then eventually die when a receive
frame overwrites what used to be a receive buffer.)

David



RE: [PATCH 1/2] [for 4.13] net: qcom/emac: disable flow control autonegotiation by default

2017-08-02 Thread David Laight
From: Timur Tabi
> Sent: 01 August 2017 22:38
> The EMAC has a curious qwirk when RX flow control is enabled and the
> kernel hangs.  With the kernel hung, the EMAC's RX queue soon fills.
> If RX flow control is enabled, the EMAC will then send a non-stop
> stream of pause frames until the system is reset.  The EMAC does not
> have a built-in watchdog.
> 
> In various tests, the pause frame stream sometimes overloads nearby
> switches, effectively disabling the network.
...

If the nearby switches cannot handle pause frames, then the MAC shouldn't
be sending them at all.
They

I suspect that they should only ever be sent if the phy autonegotiation
indicates that they are supported.
You might want to avoid sending them even if allowed, or advertise
non-support on hardware that could support them, but sending them
anyway is likely to cause grief.

This is similar to the problems that arise when the 'speed' is forced
instead of limiting the advertised speed to one value.

David




RE: [PATCH] Adding Agile-SD TCP module and modifying Kconfig and Makefile

2017-08-02 Thread David Laight
From: mohamedalrshah
> Sent: 02 August 2017 05:44
> Published:
> Alrshah, M.A., Othman, M., Ali, B. and Hanapi, Z.M., 2015. Agile-SD: a 
> Linux-based
> TCP congestion control algorithm for supporting high-speed and short-distance 
> networks.
> Journal of Network and Computer Applications, 55, pp.181-190.
> 
> Agile-SD is a new loss-based and RTT-independent TCP congestion
> control algorithm designed to support high-speed and Short-Distance (SD)
> networks. It mainly contributes the agility factor mechanism, which allows
> Agile-SD to deal with small buffer sizes while reducing its sensitivity to
> packet loss. Due to the use of this mechanism, Agile-SD improves the
> throughput of TCP up to 50% compared to Cubic-TCP and Compound-TCP. Its
> performance was evaluated using the well-known NS2 simulator to measure
> the average throughput, loss ratio and fairness.
> 
> Agile-SD is designed and implemented by Mohamed A. Alrshah et al. (2015) as
> a Linux kernel module in the real Linux operating system (openSUSE 42.1 Leap),
> which is implemented at the Network, Parallel and Distributed Computing 
> Laboratory,
> A2.10, second floor, Block A, Faculty of Computer Science and Information 
> Technology,
> Universiti Putra Malaysia, over real PCs connected to the Internet for daily 
> uses and
> evaluation purposes.
> 
> The main motivation behind Agile-SD is to support the short-distance networks 
> such
> as local area networks and data center networks in order to increase the 
> bandwidth
> utilization over high-speed networks. Moreover, Agile-SD is introduced in 
> support
> of the open source community, where it can be used under the OpenGL agreement.

What a pile of bullshit ...

David



RE: [RFC PATCH v2 0/2] nb8800 suspend/resume support

2017-08-02 Thread David Laight
From: Måns Rullgård
> Sent: 02 August 2017 17:10
...
> ping -f is limited to 100 packets per second.  Use something like iperf
> in UDP mode instead.

Or break a MAC driver so it just saturates the network.

You might actually need bursts of traffic - otherwise the receiver
will be continuously out of rx buffers and no dma will be active.

David



RE: [PATCH] PCI: Update ACS quirk for more Intel 10G NICs

2017-08-04 Thread David Laight
From: Bjorn Helgaas
> Sent: 03 August 2017 22:49
> On Thu, Jul 20, 2017 at 02:41:01PM -0700, Roland Dreier wrote:
> > From: Roland Dreier 
> >
> > Add one more variant of the 82599 plus the device IDs for X540 and X550
> > variants.  Intel has confirmed that none of these devices does peer-to-peer
> > between functions.  The X540 and X550 have added ACS capabilities in their
> > PCI config space, but the ACS control register is hard-wired to 0 for both
> > devices, so we still need the quirk for IOMMU grouping to allow assignment
> > of individual SR-IOV functions.
...
> > --- a/drivers/pci/quirks.c
> > +++ b/drivers/pci/quirks.c
> > @@ -4335,12 +4335,33 @@ static const struct pci_dev_acs_enabled {
> > { PCI_VENDOR_ID_INTEL, 0x1507, pci_quirk_mf_endpoint_acs },
> > { PCI_VENDOR_ID_INTEL, 0x1514, pci_quirk_mf_endpoint_acs },
> > { PCI_VENDOR_ID_INTEL, 0x151C, pci_quirk_mf_endpoint_acs },
> > +   { PCI_VENDOR_ID_INTEL, 0x1528, pci_quirk_mf_endpoint_acs },
> > { PCI_VENDOR_ID_INTEL, 0x1529, pci_quirk_mf_endpoint_acs },
> > +   { PCI_VENDOR_ID_INTEL, 0x154A, pci_quirk_mf_endpoint_acs },
...

That list is looking a bit long.
Is it possible to run-time determine that the ACS control register is hard wired
to zero, and apply the quirk to all such devices.
Or even changing to a (device & mask) == value test??

David




RE: [PATCH] net: phy: marvell: logical vs bitwise OR typo

2017-08-04 Thread David Laight
From: Dan Carpenter
> Sent: 04 August 2017 09:17
> This was supposed to be a bitwise OR but there is a || vs | typo.
> 
> Fixes: 864dc729d528 ("net: phy: marvell: Refactor m88e1121 RGMII delay 
> configuration")
> Signed-off-by: Dan Carpenter 
> 
> diff --git a/drivers/net/phy/marvell.c b/drivers/net/phy/marvell.c
> index 361fe9927ef2..15cbcdba618a 100644
> --- a/drivers/net/phy/marvell.c
> +++ b/drivers/net/phy/marvell.c
> @@ -83,7 +83,7 @@
>  #define MII_88E1121_PHY_MSCR_REG 21
>  #define MII_88E1121_PHY_MSCR_RX_DELAYBIT(5)
>  #define MII_88E1121_PHY_MSCR_TX_DELAYBIT(4)
> -#define MII_88E1121_PHY_MSCR_DELAY_MASK  (~(BIT(5) || BIT(4)))
> +#define MII_88E1121_PHY_MSCR_DELAY_MASK  (~(BIT(5) | BIT(4)))

Wouldn't:
+#define MII_88E1121_PHY_MSCR_DELAY_MASK
(~(MII_88E1121_PHY_MSCR_RX_DELAY | MII_88E1121_PHY_MSCR_TX_DELAY))
be more correct?
If a little long?
Actually the ~ looks odd here
Reads code
Kill the define and explicitly mask off the two values just before
conditionally setting them.

David



RE: [PATCH v7 2/3] PCI: Enable PCIe Relaxed Ordering if supported

2017-08-07 Thread David Laight
From: Casey Leedom
> Sent: 04 August 2017 21:49
...
> Whenever our Hardware Designers implement new functionality in our hardware,
> they almost always put in A. several "knobs" which can control fundamental
> parameters of the new Hardware Feature, and B.  a mechanism of completely
> disabling it if necessary.  This stems from the incredibly long Design ->
> Deployment cyle for Hardware (as opposed to the edit->compile->run cycle for 
> s!

Indeed, I'd also expect there to be an undocumented flag to turn
it on (broken) in earlier parts to allow testing.

David



RE: [PATCH net-next 03/14] sctp: remove the typedef sctp_scope_policy_t

2017-08-07 Thread David Laight
From: Xin Long
> Sent: 05 August 2017 13:00
> This patch is to remove the typedef sctp_scope_policy_t and keep
> it's members as an anonymous enum.
> 
> It is also to define SCTP_SCOPE_POLICY_MAX to replace the num 3
> in sysctl.c to make codes clear.
> 
> Signed-off-by: Xin Long 
> ---
>  include/net/sctp/constants.h | 6 --
>  net/sctp/sysctl.c| 2 +-
>  2 files changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/include/net/sctp/constants.h b/include/net/sctp/constants.h
> index 922fba5..acb03eb 100644
> --- a/include/net/sctp/constants.h
> +++ b/include/net/sctp/constants.h
> @@ -341,12 +341,14 @@ typedef enum {
>   SCTP_SCOPE_UNUSABLE,/* IPv4 unusable addresses */
>  } sctp_scope_t;
> 
> -typedef enum {
> +enum {
>   SCTP_SCOPE_POLICY_DISABLE,  /* Disable IPv4 address scoping */
>   SCTP_SCOPE_POLICY_ENABLE,   /* Enable IPv4 address scoping */
>   SCTP_SCOPE_POLICY_PRIVATE,  /* Follow draft but allow IPv4 private 
> addresses */
>   SCTP_SCOPE_POLICY_LINK, /* Follow draft but allow IPv4 link 
> local addresses */
> -} sctp_scope_policy_t;
> +};
> +
> +#define SCTP_SCOPE_POLICY_MAXSCTP_SCOPE_POLICY_LINK

Perhaps slightly better to end the enum with:
SCTP_SCOPE_POLICY_COUNT,/* Number of policies */
SCTP_SCOPE_POLICY_MAX = SCTP_SCOPE_POLICY_COUNT - 1 /* Last policy 
*/
};

David



RE: [iproute PATCH v2 2/2] ifcfg: Quote left-hand side of [ ] expression

2017-08-18 Thread David Laight
From: Phil Sutter
> Sent: 18 August 2017 12:24
...
> > > -if [ `ip link ls $dev | grep -c MULTICAST` -ge 1 ]; then
> > > +if [ "`ip link ls $dev | grep -c MULTICAST`" -ge 1 ]; then
> >
> > You could drag all these scripts into the 1990's by using $(...)
> > instead of `...`.
> 
> That's a different kettle of fish IMO, using $() doesn't change the
> situation this addresses:
> 
> | $ [ $(echo foo bar) -eq 0 ] && echo ok
> | bash: [: too many arguments
> | $ [ "$(echo foo bar)" -eq 0 ] && echo ok
> | bash: [: foo bar: integer expression expected

I didn't say it would.
IFS=
set -o noglob
would though - not that I'd suggest it here.
protecting the 'eval' is somewhat harder.

David



RE: [net-next 11/15] net/mlx5e: Properly indent within conditional statements

2017-08-17 Thread David Laight
From: Saeed Mahameed
> Sent: 17 August 2017 14:30
> To: David S. Miller
> Cc: netdev@vger.kernel.org; Leon Romanovsky; Or Gerlitz; Saeed Mahameed
> Subject: [net-next 11/15] net/mlx5e: Properly indent within conditional 
> statements
> 
> From: Or Gerlitz 
> 
> To fix these checkpatch complaints:
> 
> WARNING: suspect code indent for conditional statements (8, 24)
> +   if (eth_proto & (MLX5E_PROT_MASK(MLX5E_10GBASE_SR)
> [...]
> +   return PORT_FIBRE;
> 
> Signed-off-by: Or Gerlitz 
> Signed-off-by: Saeed Mahameed 
> ---
>  .../net/ethernet/mellanox/mlx5/core/en_ethtool.c   | 22 
> +++---
>  1 file changed, 11 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
> b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
> index a75ac4d11c5b..ed161312a773 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
> @@ -988,23 +988,23 @@ static u8 get_connector_port(u32 eth_proto, u8 
> connector_type)
>   return ptys2connector_type[connector_type];
> 
>   if (eth_proto & (MLX5E_PROT_MASK(MLX5E_10GBASE_SR)
> -  | MLX5E_PROT_MASK(MLX5E_40GBASE_SR4)
> -  | MLX5E_PROT_MASK(MLX5E_100GBASE_SR4)
> -  | MLX5E_PROT_MASK(MLX5E_1000BASE_CX_SGMII))) {
> - return PORT_FIBRE;
> + | MLX5E_PROT_MASK(MLX5E_40GBASE_SR4)
> + | MLX5E_PROT_MASK(MLX5E_100GBASE_SR4)
> + | MLX5E_PROT_MASK(MLX5E_1000BASE_CX_SGMII))) {
> + return PORT_FIBRE;

Gah, that is why the rules are stupid.
If anything the continuation lines want indenting a few more bytes.

David



RE: [iproute PATCH v2 1/7] ipntable: Make sure filter.name is NULL-terminated

2017-08-18 Thread David Laight
From: Phil Sutter
> Sent: 17 August 2017 18:09
> To: Stephen Hemminger
> Cc: netdev@vger.kernel.org
> Subject: [iproute PATCH v2 1/7] ipntable: Make sure filter.name is 
> NULL-terminated
> 
> Signed-off-by: Phil Sutter 
> ---
>  ip/ipntable.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/ip/ipntable.c b/ip/ipntable.c
> index 879626ee4f491..7be1f04d33d90 100644
> --- a/ip/ipntable.c
> +++ b/ip/ipntable.c
> @@ -633,7 +633,8 @@ static int ipntable_show(int argc, char **argv)
>   } else if (strcmp(*argv, "name") == 0) {
>   NEXT_ARG();
> 
> - strncpy(filter.name, *argv, sizeof(filter.name));
> + strncpy(filter.name, *argv, sizeof(filter.name) - 1);
> + filter.name[sizeof(filter.name) - 1] = '\0';

Why not check for overflow instead?
if (filter.name[sizeof(filter.name) - 1])
usage("filer name too long");

David


RE: [iproute PATCH v2 2/3] iproute_lwtunnel: Argument to strerror must be positive

2017-08-18 Thread David Laight
From: Phil Sutter
> Sent: 17 August 2017 18:10
> Signed-off-by: Phil Sutter 
> ---
>  ip/iproute_lwtunnel.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/ip/iproute_lwtunnel.c b/ip/iproute_lwtunnel.c
> index 398ab5e077ed8..1a3dc4d4c0ed9 100644
> --- a/ip/iproute_lwtunnel.c
> +++ b/ip/iproute_lwtunnel.c
> @@ -643,7 +643,7 @@ static int lwt_parse_bpf(struct rtattr *rta, size_t len,
>   err = bpf_parse_common(bpf_type, , _cb_ops, );
>   if (err < 0) {
>   fprintf(stderr, "Failed to parse eBPF program: %s\n",
> - strerror(err));
> + strerror(-err));

If we are in userspace I'd expect errno values to be +ve.
Returning a -ve errno is very non-standard.

David



RE: [iproute PATCH v2 4/5] tc/tc_filter: Make sure filter name is not empty

2017-08-18 Thread David Laight
From: Phil Sutter
> Sent: 17 August 2017 18:10
> The later check for 'k[0] != 0' requires a non-empty filter name,
> otherwise NULL pointer dereference in 'q' might happen.
> 
> Signed-off-by: Phil Sutter 
> ---
>  tc/tc_filter.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/tc/tc_filter.c b/tc/tc_filter.c
> index b13fb9185d4fd..a799edb35886d 100644
> --- a/tc/tc_filter.c
> +++ b/tc/tc_filter.c
> @@ -412,6 +412,9 @@ static int tc_filter_get(int cmd, unsigned int flags, int 
> argc, char **argv)
>   usage();
>   return 0;
>   } else {
> + if (!strlen(*argv))
> + invarg("invalid filter name", *argv);

That is nearly as bad as:
p[strlen(p)] = 0;

> +
>   strncpy(k, *argv, sizeof(k)-1);
> 
>   q = get_filter_kind(k);
> --
> 2.13.1



RE: [iproute PATCH v2 2/2] ifcfg: Quote left-hand side of [ ] expression

2017-08-18 Thread David Laight
From: Phil Sutter
> Sent: 17 August 2017 18:10
> This prevents word-splitting and therefore leads to more accurate error
> message in case 'grep -c' prints something other than a number.
> 
> Signed-off-by: Phil Sutter 
> ---
>  ip/ifcfg | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/ip/ifcfg b/ip/ifcfg
> index 083d9df36742f..30a2dc49816cd 100644
> --- a/ip/ifcfg
> +++ b/ip/ifcfg
> @@ -131,7 +131,7 @@ noarp=$?
> 
>  ip route add unreachable 224.0.0.0/24 >& /dev/null
>  ip route add unreachable 255.255.255.255 >& /dev/null
> -if [ `ip link ls $dev | grep -c MULTICAST` -ge 1 ]; then
> +if [ "`ip link ls $dev | grep -c MULTICAST`" -ge 1 ]; then

You could drag all these scripts into the 1990's by using $(...)
instead of `...`.

David



RE: [PATCH 1/3] Fix ERROR: trailing statements should be on next line

2017-05-15 Thread David Laight
From: Alex Williamson
> Sent: 15 May 2017 04:21
...
> > >   /* Find end of list, sew whole thing into vi->rq.pages. */
> > > - for (end = page; end->private; end = (struct page *)end->private);
> > > + for (end = page; end->private; end = (struct page *)end->private)
> > > + ;
> 
> FWIW, I generally like to put a comment on the next line to make it
> abundantly clear that there's nothing in the body of the loop, it's
> also more aesthetically pleasing than a semi-colon on the line by
> itself, ex. /* Nothing */;  It's just too easy to misinterpret the
> loop otherwise, especially without gratuitous white space.  Thanks,

My preference is to put 'continue;' on a line by itself.
Or even move the termination condition into the loop:
for (end = page;; end = (struct page *)end->private)
if (!end->private)
break;

(oh, is that cast needed??)

David



RE: [PATCH 1/4] hamradio: Combine two seq_printf() calls into one in yam_seq_show()

2017-05-10 Thread David Laight
From: SF Markus Elfring
> Sent: 09 May 2017 15:22
> A bit of data was put into a sequence by two separate function calls.
> Print the same data by a single function call instead.
> 
> This issue was detected by using the Coccinelle software.
> 
> Signed-off-by: Markus Elfring 
> ---
>  drivers/net/hamradio/yam.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/drivers/net/hamradio/yam.c b/drivers/net/hamradio/yam.c
> index b6891ada1d7b..542f1e511df1 100644
> --- a/drivers/net/hamradio/yam.c
> +++ b/drivers/net/hamradio/yam.c
> @@ -830,8 +830,7 @@ static int yam_seq_show(struct seq_file *seq, void *v)
>   seq_printf(seq, "  RxFrames %lu\n", dev->stats.rx_packets);
>   seq_printf(seq, "  TxInt%u\n", yp->nb_mdint);
>   seq_printf(seq, "  RxInt%u\n", yp->nb_rxint);
> - seq_printf(seq, "  RxOver   %lu\n", dev->stats.rx_fifo_errors);
> - seq_printf(seq, "\n");
> + seq_printf(seq, "  RxOver   %lu\n\n", dev->stats.rx_fifo_errors);
>   return 0;

The code was consistently (and probably deliberately) using one seq_printf()
call for each line so that the source looks 'a bit like' the output.

These changes are all stupid.

David



RE: bpf pointer alignment validation

2017-05-10 Thread David Laight
From: Alexei Starovoitov
> Sent: 10 May 2017 06:58
> > +static u32 calc_align(u32 imm)
> > +{
> > +   u32 align = 1;
> > +
> > +   if (!imm)
> > +   return 1U << 31;
> > +
> > +   while (!(imm & 1)) {
> > +   imm >>= 1;
> > +   align <<= 1;
> > +   }
> > +   return align;
> > +}
> 
> same question as in previous reply.
> Why not to use something like:
> static u32 calc_align(u32 n)
> {
> if (!n)
> return 1U << 31;
> return n - ((n - 1) & n);
> }

That function needs a comment saying what it returns.
Think I'd write it as:
return n & ~(n & (n - 1));
(even though that might be one more instruction)

David



RE: [PATCH net-next 9/9] nfp: eliminate an if statement in calculation of completed frames

2017-05-17 Thread David Laight
From: Jakub Kicinski
> Sent: 16 May 2017 01:55
> Given that our rings are always a power of 2, we can simplify the
> calculation of number of completed TX descriptors by using masking
> instead of if statement based on whether the index have wrapped
> or not.
> 
> Signed-off-by: Jakub Kicinski 
> ---
>  drivers/net/ethernet/netronome/nfp/nfp_net_common.c | 10 ++
>  1 file changed, 2 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
> b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
> index c64514f8ee65..da83e17b8b20 100644
> --- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
> +++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
> @@ -940,10 +940,7 @@ static void nfp_net_tx_complete(struct nfp_net_tx_ring 
> *tx_ring)
>   if (qcp_rd_p == tx_ring->qcp_rd_p)
>   return;
> 
> - if (qcp_rd_p > tx_ring->qcp_rd_p)
> - todo = qcp_rd_p - tx_ring->qcp_rd_p;
> - else
> - todo = qcp_rd_p + tx_ring->cnt - tx_ring->qcp_rd_p;
> + todo = D_IDX(tx_ring, qcp_rd_p + tx_ring->cnt - tx_ring->qcp_rd_p);

I'm not sure you need to add tx_ring->cnt here.
I bet D_IDX() masks it away.

>   while (todo--) {
>   idx = D_IDX(tx_ring, tx_ring->rd_p++);

That '++' looks suspicious.
I think you need to decide whether you are incrementing pointers into the ring
or indexes into it.
Sometimes it is safer to use a non-wrapping index and mask when accessing the 
entry.
entry_ptr = [idx & (RING_SIZE - 1)]
Ring full is then (read_idx == write_idx + RING_SIZE),
ring empty (read_idx == write_idx).
So the index just wrap at (probably)_2^32.

David



RE: sky2: Use seq_putc() in sky2_debug_show()

2017-05-09 Thread David Laight
From: Stephen Hemminger
> Sent: 09 May 2017 06:50
> On Mon, 8 May 2017 19:42:46 +0200
> SF Markus Elfring  wrote:
> 
> > > Which issue do you mean? I dont see any issue you fix here.
> >
> > Are the run time characteristics a bit nicer for the function seq_putc
> > in comparison to the function seq_puts for printing a single line break 
> > here?
> >
> > Regards,
> > Markus
> 
> I would put this in why bother category. seq_puts is correct and this is only
> in diagnostic output useful to developer and disabled on most distro kernels

Sometimes consistency is best.
Output everything with seq_printf(), using a format "%s" if necessary.
The performance really doesn't matter here at all.

It is also (probably) possible to get gcc to do the conversions - as it does 
for printf().
(A right PITA for embedded systems where only printf() exists.)

David



RE: [PATCH net-next 1/8] qed: LL2 to use packed information for tx

2017-06-09 Thread David Laight
From: David Miller
> Sent: 09 June 2017 00:24
> 
> From: Yuval Mintz 
> Date: Thu, 8 Jun 2017 19:13:16 +0300
> 
> > @@ -67,6 +79,21 @@ struct qed_ll2_stats {
> > u64 sent_bcast_pkts;
> >  };
> >
> > +struct qed_ll2_tx_pkt_info {
> > +   u8 num_of_bds;
> > +   u16 vlan;
> > +   u8 bd_flags;
> > +   u16 l4_hdr_offset_w;/* from start of packet */
> > +   enum qed_ll2_tx_dest tx_dest;
> > +   enum qed_ll2_roce_flavor_type qed_roce_flavor;
> > +   dma_addr_t first_frag;
> > +   u16 first_frag_len;
> > +   bool enable_ip_cksum;
> > +   bool enable_l4_cksum;
> > +   bool calc_ip_len;
> > +   void *cookie;
> > +};
> > +
> 
> This layout is extremely inefficient, with lots of padding in between
> struct members.
> 
> Group small u8 members and u16 members together so that they consume
> full 32-bit areas so you can eliminate all of the padding.

I'd also query the use of u16 sizes/lengths, any arithmetic on u16 (and u8)
variables is likely to generate extra code (on non-x86).
You are using 32 bits for the 'enum' - I bet the values fit in 8 bits,
so aren't really worried about size.

If size did matter you can easily get the above down to 32 bytes.

David



RE: [PATCH net-next 1/8] qed: LL2 to use packed information for tx

2017-06-09 Thread David Laight
From: Mintz, Yuval
> Sent: 09 June 2017 08:52
> > From: David Laight [mailto:david.lai...@aculab.com]
> > Sent: Friday, June 09, 2017 10:28 AM
> > To: 'David Miller' <da...@davemloft.net>; Mintz, Yuval
> > <yuval.mi...@cavium.com>
> > Cc: netdev@vger.kernel.org; linux-r...@vger.kernel.org; Kalderon, Michal
> > <michal.kalde...@cavium.com>
> > Subject: RE: [PATCH net-next 1/8] qed: LL2 to use packed information for tx
> >
> > From: David Miller
> > > Sent: 09 June 2017 00:24
> > >
> > > From: Yuval Mintz <yuval.mi...@cavium.com>
> > > Date: Thu, 8 Jun 2017 19:13:16 +0300
> > >
> > > > @@ -67,6 +79,21 @@ struct qed_ll2_stats {
> > > > u64 sent_bcast_pkts;
> > > >  };
> > > >
> > > > +struct qed_ll2_tx_pkt_info {
> > > > +   u8 num_of_bds;
> > > > +   u16 vlan;
> > > > +   u8 bd_flags;
> > > > +   u16 l4_hdr_offset_w;/* from start of packet */
> > > > +   enum qed_ll2_tx_dest tx_dest;
> > > > +   enum qed_ll2_roce_flavor_type qed_roce_flavor;
> > > > +   dma_addr_t first_frag;
> > > > +   u16 first_frag_len;
> > > > +   bool enable_ip_cksum;
> > > > +   bool enable_l4_cksum;
> > > > +   bool calc_ip_len;
> > > > +   void *cookie;
> > > > +};
> > > > +
> > >
> > > This layout is extremely inefficient, with lots of padding in between
> > > struct members.
> > >
> > > Group small u8 members and u16 members together so that they consume
> > > full 32-bit areas so you can eliminate all of the padding.
> >
> > I'd also query the use of u16 sizes/lengths, any arithmetic on u16 (and u8)
> > variables is likely to generate extra code (on non-x86).
> > You are using 32 bits for the 'enum' - I bet the values fit in 8 bits, so 
> > aren't
> > really worried about size.
> >
> > If size did matter you can easily get the above down to 32 bytes.
> 
> You're right, and that's exactly the point -  since this is not data-path 
> critical
> I don't see why the size/efficiency should matter [greatly].

It is just good practise so that it happens automatically when it does matter.
Just swapping 'vlan' and 'bd_flags' would make it look much better.

David



RE: [PATCH net-next 09/10] net/mlx4_en: Replace TXBB_SIZE multiplications with shift operations

2017-06-20 Thread David Laight
From: Tariq Toukan
> Sent: 15 June 2017 12:36
> Define LOG_TXBB_SIZE, log of TXBB_SIZE, and use it with a shift
> operation instead of a multiplication with TXBB_SIZE.
> Operations are equivalent as TXBB_SIZE is a power of two.
> 
> Performance tests:
> Tested on ConnectX3Pro, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
> 
> Gain is too small to be measurable, no degradation sensed.
> Results are similar for IPv4 and IPv6.

I can't imagine there is any difference at all.
The compiler will use a shift for a 'multiply by a constant power of 2'.

...
If you want to save a few cycles I think the loop:
> - for (i = 0; i < tx_info->nr_txbb * TXBB_SIZE;
> + for (i = 0; i < tx_info->nr_txbb << LOG_TXBB_SIZE;
requires the compiler generate code to read nr_txbb every
iteration.
Caching the values might help (unless it causes a different
register spill).

David



RE: [PATCH 00/51] rtc: stop using rtc deprecated functions

2017-06-21 Thread David Laight
From: Russell King - ARM Linux
> Sent: 20 June 2017 22:16
..
> Consider that at the moment, we define the 32-bit RTC representation to
> start at a well known epoch.  We _could_ decide that when it wraps to
> 0x8000 seconds, we'll define the lower 0x4000 seconds to mean
> dates in the future - and keep rolling that forward each time we cross
> another 0x4000 seconds.  Unless someone invents a real time machine,
> we shouldn't need to set a modern RTC back to 1970.

True, just treating the value as unsigned gives another 67 years.

If a 32bit RTC is programmed with the low 32bits of the 64bit 'seconds
since 1970' the kernel should have no real difficulty sorting out the
high bits from other available information.

Problems with things like the x86 bios setting the rtc to stupid values
are another matter.
ISTR the rtc chip has a bit for 'summertime' that is never set, on a
multi-os system you can get multiple summer time changes.

David



RE: [net-next 11/15] net/mlx5e: Reduce number of heap allocated buffers for update stats

2017-06-20 Thread David Laight
From: Saeed Mahameed
> Sent: 15 June 2017 22:43
> Allocating buffers on the heap every 200ms is something we should avoid,
> let's use buffers located on the stack instead.
...
> + u32 in[MLX5_ST_SZ_DW(ppcnt_reg)] = {0};

How big is that?

Allocating large on-stack buffers is even worse.

One kzalloc() every 200ms isn't going to be noticeable.

David



RE: remove dma_alloc_noncoherent

2017-06-20 Thread David Laight
From: Christoph Hellwig
> Sent: 16 June 2017 08:17
>
> For many years we've had the dma_alloc_attrs API that is more flexible
> than dma_alloc_noncoherent.  This series moves the remaining users over
> to the attrs API.

And most of the callers probably only want to specify 'noncoherent'.
Grepping the source for other uses is easier if the wrapper is left.

David



RE: Faster TCP keepalive

2017-06-26 Thread David Laight
> From: Stephen Suryaputra Lin
> Sent: 23 June 2017 20:58
...
> Suggestions on other ways to quickly tearing down TCP connections to a
> rebooted host in the application above are welcomed.

Arrange to send an RST for each connection during shutdown.

David



RE: [PATCH] net: usb: asix88179_178a: Add support for the Belkin B2B128

2017-06-27 Thread David Laight
From: Andrew F. Davis
> Sent: 26 June 2017 18:41
> The Belkin B2B128 is a USB 3.0 Hub + Gigabit Ethernet Adapter, the
> Ethernet adapter uses the ASIX AX88179 USB 3.0 to Gigabit Ethernet
> chip supported by this driver, add the USB ID for the same.
...

I've just had a look at the current version of ax88179_178a.c.
It still makes me pull my hair out

Not the least of the problems is that it lies about skb->truesize.
All the receive skb are longer than 16k - so will be 64k, but
it sets skb->truesize based on the actual receive frame size.

A lot of the code is also 'over complicated' - making it slower
that strictly necessary.

There is also the more general problem that usbnet is horribly
inefficient for anything trying to run at Ge speeds (never mind
anything faster.

David



RE: [RFC V1 1/1] net: cdc_ncm: Reduce memory use when kernel memory low

2017-05-19 Thread David Laight
From: linux-usb-ow...@vger.kernel.org [mailto:linux-usb-ow...@vger.kernel.org] 
On Behalf Of Jim Baxter
> Sent: 16 May 2017 18:41
> 
> The CDC-NCM driver can require large amounts of memory to create
> skb's and this can be a problem when the memory becomes fragmented.
> 
> This especially affects embedded systems that have constrained
> resources but wish to maximise the throughput of CDC-NCM with 16KiB
> NTB's.

Why is this driver copying multiple tx messages into a single skb.
Surely there are better ways to do this??

I think it is generating a 'multi-ethernet frame' URB with an
overall header for each URB and a header for each ethernet frame.

Given that the USB stack allows multiple concurrent transmits I'm
surprised that batching large ethernet frames makes much difference.

Also the USB target can't actually tell when URB that contain
multiples of the USB packet size end.
So it is possible to send a single NTB as multiple URB.
Of course, the usb_net might have other ideas.

David



RE: [PATCH net-next 9/9] nfp: eliminate an if statement in calculation of completed frames

2017-05-19 Thread David Laight
From: Jakub Kicinski
> Sent: 17 May 2017 18:37
..
> > >   while (todo--) {
> > >   idx = D_IDX(tx_ring, tx_ring->rd_p++);
> >
> > That '++' looks suspicious.
> > I think you need to decide whether you are incrementing pointers into the 
> > ring
> > or indexes into it.
> > Sometimes it is safer to use a non-wrapping index and mask when accessing 
> > the entry.
> > entry_ptr = [idx & (RING_SIZE - 1)]
> > Ring full is then (read_idx == write_idx + RING_SIZE),
> > ring empty (read_idx == write_idx).
> > So the index just wrap at (probably)_2^32.
> 
> I may be missing the point.  I use a mix of the two, actually, the
> software pointers are free running (non-wrapping) but the HW QCP
> pointers wrap.  Because HW pointers wrap I always keep one entry on
> the rings empty, see nfp_net_tx_full().

Ah, I'd assumed that rd_p was a pointer, not an index.

David



RE: [PATCH net-next] cxgb4: fix incorrect cim_la output for T6

2017-05-19 Thread David Laight
From: Ganesh Goudar
> Sent: 19 May 2017 11:12
T6
> 
> take care of UpDbgLaRdPtr[0-3] restriction for T6
> 
> Signed-off-by: Ganesh Goudar 
> ---
>  drivers/net/ethernet/chelsio/cxgb4/t4_hw.c | 7 +++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c 
> b/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c
> index aded42b96..917b46b 100644
> --- a/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c
> +++ b/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c
> @@ -8268,6 +8268,13 @@ int t4_cim_read_la(struct adapter *adap, u32 *la_buf, 
> unsigned int *wrptr)
>   if (ret)
>   break;
>   idx = (idx + 1) & UPDBGLARDPTR_M;
> +
> + /* Bits 0-3 of UpDbgLaRdPtr can be between  to 1001 to
> +  * identify the 32-bit portion of the full 312-bit data
> +  */
> + if (is_t6(adap->params.chip))
> + while ((idx & 0xf) > 9)
> + idx = (idx + 1) % UPDBGLARDPTR_M;

Why the loop, maybe:
if (is_t6(adap->params.chip) && (idx & 0xf) >= 9)
idx = (idx & 0xf0) + 0x10;
else
idx++;
idx &= UPDBGLARDPTR_M;

David



RE: [RFC V1 1/1] net: cdc_ncm: Reduce memory use when kernel memory low

2017-05-19 Thread David Laight
From: Bjørn Mork
> Sent: 19 May 2017 14:56
...
> Unless someone has a nice way to just collect a list of skbs and have
> them converted to proper framing on the fly when transmitting, without
> having to care about USB packet boundaries.

skb can be linked into arbitrary chains (or even trees), but I suspect
the usbnet code would need to be taught about them.

For XHCI it isn't too bad because it will do arbitrary scatter-gather
(apart from the ring end).
But I believe the earlier controllers only support fragments that
end on usb packet boundaries.

I looked at the usbnet code a few years ago, I'm sure it ought to
be possible to shortcut most of the code that uses URB and directly
write to the xhci (in particular) ring descriptors.

For receive you probably want to feed the USB stack multiple (probably)
2k buffers, and handle the debatching into ethernet frames yourself.

One of the ASIX drivers used to be particularly bad, it allocated 64k
skb for receive (the hardware can merge IP packets) and then hacked
the true_size before passing upstream.

David



RE: [PATCH nf-next] netns: add and use net_ns_barrier

2017-06-02 Thread David Laight
From: Florian Westphal
> Sent: 30 May 2017 10:38
> 
> Quoting Joe Stringer:
>   If a user loads nf_conntrack_ftp, sends FTP traffic through a network
>   namespace, destroys that namespace then unloads the FTP helper module,
>   then the kernel will crash.
> 
> Events that lead to the crash:
> 1. conntrack is created with ftp helper in netns x
> 2. This netns is destroyed
> 3. netns destruction is scheduled
> 4. netns destruction wq starts, removes netns from global list
> 5. ftp helper is unloaded, which resets all helpers of the conntracks
> via for_each_net()
> 
> but because netns is already gone from list the for_each_net() loop
> doesn't include it, therefore all of these conntracks are unaffected.
> 
> 6. helper module unload finishes
> 7. netns wq invokes destructor for rmmod'ed helper
> 
...
>  void
>  nf_ct_iterate_destroy(int (*iter)(struct nf_conn *i, void *data), void *data)
> @@ -1734,6 +1736,13 @@ nf_ct_iterate_destroy(int (*iter)(struct nf_conn *i, 
> void *data), void *data)
>   }
>   rtnl_unlock();
> 
> + /* Need to wait for netns cleanup worker to finish, if its
> +  * running -- it might have deleted a net namespace from
> +  * the global list, so our __nf_ct_unconfirmed_destroy() might
> +  * not have affected all namespaces.
> +  */
> + net_ns_barrier();
> +

A problem I see is that nothing obvious guarantees that the cleanup worker
has actually started.

David



RE: [PATCH net-next 10/12] nfp: fix print format for ring pointers in ring dumps

2017-05-31 Thread David Laight
From: Jakub Kicinski
> Sent: 28 May 2017 01:34
> Ring pointers are unsigned.  Fix the print formats to avoid
> showing users negative values.
...
> 
> - seq_printf(file, "RX[%02d,%02d]: cnt=%d dma=%pad host=%p   H_RD=%d 
> H_WR=%d FL_RD=%d FL_WR=%d\n",
> + seq_printf(file, "RX[%02d,%02d]: cnt=%u dma=%pad host=%p   H_RD=%u 
> H_WR=%u FL_RD=%u FL_WR=%u\n",

I can't help feeling that %x might be better here.
The wrap points and ring full conditions will be rather more obvious.

David



RE: [patch net-next 3/3] mlxsw: spectrum_router: Align RIF index allocation with existing code

2017-06-06 Thread David Laight
From: Jiri Pirko
> Sent: 04 June 2017 15:54
> The way we usually allocate an index is by letting the allocation
> function return an error instead of an invalid index.
...
> -static int mlxsw_sp_avail_rif_get(struct mlxsw_sp *mlxsw_sp)
> +static int mlxsw_sp_rif_index_alloc(struct mlxsw_sp *mlxsw_sp, u16 
> *p_rif_index)

You typically get much better code from returning an invalid index.
This is why PTR_ERR() exists.

David



RE: [PATCH v4 03/11] net: mvmdio: use GENMASK for masks

2017-06-19 Thread David Laight
From: Antoine Tenart
> Sent: 14 June 2017 16:49
> Cosmetic patch to use the GENMASK helper for masks.
...
> - ret = val & 0x;
> + ret = val & GENMASK(15, 0);

My 2c: It isn't at all clear to me that changes like this in anyway
improve the code readability.
In some sense the '15' should be a named constant - but that just makes
it even less obvious what is going on.

David



RE: sock_create_kern() and (lack of) get_net()

2017-05-04 Thread David Laight
From: Cong Wang
> Sent: 03 May 2017 17:33
> On Wed, May 3, 2017 at 4:39 AM, David Laight <david.lai...@aculab.com> wrote:
> > I suspect that many of the sockets created with 'kern=1' are not 'special'
> > and should hold a reference to the namespace.
> 
> They are special if they are created in net init, which means they
> have the same life-time with netns. They should NOT hold a refcnt,
> otherwise who would release the last netns refcnt? net exit is called
> when refcnt reaches 0, if you really held it, it is always at least 1.

Right, so some are 'special' but some aren't?

At the moment no sockets created in the kernel hold a reference to the
namespace, deleting the namespace won't magically cause the kernel
module that created the socket to delete it.
If the socket is being used for a TCP connection then in can sit in
one of the FIN-WAIT states after sock_release() has been called
(I'm not sure exactly when the 'sock' and 'socket' get freed).

I guess it would be possible for the namespace deleting code to
traverse the list of sockets that reference the namespace and 'unhook'
then all from the relevant protocol stack so that any 'user' calls
error out.
It would need to allow for threads blocked inside the socket functions.

Does any of that happen?

David



RE: [PATCH v4 net-next 04/10] net/ncsi: Ethtool operation to get NCSI topology

2017-05-04 Thread David Laight
From: Gavin Shan
> Sent: 04 May 2017 07:16
> On Wed, May 03, 2017 at 10:19:44PM -0700, Stephen Hemminger wrote:
> >On Wed,  3 May 2017 14:44:35 +1000
> >Gavin Shan  wrote:
...
> >> +{
> >> +  struct ethtool_ncsi_channels *enc;
> >> +  short nr_channels;
> >Should be __u16 or unsigned not short.
> >
> 
> Nope, It's for signed number. User expects to get number of available
> channels when negative number is passed in. When it's positive, it's
> going to get the channels' information.

Why 16 bits?
You are just making life hard for the compiler and possibly generating
random padding.

I guess the user is expected to pass -1 first to get the number of
channels, then allocate an appropriate sized array and call again
specifying the number of channels?

What happens if the number of channels changes between the two requests?

I'd also suggest passing the size of each entry (in at least one direction).
That way additional channel information can be added.

David



RE: [PATCH] net: dsa: loop: Check for memory allocation failure

2017-05-08 Thread David Laight
From: Christophe JAILLET
> Sent: 06 May 2017 06:30
> If 'devm_kzalloc' fails, a NULL pointer will be dereferenced.
> Return -ENOMEM instead, as done for some other memory allocation just a
> few lines above.
...
> --- a/drivers/net/dsa/dsa_loop.c
> +++ b/drivers/net/dsa/dsa_loop.c
> @@ -256,6 +256,9 @@ static int dsa_loop_drv_probe(struct mdio_device *mdiodev)
>   return -ENOMEM;
> 
>   ps = devm_kzalloc(>dev, sizeof(*ps), GFP_KERNEL);
> + if (!ps)
> + return -ENOMEM;
> +
>   ps->netdev = dev_get_by_name(_net, pdata->netdev);
>   if (!ps->netdev)
>   return -EPROBE_DEFER;

On the face if it this code leaks like a sieve.

David



sock_create_kern() and (lack of) get_net()

2017-05-03 Thread David Laight
sock_create_kern() passes 'kern=1' to __sock_create().
sock_create() passes 'kern=0' and uses current->nsproxy->ns_net.

The 'kern' parameter is passed to security_socket_create() and
security_socket_post_create() - I think this is just checking
whether the call is allowed.

The 'kern' parameter is also passed through to sk_alloc() and
controls whether the socket holds a reference count to the namespace.

The latter 'feature' is there because some sockets are used within
the protocol stack itself and the network namespace needs to be
deleteable while those sockets exits.
Prior to 4.2 get_get() was called when all sockets were created and
a 'dance' was done is a few places to drop the reference.
These sockets are still inside the namespace - so must be deleted
by the code that deletes the namespace.

I suspect that many of the sockets created with 'kern=1' are not 'special'
and should hold a reference to the namespace.

In particular code that calls sock_create_kern() and then uses the
kernel_xxx() socket functions at the bottom of net/socket.c probably
want to hold a reference to the network namespace.
I'm pretty sure the socket can still exist (eg draining data) after
sock_release() is called - so the driver can't hold the namespace
reference on behalf of the socket.

A quick audit shows calls to __sock_create(..., 1) at:
  ./fs/cifs/connect.c:3176
  ./net/wireless/nl80211.c:10022
  ./net/sunrpc/svcsock.c:1516
  ./net/sunrpc/clnt.c:1247
  ./net/sunrpc/xprtsock.c:1952
  ./net/sunrpc/xprtsock.c:2019
  ./net/9p/trans_fd.c:948
  ./net/9p/trans_fd.c:996
and calls to sock_create_kern() at:
  ./drivers/infiniband/sw/rxe/rxe_qp.c:233
  ./drivers/block/drbd/drbd_receiver.c:631
  ./drivers/block/drbd/drbd_receiver.c:726
  ./fs/dlm/lowcomms.c:732
  ./fs/dlm/lowcomms.c:1053
  ./fs/dlm/lowcomms.c:1134
  ./fs/dlm/lowcomms.c:1221
  ./fs/dlm/lowcomms.c:1303
  ./fs/afs/rxrpc.c:68
  ./net/ceph/messenger.c:480
  ./net/rds/tcp_connect.c
  ./net/rds/tcp_connect.c:108
  ./net/rds/tcp_listen.c:128
  ./net/rds/tcp_listen.c:247
  ./net/rxrpc/local_object.c:117
  ./net/smc/af_smc.c:1317
  ./net/l2tp/l2tp_core.c:1506
  ./net/l2tp/l2tp_core.c:1534
All of which look to me like code that is using IP connections and
would need to be shut down before any namespace could be deleted.

There are also calls to sock_create_kern() in:
  ./net/tipc/server.c:330
  ./net/ipv6/ip6_udp_tunnel.c:22
  ./net/ipv4/udp_tunnel.c:19
  ./net/ipv4/af_inet.c:1529
  ./net/bluetooth/rfcomm/core.c:203
  ./net/netfilter/ipvs/ip_vs_sync.c:1503
  ./net/netfilter/ipvs/ip_vs_sync.c:1560
These might all be internal to the protocol stack.

I suspect that the 'kern' parameter to __sock_create() needs changing
to 'flags' with:
  1 - traditional 'kernel' socket, pass '1' to security_socket_create().
  2 - 'protocol internal' socket, don't hold a net_ns reference count.
The call sites would then need auditing to see which value they should
pass.

As usual I've probably missed something obvious...

David



RE: Question about SOCK_SEQPACKET

2017-05-05 Thread David Laight
From: Steven Whitehouse
> Sent: 05 May 2017 11:34
...
> Just before the part that you've quoted, the description for
> SOCK_SEQPACKET says:
> "The SOCK_SEQPACKET socket type is similar to the SOCK_STREAM type, and
> is also connection-oriented. The only difference between these types is
> that record boundaries are maintained using the SOCK_SEQPACKET type. A
> record can be sent using one or more output operations and received
> using one or more input operations, but a single operation never
> transfers parts of more than one record."

Right SOCK_SEQPACKET is for protocols like ISO transport.
There is no limit on the length of a 'record'.
I've written file transfer programs that put the entire file
data into a single 'record'. The receiver disconnected on
receipt of the 'end of record'.

The socket man pages could easily be wrong - they are very IP-centric.
Remember ISO transport use declined when Unix became more popular
back in the mid 1980s. Unix sockets have never really been used for
it - the address information needed just doesn't match that IP
(especially IPv4).

David



RE: SSE instructions for fast packet copy?

2017-05-05 Thread David Laight
From: Tom Herbert
> Sent: 05 May 2017 06:51
> To: Linux Kernel Network Developers
> Subject: SSE instructions for fast packet copy?
> 
> Hi,
> 
> I am thinking about the possibility of using SSE in kernel for
> speeding up the kernel memcpy particularly for copy to userspace
> emeory, and maybe even using the string instructions (like if we
> supported regex in something like eBPF). AFAIK we don't use SSE in
> kernel because of xmm register state needing to be saved across
> context switch. However, if we start busy-polling a CPU in kernel on
> network queues then there might not be any context switches to worry
> about. In this model we'd want to enable SSE per CPU.
> 
> Has this ever been tried before? Is this at all feasible? :-) Is it
> possible to enable SSE for kernel for just one CPU? (I found CPUID
> will return SSE supported, but don't see how to enable other than
> -msse for compiling).

Not even worth thinking about.
With recent intel cpus 'rep movsb' is optimised in the hardware
(for cached memory) and will run as fast as any other copy.

(There is a related fubar that memcopytoio() is implemented
as memcpy() and then as 'rep movsb' so generates repeated
byte accesses to io memory.)

I'm pretty sure the FP registers are 'lazy saved'.
The cpu's sse registers (the entire FP register set) might
contain life values for a process that is running on a different cpu.
If that process executes an FP instruction it will fault and an IPI
issued to get the registers written to the processes fp save area
from where they can be loaded.
Any use of the sse registers would have to interact correctly
with that IPI code.

David



RE: [PATCH v4 net-next 04/10] net/ncsi: Ethtool operation to get NCSI topology

2017-05-08 Thread David Laight
From: Gavin Shan
> Sent: 08 May 2017 01:20
...
> >Why 16 bits?
> >You are just making life hard for the compiler and possibly generating
> >random padding.
> >
> 
> It's because there are 256 NCSI channels to maximal degree. So 16-bits
> is the minial data width to hold it in signed format. Yes, I think
> __s32 would be better in this case. However, I would like to discard
> the negotiation mechanism in next respin.

Just because the domain of a value fits in 16 bits doesn't mean
that a 16bit type is appropriate.

It is generally much better to use 32 (aka machine word) sized
items unless you have an array or are trying to fit a lot of
items into a small memory area.

David



RE: [PATCH] x86/asm: Don't use rbp as temp register in csum_partial_copy_generic()

2017-05-04 Thread David Laight
From: Josh Poimboeuf
> Sent: 04 May 2017 15:52
> Andrey Konovalov reported the following warning while fuzzing the kernel
> with syzkaller:
> 
>   WARNING: kernel stack regs at 8800686869f8 in a.out:4933 has bad 'bp' 
> value c3fc855a10167ec0
> 
> The unwinder dump revealed that rbp had a bad value when an interrupt
> occurred in csum_partial_copy_generic().
> 
> That function saves rbp on the stack and then overwrites it, using it as
> a scratch register.  That's problematic because it breaks stack traces
> if an interrupt occurs in the middle of the function.

Does gcc guarantee not to use bp as a scratch register in leaf functions?

David



RE: [PATCH] net_sched: use explicit size of struct tcmsg, remove need to declare tcm

2017-09-18 Thread David Laight
From: Eric Dumazet
> Sent: 18 September 2017 16:01
...
> > > - err = nlmsg_parse(nlh, sizeof(*tcm), tca, TCA_MAX, NULL, NULL);
> > > + err = nlmsg_parse(nlh, sizeof(struct tcmsg), tca, TCA_MAX, NULL, NULL);
> >
> > Would sizeof(*nlmsg_data(nlh)) be cleaner??
> 
> Not really, since
> 
> static inline void *nlmsg_data(const struct nlmsghdr *nlh)

I thought about that after posting :-(

David



RE: [PATCH] net_sched: use explicit size of struct tcmsg, remove need to declare tcm

2017-09-18 Thread David Laight
From: Colin King
> Sent: 18 September 2017 12:41
> Pointer tcm is being initialized and is never read, it is only being used
> to determine the size of struct tcmsg.  Clean this up by removing
> variable tcm and explicitly using the sizeof struct tcmsg rather than *tcm.
> Cleans up clang warning:
> 
> warning: Value stored to 'tcm' during its initialization is never read
> 
> Signed-off-by: Colin Ian King 
> ---
>  net/sched/sch_api.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
> index c6deb74e3d2f..aa82116ed10c 100644
> --- a/net/sched/sch_api.c
> +++ b/net/sched/sch_api.c
> @@ -1500,7 +1500,6 @@ static int tc_dump_qdisc(struct sk_buff *skb, struct 
> netlink_callback *cb)
>   int s_idx, s_q_idx;
>   struct net_device *dev;
>   const struct nlmsghdr *nlh = cb->nlh;
> - struct tcmsg *tcm = nlmsg_data(nlh);
>   struct nlattr *tca[TCA_MAX + 1];
>   int err;
> 
> @@ -1510,7 +1509,7 @@ static int tc_dump_qdisc(struct sk_buff *skb, struct 
> netlink_callback *cb)
>   idx = 0;
>   ASSERT_RTNL();
> 
> - err = nlmsg_parse(nlh, sizeof(*tcm), tca, TCA_MAX, NULL, NULL);
> + err = nlmsg_parse(nlh, sizeof(struct tcmsg), tca, TCA_MAX, NULL, NULL);

Would sizeof(*nlmsg_data(nlh)) be cleaner??

David



RE: [PATCH net-next 05/12] net: dsa: b53: Use a macro to define I/O operations

2017-09-19 Thread David Laight
From: Florian Fainelli
> Sent: 18 September 2017 22:41
> Instead of repeating the same pattern: acquire mutex, read/write, release
> mutex, define a macro: b53_build_op() which takes the type (read|write), I/O
> size, and value (scalar or pointer). This helps with fixing bugs that could
> exit (e.g: missing barrier, lock etc.).

> +#define b53_build_op(type, op_size, val_type)\
> +static inline int b53_##type##op_size(struct b53_device *dev, u8 page,   
> \
> +   u8 reg, val_type val) 
> \
> +{
> \
> + int ret;
> \
> + 
> \
> + mutex_lock(>reg_mutex);
> \
> + ret = dev->ops->type##op_size(dev, page, reg, val); 
> \
> + mutex_unlock(>reg_mutex);  
> \
> + 
> \
> + return ret; 
> \
>  }

Why separate the 'type' and 'op_size' arguments since they
are always pasted together?

David



RE: [PATCH net-next 05/12] net: dsa: b53: Use a macro to define I/O operations

2017-09-19 Thread David Laight
> >>> +#define b53_build_op(type, op_size, val_type)\
> >>> +static inline int b53_##type##op_size(struct b53_device *dev, u8
> >page,\
> >>> +   u8 reg, val_type val) 
> >>> \
> >>> +{
> >>> \
> >>> + int ret;
> >>> \
> >>> + 
> >>> \
> >>> + mutex_lock(>reg_mutex);
> >>> \
> >>> + ret = dev->ops->type##op_size(dev, page, reg, val); 
> >>> \
> >>> + mutex_unlock(>reg_mutex);  
> >>> \
> >>> + 
> >>> \
> >>> + return ret; 
> >>> \
> >>>  }
> >>
> >> Why separate the 'type' and 'op_size' arguments since they
> >> are always pasted together?
> >
> >For read/write48, the value type is u64.
> 
> The way I read David's comment is that instead of calling the macro with 
> read, 48, just combine that
> in a single argument: read48. I don't have a preference about that and can 
> respin eventually.

Indeed, factoring in the type is harder because reads want 'u64 *' not 'u64'.
While that could be factored, it would take more source lines and make
things very obfuscated.

David



RE: [PATCH iproute2 1/3] ss: allow AF_FAMILY constants >32

2017-10-04 Thread David Laight
From: Stefan Hajnoczi
> Sent: 04 October 2017 16:01
...
> > > --- a/misc/ss.c
> > > +++ b/misc/ss.c
> > > @@ -170,55 +170,57 @@ enum {
> > >  struct filter {
> > >   int dbs;
> > >   int states;
> > > - int families;
> > > + __u64 families;
> >
> > Since this isn't a value that is coming from kernel. It should be uint64_t
> > rather than __u64.
> 
> Okay, will fix in v2.

But that looks like a kernel structure, not one exposed to userspace.
So surely __u64 is correct.

David



RE: [PATCH net-next v2 1/2] libbpf: parse maps sections of varying size

2017-10-04 Thread David Laight
From: Craig Gallek
> Sent: 04 October 2017 14:58
> >> + obj->maps[map_idx].def = *def;
> >
> > I'm not too happy/comfortable with this way of copying the memory of
> > "def" (the type-cased struct bpf_map_def).  I guess it works, and is
> > part of the C-standard(?).
> 
> I believe this is a C++-ism.  I'm not sure if it was pulled into the
> C99 standard or if it's just a gcc 'feature' now.  I kept it because
> it was in the initial code, but I'm happy to do an explicit copy for
> v3 if that looks better.

Structure copies have been valid C for ages.

Annoyingly gcc will call memcpy() directly (not a 'private' function).

David



RE: [patch net-next 2/4] net: sched: introduce per-egress action device callbacks

2017-10-10 Thread David Laight
From: Jiri Pirko
> Sent: 10 October 2017 08:30
> Introduce infrastructure that allows drivers to register callbacks that
> are called whenever tc would offload inserted rule and specified device
> acts as tc action egress device.

How does a driver safely unregister a callback?
(to avoid a race with the callback being called.)

Usually this requires a callback in the context that makes the
notification callbacks indicating that no more such callbacks
will be made.

David



RE: [net-next 05/14] i40e/i40evf: always set the CLEARPBA flag when re-enabling interrupts

2017-10-10 Thread David Laight
From: Jeff Kirsher
> Sent: 09 October 2017 23:39
> In the past we changed driver behavior to not clear the PBA when
> re-enabling interrupts. This change was motivated by the flawed belief
> that clearing the PBA would cause a lost interrupt if a receive
> interrupt occurred while interrupts were disabled.
> 
> According to empirical testing this isn't the case. Additionally, the
> data sheet specifically says that we should set the CLEARPBA bit when
> re-enabling interrupts in a polling setup.

I presume this if the MSI-X Pending Bit Array?
Normally this bit is cleared when the interrupt is actioned.

If request the device clear the PBA then it (probably) won't
raise an interrupt when it is unmasked (by clearing the 'masked' bit).

If you've just checked all the rings (with the interrupt masked)
and you clear the PBA bit when you unmask interrupts then you will
need to do another scan of the rings to pick up any packets that
arrived (or tried to signal an interrupt) in that small gap.

'Empirical testing' probably won't hit the timing window.

David



RE: [patch net-next 2/4] net: sched: introduce per-egress action device callbacks

2017-10-10 Thread David Laight
From: Jiri Pirko
> Sent: 10 October 2017 15:32
> To: David Laight
> Cc: netdev@vger.kernel.org; da...@davemloft.net; j...@mojatatu.com; 
> xiyou.wangc...@gmail.com;
> sae...@mellanox.com; mat...@mellanox.com; leo...@mellanox.com; 
> ml...@mellanox.com
> Subject: Re: [patch net-next 2/4] net: sched: introduce per-egress action 
> device callbacks
> 
> Tue, Oct 10, 2017 at 03:31:59PM CEST, david.lai...@aculab.com wrote:
> >From: Jiri Pirko
> >> Sent: 10 October 2017 08:30
> >> Introduce infrastructure that allows drivers to register callbacks that
> >> are called whenever tc would offload inserted rule and specified device
> >> acts as tc action egress device.
> >
> >How does a driver safely unregister a callback?
> >(to avoid a race with the callback being called.)
> >
> >Usually this requires a callback in the context that makes the
> >notification callbacks indicating that no more such callbacks
> >will be made.
> 
> rtnl is your answer. It is being held during register/unregister/cb

Do you mean 'acquired during register/unregister' and 'held across the
callback' ?

So the unregister sleeps (or spins?) until any callbacks complete?
So the driver mustn't hold any locks (etc) across the unregister that
it acquires in the callback.
That ought to be noted somewhere.

David



RE: [PATCH net-next v2 2/4] net: dsa: mv88e6060: setup random mac address

2017-10-16 Thread David Laight
From: woojung@microchip.com
> Sent: 13 October 2017 18:59
> > >> > +  REG_WRITE(REG_GLOBAL, GLOBAL_MAC_01, (addr[0] << 9) |
> > >> addr[1]);
> > >>
> > >> Is that supposed to be 9 ?
> > >
> > > Looks like it.
> > > Check
> > http://www.marvell.com/switching/assets/marvell_linkstreet_88E6060_data
> > sheet.pdf
> >
> > Hum, David is correct, there is a bug in the driver which needs to be
> > addressed first. MAC address bit 40 is addr[0] & 0x1, thus we must
> > shift byte 0 by 8 and mask it against 0xfe.
> >
> > I'll respin this serie including a fix for both net and net-next.
> 
> Yes, you are right. Missed description about bit 40.

Since they are all random numbers it doesn't matter.

Actually isn't this just masking off the non-unicast bit?
So it won't be set in the data - and so doesn't need masking.

David



RE: [PATCH net-next v3 3/5] net: dsa: mv88e6060: setup random mac address

2017-10-16 Thread David Laight
From: Vivien Didelot
> Sent: 13 October 2017 19:18
> As for mv88e6xxx, setup the switch from within the mv88e6060 driver with
> a random MAC address, and remove the .set_addr implementation.
> 
> Signed-off-by: Vivien Didelot 
> ---
>  drivers/net/dsa/mv88e6060.c | 43 ++-
>  1 file changed, 26 insertions(+), 17 deletions(-)
> 
> diff --git a/drivers/net/dsa/mv88e6060.c b/drivers/net/dsa/mv88e6060.c
> index d64be2b83d3c..6173be889d95 100644
> --- a/drivers/net/dsa/mv88e6060.c
> +++ b/drivers/net/dsa/mv88e6060.c
> @@ -9,6 +9,7 @@
>   */
> 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -188,6 +189,27 @@ static int mv88e6060_setup_port(struct dsa_switch *ds, 
> int p)
>   return 0;
>  }
> 
> +static int mv88e6060_setup_addr(struct dsa_switch *ds)
> +{
> + u8 addr[ETH_ALEN];
> + u16 val;
> +
> + eth_random_addr(addr);
> +
> + val = addr[0] << 8 | addr[1];
> +
> + /* The multicast bit is always transmitted as a zero, so the switch uses
> +  * bit 8 for "DiffAddr", where 0 means all ports transmit the same SA.
> +  */
> + val &= 0xfeff;

The comment is probably ok, but the mask isn't needed.
eth_randmon_addr() won't return and address with the multicast bit set.

David



RE: [next-queue PATCH v7 4/6] net/sched: Introduce Credit Based Shaper (CBS) qdisc

2017-10-16 Thread David Laight
From: Ivan Khoronzhuk
> Sent: 13 October 2017 20:59
> On Thu, Oct 12, 2017 at 05:40:03PM -0700, Vinicius Costa Gomes wrote:
> > This queueing discipline implements the shaper algorithm defined by
> > the 802.1Q-2014 Section 8.6.8.2 and detailed in Annex L.
> >
> > It's primary usage is to apply some bandwidth reservation to user
> > defined traffic classes, which are mapped to different queues via the
> > mqprio qdisc.
> >
> > Only a simple software implementation is added for now.
> >
> > Signed-off-by: Vinicius Costa Gomes 
> > Signed-off-by: Jesus Sanchez-Palencia 
> > ---
> >  include/uapi/linux/pkt_sched.h |  18 +++
> >  net/sched/Kconfig  |  11 ++
> >  net/sched/Makefile |   1 +
> >  net/sched/sch_cbs.c| 314 
> > +
> >  4 files changed, 344 insertions(+)
> >  create mode 100644 net/sched/sch_cbs.c
> >
> > diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
> > index 099bf5528fed..41e349df4bf4 100644
> > --- a/include/uapi/linux/pkt_sched.h
> > +++ b/include/uapi/linux/pkt_sched.h
> > @@ -871,4 +871,22 @@ struct tc_pie_xstats {
> > __u32 maxq; /* maximum queue size */
> > __u32 ecn_mark; /* packets marked with ecn*/
> >  };
> > +
> > +/* CBS */
> > +struct tc_cbs_qopt {
> > +   __u8 offload;

You probably don't want unnamed padding in a uapi structure.

> > +   __s32 hicredit;
> > +   __s32 locredit;
> > +   __s32 idleslope;
> > +   __s32 sendslope;
> > +};
> > +
> > +enum {
> > +   TCA_CBS_UNSPEC,
> > +   TCA_CBS_PARMS,
> > +   __TCA_CBS_MAX,
> > +};
> > +
> > +#define TCA_CBS_MAX (__TCA_CBS_MAX - 1)

Why not:
TCA_CBS_PARMS,
TCA_CBS_NEXT,
TCA_CBS_MAX = TCA_CBS_NEXT - 1,

...
David



RE: [net-next 6/9] e1000e: fix buffer overrun while the I219 is processing DMA transactions

2017-10-16 Thread David Laight
From: Neftin, Sasha
> Sent: 16 October 2017 11:40
> On 10/11/2017 12:07, David Laight wrote:
> > From: Jeff Kirsher
> >> Sent: 10 October 2017 18:22
> >> Intel 100/200 Series Chipset platforms reduced the round-trip
> >> latency for the LAN Controller DMA accesses, causing in some high
> >> performance cases a buffer overrun while the I219 LAN Connected
> >> Device is processing the DMA transactions. I219LM and I219V devices
> >> can fall into unrecovered Tx hang under very stressfully UDP traffic
> >> and multiple reconnection of Ethernet cable. This Tx hang of the LAN
> >> Controller is only recovered if the system is rebooted. Slightly slow
> >> down DMA access by reducing the number of outstanding requests.
> >> This workaround could have an impact on TCP traffic performance
> >> on the platform. Disabling TSO eliminates performance loss for TCP
> >> traffic without a noticeable impact on CPU performance.
> >>
> >> Please, refer to I218/I219 specification update:
> >> https://www.intel.com/content/www/us/en/embedded/products/networking/
> >> ethernet-connection-i218-family-documentation.html
> >>
> >> Signed-off-by: Sasha Neftin <sasha.nef...@intel.com>
> >> Reviewed-by: Dima Ruinskiy <dima.ruins...@intel.com>
> >> Reviewed-by: Raanan Avargil <raanan.avar...@intel.com>
> >> Tested-by: Aaron Brown <aaron.f.br...@intel.com>
> >> Signed-off-by: Jeff Kirsher <jeffrey.t.kirs...@intel.com>
> >> ---
> >>   drivers/net/ethernet/intel/e1000e/netdev.c | 8 +---
> >>   1 file changed, 5 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c
> b/drivers/net/ethernet/intel/e1000e/netdev.c
> >> index ee9de3500331..14b096f3d1da 100644
> >> --- a/drivers/net/ethernet/intel/e1000e/netdev.c
> >> +++ b/drivers/net/ethernet/intel/e1000e/netdev.c
> >> @@ -3021,8 +3021,8 @@ static void e1000_configure_tx(struct e1000_adapter 
> >> *adapter)
> >>
> >>hw->mac.ops.config_collision_dist(hw);
> >>
> >> -  /* SPT and CNP Si errata workaround to avoid data corruption */
> >> -  if (hw->mac.type >= e1000_pch_spt) {
> >> +  /* SPT and KBL Si errata workaround to avoid data corruption */
> >> +  if (hw->mac.type == e1000_pch_spt) {
> >>u32 reg_val;
> >>
> >>reg_val = er32(IOSFPC);
> >> @@ -3030,7 +3030,9 @@ static void e1000_configure_tx(struct e1000_adapter 
> >> *adapter)
> >>ew32(IOSFPC, reg_val);
> >>
> >>reg_val = er32(TARC(0));
> >> -  reg_val |= E1000_TARC0_CB_MULTIQ_3_REQ;
> >> +  /* SPT and KBL Si errata workaround to avoid Tx hang */
> >> +  reg_val &= ~BIT(28);
> >> +  reg_val |= BIT(29);

> > Shouldn't some more of the commit message about what this is doing
> > be in the comment?

> There is provided link on specification update:
> https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/i218-i219-ethernet-
> connection-spec-update.pdf?asset=9561.
> This is Intel's public release.

And sometime next week the marketing people will decide to reorganise the
web site and the link will become invalid.

> > And shouldn't the 28 and 28 be named constants?

> (28 and 29) - you can easy understand from code that same value has been
> changed from 3 to 2. There is no point add flag here I thought.

Oh, there is. The 'workaround is':
  Slightly slow down DMA access by reducing the number of outstanding requests.
  This workaround could have an impact on TCP traffic performance and could
  reduce performance up to 5 to 15% (depending) on the platform.
  Disabling TSO eliminates performance loss for TCP traffic without a 
  noticeable impact on CPU performance.

I wonder what tests they did to show that TSO doesn't save cpu cycles!

So my guess is that you are changing the number of outstanding PCIe reads
(or reads for tx buffers, or ???) from 3 to 2.

Lets read between the lines a little further
(since you are at Intel you can probably check this):
Assuming that TSO is 'Transmit Segmentation Offload' and that TSO packets
might be 64k, then reading 3 TSO packets might issue PCIe reads for 196k
bytes of data (under 4k for non-TSO).
If the internal buffer that this data is stored in isn't that big then
that internal buffer would overflow.
It might be that data is removed from this buffer as soon as the last
completion TLP arrives - but they can be interleaved with other
outstanding PCIe reads.
It all rather depends on the negotiated maximum TLP size and number
of tags.

Perhaps reducing the maximum TSO packet to 32k stops the overflow
as well...

David



RE: [PATCH net-next 00/10] korina cleanups/optimizations

2017-10-16 Thread David Laight
From: Roman Yeryomin
> Sent: 15 October 2017 17:22
> TX optimizations have led to ~15% performance increase (35->40Mbps)
> in local tx usecase (tested with iperf v3.2).

Indicate which patches give the improvement.
IIRC some just changed the source without changing what the code really did.
(Not a problem in itself.)

...
>   net: korina: optimize tx/rx interrupt handlers

You'd probably get a noticeable improvement from caching the
value of the interrupt mask - instead of reading it from the hardware.
...

David



RE: [PATCH net-next] ip_gre: check packet length and mtu correctly in erspan_fb_xmit

2017-10-05 Thread David Laight
From: William Tu
> Sent: 05 October 2017 01:14
> Similarly to early patch for erspan_xmit(), the ARPHDR_ETHER device
> is the length of the whole ether packet.  So skb->len should subtract
> the dev->hard_header_len.
> 
> Fixes: 1a66a836da63 ("gre: add collect_md mode to ERSPAN tunnel")
> Signed-off-by: William Tu 
> Cc: Xin Long 
> ---
>  net/ipv4/ip_gre.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
> index b279c325c7f6..10b21fe5b3a6 100644
> --- a/net/ipv4/ip_gre.c
> +++ b/net/ipv4/ip_gre.c
> @@ -579,7 +579,7 @@ static void erspan_fb_xmit(struct sk_buff *skb, struct 
> net_device *dev,
>   if (gre_handle_offloads(skb, false))
>   goto err_free_rt;
> 
> - if (skb->len > dev->mtu) {
> + if (skb->len - dev->hard_header_len > dev->mtu) {

Can you guarantee that skb->len > dev_hard_header_len?
It is probably safer to check skb->len > dev->hard_header_len + dev->mtu
since that addition isn't going to overflow.

>   pskb_trim(skb, dev->mtu);
>   truncate = true;

Is that pskb_trim() now truncating to the correct size?

David



RE: [PATCH] rsi: fix integer overflow warning

2017-10-05 Thread David Laight
From: Joe Perches
> Sent: 05 October 2017 13:19
> On Thu, 2017-10-05 at 14:05 +0200, Arnd Bergmann wrote:
> > gcc produces a harmless warning about a recently introduced
> > signed integer overflow:
> >
> > drivers/net/wireless/rsi/rsi_91x_hal.c: In function 'rsi_prepare_mgmt_desc':
> > include/uapi/linux/swab.h:13:15: error: integer overflow in expression 
> > [-Werror=overflow]
> >   (((__u16)(x) & (__u16)0x00ffU) << 8) |   \
> >^
> > include/uapi/linux/swab.h:104:2: note: in expansion of macro 
> > '___constant_swab16'
> >   ___constant_swab16(x) :   \
> >   ^~
> > include/uapi/linux/byteorder/big_endian.h:34:43: note: in expansion of 
> > macro '__swab16'
> >  #define __cpu_to_le16(x) ((__force __le16)__swab16((x)))
> 
> []
> 
> > The problem is that the 'mask' value is a signed integer that gets
> > turned into a negative number when truncated to 16 bits. Making it
> > an unsigned constant avoids this.
> 
> I would expect there are more of these.
> 
> Perhaps this define in include/uapi/linux/swab.h:
> 
> #define __swab16(x)   \
>   (__builtin_constant_p((__u16)(x)) ? \
>   ___constant_swab16(x) : \
>   __fswab16(x))
> 
> should be
> 
> #define __swab16(x)   \
>   (__builtin_constant_p((__u16)(x)) ? \
>   ___constant_swab16((__u16)(x)) :\
>   __fswab16((__u16)(x)))

You probably don't want the cast in the call to __fswab16() since
that is likely to generate an explicit and with 0x.
You will likely also get one if the argument is _u16 (not unsigned int).

David




RE: [PATCH net-next] ip_gre: check packet length and mtu correctly in erspan_fb_xmit

2017-10-06 Thread David Laight
From: William Tu
> Sent: 05 October 2017 22:21
...
> >> - if (skb->len > dev->mtu) {
> >> + if (skb->len - dev->hard_header_len > dev->mtu) {
> >
> > Can you guarantee that skb->len > dev_hard_header_len?
> > It is probably safer to check skb->len > dev->hard_header_len + dev->mtu
> > since that addition isn't going to overflow.
> Sure, I will fix it.
> 
> >
> >>   pskb_trim(skb, dev->mtu);
> >>   truncate = true;
> >
> > Is that pskb_trim() now truncating to the correct size?
> 
> You're right, now I should truncate to (dev->mtu + dev_hard_header_len)

It might be worth caching that length in the dev structure
to avoid the arithmetic on every packet.

David



RE: [PATCH net-next v3 0/5] net: dsa: remove .set_addr

2017-10-16 Thread David Laight
From: Andrew Lunn
> Sent: 16 October 2017 17:10
...
> So, received Pause frames never leave the MAC. They don't get bridged,
> nor do they get passed up for host processing. They are purely point
> to point between two MAC peers. The destination is unambiguous. It is
> simple the other MAC peer. The destination address makes it clear it
> is a pause frame, the the source address seems to be unneeded.
> 
> In this context, a random MAC addresses are safe.

Is there any reason why a fixed value (say 00:00:00:00:00:00)
can't be used?

> In the more general case, i would agree with you. Collisions are
> possible, causing problems.

For IP MAC addresses only go as far as the first router.
So the duplicates would (typically) have to be within the same subnet.
This makes the chance of a duplicate random address unlikely.

(Unless you have an un-subnetted class A network consisting of
multiple 1km long coax segments connected by 1km long repeaters
making a single collision domain.)

David



RE: [PATCH net 0/3] Fix for BPF devmap percpu allocation splat

2017-10-17 Thread David Laight
From: Daniel Borkmann
> Sent: 17 October 2017 15:56
> 
> The set fixes a splat in devmap percpu allocation when we alloc
> the flush bitmap. Patch 1 is a prerequisite for the fix in patch 2,
> patch 1 is rather small, so if this could be routed via -net, for
> example, with Tejun's Ack that would be good. Patch 3 gets rid of
> remaining PCPU_MIN_UNIT_SIZE checks, which are percpu allocator
> internals and should not be used.

Does it make sense to allow the user program to try to allocate ever
smaller very large maps until it finds one that succeeds - thus
using up all the percpu space?

Or is this a 'root only' 'shoot self in foot' job?

David



<    1   2   3   4   5   6   7   >