RE: [PATCH net 1/2] net/mlx4_en: Change min MTU size to ETH_MIN_MTU

2018-12-04 Thread David Laight
From: Eric Dumazet
> Sent: 04 December 2018 17:04
> On 12/04/2018 08:59 AM, David Laight wrote:
> > From: Tariq Toukan
> >> Sent: 02 December 2018 12:35
> >> From: Eran Ben Elisha 
> >>
> >> NIC driver minimal MTU size shall be set to ETH_MIN_MTU, as defined in
> >> the RFC791 and in the network stack. Remove old mlx4_en only define for
> >> it, which was set to wrong value.
> > ...
> >>
> >> -  /* MTU range: 46 - hw-specific max */
> >> -  dev->min_mtu = MLX4_EN_MIN_MTU;
> >> +  /* MTU range: 68 - hw-specific max */
> >> +  dev->min_mtu = ETH_MIN_MTU;
> >>dev->max_mtu = priv->max_mtu;
> >
> > Where does 68 come from?
> 
> Min IPv4 MTU per RFC791

Maybe I'm just confused and these are the ranges that the 'maximum mtu'
can be set to.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


RE: [PATCH net 1/2] net/mlx4_en: Change min MTU size to ETH_MIN_MTU

2018-12-04 Thread David Laight
From: Eric Dumazet
> Sent: 04 December 2018 17:04
> 
> On 12/04/2018 08:59 AM, David Laight wrote:
> > From: Tariq Toukan
> >> Sent: 02 December 2018 12:35
> >> From: Eran Ben Elisha 
> >>
> >> NIC driver minimal MTU size shall be set to ETH_MIN_MTU, as defined in
> >> the RFC791 and in the network stack. Remove old mlx4_en only define for
> >> it, which was set to wrong value.
> > ...
> >>
> >> -  /* MTU range: 46 - hw-specific max */
> >> -  dev->min_mtu = MLX4_EN_MIN_MTU;
> >> +  /* MTU range: 68 - hw-specific max */
> >> +  dev->min_mtu = ETH_MIN_MTU;
> >>dev->max_mtu = priv->max_mtu;
> >
> > Where does 68 come from?
> 
> Min IPv4 MTU per RFC791

Which has nothing to do with an ethernet driver.
Indeed, IIRC, it is the smallest maximum frame size that IPv4
can work over.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


RE: [PATCH net 1/2] net/mlx4_en: Change min MTU size to ETH_MIN_MTU

2018-12-04 Thread David Laight
From: Tariq Toukan
> Sent: 02 December 2018 12:35
> From: Eran Ben Elisha 
> 
> NIC driver minimal MTU size shall be set to ETH_MIN_MTU, as defined in
> the RFC791 and in the network stack. Remove old mlx4_en only define for
> it, which was set to wrong value.
...
> 
> - /* MTU range: 46 - hw-specific max */
> - dev->min_mtu = MLX4_EN_MIN_MTU;
> + /* MTU range: 68 - hw-specific max */
> + dev->min_mtu = ETH_MIN_MTU;
>   dev->max_mtu = priv->max_mtu;

Where does 68 come from?
The minimum size of an ethernet packet including the mac addresses
and CRC is 64 bytes - but that would never be an 'mtu'.

Since 64 - 46 = 18, the 46 probably excludes both MAC addresses,
the ethertype/length and the CRC.
This is 'sort of' the minimum mtu for an ethernet frame.

I'm not sure which values are supposed to be in dev->min/max_mtu.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: [PATCH net-next 2/2] net/sched: act_police: don't use spinlock in the data path

2018-11-16 Thread David Laight
From: Eric Dumazet
> Sent: 16 November 2018 14:35
...
> I suggest to use a single cache line with a dedicated spinlock and these 
> three s64
> 
>   spinlock_t  tcfp_lock cacheline_aligned_in_smp;
>   s64 ...
>   s64 ...
>   s64 ...

Doesn't this do something really stupid when cache lines are big.
If the spinlock is 8 bytes you never want more than 32 byte alignment.
If cache lines are 256 bytes you don't even need that.

Also ISTR that the kmalloc() only guarantees 8 byte alignment on x86_64.
So aligning structure members to larger offsets is rather pointless.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


SCTP on RH 5.7

2018-11-05 Thread David Laight
Why do our customers insist on trying to use SCTP on RH 5.7 with its
ancient 2.6.18 kernel.
Not surprising they are getting issues!

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: [PATCH net-next v2] net: core: change bool members of struct net_device to bitfield members

2018-10-10 Thread David Laight
From: Eric Dumazet
> Sent: 09 October 2018 21:52
> 
> On 10/09/2018 01:24 PM, Heiner Kallweit wrote:
> 
> > Reordering the struct members to fill the holes could be a little tricky
> > and could have side effects because it may make a performance difference
> > whether certain members are in one cacheline or not.
> > And whether it's worth to spend this effort (incl. the related risks)
> > just to save a few bytes (also considering that typically we have quite
> > few instances of struct net_device)?
> 
> Not really.
> 
> In fact we probably should spend time reordering fields for performance,
> since some new fields were added a bit randomly, breaking the goal of data 
> locality.
> 
> Some fields are used in control path only can could be moved out of the cache 
> lines
> needed in data path (fast path).

Interesting thought
The memory allocator rounds sizes up to a power of 2 and gives out memory
aligned to that value.
This means that the cache lines just above powers of 2 are used far
more frequently than those below one.
This will be made worse because the commonly used fields are normally at
the start of a structure.
This ought to be measurable?

Has anyone tried randomly splitting the padding between the start
and end of the allocation (while maintaining cache alignment)?
(Not sure how this would affect kfree().)

Or splitting pages (or small groups of pages) into non-power of 2
size blocks?
For instance you get three 1344 (21*64) byte blocks and five 768 byte
blocks into a 4k page.
These could give a significant footprint reduction as well as
balancing out cache line usage. 

I also wonder whether it is right to add a lot of padding to cache-line
align structure members on systems with large cache lines.
The intention is probably to get a few fields into the same cache line
not to add padding that may be larger than aggregate size of the fields.

Oh - and it is somewhat pointless because kmalloc() isn't guaranteed
to give out cache-line aligned buffers.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


RE: [PATCH net-next] net: core: change bool members of struct net_device to bitfield members

2018-10-09 Thread David Laight
From: Heiner Kallweit
> Sent: 08 October 2018 21:01
> 
> bool is good as parameter type or function return type, but if used
> for struct members it consumes more memory than needed.

Actually it can generate extra code when used as a parameter or
return type - but DM doesn't seem to worry about that,

> Changing the bool members of struct net_device to bitfield members
> allows to decrease the memory footprint of this struct.

On archs where bool is 8 bit does this actually make any difference?
Especially since the structure is probably malloced.

> Signed-off-by: Heiner Kallweit 
> ---
>  include/linux/netdevice.h | 24 +---
>  1 file changed, 13 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 76603ee13..3d7b8df2e 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
...
> @@ -1879,7 +1882,6 @@ struct net_device {
>   unsigned short  dev_port;
>   spinlock_t  addr_list_lock;
>   unsigned char   name_assign_type;
> - booluc_promisc;

You've a great big hole here.
I suspect there is one after dev_port as well.

>   struct netdev_hw_addr_list  uc;
>   struct netdev_hw_addr_list  mc;
>   struct netdev_hw_addr_list  dev_addrs;
> @@ -1986,14 +1988,11 @@ struct net_device {
>  NETREG_DUMMY,/* dummy device for NAPI poll */
>   } reg_state:8;
> 
> - bool dismantle;
> -
>   enum {
>   RTNL_LINK_INITIALIZED,
>   RTNL_LINK_INITIALIZING,
>   } rtnl_link_state:16;
> 
> - bool needs_free_netdev;

This one might remove some padding.
But it could be moved into one of the holes.

>   void (*priv_destructor)(struct net_device *dev);
> 
>  #ifdef CONFIG_NETPOLL
> @@ -2046,7 +2045,10 @@ struct net_device {
>   struct sfp_bus  *sfp_bus;
>   struct lock_class_key   *qdisc_tx_busylock;
>   struct lock_class_key   *qdisc_running_key;
> - boolproto_down;
> + unsigneduc_promisc:1;
> + unsigneddismantle:1;
> + unsignedneeds_free_netdev:1;
> + unsignedproto_down:1;
>   unsignedwol_enabled:1;
>  };
>  #define to_net_dev(d) container_of(d, struct net_device, dev)
> --
> 2.19.1

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


RE: [RFC PATCH] skb: Define NET_IP_ALIGN based on CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS

2018-10-05 Thread David Laight
From: Ben Hutchings
> Sent: 04 October 2018 18:37
> 
> NET_IP_ALIGN is supposed to be defined as 0 if DMA writes to an
> unaligned buffer would be more expensive than CPU access to unaligned
> header fields, and otherwise defined as 2.
> 
> Currently only ppc64 and x86 configurations define it to be 0.
> However several other architectures (conditionally) define
> CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS, which seems to imply that
> NET_IP_ALIGN should be 0.
> 
> Remove the overriding definitions for ppc64 and x86 and define
> NET_IP_ALIGN solely based on CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS.

Even if CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is set unaligned
accesses are likely to be slightly slower than aligned ones.
So having NET_IP_ALIGN set to 2 might make sense even on x86.
(ISTR DM saying why this isn't done.)

I've also met systems when misaligned DMA transfers (for ethernet receive)
were so bad that is was necessary to DMA to a 4n aligned buffer and
then do a misaligned copy to the real rx buffer (skb equiv) for the
network stack - which required the buffer be 4n+2 aligned.
(sparc sbus with the original DMA part.)

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: [PATCH v2 1/5] netlink: remove NLA_NESTED_COMPAT

2018-09-20 Thread David Laight
From: Johannes Berg
> Sent: 19 September 2018 20:49
> This isn't used anywhere, so we might as well get rid of it.
> 
> Reviewed-by: David Ahern 
> Signed-off-by: Johannes Berg 
> ---
>  include/net/netlink.h |  2 --
>  lib/nlattr.c  | 11 ---
>  2 files changed, 13 deletions(-)
> 
> diff --git a/include/net/netlink.h b/include/net/netlink.h
> index 318b1ded3833..b680fe365e91 100644
> --- a/include/net/netlink.h
> +++ b/include/net/netlink.h
> @@ -172,7 +172,6 @@ enum {
>   NLA_FLAG,
>   NLA_MSECS,
>   NLA_NESTED,
> - NLA_NESTED_COMPAT,
>   NLA_NUL_STRING,
>   NLA_BINARY,
>   NLA_S8,
...

Is it safe to remove an item from this emun ?

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: [PATCH 1/2] staging: rtl8192e: Fix compiler warning about strncpy

2018-08-21 Thread David Laight
From: Larry Finger
> Sent: 20 August 2018 18:51
> When strncpy() is called with source and destination strings the same
> length, gcc 8 warns that there may be an unterminated string. Using
> strlcpy() rather than strncpy() forces a null at the end and quiets the
> warning.
> 
> Signed-off-by: Larry Finger 
> ---
>  drivers/staging/rtl8192e/rtllib_softmac.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/staging/rtl8192e/rtllib_softmac.c 
> b/drivers/staging/rtl8192e/rtllib_softmac.c
> index 919231fec09c..95a8390cb7ac 100644
> --- a/drivers/staging/rtl8192e/rtllib_softmac.c
> +++ b/drivers/staging/rtl8192e/rtllib_softmac.c
> @@ -1684,14 +1684,14 @@ inline void rtllib_softmac_new_net(struct 
> rtllib_device *ieee,
>* essid provided by the user.
>*/
>   if (!ssidbroad) {
> - strncpy(tmp_ssid, ieee->current_network.ssid,
> + strlcpy(tmp_ssid, ieee->current_network.ssid,
>   IW_ESSID_MAX_SIZE);
>   tmp_ssid_len = ieee->current_network.ssid_len;

If there is a length, why not use it?
Depending on where the data came from the length might need validating.
Depending on how tmp_ssid is used it might need zero filling.

>   }
>   memcpy(>current_network, net,
>  sizeof(struct rtllib_network));

Gah - should be sizeof(ieee->current_network).
Or better still a structure assignment.

>   if (!ssidbroad) {
> - strncpy(ieee->current_network.ssid, tmp_ssid,
> + strlcpy(ieee->current_network.ssid, tmp_ssid,
>   IW_ESSID_MAX_SIZE);
>   ieee->current_network.ssid_len = tmp_ssid_len;

Hmmm... this looks like it is restoring the fields.
So why not have a temporary ssid buffer that is the size of the
actual buffer and user memcpy().

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: [iproute PATCH 4/4] lib: Enable colored output only for TTYs

2018-08-15 Thread David Laight
From: Stephen Hemminger
> Sent: 15 August 2018 16:04
...
> > This also disables color sequence when the output is piped to a pager
> > such as less which with the -R argument can handle it just fine.
> >
> > ie., the user needs to remove the color arg when that output is not wanted.
> 
> If you are going to change the color enabling, why not make it compatible
> with what ls does?

Indeed - otherwise it is very hard to debug the colour escape sequences.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: [PATCH v2 0/2] net/sctp: Avoid allocating high order memory with kmalloc()

2018-08-06 Thread David Laight
From: Michael Tuexen
> Sent: 03 August 2018 21:57
...
> >> Given how useless SCTP streams are, does anything actually use
> >> more than about 4?
> >
> > Maybe Michael can help us with that. I'm also curious now.
> In the context of SIGTRAN I have seen 17 streams...

Ok, I've seen 17 there as well, 5 is probably more common.

> In the context of WebRTC I have seen more streams. In general,
> the streams concept seems to be useful. QUIC has lots of streams.
> 
> So I'm wondering why they are considered useless.
> David, can you elaborate on this?

I don't think a lot of people know what they actually are.

Streams just allow some receive data to forwarded to applications when receive
message(s) on stream(s) are lost and have to be retransmitted.

I suspect some people think that the separate streams have separate flow 
control,
not just separate data sequences.

M2PA separates control message (stream 0) from user data (stream 1).
I think the spec even suggests this is so control messages get through when
user data is flow controlled off - not true (it would be true for ISO
transport's 'expedited data).

M3UA will use 16 streams (one for each (ITU) SLS), but uses stream 0 for 
control.
If a data message is lost then data for the other sls can be passed to the
userpart/mtp3 - this might save bursty processing when the SACK-requested
retransmission arrives. But I doubt you'd want to run M3UA on anything lossy
enough for more than 4 data streams to make sense.

Even M3UA separating control onto stream 0 data onto 1-n doesn't seem useful to 
me.

If QUIC is using 'lots of streams' is it just using the stream-id as a qualifier
for the data? Rather than requiring the 'not head of line blocking' feature
of sctp streams?

Thought
Could we let the application set large stream-ids, but actually mask them
down to (say) 32 for the protocol code?

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: [PATCH v2 0/2] net/sctp: Avoid allocating high order memory with kmalloc()

2018-08-03 Thread David Laight
From: Konstantin Khorenko
> Sent: 03 August 2018 17:21
> 
> Each SCTP association can have up to 65535 input and output streams.
> For each stream type an array of sctp_stream_in or sctp_stream_out
> structures is allocated using kmalloc_array() function. This function
> allocates physically contiguous memory regions, so this can lead
> to allocation of memory regions of very high order, i.e.:
...

Given how useless SCTP streams are, does anything actually use
more than about 4?

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: [PATCH v2 1/2] net/sctp: Make wrappers for accessing in/out streams

2018-08-03 Thread David Laight
From: Konstantin Khorenko
> Sent: 03 August 2018 17:21
...
> --- a/net/sctp/stream.c
> +++ b/net/sctp/stream.c
> @@ -37,6 +37,18 @@
>  #include 
>  #include 
> 
> +struct sctp_stream_out *sctp_stream_out(const struct sctp_stream *stream,
> + __u16 sid)
> +{
> + return ((struct sctp_stream_out *)(stream->out)) + sid;
> +}
> +
> +struct sctp_stream_in *sctp_stream_in(const struct sctp_stream *stream,
> +   __u16 sid)
> +{
> + return ((struct sctp_stream_in *)(stream->in)) + sid;
> +}
> +

Those look like they ought to be static inlines in the header file.
Otherwise you'll be making SCTP performance worse that it is already.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: [PATCH net-next] sctp: add support for SCTP_REUSE_PORT sockopt

2018-05-22 Thread David Laight
...
> > >> the reason this was added is to have a specified way to allow a system to
> > >> behave like a client and server making use of the INIT collision.
> > >>
> > >> For 1-to-many style sockets you can do this by creating a socket, 
> > >> binding it,
> > >> calling listen on it and trying to connect to the peer.
> > >>
> > >> For 1-to-1 style sockets you need two sockets for it. One listener and 
> > >> one
> > >> you use to connect (and close it in case of failure, open a new one...).
> > >>
> > >> It was not clear if one can achieve this with SO_REUSEPORT and/or 
> > >> SO_REUSEADDR
> > >> on all platforms. We left that unspecified.
> > >>
> > >> I hope this makes the intention clearer.

That really doesn't help for 1-1 sockets.
You need a way of accepting connections that come from a specific remote host.

Otherwise the application has to verify that the remove address on the incoming
connection is the one it is expecting.

It is also a problem if two different applications (instances) want to
generate 'INIT collisions' for the same local addresses but different remote
ones.

The only way 'INIT collisions' are going to work is with a different
option that stops the receipt of an ABORT chunk erroring a connect.

David



RE: [PATCH v2] net: dsa: drop some VLAs in switch.c

2018-05-08 Thread David Laight
From: Salvatore Mesoraca
> Sent: 07 May 2018 20:03
...
> This optimization will save us an allocation when number of ports is
> less than 32 or 64 (depending on arch).
> IMHO it's useful, if you consider that, right now, DSA works only with
> 12-ports switches.

Why not just error out if the number of ports is greater than the compile-time
limit?

Worry about dynamic allocation if you need a lot more than 64 ports.

David



RE: [PATCH 000/109] remove in-kernel calls to syscalls

2018-03-29 Thread David Laight
From: Dominik Brodowski
> Sent: 29 March 2018 15:42
> On Thu, Mar 29, 2018 at 07:20:27AM -0700, Matthew Wilcox wrote:
> > On Thu, Mar 29, 2018 at 01:22:37PM +0200, Dominik Brodowski wrote:
> > > At least on 64-bit x86, it will likely be a hard requirement from v4.17
> > > onwards to not call system call functions in the kernel: It is better to
> > > use use a different calling convention for system calls there, where
> > > struct pt_regs is decoded on-the-fly in a syscall wrapper which then hands
> > > processing over to the actual syscall function. This means that only those
> > > parameters which are actually needed for a specific syscall are passed on
> > > during syscall entry, instead of filling in six CPU registers with random
> > > user space content all the time (which may cause serious trouble down the
> > > call chain).[*]
> >
> > How do we stop new ones from springing up?  Some kind of linker trick
> > like was used to, er, "dissuade" people from using gets()?
> 
> Once the patches which modify the syscall calling convention are merged,
> it won't compile on 64-bit x86, but bark loudly. That should frighten anyone.
> Meow.

Should be pretty easy to ensure the prototypes aren't in any normal header.
Renaming the global symbols (to not match the function name) will make it
much harder to call them as well.

David



RE: RFC on writel and writel_relaxed

2018-03-28 Thread David Laight
From: Benjamin Herrenschmidt
> Sent: 28 March 2018 16:13
...
> > I've always wondered exactly what the twi;isync were for - always seemed
> > very heavy handed for most mmio reads.
> > Particularly if you are doing mmio reads from a data fifo.
> 
> If you do that you should use the "s" version of the accessors. Those
> will only do the above trick at the end of the access series. Also a
> FIFO needs special care about endianness anyway, so you should use
> those accessors regardless. (Hint: you never endian swap a FIFO even on
> BE on a LE device, unless something's been wired very badly in HW).

That was actually a 64 bit wide fifo connected to a 16bit wide PIO interface.
Reading the high address 'clocked' the fifo.
So the first 3 reads could happen in any order, but the 4th had to be last.
This is a small ppc and we shovel a lot of data through that fifo.

Whether it needed byteswapping depended completely on how our hardware people
had built the pcb (not made easy by some docs using the ibm bit numbering).
In fact it didn't

While that driver only had to run on a very specific small ppc, generic drivers
might have similar issues.

I suspect that writel() is always (or should always be):
barrier_before_writel()
writel_relaxed()
barrier_after_writel()
So if a driver needs to do multiple writes (without strong ordering)
it should be able to repeat the writel_relaxed() with only one set
of barriers.
Similarly for readl().
In addition a lesser barrier is probably enough between a readl_relaxed()
and a writel_relaxed() that is conditional on the read value.

David



RE: RFC on writel and writel_relaxed

2018-03-28 Thread David Laight
From: Benjamin Herrenschmidt
> Sent: 28 March 2018 10:56
...
> For example, let's say I have a device with a reset bit and the spec
> says the reset bit needs to be set for at least 10us.
> 
> This is wrong:
> 
>   writel(1, RESET_REG);
>   usleep(10);
>   writel(0, RESET_REG);
> 
> Because of write posting, the first write might arrive to the device
> right before the second one.
> 
> The typical "fix" is to turn that into:
> 
>   writel(1, RESET_REG);
>   readl(RESET_REG); /* Flush posted writes */

Would a writel(1, RESET_REG) here provide enough synchronsiation?

>   usleep(10);
>   writel(0, RESET_REG);
> 
> *However* the issue here, at least on power, is that the CPU can issue
> that readl but doesn't necessarily wait for it to complete (ie, the
> data to return), before proceeding to the usleep. Now a usleep contains
> a bunch of loads and stores and is probably fine, but a udelay which
> just loops on the timebase may not be.
> 
> Thus we may still violate the timing requirement.

I've seem that sort of code (with udelay() and no read back) quite often.
How many were in linux I don't know.

For small delays I usually fix it by repeated writes (of the same value)
to the device register. That can guarantee very short intervals.

The only time I've actually seen buffered writes break timing was
between a 286 and an 8859 interrupt controller.
If you wrote to the mask then enabled interrupts the first IACK cycle
could be too close to write and break the cycle recovery time.
That clobbered most of the interrupt controller registers.
That probably affected every 286 board ever built!
Not sure how much software added the required extra bus cycle.

> What we did inside readl, with the twi;isync sequence (which basically
> means, trap on return value with "trap never" as a condition, followed
> by isync that ensures all excpetion conditions are resolved), is force
> the CPU to "consume" the data from the read before moving on.
> 
> This effectively makes readl fully synchronous (we would probably avoid
> that if we were to implement a readl_relaxed).

I've always wondered exactly what the twi;isync were for - always seemed
very heavy handed for most mmio reads.
Particularly if you are doing mmio reads from a data fifo.

Perhaps there should be a writel_wait() that is allowed to do a read back
for such code paths?

David



RE: [PATCH v5 0/2] Remove false-positive VLAs when using max()

2018-03-22 Thread David Laight
From: Kees Cook
> Sent: 22 March 2018 15:01
...
> >   /* Glory to Martin Uecker  */
> >   #define __is_constant(a) \
> > (sizeof(int) == sizeof(*(1 ? ((void*)((a) * 0l)) : (int*)1)))
...
> So, this time it's not a catastrophic failure with gcc 4.4. Instead it
> fails in 11 distinct places:
...
> Seems like it doesn't like void * arguments:
> 
> mm/percpu.c:
> void *ptr;
> ...
> base = min(ptr, base);

Try adding (unsigned long) before the (a).

David



RE: [RFC PATCH 0/3] kernel: add support for 256-bit IO access

2018-03-22 Thread David Laight
From: David Laight
> Sent: 22 March 2018 10:36
...
> Any code would need to be in memcpy_fromio(), not in every driver that
> might benefit.
> Then fallback code can be used if the registers aren't available.
> 
> >  (b) we can't guarantee that %ymm register write will show up on any
> > bus as a single write transaction anyway
> 
> Misaligned 8 byte accesses generate a single PCIe TLP.
> I can look at what happens for AVX2 transfers later.
> I've got code that mmap()s PCIe addresses into user space, and can
> look at the TLP (indirectly through tracing on an fpga target).
> Just need to set something up that uses AVX copies.

On my i7-7700 reads into xmm registers generate 16 byte TLP and
reads into ymm registers 32 byte TLP.
I don't think I've a system that supports avx-512

With my lethargic fpga slave 32 bytes reads happen every 144 clocks
and 16 byte ones every 126 (+/- the odd clock).
Single bytes ones happen every 108, 8 byte 117.
So I have (about) 110 clock overhead on every read cycle.
These clocks are the 62.5MHz clock on the fpga.

So if we needed to do PIO reads using the AVX2 (or better AVX-512)
registers would make a significant difference.
Fortunately we can 'dma' most of the data we need to transfer.

I've traced writes before, they are a lot faster and are limited
by things in the fpga fabric (they appear back to back).

David



RE: [RFC PATCH 2/3] x86/io: implement 256-bit IO read and write

2018-03-22 Thread David Laight
From: Linus Torvalds
> Sent: 22 March 2018 01:27
> On Tue, Mar 20, 2018 at 7:42 AM, Alexander Duyck
>  wrote:
> >
> > Instead of framing this as an enhanced version of the read/write ops
> > why not look at replacing or extending something like the
> > memcpy_fromio or memcpy_toio operations?
> 
> Yes, doing something like "memcpy_fromio_avx()" is much more
> palatable, in that it works like the crypto functions do - if you do
> big chunks, the "kernel_fpu_begin/end()" isn't nearly the issue it can
> be otherwise.
> 
> Note that we definitely have seen hardware that *depends* on the
> regular memcpy_fromio()" not doing big reads. I don't know how
> hardware people screw it up, but it's clearly possible.

I wonder if that hardware works with the current kernel on recent cpus?
I bet it doesn't like the byte accesses that get generated either.

> So it really needs to be an explicitly named function that basically a
> driver can use to say "my hardware really likes big aligned accesses"
> and explicitly ask for some AVX version if possible.

For x86 being able to request a copy done as 'rep movsx' (for each x)
would be useful.
For io copies the cost of the memory access is probably much smaller
that the io access, so really fancy copies are unlikely make much
difference unless the width of the io access changes.

David



RE: [RFC PATCH 0/3] kernel: add support for 256-bit IO access

2018-03-22 Thread David Laight
From: Sent: 21 March 2018 18:16
> To: Ingo Molnar
...
> All this to do a 32-byte PIO access, with absolutely zero data right
> now on what the win is?
> 
> Yes, yes, I can find an Intel white-paper that talks about setting WC
> and then using xmm and ymm instructions to write a single 64-byte
> burst over PCIe, and I assume that is where the code and idea came
> from. But I don't actually see any reason why a burst of 8 regular
> quad-word bytes wouldn't cause a 64-byte burst write too.

The big gain is from wide PCIe reads, not writes.
Writes to uncached locations (without WC) are almost certainly required
to generate the 'obvious' PCIe TLP, otherwise things are likely to break.

> So right now this is for _one_ odd rdma controller, with absolutely
> _zero_ performance numbers, and a very high likelihood that it does
> not matter in the least.
> 
> And if there are some atomicity concerns ("we need to guarantee a
> single atomic access for race conditions with the hardware"), they are
> probably bogus and misplaced anyway, since
> 
>  (a) we can't guarantee that AVX2 exists in the first place

Any code would need to be in memcpy_fromio(), not in every driver that
might benefit.
Then fallback code can be used if the registers aren't available.

>  (b) we can't guarantee that %ymm register write will show up on any
> bus as a single write transaction anyway

Misaligned 8 byte accesses generate a single PCIe TLP.
I can look at what happens for AVX2 transfers later.
I've got code that mmap()s PCIe addresses into user space, and can
look at the TLP (indirectly through tracing on an fpga target).
Just need to set something up that uses AVX copies.

> So as far as I can tell, there are basically *zero* upsides, and a lot
> of potential downsides.

There are some upsides.
I'll do a performance measurement for reads.

David



RE: [RFC PATCH 0/3] kernel: add support for 256-bit IO access

2018-03-20 Thread David Laight
From: Andy Lutomirski
> Sent: 20 March 2018 14:57
...
> I'd rather see us finally finish the work that Rik started to rework
> this differently.  I'd like kernel_fpu_begin() to look like:
> 
> if (test_thread_flag(TIF_NEED_FPU_RESTORE)) {
>   return; // we're already okay.  maybe we need to check
> in_interrupt() or something, though?
> } else {
>   XSAVES/XSAVEOPT/XSAVE;
>   set_thread_flag(TIF_NEED_FPU_RESTORE):
> }
> 
> and kernel_fpu_end() does nothing at all.

I guess it might need to set (clear?) the CFLAGS bit for a process
that isn't using the fpu at all - which seems a sensible feature.
 
> We take the full performance hit for a *single* kernel_fpu_begin() on
> an otherwise short syscall or interrupt, but there's no additional
> cost for more of them or for long-enough-running things that we
> schedule in the middle.

It might be worth adding a parameter to kernel_fpu_begin() to indicate
which registers are needed, and a return value to say what has been
granted.
Then a driver could request AVX2 (for example) and use a fallback path
if the register set isn't available (for any reason).
A call from an ISR could always fail.

> As I remember, the main hangup was that this interacts a bit oddly
> with PKRU, but that's manageable.

WTF PKRU ??

Dvaid



RE: [RFC PATCH 2/3] x86/io: implement 256-bit IO read and write

2018-03-20 Thread David Laight
From: Rahul Lakkireddy
> Sent: 20 March 2018 13:32
...
> On High Availability Server, the logs of the failing system must be
> collected as quickly as possible.  So, we're concerned with the amount
> of time taken to collect our large on-chip memory.  We see improvement
> in doing 256-bit reads at a time.

Two other options:

1) Get the device to DMA into host memory.

2) Use mmap() (and vm_iomap_memory() in your driver) to get direct
   userspace access to the (I assume) PCIe memory space.
   You can then use whatever copy instructions the cpu has.
   (Just don't use memcpy().)

David



RE: [RFC PATCH 0/3] kernel: add support for 256-bit IO access

2018-03-20 Thread David Laight
From: Ingo Molnar
> Sent: 20 March 2018 10:54
...
> Note that a generic version might still be worth trying out, if and only if 
> it's
> safe to access those vector registers directly: modern x86 CPUs will do their
> non-constant memcpy()s via the common memcpy_erms() function - which could in
> theory be an easy common point to be (cpufeatures-) patched to an AVX2 
> variant, if
> size (and alignment, perhaps) is a multiple of 32 bytes or so.
> 
> Assuming it's correct with arbitrary user-space FPU state and if it results 
> in any
> measurable speedups, which might not be the case: ERMS is supposed to be very
> fast.
> 
> So even if it's possible (which it might not be), it could end up being slower
> than the ERMS version.

Last I checked memcpy() was implemented as 'rep movsb' on the latest Intel cpus.
Since memcpy_to/fromio() get aliased to memcpy() this generates byte copies.
The previous 'fastest' version of memcpy() was ok for uncached locations.

For PCIe I suspect that the actual instructions don't make a massive difference.
I'm not even sure interleaving two transfers makes any difference.
What makes a huge difference for memcpy_fromio() is the size of the register.
The time taken for a read will be largely independent of the width of the
register used.

David



RE: [RFC PATCH 0/3] kernel: add support for 256-bit IO access

2018-03-20 Thread David Laight
From: Thomas Gleixner
> Sent: 20 March 2018 09:41
> On Tue, 20 Mar 2018, Ingo Molnar wrote:
> > * Thomas Gleixner  wrote:
...
> > > And if we go down that road then we want a AVX based memcpy()
> > > implementation which is runtime conditional on the feature bit(s) and
> > > length dependent. Just slapping a readqq() at it and use it in a loop does
> > > not make any sense.
> >
> > Yeah, so generic memcpy() replacement is only feasible I think if the most
> > optimistic implementation is actually correct:
> >
> >  - if no preempt disable()/enable() is required
> >
> >  - if direct access to the AVX[2] registers does not disturb legacy FPU 
> > state in
> >any fashion
> >
> >  - if direct access to the AVX[2] registers cannot raise weird exceptions 
> > or have
> >weird behavior if the FPU control word is modified to non-standard 
> > values by
> >untrusted user-space
> >
> > If we have to touch the FPU tag or control words then it's probably only 
> > good for
> > a specialized API.
> 
> I did not mean to have a general memcpy replacement. Rather something like
> magic_memcpy() which falls back to memcpy when AVX is not usable or the
> length does not justify the AVX stuff at all.

There is probably no point for memcpy().

Where it would make a big difference is memcpy_fromio() for PCIe devices
(where longer TLP make a big difference).
But any code belongs in its implementation not in every driver.
The implementation of memcpy_toio() is nothing like as critical.

If might be the code would need to fallback to 64bit accesses
if the AVX(2) registers can't currently be accessed - maybe some
obscure state

However memcpy_to/fromio() are both horrid at the moment because
they result in byte copies!

David




RE: [RFC PATCH 0/3] kernel: add support for 256-bit IO access

2018-03-19 Thread David Laight
From: Thomas Gleixner
> Sent: 19 March 2018 15:37
...
> > If system call entry reset the AVX registers then any FP save/restore
> > would be faster because the AVX registers wouldn't need to be saved
> > (and the cpu won't save them).
> > I believe the instruction to reset the AVX registers is fast.
> > The AVX registers only ever need saving if the process enters the
> > kernel through an interrupt.
> 
> Wrong. The x8664 ABI clearly states:
> 
>Linux Kernel code is not allowed to change the x87 and SSE units. If
>those are changed by kernel code, they have to be restored properly
>before sleeping or leav- ing the kernel.
> 
> That means the syscall interface relies on FPU state being not changed by
> the kernel. So if you want to clear AVX on syscall entry you need to save
> it first and then restore before returning. That would be a huge
> performance hit.

The x87 and SSE registers can't be changed - they can contain callee-saved
registers.
But (IIRC) the AVX and AVX2 registers are all caller-saved.
So the system call entry stub functions are allowed to change them.
Which means that the syscall entry code can also change them.
Of course it must not leak kernel values back to userspace.

It is a few years since I looked at the AVX and fpu save code.

David


RE: [RFC PATCH 0/3] kernel: add support for 256-bit IO access

2018-03-19 Thread David Laight
From: Thomas Gleixner
> Sent: 19 March 2018 15:05
> 
> On Mon, 19 Mar 2018, David Laight wrote:
> > From: Rahul Lakkireddy
> > In principle it ought to be possible to get access to one or two
> > (eg) AVX registers by saving them to stack and telling the fpu
> > save code where you've put them.
> 
> No. We have functions for this and we are not adding new ad hoc magic.

I was thinking that a real API might do this...
Useful also for code that needs AVX-like registers to do things like CRCs.

> > OTOH, for x86, if the code always runs in process context (eg from a
> > system call) then, since the ABI defines them all as caller-saved
> > the AVX(2) registers, it is only necessary to ensure that the current
> > FPU registers belong to the current process once.
> > The registers can be set to zero by an 'invalidate' instruction on
> > system call entry (hope this is done) and after use.
> 
> Why would a system call touch the FPU registers? The kernel normally does
> not use FPU instructions and the code which explicitely does has to take
> care of save/restore. It would be performance madness to fiddle with the
> FPU stuff unconditionally if nothing uses it.

If system call entry reset the AVX registers then any FP save/restore
would be faster because the AVX registers wouldn't need to be saved
(and the cpu won't save them).
I believe the instruction to reset the AVX registers is fast.
The AVX registers only ever need saving if the process enters the
kernel through an interrupt.

David



RE: [RFC PATCH 0/3] kernel: add support for 256-bit IO access

2018-03-19 Thread David Laight
From: Rahul Lakkireddy
> Sent: 19 March 2018 14:21
> 
> This series of patches add support for 256-bit IO read and write.
> The APIs are readqq and writeqq (quad quadword - 4 x 64), that read
> and write 256-bits at a time from IO, respectively.

Why not use the AVX2 registers to get 512bit accesses.

> Patch 1 adds u256 type and adds necessary non-atomic accessors.  Also
> adds byteorder conversion APIs.
> 
> Patch 2 adds 256-bit read and write to x86 via VMOVDQU AVX CPU
> instructions.
> 
> Patch 3 updates cxgb4 driver to use the readqq API to speed up
> reading on-chip memory 256-bits at a time.

Calling kernel_fpu_begin() is likely to be slow.
I doubt you want to do it every time around a loop of accesses.

In principle it ought to be possible to get access to one or two
(eg) AVX registers by saving them to stack and telling the fpu
save code where you've put them.
Then the IPI fp save code could then copy the saved values over
the current values if asked to save the fp state for a process.
This should be reasonable cheap - especially if there isn't an
fp save IPI.

OTOH, for x86, if the code always runs in process context (eg from a
system call) then, since the ABI defines them all as caller-saved
the AVX(2) registers, it is only necessary to ensure that the current
FPU registers belong to the current process once.
The registers can be set to zero by an 'invalidate' instruction on
system call entry (hope this is done) and after use.

David



RE: [PATCH v5 0/2] Remove false-positive VLAs when using max()

2018-03-19 Thread David Laight
From: linus...@gmail.com [mailto:linus...@gmail.com] On Behalf Of Linus Torvalds
> Sent: 18 March 2018 23:36
...
> 
> Yeah, and since we're in the situation that *new* gcc versions work
> for us anyway, and we only have issues with older gcc's (that sadly
> people still use), even if there was a new cool feature we couldn't
> use it.

Is it necessary to have the full checks for old versions of gcc?

Even -Wvla could be predicated on very recent gcc - since we aren't
worried about whether gcc decides to generate a vla, but whether
the source requests one.

David



RE: [PATCH v5 0/2] Remove false-positive VLAs when using max()

2018-03-16 Thread David Laight
From: Linus Torvalds
> Sent: 16 March 2018 17:29
> On Fri, Mar 16, 2018 at 4:47 AM, Florian Weimer  wrote:
> >
> > If you want to catch stack frames which have unbounded size,
> > -Werror=stack-usage=1000 or -Werror=vla-larger-than=1000 (with the constant
> > adjusted as needed) might be the better approach.
> 
> No, we want to catch *variable* stack sizes.
> 
> Does "-Werror=vla-larger-than=0" perhaps work for that? No, because
> the stupid compiler says that is "meaningless".
> 
> And no, using "-Werror=vla-larger-than=1" doesn't work either, because
> the moronic compiler continues to think that "vla" is about the
> _type_, not the code:
> 
>t.c: In function ‘test’:
>t.c:6:6: error: argument to variable-length array is too large
> [-Werror=vla-larger-than=]
>  int array[(1,100)];
> 
> Gcc people are crazy.
> 
> Is there really no way to just say "shut up about the stupid _syntax_
> issue that is entirely irrelevant, and give us the _code_ issue".

I looked at the generated code for one of the constant sized VLA that
the compiler barfed at.
It seemed to subtract constants from %sp separately for the VLA.
So it looks like the compiler treats them as VLA even though it
knows the size.
That is probably missing optimisation.

David



RE: [PATCH v3] kernel.h: Skip single-eval logic on literals in min()/max()

2018-03-14 Thread David Laight
From: Kees Cook
> Sent: 13 March 2018 22:15
...
> I'll send a "const_max()" which will refuse to work on
> non-constant-values (so it doesn't get accidentally used on variables
> that could be exposed to double-evaluation), and will work for stack
> array declarations (to avoid the overly-sensitive -Wvla checks).

ISTR the definitions were of the form:
char foo[max(sizeof (struct bah), sizeof (struct baz))];
This doesn't generate a 'foo' with the required alignment.
It would be much better to use a union.

David



RE: [PATCH] net: dsa: drop some VLAs in switch.c

2018-03-14 Thread David Laight
From: Salvatore Mesoraca
> Sent: 13 March 2018 22:01
> 2018-03-13 20:58 GMT+01:00 Vivien Didelot 
> :
> > Hi Salvatore,
> 
> Hi Vivien,
> 
> > Salvatore Mesoraca  writes:
> >
> >> dsa_switch's num_ports is currently fixed to DSA_MAX_PORTS. So we avoid
> >> 2 VLAs[1] by using DSA_MAX_PORTS instead of ds->num_ports.
> >>
> >> [1] https://lkml.org/lkml/2018/3/7/621
> >>
> >> Signed-off-by: Salvatore Mesoraca 
> >
> > NAK.
> >
> > We are in the process to remove hardcoded limits such as DSA_MAX_PORTS
> > and DSA_MAX_SWITCHES, so we have to stick with ds->num_ports.
> 
> I can rewrite the patch using kmalloc.
> Although, if ds->num_ports will always be less than or equal to 12, it
> should be better to
> just use DSA_MAX_PORTS.

Isn't using DECLARE_BITMAP() completely OTT when the maximum size is less
than the number of bits in a word?

David



RE: [PATCH v3] kernel.h: Skip single-eval logic on literals in min()/max()

2018-03-13 Thread David Laight
The amount of replicated defined could also be reduced by passing > or <
to a min_max() macro.
So you start off with something like:
#define min(x, y) __min_max(x, <, y)
#define max(x, y) __min_max(x, >, y)
then have:
#define __min_max(x, cond, y) ((x) cond (y) ? (x) : (y))
in all its associated flavours.

David




RE: [PATCH] drivers: net: wireless: ath: ath9: dfs: remove VLA usage

2018-03-13 Thread David Laight
From: Daniel Micay
> Sent: 10 March 2018 23:45
> 
> > Just wondering. Is this actually a VLA. FFT_NUM_SAMPLES was static const so
> > not really going to show a lot of variation. This array will always have the
> > same size on the stack.
> 
> The issue is that unlike in C++, a `static const` can't be used in a
> constant expression in C. It's unclear why C is defined that way but
> it's how it is and there isn't currently a GCC extension making more
> things into constant expressions like C++.

'static' and 'const' are both just qualifiers to a 'variable'
You can still take it's address.
The language allows you to cast away the 'const' and write to
the variable - the effect is probably 'implementation defined'.

It is probably required to be valid for 'static const' items
to be patchable.

David



RE: [PATCH net 3/4] net: dsa: microchip: Utilize strncpy() for ethtool::get_strings

2018-03-02 Thread David Laight
From: Florian Fainelli
>
> Do not use memcpy() which is not safe, but instead use strncpy() which
> will make sure that the string is NUL terminated (in the Linux
> implementation) if the string is smaller than the length specified. This
> fixes KASAN out of bounds warnings while fetching port statistics.

You really ought to use a copy function that will truncate the
string if it is too long.
Just assuming the string isn't too long is asking for trouble.
You might (almost) just use strcpy().

strlcpy() will probably work best here.

David



RE: [PATCH RFC 1/2] virtio: introduce packed ring defines

2018-02-27 Thread David Laight
From: Tiwei Bie
> Sent: 23 February 2018 11:18
...
> +struct vring_packed_desc_event {
> + /* Descriptor Event Offset */
> + __virtio16 desc_event_off   : 15,
> + /* Descriptor Event Wrap Counter */
> +desc_event_wrap  : 1;
> + /* Descriptor Event Flags */
> + __virtio16 desc_event_flags : 2;
> +};

This looks like you are assuming that a bit-field has a defined
layout and can be used to map a 'hardware' structure.
The don't, don't use them like that.

David



RE: [PATCH v5 07/14] net: pch_gbe: Fix handling of TX padding

2018-02-19 Thread David Laight
From: Paul Burton
> Sent: 17 February 2018 20:11
> 
> The ethernet controller found in the Intel EG20T Platform Controller
> Hub requires that we place 2 bytes of padding between the ethernet
> header & the packet payload. Our pch_gbe driver handles this by copying
> packets to be transmitted to a temporary struct skb with the padding
> bytes inserted
...

Uggg WFT is the driver doing that for?

I'd guess that the two byte pad is there so that a 4 byte aligned
frame is still 4 byte aligned when the 14 byte ethernet header is added.
So instead of copying the entire frame the MAC header should be built
(or rebuilt?) two bytes further from the actual data.

David





RE: [net-next v2 1/1] tipc: avoid unnecessary copying of bundled messages

2018-02-19 Thread David Laight
From: Jon Maloy 
Date: Thu, 15 Feb 2018 14:14:37 +0100

> A received sk buffer may contain dozens of smaller 'bundled' messages
> which after extraction go each in their own direction.
>
> Unfortunately, when we extract those messages using skb_clone() each
> of the extracted buffers inherit the truesize value of the original
> buffer. Apart from causing massive overaccounting of the base buffer's
> memory, this often causes tipc_msg_validate() to come to the false
> conclusion that the ratio truesize/datasize > 4, and perform an
> unnecessary copying of the extracted buffer.
>
> We now fix this problem by explicitly correcting the truesize value of
> the buffer clones to be the truesize of the clone itself plus a
> calculated fraction of the base buffer's overhead. This change
> eliminates the overaccounting and at least mitigates the occurrence
> of unnecessary buffer copying.

Have you actually checked that copying the data when you extract the
messages isn't faster than cloning and trying to avoid the copy?
Copying at the point is probably cheaper because it leads to
a simpler message structure.

David



RE: [PATCH][V2] net: dsa: mv88e6xxx: avoid unintended sign extension on a 16 bit shift

2018-02-19 Thread David Laight
From: Colin King
> Sent: 16 February 2018 16:55
> 
> From: Colin Ian King 
> 
> The shifting of timehi by 16 bits to the left will be promoted to
> a 32 bit signed int and then sign-extended to an u64. If the top bit
> of timehi is set then all then all the upper bits of ns end up as also
> being set because of the sign-extension. Fix this by making timehi and
> timelo u64.  Also move the declaration of ns.
> 
> Detected by CoverityScan, CID#1465288 ("Unintended sign extension")
> 
> Fixes: c6fe0ad2c349 ("net: dsa: mv88e6xxx: add rx/tx timestamping support")
> Signed-off-by: Colin Ian King 
> ---
>  drivers/net/dsa/mv88e6xxx/hwtstamp.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/dsa/mv88e6xxx/hwtstamp.c 
> b/drivers/net/dsa/mv88e6xxx/hwtstamp.c
> index b251d534b70d..2149d332dea0 100644
> --- a/drivers/net/dsa/mv88e6xxx/hwtstamp.c
> +++ b/drivers/net/dsa/mv88e6xxx/hwtstamp.c
> @@ -293,7 +293,8 @@ static void mv88e6xxx_get_rxts(struct mv88e6xxx_chip 
> *chip,
>  struct sk_buff *skb, u16 reg,
>  struct sk_buff_head *rxq)
>  {
> - u16 buf[4] = { 0 }, status, timelo, timehi, seq_id;
> + u16 buf[4] = { 0 }, status, seq_id;
> + u64 ns, timelo, timehi;
>   struct skb_shared_hwtstamps *shwt;
>   int err;
> 
> @@ -321,7 +322,7 @@ static void mv88e6xxx_get_rxts(struct mv88e6xxx_chip 
> *chip,
>*/
>   for ( ; skb; skb = skb_dequeue(rxq)) {
>   if (mv88e6xxx_ts_valid(status) && seq_match(skb, seq_id)) {
> - u64 ns = timehi << 16 | timelo;
> + ns = timehi << 16 | timelo;

This seems to be somewhat excessive 64bit maths on a 32bit system.
It is more than enough to make timelo/timehi 'unsigned int'.

David



RE: [PATCH net-next 0/7] tcp: implement rb-tree based retransmit queue

2018-02-06 Thread David Laight
From: Eric Dumazet
> Sent: 06 February 2018 14:20
...
> Please give exact details.
> Sending 64, 128, 256 or 512 bytes at a time on TCP_STREAM makes little sense.
> We are not optimizing stack for pathological cases, sorry.

There are plenty of workloads which are not bulk data and where multiple
small buffers get sent at unknown intervals (which may be back to back).
Such connections have to have Nagle disabled because the Nagle delays
are 'horrid'.
Clearly lost packets can cause delays, but they are rare on local networks.

David



RE: [PATCH] net: cxgb4: avoid memcpy beyond end of source buffer

2018-02-02 Thread David Laight
From: Arnd Bergmann
> Sent: 02 February 2018 15:19
> 
> Building with link-time-optimizations revealed that the cxgb4 driver does
> a fixed-size memcpy() from a variable-length constant string into the
> network interface name:
...
> I can see two equally workable solutions: either we use a strncpy() instead
> of the memcpy() to stop at the end of the input, or we make the source buffer
> fixed length as well. This implements the latter.
> 
> Signed-off-by: Arnd Bergmann 
> ---
>  drivers/net/ethernet/chelsio/cxgb4/cxgb4_uld.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_uld.h
> b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_uld.h
> index 1d37672902da..a14e8db51cdc 100644
> --- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_uld.h
> +++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_uld.h
> @@ -355,7 +355,7 @@ struct cxgb4_lld_info {
>  };
> 
>  struct cxgb4_uld_info {
> - const char *name;
> + char name[IFNAMSIZ];
>   void *handle;
>   unsigned int nrxq;
>   unsigned int rxq_size;
> --
> 2.9.0

Surely there is another part to this patch?

David



RE: [PATCH] tcp_lp: use 64-bit arithmetic instead of 32-bit

2018-02-01 Thread David Laight
> > The question you need to ask is 'can it overflow 32bit maths', otherwise
> > you are potentially making the system do extra work for no reason.
> >
> 
> Yeah, I get your point and it seems that in this particular case there
> is no risk of a 32bit overflow, but in general and IMHO as the code
> evolves, the use of incorrect arithmetic may have security
> implications in the future, so I advocate for code correctness in this
> case.

Even if the variable are 64bit you still need to worry (maybe less)
about arithmetic overflow.
The only real way to avoid overflow is to understand the domain
of the values being used.

David



RE: [PATCH 04/36] usercopy: Prepare for usercopy whitelisting

2018-01-12 Thread David Laight
From: Christopher Lameter
> Sent: 10 January 2018 18:28
> On Tue, 9 Jan 2018, Kees Cook wrote:
> 
> > +struct kmem_cache *kmem_cache_create_usercopy(const char *name,
> > +   size_t size, size_t align, slab_flags_t flags,
> > +   size_t useroffset, size_t usersize,
> > +   void (*ctor)(void *));
> 
> Hmmm... At some point we should switch kmem_cache_create to pass a struct
> containing all the parameters. Otherwise the API will blow up with
> additional functions.

Or add an extra function to 'configure' the kmem_cache with the
extra parameters.

David



RE: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-08 Thread David Laight
From: Alan Cox
> Sent: 08 January 2018 12:13
...
> > Try over 35% slowdown
> > Given that AWS instance runs known code (user and kernel) why do we
> > need to worry about any of these sideband attacks?
> 
> You may not need to. Amazon themselves obviously need to worry that no
> other VM steals your data (or vice versa) but above that (and with raw
> hardware appliances) if you control all the code you run then the nopti
> and other disables may be useful (At the end of the day as with anything
> else you do your own risk assessment).

I believe AWS allows VM kernels to load user-written device drivers
so the security of other VMs cannot rely on whether a VM is booted
with PTI=yes or PTI=no.

David



RE: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-08 Thread David Laight
From: Linus Torvalds
> Sent: 07 January 2018 19:47
...
> Also, people definitely *are* noticing the performance issues with the
> current set of patches, and they are causing real problems. Go search
> for reports of Amazon AWS slowdowns.

Try over 35% slowdown
Given that AWS instance runs known code (user and kernel) why do we
need to worry about any of these sideband attacks?

David



RE: [PATCHv2 net-next 04/12] sctp: implement make_datafrag for sctp_stream_interleave

2017-12-08 Thread David Laight
From: Xin Long ...
> Hi, David, Sorry,  I'm not sure we're worrying about the cpu cost or
> codes style now ?
> 
> For cpu cost,  I think 0x848(%r13) operation must be better than the
> generated code of if-else.

Nope - the call xxx(%ryyy) is likely to be a data cache miss and a complete
cpu pipeline stall.

The conditional will be a data cache hit and the code (for one branch)
will be prefetched and speculatively executed.

Some very modern cpu might manage to predict indirect jumps, but for
most it is a full pipeline stall.

David



RE: [PATCHv2 net-next 04/12] sctp: implement make_datafrag for sctp_stream_interleave

2017-12-08 Thread David Laight
From: Xin Long
> Sent: 08 December 2017 16:18
> 
...
> >> Alternatively you could preform the dereference in two steps (i.e. declare 
> >> an si
> >> pointer on the stack and set it equal to asoc->stream.si, then deref
> >> si->make_datafrag at call time.  That will at least give the compiler an
> >> opportunity to preload the first pointer.

You want to save the function pointer itself.

...
> Another small difference:
>   as you can see, comparing to (X), (Y) is using 0x28(%rsp) in the loop,
>   instead of %r13.
> 
> So that's what I can see from the related generated code.
> If 0x848(%r13) is not worse than 0x28(%rsp) for cpu, I think
> asoc->stream.si->make_datafrag() is even better. No ?

That code must have far too many life local variables.
Otherwise there's be a caller saved register available.

David



RE: [PATCHv2 net-next 04/12] sctp: implement make_datafrag for sctp_stream_interleave

2017-12-08 Thread David Laight
From: Marcelo Ricardo Leitner
> Sent: 08 December 2017 16:00
...
> > Is it worth replacing the si struct with an index/enum value, and indexing 
> > an
> > array of method pointer structs?  That would save you at least one 
> > dereference.
> 
> Hmmm, maybe, yes. It would be like
> sctp_stream_interleave[asoc->stream.si].make_datafrag(...)

If you only expect 2 choices then an if () is likely
to produce better code that the above.

The actual implementation can be hidden inside a #define
or static inline function.

David



RE: [PATCHv2 net-next 04/12] sctp: implement make_datafrag for sctp_stream_interleave

2017-12-08 Thread David Laight
From: 'Marcelo Ricardo Leitner'
> Sent: 08 December 2017 15:16
> On Fri, Dec 08, 2017 at 03:01:31PM +, David Laight wrote:
> > From: Marcelo Ricardo Leitner
> > > Sent: 08 December 2017 14:57
> > >
> > > On Fri, Dec 08, 2017 at 02:06:04PM +, David Laight wrote:
> > > > From: Xin Long
> > > > > Sent: 08 December 2017 13:04
> > > > ...
> > > > > @@ -264,8 +264,8 @@ struct sctp_datamsg 
> > > > > *sctp_datamsg_from_user(struct sctp_association *asoc,
> > > > >   frag |= SCTP_DATA_SACK_IMM;
> > > > >   }
> > > > >
> > > > > - chunk = sctp_make_datafrag_empty(asoc, sinfo, len, frag,
> > > > > -  0, GFP_KERNEL);
> > > > > + chunk = asoc->stream.si->make_datafrag(asoc, sinfo, 
> > > > > len, frag,
> > > > > +GFP_KERNEL);
> > > >
> > > > I know that none of the sctp code is very optimised, but that indirect
> > > > call is going to be horrid.
> > >
> > > Yeah.. but there is no way to avoid the double derreference
> > > considering we only have the asoc pointer in there and we have to
> > > reach the contents of the data chunk operations struct, and the .si
> > > part is the same as 'stream' part as it's a constant offset.
> > ...
> >
> > It isn't only the double indirect, the indirect call itself isn't 'fun'.
> 
> I meant in this context.
> 
> The indirect call is so we don't have to flood the stack with
> if (old data chunk fmt) {
>   ...
> } else {
>   ...
> }
> 
> So instead of this, we now have some key operations identified and
> wrapped up behind this struct, allowing us to abstract whatever data
> chunk format it is.

Nothing wrong with:
#define foo(asoc, ...) \
if (asoc->new_fmt) \
foo_new(asoc, __VA_LIST__); \
else \
foo_old(asoc, __VA_LIST__);

> > I think there are other hot paths where you've replaced a sizeof()
> > with a ?: clause.
> > Caching the result might be much better.
> 
> The only new ?: clause I could find this patchset is on patch 12 and
> has nothing to do with sizeof().
> 
> The sizeof() results are indeed cached, as you can see in patch 4:
> +static struct sctp_stream_interleave sctp_stream_interleave_0 = {
> +   .data_chunk_len = sizeof(struct sctp_data_chunk),
> and the two helpers on it at the begining of the patch.

I was getting two bits mixed up.
But the code that reads data_chunk_len is following an awful lot of pointers.

David




RE: [PATCHv2 net-next 04/12] sctp: implement make_datafrag for sctp_stream_interleave

2017-12-08 Thread David Laight
From: Marcelo Ricardo Leitner
> Sent: 08 December 2017 14:57
> 
> On Fri, Dec 08, 2017 at 02:06:04PM +0000, David Laight wrote:
> > From: Xin Long
> > > Sent: 08 December 2017 13:04
> > ...
> > > @@ -264,8 +264,8 @@ struct sctp_datamsg *sctp_datamsg_from_user(struct 
> > > sctp_association *asoc,
> > >   frag |= SCTP_DATA_SACK_IMM;
> > >   }
> > >
> > > - chunk = sctp_make_datafrag_empty(asoc, sinfo, len, frag,
> > > -  0, GFP_KERNEL);
> > > + chunk = asoc->stream.si->make_datafrag(asoc, sinfo, len, frag,
> > > +GFP_KERNEL);
> >
> > I know that none of the sctp code is very optimised, but that indirect
> > call is going to be horrid.
> 
> Yeah.. but there is no way to avoid the double derreference
> considering we only have the asoc pointer in there and we have to
> reach the contents of the data chunk operations struct, and the .si
> part is the same as 'stream' part as it's a constant offset.
...

It isn't only the double indirect, the indirect call itself isn't 'fun'.

I think there are other hot paths where you've replaced a sizeof()
with a ?: clause.
Caching the result might be much better.

David



RE: [PATCHv2 net-next 04/12] sctp: implement make_datafrag for sctp_stream_interleave

2017-12-08 Thread David Laight
From: Xin Long
> Sent: 08 December 2017 13:04
...
> @@ -264,8 +264,8 @@ struct sctp_datamsg *sctp_datamsg_from_user(struct 
> sctp_association *asoc,
>   frag |= SCTP_DATA_SACK_IMM;
>   }
> 
> - chunk = sctp_make_datafrag_empty(asoc, sinfo, len, frag,
> -  0, GFP_KERNEL);
> + chunk = asoc->stream.si->make_datafrag(asoc, sinfo, len, frag,
> +GFP_KERNEL);

I know that none of the sctp code is very optimised, but that indirect
call is going to be horrid.

David



RE: [PATCH V11 3/5] printk: hash addresses printed with %p

2017-12-06 Thread David Laight
From: David Miller
> Sent: 05 December 2017 20:31
...
> > Would it make sense to keep the 3 lowest bits of the address?
> >
> > Currently printed pointers no longer have any correlation with the actual
> > alignment in memory of the object, which is a typical cause of a class of 
> > bugs.
> 
> Yeah, this is driving people nuts who wonder why pointers are aligned
> all weird now.

I can also image issues where you want to know whether 2 pointers point
into the same structure (like an skb).
Useful if you suspect that code is using a stale pointer because the
skb got reallocated by some internal function.
I'm not sure such pointers are printed by default though.

I know I have debugged things that required hexdumps of memory areas
referenced by traced pointers.
I won't have done that for anything written with the kernel printf because
getting any more info for a linux kernel oops is actually quite hard.

David



RE: [PATCH V11 4/5] vsprintf: add printk specifier %px

2017-11-30 Thread David Laight
From: Kees Cook
> Sent: 29 November 2017 22:28
> On Wed, Nov 29, 2017 at 2:07 AM, David Laight <david.lai...@aculab.com> wrote:
> > From: Linus Torvalds
> >> Sent: 29 November 2017 02:29
> >>
> >> On Tue, Nov 28, 2017 at 6:05 PM, Tobin C. Harding <m...@tobin.cc> wrote:
> >> >
> >> >Let's add specifier %px as a
> >> > clear, opt-in, way to print a pointer and maintain some level of
> >> > isolation from all the other hex integer output within the Kernel.
> >>
> >> Yes, I like this model. It's easy and it's obvious ("'x' for hex"),
> >> and it gives people a good way to say "yes, I really want the actual
> >> address as hex" for if/when the hashed pointer doesn't work for some
> >> reason.
> >
> > Remind me to change every %p to %px on kernels that support it.
> >
> > Although the absolute values of pointers may not be useful, knowing
> > that two pointer differ by a small amount is useful.
> > It is also useful to know whether pointers are to stack, code, static
> > data or heap.
> >
> > This change to %p is going to make debugging a nightmare.
> 
> In the future, maybe we could have a knob: unhashed, hashed (default),
> or zeroed.

Add a 4th, hashed_page+offset.

Isn't there already a knob for %pK, bits in the same value could be used.
That would make it easy to ensure that %pK is more restructive than %p.

David



RE: [PATCH V11 0/5] hash addresses printed with %p

2017-11-30 Thread David Laight
From: Andrew Morton
> Sent: 29 November 2017 23:21
> >
> > The added advantage of hashing %p is that security is now opt-out, if
> > you _really_ want the address you have to work a little harder and use
> > %px.

You need a system-wide opt-out that prints the actual values.
Otherwise developers will use something else to print addresses and
the code will remain in the released drivers.

> > The idea for creating the printk specifier %px to print the actual
> > address was suggested by Kees Cook (see below for email threads by
> > subject).
> 
> Maybe I'm being thick, but...  if we're rendering these addresses
> unusable by hashing them, why not just print something like
> "" in their place?  That loses the uniqueness thing but I
> wonder how valuable that is in practice?

My worry is that is you get a kernel 'oops' print with actual register
values you have no easy way of tying an address or address+offset to
the corresponding hash(address) printed elsewhere.

David



RE: [PATCH V11 4/5] vsprintf: add printk specifier %px

2017-11-29 Thread David Laight
From: Linus Torvalds
> Sent: 29 November 2017 02:29
> 
> On Tue, Nov 28, 2017 at 6:05 PM, Tobin C. Harding  wrote:
> >
> >Let's add specifier %px as a
> > clear, opt-in, way to print a pointer and maintain some level of
> > isolation from all the other hex integer output within the Kernel.
> 
> Yes, I like this model. It's easy and it's obvious ("'x' for hex"),
> and it gives people a good way to say "yes, I really want the actual
> address as hex" for if/when the hashed pointer doesn't work for some
> reason.

Remind me to change every %p to %px on kernels that support it.

Although the absolute values of pointers may not be useful, knowing
that two pointer differ by a small amount is useful.
It is also useful to know whether pointers are to stack, code, static
data or heap.

This change to %p is going to make debugging a nightmare.

David



RE: [PATCH net 3/5] sctp: only allow the asoc reset when the asoc outq is empty

2017-11-27 Thread David Laight
From: Xin Long
> Sent: 25 November 2017 13:06
> 
> As it says in rfc6525#section5.1.4, before sending the request,
> 
>C2:  The sender has either no outstanding TSNs or considers all
> outstanding TSNs abandoned.
> 
> Prior to this patch, it tried to consider all outstanding TSNs abandoned
> by dropping all chunks in all outqs with sctp_outq_free (even including
> sacked, retransmit and transmitted queues) when doing this reset, which
> is too aggressive.
> 
> To make it work gently, this patch will only allow the asoc reset when
> the sender has no outstanding TSNs by checking if unsent, transmitted
> and retransmit are all empty with sctp_outq_is_empty before sending
> and processing the request.

Doesn't that rather defeat the point of a reset?

David



RE: [PATCH] net: bridge: add max_fdb_count

2017-11-21 Thread David Laight
From: Sarah Newman
> Sent: 15 November 2017 19:27
> Current memory and CPU usage for managing bridge fdb entries is unbounded.
> Add a parameter max_fdb_count, controlled from sysfs, which places an upper
> limit on the number of entries. Defaults to 1024.
> 
> When max_fdb_count is met or exceeded, whether traffic is sent out a
> given port should depend on its flooding behavior.

Does it make sense for a bridge to run in a mode where it doesn't
remember (all the) MAC addresses from one of its interfaces?
Rather than flood unknown addresses they are just sent to the
'everywhere else' interface.

David



RE: [net-next 1/1] tipc: enforce valid ratio between skb truesize and contents

2017-11-21 Thread David Laight
From: Jon Maloy
> Sent: 15 November 2017 20:24
> The socket level flow control is based on the assumption that incoming
> buffers meet the condition (skb->truesize / roundup(skb->len) <= 4),
> where the latter value is rounded off upwards to the nearest 1k number.
> This does empirically hold true for the device drivers we know, but we
> cannot trust that it will always be so, e.g., in a system with jumbo
> frames and very small packets.

I'd be more worried about drivers that set skb->truesize to the amount
of data in the skb rather than the memory used by the skb.

Last I looked some of the usbnet drivers lied badly.

David



RE: [PATCH 0/9] use network namespace for iSCSI control interfaces

2017-11-21 Thread David Laight
From: Chris Leech
> Sent: 15 November 2017 00:25
> To: David Laight
> Cc: netdev@vger.kernel.org; contain...@lists.linux-foundation.org
> Subject: Re: [PATCH 0/9] use network namespace for iSCSI control interfaces
> 
> On Wed, Nov 08, 2017 at 10:31:04AM +, David Laight wrote:
> > From: Chris Leech
> > > Sent: 07 November 2017 22:45
> > >
> > > I've posted these changes to allow iSCSI management within a container
> > > using a network namespace to the SCSI and Open-iSCSI lists, but seeing
> > > as it's not really SCSI/block related I'm casting a wider net looking
> > > for reviews.
> >
> > I didn't spot you acquiring and releasing references to the namespace.
> > (I might have missed it, the relevant patch is difficult to read).
> >
> > If the sockets are created in the context of the process whose namespace
> > you are using you don't need it, but given the hooks and callbacks
> > I'm not at all sure that is obviously true.
> 
> Thanks David,
> 
> Looking at it again, you're right and I think I need to hold a reference
> for the iSCSI host and handle namespace deletion.  Even for iscsi_tcp
> the socket gets handed off from the creating process to the transport
> and can outlive iscsid.

It isn't that simple
IIRC:

The namespace delete callback isn't made until the reference count is zero.
Sockets created with sock_create_kern() don't hold a reference to the
namespace

This is all fine for sockets used for admin purposes, but rather hopeless
if you really need the namespace to continue to exist while the connections
are open - if only for long enough to close the connection.

To make matters even more annoying the functions for holding and
releasing a namespace are GPL_ONLY :-(

David



RE: [PATCH] net: mvneta: fix handling of the Tx descriptor counter

2017-11-20 Thread David Laight
From: Simon Guinot
> Sent: 13 November 2017 15:36
> To: David Miller
> Cc: thomas.petazz...@free-electrons.com; netdev@vger.kernel.org; m...@gmx.de;
> andreas.tob...@cloudguard.ch; gregory.clem...@free-electrons.com; 
> antoine.ten...@free-electrons.com;
> m...@semihalf.com; sta...@vger.kernel.org
> Subject: Re: [PATCH] net: mvneta: fix handling of the Tx descriptor counter
> 
> On Mon, Nov 13, 2017 at 11:54:14PM +0900, David Miller wrote:
> > From: Simon Guinot 
> > Date: Mon, 13 Nov 2017 15:51:15 +0100
> >
> > > IIUC the driver stops the queue if a threshold of 316 Tx descriptors is
> > > reached (default and worst value).
> >
> > That's a lot of latency.
> 
> OK, then I'll keep the "tx_pending > 255" flushing condition. But note
> there is no other software mechanism to limit the Tx latency inside the
> mvneta driver. Should we add something ? And is that not rather the job
> of the network stack to keep track of the latency and to limit the txq
> size ?

This is 'first packet transmit latency'.

If the 'doorbell write' is just a PCIe write then, on most systems,
that is cheap and pipelined/posted.
I'd almost be surprised if you see any 'improvement' from not doing
it every packet.

The overall tx queue size is a different issue - usually needs
limiting by BQL if TSO is done.

David



RE: [PATCH net-next] net: thunderbolt: Clear finished Tx frame bus address in tbnet_tx_callback()

2017-11-20 Thread David Laight
From: Mika Westerberg
> Sent: 13 November 2017 10:22
> To: David Miller
> Cc: michael.ja...@intel.com; yehezkel.ber...@intel.com; netdev@vger.kernel.org
> Subject: Re: [PATCH net-next] net: thunderbolt: Clear finished Tx frame bus 
> address in
> tbnet_tx_callback()
> 
> On Sat, Nov 11, 2017 at 07:21:24PM +0900, David Miller wrote:
> > From: Mika Westerberg 
> > Date: Thu,  9 Nov 2017 13:46:28 +0300
> >
> > > When Thunderbolt network interface is disabled or when the cable is
> > > unplugged the driver releases all allocated buffers by calling
> > > tbnet_free_buffers() for each ring. This function then calls
> > > dma_unmap_page() for each buffer it finds where bus address is non-zero.
> > > Now, we only clear this bus address when the Tx buffer is sent to the
> > > hardware so it is possible that the function finds an entry that has
> > > already been unmapped.
> > >
> > > Enabling DMA-API debugging catches this as well:
> > >
> > >   thunderbolt :06:00.0: DMA-API: device driver tries to free DMA
> > > memory it has not allocated [device address=0x68321000] 
> > > [size=4096 bytes]
> > >
> > > Fix this by clearing the bus address of a Tx frame right after we have
> > > unmapped the buffer.
> > >
> > > Signed-off-by: Mika Westerberg 
> >
> > Applied, but assuming zero is a non-valid DMA address is never a good
> > idea.  That's why we have the DMA error code signaling abstracted.
> 
> There does not seem to be a way to mark DMA address invalid in a driver
> so we probably need to add a flag to struct tbnet_frame instead.

Can you use the length?

David



RE: [PATCH 3/7] net: core: eliminate dev_alloc_name{,_ns} code duplication

2017-11-20 Thread David Laight
From: Rasmus Villemoes
> Sent: 12 November 2017 23:15
> dev_alloc_name contained a BUG_ON(), which I moved to dev_alloc_name_ns;
> the only other caller of that already has the same BUG_ON.
> 
> Signed-off-by: Rasmus Villemoes 
> ---
>  net/core/dev.c | 12 ++--
>  1 file changed, 2 insertions(+), 10 deletions(-)
> 
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 240ae6bc1097..1077bfe97bde 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -1112,6 +1112,7 @@ static int dev_alloc_name_ns(struct net *net,
>   char buf[IFNAMSIZ];
>   int ret;
> 
> + BUG_ON(!net);
>   ret = __dev_alloc_name(net, name, buf);

Just delete it.
The NULL pointer dereference is as easy to debug as the BUG().

David



RE: [PATCH net-next v2 1/3] net: bgmac: Pad packets to a minimum size

2017-11-10 Thread David Laight
From: Florian Fainelli
> Sent: 09 November 2017 22:35
> 
> In preparation for enabling Broadcom tags with b53, pad packets to a
> minimum size of 64 bytes (sans FCS) in order for the Broadcom switch to
> accept ingressing frames. Without this, we would typically be able to
> DHCP, but not resolve with ARP because packets are too small and get
> rejected by the switch.
> 
> Signed-off-by: Florian Fainelli 
> ---
>  drivers/net/ethernet/broadcom/bgmac.c | 13 +
>  1 file changed, 13 insertions(+)
> 
> diff --git a/drivers/net/ethernet/broadcom/bgmac.c 
> b/drivers/net/ethernet/broadcom/bgmac.c
> index 48d672b204a4..5130fc96940d 100644
> --- a/drivers/net/ethernet/broadcom/bgmac.c
> +++ b/drivers/net/ethernet/broadcom/bgmac.c
> @@ -127,6 +127,8 @@ bgmac_dma_tx_add_buf(struct bgmac *bgmac, struct 
> bgmac_dma_ring *ring,
>   dma_desc->ctl1 = cpu_to_le32(ctl1);
>  }
> 
> +#define ENET_BRCM_TAG_LEN4
> +
>  static netdev_tx_t bgmac_dma_tx_add(struct bgmac *bgmac,
>   struct bgmac_dma_ring *ring,
>   struct sk_buff *skb)
> @@ -139,6 +141,16 @@ static netdev_tx_t bgmac_dma_tx_add(struct bgmac *bgmac,
>   u32 flags;
>   int i;
> 
> + /* The Ethernet switch we are interfaced with needs packets to be at
> +  * least 64 bytes (including FCS) otherwise they will be discarded when
> +  * they enter the switch port logic. When Broadcom tags are enabled, we
> +  * need to make sure that packets are at least 68 bytes
> +  * (including FCS and tag) because the length verification is done after
> +  * the Broadcom tag is stripped off the ingress packet.
> +  */

I think that that would be better as:
/* Ethernet packets are padded to 64 bytes (including FCS).
 * If 'Broadcom tags' are enabled they must still be 64 bytes
 * long after the 4 byte tag is removed.
 * Since the hardware doesn't do it, we must pad them before
 * transmit.
 */

Which seems to be to be a bug in the chip.

> + if (skb_put_padto(skb, ETH_ZLEN + ENET_BRCM_TAG_LEN))
> + goto err_stats;
> +

But you shouldn't overpad packets that don't have the extra tag.

David



RE: [PATCH x86 v2] uprobe: emulate push insns for uprobe on x86

2017-11-09 Thread David Laight
From: Yonghong Song
> Sent: 09 November 2017 00:55
>
> Uprobe is a tracing mechanism for userspace programs.
> Typical uprobe will incur overhead of two traps.
> First trap is caused by replaced trap insn, and
> the second trap is to execute the original displaced
> insn in user space.
> 
> To reduce the overhead, kernel provides hooks
> for architectures to emulate the original insn
> and skip the second trap. In x86, emulation
> is done for certain branch insns.
> 
> This patch extends the emulation to "push "
> insns. These insns are typical in the beginning
> of the function. For example, bcc
...
> diff --git a/arch/x86/include/asm/uprobes.h b/arch/x86/include/asm/uprobes.h
> index 74f4c2f..f9d2b43 100644
> --- a/arch/x86/include/asm/uprobes.h
> +++ b/arch/x86/include/asm/uprobes.h
> @@ -33,6 +33,11 @@ typedef u8 uprobe_opcode_t;
...
> @@ -53,6 +59,10 @@ struct arch_uprobe {
>   u8  fixups;
>   u8  ilen;
>   }   defparam;
> + struct {
> + u8  rex_prefix;

Just call this 'reg_high' and set to 0 or 1.

> + u8  opc1;
> + }   push;
>   };
>  };
> 
> diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
> index a3755d2..5ace65c 100644
> --- a/arch/x86/kernel/uprobes.c
> +++ b/arch/x86/kernel/uprobes.c
> @@ -640,11 +640,71 @@ static bool check_jmp_cond(struct arch_uprobe *auprobe, 
> struct pt_regs *regs)
>  #undef   COND
>  #undef   CASE_COND
> 
> -static bool branch_emulate_op(struct arch_uprobe *auprobe, struct pt_regs 
> *regs)
> +static unsigned long *get_push_reg_ptr(struct arch_uprobe *auprobe,
> +struct pt_regs *regs)
>  {
> - unsigned long new_ip = regs->ip += auprobe->branch.ilen;
> - unsigned long offs = (long)auprobe->branch.offs;
> +#if defined(CONFIG_X86_64)
> + switch (auprobe->push.opc1) {
> + case 0x50:
> + return auprobe->push.rex_prefix ? >r8 : >ax;
> + case 0x51:
> + return auprobe->push.rex_prefix ? >r9 : >cx;
> + case 0x52:
> + return auprobe->push.rex_prefix ? >r10 : >dx;
> + case 0x53:
> + return auprobe->push.rex_prefix ? >r11 : >bx;
> + case 0x54:
> + return auprobe->push.rex_prefix ? >r12 : >sp;
> + case 0x55:
> + return auprobe->push.rex_prefix ? >r13 : >bp;
> + case 0x56:
> + return auprobe->push.rex_prefix ? >r14 : >si;
> + }
> +
> + /* opc1 0x57 */
> + return auprobe->push.rex_prefix ? >r15 : >di;

The bottom of that switch statement is horrid
Actually why can't you sort out this address in the code that
sets up 'reg_prefix' (etc);

David



RE: [PATCH] net: mvneta: fix handling of the Tx descriptor counter

2017-11-08 Thread David Laight
From: Simon Guinot
> Sent: 08 November 2017 16:59
> 
> The mvneta controller provides a 8-bit register to update the pending
> Tx descriptor counter. Then, a maximum of 255 Tx descriptors can be
> added at once. In the current code the mvneta_txq_pend_desc_add function
> assumes the caller takes care of this limit. But it is not the case. In
> some situations (xmit_more flag), more than 255 descriptors are added.
> When this happens, the Tx descriptor counter register is updated with a
> wrong value, which breaks the whole Tx queue management.

Even with xmit_more it is usually best to limit the number
(in order to reduce tx latency).

OTOH there are probably paths where it 'just happens'.

> This patch fixes the issue by allowing the mvneta_txq_pend_desc_add
> function to process more than 255 Tx descriptors.
> 
> Fixes: 2a90f7e1d5d0 ("net: mvneta: add xmit_more support")
> Cc: sta...@vger.kernel.org # 4.11+
> Signed-off-by: Simon Guinot 
> ---
>  drivers/net/ethernet/marvell/mvneta.c | 16 +---
>  1 file changed, 9 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/net/ethernet/marvell/mvneta.c 
> b/drivers/net/ethernet/marvell/mvneta.c
> index 64a04975bcf8..027c08ce4e5d 100644
> --- a/drivers/net/ethernet/marvell/mvneta.c
> +++ b/drivers/net/ethernet/marvell/mvneta.c
> @@ -816,11 +816,14 @@ static void mvneta_txq_pend_desc_add(struct mvneta_port 
> *pp,
>  {
>   u32 val;
> 
> - /* Only 255 descriptors can be added at once ; Assume caller
> -  * process TX desriptors in quanta less than 256
> -  */
> - val = pend_desc + txq->pending;
> - mvreg_write(pp, MVNETA_TXQ_UPDATE_REG(txq->id), val);
> + pend_desc += txq->pending;
> +
> + /* Only 255 Tx descriptors can be added at once */
> + while (pend_desc > 0) {
> + val = min(pend_desc, 255);
> + mvreg_write(pp, MVNETA_TXQ_UPDATE_REG(txq->id), val);
> + pend_desc -= val;
> + }

do { ... } while () ??

David



RE: mlx5 broken affinity

2017-11-08 Thread David Laight
From: Sagi Grimberg
> Sent: 08 November 2017 07:28
...
> > Why would you give the user a knob to destroy what you carefully optimized?
> 
> Well, looks like someone relies on this knob, the question is if he is
> doing something better for his workload. I don't know, its really up to
> the user to say.

Maybe the user wants to ensure that nothing except some very specific
processing happens on some (or most) of the cpu cores.

If the expected overall ethernet data rate isn't exceptionally large
is there any reason to allocate a queue (etc) for every cpu.

David



RE: [PATCH 0/9] use network namespace for iSCSI control interfaces

2017-11-08 Thread David Laight
From: Chris Leech
> Sent: 07 November 2017 22:45
> 
> I've posted these changes to allow iSCSI management within a container
> using a network namespace to the SCSI and Open-iSCSI lists, but seeing
> as it's not really SCSI/block related I'm casting a wider net looking
> for reviews.

I didn't spot you acquiring and releasing references to the namespace.
(I might have missed it, the relevant patch is difficult to read).

If the sockets are created in the context of the process whose namespace
you are using you don't need it, but given the hooks and callbacks
I'm not at all sure that is obviously true.

David



RE: [PATCH v4] scripts: add leaking_addresses.pl

2017-11-07 Thread David Laight
From: Tobin C. Harding
> Sent: 07 November 2017 10:32
>
> Currently we are leaking addresses from the kernel to user space. This
> script is an attempt to find some of those leakages. Script parses
> `dmesg` output and /proc and /sys files for hex strings that look like
> kernel addresses.
...

Maybe the %p that end up in dmesg (via the kernel message buffer) should
be converted to text in a form that allows the code that reads them to
substitute alternate text for non-root users?

Then the actual addresses will be available to root (who can probably
get most by other means) but not to the casual observer.

David



RE: [PATCH] [net-next,v2] ibmvnic: Feature implementation of Vital Product Data (VPD) for the ibmvnic driver

2017-11-06 Thread David Laight
From: David Miller
> Sent: 04 November 2017 13:21
> From: Desnes Augusto Nunes do Rosario 
> Date: Wed,  1 Nov 2017 19:03:32 -0200
> 
> > +   substr = strnstr(adapter->vpd->buff, "RM", adapter->vpd->len);
> > +   if (!substr) {
> > +   dev_info(dev, "No FW level provided by VPD\n");
> > +   complete(>fw_done);
> > +   return;
> > +   }
> > +
> > +   /* get length of firmware level ASCII substring */
> > +   fw_level_len = *(substr + 2);
> > +
> > +   /* copy firmware version string from vpd into adapter */
> > +   ptr = strncpy((char *)adapter->fw_version,
> > + substr + 3, fw_level_len);
> 
> You have to be more careful here, making sure first that
> (substr + 2) < (adapter->vpd->buff + adapter->vpd->len),
> and next that (substr + 2 + fw_level_len) is in range
> as well.

And that the copy isn't longer than the target buffer.

David



RE: [PATCH net-next v2 1/3] enic: reset fetch index

2017-11-03 Thread David Laight
From: Parvi Kaustubhi
> Sent: 01 November 2017 15:45
> Since we are allowing rx ring size modification, reset fetch index
> everytime. Otherwise it could have a stale value that can lead to a null
> pointer dereference.
> 
> Signed-off-by: Govindarajulu Varadarajan 
> Signed-off-by: Parvi Kaustubhi 
> ---
> v2: remove unused variable to fix build warning
> 
>  drivers/net/ethernet/cisco/enic/vnic_rq.c | 16 ++--
>  1 file changed, 2 insertions(+), 14 deletions(-)
> 
> diff --git a/drivers/net/ethernet/cisco/enic/vnic_rq.c 
> b/drivers/net/ethernet/cisco/enic/vnic_rq.c
> index 36bc2c7..f8aa326 100644
> --- a/drivers/net/ethernet/cisco/enic/vnic_rq.c
> +++ b/drivers/net/ethernet/cisco/enic/vnic_rq.c
> @@ -139,20 +139,8 @@ void vnic_rq_init(struct vnic_rq *rq, unsigned int 
> cq_index,
>   unsigned int error_interrupt_enable,
>   unsigned int error_interrupt_offset)
>  {
> - u32 fetch_index = 0;
> -
> - /* Use current fetch_index as the ring starting point */
> - fetch_index = ioread32(>ctrl->fetch_index);
> -
> - if (fetch_index == 0x) { /* check for hardware gone  */
> - /* Hardware surprise removal: reset fetch_index */
> - fetch_index = 0;
> - }
> -
> - vnic_rq_init_start(rq, cq_index,
> - fetch_index, fetch_index,
> - error_interrupt_enable,
> - error_interrupt_offset);
> + vnic_rq_init_start(rq, cq_index, 0, 0, error_interrupt_enable,
> +error_interrupt_offset);

ISTM that the two '0' arguments can be removed.

David



RE: [PATCH net-next 1/1] forcedeth: replace pci_alloc_consistent with dma_alloc_coherent

2017-10-30 Thread David Laight
From: Zhu Yanjun
> Sent: 28 October 2017 13:26
> The functions pci_alloc_consistent is obsolete. So it is replaced
> with dma_alloc_coherent
...
> + dma_free_coherent(>pci_dev->dev,
> +   sizeof(struct ring_desc) *
> +   (np->rx_ring_size +
> +   np->tx_ring_size),
> +   np->rx_ring.orig, np->ring_addr);
...

That can't possibly be the best way to split those long lines.

David



RE: [PATCH net 4/4] sctp: fix some type cast warnings introduced since very beginning

2017-10-30 Thread David Laight
From: Xin Long
> Sent: 28 October 2017 12:44
> These warnings were found by running 'make C=2 M=net/sctp/'.
> They are there since very beginning.
...
> - param.crr_id = i;
> + param.crr_id = htonl(i);
...

Isn't this a bug fix, not just removing a warning??

David



RE: [PATCH net-next 1/3] ibmvnic: Enable scatter-gather support

2017-10-27 Thread David Laight
From: Thomas Falcon
> Sent: 26 October 2017 18:52
...
> >>memset(dst, 0, adapter->req_mtu);
> > Seems unnecessary.
> 
> I removed that bit, and so far you seem to be right :) .

I'd check that short frames are padded with zeros.

David



RE: v4.14-rc3/arm64 DABT exception in atomic_inc() / __skb_clone()

2017-10-26 Thread David Laight
From: Willem de Bruijn
> Sent: 25 October 2017 19:50
...
> From skb->dev and netdev_priv, the tun device has flags 0x1002 ==
> IFF_TAP | IFF_NO_PI. This kernel precedes the recent support for
> IFF_NAPI and IFF_NAPI_FRAGS. The allocation most likely happened
> in tun_build_skb from current->task_frag. It would be a previous
> allocation that left alloc_frag->offset unaligned. But perhaps this code
> needs to perform alignment before setting skb->head.
>
> At least on platforms where atomic on dataref must be aligned.

Isn't that true of almost everything?
I'm not even sure x86 always (ever?) manages locked cycles on
misaligned addresses.

David



RE: Get rid of RCU callbacks in TC filters?

2017-10-23 Thread David Laight
From: Cong Wang
> Sent: 20 October 2017 21:52
> To: Paul E. McKenney
> Cc: Jamal Hadi Salim; Chris Mi; Linux Kernel Network Developers; Daniel 
> Borkmann; Eric Dumazet; David
> Miller; Jiri Pirko
> Subject: Re: Get rid of RCU callbacks in TC filters?
> 
> On Fri, Oct 20, 2017 at 1:31 PM, Cong Wang  wrote:
> > Actually, I just tried this approach, this way makes the core tc filter
> > code harder to wait for flying callbacks, currently rcu_barrier() is
> > enough, with one more schedule_work() added we probably
> > need flush_workqueue()... Huh, this also means I can't use the
> > global workqueue so should add a new workqueue for tc filters.
> >
> > Good news is I seem to make it work without adding much code.
> > Stay tuned. ;)
> 
> Hmm, lockdep already complains I have a deadlock when flushing
> workqueue while hodling RTNL lock, because callbacks need RTNL
> too... See the rcu_barrier() in tcf_block_put().
> 
> So this approach seems to be a dead end. Is there any way out?
> 
> Probably we have to convert call_rcu() to synchronize_rcu(),
> but with batching of course.

Locking for callbacks is always a PITA.
You also need to sort out how to 'unhook' the callbacks.

If you require the caller hold the rntl_lock prior to adding
or removing callbacks, and guarantee that the functions will
be called with the rntl_lock held you might be ok.

Otherwise the one side can't do its 'expected' locking without
a potential deadlock or use-after-free.

The only way I know to avoid the issues removing the callbacks
is to get the callback function called from its normal scheduling
point with an argument (probably NULL) that indicates it will
never be called again.

David



RE: [PATCH net-next 6/8] tools: bpftool: print all relevant byte opcodes for "load double word"

2017-10-20 Thread David Laight
From: Jakub Kicinski
> Sent: 19 October 2017 23:46
> The eBPF instruction permitting to load double words (8 bytes) into a
> register need 8-byte long "immediate" field, and thus occupy twice the
> space of other instructions. bpftool was aware of this and would
> increment the instruction counter only once on meeting such instruction,
> but it would only print the first four bytes of the immediate value to
> load. Make it able to dump the whole 16 byte-long double instruction
> instead (as would `llvm-objdump -d `).

Guess why most modern instruction sets use a 'load high' instruction
to generate big constants...

Interestingly, is there anything special in the rest of the
second instruction in order to make it an identifiable no-op?

...
> diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c
> index 355c14325622..57edbea2fbe8 100644
> --- a/tools/bpf/bpftool/prog.c
> +++ b/tools/bpf/bpftool/prog.c
> @@ -313,20 +313,29 @@ static void print_insn(struct bpf_verifier_env *env, 
> const char *fmt, ...)
>  static void dump_xlated(void *buf, unsigned int len, bool opcodes)
>  {
>   struct bpf_insn *insn = buf;
> + bool double_insn = false;
>   unsigned int i;
> 
>   for (i = 0; i < len / sizeof(*insn); i++) {
> + if (double_insn) {
> + double_insn = false;
> + continue;
> + }

Why not just:
for (i = 0; i < len / sizeof(*insn); i += 1 + double_insn) {

...

David



RE: [PATCH v2 net-next 1/2] net: dsa: lan9303: Add port_fast_age and port_fdb_dump methods

2017-10-20 Thread David Laight
From: Egil Hjelmeland
> Sent: 19 October 2017 17:53
...
> >> IMHO it is best to define a struct for the 'ctx and then do:
> >> ..., void *v_ctx)
> >> {
> >> foo_ctx *ctx = v_ctx;
> >> int port = ctx->port;
> >>
> >> That stops anyone having to double-check that the *(int *)
> >> is operating on a pointer to an integer of the correct size.
> >>
> >
> > Does casting to a struct pointer require less manual double-check than
> > to a int-pointer? In neither cases the compiler can protect us, IFAIK.
> > But on the other hand, a the text "foo_ctx" can searched in the editor.
> > So in that respect it can somewhat aid to the double-checking.
> >
> > So I can do that.
> >
> >
> 
> I understand now that the caller side (lan9303_port_fast_age) is
> vulnerable. Say somebody has the idea to change the "port" param
> of .port_fast_age from int to u8, then my code is a trap.

Worse, change it to a long and it will work on everything except
64bit big-endian systems.

David



RE: [PATCH v2 net-next 1/2] net: dsa: lan9303: Add port_fast_age and port_fdb_dump methods

2017-10-19 Thread David Laight
From: Andrew Lunn
> Sent: 19 October 2017 15:15
> > > +/* Clear learned (non-static) entry on given port */
> > > +static void alr_loop_cb_del_port_learned(struct lan9303 *chip, u32 dat0,
> > > +  u32 dat1, int portmap, void *ctx)
> > > +{
> > > + int *port = ctx;
> >
> > You can get the value directly to make the line below more readable:
> >
> > int port = *(int *)ctx;
> 
> You have to be a bit careful with this. You often see people
> submitting patches taking away casts for void * pointers.
> If they do that here, it should at least not compile...
> 
> So maybe do it in two steps?
> 
>int * pport = ctx;
>int port = *pport;

IMHO it is best to define a struct for the 'ctx and then do:
..., void *v_ctx)
{
foo_ctx *ctx = v_ctx;
int port = ctx->port;

That stops anyone having to double-check that the *(int *)
is operating on a pointer to an integer of the correct size.

One of the syntax checkers probably ought to generate a warning
for *(integer_type *)foo since it is often a bug.

David



RE: [PATCH] lib/dynamic_queue_limits.c: relax BUG_ON to WARN_ON in dql_complete()

2017-10-18 Thread David Laight
From: Ard Biesheuvel
> Sent: 18 October 2017 16:45
> Even though calling dql_completed() with a count that exceeds the
> queued count is a serious error, it still does not justify bringing
> down the entire kernel with a BUG_ON(). So relax it to a WARN_ON()
> instead.
> 
> Signed-off-by: Ard Biesheuvel 
> ---
>  lib/dynamic_queue_limits.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/lib/dynamic_queue_limits.c b/lib/dynamic_queue_limits.c
> index f346715e2255..24ce495d78f3 100644
> --- a/lib/dynamic_queue_limits.c
> +++ b/lib/dynamic_queue_limits.c
> @@ -23,7 +23,7 @@ void dql_completed(struct dql *dql, unsigned int count)
>   num_queued = ACCESS_ONCE(dql->num_queued);
> 
>   /* Can't complete more than what's in queue */
> - BUG_ON(count > num_queued - dql->num_completed);
> + WARN_ON(count > num_queued - dql->num_completed);
> 
>   completed = dql->num_completed + count;

Don't you need to bound 'count' so that horrid things don't
happen further down the code?

David



RE: [PATCH net-next 1/3] ibmvnic: Enable scatter-gather support

2017-10-18 Thread David Laight
From: Thomas Falcon
> Sent: 17 October 2017 18:37
> This patch enables scatter gather support. Since there is no
> HW/FW scatter-gather support at this time, the driver needs to
> loop through each fragment and copy it to a contiguous, pre-mapped
> buffer entry.
...
>   offset = index * adapter->req_mtu;
>   dst = tx_pool->long_term_buff.buff + offset;

You should be able to treat the pre-allocated data area as a
big ring buffer.
So it can hold a lot of small frames or a few big ones.
This slightly complicates the 'is there enough space for
this packet' check since you need buffer space as well
as a ring entry.

You also really want to align each tx buffer on a 4n+2
boundary so that most of the copy is aligned.

>   memset(dst, 0, adapter->req_mtu);
Seems unnecessary.

David



RE: [PATCH net 0/3] Fix for BPF devmap percpu allocation splat

2017-10-17 Thread David Laight
From: Daniel Borkmann
> Sent: 17 October 2017 15:56
> 
> The set fixes a splat in devmap percpu allocation when we alloc
> the flush bitmap. Patch 1 is a prerequisite for the fix in patch 2,
> patch 1 is rather small, so if this could be routed via -net, for
> example, with Tejun's Ack that would be good. Patch 3 gets rid of
> remaining PCPU_MIN_UNIT_SIZE checks, which are percpu allocator
> internals and should not be used.

Does it make sense to allow the user program to try to allocate ever
smaller very large maps until it finds one that succeeds - thus
using up all the percpu space?

Or is this a 'root only' 'shoot self in foot' job?

David



RE: [PATCH net-next v3 0/5] net: dsa: remove .set_addr

2017-10-16 Thread David Laight
From: Andrew Lunn
> Sent: 16 October 2017 17:10
...
> So, received Pause frames never leave the MAC. They don't get bridged,
> nor do they get passed up for host processing. They are purely point
> to point between two MAC peers. The destination is unambiguous. It is
> simple the other MAC peer. The destination address makes it clear it
> is a pause frame, the the source address seems to be unneeded.
> 
> In this context, a random MAC addresses are safe.

Is there any reason why a fixed value (say 00:00:00:00:00:00)
can't be used?

> In the more general case, i would agree with you. Collisions are
> possible, causing problems.

For IP MAC addresses only go as far as the first router.
So the duplicates would (typically) have to be within the same subnet.
This makes the chance of a duplicate random address unlikely.

(Unless you have an un-subnetted class A network consisting of
multiple 1km long coax segments connected by 1km long repeaters
making a single collision domain.)

David



RE: [net-next 6/9] e1000e: fix buffer overrun while the I219 is processing DMA transactions

2017-10-16 Thread David Laight
From: Neftin, Sasha
> Sent: 16 October 2017 11:40
> On 10/11/2017 12:07, David Laight wrote:
> > From: Jeff Kirsher
> >> Sent: 10 October 2017 18:22
> >> Intel 100/200 Series Chipset platforms reduced the round-trip
> >> latency for the LAN Controller DMA accesses, causing in some high
> >> performance cases a buffer overrun while the I219 LAN Connected
> >> Device is processing the DMA transactions. I219LM and I219V devices
> >> can fall into unrecovered Tx hang under very stressfully UDP traffic
> >> and multiple reconnection of Ethernet cable. This Tx hang of the LAN
> >> Controller is only recovered if the system is rebooted. Slightly slow
> >> down DMA access by reducing the number of outstanding requests.
> >> This workaround could have an impact on TCP traffic performance
> >> on the platform. Disabling TSO eliminates performance loss for TCP
> >> traffic without a noticeable impact on CPU performance.
> >>
> >> Please, refer to I218/I219 specification update:
> >> https://www.intel.com/content/www/us/en/embedded/products/networking/
> >> ethernet-connection-i218-family-documentation.html
> >>
> >> Signed-off-by: Sasha Neftin <sasha.nef...@intel.com>
> >> Reviewed-by: Dima Ruinskiy <dima.ruins...@intel.com>
> >> Reviewed-by: Raanan Avargil <raanan.avar...@intel.com>
> >> Tested-by: Aaron Brown <aaron.f.br...@intel.com>
> >> Signed-off-by: Jeff Kirsher <jeffrey.t.kirs...@intel.com>
> >> ---
> >>   drivers/net/ethernet/intel/e1000e/netdev.c | 8 +---
> >>   1 file changed, 5 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c
> b/drivers/net/ethernet/intel/e1000e/netdev.c
> >> index ee9de3500331..14b096f3d1da 100644
> >> --- a/drivers/net/ethernet/intel/e1000e/netdev.c
> >> +++ b/drivers/net/ethernet/intel/e1000e/netdev.c
> >> @@ -3021,8 +3021,8 @@ static void e1000_configure_tx(struct e1000_adapter 
> >> *adapter)
> >>
> >>hw->mac.ops.config_collision_dist(hw);
> >>
> >> -  /* SPT and CNP Si errata workaround to avoid data corruption */
> >> -  if (hw->mac.type >= e1000_pch_spt) {
> >> +  /* SPT and KBL Si errata workaround to avoid data corruption */
> >> +  if (hw->mac.type == e1000_pch_spt) {
> >>u32 reg_val;
> >>
> >>reg_val = er32(IOSFPC);
> >> @@ -3030,7 +3030,9 @@ static void e1000_configure_tx(struct e1000_adapter 
> >> *adapter)
> >>ew32(IOSFPC, reg_val);
> >>
> >>reg_val = er32(TARC(0));
> >> -  reg_val |= E1000_TARC0_CB_MULTIQ_3_REQ;
> >> +  /* SPT and KBL Si errata workaround to avoid Tx hang */
> >> +  reg_val &= ~BIT(28);
> >> +  reg_val |= BIT(29);

> > Shouldn't some more of the commit message about what this is doing
> > be in the comment?

> There is provided link on specification update:
> https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/i218-i219-ethernet-
> connection-spec-update.pdf?asset=9561.
> This is Intel's public release.

And sometime next week the marketing people will decide to reorganise the
web site and the link will become invalid.

> > And shouldn't the 28 and 28 be named constants?

> (28 and 29) - you can easy understand from code that same value has been
> changed from 3 to 2. There is no point add flag here I thought.

Oh, there is. The 'workaround is':
  Slightly slow down DMA access by reducing the number of outstanding requests.
  This workaround could have an impact on TCP traffic performance and could
  reduce performance up to 5 to 15% (depending) on the platform.
  Disabling TSO eliminates performance loss for TCP traffic without a 
  noticeable impact on CPU performance.

I wonder what tests they did to show that TSO doesn't save cpu cycles!

So my guess is that you are changing the number of outstanding PCIe reads
(or reads for tx buffers, or ???) from 3 to 2.

Lets read between the lines a little further
(since you are at Intel you can probably check this):
Assuming that TSO is 'Transmit Segmentation Offload' and that TSO packets
might be 64k, then reading 3 TSO packets might issue PCIe reads for 196k
bytes of data (under 4k for non-TSO).
If the internal buffer that this data is stored in isn't that big then
that internal buffer would overflow.
It might be that data is removed from this buffer as soon as the last
completion TLP arrives - but they can be interleaved with other
outstanding PCIe reads.
It all rather depends on the negotiated maximum TLP size and number
of tags.

Perhaps reducing the maximum TSO packet to 32k stops the overflow
as well...

David



RE: [PATCH net-next 00/10] korina cleanups/optimizations

2017-10-16 Thread David Laight
From: Roman Yeryomin
> Sent: 15 October 2017 17:22
> TX optimizations have led to ~15% performance increase (35->40Mbps)
> in local tx usecase (tested with iperf v3.2).

Indicate which patches give the improvement.
IIRC some just changed the source without changing what the code really did.
(Not a problem in itself.)

...
>   net: korina: optimize tx/rx interrupt handlers

You'd probably get a noticeable improvement from caching the
value of the interrupt mask - instead of reading it from the hardware.
...

David



RE: [next-queue PATCH v7 4/6] net/sched: Introduce Credit Based Shaper (CBS) qdisc

2017-10-16 Thread David Laight
From: Ivan Khoronzhuk
> Sent: 13 October 2017 20:59
> On Thu, Oct 12, 2017 at 05:40:03PM -0700, Vinicius Costa Gomes wrote:
> > This queueing discipline implements the shaper algorithm defined by
> > the 802.1Q-2014 Section 8.6.8.2 and detailed in Annex L.
> >
> > It's primary usage is to apply some bandwidth reservation to user
> > defined traffic classes, which are mapped to different queues via the
> > mqprio qdisc.
> >
> > Only a simple software implementation is added for now.
> >
> > Signed-off-by: Vinicius Costa Gomes 
> > Signed-off-by: Jesus Sanchez-Palencia 
> > ---
> >  include/uapi/linux/pkt_sched.h |  18 +++
> >  net/sched/Kconfig  |  11 ++
> >  net/sched/Makefile |   1 +
> >  net/sched/sch_cbs.c| 314 
> > +
> >  4 files changed, 344 insertions(+)
> >  create mode 100644 net/sched/sch_cbs.c
> >
> > diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
> > index 099bf5528fed..41e349df4bf4 100644
> > --- a/include/uapi/linux/pkt_sched.h
> > +++ b/include/uapi/linux/pkt_sched.h
> > @@ -871,4 +871,22 @@ struct tc_pie_xstats {
> > __u32 maxq; /* maximum queue size */
> > __u32 ecn_mark; /* packets marked with ecn*/
> >  };
> > +
> > +/* CBS */
> > +struct tc_cbs_qopt {
> > +   __u8 offload;

You probably don't want unnamed padding in a uapi structure.

> > +   __s32 hicredit;
> > +   __s32 locredit;
> > +   __s32 idleslope;
> > +   __s32 sendslope;
> > +};
> > +
> > +enum {
> > +   TCA_CBS_UNSPEC,
> > +   TCA_CBS_PARMS,
> > +   __TCA_CBS_MAX,
> > +};
> > +
> > +#define TCA_CBS_MAX (__TCA_CBS_MAX - 1)

Why not:
TCA_CBS_PARMS,
TCA_CBS_NEXT,
TCA_CBS_MAX = TCA_CBS_NEXT - 1,

...
David



RE: [PATCH net-next v3 3/5] net: dsa: mv88e6060: setup random mac address

2017-10-16 Thread David Laight
From: Vivien Didelot
> Sent: 13 October 2017 19:18
> As for mv88e6xxx, setup the switch from within the mv88e6060 driver with
> a random MAC address, and remove the .set_addr implementation.
> 
> Signed-off-by: Vivien Didelot 
> ---
>  drivers/net/dsa/mv88e6060.c | 43 ++-
>  1 file changed, 26 insertions(+), 17 deletions(-)
> 
> diff --git a/drivers/net/dsa/mv88e6060.c b/drivers/net/dsa/mv88e6060.c
> index d64be2b83d3c..6173be889d95 100644
> --- a/drivers/net/dsa/mv88e6060.c
> +++ b/drivers/net/dsa/mv88e6060.c
> @@ -9,6 +9,7 @@
>   */
> 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -188,6 +189,27 @@ static int mv88e6060_setup_port(struct dsa_switch *ds, 
> int p)
>   return 0;
>  }
> 
> +static int mv88e6060_setup_addr(struct dsa_switch *ds)
> +{
> + u8 addr[ETH_ALEN];
> + u16 val;
> +
> + eth_random_addr(addr);
> +
> + val = addr[0] << 8 | addr[1];
> +
> + /* The multicast bit is always transmitted as a zero, so the switch uses
> +  * bit 8 for "DiffAddr", where 0 means all ports transmit the same SA.
> +  */
> + val &= 0xfeff;

The comment is probably ok, but the mask isn't needed.
eth_randmon_addr() won't return and address with the multicast bit set.

David



RE: [PATCH net-next v2 2/4] net: dsa: mv88e6060: setup random mac address

2017-10-16 Thread David Laight
From: woojung@microchip.com
> Sent: 13 October 2017 18:59
> > >> > +  REG_WRITE(REG_GLOBAL, GLOBAL_MAC_01, (addr[0] << 9) |
> > >> addr[1]);
> > >>
> > >> Is that supposed to be 9 ?
> > >
> > > Looks like it.
> > > Check
> > http://www.marvell.com/switching/assets/marvell_linkstreet_88E6060_data
> > sheet.pdf
> >
> > Hum, David is correct, there is a bug in the driver which needs to be
> > addressed first. MAC address bit 40 is addr[0] & 0x1, thus we must
> > shift byte 0 by 8 and mask it against 0xfe.
> >
> > I'll respin this serie including a fix for both net and net-next.
> 
> Yes, you are right. Missed description about bit 40.

Since they are all random numbers it doesn't matter.

Actually isn't this just masking off the non-unicast bit?
So it won't be set in the data - and so doesn't need masking.

David



RE: Kernel Performance Tuning for High Volume SCTP traffic

2017-10-13 Thread David Laight
From: Traiano Welcome
> Sent: 13 October 2017 17:04
> On Fri, Oct 13, 2017 at 11:56 PM, David Laight <david.lai...@aculab.com> 
> wrote:
> > From: Traiano Welcome
> >
> > (copied to netdev)
> >> Sent: 13 October 2017 07:16
> >> To: linux-s...@vger.kernel.org
> >> Subject: Kernel Performance Tuning for High Volume SCTP traffic
> >>
> >> Hi List
> >>
> >> I'm running a linux server processing high volumes of SCTP traffic and
> >> am seeing large numbers of packet overruns (ifconfig output).
> >
> > I'd guess that overruns indicate that the ethernet MAC is failing to
> > copy the receive frames into kernel memory.
> > It is probably running out of receive buffers, but might be
> > suffering from a lack of bus bandwidth.
> > MAC drivers usually discard receive frames if they can't get
> > a replacement buffer - so you shouldn't run out of rx buffers.
> >
> > This means the errors are probably below SCTP - so changing SCTP parameters
> > is unlikely to help.
> 
> Does this mean that tuning UDP performance could help ? Or do you mean
> hardware (NIC) performance could be the issue?

I'd certainly check UDP performance.

David



RE: [PATCH net-next 5/5] net: dsa: split dsa_port's netdev member

2017-10-13 Thread David Laight
From: Vivien Didelot
> Sent: 13 October 2017 16:29
> Vivien Didelot  writes:
> 
> >>> How about using:
> >>>
> >>>   union {
> >>>   struct net_device *master;
> >>>   struct net_device *slave;
> >>>   } netdev;
> >> ...
> >>
> >> You can remove the 'netdev' all the compilers support unnamed unions.
> >
> > There are issues with older GCC versions, see the commit 42275bd8fcb3
> > ("switchdev: don't use anonymous union on switchdev attr/obj structs")
> >
> > That's why I kept it in the v2 I sent.
> 
> At the same time, I can see that struct sk_buff uses anonym union a lot.
> 
> It seems weird that one raised a compiler issue for switchdev but not
> for skbuff.h... Do you think it is viable to drop the name here then?

I believe the problem is with initialisers for static structures
that contain anonymous unions.

David



RE: Kernel Performance Tuning for High Volume SCTP traffic

2017-10-13 Thread David Laight
From: Traiano Welcome

(copied to netdev)
> Sent: 13 October 2017 07:16
> To: linux-s...@vger.kernel.org
> Subject: Kernel Performance Tuning for High Volume SCTP traffic
> 
> Hi List
> 
> I'm running a linux server processing high volumes of SCTP traffic and
> am seeing large numbers of packet overruns (ifconfig output).

I'd guess that overruns indicate that the ethernet MAC is failing to
copy the receive frames into kernel memory.
It is probably running out of receive buffers, but might be
suffering from a lack of bus bandwidth.
MAC drivers usually discard receive frames if they can't get
a replacement buffer - so you shouldn't run out of rx buffers.

This means the errors are probably below SCTP - so changing SCTP parameters
is unlikely to help.

I'd make sure any receive interrupt coalescing/mitigation is turned off.

David
 

> I think a large amount of performance tuning can probably be done to
> improve the linux kernel's SCTP handling performance, but there seem
> to be no guides on this available. Could anyone advise on this?
> 
> 
> Here are my current settings, and below, some stats:
> 
> 
> -
> net.sctp.addip_enable = 0
> net.sctp.addip_noauth_enable = 0
> net.sctp.addr_scope_policy = 1
> net.sctp.association_max_retrans = 10
> net.sctp.auth_enable = 0
> net.sctp.cookie_hmac_alg = sha1
> net.sctp.cookie_preserve_enable = 1
> net.sctp.default_auto_asconf = 0
> net.sctp.hb_interval = 3
> net.sctp.max_autoclose = 8589934
> net.sctp.max_burst = 40
> net.sctp.max_init_retransmits = 8
> net.sctp.path_max_retrans = 5
> net.sctp.pf_enable = 1
> net.sctp.pf_retrans = 0
> net.sctp.prsctp_enable = 1
> net.sctp.rcvbuf_policy = 0
> net.sctp.rto_alpha_exp_divisor = 3
> net.sctp.rto_beta_exp_divisor = 2
> net.sctp.rto_initial = 3000
> net.sctp.rto_max = 6
> net.sctp.rto_min = 1000
> net.sctp.rwnd_update_shift = 4
> net.sctp.sack_timeout = 50
> net.sctp.sctp_mem = 61733040 82310730 123466080
> net.sctp.sctp_rmem = 40960 8655000 41943040
> net.sctp.sctp_wmem = 40960 8655000 41943040
> net.sctp.sndbuf_policy = 0
> net.sctp.valid_cookie_life = 6
> -
> 
> 
> I'm seeing a high rate of packet errors (almost all overruns) on both
> 10gb NICs attached to my linux server.
> 
> The system is handling high volumes of network traffic, so this is
> likely a linux kernel tuning problem.
> 
> All the normal tuning parameters I've tried thus far seems to be
> having little effect and I'm still seeing high volumes of packet
> overruns.
> 
> Any pointers on other things I could try to get the system handling
> SCTP packets efficiently would be much appreciated!
> 
> -
> :~# ifconfig ens4f1
> 
> ens4f1Link encap:Ethernet  HWaddr 5c:b9:01:de:0d:4c
>   UP BROADCAST RUNNING PROMISC MULTICAST  MTU:9000  Metric:1
>   RX packets:22313514162 errors:17598241316 dropped:68
> overruns:17598241316 frame:0
>   TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
>   collisions:0 txqueuelen:1000
>   RX bytes:31767480894219 (31.7 TB)  TX bytes:0 (0.0 B)
>   Interrupt:17 Memory:c980-c9ff
> -
> 
> System details:
> 
> OS: Ubuntu Linux (4.11.0-14-generic #20~16.04.1-Ubuntu SMP x86_64 )
> CPU Cores : 72
> NIC Model : NetXtreme II BCM57810 10 Gigabit Ethernet
> RAM   : 240 GiB
> 
> NIC sample stats showing packet error rate:
> 
> 
> 
> for i in `seq 1 10`;do echo "$i) `date`" - $(ifconfig ens4f0| egrep
> "RX"| egrep overruns;sleep 5);done
> 
> 1) Thu Oct 12 19:50:40 SGT 2017 - RX packets:8364065830
> errors:2594507718 dropped:215 overruns:2594507718 frame:0
> 2) Thu Oct 12 19:50:45 SGT 2017 - RX packets:8365336060
> errors:2596662672 dropped:215 overruns:2596662672 frame:0
> 3) Thu Oct 12 19:50:50 SGT 2017 - RX packets:8366602087
> errors:2598840959 dropped:215 overruns:2598840959 frame:0
> 4) Thu Oct 12 19:50:55 SGT 2017 - RX packets:8367881271
> errors:2600989229 dropped:215 overruns:2600989229 frame:0
> 5) Thu Oct 12 19:51:01 SGT 2017 - RX packets:8369147536
> errors:2603157030 dropped:215 overruns:2603157030frame:0
> 6) Thu Oct 12 19:51:06 SGT 2017 - RX packets:8370149567
> errors:2604904183 dropped:215 overruns:2604904183frame:0
> 7) Thu Oct 12 19:51:11 SGT 2017 - RX packets:8371298018
> errors:2607183939 dropped:215 overruns:2607183939frame:0
> 8) Thu Oct 12 19:51:16 SGT 2017 - RX packets:8372455587
> errors:2609411186 dropped:215 overruns:2609411186frame:0
> 9) Thu Oct 12 19:51:21 SGT 2017 - RX packets:8373585102
> errors:2611680597 dropped:215 overruns:2611680597 frame:0
> 10) Thu Oct 12 19:51:26 SGT 2017 - RX packets:8374678508
> errors:2614053000 dropped:215 overruns:2614053000 frame:0
> 
> 
> 
> However, checking (with tc) shows no ring buffer overruns on NIC:
> 
> 
> 
> tc -s qdisc show dev ens4f0|egrep drop
> 
> Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
> Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
> Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
> Sent 0 bytes 0 pkt (dropped 0, 

RE: [PATCH net-next v2 2/4] net: dsa: mv88e6060: setup random mac address

2017-10-13 Thread David Laight
From: Vivien Didelot
> Sent: 13 October 2017 02:41
> As for mv88e6xxx, setup the switch from within the mv88e6060 driver with
> a random MAC address, and remove the .set_addr implementation.
> 
> Signed-off-by: Vivien Didelot 
> ---
>  drivers/net/dsa/mv88e6060.c | 30 +++---
>  1 file changed, 19 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/net/dsa/mv88e6060.c b/drivers/net/dsa/mv88e6060.c
> index 621cdc46ad81..2f9d5e6a0f97 100644
> --- a/drivers/net/dsa/mv88e6060.c
> +++ b/drivers/net/dsa/mv88e6060.c
...
> + REG_WRITE(REG_GLOBAL, GLOBAL_MAC_01, (addr[0] << 9) | addr[1]);

Is that supposed to be 9 ?

David



RE: [PATCH net-next 5/5] net: dsa: split dsa_port's netdev member

2017-10-13 Thread David Laight
From: Florian Fainelli
> Sent: 13 October 2017 00:05
...
> How about using:
> 
>   union {
>   struct net_device *master;
>   struct net_device *slave;
>   } netdev;
...

You can remove the 'netdev' all the compilers support unnamed unions.

David



RE: [PATCH 3/4] net: qcom/emac: enforce DMA address restrictions

2017-10-12 Thread David Laight
From: Timur Tabi
> Sent: 12 October 2017 15:13
> On 10/12/17 4:30 AM, David Laight wrote:
> > Isn't the memory allocated by a single kzalloc() call?
> 
> dma_alloc_coherenent, actually.
> 
> > IIRC that guarantees it doesn't cross a power or 2 boundary less than
> > the size.
> 
> I'm pretty sure that kzalloc does not make that guarantee, and I don't
> think dma_alloc_coherent does either.

dma_alloc_coherent() definitely does.
And I've a driver that relies on it (for 16k blocks).

David



RE: [PATCH 3/4] net: qcom/emac: enforce DMA address restrictions

2017-10-12 Thread David Laight
From: Timur Tabi
> Sent: 11 October 2017 20:52
> The EMAC has a restriction that the upper 32 bits of the base addresses
> for the RFD and RRD rings must be the same.  The ensure that restriction,
> we allocate twice the space for the RRD and locate it at an appropriate
> address.
> 
> We also re-arrange the allocations so that invalid addresses are even
> less likely.
> 
> Signed-off-by: Timur Tabi 
> ---
>  drivers/net/ethernet/qualcomm/emac/emac-mac.c | 39 
> ---
>  1 file changed, 24 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/net/ethernet/qualcomm/emac/emac-mac.c 
> b/drivers/net/ethernet/qualcomm/emac/emac-
> mac.c
> index 9cbb27263742..0f5ece5d9507 100644
> --- a/drivers/net/ethernet/qualcomm/emac/emac-mac.c
> +++ b/drivers/net/ethernet/qualcomm/emac/emac-mac.c
> @@ -734,6 +734,11 @@ static int emac_rx_descs_alloc(struct emac_adapter *adpt)
>   rx_q->rrd.size = rx_q->rrd.count * (adpt->rrd_size * 4);
>   rx_q->rfd.size = rx_q->rfd.count * (adpt->rfd_size * 4);
> 
> + /* Check if the RRD and RFD are aligned properly, and if not, adjust. */
> + if (upper_32_bits(ring_header->dma_addr) !=
> + upper_32_bits(ring_header->dma_addr + ALIGN(rx_q->rrd.size, 8)))
> + ring_header->used = ALIGN(rx_q->rrd.size, 8);
> +

Isn't the memory allocated by a single kzalloc() call?
IIRC that guarantees it doesn't cross a power or 2 boundary less than
the size.
So if you allocate any size between 4k and 8k it won't cross an odd
4k boundary (etc).

So these checks are entirely pointless.

David



RE: [PATCH] rtl8xxxu: mark expected switch fall-throughs

2017-10-11 Thread David Laight
From: Joe Perches
> Sent: 11 October 2017 11:21
> On Tue, 2017-10-10 at 14:30 -0500, Gustavo A. R. Silva wrote:
> > In preparation to enabling -Wimplicit-fallthrough, mark switch cases
> > where we are expecting to fall through.
> 
> perhaps use Arnaldo's idea:
> 
> https://lkml.org/lkml/2017/2/9/845
> https://lkml.org/lkml/2017/2/10/485

gah, that is even uglier and requires a chase through
headers to find out what it means.

David



RE: [PATCH v2] xdp: Sample xdp program implementing ip forward

2017-10-11 Thread David Laight
From: Jesper Dangaard Brouer
> Sent: 10 October 2017 20:06
...
> > +   int src_ip = 0, dest_ip = 0;
...
> > +   key4.b8[4] = dest_ip % 0x100;
> > +   key4.b8[5] = (dest_ip >> 8) % 0x100;
> > +   key4.b8[6] = (dest_ip >> 16) % 0x100;
> > +   key4.b8[7] = (dest_ip >> 24) % 0x100;

Do you really want signed remainders done here?

David



RE: [net-next 6/9] e1000e: fix buffer overrun while the I219 is processing DMA transactions

2017-10-11 Thread David Laight
From: Jeff Kirsher
> Sent: 10 October 2017 18:22
> Intel 100/200 Series Chipset platforms reduced the round-trip
> latency for the LAN Controller DMA accesses, causing in some high
> performance cases a buffer overrun while the I219 LAN Connected
> Device is processing the DMA transactions. I219LM and I219V devices
> can fall into unrecovered Tx hang under very stressfully UDP traffic
> and multiple reconnection of Ethernet cable. This Tx hang of the LAN
> Controller is only recovered if the system is rebooted. Slightly slow
> down DMA access by reducing the number of outstanding requests.
> This workaround could have an impact on TCP traffic performance
> on the platform. Disabling TSO eliminates performance loss for TCP
> traffic without a noticeable impact on CPU performance.
> 
> Please, refer to I218/I219 specification update:
> https://www.intel.com/content/www/us/en/embedded/products/networking/
> ethernet-connection-i218-family-documentation.html
> 
> Signed-off-by: Sasha Neftin 
> Reviewed-by: Dima Ruinskiy 
> Reviewed-by: Raanan Avargil 
> Tested-by: Aaron Brown 
> Signed-off-by: Jeff Kirsher 
> ---
>  drivers/net/ethernet/intel/e1000e/netdev.c | 8 +---
>  1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c 
> b/drivers/net/ethernet/intel/e1000e/netdev.c
> index ee9de3500331..14b096f3d1da 100644
> --- a/drivers/net/ethernet/intel/e1000e/netdev.c
> +++ b/drivers/net/ethernet/intel/e1000e/netdev.c
> @@ -3021,8 +3021,8 @@ static void e1000_configure_tx(struct e1000_adapter 
> *adapter)
> 
>   hw->mac.ops.config_collision_dist(hw);
> 
> - /* SPT and CNP Si errata workaround to avoid data corruption */
> - if (hw->mac.type >= e1000_pch_spt) {
> + /* SPT and KBL Si errata workaround to avoid data corruption */
> + if (hw->mac.type == e1000_pch_spt) {
>   u32 reg_val;
> 
>   reg_val = er32(IOSFPC);
> @@ -3030,7 +3030,9 @@ static void e1000_configure_tx(struct e1000_adapter 
> *adapter)
>   ew32(IOSFPC, reg_val);
> 
>   reg_val = er32(TARC(0));
> - reg_val |= E1000_TARC0_CB_MULTIQ_3_REQ;
> + /* SPT and KBL Si errata workaround to avoid Tx hang */
> + reg_val &= ~BIT(28);
> + reg_val |= BIT(29);

Shouldn't some more of the commit message about what this is doing
be in the comment?
And shouldn't the 28 and 28 be named constants?

>   ew32(TARC(0), reg_val);

David



RE: [patch net-next 2/4] net: sched: introduce per-egress action device callbacks

2017-10-10 Thread David Laight
From: Jiri Pirko
> Sent: 10 October 2017 15:32
> To: David Laight
> Cc: netdev@vger.kernel.org; da...@davemloft.net; j...@mojatatu.com; 
> xiyou.wangc...@gmail.com;
> sae...@mellanox.com; mat...@mellanox.com; leo...@mellanox.com; 
> ml...@mellanox.com
> Subject: Re: [patch net-next 2/4] net: sched: introduce per-egress action 
> device callbacks
> 
> Tue, Oct 10, 2017 at 03:31:59PM CEST, david.lai...@aculab.com wrote:
> >From: Jiri Pirko
> >> Sent: 10 October 2017 08:30
> >> Introduce infrastructure that allows drivers to register callbacks that
> >> are called whenever tc would offload inserted rule and specified device
> >> acts as tc action egress device.
> >
> >How does a driver safely unregister a callback?
> >(to avoid a race with the callback being called.)
> >
> >Usually this requires a callback in the context that makes the
> >notification callbacks indicating that no more such callbacks
> >will be made.
> 
> rtnl is your answer. It is being held during register/unregister/cb

Do you mean 'acquired during register/unregister' and 'held across the
callback' ?

So the unregister sleeps (or spins?) until any callbacks complete?
So the driver mustn't hold any locks (etc) across the unregister that
it acquires in the callback.
That ought to be noted somewhere.

David



  1   2   3   4   5   6   7   >