from:"David Laight"

RE: [PATCH v3 06/10] net/macb: clean up ring buffer logic

2012-10-30 Thread David Laight

 Instead of masking head and tail every time we increment them, just let them
 wrap through UINT_MAX and mask them when subscripting. Add simple accessor
 functions to do the subscripting properly to minimize the chances of messing
 this up.
...
 +static unsigned int macb_tx_ring_avail(struct macb *bp)
 +{
 + return TX_RING_SIZE - (bp-tx_head - bp-tx_tail);
 +}

That one doesn't look quite right to me.
Surely it should be masking with 'TX_RING_SIZE - 1'

David



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH v3 06/10] net/macb: clean up ring buffer logic

2012-10-31 Thread David Laight

 On Tue, Oct 30, 2012 at 4:12 AM, David Laight david.lai...@aculab.com wrote:
  Instead of masking head and tail every time we increment them, just let 
  them
  wrap through UINT_MAX and mask them when subscripting. Add simple accessor
  functions to do the subscripting properly to minimize the chances of 
  messing
  this up.
  ...
  +static unsigned int macb_tx_ring_avail(struct macb *bp)
  +{
  + return TX_RING_SIZE - (bp-tx_head - bp-tx_tail);
  +}
 
  That one doesn't look quite right to me.
  Surely it should be masking with 'TX_RING_SIZE - 1'
 
 Why is that? head and tail can never be more than TX_RING_SIZE apart,
 so it shouldn't make any difference.

It's a ring buffer (I presume) the pointers can be in either order.

David



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH v3 06/10] net/macb: clean up ring buffer logic

2012-10-31 Thread David Laight

   return (TX_RING_SIZE - (bp-tx_head - bp-tx_tail)  (TX_RING_SIZE - 
 1));

Is equivalent to:

return (bp-tx_tail - bp-tx_head)  (TX_RING_SIZE - 1));

David



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH v8 01/16] hashtable: introduce a small and naive hashtable

2012-10-31 Thread David Laight

   On Tue, Oct 30, 2012 at 02:45:57PM -0400, Sasha Levin wrote:
   +/* Use hash_32 when possible to allow for fast 32bit hashing in 64bit 
   kernels. */
   +#define hash_min(val, bits) 
\
   +({  
\
   + sizeof(val) = 4 ? 
\
   + hash_32(val, bits) :   
\
   + hash_long(val, bits);  
\
   +})
  
   Doesn't the above fit in 80 column.  Why is it broken into multiple
   lines?  Also, you probably want () around at least @val.  In general,
   it's a good idea to add () around any macro argument to avoid nasty
   surprises.
 
  It was broken to multiple lines because it looks nicer that way (IMO).
 
  If we wrap it with () it's going to go over 80, so it's going to stay
  broken down either way :)
 
 ({  \
   sizeof(val) = 4 ? hash_32(val, bits) : hash_long(val, bits); \
 })
 
 Is the better way to go. We are C programmers, we like to see the ?: on
 a single line if possible. The way you have it, looks like three
 statements run consecutively.

To add some more colour (not color):

In any case, this is a normal C #define, it doesn't need the {}.
So it can just be:
# define hash_min(val, bits) \
(sizeof(val) = 4 ? hash_32(val, bits) : hash_long(val, bits))

I don't think that s/val/(val)/g and s/bits/(bits)/g are needed
because the tokens are already ',' separated.

I do actually wonder how many of these hash lists should be replaced
with some kind of tree structure in order to get O(log(n)) searches.
After all hashing is still O(n).
(apologies if I mean o(n) not O(n) - it's a long time since I did
my maths degree!)

David



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH 3/9] net: xfrm: use this_cpu_ptr per-cpu helper

2012-11-01 Thread David Laight

 this_cpu_read
 |-_this_cpu_generic_read
 
 #define _this_cpu_generic_read(pcp) \
 ({  typeof(pcp) ret__;  \
 preempt_disable();  \
 ret__ = *this_cpu_ptr((pcp));  \
 preempt_enable();   \
 ret__;  \
 })
 
 
 this_cpu_read operations locate per-cpu variable with preemption safe， not
 disable interrupts. why is it atomic vs interrupts?

Hmmm...  what effect do those preemt_dis/enable() actually have?
Since a pre-empt can happen either side of them, the value
the caller sees can be for the wrong cpu anyway.

The only time I could see them being necessary is if
*this_cpu_ptr() itself needs mutex protection in order to
function correctly - and that is likely to be port specific.
On i386/amd64 where (I guess) it is an access offset by fs/gs
this isn't necessary and just wastes cpu cycles.

If the caller cares which cpu the value comes from (eg to
increment a counter) then the caller would need to disable
pre-emption across the whole operation.

David

RE: [PATCH v4] lxt PHY: Support for the buggy LXT973 rev A2

2012-09-24 Thread David Laight

 This patch adds proper handling of the buggy revision A2 of LXT973 phy, adding
 precautions linked to ERRATA Item 4:
 
 Revision A2 of LXT973 chip randomly returns the contents of the previous even
 register when you read a odd register regularly

Does reading the PHY registers involve bit-banging an MII interface?
If so this code is likely to stall the system for significant
periods (ready phy registers at all can be a problem).

I know some ethernet mac have hardware blocks for reading MII
and even polling one MII register for changes.

Maybe some of this code ought to be using async software
bit-bang - especially when just polling for link status change.
I'm sure it ought to be possible to do one bit-bang action
per clock tick instead of spinning for the required delays. 

David



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH v6] hashtable: introduce a small and naive hashtable

2012-09-26 Thread David Laight

Amazing how something simple gets lots of comments and versions :-)

 ...
 + * This has to be a macro since HASH_BITS() will not work on pointers since
 + * it calculates the size during preprocessing.
 + */
 +#define hash_empty(hashtable)
 \
 +({   
 \
 + int __i;
 \
 + bool __ret = true;  
 \
 + 
 \
 + for (__i = 0; __i  HASH_SIZE(hashtable); __i++)
 \
 + if (!hlist_empty(hashtable[__i]))  
 \
 + __ret = false;  
 \
 + 
 \
 + __ret;  
 \
 +})

Actually you could have a #define that calls a function
passing in the address and size.
Also, should the loop have a 'break' in it?

David



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH v6] hashtable: introduce a small and naive hashtable

2012-09-27 Thread David Laight

   And even then, if we would do:
  
 for (i = 0; i  HASH_SIZE(hashtable); i++)
 if (!hlist_empty(hashtable[i]))
 break;
  
 return i = HASH_SIZE(hashtable);
  
   What happens if the last entry of the table is non-empty ?
 
  It still works, as 'i' is not incremented due to the break. And i will
  still be less than HASH_SIZE(hashtable). Did you have *your* cup of
  coffee today? ;-)
 
 Ahh, right! Actually I had it already ;-)

I tend to dislike the repeated test, gcc might be able to optimise
it away, but the code is cleaner written as:

for (i = 0; i  HASH_SIZE(hashtable); i++)
if (!hlist_empty(hashtable[i]))
return false;
return true;

 Agreed that the flags should be removed. Moving to define + static
 inline is still important though.

Not sure I'd bother making the function inline.

David



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH v6] hashtable: introduce a small and naive hashtable

2012-09-27 Thread David Laight

 Moreover, if your thinking is that we do not need a static inline
 function replicated at every caller, maybe we should introduce a
 lib/hashtable.c that implements those 2 functions.

That was my thought...
Given their nature, I'd guess they aren't critical path.
Probably not worth adding an entire source file though.

With regard to 'inline', when I say it, I really mean it!
Unfortunately some people seem to just type it anyway (rather
like 'register' used to get used) - so the compilers start
ignoring the request.
At least some versions of gcc will usually inline static
functions that are only called once - but even then it can
change its mind for almost no reason.

David



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH v2 00/11] introduce random32_get_bytes() and random32_get_bytes_state()

2012-11-06 Thread David Laight

 On Sun, 04 Nov 2012 00:43:31 +0900, Akinobu Mita said:
  This patchset introduces new functions into random32 library for
  getting the requested number of pseudo-random bytes.
 
  Before introducing these new functions into random32 library,
  prandom32() and prandom32_seed() with prandom32 prefix are
  renamed to random32_state() and srandom32_state() respectively.
 
  The purpose of this renaming is to prevent some kernel developers
  from assuming that prandom32() and random32() might imply that only
  prandom32() was the one using a pseudo-random number generator by
  prandom32's p, and the result may be a very embarassing security
  exposure.
 
 Out of curiosity, why the '32'?  I'm just waiting for some kernel developer to
 do something stupid with this on a 64-bit arch because they think it's a 
 32-bit API. ;)
 
 Should we bite the bullet and lose the 32, as long as we're churning the code 
 *anyhow*?

Also why remove the 'pseudo' part of the name?
It is an important part of the name.

David




--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [RFC PATCH v2 2/6] PM / Runtime: introduce pm_runtime_set_memalloc_noio()

2012-10-24 Thread David Laight

 Looks the problem is worse than above, not only bitfields are affected, the
 adjacent fields might be involved too, see:
 
http://lwn.net/Articles/478657/

Not mentioned in there is that even with x86/amd64 given
a struct with the following adjacent fields:
char a;
char b;
char c;
then foo-b |= 0x80; might do a 32bit RMW cycle.
This will (well might - but probably does) happen
if compiled to a 'BTS' instruction.
The x86 instruction set docs are actually unclear
as to whether the 32bit cycle might even be misaligned!
amd64 might do a 64bit cycle (not checked the docs).

David



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH net-next?] pktgen: Use simpler test for non-zero ipv6 address

2012-10-11 Thread David Laight

 - if (pkt_dev-min_in6_daddr.s6_addr32[0] == 0 
 - pkt_dev-min_in6_daddr.s6_addr32[1] == 0 
 - pkt_dev-min_in6_daddr.s6_addr32[2] == 0 
 - pkt_dev-min_in6_daddr.s6_addr32[3] == 0) ;
 - else {
 + if (pkt_dev-min_in6_daddr.s6_addr32[0] |
 + pkt_dev-min_in6_daddr.s6_addr32[1] |
 + pkt_dev-min_in6_daddr.s6_addr32[2] |
 + pkt_dev-min_in6_daddr.s6_addr32[3]) {

Given the likely values, it might be worth using:
if (pkt_dev-min_in6_daddr.s6_addr32[0] == 0 
pkt_dev-min_in6_daddr.s6_addr32[1] |
pkt_dev-min_in6_daddr.s6_addr32[2] |
pkt_dev-min_in6_daddr.s6_addr32[3]) {

David



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [RFC net-next] treewide: s/ipv4_is_foo()/ipv4_addr_foo/

2012-10-11 Thread David Laight

 ipv4 and ipv6 use different styles for these tests.
 
 ipv4_is_foo(__be32)
 ipv6_addr_foo(struct in6_addr *)

I presume there is a 'const' in there ...

 Perhaps it'd be good to convert the ipv4 tests to the ipv6 style.

You don't want to force an IPv4 address (which might be in a register)
be written out to stack.
Taking the address also has implications for the optimiser.

David



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH] SUNRPC: Prevent kernel stack corruption on long values of flush

2012-10-18 Thread David Laight

 ...
 long is always the same or bigger then a pointer
 (A pointer must always fit in a long)
 ...

Linux may make that assumption, but it doesn't have
to be true. 64bit windows still has 32bit long.
C99 inttypes.h defines [u]intptr_t to be an integral type
that is large enough to hold a pointer to any data item.
(That in itself is problematic for implementations that
encode multiple characters into a machine word and need
to use 'fat' pointers in order to encode the offset.)

David

RE: [PATCH BUGFIX 2/6] pkt_sched: fix the update of eligible-group sets

2013-03-06 Thread David Laight

 Between two invocations of make_eligible, the system virtual time may
 happen to grow enough that, in its binary representation, a bit with
 higher order than 31 flips. This happens especially with
 TSO/GSO. Before this fix, the mask used in make_eligible was computed
 as (1ULindex_of_last_flipped_bit)-1, whose value is well defined on
 a 64-bit architecture, because index_of_flipped_bit = 63, but is in
 general undefined on a 32-bit architecture if index_of_flipped_bit  31.
 The fix just replaces 1UL with 1ULL.
...
 - unsigned long mask = (1UL  fls(vslot ^ old_vslot)) - 1;
 + unsigned long mask = (1ULL  fls(vslot ^ old_vslot)) - 1;

I'm not sure you really want to be doing 64bit shifts on 32 bit
systems just to generate ~0ul.
probably worth a conditional.

David


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH v2] net/macb: Use non-coherent memory for rx buffers

2012-12-03 Thread David Laight

 Allocate regular pages to use as backing for the RX ring and use the
 DMA API to sync the caches. This should give a bit better performance
 since it allows the CPU to do burst transfers from memory. It is also
 a necessary step on the way to reduce the amount of copying done by
 the driver.

I've not tried to understand the patches, but you have to be
very careful using non-snooped memory for descriptor rings.
No amount of DMA API calls can sort out some of the issues.

Basically you must not dirty a cache line that contains data
that the MAC unit might still write to.

For the receive ring this means that you must not setup
new rx buffers for ring entries until the MAC unit has
filled all the ring entries in the same cache line.
This probably means only adding rx buffers in blocks
of 8 or 16 (or even more if there are large cache lines).

I can't see any code in the patch that does this.

Doing the same for the tx ring is more difficult, especially
if you can't stop the MAC unit polling the TX ring on a
timer basis.
Basically you can only give the MAX tx packets if either
it is idle, or if the tx ring containing the new entries
starts on a cache line.
If the MAC unit is polling the ring, then to give it
multiple items you may need to update the 'owner' bit
in the first ring entry last - just in case the cache
line gets written out before you've finished.

David




--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH v2] net/macb: Use non-coherent memory for rx buffers

2012-12-03 Thread David Laight

 On 12/03/2012 01:43 PM, David Laight :
  Allocate regular pages to use as backing for the RX ring and use the
  DMA API to sync the caches. This should give a bit better performance
  since it allows the CPU to do burst transfers from memory. It is also
  a necessary step on the way to reduce the amount of copying done by
  the driver.
 
  I've not tried to understand the patches, but you have to be
  very careful using non-snooped memory for descriptor rings.
  No amount of DMA API calls can sort out some of the issues.
 
 David,
 
 Maybe I have not described the patch properly but the non-coherent
 memory is not used for descriptor rings. It is used for DMA buffers
 pointed out by descriptors (that are allocated as coherent memory).
 
 As buffers are filled up by the interface DMA and then, afterwards, used
 by the driver to pass data to the net layer, it seems to me that the use
 of non-coherent memory is sensible.

Ah, ok - difficult to actually determine from a fast read of the code.
So you invalidate (I think that is the right term) all the cache lines
that are part of each rx buffer before giving it back to the MAC unit.
(Maybe that first time, and just those cache lines that might have been
written to after reception - I'd worry about whether the CRC is written
into the rx buffer!)

I was wondering if the code needs to do per page allocations?
Perhaps that is necessary to avoid needing a large block of
contiguous physical memory (and virtual addresses)?

I know from some experiments done many years ago that a data
copy in the MAC tx and rx path isn't necessarily as bad as
people may think - especially if it removes complicated
'buffer loaning' schemes and/or iommu setup (or bounce
buffers due to limited hardware memory addressing).

The rx copy can usually be made to be a 'whole word' copy
(ie you copy the two bytes of garbage that (mis)align the
destination MAC address, and some bytes after the CRC.
With some hardware I believe it is possible for the cache
controller to do cache-line aligned copies very quickly!
(Some very new x86 cpus might be doing this for 'rep movsd'.)

The copy in the rx path is also better for short packets
the can end up queued for userspace (although a copy in
the socket code would solve that one.

David



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH v2] net/macb: Use non-coherent memory for rx buffers

2012-12-05 Thread David Laight

 If I understand well, you mean that the call to:
 
   dma_sync_single_range_for_device(bp-pdev-dev, phys,
   pg_offset, frag_len, DMA_FROM_DEVICE);
 
 in the rx path after having copied the data to skb is not needed?
 That is also the conclusion that I found after having thinking about
 this again... I will check this.

You need to make sure that the memory isn't in the data cache
when you give the rx buffer back to the MAC.
(and ensure the cpu doesn't read it until the rx is complete.)
I've NFI what that dma_sync call does - you need to invalidate
the cache lines.
 
 For the CRC, my driver is not using the CRC offloading feature for the
 moment. So no CRC is written by the device.

I was thinking it would matter if the MAC wrote the CRC into the
buffer (even though it was excluded from the length).
It doesn't - you only need to worry about data you've read.

  I was wondering if the code needs to do per page allocations?
  Perhaps that is necessary to avoid needing a large block of
  contiguous physical memory (and virtual addresses)?
 
 The page management seems interesting for future management of RX
 buffers as skb fragments: that will allow to avoid copying received data.

Dunno - the complexities of such buffer loaning schemes often
exceed the gain of avoiding the data copy.
Using buffers allocated to the skb is a bit different - since
you completely forget about the memory once you pass the skb
upstream.

Some quick sums indicate you might want to allocate 8k memory
blocks and split into 5 buffers.

David



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH v2] net/macb: Use non-coherent memory for rx buffers

2012-12-05 Thread David Laight

 Well, for the 10/100 MACB interface, I am stuck with 128 Bytes buffers!
 So this use of pages seems sensible.

If you have dma coherent memory you can make the rx buffer space
be an array of short buffers referenced by adjacent ring entries
(possibly with the last one slightly short to allow for the
2 byte offset).
Then, when a big frame ends up in multiple buffers, you need
a maximum of two copies to extract the data.

David



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH, resubmit] ax88179_178a: ASIX AX88179_178A USB 3.0/2.0 to gigabit ethernet adapter driver

2013-02-08 Thread David Laight

  +struct ax88179_rx_pkt_header {
  +   u8  l4_csum_err:1,
  +   l3_csum_err:1,
  +   l4_type:3,
  +   l3_type:2,
  +   ce:1;
  +
  +   u8  vlan_ind:3,
  +   rx_ok:1,
  +   pri:3,
  +   bmc:1;
  +
  +   u16 len:13,
  +   crc:1,
  +   mii:1,
  +   drop:1;
  +} __packed;
 
 This won't work on both big-endian systems (assuming this works on x86).
 You apparently try to convert the structure in-place in
 ax88179_rx_fixup() by calling le32_to_cpus(); that may work if you
 define all the bitfields to be part of a u32 but it won't work with the
 current definition.

Trying to use bit fields to map hardware registers (etc) isn't a
good idea at all. The C standard says absolutely nothing about
the order in which the bits are allocated to the field names.
(There are at least 4 possible orederings.)

It is much better to define constants for the bit values and
explicitly mask them as required.

David

Re: [PATCH v3 0/7] Add O_DENY* support for VFS and CIFS/NFS

2013-03-01 Thread David Laight

On Thu, Feb 28, 2013 at 01:53:25PM -0800, Andy Lutomirski wrote:
  O_DENYMAND - to switch on/off three flags above.
 
 O_DENYMAND doesn't deny anything.  Would a name like O_RESPECT_DENY be
 better?

Possibly rename to O_CHECK_DENY ?

David

-- 
David Laight: da...@l8s.co.uk
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH] vhost-net: fall back to vmalloc if high-order allocation fails

2013-01-24 Thread David Laight

 + n = kmalloc(sizeof *n, GFP_KERNEL | __GFP_NOWARN);
 + if (!n)
 + n = vmalloc(sizeof *n);

I'm slightly confused by this construct.
I thought kmalloc(... GFP_KERNEL) would sleep waiting for
memory (rather than return NULL).

I realise that (for multi-page sizes) that kmalloc() and
vmalloc() both need to find a contiguous block of kernel
virtual addresses - in different address ranges, and
that vmalloc() then has to find physical memory pages
(which will not be contiguous).

I think this means that kmalloc() is likely to be faster
(if it doesn't have to sleep), but that vmalloc() is
allocating from a much larger resource.

This make me that that large allocations that are not
temporary should probably be allocated with vmalloc().

Is there a 'NO_SLEEP' flag to kmalloc()? is that all
GFP_ATOMIC requests? If so you might try a non-sleeping
kmalloc() with a vmalloc() if it fails.

This all looks as though there should be a GFP_NONCONTIG
flag (or similar) so that kmalloc() can make a decision itself.

Of at least a wrapper - like the one for free().

David



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH] vhost-net: fall back to vmalloc if high-order allocation fails

2013-01-24 Thread David Laight

  I think this means that kmalloc() is likely to be faster
  (if it doesn't have to sleep), but that vmalloc() is
  allocating from a much larger resource.
 
  This make me that that large allocations that are not
  temporary should probably be allocated with vmalloc().
 
 vmalloc has some issues for example afaik it's not backed by
 a huge page so  I think its use puts more stress on the TLB cache.

Thinks further ...
64bit systems are likely to have enough kernel VA to be able
to map all of physical memory into contiguous VA.

I don't know if Linux does that, but I know code to map it was added
to NetBSD amd64 in order to speed up kernel accesses to 'random'
pages - it might have been partially backed out due to bugs!

If physical memory is mapped like that then kmalloc() requests
can any of physical memory and be unlikely to fail - since user
pages can be moved in order to generate contiguous free blocks.

Doesn't help with 32bit systems - they had to stop mapping all
of physical memory into kernel space a long time ago.

David



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [patch v2] b43: N-PHY: fix gain in b43_nphy_get_gain_ctl_workaround_ent()

2013-01-18 Thread David Laight

 + int gain_data[] = {0x0062, 0x0064, 0x006a, 0x106a, 0x106c,
 +0x1074, 0x107c, 0x207c};

static const int ...

David



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH v6 08/46] CPU hotplug: Provide APIs to prevent CPU offline from atomic context

2013-02-19 Thread David Laight

 I wouldn't go that far... ;-) Unfairness is not a show-stopper right?
 IMHO, the warning/documentation should suffice for anybody wanting to
 try out this locking scheme for other use-cases.

I presume that by 'fairness' you mean 'write preference'?

I'd not sure how difficult it would be, but maybe have two functions
for acquiring the lock for read, one blocks if there is a writer
waiting, the other doesn't.

That way you can change the individual call sites separately.

The other place I can imagine a per-cpu rwlock being used
is to allow a driver to disable 'sleep' or software controlled
hardware removal while it performs a sequence of operations.

David



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: Disable IPv4-mapped - enforce IPV6_V6ONLY

2013-02-25 Thread David Laight

  A proper solution would be to either return false if net.ipv6.bindv6only is 
  true and optval is false
 (which would break downward compatibility because it wouldn't just be a 
 default and setsockopt might
 return an error) or to introduce a new sysctl variable like 
 net.ipv6.bindv6only_enforced_silently.
 (silently because setsockopt() wouldn't return an error if 
 net.ipv6.bindv6only is true and optval
 (v6only in the example above) is false.)
 
  I would volunteer to write a patch which introduces something like
 net.ipv6.bindv6only_enforced_silently if some maintainer would give me his ok.
 
  If so, the question remains if
 
  systemctl net.ipv6.bindv6only_enforced_silently = 1
 
  should set systemctl.net.ipv6.bindv6only too or if an error should be 
  returned if
 net.ipv6.bindv6only is false.
 
 I am not convinced why you need this, and I am not in favor of
 enfocing IPV6_V6ONLY, but... some points:
 
 - We should allow system-admin to enforce IPV6_V6ONLY to 0 as well.
 - CAP_NET_ADMIN users should always be able to use both modes
   (They can do sysctl anyway.)
 - setsockopt should fail w/ EPERM if user tries to override.

I can imagine that some programs will always try to clear IPV6_V6ONLY
(maybe for portability with other OS which default to setting it
for security reasons) and will error-exit if it fails.
So non-silent enforcing could be a PITA.

OTOH there might be programs/systems where silent failure is wrong.

You really don't want to (globally) stop an application setting
IPV6_V6ONLY, such a program may well be creating separate IPv4
and IPv6 sockets.

Some of this needs to be part of some application wide 'security'
framework - that probably doesn't exist!

Should there also be similar controls for the use of IPv4
mapped addresses in actual on-the-wire IPv6 packets - eg those
destined for a remote gateway on an IPv6 only system?

David



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH v3 01/17] hashtable: introduce a small and naive hashtable

2012-09-06 Thread David Laight

 My solution to making 'break' work in the iterator is:
 
   for (bkt = 0, node = NULL; bkt  HASH_SIZE(name)  node ==
NULL; bkt++)
   hlist_for_each_entry(obj, node, name[bkt], member)

I'd take a look at the generated code.
Might come out a bit better if the condition is changed to:
node == NULL  bkt  HASH_SIZE(name)
you might find the compiler always optimises out the
node == NULL comparison.
(It might anyway, but switching the order gives it a better
chance.)

David



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH 2/2] netprio_cgroup: Optimize the priomap copy loop slightly

2012-09-11 Thread David Laight

 - for (i = 0;
 -  old_priomap  (i  old_priomap-priomap_len);
 -  i++)
 - new_priomap-priomap[i] = old_priomap-priomap[i];
 + if (old_priomap) {
 + old_len = old_priomap-priomap_len;
 +
 + for (i = 0; i  old_len; i++)
 + new_priomap-priomap[i] = old_priomap-priomap[i];
 + }

Or:
memcpy(new_priomap-priomap, old_priomap-priomap,
old_priomap-priomap_len * sizeof old_priomap-priomap[0]);

David

N�r��yb�X��ǧv�^�)޺{.n�+{zX����ܨ}���Ơz�j:+v���zZ+��+zf���h���~i���z��w���?��)ߢf��^jǫy�m��@A�a���
0��h���i

RE: [PATCH 9/9] drivers/isdn/gigaset/common.c: Remove useless kfree

2012-09-13 Thread David Laight

 Remove useless kfree() and clean up code related to the removal.
...
 diff --git a/drivers/isdn/gigaset/common.c b/drivers/isdn/gigaset/common.c
 index aa41485..30a6b17 100644
 --- a/drivers/isdn/gigaset/common.c
 +++ b/drivers/isdn/gigaset/common.c
 @@ -1123,7 +1123,6 @@ struct gigaset_driver *gigaset_initdriver(unsigned 
 minor, unsigned minors,
   return drv;
 
  error:
 - kfree(drv-cs);
   kfree(drv);
   return NULL;
  }
 

Seems to me that (assuming kfree(NULL) is ok) the kfree()
is best left in - just in case some other error path is
added after drv-cs is assigned.
Better safe than a memory leak.

David



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH] sata_fsl: add workaround for data length mismatch on freescale V2 controller

2012-09-04 Thread David Laight

 + /* Read command completed register */
 + done_mask = ioread32(hcr_base + CC);
 +
 + if (host_priv-quirks  SATA_FSL_QUIRK_V2_ERRATA) {
 + if (unlikely(hstatus  INT_ON_DATA_LENGTH_MISMATCH)) {
 + for (tag = 0; tag  ATA_MAX_QUEUE; tag++) {
 + qc = ata_qc_from_tag(ap, tag);
 + if (qc  ata_is_atapi(qc-tf.protocol))
{
 + atapi_flag = 1;
 + break;
 + }
 + }
 + }
 + }
 +
 + /* Workaround for data length mismatch errata */
 + if (atapi_flag) {

Seems to me like the conditionals for this code are all
in the wrong order - adding code to the normal path.

The whole lot should probably be inside:
if (unlikely(hstatus  INT_ON_DATA_LENGTH_MISMATCH)) {
and the 'atapi_flag' boolean removed.

I also wonder it this is worthy of an actual quirk?
Might be worth doing anyway.

David



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH 02/16] user_ns: use new hashtable implementation

2012-08-15 Thread David Laight

 Yes hash_32 seems reasonable for the uid hash.   With those long hash
 chains I wouldn't like to be on a machine with 10,000 processes with
 each with a different uid, and a processes calling setuid in the fast
 path.
 
 The uid hash that we are playing with is one that I sort of wish that
 the hash table could grow in size, so that we could scale up better.

Since uids are likely to be allocated in dense blocks, maybe an
unhashed multi-level lookup scheme might be appropriate.

Index an array with the low 8 (say) bits of the uid.
Each item can be either:  
  1) NULL = free entry.
  2) a pointer to a uid structure (check uid value).
  3) a pointer to an array to index with the next 8 bits.
(2) and (3) can be differentiated by the low address bit.
I think that is updateable with cmpxchg.

Clearly this is a bad algorithm if uids are all multiples of 2^24
but that is true or any hash function.

David



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH 1/1] tcp: Wrong timeout for SYN segments

2012-08-23 Thread David Laight

 I would suggest to increase TCP_SYN_RETRIES from 5 to 6.
 
 180 secs is eternity, but 31 secs is too small.

Wasn't the intention of the long delay to allow a system
acting as a router to reboot?
I suspect that is why it (and some other TCP timers)
are in minutes.

David

RE: [PATCH v2 3/3] pppoatm: protect against freeing of vcc

2012-11-28 Thread David Laight

 On Tue, 27 Nov 2012 18:02:29 +
 David Woodhouse dw...@infradead.org wrote:
 
  In solos-pci at least, the ops-close() function doesn't flush all
  pending skbs for this vcc before returning. So can be a tasklet
  somewhere which has loaded the address of the vcc-pop function from one
  of them, and is going to call it in some unspecified amount of time.
 
  Should we make the device's -close function wait for all TX and RX skbs
  for this vcc to complete?
 
 the driver's close routine should wait for any of the pending tx and rx
 to complete.  take a look at the he.c in driver/atm

I'm not sure that sleeping for long periods in close() is always a
good idea. If the process is event driven it will be unable to
handle events on other fd until the close completes.
This may be known not to be true in this case, but is more generally
a problem.
In this case the close should probably (IMHO at least) only sleep
while pending tx and rx are aborted/discarded.

Even when it might make sense to sleep in close until tx drains
there needs to be a finite timeout before it become abortive.

David



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/3] Add O_DENY* flags to fcntl and cifs

2012-12-12 Thread David Laight

On Sat, Dec 08, 2012 at 12:43:14AM +0400, Pavel Shilovsky wrote:
 
 The problem is the possibility of denial-of-service attacks here. We 
 can try to prevent them by:

FWIW I already see a DoS 'attack'.
I have some filestore shared using NFS (to Linux and Solaris) and 
using samba (to Windows).

I use it for release builds of a product to ensure the versions
built for the different operating systems match, and because some
files have to be built on an 'alien' system (eg gcc targetted at
embedded card).

I can't run the windows build at the same time as the others
because the windows C compiler manages to obtain exclusive access
to the source files - stopping the other systems from reading them.

David

-- 
David Laight: da...@l8s.co.uk
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH 0/2] fs: supply inode uid/gid setting interface

2013-08-27 Thread David Laight

 Subject: Re: [PATCH 0/2] fs: supply inode uid/gid setting interface
 
 On 2013/8/23 12:10, Greg KH wrote:
  On Fri, Aug 23, 2013 at 10:48:36AM +0800, Rui Xiang wrote:
  This patchset implements an accessor functions to set uid/gid
  in inode struct. Just finish code clean up.
 
  Why?
 
 It can introduce a new function to reduce some codes.
  Just clean up.

In what sense is it a 'cleanup' ?

To my mind it just means that anyone reading the code has
to go and look at another file in order to see what the
function does (it might be expected to be more complex).

It also adds run time cost, while probably not directly
measurable I suspect it more than doubles the execution
time of that code fragment - do that everywhere and the
system will run like a sick pig.

The only real use for accessor functions is when you
don't want the structure offset to be public.
In this case that is obviously not the case since
all the drivers are directly accessing other members
of the structure.

David



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [net-next rfc 1/3] net: avoid high order memory allocation for queues by using flex array

2013-06-19 Thread David Laight

 +static void netif_free_tx_queues(struct net_device *dev)
 +{
 + if (is_vmalloc_addr(dev-_tx))
 + vfree(dev-_tx);
 + else
 + kfree(dev-_tx);
 +}
 +
  static int netif_alloc_netdev_queues(struct net_device *dev)
  {
   unsigned int count = dev-num_tx_queues;
 @@ -5261,9 +5270,11 @@ static int netif_alloc_netdev_queues(struct net_device 
 *dev)
   BUG_ON(count  1);
 
   tx = kcalloc(count, sizeof(struct netdev_queue), GFP_KERNEL);
 - if (!tx)
 - return -ENOMEM;
 -
 + if (!tx) {
 + tx = vzalloc(count * sizeof(struct netdev_queue));
 + if (!tx)
 + return -ENOMEM;
 + }
   dev-_tx = tx;

Given the number of places I've seen this code added, why
not put it in a general helper?

I also thought that malloc() with GFP_KERNEL would sleep.
Under what conditions does it fail instead?

David

RE: [PATCH 2/2] perf tools: Make Power7 events available for perf

2013-06-20 Thread David Laight

 I think we should be able to do something better using the C
 preprocessor, this is exactly the sort of thing it's good at.
 
 What I mean is something like we do with arch/powerpc/include/asm/systbl.h,
 where we define the list of syscalls once, and then include it in
 multiple places, using different macro definitions to get different
 outputs.

There is a 'neat' trick - you can pass a #define macro the
name of another #define - which is then expanded after the
initial expansion. A description I happen to have is pasted below.

David

Consider what happens when #define foo(x) x(args) is expanded: foo(bar)
clearly becomes bar(args). If we also have #define bar(args) then bar()
is expanded AFTER foo() allowing us to generate any text including args.
So we have passed the name of one #define as a parameter to a different #define.

If we replace the definition of foo with
#define foo(x) x(args1) x(args2) then foo(bar) is equivalent to
bar(args1) bar(args2).
This is useful because foo(baz) expands to baz(args1) baz(args2)
allowing us to feed the same set of arguments to more than one #define.

A simple example:
#define lights(x) x(red) x(orange) x(green)
#define xx(colour) LIGHT_##colour,
enum { lights(xx) NUM_LIGHTS };
#undef xx
#define xx(colour) #colour,
static const char light_names[] = { lights(xx) };
#undef xx

This expands to:
enum { LIGHT_red, LIGHT_orange, LIGHT_green, NUM_LIGHTS };
static const char light_names[] = { ”red”, ”orange”, ”green”, };
(We needed to add NUM_LIGHTS because a trailing comma isn’t valid in a C++ 
enum.)


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH v7] ethernet/arc/arc_emac - Add new driver

2013-06-21 Thread David Laight

 So it should be declared dma_addr_t then,
 
  +   addr = dma_map_single(ndev-dev, (void 
  *)rx_buff-skb-data,
  + buflen, DMA_FROM_DEVICE);
  +   if (dma_mapping_error(ndev-dev, addr)) {
  +   if (net_ratelimit())
  +   netdev_err(ndev, cannot dma map\n);
  +   dev_kfree_skb(rx_buff-skb);
  +   stats-rx_errors++;
  +   continue;
  +   }
  +   dma_unmap_addr_set(rx_buff, mapping, addr);
  +   dma_unmap_len_set(rx_buff, len, buflen);
  +
  +   rxbd-data = (unsigned char 
  *)cpu_to_le32(rx_buff-skb-data);
 
  the 'addr' returned by dma_map_single is what the device really
  needs, although it is the same as rx_buff-skb-data with the
  trivial definition of dma_map_single.
 
  The last line here needs to be
  rxbd-data = cpu_to_le32(addr);
 
  which fixes the bug, and has no dependency on a 32 bit CPU.
 
 It still has a dependency on dma_addr_t size being 32 bit

No - just a dependency on memory being mapped to a 32bit
physical address accessible from the device.
Many 64 bit systems have devices that can only access 32bit
address space, either the memory has to be allocated from the
correct arena, of DMA bounce buffers are used.

David



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [RFC PATCH] vsnprintf: Remove use of %n and convert existing uses

2013-09-12 Thread David Laight

 On Wed, Sep 11, 2013 at 05:04:17PM -0700, Joe Perches wrote:
  On Thu, 2013-09-12 at 08:40 +0900, Tetsuo Handa wrote:
   Joe Perches wrote:
-   seq_printf(m, %s%d%n, con-name, con-index, len);
+   len = seq_printf(m, %s%d, con-name, con-index);
  
   Isn't len always 0 or -1 ?
 
  Right.  Well you're no fun...
 
  These uses would seem broken anyway because the
  seq_printf isn't itself tested for correctness.
 
  Hmm.
 
  Also, there's a large amount of code that appears
  to do calculations with pos or len like:
 
  pos += seq_printf(handle, fmt. ...)
 
 ... and most of that code proceeds to ignore pos completely.
 Note that -show() is *NOT* supposed to return the number of
 characters it has/would like to have produced.  Just return
 0 and be done with that; overflows are dealt with just fine.
 The large amount, BTW, is below 100 lines, AFAICS, in rather
 few files.
 
  There are very few that seem to use it correctly
  like netfilter.
 
  Suggestions?

Change the return type of seq_printf() to void and require that
code use access functions/macros to find the length and error
status. Possibly save the length of the last call separately.

David



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH v3 1/3] phy: Add new Exynos USB PHY driver

2013-11-06 Thread David Laight

  I just did a diff of registers in exynos 4210 and 4212 PHY drivers [1]
  and couldn't find that big a difference in register layout. Of course
  there are a few changes in HSIC bit fields and PHYFSEL but that's only
  minimal and could well be handled in a single driver.
 
  [1] - http://diffchecker.com/py3tp68m
 
 This is quite a lot of differences, especially including shifted
 bitfields... In addition there is another set of available PHYs
 and inter-dependencies between them.

Might be worth feeding both files through sed -e 's/_421[02]_/_421x_/'
prior to doing the diff.
And maybe changing the order of some definitions so they match.
Then the actual differences will be more obvious.

David



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH RFC 00/77] Re-design MSI/MSI-X interrupts enablement pattern

2013-10-04 Thread David Laight

  It seems to me that a more useful interface would take a minimum and
  maximum number of vectors from the driver.  This wouldn't allow the
  driver to specify that it could only accept, say, any even number within
  a certain range, but you could still leave the current functions
  available for any driver that needs that.
 
 Mmmm.. I am not sure I am getting it. Could you please rephrase?

One possibility is for drivers than can use a lot of interrupts to
request a minimum number initially and then request the additional
ones much later on.
That would make it less likely that none will be available for
devices/drivers that need them but are initialised later.

David



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH 02/10][v6] powerpc/Power7: detect load/store instructions

2013-10-16 Thread David Laight

 Implement instr_is_load_store_2_06() to detect whether a given instruction
 is one of the fixed-point or floating-point load/store instructions in the
 POWER Instruction Set Architecture v2.06.
...
 +int instr_is_load_store_2_06(const unsigned int *instr)
 +{
 + unsigned int op, upper, lower;
 +
 + op = instr_opcode(*instr);
 +
 + if ((op = 32  op = 58) || (op == 61 || op == 62))
 + return true;
 +
 + if (op != 31)
 + return false;
 +
 + upper = op  5;
 + lower = op  0x1f;
 +
 + /* Short circuit as many misses as we can */
 + if (lower  3 || lower  23)
 + return false;
 +
 + if (lower == 3) {
 + if (upper = 16)
 + return true;
 +
 + return false;
 + }
 +
 + if (lower == 7 || lower == 12)
 + return true;
 +
 + if (lower = 20) /*  lower = 23 (implicit) */
 + return true;
 +
 + return false;
 +}

I can't help feeling the code could do with some comments about
which actual instructions are selected where.

David



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH] X.25: Fix address field length calculation

2013-10-16 Thread David Laight

 On Tue, 2013-10-15 at 14:29 +, Kelleter, Günther wrote:
  Addresses are BCD encoded, not ASCII. x25_addr_ntoa got it right.
 []
  Wrong length calculation leads to rejection of CALL ACCEPT packets.
 []
  diff --git a/net/x25/af_x25.c b/net/x25/af_x25.c
 []
  @@ -98,7 +98,7 @@ int x25_parse_address_block(struct sk_buff *skb,
  }
  len = *skb-data;
  -   needed = 1 + (len  4) + (len  0x0f);
  +   needed = 1 + ((len  4) + (len  0x0f) + 1) / 2;
 
 This calculation looks odd.

Looks correct to me...
In X.25 the lengths (in digits) of the called and calling addresses
are encoded in the high and low nibbles of one byte and then
followed by both addresses with a digit in each nibble.
If the length of the first address is odd, the second one
isn't byte aligned.

David



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH RFC v2 29/29] vmxnet3: Make use of pcim_enable_msix_range() interface

2013-10-21 Thread David Laight

 Subject: [PATCH RFC v2 29/29] vmxnet3: Make use of pcim_enable_msix_range() 
 interface
...
 diff --git a/drivers/net/vmxnet3/vmxnet3_drv.c 
 b/drivers/net/vmxnet3/vmxnet3_drv.c
 index d33802c..e552d2b 100644
 --- a/drivers/net/vmxnet3/vmxnet3_drv.c
 +++ b/drivers/net/vmxnet3/vmxnet3_drv.c
 @@ -2735,39 +2735,19 @@ vmxnet3_read_mac_addr(struct vmxnet3_adapter 
 *adapter, u8 *mac)
   */
 
  static int
 -vmxnet3_acquire_msix_vectors(struct vmxnet3_adapter *adapter,
 -  int vectors)
 +vmxnet3_acquire_msix_vectors(struct vmxnet3_adapter *adapter, int vectors)
  {
 - int err = -EINVAL, vector_threshold;
 - vector_threshold = VMXNET3_LINUX_MIN_MSIX_VECT;
 -
 - while (vectors = vector_threshold) {
 - err = pci_enable_msix(adapter-pdev, adapter-intr.msix_entries,
 -   vectors);
 - if (!err) {
 - adapter-intr.num_intrs = vectors;
 - return 0;
 - } else if (err  0) {
 - dev_err(adapter-netdev-dev,
 -Failed to enable MSI-X, error: %d\n, err);
 - return err;
 - } else if (err  vector_threshold) {
 - dev_info(adapter-pdev-dev,
 -  Number of MSI-Xs which can be allocated 
 -  is lower than min threshold required.\n);
 - return -ENOSPC;
 - } else {
 - /* If fails to enable required number of MSI-x vectors
 -  * try enabling minimum number of vectors required.
 -  */
 - dev_err(adapter-netdev-dev,
 - Failed to enable %d MSI-X, trying %d 
 instead\n,
 - vectors, vector_threshold);
 - vectors = vector_threshold;
 - }
 + vectors = pcim_enable_msix_range(adapter-pdev,
 +  adapter-intr.msix_entries, vectors,
 +  VMXNET3_LINUX_MIN_MSIX_VECT);
 + if (vectors  0) {
 + dev_err(adapter-netdev-dev,
 + Failed to enable MSI-X, error: %d\n, vectors);
 + return vectors;
   }
 
 - return err;
 + adapter-intr.num_intrs = vectors;
 + return 0;
  }

AFAICT the old code either used the requested number or the minimum number.
The new code seems to claim an intermediate number of interrupts - but probably
only uses the minimum number.
This wastes the last few MSI-X interrupts.
The code (especially the calling code) would be easier to read if the 'vectors'
value wasn't explicitly passed.

David



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH 1/1] mfd: omap-usb-host: Fix USB device detection problems on OMAP4 Panda

2013-11-29 Thread David Laight

 From: Of Roger Quadros
 With u-boot 2013.10, USB devices are sometimes not detected
 on OMAP4 Panda. To make us independent of what bootloader does
 with the USB Host module, we must RESET it to get it to a known
 good state. This patch Soft RESETs the USB Host module.
...
 +++ b/drivers/mfd/omap-usb-host.c
 @@ -43,14 +43,18 @@
  /* UHH Register Set */
  #define  OMAP_UHH_REVISION   (0x00)
  #define  OMAP_UHH_SYSCONFIG  (0x10)
 -#define  OMAP_UHH_SYSCONFIG_MIDLEMODE(1  12)
 +#define  OMAP_UHH_SYSCONFIG_MIDLEMASK(3  12)
 +#define OMAP_UHH_SYSCONFIG_MIDLESHIFT(12)

(tab/space issue)

Wouldn't it be clearer to define these in the opposite order with:
+#defineOMAP_UHH_SYSCONFIG_MIDLEMASK(3  
OMAP_UHH_SYSCONFIG_MIDLESHIFT)

... 
 +static void omap_usbhs_rev1_reset(struct device *dev)
 +{
 + struct usbhs_hcd_omap *omap = dev_get_drvdata(dev);
 + u32 reg;
 + unsigned long timeout;
 +
 + reg = usbhs_read(omap-uhh_base, OMAP_UHH_SYSCONFIG);
 +
 + /* Soft Reset */
 + usbhs_write(omap-uhh_base, OMAP_UHH_SYSCONFIG,
 + reg | OMAP_UHH_SYSCONFIG_SOFTRESET);
 +
 + timeout = jiffies + msecs_to_jiffies(100);
 + while (!(usbhs_read(omap-uhh_base, OMAP_UHH_SYSSTATUS)
 +  OMAP_UHH_SYSSTATUS_RESETDONE)) {
 + cpu_relax();
 +
 + if (time_after(jiffies, timeout)) {
 + dev_err(dev, Soft RESET operation timed out\n);
 + break;
 + }
 + }
 +
 + /* Set No-Standby */
 + reg = ~OMAP_UHH_SYSCONFIG_MIDLEMASK;
 + reg |= OMAP_UHH_SYSCONFIG_MIDLE_NOSTANDBY
 +  OMAP_UHH_SYSCONFIG_MIDLESHIFT;
 +
 + /* Set No-Idle */
 + reg = ~OMAP_UHH_SYSCONFIG_SIDLEMASK;
 + reg |= OMAP_UHH_SYSCONFIG_SIDLE_NOIDLE
 +  OMAP_UHH_SYSCONFIG_SIDLESHIFT;

Why not pass in the mask and value and avoid replicating the
entire function. I can't see any other significant differences,
the udelay(2) won't be significant.

I'm not sure of the context this code runs in, but if the reset
is likely to take a significant portion of the 100ms timeout
period, why not just sleep for a few ms between status polls.

David




--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH 1/4] pch_gbe: Fix transmit queue management

2013-12-02 Thread David Laight

 From: David Miller
 
  According to Documentation/networking/driver.txt the ndo_start_xmit
  method should not return NETDEV_TX_BUSY under normal circumstances.
  Stop the transmit queue when tx_ring is full.
... 
 You should be instead preventing the transmit method from being invoked
 when it might be possible that a request cannot be satisfied.
 
 This means that at the end of a transmit request, you must stop the
 queue if a packet with the maximum number of possible descriptors
 cannot be satisfied.  Then it is impossible for the transmit function
 to be invoked in a situation where it would need to fail for lack of
 available transmit descriptors.

I know you like ethernet drivers to work this way, but requiring enough
descriptors for a maximally fragmented packet requires a difficult
calculation and will cause the tx queue to be stopped unnecessarily.

IIRC a normal skb can have a maximum of 17 fragments, if these end up
being transmitted over USB using xhci they might need 34 descriptors
(because descriptors can't cross 64k boundaries).
However a typical 64k TCP packet just needs 4.
xhci has a further constraint that the ring end can only be at
specific alignments, in effect thus means that the descriptors for
a single packet cannot cross the end of a ring segment.
So if the available space in the xhci ring was used to stop the
ethernet tx queue (which it isn't at the moment) it would have to
be stopped very early.

Isn't there also a new skb structure that allows lists of fragments?
That might need even more descriptors for a worst case transmit.

I would have thought it would make sense for the ethernet driver
to stop the queue when there is a reasonable expectation that there
won't be enough descriptors for the next packet, but have a reasonable
code path when a packet with a large number of fragments won't fit
in the tx ring. This either requires the driver itself to hold the
skb, or to return NETDEV_TX_BUSY.
It might make sense to monitor recent traffic to find the 'expected
maximum number of fragments'.

David




--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH 1/1] mfd: omap-usb-host: Fix USB device detection problems on OMAP4 Panda

2013-12-02 Thread David Laight

 From: Roger Quadros [mailto:rog...@ti.com]
 On 11/29/2013 03:17 PM, David Laight wrote:
...
  +  timeout = jiffies + msecs_to_jiffies(100);
  +  while (!(usbhs_read(omap-uhh_base, OMAP_UHH_SYSSTATUS)
  +   OMAP_UHH_SYSSTATUS_RESETDONE)) {
  +  cpu_relax();
 
 You mean use msleep(1) here instead of cpu_relax()?
 Shouldn't be a problem IMO, but can you please tell me why that is better
 as the reset seems to complete usually in the first iteration.

If it doesn't finish in the first iteration you don't want to
spin the cpu for 100ms.

If it hasn't finished in the first millisecond, you probably expect
it to actually time out - so you might as well look (say) every 10ms.

David



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH] ipvs: Remove unused variable ret from sync_thread_master()

2013-11-12 Thread David Laight

  @@ -1637,7 +1637,7 @@ static int sync_thread_master(void *data)
  continue;
  }
  while (ip_vs_send_sync_msg(tinfo-sock, sb-mesg)  0) {
  -   int ret = __wait_event_interruptible(*sk_sleep(sk),
 
 So ideally there's be a comment here why we're using interruptible but
 then ignore interruptions.
 
 Julian said (
 http://lkml.kernel.org/r/alpine.lfd.2.00.1310012245020.1...@ja.ssi.bg ):
 
  Yes, your patch looks ok to me. In the past
 we used ssleep() but IPVS users were confused why
 IPVS threads increase the load average. So, we
 switched to _interruptible calls and later the socket
 polling was added.  

I've done this in the past so that the code sleeps interruptibly
unless there is a signal pending - which would cause it to return
early.

/* Tell scheduler we are going to sleep... */
if (signal_pending(current))
/* We don't want waking immediately (again) */
sleep_state = TASK_UNINTERRUPTIBLE;
else
sleep_state = TASK_INTERRUPTIBLE;
set_current_state(sleep_state);

Shame there isn't a process flag to indicate that the process
will sleep uninterruptibly and that it doesn't matter.
So don't count to the load average and don't emit a warning
if it has been sleeping for a long time.

David



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH] ipvs: Remove unused variable ret from sync_thread_master()

2013-11-12 Thread David Laight

  I've done this in the past so that the code sleeps interruptibly
  unless there is a signal pending - which would cause it to return
  early.
 
  /* Tell scheduler we are going to sleep... */
  if (signal_pending(current))
  /* We don't want waking immediately (again) */
  sleep_state = TASK_UNINTERRUPTIBLE;
  else
  sleep_state = TASK_INTERRUPTIBLE;
  set_current_state(sleep_state);
 
 If this is for kernel threads, I think you can wipe the pending state;
 not entirely sure how signals interact with kthreads; Oleg will know.

I did take me a while to manage to use kthreads, and we probably still
have customers who insist on using pre 2.6.18 kernels!
(In which case the processes are children of a daemon that only
return to userspace in order to exit.)

We also need to send signals (to get the process out of blocking
socket calls), so I have to use force_sig() to get the signal noticed.

The other issue was tidyup - I had to find module_put_and_exit().

David



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [Fwd: Re: [PATCH v2 2/2] x86: add prefetching to do_csum]

2013-11-13 Thread David Laight

 Sure, I modified the code so that we only prefetched 2 cache lines ahead, but
 only if the overall length of the input buffer is more than 2 cache lines.
 Below are the results (all counts are the average of 100 iterations of the
 csum operation, as previous tests were, I just omitted that column).

Hmmm averaging over 10 iterations means that all the code
is in the i-cache and the branch predictor will be correctly primed.

For short checksum requests I'd guess that the relevant data
has just been written and is already in the cpu cache (unless
there has been a process and cpu switch).
So prefetch is likely to be unnecessary.

If you assume that the checksum code isn't in the i-cache then
small requests are likely to be dominated by the code size.

David



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [Fwd: Re: [PATCH v2 2/2] x86: add prefetching to do_csum]

2013-11-13 Thread David Laight

  I'm not sure, whats the typical capacity for the branch predictors
  ability to remember code paths?
...
 
 For such simple single-target branches it goes near or over a thousand for
 recent Intel and AMD microarchitectures. Thousands for really recent CPUs.

IIRC the x86 can also correctly predict simple sequences - like a branch
in a loop that is taken every other iteration, or only after a previous
branch is taken.

Much simpler cpus may use a much simpler strategy.
I think one I've used (a fpga soft-core cpu) just uses the low
bits of the instruction address to index a single bit table.
This means that branches alias each other.
In order to get the consistent cycle counts in order to minimise
the worst case code path we had to disable the dynamic prediction.

For the checksum code the loop branch isn't a problem.
Tests on entry to the function might get mispredicted.

So if you have conditional prefetch when the buffer is long
then time a short buffer after a 100 long ones you'll almost
certainly see the mispredition penalty.

FWIW I remember speeding up a copy (I think) loop on a strongarm by
adding an extra instruction to fetch a word from later in the buffer
into a register I never otherwise used.
(That was an unpaged system so I knew it couldn't fault.)

David




--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: net/usb/ax88179_178a driver broken in linux-3.12

2013-11-19 Thread David Laight

 From: Eric Dumazet [mailto:eric.duma...@gmail.com]
 On Tue, 2013-11-19 at 09:02 -0500, Mark Lord wrote:
  On 13-11-19 05:04 AM, David Laight wrote:
   From: Mark Lord
  ..
   except the ax88179_178a driver still does not work in linux-3.12,
   whereas it works fine in all earlier kernels.

I've seen lost packets in IIRC 3.2

   That's a regression.
   And a simple revert (earlier in this thread) fixes it.
   So.. let's revert it for now, until a proper xhci compatible patch is 
   produced.
  ...
   There is a patch to xhci-ring.c that should fix the SG problem.
   http://www.spinics.net/lists/linux-usb/msg97176.html

   I think it should apply to the 3.12 sources.

  I am running with that patch here now (thanks),
  and it too appears to prevent the lockups.

  But is this patch upstream already?
  If yes, then it needs to get pushed out to -stable for 3.12 at least.

Having someone else confirm that there is a bug, and that the
patch fixes it should help it get pushed to -stable.

  If not upstream, then the revert is probably safest for -stable,
  rather than new code that has never been upstream before.

  Both patches are attached to this email.
  One or the other is required for the USB 3.0 network adapters to function 
  in 3.12.

 I do not see any error in commit f27070158d6754765f2
 (ax88179_178a: avoid copy of tx tcp packets)

 Quite the contrary in fact...

 I suspect a TSO bug, and would rather disable TSO for this nic.

 Have you tried to revert 3804fad45411b482
 (USBNET: ax88179_178a: enable tso if usb host supports sg dma)

It isn't directly a TSO problem.
There has always been a bug in the xhci driver for fragmented buffers.
TSO just means it is given a lot of fragmented buffers.

As well as user-supplied fragmented buffers, the bug affects
internal fragmentation that happens whenever a buffer crosses a 64k
byte boundary (please hw engineers - stop doing this!)

I'm not sure whether usbnet would ever pass buffers that cross 64k
boundaries. I've not seen one - even with TSO. But the rx buffers
are 20k (doesn't seem ideal!) and could also be problematical.

USB mass storage has used SG for ages, the buffers must all be
adequately aligned for the hardware - they won't meet the constraint
for USB3 itself, but the documented restriction may be more severe
than the actual one.

David

RE: net/usb/ax88179_178a driver broken in linux-3.12

2013-11-19 Thread David Laight

 From: Eric Dumazet [mailto:eric.duma...@gmail.com]
 On Tue, 2013-11-19 at 14:43 +, David Laight wrote:
 
  It isn't directly a TSO problem.
  There has always been a bug in the xhci driver for fragmented buffers.
  TSO just means it is given a lot of fragmented buffers.
 
  As well as user-supplied fragmented buffers, the bug affects
  internal fragmentation that happens whenever a buffer crosses a 64k
  byte boundary (please hw engineers - stop doing this!)
 
 TCP stack uses order-3 allocations.
 
 This means 32KB for x86 (PAGE_SIZE=4096)
 
 What is PAGE_SIZE for your arches ?
 
 If there is a 6KB limit, we might adapt TCP to make sure we do not cross
 a 64KB boundary.
 
 Other strategy would be to detect this case in the driver and split a
 problematic segment into two parts.

The xhci code does all the checks for buffer fragments crossing 64k boundaries.
I've posted a patch that uses a conservative upper limit for the
number of fragments: 2 * number_of_sg_buffs + xfer_len/65536.
This saves scanning the sg list twice.

For those not reading linux-usb:

The xhci 'bug' is that an SG list may only cross the end of a ring
segment at an aligned length.
For USB2 devices this will be 512 bytes. For USB3 the documented alignment
could be 16k (depends on a burst size), but might only be 1k.
(This is another place where the hw engineers haven't made life easy.)
The only simple solution is to ensure that a SG list doesn't cross
the end of a ring segment.

David

RE: Supporting 4 way connections in LKSCTP

2013-12-04 Thread David Laight

  In normal operation, IP-A sends INIT to IP-X, IP-X returns INIT_ACK to
  IP-A. IP-A then sends HB to IP-X, IP-X then returns HB_ACK to IP-A. In
  the meantime, IP-B sends HB to IP-Y and IPY returns HB_ACK.
 
  In case of the path between IP-A and IP-X is broken, IP-B sends INIT
  to IP-X, NODE-B uses IP-Y to return INIT_ACK to IP-B. Then IP-B sends
  HB to IP-X, and IP-Y returns HB_ACK to IP-B. In the meantime, the HB
  communication between IP-B and IP-Y follows the normal flow.
 
  Can I confirm, is it really valid?
 
 As long as NODE-B knows about both IP-A and IP-B, and NODE-A knows about
 both IP-X and IP-Y (meaning all the addresses were exchanged inside INIT
 and INIT-ACK), then this situation is perfectly valid.  In fact, this
 has been tested an multiple interops.

There are some network configurations that do cause problems.
Consider 4 systems with 3 LAN segments:
A) 10.10.10.1 on LAN X and 192.168.1.1 on LAN Y.
B) 10.10.10.2 on LAN X and 192.168.1.2 on LAN Y.
C) 10.10.10.3 on LAN X.
D) 10.10.10.4 on LAN X and 192.168.1.2 on LAN Z.
There are no routers between the networks (and none of the systems
are running IP forwarding).

If A connects to B everything is fine - traffic can use either LAN.

Connections from A to C are problematic if C tries to send anything
(except a HB) to 192.168.1.1 before receiving a HB response.
One of the SCTP stacks we've used did send messages to an
inappropriate address, but I've forgotten which one.

Connections between A and D fail unless the HB errors A receives
for 192.168.1.2 are ignored.

Of course the application could explicitly bind to only the 10.x address
but that requires the application know the exact network topology
and may be difficult for incoming calls.

David

RE: Supporting 4 way connections in LKSCTP

2013-12-04 Thread David Laight

  There are some network configurations that do cause problems.
  Consider 4 systems with 3 LAN segments:
  A) 10.10.10.1 on LAN X and 192.168.1.1 on LAN Y.
  B) 10.10.10.2 on LAN X and 192.168.1.2 on LAN Y.
  C) 10.10.10.3 on LAN X.
  D) 10.10.10.4 on LAN X and 192.168.1.2 on LAN Z.
  There are no routers between the networks (and none of the systems
  are running IP forwarding).
 
  If A connects to B everything is fine - traffic can use either LAN.
 
  Connections from A to C are problematic if C tries to send anything
  (except a HB) to 192.168.1.1 before receiving a HB response.
  One of the SCTP stacks we've used did send messages to an
  inappropriate address, but I've forgotten which one.
 
 I guess that would be problematic if A can not receive traffic for
 192.168.1.1 on the interface connected to LAN X.  I shouldn't
 technically be a problem for C as it should mark the path to 192.168.1.1
 as down.  For A, as long as it doesn't decide to ABORT the association,
 it shouldn't be a problem either.  It would be interesting to know more
 about what problems you've observed.

It was a long time ago, we don't actually do much SCTP testing even though
our product (SS7 M3UA) uses it. Mostly because we don't want to find
bugs that are hard to fix!

What we saw was C using the 192.x address (as supplied in the INIT) and
these not being routable to A (and not getting a response from a 3rd system).
So the application layer immediately timed everything out.

What I can't remember is whether this was Linux, Solaris or the SCTP stack
we took from freebsd and rammed into the windows kernel.

David

RE: Supporting 4 way connections in LKSCTP

2013-12-04 Thread David Laight

 The point is that address scoping should be used. When sending an
 INIT from 10.10.10.1 to 10.10.10.4 you should not list 192.168.1.1,
 since you are transmitting an address to a node which might or might
 not be in the same scope.

You might have two machines that are connected via the public
internet and also via a private network.
The two sets of cabling being completely separate giving you
resilience to network failure.
In which case you definitely don't want address scoping.

While you may not want the SCTP traffic on the public network
itself, it could easily be routed separately.

We have systems that 'sort of' designate one interface for SIP/RTP
and the other for 'management'. They might run M3UA/SCTP but no one
has really thought enough about which interface(s) the M3UA traffic
should use.
(Think of an ISUP/SIP gateway using M3UA for ISUP signalling.)

David



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: Supporting 4 way connections in LKSCTP

2013-12-05 Thread David Laight

  the configured addresses could be:
  System A) 10.0.0.1 on Lan X, 10.10.0.1 on Lan Y
  System B) 10.0.0.2 on Lan X, 10.10.0.2 on Lan Y
  System C) 10.0.0.3 on Lan X, 10.10.0.2 on Lan Z
 
  Same problem will occur.
...
 With that, Sys A talking to Sys C will get an abort
 from Sys B when trying to talk to 10.10.0.2.  With /8, it'll be
 even worse since SysB and SysC will have duplicate addresses
 within the subnet. :)
 
 The point is that you don't always know that the same private subnet
 is in reality 2 different subnets with duplicate addresses.
 
 I've had to debug an actual production issue similar to this where
 customer had a very similar configuration to above, and their
 associations kept getting aborted.  When I tried accessing the
 system that kept sending aborts, I found it was some windows
 server and not a Diameter station they were expecting.

Does seem that the addresses listed in INIT and INIT_ACK chunks
should be ignored until a valid HB response has been received.
If an abort is received before a valid HB response then the
address should be ignored rather than the connection aborted.
Then it wouldn't matter anywhere near as much if addresses are
advertised that are not reachable from the remote system.

All this should have been thought about when the original RFC
was written.

David



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH net-next v2 1/2] r8152: add RTL8152_EARLY_AGG_TIMEOUT_SUPER

2014-03-13 Thread David Laight

From: Hayes Wang
 For slow CPU, the frequent bulk transfer would cause poor throughput.
 One solution is to increase the timeout of the aggregation. It let
 the hw could complete the bulk transfer later and fill more packets
 into the buffer. Besides, it could reduce the frequency of the bulk
 transfer efficiently and improve the performance.
 
 However, the optimization value of the timeout depends on the
 capability of the hardware, especially the CPU. For example, according
 to the experiment, the timeout 164 us is better than the default
 value for the chromebook with the ARM CPU.

The best value probably depends on the workload as well as the
cpu speed.
I wonder if a sane algorithm for dynamically setting this value exists.
It really needs a CBU (crystal ball unit), but working ones are
difficult to come by.

Passing the buck to the 'user' is rather a cop-out (but we all do it).

 Now add RTL8152_EARLY_AGG_TIMEOUT_SUPER to let someone could choose
 desired timeout value if he wants to get the best performance.
...
  /* USB_RX_EARLY_AGG */
 -#define EARLY_AGG_SUPPER 0x0e832981
 +#define EARLY_AGG_SUPER  rx_buf_sz - 1522) / 4)  16) | \
 + (u32)(CONFIG_RTL8152_EARLY_AGG_TIMEOUT_SUPER = 0 ? 0x2981 : \
 + ((CONFIG_RTL8152_EARLY_AGG_TIMEOUT_SUPER * 125)  0x ? \
 + CONFIG_RTL8152_EARLY_AGG_TIMEOUT_SUPER * 125 : 0x)))

The 0x2981 would be better written as (85 * 125).
But maybe replace with something like:
min((CONFIG_RTL8152_EARLY_AGG_TIMEOUT_SUPER = 0 ? 85 :
CONFIG_RTL8152_EARLY_AGG_TIMEOUT_SUPER) * 125u, 0xu)

David



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH net-next v3 1/2] r8152: add RTL8152_EARLY_AGG_TIMEOUT_SUPER

2014-03-13 Thread David Laight

From: Hayes Wang
...

I should have spotted this before.

  /* USB_RX_EARLY_AGG */
 -#define EARLY_AGG_SUPPER 0x0e832981
 +#define EARLY_AGG_SUPER  rx_buf_sz - 1522) / 4)  16) | \
 + (u32)(CONFIG_RTL8152_EARLY_AGG_TIMEOUT_SUPER = 0 ? 85 * 125 : \
 + min(CONFIG_RTL8152_EARLY_AGG_TIMEOUT_SUPER * 125, 0x)))
  #define EARLY_AGG_HIGH   0x0e837a12
  #define EARLY_AGG_SLOW   0x0e83
 
 @@ -1978,7 +1980,7 @@ static void r8153_set_rx_agg(struct r8152 *tp)
   ocp_write_dword(tp, MCU_TYPE_USB, USB_RX_BUF_TH,
   RX_THR_SUPPER);
   ocp_write_dword(tp, MCU_TYPE_USB, USB_RX_EARLY_AGG,
 - EARLY_AGG_SUPPER);
 + EARLY_AGG_SUPER);
   } else {
   ocp_write_dword(tp, MCU_TYPE_USB, USB_RX_BUF_TH,
   RX_THR_HIGH);

It looks as though rx_buf_sz should be a parameter to EARLY_AGG_SUPER.

David



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH v3 1/4] net: add name_assign_type netdev attribute

2014-03-17 Thread David Laight

From: David Herrmann
 The name_assign_type attribute gives hints where the interface name of a
 given net-device comes from. Three different values are currently defined:
   NET_NAME_ENUM:
 This is the default. The ifname is provided by the kernel with an
 enumerated suffix. Names may be reused and unstable.
   NET_NAME_USER:
 The ifname was provided by user-space during net-device setup.
   NET_NAME_RENAMED:
 The net-device has been renamed via RTNL. Once this type is set, it
 cannot change again.
...
 diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
 index b8d8c80..6698e87 100644
 --- a/include/linux/netdevice.h
 +++ b/include/linux/netdevice.h
 @@ -1248,6 +1248,7 @@ struct net_device {
* of the interface.
*/
   charname[IFNAMSIZ];
 + unsigned char   name_assign_type; /* name assignment type */
 
   /* device name hash chain, please keep it close to name[] */
   struct hlist_node   name_hlist;

Do you really need to add 7 byte of padding here?
There seems to be some padding lurking elsewhere that really ought
to be mergable.

David



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH v2] net: netfilter: LLVMLinux: vlais-netfilter

2014-03-18 Thread David Laight

From:  beh...@converseincode.com
 From: Mark Charlebois charl...@gmail.com

 Replaced non-standard C use of Variable Length Arrays In Structs (VLAIS) in
 xt_repldata.h with a C99 compliant flexible array member and then calculated
 offsets to the other struct members. These other members aren't referenced by
 name in this code, however this patch maintains the same memory layout and
 padding as was previously accomplished using VLAIS.

 Had the original structure been ordered differently, with the entries VLA at
 the end, then it could have been a flexible member, and this patch would have
 been a lot simpler. However since the data stored in this structure is
 ultimately exported to userspace, the order of this structure can't be 
 changed.

Why not just remove the last element and allocate space for it after the
structure?
That would reduce the complexity of the patch and the unreadability
of the new code.
I realise that the alignment of type##_error is 'tricky' to determine.

David

N�r��yb�X��ǧv�^�)޺{.n�+{zX����ܨ}���Ơz�j:+v���zZ+��+zf���h���~i���z��w���?��)ߢf��^jǫy�m��@A�a���
0��h���i

RE: [PATCH v2] net: netfilter: LLVMLinux: vlais-netfilter

2014-03-18 Thread David Laight

From: Behan Webster 
 On 03/18/14 02:41, David Laight wrote:
  From:  beh...@converseincode.com
  From: Mark Charlebois charl...@gmail.com

  Replaced non-standard C use of Variable Length Arrays In Structs (VLAIS) in
  xt_repldata.h with a C99 compliant flexible array member and then 
  calculated
  offsets to the other struct members. These other members aren't referenced 
  by
  name in this code, however this patch maintains the same memory layout and
  padding as was previously accomplished using VLAIS.

  Had the original structure been ordered differently, with the entries VLA 
  at
  the end, then it could have been a flexible member, and this patch would 
  have
  been a lot simpler. However since the data stored in this structure is
  ultimately exported to userspace, the order of this structure can't be 
  changed.
  Why not just remove the last element and allocate space for it after the
  structure?
 Because that would still be employing VLAIS to solve the problem. The
 last element may be a zero-length array (a flexible member), not a VLA.
 Sadly both the last 2 elements in the struct need to be manually
 calculated, which is what we've done.

So make the last element a 'flexible member' and then work out where
the final field goes.
Something like:
struct p {
struct a a;
struct b b[];
} p = malloc(sizeof *p + n * sizeof (struct b) + alignof (struct c)
+ sizeof (struct c);
struct c *c = (void *)p-b[n] + (-offsetof(struct p, b[n])  
(alignof(struct c) - 1);

David

RE: sisusb: Use static const, fix typo

2014-02-24 Thread David Laight

From: Joe Perches
 Reduce text a bit by using static const.

If you want to save a few bytes remove the pointers.
(and the fixed RAM text to get below 7 chars).
eg:

 - const char *ramtypetext2[] = {  SDR SDRAM, SDR SGRAM,
 - DDR SDRAM, DDR SGRAM };

static const char ramtypetext2[8][] = {
SDR SD, SDR SG, DDR SD, DDR SG

...
 - dev_info(sisusb-sisusb_dev-dev, %dMB %s %s, bus width %d\n, 
 (sisusb-vramsize  20),
 ramtypetext1,
 - ramtypetext2[ramtype], bw);

dev_info(sisusb-sisusb_dev-dev, %dMB %s %sRAM, bus width %d\n,
 sisusb-vramsize  20, ramtypetext1, ramtypetext2[ramtype],
 bw);

David



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: sisusb: Use static const, fix typo

2014-02-25 Thread David Laight

From: Joe Perches [ 
 On Mon, 2014-02-24 at 10:26 +, David Laight wrote:
  From: Joe Perches
   Reduce text a bit by using static const.

  If you want to save a few bytes remove the pointers.
  (and the fixed RAM text to get below 7 chars).

 Hi David.

  eg:

   - const char *ramtypetext2[] = {  SDR SDRAM, SDR SGRAM,
   - DDR SDRAM, DDR SGRAM };

 The idea was use static to avoid the array reload
 on each function entry.

  static const char ramtypetext2[8][] = {
  SDR SD, SDR SG, DDR SD, DDR SG

 8 is an odd number here.

Brain fade - not enough coffee.
I meant 'char ramtypetext2[][8]' so that you avoid the pointers as well.

David

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH] net/usb: Add Lenovo ThinkPad OneLink GigaLAN USB ID to ax88179 driver

2014-02-25 Thread David Laight

From: Keith Packard
 The Lenovo OneLink dock includes a USB ethernet adapter using the
 AX88179 chip, but with a different USB ID. Add this new USB id to the
 driver so that it will autodetect the adapter correctly.
 
 Signed-off-by: Keith Packard kei...@keithp.com
 Tested-by: Carl Worth cwo...@cworth.org
 ---
  drivers/net/usb/ax88179_178a.c | 17 +
  1 file changed, 17 insertions(+)
 
 diff --git a/drivers/net/usb/ax88179_178a.c b/drivers/net/usb/ax88179_178a.c
 index 8e8d0fc..dcf974f 100644
 --- a/drivers/net/usb/ax88179_178a.c
 +++ b/drivers/net/usb/ax88179_178a.c
 @@ -1418,6 +1418,19 @@ static const struct driver_info samsung_info = {
   .tx_fixup = ax88179_tx_fixup,
  };
 
 +static const struct driver_info lenovo_info = {
 + .description = ThinkPad OneLink Dock USB GigaLAN,
 + .bind = ax88179_bind,
 + .unbind = ax88179_unbind,
 + .status = ax88179_status,
 + .link_reset = ax88179_link_reset,
 + .reset = ax88179_reset,
 + .stop = ax88179_stop,
 + .flags = FLAG_ETHER | FLAG_FRAMING_AX,
 + .rx_fixup = ax88179_rx_fixup,
 + .tx_fixup = ax88179_tx_fixup,
 +};
 +
  static const struct usb_device_id products[] = {
  {
   /* ASIX AX88179 10/100/1000 */
 @@ -1435,6 +1448,10 @@ static const struct usb_device_id products[] = {
   /* Samsung USB Ethernet Adapter */
   USB_DEVICE(0x04e8, 0xa100),
   .driver_info = (unsigned long)samsung_info,
 +}, {
 + /* Lenovo ThinkPad OneLink GigaLAN */
 + USB_DEVICE(0x17ef, 0x304b),
 + .driver_info = (unsigned long)samsung_info,
 
I think you meant lenovo_info.

Actually it looks like the initialiser should be factored somehow
to that the list of functions and flags doesn't need repeating for
every clone.

David



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH net-next 08/12] r8152: support TSO

2014-03-04 Thread David Laight

From: Hayes Wang
 Support scatter gather and TSO.
 
 Adjust the tx checksum function and set the max gso size to fix the
 size of the tx aggregation buffer.

There is little point supporting TSO unless the usb host controller
supports arbitrary aligned scatter-gather.
All you do is require that a large skb be allocated (that may well
fail), and add it another data copy.

The xhci controller is almost capable of arbitrary scatter-gather,
but it is currently disabled because the data must be aligned at
the end of the transfer ring (the hardware designers have made it
almost impossible to write efficient software).

Note that the various xhci controllers behave subtly differently.

David



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH net-next 08/12] r8152: support TSO

2014-03-04 Thread David Laight

From: hayeswang 
  David Laight [mailto:david.lai...@aculab.com]
  Sent: Tuesday, March 04, 2014 8:12 PM
  To: 'Hayes Wang'; net...@vger.kernel.org
  Cc: nic_s...@realtek.com; linux-kernel@vger.kernel.org;
  linux-...@vger.kernel.org
  Subject: RE: [PATCH net-next 08/12] r8152: support TSO

  From: Hayes Wang
   Support scatter gather and TSO.

   Adjust the tx checksum function and set the max gso size to fix the
   size of the tx aggregation buffer.

  There is little point supporting TSO unless the usb host controller
  supports arbitrary aligned scatter-gather.
  All you do is require that a large skb be allocated (that may well
  fail), and add it another data copy.

 I think I have done it. For also working for EHCI usb host controller,
 I allocate 16 KB continuous buffer and copy the fragmented packets to
 it and bulk out the buffer.

Does that mean you are splitting the 64k 'ethernet packet' from TCP
is software? I've looked at the ax88179 where the hardware can do it.

Is there really a gain doing segmentation here if you have to do the
extra data copy?

It might be worth packing multiple short packets into a single USB bulk
data packet, but that might require a CBU (crystal ball unit).

I did measure a maximum transmit ethernet frame rate of (IIRC) 25
frames/sec for the ax88179 - probably limited by the USB3 frame rate.
Exceeding that would probably require putting multiple tx packets into
a single URB. OTOH that limit probably doesn't matter

What might be more useful is to set the rx side up to receive into
page-aligned 2k (or 4k) buffers and then separate out the ethernet
frames into the required skb - probably as page list fragments
(I'm not entirely sure how such skb can be created).

I don't know what the r8152 does, but the usbnet code encourages it to
allocate long skb, pass them to the usb stack to fill, and then clone
them if they contain multiple frames.

This isn't really good use of memory or cpu cycles.
The ax88179 driver also lies badly about the skb's truesize.

David

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH net-next 08/12] r8152: support TSO

2014-03-04 Thread David Laight

From: Eric Dumazet 
 On Tue, 2014-03-04 at 14:35 +, David Laight wrote:
 
  Does that mean you are splitting the 64k 'ethernet packet' from TCP
  is software? I've looked at the ax88179 where the hardware can do it.
 
  Is there really a gain doing segmentation here if you have to do the
  extra data copy?
 
 There is no 'extra' copy.
 
 There is _already_ a copy, so what's the deal of doing this copy in a SG
 enabled way ?

Ok.

But there is one more copy than for a normal ethernet chipset which
can use multiple ring entries to send the MAC+IP+TCP headers from a
different buffer to the TCP userdata.

David

N�r��yb�X��ǧv�^�)޺{.n�+{zX����ܨ}���Ơz�j:+v���zZ+��+zf���h���~i���z��w���?��)ߢf��^jǫy�m��@A�a���
0��h���i

RE: [PATCH] phy: fix compiler array bounds warning on settings[]

2014-03-05 Thread David Laight

From: Bjorn Helgaas
 With -Werror=array-bounds, gcc v4.7.x warns that in phy_find_valid(), the
 settings[] array subscript is above array bounds, I think because idx is
 a signed integer and if the caller supplied idx  0, we pass the guard but
 still reference out of bounds.

Not rejecting the patch but...

Just indexing an array with 'int' shouldn't cause this warning,
so somewhere a caller must actually be passing an idx  0.

While changing the type to unsigned will make the comparison
against the array bound reject the -1, I suspect that the
specific call path didn't really intend passing a hard-coded -1.

It might be worth trying to locate the call site that passes -1.

David

RE: [PATCH] phy: fix compiler array bounds warning on settings[]

2014-03-05 Thread David Laight

From: Bjorn Helgaas
 On Wed, Mar 5, 2014 at 2:10 AM, David Laight david.lai...@aculab.com wrote:
  From: Bjorn Helgaas
  With -Werror=array-bounds, gcc v4.7.x warns that in phy_find_valid(), the
  settings[] array subscript is above array bounds, I think because idx is
  a signed integer and if the caller supplied idx  0, we pass the guard but
  still reference out of bounds.

  Not rejecting the patch but...

  Just indexing an array with 'int' shouldn't cause this warning,
  so somewhere a caller must actually be passing an idx  0.

  While changing the type to unsigned will make the comparison
  against the array bound reject the -1, I suspect that the
  specific call path didn't really intend passing a hard-coded -1.

  It might be worth trying to locate the call site that passes -1.

 I agree 100%.  If that's the case, we definitely should find that
 caller rather than apply this patch.  I'll look more today.

You might want to apply the patch as well :-)

David

RE: [PATCH] phy: fix compiler array bounds warning on settings[]

2014-03-06 Thread David Laight

From: Bjorn Helgaas
 I'm stumped.  phy_find_valid() is static and only called from one
 place.  The 'idx' argument is always the result of phy_find_setting(),
 which should always return something between 0 and
 ARRAY_SIZE(settings), so I don't see any way idx can be  0.
 
 I stripped this down as far as I could; the resulting test code is at
 http://pastebin.com/pp1zMEWu if anybody else wants to look at it.  I'm
 using gcc 4.8.x 20131105 (prerelease), with -Warray-bounds -O2
 flags.
 
 I hesitate to suspect a compiler bug, but it is very strange.  For
 example, in my test code, replacing MAX_NUM_SETTINGS with 2 gets
 rid of the warnings.  MAX_NUM_SETTINGS is known to be 2 at
 compile-time, so I don't know why this should make a difference.

I can't get an array bounds error from the gcc 4.8.1-10ubuntu9 at all.
Not even when I index the array with a constant 3.
I wonder if they compiled it out!

David

N�r��yb�X��ǧv�^�)޺{.n�+{zX����ܨ}���Ơz�j:+v���zZ+��+zf���h���~i���z��w���?��)ߢf��^jǫy�m��@A�a���
0��h���i

RE: [PATCH 3/3] perf: Use 64-bit value when comparing sample_regs

2014-03-06 Thread David Laight

From: Sukadev Bhattiprolu
 When checking whether a bit representing a register is set in
 sample_regs, a 64-bit mask, use 64-bit value (1LL).
 
 Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com
 ---
  tools/perf/util/unwind.c |4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)
 
 diff --git a/tools/perf/util/unwind.c b/tools/perf/util/unwind.c
 index 742f23b..2b888c6 100644
 --- a/tools/perf/util/unwind.c
 +++ b/tools/perf/util/unwind.c
 @@ -396,11 +396,11 @@ static int reg_value(unw_word_t *valp, struct regs_dump 
 *regs, int id,
  {
   int i, idx = 0;
 
 - if (!(sample_regs  (1  id)))
 + if (!(sample_regs  (1LL  id)))
   return -EINVAL;
 
   for (i = 0; i  id; i++) {
 - if (sample_regs  (1  i))
 + if (sample_regs  (1LL  i))
   idx++;
   }

There are much faster ways to count the number of set bits, especially
if you might need to check a significant number of bits.
There might even be a function defined somewhere to do it.
Basically you just add up the bits, for 16 bit it would be:
val = (val  0x) + (val  1)  0x;
val = (val  0x) + (val  2)  0x;
val = (val  0x0f0f) + (val  4)  0x0f0f;
val = (val  0x00ff) + (val  8)  0x00ff;
As the size of the work increases the improvement is more significant.
(Some of the later masking can probably be proven unnecessary.)

David

RE: [PATCH 2/2] Revert USBNET: ax88179_178a: enable tso if usb host supports sg dma

2014-03-07 Thread David Laight

From: Mathias Nyman
 This reverts commit 3804fad45411b48233b48003e33a78f290d227c8.
 
 This commit, together with commit 247bf557273dd775505fb9240d2d152f4f20d304
 xhci 1.0: Limit arbitrarily-aligned scatter gather. were
 origially added to get xHCI 1.0 hosts and usb ethernet ax88179_178a devices
 working together with scatter gather. xHCI 1.0 hosts pose some requirement on 
 how transfer
 buffers are aligned, setting this requirement for 1.0 hosts caused USB 3.0 
 mass
 storage devices to fail more frequently.

This patch doesn't need to be reverted.
Provided the xhci driver doesn't set the flag to say that arbitrary scatter
gather is supported (ie usb_device_no_sg_constraint(dev-udev)) is false)
the ax88179_178a driver won't request transmits that need fragmenting.

There is a separate issue in that it does request receives that need
fragmenting.
However if the fragmented receives end up being split by a LINK TRB
the device driver recovers.
The ax88179 hardware doesn't recover when a tx is split.

David



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH 1/2] Revert xhci 1.0: Limit arbitrarily-aligned scatter gather.

2014-03-07 Thread David Laight

From: Mathias Nyman
 This reverts commit 247bf557273dd775505fb9240d2d152f4f20d304.

You need to revert further.
Just don’t set hcd-self.no_sg_constraint ever - since it just
doesn't work.
That will stop the ax88179_178a driver sending fragmented packets.

With the check for aligned non-terminal fragments removed the 
code will be as bad as the earlier releases.

 This commit, together with commit 3804fad45411b48233b48003e33a78f290d227c8
 USBNET: ax88179_178a: enable tso if usb host supports sg dma were
 origially added to get xHCI 1.0 hosts and usb ethernet ax88179_178a devices
 working together with scatter gather. xHCI 1.0 hosts pose some requirement on 
 how transfer
 buffers are aligned, setting this requirement for 1.0 hosts caused USB 3.0 
 mass
 storage devices to fail more frequently.
 
 USB 3.0 mass storage devices used to work before 3.14-rc1.  Theoretically,
 the TD fragment rules could have caused an occasional disk glitch.
 Now the devices *will* fail, instead of theoretically failing.
 From a user perspective, this looks like a regression; the USB device 
 obviously
 fails on 3.14-rc1, and may sometimes silently fail on prior kernels.
 
 The proper soluition is to implement the TD fragment rules required, but for 
 now
 this patch needs to be reverted to get USB 3.0 mass storage devices working 
 at the
 level they used to.
 
 Signed-off-by: Mathias Nyman mathias.ny...@linux.intel.com
 Cc: stable sta...@vger.kernel.org
 ---
  drivers/usb/host/xhci.c | 14 +++---
  1 file changed, 3 insertions(+), 11 deletions(-)
 
 diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c
 index 6fe577d..924a6cc 100644
 --- a/drivers/usb/host/xhci.c
 +++ b/drivers/usb/host/xhci.c
 @@ -4733,6 +4733,9 @@ int xhci_gen_setup(struct usb_hcd *hcd, 
 xhci_get_quirks_t get_quirks)
   /* Accept arbitrarily long scatter-gather lists */
   hcd-self.sg_tablesize = ~0;
 
 + /* support to build packet from discontinuous buffers */
 + hcd-self.no_sg_constraint = 1;
 +

Don't add the above - it only looked at by the usbnet code
and in particular the ax88179_178a driver

   /* XHCI controllers don't stop the ep queue on short packets :| */
   hcd-self.no_stop_on_short = 1;
 
 @@ -4757,14 +4760,6 @@ int xhci_gen_setup(struct usb_hcd *hcd, 
 xhci_get_quirks_t get_quirks)
   /* xHCI private pointer was set in xhci_pci_probe for the second
* registered roothub.
*/
 - xhci = hcd_to_xhci(hcd);
 - /*
 -  * Support arbitrarily aligned sg-list entries on hosts without
 -  * TD fragment rules (which are currently unsupported).
 -  */
 - if (xhci-hci_version  0x100)
 - hcd-self.no_sg_constraint = 1;
 -
   return 0;
   }
 
 @@ -4793,9 +4788,6 @@ int xhci_gen_setup(struct usb_hcd *hcd, 
 xhci_get_quirks_t get_quirks)
   if (xhci-hci_version  0x96)
   xhci-quirks |= XHCI_SPURIOUS_SUCCESS;
 
 - if (xhci-hci_version  0x100)
 - hcd-self.no_sg_constraint = 1;
 -
   /* Make sure the HC is halted. */
   retval = xhci_halt(xhci);
   if (retval)
 --
 1.8.1.2
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-usb in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH 2/2] Revert USBNET: ax88179_178a: enable tso if usb host supports sg dma

2014-03-07 Thread David Laight

From: Alan Stern 
 On Fri, 7 Mar 2014, David Laight wrote:

  From: Mathias Nyman
   This reverts commit 3804fad45411b48233b48003e33a78f290d227c8.

   This commit, together with commit 247bf557273dd775505fb9240d2d152f4f20d304
   xhci 1.0: Limit arbitrarily-aligned scatter gather. were
   origially added to get xHCI 1.0 hosts and usb ethernet ax88179_178a 
   devices
   working together with scatter gather. xHCI 1.0 hosts pose some 
   requirement on how transfer
   buffers are aligned, setting this requirement for 1.0 hosts caused USB 
   3.0 mass
   storage devices to fail more frequently.

  This patch doesn't need to be reverted.

 Yes, it does.

  Provided the xhci driver doesn't set the flag to say that arbitrary scatter
  gather is supported (ie usb_device_no_sg_constraint(dev-udev)) is false)
  the ax88179_178a driver won't request transmits that need fragmenting.

 True.  But xhci-hcd _will_ set the flag, because of patch 1 in this
 series.  In other words, patch 1 makes patch 2 necessary.

I was reading patch 2 first.
It would seem to be better to modify patch 1 - so it doesn't set the flag.
Then the changes are limited to the usb tree, and the change to net wouldn't
need to be reapplied in order to test scatter-gather when it is properly fixed.

David

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH 1/2] Revert xhci 1.0: Limit arbitrarily-aligned scatter gather.

2014-03-07 Thread David Laight

From: Alan Stern
 On Fri, 7 Mar 2014, David Laight wrote:

  From: Mathias Nyman
   This reverts commit 247bf557273dd775505fb9240d2d152f4f20d304.

  You need to revert further.
  Just don?t set hcd-self.no_sg_constraint ever - since it just
  doesn't work.

 No; it does work most of the time.

 If hcd-self.no_sg_constraint wasn't sent, then certain commands would
 fail always, instead of failing occasionally.

The point is that the no_sg_constraint was added in order to allow
the ax88179_178a driver to send arbitrarily aligned transfers.
Nothing else looks at it (well didn't when I scanned the tree a while back).

In effect all the other transfer requests were assumed to be suitably
aligned.

The check that caused things to fail was the one added to xhci relatively
recently that verified the alignment of the fragments.

David

RE: [PATCH 2/2] Revert USBNET: ax88179_178a: enable tso if usb host supports sg dma

2014-03-07 Thread David Laight

From: 
 On Fri, 7 Mar 2014, David Laight wrote:

  From: Alan Stern
   On Fri, 7 Mar 2014, David Laight wrote:

From: Mathias Nyman
 This reverts commit 3804fad45411b48233b48003e33a78f290d227c8.

 This commit, together with commit 
 247bf557273dd775505fb9240d2d152f4f20d304
 xhci 1.0: Limit arbitrarily-aligned scatter gather. were
 origially added to get xHCI 1.0 hosts and usb ethernet ax88179_178a 
 devices
 working together with scatter gather. xHCI 1.0 hosts pose some 
 requirement on how transfer
 buffers are aligned, setting this requirement for 1.0 hosts caused 
 USB 3.0 mass
 storage devices to fail more frequently.

This patch doesn't need to be reverted.

   Yes, it does.

Provided the xhci driver doesn't set the flag to say that arbitrary 
scatter
gather is supported (ie usb_device_no_sg_constraint(dev-udev)) is 
false)
the ax88179_178a driver won't request transmits that need fragmenting.

   True.  But xhci-hcd _will_ set the flag, because of patch 1 in this
   series.  In other words, patch 1 makes patch 2 necessary.

  I was reading patch 2 first.
  It would seem to be better to modify patch 1 - so it doesn't set the flag.
  Then the changes are limited to the usb tree, and the change to net wouldn't
  need to be reapplied in order to test scatter-gather when it is properly 
  fixed.

  The point is that the no_sg_constraint was added in order to allow
  the ax88179_178a driver to send arbitrarily aligned transfers.
  Nothing else looks at it (well didn't when I scanned the tree a while back).

  In effect all the other transfer requests were assumed to be suitably
  aligned.

  The check that caused things to fail was the one added to xhci relatively
  recently that verified the alignment of the fragments.

 (Actually the check was added to usbcore, not to xhci-hcd.)

 The _real_ problem here seems to be that no_sg_constraint is
 ambiguous.  Originally it was meant to refer to the constraint that all
 SG elements except the last must be a multiple of the maxpacket size.
 For that purpose, the check added to usbcore was entirely appropriate,
 as was setting the flag in xhci-hcd.

Probably true - but the only code that looked at it was in usbnet.
The check in usbcore is very recent.

 But now it turns out that the ax88179 driver is violating a _different_
 constraint: that Link TRBs must occur only at 1-KB boundaries.  The
 no_sg_constraint flag was never meant to describe this.  In other
 words, the flag issue is separate from the problem facing ax88179.

Really that means that the xhci controller couldn't actually support the
alignments it said it could - rather than the ax88179 driver sending
packets that didn't meet that the rules.

 The appropriate way to address this new problem for the future is to
 remove the second constraint by adding correct support for TD
 fragments into xhci-hcd.

Indeed.

 The appropriate way to address the problem
 right now in the -stable kernels is to prevent ax88179 from using SG --
 and not by abusing the no_sg_constraint flag, which is not directly
 related to the TD fragment problem.

I'd say that if the no_sg_constraint flag is set the ax88179 driver
could reasonably expect to send its fragment lists.

So it is best not to set the no_sg_constraint flag, and then remove
the checks against it that are needed to allow the transfers from
the disk system not be rejected - even though we know that some
of them might not actually work.

Otherwise you'll need to add yet another flag when the sg support
is fixed.

David

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: rfc: checkpatch logical line continuations (was IBM Akebono: Add support for a new PHY interface to the IBM emac driver)

2014-03-10 Thread David Laight

From: j...@joshtriplett.org
 On Fri, Mar 07, 2014 at 01:02:44PM -0800, Joe Perches wrote:
  On Fri, 2014-03-07 at 15:41 -0500, David Miller wrote:
   From: Alistair Popple alist...@popple.id.au
   Date: Thu,  6 Mar 2014 14:52:25 +1100

+   out_be32(dev-reg, in_be32(dev-reg) | WKUP_ETH_RGMIIEN
+| WKUP_ETH_TX_OE | WKUP_ETH_RX_IE);

   When an expression spans multiple lines, the lines should end with
   operators rather than begin with them.

  That's not in CodingStyle currently.

 It's also not even remotely consistent across existing kernel code, and
 it isn't obvious that there's a general developer consensus on the
 right way to write it.

My personal preference (which counts for nothing here) is to put
the operators at the start of the continuation like in order to
make it more obvious that it is a continuation.

The netdev rules are particularly problematical for code like:
if (tst(foo, foo2, foo3, ...)  ... 
tst2(..)  tst3()) {
baz();
where a scan read of the LHS gives the wrong logic.

At least we don't have a coding style that allows very long lnes
an puts } and { on their own lines - leading to:
...
}
while (foo(...)  bar(...)  . /* very long line falls off screen 
*/
{
int x;
Is that the top or bottom of a loop?

David

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH v8 net-next 1/3] filter: add Extended BPF interpreter and converter

2014-03-11 Thread David Laight

From: David Miller
 From: Linus Torvalds torva...@linux-foundation.org
 Date: Mon, 10 Mar 2014 19:02:18 -0700

  I would generally suggest that people only use bool for function
  return types, and absolutely nothing else. Seriously.

 I think it makes sense for function arguments too.

'bool' doesn't necessarily even make sense for function arguments
and results.
It might require the same additional masking of high bits that
are required for 'short' and 'char' parameters.

Also, IIRC, one of the arm ABIs used 32bit 'bool'.

Even the extra code that gets added to ensure that the 'bool'
values are either 0 or 1 can be a problem.
I don't know whether the recent compilers (or standards) say
whether the reader or writer is expected to enforce the domain.
But if the compiler uses a bit-wise AND for bool_a  bool_b there
is definitely scope for unexpected behaviour, and if the compiler
does anything else you get unwanted instructions and possibly branches.
Some old versions of gcc made a 'pigs breakfast' of bool_a |= bool_b.

David

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH RFC] netfilter: cacheline aligning in struct netns_ct

2014-03-12 Thread David Laight

From: Zhouyi Zhou
 not frequently changing components should share same cachelines
 
 Signed-off-by: Zhouyi Zhou yizhouz...@ict.ac.cn
 ---
  include/net/netns/conntrack.h |   12 
  1 file changed, 8 insertions(+), 4 deletions(-)
 
 diff --git a/include/net/netns/conntrack.h b/include/net/netns/conntrack.h
 index fbcc7fa..69d2d58 100644
 --- a/include/net/netns/conntrack.h
 +++ b/include/net/netns/conntrack.h
 @@ -65,6 +65,12 @@ struct nf_ip_net {
  struct netns_ct {
   atomic_tcount;
   unsigned intexpect_count;
 + struct hlist_nulls_head unconfirmed;
 + struct hlist_nulls_head dying;
 + struct hlist_nulls_head tmpl;
 +
 + /*  not frequently changing components should share same cachelines */
 + struct nf_ip_netnf_ct_proto cacheline_aligned_in_smp;

I'm not sure this has the effect you are trying to achieve.
Adding an __attribute__((aligned(n))) to a structure member not only
adds padding before the member, but also pads the member to a multiple
of the specified size.

With large cache lines you probably want neither.
IIRC one of the ppc systems has something like 256 byte cache lines
(and if it doesn't now, it might have soon).
So arbitrarily cache-line aligning structure members may not be a good idea.

In any case reducing the number of cache lines accessed, and the number
dirtied is probably more important than putting 'readonly' data in its
own cache lines.

David



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH -next] qlcnic: fix compiler warning

2014-01-10 Thread David Laight

From: Shahed Shaikh
 Adding netdev.
 
  -Original Message-
  From: Martin Kaiser,,, [mailto:mar...@reykholt.kaiser.cx] On Behalf Of
  Martin Kaiser
  Sent: Thursday, January 09, 2014 9:29 PM
  To: Himanshu Madhani; Rajesh Borundia
  Cc: linux-kernel; triv...@kernel.org
  Subject: [PATCH -next] qlcnic: fix compiler warning
 
  Add an explicit cast to fix the following warning (seen on Debian Wheezy, 
  gcc
  4.7.2)
 
  CC [M]  drivers/net/wireless/rtlwifi/rtl8192ce/trx.o
  drivers/net/ethernet/qlogic/qlcnic/qlcnic_io.c: In function
  qlcnic_send_filter:
  drivers/net/ethernet/qlogic/qlcnic/qlcnic_io.c:349:3: warning:
  passing argument 2 of ether_addr_equal from incompatible pointer type
  [enabled by default]
  In file included from include/linux/if_vlan.h:16:0,
  from drivers/net/ethernet/qlogic/qlcnic/qlcnic_io.c:9:
  include/linux/etherdevice.h:244:20: note: expected const u8 * but
  argument is of type u64 *
 
 
 If I am not wrong, this patch should go to David's net-next tree.
 
  Signed-off-by: Martin Kaiser mar...@kaiser.cx
 
 Acked-by: Shahed Shaikh shahed.sha...@qlogic.com
 
 Thanks,
 Shahed
  ---
   drivers/net/ethernet/qlogic/qlcnic/qlcnic_io.c |2 +-
   1 file changed, 1 insertion(+), 1 deletion(-)
 
  diff --git a/drivers/net/ethernet/qlogic/qlcnic/qlcnic_io.c
  b/drivers/net/ethernet/qlogic/qlcnic/qlcnic_io.c
  index 6373f60..3557154 100644
  --- a/drivers/net/ethernet/qlogic/qlcnic/qlcnic_io.c
  +++ b/drivers/net/ethernet/qlogic/qlcnic/qlcnic_io.c
  @@ -346,7 +346,7 @@ static void qlcnic_send_filter(struct qlcnic_adapter
  *adapter,
  head = (adapter-fhash.fhead[hindex]);
 
  hlist_for_each_entry_safe(tmp_fil, n, head, fnode) {
  -   if (ether_addr_equal(tmp_fil-faddr, src_addr) 
  +   if (ether_addr_equal(tmp_fil-faddr, (const u8 *)src_addr)
  
  tmp_fil-vlan_id == vlan_id) {
  if (jiffies  (QLCNIC_READD_AGE * HZ + tmp_fil-
  ftime))
  qlcnic_change_filter(adapter, src_addr,
  --
  1.7.10.4

A quick look at the code seems to imply that this code is somewhat buggy.
'src_addr' is defined a u64 even though it is a 6 byte mac address.
On big-endian systems the wrong bytes get accessed all over the place.

Also when ether_addr_equal() in inlined the code ends up accessing src_addr
as 32bit and 16bit data - which don't have to be synchronised against
and writes to the 64bit item.

If might be that ether_addr_equal() (etc) should have a gcc asm
statement with a memory constraint against the mac address buffers.

David

RE: [PATCH v2 3/4] net: make tcp_cleanup_rbuf private

2014-01-10 Thread David Laight

From: David Miller
...
  On Thu, Jan 9, 2014 at 12:26 PM, Neal Cardwell ncardw...@google.com wrote:
  On Thu, Jan 9, 2014 at 3:16 PM, Dan Williams dan.j.willi...@intel.com 
  wrote:
  net_dma was the only external user so this can become local to tcp.c
  again.
  ...
  -void tcp_cleanup_rbuf(struct sock *sk, int copied)
  +static void cleanup_rbuf(struct sock *sk, int copied)
 
  I would vote to keep the tcp_ prefix. In the TCP code base that is the
  more common idiom, even for internal/static TCP functions, and
  personally I find it easier to read and work with in stack traces,
  etc. My 2 cents.
 
 
  Ok.  It was cleanup_rbuf() in a former life, but one vote for leaving
  the name as is is enough for me.
 
 You can make that two votes :)

I suspect DM adds 2000 votes :-)

Keeping the prefix makes it easier to grep for, and makes it more
obvious that it isn't a generic function (when being called).

David



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst

2014-01-10 Thread David Laight

From: walt
  In the meantime, try this patch, which is something of a long shot.
 
 No difference.  But I notice the code enables the TRB quirk only if the
 xhci_version is specifically 0x95.  My debug messages claim that xHCI
 doesn't need link TRB QUIRK so I'm wondering if adding my asmedia device
 to the quirks list really doesn't change anything unless it's xhci 0.95.

I think the intention was that the quirk would be applied for your
hardware even though it claims to be version 0.96.
That was gust a hopeful guess that the same change would help.

 Does lspci provide that information?  I'm not seeing anything obvious.
I've only seen it in the diagnostic prints (that aren't normally output).

David

RE: [RFC 00/10] xhci: re-work command queue management

2014-01-13 Thread David Laight

From: Mathias Nyman
 This is an attempt to re-work and solve the issues in xhci command
 queue management that Sarah has descibed earlier:
 
 Right now, the command management in the xHCI driver is rather ad-hock.
 Different parts of the driver all submit commands, including interrupt
 handling routines, functions called from the USB core (with or without the
 bus bandwidth mutex held).
 Some times they need to wait for the command to complete, and sometimes
 they just issue the command and don't care about the result of the command.
 
...
 
 The Implementation:
 ---
 
 First step is to create a list of the commands submitted to the command queue.
 To accomplish this each command is required to be submitted with a properly
 filled command structure containing completion, status variable and a pointer 
 to
 the command TRB that will be used.
 
 The first 7 patches are all about creating these command structures and
 submitting them when we queue commands.
 The command structures are allocated on the fly, the commands that are 
 submitted
 in interrupt context are allocated with GFP_ATOMIC.
 
 Next, the global command queue is introduced. Commands are added to the queue
 when trb's are queued, and remove when the commad completes.
 Also switch to use the status variable and completion in the command struct.
...

IMHO the xhci driver is already far too complicated, and this probably just 
makes
it even worse.

The fact that you are having to allocate memory ion an ISR ought also to be
ringing alarm bells.

Have you considered adding a 'software command ring' (indexed with the
same value as the hardware one) and using it to hold additional parameters?

It might even be worth only putting a single command into the hardware ring!
That might simplify the timer code.

This still has a fixed constraint on the number of queued commands, but
I suspect that is bounded anyway (a few per device?).
If not you can almost certainly arrange to grow the soft-ring before
the isr code can run out of entries.

David



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst [NEW HARDWARE]

2014-01-14 Thread David Laight

From: walt
 On 01/09/2014 03:50 PM, Sarah Sharp wrote:
 
  On Tue, Jan 07, 2014 at 03:57:00PM -0800, walt wrote:
 
  I've wondered if my xhci problems might be caused by hardware quirks, and
  wondering why I seem to be the only one who has this problem.
 
  Maybe I could take one for the team by buying new hardware toys that I
  don't really need but I could use to test the xhci driver?  (I do enjoy
  buying new toys, necessary, or, um, maybe not :)
 
  It would be appreciated if you could see if your device causes other
  host controllers to fail.  Who am I to keep a geek from new toys? ;)
 
 Sarah, I just fixed my xhci bug for US$19.99 :)
 
 #lspci | tail -1
 04:00.0 USB controller: NEC Corporation uPD720200 USB 3.0 Host Controller 
 (rev 03)

Do you know which version of the xhci spec it conforms to?
(Will probably be 0.96 or 1.00).

Of course, ASMedia will probably change the silicon they are using without
changing anything on the packaging.

David

RE: [RFC 00/10] xhci: re-work command queue management

2014-01-14 Thread David Laight

From: Mathias Nyman 
...
  The fact that you are having to allocate memory ion an ISR ought also to be
  ringing alarm bells.
 
 It did.
 Other options are as you said to use a 'software command ring'. Either
 just pre-allocate a full command ring (64 trbs * sizeof(struct
 xhci_command)), roughly 2300 bytes when driver loads,
 or then a smaller pool of commands just used for ISR submitted commands.
 (We then need to either either expand pool, or start allocating in isr
 if pool is empty)

If you pre-allocate mapping 1-1 with the command ring entries
you don't need many of the fields of xhci_command at all.
So the allocated size will be much smaller.

Also not that all the rings have been increased to 256 entries,
not that there is any requirement for them to match.

 These solutions require some ring management and possibly expansion
 code, and increases the complexity again. I selected this simpler
 approach as I understood that the frequency of commands allocated in ISR
 is quite low.

The frequency isn't the problem - the fact that it is allowed is a problem.

 This also feels like trying to optimize before we get the main
 changes working.

I would call it 'not realising the full scope of the problem' and thus
leaving the 'difficult bits' for last.
Sorting out how to solve the difficult bits up front can lead to a
nicer solution.

 
 
  Have you considered adding a 'software command ring' (indexed with the
  same value as the hardware one) and using it to hold additional parameters?
 
  It might even be worth only putting a single command into the hardware ring!
  That might simplify the timer code.
 
 This is something I haven't thought about. Could be one possibility, but
 feels like artificially restricting something the HW is designed to care
 of.

It might resolve any issues about the fixed finite size of the hardware
command ring.
OTOH you probably don't need that many entries in the hardware command
ring for 'normal' processing - provided you have a software ring/queue
that can buffer all the requests.

Looking at the number of fields in xci_command, why do you need the
caller to pass a structure at all?
Just make the fields parameters to the 'send a command' function
along with a parameter that says whether it can sleep (in kmalloc()
or if the ring is full depending on the implementation).

...
  This still has a fixed constraint on the number of queued commands, but
  I suspect that is bounded anyway (a few per device?).
 
 64 commands fit on the command ring segment, I'm not aware of any
 per-device limits?

I was wondering about how many commands the software might try to queue,
not the number that the hardware allowed to be queued.

David



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH] usbnet: remove generic hard_header_len check

2014-02-13 Thread David Laight

From: Emil Goode
int rndis_rx_fixup(struct usbnet *dev, struct sk_buff *skb)
{
   + /* This check is no longer done by usbnet */
   + if (skb-len  dev-net-hard_header_len)
   + return 0;
   +
 
  Wouldn't it be better to test against ETH_HLEN, since that is a constant
  and obviously correct in this case?
 
 Some minidrivers change the default hard_header_len value so using it
 guarantees that the patch will not make any change to how the code is
 currently working. Using ETH_HLEN could be more informative about what
 the minidriver should check before passing skbs to usbnet_skb_return().
 Then I think the comment should be changed as well. My intention was to
 not make any changes that affect how the code works for devices I cannot
 test, but I think either way is fine and if you insist on changing it
 let me know.

I think that test is to ensure that the data passed to the mini-driver
contains the ethernet frame encapsulation header (this typically
contains the actual frame length and some flags) so that the minidriver
won't read off the end of the usb data.

Any check for stupidly short ethernet frames would be later on.
IIRC the absolute minimum 802.3 ethernet frame is 17 bytes
(after frame padding has been stripped).

David

RE: [PATCH v2] usbnet: remove generic hard_header_len check

2014-02-13 Thread David Laight

From: Of Emil Goode
 This patch removes a generic hard_header_len check from the usbnet
 module that is causing dropped packages under certain circumstances
 for devices that send rx packets that cross urb boundaries.
 
 One example is the AX88772B which occasionally send rx packets that
 cross urb boundaries where the remaining partial packet is sent with
 no hardware header. When the buffer with a partial packet is of less
 number of octets than the value of hard_header_len the buffer is
 discarded by the usbnet module.
...
 diff --git a/drivers/net/usb/ax88179_178a.c b/drivers/net/usb/ax88179_178a.c
 index d6f64da..955df81 100644
 --- a/drivers/net/usb/ax88179_178a.c
 +++ b/drivers/net/usb/ax88179_178a.c
 @@ -1118,6 +1118,10 @@ static int ax88179_rx_fixup(struct usbnet *dev, struct 
 sk_buff *skb)
   u16 hdr_off;
   u32 *pkt_hdr;
 
 + /* This check is no longer done by usbnet */
 + if (skb-len  dev-net-hard_header_len)
 + return 0;
 +

The ax88179 driver can also receive ethernet frames that cross the
end of rx URB.
It should have the code to save the last fragment (of a urb) until
the next data arrives, and then correctly merge the fragments.

It is likely that any sub-driver that sets the receive urb length
to a multiple of 1k (rather than leaving it at hard_hdr+mtu) can defrag
rx data that crosses urb boundaries.

David



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH RFC] sctp: Update HEARTBEAT timer immediately after user changed HB.interval

2014-02-18 Thread David Laight

  +
  +   /* Update the heartbeat timer immediately. */
  +   if (!mod_timer(trans-hb_timer,
  +   sctp_transport_timeout(trans)))
  +   sctp_transport_hold(trans);
 
 This is causes a rather inconsistent behavior in that it doesn't
 account of the amount of time left on the hb timer.  Thus it doesn't
 always do what commit log implies.  If the remaining time is shorter
 then the new timeout value, it will take longer till the next HB
 the application expects.

Being able to stop heartbeats being sent by repeatedly configuring
the interval is certainly not a good idea!

 If the application has very strict timing requirement on the HBs, it
 should trigger the HB manually.
 
 We could rework the code to allow the combination of
 SPP_HB_DEMAND | SPP_HB_ENABLE to work as follows:
   1) update the timer to the new value
   2) Send an immediate HB
  a) Update the hb timeout.
 
 2a should probably be done regardless since it can cause 2 HB to be
 outstanding at the same time.

Sending a heartbeat 'on demand' might be problematic if one
has also just been sent (from the timer).

I'd have thought that you wouldn't want to send a heartbeat while
still expecting a response from the previous one (this might require
splitting the time interval in two).

David



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH 1/2 v2] usbnet: fix bad header length bug

2014-02-10 Thread David Laight

From: Oliver Neukum
 censored. Oh well. But how about merging it with FLAG_MULTI_PACKET?
 I really don't want to add more flags. There is a point where enough
 flags make absurd having a common code. We are closing in on that point.

Any sub-driver that supports multi-packet either has to use stupidly
long URB and/or set the rx_urb_size to a multiple of the usb transfer
size.

It will also have to detect illegal short headers.

Actually it might be worth double-checking the encapsulations used.
IIRC the ax88179_178a uses different headers for tx and rx.
So there might be some that support multi-packet but still have
a short(ish) limit on the bulk receive size (before the short fragment).

I'm sat at the wrong desk to look at the code...

David

RE: [PATCH 05/14] net: axienet: Service completion interrupts ASAP

2014-02-12 Thread David Laight

From: Michal Simek
 From: Peter Crosthwaite peter.crosthwa...@xilinx.com

 The packet completion interrupts for TX and RX should be serviced before
 the packets are consumed. This ensures against the degenerate case when a
 new completion interrupt is raised after the handler has exited but before
 the interrupts are cleared. In this case its possible for the ISR to clear
 an unhandled interrupt (leading to potential deadlock).

I would clear the IRQ after processing the last packet, and then do a final
check for another packet.

That reduces the number of interrupts you take and then find there is
no work (because it was done on the previous interrupt).

There is a slight 'gotcha' in that the write to clear the IRQ can easily
get delayed enough the that cpu exits the ISR before the IRQ line
actually drops - leading the unwanted and unclaimed interrupts.
A posted write over PCIe could easily take long enough.

Maybe a hybrid scheme where the IRQ is cleared when the next entry
is still owned by the device would work.

David

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH v3] net: netfilter: LLVMLinux: vlais-netfilter

2014-03-19 Thread David Laight

From: beh...@converseincode.com 
 From: Mark Charlebois charl...@gmail.com

 Replaced non-standard C use of Variable Length Arrays In Structs (VLAIS) in
 xt_repldata.h with a C99 compliant flexible array member and then calculated
 offsets to the other struct members. These other members aren't referenced by
 name in this code, however this patch maintains the same memory layout and
 padding as was previously accomplished using VLAIS.

 Had the original structure been ordered differently, with the entries VLA at
 the end, then it could have been a flexible member, and this patch would have
 been a lot simpler. However since the data stored in this structure is
 ultimately exported to userspace, the order of this structure can't be 
 changed.

 This patch makes no attempt to change the existing behavior, merely the way in
 which the current layout is accomplished using standard C99 constructs. As 
 such
 the code can now be compiled with either gcc or clang.

 Author: Mark Charlebois charl...@gmail.com
 Signed-off-by: Mark Charlebois charl...@gmail.com
 Signed-off-by: Behan Webster beh...@converseincode.com
 Signed-off-by: Vinícius Tinti viniciusti...@gmail.com
 ---
  net/netfilter/xt_repldata.h | 27 ++-
  1 file changed, 22 insertions(+), 5 deletions(-)

 diff --git a/net/netfilter/xt_repldata.h b/net/netfilter/xt_repldata.h
 index 6efe4e5..343599e 100644
 --- a/net/netfilter/xt_repldata.h
 +++ b/net/netfilter/xt_repldata.h
 @@ -5,23 +5,40 @@
   * they serve as the hanging-off data accessed through repl.data[].
   */

 +/* tbl has the following structure equivalent, but is C99 compliant:
 + * struct {
 + *   struct type##_replace repl;
 + *   struct type##_standard entries[nhooks];
 + *   struct type##_error term;
 + * } *tbl;
 + */
 +
  #define xt_alloc_initial_table(type, typ2) ({ \
   unsigned int hook_mask = info-valid_hooks; \
   unsigned int nhooks = hweight32(hook_mask); \
   unsigned int bytes = 0, hooknum = 0, i = 0; \
   struct { \
   struct type##_replace repl; \
 - struct type##_standard entries[nhooks]; \
 - struct type##_error term; \
 - } *tbl = kzalloc(sizeof(*tbl), GFP_KERNEL); \
 + struct type##_standard entries[]; \
 + } *tbl; \
 + struct type##_error *term; \
 + size_t entries_end = offsetof(typeof(*tbl), \
 + entries[nhooks-1]) + sizeof(tbl-entries[0]); \

Is the compiler complaining about:
offsetof(typeof(*tbl), entries[nhooks])
If it does it is a PITA.

 + size_t term_offset = (entries_end + __alignof__(*term) - 1) \
 +  ~(__alignof__(*term) - 1); \

You've not tested this - the () are in the wrong places.

 + size_t term_end = term_offset + sizeof(*term); \
 + size_t tbl_sz = (term_end + __alignof__(tbl-repl) - 1) \
 +  ~(__alignof__(tbl-repl) - 1); \
 + tbl = kzalloc(tbl_sz, GFP_KERNEL); \

The number of temporary variables make the above hard to read.
I'm not at all sure you need to worry about the trailing alignment.
It rather depends on how the final data is used.
If the combined buffer is copied to userspace you may not
be copying all of the required data.
It might be easier to call copytouser() twice.

   if (tbl == NULL) \
   return NULL; \
 + term = (struct type##_error *)(((char *)tbl)[term_offset]); \
   strncpy(tbl-repl.name, info-name, sizeof(tbl-repl.name)); \
 - tbl-term = (struct type##_error)typ2##_ERROR_INIT;  \
 + *term = (struct type##_error)typ2##_ERROR_INIT;  \

David

RE: [PATCH v3] mac80211: LLVMLinux: Remove VLAIS usage from mac80211

2014-03-19 Thread David Laight

From: beh...@converseincode.com
 Replaced the use of a Variable Length Array In Struct (VLAIS) with a C99
 compliant equivalent. This is the original VLAIS struct.
 
 struct {
   struct aead_request req;
   u8  priv[crypto_aead_reqsize(tfm)];
 } aead_req;
 
 This patch instead allocates the appropriate amount of memory using an char
 array.
 
 The new code can be compiled with both gcc and clang.
 
 Signed-off-by: Jan-Simon Mller dl...@gmx.de
 Signed-off-by: Behan Webster beh...@converseincode.com
 Signed-off-by: Vincius Tinti viniciusti...@gmail.com
 Signed-off-by: Mark Charlebois charl...@gmail.com
 ---
  net/mac80211/aes_ccm.c | 40 ++--
  1 file changed, 22 insertions(+), 18 deletions(-)
 
 diff --git a/net/mac80211/aes_ccm.c b/net/mac80211/aes_ccm.c
 index 7c7df47..20521f9 100644
 --- a/net/mac80211/aes_ccm.c
 +++ b/net/mac80211/aes_ccm.c
 @@ -23,12 +23,14 @@ void ieee80211_aes_ccm_encrypt(struct crypto_aead *tfm, 
 u8 *b_0, u8 *aad,
  u8 *data, size_t data_len, u8 *mic)
  {
   struct scatterlist assoc, pt, ct[2];
 - struct {
 - struct aead_request req;
 - u8  priv[crypto_aead_reqsize(tfm)];
 - } aead_req;
 
 - memset(aead_req, 0, sizeof(aead_req));
 + char aead_req_data[sizeof(struct aead_request) +
 + crypto_aead_reqsize(tfm)]
 + __aligned(__alignof__(struct aead_request));
 +
 + struct aead_request *aead_req = (void *) aead_req_data;
 +
 + memset(aead_req, 0, sizeof(aead_req_data));

It seems to me that the underlying function(s) ought to be changed so that
this isn't needed.
Passing an extra parameter would probably cost very little.

David

RE: [PATCH 9/9] powerpc/pm: support deep sleep feature on T1040

2014-03-20 Thread David Laight

From: Kevin Hao
 Sent: 20 March 2014 11:48
 To: Scott Wood
 Cc: linuxppc-...@lists.ozlabs.org; Chenhui Zhao; jason@freescale.com; 
 linux-kernel@vger.kernel.org
 Subject: Re: [PATCH 9/9] powerpc/pm: support deep sleep feature on T1040

 On Tue, Mar 18, 2014 at 06:18:54PM -0500, Scott Wood wrote:
The sequence write, readback, sync will guarantee this according to 
   the manual.

  If you're referring to the section you quoted above, the manual does not
  say that.  It only talks about when accesses to memory regions affected
  by the configuration register write can be safely made.

   To guarantee that the results of any sequence of writes to configuration
   registers are in effect, the final configuration register write should be
   immediately followed by a read of the same register, and that should be
   followed by a SYNC instruction. Then accesses can safely be made to memory
   regions affected by the configuration register write.

That sort of sequence is need to force the operations through any
external bus - after the cpu itself has issued the bus cycles.
Mostly required because writes are often 'posted' (ie address and data
latched, and then performed synchronously).

 According to the above description in t4240 RM manual (2.6.1 Accessing CCSR
 Memory from the Local Processor), that the writing to configuration register
 takes effect is a prerequisite for the memory access to the affected regions.
...
 OK, so the intention of 'twi, isync' following the load is not to order the
 following storage access, but order the following delay loop instructions,
 right

I tried to work out what the 'twi, isync' instructions were for (in in_le32()).
The best I could come up with was to ensure a synchronous bus-fault.
But bus faults are probably only expected during device probing - not
normal operation, and the instructions will have a significant cost.

Additionally in_le32() and out_le32() both start with a 'sync' instruction.
In many cases that isn't needed either - an explicit iosync() can be
used after groups of instructions.

David

RE: [PATCH 9/9] powerpc/pm: support deep sleep feature on T1040

2014-03-21 Thread David Laight

From: Scott Wood [mailto:scottw...@freescale.com]
 On Thu, 2014-03-20 at 11:59 +, David Laight wrote:
  I tried to work out what the 'twi, isync' instructions were for (in 
  in_le32()).
  The best I could come up with was to ensure a synchronous bus-fault.
  But bus faults are probably only expected during device probing - not
  normal operation, and the instructions will have a significant cost.
 
  Additionally in_le32() and out_le32() both start with a 'sync' instruction.
  In many cases that isn't needed either - an explicit iosync() can be
  used after groups of instructions.
 
 The idea is that it's better to be maximally safe by default, and let
 performance critical sections be optimized using raw accessors and
 explicit synchronization if needed, than to have hard-to-debug bugs due
 to missing/wrong sync.  A lot of I/O is slow enough that the performance
 impact doesn't really matter, but the brain-time cost of getting the
 sync right is still there.

Hmmm

That might be an excuse for the 'sync', but not the twi and isync.

I was setting up a dma request (for the ppc 83xx PCIe bridge) and
was doing back to back little-endian writes into memory.
I had difficulty finding and including header files containing
the definitions for byteswapped accesses I needed.
arch/powerpc/include/asm/swab.h contains some - but I couldn't
work out how to get it included (apart from giving the full path).

In any case you need to understand when synchronisation is
required - otherwise you will get it wrong.
Especially since non-byteswapped accesses are done by direct
access.

David

RE: [RFC] csum experts, csum_replace2() is too expensive

2014-03-21 Thread David Laight

From: Eric Dumazet
 On Thu, 2014-03-20 at 18:56 -0700, Andi Kleen wrote:
  Eric Dumazet eric.duma...@gmail.com writes:
  
   I saw csum_partial() consuming 1% of cpu cycles in a GRO workload, that
   is insane...
 
 
  Couldn't it just be the cache miss?
 
 Or the fact that we mix 16 bit stores and 32bit loads ?
 
 BTW, any idea why ip_fast_csum() on x86 contains a memory constraint ?

The correct constraint would be one that told gcc that it
accesses the 20 bytes from the source pointer.

Without it gcc won't necessarily write out the values before
the asm instructions execute.

David

RE: [PATCH 07/21] netfilter: nf_nat: remove inline marking of EXPORT_SYMBOL functions

2013-05-09 Thread David Laight

 EXPORT_SYMBOL and inline directives are contradictory to each other.
 The patch fixes this inconsistency.
...
 -inline const struct nf_nat_l4proto *
 +const struct nf_nat_l4proto *
  __nf_nat_l4proto_find(u8 family, u8 protonum)
  {
   return rcu_dereference(nf_nat_l4protos[family][protonum]);

If it makes sense to inline the local calls (ie the cost
of the call is significant) then possibly add an inlined
(or inlinable) static function that is called locally and
by the exported one?

I'm not sure that gcc is allowed to make the assumption
that the local exported function will be called - and
thus inline it.

David



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: Scaling problem with a lot of AF_PACKET sockets on different interfaces

2013-06-07 Thread David Laight

  Looks like the ptype_base[] should be per 'dev'?
  Or just put entries where ptype-dev != null_or_dev on a per-interface
  list and do two searches?
 
 Yes, but then we would have two searches instead of one in fast path.

Usually it would be empty - so the search would be very quick!

David

RE: [PATCH v9 net-next 5/7] net: simple poll/select low latency socket poll

2013-06-05 Thread David Laight

 I am a bit uneasy with this one, because an applicatio polling() on one
 thousand file descriptors using select()/poll(), will call sk_poll_ll()
 one thousand times.

Anything calling poll() on 1000 fds probably has performance
issues already! Which is why kevent schemes have been added.

At least the Linux code doesn't use a linked list for
the fd - 'struct file' map which made poll() O(n^2),
and getting to that number of open fds O(n^3) on
some versions of SVR4.

David

RE: Scaling problem with a lot of AF_PACKET sockets on different interfaces

2013-06-07 Thread David Laight

  I have a Linux router with a lot of interfaces (hundreds or
  thousands of VLANs) and an application that creates AF_PACKET
  socket per interface and bind()s sockets to interfaces.
...
  I noticed that box has strange performance problems with
  most of the CPU time spent in __netif_receive_skb:
86.15%  [k] __netif_receive_skb
 1.41%  [k] _raw_spin_lock
 1.09%  [k] fib_table_lookup
 0.99%  [k] local_bh_enable_ip
...
  This corresponds to:
 
  net/core/dev.c:
   type = skb-protocol;
   list_for_each_entry_rcu(ptype,
   ptype_base[ntohs(type)  PTYPE_HASH_MASK], list) {
   if (ptype-type == type 
   (ptype-dev == null_or_dev || ptype-dev == skb-dev ||
ptype-dev == orig_dev)) {
   if (pt_prev)
   ret = deliver_skb(skb, pt_prev, orig_dev);
   pt_prev = ptype;
   }
   }
 
  Which works perfectly OK until there are a lot of AF_PACKET sockets, since
  the socket adds a protocol to ptype list:

Presumably the 'ethertype' is the same for all the sockets?
(And probably the ' PTYPE_HASH_MASH' doesn't separate it from 0800
or 0806 (IIRC IP and ICMP))

How often is that deliver_skb() inside the loop called?
If the code could be arranged so that the scan loop didn't contain
a function call then the loop code would be a lot faster since
the compiler can cache values in registers.
While that woukd speed the code up somewhat, there would still be a
significant cost to iterate 1000+ times.

Looks like the ptype_base[] should be per 'dev'?
Or just put entries where ptype-dev != null_or_dev on a per-interface
list and do two searches?

David

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 2841 matches

Mail list logo