RE: [PATCH net-next v3 0/5] lan743x speed boost

2021-02-17 Thread Bryan.Whitehead
> From: Sven Van Asbroeck 
> 
> Tree: git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git #
> 9ec5eea5b6ac
> 
> v2 -> v3:
> - Bryan Whitehead:
> + add Bryan's reviewed-by tag to patch 1/5.
> + Only use FRAME_LENGTH if the LS bit is checked.
> If set use the smaller of FRAME_LENGTH or buffer length.
> If clear use buffer length.
> + Correct typo in cover letter history (swap "packet" <-> "buffer").
> 
> v1 -> v2:
> - Andrew Lunn:
> + always keep to Reverse Christmas Tree.
> + "changing the cache operations to operate on the received length"
> should
>   go in its own, separate patch, so it can be easily backed out if
>   "interesting things" should happen with it.
> 
> - Bryan Whitehead:
> + multi-buffer patch concept "looks good".
>   As a result, I will squash the intermediate "dma buffer only" patch 
> which
>   demonstrated the speed boost using an inflexible solution
>   (w/o multi-buffers).
> + Rename lan743x_rx_process_packet() to lan743x_rx_process_buffer()
> + Remove unused RX_PROCESS_RESULT_PACKET_DROPPED
> + Rename RX_PROCESS_RESULT_PACKET_RECEIVED to
>   RX_PROCESS_RESULT_BUFFER_RECEIVED
> + Fold "unmap from dma" into lan743x_rx_init_ring_element() to prevent
>   use-after-dma-unmap issue
> + ensure that skb allocation issues do not result in the driver sending
>   incomplete packets to the OS. E.g. a three-buffer packet, with the
>   middle buffer missing
> 
> - Willem De Bruyn: skb_hwtstamps(skb) always returns a non-null value, if
> the
>   skb parameter points to a valid skb.
> 
...
> Sven Van Asbroeck (5):
>   lan743x: boost performance on cpu archs w/o dma cache snooping
>   lan743x: sync only the received area of an rx ring buffer
>   TEST ONLY: lan743x: limit rx ring buffer size to 500 bytes
>   TEST ONLY: lan743x: skb_alloc failure test
>   TEST ONLY: lan743x: skb_trim failure test
> 
>  drivers/net/ethernet/microchip/lan743x_main.c | 352 +-
>  drivers/net/ethernet/microchip/lan743x_main.h |   5 +-
>  2 files changed, 174 insertions(+), 183 deletions(-)
> 
> --
> 2.17.1

Hi Sven,

Just to let you know, my colleague tested the patches 1 and 2 on x86 PC and we 
are satisfied with the result.
We confirmed some performance improvements.
We also confirmed PTP is working.

Thanks for your work on this.

Tested-by: unglinuxdri...@microchip.com



RE: [PATCH net-next v3 2/5] lan743x: sync only the received area of an rx ring buffer

2021-02-16 Thread Bryan.Whitehead
> From: Sven Van Asbroeck 
> 
> On cpu architectures w/o dma cache snooping, dma_unmap() is a is a very
> expensive operation, because its resulting sync needs to invalidate cpu
> caches.
> 
> Increase efficiency/performance by syncing only those sections of the
> lan743x's rx ring buffers that are actually in use.
> 
> Signed-off-by: Sven Van Asbroeck 
> ---

Looks Good, Thanks Sven
Our testing is in progress, We will let you know our results soon.

Reviewed-by: Bryan Whitehead 



RE: [PATCH net-next v2 2/5] lan743x: sync only the received area of an rx ring buffer

2021-02-13 Thread Bryan.Whitehead
 
> Will do. Are you planning to hold off your tests until v3? It shouldn't take 
> too
> long.

Sure, we will wait for v3


RE: [PATCH net-next v2 2/5] lan743x: sync only the received area of an rx ring buffer

2021-02-12 Thread Bryan.Whitehead
Hi Sven, see below.

> +   if (buffer_info->dma_ptr) {
> +   /* unmap from dma */
> +   packet_length = RX_DESC_DATA0_FRAME_LENGTH_GET_
> +   (le32_to_cpu(descriptor->data0));
> +   if (packet_length == 0 ||
> +   packet_length > buffer_info->buffer_length)
> +   /* buffer is part of multi-buffer packet: fully used 
> */
> +   packet_length = buffer_info->buffer_length;

According to the document I have, FRAME_LENGTH is only valid when LS bit is 
set, and reserved otherwise.
Therefore, I'm not sure you can rely on it being zero when LS is not set, even 
if your experiments say it is.
Future chip revisions might use those bits differently.

Can you change this so the LS bit is checked.
If set you can use the smaller of FRAME_LENGTH or buffer length.
If clear you can just use buffer length. 

> +   /* sync used part of buffer only */
> +   dma_sync_single_for_cpu(dev, buffer_info->dma_ptr,
> +   packet_length,
> +   DMA_FROM_DEVICE);
> +   dma_unmap_single_attrs(dev, buffer_info->dma_ptr,
> +  buffer_info->buffer_length,
> +  DMA_FROM_DEVICE,
> +  DMA_ATTR_SKIP_CPU_SYNC);
> +   }



RE: [PATCH net-next v2 1/5] lan743x: boost performance on cpu archs w/o dma cache snooping

2021-02-12 Thread Bryan.Whitehead
Hi Sven,

> Subject: [PATCH net-next v2 1/5] lan743x: boost performance on cpu archs
> w/o dma cache snooping
> 
> EXTERNAL EMAIL: Do not click links or open attachments unless you know the
> content is safe
> 
> From: Sven Van Asbroeck 
> 
> The buffers in the lan743x driver's receive ring are always 9K, even when the
> largest packet that can be received (the mtu) is much smaller. This performs
> particularly badly on cpu archs without dma cache snooping (such as ARM):
> each received packet results in a 9K dma_{map|unmap} operation, which is
> very expensive because cpu caches need to be invalidated.
> 
> Careful measurement of the driver rx path on armv7 reveals that the cpu
> spends the majority of its time waiting for cache invalidation.
> 
> Optimize by keeping the rx ring buffer size as close as possible to the mtu.
> This limits the amount of cache that requires invalidation.
> 
> This optimization would normally force us to re-allocate all ring buffers when
> the mtu is changed - a disruptive event, because it can only happen when
> the network interface is down.
> 
> Remove the need to re-allocate all ring buffers by adding support for multi-
> buffer frames. Now any combination of mtu and ring buffer size will work.
> When the mtu changes from mtu1 to mtu2, consumed buffers of size mtu1
> are lazily replaced by newly allocated buffers of size mtu2.
> 
> These optimizations double the rx performance on armv7.
> Third parties report 3x rx speedup on armv8.
> 
> Tested with iperf3 on a freescale imx6qp + lan7430, both sides set to mtu
> 1500 bytes, measure rx performance:
> 
> Before:
> [ ID] Interval   Transfer Bandwidth   Retr
> [  4]   0.00-20.00  sec   550 MBytes   231 Mbits/sec0
> After:
> [ ID] Interval   Transfer Bandwidth   Retr
> [  4]   0.00-20.00  sec  1.33 GBytes   570 Mbits/sec0
> 
> Signed-off-by: Sven Van Asbroeck 

Looks good

Reviewed-by: Bryan Whitehead 



RE: [PATCH net-next v2 0/5] lan743x speed boost

2021-02-12 Thread Bryan.Whitehead
Hi Sven, see below
 
> - Bryan Whitehead:
> + multi-buffer patch concept "looks good".
>   As a result, I will squash the intermediate "dma buffer only" patch 
> which
>   demonstrated the speed boost using an inflexible solution
>   (w/o multi-buffers).
> + Rename lan743x_rx_process_buffer() to lan743x_rx_process_packet()
You meant "Rename lan743x_rx_process_packet() to lan743x_rx_process_buffer()"

> + Remove unused RX_PROCESS_RESULT_PACKET_DROPPED
> + Rename RX_PROCESS_RESULT_BUFFER_RECEIVED to
>   RX_PROCESS_RESULT_PACKET_RECEIVED
You meant "Rename RX_PROCESS_RESULT_PACKET_RECEIVED to 
RX_PROCESS_RESULT_BUFFER_RECEIVED"

I don't think you need a new version for just these typos, because the patch is 
correct. But if you do a new version then you can change it.

Regards,
Bryan


RE: [PATCH net-next v1 2/6] lan743x: support rx multi-buffer packets

2021-02-03 Thread Bryan.Whitehead
> On Wed, Feb 3, 2021 at 3:14 PM  wrote:
> >
> > We can test on x86 PC. We will just need about a week after you release
> your next version.
> >
> 
> That's great. If you have any suggestions on how I can improve testing on my
> end, feel free to reach out.

If you are able, in addition to basic rx and tx iperf tests, I would recommend 
PTP tests.
PTP relies on the time stamps extracted from the extension descriptors, which 
is directly in the RX path you are modifying.

If you are not able, we will at least cover that for x86 PC.

Thanks,
Bryan



RE: [PATCH net-next v1 2/6] lan743x: support rx multi-buffer packets

2021-02-03 Thread Bryan.Whitehead
Hi Sven,

We can test on x86 PC. We will just need about a week after you release your 
next version.

Thanks,
Bryan

> -Original Message-
> From: Sven Van Asbroeck 
> Sent: Wednesday, February 3, 2021 1:53 PM
> To: Bryan Whitehead - C21958 
> Cc: UNGLinuxDriver ; David Miller
> ; Jakub Kicinski ; Andrew Lunn
> ; Alexey Denisov ; Sergej Bauer
> ; Tim Harvey ; Anders
> Rønningen ; netdev
> ; Linux Kernel Mailing List  ker...@vger.kernel.org>
> Subject: Re: [PATCH net-next v1 2/6] lan743x: support rx multi-buffer
> packets
> 
> EXTERNAL EMAIL: Do not click links or open attachments unless you know the
> content is safe
> 
> Thank you Bryan. I will prepare a v2 early next week.
> 
> Would Microchip be able to donate some time to test v2? My own tests are
> certainly not perfect. Various stress tests across architectures
> (intel/arm) would help create confidence in the multi-buffer frame
> implementation. Perhaps Microchip has various test rigs already set up?
> 
> After all, absence of bugs is more important than speed improvements.


RE: [PATCH net-next v1 2/6] lan743x: support rx multi-buffer packets

2021-02-01 Thread Bryan.Whitehead
Hi Sven, see below

> >
> > If lan743x_rx_init_ring_element fails to allocate an skb, Then
> > lan743x_rx_reuse_ring_element will be called.
> > But that function expects the skb is already allocated and dma mapped.
> > But the dma was unmapped above.
> 
> Good catch. I think you're right, the skb allocation always has to come before
> the unmap. Because if we unmap, and then the skb allocation fails, there is
> no guarantee that we can remap the old skb we've just unmapped (it could
> fail).
> And then we'd end up with a broken driver.
> 
> BUT I actually joined skb alloc and init_ring_element, because of a very
> subtle synchronization bug I was seeing: if someone changes the mtu
> _in_between_ skb alloc and init_ring_element, things will go haywire,
> because the skb and mapping lengths would be different !
> 
> We could fix that by using a spinlock I guess, but synchronization primitives 
> in
> "hot paths" like these are far from ideal... Would be nice if we could avoid
> that.
> 
> Here's an idea: what if we fold "unmap from dma" into init_ring_element()?
> That way, we get the best of both worlds: length cannot change in the
> middle, and the function can always "back out" without touching the ring
> element in case an allocation or mapping fails.
> 
> Pseudo-code:
> 
> init_ring_element() {
> /* single "sampling" of mtu, so no synchronization required */
> length = netdev->mtu + ETH_HLEN + 4 + RX_HEAD_PADDING;
> 
> skb = alloc(length);
> if (!skb) return FAIL;
> dma_ptr = dma_map(skb, length);
> if (!dma_ptr) {
> free(skb);
> return FAIL;
> }
> if (buffer_info->dma_ptr)
> dma_unmap(buffer_info->dma_ptr, buffer_info->buffer_length);
> buffer_info->skb = skb;
> buffer_info->dma_ptr = dma_ptr;
> buffer_info->buffer_length = length;
> 
> return SUCCESS;
> }
> 
> What do you think?

Yes, I believe this will work.

> 
> >
> > Also if lan743x_rx_init_ring_element fails to allocate an skb.
> > Then control will jump to process_extension and therefor the currently
> > received skb will not be added to the skb list.
> > I assume that would corrupt the packet? Or am I missing something?
> >
> 
> Yes if an skb alloc failure in the middle of a multi-buffer frame, will 
> corrupt
> the packet inside the frame. A chunk will be missing. I had assumed that this
> would be caught by an upper network layer, some checksum would be
> incorrect?
> 
> Are there current networking devices that would send a corrupted packet to
> Linux if there is a corruption on the physical link? Especially if they don't
> support checksumming?
> 
> Maybe my assumption is naive.
> I'll fix this up if you believe that it could be an issue.

Yes the upper layers will catch it and drop it.
But if we already know the packet is corrupted, I think it would be better if we
dropped it here, to avoid unnecessary processing upstream.

...
> 
> RX_PROCESS_RESULT_XXX can now only take two values (RECEIVED and
> NOTHING_TO_DO), so in theory it could be replaced by a bool. But perhaps
> we should keep the current names, because they are clearer to the reader?
> 
I'm ok if you want to change it too a bool. 



RE: [PATCH net-next v1 2/6] lan743x: support rx multi-buffer packets

2021-01-30 Thread Bryan.Whitehead
Hi Sven, 

Looks good.
see comments below.

>  static int lan743x_rx_process_packet(struct lan743x_rx *rx)  {
It looks like this function no longer processes a packet, but rather only 
processes a single buffer.
So perhaps it should be renamed to lan743x_rx_process_buffer, so it is not 
misleading.

...
> +   /* unmap from dma */
> +   if (buffer_info->dma_ptr) {
> +   dma_unmap_single(&rx->adapter->pdev->dev,
> +buffer_info->dma_ptr,
> +buffer_info->buffer_length,
> +DMA_FROM_DEVICE);
> +   buffer_info->dma_ptr = 0;
> +   buffer_info->buffer_length = 0;
> +   }
> +   skb = buffer_info->skb;
> 
> -process_extension:
> -   if (extension_index >= 0) {
> -   descriptor = &rx->ring_cpu_ptr[extension_index];
> -   buffer_info = &rx->buffer_info[extension_index];
> -
> -   ts_sec = le32_to_cpu(descriptor->data1);
> -   ts_nsec = (le32_to_cpu(descriptor->data2) &
> - RX_DESC_DATA2_TS_NS_MASK_);
> -   lan743x_rx_reuse_ring_element(rx, extension_index);
> -   real_last_index = extension_index;
> -   }
> +   /* allocate new skb and map to dma */
> +   if (lan743x_rx_init_ring_element(rx, rx->last_head)) {

If lan743x_rx_init_ring_element fails to allocate an skb,
Then lan743x_rx_reuse_ring_element will be called.
But that function expects the skb is already allocated and dma mapped.
But the dma was unmapped above.

Also if lan743x_rx_init_ring_element fails to allocate an skb.
Then control will jump to process_extension and therefor
the currently received skb will not be added to the skb list.
I assume that would corrupt the packet? Or am I missing something?

...
> -   if (!skb) {
> -   result = RX_PROCESS_RESULT_PACKET_DROPPED;
It looks like this return value is no longer used.
If there is no longer a case where a packet will be dropped 
then maybe this return value should be deleted from the header file.

... 
>  move_forward:
> -   /* push tail and head forward */
> -   rx->last_tail = real_last_index;
> -   rx->last_head = lan743x_rx_next_index(rx, real_last_index);
> -   }
> +   /* push tail and head forward */
> +   rx->last_tail = rx->last_head;
> +   rx->last_head = lan743x_rx_next_index(rx, rx->last_head);
> +   result = RX_PROCESS_RESULT_PACKET_RECEIVED;

Since this function handles one buffer at a time,
  The return value RX_PROCESS_RESULT_PACKET_RECEIVED is now misleading.
  Can you change it to RX_PROCESS_RESULT_BUFFER_RECEIVED.




RE: [PATCH net-next v1 1/6] lan743x: boost performance on cpu archs w/o dma cache snooping

2021-01-30 Thread Bryan.Whitehead
Sven, see below comments

> @@ -2148,11 +2149,18 @@ static int lan743x_rx_process_packet(struct
> lan743x_rx *rx)
> descriptor = &rx->ring_cpu_ptr[first_index];
> 
> /* unmap from dma */
> +   packet_length = RX_DESC_DATA0_FRAME_LENGTH_GET_
> +   (descriptor->data0);
It appears you moved this packet_length assignment from just below the 
following if block, however  you left out the le32_to_cpu.See next comment

> if (buffer_info->dma_ptr) {
> -   dma_unmap_single(&rx->adapter->pdev->dev,
> -buffer_info->dma_ptr,
> -buffer_info->buffer_length,
> -DMA_FROM_DEVICE);
> +   
> dma_sync_single_for_cpu(&rx->adapter->pdev->dev,
> +   buffer_info->dma_ptr,
> +   packet_length,
> +   DMA_FROM_DEVICE);
> +   
> dma_unmap_single_attrs(&rx->adapter->pdev->dev,
> +  buffer_info->dma_ptr,
> +  
> buffer_info->buffer_length,
> +  DMA_FROM_DEVICE,
> +
> + DMA_ATTR_SKIP_CPU_SYNC);
> buffer_info->dma_ptr = 0;
> buffer_info->buffer_length = 0;
> }
Just below here is the following line
packet_length = RX_DESC_DATA0_FRAME_LENGTH_GET_
(le32_to_cpu(descriptor->data0));
This line was moved above the previous if block, but the le32_to_cpu was 
removed. Is that intentional?
Also I don't see any mention of this packet_length assignment (after the if 
block) being removed.
Since packet_length already contains this value, it seems unnecessary to keep 
this assignment.

> @@ -2167,8 +2175,8 @@ static int lan743x_rx_process_packet(struct
> lan743x_rx *rx)
> int index = first_index;
> 
> /* multi buffer packet not supported */
> -   /* this should not happen since
> -* buffers are allocated to be at least jumbo size
> +   /* this should not happen since buffers are allocated
> +* to be at least the mtu size configured in the mac.
>  */
> 
> /* clean up buffers */ @@ -2628,6 +2636,9 @@ static 
> int
> lan743x_netdev_change_mtu(struct net_device *netdev, int new_mtu)
> struct lan743x_adapter *adapter = netdev_priv(netdev);
> int ret = 0;
> 
> +   if (netif_running(netdev))
> +   return -EBUSY;
> +
> ret = lan743x_mac_set_mtu(adapter, new_mtu);
> if (!ret)
> netdev->mtu = new_mtu;
> --
> 2.17.1



RE: [PATCH] lan743x: Added fixed link support

2020-05-18 Thread Bryan.Whitehead
> -Original Message-
> From: Roelof Berg 
> Sent: Sunday, May 17, 2020 4:45 PM
> To: Andrew Lunn 
> Cc: Bryan Whitehead - C21958 ;
> UNGLinuxDriver ; David S. Miller
> ; net...@vger.kernel.org; linux-
> ker...@vger.kernel.org
> Subject: Re: [PATCH] lan743x: Added fixed link support
> 
> EXTERNAL EMAIL: Do not click links or open attachments unless you know the
> content is safe
> 
> To Everyone: I need a test hardware recommendation for a lan7431/0 NIC in
> normal mode (not fixed-link mode). In prior patches this was not necessary,
> because I was able to ensure 100% backwards compatibility by careful coding
> alone. But I might soon come to a point where I need to test phy-connected
> devices as well.

Hi Roelof,

I believe I can find the hardware back at the office. However at this time, due 
to virus fears, I'm working from home.

Can hardware testing wait until we return to the office?

Regards,
Bryan